My first dataset on Kaggle and web scraping with Selenium

Hi,

It took me some time, finding out how to scrape JavaScript, but I managed! It would be amazing to get some feedback on the scraper and the data that I’ve posted on kaggle.

I believe the data set is very interesting because there are numerous success-metrics and stats which could determine the end result.

You can find the scraper and the raw data on my github.

It would be nice to see what I can improve in code-readability etc.

3 Likes

Hi, @DavidMiedema

Thank you for sharing the code.

I think you can guess what I’m going to say about the improvements :slight_smile:
You did not have to use Selenium for this task. Here is a link without JS with all necessary data - https://www.chessable.com/ajax/loadReps.php?v=a&search=&tag=tag_all&difficulty=diff_all&priceLow=0&priceHigh=x&section=a&language=language_all&fen=&page=3.

The other one confuses me a bit about the way you process individual cards. You create spices, search elements and combine them in a dateframe, but why not make a dictionary list where each key corresponds to a column?

data_cards = []
for link in card_links:
    card_result = {"course_link": link}
    driver.get(link)
    print("surfed to: {}".format(driver.current_url))
    time.sleep(randint(1,2))
    card_result['type_author'] = driver.find_element(By.XPATH, '//span[@class="book-cover__author"]').text
    data_cards.append(card_result)

data = pd.DataFrame(data_cards)

This way, you will avoid that in case one line of data is missing in any of the lists. Your code will fall at the very end when forming DataFrame, besides, you do not need to enter several times what will be the column name.

1 Like

Well this made me laugh. Thank you for all the feedback! It was most definitely a pain getting this to work, as it did crash twice at the end. Fortunately I did test it a few times on one page only. Which creates a nice workflow.

I agree, creating a list of dictionaries is much faster. It was just my first time working with it and it was kind of overwhelming. But yes, it protects me so much better against NaN values, which would give me a lot less to think about while parsing.

The thing which would be very interesting to me is how to get on those “/ajax/loadReps.php?” links for every single page. e.g. I can’t go from link to link (feels like scraping Tarzan) that easily.

How would you approach that?

1 Like

I must say I have quite a few projects using Selenium. :slight_smile: But it’s usually a forced measure.

The links are much simpler, you can manually determine how many pages there are on the site. Go to page 50, for example, and check if there are any data on it. Then create a generator that creates the desired number of pages and if you do it in 1 thread, then go through them. If there are several threads, then transfer data from the generator to the threads.

pages = (f'https://www.chessable.com/ajax/loadReps.php?v=a&search=&tag=tag_all&difficulty=diff_all&priceLow=0&priceHigh=x&section=a&language=language_all&fen=&page={i}' for I in range(1,20))

Another more usual way is to simply use the While loop increasing the page parameter until you get to a page with no items or to a page with no new items. This can happen if the site simply returns results from the last page when the page parameter is exceeded.

I would use the first path as I am used to working with multithreading. And quite often I come across situations when I need to get data from tens or hundreds of thousands of pages.

For this particular site, you can simply do this - https://www.chessable.com/ajax/loadReps.php?v=a&search=&tag=tag_all&difficulty=diff_all&priceLow=0&priceHigh=x&section=a&language=language_all&fen=&page=300.
You will see that there are only 384 elements and 20 elements per page, which means only 19 pages per site.

1 Like