I hope this is not a problem, but I scraped the Community

Well, you seem to have a good grasp of English already, so I can imagine that you’ll only get better.

1 Like

Nice done @otavios.s!
You’ve made yourself an optimized peer in the community :smiley:

I’ve once implemented a web scrapper with selenium right after the web scrap course and I had a lot of fun writing it. This is totally required for dynamic websites. It is the best one can do to reinforce the fresh knowledge.

If you see some interest from peers doing the same, maybe it is a good idea to create a Git repository and open source your project.

Congrats in your automation project!

1 Like

Thank you @fedepereira!

And here’s the repository you suggested:

Great idea!

2 Likes

Wow nice. I’ll follow you there on github

1 Like

Hi @otavios.s

You are good at looking for new skills and ways to automate tasks around you.

So I will give you a few things that you could improve.

  1. You have no error handling - Selenium is rather unstable, and considering that your code works in an infinite loop. After any error, it will just stop working and you won’t even know about it.
  2. Besides headless, there are other settings of Selenium recommended to use. By using them, you significantly reduce the risk of blocking the site. It’s a good thing the community doesn’t check for such a configuration. Here is a small list.
            options.add_argument("-no-sandbox")
            options.add_argument("-disable-dev-shm-usage")
            options.add_argument("-disable-gpu")
            options.add_argument( "--user-agent={user-agent}" )
  1. Your code does not close Selenium. If you don’t make a correct exit from it, the chromedriver hangs in the processes; in time it will be a problem. Especially on the server.
  2. Do not use Selenium. It is very slow, it is not scalable (yes, it is not always necessary), it is very unstable. The only advantage is that it is easy to learn.
  3. For example, you could use requests - and go to https://community.dataquest.io/c/qa/44/l/latest.json and get a full json file as a result. Then, using [‘topic_list’][‘topics’] key, you would get the results of the last questions. You will see that you have much more possibilities to process data than with Selenium, e.g. to add filters.
  4. Do not use an infinite loop While + sleep to launch the code with time intervals. Let your OS do it for you. You could set up cron on a unix system. Or a task scheduler in Windows.

I hope you find some of these recommendations useful to you.

4 Likes

Hi @moriturus7

Thanks for the tips.

Let me just comment on them:

1- Yes, I only implemented some error handling in email function. Because it is just a simple project, I did not think it was necessary. But good point.

2- I had no idea about this. Good to know.

3- I forgot to quit selenium :sweat_smile:. Well noted.

4- I’m having a hard time to find the right tools to use. I could not figure out yet when to use selenium, or requests, or scrapy, or beatifulSoup.

5- I had no idea about this either.

6- Good point.

Anyway, I never meant to look like an expert, I’m just a beginner trying new things and I get really happy when I accomplish something. Do you know where I could learn more about scraping?

I really appreciate you taking the time to help. Thanks!

1 Like

Unfortunately, I will not be able to provide you with good web scraping articles. Most articles on this subject are primitive and do not provide enough information.

I’ll probably be able to answer most of the questions that you might have with web scraping.

About which libraries to use and for what purpose. Here is a list of small recommendations again. Probably I just like lists :slight_smile:

  1. There are only 2 good reasons to use selenium.
    1.1 If a site uses JS to authorize a cookie. In simple words, if you need to execute JS code to confirm your rights to work with the site. Instead of dealing with the JS code, it is easier to use selenium. Get cookies and pass them on to a more suitable library.
    1.2 If the time you take to figure out how the data on the page is generated is many times greater than what you can allow. An example from my practice is scraping comments from the google map. As part of the task I was doing, I would spend a lot of time to figure out the source code of the page, so I used selenium. Then I made sure once again that selenium is a path of pain and suffering.

  2. You can always use requests.

  3. I won’t give a good recommendation on Scrapy. Scrapy is an asynchronous framework - let’s say requests on steroids. Because Scrapy takes over a lot of things, you have less control over the code. In fact, when I came across Scrapy it could not offer me anything that could not be done with requests + parallelism libraries.

  4. Buetifoulsoup is just an html analyzer. You can use it in selenium bundles if you do not want to use the analyzer which is built into selenium. You can use it with requests if the target page returns html instead of json. You can use it in any situation when you need to work with html structure to get data.

6 Likes

The work is awesome and inspiring. I will follow you post and try it myself. Thanks~

1 Like

Do you think selenium is a good option when you need to interact with the page to log in or write a search or click somewhere?

Thank you for the great tips!

I don’t think that’s a good option.

In most cases, if you study the vision of the site and the way it works with the data you will find a better solution than selenium.

I think I will soon create a theme with how to hide the interactive site from the requests. As an example.

1 Like

Hey! this is Amazing, and very inspiring, I wish i was this smart. LoL

1 Like

Hi @moriturus7,

I’ve read most of your replies here which are a bit critical of the more usual scraping frameworks like Scrapy and Selenium.

Unfortunately it was not all that clear to me, the only thing that is clear is that you know a lot about it. Did you by any chance write a tutorial about how to use requests to get the dreaded JS? I would love to “always use requests.”

1 Like

Hi @DavidMiedema

Maybe my answers are really a little critical. I’m very good with scrapy. The guys did a lot of work and wrote a very good library. And they found a way to monetize their work. But Scrapy, like any framework, takes on too much excess work. And often using more dotted tools you will get results if not better, then similar. With more control over what is happening.

With Selenium, I really believe that it is often used as the basis for building parsers. It is very expensive, slow, and inefficient. The use of Selenium should be in demand. Most often it is connected with bypassing locks built on JS.

I did a little article on scraping without selenium. My English is bad enough, so you’ll probably get more information by examining the code and screenshots.

The main thing you should understand for yourself when working with sites built on JS. Usually JS is used to build a frontend. That is, it is the interface with which you work. But it must get data from the server. So either the data is already on the page, or they will be sent as a separate request. That’s why when working with JS you should pay more attention to DevTools - Network in your browser. And study the source code of the page, not the elements. Namely the source code.

Magic does not happen and data cannot just appear. So you can simulate working with the network in requests and address them directly if there was a separate request for their transfer. Or you can find them in the code of the page and extract them.

3 Likes

Just a little example. There’s a website.
https://www.kindercare.com/our-centers/find-a-center.
When you enter ZipCode, you will be taken to the data page. You can use Selenium to enter data and wait for rendering.
or you can view it on the Network

And use the link - https://www.kindercare.com/data/center-search?location=20005&distance=15&edpId=.

By configuring the distance and changing the code in the location you will be able to receive data much faster using requests

3 Likes

Very cool stuff indeed. Thanks for the great explanations :smiley: This is indeed something I have tried at some point before learning Selenium, but I was mostly looking at the screen and I could not find a single element I wanted.

Especially with 80 images and a lot of other junk coming in it is a bit of a search to get that one link. I am definitely going to try and learn this because it seems the cleanest way to me!

1 Like

Hey @otavios.s

I got a doubt and I am quite unsure on this point. Do we have any possibility to convert the web scrapped data into a pandas data frame.?

Thanks for your help.

Best
K!

Absolutely!

In fact, I did this in this project before in order to filter the new, unanswered topics :

tables = pd.read_html(driver.page_source)

This stores the tables in a page into a list of dataframes.

There are other ways to do it, but it depends on what data you’re scraping.

1 Like

That’s cool. Good to know this point.

I have another question for you. How do we deal when we encounter the error “HTTP Error 403: Forbidden” while reading url with Pandas? How should we proceed in this case?

Thanks for your help.

Best
K!

1 Like

This error never happened to be. If you’re using urllib, this could be the problem. You should try requests instead. You can read more in the links below:

1 Like