Importing and downloading a website

Why do I get a 404 page not found status code when I use requests.get(“awebsite.com”) on some sites but not others?

There could be several reasons.

The website you mentioned doesn’t load for me at all (assuming it exists at all). And as per https://www.isitdownrightnow.com/ its server might be down.

But as I said there could be different reasons depending on the website which you will have to look into. For example, a straightforward search yields posts like -

which could shed light on why this might happen and what you could do about it.

2 Likes

Nice resources you’ve provided here @the_doctor! I am “not there yet” but I’m finding it’s good to read about things you know nothing about so that by the time you do start to learn it, it already seems familiar. Thanks for that!

P.S. – I want to scrape things so badly! :laughing:

1 Like

Have you previously used the resource in the first Stack Overflow reference listed?

No. I haven’t worked with web scraping or using the requests library in any significant way as of now.

Sure. If you have future questions, then I’m sure someone else can help you better than I did. Good luck!

Hi @hunter.kiely
404 error from the list of regulated responses for the Web and usually it means the page does not exist. https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

But there are several reasons why you can get this answer.

  1. The page does not really exist, you can check it by going to it in your browser.
  2. 403 and 404 errors sometimes occur if the site is blocked for your region. Usually it is 403, but it depends on the developers of the site and it can be 404.
  3. In the example that you provided, you do not use headers. The site may block you with a 404 error because it does not accept headers that are used in requests defaulted. Try the following
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get('url', headers=headers)

Most of the sites you want to steal will require headers.

Is this needed for browser issues, or getting blocked from the server?

Yes, many sites change their structure depending on the browser and device from which you connect. Using the user-agent you send the basic data needed for the site to process. Quite often, if the user-agent is not correct or in the lock list, the site simply returns an error message instead of the data.

If you can open the site in your browser, but with requests you get an error. Then explore the headers.

What do you mean by the lock list?

Some sites use different ways to protect against scraping.
One of the most common is blocking the user-agent of known bots. That is, when you receive a call to headers that specifies such a user-agent. The site will always return an error.
But this usually happens with bots which have their own user-agent - for example, AhrefsBot.
You can compare this to an IP lockdown. When site administrators see that an abnormally large number of requests are made from some IP, they block access.

You can simply use the User-agent from your browser. They are very rarely blocked, as it is a risk of losing millions of users.

I know right, It’s just an urge, you’re a scrapist and you can’t help it.
http://www.pythonscraping.com/pages/page3.html
http://books.toscrape.com/

Been starting out with these. After one month I had all the details of all books in a csv. Such emotion. Good luck! The better you get at python the easier scraping will be.

1 Like

“Poco a poco,” as we say here in Mexico. :sunglasses:

I will definitely be asking some questions on scraping when I get there…and on the other end of it, I will definitely be sharing what I find/learn!

Thanks for the resources, I have book marked them for later.

Happy coding @DavidMiedema, hope you are healthy and safe.