Stuck with simple webscraping

I was trying to understand web-scrapping and got stuck on a little exercise. I have been trying multiple things, but I’ve spent a significant amount of time with no idea about what was wrong. So I thought I would ask.

I’m trying to scrape some info from this page. I’m specifically after the latitude and longitude info associated to the Direction button on a panel on this page.

I can use regex and extract the latitude and longitude, I’m stuck with actually getting the URL with coordinate info.
Using developer tools, I located the tag associated to the info:
image

I have tried the following:

Attempt 1:

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

#Won't work. Access denied
response = requests.get("https://www.zomato.com/ahmedabad/shivala-village-sola/info")
content=response.content
content

Outcome: Access denied

Attempt 2:

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

#Starts chromedriver
driver = webdriver.Chrome(executable_path = "C:\\Users\\jesmax\\Documents\\chromedriver")

#Get the page
driver.get("https://www.zomato.com/ahmedabad/shivala-village-sola/info")
the_page=driver.page_source

#Save the page locally
with open("scraped_webpages/hotel.html", "w+", encoding="utf-8") as f:
        _ = f.write(the_page)

#Read the page
with open("scraped_webpages/hotel.html",encoding="utf-8") as f:
        page= f.read()

#Tried the following:
url1=soup.find_all('href',class_="sc-sVRsr jAZoWn")
#url1=soup.find_all('a',class_="sc-sVRsr jAZoWn")
url1

Outcome 2.1: empty list
image

Attempt 2.2

url1=soup.find_all('a',class_="sc-sVRsr jAZoWn")
url1

Outcome 2.2: empty list again.

I’m not sure of what I am doing wrong. Any help would be appreciated.

Hi @jesmaxavier:

Based on my limited knowledge on scraping and the output result, I think you need to supply a cookie or authentication token that you have used for login before doing the scraping (so that the scraper can somewhat “impersonate” you).

You may find this useful:

1 Like

In the Selenium attempt, you’re searching for a class that doesn’t exist for me. I’m guessing that class name is generated on the fly, you probably can’t count on that.

You’ll need to find another of the multiple ways available to get to this information using Selenium.


As for the requests attempts, many websites employ some measures to mitigate web scraping. If you look in zomato.com/robots.txt, you’ll find the following:

User-agent: *
Disallow: /

This tells you that without specifying a user agent, you don’t have access to anything on the website. So, specify one:

>>> import requests
>>> url = "https://www.zomato.com"
>>> user_agent = "Mozilla/5.0"
>>> payload = {"user-agent": user_agent}
>>> requests.get(url, headers=payload).status_code
200

The 200 code tells you the request was successful.

You can google something like “What is my user agent?” to figure out yours (it depends on the browser, operating system, etc).

1 Like

Hello,

I found a solution to your problem, it involves first finding the element with Selenium. Note that things that you click which contain links can be found in the html code by href=‘linkaddress/blabla’ often links have a link text which is what is actually displayed to the user on the browser. So the best way to find this object is to use selenium driver’s find_object_by_partial_link_text method.
If you are sure you have a complete match you can use driver.find_element_by_link_text, either one would work in this case

Apparently, this method is deprecated in favor of "find_element(by=By.PARTIAL_LINK_TEXT, value=link_text) " but I haven’t learned that syntax yet

so we have;
from selenium import webdriver
driver = webdriver.Chrome()

driver.get(“https://www.zomato.com/ahmedabad/shivala-village-sola/info”)
direction_button = driver.find_element_by_partial_link_text(‘Direction’)
direction_button_size = direction_button.size()

print(direction_button_size)
#returns dict
{‘height’:36, ‘width’:112})

Also note that the size of this button will differ from user to user depending on their screen’s resolution which your browser detects. So this will not always be a fixed value.

This is my first comment on the dataquest community forums, @dataquest admins holla at me, give me a discount because I’m broke =D

2 Likes

@Bruno
Thank you very much for your help! I got it to work and learnt a lot more thanks to your comment about the robots.txt.

You are right on this! The problem was in the downloaded page, the class code was completely different. However, I kept using the class code in the original page, which was completely different.

In addition to the code that you have put up, the following code helps to get the location. Hopefully this is helpful to others working on a similar problem.

page=requests.get(url,headers=payload).content
soup=BeautifulSoup(page,parser='html')
 for a in soup.find_all('a',href=True):
      if "google.com/maps" in a['href']:
            # Regex to get the location
            location1=re.search("destination=(.*)",a['href']).group(1)
            break

To others who might be working on web-scrapping, there is a lot more to it than meets the eye. There is an etiquette that needs to be followed when you are web-scrapping. Check out this stack overflow post. It contains two links. Its worth reading before web scraping.

@masterryan.prof thanks for the info. My link wasn’t behind a log-in, but I can see how you might have got confused because of the denied access error.

@jaygbc12 thanks for your help and welcome to the community . I have put the Selenium solution aside for now, just because for every request I get the entire page and I felt it was a bit of an overhead. That being said, this is something I could use when I get to websites that are more strict with regards to web-scrapping.

2 Likes

Great contribution!‎

1 Like