Scraping Multiple Pages Data Length Issue

Screen Link:
I tried to scrape data of multiple pages at the same time for a online shopping website. Below is my code.
PS: This is my personal project

My Code:

for page in pages:
#     print(page)
    response = requests.get("https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page={}".format(page)).text #URL which want to scrape
    #content = response.content #To get the content
    soup = BeautifulSoup(response, 'html.parser')
    #print(soup.prettify())
    
    """ Now the below simple logic code helps us to scrape the data for each containers which we want from the all pages"""
    
    desc = soup.find_all('div', class_ = '_3wU53n') # Extracting descriptions of each laptop 
    for i in range(len(desc)):
        descriptions.append(desc[i].text)
    len(descriptions)
    
commonclass = soup.find_all('li', class_ = 'tVe95H') #This class is applicable for all the features which are written below
for i in range(0,len(commonclass)):
    p = commonclass[i].text #extracting the text from tags
    if('Core' in p):
        processors.append(p)
        #print(processors)
    elif('RAM' in p):
        ram.append(p)
        #print(ram)
    elif('Operating' in p):
        os.append(p)
        #print(os)
    elif('HDD' in p or 'SSD' in p):
        storage.append(p)
       # print(storage)
    elif('Display' in p):
        inches.append(p)
        #print(inches)
    elif('Warranty' in p):
        warranty.append(p)
        #print(warranty)
  
price = soup.find_all('div',class_ = '_1vC4OE _2rQ-NK') # Extracting price of each laptop
for i in range(len(price)):
    prices.append(price[i].text)
    len(prices)

rating = soup.find_all('div',class_ = 'hGSR34') # Extracting rating of each laptop
for i in range(len(rating)):
    ratings.append(rating[i].text)
    len(ratings)

exchange = soup.find_all('div',class_ = '_3_G5Wj') # Extracting exchange offer for each laptop
for i in range(len(exchange)):
    exchange_off.append(exchange[i].text)
    len(exchange_off)
print(len(descriptions))
print(len(processors))
print(len(ram))
print(len(os))
print(len(storage))
print(len(inches))
print(len(warranty))
print(len(prices))
print(len(ratings))
print(len(exchange_off))
Replace this line with your code

What I expected to happen:
I expected to see the length of all the features

What actually happened:
I see the length of all the features with a lot of variance in numbers

504
23
24
25
24
25
22
24
35
59
Replace this line with the output/error

My concern is when we print the length of all the features, we have a lot of variations in numbers.
For example; If you see the descriptions we have 504 descriptions in total from all 21 pages, while if we look at other features like processors, ram , storage etc the lengths are like 23, 24, 25. As per my belief at least the other features should be in the range of 400+ which is not in our case. So I would like to know the reason behind this variance when we compare with descriptions?. Is this a expected behaviour or anything which I am missing from my end or in my code.

This is the first time I am exploring ‘Web Scraping’ and my objective in this project is to scrape all these web pages and collect the information about laptops from it which we can use for further analysis.

I am really looking forward for your help community. Let me know if you require any further information.

Thanks in advance
Best
K!

Hello @prasadkalyan05!

I think this is happening because your code is not entirely inside the first for loop. It should be:

for page in pages:
    "everything else"

Only the part that extracts the description of the product is inside this for loop so it is the only part that is scraped from all pages. The rest of the scraper only in scrapes the last page in the pages list because this part of the code is only accessed after you loop through the entire pages list.

If you have everything inside the first loop I think it will work. If it doesn’t, could you share the website you’re trying to scrape?

Also, I’s suggest that you do not use range(len()) in the for loops as it makes your code harder to understanding and less efficient.

For example, instead of this:

price = soup.find_all('div',class_ = '_1vC4OE _2rQ-NK') # Extracting price of each laptop
for i in range(len(price)):
        prices.append(price[i].text)

Try this:

prices = soup.find_all('div',class_ = '_1vC4OE _2rQ-NK') # Extracting price of each laptop
for price in prices:
        all_prices .append(price.text)

One last thing: the len(ratings) in the code below is pointless, it’s not being assigned to anything and it’s not being printed either. And this happens in almost every for in your code.

rating = soup.find_all('div',class_ = 'hGSR34') # Extracting rating of each laptop
for i in range(len(rating)):
    ratings.append(rating[i].text)
    len(ratings)

Hope this helps you.

3 Likes

Hey @otavios.s

I think this is happening because your code is not entirely inside the first for loop. It should be:

Ooops! I made a silly mistake here- My code was outside for loop. And I adjusted it now and it is working fine:)

One last thing: the len(ratings) in the code below is pointless, it’s not being assigned to anything and it’s not being printed either. And this happens in almost every for in your code.

I realised this point after your explanation. Yes, it is indeed valid and the (len) is nowhere useful in the code. Hence, I modified the code as you suggested.

Thanks a lot for clarifying my doubts. You are the best : :clap:t2:

Cheers

Best
K!

2 Likes