Web scraping without Selenium

You must all have seen the great project created by @otavios.s. After discussing the topic, I thought you might be interested in how to work with sites that have interactive elements without using selenium.

I’ll tell you a little bit about me.

More than 4 years ago I decided to change my field of activity. I started studying Python and everything related to data processing, from storage and collection to in-depth training and web servers. But it’s much easier for me to study something by working on some projects and tasks. So I started with web scraping. I started to take small projects, solved them and moved on.
At the beginning of my journey, I used both selenium and requests. Sometimes I gave in to the easy way that selenium offered. But the more complicated the projects became, the more I saw disadvantages in using selenium.

I will name a few of them.

  1. Stability. - Sometimes you just cannot explain why selenium stopped working. Something has turned to the same web driver? Was the port busy? Are you already using a browser with an account that you run in selenium? Is there a parade of planets today? - He’ll just die giving you a lot of mistakes.
  2. Scalability - You can be very upset with the result by launching selenium in several threads.
  3. Speed - That is the most important thing. Selenium does everything that your browser does, even in headless mode it executes all the js code. Nowadays, when almost every site uses activity trackers, js code for js code. When you go to 1 url address - you make dozens of calls to the network.

That doesn’t mean Selenium is terrible and you should never use it. You must understand its shortcomings and assess the situation. Sometimes it can be very helpful. There are situations when I still need Selenium as one of the stages of web scraping.

Well, let’s cut to the chase.

Task

On the site - https://www.centurybatteries.com.au. There is a tool that allows for a series of machines to obtain data on recommended batteries. We need to make a table with this data.

Step 1. Study of the structure and work of the site.

Before writing code, you need to learn how the site works with data. What happens when you press a button. What requests it sends and what parameters are needed in them. From which pages you need to get the data.

Your browser can help you with this. In it you need to open the developer tools and the Network tab.

So, what I did and what you see.

  1. I opened the Network tab.
  2. If you are already on the target page and the Network tab is empty, then everything is fine. If not, then go to the landing page, and when it is fully loaded, remove all of the results that Network shows.
  3. I selected one of the items in the Make field.
  4. I see the site has sent a request for MakeModelSearch.

Here you must remember a simple fact. When you download a page of the site, you download something from the server. And all you see is here, on your computer in your browser. If the site has to get new data, it has to tell the server something about it. Because everything you see on the site or already have inside the page and js code changes the display. Or when you do something, the js code in your browser asks for updates from the server.

In this case the client - your browser sends a request to the server for data. Let’s see what the request is.

Here, you can see the URL to which the site addressed and the type of request.
So our target URL is https://www.centurybatteries.com.au/CMSWebParts/Digicon/Elements/BatteryFinder/BatteryFinderService.asmx/MakeModelSearch.

Request type - POST

Next we see which headers are used in the request when accessing the server. I have highlighted the most important ones in this case.

Content-type - the type of content requested. In some cases there are no default settings and if the page returns json and you request HTML you will get an error.

user-agent - the way the client is presented to the server. The defaulted request library uses a user-agent similar to python/3.7 and ignoring normal header instructions you may encounter surprising errors. Simply because the server is configured not to handle everything that is unclear.

In most cases, you can just copy all of Seaseger’s cookies and if you’re not happy with the results, you can take turns turning off watching what changes.
Why not use a cookie right away? In some cases, cookies are used to authorize a user. And by setting up your code with a cookie in your heder, you may get the false illusion that everything is already working for you. And after a few hours, your code will no longer work. That’s why it’s a separate topic.

The last thing we see in the request parameters is Request Payload - the data that our client tells the server what he wants to get. You can see that they are linked to the fields in the table and the value in Make matches what I entered. This is the key to our solution.

If you look at the Response tab, you can see exactly what the server is responding to the request. And it will be a very strange JSON file. Why is it weird? Because it’s a JSON that you packed into a string and created JSON by assigning that string to a single key.
But you can see that the data there is already quite interesting.

If you explore this JSON, you’ll be very surprised. You will see that you have already received almost all the data except the battery model. So you don’t need all other buttons in the table anymore. You don’t need to go through and combine manufacturers, series and years. You only need to make requests with every Make that’s on the list. In order to get data on all the machines that belong to it. Already at this stage it will be many times faster than Selenium.

Where do we get the battery model data from?
Here, after repeating the actions of a normal user, we will see that the final result of this table is a button that translates to a new page.

Let’s move on to her.

And yes, you already know what we’re doing. Let’s study it!
It’s much easier, because it’s a GET request, and there are no interactive elements here we just need to look at the browser string.

Where https://www.centurybatteries.com.au/resources/battery-finder/fitment-options/ is the base URL.
71883 is a parameter that leads us to a wanderer belonging to a specific series of machines.
Product Name is the data we need to get from here.

What are we interested in at this stage? Where did the number 71883 come from. If you study JSON which we get at the previous step, you will understand that it is VehicleId.

What do we lack to build code? We don’t know where the Make list comes from. All we need to do is look at the source code for https://www.centurybatteries.com.au/home.

Where we will see the following

Thus, having investigated the site, we get the following algorithm of action for obtaining data.

  1. Go to https://www.centurybatteries.com.au/home and get the list of makes, which is an attribute of v-bind:makes within the battery-finder tag.
  2. using each item of the list, create a POST request to the page - https://www.centurybatteries.com.au/CMSWebParts/Digicon/Elements/BatteryFinder/BatteryFinderService.asmx/MakeModelSearch. process the received JSON and get the data about the machines including VehicleID
  3. using VehicleID, create GET requests to https://www.centurybatteries.com.au/resources/battery-finder/fitment-options/ and obtain battery data.

Step 2. Write the basic code.

Our project is based on interaction with the site using both requests and attraction of target data.

At this stage I recommend using Jupyter as all data is stored in RAM. You will be able to experiment with the code until you get the final results.

At this point you should find out the following:

  1. Which settings you need to change to get the results. Whether you need to change the headers.
  2. Write code that extracts data from the page and feeds it to the desired format. In our case it will be dictionaries.

To start with, we can write a simple function for requests. You can make 2 different functions for GET and POST requests or write a function that accepts the necessary parameter.
Since I want to make the project as simple and clear as possible, it will be 2 functions

Function for GET Request

def get_req(url):
    for _ in range(5):
        try:
            response = requests.get(url, headers=headers, timeout=20)
            if response.status_code == 200:
                return response
        except Exception as e:
            print(e)
            continue
     print('Error status code', response.status_code, response.text)

Function for POST request

def post_req(url, data):
    for _ in range(5):
        try:
            response = requests.post(url, data=data.encode('utf-8'), headers=headers, timeout=20)
            if response.status_code == 200:
                return response
        except Exception as e:
            print(e)
            continue  
     print('Error status code', response.status_code, response.text)

As you can see, they are almost the same and could be replaced by the same function with the condition and parameters.

The main difference between GET and POST requests:
Originally in REST API ideology GET queries are used to get data and they accept parameters inside url string. You can often see something like {url}?page=1?elem=20 There are page=1 and elem=20 parameters passed.
You can either just create a string with the required parameters or pass params in the get method for requests.

As in the example below.

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)

POST queries in REST ideology to use for inserting data. And the parameters were passed inside the request. But with time, when developing different services, developers did what was convenient for them. Therefore, these facets were blurred. There are other types of queries as well. But it’s not likely that you’ll meet them.

To transfer data, you can either form a string data ounce yourself and transfer it to requests. Or you can pass the dictionary to the requests library to do it for you. But don’t forget that there can’t be repeating keys in python dict and in requests can.

These functions provide the minimum required error handling. We make 5 attempts in case the server did not respond immediately. If we received a response with a code other than 200 (This is the default code of the correct response), we make new attempts. A timeout of 20 seconds is necessary to ensure that the code does not hang up if the server has no limit on the response processing. (I’ve encountered cases where the server could process a request for more than 6 hours).

There are many ways to improve these features. But this is a good way to start without overloading the project with unclaimed code.

Then we need 3 functions that get data for 3 links in our action algorithm.

Everywhere, I’ll use them as a global variable.

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
           'content-type': 'application/json; charset=UTF-8'}

The function of the processing page is https://www.centurybatteries.com.au/home.

def get_makes():
    start_url = 'https://www.centurybatteries.com.au/home'
    response = get_req(start_url)
    page = BeautifulSoup(response.content, 'lxml')
    makes = page.find("battery-finder")['v-bind:makes']
    makes = ast.literal_eval(makes)
    return makes

We use a GET request. From the received response we take a binary representation of its code and send it to BeautifulSoup.
I know that sometimes there are questions about using .content or .text. In most cases it doesn’t matter if the content is a byte representation of the text as received by the request. Text is already transformed representation as it was processed by library requests. It relies on html code parameters to determine which encoding the code should be converted to. And all the html analyzers know how to do it too.

But in some cases you yourself need to process the encoding and convert data into text. I prefer to use .content because it is data in the form their site has passed them into.

In addition, you may be hit by the line ast.literal_eval(makes).

This is a standard code library that allows strings that are explicitly lists to be converted to lists. For example. String '[“A.E.B.I.”,“ABARTH”,“ABBEY","AC"]' will be converted to the view list ["A.E.B.I.","ABARTH","ABBEY”,“AC”]

Here I want to draw your attention. Feel free to google, do not try to write your code for any small task. Until I faced this task, I did not know about this library and functions but thought. Didn’t anyone do it before me? They did, and I just used a solution that was so good that it was included in the standard library. Did I write better? I don’t think so.

Just remember. You’re not a bad specialist if you don’t know how to do something. But I’m not saying you’re good if you’re not even trying to find a solution.

At this point we have a function that allows us to get all the MAKEs we should use in the next URL. To get the data we need, we just need to take 1 from the list to submit a POST request and process the results.

To investigate, we just take one of the list items. For example, HONDA

Our 2 functions are designed to work with POST requests. It will receive one of the MAKEs to generate data for which a POST request is sent, transmit it to the desired request function and process the results. Leaving only the data we are interested in.

def get_vehicles(make):
    data = '''{"culture":"en-AU","type":"Model","make":"''' + make + '''","model":"","year":"","series":"","cc":"","horsePower":"","vehicleType":"All"}'''
    url = 'https://www.centurybatteries.com.au/CMSWebParts/Digicon/Elements/BatteryFinder/BatteryFinderService.asmx/MakeModelSearch'
    response = post_req(url, data)
    data = response.json()
    data = json.loads(data['d'])
    result_list = []
    for model in data.get('IndividualVehicles', []):
        result = {}
        result['vehicle_id'] = model['VehicleId']
        result['Make'] = model['Make']
        result['Model'] = model['Model']
        result['Year_From'] = model['YearFrom']
        result['Year_To'] = model['YearTo']
        result['Series'] = model['Series']
        result_list.append(result)
    return result_list

The data string in which we insert MAKE is the same data that was transferred to the POST request that you saw in Request Payload earlier. I just took the string as it was and replaced the MAKE value as I like.

You may also have noticed these 2 lines

    data = response.json()
    data = json.loads(data['d'])

The requests library has at its disposal a method that allows to convert the body of the received JSON response, as it is very popular when working with the API.
But how can you remember in our case there is another JSON encoded into string inside the answer. That’s why we need to convert it into a more convenient framework using the standard json library.

We could as easily return data as the result of this function data.get('IndividualVehicles', []) But here I want to show you an example of JSON processing in which I store only the data I need.

We have written 2 data processing functions. If you remember then in the 3 functions where we get the battery model we use the GET request and the VehicleId parameter clearly defines for which machine we want to know the battery.

Let’s think about what we will pass to this function? Only VehicleId? But in that case we will get 2 sets of data which we need to merge. Can we do otherwise? We can transfer the entire dictionary, right? Then let’s do exactly that. To test this function, you will need to pass the dictionary with the mandatory vehicle_id key (In case you noticed, I changed this parameter when creating the dictionary in 2 functions to make it look like a clear column name in the future table) as it is necessary to create the request.

def get_vehicle_battery(vehicle):
    elem_id = vehicle['vehicle_id']
    url = f'https://www.centurybatteries.com.au/resources/battery-finder/fitment-options/{elem_id}'
    response = get_req(url)
    page = BeautifulSoup(response.content)
    elems = page.find_all(class_='fitment-options__option')
    for elem in elems:
        name = elem.find(class_='fitment-options__rating__text').text.strip()
        code = elem.find(class_='option__details__row').find(class_='option__details__data').text.strip()
        vehicle[name] = code

You may notice that there is no return in the function. Since we work with dictionaries that are mutable objects, we have all the results of new changes in the list of dictionaries that we already have.

And so in this step we studied the data from each link and created a final function that receives the data of interest to us.

If you examine the data that you get from all the references and try to set them into regular For loops, you will find out the following. At the very beginning we only need to check 1 link to get the list. In step 2, this will be 526 links created from the resulting list. And on the third step, we need to check over 20000 links to get the battery data.

20,000 references… …and if we wrote something bad and in some cases the data won’t work correctly and we’ll have to check it again. I think you understand that this is going to take time, and a lot of time. Which means we have to speed it up. That’s why our next move. We’re going to have to parallelize the data.

Step 3. Parallel processing.

Python has 3 ways to ensure parallel processing of data.

  1. Threads
  2. Processes .
  3. Corutinas are asynchronous.

Since networking is an I/O operation. We can use flows and asynchronous corutines. But since the asynchronous code is very specific and not yet mature enough in the Python environment. I’ll focus on Threads.

And since Python developers love us and want us to be happy (no, they don’t send us beer… unfortunately) they created concurrent.futures . In the data-engineering branch from dataquests. Very good material on the subject, at least that was when I took the course, I can not check what is there now. So I will go straight to the function that we will use.

def multi_threads(data, parse_function=None):
    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
        result_data = []
        counter = 0        
        future_to_item = {executor.submit(parse_function, item): item for item in data[:100]}
        for future in concurrent.futures.as_completed(future_to_item):
            try:
                item_data = future.result()
                if item_data:
                    result_data.extend(item_data)
            except Exception as e:
                print(e, future_to_item[future])
            counter += 1
            if counter % 10 == 0:
                print(counter)
        return result_data

There’s only one final step left.

Step 4. Gather them all.
I’ll just sit here and show you the final code.

import concurrent.futures
import json
import ast

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
           'content-type': 'application/json; charset=UTF-8'}

workers = 5

def get_req(url):
    for _ in range(5):
        try:
            response = requests.get(url, headers=headers, timeout=20)
            if response.status_code == 200:
                return response
        except Exception as e:
            print(e)
            continue
        print('Error status code', response.status_code, response.text)
        
def post_req(url, data):
    for _ in range(5):
        try:
            response = requests.post(url, data=data.encode('utf-8'), headers=headers, timeout=20)
            if response.status_code == 200:
                return response
        except Exception as e:
            print(e)
            continue  
    print('Error status code', response.status_code, response.text)

def get_makes():
    start_url = 'https://www.centurybatteries.com.au/home'
    response = get_req(start_url)
    page = BeautifulSoup(response.content, 'lxml')
    makes = page.find("battery-finder")['v-bind:makes']
    makes = ast.literal_eval(makes)
    return makes

def get_vehicles(make):
    data = '''{"culture":"en-AU","type":"Model","make":"''' + make + '''","model":"","year":"","series":"","cc":"","horsePower":"","vehicleType":"All"}'''
    url = 'https://www.centurybatteries.com.au/CMSWebParts/Digicon/Elements/BatteryFinder/BatteryFinderService.asmx/MakeModelSearch'
    response = post_req(url, data)
    data = response.json()
    data = json.loads(data['d'])
    result_list = []
    for model in data.get('IndividualVehicles', []):
        result = {}
        result['vehicle_id'] = model['VehicleId']
        result['Make'] = model['Make']
        result['Model'] = model['Model']
        result['Year_From'] = model['YearFrom']
        result['Year_To'] = model['YearTo']
        result['Series'] = model['Series']
        result_list.append(result)
    return result_list

def get_vehicle_battery(vehicle):
    elem_id = vehicle['vehicle_id']
    url = f'https://www.centurybatteries.com.au/resources/battery-finder/fitment-options/{elem_id}'
    response = get_req(url)
    page = BeautifulSoup(response.content, 'lxml')
    elems = page.find_all(class_='fitment-options__option')
    for elem in elems:
        name = elem.find(class_='fitment-options__rating__text').text.strip()
        code = elem.find(class_='option__details__row').find(class_='option__details__data').text.strip()
        vehicle[name] = code
        
def multi_threads(data, parse_function=None):
    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
        result_data = []
        counter = 0        
        future_to_item = {executor.submit(parse_function, item): item for item in data[:100]}
        for future in concurrent.futures.as_completed(future_to_item):
            try:
                item_data = future.result()
                if item_data:
                    result_data.extend(item_data)
            except Exception as e:
                print(e, future_to_item[future])
            counter += 1
            if counter % 10 == 0:
                print(counter)
        return result_data
    
def main():
    makes = get_makes()
    vehicles_data = multi_threads(makes, parse_function=get_vehicles)
    multi_threads(vehicles_data, parse_function=get_vehicle_battery)
    pd.DataFrame(vehicles_data).to_csv('result_data.csv')
    
if __name__ == "__main__":
    main()

Summary

As you can see, there’s nothing wrong with working with the network “inside”. Not relying on selenium to do it all for you. Operating on data directly as your browser does often means you win more than lose.
It took us a while to study the site before we started writing anything. But let’s just imagine 20,000 links being processed with Selenium. It will take a long time. You can test it yourself. 1 thread with Selenium and 1 thread with requests. Then you may try to speed them up.
You can also compare the resource consumption in both variants.
And yes. With the requests we just went through all the Makes. With Selenium, we would need to repeat the whole user path creating many combinations of parameters. And this is even more site visits and even more time.

You might have noticed that in some places, I was too detailed. And in some places I gave rather few descriptions. This approach was chosen by me to force you to try it all on your own and understand how the code works. Nothing will tell you more about Python than it does when you write something.

That’s the end of this article. If you tried to do it all over again, studied the answers of different links, and tried to understand why I chose this or that solution. Well… I think you’re amazing… :slight_smile:

12 Likes

This is a great work piece @moriturus7

Great job @moriturus7!

I will try to recreate what you’ve done.

Awesome work @moriturus7

I’ll try to explore this in more detail and will come back to you with my questions. Thanks again for sharing this with community!

Best
K!

I’ve fully completed this article. I hope you find something useful for yourself. And I’m happy to answer any questions you may have.

1 Like

Hey, great post!
I just recommed anyone to take deep look into Scrapy
Depending on the page BeautifoulSoup is very limited and Selenium maybe is a solution but not the best one, take a look into this awesome python library I think you will love it.

Hey @raduspaimoc

Excuse me, but you’re confusing the technology.
BeautifoulSoup is an HTML parser that parses an HTML page and allows you to navigate through it and access different elements. It would be useful to compare BeautifoulSoup with the lxml library.

Scrapy is a framework for scraping. It combines querying, paralleling from an asynchronous tornado library and html analyzer.

Simply put. It is wrong to compare BeautifoulSoup and Scrapy. Since it only performs one task and does it well.

The work is very interesting. It is great for me to learn something new. I will try to follow your process and see how it works. Thanks~

@moriturus7

Great blog. Few questions.

I don’t understand why you put lxml in page = BeautifulSoup(response.content, 'lxml')

def get_makes():
    start_url = 'https://www.centurybatteries.com.au/home'
    response = get_req(start_url)
    page = BeautifulSoup(response.content, 'lxml')
    makes = page.find("battery-finder")['v-bind:makes']
    makes = ast.literal_eval(makes)
    return makes

Also what is the use of ast.literal_eval(makes) ?

Thank you

Hi @LuukvanVliet

By default, BeautifulSoup uses the standard Python html parser - “html.parser”. And he doesn’t always do his job well.

Depending on which library is used for the html parser, it will depend on how fast and correctly the pages will be processed. Sometimes “html. parser” can make four html tables out of one and break the sequence of tags. Combine tags, then you will have to look for this error.

But the good side of BeautifulSoup is that it allows you to use third-party parsers. For example, lxml. In fact, I don’t usually use BeautifulSoup. I have included it in this article because it is available in DataQuests courses and it is easier to understand for most. I usually use the lxml library.
Here you can read about the supported parsers as well as their pros and cons https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

About ast.literal_eval(makes) everything is much simpler. If you fill in this function, you will see that we get makes in the form of '["A", "B"," C"]'. it is obvious that we have a list that we received as a string. To avoid doing something like makes.replace('[',").replace(']',").split ('","') we can use ast.literal_eval() In order to get just a list.