Web scraping pipeline, would like your feedback

Hi all,

I have been learning Python for about 2 months on dataquest and today I created my first web scraping script. The script scrapes mortage interest percentages of a Dutch bank. To me it felt like such an achievement that I managed to scrape the data, transform it and write it to a csv file. Also I scheduled the script to it automatically runs every day.

Could you provide me some feedback on the way my code is structured? Would you use the same structure of do things differently?

from bs4 import BeautifulSoup
import requests
import pandas as pd 
import datetime

response = requests.get('https://www.rabobank.nl/particulieren/hypotheek/hypotheekrente/rente-annuiteitenhypotheek-en-lineaire-hypotheek/')

soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find("div", attrs={"class": "tpa-table__scrollable-area"})
rows = table.findAll('tr')
data = [[td.findChildren(text=True) for td in tr.findAll("td")] for tr in rows]
data = [[u"".join(d).strip() for d in l] for l in data]
df = pd.DataFrame(data)
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
df.rename(columns={"":"Jaren"}, inplace=True)
df['Jaren'] = df['Jaren'].apply(lambda x: x.replace('jaar',''))
df = pd.melt(df, id_vars=['Jaren'], var_name = "Voorwaarde", value_name = "Rente %")
df["Rente %"] = df['Rente %'].apply(lambda x: float(x.strip().replace("%","").replace(",","."))/100)
df["Timestamp"] = datetime.datetime.now().strftime("%d-%m-%Y %H:%M:%S")
df.to_csv(r'C:\Users\Desktop\rabo_rente.csv', mode='a', header=False, index=False)
3 Likes

Could someone provide me some feedback on the structure?
I have heard it is better practise to start defining functions and create a main function to execute the whole script? Any tips?

Hi. I can give you some recommendations.

  1. Don’t use absolute paths. At some point, you may want to change the folder position, or move the script - use the os.path library to specify directories. In addition, this solution will be independent for different axes.
  2. Use timeout and headers in requests. If there are problems with the connection, your request may simply hang for an unlimited time, which will cause the script to hang.
  3. and Yes, divide the code into functions in a good way. The function is responsible for requests. This function is responsible for the html code. Function that generates the final results file and saves it. The main function that is called when the file is started and that calls the others. This will allow you to create a fairly basic structure for web scraping that you can reuse in the future. It also makes it easier to test individual functions if necessary.

Hi @moriturus7,

Thanks for your recommendations.

  1. I am not really sure how to interpret this, but how would this change the code? Can you provide an example based on an example?
df.to_csv(r'C:\Users\Desktop\rabo_rente.csv', mode='a', header=False, index=False)
  1. Thanks for this one. Could you give an example how this would look?

  2. I will change my code accordingly and get back to you so you can review it once again :slight_smile:

  1. Example result in data folder with script:
import os

cd = os.path.dirname(os.path.abspath(__file__))
result_path = os.path.join(cd, 'data', 'result', 'rabo_rente.csv')
if not os.path.exists(result_path):
  os.makedirs(result_path)
df.to_csv(result_path, mode='a', header=False, index=False)
  1. Example:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
url = 'https://www.rabobank.nl/particulieren/hypotheek/hypotheekrente/rente-annuiteitenhypotheek-en-lineaire-hypotheek/'
response = requests.get(url, timeout=30, headers=headers)

Hi @moriturus7,

Within short notice I will start building a data warehouse as a sample project.
Few questions on this if I may ask:

What project structure would you advise? Would you advise 1 GitHub repository that includes all the scripts?
How would the directories look? Would you keep all functions in a different script from the main function?

Hi.

For each individual project, create a separate repository. If at some point you develop a universal flexible class of functions for yourself. You can put it in a separate repository that will be installed as a library

At the moment, you actually have 1 small script, you do not need to divide it into several files, this will only make it harder to read

I would recommend the following structure

readme.md
requirements.txt
rabo_rente
–rabo_rente.py

Splitting the script into several files is necessary if your structure involves solving several independent tasks. Either your final code is very large, and you want to use files to separate functions that are responsible for similar tasks. Or you use auxiliary scripts that may change as the project progresses.