BLACK FRIDAY EXTRA SAVINGS EVENT - EXTENDED
START FREE

Web scraping Wikipedia with BeautifulSoup

TL;DR: I wrote a function to scrape Wikipedia for movie budgets, gimme feedback:

While doing my Star Wars project, I’ve done a basic scrape of movie data using Wikipedia library. That gave me an idea how to expand on the fandango project - scrape movie budget data and analyse rating shift based on the budget… and other factors I can scrape from the page

But I need to scrape the budget data first, Wikipedia library can’t deliver that, I imagine that Wikipedia API would be a good way to do it (haven’t touched that topic yet it’s on the list of things to do)

but I’ve wanted to try scraping the data from a webpage (because I’ve never done it) - here’s how I’ve done it, I’m curious about the feedback:

sample of my df (I’ve made the titles a bit more url friendly):

FILM title_urled
Avengers: Age of Ultron (2015) avengers_age_of_ultron_2015
Cinderella (2015) cinderella_2015
Ant-Man (2015) ant_man_2015

function:

def scrape_money(df):
    # 1. Input:
    search_query = df['title_urled']
    # 2. Put the title into wikipedia search and extract the link to the first result:(it's not the first link!!!)
    url = "https://en.wikipedia.org/w/index.php?search="+search_query+"&title=Special:Search&profile=advanced&fulltext=1&ns0=1"
    html = urlopen(url)
    soup = BeautifulSoup(html, 'lxml')
    web_links = soup.find_all("a")
    # this is the last minute hack, in case there are no search results!!:
    if len(web_links)>43:
    # 3. the first result of our seach query is actually the eleventh link on the results page:    
        movie_path = web_links[10].get("href")   
        # 4. now lets scrape all of the infobox-labels into a list a check how long is that list:
        response = requests.get("https://en.wikipedia.org"+movie_path)
        content = response.content
        parser = BeautifulSoup(content, 'html.parser')
        par_len = len(parser.find_all("th", class_="infobox-label"))
        # 5. Loop trough infobox-labels list and find the position of 'Budget':
        for num in range(8,par_len):        
            tag_name = parser.find_all("th", class_="infobox-label")[num]
            if tag_name.text == 'Budget':  
                tag_numbers = parser.find_all("td", class_="infobox-data")[num]
                return tag_numbers.text
        # 6. If we can't find the budget:
        else:
            return None

The whole step by step instructions are on my jupyter notebook:
scrape_wiki.ipynb (1.7 MB)

notebook on Github

Click here to view the jupyter notebook file in a new tab

1 Like

Looks great. I’m new to Web scraping and so far it has been super helpful. Awesome job.

1 Like

Doesn’t really work to me.
Some of the href movie_path = web_links[10].get("href") are number 10 on the list some other are number 11 or 12

it works on most of them, but if you visit the Github, there’s an updated second version that addresses that issue

1 Like