Web scraping - data into web app

Has anyone ever used beautiful soup and used it to scrape data into some sort of app that they built with flask or django? The reason I ask, is because people keep telling me to learn R.

Hi @hunter.kiely

Welcome to our Dataquest Community.

Yes, I used beautiful soap to scrape data from the website but not integrated it with any website yet using flask or Django.

I scrape Movie data from the IMDB website and Mobile data from Flipkart. You can see my movie data notebook here.

Mobile Data notebook here:-
Flipkart_Mobile_Data.ipynb (14.1 KB)

Click here to view the jupyter notebook file in a new tab

What mobile data you are importing from flipkart here?

You can get the url in requests.get(‘url’) line.

click here.

Mobile data means price , Total Reviews, Total Ratings and Stars of the particular mobile.

What exactly is coming off the flipkart website?
I’m possibly getting status code 200.

The next block of code, the output is bs4.BeautifulSoup. I need to better digest this; I am confusing where you are getting each different group of data. I don’t see the movie data now.

Flipkart_Mobile_Data.ipynb (19.8 KB)

I updated my notebook, go through this one. Download it run all cells locally on your computer. Then you see one file mobile_data.csv is created which contains the scraped data.

Like this
mobile_data.csv (2.2 KB)

Let me know if you need more help.

Click here to view the jupyter notebook file in a new tab

1 Like

That is my fault, I should have known that was an HTTP status code.

To answer your question, it would help to better understand what you’re doing and why? What is the app? What is the data? What are you wanting to the app to do?

Another thing that could help is to understand what R package people are directing you to, because that could help shed light on whether there is a good Python equivalent or whether learning R might indeed be the best option.

I’m pretty sure shiny is the standard R package most people learn. I’m just talking about beautiful soup for scraping general data off of websites when there is no standard source of information. I don’t have an example off the top of my head currently.

A better term would be an API. Maybe, I want info on the price of flea and tick medications from several different retailers.

Even if you used Shiny to create your web app, you could still scrape the data using beautifulsoup (bs) if that was your preference. If the question is “can I use bs to scrape data that I then use to build something else” the answer is absolutely. If you’re working with an API rather than web scraping, you probably want the requests library rather than bs (although you use requests with bs to make the initial site call anyway).

My biggest question is what you want to do with that data — how do you want to display it?

My understanding is that Shiny is good for building interactive apps that present data in different ways. A python equivalent (again, I don’t have direct experience here but from my understanding) might be Dash by plotly.

If you’re looking to build a more fully-fledged web app, then you’re going to need some web development skills and something like django or flask is more appropriate.

The real question is, how do you keep updating live information to the back end of a web application with beautiful soup? Lets say, I want to compare the price of flea and tick creams across a range of websites along with their half life. I’m thinking of something like camel camel camel, for dog and cat products that updates automatically.

Are you saying this can just be done with the snap of a finger using shiny… I guess you just have to use R studio. There are many more steps with python. The issue is, how does that integrate into a larger website using HTML5?

OK, so that’s a slightly different application to what I was thinking of. What you will need to do is set up some recurring task that scrapes those prices and stores them in a database, and then have the ‘app’ show the data from the database.

What the ‘app’ is could be Shiny, or it could be Dash, or it could be something custom built. My practical knowledge of dash and shiny is limited, but I believe both of them could surface data retrieved from a database.

I don’t believe that Shiny will automate the scraping and storing of that data, it’s mostly just used to retrive and display the data, so there’s still that work to be done, whether you do it in R or in Python.

How can you make this reoccur? Like a loop?
What is the best database to use in this instance?
Do you all have something showing importing from mysql to jupyter notebook?

For the database, it really depends what you want to do. If things are simple, sqlite will suffice, but if you’re looking to be more robust then maybe postgres.

For repeating tasks, google is your friend. I googled around and looked at a bunch of stack overflow threads, but this seems like one of the better answers: https://stackoverflow.com/a/46738061/4691920

Use my example, would it be beneficial for postgres?

It’s almost always best to start simple and expand out from that. With that in mind, this is what I would do if I was approaching this (I’d favor the Python approach because I’m much more familiar with it so I figure even if it’s 30% more complex I’m in front):

  • Write script that scrapes the data once.
  • Expand the script so that each time it grabs the data, it writes it to a sqlite database
  • Put the functionality so far into a loop and acquire some test data into your database
  • Using that test data, build a simple app using dash that will display that data
  • Add functionality into the app that grabs the data periodically (using something like the link I posted above) and writes it to the database.
  • Iterate on the app

This would probably take me quite some time, but it lets me work iteratively towards my goal with little milestones along the way that would help me feel like I’m making progress.

To answer your question directly, I’ve chosen to use SQLite because in my imagination this thing is not going to get a ton of traffic for a while, and I can choose to upgrade the database later if and when I need it.