Webscrapping: From a website to csv through jupyter 🚀 and databases. ( personal project )

Buchla_200

Greetings to everyone.

I want to introduce a personal project in which I have been fully involved the last months.

The idea.

Most of us we are ( as a analyst ) looking for a .csv from there we start to work on it.

The question that automatically appears when you feel like playing other topics is:

  • Q: How can I get the proper csv about a topic that i want?

  • my Answer: Do It by Yourself.


Since I joined Dataquest I understood that the best thing you can do is learn as much as you can from the different aspects of a project.

How I do collect the data that I was interested from a second-hand sinthesizer advertising site?

The beginning

The first time everything was a vague idea about what the steps to follow had to be, too many things and no apparent connection, but little by little (days) I was making my way learning through hours and more hours of work and errors and more and errors solved in front of my computer.

Here are some of the technical skills I’ve touch-learned:

  • html
  • Request
  • Anaconda (loading enviroments, saving… )
  • Python
  • Linux (terminal)
  • Jupyter
  • VScodium (Free/Libre Open Source Software Binaries of VS Code)
  • Debugging
  • Data bases Desing (basics)
  • MySQL and MySQLWorkbench ( and problems related with snap on Ubuntu)
  • PostgreSQL
  • SQLAlchemy
  • Git and Github (terminal)

The project can be improved (without a doubt), the idea is to update those details ( description of functions, English translation, and more things) in the future, but now it is the time when you can show it and work on something else.

Hispasonic (from web to csv I )

Apart from the data extraction stage, another thing I felt I needed to know was how to move the contents of a static document such as a csv to a database.

At first I was enough with SQLite, then I began to play with MySQL and finally and thanks to a book that touches on the subject of data analysis through databases I dared with PostgreSQL.

These are the Jupyters that play that part:

Now that the hard thing has passed the experience has been super positive, so I encourage everyone who wants to improve themselves to choose a topic and finish it, I hope mine will encourage you to do so.

  • Finally I would like to thank the Dataquest team for the work they do and the quality of it without them I would not have been able to get here.

A&E is happy coding.

6 Likes

I am short of words. This is a very comprehensive data-gathering project that initially seems like a long read. However, going through your work was a rewarding experience for me. I learned so much. I encourage everyone to go through this project and learn how thorough data gathering can be.

The extra mile of making the dataset available in two database flavours is also exciting. The entire workflow is easy to understand and well explained. Thank you for taking the time to translate some of the Spanish words on the website to English. It made me follow along better.

Projects like this remind me of how incredible data skills are. As long as our curiosity continues, we can get relevant data to answer our questions and make it available for millions of others to explore for answers. This is amazing and worth all the time and effort you put into it! Well done!

5 Likes

@israelogunmola

I am short of words. This is a very comprehensive data-gathering project that initially seems like a long read. However, going through your work was a rewarding experience for me. I learned so much. I encourage everyone to go through this project and learn how thorough data gathering can be.

  • I am very, very grateful to you for taking the time to look at it.

The extra mile of making the dataset available in two database flavours is also exciting.

  • I imagined working in a company and I could not simply leave the data in a csv. I had to create the data bases and understand the process through jupyter and dump the content of the csv in these databases.

The entire workflow is easy to understand and well explained. Thank you for taking the time to translate some of the Spanish words on the website to English. It made me follow along better.

  • At first the project was going to be something personal and that’s why I kept it with the variables in Spanish after the thing was growing and of course… :face_with_hand_over_mouth:

Projects like this remind me of how incredible data skills are. As long as our curiosity continues, we can get relevant data to answer our questions and make it available for millions of others to explore for answers.

  • Yes! it’s a very nice feeling to be able to :eyes: things in data stuff.

This is amazing and worth all the time and effort you put into it!

  • Thanks, it takes time but it’s been worth it.

Well done!

  • Thanks again. :pray:

A&E.

3 Likes

Hey @Edelberth, thanks for sharing your incredible project with the Community! I’m really glad that you have decided to create a database from csv files. After looking through hundreds of projects, it’s the first time I see someone create a database :slight_smile: It’s truly an end-to-end project, well done!

My first suggestion is to create a better GitHub page by including information about the data and a brief description of your approach. You can also put the files into different directories (i.e., .ipynb into notebooks, .py into scripts, etc.).

Some suggestions about Hispasonic (from web to csv):

  • Clearly mention that the name of the website is Hispasonic and provide a link
  • Write a better introduction explaining what you aim to achieve, for example, by highlighting the data you want to scrape
  • It is not necessary to say what different packages are needed for
  • You can write much better docstrings for your functions using, for example, NumPy/pandas style. You can also have a look at my article about docstrings
  • Filter the amount of url repeated. - I think this should go under the number 2
  • You have some typos and punctuation errors. You can use Grammarly to check for them :slight_smile:
  • When you download all the ads, consider truncating the list of paths (after [18])
  • Insert terminal commands as code directly in MarkDown
  • The next step we must implement is all the possible brands of synthesizer manufacturers that we can find in the ads. - do you need to implement the brands?
  • Sometimes, your code style is inconsistent
  • The code cell [26] is pretty cryptic. You should probably write these comments in the previous code cell and explain the algorithm directly with the code on the side
  • You should better explain what’s happening in [27]. For instance, use some docstrings to describe the functions

Next, Hispasonic (from csv to MySQL II ):

  • Capitalize all titles
  • It is not clear to me whether you created a database previously without using Python and then just accessed it with Python? Could you clarify this? I’m referring to the cell [10]
  • What are you doing in the last code cell?

Finally, I have no comments on Hispasonic (from csv to PostgreSQL II ) except for better code styling in regards to the terminal code.

When are you planning on using SQL queries to gain insights from the data?

Congratulations again on finishing the project! Happy coding :grinning:

1 Like

Hey! @artur.sannikov96

Glad to see you, your comments are always welcome.

Hey @Edelberth, thanks for sharing your incredible project with the Community! I’m really glad that you have decided to create a database from csv files. After looking through hundreds of projects, it’s the first time I see someone create a database :slight_smile: It’s truly an end-to-end project, well done!

Thank you very very much, I am very happy that you liked it, for me it has been very important to have come here. On the one hand because I have known this page for a long time and also people and on the other because the things I have been learning in Dataquest I have tried to capture it here.

In relation to databases there is one thing that is always repeated in job ads and that is SQL. So I installed locally to understand them a little better and see if it was possible to access the databases through jupyter.

  • My first suggestion is to create a better GitHub page by including information about the data and a brief description of your approach. You can also put the files into different directories (i.e., .ipynb into notebooks , .py into scripts , etc.).

  • Without a doubt. with github I have been learning as I was doing this project. However, I do not know how to order what I have without losing the days of upload. if you have any idea I’m all ears.

  • Clearly mention that the name of the website is Hispasonic and provide a link

Yes

  • Write a better introduction explaining what you aim to achieve, for example, by highlighting the data you want to scrape

    True, I had so internalized what I wanted to do that I didn’t even realize I hadn’t said it.

  • In this first part of the project focuses on obtaining relevant ad information, the category I have focused on has been the one that refers to electronic musical instruments.
  • It is not necessary to say what different packages are needed for
    I did it for me so I wouldn’t forget

  • You can write much better docstrings for your functions using, for example, NumPy/pandas style. You can also have a look at my article about docstrings
    I take note, thanks.

  • Filter the amount of url repeated. - I think this should go under the number 2
    True, that like many other things that you have surely seen are details that after days you forget that you have them in front of your eyes. pointed.

  • You have some typos and punctuation errors. You can use Grammarly to check for them :slight_smile:
    :speak_no_evil:

  • When you download all the ads, consider truncating the list of paths (after [18])
    As I said before this was one of those things that remains to be outlined.

  • Insert terminal commands as code directly in MarkDown
    I did it that way because I didn’t know how to highlight what I saw in the terminal, to be honest, I didn’t even think about it

  • The next step we must implement is all the possible brands of synthesizer manufacturers that we can find in the ads. - do you need to implement the brands?

    Of course, if not the is impossible to recognize between the brand of the synthesizer and any comment or description of the advertisement. (I don’t know if I got the question right.)

  • Sometimes, your code style is inconsistent
    Yes, it is. I’m workink on it.

  • The code cell [26] is pretty cryptic. You should probably write these comments in the previous code cell and explain the algorithm directly with the code on the side

    What I was trying to explain are the steps of the previous cell. In fact that is the way I explained at the time how it worked, since sometimes one forgets.

  • You should better explain what’s happening in [27]. For instance, use some docstrings to describe the functions

    This cell what I explain or at least is what I thought I explained (now I see how involved I was in my world) is how I built the dictionary from the list of names with one mark, two marks and two repeated marks.

I take your opinion very much into account, now I want to rest a little from this because it has become “The monotopic”.

I thank you as always for the time and effort, if I had done half of the things you suggest to me surely your code-trip would have been more pleasant.

I repeat again, thank you for your observations, they serve me for now and in the future. Now I want to go through DQ and see if with what I have learned I can help as you have done with me.

Thanks a lot.

A&E.

2 Likes

Are you using Git on your local machine? You should be able to move files with the git mv command. It also has the rm command if you know what I mean.

No, I just think your English is not correct, I could not understand what you meant :frowning:

You are welcome!

2 Likes

Hello @artur.sannikov96

Github

What has happened was, In the beginning I wanted to organize the projects by directories and as it was uploading in a manual way it seemed that there were no problems, but when I have learned to use it through the terminal I have realized the problem.

Now or I delete what I have done and therefore also when I uploaded it, or I create a new repo this time better organized. Or I create a count on gitlab and then import all content as import to github… I’m still thinking about because is time and there are a lot of thinks to fix. At least this is what I feel.

No, I just think your English is not correct, I could not understand what you meant.

Yes. It’s not good, let’s see if I can improve it, or at least avoid trying to explain myself at the same level as my mother tongue and make things easier for me.

A pleasure to communicate with you again.

A&E.

2 Likes