I hope this is not a problem, but I scraped the Community

Hello everyone!

[EDIT]
(The code is now available in this GitHub repository.)

So, I’ve been active in the Dataquest community for about a month now. I started to visit the community more often after I was accepted for the Covid-19 Financial Aid Scholarship. Being accepted for the program made me feel really grateful for being helped during these tough times everybody is going through and so I felt like I should put on more efforts to help others. That is why I started visiting the community to see if I could help someone. I thought that if I could help other students with their questions, I would be helping not only the student whose question I’d answer but the platform that helped me in the first place.

I’m not a python expert or an experienced data scientist or anything like that. I’m just a Dataquest student like most of you and so I was not sure whether I was capable of answering people’s questions. But as it turns out, I was.

As I kept answering questions I notice that I was also helping myself by doing it, since I had to revisit something I had already studied or even learn something new to answer a question. After that ,I caught myself typing community.dataquest.io in my browser several times a day and having fun doing it. So I thought of way to optimize my time and the help I was providing: I wrote a web scraper to notify me via email every time a new question was posted in the community.

The pros of this are:

• I practiced scraping;
• I do not need to check the website manually all the time anymore;
• Students can have their question answered (if I’m capable of answering it, of course) faster. And I
know that when you are stuck it’s easy to get demotivated, especially if you rely on the community
to have some answers and need to wait hours or days to maybe receive an answer that allows
you to move on.

And the cons are… well, I don’t see any.

After I got my scraper working, I felt that it would be good to share this idea and the code with everyone and maybe help people in helping others, so here I am now. Enough talking then, let’s code!

First, the code is made to use google Chrome to scrap and Gmail to send the emails. You can, of course, use others, but that’s on you.

We’ll start by importing the libraries we’ll use. You’re probably already familiar with pandas and the sleep function from time. Other than those, we’ll use smtplib to send the emails and selenium, which is an extreme powerful tool, to scrap the website. If you’re into web scraping, selenium is a must.
Also, you need to download the Chrome webdriver (if you’re using Chrome) and place it in the same directory as your script.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
import smtplib
from time import sleep

Now, we’ll write the send_email function to send the emails. This function is pretty straight forward even if you never worked with smtplib.
We’ll use the try and except clauses just so the script does not raise an error if it fails to connect with the Gmail server.

def send_email(subject, msg):
    try:
        mail_from = 'your email as string'
        password = 'Your password as string'
        mail_to = 'The email address that will receive the email as string'
        server = smtplib.SMTP('smtp.gmail.com:587')
        server.ehlo()
        server.starttls()
        server.login(mail_from, password)
        message = f'Subject: {subject}\n\n{msg}'
        server.sendmail(mail_from, mail_to, message)
        server.quit()
        print('Email successfully sent!')
    except:
        print('Failed to send email.')

Now, the scraping. We’ll use an infinity loop to keep the code running all the time and at the end of the code we’ll use sleep to set the time we want the scraper to wait between each time it checks for new posts. From now on, everything is inside the while.
So, first we set up selenium and instantiate the driver object. Then we tell driver to get the website. If you set option.headless to False you can actually see your browser opening and going to the website to scrap the data, which is really fun.

while True:
    my_url = 'https://community.dataquest.io/c/qa/44'
    option = Options()
    option.headless = True
    driver = webdriver.Chrome(options=option)
    driver.get(my_url)

Now that we are in the Community, this is the part of the website we are interested in. If you know HTML you know that this is a table. If you don’t, you do not need to worry about it:

We’ll use pd.read_html to reade the driver’s page source. This returns a list with all the tables in the website as dataframes:

    tables = pd.read_html(driver.page_source)

As the page we’re scraping only has one table (the one we want), we’ll assign the first (and only) element of the list to our table variable:

    table = tables[0]

This is the table:

    Topic                                              ...  Activity
0   About the Q&A category  Q&A  Post technical qu...  ...        6d
1   Star Wars Survey Project - Converting Yes/No r...  ...        1m
2   Distance above and below the mean of a distrib...  ...       21m
3   Help : Reading from dictionary  Non-DQ Courses...  ...        1h
4   Categorical and Quantitative data  Non-DQ Courses  ...        2h
5   Problem to finish the exercise and mission 392...  ...        2h
6   Cannot connect with DB from Jupyter Notebook  ...  ...        5h
7   Intro courses not working (possible bug)  DQ C...  ...        7h
8   How does one solve “invalid literal for int() ...  ...        8h
9   Problem with saving takeaways  DQ Courses  dat...  ...        8h
10     Multiprocessing python  Non-DQ Courses  python  ...        9h
11  Guided Project: Analyzing _CIA _Factbook _Data...  ...        9h
12  Regex pattern to extract C excluding patterns ...  ...       10h
13  Series.fillna() does not have any purpose  DQ ...  ...       10h
14  Deriving the Derivative of the “Sum of Mean Sq...  ...       11h
15                  Working of the RE  Q&A  369-9 369  ...       11h
16         Question about DataFrame.style  DQ Courses  ...       12h
17  Exploring Ebay Cars Sales Data  DQ Courses  29...  ...       16h
18  Duplicated column name  DQ Courses  dataquest-...  ...       17h
19  2nd Question in Guided Project: Answering Busi...  ...       18h
20  Creating Box Plot with custom function  DQ Cou...  ...       18h
21  Unbale to understand the content for visualizi...  ...       20h
22  Groupby giving back inf values  DQ Courses  py...  ...       21h
23  Error in conditional statement code, most prob...  ...       22h
24  Unsupported operand type(s) for +=: ‘dict’ and...  ...       23h
25  For the “Employee Exit Surveys” Guided Project...  ...        1d
26  Removing Constants from an Equations Derivativ...  ...        1d
27  Predicting Bike Rentals - DT/RF not better tha...  ...        1d
28  Clean And Analyze Employee Exit Surveys part 7...  ...        1d
29  Np table in jobs.db or chinook.db on R studio ...  ...        1d``

Although you can’t see it here, it has all the columns you can see in the image from before. Also, notice that the first topic seems to be a pinned one that we are not interested in as we are looking for new topics. We’ll then use slicing to select the first ten topics after the pinned one and the Topics, Replies, and Activity columns only:

    table = table.iloc[1:10, [0, 2, 5]]

This is what we have now:

   Topic                                                Replies Activity
1  Star Wars Survey Project - Converting Yes/No r...        0       1m
2  Distance above and below the mean of a distrib...        3      21m
3  Help : Reading from dictionary  Non-DQ Courses...       10       1h
4  Categorical and Quantitative data  Non-DQ Courses        4       2h
5  Problem to finish the exercise and mission 392...        2       2h
6  Cannot connect with DB from Jupyter Notebook  ...        8       5h
7  Intro courses not working (possible bug)  DQ C...        1       7h
8  How does one solve “invalid literal for int() ...        1       8h
9  Problem with saving takeaways  DQ Courses  dat...        3       8h

Now, we need to split the activity column in two columns, the first containing only the number and the other containing the letter that represents the time unit (hours or minutes)

 table['time'] = table['Activity'].str[-1]
 table['Activity'] = table['Activity'].str[:-1].astype(int)

And now we have this:

    Topic  Replies                                      Replies Activity time
1  Star Wars Survey Project - Converting Yes/No r...        0         1    m
2  Distance above and below the mean of a distrib...        3        21    m
3  Help : Reading from dictionary  Non-DQ Courses...       10         1    h
4  Categorical and Quantitative data  Non-DQ Courses        4         2    h
5  Problem to finish the exercise and mission 392...        2         2    h
6  Cannot connect with DB from Jupyter Notebook  ...        8         5    h
7  Intro courses not working (possible bug)  DQ C...        1         7    h
8  How does one solve “invalid literal for int() ...        1         8    h
9  Problem with saving takeaways  DQ Courses  dat...        3         8    h

I defined a new topic as a topic created no longer than 10 minutes before the scraper runs and with no replies. So, we’ll create the new_topics dataframe selecting only the rows that fulfill these requirements. Then we’ll use the shape method to assign the number of new topics to the variable num_new:

    new_topics = table[(table['time'] == 'm') & (table['Activity'] <= 10) & (table['Replies'] == 0)]
    new_topics = new_topics.reset_index(drop=True)
    num_new = new_topics.shape[0]

And then we have:

    Topic                                               Replies  Activity time
0  Star Wars Survey Project - Converting Yes/No r...        0         1    m

The work is basically done. We’ll use an if statement to check if the number of new posts is greater than zero and, if so, we’ll set up the subject, the message and call the send_email function. If it is not greater than zero, it will just print 'No new topics found.'.
The subject contains the number of new posts and the message’s body contains the new_topics dataframe so we can see the title of the new topics. The message also contains the url, so we can just click and go the Community right the way

    if num_new > 0:
        subject = f'{num_new} new topics!'
        msg = f'New topics: \n {new_topics}\n\n {my_url}'
        send_email(subject, msg)
    else:
        print('No new topics found.')

After that, we’ll use sleep to make our code wait before it goes check the website again. I set it to ten minutes, which I think is a fair amount of time for this.

    sleep(600)

Finally, for the email to be sent from your Gmail account you must allow less secure apps in your account. I am not providing a link for this, just google less secure apps google and you’ll see how to do it.

And that’s it.

I hope you enjoyed this and that it can be useful somehow. My intention to post was just to try to get more people to help others or maybe make someone interested in web scraping (which is lots of fun) and, of course, share what I’ve done. I also hope that Dataquest does not mind that I’m doing it. :sweat_smile:

Feel free to use this code for whatever you want, it’s absolutely free.

Cheers!

38 Likes

Hey @otavios.s

This really sounds interesting. I really liked the way you explained the procedure end to end. May be i’ll give it a shot on something else and explore web scrapping stuff :slight_smile:

Thanks for sharing it .

Best
K!

1 Like

Well, this is awesome. Great work!

1 Like

This was such an exciting read @otavios.s! :heart_eyes:

I would strongly encourage you to take this one step further and publish this story in some Data Science publication!

5 Likes

Thank you everybody for the support!

Hey @nityesh, I read this post right before creating this topic. Do you think my post is worth a publication?

I’ll give it a shot!

2 Likes

Superb initiative! Ill try it out :slight_smile:

1 Like

Yes, I do, @otavios.s :slight_smile:

2 Likes

Hey everyboy and speccially @nityesh:

My story was just published in Towards Data Science!

Thank you all for the feedback and encouragement!

And also this:

Our curators just read your story, How web scraping helped me going from learning to teaching , that you submitted for review. Based on its quality, they selected it to be recommended to readers interested in Machine Learning and Data Science across our homepage, app, topic pages, and emails.

10 Likes

YAAY!! Congratulations @otavios.s. This is amazing!!! :heart_eyes:

2 Likes

Hey @otavios.s

Congratulations! This is both inspiring and aspiring! :+1:

2 Likes

Hi @otavios.s.
I’m also new to the community and also have applied for the dataquest scholarship.I would like to ask you a few questions:
1)Dataquest has said that it would let people know about the results, on June 4th. So, did you recieve the scholarship earlier?
2)Were you given premium access which they said that they would offer to the people who finish courses initially and remain active in the community?If yes, how to get it after getting the basic scholarship?
3)Do you think Dataquest falls short of other sites as Datacamp which has articles, exercises and videos , whereas Dataquest lacks video lectures?

Thank you!!

1 Like

Hello @jhnafrin06.

I applied for it in april. It was a different selection than the one happening now. My scholarship started on april 30th.

No. The premium access will be granted after the first three months of scholarship for the ones they think that did a good job and earn it.

I never took a datacamp course. What I can say about this:

  • I like dataquest’s text lessons. I think it is as good as (or even better) than general video lessons.
  • Dataquest is much more hands-on than datacamp and I don’t need to take a datacamp course to know that. I quick online search will show you that datacamp exercises consist basically on filling gaps in the code instead of writing the full code as we do here. Actually, that’s why I discarded datacamp when I was deciding which course to take.
5 Likes

Great work!! @otavios.s
Thanks for sharing.

1 Like

Hey guys, in case anyone is interested, I did it again:

5 Likes

Great job @otavios.s!

I’m definitely interested in web scraping and will give it a go.

Reading your headline on Medium, have you thought about changing the “going” to “go” so it reads better?
“How web scraping helped me go from learning to teaching”.

2 Likes

Thank you @Udoka!

Well, I had never thought about that. English is not my native language, so sometimes I make some mistakes. I’ll take a look.

1 Like

Well, you seem to have a good grasp of English already, so I can imagine that you’ll only get better.

1 Like

Nice done @otavios.s!
You’ve made yourself an optimized peer in the community :smiley:

I’ve once implemented a web scrapper with selenium right after the web scrap course and I had a lot of fun writing it. This is totally required for dynamic websites. It is the best one can do to reinforce the fresh knowledge.

If you see some interest from peers doing the same, maybe it is a good idea to create a Git repository and open source your project.

Congrats in your automation project!

1 Like

Thank you @fedepereira!

And here’s the repository you suggested:

Great idea!

2 Likes