How I learnt Data Science in 175 days as a complete beginner

Happy New Year everyone! As the title suggests, this is an analysis project on my DataQuest journey. I’m really excited to have finished this project just in time before the new year came! What’s a better way to send off 2020 than a thorough look-back at my focus of the year?

It has been a year of grief, but in the DataQuest community, I see people from all over the world trying their best to learn and make progress every day. So my incentives to do this project are not only revisiting my journey but more encouraging beginners of this journey by giving them a peek into the road ahead. But please keep in mind, that the time and effort to complete this path is highly relevant to personal situations. I will explain mine later in this article.

This project is also inspired by the people in this community, especially @otavios.s’s amazing project I hope this is not a problem, but I scraped the Community. I was introduced to Selenium and ChromeDriver thanks to his project. Yes, I also scraped the DQ website to get the full Data Scientist curriculum and hope it’s okay…

Before I go into the details of this project, I want to first share my findings.

The questions that get answered in this project:

  1. How many days did it take for me to finish this path? (timespan, including intervals I didn’t spend on studying)
  • 175 days. From June 19th, 2020 to December 11th, 2020.

  1. What’s my best learning steak and average learning streak?
  • My best learning streak was 20 days, and 6.6875 days on average. From my personal experience, it’s important to get into the groove and keep going. I a week-long break in October and it took another week to get back to the same learning efficiency as before.

  1. How much time was spent in total?
  • Total hours spent in finishing the path was 306.4 hours. This means if I studied 24/7, the path could be finished in roughly 13 days. Instead, it took me 175 days. I’m sure the robots are laughing at us humans.

  1. How many hours did I spend on average in weeks I studied?
  • Assuming I studied 5 days out of a week on average, in the 24 weeks I did study, I would have studied for 120 days. This means I spent 3 hours a day studying on Data Quest on average. That sounds about right, but note that it’s a rough estimation. Plus I did spend quite some time in the community and reading up excurriculum materials, those are not counted in this project.

  1. What’s the average time spent to finish a mission?
  • 111.43 minutes, in other words close to 2 hours. It looks like it takes a dauntingly long time to finish a mission. But this also includes time spent on guided projects, which are most definitely more time consuming than just learning missions. It’s not uncommon to spend days on a guided project. I wish I had more granular data on time spent on each mission so I can see the average time spent on projects and non-project missions, but I don’t know if that data even exists.

  1. What are the speed bumps in the curriculum?
  • Steps 2, 4, 5, 6 took more weeks than others to finish. Among them, Step 2 and 6 have the most number of missions, Step 2 also have the most number of guided projects. That makes Step 4 and 5 the most time-consuming steps of all. Between the two, Step 4 is more time consuming than Step 5. Which reflects my memory pretty well. In Step 4, the time-consuming part was SQL, and in step 5, it was the courses on probability.

Now, a little context about my personal learning situations:

  • I started the Data Scientist path in Python on June 19th, 2020, and finished it on December 11th, 2020. Although I didn’t spend a lot of time in the last two weeks, it’s mostly spent on finishing two last guided projects(counts as 2 missions) and extracurricular projects. That’s probably why I didn’t get any learning progress emails after the last of November.
  • I used to be a digital marketing account manager and had close to none coding experiences. I learned Python fundamentals from a data science course on Udemy for a couple of weeks right before I decided to switch to DataQuest.
  • I finished Andrew Ng’s Machine Learning course on Coursera a few weeks before starting the path. I learned basic Octave during that course.
  • I’m currently unemployed so I have a lot of spare time for learning.

A closer look at the project

A) Data collection (email parsing & web scraping)

The data I used in this project are collected from two sources:

  1. The progress data in this project comes from the weekly accomplishment email I get from DataQuest on Mondays if I made enough progress the previous week. It consists of:
    • date: Receiving date of the email. Always a Monday.
    • missions_completed: Number of missions completed.
    • missions_increase_pct: Percentage increase/decrease compared to last week on the number of missions completed.
    • minutes_spent: Minutes spent on learning.
    • minutes_increase_pct: Percentage increase/decrease compared to last week on the minutes spent.
    • learning_streak(days): Number of consecutive days spent on learning.
    • best_streak: Best learning streak.

To get the weekly emails, I first created a tag in my Gmail to group the emails I want and then went to Google Takeout to download them. You can choose the file format in the process, what I had downloaded was a .mbox file. Python has a library for parsing this type of file called mailbox. You will find the code used in this project in the GitHub link at the end of the post.

A screenshot of the weekly accomplishment email

  1. The curriculum data in this project comes from the DataQuest dashboard for the Data Scientist path. It consists of 8 Steps, 32 courses, and 165 missions including 22 guided projects in hierarchical order.
    As mentioned at the beginning of the post, I used Selenium and ChromeDriver for the first time. The dashboard page where the curriculum information resides contains a grid of steps and collapsible lists of courses and missions, there was auto-login and a lot of clicking involved. I will probably write another article on scraping this page later.

B) Data Imputation

The weekly email dataset in this project is very small, with only 16 rows containing data from 16 weeks. But my learning span was in fact 26 weeks. There were weeks where I didn’t study at all, but still, for such a small dataset, I can’t really afford to lose 10 weeks of data.

Luckily, on the profile page, DataQuest provides the learning curve throughout a path. So I came up with an imputation strategy: fill in the blanks where possible, plot the existing data then compare with the DataQuest generated learning curve, and integrate with my personal experience(e.g.pictures and memories of taking vacations & slacking :slight_smile: ) to impute the missing number missions completed data. Then impute minutes spent based on average minutes spent on a mission. It’s more detailed in the project.

While I think the imputation was pretty successful (in serving the needs in this project), I wish we could have more data on our learning journey from DataQuest.

C) Visualizations in this project:

I used Plotly to plot all the visualizations in this project. I’m pretty happy with the Hours Spent vs Missions Completed plot below. It helped me make quite a few interesting observations and answered the curriculum related questions at the beginning of this post. Again, you can read the details in the GitHub link at the end of the post.

To share the plots in posts like this one, I also tried out Chart Studio. The plots below are from the chart studio cloud and embedded using chart studio generated html.

  • My learning curve
dq_learning_curve
  • Hours spent weekly and the corresponding number of missions completed and the steps they belong to
dq_hour_mission_line
  • Number of missions and guided projects in each learning Step
dq_mission_num_scatter
  • Full curriculum table of the Data Scientist in Python path on DataQuest
curriculum_table

Apart from answering all the questions at the beginning of this project. I also want to add, to the beginners of this course: what I’ve done in this project is more data collecting, data cleaning, and imputation, which you will learn in the first 4 Steps. That means you will be equipped to do all of this halfway through the data scientist path!

P.s. if anyone has more questions regarding this project or the DQ data scientist path, feel free to ask me in the comment or reach me at veratsien@gmail.com. I will try my best to provide an answer. :relaxed:

Click here to view the full project.

16 Likes

Hi,
Great!
May I ask you, how much time does it take you to do this personal project?

2 Likes

Hi @sergibtrader,

I’d say about a week’s worth of working time.

I think what took this long was mostly the process of designing this project.

  • I didn’t work on it every day and didn’t start off including the web scraping part. After I was done with the email parsing, it just didn’t feel like much data to work with. That’s when I decided to scrap the curriculum.
  • It took some time to come up with a data imputation strategy that I was happy with.
  • Also, it was the first time I worked with the email parsing and scraping with Selenium.
2 Likes