Is it possible to become a Kaggle competitor in 30 days without any prior programming or machine learning experience? Apparently, yes! That’s what Kaggle offered with the 30 Days of ML challenge in August 2021. Although the challenge has already ended, you can still start a challenge by participating in a different competition. In this article, I share my experiences with the challenge and give you some tips on how to create your own 30 Days of ML challenge!
30 days of ML was advertised as a beginner-friendly challenge, suitable for people without any prior programming or data science knowledge and only requiring about an hour per day of time investment.
You can check out the full description here.
Before getting into the details of how I tackled the challenge, let me talk a bit about my background.
In short: I had almost no programming and data science experience.
I got interested in data science about two years ago, but I felt very intimidated and uncertain about it (Can I ever go through the long list of prerequisites? Am I suitable for this field? After investing a lot of time and energy, will I ever be hired? Or will all my effort be in vain? Will I even like it?)
I had a few attempts to learn Python, but I found even simple concepts daunting (e.g. functions) and gave up blaming not having much time, the coronavirus, etc. But let’s admit what was truly holding me back are my doubts, fearing the unknown, and my laziness to get out of my comfort zone and learn something that requires effort and perseverance.
This summer I decided to give learning another try and started Kaggle’s Python course in July (the same course as in the challenge). After getting stuck in Part 5 and realizing that I needed more practice, I joined Dataquest. Within a week, I found out about the 30 Days of ML challenge and decided to join.
The very first tasks were:
to submit to the Titanic challenge. There was a follow-along guide, so anyone could do it just by copy-pasting codes.
to join a discord community
It came as a shock that over 47 thousand people joined Discord and who knows how many of them weren’t beginners! Even a Kaggle grandmaster, Abhishek Thakur, joined which surprised many of us at first, but it turned out he was there to guide us and teach us a few cool techniques through his Youtube channel! (Check it out on this link!)
Our task was to go through the Python, Intro to ML, and Intermediate ML micro-courses. We got daily email reminders to stay on track. Python wasn’t easy and I’m very glad that I covered most of it earlier, as well as the DQ Python Fundamentals courses, otherwise, I would have gotten quite behind.
Kaggle courses are available for free on their website, so you don’t need to join the 30 Days of ML to complete them.
The email we waited for the most arrived: the link with the invite for the competition. I was very excited to open it and see what the competition would be like.
The features of the dataset were anonymized, which means instead of having features like Quality with values: good, average, poor, we had features like cat1 with values A, B, C for all the columns of the dataset.
Because I couldn’t get anything valuable out of the data, I tried everything possible I could think of to see what works. I compared the changes of the cross-validation score and the public score and only kept what improved both.
I truly loved Abhishek’s videos, but it was a bit difficult to understand for me, and because of my lack of programming knowledge I was sure that I would spend days trying to change just a tiny part of it. So my approach was to follow his videos to get ideas, and then I researched easier ways to implement them. You can find links to most of the resources I used in my notebook.
However, the way of tuning my models was very beginner-like: tedious and time-consuming. I ran a GridSearch, then I tuned the parameters near the values I got, one-by-one or in groups of 2 or 3, using GridSearch over and over again.
I was still proud of these models. Later, Abhishek shared his video about using Optuna for optimizing hyperparameters, but most of the time it gave worse results than my hand-tuned parameters.
Finally, I used StackingRegressor to stack 3 of my XGBoost models tuned with GridSearch, an XGBoost model optimized with Optuna, a RandomForest, and a GradientBoosting model from Abhishek’s video, and the default final estimator, RidgeCV, as the meta-model.
It’s probably not necessary to say that I spent much more time than an hour a day.
If you are a beginner and didn’t understand most of what I wrote, don’t worry, I didn’t even know about any of these two weeks ago.
7573 teams participated in the competition. I was 889th on the public leaderboard, but I dropped to the 1122nd place for the private. At first, I felt a bit disappointed in myself, I thought I could do better. But then I realized that I’m in the top 15%, which is not bad for a beginner! And even if I ended in a much worse position, that would be fine too, because I learned much more than I would have without this challenge. I’m very proud of myself and everyone else participating!
I was playing with changing the seeds of the models on the last day and realized that just modifying this seemingly insignificant parameter led to very different results. I googled how to take the average of seeds, but it seemed to be too complicated for me yet (not mentioning how much time it would have taken to run it!). So I gave up the idea and used the seed that gave me the best public score.
What makes me the happiest is that averaging 20 seeds helped someone win! Of course, he used other awesome techniques as well. But it makes me proud of myself that my idea could have worked if I knew how to implement it! At about three weeks into the challenge, I already felt that data science is what I truly want to do from now, but the fact that I thought of a winning solution makes me very motivated to continue studying and improving myself.
After the competition ended, a friend of mine told me that Abhishek indeed used multiple seeds in his videos, which could have been the solution to my problem. I’m not sure if I didn’t find this information important when I watched his videos, or that was what I didn’t understand, so I’m going to rewatch all his videos again.
But more importantly, I’m going back to the basics to get a bit more confident in using Python and learn more details of simple techniques.
After building solid foundations, I’m going to take more advanced Machine Learning courses and start participating in competitions again.
Why would you even do a fast-paced challenge like this? My biggest takeaway from this opportunity was getting a general idea of what data science is like. If you are a beginner and have as many doubts as I did, I recommend you do the same to get a feel of a machine learning project. Here is my advice on how:
- First and foremost, give yourself a bit more time and make it at least 40 days, but preferably 60.
Two weeks are not enough for building a solid foundation. Even if you want to learn fast, make it at least a month and cover all the basics. You can learn during the competition as well, but it will be more efficient if Python comes easy to you, so you can focus on more advanced topics. I know 60 days would double the time investment, but it’s still much shorter than spending years studying and realizing that data science is not for you.
- Think of Kaggle’s micro-courses as summaries!
This relates to my previous point. Kaggle’s micro-courses are great and you can start competing after only taking these, but you’ll need deeper knowledge in order not to spend many frustrating hours debugging. If you still prefer to follow Kaggle courses, do your research for concepts that you don’t understand. Also, definitely include Pandas, Data Visualization and Feature Engineering. Otherwise, I recommend completing Steps 1 to 3 on the Data Scientist path on Dataquest. I found Dataquest’s courses more detailed and easier to understand. Not to mention how many opportunities they provide to practice!
- Don’t do it alone!
I was very lucky to meet a handful of people with different backgrounds and we formed a small, supportive community. There was always someone there to help those who got stuck and give encouragement. I’m very grateful for getting to know them.
- Participate in a competition!
Kaggle’s Tabular Playground Series is very similar to the 30 Days of Challenge competition and a new one starts on the 1st day of every month, so you can start any time of the year.
- It’s fine to use more experienced users’ notebooks!
Don’t be scared of using other notebooks, like I was. But don’t just copy-paste their codes. Spend time to understand them, add your ideas, and don’t forget to say thank you to the author!
- Focus on learning!
It’s easy to get discouraged by focusing too much on the leaderboard, but keep in mind that the goal is improving yourself!
- Lastly, data science is super exciting, so take it easy, play with your code and have fun!
I am very grateful to Kaggle, Alexis Cook, Luca Massaron, Abhishek Thakur, and everyone else for organizing this challenge and guiding us.
You allowed me to get a glance at what a Data Science project is like without spending years studying and helped me realize that it’s worth all the time and effort to continue walking this path. It was a life-changing opportunity for me, and I cannot thank you enough for that!