I joined the Driven Data competition! Thoughts


Well, after discovering the benchmark article thanks to Dataquest Download email, I finally decided 3 days ago to enter the competition. After a lot of optimization stuff I finally made my first submission one day after and obtained 70%, entering near the top 10% of the competitors ! Let’s say I was very excited since I had improved at my first attempt their benchmark which obtained without any kind of optimization only 30%. But the current leader of the competition is at 95%… and there is a lot of competitors around 90% score levels. So my mood switched to the mode: “You can do it!”, and I was definitively convinced that I will be able to improve my score with more efforts.

2nd submission after some time: same score around 70%.
3rd submission this night: only 19%!
Looks something went wrong after reconfiguring so many times my training set features.
Resultat: feeling very disappointed for such a bad score despite so many efforts…

The competition will close very soon, so I wanted to share some thoughts about this experience.

A very frustrating thing: the dataset is very large with more than 63K observations, and the most important feature column (large strings of DNA sequence) mean is around 4800 characters (the maximum in a row having 60000 !). So each time you try to construct new features based on the DNA sequences, if you are not very careful with the purpose of your code, it may take hours and be very memory consuming! Same with hyperparameter optimization, grid search cv, cross-folding, etc. : it’s too long, so long that I have been forced to shut down the kernel very often. For those more experimented than me, how do you deal with time and memory constraints in machine learning? I have also built a pipeline with grid search, but no way for my RAM (18GB) to support the charge without slowing the process.

I improved the model ok but basically just searching for and finding the “best” parameters for the same model showed in the benchmark so this is not very exciting per se since 90% of the work was already done (though during the process I learned a lot about sciky-learn methods and this is good). After doing research about DNA and browse academic papers (would need more time to read and understand them in depth), I got convinced that the true challenge for improving the score relies not on hyperparametization but much more likely on better features construction based on the DNA sequences (google “how count k-mers” for example). Hyperparametization is boring, it’s just a for loop (and maybe the reason number one we need it, it’s because the model is not so good?). Building good features is another story and need more creative resources. So, at the date, this is where I have failed to improve the model, somewhere in-between features construction and features selection. Now, my question is: how could we challenge in a competition real pros on their own research field (bioengineering here) when themselves are fighting for analyzing DNA since decades?

Maybe I turned too pessimistic. I made only 3 submissions when most competitors made dozens and dozens of submissions (limit is 3 by day), so let’s say there is a small lesson in this story: try to not join late the competition!



PS: last free day today so let’s try to make at least 1 new submission:)


Update : I finally made 3 submissions monday, slightly improving my score (73,88%), currently ranking 176/1182. Not a bad experience so, though I believe there is room to do better, maybe in easiest competitions.

Teams that exceed this threshold (i.e. exceeding 75.6% on the private leaderboard ) will automatically be invited to submit a report for assessment by our panel of judges (see “Assessment”, below).

Not eligible for less than 2 percents, but it was my first competition, everyone should try it!


This is very cool @WilfriedF! Congratulations!! :tada:


This is inspiring @WilfriedF! I think I would need to brush up on some skills using the DQ curriculum and perhaps do a few projects to prepare myself (coz I’m trained in Deep Learning, not so much on Random Forests). Thanks for documenting and sharing your experience. I think you did quite well for a first go and know that there will be many more opportunities to come so work hard and press on :muscle:. I am hoping to see you document many more of such experiences in future! :wink:

Cheers and Congratulations! :clap: :clap: :clap: :tada:


Thank you for the feedback @nityesh and @masterryan.prof.

Also I saw that Driven Data allows to work in team joining a competition. Never worked in team, but could be interesting. Keep an eye! Next time I will try to join the competition earlier and thus have more time to work on it.


Thanks to @nityesh for sharing this thread in the weekly Community Champions post!

The competition will end today, and I finally beat the second benchmark (75,6%) at my last submission a few minutes ago!


So if I understand well I will be invited to “submit a report for assessment by panel of judges” though it sounds to me weird since I am only ranked 164/1211.


Confirmed, I have qualified for the called “Innovation Track”, it’s like another competition into the competition. I received today an email from the organisation, saying:

You are now invited to submit a report demonstrating how your lab-of-origin prediction models excel in domains beyond raw accuracy […]

You’ll need to convince experts from a variety of fields that your submission represents valuable progress in solving real-world attribution problems, demonstrating in plain language how your approach is impressive beyond raw lab-of-origin accuracy.

This is too much! Really, I dont believe that what I did “represents valuable progress” and unfortunatly I will not save the world with my painfull model! Additionally, “reports should be at most four pages long and with at most two figures”. Pfffff !


WOOAH! This is so amazing @WilfriedF!! My heartiest congratulations to you! :tada: :partying_face:

And thank you so much for giving us the exciting “live updates” about this. :heart_eyes:

You know, you should definitely compile this experience of yours into an article. It’ll be a good read.