Well, after discovering the benchmark article thanks to Dataquest Download email, I finally decided 3 days ago to enter the competition. After a lot of optimization stuff I finally made my first submission one day after and obtained 70%, entering near the top 10% of the competitors ! Let’s say I was very excited since I had improved at my first attempt their benchmark which obtained without any kind of optimization only 30%. But the current leader of the competition is at 95%… and there is a lot of competitors around 90% score levels. So my mood switched to the mode: “You can do it!”, and I was definitively convinced that I will be able to improve my score with more efforts.
2nd submission after some time: same score around 70%.
3rd submission this night: only 19%!
Looks something went wrong after reconfiguring so many times my training set features.
Resultat: feeling very disappointed for such a bad score despite so many efforts…
The competition will close very soon, so I wanted to share some thoughts about this experience.
A very frustrating thing: the dataset is very large with more than 63K observations, and the most important feature column (large strings of DNA sequence) mean is around 4800 characters (the maximum in a row having 60000 !). So each time you try to construct new features based on the DNA sequences, if you are not very careful with the purpose of your code, it may take hours and be very memory consuming! Same with hyperparameter optimization, grid search cv, cross-folding, etc. : it’s too long, so long that I have been forced to shut down the kernel very often. For those more experimented than me, how do you deal with time and memory constraints in machine learning? I have also built a pipeline with grid search, but no way for my RAM (18GB) to support the charge without slowing the process.
I improved the model ok but basically just searching for and finding the “best” parameters for the same model showed in the benchmark so this is not very exciting per se since 90% of the work was already done (though during the process I learned a lot about sciky-learn methods and this is good). After doing research about DNA and browse academic papers (would need more time to read and understand them in depth), I got convinced that the true challenge for improving the score relies not on hyperparametization but much more likely on better features construction based on the DNA sequences (google “how count k-mers” for example). Hyperparametization is boring, it’s just a for loop (and maybe the reason number one we need it, it’s because the model is not so good?). Building good features is another story and need more creative resources. So, at the date, this is where I have failed to improve the model, somewhere in-between features construction and features selection. Now, my question is: how could we challenge in a competition real pros on their own research field (bioengineering here) when themselves are fighting for analyzing DNA since decades?
Maybe I turned too pessimistic. I made only 3 submissions when most competitors made dozens and dozens of submissions (limit is 3 by day), so let’s say there is a small lesson in this story: try to not join late the competition!
PS: last free day today so let’s try to make at least 1 new submission:)