Explanation for Kaggle mission - why do we loop through k-values 49 times?


I’m hoping someone can help me with some explanation. I’ve been going through the Kaggle fundamentals missions and they’ve all been very interesting and well explained, however, there is now something I can’t understand.

In the Model Selection And Tuning mission, step 4 (Exploring Different K Values) we are told to use the Python range class to loop through odd values of K for the Titanic data set and see which one is more accurate. My problem is, why do we loop through 49 times? Is this an arbitrary number and if I had a completely different set, would it make sense I run through that 49 times? Is it specific to this data set because e.g. it has 49 features?

I did run shape to see how many columns all_X has in this instance and it says 37 so not sure where 49 is coming from!

Thanks for any help

K represents the number of neighbors used in the K-nearest neighbors algorithm.
49 looks random here, just to give you a range of results. It also doesn’t make sense to have K >= total number of rows in the data since you can’t have more neighbors than you have data.
For the effect of number of columns on K selection, my intuition is more columns makes data more sparsely separated and thus having a larger K allows less overfitting to one neighbor.
This link demonstrates overfitting in KNN, but also more importantly bias-variance tradeoff: http://scott.fortmann-roe.com/docs/BiasVariance.html

1 Like

Thank you hanqi - that makes sense :slight_smile: the article is very interesting!