Weird RMSE with KFold Cross-Validation in car prices Guided P

Sure a lot could be better. Any learning is welcome. Thanks in advance.
How does on explain that k-fold cross validation is telling us that k=1 for k-NN is optimal. Doesn’t make sense. Also curious to know if anyone is working on utilities for automated data cleaning. I’ll do a separate post on that.

car_prices.ipynb (352.0 KB)

Click here to view the jupyter notebook file in a new tab

Plot is similar to what @veratsien got : https://community.dataquest.io/t/trying-out-plotly-and-its-slider-for-the-first-time-with-guided-project-predicting-car-prices/546208

1 Like

Hi @ananth.ch,

Nice project you’ve got here. I like how you “think out loud” in the comments.:wink: And thanks for the link to csvtomd.com. A very handy site to know.

From those comments, I see you have questions about how different features and Ks effect the model performance. To answer your question:

Here is a good and thorough article I found on k-neighbors intuition. I believe you will find most if not all the answers in it.

After reading the article, you will know, the basic idea of what k does in a KNN model is to adjust the granularity. Since in a regressor, as you increase the number of k, you are getting the average of nearest k numbers of values. So when k=1 is optimal, it means more values average is further away from the actual value. That could be a sign that you are introducing more noises with your features than valuable data that actually helps.

Here’s a 3d plot for a model with 2 features to help with the intuition:

For example, in this particular project, imagine an extreme case where we introduce a highly irrelevant feature to the model, say the name of the owner. If you predict the car price based on the horsepower and owner_name features, as you increase k, you are using more data mapped with an irrelevant feature, thus introducing more noise from owner_name. The way I think of it is, as horsepower predicts, owner_name deviates. If owner_name deviates more than horsepower brings the prediction closer to the actual value, it’s likely that k=1 would be optimal in this case.

Not sure if I explained it clear enough, the article I mentioned is definitely clearer and more thorough. Anyway, hope this helps!

2 Likes

hi @ananth.ch

I am curious about this :arrow_down: Do you have any references to share? Thanks.

Great

Hi Vera!
Thank you so much. Did learn something from your post and the reference to Max’s article. I wonder what it takes to get to that level where you can offer such insights.

My OP wasn’t clear about my confusion. When we use our own 80:20 split we see a certain optimal value of k (high teens) for those four features, but when we use k-fold cross-validation, that tells us that k should be one for the model. How can simple train/test versus k-fold x/validate be at such odds with each other. That’s what’s thrown me off…

For irrelevant features, yes, I completely agree. If the diversity of samples in a particular column is too high, I call that an “identifier”, not a “feature” :wink:

1 Like

@ananth.ch Sorry I misunderstood your question. Although I think it will help in answering the actual question you have…

You probably already know this, but just to show the deducing process:

  • kfold performs cross-validation, which helps in the generalization of the model. Meaning it will help the model to overfit less.
  • When kfold is not performed, i.e. when we only use train_test_split once, our model is more likely to overfit than with kfold.

As mentioned in my previous comment:

So my guess is when we don’t use kfold cross-validation, we only trained our model on one set of data and our model is overfitting. So when the k hyperparameter in knn increases, it tunes our model from overfitting and gives a lower rmse. You see where I’m going with this…

Thank you for a really awesome question btw. This is a very interesting connection I haven’t noticed before. So the k in knn model tunes granularity, from a result perspective, it tunes overfitting and so does cross-validation.