# Weird RMSE with KFold Cross-Validation in car prices Guided P

Sure a lot could be better. Any learning is welcome. Thanks in advance.
How does on explain that k-fold cross validation is telling us that k=1 for k-NN is optimal. Doesn’t make sense. Also curious to know if anyone is working on utilities for automated data cleaning. I’ll do a separate post on that.

1 Like

Hi @ananth.ch,

Nice project you’ve got here. I like how you “think out loud” in the comments. And thanks for the link to csvtomd.com. A very handy site to know.

From those comments, I see you have questions about how different `features` and `K`s effect the model performance. To answer your question:

Here is a good and thorough article I found on k-neighbors intuition. I believe you will find most if not all the answers in it.

After reading the article, you will know, the basic idea of what `k` does in a `KNN` model is to adjust the granularity. Since in a regressor, as you increase the number of `k`, you are getting the average of nearest `k` numbers of values. So when `k=1` is optimal, it means more values average is further away from the actual value. That could be a sign that you are introducing more noises with your features than valuable data that actually helps.

Here’s a 3d plot for a model with 2 features to help with the intuition:

For example, in this particular project, imagine an extreme case where we introduce a highly irrelevant feature to the model, say the name of the owner. If you predict the car price based on the `horsepower` and `owner_name` features, as you increase `k`, you are using more data mapped with an irrelevant feature, thus introducing more noise from `owner_name`. The way I think of it is, as `horsepower` predicts, `owner_name` deviates. If `owner_name` deviates more than `horsepower` brings the prediction closer to the actual value, it’s likely that `k=1` would be optimal in this case.

Not sure if I explained it clear enough, the article I mentioned is definitely clearer and more thorough. Anyway, hope this helps!

2 Likes

Great

Hi Vera!
Thank you so much. Did learn something from your post and the reference to Max’s article. I wonder what it takes to get to that level where you can offer such insights.

My OP wasn’t clear about my confusion. When we use our own 80:20 split we see a certain optimal value of k (high teens) for those four features, but when we use k-fold cross-validation, that tells us that k should be one for the model. How can simple train/test versus k-fold x/validate be at such odds with each other. That’s what’s thrown me off…

For irrelevant features, yes, I completely agree. If the diversity of samples in a particular column is too high, I call that an “identifier”, not a “feature”

1 Like

@ananth.ch Sorry I misunderstood your question. Although I think it will help in answering the actual question you have…

You probably already know this, but just to show the deducing process:

• `kfold` performs cross-validation, which helps in the generalization of the model. Meaning it will help the model to overfit less.
• When `kfold` is not performed, i.e. when we only use `train_test_split` once, our model is more likely to overfit than with `kfold`.

As mentioned in my previous comment:

So my guess is when we don’t use `kfold` cross-validation, we only trained our model on one set of data and our model is overfitting. So when the `k` hyperparameter in `knn` increases, it tunes our model from overfitting and gives a lower rmse. You see where I’m going with this…

Thank you for a really awesome question btw. This is a very interesting connection I haven’t noticed before. So the `k` in `knn` model tunes granularity, from a result perspective, it tunes overfitting and so does cross-validation.