KNN Univariate Model - Underfit or Overfit - Python

Hi DQ community

Heads-up! this might be the lamest question when it comes to KNN :grimacing:

This is a general query not specific to a mission but the most relevant link I could think of is this:
https://app.dataquest.io/m/154/cross-validation/8/bias-variance-tradeoff

The below plot gives the RMSE values for all the features used for univariate modelling of the target column.

Instead of just trying to predict the test dataset (blue) target column, and be done with, I also applied the same modelling to training dataset (red).

The smaller gaps at the peak and a larger gap for the non-peaks features made me wonder if this can be taken for the overfitting or underfitting discussion about the model; but in general not in extreme technicality.

I mean to say that just by looking at this plot, should I be like “ohkaaaay… so far so good or be like Uh-Oh… the dataset is as dramatic as I am! :scream: and we need more info/ cleaner training dataset to try to model for future unknown data”.

Hi @Rucha,

Even though I have finished the whole curriculum I am not 100% about my answer but hey, let’s see!

Firstly I need to understand the graph well enough. It’s the columns on the x-axis right?

So, every column has their own predictive value. Of course there will be some columns that will cause the model to overfit. And others will not.

It’s all about dropping the right ones. So the ones that are giving a big gap show that there has been some overfitting.

On the other hand, in the overfitting discussion there is something to be said about overfitting to the test-data. Because if you would like to have the model you are training applicable in the real world, I would argue that scores that are too high are overfitting as well.

e.g. take the Titanic competition on Kaggle. Some people have really high scores, and it is possible because it is a fixed competition with fixed values. But let’s say (unlikely) they will build an exact replica of the titanic and we will have another crash. Will this give us similar results? And then I would be very interested in who would predict the survivors of the new titanic best.

One last thing to say is that your data is not that terrible. But KNN is actually very prone to overfit! Just select the best columns and do not forget to do perform cross validation on the train set :relaxed:

1 Like

hi @DavidMiedema

Apologies for the delayed response. Yeah I guess I cropped the features on the x-axis while taking the snap-shot.

After your response I tried this again. Turns out I had some mistakes with the train-test-split part itself. I will test/experiment t few more things with this. I will update in case something interesting turns up!