Guided Project: Predicting Car Prices (KNN)

Hi there,

Thanks for reviewing my project. I’d like to ask if the conclusion is correct and valid. My result suggests that the optimal value:
a) k-nearest neighbors: 1
b) number of folds: 6
c) n-best features: 4

Questions:

  1. When I was doing cross-validation, I held all other variables (the result will be a dictionary) constant instead of writing double loops within a function (the result will be a dictionary within a dictionary). Is this approach correct?

  2. When my k-nearest neighbor is 1 (lowest RMSE), what does that mean if, say, someone throw me a car with car features with extreme value?

Link to my project:
https://colab.research.google.com/drive/1XE7RWo2Jb_55DiXfUszTLiREoqB93jUs?usp=sharing

Link to the intruction:
https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/6/next-steps

Thank you

Hello there, thank you for sharing your project. A question I have is why you’ve chosen 50-50 split in your uni and multivariate models instead of 75/25? Have probably misunderstood but thought 50-50 split is used for k-fold cross validation while 75/25 is used for ‘training and simple validation process’ referred to within the instructions of the project. For the last bit of analysis with k-fold validation I suspect an even split is generally required so presumably you’ve kept the even split of the data so the analysis is relatively consistent throughout the project?

Hi @KostasM , thanks for reviewing my project. I think it’s a great question asked. I was thinking: Because at my k-fold validation, my fold starts from 2, and when fold is 2, it’s a 50/50 split (holdout validation). And I also reviewed the solution and projects done by other members, they used 50/50 split. In line with that, I tried to keep it consistent throughout.

More training data means better accuracy. But since we are going to validate the model, the error from k-fold validation will generally be lower than that of train-and-test. After all, we reach the conclusion based on the optimal value (for number of neighbors, splits, features) of k-fold validation. I only looked back to compare whether the values are the same. Sometimes, I find the initial train-and-test analysis is redundant.

In the near future, when i wanna predict a new observation, i’d simply throw in the optimal value in the model. If the optimal fold is 4, then 75/25 split; if 6, then 83:17. Now, it makes sense to manually split the data. I hope I am right. XD