KNN using SelectKBest for univariate feature selection

Hello, I am sharing my guided project on predicting car prices using KNN. Some things that make this project unique from the original solution are:

  • It discusses stratifying the train-test splits by labeling prices as “normal” or “high” outliers.
  • It uses SelectKBest from scikit-learn to select features.
  • It combines feature selection and hyperparameter optimization in a nested for-loop, then finds the best model.

I would like feedback on the clarity and accuracy of my explanations. I would also like to know whether you agree or disagree with the final set of features and the k-value that I chose for my final model.

If ever, I would also like to know if there is a convenient way to do stratified k-fold cross-validation for regression problems. I read about sklearn’s StratifiedKFold class, but it only seems to work on classification problems.

By the way, since the project is on my personal website, the code blocks are hidden and only the outputs are visible. You can open the code blocks by clicking the “Show Code” buttons.

Last mission screen URL:
https://app.dataquest.io/c/36/m/155/guided-project%3A-predicting-car-prices/6/next-steps

Link to my project: Predicting Car Prices using the K Nearest Neighbors Algorithm | MG Data Science

3 Likes

Update: Regarding my question about stratified cross-validation, I figured out a way to do it based on this article. Basically, to split a dataset into 5 folds, I need to sort the rows by price. Then, I select the first row out of every 5 rows and put them in fold 1, then select the second row for fold 2, etc. So I wrote custom functions to perform this stratification. You can read the full details in the “K-Fold Cross-Validation” part of the project. This technique lowered my mean RMSE by a few hundred dollars and lowered my standard deviation RMSE by over a thousand.