Guided Project_ Predicting Car Prices

Hi everyone,

Here is my guided project on predicting car prices using KNN algorithm.

Instead of iteratively modifying the functions, I added to them all the potentially useful parameters from the beginning and then tuned them. I used train/test validation and k-fold cross-validation algorithms, and for both, I built several univariate and multivariate models and estimated the error for each. The spaghetti-plots, which usually look quite scaring, this time were rather insightful and allowed to easily find the model with the minimum error.

Looking forward to receiving your feedback. Please let me know what can be improved in my project. Code efficiency, storytelling flow, correctness of conclusions, any eventual errors or typos - anything you would suggest will be of great use for me.

Many thanks in advance!

P.S. The cover picture of my project I took it myself, when traveling in Chile :grinning: Hope you’ll like it :yum:

https://app.dataquest.io/c/36/m/155/guided-project%3A-predicting-car-prices/3/univariate-model

Predicting Car Prices Using KNN Algorithm.ipynb (630.8 KB)

Click here to view the jupyter notebook file in a new tab

5 Likes

looking primo, I’ve got 2 humble remarks:

  1. Doing this project I found that every parameter / hyperparameter is dependent on another one… so if you deducted in step say…4 that the best k number of neighbors is equal to 6, then you moved to step 5 to fiddle with k-fold ■■■■ (duuuh k n o b, what is it with this word?!) . I wouldn’t leave the number of k neighbours k n o b fixed at 6, I would still keep on testing it and on top of that test the k fold value. In the lesson we’re sort of being told to leave it fixed and move on(which is easier, and computationally cheaper) . In the future projects GridSearchCV does all of that for us and it does check all the possibilities from the list like I’m describing. …hope that’s clear
  • so in your last code cell I improved the RMSE value just by lowering the k neighbours value
  1. remember the time I was flooding my notebook with lines of code for styling plots, and looking for a way to reduce that amount of repetitive code for plots?
  • and you told me to use functions for that :slight_smile:
  1. actually 3rd very small one: everyone uses same colors for plots, I know the visuals of a machine learning project are not the most important part at all but a few color changes and suddenly your notebook looks different than the 100 other ones the recruiter/ client/ boss saw today.
1 Like

Hi Adam,

Thanks a lot for taking a look at my project and for your valuable feedback!

About the graphs, it’s funny that first I gave you this advice and now you’re returning it to me! :sweat_smile::joy: Well, you’re right on that, and I was actually going to introduce a function for all of my line plots, but then I noticed that because of all those for-loops I wouldn’t be able to include also the plt.plot() in it, and also the last plot is a bit “strange”. But probably it can be a good idea to create at least a function a kind of “plot_decorations” where to include all those titles, labels, etc. In any case, it will help avoid code repetition.

About tuning also the k (in terms, the number of neighbors) for different k-folds, this is also a cool suggestion, thank you! And as for using purely machine learning methods rather than Python-based approaches, I agree, it would make the code much more laconic and elegant. Indeed, I’m going to return to this project also later, after finishing the whole path, and optimize my code.

Oh, the matplotlib default colours are definitely not my favourite either, particularly the first two: blue and orange :see_no_evil: I found out that they are so for the issues of color-blindness, but they look ugly anyway, and I also often avoid them, especially when I need only 1-2 colours.

Thanks again, Adam, for all your great suggestions! Everything totally makes sense, really appreciated! :star_struck:

1 Like