# Guided Project: Car Price prediction with k-nearest neighbors algorithm

Hi community !

After a lot of work, think the notebook is ready to be published!

I took some liberties with the Dataquest course, so don’t be surprised if it doesn’t fit exactly the course.

Not sure I am fully satisfied, but I did my best. Compared to the course:

• use the categorical features contained in the dataset
• use another scoring function: MAPE
• introduce 2 feature selection algorithms
• introduce elbow detection point (if you think, the use of elbow point is not appropriate here, please do not hesitate to let a comment)

Finally, I realized that I should have used cross-validation to optimize the number of neighbors instead of optimize first the number of folds, but it was too late to rewrite everything and now I need to recharge batteries after so many efforts in fixing details and trying to improve my English writings, pffff…

Maybe the notebook will be confusing sometimes and a bit indigest, sorry in advance if this is the case.

Any feedback welcome of course, thanks !

Click here to view the jupyter notebook file in a new tab

4 Likes

Hi @WilfriedF,

Great job on the project! I love it that you explored outside of the curriculum. I’ve learned quite a few new concepts reading your project. Special thanks for the resources on elbow detection.

I also admire the diligence you put in the code and style of the project. I don’t know if you know this already or not, since you are using docstrings for all your functions, further down the course, there’s actually a session for best practices and specifically how to format the docstrings by Google style.

While learning about MAPE, I came across this article that might help in understanding why the MAPE score didn’t behave as expected from the RMSE score. Quote from the article in the link by Agrima Bahl:

MAPE puts a larger penalty on negative errors . What this means is that for the same error, the error is higher when aᵗ < fᵗ than when aᵗ > fᵗ.

For example, for the actual value 100 and estimated value of 90, the MAPE is 0.10. For the same estimated value and actual value of 80, the MAPE is 0.125. Therefore when using MAPE as an objective function, the estimator prefers smaller values and can be biased towards negative errors.

I like the Pravlence of binary features plot, it’s very informative. It also brought another point I didn’t consider either when doing this project. When hot encoding categorical features, there is one thing to keep in mind, the redundant information from the nature of the binary features. For instance, we have a binary gender feature, when hot encoding it, it’s going to be female, with ones and zeros, and male, with ones and zeros. But these two columns have essentially the same information. This could lead to a problem of overfitting. So in these cases, we should probably drop one of them. Here’s what’s in the sklearn doc on this matter:

drop {‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

Thanks for sharing your awesome project! I really learned a lot from it.

5 Likes

Hi @veratsien,

That’s really a great feedback you let to me! Thank you a lot to take time for that. I will read carefully the article about MAPE. Finally, someone found something about MAPE and you are this person! Very interesting indeed: bias toward negative error. In other words, penalty when you overestimate the value… I will look the charts with other eyes now and maybe some new ideas will arouse. The question is: why Kibler et al. have chosen this scoring function instead of RMSE, the post popular error function in linear regression? Does it mean that MAPE is more valuable, for any reason I ignore, for price prediction (for example, in finance)?
And it looks like this is not a scoring function used very often. I wanted to use it directly into `cross_val_score` with RMSE, but it was not working and I have quickly given up. You will have seen in the notebook for which reason I use MAPE, only for comparison purposes because I wanted to measure my model to some kind of past benchmark, a good way to check if I was making the model more valuable or not.

Regarding the binary features, after reading you I didn’t understood first why I should remove some categories since the number of new columns will always be equal to the number of categorical values… But now I got it: you mean when the attribute has only 2 categorical values, so they exclude themselves inevitably, right? This is very relevant indeed. I didn’t realized it before. And now looking at the dataset, it’s very clear: engine location, aspiration and fuel type attributes were already binary (doors has a third value, “unknown” because of missing values) ! How did I miss that? Pffff…Thank you! I will make later the changes you point righlty and see if it improves the model.

Regarding docstring, I made the course in the past yes but it’s true that I didn’t care too much here to apply perfectly the usual guidelines. I should! But you know in reality the docstring is an extra effort I don’t like so much, probably because it’s still not a natural habit for me.

Best
Wilfried

2 Likes

@WilfriedF I’m glad my feedback helps.

The way I see it is to go case by case. For instance, if I were the one selling my house, and want to decide on a listing price based on predictions, I would rather the prediction overestimates the value than underestimates it, so I would probably use RMSE. But if I were putting down an offer based on predictions, I would probably go with MAPE cause I don’t want to overpay.

This also reminds me of the choice of error metrics for classification problems. In a classification problem, you choose Precision or Recall or f1 score, etc. case by case, if it’s more important to be generous and identify all possible positives(like credit card fraud detection) or to not get it wrong(like what type of tumor is this).

I’m definitely in no place to complain about your docstring formatting, I don’t do it at all… But I will try to start forming a better habit. Especially for the more complicated functions. A lot of the functions you wrote for feature engineering can actually be reused in almost every project. This reminds me to write my own commonly used functions and just import them for projects in the future.

I was really inspired by your project, thanks for sharing!

1 Like

Hi @veratsien,

Yes it’s a good example for real life application. In pure learning context, how to determine whether 2700 is a good or bad RMSE score ? Whether a MAPE of 11% is good or bad? Personnally, I feel more confortable with classification problem, because the answer for each observation is “yes” or “no”: “yes”, you have found a true positive, “no” this observation is not a true positive, etc. Look what happened to me here, maybe you will find it interesting: Be aware with scoring function: a “good” mean accuracy can hide a bad result!. I am definitvely fascinated by scoring functions ahah.

Would be awesome if you reuse `forward_selection` and/or `backward_elimination`, let me know if they help to select good features!

Note a fact for the 2 functions, they end like that:

``````if len(list)>0:
....
return(knn_train_test(features, target, df), features)
``````

But my first inspiration was in reality:

``````if len(list)>0:
....
else:
return(knn_train_test(features, target, df), features)
``````

But strangely it doesn’t work (more precisely it works, but returning `features` wirth values found during the 2nd iteration). I understand that this is due to the recursivity happening under the hood, and I am happy it’s finally working without using `else`. But the reason why it’s working in the first case rather than in the second case is still pretty mysterious to me! It leads me to think that maybe the functions are not 100% safe.

1 Like

Hi @WilfriedF,

You brought up a very interesting and important topic. Picking and evaluating error metrics is such an ‘art’ in my opinion. Learning the formula is really just a fundamental first step.

I think it’s important to interpret the error metric of a model into real-life indicators, in most cases for a regressor, the target unit. Take this project, for example, what does an RMSE score of 2700 mean? Knowing the formula, it means that if we take a new dataset of cars, and make price predictions based on our model, the error of price difference from each car will probably average to around \$2700 higher or lower. Now we put it in the context of car selling/buying or other possible context depending on the reality.

I think for learning projects like this one, fining a benchmark like what you did was also a great solution. I have wondered the same thing, whether 2700 is a good or bad RMSE score. What I did was really just go through the projects shared in the community and check out other people’s scores.

You’ve probably learned a lot already from what you encountered in the post above, but I still want to share this with you if you haven’t done Andrew Ng’s Machine Learning course on Coursera. There’s a section for Error Metrics for Skewed Classes. Actually since you are fascinated by scoring functions, the whole week 6 of the course might interest you.

It took me a while to figure out your `forward_selection` function… It’s definitely a bit of a brainer. I apologize in advance if I understand it wrong. From my understanding, the logic of your function is that: First iter --> it starts with looping through all individual features `all_features`, pick out the ones that have an RMSE lower than `min_rmse`, only append the last one(the one with lowest RMSE) to `features` list as the starting `features` for the next iteration. All following iter --> Then loop through the remaining features individually appended to `features`, pick out the combinations that return an RMSE lower than `min_rmse` and only keep the last combination as `features` for the next iteration. Whenever `len(selected_features)=0`, the function exits and returns the RMSE score and features.

• My first question is, in a scenario where in the very first iteration if no individual feature returns an RMSE lower than `min_rmse`, i.e. when `len(selected_features)=0`, wouldn’t that return `features` as an empty list right away? Where actually what features to be returned should be features where you get the initial `min_rmse`. In your case in the project, all the features.
• Then the logic of add only the last `selected_features` can be simplified to just assign the `updated_features[i]` to `selected_feature` without an `s`, don’t you think?

Your `forward_selection` function is definitely keeping me thinking how effective it is in finding the best combinations of features. Especially when the feature list gets longer, I think it’s possible that the best 5 features list isn’t the best one to start searching for the best 6 features list… I am definitely intrigued by your work will test the functions out in future projects and let you know the results.

Sorry for a really long reply… it’s just that you got me thinking.

1 Like

Well, another amazing reply from you! Together we deepen the subject, this is great.

For `forward selection` and `backward elimination` functions, the idea comes Learning Strategies for
Opponent Modeling in Poker
, a paper I read few years ago. Now opening it another time in order to take a screenshot, I have just realized that the authors are also talking about another feature selection algorithm that I have totally missed in previous readings, called RELIEF-F and it picked up my curiosity ahah. Dataquest should organize small implementation challenges, it would be fun!

Here the screenshot:

The interesting thing (and the limitations of forward selection) is that while 6-14 is necessary better than 12-14, nothing garantees that 6-14 is better than, let’s say, 6-12. Since the best univariate model is 14 here, 6 being probably the 2nd best univariate model, the choice of 14 made during the first step will determine all the remaining process. But searching for all possible combinations would lead here to 2^16 iterations, so it’s a tradeoff.

Reading your answer I think you understood it perfectly. In the general lines, for `forward selection`:

• `all_features` is constant
• `l` decreases by 1 at each iteration
• `updated_features` decreases by 1 item at each iteration
• `features` increases by 1 item at each iteration
• you prefer to start with an empty set of features, but it would work otherwise with any number of features as a starting point. If you want to force it to start with “horsepower” feature for example, you can.

Ahah you are redoutable, it’s right: in the case of the project, since I used the full set of features to initialize `min_rmse`, if the function returns an empty list the optimal solution given the function would be “full feature set”. But I think you will be agree this is a very special case. It would just mean that the function didn’t improve, the user would know what to do I think since he always knows his starting point. Or maybe should I write a kind of warming for this special case? Note that I could have used any random `min_rmse` initializing the function or even `standardized_cars.price.max()` since an error such large is very unlikely to occur.

I was not sure to follow you here, but now again I got it after some thinking time. Do you mean instead of:

``````selected_features = []
for i in range(0,l):
``````

Prefer:

``````selected_feature = ""
for i in range(0,l):
``````

And go on into the loop with the variable? Indeed it’s a welcome simplification though it looks to me a bit unusual to overwrite strings, but thanks, you are definitively very attentive to detail!

Maybe the ugly part of the code is `features.append`/`features.remove`, it’s a bit tricky, but except creating a copy of `features` and go on with this copy into the loop, I don’t see any better solution (though I am certain there are several better).

Unfortunately the link send to a 404 page!

Long reply too… But it was inevitable !

2 Likes

Exactly what I was wondering. I can settle with the trade-off explanation. Thanks for digging up the paper and share it. I skimmed but can’t promise to read through it.

I understand you have generalized the function for different use cases. My questions about your `forward_selection` function are more personal preferences I guess. I would want to include the special cases, and like you said maybe return a relevant message.

Inspired by your comment, I took the liberty to try and see if I can find another solution to achieve the same thing as your function without the `append()` and `remove()` methods. Here’s my solution:

``````def forward_selection(features, target, df, min_rmse):
all_features = df.drop(target,axis=1, inplace = False).columns
updated_features = [feature for feature in all_features if feature not in features]
l = len(updated_features)

# Create a list of all combos of add each remaining feature to previous features
feature_combos = [feature.append(updated_features[i]) for i in range(0, l)]

# Create a dictionary with feature combos and their rmse score
feature_rmse = {feature_combo:knn_train_test(features_combo, target, df) for feature_combo in feature_combos}

# Find the minmum value in feature_rmse dictionary
min_feature_rmse = min(feature_rmse)

if  min_feature_rmse < min_rmse:
min_rmse = min_feature_rmse

# Return the key of the minmum value in feature_rmse dictionary as the new features list
features = min(feature_rmse, key=feature_rmse.get)

# Rinse & repeat
forward_selection(features, target, df, min_rmse)

return(knn_train_test(features, target, df), features)
``````

I think it does the same thing… Plz let me know if there’s a bug or something.

1 Like

Hi @veratsien,

Your function has `s` issues :

feature`s`.append(updated_features[i]) for i in range(0, l)]
feature_rmse = {feature_combo:knn_train_test( `feature_combo`, target, df) for feature_combo in feature_combos}

But even after this changes, I get a none error.

I cannot test it very confortably because I have nightmare charmac codec errors on my notebooks actually (nbconvert bugs I believe).

But wait:

``````a_features = standardized_cars.columns[:-1].tolist()
l = len(a_features)
features = []
a = [features.append(a_features[i]) for i in range(0, l)]
a
``````

Output:

``````[None,
None,
None,
None,
None,
None,
None,
None,...]
``````

Look it’s not able to `append` during a comprehension list process? After the execution, `features` contains now all features, but `a` only has none values.
Even if it was working, since you are changing `features` content on the fly, each new list added to `a` would have 1 item more (first list: 1 item, second list : 2 items, etc.; but we want l lists containing each one 1 item during the first iteration), don’t you think ?

1 Like

@WilfriedF You are absolutely right about changing `features` on the fly causing a problem. Testing the code myself, there are so many obvious bugs in the code like lists can’t be used as keys for dictionaries. Duh. And `list.append()` changes the list in place and doesn’t return anything, that explains the `None` output. Also, the code to find the minimum value in a dictionary is not correct. It only returns the minimum key… The whole thing was just a big brain fart. I don’t know what I was thinking.

I gave it another try and wrote something a bit different. The basic logic is to find the minimum rmse in every batch, then return the true minimum among them:

``````def forward_selection_alt(features, target, df, min_rmse, results):
all_features = df.drop(target,axis=1).columns
updated_features = [feature for feature in all_features if feature not in features]
l = len(updated_features)

if l > 0:
# Initialize a dictionary with feature combos and their rmse score
feature_rmse = dict()

for i in range(0, l):
# Get the base features for combo
combo = [feature for feature in features]

# Append each feature in updated_fatures to combo
combo.append(updated_features[i])

# Assign rmse of the combo to feature_rmse dictionary
rmse = knn_train_test(combo, target, df)
feature_rmse[tuple(combo)] = rmse

# Get the item with the minmum value in feature_rmse dictionary
min_feature_rmse = min(feature_rmse.items(),
key = lambda item: item[1])

min_combo_rmse = min_feature_rmse[1]
min_combo = min_feature_rmse[0]
#         print('batch best features: ', min_combo)
#         print('batch minimum rmse: ', min_combo_rmse, ' | ',
#               'true minimum: ', min_rmse)
#         print('\n')
results[min_combo] = min_combo_rmse

if  min_combo_rmse < min_rmse:
min_rmse = min_combo_rmse

# Return the key of the minmum value in feature_rmse dictionary as the new features list
features = list(min_combo)

# Rinse & repeat
forward_selection_alt(features, target, df, min_rmse, results)

best_result = min(results.items(),
key = lambda item: item[1])

return (list(best_result[0]), best_result[1])
``````

I had to convert the `combo` list to `tuple` in the process and convert it back to a list for iteration, so that’s not very elegant. The print output is too long to post here. But you will see the ‘true minimum’ curve and this is sorta like finding the elbow.

In the process of testing, I found out that your function has a value restriction on the initial `min_rmse`. If you start with a ‘too small’ value, it returns this error message: `ValueError: at least one array or dtype is required`, which is caused by what we have discussed before:

The `features` list is empty, but the `knn_train_test` function needs `at least one array or dtype`.

I created a notebook for testing this function. Yours included in it. I will upload it below, you can check it out and play with it if you like:

I don’t know about you but my brain hurts a little. Jokes aside, this is definitely a fun coding challenge. And of course, it has been a fun discussion with you.

1 Like

Wow nice debug!

I got it: this is because the output is:

``````return(knn_train_test(features, target, df), features)
``````

So indeed with a very small `min_rmse`, `features` will be empty and it will produce an error. Well done! Think I missunderstood you the first time you pointed this fact. By luck, the fix is easy: just add a line `if len(features)>0` etc.

I didn’t analyse your function for the moment so my brain is not hurting so much, but sure it will happen when I will look into ahah! Let me some time and I come back.

1 Like

Hey @veratsien, I am back!

Ok, so I think I understand it perfectly now.
First thank again for your extra efforts and also this line:
`if l > 0:`
Very important too!

During the process, I quickly concluded that your function was doing the same than my function, so I was wondering: why my version is not able to find a result below 2700?
And then I understood: mine is more restrictive since I put as imperative condition for going to the next iteration that `rmse<min_rmse`.
While your version is not doing exactly the same thing: you take the minimum at each iteration and you compare with `min_rmse`, but `rmse<min_rmse` is not an imperative condition to follow the iteration. So, if I am not wrong you will always make `len(all_features)` iterations (if you start with an empty list) whatever happens and then return the best score. In that sense, your function explores more in depth the features space!
If you go back to the screenshot in a previous post, the score accuracy is always improving during forward iteration, may be a coincidence, but this is why I thought I should stop if I wasn’t improving more. But I think your interpretation is perfectly valid too, so well done!

On the other side, I am not sure that you have simplified my version , but this is a detail and it’s fine if you feel more confortable with dictionaries comprehension (I confess I am not so much).

Just a quick comment here:
`combo = [feature for feature in features]`
Maybe a copy will be faster, no, or you did it volontary like that because you know this is the best way?

To conclude, this is the proof that when we look more in depth at something, there will be always interesting findings along the road.

Bravo!

Wilfried

1 Like

@WilfriedF I could not agree more with your conclusion.

You are absolutely right. My function basically exhausts all features in a specific order. (best features combo in every iteration).

I think your function is more efficient in finding the ‘ballpark’, and mine goes a bit further.

I definitely did not simplify your code. It’s more a new function built on your idea. I have tried using `features.copy()`, but it somehow points `features` to the same value as `combo` in second and after iterations, and again, `features` will change on the fly. That’s why I used `combo = [feature for feature in features]` as an alternative. I’m not sure why it’s happening… I will definitely look into it after my brain gets some rest.

1 Like

Hi @veratsien,

Some news here, I got a new idea, let’s call it “ensemble model” : just use the last dataframe in the notebook `predictions_df` based on 11 different feature sets, we remove the 2 worst series commented before in the project, and then we merge them using `mean` or `median`. And reaching a new minimum for both MAPE and RMSE! The idea came from some conferences I saw about meteorology: to improve their model they stress it changing slightly the initial conditions, obtaining by the way several simulations instead of only one, so they can plot together the main prediction from the main model and probabilistics boxplots from the derived simulations.

``````predictions_df = predictions_df.drop(columns=["feature_sets5","feature_sets7"])
plot_pred(standardized_cars["price"], predictions_df.median(axis=1), "Ensemble model")
``````

It’s not probabilitisc here since I merged all the models, but it would be perfectly possible to mount a final result with a range of predicted prices and their corresponding probabilities.

Well, still not interested in backward elimination or RELIEF-F? I looked quickly and material about RELIEF-F can be easily found… For my part, work for a living + Christmas coming, etc., so the period is not very favorable to work hard on this things, but RELIEF-F stays in a corner of my head

Best,
W.

1 Like

Hi @WilfriedF,

Congratulations on an improved model! This reminds me of the `ensemble` class in sklearn. All quotes below are from the sklearn page in the link:

In averaging methods , the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Great job on coming up with the idea yourself! If you are interested, in the same ‘ensemble’ methodology, there’s another angle:

By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

I haven’t used boosting methods myself, but I think this is a great project to try out both and compare.

I just did a quick search too, it’s definitely interesting. It seems to me that the Relief-f algorithm can be categorized into dimensionality reduction methods. And that’s another big topic.

I’m in the same place as you with Christmas around the corner. An early Merry Christmas to you!

1 Like

Thank you @veratsien ! Unfortunately, i cannot deposit my copyright ahah.

After more thought, I realized that by cumulating 12 models (the project ends with 12 model and not 11 as I said wrongly before), we made a kind of dimensionality reduction indeed. This is like having a new dataset with 12 attributes. So basically we can apply on it any kind of machine learning technique and even feature selection again! And since we have now only 12 attributes, it becomes not so time and memory consuming to try each one of the `2^12` (well I should cut off `-1` since the vectors with only 0 values would result to a training set with only 0 values) empty feature sets combinations. This is what I did: cross validation again with default parameters for each combination!

The only reserve I would have: maybe it’s too much and it will increase overfitting bias? Or are we definitively safe from this point of view?

So I used `ìtertools` and `2^12` binary vectors this time to select the “features”, here the “raw” code:

``````%%time
import itertools
num_cols= len(predictions_df.columns)

#create the num_cols**2 vectors
lst = list(itertools.product([0, 1], repeat=num_cols))
pure_strategies = np.zeros(shape = (2**num_cols,num_cols))
for i, val in enumerate(lst):
pure_strategies[i][0:num_cols] = lst[i][0:num_cols]

#cross validation on each combination of features
results = np.zeros(shape=(2**num_cols,))
for i in range(0,len(pure_strategies)):
strat = pure_strategies[i]*predictions_df
strat["price"]= standardized_cars["price"]
pred=cross_validation_predictions(strat.columns[:-1],"price", strat,k = 5, n_splits=5)
m = mean_absolute_percentage_error(standardized_cars["price"], pred)
results[i]+=m

#check the best result
n = np.argmin(results)
strat = pure_strategies[n]*predictions_df
print(pure_strategies[n])
``````

We end by selecting only 6 models of 12 and it took only 1 minute to calculate all the combinations. Final result:

Bingo! Big drop for both MAPE and RMSE and now the model makes better than Kibler et al. (1989)!

1 Like

@WilfriedF I really admire you for going the extra mile.

Your concern about overfitting isn’t baseless. If I understand you correctly, you built a new model based on the predictions from the 12 models and used the prediction as features, then used cross-validation to make new predictions?

I apologize in advance if I didn’t understand you correctly. My main question with this strategy is, it doesn’t seem to produce a generalized model. Imagine using this strategy on unseen data to predict prices. Again, you get 12 sets of predictions from 12 models, but since there’s no ‘right answer’ to compare with, how do you know which model combinations to pick from to form the final predictions?

1 Like

You have perfectly understood @veratsien

I go the extra mile because you are inspiring me being a super interlocutor!
Also I remember that I have been very impressed by one of your previous project, with strong mathematics: Build a neural network model from scratch.
And waiting for the day you wil tell us that you finally found the way to finish this business!

Here I am not sure to follow you: what concern do you see with this method that, for example, would not be a concern with the averaging method? I can answer you that here you end with a vector indicating which combination returns the best MAPE, exactly: [1,1,0,1,0,1,0,0,1,0,1,0], that is, only 6 feature sets, that we can reuse on unseen data, don’t you think? On unseen data, my idea is to just reuse this vector to filter `predictions_df` which stores 12 models predictions. Or if you prefer you can also recalculate each time the best vector for the new unseen data. But maybe you point something else that I missunderstand?

1 Like

@WilfriedF I’m really glad we help each other learn by digging deep in our discussions. And thank you for your compliment on the neural network project. I forced myself to let go at the time cause I needed to step away from it and go back with fresh eyes. I will definitely go back and ‘finish the business’ one way or another.

So first, this is not possible because the unseen data wouldn’t contain the target column at all, we would be making predictions on that, therefore we won’t’ be able to calculate the ‘best vector’. There’s nothing to compare with.

I think my problem is really with using predictions i.e. the target columns as features to pick the best target(here it’s the price) predictions. Intuitively it just feels like it would lead to overfitting. If you think about it, you are using predictions from your 12 models as features, feeding them to a static function, either average or median, and pick the best rmse based on that. There’s no training evolved because the function(average or median) is static, there are no hyperparameters. And thus if you use the same 6 models as features to compute the final predictions, you are using the specific relation from this specific dataset. I know I’m probably not doing a very good job explaining my thoughts here, but it’s kinda subtle.

One thing you can do to check if it’s really overfitting is to compute the learning curve. Although I’m not sure if it will work on a KNN model since it’s not one of those algorithms that learn gradually. Anyways, Here’s a good article on learning curve.

1 Like