Solution Notebook - Predicting Car Prices

Guided Project Link: https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/6/next-steps

Solution Notebook: https://github.com/dataquestio/solutions/blob/master/Mission155Solutions.ipynb

Hi Mary,
For the solution above, I don’t understand one place in your code. After we calculated the average knn predictive values(with k_vals=[1,3,5,7,9]) of each unique variate. We got this result:
horsepower 4219.377860
width 4618.910560
curb-weight 4730.075815
highway-mpg 5069.469256
length 5176.394904
city-mpg 5202.409003
wheel-base 5252.392462
compression-rate 7166.073599
bore 7222.472445
normalized-losses 7624.407151
stroke 8000.240467
peak-rpm 8119.365233
height 8163.346266

But why in your code, you choose ‘city-mpg’ ahead of ‘highway-mpg’ ? but ‘highway-mpg’
has lower rmse.

Thanks in advance!

1 Like

Hi @dogzerg12,

The content team has fixed this issue. Thank you for reporting it to us.

Best,
Sahil

Hi @Sahil,
and why don’t we use the engine-size attribute? It’s also a numerical value.

1 Like

Hi @arredocana,

Great Point! I will send this project for review. Thank you for bringing this up.

Best,
Sahil

Hi @Sahil the solution notebook link is not showing the solution. Can you please look into it?

Hi @puriaseem,

I just checked this link:

And I am able to see the solution. Are you getting any error?

Best,
Sahil

@Sahil, thanks for the quick response. The link is working now.

1 Like

Hi @Mary
Please could you explain why in the final step of the solution the top models chosen were 3,4 and 5? Why was 2 ignored?

Thanks in advance

1 Like

Hi @glonimi0,

The content team has fixed this issue. Thank you for letting us know about it.

Best,
Sahil

1 Like

Hi,
I got some questions concerning the solution file.

  1. In the part where optimal k value is calculated the results are in favor of k = 3. They have the lowest value for our key features (‘engine-size’ and ‘horsepower’)
'engine-size': {1: 3258.4861059962027,
  3: 2840.562805643501,
  5: 3238.4628296477176,
  7: 3563.086774256415,
  9: 3831.8244149840766},
 'horsepower': {1: 4170.054848037801,
  3: 4020.8492630885394,
  5: 4037.0377131537603,
  7: 4353.811860277134,
  9: 4515.135617419103},

Why in the latter part with knn_train_test function the k is chosen to be 5?

  1. Since the solution does not perform cross validation model why is the training/test set split 50/50 and not e.g. 80/20?

  2. I am also wondering, isn’t this set relatively too small for this type of classification? I have been playing with parameters, depending what is a proportion of train/test set (0.50, 0.70, 0.75, 0.80) or if we perform different data cleansing (I removed all Null rows where value was NaN except for 'normalized-losses' where I estimate so roughly 10 entries) the result changes drastically. Optimal k value varies from 3 to 13, best number of features (based on rmse) also fluctuates substantially. Isn’t this model fated to be underfitted?

Why does the example solution not consider the symboling, num_doors and num_cylinders columns? I thought knn handles all numerical data (including discrete values). And isnt it just better to delete the normalized_losses column given that about 25% of the values are missing? Thanks for the help.

1 Like