Guided Project Link: https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/6/next-steps
Solution Notebook: https://github.com/dataquestio/solutions/blob/master/Mission155Solutions.ipynb
Guided Project Link: https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/6/next-steps
Solution Notebook: https://github.com/dataquestio/solutions/blob/master/Mission155Solutions.ipynb
Hi Mary,
For the solution above, I don’t understand one place in your code. After we calculated the average knn predictive values(with k_vals=[1,3,5,7,9]) of each unique variate. We got this result:
horsepower 4219.377860
width 4618.910560
curb-weight 4730.075815
highway-mpg 5069.469256
length 5176.394904
city-mpg 5202.409003
wheel-base 5252.392462
compression-rate 7166.073599
bore 7222.472445
normalized-losses 7624.407151
stroke 8000.240467
peak-rpm 8119.365233
height 8163.346266
But why in your code, you choose ‘city-mpg’ ahead of ‘highway-mpg’ ? but ‘highway-mpg’
has lower rmse.
Thanks in advance!
Hi @arredocana,
Great Point! I will send this project for review. Thank you for bringing this up.
Best,
Sahil
Hi @puriaseem,
I just checked this link:
And I am able to see the solution. Are you getting any error?
Best,
Sahil
Hi @Mary
Please could you explain why in the final step of the solution the top models chosen were 3,4 and 5? Why was 2 ignored?
Thanks in advance
Hi @glonimi0,
The content team has fixed this issue. Thank you for letting us know about it.
Best,
Sahil
Hi,
I got some questions concerning the solution file.
'engine-size': {1: 3258.4861059962027, 3: 2840.562805643501, 5: 3238.4628296477176, 7: 3563.086774256415, 9: 3831.8244149840766}, 'horsepower': {1: 4170.054848037801, 3: 4020.8492630885394, 5: 4037.0377131537603, 7: 4353.811860277134, 9: 4515.135617419103},
Why in the latter part with knn_train_test
function the k is chosen to be 5?
Since the solution does not perform cross validation model why is the training/test set split 50/50 and not e.g. 80/20?
I am also wondering, isn’t this set relatively too small for this type of classification? I have been playing with parameters, depending what is a proportion of train/test set (0.50, 0.70, 0.75, 0.80) or if we perform different data cleansing (I removed all Null rows where value was NaN except for 'normalized-losses'
where I estimate so roughly 10 entries) the result changes drastically. Optimal k value varies from 3 to 13, best number of features (based on rmse) also fluctuates substantially. Isn’t this model fated to be underfitted?
Why does the example solution not consider the symboling, num_doors and num_cylinders columns? I thought knn handles all numerical data (including discrete values). And isnt it just better to delete the normalized_losses column given that about 25% of the values are missing? Thanks for the help.