Guided Project: ML-Predicting Car Prices

Hello guys,
I’ve finished the guided project for Predicting Car Prices. I welcome any comments and suggestions on areas that need improvements.

Here’s the link to the project

ML±+Predicting+Car+Prices.ipynb (269.8 KB)

Click here to view the jupyter notebook file in a new tab

1 Like

Interesting, changing just the random seed you’ve improved your results, nice one! Surprised they haven’t showed us how to turn that ■■■■( woah… I can’t say k n o b ?) in the lesson.

Anyway, that gave me the idea to tweak your function:

def knn_train_test_univariate_v2(df, feature_col, target_col, n):
#     train and test sets
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    # model building
    model = KNeighborsRegressor()[[feature_col]], train_set[target_col])
    predictions = model.predict(test_set[[feature_col]])
    rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return rmse

# we know that 'Engine_size' is the best col, so let look for the best random seed: 
seed_dictionary = {}
for n in list(range(0,100)):
    seed_dictionary[n] = knn_train_test_univariate_v2(cars, 'Engine_size', 'Price', n)
min(seed_dictionary.items(), key=lambda x: x[1])


(19, 2604.2314886733743)
1 Like

Awesome! I didn’t know we can do that.

Thanks a lot Adam😊

right, here comes the bad news: I couldn’t figure out how to get the rmse down to levels from your model, but then I’ve noticed you have 1 less record in the dataframe and that 1 record makes the whole difference:

  • the way you’ve read the data, the first record is set as column names, then you’ve replaced it with different column names (essentially removing that record) that 1 record makes a big difference in results
  • try reading in the df like this:
new_cols = ['Symbol', 'Normalized_loss', 'Make', 'Fuel_type', 'Aspiration', 'No_of_doors', 'Body_style', 'Drive_wheels',
           'Engine_loc', 'Wheel_base', 'Length', 'Width', 'Height', 'Curb_weight', 'Engine_type', 'No_of_cylinders',
           'Engine_size', 'Fuel_system', 'Bore', 'Stroke', 'Compression_ratio', 'Horse_power', 'Peak_rpm', 'City_mpg',
           'Highway_mpg', 'Price']

cars = pd.read_csv('', names=new_cols)

I’m pretty new to ML but what from I understand we shouldn’t remove records that are perfectly fine to improve the model (maybe unless they have some extreme values in them) - don’t know, hopefully that’s in the future lessons…

1 Like

Thanks for this idea of finding the best random seed. However, I need some clarification on the same.

I have experimented the idea on different features in the dataset, and each feature has a different best random seed. Splitting the data using one of the ‘best random seeds’ has a massive effect on the rmse.

Moving forward, is it really safe to change the random seed, or just maintain the 0 and accept the tradeoff?

Like I said I am new to this, and I haven’t even finished this project yet. But from what I understand (and I think this hasn’t been emphasised enough so far in the ML lessons):
we should be trying to tweak, turn, modify and test every model, then compare. So instead of choosing between changing the seed and keeping it just put that into another variable and compare it. Eventually you’ll end up with 1 solution that gives the best results.

I haven’t finished my proj yet, but have a look how many comparisons I’ve generated just at the first model(again this is just a sketchbook, not a finished notebook, the comparison plots are at the end ):
cars_ml on Github

1 Like

Hi, thanks for sharing. It’s coming out really nice.
Your approach is also very impressive, I would like to see the work once it’s done.