Cross Validation and Hyperparameter tuning - Predicting Cars Price

Screen Link: https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/6/next-steps

Hi all,

maybe it’s a stupid question, but I don’t understand/interpret how to optimize the K-Nearest Neighbors model using k-fold cross validation.

Using test/train validation, I got an optimal k-value (with a lower RMSE) for a multivariate model:

5 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower'], k=2                                                                                                                    2020.421278
4 features ['curb_weight', 'city_mpg', 'width', 'engine_size'], k=2                                                                                                                                  2033.850822
11 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio', 'bore', 'normalized_losses'], k=2                          2071.718283
6 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg'], k=3                                                                                                     2099.756135
9 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio'], k=2                                                        2170.566203

Following the next steps, I created a knn_train_validate function to use k-fold cross validation, but I get stuck… :crazy_face:

For example, if I use this function in a univariate model to find out what the best k-value is for each feature and some folds the output is a dictionary with the feature name as a key and the fold number as a value (inside are the k-values and RMSEs):

knn_train_validate function:

def knn_train_validate(df, features, target, k_neighbors=[5], n_folds=[10]):
    
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.model_selection import KFold, cross_val_score
    
    # Normalize all columnns to range from 0 to 1 except the target column
    target_col = df[target]
    df = (df - df.min()) / (df.max() - df.min())
    df[target] = target_col
       
    k_rmses = dict()
    k_folds_rmses = dict()
    
    for fold in n_folds:
        
        kf = KFold(fold, shuffle=True, random_state=1)
        
        for k in k_neighbors:
            model = KNeighborsRegressor(n_neighbors=k, algorithm='brute')
            mses = cross_val_score(model, df[features], df[target], \
                                   scoring='neg_mean_squared_error', cv=kf)
            rmses = np.sqrt(np.absolute(mses))
            avg_rmse = np.mean(rmses)
            std_rmse = np.std(rmses)
            k_rmses[k] = avg_rmse
        
        k_folds_rmses[fold] = k_rmses
    
    return k_folds_rmses

Univariate model:

k_folds_univariate = dict()

for f in features:
    k_folds_univariate[f] = knn_train_validate(numeric_cars, [f], 'price',\
                                               k_neighbors= [k for k in range(1,6)], n_folds= [5, 7, 9, 10])

for k,v in k_folds_univariate.items():
    print(k,v)

Output:

curb_weight {5: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 7: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 9: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 10: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}}
city_mpg {5: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 7: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 9: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 10: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}}
....

  1. What is the number of folds should I use to get the optimal k-value for a univariate and multivariate model?
  2. Is the number of folds another hyperparameter I can optimize like k-neighbors value? or I only choose a specific value…
  3. What is the workflow for using k-fold cross-validation to optimize this model?

Attached is the full code in Jupyter:
Cars Listings.ipynb (214.3 KB)

Click here to view the jupyter notebook file in a new tab

Hey there! I think I can try to answer your questions. I won’t comment codewise, but I can help out conceptually

  1. There is no “optimal” amount of folds to use in k-fold cross-validation. When we perform k-fold cross-validation, we are trying to estimate how well the machine learning model will perform on the test data by further dividing up the training data into different folds.

  2. The number of folds k is purely for cross-validation. As a rule of thumb, 5 or 10 folds is considered a good number to use. Cross-validation helps us understand what hyperparameters might work best for the model, since it allows us to experiment with different hyperparameters without touching the test data.

  3. K-fold cross-validation should be performed after you decide what features you want to use in your data. As mentioned above, k-fold cross-validation will help you decide what hyperparameter optimizes the RMSE for your training data. Once you decide what number of neighbors to use, you can build your final model and then evaluate it against the test data.

The golden rule is that your machine learning model should only see the test data at the very last step. We use cross-validation to help with hyperparameter optimization.

2 Likes

Hi @christian5 thank you very much for your answer :slight_smile: but I’m still a little confused.

I understand what you’re saying here:

But in this guided project, we see that with train/test validation we can know which are the relevant features and for which hyperparameter k.

For example as I said here:

Using train/test validation, I got an optimal k-value (with a lower RMSE) for a multivariate model:

5 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower'], k=2

In this case, the best model (for that data) would be with these 5 features and k=2.

So, my question is:

Can we use K-fold cross-validation to know which are the relevant features and the optimal k hyperparameter?