Screen Link: https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/6/next-steps

Hi all,

maybe it’s a stupid question, but I don’t understand/interpret how to optimize the K-Nearest Neighbors model using k-fold cross validation.

Using test/train validation, I got an optimal k-value (with a lower RMSE) for a multivariate model:

```
5 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower'], k=2 2020.421278
4 features ['curb_weight', 'city_mpg', 'width', 'engine_size'], k=2 2033.850822
11 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio', 'bore', 'normalized_losses'], k=2 2071.718283
6 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg'], k=3 2099.756135
9 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio'], k=2 2170.566203
```

Following the next steps, I created a `knn_train_validate`

function to use k-fold cross validation, but I get stuck…

For example, if I use this function in a univariate model to find out what the best k-value is for each feature and some folds the output is a dictionary with the feature name as a key and the fold number as a value (inside are the k-values and RMSEs):

**knn_train_validate function**:

```
def knn_train_validate(df, features, target, k_neighbors=[5], n_folds=[10]):
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, cross_val_score
# Normalize all columnns to range from 0 to 1 except the target column
target_col = df[target]
df = (df - df.min()) / (df.max() - df.min())
df[target] = target_col
k_rmses = dict()
k_folds_rmses = dict()
for fold in n_folds:
kf = KFold(fold, shuffle=True, random_state=1)
for k in k_neighbors:
model = KNeighborsRegressor(n_neighbors=k, algorithm='brute')
mses = cross_val_score(model, df[features], df[target], \
scoring='neg_mean_squared_error', cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
std_rmse = np.std(rmses)
k_rmses[k] = avg_rmse
k_folds_rmses[fold] = k_rmses
return k_folds_rmses
```

**Univariate model:**

```
k_folds_univariate = dict()
for f in features:
k_folds_univariate[f] = knn_train_validate(numeric_cars, [f], 'price',\
k_neighbors= [k for k in range(1,6)], n_folds= [5, 7, 9, 10])
for k,v in k_folds_univariate.items():
print(k,v)
```

**Output:**

```
curb_weight {5: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 7: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 9: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 10: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}}
city_mpg {5: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 7: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 9: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 10: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}}
....
```

- What is the number of folds should I use to get the optimal k-value for a univariate and multivariate model?
- Is the number of folds another hyperparameter I can optimize like k-neighbors value? or I only choose a specific value…
- What is the workflow for using k-fold cross-validation to optimize this model?

Attached is the full code in Jupyter:

Cars Listings.ipynb (214.3 KB)

Click here to view the jupyter notebook file in a new tab