KNN cross-validation

Hello helpful community, I’d like to ask what the difference between the k-fold cross-validation technique by instantiating a KFold class (sklearn.model_selection import cross_val_score, KFold) and the k-fold cross-validation technique below (manually splitting the dataset into numbers of fold).

from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits=5, shuffle=True, random_state=1)
knn = KNeighborsRegressor()
mses = cross_val_score(knn, dc_listings[["accommodates"]], 
rmses = np.sqrt(np.abs(mses))
avg_rmse = np.mean(rmses)

Manually splitting

fold_ids = [1,2,3,4,5]

def train_and_validate(df, folds):
    fold_rmses = []
    for fold in folds:
        model = KNeighborsRegressor()
        train = df[df["fold"] != fold].copy()
        test = df[df["fold"] == fold].copy()[["accommodates"]], train["price"])
        labels = model.predict(test[["accommodates"]])
        mse = mean_squared_error(test["price"], labels)
        rmse = np.sqrt(mse)
    return fold_rmses

rmses = train_and_validate(dc_listings, fold_ids)
avg_rmse = np.mean(rmses)
1 Like

hi @sanctusdan

Have you been able to understand the difference by now as it’s been over 3 months since you raised this question? If yes, you may stop here and reply so.

Else, please continue reading…

If the data was not shuffled in the second case, then that is the difference. The KFold() instance has the option shuffle set to True. So the data would get shuffled before split.

If the second case as well had the shuffled, then it’s just manual code as compared using an existing module.

1 Like