Fit/Predict vs Cross validation routines

Hello DQ community! Hope you are all well and hopefully you are managing to enjoy some time off. During my holidays I managed to squeeze in some learning time and I had a question on the differences between the fit + predict vs cross_validation routines.

I am halfway through the “predicting car prices” project, and instead of using an approach based on:

  • split dataset in test / train
  • fit model with training data
  • predict target column on test data

I wanted to go directly with a cross validation routine, which, if I got it right, does not require me to handle the train_test_split thing, but rather just define a number of folds and let the magic flow with the sklearn.model_selection.cross_val_score handle the heavy lifting.

in code terms this means that:

UNIVARIATE KNN MODELS

def knn_test_train_v1(df, train_col, target_col, neighbors = 5):
    #Prepare features matrix and target vector
    X = df[[train_col]]
    y = df[target_col]
    
    #define data to perform a Two-fold cross validation
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
    
    # Instantiate a KNN model using a k-neighbors value.
    knn = KNeighborsRegressor(n_neighbors = neighbors)
    
    #Fit the model with the training matrix and target column y_train
    knn.fit(X_train, y_train)
    
    #predict the values on the remaining test matrix
    predicted_labels = knn.predict(X_test)
    
    # Calculate and return RMSE.
    mse = mean_squared_error(y_test, predicted_labels)
    rmse = np.sqrt(mse)
    return rmse

Am I correct in saying that the code below is doing the same thing as the one above?
while in the section above the size of the train/test samples is defined by the test_size parameter, in the cross_validation version it is inferred through the number of folds.

def knn_test_train_v2(df, train_col, target_col, neighbors = 5, folds = 2):
    #Prepare features matrix and target vector
    X = df[[train_col]]
    y = df[target_col]
    
    #instantiate a number of folds for cross validation 
    kf = KFold(n_splits = folds, shuffle = True, random_state = 0)
        
    # Instantiate a KNN model using a k-neighbors value.
    knn = KNeighborsRegressor(n_neighbors = neighbors)
    
    #Perform cross validation with the cross_val routine (that under the hood fits the model and predicts values)
    rmses = cross_val_score(knn, X, y, scoring = 'neg_root_mean_squared_error', cv = kf)
    
    return round(np.mean(abs(rmses)),2)

If my assumptions are correct, I would expect these two codes (with equal hyperparameters, that is the same number of folds and neighbors) should deliver the same output, but it’s not the case. It might be related to the random_state, but I wanted to first understand if I got right the purpose of the two different routines.

Thanks to anyone that will have the time to read through this during August! :smiley:

cheers,
Nick

2 Likes

Bump bump :slight_smile:

Hi @nlong,

Your approach is correct, the reason you are getting a different result is mostly due to the internal logic implemented in a function:

Perhaps, this post will help:

Best,
Sahil

2 Likes

Thank you @Sahil! I went through the article you attached but therein they mention differences due to index reshuffling (default in train_test_split, not applied in cross_val_score, unless explicitly declared).

In my case I used for both functions a random seed = 0, but I guess that this does not even out the differences?

1 Like

Hi @nlong,

Yes, it doesn’t even out the differences. To even out the differences you have to implement the same logic used in the cross_val_score function. For example, in the first case you are using test data size of 50 percent:

train_test_split(X, y, test_size=0.5, random_state=0)

But we don’t know what percentage is used by cross_val_score. There can be other differences like that.

Best,
Sahil

1 Like