Hello DQ community! Hope you are all well and hopefully you are managing to enjoy some time off. During my holidays I managed to squeeze in some learning time and I had a question on the differences between the fit + predict vs cross_validation routines.
I am halfway through the “predicting car prices” project, and instead of using an approach based on:
- split dataset in test / train
- fit model with training data
- predict target column on test data
I wanted to go directly with a cross validation routine, which, if I got it right, does not require me to handle the train_test_split thing, but rather just define a number of folds and let the magic flow with the sklearn.model_selection.cross_val_score handle the heavy lifting.
in code terms this means that:
UNIVARIATE KNN MODELS
def knn_test_train_v1(df, train_col, target_col, neighbors = 5):
#Prepare features matrix and target vector
X = df[[train_col]]
y = df[target_col]
#define data to perform a Two-fold cross validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# Instantiate a KNN model using a k-neighbors value.
knn = KNeighborsRegressor(n_neighbors = neighbors)
#Fit the model with the training matrix and target column y_train
knn.fit(X_train, y_train)
#predict the values on the remaining test matrix
predicted_labels = knn.predict(X_test)
# Calculate and return RMSE.
mse = mean_squared_error(y_test, predicted_labels)
rmse = np.sqrt(mse)
return rmse
Am I correct in saying that the code below is doing the same thing as the one above?
while in the section above the size of the train/test samples is defined by the test_size
parameter, in the cross_validation version it is inferred through the number of folds.
def knn_test_train_v2(df, train_col, target_col, neighbors = 5, folds = 2):
#Prepare features matrix and target vector
X = df[[train_col]]
y = df[target_col]
#instantiate a number of folds for cross validation
kf = KFold(n_splits = folds, shuffle = True, random_state = 0)
# Instantiate a KNN model using a k-neighbors value.
knn = KNeighborsRegressor(n_neighbors = neighbors)
#Perform cross validation with the cross_val routine (that under the hood fits the model and predicts values)
rmses = cross_val_score(knn, X, y, scoring = 'neg_root_mean_squared_error', cv = kf)
return round(np.mean(abs(rmses)),2)
If my assumptions are correct, I would expect these two codes (with equal hyperparameters, that is the same number of folds and neighbors) should deliver the same output, but it’s not the case. It might be related to the random_state, but I wanted to first understand if I got right the purpose of the two different routines.
Thanks to anyone that will have the time to read through this during August!
cheers,
Nick