Guided Project: Predicting Bike rentals: K-fold cross validation vs Nested Cross Validation (Nested CV) vs train/validate/test metrics

Screen Link: Learn data science with Python and R projects

As I was following along the “Predicting Bike Rentals” guided project and comparing scores between the linear regression model and decision tree regressor model, I began wondering if the way I do cross validation is correct, or if there are better strategies for cross validation.

Specifically, I looked at 2 strategies for generating error scores for my decision tree regressor model, each with 2 variations of data splits:
1. GridsearchCV, followed by building a new model with optimized parameters and
K-Fold cross validation to get an average error
1a. Using all data for gridsearch and cross validation, generating an average test error score, and a training score
1b. Using an 80/20 train/test split. Using the train set for gridsearch and cross validation, generating an average validation error score, a test error score, and a training set error

2. Nested Cross Validation (Nested CV), which funnels the optimized gridsearchCV model into cross_val_score.
2a. Using all data, generating a single average test error score
2b. Using an 80/20 train/test split, generating an average validation error score, a test error score, and a training set error. use training set for gridsearchCV and cross val
2c. Same as 2b. but use the training set for gridsearchCV and test set for cross val. No validation score.

See nested CV on the scikit learn website

I don't advise running my code as each gridsearch takes my laptop about 5 minutes to run.

See the outputs.

My Code:

Splt the data 80/20

# Randomly split the data into training/validation and test data,
# where x denotes predictor features and y is the target
x_train, x_test, y_train, y_test = train_test_split(bike_rentals[columns], 
                                                    bike_rentals['cnt'],
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=1)

1a.

# Use the optimized parameters from gridsearch to build and test
# a decision tree regressor model

# Define the model using optimzed parameter values
new_tree = DecisionTreeRegressor(max_depth=20, min_samples_leaf=5,
                            min_samples_split=20, random_state=1)

# Define the data splitting strategy
kf = KFold(n_splits=5, shuffle=True, random_state=1)

# Perform cross validation and calculate scores
mses = cross_val_score(new_tree, bike_rentals[columns], bike_rentals['cnt'],
                       cv=kf, scoring='neg_mean_squared_error')

# Calculate the mean root mean squared error and standard deviation
# for the validation
avg_rmse = np.mean(np.sqrt(abs(mses)))
std_rmse = np.std(np.sqrt(abs(mses)))

# Train the model
new_tree.fit(x_train, y_train)

# Test the model on training data
train_prediction = new_tree.predict(x_train)
train_rmse = np.sqrt(mean_squared_error(y_train, train_prediction))

print('Train RMSE:', train_rmse)
print('Mean test RMSE:', avg_rmse, 'standard deviation:', std_rmse)

1a. output:

Train RMSE: 39.03289641393771
Mean test RMSE: 53.076708804611464 standard deviation: 1.5627723601274157

1b.

# Use the optimized parameters from gridsearch to build and test
# a decision tree regressor model

# Define the model using optimzed parameter values
new_tree = DecisionTreeRegressor(max_depth=20, min_samples_leaf=5,
                            min_samples_split=20, random_state=1)

# Define the data splitting strategy
kf = KFold(n_splits=5, shuffle=True, random_state=1)

# Perform cross validation and calculate scores
mses = cross_val_score(new_tree, x_train, y_train,
                       cv=kf, scoring='neg_mean_squared_error')

# Calculate the mean root mean squared error and standard deviation
# for the validation
avg_rmse = np.mean(np.sqrt(abs(mses)))
std_rmse = np.std(np.sqrt(abs(mses)))

# Train the model
new_tree.fit(x_train, y_train)

# Test the model on training data
train_prediction = new_tree.predict(x_train)
train_rmse = np.sqrt(mean_squared_error(y_train, train_prediction))

# Test the model on unseen test data
test_prediction = new_tree.predict(x_test)
test_rmse = np.sqrt(mean_squared_error(y_test, test_prediction))

print('Train RMSE:', train_rmse)
print('Mean validation RMSE:', avg_rmse, 'standard deviation:', std_rmse)
print('Test RMSE:', test_rmse)

1b. output:

Train RMSE: 36.92036495214353
Mean validation RMSE: 54.58728740299868 standard deviation: 1.68934008223517
Test RMSE: 51.86347477122018

2a.

# Optimize parameters for the decision tree regressor via 
# gridsearch cross validation, then take the optimized model
# and perform nested cross validation

# Define model, tree
tree = DecisionTreeRegressor(criterion='mse', random_state=1)

# Define parameters to iterate over
parameters = {'max_depth': [3,5,10,20,30,50],
              'min_samples_split': [2,5,8,10,15,20,30,50,100],
              'min_samples_leaf': [1,5,8,10,15,20,30,50,100]
              }

# Splitting strategy for inner loop
kf_inner = KFold(n_splits=5, shuffle=True, random_state=1)

# Define and begin search
gridsearch = GridSearchCV(tree, parameters, scoring='neg_mean_squared_error',
                          cv=kf_inner, n_jobs=-1)
gridsearch.fit(bike_rentals[columns], bike_rentals['cnt'])

print("Best parameters:")
# Get optimized value for each parameter we searched
best_parameters = gridsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
# Splitting strategy for outer loop
kf_outer = KFold(n_splits=5, shuffle=True, random_state=1)

# Perform nested cross validation and calculate scores
mses = cross_val_score(gridsearch, bike_rentals[columns], bike_rentals['cnt'],
                       cv=kf_outer, scoring='neg_mean_squared_error')

# Calculate the mean root mean squared error and standard deviation
avg_rmse = np.mean(np.sqrt(abs(mses)))
std_rmse = np.std(np.sqrt(abs(mses)))

print('Mean nested CV test RMSE:', avg_rmse, 'standard deviation:', std_rmse)

2a. output:

Best parameters:
	max_depth: 20
	min_samples_leaf: 5
	min_samples_split: 20
Mean nested CV test RMSE: 53.20798709133597 standard deviation: 1.6911009731982067

2b.

# Optimize parameters for the decision tree regressor via 
# gridsearch cross validation, then take the optimized model
# and perform nested cross validation

# Define model, tree
tree = DecisionTreeRegressor(criterion='mse', random_state=1)

# Define parameters to iterate over
parameters = {'max_depth': [3,5,10,20,30,50],
              'min_samples_split': [2,5,8,10,15,20,30,50,100],
              'min_samples_leaf': [1,5,8,10,15,20,30,50,100]
              }

# Splitting strategy for inner loop
kf_inner = KFold(n_splits=5, shuffle=True, random_state=1)

# Define and begin search
gridsearch = GridSearchCV(tree, parameters, scoring='neg_mean_squared_error',
                          cv=kf_inner, n_jobs=-1)
gridsearch.fit(x_train, y_train)

print("Best parameters:")
# Get optimized value for each parameter we searched
best_parameters = gridsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
# Splitting strategy for outer loop
kf_outer = KFold(n_splits=5, shuffle=True, random_state=1)

# Perform nested cross validation and calculate scores
mses = cross_val_score(gridsearch, x_train, y_train,
                       cv=kf_outer, scoring='neg_mean_squared_error')

# Calculate the mean root mean squared error and standard deviation
avg_rmse = np.mean(np.sqrt(abs(mses)))
std_rmse = np.std(np.sqrt(abs(mses)))

# Test the model on training data
train_prediction = gridsearch.predict(x_train)
train_rmse = np.sqrt(mean_squared_error(y_train, train_prediction))

# Test the model on unseen test data
test_prediction = gridsearch.predict(x_test)
test_rmse = np.sqrt(mean_squared_error(y_test, test_prediction))

print('Train RMSE:', train_rmse)
print('Mean nested CV validation RMSE:', avg_rmse, 'standard deviation:', std_rmse)
print('Test RMSE:', test_rmse)

2b. output:

Best parameters:
	max_depth: 30
	min_samples_leaf: 5
	min_samples_split: 15
Train RMSE: 36.92036495214353
Mean nested CV validation RMSE: 54.76673639251891 standard deviation: 2.047178984545772
Test RMSE: 51.86347477122018

2.c

# Optimize parameters for the decision tree regressor via 
# gridsearch cross validation, then take the optimized model
# and perform nested cross validation

# Define model, tree
tree = DecisionTreeRegressor(criterion='mse', random_state=1)

# Define parameters to iterate over
parameters = {'max_depth': [3,5,10,20,30,50],
              'min_samples_split': [2,5,8,10,15,20,30,50,100],
              'min_samples_leaf': [1,5,8,10,15,20,30,50,100]
              }

# Splitting strategy for inner loop
kf_inner = KFold(n_splits=5, shuffle=True, random_state=1)

# Define and begin search
gridsearch = GridSearchCV(tree, parameters, scoring='neg_mean_squared_error',
                          cv=kf_inner, n_jobs=-1)
gridsearch.fit(x_train, y_train)

print("Best parameters:")
# Get optimized value for each parameter we searched
best_parameters = gridsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
# Splitting strategy for outer loop
kf_outer = KFold(n_splits=5, shuffle=True, random_state=1)

# Perform nested cross validation and calculate scores
mses = cross_val_score(gridsearch, x_test, y_test,
                       cv=kf_outer, scoring='neg_mean_squared_error')

# Calculate the mean root mean squared error and standard deviation
avg_rmse = np.mean(np.sqrt(abs(mses)))
std_rmse = np.std(np.sqrt(abs(mses)))

# Test the model on training data
train_prediction = new_tree.predict(x_train)
train_rmse = np.sqrt(mean_squared_error(y_train, train_prediction))

print('Train RMSE:', train_rmse)
print('Mean nested CV test RMSE:', avg_rmse, 'standard deviation:', std_rmse)

2c. output:

Best parameters:
	max_depth: 30
	min_samples_leaf: 5
	min_samples_split: 15
Train RMSE: 36.92036495214353
Mean nested CV test RMSE: 75.8395036920455 standard deviation: 7.3268907977333395

I generated training errors where possible because it’s my understanding that a large difference in training and testing error is generally a sign of overfitting (large bias).

1b and 2a seem the like the best options of the bunch, but which one is better over the other?

Which strategy is best? Which is wrong and in what way? Why use one over another? Is there a better strategy that I have not mentioned?

Any advice is welcome. Cheers.