I have questions related to Step 5 of the “Machine Learning in Python: Intermediate” mission, so if any of you already did it and understood it, it would be great to have your feedbacks.
In particular I am confused about two things (the first one more conceptual the second more code-related):
In the description it is written that “A good way to detect if your model is overfitting is to compare the in-sample error and the out-of-sample error, or the training error with the test error”. Which is fine with me. However in the exercise they never use such an evaluation to understand whether they are overfitting or not: for each train and test partition they only compute the test error as mean_squared error (because they use the y_test and the predictions computed with respect to the X_test), so they do not consider the train error. So I though it was kind of fine if the goal is to see e.g. the variance of the test errors over different partitions (test/train sets given by cross-validation in each “for” loop) such that if you have a big variance among these test errors it means that according to the partition your error will change a lot (overfitting).
But then, I don’t know at all the reason why, they compute the variance not over the mean errors given by different test/train partition (so considering the resulting errors in all the “for loops”) but for a specific test/train partition (so for each loop) they are computing the variance of the PREDICTIONS which means that basically if I predict a constant model (for a specific test/train partition, or specific loop) my variance error (for that loop) would be zero. Or if I should predict a model that by itself varies a lot I obtain a big variance of the PREDICTIONS (var = np.var(predictions)) although maybe the error between my prediction and the real model is zero (mean_squared_error(y_test, predictions))…which for me does not have any sense because it is not at all related to the difference between what I predict and what is the real results but it is only a matter of what is the variation of the model I am predicting. Do you have any idea on why are they doing like that?
Why they did not adopt the strategy they explained in a previous lesson related to the cross-validation and the use of cross_val_score? How could we use cross_val_score to obtain the results?
–> Here I attach their code/answer:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np
features = filtered_cars[cols]
target = filtered_cars[“mpg”]
variance_values =  mse_values =  # KFold instance. kf = KFold(n_splits=10, shuffle=True, random_state=3) # Iterate through over each fold. for train_index, test_index in kf.split(features): # Training and test sets. X_train, X_test = features.iloc[train_index], features.iloc[test_index] y_train, y_test = target.iloc[train_index], target.iloc[test_index] # Fit the model and make predictions. lr = LinearRegression() lr.fit(X_train, y_train) predictions = lr.predict(X_test) # Calculate mse and variance values for this fold. **mse = mean_squared_error(y_test, predictions)**
** var = np.var(predictions)**
# Append to arrays to do calculate overall average mse and variance values. variance_values.append(var) mse_values.append(mse) # Compute average mse and variance values. avg_mse = np.mean(mse_values) avg_var = np.mean(variance_values) return(avg_mse, avg_var)
two_mse, two_var = train_and_cross_val([“cylinders”, “displacement”])
three_mse, three_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”])
four_mse, four_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”, “weight”])
five_mse, five_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”, “weight”, “acceleration”])
six_mse, six_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”, “weight”, “acceleration”, “model year”])
seven_mse, seven_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”, “weight”, “acceleration”,“model year”, “origin”])