Machine Learning in Python: Intermediate - Course 5/8


I have questions related to Step 5 of the “Machine Learning in Python: Intermediate” mission, so if any of you already did it and understood it, it would be great to have your feedbacks.
In particular I am confused about two things (the first one more conceptual the second more code-related):

In the description it is written that “A good way to detect if your model is overfitting is to compare the in-sample error and the out-of-sample error, or the training error with the test error”. Which is fine with me. However in the exercise they never use such an evaluation to understand whether they are overfitting or not: for each train and test partition they only compute the test error as mean_squared error (because they use the y_test and the predictions computed with respect to the X_test), so they do not consider the train error. So I though it was kind of fine if the goal is to see e.g. the variance of the test errors over different partitions (test/train sets given by cross-validation in each “for” loop) such that if you have a big variance among these test errors it means that according to the partition your error will change a lot (overfitting).

But then, I don’t know at all the reason why, they compute the variance not over the mean errors given by different test/train partition (so considering the resulting errors in all the “for loops”) but for a specific test/train partition (so for each loop) they are computing the variance of the PREDICTIONS which means that basically if I predict a constant model (for a specific test/train partition, or specific loop) my variance error (for that loop) would be zero. Or if I should predict a model that by itself varies a lot I obtain a big variance of the PREDICTIONS (var = np.var(predictions)) although maybe the error between my prediction and the real model is zero (mean_squared_error(y_test, predictions))…which for me does not have any sense because it is not at all related to the difference between what I predict and what is the real results but it is only a matter of what is the variation of the model I am predicting. Do you have any idea on why are they doing like that?

Why they did not adopt the strategy they explained in a previous lesson related to the cross-validation and the use of cross_val_score? How could we use cross_val_score to obtain the results?

–> Here I attach their code/answer:

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np
def train_and_cross_val(cols):
features = filtered_cars[cols]
target = filtered_cars[“mpg”]

variance_values = []
mse_values = []

# KFold instance.
kf = KFold(n_splits=10, shuffle=True, random_state=3)

# Iterate through over each fold.
for train_index, test_index in kf.split(features):
    # Training and test sets.
    X_train, X_test = features.iloc[train_index], features.iloc[test_index]
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]
    # Fit the model and make predictions.
    lr = LinearRegression(), y_train)
    predictions = lr.predict(X_test)
    # Calculate mse and variance values for this fold.
    **mse = mean_squared_error(y_test, predictions)**

** var = np.var(predictions)**

    # Append to arrays to do calculate overall average mse and variance values.

# Compute average mse and variance values.
avg_mse = np.mean(mse_values)
avg_var = np.mean(variance_values)
return(avg_mse, avg_var)

two_mse, two_var = train_and_cross_val([“cylinders”, “displacement”])
three_mse, three_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”])
four_mse, four_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”, “weight”])
five_mse, five_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”, “weight”, “acceleration”])
six_mse, six_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”, “weight”, “acceleration”, “model year”])
seven_mse, seven_var = train_and_cross_val([“cylinders”, “displacement”, “horsepower”, “weight”, “acceleration”,“model year”, “origin”])


Wow massive question, i wish there was someone to help me answer these while i was learning. Firstly, could you edit your question to have the enclosing ``` at the right places so all the code can be indented correctly? Disclaimer: i haven’t worked through that lesson.

  1. I agree with your reasoning it makes no sense at all the calculate variance on predictions. The variance defined in the bias variance trade-off:–variance_tradeoff
    is the variance of predictions on a single test data point made from multiple models with same architecture and hyperparameters trained on different sets of training data. For a group of test points you can just average the bias-variance over all of them but really the concept is describing a single test point. What is desired is models that are trained on varying sets of training data that do not have wildly varying predictions when given the same set of features in a single test data point to predict, this is seen as a demonstration of the model to not overfit to any particular training set. A second assumption of this methodology is that the testing dataset will have features with similar patterns as the varied sets of training data.

To make it clear, you are correct in saying

it is not at all related to the difference between what I predict and what is the real results

but here i want to emphasize the variance that i’m talking about here also does not care about what the true y value is too. If you look at the formula in the first link above, the variance part of the formula only depends on f_hat. It is the bias part that depends on f_true (something similar to observed y_true values). Note that f_true is never known and is different from Y which is the y_true you have in the data. You can simulate an experiment building multiple models to somehow “prove” that equation by inserting Y into the place of f_true and seeing that LHS ~ Bias^2 + Var. In an actual dataset, the true function f_true is unavailable. The bias cannot be calculated. if the test set y_true values were used to calculate bias, this bias will include the irreducible error too, so i wrote ~ not = here.

On the lesson not using train-test-split as instructed, i interpret that as the author writing the course over a long period with breaks in between so he forgot what was promised to be taught (it sometimes happens). For your information, train-test-split is also called holdout-CV, as opposed to k-fold CV done in the lesson. Note that k-fold CV does not demonstrate variance in bias-variance tradeoff because the test set keeps changing so each test point is not being predicted by multiple models and thus no concept of variance here.

On interpreting CV scores, your idea of high variance of errors among folds indicating overfitting looks valid. However i would still want to check it against the training set errors. For eg, 3 folds of test set error all show 0.8, but what if the training error on all 3 folds were 0.95? Not sure if this could happen, but i’m not convinced no variance among cv scores means no overfitting. Nevertheless i would still agree that high variance among cv scores indicates overfitting.

  1. cross_val_score has a limited exposed API, meaning it only outputs the array of scores, 1 for each fold. So using cross_val_score there is no way to know the predicted values that went through some metric to produce the scores. In this lesson they had to use the predictions to calculate var so there is need for manual fold generation, fit and predict to get hold of the predictions.
1 Like

Dear Hanqi,

sorry for the massive question! haha! But thank you very much for your detailed answer!
I was not aware of that detail of having exactly the same test set for the computation of the variance! So this was very useful. Thanks!

Hi Jessica, could you mark this as solved so future students can find it if you think it answers your question?