132-5 Cross_val_predict generating different result then the DQ expected output

Screen Link: https://app.dataquest.io/m/132/overfitting/5/cross-validation

Your Code: Enclose your code in 3 backticks like this to format properly:

from sklearn.model_selection import KFold,cross_val_score,cross_val_predict
from sklearn.metrics import mean_squared_error
import numpy as np

def train_and_cross_val(cols):
    features = cols
    target = 'mpg'
    kf = KFold(n_splits=10, shuffle=True, random_state=3)

    lr = LinearRegression()
    cv_score = np.absolute(cross_val_score(lr,filtered_cars[features],filtered_cars[target],
    predictions = cross_val_predict(lr,filtered_cars[features],filtered_cars[target],
    avg_mse = np.mean(cv_score)
    avg_var = np.var(predictions)
    return (avg_mse,avg_var)

two_mse, two_var = train_and_cross_val(["cylinders","displacement"])
three_mse, three_var = train_and_cross_val(["cylinders","displacement","horsepower"])
four_mse, four_var = train_and_cross_val(["cylinders","displacement","horsepower","weight"])
five_mse, five_var = train_and_cross_val(["cylinders","displacement","horsepower","weight","acceleration"])
six_mse, six_var = train_and_cross_val(["cylinders","displacement","horsepower","weight","acceleration","model year"])
seven_mse, seven_var = train_and_cross_val(["cylinders","displacement","horsepower","weight","acceleration","model year","origin"])

What I expected to happen:
here I am using “cross_val_predict” predict function, unlike the answer that is given by DQ
What actually happened: The returned result of var doest not match with the DQ answer though the mse is matching, why is this happening ? to format properly

Other details:

In your implementation of train_and_cross_val, the predictions list contains all of the predictions at once.

In Dataquest’s implementation, that array contains only the predictions for the specific fold that is being handled on each iteration. Then the mean and variance are computed for this fold and appended to mse_values and variance_values respectively.

After iterating over all folds, Dataquest’s implementation computes the mean of both mse_values and variance_values and returns this values.

The reason why the MSE values coincides is because the mean of MSE values equals the MSE of all the values. The same doesn’t happen to the variance.

I don’t have the time to give a mathematical proof of the above statement concerning the mean (let me know if you want one and I’ll write it some other time), but I’ll give an example that shows that the same doesn’t happen to the variance.

>>> import numpy as np
>>> a = [1, 5]
>>> b = [2, 3]
>>> var_a = np.var(a)
>>> var_b = np.var(b)
>>> np.mean((var_a, var_b))
>>> np.var(a+b)

Since the last two printed values are different, we conclude that it isn’t necessarily the case that the mean of the variances equals the variance of the whole thing.


The discussion section explains Bruno’s statement.

Important to note that the sample sizes must be the same for grand mean to be equal to the simple mean over all objects in all samples. Such may not happen if the number of observations is not cleanly divisible by number of folds.

I have a question on this point about cross validation. In most sources, they would make predictions on the test set fold, then calculate some performance metric using the predictions and labels for the rows in that fold.
However in https://www.amazon.com/Real-World-Machine-Learning-Henrik-Brink/dp/1617291927 they explained Kfold CV by finishing all the predictions on all folds first, then calculate performance metric using predictions and labels of all rows at once.
Would you know why they may be doing that?

Back to this dataquest exercise, what was this lesson driving at by calculating mean/variance of predictions rather than the mean/variance of the mean/variance aggregated from each fold?
This student had the same question: Machine Learning in Python: Intermediate - Course 5/8


Hey, Han. I’m not that familiar with the literature, I’d have to investigate, but it seems to me like the details of k-fold aren’t pinned down completely. In other words, k-fold refers to the general idea behind it, and then people can implement it in different ways — that’s my impression.

I’ll have to get back you on this, I’ll be travelling for a few days and won’t have time to investigate and get context on this.