ML intermediate - Overfitting/cross validation - using cross_val_score

Hi,

I am trying to use cross_val_score in my function instead of iterating through the KFold instance.

Screen Link: https://app.dataquest.io/m/132/overfitting/5/cross-validation

Here is my function

# def train_and_cross_val(cols):
#     lr = LinearRegression()
#     features = filtered_cars[cols]
#     target = filtered_cars["mpg"]
#     kf = KFold(10, shuffle=True, random_state=3)
#     variance_values = cross_val_score(lr, features, target,cv=kf)
#     mses = cross_val_score(lr, features, target, scoring="neg_mean_squared_error", cv=kf)
#     avg_mse = np.mean(mses)
#     avg_var = np.mean(variance_values)
    
#     return avg_mse, avg_var

I expected to return avg_mse and avg_var however only avg_mse returns correctly. My mean of variances in each of the test cases are too small.

How do I correctly access the variances of the predictions?

Many thanks,

Gopala

3 Likes

did u got the solution bro?

2 Likes

Hi @gopalbhat50, @nitishkumarhardworke,

variance_values = cross_val_score(lr, features, target,cv=kf)

Here, cross_val_score is returning accuracy values, not variance. To calculate variance, you have to find the variance of cross_val_predict like this:

variance_values = np.var(cross_val_predict(lr, features, target,cv=kf))

Make sure to import that function by adding this line to your code:

from sklearn.model_selection import cross_val_predict

Best,
Sahil

2 Likes

Worth pointing out that cross_val_predict will NOT work for unit 5 in Crossvalidation as it returns prediction for all 10 folds whilst what was asked to return 1 fold at a time and compute variance for each fold with grand variance in the end. I spent about an hour trying to make it work so hopefully will help someone else.

Another point is that cross_val_score in original question does return variance but in standardised form - hence “small”. Perhaps there is a way to reverse standardisation but didn’t have time to do that.

2 Likes