Cars['horsepower'] contains six '?' values. Why does the solution code work without explicitly addressing these values, but not my code?

Screen Link: https://app.dataquest.io/m/132/overfitting/5/cross-validation

My Code:

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np

# print((cars['horsepower'] == '?').sum())
# cars['horsepower'] = cars['horsepower'].replace('?',np.NaN).astype(float)
# cars['horsepower'] = cars['horsepower'].fillna(cars['horsepower'].mean())

def train_and_cross_val(cols):
    mses = []
    variances= []
    kf = KFold(10,shuffle=True,random_state=3)
    for train,test in kf.split(cars):
        lr = LinearRegression()
        lr.fit(cars[cols].iloc[train],cars['mpg'].iloc[train])
        predictions = lr.predict(cars[cols].iloc[test])
        mses.append(mean_squared_error(cars['mpg'].iloc[test],predictions))
        variances.append(np.var(predictions))
    return (np.average(mses),np.average(variances))
two_mse,two_var = train_and_cross_val(['cylinders','displacement'])
three_mse,three_var = train_and_cross_val(['cylinders','displacement','horsepower'])
four_mse,four_var = train_and_cross_val(['cylinders','displacement','horsepower','weight'])
five_mse,five_var = train_and_cross_val(['cylinders','displacement','horsepower','weight','acceleration'])
six_mse,six_var = train_and_cross_val(['cylinders','displacement','horsepower','weight','acceleration','model year'])
seven_mse,seven_var = train_and_cross_val(['cylinders','displacement','horsepower','weight','acceleration','model year','origin'])

If you run my code, you’ll get an error due to cars[‘horsepower’] containing six ‘?’ values, and thus not being not able to have the whole series be converted to a float, and thus not able to be passed in as a parameter to the LinearRegression objects’ fit function. If you uncomment the three commented lines of code at the top, the lines will first prints out that there is indeed 6 ‘?’ values, then convert those values to the mean of the remaining values. The code will then run smoothly, and the value of the variables will be very close to the expected value when looking at the variable inspector.

Copy and pasting the solution code works, but as far as I can tell, there isn’t anything fundamentally different between the solution code and mine – that is, nothing so different in the solution code that would seem to have any affect on the ‘?’ values I can see.

You are working with the cars dataframe.

The solution is working with the filtered_cars dataframe. filtered_cars was created in the first Step of the Mission and has been used for each subsequent Step as well.

You can check that First Step to see how filtered_cars was defined.

1 Like

Oof, such a silly mistake! I guess I got so caught up in the meticulous I forgot about the obvious. Thank you!

1 Like