Cross validation 154-2

Hey there fellow learners!

Just a quick question here.

Screen Link:
https://app.dataquest.io/m/154/cross-validation/2/holdout-validation

My Code:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

knn = KNeighborsRegressor()
knn.fit(train_one[['accommodates']],train_one['price'])
prediction_one = knn.predict(test_one[['accommodates']])
msq_one = mean_squared_error(test_one['price'],prediction_one)
iteration_one_rmse = msq_one**0.5

knn_two = KNeighborsRegressor()
knn_two.fit(train_two[['accommodates']],train_two['price'])
prediction_two = knn_two.predict(test_two[['accommodates']])
msq_two = mean_squared_error(test_two['price'],prediction_two)
iteration_two_rmse = msq_two**0.5

avg_rmse = np.mean([iteration_two_rmse,iteration_one_rmse])

What I expected to happen:

avg_rmse = 128.96254732948216

What actually happened:

avg_rmse = 123.7207888486061

So I seem to be slightly off here. I checked it with the answer and what happened is that I should not have started knn_two. This seems to be a bit counter intuitive to me, because the original knn already has the test_two values in it right? So it would unfairly improve its learning capabilities. Therefore I started a new knn.

I am a bit confused why I should not start another knn.

Cheers!

1 Like

Check the red text above. It happens all the time, a slight copy-paste error can lead to wrong results :smiley:

Oops @fedepereira!

I have corrected the mistake, it’s not like I was stuck on the code, it was more a question about the difference. I reversed engineered my mistake ;).

But I guess training the model twice should be better? Even though the rmse was off more.

Hi @DavidMiedema,
sorry for my delay here, was busy these days!
I just did it like you, i.e. 2 different models and training each of them separately. I get the pass without problems. If you replace that knn with knn_two you should get a pass as well.

I think both answers are valid since the model hasn’t any memory. That’s why both answers give the same result. I just tried by re-using the same model instance in both steps, running the fit and predict methods and I get a pass.

Good question,

Since the KNeighborsRegressor is an object, this line of code creates a new object (or model) called knn. It would be perfectly acceptable to call the second model something else, (model2, Bob,… sldkfasld?) but when you instantiate a model called knn again, the variable knn is a new object. Just like if we did:

L =[1,2,3]
L = 0
print(L)

we know that we will get 0 and NOT [1,2,3]
A benefit of using the same name is that it allows you to easily combine these two functions into one loop if you want.

Alternatively could also create a function instead creating two objects as,

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one
rmse_values = []

features = ['accommodates']
# training m1 & m2 models
def training_model(train_df,test_df):
    knn = KNeighborsRegressor(algorithm = 'auto', n_neighbors=5)
    train_target = train_df['price']
    train_features = knn.fit(X = train_df[features], y = train_target)
    predictions = knn.predict(X = test_df[features])
    rmse_values.append(np.sqrt(mean_squared_error(y_true = test_df['price'], y_pred = predictions)))

training_model(train_one, test_one)
training_model(train_two, test_two)
avg_rmse = np.mean(rmse_values)