Https://app.dataquest.io/m/140/multivariate-k-nearest-neighbors/8/calculating-mse-using-scikit-learn

https://app.dataquest.io/m/140/multivariate-k-nearest-neighbors/8/calculating-mse-using-scikit-learn

I am new to scikit-learn and machine learning and only just started with the data scientist path. I usually work on the DataQuest screen as well as on a local jupyter notebook. I’m getting different predictions and mse / rmse values and just wanted to confirm this is to be expected and not something I have done wrong?

DataQuest - first few predictions:
Screen Shot 2020-09-15 at 7.35.04 am

My first few predictions:
Screen Shot 2020-09-15 at 7.36.12 am

DataQuest MSE and RMSE
Screen Shot 2020-09-15 at 7.36.57 am

My MSE and RMSE
Screen Shot 2020-09-15 at 7.37.41 am

The code I use to validate DataQuest and in my local environment are exactly the same.

If differences are to be expected, is there any way in which I could set a random seed to avoid confusion moving forward?

I also noticed that the first predictions are run using the default metric ‘minkowski’, whereas in the MSE / RMSE calculation screen, the code switches to ‘euclidean’ metric. Could anyone explain the difference?

My code:

train_df = normalised_listings.iloc[0:2792].copy()
test_df = normalised_listings.iloc[2792:].copy()
knn = KNeighborsRegressor(algorithm="brute")
train_features = train_df[["accommodates", "bathrooms"]]   # training data - feature columns
train_target = train_df["price"]   # training data - target column
knn.fit(train_features, train_target)
predictions = knn.predict(test_df[['accommodates', 'bathrooms']])

from sklearn.metrics import mean_squared_error
two_features_mse = mean_squared_error(test_df["price"], predictions)
two_features_rmse = np.sqrt(two_features_mse) 
print(two_features_mse, two_features_rmse)

Thank you for your help!

2 Likes

Hi @Jac

I think np.random() varies system to system. So, I don’t think it can be set to the same value for different systems. Maybe this will be the reason for different values. I suppose that you are using the same dataset as given on the platform.

Yes, the default metrics used KNeighborsRegressor in sklearn.neighbors package is minkowski with the default p value 2. By setting the p value to 2, it makes the metric same as euclidean distance.

Minkowski is the general form of some other metrics. If p=1, then it is using the metric name as ‘ManhattanDistance’, and if p=2, then it uses the metric name as ‘EuclideanDistance’. You can see the code here. It uses simple if-else to assign metrics according to p value.

Here is the formula of Minkowski distance:

(\sum_{i=1}^n |x_i-y_i|^p)^{1/p}

And the euclidean distance between two points (x_1, y_1) and (x_2, y_2) is \sqrt{(x_2 - x_1) ^ 2 + (y_2-y_1)^2 } for n dimension it can be written as \sqrt{\sum_{i=1}^n (x_i-y_i)^2}. Similarly, Manhattan distance for these points is |x_2-x_1|+|y_2-y_1|.

I hope this may help you.