Different Answers for MSE & RMSE? (Random Seed & Reproducibility)

Screen Link:
https://app.dataquest.io/m/140/multivariate-k-nearest-neighbors/9/using-more-features

Hello,

Can someone please explain why I am getting different values for MSE and RMSE?

I am following along with the lesson in Jupyter Notebook on my computer so I can freely experiment.

The DQ answer is:
MSE: 13322.432400455064
RMSE: 115.42284176217056

I am getting in Jupyter on my computer:
MSE: 12702.843731513083
RMSE: 112.7068930079837

I am having trouble figuring out why. I’m thinking it might be in the data cleaning steps? I kind of carried over the data cleaning steps from the previous mission so maybe something is off?

I am even using the same code as the answer. I’ve attached the Jupyter Notebook below.

Thank you for your time

DQ SKLearn Different Answers.ipynb (6.8 KB)

Click here to view the jupyter notebook file in a new tab

ok, I think I just answered my own question:

I forgot to put the np.random.seed(1) in so when I randomized the dataset using

dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))], it was creating a different randomized version, thus the 5 nearest neighbors were different. Is this correct?

I have new questions now:

How does running np.random.seed(1) feed into dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))] ?

np.random.permutation() knows somehow what was passed into np.random.seed()?

So essentially we always want to randomize our dataset before passing it into Scikit-Learn?

How does this work in practice in terms of reproducibility? Do you always want to set the np.random.seed() every time?

Thanks!