Multivariate KNN & Standardization vs Normalization

Screen Link: https://app.dataquest.io/m/140/multivariate-k-nearest-neighbors/4/normalize-columns

In this lesson, could you guys clarify that this is standardization not normalization. In addition, can you clarify why you chose to standardize the data instead of normalizing it using min-max normalization like you did for the weighted sum problem?

Hey, Melissa.

Normalization” is sometimes used as an umbrella term for the “adjusting values measured on different scales to a notionally common scale”. In my experience it most commonly refers to min-max normalization, though.

This is to say that I’m inclined to agree with you, but I do not commit to the opinion using “normalization” is actually wrong — it’s a defensible (if weak) point of view.

Now, regarding which to choose, I actually think that min-max scaling is more appropriate here; kNN is heavily based on distances and standardization doesn’t allow you to properly compare different features.

For example, the maximum value of maximum_nights after standardizing is around 61. The maximum value of number_of_reviews post-standaridization is roughly 12.

Using just this two features (assuming the maximum is representative of the column, which isn’t necessarily the case — this just an example to convey the idea), the former would crush the latter, it would have a much stronger influence.

Min-max normalization allows us to escape from this problem. The content of this screen explains this pictorially in a different context.

At the end of the day, what performs better is the best choice, so you can even try both and see where it leads you.

I hope this helps.

2 Likes