Introduction to KNN - Randomizing, and sorting

Screen Link:

https://app.dataquest.io/m/139/introduction-to-k-nearest-neighbors/6/randomizing-and-sorting

Just wanted to get some help understanding the part where it says:

" If we sort by the distance column and then just select the first 5 living spaces, we would be biasing the result to the ordering of the dataset."

What does it mean when we say it “biases” the result to the ordering? Can I get a bit more detail on that? What is the effect of not randomizing the dataframe before using sort_values that biases our result?

Thanks!

Hey, Nico.

I think the author was trying to avoid the consequences of stable sorting. Stable sorting maintains the order among ties. This would be biased towards the original order of the dataset.

However, the default sorting method used by pandas.DataFrame.sort_values is quicksort. This sorting algorithm isn’t stable, but it’s not entirely random either, so to be safe it’s better to randomize the dataset before sorting.

1 Like