CYBER WEEK - EXTRA SAVINGS EVENT
TRY A FREE LESSON

Fit StandardScaler API with training data and only transform the test set with the same parameters (losing model generalization)

If the standard scaler is better than the min max normalizer in terms of model generalization, since the standard deviation and mean are not the same for every attribute and can vary for different sets of data, and it doesn’t restricts new data to the same exact range interval, why does the transformations should only fit to the training data, to store the mean and standard deviation, to then transform the unseen test set with the same parameters? This will give also biased results on model evaluation, or not?

Thank you in advance!

I am not quite sure what you mean here. Test data should be standardized as well

Because, as you say, it’s unseen data. We don’t and shouldn’t know anything about it, in practice. So, we standardize it based on the training data. Especially to minimize data leakage as well because we don’t want to train our model based on information from the test set.

The 2nd para in this post should also clarify some details around this.

My only doubt was about the advantage that z-score presents in terms of not capping the values within a fixed range, like min max. And if the mean and standard deviation were the same, i thought we were capping somehow new unseen data, but are two different things, because the z-score method will allow new data in the same way, by not being capped.

Thank you for your clarification @the_doctor