Minmax_scale() vs minmax_scaler on train and holdout

I have a conceptual doubt on how min max scaling (or any scaling for that matter) should be used for preprocessing.

Approach 1:
On the screen 2 we apply function minmax_scale separately on train and holdout sets. Hence different min and max values are used for transformation on each dataset.

  • Is this the correct approach to minmax scaling?
  • Shouldn’t the values in holdout be scaled using the same min and max values present in train set?

Approach 2 (Using min_max_scaler from sklearn)
In fact this is what the transformer minmax_scaler in sklearn is designed to do with following steps:
from sklearn.preprocessing import MinMaxScaler

  • Instantiating minmax scaler on train set
    X_train_minmax = min_max_scaler.fit_transform(X_train)
  • Using this instance to apply on holdout data
    X_train_minmax .transform(X_test)

Here we only use the transform function for holdout set which means that the scaler attributes from train data were used for scaling.
Approach 1 is akin to applying fit_transform for both datasets.

Which of the above 2 approaches is correct (or Better)?

This post explains Should we apply normalisation to test data as well?.

Normalization is used to scale the data between a certain range.

Min max scaling is standardising the data to fall within range [0, 1].

Normalization is preformed performed after splitting the data between training and test set. That is, normalized the training set separately. And, then normalized the testing set separately.

The test set needs to be unseen data and not accessible at the training stage.

Using any information from the test set before or during training is a potential bias in the evaluation of the performance of the model.

The goal of normalisation is to help the Machine Learning algorithms converge faster.

According to scaling vs normalization,

Scaling is important in the algorithms such as support vector machines (SVM) and k-nearest neighbors (KNN) where distance between the data points is important. For example, in the dataset containing prices of products; without scaling, SVM might treat 1 USD equivalent to 1 INR though 1 USD = 65 INR.

In scaling, you’re changing the range of your data while in normalization you’re mostly changing the shape of the distribution of your data.

You need to normalize our data if you’re going use a machine learning or statistics technique that assumes that data is normally distributed e.g. t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian Naive Bayes.

Different algorithm requires different assumption. You have to ensure you have correct assumption by performing the different types of normalisation/scaling on data in the preprocessing stage.

@alvinctk : Thanks for your response but I think you misunderstood my question.
I don’t have a query on scaling v/s normalization. My query is regarding the approach used on screen for applying minmax scaling.

Let me elaborate with example:
Approach 1 (as show in the screen):
Let us consider a column age in train and test set with following values:
Train_Age: [23, 45, 12, 36, 89, 12, 3, 60]
test_age: [10, 9, 45, 12, 27, 32, 31, 30, 29]

Now when we apply function minmax_scale() on train_age data scaling will be done using min value = 3 and max value = 89
Now if we apply this function again on test_age scaling will be done using min value = 9 and max value = 45. Hence the min & max parameters used are different on train and test set.

Approach 2:
Now contrast this with the 2nd approach available in scikitlearn() using transformer minmax_scaler

from sklearn.preprocessing import MinMaxScaler

  • Instantiating minmax scaler on train set (here scaling is done using min = 3 and max = 89)
    X_train_minmax = min_max_scaler.fit_transform(Train_Age)
  • Using this instance to apply on holdout data
    X_train_minmax .transform(test_age)

As you can see we use method transform() on test set rather than fit_transform(). What this will do is it will apply the same min and max parameters as the train set on the test set (min = 3 and max = 89) and the results will be completely different.

I want to know if the approach shown on the screen in the lesson correct.

Well, I don’t have access to the screen.

Use Approach 1 because when scaling the data,

Define a mathematical function f to describe the min-max scaling such that f: x → y. f is function. x is the input to the function f. And, y is output of the function.

By applying individual min-max scaling to the train and test data separately, the representation of data before/after mini-max scaling is not affected because the translation always maps to function f. Inversely, the inverse of f maps back to x.

You can treat mapping of function f like shifting to some point to right or left.

@alvinctk
This is the screen I am referring to:
https://app.dataquest.io/m/186/feature-preparation%2C-selection-and-engineering/2/preparing-more-features
So if I go by approach 1

  • Can you please let me know when do we use transformer MinMaxScaler (approach 2) and is the approach incorrect? (Basically can you apply fit_trsanform on test data as well)
  • Is it correct to use different max and min values for scaling train and test (rather than learning min and max from train data and then applying the same on test data for scaling)

Unable to access the locked mission. Maybe some other moderators can help with the mission.

Instead of questioning which approach is better/correct, it’s better to know why we do the approach.

The rationale behind the chosen approach 1 is mainly because the training data must not have any knowledge of test data set.

Hi @vinayak.naik87

minmax_scale and MinMaxScale are equivalent methods with the difference of API available with latter (please read the sklearn official doc for more info).

We can apply the fit_transform method to a Test dataset, but then we would essentially kill the purpose of Test data altogether. Consider fit_transform as a shortcut to, apply fit and transform methods, in one line of code instead of two.

Not sure if you have tried applying these methods on Train and Test data. You will get the same results. For example: Steps:

  1. fit on train
  2. Series a = transform on test
  3. Series b = fit_transform on train
  4. Series c = transform on test (after step 3.)

The results for series A and C will be same. Let us know if this helps you.

As @alvinctk has already highlighted, its not just which approach to take but also why it should be taken. There is a user guide in sklearn doc, that explains various methods of scaling the data and how each approach impacts the outliers in the data.

@alvinctk - why is your post flagged?