Fit transform vs transform

,

Screen Link:

https://app.dataquest.io/m/26/clustering-basics/6/initial-clustering

In this we are introduced to fit_transform but I am confused why to use this instead of transform.

I have searched forums here and StackExchange but none made sense.

I completed it and realised using fit fitted it to model as before.

So why not fit and then predict on training data instead of fit transform

fit_transform is shortcut for fit, then transform. If you look at source code, fit_transform calls fit then transform.

Because you cannot transform without fit for transformations that require full pass over the rows (eg. calculating mean,std for standardization). Fit is the step that calculates extra statistics required for transformation, then the object fitted will have its relevant instance attributes set. For eg, during https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.transform, you can see check_is_fitted(self) in the source https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/utils/validation.py#L955 which gives you NotFittedError when you apply transform on a transformer that is not fitted.

check_is_fitted(self) is similarly called when you do https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict (and probably any estimator.predict). def predict returns return self._decision_function(X) and def _decision_function contains the check. https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/linear_model/_base.py#L222

The usual use pattern is pipeline.fit_transform on training set, then pipeline.transform on test set.
You can split the fit_transform into fit and transform on training, it’s just unnecessarily longer and prone to programming mistakes.

This paper explains the design considerations of sklearn and terms used (transformer, estimator, predictor) https://arxiv.org/pdf/1309.0238.pdf

You can end at fit if you’re just studying statistics of the data, but usually people would want to feed the data into a model to further fit and predict. Some models require (not to run without error, but to give meaningful results) preprocessed data, so you have to transform after fitting the data using the preprocessing transformer/estimator (i haven’t got a clear difference between these 2) then fit the model and predict with the model.

2 Likes