Hi dataquest community, happy new year and I hope you are all well!

I was tackling the final ML project for tha data-scientist path (loan prediction) and I was thinking about how to approach the problem given that:

- We need to perform the obvious train/test split
- the dataset is imbalanced (target column
`loan_status`

is split 85-15), and I would like to even the odds with some methods such as SMOTE. - Missing values must be managed with some imputation strategy.

In which order should these operations be performed? Given that:

- As far as I have understood, it is a best practice to put a test set aside from the very beginning, before any kind of imputation choice is done, mostly to avoid data leakage.
- If we want to split data, in the beginning we will split an imbalanced dataset with missing values (with a stratified sampling approach) so that we have a test set with the same 85%-15% distribution.
- We can then perform SMOTE on the training set, BUT this is doable only if there are no null values, meaning imputation must happen upfront.

So ideally is these sequence of event correct?

- train_test_split
- impute values on train set only.
- SMOTE on train set, so that the train set has a 50/50 share of classes (paid / not paid loans)
`model.fit(X_train_SMOTED, y_train_SMOTED)`

- impute values on test set (based on the imputer fitted on X_train?)
`model.predict(X_test, y_test)`

Assuming the above is correct, how should the imputation be managed?

- The simples way would be to use a
`SimpleImputer`

across the whole training set, regardless if the value to impute belongs to class 1 or 0. - Impute values based on the median of the specific class the observation (for which we are imputing the value) belongs to. I wouldnâ€™t know how to do this with sklearn though.

The pic below shows what I am trying to explain.

Thanks for your help!

Cheers,

Nick