Methodology approach to classification on imbalanced datasets

Hi dataquest community, happy new year and I hope you are all well!

I was tackling the final ML project for tha data-scientist path (loan prediction) and I was thinking about how to approach the problem given that:

  • We need to perform the obvious train/test split
  • the dataset is imbalanced (target column loan_status is split 85-15), and I would like to even the odds with some methods such as SMOTE.
  • Missing values must be managed with some imputation strategy.

In which order should these operations be performed? Given that:

  • As far as I have understood, it is a best practice to put a test set aside from the very beginning, before any kind of imputation choice is done, mostly to avoid data leakage.
  • If we want to split data, in the beginning we will split an imbalanced dataset with missing values (with a stratified sampling approach) so that we have a test set with the same 85%-15% distribution.
  • We can then perform SMOTE on the training set, BUT this is doable only if there are no null values, meaning imputation must happen upfront.

So ideally is these sequence of event correct?

  • train_test_split
  • impute values on train set only.
  • SMOTE on train set, so that the train set has a 50/50 share of classes (paid / not paid loans)
  • model.fit(X_train_SMOTED, y_train_SMOTED)
  • impute values on test set (based on the imputer fitted on X_train?)
  • model.predict(X_test, y_test)

Assuming the above is correct, how should the imputation be managed?

  • The simples way would be to use a SimpleImputer across the whole training set, regardless if the value to impute belongs to class 1 or 0.
  • Impute values based on the median of the specific class the observation (for which we are imputing the value) belongs to. I wouldn’t know how to do this with sklearn though. :thinking:

The pic below shows what I am trying to explain.

Thanks for your help!
Cheers,
Nick

Hi @nlong !

This is such a good question and it’s a shame it went unanswered.

This is how I approach this problem.

So ideally is these sequence of event correct?

  • train_test_split
  • impute values on train set only.
  • SMOTE on train set, so that the train set has a 50/50 share of classes (paid / not paid loans)
  • model.fit(X_train_SMOTED, y_train_SMOTED)

I agree 100% until this point. This is exactly what I do.

However, once I’ve done all the process of developing the model, and this includes not only training, but cross-validation, feature selection, and hyperparameter tuning as well, I test the model on the test_set as unbalanced as it is.

That’s because, the way I see it, the test is the final step o model development and it’s where you see how your model performs against data it has never seen before, kinda like if it was working with real-life data in production.

And the real-life data, due to the nature of the problem your model is trying to solve, will naturally be unbalanced, so the results of this test will give you a better idea of how your model will perform when put in production.

About missing values, I believe new values should be input even before the train_test_split, so the input data in both datasets follow the same logic.

Then it will probably become an iterative process of choosing the method that generates the best results, whether it’s the mean, median, or any other.

Your idea of filling the null values with the mean (or median) of the specific class is a great choice and will probably work.

I’m not sure you’re having trouble creating the code for this, but I could be easily done with pandas, before you get to Sklearn, you just have to:

  1. Split the DataFrame in two (one of each class)
  2. Calculate the mean (or median) of each DataFrame.
  3. Input the value in each DataFrame.
  4. Concatenate them both together again.

Well, that’s my view on the problem, it doesn’t mean it’s correct. More perspectives are always welcome to the discussion.

3 Likes

Agree with this, based on my limited experience with Machine and Deep Learning. Good to perform EDA to explore your dataset and weed out outliers and replace them with “cleaner” data (assuming its a production model that will “see” real-world data), so that when you train your model it will be more competent in classifying the loan status (in this case).

1 Like

Ciao @otavios.s , thanks for the answer!
Happy that I managed to get it. Actually I did the same, namely, run the model checks on the unbalanced test_set. It doesn’t make sense to rebalance it at all.

As for the imputation phase, I am not sure whether it is methodologically correct to impute values apriori. Given a distribution, if I am taking the mean / median of the WHOLE distribution it means that I am taking information from all parts of the dataset, even those that will eventually become the test set.
If you look around on either random data science boards or on certified textbooks (e.g. Aurelien Gèron on Machine Learning )

it is usually best practice to first split, and then perform imputation and scaling.
Ageon in his end to end ML project goes down this path:

  • creating a stratified training set of named housing_num
  • then it feeds it into a Pipeline as such (cell run 73)
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

This is done by fitting the transformer with data from the training set, and this operation is eventually repeated for the Test set, but calling the transform method only. This will ensure that imputation and scaling strategies are applied to the missing values of the test set, using exclusively data from the train set.

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)

Does this makes sense?

In the very end, if you are curious this is the final output of the project.

1 Like

Hi @nlong

That’s an excellent approach and it probably makes more sense than what I said, but I am really not sure how much of a difference this will make.

I’m familiar with Gèron’s book and following his approach is a great way to go, so I’m sure you’re in the right direction here.

By the way, your project is very very good, congratulations!

1 Like

Thanks @otavios.s !

yeah I agree, this is more “form over function”, beacause in the veery end I do not expect this to create major differences. Trying to build a wider understanding makes me feel more confindent when and if I choose to cut some corners :smiley:

1 Like