Imbalance data for balance

Dear Support,

Please I need some help on a personal project. I am working on a logistic regression problem and the target on the training data set is approval_status. The data in the target column could either be ‘approved’ or ‘not approved’.
The total number of rows is 16500 of which approved are 16320 while not approved is 180. Is this an imbalanced data set? If yes how can it be balanced for use in machine learning prediction.

Also does using pd.get_dummies(df) avoid the dummy variable trap?

Thanks for your help

Hi @ignatiusebigwai

Yes, your dataset is obviously unbalanced. Since 98% of your data belongs to the same status. A balanced set is if the distribution is about 50%.

There are several ways to make the set balanced.

  1. The most obvious to extend it with data relevant to the not approved.
  2. You can try to take a sample of about 200 results from approved. And train the model using a dataset of 380 values, if the patterns are simple enough this may be enough for fairly good results.
  3. you can create dummy not approved values if you know the most obvious values that affect it.
  4. you can train the model on an unbalanced set, and at the end artificially change the probability at which the value will be classified to any status. For example, prediction >= 0.95 status approved, in all other situations not approved. So while your model will tend to predict approved more often, you can attribute any uncertainty to the fact that the not approved status is more likely.

But I would try to consider all possible options in order to increase the number of not approved

Hi Moriturus,

Thanks very much for your prompt response.
I will try all options you mentioned to see the best results.

Thanks once again

I’ll share my practical experience in this regard.

You can try all of the below strategies to see which one is working best.

  1. Oversampling -> minority class will be equated to majority class
  2. Undersampling -> Majority class will be equated to minority class.
  3. Oversampling + Undersampling -> Increase Minority class and decrease majority class to certain ratio that they both become equal.

If you want a simple solution try Oversampling using SMOTE Techniques , It is one of most popular method for imbalanced dataset.

Below are few links for your reference.