Please I need some help on a personal project. I am working on a logistic regression problem and the target on the training data set is approval_status. The data in the target column could either be ‘approved’ or ‘not approved’.
The total number of rows is 16500 of which approved are 16320 while not approved is 180. Is this an imbalanced data set? If yes how can it be balanced for use in machine learning prediction.
Also does using pd.get_dummies(df) avoid the dummy variable trap?
Yes, your dataset is obviously unbalanced. Since 98% of your data belongs to the same status. A balanced set is if the distribution is about 50%.
There are several ways to make the set balanced.
The most obvious to extend it with data relevant to the not approved.
You can try to take a sample of about 200 results from approved. And train the model using a dataset of 380 values, if the patterns are simple enough this may be enough for fairly good results.
you can create dummy not approved values if you know the most obvious values that affect it.
you can train the model on an unbalanced set, and at the end artificially change the probability at which the value will be classified to any status. For example, prediction >= 0.95 status approved, in all other situations not approved. So while your model will tend to predict approved more often, you can attribute any uncertainty to the fact that the not approved status is more likely.
But I would try to consider all possible options in order to increase the number of not approved