Categorical independent variables in Logistic

Hello, community! I am trying to work with logistic regression. I have a few conceptual doubts. On the internet, answers are so scattered, that at this point, it has become even more confusing.

  1. When independent variables are categorical( with more than 2 categories), we do the one-hot encoding. While doing so, we need to generate only k-1 variables for k categories to avoid the dummy variable trap. Do I have to do this every time? If yes, why am I coming across logistic models, that use k variables for k categories?

  2. Regarding handling outliers and missing data. Should be done before train test split or after or it depends?

I would really appreciate your answers.

1 Like

Hi @sahiba.kaur.stats

Important, Perplexing & Interesting questions!

  1. As far as I understood, for the k-1 and k categories, two things stood out - multicollinearity and the classification model.
    Keeping k- categories have the potential to induce multicollinearity between at least two variables. The marital status “Single” or “Married” are the typical examples. The same won’t happen for "Widow "or “Divorced” (along with the former two). Here, as you said multicollinearity will have to be dealt with for a regression model.

I came across a classification model that employed all the categories and showed that it did not impact the model significantly. This might be a silly question (as you may have already tried to do so) but is there a possibility to raise this question directly to the authors of the model/ project?

  1. Although not a helpful answer (as if the first one was! :stuck_out_tongue:) yeah “it depends”. Perhaps definition and understanding of what an outlier is might help.

The basic idea for a test set is that model should be subjected to unseen data. If we have first filtered or transformed data to obtain say a normal distribution, we already have introduced some form of bias. Hence even if the model is a good fit on train data and shows good accuracy on test data, it will still not help if the outlier removed was in fact valid data points just didn’t fit well with the distribution.
So you may find a lot of people suggesting that first split and then transform, some may also suggest not to drop these outliers altogether as they may have other reasons, not included in the dataset or not yet identified.
By this method, all the transformations (train data has been subjected to) will be applied to test data as well when the model is applied to it.