Logistic Regression Feature Engineering

Hey guys,

I’m working in a personal project in which I must apply the Logistic Regression algorithm,
the following are some columns of the dataset:

What I’m trying is to improve the precision of the predictions, as you can see the low and high margins depend on the age of the user and the test. Is a good option to create a new column that shows if the result is between the margins or not.

Which is a better option , whith 2 values or 3:
Binary column:
0 -> Betwenn low and high margins
1 -> Under or over margins
Ternary column:
0 -> Betwenn low and high margins
1 -> Under low margin
2 -> Over high margin

1 Like


If you create new column based on values of another column, this will lead to multicollinearity problem, where one variable will be highly correlated (dependent on) with another. And generally, you should check and avoid multicollinearity and not to create it in an artificial way :slight_smile:

So, I would avoid doing this.

@lostmachine thanks for warning me about the multicollinearity problem. But what my goal creating this new column is to ignore the RESULT, LOW_MARGIN, and HIGH_MARGIN columns. In this way, the algorithm will not have to search the correlation between the 3 columns. May this somehow help it to improve the precision?

I don’t think that feature engineering one column and exclusion of 3 columns will greatly boost model accuracy, but as always in data science - you know by trial and error and your question is highly dependent on the structure of your data set.