Why we need to remove one of the dummy columns for each feature?

Screen Link:
https://app.dataquest.io/m/186/feature-preparation%2C-selection-and-engineering/9/final-feature-selection-using-rfecv

Hi, in page 9 of feature selection and engineering of the kaggle course, it is indicated the following: “#Apart from that, we should remove one of each of our dummy variables to reduce the collinearity in each. We’ll remove: Pclass_2, Age_categories_Teenager, Fare_categories_12-50, Title_Master, Cabin_type_A.”
Why?? We are just aleatory removing one of the categories of each feature??
I do not understand the point. Somebody could explain me that? :slight_smile: Thanks!

1 Like

Linear regression is done for 2 main purposes, prediction and inference.
Prediction means we care about whether the model can give accurate output y values, inference means we care about the accuracy of the coefficients of the x values, because we want to study how a unit increase in x will change y.

When doing prediction, collinearity is not a problem. When doing inference, collinearity causes coefficient estimates to be inaccurate. Dropping one column from the one-hot encoded (dummy) columns avoids creating collinear sets of x variables.

It’s hard to find material teaching when you don’t care about dropping dummies. I’m still searching.