Collinearity or Dummy Variable Trap and Overfitting

In the mission Feature Preparation, Selection and Engineering we learn about Dummy Variable Trap (due to high collinearity between features) that can cause an overfitting of the model.

The effect of collinearity is that your model will overfit - you may get great results on your test data set, but then the model performs worse on unseen data (like the holdout set).

Could you please explain why the collinearity between features leads to overfitting. Thank you!

One of the main purposes of feature selection technique is to pick the features that are statistically significant and have more predictive power, while avoiding multi-collinearity at the same time. If two or more features have high correlation, then your model will capture noise from all those features when included. In other words, your model will memorize all the randomness (noise) from collinear features. For instance, if your data includes both “height” and “weight” as features. Since these features are highly correlated, your model will capture more noise leading it to overfit. As a result, your model will be too optimistic in training set, and perform poorly in unseen test set.

Ok. Let’s say that I use some kind of regularisation to avoid overfitting. In this case it shouldn’t be a problem. Why should my model capture more noise from collinear variables?

Regularization is a way to prevent overfitting when your model is complex. Using regularization in a simple model leads to underfitting, where model ignores underlying real patterns in your data. One way to think is: using all collinear features usually makes our model complex. Complex models are prone to overfitting. You’re right regularization is a way to avoid overfitting.

However, say you have overwhelming number features in your dataset. Using all of them can be too much, especially when some features are correlated and some features are statistically less significant. Feature selection and dimension reduction come very handy in this situation and you can pick only contributive factors without compromising the quality of your model in test sets.

I have found the answer to my question on Wikipedia:

So long as the underlying specification is correct, multicollinearity does not actually bias results; it just produces large standard errors in the related independent variables. More importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. Since multicollinearity causes imprecise estimates of coefficient values, the resulting out-of-sample predictions will also be imprecise. And if the pattern of multicollinearity in the new data differs from that in the data that was fitted, such extrapolation may introduce large errors in the predictions.

1 Like