On this mission:
https://app.dataquest.io/m/186/feature-preparation%2C-selection-and-engineering/9/final-feature-selection-using-rfecv
Right before implementing RFECV, the lesson states:
“”"We can see that there is a high correlation between Sex_female
/Sex_male
and Title_Miss
/Title_Mr
/Title_Mrs
. We will remove the columns Sex_female
and Sex_male
since the title data may be more nuanced.
Apart from that, we should remove one of each of our dummy variables to reduce the collinearity in each. We’ll remove:
Pclass_2
Age_categories_Teenager
Fare_categories_12-50
Title_Master
-
Cabin_type_A
“”"
I understand why we need to remove one dummy column from each set to avoid the dummy-variable-trap / colinearity.
My questions are: How did Dataquest decide to remove these specific dummy variables from each set? Was it arbitrary? If not, can you show the steps taken to decide upon removing these dummy columns (e.g. as opposed to having removed Title_Dr, Cabin_type_C, etc.) ?