How did Dataquest chose which dummy variables to eliminate? Kaggle Fundamentals: Feature Preparation, Selection and Engineering - 9

On this mission:
https://app.dataquest.io/m/186/feature-preparation%2C-selection-and-engineering/9/final-feature-selection-using-rfecv


Right before implementing RFECV, the lesson states:

“”"We can see that there is a high correlation between Sex_female/Sex_male and Title_Miss/Title_Mr/Title_Mrs. We will remove the columns Sex_female and Sex_male since the title data may be more nuanced.

Apart from that, we should remove one of each of our dummy variables to reduce the collinearity in each. We’ll remove:

  • Pclass_2
  • Age_categories_Teenager
  • Fare_categories_12-50
  • Title_Master
  • Cabin_type_A“”"


I understand why we need to remove one dummy column from each set to avoid the dummy-variable-trap / colinearity.

My questions are: How did Dataquest decide to remove these specific dummy variables from each set? Was it arbitrary? If not, can you show the steps taken to decide upon removing these dummy columns (e.g. as opposed to having removed Title_Dr, Cabin_type_C, etc.) ?