Kaggle - feature selection


I was working on the ‘Feature Preparation , selection and enginnering’ part of Kaggle in which there is an advice to use the columns that give highest coef. of the fit to improve accuracy. However when I use all the columns I get higher accuracy than using only the chosen ones. Using chosen columns based on coef accuracy is 0.8148019521053229, using all columns: 0.8148399727613211

How to reconcile that ?

The code I have is the following
columns = [‘Age_categories_Missing’, ‘Age_categories_Infant’,
‘Age_categories_Child’, ‘Age_categories_Teenager’,
‘Age_categories_Young Adult’, ‘Age_categories_Adult’,
‘Age_categories_Senior’, ‘Pclass_1’, ‘Pclass_2’, ‘Pclass_3’,
‘Sex_female’, ‘Sex_male’, ‘Embarked_C’, ‘Embarked_Q’, ‘Embarked_S’,
‘SibSp_scaled’, ‘Parch_scaled’, ‘Fare_scaled’]

all_X = train[columns]
all_y = train[‘Survived’]

lr = LogisticRegression()
scores = cross_val_score(lr, all_X, all_y, cv=10)
accuracy = scores.mean()

Those two numbers are very, very similar! I don’t know for certain, but I would guess that there’s no statistically significant difference. Which means that it doesn’t make any difference whether you use all the features (it doesn’t hurt, but it doesn’t really help, either).

It is true that appropriate feature selection can improve accuracy, but that’s not always the case. There are also other reasons you might want to keep the featureset smaller, including potentially:

  • reduced training time
  • reduced complexity of the model
  • reduced risk of overfitting