Hi,
I was working on the ‘Feature Preparation , selection and enginnering’ part of Kaggle in which there is an advice to use the columns that give highest coef. of the fit to improve accuracy. However when I use all the columns I get higher accuracy than using only the chosen ones. Using chosen columns based on coef accuracy is 0.8148019521053229, using all columns: 0.8148399727613211
How to reconcile that ?
The code I have is the following
columns = [‘Age_categories_Missing’, ‘Age_categories_Infant’,
‘Age_categories_Child’, ‘Age_categories_Teenager’,
‘Age_categories_Young Adult’, ‘Age_categories_Adult’,
‘Age_categories_Senior’, ‘Pclass_1’, ‘Pclass_2’, ‘Pclass_3’,
‘Sex_female’, ‘Sex_male’, ‘Embarked_C’, ‘Embarked_Q’, ‘Embarked_S’,
‘SibSp_scaled’, ‘Parch_scaled’, ‘Fare_scaled’]
all_X = train[columns]
all_y = train[‘Survived’]
lr = LogisticRegression()
scores = cross_val_score(lr, all_X, all_y, cv=10)
accuracy = scores.mean()
print(accuracy)