Feature scaling - Predicting House Sale Prices

Screen Link: https://app.dataquest.io/m/240/guided-project%3A-predicting-house-sale-prices/3/feature-selection

Hi guys,

I used the rescaling technique in this project (as we did in this [mission])https://app.dataquest.io/m/236/feature-selection/5/removing-low-variance-features) to normalize all the features and remove those with low variance.

But the RMSE value does not improve, in fact, it increases its value tremendously:

  • RMSE without rescaling: 26532.30 (k=0) :+1:

  • RMSE with all normalized features: 54524673275928.02 (k=0)

  • RMSE with all normalized features and remove low variance ones: 48016.50 (k=0)

  • RMSE with normalized numerical features (and remove low variance ones): 54524673275928.02 (k=0)

  • RMSE with normalized categorical features (and remove low variance ones): 31333.01(k=0)

Has anyone tried this technique and gotten a better value?

Why is the technique useless in this case? Perhaps I made a mistake?

My Code:

  # Feature scaling with categorical columns
  #cat_cols = filtered_df.select_dtypes(['uint8']).columns

  #filtered_df[cat_cols] = (filtered_df[cat_cols]-filtered_df[cat_cols].min())\
  #   / (filtered_df[cat_cols].max()-filtered_df[cat_cols].min())

  #unit_df_var = filtered_df[cat_cols].var()
  #filtered_df.drop(unit_df_var[unit_df_var < 0.1].index, axis=1, inplace=True)

  # Feature scaling with all columns
  #target_col = filtered_df['SalePrice']
  #filtered_df = (filtered_df - filtered_df.min()) / (filtered_df.max() - filtered_df.min())
  #filtered_df['SalePrice'] = target_col
  #unit_df_var = filtered_df.var()
  #filtered_df.drop(unit_df_var[unit_df_var < 0.15].index, axis=1, inplace=True)

  #Feature scaling with numerical columns
  #num_cols = filtered_df.select_dtypes(['int','float']).columns.drop('SalePrice')

  #filtered_df[num_cols] = (filtered_df[num_cols]-filtered_df[num_cols].min())\
  # / (filtered_df[num_cols].max()-filtered_df[num_cols].min())

  #unit_df_var = filtered_df[num_cols].var()
  #filtered_df.drop(unit_df_var[unit_df_var < 0.001].index, axis=1, inplace=True)