Link to the mission:
Without the dataquest guidance of plotting the three example features (Garage Area
, Gr Liv Area
, and Overall Cond
), what would be the best workflow to choose the best single feature from the 20+ available?
What I did:
Using the data documentation, I made a list of all the numerical column names. I then made scatter plots of all of the numerical features plotted against the target (SalePrice
) and calculated all of their correlation coefficients.
Overall Qual
had the greatest correlation with 0.805, but this is a discrete variable. Of the continuous variables, Gr Liv Area
had the greatest correlation with 0.699 (somehow different from the mission’s 0.709).
It makes intuitive sense to me to use a continuous variable for a model and therefore Gr Liv Area
is the best choice based on the criteria of being continuous and having the highest correlation.
Why then did the instruction include a discrete variable in the 3 (randomly chosen?) example features? And what is the best way to approach this case of univariate feature selection outside the context of being guided?