Link to the mission:
Without the dataquest guidance of plotting the three example features (
Gr Liv Area, and
Overall Cond), what would be the best workflow to choose the best single feature from the 20+ available?
What I did:
Using the data documentation, I made a list of all the numerical column names. I then made scatter plots of all of the numerical features plotted against the target (
SalePrice) and calculated all of their correlation coefficients.
Overall Qual had the greatest correlation with 0.805, but this is a discrete variable. Of the continuous variables,
Gr Liv Area had the greatest correlation with 0.699 (somehow different from the mission’s 0.709).
It makes intuitive sense to me to use a continuous variable for a model and therefore
Gr Liv Area is the best choice based on the criteria of being continuous and having the highest correlation.
Why then did the instruction include a discrete variable in the 3 (randomly chosen?) example features? And what is the best way to approach this case of univariate feature selection outside the context of being guided?