240-2 Question on choosing columns to drop in the Guided Project: "Predicting House Sale Prices"

Hi Guys,

There is one step in the Guided Project: ‘Predicting House Sale Prices’, asking us to drop columns that “leak data about the final sale”. The target feature is the house price.

The answer was to drop “Mo Sold”, “Sale Condition”, “Sale Type”, “Yr Sold”, those four columns.
Below is the info for those four features.
The part that I don’t understand is how do those four features leak data about the final sale, why can’t we just change them to categorical data and dummy them? Thanks, really appreciate

Mo Sold: Month Sold (MM)

Sale Condition: Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)

Sale Type: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other

Yr Sold: Year Sold (YYYY)

Hey, Bin.

They leak data because they give away information about the sale, something that you’re trying to predict.

Say you want to predict the sale price of a property. If you know in which month it will be sold, that’s an extra piece of information to help you. But in practice, you don’t know such a thing. In practice the sale hasn’t occurred yet, you don’t know when it will happen.

The same is true for any feature concerning the actual sale.

1 Like

Thanks Bruno. That’s so much clear now. really appreciate your help!!!

1 Like