Guided Project: Predicting House Sale Prices: Data Leakage

Hi guys,

In the Predicting House Sale Prices Guided Project, we drop columns such as “Mo Sold”, “Sale Condition”, “Sale Type”, “Yr Sold” as they consist of data regarding the actual sale and hence ‘leak data’.

However, in the suggested solution, why is it that the two created features “Years before sale”" and “Years Since Remod” are not considered leaky data, since they made use of “Yr Sold” and wouldn’t be available on new data.

years_sold = df[‘Yr Sold’] - df[‘Year Built’]
years_since_remod = df[‘Yr Sold’] - df[‘Year Remod/Add’]

df[‘Years Before Sale’] = years_sold
df[‘Years Since Remod’] = years_since_remod

1 Like

I agree with the feature “years before sale”, but not with “Years before remodelation”. How would years since remodelation leak?

HI, I had the same concern. We are dropping ‘Yr Sold’ because it’s leaky and then use it to create new features.
Regarding, years_since_remod = df[‘Yr Sold’] - df[‘Year Remod/Add’] shouldn’t we use:
years_since_remod = df[‘Year Remod/Add’] -df[‘Yr Built’]

That way we aren’t using a leaky column, and is consistent with one of the exercices on the guided courses.

Thanks