Looking to get a definition of 'Data Leakage'

Hi there, I have been working through the guided project for preditcting house prices and the instructions advise me to ‘remove any columns that leak information about the sale’ and provides the year of sale as an example of leaky information.

I was wondering if anyone could provide some advise as to what constitutes a variable that leaks information as I have not encountered this term in any missions leading up to this.

Mission link: https://app.dataquest.io/m/240/guided-project%3A-predicting-house-sale-prices/2/feature-engineering

Thankyou very much,
Nick.

5 Likes

hi @nick.creed98

I haven’t completed this project yet, so these links are part of my search on this topic. Perhaps they are helpful to you too:

I, too, am looking for a more in depth discussion about data leakage. From @Rucha’s links, it seems that the issue with our housing data set is that we’ll essentially be using data from the future to predicting prices on houses sold in the past. Is this correct? If so, I don’t see how eliminating the year sold resolves this issue. Wouldn’t we have to temporally align the data between train and test sets?

trying to revive the topic, this is how I understand the issue at hand:

towardsdatascience article serves a good explanation:

This is because the test set’s purpose is to simulate real-world, unseen data.

unseen data

So in our case if we wanted to predict a future sale price of a house we’ll never have the data about the date of the sale? correct? but if we were to change that column into age of the house at the moment of sale then it should be ok… correct?