When is it ok to include forward-looking data during training?

I’m fairly new to data science and I’ve taken it as a cardinal rule that none of my training data should be aware of anything that isn’t available at prediction time. When the training data is historical and time is a factor, I assume that including any information that came after a particular training data instance would risk polluting the model. I found a case that feels ambiguous though.

The specific situation that I’m wondering about involves predicting the availability of a location. I’m exploring the possibility of using the history of visitors to that location as a feature. My labeled data and my observations of the visit history have dates throughout 2019. I want to train on the average visitors per day. I can either restrict my calculation of the average to the visitors before each label (an observation of availability) or I can calculate the average for each location over all of 2019. Including all of 2019 means that this feature for each label will include future data. I’m wary of this approach but when I filter to only observations before the labels then I have a lot less data to calculate the averages.

What would go wrong if I include the forward-looking observations in calculating these averages? How would I know if I’ve made a mistake?

Much appreciated.

1 Like

It sounds like you already know what the right answer is here. The tiny angel on one shoulder says “only include information that you could have known when you would have wanted to make the prediction,” and the tiny devil on the other shoulder says “but what about this information from the future?!?” Presumably the purpose of your prediction is to ‘generalize’ the model and make it useful on some other dataset (like using it right now, going forward, or applying it to some other data). If you don’t have access to the future data in the situation you’d like to apply it, by using 2019 average you’re not giving your model the best chance of generalizing because you are introducing a known source of error. When you are optimizing models, you are trying to reduce the total error! Listen to the angel!

1 Like

Perhaps more directly to your point, you can see in your model to extent to which adding “future” data reduces your prediction error, but I’m not familiar with any ways of estimating the magnitude of error it adds to the model’s ability to generalize. (I think if there were, it would probably be a best practice to do a cost-benefit analysis on including forbidden data like this. Since, to my knowledge, it isn’t, I think the community leans toward avoiding this avoidable source of error).

1 Like