In the guided project “Predicting stock market” Dataquest suggests to try year, month and day of the week as features for the model. I can understand that the month or day of the week can be useful for the model. But I don’t understand how the year can improve model results. The year is always different.
When you say “the year is always different”, there seems to be an assumption that the features must stay the same. Another way to put it is you expect the future data in production to be the one of those values seen in the current data during training.
If these assumptions were enforced, there will not be much confidence by the user in the model for extrapolating beyond the domain/range of values a linear regression model (i never did the mission, but a quick scan shows this is the model used) is trained on.
However practically, it is sensible indeed to use the year to predict. Imagine a company that sets a fixed absolute year-on-year revenue growth target. This can create a perfect positive sloped trend line describing their revenue, with a single predictor (year) fully predicting the revenue in future. Eg. year
(2002, 1.2mil), it would be reasonable to predict into a never before seen year 2003.
Extending this discussion from numerical features to categorical features, the above assumptions are hard to accommodate in real life too. When business processes change, and definitions of events/metrics change, something that was previously described with 2 categories could now become 3 categories or vice versa. The point being there will be new unseen data coming, so it will be a stretch to expect future data to stay like the current.
One solution can be to retrain completely, or incrementally train with new categories…