In the Machine Learning courses the train and test data come from the same data set, but I am curious if you can use “new data” for the test data. Specifically, I am looking to estimate vacation usage for employees in 2021. My test data would be 2020 leave usage rates for employees, and have the 2021 spreadsheet with current employees and updated features. Is it appropriate/valid to use the 2021 spreadsheet as the test data?
You can. That’s what a lot of production-level/real-world Machine Learning systems have to do - infer on unseen data.
It is appropriate in the sense that you still have to look into what kind of issues might come up with your model not really fitting well enough (over/under) to that new data. This is especially important when working with time-series data, as you might be. Because there could be patterns you might be training on from the 2020 data that might not exist on your 2021 data at all, leaving a very poorly performing model.