How to split time series data into training and test set?

What is the best way to split time series data for training and test set? If I have historical sales data for say 3 years or 5 years, splitting the data randomly might lead to wrong predictions, I am not sure about it. Is there any better way to do it?

Hereâ€™s a similar question asked on stack exchange

Question: I want to be sure of something, is the use of k-fold cross-validation with time series is straightforward, or does one need to pay special attention before using it?

Background: Iâ€™m modeling a time series of 6 year (with semi-markov chain), with a data sample every 5 min. To compare several models, Iâ€™m using a 6-fold cross-validation by separating the data in 6 year, so my training sets (to calculate the parameters) have a length of 5 years, and the test sets have a length of 1 year. Iâ€™m not taking into account the time order, so my different sets are :

• fold 1 : training [1 2 3 4 5], test [6]
• fold 2 : training [1 2 3 4 6], test [5]
• fold 3 : training [1 2 3 5 6], test [4]
• fold 4 : training [1 2 4 5 6], test [3]
• fold 5 : training [1 3 4 5 6], test [2]
• fold 6 : training [2 3 4 5 6], test [1].

Iâ€™m making the hypothesis that each year are independent from each other. How can I verify that? Is there any reference showing the applicability of k-fold cross-validation with time series.

Time-series (or other intrinsically ordered data) can be problematic for cross-validation. If some pattern emerges in year 3 and stays for years 4-6, then your model can pick up on it, even though it wasnâ€™t part of years 1 & 2.

An approach thatâ€™s sometimes more principled for time series is forward chaining, where your procedure would be something like this:

• fold 1 : training [1], test [2]
• fold 2 : training [1 2], test [3]
• fold 3 : training [1 2 3], test [4]
• fold 4 : training [1 2 3 4], test [5]
• fold 5 : training [1 2 3 4 5], test [6]

That more accurately models the situation youâ€™ll see at prediction time, where youâ€™ll model on past data and predict on forward-looking data. It also will give you a sense of the dependence of your modeling on data size.

You can further improve by deleting a subset of data in order to have independence between training and test data.

For cross validation to work as a model selection tool, you need approximate independence between the training and the test data. The problem with time series data is that adjacent data points are often highly dependent, so standard cross validation will fail. The remedy for this is to leave a gap between the test sample and the training samples, on both sides of the test sample . The reason why you also need to leave out a gap before the test sample is that dependence is symmetric when you move forward or backward in time (think of correlation).

This approach is called â„Žđť‘Łhv cross validation (leave đť‘Łv out, delete â„Žh observations on either side of the test sample) and is described in this paper. In your example, this would look like this:

• fold 1 : training [1 2 3 4 5h], test [6]
• fold 2 : training [1 2 3 4h h6], test [5]
• fold 3 : training [1 2 3h h5 6], test [4]
• fold 4 : training [1 2h h4 5 6], test [3]
• fold 5 : training [1h h3 4 5 6], test [2]
• fold 6 : training [h2 3 4 5 6], test [ 1]

Where the h indicates that h observations of the training sample are deleted on that side.

Since this is a time-series data set, the temporal issue is of utmost importance, meaning you really cannot use future data in order to predict past events

1 Like

Youâ€™re correct in thinking that one should be careful about how time series data is handled, because every row is an observation that is in some way or other dependent on the data preceding it.

Similar problems are tackled in the Predicting Bike Rentals Guided Project and the Stock Market Prediction Guided Project. In the latter project, feature engineering is performed to, for instance, create new columns that take into account the average and standard deviation of prices in the previous 100 days and such. This helps reflect the dependence of data on the data that preceded it.

The way the train and test sets were split in the Stock Market prediction project was like so: Data from 1950-2012 was used to train the model, and data from 2013-2015 was then used as the test set.

1 Like

We have a `solved` feature that allows you the ability to mark something as the â€ścorrectâ€ť answer, which helps future students with the same question quickly find the solution theyâ€™re looking for.
Hereâ€™s an article on how to mark posts as solved - I donâ€™t want to do this for you until I know that solution/explanation works.We have a `solved` feature that allows you the ability to mark something as the â€ścorrectâ€ť answer, which helps future students with the same question quickly find the solution theyâ€™re looking for.