Act fast, special offers end soon! Up to $294 is savings when you get Premium today.
Get offer codes

Train/test split to predict customer churn

I’m working on a model to predict churn. I understand the concept of training and testing, but I’ve failed to adddress this in a real life situation.

Assume that I’ve a dataset regarding to a subscription based business. I have 5K churned customers and 15K active customers. In the general way, what all ML courses show is that split the data 80/20; train it and test it. We predict a target and compare with actual column, which makes sense.

But in my case, if I want to predict how many of these active 15k are likely to churn, how would I break down my data into train, test and predict? Prediction has to be on 15k active users since it’s from those people we want to know who is going to churn, so should I train only churned customers; or something else?

I’m a bit confused.

You shouldn’t train on the data that you want to test/predict, that’s a mistake right there.

Now, to address the problem you have of trying to figure out which of these 15K customers are likely to churn, you need other data on which to train.

You can take the 5K churned customers and use those. “But these all churned. How does this help?”, you ask.

Before they churned, they were active users. Notice the reference to time. You need data on when they were active, and when they churned.

Then, to simplify, you can create a new column, churned_in_6_months, whose name describes its own data, I hope.

Then you can train/test on this dataset to predict churned_in_6_months (which will have churned and non-churned customers (probably, depending on the industry, I suppose)), and predict on the 15K active users.

1 Like

Thanks for reply. I’m not sure if I understood you correct; do you suggest creating the training data only on the churned customers(5K) and test on the active(15K) customers?

Also your point about the industry is quite right. I was mentioning this issue for subscription business which has 1 year contracts.

No. You can’t test on the 15K active users because you don’t know what will happen. For you to “test” in this sense, you need to know what happened.

What you will you do is predict on these 15K users once you have your model working.

Say you have the following columns for each subscription (I’m using “subscription” instead of “user” because each user can subscribe multiple times and one way to handle this is to have a row for each subscription instead of having a row for each user):

  • start_date
  • end_date

(For the 15K active users subscriptions, the end_date will be missing).

As a starting point, you can look into what customers unsubscribed within the first year, so you create a new column, churned_1y, with a rule like:

if end_date - start_date > 1 year:
    churned_1y = 0
else end_date - start_date <= 1 year:
    churned_1y = 1

Now you can create your model to train and test predictions on this new column.

Once you’re happy with your results, you apply them to the 15K active users.

Actually, what I’m asking is how to split the data as train or test rather than which target/column variable.

What you will you do is predict on these 15K users once you have your model working.

This is exactly what I’m trying to ask. How do we decide from where to split data to train or test? What should be my train data ? Assuming contract length as 1 year; I think splitting data for train 0-(t-1) year and last 1 year as test dataset seems to make more sense but almost all tutorials split it directly 80/20, which I can not figure out.

You can’t use any of the 15K users for this. Your train and test data must all come from the 5K churned users.

You can do a regular 80/20 randomized split.