Customer churn is when a subscriber or a regular customer cancels his subscription or stops doing business with a company. Therefore, the churn rate is the measure of how many people stopped being a client of the company in a determined time period.
Businees photo created by yanalya - br.freepik.com
In business administration, churn is a very important metric of how well the business is doing. If the churn rate is high, then the business is losing a lot of clients and not performing well.
With the evolution of machine learning algorithms and data science, churn prediction has become a very important part of every company’s strategy. If a company can accurately predict that a customer is about to churn, it can then act to prevent the churn. Usually working to keep a client is cheaper than working to get a new client.
In this article, I present a project in which I worked with a churn prediction dataset of a phone/internet company available on Kaggle. The main goal was to build a machine learning model capable of accurately predict that a customer will churn based on the information available in the dataset. In order to accomplish that, I went through some main steps, such as:
- Exploratory data analysis;
- Data preparation;
- Train, tune, test, and evaluate machine learning models.
This article will only present the most relevant parts, insights, and results of the project. To see the code and everything else in detail, check the full project on GitHub.
The dataset contains sixteen categorical variables, three continuous variables, and the target variable, Churn. Some charts were plotted to analyze the impact of each variable in the target variable. In the image below, you can see how the Churn is affected by each category in the categorical variables.
We can learn a lot from these charts. Here are some insights:
- Customers without dependents are two times more likely to churn.
- Customers that use paperless billing and optical fiber are more likely to churn.
- Customers with no online security or backup, no device protection, and no tech support are from two to three times more likely to churn.
- Customers with no internet service are unlikely to churn.
- Customers with month-to-month contracts are almost four times more likely to churn than customers with yearly contracts. Two-year contractors are very unlikely to churn.
- Customers that use electronic checks to pay their bills are more likely to churn.
To perform a similar analysis with the continuous variables, the following scatter plots were plotted:
We can see there is a significant correlation between the tenure and Churn columns. The highest the tenure, the lowest the chances that the customer will churn. The tenure variable refers to the number of time periods each customer has paid for the company’s service. Unfortunately, there’s no much we can say about the MonthlyCharges and TotalCharges variables.
Finally, it is important to check how unbalanced the dataset is to decide whether or not to balance it. An unbalanced dataset contains a significantly higher number of samples of one of the two classes, which can lead to bad machine learning models.
As you can see in the image below, the dataset is not properly balanced, but it is no highly unbalanced either. Therefore, machine learning models were trained using unbalanced and balanced data and see which data presents the best results. The balancement was made using the Randon under-sampling technique.
Models and Metrics
Four different machines learning algorithms were used:
Each of the algorithms was trained using both balanced and unbalanced data so it was possible to see which algorithm-data combination yields the bests results.
The most important metric used is Recall. This metric indicates the proportion of positive results yielded by the model by the total number of positive labels in the dataset. In this case, the Recall reveals the proportion of churns identified correctly by the total number of churns.
Precision was also kept on sight as a secondary metric. Precision indicates the proportion of positives yielded by the models that are actually true positives.
For the problem this project was dealing with, Recall is more important because it’s preferable to have a model that does not miss any churns but sometimes classify non-churns as churns, than a model that does not classify non-churns as churns but misses a lot of churns. In other words, we prefer to be incorrect when classifying a non-churning customer than when classifying a churning customer.
In the image below you can see the results of each model for these two metrics.
Models that used balanced data yielded better results. So the balanced data was used in the rest of the project.
SVM, Logistic Regression, and XGBoost provided similar results in both metrics. Moving to the next step, these algorithms’ hyperparameters were tuned using grid search.
There was a great improvement in Recall after the SVM model was tuned. The Logistic Regression model presented a slight improvement after tuned, not as good as the SVM model, though. There was a considerable improvement in the XGBoost model as well. The final results for recall in these three models are:
- SVM — 0.94
- Logistic Regression — 0.83
- XGBoost — 0.88
The test set was then used to evaluate the two best models to see if there’s any significant difference from the results yielded by the models during training. Both models performed as well on the test set as on the training set. But SVM is still presenting a better Recall, therefore it would be the model chosen in a real-life situation.
In this article, we could see the results and insight produced by the step-by-step process of creating a machine learning for churn prediction and have some intuition on how this is done in a real-life company and how important it could be.
If you have any questions, suggestions, or feedback, feel free to get in touch!