Data visualization using R | Customer Churn | Exploratory Data Analysis

thejasmine
6 min readDec 5, 2020

--

Photo by Chris Liverani on Unsplash

Data visualization is a powerful way to transform messy data into useful insights for decision making and uncover patterns. There are a lot of tools for visualization, such as Tableau for creating interactive dashboards, python’s seaborn package, etc. Recently, I am learning how to plot data using R. As a result, and I would like to share how to create charts using R ggplot2. I am going to break down sections into different types of charts.

The code is available in this Github Repository.

Check out more data visualization I created here Tableau Public.

Table of content

  1. Introduction
  2. Understand the data and data preprocessing
  3. Comparison chart — Bar chart, Line chart
  4. Distribution chart — Scatter plot, Box plot, Histogram
  5. Conclusion

1. Introduction

What is ggplot2?

ggplot2 is a data visualization package for R. It can create various charts such as line charts for comparing data, scatter plots for understanding the relationship of data, etc.

Here is the documentation for ggplot2.

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
sources: https://ggplot2.tidyverse.org/index.html

Cheat sheets provided by the website are super nice, highly recommended.:)

2. View the data& data preprocessing

The first thing we need to do is install ggplot2 package into R studio. We only have to install package one, but we need to import the package as needed.

#you only need to install the package if you don't have ggplot installedinstall.packages('ggplot2', dependencies = TRUE)#call library
library(ggplot2)

I used a telecommunication company’s customer churn dataset from Kaggle. There is customer demographic data such as age range, gender, and account information such as plans they selected and target variable whether the customer left the program last month. The dataset contains 7043 customer row data and 21 variables. The churn column is the target variable for the analysis.

View(telco.df)
some raw data in the dataset
str(telco.df)

Data Cleaning Steps

  1. Check if there is null data in our dataset.
  2. Convert and relevel senior citizen variable. I converted the column from integer to factor, then change 0 and 1 to “No” and “Yes”

Let’s start to build some charts to understand more about customers!

3. Comparison chart

Pie chart

Let’s see the distribution of churn customers by a simple pie chart!

It is clear to see that about 27% of customers churn last month. We can start to see the churn rate breakdown by different customer demographic variables.

Bar chart

#code snippet for bar chartggplot(telco.df, aes(x=variable,fill=Churn))+ geom_bar() + labs(title="your title", x="x axis", x="y axis")

For the first part, I would like to know more about demographic data. Let’s understand the age range, gender, partner, and dependent using a bar chart. The bar chart is a good way to compare counts or values.

Customer demographic

For gender, the distribution of male and female is fifty, and the churn rate is almost the same. Move on to the senior citizen chart, we can see that most of the customers in the dataset are younger people. Lastly, about 50% of customers have a partner, and the churn rate is lower than customers who don’t have a partner.

Different internet service

Online security, online backup, device protection, tech support, streaming tv, and streaming movies are services used by customers with internet service. The churn rate for customers who use the add-ons service is lower than for those who don’t use the service. For example, customers who have used tech support's churn rate is much lower.

line chart

#code snippet for line chartggplot(data = telco.df) + geom_line(aes(x=tenure, y= variable, color=Churn))ggplot(data,aes(x = tenure, color=Churn)) + geom_freqpoly(size=2)

Customer tenure

The line chart displays the trend of value over time. In addition, we can compare different items’ trend in one graph. In this analysis, tenure would be the x-axis. Let’s plot some charts to understand the relationship between different contract types and customer tenure.

Obviously, most of the monthly contract customers churn from the company within the first few months. After that, the number of churn customers decreases. For longer contracts, customers are more loyal to the company.

Next, the chart shows the average monthly charge and tenure. We can see that overall, churn customer’s average monthly charge is higher.

4. Distribution Chart

Scatter plot

A Scatter plot is used to displays the relationship between two numerical variables. In the dataset, monthly charges and total charges are numeric variables. We can use scatter to learn more about their relationship.

#code snippet for scatter plotggplot(data=telco.df) + geom_point(aes(x=MonthlyCharges, y=TotalCharges))

Monthly charge vs Total charge

We can see that as the monthly charge goes up, the total charge increase.

Box plot

Boxplot displays the distribution of the data set. The width of the box shows how data dispersed from the median point.

#code snippet for box plotggplot(data=telco.df) + geom_boxplot(aes(x=Churn,y=TotalCharges,fill=Churn))

Customer churn vs tenure

Customers' tenure ranges from 1 month to about 70 months. It is easy to see that the tenure of customers who don’t churn is longer than churn customers.

Customer churn vs Total charges

For churned customers, they have a lower median point and a smaller range of IQR. It is reasonable to assume that since most of the churned customers did not stay with the company for a long time, thus they have a lower median total charge. It’s interesting to see that there are some outliners for churned customers. Their total charges are much higher than most of the customers. The company may want to look into what caused those charges.

Histogram

The histogram is really similar to bar charts, but it groups data into bins. The height of each bar helps us know the number of observations under different bins.

Here, I would like to use histograms to see the monthly price distribution for different services.

#code snippet for histogramggplot(subset(telco.df, PhoneService %in% c("Yes") & InternetService %in% c("DSL"," Fiber optic")),aes(x = MonthlyCharges, fill = Churn)) + geom_histogram()

For customers who have both internet service and phone service, their monthly charges range from 40 to 90. For only the phone service plan, it seems that customers pay around 20 to 25 per month. The price range for internet service is wider compared to the only phone service.

5. Conclusion

I have used different types of charts to learn more about the dataset. Here are some key takeaways.

  • About 27% of customers churned last month.
  • Churned customers’ tenure is short than customers who don’t churn.
  • Most of the customers in the dataset are younger people.
  • The churn rate for customers who use the add-ons internet service is lower than for those who don’t use the service.
  • Customers with the monthly plan have the highest churn rate.
  • More…

Data visualization is a good starter to understand more about data. It is easy to see the distribution and trend by visual charts. I hope you enjoy doing data visualization as I do.

The code is available in this Github Repository.

Check out more data visualization I created here Tableau Public.

--

--