Chi-Square Analysis of Churn Rate & Exploratory Data Analysis in R

& Exploratory Data

Chi-Square Analysis of Churn Rate

Analysis in R

Research Question

Churn rate is the percentage rate of customers that leave a company within a given time frame. Managing customer churn rate is vital to the bottom line for corporations because the cost of obtaining new customers is approximately ten times the cost of maintaining existing customers on average. In the dataset (‘d207pa.csv’) for this analysis, the anonymized customer demographics, purchasing behavior, tenure length, and churn status are included for 10,000 clients of a telecoms business.

To identify potential targets for churn reduction, I chose to ask the question “Is there a significant difference in the churn rates for clients who are provided their internet service through DSL and those who are provided their internet service through fiber optic?” This analysis provides evidence that there is a statistically significant difference between the churn rates of DSL and fiber optic customers.

Potential benefit of this analysis:

This organization’s stakeholders would benefit from this analysis because it may pinpoint whether the type of internet service clients are provided is a contributing factor to churn. If a significant difference in churn rates exists between the two types of internet service, further analysis and exploration could lead to determining other contributing factors to the discrepancy, which could lead to lowering overall churn rate within the clientele.

Relevant Variables:

The relevant data to answer my question is the type of internet service provided and the churn column.

Process for creating a contingency table in R:

My research question involves categorical data, so I wrote code to use the chi-square technique for my analysis of this data.

First, I determined that my data did not contain any missing values by using the is.na function in R. Then, I decided to remove any observations that contained the category of “None” from my data set by creating a new data frame that did not include those observations. After creating my new data frame, I made a contingency table with the internet service and churn variables using the “table” function in R.

The code I wrote to complete the above calculations is:

d207data <- read.csv("C:/Users/Ruth Wright/OneDrive/Desktop/d207pa.csv")

is.na(d207data$InternetService)

is.na(d207data$Churn)

d207datadf <- data.frame(d207data)

testd207data <- d207datadf[!(d207datadf$InternetService=="None"),]

testd207data

table(testd207data$InternetService, testd207data$Churn)

The output from that code:

Contingency Table:

Next, I performed the chi-square analysis with the contingency table by using the chisq.test function in R. Here is the code and its output:

The justification for using the chi-square analysis technique is that the variables I needed to use to answer my research question, “InternetService” and “Churn”, are both categorical. To use a ttest or ANOVA method, the data must be continuous. Therefore, the method suitable to answer my question about whether the type of internet service is indicative of a higher churn rate is the chi-square method.

Univariate Visualizations:

Categorical Variable Distributions

This bar plot is a visualization of the categorical data from the “Techie” variable. It represents the number of clients who self-identified in a survey as being technologically proficient, also known as a “techie.” The visualization clearly illustrates that more than eighty percent of the 10000 people surveyed do not consider themselves to be knowledgeable when working with the current technology.

This bar plot illustrates the categorical data from the “Marital” variable. The five marital categories are: Divorced, Married, Never Married, Separated, and Widowed. It is interesting to note that there is a fairly uniform distribution of clients within each category, although the amount of people in the “divorced” category is slightly higher than the other four categories.

Continuous Variable Distributions

This is a histogram of the continuous data from the “Age” variable. The chart above shows the distribution of the clients in each age group, grouped in five-year increments. The lowest group contains people under the age of 20, and the highest group contains people over the age of 85. This visualization depicts a mostly uniform distribution of people in each age group, with the exceptions of the lowest and highest categories, which would be expected within this context.

This is a boxplot of the continuous data from the “Children” variable. This chart depicts the number of children under eighteen that live within the household of each client. The boxplot depicts that the mean number of children in each household is 1, while some households have up to 10 children. The inner quartile range is from 0-3 children, and the outliers are 8, 9, and 10 children per household.

Bi-Variate Visualizations:

This stacked bar chart is a visualization of the churn rate of the customers for each type of internet service they are provided, whether that be DSL, Fiber Optic, or None. (Both variables are categorical.) The blue portion of each bar represents the customers who have left the company within the past month. This visualization is a great way to illustrate that clients who are provided DSL internet services have a higher churn rate than the other two categories, because more than one third of the DSL bar is blue. The other two bars only contain about one quarter of the blue portion.

This scatterplot is a visualization of the relationship between the length of time that the client has been buying services from the company and their annual income level. (Both of these variables contain continuous data.) The income levels range from $0 to a little over $225,000, and the tenure lengths are from 0 months to over 70 months. Clearly, the clustering of the points is not linear, so there is no indication from this scatterplot that there is a strong relationship between the client’s annual income and tenure length.

Results: There IS a statistically significant difference in the churn rates.

Upon examination of the contingency table, I determined that the churn rate within the DSL clientele is around 32 percent, which is much higher than the churn rate within the fiber optic clientele, which is around 24 percent.

The null hypothesis for this test is that there is no statistically significant difference between the churn rates for the customers who use DSL for their internet service and the customers who use Fiber Optic for their internet service. However, after running the chi-square test, I discovered that the p-value of my analysis is less than .2e-16, which is much, much smaller than the widely accepted alpha level of 0.5. This means that we must reject the null hypothesis and accept the alternative hypothesis that there is indeed a statistically significant difference in the churn rate between customers of the two different types of internet service provided.

Limitations of analysis:

Although the chi square test indicates a statistically significant difference exists between the churn rates of the two internet service types, the analysis is not able to determine the causation of the difference in churn rates. A very important idea to remember is that correlation does not necessarily indicate causation. For instance, a client’s decision to change to the other type of internet service provided does not indicate that they are more or less likely to end their contract with this company. The analysis is limited to determining that a difference exists and cannot extend to determining the causation of the difference.

Recommendations:

I would recommend that further research be conducted to determine if the customers who use the DSL internet service are as satisfied with their customer service as the Fiber Optic service clients. This research could be accomplished through surveys or customer questionnaires sent to both groups and then the results could be compared and analyzed to determine if this company needs to invest more of their resources into customer service for the DSL clientele.

Turn static files into dynamic content formats.

Create a flipbook