Exploratory Analysis in Python

December 28, 2023 – Western Governors University

Describe a real-world organizational situation or issue.

Is there a statistically significant difference between the monthly charge of customers that churn versus the monthly charge of customers that do not churn? An analysis of the churn data set will not only answer this question but can also help the internet service provider explore other insights within the data set. Such analysis stands to benefit the internet service provider by understanding how to better retain their customers. This, in turn, will support the company’s financial well-being because retaining customers is much more cost-effective than acquiring new ones.

Only two variables within the data set will be necessary for answering the original question: 1) The quantitative continuous variable, MonthlyCharge, which is the amount charged to the customer for services rendered per month. 2) Churn, the categorical variable which indicates whether a customer discontinued service within the last month.

Additionally, we will be exploring the data set further by looking at tenure, contracts lengths, and internet service and how these variables interact with each other and churn.

Describe the data analysis.

Welch’s Two Sample T-Test was chosen for conducting the hypothesis test after doing the following analysis:

  1. Insured MonthlyCharge was normally distributed using histograms, statistical measures of center (mean, median, and mode), and QQ-plots (Hayden, n.d.a).
  2. Isolated and removed outliers using the IQR calculation method.
  3. Homogeneity of variance between the two samples was checked using bartlett from SciPy (SciPy, 2023a). Bartlett was chosen specifically because the data was normally distributed.
  4. Because the variance was statistically significant between the two samples, we used Welch’s T-Test which does not require the variance to be equal (SciPy, 2023b).

Please see attached code file “D207PA – Code.ipynb” for the code used in the analysis. Output of IQR calculation:

Output of Measures of Center:

Output of Bartlett Check for Homogeneity of Variance:

Output of Welch’s T-Test:

Identify the distribution of two continuous variables and two categorical variables using univariate statistics.

Univariate statistical charts were created to view the distributions of both continuous and categorical variables while exploring the data set.

MonthlyCharge was a continuous variable so a histograms was used to plot its distribution:

As we can see from the shape of the data, it does appear to be normally distributed because the data is shaped like a bell curve with the data being most frequent in the center (approximately $150 monthly charge) with slopes going down either side. This means as the monthly charge increases or decreases from $150, the frequency decreases. There is also a slight right skew which means more customers have a monthly charge over $150 than do customers that have less than $150.

Univariate Statistics were calculated to further identify the MonthlyCharge distribution:

Tenure was also a continuous variable so a histogram was used to plot its distribution as well:

This data appears to be a multimodal distribution because the data coalesced around two peaks that are separated by a gap in the data. The first peak is at the beginning of the Tenure range (about 5 months), the frequency decreases as it approaches 30 months of Tenure, then increases until it reaches the second peak around 65 months of Tenure.

Univariate Statistics were calculated to further identify the Tenure distribution:

Because Contract was a categorical variable, a column chart was used to illustrate its distribution:

The data is not uniformly distributed because the different categories have different frequencies. It is skewed towards Month-to-month because that is the most frequent category. Two Year Contract and One Year Contract were similar and about half as frequent as Month-to-month with Two Year being slightly more frequent than One year. Univariate Statistics were calculated to further identify the Contract distribution:

Internet Service was also a categorical variable so a column chart was used to illustrate its distribution as well:

This distribution is also not uniform because each category has different frequencies represented in the data with, Fiber Optic being the most frequent, then DSL, then None. Univariate statistics were used to further represent the distribution of the data:

Identify the distribution of two continuous variables and two categorical variables using bivariate statistics.

Bivariate statistical charts were also created while exploring the dataset to view how both continuous and categorical variables were distributed together.

A scatterplot was used to view the distribution of Tenure and MonthlyCharge together (both continuous):

Tenure vs. Monthly Charge has a clustered distribution because the data is visually separated into two groups. The separation runs horizontally across the entire chart which means there are not many customers with a Tenure in the middle of the range.

Since it is an analysis of two continuous variables and the data is not normally distributed, Spearman’s Rho was used as a bivariate statistic to help describe the data:

The extremely low Spearman’s Rho value of -0.00469 illustrates that there does not appear to be a meaningful relationship between the two variables.

A stacked column chart was used to view the distribution of Churn and Contract together (both categorical):

In this chart, two columns separate churn customers (Yes) from non-churn customers (no). Each stacked column contains the distribution of the contracts that make up that churn/non-churn entity. Customers that did not churn had a more even distribution of contract types with slightly more month-to-month contracts. Meanwhile, significantly more churn customers had month-to-month contracts as opposed to one-year or two-year. A Cross Tabulation was created to help describe this comparison of two categorical variables:

Implications of the Analysis

The Welch’s Two-Sample T-Test produced an output p-value of 0.0 which indicated a strong statistically significant difference between the two samples of monthly charges between churn and not churn customers. This rejection of the null hypothesis indicates that the difference between the two samples was not by chance and that the data warrants further investigation.

The T-Test conducted is limited in that it only tells us that there was a significant difference between the two subsets of data. It does not tell us the nature of the relationship between MonthlyCharge and customer churn. Further analysis would be necessary to determine at which amount monthly charge would seem to indicate that a customer may churn or not. Or if MonthlyCharge is even a cause of such behavior or if there are other factors involved.

Exploring the effect size could be a good next step to view the actual practical impact of the difference between the two sets of data (Hayden, n.d.b). Another recommendation would be to conduct hypothesis testing on other variables to view what other interactions among variables could be at play.

Web Sources

Hayden, L. (n.d.). Assumptions and Normal Distributions. Data Camp. Retrieved December 20, 2023, from https://campus.datacamp.com/courses/experimental-design-in-python/testing-normality-parametric-and-non-parametric-tests?ex=1.

NumFOCUS. (2023a). Pandas.DataFrame.boxplot. Pandas 2.1.4 Documentation. Retrieved December 21, 2023, from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html.

NumFOCUS. (2023b). Pandas.DataFrame.plot.bar. Pandas 2.1.4 Documentation. Retrieved December 21, 2023, from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html.

NumFOCUS. (2023c). Pandas.DataFrame.quantile. Pandas 2.1.3 Documentation. Retrieved December 21, 2023, from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html.

SciPy Community, The. (2023a). Scipy.stats.bartlett. SciPy v1.11.4 Manual. Retrieved December 21, 2023, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html. SciPy Community, The. (2023b). Scipy.stats.ttest_ind. SciPy v1.11.4 Manual. Retrieved December 21, 2023, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *