Uploaded by tavkaur118

MathsDataExplorationAssign3

advertisement
Dataset Exploration Part - III
Tavleen Kaur
Math for Data Analytics
Student ID- 200573180
Table of Content









Introduction
Dataset Description
Hypothesis Test Analysis
Univariate descriptive analysis
Frequency Distribution
Research Questions
Assumptions
Conclusion
Tracking Log
Introduction:
The Telco firm seeks to increase new customer acquisition while avoid contract termination
from existing customers (churn rate). The telco company's growth rate (number of new
customers) must be higher than its churn rate (number of existing customers) in order to
increase its clientele. Better price offers, quicker internet connections, and a safer online
experience from other businesses are some of the reasons that existing consumers have left
their telco providers. A high churn rate will adversely affect a company’s profits and impede
growth. Our churn prediction would be able to provide insights to the telco firm regarding
how well it is doing at keeping its current customers, as well as what the underlying causes
are for contract termination (high churn rate).
Dataset Description:
The provided dataset obtained from Maven Analytics website and data source are IBM
Cognos analytics. This is a fictional sample of Churn data that indicates the characteristics of
7,043 customers of a fictional telecommunications corporation from California that provides
phone and internet services, and it includes details about customer demographics, location,
services, and current status.
Hypothesis Test Analysis:

Chi Squared
Chi-Square test has been conducted on the variables 'Gender' (categorical) and 'Contract'
(categorical) within the dataset. The purpose of this analysis is to explore if there is a
significant association between the gender of customers and the type of contract they have.
The Chi-Square test is appropriate for this analysis as both variables are categorical and do
not follow a normal distribution.
Hypothesis:
Null Hypothesis (H0): There is no association between the gender of customers and the type
of contract.
Alternative Hypothesis (H1): There is a significant association between the gender of
customers and the type of contract.
Observed and Expected values for Month-to-Month, One Year, and Two-Year intervals in the
gender categories of Female and Male have been compared in Figure (). For Month-toMonth distribution for females and males significantly deviates from what would be
expected based on the model or hypothesis, a significant result here indicates a noteworthy
difference between the observed and expected One Year counts for both genders and
similarly, a significant result for the Two-Year interval suggests a substantial difference
between observed and expected counts.
Results:
The Chi-Square test on 'Gender' and 'Contract' yielded a p-value of [0.47]. since the p-value
(0.47) is greater than 0.05, you would fail to reject the null hypothesis. The result suggests
that there is not enough evidence to conclude that there is a significant association between
gender and the type of contract.

T-Test
The t-Test: One-Sample was performed on the 'Monthly Charge' variable, comparing it
against a hypothesized mean. The t-Test: Two-Sample Assuming Unequal Variances was
conducted on both 'Monthly Charge' and 'Total Charges.' The goal is to provide insights into
the statistical significance of differences in means between groups.
1). t-Test: One-Sample for Monthly Charge:
This test was chosen to assess whether the mean of the 'Monthly Charge' variable
significantly differs from a hypothesized mean of 80.
Hypothesis:
Null Hypothesis (H0): There is no significant difference between the observed mean
'Monthly Charge' and the hypothesized mean of 80.
Alternative Hypothesis (H1): There is a significant difference between the observed mean
'Monthly Charge' and the hypothesized mean of 80.
Result: The t-statistic is 0.56, and the p-value is 0.58 (two-tailed). Since the p-value is greater
than the common significance level of 0.05, we fail to reject the null hypothesis.
2). t-Test: Two-Sample for Monthly Charge and Total Charges:
This test aimed to investigate if there is a significant difference in means between 'Monthly
Charge' and 'Total Charges.'
Hypothesis:
Null Hypothesis (H0): There is no significant difference in means between 'Monthly Charge'
and 'Total Charges.'
Alternative Hypothesis (H1): There is a significant difference in means between 'Monthly
Charge' and 'Total Charges.'
Result: The t-statistic for 'Monthly Charge' vs. 'Total Charges' is -33.14, and the p-value is
close to zero. This indicates strong evidence against the null hypothesis, suggesting a
significant difference in means between the two variables. With respect to graph, the graph
visually accentuates the disparity, providing a clear representation of how 'Monthly Charge'
and 'Total Charges' vary within the dataset.
3). F-Test: Two-Sample for Variances:
This test was chosen to assess whether there is a significant difference in the variances of
'Monthly Charge' and 'Total Charges.'
Hypothesis:
Null Hypothesis (H0): There is no significant difference in variances between 'Monthly
Charge' and 'Total Charges.'
Alternative Hypothesis (H1): There is a significant difference in variances between 'Monthly
Charge' and 'Total Charges.'
The extremely low p-value suggests strong evidence against the null hypothesis, indicating a
significant difference in variances between 'Monthly Charge' and 'Total Charges.' The FStatistic being close to zero indicates that the variance of 'Monthly Charge' is significantly
smaller than the variance of 'Total Charges.'
Conclusion:
The t-Test: One-Sample did not find a significant difference between the observed mean
'Monthly Charge' and the hypothesized mean. However, the t-Test: Two-Sample indicated a
highly significant difference in means between 'Monthly Charge' and 'Total Charges.' The FTest: Two-Sample for Variances indicates a statistically significant difference in variances
between 'Monthly Charge' and 'Total Charges.' This suggests that the variability in 'Monthly
Charge' is considerably smaller compared to the variability in 'Total Charges.'

OR and RR
This section provides an analysis of the relative risk and odds ratio between the variables
'Internet Service' (categorized as 'Yes' and 'No') and 'Customer Status' (categorized as
'Churned' and 'Stayed'). The purpose of this analysis is to find the association between
having internet service and the likelihood of customer churn.
Relative Risk (RR): This measure provides insight into the risk of an event (customer churn) in
one group compared to another. It is particularly useful for understanding the proportional
impact of internet service on the outcome. Odds Ratio (OR): The odds ratio measures the
odds of an event happening in one group compared to the odds of it happening in another
group.
Interpretation:
Odds Ratio: An Odds Ratio of 0.18 suggests that the odds of customer churn for customers
with internet service are 0.18 times the odds for customers without internet service.
The confidence interval (0.15 to 0.22) indicates a statistically significant difference, as it does
not include 1. This implies a significant association between having internet service and
lower odds of customer churn.
Similarly, a Relative Risk of 0.25 indicates that the risk of customer churn for customers with
internet service is 0.25 times the risk for customers without internet service and the
confidence interval (0.21 to 0.30) is statistically significant, further supporting the conclusion
that having internet service is associated with a lower risk of customer churn.

ANOVA
Analysis of Variance (ANOVA) is used to assess the variation in average monthly GB
download and total revenue across different categories of internet types. Internet Type is a
categorical and encoded for categories.
With three categories of internet types, ANOVA allows us to compare means across multiple
groups simultaneously. Also, ANOVA is suitable for comparing means of continuous
variables, making it appropriate for the analysis of 'Avg Monthly GB Download' and 'Total
Revenue' and ANOVA assumes that the data within each group are normally distributed and
have equal variances. These assumptions should be assessed to ensure the validity of the
results.
Interpretation:
The Between Groups Variability, represented by the F-statistic (1196.26), measures the ratio
of the variance between the groups to the variance within the groups. The high F-statistic of
1196.26 suggests that there is substantial variability in 'Avg Monthly GB Download' and
'Total Revenue' across the different categories of internet types (Fiber Optic, DSL, and Cable).
P-value: the p-value is close to zero (0), indicating a very low probability of observing the
observed level of variability between groups under the assumption that there is no true
difference. Therefore, the small p-value suggest strong evidence against the null hypothesis.
Critical F-value: the critical F-value at the chosen significance level of 0.05 is 2.99762043.
Hence, the calculated F-statistic (1196.26) is far larger than the critical F-value, reinforcing
the evidence against the null hypothesis.
A bar graph in Figure ( ) visually represents the mean differences in 'Avg Monthly GB
Download' and 'Total Revenue' across the three internet types.
Fiber Optic and DSL exhibit notably lower average monthly download volumes compared to
Cable, indicating a potential difference in data consumption patterns among customers with
these internet types. The significantly higher bar for Cable suggests that customers with
Cable internet service tend to have a higher average monthly download volume. A similar
trend is observed in the 'Total Revenue' aspect. The bar corresponding to Cable is
considerably higher than those for Fiber Optic and DSL. This suggests that customers with
Cable internet not only have higher data consumption but also contribute more to the
overall revenue.

MANOVA
Multivariate Analysis of Variance (MANOVA) use to assess the impact of 'Churn Category' on
two dependent variables: 'Avg Monthly GB Download' and 'Number of Dependents.' The
'Churn Category' variable has been encoded numerically to represent different reasons for
customer churn. The objective is to investigate whether there are significant differences in
the means of the two dependent variables across the various categories of churn.
MANOVA allows for the simultaneous analysis of multiple dependent variables, making it
suitable for examining the joint effect of 'Churn Category' on both 'Avg Monthly GB
Download' and 'Number of Dependents.' Also, MANOVA is suitable for assessing whether
there are differences in means among these categories and MANOVA considers the
relationship between the entire set of dependent variables and the independent variable,
providing a comprehensive understanding of potential group differences.
The p-values associated with each trace indicate that there is no significant difference in
means among the 'Churn Category' groups for 'Avg Monthly GB Download' and 'Number of
Dependents.'
The group covariance matrices provide a detailed look at the variability within each 'Churn
Category' group by presenting the covariance among the dependent variables—'Avg
Monthly GB Download' and 'Number of Dependents.' Covariance measures how much two
variables change together, offering insights into the relationships and variability specific to
each 'Churn Category.'
The negative covariances indicate an inverse relationship between 'Avg Monthly GB
Download' and 'Number of Dependents' within each 'Churn Category' group. Higher
magnitudes of covariance signify greater variability between the two variables.
Understanding these covariances aids in assessing how 'Avg Monthly GB Download' and
'Number of Dependents' interact within each specific 'Churn Category.'
The correlation matrix reveals weak correlations between the variables, suggesting that
'Tenure in Months' has a negligible negative relationship with both 'Avg Monthly GB
Download' and 'Number of Dependents.' Additionally, there is a slight positive correlation
between 'Avg Monthly GB Download' and 'Number of Dependents.' These insights provide
an understanding of how these variables co-vary across the entire dataset.
The group means provide insights into the central tendencies of each 'Churn Category.'
Notably, the differences in means across categories suggest potential variations in customer
behavior. For instance, 'Competitor' and 'Attitude' categories exhibit higher mean values for
'Avg Monthly GB Download' and 'Number of Dependents' compared to other categories.
These matrices provide detailed information about the sum of squares and cross-products
among the variables in the MANOVA. The T matrix represents the total variation, the H
matrix represents the hypothesized variation due to the effect of 'Churn Category,' and the E
matrix represents the residual (error) variation. Understanding these matrices aids in
discerning the contribution of each component to the overall variability.
Box's Test assesses the equality of covariance matrices among groups. Box's Test evaluates
the null hypothesis that the covariance matrices across groups are equal. The small p-value
(1.97956E-08) suggests significant differences in covariance matrices among groups,
providing evidence to reject the null hypothesis. This result supports the overall findings
from the MANOVA, reinforcing the existence of variations in the means of 'Avg Monthly GB
Download' and 'Number of Dependents' across 'Churn Category' groups.
Univariate descriptive statistics:

Variable Name: Tenure
Statistical metrics
Mean
Description
The mean tenure is 18.9, which represents
the average length of time customers have
been churned.
Standard Error
The standard error of 0.5 measures the
variability in the sample mean. A smaller
standard error suggests that the sample
mean is a more reliable estimate of the
population mean.
The median tenure is 11.
The mode is 1. This means that a large
number of individuals in the dataset have a
tenure of 1 month.
The standard deviation of 19.8. A higher
standard deviation indicates that tenures
vary widely from the mean.
The sample variance of 390.5.
A kurtosis value of 0.0 suggests that the
distribution of tenure values is
approximately mesokurtic, which means it
has tails and a peak similar to a normal
distribution.
A skewness value of 1.1 indicates that the
distribution of tenure values is positively
skewed.
The tenure range is 71.0
The minimum tenure observed is 1.0, and
the maximum tenure is 72.0.
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum and Maximum
Sum and Count
Largest and Smallest Values

The sum of all tenure values is 29,979.0,
and there are 1,585 observations in the
dataset.
The largest and smallest values are 72 and
1.
Variable Name: Total Revenue
Statistical metrics
Mean
Description
The average value is 2222.6
Standard Error
The standard error is 63.9 which implies
that there is some variability in the sample.
The median value is 1143
The mode is 55.1.
The standard deviation is 2543.3 suggests
that the data points in your dataset for this
variable exhibit a significant degree of
variability from the mean value.
The sample variance is 6,468,388.9.
A kurtosis value of 1.0 suggests that the
distribution of this variable is mesokurtic,
meaning it has tails and a peak similar to a
normal distribution.
A skewness value of 1.4 indicates that the
distribution of this variable is positively
skewed.
The range is 11,148.5
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum and Maximum
Sum and Count
Largest and Smallest Values

The minimum value observed is 46.9, and
the maximum value is 11,195.4
The sum of all values is 3,522,882.9, and
there are 1,585 observations in the dataset.
The largest and smallest values are 11195.4
and 46.9.
Variable Name: Average monthly GB download
Statistical metrics
Mean
Description
The average value is 23.5
Standard Error
The standard error is 0.4 which suggests
that the sample mean which have ben
calculated is a precise estimate of the
population mean.
The median value is 20.
The mode is 27.
The standard deviation is 17.7 suggests that
the data points in the dataset for this
variable have a substantial degree of
variability from the mean value of 23.5
The sample variance is 314.8
A kurtosis value of 1.7 suggests that the
distribution of this variable is leptokurtic,
meaning it has fatter tails and is more
peaked than a normal distribution.
A skewness value of 1.4 indicates that the
distribution of this variable is positively
skewed.
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum and Maximum
Sum and Count
Largest and Smallest Values
The range is 83
The minimum value observed is 2 and the
maximum value is 85.
The sum of all values is 37,182.0, and there
are 1,585 observations in the dataset.
The largest and smallest values are 85 and
2.
Frequency Distribution:
Variables
Age
Tenure
Total Revenue
Avg Monthly GB Download

Description
The data points for age are all the same or
do not vary.
The values in this column range from 1 to
70.
The values in this column range from
101.34 to 9664.155.
The values in this column range from 5 to
82.
Tenure in months
Figure 4
Analysis: Customers are categorized into different tenure groups based on their "Tenure in
Months" which includes short-term customers (e.g., less than 6 months), medium-term
customers (e.g., 6 to 12 months), and long-term customers (e.g., more than 12 months). The
tenure groups are defined as follows: "0 to 20 months," "20 to 50 months," and "50 months
and above."
In the "0 to 20 months" group, 64.86% of the customers have churned, while 35.14% have
not churned.
In the "20 to 50 months" group, 89.40% of the customers have churned, while 10.60% have
not churned.
Hence, the analysis in Figure 4 suggests that in the "0 to 20 months" group, a significant
proportion of customers (64.86%) have churned, which means they stopped using the
service within the first 20 months of their tenure. However, as tenure increases to the "20 to
50 months" group, the churn rate increases further, with 89.40% of customers churning in
this group. This implies that customers who stay longer (more than 50 months) are less likely
to churn.

Age
Figure 5
Customers are categorized into different age groups which includes "0 to 20": customers
with ages between 0 and 20 years, "20 to 40": customers with ages between 20 and 40
years, "40 to 60": customers with ages between 40 and 60 years, and "60 and more":
Individuals with ages 60 years and older.
The majority of customers (66.31%) fall with age group of "40 to 60" have churned. Similarly,
a significant number of customers (33.94%) belong to the "20 to 40" age group and a smaller
proportion of individuals (2.52%) are in the "0 to 20" age group. The "60 and more" age
group includes 100.00% of the individuals, indicating that these individuals are 60 years old
or older. Hence, the analysis in Figure 5 suggests that customers of age 60 and more have
churned, which means they stopped using the service.
Research questions:
1. Question: How does the presence or absence of online security features impact the
financial aspects of customer relationships, specifically in terms of monthly charges
and total charges?
Variables: Online Security, Monthly Charges, Total Charges
Rationale: This question helps to assess the impact of financial effects of online security on
customer charges. These insights guide strategic decisions for enhancing customer
satisfaction and retention through effective promotion and integration of online security
features.
2. Question: How does the gender and contract types influence customer preferences
and also, assess who are more like to continue using the services?
Variables: Gender, Contract
Rationale: The question will focus to understand how gender intersects with contract
choices provides insights into customer preferences, guiding targeted service strategies for
improved satisfaction and retention.
3. Question: How does the type of internet service influence whether customers
continue to use the services or leave?
Variables: Internet Service and Customer Status
Rationale: Investigating the correlation between internet service types and customer status
(churned or stayed) is essential for finding patterns that can inform strategic decisions. The
findings aim to enhance service offerings, ensuring customer satisfaction and reducing churn
rates.
Conclusion:
These are the conclusions:




The Chi-squared analysis indicates that the gender does not appear to be a
significant factor in determining the type of contract customers choose.
The t-Test: One-Sample did not find a significant difference between the observed
mean 'Monthly Charge' and the hypothesized mean of $80. However, the t-Test: TwoSample indicated a highly significant difference in means between 'Monthly Charge'
and 'Total Charges.' The F-Test: Two-Sample for Variances indicates a statistically
significant difference in variances between 'Monthly Charge' and 'Total Charges.' This
suggests that the variability in 'Monthly Charge' is considerably smaller compared to
the variability in 'Total Charges.'
The analysis of relative risk and odds ratio provides evidence of a significant
association between having internet service and a reduced likelihood of customer
churn. Customers with internet service exhibit lower odds and risks of churning
compared to those without internet service.
The ANOVA results indicate a significant difference in means among the different
categories of internet types for 'Avg Monthly GB Download' and 'Total Revenue.'
Recommendation:




Based on Chi-squared analysis, it would be prudent to explore other factors that
could influence the choice of contract type among customers.
Further exploration into the factors contributing to the observed differences in
'Monthly Charge' and 'Total Charges' is recommended. This could involve
investigating customer profiles or service usage patterns to better understand the
sources of variation in charges. Understanding the reasons for variations in charges
could provide valuable insights for service optimization or customer segmentation
strategies.
These findings suggest that internet service may play a protective role in customer
retention. Marketing efforts or service improvements related to internet services
may be explored to enhance customer satisfaction and reduce churn.
Investigating the factors contributing to the observed differences in 'Avg Monthly GB
Download' and 'Total Revenue' for different internet types could provide valuable
insights for strategic decision-making.
Tracking Log
Project Name: Telco customer churn
Date: 1st November 2023 – 8th November 2023
Task 1: Hypothesis Test Analysis
Description: In this task, hypothesis tests have been conducted to examine the statistical
significance of relationships within our dataset. Specifically, we focused on investigating the
association between variables, such as gender and contract types, using appropriate
statistical tests.
Data Source: I obtained the dataset from Maven Analytics website and data source are IBM
Cognos analytics.
Tools/Software Used: I used Excel to perform hypothesis test
Steps Taken:




Identified and selected the variables for hypothesis testing, considering their
relevance to our research objectives.
Formulated null and alternative hypotheses to test the significance of the identified
relationships.
Chose the appropriate statistical test based on the nature of the variables and the
research question.
Executed the chosen hypothesis test on the selected variables. This involved
calculating relevant test statistics, such as chi-square values, p-values, and degrees of
freedom.
Challenges/Issues Encountered: There were no significant challenges or issues encountered
during this phase
Date: 10th November 2023
Task 2: Update Research Questions
Description: In this, I defined research questions to guide the analysis and exploration of the
telco company's customer dataset.
Steps Taken:




Reviewed the dataset's content, variables, and findings from the hypothesis test
analysis
Brainstormed potential research questions that align with the project's objectives
and the dataset's variables.
Evaluated the relevance and feasibility of each research question.
Refined the research questions to ensure they are clear, specific, and actionable.
Download