Dataset Exploration Part - III Tavleen Kaur Math for Data Analytics Student ID- 200573180 Table of Content Introduction Dataset Description Hypothesis Test Analysis Univariate descriptive analysis Frequency Distribution Research Questions Assumptions Conclusion Tracking Log Introduction: The Telco firm seeks to increase new customer acquisition while avoid contract termination from existing customers (churn rate). The telco company's growth rate (number of new customers) must be higher than its churn rate (number of existing customers) in order to increase its clientele. Better price offers, quicker internet connections, and a safer online experience from other businesses are some of the reasons that existing consumers have left their telco providers. A high churn rate will adversely affect a company’s profits and impede growth. Our churn prediction would be able to provide insights to the telco firm regarding how well it is doing at keeping its current customers, as well as what the underlying causes are for contract termination (high churn rate). Dataset Description: The provided dataset obtained from Maven Analytics website and data source are IBM Cognos analytics. This is a fictional sample of Churn data that indicates the characteristics of 7,043 customers of a fictional telecommunications corporation from California that provides phone and internet services, and it includes details about customer demographics, location, services, and current status. Hypothesis Test Analysis: Chi Squared Chi-Square test has been conducted on the variables 'Gender' (categorical) and 'Contract' (categorical) within the dataset. The purpose of this analysis is to explore if there is a significant association between the gender of customers and the type of contract they have. The Chi-Square test is appropriate for this analysis as both variables are categorical and do not follow a normal distribution. Hypothesis: Null Hypothesis (H0): There is no association between the gender of customers and the type of contract. Alternative Hypothesis (H1): There is a significant association between the gender of customers and the type of contract. Observed and Expected values for Month-to-Month, One Year, and Two-Year intervals in the gender categories of Female and Male have been compared in Figure (). For Month-toMonth distribution for females and males significantly deviates from what would be expected based on the model or hypothesis, a significant result here indicates a noteworthy difference between the observed and expected One Year counts for both genders and similarly, a significant result for the Two-Year interval suggests a substantial difference between observed and expected counts. Results: The Chi-Square test on 'Gender' and 'Contract' yielded a p-value of [0.47]. since the p-value (0.47) is greater than 0.05, you would fail to reject the null hypothesis. The result suggests that there is not enough evidence to conclude that there is a significant association between gender and the type of contract. T-Test The t-Test: One-Sample was performed on the 'Monthly Charge' variable, comparing it against a hypothesized mean. The t-Test: Two-Sample Assuming Unequal Variances was conducted on both 'Monthly Charge' and 'Total Charges.' The goal is to provide insights into the statistical significance of differences in means between groups. 1). t-Test: One-Sample for Monthly Charge: This test was chosen to assess whether the mean of the 'Monthly Charge' variable significantly differs from a hypothesized mean of 80. Hypothesis: Null Hypothesis (H0): There is no significant difference between the observed mean 'Monthly Charge' and the hypothesized mean of 80. Alternative Hypothesis (H1): There is a significant difference between the observed mean 'Monthly Charge' and the hypothesized mean of 80. Result: The t-statistic is 0.56, and the p-value is 0.58 (two-tailed). Since the p-value is greater than the common significance level of 0.05, we fail to reject the null hypothesis. 2). t-Test: Two-Sample for Monthly Charge and Total Charges: This test aimed to investigate if there is a significant difference in means between 'Monthly Charge' and 'Total Charges.' Hypothesis: Null Hypothesis (H0): There is no significant difference in means between 'Monthly Charge' and 'Total Charges.' Alternative Hypothesis (H1): There is a significant difference in means between 'Monthly Charge' and 'Total Charges.' Result: The t-statistic for 'Monthly Charge' vs. 'Total Charges' is -33.14, and the p-value is close to zero. This indicates strong evidence against the null hypothesis, suggesting a significant difference in means between the two variables. With respect to graph, the graph visually accentuates the disparity, providing a clear representation of how 'Monthly Charge' and 'Total Charges' vary within the dataset. 3). F-Test: Two-Sample for Variances: This test was chosen to assess whether there is a significant difference in the variances of 'Monthly Charge' and 'Total Charges.' Hypothesis: Null Hypothesis (H0): There is no significant difference in variances between 'Monthly Charge' and 'Total Charges.' Alternative Hypothesis (H1): There is a significant difference in variances between 'Monthly Charge' and 'Total Charges.' The extremely low p-value suggests strong evidence against the null hypothesis, indicating a significant difference in variances between 'Monthly Charge' and 'Total Charges.' The FStatistic being close to zero indicates that the variance of 'Monthly Charge' is significantly smaller than the variance of 'Total Charges.' Conclusion: The t-Test: One-Sample did not find a significant difference between the observed mean 'Monthly Charge' and the hypothesized mean. However, the t-Test: Two-Sample indicated a highly significant difference in means between 'Monthly Charge' and 'Total Charges.' The FTest: Two-Sample for Variances indicates a statistically significant difference in variances between 'Monthly Charge' and 'Total Charges.' This suggests that the variability in 'Monthly Charge' is considerably smaller compared to the variability in 'Total Charges.' OR and RR This section provides an analysis of the relative risk and odds ratio between the variables 'Internet Service' (categorized as 'Yes' and 'No') and 'Customer Status' (categorized as 'Churned' and 'Stayed'). The purpose of this analysis is to find the association between having internet service and the likelihood of customer churn. Relative Risk (RR): This measure provides insight into the risk of an event (customer churn) in one group compared to another. It is particularly useful for understanding the proportional impact of internet service on the outcome. Odds Ratio (OR): The odds ratio measures the odds of an event happening in one group compared to the odds of it happening in another group. Interpretation: Odds Ratio: An Odds Ratio of 0.18 suggests that the odds of customer churn for customers with internet service are 0.18 times the odds for customers without internet service. The confidence interval (0.15 to 0.22) indicates a statistically significant difference, as it does not include 1. This implies a significant association between having internet service and lower odds of customer churn. Similarly, a Relative Risk of 0.25 indicates that the risk of customer churn for customers with internet service is 0.25 times the risk for customers without internet service and the confidence interval (0.21 to 0.30) is statistically significant, further supporting the conclusion that having internet service is associated with a lower risk of customer churn. ANOVA Analysis of Variance (ANOVA) is used to assess the variation in average monthly GB download and total revenue across different categories of internet types. Internet Type is a categorical and encoded for categories. With three categories of internet types, ANOVA allows us to compare means across multiple groups simultaneously. Also, ANOVA is suitable for comparing means of continuous variables, making it appropriate for the analysis of 'Avg Monthly GB Download' and 'Total Revenue' and ANOVA assumes that the data within each group are normally distributed and have equal variances. These assumptions should be assessed to ensure the validity of the results. Interpretation: The Between Groups Variability, represented by the F-statistic (1196.26), measures the ratio of the variance between the groups to the variance within the groups. The high F-statistic of 1196.26 suggests that there is substantial variability in 'Avg Monthly GB Download' and 'Total Revenue' across the different categories of internet types (Fiber Optic, DSL, and Cable). P-value: the p-value is close to zero (0), indicating a very low probability of observing the observed level of variability between groups under the assumption that there is no true difference. Therefore, the small p-value suggest strong evidence against the null hypothesis. Critical F-value: the critical F-value at the chosen significance level of 0.05 is 2.99762043. Hence, the calculated F-statistic (1196.26) is far larger than the critical F-value, reinforcing the evidence against the null hypothesis. A bar graph in Figure ( ) visually represents the mean differences in 'Avg Monthly GB Download' and 'Total Revenue' across the three internet types. Fiber Optic and DSL exhibit notably lower average monthly download volumes compared to Cable, indicating a potential difference in data consumption patterns among customers with these internet types. The significantly higher bar for Cable suggests that customers with Cable internet service tend to have a higher average monthly download volume. A similar trend is observed in the 'Total Revenue' aspect. The bar corresponding to Cable is considerably higher than those for Fiber Optic and DSL. This suggests that customers with Cable internet not only have higher data consumption but also contribute more to the overall revenue. MANOVA Multivariate Analysis of Variance (MANOVA) use to assess the impact of 'Churn Category' on two dependent variables: 'Avg Monthly GB Download' and 'Number of Dependents.' The 'Churn Category' variable has been encoded numerically to represent different reasons for customer churn. The objective is to investigate whether there are significant differences in the means of the two dependent variables across the various categories of churn. MANOVA allows for the simultaneous analysis of multiple dependent variables, making it suitable for examining the joint effect of 'Churn Category' on both 'Avg Monthly GB Download' and 'Number of Dependents.' Also, MANOVA is suitable for assessing whether there are differences in means among these categories and MANOVA considers the relationship between the entire set of dependent variables and the independent variable, providing a comprehensive understanding of potential group differences. The p-values associated with each trace indicate that there is no significant difference in means among the 'Churn Category' groups for 'Avg Monthly GB Download' and 'Number of Dependents.' The group covariance matrices provide a detailed look at the variability within each 'Churn Category' group by presenting the covariance among the dependent variables—'Avg Monthly GB Download' and 'Number of Dependents.' Covariance measures how much two variables change together, offering insights into the relationships and variability specific to each 'Churn Category.' The negative covariances indicate an inverse relationship between 'Avg Monthly GB Download' and 'Number of Dependents' within each 'Churn Category' group. Higher magnitudes of covariance signify greater variability between the two variables. Understanding these covariances aids in assessing how 'Avg Monthly GB Download' and 'Number of Dependents' interact within each specific 'Churn Category.' The correlation matrix reveals weak correlations between the variables, suggesting that 'Tenure in Months' has a negligible negative relationship with both 'Avg Monthly GB Download' and 'Number of Dependents.' Additionally, there is a slight positive correlation between 'Avg Monthly GB Download' and 'Number of Dependents.' These insights provide an understanding of how these variables co-vary across the entire dataset. The group means provide insights into the central tendencies of each 'Churn Category.' Notably, the differences in means across categories suggest potential variations in customer behavior. For instance, 'Competitor' and 'Attitude' categories exhibit higher mean values for 'Avg Monthly GB Download' and 'Number of Dependents' compared to other categories. These matrices provide detailed information about the sum of squares and cross-products among the variables in the MANOVA. The T matrix represents the total variation, the H matrix represents the hypothesized variation due to the effect of 'Churn Category,' and the E matrix represents the residual (error) variation. Understanding these matrices aids in discerning the contribution of each component to the overall variability. Box's Test assesses the equality of covariance matrices among groups. Box's Test evaluates the null hypothesis that the covariance matrices across groups are equal. The small p-value (1.97956E-08) suggests significant differences in covariance matrices among groups, providing evidence to reject the null hypothesis. This result supports the overall findings from the MANOVA, reinforcing the existence of variations in the means of 'Avg Monthly GB Download' and 'Number of Dependents' across 'Churn Category' groups. Univariate descriptive statistics: Variable Name: Tenure Statistical metrics Mean Description The mean tenure is 18.9, which represents the average length of time customers have been churned. Standard Error The standard error of 0.5 measures the variability in the sample mean. A smaller standard error suggests that the sample mean is a more reliable estimate of the population mean. The median tenure is 11. The mode is 1. This means that a large number of individuals in the dataset have a tenure of 1 month. The standard deviation of 19.8. A higher standard deviation indicates that tenures vary widely from the mean. The sample variance of 390.5. A kurtosis value of 0.0 suggests that the distribution of tenure values is approximately mesokurtic, which means it has tails and a peak similar to a normal distribution. A skewness value of 1.1 indicates that the distribution of tenure values is positively skewed. The tenure range is 71.0 The minimum tenure observed is 1.0, and the maximum tenure is 72.0. Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum and Maximum Sum and Count Largest and Smallest Values The sum of all tenure values is 29,979.0, and there are 1,585 observations in the dataset. The largest and smallest values are 72 and 1. Variable Name: Total Revenue Statistical metrics Mean Description The average value is 2222.6 Standard Error The standard error is 63.9 which implies that there is some variability in the sample. The median value is 1143 The mode is 55.1. The standard deviation is 2543.3 suggests that the data points in your dataset for this variable exhibit a significant degree of variability from the mean value. The sample variance is 6,468,388.9. A kurtosis value of 1.0 suggests that the distribution of this variable is mesokurtic, meaning it has tails and a peak similar to a normal distribution. A skewness value of 1.4 indicates that the distribution of this variable is positively skewed. The range is 11,148.5 Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum and Maximum Sum and Count Largest and Smallest Values The minimum value observed is 46.9, and the maximum value is 11,195.4 The sum of all values is 3,522,882.9, and there are 1,585 observations in the dataset. The largest and smallest values are 11195.4 and 46.9. Variable Name: Average monthly GB download Statistical metrics Mean Description The average value is 23.5 Standard Error The standard error is 0.4 which suggests that the sample mean which have ben calculated is a precise estimate of the population mean. The median value is 20. The mode is 27. The standard deviation is 17.7 suggests that the data points in the dataset for this variable have a substantial degree of variability from the mean value of 23.5 The sample variance is 314.8 A kurtosis value of 1.7 suggests that the distribution of this variable is leptokurtic, meaning it has fatter tails and is more peaked than a normal distribution. A skewness value of 1.4 indicates that the distribution of this variable is positively skewed. Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum and Maximum Sum and Count Largest and Smallest Values The range is 83 The minimum value observed is 2 and the maximum value is 85. The sum of all values is 37,182.0, and there are 1,585 observations in the dataset. The largest and smallest values are 85 and 2. Frequency Distribution: Variables Age Tenure Total Revenue Avg Monthly GB Download Description The data points for age are all the same or do not vary. The values in this column range from 1 to 70. The values in this column range from 101.34 to 9664.155. The values in this column range from 5 to 82. Tenure in months Figure 4 Analysis: Customers are categorized into different tenure groups based on their "Tenure in Months" which includes short-term customers (e.g., less than 6 months), medium-term customers (e.g., 6 to 12 months), and long-term customers (e.g., more than 12 months). The tenure groups are defined as follows: "0 to 20 months," "20 to 50 months," and "50 months and above." In the "0 to 20 months" group, 64.86% of the customers have churned, while 35.14% have not churned. In the "20 to 50 months" group, 89.40% of the customers have churned, while 10.60% have not churned. Hence, the analysis in Figure 4 suggests that in the "0 to 20 months" group, a significant proportion of customers (64.86%) have churned, which means they stopped using the service within the first 20 months of their tenure. However, as tenure increases to the "20 to 50 months" group, the churn rate increases further, with 89.40% of customers churning in this group. This implies that customers who stay longer (more than 50 months) are less likely to churn. Age Figure 5 Customers are categorized into different age groups which includes "0 to 20": customers with ages between 0 and 20 years, "20 to 40": customers with ages between 20 and 40 years, "40 to 60": customers with ages between 40 and 60 years, and "60 and more": Individuals with ages 60 years and older. The majority of customers (66.31%) fall with age group of "40 to 60" have churned. Similarly, a significant number of customers (33.94%) belong to the "20 to 40" age group and a smaller proportion of individuals (2.52%) are in the "0 to 20" age group. The "60 and more" age group includes 100.00% of the individuals, indicating that these individuals are 60 years old or older. Hence, the analysis in Figure 5 suggests that customers of age 60 and more have churned, which means they stopped using the service. Research questions: 1. Question: How does the presence or absence of online security features impact the financial aspects of customer relationships, specifically in terms of monthly charges and total charges? Variables: Online Security, Monthly Charges, Total Charges Rationale: This question helps to assess the impact of financial effects of online security on customer charges. These insights guide strategic decisions for enhancing customer satisfaction and retention through effective promotion and integration of online security features. 2. Question: How does the gender and contract types influence customer preferences and also, assess who are more like to continue using the services? Variables: Gender, Contract Rationale: The question will focus to understand how gender intersects with contract choices provides insights into customer preferences, guiding targeted service strategies for improved satisfaction and retention. 3. Question: How does the type of internet service influence whether customers continue to use the services or leave? Variables: Internet Service and Customer Status Rationale: Investigating the correlation between internet service types and customer status (churned or stayed) is essential for finding patterns that can inform strategic decisions. The findings aim to enhance service offerings, ensuring customer satisfaction and reducing churn rates. Conclusion: These are the conclusions: The Chi-squared analysis indicates that the gender does not appear to be a significant factor in determining the type of contract customers choose. The t-Test: One-Sample did not find a significant difference between the observed mean 'Monthly Charge' and the hypothesized mean of $80. However, the t-Test: TwoSample indicated a highly significant difference in means between 'Monthly Charge' and 'Total Charges.' The F-Test: Two-Sample for Variances indicates a statistically significant difference in variances between 'Monthly Charge' and 'Total Charges.' This suggests that the variability in 'Monthly Charge' is considerably smaller compared to the variability in 'Total Charges.' The analysis of relative risk and odds ratio provides evidence of a significant association between having internet service and a reduced likelihood of customer churn. Customers with internet service exhibit lower odds and risks of churning compared to those without internet service. The ANOVA results indicate a significant difference in means among the different categories of internet types for 'Avg Monthly GB Download' and 'Total Revenue.' Recommendation: Based on Chi-squared analysis, it would be prudent to explore other factors that could influence the choice of contract type among customers. Further exploration into the factors contributing to the observed differences in 'Monthly Charge' and 'Total Charges' is recommended. This could involve investigating customer profiles or service usage patterns to better understand the sources of variation in charges. Understanding the reasons for variations in charges could provide valuable insights for service optimization or customer segmentation strategies. These findings suggest that internet service may play a protective role in customer retention. Marketing efforts or service improvements related to internet services may be explored to enhance customer satisfaction and reduce churn. Investigating the factors contributing to the observed differences in 'Avg Monthly GB Download' and 'Total Revenue' for different internet types could provide valuable insights for strategic decision-making. Tracking Log Project Name: Telco customer churn Date: 1st November 2023 – 8th November 2023 Task 1: Hypothesis Test Analysis Description: In this task, hypothesis tests have been conducted to examine the statistical significance of relationships within our dataset. Specifically, we focused on investigating the association between variables, such as gender and contract types, using appropriate statistical tests. Data Source: I obtained the dataset from Maven Analytics website and data source are IBM Cognos analytics. Tools/Software Used: I used Excel to perform hypothesis test Steps Taken: Identified and selected the variables for hypothesis testing, considering their relevance to our research objectives. Formulated null and alternative hypotheses to test the significance of the identified relationships. Chose the appropriate statistical test based on the nature of the variables and the research question. Executed the chosen hypothesis test on the selected variables. This involved calculating relevant test statistics, such as chi-square values, p-values, and degrees of freedom. Challenges/Issues Encountered: There were no significant challenges or issues encountered during this phase Date: 10th November 2023 Task 2: Update Research Questions Description: In this, I defined research questions to guide the analysis and exploration of the telco company's customer dataset. Steps Taken: Reviewed the dataset's content, variables, and findings from the hypothesis test analysis Brainstormed potential research questions that align with the project's objectives and the dataset's variables. Evaluated the relevance and feasibility of each research question. Refined the research questions to ensure they are clear, specific, and actionable.