Customer Churn Prediction in Telecom: A Research Study

Explanation of question: Predictive Analysis of Customer Retention in the Telecom Industry In the telecommunications industry, customer retention is a critical issue that a3ects companies’ profitability and long-term success. Customer churn, the rate at which customers discontinue their service or switch to competitors, represents a substantial cost for telecom providers. Understanding and predicting why customers leave can help telecom companies develop strategies to keep them. This study focuses on analyzing factors that contribute to customer churn and applying predictive models to determine how well churn can be predicted. Research Questions This study addresses four main research questions: 1. What are the main factors that predict customer churn? o This question aims to identify key variables that influence whether a customer will leave the service or not. Factors could include customer demographics, service usage patterns, payment history, contract types, etc. By determining which factors have the strongest impact on churn, telecom companies can take actions to address them and improve retention. 2. What is the relationship between contract type and customer retention? o The type of contract (e.g., month-to-month, one-year, two-year, etc.) that a customer has with the telecom provider may a3ect their likelihood of staying with the service. For example, customers with longer contracts might be less likely to churn than those on a month-to-month basis. This question explores how di3erent contract types influence customer behavior and retention. 3. How do service subscriptions a@ect customer churn? o Customers typically subscribe to various services, such as mobile plans, internet, or TV packages. The type and number of services a customer subscribes to could influence their loyalty. This question examines whether customers who have multiple services are less likely to churn compared to those with a single service, or whether certain services have a stronger impact on churn. 4. Can we predict customer churn with su@icient accuracy? o This is the main goal of the study. It explores whether we can build a predictive model that accurately forecasts whether a customer will churn based on various features (demographics, service usage, etc.). By using statistical and machine learning models, the study tests how well churn can be predicted using available data. Importance of Customer Retention The importance of customer retention in the telecom industry cannot be overstated. When customers leave, telecom companies face: • Revenue loss: Losing customers means losing a steady stream of revenue, and it can be more expensive to acquire new customers than to retain existing ones. • Market share erosion: High churn rates can damage a company’s reputation and reduce its market share. • Increased acquisition costs: Constantly acquiring new customers is more expensive than retaining existing ones, which includes marketing, promotions, and customer acquisition strategies. • Reduced lifetime customer value: The longer a customer stays, the more profitable they become. By improving retention, telecom providers can maximize the lifetime value (LTV) of each customer. Methods of Addressing the Issue The study uses a multi-method approach to analyze the data and answer the research questions: 1. Statistical Analysis • Chi-square tests: These tests are used to examine relationships between categorical variables, such as customer demographics (e.g., gender, region) and churn rates. For example, we can explore whether gender or geographic location significantly impacts churn. • T-tests: These tests help analyze di3erences in numerical variables. For instance, we can use a t-test to determine if there is a significant di3erence in the average monthly charges between customers who churn and those who do not. 2. Descriptive Analytics • Pattern identification: This involves identifying trends or patterns in the data that may give insights into customer behavior. For example, are customers who have a history of late payments more likely to churn? • Trend analysis: The study looks at customer churn trends across di3erent segments (e.g., based on age, contract type, or service subscription) to spot patterns that can inform retention strategies. 3. Predictive Modeling • Logistic Regression: This statistical method is used to predict a binary outcome (whether a customer will churn or not). Logistic regression can estimate the probability of a customer churning based on di3erent variables (e.g., service subscription, payment history, demographics). • Model validation and performance assessment: Once the logistic regression model is built, it will be validated and assessed to see how well it predicts churn. Techniques such as cross-validation, confusion matrices, and accuracy scoring can be used to evaluate model performance. Dataset The dataset used for this analysis is the Telco Customer Churn dataset from Kaggle, which contains comprehensive customer data, including: • Demographic details: Information about the customer, such as age, gender, and region. • Service subscription patterns: Information on the services the customer has subscribed to (e.g., internet, phone, TV services). • Contract details: The type and length of the contract (e.g., month-to-month, one-year, two-year). • Payment history: Records of how the customer has paid for the services (e.g., on-time payments, late payments, etc.). • Target variable: The target variable is Customer Churn, represented as a binary outcome (Yes/No). This indicates whether the customer has left the service (churned) or continued using it. Descriptive Analytics in the Study In the descriptive analysis, certain relationships will be visualized to better understand the data. Examples include: • Contract Type vs. Churn Percentage: Plotting the relationship between the type of contract (e.g., month-to-month vs. one- or two-year contracts) and the churn rate. This could show whether customers with shorter contracts are more likely to churn. • Churn Rate vs. Service Subscription: This plot would analyze how di3erent service subscriptions (e.g., only mobile plan vs. mobile + internet + TV) a3ect the churn rate. It might reveal that customers with multiple services are less likely to churn. Conclusion This study ultimately aims to build an e3ective predictive model to forecast customer churn and understand the underlying factors that drive churn in the telecom industry. The insights gained can help telecom providers develop targeted retention strategies, improve customer satisfaction, and reduce the cost of customer acquisition, thereby enhancing profitability and customer loyalty. Code Explanation: This code is a basic setup for analyzing and building a predictive model to analyze customer churn in the telecom industry, using the Telco Customer Churn dataset. Let me explain each part of the code step by step: 1. Importing Libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy import stats from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix • pandas (pd): A powerful library for data manipulation and analysis, particularly for structured data (like CSV files). It allows easy reading, cleaning, and transformation of data. • matplotlib.pyplot (plt): A plotting library for creating static, animated, and interactive visualizations. It's often used for creating simple plots like line plots, bar charts, etc. • seaborn (sns): A data visualization library built on top of matplotlib. It provides more aesthetically pleasing and informative statistical graphics. • scipy.stats: A sub-package of SciPy used for statistical operations. It's used for hypothesis testing, probability distributions, and other statistical analysis tools. • sklearn.model_selection.train_test_split: A function to split the dataset into training and testing sets, helping in model evaluation by using di3erent data for training and testing. • sklearn.linear_model.LogisticRegression: A model for classification tasks. Logistic Regression is used here to predict the likelihood of customer churn (binary outcome: churn or no churn). • sklearn.metrics.classification_report, confusion_matrix: These functions provide performance metrics for evaluating classification models. classification_report gives precision, recall, F1-score, etc., while confusion_matrix shows the true vs predicted classifications. 2. Reading and Cleaning the Data df = pd.read_csv('./WA_Fn-UseC_-Telco-Customer-Churn.csv') • This line reads the CSV file containing the dataset ('./WA_Fn-UseC_-TelcoCustomer-Churn.csv') into a pandas DataFrame (df). A DataFrame is a 2D table where each column represents a feature (e.g., customer age, contract type), and each row represents a customer record. df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce') • df['TotalCharges']: This is selecting the TotalCharges column from the DataFrame, which contains the total charges paid by customers. • pd.to_numeric(..., errors='coerce'): This function is used to convert the TotalCharges column to numeric values. If there are any non-numeric values (e.g., missing or erroneous data), they are converted to NaN (Not a Number). This ensures that the column has consistent numeric data for analysis. df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0}) • df['Churn']: This is selecting the Churn column, which indicates whether a customer has churned (left the service). • map({'Yes': 1, 'No': 0}): This replaces the categorical values 'Yes' and 'No' with numeric values 1 (for customers who churned) and 0 (for customers who did not churn). This is necessary because machine learning algorithms typically work with numeric data. Summary of the Code Flow: • The code begins by importing necessary libraries for data manipulation, visualization, statistical analysis, and machine learning. • The dataset is loaded into a pandas DataFrame. • The TotalCharges column is cleaned by converting it to numeric format, with any non-numeric values being coerced into NaN values. • The Churn column is transformed from categorical values ('Yes' and 'No') into numeric values (1 for churn and 0 for no churn). This is just the initial data preprocessing step. After this, you would typically go on to explore and analyze the data (e.g., through visualization, correlation analysis) and then build a machine learning model to predict customer churn. Next Steps (for full analysis): 1. Exploring the dataset: Use methods like df.head(), df.info(), df.describe(), etc., to explore the dataset and check for missing values or outliers. 2. Feature selection/engineering: You might need to create new features or drop irrelevant ones, depending on the insights you gather from data exploration. 3. Data splitting: Using train_test_split, you would split the data into a training set (for building the model) and a test set (for evaluating the model's performance). 4. Model training: A logistic regression model would be trained using the training data. 5. Model evaluation: After training the model, you would evaluate its performance using the test data and metrics like accuracy, precision, recall, and the confusion matrix. This code defines and executes a function called analyze_churn_factors() which performs an analysis of the factors contributing to customer churn in the telecom dataset. It uses statistical tests to assess the relationship between di3erent features (categorical and numerical) and churn. Let's break it down: 1. Defining the Function: def analyze_churn_factors(): • This line defines the function analyze_churn_factors(). When called, this function will analyze factors (both categorical and numerical) that may contribute to customer churn. 2. Defining Categorical and Numerical Features: categorical_features = ['Contract', 'InternetService', 'OnlineSecurity', 'TechSupport'] numerical_features = ['tenure', 'MonthlyCharges'] • categorical_features: This list contains the names of categorical features in the dataset that are likely to have a relationship with churn. These features represent di3erent aspects of a customer's service and are typically represented as text values (e.g., "Month-to-month" vs. "One year" for the Contract feature). o Contract: The type of contract a customer has (e.g., month-to-month, one year, two years). o InternetService: The type of internet service (e.g., fiber optic, DSL). • o OnlineSecurity: Whether the customer has online security service (e.g., Yes/No). o TechSupport: Whether the customer has tech support service (e.g., Yes/No). numerical_features: This list contains the names of numerical features (continuous variables) that could also be related to churn. These features typically represent quantities or counts. o tenure: The number of months a customer has been with the service. o MonthlyCharges: The amount the customer is charged monthly for the service. 3. Chi-Square Test for Categorical Features: # Chi-square test for categorical variables print("\nCategorical Feature Analysis:") for feature in categorical_features: contingency = pd.crosstab(df[feature], df['Churn']) chi2, p_value = stats.chi2_contingency(contingency)[:2] print(f"\n{feature}:") print(f"Chi-square statistic: {chi2:.2f}") print(f"p-value: {p_value:.4f}") • Chi-Square Test: A Chi-square test is used to determine if there is a significant relationship between two categorical variables. It tests whether the distribution of categorical values in one variable is independent of the values in another variable. In this case, it's used to check if the churn rate is related to categorical features (e.g., contract type, internet service). o pd.crosstab(df[feature], df['Churn']): Creates a contingency table (cross-tabulation) showing how each category in the categorical feature (e.g., 'Contract', 'InternetService') corresponds to churn (1 for churned, 0 for not churned). o stats.chi2_contingency(contingency): This function computes the chisquare statistic and p-value for the contingency table. § chi2: The chi-square statistic, which tells us how much the observed values di3er from the expected values under the assumption of independence. § o p_value: The p-value, which tells us if the relationship between the variables is statistically significant. If the p-value is small (usually < 0.05), we reject the null hypothesis and conclude that there is a significant relationship between the variables. The results are printed, showing the chi-square statistic and p-value for each categorical feature. 4. Calculating and Displaying Churn Rates by Category: # Calculate and display churn rate by category churn_rate = df.groupby(feature)['Churn'].mean().sort_values(ascending=False) print("Churn rates:") print(churn_rate) • This code calculates the churn rate for each category within the categorical feature. o df.groupby(feature)['Churn'].mean(): Groups the DataFrame by the unique values in the feature (e.g., contract type), and calculates the mean of the Churn column for each group. Since the Churn column is coded as 1 (churned) and 0 (not churned), the mean represents the churn rate for each category. o sort_values(ascending=False): Sorts the churn rates in descending order, so the categories with the highest churn rates appear first. o The churn rates for each category are printed. 5. T-Test for Numerical Features: # T-test for numerical variables print("\nNumerical Feature Analysis:") for feature in numerical_features: t_stat, p_value = stats.ttest_ind( df[df['Churn'] == 1][feature], df[df['Churn'] == 0][feature] ) print(f"\n{feature}:") print(f"T-statistic: {t_stat:.2f}") print(f"p-value: {p_value:.4f}") • T-Test: A t-test is used to compare the means of two groups (in this case, customers who churned vs. customers who did not churn). It tests whether the means of a numerical feature di3er significantly between the two groups. o o stats.ttest_ind(): Performs an independent t-test. It compares the means of the feature between two groups: those who churned (df[df['Churn'] == 1][feature]) and those who did not churn (df[df['Churn'] == 0][feature]). § t_stat: The t-statistic, which measures how much the sample means di3er in terms of the variation within the samples. A higher absolute t-statistic indicates a greater di3erence between the two groups. § p_value: The p-value, which tells us if the di3erence between the two groups is statistically significant. If the p-value is less than 0.05, we can conclude that the means of the two groups di3er significantly. The results are printed, showing the t-statistic and p-value for each numerical feature. 6. Executing the Analysis: # Execute analysis print("=== Analysis of Churn Factors ===") analyze_churn_factors() • This line calls the analyze_churn_factors() function to execute the analysis of churn factors. • It prints a heading "=== Analysis of Churn Factors ===" before starting the analysis. Summary of What This Code Does: • • Categorical Features Analysis (Chi-Square Test): o For each categorical feature (e.g., Contract, InternetService), the code calculates a chi-square statistic and a p-value to check whether the categorical feature is significantly related to customer churn. o It also calculates and prints the churn rate for each category within the feature. Numerical Features Analysis (T-Test): o For each numerical feature (e.g., Tenure, MonthlyCharges), the code compares the means between customers who churned and those who did not churn using a t-test. o It prints the t-statistic and p-value to assess whether the numerical feature significantly di3erentiates between the two groups. This function helps identify whether certain features (both categorical and numerical) have a statistically significant relationship with customer churn, which is crucial for building a predictive model or designing retention strategies. This code defines a function called analyze_contract_retention() that performs an analysis of customer retention based on their contract type. The function calculates key metrics such as average tenure (how long customers stay) and churn rate (the percentage of customers who leave) by di3erent contract types. It also visualizes these metrics using bar plots. Let's break it down step by step. 1. Defining the Function: def analyze_contract_retention(): • This line defines a function named analyze_contract_retention(). The purpose of this function is to analyze how the contract type influences customer retention by calculating the average tenure and churn rate for each contract type. 2. Calculate Average Tenure by Contract Type: avg_tenure = df.groupby('Contract')['tenure'].mean().sort_values(ascending=False) • df.groupby('Contract'): Groups the data by the 'Contract' column, which contains information about the type of contract each customer has (e.g., monthto-month, one year, two years). • ['tenure']: Selects the 'tenure' column, which represents the number of months each customer has been with the telecom provider. • .mean(): Calculates the mean tenure (average duration) for each contract type. This will tell us how long, on average, customers stay with the company based on the contract type. • .sort_values(ascending=False): Sorts the contract types in descending order by average tenure, so the contract type with the longest average tenure appears first. 3. Calculate Churn Rate by Contract Type: churn_by_contract = df.groupby('Contract')['Churn'].mean().sort_values(ascending=False) • df.groupby('Contract'): Again, groups the data by contract type. • ['Churn']: Selects the 'Churn' column, which indicates whether a customer has churned (1 for churned, 0 for not churned). • .mean(): Calculates the mean churn rate for each contract type. Since the 'Churn' column is coded as 0 (not churned) and 1 (churned), the mean represents the percentage of customers who have churned within each contract type. • .sort_values(ascending=False): Sorts the contract types by churn rate in descending order. This means contract types with the highest churn rate will appear first. 4. Visualization (Bar Plots): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) • plt.subplots(1, 2, figsize=(12, 5)): This creates a figure (fig) with two subplots (ax1 and ax2) arranged horizontally (1 row, 2 columns), with a figure size of 12 inches by 5 inches. The axes ax1 and ax2 represent the individual subplots where the plots will be drawn. Plot 1: Average Tenure by Contract Type avg_tenure.plot(kind='bar', ax=ax1) ax1.set_title('Average Tenure by Contract Type') ax1.set_ylabel('Average Tenure (months)') • avg_tenure.plot(kind='bar', ax=ax1): Plots a bar chart of the average tenure for each contract type on the first subplot (ax1). o The kind='bar' argument specifies that a bar chart should be used. o The ax=ax1 argument specifies that this plot should be placed on the first subplot (ax1). • ax1.set_title('Average Tenure by Contract Type'): Sets the title of the first subplot to "Average Tenure by Contract Type". • ax1.set_ylabel('Average Tenure (months)'): Sets the y-axis label for the first subplot to "Average Tenure (months)" to indicate the units of measurement. Plot 2: Churn Rate by Contract Type (churn_by_contract * 100).plot(kind='bar', ax=ax2) ax2.set_title('Churn Rate by Contract Type') ax2.set_ylabel('Churn Rate (%)') • (churn_by_contract * 100).plot(kind='bar', ax=ax2): Plots a bar chart of the churn rate for each contract type on the second subplot (ax2). o The churn rate is multiplied by 100 to express it as a percentage (since it's initially a decimal between 0 and 1). o The kind='bar' argument specifies that the plot should be a bar chart. o The ax=ax2 argument ensures that this plot appears on the second subplot. • ax2.set_title('Churn Rate by Contract Type'): Sets the title of the second subplot to "Churn Rate by Contract Type". • ax2.set_ylabel('Churn Rate (%)'): Sets the y-axis label for the second subplot to "Churn Rate (%)" to indicate that the values represent the churn rate as a percentage. 5. Finalize the Layout and Show the Plot: plt.tight_layout() plt.show() • plt.tight_layout(): Adjusts the spacing between the subplots so that the labels and titles don't overlap and the layout looks neat. • plt.show(): Displays the plots on the screen. 6. Return the Results: return avg_tenure, churn_by_contract • The function returns two variables: o avg_tenure: The average tenure by contract type. o churn_by_contract: The churn rate by contract type. 7. Executing the Analysis: # Execute analysis print("\n=== Contract Type and Retention Analysis ===") avg_tenure, churn_by_contract = analyze_contract_retention() print("\nAverage Tenure by Contract Type:") print(avg_tenure) print("\nChurn Rate by Contract Type:") print(churn_by_contract) • analyze_contract_retention() is called to execute the analysis. • The results (average tenure and churn rate) are stored in the variables avg_tenure and churn_by_contract. • The results are printed to the console: o Average Tenure by Contract Type: The average number of months customers stay, broken down by contract type. o Churn Rate by Contract Type: The percentage of customers who have churned, broken down by contract type. Summary of What This Code Does: • • • Analysis: The function analyzes how contract type a3ects customer retention by calculating: o The average tenure (how long customers stay with the telecom provider) for each contract type. o The churn rate (the percentage of customers who leave) for each contract type. Visualization: It visualizes both of these metrics in side-by-side bar charts: o One chart shows the average tenure by contract type. o The other shows the churn rate by contract type. Return Values: It returns the computed average tenure and churn rate so that they can be further examined or used in other analyses. This analysis is useful for understanding how the type of contract o3ered by the telecom company impacts customer retention. For example, customers with longer contracts (e.g., one-year or two-year contracts) may have a lower churn rate than those on monthto-month plans, indicating that contract type plays a significant role in customer loyalty and retention. This code defines and executes a function called analyze_service_impact(), which aims to analyze how di3erent service subscriptions impact customer churn in the telecom dataset. It calculates the churn rate for each service (such as phone service, internet service, tech support, etc.), visualizes the results in a horizontal bar chart, and returns a dictionary containing the churn rates for each service. 1. Defining the Function: def analyze_service_impact(): • This line defines a function named analyze_service_impact(). The purpose of this function is to analyze how di3erent services o3ered by the telecom company (e.g., phone service, internet service, tech support) impact customer churn. 2. List of Services: services = ['PhoneService', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'] • services: This list contains the names of various service features that the telecom company o3ers. Each service is a categorical feature (typically 'Yes' or 'No') in the dataset that indicates whether a customer subscribes to that particular service. These services likely have an impact on whether a customer churns (leaves) or stays with the telecom company. 3. Calculating Churn Rate for Each Service: service_impact = {} for service in services: churn_rate = df.groupby(service)['Churn'].mean() service_impact[service] = churn_rate • service_impact: An empty dictionary to store the churn rates for each service. • for service in services:: This loop iterates over each service in the services list. o df.groupby(service)['Churn'].mean(): For each service, this line groups the data by the service column (e.g., PhoneService, InternetService) and calculates the mean churn rate for customers who have (or have not) subscribed to that service. The Churn column is binary (1 for churned, 0 for not churned), so the mean represents the percentage of customers who have churned within each group. o service_impact[service] = churn_rate: This stores the churn rate for the current service in the service_impact dictionary, where the key is the service name, and the value is the calculated churn rate. 4. Creating a DataFrame for Plotting: # Create a summary DataFrame for plotting impact_df = pd.DataFrame({ service: rates['Yes'] if 'Yes' in rates else rates.iloc[-1] for service, rates in service_impact.items() }, index=['Churn Rate']).T • pd.DataFrame(): A new DataFrame (impact_df) is created to summarize the churn rate for each service. o {service: rates['Yes'] if 'Yes' in rates else rates.iloc[-1]}: § § The dictionary comprehension creates key-value pairs where: § service is the service name. § rates['Yes']: If the service column contains a 'Yes' value, the churn rate for customers who subscribed to the service ('Yes') is selected. § rates.iloc[-1]: If there is no 'Yes' value (i.e., the service is represented only as 'No' or similar), it takes the last value of the churn_rate series, which corresponds to the 'No' group. This results in a dictionary where each service is associated with its churn rate for customers who subscribed to the service. o index=['Churn Rate']: The new DataFrame has a single row, and its index is labeled 'Churn Rate'. o .T: The .T transposes the DataFrame so that each service appears as a row, and the churn rate is displayed as the single value for each service. 5. Visualization: # Visualization plt.figure(figsize=(12, 6)) impact_df.sort_values('Churn Rate', ascending=True).plot(kind='barh') plt.title('Churn Rate by Service Subscription') plt.xlabel('Churn Rate') plt.tight_layout() plt.show() • plt.figure(figsize=(12, 6)): Creates a figure for the plot with a specified size of 12 inches by 6 inches. • impact_df.sort_values('Churn Rate', ascending=True).plot(kind='barh'): o impact_df.sort_values('Churn Rate', ascending=True): Sorts the impact_df DataFrame by the churn rate in ascending order, so that services with the lowest churn rate appear at the top and services with the highest churn rate appear at the bottom. o .plot(kind='barh'): Creates a horizontal bar chart ('barh' stands for horizontal bar chart) with the service names on the y-axis and the churn rates on the x-axis. • plt.title('Churn Rate by Service Subscription'): Sets the title of the plot. • plt.xlabel('Churn Rate'): Labels the x-axis as "Churn Rate". • plt.tight_layout(): Adjusts the layout to make sure the labels and titles fit without overlapping. • plt.show(): Displays the plot on the screen. 6. Returning the Service Impact Data: return service_impact • The function returns the service_impact dictionary, which contains the churn rates for each service. 7. Executing the Analysis: # Execute Analysis print("\n=== Service Impact Analysis ===") service_impact = analyze_service_impact() • The function analyze_service_impact() is called, and the resulting churn rates for each service are stored in the variable service_impact. • print("\n=== Service Impact Analysis ==="): Prints a heading to indicate the start of the service impact analysis. Summary of What This Code Does: 1. Churn Rate Calculation: The function calculates the churn rate for each of the services in the list (e.g., phone service, internet service, tech support) by grouping the data based on whether a customer has subscribed to the service and calculating the mean churn rate (percentage of customers who churned). 2. Visualization: It visualizes the churn rates for all services using a horizontal bar chart, where the services are sorted by churn rate in ascending order. This chart makes it easy to see which services are associated with higher or lower churn rates. 3. Return Values: The function returns a dictionary (service_impact) where the keys are service names and the values are their respective churn rates. Why This Analysis Is Useful: This analysis is useful for identifying which services are more likely to contribute to customer churn. For example, if customers who subscribe to tech support or streaming services have a higher churn rate, the telecom company might focus on improving these services or o3er targeted retention strategies to reduce churn. This code defines and executes a function build_churn_prediction_model(), which constructs a logistic regression model to predict customer churn based on certain features (such as tenure, MonthlyCharges, and TotalCharges). The function also evaluates the model's performance, prints a classification report, and calculates the feature importance to identify which features most influence the model's predictions. Let's break down the code step by step: 1. Defining the Function: def build_churn_prediction_model(): • This line defines a function named build_churn_prediction_model() which is responsible for building a churn prediction model using logistic regression. 2. Preparing Features: features = ['tenure', 'MonthlyCharges', 'TotalCharges'] X = df[features].fillna(0) y = df['Churn'] • features: This list contains the names of the columns in the dataset that will be used as independent variables (input features) for the model. These are: o tenure: The number of months the customer has been with the company. o MonthlyCharges: The monthly charges for the customer. o TotalCharges: The total charges billed to the customer over the course of their relationship with the telecom company. • • X = df[features].fillna(0): o df[features] selects the relevant columns from the dataset (tenure, MonthlyCharges, TotalCharges). o fillna(0) replaces any missing or NaN (Not a Number) values in these columns with 0. This ensures that missing values won't cause errors when training the model. y = df['Churn']: o y is the target variable representing customer churn. The Churn column contains binary values: 1 for customers who have churned (left the company), and 0 for those who have not. 3. Splitting the Data: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) • • train_test_split(): This function from the sklearn.model_selection module splits the data into training and test sets. o X: The features (independent variables). o y: The target variable (churn). o test_size=0.2: 20% of the data is reserved for testing, and 80% is used for training the model. o random_state=42: A random seed is set for reproducibility, ensuring the split is the same every time the code is run. X_train, X_test, y_train, y_test: The function returns four datasets: o X_train: Features for training. o X_test: Features for testing. o y_train: Target variable for training. o y_test: Target variable for testing. 4. Training the Logistic Regression Model: model = LogisticRegression() model.fit(X_train, y_train) • model = LogisticRegression(): Creates an instance of a Logistic Regression model. Logistic regression is a commonly used machine learning algorithm for binary classification tasks, such as predicting whether a customer will churn or not. • model.fit(X_train, y_train): This trains the logistic regression model using the training data (X_train for features and y_train for target). The model learns the relationship between the features and the target variable. 5. Evaluating the Model: y_pred = model.predict(X_test) • model.predict(X_test): The trained model is used to predict the churn outcome (y_pred) for the test set (X_test). This generates predicted values for whether each customer in the test set has churned or not (either 0 or 1). print("\nModel Performance:") print("\nClassification Report:") print(classification_report(y_test, y_pred)) • classification_report(y_test, y_pred): This function from the sklearn.metrics module generates a detailed classification report, comparing the actual churn values (y_test) with the predicted churn values (y_pred). The classification report includes key metrics such as: o Precision: The proportion of positive predictions (churn = 1) that were correct. o Recall: The proportion of actual positive cases (actual churn = 1) that were correctly identified by the model. o F1-score: A harmonic mean of precision and recall, providing a balance between the two. o Accuracy: The proportion of correct predictions (both churned and nonchurned) in the overall test set. These metrics help assess how well the model performs on the test data. 6. Feature Importance: feature_importance = pd.DataFrame({ 'Feature': features, 'Importance': abs(model.coef_[0]) }).sort_values('Importance', ascending=False) • • • model.coef_[0]: This accesses the coe@icients of the logistic regression model. Each feature in the model (tenure, MonthlyCharges, TotalCharges) has an associated coe3icient, which represents the strength and direction of the relationship between the feature and the target variable (churn). o A positive coe@icient indicates that as the feature value increases, the likelihood of churn increases. o A negative coe@icient indicates that as the feature value increases, the likelihood of churn decreases. o abs() is used to take the absolute value of these coe3icients because we are only interested in the magnitude of their importance (whether they are positively or negatively correlated with churn). pd.DataFrame(): A new DataFrame is created with two columns: o Feature: The names of the features used in the model (tenure, MonthlyCharges, TotalCharges). o Importance: The absolute values of the model coe3icients, which represent the importance of each feature. sort_values('Importance', ascending=False): Sorts the DataFrame by importance in descending order, so the most important features come first. 7. Returning the Feature Importance: return feature_importance • The function returns the feature_importance DataFrame, which shows how important each feature is in predicting customer churn. 8. Executing the Predictive Model: # Execute predictive model print("\n=== Churn Prediction Model ===") feature_importance = build_churn_prediction_model() print("\nFeature Importance:") print(feature_importance) • The function build_churn_prediction_model() is executed, and the resulting feature importance is stored in the variable feature_importance. • print("\n=== Churn Prediction Model ==="): Prints a heading to indicate the start of the churn prediction model output. • print("\nFeature Importance:"): Prints the feature importance. • print(feature_importance): Prints the sorted list of features and their importance in predicting churn. Summary of What This Code Does: • Builds a Predictive Model: The function builds a logistic regression model to predict customer churn based on tenure, MonthlyCharges, and TotalCharges. • Evaluates Performance: It evaluates the model's performance using the classification report, which provides metrics like precision, recall, F1-score, and accuracy. • Feature Importance: The function calculates the importance of each feature in predicting churn, based on the coe3icients learned by the logistic regression model. • Returns Feature Importance: It returns a DataFrame showing how important each feature is for the model, which helps identify the most influential factors driving churn. Why This Analysis Is Useful: This predictive model helps identify which customers are more likely to churn, allowing the telecom company to take proactive actions to retain those customers. By examining the feature importance, the company can understand the factors (such as tenure, MonthlyCharges, and TotalCharges) that most influence churn, and tailor retention strategies accordingly.

Customer Churn Prediction in Telecom: A Research Study

Related documents

Products

Support

Customer Churn Prediction in Telecom: A Research Study

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib