Alberta Ingenuity & CMASTE Lesson 4: Customer Attrition Analysis (Teachers’ Resource) Purpose: Analysis of customer attrition (how long customers stay with a company and why they leave) is very important to subscription-based businesses such as cable TV, insurance, credit cards, magazine subscription, and more recently, cell phone programs and residential utilities. For a large company millions of database records of customers must be analyzed. This is where the Machine Learning strategy of data mining takes over. Data mining uses computer-based queries to find patterns and relevant information in a large data set. Problem: Computer programs use a set of algorithms to analyze the customer data for hazard probability (the chance that subscriber who has survived a certain amount of time is going to stop, cancel, or expire before the next unit of time) and also for survival probability, the chance that a random customer will still be with the company after a specific amount of time. We want to see if there is a mathematical pattern in this analysis that we can use to make inferences and projections about customer attrition. Hypothesis: The data mining technique will provide rapid feedback about customer behavior and a way to quantify customer loyalty. Design: The assumption is that time is discrete, in units of days, months, etc., whereas traditional statistical analysis treats time as continuous. The hazards in a typical subscription business are shown in Figure 1 as: 1) customers who don’t start 2) customers who start but never pay, which occurs at about 60 days and 3) customers who stop when the promotion ends, which occurs at about 90 days. (Figure 1) We can define hazard probability as P number who succumbed to the risk . population at risk during that time See Appendix for the data mining program used to calculate these probabilities. 116100676 Centre for Machine Learning 1/8 Alberta Ingenuity & CMASTE The gradual decline in hazards over time implies that the longer a customer stays with a company, the less likely they are to leave. Procedure: 1) From Figure 1, estimate these hazard probabilities as percentages: a) probability of a customer not starting the subscription ________ Answer: P ~ 4% b) probability of a customer being lost due to non-payment _________ Answer: P ~ 10% c) probability of a customer being lost at the end of the promotion _________ Answer: P ~ 6% This leads us to the concept of survivor probability or retention rate, which is a more holistic picture of the likelihood that a random customer will stay with a company to a certain point in time. Rather than using a probability formula here, we will use graph and data analysis to help quantify survivor probability. Figure 2 shows three examples of survival curves for credit card users for a particular company: the top curve is for customers who start as card holders, paying customers who are charged automatically every month, while the bottom curve is for customers who did not have a credit card previously and who are billed monthly and pay by cheque. The middle curve is the “average” of the other two curves. (Figure 2) We notice that the curves have the basic shape of an exponential decay curve and we can study them from a Math 30 P perspective with equation type y a(b) x or from a Math 31 point of view with equation type y c e x . 116100676 Centre for Machine Learning 2/8 Alberta Ingenuity & CMASTE 2) a) Using the tables of data read from the graph of the upper survivor (retention) curve, generate a regression equation of the form y a(b) x . ________________ t (months) Survivor Prob. (%) 0 1 2 3 4 5 6 7 8 9 10 11 12 97 92 88 77 68 62 57 54 50 48 45 43 42 Answer: y 95.34(0.9276) x b) Use your equation from part a) to calculate: i) the retention probability after 18 months ii) the number of months or days until the retention probability was 40% (graphically) P 95.34(0.9276)18 Answer: 24.65% Answer: Y1 95.34(0.9276) x Y2 40 Window x[0, 20, 2] y[0, 100, 10] Intersecti on occurs when x 11.557 months c) (* for Math 31 students) use an algebraic process to rewrite your regression equation in the form y c e x Answer: y 95.34(0.9276) x y 95.34(e z ) (0.9276) x e z ln( 0.9276) x ln( e z ) x ln( 0.9276) z ln( e) z x ln( 0.9276) z 0.075 x y 95.34(e 0.075x ) 116100676 Centre for Machine Learning 3/8 Alberta Ingenuity & CMASTE dy for your equation in part c) dx and specifically the value of the derivative at t = 5 months. Explain the meaning of that number with respect to the data. Answer: y 95.34e 0.075x 0.075 d) (* for Math 31 students) Find the derivative 7.1505e 0.075x y (5) 7.1505e 0.075(5) 4.91, which means that at 5 months we are losing about 4.91% of customers per month. e) Using the tables of data read from the graph of the lower survivor curve, generate a regression equation of the form y a(b) x . ___________________ t (months) Survivor Prob. (%) 0 1 2 3 4 5 6 7 8 9 10 11 12 97 74 59 30 21 19 17 15 14 13 12 11 10 Answer: y 65.97(0.8338) x f) Use your equation from part e) to calculate: i) the retention probability after 18 months ii) the number of months or days until the retention probability was 50% (graphically) 18 P 65.97(0.8338) Answer: Answer: P 2.5% Y1 65.97(0.8338) x Y2 50 Window x[0, 12, 1] y[0, 100, 10] Intersecti on occurs when x 1.525 months 116100676 Centre for Machine Learning 4/8 Alberta Ingenuity & CMASTE g) (* for Math 31 students) use an algebraic process to rewrite your regression equation in the form y c e x y 65.97(0.8338) x y 65.97(e) z 0.8338 x e z Answer: ln( 0.8338) x ln( e) z x ln( 0.8338) z ln( e) z x ln( 0.8338) z 0.18 x y 65.97(e 0.18 x ) dy for your equation in dx part g) and specifically the value of the derivative at t = 5 months. Explain the meaning of that number with respect to the data. h) (* for Math 31 students) Find the derivative Answer: y 65.97e .18 x .18 11.8746e .18 x y (5) 11.8746e .18(5) 4.83 , which means that at 5 months we are losing about 4.83% of customers per month. 2) Did you notice the steep drop off in the lower graph at about 60 days; what factor(s) do you think might account for that? Answer: This drop off would correspond to non-payment of credit card bills for those customers new to credit cards and having to pay by cheque. 3) Why do you think the upper graph does not have the same drop-off at that time? Answer: These customers had a credit card previously, so they are used to monthly payments. As well, their payment is done by direct debit, which is more convenient. 116100676 Centre for Machine Learning 5/8 Alberta Ingenuity & CMASTE We can quantify the difference between the two groups in another way, using a common industry measure called customer half-life or median customer lifetime. This is the time when exactly half or 50% of the original customers would still be subscribers. (Math 30 P students are familiar with this term from work with radioactive decay problems.) Figure 3 shows this for each of our two groups. (Figure 3) 4) From Figure 3 identify the customer half-life in days, for each group. Answer: The half-life for the “credit card group” appears to be about 245 days, and the half-life for the “non-credit card group” appears to be about 65 days. Evaluation: The data mining techniques provide very useful information that can be analyzed mathematically, leading to inferences about subscription customers and their retention rates that would aid in decision-making for the company. Synthesis: Data mining is very versatile and can be adapted for a wide variety of different industries. 116100676 Centre for Machine Learning 6/8 Alberta Ingenuity & CMASTE Appendix: Data Mining program to calculate hazard probabilities: Calculating Hazards in a Database Let's take a closer look at how survival data mining works with a database-in this case, running with Oracle. Assume that a database contains one row for each customer with the following information: Start_date Stop_date (NULL is not stopped) Other interesting variables such as stop reason, channel, and so on. How is this data used to calculate hazards? Oracle extensions make the full calculation possible. The first thing is to calculate the time with company (tenure) and the stop flag: SELECT ((case when stop_date is NULL then < today > else stop_date end) - start_date) as tenure, (case when stop_date is NULL then 0 else 1 end) as is_stopped FROM customers The next step is to aggregate these fields by time with the company or tenure. This gives the number of customers with exactly each amount of time and the number that stopped at a certain time (some customers with that tenure will still be active): SELECT ((case when stop_date is NULL then < today > else stop_date end) - start_date) as tenure, count(*) as pop_at_t, sum(case when stop_date is NULL then 0 else 1 end) as num_stopped FROM customers GROUP BY ((case when stop_date is NULL then < today > else stop_date end) - start_date) 116100676 Centre for Machine Learning 7/8 Alberta Ingenuity & CMASTE At this point, you could continue the calculation in a spreadsheet. However, the analytic functions make it possible to calculate the total population at risk, and thus the hazard. The total population is the sum of pop_at_t for all times greater than or equal to t. The hazard is num_stopped divided by this total. The following query does this calculation SELECT tenure, sum(pop_at_t) over (order by tenure desc range unbounded preceding), num_stopped / (sum(pop_at_t) over (order by tenure desc range unbounded preceding)) FROM < subquery > GROUP BY tenure ORDER BY tenure *I felt that it was worthwhile for teachers and/or students to see the type of program used in a machine learning setting. It is not expected that the program be used or even understood by teachers and/or students. Sources 1) http://www.intelligententerprise.com, Data Mining and Hazard Survival, Linoff, Gordon S., 2004 2) http://www.cs.ualberta.ca/~greiner/C-466/SLIDES/syllabus.html 116100676 Centre for Machine Learning 8/8