Basics of Statistical Analysis Basics of Analysis • The process of data analysis Observation Data Encode Information Analysis Example 1: – Gift Catalog Marketer – Mails 4 times a year to its customers – Company has I million customers on its file Example 1 • Cataloger would like to know if new customers buy more than old customers? • Classify New Customers as anyone who brought within the last twelve months. • Analyst takes a sample of 100,000 customers and notices the following. Example 1 • 5000 orders received in the last month • 3000 (60%) were from new customers • 2000 (40%) were from old customers • So it looks like the new customers are doing better Example 1 • Is there any Catch here!!!!! • Data at this gross level, has no discrimination between customers within either group. – A customer who bought within the last 11 days is treated exactly similar to a customer who bought within the last 11 months. Example 1 • Can we use some other variable to distinguish between old and new Customers? • Answer: Actual Dollars spent ! • What can we do with this variable? – Find its Mean and Variation. • We might find that the average purchase amount for old customers is two or three times larger than the average among new customers Numerical Summaries of data • The two basic concepts are the center and the Spread of the data n • Center of data xi - Mean, which is given by x i 1 n - Median - Mode Numerical Summaries of data • Forms of Variation – Difference about the mean: – Absolute Difference: xi x xi x n – Total Sum of Squares: 2 ( x x ) i i 1 n – Variance: 2 ( x x ) i i 1 n 1 – Standard Deviation: SquareRoot[Variance] Confidence Intervals • In catalog eg, analyst wants to know average purchase amount of customers • He draws two samples of 75 customers each and finds the means to be $68 and $122 • Since difference is large, he draws another 38 samples of 75 each • The mean of means of the 40 samples turns out to be $ 94.85 • How confident should he be of this mean of means? Confidence Intervals • Analyst calculates the standard deviation of sample means, called Standard Error (SE) • Basic Premise for confidence Intervals – 95 percent of the time the true mean purchase amount lies between plus or minus 1.96 standard errors from the mean of the sample means. • C.I. = Mean (+or-) (1.96) * Standard Error Confidence Intervals • However, if CI is calculated with only one sample then Standard Error of sample mean = Standard deviation of sample n • Basic Premise for confidence Intervals with one sample – 95 percent of the time the true mean lies between plus or minus 1.96 standard errors from the sample means. C.I. For Response Rates • Standard error for response rates is S.E.= p * (1 p) n Where, p = Sample response rate n = sample size Example 2: • Test 1,000 names selected at random from a new list. • To break-even the list must be expected to have a response rate of 4.5 percent • Confidence Interval= Expected Response (+/-) 1.96*SE = p(+/-) 1.96*SE • In our case C.I. = 3.22 % to 5.78%. Thus any response between 3.22 and 5.78 % supports hypothesis that true response rate is 4.5% Example 2: • • • • The list is mailed and actually pulls in 3.5% Thus, the true response rate maybe 4.5% What if the actual rate pulled in were 5% ? Regression towards mean: Phenomenon of test result being different from true result • Give more thought to lists whose cutoff rates lie within confidence interval Determining List Size • Let us assume that we expect true p = 0.035 • We want to be 95% certain that our test mailing will tell us if true response is between 3.3 % and 3.7% • We are saying that Precision = 1.96*SE=.002 (or 0.2%) • Hence, – – – – – SE=0.002/1.96=0.001020 0.001020= p * (1 p) n 0.001020=(0.033775/n)^1/2 0.00000104=0.033775/n n=32,437 • In general, – n=[p*(1-p)*1.962]/Precision2