Count Data 1. Estimating & testing proportions: Ten customers, 2 purchase a product. We estimate the probability p of purchase as p=0.20 for all customers. Could p really be 0.50 in the population? Binomial n independent trials Each results in success (1) or failure (0) p = probability of success: Constant on all trials. X = observed number of successes in n trials. Pr{X=r} = n!/[r!(n-r)!]pr(1-p)(n-r) n! = n(n-1)(n-2)…(1), 0! = 1! = 1. p known Pr{X} is probability function. p to be estimated and r known Pr{X} is now a function of p and is known as the likelihood function L(p). Logarithm is ln(L(p)). Ex: (r=2 n=10) : 45 p2(1-p)8 maximum at p=0.20. 2. Contingency Tables Observed Coupon No Coupon Purchase 86 24 No Purchase 14 76 Expected (under H0: no coupon effect) Purchase No Purchase Coupon 55 45 No Coupon 55 45 (O E ) 2 Pearson Chi-square (k=1 df) all =77.65 E cells 2 k Compare to Chi-square 1 df =1.962 (significant) Likelihood Coupon No Coupon Purchase p186 p224 No Purchase (1-p1)14 (1-p2)76 Likelihood is C p186(1-p1)14p224(1-p2)76 Max at p1=0.86, p2=0.24 Max ln(L) is 86ln(.86)+…+76ln(.76)=-95.6043 under H0:p1=p2: Max at p1= p2=0.55, (1-p1) = (1-p2) =0.45 Max ln(L) is 86ln(.55)+…+76ln(.45)=-137.628 Likelihood ratio test (difference between -2 ln(L) values) = 2(137.628-95.6043) = 84.0468 Close, but not the same as Pearson Chi-square (77.65) Run Logistic_A.sas demo. 3. Logistic Regression: X = food storage temperature (degrees C) Y = 1 if spoilage after 2 months, 0 otherwise X: -10 0 1 4 6 Y: 0 1 0 1 1 Regress Y on X: Problem: Predicted probabilities >1 or < 0. Idea: Convert p to logit Logit = ln(p/(1-p)) = ln(odds) Model Logit = 0 + 1X p= exp(Logit)/(1+exp(Logit))= exp( 0 + 1X )/(1+ exp( 0 + 1X )) So… use exp( 0 + 1X )/(1+ exp( 0 + 1X )) for p in the likelihood function (you know X) then find betas that maximize this function. Equivalently, minimize -2 ln(likelihood). Any betas whose -2 ln(likelihood) differs from that of the maximum likelihood betas by an amount exceeding the Chi-square 95% point would be rejected in a 5% hypothesis test. Therefore if we truncate our plot at the right point we will cut off the rejected set of betas and have an approximate 95% confidence region for the pair of betas. b0= -0.0167 b1= 0.5597 -10 0 1 4 6 Run demo: Logistic_B.sas proc logistic data=logistic; model spoiled(event="1")=temperature/ itprint ctable pprob=0.5; Pairs: one 0 and one 1 Concordant: actual 1 has higher predicted probability than actual 0 (i.e. 1 is to the right of 0 since slope is positive) Discordant pairs: Actual 0 has higher probability of being 1 than does actual 1. We have 2 0’s, 3 1’s, so 2x3=6 pairs. One of those 6 (circled) is discordant and there are no ties so 5/6 are concordant. Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 83.3 16.7 0.0 6 Somers' D Gamma Tau-a c 0.667 0.667 0.400 0.833 Prior probability 0.5: Classify any point with higher probability than 0.5 as 1, others as 0. You will have some misclassifications. Classification Table Correct Incorrect Percentages Prob NonNonSensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 0.500 2 1 1 1 60.0 66.7 50.0 33.3 50.0 Split point between X=0 and X=1. 2 correct events at X=4, X=6. 1 correct non-event at X = -10. One incorrect event at X=1, one incorrect non-event at X=0. Sensitivity: probability of saying a 1 is a 1: 3 1’s, we got 2 of them so 2/3 Specificity: Probability of calling a non-event a non-event ½. (denominators = numbers of actuals) False positives 1/3 of our classified events were nonevents. False negatives: ½ of our classified non-events were events. (denominators = numbers of decisions) Odds Ratio: Old Logit = 0 + 1 X New Logit = 0 + 1 (X+1) ln (pnew/(1-pnew)) – ln (pold/(1-pold))= 1 ln(new odds)-ln(old odds) = 1 ln( (new odds)/(old odds) ) = 1 odds ratio = exp(1) e = 1 + + 2/2! + 3/3! + …. (Taylor) e is approximately 1 + when is small Other Stats (source, SAS online help): The following statistics are all rank based correlation statistics for assessing the predictive ability of a model: (nc= # concordant, nd= # discordant, N points, t pairs with different responses) C (area under the ROC curve) (nc + ½ (# ties))/t Somers’ D (nc-nd)/t Kendall’s Tau-a (nc-nd)/(½N(N-1)) Goodman-Kruskal Gamma (nc-nd)/ (nc+nd)