Logistic_notes

advertisement
Count Data
1. Estimating & testing proportions:
Ten customers, 2 purchase a product. We estimate the
probability p of purchase as p=0.20 for all customers.
Could p really be 0.50 in the population?
Binomial
n independent trials
Each results in success (1) or failure (0)
p = probability of success: Constant on all trials.
X = observed number of successes in n trials.
Pr{X=r} = n!/[r!(n-r)!]pr(1-p)(n-r)
n! = n(n-1)(n-2)…(1), 0! = 1! = 1.
p known  Pr{X} is probability function.
p to be estimated and r known  Pr{X} is now a function
of p and is known as the likelihood function L(p).
Logarithm is ln(L(p)).
Ex: (r=2 n=10) : 45 p2(1-p)8 maximum at p=0.20.
2. Contingency Tables
Observed
Coupon
No Coupon
Purchase
86
24
No Purchase
14
76
Expected (under H0: no coupon effect)
Purchase
No Purchase
Coupon
55
45
No Coupon 55
45
(O  E ) 2
Pearson Chi-square (k=1 df)   all
=77.65
E
cells
2
k
Compare to Chi-square 1 df =1.962 (significant)
Likelihood
Coupon
No Coupon
Purchase
p186
p224
No Purchase
(1-p1)14
(1-p2)76
Likelihood is C p186(1-p1)14p224(1-p2)76
Max at p1=0.86, p2=0.24
Max ln(L) is 86ln(.86)+…+76ln(.76)=-95.6043
under H0:p1=p2:
Max at p1= p2=0.55, (1-p1) = (1-p2) =0.45
Max ln(L) is 86ln(.55)+…+76ln(.45)=-137.628
Likelihood ratio test (difference between -2 ln(L) values)
= 2(137.628-95.6043) = 84.0468
Close, but not the same as Pearson Chi-square (77.65)
Run Logistic_A.sas demo.
3. Logistic Regression:
X = food storage temperature (degrees C)
Y = 1 if spoilage after 2 months, 0 otherwise
X: -10 0 1 4 6
Y: 0 1 0 1 1
Regress Y on X:
Problem: Predicted probabilities >1 or < 0.
Idea: Convert p to logit
Logit = ln(p/(1-p)) = ln(odds)
Model Logit = 0 + 1X
p= exp(Logit)/(1+exp(Logit))=
exp( 0 + 1X )/(1+ exp( 0 + 1X ))
So… use exp( 0 + 1X )/(1+ exp( 0 + 1X )) for p in the
likelihood function (you know X) then find betas that
maximize this function. Equivalently, minimize -2
ln(likelihood).
Any betas whose -2 ln(likelihood) differs from that of the
maximum likelihood betas by an amount exceeding the
Chi-square 95% point would be rejected in a 5%
hypothesis test. Therefore if we truncate our plot at the
right point we will cut off the rejected set of betas and
have an approximate 95% confidence region for the pair
of betas.
b0=
-0.0167
b1=
0.5597
-10
0
1
4
6
Run demo: Logistic_B.sas
proc logistic data=logistic;
model spoiled(event="1")=temperature/
itprint ctable pprob=0.5;
Pairs: one 0 and one 1
Concordant: actual 1 has higher predicted probability
than actual 0 (i.e. 1 is to the right of 0 since slope is
positive)
Discordant pairs: Actual 0 has higher probability of being
1 than does actual 1.
We have 2 0’s, 3 1’s, so 2x3=6 pairs.
One of those 6 (circled) is discordant and there are no ties
so 5/6 are concordant.
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
83.3
16.7
0.0
6
Somers' D
Gamma
Tau-a
c
0.667
0.667
0.400
0.833
Prior probability 0.5: Classify any point with higher
probability than 0.5 as 1, others as 0. You will have some
misclassifications.
Classification Table
Correct
Incorrect
Percentages
Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS
NEG
0.500
2
1
1
1
60.0
66.7
50.0
33.3
50.0
Split point between X=0 and X=1.
2 correct events at X=4, X=6. 1 correct non-event at X =
-10. One incorrect event at X=1, one incorrect non-event
at X=0.
Sensitivity: probability of saying a 1 is a 1: 3 1’s, we got 2
of them so 2/3
Specificity: Probability of calling a non-event a non-event
½. (denominators = numbers of actuals)
False positives 1/3 of our classified events were nonevents.
False negatives: ½ of our classified non-events were
events. (denominators = numbers of decisions)
Odds Ratio:
Old Logit = 0 + 1 X
New Logit = 0 + 1 (X+1)
ln (pnew/(1-pnew)) – ln (pold/(1-pold))= 1
ln(new odds)-ln(old odds) = 1
ln( (new odds)/(old odds) ) = 1
odds ratio = exp(1)
e = 1 +  + 2/2! + 3/3! + …. (Taylor)
e is approximately 1 +  when  is small
Other Stats (source, SAS online help):
The following statistics are all rank based correlation
statistics for assessing the predictive ability of a model:
(nc= # concordant, nd= # discordant, N points,
t pairs with different responses)
C (area under the ROC curve) (nc + ½ (# ties))/t
Somers’ D
(nc-nd)/t
Kendall’s Tau-a (nc-nd)/(½N(N-1))
Goodman-Kruskal Gamma (nc-nd)/ (nc+nd)
Download