Topic 1: Binary Logit Models

advertisement

Topic 1

Binary Logit Models

1

 Often variables in social sciences are dichotomous:

Employed vs. unemployed;

Married vs. unmarried;

Guilty vs. innocent;

Voted vs. didn’t vote

2

 Social scientists frequently wish to estimate regression models with a dichotomous dependent variable

 Most researchers are aware that

There is something wrong with using OLS for a dichotomous dependent variable;

But they do not know what makes dichotomous dependent variable problematic in standard linear regression; and

What other methods are superior

3

 Focus of this topic is on logit analysis (or logistic regression analysis) for dichotomous dependent variable

 Logit models have many similarities to OLS regression models

 Examine why OLS regression run into problems when the dependent variable is dichotomous

4

Example

 Dataset: penalty.txt

 Comprises 147 penalty cases in the state of

New Jersey

 In all cases the defendant was convicted of firstdegree murder with a recommendation by the prosecutor that a death sentence be imposed

 Penalty trial is conducted to determine if the defendant should receive a death penalty or life imprisonment

5

 The dataset comprises the following variables:

DEATH

BLACKD

1 for a death sentence

0 for a life sentence

1 if the defendant was black

0 otherwise

WHITVIC 1 if the victim was white

0 otherwise

SERIOUS – an average rating of seriousness of the crime evaluated by a panel of judges, ranging from 1

(least serious) to 15 (most serious)

6

DATA PENALTY;

INFILE ‘D:\TEACHING\MS4225\PENALTY.TXT;

INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;

PROC REG;

MODEL DEATH=BLACKD WHITVIC SERIOUS;

RUN;

7

8

 Remarks on OLS regression output:

The coefficient for SERIOUS is positive and very significant

Neither of the two racial variables are significantly different from zero

R 2 is very low

 F-test indicates overall significance of the model

Should we trust these results?

9

1.

2.

3.

4.

5.

6.

Assumptions of the linear regression model:

Y i

E (

 var( i

) i cov( x i a

 i

)

,

 b j

) x

0

 i i

0

 i

 

2  i

 i

 j

’s are treated as fixed

(homoscedasticity)

(absence of autocorrelation)

~ Normal i

If assumptions 1-5 are satisfied, then OLS estimators of a and b are B.L.U.

If all assumptions are satisfied, then OLS estimators of a and b are M.V.U.

10

 Now, what if y is a dichotomy with possible values of 1 or 0?

 It is still possible to claim that assumptions 1,

2, 4 and 5 are true

 But if 1 and 2 are true then 3 and 6 are necessarily false!!

11

 Consider assumption 6

 Note that

If y i

1 ,

 i i

1 y

 i

 a a

 b x i b x i y i

0 ,

 i

  a  b x i

In fact, it follows a Binomial distribution a ˆ b ˆ

Standard inference procedures are no longer valid as a consequence

 But in large samples, Binomial distribution tends towards the Normal distribution

12

 Consider assumption 3:

Note that

E ( y i

)

1

1

Pr( p i y i

0

1 )

( 1

0

 p i

Pr(

) y i

0 )

 p i

But from Assumptions 1 and 2,

E ( y i

)

E a

(

 a  b x i b

 x i

E

(

 i

) i

)

 a 

Therefore, b x i p i

 a  b x i

Linear probability model (LPM)

13

 Accordingly, from our previous output, a 1point increase in the SERIOUS scale is associated with a 0.038 increase in the probability of a death sentence; the probability of a death sentence for black defendants is 0.12 higher than for non-black defendants, ceteris paribus

14

 var(

 i

)

E [(

 i

E (

 i

2

)

E (

 i

))

2

]

(

 a  b x i

)

2

( 1

 p i

)

( 1

 a  b x i

)

2 p i

(

(

 a a 

 b b x i x i

)

2

)( 1

( 1

 a a

 b x i b x i

)

)

( 1

 a  b x i

)

2

( a

 p i

( 1

 p i

)

 variance is at a maximum when p i

0 .

5

 b x i

)

15

So, what are the consequences?

 Violation of assumptions 3 and 6 does not lead to biased estimation by OLS (only assumptions 1 and

2 are required for OLS to yield unbiased estimators)

 If the sample size is large enough, the estimators normally distributed.

 i

' s

 Voilation of the homoscedasticity assumption makes the OLS estimators no longer efficient. In addition, the estimated standard errors are biased.

16

 Also, the model p i

 a  b x i x i and therefore has no upper or lower bound.

But it is impossible for the true values (which are probabilities) to be greater than 1 or less than 0!

17

 Odds of an event: the ratio of the expected number of times that an event will occur to the expected number of times it will not occur, (e.g. an odds of 4 means we expect 4 times as many occurrences as nonoccurrences; an odds of 5/2 (or 5 to 2) means we expect 4 occurrences to 2 nonoccurrences.

18

 Let p be the probability of an event and 0 the odds of the event, then o

1

 p p or p

 o

1

 o

19

0.5

0.6

0.7

0.8

0.9

Relationship between Odds and Probability

Probability Odds

0.1

0.2

0.3

0.4

0.11

0.25

0.43

0.67

•o<1 => p<0.5

•o>1 => p>0.5

•0 < o < ∞

1.00

1.50

2.33

4.00

9.00

20

Death Sentence by Race of Defendant for

147 Penalty Trials.

Death

Life

Total

Blacks

28

45

73 non-blacks

22

52

74

50

97

147

O

D

O

D | B

50

97

0 .

52

28

45

0 .

62

O

D | NB

22

52

0 .

42

Ratio of black-odds to non-black odds are:

0 .

62

0 .

42

1 .

476

=> The odds of death sentence for blacks are 47.6% higher than for non-blacks, or the odds of death sentence for non-blacks

21 are 0.63 times the corresponding odds for non-blacks.

 Logit model: p i

1

1 e

( a  b x i

) which is the cumulative logistic distribution function.

Let

Z i

 a  b x i

, then p i

Notes:

1

1 e

Z i

F ( a  b x i

)

As Z i ranges from ∞ to +∞, P and 1; i ranges between 0

 P i is non-linearly related to Z i

22

 Also,

1

 p i p i

 e

Z i

Let

L i

 ln(

1

 p i p i

Z i a  b x i

)

(the odds of the event)

Although L i is linear in X i

, the probabilities themselves are not. This is in contrast to LPM.

23

Graph of logit model for a single explanatory variable

0.4

0.3

0.2

0.1

0

P i

1.0

0.9

0.8

0.7

0.6

0.5

-4 -3 -2 -1 0 1 2 3 4

(produce a graph using a

= 0 and b

= 1)

24

 Now p i

 p i

X i

F ( a f

)

F

F

'

( a  b x i

( a

 x i

 b x i

)

) b

( a b x i b x i

) b

 As f( a

+ b

X i

) is always positive, the sign of b indicates the direction of relationship between p i and X i

25

 For the LOGIT model, f ( a  b x i

)

( 1

 e e

Z

 i

Z i )

2

F ( a  b x i

)[ 1

 p i

( 1

 p i

)

F ( a

∴  p i

 x i

 b  p i

( 1

 p i

)

 b x i

)]

In other words, a 1-unit change in X i produce a constant effect on p i does not

26

 Note that y i only takes on values of 0 and 1, so L i is not defined. Therefore, OLS is not an appropriate estimation technique. Maximum

Likelihood (ML) estimation is usually undertaken.

ML basic principle: to choose as estimates those parameter values which would maximize the probability of what we have in fact observed.

27

Steps:

 Write down an expression for the probability of the data as a function of the unknown parameters [construction of likelihood function]

 Find the values of the unknown parameters that make the value of this expression as large as possible.

28

log L

 i

 i

 y i y i log



1

 p i p i log( e a  b x i )



  i

 i log( 1

 log

 p i e

( a

1

)

 b x i e

( a

)

 b x i

)

 a  y i

 b  x i y i

  i log( 1

 e a  b x i )



Taking the derivatives of log L and setting them to zero give:

 log

 a ˆ

L

 y i

  y i i

  i i

1

 e

( a ˆ  b ˆ x i

)

1

0

29

 log

 b ˆ

L

  x i y i

  i

 x i y i

  i x i x i i

1

 e

( a ˆ  b ˆ x i

)

1

0

The first order conditions are non-linear in and , so solutions are typically obtained by iterative methods.

30

Newton-Raphson algorithm

Let U( a

, b

) be the vector of first derivatives of log L with respect to a and b and let I( a

, b

) be the corresponding matrix of the second derivatives.

i.e.

U ( a

, b

)

 i

 i x i y i y i

 i  i x

ˆ i i y

ˆ i

 gradient or score vector

31

H

 a

, b 

2

 a

2

 a log

 log

 a b

L

L

 2

 2 a

 b log

 b b

L log L

 

 i x i y

ˆ i y

ˆ i

( 1

( 1

 y

ˆ i y

ˆ i

)

)

 i

 x i y

ˆ i

( 1

 x i

2 y

ˆ i

( 1

 y

ˆ y

ˆ i i

)

)

Hessian matrix

32

 The Newton-Raphson algorithm derives new estimates based on

 a b ˆ

ˆ j

1 j

1

 a ˆ b ˆ j j

H

1

( a

, b

) U ( a

, b

)

1 where is the inverse of H ( a

, b

)

 In practice, we need a set of starting values.

[PROC LOGISTIC starts with all coefficients equal to zero]

33

 The process is repeated until the maximum change in each parameter estimate from one step to the next is less than some criterion.

i.e. a ˆ j

1

 a ˆ j

 u and b ˆ j

1

 b ˆ j

 u

34

 Note that

cov(

a

ˆ

,

b

ˆ

)

 

H

1

(

a

ˆ

,

b

ˆ

)

 This variance-covariance matrix can be obtained using the COVB option in the

MODEL statement in SAS

35

SAS Program

DATA PENALTY;

INFILE ‘D:\TEACHING\MS4225\PENALTY.TXT’;

INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;

PROC LOGISTIC DATA=PENALTY DESCENDING;

MODEL DEATH=BLACKD WHITVIC SERIOUS;

RUN;

36

37

38

Interpretation of results

 Rather than a t-statistic SAS reports a Wald

Chi-square value, which is the square of the usual t-statistic.

Reason: the t-statistic is only an asymptotic one and has an “asymptotic” N(0,1) distribution under null. The square of a N(0,1) is a chi-square random variable with one df.

39

 Test of overall significance

H

0

: b

1

 b

2

...

 b k

0

H

1

: otherwise

1.

2.

3.

Likelihood-Ratio test:

LR

 

L b ˆ

R

L b ˆ

UR

)} ~

 k

2

Score (Lagrange-Multipler) test

LM

[ U ( b ˆ

R

)]' [

H

1

( b ˆ

R

)][ U ( b ˆ

R

)] ~

 k

2

Wald test:

W

 b ˆ

UR

'

[

H ( b ˆ

UR

)] b ˆ

UR

~

 k

2

40

Model Selection Criteria

1.

Akaike’s Information Criterion

AIC = -2 ln L + 2 *(k+1)

2.

Schwartz Criterion

SC = -2 ln L + (k+1)*ln(n)

3. Generalized R 2 =

1

 exp

LR n analogous to conventional R 2 used in linear regression

41

Optimization Technique

Fishers’ scoring (Iteratively reweighted least squares) – equivalent to Newton-Raphson algorithm.

42

 Odds ratio = e b

 The (predicted) odds ratio of 1.813 indicates that the odds of a death sentence for black defendants are 81% higher than the odds for other defendants

 The (predicted) odds of death are about 29% higher when the victim is white. (But note that the coefficient is insignificant)

43

 a 1-unit increase in the SERIOUS scale is associated with a 21% increase in the predicted odds of a death sentence.

44

Association of predicted probabilities and observed responses

Example: For the 147 observations in the sample, there are 147 C

2

= 10731 ways to pair them up (without pairing an observation with itself). Of these, 5881 pairs have either both 1’s on the dependent variable or both 0’s. These we ignore, leaving 4850 pairs in which one case has a 1 and the other case has a zero.

For each pair, we ask the question “Does the case with a 1 have a higher predicted value

(based on the model) than the case with a 0?

45

If yes, we call that pair concordant; if no, we call that pair discordant; if the two cases have the same predicted value, we call it a tie.

Let C = number of concordant pairs;

D = number of discordant pairs;

T = number of ties

N = total number of pairs (before eliminating any)

46

Tau a

Gamma

C

D

C

N

C

D

D

C s D

C

D

Somer'

C

D

T

0 .

5

( 1

Somer' s D)

All 4 measures vary between 0 and 1 with large values corresponding to stronger associations between the predicted and observed values

47

An Illustrative example of

LOGIT model

Table 12.4 of Ramanathan (1995) presents information on the acceptance or rejection to medical school for a sample of 60 applicants, along with a number of their characteristics. The variables are as follows:

Accept =1 if granted an acceptance, 0 otherwise;

GPA = cumulative undergraduate grade point average

BI0 = score in the biology portion of the Medical College

Admission Test (MCAT);

CHEM = score in the chemistry portion of the MACT;

PHY = score in the physics portion of the MCAT;

RED = score in the reading portion of the MCAT;

48

PRB = score in the problem portion of the MCAT;

QNT = score in quantitative portion of the MCAT;

AGE = age of applicant;

GENDER = 1 if male, 0 if female;

1.

Estimate a LOGIT model for the probability of acceptance into medical school

2.

Predict the probability of success of an individual with the following characteristics

GPA = 2.96

BIO = 7

CHEM = 7

49

3.

4.

PHY = 8

RED = 5

PRB = 7

QNT = 5

AGE = 25

GENDER = 0

Calculate Cragg and Uhler’s pseudo R 2 for the above model. How well does the model appear to fit the data?

AGE and GENDER represent personal characteristics. Test the hypothesis that AGE and

GENDER jointly have no impact on the probability of success.

50

DATA UNI;

INFILE 'D:\TEACHING\MS4225\MEDICAL.TXT';

INPUT ACCEPT GPA BIO CHEM PHY RED PRB QNT

AGE GENDER;

PROC LOGISTIC DATA=UNI DESCENDING;

MODEL ACCEPT=GPA BIO CHEM PHY RED PRB QNT

AGE GENDER;

RUN;

PROC LOGISTIC DATA=UNI DESCENDING;

MODEL ACCEPT=GPA BIO CHEM PHY RED PRB QNT;

RUN;

51

52

53

54

55

LOGIT alternative estimation technique

DATA PENALTY;

INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';

INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;

PROC GENMOD DATA=PENALTY DESCENDING;

MODEL DEATH=BLACKD WHITVIC SERIOUS/D=B;

RUN;

56

57

Advantages of PROC GENMOD

 Class variable

DATA PENALTY;

INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';

INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;

PROC GENMOD DATA=PENALTY DESCENDING;

CLASS CULP;

MODEL DEATH=BLACKD WHITVIC CULP/D=B;

RUN;

58

 Variable CULP takes the integer values 1 to 5

(5 notes high culpability and 1 denotes low culpability)

 The CLASS option treats this variable as a set of categories by creating 4 dummy variables, one for each of the values 1 through 4 (the default in GENMOD is to take the highest value as the omitted category)

59

60

61

 Multiplicative terms in the MODEL statement

(to capture interaction effects)

 For example, some people may argue that black defendants who kill white victims may be especially likely to receive a death sentence. We can test this hypothesis by including the variable BLACK*WHITVIC in the model statement

62

DATA PENALTY;

INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';

INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;

PROC GENMOD DATA=PENALTY DESCENDING;

MODEL DEATH=BLACKD WHITVIC CULP BLACKD*WHITVIC/D=B;

RUN;

63

64

Other features of PROC GENMOD

 Deviance = -2 ln L

(for individual data)

Involves a comparison between the model of interest and the maximal (or saturated) model which always fit the data better. The question is whether the difference in fit is statistically significant.

With individual data, the saturated model has one parameter for every predicted probability and therefore gives a perfect fit and a likelihood value of 1.

65

 Unfortunately, with individual level data the deviance does not have a chi-square distribution because the number of parameters increases with sample size, thereby violating a condition of asymptotic theory.

 SCALE variable (can be ignored for binary regression models unless one is working with grouped data and want to allow for over dispersion)

 Pearson Chi-square test (to be considered at a later stage)

66

Disadvantages of PROC

GENMOD

 Does not provide the odds ratio estimates

 Does not report a global test for the overall significance of model.

67

Hosmer-Lemeshow (HL) statistic

DATA PENALTY;

INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';

INPUT DEATH BLACKD WHITVIC SERIOUS CULP

SERIOUS2;

PROC LOGISTIC DATA=PENALTY DESCENDING;

MODEL DEATH=BLACKD WHITVIC CULP/LACKFIT;

RUN;

68

69

70

 The HL statistic is calculated in the following way:

Based on the estimated model, predicted probabilities are generated for all observations. These are sorted by size, then grouped into approximately 10 intervals.

Within each interval, the expected frequency is obtained by adding up the predicted probabilities.

Expected frequencies are compared with the observed frequency by the conventional Pearson chisquare statistic. The d.o.f. is the number of intervals minus 2.

71

Download