1
• Many discrete outcomes are to questions that have a natural ordering but no quantitative interpretation:
• Examples:
– Self reported health status
• (excellent, very good, good, fair, poor)
– Do you agree with the following statement
• Strongly agree, agree, disagree, strongly disagree
2
• Can use the same type of model as in the previous section to analyze these outcomes
• Another ‘latent variable’ model
• Key to the model: there is a monotonic ordering of the qualitative responses
3
• Excellent, very good, good, fair, poor
• Coded as 1, 2, 3, 4, 5 on National Health
Interview Survey
• We will code as 5,4,3,2,1 (easier to think of this way)
• Asked on every major health survey
• Important predictor of health outcomes, e.g. mortality
• Key question: what predicts health status?
4
• Important to note – the numbers 1-5 mean nothing in terms of their value, just an ordering to show you the lowest to highest
• The example below is easily adapted to include categorical variables with any number of outcomes
5
• y i
* = latent index of reported health
• The latent index measures your own scale of health. Once y i
* crosses a certain value you report poor, then good, then very good, then excellent health
6
• y i
= (1,2,3,4,5) for (fair, poor, VG, G, excel)
• Interval decision rule
• y i
=1 if y i
* ≤ u
1
• y i
=2 if u
1
• y i
=3 if u
2
< y
< y i i
* ≤ u
* ≤ u
2
3
• y i
=4 if u
3
< y i
* ≤ u
4
• y i
=5 if y i
* > u
4
7
• As with logit and probit models, we will assume y i
* is a function of observed and unobserved variables
• y i
* = β
0
+ x
1i
β
1
+ x
2i
β
2
…. x ki
β k
+ ε i
• y i
* = x i
β + ε i
8
• The threshold values (u
1
, u
2
, u
3
, u
4
) are unknown. We do not know the value of the index necessary to push you from very good to excellent.
• In theory, the threshold values are different for everyone
• Computer will not only estimate the β’s, but also the thresholds – average across people
9
• As with probit and logit, the model will be determined by the assumed distribution of ε
• In practice, most people pick nornal, generating an ‘ordered probit’ (I have no idea why)
• We will generate the math for the probit version
10
• Lets do the outliers, Pr(y i
Pr(y i
=5) first
=1) and
• Pr(y i
=1)
• = Pr(y i
• = Pr(x i
• =Pr(ε i
• = Φ[u
1
* ≤ u
1
)
β +ε i
≤ u
1
- x
≤ u
1
β)
)
- x i i
β] = 1- Φ[x i
β – u
1
]
11
• Pr(y i
=5)
• = Pr(y i
* > u
4
)
• = Pr(x i
β +ε i
> u
4
)
• =Pr(ε i
> u
4
• = 1 - Φ[u
4
- x i
β)
- x i
β] = Φ[x i
β – u
4
]
12
• Pr(y i
=3) = Pr(u
2
< y i
* ≤ u
3
)
= Pr(y i
* ≤ u
3
) – Pr(y i
* ≤ u
2
)
= Pr(x i
β +ε i
= Pr(ε i
≤ u
3
≤ u
- x i
3
) – Pr(x
β) - Pr(ε i i
β +ε i
≤ u
2
≤ u
- x i
β)
2
)
= Φ[u
3
- x i
β] - Φ[u
2
- x i
β]
= 1 - Φ[x i
β - u
3
] – 1 + Φ[x i
β - u
2
]
= Φ[x i
β - u
2
] - Φ[x i
β - u
3
]
13
• Pr(y i
=1) = 1- Φ[x i
β – u
1
]
• Pr(y i
=2) = Φ[x i
β – u
1
] - Φ[x i
β – u
2
]
• Pr(y i
=3) = Φ[x i
β – u
2
] - Φ[x i
β – u
3
]
• Pr(y i
=4) = Φ[x i
β – u
3
] - Φ[x i
β – u
4
]
• Pr(y i
=5) = Φ[x i
β – u
4
]
14
• There are 5 possible choices for each person
• Only 1 is observed
• L = Σ i ln[Pr(y i
=k)] for k
15
• Cancer control supplement to 1994
National Health Interview Survey
• Question: what observed characteristics predict self reported health (1-5 scale)
• 1=poor, 5=excellent
• Key covariates: income, education, age, current and former smoking status
• Programs
• sr_health_status.do, .dta, .log
16
• desc;
• male byte %9.0g =1 if male
• age byte %9.0g age in years
• educ byte %9.0g years of education
• smoke byte %9.0g current smoker
• smoke5 byte %9.0g smoked in past 5 years
• black float %9.0g =1 if respondent is black
• othrace float %9.0g =1 if other race (white is ref)
• sr_health float %9.0g 1-5 self reported health,
• 5=excel, 1=poor
• famincl float %9.0g log family income
17
• tab sr_health;
•
•
•
•
•
•
1-5 self | reported | health, |
•
•
5=excel, |
1=poor | Freq. Percent Cum.
• ------------+-----------------------------------
• 1 | 342 2.65 2.65
2 | 991 7.68 10.33
3 | 3,068 23.78 34.12
4 | 3,855 29.88 64.00
• 5 | 4,644 36.00 100.00
• ------------+-----------------------------------
• Total | 12,900 100.00
18
• oprobit sr_health male age educ famincl black othrace smoke smoke5;
19
• Ordered probit estimates Number of obs = 12900
•
•
LR chi2(8) = 2379.61
Prob > chi2 = 0.0000
• Log likelihood = -16401.987 Pseudo R2 = 0.0676
•
•
•
•
•
• ------------------------------------------------------------------------------
• sr_health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
• -------------+----------------------------------------------------------------
• male | .1281241 .0195747 6.55 0.000 .0897583 .1664899
age | -.0202308 .0008499 -23.80 0.000 -.0218966 -.018565
educ | .0827086 .0038547 21.46 0.000 .0751535 .0902637
famincl | .2398957 .0112206 21.38 0.000 .2179037 .2618878
black | -.221508 .029528 -7.50 0.000 -.2793818 -.1636341
othrace | -.2425083 .0480047 -5.05 0.000 -.3365958 -.1484208
•
• smoke | -.2086096 .0219779 -9.49 0.000 -.2516855 -.1655337
smoke5 | -.1529619 .0357995 -4.27 0.000 -.2231277 -.0827961
• -------------+----------------------------------------------------------------
• _cut1 | .4858634 .113179 (Ancillary parameters)
• _cut2 | 1.269036 .11282
•
•
_cut3 | 2.247251 .1138171
_cut4 | 3.094606 .1145781
• ------------------------------------------------------------------------------
20
• Marginal effects/changes in probabilities are now a function of 2 things
– Point of expansion (x’s)
– Frame of reference for outcome (y)
• STATA
– Picks mean values for x’s
– You pick the value of y
21
• Consider y=5
• d Pr(y i
=5)/dx i
= d Φ[x i
β – u
4
]/dx i
= βφ[x i
β – u
4
]
• Consider y=3
• d Pr(y i
=3)/dx i
= βφ[x i
β – u
3
] - βφ[x i
β – u
4
]
22
• x i
β = β
– X
2i
0
+ x
1i
β
1
+ x
2i
β is yes or no (1 or 0)
2
…. x ki
β k
• ΔPr(y i
=5) =
• Φ[β
0
+ x
1i
- Φ[β
0
β
1
+ x
+ β
1i
2
β
1
+ x
3i
β
3
+ x
3i
β
3
+.. x ki
β k
]
…. x ki
β k
]
• Change in the probabilities when x
2i and x
2i
=0
=1
23
• mfx compute, predict(outcome(5));
24
• mfx compute, predict(outcome(5));
• Marginal effects after oprobit
•
• y = Pr(sr_health==5) (predict, outcome(5))
= .34103717
• ------------------------------------------------------------------------------
• variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
• ---------+--------------------------------------------------------------------
• male*| .0471251 .00722 6.53 0.000 .03298 .06127 .438062
•
•
•
•
• age | -.0074214 .00031 -23.77 0.000 -.008033 -.00681 39.8412
educ | .0303405 .00142 21.42 0.000 .027565 .033116 13.2402
famincl | .0880025 .00412 21.37 0.000 .07993 .096075 10.2131
black*| -.0781411 .00996 -7.84 0.000 -.097665 -.058617 .124264
othrace*| -.0843227 .01567 -5.38 0.000 -.115043 -.053602 .04124
•
• smoke*| -.0749785 .00773 -9.71 0.000 -.09012 -.059837 .289147
smoke5*| -.0545062 .01235 -4.41 0.000 -.078719 -.030294 .081395
• ------------------------------------------------------------------------------
• (*) dy/dx is for discrete change of dummy variable from 0 to 1
25
• Males are 4.7 percentage points more likely to report excellent
• Each year of age decreases chance of reporting excellent by 0.7 percentage points
• Current smokers are 7.5 percentage points less likely to report excellent health
26
• Wald tests/-2 log likelihood tests are done the exact same was as in PROBIT and LOGIT
27
• Use PRCHANGE to calculate marginal effect for a specific person prchange, x(age=40 black=0 othrace=0 smoke=0 smoke5=0 educ=16);
– When a variable is NOT specified (famincl),
STATA takes the sample mean.
28
• PRCHANGE will produce results for all outcomes
•
•
• male
•
•
Avg|Chg| 1 2 3 4
0->1 .0203868 -.0020257 -.00886671 -.02677558 -.01329902
5
0->1 .05096698
29
•
•
• age
• Avg|Chg| 1 2 3 4
• Min->Max .13358317 .0184785 .06797072 .17686112 .07064757
-+1/2 .00321942 .00032518 .00141642 .00424452 .00206241
-+sd/2 .03728014 .00382077 .01648743 .04910323 .0237889
• MargEfct .00321947 .00032515 .00141639 .00424462 .00206252
30