Approximate Inference for the Multinomial Logit Model M.Rekkas

advertisement
Approximate Inference for the Multinomial Logit Model
M.Rekkas∗
Abstract
Higher order asymptotic theory is used to derive p-values that achieve superior accuracy
compared to the p-values obtained from traditional tests for inference about parameters of the
multinomial logit model. Simulations are provided to assess the finite sample behavior of the
test statistics considered and to demonstrate the superiority of the higher order method. Stata
code that outputs these p-values is available to facilitate the implementation of these methods
for the end-user.
∗
Department of Economics, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, email:
mrekkas@sfu.ca, phone: (778) 782-6793, fax: (778) 782-5944. I would like to thank Nancy Reid and two anonymous
referees for helpful comments and suggestions. The support of the Natural Sciences and Engineering Research
Council of Canada is gratefully appreciated.
1
1
Introduction
The multinomial logit specification is the most popular discrete choice model in applied statistical
disciplines such as economics. Recent developments in higher order likelihood asymptotic methods
are applied to obtain highly accurate tail probabilities for testing parameters of interest in these
models. This involves using an adjusted version of the standard log likelihood ratio statistic.
Simulations are provided to demonstrate the significant improvements in accuracy that can be
achieved on conventional first-order methods, that is, methods that achieve distributional accuracy
of order O(n−1/2 ), where n is the sample size. The resulting p-value expressions for assessing scalar
parameters of interest are remarkably simple and can easily be programmed into conventional
statistical packages. The results have particular appeal to applied statisticians dealing with discrete
choice models where the number of observations may be limited. More generally, these higherorder methods can be conducted regardless of the sample size in order to determine the extent to
which first-order methods can be relied upon.
The two main contributions are as follows. First, higher order likelihood theory is used to
obtain highly accurate p-values for testing parameters of the multinomial model. And second,
Stata code is made available to the end-user for this model.1 While the past two decades have
seen significant advances in likelihood asymptotic methods, the empirical work employing these
techniques has severely lagged behind. This schism is undoubtedly due to the lack of user friendly
computer code. The Stata programs are provided as a means to bridge this gap.
2
Model
For a given parametric model and observed data y = (y1 , y2 , . . . , yn ), denote the log-likelihood
function as l(θ), where θ is the full parameter vector of the model expressed as θT = (ψ, λT )T ,
with scalar interest parameter ψ and nuisance parameter vector λ. Denote the overall maximum
likelihood estimator as θ̂ = (ψ̂, λ̂T )T = argmaxθ l(θ) and the constrained maximum likelihood
estimator as θ̂ψT = (ψ, λ̂Tψ )T = argmaxλ l(θ; y) for fixed ψ values. Let jθθT (θ̂) = −∂ 2 l(θ)/∂θ∂θT |θ̂
denote the observed information matrix and jλλT (θ̂ψ ) = −∂ 2 l(θ)/∂λ∂λT |θ̂ψ denote the observed
nuisance information matrix. Inference about ψ is typically based on two departure methods,
1
Brazzale (1999) provides R Code for approximate conditional inference for logistic and loglinear models but does
not consider the multinomial logit model.
2
known as the Wald departure (q) and the signed log likelihood ratio departure (r):
(
q = (ψ̂ − ψ)
|jθθT (θ̂)|
)1/2
(1)
|jλλT (θ̂ψ )|
1/2
r = sgn(ψ̂ − ψ)[2{l(θ̂) − l(θ̂ψ )}]
.
(2)
Note that the expression in (1) is not the usual Wald statistic for which the estimated standard
error of ψ̂ is used for standardization.2 Approximate p-values are given by Φ(q) and Φ(r), where
Φ(·) represents the standard normal cumulative distribution function. These methods are referred
to as first-order methods as q and r are distributed asymptotically as standard normal with firstorder accuracy (i.e. the relative error of the approximation is O(n−1/2 )). In small and even
moderate samples, these methods can be highly inaccurate.
Barndorff-Nielsen (1986) derived the modified signed log likelihood ratio statistic for higher
order inference
µ
r∗ = r −
¶
1
r
log
,
r
Q
(3)
where r is the signed likelihood ratio departure in (2) and Q is a standardized maximum likelihood
departure term. The distribution of r∗ is also asymptotically distributed as standard normal but
when the distribution of y is continuous it achieves third-order accuracy. Tail area approximations
can be obtained by using Φ(r∗ ). For exponential family models, several definitions for Q exist,
see for example, Barndorff-Nielsen (1991), Pierce and Peters (1992), Fraser and Reid (1995), and
Jensen (1995). The derivation of Q given by Fraser and Reid (1995) will be used in this paper.
While the Fraser and Reid version only applies to continuous data, saddlepoint arguments can
be invoked to argue that the method is still valid for exponential family models in the discrete
setting. Given the maximum likelihood estimates take values on a lattice however, technical issues
surrounding the exact order of the error produce p-values with distributional accuracy of order
O(n−1 ). For more general models, Davison et al. (2006) provide a framework for handling discrete
data that also achieves second-order accuracy. Given the present context involves the exponential
family model, the Fraser and Reid (1995) methodology is directly applicable.
Fraser and Reid (1995) used tangent exponential models to derive a highly accurate approximation to the p-value for testing a scalar interest parameter. The theory for obtaining Q involves
two main components. The first component requires a reduction of dimension by approximate
2
The standard Wald statistic will be considered in the examples and simulations as this is the statistic that is
typically reported in conventional statistical packages.
3
ancillarity.3 This step reduces the dimension of the variable to the dimension of the full parameter. The second component requires a further reduction of dimension from the dimension of the
parameter to the dimension of the scalar interest parameter. These two components are achieved
through two key reparameterizations: from the parameter θ to a new parameter ϕ, and from the
parameter ϕ to a new parameter χ. The variable ϕ represents the local canonical parameter of an
approximating exponential model, and the parameter χ is a scaled version of ϕ.
The canonical parameterization of θ is given by:
¯
¯
∂
ϕ (θ) = T l(θ; y)¯¯ V,
∂y
yo
T
(4)
where V = (v1 , ..., vp ) is an ancillary direction array that can be obtained as
¯
∂y ¯¯
V =
=
∂θT ¯θ̂
½
∂k(y, θ)
∂y T
¾−1 ½
¾¯
∂k(y, θ) ¯¯
¯ ,
∂θT
θ̂
(5)
where k = k(y, θ) = (k1 , ..., kn )T is a full dimensional pivotal quantity. Fraser and Reid (1995)
obtain this conditionality reduction without the computation of an explicit ancillary statistic.
The second reparameterization is to χ(θ), where χ(θ) is constructed to act as a scalar canonical
parameter in the new parameterization:
χ(θ) = ¯¯
ψϕT (θ̂ψ )
¯ ϕ(θ),
¯
¯ψϕT (θ̂ψ )¯
(6)
−1
where ψϕT (θ) = ∂ψ(θ)/∂ϕT = (∂ψ(θ)/∂θT )(∂ϕ(θ)/∂θT )
. The matrices jϕϕT (θ̂) and j(λλT ) (θ̂ψ )
are defined as the observed information matrix and observed nuisance information matrix, respectively, and are defined as follows:
−2
|jϕϕT (θ̂)| = |jθθT (θ̂)||ϕθT (θ̂)|
(7)
−1
|j(λλT ) (θ̂ψ )| = |jλλT (θ̂ψ )||ϕTλ (θ̂ψ )ϕλT (θ̂ψ )|
.
(8)
The standardized maximum likelihood departure is then given by
(
Q = sgn(ψ̂ − ψ)|χ(θ̂) − χ(θ̂ψ )|
|jϕϕT (θ̂)|
)1/2
|j(λλT ) (θ̂ψ )|
.
(9)
Notice for canonical parameter, ϕT (θ) = (ψ, λT )T , the expression in (9) simplifies to the Wald
departure given in (1). For this case, more accurate inference about ψ is simply based on (3) with
the conventional first-order quantities given in (1) and (2) as inputs.
3
Fraser and Reid show that an exact ancillary statistic is not required for this reduction.
4
Now, consider the multinomial logit model. Suppose there are J+1 response categories, yi =
(yi0 , . . . , yiJ ) with corresponding probabilities, (πi0 , . . . , πiJ ) and K explanatory variables with
associated βjT parameters where βj is a K × 1 vector. The probabilities are derived as:
πij = Λ(βjT xi ) =
exp(βjT xi )
1+
PJ
T
m=1 exp(βm xi )
, j = 0, 1, . . . , J,
with the normalization β0 = 0. For data y = (y1 , . . . , yn ) the likelihood function is given by
L(β) =
n
Y
yi1 yi2
yiJ
(1 − πi1 − . . . − πiJ )(1−yi1 −...−yiJ )
πi1 πi2 · · · πiJ
i=1
(
= exp
β1T
n
X
yi1 xi + . . . +
βJT
i=1
n
X
n
X
"
1
yiJ xi +
log
PJ
Tx )
1 + m=1 exp(βm
i
i=1
i=1
#)
.
The corresponding log likelihood is given by
l(β) =
β1T
n
X
yi1 xi + . . . +
i=1
βJT
n
X
n
X
"
#
1
yiJ xi +
log
.
PJ
Tx )
1 + m=1 exp(βm
i
i=1
i=1
(10)
The exponential family form in (10) gives the canonical parameter ϕ(θ) = (β1T , . . . , βkT ). Thus if
interest is on a scalar component of βjT , the maximum likelihood departure Q is given by expression
(1). To calculate this expression the first and second derivatives of the log likelihood function and
related quantities are required. The first and second derivatives for this model are easily calculated:
lβjk
=
n
X
(yij − πij )xik , j = 1, . . . , J and k = 1, . . . , K
i=1
lβjk βjl
lβjk βj T l
= −
= −
n
X
i=1
n
X
πij (1 − πij )xik xil , j = 1, . . . , J and k, l = 1, . . . , K
πij πij 0 xik xil , j = 1, . . . , J and k, l = 1, . . . , K ∀j 6= j T .
i=1
To examine the higher-order adjustment, two simple examples are considered.4 For the first
example, data from a real economic field experiment are used to estimate the parameters of the
model. In this example, there are five independent variables and a dependent variable that can
take on one of two values, 0 or 1, i.e. the model is the standard logit model.5 The dataset
for this example is provided in Table 1. The estimation results (with the constant suppressed)
with “0” as the comparison group are provided in Table 2. Log odds can easily be calculated by
exponentiating the coefficients. The conventional p-values associated with the maximum likelihood
4
All computations were done in Stata 8. Code for the two examples is accessible from www.sfu.ca/∼mrekkas.
Code for the first example is also provided using R Code.
5
The special case where the dependent variable can only take one of two values has previously been considered.
For more on the logit model see Brazzale (1999).
5
estimates are reported along with those produced from the signed log likelihood ratio departure
given in (2) and from the modified log likelihood ratio statistic given in (3). These resulting
p-values are denoted as MLE, LR, and RSTAR, respectively. The p-values associated with the
maximum likelihood estimates are provided for comparison as these p-values are outputted by
most conventional statistical packages. It should be noted that using the r∗ formula in (3) along
with Φ(r∗ ) produces a p-value with interpretation as probability left of the data point. However,
for consistency with output reported by statistical packages, the p-values associated with r∗ in
the tables are always reported to reflect tail probabilities. As can be discerned from Table 2, even
with 40 observations, the p-values produced from the three different methods are quite different
and, depending on the method chosen, would lead to different inferences about the parameters.
Next, the second example considers a dependent variable that can take on one of three values,
1, 2 or 3. The dataset is provided in Table 3.6 Results from this estimation (with the constants
suppressed) with group 1 as the comparison group are provided in Table 4. Relative risk ratios
can be obtained by exponentiating the coefficients. Once again the table reveals a wide range
of p-values. For instance, the coefficient for variable X2 in the Y=3 equation, would be deemed
insignificant at the 5% level using the conventional MLE test while it would be deemed significant
at this level using the LR or RSTAR methods.
To investigate the properties of the higher-order method in small and large samples, two simulations are conducted and accuracy is assessed by computing the observed p-values for each method
(MLE, LR, RSTAR) and recording several criteria. The recorded criteria for each method is as
follows: coverage probability, coverage error, upper and lower error probabilities, and coverage
bias. The coverage probability records the percentage of a true parameter value falling within
the intervals. The coverage error records the absolute difference between the nominal level and
the coverage probability. The upper (lower) error probability records the percentage of a true
parameter value falling above (below) the intervals. And the coverage bias represents the sum of
the absolute differences between the upper and lower error probabilities and their nominal levels.
The first simulation generates 10,000 random samples each of size 50 from a dataset of brand
choice with two independent variables representing gender and age.7 The simulated dependent
variable can take one of three different values representing one of three different brands. The
data are provided in Table 5, where X0 represents the constant, X1 represents the gender of the
6
This dataset consists of a sample of size 30 from the data available at www.stata-press.com/data/r8/ for car
choice.
7
The dataset consists of a sample of size 50 from the data available at www.ats.ucla.edu/stat/stata/dae/.
6
consumer (coded 1 if the consumer is female) and X2 represents the age of the consumer. The
dependent variable is simulated under the following conditions: the first brand was chosen as the
base category and the true values for the parameters were set as -11.7747 and -22.7214 for the
constants of brands 2 and 3, respectively, 0.5238 and 0.4659 for the parameter associated with
the gender variable for brands 2 and 3, respectively, and 0.3682 and 0.6859 for the parameter
associated with age for brands 2 and 3, respectively. The results from this simulation are recorded
in Table 6 for nominal 90%, 95%, and 99% confidence intervals for covering the true gender
parameter associated with brand 2 of 0.5238. The superiority of the higher-order method in terms
of coverage error and coverage bias is evident. Notice the skewed tail probabilities produced by
both first-order methods.
The second simulation generates 10,000 random samples using the full dataset of 735 observations with similar conditions set out in the first simulation. The dataset is not listed but is
available at the website provided earlier. The results from this simulation are provided in Table 7
again for nominal 90%, 95%, and 99% confidence intervals for covering the true gender parameter
associated with brand 2 of 0.5238. With this larger sample size the first-order methods perform
predictably better, however, the asymmetry in the tails while diminished, still persists.
3
Conclusion
In this paper higher order likelihood asymptotic theory was applied for testing parameters of the
multinomial logit; improvements to first-order methods were shown using two simulations. Stata
code has been made available to facilitate the implementation of these higher order adjustments.
7
References
[1] Barndorff-Nielsen, O., 1991, Modified Signed Log-Likelihood Ratio, Biometrika 78,557-563.
[2] Brazzale, A., 1999, Approximate Conditional Inference in Logistic and Loglinear Models,
Journal of Computational and Graphical Statistics 8(3), 653-661.
[3] Davison, A., Fraser, D., Reid, N., 2006, Improved Likelihood Inference for Discrete Data,
Journal of the Royal Statistical Society Series B 68, 495-508.
[4] Fraser, D., Reid, N., 1995, Ancillaries and Third-Order Significance, Utilitas Mathematica 7,
33-53.
[5] Jensen, J., 1995, Saddlepoint Approximation, Oxford University Press, New York.
[6] Lugannani, R., Rice, S., 1980, Saddlepoint Approximation for the Distribution of the Sums
of Independent Random Variables, Advances in Applied Probability 12, 475-490.
[7] Pierce, D., Peters, D., 1992. Practical Use of Higher Order Asymptotics for Multiparameter
Exponential Families (with discussion), Journal of the Royal Statistical Society Series B 54,
701-738.
8
Table 1: Data from Field Experiment
Y
X0
X1
X2
X3
X4
X5
Y
X0
X1
X2
X3
X4
X5
0
1
0
1
0
0
1
1
0
1
1
0
0
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10.9
8.2
14.6
9.33
11.5
13.33
12.33
13
11.5
10.5
9.8
9.33
13.5
12
8.2
11.8
12.5
9.67
9.5
10.75
3
2
2
2
3
1
5
4
3
2
4
4
3
2
1
2
1
2
3
3
1
2
1
1
6
4
3
1
0
3
2
1
1
1
0
1
0
2
0
0
1
0
1
0
0
0
0
1
0
1
1
0
0
0
0
0
1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
12.25
11
10.2
11.5
11.67
12.75
12.2
9.33
10.2
11
10.8
13.33
9.33
8.67
9.67
10
10.25
9.33
10.5
9.33
3
2
3
2
2
4
3
4
2
2
2
1
4
1
3
3
6
2
2
1
1
0
2
0
6
2
0
2
1
0
4
1
0
0
4
14
0
5
0
1
Table 2: Estimation Results
p-values
X1
X2
X3
X4
X5
Coefficient
SE
MLE
LR
RSTAR
-0.3665
-1.7582
-0.4703
0.3468
-0.1097
0.7689
0.8004
0.2636
0.3190
0.1815
0.3168
0.0140
0.0372
0.1385
0.2728
0.3160
0.0094
0.0274
0.1361
0.2553
0.3658
0.0153
0.0372
0.1490
0.3047
9
Table 3: Data
Y
X0
X1
X2
Y
X0
X1
X2
3
1
1
2
1
1
1
1
2
1
1
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
1
1
1
0
1
1
1
1
1
46.7
26.1
32.7
49.2
24.3
39
33
20.3
38
60.4
69
27.7
41
65.6
24.8
1
2
3
2
3
3
2
3
3
2
2
3
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
0
0
1
0
1
1
1
1
1
21.6
44.4
44.7
49
43.8
46.6
45.6
40.7
46.7
49.2
38
44.7
43.8
46.6
46.7
Table 4: Estimation Results
p-values
Y
Coefficient
SE
MLE
LR
RSTAR
2
X1
X2
-0.2525
0.0858
1.1361
0.0519
0.8241
0.0980
0.8238
0.0753
0.8337
0.0950
3
X1
X2
1.5998
0.1000
1.4213
0.0511
0.2603
0.0502
0.2419
0.0273
0.2920
0.0397
10
Table 5: Simulation Data
X0
X1
X2
X0
X1
X2
X0
X1
X2
X0
X1
X2
X0
X1
X2
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
0
1
1
1
1
34
36
32
32
36
32
34
28
37
32
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
1
1
33
32
35
30
32
33
32
36
32
31
1
1
1
1
1
1
1
1
1
1
1
0
1
0
1
1
1
1
1
1
32
31
36
34
36
32
32
29
33
32
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
0
1
0
1
1
32
32
36
31
33
31
31
36
32
34
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
0
1
1
0
38
32
31
32
32
32
32
32
28
35
Table 6: Simulation Results for n = 50
CI
Method
Coverage
Probability
Coverage
Error
Lower
Probability
Upper
Probability
Coverage
Bias
90%
MLE
LR
RSTAR
0.9085
0.8896
0.9065
0.0085
0.0104
0.0065
0.0411
0.0498
0.0460
0.0504
0.0606
0.0475
0.0093
0.0108
0.0065
95%
MLE
LR
RSTAR
0.9602
0.9423
0.9533
0.0102
0.0077
0.0033
0.0181
0.0261
0.0233
0.0217
0.0316
0.0234
0.0102
0.0077
0.0033
99%
MLE
LR
RSTAR
0.9966
0.9867
0.9910
0.0066
0.0033
0.0010
0.0012
0.0052
0.0041
0.0022
0.0081
0.0049
0.0066
0.0033
0.0010
Table 7: Simulation Results for n = 735
CI
Method
Coverage
Probability
Coverage
Error
Lower
Probability
Upper
Probability
Coverage
Bias
90%
MLE
LR
RSTAR
0.8943
0.8938
0.8945
0.0057
0.0062
0.0055
0.0498
0.0498
0.0506
0.0559
0.0564
0.0549
0.0061
0.0066
0.0055
95%
MLE
LR
RSTAR
0.9460
0.9453
0.9460
0.0040
0.0047
0.0040
0.0250
0.0250
0.0256
0.0290
0.0297
0.0284
0.0040
0.0047
0.0040
99%
MLE
LR
RSTAR
0.9880
0.9878
0.9879
0.0020
0.0022
0.0021
0.0048
0.0048
0.0049
0.0072
0.0074
0.0072
0.0024
0.0026
0.0023
11
Download