Convergence rates and asymptotic standard errors Department of Statistics

advertisement
Convergence rates and asymptotic standard errors
for MCMC algorithms for Bayesian probit regression
Vivekananda Roy and James P. Hobert
Department of Statistics
University of Florida
March 2007
Abstract
Consider a probit regression problem in which Y1 , . . . , Yn are independent Bernoulli random
variables such that Pr(Yi = 1) = Φ(xTi β) where xi is a p-dimensional vector of known covariates associated with Yi , β is a p-dimensional vector of unknown regression coefficients and Φ(·)
denotes the standard normal distribution function. We study Markov chain Monte Carlo algorithms
for exploring the intractable posterior density that results when the probit regression likelihood is
combined with a flat prior on β. We prove that Albert and Chib’s (1993) data augmentation algorithm and Liu and Wu’s (1999) PX-DA algorithm both converge at a geometric rate, which ensures
the existence of central limit theorems (CLTs) for ergodic averages under a second moment condition. While these two algorithms are essentially equivalent in terms of computational complexity,
results in Hobert and Marchev (2006) imply that the PX-DA algorithm is theoretically more efficient
in the sense that the asymptotic variance in the CLT under the PX-DA algorithm is no larger than
that under Albert and Chib’s algorithm. We also construct minorization conditions that allow us
to exploit regenerative simulation techniques for the consistent estimation of asymptotic variances.
As an illustration, we apply our results to van Dyk and Meng’s (2001) lupus data. This example
demonstrates that huge gains in efficiency are possible by using the PX-DA algorithm instead of
Albert and Chib’s algorithm.
Key words and phrases. Asymptotic variance, Central limit theorem, Data augmentation algorithm, Geometric ergodicity,
Minorization condition, PX-DA algorithm, Regeneration, Reversible Markov chain
1
1
Introduction
Suppose that Y1 , . . . , Yn are independent Bernoulli random variables such that Pr(Yi = 1) = Φ(xTi β)
where xi is a p × 1 vector of known covariates associated with Yi , β is a p × 1 vector of unknown
regression coefficients and Φ(·) denotes the standard normal distribution function. For yi ∈ {0, 1}, we
have
Pr(Y1 = y1 , . . . , Yn = yn | β) =
n
Y
y 1−yi
Φ(xTi β) i 1 − Φ(xTi β)
.
i=1
A popular method of making inferences about β is through a Bayesian analysis with a flat prior on β.
Let y = (y1 , . . . , yn ) denote the observed data and define the marginal density as
Z
c(y) =
n
Y
y 1−yi
Φ(xTi β) i 1 − Φ(xTi β)
dβ .
Rp i=1
Chen and Shao (2000) provide necessary and sufficient conditions on y and {xi }ni=1 for c(y) < ∞ and
these are stated explicitly in the Appendix. When these conditions hold, the posterior density of β is
well defined (i.e., proper) and is given by
π(β | y) =
n
1−yi
y 1 Y
.
Φ(xTi β) i 1 − Φ(xTi β)
c(y)
i=1
This posterior density is intractable in the sense that expectations with respect to π(β | y), which are
required for Bayesian inference, cannot be computed in closed form. Moreover, classical Monte Carlo
methods based on independent and identically distributed (iid) samples are problematic when the dimension, p, is large. These difficulties have spurred the development of Markov chain Monte Carlo methods
for exploring π(β | y). The first of these was Albert and Chib’s (1993) data augmentation algorithm,
which is now described.
Let X denote the n × p design matrix whose ith row is xTi and, for z = (z1 , ..., zn )T ∈ Rn , let
β̂ = β̂(z) = (X T X)−1 X T z. Also, let TN(µ, κ2 , w) denote a normal distribution with mean µ and
variance κ2 that is truncated to be positive if w = 1 and negative if w = 0. Albert and Chib’s algorithm
(henceforth, the “AC algorithm”) simulates a Markov chain whose invariant density is π(β | y). A single
iteration uses the current state β to produce the new state β 0 through the following two steps:
1. Draw z1 , . . . , zn independently with zi ∼ TN(xTi β, 1, yi )
2. Draw β 0 ∼ Np β̂(z), (X T X)−1
Albert and Chib (1993) has been referenced over 350 times, which shows that the AC algorithm and its
variants have been widely applied and studied.
2
The PX-DA algorithm of Liu and Wu (1999) is a modified version of the AC algorithm that also
simulates a Markov chain whose invariant density is π(β | y). A single iteration of the PX-DA algorithm
entails the following three steps:
1. Draw z1 , . . . , zn independently with zi ∼ TN(xTi β, 1, yi )
2 P
and set z 0 = (gz1 , . . . , gzn )T
2. Draw g 2 ∼ Gamma n2 , 21 ni=1 zi − xTi (X T X)−1 X T z
3. Draw β 0 ∼ Np β̂(z 0 ), (X T X)−1
Note that the first and third steps of the PX-DA algorithm are the same as the two steps of the AC
algorithm so, no matter what the dimension of β, the difference between the AC and PX-DA algorithms
is just a single draw from the univariate gamma distribution. For typical values of n and p, the effort
required to make this extra univariate draw is insignificant relative to the total amount of computation
needed to perform one iteration of the AC algorithm. Thus, the two algorithms are basically equivalent
from a computational standpoint. However, Liu and Wu (1999) and van Dyk and Meng (2001) both
provide considerable empirical evidence that autocorrelations die down much faster under PX-DA than
under AC, which suggests that the PX-DA algorithm “mixes faster” than the AC algorithm. (Liu and
Wu (1999) also established a theoretical result along these lines - see the proof of our Corollary 1.)
Suppose we require the posterior expectation of f (β) given y, i.e., we want to evaluate
Z
E f (β) | y :=
f (β)π(β | y) dβ ,
Rp
assuming this integral exists and is finite. Let {βj }∞
j=0 denote the Markov chain associated with either
the AC or PX-DA algorithm. Because {βj }∞
j=0 satisfies the usual regularity conditions (stated explicitly
in Section 2), the ergodic theorem implies that, no matter what the distribution of the starting value, β0 ,
f m :=
m−1
1 X
f (βj )
m
j=0
is a strongly consistent estimator of E f (β) | y ; that is, f m → E[f (β) | y] almost surely as m → ∞. In
practice, one simulates the chain for a finite number of iterations, say m, and reports f m as the estimate
of E f (β) | y . Suppose there is an associated central limit theorem (CLT) given by
d
√ m f m − E f (β) | y → N(0, σ 2 ) as m → ∞ ,
(1)
and that we have a consistent estimator of σ 2 , call it σ̂ 2 . Then we can compute an asymptotic standard
√
error for f m , which is given by σ̂/ m. Since the sample size, m, is under our control, the main benefit
of calculating the standard error is to determine whether the sample size we chose was large enough.
3
√
For example, if the asymptotic 95% confidence interval given by f m ± 2σ̂/ m is deemed too wide,
then m can be increased appropriately and further simulation can be carried out. (Note that burn-in is
not the issue here since the ability to compute a standard error is just as important when β0 ∼ π(β | y)
as it is when the chain is not stationary.) Unfortunately, the usual regularity conditions are not enough
to guarantee that (1) holds. Moreover, even when there is a CLT, finding a simple, consistent estimator
of the asymptotic variance can be challenging due to the dependence among the random variables in the
Markov chain.
In this article, we prove that the Markov chains underlying the AC and PX-DA algorithms both
converge at a geometric rate (defined formally in Section 2) which implies that the CLT in (1) holds
R
for every f ∈ L2 π(β | y) ; that is, for every f such that Rp f 2 (β)π(β|y)dβ < ∞. It follows from
results in Hobert and Marchev (2006) that PX-DA is theoretically more efficient than AC in the sense
that the asymptotic variance in the CLT under the PX-DA algorithm is no larger than that under the AC
algorithm. We also construct minorization conditions that allow us to exploit regenerative simulation
techniques for the consistent estimation of asymptotic variances. We illustrate our results using van
Dyk and Meng’s (2001) lupus data. In this particular example, the PX-DA algorithm turns out to be far
more efficient than the AC algorithm. Hence, even though the AC and PX-DA algorithms are essentially
equivalent in terms of computational complexity, huge gains in efficiency are possible by using PX-DA.
2
Geometric convergence and CLTs for the AC algorithm
We begin with a brief derivation of the AC algorithm. Let R+ = (0, ∞), R− = (−∞, 0], and let
φ(v; µ, κ2 ) denote the N(µ, κ2 ) density function evaluated at the point v ∈ R. Consider the function
from Rp × Rn to R+ given by
" n
#
o
1 Yn
IR+ (zi )I{1} (yi ) + IR− (zi )I{0} (yi ) φ(zi ; xTi β, 1) ,
π(β, z | y) =
c(y)
i=1
where, as usual, IA (·) is the indicator function of the set A. Note that
Z
π(β, z | y) dz = π(β | y) ,
Rn
and hence, π(β, z | y) can be viewed as a joint density in (β, z) whose β marginal is π(β | y). This
joint density is usually motivated as follows. Let Z1 , ..., Zn be independent random variables with
Zi ∼ N(xTi β, 1). If we define Yi = IR+ (Zi ), then Y1 , . . . , Yn are independent Bernoulli random
variables with Pr(Yi = 1) = Φ(xTi β). The Zi ’s can therefore be thought of as latent variables (or
missing data) and π(β, z | y) represents the posterior density of (β, z) given y under a flat prior on β.
4
The AC algorithm is simply a data augmentation algorithm (or two-variable Gibbs sampler) based on the
joint density π(β, z | y). Indeed, straightforward calculations reveal that β | z, y ∼ Np β̂, (X T X)−1
and that, conditional on (β, y), Z1 , . . . , Zn are independent with Zi | β, y ∼ TN(xTi β, 1, yi ).
If we denote the current state of the Markov chain as β and the next state as β 0 , then the Markov
transition density of the AC algorithm is given by
Z
0
k(β | β) =
π(β 0 | z, y) π(z | β, y) dz .
Note that
k(β 0 | β) π(β | y)
=
Rn
0
k(β | β ) π(β 0 | y)
for all β, β 0 ∈ Rp ; that is, k(β 0 | β) is reversible with
respect to π(β | y). It follows immediately that π(β | y) is the invariant density for the Markov chain, or,
in symbols,
Z
k(β 0 | β) π(β | y) dβ = π(β 0 | y) ,
Rp
for all β 0 ∈ Rp . Roy (2008) shows that the Markov chain driven by k(β 0 | β) is ψ-irreducible, aperiodic
and Harris recurrent (see Meyn and Tweedie (1993) for definitions). These conditions, which are henceforth referred to collectively as “the usual regularity conditions,” imply that the ergodic theorem holds;
R
that is, if f : Rp → R is such that Rp |f (β)|π(β|y)dβ < ∞, then f m is a strongly consistent estimator
of E f (β) | y no matter what the distribution of the starting value, β0 . We will say that there is a CLT
for f if there exists a σ 2 ∈ (0, ∞) such that, for all starting distributions, as m → ∞,
d
√ m f m − E f (β) | y → N(0, σ 2 ) .
As explained in the introduction, CLTs are the basis for asymptotic standard errors, which can be
used to ascertain how large a sample is required to estimate E f (β) | y . Unfortunately, even if f ∈
L2 π(β | y) , the usual regularity conditions are not enough to guarantee a CLT for f . We now introduce some convergence rate concepts that will allow us to describe a sufficient condition for CLTs.
Let K(·, ·) denote the Markov transition function associated with the AC algorithm; that is, for
R
β ∈ Rp and a measurable set A ⊂ Rp , K(β, A) = A k(β 0 | β) dβ 0 . The corresponding m-step Markov
transition function is defined inductively by
m
Z
K (β, A) =
K m−1 (β 0 , A)K(β, dβ 0 ) ,
Rp
where K 1 ≡ K. It describes the probability of m-step transitions; i.e, for m ∈ N := {1, 2, . . . } and
l ∈ {0, 1, 2, ...}, we have
K m (β, A) = Pr(βm+l ∈ A | βl = β) .
Let Π(· | y) be the probability measure associated with the posterior density π(β | y); that is, Π(A | y) =
R
p
A π(β | y)dβ. The usual regularity conditions imply that, for every β ∈ R ,
kK m (β, ·) − Π(· | y)k ↓ 0 as m → ∞ ,
5
where the left-hand side represents the total variation distance between K m (β, ·) and Π(· | y); i.e., the
supremum over measurable A of K m (β, A) − Π(A | y). However, the usual regularity conditions
tell us nothing about the rate at which this convergence takes place. The chain is called geometrically
ergodic if there exist a constant ρ ∈ [0, 1) and a function M : Rp → [0, ∞) such that for any β ∈ Rp
and any m ∈ N,
kK m (β, ·) − Π(· | y)k ≤ M (β)ρm .
Roberts and Rosenthal (1997) show that if a Markov chain (satisfying the usual regularity conditions)
is reversible and geometrically ergodic, then there is a CLT for every function that is square integrable
with respect to the invariant distribution. (For more on Markov chain CLTs, see Chan and Geyer (1994),
Mira and Geyer (1999) and Jones, Haran, Caffo and Neath (2006).) A proof of the following result is
given in the Appendix.
Theorem 1. The Markov chain on Rp with transition density k(β 0 | β) (that is, the Markov chain underlying AC algorithm) is geometrically ergodic.
Together with the results of Roberts and Rosenthal (1997), this theorem implies that the AC algo
rithm has a CLT for every f ∈ L2 π(β | y) . In order to use this theory to calculate standard errors, we
require a consistent estimator of the asymptotic variance, σ 2 . This topic will be addressed in Section 4.
In the next section, we use results from Liu and Wu (1999) and Hobert and Marchev (2006) to conclude
that geometric ergodicity of the AC algorithm implies that of the PX-DA algorithm and that PX-DA is
at least as good as AC in terms of performance in the CLT.
3
Comparing the AC and PX-DA algorithms
The Markov transition density of the PX-DA algorithm can be written as
Z Z
k ∗ (β 0 | β) =
π(β 0 | z 0 , y)R(z, dz 0 )π(z | β, y) dz ,
Rn
Rn
where R(z, dz 0 ) is the Markov transition function induced by Step 2 of the algorithm that takes z →
z 0 = (gz1 , . . . , gzn )T . It is straightforward to show that the Markov chain driven by k ∗ satisfies the usual
regularity conditions. Hobert and Marchev (2006) provide results that can be used to compare different
data augmentation algorithms (in terms of efficiency and convergence rate). In order to establish that
their results are applicable in our analysis of k ∗ , we now show that R(z, dz 0 ) admits a certain “group
R
representation.” Let π(z | y) = Rp π(β, z | y) dβ. A simple calculation reveals that
#
" n
1
o
|X T X|− 2 exp − z T (I − H)z/2 Y n
IR+ (zi )I{1} (yi ) + IR− (zi )I{0} (yi )
,
π(z | y) =
n−p
c(y)(2π) 2
i=1
6
where H = X(X T X)−1 X T . Let G be the multiplicative group R+ where group composition is defined
as multiplication; i.e., for g1 , g2 ∈ G, g1 ◦ g2 = g1 g2 . The identity element is e = 1 and g −1 = 1/g.
The left-Haar measure on G is νl (dg) = dg/g where dg denotes Lebesgue measure on R+ . Let G
act on the left of Rn through component-wise multiplication; that is, if g ∈ G and z ∈ Rn , then
gz = (gz1 , . . . , gzn )T . With the left group action defined in this way, it is easy to see that Lebesgue
measure on Rn is relatively left invariant with multiplier χ(g) = g n ; i.e.,
Z
Z
h(z)dz
h(gz)dz =
gn
Rn
Rn
for all g ∈ G and all integrable functions h : Rn → R. (See Chapters 1 & 2 of Eaton (1989) for
background on left group actions and multipliers.) Let Z denote the subset of Rn in which z lives;
i.e., Z is the Cartesian product of n half-lines (R+ and R− ), where the ith component is R+ if yi = 1
and R− if yi = 0. Fix z ∈ Z. It is easy to see that Step 2 of the PX-DA algorithm is equivalent to the
transition z → gz where, as in Section 4.3 of Hobert and Marchev (2006), g is drawn from a distribution
on G having density function
n
z T (I − H)z 2 n−1 −g2 z T (I−H)z/2
g n−1 π(gz | y)dg
χ(g)π(gz | y)νl (dg)
R
= R ∞ n−1
= (n−2)/2
g
e
dg . (2)
π(gz | y)dg
2
Γ(n/2)
G χ(g)π(gz | y)νl (dg)
0 g
R
Furthermore, G χ(g)π(gz | y)νl (dg) is positive for all z ∈ Z and finite for almost all z ∈ Z. Consequently, we may now appeal to several of the results in Hobert and Marchev (2006). First, their
Proposition 3 shows that R(z, dz 0 ) is reversible with respect to π(z | y) and it follows that k ∗ (β 0 | β) is
reversible with respect to π(β | y). We now use the fact that the AC algorithm is geometrically ergodic
to establish that the PX-DA algorithm enjoys this property as well.
Corollary 1. The Markov chain on Rp with transition density k ∗ (β 0 | β) (that is, the Markov chain
underlying the PX-DA algorithm) is geometrically ergodic.
Proof. Define
(
L20 π(β | y) =
f ∈ L2 π(β | y) :
Z
)
f (β)π(β | y) dβ = 0
.
Rp
Let K and K ∗ denote the Markov operators on L20 π(β | y) associated with the Markov chains underlying the AC and PX-DA algorithms, respectively (see; e.g., Liu, Wong and Kong, 1994; Mira and Geyer,
1999). Denote the norms of these operators by kKk and kK ∗ k. In general, a reversible Markov chain
(that satisfies the usual regularity conditions) is geometrically ergodic if and only if the norm of the
associated Markov operator is less than 1 (Roberts and Rosenthal, 1997; Roberts and Tweedie, 2001).
By Theorem 1, the AC algorithm is geometrically ergodic and consequently kKk < 1. But Liu and Wu
7
(1999) show that kK ∗ k ≤ kKk (see also Hobert and Marchev, 2006, Theorem 4) and hence kK ∗ k < 1,
which implies that the PX-DA algorithm is also geometrically ergodic.
We have now shown that the Markov chains underlying the AC and PX-DA algorithms are both
reversible and geometrically ergodic and hence both have CLTs for all f ∈ L2 π(β | y) . We now use
another result from Hobert and Marchev (2006) to show that the PX-DA algorithm is at least as efficient
as the AC algorithm.
2 and σ 2
Corollary 2. Let f ∈ L2 π(β | y) . If σf,k
f,k∗ denote the variances in the CLT for the AC and
2
2
PX-DA algorithms, respectively, then σf,k
∗ ≤ σf,k < ∞.
Proof. The result follows immediately from Hobert and Marchev’s (2006) Theorem 4.
In order to use our theoretical results in practice to compute valid asymptotic standard errors, we
require a consistent estimator of the asymptotic variance and this is the subject of the next section.
4
4.1
Consistent estimators of asymptotic variances via regeneration
Minorization and regeneration
We begin with the AC algorithm. However, instead of considering the Markov chain on Rp driven by
k(β 0 | β), we consider the joint chain on Rp × Rn with Markov transition density given by
k̃ β 0 , z 0 | β, z = π(z 0 | β 0 , y)π(β 0 | z, y) .
The Markov chain defined by k̃, which we denote by {βj , zj }∞
j=0 , has invariant density π(β, z | y) and
satisfies the usual regularity conditions. (The reason for choosing this joint chain instead of the one that
updates in the opposite order is described later in this section.) The de-initializing arguments of Roberts
and Rosenthal (2001) can be used to show that this chain inherits geometric ergodicity from its marginal
0
∞
chain {βj }∞
j=0 , whose Markov transition density is k(β | β). Note that {βj , zj }j=0 is the chain that is
actually simulated when the AC algorithm is run (we just ignore the zj s).
Suppose we can find a function s : Rp ×Rn → [0, 1], whose expectation with respect to π(β, z | y) is
strictly positive, and a probability density d(β 0 , z 0 ) on Rp ×Rn such that for all (β 0 , z 0 ), (β, z) ∈ Rp ×Rn ,
we have
k̃ β 0 , z 0 | β, z ≥ s(β, z)d(β 0 , z 0 ) .
(3)
This is called a minorization condition (Jones and Hobert, 2001; Meyn and Tweedie, 1993; Roberts and
Rosenthal, 2004) and it can be used to introduce regenerations into the Markov chain driven by k̃. These
8
regenerations are the key to constructing a simple, consistent estimator of the variance in the CLT. After
explaining exactly how this is done, we will identify s and d for both AC and PX-DA.
Equation (3) allows us to rewrite k̃ as the following two-component mixture density
k̃ β 0 , z 0 | β, z = s(β, z)d(β 0 , z 0 ) + 1 − s(β, z) r β 0 , z 0 | β, z ,
(4)
where r is the so-called residual density defined as
k̃ β 0 , z 0 | β, z − s(β, z)d(β 0 , z 0 )
r β , z | β, z =
,
1 − s(β, z)
0
0
when s(β, z) < 1 (and defined arbitrarily when s(β, z) = 1). Instead of simulating one step of the
Markov chain {βj , zj }∞
j=0 in the usual way (that is, drawing from π(β | z, y) and then from π(z | β, y)),
we could simulate a single step of the chain using the mixture representation (4) as follows. Suppose
the current state is (βj , zj ) = (β, z). First, we draw δj ∼ Bernoulli s(β, z) . Then if δj = 1, we
draw (βj+1 , zj+1 ) from d, and if δj = 0, we draw (βj+1 , zj+1 ) from the residual density. The (random)
times at which δj = 1 correspond to regenerations in the sense that the process probabilistically restarts
itself at the next iteration. More specifically, suppose we start by drawing (β0 , z0 ) ∼ d. Then every
time δj = 1, we have (βj+1 , zj+1 ) ∼ d so the process is, in effect, starting over again. Furthermore,
the “tours” taken by the chain in between these embedded regeneration times are iid, which means
that standard iid theory can be used to analyze the asymptotic behavior of ergodic averages, thereby
circumventing the difficulties associated with analyzing averages of dependent random variables. For
more details and simple examples, see Mykland, Tierney and Yu (1995) and Hobert, Jones, Presnell and
Rosenthal (2002).
In practice, we can even avoid having to draw from r (which can be problematic) simply by doing things in a slightly different order. Indeed, given the current state (βj , zj ) = (β, z), we draw
(βj+1 , zj+1 ) in the usual way (that is, drawing from π(β | z, y) and then from π(z | β, y)) after which we
“fill in” a value for δj by drawing from the conditional distribution of δj given (βj , zj ) and (βj+1 , zj+1 ),
which is just a Bernoulli distribution with success probability given by
η = η(βj , zj , βj+1 , zj+1 ) =
s(βj , zj ) d(βj+1 , zj+1 )
k̃(βj+1 , zj+1 | βj , zj )
.
(5)
In the next subsection, we describe exactly how these supplemental Bernoulli draws are used to construct
a consistent estimator of E f (β) | y as well as a consistent estimator of the corresponding asymptotic
variance.
9
4.2
A consistent estimator of the asymptotic variance
Suppose the Markov chain is to be run for R regenerations (or tours); that is, we start by drawing
(β0 , z0 ) ∼ d and we stop the simulation the Rth time that a δj = 1. Let 0 = τ0 < τ1 < τ2 < · · · < τR
be the (random) regeneration times; that is, τt = min{j > τt−1 : δj−1 = 1} for t ∈ {1, 2, . . . , R}. The
total length of the simulation, τR , is random. Let N1 , N2 , . . . , NR be the (random) lengths of the tours;
i.e., Nt = τt − τt−1 and define
τX
t −1
St =
f (βj ) .
j=τt−1
Note that the (Nt , St ) pairs are iid. The strongly consistent estimator of E f (β) | y is
f τR =
where S = R−1
PR
t=1 St
and N = R−1
τR −1
S
1 X
f (βj ) ,
=
τR
N
j=0
PR
t=1 Nt .
Because the Markov chain driven by k̃ is geometri-
cally ergodic, the results in Hobert et al. (2002) are applicable and imply that, as long as there exists an
α > 0 such that E |f (β)|2+α | y < ∞, then
√ d
R f τR − E f (β) | y → N(0, γ 2 )
(6)
as R → ∞. (Note that the requirement of a finite 2 + α moment is a bit stronger than the second
moment condition discussed earlier.) The main benefit of using regeneration is the existence of a simple,
consistent estimator of γ 2 , which takes the form
2
γ̂ =
PR
t=1 (St
− f τR Nt )2
RN
2
.
See Hobert et al. (2002) for a simple proof that this estimator is consistent.
Remark 1. The CLT in (6) is slightly different from the CLT discussed earlier, which takes the form
d
√ m f m − E f (β) | y → N 0, σ 2 .
Hobert et al. (2002) explain that the two CLTs are related by the equation γ 2 = E s(β, z) | y σ 2 .
Remark 2. A further advantage of using regeneration to calculate standard errors is that the starting
distribution is prescribed to be d(β, z) so that burn-in is a non-issue.
10
4.3
Minorization conditions for the AC and PX-DA algorithms
We begin by deriving a minorization condition for the AC algorithm using the “distinguished point”
technique introduced in Mykland et al. (1995). First, note that k̃ β 0 , z 0 | β, z does not depend on β
and as a consequence, neither will our function s. Fix a distinguished point z∗ ∈ Rn and let D be
a p-dimensional rectangle defined by D = D1 × · · · × Dp where Di = [ci , di ] and ci < di for all
i = 1, 2, . . . , p. Now note that
k̃(β 0 , z 0 | β, z) = π(z 0 | β 0 , y)π(β 0 | z, y)
π(β 0 | z, y)
π(z 0 | β 0 , y)π(β 0 | z∗ , y)
=
π(β 0 | z∗ , y)
"
#
π(β | z, y)
inf
π(z 0 | β 0 , y)π(β 0 | z∗ , y)ID (β 0 )
≥
β∈D π(β | z∗ , y)
= s(z) d(β 0 , z 0 )
where
π(β | z, y)
β∈D π(β | z∗ , y)
s(z) = ε inf
and
1
d(β 0 , z 0 ) = π(z 0 | β 0 , y)π(β 0 | z∗ , y)ID (β 0 ) ,
ε
(7)
and
Z
Z
Z
π(z | β, y)π(β | z∗ , y)ID (β) dz dβ =
ε=
Rp
Rn
π(β | z∗ , y) dβ .
D
Clearly, d(β 0 , z 0 ) is a probability density on Rp × Rn . All that is required to apply the regenerative
method described above is the ability to draw from the density d (to start the simulation) and the ability
to calculate η in (5). Making a draw from d(β 0 , z 0 ) can be done sequentially by first drawing β 0 from
the truncated density ε−1 π(β 0 | z∗ , y)ID (β 0 ) (which does not require the value of ε) and then drawing z 0
from π(z 0 | β 0 , y).
We now provide a closed form expression for s(z), which in turn will give a closed form expression
for the success probability η . First,
π(β | z, y) =
1
p
1
(2π) 2 |X T X|− 2
exp
T
1
− β − β̂(z) X T X β − β̂(z)
2
11
.
where β̂(z) = (X T X)−1 X T z. Thus,
s(z) = ε inf
π(β | z, y)
β∈D π(β | z∗ , y)
T T
T
T
exp − β̂(z) X X β̂(z) − 2β X X β̂(z)
= ε inf
T
β∈D
1
T
T
T
exp − 2 β̂(z∗ ) X X β̂(z∗ ) − 2β X X β̂(z∗ )
1 T
T
−1
T
exp − 2 z X(X X) X z
inf exp (z − z∗ )T Xβ
= ε
β∈D
exp − 12 z∗T X(X T X)−1 X T z∗
1 T
T
−1
T
X
exp − 2 z X(X X) X z
p
exp
= ε
ci ti IR+ (ti ) + di ti IR− (ti )
1 T
T
−1
T
i=1
exp − 2 z∗ X(X X) X z∗
1
2
where tT = (z − z∗ )T X. Therefore, the success probability η in (5) becomes
η=
=
s(βj , zj ) d(βj+1 , zj+1 )
k̃(βj+1 , zj+1 | βj , zj )
s(zj ) d(βj+1 , zj+1 )
k̃(βj+1 , zj+1 | βj , zj )
π(β | zj , y) π(zj+1 | βj+1 , y)π(βj+1 | z∗ , y)
= inf
ID (βj+1 )
β∈D π(β | z∗ , y)
π(zj+1 | βj+1 , y)π(βj+1 | zj , y)
π(β | zj , y) π(βj+1 | z∗ , y)
ID (βj+1 )
= inf
β∈D π(β | z∗ , y)
π(βj+1 | zj , y)
1 T
T
−1
T
X
exp − 2 zj X(X X) X zj
p
(j)
(j)
(j)
(j) exp
=
ci ti IR+ (ti ) + di ti IR− (ti )
1 T
T
−1
T
i=1
exp − 2 z∗ X(X X) X z∗
1 T
T
−1
T
exp − 2 z∗ X(X X) X z∗
T
exp − (zj − z∗ ) Xβj+1 ID (βj+1 )
×
1 T
T
−1
T
exp − 2 zj X(X X) X zj
( p
)
X (j)
(j)
(j)
(j)
(j)
= exp
ci ti IR+ (ti ) + di ti IR− (ti ) − ti βj+1,i ID (βj+1 )
i=1
T
where t(j) = (zj − z∗ )T X and βj+1,i is the ith element of the vector βj+1 . Note that η is free of βj ,
zj+1 and ε.
Notice that there is a chance for regeneration only when the β component enters the p-dimensional
rectangle D. This suggests making D large. However, increasing D too much will lead to very small
12
values of η. Hence, there is a trade-off between the size of D and the magnitude of the success probability, η.
Modifying a computer program that runs the AC algorithm so that it simulates the regenerative
process is quite simple. Since code for simulating from π(β 0 |z, y) and π(z 0 | β 0 , y) is already available,
it is straightforward to write code to simulate from the density d. All that remains is a small amount of
code to calculate η and compare it to a Uniform(0, 1) after each iteration of the AC algorithm.
Remark 3. Instead of using k̃ β 0 , z 0 | β, z to construct the regenerative process, we could have used
˜
k̃ z 0 , β 0 | z, β = π(β 0 | z 0 , y)π(z 0 | β, y). In fact, from a theoretical standpoint, it is actually more
˜
natural to use k̃. However, if we had done so, the z component would have to enter an n-dimensional
rectangle before a regeneration is possible. In most applications, n is much larger than p, and when n is
large, the probability that all n components of z simultaneously enter their assigned interval is typically
so small that the algorithm is of no practical use. Moreover, as mentioned above, this problem cannot
be solved simply by making the intervals larger.
Regeneration can also be used in conjunction with the PX-DA algorithm. Indeed, we now show
that a simple modification of our minorization condition for the AC algorithm yields a minorization
condition for the PX-DA algorithm. The Markov transition density of the PX-DA algorithm can be
rewritten as
∗
0
Z
Z
k (β | β) =
Rn
π(β 0 | gz, y)h(g | z)π(z | β, y) dg dz ,
R+
where h(g | z) is the density in (2). As before, instead of working directly with k ∗ , we consider the joint
chain on Rp × Rn × R+ with Markov transition density given by
k̃ ∗ β 0 , (z 0 , g 0 ) | β, (z, g) = h(g 0 | z 0 )π(z 0 | β 0 , y) π(β 0 | gz, y) .
∗
Let {βj , (zj , gj )}∞
j=0 denote the Markov chain corresponding to k̃ . Since π(β|y) is the invariant density
for k ∗ (β 0 |β), we have
Z
Z Z Z
0
∗ 0
π(β | y) =
k (β |β)π(β|y)dβ =
π(β 0 | gz, y)h(g | z)π(z | β, y)π(β | y) dβ dg dz
n R+ Rp
Rp
R
Z Z
=
π(β 0 | gz, y) h(g | z)π(z | y) dg dz ,
Rn
R+
and from this it follows that π(β | z, y) h(g | z)π(z | y) is the invariant density for {βj , (zj , gj )}∞
j=0 .
It is straightforward to show that this chain satisfies the usual regularity conditions, and, as before, the
chain associated with k̃ ∗ inherits geometric ergodicity from its marginal chain {βj }∞
j=0 , whose Markov
13
transition density is k ∗ . We can construct a minorization condition for k̃ ∗ as follows
k̃ ∗ β 0 , (z 0 , g 0 ) | β, (z, g) = h(g 0 | z 0 )π(z 0 | β 0 , y) π(β 0 | gz, y)
π(β | gz, y) 0 0
h(g | z )π(z 0 | β 0 , y) π(β 0 | z∗ , y)ID (β 0 )
≥
inf
β∈D π(β | z∗ , y)
= s(gz)d∗ β 0 , (z 0 , g 0 ) ,
where the function s(·) is as defined before and
1
d∗ β 0 , (z 0 , g 0 ) = h(g 0 | z 0 )π(z 0 | β 0 , y) π(β 0 | z∗ , y)ID (β 0 ) ,
ε
where ε is also the same as before. Therefore, for the PX-DA algorithm, η is given by
s(gj zj ) d∗ (βj+1 , (zj+1 , gj+1 ))
η=
k̃ ∗ (βj+1 , (zj+1 , gj+1 ) | βj , (zj , gj ))
π(βj+1 | z∗ , y)
π(β | gj zj , y)
ID (βj+1 )
= inf
β∈D π(β | z∗ , y)
π(βj+1 | gj zj , y)
1 2 T
T
−1
T
X
exp − 2 gj zj X(X X) X zj
p
∗(j)
∗(j)
∗(j)
∗(j) exp
=
ci ti IR+ (ti ) + di ti IR− (ti )
i=1
exp − 12 z∗T X(X T X)−1 X T z∗
1 T
T
−1
T
exp − 2 z∗ X(X X) X z∗
T
exp − (gj zj − z∗ ) Xβj+1 ID (βj+1 )
×
1 2 T
T
−1
T
exp − 2 gj zj X(X X) X zj
( p
)
X ∗(j)
∗(j)
∗(j)
∗(j)
∗(j)
= exp
ci ti IR+ (ti ) + di ti IR− (ti ) − ti βj+1,i ID (βj+1 )
i=1
T
where t∗(j) = (gj zj − z∗ )T X.
Corollary 2 states that the asymptotic variance in the CLT for the PX-DA algorithm is no larger than
2
2
2 π(β | y) . However, we know from
that for the AC algorithm; i.e, σf,k
∗ ≤ σf,k < ∞ for all f ∈ L
Remark 1 that the regenerative method is based on a slightly different CLT whose asymptotic variance
has an extra factor involving the small function from the minorization condition, namely E(s(·) | y).
However, it is straightforward to show that the expectation of s(z) under π(β, z | y) (the invariant den
sity of k̃ β 0 , z 0 | β, z ) is the same as the expectation of s(gz) under π(β | z, y) h(g | z)π(z | y) (the
invariant density of k̃ ∗ β 0 , (z 0 , g 0 ) | β, (z, g) ). Consequently, if E |f (β)|2+α | y < ∞ for some α > 0
2 and γ 2
and if γf,k
f,k∗ denote the variances in the regenerative CLT for the AC and PX-DA algorithms,
2
2
respectively, then γf,k
∗ ≤ γf,k < ∞. Hence, PX-DA remains more efficient than AC in the regenerative
context.
14
4.4
An illustration using van Dyk and Meng’s lupus data
We end this section with an illustration of our results using van Dyk and Meng’s (2001) lupus data,
which consists of triples (yi , xi1 , xi2 ), i = 1, . . . , 55, where xi1 and xi2 are covariates indicating the
levels of certain antibodies in the ith individual and yi is an indicator for latent membranous lupus
nephritis (1 for presence and 0 for absence). van Dyk and Meng (2001) considered the model
Pr(Yi = 1) = Φ β0 + β1 xi1 + β2 xi2 ,
with a flat prior on β. We used a linear program (that is described in the Appendix) to verify that Chen
and Shao’s (2000) necessary and sufficient conditions for propriety are satisfied in this case.
In order to implement the regenerative method, we had to choose the distinguished point z∗ as well as
the sets [ci , di ]. We ran the PX-DA algorithm for an initial 20,000 iterations starting from the maximum
likelihood estimate of β given by β̂ = (−1.778, 4.374, 2.428). We took the distinguished point to be the
average value of z over this initial run. For i ∈ {0, 1, 2}, let β̄i and si denote the sample mean and the
usual sample standard deviation of the βi s over this initial run. We set Di = β̄i −0.09·si , β̄i +0.09·si .
(The factor 0.09 was chosen by trial and error.)
We ran AC and PX-DA for R = 100 regenerations each. This took 1,223,576 iterations for AC and
1,256,677 iterations for PX-DA. We used the simulations to estimate the posterior expectations of the
regression parameters and the results are shown in Table 1. (Results in Chen and Shao (2000) imply
that there exists α > 0 such that E |βj |2+α | y < ∞ for j ∈ {0, 1, 2}.) It is striking that the estimated
asymptotic variances for the AC algorithm are all at least 65 times as large as the corresponding values
for the PX-DA algorithm. These estimates suggest that, in this particular example, the AC algorithm requires about 65 times as many iterations as the PX-DA algorithm to achieve the same level of precision.
2 /γ 2
(We actually repeated the entire experiment nine times and the estimates of γf,k
f,k∗ ranged between
40 and 145.)
Table 1: Results Based on R = 100 Regenerations
AC Algorithm
PX-DA Algorithm
√ estimate s.e. γ̂f,k∗ / R
Parameter
estimate
√ s.e. γ̂f,k / R
β0
-3.060
0.097
-3.018
0.012
66.6
β1
7.005
0.190
6.916
0.023
66.9
β2
4.037
0.121
3.982
0.015
63.1
15
2 /γ̂ 2
γ̂f,k
f,k∗
5
Discussion
Let β = {βj }∞
j=0 denote the Markov chain driven by either the AC algorithm or the PX-DA algorithm
P
and consider the ergodic average, f m = m−1 m−1
j=0 f (βj ), which is a strongly consistent estimator of
the posterior expectation of f (β). We have proven that both algorithms converge at a geometric rate so,
as long as the function f is square integrable with respect to the posterior density, there is a CLT for
f m . Moreover, it follows from results in Hobert and Marchev (2006) that PX-DA is theoretically more
efficient than AC in the sense that the asymptotic variance in the CLT under the PX-DA algorithm is no
larger than that under the AC algorithm. We have also constructed minorization conditions that allow
one to exploit regenerative simulation techniques for the consistent estimation of asymptotic variances.
One could argue that a direct comparison of the asymptotic variances under AC and PX-DA is
not “fair” because PX-DA is more computationally demanding. However, the additional computation
required by the PX-DA algorithm is just a single (univariate) draw per iteration, which is basically
insignificant relative to the total amount of computation involved in one iteration of the AC algorithm.
Hence, we believe the comparison is justified. Furthermore, while our theoretical results say only that
PX-DA is at least as efficient as AC, our empirical study based on van Dyk and Meng’s (2001) lupus
data reveals that dramatic gains in efficiency are possible by using PX-DA instead of AC.
There are obvious analogues of the AC algorithm for other link functions such as the logit and
complementary log-log links. It would be interesting to see if our results for the probit link could be
extended to these other link functions. Our proof that the AC algorithm is geometrically ergodic uses
several results that are specific to the probit link, including the bounds on Mill’s ratio and the moments
of the truncated normal distribution. Extending our proof to a different link function would require
analogous results for the distribution function in question; e.g., the logistic distribution for the logit link.
It would also be interesting to see whether the extensions of the AC algorithm that have been developed
for the multinomial probit model and the multivariate probit model for correlated binary data (see, e.g.,
Albert and Chib, 1993; Chib and Greenberg, 1998) also converge at a geometric rate.
Appendices
A
Chen and Shao’s conditions
Here we state Chen and Shao’s (2000) necessary and sufficient conditions for c(y) < ∞ as well as a
simple method for checking these conditions. Let X denote the n × p matrix whose ith row is xTi and
16
let W denote an n × p matrix whose ith row is wiT , where

 x if y = 0
i
i
wi =
 −xi if yi = 1 .
Proposition 1. (Chen and Shao, 2000) The function c(y) is finite if and only if
1. the design matrix X has full column rank, and
2. there exists a vector a = (a1 , ..., an )T with strictly positive components such that W T a = 0 .
Assuming that X has full column rank, the second condition of Proposition 1 can be straightforwardly checked with a simple linear program implementable in the R programming language (R Development Core Team, 2006) using the “simplex” function from the “boot” library. Let 1 and J denote a
column vector and a matrix of 1s, respectively. The linear program calls for maximizing 1T a subject to
• WTa = 0
• (J − I)a ≤ 1 (element-wise)
• ai ≥ 0 for i = 1, . . . , n
This is always feasible (e.g., take a to be a vector of zeros). If the maximizer, call it a∗ , is such that
a∗i > 0, for all i = 1, . . . , n, then the second condition of Proposition 1 is satisfied and c(y) < ∞.
Moreover, it is straightforward to show that if a∗ contains one or more zeros, then there does not exist
an a with all positive elements such that W T a = 0, so c(y) = ∞.
B
Proof of Theorem 1
We begin with a definition. A function V : Rp → [0, ∞) is said to be unbounded off compact sets if for
every α > 0, the level set {β : V (β) ≤ α} is compact.
Proof of Theorem 1. It is straightforward to show that the Markov chain driven by k(β 0 |β) is a Feller
Markov chain and that the support of its maximal irreducibility measure has non-empty interior (Roy,
2008). Thus, according to Meyn and Tweedie’s (1993) Lemma 15.2.8, we can show that the chain is
geometrically ergodic by finding a V : Rp → [0, ∞) that is unbounded off compact sets and is such that
KV ≤ λV + L
17
for some λ ∈ [0, 1) and some L < ∞, where (KV )(β) =
R
Rp
V (β 0 )K(β, dβ 0 ). We will use V (β) =
(Xβ)T (Xβ) and, as is standard, we will refer to V as the drift function. Recall that X is assumed to
have full column rank, p, and hence X T X is positive definite. Thus, for each α > 0, the set
{β ∈ Rp : V (β) ≤ α} = {β ∈ Rp : β T (X T X)β ≤ α}
is compact and hence our drift function is unbounded off compact sets. Now, using Fubini’s theorem,
we have
Z
(KV )(β) =
Rp
Z
V (β 0 )k(β 0 | β) dβ 0
"Z
0
0
#
0
V (β )π(β | z, y) dβ π(z | β, y) dz
=
Rn
Rp
Z
E V (β 0 ) | z, y π(z | β, y) dz
Rn
o
n = E E V (β 0 )z, y β, y ,
=
where, as the notation suggests, the expectations in the last two lines are with respect to the conditional densities π(β 0 | z, y) and π(z | β, y). Recall that π(β 0 | z, y) is a p-dimensional normal density
and π(z | β, y) is a product of truncated normals. Evaluating the inner expectation, we have
E V (β 0 )z, y = E (β 0 )T X T Xβ 0 z, y
= tr (X T X)(X T X)−1 + z T X(X T X)−1 (X T X)(X T X)−1 X T z
= p + z T X(X T X)−1 X T z
≤ p + zT z ,
where tr(·) denotes trace of a matrix and the inequality follows from the fact that
z T (I − X(X T X)−1 X T )z ≥ 0
for all z ∈ Rn . We now have that
n
n o
h
i
X
E E V (β 0 )z, y β, y ≤ E p + z T z β, y = p +
E zi2 | β, y .
i=1
Standard results for the truncated normal distribution (see, e.g., Johnson and Kotz, 1970) imply that if
U ∼ TN(ξ, 1, 1) then,
E(U 2 ) = 1 + ξ 2 +
18
ξφ(ξ)
,
Φ(ξ)
where φ(·) with only a single argument denotes the standard normal density function; that is, φ(v) is
equivalent to φ(v; 0, 1). Similarly, if U ∼ TN(ξ, 1, 0) then,
E(U 2 ) = 1 + ξ 2 −
It follows that
E zi2 | β, y =

 1 + (xT β)2 +
i
 1 + (xT β)2 −
i
ξφ(ξ)
.
1 − Φ(ξ)
T
(xT
i β)φ(xi β)
Φ(xT
β)
i
T
(xT
i β)φ(xi β)
1−Φ(xT
i β)
if yi = 1
if yi = 0 .
A more compact way of expressing this is as follows:
(wT β)φ(wiT β)
E zi2 | β, y = 1 + (wiT β)2 − i
,
1 − Φ(wiT β)
where wi is defined in Section A of this Appendix. Hence, we have
n
n
o
n X
X
(wiT β)φ(wiT β)
.
(KV )(β) = E E V (β 0 )z, y β, y ≤ p + n +
(wiT β)2 −
T β)
1
−
Φ(w
i
i=1
i=1
(8)
The goal is to show that (KV )(β) ≤ λV (β) + L for all β ∈ Rp . It follows from (8) that (KV )(0) ≤
p + n. We now concentrate on β ∈ Rp \ {0}.
We begin by constructing a partition of the set Rp \{0} using the n hyperplanes defined by wiT β = 0.
For a positive integer m, define Nm = {1, 2, . . . , m}. Let A1 , A2 , . . . , A2n denote all the subsets of Nn ,
and, for each j ∈ N2n , define a corresponding subset of p-dimensional Euclidean space as follows:
Sj = β ∈ Rp \ {0} : wiT β ≤ 0 for all i ∈ Aj and wiT β > 0 for all i ∈ Āj
where Āj denotes the complement of Aj ; that is, Āj = Nn \ Aj . Note that
• the Sj are disjoint,
n
• ∪2j=1 Sj = Rp \ {0}, and
• some of the Sj may be empty.
We now show that if Sj is nonempty, then so are Aj and Āj . Suppose that Sj 6= ∅ and fix β ∈ Sj .
Since the conditions of Proposition 1 are in force, there exist strictly positive constants {ai }ni=1 such
that
a1 w1T + a2 w2T + · · · + an wnT = 0 .
Therefore,
a1 w1T β + a2 w2T β + · · · + an wnT β = 0 .
19
(9)
The matrix X has full column rank p, and hence 0 < β T X T Xβ =
Pn
T
2
i=1 (xi β)
=
Pn
T
2
i=1 (wi β) .
Thus, there exists an i ∈ Nn such that wiT β 6= 0 and, since all the ai are strictly positive, (9) implies
that there must also exist an i0 6= i such that wiT0 β and wiT β have opposite signs. Thus, Aj and Āj are
both nonempty. Now define C = j ∈ N2n : Sj 6= ∅ . For each j ∈ C, define
T
2
i∈A (wi β)
Pn j T 2
i=1 (wi β)
T
2
i∈Aj (wi β)
P
T
T
2
2
i∈Aj (wi β) +
i∈Aj (wi β)
P
P
Rj (β) =
=P
,
and
λj = sup Rj (β) ∈ [0, 1] .
β∈Sj
In the following calculation, we will utilize a couple of facts concerning the so-called Mill’s ratio.
First, when u ≥ 0, uφ(u)/(1 − Φ(u)) ≥ u2 (see, e.g., Feller, 1968, p.175). Also, it is clear that if we
define
M=
sup
u∈(−∞,0]
uφ(u) 1 − Φ(u) ,
then M ∈ (0, ∞).
Fix j ∈ C. It follows from (8) and the results concerning Mills ratio that for all β ∈ Sj , we have
(KV )(β) ≤ p + n +
n
X
(wiT β)2 −
i=1
X (wT β)φ(wT β) X (wT β)φ(wT β)
i
i
i
i
−
T β)
T β)
1
−
Φ(w
1
−
Φ(w
i
i
i∈A
i∈Aj
j
n
X
X (wT β)φ(wT β) X (wT β)φ(wT β)
T
2
i
i
i
i
= p+n+
(wi β) +
1 − Φ(wT β) −
T β)
1
−
Φ(w
i
i
i=1
i∈A
i∈Aj
j
≤ p+n+
n
X
(wiT β)2 + nM −
i=1
= p + n(M + 1) +
X
(wiT β)2
i∈Aj
X
(wiT β)2
i∈Aj
= p + n(M + 1) + Rj (β)
n
X
(wiT β)2
i=1
≤ λj V (β) + L
where L := p + n(M + 1). Since ]j∈C Sj = Rp \ {0}, if we define
λ = max λj ,
j∈C
then we have
(KV )(β) ≤ λV (β) + L ,
20
for all β ∈ Rp . Hence, it suffices to show that λj < 1 for all j ∈ C.
Again, fix j ∈ C and note that for l ∈ R+ , Rj (lβ) = Rj (β) which means that Rj (β) depends on β
only through β’s direction and not on its distance from the origin. Thus,
λj = sup Rj (β) = sup Rj (β) ≤ sup Rj (β) ,
β∈Sj∗
β∈Sj
β∈Sj∗∗
where
Sj∗ = β ∈ Rp : kβk = 1 and wiT β ≤ 0 for all i ∈ Aj and wiT β > 0 for all i ∈ Āj ,
and
Sj∗∗ = β ∈ Rp : kβk = 1 and wiT β ≤ 0 for all i ∈ Aj and wiT β ≥ 0 for all i ∈ Āj .
Now, since Sj∗∗ is a compact set in Rp and Rj (β) is a continuous function on Sj∗∗ , we know that
sup Rj (β) = Rj (β̃) for some β̃ ∈ Sj∗∗ .
β∈Sj∗∗
Thus, we need only show that there does not exist a β̃ ∈ Sj∗∗ such that Rj (β̃) = 1. Assume such a β̃
does indeed exist. Then
T
2
i∈Aj (wi β̃)
P
T
2
i∈Aj (wi β̃)
P
This implies that
2
T
i∈Āj (wi β̃)
P
+
T
2
i∈Āj (wi β̃)
P
=1.
= 0. Again, there exist strictly positive constants a1 , a2 , . . . , an such
that
a1 w1T β̃ + a2 w2T β̃ + · · · + an wnT β̃ = 0 .
But we already know that wiT β̃ = 0 for all i ∈ Āj , and hence it must be the case that
X
ai wiT β̃ = 0 .
i∈Aj
However, wiT β̃ ≤ 0 for all i ∈ Aj as β̃ ∈ Sj∗∗ . This combined with the fact that ai are all strictly
positive shows that wiT β̃ = 0 for all i ∈ Aj . Hence, we have identified a nonzero β̃ such that
wiT β̃ = 0 for all i ∈ Nn .
But this contradicts the fact that W has full column rank. Thus, the assumed β̃ cannot exist and we have
established that
sup Rj (β) < 1 ,
β∈Sj∗∗
which implies that λj < 1. Therefore, λ < 1 and the proof is complete.
21
Acknowledgments
The authors thank Trevor Park (for suggesting the linear program described in the Appendix), Jose
Blanchet and Andrew Thomas (for pointing out an error in an early version of the paper), and two
anonymous referees (for helpful comments and suggestions). Hobert’s research supported by NSF Grant
DMS-05-03648.
References
A LBERT, J. H. and C HIB , S. (1993). Bayesian analysis of binary and polychotomous response data.
Journal of the American Statistical Association, 88 669–679.
C HAN , K. S. and G EYER , C. J. (1994). Discussion of “Markov chains for exploring posterior distributions”. The Annals of Statistics, 22 1747–1757.
C HEN , M.-H. and S HAO , Q.-M. (2000). Propriety of posterior distribution for dichotomous quantal
response models. Proceedings of the American Mathematical Society, 129 293–302.
C HIB , S. and G REENBERG , E. (1998). Analysis of multivariate probit models. Biometrika, 85 347–361.
E ATON , M. L. (1989). Group Invariance Applications in Statistics. Institute of Mathematical Statistics
and the American Statistical Association, Hayward, California and Alexandria, Virginia.
F ELLER , W. (1968). An Introduction to Probability Theory and its Applications, vol. I. 3rd ed. John
Wiley & Sons, New York.
H OBERT, J. P., J ONES , G. L., P RESNELL , B. and ROSENTHAL , J. S. (2002). On the applicability of
regenerative simulation in Markov chain Monte Carlo. Biometrika, 89 731–743.
H OBERT, J. P. and M ARCHEV, D. (2006).
A theoretical comparison of the data augmentation,
marginal augmentation and PX-DA algorithms. Tech. rep., University of Florida. Available at
http://web.stat.ufl.edu/˜jhobert/.
J OHNSON , N. L. and KOTZ , S. (1970). Continuous Univariate Distributions-1. John Wiley & Sons.
J ONES , G. L., H ARAN , M., C AFFO , B. S. and N EATH , R. (2006). Fixed-width output analysis for
Markov chain Monte Carlo. Journal of the American Statistical Association, 101 1537–1547.
J ONES , G. L. and H OBERT, J. P. (2001). Honest exploration of intractable probability distributions via
Markov chain Monte Carlo. Statistical Science, 16 312–34.
22
L IU , J. S., W ONG , W. H. and KONG , A. (1994). Covariance structure of the Gibbs sampler with
applications to comparisons of estimators and augmentation schemes. Biometrika, 81 27–40.
L IU , J. S. and W U , Y. N. (1999). Parameter expansion for data augmentation. Journal of the American
Statistical Association, 94 1264–1274.
M EYN , S. P. and T WEEDIE , R. L. (1993). Markov Chains and Stochastic Stability. Springer Verlag,
London.
M IRA , A. and G EYER , C. J. (1999). Ordering Monte Carlo Markov chains. Tech. Rep. No. 632, School
of Statistics, University of Minnesota.
M YKLAND , P., T IERNEY, L. and Y U , B. (1995). Regeneration in Markov chain samplers. Journal of
the American Statistical Association, 90 233–41.
R D EVELOPMENT C ORE T EAM (2006). R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org.
ROBERTS , G. O. and ROSENTHAL , J. S. (1997). Geometric ergodicity and hybrid Markov chains.
Electronic Communications in Probability, 2 13–25.
ROBERTS , G. O. and ROSENTHAL , J. S. (2001). Markov chains and de-initializing processes. Scandinavian Journal of Statistics, 28 489–504.
ROBERTS , G. O. and ROSENTHAL , J. S. (2004). General state space Markov chains and MCMC
algorithms. Probability Surveys, 1 20–71.
ROBERTS , G. O. and T WEEDIE , R. L. (2001). Geometric L2 and L1 convergence are equivalent for
reversible Markov chains. Journal of Applied Probability, 38A 37–41.
ROY, V. (2008). Analysis of Markov chain Monte Carlo algorithms for Bayesian probit regression.
Ph.D. thesis, Department of Statistics, University of Florida.
VAN
DYK , D. A. and M ENG , X.-L. (2001). The art of data augmentation (with discussion). Journal of
Computational and Graphical Statistics, 10 1–50.
23
Download