Expert Information and Nonparametric Bayesian Inference of Rare Events Hwan-sik Choi November 2012

advertisement
Expert Information and Nonparametric
Bayesian Inference of Rare Events
Hwan-sik Choi∗
November 2012
Abstract
Prior distributions are important in Bayesian inference of rare events because historical data
information is scarce, and experts are an important source of information for elicitation of a
prior distribution. We propose a method to incorporate expert information into nonparametric
Bayesian inference on rare events when expert knowledge is elicited as moment conditions on a
finite dimensional parameter θ only. We generalize the Dirichlet process mixture model to merge
expert information into the Dirichlet process (DP) prior to satisfy expert’s moment conditions.
Among all the priors that comply with expert knowledge, we use the one that is closest to the
original DP prior in the Kullback-Leibler information criterion. The resulting prior distribution
is given by exponentially tilting the DP prior along θ. We provide a Metropolis-Hastings
algorithm to implement our approach to sample from posterior distributions with exponentially
tilted DP priors. Our method combines prior information from an econometrician and an expert
by finding the least-informative prior given expert information.
JEL: C11, C14, C15, G11, G31
Keywords: Dirichlet process mixture, defaults, Kullback-Leibler information criterion, maximum entropy, Metropolis-Hastings algorithm.
∗ hwansik@purdue.edu. Department of Consumer Science, Purdue University, 812 W State St. West Lafayette,
IN 47907, USA.
1
1
Introduction
We develop a nonparametric Bayesian framework that incorporates expert information for inference
of rare events. Inference of rare events such as defaults in a high grade portfolio, extreme losses, and
other catastrophic events is critical in measuring credit risk and systemic risk, which are important
in risk management for optimal hedging and economic capital calculations. But inference of rare
events is difficult because of lack of historical data information. Therefore, it is important to use
all sources of information including non-data information, and the Bayesian method provides a
natural and coherent mechanism of combining all available information. However, the use of noninformative or objective priors in a Bayesian model for rare events is often not satisfactory because
of the scarcity of information in data. Kiefer (2009, 2010) proposes to use expert information for
default estimation as an additional source of information, and argues that the Bayesian approach is
a natural theoretical framework for inference on rare events. He uses expert information to elicit the
Beta distribution prior or a mixture of the Beta distributions for a default probability. Although
Kiefer’s approach is illustrated with the binomial sampling distribution for the number of defaults
in a portfolio, it can be easily extended to any general parametric models.
In this paper, we argue that nonparametric models are useful for rare events, and show that
expert information can be effectively incorporated in nonparametric models too. Consider the
following situation for motivation of using nonparametric models for rare events. Rare events are
often associated with the tails of sampling distributions. For this reason, the rare events are also
called the tail-risk events. For a parametric model, a finite number of parameters would determine
the entire distribution, and inference of tail probabilities can be done from inference of the finite
dimensional parameter. In this case, the problem of scarce tail-risk events is remedied by the use
of the parametric assumption, which is a strong form of non-data information. Consequently, a
parametric model improves the efficiency of using data information. However, when there is a
concern for misspecification, the impact of misspecification can be serious for inference of tail-risk
events. Since frequent events would dominate the data information, they would drive estimation
of the parameters as well. But frequent events may not be relevant to tail-risk events if the model
2
is misspecified. Moreover, it is difficult to check the adequacy of a parametric model specifically
for tail probabilities because of the scarcity of rare events. For example, the maximum likelihood
(ML) estimator for the mean parameter of normal distributions with a known variance would be the
sample mean, which is a sufficient statistic. All inference can be done with the sufficient statistic.
But if the normal distributions are misspecified, it would be better to do inference on the tail
probabilities with emphasizing the samples far away from the sample mean since they are more
relevant. For this reason, using a nonparametric model can be appealing for rare events when we
do not have strong confidence in a parametric model.
Of course, an important disadvantage of nonparametric models is reduction in efficiency of
using data. Therefore, additional information becomes very desirable, and expert information
is particularly useful as complementary information in this setting. However, elicitation of expert
opinion becomes difficult when the dimension of a model parameter increases. Practically, it is easier
for an expert to talk about a few aspects of rare events rather than the entire shape of sampling
distributions. Our approach is to use expert information on a finite dimensional parameter, and
to incorporate the expert information in the nonparametric Bayesian model. In particular, we
consider the Dirichlet process mixture (DPM) model. The sampling distribution of the DPM
model is an infinite mixture of a family of distributions, and the mixing distribution works as an
infinite dimensional parameter of the model. The DPM model uses the Dirichlet process as a prior
distribution over the mixing distributions. The Dirichlet process (DP) is a random measure over
a space of distributions. Since the development of the DP by Freedman (1963), its theoretical
properties and usefulness in Bayesian analysis are shown by Ferguson (1973), Ferguson (1974),
Blackwell and MacQueen (1973), and Antoniak (1974). Especially, the DP is widely used as a prior
distribution of nonparametric Bayesian models (Lo (1984)).
The DPM model has become popular with the development of the Markov Chain Monte Carlo
(MCMC) methods such as Escobar (1994), Escobar and West (1995), MacEachern and Müller
(1998), Neal (2000), and Ishwaran and Zarepour (2000) for sampling posterior distributions with
DP priors. In economics, Hirano (2002) studies semiparametric Bayesian inference for a panel
model with a countable mixture of normally distributed errors with mixing weights drawn from
3
the DP. Griffin and Steel (2011) consider a serially correlated sequence of DPs with applications to
financial time series and GDP data, and Jensen and Maheu (2010) use the DP for the distribution
of errors that are multiplied by the volatility process in a stochastic volatility model. Taddy and
Kottas (2010) apply the DPM model for inference in quantile regressions.
In the DPM model, a mixing distribution drawn from the DP determines the sampling distribution F that is a member of the space F of all sampling distribution functions of a nonparametric
model. Therefore, the DP gives a distribution over the functional space F. Given a sampling
distribution function F , we consider a finite dimensional vector θ of functionals of F related to the
rare events that an expert has knowledge on. We elicit expert opinion about the distribution of θ.
Elicitation of expert information requires a careful design. Generally, it is easier for an expert to
think about probabilities or quantiles than moments. Kadane and Wolfson (1998) and Garthwaite,
Kadane, and O’Hagan (2005) discuss the issues in elicitation of expert opinions. Kadane, Chan,
and Wolfson (1996) study elicitation of a subjective prior for Bayesian unit root models. Chaloner
and Duncan (1983) develops a computer scheme to elicit expert information from a predictive
distribution.
In our approach, expert information is in the form of moment conditions on θ that the DP prior
may not satisfy. We combine the expert information and the DP prior by modifying the DP prior
to satisfy the moment conditions. Among all such modifications, we find the prior distribution
that is closest to the original DP prior in the Kullback-Leibler information criterion. The resulting prior distribution is given by exponentially tilting the DP prior along θ. We also provide a
Metropolis-Hastings algorithm to implement our approach to draw samples from the exponentially
tilted DP prior. Our method gives a simple way to combine the prior information from an econometrician and an expert by finding the least-informative prior given expert information, based on
an econometrician’s prior.
4
2
Expert information in nonparametric Bayesian inference
2.1
Dirichlet process mixture models
Consider the DPM model for n observations {yi }ni=1 given by
yi | ξi ∼ Kξi
ξi | G ∼ G
G ∼ P,
where X ∼ F means the distribution of X is F , Kξi = K(·|ξi ) is a distribution function with a
parameter ξi ∈ Ξ ⊂ Rd , and ξi is a random draw from a mixing measure G defined on a probability
space (Ξ, B), where B is a σ-field of subsets of Ξ. Let G be the space of all mixing measures on Ξ.
In the DPM model, a measure G ∈ G is thought to be an infinite dimensional parameter, and the
prior distribution of G is given by a Dirichlet process P = DP(αG0 ) with a concentration parameter
α > 0 and a base measure G0 ∈ G. The mean of DP(αG0 ) is G0 , so we can write
EG = G0 ,
(2.1)
where
Z
EG =
G P(dG).
(2.2)
d
The concentration parameter α is inversely related to the variance of DP(αG0 ), and we have G → G0
as α → ∞. Any finite dimensional distribution of DP(αG0 ) is a Dirichlet distribution, i.e. for an
arbitrary finite measurable partition (B1 , B2 , . . . , Br ) of Ξ, we have
(G(B1 ), G(B2 ), . . . , G(Br )) ∼ Dirichlet(αG0 (B1 ), αG0 (B2 ), . . . , αG0 (Br )),
and EG(B) = G0 (B) for all B ∈ B. From
Pr
j=1
αG0 (Bj ) = αG0 (Ξ) = α, α is also called a total
mass parameter. See Ferguson (1973) and Ferguson (1974) for the detailed development of the
5
DP prior and its properties, and Ghosh and Ramamoorthi (2003) and Hjort, Holmes, Müller, and
Walker (2010) for an overview of nonparametric Bayesian methods.
For notational simplicity, we omit the subscript i. The distribution function F (y|G) of y given
G can be written in a mixture form as
Z
F (y|G) =
K(y|ξ) G(dξ),
(2.3)
where K(y|ξ) is used as a mixing kernel. For convenience of exposition of our main idea, we also
assume that the probability density function of K(y|ξ) exists for all ξ. Then we can write the
sampling density of y given G as
Z
f (y|G) =
k(y|ξ) G(dξ),
(2.4)
where the mixing kernel k(y|ξ) is the probability density function of K(y|ξ). The DPM model
has become popular in Bayesian nonparametric modeling because of its flexibility compared to a
parametric model or a mixture model with a fixed number of mixing components. To generate a
posterior distribution for the DPM model, various sampling schemes were developed. Blackwell and
MacQueen (1973) develops Pólya urn schemes for DP priors of Ferguson (1973). Escobar (1994)
and Escobar and West (1995) provide generalized Pólya urn Gibbs samplers for the DPM model.
Neal (2000), Ishwaran and Zarepour (2000) and Ishwaran and James (2001) also provide alternative
sampling methods for a general class of priors that include DP priors.
In our framework, experts have information on a k-dimensional parameter of interest
θ = ϕ(F (y|G)),
(2.5)
where ϕ(·) is a vector of functionals of the sampling distribution F (y|G) of y. Note that the expected
value of θ can be written as
Z
Eθ =
ϕ(F (y|G)) P(dG).
6
The functional ϕ can be a linear functional such as moments of F (y|G), or a value of the distribution
function F (c|G) at a fixed point y = c. But it can also be a nonlinear functional such as quantiles,
inter-quantile ranges, hazard functions of F (y|G), or the quantities regarding the distribution of
the maximum or minimum of N samples. See Gelfand and Mukhopadhyay (1995) and Gelfand and
Kottas (2002) for more examples of ϕ and their properties. When ϕ is linear, from (2.3), we can
simplify θ to a mixture of ϕ(K(y|ξ)),
Z
θ=
ϕ(K(y|ξ)) G(dξ),
(2.6)
and using (2.1), the expected value of θ becomes
Z
Eθ =
ϕ(K(y|ξ)) G0 (dξ).
(2.7)
An important class of a linear functional ϕ is given by
Z
ϕ(F (y|G)) =
h(y)f (y|G) dy,
where h : R → Rk . Note that if h(y) = y p , then θ becomes a p-th order moment of F (y|G), and if
h(y) = I{y ≤ c}, where I(·) is an indicator function, then θ is the cumulative distribution function
evaluated at c. From (2.4), we can easily prove (2.6) by
Z
Z
θ=
h(y) k(y|ξ) G(dξ) dy
ZZ
=
h(y)k(y|ξ) dy G(dξ)
Z
= ϕ(K(y|ξ)) G(dξ).
But for a general nonlinear functional ϕ, (2.6) or (2.7) would not hold. As suggested by Kadane
and Wolfson (1998), elicitation of expert information should include questions about quantiles or
probabilities rather than moments, and we consider asking experts about quantiles (nonlinear ϕ)
7
and values of distribution functions at fixed points (linear ϕ) in this paper to illustrate our approach
with an example.
In our framework, expert’s information is in the form of l moment restrictions on θ such that
Eg(θ) = 0,
(2.8)
where g(·) is a function g : Rk → Rl . From (2.5), we can write (2.8) as
Eϕ̃(F (y|G)) = 0,
where ϕ̃ = g ◦ ϕ is an l-dimensional functional. If both g and ϕ are linear, ϕ̃ is a linear functional
and Eg(θ) is simplified to
ZZ
Eg(θ) =
ϕ̃(K(y|ξ)) G(dξ) P(dG)
Z
=
ϕ̃(K(y|ξ)) G0 (dξ).
(2.9)
This implies that Eg(θ) depends on G0 only, when ϕ̃ is linear and G is from DP(αG0 ).
Let
D(Q k P) =
Z
log(dQ/dP) dQ
be the Kullback-Leibler information criterion (KLIC, Kullback and Leibler (1951), White (1982))
from a measure Q to P. The goal of our paper is to merge the expert information (2.8) into the
DP prior P by finding a new measure Q that is closest to P in the KLIC by solving
min D(Q k P),
(2.10)
Q = {Q : EQ g(θ) = 0},
(2.11)
Q∈Q
where
R
and EQ g(θ) = g(θ) Q(dG) is the expectation of g(θ) with respect to Q. The solution Q∗ to (2.10)
8
is given by the Gibbs canonical density π ∗ given by
π∗ =
dQ∗
exp(λ0∗ g(θ))
= P
,
dP
E exp(λ0∗ g(θ))
(2.12)
where λ∗ is the solution of the unconstrained convex problem
min EP exp(λ0 g(θ)),
λ
(2.13)
and the minimum KLIC with Q∗ is given by
D(Q∗ k P) = − log(EP exp(λ0∗ g(θ))).
See Csiszár (1975) and Cover and Thomas (2006) for an information theoretic treatment of this
problem, and Kitamura and Stutzer (1997) for a discussion of the optimization problem in the
context of semiparametric estimation. Intuitively, the solution Q∗ gives the least informative Q ∈ Q
that satisfies (2.11). Geometrically speaking, the KLIC from Q to P in (2.10) is −1-divergence of
Amari (1982) among the class of α-divergence (|α| ≤ 1) from P to Q. In Amari’s terminology, we
get the minimizer Q∗ ∈ Q of −1-divergence by “−1-projection” of P onto the space Q, and the
projection is given by exponential tilting of P (Kitamura and Stutzer (1997)). In this paper, we
develop a practical method to find Q∗ and implement nonparametric Bayesian inference with Q∗
as a prior distribution.
2.2
Semiparametric models and geometric interpretation
In this section, we consider an extension of our approach to a semiparametric model defined with
a set of moment conditions. For simplicity of exposition, let us assume y ∼ G without mixing,
but the discussion below can be easily applied to a mixture model. Suppose we have the moment
conditions
EG m(y, θ) = 0,
9
Priors
Ḡ ⊂ G
Θ
G∈G
P∗
−1-projection
θ
−1-projection
Q0
Q∗
(Expert)
Gθ
G∗θ
P (DP)
Q
Figure 1: Geometry of exponential tilting of DP priors for semiparametric models. A random
draw G from DP(αG0 ) is projected onto the space Ḡ that satisfies moment conditions to get the
projection G∗θ . The projection is done by −1-projection of Amari (1982) by minimizing Amari’s
−1-divergence. This projection process defines a degenerated distribution P ∗ . Then we incorporate
expert information by applying −1-projection again to get Q∗ . The whole process is shown as the
solid path P → G → G∗θ → P ∗ → Q∗ .
where θ ∈ Θ is a parameter of interest, and EG is the expectation with respect to G. Consider the
space Gθ = {G ∈ G|EG m(y, θ) = 0} of distributions that satisfies the moment condition given θ,
and denote Ḡ = ∪θ∈Θ Gθ as the union of all Gθ . Then for each G ∈ Ḡ, we can define θ = {θ|G ∈ Gθ },
and θ = ∅ for G ∈ (G − Ḡ). We assume that θ is fully identified, i.e. Gθ ∩ Gθ0 = ∅ for θ 6= θ0 .
In our approach, the parameter vector θ is the object on which we elicit expert information. We
first begin with the DP prior DP(αG0 ). A random draw G ∈ G from DP(αG0 ) is projected onto
the sub-space Ḡ by solving
min min
D(G0 k G).
0
θ
G ∈Gθ
(2.14)
Note that the solution G∗θ to the problem minG0 ∈Gθ D(G0 k G) is given by
dG∗θ
exp(γ∗0 m(y, θ))
= G
,
dG
E exp(γ∗0 m(y, θ))
where γ∗ solves minγ EG exp(γ 0 m(y, θ)), and we also have D(G∗θ k G) = − log(EG exp(γ∗0 m(y, θ))).
Therefore, we can rewrite (2.14) as maxθ minγ log(EG exp(γ 0 m(y, θ))) or maxθ log(EG exp(γ∗0 m(y, θ))).
See Kitamura and Stutzer (1997) and Newey and Smith (2004) for the details of dual convexity of
10
this optimization. Geometrically, the solution G∗θ is derived by Amari’s −1-projection of G onto Ḡ.
This projection defines a degenerated distribution of P on Ḡ, and we denote it as P ∗ . Once we get
the prior P ∗ , then we can incorporate expert information by applying Amari’s −1-projection of P ∗
onto Q again.
Figure 1 illustrates the geometry of our approach with semiparametric models. The solid path
P → G → G∗θ → P ∗ → Q∗ represents the process of incorporating moment conditions and expert
information starting from the DP prior P. The first three steps (P → G → G∗θ → P ∗ ) of this
path represents incorporation of moment conditions to get the degenerated distribution P ∗ through
−1-projection (G → G∗θ ). The last step (P ∗ → Q∗ ) is incorporating expert information with another
−1-projection. Essentially, this approach uses the information in moment conditions followed by
expert information. We may call this method ET-ETD-DP (exponential tilting of exponentially
tilted draws of Dirichlet process).
Alternatively, we can use the method of Kitamura and Otsu (2011) for the first three steps
(P → G → G∗θ → P ∗ ). They start with a marginal prior distribution p(θ) on θ. Given θ, they
derive the conditional distribution p(G|θ) on Gθ by −1-projection of G ∼ DP(αG0 ) onto Gθ . This
approach would be useful when an econometrician has a marginal distribution of θ at hand. Once
we apply their method to get P ∗ , expert information can be incorporated (P ∗ → Q∗ ) as usual.
If Ḡ = G, it is not necessary to project G onto Ḡ. We can use the simple method P → Q0 → G
shown as the dashed path in Figure 1. This method applies expert information first (P → Q0 )
on the DP prior, then then draw G from Q0 . This method would not have to assume a marginal
distribution p(θ) of θ, and does not require the projection G → G∗θ of Kitamura and Otsu (2011).
We call this method ETDP (exponentially tilted Dirichlet process). Of course, we still can use
the method of Kitamura and Otsu (2011) by assuming p(θ) and use Q0 in place of P to get the
conditional distribution p(G|θ). But in this case, expert information is not fully incorporated in
the distribution of θ because of the marginal distribution p(θ) used in their method. Essentially,
this method uses expert information first to replace P with Q0 before implementing Kitamura and
Otsu (2011). We may call this method as ETD-ETDP (exponentially tilted draws of exponentially
tilted Dirichlet process).
11
In the following sections, we will focus on nonparametric models only, because the discussion
above regarding semiparametric models is conceptually straightforward and does not give additional
insights in presenting the main idea of this paper. We plan to consider semiparametric models in
a follow-up paper in detail.
3
Exponential tiling of Dirichlet process prior
3.1
Estimation of Gibbs measure
We find Q∗ by solving λ∗ from the sample version of (2.13) using simulated θ. To simulate θ, we
first generate G from DP(αG0 ) using the stick-breaking process of Sethuraman (1994). A random
measure G from the DP prior is discrete almost surely, and can be written in a countable sum,
G=
∞
X
wj δξj ,
(3.1)
j=1
where wj are random weights independent of ξj , and δξ is the distribution concentrated at a random
point ξ. An important consequence of the almost sure discreteness of G is that the samples from G
show clustering. If α is small, the clustering becomes more severe, and there will be a fewer number
of distinct clusters. The stick-breaking process defines the DP through (3.1) by
iid
ξj ∼ G0 ,
(3.2)
w1 = V1 , and
wj = Vj
j−1
Y
(1 − Vk ), for j = 2, 3, . . .
(3.3)
k=1
iid
where Vk ∼ Beta(1, α). Intuitively, the stick-breaking scheme starts with a stick with length 1,
Qj−1
and keeps cutting the fraction Vj of the remaining stick of length k=1 (1 − Vk ). Then wj are the
lengths of the pieces cut.
In fact, the discrete measure in (3.1) gives more general classes of random measures than DP
12
priors. By generalizing the probability distribution of Vk , the stick-breaking method can generate
other classes of priors such as the two parameter Beta process of Ishwaran and Zarepour (2000) and
the two-parameter Poisson-Dirichlet process (Pitman-Yor process) of Pitman and Yor (1997). The
priors from the generalized stick-breaking schemes applied to (3.1) are called the stick-breaking
priors which include DP priors as a simple special case. For posterior sampling with the stickbreaking priors, Ishwaran and Zarepour (2000) and Ishwaran and James (2001) develop generalized
Pólya urn Gibbs samplers and blocked Gibbs samplers. We consider DP priors only in this paper,
but the main idea can be easily extended to other stick-breaking priors.
Once we have G, we can calculate θ from (2.5). Since G defined in (3.1) is an infinite sum, we
would have to approximate G by the truncated stick-breaking method. The method is given by a
finite sum
GN̄ =
N̄
X
wj δξj ,
(3.4)
j=1
where wj and ξj are from (3.2) and (3.3) except that VN̄ = 1. Let PN̄ be the distribution of (3.4)
generated by the truncated stick-breaking method. Ishwaran and Zarepour (2002) shows that, as
N̄ → ∞,
ZZ
f (x)G(dx)PN̄ (dG) →
ZZ
f (x)G(dx)P(dG),
for any bounded continuous real function f . Ishwaran and Zarepour (2000) shows that the approximation error of using PN̄ can be substantial when α is large, and N̄ should increase as α
increases. When the mixing kernel K(·|ξ) is a normal distribution, Theorem 1 of Ishwaran and
R
James (2002) shows that the distance |fPN̄ (y) − fP (y)|dy between the marginal densities fPN̄ (y)
of y from PN̄ and fP (y) from P is proportional to exp(−(N̄ − 1)/α) for large N̄ . This implies that
N̄ should increase proportionally as α increases to get the same level of asymptotic approximation.
Considering their results, we choose N̄ = 20 × α or 50 if (20 × α) < 50 to make the approximation
error sufficiently small for all examples in this paper.
PN̄
N̄ M
N̄
Let {θm
}m=1 , where θm
= j=1 wmj ϕ(K(y|ξmj )), be M random samples of θ calculated from
N̄ M
N̄
GN̄ in (3.4). Consider a discrete probability measure {πm
}m=1 on the points {θm
} such that
P
M
N̄
N̄
N̄
N̄
0 ≤ πm
≤ 1 and m=1 πm
= 1. Although {πm
} and {θm
} depend on N̄ , we drop the superscript
13
N̄ for notational simplicity. Then we solve the maximum entropy problem
max
M
X
(π1 ,...,πM )
πm log(1/πm )
(3.5)
m=1
such that
M
X
πm = 1
m=1
M
X
πm g(θm ) = 0.
m=1
The solution {π̂m } to (3.5) is given by
π̂m
exp λ̂0M g(θm )
,
=P
M
0
m=1 exp λ̂M g(θm )
where λ̂M is an (l × 1) Lagrange multiplier vector of the constraints, and the multiplier can be
obtained from
λ̂M = argmin M −1
λ∈Λ
M
X
exp(λ0 g(θm )),
(3.6)
m=1
where Λ is a compact subset of Rl . The solution π̂m is the exponential tilting estimator of Kitamura
and Stutzer (1997). Exponential tilting estimators are a special case of the generalized empirical
PM
−1
0
likelihood estimators of Newey and Smith (2004). Let LN̄
M (λ) = M
m=1 exp(λ g(θm )) be the
estimating function from (3.6), and L(λ) = EP exp(λ0 g(θ)) be the objective function from (2.13).
a.s.
Noting that λ̂M = argminλ∈Λ LN̄
M (λ) and λ∗ = argminλ L(λ), we prove λ̂M → λ∗ as M, N̄ → ∞.
Assumption 3.1. The solution λ∗ = argminλ EP exp(λ0 g(θ)) of (2.13) is unique in Λ.
a.s.
Assumption 3.2. LN̄
M (λ) → L(λ) uniformly as M, N̄ → ∞.
Assumption 3.3. L(λ) is continuous on Λ.
Theorem 3.4. Let π̂ M be the empirical distribution of {π̂m }, and π ∗ be the change of measure in
14
a.s.
(2.12). If Assumptions 3.1-3.3 hold, as M, N̄ → ∞, we have λ̂M → λ∗ and
π̂ M ⇒ π ∗ .
Proof. We use the argument in Theorem 4.2.1. of Bierens (1994). We first show
a.s.
L(λ̂M ) → L(λ∗ ).
We have
N̄
0 ≤ L(λ̂M ) − L(λ∗ ) = L(λ̂M ) − LN̄
M (λ̂M ) + LM (λ̂M ) − L(λ∗ )
N̄
≤ L(λ̂M ) − LN̄
M (λ̂M ) + LM (λ∗ ) − L(λ∗ )
a.s.
≤ 2 sup |LN̄
M (λ) − L(λ)| → 0,
(3.7)
λ∈Λ
N̄
N̄
N̄
as M, N̄ → ∞, from LN̄
M (λ̂M ) ≤ LM (λ∗ ) and almost sure uniform convergence of LM (λ) to LM (λ).
a.s.
Since L(λ) is continuous and λ∗ is unique, we have λ̂M → λ∗ from (3.7).
3.2
Implementation with a Metropolis-Hastings algorithm
Let y = (y1 , . . . , yn ) be data and
Ψ(dG|y) ∝
n
Y
i=1
f (yi |G)P(dG),
be the density Ψ of the posterior distribution of G in the DPM model. With expert information,
the posterior becomes
Ψ∗ (dG|y) ∝
n
Y
i=1
f (yi |G)π ∗ (θ)P(dG) =
n
Y
i=1
f (yi |G)Q∗ (dG).
We use an MCMC method to get samples from the posterior distribution Ψ∗ . We implement our
approach with the independence chain Metropolis-Hastings (M-H) algorithm (Chib and Greenberg
15
(1994), Chib and Greenberg (1995)). Considering the relationship
Ψ∗ (dG|y)
= π∗ ,
Ψ(dG|y)
we use the posterior distribution Ψ with the original DP prior P as the proposal distribution of the
independence chain M-H algorithm. Then the acceptance of t-th sample G(t) in the MCMC chain
depends on the quantity
π ∗ (θ(t) )
,
π ∗ (θ(t−1) )
(3.8)
which is approximated by
exp λ̂0M g(θ(t) )
,
exp λ̂0M g(θ(t−1) )
with the estimated λ̂M from the previous section. The t-th MCMC sample of G(t) is accepted with
probability



exp λ̂0M g(θ(t) ) 
,
min 1,
 exp λ̂0 g(θ(t−1) ) 
M
or G(t) = G(t−1) if rejected.
For sampling from the proposal distribution Ψ, we use the blocked Gibbs algorithm of Ishwaran
and James (2001). The implementation of our approach combines the blocked Gibbs algorithm
and the independence chain M-H algorithm above. The posterior simulation cycles through the
∗
∗ n
N̄
following steps. Let ξ N̄ = {ξj }N̄
= {wj }N̄
j=1 and w
j=1 from GN̄ in (3.4). Let ξ = {ξi }i=1 be n
i.i.d. samples of ξi∗ ∼ GN̄ which are used to draw yi |ξi∗ ∼ Kξi∗ . Since GN̄ is discrete, ξi∗ are drawn
from the elements of ξ N̄ . Therefore, some ξi∗ may be same and some elements of ξ N̄ may not be
drawn at all. For each j = 1, . . . , N̄ , define the set Ij = {i : ξi∗ = ξj } of the indices of ξi∗ that is
equal to ξj . When ξj was not drawn at all, then Ij = ∅.
1. Updating ξ N̄ given wN̄ , ξ ∗ , and y: For j = 1, 2, . . . , N̄ , simulate ξj ∼ G0 for all j with Ij = ∅,
16
and draw ξj for Ij 6= ∅ from the density
p(ξj |ξ ∗ , wN̄ , y) ∝ G0 (dξj )
Y
i∈Ij
k(yi |ξj ),
where k is the density of the kernel function K given in (2.3) and (2.4).
2. Updating ξ ∗ given ξ N̄ , wN̄ , and y: Draw (ξi∗ |ξ N̄ , wN̄ , y) independently from
ξi∗ |(ξ N̄ , wN̄ , y) ∼
N̄
X
wj δξj .
j=1
3. Updating wN̄ given ξ ∗ , ξ N̄ , and y: Draw w1 = V1 , and
wj = Vj
j−1
Y
(1 − Vk ), for j = 2, 3, . . . , N̄ − 1
k=1
iid
where VN̄ = 1, Vk ∼ Beta(1 + Card(Ik ), α +
PN̄
k+1
Card(Ik )) for k = 1, . . . , N̄ − 1, and
Card(Ik ) represents the cardinality of the set Ik .
4. Calculate the t-th MCMC sample g(θ(t) ) from (2.5) replacing G with GN̄ defined from
(ξ N̄ , wN̄ ). The t-th sample(ξ ∗ , ξ N̄ , wN̄ ) is accepted with probability

exp λ̂0M g(θ(t) ) 
,
min 1,
 exp λ̂0 g(θ(t−1) ) 
M


or replaced by the previous sample.
In Step 1, the updating of ξ N̄ is simple if the base measure G0 and the kernel function K are
conjugate. For this reason, we use normal-gamma conjugate distributions for our examples and
empirical applications. Step 1-3 are the original blocked Gibber sampler, and our approach simply
adds Step 4 for exponential tilting.
17
4
Dirichlet process mixture of normal distributions
To demonstrate our theory, we consider the DPM model with the family of normal distributions as
a kernel function. A normal mixture model is very flexible and popularly used because a locationscale mixture of normal distributions can approximate any distribution on the real line (Lo (1984),
Escobar and West (1995)) or on Rd (Müller, Erkanli, and West (1996)). We define the kernel
function
K(y|ξ) = Φ(y|µ, 1/τ ),
where ξ = (µ, τ ) is defined on Ξ = R × R+ , and Φ(·|µ, 1/τ ) is a normal distribution function
with mean µ and variance 1/τ , where τ is a precision parameter. The prior DP(αG0 ) on G is
given by a concentration parameter α and a base measure G0 on (µ, τ ) ∈ R × R+ distributed as a
normal-gamma distribution G0 = NG(µ0 , n0 , ν0 , σ02 ) in which
µ|τ ∼ N(µ0 , (n0 τ )−1 ),
and
τ ∼ Γ(ν0 /2, ν0 σ02 /2),
where Γ(a, b) is a gamma distribution with a shape parameter a and a rate parameter b. Therefore,
DP(αG0 ) is defined with 5 parameters (α, µ0 , n0 , ν0 , σ02 ). The normal-gamma distribution is a
convenient choice for the DPM of normal distribution because of its conjugacy with the normal
kernel function. Also, it is easy to generalized this example to consider hyper-parameters by randomizing the 5 parameters. A popular choice is to put a gamma hyper-prior on α, n0 , σ0−2 , and a
normal hyper-prior on µ0 . Note that α represents the concentration of G around G0 , n0 is related
to the concentration of µ around µ0 of G0 , and ν0 is for the concentration of τ in G0 . Therefore,
(α, n0 , ν0 ) jointly determines the effective sample size of the prior distribution, and smaller values
of these parameters imply more diffuse priors.
In our numerical example, we assume a mixture of two normal distributions as the true distri2
2
bution. We set n = 20 and generate i.i.d. data {yi }20
i=1 from the mixture of N(10, 5 ) and N(20, 5 )
18
Posterior Density of θ
Posterior CDF of θ
70
1.0
Expert
60
Expert
0.8
50
Prior (DP)
0.6
40
DP
DP
30
0.4
20
0.2
Prior (DP)
10
0
0.0
0.00
0.01
0.02
0.03
θ
0.04
0.05
0.06
0.00
0.01
0.02
0.03
θ
0.04
0.05
0.06
Figure 2: The left panel shows the prior distribution of θ from DP(αG0 ) (Prior (DP)), the posterior
distribution of θ with DP(αG0 ) (DP), and the posterior distribution of θ with expert information
(Expert). The right panel shows the cumulative distribution functions of θ.
with equal weights. Specifically, y1 , . . . , y20 are calculated from
yi = Wi Y1i + (1 − Wi )Y2i ,
(4.1)
with Wi = 1 or 0 with an equal probability and Y1i ∼ N(10, 52 ) and Y2i ∼ N(20, 52 ). We set the
DP prior DP(αG0 ) with α = 10, µ0 = 15, n0 = 1, ν0 = 6, σ02 = 20. Based on the discussion in the
previous section, we use N̄ = 20 × α = 200 for GN̄ to approximate G. The parameter of interest θ
is the left-tail probability θ = P{yi ≤ 0} given G, which would represent the probability of default
if yi is an equity value. For estimation of λ̂M and π̂ M , we first set M = 1, 000, 000 and simulate
N̄ M
{θm
}m=1 by using the truncated stick-breaking process in (3.4). The dashed line “Prior (DP)” in
the first panel in Figure 2 shows the prior density of θ from DP(αG0 ).
We assume that the expert information is P{θ ≤ 0.01} = 50% and P{θ ≤ 0.005} = 25%. This
means g(θ) = (I{θ ≤ 0.01} − 0.5, I{θ ≤ 0.005} − 0.25)0 for the moment condition Eg(θ) = 0. To
implement the exponential tilting, we find the Lagrange multiplier used in the maximum entropy
N̄ M
problem in (3.6) to get λ̂0M = (1.010, 1.285) from the simulated {θm
}m=1 . The positive values of
λ̂M imply applying higher weights on the events I{θ ≤ 0.01} and I{θ ≤ 0.005} in the moment
19
Predictive Density of y
Predictive CDF of y
0.06
1.0
0.05
0.8
Data
0.04
0.6
Data
0.03
0.4
0.02
DP
0.2
0.01
DP
Expert
Expert
0.00
0.0
0
10
20
y
30
40
0
10
20
y
30
40
R
R
Figure 3: Predictive density φ(yi |G)Ψ∗ (dG|y) and cumulative distribution Φ(yi |G)Ψ∗ (dG|y)
of yi , where Ψ∗ (dG|y) is the posterior distribution of G, with expert information (Expert) and
without expert information (DP). The vertical lines on the left panel represent the values of n = 20
samples, and the step function (Data) on the right panel is the CDF of the samples.
conditions. Therefore, the exponential tilting of DP(αG0 ) would tilt the prior distribution of θ
toward zero, which tells that the expert information emphasizes that θ should be smaller than what
DP(αG0 ) would generate. Consequently, the prior that complies with the expert information favors
distributions with smaller left-tail probabilities than DP(αG0 ).
Given the simulated data from (4.1), and with the expert information, we draw samples from the
posterior distribution. For the MCMC procedure, we use 101, 000 MCMC iterations, discard the
first 1, 000 iterations for burn-in, and use every 50th state for thinning after the burn-in period. On
the left panel in Figure 2, the line labeled “Expert” represents the posterior density with the expert
information and the line with “DP” is the posterior density from the ordinary DPM model without
the expert information. The right panel shows the distribution functions of the distributions on the
left panel. We can see that the posterior distribution of θ is tilted toward zero.
R
Figure 6 shows the predictive density φ(yi |G)Ψ∗ (dG|y) of yi , and the right panel shows the
predictive distribution function of yi , with expert information (Expert) and without expert information (DP). The vertical lines on the left panel represent the values of n = 20 samples and the
curved labeled as Data is the density of the samples. The step function (Data) on the right panel is
20
Posterior Density of H120
Posterior CDF of H120
5
1.0
Expert
4
0.8
DP
3
0.6
2
0.4
Expert
DP
1
0.2
0
0.0
0.0
0.1
0.2
H120
0.3
0.4
0.0
0.1
0.2
H120
0.3
0.4
Figure 4: Posterior distribution of the probability H120 to observe one default event (yi < 0) out of
20 samples.
the CDF of the samples. The predictive density with the expert information (Expert) has slightly
thinner left-tail than the the predictive density without the expert information (DP). This coincides
with our intuition that the positive λ̂M implies the prior with the expert information favors the
distributions with small left-tail probabilities.
An interesting quantity in an analysis of a rare event is the number of occurrences of the rare
event given a sample size. For the probability of default θ, the probability that we observe k defaults
out of n samples is given by the binomial distribution
Hkn
n k
=
θ (1 − θ)k .
k
(4.2)
We can study the posterior distribution of Hkn given n and k. For the probability H120 of
observing one default out of 20 samples, in Figure 4, we calculate the density (left panel) and
the distribution functions (right panel) of H120 with and without the expert information. Since
the expert information favors the distributions with small left-tail probabilities, we can see that
the distribution of the probability H120 with the expert information (Expert) is tilted toward zero
compared to the distribution without the expert information (DP).
21
Company
Aetna Inc
Archer-Daniels-Midland Co
Chubb Corp/The
Humana Inc
Kohl’s Corp
Marsh & McLennan Cos Inc
Mylan Inc/PA
Northrop Grumman Corp
Return (%)
Company
−12.1
−2.1
16.3
−3.8
−9.0
3.3
−13.4
−8.0
Raytheon Co
Republic Services Inc
Staples Inc
Stryker Corp
Symantec Corp
Thermo Fisher Scientific Inc
WellPoint Inc
Zimmer Holdings Inc
Return (%)
13.5
−14.2
−17.4
−6.1
−25.9
−19.4
−19.0
1.8
Table 1: Annual returns ((p1 − p0 )/p0 ) of 16 companies in a homogeneous portfolio from Jun
30, 2011 to Jun 29, 2012. Max= 16.3%, Min= −25.9%, Mean= −7.21%, and Median= −8.53%,
Standard deviation= 11.75%.
5
Application to tail probabilities of stock returns
We consider inference of tail probabilities for annual stock returns of a homogeneous portfolio of
high grade companies. We build the homogeneous portfolio from the companies included in the S&P
500 index. Among the 500 companies, we select the companies with the following characteristics.
Market beta: 0.8 ∼ 1, Price to book value per share: 1 ∼ 4, and Market Capitalization: $10 ∼
30MM. Using data on 6/30/2011 from Bloomberg, we find that there are 16 companies that belong
to all of these categories. Their returns are calculated by (p1 − p0 )/p0 , where p0 is the stock prices
on 6/30/2011 and p1 is the stock prices on 6/29/2012. The maximum return is 16.3%, the minimum
is −25.9%, the mean is −7.21%, the median is −8.53%, and the standard deviation of the returns
is 11.75%. All observations are assumed to be i.i.d. given a mixing measure G. We could assume
correlation in data, but it does not add much insight for illustration of our approach. Also, the
homogeneity of the portfolio would reduce the concerns of misspecification such as a fixed effect of
an individual asset.
Our inference is focused on the rare event that the annual return falls below −30%. Let θ =
P{yi < −30%} be the probability of the rare event given G. We assume that a hypothetical expert
has the following information. P{θ < 0.02} = 50% and P{θ < 0.01} = 25%. As before, we use
the DPM of normal distributions. The DP prior DP(αG0 ) used for this empirical application is
from α = 10, and the normal-gamma base measure G0 = NG(µ0 , n0 , ν0 , σ02 ) with µ0 = 0, n0 = 1,
22
Posterior Density of θ
Posterior CDF of θ
25
1.0
DP
20
15
0.8
0.6
Expert
10
Expert
0.4
DP
5
0.2
0
0.0
0.00
0.02
0.04
θ
0.06
0.08
0.00
0.02
0.04
θ
0.06
0.08
Figure 5: The left panel shows the posterior distribution of θ with DP(αG0 ) (DP), and the posterior
distribution of θ with expert information (Expert). The right panel shows the posterior cumulative
distribution functions of θ.
ν0 = 6, σ02 = 100. We also set N̄ = 20 × α = 200 for GN̄ , and M = 1, 000, 000 for estimation of λ̂M
as in the simulation study of the previous section. We find λ̂0M = (0.535, 1.315) from (3.6).
For the simulation of the posterior distribution, we use 101, 000 MCMC iterations, 1, 000 iterations for burn-in, and every 50th state for thinning after the burn-in as before. As shown in Figure
5, the posterior density and distribution functions of θ with the expert information (Expert) shows
tilting toward zero compared to the distributions without the expert information (DP). This tells
us that the left-tail probabilities are smaller than those implied by the DP prior in the expert’s
opinion.
The predict density and distribution function of yi shown in Figure 6 also show the results that
support the argument above. The predictive distribution with the expert information (Expert) has
a slightly thinner left-tail than the one without the expert information (DP). Finally, in Figure 7,
we show the posterior distribution of the probability H116 , as defined in (4.2), to observe one asset
with the annual return below −30% out of 16 assets. From the results, it is predicted that it is more
likely to have small values of H116 with the expert information (Expert) than without the expert
information (DP).
23
Predictive Density of y
Predictive CDF of y
1.0
0.035
Data
0.030
0.8
Data
0.025
0.6
0.020
0.015
0.010
0.4
DP
0.2
0.005
DP
Expert
Expert
0.0
0.000
-40
-30
-20
-10
y
0
10
20
30
-40
-30
-20
-10
y
0
10
20
30
R
R
Figure 6: Predictive density φ(yi |G)Ψ∗ (dG|y) and cumulative distribution Φ(yi |G)Ψ∗ (dG|y) of
yi , where Ψ∗ (dG|y) is the posterior distribution of G.
Posterior Density of H116
Posterior CDF of H116
8
1.0
0.8
6
0.6
4
0.4
Expert
2
Expert
0.2
DP
DP
0
0.0
0.0
0.1
0.2
H116
0.3
0.4
0.0
0.1
0.2
H116
0.3
0.4
Figure 7: Posterior distributions of the probability to observe one asset with the return below −30%
out of 16 assets (H116 ) with expert information (Expert) and without expert information (DP).
24
6
Conclusion
Our method deals with the problem of misspecification and the scarcity of data information for inference of rare events by using nonparametric models and incorporating expert information. Elicitation
of expert information can be easily done with carefully designed interviews, and the implementation
of our method can be done with the straightforward M-H method we propose. Therefore, when
historical data are scarce and it is hard to find a reasonable parametric model, our approach could
be an attractive choice.
New nonparametric Bayesian methods are developing rapidly. There is a growing interest in
considering more general priors than the Dirichlet process such as the general stick-breaking processes. Doss (1994) considers a mixture of Dirichlet processes, and Teh, Jordan, Beal, and Blei
(2006) provide an extension to hierarchical Dirichlet processes. We may generalize our method to a
hierarchical Dirichlet processes. Also, there are new MCMC methods for the DPM model such as
the retrospective MCMC sampling of Papaspiliopoulos and Roberts (2008), and the slice sampling
of Walker and Mallick (1997) which is generalized in Kalli, Griffin, and Walker (2011) to general
stick-breaking processes. In a future study, we can consider these methods to improve the MCMC
method presented in this paper and compare the performances of the alternative MCMC schemes.
We also believe that our method can be applied to semiparametric models naturally. Semiparametric models defined by a set of moment conditions are popular in economics and finance. The
moment conditions provide information which reduces the dimension of model space. The resulting
method would merge expert information, moment information, and econometrician’s priors with
data information all together. Another interesting topic is to consider covariates in our approach.
For example, we can consider implementing our method to Bayesian density regressions in which
DP priors can depend on covariates.
25
References
Amari, S. I. (1982): “Differential Geometry of Curved Exponential Families - Curvatures and
Information Loss,” Annals of Statistics, 10(2), 357–385.
Antoniak, C. (1974): “Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems,” Annals of Statistics, 2(6), 1152–1174.
Bierens, H. (1994): Topics in Advanced Econometrics: Estimation, Testing, and Specification of
Cross-Section and Time Series Models. Cambridge University Press.
Blackwell, D., and J. MacQueen (1973): “Ferguson distributions via Pólya urn schemes,”
Annals of Statistics, 1(2), 353–355.
Chaloner, K., and G. Duncan (1983): “Assessment of a beta prior distribution: PM elicitation,”
The Statistician, 32(1/2), 174–180.
Chib, S., and E. Greenberg (1994): “Bayes inference in regression models with ARMA (p, q)
errors,” Journal of Econometrics, 64(1-2), 183–206.
(1995): “Understanding the Metropolis-Hastings algorithm,” The American Statistician,
49(4), 327–335.
Cover, T., and J. Thomas (2006): Elements of Information Theory. John Wiley & Sons, 2nd
edn.
Csiszár, I. (1975): “I-divergence geometry of probability distributions and minimization problems,” Annals of Probability, 3, 146–158.
Doss, H. (1994): “Bayesian nonparametric estimation for incomplete data via successive substitution sampling,” Annals of Statistics, 22(4), 1763–1786.
Escobar, M. (1994): “Estimating normal means with a Dirichlet process prior,” Journal of the
American Statistical Association, 89(425), 268–277.
26
Escobar, M., and M. West (1995): “Bayesian density estimation and inference using mixtures,”
Journal of the American Statistical Association, 90(430), 577–588.
Ferguson, T. (1973): “A Bayesian analysis of some nonparametric problems,” Annals of Statistics,
1(2), 209–230.
(1974): “Prior distributions on spaces of probability measures,” Annals of Statistics, 2(4),
615–629.
Freedman, D. (1963): “On the asymptotic behavior of Bayes’ estimates in the discrete case,”
Annals of Mathematical Statistics, 34(4), 1386–1403.
Garthwaite, P., J. Kadane, and A. O’Hagan (2005): “Statistical methods for eliciting probability distributions,” Journal of the American Statistical Association, 100(470), 680–701.
Gelfand, A., and A. Kottas (2002): “A computational approach for full nonparametric Bayesian
inference under Dirichlet process mixture models,” Journal of Computational and Graphical
Statistics, 11(2), 289–305.
Gelfand, A., and S. Mukhopadhyay (1995): “On nonparametric Bayesian inference for the
distribution of a random sample,” Canadian Journal of Statistics, 23(4), 411–420.
Ghosh, J., and R. Ramamoorthi (2003): Bayesian nonparametrics. Springer Verlag.
Griffin, J., and M. Steel (2011): “Stick-breaking autoregressive processes,” Journal of Econometrics, 162(2), 383–396.
Hirano, K. (2002): “Semiparametric Bayesian inference in autoregressive panel data models,”
Econometrica, 70(2), 781–799.
Hjort, N. L., C. Holmes, P. Müller, and S. G. Walker (eds.) (2010): Bayesian Nonparametrics. Cambridge University Press.
Ishwaran, H., and L. James (2001): “Gibbs sampling methods for stick-breaking priors,” Journal
of the American Statistical Association, 96(453), 161–173.
27
(2002): “Approximate Dirichlet Process computing in finite normal mixtures,” Journal of
Computational and Graphical Statistics, 11(3), 508–532.
Ishwaran, H., and M. Zarepour (2000): “Markov chain Monte Carlo in approximate Dirichlet
and beta two-parameter process hierarchical models,” Biometrika, 87(2), 371–390.
(2002): “Exact and approximate sum representations for the Dirichlet process,” Canadian
Journal of Statistics, 30(2), 269–283.
Jensen, M., and J. Maheu (2010): “Bayesian semiparametric stochastic volatility modeling,”
Journal of Econometrics, 157(2), 306–316.
Kadane, J., N. Chan, and L. Wolfson (1996): “Priors for unit root models,” Journal of
Econometrics, 75(1), 99–111.
Kadane, J., and L. Wolfson (1998): “Experiences in elicitation,” The Statistician, 47(1), 3–19.
Kalli, M., J. Griffin, and S. Walker (2011): “Slice sampling mixture models,” Statistics and
computing, 21(1), 93–105.
Kiefer, N. (2009): “Default estimation for low-default portfolios,” Journal of Empirical Finance,
16(1), 164–173.
(2010): “Default estimation and expert information,” Journal of Business and Economic
Statistics, 28(2), 320–328.
Kitamura, Y., and T. Otsu (2011): “Bayesian analysis of moment condition models using nonparametric priors,” Working Paper, Yale University.
Kitamura, Y., and M. Stutzer (1997): “An information-theoretic alternative to generalized
method of moments estimation,” Econometrica, 65(4), 861–874.
Kullback, S., and R. Leibler (1951): “On information and sufficiency,” Annals of Mathematical
Statistics, 22(1), 79–86.
28
Lo, A. (1984): “On a class of Bayesian nonparametric estimates: I. Density estimates,” Annals of
Statistics, 12(1), 351–357.
MacEachern, S., and P. Müller (1998): “Estimating mixture of Dirichlet process models,”
Journal of Computational and Graphical Statistics, 7(2), 223–238.
Müller, P., A. Erkanli, and M. West (1996): “Bayesian curve fitting using multivariate normal
mixtures,” Biometrika, 83(1), 67–79.
Neal, R. (2000): “Markov chain sampling methods for Dirichlet process mixture models,” Journal
of Computational and Graphical Statistics, 9(2), 249–265.
Newey, W. K., and R. J. Smith (2004): “Higher order properties of GMM and generalized
empirical likelihood estimators,” Econometrica, 72(1), 219–255.
Papaspiliopoulos, O., and G. Roberts (2008): “Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models,” Biometrika, 95(1), 169–186.
Pitman, J., and M. Yor (1997): “The two-parameter Poisson-Dirichlet distribution derived from
a stable subordinator,” Annals of Probability, 25(2), 855–900.
Sethuraman, J. (1994): “A constructive definition of Dirichlet priors,” Statistica Sinica, 4, 639–
650.
Taddy, M., and A. Kottas (2010): “A Bayesian nonparametric approach to inference for quantile
regression,” Journal of Business and Economic Statistics, 28(3), 357–369.
Teh, Y., M. Jordan, M. Beal, and D. Blei (2006): “Hierarchical Dirichlet processes,” Journal
of the American Statistical Association, 101(476), 1566–1581.
Walker, S., and B. Mallick (1997): “Hierarchical generalized linear models and frailty models
with Bayesian nonparametric mixing,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4), 845–860.
White, H. (1982): “Maximum likelihood estimation of misspecified models,” Econometrica, 50(1),
1–26.
29
Download