FILTERED LIKELIHOOD FOR POINT PROCESSES By Kay Giesecke and Gustavo Schwenkler

advertisement
FILTERED LIKELIHOOD FOR POINT PROCESSES
By Kay Giesecke and Gustavo Schwenkler
∗
Stanford University and Boston University
We develop and study likelihood estimators of the parameters
of a marked point process and of incompletely observed explanatory
factors that influence the arrival intensity and mark distribution. We
establish an approximation to the likelihood and analyze the convergence and large-sample properties of the associated estimators.
Numerical results illustrate our approach.
1. Introduction. Point processes are stochastic models for the timing of discrete events. They are widely used in finance and economics, as
well as various other areas such as insurance, engineering, seismology, epidemiology, and medicine. In many cases, the point process intensity, which
represents the conditional event arrival rate, is a function of explanatory
factors. In practice, however, the factor data available to the statistician are
often incomplete. This paper addresses the inference problem arising in this
situation. It develops and analyzes likelihood estimators of the parameters
of the point process and of the explanatory factors. The results provide a
rigorous statistical foundation for the applications of event timing models in
the presence of incomplete factor data. They facilitate the analysis of models
for which statistical estimation was previously intractable.
More specifically, we consider a marked point process whose intensity and
mark distribution are parametric functions of a vector V of explanatory factors. The factors are allowed to follow stochastic processes whose dynamics
are specified by a parameter. The data include the event times and marks
observed during a sample period. They also include certain observations of
V . Some elements of V are observed at any time, others are sampled only on
certain dates, and the remaining ones are frailties that are never observed.
The values between the sampling times of the partially observed factors, and
the values of the frailty factors are not available for inference.
∗
We thank Darrell Duffie, Peter Glynn, Tze Leung Lai, Yang Song, and Stefan Weber
for insightful comments. This work was completed while Schwenkler was a PhD student in
the Department of Management Science and Engineering at Stanford University. Schwenkler acknowledges support from a Mayfield Fellowship and a Lieberman Fellowship.
AMS 2000 subject classifications: 62F12, 62M20, 62N02
Keywords and phrases: event timing, point process, filtering, parametric maximum likelihood, asymptotic theory, likelihood approximation
1
2
K. GIESECKE AND G. SCHWENKLER
The model and data structure we adopt is motivated by a range of applications of event timing timing models in which incomplete information about
V is a key feature. An example is the prediction of corporate bankruptcies,
see Azizpour, Giesecke and Schwenkler (2014), Chava and Jarrow (2004),
Duffie et al. (2009), Duffie, Saita and Wang (2007), Lando and Nielsen (2010)
and others. Here, the event times of the point process represent bankruptcy
dates. A mark variable might describe the financial loss at default. The vector V might include stochastically varying factors, such as stock returns,
which are observed at any time during the sample period. It might also
include macro-economic factors, such as industrial production, which are
published only monthly or quarterly. Moreover, there is strong evidence of
the presence of a dynamically varying, unobservable frailty factor influencing bankruptcy timing. Similar model and data structures are also common
in the study of market transactions, unemployment spells, mortgage prepayments and delinquencies, the timing of births, and the duration of strikes.1
Section 2.3 provides concrete examples.
We use a filtering approach to address the parameter inference problem
with incomplete factor data. We begin by developing a representation of the
likelihood of the data as a product of two terms, one addressing the event
data and the other the available factor data. The decomposition is based on
a change of probability measure that resolves the interaction between the
point process and the factors influencing the arrival intensity. The likelihood
representation generalizes to incomplete factor data the representations of
the complete data likelihood studied by Andersen et al. (1993), Ogata (1978),
Puri and Tuan (1986) and others. Our filtering approach eliminates the need
to interpolate between discrete observations of factors values. Interpolation
between discrete observations is a standard practice in the literature to
construct a data series with values for every time in the sample period.
However, the implications of this practice for the properties of the estimators
are not well understood, rendering this practice questionable.
One of the terms governing the likelihood is a point process filter whose
practical computation may be difficult. We propose an approximation to
this filter that eliminates the need to perform simulations to evaluate the
likelihood estimator, which may be computationally burdensome for longer
sample periods and cause a loss of estimation efficiency. The approximation
is based on a quadrature method; an efficient algorithm reduces its compu1
See Engle and Russell (1998), Lancaster (1979), Deng, Quigley and Van Order (2000),
Newman and McCulloch (1984), and Harrison and Stewart (1989), respectively. Kiefer
(1988) provides an excellent overview of event timing applications in economics.
FILTERED LIKELIHOOD FOR POINT PROCESSES
3
tation to a sequential multiplication of matrices.2 Our approach is an alternative to the development and numerical solution of a Kushner-Stratonovich
equation for the filter (see, for example, Kurtz and Ocone (1988), Kliemann,
Koch and Marchetti (1990), Kurtz (1998)).3
We provide conditions guaranteeing the uniform convergence of the approximate likelihood to the exact likelihood. We also show that the parameter maximizing the approximate likelihood converges to the parameter at
which the exact likelihood attains a global optimum and that it inherits the
large-sample properties of the estimator maximizing the exact likelihood.
The approximation does not generate additional asymptotic variance, so
there is no loss of estimation efficiency when maximizing the approximate
instead of the exact likelihood. Our approach to developing the properties
of an estimator obtained from an approximate likelihood is an alternative
to the approach of Dupacova and Wets (1988), who directly addressed the
behavior of that estimator by viewing it as a solution of a stochastic optimization problem with partial information.
Numerical results illustrate the behavior of our estimator and provide a
comparison with a stochastic EM Algorithm (see Dempster, Laird and Rubin (1977), Wei and Tanner (1990), Celeux and Diebolt (1992), and others).
We analyze an extension of the classical self-exciting point process of Hawkes
(1971), which is widely used in applications. The intensity is a linear function
of three factors, the first following a jump-diffusion that is observed monthly,
the second following an unobservable geometric Brownian motion, and the
third given by a functional of the point process path. We verify the convergence of the approximate to the true likelihood for alternative configurations
of the approximation. We find a substantial reduction of computational effort over the alternative of a Monte Carlo approximation to the likelihood.
Comparing our estimates with those of the stochastic EM Algorithm, we find
that, for the majority of the parameters, our estimates are more accurate
and less susceptible to the initial values.4 Turning to large-sample properties
(see Nielsen (2000) for the stochastic EM Algorithm), we find that the EM
Algorithm generates asymptotic standard errors in excess of or similar to
those implied by our method, for the majority of the parameters.
We have implemented our estimators in the R package. The code can be
2
Brillinger and Preisler (1983) use a quadrature method to approximate the likelihood
in a regression model with latent variables.
3
Zeng (2003) uses this approach to perform Bayesian least-squares estimation for a
point process model that is a special case of our formulation.
4
The EM estimator is guaranteed to converge to a local optimum of the likelihood only;
see Wu (1983) and Dembo and Zeitouni (1986). Our estimator is guaranteed to converge
to a global optimum.
4
K. GIESECKE AND G. SCHWENKLER
downloaded at http://people.bu.edu/gas. It can be easily customized to
treat alternative models specifications.
Prior research has considered inference for event timing models. Ogata
(1978) develops and analyzes likelihood estimators of stationary and ergodic
point processes with complete data. Andersen and Gill (1982) examine the
proportional hazards model of Cox (1972) with complete data, in which
the intensity is the product of an exponential function of factors and a deterministic, semi-parametric baseline hazard function. Nielsen et al. (1992),
Murphy (1994), Murphy (1995) and Parner (1998) study a frailty model
in which the intensity is the product of a semi-parametric baseline hazard
function, an observable factor, and an unobservable frailty represented by
an independent random variable. We focus on parametric intensity models
and treat more general model and data structures. For example, we allow
the frailty to follow a stochastic process that may be correlated with the
observable factors. We also allow for arbitrary link functions, going beyond
the proportional hazard formulation. Moreover, we allow for partial observation of explanatory factors. These features are common in the applied event
timing literature; see the examples in Section 2.3. Finally, our results are
related to the results of Le Cam and Yang (1988), who analyze likelihood
estimators with incomplete data. Unlike these authors, we allow for some
factors to be observed at all times. As a result, our data cannot be viewed as
the product of repeated independent experiments and our likelihood cannot
be decomposed into a product of conditional densities.
The rest of this paper is organized as follows. Section 2 formulates the
point process model with incomplete factor data and provides examples.
Section 3 develops a representation of the likelihood. Section 4 introduces
and analyzes an approximation to the likelihood, and examines the asymptotic properties of the associated estimator. Section 5 provides an efficient
algorithm to evaluate the approximate likelihood. Section 6 numerically illustrates the behavior of our estimators. There are several appendices. Appendices A and B provide additional results on the exact and approximate
estimators. Appendix C provides the proofs of our results.
2. Marked point process. Fix a complete probability space (Ω, F, P)
and a right-continuous and complete filtration (Ft )t≥0 . Consider a marked
point process (Tn , Un )n≥1 . The Tn are strictly positive and strictly increasing
stopping times that represent the arrival times of events. The Un are FTn measurable mark variables that take values in a subset RU of Euclidean
space. They provide additional information about the events.
FILTERED LIKELIHOOD FOR POINT PROCESSES
5
P
2.1. Model. Suppose the counting process N given by Nt = n≥1 1{Tn ≤t}
has intensity λ (see II.D7 in Brémaud (1980)). The intensity represents the
conditional event arrival rate. We take
(1)
λt = Λ(Vt ; α)
for a function Λ : RV → (0, ∞) specified by a parameter α ∈ ΘN , and an
explanatory factor process V = (X, Y, Z) ∈ RV ⊂ RdX × RdY × RdZ , where
dX , dY and dZ are natural numbers. The dynamics of V are specified by
a parameter γ ∈ ΘV . The conditional distribution of the mark variables is
described by a probability transition kernel Πt (du) from Ω × [0, ∞) into RU
(see VIII.D5 in Brémaud (1980)). We take
(2)
Πt (du) = π(u, Vt− ; β)du
for a density function π : RU × RV → (0, ∞) specified by a parameter
β ∈ ΘU . Technical conditions need to be imposed on π, Λ, and the dynamics
of V in order for the point process to be non-explosive (see, for example,
Protter (2004) and Gjessing et al. (2010)).
2.2. Data. The goal is to estimate the parameter θ = (α, β, γ) governing
the intensity, the distribution of the marks, and the dynamics of the explanatory factor V . The data available for inference include the event times and
marks observed during a sample period [0, τ ], where τ > 0. The data also
include certain observations of V during the sample period. More precisely,
the data are given by Dτ = (Nτ , Uτ , Xτ , Yτ ), where
•
•
•
•
Nτ
Uτ
Xτ
Yτ
Y,
= (N
Pt : 0 ≤ t ≤ τ ) represents the observations of the Tn ;
= ( n≤Nt Un : 0 ≤ t ≤ τ ) represents the observations of the Un ;
= (Xt : 0 ≤ t ≤ τ ) represents the observations of X;
= (Ytj : t ≤ τ, t ∈ S j , j = 1, . . . , dY ) represents the observations
of
TdY j
j
where the S are discrete subsets of [0, ∞) with S = j=1 S 6= ∅.
The structure of the data addresses the fact that in practice, the data on
the explanatory factor V = (X, Y, Z) may be incomplete. While the data
include the entire history of X for the sample period, they include the values
of Y only on certain dates. The data do not include any values of Z; this
element of V is an unobservable frailty.
2.3. Examples. The point process model and data structure we adopt
have a wide scope. We illustrate with a number of examples from the finance
and economics literature.
6
K. GIESECKE AND G. SCHWENKLER
Example 2.1 (Timing of Births). Newman and McCulloch (1984) explore the factors driving the timing of births for 4724 women between the
ages of 20 and 49 in Costa Rica. A unit of time is one month, and time
is measured starting at the 13-th birthday of a woman. The event time Tn
corresponds to the time of first birth for the n-th woman. The event intensity is Λ(v; α) = exp(α · v) for v = (v1 , . . . , v4 ) and α = (α1 , . . . , α4 ). The
explanatory factor Vt = (Vt1 , . . . , Vt4 ) satisfies Vt2 = t, Vt3 = max{t − 48, 0},
and Vt4 = max{t − 144, 0}. The factor V 1 is a 6-dimensional vector that
represents personal and regional characteristics such as schooling level and
mortality rates, and is observed monthly. Overall, dX = 3, dY = 6, dZ = 0,
X = (V 2 , V 3 , V 4 ), Y = V 1 , and S j = N0 for 1 ≤ j ≤ 6. Similar models are
analyzed by Heckman and Walker (1990), Hotz, Klerman and Willis (1997),
and Todd, Winters and Stecklov (2012).
Example 2.2 (Timing of Market Transactions). Engle and Russell (1998)
analyze IBM stock transactions data during regular trading hours for the
period between November 1, 1990, and January 31, 1991. The event time
Tn represents the time of the n-th transaction. The event intensity satisfies
Λ(v; α) = Λ0 (v0 /v1 ; α)v1−1 for v = (v0 , v1 ). The vector of explanatory factors
is Vt = (Vt0 , Vt1 ), where Vt0 = t − TNt and Vt1 = VN1t for
Vn1
= γ0 +
m
X
j=0
γj+1 (Tn−j − Tn−j−1 ) +
q
X
1
γm+2+j Vn−j
j=0
for γ = (γ0 , γ1 , . . . , γm+q+2 ) and m, q ∈ N. This model is called the autoregressive conditional duration model because the conditional distribution of
the time until the next event depends on the realization of past inter-event
times. The data include complete observations of V 0 and V 1 . Thus, dX = 2,
dY = dZ = 0, and V = X. Related models of inter-transaction times are
analyzed by Dufour and Engle (2000) and Engle (2000).
Example 2.3 (Unemployment Timing). Lancaster (1979) studies the
duration of unemployment among 479 British individuals. The event time
Tn represents the arrival time of the n-th job offer. The n-th mark is Un =
(wn , an ) ∈ R+ × {0, 1}, where wn indicates the wage offered for the n-th
job and an indicates acceptance of an offer. The density π is parameterindependent and modeled by the empirical distribution in the sample. The
intensity Λ(v; α) = v2 exp(α · v1 ) for v = (v1 , v2 ) and Vt = (V01 , V02 ). The
three-dimensional observable explanatory factor component V 1 represents
the age of the individual, the unemployment rate in the UK, and social security benefits. The component V 2 is an unobservable frailty that has a gamma
FILTERED LIKELIHOOD FOR POINT PROCESSES
7
distribution with unit mean and variance γ. Thus, dX = 3, dY = 0, dZ = 1
and V = (X, Z) with X = V 1 and Z = V 2 . Lancaster and Nickell (1980),
Meyer (1990) and Dolton and O’Neill (1996) examine related models.
Example 2.4 (Mortgage Delinquencies). Deng, Quigley and Van Order (2000) investigate home mortgage default and prepayment events in the
United States using quarterly data from Freddie Mac on single-family mortgage loans issued between 1976 and 1983. The event time Tn represents the
first time of either default or prepayment of the n-th loan. The event intensity is Λ(v; α) = v0 exp(v2 + v3 + αd · v6 ) + v1 exp(v4 + v5 + αp · v6 )
for v = (v0 , . . . , v6 ) and α = (αd , αp ). The vector of explanatory factors
is Vt = (Vt0 , . . . , Vt6 ). Here, Vt0 = V00 and Vt1 = V01 are time-invariant
one-dimensional frailty factors that take values in discrete sets and characterize the borrower type, Vt2 and Vt4 are one-dimensional baseline functions, Vt3 and Vt5 are two-dimensional measures of the moneyness of the
default and prepayment options held by the borrower, respectively, and Vt6 is
a two-dimensional vector of macro-factors such as employment and divorce
rates. The data includes continuous observations of (V 2 , V 4 ) and quarterly
observations of (V 3 , V 5 , V 6 ). Therefore, dX = 2, dY = 6, dZ = 2, and
V = (X, Y, Z) for X = (V 2 , V 4 ), Y = (V 3 , V 5 , V 6 ), Z = (V 0 , V 1 ), and
S j = N0 for 1 ≤ j ≤ 6. Related models are analyzed by Demyanyk and
Van Hemert (2011), Deng and Gabriel (2006), Foote, Gerardi and Willen
(2008), and Pennington-Cross and Ho (2010).
Example 2.5 (Timing of Corporate Bankruptcies). Azizpour, Giesecke
and Schwenkler (2014) examine the timing of bankruptcies in the United
States between 1970 and 2012. An event time Tn represents a bankruptcy
date. A mark Un ∈ R+ represents the debt outstanding at default. The density π is parameter independent and modeled by the empirical distribution in
the sample. The intensity function is Λ(v; α) = exp(α1 · v1 ) + α2 v2 + α3 v3 for
v = (v1 , v2 , v3 ) and α = (α1 , α2 , α3 ). The factor vector is V = (V 1 , . . . , V 6 ),
where
X
Vt1 =
e−γ1 (t−Tn ) log Un
n≤Nt
generates self-exciting behavior. The element V 2 is an unobservable frailty
governed by the SDE
q
2
2
dVt = γ3 (γ4 − Vt )dt + Vt2 dWt
where W is a Brownian motion and γ = (γ1 , . . . , γ4 ) with γ1 > 0. The initial value V02 is sampled from its stationary gamma distribution with shape
8
K. GIESECKE AND G. SCHWENKLER
parameter 2γ3 γ4 and scale parameter 2γ13 . The element V 3 follows a vector
autoregressive model and represents several interest rates, stock index returns
and volatilities that are sampled daily. The element V 4 follows an autoregressive model and represents credit spreads that are sampled weekly. The element V 5 follows an autoregressive moving-average model, represents the industrial production growth rate, and is sampled monthly. The element V 6 follows a moving-average model, represents the growth rate of the gross domestic product, and is sampled quarterly. We have V = (X, Y, Z) with X = V 1 ,
Y = V 2 and Z = V 3 , dX = 1, dY = 4, dZ = 1, S 1 = {k/365 : k ∈ N},
S 2 = {k/52 : k ∈ N}, S 3 = {k/12 : k ∈ N}, and S 4 = {k/4 : k ∈ N}.
Related models are analyzed by Duffie, Saita and Wang (2007), Duffie et al.
(2009), and Lando and Nielsen (2010).
Our formulation facilitates the analysis of multiple influences on event
timing and various data structures. The arrival intensity (1) is a non-negative
function of a vector V of explanatory factors. The elements of V may include
time, allowing for time-inhomogeneity as in Examples 2.1 and 2.2. They may
be represented by time-invariant random variables (Examples 2.3 and 2.4),
or by stochastic processes (Examples 2.2 and 2.5). The elements of V may
be correlated (Examples 2.4 and 2.5) and need not be Markovian (Example
2.2). The point process may be an element of V or have an impact on the
dynamics of other elements of V , to generate self-exciting behavior as in
Examples 2.2 and 2.5. Some elements of V may be completely unobservable
frailties (Examples 2.3, 2.4, and 2.5). Others may be observable only on
certain dates (Examples 2.1, 2.4, and 2.5) or even at any time (Examples
2.2 and 2.5). Our formulation allows one to address all of these features in
a common model and data framework.
3. Likelihood inference. We analyze likelihood estimators of the parameter θ = (α, β, γ) ∈ Θ = ΘN × ΘU × ΘV . We write Pθ and Eθ for the
probability and the expectation operators when the underlying parameter
is θ. The true but unknown parameter θ0 belongs to Θ◦ , the interior of Θ.
The observed data is a realization of Dτ under Pθ0 .
The likelihood Lτ (θ) of the data Dτ at the parameter θ ∈ Θ is taken as the
Radon-Nikodym derivative of the Pθ -distribution of the data with respect
to the true Pθ0 -distribution; see Appendix A.1 for a precise definition. With
complete factor data (dY = dZ = 0), the likelihood is given by dPθ /dPθ0 . In
the general case, the likelihood is given by the projection of dPθ /dPθ0 onto
the observed data:
dPθ (3)
Lτ (θ) = Eθ0
Dτ .
dPθ 0
FILTERED LIKELIHOOD FOR POINT PROCESSES
9
Unfortunately, the representation (3) is impractical because the true parameter θ0 is unknown. In order to obtain an alternative representation,
we introduce an auxiliary probability measure. Let π ∗ be a strictly positive
probability density on RU . If the variable
(4)
Mτ (θ) =
π ∗ (Un )
π(Un , VTn − ; β)
n≤Nτ
Z τ
Z τ
(1 − Λ(Vs ; α))ds
log(Λ(Vs− ; α))dNs −
× exp −
Y
0
0
has expected value equal to one, then it induces an equivalent probability
measure P∗θ on Fτ . Under this measure, the event counting process N is a
standard Poisson process on the interval [0, τ ] and a mark Un has density π ∗ .
The intensity and the mark distribution relative to P∗θ do neither depend on
the factor V nor the parameter θ. This property facilitates a decomposition
of the likelihood (3) into a product of two terms, one addressing the event
data (Nτ , Uτ ) and the other the factor data (Xτ , Yτ ). Each of these terms is
given by an expectation under the auxiliary measure.
Theorem 3.1.
Suppose the following conditions hold:
(A1) The expectation Eθ [Mτ (θ)] = 1.
(A2) The conditional likelihood L∗F (θ; τ ) of the factor data (Xτ , Yτ ) given
the event data (Nτ , Uτ ) exists under P∗θ .
Then the likelihood satisfies
(5)
Lτ (θ) ∝
E∗θ
1 Dτ L∗F (θ; τ ).
Mτ (θ) Condition (A1) guarantees that the measure P∗θ is well-defined. Blanchet
and Ruf (2013) provide sufficient conditions for (A1). Condition (A2) requires the absolute continuity of the conditional P∗θ -law of the factor data
(Xτ , Yτ ) given the event data (Nτ , Uτ ) relative to the corresponding conditional law under P∗θ0 ; see Appendix A for details. Sufficient conditions for
this property depend on the model and data structure specified for V . For a
jump-diffusion model of V with discrete observation dates, for example, see
Komatsu and Takeuchi (2001) and Takeuchi (2002).
Theorem 3.1 generalizes to incomplete factor data the representations of
the complete data point process likelihood studied by Andersen et al. (1993),
Ogata (1978), Puri and Tuan (1986) and others. Indeed, if dY = dZ = 0, the
data Dτ include all values of the factor V = X required to evaluate Mτ (θ).
10
K. GIESECKE AND G. SCHWENKLER
In this case the conditional expectation in (5) is trivial so
(6)
Lτ (θ) ∝
L∗F (θ; τ )
.
Mτ (θ)
With incomplete factor data (dY > 0 or dZ > 0), the data Dτ are insufficient
to evaluate Mτ (θ). In this case, the conditional expectation in (5) represents
a posterior mean of 1/Mτ (θ) given the data.
From Theorem 3.1, ignoring an additive term that is independent of θ,
the log-likelihood takes the form
(7)
Lτ (θ) = log L∗E (θ; τ ) + log L∗F (θ; τ ),
where L∗E (θ; τ ) = E∗θ [1/Mτ (θ) | Dτ ]. A maximum likelihood estimator (MLE)
θ̂τ ∈ Θ◦ of the parameter θ is a solution to
(8)
0 = ∇Lτ (θ).
Appendix A.2 studies the asymptotic properties of a MLE. Identifiability,
smoothness, boundedness, and non-singularity conditions ensure that θ̂τ →
√
θ0 in Pθ0 -probability and τ (θ̂τ − θ0 ) → N (0, Σ0 ) in Pθ0 -distribution as
τ → ∞. The asymptotic variance-covariance matrix Σ0 ∈ Rp×p is given by
(9)
Σ0 = I(θ0 )−1 + 2ζI(θ0 )−1 ΣEF (θ0 )I(θ0 )−1 ,
]|
where ζ = limτ →∞ |S∩[0,τ
is the mean number of observations of Y per
τ
unit of time, I(θ0 ) is the Fisher information matrix for the estimation of
the true parameter θ0 , and ΣEF (θ0 ) is a matrix that addresses the filtering
of the unobserved factors. With complete factor data Σ0 = I(θ0 )−1 (Ogata
(1978)), so the filtering calls for a correction term.
4. Approximate likelihood estimators. Theorem 3.1 states that the
likelihood is governed by two terms. The term L∗F (θ; τ ) addresses the available factor data; its computation is standard for a range of models of V , see
Aı̈t-Sahalia (2008), Giesecke and Schwenkler (2014), Lo (1988), and others.
The term L∗E (θ; τ ) = E∗θ [1/Mτ (θ) | Dτ ] addresses the event data. Depending
on the model specification, its practical computation may be difficult.
This section develops and analyzes an approximation to the point process
filter L∗E (θ; τ ). It also examines the convergence and asymptotic properties
of the estimators maximizing the approximate likelihood. We focus on the
case that at least one of the factors is an unobserved frailty (dZ > 0).
FILTERED LIKELIHOOD FOR POINT PROCESSES
11
4.1. Filter approximation. We consider filters of the form
E(θ; τ, g) = E∗θ [g(Vτ , τ )/Mτ (θ) | Dτ ]
(10)
for functions g on RV ×[0, ∞). The term L∗E (θ; τ ) can be computed by taking
g(v, t) = 1. While we treat E(θ; t, g) at the horizon t = τ , the extension to
dates t < τ is straightforward.
A standard approach to computing (10) if dY = 0 is to develop and numerically solve the associated Kushner-Stratonovic equation (see Kliemann,
Koch and Marchetti (1990)). To our knowledge, however, the KS equation
has not yet been worked out for the case that dY > 0, i.e., when some factors
are observed at fixed dates only.
We propose to approximate E(θ; τ, g) using a quadrature-based method.
Fix a number mS ∈ N and choose points v1 , . . . , vmS ∈ RV , together with
compact sets A1 , . . . , AmS ⊆ RV that are pairwise disjoint such that Ak
is a neighborhood of vk with non-empty interior. Additionally, fix another
number mT ∈ N, and choose times T 0 , T 1 , . . . , T mT ∈ [0, τ ] with 0 = T 0 <
T 1 < · · · < T mT = τ . Our approximation of E(θ; τ, g) is given by
E I (θ; τ, g) =
mS
X
i0 =1
···
mS
X
g(vimT , τ )FθI (i1 , . . . , imT )PθI (i0 , i1 , . . . , imT )
imT =1
where I = {mS , mT , v, A, T } is the implementation of the approximation
with v = {v1 , . . . , vmS }, A = {A1 , . . . , AmS } and T = {T 1 , . . . , T mT }.
Further, writing Λ(·; θ) and π(·, ·; θ) rather than Λ(·; α) and π(·, ·; β), let
(m
T
X
X
Λ(vij ; θ)π(Un , vij ; θ)
log
FθI (i1 , . . . , imT ) = exp
π ∗ (Un )
j=1 Tn ∈(T j−1 ,T j ]
)
mT
X
+
1 − Λ(vij ; θ) T j − T j−1 ,
j=1
PθI (i0 , . . . , imT )
=
P∗θ [V0
∈ Ai0 , Vt ∈ Aij (1 ≤ j ≤ mT , t ∈ (T j−1 , T j ]) Dτ ].
The approximation E I (θ; τ, g) of E(θ; τ, g) is a weighted average of certain
paths of the explanatory factors V . The paths are those for which V0 ∈ {v1 ,
. . . , vmS } and Vt is constant in the interval (T j−1 , T j ] for 1 ≤ j ≤ mT ,
taking values in the set {v1 , . . . , vmS } ⊂ RV . The weights assigned to these
paths are the conditional P∗θ -probabilities, given the data Dτ , that V is in the
set Aij during the interval (T j−1 , T j ] for 1 ≤ j ≤ mT and ij ∈ {1, . . . , mS }.
Thus, E I (θ; τ, g) is based on a multidimensional rectangular quadrature
approximation to the Lebesgue integral in E(θ; τ, g), taken with respect to
12
K. GIESECKE AND G. SCHWENKLER
the measure P∗θ conditional on the data Dτ . Section 5 provides an algorithm
for evaluating the approximate filter E I (θ; τ, g).
4.2. Convergence. We examine the convergence of the approximate filter
E I (θ; τ, g) and of the estimators based on E I (θ; τ, g). The idea is to choose
sequences of compact sets {An,j }1≤j≤mnS and of times {T n,j }1≤j≤mnT , such
that, in the limit as n → ∞, the entire range of V is covered by the compact
sets while both the sets An,j and the time intervals (T n,j−1 , T n,j ] become
smaller. Let ∂v denote the partial derivative with respect to v.
Theorem 4.1 (Convergence of approximate filter).
ing conditions hold.
Suppose the follow-
(B1) The partial derivative ∂v Λ(v; θ) of the intensity function Λ exists. Both
Λ(v; θ) and ∂v Λ(v; θ) are continuous in θ ∈ Θ.
(B2) The partial derivative ∂v π(u, v; θ) of the P-density function π of the
marks exists. The functions π(u, v; θ) and ∂v π(u, v; θ) are continuous
in θ ∈ Θ.
(B3) The space of admissible parameters Θ is a compact subset of an Euclidean
space and has non-empty, convex interior.
∗
(B4) Eθ 1/Mτ (θ)2 < ∞ for any θ ∈ Θ.
(B5) The function g is P-measurable and either continuous or bounded.
Suppose there exists a sequence of data-dependent implementations I n =
{mnS , mnT , vn , An , T n } with vn = (vn,j )1≤j≤mnS , An = (An,j )1≤j≤mnS , and
T n = (T n,i )1≤i≤mnT , such that the following conditions hold P-almost surely:
(B1) mnS , mnT < ∞ for all n ∈ N.
(B2) An is a sequence of compact and convex Rd -sets with non-empty, pairSmnS n,j n→∞
wise disjoint interiors, such that An = j=1
A −−−→ RV .
n,j
n
(B3) v ∈ RV for each n ∈ N and 1 ≤ j ≤ mS , and the interior of An,j is
an open neighborhood of vn,j .
n
(B4) 0 = T n,0 < T n,1 < · · · < T n,mT = τ .
(B5) In the limit as n → ∞,
sup P∗θ ∃t ∈ [0, τ ] : Vt ∈
/ An Dτ → 0.
θ∈Θ
(B6) In the limit as n → ∞, it holds P-almost surely that
n,j
max n vol A
τ max max n ∂v Λ(vj ; θ)
1≤j≤mS
θ∈Θ 1≤j≤mS
j ; θ)π(U , vj ; θ) Λ(v
k
→ 0.
+ Nτ max max max n ∂v log
θ∈Θ 1≤k≤Nτ 1≤j≤mS
π ∗ (Uk )
FILTERED LIKELIHOOD FOR POINT PROCESSES
13
n
Then, for fixed τ > 0, E I ( · ; τ, g) converges uniformly in the space of admissible parameters Θ to E( · ; τ, g) as n → ∞, P-almost surely.
Theorem 4.1 states that if one can construct implementations such that
Conditions (B1)–(B6) are satisfied, then the approximation E I (θ; τ, g) converges uniformly in the parameter θ ∈ Θ. To achieve this, the sequence of
compact sets An,j has to be chosen such that the volume of each set becomes
smaller faster than |∂v log Λ(v; θ)π(u, v; θ)/π ∗ (u)| and |∂v Λ(v; θ)| grow, while
their union covers the range RV of the explanatory factor V . Consequently,
with the state space discretization, one controls an extreme case of the intensity function and the mark distribution. The time discretization grid {T 1 ,
n
. . . , T mT } can be chosen so as to facilitate the calculation of the probabiln
ities PθI (i0 , i1 , . . . , imnT ); we address this choice in Section 5. Assumptions
(B1)–(B3) guarantee that Assumptions (B1)–(B6) are well-posed. A sufficient condition for the existence of an implementation I n as in Theorem 4.1
is that there exists a sequence of compact and convex sets An ∈ σ(RV ) with
non-empty interior, An 6= RV , such that P-almost surely
/ An Dτ → 0.
(11)
sup P∗θ ∃t ∈ [0, τ ] : Vt ∈
θ∈Θ
Theorem 4.1 guarantees that our scheme approximates the likelihood
function to the same degree of accuracy for any θ ∈ Θ and converges uniformly. Next, we analyze the convergence of the estimators obtained from
the approximate likelihood.
Proposition 4.2 (Convergence of estimators). Suppose the conditions
of Theorem 4.1 apply and, in addition, supθ∈Θ L∗F (θ; τ ) < ∞, P-almost
surely. Assume that θ̂τ ∈ Θ maximizes the likelihood
Lτ (θ) = E(θ; τ, 1)L∗F (θ; τ )
and that θ̂τn for n ∈ N maximizes the approximate likelihood
n
Lnτ (θ) = E I (θ; τ, 1)L∗F (θ; τ ).
Then for fixed τ > 0, P-a.s.,
lim Lnτ (θ̂τn ) = Lτ (θ̂τ )
n→∞
and θn converges to a global optimum of the likelihood for any realization of
the data Dτ .
14
K. GIESECKE AND G. SCHWENKLER
Assuming smoothness of Lτ (·), since Lτ (θ̂τ ) < ∞ and the parameter set
Θ is compact, the sequence of estimators θ̂τn obtained by the uniformly convergent implementation of our approximation converges to some parameter
θ̂τ∞ ∈ Θ. Proposition 4.2 states that Lτ (θ̂τ∞ ) = Lτ (θ̂τ ), that is, both θ̂τ∞
and θ̂τ achieve the same likelihood, making θ̂τ∞ a global maximizer of the
likelihood function Lτ (·). When θ̂τ is the unique MLE, then θ̂τ∞ = θ̂τ and
hence θ̂τn → θ̂τ as n → ∞.
4.3. Asymptotic properties of estimators. The uniform convergence of
the approximate likelihood function established above entails additional results. Following the notation in Proposition 4.2, we write
√ n
√
√
(12)
τ (θ̂τ − θ0 ) = τ (θ̂τn − θ̂τ∞ ) + τ (θ̂τ∞ − θ0 ).
The assumptions in Theorem 4.1 guarantee that, for any given degree of accuracy, one can construct an implementation of our approximation achieving
√
that accuracy. If n goes to ∞ fast enough as τ → ∞, then τ (θ̂τn − θ̂τ∞ )
converges to zero P-almost surely. This observation leads to the following
result.
Proposition 4.3 (Asymptotic properties of estimators). Suppose that
any exact maximum likelihood estimator is consistent and asymptotically
√
normal; i.e., θ̂τ → θ0 in Pθ0 -probability and τ (θ̂τ − θ0 ) → N (0, Σ0 ) in
Pθ0 -distribution as τ → ∞ for Σ0 as in (9). Further, suppose that Lτ (·)
does not attain optima on the boundary of Θ. Then there exists a sequence
of implementations I n with n → ∞ as τ → ∞, such that
θ̂τn → θ0
as τ → ∞ in Pθ0 -probability, and
√ n
τ (θ̂τ − θ0 ) → N (0, Σ0 )
as τ → ∞ in Pθ0 -distribution, where Σ0 is the same as in (9).
Proposition 4.3 guarantees the existence of an implementation of our approximation that generates a consistent and asymptotically normal estimator. The asymptotic variance-covariance matrix of that estimator is the same
as that of the estimator θ̂τ obtained from the exact likelihood. Thus, the approximation does not generate additional asymptotic variance. As a result,
there is no loss of efficiency when maximizing the approximate likelihood
instead of the exact (but uncomputable) one. Appendix A.2 provides conditions ensuring consistency and asymptotic normality of exact maximum
FILTERED LIKELIHOOD FOR POINT PROCESSES
15
likelihood estimators. Further, Appendix B indicates how an approximation
to the asymptotic variance-covariance matrix Σ0 can be constructed using
our approximate filter.
5. Filter computation. We provide a convenient algorithm for evaluating the filter approximation E I (θ; τ, g). If the process V is Markovian,
then the probabilities PθI (i0 , . . . , imT ) can be written as
"
#
h
i
E∗θ 1Ai0 (V0 ) P∗θ ∀j ∈ {1, . . . , mT } ∀t ∈ (T j−1 , T j ] : Vt ∈ Aij Dτ , V0 Dτ .
We propose to approximate PθI (i0 , . . . , imT ) by the following interpolation:
P̂θI (i0 , . . . , imT )
=
P∗θ
h
mT
iY
i
h
V0 ∈ A Dτ
P∗θ VT j ∈ Aij Dτ , VT j−1 = vij−1 .
i0
j=1
If max1≤j≤mT |Aij | ≈ 0, then PθI (i0 , . . . , imT ) ≈ P̂θI (i0 , . . . , imT ). The following Proposition formalizes this statement.
Proposition 5.1 (Estimation of transition probabilities). Suppose the
conditions of Theorem 4.1 apply and assume that V is Markovian. Let
(I n )n≥1 be a sequence of implementations guaranteeing the convergence of
the approximate filter. If mnT → ∞ as n → ∞, then
In
P̂ (i0 , . . . , imn ) − P I n (i0 , . . . , imn ) → 0
θ
θ
T
T
as n → ∞, Pθ -almost surely.
Proposition 5.1 implies that the quantities P̂θI (i0 , . . . , imT ) are consistent
estimators of the probabilities PθI (i0 , . . . , imT ). An immediate consequence
is that the general convergence results of Section 4.2 are not altered if we
compute the filter approximation E I (θ; τ, g) using P̂θI (i0 , . . . , imT ) instead
of PθI (i0 , . . . , imT ). Let
P
FθI,j (i) = e
Tn ∈(T j−1 ,T j ]
log
Λ(vi ;θ)π(Un ,vi ,s;θ)
+
π ∗ (Un )
(1−Λ(vi ;θ))(T j −T j−1 )
for 1 ≤ j ≤ mT and 1 ≤ i ≤ mS . Then FθI (i1 , . . . , imT ) can be rewritten
Q T I,j
I,j
as m
for 1 ≤ j ≤ mT with
j=1 Fθ (ij ). Similarly, define the matrices p̂θ
components
h
i
∗
k
p̂I,j
Dτ , VT j−1 = vl ,
θ (k, l) = Pθ VT j ∈ A
16
K. GIESECKE AND G. SCHWENKLER
as well as the vector
p̂θI,0 (k) = P∗θ [V0 ∈ Ak | Dτ ]
for 1 ≤ k, l ≤ mS . We can write P̂θI (i0 , . . . , imT ) =
It follows that the approximation
I
Ê (θ; τ, g) =
mS
X
i0 =1
···
mS
X
imT =1
g(v
im T
, τ)
mT
Y
I,j
I,0
j=1 p̂θ (ij , ij−1 )p̂θ (i0 ).
QmT
I,0
FθI,j (ij ) p̂I,j
θ (ij , ij−1 ) p̂θ (i0 )
j=1
shares the same convergence properties as E I (θ; τ, g) in Theorem 4.1. Conveniently, Ê I (θ; τ, g) can be calculated by a series of matrix multiplications.
We have implemented our estimator in R. The code can be downloaded
at http://people.bu.edu/gas. We summarize the steps.
Algorithm 5.2.
For a given θ ∈ Θ do
(i) Initialization: Let ξ ∈ RmS with ξ(i) = p̂I,0
θ (i) for 1 ≤ i ≤ mS .
(ii) Iteration: For j = 1, . . . , mT − 1 do
(a) Define Q̂j ∈ RmS ×mS with elements FθI,j (k)p̂I,j
θ (k, l)
(b) Update the vector ξ to ξ ← Q̂j · ξ
(iii) Termination:
(a) Define Q̂mT ∈ RmS ×mS with elements
T
(k, l)
g(vk , τ )FθI,mT (k)p̂I,m
θ
(b) Update the vector ξ to ξ ← Q̂mT · ξ
P S
(c) Compute Ê I (θ; τ, g) = m
i=1 ξ(i).
Algorithm 5.2 involves mT multiplications of mS × mS -matrices, mT m2S
simple multiplications, and one addition of length mS in order to compute
Ê I (θ; τ, g). Therefore, the computational effort is of order O(mT m3S ). It
grows cubically with the fineness of the state space discretization, and linearly with the fineness of the time discretization.
6. Numerical results. This section illustrates our estimator and contrasts it with a standard alternative. We take the intensity (1) as
(13)
Λ(v; α) = α1 v1 + α2 v2 + α3 e−α4 v5 v3
17
FILTERED LIKELIHOOD FOR POINT PROCESSES
for v = (v1 , v2 , v3 , v4 , v5 ) ∈ RV = R5+ and α = (α1 , α2 , α3 , α4 ) ∈ ΘN = R4+ .
The vector V = (V 1 , . . . , V 5 ) of explanatory factors satisfies the stochastic
differential equation
dVt = µ(Vt ; γ)dt + σ(Vt ; γ)dWt + R(Vt− ; γ)dNt
for a two-dimensional standard Brownian motion W = (W 1 , W 2 ), v =
(v1 , . . . , v5 ) ∈ RV , γ = (γ1 , . . . , γ8 ) ∈ ΘV = R × R3+ × R × R3+ , and

(14)



µ(v; γ) = 



(γ1 +
γ22
2 )(v1
− γ3 e−γ4 v5 v4 ) − γ3 γ4 e−γ4 v5 v4
γ2






(γ5 + 26 )v2
0
0
1
γ2 (v1 − γ3 e−γ4 v5 v4 )
0
0
γ6 v2
0
0
0
0
0
0
(15)


σ(v; γ) = 


(16)
R(v; γ) = (γ3 , 0, eγ7 v5 , eγ8 v5 , 0)> .







The marks are irrelevant; to fully specify the Radon-Nikodym derivative (4)
we take RU = (0, 1) and π(u, v; β) = π ∗ (u) = 1. Theorem V.3.7 in Protter
(2004) guarantees the existence of a unique, non-explosive solution (V, N )
to the system specified above.5 We take γ7 = α4 and γ8 = γ4 , and see that
V takes the form
Z t
1
1
1
(17)
Vt = V0 exp γ1 t + γ2 Wt + γ3
e−γ4 (t−u) dNu ,
0
(18)
Vt2 = V02 exp γ5 t + γ6 Wt2 ,
Z t
Z t
3
α4 u
4
(19)
Vt =
e dNu , Vt =
eγ4 u dNu , Vt5 = t.
0
0
The factor V 1 follows a jump-diffusion with jumps driven by N ; the impact
of a jump decays exponentially with time at rate γ4 . The factor V 2 follows a
geometric Brownian motion. The factors V 3 and V 4 are pure-jump processes
driven by N , and V 5 is time.
5
To seeR this, consider the SDE for (V, N ), which is driven by the semi-martingales W
·
and N − 0 λs ds. The coefficients of this SDE are functional Lipschitz in the sense of the
definition in Section V.3 of Protter (2004). Thus, Theorem V.3.7 in Protter (2004) applies.
18
K. GIESECKE AND G. SCHWENKLER
The data structure is as follows. The factors V 3 , V 4 , and V 5 are observed
k
at any time. The factor V 1 is observed only at times S = { 12
: k ∈ N},
2
while V is a completely unobserved frailty. We have dX = 3, dY = dZ = 1,
X = (V 3 , V 4 , V 5 ), Y = V 1 , and Z = V 2 .
We take θ0 = (α1,0 , α2,0 , α3,0 , α4,0 , γ1,0 , γ2,0 , γ3,0 , γ4,0 , γ5,0 , γ6,0 ) with
α1,0 = 3, α2,0 = 3, α3,0 = 2, α4,0 = 2.5, γ1,0 = −0.5, γ2,0 = 1, γ3,0 = 0.2,
γ4,0 = 5, γ5,0 = −0.125, and γ6,0 = 0.5. For simplicity, we take V0 =
(2.8, 1, 0, 0, 0), such that the starting value V0 of the explanatory factor V
is non-random. These values are roughly consistent with the fitted parameter values of a related model of bankruptcy timing in Azizpour, Giesecke
and Schwenkler (2014). (We have considered alternative values, but found
no significant difference in the results.) We take Θ = [0, 20]4 × [−20, 20]×
[0, 20]3 × [−20, 20]× [0, 20] ⊂ ΘN × ΘV .
Time is measured in years and the horizon is τ = 40. We generate realizations of the counting process N under the model (17)–(19) parametrized
by θ0 using the discretization method described and analyzed by Giesecke
and Teng (2012). We take 200,000 time steps.
The numerical results reported below are based on an implementation
in R, running on an 8-core Opteron, 31.5 GB RAM, 10 GB swap server at
Stanford University with an Ubuntu 11.04 operating system. The codes used
in this Section can be downloaded at http://people.bu.edu/gas.
6.1. Accuracy of filter approximation. We begin by evaluating the accuracy of the filter approximation. The implementation sequence I n of Theorem 4.1 is fixed as follows. The factor X is observed at any time, hence
there is no need to discretize this component of V . On the other hand, we
can discretize the state spaces of Y and Z separately since these compo1
nents are P∗ -independent. Let Sτ = {i 12
: 1 ≤ i ≤ 12τ } and mnS = n2 . For
mins∈Sτ Ys
√
)/(mnS − 1),
the factor process Y we set ∆nY = (maxs∈Sτ Ys + n −
n
√
n,j − ∆n , y n,j ] for 1 ≤
y n,j = (mins∈Sτ Ys )/ n + (j − 1)∆nY and An,j
Y
Y = [y
j ≤ mnS . This grid for Y covers the entire range of observed values of this
qZ
)/(mnS − 1) for
factor. For the frailty Z we choose ∆nZ = (QZ + n − √
n
QZ and qZ the theoretical 95% and 5% quantiles, respectively, of Zτ . Set
√
n,j −∆n , z n,j ] for 1 ≤ j ≤ mn . Thus,
z n,j = qZ / n+(j −1)∆nZ and An,j
Z
S
Z = [z
the range of Z is covered by the state space discretization with at least 90%
probability. This implementation satisfies Conditions (B2), (B3), (B5), and
(B6) of Theorem 4.1.
We consider several time discretizations {T j }j≥1 , including monthly (12
time steps per year), weekly (50 time steps) and daily (250 time steps)
grids. Each of these grids is supplemented with the times in Sτ at which Y is
19
−2
Relative error
−1
0
FILTERED LIKELIHOOD FOR POINT PROCESSES
−3
Daily time discretization
Weekly time discretization
Monthly time discretization
10
20
30
40
Fineness and range of state space discretization n
n
Fig 1. Relative error of the approximate (partial) log-likelihood log Ê I (θ0 ; τ, 1), averaged
over 10 different realizations of Dτ generated from the model θ0 , for each of several time
discretizations.
observed. The final time discretizations are given by T = Sτ ∪{T j }1≤j≤mT . It
follows that Conditions (B1) and (B4) of Theorem 4.1 are satisfied. Since the
parameter space Θ is a compact subset of R10 , our model choice guarantees
that Conditions (B1) through (B4) in Theorem 4.1 are satisfied as well. Note
that the value of n specifies the fineness and the range of the state-space
discretization, and has no effect on the time discretization. The convergence
of Theorem 4.1 is with respect to the state-space grid becoming finer and
more extensive as n → ∞.
For a given T , define the interpolated factor and frailty processes by
Ȳt = YT̄ (t) and Z̄t = ZT̄ (t) for T̄ (t) = max{u ∈ T : u ≤ t}. For V̄ =
(Ȳ , Z̄, V 3 , V 4 , V 5 ), let M̄τ (θ) be given by
Z τ
Z τ
Y
π ∗ (Un )
exp −
log Λ(V̄s− ; θ)dNs −
(1 − Λ(V̄s ; θ))ds .
π(Un , V̄Tn − ; θ)
0
0
n≤Nt
Thus, E∗ [1/M̄τ (θ) | Dτ ] is the filter obtained by interpolating Y and Z. Theon
rem 4.1 implies that for this T , the approximate filter Ê I (θ; τ, 1) in Section
5 converges to E∗θ [1/M̄τ (θ) | Dτ ] as n → ∞. We examine this convergence at
the parameter θ0 . For a given T and realization of the data Dτ , we com-
20
K. GIESECKE AND G. SCHWENKLER
n
pute the (partial) log-likelihood approximation log Ê I (θ0 ; τ, 1) using Algorithm 5.2. We contrast this approximation with a Monte Carlo estimate
of log E∗θ0 [1/M̄τ (θ) | Dτ ], obtained from 100,000 sample paths of Ȳ and Z̄
drawn from their conditional P∗θ0 -distributions given Dτ . Figure 1 shows the
relative error of the approximation, averaged over 10 alternative realizations
of Dτ generated from the model θ0 , as a function of n. The 95% confidence
bands of the relative error, obtained by resampling, are extremely tight, and
are therefore not indicated.
For coarse state-space discretizations (small n), the approximation tends
to understate the filter. This effect is more pronounced for finer time discretization grids because of the propagation of errors through the time grid:
In finer grids, the approximate filter is updated more frequently in Step
2(a) of Algorithm 5.2. For fine state-space discretizations (large n), the approximation tends to slightly overstate the filter, especially for coarser time
discretization grids. We have studied the relative errors associated with the
monthly and weekly grids also for n > 40. The errors converge, albeit slowly.
As discussed after Proposition 5.1, if the time discretization also becomes
n
finer as n → ∞, then the approximate filter Ê I (θ; τ, 1) converges to the
true filter E(θ; τ, 1). Figure 1 provides numerical support for this result. The
smallest average relative error for n = 40 occurs when the finest time grid is
used. Since the Monte Carlo filter converges to E(θ0 ; τ, 1) in Pθ0 -probability
as the time grid becomes finer, the fact that the relative error decreases with
the fineness of the time grid for large n numerically supports the convergence
established by Proposition 5.1.
The approximation provides a substantial reduction of computational effort relative to Monte Carlo. The Monte Carlo estimation of the filter requires roughly 10 minutes for the monthly grid, 1 hour for the weekly grid,
and almost 5 hours for the daily time grid. The corresponding computation
times of Algorithm 5.2, averaged across the values of n we have considered,
are 1.3 minutes, 9 minutes and 56 minutes.
6.2. Accuracy of parameter estimates. Next we examine the accuracy of
our parameter estimates. For each of 10 alternative realizations of the data
Dτ generated from the model θ0 , we maximize the approximate log-likelihood
n
function log Ê I (θ; τ, 1) + log L∗F (θ; τ ), taking a weekly time discretization
and using the Nelder-Mead algorithm in R.6 The optimization is initialized
6
Since Y has the explicit solution (17), the corresponding log-likelihood is given by
2
Rs
2
Ysi −γ3 0 i e−γ4 (si −u) dNu
γ2
(s
−
s
)
log
−
γ
−
|T |
R si−1 −γ (s
1
i
i−1
2
4 i−1 −u) dN
X
Ysi−1 −γ3 0
e
u
log L∗F (θ; τ ) = −
2
2γ2 (si − si−1 )
i=1
21
FILTERED LIKELIHOOD FOR POINT PROCESSES
8
10
6
1
8
10
0.2
4
6
8
4
10
Volatility
n
6
8
10
Eventndecay
0.0
10
Meann rate
4
6
8
10
4
6
8
10
n
seq(1, 10)
Absolute error
3
n
Filtered MLE
EM Alg.,
Initial pars.
0
0
Absolute error
4
6
Absolute error
2
8
4
Event sensitivity
n
0
6
Absolute error
3
6
Absolute error
1
Absolute error
4
seq(1, 10)
Frailty
Z
4
Volatility
n
0
Factor
Y
Absolute error
10
11
8
Absolute error
6
Meann rate
0
4
seq(1, 10)
Event decay
0
Absolute error
1
Absolute error
1
Intensity
λ
Event sensitivity
4
Frailty sensitivity
8
Factor sensitivity
4
6
8
10
4
6
8
10
Fig 2. Absolute error of a parameter estimate, averaged over the errors associated with 10
alternative realizations of Dτ generated from the model θ0 , as a function of n, the fineness
and range of the state space discretization (solid line). Also shown are the parameter
estimates obtained using the stochastic EM Algorithm (dashed line). The dotted line shows
the average absolute error of the initial parameter values.
at the value θ0 (1 + ), where is drawn from a standard normal distribution.
If that value lies outside of Θ, we take its orthogonal projection onto the
boundary of Θ. Below we examine an alternative, uniform sampling rule for
the initial value.
Proposition 4.3 guarantees the consistency of the estimator. Hence we can
expect the estimates to be close to θ0 . Therefore, we measure the accuracy
of the estimates in terms of the absolute error with respect to θ0 . Figure 2
shows the absolute error of each estimate, averaged over the errors associated
with 10 alternative realizations of Dτ , as a function of n parametrizing the
fineness and range of the state space discretization. The accuracy of the
estimate tends to increase with n, although the convergence of the error
may be slow. All fitted values lie in the interior of the parameter space.
To provide some perspective, we also consider the parameter estimates
generated by the stochastic EM Algorithm of Celeux and Diebolt (1992).
In the E-step we sample 10,000 paths of Y and Z based on their condi−
|T |
X
i=1
log
Z
Y si − γ 3
0
si
√
e−γ4 (si −u) dNu γ2 si − si−1 .
22
K. GIESECKE AND G. SCHWENKLER
tional law given the data Dτ . To facilitate a comparison with our estimates,
we sample these paths on a weekly time grid. In the M-step, the complete
data partial likelihood functions of V and N are maximized separately. The
optimization using the Nelder-Mead algorithm is difficult; the convergence
is extremely slow and we often obtain boundary optima. Since the gradient and the Hessian matrix of the complete data likelihood are available
analytically, we implement the optimization using the BFGS quasi-Newton
method in R. The EM Algorithm is initialized at the same parameter as our
scheme. We stop it when the M-step increases the likelihood by less than 0.5;
a criterion consistent with, e.g., Chan and Ledolter (1995). The resulting absolute errors of the parameter estimates are shown in Figure 2 by the dashed
lines (note that the value of n is irrelevant). We have considered alternative
implementations, including performing a larger number of simulation trials
and using a tighter stopping criterion, but found no significant difference in
the results.
Our estimates of the factor sensitivity α1 , frailty sensitivity α2 and decay rate α4 of the intensity, and the volatility γ2 , event sensitivity γ3 , and
decay rate γ4 of the factor X, are more accurate than those of the EM Algorithm. Our estimate of the event sensitivity α3 of the intensity is roughly
as accurate as the EM estimate, especially for larger n. At first sight, the
EM Algorithm seems to deliver more accurate estimates of the mean rate
γ1 of the factor, and the mean rate γ5 and volatility γ6 of the frailty. A
closer look at how the EM Algorithm estimates these parameters leads to a
different conclusion. Since the EM Algorithm maximizes the complete data
partial likelihoods of V and N separately, the mean and the volatility of the
factor Y and the frailty Z are estimated by matching their first two sample
moments. Because of this, the estimation of these parameters is especially
susceptible to the initial parameters. This is indicated by the location of
the dotted line in Figure 2, which represents the average absolute error of
the initial parameter with respect to θ0 . The EM estimates of γ1 , γ5 and γ6
hardly deviate from the respective initial values, consistently with the fact
that the EM estimator is guaranteed to converge to a local optimum of the
likelihood only (see Wu (1983) and Dembo and Zeitouni (1986)). This may
be an issue in practice when θ0 is unknown. We have investigated this further
by analyzing the estimates obtained from initial parameter values sampled
from a uniform distribution on the parameter space Θ. Unlike our method,
the EM Algorithm was found to converge extremely slowly when the initial parameters were very different from θ0 , often resulting in inappropriate
estimates.
Another issue of the separate maximization of the complete data partial
FILTERED LIKELIHOOD FOR POINT PROCESSES
23
likelihood functions of N and V in the EM Algorithm is that the interaction
of the events and the factor is ignored. As a result, the EM Algorithm faces
difficulties in differentiating between diffusive movements and jumps of Y .
It tends to converge to boundary solutions that imply an event sensitivity
γ3 of Y equal or close to zero, a decay rate γ4 equal or close to 20, and
too high estimates of the volatility γ2 . Boundary optima are an issue for
asymptotic results, which apply only to optima in the interior of the parameter space. Moreover, in order to compensate for understating γ3 , the EM
Algorithm tends to overstate the role of the factor Y in the intensity. This
interpretation is consistent with the large errors of the EM estimates for the
factor sensitivity and event decay of the intensity, and the event sensitivity
and decay of the factor. In these cases, the errors are much larger than the
errors of the initial values.
In order to shed more light on this issue, we analyze how the two methods
address the filtering problem. We compare the fitted filtered factor Eθ [Yt | Dt ]
to the corresponding monthly observations, and the fitted filtered intensity
ht = Eθ [λt | Dt ] to the actual number of events per year. Here, Dt represents
the data observed during [0, t]. For a given realization of the data, Figure 3
n
n
shows the approximations to these filters, given by Ê I (θ; t, g)/Ê I (θ; t, 1)
with g(v, t) = (1, 0, 0, 0, 0) · v and g(v, t) = Λ(v; θ), respectively, evaluated
at the corresponding estimator θ = θ̂τ . For that realization, Figure 3 also
shows Monte Carlo approximations to these filters, obtained using 10,000
trials and evaluated at the EM estimator of θ. The confidence bands were
calculated using subsampling with group size 100.
As expected, in the case of our method the chosen value of n has a large
impact on the accuracy of the filter. In accordance with Theorem 4.1, the
larger n, the finer and more extensive the state space grid, and the closer
the filters track the observed values of the factor and the event counts. If
the state space discretization is too coarse, then the approximation to the
filtered factor fluctuates between few values, and the filtered intensity h
understates the actual number of events. As n increases, the filtered paths
track the observations better. The Monte Carlo approximation to the factor
filter, which is based on EM parameter estimates, tracks the factor well, with
occasionally wide confidence bands. Nevertheless, the corresponding Monte
Carlo approximation to the filtered intensity h severely understates the actual event counts.7 We have found similar issues for all other realizations of
the data Dτ that we have considered.
We examine the implications of the estimation errors for model fit. Propo7
The confidence bands of the simulated filtered intensity are very tight, and we do not
display them for the sake of clarity.
24
250
Algorithm 4.5, n = 3
Actual number of events
0
0
2
50
4
6
100
8
150
Algorithm 4.5, n = 5
Actual path of factor
200
10 12 14
K. GIESECKE AND G. SCHWENKLER
0
10
20
30
40
0
10
20
30
40
Algorithm 4.5, n = 10
Actual number of events
0
0
2
50
4
6
100
8
150
Algorithm 4.5, n = 10
Actual path of factor
250
Time in years
200
10 12 14
Time in years
0
10
20
30
40
0
10
20
30
40
Algorithm 4.5, n = 20
Actual number of events
0
0
2
50
4
6
100
8
150
Algorithm 4.5, n = 20
Actual path of factor
250
Time in years
200
10 12 14
Time in years
0
10
20
30
40
0
10
20
30
40
Simulated filtered path
Actual number of events
0
0
2
50
4
6
100
8
150
Simulated filtered path
95% confidence bands
Actual path of factor
250
Time in years
200
10 12 14
Time in years
0
10
20
Time in years
(a) Factor Y
30
40
0
10
20
30
40
Time in years
(b) Intensity
Fig 3. Fitted filtered factor and intensity vs. actual factor and event counts.
sition D.1 Azizpour, Giesecke and SchwenklerR (2014) implies that, after a
t
change of time given by the P-compensator 0 hs ds relative to the rightcontinuous and complete filtration generated by σ(Dt ), the counting process
N is a standard P-Poisson process in the time-changed filtration. We test
whether the fitted filtered intensity h time-scales the realized N correctly.
Figure 4 provides a QQ plot of the time-changed inter-arrival times for each
of several fitting methods (95% confidence bands of the order statistics of
the exponential distribution are included). We see that, as n increases, the
model estimated using our scheme fits the event data better. Already with
n = 20 we obtain an acceptable fit. The model estimated using the EM Al-
25
10
FILTERED LIKELIHOOD FOR POINT PROCESSES
Filtered MLE with Algorithm 4.5, n = 3
Filtered MLE with Algorithm 4.5, n = 10
Filtered MLE with Algorithm 4.5, n = 20
EM Algorithm
Diagonal line
95 % confidence bands
5
●
●
●
0
●
●●
●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
2
4
●● ● ●
●●●
● ● ●
● ●
●
●
●
●
6
Fig 4. Goodness-of-fit tests: QQ plots of the quantiles of the time-changed inter-arrival
times derived vs. theoretical standard exponential quantiles.
gorithm, for which the filtered intensity h is approximated by Monte Carlo
simulation, fails the test. This is a consequence of the inaccuracy of the EM
estimates of some of the model parameters.
6.3. Asymptotic properties of parameter estimates. Finally we examine
the behavior of the estimators as the sample horizon τ increases. In the
setting of the previous analyses, we calculate the approximate asymptotic
variance-covariance matrix Σ̂0 of our estimators according to (25)-(27) at the
true parameter vector θ0 . In order to satisfy the conditions of Proposition
√
4.3, for the implementation we let n vary with τ as in n = n(τ ) = τ ,
τ 2
and set mnS = (3 + 10
) . Otherwise, the state space grid is constructed as
described above. We continue to use a weekly time discretization.
Figure 5 shows the asymptotic standard errors for each parameter (solid
lines), for a given realization of Dτ . As τ increases and more data is included
in the estimation, the errors decrease, albeit slowly in some cases. The error
behaves similarly for the other realizations of Dτ we have considered (the
results for these are available upon request).
For comparison, Figure 5 also indicates the asymptotic standard errors of
the EM estimates (dashed lines). These are calculated using the following
approximation of the theoretical maximum likelihood asymptotic variance-
26
K. GIESECKE AND G. SCHWENKLER
20
30
40
Volatility
n
20
30
40
10
Meann rate
20
30
40
1.0
40
10
20
30
40
n
seq(1, 10)
1.00
0.10
10
20
30
40
Eventndecay
10
20
30
40
n
Filtered MLE with
Algorithm 4.5
EM algorithm
0.01
0.10
1.00
Volatility
n
0.01
30
1e−05
0.10
0.01
0.1
10
seq(1, 10)
20
Event sensitivity
n
1e−03
1.00
1.0
Factor
Y
10
1e+01
10
1e−01
40
1e−03
30
1e−01
20
Meann rate
0.1
0.1
10
1
1
10
seq(1, 10)
Event decay
1.0
100
10
Intensity
λ
Frailty
Z
Event sensitivity
10.0
Frailty sensitivity
10.0
Factor sensitivity
10
20
30
40
10
20
30
40
Fig 5. Asymptotic standard errors of our estimates and EM estimates plotted against τ ,
the end of the observation period. The y-axes are in log-scale.
covariance matrix:
−1
−1
ζ EM
2
∇2 Q(θ, θ0 )|θ=θ0
∇Q(θ, θ0 )|θ=θ0 ∇Q(θ, θ0 )|>
θ=θ0 ∇ Q(θ, θ0 )|θ=θ0
τ
where Q(θ, θ0 ) is the complete data likelihood evaluated at θ through simulations with respect to the conditional P∗θ0 -distribution given Dτ , and ζ EM
is an approximation of the fraction of missing data as in Nielsen (2000). We
find that the standard errors of the EM estimates of the factor sensitivity
α1 , the event sensitivity α3 , and the event decay α4 of the intensity, as well
as of the factor mean rate γ1 and volatility γ2 , exceed those of our estimates.
On the other hand, the standard errors of the EM estimates for the frailty
parameters γ5 and γ6 are smaller than those of our estimates. Both methods
achieve similar standard errors for the frailty sensitiviy α2 of the intensity
and the event sensitivity γ3 and decay γ4 of the factor Y .
In many cases, the loss in efficiency due to simulation in the stochastic EM
Algorithm exceeds the loss of efficiency due to the approximate filtering in
our method. The loss of efficiency due to filtering, indicated in Proposition
A.2, appears to be fairly small in the cases considered. This renders our
estimates highly efficient.
FILTERED LIKELIHOOD FOR POINT PROCESSES
27
APPENDIX A: EXACT LIKELIHOOD
This appendix complements Section 3. It provides a rigorous definition
of the exact likelihood Lτ (θ) of the observed data Dτ and the conditional
likelihood L∗F (θ; τ ) of the factor data. It also derives conditions that ensure
consistency and asymptotic normality of an exact maximum likelihood estimator θ̂τ , and introduces notation that will be used throughout the proofs
given in Appendix C.
A.1. Likelihood. Let RD,τ denote the range of the date Dτ and Dτ
the Borel σ-algebra of RD,τ . We have RD,τ = D([0, τ ]; N) × D([0, τ ]; RU )
×D([0, τ ]; RdX ) × RY,τ , where D(A; B) is the Skorokhod space of càdlàg
functions mapping the set A onto B, and RY,τ = {(yi,j : yi,j ∈ R for i =
1, . . . , dY , j = 1, . . . , |S i ∩ [0, τ ]|} is the range of the data Yτ of the partially observed factor element. Write Dθ,τ for the Pθ -law of the data Dτ
on (RD,τ , Dτ ). The likelihood of the data is given by the Radon-Nikodym
derivative
dDθ,τ
,
Lτ (θ) =
dDθ0 ,τ
evaluated at the observed data Dτ .
We now give a precise definition of the conditional likelihood L∗F (θ; τ ) appearing in Theorem 3.1. Let D∗θ,τ denote the P∗ -law of the data Dτ on (RD,τ ,
Dτ ). Write D∗E,θ,τ for the P∗ -law of the event data (Nτ , Uτ ) on (D([0, τ ]; N)×
D([0, τ ]; RU ), σ(D([0, τ ]; N) × D([0, τ ]; RU ))), and D∗F,θ,τ for the conditional
P∗θ -law of the factor data (Xτ , Yτ ) given the event data (Nτ , Uτ ) on (D([0, τ ];
RdX ) × RY,τ , σ(D([0, τ ]; RdX ) × RY,τ )). The conditional likelihood L∗F (θ; τ )
of the factor data (Xτ , Yτ ) given the event data (Nτ , Uτ ) under P∗ is the
Radon-Nikodym derivative
(20)
L∗F (θ; τ ) =
dD∗F,θ,τ
dD∗F,θ0 ,τ
,
evaluated at the observed data.
A.2. Asymptotic properties of exact likelihood estimators. The
representation of the likelihood obtained in Theorem 3.1 facilitates the analysis of the asymptotic properties of an exact MLE θ̂τ as τ → ∞.
Proposition A.1 (Consistency).
In addition, suppose that:
Suppose Assumptions (A1)-(A2) hold.
(A3) For any θ ∈ θ, it holds Pθ -almost surely that L∗F (θ; τ ) > 0.
28
K. GIESECKE AND G. SCHWENKLER
(A4) The true parameter θ0 is the unique maximizer of the asymptotic likelihood, i.e.,
1
arg sup lim Lτ (θ) = {θ0 }.
τ
→∞
τ
θ∈Θ
Then every θ̂τ solving (8) is consistent: θ̂τ → θ0 as τ → ∞ in Pθ0 -probability.
Consistency follows from Lemma 4.1 in Section 1.4.3 of Ibragimov and
Has’minskii (1981) after establishing that the filtering in (5) guarantees the
smoothness of the likelihood function.
Condition (A4) is a standard identifiability condition. Condition (A3) is
mild and implies both the equivalence of the laws of the data Dτ under
Pθ and Pθ̃ for any θ, θ̃ ∈ Θ, as well as the stochastic equicontinuity of the
likelihood function. It is a restriction on the parameter space Θ, since it
requires that the likelihood of the observed factor data be strictly positive
at any parameter θ ∈ Θ. Thus, consistency is guaranteed as long as the
parameter space is chosen adequately.
We now consider asymptotic normality. We make some assumptions to
simplify the exposition. Suppose that S i = S j and 0 ∈ S j for all 1 ≤ i, j ≤
dY . Hence, all partially observed factors are observed on the same dates,
including at time 0. Suppose Sτ = S ∩[0, τ ] takes the form {s0 , s1 , . . . , sm(τ ) }
for 0 = s0 < · · · < sm(τ ) = τ . Suppose the limit ζ = limτ →∞ m(τ )/τ exists
and is strictly positive. Define the random functions
(21)
θ ∈ Θ 7→ L∗E,i (θ) = L∗E (θ; si )/L∗E (θ; si−1 )
(22)
θ ∈ Θ 7→ L∗F,i (θ) = L∗F (θ; si )/L∗F (θ; si−1 )
for i = 1, . . . , m(τ ), as well as
θ ∈ Θ 7→ L∗E,0 (θ) = 1
θ ∈ Θ 7→ L∗F,0 (θ) = L∗F (θ; 0).
It follows that L∗E (θ; τ ) =
Qm(τ )
i=0
L∗E,i (θ) and L∗F (θ; τ ) =
Write θ = (θ1 , . . . , θp ) for some p ∈ N, ∂v =
∂3
∂θu ∂θv ∂θw
∂
∂θv ,
Qm(τ )
∂v,w =
∗
i=0 LF,i (θ).
∂2
∂θv ∂θw , and
∂u,v,w =
for any 1 ≤ u, v, w ≤ p, and let ∇, ∇2 denote the gradient and the Hessian matrix, respectively, with respect to θ.
Proposition A.2 (Asymptotic normality).
suppose that the following conditions hold:
In addition to (A1)-(A4),
(A5) The functions L∗E,i (θ) and L∗F,i (θ) for i = 0, . . . , m(τ ) are three times
continuously differentiable at θ, Pθ -almost surely, for any θ ∈ Θ◦ .
FILTERED LIKELIHOOD FOR POINT PROCESSES
29
(A6) For any θ ∈ Θ◦ , 1 ≤ u, v ≤ p, and 0 ≤ i ≤ m(τ ), it holds Pθ -a.s. that
0 = Eθ ∂u log L∗E,i (θ) Dsi−1 = Eθ ∂u log L∗F,i (θ) Dsi−1 ,
"
#
"
#
2 L∗ (θ) 2 L∗ (θ) ∂u,v
∂u,v
E,i
F,i
0 = Eθ
= Eθ
.
Ds
Ds
L∗E,i (θ) i−1
L∗F,i (θ) i−1
(A7) There exist positive constants K1 , K2 , K3 , and K4 such that for all
1 ≤ u, v ≤ p and 0 ≤ i ≤ m(τ )

! 
∗ (θ) 2
h
i
∂
L
u,v
2
E,i
 ≤ K2 ,
Eθ ∂v log L∗E,i (θ)
≤ K1 ,
Eθ 
L∗E,i (θ)

! 
∗ (θ) 2
h
i
∂
L
u,v
2
F,i
 ≤ K4 .
Eθ ∂v log L∗F,i (θ)
≤ K3 ,
Eθ 
L∗F,i (θ)
(A8) For any θ ∈ Θ◦ , the following limits of matrices exist in Pθ -probability
and are deterministic:
m
1 X
ΣE (θ) = lim
∇ log L∗E,i (θ)> ∇ log L∗E,i (θ) ∈ Rp×p ,
m→∞ m
i=0
m
1 X
∇ log L∗F,i (θ)> ∇ log L∗F,i (θ) ∈ Rp×p ,
ΣF (θ) = lim
m→∞ m
i=0
m
1 X
ΣEF (θ) = lim
∇ log L∗E,i (θ)> ∇ log L∗F,i (θ) ∈ Rp×p .
m→∞ m
i=0
Θ◦ ,
(A9) For any θ ∈
the matrix ΣE (θ) + ΣF (θ) is non-singular.
◦
(A10) For any θ ∈ Θ , there exists a neighborhood U (θ) of θ and finite constants GE = GE (θ), GF = GF (θ), such that for any θ̃ ∈ U (θ) and
1 ≤ u, v, w ≤ p
"
#
m
1 X
∗
E
lim Pθ ∂u,v,w log LE,i (θ̃) < G
= 1,
m→∞
m
i=0
#
"
m
1 X
∂u,v,w log L∗F,i (θ̃) < GF = 1.
lim Pθ m→∞
m
i=0
√
Then every θ̂τ solving (8) is asymptotically normal: τ (θ̂τ − θ0 ) → N (0, Σ0 )
as τ → ∞ in Pθ0 -distribution. Letting Σ(θ0 ) = ΣE (θ0 ) + ΣF (θ0 ), the asymptotic variance-covariance matrix is given by
2
1
Σ0 = Σ(θ0 )−1 + Σ(θ0 )−1 ΣEF (θ0 )Σ(θ0 )−1 .
ζ
ζ
30
K. GIESECKE AND G. SCHWENKLER
Proposition A.2 is a martingale CLT for the score function. Because of the
incompleteness of the factor data, the analysis is not completely standard:
in our setting, the likelihood ratios are locally asymptotically quadratic but
not locally asymptotically mixed normal.8 Thus, the conventional martingale
CLTs of Feigin (1976), Jeganathan (1995), van der Vaart (2000) and others
cannot be directly applied in our setting. Our asymptotic analysis uses an
argument similar to the one behind Proposition 6.1 of Hall and Heyde (1980),
after controlling for the incompleteness of the factor data.
Condition (A5) is standard. Condition (A7) is also standard; it imposes
regularity on the intensity (1) and the Pθ -law of the factor V . Conditions
(A8) and (A9) are necessary to obtain well-defined covariance matrices. Condition (A10) guarantees that a Taylor expansion of the score function is sufficiently accurate. Condition (A6) is an identifiability condition. It implies
that θ0 maximizes the asymptotic likelihood.
Proposition A.2 highlights the influence of the partially observed data on
the asymptotic variance-covariance matrix Σ0 : it is inversely proportional to
ζ, the average number of observations of Y per unit of time. We compare with
the results of Ogata (1978), who studied likelihood estimators for stationary
and ergodic point processes with complete factor data. Consider the Fisher
information matrix I(θ0 ) for the estimation of θ0 = (α0 , β0 , γ0 ), given in our
setting by
(23)
I(θ0 ) = ζ (ΣE (θ0 ) + ΣF (θ0 )) .
Under the assumptions of stationarity and ergodicity as well as complete
factor data, Ogata (1978) shows that Σ0 = I(θ0 )−1 . We relax these assumptions and allow for incomplete factor data. We find that
(24)
Σ0 = I(θ0 )−1 + 2ζI(θ0 )−1 ΣEF (θ0 )I(θ0 )−1 .
The matrix ΣEF (θ0 ) corrects for the filtering of the unobserved path of the
factors. If V is observed perfectly, then one can show that ΣEF (θ0 ) = 0,
producing the result of Ogata (1978).
We end this section with alternative and more traditional representations
of the matrices ΣE (θ0 ), ΣF (θ0 ) and ΣEF (θ0 ).
Corollary A.3. Suppose the assumptions of Proposition A.2 hold. For
any θ ∈ Θ◦ , it holds Pθ -almost surely that
ζΣE (θ) = − lim
τ →∞
8
1 2
1
∇ log L∗E (θ; τ ) = lim ∇ log L∗E (θ; τ )> ∇ log L∗E (θ; τ ),
τ
→∞
τ
τ
See Definitions 1 and 3 of Jeganathan (1995).
FILTERED LIKELIHOOD FOR POINT PROCESSES
ζΣF (θ) = − lim
τ →∞
31
1 2
1
∇ log L∗F (θ; τ ) = lim ∇ log L∗F (θ; τ )> ∇ log L∗F (θ; τ ),
τ →∞ τ
τ
1
∇ log L∗E (θ; τ )> ∇ log L∗F (θ; τ ).
τ →∞ τ
ζΣEF (θ) = lim
APPENDIX B: ASYMPTOTIC VARIANCE-COVARIANCE MATRIX
Proposition 4.3 states that the approximate MLE has the same asymptotic variance-covariance matrix Σ0 as the exact MLE analyzed in Appendix
A. Thus, there is no loss of efficiency when maximizing the approximate likelihood rather than the exact likelihood. From Proposition A.2 we know that
Σ0 = ζ1 Σ(θ0 )−1 + ζ2 Σ(θ0 )−1 ΣEF (θ0 )Σ(θ0 )−1 for Σ(θ0 ) = ΣE (θ0 ) + ΣF (θ0 ).
For finite τ , we propose to approximate Σ0 as follows. We approximate ζ by
)
ζ̂ = m(τ
τ . For given n, we approximate ΣE (θ0 ), ΣF (θ0 ), and ΣEF (θ0 ) by
(25)
>
m(τ )
n
n
E I (θ̂τn ; si , 1)
E I (θ̂τn ; si , 1)
1 X
∇ log
∇
log
Σ̂E =
m(τ )
E I n (θ̂τn ; si−1 , 1)
E I n (θ̂τn ; si−1 , 1)
i=1
m(τ )
(26)
(27)
Σ̂F =
Σ̂EF
1 X
∇ log L∗F,i (θ)> ∇ log L∗F,i (θ)
m(τ )
1
=
m(τ )
i=0
m
X
i=1
n
∇ log
E I (θ̂τn ; si , 1)
E I n (θ̂τn ; si−1 , 1)
>
∇ log L∗F,i (θ),
respectively. The implementation I n is chosen so as to achieve some given
degree of accuracy in accordance with Theorem 4.1. Finally, we set Σ̂ =
Σ̂E + Σ̂F and approximate Σ0 by Σ̂0 = 1 Σ̂−1 + 2 Σ̂−1 Σ̂EF Σ̂−1 .
ζ̂
ζ̂
APPENDIX C: PROOFS
Proof of Theorem 3.1. The measure dP∗θ = Mτ (θ)dPθ is well-defined
by Condition (A1) and Theorem VIII.T10 of Brémaud (1980). We have
Z
∗
∗
Dθ,τ (A) = Eθ [1A ] = Eθ [1A Eθ [1/Mτ (θ) | Dτ ]] =
E∗θ [1/Mτ (θ) | Dτ ]dD∗θ,τ
A
for A ∈ Dτ . Consequently, the law Dθ,τ is absolutely continuous with respect
to D∗θ,τ with Radon-Nikodym derivative
dDθ,τ
1 ∗
(28)
= Eθ
Dτ .
dD∗θ,τ
Mτ (θ) Bayes’ fomula implies that the P∗θ -law of the data is equal to the product
of the P∗θ -law of the event data (Nτ , Uτ ) and the conditional P∗θ -law of the
32
K. GIESECKE AND G. SCHWENKLER
factor data (X, Yτ ) given (Nτ , Uτ ):
D∗θ,τ = D∗F,θ,τ × D∗E,θ,τ .
(29)
Now, the P∗θ -law of the event data (Nτ , Uτ ) is parameter independent by
construction since the counting process N has unit intensity and the event
marks have parameter-independent density π ∗ under P∗θ . Hence,
dD∗E,θ,τ
(30)
dD∗E,θ0 ,τ
∝ 1.
In addition,
dD∗F,θ,τ
(31)
dD∗F,θ0 ,τ
= L∗F (θ; τ )
by Assumption (A2) and the definition of the conditional likelihood L∗F (θ; τ )
in (20). Because D∗θ is a product law as in (29), Fubini’s theorem together
with (30)-(31) imply that
dD∗θ,τ
dD∗θ0 ,τ
=
dD∗F,θ,τ × D∗E,θ,τ
dD∗F,θ0 ,τ × D∗E,θ0 ,τ
=
dD∗F,θ,τ dD∗E,θ,τ
dD∗F,θ0 ,τ dD∗E,θ0 ,τ
∝ L∗F (θ; τ )
We conclude that the likelihood satisfies
∗
∗
dDθ,τ
dDθ,τ dDθ,τ dDθ0 ,τ
Lτ (θ) =
=
dDθ0 ,τ
dD∗θ,τ dD∗θ0 ,τ dDθ0 ,τ
(32)
1 ∗
∗
∝ Eθ
D
τ LF (θ; τ ).
Mτ (θ) This completes the proof.
Proof of Theorem 4.1. For simplicity, we only consider the case that
g ≡ 1. Since Θ is compact by Assumption (B3), the reasoning below remains
valid whenever the measurable function g is either continuous or bounded.
Define C1τ = maxθ∈Θ E∗θ [ Mτ1(θ) | Dτ ] and C2τ = maxθ∈Θ E∗θ [ M 21(θ) | Dτ ], which
τ
are finite by Assumptions (B3) and (B4). Write
n
∆τ (θ) = |E I (θ; τ, 1) − E(θ; τ, 1)|
for the absolute error generated by the approximation at the parameter θ.
By Jensen’s inequality, we can bound ∆τ (θ) from above by
1 ∗ 1{∃t∈[0,τ ] V ∈A
n } Dτ
Eθ t/
Mτ (θ)
n
(33)
+
mS
X
i1 =1
n
...
mS
X
imn =1
T
E∗θ
In
F (i1 , . . . , imn ) −
θ
T
1 Ii ,i ,...,i n
Mτ (θ) 0 1 mT
Dτ
FILTERED LIKELIHOOD FOR POINT PROCESSES
33
for Ii0 ,i1 ,...,imn = 1{V0 ∈An,i0 ,∀j=1,...,mn , ∀t∈(T n,j−1 ,T n,j ]: Vt ∈An,ij } . The second
T
T
summand in the right-hand side of (33) can be bounded above by
C1τ exp ψτn,1 − 1 P∗θ [∀t ∈ [0, τ ] : Vt ∈ An | Dτ ] ≤ C1τ exp ψτn,1 − 1 ,
where
τ max max n ∂v Λ(vj ; θ)
1≤j≤mS
θ∈Θ 1≤j≤mS
j ; θ)π(U , vj ; θ) Λ(v
k
+ Nτ max max max n ∂v log
θ∈Θ 1≤k≤Nτ 1≤j≤mS
π ∗ (Uk )
ψτn,1 = max n vol An,j
as in Assumption (B6). This follows from Taylor expansion around the points
vij and Jensen’s inequality, together with Assumptions (B1), (B2), and (B1)(B4). We use Hölder’s inequality with Assumptions (B4) and (B5)
for the
p τ q n,2
first term on the right-hand side of (33) to obtain the bound C2 ψτ for
/ A n Dτ .
ψτn,2 = sup P∗θ ∃t ∈ [0, τ ] : Vt ∈
θ∈Θ
n
Thus, we have that E I (θ; τ, 1) − E(θ; τ, 1) is bounded above by
p q
(34)
exp ψτn,1 − 1 C1τ + C2τ ψτn,2 .
Note that (34) does not depend on θ. As a result, the error bound in (34)
holds uniformly over the parameter space Θ. Assumptions (B3), (B4), (B5),
and (B6) then lead to the claim for n → ∞.
Proof of Proposition 4.2. Take > 0. By Theorem 4.1 there exists
some n0 ∈ N, depending only on τ , such that
In
E (θ; τ, 1) − E(θ; τ, 1) ≤ for all n ≥ n0 and all θ ∈ Θ. Since maxθ∈Θ L∗F (θ; τ ) < ∞, P-almost surely,
it follows that
|Lτ (θ) − Lnτ (θ)| ≤ max L∗F (θ; τ ) < ∞,
θ∈Θ
P-almost surely. Define ˜ = maxθ∈Θ L∗F (θ; τ ). Then, since θ̂τn is an optimizer
of Lnτ ,
(35)
θ̂τn ∈ {θ ∈ Θ : Lτ (θ̂τ ) − ˜ ≤ Lnτ (θ) ≤ Lτ (θ̂τ ) + ˜},
Pθ -almost surely. The claim follows for → 0.
34
K. GIESECKE AND G. SCHWENKLER
Proof of Proposition 4.3. For fixed τ > 0, we deduce from Proposition 4.2 that there exists the limit of θ̂τn for n → ∞. Let θ̂τ∞ be this limit.
Then we know that Lτ (θ̂τ∞ ) = Lτ (θ̂τ ) for θ̂τ = arg maxθ∈Θ Lτ (θ). Since θ̂τ
is a global maximizer and no optimum is attained on the boundary of the
parameter space, necessarily θ̂τ∞ satisfies the first-order condition
∇Lτ (θ̂τ∞ ) = 0.
(36)
Thus, we can derive a linear-quadratic Taylor expansion of Lτ (θ̂τ∞ ) around
θ0 as in Proposition A.2, and conclude that
√ ∞
(37)
τ θ̂τ − θ0 → N (0, Σ0 ) .
From Proposition 4.2 we know that for any θ > 0, there exists some
implementation I n for n = n(τ ) such that |θ̂τn − θ̂τ∞ | ≤ τ1 . Thus, P-a.s.,
√ n(τ )
(38)
τ θ̂τ − θ̂τ∞ → 0.
We remark that the sequence of implementations {I n }τ ≥0 is fixed in such a
way that the right-hand side of equation (33) converges to zero as τ → ∞.
The existence of such implementations is asserted by Assumptions (B1)–
(B6) in Theorem 4.1.
Proof of Proposition 5.1. We proceed
claim follows
I n inductively. The
n
I
n
n
n
immediately for mT = 0. For mT ≥ 1, P̂θ (i0 , . . . , imT ) − Pθ (i0 , . . . , imnT )
is bounded above by the triangle inequality:
n
n
n
PθI (imnT )P̂θI (i0 , . . . , imnT −1 ) − PθI (i0 , . . . , imnT −1 )
n
n
n
+ P̂ I (imn ) − P I (imn )P̂ I (i0 , . . . , imn −1 ).
θ
T
In
θ
T
θ
T
In
Note that Pθ (imnT ) ≤ 1 and P̂θ (i1 , . . . , imnT −1 ) ≤ 1. By the induction
n
n
assumption we know that |P̂θI (i0 , . . . , imnT −1 ) − PθI (i0 , . . . , imnT −1 )| → 0 as
n → ∞, Pθ -almost surely. Further, from Jensen’s and Hölder’s inequality we
derive for j = mnT
n
i
h
n
I
1
D
n,ij
n,i
P̂θ (ij ) − PθI (ij ) ≤ E∗θ 1{Vt ∈A
/
(t∈(T n,j−1 ,T n,j ))} {VT n,j ∈A j } τ
≤ max P∗θ̃ ∀t ∈ (T n,j−1 , T n,j ) : Vt ∈
/ An,ij | Dτ .
θ̃∈Θ
Now (T n,j−1 , T n,j ) → ∅ as n → ∞ since mnT → ∞. Hence, the monotone
n
n
convergence theorem implies that |P̂θI (imnT ) − PθI (imnT )| → 0 as n → ∞,
Pθ -almost surely.
FILTERED LIKELIHOOD FOR POINT PROCESSES
35
Proof of Proposition A.1. Lemma 4.1 in Section 1.4.3 of Ibragimov
and Has’minskii (1981) states that it is sufficient to show that for any γ ≥ 0
and any θ ∈ Θ,
"
#
1
1
(39)
Pθ sup Lτ (θ + ξ) − Lτ (θ) ≥ 0 → 0
τ
|ξ|≥γ τ
as τ → ∞. Write
1
1
1
Lτ (θ + ξ)
Lτ (θ + ξ) − Lτ (θ) = log
.
τ
τ
τ
Lτ (θ)
We will show that for any given , δ > 0, there exists τ ∗ > 0, such that for
all τ ≥ τ ∗ and θ, θ̃ ∈ Θ:
#
"
Lτ (θ̃)
1
log
> δ ≤ .
Pθ
τ
Lτ (θ)
This implies convergence in probability uniformly over the parameter space
Θ, and
"
#
1
(40)
Pθ sup (Lτ (θ + ξ) − Lτ (θ)) > 0 → 0.
|ξ|≥γ τ
The inequality “≥” in Condition (39) then follows from Assumption (A4).
Fix , δ > 0 and take θ, θ̃ ∈ Θ. The likelihood function Lτ (θ̃) satisfies
Lτ (θ̃) =
(41)
dDθ̃,τ
dDθ0 ,τ
by definition. Consequently,
h
i
Eθ0 Lτ (θ̃) = 1.
Theorem 3.1 states that the likelihood function satisfies
dDθ,τ
1 ∗
Lτ (θ) =
∝ Eθ
Dτ L∗F (θ; τ ),
dDθ0 ,τ
Mτ (θ) which is strictly positive under Assumption (A3). Thus, Dθ0 ,τ also absolutely
continuous with respect to Dθ,τ with Radon-Nikodym density
dDθ0 ,τ
1
=
.
dDθ,τ
Lτ (θ)
Lτ (θ̃)
By construction, L
is Dτ -measurable. We conclude from the exponential
τ (θ)
Chebyshev inequality and (41) that
"
#
"
#
h
i
1
Lτ (θ̃)
L
(
θ̃)
τ
Pθ
log
> δ ≤ e−τ δ Eθ
= e−τ δ Eθ0 Lτ (θ̃) = e−τ δ .
τ
Lτ (θ)
Lτ (θ)
Choosing τ large enough leads to the claim.
36
K. GIESECKE AND G. SCHWENKLER
Proof of Proposition A.2. We will use an argument similar to the
one used to prove Proposition 6.1 in Hall and Heyde (1980), after generalizing to a multivariate parameter and our incomplete data structure. For
simplicity, write m = m(τ ).
By Taylor expansion and Assumption (A5),
(42) 0 =
1
1
1
∇Lτ (θ̂τ ) = ∇Lτ (θ0 ) + ∇2 Lτ (θ0 ) · θ̂τ − θ0 + oP (kθ̂τ − θ0 k)
τ
τ
τ
since the MLE θ̂τ is consistent by Proposition A.1 and the third derivatives of the log-likelihood function are uniformly bounded in probability in
a neighborhood of θ0 by Assumption (A10). Reformulate equation (42) as
−1 r
1
√ 1
2
τ θ̂τ − θ0 = − ∇ Lτ (θ0 ) + oP (1)
∇Lτ (θ0 ).
τ
τ
Assumption (A5) and the definition of L∗E,i and L∗F,i in (21)-(22) imply
∇2 Lτ (θ) = ∇2
=
m
X
log L∗E,i (θ) + ∇2
i=1
∇2 L∗E,i (θ)
L∗E,i (θ)
i=1
m
X
m
X
+
m
X
log L∗F,i (θ)
i=1
2
∇ L∗F,i (θ)
L∗F,i (θ)
i=1
m
X
∇ log L∗E,i (θ)> ∇ log L∗E,i (θ)
−
i=1
−
m
X
∇ log L∗F,i (θ)> ∇ log L∗F,i (θ).
i=1
Chebyshev’s inequality together with Assumptions (A6) and (A7) imply
m
m
2 ∗
2 ∗
1 X ∇ LE,i (θ)
1 X ∇ LF,i (θ)
=
lim
=0
m→∞ m
m→∞ m
L∗E,i (θ)
L∗F,i (θ)
lim
i=1
i=1
in Pθ -probability, and Assumption (A8) leads to
(43)
1 2
∇ Lτ (θ) → −ζ (ΣE (θ) + ΣF (θ))
τ
in Pθ -probability as τ → ∞. Thus, it suffices to show that
(44)
1
√ ∇Lτ (θ0 ) → N (0, ζ (ΣE (θ0 ) + ΣF (θ0 ) + 2ΣEF (θ0 )))
τ
FILTERED LIKELIHOOD FOR POINT PROCESSES
37
in Pθ0 -distribution. Asymptotic normality then follows given that ΣE (θ0 ) +
ΣF (θ0 ) + 2ΣEF (θ0 ) is positive semidefinite.
To see that (44) holds, note that Assumptions (A6)-(A7) together with
the fact that ∇ log L∗E,i (θ) and ∇ log L∗F,i (θ) are Dsi -measurable imply that
∇Lsk (θ) =
k
X
∇ log L∗E,i (θ) + ∇ log L∗F,i (θ) ,
1 ≤ k ≤ m,
i=1
is a Pθ -martingale on (Dsk )1≤k≤m . Chebyshev’s inequality and Assumptions
(A7)-(A8) lead to
(45)
m
i
1 X h
Eθ ∇ log L∗E,i (θ)> ∇ log L∗E,i (θ) Dsi−1
m→∞ m
ΣE (θ) = lim
i=1
(46)
m
i
1 X h
= lim
Eθ ∇ log L∗E,i (θ)> ∇ log L∗E,i (θ) ,
m→∞ m
i=1
m
i
1 X h
Eθ ∇ log L∗F,i (θ)> ∇ log L∗F,i (θ) Dsi−1
ΣF (θ) = lim
m→∞ m
1
m→∞ m
= lim
(47)
1
m→∞ m
ΣEF (θ) = lim
i=1
m
X
i=1
m
X
h
i
Eθ ∇ log L∗F,i (θ)> ∇ log L∗F,i (θ) ,
h
i
Eθ ∇ log L∗E,i (θ)> ∇ log L∗F,i (θ) Dsi−1
i=1
m
i
1 X h
= lim
Eθ ∇ log L∗E,i (θ)> ∇ log L∗F,i (θ) .
m→∞ m
i=1
In the notation of Hall and Heyde (1980) (HH), Im (θ) = ∇Lsm (θ)> ∇Lsm (θ)
and Jm (θ) = ∇2 Lsm (θ). We have that
1
Im (θ) = ΣE (θ) + ΣF (θ) + 2ΣEF (θ),
m
1
lim
Jm (θ) = − (ΣE (θ) + ΣF (θ)) ,
m→∞ m
lim
m→∞
where the first equation follows from Assumption (A6) and Chebyshev’s inequality, and the second equation follows from (43). As a result, Im (θ) →
1
1
∞ as m → ∞. Also, limm→∞ m
Im (θ) = limm→∞ m
Eθ [Im (θ)]. Finally,
1
1
limm→∞ m Jm (θ) = − limm→∞ m Im (θ) + 2ΣEF (θ). Given that Im (θ) and
Jm (θ) are continuous by Assumption (A5), we conclude that a slightly modified version of Assumption 1 of Proposition 6.1 of HH holds. This modified
38
K. GIESECKE AND G. SCHWENKLER
version of Assumption 1 accounts for the fact that our data is observed incompletely and there is correlation between the event data likelihood and
the factor data likelihood. In addition, Assumption 2 of Proposition 6.1 of
HH also holds because of our Assumptions (A5) and (A10), equation (43),
and Theorem 2.1 of Billingsley (2009). Following the proof of Proposition
6.1 of HH yields the claim.
Proof of Corollary A.3. The first equality for ΣE (θ) follows from
(43). For the second equality note that, by construction and since s0 = 0,
sm(τ ) = τ , we have that
m(τ )
1 X
∇ log L∗E (θ; si )> ∇ log L∗E,i (θ)
τ →∞ m(τ )
i=1
)
m(τ )
1 X
∗
>
∗
−
∇ log LE (θ; si−1 ) ∇ log LE,i (θ)
m(τ )
(
ΣE (θ) = lim
i=1
m(τ )
X
1
∇ log L∗E (θ; sm(τ ) )>
∇ log L∗E,i (θ)
τ →∞ m(τ )
= lim
i=1
τ 1
∇ log L∗E (θ; sm(τ ) )> ∇ log L∗E (θ; sm(τ ) )
m(τ ) τ
1
1
= lim ∇ log L∗E (θ; τ )> ∇ log L∗E (θ; τ ).
ζ τ →∞ τ
= lim
τ →∞
The analogous expressions hold for ΣE (θ) and ΣEF (θ).
REFERENCES
Aı̈t-Sahalia, Y. (2008). Closed-form likelihood expansions for multivariate diffusions.
The Annals of Statistics 36 906-937.
Andersen, P. K. and Gill, R. D. (1982). Cox’s Regression Model for Counting Processes:
A Large Sample Study. The Annals of Statistics 10 1100-1120.
Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1993). Statistical Models
Based on Counting Processes. Springer series in statistics. Springer.
Azizpour, S., Giesecke, K. and Schwenkler, G. (2014). Exploring the Sources of
Default Clustering. Working Paper.
Billingsley, P. (2009). Convergence of Probability Measures. Wiley Series in Probability
and Statistics. Wiley.
Blanchet, J. and Ruf, J. (2013). A Weak Convergence Criterion Constructing Changes
of Measure. Working Paper.
Brémaud, P. (1980). Point Processes and Queues – Martingale Dynamics. SpringerVerlag, New York.
Brillinger, D. R. and Preisler, H. K. (1983). Maximum likelihood estimation in a
latent variable problem. In Studies in Econometrics, Time Series and Multivariate
FILTERED LIKELIHOOD FOR POINT PROCESSES
39
Statistics (S. Karlin, T. Amemiya and L. A. Goodman, eds.) 31-65. Academic Press,
New York.
Celeux, G. and Diebolt, J. (1992). A stochastic approximation type EM algorithm for
the mixture problem. Stochastics and Stochastic Reports 41 119-134.
Chan, K. S. and Ledolter, J. (1995). Monte Carlo EM Estimation for Time Series
Models Involving Counts. Journal of the American Statistical Association 90 242-252.
Chava, S. and Jarrow, R. (2004). Bankruptcy prediction with industry effects. Review
of Finance 8 537–569.
Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical
Society. Series B (Methodological) 34 187-220.
Dembo, A. and Zeitouni, O. (1986). Parameter estimation of partially observed continuous time stochastic processes via the EM Algorithm. Stochastic Processes and their
Applications 23 91 - 113.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum Likelihood from
Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series
B (Methodological) 39 1-38.
Demyanyk, Y. and Van Hemert, O. (2011). Understanding the Subprime Mortgage
Crisis. Review of Financial Studies 24 1848-1880.
Deng, Y. and Gabriel, S. (2006). Risk-Based Pricing and the Enhancement of Mortgage
Credit Availability among Underserved and Higher Credit-Risk Populations. Journal
of Money, Credit and Banking 38 1431-1460.
Deng, Y., Quigley, J. M. and Van Order, R. (2000). Mortgage Terminations, Heterogeneity and the Exercise of Mortgage Options. Econometrica 68 275-307.
Dolton, P. and O’Neill, D. (1996). Unemployment Duration and the Restart Effect:
Some Experimental Evidence. The Economic Journal 106 387-400.
Duffie, D., Saita, L. and Wang, K. (2007). Multi-period corporate default prediction
with stochastic covariates. Journal of Financial Economics 83 635–665.
Duffie, D., Eckner, A., Horel, G. and Saita, L. (2009). Frailty Correlated Default.
Journal of Finance 64 2089–2123.
Dufour, A. and Engle, R. F. (2000). Time and the Price Impact of a Trade. The Journal
of Finance 55 2467-2498.
Dupacova, J. and Wets, R. (1988). Asymptotic Behavior of Statistical Estimators and
of Optimal Solutions of Stochastic Optimization Problems. The Annals of Statistics 16
1517-1549.
Engle, R. F. (2000). The Econometrics of Ultra-High-Frequency Data. Econometrica 68
1-22.
Engle, R. F. and Russell, J. R. (1998). Autoregressive Conditional Duration: A New
Model for Irregularly Spaced Transaction Data. Econometrica 66 1127-1162.
Feigin, P. D. (1976). Maximum Likelihood Estimation for Continuous-Time Stochastic
Processes. Advances in Applied Probability 8 712-736.
Foote, C. L., Gerardi, K. and Willen, P. S. (2008). Negative equity and foreclosure:
Theory and evidence. Journal of Urban Economics 64 234 - 245.
Giesecke, K. and Schwenkler, G. (2014). Simulated Likelihood Estimators for
Discretely-Observed Jump-Diffusions. Working Paper.
Giesecke, K. and Teng, G. (2012). Numerical Solution of Jump-Diffusion SDEs. Working Paper.
Gjessing, H., Røysland, K., Pena, E. and Aalen, O. (2010). Recurrent events and
the exploding Cox model. Lifetime Data Analysis 16 525-546.
Hall, P. and Heyde, C. C. (1980). Martingale limit theory and its application. Probability
and mathematical statistics. Academic Press.
40
K. GIESECKE AND G. SCHWENKLER
Harrison, A. and Stewart, M. (1989). Cyclical Fluctuations in Strike Durations. The
American Economic Review 79 827-841.
Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika 58 83–90.
Heckman, J. J. and Walker, J. R. (1990). The Relationship Between Wages and Income
and the Timing and Spacing of Births: Evidence from Swedish Longitudinal Data.
Econometrica 58 1411-1441.
Hotz, V. J., Klerman, J. A. and Willis, R. J. (1997). The economics of fertility in
developed countries. In Handbook of Population and Family Economics, (M. R. Rosenzweig and O. Stark, eds.) 1, Part A 7 275 - 347. Elsevier.
Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical estimation: asymptotic theory. Springer-Verlag, New York.
Jeganathan, P. (1995). Some Aspects of Asymptotic Theory with Applications to Time
Series Models. Econometric Theory 11 818-887.
Kiefer, N. (1988). Economic Duration Data and Hazard Functions. Journal of Economic
Perspectives 26 646-679.
Kliemann, W. H., Koch, G. and Marchetti, F. (1990). On the unnormalized solution
of the filtering problem with counting process observations. IEEE Transactions on
Information Theory 36 1415 -1425.
Komatsu, T. and Takeuchi, A. (2001). On the smoothness of PDF of solutions to SDE of
jump type. International Journal of Differential Equations and Applications 2 141-197.
Kurtz, T. (1998). Martingale problems for conditional distributions of Markov processes.
Electron. J. Probab. 3 29 pp.
Kurtz, T. G. and Ocone, D. L. (1988). Unique Characterization of Conditional Distributions in Nonlinear Filtering. The Annals of Probability 16 80-107.
Lancaster, T. (1979). Econometric Methods for the Duration of Unemployment. Econometrica 47 939-956.
Lancaster, T. and Nickell, S. (1980). The Analysis of Re-Employment Probabilities
for the Unemployed. Journal of the Royal Statistical Society. Series A (General) 143
141-165.
Lando, D. and Nielsen, M. S. (2010). Correlation in Corporate Defaults: Contagion or
Conditional Independence? Journal of Financial Intermediation 19 355-372.
Le Cam, L. and Yang, G. L. (1988). On the Preservation of Local Asymptotic Normality
under Information Loss. The Annals of Statistics 16 483-520.
Lo, A. W. (1988). Maximum Likelihood Estimation of Generalized Itó Processes with
Discretely Sampled Data. Econometric Theory 4 231-247.
Meyer, B. D. (1990). Unemployment Insurance and Unemployment Spells. Econometrica
58 757-782.
Murphy, S. A. (1994). Consistency in a Proportional Hazards Model Incorporating a
Random Effect. The Annals of Statistics 22 712-731.
Murphy, S. A. (1995). Asymptotic Theory for the Frailty Model. The Annals of Statistics
23 182-198.
Newman, J. L. and McCulloch, C. E. (1984). A Hazard Rate Approach to the Timing
of Births. Econometrica 52 939-961.
Nielsen, S. F. (2000). The Stochastic EM Algorithm: Estimation and Asymptotic Results. Bernoulli 6 457-489.
Nielsen, G., Gill, R., Andersen, P. and Sorensen, T. (1992). A counting process
approach to maximum likelihood estimation in frailty models. Scandinavian Journal of
Statistics 19 25-43.
Ogata, Y. (1978). The asymptotic behavior of maximum likelihood estimators of station-
FILTERED LIKELIHOOD FOR POINT PROCESSES
41
ary point processes. Annals of the Institute of Statistical Mathematics 30 243–261.
Parner, E. (1998). Asymptotic theory for the correlated gamma-frailty model. The Annals of Statistics 26 183-214.
Pennington-Cross, A. and Ho, G. (2010). The Termination of Subprime Hybrid and
Fixed-Rate Mortgages. Real Estate Economics 38 399-426.
Protter, P. (2004). Stochastic Integration and Differential Equations. Springer-Verlag,
New York.
Puri, M. L. and Tuan, P. D. (1986). Maximum likelihood estimation for stationary point
processes. Proceedings of the National Academy of Sciences USA 83 541-545.
Takeuchi, A. (2002). The Malliavin calculus for SDE with jumps and the partially hypoelliptic problem. Osaka J. Math. 39 523-559.
Todd, J. E., Winters, P. and Stecklov, G. (2012). Evaluating the impact of conditional cash transfer programs on fertility: the case of the Red de Protección Social in
Nicaragua. Journal of Population Economics 25 267-290.
van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge University Press.
Wei, G. and Tanner, M. (1990). A Monte Carlo Implementation of the EM Algorithm
and the Poor Man’s Data Augmentation Algorithms. Journal of the American Statistical
Association 85 699-704.
Wu, J. (1983). On the Convergence Properties of the EM Algorithm. The Annals of
Statistics 11 95-103.
Zeng, Y. (2003). A Partially Observed Model for Micromovement of Asset Prices with
Bayes Estimation via Filtering. Mathematical Finance 13 411-444.
Department of Management Science and Engineering Boston University School of Management
Boston University
Stanford University
Boston, MA 02215
Stanford, CA 94305
E-mail: gas@bu.edu
E-mail: giesecke@stanford.edu
Download