FILTERED LIKELIHOOD FOR POINT PROCESSES By Kay Giesecke and Gustavo Schwenkler ∗ Stanford University and Boston University We develop and study likelihood estimators of the parameters of a marked point process and of incompletely observed explanatory factors that influence the arrival intensity and mark distribution. We establish an approximation to the likelihood and analyze the convergence and large-sample properties of the associated estimators. Numerical results illustrate our approach. 1. Introduction. Point processes are stochastic models for the timing of discrete events. They are widely used in finance and economics, as well as various other areas such as insurance, engineering, seismology, epidemiology, and medicine. In many cases, the point process intensity, which represents the conditional event arrival rate, is a function of explanatory factors. In practice, however, the factor data available to the statistician are often incomplete. This paper addresses the inference problem arising in this situation. It develops and analyzes likelihood estimators of the parameters of the point process and of the explanatory factors. The results provide a rigorous statistical foundation for the applications of event timing models in the presence of incomplete factor data. They facilitate the analysis of models for which statistical estimation was previously intractable. More specifically, we consider a marked point process whose intensity and mark distribution are parametric functions of a vector V of explanatory factors. The factors are allowed to follow stochastic processes whose dynamics are specified by a parameter. The data include the event times and marks observed during a sample period. They also include certain observations of V . Some elements of V are observed at any time, others are sampled only on certain dates, and the remaining ones are frailties that are never observed. The values between the sampling times of the partially observed factors, and the values of the frailty factors are not available for inference. ∗ We thank Darrell Duffie, Peter Glynn, Tze Leung Lai, Yang Song, and Stefan Weber for insightful comments. This work was completed while Schwenkler was a PhD student in the Department of Management Science and Engineering at Stanford University. Schwenkler acknowledges support from a Mayfield Fellowship and a Lieberman Fellowship. AMS 2000 subject classifications: 62F12, 62M20, 62N02 Keywords and phrases: event timing, point process, filtering, parametric maximum likelihood, asymptotic theory, likelihood approximation 1 2 K. GIESECKE AND G. SCHWENKLER The model and data structure we adopt is motivated by a range of applications of event timing timing models in which incomplete information about V is a key feature. An example is the prediction of corporate bankruptcies, see Azizpour, Giesecke and Schwenkler (2014), Chava and Jarrow (2004), Duffie et al. (2009), Duffie, Saita and Wang (2007), Lando and Nielsen (2010) and others. Here, the event times of the point process represent bankruptcy dates. A mark variable might describe the financial loss at default. The vector V might include stochastically varying factors, such as stock returns, which are observed at any time during the sample period. It might also include macro-economic factors, such as industrial production, which are published only monthly or quarterly. Moreover, there is strong evidence of the presence of a dynamically varying, unobservable frailty factor influencing bankruptcy timing. Similar model and data structures are also common in the study of market transactions, unemployment spells, mortgage prepayments and delinquencies, the timing of births, and the duration of strikes.1 Section 2.3 provides concrete examples. We use a filtering approach to address the parameter inference problem with incomplete factor data. We begin by developing a representation of the likelihood of the data as a product of two terms, one addressing the event data and the other the available factor data. The decomposition is based on a change of probability measure that resolves the interaction between the point process and the factors influencing the arrival intensity. The likelihood representation generalizes to incomplete factor data the representations of the complete data likelihood studied by Andersen et al. (1993), Ogata (1978), Puri and Tuan (1986) and others. Our filtering approach eliminates the need to interpolate between discrete observations of factors values. Interpolation between discrete observations is a standard practice in the literature to construct a data series with values for every time in the sample period. However, the implications of this practice for the properties of the estimators are not well understood, rendering this practice questionable. One of the terms governing the likelihood is a point process filter whose practical computation may be difficult. We propose an approximation to this filter that eliminates the need to perform simulations to evaluate the likelihood estimator, which may be computationally burdensome for longer sample periods and cause a loss of estimation efficiency. The approximation is based on a quadrature method; an efficient algorithm reduces its compu1 See Engle and Russell (1998), Lancaster (1979), Deng, Quigley and Van Order (2000), Newman and McCulloch (1984), and Harrison and Stewart (1989), respectively. Kiefer (1988) provides an excellent overview of event timing applications in economics. FILTERED LIKELIHOOD FOR POINT PROCESSES 3 tation to a sequential multiplication of matrices.2 Our approach is an alternative to the development and numerical solution of a Kushner-Stratonovich equation for the filter (see, for example, Kurtz and Ocone (1988), Kliemann, Koch and Marchetti (1990), Kurtz (1998)).3 We provide conditions guaranteeing the uniform convergence of the approximate likelihood to the exact likelihood. We also show that the parameter maximizing the approximate likelihood converges to the parameter at which the exact likelihood attains a global optimum and that it inherits the large-sample properties of the estimator maximizing the exact likelihood. The approximation does not generate additional asymptotic variance, so there is no loss of estimation efficiency when maximizing the approximate instead of the exact likelihood. Our approach to developing the properties of an estimator obtained from an approximate likelihood is an alternative to the approach of Dupacova and Wets (1988), who directly addressed the behavior of that estimator by viewing it as a solution of a stochastic optimization problem with partial information. Numerical results illustrate the behavior of our estimator and provide a comparison with a stochastic EM Algorithm (see Dempster, Laird and Rubin (1977), Wei and Tanner (1990), Celeux and Diebolt (1992), and others). We analyze an extension of the classical self-exciting point process of Hawkes (1971), which is widely used in applications. The intensity is a linear function of three factors, the first following a jump-diffusion that is observed monthly, the second following an unobservable geometric Brownian motion, and the third given by a functional of the point process path. We verify the convergence of the approximate to the true likelihood for alternative configurations of the approximation. We find a substantial reduction of computational effort over the alternative of a Monte Carlo approximation to the likelihood. Comparing our estimates with those of the stochastic EM Algorithm, we find that, for the majority of the parameters, our estimates are more accurate and less susceptible to the initial values.4 Turning to large-sample properties (see Nielsen (2000) for the stochastic EM Algorithm), we find that the EM Algorithm generates asymptotic standard errors in excess of or similar to those implied by our method, for the majority of the parameters. We have implemented our estimators in the R package. The code can be 2 Brillinger and Preisler (1983) use a quadrature method to approximate the likelihood in a regression model with latent variables. 3 Zeng (2003) uses this approach to perform Bayesian least-squares estimation for a point process model that is a special case of our formulation. 4 The EM estimator is guaranteed to converge to a local optimum of the likelihood only; see Wu (1983) and Dembo and Zeitouni (1986). Our estimator is guaranteed to converge to a global optimum. 4 K. GIESECKE AND G. SCHWENKLER downloaded at http://people.bu.edu/gas. It can be easily customized to treat alternative models specifications. Prior research has considered inference for event timing models. Ogata (1978) develops and analyzes likelihood estimators of stationary and ergodic point processes with complete data. Andersen and Gill (1982) examine the proportional hazards model of Cox (1972) with complete data, in which the intensity is the product of an exponential function of factors and a deterministic, semi-parametric baseline hazard function. Nielsen et al. (1992), Murphy (1994), Murphy (1995) and Parner (1998) study a frailty model in which the intensity is the product of a semi-parametric baseline hazard function, an observable factor, and an unobservable frailty represented by an independent random variable. We focus on parametric intensity models and treat more general model and data structures. For example, we allow the frailty to follow a stochastic process that may be correlated with the observable factors. We also allow for arbitrary link functions, going beyond the proportional hazard formulation. Moreover, we allow for partial observation of explanatory factors. These features are common in the applied event timing literature; see the examples in Section 2.3. Finally, our results are related to the results of Le Cam and Yang (1988), who analyze likelihood estimators with incomplete data. Unlike these authors, we allow for some factors to be observed at all times. As a result, our data cannot be viewed as the product of repeated independent experiments and our likelihood cannot be decomposed into a product of conditional densities. The rest of this paper is organized as follows. Section 2 formulates the point process model with incomplete factor data and provides examples. Section 3 develops a representation of the likelihood. Section 4 introduces and analyzes an approximation to the likelihood, and examines the asymptotic properties of the associated estimator. Section 5 provides an efficient algorithm to evaluate the approximate likelihood. Section 6 numerically illustrates the behavior of our estimators. There are several appendices. Appendices A and B provide additional results on the exact and approximate estimators. Appendix C provides the proofs of our results. 2. Marked point process. Fix a complete probability space (Ω, F, P) and a right-continuous and complete filtration (Ft )t≥0 . Consider a marked point process (Tn , Un )n≥1 . The Tn are strictly positive and strictly increasing stopping times that represent the arrival times of events. The Un are FTn measurable mark variables that take values in a subset RU of Euclidean space. They provide additional information about the events. FILTERED LIKELIHOOD FOR POINT PROCESSES 5 P 2.1. Model. Suppose the counting process N given by Nt = n≥1 1{Tn ≤t} has intensity λ (see II.D7 in Brémaud (1980)). The intensity represents the conditional event arrival rate. We take (1) λt = Λ(Vt ; α) for a function Λ : RV → (0, ∞) specified by a parameter α ∈ ΘN , and an explanatory factor process V = (X, Y, Z) ∈ RV ⊂ RdX × RdY × RdZ , where dX , dY and dZ are natural numbers. The dynamics of V are specified by a parameter γ ∈ ΘV . The conditional distribution of the mark variables is described by a probability transition kernel Πt (du) from Ω × [0, ∞) into RU (see VIII.D5 in Brémaud (1980)). We take (2) Πt (du) = π(u, Vt− ; β)du for a density function π : RU × RV → (0, ∞) specified by a parameter β ∈ ΘU . Technical conditions need to be imposed on π, Λ, and the dynamics of V in order for the point process to be non-explosive (see, for example, Protter (2004) and Gjessing et al. (2010)). 2.2. Data. The goal is to estimate the parameter θ = (α, β, γ) governing the intensity, the distribution of the marks, and the dynamics of the explanatory factor V . The data available for inference include the event times and marks observed during a sample period [0, τ ], where τ > 0. The data also include certain observations of V during the sample period. More precisely, the data are given by Dτ = (Nτ , Uτ , Xτ , Yτ ), where • • • • Nτ Uτ Xτ Yτ Y, = (N Pt : 0 ≤ t ≤ τ ) represents the observations of the Tn ; = ( n≤Nt Un : 0 ≤ t ≤ τ ) represents the observations of the Un ; = (Xt : 0 ≤ t ≤ τ ) represents the observations of X; = (Ytj : t ≤ τ, t ∈ S j , j = 1, . . . , dY ) represents the observations of TdY j j where the S are discrete subsets of [0, ∞) with S = j=1 S 6= ∅. The structure of the data addresses the fact that in practice, the data on the explanatory factor V = (X, Y, Z) may be incomplete. While the data include the entire history of X for the sample period, they include the values of Y only on certain dates. The data do not include any values of Z; this element of V is an unobservable frailty. 2.3. Examples. The point process model and data structure we adopt have a wide scope. We illustrate with a number of examples from the finance and economics literature. 6 K. GIESECKE AND G. SCHWENKLER Example 2.1 (Timing of Births). Newman and McCulloch (1984) explore the factors driving the timing of births for 4724 women between the ages of 20 and 49 in Costa Rica. A unit of time is one month, and time is measured starting at the 13-th birthday of a woman. The event time Tn corresponds to the time of first birth for the n-th woman. The event intensity is Λ(v; α) = exp(α · v) for v = (v1 , . . . , v4 ) and α = (α1 , . . . , α4 ). The explanatory factor Vt = (Vt1 , . . . , Vt4 ) satisfies Vt2 = t, Vt3 = max{t − 48, 0}, and Vt4 = max{t − 144, 0}. The factor V 1 is a 6-dimensional vector that represents personal and regional characteristics such as schooling level and mortality rates, and is observed monthly. Overall, dX = 3, dY = 6, dZ = 0, X = (V 2 , V 3 , V 4 ), Y = V 1 , and S j = N0 for 1 ≤ j ≤ 6. Similar models are analyzed by Heckman and Walker (1990), Hotz, Klerman and Willis (1997), and Todd, Winters and Stecklov (2012). Example 2.2 (Timing of Market Transactions). Engle and Russell (1998) analyze IBM stock transactions data during regular trading hours for the period between November 1, 1990, and January 31, 1991. The event time Tn represents the time of the n-th transaction. The event intensity satisfies Λ(v; α) = Λ0 (v0 /v1 ; α)v1−1 for v = (v0 , v1 ). The vector of explanatory factors is Vt = (Vt0 , Vt1 ), where Vt0 = t − TNt and Vt1 = VN1t for Vn1 = γ0 + m X j=0 γj+1 (Tn−j − Tn−j−1 ) + q X 1 γm+2+j Vn−j j=0 for γ = (γ0 , γ1 , . . . , γm+q+2 ) and m, q ∈ N. This model is called the autoregressive conditional duration model because the conditional distribution of the time until the next event depends on the realization of past inter-event times. The data include complete observations of V 0 and V 1 . Thus, dX = 2, dY = dZ = 0, and V = X. Related models of inter-transaction times are analyzed by Dufour and Engle (2000) and Engle (2000). Example 2.3 (Unemployment Timing). Lancaster (1979) studies the duration of unemployment among 479 British individuals. The event time Tn represents the arrival time of the n-th job offer. The n-th mark is Un = (wn , an ) ∈ R+ × {0, 1}, where wn indicates the wage offered for the n-th job and an indicates acceptance of an offer. The density π is parameterindependent and modeled by the empirical distribution in the sample. The intensity Λ(v; α) = v2 exp(α · v1 ) for v = (v1 , v2 ) and Vt = (V01 , V02 ). The three-dimensional observable explanatory factor component V 1 represents the age of the individual, the unemployment rate in the UK, and social security benefits. The component V 2 is an unobservable frailty that has a gamma FILTERED LIKELIHOOD FOR POINT PROCESSES 7 distribution with unit mean and variance γ. Thus, dX = 3, dY = 0, dZ = 1 and V = (X, Z) with X = V 1 and Z = V 2 . Lancaster and Nickell (1980), Meyer (1990) and Dolton and O’Neill (1996) examine related models. Example 2.4 (Mortgage Delinquencies). Deng, Quigley and Van Order (2000) investigate home mortgage default and prepayment events in the United States using quarterly data from Freddie Mac on single-family mortgage loans issued between 1976 and 1983. The event time Tn represents the first time of either default or prepayment of the n-th loan. The event intensity is Λ(v; α) = v0 exp(v2 + v3 + αd · v6 ) + v1 exp(v4 + v5 + αp · v6 ) for v = (v0 , . . . , v6 ) and α = (αd , αp ). The vector of explanatory factors is Vt = (Vt0 , . . . , Vt6 ). Here, Vt0 = V00 and Vt1 = V01 are time-invariant one-dimensional frailty factors that take values in discrete sets and characterize the borrower type, Vt2 and Vt4 are one-dimensional baseline functions, Vt3 and Vt5 are two-dimensional measures of the moneyness of the default and prepayment options held by the borrower, respectively, and Vt6 is a two-dimensional vector of macro-factors such as employment and divorce rates. The data includes continuous observations of (V 2 , V 4 ) and quarterly observations of (V 3 , V 5 , V 6 ). Therefore, dX = 2, dY = 6, dZ = 2, and V = (X, Y, Z) for X = (V 2 , V 4 ), Y = (V 3 , V 5 , V 6 ), Z = (V 0 , V 1 ), and S j = N0 for 1 ≤ j ≤ 6. Related models are analyzed by Demyanyk and Van Hemert (2011), Deng and Gabriel (2006), Foote, Gerardi and Willen (2008), and Pennington-Cross and Ho (2010). Example 2.5 (Timing of Corporate Bankruptcies). Azizpour, Giesecke and Schwenkler (2014) examine the timing of bankruptcies in the United States between 1970 and 2012. An event time Tn represents a bankruptcy date. A mark Un ∈ R+ represents the debt outstanding at default. The density π is parameter independent and modeled by the empirical distribution in the sample. The intensity function is Λ(v; α) = exp(α1 · v1 ) + α2 v2 + α3 v3 for v = (v1 , v2 , v3 ) and α = (α1 , α2 , α3 ). The factor vector is V = (V 1 , . . . , V 6 ), where X Vt1 = e−γ1 (t−Tn ) log Un n≤Nt generates self-exciting behavior. The element V 2 is an unobservable frailty governed by the SDE q 2 2 dVt = γ3 (γ4 − Vt )dt + Vt2 dWt where W is a Brownian motion and γ = (γ1 , . . . , γ4 ) with γ1 > 0. The initial value V02 is sampled from its stationary gamma distribution with shape 8 K. GIESECKE AND G. SCHWENKLER parameter 2γ3 γ4 and scale parameter 2γ13 . The element V 3 follows a vector autoregressive model and represents several interest rates, stock index returns and volatilities that are sampled daily. The element V 4 follows an autoregressive model and represents credit spreads that are sampled weekly. The element V 5 follows an autoregressive moving-average model, represents the industrial production growth rate, and is sampled monthly. The element V 6 follows a moving-average model, represents the growth rate of the gross domestic product, and is sampled quarterly. We have V = (X, Y, Z) with X = V 1 , Y = V 2 and Z = V 3 , dX = 1, dY = 4, dZ = 1, S 1 = {k/365 : k ∈ N}, S 2 = {k/52 : k ∈ N}, S 3 = {k/12 : k ∈ N}, and S 4 = {k/4 : k ∈ N}. Related models are analyzed by Duffie, Saita and Wang (2007), Duffie et al. (2009), and Lando and Nielsen (2010). Our formulation facilitates the analysis of multiple influences on event timing and various data structures. The arrival intensity (1) is a non-negative function of a vector V of explanatory factors. The elements of V may include time, allowing for time-inhomogeneity as in Examples 2.1 and 2.2. They may be represented by time-invariant random variables (Examples 2.3 and 2.4), or by stochastic processes (Examples 2.2 and 2.5). The elements of V may be correlated (Examples 2.4 and 2.5) and need not be Markovian (Example 2.2). The point process may be an element of V or have an impact on the dynamics of other elements of V , to generate self-exciting behavior as in Examples 2.2 and 2.5. Some elements of V may be completely unobservable frailties (Examples 2.3, 2.4, and 2.5). Others may be observable only on certain dates (Examples 2.1, 2.4, and 2.5) or even at any time (Examples 2.2 and 2.5). Our formulation allows one to address all of these features in a common model and data framework. 3. Likelihood inference. We analyze likelihood estimators of the parameter θ = (α, β, γ) ∈ Θ = ΘN × ΘU × ΘV . We write Pθ and Eθ for the probability and the expectation operators when the underlying parameter is θ. The true but unknown parameter θ0 belongs to Θ◦ , the interior of Θ. The observed data is a realization of Dτ under Pθ0 . The likelihood Lτ (θ) of the data Dτ at the parameter θ ∈ Θ is taken as the Radon-Nikodym derivative of the Pθ -distribution of the data with respect to the true Pθ0 -distribution; see Appendix A.1 for a precise definition. With complete factor data (dY = dZ = 0), the likelihood is given by dPθ /dPθ0 . In the general case, the likelihood is given by the projection of dPθ /dPθ0 onto the observed data: dPθ (3) Lτ (θ) = Eθ0 Dτ . dPθ 0 FILTERED LIKELIHOOD FOR POINT PROCESSES 9 Unfortunately, the representation (3) is impractical because the true parameter θ0 is unknown. In order to obtain an alternative representation, we introduce an auxiliary probability measure. Let π ∗ be a strictly positive probability density on RU . If the variable (4) Mτ (θ) = π ∗ (Un ) π(Un , VTn − ; β) n≤Nτ Z τ Z τ (1 − Λ(Vs ; α))ds log(Λ(Vs− ; α))dNs − × exp − Y 0 0 has expected value equal to one, then it induces an equivalent probability measure P∗θ on Fτ . Under this measure, the event counting process N is a standard Poisson process on the interval [0, τ ] and a mark Un has density π ∗ . The intensity and the mark distribution relative to P∗θ do neither depend on the factor V nor the parameter θ. This property facilitates a decomposition of the likelihood (3) into a product of two terms, one addressing the event data (Nτ , Uτ ) and the other the factor data (Xτ , Yτ ). Each of these terms is given by an expectation under the auxiliary measure. Theorem 3.1. Suppose the following conditions hold: (A1) The expectation Eθ [Mτ (θ)] = 1. (A2) The conditional likelihood L∗F (θ; τ ) of the factor data (Xτ , Yτ ) given the event data (Nτ , Uτ ) exists under P∗θ . Then the likelihood satisfies (5) Lτ (θ) ∝ E∗θ 1 Dτ L∗F (θ; τ ). Mτ (θ) Condition (A1) guarantees that the measure P∗θ is well-defined. Blanchet and Ruf (2013) provide sufficient conditions for (A1). Condition (A2) requires the absolute continuity of the conditional P∗θ -law of the factor data (Xτ , Yτ ) given the event data (Nτ , Uτ ) relative to the corresponding conditional law under P∗θ0 ; see Appendix A for details. Sufficient conditions for this property depend on the model and data structure specified for V . For a jump-diffusion model of V with discrete observation dates, for example, see Komatsu and Takeuchi (2001) and Takeuchi (2002). Theorem 3.1 generalizes to incomplete factor data the representations of the complete data point process likelihood studied by Andersen et al. (1993), Ogata (1978), Puri and Tuan (1986) and others. Indeed, if dY = dZ = 0, the data Dτ include all values of the factor V = X required to evaluate Mτ (θ). 10 K. GIESECKE AND G. SCHWENKLER In this case the conditional expectation in (5) is trivial so (6) Lτ (θ) ∝ L∗F (θ; τ ) . Mτ (θ) With incomplete factor data (dY > 0 or dZ > 0), the data Dτ are insufficient to evaluate Mτ (θ). In this case, the conditional expectation in (5) represents a posterior mean of 1/Mτ (θ) given the data. From Theorem 3.1, ignoring an additive term that is independent of θ, the log-likelihood takes the form (7) Lτ (θ) = log L∗E (θ; τ ) + log L∗F (θ; τ ), where L∗E (θ; τ ) = E∗θ [1/Mτ (θ) | Dτ ]. A maximum likelihood estimator (MLE) θ̂τ ∈ Θ◦ of the parameter θ is a solution to (8) 0 = ∇Lτ (θ). Appendix A.2 studies the asymptotic properties of a MLE. Identifiability, smoothness, boundedness, and non-singularity conditions ensure that θ̂τ → √ θ0 in Pθ0 -probability and τ (θ̂τ − θ0 ) → N (0, Σ0 ) in Pθ0 -distribution as τ → ∞. The asymptotic variance-covariance matrix Σ0 ∈ Rp×p is given by (9) Σ0 = I(θ0 )−1 + 2ζI(θ0 )−1 ΣEF (θ0 )I(θ0 )−1 , ]| where ζ = limτ →∞ |S∩[0,τ is the mean number of observations of Y per τ unit of time, I(θ0 ) is the Fisher information matrix for the estimation of the true parameter θ0 , and ΣEF (θ0 ) is a matrix that addresses the filtering of the unobserved factors. With complete factor data Σ0 = I(θ0 )−1 (Ogata (1978)), so the filtering calls for a correction term. 4. Approximate likelihood estimators. Theorem 3.1 states that the likelihood is governed by two terms. The term L∗F (θ; τ ) addresses the available factor data; its computation is standard for a range of models of V , see Aı̈t-Sahalia (2008), Giesecke and Schwenkler (2014), Lo (1988), and others. The term L∗E (θ; τ ) = E∗θ [1/Mτ (θ) | Dτ ] addresses the event data. Depending on the model specification, its practical computation may be difficult. This section develops and analyzes an approximation to the point process filter L∗E (θ; τ ). It also examines the convergence and asymptotic properties of the estimators maximizing the approximate likelihood. We focus on the case that at least one of the factors is an unobserved frailty (dZ > 0). FILTERED LIKELIHOOD FOR POINT PROCESSES 11 4.1. Filter approximation. We consider filters of the form E(θ; τ, g) = E∗θ [g(Vτ , τ )/Mτ (θ) | Dτ ] (10) for functions g on RV ×[0, ∞). The term L∗E (θ; τ ) can be computed by taking g(v, t) = 1. While we treat E(θ; t, g) at the horizon t = τ , the extension to dates t < τ is straightforward. A standard approach to computing (10) if dY = 0 is to develop and numerically solve the associated Kushner-Stratonovic equation (see Kliemann, Koch and Marchetti (1990)). To our knowledge, however, the KS equation has not yet been worked out for the case that dY > 0, i.e., when some factors are observed at fixed dates only. We propose to approximate E(θ; τ, g) using a quadrature-based method. Fix a number mS ∈ N and choose points v1 , . . . , vmS ∈ RV , together with compact sets A1 , . . . , AmS ⊆ RV that are pairwise disjoint such that Ak is a neighborhood of vk with non-empty interior. Additionally, fix another number mT ∈ N, and choose times T 0 , T 1 , . . . , T mT ∈ [0, τ ] with 0 = T 0 < T 1 < · · · < T mT = τ . Our approximation of E(θ; τ, g) is given by E I (θ; τ, g) = mS X i0 =1 ··· mS X g(vimT , τ )FθI (i1 , . . . , imT )PθI (i0 , i1 , . . . , imT ) imT =1 where I = {mS , mT , v, A, T } is the implementation of the approximation with v = {v1 , . . . , vmS }, A = {A1 , . . . , AmS } and T = {T 1 , . . . , T mT }. Further, writing Λ(·; θ) and π(·, ·; θ) rather than Λ(·; α) and π(·, ·; β), let (m T X X Λ(vij ; θ)π(Un , vij ; θ) log FθI (i1 , . . . , imT ) = exp π ∗ (Un ) j=1 Tn ∈(T j−1 ,T j ] ) mT X + 1 − Λ(vij ; θ) T j − T j−1 , j=1 PθI (i0 , . . . , imT ) = P∗θ [V0 ∈ Ai0 , Vt ∈ Aij (1 ≤ j ≤ mT , t ∈ (T j−1 , T j ]) Dτ ]. The approximation E I (θ; τ, g) of E(θ; τ, g) is a weighted average of certain paths of the explanatory factors V . The paths are those for which V0 ∈ {v1 , . . . , vmS } and Vt is constant in the interval (T j−1 , T j ] for 1 ≤ j ≤ mT , taking values in the set {v1 , . . . , vmS } ⊂ RV . The weights assigned to these paths are the conditional P∗θ -probabilities, given the data Dτ , that V is in the set Aij during the interval (T j−1 , T j ] for 1 ≤ j ≤ mT and ij ∈ {1, . . . , mS }. Thus, E I (θ; τ, g) is based on a multidimensional rectangular quadrature approximation to the Lebesgue integral in E(θ; τ, g), taken with respect to 12 K. GIESECKE AND G. SCHWENKLER the measure P∗θ conditional on the data Dτ . Section 5 provides an algorithm for evaluating the approximate filter E I (θ; τ, g). 4.2. Convergence. We examine the convergence of the approximate filter E I (θ; τ, g) and of the estimators based on E I (θ; τ, g). The idea is to choose sequences of compact sets {An,j }1≤j≤mnS and of times {T n,j }1≤j≤mnT , such that, in the limit as n → ∞, the entire range of V is covered by the compact sets while both the sets An,j and the time intervals (T n,j−1 , T n,j ] become smaller. Let ∂v denote the partial derivative with respect to v. Theorem 4.1 (Convergence of approximate filter). ing conditions hold. Suppose the follow- (B1) The partial derivative ∂v Λ(v; θ) of the intensity function Λ exists. Both Λ(v; θ) and ∂v Λ(v; θ) are continuous in θ ∈ Θ. (B2) The partial derivative ∂v π(u, v; θ) of the P-density function π of the marks exists. The functions π(u, v; θ) and ∂v π(u, v; θ) are continuous in θ ∈ Θ. (B3) The space of admissible parameters Θ is a compact subset of an Euclidean space and has non-empty, convex interior. ∗ (B4) Eθ 1/Mτ (θ)2 < ∞ for any θ ∈ Θ. (B5) The function g is P-measurable and either continuous or bounded. Suppose there exists a sequence of data-dependent implementations I n = {mnS , mnT , vn , An , T n } with vn = (vn,j )1≤j≤mnS , An = (An,j )1≤j≤mnS , and T n = (T n,i )1≤i≤mnT , such that the following conditions hold P-almost surely: (B1) mnS , mnT < ∞ for all n ∈ N. (B2) An is a sequence of compact and convex Rd -sets with non-empty, pairSmnS n,j n→∞ wise disjoint interiors, such that An = j=1 A −−−→ RV . n,j n (B3) v ∈ RV for each n ∈ N and 1 ≤ j ≤ mS , and the interior of An,j is an open neighborhood of vn,j . n (B4) 0 = T n,0 < T n,1 < · · · < T n,mT = τ . (B5) In the limit as n → ∞, sup P∗θ ∃t ∈ [0, τ ] : Vt ∈ / An Dτ → 0. θ∈Θ (B6) In the limit as n → ∞, it holds P-almost surely that n,j max n vol A τ max max n ∂v Λ(vj ; θ) 1≤j≤mS θ∈Θ 1≤j≤mS j ; θ)π(U , vj ; θ) Λ(v k → 0. + Nτ max max max n ∂v log θ∈Θ 1≤k≤Nτ 1≤j≤mS π ∗ (Uk ) FILTERED LIKELIHOOD FOR POINT PROCESSES 13 n Then, for fixed τ > 0, E I ( · ; τ, g) converges uniformly in the space of admissible parameters Θ to E( · ; τ, g) as n → ∞, P-almost surely. Theorem 4.1 states that if one can construct implementations such that Conditions (B1)–(B6) are satisfied, then the approximation E I (θ; τ, g) converges uniformly in the parameter θ ∈ Θ. To achieve this, the sequence of compact sets An,j has to be chosen such that the volume of each set becomes smaller faster than |∂v log Λ(v; θ)π(u, v; θ)/π ∗ (u)| and |∂v Λ(v; θ)| grow, while their union covers the range RV of the explanatory factor V . Consequently, with the state space discretization, one controls an extreme case of the intensity function and the mark distribution. The time discretization grid {T 1 , n . . . , T mT } can be chosen so as to facilitate the calculation of the probabiln ities PθI (i0 , i1 , . . . , imnT ); we address this choice in Section 5. Assumptions (B1)–(B3) guarantee that Assumptions (B1)–(B6) are well-posed. A sufficient condition for the existence of an implementation I n as in Theorem 4.1 is that there exists a sequence of compact and convex sets An ∈ σ(RV ) with non-empty interior, An 6= RV , such that P-almost surely / An Dτ → 0. (11) sup P∗θ ∃t ∈ [0, τ ] : Vt ∈ θ∈Θ Theorem 4.1 guarantees that our scheme approximates the likelihood function to the same degree of accuracy for any θ ∈ Θ and converges uniformly. Next, we analyze the convergence of the estimators obtained from the approximate likelihood. Proposition 4.2 (Convergence of estimators). Suppose the conditions of Theorem 4.1 apply and, in addition, supθ∈Θ L∗F (θ; τ ) < ∞, P-almost surely. Assume that θ̂τ ∈ Θ maximizes the likelihood Lτ (θ) = E(θ; τ, 1)L∗F (θ; τ ) and that θ̂τn for n ∈ N maximizes the approximate likelihood n Lnτ (θ) = E I (θ; τ, 1)L∗F (θ; τ ). Then for fixed τ > 0, P-a.s., lim Lnτ (θ̂τn ) = Lτ (θ̂τ ) n→∞ and θn converges to a global optimum of the likelihood for any realization of the data Dτ . 14 K. GIESECKE AND G. SCHWENKLER Assuming smoothness of Lτ (·), since Lτ (θ̂τ ) < ∞ and the parameter set Θ is compact, the sequence of estimators θ̂τn obtained by the uniformly convergent implementation of our approximation converges to some parameter θ̂τ∞ ∈ Θ. Proposition 4.2 states that Lτ (θ̂τ∞ ) = Lτ (θ̂τ ), that is, both θ̂τ∞ and θ̂τ achieve the same likelihood, making θ̂τ∞ a global maximizer of the likelihood function Lτ (·). When θ̂τ is the unique MLE, then θ̂τ∞ = θ̂τ and hence θ̂τn → θ̂τ as n → ∞. 4.3. Asymptotic properties of estimators. The uniform convergence of the approximate likelihood function established above entails additional results. Following the notation in Proposition 4.2, we write √ n √ √ (12) τ (θ̂τ − θ0 ) = τ (θ̂τn − θ̂τ∞ ) + τ (θ̂τ∞ − θ0 ). The assumptions in Theorem 4.1 guarantee that, for any given degree of accuracy, one can construct an implementation of our approximation achieving √ that accuracy. If n goes to ∞ fast enough as τ → ∞, then τ (θ̂τn − θ̂τ∞ ) converges to zero P-almost surely. This observation leads to the following result. Proposition 4.3 (Asymptotic properties of estimators). Suppose that any exact maximum likelihood estimator is consistent and asymptotically √ normal; i.e., θ̂τ → θ0 in Pθ0 -probability and τ (θ̂τ − θ0 ) → N (0, Σ0 ) in Pθ0 -distribution as τ → ∞ for Σ0 as in (9). Further, suppose that Lτ (·) does not attain optima on the boundary of Θ. Then there exists a sequence of implementations I n with n → ∞ as τ → ∞, such that θ̂τn → θ0 as τ → ∞ in Pθ0 -probability, and √ n τ (θ̂τ − θ0 ) → N (0, Σ0 ) as τ → ∞ in Pθ0 -distribution, where Σ0 is the same as in (9). Proposition 4.3 guarantees the existence of an implementation of our approximation that generates a consistent and asymptotically normal estimator. The asymptotic variance-covariance matrix of that estimator is the same as that of the estimator θ̂τ obtained from the exact likelihood. Thus, the approximation does not generate additional asymptotic variance. As a result, there is no loss of efficiency when maximizing the approximate likelihood instead of the exact (but uncomputable) one. Appendix A.2 provides conditions ensuring consistency and asymptotic normality of exact maximum FILTERED LIKELIHOOD FOR POINT PROCESSES 15 likelihood estimators. Further, Appendix B indicates how an approximation to the asymptotic variance-covariance matrix Σ0 can be constructed using our approximate filter. 5. Filter computation. We provide a convenient algorithm for evaluating the filter approximation E I (θ; τ, g). If the process V is Markovian, then the probabilities PθI (i0 , . . . , imT ) can be written as " # h i E∗θ 1Ai0 (V0 ) P∗θ ∀j ∈ {1, . . . , mT } ∀t ∈ (T j−1 , T j ] : Vt ∈ Aij Dτ , V0 Dτ . We propose to approximate PθI (i0 , . . . , imT ) by the following interpolation: P̂θI (i0 , . . . , imT ) = P∗θ h mT iY i h V0 ∈ A Dτ P∗θ VT j ∈ Aij Dτ , VT j−1 = vij−1 . i0 j=1 If max1≤j≤mT |Aij | ≈ 0, then PθI (i0 , . . . , imT ) ≈ P̂θI (i0 , . . . , imT ). The following Proposition formalizes this statement. Proposition 5.1 (Estimation of transition probabilities). Suppose the conditions of Theorem 4.1 apply and assume that V is Markovian. Let (I n )n≥1 be a sequence of implementations guaranteeing the convergence of the approximate filter. If mnT → ∞ as n → ∞, then In P̂ (i0 , . . . , imn ) − P I n (i0 , . . . , imn ) → 0 θ θ T T as n → ∞, Pθ -almost surely. Proposition 5.1 implies that the quantities P̂θI (i0 , . . . , imT ) are consistent estimators of the probabilities PθI (i0 , . . . , imT ). An immediate consequence is that the general convergence results of Section 4.2 are not altered if we compute the filter approximation E I (θ; τ, g) using P̂θI (i0 , . . . , imT ) instead of PθI (i0 , . . . , imT ). Let P FθI,j (i) = e Tn ∈(T j−1 ,T j ] log Λ(vi ;θ)π(Un ,vi ,s;θ) + π ∗ (Un ) (1−Λ(vi ;θ))(T j −T j−1 ) for 1 ≤ j ≤ mT and 1 ≤ i ≤ mS . Then FθI (i1 , . . . , imT ) can be rewritten Q T I,j I,j as m for 1 ≤ j ≤ mT with j=1 Fθ (ij ). Similarly, define the matrices p̂θ components h i ∗ k p̂I,j Dτ , VT j−1 = vl , θ (k, l) = Pθ VT j ∈ A 16 K. GIESECKE AND G. SCHWENKLER as well as the vector p̂θI,0 (k) = P∗θ [V0 ∈ Ak | Dτ ] for 1 ≤ k, l ≤ mS . We can write P̂θI (i0 , . . . , imT ) = It follows that the approximation I Ê (θ; τ, g) = mS X i0 =1 ··· mS X imT =1 g(v im T , τ) mT Y I,j I,0 j=1 p̂θ (ij , ij−1 )p̂θ (i0 ). QmT I,0 FθI,j (ij ) p̂I,j θ (ij , ij−1 ) p̂θ (i0 ) j=1 shares the same convergence properties as E I (θ; τ, g) in Theorem 4.1. Conveniently, Ê I (θ; τ, g) can be calculated by a series of matrix multiplications. We have implemented our estimator in R. The code can be downloaded at http://people.bu.edu/gas. We summarize the steps. Algorithm 5.2. For a given θ ∈ Θ do (i) Initialization: Let ξ ∈ RmS with ξ(i) = p̂I,0 θ (i) for 1 ≤ i ≤ mS . (ii) Iteration: For j = 1, . . . , mT − 1 do (a) Define Q̂j ∈ RmS ×mS with elements FθI,j (k)p̂I,j θ (k, l) (b) Update the vector ξ to ξ ← Q̂j · ξ (iii) Termination: (a) Define Q̂mT ∈ RmS ×mS with elements T (k, l) g(vk , τ )FθI,mT (k)p̂I,m θ (b) Update the vector ξ to ξ ← Q̂mT · ξ P S (c) Compute Ê I (θ; τ, g) = m i=1 ξ(i). Algorithm 5.2 involves mT multiplications of mS × mS -matrices, mT m2S simple multiplications, and one addition of length mS in order to compute Ê I (θ; τ, g). Therefore, the computational effort is of order O(mT m3S ). It grows cubically with the fineness of the state space discretization, and linearly with the fineness of the time discretization. 6. Numerical results. This section illustrates our estimator and contrasts it with a standard alternative. We take the intensity (1) as (13) Λ(v; α) = α1 v1 + α2 v2 + α3 e−α4 v5 v3 17 FILTERED LIKELIHOOD FOR POINT PROCESSES for v = (v1 , v2 , v3 , v4 , v5 ) ∈ RV = R5+ and α = (α1 , α2 , α3 , α4 ) ∈ ΘN = R4+ . The vector V = (V 1 , . . . , V 5 ) of explanatory factors satisfies the stochastic differential equation dVt = µ(Vt ; γ)dt + σ(Vt ; γ)dWt + R(Vt− ; γ)dNt for a two-dimensional standard Brownian motion W = (W 1 , W 2 ), v = (v1 , . . . , v5 ) ∈ RV , γ = (γ1 , . . . , γ8 ) ∈ ΘV = R × R3+ × R × R3+ , and (14) µ(v; γ) = (γ1 + γ22 2 )(v1 − γ3 e−γ4 v5 v4 ) − γ3 γ4 e−γ4 v5 v4 γ2 (γ5 + 26 )v2 0 0 1 γ2 (v1 − γ3 e−γ4 v5 v4 ) 0 0 γ6 v2 0 0 0 0 0 0 (15) σ(v; γ) = (16) R(v; γ) = (γ3 , 0, eγ7 v5 , eγ8 v5 , 0)> . The marks are irrelevant; to fully specify the Radon-Nikodym derivative (4) we take RU = (0, 1) and π(u, v; β) = π ∗ (u) = 1. Theorem V.3.7 in Protter (2004) guarantees the existence of a unique, non-explosive solution (V, N ) to the system specified above.5 We take γ7 = α4 and γ8 = γ4 , and see that V takes the form Z t 1 1 1 (17) Vt = V0 exp γ1 t + γ2 Wt + γ3 e−γ4 (t−u) dNu , 0 (18) Vt2 = V02 exp γ5 t + γ6 Wt2 , Z t Z t 3 α4 u 4 (19) Vt = e dNu , Vt = eγ4 u dNu , Vt5 = t. 0 0 The factor V 1 follows a jump-diffusion with jumps driven by N ; the impact of a jump decays exponentially with time at rate γ4 . The factor V 2 follows a geometric Brownian motion. The factors V 3 and V 4 are pure-jump processes driven by N , and V 5 is time. 5 To seeR this, consider the SDE for (V, N ), which is driven by the semi-martingales W · and N − 0 λs ds. The coefficients of this SDE are functional Lipschitz in the sense of the definition in Section V.3 of Protter (2004). Thus, Theorem V.3.7 in Protter (2004) applies. 18 K. GIESECKE AND G. SCHWENKLER The data structure is as follows. The factors V 3 , V 4 , and V 5 are observed k at any time. The factor V 1 is observed only at times S = { 12 : k ∈ N}, 2 while V is a completely unobserved frailty. We have dX = 3, dY = dZ = 1, X = (V 3 , V 4 , V 5 ), Y = V 1 , and Z = V 2 . We take θ0 = (α1,0 , α2,0 , α3,0 , α4,0 , γ1,0 , γ2,0 , γ3,0 , γ4,0 , γ5,0 , γ6,0 ) with α1,0 = 3, α2,0 = 3, α3,0 = 2, α4,0 = 2.5, γ1,0 = −0.5, γ2,0 = 1, γ3,0 = 0.2, γ4,0 = 5, γ5,0 = −0.125, and γ6,0 = 0.5. For simplicity, we take V0 = (2.8, 1, 0, 0, 0), such that the starting value V0 of the explanatory factor V is non-random. These values are roughly consistent with the fitted parameter values of a related model of bankruptcy timing in Azizpour, Giesecke and Schwenkler (2014). (We have considered alternative values, but found no significant difference in the results.) We take Θ = [0, 20]4 × [−20, 20]× [0, 20]3 × [−20, 20]× [0, 20] ⊂ ΘN × ΘV . Time is measured in years and the horizon is τ = 40. We generate realizations of the counting process N under the model (17)–(19) parametrized by θ0 using the discretization method described and analyzed by Giesecke and Teng (2012). We take 200,000 time steps. The numerical results reported below are based on an implementation in R, running on an 8-core Opteron, 31.5 GB RAM, 10 GB swap server at Stanford University with an Ubuntu 11.04 operating system. The codes used in this Section can be downloaded at http://people.bu.edu/gas. 6.1. Accuracy of filter approximation. We begin by evaluating the accuracy of the filter approximation. The implementation sequence I n of Theorem 4.1 is fixed as follows. The factor X is observed at any time, hence there is no need to discretize this component of V . On the other hand, we can discretize the state spaces of Y and Z separately since these compo1 nents are P∗ -independent. Let Sτ = {i 12 : 1 ≤ i ≤ 12τ } and mnS = n2 . For mins∈Sτ Ys √ )/(mnS − 1), the factor process Y we set ∆nY = (maxs∈Sτ Ys + n − n √ n,j − ∆n , y n,j ] for 1 ≤ y n,j = (mins∈Sτ Ys )/ n + (j − 1)∆nY and An,j Y Y = [y j ≤ mnS . This grid for Y covers the entire range of observed values of this qZ )/(mnS − 1) for factor. For the frailty Z we choose ∆nZ = (QZ + n − √ n QZ and qZ the theoretical 95% and 5% quantiles, respectively, of Zτ . Set √ n,j −∆n , z n,j ] for 1 ≤ j ≤ mn . Thus, z n,j = qZ / n+(j −1)∆nZ and An,j Z S Z = [z the range of Z is covered by the state space discretization with at least 90% probability. This implementation satisfies Conditions (B2), (B3), (B5), and (B6) of Theorem 4.1. We consider several time discretizations {T j }j≥1 , including monthly (12 time steps per year), weekly (50 time steps) and daily (250 time steps) grids. Each of these grids is supplemented with the times in Sτ at which Y is 19 −2 Relative error −1 0 FILTERED LIKELIHOOD FOR POINT PROCESSES −3 Daily time discretization Weekly time discretization Monthly time discretization 10 20 30 40 Fineness and range of state space discretization n n Fig 1. Relative error of the approximate (partial) log-likelihood log Ê I (θ0 ; τ, 1), averaged over 10 different realizations of Dτ generated from the model θ0 , for each of several time discretizations. observed. The final time discretizations are given by T = Sτ ∪{T j }1≤j≤mT . It follows that Conditions (B1) and (B4) of Theorem 4.1 are satisfied. Since the parameter space Θ is a compact subset of R10 , our model choice guarantees that Conditions (B1) through (B4) in Theorem 4.1 are satisfied as well. Note that the value of n specifies the fineness and the range of the state-space discretization, and has no effect on the time discretization. The convergence of Theorem 4.1 is with respect to the state-space grid becoming finer and more extensive as n → ∞. For a given T , define the interpolated factor and frailty processes by Ȳt = YT̄ (t) and Z̄t = ZT̄ (t) for T̄ (t) = max{u ∈ T : u ≤ t}. For V̄ = (Ȳ , Z̄, V 3 , V 4 , V 5 ), let M̄τ (θ) be given by Z τ Z τ Y π ∗ (Un ) exp − log Λ(V̄s− ; θ)dNs − (1 − Λ(V̄s ; θ))ds . π(Un , V̄Tn − ; θ) 0 0 n≤Nt Thus, E∗ [1/M̄τ (θ) | Dτ ] is the filter obtained by interpolating Y and Z. Theon rem 4.1 implies that for this T , the approximate filter Ê I (θ; τ, 1) in Section 5 converges to E∗θ [1/M̄τ (θ) | Dτ ] as n → ∞. We examine this convergence at the parameter θ0 . For a given T and realization of the data Dτ , we com- 20 K. GIESECKE AND G. SCHWENKLER n pute the (partial) log-likelihood approximation log Ê I (θ0 ; τ, 1) using Algorithm 5.2. We contrast this approximation with a Monte Carlo estimate of log E∗θ0 [1/M̄τ (θ) | Dτ ], obtained from 100,000 sample paths of Ȳ and Z̄ drawn from their conditional P∗θ0 -distributions given Dτ . Figure 1 shows the relative error of the approximation, averaged over 10 alternative realizations of Dτ generated from the model θ0 , as a function of n. The 95% confidence bands of the relative error, obtained by resampling, are extremely tight, and are therefore not indicated. For coarse state-space discretizations (small n), the approximation tends to understate the filter. This effect is more pronounced for finer time discretization grids because of the propagation of errors through the time grid: In finer grids, the approximate filter is updated more frequently in Step 2(a) of Algorithm 5.2. For fine state-space discretizations (large n), the approximation tends to slightly overstate the filter, especially for coarser time discretization grids. We have studied the relative errors associated with the monthly and weekly grids also for n > 40. The errors converge, albeit slowly. As discussed after Proposition 5.1, if the time discretization also becomes n finer as n → ∞, then the approximate filter Ê I (θ; τ, 1) converges to the true filter E(θ; τ, 1). Figure 1 provides numerical support for this result. The smallest average relative error for n = 40 occurs when the finest time grid is used. Since the Monte Carlo filter converges to E(θ0 ; τ, 1) in Pθ0 -probability as the time grid becomes finer, the fact that the relative error decreases with the fineness of the time grid for large n numerically supports the convergence established by Proposition 5.1. The approximation provides a substantial reduction of computational effort relative to Monte Carlo. The Monte Carlo estimation of the filter requires roughly 10 minutes for the monthly grid, 1 hour for the weekly grid, and almost 5 hours for the daily time grid. The corresponding computation times of Algorithm 5.2, averaged across the values of n we have considered, are 1.3 minutes, 9 minutes and 56 minutes. 6.2. Accuracy of parameter estimates. Next we examine the accuracy of our parameter estimates. For each of 10 alternative realizations of the data Dτ generated from the model θ0 , we maximize the approximate log-likelihood n function log Ê I (θ; τ, 1) + log L∗F (θ; τ ), taking a weekly time discretization and using the Nelder-Mead algorithm in R.6 The optimization is initialized 6 Since Y has the explicit solution (17), the corresponding log-likelihood is given by 2 Rs 2 Ysi −γ3 0 i e−γ4 (si −u) dNu γ2 (s − s ) log − γ − |T | R si−1 −γ (s 1 i i−1 2 4 i−1 −u) dN X Ysi−1 −γ3 0 e u log L∗F (θ; τ ) = − 2 2γ2 (si − si−1 ) i=1 21 FILTERED LIKELIHOOD FOR POINT PROCESSES 8 10 6 1 8 10 0.2 4 6 8 4 10 Volatility n 6 8 10 Eventndecay 0.0 10 Meann rate 4 6 8 10 4 6 8 10 n seq(1, 10) Absolute error 3 n Filtered MLE EM Alg., Initial pars. 0 0 Absolute error 4 6 Absolute error 2 8 4 Event sensitivity n 0 6 Absolute error 3 6 Absolute error 1 Absolute error 4 seq(1, 10) Frailty Z 4 Volatility n 0 Factor Y Absolute error 10 11 8 Absolute error 6 Meann rate 0 4 seq(1, 10) Event decay 0 Absolute error 1 Absolute error 1 Intensity λ Event sensitivity 4 Frailty sensitivity 8 Factor sensitivity 4 6 8 10 4 6 8 10 Fig 2. Absolute error of a parameter estimate, averaged over the errors associated with 10 alternative realizations of Dτ generated from the model θ0 , as a function of n, the fineness and range of the state space discretization (solid line). Also shown are the parameter estimates obtained using the stochastic EM Algorithm (dashed line). The dotted line shows the average absolute error of the initial parameter values. at the value θ0 (1 + ), where is drawn from a standard normal distribution. If that value lies outside of Θ, we take its orthogonal projection onto the boundary of Θ. Below we examine an alternative, uniform sampling rule for the initial value. Proposition 4.3 guarantees the consistency of the estimator. Hence we can expect the estimates to be close to θ0 . Therefore, we measure the accuracy of the estimates in terms of the absolute error with respect to θ0 . Figure 2 shows the absolute error of each estimate, averaged over the errors associated with 10 alternative realizations of Dτ , as a function of n parametrizing the fineness and range of the state space discretization. The accuracy of the estimate tends to increase with n, although the convergence of the error may be slow. All fitted values lie in the interior of the parameter space. To provide some perspective, we also consider the parameter estimates generated by the stochastic EM Algorithm of Celeux and Diebolt (1992). In the E-step we sample 10,000 paths of Y and Z based on their condi− |T | X i=1 log Z Y si − γ 3 0 si √ e−γ4 (si −u) dNu γ2 si − si−1 . 22 K. GIESECKE AND G. SCHWENKLER tional law given the data Dτ . To facilitate a comparison with our estimates, we sample these paths on a weekly time grid. In the M-step, the complete data partial likelihood functions of V and N are maximized separately. The optimization using the Nelder-Mead algorithm is difficult; the convergence is extremely slow and we often obtain boundary optima. Since the gradient and the Hessian matrix of the complete data likelihood are available analytically, we implement the optimization using the BFGS quasi-Newton method in R. The EM Algorithm is initialized at the same parameter as our scheme. We stop it when the M-step increases the likelihood by less than 0.5; a criterion consistent with, e.g., Chan and Ledolter (1995). The resulting absolute errors of the parameter estimates are shown in Figure 2 by the dashed lines (note that the value of n is irrelevant). We have considered alternative implementations, including performing a larger number of simulation trials and using a tighter stopping criterion, but found no significant difference in the results. Our estimates of the factor sensitivity α1 , frailty sensitivity α2 and decay rate α4 of the intensity, and the volatility γ2 , event sensitivity γ3 , and decay rate γ4 of the factor X, are more accurate than those of the EM Algorithm. Our estimate of the event sensitivity α3 of the intensity is roughly as accurate as the EM estimate, especially for larger n. At first sight, the EM Algorithm seems to deliver more accurate estimates of the mean rate γ1 of the factor, and the mean rate γ5 and volatility γ6 of the frailty. A closer look at how the EM Algorithm estimates these parameters leads to a different conclusion. Since the EM Algorithm maximizes the complete data partial likelihoods of V and N separately, the mean and the volatility of the factor Y and the frailty Z are estimated by matching their first two sample moments. Because of this, the estimation of these parameters is especially susceptible to the initial parameters. This is indicated by the location of the dotted line in Figure 2, which represents the average absolute error of the initial parameter with respect to θ0 . The EM estimates of γ1 , γ5 and γ6 hardly deviate from the respective initial values, consistently with the fact that the EM estimator is guaranteed to converge to a local optimum of the likelihood only (see Wu (1983) and Dembo and Zeitouni (1986)). This may be an issue in practice when θ0 is unknown. We have investigated this further by analyzing the estimates obtained from initial parameter values sampled from a uniform distribution on the parameter space Θ. Unlike our method, the EM Algorithm was found to converge extremely slowly when the initial parameters were very different from θ0 , often resulting in inappropriate estimates. Another issue of the separate maximization of the complete data partial FILTERED LIKELIHOOD FOR POINT PROCESSES 23 likelihood functions of N and V in the EM Algorithm is that the interaction of the events and the factor is ignored. As a result, the EM Algorithm faces difficulties in differentiating between diffusive movements and jumps of Y . It tends to converge to boundary solutions that imply an event sensitivity γ3 of Y equal or close to zero, a decay rate γ4 equal or close to 20, and too high estimates of the volatility γ2 . Boundary optima are an issue for asymptotic results, which apply only to optima in the interior of the parameter space. Moreover, in order to compensate for understating γ3 , the EM Algorithm tends to overstate the role of the factor Y in the intensity. This interpretation is consistent with the large errors of the EM estimates for the factor sensitivity and event decay of the intensity, and the event sensitivity and decay of the factor. In these cases, the errors are much larger than the errors of the initial values. In order to shed more light on this issue, we analyze how the two methods address the filtering problem. We compare the fitted filtered factor Eθ [Yt | Dt ] to the corresponding monthly observations, and the fitted filtered intensity ht = Eθ [λt | Dt ] to the actual number of events per year. Here, Dt represents the data observed during [0, t]. For a given realization of the data, Figure 3 n n shows the approximations to these filters, given by Ê I (θ; t, g)/Ê I (θ; t, 1) with g(v, t) = (1, 0, 0, 0, 0) · v and g(v, t) = Λ(v; θ), respectively, evaluated at the corresponding estimator θ = θ̂τ . For that realization, Figure 3 also shows Monte Carlo approximations to these filters, obtained using 10,000 trials and evaluated at the EM estimator of θ. The confidence bands were calculated using subsampling with group size 100. As expected, in the case of our method the chosen value of n has a large impact on the accuracy of the filter. In accordance with Theorem 4.1, the larger n, the finer and more extensive the state space grid, and the closer the filters track the observed values of the factor and the event counts. If the state space discretization is too coarse, then the approximation to the filtered factor fluctuates between few values, and the filtered intensity h understates the actual number of events. As n increases, the filtered paths track the observations better. The Monte Carlo approximation to the factor filter, which is based on EM parameter estimates, tracks the factor well, with occasionally wide confidence bands. Nevertheless, the corresponding Monte Carlo approximation to the filtered intensity h severely understates the actual event counts.7 We have found similar issues for all other realizations of the data Dτ that we have considered. We examine the implications of the estimation errors for model fit. Propo7 The confidence bands of the simulated filtered intensity are very tight, and we do not display them for the sake of clarity. 24 250 Algorithm 4.5, n = 3 Actual number of events 0 0 2 50 4 6 100 8 150 Algorithm 4.5, n = 5 Actual path of factor 200 10 12 14 K. GIESECKE AND G. SCHWENKLER 0 10 20 30 40 0 10 20 30 40 Algorithm 4.5, n = 10 Actual number of events 0 0 2 50 4 6 100 8 150 Algorithm 4.5, n = 10 Actual path of factor 250 Time in years 200 10 12 14 Time in years 0 10 20 30 40 0 10 20 30 40 Algorithm 4.5, n = 20 Actual number of events 0 0 2 50 4 6 100 8 150 Algorithm 4.5, n = 20 Actual path of factor 250 Time in years 200 10 12 14 Time in years 0 10 20 30 40 0 10 20 30 40 Simulated filtered path Actual number of events 0 0 2 50 4 6 100 8 150 Simulated filtered path 95% confidence bands Actual path of factor 250 Time in years 200 10 12 14 Time in years 0 10 20 Time in years (a) Factor Y 30 40 0 10 20 30 40 Time in years (b) Intensity Fig 3. Fitted filtered factor and intensity vs. actual factor and event counts. sition D.1 Azizpour, Giesecke and SchwenklerR (2014) implies that, after a t change of time given by the P-compensator 0 hs ds relative to the rightcontinuous and complete filtration generated by σ(Dt ), the counting process N is a standard P-Poisson process in the time-changed filtration. We test whether the fitted filtered intensity h time-scales the realized N correctly. Figure 4 provides a QQ plot of the time-changed inter-arrival times for each of several fitting methods (95% confidence bands of the order statistics of the exponential distribution are included). We see that, as n increases, the model estimated using our scheme fits the event data better. Already with n = 20 we obtain an acceptable fit. The model estimated using the EM Al- 25 10 FILTERED LIKELIHOOD FOR POINT PROCESSES Filtered MLE with Algorithm 4.5, n = 3 Filtered MLE with Algorithm 4.5, n = 10 Filtered MLE with Algorithm 4.5, n = 20 EM Algorithm Diagonal line 95 % confidence bands 5 ● ● ● 0 ● ●● ●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 ●● ● ● ●●● ● ● ● ● ● ● ● ● ● 6 Fig 4. Goodness-of-fit tests: QQ plots of the quantiles of the time-changed inter-arrival times derived vs. theoretical standard exponential quantiles. gorithm, for which the filtered intensity h is approximated by Monte Carlo simulation, fails the test. This is a consequence of the inaccuracy of the EM estimates of some of the model parameters. 6.3. Asymptotic properties of parameter estimates. Finally we examine the behavior of the estimators as the sample horizon τ increases. In the setting of the previous analyses, we calculate the approximate asymptotic variance-covariance matrix Σ̂0 of our estimators according to (25)-(27) at the true parameter vector θ0 . In order to satisfy the conditions of Proposition √ 4.3, for the implementation we let n vary with τ as in n = n(τ ) = τ , τ 2 and set mnS = (3 + 10 ) . Otherwise, the state space grid is constructed as described above. We continue to use a weekly time discretization. Figure 5 shows the asymptotic standard errors for each parameter (solid lines), for a given realization of Dτ . As τ increases and more data is included in the estimation, the errors decrease, albeit slowly in some cases. The error behaves similarly for the other realizations of Dτ we have considered (the results for these are available upon request). For comparison, Figure 5 also indicates the asymptotic standard errors of the EM estimates (dashed lines). These are calculated using the following approximation of the theoretical maximum likelihood asymptotic variance- 26 K. GIESECKE AND G. SCHWENKLER 20 30 40 Volatility n 20 30 40 10 Meann rate 20 30 40 1.0 40 10 20 30 40 n seq(1, 10) 1.00 0.10 10 20 30 40 Eventndecay 10 20 30 40 n Filtered MLE with Algorithm 4.5 EM algorithm 0.01 0.10 1.00 Volatility n 0.01 30 1e−05 0.10 0.01 0.1 10 seq(1, 10) 20 Event sensitivity n 1e−03 1.00 1.0 Factor Y 10 1e+01 10 1e−01 40 1e−03 30 1e−01 20 Meann rate 0.1 0.1 10 1 1 10 seq(1, 10) Event decay 1.0 100 10 Intensity λ Frailty Z Event sensitivity 10.0 Frailty sensitivity 10.0 Factor sensitivity 10 20 30 40 10 20 30 40 Fig 5. Asymptotic standard errors of our estimates and EM estimates plotted against τ , the end of the observation period. The y-axes are in log-scale. covariance matrix: −1 −1 ζ EM 2 ∇2 Q(θ, θ0 )|θ=θ0 ∇Q(θ, θ0 )|θ=θ0 ∇Q(θ, θ0 )|> θ=θ0 ∇ Q(θ, θ0 )|θ=θ0 τ where Q(θ, θ0 ) is the complete data likelihood evaluated at θ through simulations with respect to the conditional P∗θ0 -distribution given Dτ , and ζ EM is an approximation of the fraction of missing data as in Nielsen (2000). We find that the standard errors of the EM estimates of the factor sensitivity α1 , the event sensitivity α3 , and the event decay α4 of the intensity, as well as of the factor mean rate γ1 and volatility γ2 , exceed those of our estimates. On the other hand, the standard errors of the EM estimates for the frailty parameters γ5 and γ6 are smaller than those of our estimates. Both methods achieve similar standard errors for the frailty sensitiviy α2 of the intensity and the event sensitivity γ3 and decay γ4 of the factor Y . In many cases, the loss in efficiency due to simulation in the stochastic EM Algorithm exceeds the loss of efficiency due to the approximate filtering in our method. The loss of efficiency due to filtering, indicated in Proposition A.2, appears to be fairly small in the cases considered. This renders our estimates highly efficient. FILTERED LIKELIHOOD FOR POINT PROCESSES 27 APPENDIX A: EXACT LIKELIHOOD This appendix complements Section 3. It provides a rigorous definition of the exact likelihood Lτ (θ) of the observed data Dτ and the conditional likelihood L∗F (θ; τ ) of the factor data. It also derives conditions that ensure consistency and asymptotic normality of an exact maximum likelihood estimator θ̂τ , and introduces notation that will be used throughout the proofs given in Appendix C. A.1. Likelihood. Let RD,τ denote the range of the date Dτ and Dτ the Borel σ-algebra of RD,τ . We have RD,τ = D([0, τ ]; N) × D([0, τ ]; RU ) ×D([0, τ ]; RdX ) × RY,τ , where D(A; B) is the Skorokhod space of càdlàg functions mapping the set A onto B, and RY,τ = {(yi,j : yi,j ∈ R for i = 1, . . . , dY , j = 1, . . . , |S i ∩ [0, τ ]|} is the range of the data Yτ of the partially observed factor element. Write Dθ,τ for the Pθ -law of the data Dτ on (RD,τ , Dτ ). The likelihood of the data is given by the Radon-Nikodym derivative dDθ,τ , Lτ (θ) = dDθ0 ,τ evaluated at the observed data Dτ . We now give a precise definition of the conditional likelihood L∗F (θ; τ ) appearing in Theorem 3.1. Let D∗θ,τ denote the P∗ -law of the data Dτ on (RD,τ , Dτ ). Write D∗E,θ,τ for the P∗ -law of the event data (Nτ , Uτ ) on (D([0, τ ]; N)× D([0, τ ]; RU ), σ(D([0, τ ]; N) × D([0, τ ]; RU ))), and D∗F,θ,τ for the conditional P∗θ -law of the factor data (Xτ , Yτ ) given the event data (Nτ , Uτ ) on (D([0, τ ]; RdX ) × RY,τ , σ(D([0, τ ]; RdX ) × RY,τ )). The conditional likelihood L∗F (θ; τ ) of the factor data (Xτ , Yτ ) given the event data (Nτ , Uτ ) under P∗ is the Radon-Nikodym derivative (20) L∗F (θ; τ ) = dD∗F,θ,τ dD∗F,θ0 ,τ , evaluated at the observed data. A.2. Asymptotic properties of exact likelihood estimators. The representation of the likelihood obtained in Theorem 3.1 facilitates the analysis of the asymptotic properties of an exact MLE θ̂τ as τ → ∞. Proposition A.1 (Consistency). In addition, suppose that: Suppose Assumptions (A1)-(A2) hold. (A3) For any θ ∈ θ, it holds Pθ -almost surely that L∗F (θ; τ ) > 0. 28 K. GIESECKE AND G. SCHWENKLER (A4) The true parameter θ0 is the unique maximizer of the asymptotic likelihood, i.e., 1 arg sup lim Lτ (θ) = {θ0 }. τ →∞ τ θ∈Θ Then every θ̂τ solving (8) is consistent: θ̂τ → θ0 as τ → ∞ in Pθ0 -probability. Consistency follows from Lemma 4.1 in Section 1.4.3 of Ibragimov and Has’minskii (1981) after establishing that the filtering in (5) guarantees the smoothness of the likelihood function. Condition (A4) is a standard identifiability condition. Condition (A3) is mild and implies both the equivalence of the laws of the data Dτ under Pθ and Pθ̃ for any θ, θ̃ ∈ Θ, as well as the stochastic equicontinuity of the likelihood function. It is a restriction on the parameter space Θ, since it requires that the likelihood of the observed factor data be strictly positive at any parameter θ ∈ Θ. Thus, consistency is guaranteed as long as the parameter space is chosen adequately. We now consider asymptotic normality. We make some assumptions to simplify the exposition. Suppose that S i = S j and 0 ∈ S j for all 1 ≤ i, j ≤ dY . Hence, all partially observed factors are observed on the same dates, including at time 0. Suppose Sτ = S ∩[0, τ ] takes the form {s0 , s1 , . . . , sm(τ ) } for 0 = s0 < · · · < sm(τ ) = τ . Suppose the limit ζ = limτ →∞ m(τ )/τ exists and is strictly positive. Define the random functions (21) θ ∈ Θ 7→ L∗E,i (θ) = L∗E (θ; si )/L∗E (θ; si−1 ) (22) θ ∈ Θ 7→ L∗F,i (θ) = L∗F (θ; si )/L∗F (θ; si−1 ) for i = 1, . . . , m(τ ), as well as θ ∈ Θ 7→ L∗E,0 (θ) = 1 θ ∈ Θ 7→ L∗F,0 (θ) = L∗F (θ; 0). It follows that L∗E (θ; τ ) = Qm(τ ) i=0 L∗E,i (θ) and L∗F (θ; τ ) = Write θ = (θ1 , . . . , θp ) for some p ∈ N, ∂v = ∂3 ∂θu ∂θv ∂θw ∂ ∂θv , Qm(τ ) ∂v,w = ∗ i=0 LF,i (θ). ∂2 ∂θv ∂θw , and ∂u,v,w = for any 1 ≤ u, v, w ≤ p, and let ∇, ∇2 denote the gradient and the Hessian matrix, respectively, with respect to θ. Proposition A.2 (Asymptotic normality). suppose that the following conditions hold: In addition to (A1)-(A4), (A5) The functions L∗E,i (θ) and L∗F,i (θ) for i = 0, . . . , m(τ ) are three times continuously differentiable at θ, Pθ -almost surely, for any θ ∈ Θ◦ . FILTERED LIKELIHOOD FOR POINT PROCESSES 29 (A6) For any θ ∈ Θ◦ , 1 ≤ u, v ≤ p, and 0 ≤ i ≤ m(τ ), it holds Pθ -a.s. that 0 = Eθ ∂u log L∗E,i (θ) Dsi−1 = Eθ ∂u log L∗F,i (θ) Dsi−1 , " # " # 2 L∗ (θ) 2 L∗ (θ) ∂u,v ∂u,v E,i F,i 0 = Eθ = Eθ . Ds Ds L∗E,i (θ) i−1 L∗F,i (θ) i−1 (A7) There exist positive constants K1 , K2 , K3 , and K4 such that for all 1 ≤ u, v ≤ p and 0 ≤ i ≤ m(τ ) ! ∗ (θ) 2 h i ∂ L u,v 2 E,i ≤ K2 , Eθ ∂v log L∗E,i (θ) ≤ K1 , Eθ L∗E,i (θ) ! ∗ (θ) 2 h i ∂ L u,v 2 F,i ≤ K4 . Eθ ∂v log L∗F,i (θ) ≤ K3 , Eθ L∗F,i (θ) (A8) For any θ ∈ Θ◦ , the following limits of matrices exist in Pθ -probability and are deterministic: m 1 X ΣE (θ) = lim ∇ log L∗E,i (θ)> ∇ log L∗E,i (θ) ∈ Rp×p , m→∞ m i=0 m 1 X ∇ log L∗F,i (θ)> ∇ log L∗F,i (θ) ∈ Rp×p , ΣF (θ) = lim m→∞ m i=0 m 1 X ΣEF (θ) = lim ∇ log L∗E,i (θ)> ∇ log L∗F,i (θ) ∈ Rp×p . m→∞ m i=0 Θ◦ , (A9) For any θ ∈ the matrix ΣE (θ) + ΣF (θ) is non-singular. ◦ (A10) For any θ ∈ Θ , there exists a neighborhood U (θ) of θ and finite constants GE = GE (θ), GF = GF (θ), such that for any θ̃ ∈ U (θ) and 1 ≤ u, v, w ≤ p " # m 1 X ∗ E lim Pθ ∂u,v,w log LE,i (θ̃) < G = 1, m→∞ m i=0 # " m 1 X ∂u,v,w log L∗F,i (θ̃) < GF = 1. lim Pθ m→∞ m i=0 √ Then every θ̂τ solving (8) is asymptotically normal: τ (θ̂τ − θ0 ) → N (0, Σ0 ) as τ → ∞ in Pθ0 -distribution. Letting Σ(θ0 ) = ΣE (θ0 ) + ΣF (θ0 ), the asymptotic variance-covariance matrix is given by 2 1 Σ0 = Σ(θ0 )−1 + Σ(θ0 )−1 ΣEF (θ0 )Σ(θ0 )−1 . ζ ζ 30 K. GIESECKE AND G. SCHWENKLER Proposition A.2 is a martingale CLT for the score function. Because of the incompleteness of the factor data, the analysis is not completely standard: in our setting, the likelihood ratios are locally asymptotically quadratic but not locally asymptotically mixed normal.8 Thus, the conventional martingale CLTs of Feigin (1976), Jeganathan (1995), van der Vaart (2000) and others cannot be directly applied in our setting. Our asymptotic analysis uses an argument similar to the one behind Proposition 6.1 of Hall and Heyde (1980), after controlling for the incompleteness of the factor data. Condition (A5) is standard. Condition (A7) is also standard; it imposes regularity on the intensity (1) and the Pθ -law of the factor V . Conditions (A8) and (A9) are necessary to obtain well-defined covariance matrices. Condition (A10) guarantees that a Taylor expansion of the score function is sufficiently accurate. Condition (A6) is an identifiability condition. It implies that θ0 maximizes the asymptotic likelihood. Proposition A.2 highlights the influence of the partially observed data on the asymptotic variance-covariance matrix Σ0 : it is inversely proportional to ζ, the average number of observations of Y per unit of time. We compare with the results of Ogata (1978), who studied likelihood estimators for stationary and ergodic point processes with complete factor data. Consider the Fisher information matrix I(θ0 ) for the estimation of θ0 = (α0 , β0 , γ0 ), given in our setting by (23) I(θ0 ) = ζ (ΣE (θ0 ) + ΣF (θ0 )) . Under the assumptions of stationarity and ergodicity as well as complete factor data, Ogata (1978) shows that Σ0 = I(θ0 )−1 . We relax these assumptions and allow for incomplete factor data. We find that (24) Σ0 = I(θ0 )−1 + 2ζI(θ0 )−1 ΣEF (θ0 )I(θ0 )−1 . The matrix ΣEF (θ0 ) corrects for the filtering of the unobserved path of the factors. If V is observed perfectly, then one can show that ΣEF (θ0 ) = 0, producing the result of Ogata (1978). We end this section with alternative and more traditional representations of the matrices ΣE (θ0 ), ΣF (θ0 ) and ΣEF (θ0 ). Corollary A.3. Suppose the assumptions of Proposition A.2 hold. For any θ ∈ Θ◦ , it holds Pθ -almost surely that ζΣE (θ) = − lim τ →∞ 8 1 2 1 ∇ log L∗E (θ; τ ) = lim ∇ log L∗E (θ; τ )> ∇ log L∗E (θ; τ ), τ →∞ τ τ See Definitions 1 and 3 of Jeganathan (1995). FILTERED LIKELIHOOD FOR POINT PROCESSES ζΣF (θ) = − lim τ →∞ 31 1 2 1 ∇ log L∗F (θ; τ ) = lim ∇ log L∗F (θ; τ )> ∇ log L∗F (θ; τ ), τ →∞ τ τ 1 ∇ log L∗E (θ; τ )> ∇ log L∗F (θ; τ ). τ →∞ τ ζΣEF (θ) = lim APPENDIX B: ASYMPTOTIC VARIANCE-COVARIANCE MATRIX Proposition 4.3 states that the approximate MLE has the same asymptotic variance-covariance matrix Σ0 as the exact MLE analyzed in Appendix A. Thus, there is no loss of efficiency when maximizing the approximate likelihood rather than the exact likelihood. From Proposition A.2 we know that Σ0 = ζ1 Σ(θ0 )−1 + ζ2 Σ(θ0 )−1 ΣEF (θ0 )Σ(θ0 )−1 for Σ(θ0 ) = ΣE (θ0 ) + ΣF (θ0 ). For finite τ , we propose to approximate Σ0 as follows. We approximate ζ by ) ζ̂ = m(τ τ . For given n, we approximate ΣE (θ0 ), ΣF (θ0 ), and ΣEF (θ0 ) by (25) > m(τ ) n n E I (θ̂τn ; si , 1) E I (θ̂τn ; si , 1) 1 X ∇ log ∇ log Σ̂E = m(τ ) E I n (θ̂τn ; si−1 , 1) E I n (θ̂τn ; si−1 , 1) i=1 m(τ ) (26) (27) Σ̂F = Σ̂EF 1 X ∇ log L∗F,i (θ)> ∇ log L∗F,i (θ) m(τ ) 1 = m(τ ) i=0 m X i=1 n ∇ log E I (θ̂τn ; si , 1) E I n (θ̂τn ; si−1 , 1) > ∇ log L∗F,i (θ), respectively. The implementation I n is chosen so as to achieve some given degree of accuracy in accordance with Theorem 4.1. Finally, we set Σ̂ = Σ̂E + Σ̂F and approximate Σ0 by Σ̂0 = 1 Σ̂−1 + 2 Σ̂−1 Σ̂EF Σ̂−1 . ζ̂ ζ̂ APPENDIX C: PROOFS Proof of Theorem 3.1. The measure dP∗θ = Mτ (θ)dPθ is well-defined by Condition (A1) and Theorem VIII.T10 of Brémaud (1980). We have Z ∗ ∗ Dθ,τ (A) = Eθ [1A ] = Eθ [1A Eθ [1/Mτ (θ) | Dτ ]] = E∗θ [1/Mτ (θ) | Dτ ]dD∗θ,τ A for A ∈ Dτ . Consequently, the law Dθ,τ is absolutely continuous with respect to D∗θ,τ with Radon-Nikodym derivative dDθ,τ 1 ∗ (28) = Eθ Dτ . dD∗θ,τ Mτ (θ) Bayes’ fomula implies that the P∗θ -law of the data is equal to the product of the P∗θ -law of the event data (Nτ , Uτ ) and the conditional P∗θ -law of the 32 K. GIESECKE AND G. SCHWENKLER factor data (X, Yτ ) given (Nτ , Uτ ): D∗θ,τ = D∗F,θ,τ × D∗E,θ,τ . (29) Now, the P∗θ -law of the event data (Nτ , Uτ ) is parameter independent by construction since the counting process N has unit intensity and the event marks have parameter-independent density π ∗ under P∗θ . Hence, dD∗E,θ,τ (30) dD∗E,θ0 ,τ ∝ 1. In addition, dD∗F,θ,τ (31) dD∗F,θ0 ,τ = L∗F (θ; τ ) by Assumption (A2) and the definition of the conditional likelihood L∗F (θ; τ ) in (20). Because D∗θ is a product law as in (29), Fubini’s theorem together with (30)-(31) imply that dD∗θ,τ dD∗θ0 ,τ = dD∗F,θ,τ × D∗E,θ,τ dD∗F,θ0 ,τ × D∗E,θ0 ,τ = dD∗F,θ,τ dD∗E,θ,τ dD∗F,θ0 ,τ dD∗E,θ0 ,τ ∝ L∗F (θ; τ ) We conclude that the likelihood satisfies ∗ ∗ dDθ,τ dDθ,τ dDθ,τ dDθ0 ,τ Lτ (θ) = = dDθ0 ,τ dD∗θ,τ dD∗θ0 ,τ dDθ0 ,τ (32) 1 ∗ ∗ ∝ Eθ D τ LF (θ; τ ). Mτ (θ) This completes the proof. Proof of Theorem 4.1. For simplicity, we only consider the case that g ≡ 1. Since Θ is compact by Assumption (B3), the reasoning below remains valid whenever the measurable function g is either continuous or bounded. Define C1τ = maxθ∈Θ E∗θ [ Mτ1(θ) | Dτ ] and C2τ = maxθ∈Θ E∗θ [ M 21(θ) | Dτ ], which τ are finite by Assumptions (B3) and (B4). Write n ∆τ (θ) = |E I (θ; τ, 1) − E(θ; τ, 1)| for the absolute error generated by the approximation at the parameter θ. By Jensen’s inequality, we can bound ∆τ (θ) from above by 1 ∗ 1{∃t∈[0,τ ] V ∈A n } Dτ Eθ t/ Mτ (θ) n (33) + mS X i1 =1 n ... mS X imn =1 T E∗θ In F (i1 , . . . , imn ) − θ T 1 Ii ,i ,...,i n Mτ (θ) 0 1 mT Dτ FILTERED LIKELIHOOD FOR POINT PROCESSES 33 for Ii0 ,i1 ,...,imn = 1{V0 ∈An,i0 ,∀j=1,...,mn , ∀t∈(T n,j−1 ,T n,j ]: Vt ∈An,ij } . The second T T summand in the right-hand side of (33) can be bounded above by C1τ exp ψτn,1 − 1 P∗θ [∀t ∈ [0, τ ] : Vt ∈ An | Dτ ] ≤ C1τ exp ψτn,1 − 1 , where τ max max n ∂v Λ(vj ; θ) 1≤j≤mS θ∈Θ 1≤j≤mS j ; θ)π(U , vj ; θ) Λ(v k + Nτ max max max n ∂v log θ∈Θ 1≤k≤Nτ 1≤j≤mS π ∗ (Uk ) ψτn,1 = max n vol An,j as in Assumption (B6). This follows from Taylor expansion around the points vij and Jensen’s inequality, together with Assumptions (B1), (B2), and (B1)(B4). We use Hölder’s inequality with Assumptions (B4) and (B5) for the p τ q n,2 first term on the right-hand side of (33) to obtain the bound C2 ψτ for / A n Dτ . ψτn,2 = sup P∗θ ∃t ∈ [0, τ ] : Vt ∈ θ∈Θ n Thus, we have that E I (θ; τ, 1) − E(θ; τ, 1) is bounded above by p q (34) exp ψτn,1 − 1 C1τ + C2τ ψτn,2 . Note that (34) does not depend on θ. As a result, the error bound in (34) holds uniformly over the parameter space Θ. Assumptions (B3), (B4), (B5), and (B6) then lead to the claim for n → ∞. Proof of Proposition 4.2. Take > 0. By Theorem 4.1 there exists some n0 ∈ N, depending only on τ , such that In E (θ; τ, 1) − E(θ; τ, 1) ≤ for all n ≥ n0 and all θ ∈ Θ. Since maxθ∈Θ L∗F (θ; τ ) < ∞, P-almost surely, it follows that |Lτ (θ) − Lnτ (θ)| ≤ max L∗F (θ; τ ) < ∞, θ∈Θ P-almost surely. Define ˜ = maxθ∈Θ L∗F (θ; τ ). Then, since θ̂τn is an optimizer of Lnτ , (35) θ̂τn ∈ {θ ∈ Θ : Lτ (θ̂τ ) − ˜ ≤ Lnτ (θ) ≤ Lτ (θ̂τ ) + ˜}, Pθ -almost surely. The claim follows for → 0. 34 K. GIESECKE AND G. SCHWENKLER Proof of Proposition 4.3. For fixed τ > 0, we deduce from Proposition 4.2 that there exists the limit of θ̂τn for n → ∞. Let θ̂τ∞ be this limit. Then we know that Lτ (θ̂τ∞ ) = Lτ (θ̂τ ) for θ̂τ = arg maxθ∈Θ Lτ (θ). Since θ̂τ is a global maximizer and no optimum is attained on the boundary of the parameter space, necessarily θ̂τ∞ satisfies the first-order condition ∇Lτ (θ̂τ∞ ) = 0. (36) Thus, we can derive a linear-quadratic Taylor expansion of Lτ (θ̂τ∞ ) around θ0 as in Proposition A.2, and conclude that √ ∞ (37) τ θ̂τ − θ0 → N (0, Σ0 ) . From Proposition 4.2 we know that for any θ > 0, there exists some implementation I n for n = n(τ ) such that |θ̂τn − θ̂τ∞ | ≤ τ1 . Thus, P-a.s., √ n(τ ) (38) τ θ̂τ − θ̂τ∞ → 0. We remark that the sequence of implementations {I n }τ ≥0 is fixed in such a way that the right-hand side of equation (33) converges to zero as τ → ∞. The existence of such implementations is asserted by Assumptions (B1)– (B6) in Theorem 4.1. Proof of Proposition 5.1. We proceed claim follows I n inductively. The n I n n n immediately for mT = 0. For mT ≥ 1, P̂θ (i0 , . . . , imT ) − Pθ (i0 , . . . , imnT ) is bounded above by the triangle inequality: n n n PθI (imnT )P̂θI (i0 , . . . , imnT −1 ) − PθI (i0 , . . . , imnT −1 ) n n n + P̂ I (imn ) − P I (imn )P̂ I (i0 , . . . , imn −1 ). θ T In θ T θ T In Note that Pθ (imnT ) ≤ 1 and P̂θ (i1 , . . . , imnT −1 ) ≤ 1. By the induction n n assumption we know that |P̂θI (i0 , . . . , imnT −1 ) − PθI (i0 , . . . , imnT −1 )| → 0 as n → ∞, Pθ -almost surely. Further, from Jensen’s and Hölder’s inequality we derive for j = mnT n i h n I 1 D n,ij n,i P̂θ (ij ) − PθI (ij ) ≤ E∗θ 1{Vt ∈A / (t∈(T n,j−1 ,T n,j ))} {VT n,j ∈A j } τ ≤ max P∗θ̃ ∀t ∈ (T n,j−1 , T n,j ) : Vt ∈ / An,ij | Dτ . θ̃∈Θ Now (T n,j−1 , T n,j ) → ∅ as n → ∞ since mnT → ∞. Hence, the monotone n n convergence theorem implies that |P̂θI (imnT ) − PθI (imnT )| → 0 as n → ∞, Pθ -almost surely. FILTERED LIKELIHOOD FOR POINT PROCESSES 35 Proof of Proposition A.1. Lemma 4.1 in Section 1.4.3 of Ibragimov and Has’minskii (1981) states that it is sufficient to show that for any γ ≥ 0 and any θ ∈ Θ, " # 1 1 (39) Pθ sup Lτ (θ + ξ) − Lτ (θ) ≥ 0 → 0 τ |ξ|≥γ τ as τ → ∞. Write 1 1 1 Lτ (θ + ξ) Lτ (θ + ξ) − Lτ (θ) = log . τ τ τ Lτ (θ) We will show that for any given , δ > 0, there exists τ ∗ > 0, such that for all τ ≥ τ ∗ and θ, θ̃ ∈ Θ: # " Lτ (θ̃) 1 log > δ ≤ . Pθ τ Lτ (θ) This implies convergence in probability uniformly over the parameter space Θ, and " # 1 (40) Pθ sup (Lτ (θ + ξ) − Lτ (θ)) > 0 → 0. |ξ|≥γ τ The inequality “≥” in Condition (39) then follows from Assumption (A4). Fix , δ > 0 and take θ, θ̃ ∈ Θ. The likelihood function Lτ (θ̃) satisfies Lτ (θ̃) = (41) dDθ̃,τ dDθ0 ,τ by definition. Consequently, h i Eθ0 Lτ (θ̃) = 1. Theorem 3.1 states that the likelihood function satisfies dDθ,τ 1 ∗ Lτ (θ) = ∝ Eθ Dτ L∗F (θ; τ ), dDθ0 ,τ Mτ (θ) which is strictly positive under Assumption (A3). Thus, Dθ0 ,τ also absolutely continuous with respect to Dθ,τ with Radon-Nikodym density dDθ0 ,τ 1 = . dDθ,τ Lτ (θ) Lτ (θ̃) By construction, L is Dτ -measurable. We conclude from the exponential τ (θ) Chebyshev inequality and (41) that " # " # h i 1 Lτ (θ̃) L ( θ̃) τ Pθ log > δ ≤ e−τ δ Eθ = e−τ δ Eθ0 Lτ (θ̃) = e−τ δ . τ Lτ (θ) Lτ (θ) Choosing τ large enough leads to the claim. 36 K. GIESECKE AND G. SCHWENKLER Proof of Proposition A.2. We will use an argument similar to the one used to prove Proposition 6.1 in Hall and Heyde (1980), after generalizing to a multivariate parameter and our incomplete data structure. For simplicity, write m = m(τ ). By Taylor expansion and Assumption (A5), (42) 0 = 1 1 1 ∇Lτ (θ̂τ ) = ∇Lτ (θ0 ) + ∇2 Lτ (θ0 ) · θ̂τ − θ0 + oP (kθ̂τ − θ0 k) τ τ τ since the MLE θ̂τ is consistent by Proposition A.1 and the third derivatives of the log-likelihood function are uniformly bounded in probability in a neighborhood of θ0 by Assumption (A10). Reformulate equation (42) as −1 r 1 √ 1 2 τ θ̂τ − θ0 = − ∇ Lτ (θ0 ) + oP (1) ∇Lτ (θ0 ). τ τ Assumption (A5) and the definition of L∗E,i and L∗F,i in (21)-(22) imply ∇2 Lτ (θ) = ∇2 = m X log L∗E,i (θ) + ∇2 i=1 ∇2 L∗E,i (θ) L∗E,i (θ) i=1 m X m X + m X log L∗F,i (θ) i=1 2 ∇ L∗F,i (θ) L∗F,i (θ) i=1 m X ∇ log L∗E,i (θ)> ∇ log L∗E,i (θ) − i=1 − m X ∇ log L∗F,i (θ)> ∇ log L∗F,i (θ). i=1 Chebyshev’s inequality together with Assumptions (A6) and (A7) imply m m 2 ∗ 2 ∗ 1 X ∇ LE,i (θ) 1 X ∇ LF,i (θ) = lim =0 m→∞ m m→∞ m L∗E,i (θ) L∗F,i (θ) lim i=1 i=1 in Pθ -probability, and Assumption (A8) leads to (43) 1 2 ∇ Lτ (θ) → −ζ (ΣE (θ) + ΣF (θ)) τ in Pθ -probability as τ → ∞. Thus, it suffices to show that (44) 1 √ ∇Lτ (θ0 ) → N (0, ζ (ΣE (θ0 ) + ΣF (θ0 ) + 2ΣEF (θ0 ))) τ FILTERED LIKELIHOOD FOR POINT PROCESSES 37 in Pθ0 -distribution. Asymptotic normality then follows given that ΣE (θ0 ) + ΣF (θ0 ) + 2ΣEF (θ0 ) is positive semidefinite. To see that (44) holds, note that Assumptions (A6)-(A7) together with the fact that ∇ log L∗E,i (θ) and ∇ log L∗F,i (θ) are Dsi -measurable imply that ∇Lsk (θ) = k X ∇ log L∗E,i (θ) + ∇ log L∗F,i (θ) , 1 ≤ k ≤ m, i=1 is a Pθ -martingale on (Dsk )1≤k≤m . Chebyshev’s inequality and Assumptions (A7)-(A8) lead to (45) m i 1 X h Eθ ∇ log L∗E,i (θ)> ∇ log L∗E,i (θ) Dsi−1 m→∞ m ΣE (θ) = lim i=1 (46) m i 1 X h = lim Eθ ∇ log L∗E,i (θ)> ∇ log L∗E,i (θ) , m→∞ m i=1 m i 1 X h Eθ ∇ log L∗F,i (θ)> ∇ log L∗F,i (θ) Dsi−1 ΣF (θ) = lim m→∞ m 1 m→∞ m = lim (47) 1 m→∞ m ΣEF (θ) = lim i=1 m X i=1 m X h i Eθ ∇ log L∗F,i (θ)> ∇ log L∗F,i (θ) , h i Eθ ∇ log L∗E,i (θ)> ∇ log L∗F,i (θ) Dsi−1 i=1 m i 1 X h = lim Eθ ∇ log L∗E,i (θ)> ∇ log L∗F,i (θ) . m→∞ m i=1 In the notation of Hall and Heyde (1980) (HH), Im (θ) = ∇Lsm (θ)> ∇Lsm (θ) and Jm (θ) = ∇2 Lsm (θ). We have that 1 Im (θ) = ΣE (θ) + ΣF (θ) + 2ΣEF (θ), m 1 lim Jm (θ) = − (ΣE (θ) + ΣF (θ)) , m→∞ m lim m→∞ where the first equation follows from Assumption (A6) and Chebyshev’s inequality, and the second equation follows from (43). As a result, Im (θ) → 1 1 ∞ as m → ∞. Also, limm→∞ m Im (θ) = limm→∞ m Eθ [Im (θ)]. Finally, 1 1 limm→∞ m Jm (θ) = − limm→∞ m Im (θ) + 2ΣEF (θ). Given that Im (θ) and Jm (θ) are continuous by Assumption (A5), we conclude that a slightly modified version of Assumption 1 of Proposition 6.1 of HH holds. This modified 38 K. GIESECKE AND G. SCHWENKLER version of Assumption 1 accounts for the fact that our data is observed incompletely and there is correlation between the event data likelihood and the factor data likelihood. In addition, Assumption 2 of Proposition 6.1 of HH also holds because of our Assumptions (A5) and (A10), equation (43), and Theorem 2.1 of Billingsley (2009). Following the proof of Proposition 6.1 of HH yields the claim. Proof of Corollary A.3. The first equality for ΣE (θ) follows from (43). For the second equality note that, by construction and since s0 = 0, sm(τ ) = τ , we have that m(τ ) 1 X ∇ log L∗E (θ; si )> ∇ log L∗E,i (θ) τ →∞ m(τ ) i=1 ) m(τ ) 1 X ∗ > ∗ − ∇ log LE (θ; si−1 ) ∇ log LE,i (θ) m(τ ) ( ΣE (θ) = lim i=1 m(τ ) X 1 ∇ log L∗E (θ; sm(τ ) )> ∇ log L∗E,i (θ) τ →∞ m(τ ) = lim i=1 τ 1 ∇ log L∗E (θ; sm(τ ) )> ∇ log L∗E (θ; sm(τ ) ) m(τ ) τ 1 1 = lim ∇ log L∗E (θ; τ )> ∇ log L∗E (θ; τ ). ζ τ →∞ τ = lim τ →∞ The analogous expressions hold for ΣE (θ) and ΣEF (θ). REFERENCES Aı̈t-Sahalia, Y. (2008). Closed-form likelihood expansions for multivariate diffusions. The Annals of Statistics 36 906-937. Andersen, P. K. and Gill, R. D. (1982). Cox’s Regression Model for Counting Processes: A Large Sample Study. The Annals of Statistics 10 1100-1120. Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer series in statistics. Springer. Azizpour, S., Giesecke, K. and Schwenkler, G. (2014). Exploring the Sources of Default Clustering. Working Paper. Billingsley, P. (2009). Convergence of Probability Measures. Wiley Series in Probability and Statistics. Wiley. Blanchet, J. and Ruf, J. (2013). A Weak Convergence Criterion Constructing Changes of Measure. Working Paper. Brémaud, P. (1980). Point Processes and Queues – Martingale Dynamics. SpringerVerlag, New York. Brillinger, D. R. and Preisler, H. K. (1983). Maximum likelihood estimation in a latent variable problem. In Studies in Econometrics, Time Series and Multivariate FILTERED LIKELIHOOD FOR POINT PROCESSES 39 Statistics (S. Karlin, T. Amemiya and L. A. Goodman, eds.) 31-65. Academic Press, New York. Celeux, G. and Diebolt, J. (1992). A stochastic approximation type EM algorithm for the mixture problem. Stochastics and Stochastic Reports 41 119-134. Chan, K. S. and Ledolter, J. (1995). Monte Carlo EM Estimation for Time Series Models Involving Counts. Journal of the American Statistical Association 90 242-252. Chava, S. and Jarrow, R. (2004). Bankruptcy prediction with industry effects. Review of Finance 8 537–569. Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological) 34 187-220. Dembo, A. and Zeitouni, O. (1986). Parameter estimation of partially observed continuous time stochastic processes via the EM Algorithm. Stochastic Processes and their Applications 23 91 - 113. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39 1-38. Demyanyk, Y. and Van Hemert, O. (2011). Understanding the Subprime Mortgage Crisis. Review of Financial Studies 24 1848-1880. Deng, Y. and Gabriel, S. (2006). Risk-Based Pricing and the Enhancement of Mortgage Credit Availability among Underserved and Higher Credit-Risk Populations. Journal of Money, Credit and Banking 38 1431-1460. Deng, Y., Quigley, J. M. and Van Order, R. (2000). Mortgage Terminations, Heterogeneity and the Exercise of Mortgage Options. Econometrica 68 275-307. Dolton, P. and O’Neill, D. (1996). Unemployment Duration and the Restart Effect: Some Experimental Evidence. The Economic Journal 106 387-400. Duffie, D., Saita, L. and Wang, K. (2007). Multi-period corporate default prediction with stochastic covariates. Journal of Financial Economics 83 635–665. Duffie, D., Eckner, A., Horel, G. and Saita, L. (2009). Frailty Correlated Default. Journal of Finance 64 2089–2123. Dufour, A. and Engle, R. F. (2000). Time and the Price Impact of a Trade. The Journal of Finance 55 2467-2498. Dupacova, J. and Wets, R. (1988). Asymptotic Behavior of Statistical Estimators and of Optimal Solutions of Stochastic Optimization Problems. The Annals of Statistics 16 1517-1549. Engle, R. F. (2000). The Econometrics of Ultra-High-Frequency Data. Econometrica 68 1-22. Engle, R. F. and Russell, J. R. (1998). Autoregressive Conditional Duration: A New Model for Irregularly Spaced Transaction Data. Econometrica 66 1127-1162. Feigin, P. D. (1976). Maximum Likelihood Estimation for Continuous-Time Stochastic Processes. Advances in Applied Probability 8 712-736. Foote, C. L., Gerardi, K. and Willen, P. S. (2008). Negative equity and foreclosure: Theory and evidence. Journal of Urban Economics 64 234 - 245. Giesecke, K. and Schwenkler, G. (2014). Simulated Likelihood Estimators for Discretely-Observed Jump-Diffusions. Working Paper. Giesecke, K. and Teng, G. (2012). Numerical Solution of Jump-Diffusion SDEs. Working Paper. Gjessing, H., Røysland, K., Pena, E. and Aalen, O. (2010). Recurrent events and the exploding Cox model. Lifetime Data Analysis 16 525-546. Hall, P. and Heyde, C. C. (1980). Martingale limit theory and its application. Probability and mathematical statistics. Academic Press. 40 K. GIESECKE AND G. SCHWENKLER Harrison, A. and Stewart, M. (1989). Cyclical Fluctuations in Strike Durations. The American Economic Review 79 827-841. Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika 58 83–90. Heckman, J. J. and Walker, J. R. (1990). The Relationship Between Wages and Income and the Timing and Spacing of Births: Evidence from Swedish Longitudinal Data. Econometrica 58 1411-1441. Hotz, V. J., Klerman, J. A. and Willis, R. J. (1997). The economics of fertility in developed countries. In Handbook of Population and Family Economics, (M. R. Rosenzweig and O. Stark, eds.) 1, Part A 7 275 - 347. Elsevier. Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical estimation: asymptotic theory. Springer-Verlag, New York. Jeganathan, P. (1995). Some Aspects of Asymptotic Theory with Applications to Time Series Models. Econometric Theory 11 818-887. Kiefer, N. (1988). Economic Duration Data and Hazard Functions. Journal of Economic Perspectives 26 646-679. Kliemann, W. H., Koch, G. and Marchetti, F. (1990). On the unnormalized solution of the filtering problem with counting process observations. IEEE Transactions on Information Theory 36 1415 -1425. Komatsu, T. and Takeuchi, A. (2001). On the smoothness of PDF of solutions to SDE of jump type. International Journal of Differential Equations and Applications 2 141-197. Kurtz, T. (1998). Martingale problems for conditional distributions of Markov processes. Electron. J. Probab. 3 29 pp. Kurtz, T. G. and Ocone, D. L. (1988). Unique Characterization of Conditional Distributions in Nonlinear Filtering. The Annals of Probability 16 80-107. Lancaster, T. (1979). Econometric Methods for the Duration of Unemployment. Econometrica 47 939-956. Lancaster, T. and Nickell, S. (1980). The Analysis of Re-Employment Probabilities for the Unemployed. Journal of the Royal Statistical Society. Series A (General) 143 141-165. Lando, D. and Nielsen, M. S. (2010). Correlation in Corporate Defaults: Contagion or Conditional Independence? Journal of Financial Intermediation 19 355-372. Le Cam, L. and Yang, G. L. (1988). On the Preservation of Local Asymptotic Normality under Information Loss. The Annals of Statistics 16 483-520. Lo, A. W. (1988). Maximum Likelihood Estimation of Generalized Itó Processes with Discretely Sampled Data. Econometric Theory 4 231-247. Meyer, B. D. (1990). Unemployment Insurance and Unemployment Spells. Econometrica 58 757-782. Murphy, S. A. (1994). Consistency in a Proportional Hazards Model Incorporating a Random Effect. The Annals of Statistics 22 712-731. Murphy, S. A. (1995). Asymptotic Theory for the Frailty Model. The Annals of Statistics 23 182-198. Newman, J. L. and McCulloch, C. E. (1984). A Hazard Rate Approach to the Timing of Births. Econometrica 52 939-961. Nielsen, S. F. (2000). The Stochastic EM Algorithm: Estimation and Asymptotic Results. Bernoulli 6 457-489. Nielsen, G., Gill, R., Andersen, P. and Sorensen, T. (1992). A counting process approach to maximum likelihood estimation in frailty models. Scandinavian Journal of Statistics 19 25-43. Ogata, Y. (1978). The asymptotic behavior of maximum likelihood estimators of station- FILTERED LIKELIHOOD FOR POINT PROCESSES 41 ary point processes. Annals of the Institute of Statistical Mathematics 30 243–261. Parner, E. (1998). Asymptotic theory for the correlated gamma-frailty model. The Annals of Statistics 26 183-214. Pennington-Cross, A. and Ho, G. (2010). The Termination of Subprime Hybrid and Fixed-Rate Mortgages. Real Estate Economics 38 399-426. Protter, P. (2004). Stochastic Integration and Differential Equations. Springer-Verlag, New York. Puri, M. L. and Tuan, P. D. (1986). Maximum likelihood estimation for stationary point processes. Proceedings of the National Academy of Sciences USA 83 541-545. Takeuchi, A. (2002). The Malliavin calculus for SDE with jumps and the partially hypoelliptic problem. Osaka J. Math. 39 523-559. Todd, J. E., Winters, P. and Stecklov, G. (2012). Evaluating the impact of conditional cash transfer programs on fertility: the case of the Red de Protección Social in Nicaragua. Journal of Population Economics 25 267-290. van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. Wei, G. and Tanner, M. (1990). A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms. Journal of the American Statistical Association 85 699-704. Wu, J. (1983). On the Convergence Properties of the EM Algorithm. The Annals of Statistics 11 95-103. Zeng, Y. (2003). A Partially Observed Model for Micromovement of Asset Prices with Bayes Estimation via Filtering. Mathematical Finance 13 411-444. Department of Management Science and Engineering Boston University School of Management Boston University Stanford University Boston, MA 02215 Stanford, CA 94305 E-mail: gas@bu.edu E-mail: giesecke@stanford.edu