Objective Bayesian Survival Analysis Using Shape Mixtures of Log-Normal Distributions Catalina A. Vallejos and Mark F.J. Steel Department of Statistics, University of Warwick ∗ Abstract Survival models such as the Weibull or log-normal lead to inference that is not robust to the presence of outliers. They also assume that all heterogeneity between individuals can be modelled through covariates. This article considers the use of infinite mixtures of lifetime distributions as a solution for these two issues. This can be interpreted as the introduction of a random effect in the survival distribution. We introduce the family of Shape Mixtures of Log-Normal distributions, which covers a wide range of shapes. Bayesian inference under non-subjective priors based on the Jeffreys rule is examined and conditions for posterior propriety are established. The existence of the posterior distribution on the basis of a sample of point observations is not always guaranteed and a solution through set observations is implemented. This also accounts for censored observations. In addition, a method for outlier detection based on the mixture structure is proposed. Finally, the analysis is illustrated using a real dataset. Keywords: Jeffreys prior; Outlier detection; Posterior existence; Set observations 1 Introduction Standard survival models can be criticized in two aspects. Firstly, they often assign low probabilities to “rare” or “tail” events. In real datasets the presence of this kind of outcomes typically ∗ Corresponding author: Mark Steel, Department of Statistics, University of Warwick, Coventry, CV4 7AL, UK; email: M.F.Steel@stats.warwick.ac.uk. Catalina Vallejos acknowledges funding through a Warwick Postgraduate Research Scholarship of the University of Warwick and a grant from the Department of Statistics of the Pontificia Universidad Católica de Chile. 1 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism occurs more often than predicted by standard models. In fact, as was argued in Barros et al. (2008), models such as Weibull or log-normal do not produce inference that is robust to outlying observations. This situation implies the need to develop distributions with more flexible tails. The second, related, criticism is concerned with the typical assumption that the survival times correspond to realizations of random variables T1 , . . . , Tn which have the same distribution (possibly depending on a set of covariates). However, as was argued by Marshall and Olkin (2007), this is not always realistic because there are often specific individual factors that result in heterogeneity of the survival times which can not be captured by covariates. We consider the use of infinite mixture of lifetime distributions as a solution for these two issues. This idea has been mentioned by previous authors (e.g. Marshall and Olkin, 2007), but is not yet much used in applied work. Mixture modeling can be interpreted as the introduction of a random effect on the survival distribution. In particular, this article explores the Shape Mixture of Log-Normal (SMLN) distributions for which the shape parameter is assumed as a random quantity. This new class covers a wide range of shapes, in particular cases with fatter tail behaviour than the log-normal. It includes the already studied log-Student t, log-Laplace, log-exponential power and log-logistic distributions among others. This paper puts the earlier literature into a common framework (which is more general, as other distributions can also be defined) and provides a basis for the exploration of properties and the development of objective Bayesian inference methods, which should be attractive for practitioners. Section 2 introduces the use of mixture families of distributions with particular focus on the SMLN family. Covariates are introduced via an Accelerated Failure Times model. Section 3 analyzes aspects of Bayesian inference for models in the SMLN family. This addresses the existence of censored observations. Non-subjective priors based on the Jeffreys rule are proposed and conditions for the existence of the posterior distribution are provided. In addition, we highlight that the existence of the posterior distribution of the parameters on the basis of a sample of point observations is not always guaranteed and a solution that considers set observations is considered. Section 3 also discusses some implementation details and proposes a method for outlier detection that exploits the mixture structure. Bayesian inference with the SMLN family is illustrated in Section 4 through the use of a real dataset. Finally, Section 5 concludes. All the proofs are contained in Appendix A without mention in the text. 2 Mixtures of life distributions Let T1 , . . . , Tn be positive continuous random variables representing the survival times of n independent individuals. Usually T1 , . . . , Tn are assumed to have the same distribution (possibly 2 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism depending on a set of covariates). However, this is not appropriate in the face of unobserved heterogeneity between the survival times. This heterogeneity can be interpreted as the existence of specific factors for each individual that cannot be observed. The lack of homogeneity on the survival times can also be related to the presence of outlying observations. This paper considers the use of mixtures of life distributions in order to represent unobserved heterogeneity and add robustness to the presence of outliers. The distribution of Ti is defined as a mixture of life distributions, if and only if its density function is given by Z f (ti |ψ, Λi = λi ) dPΛi (λi |θ), (1) f (ti |ψ, θ) ≡ L where f (·|ψ, Λi = λi ) represents the density function associated with a lifetime distribution which depends on the value of ψ and λi . This can be interpreted as the underlying distribution of the survival time. Moreover, λi can be understood as a random effect associated to each individual which in the context of survival analysis is often called frailty term. The mixing distribution has cumulative density function PΛi (·|θ) with support L and depends on a parameter θ. If L is a finite set of values, the distribution of Ti is a finite mixture of life distributions. However, here we focus on the case in which Λi is a continuous positive random variable (usually L = R+ ), in which case f (·|ψ, θ) can be interpreted as an infinite mixture of densities. The intuition behind traditional parametric models carries over to these mixture models. In fact, conditional on the mixing parameters the underlying distribution applies. The latter may be underpinned by theoretical or practical reasons. For instance, the log-normal distribution arises as the limiting distribution when additive cumulative damage is the cause of the death or failure. Whereas we focus on mixtures generated from log-normal distributions, a large number of mixture families can be generated by (1). For example, Jewell (1982) explores mixtures of exponential and Weibull distributions. Barros et al. (2008) and Patriota (2012) consider an extended class of Birnbaum-Sanders distributions that can be represented as in (1). The latter family is motivated by models for crack extensions where the mixing distribution accounts for dependence between the cracks. 2.1 The family of Shape Mixtures of Log-Normals Definition 1. A random variable Ti has a distribution in the family of Shape Mixtures of LogNormals (SMLN) if and only if its density can be represented as Z σ2 2 f (ti |µ, σ , θ) = fLN ti |µ, dPΛi (λi |θ), ti > 0, µ ∈ R, σ 2 > 0, θ ∈ Θ, (2) λ i L 2 where fLN (·|µ, σλi ) corresponds to the density of a log-normal distribution with parameters µ and σ 2 /λi , and λi is an observed value of a random variable Λi which has distribution function 3 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism Table 1: Some distributions included in the SMLN family. fP S (·|δ) denotes the density of a positive stable random variable with parameter δ. Distribution Discrete mixture Density f (ti ) Pq j=1 pj fLN Mixing density P 2 ti |µ, σλj , qj=1 pj = 1, 0 ≤ pj ≤ 1 Log-Student t Γ(ν/2+1/2) 1 √ 1 Γ(ν/2) πσ 2 ν ti Log-Laplace 1 1 2σ ti Log-exponential power α 1 1 t 2σΓ( α ) i Log-logistic h 1+ (log(ti )−µ)2 σ2 ν i−( ν2 + 12 ) ,ν >0 n o exp − | log(tσi )−µ| Discrete with finite support Gamma(ν/2, ν/2) Inv-Gamma(1,1/2) n α o exp − | log(tσi )−µ| , α ∈ (1, 2) (ti / eµ )1/σ−1 1 σ eµ [1+(ti / eµ )1/σ ]2 − 21 Γ(3/2) Γ(1+1/α) λi λ−2 i P∞ k=0 × fP S (λi | α2 ) −2 k (1+k)2 − (1 + k) e 2λi PΛi (·|θ) defined on L ⊆ R+ (possibly discrete). Denote Ti ∼ SM LNP (µ, σ 2 , θ). Alternatively, (2) can be expressed as a hierarchical representation which corresponds to σ2 2 , Λi |θ ∼ PΛi (·|θ). (3) Ti |µ, σ , Λi = λi ∼ LN µ, λi The SMLN family can be interpreted as an infinite mixture of log-normal distributions with random shape parameter or as the exponential transformation of a random variable distributed as a scale mixture of normals. This family includes a number of distributions that have been proposed in the context of survival analysis. Table 1 summarizes some of them. In particular, the log-Student t distribution was introduced by Hogg and Klugman (1983) and used in e.g. McDonald and Butler (1987) and Cassidy et al. (2009). However, we are not aware of any previous work that explored theoretical results and properties. In the special case of ν = 1, Lindsey et al. (2000) applied it in the context of pharmacokinetic data. The log-Laplace appeared in Uppuluri (1981) and the log-exponential power was introduced in Vianelli (1983) and was used in Martı́n and Pérez (2009) and Singh et al. (2012). Martı́n and Pérez (2009) uses a mixture of uniforms representation, which also allows for thinner tails than the log-normal (α < 1). Here, however, we only consider distributions with fatter tails, in line with most of the empirical evidence in survival analysis. Finally, the log-logistic distribution was introduced by Shah and Dave (1963) and is used regularly in survival analysis, hydrology and economics. This list can be increased by varying the mixing distribution. For example, all the mixing distributions used for scale mixtures of normals listed in Fernández and Steel (2000) can be used in this context. For identifiability reasons, the mixing distribution must not have unknown scale parameters. As illustrated in Figure 1, the SMLN family allows for a wide variety of shapes for the density. The hazard function can be interpreted as the instantaneous rate of failure given that the event 4 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism (a) (b) f(t) 0.4 0.6 0.8 µ=0,σ2=0.5 µ=0,σ2=1 µ=0,σ2=4 0.0 0.2 0.4 0.0 0.2 f(t) 0.6 0.8 µ=0,σ2=1,ν=1 µ=0,σ2=1,ν=5 0 1 2 3 4 0 1 t 2 3 (c) (d) µ=0,σ =0.5,α=1.2 µ=0,σ2=2,α=1.2 µ=0,σ2=2,α=1.6 µ=0,σ2=0.5 µ=0,σ2=2 0.8 0.6 f(t) 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 2 f(t) 4 t 0 1 2 3 4 0 t 1 2 3 4 t Figure 1: Density function of some models in the SMLN family. (a) Log-Student t, (b) log-Laplace, (c) log-exponential power, (d) log-logistic. The solid line corresponds to a log-normal(0, 1) density. did not happen before. For the SMLN family it is generally not possible to obtain an analytic formula for the hazard function, but an approximation can be obtained using numerical integration. Figure 2 illustrates that the shape of the hazard rates is also flexible. For example, while the hazard rate of the log-normal distribution has an increasing initial phase, the log-Laplace and log-logistic distributions produce a monotone decreasing hazard rate for some values of σ 2 . Table 2 groups some properties of the log-normal and various SMLN distributions. It is clear that the tails of all the examples of SMLN with continuous mixing distributions are fatter than those of the log-normal distribution, and we also see that behaviour of the density function close to zero can be quite different. Whereas all positive moments exist for the log-normal, this is not necessarily the case for the shape mixtures: in particular, we can show that no positive moments exist for the log-Student t for any finite value of ν. The log-Laplace and log-logistic models only allow for moments up to 1/σ. The log-exponential power distribution with α > 1 does possess all moments (see Angeletti et al., 2011 and Singh et al., 2012). An important aspect of survival modelling is the inclusion of covariates. Throughout, we will condition on the covariates which are assumed to be constant in time. Based on the closure under scale-power transformations of the SMLN family (if T ∼ SM LNP (µ, σ 2 , θ), then T 0 = aT b ∼ SM LNP (log(a) + µb, σ 2 b2 , θ) (a > 0, b 6= 0)), an Accelerated Failure Times (AFT) model is introduced. The AFT-SMLN model expresses the dependence between the covariates 5 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism 1.5 (b) 1.5 (a) µ=0,σ =1,ν=1 µ=0,σ2=1,ν=5 0.5 h(t) 1.0 µ=0,σ2=0.5 µ=0,σ2=1 µ=0,σ2=4 0.0 0.0 0.5 h(t) 1.0 2 0 1 2 3 4 5 6 7 0 1 2 3 t 4 5 6 7 t 1.5 (d) 1.5 (c) µ=0,σ =0.5,α=1.2 µ=0,σ2=2,α=1.2 µ=0,σ2=2,α=1.6 0.5 h(t) 1.0 µ=0,σ2=0.5 µ=0,σ2=2 0.0 0.0 0.5 h(t) 1.0 2 0 1 2 3 4 5 6 7 0 1 2 t 3 4 5 6 7 t Figure 2: Hazard function of some models in the SMLN family. (a) Log-Student t, (b) log-Laplace, (c) log-exponential power, (d) log-logistic. The solid line corresponds to a log-normal(0,1) hazard rate. Table 2: Properties of some distributions in the SMLN family. B(·, ·) is the Beta function and sign(x) = 1 if x > 0 and sign(x) = −1 for x < 0. Distribution Tails limti →0 f (ti ) E(tri ), r > 0 Log-normal 1 2 t−1 i exp{− 2σ 2 [log ti ] } 0 exp(rµ + 21 r2 σ 2 ) Log-Student t −(ν+1) t−1 i | log ti | ∞ ∞ Log-Laplace 1 t−1 i exp{− σ | log ti |} 0 for σ < 1 1 2 exp(−µ) for σ = 1 ∞ for σ > 1 exp(rµ) 1−r 2 σ 2 , r Log-exponential power 1 α t−1 i exp{− σ α | log ti | } 0 for α > 1 see Singh et al. (2012), α > 1 Log-logistic ti 0 for σ < 1 exp(−µ) for σ = 1 ∞ for σ > 1 exp(rµ)B(1 − rσ, 1 + rσ), r < 1/σ − sign{log(ti )} −1 σ < 1/σ 6 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism and the survival time by replacing the parameter µ with x0i β, so that ind Ti |xi , β, σ 2 , θ ∼ SM LNP (x0i β, σ 2 , θ), i = 1, 2, . . . , n, (4) where xi is a vector containing the value of k covariates associated with survival time i and β ∈ Rk is a vector of parameters. This can also be interpreted as a linear regression model for the logarithm of the survival times with error term distributed as a scale mixture of normals. As 0 the median of Ti in (4) is given by exi β , eβj is interpreted as the (proportional) marginal change of the median survival time as a consequence of a unitary change in covariate j. 3 Bayesian inference for the SMLN family 3.1 The prior Bayesian inference will be conducted using objective priors that are generated by Jeffreys rule. This is one of the most common choices in the absence of prior information and has interesting invariance and information-theoretic properties (Bernardo and Smith, 1994). The next theorem presents the Fisher information matrix for the AFT-SMLN model which is the basis for the Jeffreys-style priors. Theorem 1. Let T1 , . . . , Tn be independent random variables with Ti distributed according to (4), then its Fisher information matrix corresponds to Pn 1 0 x x 0 0 k (θ) i 1 2 i i=1 σ 1 (5) I(β, σ 2 , θ) = 0 k (θ) σ12 k3 (θ) , σ4 2 1 k (θ) k4 (θ) 0 σ2 3 where k1 (θ), k2 (θ), k3 (θ) and k4 (θ) are functions depending only on θ. The expressions involved in k1 (θ), k2 (θ), k3 (θ) and k4 (θ) are complicated (see the proof) and thus I(β, σ 2 , θ) is not easily obtained from this theorem for any arbitrary mixing distribution. Indeed, it is usually more efficient to compute I(β, σ 2 , θ) directly from f (·|β, σ 2 , θ). However, this structure facilitates a general representation of the Jeffreys-style priors: Corollary 1. Under the same assumptions as in Theorem 1, it follows that (i) The Jeffreys prior for the AFT-SMLN model is given by q 1 J 2 π (β, σ , θ) ∝ [k1 (θ)]k [k2 (θ)k4 (θ) − k32 (θ)], k (σ 2 )1+ 2 (6) 7 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism (ii) The Type I Independence Jeffreys prior, which deals separately with the blocks for β and (σ 2 , θ) is given by q 1 II 2 π (β, σ , θ) ∝ 2 k2 (θ)k4 (θ) − k32 (θ), (7) σ (iii) The Independence Jeffreys prior, which deals separately with β, σ 2 and θ, is given by 1p π I (β, σ 2 , θ) ∝ 2 k4 (θ). (8) σ The three non-subjective priors presented here can be written as π(β, σ 2 , θ) ∝ 1 π(θ), (σ 2 )p (9) where π(θ) is the factor of the prior that depends on θ. The value of p is equal to 1+(k/2) for the Jeffreys prior and p is equal to 1 for the Type I Independence Jeffreys and Independence Jeffreys priors. In the cases in which the parameter θ does not appear (e.g. log-normal, log-Laplace and log-logistic distributions) this prior simplifies to π(β, σ 2 ) ∝ (σ 2 )−p . Note that the result in Corollary 1 also specifies the prior for θ. The implied priors for the special cases of the log-Student t and the log-exponential power (derived directly from the specific likelihood functions) are explicitly presented in the proof of Theorem 3. 3.2 The posterior distribution The three priors presented in Corollary 1 do not correspond to proper probability distributions and therefore the propriety of the posterior distribution must be verified. At this stage we also discuss the existence of censored observations. We assume here that censoring is noninformative. Censoring must be taken into account when conducting inference. However, as is shown in the following proposition, adding censored observations cannot destroy the propriety of the posterior distribution. Proposition 1. Let T1 , . . . , Tn be the survival times of n independent individuals. Let nc (nc ≤ n) be the number of individuals for which the actual survival time was not observed but a censored observation was recorded. Then, the propriety of the posterior distribution of (β, σ 2 , θ) on the basis of the n − nc observed survival times implies that the posterior distribution of (β, σ 2 , θ) given the complete sample is proper. Throughout the rest of this paper, the results regarding posterior propriety will be presented on the basis of the non-censored observations (using n instead of n − nc for ease of notation). The following theorem and its corollary cover posterior propriety for the AFT-SMLN model under the priors in Corollary 1. 8 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism Theorem 2. Let T1 , . . . , Tn > 0 be the survival times of n independent individuals which have distribution given by (4). Assume that the observed survival times are t1 , . . . , tn . Define D = diag(λ1 , . . . , λn ) and X = (x1 , . . . , xn )0 . Suppose that r(X) = k (full rank) and that the prior for (β, σ 2 , θ) is given by (9). Then, the posterior distribution is proper if and only if n > max{k, k − 2p + 2} and Z Z Θ n Y Ln i=1 1 2 0 − 12 λi [det(X DX)] − n+2p−k−2 2 2 [S (D, y)] π(θ) n Y dP (λi |θ) dθ < ∞, (10) i=1 where S 2 (D, y) = y 0 Dy − y 0 DX(X 0 DX)−1 X 0 Dy and y = (log(t1 ), . . . , log(tn ))0 . Corollary 2. Under the same assumptions as in Theorem 2, it follows that (i) If p = 1, a sufficient condition for the propriety of the posterior is n > k and π(θ) being a proper density function for θ, (ii) If p = 1 + k/2, a sufficient condition for the propriety of the posterior is n > k and Z −k E(Λi 2 |θ)π(θ) dθ < ∞, (11) Θ (iii) For any value of p, a sufficient condition for the non-existence of a proper posterior distribution is Z dP (λi |θ)π(θ) dθ = ∞, (12) Θ for λi ∈ B, where B is a set of positive Lebesgue measure with respect to P (·|θ). The previous results provide some general conditions for the SMLN family, which immediately leads to posterior propriety of the log-normal AFT model (where no θ appears and each λi = 1) when n > k under any of the priors in Corollary 1. The following theorem summarizes the analysis for the AFT models based on the other distributions presented in Table 2. Theorem 3. Under the same assumptions as in Theorem 2 and provided that n > k, it follows that (i) For the log-Student t AFT model, the posterior is proper under the Jeffreys or Type I Independence Jeffreys prior. However, the posterior is improper for the Independence Jeffreys prior. (ii) For the log-Laplace AFT model, log-exponential power AFT model and log-logistic AFT model, the propriety of the posterior can be verified with any of the three proposed priors. 9 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism Theorem 3 tells us that the log-Student t AFT model does not lead to valid Bayesian inference in combination with the Independence Jeffreys prior. The other models can be combined with all priors considered here; of course, the absence of θ in the log-Laplace and log-logistic models implies that in those cases the Type I Independence Jeffreys and Independence Jeffreys priors coincide. 3.3 The problem with using point observations Most statistical models assume that the observations come from continuous distributions which assign probability zero to a random variable taking any particular (point) value. In spite of this, the standard statistical analysis is based on point observations. This situation can cause problems in the context of Bayesian inference because the assessment of the propriety of the posterior distribution is usually conducted without taking into account events that have zero probability. Hence, the propriety of the posterior on the basis of a sample of point observations cannot be always guaranteed. As was argued in Fernández and Steel (1998), this problem introduces the risk of having senseless inference. The following theorem illustrates the problem of the use of point observations in the context of the AFT log-Student t model. Theorem 4. Adopt the same assumptions as in Theorem 2 and assume that n > k. If the mixing distribution is Gamma(ν/2,ν/2) and s (k ≤ s < n) is defined as the largest number of observations that can be represented as an exact linear combination of their covariates (i.e. log(ti ) = x0i β for some fixed β), a necessary condition for the propriety of the posterior distribution of (β, σ 2 , ν) is Z m n − k + (2p − 2) π(ν) dν = 0, where m = − 1. (13) n−s 0 This result indicates that it is possible to have samples of point observations for which no Bayesian inference can be conducted, unless π(ν) induces a positive lower bound for ν. In particular, this condition is not satisfied for the Jeffreys-style priors that were presented in this report. Theorem 4 highlights the need for considering sets of zero Lebesgue measure when checking the propriety of the posterior distribution based on point observations. When no covariates are taken into account (k = 1), s coincides with the largest number of observations that take the same value. In the context of scale mixture of normals, Fernández and Steel (1998, 1999) proposed the use of set observations as a solution to this problem. We now extend this to the SMLN family. The use of set observations is based on the fact that, in practice, it is impossible to record observations from continuous random variables with total precision and every observation can only 10 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism be considered as a label of a set of positive Lebesgue measure. For instance, if the recorded value for the survival time is t0 , it really means that the actual survival time is between t0 − l and t0 + r where l and r represent the level of imprecision with which the data was recorded. This is equivalent to considering the observation as interval censored. As explained in Section 3.4, the implementation is done through data augmentation which does not involve a large increase of the computational cost. This procedure also naturally deals with left or right censored observations by taking respectively l or r equal to infinity. The following theorem indicates that the use of set observations provides a proper posterior distribution in situations where point observations might not. Theorem 5. Let T1 , . . . , Tn > 0 be the survival times of n independent individuals which have distribution given by (4). Assume set observations (t1 − l , t1 + r ), . . . , (tn − l , tn + r ) (0 < l , r < ∞) are recorded. Define D = diag(λ1 , . . . , λn ) and X = (x1 · · · xn )0 . Suppose that r(X) = k (full rank) and that the prior for (β, σ 2 , θ) is given by (9). Define E = (t1 − l , t1 + r ) × (t2 − l , t2 + r ) × · · · × (tn − l , tn + r ). The posterior distribution of (β, σ 2 , θ) given the sample is proper if and only if the conditions in Theorem 2 are valid for any sample of point observations in E, except for a set of zero Lebesgue measure. 3.4 Implementation Bayesian inference will be implemented using the hierarchical representation (3) of the SMLN family. This can be seen as an application of the data augmentation idea of Tanner and Wong (1987). Throughout, we use the prior presented in (9). A Gibbs Sampling algorithm is used for the implementation, where both censored and set observations are also accommodated through data augmentation (as in Fernández and Steel, 1999, 2000). This introduces an additional step in the sampler in which, given the current value of the parameters and mixing variables, point values of the survival times in line with the censored observations are simulated. In this case, this can be easily done by sampling log(ti ) from a truncated normal distribution. Given these simulated values, the other steps can be treated as if there were no censoring nor set observations. Regardless of the mixing distribution and provided that n > 2 − 2p, the full conditionals for β and σ 2 are normal and inverted gamma distributions. However, the full conditionals for λ and θ are generally not of a known form for arbitrary mixing distributions. For the Λi ’s we Q have π(λ1 , . . . , λn |β, σ 2 , θ, t) = ni=1 π(λi |β, σ 2 , θ, t) where t = (t1 , . . . , tn )0 are the simulated survival times and 1 1 0 2 2 2 (14) π(λi |β, σ , θ, t) ∝ λi exp − 2 λi (log(ti ) − xi β) dP (λi |θ), i = 1, . . . , n. 2σ 11 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism The full conditional of Λi for the log-Student t and log-Laplace models has a known form. In the first case it is a Gamma((ν + 1)/2,1/2 [{(log(ti ) − x0i β)2 /σ 2 } + ν]). In the second, it corresponds to an Inverse Gaussian(σ/| log(ti ) − x0i β|,1). When these full conditionals have no known form, a Metropolis-Hastings (MH) step with a Gaussian Random Walk proposal can be implemented. However, if the mixing distribution has no closed form (as e.g. for the log-exponential power and log-logistic distributions), the acceptance probability in the MH is not easily computable. In these cases, we need to carefully consider alternative solutions. For instance, the mixing density of the log-logistic distribution corresponds to an infinite series. In this case we implement the Rejection Sampling algorithm proposed in Holmes and Held (2006, √ p.163), using the fact that (2 Λi )−1 has the asymptotic Kolmogorov-Smirnov distribution. The mixing distribution of the log-exponential power model has no known analytical form either. One possible approach to this is to use the hierarchical representation proposed by Ibragimov and Chernin (1959) for the positive stable distributions which requires the use of n extra augmenting variables. However, as argued in Tsionas (1999), this is not appropriate when the value of α is unknown. Instead, we implement Bayesian inference through the mixture of uniforms representation used in Martı́n and Pérez (2009). This replaces the use of Λi by iid Ui (i = 1, . . . , n), with Ui ∼ Gamma(1 + 1/α, 1) and log(ti )|Ui = ui , β, σ 2 , α ∼ U(x0i β − 1/α 1/α σui , x0i β + σui ). We restrict the range of α to (1, 2) in order to obtain fatter tails than the log-normal. The case α = 1 is excluded but it is covered by the log-Laplace model. We decomQ pose the posterior distribution π(β, σ 2 , α, u1 , . . . , un |t) as ni=1 π(ui |t, β, σ 2 , α) × π(β, σ 2 , α|t) and the full conditionals for the Ui ’s are α | log(ti ) − x0i β| 2 −ui π(ui |t, β, σ , α) ∝ e , ui > , i = 1, . . . , n. (15) σ Additionally, for censored and set observations log(ti ) is sampled from a truncated uniform distribution. In terms of setting up a sampler, it might be easier to simply start directly from the log-exponential power or log-logistic distribution, rather than its interpretation as a scale mixture. However, we would lose the inference on the mixing variables Λi or Ui and this is particularly important in identifying outlying observations (see e.g. Lange et al., 1989 and Fernández and Steel, 1999), as further discussed in Section 3.6. Also, the use of these mixing representations facilitate dealing with censored and set observations. The samplers presented previously update n mixing parameters at each step of the chain. This results in long running times and low efficiency of the algorithms (especially in situations in which sampling from the mixing variables is cumbersome). In order to avoid this problem, the Λi ’s will be sampled only every Q iterations of the chain. The value of Q is chosen in order to optimize the efficiency of the sampler which is measured as the ratio between the Effective Sample Size (ESS) of the chain and the CPU time required. This simplification is not applicable 12 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism to the log-exponential power model because the truncation of the full conditionals of the Ui ’s requires the update of these mixing parameters at each iteration of the algorithm. 3.5 Model comparison In order to compare different models, we will use Bayes factors. These will be computed using the methodology proposed by Chib (1995) and Chib and Jeliazkov (2001). For comparison, the Deviance Information Criteria (DIC) developed by Spiegelhalter et al. (2002) will also be provided. Additionally, models will be compared by the quality of their predictions. We use the log-predictive score (LPS) introduced by Good (1952). For this, we randomly divide the original sample in a inference sample (t∗1 , . . . , t∗n−q ) and a predictive sample (t∗n−q+1 , . . . , t∗n ). Lower values of the LPS indicate better predictive accuracy. The LPS is defined as LP S = − 3.6 n 1 X log(f (t∗r |t∗1 , . . . , t∗n−q )). q r=n−q+1 (16) Outlier detection The existence of outlying observations will be assessed using the posterior distribution of the mixing variables. Extreme values (with respect to a reference value, λref or uref ) of the mixing variables are associated with outliers (see also West, 1984). Let us first consider mixing through Λi . Formally, evidence of outlying observations will be assessed by contrasting the hypothesis H0 : Λi = λref versus H1 : Λi 6= λref (with all other Λj , j 6= i free). Evidence in favour of each of these hypothesis will be measured using Bayes factors, which can be computed as the generalized Savage-Dickey density ratio proposed in Verdinelli and Wasserman (1995). The evidence in favour of H0 versus H1 is given by 1 BF01 = π(λi |t)E , (17) dP (λi |θ) λi =λref where the expectation is with respect to π(θ|t, Λi = λref ). Note that when the parameter θ does not appear in the model, this expression can be simplified as the original version of the Savage-Dickey density ratio π(λi |t) π(ti |β, σ 2 , λi ) BF01 = =E , (18) dP (λi ) λi =λref π(ti |β, σ 2 ) λi =λref where the expectation is with respect to π(β, σ 2 |t). The main challenge of this approach is the choice of λref . Intuitively, if there is no unobserved heterogeneity, the posterior distribution 13 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism of the mixing parameters should not be much affected by the data. Therefore, the mixing distribution (which can be interpreted as a prior distribution for Λi ) will then be close to the posterior distribution of Λi . Following this intuition, we propose to use E(Λi |θ) (if it exists) as λref . Using this rule, λref = 1 for the log-Student t model and λref = 0.4 for the log-logistic model. When the value of θ is unknown, we estimate it using the MCMC sample obtained during the implementation. In particular, we consider the median of the posterior distribution of θ. The performance of these choices has been validated using simulated datasets. Examples for which E(Λi |θ) is not finite, require a more detailed analysis. For example, the expectation of the mixing distribution that generates the log-Laplace distribution does not exist. Using simulated datasets, we note a large heterogeneity between the posterior distributions of the Λi ’s and the existence of a unique reference value is not clear (even in the absence of outlying observations). However, the average of the posterior medians of the mixing distribution is close to one for all simulated data sets we have tried. In this calculation, we discard the lowest 25% of λi values in order to remove the influence of any possible outliers. Hence, we propose λref = 1 for the log-Laplace model. Figure 3 shows the performance of the reference values by plotting the Bayes factor in (17) for the log-Student t, log-Laplace and log-logistic models in terms of a standardized log survival time observations z (given β, σ 2 and θ). The standardized observations p are defined as log(t) minus its mean, divided by its standard deviation (i.e. σ qEΛ (1/Λ|θ)). For the log-Student t, log-Laplace and log-logistic models, they are z = log(t)−x0 β √ log(t)−x0 β 3 , σ π log(t)−x0 β σ ν−2 ν (for ν > 2), √1 and z = z = respectively. As expected, large values of |z| lead to σ 2 evidence in favour of an outlier. The log-Student t model with very large number of degrees of freedom requires an exceptionally large |z| to distinguish it from the log-normal case. The log-exponential power model is a very special case. The implementation of Bayesian inference for this model did not consider the SMLN representation. A mixture of uniform distributions representation was used instead. Under this representation, the mixing parameters are denoted by Ui . The hypothesis associated to outlier detection can be re-written in terms of Ui as H0 : Ui = uref versus H1 : Ui 6= uref . The expectation of Ui given α is 1 + 1/α and, according to the intuition presented previously, we might use this value as uref . With this rule, uref is a function of α which lies in (1.5, 2). In practice, this choice overestimated the number of outliers. Even for datasets generated from the log-normal model, several outliers were detected. In fact, we estimate π(uref |t) by averaging π(uref |t, β, σ 2 , α) in (15) using an MCMC sample from the posterior distribution of (β, σ 2 , α). If the value of (β, σ 2 , α) is α | log(ti )−x0i β| , π(uref |t, β, σ 2 , α) is equal to zero and the Bayes Factor in such that uref ≤ σ favour of the observation i beingan outlier iscomputed as infinity. Simulated datasets indicated α | log(ti )−x0i β| that the means and medians of (i = 1, . . . , n) are around 0.6, regardless of the σ model from which the data was generated. We use this in order to adjust the reference value to 14 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism 3 4 5 2 6 0 1 2 3 |z| |z| (c) (d) 4 5 6 4 5 6 1 2 3 4 5 −4 0 2 4 0 2*log(BF) 4 2 0 α=1.1 α=1.5 α=1.9 −4 2*log(BF) 0 2*log(BF) 4 2 6 1 6 0 λref=0.8 λref=1 λref=1.2 −4 2 0 ν=10 ν=30 ν=200 −4 2*log(BF) 4 6 (b) 6 (a) 6 0 |z| 1 2 3 |z| Figure 3: Bayes factor for outlier detection as a function of |z| for: (a) log-Student t, (b) log-Laplace (c) log-exponential power and (d) log-logistic models. The log Bayes factor has been re-scaled by 2 in order to apply the interpretation rule proposed in Kass and Raftery (1995). The dotted horizontal line is the threshold for which observations will be detected as outliers. uref = 1+1/α+0.6. This choice performed much q better with simulated datasets. The resulting 0 Γ(1/α) β (see Figure 3) are not much affected by Bayes Factors as a function of z = log(t)−x σ Γ(3/α) the value of α. Also, when simulating a log-normal dataset, no outliers were detected. 4 Application: Veterans administration lung cancer trial We use a dataset (presented in Kalbfleisch and Prentice, 2002) relating to a trial in which a therapy (standard or test chemotherapy) was randomly applied to 137 patients that were diagnosed with inoperable lung cancer. The survival times of the patients were measured in days since treatment and as previous studies suggested some heterogeneity between the patients, they also recorded some covariates. The covariates include the treatment that is applied to the patient (0: standard, 1: test); the histological type of the tumor (squamous, small cell, adeno, large cell); a continuous index representing the status of the patient at the moment of the treatment (the higher the index, the better the patient’s condition); the time between the diagnosis and the treatment (in months); age (in years); and a binary indicator of prior therapy (0: no, 1: yes). The data contains 9 right censored observations. This dataset has been previously analyzed from a frequentist point of view using traditional models such as the Weibull, log-normal and log-logistic regressions (see Lee and Wang, 2003). These models coincide in that the status of the patient at the moment of treatment and the histological type of the tumor are the relevant 15 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism explanatory variables for the survival time. Nevertheless, as was argued in Barros et al. (2008), the existence of influential observations affects the inference produced under some models. In fact, they illustrated that the inference produced by a log-Birnbaum-Saunders model is greatly modified when dropping observations 77, 85 and 100. They proposed a log-Birnbaum-Saunders Student t distribution as a more robust alternative for this dataset because it allows for fatter tails and accommodates heterogeneity in the data. This distribution can also be represented through a mixture family as in (1), so our methodology could be also extended to this distribution. Table 3 presents a summary of the posterior distribution of (β, σ 2 ) when a log-normal AFT model is fitted. This is based on 10,000 draws, recorded from a total of 400,000 iterations with a burn-in period of 200,000 and a thinning of 20. The use of different starting points and the usual convergence criteria strongly suggest convergence of the chains. The Jeffreys and the Independence Jeffreys prior produced similar results, making the choice between one of these two priors not too critical. In addition, Bayesian inference was conducted on the basis of set observations, using l = r = 0.5 for non-censored observations. Censored observations are necessarily set observations, e.g. l = 0 and r = ∞ for right censored observations. We use this treatment of the censored observations for both point and set observations. In this application, this did not produce very different results from those obtained on the basis of point observations. The use of point observations does not produce problems for the log-normal model. However, we know that set observations avoid potential problems with the inference for other models (see Subsection 3.3), so we will focus on the analysis with set observations in the rest of the paper. Results suggest that the main covariate effects are due to the tumour type and patient status, and possibly age (which has a somewhat surprising largely positive effect) and are roughly in line with the maximum likelihood results for the log-Birnbaum Saunders model in Barros et al. (2008), although the effect of the test treatment is less clearly negative. We then use the model (4) with the mixing distributions presented in Table 1. Bayesian inference is conducted under the Jeffreys-type priors introduced in Corollary 1. We adopted the same total number of iterations, burn in and thinning period as in the log-normal case. In order to improve the efficiency of the algorithm we choose Q = 10 for the log-Student t and log-Laplace models and Q = 20 for the log-logistic model. The Bayes factors of these models with respect to the log-normal is computed. Additionally, the LPS and DIC performance of each model is reported. Figure 4 and Table 4 summarize this information. For the LPS criteria, the original sample was randomly partitioned in a prediction sample and an inference sample of sizes 34 and 103, respectively (i.e. 25% and 75% of the original sample). This was repeated 50 times. Clearly, Bayes factors, LPS and DIC provide evidence in favour of these mixture models. For the log-Student t model, this evidence is also supported by the fact that inference on ν favours relative small values. Similarly, the log-exponential power model suggests values 16 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism Table 3: Summary of the posterior distribution of the parameters of the log-normal AFT model. β0 : Intercept, β1 : Treat (test), β2 : Type (squamous), β3 : Type (small cell), β4 : Type (adeno), β5 : Status, β6 : Time from diagnosis, β7 : Age, β8 : Prior therapy (yes). Jeffreys prior Point Observations Set Observations Median HPD 95% Median HPD 95% β0 β1 β2 β3 β4 β5 β6 β7 β8 σ2 1.81 -0.17 -0.11 -0.73 -0.77 0.04 0.00 0.01 -0.10 1.12 [ 0.52, 3.11] [-0.53, 0.21] [-0.63, 0.45] [-1.25,-0.19] [-1.37,-0.17] [ 0.03, 0.05] [-0.02, 0.02] [-0.01, 0.03] [-0.54, 0.35] [ 0.86, 1.42] 1.79 -0.17 -0.12 -0.73 -0.78 0.04 0.00 0.01 -0.11 1.13 [ 0.38, 3.06] [-0.53, 0.21] [-0.67, 0.43] [-1.24,-0.19] [-1.37,-0.19] [ 0.03, 0.05] [-0.02, 0.02] [-0.01, 0.03] [-0.55, 0.33] [ 0.88, 1.44] Independence Jeffreys prior Point Observations Set Observations Median HPD 95% Median HPD 95% 1.82 -0.17 -0.11 -0.72 -0.76 0.04 0.00 0.01 -0.11 1.21 [ 0.44, 3.17] [-0.56, 0.21] [-0.68, 0.47] [-1.28,-0.18] [-1.39,-0.16] [ 0.03, 0.05] [-0.02, 0.02] [-0.01, 0.03] [-0.57, 0.36] [ 0.93, 1.54] 1.80 -0.17 -0.11 -0.72 -0.77 0.04 0.00 0.01 -0.10 1.21 [ 0.38, 3.09] [-0.57, 0.22] [-0.66, 0.47] [-1.27,-0.16] [-1.39,-0.17] [ 0.03, 0.05] [-0.02, 0.02] [-0.01, 0.03] [-0.58, 0.35] [ 0.92, 1.54] of α far from 2. Figure 5 and 6 contrast the prior and the posterior distributions of ν and α under the log-Student t and log-exponential power models. They reflect how the prior and posterior distributions differ and that the later is strongly driven by the data itself. For example, in spite of the fact that the Jeffreys prior for ν has a thicker right tail, the posterior distribution of ν looks somehow similar under both priors. The contrast between the Jeffreys prior for α and its posterior is somewhat surprising, in view of the results for the other priors. This is explained by the fact that σ 2 and α are highly (positively) correlated a posteriori. For k = 9, the Jeffreys prior assigns high probabilities to low values of σ 2 (much higher in comparison with the other two priors) and therefore the Jeffreys prior is implicitly driving the posterior of α towards 1. Using a modification of the Jeffreys prior where p = 1, we find indeed that the posterior of α is shifted much to the right, with a mode around 1.5. Combining these results, the log-logistic model appears the best candidate for fitting this dataset. This is in line with the results in Lee and Wang (2003) in which, using a frequentist approach, the log-logistic model is preferred to the log-normal and other standard models. The posterior distribution of β is somewhat different from that for the log-normal model (see Table 5). In particular, the effect of the test treatment is less pronounced and the effect of age is also less prominent and more in line with our expectations. The results on β are relatively close to the classical ones reported in Barros et al. (2008) using the log-Birnbaum Saunders Student t model and to the ones in Lee and Wang (2003) using the log-logistic model. The estimates of σ 2 cannot be compared across models because it has a different interpretation for each model. 17 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism Table 4: Comparison of the fitted models by DIC and the percentage of repetitions (out of 50) for which the model performs better than the log-normal model in terms of the LPS score. Type I Ind. Jeffreys DIC Better than LN Ind. Jeffreys DIC Better than LN 1449.10 1446.08 1444.25 1444.03 1444.53 1449.51 1445.66 1444.28 1444.33 1444.24 1449.51 1444.28 1444.33 1444.24 60% 66% 74% 80% ● LEP 0 ● 2 4 6 LN ● ● LEP ● 0.0 log−Bayes Factors 0.5 1.0 1.5 2.0 5.690 LST LLOG 8 ● 5.705 Log−Predictive Score 5.700 ● 5.690 ● LLAP 66% 84% 88% (c) LN 5.705 ● 5.695 Log−Predictive Score 5.700 LST 5.695 ● 5.690 Log−Predictive Score 5.705 LN 5.710 (b) 5.710 (a) ● 74% 66% 84% 88% 5.700 Log-normal Log-Student t Log-Laplace Log-exp. power Log-logistic Jeffreys DIC Better than LN 5.695 Model LLAP ● ● LLOG 2.5 LEP ● 3.0 0.0 0.5 log−Bayes Factors 1.0 1.5 2.0 LLAP LLOG 2.5 3.0 log−Bayes Factors Figure 4: Bayes Factor of each model with respect to the log-normal and average of the LPS score 0 5 10 15 ν 20 25 30 0.10 0.20 (b) 0.00 Relative Frequency / Density 0.10 0.20 (a) 0.00 Relative Frequency / Density obtained for the 50 repetitions. (a) Jeffreys prior, (b) Type I Ind. Jeffreys prior, (c) Ind. Jeffreys prior. 0 5 10 15 20 25 30 ν Figure 5: Histogram of the posterior for ν with the log-Student t model. The solid line correspond to π(ν) defined by (a) Jeffreys prior and (b) Type I Independence Jeffreys prior. 18 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 α 1.6 1.8 2.0 3 2 1 0 1 2 3 Relative Frequency / Density (c) 0 1 2 3 Relative Frequency / Density (b) 0 Relative Frequency / Density (a) 1.0 1.2 1.4 α 1.6 1.8 2.0 α Figure 6: Histogram of the posterior for α with the log-exponential power model. The solid line correspond to π(α) defined by (a) Jeffreys prior, (b) Type I Ind. Jeffreys prior and (c) Ind. Jeffreys prior. Table 5: Summary of the posterior distribution of the parameters of the model under the Jeffreys prior for the log-Student t, log-Laplace, log-logistic and log-exponential power models. Results are presented on the basis of set observations. β0 : Intercept, β1 : Treat (test), β2 : Type (squamous), β3 : Type (small cell), β4 : Type (adeno), β5 : Status, β6 : Time from diagnosis, β7 : Age, β8 : Prior therapy (yes). Log-Student t Median HPD 95% β0 β1 β2 β3 β4 β5 β6 β7 β8 σ2 θ 2.12 -0.07 0.03 -0.73 -0.74 0.04 0.00 0.01 -0.09 0.66 4.28 [ 0.79, 3.36] [-0.43, 0.26] [-0.50, 0.57] [-1.23,-0.28] [-1.26,-0.26] [ 0.03, 0.04] [-0.02, 0.02] [-0.01, 0.03] [-0.49, 0.33] [ 0.37, 1.01] [ 1.45,12.61] Log-Laplace Median HPD 95% 2.08 -0.06 -0.04 -0.73 -0.67 0.04 0.01 0.01 -0.10 0.60 - [ 0.82, 3.29] [-0.42, 0.25] [-0.55, 0.51] [-1.20,-0.26] [-1.14,-0.21] [ 0.03, 0.04] [-0.02, 0.02] [-0.01, 0.03] [-0.48, 0.31] [ 0.41, 0.81] - Log-exp. power Median HPD95% 2.00 -0.08 -0.04 -0.72 -0.69 0.04 0.00 0.01 -0.10 0.87 1.15 [ 0.70, 3.27] [-0.41, 0.28] [-0.55, 0.52] [-1.19,-0.22] [-1.20,-0.20] [ 0.03, 0.04] [-0.02, 0.02] [-0.01, 0.03] [-0.50, 0.30] [ 0.45, 1.64] [ 1.00, 1.54] Log-logistic Median HPD95% 2.06 -0.10 -0.02 -0.73 -0.76 0.04 0.00 0.01 -0.10 0.33 - [ 0.74, 3.31] [-0.46, 0.24] [-0.55, 0.51] [-1.23,-0.26] [-1.31,-0.26] [ 0.03, 0.05] [-0.02, 0.02] [-0.01, 0.03] [-0.52, 0.30] [ 0.24, 0.43] - 19 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism The posterior distributions of the mixing parameters (not reported) vary substantially between the patients, suggesting heterogeneity in the data. Figure 7 formalizes this by presenting the Bayes factor in favour of being an outlier for each of the 137 observations. There is clear evidence for the existence of outlying observations under the three considered priors for all models. Although all prior present similar results, this evidence is slightly stronger for the Jeffreys prior. Moreover, the choice of the mixture model does not greatly affect the conclusions. The analysis suggests that observations 77, 85 are very clear outliers, while 44 and 100 are also obvious outliers. Patients 77 and 85 had an uncensored survival time of 1 day (lower than the 2nd percentile of the survival times), were under the standard treatment and had a squamous type of tumor. Observations 15, 17, 21, 75 and 95 are also detected as outlying observations for some of these models. With the exception of patient 21, they all correspond to uncensored observations. While observations 15, 95 and 100 have a small survival time (8, 2 and 11 days respectively), the survival times associated to patients 17, 21, 44 and 75 are larger (384, 123, 392 and 991 days respectively). In particular, the survival time of patient 75 is above the 99th percentile of the survival times of patients with the same type of tumor (squamous). Similarly, the survival times of patients 17 and 44 are above the 98th percentile for patients that had the same type of tumor (small cell). None of the patients detected as possible outliers had tumors type adeno or large cell. Additionally, patient 95 has a considerably larger number of months from diagnosis than other patients with the same type of tumor (small cell). The previous results are roughly in line with the results in Barros et al. (2008), where observations 77, 85 and 100 were also identified as possible influential observations when fitting a log-Birnbaum-Saunders model. However, using a log-Birnbaum-Saunders-t model they also label observations 12, 77, 95 and 106 as influential observations, whereas our models would point to observations 44 and possibly 15 and 95 as additional outliers. This illustrates that different models might detect different outliers. 5 Concluding remarks We recommend the use of mixtures of life distributions defined in Section 2 as a convenient framework for survival analysis, particularly when standard models such as the Weibull or lognormal are not able to capture some features of the data. This approach intuitively leads to flexible distributions on the basis of a known distribution by mixing with a random quantity. This can also be interpreted as random effects or frailty terms. Specifically, the use of these mixture families can be justified in the presence of unobserved heterogeneity or outlying observations. In particular, the SMLN family is presented as an example of mixtures of life distributions. This 20 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism (a) (b) 100 0 20 40 60 80 100 120 6 140 0 20 40 60 80 Observation Observation (c) (d) 100 100 120 140 120 140 6 77 44 2 100 0 0 2 100 4 77 44 4 2log(BF) 6 8 85 8 85 2log(BF) 95 75 0 0 77 44 15 1721 2 95 52 4 2log(BF) 6 4 44 15 2 2log(BF) 8 85 8 85 77 0 20 40 60 80 100 120 140 0 Observation 20 40 60 80 100 Observation Figure 7: Under Jeffreys prior. 2 × log(BF ) in favour of the hypothesis H1 : λi 6= λref (ui 6= uref ) versus H0 : λi = λref (ui = uref ). For (a) log-Student t, (b) log-Laplace, (c) log-exp. power and (d) log-logistic model. Horizontal lines reflect the interpretation rule of Kass and Raftery (1995). family, while based on the log-normal model, naturally allows us to fit data with a variety of tail behavior. The mixing is applied to the shape parameter of the log-normal distribution and the resultant distribution is quite flexible and can be adjusted by choosing the mixing distribution. This makes this family applicable in a wide range of situations. Covariates are included through an accelerated failure time (AFT) specification. We consider objective Bayesian inference under the AFT-SMLN model. We propose three different Jeffreys-type priors. These prior distributions are not proper and therefore the propriety of the posterior distribution cannot be guaranteed. Theorem 2 provides necessary and sufficient conditions for the propriety of the posterior. As these conditions are not always easy to check in practice, Corollary 2 and Theorem 3 provide some extra guidance and results for specific mixing distributions. Additionally, the problem associated with the use of point observations was explored. The use of set observations was considered as a solution of this problem. This solution can easily be implemented using Gibbs Sampling. We recommend their use in situations in which it is not easy to verify the existence of problems associated to point observations. Additionally, we propose a methodology for outlier detection that is based on the mixing structure. It is intuitively clear that outlying observations are associated with extreme values for their corresponding mixing variable and the evidence of outlying observations is formalized by means of Bayes Factors. The choice of a reference value is critical and is not always trivial. We provide a general rule for situations in which the mixing distribution has a finite expectation. We also provide guidance in those examples where this rule cannot be applied. The mixing framework proposed here can be used with any proper mixing distribution, 21 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism whether the induced survival time distribution has a closed-form density function or not. The proposed MCMC inference scheme does not rely on a closed form expression for the survival density with the mixing variables integrated out, so our Bayesian inference can be used much more widely than in the examples illustrated here. The challenge then may be that the Jeffreystype prior for the parameter(s) of the mixing distribution needs to be derived from Theorem 1 (rather than from the integrated survival density) and this may not be trivial in general. A Proofs Theorem 1. Taking the negative expectation of the second derivatives of the log likelihood, the expressions k1 (θ), k2 (θ), k3 (θ) and k4 (θ) are given by 2 2 σ 0 2 log(ti ) − x0i β EΛi Λi fLN ti |xi β, Λi (19) k1 (θ) = nETi , σ f (ti ) 2 σ2 0 4 n log(ti ) − x0i β EΛi Λi fLN ti |xi β, Λi k2 (θ) = ETi − 1 , 4 σ f (ti ) h k3 (θ) = n ET i 2 log(ti )−x0i β σ i2 EΛi Λi fLN 2 ti |x0i β, σΛi f 2 (ti ) (20) Z ∞ fLN 0 ti |x0i β, 2 σ λi d dPΛi (λi |θ) dθ n Z 1X ∞ d dPΛi (λi |θ), − 2 i=1 0 dθ R 2 ∞ σ2 d 0 n Z ∞ d2 0 fLN ti |xi β, λi dθ dPΛi (λi |θ) X k4 (θ) = nETi dPΛi (λi |θ). − f (ti ) dθ2 i=1 0 (21) (22) Corollary 1. The proof follows directly from Theorem 1 using the structure of the determinant of the Fisher information matrix and its sub-matrices. Proposition 1. The contribution of the censored observations to the likelihood function is given respectively by 1 − FT (ti ), FT (ti ) or FT (ti2 ) − FT (ti1 ) in the case of right, left or interval censoring. These three quantities have an upper bound of 1 and using the independence of the survival times, the likelihood of the data is bounded by the likelihood of the n − nc observed survival times. Therefore, the marginal likelihood of the complete data is bounded by the marginal likelihood of these n − nc observations and the result follows. 22 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism Theorem 2. (Based on Fernández and Steel, 2000). Let T = (T1 , . . . , Tn )0 , it follows that Z n n n Y Y Y 1 − 12 [(β−a)0 A(β−a)+S 2 (D,y)] −1 2 2 −n 2 2 2σ fT (t|β, σ , θ) = ti (2πσ ) λi e dP (λi |θ), (23) i=1 R+n i=1 i=1 where y = (log(t1 ), . . . , log(tn ))0 , a = (X 0 DX)−1 X 0 Dy, A = X 0 DX and S 2 (D, y) = y 0 Dy − y 0 DX(X 0 DX)−1 X 0 Dy. The posterior distribution of (β, σ 2 , θ) is proper if and only if the marginal density of T , fT (t), is finite. In the following, we use Fubini’s theorem in order to exchange the order of the integrals. Integrating with respect to β, we have 1 Qn Z Z Z n n 2 Y Y S 2 (D,y) λ − i −1 2 − n+2p−k i=1 2 2 2σ p e fT (t) ∝ ti (σ ) π(θ) dP (λi |θ) dθ dσ 2 . 0 det(X DX) R+ Θ R+n i=1 i=1 (24) 2 After integrating with respect to σ , it follows that Z Z n n n Y Y Y 1 −1 0 − 21 2 − n+2p−k−2 2 2 fT (t) ∝ ti π(θ) λi (det(X DX)) [S (D, y)] dP (λi |θ) dθ, i=1 Θ R+n i=1 i=1 (25) provided that n + 2p − k − 2 > 0. As X has full rank, n > k is also required. Finally, provided that ti 6= 0 for all i ∈ {1, . . . , n}, the theorem holds. Lemma 1. Under the same assumptions as in Theorem 2, the posterior distribution of (β, σ 2 , θ) given the data is proper if and only if n > max{k, k − 2p + 2} and Z Z n Y Y 1 − n+2p−k−3 2 2 π(θ) dP (λi |θ) dθ < ∞, (26) λi λmk+1 Θ 0<λ1 <···<λn <∞ i∈{m / 1 ,...,mk ,mk+1 } i=1 where k Y i=1 k+1 Y i=1 ( λmi ≡ max k Y ) λli : det ((xl1 · · · xlk )) 6= 0 , (27) i=1 λmi ≡ max (k+1 Y i=1 λli : det xl1 ··· log(tl1 ) · · · xlk+1 log(tlk+1 ) !! ) 6= 0 . (28) Proof: It can be shown that det(X 0 DX) has upper and lower bounds that are proportional Qk to i=1 λmi and S 2 (D, Y ) has upper and lower bounds that are proportional to λmk+1 (see Fernández and Steel, 1999). Combined with (25), this lemma is immediate. Corollary 2. Excluding a set of zero Lebesgue measure, it follows that λmk+1 in Lemma 1 is equal to λb = max{λi : i 6= m1 , . . . , mk }. R (i) For p = 1. By Lemma 1, the integral in (26) is bounded by Θ π(θ) dθ. Hence, a sufficient condition for the propriety of the posterior distribution is that π(θ) is proper. 23 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism (ii) For p = 1 + k/2. By Lemma 1, (26) is bounded by −k −k R Θ −k E(λb 2 |θ)π(θ) dθ. However, E(λb 2 |θ) ≤ E(λ(1)2 |θ) where λ(1) = min{λ1 , . . . , λn }. Using the density of the first −k −k order statistic it follows that E(λ(1)2 |θ) ≤ nE(λi 2 |θ) ∀i = 1, . . . , n and hence, as the λi ’s are iid, a sufficient condition for the propriety of the posterior distribution is Z −k E(λi 2 |θ)π(θ) dθ < ∞. (29) Θ (iii) The proof is immediate when using Fubini’s theorem in (25). Theorem 3. (i) It can be shown that the Fisher information matrix corresponds to 1 ν+1 Pn 0 0 0 i=1 xi xi σ 2 ν+3 n ν 1 0 − σn2 (ν+1)(ν+3) . 2σ 4 ν+3 0 ν 2(ν+5) n 1 n 0 ν+1 0 − σ2 (ν+1)(ν+3) 4 Ψ ( 2 ) − Ψ ( 2 ) − ν(ν+1)(ν+3) (30) Therefore, the components depending on ν of the Jeffreys, Type I Independence Jeffreys and Independence Jeffreys prior are, respectively s s ν k ν + 1 2(ν + 3) ν(ν + 1) J π (ν) = Ψ0 − Ψ0 − , (31) (k+1) (ν + 3) 2 2 ν(ν + 1)2 s r ν ν + 1 ν 2(ν + 3) II − Ψ0 , (32) π (ν) = Ψ0 − ν+3 2 2 ν(ν + 1)2 s ν ν + 1 2(ν + 5) I π (ν) = Ψ0 − Ψ0 . (33) − 2 2 ν(ν + 1)(ν + 3) It can be shown that π J (ν) and π II (ν) are proper priors for ν (using Fonseca et al., 2008). Therefore, Corollary 2 part (i) implies the propriety of the posterior distribution for these two priors. Corollary 2 part (iii) indicates that the Independence Jeffreys prior does not produce a proper posterior. In fact, using Stirling’s approximation for the gamma function, it follows that !ν/2−1 nν/2 n/2 Y n n Y ν Pn ν/2 1 p(λi |ν) ≈ λi e− 2 ( i=1 λi −n) e−n , (34) ν/2 − 1 ν/2 − 1 i=1 i=1 as ν goes to infinity. Moreover, by Corollary 1 in Fonseca et al. (2008), π I (ν) is of order R∞ Q ν −2 when ν → ∞. Therefore, 0 π(ν) ni=1 p(λi |ν) dν = ∞ for any value of λ in A, P where A = {λ = (λ1 , . . . , λn )0 : ni=1 λi < n}. 24 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism −k (ii) For the Jeffreys prior, E(λ1 2 ) is finite and therefore the posterior is proper. As the parameter θ is not required, the Independence Jeffreys and Type I Independence Jeffreys coincide and Corollary 2 part (i) indicates that the posterior is proper under these priors. (iii) The Fisher Information matrix was derived by Martı́n and Pérez (2009) and is given by α(α−1)Γ(1− 1 ) P n 0 α 0 0 1 i=1 xi xi σ 2 Γ( α ) 1 n(1+Ψ(1+ α )) nα 0 − 2 σ σα 1 n(1+Ψ(1+ α )) n 0 − (1 + α1 )Ψ0 (1 + α1 ) + (1 + Ψ(1 + α1 ))2 − 1 σα α3 . Therefore, the components depending on α of the Jeffreys, Type I Independence Jeffreys and Independence Jeffreys prior are, respectively k/2 s α(α − 1)Γ(1 − 1/α) 1 1 1 J 0 π (α) = 1+ Ψ 1+ − 1, (35) Γ(1/α) α α α s 1 1 1 II 0 1+ Ψ 1+ − 1, (36) π (α) = α α α s 2 1 1 1 1 I 0 π (α) = 1+ Ψ 1+ + 1+Ψ 1+ − 1. (37) 3 α α α α2 As the previous components are bounded continuous functions of α in (1, 2), they are proper priors for α. Corollary 2 part (i) implies the propriety of the posterior distribution under the Type I Independence Jeffreys and Independence Jeffreys prior. The propriety of the posterior under the Jeffreys prior can be verified using Corollary 2 part (ii) because −k E(λ1 2 |α) is a continuous bounded function for α ∈ (1, 2). In fact, k+1 −k E(λ1 2 |α) Γ(3/2) E(W 2 |α) Γ(3/2) Γ((k + 1)/α + 1) = , = k+1 Γ(1 + 1/α) E(Z 2 |α) Γ(1 + 1/α) Γ((k + 3)/2) (38) where W ∼ Weibull(α/2, 1) and Z ∼ Exponential(1). The latter uses the lemma in Meintanis (1998) which states that a Weibull(a, 1) random variable can be represented as the ratio of an Exponential(1) and an independent positive stable(a) random variable. (iv) As the parameter θ is not required , we can conclude the same as in the log-Laplace model for the Independence Jeffreys and Type I Independence Jeffreys prior. For the Jeffreys p prior, Corollary 2 part (ii) is used. It can be shown that Ωi = 1/(4Λi ) has an Asymptotic P k+1 2 −2k2 ωi2 Kolmogorov distribution with density function g(ωi ) = 8ωi ∞ k e , for k=1 (−1) −k/2 ωi > 0. Therefore, E(λ1 ) = 2k E(ωik ) which is finite because k+2 Z ∞ ∞ ∞ X X 1 1 k+1 2 k+1 −2k2 ω12 k+1 2 (−1) k ω1 e dω1 = (−1) k √ (39) k+1 2k 0 k=1 k=1 25 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism is finite (it has the form P∞ k+1 ak k=1 (−1) with ak → 0 as k → ∞). Theorem 4. This proof requires Lemma 1. As was argued in Fernández and Steel (1999), if s is the largest number of observations that can be written as an exact linear combination of their covariates, it follows that λmk+1 in Lemma 1 corresponds to λ(n−s) which represent the (n−s)-th order statistic of λ1 , . . . , λn . The rest of the proof is obtained by iteratively integrating (26). We use the following inequality (see Fernández and Steel, 1999) Z λi+1 λvi+1 λvi+1 −rλi+1 −rλi e ≤ λv−1 e dλ ≤ , r, v > 0. (40) i i v v 0 Note that the integral in (40) is infinity for v < 0. After integrating with respect to the n − s − 1 smallest λ’s, (26) has an upper bound and lower bound given respectively by " ∞Z Z 0 0<λ(n−s) <···<λ(n) <∞ ν #n−s ν+1 −(n−s−1) 2 ν 2 λa−1 e−λ(n−s) 2 ν (n−s) (n − s − 1)! Γ 2 ν 2 n Y dP (λ(i) |ν)π(ν) dν i=n−s+1 (41) Z 0 ∞Z 0<λ(n−s) <···<λ(n) <∞ " ν #n−s ν+1 −(n−s−1) n 2 Y (n−s)ν a−1 − 2 λ(n−s) 2 e λ dP (λ(i) |ν)π(ν) dν. (n − s − 1)! (n−s) Γ ν2 i=n−s+1 (42) ν 2 Where a = − n+2p−k−3 + ν2 + (n−s−1)(ν+1) . Note that when integrating with respect to λ(n−s) , 2 2 it requires a > 0 in order to have a finite integral in (41) and (42). Hence, the propriety of the posterior distribution requires ν > n−k+(2p−2) − 1. n−s Theorem 5. The proof follows from Fubini’s theorem and the proof of Theorem 2. Define R I(·) as the integral in (25) as a function of the observables. As E is bounded E I(s) ds is bounded as long as I(·) is finite except on a set of zero Lebesgue measure. References Angeletti, F., E. Bertin, and P. Abry (2011). Critical moment definition and estimation, for finite size observation of log-exponential-power law random variables. arXiv:1103.5033. Barros, M., G. Paula, and V. Leiva (2008). A new class of survival regression models with heavy-tailed errors: robustness and diagnostics. Lifetime Data Analysis 14, 316–332. Bernardo, J. and A. Smith (1994). Bayesian Theory. Chichester: Wiley. Cassidy, D., M. Hamp, and R. Ouyed (2009). Pricing european options with a log Student’s t-distribution: a Gosset formula. Physica A: Statistical Mechanics and its Applications 389, 5736–5748. 26 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90, 1313–1321. Chib, S. and I. Jeliazkov (2001). Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association 96, 270–281. Fernández, C. and M. Steel (1998). On the dangers of modelling through continuous distribution: A Bayesian perspective. Bayesian Statistics 6, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith (eds.), Oxford University Press, 213–238. Fernández, C. and M. Steel (1999). Multivariate Student-t regression models: Pitfalls and inference. Biometrika 86, 153–167. Fernández, C. and M. Steel (2000). Bayesian regression analysis with scale mixtures of normals. Econometric Theory 16, 80–101. Fonseca, T., M. Ferreira, and H. Migon (2008). Objective Bayesian analysis for the Student-t regression model. Biometrika 95, 325–333. Good, I. (1952). Rational decisions. Journal of the Royal Statistical Society. Series B (Methodological) 14, 107–114. Hogg, R. V. and S. A. Klugman (1983). On the estimation of long tailed skewed distributios with actuarial applications. Journal of Econometrics 23, 91–102. Holmes, C. and L. Held (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis 1, 145–168. Ibragimov, I. and K. Chernin (1959). On the unimodality of stable laws. Theory of Probability and its applications 4, 417–419. Jewell, N. (1982). Mixtures of exponential distributions. The Annals of Statistics 10, 479–484. Kalbfleisch, J. and R. Prentice (2002). The Statistical Analysis of Failure Time Data (2nd ed.). Wiley. Kass, R. and A. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90, 773–795. Lange, K., R. Little, and J. Taylor (1989). Robust statistical modelling using the t distribution. Journal of the American Statistical Association 84, 881–896. Lee, E. and J. Wang (2003). Statistical methods for survival data analysis (3rd ed.). Wiley. 27 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism Lindsey, J., W. Byrom, J. Wang, P. Jarvis, and B. Jones (2000). Generalized nonlinear models for pharmacokinetic data. Biometrics 56, 81–88. Marshall, A. W. and I. Olkin (2007). Life Distributions. Springer. Martı́n, J. and C. Pérez (2009). Bayesian analysis of a generalized lognormal distribution. Computational Statistics and Data Analysis 53, 1377–1387. McDonald, J. and R. Butler (1987). Some generalized mixture distributions with an application to unemployment duration. The Review of Economics and Statistics 69, 232–240. Meintanis, S. (1998). Moment-type estimation for positive stable laws with applications. IAENG International Journal of Applied Mathematics 38, 26–29. Patriota, A. (2012). On scale-mixture Birnbaum-Saunders distributions. Journal of Statistical Planning and Inference 142, 2221–2226. Shah, B. and P. Dave (1963). A note on log-logistic distribution. Journal of the M.S. University of Baroda (Science Number) 12, 15–20. Singh, B., K. Sharma, S. Rathi, and G. Singh (2012). A generalized log-normal distribution and its goodness of fit to censored data. Computational Statistics 27, 51–67. Spiegelhalter, D., N. Best, B. Carlin, and A. van der Linde (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, B 64, 583–640. Tanner, M. and W. Wong (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82, 528–540. Tsionas, E. G. (1999). Monte Carlo inference in econometric models with symmetric stable disturbances. Journal of Econometrics 88, 365–401. Uppuluri, V. (1981). Some properties of log-laplace distribution. Statistical Distributions in Scientific Work 4, Taillie, C., Patil, G. P., Baldessari, B. A. (eds.), Reidel, 105–110. Verdinelli, I. and L. Wasserman (1995). Computing Bayes factors by using a generalization of the Savage-Dickey density ratio. Journal of the American Statistical Association 90, 614– 618. Vianelli, S. (1983). The family of normal and lognormal distributions of order r. Metron 41, 3–10. West, M. (1984). Outlier models and prior distributions in Bayesian linear regression. Journal of the Royal Statistical Society, Series B (Methodological) 46, 431–439. 28 CRiSM Paper No. 13-01, www.warwick.ac.uk/go/crism