A General Sample Selection model with Skew-normal distribution Emmanuel O. Ogundimu and Jane L. Hutton Department of Statistics, University of Warwick, UK. {E.O.Ogundimu, J.L.Hutton}@warwick.ac.uk 14/03/12 Abstract Scores arising from questionnaires often follow asymmetric distributions, on a fixed range. This can be due to scores clustering at one end of the scale or selective reporting. Sometimes, the scores are further subjected to sample selection resulting in its partial observability. Thus, methods based on complete cases for skew data are inadequate for the analysis of such data and a general sample selection model is required. Heckman (1976) proposed a full maximum likelihood estimation method under the normality assumption for sample selection problems, and parametric and non-parametric extensions have been proposed. We develop a general sample selection model with underlying skew-normal distribution. A link is established between the continuous component of our model log-likelihood function and an extended version of a generalized skew-normal distribution. This link is used to derive the expected value of the model, which extends Heckman’s two-stage model. Finite sample performance of the maximum likelihood estimator of the model is studied via Monte-Carlo simulation. The model parameters are more precisely estimated under the new model even in the presence of moderate to extreme skewness than the Heckman selection models. The model is applied to data from a study of neck injuries where the responses are substantially skew. Computational and identification issues are discussed. 1 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism 1 Introduction For certain diseases, the patient’s perception of his or her well-being may be the most important outcome of interest. These are broadly termed quality of life (QoL) outcomes when patient-reported outcome is measured. Scores arising from instruments designed to assess QoL (e.g. screening questionnaires) often follow asymmetric distributions due to skewness inherent in the Likert-scale type instruments. Indeed, skewness related studies are not uncommon in psychology literature due to the use of such instruments. Apart from this, the realized samples from the underlying discrete process are further subjected to selective reporting and missing data, with the scores reflecting a selected population. Consequently, there is need for a general model for sample selection with inherent skewness. If a sample selection approach is taken to item nonresponse in questionnaires, the data are assumed to be missing not at random (MNAR). This assumption is more realistic than the missing at random (MAR) missing data mechanism assumption. For instance, patients may refuse to answer sensitive questions (e.g. underlying health issues, drug addiction) on a questionnaire for reasons related to the underlying true values for those questions. In multivariate settings with arbitrary patterns of nonresponse, the MAR assumption is convenient computationally, but it is often implausible (Robins and Gill, 1997). In this setting, MAR means that a patient’s probabilities of responding to items may depend only on his or her own set of observed items, which is an unreasonable assumption. However, when we suspect that nonresponse may depend on missing values, then a proper analysis will be to model jointly the population of complete data and the nonresponse process. Selection models are therefore viable tool. A Selection model was introduced by Heckman (1976). He proposed a full maximum likelihood estimation under the assumption of normality. His method was criticized on the ground of its sensitivity to normality assumption prompting him to develop the two-step estimator (Heckman, 1979). Sample selection models, also referred to as models with incidental (hidden) truncation, arise in practice as a result of the partial observability of the outcome of interest in a study. The data are missing not at random (MNAR) because the observed data do not represent a random sample from the population, even after controlling for covariates. Although the model has its origin from the field of Economics, it has been applied extensively in other fields like Finance, Sociology and Political science, but sparingly in medical research. A prominent application to treatment allocation for patients and links with the skew-normal distribution was discussed by Copas and Li (1997). The two most common deviations from normality are heavier tails and skewness. In 2 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism dealing with heavier tail in sample selection, Marchenko and Genton (2011) derived a model using links between hidden truncation and sample selection but with an underlying bivariatet error distribution. They noted that a more appealing flexible parametric model needed to be considered that can accommodate simultaneously these two deviations from normality. A skew normal distribution could be a good candidate. A continuous random variable Z is said to have a standard skew-normal distribution with parameter λ ∈ R if its density is f (z; λ) = 2φ(z)Φ(λz), z ∈ R, (1.1) where φ and Φ denote the standard normal density and corresponding distribution function respectively. The component λ is called the shape parameter because it regulates the shape of the density function. Although the tail behaviour of skew-normal distribution is similar to the normal, its lower tail behavior becomes heavier with increase in truncation intensity. An added advantage of using the skew normal distribution is that the model will contain the normal one as a special case and a comparison of the two models can be used to study the degree of deviation from normality. The article is organized as follows. In section 2, we describe the Copas and Li (1997) model in relation to the general hidden truncation model formulation of Arnold and Beaver (2002). Motivation for using the skew-normal as underlying process when Likert-scale type questionnaires are used in medical research is discussed in section 3. In section 4, a new model is derived using the general formulation of skew distributions arising from selection and linked with hidden truncation formulation of the model. Finite sample performance of this model are studied. The model is applied to a real life data in section 5 and conclusion given in section 6. 2 Copas and Li (1997) Selection Model In this section, we formulate the Copas and Li (1997) model from the unified framework of skew-distributions arising from selections (Arellano-Valle et al., 2006). A link with the hidden truncation formulation given by Arnold and Beaver (2002) is established. A general representation of various classes of skew-normal distribution by using the closed skew-normal distribution is presented. 3 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism 2.1 Copas and Li (1997) model: Skew distributions arising from selection approach Let Y ? be the outcome variable of interest, assumed linearly related to covariates xi through the standard multiple regression Yi? = β 0 xi + σε1i , ε1i ∼ N (0, 1), i = 1, . . . , N. Suppose the main model is supplemented by a selection (missingness) equation Si? = γ 0 xi + ε2i , ε2i ∼ N (0, 1), i = 1, . . . , N where β and γ are unknown parameters and x are fixed observed characteristics not subject to missingness. Suppose further that ε1i ε2i ! ∼ N2 ! !! 0 1 ρ , . 0 ρ 1 Note that the variance of Si? is posited as 1 because only its sign is observed and the variance is not identifiable in the model. It is assumed that Yi? and Si? are correlated with parameter ρ in the underlying process. The parameter ρ ∈ [-1,1] determines the severity of the selection process. Due to the selection, when Si? > 0 (the 0 threshold is arbitrary since no symmetry is assumed), we observe Yi with n observations out of N from Yi? i.e. si = I(Si? > 0) and Yi = Yi? si . Thus the observation is on the conditional density f (y|x, S ? > 0) = P (S ? > 0|y, x)f (y, x) f (y, x, S ? > 0) = . P (S ? > 0) P (S ? > 0) (2.2) Equation (2.2) is the basis of the unification of selection problems as skew distributions given by Arellano-Valle et al. (2006). It is straightforward to see that 0x 1 φ y−β σ σ f (y|x, S = 1) = Φ γ 0 x+ρ √ Φ(γ 0 x) y−β 0 x σ 1−ρ2 , (2.3) where φ and Φ are as defined in (1.1). 4 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Model (2.3) can be related to Rubin (1976) definitions of missing data mechanism. Data are missing completely at random (MCAR) when the probability of missing data on the response variable Y is not related to other measured covariates and is unrelated to Y itself. If the non-intercept terms in γ, as well as ρ are 0 in (2.3), the data is MCAR. A complete case analysis without the need of any adjustment using covariates will give valid inference. Data are missing at Random (MAR) when the probability of missing data for Y is related to some other measured covariates in the analysis model but not on the values of Y . Thus, if ρ = 0 in (2.3) the data are MAR, and valid inference about the conditional distribution of Y given x can be made when adjustment for missing data is done using covariates on complete cases. The third missing data mechanism is the missing not at random (MNAR) mechanism. Data are MNAR, when the probability of missing data on Y is related to the values of Y itself, even after adjusting for covariates. If ρ 6= 0 in (2.3) then the missing data is MNAR. In this case, the missing data process is said to be informative or non-ignorable. It is non-ignorable in that the missing data process needs to be accounted for in order to arrive at valid inference. The complete density of a sample selection model is comprised of a continuous component (the conditional density given by (2.3)), and a discrete component given by P (S). The distribution of the discrete component determines the nature of the model to be fitted to the selection process. In Copas and Li (1997), the model P (S = s) = {Φ(γ 0 x)}s {1 − Φ(γ 0 x)}1−s (i.e a probit model) is used. The log-likelihood function is therefore n n n X X X 0 si (ln Φ(γ xi )) + (1 − si ) ln Φ(−γ 0 xi ) l(β, σ, γ, ρ) = si ln f (yi |xi , Si = 1) + i=1 n 1 −n ln 2π − ln σ 2 − =si 2 2 2 + n X i=1 n X i=1 i=1 0 2 (yi − β xi ) + σ2 n X i=1 0 x 0 i γ xi + ρ yi −β σ p ln Φ 1 − ρ2 (1 − si ) ln Φ(−γ 0 xi ). i=1 (2.4) 2.2 Copas and Li (1997) model: Hidden truncation formulation The continuous component of the sample selection density give by (2.3) has a link with the 0 extended skew normal distribution. Let µ = β 0 x, λ0 = √γ x 2 ∈ R and λ1 = √ ρ 2 ∈ R in 1−ρ 1−ρ (2.3); we then have the pdf written in the usual extended skew-normal form: 5 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Φ λ0 + λ1 ( y−µ ) σ , λ0 √ σΦ 2 y−µ σ φ f (y; µ, σ 2 , λ0 , λ1 ) = (2.5) 1+λ1 where λ0 and λ1 are shift and scale parameters respectively (see Capitanio et al. (2003)). The Azzalini skew normal distribution is recovered when λ0 =0. Now, equation (2.5) and hence equation (2.3) can be readily derived using the hidden truncation formulation of Arnold and Beaver (2002). The idea is as follows: Suppose Y and S are two independent random variables, not necessarily normal (the distribution of Y and S can be different). Assume Y has density (distribution) function ψ1 (Ψ1 ) and S has density (distribution) function ψ2 (Ψ2 ). The conditional density of Y |λ0 + λ1 Y > S is f (y; λ0 , λ1 ) = ψ1 (y)Ψ2 (λ0 + λ1 y) . P (λ0 + λ1 Y > S) (2.6) In the case of Copas and Li (1997) model, Y and S are normal. Thus, we write f (y; λ0 , λ1 ) = φ(y)Φ(λ0 + λ1 y) . λ 0 Φ √ 2 1+λ1 By performing location-scale transformation, this equation becomes equation (2.5). The moment generating function of the pdf in (2.5) is Φ My (t) = λ√ 0 +σλ1 t 1+λ21 σ 2 t2 exp tµ + , 2 Φ √ λ0 1+λ21 and the first moment is E(Y ) = µ + σ p Λ p , 1 + λ21 1 + λ21 λ1 λ0 (2.7) 6 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism 0 where Λ(.) = φ(.)/Φ(.) is the inverse Mill’s ratio. If we substitute µ = β 0 x, λ0 = √γ x 1−ρ2 R and λ1 = √ ρ 1−ρ2 ∈ ∈ R, in (2.7) we have E(Y |x, S ? > 0) = β 0 x + σρΛ(γ 0 x), (2.8) which is the expected value of the conditional distribution whose density is given by (2.3). Equation (2.8) is the basis of Heckman’s two-step procedure (Heckman, 1979). A standard probit model is fitted by noting cases with S = 1 and γ is estimated. The resulting estimate of γ̂, is used to form Λ(γ̂ 0 x) for each of the cases with S = 1. This quantity is then taken as an additional covariate in equation (2.8) which is fitted by least squares. The coefficient of the additional covariate now gives an estimate of σρ. The method given by equation (2.8) is more robust to normality assumption than the likelihood method given in (2.4). However, when the outcome and selection equations contain the same covariates, the method has been shown to perform poorly (Puhani, 2000). This is due to collinearity among the covariates which the inverse Mill’s ratio (Λ(.)) could not remove because Λ(.) is linear in a wide range of its support. To circumvent this problem, the so called exclusion restriction (i.e. at least an extra variable which predict selection is included in the selection equation and excluded from the outcome equation) is used in practice. A general sample selection model which has Copas and Li (1997) model as a special case will be formulated in section 4. The first moment of this model will be shown to extend Heckman’s 2-step method and it is expected to be less sensitive to collinearity among covariates. The correction term equivalent to the inverse Mill’s ratio in this model is nonlinear in the wide range of its support due to the fact that it comes from the distribution function of a skew normal. We consider next a special tool that will simplify the formulation of this model and many more general models in this category. 2.3 The closed skew-normal distribution The CSN family is constructed in the multivariate framework because it is a generalization of the multivariate skew-normal distribution such that some important properties of the normal distribution are preserved. It is closed under: • Marginalization • Conditional distribution 7 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism • Linear transformations • Sums of independent random variables from CSN family • Joint distribution of independent random variables in CSN family. We begin with a definition of the CSN distribution. Definition 1 : Consider p ≥ 1, q ≥ 1, µ ∈ Rp , ν ∈ Rq , D an arbitrary q × p matrix, Σ and ∆ positive definite matrices of dimensions p × p and q × q, respectively. Then the probability density function (pdf) of the CSN distribution is given by: fp,q (y) = Cφp (y; µ, Σ)Φq (D(y − µ); ν, ∆), y ∈ Rp , (2.9) with: C −1 = Φq (0; ν, ∆ + DΣD0 ), (2.10) where φp (.; η, Ψ),Φp (.; η, Ψ) are the pdf and cdf of a p-dimensional normal distribution with mean η ∈ Rp and p × p covariance matrix Ψ. We write Y ∼ CSNp,q (µ, Σ, D, ν, ∆), if y ∈ Rp is distributed as CSN distribution with parameters q, µ, D, Σ, ν, ∆. The special case of ν = 0 in (2.9), gives, fp,q (y) = 2q φp (y; µ, Σ)Φq (D(y − µ); 0, ∆). (2.11) It is straightforward to see that the pdf in (2.9) includes the normal distribution as a special case. The following properties CSN distributions are required (1) Distribution function: Let Y ∼ CSNp,q (µ, Σ, D, ν, ∆). The distribution function of Y is Fp,q (y) = CΦp+q ! ! !! y µ Σ ΣD0 ; , , 0 ν DΣ ∆ + DΣD0 (2.12) where C is as defined in (2.10). (2) Scalar multiplication: Let Y ∼ CSNp,q (µ, Σ, D, ν, ∆), then for any c ∈ R 8 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism cY ∼ CSNp,q (cµ, Σc2 , Dc−1 , ν, ∆) (2.13) (3) Marginal density: Let Y ∼ CSNp,q (µ, Σ, D, ν, ∆) and partition Y = Y0 = (Y10 , Y20 ), where Y1 is k dimensional, Y2 is p − k dimensional. Then Y1 ∼ CSNk,q (µ1 , Σ11 , D? , ν, ∆? ), (2.14) ? 0 where D? = D1 + D2 Σ21 Σ−1 11 , ∆ = ∆ + D2 Σ22.1 D2 , Σ22.1 = Σ22 − Σ21 Σ11 Σ12 , and µ1 , Σ11 , Σ22 , Σ12 , Σ21 came from the corresponding partitions of µ & Σ and D1 , D2 from k D= q D1 p−k D2 . (4) Conditional density: Let Y ∼ CSNp,q (µ, Σ, D, ν, ∆), then for two subvectors Y1 and Y2 , where Y0 = (Y10 , Y20 ), Y1 is k-dimensional, 1 ≤ k ≤ p, and µ, Σ, D are partitioned as above, then the conditional distribution of Y2 given Y1 = Y10 is ? CSNp−k,q (µ2 + Σ21 Σ−1 11 (Y10 − µ1 ), Σ22.1 , D2 , ν − D (Y10 − µ1 ), ∆). (2.15) The properties of CSN given above are sufficient for our model formulation in section 4. Further details of CSN can be found in Gonzalez-Farias et al. (2004). We consider next some examples to motivate the use of bivariate skew-normal underlying process rather than the normal distribution when dealing with outcome from Likert-scale type questionnaire. 9 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism 3 Skew-normal and Normal approximation to Discrete distributions In surveys such as market and opinion polls, test and questionnaire data are often used, with scores measured on N units and P items. A common attribute of data realized from these surveys is that they are skewed. Several articles on the use of sample Cronbach’s alpha (α̂) to measure test reliability pointed to the need to take skewness of the data into account (see for example Maydeu-Olivares et al. (2007) and Yuan et al. (2003)). In health-related Quality of Life (QoL) studies, questionnaires are often used and realized samples from them are skewed. This can be due to the fact that the underlying population from which the samples are drawn (e.g Likert-scale type questionnaires) are discrete. Aside from the fact that the scale is skewed due to discreteness, there are situations in which some hidden truncation is already present in the underlying population from which the samples are drawn. Thus, the assumption of underlying normality is not realistic. To correct for possible misspecification of the distributional assumption in the parametric framework, semi-parametric methods are often used for sample selection models. However, in (most) clinical trial settings, the intercept of the regression model can be of interest for prediction purposes. This may render the use of this method impractical. We have shown in section 2.2 that the hidden truncation process almost always leads to skewness. In this section, we will show that the skew-normal distribution gives a better approximation to discrete distributions than the normal ones. The use of skew-normal distribution to approximate binomial distribution is presented in section 3.1. Section 3.2 gives the approximation to negative binomial distribution. 3.1 Skew-Normal Approximation to Binomial distribution Chang et al. (2008) presented an improved approximation to binomial distribution by the skew-normal distribution. Their aim was to obtain an approximation better than the normal ones especially when the binomial distribution is asymmetric with probability p 6= 0.5. For a given binomial, B(n, p) distribution (where n is the size of the trials and p is the probability of success in each trial), the parameters of the approximating skew-normal, SN (µ, σ 2 , λ) distribution is determined by the methods of moment. For example, suppose X ∼ B(n, p), 10 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism and Y ∼ SN (µ, σ 2 , λ), the following moment matching can be used: E(X) =np = E(Y ) = µ + σ( p √ 2/π)(λ/ 1 + λ2 ) E(X − E(X))2 =np(1 − p) = E(Y − E(Y ))2 =σ 2 1 − (2/π)λ2 /(1 + λ2 ) (3.16) E(X − E(X))3 =np(p − 1)(2p − 1) = E(Y − E(Y ))3 p √ =σ 3 ( 2/π)(λ/ 1 + λ2 )3 ((4/π) − 1). 0.02 0.04 0.06 0.08 Binomial Normal Skew−Normal 0.00 0.05 0.10 0.15 dbinom(x1, size = 100, prob = 0.25) Binomial Normal Skew−Normal 0.00 dbinom(x, size = 20, prob = 0.25) 0.20 The simultaneous solution of equation (3.16) gives the desired values of λ, σ and µ. Further details on the solution and mild restriction on n and p to make the skew-normal approximation work can be found in Chang et al. (2008). To illustrate the use of skew-normal approximation to binomial distribution, suppose X1 ∼ B(20, 0.25) and X2 ∼ B(100, 0.25), then the matching skew-normal distributions are Y1 ∼ SN (3.365, (2.534)2 , 1.374) and Y2 ∼ SN (22.205, (5.154)2 , 0.927) respectively. 0 5 10 15 20 x 0 20 40 60 80 100 x Figure 1: B(20,0.25) pmf with matching normal and skew-normal pdf. Figure 2: B(100,0.25) pmf with matching normal and skew-normal pdf. Figure 1 shows the plot of the matching normal and skew-normal pdfs for B(20, 0.25). The plot clearly shows that the skew-normal gave a far better approximation than the usual 11 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism normal approximation. With large sample (100) presented in Figure 2, the normal and skewnormal pdfs gave good approximation. However, the peak of the binomial pmf was better approximated by the skew-normal distribution. 3.2 Skew-Normal Approximation to Negative Binomial distribution Similar to the binomial distribution, the normal approximation is usually used to approximate the negative binomial distribution. However, Lin et al. (2010) showed by method of moment matching (as it was done in binomial case) that the skew-normal distribution gives better approximation than the normal distribution for the negative binomial distribution. Suppose X ∼ N B(r, p) and Y ∼ SN (µ, σ 2 , λ), the matching distribution was found by equating the first three moments and they obtained: n√ p 2/3 o−1/2 λ= rq (2/π)((4/π) − 1)/(1 + q) + (2/π) − 1 1/2 √ 2 2 σ =( rq/p)/ 1 − (2/π)λ /(1 + λ ) µ =(rq/p) − σ( (3.17) p √ 2/π)λ/ 1 + λ2 . For example, suppose X1 ∼ N B(20, 0.75) and X2 ∼ N B(100, 0.75), then the matching skew-normal distributions are Y1 ∼ SN (3.411, (4.415)2 , 2.422) and Y2 ∼ SN (27.766, (8.686)2 , 1.349). The normal approximation under small sample size (20) is very poor as compared to the skew-normal approximation (see Figure 3). Although the normal approximation gives good approximation when the sample size is large (100), the peak of the negative binomial pmf is better approximated with the skew-normal as it was for the binomial distribution. In general, the plots clearly show that the skew-normal approximation, when applicable, is much more superior to the usual normal approximation both for small and large samples for the binomial and negative binomial distributions. It should therefore be more efficient to approximate discrete process by skew-normal distribution rather than the normal. This further motivate the use of underlying skew-normal distribution in the model we develop next. 12 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism 0.06 0.14 0.02 0.03 0.04 0.05 NBinomial Normal Skew−Normal 0.00 0.01 dnbinom(x, size = 100, prob = 0.75) 0.12 0.10 0.08 0.06 0.04 0.02 0.00 dnbinom(x, size = 20, prob = 0.75) NBinomial Normal Skew−Normal 0 5 10 15 20 0 20 x 60 80 100 x Figure 3: NB(20,0.75) pmf with matching normal and skew-normal pdf. 4 40 Figure 4: NB(100,0.75) pmf with matching normal and skew-normal pdf. Skew-normal selection model (SNSM) Suppose we relax the assumption of bivariate normality given in section 2.1 such that the underlying error distribution is bivariate skew-normal. i.e ε1i ε2i ! ∼ SN2 ! ! !! 0 1 ρ λ1 , , , 0 ρ 1 λ2 where λ1 and λ2 are the skewness parameters for Yi? and Si? respectively. Then f (y|x, S = 1) is still defined as equation (2.2). Now, the pdf of the bivariate process can be written as −1 0 0 f (y, s) = 2φ2 (y, s); (β x, γ x), DRD Φ (λ1 , λ2 )D ((y, s) − (β x, γ x)) , 0 0 where Σ = DRD and D denotes the diagonal matrix which has the square roots of the diagonal entries of Σ on its diagonal and R is the correlation matrix. This has a closed 13 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism skew-normal representation (CSN) given as (y, s) ∼ CSN2,1 µ = (β 0 x, γ 0 x), Σ = ! σ 2 ρσ , D = (λ1 /σ, λ2 ), ν = 0, ∆ = 1 ρσ 1 (4.18) To determine the expression P (S ? > 0|y, x) in equation (2.2), we determine the distribution of S|Y using the conditional distribution property given in equation (2.15). This gives y − β 0x y − β 0x S|Y ∼ CSN1,1 γ 0 x + ρ , 1 − ρ2 , λ2 , −(λ1 + λ2 ) ,1 σ σ Thus, P (S ? > 0|Y ) = 1 − P (S ? < 0|Y ), and using scalar multiplication properties given in equation (2.13) becomes y − β 0x y − β 0x , 1 − ρ2 , −λ2 , −(λ1 + λ2 ) ,1 CSN1,1 γ 0 x + ρ σ σ This belongs to the ESN family with shift and scale parameters −(λ1 + λ2 ) p −λ2 / 1 − ρ2 respectively. We denote this cdf as y − β 0 x y − β 0x −λ2 , −(λ1 + λ2 ) ΦESN γ 0 x + ρ ; 0, 1 − ρ2 , p σ σ 1 − ρ2 y−β 0 x σ and (4.19) To determine the expression P (S ? > 0) in equation (2.2) we need to extract its marginal distribution from the bivariate process. Using the property of marginalization of CSN (see equation (2.14)) one can easily write the P (S ? < 0) as P (S ? < 0) = CSN1,1 γ 0 x, 1, λ2 + λ1 ρ, 0, (1 + λ21 − λ21 ρ2 ) which turns out to be a skew normal distribution. Since, P (S ? > 0) = 1 − P (S ? < 0) we write −(λ2 + λ1 ρ) ? 0 P (S > 0) = ΦSN γ x; 0, 1, p , (4.20) 1 + λ21 − λ21 ρ2 where ΦSN denotes the cdf of a skew-normal random variable. By noting that Y ∼ 14 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism SN (β 0 x, σ 2 , λ1 ) before the selection process and substituting (4.19) and (4.20) into the general sample selection equation (2.2) we have 2 φ σ f (y, x, S = 1) = y−β 0 x σ 0 0 λ1 (y−β 0 x) y−β x x 0 2 √−λ2 Φ ΦESN γ x + ρ σ ; 0, 1 − ρ , , −(λ1 + λ2 ) y−β σ σ 1−ρ2 −(λ2 +λ1 ρ) 0 ΦSN γ x; 0, 1, √ 2 2 2 1+λ1 −λ1 ρ (4.21) If λ1 and λ2 are set equal to zero in (4.21), Copas and Li (1997) model given by (2.3) is recovered. From now on, we shall restrict attention to a special case of the model given in (4.21). Suppose only λ2 is set equal to zero, we get a simpler model: 2 φ σ γ 0 x+ρ y−β0 x λ1 (y−β 0 x) √ σ Φ Φ σ 1−ρ2 −λ ρ 1 0 ΦSN γ x; 0, 1, √ 2 2 2 y−β 0 x σ f (y, x, S = 1) = (4.22) 1+λ1 −λ1 ρ This situation is possible in practice where the underlying mechanism governing selection is not skewed before entering the joint process. Equation (4.22) describes another class of skew-normal distributions. To see this, suppose 0 substitution similar to what was done in section 2.2 is used, i.e, if we put µ = β 0 x, λ0 = √γ x 2 1−ρ ∈ R and λ = √ ρ 1−ρ2 ∈ R in (4.22), we obtain λ1 (y−µ) y−µ Φ Φ λ0 + λ σ σ 2φ . f (y; µ, σ 2 , λ0 , λ1 , λ) = λ0 −λ1 λ σ √ √ ΦSN 1+λ2 ; 0, 1, 2 2 y−µ σ (4.23) 1+λ1 +λ If λ0 = 0, in (4.23) we have f (y; µ, σ 2 , λ1 , λ) = 2φ σ λ1 (y−µ) Φ Φ λ σ ΦSN 0; 0, 1, √ −λ12λ y−µ σ y−µ σ . (4.24) 1+λ1 +λ2 Comparing equation (4.24) with equation 12 given in Jamalizadeh et al. (2008), and noting 15 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism that π cos−1 √ −λ1 λ √ 1+λ21 1+λ2 = 1 ΦSN 0; 0, 1, √ −λ12λ , (4.25) 1+λ1 +λ2 shows that the two models are the same. The L.H.S in (4.25) was evaluated using the orthant probability expression. In this case, the orthant probability is of the form P (Y1 > 0, Y2 > 0) with Y1 and Y2 ∼ N2 (0, Σ), where Σ is a 2 × 2 diagonal matrix with diagonal elements 1 and non diagonal element ρ12 (See a more general expression in Kotz et al. (2000)). However, the R.H.S, although it requires evaluation of two-dimensional integral, is a more general expression when the centered orthant probabilities rule is not applicable and is readily available in public statistical softwares (e.g. the ‘psn’ function in Azzalini’s skew-normal package). Indeed, when µ = 0 and σ = 1 in equation (4.23), it can be referred to as an extended twoparameter generalized skew-normal distribution denoted as GSN (λ0 , λ1 , λ2 ) since it extends the two parameter generalized skew-normal distribution discuss in Jamalizadeh et al. (2008) with an extra parameter λ0 . In general, equation (4.22) could be derived from the general hidden truncation formulation given by equation (2.6) with appropriate re-parametrization. Whichever route is taken, the use of the CSN distribution is essential. If skew distribution arising from selection (Arellano-Valle et al., 2006) approach is followed, then the required properties of CSN distributions are its conditional distribution, marginal distribution and scalar multiplication properties. However, hidden truncation formulation of Arnold and Beaver (2002) make use of scalar multiplication and additive properties of the CSN distribution. Thus, the link between skew distribution arising from selection and its counterpart through hidden truncation formulation is not limited to the elliptical distributions. It is readily extended to skew-elliptical distributions even though the correlation coefficient may not be adequate in capturing association in the underlying process due to the non-elliptical contours in this case. 4.1 Moment estimator of Skew-normal selection model Now that we have established that the link between skew-normal distribution arising from selection and the hidden truncation formulation of the model is also applicable in the skewsymmetric family, the conditional expectation of model in (4.22) can be readily derived. Let Zλ0 ,λ1 ,λ ∼ GSN (λ0 , λ1 , λ) be given by 16 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism k(λ0 , λ1 , λ)φ(z)Φ(λ1 z)Φ(λ0 + λz) z ∈ R, (4.26) 2 where k(λ0 , λ1 , λ) = ΦSN √ λ0 ;0,1, 1+λ2 √ −λ12λ 1+λ1 +λ2 Theorem 1 : If M (t; λ0 , λ1 , λ) is the moment generating function of Zλ0 ,λ1 ,λ ∼ GSN (λ0 , λ1 , λ), then t2 /2 M (t; λ0 , λ1 , λ) = k(λ0 , λ1 , λ)e Φ2 λ1 t λ + λt λ1 λ p √0 p , ; √ 1 + λ2 1 + λ21 1 + λ21 1 + λ2 (4.27) where k(λ0 , λ1 , λ) is as given in (4.26) and Φ2 (., ., ρ) denotes the cdf of N2 (0, 0, 1, 1, ρ). The derivation of equation (4.27) can be found in the Appendix. The moments of Zλ0 ,λ1 ,λ can be derived from (4.27). In particular, the first moment, after some algebra, is ( 2 E(Zλ0 ,λ1 ,λ ) = ΦSN √ λ0 ; 0, 1, 1+λ2 √ −λ12λ 2 1+λ1 +λ p λ0 1 + λ21 1 λ1 √ p Φ p 2π 1 + λ21 1 + λ21 + λ2 (4.28) ) λ0 −λ0 λ1 λ λ p . +√ φ √ Φ √ 1 + λ2 1 + λ2 1 + λ2 1 + λ21 + λ2 If Y = µ + σZλ0 ,λ1 ,λ , then E(Y ) = µ + σE(Zλ0 ,λ1 ,λ ). Using the link between equations (2.7) and (2.8) in section 2.2 and noting the regression parametrization, the conditional mean of the model given by equation (4.22) can be written as ? 0 E(Y |x, S > 0) =β x + σ ( 2 ΦSN γ 0 x; 0, 1, √ −λρ √ 1 γ 0 x 1 + λ2 λ √ √ Φ p 2π 1 + λ2 1 + λ2 − λ2 ρ2 1+λ2 −λ2 ρ2 ) 0 −γ xλρ +ρφ(γ 0 x)Φ p , 1 + λ2 − λ2 ρ2 (4.29) 17 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism where from now on, we take λ1 = λ. When λ = 0 in equation (4.29), we have the Heckman two-step model given in equation (2.8). To visualize the impact of using selectionnormal model when the correct model is the one given by equation (4.29), we plot the second component of the expectation as a function of γ 0 x, the mean of the selection variable. We take ρ = 0.5 and 0.9 for values of λ= 0, 1, 2 and 5. It should be noted that λ = 0 corresponds to the inverse Mill’s ratio correction for (2.8). The standard deviation, σ, simply scales the correction factor and ρ is the correlation between the outcome and the selection process. ρ = 0.9 4 λ=0 λ=1 λ=2 λ=5 3 E(Y|x, S > 0) − β'x 1.5 1 1.0 0 0.0 0.5 E(Y|x, S > 0) − β'x 2.0 λ=0 λ=1 λ=2 λ=5 2 2.5 ρ = 0.5 −4 −2 0 2 4 −4 γx −2 0 2 4 γx ' ' Figure 5: Plot of correction factor for different values of skewness parameter with λ = 0 corresponding to the normal case. Figure 6: Plot of correction factor for different values of skewness parameter with λ = 0 corresponding to the normal case. It can be seen from Figure 5 (ρ = 0.5) that for positive values of the selection linear predictor γ 0 x, the conditional expectation will be underestimated under the usual selectionnormal model. This underestimation increases as the skewness increases. However, for negative values of γ 0 x, the underestimation of the conditional expectation by the selectionnormal model compared to selection skew-normal model decreases and the difference dies out as γ 0 x becomes more negative and missingness increases. This observation is also true for ρ = 0.9, as the figures are similar (see Figure 6). Sometimes, the marginal effect of the covariates (xi ) on the outcome Yi in the observed sample may be of interest. For the Heckman two-step model, the effect consists of two components- the direct effect of the covariates on the mean of Yi which is captured by β and 18 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism the indirect effect of the covariates in the selection equation. For Heckman two-step model (equation (2.8)), the marginal effect is given by 2 0 ∂E(Y |x, S ? > 0) φ(γ 0 x) 0 φ(γ x) = βi − ρσγi γ x + ∂xi Φ(γ 0 x) Φ(γ 0 x) (4.30) Using similar argument, the marginal effect corresponding to equation (4.29) can be written as " ! 2 ∂E(Y |x, S ? > 0) −γ 0 xλρ 0 0 =βi − σγi ρ(γ x)φ(γ x)Φ p ∂xi 1 + λ2 − λ2 ρ2 −λρ 0 ΦSN γ x; 0, 1, √ 1+λ2 −λ2 ρ2 2 2 −γ 0 xλρ 0 √ √ 2ρ φ(γ x) Φ 1+λ2 −λ2 ρ2 1 γ 0 x 1 + λ2 λ −√ p + φ p 2π 1 + λ2 − λ2 ρ2 1 + λ2 − λ2 ρ2 −λρ 0 √ ΦSN γ x; 0, 1, 1+λ2 −λ2 ρ2 λρ2 φ(γ 0 x) −γ 0 xλρ +p φ p 1 + λ2 − λ2 ρ 2 1 + λ2 − λ2 ρ2 −γ 0 xλρ 0 √ √ 2φ(γ x)Φ # 1+λ2 −λ2 ρ2 1 λ γ 0 x 1 + λ2 . +√ √ Φ p 2π 1 + λ2 1 + λ2 − λ2 ρ2 −λρ 0 √ ΦSN γ x; 0, 1, 2 2 2 1+λ −λ ρ (4.31) Equation (4.31) reduces to equation (4.30) when λ = 0. From Figures 7 and 8, the conditional marginal effect of covariates xi on the outcome Y will be underestimated by the selection-normal model for positive values of γ 0 x between (roughly) -4 and 4. When |γ 0 x| exceeds 4, this effect dies out since the correction factor becomes zero for all the values of λ (including λ = 0). 4.2 Maximum likelihood estimation The complete density of the selection skew-normal model, like the selection normal model, is comprised of a continuous component given by (4.22) and a discrete component for P (S). As stated earlier, the distribution of the selection process determines the nature of the model 19 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism −0.4 −0.2 λ=0 λ=1 λ=2 λ=5 −0.6 ∂(E(Y|x, S > 0)) ∂(x) − β' −0.3 −0.2 −0.1 λ=0 λ=1 λ=2 λ=5 −0.5 −0.8 −0.4 ∂(E(Y|x, S > 0)) ∂(x) − β' 0.0 ρ = 0.9 0.0 ρ = 0.5 −4 −2 0 2 4 −4 −2 0 γx 2 4 γx ' ' Figure 7: Plot of marginal effect for different values of skewness parameter with λ = 0 corresponding to the normal case. Figure 8: Plot of marginal effect for different values of skewness parameter with λ = 0 corresponding to the normal case. to be fitted for the binary variable which in this case is given by P (S = s) = {ΦSN (γ 0 x; 0, 1, λ? )}s {1 − ΦSN (γ 0 x; 0, 1, λ? )}1−s , where λ? = √ −λρ 1+λ2 −λ2 ρ2 . This is a probit model with skew-link. The loglikelihood function is therefore, n n X X 0 ? l(β, σ, γ, ρ, λ) = si ln f (yi |xi , Si = 1) + si ln ΦSN (γ xi ; 0, 1, λ ) i=1 + n X i=1 (1 − si ) ln ΦSN (−γ 0 x; 0, 1, −λ? ) i=1 n n n 1 X (yi − β 0 xi )2 X y i − β 0 xi n 2 + =si n ln 2 − ln 2π − ln σ − + ln Φ λ 2 2 2 i=1 σ2 σ i=1 0 x 0 n n i X X γ xi + ρ yi −β σ p ln Φ + (1 − si ) ln ΦSN (−γ 0 x; 0, 1, −λ? ). 2 1 − ρ i=1 i=1 (4.32) 20 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism 4.3 Monte Carlo Simulation In this section we study the finite-sample properties of our selection skew-normal model (SNSM). We compare its performance with selection normal model (Copas and Li, 1997) SNM, and the Heckman’s two-step method TS, in a version similar to Marchenko and Genton iid (2011). The outcome equation is Yi? = 0.5+1.5xi +ε1i , where xi ∼ N (0, 1) and i = 1, . . . , N = 1000. Two types of selection equations: Si? = 1 + xi + 1.5wi + ε2i , with exclusion restriction iid wi ∼ N (0, 1), and Si? = 1 + xi + ε2i , without the exclusion restriction are considered. Hence, β 0 = (0.5, 1.5), and γ 0 =(1, 1, 1.5) and (1, 1) for selection with exclusion and without exclusion restriction respectively. The covariates xi and wi are independent and are also independent of the error terms ε1i and ε2i . The error terms are generated from bivariate ! skew-normal 2 σ ρσ distribution with λ= 0, 0.5, 1, 2 and 5. The covariance matrix Σ = , where σ = 1 ρσ 1 and the correlation ρ = 0.5. This simulation scenario implies that only Yi? is skew, which is in line with our model. It should be noted that λ = 0 case corresponds to an underlying bivariate normal process. We only observe values Yi? when Si? > 0. The degree of censoring is about 30% with exclusion restriction, and about 20% in the absence of exclusion restriction. Simulation results are based on 1000 replications The results of the simulation in the presence of exclusion restriction are presented in Table 1. Even under normality assumption (i.e. λ = 0) the performance of SNSM is comparable to SNM and TS. For instance, SNM and TS showed slightly less bias in the estimation of the intercept of the outcome equation than SNSM. However, this advantage is counter-balanced when the intercept of the selection equation is considered since is has less bias than SNM and TS. In terms of MSE, SNM and TS are more efficient. Other parameters are comparable across the three models. In effect, SNM and TS do not appear to show emphatic superior advantage overall even with underlying normal assumption. As the degree of skewness increases, the SNSM model gets better in precision of estimating the intercept of the selection and the outcome equations whereas SNM and TS get worse. When λ = 5 (which is almost a folded normal), the SNM and TS break down. However, SNSM performs well but at a cost of non-convergence for some of the samples (in this case, 828 out of 1000 samples converged). The results of the simulation in the absence of exclusion restriction are presented in Table 2. When the underlying process is normal, the intercept has a lower bias than SNM but higher than TS. For regression parameters of interest, the three models are comparable. Similar to what we observed under exclusion restriction, the SNSM model appears useful even when 21 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism the underlying process is normal. When λ increases, the performance of SNSM gets better both in bias and MSE. There were severe identifiability problems with SNSM model when λ = 5 as about 300 samples out of 1000 did not converge. However, in cases where a sample converges, they converged to their MLE. In addition, the SNSM estimates are better than the SNM and TS models for σ and ρ when λ ≥ 1 both in the presence and absence of exclusion restriction. Since, the variance indicates the variability of the probability distribution of the outcomes Yi , it follows that a correct prediction intervals of new observations will be obtained under SNSM model. Further, in applied settings (similar to the MINT Trials data we described next), interest may be on patients who do not return their questionnaire. This requires a correct model for the selection process. As can be seen from Tables 1 and 2, the SNSM gave consistently smaller bias and MSE as compared to SNM and TS models for the selection equation when λ ≥ 1. It should also be noted that the bias in the parameter estimates of the selection equation when SNSM model is used is smaller even under normality assumption, with or without the exclusion restriction. We also considered the effect of varying the underlying correlation in the presence of exclusion restriction for λ = 1 and 2. The results (see Tables 5 and 6 in the Appendix) are similar to the ones for ρ = 0.5 5 Application to MINT Trials We examine data from a multi-center randomized controlled trial of treatments for Whiplash Associated Disorder (WAD) referred to as Managing Injuries of the Neck Trial (MINT), in which two treatment regimes were compared: physiotherapy versus reinforcement of advice in patients with continuing symptoms after three weeks of their initial visit to the Emergency Department (ED)(Lamb et al., 2007). As with many longitudinal patient-reported outcome or quality of life studies, the data were collected using questionnaires at regular intervals over a follow-up period at 4, 8 and 12 months after patient’s ED attendance. The main goal of the study is to determine if there is any meaningful difference in two treatments. The primary outcome of interest is return to normal function after the whiplash injury, and is measured using the Neck Disability Index (NDI). The NDI is a self-completed questionnaire which assess pain-related activity restrictions in 10 areas including personal care, lifting, sleeping, driving, concentration, reading and work and result in a score between 0 and 50. It was developed in 1989 by Howard Vernon as a modification of the Oswestry 22 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Table 1: Simulation results in the presence of exclusion restriction. Bias MSE SNSM SNM TS SNSM SNM TS β0 0.0016 -0.0001 0.0002 0.0108 0.0024 0.0027 λ = 0.0 β1 -0.0003 -0.0003 -0.0005 0.0019 0.0019 0.0019 γ0 0.0061 0.0067 0.0073 0.0074 0.0050 0.0051 γ1 0.0040 0.0052 0.0059 0.0060 0.0059 0.0060 γ2 0.0080 0.0098 0.0106 0.0094 0.0093 0.0094 σ 0.0028 -0.0009 -0.0007 0.0017 0.0009 0.0009 ρ -0.0007 -0.0006 -0.0021 0.0084 0.0084 0.0113 λ -0.0027 - 0.0175 β0 0.2071 0.3564 0.3564 0.1379 0.1289 0.1291 λ = 0.5 β1 0.0001 0.0002 0.0002 0.0016 0.0016 0.0016 γ0 0.1786 0.2091 0.2101 0.0517 0.0507 0.0514 γ1 0.0203 0.0259 0.0269 0.0074 0.0075 0.0078 γ2 0.0314 0.0398 0.0409 0.0126 0.0125 0.0130 σ -0.0444 -0.0654 -0.0652 0.0065 0.0050 0.0050 ρ -0.0173 -0.0243 -0.0248 0.0104 0.0102 0.0129 λ -0.0030 - 0.1267 β0 0.0445 0.5620 0.5624 0.0361 0.3173 0.3178 λ = 1.0 β1 0.0004 0.0010 0.0007 0.0012 0.0012 0.0012 γ0 0.0401 0.3516 0.3529 0.0282 0.1319 0.1330 γ1 0.0108 0.0533 0.0547 0.0073 0.0098 0.0102 γ2 0.0201 0.0835 0.0860 0.0138 0.0192 0.0199 σ -0.0110 -0.1697 -0.1696 0.0067 0.0293 0.0293 ρ -0.0072 -0.0636 -0.0658 0.0133 0.0155 0.0181 λ -0.0501 - 0.1471 β0 0.0013 0.7088 0.7098 0.0036 0.5034 0.5049 λ = 2.0 β1 0.0007 0.0020 0.0014 0.0008 0.0009 0.0009 γ0 0.0149 0.4706 0.4728 0.0302 0.2310 0.2333 γ1 0.0086 0.0850 0.0877 0.0088 0.0151 0.0157 γ2 0.0140 0.1275 0.1324 0.0171 0.0304 0.0317 σ -0.0006 -0.2879 -0.2881 0.0022 0.0833 0.0834 ρ -0.0065 -0.1087 -0.1145 0.0170 0.0250 0.0285 λ 0.0311 - 0.0993 β0 0.0063 0.7753 0.7771 0.0069 0.6020 0.6047 λ = 5.0 β1 0.0013 0.0029 0.0019 0.0004 0.0007 0.0007 γ0 0.0184 0.5306 0.5337 0.0481 0.2918 0.2952 γ1 0.0066 0.0997 0.1038 0.0114 0.0180 0.0188 γ2 0.0134 0.1489 0.1566 0.0219 0.0368 0.0386 σ -0.0020 -0.3605 -0.3608 0.0025 0.1303 0.1305 ρ -0.0052 -0.1339 -0.1453 0.0216 0.0356 0.0383 λ 0.1398 - 0.8776 - 23 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Table 2: Simulation results in the absence of exclusion restriction. Bias MSE SNSM SNM TS SNSM SNM TS β0 0.0143 0.0154 0.0049 0.0607 0.0084 0.0124 λ = 0.0 β1 -0.0123 -0.0121 0.0036 0.0062 0.0062 0.0089 γ0 -0.0002 0.0066 0.0066 0.0167 0.0038 0.0038 γ1 0.0029 0.0100 0.0101 0.0055 0.0052 0.0052 σ 0.0228 -0.0018 0.0059 0.0069 0.0012 0.0023 ρ -0.0359 -0.0427 -0.0237 0.0474 0.0452 0.0651 λ 0.0018 - 0.1139 β0 0.2912 0.3675 0.3593 0.1334 0.1411 0.1372 λ = 0.5 β1 -0.0108 -0.0088 -0.0020 0.0050 0.0048 0.0063 γ0 0.1646 0.2036 0.2038 0.0463 0.0461 0.0462 γ1 0.0157 0.0217 0.0220 0.0060 0.0058 0.0058 σ -0.0406 -0.0642 -0.0586 0.0059 0.0049 0.0050 ρ -0.0654 -0.0648 -0.0440 0.0604 0.0544 0.0683 - 0.2527 λ -0.3782 β0 0.0640 0.5580 0.5604 0.0381 0.3151 0.3187 λ = 1.0 β1 -0.0076 0.0048 0.0025 0.0037 0.0036 0.0042 γ0 0.0759 0.5261 0.5340 0.0527 0.2841 0.2926 γ1 0.0091 0.0434 0.0490 0.0073 0.0084 0.0088 σ -0.0138 -0.1637 -0.1628 0.0067 0.0276 0.0276 ρ -0.0669 -0.0604 -0.0733 0.0768 0.0548 0.0721 - 0.1512 λ -0.0761 β0 0.0036 0.6812 0.7085 0.0045 0.4681 0.5051 λ = 2.0 β1 -0.0017 0.0304 0.0037 0.0023 0.0049 0.0029 γ0 0.0333 0.4451 0.4677 0.0884 0.2052 0.2251 γ1 -0.0047 0.0507 0.0865 0.0121 0.0114 0.0141 σ 0.0033 -0.2708 -0.2827 0.0019 0.0741 0.0805 ρ -0.0556 0.0100 -0.1245 0.0879 0.0869 0.0864 λ 0.0165 - 0.0916 β0 -0.0036 0.7107 0.7744 0.0052 0.5103 0.6021 λ = 5.0 β1 0.0057 0.0744 0.0057 0.0020 0.0115 0.0023 γ0 -0.0232 0.3987 0.5280 0.1175 0.1738 0.2856 γ1 -0.0234 -0.0769 0.1039 0.0211 0.0315 0.0178 σ 0.0048 -0.3219 -0.3559 0.0020 0.1049 0.1272 ρ -0.0065 -0.1803 -0.1503 0.0065 0.1783 0.0979 λ 0.1460 - 0.7745 - 24 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism 0 10 20 30 40 0 10 20 30 0.56 0.72 0.69 20 0.57 0 10 20 30 Month.4 ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ●● ● ●● ● ●● ●●● ●● ●●●● ●● ●● ● ● ● ● ●●● ● ●● ● ●● ●●● ● ●● ● ●●●● ●● ●●● ●● ●●●● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ●●●●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●● ●●● ●● ●●● ● ● ●●●● ● ●● ●●●● ●●● ●● ●● ●● ●●● ●●●●● ●● ● ● ● ●●●●●●●● ●●● ●●● ●●●● ● ●● ●●● ●● ● ●● ● ● ●●● ●●●●● ●● ●● ●●●●● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●●●●●●●● ●●● ● ●● ●● ●● ● ● ● ● ● ● ● ●●●● ● ●●●●●●●●●● ●● ● ● ● ●● ●●● ● ● ● ●● ● ●● ●● ● ● ●● ● ●●● ●●● ●●●●●● ●● ● ●●●● ●●●● ● ●● ● ● ● ● ●● ● ● ●●● ●●●●●● ● ● ● ●●● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ●●●● ●●● ●● ●●● ●● ●● ● ●●● ● ● ●●●● ●●● ● ●● ●● ●●● ●● ●● ●● ●●● ● ● ● ●● ●●●●● ●● ●● ●● ● ●● ● ●●● ● ● ●● ● ●● ●●● ●● ●● ● ● ● ● ●● ●● ● ●● ●●●● ● ●● ●●●● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ●●● ● ● ● ●●● ● ● ●● ● ●● ● ●●● ●●●● ●● ●●● ● ● ● ●● ●●●●●●●● ●● ● ● ●● ● ●● ●● ● ● ● ●●●●●● ● ● ● ● ●● ●● ●● ●●● ● ● ● ●● ● ● ●● ●● ●● ●●●● ● ● ●●●● ●●● ●●●●●●● ● ●● ●● ●● ●● ●●● ●●●● ●● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ● ●● ●●● ● ● ●● ●● ●● ●● ●●● ●●● ●●●● ● ●● ● ●● ●● ●● ● ●●●●● ●● ●● ● ●● ●● ●●●●● ●● ●●● ●● ●●●●● ●● ● ● ●●●● ● ● ● ●●● ●●● ●● ●●● ●● ●● ●● ● ● ●●●● ● ●● ● ●● ● ●● ● ●●●●● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●●●●● ●● ●● ● ● ●●● ● ●●●●●●●●●●● ●●●●● ● ●●●●● ●●●● ● ●● ●● ● ● ● ● ●●● ● ● ● ●● ●●●● ● ● ● ● ●●●● ● ● ● ● ● ● ●●●●●●●●●● ●●● ● ●● ● ●● ● ●● ●● ●●● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ●● ●● ●●● ●● ●● ● ●● ●● ●●●●● ●● ●●●● ●● ●● ●● ●●● ●● ●● ●● ●● ● ●●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●●● ●●●● ●●●●● ● ●● ● ● ●●● ●● ●● ●● ●●● ●●●● ●● ●●● ●● ●● 10 20 30 40 ● 10 20 30 40 ● ● ● ● ●● ●● ● ●● ● ●● ●●● ● ● ● ●● ●● ● ● ● ●● ● ●●● ●●●●●● ●● ●● ●● ● ●● ● ●● ●● ●● ●●●● ● ● ●● ●● ● ● ● ●● ●●● ●● ●● ●●●●●● ●●● ●● ● ● ●● ● ● ● ● ●●●● ● ● ●● ● ● ● ●● ●●● ● ● ● ●●● ●● ●● ● ● ●● ●● ●● ●●● ●●●● ●● ●● ● ● ●● ● ●● ● ●● ●●●●●● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ●● ●●●● ● ●● ● ● ● ●●● ● ● ●● ●● ●●● ●● ●●●●● ●● ●●● ●● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●●●●● ● ● ●●●●●● ● ● ●● ●●●●●● ●● ● ● Month.8 0.79 0 0 10 20 30 40 10 0.61 30 40 Baseline ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ●●●● ● ●●● ● ● ● ●●●● ●●●●●● ●●● ●●●●● ● ●● ●● ●● ●● ●● ●● ● ●● ● ● ●● ● ● ●●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●●● ● ●● ●● ● ● ●● ● ●● ●●●● ●● ●● ●● ●● ● ●● ● ●●● ● ● ●●● ● ● ●● ●● ● ●●● ●● ●● ● ●● ● ● ●● ● ● ●● ● ●● ●● ●●● ● ● ● ● ●● ● ● ● ●●● ●● ●●●● ● ● ● ● ● ● ●● ●●● ●● ●● ●● ●●●●● ●●● ● ● ●● ●●● ●●●● ●● ●● ●●● ● 0 Month.12 10 20 30 40 Figure 9: Marginal distributions and Correlations at Baseline, Month 4, 8 and 12 for the NDI scores 25 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Low Back Pain Disability Index. The NDI has been shown to be reliable and valid (Vernon and Mior, 1991), hence its use as a standard instrument for measuring self-rated disability due to neck pain by clinicians and researchers. The fact that the responses were derived from the use of a 10-item questionnaire posed several challenges. These include, but is not limited to the discreteness (Likert-scale type) of the scores, item and unit nonresponse and dropout with time. These might be responsible for the skewness present in the observed data (see Figure 9). There are 599 patients with a total of 1934 measurements and 372 (62%) patients have complete observations (i.e. scores at all measurements occasion). Further, approximately 50% of the patients are in the two treatment groups resulting from balanced randomisation. The mean age is approximately 41 years with range 18 to 78 years. Vernon (2009) recommended that patient’s replieis with only 2 missed items should be considered complete, with mean imputation used for adjustment. We follow this recommendation and any patient with 3 or more missing items are considered as unit missing. In effect, we have only unit nonresponse left in the dataset. In what follows, we will identify predictors of dropout at each measurement occasion before restricting attention to a measurement occasion to illustrate our new model. 5.1 Use of Logistic Regression to identify Predictors of Dropout In any model involving missing data, it is important to include covariates that are predictors of dropout in the model. For the NDI scores, we use logistic regression to identify predictors of dropout. Binary response variables werw constructed with value 1 if patient drops out by months 4, 8 or 12 and 0 otherwise. The first step was to consider if the baseline (y0 ) measurements could influence dropout. We then consider whether any pre-randomization variables give any further improvement. The two treatments under consideration were also included with the reinforcement of advice used as the reference category. The results of these models are presented in Table 3. We focus on the missingness model at months 8, which shows that age and sex of the patients are good predictors of missingness. The odds of a female (other variables held constant) dropping out at this month is about 2 times as much as for the male. This implies that females dropout more than males. Similarly, the odds of a patient dropping out increase by 4.1% with each additional year of age. Thus, the older the patients are, the more their tendency to dropout. 26 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Table 3: Logistic regression for dropout at 4, 8 and 12 months using Vernon scores Missing Estimate int 0.617 trt(physio) 0.364 sex(female) 0.091 age 0.029 y0 -0.004 y4 y8 5.2 at 4 months Missing S.E. p-value Estimate 0.574 0.282 -0.119 0.248 0.142 -0.415 0.254 0.720 0.700 0.010 0.004 0.040 -0.003 0.015 0.799 0.025 at 8 months Missing S.E p-value Estimate 0.653 0.856 1.468 0.278 0.135 -0.919 0.272 0.010 0.567 0.012 0.000 0.020 0.021 0.901 -0.053 0.021 0.233 0.018 0.050 at 12 months S.E p-value 0.736 0.046 0.312 0.003 0.306 0.064 0.012 0.107 0.024 0.026 0.028 0.496 0.029 0.082 Application of selection skew-normal model to the NDI scores To illustrate our model, we restrict attention to the measurement at 8 months. Table 3 shows that sex and age are possible predictors of nonresponse for this month. These variables are used in the selection equations. The results of fitting SNSM, SNM and TS models to the NDI scores at 8 months is presented in Table 4. The intercept estimates differ substantially, as expected from the simulation results. In addition, parameter estimates are similar for covariates in the outcome equation for the three models. However, as observed in the simulation study, the coefficients in the probit selection equation for SNM and TS are consistently larger. The degree of this will depends on the intensity of the skewness present in the data. A similar effect is also noticeable for the NDI scores. In particular, the skewness parameter (λ = 1.509) is statistically significant in the SNSM model. This implies that neglecting the influence of λ in the model, although leads to the same qualitative conclusions for the covariate effects in the outcome equation, will lead to wrong predictive power of the model. The SNSM is more general with the advantage of having good predictive power whether or not there is skewness in the data and, of course, have SNM as a special case. 27 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Table 4: Fit of selection skew-normal model (SNSM), Selection-normal model (SNM), and Heckman two-step model to the NDI scores at 8 months. Estimate SNSM S.E. p-value Estimate SNM S.E p-value Two Step Estimate S.E p-value Selection Equation int age sex(female) int age sex(female) prev trt(physio) σ ρ λ 6 0.165 0.160 0.021 0.005 0.330 0.123 -4.072 0.078 0.418 0.676 0.754 7.764 0.787 1.509 0.906 0.025 0.629 0.035 0.539 0.544 0.131 0.432 0.303 0.000 0.007 0.802 0.108 0.000 0.024 0.006 0.000 0.384 0.142 0.007 Outcome Equation 0.000 0.001 0.507 0.000 0.162 0.000 0.000 0.001 0.491 0.088 0.513 0.684 0.879 6.188 0.802 - 0.732 0.024 0.636 0.035 0.538 0.290 0.072 - 0.502 0.000 0.420 0.000 0.102 0.000 0.000 - 0.818 0.111 0.025 0.006 0.383 0.147 -2.867 0.150 1.566 0.708 0.985 - 5.002 0.113 2.030 0.033 0.545 - 0.000 0.000 0.009 0.567 0.187 0.441 0.000 0.072 - Conclusion We introduced a sample selection model with underlying bivariate skew-normal distribution which we called Skew-normal selection model (SNSM). This model is more flexible than the conventional sample selection model since it has an extra parameter that regulates skewness and has conventional sample selection model as a special case. Its moment estimator was derived using the link between skew models arising from selection and hidden truncation formulation of skew models. The moment estimator was shown to extend Heckman two-step model. Maximum likelihood estimation for parameters of the model was considered. A Monte Carlo study was used to compare the model with conventional sample selection models with moderate correlation (ρ = 0.5) and varying degree of skewness between 0 and 5. We also fixed λ to be 1 and 2, and considered the effect of varying the correlation ρ under the exclusion restriction criteria. The simulation showed that the selection skew-normal model outperforms the conventional sample selection models for all the skewness parameters considered. The conventional sample selection model has a negligible advantage when λ = 0 with smaller bias in the intercept of outcome equation. We also noted that the conventional sample selection 28 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism model breaks down as λ increases to 5 (which is almost a folded normal distribution) and the SNSM, works well if it converges. The model is very promising even in the absence of exclusion restriction criteria. In addition, the model has good estimates of the intercept both in the selection and outcome equations and hence will give better predictions even when the underlying process is bivariate normal. We believe that this model should perform well in modeling heavier tailed data, which is also a prominent departure from normality. It should be noted that the model presented here is very simple to use. In fact, the model can be readily implemented using Sample Selection package in R software (see Toomet and Henningsen (2008)). What is needed is an additional parameter λ to capture skewness, recoding the log-likelihood function to reflect equation (4.32), and either the use of numerical gradients or adding analytic gradients based on the new log-likelihood function. Starting values can be obtained using the two-step method in the Sample Selection package. However, we recommend obtaining a starting value for λ by fitting the Azzalini skew-normal model to complete cases with the intended covarites for the outcome equation. Further, the optimization routine used was BFGS but other numerical maximization algorithms can be used as well. On the issue of model identification, the model is well identified in the sense that for any Θ1 6= Θ2 , f (y, Θ1 ) 6= f (y, Θ2 ), where Θ1 and Θ2 are model parameters. This is usually the case with sample selection models since additional information comes into the model through the selection process. However, in the absence of exclusion restriction and with λ approaching infinity, the model is weakly identified. It is noteworthy that the model will fail when λ and ρ equals zero simultaneously. This is not related to the identification of the model parameters but the usual singularity of the Fisher information and observed information matrices suffered by the Azzalini’s skew-normal distribution at λ = 0. To make the SNSM model generally useful in practice, a reparametrization of the model to circumvent this problem may be required and our future work will look into this. We also note that model (4.21) is more general than the one presented here. However, it is computationally complicated. Apart from this, the parameter ρ is no longer adequate to capture the underlying association. The model therefore needs to be re-parameterized using correlation curves. Our future development of ideas presented here will look into this and the Heckman-like two-step estimator given by equation (4.29). An approach which requires only the knowledge of marginal distribution of the error terms in both the selection and the outcome equations was presented in Lee (1983). Our subsequent work will examine how this model compares to the ones we have presented here. Methods for testing hypothesis of 29 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism selection bias (ρ = 0) when the underlying error distribution is bivariate skew-normal will also be examined in the future development of the skew-normal selection model. To apply this model in practice, we recommend that the model is fitted in conjunction with the conventional sample selection model. This can be used to access the degree of departure from normality. The model could be of benefit in clinical trials and it has prospects in fields where observational studies are conducted (econometrics, psychology, politics) and respondents need to complete questionnaires. 30 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism References Arellano-Valle, R. B., M. D. Branco, and M. G. Genton (2006). A unified view of skewed distributions arising from selections. The Canadian Journal of Statistics 34, 581–601. Arnold, B. C. and R. J. Beaver (2002). Skewed multivariate models related to hidden truncation and/or selective reporting. Test 11, 7–54. Capitanio, A., A. Azzalini, and E. Stanghellini (2003). Graphical models for skew-normal variates. Scandinavian Journal of Statistics 30, 129–144. Chang, C., J. Lin, N. Pal, and M. Chiang (2008). A Note on Improved Approximation of the Binomial Distribution by the Skew-Normal Distribution. American Statistician 62(2), 167–170. Copas, J. B. and H. Li (1997). Inference for non-random samples. J. R. Statist. Soc. B 59, 55–95. Gonzalez-Farias, G., J. A. Dominguez-Molina, and A. K. Gupta (2004). The closed skewnormal. In M. G. Genton (Ed.), Skew-Elliptical Distributions and Their Applications: A Journey Beyond Normality, pp. 25–42. Boca Raton, Florida: Chapman & Hall, CRC. Heckman, J. (1976). The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5, 475–492. Heckman, J. (1979). Sample selection bias as a specification error. Annals of Economic and Social Measurement 47, 153–161. Jamalizadeh, A., J. Behboodian, and N. Balakrishnan (2008). A two-parameter generalized skew-normal distribution. Statistics and Probability Letters 78, 1722–1728. Kotz, S., N. Balakrishnan, and N. L. Johnson (2000). Continuous Multivariate DistributionsVol. 1 (2 ed.). New York: John Wiley & Sons Ltd. Lamb, S. E., S. Gates, M. R. Underwood, M. W. Cooke, D. Ashby, A. Szczepura, M. A. Williams, E. M. Williamson, E. J. Withers, S. M. Isa, and A. Gumber (2007). Managing Injuries of the Neck Trial (MINT): design of a randomised controlled trial of treatments for whiplash associated disorders. BMC Muscloskeletal Disorder 8, :7. Lee, L. (1983). Generalized econometric models with selectivity. Econometrica 51, 507–5012. 31 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Lin, J., C. Chang, and M. R. Jou (2010). A Note on Skew-Normal Distribution Approximation to the Negative Binomial Distribution. WSEAS Transactions on Mathematics 9(1), 32–41. Marchenko, Y. V. and M. G. Genton (2011). A Heckman selection-t model. Institute for Applied Maths and Computer Sci., Texas A & M University. Paper No. 171 − 2011. Maydeu-Olivares, A., D. L. Coffman, and W. M. Hartmann (2007). Asymptotically distribution-free (ADF) interval estimation of Coefficient alpha. Psychological Methods 12, 157–176. Puhani, P. A. (2000). The Heckman correction for sample selection and its critique. Journal of Economic Surveys 14, 53–68. Robins, J. M. and R. D. Gill (1997). Non-response models for the analysis of nonmonotone ignorable missing data. Statistics in Medicine 16, 39–56. Rubin, D. B. (1976). Inference and missing data. Biometrika 63, 581–592. Toomet, O. and A. Henningsen (2008). Sample selection models in R: Package sampleselection. Journal of Statistical Software 27(7). Vernon, H. (2009). The Neck Disability Index: An instrument for measuring self-rated disability due to neck pain or whiplash-assocaited disorder. Last accessed Febuary 20, 2010 at http://www.cmcc.ca/Portals/0/PDFs/Research_05_2009_NDI_Manual.pdf. Vernon, H. and S. Mior (1991). The Neck Disability Index: a study of reliabity and validity. J. Manipulative Physiol Ther. 7, 409–415. Yuan, K.-H., C. A. Guarnaccia, and B. Hayslip (2003). A study of the ditribution of sample coefficient alpha with the Hopkins symptom checklist: Bootstrap versus asymptotics. Educational and Psychological Measurement 63, 5–23. 32 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism 7 Appendix Derivation of Moment Generating function The moment generating function given in by equation (4.27) is derived as follows: tz Z ∞ etz φ(z)Φ(λ1 z)Φ(λ0 + λz) dz −∞ Z ∞ 2 t /2 φ(z − t)Φ(λ1 z)Φ(λ0 + λz) dz =k(λ0 , λ1 , λ)e E(e ) =k(λ0 , λ1 , λ) −∞ Put x = z − t =k(λ0 , λ1 , λ)e t2 /2 Z ∞ φ(x)Φ(λ1 x + λ1 t)Φ(λ0 + λx + λt) dx −∞ 2 /2 =k(λ0 , λ1 , λ)et E(Φ(λ1 x + λ1 t)Φ(λ0 + λx + λt) t2 /2 P (Y1 − λ1 X < λ1 t, Y2 − λt < λ0 + λt) λ1 t λ0 + λt λ1 λ t2 /2 =k(λ0 , λ1 , λ)e Φ2 p ,√ ;p √ 1 + λ2 1 + λ21 1 + λ21 1 + λ2 =k(λ0 , λ1 , λ)e where X, Y1 , Y2 are iid N (0, 1), and P (Y1 − λ1 X < λ1 t, Y2 − λt < λ0 + λt) = Φ2 λ0 + λt λ1 λ p ,√ ;p . √ 1 + λ2 1 + λ21 1 + λ21 1 + λ2 λ1 t Derivation of Gradients The gradient of the selection skew-normal model log-likelihood given by (4.32) can be derived as follows: 33 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism X n n X ∂l 1 =si p K1 xi + (1 − si ) (−2)K2 xi ∂γ 1 − ρ2 i=1 i=1 X n n n X X 1 λ ρ ∂l 0 =si 2 yi − β x i − K 3 xi − p K 1 xi ∂β σ i=1 σ i=1 σ 1 − ρ2 i=1 n n n 2 X n 1 X ∂l ρ λ X 0 0 0 =si − + 3 y i − β xi − 2 K 3 y i − β xi − p K 1 y i − β xi ∂σ σ σ i=1 σ i=1 σ 2 1 − ρ2 i=1 n n X X 1 yi − β 0 xi λ + λ3 ∂l 0 =si K1 ργ xi + + (1 − si ) 2K4 ∂ρ (1 − ρ2 )3/2 i=1 σ (1 + λ2 − λ2 ρ2 )3/2 i=1 X n n (y − β 0 x ) X ∂l ρ i i =si + (1 − si ) 2K4 , K3 ∂λ σ (1 + λ2 − λ2 ρ2 )3/2 i=1 i=1 where, 0 (yi −β 0 xi ) γ xi +ρ σ √ φ 1−ρ2 , K = 1 0 (yi −β 0 xi ) γ xi +ρ σ √ Φ φ −γ 0 xi Φ K2 = ΦSN 0 x λρ i 1+λ2 −λ2 ρ2 √ −γ −γ 0 xi ;0,1, √ λρ 1+λ2 −λ2 ρ2 1−ρ2 (y −β 0 x ) φ λ i σ i , K = 3 Φ λ (yi −β0 xi ) K4 = Note that the derivative of ΦSN ΦSN σ √1 φ 2π 2 −λ2 ρ2 1+λ2 −γ 0 xi ;0, 1+λ −γ 0 xi ;0,1, √ λρ 0 −γ xi ; 0, 1, √ 1+λ2 −λ2 ρ2 λρ 1+λ2 −λ2 ρ2 w.r.t. γ follows the usual differen- tiation of cdf to get the pdf. However, the derivatives of ρ and γ in this expression is not a straightforward application of this principle. The approach we followed is to re-write the cdf above as a standard bivariate normal integral. We then used the general results given in Chapter 46 of Kotz et al. (2000). That is, if Φ2 (., .; ρ) and φ2 (., .; ρ) are standard bivariate normal cdf and pdf respectively, then dΦ2 (., .; ρ) = φ2 (., .; ρ) dρ 34 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Table 5: Simulation results for λ = 1 and varying ρ in Bias SSNM SNM TS β0 0.0990 0.5636 0.5637 ρ = 0.0 β1 0.0007 0.0004 0.0003 γ0 -0.0035 0.0072 0.0073 γ1 0.0049 0.0075 0.0076 γ2 0.0109 0.0148 0.0149 σ -0.0385 -0.1746 -0.1746 0.0042 0.0022 0.0015 ρ λ -0.2164 β0 0.0461 0.5627 0.5630 ρ = 0.3 β1 0.0007 0.0009 0.0007 γ0 0.0209 0.1960 0.1965 γ1 0.0058 0.0226 0.0233 γ2 0.0114 0.0366 0.0376 σ -0.0113 -0.1714 -0.1714 ρ -0.0026 -0.0432 -0.0449 λ -0.0541 β0 0.0484 0.5614 0.5619 ρ = 0.7 β1 0.0001 0.0009 0.0006 γ0 0.0637 0.5395 0.5437 γ1 0.0185 0.1036 0.1078 γ2 0.0309 0.1583 0.1645 σ -0.0123 -0.1684 -0.1683 ρ -0.0093 -0.0656 -0.0683 λ -0.0564 - the presence of exclusion restriction. MSE SSNM SNM TS 0.0977 0.3196 0.3197 0.0014 0.0014 0.0014 0.0110 0.0054 0.0054 0.0063 0.0063 0.0063 0.0101 0.0102 0.0102 0.0121 0.0310 0.0310 0.0194 0.0145 0.0143 0.3274 0.0384 0.3183 0.3187 0.0013 0.0013 0.0013 0.0181 0.0448 0.0451 0.0068 0.0072 0.0073 0.0111 0.0119 0.0121 0.0071 0.0299 0.0299 0.0173 0.0153 0.0158 0.1529 0.0375 0.3164 0.3171 0.0011 0.0011 0.0012 0.0478 0.3011 0.3065 0.0093 0.0187 0.0202 0.0173 0.0383 0.0412 0.0067 0.0289 0.0289 0.0070 0.0113 0.0165 0.1518 - 35 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism Table 6: Simulation results for λ = 2 and varying ρ in the presence of exclusion restriction. Bias MSE SSNM SNM TS SSNM SNM TS β0 0.0009 0.7127 0.7130 0.0041 0.5093 0.5097 ρ = 0.0 β1 0.0010 0.0004 0.0004 0.0009 0.0030 0.0030 γ0 -0.0004 0.0072 0.0073 0.0193 0.0054 0.0054 γ1 0.0006 0.0074 0.0076 0.0062 0.0063 0.0063 γ2 0.0045 0.0148 0.0149 0.0100 0.0102 0.0102 σ -0.0004 -0.2996 -0.2996 0.0027 0.0902 0.0902 0.0013 0.0030 0.0013 0.0265 0.0138 0.0135 ρ λ 0.0339 - 0.1110 β0 0.0006 0.7107 0.7114 0.0038 0.5063 0.5073 ρ = 0.3 β1 0.0010 0.0013 0.0030 0.0008 0.0009 0.0009 γ0 0.0040 0.2538 0.2543 0.0256 0.0714 0.0717 γ1 0.0022 0.0325 0.0333 0.0073 0.0080 0.0081 γ2 0.0057 0.0509 0.0524 0.0125 0.0138 0.0140 σ -0.0001 -0.2918 -0.2919 0.0023 0.0856 0.0856 ρ -0.0014 -0.0739 -0.0776 0.0224 0.0200 0.0203 λ 0.0331 - 0.1034 β0 0.0024 0.7072 0.7079 0.0034 0.5010 0.5021 ρ = 0.7 β1 0.0005 0.0026 0.0021 0.0007 0.0008 0.0008 γ0 0.0251 0.7605 0.7683 0.0311 0.5911 0.6038 γ1 0.0141 0.1754 0.1832 0.0102 0.0400 0.0432 γ2 0.0226 0.2629 0.2757 0.0198 0.0849 0.0922 σ -0.0012 -0.2851 -0.2850 0.0021 0.0817 0.00816 ρ -0.0086 -0.1162 -0.1194 0.0088 0.0236 0.0295 λ 0.0284 - 0.0991 - 36 CRiSM Paper No. 12-05, www.warwick.ac.uk/go/crism