Likelihood Based Inference for Skew--Normal Independent Linear Mixed Models Victor Lachos Davila1; Pulak Ghosh2; Reinaldo Arellano-Valle INTRODUÇÃO Longitudinal data analysis, in which repeated measurements are taken on subjects at various time points, plays an important role in applied statistics, especially in biomedical research involving clinical trials. Linear mixed model (LMM; Laird and Ware, 1982) has become the most frequently used analytic tool for longitudinal data analysis with continuous repeated measures. A linear mixed model consists of a fixed effects and random effects. The random effects account for the between--subject variation. In a linear mixed model framework it is routinely assumed that the random effects and the within--subject measurement error have a normal distribution. While this assumption makes the model easy to apply in widely used softwares such as SAS, the accuracy of this assumptions is difficult to check and the routine use of normality is recently questioned by many authors (Butler and Louis, 1992; Verbeke and Lesaffre, 1997; Zhang and Davidian, 2001; Ghidey et al., 2004; Lin and Lee, 2007). Normality assumption is too restrictive as it suffers from the lack of robustness against departures from the normal distribution, particularly when data shows multimodality and skewness, and thus may not provide an accurate estimation of between—subject variation. For example, Zhang and Davidian (2001) showed that the estimated subject--specific intercept from the Framingham heart study data was not normally distributed and thus use of normal distribution in this scenario may obscure important features of between--subject variation. Despite the above drawbacks, the widespread use of normal linear mixed model (N-LMM) is, in part, motivated by mathematical convenience and by the fact that under general regularity conditions estimates of the fixed effects are robust to nonnormality of random effects (Verbeke and Lesaffre, 1997). However, a misspecified distribution of the random effects may bias the estimation of the standard errors of the parameters as well as efficient estimation 1 Universidade Estadual de Campinas, hlachos@ime.unicamp.br Georgia State University, pulakghosh@gmail.com 3 Pontificica Universidad Católica de Chile, Reivalle@mat.puc.cl 2 of fixed effects (Ghidey et al., 2004). Furthermore, inference on individual effects can be misleading when the random effects distribution deviates from normality. Thus, it is of practical interest to develop statistical model with considerable flexibility in the distributional assumptions of the random effects as well as measurement error. There has been considerable work in this direction. Verbeke and Lesaffre (1996) introduce a heterogeneous linear mixed model where random effects distribution is relaxed using a finite mixture of normal. Pinheiro et al. (2001) proposed a multivariate t—linear mixed model and showed that it would perform well in the presence of outliers. Zhang and Davidian (2001) proposed an LMM in which the random effects follow the so--called seminonparametric (SNP) distribution. Rosa et al. (2003) adopted a Bayesian framework to carry out posterior analysis in LMM with the thick--tailed class of normal/independent (NI) distributions (Lange and Sinsheimer, 1993). Ghidey et al. (2004) develops a LMM with a smooth random effects density. Ma et al. (2004) consider a generalized flexible skew--elliptical distribution for the random effects density and proposed somewhat complicated algorithms for maximum likelihood (ML) estimation and Bayesian inference via Markov Chain Monte Carlo (MCMC). Recently, Arellano--Valle et al. (2005), Lin and Lee (2007) and Lachos et al. (2007a) proposed a skew--normal linear mixed model (SN--LMM) based on multivariate skew-normal (SN) distribution introduced by Azzalini and Dalla--Valle (1996). They also develop an EM--type algorithms for the maximum likelihood estimation (MLE). A common feature of these classes of LMMs is that the normal linear mixed model (N-LMM) is a special member in each class. In this paper we propose a parametric robust modeling of LMM based on skew--normal/ independent distributions (Lachos et al., 2007b). In particular, we assume a skew--normal/ independent (SNI) distribution for the random effects and a NI distribution for the within-subject error. Together, the observed responses will follow a SNI distribution and defines what we call a skew--normal/independent linear mixed model (SNI--LMM). The SNI class of distribution is quite attractive as it simultaneously accounts for the skewness and thickness of the tails in the data. Particularly, the SNI distributions provide a group of skew--thick--tailed distributions that are useful for robust inference and contains as proper elements the skew-normal (SN), the skew--t (ST), the skew--slash (SSL), the skew--power exponential (SPE) and the skew--contaminated normal (SCN) distribution. The marginal density of the observed quantities are obtained analytically by integrating out the random effects, leading to a observed (marginal) likelihood function that can be maximized directly by using existing statistical softwares such as Ox, R or Matlab. The hierarchical representation of the proposed model makes possible the implementation of a EM--type algorithm, which for special cases and common situations yields closed form expressions for the E and M--steps. Furthermore, we note that the information matrix has a common part for all elements in the family that facilitates the direct application of inferences in SNI--LMM. We further analyze the longitudinal Framingham cholesterol data whose distribution of the random effects has been found to be non—normal and positively skewed by Zhang and Davidian (2001), Ghidey et al. (2004), and Lin and Lee (2007). The rest of the article is organized as follows. After a brief introduction of SNI distributions in Section 2, the SNI--LMM is introduced in Section 3. In Section 4 we discuss the ML estimation and inference, including the estimation of the random effects and the prediction of future values. The observed information matrix is derived analytically in Section 5. In Section 6, simulation studies are conducted to examine the performance of the estimation for subject--specific effects and prediction of futures values. The advantage of the proposed methodology is illustrated using the Framingham cholesterol data in Section 7 and some concluding remarks are presented in Section 8. MATERIAL E MÉTODOS - Maximum Likelihood Estimation - EM- algorithm -Montecarlo Simulation RESULTADOS E CONCLUSÕES In this paper we deal with a SNI-LMM, with the skew-normal LMM (Arellano--Valle, et al., 2005) as special case. A closed form expression is obtained for the likelihood function of the observed data which can be maximized by using existing statistical software. An EM--type algorithm is developed by exploring statistical properties of the SNI class considered. The observed information matrix is derived analytically which allows direct implementation of inference on this class of models. For the Cholesterol Framingham data, we show that the skew-t and the skew--contaminated normal distribution gives a better fit. REFERÊNCIAS BIBLIOGRÁFICAS Arellano-Valle R. B., Bolfarine, H. and Lachos, V. H. (2005).Skew-normal linear mixed models. Journal of Data Science, 3 415-438. Azzalini, A., Capitanio, A. (2003). Distributions generated by perturbation of symmetry with emphasis on the multivariate skew-t distribution}. Journal of the Royal Statistical Society, Series B, 65, 367-389. Azzalini, A. and Dalla-Valle, A. (1996). The multivariate skew-normal distribution. Biometrika, 83, 715-726. Branco, M. and Dey, D. (2001). A general class of multivariate skew-elliptical distribution. Journal of Multivariate Analysis, 79, 93-113. Butler, S. M. and Louis, T. A. (1992). Random effects models with non-parametric priors. Statistics in Medicine, 11, 1981–-2000. Lachos, V. H., Bolfarine, H., Arellano-Valle, R. B. and Montenegro, L. C. (2007a). Likelihood based inference for multivariate skew-normal regression models. Communications in Statistics: Theory and Methods, 36, 1769-1786. Lachos, V. H., Labra, F. V., and Ghosh, P. (2007b). Multivariate skew-normal/ independent distributions: properties and inference. Submitted to Scandinavian Journal of Statistics. Lange, K., and Sinsheimer, J. S. (1993). Normal/independent distributions and their applications in robust regression . Journal of Computational and Graphical Statistics, 2, 175-198. Zhang, D., Davidian, M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57, 795-802.