Lecture 2 Estimation and Inference for the marginal model Ziad Taib Biostatistics, AZ April 2011 Name, department 1 Date Outline of lecture 2 1. A reminder 2. Estimation for the marginal model ML and REML estimation 3. Inference for the mean structure 4. Inference for the variance components 5. Fitting linear mixed models in SAS Name, department 2 Date 1. A reminder: The 2-stage Model Formulation: Name, department 3 Date Stage 1 Response Yij for ith subject, measured at time tij, i = 1, . . . , N, j = 1, . . . , ni • Response vector Yi for ith subject: Yi (Yi1 , Yi 2 ,...,Yini )' Yi Z i b i i , i ~ N (0, i ), ofteni 2 Ini Possibly after some convenient transformation Zi is a (ni x q) matrix of known covariates and bi is a (ni x q) matrix of parameters Note that the above model describes the observed variability within subjects Name, department 4 Date Stage 2 Between-subject variability can now be studied from relating the parameters bi to known covariates bi Ki b bi Ki is a (q x p) matrix of known covariates and b is a (p-dimensional vector of unknown regression parameters Finally Name, department 5 Date bi ~ N (0, i ) The General Linear Mixed-effects Model The 2-stages of the 2-stage approach can now be combined into one model: Average evolution Name, department 6 Date Subject specific The hierarchical versus the marginal Model The general mixed model is given by It can be written as It is therefore also called a hierarchical model Name, department 7 Date Marginally we have that is distributed as fYi y fYi y / bi f bi b db Hence EYi EEYi / bi Name, department 8 Date f(yi I bi) f(bi) f(yi) Example Name, department 9 Date The prostate data A model for the prostate cancer Stage 1 Yij ln(PSAij 1) b1i b 2i tij b t ij , j 1,..., ni 2 3i ij Name, department 10 Date The prostate data A model for the prostate cancer Stage 2 Age could not be matched b1i b1 Agei b 2Ci b 3 Bi b 4 Li b 5 M i b1 j b 2i b 6 Agei b 7Ci b 8 Bi b 9 Li b10 M i b2 j b b Age b C b B b L b M b i 12 i 13 i 14 i 15 i 3j 3i 11 Ci, Bi, Li, Mi are indicators of the classes: control, BPH, local or metastatic cancer. Agei is the subject’s age at diagnosis. The parameters in the first row are the average intercepts for the different classes. Name, department 11 Date The prostate data This gives the following model ij Name, department 12 Date 2. Estimation in the Marginal Model: ML and REML Estimation Name, department 13 Date ML and REML estimates: Name, department 14 Date ML and REML estimates (cont’d) Name, department 15 Date Estimation based on the marginal model Vi Name, department 16 Date Name, department 17 Date ML estimation Maximise with respect to b Replace in the likelihood function Maximise with respect to a One can use the EM algorithm or Newton Raphson Name, department 18 Date Name, department 19 Date ML estimation Name, department 20 Date REML ESTIMATION Restricted (or residual, or reduced) maximum likelihood (REML) approach is a particular form of maximum likelihood estimation which does not base estimates on a maximum likelihood fit of all the information, but instead uses a likelihood function calculated from a transformed set of data, so that nuisance parameters have no effect. In the case of variance component estimation, the likelihood function is calculated from the probability distribution of a set of contrasts. In particular, REML is used as a method for fitting linear mixed models. In contrast to maximum likelihood estimation, REML can produce unbiased estimates of variance and covariance parameters. Name, department 21 Date Analysis of Contrast Variables Contrast variables in repeated measures data are linear combinations of the responses over time for an individual. In longitudinal studies it is of interest to consider the set of differences between responses at consecutive time points, that is, changes from time 1 to time 2, time 2 to time 3, and so forth. A set of contrast variables can be used to analyze trends over time and to make comparisons between times. The original repeated measures data for each individual are transformed into new sets of variables each given by a set of contrast variables. Name, department 22 Date REML estimation Given an iid sample Yi i = 1, . . . , N, we can estimate the variance using ˆ 2 N 2 1 2 2 ˆ Y m , E i N i 1 But since m is uknown, we use 2 N 1 ˆ12 Yi Y , E ˆ12 2 , N i 1 Based on this we can define Name, department 23 Date N 2 ˆ s 1 , E s2 2 N 1 2 REML Name, department 24 Date Name, department 25 Date 3. Inference for the mean structure Name, department 26 Date Approximate Wald tests Under the Wald statistical test, named after Abraham Wald, the maximum likelihood estimate of the parameter(s) of interest b is compared with the proposed value b0, with the assumption that the difference between the two will be approximately normal. Typically the square of the difference is compared to a chi-squared distribution. In the univariate case, the Wald statistic is bˆ b / Var bˆ 2 0 which is compared against a chi-square distribution. Alternatively, the difference can be compared to a normal distribution. In this case the test statistic is Name, department 27 Date bˆ b / Sebˆ 0 Approximate Wald tests H 0 : Lb 0, versus H A : Lb 0 Obs! a is estimated which gives extra variability and bias. Bias is resolved by using t- or F-test. Name, department 28 Date Approximate t-and F- tests Name, department 29 Date Robust inference Name, department 30 Date Name, department 31 Date Name, department 32 Date Name, department 33 Date Name, department 34 Date Name, department 35 Date Likelihood ratio tests Name, department 36 Date Likelihood ratio tests Name, department 37 Date Name, department 38 Date 4. Inference for the Variance Components Name, department 39 Date 5. Fitting linear mixed models in SAS Name, department 40 Date Statistical software Name, department 41 Date Software (cont’d) SAS – SPSS – BMDP/5v – ML3 – HLM – Splus – R can handle correlated data but some are more restricted than others. Most packages offer a choice between ML and REML and optimisation is often based on Newton-Raphson, the EM algorithm or Fisher scoring. The user has to specify a model for the mean response that is linear in the fixed effects and to specify a covariance structure. The user can select a full parameterisation of the covariance structure (unstructured) or choose among given covariance structures. The covariance structure is also influenced by the inclusion of random effects and their covariance structure. Name, department 42 Date Software (cont’d) Output often includes: history of optimisation iterations estimates of fixed effects covariance parameters with standard errors estimates of user specified contrasts Graphics is often limited but can be done in another software Name, department 43 Date SAS PROC MIXED and Repeated Measures PROC MIXED of SAS offers greater flexibility for the modelling of repeated measures data than PROC GLM. (Firstly, the procedure provides a mechanism for modelling the covariance structure associated with the repeated measures. Secondly, it can handle some forms of missing data without discarding an entire subject’s-worth of data. Thirdly, it has some capability to handle the situation when each subject may be measured at different times and time intervals.) In PROC GLM, repeated measures are handled in a multivariate framework and it requires a multivariate view of the data. PROC MIXED, on the other hand, requires a univariate or stacked-data view of the data. In other words, there is only a single response variable. The repeated information, including all of the information about the subjects, is contained in other variables. Proc GLM assumes that the covariance matrix meets a sphericity assumption compound symmetry. Name, department 44 Date SAS PROC MIXED Proc mixed was designed to handle mixed models. It has a large choice of covariance structures (unstructured, random effects, autoregressive, Diggle etc) PROC MIXED can be used not only to estimate the fixed parameters, but also the covariance parameters. By default, PROC MIXED estimates the covariance parameters using the method of restricted maximum likelihood (REML). PROC MIXED provides empirical Bayes estimates. Separate analyses for separate groups can be run using the BY statement. Approximate F tests for class variables are obtained using Wald’s test. All components of the output can be saved as a SAS data set for further manipulation using other internal (SAS) or external procedures. Name, department 45 Date PROC MIXED: Syntax PROC MIXED < options > ; BY variables ; CLASS variables ; ID variables ; MODEL dependent = < fixed-effects > < / options > ; RANDOM random-effects < / options > ; REPEATED < repeated-effect > < / options > ; PARMS (value-list) ... < / options > ; PRIOR < distribution > < / options > ; CONTRAST 'label' < fixed-effect values ... > < | random-effect values ... > , ... < / options > ; ESTIMATE 'label' < fixed-effect values ... > < | random-effect values ... >< / options > ; LSMEANS fixed-effects < / options > ; MAKE 'table' OUT=SAS-data-set ; WEIGHT variable ; Name, department 46 Date In Proc Mixed, the mixed model is specified by means of a number of statements like CLASS, MODEL, RANDOM and REPEATED. The CLASS statement identifies the classification variables (for example, gender, person, age, etc.). The MODEL statement specifies the model’s fixed effects equation, Xiβ. Thus, the design matrix Xi is defined and the model’s intercept is included by default. The RANDOM statement is used to specify random effects and the form of covariance matrix D. (Useful options: SOLUTION: print random effects solution). The REPEATED statement models the intra-individual variation and includes the structure of i=Cov(ei), where i is a block diagonal matrix for each subject. (If the REPEATED statement is not included it is assumed that i=σ2I). LSMEANS Calculates least squares mean estimates of specified fixed effects. Name, department 47 Date Modelling the Covariance Structure Using the RANDOM and REPEATED Statements in PROC MIXED Measures on different individuals are independent, so covariance needs attention only with measures on the same individuals. The covariance structure refers to variances at individual times and to correlation between measures at different times on the same individual. There are basically two aspects of the correlation. First, two measures on the same individual are correlated simply because they share common contributions from that individual. This is due to variation between indivduals. Second, measures on the same individual close in time are often more highly correlated than measures far apart in time. This is covariation within indivduals. . Usually, when using PROC MIXED, the variation between indivduals is specified by the RANDOM statement, and covariation within indivduals is specified by the REPEATED statement Name, department 48 Date PROC MIXED fits many different structures (some are listed here). Note also that a particular structure may be fit using more than one “TYPE” designation, and with combinations of the RANDOM and REPEATED statements. Name, department 49 Date Data structure of Proc Mixed Consider the example where arm strength is measured on 8 patients at 3 different times and where patients have been randomized to one of 2 treatment groups. The multivariate view associated with e.g. PROC GLM code: would look like below Name, department 50 Date For analysis of this data set using PROC MIXED, the univariate or stacked-data view will be required. The univariate view below was obtained by Proc Transpose: Name, department 51 Date The rat data proc mixed data=rat method=reml; class id group; model y = t group*t / solution; random intercept t / type=un subject=id ; run; Name, department 52 Date ? Name, department 53 Date Name, department 54 Date Results Name, department 55 Date Using the option nobound Non convergence or non positive definiteness can be indications of negative variance components. Usually Proc mixed would not allow that to happen. But using the option nobound in Proc mixed will result in a new set of estimates where d22 is negative. Consider the fitted variance function: Hence, the negative variance component suggests a negative curvature in the variance function. Name, department 56 Date Name, department 57 Date Name, department 58 Date The prostate data Age could not be matched Name, department 59 Date SAS code Name, department 60 Date ML and REML estimates: Name, department 61 Date ML and REML estimates (cont’d) Name, department 62 Date Fitted average profiles Name, department 63 Date Mixed Models using R vs. SAS: Do they give the same answers? When you perform a mixed model analysis using R (typically because some types of simulations are difficult to obtain in SAS) the question arises are we using the ”right model”? The syntax for lme() –is not as transparent as its SAS counterpart and its documentation is not as good. Therefore it is interesting to compare the two approaches. R and SAS give different answers! One of the reasons is that they apply different restrictions to achieve uniqueness of estimates. It is possible to ”force” R to get the same answer as PROC MIXED in SAS. Name, department 64 Date R commands for linear mixed models Commands for linear mixed models are in the library nlme. data <- read.table(file.choose(), header = T) attach(data) Time = factor(Time) Group = factor(Group) Subj = factor(Subj) library(nlme) model <- lme(y ~ Time + Group + Time*Group, random = ~1 | Subj) summary(model) anova(model) # This model is very close to the one produced by SAS using compound symmetry, # when it comes to F values, and the log likelihood is the same. But the AIC # and BIC are quite different. The StDev for the Random Effects are the same # when squared. The coefficients are different because R uses the first level # as the base, whereas SAS uses the last. Name, department 65 Date ? Any Questions Name, department 66 Date