Imputation by conditional distribution using Gaussian copula Ene Käärik Institute of Mathematical Statistics, University of Tartu, Estonia Ene.Kaarik@ut.ee Summary. The paper introduces the use of conditional distribution for the purpose of imputing dropouts in repeated measurements. When there is no suitable multivariate family for the univariate marginals then a common approach is through copulas. Using Gaussian copula three algorithms for imputation corresponding to correlation structure of data are derived. Key words: imputation, Gaussian copula, correlation structure Introduction Focus on the longitudinal or repeated measurements study with missing data. Let X = (X1 , . . . , Xm ) be an outcome variable with repeated measurements at t1 , . . . , tm time points. Consider on discrete time points and write 1, . . . , m. Suppose that n units are sampled repeatedly over time. The aim is to measure each unit m times, but some of them are measured at k ≤ m time points. Definition 1. Dropout or attrition is missingness in data which occurs when subject leaves the study prematurely and does not return. Dropouts are classified in the following way ( [ED03], [Lit95]): completely random dropout (CRD), when dropout and measurements process are independent, random dropout (RD), when dropout process depends on observed measurements and informative dropout, when dropout process additionally depends on unobserved measurements. We are interested in missing outcome variable, i.e. the measurements that potentially could be obtained. In many theoretical and practical tasks it is necessary to know explicit values of missing measurements. Definition 2. Imputation (filling in, substitution) is a strategy for completing missing values in data. There exists a long list of single and multiple imputation methods depending on dropout process, such as conditional and unconditional means, hot deck, linear prediction etc. We will going to impute a missing value on the basis of conditional distributions. Conditional distribution gives us a good opportunity for describing analytically all characteristics of distributions (mean, standard deviation, quantiles etc). 1448 Ene Käärik The problem is that the joint distribution may be unknown, but using the copula approach it is possible to find approximations to the joint and conditional distributions. 1 Basic concepts of copula theory Copula C is a function that links univariate marginals to their joint multivariate distribution. Sklar (1959) has defined and provided some general properties of a copula. He showed an easy way to construct multivariate distributions from known marginals that need not to be necessarily of the same distribution. He combined marginals with a copula function to get a suitable joint distribution (see for example [Nel99]). Applications of the copula theory appeared in econometrics, financial and actuarial mathematics and the field has been rapidly developing in recent years. Copulas have been applied to a wide range of problems in biostatistics as well ( [R76], [VL02]). Consider a random vector X = (X1 , . . . , Xk ) ∈ Rk with the marginal distribution functions F1 , . . . , Fk , and the joint continuous distribution function F , so that Xi ∼ Fi and X ∼ F . We focus on the case where each Fi is continuous and differentiable. If copula C and F1 , . . . , Fk are differentiable, then the joint density f (x1 , . . . , xk ) corresponding to the joint distribution F (x1 , . . . , xk ) can be written by canonical representation as a product of the marginal densities and the copula density f (x1 , . . . , xk ) = f1 (x1 ) × . . . × fk (xk ) × c(F1 , . . . , Fk ), (1) where fi (xi ) is the density corresponding to Fi and the copula density c is defined as derivative of the copula. Copulas which are not absolutely continuous do not have joint densities. Equation (1) is known as density version of Sklar’s theorem. The next essential consequence is conditional distribution. Taking into account the joint density (1) and general definition of the conditional density we get conditional density as follows: f (xk |x1 , . . . , xk−1 ) = f (x1 , . . . , xk ) c(F1 , . . . , Fk ) = fk (xk ) , f (x1 , . . . , xk−1 ) c(F1 , . . . , Fk−1 ) (2) where c(F1 , . . . , Fk ) and c(F1 , . . . , Fk−1 ) are corresponding copula densities. The copula model has several advantages compared with the approach based on the joint distribution directly: 1. In many cases it may be hard to specify a joint distribution within any existing families. Using the copula approach we can first estimate the marginal distributions, and then estimate the copula. 2. In a copula model approach we obtain a dependence function explicitly, which enables us to provide a more specific description of dependence. 3. The family of copulas is sufficiently large and allows to use in modeling a wide range of distributions. Imputation by conditional distribution using Gaussian copula 1449 1.1 Gaussian copula and its properties One of the most important examples of copulas is the normal or Gaussian copula ( [LLZ99], [FV98], [R76], [Rei99], [Son00]). In general, the marginal distributions of a k-variate Gaussian copula are assumed to be continuous random variables and can be different in principle. The only condition for creating the Gaussian copula is that the dependence matrix must be positively defined. The advantage of using normal dependence structure arises from its simplicity, analytical manageability and the easy estimability of its only parameter, the correlation matrix. Let R be a symmetric, positive definite matrix with diag(R) = (1, 1, . . . , 1)T , Φv and φv be the standardized v-variate normal distribution and density functions, respectively. Consequently we mostly use v = 1 and v = k variate functions. The multivariate Gaussian copula Ck is given by the next equality: Ck (u1 , . . . , uk ; R) = Φk (Φ1−1 (u1 ), . . . , Φ−1 1 (uk )), (3) where ui ∈ (0, 1), i = 1, . . . , k. Suppose now that there is a subject i such that until the time point k − 1 the measurements X1 , . . . , Xk−1 are observed and at the time point k the subject drops out, thus the measurement Xk has missing value. The vector H = (X1 , . . . , Xk−1 ) is called the history of measurements. For getting the normal joint density function we have to find the normal copula density ck as derivative from normal copula Ck For constructing a multivariate joint density we now use the marginals φ1 (x1 ), . . . , φ1 (xk ) and define Y = (Y1 , . . . , Yk ), where Yi = Φ−1 1 [Fi (Xi )], i = 1, . . . , k. (4) Thus, according to (1) we obtain the joint density as the product: fk (x1 , . . . , xk |R∗ ) = φ1 (x1 ) × . . . × φ1 (xk ) exp{−QTk (R−1 − I)Qk /2} , |R|1/2 (5) −1 ∗ where Qk = (Φ−1 1 [F1 (x1 )], . . . , Φ1 [Fk (xk )]), I is the k × k identity matrix and R is the matrix of dependence measures. Eventually, according to (2) the conditional density conditioned by history can be defined as fk (xk |x1 , . . . , xk−1 ; R∗ ) = φk (xk ) ck (Φ1 (x1 ), . . . , Φ1 (xk ); Rk ) , ck−1 (Φ1 (x1 ), . . . , Φ1 (xk−1 ); Rk−1 ) (6) where ck , ck−1 are normal copula densities and Rk , Rk−1 corresponding correlation matrices of measurements. 1450 Ene Käärik 2 Imputation algorithms 2.1 Derivation of general formula for imputation To handle dropouts in repeated measurements study we start with partition of the correlation matrix. Taking into account the history, the whole correlation matrix R can be partitioned as R= Rk−1 r , rT 1 (7) where Rk−1 is the correlation matrix of the history H, and r = (r1k , . . . , r(k−1)k )T is the vector of correlations between the history and the time point k. Consider the Gaussian copula and partition of the correlation matrix. Hence, due to the definition of the conditional density (6) and partition of the correlation matrix (7), we get the conditional density as follows: f (xk |H; R∗ ) = fk (xk ) φk (Qk ; R) , −1 φ1 (Φ1 [Fk (xk )])φk−1 (Qk−1 ; Rk−1 ) (8) where Qk , Qk−1 are defined as before. Substitution into the expression (8) normal densities and considering the notation (4) results in T −1 T 2 1 (yk − r Rk−1 (y1 , . . . , yk−1 ) ) f (yk |H; R∗ ) = φ1 (yk ) exp{− [ − yk2 ]} −1 T 2 1 − r Rk−1 r −1 ×(1 − r T Rk−1 r)−1/2 . (9) To impute the dropouts we should find the maximum likelihood estimate for yk (to find the value that would have been observed most likely). Taking into consideration the equality (9) we get the maximum likelihood function L(yk ) = f (yk |H; R∗ ) and the log-likelihood l(yk ) = ln L(yk ). It is easy to see that when maximizing the log-likelihood with respect to yk , we have −1 −yk + r T Rk−1 (y1 , . . . , yk−1 )T ∂l = = 0. −1 ∂yk (1 − r T Rk−1 r) (10) From (10) we get the following general form of the imputation algorithm −1 ∗ ŷk = r T Rk−1 Yk−1 , (11) ∗ where Yk−1 = (y1 , . . . , yk−1 )T is the vector of observations for the subject which drops out at the time point k. Formula (11) is the starting point for the consequent algorithms, where we shall consider particular correlation structures. Imputation by conditional distribution using Gaussian copula 1451 2.2 Correlation structure Recall the correlation matrix of history Rk−1 = (rij ), where rij = cor(Xi , Xj ), i, j = 1, . . . , k − 1. In the case of normality rij is the correlation1 between two measurements and usually we mean Spearman’s ρ or Kendall’s τ , because those are invariant under monotonic transformations of the data (the Pearson’s productmoment correlation is not). The correlation structure describes how measurements within a subject correlate and there are many possible correlation structures (see for example, [VM01], p 99). The following correlation structures of Rk−1 are most common in practice: (1) compound symmetry (CS) structure, where the correlations between all time points are equal, rij = ρ, i, j = 1, . . . , k, i 6= j, and (2) first order autoregressive (AR) structure, where the dependence between observations decreases as the measurements get further in time, rij = ρ|j−i| , i, j = 1, . . . , k, i 6= j. Interesting approaches are the Toeplitz and banded Toeplitz structures, which are slightly more general than the first order autoregressive model. This structure has different correlations for pairs of time points, but does not force them to be a power of a basic correlation. A Toepliz matrix is constant along all diagonals parallel to the main diagonal. The banded Toeplitz structure could be adopted when we assume that there is a Markovian structure: current measurement affects measurements of next several, say k∗ time points only and so the edges of the correlation matrix become zero. The simpler situation we have when k∗ = 1, what means that only two sequential measurements are dependent, rij = ρ, j = i + 1, i = 1, . . . , k − 2. This structure is known as 1-banded Toeplitz structure with one parameter ρ. 2.3 Special cases We introduce three imputation formulas using above mentioned correlation structures with one parameter ρ. Considering the general formula (11) and according correlation structures we get following results. Compound symmetry structure Let Y1 , . . . , Yk be repeated measurements with standard normal marginals. The dropout process starts at time point k, so that the history H = (Y1 , . . . , Yk−1 ) has complete observations and CS correlation structure with correlation coefficient ρ. Then the following formula for imputation of the dropout at the time point k is valid [Kaa05] ŷkCS = ρ 1 + (k − 2)ρ X k−1 yi , (12) i=1 where y1 , . . . , yk−1 are the observed values for the subject. 1 In general the notation R∗ is used for the dependence matrix to accentuate that we do not allways mean usual Pearson’s correlations. 1452 Ene Käärik Autoregressive structure Let Y1 , . . . , Yk be repeated measurements with standard normal marginals. The dropout process starts at time point k, so that the history H = (Y1 , . . . , Yk−1 ) has complete observations and AR correlation structure with correlation coefficient ρ. Then the following formula for imputation of the dropout at the time point k is valid [JNTRK04] ŷkAR = ρ Sk (yk−1 − Ȳk−1 ) + Ȳk , Sk−1 (13) where yk−1 is the last observed value for the subject, Ȳk−1 and Ȳk are the mean values of k and k − 1 time points, respectively, Sk and Sk−1 are the corresponding standard deviations. 1-banded Toeplitz structure Let Y1 , . . . , Yk be repeated measurements with standard normal marginals. The dropout process starts at time point k, so that the history H = (Y1 , . . . , Yk−1 ) has complete observations and 1-banded Toeplitz correlation structure with correlation coefficient ρ. Then the following formula for imputation of the dropout at the time point k is valid ŷk1BT = 1 X k−1 |Rk−1 | (−1)k−j+1 |Rj−1 |ρk−j yj , (14) j=1 where y1 , . . . , yk−1 are the observed values for the subject and |Rj |, j = 1, . . . , k − 1, is the determinant of correlation matrix of history, |R0 | = 1 and |R1 | = 1. Remarks The formulas (12) and (13) are proved in ( [Kaa05], [JNTRK04]). The case of 1bounded Toeplitz (14) can be proved iteratively. As an example of implementation of the formula (14) look at repeated measurements study, when dropping out starts at k = 3, 4, 5. Then using (11) and taking into account the Toeplitz correlation structure we get following formulas for imputation: 2 2 1 1 k = 3 ŷ3 = 1−ρ 2 (−ρ y1 + ρy2 ) = |R | (−ρ y1 + ρy2 ); 2 3 2 2 3 2 1 1 k = 4: ŷ4 = 1−2ρ 2 (ρ y1 − ρ y2 + ρ(1 − ρ )y3 = |R | (ρ y1 − ρ y2 + ρ|R2 |y3 ); 3 4 3 2 2 1 k = 5: ŷ5 = ρ4 +1−3ρ2 (−ρ y1 + ρ y2 − ρ (1 − ρ )y3 + ρ(1 − 2ρ2 )y4 ) = |R14 | (−ρ4 y1 + ρ3 y2 − ρ2 |R2 |y3 + ρ|R3 |y4 ). These formulas are in agreement with the general form (14). 3 Illustration To illustrate the Gaussian copula approach with 1-banded Toeplitz correlation structure we generated data from a 6-variate normal distribution with sample size n = 10 using the correlation coefficient r = 0.5 (then ρ = 0.48), assuming that the data represent repeated measurements. Due to small sample size every value is important, hence we have to impute the missing values. Imputation by conditional distribution using Gaussian copula 1453 The dropouts occur at the last time point (in random variable X6 ) and we examine three different missing mechanism: CRD, RD and ID. The imputation method based on formula (14) was compared with well-known LOCF-method (Last Observation Carried Forward).2 To analyze the obtained results the average absolute biases were calculated as average difference between observed values and imputed values, B1 as bias following (14) and B2 as bias using LOCF-method. The standard deviations of absolute biases (S1 , S2 ) were calculated as well. Results were presented in units of standard deviation of given marginals. To estimate the effectiveness of the new imputation rule we compared the mean values B1 and B2 , and standard deviations S1 and S2 . All results were obtained over 500 runs and are given in Table 1. Table 1. Results of simulation study Missingness CRD RD ID B1 0.021 0.047 1.631 B2 S1 S2 0.082 0.736 0.895 0.079 0.74 0.957 1.275 1.056 1.546 From Table 1 it follows that in all cases the imputation via formula (14) gives more stable solution compared with LOCF-method (S1 < S2 ). In the cases CRD and RD the imputation formula (14) gives smaller bias (B1 ) compared with bias (B2 ) by the LOCF-method. In the case of ID model both imputation methods did not perform well, as expected. Summary Copulas provide a natural approach to handle dependence between repeated measurements. The multivariate normal distribution and the linear correlation are the basis for modeling dependency. Thus the Gaussian copula is a good starting point to find joint and conditional distributions of repeated measurements. The algorithms which are derived require estimation of the correlation structure but not normality of marginals and therefore they may have wide implementation in practice. Acknowledgements This work is supported by Estonian Science Foundation grants No 5521 and No 5203. 2 When the main interest is the outcome at endpoint of the study (for example in clinical trials), the LOCF is the most frequently used approach for dealing with missing values 1454 Ene Käärik References [CR99] [DK94] [FV98] [Kaa05] [Kaa06] [LV02] [Lit95] [Nel99] [Rei99] [Son00] [VL02] [VM01] Clemen, R.T., Reilly, T.: Correlations and copulas for decision and risk analysis. Fuqua School of Business, Duke University. Management Science, 45 (2), 208–224 (1999) Diggle, P. J., Kenward, M. G.: Informative dropout in longitudinal data analysis. Applied Statistics, 43 (1), 49–93 (1994) Frees, E.W., Valdez, E.A.: Understanding relationships using copulas. North American Actuarial Journal, 2 (1), 1–25 (1998) Käärik, E.: Handling dropouts by copulas. In Mastorakis, N. (ed) WSEAS Transactions on Biology and Biomedicine, 1 (2), 93–97 (2005) Käärik, E.: Imputation algorithm using copulas. Submitted in: Advances in Methodology and Statistics. Contributed paper in: Applied Statistics 2005, 12 pages, Slovenia (2006) Lambert, P., Vandenhende, F.: A copula-based model for multivariate non-normal longitudinal data: analysis of a dose titration safety study on a new antidepressant. Statistics in Medicine, 21, 3197–3217 (2002) Little, R.J.A.: Modeling the dropout mechanism in repeated-measures studies. JASA, 90, 431, 1112–1121 (1995) Nelsen R.B.: An introduction to copulas. Lecture Notes in Statistics, 139, Springer Verlag, Berlin Heidelberg New York (1999) Reilly, T.: Modelling correlated data using the multivariate normal copula. In: Proceedings of the Workshop on Correlated Data, Trieste, Italy (1999) Song, P.X.K.: Multivariate dispersion models generated from Gaussian Copula. Scandinavian Journal of Statistics, 27, 305–320 (2000) Vandenhende, F., Lambert, P.: On the joint analysis of longitudinal responses and early discontinuation in randomized trials. Journal of Biopharmaceutical Statistics, 12 (4), 425–440 (2002) Verbeke, G., Molenberghs, G.: Linear mixed models for longitudinal data. Springer Series in Statistics, Springer Verlag, Berlin Heidelberg NY (2001)