Imputation by conditional distribution using Gaussian copula

advertisement
Imputation by conditional distribution using
Gaussian copula
Ene Käärik
Institute of Mathematical Statistics, University of Tartu, Estonia
Ene.Kaarik@ut.ee
Summary. The paper introduces the use of conditional distribution for the purpose
of imputing dropouts in repeated measurements. When there is no suitable multivariate family for the univariate marginals then a common approach is through
copulas. Using Gaussian copula three algorithms for imputation corresponding to
correlation structure of data are derived.
Key words: imputation, Gaussian copula, correlation structure
Introduction
Focus on the longitudinal or repeated measurements study with missing data.
Let X = (X1 , . . . , Xm ) be an outcome variable with repeated measurements at
t1 , . . . , tm time points. Consider on discrete time points and write 1, . . . , m.
Suppose that n units are sampled repeatedly over time. The aim is to measure
each unit m times, but some of them are measured at k ≤ m time points.
Definition 1. Dropout or attrition is missingness in data which occurs when
subject leaves the study prematurely and does not return.
Dropouts are classified in the following way ( [ED03], [Lit95]): completely random dropout (CRD), when dropout and measurements process are independent,
random dropout (RD), when dropout process depends on observed measurements
and informative dropout, when dropout process additionally depends on unobserved
measurements.
We are interested in missing outcome variable, i.e. the measurements that potentially could be obtained. In many theoretical and practical tasks it is necessary
to know explicit values of missing measurements.
Definition 2. Imputation (filling in, substitution) is a strategy for completing
missing values in data.
There exists a long list of single and multiple imputation methods depending
on dropout process, such as conditional and unconditional means, hot deck, linear
prediction etc.
We will going to impute a missing value on the basis of conditional distributions.
Conditional distribution gives us a good opportunity for describing analytically all
characteristics of distributions (mean, standard deviation, quantiles etc).
1448
Ene Käärik
The problem is that the joint distribution may be unknown, but using the copula
approach it is possible to find approximations to the joint and conditional distributions.
1 Basic concepts of copula theory
Copula C is a function that links univariate marginals to their joint multivariate
distribution. Sklar (1959) has defined and provided some general properties of a
copula. He showed an easy way to construct multivariate distributions from known
marginals that need not to be necessarily of the same distribution. He combined
marginals with a copula function to get a suitable joint distribution (see for example
[Nel99]).
Applications of the copula theory appeared in econometrics, financial and actuarial mathematics and the field has been rapidly developing in recent years. Copulas
have been applied to a wide range of problems in biostatistics as well ( [R76], [VL02]).
Consider a random vector X = (X1 , . . . , Xk ) ∈ Rk with the marginal distribution
functions F1 , . . . , Fk , and the joint continuous distribution function F , so that Xi ∼
Fi and X ∼ F .
We focus on the case where each Fi is continuous and differentiable. If copula C
and F1 , . . . , Fk are differentiable, then the joint density f (x1 , . . . , xk ) corresponding
to the joint distribution F (x1 , . . . , xk ) can be written by canonical representation as
a product of the marginal densities and the copula density
f (x1 , . . . , xk ) = f1 (x1 ) × . . . × fk (xk ) × c(F1 , . . . , Fk ),
(1)
where fi (xi ) is the density corresponding to Fi and the copula density c is defined as
derivative of the copula. Copulas which are not absolutely continuous do not have
joint densities.
Equation (1) is known as density version of Sklar’s theorem.
The next essential consequence is conditional distribution. Taking into account
the joint density (1) and general definition of the conditional density we get conditional density as follows:
f (xk |x1 , . . . , xk−1 ) =
f (x1 , . . . , xk )
c(F1 , . . . , Fk )
= fk (xk )
,
f (x1 , . . . , xk−1 )
c(F1 , . . . , Fk−1 )
(2)
where c(F1 , . . . , Fk ) and c(F1 , . . . , Fk−1 ) are corresponding copula densities.
The copula model has several advantages compared with the approach based on
the joint distribution directly:
1. In many cases it may be hard to specify a joint distribution within any existing
families. Using the copula approach we can first estimate the marginal distributions,
and then estimate the copula.
2. In a copula model approach we obtain a dependence function explicitly, which
enables us to provide a more specific description of dependence.
3. The family of copulas is sufficiently large and allows to use in modeling a wide
range of distributions.
Imputation by conditional distribution using Gaussian copula
1449
1.1 Gaussian copula and its properties
One of the most important examples of copulas is the normal or Gaussian copula
( [LLZ99], [FV98], [R76], [Rei99], [Son00]).
In general, the marginal distributions of a k-variate Gaussian copula are assumed
to be continuous random variables and can be different in principle.
The only condition for creating the Gaussian copula is that the dependence matrix must be positively defined. The advantage of using normal dependence structure
arises from its simplicity, analytical manageability and the easy estimability of its
only parameter, the correlation matrix.
Let R be a symmetric, positive definite matrix with diag(R) = (1, 1, . . . , 1)T ,
Φv and φv be the standardized v-variate normal distribution and density functions,
respectively. Consequently we mostly use v = 1 and v = k variate functions.
The multivariate Gaussian copula Ck is given by the next equality:
Ck (u1 , . . . , uk ; R) = Φk (Φ1−1 (u1 ), . . . , Φ−1
1 (uk )),
(3)
where ui ∈ (0, 1), i = 1, . . . , k.
Suppose now that there is a subject i such that until the time point k − 1 the
measurements X1 , . . . , Xk−1 are observed and at the time point k the subject drops
out, thus the measurement Xk has missing value. The vector H = (X1 , . . . , Xk−1 )
is called the history of measurements.
For getting the normal joint density function we have to find the normal copula
density ck as derivative from normal copula Ck
For constructing a multivariate joint density we now use the marginals
φ1 (x1 ), . . . , φ1 (xk ) and define Y = (Y1 , . . . , Yk ), where
Yi = Φ−1
1 [Fi (Xi )], i = 1, . . . , k.
(4)
Thus, according to (1) we obtain the joint density as the product:
fk (x1 , . . . , xk |R∗ ) = φ1 (x1 ) × . . . × φ1 (xk )
exp{−QTk (R−1 − I)Qk /2}
,
|R|1/2
(5)
−1
∗
where Qk = (Φ−1
1 [F1 (x1 )], . . . , Φ1 [Fk (xk )]), I is the k × k identity matrix and R
is the matrix of dependence measures.
Eventually, according to (2) the conditional density conditioned by history can
be defined as
fk (xk |x1 , . . . , xk−1 ; R∗ ) = φk (xk )
ck (Φ1 (x1 ), . . . , Φ1 (xk ); Rk )
,
ck−1 (Φ1 (x1 ), . . . , Φ1 (xk−1 ); Rk−1 )
(6)
where ck , ck−1 are normal copula densities and Rk , Rk−1 corresponding correlation matrices of measurements.
1450
Ene Käärik
2 Imputation algorithms
2.1 Derivation of general formula for imputation
To handle dropouts in repeated measurements study we start with partition of the
correlation matrix. Taking into account the history, the whole correlation matrix R
can be partitioned as
R=
Rk−1 r
,
rT 1
(7)
where Rk−1 is the correlation matrix of the history H, and r =
(r1k , . . . , r(k−1)k )T is the vector of correlations between the history and the time
point k.
Consider the Gaussian copula and partition of the correlation matrix. Hence,
due to the definition of the conditional density (6) and partition of the correlation
matrix (7), we get the conditional density as follows:
f (xk |H; R∗ ) = fk (xk )
φk (Qk ; R)
,
−1
φ1 (Φ1 [Fk (xk )])φk−1 (Qk−1 ; Rk−1 )
(8)
where Qk , Qk−1 are defined as before.
Substitution into the expression (8) normal densities and considering the notation (4) results in
T −1
T 2
1 (yk − r Rk−1 (y1 , . . . , yk−1 ) )
f (yk |H; R∗ ) = φ1 (yk ) exp{− [
− yk2 ]}
−1
T
2
1 − r Rk−1 r
−1
×(1 − r T Rk−1
r)−1/2 .
(9)
To impute the dropouts we should find the maximum likelihood estimate for yk
(to find the value that would have been observed most likely). Taking into consideration the equality (9) we get the maximum likelihood function L(yk ) = f (yk |H; R∗ )
and the log-likelihood l(yk ) = ln L(yk ).
It is easy to see that when maximizing the log-likelihood with respect to yk , we
have
−1
−yk + r T Rk−1
(y1 , . . . , yk−1 )T
∂l
=
= 0.
−1
∂yk
(1 − r T Rk−1
r)
(10)
From (10) we get the following general form of the imputation algorithm
−1
∗
ŷk = r T Rk−1
Yk−1
,
(11)
∗
where Yk−1
= (y1 , . . . , yk−1 )T is the vector of observations for the subject which
drops out at the time point k.
Formula (11) is the starting point for the consequent algorithms, where we shall
consider particular correlation structures.
Imputation by conditional distribution using Gaussian copula
1451
2.2 Correlation structure
Recall the correlation matrix of history Rk−1 = (rij ), where rij = cor(Xi , Xj ),
i, j = 1, . . . , k − 1. In the case of normality rij is the correlation1 between two
measurements and usually we mean Spearman’s ρ or Kendall’s τ , because those
are invariant under monotonic transformations of the data (the Pearson’s productmoment correlation is not).
The correlation structure describes how measurements within a subject correlate
and there are many possible correlation structures (see for example, [VM01], p 99).
The following correlation structures of Rk−1 are most common in practice:
(1) compound symmetry (CS) structure, where the correlations between all time
points are equal, rij = ρ, i, j = 1, . . . , k, i 6= j, and
(2) first order autoregressive (AR) structure, where the dependence between
observations decreases as the measurements get further in time, rij = ρ|j−i| , i, j =
1, . . . , k, i 6= j.
Interesting approaches are the Toeplitz and banded Toeplitz structures, which
are slightly more general than the first order autoregressive model. This structure
has different correlations for pairs of time points, but does not force them to be a
power of a basic correlation. A Toepliz matrix is constant along all diagonals parallel
to the main diagonal.
The banded Toeplitz structure could be adopted when we assume that there is
a Markovian structure: current measurement affects measurements of next several,
say k∗ time points only and so the edges of the correlation matrix become zero.
The simpler situation we have when k∗ = 1, what means that only two sequential
measurements are dependent, rij = ρ, j = i + 1, i = 1, . . . , k − 2. This structure is
known as 1-banded Toeplitz structure with one parameter ρ.
2.3 Special cases
We introduce three imputation formulas using above mentioned correlation structures with one parameter ρ. Considering the general formula (11) and according
correlation structures we get following results.
Compound symmetry structure
Let Y1 , . . . , Yk be repeated measurements with standard normal marginals. The
dropout process starts at time point k, so that the history H = (Y1 , . . . , Yk−1 )
has complete observations and CS correlation structure with correlation coefficient
ρ. Then the following formula for imputation of the dropout at the time point k is
valid [Kaa05]
ŷkCS =
ρ
1 + (k − 2)ρ
X
k−1
yi ,
(12)
i=1
where y1 , . . . , yk−1 are the observed values for the subject.
1
In general the notation R∗ is used for the dependence matrix to accentuate that
we do not allways mean usual Pearson’s correlations.
1452
Ene Käärik
Autoregressive structure
Let Y1 , . . . , Yk be repeated measurements with standard normal marginals. The
dropout process starts at time point k, so that the history H = (Y1 , . . . , Yk−1 )
has complete observations and AR correlation structure with correlation coefficient
ρ. Then the following formula for imputation of the dropout at the time point k is
valid [JNTRK04]
ŷkAR = ρ
Sk
(yk−1 − Ȳk−1 ) + Ȳk ,
Sk−1
(13)
where yk−1 is the last observed value for the subject, Ȳk−1 and Ȳk are the mean
values of k and k − 1 time points, respectively, Sk and Sk−1 are the corresponding
standard deviations.
1-banded Toeplitz structure
Let Y1 , . . . , Yk be repeated measurements with standard normal marginals. The
dropout process starts at time point k, so that the history H = (Y1 , . . . , Yk−1 )
has complete observations and 1-banded Toeplitz correlation structure with correlation coefficient ρ. Then the following formula for imputation of the dropout at the
time point k is valid
ŷk1BT =
1
X
k−1
|Rk−1 |
(−1)k−j+1 |Rj−1 |ρk−j yj ,
(14)
j=1
where y1 , . . . , yk−1 are the observed values for the subject and |Rj |, j = 1, . . . , k − 1,
is the determinant of correlation matrix of history, |R0 | = 1 and |R1 | = 1.
Remarks
The formulas (12) and (13) are proved in ( [Kaa05], [JNTRK04]). The case of 1bounded Toeplitz (14) can be proved iteratively. As an example of implementation
of the formula (14) look at repeated measurements study, when dropping out starts
at k = 3, 4, 5. Then using (11) and taking into account the Toeplitz correlation
structure we get following formulas for imputation:
2
2
1
1
k = 3 ŷ3 = 1−ρ
2 (−ρ y1 + ρy2 ) = |R | (−ρ y1 + ρy2 );
2
3
2
2
3
2
1
1
k = 4: ŷ4 = 1−2ρ
2 (ρ y1 − ρ y2 + ρ(1 − ρ )y3 = |R | (ρ y1 − ρ y2 + ρ|R2 |y3 );
3
4
3
2
2
1
k = 5: ŷ5 = ρ4 +1−3ρ2 (−ρ y1 + ρ y2 − ρ (1 − ρ )y3 + ρ(1 − 2ρ2 )y4 )
= |R14 | (−ρ4 y1 + ρ3 y2 − ρ2 |R2 |y3 + ρ|R3 |y4 ).
These formulas are in agreement with the general form (14).
3 Illustration
To illustrate the Gaussian copula approach with 1-banded Toeplitz correlation structure we generated data from a 6-variate normal distribution with sample size n = 10
using the correlation coefficient r = 0.5 (then ρ = 0.48), assuming that the data represent repeated measurements. Due to small sample size every value is important,
hence we have to impute the missing values.
Imputation by conditional distribution using Gaussian copula
1453
The dropouts occur at the last time point (in random variable X6 ) and we
examine three different missing mechanism: CRD, RD and ID.
The imputation method based on formula (14) was compared with well-known
LOCF-method (Last Observation Carried Forward).2
To analyze the obtained results the average absolute biases were calculated as
average difference between observed values and imputed values, B1 as bias following
(14) and B2 as bias using LOCF-method. The standard deviations of absolute biases
(S1 , S2 ) were calculated as well. Results were presented in units of standard deviation
of given marginals. To estimate the effectiveness of the new imputation rule we
compared the mean values B1 and B2 , and standard deviations S1 and S2 .
All results were obtained over 500 runs and are given in Table 1.
Table 1. Results of simulation study
Missingness
CRD
RD
ID
B1
0.021
0.047
1.631
B2
S1
S2
0.082 0.736 0.895
0.079 0.74 0.957
1.275 1.056 1.546
From Table 1 it follows that in all cases the imputation via formula (14) gives
more stable solution compared with LOCF-method (S1 < S2 ). In the cases CRD
and RD the imputation formula (14) gives smaller bias (B1 ) compared with bias
(B2 ) by the LOCF-method. In the case of ID model both imputation methods did
not perform well, as expected.
Summary
Copulas provide a natural approach to handle dependence between repeated
measurements. The multivariate normal distribution and the linear correlation are
the basis for modeling dependency. Thus the Gaussian copula is a good starting point
to find joint and conditional distributions of repeated measurements. The algorithms
which are derived require estimation of the correlation structure but not normality
of marginals and therefore they may have wide implementation in practice.
Acknowledgements
This work is supported by Estonian Science Foundation grants No 5521 and No
5203.
2
When the main interest is the outcome at endpoint of the study (for example
in clinical trials), the LOCF is the most frequently used approach for dealing with
missing values
1454
Ene Käärik
References
[CR99]
[DK94]
[FV98]
[Kaa05]
[Kaa06]
[LV02]
[Lit95]
[Nel99]
[Rei99]
[Son00]
[VL02]
[VM01]
Clemen, R.T., Reilly, T.: Correlations and copulas for decision and risk
analysis. Fuqua School of Business, Duke University. Management Science,
45 (2), 208–224 (1999)
Diggle, P. J., Kenward, M. G.: Informative dropout in longitudinal data
analysis. Applied Statistics, 43 (1), 49–93 (1994)
Frees, E.W., Valdez, E.A.: Understanding relationships using copulas.
North American Actuarial Journal, 2 (1), 1–25 (1998)
Käärik, E.: Handling dropouts by copulas. In Mastorakis, N. (ed) WSEAS
Transactions on Biology and Biomedicine, 1 (2), 93–97 (2005)
Käärik, E.: Imputation algorithm using copulas. Submitted in: Advances
in Methodology and Statistics. Contributed paper in: Applied Statistics
2005, 12 pages, Slovenia (2006)
Lambert, P., Vandenhende, F.: A copula-based model for multivariate
non-normal longitudinal data: analysis of a dose titration safety study on
a new antidepressant. Statistics in Medicine, 21, 3197–3217 (2002)
Little, R.J.A.: Modeling the dropout mechanism in repeated-measures
studies. JASA, 90, 431, 1112–1121 (1995)
Nelsen R.B.: An introduction to copulas. Lecture Notes in Statistics, 139,
Springer Verlag, Berlin Heidelberg New York (1999)
Reilly, T.: Modelling correlated data using the multivariate normal copula.
In: Proceedings of the Workshop on Correlated Data, Trieste, Italy (1999)
Song, P.X.K.: Multivariate dispersion models generated from Gaussian
Copula. Scandinavian Journal of Statistics, 27, 305–320 (2000)
Vandenhende, F., Lambert, P.: On the joint analysis of longitudinal responses and early discontinuation in randomized trials. Journal of Biopharmaceutical Statistics, 12 (4), 425–440 (2002)
Verbeke, G., Molenberghs, G.: Linear mixed models for longitudinal data.
Springer Series in Statistics, Springer Verlag, Berlin Heidelberg NY (2001)
Download