Uploaded by Kevin Lim

Multiple imputation (report)

advertisement
Multiple Imputation of Missing Blood Pressure
Covariates in Survival Analysis
Kyuson Lim
Department of Mathematics & Statistics,
McMaster University, E-mail: limk15@mcmaster.ca
December 8, 2021
STATS 756
2
Kyuson Lim
Contents
3
STATS 756
4
Kyuson Lim
CONTENTS
Chapter 1
Acknowledgement
The purpose of this report is solely on the interpretation and implementation of ‘Multiple
imputation of missing blood pressure covariates in survival analysis’ written by the Van
Buuren, Stefan in 1999.
Moreover, the original dataset that is used for the analysis is attached in the R
package ‘Mice’ but currently not available for any usage. More specification of the
original dataset, Leiden 85+, is found from the textbook of ‘Flexible Imputation of
Missing Data’, chapter 9.1.2. R codes and the output is stated in the Chapter 3 (p.97101) and Chapter 9 (p.259-283), which contains all results to be stated and interpreted
based on the data ‘Leiden 85+’.
Also, this report rephrase for the specification of dataset containing the graphical
assessments and the codes to have used in the ‘Mice’ package. The examples and codes
are extracted from the textbook, ‘Flexible Imputation of Missing Data’ written by the
same author Van Buuren, Stefan for graph visualization of multiple imputation method
and the guidance for inference. An interpretation for the original paper and multiple
imputation method are defined by rephrasing the definitions used in the textbooks and
the paper. Moreover, the first section of chapter 2 introduces multiple imputation
continued with chapter 3 with univariate imputation and chapter 4 for multivariate
imputation method. Combined with chapter 6 for imputation in practice mainly, method
of imputation and model based algorithm is explained throughout the report.
I am pleased thank for all textbooks and guideline for writing this report in behalf of
the course STATS 756 for analysis in multiple imputation as well as the methods. Also,
I would be pleased to thank for Professor Dr. Balakishnan to support me to learn with
5
STATS 756
Kyuson Lim
the ideas of multiple imputation and writing the report.
6
CHAPTER 1. ACKNOWLEDGEMENT
Chapter 2
Introduction
2.1
Background of the research
The main interest of the paper is to determine an influence of measures on relation
between mortality and Blood Pressure (BP), over 85 years old, 1236 citizens in Leiden
(1986), examined between 1987 and 1989. There is a concern if the paradoxical inverse
relation exists between blood pressure (BP) and mortality in persons over 85 years of
age. Normally, people with a lower BP live longer, but the oldest old with lower BP live
a shorter time.
As the data contains approximately 12.5% incomplete (missing) cases that produce
deflated mortality estimates for lower BP groups, this cause distortion for the inference
of influence of BP on survival. Hence, there is a suspect if individuals with lower
BP and higher mortality risks, had fewer BP measurements. For the study, variables
considered in the study include BP, age (85-89, 90-94, 95+), types of resident, activities
of daily living (independent, dependent), history of hypertension, uses of diuretics, blood
sample.
2.1.1
Guidelines: missing data
For problems of missing data, the following list contains list of questions that are
answered when using multiple imputations.
1. Amount of missing data and reasons for missingness.
7
STATS 756
Kyuson Lim
2. Consequences: important differences between individuals with complete and incomplete data. Groups differ in mean or spread on the key variables and consequences.
3. What information to use for for choosing between non-response mechanisms. This
include methods, where assumptions were made (e.g., missing at random).
4. Software and number of imputed datasets. This is also provided with a sensitivity
analysis, to assess if missing at random assumption plausible.
5. Imputation model: variables were included in the imputation model / design
features
6. How to choose set of predictors: derived variables and diagnostic plots.
7. How to specify different models for non-responses: pooling, repeated estimates
been combined.
8. Complete-case analysis: multiple imputation and complete-case analysis lead to
similar similar conclusions.
First, the goal of the study was to determine if there exists a relation between BP and
mortality in the very old is due to frailty. A second goal was to know whether high BP
was a still risk factor for mortality after the effects of poor health had been taken into
account.
The study compared two Cox regression models:
• The relation between mortality and BP adjusted for age, sex and type of residence.
• The relation between mortality and BP adjusted for age, sex, type of residence and
health.
Health was measured by 28 different variables, including mental state, handicaps,
being dependent in activities of daily living, history of cancer and others. Including
health as a set of covariates in model 2, we expect the model 2 to better explain the
relation between mortality and BP.
8
CHAPTER 2. INTRODUCTION
Kyuson Lim
2.2
STATS 756
Study of data and problems
In the data, there is an observational problem, where groups without BP measure have
much higher mortality rates. In summary, there are 4 key problems for the missing data:
• A BP not measured for 121 individuals 2, without hypertensions and with high
mortality is missing (Out of 1236 people, 218 died before the visit, 59 did not
participate, 956 individuals are measured).
• A BP is measured more often if suspected that BP was too high (hypertension).
• A BP is measured less frequently for very old people and subjects who are too ill
to be measured.
• The rate of data collection period increase (5-40%) in the early days and then
drops to constant level (10-15%).
More specifically, the proportion of missing data are summarized in the table 1.
Survived > 3 years
Yes
No
Total
History of previous hypertension
No
Yes
Total
8.7%
8.1%
8.6%
(34/390)
(10/124)
(44/514)
19.2%
(69/360)
9.8% (8/82)
17.4%
(77/442)
13.7%
8.7%
12.7%
(103/750)
(18/206)
(121/956)
Table 1. Proportion of no BP measured
For sensitivity analysis to diagnose the problem of missing data, the plot shows for
distinct Kaplan-Meier probability curves where there exists two distinct models of BP
measured and BP missing data. The figure shows the survival probability since intake for
the group with observed BP measures and the group with missing BP measures. These
curves have been obtained as baseline hazards after fitting a proportional hazards model
adjusted for age, sex and type of residence, and stratified by the missingness indicator.
Clearly, from the plot, individuals without BP measures have higher mortality rates.
Also, a relatively large group of individuals without hypertension and with high mortality
risk is missing. The goal of the sensitivity analysis is to explore the result of the analysis
under alternative scenarios for the missing data.
CHAPTER 2. INTRODUCTION
9
STATS 756
Kyuson Lim
Figure 2.1: Kaplan-Meier curves of the Leiden 85+ Cohort, stratified according to
missingness
2.2.1
Factors that affect the measurement of blood pressure
Variables related to non-response includes age, type of residence, activities of daily
living, and uses of diuretics (year of interview, blood samples are not categorical to be
excluded).
Not all variables that have different distributions in the response (𝑛 = 835) compared
to the non-response groups (𝑛 = 121). Table 2. indicates that BP was measured less
frequently for very old people and for those with health problems. The graph created
easily shows for the overview of factors in comparison for significance.
Figure 2.2: For 835 individuals, the chi-square of independence
Again, BP was measured less frequently for very old (95+) people and for those who
have a health problem (hypertension).
10
CHAPTER 2. INTRODUCTION
Kyuson Lim
2.3
STATS 756
Response mechanism for BP
∗ are independently drawn from predictive distribution, given πœƒ repreAn imputation π‘Œπ‘šπ‘–π‘ 
sents parameter of statistical model with π‘Œ = (π‘Œπ‘šπ‘–π‘  , π‘Œπ‘œπ‘π‘  ) ∈ Θ.
(Posterior predictive distribution) 𝑝(π‘Œπ‘šπ‘–π‘  |π‘Œπ‘œπ‘π‘  ) =
∫
Θ
𝑝(π‘Œπ‘šπ‘–π‘  |π‘Œπ‘œπ‘π‘  , πœƒ) 𝑝(πœƒ|π‘Œπ‘œπ‘π‘  )π‘‘πœƒ
A multiple imputation is unique, as to provide a mechanism for both high and lowconfidence situation, in dealing with the inherent uncertainty of the imputations.
A MICE (Multivariate Imputation by Chained Equations) algorithm is a MCMC
method that is univariate optimal.
• Starts with a random draw from the observed data, and imputes the incomplete
data
• One iteration consists of one cycle through all π‘Œ 𝑗 .
• Then, samples from the conditional distributions in order to obtain samples from
the joint distribution.
• Generates multiple imputations in parallel π‘š times.
Before setting up for the assumption of missing response mechanism, recaps of
model problems and outcome variables are stated. As the elimination of missing data
cause overestimation in the true survival of cohort, we have 3 problems in the model:
• Bias: if causes of missing data depends jointly on survival and unknown BP, then
relative mortality risks of different BP level biased.
• Verification: The mortality of conditional distribution given age, sex related to BP
measured and without, could not be demonstrated.
• Confounding factors: analysis using only complete cases underestimates mortality
of lower and normal BP groups.
Then, the outcome variables are classified as systolic BP and diastolic BP with indicator
variable 𝑅𝑖 𝑗 as follows.
CHAPTER 2. INTRODUCTION
11
STATS 756
Kyuson Lim
π‘Œ1 = Systolic BP
𝑝(π‘Œ3 , π‘Œ4 |π‘Œ1 , 𝑍)
𝑝(𝑅 = 1|π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  , 𝑍)
π‘Œ2 = Diastolic BP
𝑝(π‘Œ3 , π‘Œ4 |π‘Œ2 , 𝑍)
𝑝(𝑅 = 1|π‘Œπ‘œπ‘π‘  , 𝑍) (MAR)
π‘Œ3 = Survival/censoring time π‘Œ = (π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  ),
π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  , 𝑍 define differ-
π‘Œ4 = censoring indicator
ent types of response mech-
𝑅𝑖 𝑗 = 1 if π‘Œπ‘– 𝑗 is observed.
anism
The first column shows the response variables classified, and the second column and
third column shows the response mechanism generated based on the indicator variable
for different assumption made for the model where 𝑍 is the predictor variables.
For the missing data mechanism, there are 3 assumptions to be stated for its definition
and reason for the use in the analysis. While MAR is unrealistic to be considered for
the generating mechanism, the MAR and NMAR (MNAR) is used in pooling phase for
comparing values.
1. MAR (missing completely at random) 1: the probability of being missing is the
same for all cases ⇒ cause of missing data is unrelated to data. This effectively
implies that causes of the missing data are unrelated to the data. We may consequently ignore many of the complexities that arise because data are missing, apart
from the obvious loss of information (ie. Some of the data will be missing simply
because of bad luck.).
𝑝(𝑅 = 0|π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  , 𝑍) = 𝑝(𝑅 = 0|𝑍)
• The assumption is Unrealistic. The survival model between BP measured
and no BP measured in the sensitivity analysis (figure 2.1) shows systematic
difference in mortality. Hence, the assumption could not be incorporated in
the missing imputation process.
2. MAR (missing at random): the probability of being missing is the same only
within groups, defined by observed data. An example of MAR is when we take a
sample from a population, where the probability to be included depends on some
known property.
𝑝(𝑅 = 0|π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  , 𝑍) = 𝑝(𝑅 = 0|π‘Œπ‘œπ‘π‘  , 𝑍)
1MAR is a much broader class than MCAR.
12
CHAPTER 2. INTRODUCTION
Kyuson Lim
STATS 756
• MAR on π‘Œπ‘œπ‘π‘  : the probability of BP measurement depends on the survival.
Hence, it could be incorporated with the correction for a non-response.
• MAR on 𝑍: probability of non-response related to covariates (πœ’2 independence test). This relate to the correction for non-response. A MNAR means
that the probability of being missing varies for reasons that are unknown to
us.
3. MNAR (missing not at random)2: the probability to be missing also depends
on unobserved information, including π‘Œπ‘šπ‘–π‘  itself.
𝑝(𝑅 = 0|π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  , 𝑍)
• Investigation is needed for different response values to be derived in the pooling phase with 𝛿 adjustment, due to the probability of non-response related
to BP (unobserved) for a distribution of π‘Œπ‘šπ‘–π‘  , by the sensitivity analysis.
2.3.1
Influx and Outflux
An influx and outflux are summaries of the missing data pattern intended to aid in the
construction of imputation models. The influx of a variable quantifies how well its
missing data connect to the observed data on other variables. Variables with higher
influx depend strongly on the imputation model.
Figure 2.3: Global influx-outflux pattern of the Leiden 85+ Cohort data
2In the literature one can also and the term NMAR (not missing at random) for the same concept.
CHAPTER 2. INTRODUCTION
13
STATS 756
Kyuson Lim
The outflux of a variable quantifies how well its observed data connect to the missing
data on other variables. Variable with higher outflux is better connected to the missing
data, and thus potentially more useful for imputing other variables.
For data of BP, variables are quantified into the graph of x-axis of influx and y-axis
of outflux (figure 2.3). Variables that are located in the lower regions (especially near
the lower-left corner) and that are uninteresting for later analysis are better removed from
the data prior to imputation.
First of all, all points are relatively close to the diagonal, which indicates that
influx and outflux are balanced (figure 2.3). The group at the left-upper corner has
almost complete information, so the number of missing data problems for this group is
relatively small. The intermediate group (second group) has an outflux between 0.5 and
0.8, which is small. The third group that contain important variables has an outflux with
0.5 and lower, so its predictive power is limited. Also, this group has a high influx, and is
thus highly dependent on the imputation model. Hence, variables (hypert1, aovar) with
missing mark that might cause issue on in the imputations are located in the lower-right
corner.
14
CHAPTER 2. INTRODUCTION
Chapter 3
Methodology
The paper use model based multiple imputation method with multivariate approach.
Although the univariate imputation and the multivariate imputation drastically shows
difference for the output, the univariate approach would be first introduced. The method
of multiple imputation is mainly processed with 4 steps.
1. Posterior predictive density, 𝑝(π‘Œπ‘šπ‘–π‘  |𝑋, 𝑅) (𝑋 is set of predictors) given nonresponse mechanism 𝑝(𝑅|π‘Œ , 𝑍) and 𝑝(π‘Œ , 𝑍).
2. Draw imputations from 𝑝(π‘Œπ‘šπ‘–π‘  |𝑋, 𝑅) to produce π‘š complete datasets.
3. Perform π‘š complete Cox regression model on each completed data.
4. Pool π‘š analysis results and variance estimates.
The first step can be summarized with two important concepts, variable selection and
investigation by the correlation values.
Conceptually, the idea of imputation is illustrated as follows. Let 𝑄 be the quantity of
scientific interest that we can calculate if we observe the population. The goal is to obtain
ˆ which satisfy 𝐸 (𝑄|π‘Œ
ˆ ) = 𝑄 that is valid, if 𝐸 (π‘ˆ|π‘Œ ) ≥ 𝑉 ( 𝑄|π‘Œ
ˆ ),
unbiasedness estimate 𝑄,
ˆ
where π‘ˆ is the estimated covariance matrix of 𝑄.
∫
(Posterior distribution) 𝑝(𝑄|π‘Œπ‘œπ‘π‘  ) =
15
𝑝(𝑄|π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  ) 𝑝(π‘Œπ‘šπ‘–π‘  |π‘Œπ‘œπ‘π‘  )π‘‘π‘Œπ‘šπ‘–π‘ 
STATS 756
Kyuson Lim
3.1
Selection of variables
First, we may define some variables as follows. For the posterior predictive density,
𝑝(π‘Œπ‘šπ‘–π‘  |𝑋, 𝑅) (𝑋 is set of predictors) given non-response mechanism 𝑝(𝑅|π‘Œ , 𝑍) and
𝑝(π‘Œ , 𝑍). As the multiple imputation is based on a model based approach, 𝑝(π‘Œπ‘šπ‘–π‘  |𝑋)
is defined with the linear regression, where missing BP is the predictor variable 𝑋
for imputation. The selection of suitable subset of data contains no more than 15-25
variables, 𝑋 = [π‘Œπ‘œπ‘π‘  , 𝑍, π‘ˆ, 𝑉].
1. π‘Œπ‘œπ‘π‘  , 𝑍: Include all variables, especially if complete model contains strong predictive relations.
2. π‘ˆ: Variables that differ between the response and non-response groups, inspect
by correlation.
3. 𝑉: Variance with considerable variability, to reduce uncertainty.
4. π‘ˆ and 𝑉: remove for those with many missing values (%) within incomplete cases.
Variable
π‘Ÿ (Systolic BP)
π‘Ÿ (Diastolic BP)
π‘Ÿ (𝑅1 ) - response indicator
(% of observed data)- usable cases
π‘Œ : Incomplete and outcome variables
Systolic BP
1.0
0.59
Diastolic BP
0.59
1.0
Survival date
0.18
0.14
0.12
100
Censoring flag
0.13
0.11
0.08
100
𝑍 : Covariates (Cox regression: relation between mortality and BP adjusted for age & sex)
Sex
-0.1
-0.1
-0.04
100
Age
-0.11
-0.11
-0.14
100
π‘ˆ : Variables related to non-response
Type of residence
-0.21
-0.15
-0.08
100
Activity of daily living
-0.24
-0.11
-0.14
98
Previous hypertension
0.16
0.14
0.06
90
Uses diuretics
-0.04
-0.03
0.06
85
Year of interview
0.18
0.09
0.18
100
Year of blood sample
0.17
0.11
0.16
89
Serum albumin
0.24
0.18
0.02
67
Cognition
0.24
0.18
0.07
78
Current hypertension
0.23
0.17
0.01
83
Previous hypertension
0.22
0.19
0.04
83
Survival year
0.21
0.15
0.14
100
In (survival date)
0.20
0.15
0.09
100
Score GHQ
-0.19
-0.18
-0.01
83
Serum cholesterol
0.17
0.17
0.12
65
Fraction erythrocytes
0.17
0.20
0.08
70
Treated by specialist
-0.16
-0.11
0.02
100
Hemoglobin
0.15
0.18
0.08
70
Hematocrit
0.11
0.18
0.10
70
𝑉 : Prediction variables
Table 3. Correlation of variables for imputation
First, included variable appear in complete data are blood pressure, survival, sex,
and age. Then, variables related to non-response are type of residence, activity of daily
16
CHAPTER 3. METHODOLOGY
Kyuson Lim
STATS 756
living, previous hypertension, use of diuretics, year of interview, and blood sample are
included. Most importantly, the selection of variables is performed by the absolute
correlation > 0.15 with SBP/DBP based on the table 3. At last, remove variables with
usable cases < 50%.
Although there are in total of 23 actual predictor values selected by the correlation
values, the model composition are also considered to select for log(time) as a 2 step
approach. Hence, the graph of correlation between survival model components is shown
below.
Figure 3.1: Correlations between the cumulative death hazard 𝐻0 (𝑇), survival time
𝑇, log(𝑇), SBP and DBP
From figure 3.1, the high correlation may be caused by the fact that nearly everyone
in this cohort has died, so the percentage of censoring is low. We can observe that the
correlation between log(𝑇) and blood pressure is higher than for 𝐻0 (𝑇) or 𝑇, so it makes
sense to add log(𝑇) as an additional predictor.
3.2
Multiple imputation: algorithm
Based on the Bayesian approach, the values are drawn for parameter πœƒ ∗ to come up with
∗ that is based on the model specified with.
π‘Œπ‘šπ‘–π‘ 
∫
𝑝(π‘Œπ‘šπ‘–π‘  |𝑋, 𝑅) =
𝑝(π‘Œπ‘šπ‘–π‘  |𝑋, 𝑅, πœƒ) 𝑝(πœƒ|𝑋, 𝑅)π‘‘πœƒ, πœƒ = (𝛽, log 𝜎)
1. Draw value of πœƒ ∗ from 𝑝(πœƒ|𝑋, 𝑅) ⇒ 𝑝(π‘Œπ‘šπ‘–π‘  |𝑋, 𝑅, πœƒ = πœƒ ∗ ).
∗ from its conditional posterior distribution given πœƒ ∗ .
2. Draw value π‘Œπ‘šπ‘–π‘ 
3. Multiple imputation: Repeat π‘š times from the posterior distribution of π‘Œπ‘šπ‘–π‘  .
CHAPTER 3. METHODOLOGY
17
STATS 756
Kyuson Lim
Among various methods, the regression imputation incorporates knowledge of other
variables with the idea of producing smarter imputations. The first step involves building
a model from the observed data. Predictions for the incomplete cases are then calculated
under the fitted model, and serve as replacements for the missing data. The regression
model based imputation in the univariate approach are summarized as follows.
1. Obtain 𝛽ˆ and π‘Œˆπ‘œπ‘π‘  from linear regression.
0 𝑋
−1 for 𝛽ˆ = π‘Š 𝑋 0 π‘Œ
ˆ
ˆ
• Take π‘Š = (π‘‹π‘œπ‘π‘ 
π‘œπ‘π‘  )
π‘œπ‘π‘  π‘œπ‘π‘  to π‘Œπ‘œπ‘π‘  = π‘‹π‘œπ‘π‘  𝛽.
2. Random draw from posterior distribution of 𝛽.
• Calculate 𝛽ˆ∗ = 𝛽ˆ + 𝜎∗π‘Š 1/2 𝐷
– Draw π‘Ÿ-dimensional Normal random vector 𝐷 ∼ 𝑁 (0, πΌπ‘Ÿ ), where π‘Ÿ = 23
is the number of predictors.
– Use 𝜎∗2 = (π‘Œπ‘œπ‘π‘  − π‘Œˆπ‘œπ‘π‘  ) 0 (π‘Œπ‘œπ‘π‘  − π‘Œˆπ‘œπ‘π‘  )/𝑔, where random variable 𝑔 is
from πœ’π‘›2π‘œπ‘π‘  −π‘Ÿ distribution
– π‘Š 1/2 is diag(π‘Š) 1/2 obtained by Cholesky decomposition.
• Similarity between cases is the distance predicted means of BP with observed
data.
– Take predicted values π‘Œˆπ‘šπ‘–𝑠 = π‘‹π‘šπ‘–π‘  𝛽ˆ∗
– For missing values, find respondent π‘Œˆπ‘œπ‘π‘  is closest to π‘Œˆmis,𝑖 to take π‘Œπ‘œπ‘π‘ 
as respondent for imputed value 𝑖 = 1, ..., π‘›π‘šπ‘–π‘ 
(1)
(2)
(π‘š)
3. Repeat π‘š = 3 to 5 times to create π‘Œπ‘šπ‘–π‘ 
, π‘Œπ‘šπ‘–π‘ 
, ..., π‘Œπ‘šπ‘–π‘ 
.
• Incorporate uncertainty due to deviations, but also reflect variations due to
finite sampling.
The highlight part consists of generation samples from multivariate normal distribution, where the number of variables is the rank of the identity matrix. In the univariate
approach, the goal is to minimize the difference between the imputed values and the
model based imputed values close to 0, where values are conditionally imputed from the
previous imputed values.
For graph visualization and imputation approach in R, two example are shown from
the textbook of ‘Flexible Imputation of Missing Data’. Suppose that we predict Ozone
by linear regression from Solar.R.
18
CHAPTER 3. METHODOLOGY
Kyuson Lim
STATS 756
library(mice)
fit <- lm(Ozone ~ Solar.R, data = airquality)
pred <- predict(fit, newdata = ic(airquality))
data <- airquality[, c("Ozone", "Solar.R")]
imp <- mice(data, method = "norm.predict", seed = 1,
m = 1, print = FALSE)
xyplot(imp, Ozone ~ Solar.R)
Figure 3.2: Blue indicates the observed data, red indicates the imputed values.
The imputed values correspond to the most likely values under the model. However,
the ensemble of imputed values vary less than the observed values. It may be that each
of the individual points is the best under the model, but it is very unlikely that the real
(but unobserved) values of Ozone would have had this distribution. Imputing predicted
values also has an effect on the correlation. The red points have a correlation of 1 since
they are located on a line. If the red and blue dots are combined, then the correlation
increases from 0.35 to 0.39. Note that this upward bias grows with the percent missing
ozone levels (here 24%). Some of problems of univariate imputation is summarized
after the second example.
The second example shows for the specification of R codes and values imputed to
come up with the best linear regression model that is imputed for missing data.
In Step 0, missing data is identified.
> head(nhanes)
age
bmi hyp chl
1
1
2
2 22.7
NA
1 187
3
1
NA
1 187
4
3
NA
5
1 20.4
6
3
NA
NA
NA
NA
NA
1 113
NA 184
Then, in step 1, the linear model with predictor are specified in the imputation by the R
function, with. The iterated values imputed could be specified for 10 iteration steps.
CHAPTER 3. METHODOLOGY
19
STATS 756
Kyuson Lim
> imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
> fit <- with(imp, lm(bmi ~ age))
> head(imp$imp)
$age
[1] 1
2
3
4
5
6
7
8
9
10
<0 rows> (or 0-length row.names)
$bmi
1
2
3
4
5
6
7
8
9
10
1
27.2 21.7 25.5 22.5 28.7 30.1 27.4 22.5 22.5 27.2
3
22.0 30.1 20.4 33.2 27.2 35.3 29.6 22.0 27.2 28.7
4
21.7 20.4 27.2 25.5 21.7 25.5 22.7 22.5 24.9 22.5
$hyp
1 2 3 4 5 6 7 8 9 10
1
1 1 1 1 1 1 1 1 1
1
4
1 2 2 2 1 1 2 1 2
1
$chl
1
2
3
4
5
6
7
8
9
10
1
187 238 186 238 187 187 187 131 238 187
4
206 204 204 184 206 187 218 186 204 284
Also, repeated π‘š = 10 steps of optimal model could be shown by the pooling function
to come with model specification.
> est <- pool(fit)
> est
Class: mipo
term
m = 10
m
estimate
ubar
b
t dfcom
df
1 (Intercept) 10 29.621111 3.4810048 1.4312926 5.055427
2
age 10 -1.802222 0.9257992 0.2759968 1.229396
For a problems of univariate imputation, a circular dependence can occur, π‘Œ π‘—π‘šπ‘–π‘ 
depends on π‘Œβ„Žπ‘šπ‘–π‘  which depends on π‘Œ π‘—π‘šπ‘–π‘  , 𝑗 ≠ β„Ž, as π‘Œ 𝑗 and π‘Œβ„Ž is correlated.
With
large 𝑝 and small 𝑛, a collinearity or empty cells can occur to be problematic in the
imputation. The non-linear relation is not considered, combination is problematic.
However, a multivariate missing data algorithm for mice is different from model base
multiple imputation algorithm.
20
CHAPTER 3. METHODOLOGY
Kyuson Lim
3.3
STATS 756
Multivariate imputation
In the paper, multivariate problems is split into series of univariate problems. Also,
an iterative algorithm is applied to draw samples from sequence of univariate linear
regression. Although simple multivariate imputation method is based on a monotone
draw-input mechanism, the mice algorithm starts with a random draw from the observed
data, and imputes the incomplete data in a variable-by-variable fashion. Hence, one
iteration consists of one cycle through all π‘Œ 𝑗 .
Each incomplete entry is initialized by filling in random draw from π‘Œπ‘œπ‘π‘  .
• Regression switching: executed π‘š times in parallel, where π‘Œπ‘– imputed conditional
on all other data and 𝑍, π‘ˆ, 𝑉.
• Gibbs sampler: under the condition that draws converge to multivariate posterior density, 𝑝(π‘Œπ‘šπ‘–π‘  |π‘Œπ‘œπ‘π‘  , 𝑋, 𝑅), iterates about 20 steps (Partially incompatible
MCMC).
∗ , by repeatedly
• Monte-Carlo simulation for draw on multivariate distribution π‘Œπ‘šπ‘–π‘ 
drawing from conditional density.
• Let π‘Œπ‘šπ‘–π‘  = {π‘Œπ‘šπ‘–π‘  (1), ..., π‘Œπ‘šπ‘–π‘  (π‘˜)}, π‘˜ ≤ 𝑝 be partition of 𝑝-dimensional r.v. where
π‘Œπ‘šπ‘–π‘  ( 𝑗) is missing entry andπ‘Œπ‘œπ‘π‘  ∪𝑿 is multi-dimensional variable for 𝑝(π‘Œπ‘šπ‘–π‘  |π‘Œπ‘œπ‘π‘  ; 𝑿).
• The unknown parameters of the imputation model πœ™π‘– as πœ™ = (πœ™1 , . . . , πœ™ 𝑝 ) ⇒ prior
density of πœ‹(πœ™) = πœ‹1 (πœ™1 ) · · · πœ‹ 𝑝 (πœ™ 𝑝 ) .
• The unknown parameters, likelihood inference πœ™ = (πœ™1 , . . . , πœ™ 𝑝 ), of the imputation
models should be distinct
(0)
(1)
(𝑑)
Mainly, with π‘Œπ‘šπ‘–π‘ 
(fill-in), generates iterative sequence of imputations π‘Œπ‘šπ‘–π‘ 
, ..., π‘Œπ‘šπ‘–π‘ 
(𝑑)
(imputation), and the imputation of π‘Œπ‘šπ‘–π‘ 
( 𝑗) is conditional on observed data and most
recently imputed data of π‘Œπ‘šπ‘–π‘  (𝑖), 𝑗 ≠ 𝑖
1. Specify imputation model 𝑝(π‘Œ π‘—π‘šπ‘–π‘  |π‘Œ π‘—π‘œπ‘π‘  , π‘Œ− 𝑗 , 𝑅) for variable π‘Œ 𝑗 .
2. For each 𝑗, fill in with π‘Œ 𝑗(0) (π‘šπ‘–π‘ ) by random draws from π‘Œ 𝑗 (π‘œπ‘π‘ ).
3. Repeat for 𝑑 = 1, ..., 𝑀.
4. Repeat for 𝑗 = 1, ..., 𝑝.
CHAPTER 3. METHODOLOGY
21
STATS 756
Kyuson Lim
(𝑑−1)
(𝑑)
, . . . , π‘Œπ‘(𝑑−1) ) as current complete data \π‘Œ 𝑗 .
, π‘Œ 𝑗+1
5. Define π‘Œ−(𝑑)𝑗 = (π‘Œ1(𝑑) , . . . , π‘Œ 𝑗−1
6. Draw πœ™ (𝑑)
𝑗 (Posterior-step).
(𝑑)
7. Draw imputations π‘Œπ‘šπ‘–π‘ 
( 𝑗) (Imputation step).
3.3.1
Generating algorithm: Gibbs sampling
The Gibbs sampler is used under the condition that draws converge to multivariate posterior density, 𝑝(π‘Œπ‘šπ‘–π‘  |π‘Œπ‘œπ‘π‘  , 𝑋, 𝑅), iterates about 20 steps. The Monte-Carlo simulation
∗ , by repeatedly drawing from condiapplies for draw on multivariate distribution π‘Œπ‘šπ‘–π‘ 
tional density.
Let π‘Œπ‘šπ‘–π‘  = {π‘Œπ‘šπ‘–π‘  (1), ..., π‘Œπ‘šπ‘–π‘  (π‘˜)}, π‘˜ ≤ 𝑝 be partition of 𝑝-dimensional r.v. where
π‘Œπ‘šπ‘–π‘  ( 𝑗) is missing entry and π‘Œπ‘œπ‘π‘  ∪ 𝑿 is multi-dimensional variable for 𝑝(π‘Œπ‘šπ‘–π‘  |π‘Œπ‘œπ‘π‘  ; 𝑿).
(0)
(1)
(𝑑)
With π‘Œπ‘šπ‘–π‘ 
, generates iterative sequence of imputations π‘Œπ‘šπ‘–π‘ 
, ..., π‘Œπ‘šπ‘–π‘ 
, and the imputation
(𝑑)
of π‘Œπ‘šπ‘–π‘ 
( 𝑗) is conditional on observed data and most recently imputed data of π‘Œπ‘šπ‘–π‘  (𝑖), 𝑗 ≠ 𝑖
(𝑑)
π‘Œπ‘šπ‘–π‘ 
(1) ∼ 𝑝(π‘Œπ‘šπ‘–π‘  (1)|π‘Œπ‘œπ‘π‘  (1), π‘Œ2(𝑑−1) , ..., π‘Œπ‘˜(𝑑−1) ; 𝑿)
(𝑑)
π‘Œπ‘šπ‘–π‘ 
(2) ∼ 𝑝(π‘Œπ‘šπ‘–π‘  (2)|π‘Œ1(𝑑) , π‘Œπ‘œπ‘π‘  (2), π‘Œ3(𝑑−1) , ..., π‘Œπ‘˜(𝑑−1) ; 𝑿)
(𝑑)
(𝑑)
(𝑑−1)
π‘Œπ‘šπ‘–π‘ 
( 𝑗) ∼ 𝑝(π‘Œπ‘šπ‘–π‘  ( 𝑗)|π‘Œ1(𝑑) , ..., π‘Œ 𝑗−1
, π‘Œπ‘œπ‘π‘  ( 𝑗), π‘Œ 𝑗+1
, ..., π‘Œπ‘˜(𝑑−1) ; 𝑿)
(𝑑)
(𝑑−1)
π‘Œπ‘šπ‘–π‘ 
(π‘˜) ∼ 𝑝(π‘Œπ‘šπ‘–π‘  (π‘˜)|π‘Œ1(𝑑) , π‘Œ2(𝑑) , ..., π‘Œπ‘˜−1
, π‘Œπ‘œπ‘π‘  (π‘˜); 𝑿)
(𝑑−1)
(𝑑)
Note π‘Œ 𝑗+1
represents complete data for 𝑦 𝑗+1 , ..., 𝑦 π‘˜ in the previous iteration, π‘Œπ‘šπ‘–π‘ 
( 𝑗 + 1)
is a complete data of current iteration, represented by two blocks while π‘Œπ‘œπ‘π‘  ( 𝑗) =
(𝑑−1)
𝑦 𝑗 , π‘Œ 𝑗+1
, ..., π‘Œπ‘˜(𝑑−1) is the complete data in the previous iteration.
When a regression models of 𝑦 𝑗 on 𝑦 1 , ..., 𝑦 𝑗−1 , 𝑦 𝑗+1 , ..., 𝑦 π‘˜ and π‘₯1 , ..., π‘₯ π‘ž for a complete data by its parameter πœ™ 𝑗 which is known, the posterior predictive distribution
(𝑑)
π‘Œπ‘šπ‘–π‘ 
( 𝑗) is specified. To reflect the uncertainty about πœ™ 𝑗 given the complete data, πœ™ 𝑗 is
drawn from a posterior distribution on the most recently completed data to generate for
(𝑑)
(𝑑)
π‘Œπ‘šπ‘–π‘ 
( 𝑗). Now, the full Gibbs sampling algorithm generates for π‘Œπ‘šπ‘–π‘ 
:
Draw πœ™1(𝑑) ∼ 𝑝 πœ™1 |[π‘Œ1(𝑑−1) , ..., π‘Œπ‘˜(𝑑−1) , 𝑿] π‘œπ‘π‘ (1)
(𝑑)
Impute π‘Œπ‘šπ‘–π‘ 
(1) ∼ 𝑝(π‘Œπ‘šπ‘–π‘  (1)|π‘Œ2(𝑑−1) , ..., π‘Œπ‘˜(𝑑−1) , 𝑿; πœ™1(𝑑) )
22
CHAPTER 3. METHODOLOGY
Kyuson Lim
STATS 756
Draw πœ™π‘‘)2 ∼ 𝑝 πœ™2 |[π‘Œ1(𝑑) , π‘Œ2(𝑑−1) , ..., π‘Œπ‘˜(𝑑−1) , 𝑿] π‘œπ‘π‘ (2)
(𝑑)
Impute π‘Œπ‘šπ‘–π‘ 
(2) ∼ 𝑝(π‘Œπ‘šπ‘–π‘  (2)|π‘Œ1(𝑑) , π‘Œ3(𝑑−1) , ..., π‘Œπ‘˜(𝑑−1) , 𝑿; πœ™2(𝑑) )
..
.
(𝑑−1)
, π‘Œ 𝑗(𝑑−1) , ..., π‘Œπ‘˜(𝑑−1) , 𝑿] π‘œπ‘π‘ ( 𝑗)
Draw πœ™π‘‘)𝑗 ∼ 𝑝 πœ™ 𝑗 |[π‘Œ1(𝑑) , π‘Œ2(𝑑) , ..., π‘Œ 𝑗−1
(𝑑−1)
(𝑑)
(𝑑)
, ..., π‘Œπ‘˜(𝑑−1) , 𝑿 ; πœ™ (𝑑)
, π‘Œ 𝑗+1
Impute π‘Œπ‘šπ‘–π‘ 
(2) ∼ 𝑝 π‘Œπ‘šπ‘–π‘  ( 𝑗)| π‘Œ1(𝑑) , ..., π‘Œ 𝑗−1
𝑗
|
{z
}
correspond to observed values of 𝑦 𝑗
..
.
(𝑑)
, π‘Œπ‘˜(𝑑−1) , 𝑿] π‘œπ‘π‘ (π‘˜)
Draw πœ™π‘‘)π‘˜ ∼ 𝑝 πœ™ π‘˜ | [π‘Œ1(𝑑) , π‘Œ2(𝑑) , ..., π‘Œπ‘˜−1
|
{z
}
rows of completed data
(𝑑)
(𝑑)
, 𝑿; πœ™ π‘˜(𝑑) )
(π‘˜) ∼ 𝑝(π‘Œπ‘šπ‘–π‘  (π‘˜)|π‘Œ1(𝑑) , π‘Œ2(𝑑) , ..., π‘Œπ‘˜−1
Impute π‘Œπ‘šπ‘–π‘ 
(𝑑)
(𝑑−1)
represents complete data for 𝑦 𝑗+1 , ..., 𝑦 π‘˜ in the previous iteration, π‘Œπ‘šπ‘–π‘ 
( 𝑗 + 1)
Note π‘Œ 𝑗+1
is a complete data of current iteration, represented by two blocks while π‘Œπ‘œπ‘π‘  ( 𝑗) =
(𝑑−1)
𝑦 𝑗 , π‘Œ 𝑗+1
, ..., π‘Œπ‘˜(𝑑−1) is the complete data in the previous iteration. There is no need for
iteration but convergence is immediate. The mice package in R also incorporate the
multivariate method differently from the univariate case. A simple example is shown
below.
First, the new nhanes2 data in mice contains 3 out of 27 missing values that destroy
the monotone pattern: one for hyp (in row 6) and two for bmi (in rows 3 and 6).
> library(mice)
> data(nhanes2)
> nhanes2
age
bmi
hyp chl
1
20-39
NA <NA>
2
40-59 22.7
no 187
NA
3
20-39
NA
no 187
4
60-99
NA <NA>
5
20-39 20.4
6
60-99
NA
no 113
NA <NA> 184
> length(nhanes2[is.na(nhanes2)])
[1] 27
The draw phase is specified with the Gibbs sampling method where the maximum
iteration is defined to be 1. For iterative steps, only particular missing data are computed
to configure for tendencies and consistency of the data. Hence, particular 3 values are
imputed from a simple random sample.
> where <- make.where(nhanes2, "none")
> where[6, "hyp"] <- TRUE
CHAPTER 3. METHODOLOGY
23
STATS 756
Kyuson Lim
> where[c(3, 6), "bmi"] <- TRUE
> imp1 <- mice(nhanes2, where = where,
+ method = "sample",seed = 21991, maxit = 1,
+ print = FALSE)
> data <- mice::complete(imp1)
> data
age
bmi
hyp chl
1
20-39
2
40-59 22.7
NA <NA>
no 187
3
20-39 26.3
no 187
4
60-99
5
20-39 20.4
no 113
6
60-99 22.7
no 184
NA <NA>
NA
NA
From observation, the imputed values for the missing hyp data in row 3 could also
depend on bmi and chl, but in the procedure both predictors are ignored. The complete
missing data is imputed within the monotone draw-input mechanism stated before.
> imp2 <- mice(data, maxit = 1,
+ visitSequence = "monotone",
+
print = FALSE)
> data2 <- mice::complete(imp2)
> data2
age
bmi hyp chl
1
20-39 35.3
no 206
2
40-59 22.7
no 187
3
20-39 26.3
no 187
4
60-99 24.9
no 186
5
20-39 20.4
no 113
6
60-99 22.7
no 184
3.4
Pooling
The purpose is to investigate robustness of MAR assumption against violation. To
determine whether the relation between BP and mortality is affected by non-response.
1. Suppose BP distribution to be known, apply Bayes rule to calculate distribution
for 𝑝(𝐡𝑃|𝑅 = 1) and 𝑝(𝐡𝑃|𝑅 = 0).
2. Both are normal but differs by 𝛿 = 151 × 6 − 138 × 6 = 13.
3. Generate imputation by subtracting amount 𝛿 from random draw of 𝑝(𝐡𝑃|𝑅 = 1).
The model incorporate into π‘Œ1 = 𝑋 𝛽 + (1 − 𝑅1 )𝛿 + πœ–, 𝑅1 is an indicator for systolic
BP. By the 𝛿-adjustment, the regression model postulates mean difference, 𝛿, between
24
CHAPTER 3. METHODOLOGY
Kyuson Lim
STATS 756
responders and non-responders. Non-response is applied for systolic BP, as SBP and
DBP, which are correlated. Values of 𝛿 are chosen for 0, which correspond to the assumption as MAR and -5, -10, -15, -20 for the NMAR assumption.
The pooling phase mainly consists of π‘š analysis results and variance estimates. SysÍ
ˆ
tematically, the combined point estimate is 𝑄ˆ = π‘š 𝑄 𝑖 , where 𝑄ˆ 𝑖 is a π‘˜-dimensional
𝑖=1 π‘š
column vector obtained by 𝑖th imputed dataset (𝑖 ∈ [1, π‘š]). The 3 sources of variation consists of a total covariance, a complete data variance and a standard unbiased
estimate of variance:
There are mainly 3 source of variation for the total covariance,
𝑇 = π‘ˆ + 1 + π‘š1 𝐡.
• Complete data variance: π‘ˆ =
π‘ˆπ‘–
𝑖=1 π‘š , π‘ˆπ‘–
Íπ‘š
is the covariance matrix of 𝑄ˆ 𝑖 obtained
for 𝑖th iteration (conventional variability).
• Standard unbiased estimate of variance: 𝐡 =
Íπ‘š
𝑖=1
ˆ 0 (𝑄ˆ 𝑖 −𝑄)
ˆ
( 𝑄ˆ 𝑖 −𝑄)
(π‘š−1)
(extra variance
from missing values in the sample)
• Simulation variance: 𝐡/π‘š caused by 𝑄 estimated for finite π‘š (variance being
systematic).
Note that the within sample variance is given as 𝑉 (𝑄|π‘Œπ‘œπ‘π‘  ) = 𝐸 [𝑉 (𝑄|π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  )|π‘Œπ‘œπ‘π‘  ] +
𝑉 [𝐸 (𝑄|π‘Œπ‘œπ‘π‘  , π‘Œπ‘šπ‘–π‘  )|π‘Œπ‘œπ‘π‘  ]. Using the total covariance, a relative risk of 95% confidence
interval in the proportional hazards model is better estimated by the given in the range
√
of exp( 𝑄ˆ ± 1.96 𝑇).
A realized difference in means of the observed and imputed SBP (mmHg) data under
various 𝛿-adjustments. Note that the number of multiple imputations is π‘š = 5.
> delta <- c(0, -5, -10, -15, -20)
> post <- imp.qp$post
> imp.all.undamped <- vector("list", length(delta))
> for (i in 1:length(delta)) {
+ d <- delta[i]
+ cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
+ post["rrsyst"] <- cmd
+ imp <- mice(data2, pred = pred, post = post, maxit = 10,
+ seed = i * 22)
+ imp.all.undamped[[i]] <- imp
}
Also, a mean of the observed SBP is152.9 mmHg. The difference between the mean
SBP with 𝛿-adjustment compared to the observed mean SBP is summarized in the table
below.
CHAPTER 3. METHODOLOGY
25
STATS 756
Kyuson Lim
𝛿 for SBP
Avg. Difference
0
-8.2
-5
-12.3
-10
-20.7
-15
-26.1
-20
-31.5
Table 4. Realized difference in means
The strength of the effect depends on the correlation between SBP and the variable. Under
MAR assumption, the imputations are on average 8.2mmHg lower than the observed
blood pressure. For example, 𝛿 = −10mmHg means the magnitude of difference in
MAR case, −20.7 + 8.2 = −12.5mmHg, which is larger in size than 𝛿. While 𝛿 = −5 has
a small effect on the mean, the 𝛿 = −20 has too extreme effect for us to take the mean
SBP value where 𝛿 = 0.
26
CHAPTER 3. METHODOLOGY
Chapter 4
Simulation study and summary
As a summary, the standard multiple imputation scheme of stepwise model selection
consists of three phases:
1. Imputation of the missing data m times.
2. Analysis of the π‘š imputed datasets.
3. Pooling of the parameters across π‘š analyses.
4.1
Simulation study
Figure 4.1: Scatterplot of systolic and diastolic blood pressure from the first imputation.
The left-hand-side plot was obtained after just running ‘mice’ on the data without any
data screening. The right-hand-side plot is the result after cleaning the data and setting
up the predictor matrix with ‘quickpred()’ (quick selection of predictors) function in
27
STATS 756
Kyuson Lim
mice. Finally, determined values in column size and correlation threshold of automatic
process of ‘quickpred()’ such that the average number of predictors is around 25.
4.1.1
Mean BP
After the pooling process, the mean value corresponding to difference 𝛿-adjustments are
summarized as follows.
𝑁
Observed
BP
Imputed
BP
SBP
𝛿
835
DBP
Mean
SD
Mean
SD
152.9
25.7
82.8
13.1
121
0
151.1
26.2
81.5
14
121
-5
142.3
24.6
78.4
13.7
121
-10
135.9
24.7
78.2
12.8
121
-15
128.6
25
75.3
12.9
121
-20
122.3
25.2
74
12.1
Table 5. Imputed BP are pooled over π‘š = 5 multiple imputation
Under MAR assumption which correspond to the value 𝛿 at 0, a π‘₯¯ observed SBP = 152.9
and π‘₯¯ 𝑆𝐡𝑃 = 151.1 for difference of 1.8 (mmHg) as well as π‘₯¯ observed DBP = 82.8 and
π‘₯¯ 𝐷𝐡𝑃 = 81.5 for difference of 1.3 (mmHg).
From the table, there is a decreasing trend for 𝛿 = −5, −10, −15, −20 in {142.3, 135.9,
128.6, 122.3}. Only small difference in mortality exists, even among non-response
models with different 𝛿’s as risk estimates are insensitive to missing data. At last, a
relative mortality risks for Cox proportional hazard model is estimated with the age and
sex.
4.1.2
Relative mortality risk estimates: SBP and DBP
A relative mortality risks for Cox proportional hazard model is estimated with the
covariates, including age and sex. After the pooling phase, an optimal values of variation
28
CHAPTER 4. SIMULATION STUDY AND SUMMARY
Kyuson Lim
STATS 756
that correspond of the 95% confidence interval Relative mortality risk estimates for both
SBP and DBP is summarized as follows.
At 𝛿 = 0, SBP groups < 125mmHg has risk ratio of 1.76, meaning that the mortality
risk (after correction for sex and age) in the group is 1.76 times the risk of the reference
group 125 − 140 mmHg.
Figure 4.2: 95% confidence interval Relative mortality risk estimates: SBP and DBP
An imputed BP are lowered by 𝛿 but the risk estimated does not change much. Also,
a hazard ratio estimates for different 𝛿 are close. A mortality between responders and
non-responders are simply too small for serious impact on estimates. Thus, we are able
to conclude missing data hardly influence the risk estimates.
4.1.3
Pattern-mixture Model
Finally, a comparison between imputed data and observed data could be shown as well
as the combined pattern-mixture model in one plot. Hence, the pattern-mixture model
decomposes 𝑃(π‘Œ , 𝑅) = 𝑃(π‘Œ |𝑅)𝑃(𝑅) = 𝑃(π‘Œ |𝑅 = 1)𝑃(𝑅 = 1) + 𝑃(π‘Œ |𝑅 = 0)𝑃(𝑅 = 0)
for the observational probability by the Baye’s rule, which emphasize that the combined
distribution is a mixed distributions of π‘Œ in the responders and non-responders. For
example, the density at a point is computed to be 𝑃(π‘Œ = 100) = 0.015 × 0.878 + 0.058 ×
0.122 = 0.02 which is shown as a graph in the left-side.
By Bayes rule, the density of systolic BP is calculated based on the decomposition of
𝑃(𝑅 = 1|π‘Œ = 𝑦) = 𝑃(π‘Œ = 𝑦|𝑅 = 1)𝑃(𝑅 = 1)/𝑃(π‘Œ = 𝑦), where the marginal distribution
of π‘Œ is 𝑃(π‘Œ = 𝑦) = 𝑃(π‘Œ = 𝑦|𝑅 = 1)𝑃(𝑅 = 1) + 𝑃(π‘Œ = 𝑦|𝑅 = 0)𝑃(𝑅 = 0). Also, the
observable probability is calculated to be 𝑃(𝑅 = 1|π‘Œ ) = 0.015 × 0.878/0.02 = 0.65 for a
CHAPTER 4. SIMULATION STUDY AND SUMMARY
29
STATS 756
Kyuson Lim
particular point, while non-observable probability is 𝑃(𝑅 = 0|π‘Œ ) = 0.058×0.122/0.02 =
0.35.
Figure 4.3: Graphic representation of the response mechanism for SBP
The right-hand plot provides the distributions 𝑃(π‘Œ |𝑅) in the observed (blue) and
missing (red) data in the pattern-mixture model. The hypothetically complete distribution is the black curve.
The distribution of blood pressure in the group with missing blood pressures shows a
slight different, both in form and location. However, in the KS test (Kolmogorov–Smirnov
test) as well as the empirical cdf both observed and imputed values does not differ drastically. Hence, the effect of missingness on the combined distribution shows only slight
difference.
30
CHAPTER 4. SIMULATION STUDY AND SUMMARY
Chapter 5
Appendix: R codes
# Data exploration
library(foreign)
file.sas <- file.path(dataproject, "original/master85.xport")
## xport.info <- lookup.xport(file.sas)
original.sas <- read.xport(file.sas)
names(original.sas) <- tolower(names(original.sas))
dim(original.sas)
# uninteresting or problematic variables
v1 <- names(ini$nmis[ini$nmis == 0])
outlist1 <- v1[c(1, 3:5, 7:10, 16:47, 51:60, 62, 64:65, 69:72)]
length(outlist1)
# Outflux and Influx
outlist2 <- row.names(fx)[fx$outflux < 0.5]
length(outlist2)
outlist4 <- as.character(ini$loggedEvents[, "out"])
# Quick predictor
outlist <- unique(c(outlist1, outlist2, outlist4))
length(outlist)
data2 <- data[, !names(data) %in% outlist]
inlist <- c("sex", "lftanam", "rrsyst", "rrdiast")
pred <- quickpred(data2, minpuc = 0.5, include = inlist)
## Generating the imputations
imp.qp <- mice(data2, pred = pred, seed = 29725)
# plot comparison for missing data vs. observed data in KM curve
vnames <- c("rrsyst", "rrdiast")
cd1 <- mice::complete(imp)[, vnames]
cd2 <- mice::complete(imp.qp)[, vnames]
typ <- factor(rep(c("blind imputation", "quickpred"),
each = nrow(cd1)))
mis <- ici(data2[, vnames])
mis <- is.na(imp$data$rrsyst) | is.na(imp$data$rrdiast)
31
STATS 756
Kyuson Lim
cd <- data.frame(typ = typ, mis = mis, rbind(cd1, cd2))
xyplot(jitter(rrdiast, 10) ~ jitter(rrsyst, 10) | typ,
data = cd, groups = mis,
col = c(mdc(1), mdc(2)),
xlab = "Systolic BP (mmHg)",
type = c("g","p"), ylab = "Diastolic BP (mmHg)",
pch = c(1, 19),
strip = strip.custom(bg = "grey95"),
scales = list(alternating = 1, tck = c(1, 0)))
# delta-adjustment
delta <- c(0, -5, -10, -15, -20)
post <- imp.qp$post
imp.all.undamped <- vector("list", length(delta))
for (i in 1:length(delta)) f
d <- delta[i]
cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
post["rrsyst"] <- cmd
imp <- mice(data2, pred = pred, post = post, maxit = 10,
seed = i * 22)
imp.all.undamped[[i]] <- imp
g
# Hazard ratio estimates
cda <- expression(
sbpgp <- cut(rrsyst, breaks = c(50, 124, 144, 164, 184, 200,
500)),
agegp <- cut(lftanam, breaks = c(85, 90, 95, 110)),
dead <- 1 - dwa,
coxph(Surv(survda, dead)
~ C(sbpgp, contr.treatment(6, base = 3))
+ strata(sexe, agegp)))
imp <- imp.all.damped[[1]]
fit <- with(imp, cda)
# chi-square of independence plot
# significance plot
library(ggplot2)
library(forcats)
rwo = c(’Age’, ’Type of residence’, ’Activities of daily living’, ’History of hypertension’, ’Uses of d
dat <- data.frame(
Covariate = rep(x = c(’ ’), times = 5),
Question = rwo,
Significance = c(1,1,1,0,1)
)
dat$groups <- cut(dat$Significance,
# Add group column
breaks = c(-0.1, 0.01, 1.1))
32
CHAPTER 5. APPENDIX: R CODES
Kyuson Lim
STATS 756
textcol <- "grey40"
library(ggplot2)
ggplot(data = dat, aes(x = fct_inorder(Question), y = Covariate, fill = groups)) +
geom_tile(colour = "white", size=1.5) +
scale_fill_manual(breaks = levels(dat$groups),
values = c("grey", "red"),guide = guide_legend(reverse = TRUE),
labels = c(’Insignificant, p-value > 0.05’, ’Significant, p-value < 0.05’))+
scale_y_discrete(expand=c(0,0))+
scale_x_discrete(expand=c(0,0),breaks=rwo)+
theme_grey(base_size=10)+
theme(legend.position="right",legend.direction="vertical",
legend.title=element_text(colour=textcol),
legend.text=element_text(colour=textcol,size=10,face="bold"),
axis.text.x=element_text(size=20, colour=textcol, angle = 90, vjust = 0.2, hjust=0.2),
axis.text.y=element_text(size=23, vjust=0.2, colour=textcol),
axis.ticks.x=element_blank(),
plot.title=element_text(colour=textcol, hjust=0, size=14, face="bold"))+
labs(fill = "Significance")+xlab(NULL)+ylab(NULL)+coord_flip()
ggsave(’p.png’, width=7, height=6)
# correlation plot
library(corrplot)
library("pheatmap")
library(ComplexHeatmap)
M=data.frame(matrix(nrow=24, ncol=3))
rownames(M)<-c(’Systolic BP’, ’Diastolic BP’, ’Survival date’, ’Censoring flag’,
’Sex’,’Age’,
’Type of residence’, ’Activity of daily living’, ’Previous hypertension’, ’Uses diuretics’, ’Year of intervi
’Serum albumin’, ’Cognition’, ’Current hypertension’, ’Current/Previous hypertension’, ’Survival year’, ’ln (s
’Serum cholesterol’, ’Fraction erythrocytes’, ’Treated by specialist’, ’Hemoglobin’, ’Hematocrit’)
M[1,]<-c(1.0,0.59,0)
M[2,]<-c(0.59,1.0,0)
M[3,]<-c(0.18, 0.14, 0.12)
M[4,]<-c( 0.13, 0.11, 0.08)
M[5,]=c(-0.1, -0.1, -0.04)
M[6,]=c(-0.11, -0.11, -0.14)
M[7,]=c(-0.21, -0.15, -0.08)
M[8,]=c(-0.24, -0.11, -0.14)
M[9,]=c(0.16, 0.14, 0.06)
M[10,]=c(-0.04, -0.03, 0.06)
M[11,]=c(0.18, 0.09, 0.18)
M[12,]=c(0.17, 0.11, 0.16)
M[13,]=c(0.24, 0.18, 0.02)
M[14,]=c(0.24, 0.18, 0.07)
M[15,]=c(0.23, 0.17, 0.01)
CHAPTER 5. APPENDIX: R CODES
33
STATS 756
Kyuson Lim
M[16,]=c(0.22, 0.19, 0.04)
M[17,]=c(0.21, 0.15, 0.14)
M[18,]=c(0.20, 0.15, 0.09)
M[19,]=c(-0.19, -0.18, -0.01)
M[20,]=c(0.17, 0.17, 0.12)
M[21,]=c(0.17, 0.20, 0.08)
M[22,]=c(-0.16, -0.11, 0.02)
M[23,]=c(0.15, 0.18, 0.08)
M[24,]=c(0.11, 0.18, 0.10)
colnames(M)=c(’r(SBP)’, ’r(DBP)’, ’r(R1)’)
M=as.matrix(M)
# Heatmap 2
ht2 = Heatmap(M, name = "ht2",
col = circlize::colorRamp2(c(-0.25, 0, 1), c("skyblue", "white", "red")),
column_names_gp = gpar(fontsize = 9))
ht2
corrplot(M, order = ’hclust’, addrect = 2)
corrplot(M, p.mat = testRes$p, method = ’circle’, type = ’lower’, insig=’blank’,
addCoef.col =’black’, number.cex = 0.8, order = ’AOE’, diag=FALSE, addrect = 2)
testRes = cor.mtest(mtcars, conf.level = 0.95)
corrplot(M, p.mat = testRes$p, method = ’circle’, type = ’lower’, insig=’blank’,
order = ’AOE’, diag = FALSE, addrect = 3)
text(p1$x, p1$y, round(p1$corr, 2))
# hazard varaible
M=data.frame(matrix(nrow=5, ncol=5))
rownames(M)<-c(’H0(T)’, ’T’, ’log(T)’, ’SBP’, ’DBP’)
colnames(M)<-c(’H0(T)’, ’T’, ’log(T)’, ’SBP’, ’DBP’)
M[1,]=c(1.000, 0.997, 0.830, 0.169, 0.137)
M[2,]=c(0.997, 1.000, 0.862, 0.176, 0.141)
M[3,]=c(0.830, 0.862, 1.000, 0.205, 0.151)
M[4,]=c(0.169, 0.176, 0.205, 1.000, 0.592)
M[5,]=c(0.137, 0.141, 0.151, 0.592, 1.000)
M=as.matrix(M)
corrplot(M, method = ’color’, type = ’lower’, insig=’blank’,
addCoef.col =’black’, number.cex = 0.8, order = ’AOE’, diag=FALSE)
corrplot(M, method = ’circle’, type = ’lower’, insig=’blank’,
addCoef.col =’black’, number.cex = 0.8, order = ’AOE’, diag=FALSE)
corrplot(M, method="color", col=col(200),
diag=FALSE,
type="upper", order="hclust",
title=’Correlations between hazard H0(T), survival time T, log(T), SBP, DBP’,
addCoef.col = "black", # Add coefficient of correlation
34
CHAPTER 5. APPENDIX: R CODES
Kyuson Lim
STATS 756
# Combine with significance
p.mat = p.mat, sig.level = 0.05, insig = "blank",
# hide correlation coefficient on the principal diagonal
mar=c(0,0,1,0)
)
CHAPTER 5. APPENDIX: R CODES
35
STATS 756
36
Kyuson Lim
CHAPTER 5. APPENDIX: R CODES
Bibliography
[1]
McGilchrist, C. A., & Aisbett, C. W. (1991). Regression with frailty in survival
analysis. Biometrics, 461-466.
https://www.jstor.org/stable/2532138?casa_token=cxuDrkxyJzUAAAAA%
3AEnp4ejKDMHcBHgMbROgKulGAA-lUE0Iw16oVqCSqDXPbWGutHjuBeIJ7URMAZSIioGrZdBNLmqvx4fYUX_
3D0LUaBnEGd-dVIBW88Bkm6vPgEhEca24&seq=1#metadata_info_tab_contents
[2]
Van Buuren, S., Boshuizen, H. C., & Knook, D. L. (1999). Multiple imputation of missing
blood pressure covariates in survival analysis. Statistics in medicine, 18(6), 681-694.
https://stefvanbuuren.name/fimd/
[3]
Van Buuren, S., Oudshoorn, C. G. M., & de Jong, M. R. (2007). The MICE package. URL
https://www. rdocumentation. org/packages/mice/versions/2.25.
http://ftp.uni-bayreuth.de/math/statlib/R/CRAN/doc/packages/mice.pdf
37
Download