Multiple Imputation of Missing Blood Pressure Covariates in Survival Analysis Kyuson Lim Department of Mathematics & Statistics, McMaster University, E-mail: limk15@mcmaster.ca December 8, 2021 STATS 756 2 Kyuson Lim Contents 3 STATS 756 4 Kyuson Lim CONTENTS Chapter 1 Acknowledgement The purpose of this report is solely on the interpretation and implementation of ‘Multiple imputation of missing blood pressure covariates in survival analysis’ written by the Van Buuren, Stefan in 1999. Moreover, the original dataset that is used for the analysis is attached in the R package ‘Mice’ but currently not available for any usage. More specification of the original dataset, Leiden 85+, is found from the textbook of ‘Flexible Imputation of Missing Data’, chapter 9.1.2. R codes and the output is stated in the Chapter 3 (p.97101) and Chapter 9 (p.259-283), which contains all results to be stated and interpreted based on the data ‘Leiden 85+’. Also, this report rephrase for the specification of dataset containing the graphical assessments and the codes to have used in the ‘Mice’ package. The examples and codes are extracted from the textbook, ‘Flexible Imputation of Missing Data’ written by the same author Van Buuren, Stefan for graph visualization of multiple imputation method and the guidance for inference. An interpretation for the original paper and multiple imputation method are defined by rephrasing the definitions used in the textbooks and the paper. Moreover, the first section of chapter 2 introduces multiple imputation continued with chapter 3 with univariate imputation and chapter 4 for multivariate imputation method. Combined with chapter 6 for imputation in practice mainly, method of imputation and model based algorithm is explained throughout the report. I am pleased thank for all textbooks and guideline for writing this report in behalf of the course STATS 756 for analysis in multiple imputation as well as the methods. Also, I would be pleased to thank for Professor Dr. Balakishnan to support me to learn with 5 STATS 756 Kyuson Lim the ideas of multiple imputation and writing the report. 6 CHAPTER 1. ACKNOWLEDGEMENT Chapter 2 Introduction 2.1 Background of the research The main interest of the paper is to determine an influence of measures on relation between mortality and Blood Pressure (BP), over 85 years old, 1236 citizens in Leiden (1986), examined between 1987 and 1989. There is a concern if the paradoxical inverse relation exists between blood pressure (BP) and mortality in persons over 85 years of age. Normally, people with a lower BP live longer, but the oldest old with lower BP live a shorter time. As the data contains approximately 12.5% incomplete (missing) cases that produce deflated mortality estimates for lower BP groups, this cause distortion for the inference of influence of BP on survival. Hence, there is a suspect if individuals with lower BP and higher mortality risks, had fewer BP measurements. For the study, variables considered in the study include BP, age (85-89, 90-94, 95+), types of resident, activities of daily living (independent, dependent), history of hypertension, uses of diuretics, blood sample. 2.1.1 Guidelines: missing data For problems of missing data, the following list contains list of questions that are answered when using multiple imputations. 1. Amount of missing data and reasons for missingness. 7 STATS 756 Kyuson Lim 2. Consequences: important differences between individuals with complete and incomplete data. Groups differ in mean or spread on the key variables and consequences. 3. What information to use for for choosing between non-response mechanisms. This include methods, where assumptions were made (e.g., missing at random). 4. Software and number of imputed datasets. This is also provided with a sensitivity analysis, to assess if missing at random assumption plausible. 5. Imputation model: variables were included in the imputation model / design features 6. How to choose set of predictors: derived variables and diagnostic plots. 7. How to specify different models for non-responses: pooling, repeated estimates been combined. 8. Complete-case analysis: multiple imputation and complete-case analysis lead to similar similar conclusions. First, the goal of the study was to determine if there exists a relation between BP and mortality in the very old is due to frailty. A second goal was to know whether high BP was a still risk factor for mortality after the effects of poor health had been taken into account. The study compared two Cox regression models: • The relation between mortality and BP adjusted for age, sex and type of residence. • The relation between mortality and BP adjusted for age, sex, type of residence and health. Health was measured by 28 different variables, including mental state, handicaps, being dependent in activities of daily living, history of cancer and others. Including health as a set of covariates in model 2, we expect the model 2 to better explain the relation between mortality and BP. 8 CHAPTER 2. INTRODUCTION Kyuson Lim 2.2 STATS 756 Study of data and problems In the data, there is an observational problem, where groups without BP measure have much higher mortality rates. In summary, there are 4 key problems for the missing data: • A BP not measured for 121 individuals 2, without hypertensions and with high mortality is missing (Out of 1236 people, 218 died before the visit, 59 did not participate, 956 individuals are measured). • A BP is measured more often if suspected that BP was too high (hypertension). • A BP is measured less frequently for very old people and subjects who are too ill to be measured. • The rate of data collection period increase (5-40%) in the early days and then drops to constant level (10-15%). More specifically, the proportion of missing data are summarized in the table 1. Survived > 3 years Yes No Total History of previous hypertension No Yes Total 8.7% 8.1% 8.6% (34/390) (10/124) (44/514) 19.2% (69/360) 9.8% (8/82) 17.4% (77/442) 13.7% 8.7% 12.7% (103/750) (18/206) (121/956) Table 1. Proportion of no BP measured For sensitivity analysis to diagnose the problem of missing data, the plot shows for distinct Kaplan-Meier probability curves where there exists two distinct models of BP measured and BP missing data. The figure shows the survival probability since intake for the group with observed BP measures and the group with missing BP measures. These curves have been obtained as baseline hazards after fitting a proportional hazards model adjusted for age, sex and type of residence, and stratified by the missingness indicator. Clearly, from the plot, individuals without BP measures have higher mortality rates. Also, a relatively large group of individuals without hypertension and with high mortality risk is missing. The goal of the sensitivity analysis is to explore the result of the analysis under alternative scenarios for the missing data. CHAPTER 2. INTRODUCTION 9 STATS 756 Kyuson Lim Figure 2.1: Kaplan-Meier curves of the Leiden 85+ Cohort, stratified according to missingness 2.2.1 Factors that affect the measurement of blood pressure Variables related to non-response includes age, type of residence, activities of daily living, and uses of diuretics (year of interview, blood samples are not categorical to be excluded). Not all variables that have different distributions in the response (π = 835) compared to the non-response groups (π = 121). Table 2. indicates that BP was measured less frequently for very old people and for those with health problems. The graph created easily shows for the overview of factors in comparison for significance. Figure 2.2: For 835 individuals, the chi-square of independence Again, BP was measured less frequently for very old (95+) people and for those who have a health problem (hypertension). 10 CHAPTER 2. INTRODUCTION Kyuson Lim 2.3 STATS 756 Response mechanism for BP ∗ are independently drawn from predictive distribution, given π repreAn imputation ππππ sents parameter of statistical model with π = (ππππ , ππππ ) ∈ Θ. (Posterior predictive distribution) π(ππππ |ππππ ) = ∫ Θ π(ππππ |ππππ , π) π(π|ππππ )ππ A multiple imputation is unique, as to provide a mechanism for both high and lowconfidence situation, in dealing with the inherent uncertainty of the imputations. A MICE (Multivariate Imputation by Chained Equations) algorithm is a MCMC method that is univariate optimal. • Starts with a random draw from the observed data, and imputes the incomplete data • One iteration consists of one cycle through all π π . • Then, samples from the conditional distributions in order to obtain samples from the joint distribution. • Generates multiple imputations in parallel π times. Before setting up for the assumption of missing response mechanism, recaps of model problems and outcome variables are stated. As the elimination of missing data cause overestimation in the true survival of cohort, we have 3 problems in the model: • Bias: if causes of missing data depends jointly on survival and unknown BP, then relative mortality risks of different BP level biased. • Verification: The mortality of conditional distribution given age, sex related to BP measured and without, could not be demonstrated. • Confounding factors: analysis using only complete cases underestimates mortality of lower and normal BP groups. Then, the outcome variables are classified as systolic BP and diastolic BP with indicator variable π π π as follows. CHAPTER 2. INTRODUCTION 11 STATS 756 Kyuson Lim π1 = Systolic BP π(π3 , π4 |π1 , π) π(π = 1|ππππ , ππππ , π) π2 = Diastolic BP π(π3 , π4 |π2 , π) π(π = 1|ππππ , π) (MAR) π3 = Survival/censoring time π = (ππππ , ππππ ), ππππ , ππππ , π define differ- π4 = censoring indicator ent types of response mech- π π π = 1 if ππ π is observed. anism The first column shows the response variables classified, and the second column and third column shows the response mechanism generated based on the indicator variable for different assumption made for the model where π is the predictor variables. For the missing data mechanism, there are 3 assumptions to be stated for its definition and reason for the use in the analysis. While MAR is unrealistic to be considered for the generating mechanism, the MAR and NMAR (MNAR) is used in pooling phase for comparing values. 1. MAR (missing completely at random) 1: the probability of being missing is the same for all cases ⇒ cause of missing data is unrelated to data. This effectively implies that causes of the missing data are unrelated to the data. We may consequently ignore many of the complexities that arise because data are missing, apart from the obvious loss of information (ie. Some of the data will be missing simply because of bad luck.). π(π = 0|ππππ , ππππ , π) = π(π = 0|π) • The assumption is Unrealistic. The survival model between BP measured and no BP measured in the sensitivity analysis (figure 2.1) shows systematic difference in mortality. Hence, the assumption could not be incorporated in the missing imputation process. 2. MAR (missing at random): the probability of being missing is the same only within groups, defined by observed data. An example of MAR is when we take a sample from a population, where the probability to be included depends on some known property. π(π = 0|ππππ , ππππ , π) = π(π = 0|ππππ , π) 1MAR is a much broader class than MCAR. 12 CHAPTER 2. INTRODUCTION Kyuson Lim STATS 756 • MAR on ππππ : the probability of BP measurement depends on the survival. Hence, it could be incorporated with the correction for a non-response. • MAR on π: probability of non-response related to covariates (π2 independence test). This relate to the correction for non-response. A MNAR means that the probability of being missing varies for reasons that are unknown to us. 3. MNAR (missing not at random)2: the probability to be missing also depends on unobserved information, including ππππ itself. π(π = 0|ππππ , ππππ , π) • Investigation is needed for different response values to be derived in the pooling phase with πΏ adjustment, due to the probability of non-response related to BP (unobserved) for a distribution of ππππ , by the sensitivity analysis. 2.3.1 Influx and Outflux An influx and outflux are summaries of the missing data pattern intended to aid in the construction of imputation models. The influx of a variable quantifies how well its missing data connect to the observed data on other variables. Variables with higher influx depend strongly on the imputation model. Figure 2.3: Global influx-outflux pattern of the Leiden 85+ Cohort data 2In the literature one can also and the term NMAR (not missing at random) for the same concept. CHAPTER 2. INTRODUCTION 13 STATS 756 Kyuson Lim The outflux of a variable quantifies how well its observed data connect to the missing data on other variables. Variable with higher outflux is better connected to the missing data, and thus potentially more useful for imputing other variables. For data of BP, variables are quantified into the graph of x-axis of influx and y-axis of outflux (figure 2.3). Variables that are located in the lower regions (especially near the lower-left corner) and that are uninteresting for later analysis are better removed from the data prior to imputation. First of all, all points are relatively close to the diagonal, which indicates that influx and outflux are balanced (figure 2.3). The group at the left-upper corner has almost complete information, so the number of missing data problems for this group is relatively small. The intermediate group (second group) has an outflux between 0.5 and 0.8, which is small. The third group that contain important variables has an outflux with 0.5 and lower, so its predictive power is limited. Also, this group has a high influx, and is thus highly dependent on the imputation model. Hence, variables (hypert1, aovar) with missing mark that might cause issue on in the imputations are located in the lower-right corner. 14 CHAPTER 2. INTRODUCTION Chapter 3 Methodology The paper use model based multiple imputation method with multivariate approach. Although the univariate imputation and the multivariate imputation drastically shows difference for the output, the univariate approach would be first introduced. The method of multiple imputation is mainly processed with 4 steps. 1. Posterior predictive density, π(ππππ |π, π ) (π is set of predictors) given nonresponse mechanism π(π |π , π) and π(π , π). 2. Draw imputations from π(ππππ |π, π ) to produce π complete datasets. 3. Perform π complete Cox regression model on each completed data. 4. Pool π analysis results and variance estimates. The first step can be summarized with two important concepts, variable selection and investigation by the correlation values. Conceptually, the idea of imputation is illustrated as follows. Let π be the quantity of scientific interest that we can calculate if we observe the population. The goal is to obtain ˆ which satisfy πΈ (π|π ˆ ) = π that is valid, if πΈ (π|π ) ≥ π ( π|π ˆ ), unbiasedness estimate π, ˆ where π is the estimated covariance matrix of π. ∫ (Posterior distribution) π(π|ππππ ) = 15 π(π|ππππ , ππππ ) π(ππππ |ππππ )πππππ STATS 756 Kyuson Lim 3.1 Selection of variables First, we may define some variables as follows. For the posterior predictive density, π(ππππ |π, π ) (π is set of predictors) given non-response mechanism π(π |π , π) and π(π , π). As the multiple imputation is based on a model based approach, π(ππππ |π) is defined with the linear regression, where missing BP is the predictor variable π for imputation. The selection of suitable subset of data contains no more than 15-25 variables, π = [ππππ , π, π, π]. 1. ππππ , π: Include all variables, especially if complete model contains strong predictive relations. 2. π: Variables that differ between the response and non-response groups, inspect by correlation. 3. π: Variance with considerable variability, to reduce uncertainty. 4. π and π: remove for those with many missing values (%) within incomplete cases. Variable π (Systolic BP) π (Diastolic BP) π (π 1 ) - response indicator (% of observed data)- usable cases π : Incomplete and outcome variables Systolic BP 1.0 0.59 Diastolic BP 0.59 1.0 Survival date 0.18 0.14 0.12 100 Censoring flag 0.13 0.11 0.08 100 π : Covariates (Cox regression: relation between mortality and BP adjusted for age & sex) Sex -0.1 -0.1 -0.04 100 Age -0.11 -0.11 -0.14 100 π : Variables related to non-response Type of residence -0.21 -0.15 -0.08 100 Activity of daily living -0.24 -0.11 -0.14 98 Previous hypertension 0.16 0.14 0.06 90 Uses diuretics -0.04 -0.03 0.06 85 Year of interview 0.18 0.09 0.18 100 Year of blood sample 0.17 0.11 0.16 89 Serum albumin 0.24 0.18 0.02 67 Cognition 0.24 0.18 0.07 78 Current hypertension 0.23 0.17 0.01 83 Previous hypertension 0.22 0.19 0.04 83 Survival year 0.21 0.15 0.14 100 In (survival date) 0.20 0.15 0.09 100 Score GHQ -0.19 -0.18 -0.01 83 Serum cholesterol 0.17 0.17 0.12 65 Fraction erythrocytes 0.17 0.20 0.08 70 Treated by specialist -0.16 -0.11 0.02 100 Hemoglobin 0.15 0.18 0.08 70 Hematocrit 0.11 0.18 0.10 70 π : Prediction variables Table 3. Correlation of variables for imputation First, included variable appear in complete data are blood pressure, survival, sex, and age. Then, variables related to non-response are type of residence, activity of daily 16 CHAPTER 3. METHODOLOGY Kyuson Lim STATS 756 living, previous hypertension, use of diuretics, year of interview, and blood sample are included. Most importantly, the selection of variables is performed by the absolute correlation > 0.15 with SBP/DBP based on the table 3. At last, remove variables with usable cases < 50%. Although there are in total of 23 actual predictor values selected by the correlation values, the model composition are also considered to select for log(time) as a 2 step approach. Hence, the graph of correlation between survival model components is shown below. Figure 3.1: Correlations between the cumulative death hazard π»0 (π), survival time π, log(π), SBP and DBP From figure 3.1, the high correlation may be caused by the fact that nearly everyone in this cohort has died, so the percentage of censoring is low. We can observe that the correlation between log(π) and blood pressure is higher than for π»0 (π) or π, so it makes sense to add log(π) as an additional predictor. 3.2 Multiple imputation: algorithm Based on the Bayesian approach, the values are drawn for parameter π ∗ to come up with ∗ that is based on the model specified with. ππππ ∫ π(ππππ |π, π ) = π(ππππ |π, π , π) π(π|π, π )ππ, π = (π½, log π) 1. Draw value of π ∗ from π(π|π, π ) ⇒ π(ππππ |π, π , π = π ∗ ). ∗ from its conditional posterior distribution given π ∗ . 2. Draw value ππππ 3. Multiple imputation: Repeat π times from the posterior distribution of ππππ . CHAPTER 3. METHODOLOGY 17 STATS 756 Kyuson Lim Among various methods, the regression imputation incorporates knowledge of other variables with the idea of producing smarter imputations. The first step involves building a model from the observed data. Predictions for the incomplete cases are then calculated under the fitted model, and serve as replacements for the missing data. The regression model based imputation in the univariate approach are summarized as follows. 1. Obtain π½ˆ and πˆπππ from linear regression. 0 π −1 for π½ˆ = π π 0 π ˆ ˆ • Take π = (ππππ πππ ) πππ πππ to ππππ = ππππ π½. 2. Random draw from posterior distribution of π½. • Calculate π½ˆ∗ = π½ˆ + π∗π 1/2 π· – Draw π-dimensional Normal random vector π· ∼ π (0, πΌπ ), where π = 23 is the number of predictors. – Use π∗2 = (ππππ − πˆπππ ) 0 (ππππ − πˆπππ )/π, where random variable π is from ππ2πππ −π distribution – π 1/2 is diag(π) 1/2 obtained by Cholesky decomposition. • Similarity between cases is the distance predicted means of BP with observed data. – Take predicted values πˆπππ = ππππ π½ˆ∗ – For missing values, find respondent πˆπππ is closest to πˆmis,π to take ππππ as respondent for imputed value π = 1, ..., ππππ (1) (2) (π) 3. Repeat π = 3 to 5 times to create ππππ , ππππ , ..., ππππ . • Incorporate uncertainty due to deviations, but also reflect variations due to finite sampling. The highlight part consists of generation samples from multivariate normal distribution, where the number of variables is the rank of the identity matrix. In the univariate approach, the goal is to minimize the difference between the imputed values and the model based imputed values close to 0, where values are conditionally imputed from the previous imputed values. For graph visualization and imputation approach in R, two example are shown from the textbook of ‘Flexible Imputation of Missing Data’. Suppose that we predict Ozone by linear regression from Solar.R. 18 CHAPTER 3. METHODOLOGY Kyuson Lim STATS 756 library(mice) fit <- lm(Ozone ~ Solar.R, data = airquality) pred <- predict(fit, newdata = ic(airquality)) data <- airquality[, c("Ozone", "Solar.R")] imp <- mice(data, method = "norm.predict", seed = 1, m = 1, print = FALSE) xyplot(imp, Ozone ~ Solar.R) Figure 3.2: Blue indicates the observed data, red indicates the imputed values. The imputed values correspond to the most likely values under the model. However, the ensemble of imputed values vary less than the observed values. It may be that each of the individual points is the best under the model, but it is very unlikely that the real (but unobserved) values of Ozone would have had this distribution. Imputing predicted values also has an effect on the correlation. The red points have a correlation of 1 since they are located on a line. If the red and blue dots are combined, then the correlation increases from 0.35 to 0.39. Note that this upward bias grows with the percent missing ozone levels (here 24%). Some of problems of univariate imputation is summarized after the second example. The second example shows for the specification of R codes and values imputed to come up with the best linear regression model that is imputed for missing data. In Step 0, missing data is identified. > head(nhanes) age bmi hyp chl 1 1 2 2 22.7 NA 1 187 3 1 NA 1 187 4 3 NA 5 1 20.4 6 3 NA NA NA NA NA 1 113 NA 184 Then, in step 1, the linear model with predictor are specified in the imputation by the R function, with. The iterated values imputed could be specified for 10 iteration steps. CHAPTER 3. METHODOLOGY 19 STATS 756 Kyuson Lim > imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415) > fit <- with(imp, lm(bmi ~ age)) > head(imp$imp) $age [1] 1 2 3 4 5 6 7 8 9 10 <0 rows> (or 0-length row.names) $bmi 1 2 3 4 5 6 7 8 9 10 1 27.2 21.7 25.5 22.5 28.7 30.1 27.4 22.5 22.5 27.2 3 22.0 30.1 20.4 33.2 27.2 35.3 29.6 22.0 27.2 28.7 4 21.7 20.4 27.2 25.5 21.7 25.5 22.7 22.5 24.9 22.5 $hyp 1 2 3 4 5 6 7 8 9 10 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 1 1 2 1 2 1 $chl 1 2 3 4 5 6 7 8 9 10 1 187 238 186 238 187 187 187 131 238 187 4 206 204 204 184 206 187 218 186 204 284 Also, repeated π = 10 steps of optimal model could be shown by the pooling function to come with model specification. > est <- pool(fit) > est Class: mipo term m = 10 m estimate ubar b t dfcom df 1 (Intercept) 10 29.621111 3.4810048 1.4312926 5.055427 2 age 10 -1.802222 0.9257992 0.2759968 1.229396 For a problems of univariate imputation, a circular dependence can occur, π ππππ depends on πβπππ which depends on π ππππ , π ≠ β, as π π and πβ is correlated. With large π and small π, a collinearity or empty cells can occur to be problematic in the imputation. The non-linear relation is not considered, combination is problematic. However, a multivariate missing data algorithm for mice is different from model base multiple imputation algorithm. 20 CHAPTER 3. METHODOLOGY Kyuson Lim 3.3 STATS 756 Multivariate imputation In the paper, multivariate problems is split into series of univariate problems. Also, an iterative algorithm is applied to draw samples from sequence of univariate linear regression. Although simple multivariate imputation method is based on a monotone draw-input mechanism, the mice algorithm starts with a random draw from the observed data, and imputes the incomplete data in a variable-by-variable fashion. Hence, one iteration consists of one cycle through all π π . Each incomplete entry is initialized by filling in random draw from ππππ . • Regression switching: executed π times in parallel, where ππ imputed conditional on all other data and π, π, π. • Gibbs sampler: under the condition that draws converge to multivariate posterior density, π(ππππ |ππππ , π, π ), iterates about 20 steps (Partially incompatible MCMC). ∗ , by repeatedly • Monte-Carlo simulation for draw on multivariate distribution ππππ drawing from conditional density. • Let ππππ = {ππππ (1), ..., ππππ (π)}, π ≤ π be partition of π-dimensional r.v. where ππππ ( π) is missing entry andππππ ∪πΏ is multi-dimensional variable for π(ππππ |ππππ ; πΏ). • The unknown parameters of the imputation model ππ as π = (π1 , . . . , π π ) ⇒ prior density of π(π) = π1 (π1 ) · · · π π (π π ) . • The unknown parameters, likelihood inference π = (π1 , . . . , π π ), of the imputation models should be distinct (0) (1) (π‘) Mainly, with ππππ (fill-in), generates iterative sequence of imputations ππππ , ..., ππππ (π‘) (imputation), and the imputation of ππππ ( π) is conditional on observed data and most recently imputed data of ππππ (π), π ≠ π 1. Specify imputation model π(π ππππ |π ππππ , π− π , π ) for variable π π . 2. For each π, fill in with π π(0) (πππ ) by random draws from π π (πππ ). 3. Repeat for π‘ = 1, ..., π. 4. Repeat for π = 1, ..., π. CHAPTER 3. METHODOLOGY 21 STATS 756 Kyuson Lim (π‘−1) (π‘) , . . . , ππ(π‘−1) ) as current complete data \π π . , π π+1 5. Define π−(π‘)π = (π1(π‘) , . . . , π π−1 6. Draw π (π‘) π (Posterior-step). (π‘) 7. Draw imputations ππππ ( π) (Imputation step). 3.3.1 Generating algorithm: Gibbs sampling The Gibbs sampler is used under the condition that draws converge to multivariate posterior density, π(ππππ |ππππ , π, π ), iterates about 20 steps. The Monte-Carlo simulation ∗ , by repeatedly drawing from condiapplies for draw on multivariate distribution ππππ tional density. Let ππππ = {ππππ (1), ..., ππππ (π)}, π ≤ π be partition of π-dimensional r.v. where ππππ ( π) is missing entry and ππππ ∪ πΏ is multi-dimensional variable for π(ππππ |ππππ ; πΏ). (0) (1) (π‘) With ππππ , generates iterative sequence of imputations ππππ , ..., ππππ , and the imputation (π‘) of ππππ ( π) is conditional on observed data and most recently imputed data of ππππ (π), π ≠ π (π‘) ππππ (1) ∼ π(ππππ (1)|ππππ (1), π2(π‘−1) , ..., ππ(π‘−1) ; πΏ) (π‘) ππππ (2) ∼ π(ππππ (2)|π1(π‘) , ππππ (2), π3(π‘−1) , ..., ππ(π‘−1) ; πΏ) (π‘) (π‘) (π‘−1) ππππ ( π) ∼ π(ππππ ( π)|π1(π‘) , ..., π π−1 , ππππ ( π), π π+1 , ..., ππ(π‘−1) ; πΏ) (π‘) (π‘−1) ππππ (π) ∼ π(ππππ (π)|π1(π‘) , π2(π‘) , ..., ππ−1 , ππππ (π); πΏ) (π‘−1) (π‘) Note π π+1 represents complete data for π¦ π+1 , ..., π¦ π in the previous iteration, ππππ ( π + 1) is a complete data of current iteration, represented by two blocks while ππππ ( π) = (π‘−1) π¦ π , π π+1 , ..., ππ(π‘−1) is the complete data in the previous iteration. When a regression models of π¦ π on π¦ 1 , ..., π¦ π−1 , π¦ π+1 , ..., π¦ π and π₯1 , ..., π₯ π for a complete data by its parameter π π which is known, the posterior predictive distribution (π‘) ππππ ( π) is specified. To reflect the uncertainty about π π given the complete data, π π is drawn from a posterior distribution on the most recently completed data to generate for (π‘) (π‘) ππππ ( π). Now, the full Gibbs sampling algorithm generates for ππππ : Draw π1(π‘) ∼ π π1 |[π1(π‘−1) , ..., ππ(π‘−1) , πΏ] πππ (1) (π‘) Impute ππππ (1) ∼ π(ππππ (1)|π2(π‘−1) , ..., ππ(π‘−1) , πΏ; π1(π‘) ) 22 CHAPTER 3. METHODOLOGY Kyuson Lim STATS 756 Draw ππ‘)2 ∼ π π2 |[π1(π‘) , π2(π‘−1) , ..., ππ(π‘−1) , πΏ] πππ (2) (π‘) Impute ππππ (2) ∼ π(ππππ (2)|π1(π‘) , π3(π‘−1) , ..., ππ(π‘−1) , πΏ; π2(π‘) ) .. . (π‘−1) , π π(π‘−1) , ..., ππ(π‘−1) , πΏ] πππ ( π) Draw ππ‘)π ∼ π π π |[π1(π‘) , π2(π‘) , ..., π π−1 (π‘−1) (π‘) (π‘) , ..., ππ(π‘−1) , πΏ ; π (π‘) , π π+1 Impute ππππ (2) ∼ π ππππ ( π)| π1(π‘) , ..., π π−1 π | {z } correspond to observed values of π¦ π .. . (π‘) , ππ(π‘−1) , πΏ] πππ (π) Draw ππ‘)π ∼ π π π | [π1(π‘) , π2(π‘) , ..., ππ−1 | {z } rows of completed data (π‘) (π‘) , πΏ; π π(π‘) ) (π) ∼ π(ππππ (π)|π1(π‘) , π2(π‘) , ..., ππ−1 Impute ππππ (π‘) (π‘−1) represents complete data for π¦ π+1 , ..., π¦ π in the previous iteration, ππππ ( π + 1) Note π π+1 is a complete data of current iteration, represented by two blocks while ππππ ( π) = (π‘−1) π¦ π , π π+1 , ..., ππ(π‘−1) is the complete data in the previous iteration. There is no need for iteration but convergence is immediate. The mice package in R also incorporate the multivariate method differently from the univariate case. A simple example is shown below. First, the new nhanes2 data in mice contains 3 out of 27 missing values that destroy the monotone pattern: one for hyp (in row 6) and two for bmi (in rows 3 and 6). > library(mice) > data(nhanes2) > nhanes2 age bmi hyp chl 1 20-39 NA <NA> 2 40-59 22.7 no 187 NA 3 20-39 NA no 187 4 60-99 NA <NA> 5 20-39 20.4 6 60-99 NA no 113 NA <NA> 184 > length(nhanes2[is.na(nhanes2)]) [1] 27 The draw phase is specified with the Gibbs sampling method where the maximum iteration is defined to be 1. For iterative steps, only particular missing data are computed to configure for tendencies and consistency of the data. Hence, particular 3 values are imputed from a simple random sample. > where <- make.where(nhanes2, "none") > where[6, "hyp"] <- TRUE CHAPTER 3. METHODOLOGY 23 STATS 756 Kyuson Lim > where[c(3, 6), "bmi"] <- TRUE > imp1 <- mice(nhanes2, where = where, + method = "sample",seed = 21991, maxit = 1, + print = FALSE) > data <- mice::complete(imp1) > data age bmi hyp chl 1 20-39 2 40-59 22.7 NA <NA> no 187 3 20-39 26.3 no 187 4 60-99 5 20-39 20.4 no 113 6 60-99 22.7 no 184 NA <NA> NA NA From observation, the imputed values for the missing hyp data in row 3 could also depend on bmi and chl, but in the procedure both predictors are ignored. The complete missing data is imputed within the monotone draw-input mechanism stated before. > imp2 <- mice(data, maxit = 1, + visitSequence = "monotone", + print = FALSE) > data2 <- mice::complete(imp2) > data2 age bmi hyp chl 1 20-39 35.3 no 206 2 40-59 22.7 no 187 3 20-39 26.3 no 187 4 60-99 24.9 no 186 5 20-39 20.4 no 113 6 60-99 22.7 no 184 3.4 Pooling The purpose is to investigate robustness of MAR assumption against violation. To determine whether the relation between BP and mortality is affected by non-response. 1. Suppose BP distribution to be known, apply Bayes rule to calculate distribution for π(π΅π|π = 1) and π(π΅π|π = 0). 2. Both are normal but differs by πΏ = 151 × 6 − 138 × 6 = 13. 3. Generate imputation by subtracting amount πΏ from random draw of π(π΅π|π = 1). The model incorporate into π1 = π π½ + (1 − π 1 )πΏ + π, π 1 is an indicator for systolic BP. By the πΏ-adjustment, the regression model postulates mean difference, πΏ, between 24 CHAPTER 3. METHODOLOGY Kyuson Lim STATS 756 responders and non-responders. Non-response is applied for systolic BP, as SBP and DBP, which are correlated. Values of πΏ are chosen for 0, which correspond to the assumption as MAR and -5, -10, -15, -20 for the NMAR assumption. The pooling phase mainly consists of π analysis results and variance estimates. SysÍ ˆ tematically, the combined point estimate is πˆ = π π π , where πˆ π is a π-dimensional π=1 π column vector obtained by πth imputed dataset (π ∈ [1, π]). The 3 sources of variation consists of a total covariance, a complete data variance and a standard unbiased estimate of variance: There are mainly 3 source of variation for the total covariance, π = π + 1 + π1 π΅. • Complete data variance: π = ππ π=1 π , ππ Íπ is the covariance matrix of πˆ π obtained for πth iteration (conventional variability). • Standard unbiased estimate of variance: π΅ = Íπ π=1 ˆ 0 (πˆ π −π) ˆ ( πˆ π −π) (π−1) (extra variance from missing values in the sample) • Simulation variance: π΅/π caused by π estimated for finite π (variance being systematic). Note that the within sample variance is given as π (π|ππππ ) = πΈ [π (π|ππππ , ππππ )|ππππ ] + π [πΈ (π|ππππ , ππππ )|ππππ ]. Using the total covariance, a relative risk of 95% confidence interval in the proportional hazards model is better estimated by the given in the range √ of exp( πˆ ± 1.96 π). A realized difference in means of the observed and imputed SBP (mmHg) data under various πΏ-adjustments. Note that the number of multiple imputations is π = 5. > delta <- c(0, -5, -10, -15, -20) > post <- imp.qp$post > imp.all.undamped <- vector("list", length(delta)) > for (i in 1:length(delta)) { + d <- delta[i] + cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d) + post["rrsyst"] <- cmd + imp <- mice(data2, pred = pred, post = post, maxit = 10, + seed = i * 22) + imp.all.undamped[[i]] <- imp } Also, a mean of the observed SBP is152.9 mmHg. The difference between the mean SBP with πΏ-adjustment compared to the observed mean SBP is summarized in the table below. CHAPTER 3. METHODOLOGY 25 STATS 756 Kyuson Lim πΏ for SBP Avg. Difference 0 -8.2 -5 -12.3 -10 -20.7 -15 -26.1 -20 -31.5 Table 4. Realized difference in means The strength of the effect depends on the correlation between SBP and the variable. Under MAR assumption, the imputations are on average 8.2mmHg lower than the observed blood pressure. For example, πΏ = −10mmHg means the magnitude of difference in MAR case, −20.7 + 8.2 = −12.5mmHg, which is larger in size than πΏ. While πΏ = −5 has a small effect on the mean, the πΏ = −20 has too extreme effect for us to take the mean SBP value where πΏ = 0. 26 CHAPTER 3. METHODOLOGY Chapter 4 Simulation study and summary As a summary, the standard multiple imputation scheme of stepwise model selection consists of three phases: 1. Imputation of the missing data m times. 2. Analysis of the π imputed datasets. 3. Pooling of the parameters across π analyses. 4.1 Simulation study Figure 4.1: Scatterplot of systolic and diastolic blood pressure from the first imputation. The left-hand-side plot was obtained after just running ‘mice’ on the data without any data screening. The right-hand-side plot is the result after cleaning the data and setting up the predictor matrix with ‘quickpred()’ (quick selection of predictors) function in 27 STATS 756 Kyuson Lim mice. Finally, determined values in column size and correlation threshold of automatic process of ‘quickpred()’ such that the average number of predictors is around 25. 4.1.1 Mean BP After the pooling process, the mean value corresponding to difference πΏ-adjustments are summarized as follows. π Observed BP Imputed BP SBP πΏ 835 DBP Mean SD Mean SD 152.9 25.7 82.8 13.1 121 0 151.1 26.2 81.5 14 121 -5 142.3 24.6 78.4 13.7 121 -10 135.9 24.7 78.2 12.8 121 -15 128.6 25 75.3 12.9 121 -20 122.3 25.2 74 12.1 Table 5. Imputed BP are pooled over π = 5 multiple imputation Under MAR assumption which correspond to the value πΏ at 0, a π₯¯ observed SBP = 152.9 and π₯¯ ππ΅π = 151.1 for difference of 1.8 (mmHg) as well as π₯¯ observed DBP = 82.8 and π₯¯ π·π΅π = 81.5 for difference of 1.3 (mmHg). From the table, there is a decreasing trend for πΏ = −5, −10, −15, −20 in {142.3, 135.9, 128.6, 122.3}. Only small difference in mortality exists, even among non-response models with different πΏ’s as risk estimates are insensitive to missing data. At last, a relative mortality risks for Cox proportional hazard model is estimated with the age and sex. 4.1.2 Relative mortality risk estimates: SBP and DBP A relative mortality risks for Cox proportional hazard model is estimated with the covariates, including age and sex. After the pooling phase, an optimal values of variation 28 CHAPTER 4. SIMULATION STUDY AND SUMMARY Kyuson Lim STATS 756 that correspond of the 95% confidence interval Relative mortality risk estimates for both SBP and DBP is summarized as follows. At πΏ = 0, SBP groups < 125mmHg has risk ratio of 1.76, meaning that the mortality risk (after correction for sex and age) in the group is 1.76 times the risk of the reference group 125 − 140 mmHg. Figure 4.2: 95% confidence interval Relative mortality risk estimates: SBP and DBP An imputed BP are lowered by πΏ but the risk estimated does not change much. Also, a hazard ratio estimates for different πΏ are close. A mortality between responders and non-responders are simply too small for serious impact on estimates. Thus, we are able to conclude missing data hardly influence the risk estimates. 4.1.3 Pattern-mixture Model Finally, a comparison between imputed data and observed data could be shown as well as the combined pattern-mixture model in one plot. Hence, the pattern-mixture model decomposes π(π , π ) = π(π |π )π(π ) = π(π |π = 1)π(π = 1) + π(π |π = 0)π(π = 0) for the observational probability by the Baye’s rule, which emphasize that the combined distribution is a mixed distributions of π in the responders and non-responders. For example, the density at a point is computed to be π(π = 100) = 0.015 × 0.878 + 0.058 × 0.122 = 0.02 which is shown as a graph in the left-side. By Bayes rule, the density of systolic BP is calculated based on the decomposition of π(π = 1|π = π¦) = π(π = π¦|π = 1)π(π = 1)/π(π = π¦), where the marginal distribution of π is π(π = π¦) = π(π = π¦|π = 1)π(π = 1) + π(π = π¦|π = 0)π(π = 0). Also, the observable probability is calculated to be π(π = 1|π ) = 0.015 × 0.878/0.02 = 0.65 for a CHAPTER 4. SIMULATION STUDY AND SUMMARY 29 STATS 756 Kyuson Lim particular point, while non-observable probability is π(π = 0|π ) = 0.058×0.122/0.02 = 0.35. Figure 4.3: Graphic representation of the response mechanism for SBP The right-hand plot provides the distributions π(π |π ) in the observed (blue) and missing (red) data in the pattern-mixture model. The hypothetically complete distribution is the black curve. The distribution of blood pressure in the group with missing blood pressures shows a slight different, both in form and location. However, in the KS test (Kolmogorov–Smirnov test) as well as the empirical cdf both observed and imputed values does not differ drastically. Hence, the effect of missingness on the combined distribution shows only slight difference. 30 CHAPTER 4. SIMULATION STUDY AND SUMMARY Chapter 5 Appendix: R codes # Data exploration library(foreign) file.sas <- file.path(dataproject, "original/master85.xport") ## xport.info <- lookup.xport(file.sas) original.sas <- read.xport(file.sas) names(original.sas) <- tolower(names(original.sas)) dim(original.sas) # uninteresting or problematic variables v1 <- names(ini$nmis[ini$nmis == 0]) outlist1 <- v1[c(1, 3:5, 7:10, 16:47, 51:60, 62, 64:65, 69:72)] length(outlist1) # Outflux and Influx outlist2 <- row.names(fx)[fx$outflux < 0.5] length(outlist2) outlist4 <- as.character(ini$loggedEvents[, "out"]) # Quick predictor outlist <- unique(c(outlist1, outlist2, outlist4)) length(outlist) data2 <- data[, !names(data) %in% outlist] inlist <- c("sex", "lftanam", "rrsyst", "rrdiast") pred <- quickpred(data2, minpuc = 0.5, include = inlist) ## Generating the imputations imp.qp <- mice(data2, pred = pred, seed = 29725) # plot comparison for missing data vs. observed data in KM curve vnames <- c("rrsyst", "rrdiast") cd1 <- mice::complete(imp)[, vnames] cd2 <- mice::complete(imp.qp)[, vnames] typ <- factor(rep(c("blind imputation", "quickpred"), each = nrow(cd1))) mis <- ici(data2[, vnames]) mis <- is.na(imp$data$rrsyst) | is.na(imp$data$rrdiast) 31 STATS 756 Kyuson Lim cd <- data.frame(typ = typ, mis = mis, rbind(cd1, cd2)) xyplot(jitter(rrdiast, 10) ~ jitter(rrsyst, 10) | typ, data = cd, groups = mis, col = c(mdc(1), mdc(2)), xlab = "Systolic BP (mmHg)", type = c("g","p"), ylab = "Diastolic BP (mmHg)", pch = c(1, 19), strip = strip.custom(bg = "grey95"), scales = list(alternating = 1, tck = c(1, 0))) # delta-adjustment delta <- c(0, -5, -10, -15, -20) post <- imp.qp$post imp.all.undamped <- vector("list", length(delta)) for (i in 1:length(delta)) f d <- delta[i] cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d) post["rrsyst"] <- cmd imp <- mice(data2, pred = pred, post = post, maxit = 10, seed = i * 22) imp.all.undamped[[i]] <- imp g # Hazard ratio estimates cda <- expression( sbpgp <- cut(rrsyst, breaks = c(50, 124, 144, 164, 184, 200, 500)), agegp <- cut(lftanam, breaks = c(85, 90, 95, 110)), dead <- 1 - dwa, coxph(Surv(survda, dead) ~ C(sbpgp, contr.treatment(6, base = 3)) + strata(sexe, agegp))) imp <- imp.all.damped[[1]] fit <- with(imp, cda) # chi-square of independence plot # significance plot library(ggplot2) library(forcats) rwo = c(’Age’, ’Type of residence’, ’Activities of daily living’, ’History of hypertension’, ’Uses of d dat <- data.frame( Covariate = rep(x = c(’ ’), times = 5), Question = rwo, Significance = c(1,1,1,0,1) ) dat$groups <- cut(dat$Significance, # Add group column breaks = c(-0.1, 0.01, 1.1)) 32 CHAPTER 5. APPENDIX: R CODES Kyuson Lim STATS 756 textcol <- "grey40" library(ggplot2) ggplot(data = dat, aes(x = fct_inorder(Question), y = Covariate, fill = groups)) + geom_tile(colour = "white", size=1.5) + scale_fill_manual(breaks = levels(dat$groups), values = c("grey", "red"),guide = guide_legend(reverse = TRUE), labels = c(’Insignificant, p-value > 0.05’, ’Significant, p-value < 0.05’))+ scale_y_discrete(expand=c(0,0))+ scale_x_discrete(expand=c(0,0),breaks=rwo)+ theme_grey(base_size=10)+ theme(legend.position="right",legend.direction="vertical", legend.title=element_text(colour=textcol), legend.text=element_text(colour=textcol,size=10,face="bold"), axis.text.x=element_text(size=20, colour=textcol, angle = 90, vjust = 0.2, hjust=0.2), axis.text.y=element_text(size=23, vjust=0.2, colour=textcol), axis.ticks.x=element_blank(), plot.title=element_text(colour=textcol, hjust=0, size=14, face="bold"))+ labs(fill = "Significance")+xlab(NULL)+ylab(NULL)+coord_flip() ggsave(’p.png’, width=7, height=6) # correlation plot library(corrplot) library("pheatmap") library(ComplexHeatmap) M=data.frame(matrix(nrow=24, ncol=3)) rownames(M)<-c(’Systolic BP’, ’Diastolic BP’, ’Survival date’, ’Censoring flag’, ’Sex’,’Age’, ’Type of residence’, ’Activity of daily living’, ’Previous hypertension’, ’Uses diuretics’, ’Year of intervi ’Serum albumin’, ’Cognition’, ’Current hypertension’, ’Current/Previous hypertension’, ’Survival year’, ’ln (s ’Serum cholesterol’, ’Fraction erythrocytes’, ’Treated by specialist’, ’Hemoglobin’, ’Hematocrit’) M[1,]<-c(1.0,0.59,0) M[2,]<-c(0.59,1.0,0) M[3,]<-c(0.18, 0.14, 0.12) M[4,]<-c( 0.13, 0.11, 0.08) M[5,]=c(-0.1, -0.1, -0.04) M[6,]=c(-0.11, -0.11, -0.14) M[7,]=c(-0.21, -0.15, -0.08) M[8,]=c(-0.24, -0.11, -0.14) M[9,]=c(0.16, 0.14, 0.06) M[10,]=c(-0.04, -0.03, 0.06) M[11,]=c(0.18, 0.09, 0.18) M[12,]=c(0.17, 0.11, 0.16) M[13,]=c(0.24, 0.18, 0.02) M[14,]=c(0.24, 0.18, 0.07) M[15,]=c(0.23, 0.17, 0.01) CHAPTER 5. APPENDIX: R CODES 33 STATS 756 Kyuson Lim M[16,]=c(0.22, 0.19, 0.04) M[17,]=c(0.21, 0.15, 0.14) M[18,]=c(0.20, 0.15, 0.09) M[19,]=c(-0.19, -0.18, -0.01) M[20,]=c(0.17, 0.17, 0.12) M[21,]=c(0.17, 0.20, 0.08) M[22,]=c(-0.16, -0.11, 0.02) M[23,]=c(0.15, 0.18, 0.08) M[24,]=c(0.11, 0.18, 0.10) colnames(M)=c(’r(SBP)’, ’r(DBP)’, ’r(R1)’) M=as.matrix(M) # Heatmap 2 ht2 = Heatmap(M, name = "ht2", col = circlize::colorRamp2(c(-0.25, 0, 1), c("skyblue", "white", "red")), column_names_gp = gpar(fontsize = 9)) ht2 corrplot(M, order = ’hclust’, addrect = 2) corrplot(M, p.mat = testRes$p, method = ’circle’, type = ’lower’, insig=’blank’, addCoef.col =’black’, number.cex = 0.8, order = ’AOE’, diag=FALSE, addrect = 2) testRes = cor.mtest(mtcars, conf.level = 0.95) corrplot(M, p.mat = testRes$p, method = ’circle’, type = ’lower’, insig=’blank’, order = ’AOE’, diag = FALSE, addrect = 3) text(p1$x, p1$y, round(p1$corr, 2)) # hazard varaible M=data.frame(matrix(nrow=5, ncol=5)) rownames(M)<-c(’H0(T)’, ’T’, ’log(T)’, ’SBP’, ’DBP’) colnames(M)<-c(’H0(T)’, ’T’, ’log(T)’, ’SBP’, ’DBP’) M[1,]=c(1.000, 0.997, 0.830, 0.169, 0.137) M[2,]=c(0.997, 1.000, 0.862, 0.176, 0.141) M[3,]=c(0.830, 0.862, 1.000, 0.205, 0.151) M[4,]=c(0.169, 0.176, 0.205, 1.000, 0.592) M[5,]=c(0.137, 0.141, 0.151, 0.592, 1.000) M=as.matrix(M) corrplot(M, method = ’color’, type = ’lower’, insig=’blank’, addCoef.col =’black’, number.cex = 0.8, order = ’AOE’, diag=FALSE) corrplot(M, method = ’circle’, type = ’lower’, insig=’blank’, addCoef.col =’black’, number.cex = 0.8, order = ’AOE’, diag=FALSE) corrplot(M, method="color", col=col(200), diag=FALSE, type="upper", order="hclust", title=’Correlations between hazard H0(T), survival time T, log(T), SBP, DBP’, addCoef.col = "black", # Add coefficient of correlation 34 CHAPTER 5. APPENDIX: R CODES Kyuson Lim STATS 756 # Combine with significance p.mat = p.mat, sig.level = 0.05, insig = "blank", # hide correlation coefficient on the principal diagonal mar=c(0,0,1,0) ) CHAPTER 5. APPENDIX: R CODES 35 STATS 756 36 Kyuson Lim CHAPTER 5. APPENDIX: R CODES Bibliography [1] McGilchrist, C. A., & Aisbett, C. W. (1991). Regression with frailty in survival analysis. Biometrics, 461-466. https://www.jstor.org/stable/2532138?casa_token=cxuDrkxyJzUAAAAA% 3AEnp4ejKDMHcBHgMbROgKulGAA-lUE0Iw16oVqCSqDXPbWGutHjuBeIJ7URMAZSIioGrZdBNLmqvx4fYUX_ 3D0LUaBnEGd-dVIBW88Bkm6vPgEhEca24&seq=1#metadata_info_tab_contents [2] Van Buuren, S., Boshuizen, H. C., & Knook, D. L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in medicine, 18(6), 681-694. https://stefvanbuuren.name/fimd/ [3] Van Buuren, S., Oudshoorn, C. G. M., & de Jong, M. R. (2007). The MICE package. URL https://www. rdocumentation. org/packages/mice/versions/2.25. http://ftp.uni-bayreuth.de/math/statlib/R/CRAN/doc/packages/mice.pdf 37