Predicting Mortality for Patients in Critical Care: a Univariate Flagging Approach by Mallory Sheth B.A., Stanford University (2008) Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of Master of Science in Operations Research at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2015 © 2015 Mallory Sheth. All rights reserved. The author herby grants to MIT and Draper Laboratory permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. Signature of Author ………………………………………………………………………………………… Sloan School of Management May 15, 2015 Certified by………………………………………………………………………………………………….. Natasha Markuzon The Charles Stark Draper Laboratory, Inc. Technical Supervisor Certified by………………………………………………………………………………………………….. Roy E. Welsch Eastman Kodak Leaders for Global Operations Professor of Management Professor of Statistics and Engineering Systems Thesis Supervisor Accepted by…………………………………………………………………………………………………. Dimitris Bertsimas Boeing Leaders for Global Operations Professor Co-director, Operations Research Center 2 Predicting Mortality for Patients in Critical Care: a Univariate Flagging Approach by Mallory Sheth Submitted to the Sloan School of Management on May 15, 2015, in partial fulfillment of the requirements for the degree of Master of Science in Operations Research Abstract Predicting outcomes for critically ill patients is a topic of considerable interest. The most widely used models utilize data from early in a patient’s stay to predict risk of death. While research has shown that use of daily information, including trends in key variables, can improve predictions of patient prognosis, this problem is challenging as the number of variables that must be considered is large and increasingly complex modeling techniques are required. The objective of this thesis is to build a mortality prediction system that improves upon current approaches. We aim to do this in two ways: 1. By incorporating a wider range of variables, including time-dependent features 2. By exploring different predictive modeling techniques beyond standard regression We identify three promising approaches: a random forest model, a best subset regression containing just five variables, and a novel approach called the Univariate Flagging Algorithm (UFA). In this thesis, we show that all three methods significantly outperform a widely-used mortality prediction approach, the Sequential Organ Failure Assessment (SOFA) score. However, we assert that UFA in particular is well-suited for predicting mortality in critical care. It can detect optimal cut-points in data, easily scales to a large number of variables, is easy to interpret, is capable of predicting rare events, and is robust to noise and missing data. As such, we believe it is a valuable step toward individual patient survival estimates. Technical Supervisor: Natasha Markuzon The Charles Stark Draper Laboratory, Inc. Thesis Supervisor: Roy E. Welsch Eastman Kodak Leaders for Global Operations Professor of Management Professor of Statistics and Engineering Systems 3 4 Acknowledgements First of all, I would like to thank my advisers, Natasha Markuzon and Roy Welsch. Though they are both very busy, they always made me feel like a priority. They took time to review my findings, answer questions, and give thoughtful feedback for which I am extremely grateful. This thesis would not have been possible without them. I would also like to thank several members of MIT’s Laboratory for Computational Physiology. Mornin Feng and Li-wei Lehman provided valuable help in accessing and understanding the MIMIC II database, while Roger Mark, Leo Celi, and Abdullah Chahin provided much appreciated clinical expertise. In particular, I thank them for their contributions to the work on sepsis and auto-immune disease which is presented in the appendix of this thesis. I want to thank the Charles Stark Draper Laboratory, and in particular the amazing staff in the Education Office, for the Draper Lab Fellow appointment that made my research possible. I must also thank all of the people that provided me with support at MIT; in particular, the directors of the Operations Research Center, Dimitris Bertsimas and Patrick Jalliet, and the amazing administration staff, Laura Rose and Andrew Carvalho. Thank you so much for welcoming me into such a great program and helping me along the way. I also am deeply grateful to the other ORC students; in particular, Mapi, Dan, Kevin, Jack, Zeb, and Will for their comradery during first year classes when we were all just trying to get our bearings at MIT . I am lucky to have an amazing and supportive family, and I thank my parents and brothers for keeping me sane, particularly while I was juggling classes, research, and wedding planning in my first year. I thank the wine night crew for acting as a constant sounding board and showing me how quickly lifelong friendships can form. Finally, I want to thank my husband, Sureel Sheth, for his unconditional love and support even in the most stressful times. I cannot wait to set off on our next adventure together. 5 6 Table of Contents 1 2 Introduction......................................................................................................................................... 11 1.1 Motivation................................................................................................................................... 11 1.2 Existing mortality prediction models .......................................................................................... 12 1.3 Research objectives..................................................................................................................... 14 1.4 Thesis organization ..................................................................................................................... 15 Data ..................................................................................................................................................... 17 2.1 MIMIC II database...................................................................................................................... 17 2.2 Variables for analysis.................................................................................................................. 18 2.2.1 2.3 2.3.1 3 Populations for analysis .............................................................................................................. 19 Sepsis .................................................................................................................................. 21 Methodology ....................................................................................................................................... 23 3.1 3.1.1 3.2 Commonly used classification techniques .................................................................................. 23 Preprocessing approach....................................................................................................... 24 Univariate Flagging Algorithm (UFA) ....................................................................................... 26 3.2.1 Identifying optimal cut points ............................................................................................. 26 3.2.2 Formal statement of algorithm............................................................................................ 28 3.2.3 Procedure to find optimal threshold.................................................................................... 29 3.2.4 UFA-based classifiers ......................................................................................................... 30 3.2.5 Example of UFA system: Iris dataset.................................................................................. 31 3.3 4 Temporal features ............................................................................................................... 18 Evaluation of classifier performance .......................................................................................... 34 Results................................................................................................................................................. 37 4.1 Challenges of clinical data .......................................................................................................... 37 4.1.1 Many clinical variables are asymmetric and long-tailed..................................................... 38 4.1.2 List of variables associated with mortality varies throughout ICU stay ............................. 39 4.1.3 Clinical variables are often strongly correlated .................................................................. 40 4.1.4 Most patients have missing data ......................................................................................... 41 4.1.5 Summary of key insights..................................................................................................... 42 4.2 4.2.1 Predictive performance of commonly used classification techniques ........................................ 43 Random forest performs well for predicting mortality ....................................................... 43 7 4.2.2 Logistic regression performs competitively after preprocessing ........................................ 44 4.2.3 Best regression model contains just five variables.............................................................. 47 4.2.4 Best classifiers outperform SOFA score by two days......................................................... 48 4.2.5 Little observed effect of temporal variables........................................................................ 49 4.2.6 Summary of key insights..................................................................................................... 51 4.3 Predictive performance of UFA system...................................................................................... 51 4.3.1 Automated thresholds align with subject matter expertise.................................................. 51 4.3.2 UFA-based classifiers significantly outperform SOFA score............................................. 54 4.3.3 Summary of key insights..................................................................................................... 56 4.4 Practical advantages of UFA system .......................................................................................... 56 4.4.1 N-UFA classifier is robust to noisy and missing data......................................................... 56 4.4.2 UFA system generalizes well to other critical care populations ......................................... 58 4.4.3 N-UFA classifier maximizes sensitivity for low targets ..................................................... 62 4.4.4 Summary of key insights..................................................................................................... 63 5 Discussion ........................................................................................................................................... 65 6 Future research.................................................................................................................................... 69 6.1 6.1.1 Multiple testing problem..................................................................................................... 69 6.1.2 Threshold uncertainty ......................................................................................................... 69 6.1.3 Multivariate approach ......................................................................................................... 71 6.2 7 UFA............................................................................................................................................. 69 Temporal features in critical care................................................................................................ 73 6.2.1 Identifying subsets of patients where trend is important..................................................... 74 6.2.2 Characterizing uncertainty through filtering....................................................................... 76 6.2.3 Learning patient specific state parameters .......................................................................... 79 Conclusion .......................................................................................................................................... 85 References................................................................................................................................................... 87 Appendix A................................................................................................................................................. 91 Appendix B ............................................................................................................................................... 101 8 List of Figures Figure 1.1: Schematic for overall research approach Figure 2.1: Schematic of MIMIC II data collection and database construction Figure 3.1: Body temperature for adult sepsis patients Figure 3.2: Number of high mortality and low mortality flags for adult sepsis patients Figure 3.3: Petal lengths for different species of Iris Figure 3.4: Automated thresholds for Iris versicolor and Iris virginica Figure 3.5: Example of Iris classification using N-UFA Figure 3.6: Example of confusion matrix Figure 4.1: Box plots for SOFA and urine output stratified by in-hospital mortality Figure 4.2: Number of variables with a mean p-value less than 0.05, by day Figure 4.3: Pairwise correlation coefficients for SOFA variables Figure 4.4: Test-set performance metrics for classifiers, full data Figure 4.5: Random forest variable importance Figure 4.6: Accuracy and AUROC before and after data preprocessing Figure 4.7: Test-set performance metrics for classifiers, processed data Figure 4.8: Test-set performance metrics for classifiers, balanced training data Figure 4.9: AUROC for best classifiers and SOFA across time Figure 4.10: Comparison of classifier performance with and without temporal features Figure 4.11: Example UFA thresholds for adult sepsis patients Figure 4.12: Number of high mortality and low mortality flags for adult sepsis patients Figure 4.13: Accuracy and AUROC, full MIMIC population Figure 4.14: AUROC for UFA-based classifiers and SOFA across time, full MIMIC population Figure 4.15: Sensitivity and specificity, full MIMIC population Figure 5.1: Visualization of number of flags classifier, sepsis subpopulation Figure 6.1: Bootstrapped thresholds for low body temperature in adult sepsis patients Figure 6.2: Examples of different cost penalties for two-dimensional flagging Figure 6.3: Optimal cost penalties for two-dimensional flagging (1.43:1) Figure 6.4: Trends in SOFA score by day and mortality status Figure 6.5: Parallel coordinates plot summarizing SOFA across time Figure 6.6: Parallel coordinates plot summarizing SOFA across time (SOFA of 5-12) Figure 6.7: Example SIR results for a representative patient Figure 6.8: Example of Hamilton regime switching model, CVP ( = ) Figure 6.9: Example of Hamilton regime switching model, TVTP vs. CVP ( = ) 9 List of Tables Table 1.1: Comparison of APACHE IV, SAPS 3, and SOFA Table 2.1: List of available explanatory variables used from MIMIC II database Table 2.2: List of engineered temporal features Table 2.3: Most common primary diagnoses in the MIMIC II database Table 3.1: List of variables for specification of UFA algorithm Table 3.2: List of thresholds for Iris dataset, selected based on maximum absolute z-statistic Table 4.1: Relationship between clinical variables and mortality, average p-value by day Table 4.2: Summary of missing data by variable Table 4.3: Summary of missing data by patient Table 4.4: Final variable list after preprocessing Table 4.5: Top three best subset regression models Table 4.6: Test-set performance of best classifiers compared to SOFA score Table 4.7: Top 20 most significant UFA thresholds Table 4.8: Test-set performance for UFA-based classifiers (accuracy and AUROC) Table 4.9: Test-set performance for UFA-based classifiers (sensitivity and specificity) Table 4.10: Comparison of different classifiers with varying amounts of missing data Table 4.11: Comparison of different classifiers with varying amounts of imprecise data Table 4.12: Comparison of data-driven thresholds across different subpopulations Table 4.13: AUROC for AMI and lung disease subpopulations Table 4.14: Sensitivity for AMI and lung disease subpopulations Table 4.15: Comparison of results for balanced and unbalanced data, AMI subpopulation Table 6.1: Patient clusters based on static and trend SOFA data Table 6.2: Test-set performance of classifiers based on regime switching model Table B.1: All MIMIC II variables used for analysis, p-values by day Table B.2: All significant UFA thresholds, adult sepsis patients Table B.3: Comparison of classifiers with varying amounts of missing data, confidence intervals Table B.4: Comparison of classifiers with varying amounts of imprecise data, confidence intervals 10 Chapter 1 1 Introduction In this thesis, we investigate different approaches to predicting mortality for critical care patients. Using over 200 variables to characterize a patient’s stay, we compare a variety of different predictive modeling techniques. We consider commonly used linear and non-linear classification methods such as regression, support vector machine (SVM), and random forest, as well as a novel approach called the Univariate Flagging Algorithm (UFA). Through our analysis, we identify key predictors of mortality for critical care patients and show that our classifiers can outperform current mortality prediction models by as much as two days. We focus in particular on UFA, as it easily scales to a large number of variables, is easy to interpret, and is robust to noise and missing data. We believe UFA is particularly suited to critical care applications and could be a valuable step toward individual patient survival estimates. 1.1 Motivation Predicting outcomes for critically ill patients is a topic of considerable interest. Risk-adjusted mortality is the most commonly used measure of quality in critical care [1, 2], and good predictive models are needed to benchmark different physicians, facilities, or patient populations. Patient severity scores are frequently used in medical research as a potential cofounder or to balance treatment and control groups [2]. On an individual patient level, prognostic models can be used to determine courses of treatment or to communicate with the patient’s family about likely outcomes [3, 4]. Though there are several widely used models that already exist, they have limitations. Many are intentionally simple and rely on only a small number of variables measured at one point in time to minimize the burden of data collection [4]. However, as electronic medical records become increasingly widespread, the need for manual data extraction and manual score calculation will presumably be eliminated, permitting more complex approaches. Many of the existing models also use a regression framework, which ultimately limits the number of variables that can be considered and typically assumes a relationship between the explanatory variables and the outcome that is linear in the coefficients [5]. In 11 this thesis, we move beyond regression and consider other modeling techniques that have the potential to make more accurate and timely predictions. 1.2 Existing mortality prediction models In this section, we discuss three commonly used mortality prediction models: the Acute Physiology and Chronic Health Evaluation (APACHE) [6,7], the Simplified Acute Physiology Score (SAPS) [8,9], and the Sequential Organ Failure Assessment (SOFA) score [10]. We outline their similarities and differences, and identify possible areas for improvement. APACHE and SAPS are both designed to use information available at ICU admission to predict patient outcomes. They were both developed in the 1980s and quickly adopted in intensive care [4]. They have undergone updates since their inception to use larger training datasets, incorporate more sophisticated statistical techniques, and for recalibration. The most recent versions are APACHE IV and SAPS 3 which were both developed in the last decade [7, 9]. SOFA, on the other hand, was designed to evaluate a patient throughout the ICU stay. It assigns a score of 0 (normal) to 4 (very abnormal) for six different organ systems on each day in the ICU [10]. Unlike APACHE and SAPS, SOFA was originally intended to characterize patient morbidity as opposed to predict patient mortality; however, since its development, it has often been used for the latter purpose [11]. Table 1.1 compares the three systems along a variety of dimensions, including the data collection window, required variables, and methodological approach. APACHE IV contains the largest number of features, including the most physiological variables and 116 different acute diagnoses [1]. SAPS 3 and SOFA on the other hand both contain relatively few variables. The creators of SOFA specifically cited the fact that it is simple and easy to calculate as advantages of the approach [10]. Proponents of SAPS 3 similarly argue that a small number of required variables minimizes complexity and encourages routine use [4]. In all three systems, the decision of which variables to include relied, at least in part, on clinical expertise and domain knowledge [6, 9, 10]. In the case of SOFA, the entire system was designed through clinical consensus. This highlights one possible area for improvement in mortality modeling, through greater automation and data-driven variable selection methods. 12 Another possible area for improvement is exploring new modeling techniques for mortality prediction. In Table 1.1, we see that both APACHE IV and SAPS 3 rely on regression based models. In recent years, however, there has been discussion about whether nonlinear machine learning approaches such as random forest may provide better performance [1]. Table 1.1: Comparison of APACHE IV, SAPS 3, and SOFA Use of temporal variables Method for variable selection and weighting Method for mortality prediction No Regression, clinical expertise, and previous knowledge Regression Admission ± 1 hour Physiologic data (n=10), acute diagnosis and anatomical site of surgeries (n=15), comorbidities (n=6), age, hospital location and LOS prior to ICU admission, vasopressor use, type of admission, infection No Regression, clinical expertise, and definitions from other scoring systems Regression Daily Physiologic data (n=6), mechanical ventilation, use of vasopressors No Clinical consensus n/a Data collection window Required variables First 24 hours Physiologic data (n=17), admission diagnoses (n=116), comorbidities (n=6), age, hospital location and LOS prior to ICU admission, emergency surgery, thrombolytic therapy, mechanical ventilation SAPS 3 SOFA System APACHE IV Finally, none of these systems utilize temporal variables, though several studies have shown that use of daily information can improve predictions of patient prognosis [12]. Using SOFA score as an example, the maximum daily SOFA score and delta SOFA (defined as maximum score minus admission score) have good correlation with outcomes for patients in the ICU for two or more days. Ferreira et al showed that mean SOFA and maximum SOFA for the ICU stay are better indicators of prognosis than the initial score; they also established that an increase in SOFA score over the first 48 hours is a strong predictor of mortality [13]. In 2014, Andrew Zimmerman, one of the researchers who developed APACHE, said “it is inconceivable that simple models could be effectively used for predicting individual patients’ outcomes” and asserted that use of daily ICU information will be necessary to develop models powerful enough for individual prognosis [1]. 13 1.3 Research objectives The objective of this thesis is to build a mortality prediction system that can outperform current approaches. We aim to improve current methodologies in two ways: 1. By incorporating a wider range of variables, including time-dependent features 2. By exploring different predictive modeling techniques, including non-linear approaches such as random forest or our newly designed Univariate Flagging Algorithm (UFA) Figure 1.1 provides a schematic of the overall approach. During the data extraction and feature engineering phase, we process a large number of variables, including time-dependent features, to include in our analysis (red box). Next, we explore the data to understand its unique characteristics and challenges (blue box). In this research, we use three different approaches to address the observed challenges. Figure 1.1: Schematic for overall research approach First, we experiment with a number of commonly used machine learning classification techniques and compare their performance for the task of mortality prediction. In Section 4.2.1, we identify random forest as the most promising approach. Second, we attempt to clean the data through additional preprocessing and variable selection. In Sections 4.2.2 and 4.2.3, we show that regression methods can work well, if care is taken to properly customize the model. 14 The third and final approach is to design a classification system to account for the unique challenges of clinical data -- the Univariate Flagging Algorithm (UFA). In Sections 4.3 and 4.4, we demonstrate that this system has the following characteristics: Strong predictive performance Scalable to a large number of variables, including more variables than observations Robust to missing and noisy data Easily customizable to different patient populations, care centers, or targets Ability to predict rare events Interpretable In this thesis, we compare UFA to random forest and best subset regression in terms of overall predictive performance. We also evaluate how all three methods perform relative to the well-established SOFA score, a daily measure of patient morbidity. 1.4 Thesis organization Chapter 2 describes the data and defines the populations used for analysis. Chapter 3 outlines the methodology. It provides an overview of the four commonly used classification techniques employed in this analysis: logistic regression, random forest, decision tree, and SVM. It also describes our novel classifier, the UFA system, and provides an example using the well-known Iris dataset. Chapter 4 summarizes the results. It compares the predictive performance of a variety of mortality prediction models, including the UFA system. It also discusses several practical advantages of UFA, such as its ability to easily generalize to new populations. Chapter 5 is a high-level discussion of UFA, explaining how it could be used in practice. Finally, Chapter 6 outlines several possibilities for future research and Chapter 7 concludes the thesis. 15 16 Chapter 2 2 Data This chapter describes the data used in this analysis. Section 2.1 provides an overview of the database, Section 2.2 discusses variable extraction and the creation of temporal features, and Section 2.3 defines our analysis populations. 2.1 MIMIC II database The publicly available Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC II) database, version 2.6, contains de-identified clinical data for nearly 30,000 adult ICU stays at Beth Israel Deaconess Medical Center (BIDMC) in Boston, MA from 2001 to 2008 [14]. It was jointly developed by the Massachusetts Institute of Technology (MIT), Phillips Healthcare, and BIDMC, and any researcher who adheres to the data use requirements is permitted to use the database. Its creation and use for research was approved by the institutional review boards of both BIDMC and MIT (IRB protocol 2001-P-001699/3). Figure 2.1, which comes from the MIMIC II user guide, illustrates the data acquisition process. As is evident in the figure, data comes from a variety of different sources including bedside monitors, the clinical record, administrative data, and external sources such as the Social Security Death Index. The figure also highlights the variety of patient data in MIMIC II. Figure 2.1: Schematic of MIMIC II data collection and database construction 17 2.2 Variables for analysis For this study, we processed over 200 explanatory variables from the MIMIC II database covering the first four days of the patient’s stay. Certain variables were time invariant, such as demographics or severity scores measured at admission. Much of the data, however, was measured daily or hourly, including vital signs, measures of fluid balance, diagnostic laboratory test results, and use of services such as imagining, ventilation, and vasopressors. Table 2.1 contains a list of the variables considered. While most of these variables were pulled directly from the MIMIC II database, some such as shock index (calculated as heart rate divided by systolic blood pressure) were constructed to measure known interaction effects. Table 2.1: List of available explanatory variables used from MIMIC II database Time-Invariant (N=9) Day-Level (N=38) Hourly-Level (N=5) Age SOFA Heart rate Gender SAPS Respiratory rate Race/ethnicity Urine output Mean arterial blood pressure Weight Fluid balance Shock index SAPS at admission Use of ventilation Temperature SOFA at admission Use of vasopressors Number of comorbidities Use of imaging: x-ray, MRI, CT, echo Number of previous ICU stays Do not resuscitate status ICU care unit Comfort measures only status Number of tests and average daily value: creatinine, sodium, BUN, chloride, bicarbonate, glucose, magnesium, calcium, phosphate, hematocrit, hemoglobin, white blood count, platelets The primary outcome of interest throughout the study is in-hospital mortality, as recorded in the MIMIC II clinical record. We also kept track of the International Classification of Diseases, 9th revision, Clinical Modification (ICD9-CM) codes for each hospitalization. As detailed in Section 2.3, these codes were used to identify disease-specific populations for analysis. 2.2.1 Temporal features For each of the day-level and hourly-level variables in the analysis, we engineered a series of temporal variables to measure patient dynamics. Table 2.2 lists the different features that were created. 18 Table 2.2: List of engineered temporal features Day-level indicator (N=8) Day-level continuous (N=30) Hourly-level continuous (N=5) Example: Use of ventilation Example: Daily SOFA score Example: heart rate Ever used to date Minimum to date Daily minimum Number of days to date Average to date Daily maximum Maximum to date Daily average Range in values to date Daily standard deviation One-day trend Minimum to date Two-day trend Average to date Three-day trend Maximum to date Range in values to date One-day trend Two-day trend Three-day trend For each of the eight indicator variables in the analysis, such as use of ventilation or vasopressors, we counted the number of days in which the patient had the service and also documented whether they ever had the service. For the continuous variables in the analysis, the first step was to roll all of our data up to the day-level. For the five hourly-level variables, we summarized the minimum, maximum, average, and standard deviation of the hourly values for each day. Next, for both the day-level and hourly-level variables, we calculated the minimum, maximum, average, and range of values over the first four days of the patient’s stay. We also looked at the one to three day trends. The one day trend was calculated by dividing the current day’s average by the previous day’s average, the two day trend was calculated by dividing the current day’s average by the average from two days ago, and the three day trend was calculated analogously. 2.3 Populations for analysis For this analysis, we considered non-elective, adult ICU stays of four or more days in MIMIC II. The number of stays meeting these criteria totaled 5,378 with an in-hospital mortality rate of 21.2%. With outcome modeling, there is an inherent trade-off between building a general prediction model that is widely applicable or a specialized model that takes into account particular features of a disease, patient population, or care facility. Various studies have shown that the most widely used models do not 19 generalize well to new populations, and independent research suggests regular updates and customization for best performance [4, 15]. To measure the extent to which the classifiers presented in this thesis are customizable, we test them on a variety of disease-specific subgroups, in addition to the full MIMIC population. Table 2.3 displays the ten most common primary diagnoses in our data. The stays column counts the number of ICU stays of four or more days associated with that diagnosis, while mortality refers to the in-hospital mortality rate for that cohort. Table 2.3: Most common primary diagnoses in the MIMIC II database ICD9 Description Stays Mortality All Full MIMIC population 5,378 21.2% 038 Sepsis 512 30.9% 410 Acute myocardial infarction 496 14.7% 518 Lung disease 402 27.4% 414 Chronic ischemic heart disease 240 3.3% 428 Heart disease 200 20.5% 996 Complications from specified procedures 164 24.4% 431 Intercerebral hemorrhage 147 31.3% 430 Subarachnoid hemorrhage 125 19.2% 852 Subarachnoid subdural and extradural hemorrhage 109 23.9% 441 Aortic aneurysm and dissection 108 15.7% In this thesis, in addition to the full MIMIC population, we focus on the three most common diagnoses in the dataset (highlighted in yellow) -- sepsis, acute myocardial infarction (AMI), and lung disease -- as they provide us with the largest possible sample size. Across the three subpopulations, we also have a range of different mortality rates from 14.7% to 30.9%. We decided to use sepsis patients, the largest of the three disease-based cohorts, to build and test our mortality prediction models. In addition to being the most common diagnosis in the dataset, sepsis is the 10th leading cause of death in the United States and has an estimated annual economic burden of $16.7 billion making it an important cohort for analysis [16]. Additional discussion of the sepsis cohort is available in Section 2.3.1. We use the other two disease-based subpopulations, AMI and lung disease, along with the full MIMIC population to test the degree to which our models generalize to new patient cohorts. We hypothesize that 20 one advantage of a fully automated system such as UFA will be that it is easy to customize for both general and specialized applications. For the purposes of this thesis, we only consider disease-based subpopulations based on the primary diagnosis. As a side analysis, however, we considered all patients with a primary diagnosis of sepsis and analyzed secondary diagnosis at the time of admission. In some cases, we observed significant discrepancies in mortality. One of the most striking results was the 30-day mortality rate for septic patients with rheumatoid arthritis (RA), an auto-immune disorder. At 29.0%, it was 21 percentage points lower than the 30-day mortality rate for all severe sepsis patients (defined as sepsis with persistent hypotension), and the observed difference was statistically significant even after controlling for patient demographics and disease severity. The observed protective effect also extended to other auto-immune patients. While these findings are outside the scope of this thesis, they are presented in their entirety in Appendix A. 2.3.1 Sepsis Sepsis is a systemic inflammatory response to an infection that can lead to organ failure and death. Severe sepsis accounts for around 10% of all intensive care unit (ICU) admissions in the United States, and mortality rates are commonly reported between 28-50% [17, 18]. The consensus clinical definition of sepsis was established in 1992 by the American College of Chest Physicians (ACCP) and the Society of Critical Care Medicine (SCCM). ACCP/SCCM identified the systemic inflammatory response syndrome (SIRS) criteria and defined sepsis as the presence of at least two SIRS criteria caused by known or suspected infection. The SIRS criteria are: Core body temperature above 38°C or below 36°C Heart rate above 90 beats per minute Respiration rate more than 20 breaths per minute White blood count above 12,000/µl or less than 4,000/µl or more than 10% immature forms They defined severe sepsis as sepsis with the addition of acute organ dysfunction [19]. While there is a consistent clinical definition of sepsis, there is no single method for identifying sepsis patients in administrative health data. For the purpose of this thesis, we use the ICD-9-CM code 038 which indicates septicemia. It has been shown that sepsis can be confirmed in 88.9% (95 percent confidence interval 81.6-96.2) of patients with a 038 code in their discharge records [20]. 21 We decided to use the 038 diagnosis over another definition of sepsis that is sometimes used with administrative health data, called the Angus criteria [21]. Angus requires the presence of two ICD-9-CM diagnosis codes, one for infection and another for organ dysfunction. However, it has been suggested that sepsis defined by the Angus criteria overestimates the incidence of severe sepsis by a factor of two to four [20]. Our data confirm that the Angus criteria are a broader definition of sepsis than the 038 diagnosis. We identified 9,066 hospitalizations in the MIMIC II database that met the Angus criteria, nearly three times the number for 038. For this analysis, we wanted to focus on a subset of patients who have a high likelihood of true sepsis; therefore, we ultimately decided that the 038 definition was most appropriate for our application. 22 Chapter 3 3 Methodology This chapter summarizes our methodology for building and evaluating mortality prediction models. Section 3.1 describes a number of commonly used classification techniques that can be used for outcome prediction and summarizes the advantages and disadvantages of different methods. Section 3.2 provides a formal specification for UFA, a novel methodology that we designed to deal with some of the specific challenges posed by clinical data. Finally, Section 3.3 discusses our methods for evaluating these different classifiers and comparing their predictive performance. 3.1 Commonly used classification techniques At its root, mortality prediction is a classification task. The objective is to use a number of explanatory variables, such as physiologic data, to classify patients into two groups: those who live and those who die. A number of methods exist to do binary classification. In this section, we will introduce four approaches used throughout the thesis: logistic regression, support vector machine (SVM), decision tree, and random forest. Perhaps the most common method currently used in mortality prediction is logistic regression. Logistic regression is a linear method that models the posterior probabilities of the two classes as a logistic function of the explanatory variables [22]. Logistic regression is usually fit by maximum likelihood, so it is prone to overfitting if the number of variables is of the same order as the number of observations. Logistic regression also requires little to no multicolinearity between the explanatory variables. For these reasons, it is often advantageous to find a subset of the variables that are sufficient to explain the observed outcome, as opposed to using all of the variables that are available [5]. One way to do this is through best subset regression, which searches for the best combination of k variables according to some criteria. In this thesis, we use the Bayesian information criterion (BIC). 23 We also consider several non-regression based methods in this thesis. SVM is a linear classifier that can implement non-linear class boundaries by transforming the feature space. It generally has strong predictive power [5] and is resistant to overfitting, since there is little flexibility in the decision boundary [22]. The drawbacks are that it can be difficult to scale computationally, it tends to be difficult to interpret, and it does not deal well with outliers and irrelevant information, which can be prevalent in clinical data [5]. A decision tree is a non-linear method that classifies data using a series of univariate partitions [22]. Trees are frequently depicted using a tree-like graph which makes them fairly easy to interpret. As opposed to SVM, they also tend to be easy to compute and are generally insensitive to outliers [5]. While decision trees tend to have low bias, they often have high variance which can lead to unreliable predictive performance. Random forest addresses this issue through creating a large number of decision trees (in this thesis, we use 100) and essentially averaging the results [5]. This generally leads to better predictive performance, though some of the interpretability of a single tree is lost. 3.1.1 Preprocessing approach In Section 2.2, we describe our process for extracting over 200 variables from the MIMIC II database to characterize the first four days of each patient’s ICU stay. Through exploration of the data, we learned that many of the explanatory variables are asymmetric and long tailed, that there are a number of possible outliers, and that many variables are highly correlated. In addition to the large number of features, all of these aspects of the data make it challenging to analyze. One way to deal with these challenges is to preprocess the data to remove variables that are possibly irrelevant or highly correlated with other features. This section describes our methodology to do variable selection for the MIMIC II data, in the hopes of improving the predictive performance of our mortality prediction models. The process has two main steps. First, we limited our variables to those that are individually associated with in-hospital mortality. Second, we removed variables that were highly correlated with one another. For step one, we ran a two-sided t-test [23] for each numeric variable to test the following hypothesis: 24 : : = ≠ For the categorical variables, we tested the same hypothesis using a chi-squared approach [23]. For each variable, we ran five distinct tests, each utilizing 4/5 of the training data. Then, we were able to generate an interval of possible p-values for each feature. These results allowed us to formally explore the relationship between individual variables and in-hospital mortality, as well as quantify the variability in this relationship. Moreover, by comparing the mean pvalue on day one through four of the ICU stay, we were able to determine how the individual predictive power of particular features changed across time. We also used the results of this analysis to do variable selection. For many of the standard classifiers that we considered in our analysis, using a very large number of variables is problematic. For example, with regression, the number of variables is limited by the degrees of freedom. Therefore, for certain analyses in this thesis, we subset our variables to those with an average p-value of 0.05 on day 4 of the ICU stay. It is likely that this selection criterion includes some variables that do not truly have a significant relationship with mortality, due to multiple testing which is known to inflate the type I error rate [5]. We were comfortable with the possibility of false positives, however, as this is only the first stage of our variable selection process. In later stages, we will use additional methods such as best subset regression to further limit our list of features. Then, we verify those models on previously unseen data, as shown in Section 4.2.3, to ensure that the observed relationship between the explanatory variables and mortality is generalizable. Step two of our preprocessing approach was to further restrict our list of variables by considering the pairwise correlation coefficients [23]. Unsurprisingly, many of the variables in our analysis are highly correlated. For example, average arterial blood pressure is correlated with both the maximum and minimum values, with pairwise correlation coefficients between 0.7 and 0.8. We considered all pairs of variables with a correlation coefficient greater than 0.6 or less than -0.6, and removed variables that were not providing new information. These two preprocessing steps allowed us to narrow our total list of variables from more than 200 to just 75. The results of the individual t-tests and the final list of variables are presented in Section 4.1. 25 3.2 Univariate Flagging Algorithm (UFA) In many data classification problems, there is no linear relationship between an explanatory and the dependent variables. Instead, there may be ranges of the input variable for which the observed outcome is significantly more or less likely. In clinical decision making, for example, doctors identify ranges of laboratory tests values that may identify patients’ higher risk of developing or having a disease [24, 25]. This also arises in other fields; in earth science, for example, amount of rainfall thresholds can be used to develop early warning systems for landslides or flooding [26, 27]. In this section, we describe a new method for identifying such thresholds in an automated fashion called the Univariate Flagging Algorithm (UFA) [28, 29]. We also describe two UFA-based classifiers that combine the UFA thresholds to predict mortality for previously unseen patients. These methods were specifically designed to address many of the challenges of clinical data without preprocessing and to provide an alternative to the commonly used techniques from Section 3.1. 3.2.1 Identifying optimal cut points Many classifiers, such as decision trees or support vector machines (SVM) [5], are designed to find “optimal” cutpoints, typically defined as cutpoints that minimize some measure of node impurity. Such measures include misclassification rate, Gini index, or entropy/information gain [5, 22]. Supervised clustering works similarly, minimizing impurity while adding a penalty for the total number of clusters [30]. Alternatively, Williams, et al (2006) put forth a minimum p-value approach for finding optimal cutpoints for binary classification. Their algorithm uses a chi-squared test to find the cutpoint that maximizes the difference in outcomes on one side of the cut and the other [25]. These approaches are similar in that they consider the entire input space, both false positives and false negatives, to select the optimal cutpoint. In certain applications, however, one may only care about a subspace. Under certain conditions, it might be important to identify separation thresholds that are associated with a high prevalence of the target, while the overall solution is not optimized. Examples include medical conditions where values outside clinically defined thresholds are associated with high mortality, while more normal values do not provide much information. For example, in sepsis patients, low body temperature is associated with illness severity and death [31]. Figure 1 displays average body temperature for the MIMIC II sepsis population, with an overall death rate 26 of 30.9%. Patients who died are denoted in red, while patients who survived are denoted in blue. International guidelines for sepsis management define low body temperature as 36° C [32]. Below this threshold, Figure 3.1 shows that sepsis patients die at a rate of 57.1%, nearly twice the overall death rate. Above this threshold, one can say very little. Figure 3.1: Body temperature for adult sepsis patients We are interested in identifying such thresholds in an automated fashion. In the decision tree or SVM framework, this can be accomplished by associating different costs with false positives and false negatives, which will shift the “optimal” cutoff point accordingly [5, 22]. In practice, however, it is often difficult to quantify the costs associated with different types of errors particularly in the medical domain. Friedman & Fisher’s (1999) Patient Rule Induction Method (PRIM) procedure finds rectangular subregions of the feature space that are associated with a high (or low) likelihood of the outcome. The subregions are then slowly made smaller, each time increasing (or decreasing) the rate of the outcome [33]. With this method and others like it, there is an inherent trade-off between the number of data points within the subregion (the support) and the proportion of the data points that are associated with the outcome (purity), where smaller supports generally have higher purity. With PRIM, the user is responsible for defining the “optimal” subregion, by specifying the preferred trade-off for the application. While this may work well in some situations, identifying the appropriate trade-off can be challenging, suggesting the need for an algorithm that requires less user input. 27 We put forth a new threshold detection algorithm called UFA. UFA optimizes over subregions of the input space, but performs the trade-off between support and purity automatically. In this thesis, we will show that UFA can identify the existence of thresholds for individual variables and that they align with visual inspection and thresholds established by subject matter experts. We also demonstrate that these thresholds can be used to classify previously unseen test cases with performance equal to or better than many standard classifiers, such as the random forest and logistic regression. 3.2.2 Formal statement of algorithm UFA is designed to identify an optimal cutpoint for a single explanatory variable, such that observations outside that threshold are associated with a significantly higher or lower likelihood of the target. UFA identifies up to two such thresholds, one below the median and one above the median. The algorithm is intended for a binary target UFA finds the value outside = (e.g. [0, 1]) and a continuous explanatory variable . At its most basic level, that maximizes the difference in the outcome rate for observations that fall and a baseline rate, while maintaining a good level of support. Table 3.1 defines the variables that we will use in the formal specification of the UFA algorithm. For the purpose of formulation, we consider candidate thresholds below the median value of . Table 3.1: List of variables for specification of UFA algorithm. 28 For each , we conduct the following hypothesis test to check for a significant difference in the outcome rate below the threshold and the outcome rate in the interquartile range: : : − − =0 ≠0 (1) We are using a binomial proportion test [23] with test statistic = where − ∗ (1 − )∗( : 1 1 + ) (2) is the weighted average of the outcome rates, calculated: = We define as the candidate threshold = max[ ∗ + + ∗ (3) with the maximum in absolute value: ( )] (4) provides an inherent trade-off between maximizing the support and maximizing (or equivalently, minimizing) the outcome rate. The proposed measure does not provide an optimal separation in terms of minimizing the overall misclassification rate, but is optimized against finding areas enriched with cases with target outcome. The same applies to finding areas with specifically low rate of the target. 3.2.3 Procedure to find optimal threshold For each variable : 1. Generate a list of potential thresholds between the median value of value of , excluding those with low support, by dividing the range into and the minimum segments. For the purpose of this thesis, we excluded the five lowest values of , assuming that thresholds with a support less than five are of no interest. Currently, we consider 50 segments of equal length. 2. Calculate as specified in equation (2). Define 29 according to equation (4). 3. Check for statistical significance by comparing its Z-value to a chosen critical value. Keep the threshold if it is significant and discard it otherwise. For the purpose of this thesis, we used a critical value of 2.576 to establish significance, which is associated with a p-value of 0.01. We address issues related to multiple testing by validating the thresholds on previously unseen data, as seen in Section 4.3. Through this procedure, UFA finds the optimal threshold below the median for each variable . The procedure can then be repeated using candidate thresholds between the median value of value of 3.2.4 and maximum for a total of up to two significant thresholds for each variable. UFA-based classifiers UFA is designed to work with a single variable. We considered several approaches to combine singlevariable information into a multi-dimensional classifier, and present two possibilities in this thesis. Both create an indicator variable or “flag” for each significant threshold, which takes the value one if the data point exceeds the threshold and zero otherwise. The first classifier aggregates the number of “high risk” and “low risk” flags for each observation. Then, a linear decision boundary is drawn to separate one class from the other along these two dimensions. Throughout the thesis, this approach will be denoted as the Number of Flags algorithm (N-UFA). Figure 3.2: Number of high mortality and low mortality flags for adult sepsis patients Figure 3.2 shows an example of the N-UFA classifier’s performance in predicting adult sepsis patients’ mortality. For each patient, we counted the number of flags that are associated with a high likelihood of 30 mortality and the number of flags that are associated with a low likelihood of mortality; in this thesis, each flag receives equal weight of one, though future research could investigate the impact of assigning flags different weights. In Figure 3.2, the solid line represents the linear decision boundary that minimizes the misclassification rate along these two dimensions. The second UFA-based classifier presented in this thesis is a standard random forest model [5] that uses the flags for each significant threshold as dummy inputs. We will denote this classifier RF-UFA. Throughout the results section, we compare the predictive performance of the two UFA-based classifiers to classifiers that rely on the original, continuous data. One could also consider methods that combine the continuous data and the UFA flags, though they are not addressed in this thesis. 3.2.5 Example of UFA system: Iris dataset In this section, we demonstrate the UFA system by applying it to the well-known Iris dataset [35]. A classic in the machine learning discipline, it contains 50 observations for three different species of Iris. One of the species, Iris setosa, is linearly separable while the other two species, Iris virginica and Iris versicolor, are not. We ran UFA over this relatively straightforward dataset to ensure that it performed comparably to other standard approaches. Identifying optimal thresholds We begin by using UFA to identify optimal thresholds for each variable in the Iris dataset. In some cases, the optimal cutpoint is clear, such as the trivial case when the data is linearly separable. Figure 3.3 illustrates this trivial case. From the figure, it is clear that Iris setosa, denoted in red, is linearly separable from the other two classes, which are denoted in blue. As expected, UFA is able to successfully identify a threshold for a single variable, ‘petal length’, which separates the two classes from one another. Next, we focus on identifying thresholds that separate the two remaining classes, Iris versicolor and Iris virginica. The Iris dataset contains four variables and, therefore, UFA searches for up to eight possible thresholds in the data. Table 3.2 contains the optimal thresholds for each variable and shows that six of the eight are significant. In Table 3.2, the automatic trade-off between purity and support inherent to UFA is apparent. While a sepal width less than 2.4 identifies a subset of cases where 90% belong to the class versicolor, the 31 support (N=10) is not large enough to consider this variable in subsequent analysis. The other variables, however, each have two significant thresholds, depicted in Figure 3.4. We see that three of the thresholds are associated with a high likelihood of Iris versicolor, while three of the thresholds are associated with a high likelihood of Iris virginica. Figure 3.3: Petal lengths for different species of Iris Table 3.2: List of thresholds for Iris dataset, selected based on maximum absolute z-statistic Variable Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Threshold Less Than Less Than Less Than Less Than More Than More Than More Than More Than 5.7 2.4 4.7 1.5 7.0 3.2 5.0 1.7 N 24 10 45 48 12 10 42 46 % Versi. 87.5% 90.0% 97.8% 93.8% 0.0% 20.0% 2.4% 2.2% ZStat 3.4 2.3 5.2 4.4 -3.0 -1.7 -5.1 -5.9 ZStat.Abs 3.4 2.3 5.2 4.4 3.0 1.7 5.1 5.9 Sig 1 0 1 1 1 0 1 1 Classification We convert the six significant thresholds into indicator variables (“flags”) which take the value one if the data point falls outside the threshold and zero otherwise. Then, plugging these flags into a random forest model (RF-UFA), we find that we can correctly classify 48 of 50 cases for both Iris versicolor and Iris virginica. 32 Figure 3.4: Automated thresholds for Iris versicolor and Iris virginica Iris versicolor is denoted in green and Iris virginica is denoted in blue. We can achieve the same level of accuracy using N-UFA. First, we aggregate the number of Iris virginica flags and the number of Iris versicolor flags for each instance in the dataset, as seen in Figure 3.5. The first number in each cell represents the number of true virginica, while the second number represents the number of true versicolor. The optimal linear decision bounday classifies all of the green cells as Iris versicolor and all of the blue cells as Iris virginica; values in red represent errors. We see that N-UFA correctly classifies 96 of the 100 cases. This performance is in line with the apparent error rate of other classification algorithms that have been used on the Iris dataset [36, 37]. Figure 3.5: Example of Iris classification using N-UFA Versicolor Flags Vir. Flags 0 1 2 3 0 0, 1 1, 5 1, 21 0, 21 1 5, 2 4, 0 -- -- 2 28, 0 -- -- -- 3 11, 0 -- -- -- We can also use RF-UFA and N-UFA to predict the class of previously unseen cases. Using five-fold cross validation, which is explained in Section 3.3, both UFA-based classifiers have 100% accuracy separating Iris setosa from the other classes. Overall, N-UFA averages 94.7% accuracy across the five folds, while RF-UFA achieves an average accuracy of 96.0%. Once again, this performance is consistent with other commonly used classifiers. 33 3.3 Evaluation of classifier performance In this thesis, we would like to compare N-UFA and RF-UFA to other commonly used models and evaluate their relative abilities to predict in-hospital mortality for previously unseen patients. In this section, we describe our methodology for doing this comparison. We utilize ten-fold cross validation [5] to train and test our models. Cross-validation works by dividing the data into ten equal-sized parts. Then, nine of the parts are used to train the model and one part is used to validate. The advantages of cross validation are two-fold. First, it does not require a hold-out validation and test set which is advantageous when data is limited, as is the case for some of our diagnosis-based subpopulations. Also, by rotating which part of the data is used for validation, one can generate ten different estimates of model performance. In this thesis, we will report the average performance for each classifier across the ten-folds. Tables and figures that include a 95% confidence interval or error bars were created by calculating the standard deviation of the ten-folds, and adding/subtracting two standard errors from the mean value. Next, we define the metrics that are used in this thesis to characterize model performance. Each candidate classifier predicts whether a patient will live (0) or die (1). For each patient, we also know their actual outcome, so we can create a confusion matrix, such as the one depicted in Figure 3.6. Figure 3.6: Example of confusion matrix Predicted class Died (1) Lived (0) Died (1) True positive (TP) False negative (FN) Lived (0) False positive (FP) True negative (TN) Actual class We can calculate a number of different performance metrics from the confusion matrix. The three used in this thesis are accuracy, sensitivity, and specificity [5]. They are calculated as follows, utilizing the notation from Figure 3.6: 34 Accuracy: (TP+TN) / (TP+TN+FN+FP) Sensitivity: TP / (TP + FN) Specificity: TN / (TN + FP) Accuracy is the probability of making a correct prediction. Sensitivity is the probability of predicting death, given the true state is death. Specificity is the probability of predicting survival, given the true state is survival. We would like all of these metrics to be as close to one as possible, though in practice there is a trade-off between high sensitivity and high specificity. The last performance metric that we consider in our analysis is area under the receiver operating curve (AUROC) which plots the sensitivity against the specificity for different discrimination thresholds [5]. Once again, values closer to one are better than lower values. 35 36 Chapter 4 4 Results In this chapter, we present the results of our analysis. As described in Section 1.3, the goal of this thesis is to design a mortality prediction system that can outperform current models. However, predicting adverse outcomes for critically ill patients is difficult for a number of reasons. Section 4.1 explores the MIMIC II data and highlights a number of challenges with ICU data. We learn that a good mortality prediction model must be able to handle large amounts of data that may include strong correlations, and should be robust to noise and outlier values, as well as missing data. Section 4.2 compares the predictive performance of a number of different commonly used classifiers. In this section, we identify two models that significantly outperform SOFA score, one of the current mortality prediction systems described in Section 1.2. They are random forest and a five-variable regression model selected through best subset regression. Section 4.3 introduces a new classifier, UFA, which was designed to address many of the challenges of clinical prediction in an automated fashion. We compare its predictive performance to SOFA, random forest, and our best subset regression, and find that UFA predicts mortality for previous unseen patients as well or better than all of these other methods. Section 4.4 analyzes the degree to which UFA and our other candidate models possess other characteristics that are desirable in mortality prediction, such as the ability to make accurate predictions with missing or noisy data, to generalize to other diagnosis-based subpopulations, and to predict rare events. 4.1 Challenges of clinical data Section 2.2 describes the process that we used to compile over 200 variables from the MIMIC II database, which summarize the first four days of a patient’s ICU stay. This section presents a variety of preliminary analyses that we conducted to gain a better understanding of the data and the relationship between 37 individual variables and in-hospital mortality. These analyses uncover a number of challenges inherent to clinical data. 4.1.1 Many clinical variables are asymmetric and long-tailed As a first step, we plotted each of our variables, stratified by patient mortality, to look for influential data points and understand each variable’s relationship with the target. Figure 4.1 shows the data for two sample variables: SOFA score and daily urine output. In each plot, a death flag of “Y” denotes in-hospital mortality, while a death flag of “N” denotes survival. We observe that lower SOFA scores and higher urine output are generally associated with survival, which is consistent with clinical expectations. Figure 4.1: Box plots for SOFA and urine output stratified by in-hospital mortality However, the box plots also reveal that many of the variables in the analysis are not symmetric and have long tails. Referring back to daily urine output in Figure 4.1 and focusing on patients that lived (“N”), we observe one individual with daily urine output of 9,600 milliliters as compared to the median value of 1,654.5. According to the National Institutes of Health (NIH) [35], the normal range for daily urine in a healthy individual is 800 to 2,000 milliliters, so 9,600 milliliters is extremely abnormal. With clinical data, however, it is not always clear if outlier values represent data error or simply very sick individuals. For this reason, we did not exclude influential points from the dataset in this analysis. As such, classifiers that are robust to outliers and do not assume that data is normally distributed may be preferable over classifiers that do not have these features. 38 4.1.2 List of variables associated with mortality varies throughout ICU stay Next, we formally specified the relationship between each variable in our analysis and in-hospital mortality. We ran a chi-squared or t-test, depending on variable type, to evaluate whether there was a significant difference between patients who lived and patients who died. Using five-fold cross validation, for each variable, we calculated five p-values using data through day one, day two, day three, and day four of the ICU stay. Table 4.1 displays the average p-value by day for a subset of the variables in our analysis, selected to highlight key results. Table B.1 in Appendix B contains the p-values for the full variable list. Table 4.1: Relationship between clinical variables and mortality, average p-value by day Red values are less than 0.05 Variable Type Use of vasopressors Day 1 Day 2 Day 3 Day 4 0.213 0.665 0.728 0.285 0.041 0.009 0.000 0.000 Daily urine 0.141 0.001 0.000 0.000 Average daily bicarbonate 0.155 0.036 0.000 0.000 Minimum heart rate 0.549 0.041 0.031 0.005 0.336 0.347 0.269 0.357 0.524 0.074 0.048 0.051 Total days with vasopressors 0.220 0.053 0.016 Ever used vasopressors 0.489 0.469 0.399 0.047 0.015 0.004 0.059 0.013 0.001 0.668 0.097 0.007 0.272 0.001 Daily SOFA Maximum heart rate Day-level Hourly-level Average heart rate Minimum bicarbonate to date Average bicarbonate to date Temporal 1-day trend, bicarbonate 2-day trend, bicarbonate 3-day trend, bicarbonate 0.022 In total, there are 75 variables that have an average p-value below 0.05 on day four. In general, the same features that are predictive when they are static are also predictive when they are dynamic. Examples include SOFA, SAPS, urine output, temperature, shock index, and laboratory tests like BUN, phosphate, and bicarbonate. Table 4.1 displays the results for bicarbonate. The p-value on day four for the daily reading is 0.000; in addition, we see that the minimum to date, average to date, and the one to three day trends are all significant. There are also some cases where the temporal variables are significant, even though the static variable is not. For example, as seen in Table 4.1, the use of vasopressors is never significant. By day three, 39 however, the p-value for the total number of days with vasopressors drops to 0.053 and by day four, it is 0.016. In this case, it appears the duration of use is important. We see that the way in which data is summarized across time is important. For example, with heart rate, the minimum hourly value is significant on days two through four, while the average value hovers around 0.05 and the maximum value is never significant. Finally, it is nearly always the case that variables are more predictive throughout the ICU stay. Using average daily bicarbonate, for example, Table 4.1 shows that the p-value on day one is 0.155; however, it drops to 0.036 on day two and drops again to 0.000 on days three and four. Other variables like daily SOFA, daily urine, and minimum heart rate show the same trend. Figure 4.2 displays the aggregate number of variables with a mean p-value less than 0.05, which increases from just 11 on day one to 75 on day four. Figure 4.2: Number of variables with a mean p-value less than 0.05, by day 80 Number of variables 70 60 50 40 30 20 10 0 Day 1 Day 2 Day 3 Day 4 We can also conclude from Figure 4.2, however, that less than 50% of the variables are significantly associated with in-hospital mortality even on day four. As such, classifiers that can handle irrelevant features may be preferable over classifiers that do not have these features. We might also prefer a system that is easily customizable to new datasets, since we observe that changes such as considering data on day three instead of day four changes the relationship of variables with our target outcome. 4.1.3 Clinical variables are often strongly correlated For the 75 variables that had a mean p-value of less than 0.05 on day four, we generated pairwise correlation coefficients. Unsurprisingly, many of the temporal variables are highly correlated, particularly 40 when they are calculated from the same base data. For example, as seen in Figure 4.3, the mean SOFA score to date is correlated with both the minimum and maximum to date, with correlation coefficients of 0.95 and 0.94, respectively. The mean SOFA score to date and the daily SOFA score are also correlated with a coefficient of 0.91. Figure 4.3: Pairwise correlation coefficients for SOFA variables Values exceeding 0.9 are highlighted in red and values between 0.8 and 0.9 are highlighted in orange Initial SOFA Daily SOFA Mean to date Daily SOFA 0.66 Mean SOFA to date 0.87 0.91 Min SOFA to date 0.80 0.90 0.95 Max SOFA to date 0.90 0.83 0.94 Min to date 0.83 Even when variables are not calculated from the same data, we observe correlation. For example, the average SOFA score to date is correlated with the total number of days on vasopressors. Average phosphorus is correlated with average blood urea nitrogen (BUN). In both cases, the results make sense. SOFA score and vasopressor use are likely both proxies for sepsis severity. Phosphorus and BUN both measure kidney function. However, based on these results, a classifier that does not require variables to be uncorrelated may be preferable to a classifier that does. 4.1.4 Most patients have missing data Finally, we investigate the number of missing values in our dataset. Missing data can be especially problematic in critical care, where the exact set of test results and other data recorded will depend on the patient’s condition, the physician’s care plan, and a variety of other factors. Table 4.2 summarizes the available data elements from MIMIC II with the most and least missing data for the 517 patients in our sepsis subpopulation. The variable that is missing most often is SAPS score at 17%. The amount of missing data for each variable drops off quickly, however, and we observe that most variables are only missing for between 2-5% of patients. Temporal features can either exacerbate or help alleviate the missing data problem. Of the temporal variables, the one-day trend in fluid balance is the variable missing most often at 26%. This make sense, because to calculate this variable one must have fluid balance for both day three and day four of the stay. The other one, two, and three day trend variables suffer from the same issue. Conversely, temporal 41 variables that summarize minimum or maximum values across a patient’s stay are rarely missing, since they only require one data point in a 96 hour period. Table 4.2: Summary of missing data by variable Missing most often Variable Missing least often % Missing % Missing Variable Daily SAPS 17% In-hospital mortality 0% Initial SAPS 7% Age 0% Daily urine 5% Initial SOFA 1% Average calcium 5% Daily SOFA 1% Weight 5% Average hematocrit 2% The vast number of variables in the analysis also exacerbates the missing data problem. Though each individual variable is missing with relatively low frequency, Table 4.3 shows that only 34.6% of patients have complete data for the full list of variables in the analysis. For this thesis, we substituted missing values with the empirical average for classifiers that cannot accommodate missing data. However, it is clear that a classifier that can accommodate missing data would be preferable. Table 4.3: Summary of missing data by patient Amount missing % of Patients No missing data 34.6% Missing 1+ variables 65.8% Missing 5+ variables 27.5% Missing 10+ variables 14.7% Missing 50+ variables 4.8% 4.1.5 Summary of key insights Through the analysis in this section, we identified some of the challenges of clinical data. The results suggest that a good mortality prediction model should be able to handle large amounts of data that may include strong correlations, and should be robust to noise, missing data, and outlier values. We would also like the model to be easily customizable to new datasets, since we observed that changes such as considering data on day three instead of day four changes the relationship of variables with our target outcome. 42 4.2 Predictive performance of commonly used classification techniques In this section, we compare the performance of four commonly used classification techniques, logistic regression, SVM, random forest, and decision tree, for the task of mortality prediction in clinical care. As a first step, we ran all four classifiers using our full list of variables, containing more than 200 static and temporal features. Section 4.2.1 summarizes these results. Then, we attempted to improve our model performance by preprocessing the data to address some of the challenges outlined in Section 4.1. The results of that analysis are presented in Sections 4.2.2 and 4.2.3. Through our analysis, we find that random forest and a best subset regression model containing just five variables are the two most promising approaches. In Section 4.2.4, we show that they both outperform SOFA score, a widely used mortality prediction tool, by as much as two days. In Section 4.2.5, we provide some insights on the role of temporal features in predicting patient mortality. 4.2.1 Random forest performs well for predicting mortality Using the full analysis dataset, including over 200 different clinical features, we compared the predictive performance of logistic regression, SVM, random forest, and decision tree. Figure 4.4 displays the accuracy, AUROC, sensitivity, and specificity for each approach. We observe that random forest, depicted in purple, has the best overall performance. It has AUROC and specificity comparable to SVM. However, it also has much higher sensitivity and the highest overall accuracy, at 79.0%. Figure 4.4: Test-set performance metrics for classifiers, full data 43 To better understand which variables are most important in the random forest model, we generated a variable importance plot. Figure 4.5 displays the 25 most important variables in the model, where importance is measured by the change in out-of-bag error when that variable is removed from the analysis. We see that daily urine is by far the most important variable in our model, according to this definition. Other important features include daily SOFA score, shock index, and laboratory tests like bicarbonate, hemoglobin, and platelet count. In Section 4.2.3, we will compare these variables to the best variables for a logistic regression analysis, and in Section 4.3.1, we will compare them to the most significant thresholds for the UFA system. Figure 4.5: Random forest variable importance Returning to Figure 4.4, we see that logistic regression and decision tree are the least successful classifiers with relatively low accuracy, AUROC, and specificity. Given the results of the data exploration presented in Section 4.1, this is unsurprising. We know that logistic regression in particular tends to suffer from over fitting when the number of variables approaches the number of observations. We also know that it performs best when there is little to no multicolinearity between the explanatory variable, which is not the case in our dataset. 4.2.2 Logistic regression performs competitively after preprocessing To address some of the data challenges outlined in Section 4.1, we decided to process our dataset to remove irrelevant information and correlated variables. We hypothesized that this would improve the predictive performance of logistic regression and possibly other methodologies as well. 44 We used the analysis from Section 4.1.2 to limit our total variables to the 75 that are individually significant on day 4. Next, we removed variables that are strongly correlated with another variable in the final list, where a significant correlation is defined as having a p-value greater than 0.6 or below -0.6. While it may have been preferable to limit the variables based on whether combinations of variables were significant or correlated, the large number of features makes that practically challenging. Table 4.4 contains the final list of variables after preprocessing. It includes 28 total features. Table 4.4: Final variable list after preprocessing (N=28) Age Creatinine two-day trend Platelet one-day trend DNR Average BUN to date Platelet three-day trend Use of echo Chloride two-day trend Number of platelet tests Number of days with vasopressors Maximum bicarbonate to date Average temperature Average SOFA to date Bicarbonate one-day trend Maximum temperature to date SOFA two-day trend Bicarbonate three-day trend Average MAP to date SOFA three-day trend Average phosphorus to date Average shock index to date Maximum urine to date Hemoglobin three-day trend Shock index one-day trend Minimum fluid balance to date Minimum platelets to date Shock index three-day trend Daily fluid balance range Figure 4.6 displays the accuracy and AUROC for each of our four classifiers before preprocessing (in blue) and after preprocessing (in red). As expected, logistic regression has the largest increases. Accuracy increases by more than ten percentage points, from 68.7% to 78.8%, while AUROC increases from 0.70 to 0.84. For the other classifiers, preprocessing does not diminish performance but it also does not lead to significant improvements. After preprocessing, we see that logistic regression has very similar performance to random forest, our best classifier from the previous section. Figure 4.6: Accuracy and AUROC before and after data preprocessing Accuracy AUROC 45 Logistic regression also has very high sensitivity, relative to other methods. Figure 4.7 compares the four methods for the same set of performance metrics in Figure 4.4, after preprocessing. We see that logistic regression correctly identifies 60% of patients whose true outcome is death, compared to 50% for random forest and just 37% for SVM. Figure 4.7: Test-set performance metrics for classifiers, processed data One way to increase the sensitivity for all of our methods is to train the classifiers to predict the positive case (e.g. “death”) more often. This can be achieved by balancing the training dataset. Figure 4.8 compares the test-set performance of our four data mining algorithms using the preprocessed data and a balanced training data. Figure 4.8: Test-set performance metrics for classifiers, balanced training data With a balanced training set, all of the classifiers have higher sensitivity as expected, and SVM becomes more competitive with logistic regression and random forest. For the remainder of our discussion, however, we will focus primarily on the latter two methods. Random forest has the advantage of not 46 requiring data preprocessing, as shown in Section 4.2.1. Logistic regression has the advantage of being the traditional approach used in this field, making it more familiar to practitioners. 4.2.3 Best regression model contains just five variables The original premise of this research was that the incorporation of a larger number of variables, including temporal features, and the use of non-linear classifiers would improve our ability to predict mortality. The results in the previous sections, however, suggest that a logistic regression model may actually be a viable method, if customized appropriately, and that limiting the number of variables can increase performance. Given this result, we decided to use an exhaustive best subset regression to determine whether we could create a more parsimonious regression model without sacrificing predictive performance. To this end, we were interested in subsets of variables that could achieve AUROC on unseen data of 0.81 to 0.86, the range achieved by random forest and logistic regression in Section 4.2.1 and Section 4.2.2, respectively. Beginning with the 28 variables in our preprocessed dataset, we looked for subsets of up to 10 variables and selected the best combinations based on BIC. The three best combinations are listed in Table 4.5, along with their accuracy and AUROC metrics. The variables age, SOFA, average temperature, and average shock index appear in all three combinations. There are some notable omissions as well, such as daily urine which Figure 4.5 identified as the most important variable in our random forest model. One explanation could be that the relationship between daily urine values and mortality is not linear, which is supported by our data visualization in Section 4.1.1. Table 4.5: Top three best subset regression models Variables in red are not present in the first model Variables Accuracy AUROC Age, daily SOFA, average temperature, average SI, average bicarbonate 74.8% (72.6%, 77.0%) 0.831 (0.799, 0.862) Age, daily SOFA, average temperature, average SI, average bicarbonate, chloride two-day trend 75.0% (72.6%, 77.4%) 0.829 (0.799, 0.860) Age, daily SOFA, average temperature, average SI, chloride two-day trend, minimum fluid balance to date 75.2% (73.5%, 76.9%) 0.816 (0.790, 0.843) In Table 4.5, we observe that the average AUROC decreases from 0.831 for the best combination of variables to 0.816 for the third best combination, though all three are in the range of the optimal 47 classifiers from Sections 4.2.1 and 4.2.2. The first combination of variables is the most parsimonious, containing just five features. If we continue looking down the list of best subset models until we find one with fewer than five variables, we see that it contains age, daily SOFA, average temperature, and average SI; however, the AUROC for this model is just 0.809. Given this, we select the first subset of variables in Table 4.5 as our “best subset” regression: age, daily SOFA, average temperature, average SI, and average bicarbonate. 4.2.4 Best classifiers outperform SOFA score by two days Table 4.6 compares the predictive performance of the random forest from Section 4.2.1 and the best subset regression from Section 4.2.3 to SOFA score. As explained in Section 2.1, SOFA is a daily score, designed to evaluate a patient throughout their ICU stay [10]. Though it was originally intended to characterize patient morbidity, since its development, it has often been used to predict patient mortality [11]. In this thesis, SOFA score represents baseline performance of current mortality prediction models. Table 4.6 shows that both best subset regression and random forest can predict mortality better than SOFA score on day four of the ICU stay. AUROC, in particular, is significantly higher at 0.831 and 0.823, as compared to 0.748 for SOFA score. Table 4.6: Test-set performance of best classifiers compared to SOFA score Classifier Accuracy AUROC Best subset regression 74.8% (72.6%, 77.0%) 0.831 (0.799, 0.862) Random forest 79.0% (76.9%, 81.1%) 0.823 (0.796, 0.851) SOFA score 72.7% (69.6%, 75.8%) 0.748 (0.709, 0.788) It would also be desirable, however, to make a more timely prediction than SOFA score. That is, it would be useful to be able to predict mortality earlier in the ICU stay. Figure 4.9 compares AUROC for all three methods across the second, third, and fourth day of the ICU stay. There are three key takeaways from Figure 4.9. First, for all three methods, the ability to predict inhospital mortality increases the longer that the patient is in the ICU. For SOFA score, for example, AUROC increases from 0.67 on day two to 0.75 on day four, an increase of 0.08. This result aligns with our findings in Section 4.1.2, which showed that individual variable’s relationship with mortality also tended to become stronger throughout the ICU stay. 48 Figure 4.9: AUROC for best classifiers and SOFA across time Blue rectangle signifies confidence interval for SOFA AUROC on day four The second takeaway is that the best subset regression and random forest outperform SOFA score on each day. Table 4.6 showed a significant difference in AUROC for day four and Figure 4.9 confirms that this relationship holds earlier in the ICU stay as well. Finally, the blue rectangle in Figure 4.9 represents the confidence interval for SOFA AUROC on day four. We see that the best subset regression and random forest have AUROC that falls within this interval as early as day two. Therefore, not only do these models allow us to better predict in-hospital mortality at a given point in time, they also allow us to make more timely predictions. For time sensitive applications, such as determining courses of treatment or communicating with a patient’s family about likely outcomes, this two day advantage may be extremely meaningful. Figure 4.14 in Section 4.4 shows that the same result holds for the full MIMIC population and the UFA system. 4.2.5 Little observed effect of temporal variables In the previous section, we saw that the best subset regression performs better on day four than day two, with an AUROC of 0.719 and 0.748 respectively. From this, we learn that the variables in the regression, such as temperature or bicarbonate level are more predictive of in-hospital mortality when measured later in the stay. What is notable, however, is that none of the variables in the best subset regression are temporal variables, as none of them measure the change in a patient’s status across time. 49 One of the hypotheses of this analysis was that the use of temporal variables would improve our ability to predict mortality for previous unseen patients. In this section, we check this assumption by comparing the performance of logistic regression and random forest using three different sets of variables. The first set is the preprocessed dataset from Section 4.2.2 which contains the 28 variables listed in Table 4.4. When selecting these variables, we often encountered features with high pairwise correlation where one variable was static and the other temporal, such as daily urine and maximum urine to date. When generating the final list, we favored the temporal features. Therefore, we will denote this list of variables as “favor temporal”. However, it would also have been possible to favor the static variables. We will denote this list of 28 variables as “favor static”. Finally, we can create a variable list that contains no temporal information at all (N=15), which we will denote “no temporal”. Figure 4.10 compares accuracy and AUROC for these three lists of variables. For regression, we see virtually no difference across the three variables lists. For random forest, the model with no temporal variables performs slightly worse than the other two models, with accuracy of 77.9% versus 79.8%. However, the differences are small and not significant. We conclude that, contrary to our expectations, knowing about changes in a patient’s status across time does not significantly improve our ability to predict in-hospital mortality above and beyond knowing their current status. It is possible that this result is driven by the type of temporal variables that we chose to include in our analysis. We discuss this issue further, and suggest other types of temporal variables that might be considered, in Section 5.2 Figure 4.10: Comparison of classifier performance with and without temporal features Accuracy AUROC 50 4.2.6 Summary of key insights Through the analysis in this section, we identified two classifiers that are capable of predicting patient mortality better than current methods, outperforming SOFA score by as much as two days. The first is random forest, which outperformed all of the other commonly used methods considered. It has the advantage of having good performance, even on data that has not undergone preprocessing. The second promising classifier was a best subset regression model with just five variables: age, daily SOFA score, average temperature, average shock index, and average bicarbonate level. It has the advantage of being extremely simple. Further, it was not immediately obvious that a linear model with no temporal features could achieve such high performance; however, we conclude that it is possible if care is taken to properly customize the model through extensive data preprocessing. 4.3 Predictive performance of UFA system In this section, we summarize the performance of the fully-automated UFA system formalized in Section 3.2. This system was designed to address the many challenges associated with clinical data and outcome prediction. In particular, UFA is adept at identifying cut points in clinical data, outside of which mortality is significantly more or less likely. Section 4.3.1 discusses the degree to which these automaticallyidentified thresholds align with clinical insight. Section 4.3.2 demonstrates the UFA system’s ability to predict mortality for previously unseen patients, and shows that it meets or exceeds the performance of best subset regression, random forest, and SOFA score. 4.3.1 Automated thresholds align with subject matter expertise As described in Section 3.2, UFA works by searching for ranges of the explanatory variables for which the target is significantly more or less likely. Then, it combines this information for all of the variables in the analysis to make a final prediction. This type of approach is very natural in the realm of clinical care, where laboratory results or vital signs outside of clinically defined thresholds are often associated with high mortality, while more normal values do not provide much information. We ran UFA for all 218 variables in the MIMIC II dataset, and automatically identified 95 significant thresholds associated with high mortality and 43 significant thresholds associated with low mortality for sepsis patients on day four of the ICU stay. Table 4.7 displays the 20 thresholds with the highest absolute z-statistic or, in other words, the most significant thresholds identified by UFA. 51 The first thing that we observe in Table 4.7 is that all of the most significant thresholds are associated with high mortality. This is clear because the percent of patients who died outside each threshold exceeds the overall sepsis death rate of 30.9%. The rest of the significant thresholds identified by the algorithm, including the 43 low mortality thresholds, are available in Appendix B. Table 4.7: Top 20 most significant UFA thresholds Variable Threshold N % Died |Z| SOFA More than 12.0 74 68.9% 6.95 AVG.SAPS More than 19.4 84 61.9% 6.60 SAPS More than 18.1 73 61.6% 6.26 MAX.URINE Less than 999.5 104 60.6% 6.20 AVG.URINE Less than 516.5 81 65.4% 6.13 AVG.SOFA More than 12.7 84 63.1% 6.10 MIN.SOFA More than 10.1 93 60.2% 6.01 URINE Less than 944.3 152 57.9% 5.98 MAX.SOFA More than 15.2 66 65.2% 5.91 MAX.SAPS More than 24.2 53 66.0% 5.84 BICARB Less than 17.7 75 65.3% 5.79 PLATELET Less than 85.3 96 57.3% 5.74 MAX.UNITS_OUT Less than 1,017.8 56 66.1% 5.61 AVG.PLATELET Less than 109.0 106 54.7% 5.53 MIN.PLATELET Less than 81.7 109 55.0% 5.51 MINTEMP Less than 35.5 76 59.2% 5.43 AVG.UNITS_OUT Less than 806.2 77 59.7% 5.24 SOFA_FIRST More than 18.1 12 100.0% 5.24 AVG.BICARB Less than 17.6 100 57.0% 5.17 MEAN.PHOS More than 4.5 93 52.7% 5.14 We see that many of the variables at the top of Table 4.7 align closely with the most important variables in our random forest model, as depicted in Figure 4.5. Specifically, urine and daily SOFA score are both highly ranked with bicarbonate levels and platelet count close behind. To evaluate the UFA thresholds, we compared them to known clinically defined thresholds when available. We find that the automated thresholds align well with subject matter expertise, as demonstrated in Figure 4.11. In the figure, red data points indicate patients who died while blue data points indicate patients who lived. Clinical bounds are denoted with solid lines and UFA thresholds are denoted with dotted lines. 52 Figure 4.11: Example UFA thresholds for adult sepsis patients Temperature Maximum Platelet Count Urine Sodium Mean Arterial Blood Pressure (MAP) Minimum Phosphorus 53 The six variables in Figure 4.11 were selected to represent a variety of vital signs (e.g. temperature, MAP), laboratory tests (e.g. platelets, sodium, and phosphorus), and other important clinical features (e.g. urine output). In all cases, we see that the UFA defined threshold is well within one standard deviation of the clinically specified threshold. For instance, returning to the example from Section 3.2.1, the clinical definition of low body temperature is 36° C. In sepsis, low body temperature is one of the diagnostic criteria for severe sepsis and septic shock, and is known to be associated with patient severity and death [31]. Applying UFA to the MIMIC II data for body temperature, we identify a high-mortality threshold at 35.97°C, denoted by a dotted line in Figure 4.11. Below the threshold of 35.97°C, sepsis patients die at a rate of 57.9%, nearly twice the overall death rate. We see that the UFA-identified threshold aligns closely with the known physiological limit. Moreover, UFA did not identify a significant threshold for high body temperature (fever) in sepsis. Once again, this is consistent with published guidelines for sepsis presentation [31], which state that fever is an “insensitive indicator” of severity of illness. 4.3.2 UFA-based classifiers significantly outperform SOFA score Overall, UFA identified 95 thresholds associated with high mortality and 43 thresholds associated with low mortality for sepsis patients on day four of the ICU stay. Figure 4.12 plots patients according to their number of high and low flags, where red denotes patients who died and blue denotes patients that lived. We see that a linear decision boundary effectively separates the two classes. Figure 4.12: Number of high mortality and low mortality flags for adult sepsis patients 54 Table 4.8 compares the performance of our two UFA-based classifiers (highlighted in orange) to the best commonly used classifiers identified in Section 4.2, random forest and best subset regression. All four models are benchmarked against SOFA score, a widely used daily score designed to evaluate a patient throughout their ICU stay [10]. The first UFA-based classifier, N-UFA, classifies patients according to the linear decision boundary in Figure 4.12. The other UFA-based classifier, RF-UFA, is a standard random forest model [5] which uses the flags for each significant threshold as dummy inputs. Table 4.8: Test-set performance for UFA-based classifiers (accuracy and AUROC) Classifier SOFA N-UFA RF-UFA Best subset regression Random forest Type Accuracy AUROC Baseline 72.7% (69.6, 75.8) 0.748 (0.709, 0.788) 77.5% (75.1, 79.9) 0.819 (0.797, 0.841) 78.1% (75.8, 80.3) 0.800 (0.779, 0.821) 74.8% (72.6, 77.0) 0.831 (0.799, 0.862) 79.0% (76.9, 81.1) 0.823 (0.790, 0.851) UFA-based Section 4.2 We see that, on average, N-UFA correctly predicts in-hospital mortality for 77.5% of patients, while RFUFA achieves 78.1% accuracy. As seen in Table 4.8, this performance is better than SOFA score and comparable to the best subset regression and random forest models. The results for AUROC follow a similar pattern. Table 4.9 shows the same comparison for sensitivity and specificity. We see that the UFA-based classifiers (highlighted in orange) achieve the same sensitivity as best subset regression, at 50.9%, 49.7%, and 51.7% respectively. All three methods perform significantly better than SOFA score on this metric. However, both UFA-based methods also maintain specificity above 0.9. while best subset regression’s specificity is 0.863. Table 4.9: Test-set performance for UFA-based classifiers (sensitivity and specificity) Classifier SOFA N-UFA RF-UFA Best subset regression Random forest Type Sensitivity Specificity Baseline 35.4% (29.0, 41.7) 0.911 (0.869, 0.953) 50.9% (46.3, 55.4) 0.908 (0.871, 0.945) 49.7% (45.0, 54.4) 0.923 (0.892, 0.955) 51.7% (47.0, 56.4) 0.863 (0.820, 0.907) 47.4% (41.3, 53.6) 0.949 (0.924, 0.973) UFA-based Section 4.2 55 4.3.3 Summary of key insights Through the analysis in this section, we saw that the UFA system can predict patient mortality significantly better than SOFA score. We found that the mortality thresholds identified by UFA align with clinical norms, and that UFA’s performance is consistent with the best performing classifiers from Section 4.2, best subset regression and random forest. This section also outlined some advantages of UFA. First, UFA requires no preprocessing, as opposed to methods like best subset regression. Second, the UFA system provides information about the variables fed into the algorithm by giving the user a set of association rules; e.g. when temperature is below 35.97°C, it is associated with significantly higher mortality. In this sense, it is extremely interpretable. Finally, we see that the UFA-based classifiers have relatively high sensitivity while maintaining high specificity. In the next section, we continue our discussion of the practical advantages of the UFA system, particularly in the field of critical care and particularly as it compares to best subset regression and random forest. 4.4 Practical advantages of UFA system In Sections 4.2 and 4.3, we identified three approaches to mortality prediction in the ICU that significantly outperform current methods: best subset regression, random forest, and the UFA system. We also established that these three approaches have similar predictive performance for patients with a primary diagnosis of sepsis. In addition to strong predictive performance, however, there are several other characteristics that may be desirable in an outcome prediction system. In this section, we address three: Robust to missing and noisy data Easily customizable to different patient populations, care centers, or targets Ability to predict rare events 4.4.1 N-UFA classifier is robust to noisy and missing data In Section 4.1, we outlined several difficulties of clinical data that should be considered when designing a mortality prediction system. In particular, we found that many variables in MIMIC II are long tailed and may include outliers. We also saw that the majority of patients in the database have incomplete data. 56 For the commonly used classifiers in Section 4.2, missing data were replaced with the empirical average so that observations with incomplete information did not have to be dropped. While other imputation techniques are possible, experimenting with different approaches was outside the scope of this thesis. With UFA, there is no need to have complete data for each observation since each variable is considered individually. If an instance is missing data for a particular variable, it can simply be excluded from the calculation of that variable’s threshold, but remain included in calculations for which data is present. The question remains, however, whether the UFA-based classifiers will have high predictive power if certain flags are missing or assigned incorrectly due to noisy data. We hypothesize that N-UFA in particular should be robust to noise and missing data, since it aggregates over all of the high and low risk flags and does not depend on individual variables. Table 4.10 confirms this hypothesis for mortality prediction of septic patients. It compares the performance of N-UFA, random forest, and best subset regression for the original MIMIC II data (denoted 0% additional missing data) and a version of the MIMIC II data where 50% of observations were replaced randomly with missing values. All of these methods are compared to a standard logistic regression, with no preprocessing, to benchmark the results. Table 4.10: Comparison of different classifiers with varying amounts of missing data Classifier N-UFA Random forest Best subset regression Logistic regression Type UFA-Based Section 4.2 Comparison Accuracy AUROC 0% 50% ∆ 0% 50% ∆ 77.5% 76.2% 1.3% 0.819 0.790 0.029 79.0% 71.9% 7.1% 0.823 0.771 0.052 74.8% 70.4% 4.4% 0.831 0.706 0.125 68.7% 58.3% 10.4% 0.698 0.598 0.100 Table 4.10 shows that with 50% missing data, N-UFA has the highest accuracy and AUROC of all three methods. We also see that the difference in accuracy between 0% missing data and 50% missing data for N-UFA is only 1.3 percentage points, compared to 4.4 percentage points for best subset regression, 7.1 percentage points for random forest, and 10.4 percentage points for logistic regression. Similarly, AUROC decreases by 0.029 as opposed to 0.125, 0.052, and 0.100 respectively. Table 4.11 provides similar results for data noise. It presents accuracy and AUROC for N-UFA, random forest, best subset regression, and logistic regression when 50% of the MIMIC II data is randomly perturbed by a value , distributed normally with mean zero and the empirical variance of the variable in 57 question. Once again, we see that the N-UFA holds up well. With 50% imprecise data, the accuracy and AUROC are in line with random forest and best subset regression, and significantly higher than logistic regression. On average, accuracy decreases by just 1.7 percentage points and AUROC decreases by 0.023 for N-UFA as the percentage of imprecise data increases to 50%. Table 4.11: Comparison of different classifiers with varying amounts of imprecise data Classifier N-UFA Type UFA-Based Random forest Best subset regression Logistic regression Section 4.2 Comparison Accuracy AUROC 0% 50% ∆ 0% 50% ∆ 77.5% 75.8% 1.7% 0.819 0.796 0.023 79.0% 76.3% 2.7% 0.823 0.802 0.021 74.8% 75.8% -1.0% 0.831 0.790 0.041 68.7% 68.8% -0.1% 0.698 0.681 0.017 Expanded versions of Table 4.10 and Table 4.11 including confidence intervals and results for 5-25% missing data are available in Appendix B. 4.4.2 UFA system generalizes well to other critical care populations With outcome modeling, there is an inherent trade-off between building a general prediction model that is widely applicable or a specialized model that takes into account particular features of a disease, patient population, or care facility. Various studies have shown that the most widely used models do not generalize well to new populations, and independent research suggests regular updates and customization for best performance [4, 15]. Up to this point, all of the results in this thesis were generated for patients with a primary diagnosis of sepsis, the most prevalent primary diagnosis in the MIMIC II data. In this section, we evaluate each model’s ability to generalize to the full MIMIC population. We also test the models’ performance on two alternate diagnosis-based subpopulations, patients with AMI and patients with lung disease. The results show that the UFA system can generalize to new cohorts of patients, and confirms many of the findings from Section 4.3 including the clinical validity of the identified thresholds and accuracy and AUROC in line with other commonly used classification methods. 58 Thresholds We ran UFA for each of our four populations individually so that the high and low mortality thresholds were specific to each cohort. Approximately 55% of possible thresholds were significant in only one or two cohorts, suggesting a good deal of customization across the different groups. However, when a significant threshold was found in multiple subpopulations, it tended to fall in the same place and consistently predict either high or low mortality. As an example, Table 4.12 compares the significant thresholds for average sodium levels and average body temperature across all of the study cohorts. The normal range for sodium levels is 135-145 mEq/L [34]. In Table 4.12, we see that UFA identified both high and low sodium thresholds for the full MIMIC population, and these thresholds align very closely with the known bounds. For the three disease-based cohorts, the values are also consistent with clinical norms, but the association with mortality is only significant in one direction. This presumably highlights the specific risks inherent to each disease. We see a very similar result for body temperature. While UFA identified significant high and low thresholds in the full MIMIC population, low body temperature appears to be a better predictor of mortality in sepsis and lung disease, while fever is problematic for AMI patients. Table 4.12: Comparison of data-driven thresholds across different subpopulations Average Sodium Level (normal range 135-145 mEq/L) Value less than 133.3 associated with high mortality All MIMIC Value above 145.3 associated with high mortality Sepsis Value less than 134.9 associated with high mortality AMI Value more than 143.1 associated with high mortality Lung Disease Value less than 135.2 associated with high mortality Average Body Temperature (normal range 36.0 – 38.0 C) Value less than 35.97 associated with high mortality All MIMIC Value above 38.14 associated with high mortality Sepsis Value less than 35.97 associated with high mortality AMI Value more than 38.14 associated with high mortality Lung Disease Value less than 35.99 associated with high mortality Next, we used the thresholds established by UFA to predict in-hospital mortality for the full MIMIC population, patients with AMI, and patients with lung disease. 59 Full MIMIC population Figure 4.13 displays accuracy and AUROC results for the full MIMIC population. We see that all five classifiers have accuracy of approximately 80%. However, AUROC is significantly higher for the UFAbased classifiers and random forest as compared to SOFA and best subset regression. Figure 4.13: Accuracy and AUROC, full MIMIC population In fact, we find that N-UFA and RF-UFA can predict mortality after just two days in the ICU with the same AUROC as SOFA score on day four for the full MIMIC population. Figure 4.14 shows that the mean AUROC for N-UFA and RF-UFA on day two are 0.720 and 0.734 respectively, while the AUROC for SOFA score on day four is 0.707. Figure 4.14: AUROC for UFA-based classifiers and SOFA across time, full MIMIC population Blue rectangle signifies confidence interval for SOFA AUROC on day four We also find that N-UFA has significantly higher sensitivity than any of the other four classifiers in the full MIMIC population. Focusing on patients who actually died, sensitivity measures the percentage that were predicted correctly. Conversely, specificity focuses on patients who survived and measures the 60 percentage that were predicted correctly. As seen in Figure 4.15, all of the classifiers have lower sensitivity than specificity meaning that they have more trouble predicting deaths than predicting survival. However, N-UFA performs the best on sensitivity; with an average value of 28.6%, it has nearly twice the sensitivity of SOFA. Figure 4.15: Sensitivity and specificity, full MIMIC population Disease-based subpopulations As compared to the full MIMIC population and patients with sepsis, the AUROC results for the AMI and lung disease-based subpopulations are less stark. In Table 4.13, we see that the AUROC for SOFA score is slightly lower than the other methods for AMI, but not significantly. For the lung disease population, all of methods that were considered have roughly similar AUROC as evidenced by a large amount of overlap in the confidence intervals. Table 4.13: AUROC for AMI and lung disease subpopulations Classifier SOFA N-UFA RF-UFA Best subset regression Random forest Type AMI Lung Baseline 0.725 (0.654,0.796) 0.721 (0.643,0.799) 0.777 (0.718,0.836) 0.712 (0.641,0.783) 0.746 (0.681,0.811) 0.708 (0.640,0.776) 0.755 (0.695,0.814) 0.703 (0.641,0.766) 0.747 (0.683,0.810) 0.731 (0.659,0.803) UFA-based Section 4.2 However, as with the full MIMIC population, the UFA-based classifiers outperform other methods in terms of sensitivity. 61 4.4.3 N-UFA classifier maximizes sensitivity for low targets Table 4.14 displays the sensitivity for the AMI and lung disease patient cohorts for all of the mortality prediction approaches discussed in this thesis. We see that N-UFA significantly outperforms the other methods, particularly for the AMI subpopulation, the patient cohort with the lowest target. Table 4.14: Sensitivity for AMI and lung disease subpopulations Classifier SOFA N-UFA RF-UFA Best subset regression Random forest Type AMI Lung Baseline 1.7 (0.0, 5.0) 15.4 (10.8, 19.9) 32.7 (22.0, 43.4) 32.7 (26.2, 39.2) 12.0 (1.7, 22.4) 31.4 (23.7, 39.1) 4.8 (0.0, 11.7) 15.3 (11.8, 18.8) 2.8 (0.0, 6.6) 18.1 (13.0, 23.3) UFA-based Section 4.2 The in-hospital mortality rate for the AMI subpopulation is 14.7%, so a classifier can achieve more than 85% accuracy by simply predicting that everyone will survive. This approach would result in 0% sensitivity. As such, it is unsurprising that sensitivity is generally low for this patient cohort, as low as 1.7% for the SOFA-based classifier. N-UFA, however, has 32.7% sensitivity which is nearly 7 times the sensitivity of the next best classifier that is not UFA-based and almost 20 times the sensitivity of SOFA score. In general, one way to increase the sensitivity of a classifier is to balance the training data, so that 50% of the training cases are patients who died. This teaches the classifier to sometimes predict death, as always predicting survival will only result in 50% accuracy. We tried balancing the training data for the AMI subpopulation, and the results are displayed in Table 4.15. The percentage of deaths that we could predict in a previously unseen test set increased to more than 58% for all classifiers. However, balancing the training data simultaneously led to a decrease in the specificity, the percentage of patients who survived that were predicted correctly. We found a similar result when we tried balancing the training data for the full MIMIC population and the other two subpopulations. To achieve a different trade-off between sensitivity and specificity, we could assign costs to different types of errors (e.g. false positives and false negatives) such that the relative importance aligns with our application [5]. 62 While balancing the training dataset is often a good approach to address class imbalance, it can sometimes be undesirable. In particular, for very rare events, balancing the dataset through undersampling may exclude a large number of potentially useful majority-class examples, while balancing the dataset through oversampling may require replicating a small number of minority-class examples and can lead to overfitting [38]. Table 4.15 suggests that N-UFA may provide a possible alternative, achieving relatively high sensitivity in imbalanced data as compared to other commonly used classification techniques. Table 4.15: Comparison of results for balanced and unbalanced data, AMI subpopulation Classifier SOFA Baseline N-UFA RF-UFA Best subset regression Random forest 4.4.4 Type UFA-based Section 4.2 Original Balanced Sensitivity Specificity Sensitivity Specificity 1.7% 99.8% 63.1% 67.7% 32.7% 93.9% 63.5% 70.4% 12.0% 98.7% 58.6% 69.6% 4.8% 99.1% 75.2% 70.2% 2.8% 99.3% 64.6% 72.6% Summary of key insights Through the analysis in this section, we saw that the UFA system has many practical advantages for the application of critical care. First, we found that N-UFA holds up well even with large amounts of missing or imprecise data. For missing data specifically, the decline in performance is smaller than both random forest and best subset regression. We also saw that the system generalizes well to the full MIMIC population. It has significantly higher AUROC as compared to best subset regression, and both N-UFA and RF-UFA outperform SOFA score by as much as two days. Finally, we found that N-UFA has higher sensitivity than all other approaches, including random forest, for all four populations analyzed in this thesis. The result is particularly stark for AMI, the population where the death rate is lowest. 63 64 Chapter 5 5 Discussion The objective of this thesis was to build a mortality prediction model that could outperform current approaches. Throughout Sections 4.2 and 4.3, we identify three promising approaches: random forest, a best subset regression containing just five variables, and UFA. For patients admitted to the ICU with a primary diagnosis of sepsis, we demonstrate that all three are capable of significantly outperforming SOFA score [10], a daily score widely used to evaluate a patient throughout their ICU stay. While all three have strong predictive performance for sepsis patients, we assert that the UFA system is particularly well-suited for the task of predicting mortality in critical care. UFA works by searching for ranges of the explanatory variables for which the target is significantly more or less likely. Then, it combines this information for all of the variables in the analysis to make a final prediction. This type of approach is very natural in the realm of clinical care, where laboratory results or vital signs outside of clinically defined thresholds are often associated with high mortality, while more normal values do not provide much information. UFA also has a variety of other practical advantages. First, UFA considers each variable individually. As a result, it can quickly and easily be applied to datasets with a very large number of features, including cases when the number of features is much larger than the number of observations. One can also easily introduce new variables that are interactions of existing features, if thought to be important for the application. In contrast, a standard logistic regression model will struggle as the number of variables approaches the number of observations, and thus will require preprocessing to remove irrelevant features. A second advantage of UFA is that it is fully-automated which allows for easy customization. In this thesis, this is demonstrated by applying UFA to the full MIMIC population, an AMI subpopulation, and a lung disease subpopulation, in addition to our cohort of sepsis patients. We found that the UFA system had consistently good predictive performance. For the two diagnosis-based subpopulations, we found that the UFA system had comparable performance to all of the other methods in terms of accuracy and AUROC. For the full MIMIC population, we found that N-UFA and RF-UFA outperform SOFA score by as much as two days. They also have significantly higher AUROC than best subset regression. 65 This result is not particularly surprising, as the best subset regression was tailored for the sepsis population through extensive data preprocessing. The final model contained just five variables – age, average temperature, average shock index, daily SOFA, and average bicarbonate – and has the benefit of being very simple. In the sepsis population for which it was constructed, it has AUROC and sensitivity equal to the number of flags classifier. However, we saw that it does not always generalize to new patient cohorts. UFA, on the other hand, can quickly generate a new set of significant thresholds and flags that are particular to the population of interest without user intervention. Though we only explore disease-based customization in this thesis, one could easily tailor the algorithm for a new point in time (e.g. patient status as of day two), for new outcomes (e.g. 28 day mortality), or new types of patient cohorts, such as individuals treated at a particular facility or in a particular region. Finally, UFA is flexible and can be used for many purposes. The thresholds generated by the algorithm are easy to interpret, since they take the form of association rules, i.e.: When [VARIABLE] is above/below [VALUE], it is associated with signficantly higher/lower mortality. In some applications, UFA might be used to simply generate a list of thresholds for a particular population, which can then be compared to current clinical guidelines or used as a jumping off point for future research. As shown in Table 4.10, one can also compare thresholds for different patient cohorts to identify variables that are predictive of mortality in one population but not another. This functionality is not present in any of the other classifiers. Moreover, the thresholds can be used for prediction and can be combined into a number of different classifiers. This thesis considered two UFA-based classifiers, one based on random forest (RF-UFA) and one based on the number of high and low mortality flags (N-UFA). One could easily use the thresholds within another predictive modeling framework if necessary for the application. For the application of mortality prediction, N-UFA in particular appears to be a good choice. Our results show that it has the advantage of maximizing sensitivity in unbalanced datasets, particularly for patient cohorts with low death rates. It also has several other practical advantages. 66 First, it has been shown to be robust to missing and noisy data which can be prevalent in critical care. Since N-UFA aggregates information, even if one flag is missing or assigned incorrectly due to noisy data, the aggregation of all the other low and high mortality flags may still lead to a correct prediction. In this thesis, we show that N-UFA performs better than or equal to random forest, the next most competitive method, even with up to 50% unreliable data. Second, predictions made by N-UFA are easy to summarize, visualize, and interpret. Figure 5.1 plots all of the sepsis patients in our dataset based on their number of high mortality and low mortality flags. Patients who died are indicated in red, while patients who survived are indicated in blue. The solid, diagonal line is the linear boundary that best separates the two classes, Figure 5.1: Visualization of number of flags classifier, sepsis subpopulation Since N-UFA is two-dimensional, Figure 5.1 is easy to interpret and one can quickly see that the decision boundary does quite a good job separating patients who lived from patients who died. When a physician is presented with a new sepsis patient, the UFA system can automatically apply the established sepsis thresholds, aggregate the number of thresholds violated by the new patient, and place him on the chart. The yellow circle in Figure 5.1 represents a fictitious new patient with five high-mortality flags and fifteen low-mortality flags. Based on this simple picture, one can quickly see that the patient is predicted to live. Further, by considering the patient’s distance from the decision boundary and the other patients in his vicinity, the physician can quickly internalize the level of uncertainty associated with the prediction. In this case, one may be fairly comfortable predicting survival for the yellow patient but may ascribe a high level of uncertainty (or refuse to make a prediction) for a patient on or near the decision boundary. As such, we believe that UFA is a major step toward a mortality prediction model capable of individual prognosis. 67 68 Chapter 6 6 Future research This section discusses possibilities for future work. Section 6.1 focuses specifically on the UFA system, while Section 6.2 discusses the use of temporal features in mortality prediction more broadly. 6.1 UFA We believe that UFA is a promising first step toward a mortality prediction system capable of individual prognosis. In this section, we discuss three possible areas for future work. 6.1.1 Multiple testing problem One possible drawback to UFA is that it conducts identify the optimal thresholds, where statistical tests in the training phase in order to is the number of variables and is the number of potential thresholds. As is well documented, multiple hypothesis testing can inflate the type I error rate and lead to significant results, even when none exist [5, 25]. This drawback is also present in related methods, such as the minimum p-value approach to finding optimal cut points. In this thesis, we address the multiple testing problem by validating the UFA-generated thresholds on previously unseen data and demonstrating that we can classify new patients with high accuracy and AUROC. There are other possible approaches, however. For example, one can adjust the p-values for multiple testing, using an approach such as the well-known Bonferroni method [5]. Future research should consider the impact of using an alternative method to address the multiple testing problem on the performance of the UFA system. 6.1.2 Threshold uncertainty As explained in Section 3.2, the UFA system selects the optimal threshold variable in the training dataset, where test statistic for each explanatory is defined as the candidate threshold with the maximum in absolute value. In addition, one might be interested in quantifying the uncertainty associated with that threshold. 69 To do this, we employed bootstrapping, a data-driven technique where one resamples the training data many times to generate additional new training datasets [5]. Then, we ran UFA on each one to create a histogram of possible thresholds for each variable. Using the sepsis population, Figure 6.1 shows 1,000 bootstrapped thresholds for body temperature. The red dotted vertical line at 35.97°C represents the optimal cut point that was found using the full training data in Section 4.3.1. Figure 6.1: Bootstrapped thresholds for low body temperature in adult sepsis patients This information is useful in two ways. First, we can calculate the variance in the potential thresholds, which can help quantify uncertainty. Second, we can compare for the full training data to the bootstrapped distribution, and determine whether it is consistent with the other trials. If we are concerned that may be overfit to the training data and not generalizable, we can consider using a feature of the bootstrapped distribution such as the mean or mode as our candidate threshold instead. For the sepsis application, we ran 100 bootstraps per variable and used the mode of each distribution as the candidate threshold. On average, we found 5% fewer significant thresholds and, of the significant thresholds found, 15.6% varied by more than 5% from . When applied to unseen data, however, N- UFA achieved the exact same AUROC using the bootstrapped thresholds as it did using the original thresholds. Since bootstrapping adds significant runtime to the UFA system and did not improve the predictive performance for our application, we present the non-bootstrapped results in this thesis. However, we recommend further work in this area. For example, visual inspection of the bootstrapped distributions for 70 the MIMIC II data reveals that some of the variables have a bimodal distribution, perhaps suggesting multiple cut points. It is also possible that overfitting may be more of an issue in other datasets. 6.1.3 Multivariate approach In its current form, UFA is univariate which provides certain practical advantages. It can quickly and easily be applied to datasets with a very large number of features. Moreover, if the user feels that a particular interaction is important for their application, it can be introduced as an additional feature. However, the algorithm could conceivably also establish thresholds in multiple dimensions. The following example shows one possibility using SVM. In one dimension, UFA cycles through potential thresholds for a variable and compares the death rate outside of that threshold to a baseline death rate, calculated using patients in the interquartile range of . In two dimensions, we use a similar approach; however, instead of cycling through values of , we iterate through different cost penalties. Figure 6.2 plots patients according to two variables, systolic blood pressure and heart rate. The patients who died are indicated in red, while the patients who lived are indicated in black. By adjusting the cost weights associated with misclassifying survival, we control the purity of patients within the high mortality area. Figure 6.2 shows the difference of using a 1:1 cost penalty and a 2:1 cost penalty. Figure 6.2: Examples of different cost penalties for two-dimensional flagging 1:1 Cost Penalty 2:1 Cost Penalty 71 As with the univariate case, we are interested in balancing the mortality rate within the high-mortality area (in pink) with the number of cases in that area. We use a similar procedure to the one described in Section 3.2.3. Step 1: Using SVM, determine how much misclassification must be penalized in order to get a 100% death rate (or equivalently, 0% death rate). Step 2: Iterate from a 1:1 cost balance to the value from Step 1. For each value, run SVM and calculate the death rate within the pink area. Compare this death rate to a baseline death rate, calculated based on the interquartile ranges of the two variables. Step 3: Generate a Z-statistic for the two-sided proportion test using the same method as the one-variable case, outlined in Section 3.2.2. Step 4: Choose the cost balance with the maximum Z-statistic. In this example, the optimal cost ratio was found to be 1.43:1, which is depicted in Figure 6.3. We see that, as desired, the pink area contains primarily patients who died, while simultaneously having a fairly large support. Figure 6.3: Optimal cost penalties for two-dimensional flagging (1.43:1) The disadvantage of this approach, however, is that we have lost the ability to state the results of the algorithm in terms of simple association rules; e.g. when [VARIABLE] is above/below [VALUE], it is associated with significantly higher/lower mortality. Also, it is not clear what the appropriate “baseline” 72 death rate should be. In this example, we used patients who were in the interquartile range for systolic blood pressure AND heart rate. As the number of variables increases, however, the number of patients in that intersection will become very small. This is another possible area for future research. 6.2 Temporal features in critical care It seems reasonable that information about a patient’s improving or worsening health status should impact our ability to accurately predict mortality. Using SOFA as an example, past research shows that maximum daily SOFA score and delta SOFA (defined as maximum score minus admission score) have good correlation with outcomes for patients in the ICU for two or more days [14]. Our data seems to tell a similar story. Figure 6.4 shows the average SOFA score by day for patients who ultimately died (in purple) and lived (in green), and suggests that the trends are different. In our analysis, however, we found that simply adding variables like maximum SOFA or the trend in SOFA did not improve our ability to predict mortality for previously unseen patients. For sepsis patients, we found a best subset regression model that contains just five variables, including no trend variables, had the same predictive performance as a much more complicated model including all of the temporal information. Figure 6.4: Trends in SOFA score by day and mortality status In this section, we discuss three ongoing analyses that consider new ways to incorporate temporal information into mortality prediction models. As opposed to looking at summary statistics across time such as minimum or standard deviation, we investigate whether more sophisticated methods result in increased model performance. 73 6.2.1 Identifying subsets of patients where trend is important This section describes a side analysis that was conducted to deconstruct the trend in average SOFA score across time, depicted in Figure 6.4. The analysis shows that trend data does not improve our ability to predict sepsis mortality for fairly healthy patients or especially critical patients. However, it is useful in predicting mortality for patients who do not fall into either of those groups. Figure 6.5 displays a parallel coordinates plot which we used to determine whether there are particular trends in SOFA score that correspond to survival and in-hospital mortality. The only obvious trend that we observed was more red lines (which denote patients that died) near the top of the plot and more black lines (which denote patients that lived) near the bottom. Figure 6.5: Parallel coordinates plot summarizing SOFA across time SOFA (Day 4) # Lived # Died % Died 13+ 4 41 91% 5-12 86 80 48% 4 or less 40 9 18% Next, we drilled down to patients with a SOFA score on day four of 5-12. Using clustering algorithms and manual analysis, we determined that the number of decreases in SOFA during an individual’s stay allowed us to further stratify patients. As seen in Figure 6.6, patients whose SOFA score declines three times (i.e. goes down every day) have a death rate of 22%, while patients whose SOFA never declines have a death rate of 61%. We also tried to stratify patients based on the magnitude of changes (from day to day and across multiple days), as well as particular sequences of increases and decreases but did not discover anything of note. Using this information, we developed five clusters of patients based on SOFA score, outlined in Table 6.1. If we subset to Cluster 5, there is no longer a deviation in the trend of average SOFA score from day 1 to day 4. Further, the majority of the patients outside of Cluster 5 were categorized based on their day 4 74 SOFA score (as opposed to trend data), which may explain why trend data seemed to be less important in our primary analysis. Figure 6.6: Parallel coordinates plot summarizing SOFA across time (SOFA of 5-12) Number of Decreases # Lived # Died % Died 0 7 11 61% 1 28 35 56% 2 37 30 45% 3 14 4 22% Table 6.1: Patient clusters based on static and trend SOFA data Cluster 1 2 3 4 5 Description Day 4 SOFA >= 13 Day 4 SOFA <= 4 Zero Down Three Down Other # Lived 4 40 7 14 65 # Died 41 9 11 4 65 % Died 91% 18% 61% 22% 50% We went through a similar process to create patient clusters for bicarbonate levels, temperature, and shock index. By replacing SOFA and bicarbonate with the new clustered variables (which include some trend information) in our best subset regression, we were able to improve AUROC by 0.02 while maintaining accuracy. Using the clustered variables for temperature and shock index did not improve performance over our original model. These results suggest that trend data may be useful for mortality prediction; however, it is possible that it is only useful for a subset of patients. Determining how to interact temporal and static features to automatically identify these subgroups is an area for future research. 75 6.2.2 Characterizing uncertainty through filtering Another way to use temporal data is to use a patient’s progression to help quantify uncertainty about their true state on the current day. In this section, we use sequential importance resampling (SIR) [39, 40] to quantify the uncertainty in an individual patient’s SOFA score. By combining a dynamical model for SOFA score progression and the observed SOFA scores for an individual patient, we can determine the posterior distribution for the SOFA score on each day of a patient’s ICU stay. We will assume that a patient’s SOFA score evolves in time according to the following model: = ( )+ , ~ ( ) = 0, 1, 2, 3, 4 Then, in order to actually implement particle filtering, we need to determine ( ) , distribution of the noise term ( ), and the for this particular application. There is precedent in the literature to use a linear model to describe disease progression [41]. = + Further, in the clinical domain, models are typically parameterized using observational data from past patients [41, 42]. Therefore, we used the full population of sepsis patients in MIMIC II to determine and the distribution of . Since, particularly for patients who died, the progression of SOFA score is not consistent across time periods, as seen in Figure 6.4, we decided to model each time step separately. The final system models for patients that die then become: = 0.95 = + = + = + + And the system models for patients that live are: = 0.92 76 + = 0.91 + = 0.92 + = 0.94 + We assume the error terms are distributed normally with mean zero and variance determined empirically. Finally, for each time step, we had to select which model to use (i.e. patients who live or patients that die). We tried a variety of different approaches which are outlined below. = 1. Truth: (( + ) + (1 − = 1 if patient actually died and )( + ) = 0 otherwise. This method is the least useful in practice, since this information will not be known. However, it minimizes the uncertainty in the model and therefore, provides a useful baseline. 2. Data: = 1 if patient’s SOFA score on day t predicts death under the optimal classifier and = 0 otherwise. 3. Distribution: Divide the distribution ( | ) into intervals. Let be the probability mass within each interval multiplied by the population death rate for that range of SOFA scores, summed over all intervals. In this method, 0 ≤ ≤ 1 and the dynamics equation is a convex combination of the system models for patients who lived and patients who died. As a variation, I can round to 1 if it is larger than 0.5 and 0 otherwise, and select a single model. Finally, we set our prior ( ) for each individual patient to be the observed distribution of SOFA score at admission for all sepsis patients. We also assume that SOFA score is a measure of the patient’s true severity plus some noise. Therefore, we model the relationship between and the true state as follows: where ~ (0,1). = + With the model now fully specified, we generated ( | : ) for t = 1, 2, 3, 4, 5 for each patient in MIMIC II. Figure 6.7 displays six weighted histograms showing ( ) and ( | 77 : ) for a representative patient. In each histogram, there is a red vertical line at 9.5, as we found that the optimal classifier that relies only on the daily SOFA score is: ≥ 9.5, . < 9.5, . For t > 0, there is also a purple vertical line which indicates the observed SOFA score for this particular patient on each day. Figure 6.7: Example SIR results for a representative patient The true outcome of this patient was survival Day 0 (Prior) Day 1 Day 2 Day 3 Day 4 Day 5 Across time, we see that the posterior distribution of the SOFA score narrows, as the sequence of observed values reduces uncertainty about the patient’s true state. On day five, we observe a SOFA score of 10 for this patient. If we were to use this data point and our association rule predicting death for SOFA scores above 9.5, we incorrectly predict that this patient died. If instead we analyze ( | : ), however, we can use various features of the distribution to make a determination. One obvious choice would be the expected value. Another option would be the probability 78 of death, denoted , which we calculate by dividing ( | : ) into intervals and then summing, over all intervals, the probability mass multiplied by the population death rate for SOFA scores in that range. On day 5, we observe that [ ] < 9.5 and survival. < 0.500, both of which suggest a correct prediction of With this approach, the posterior distribution for each patient on each day is determined both by general dynamics and their own specific observed SOFA values. Therefore, the final distribution is customized for each individual. Preliminary results such as those in Figure 6.4 suggest that features of the distribution can be used to correctly predict in-hospital mortality for individual patient in cases where the daily SOFA score fails. Unfortunately, at the population level, the two features explored above, expected value and did not produce better results than predicting using the static SOFA score alone. However, future research could explore using other features of ( | try alternative models for the dynamics ( | 6.2.3 : ) for prediction, such as the standard deviation, or ) to improve results. Learning patient specific state parameters This section explores a final way to use time-varying information to characterize patients. In the previous section, we explored methodology that allowed us to characterize our certainty about a patient’s severity at any point in time, where severity was a continuous measure. In some situations, however, it may be more appropriate to model severity as two states, critical or not critical. In this section, we use the example of low blood pressure or hypotension. When sepsis is accompanied by a fall in blood pressure, it is called severe sepsis. When that hypotension is not responsive to fluids or other medications, the patient is said to be in septic shock which is associated with very high mortality rates [43]. However, there is no consensus about the exact cutoff that defines hypotension. Different physicians consulted on this thesis disagreed about whether a threshold of 60, 65, or 70 mmHg was appropriate; UFA identified a threshold of 67.4. Moreover, there is some thought that hypotension is actually a relative measure. The normal range for mean arterial blood pressure (MAP) in adults is 70-105 mmHg [43] -- if a patient is normally at 70 mmHg, he would not be considered hypotensive at 65 mmHg, while a patient who is normally at 105 mmHg might be considered hypotensive at 70 mmHg. In this section, rather than use a fixed threshold to characterize hypotension, we decided to parameterize a patient’s MAP time series using a Hamilton regime switching model [45]. This model assumes that a patient has two states, a high state (normal) and a low state (hypotensive). For each patient, we can use a 79 filtering approach to learn the optimal model parameters ( ) and the estimated state vector, i.e. whether the patient is hypotensive or not at each time step. The inferred state ( , ,… ), along with the probability of switching between the high state and low state are potentially informative measures of hypotension persistence. Borrowing notation from Hamilton [45], we describe the data with a first-order autoregression, where ~ (0, ) and = + + is a random variable that is 1 when the patient is in the high state (normal MAP) and 2 when the patient is in the low state (hypotension). We used a first-order, two-state Markov process to describe the transitions between = 1 and = 2. CTP Specification: One possibility is to use constant transition probabilities (CTP) where denotes the probability of being in state at time , conditional on being in state at time − 1. The list of parameters , ). Figure 6.8 compares a plot of an example patient’s blood pressure data ( , ,… needed to fully characterize the behavior of is then =( , , the probability of being in state 2 (hypotension) at time t for turned out to be = 0.11, = 4.23, = 60.47, = 50.67, = , , ) and a plot of . For this patient, the optimal theta = 0.968, = 0.966. We observe that, for this patient, both the high state and the hypotensive state are quite “sticky” – that is, are both close to 1. Figure 6.8: Example of Hamilton regime switching model, CVP ( = 80 , ) and If we assign this patient to the hypotensive state when >= 0.5 then Figure 6.8 implies that this patient had two hypotensive episodes during his first 72 hours in the ICU (hours 7-27 and hours 29-42) with an average episode length of 17 hours. In comparison, if we were to use a naïve threshold of 65 mmHg to identify hypotensive episodes, we would say that this patient had eight episodes with an average episode length of 4.5 hours. The estimated state vector customizes our hypotension definition for this patient and provides a cleaner interpretation of the patient’s blood pressure data, which arguably conforms more closely to visual inspection. TVTP Specification: We can also allow the transition probabilities to depend both on the state at time − 1 as well as other factors, such as the patient’s drug regimen. Specifically, the transition probabilities evolve as logistic functions [46]: = = While and exp( 1 + exp( exp( 1 + exp( ) ) ) ) are specified to be time-invariant, certain elements of the conditioning vector change across time, resulting in time-varying transition probabilities (TVTP). For our analysis, has dimension (3 x 1) and contains the patient’s heart rate at time − 1, an indicator variable for whether the patient received vasopressors, a drug intended to raise blood pressure, at time − 1, and a constant term. We will denote the parameters that govern the transition probabilities as =( , dimension (6 x 1). The list of parameters needed to fully characterize the behavior of specification is then =( , , , )′ where has in the TVTP , ). We take, for example, a patient whose probability of staying in the hypotensive state ( ) is 0.972 under the constant CTP specification. If we incorporate information about his drug regimen into our model, we find that = 0.464 when the patient is being treated with vasopressors and 0.840 when the patient is not being treated. These results conform to our expectation that a patient is more likely to leave the hypotensive state while on vasopressors. Figure 6.9 provides the MAP data for this example patient, alongside his estimated probability of being in the hypotensive state at each time step for both the CTP and the TVTP model specifications. 81 Figure 6.9: Example of Hamilton regime switching model, TVTP vs. CVP ( = ) Overall, the two approaches produce similar results in most time periods. However, we observe two key differences. First, the state estimates under the TVTP specification are more certain; that is, deviates less from zero and one. Second, the two methods differ for hours 42-49. The CTP specification shows a high probability that the patient is in the hypotensive state, while the TVTP specification is able to use additional information about heart rate and vasopressor use to update its estimate. Returning to the issue of predicting mortality, we hypothesize that one can use the results of the Hamilton regime switching model in two different ways to improve predictions. First, one can use the patientspecific estimated state vector to characterize variables like hypotension, as opposed to a single set threshold. Second, one can consider the parameters of each patient’s switching model. For example, the probability of staying in the high state (p11) and staying in the hypotensive state (p22) for each patient might be useful measures of hypotension persistence. Similarly, the difference in p22 when a patient is on vasopressors or not on vasopressors could provide a measure of their response to treatment. Table 6.2 compares the performance of three logistic regression models which attempt to predict sepsis mortality using solely information about hypotension. The first two models contain four independent variables: total number of hours in the hypotensive state, the total number of hypotensive episodes, the average episode length, and the length of the longest episode for each patient. The first model uses a population-wide threshold of 65 mmHg to define hypotension, while the second model uses the patientspecific state vector. We see that the features of the state vector that were selected are not predictive of in- 82 hospital mortality on previously unseen samples. Accuracy is just 24% compared to 56% for a model that uses a 65 mmHg threshold. However, our third model, which uses the optimal model parameters ( , , , , , ) as the independent variables, has the highest accuracy and AUROC at 76% and 0.705 respectively. This provides some confidence that the Hamilton regime-switching model is capturing important dynamics in patient’s blood pressure data, and suggests that further analysis should focus on identifying additional features of the state vector that are more suitable for prediction. Table 6.2: Test-set performance of classifiers based on regime switching model Independent Variables Accuracy AUROC Episode Characteristics (65 mmHg Threshold) 56% 0.635 Episode Characteristics (State Vector) 24% 0.147 Model Parameters 76% 0.705 83 84 Chapter 7 7 Conclusion The objective of this thesis was to build a mortality prediction model that can outperform current approaches. We aimed to improve current methodologies in two key ways: 1. By incorporating a wider range of variables, including time-dependent features 2. By exploring different predictive modeling techniques beyond standard regression We identified three different outcome prediction approaches that can significantly outperform current methods. The first model was a best subset regression containing just five static variables. It was not immediately obvious that a linear model with no temporal features could achieve such high performance; however, we conclude that it is possible if care is taken to properly customize the model through extensive data preprocessing. The other two models, random forest and the UFA system, have the advantage of being more flexible and they do not require the user to do variable selection. As such, they are easy to customize to new populations and have consistently strong predictive performance. In addition to being easily customizable, we show that the UFA system in general (and the N-UFA classifier in particular) has several other practical advantages that make it well-suited for use in critical care: 1. Provides user with simple association rules characterizing relationship between individual variables and mortality 2. Robust to noise and missing data 3. Maximizes sensitivity in unbalanced datasets, particularly for rare events 4. Displays results in two-dimensions making it easy to interpret and visualize As such, we believe that UFA is a major step toward a mortality prediction model capable of individual prognosis. 85 86 References 1. Zimmerman JE, Kramer AA. A history of outcome predictionin the ICU. Current Opinion in Critical Care, 2014; 20(5) 550-556. 2. Power GS, Harrison DA. Why try to predict ICU outcomes? Current Opinion in Critical Care, 2014; 20(5) 544-549. 3. Connors AF, et al. A controlled trial to improve care for seriously ill hospitalized patients: the study to understand prognoses and preferences for outcomes and risks of treatments (SUPPORT). JAMA 1995; 274(20) 1591-1598. 4. Salluh JIF, Soares M. ICU severity of illness scores: APACHE, SAPS and MPM. Current Opinion in Critical Care, 2014; 20(5): 557-565. 5. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer Science+Business Media; 2009. 6. Knaus WA, et al. The APACHE III prognostic system: risk prediction of hospital mortality for critically ill hospitalized adults. Chest 1991; 100(6): 1619-1636. 7. Zimmeman JE, Kramer AA, McNair DS, Malila FM. Acute Physiology and Chronic Health Evaluations (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit Care Med 2006; 34(5): 1297-1310. 8. Le Gall JR, Lemeshow S, Saulneir F. A new simplified acute physiology score (SAPS II) based on a European/North American multicenter study. JAMA 1993; 270(24): 2957-63. 9. Moreno RP, Metnitz PGH, Almeida E, et al. SAPS 3 – from evaluation of the patient to evaluation of the intensive care unit. Intensive Care Med 2005; 31: 1336-1355. 10. Vincent JL, et al. The SOFA (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive Care Med 1996; 22:707-710. 11. Minne L, Abu-Hanna A, de Jonge E. Evaluation of SOFA-based models for predicting mortality in the ICU: A systematic review. Crit Care 2008; 12:R161 12. Moreno R, et al. The use of maximum SOFA score to quantify organ dysfunction/failure in intensive care. Intensive Care Med. 1999; 25(7): 686-696. 13. Ferreira FL, et al. Serial evaluation of the SOFA score to predict outcome in critically ill patients. JAMA 2001; 286 (14): 1754-1758. 14. Saeed, M, et al. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A publicaccess intensive care unit database. Crit Care Med. 2011; 39(5):952-960. 15. Soares M, et al. Performance of six severity-of-illness scores in cancer patients requiring admission to the intensive care unit: a prospective observational study. Crit Care 2004; 8(4): R194-R203. 16. Melamed A, Sorvillo F. The burden of sepsis-associated mortality in the United States from 1999 to 2005: an analysis of multiple-cause-of-death data. Crit Care 2009; 13(1). 17. Angus, Derek C. MD, MPH and Tom van der Poll, MD, PhD. “Severe sepsis and septic shock,” New England Journal of Medicine, vol. 369, pp.840-851, 2013. 18. “Sepsis Fact Sheet,” National Insitute of General Medical Sciences. Nov. 2012. [Online]. Available: http://www.nigms.nih.gov/Education/Pages/factsheet_sepsis.aspx 19. Martin G. Sepsis, severe sepsis and septic shock: changes in incidence, pathogens, and outcomes. Expert Rev Anti Infect Ther. 2012; 10(6): 701-706. 87 20. Martin G, Mannino DM, Eaton S, Moss M. “The epidemiology of sepsis in the United States from 1979 through 2000,” New England Journal of Medicine, vol. 348, pp.1546-1554, 2003. 21. Angus DC, Linde-Zwirble WT, Lidicker J, Clermont G, Carcillo J, Pinsky MR. “Epidemiology of severe sepsis in the United States: analysis of incidence, outcome, and associated costs of care”. Critical Care Medicine, vol. 29, pp.1303-1310, 2001. 22. Witten IH, Frank E. Data mining: practical machine learning tools and techniques. 2nd ed. San Francisco: Morgan Kaufmann Publishers; 2005. 23. Rice, JA. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury: Brooks/Cole; 2007. 24. Mazumdar M, Glassman JR. Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments. Stat Med. 2000 Jan 15; 19(1): 113-32. 25. Williams B, Mandrekar JN, Mandrekar SJ, Cha SS, Furth AF. Finding optimal cutpoints for continuous covariates with binary and time-to-event outcomes. Technical Report, Mayo Clinic, Department of Health Sciences Research, 2006. Available: http://www.mayo.edu/research/documents/biostat-79pdf/doc-10027230 26. Baum RL, Godt JW. Early warning of rainfall-induced shallow landslides and debris flows in the USA. Landslides. 2010; 7:259-272. 27. Martina MLV, Todini E, Libralon A. A Bayesian decision approach to rainfall thresholds based flood warning. Hydrol. Earth Syst. Sci. 2006; 10:413-426. 28. Sheth M, Welsch R, Markuzon N. A fully-automated algorithm for identifying optimal thresholds in data. Working paper. 29. Sheth M, Celi L, Mark R, Welsch R, Markuzon N. Predicting mortality in critical care using time-varying parameters and the univariate flagging algorithm. Working paper. 30. Eick, CF, Zeidat N, Zhao Z. Supervised clustering – algorithms and benefits. Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, 2004. 31. Kalil, A. Septic shock clinical presentation. Medscape. 20 Oct 2014. Available: http://emedicine.medscape.com/article/168402-clinical. Accessed 16 Mar 2015. 32. Dellinger RP, et al. Surviving sepsis campaign: international guidelines for management of severe sepsis and septic shock: 2012. Crit Care Med. 2013; 41(2): 580-637. 33. Friedman JH, Fisher NI. Bump hunting in high-dimensional data. Statistics and Computing. 1999; 9:123-143. 34. MedlinePlus from National Institutes of Health. 16 March 2015. Available: http://www.nlm.nih.gov/medlineplus/medlineplus.html 35. Lichman M. Iris Data Set: 2013. Database: UCI Machine learning Repository [Internet]. Accessed: http://archive.ics.uci.edu/ml/datasets/Iris 36. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936. 7(2): 179-188. 37. Kapouleas I, Weiss SM. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. Readings in Machine Learning. 1990; 177-183. 38. Weiss, GM. Mining with rarity: a unifying framework. ACM SIGKDD Explorations. 2004; 6(1): 7-19. 39. Gordon, N. J., D.J. Salmond, A.F.M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings, Vol 140 No 2, April 1993. Accessed May 16, 2014. Available: http://www.ece.iastate.edu/~namrata/EE520/Gordonnovelapproach.pdf 88 40. Doucet A, Godsill S, Andrieu C. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing. 2000; 10: 197-208. Accessed May 14, 2014. Available: ftp://ftp.idsa.prd.fr/local/aspi/legland/ref/doucet00b.pdf 41. Helm JE, Lavieri MS, Van Oyen MP, Stein J, Musch D. Dynamic Forecasting and Control Algorithms of Glaucoma Progression for Clinician Decision Support", Operations Research (2nd round of review), 2012. 42. Rangel-Frausto, MS, Pittet D, Hwang T, Woolson RF, Wenzel RP. The dynamics of disease progression in sepsis: Markov modeling describing the natural history and the likely impact of effective antisepsis agents. Clinical Infectious Diseases. 1998 Jul; 27(1): 185-90. 43. Nachimuthu SK, Haug PJ. Early Detection of Sepsis in the Emergency Department using Dynamic Bayesian Networks. AMIA Annual Symposium Proceedings. 2012; 653-662. 44. “Normal Hemodynamic Parameters”, LiDCO Group. Accessed 13 Dec 2014. Available: http://www.lidco.com/clinical/hemodynamic.php 45. Hamilton JD. “Regime-Switching Models”, Palgrave Dictionary of Economics, 18 May 2005. 46. Diebold FX, Lee J, Weinbach GC. Regime Switching with Time Varying Transition Probabilities. Nonstationary Time Series Analysis and Cointegration (Advanced Texts in Econometrics). Oxford: Oxford University Press, p 283-302. 47. Sheth, MB, Mark R, Chahin A, Markuzon N. Protective effects of rheumatoid arthritis in septic ICU patients. 2014 IEEE International Conference on Big Data, 27-30 Oct. 2014. 48. Elixhauser A, Steiner C, Harris DR, and Coffey R. Comorbidity measures for use with administrative data. Med Care. 1998; 36:8-27. 49. Bale C, Kakrani AK, Dabadghao VS, Sharma ZD. Sequential organ failure assessment score as prognostic marker in critically ill patients in a tertiary care intensive care unit. International Journal of Medicine and Public Health. 2013; 3(3):155-158. 50. “Rheumatoid Arthritis,” Arthritis Foundation. 2014. [Online]. Available: http://www.arthritis.org/conditions-treatments/disease-center/rheumatoid-arthritis/ 89 90 Appendix A Protective effects of rheumatoid arthritis in septic ICU patients For the purposes of this thesis, we only considered disease-based subpopulations based on the primary diagnosis. As a side analysis, however, we considered all patients with a primary diagnosis of sepsis and analyzed secondary diagnosis at the time of admission. In some cases, we observed significant discrepancies in mortality. One of the most striking results was the 30-day mortality rate for septic patients with rheumatoid arthritis (RA), an auto-immune disorder. At 29.0%, it was 21 percentage points lower than the 30-day mortality rate for all severe sepsis patients (defined as sepsis with persistent hypotension), and the observed difference was statistically significant even after controlling for patient demographics and disease severity. This result is particularly interesting because RA is an auto-immune disorder. Auto-immune disorders are a group of diseases that arise from an abnormal immune response of the body against substances and tissues normally present in the body. The immunological responses are usually caused by antibody production and T cell activation with intolerance to host’s own cells. The effect of autoimmunity has been speculated to carry a worse outcome in sepsis compared to normal hosts. However, recent research is hinting towards a different behavior and perhaps, a better outcome for patients with auto-immunity in cases of severe sepsis. Given the considerable interest in this topic, we decided to do a side investigation into the relationship between rheumatoid arthritis and sepsis mortality [47].The results in this section are intended to stand alone, and we define the sepsis population and the target outcome slightly differently than the other results in the thesis. Methods The primary study population for this analysis consists of adult ICU patients with severe sepsis. As described in Section 2.3.1, we define sepsis using the ICD-9 code 038 which indicates septicemia. For this analysis, however, we further restricted our population to include markers of systematic shock. In particular, we require the presence of hypotension, defined as three consecutive mean arterial blood 91 pressure (MAP) readings below 65 mmHg in a 30 minute period. Throughout this analysis, we will denote this sepsis definition as “038-H”. All patients were required to have 24 hours of ICU data beyond their first hypotensive episode, and we selected the last ICU stay meeting these criteria for each patient. We identified 1,302 patients in the database meeting the 038-H criterion. This study includes a series of sensitivity analyses to determine the impact of these inclusion criteria on the results and to identify the group of patients that benefits most from an RA diagnosis. These sensitivity analyses include variations to the definition of severe sepsis and hypotension, as well as the removal of the 24 hour data requirement. Throughout this study, RA was defined using the set of ICD-9-CM diagnosis codes that correspond to the RA co-morbidity flag in the Elixhauser index. The Elixhauser methodology was specifically developed for use with administrative health data, making it uniquely suited for our purposes [48]. We also use ICD-9CM diagnosis codes to identify subpopulations with other auto-immune disorders such as multiple sclerosis and Crohn’s disease to determine whether these conditions show a similar protective effect as RA in severe sepsis. For each patient in the study, we used the MIMIC II database to collect demographic information like gender and age, along with SOFA score. SOFA includes information about the condition of a patient’s respiratory, renal, and cardiovascular systems, among others, and has been found to be a strong predictor of prognosis for septic ICU patients [49]. We also used MIMIC II to gather information on use of corticosteroids, which can be used in the treatment of both RA and severe sepsis. The admission histories from discharge summaries were manually checked for home use of prednisone prior to admission in the ICU. The primary outcome of interest in this study is 30 day mortality for ICU patients with severe sepsis. Thirty-day mortality was measured beginning with the patient’s hospital discharge date and relies on death information from the Social Security Death Index (SSDI). We compared mortality rates for sepsis patients with and without RA using a chi-squared test. We also compared baseline characteristics such as gender, age, and SOFA score using a chi-squared or Student’s ttest (as appropriate) to determine whether the two groups of patients differed significantly across these dimensions. 92 To test whether the presence of RA was significantly predictive of sepsis mortality, we utilized a logistic regression model [5]. Our first model controlled for age, gender, and SOFA score, in addition to RA. Subsequent models introduced covariates to control for patients’ drug regimens, including chronic use of prednisone and use of corticosteroids in the ICU. Results Septic patients with RA have significantly lower 30-day mortality rates As described in the methods section, the primary patient cohort for this study consists of individuals with a documented diagnosis of septicemia, hypotension as indicated by three MAP readings below 65 within 30 minutes, and 24 hours of data beyond the first hypotensive episode. This patient cohort contains 1,302 individuals and of these, 31 patients have RA. Our data show significantly lowered mortality rates for the septic patients with RA, suggesting a protective effect. The observed 30-day mortality rate for hypotensive, septic patients with RA is 29.0% as compared to 50.6% for patients without RA. This difference is significant with a p-value of 0.016 (table 1). Table 1: 30-day mortality and demographics for hypotensive, septic patients with and without RA RA No RA p-value 31 1,271 -- 29.0% 50.6% 0.016 Gender (% Male) 41.9% 55.5% 0.130 Age (Mean ± SE) 65.1 ± 2.7 67.7 ± 0.5 0.358 SOFA (Mean ± SE) 10.1 ± 0.9 10.1 ± 0.1 0.978 N Mortality 30-day Patient Characteristics We considered various co-factors that could explain the observed effect. One possibility is that this difference in the mortality rate is due to systematic differences in the septic populations with and without RA. For example, women are more likely to suffer from RA [50], a fact that is clearly represented in our data. 41.9% of septic RA patients are male, which is 13.6 percentage points lower than septic patients without RA. We also observe that septic RA patients are slightly younger than their counterparts without RA, with a mean age of 65.1 compared to 67.7, though this difference is not statistically significant (table 1). 93 To control for underlying differences between the two groups, we utilized a logistic regression model. The model included variables for patient gender, age, and SOFA score at ICU admission. We also included an indicator for whether the patient had RA. After adjusting for patient demographics and health status, we find that the presence of RA is still significantly predictive of survival at 30 days with a p-value of 0.024 (table 2). This result is consistent with the conclusion that RA is protective in severe sepsis. Table 2: Logistic regression analysis of sepsis mortality Outcome: 30-Day Mortality Independent B Variable SE p-value Exp(B) Age 0.02 0.00 0.000 1.02 Gender 0.12 0.12 0.310 1.13 SOFA 0.14 0.01 0.000 1.15 RA Flag -0.95 0.42 0.024 0.39 Intercept -2.89 0.32 0.000 0.06 Identifying related populations where RA is beneficial The results in the previous section establish that 30-day mortality is significantly lower for a particular group of severe sepsis patients with RA. To better understand this phenomenon, we conducted a series of related analyses to determine the particular conditions under which the result holds and to identify the groups of patients who most benefit from a diagnosis of RA. First, we investigated whether our observed protective effect extends to septic patients with other autoimmune disorders such as Crohn’s disease and multiple sclerosis. Next, we analyzed the role of sepsis severity on our results by rerunning our analysis for less critically ill populations like sepsis patients identified using the Angus criteria or patients with documented septicemia but no hypotension. Lastly, we analyzed the role of ICU length of stay by varying the amount of time that patients were remaining in the ICU. The following results confirm that septic patients with auto-immune disorders have better 30-day survival rates than the overall sepsis population. Further, the results suggest that RA is particularly beneficial for patients with a certain level of sepsis severity, and that length of stay is an important factor in this analysis. Septic Patients with auto-immune disorders have significantly lower 30-day mortality rates Using the severe sepsis definition from the previous section, we find that RA is not the only auto-immune disease with an observed protective effect. Septic ICU patients with a variety of different auto-immune disorders have significantly lowered 30-day mortality compared to severe sepsis patients overall. For 94 example, while the full septic population has a 30-day mortality rate of 50.1%, septic patients with multiple sclerosis have a mortality rate of 25.0%, septic patients with ulcerative colitis have a mortality rate of 26.7%, and septic patients with systemic lupus erythematosus (SLE) have a mortality rate of 12.5% (table 3). Moreover, if we define a group of sepsis patients with any of the auto-immune conditions listed in Table 3, we find that 30-day mortality is 21 percentage points lower than the full sepsis population. This result is significant at a 0.01 level even after controlling for patient demographics and disease severity (table 4). These results suggest that our observed protective effect applies to a variety of auto-immune conditions, supporting the theory that it is broadly related to immune modulation in sepsis as opposed to specifically related to the physiology of RA. Table 3: 30-day mortality for septic ICU patients with auto-immune disorders All Severe Sepsis 1,302 Death Rate 50% All Auto-Immune 65 29% rheumatoid arthritis 15 27% ulcerative colitis 15 33% Crohns disease 13 46% multiple sclerosis 12 25% systemic lupus erythematosus 8 13% myasthenia gravis 4 0% ankylosing spondylitis 1 0% psoriatic arthritis 0 -- Condition N Table 4: Logistic regression analysis of sepsis mortality, all auto-immune Outcome: 30-Day Mortality Independent B Variable SE p-value Exp(B) Age 0.02 0.00 0.000 1.02 Gender 0.12 0.12 0.303 1.13 SOFA 0.14 0.01 0.000 1.15 Auto-Immune Flag -0.81 0.30 0.006 0.44 Intercept -2.84 0.33 0.000 0.06 95 Level of sepsis severity and length of stay help determine when RA is protective For this analysis, we defined severe sepsis as a documented diagnosis of septicemia (038 code), along with hypotension. In this section, we varied patient selection criteria in order to identify a combination of patient’s conditions when the RA diagnosis is most protective In Section 2.3.1, we discuss two sepsis definitions that are less stringent than the one that was selected for this analysis. The first, the Angus criteria, is the most general, while the second, which requires a documented diagnosis of septicemia but does not require hypotension, falls in the middle. For both of these definitions, septic patients with RA have lower 30-day mortality than patients without RA; however, for the Angus criteria, the difference is only three percentage points and, in both cases, the difference is not statistically significant (table 5). We conclude that a certain level of patient severity is necessary for RA to have a strong protective effect. At the same time, the data suggest that once that level is achieved, our results are very stable to variations in methodology. For example, if we make our sepsis definition more stringent by adding the Angus criteria on top of the 038-H definition, there is a 21 percentage point difference in 30-day mortality for patients with RA and without RA, which is statistically significant. We also see a significant protective effect if we define hypotension using a shorter or longer measurement window or if we define hypotension based on the use of vasopressors as opposed to low blood pressure (table 5). In all of these cases, septic patients with RA have a 30-day mortality rate that is at least 16 percentage points lower than septic patients without RA and the result is significant even after controlling for patient demographics and disease severity. Table 5: P-values for chi-squared and logistic regression analysis of sepsis mortality, sensitivity analysis Outcome: 30-Day Mortality Sepsis and Hypotension Definition RA No RA p-value (χ2) p-value (logistic) Angus 30% 33% 0.39 0.77 038, No Hypotension Required 32% 40% 0.12 0.21 038-H + Angus 31% 52% 0.03 0.03 038-H, 3 MAP <= 65 (20 Min) 28% 54% 0.01 0.01 038-H, 3 MAP <= 65 (60 Min) 31% 49% 0.02 0.04 038-H, Vasopressors 32% 48% 0.03 0.04 96 We also find that a patient’s length of stay in the ICU is an important factor in determining whether or not RA is beneficial during severe sepsis. All of the analyses in the previous discussion require that patients have at least 24 hours of data beyond their first hypotensive episode or dose of vasopressors (depending on the analysis) or, in the case that hypotension is not required, at least 24 hours in the ICU. Removing this requirement dampens the protective effect of RA. For the original study inclusion criteria, the difference in 30-day mortality for patients with and without RA decreases from 21.6 percentage points to 14.0 percentage points once the 24 hour data requirement is removed. While this is still a statistically significant difference at the 0.05 level after controlling for patient demographics, variations on the original criteria, including a longer measurement window for hypotension or defining hypotension based on vasopressors, are no longer statistically significant without the 24 hour requirement (table 6). Table 6: P-values for chi-squared and logistic regression analysis of sepsis mortality, no 24 hour requirement Outcome: 30-Day Mortality Sepsis and Hypotension Definition RA No RA p-value (χ2) p-value (logistic) 038-H 41% 55% 0.08 0.05 038-H, 3 MAP <= 65 (20 Min) 44% 59% 0.08 0.05 038-H, 3 MAP <= 65 (60 Min) 40% 52% 0.08 0.11 038-H, Vasopressors 39% 51% 0.10 0.08 Consistent with earlier results, these findings suggest that the level of sepsis severity is important in determining whether RA is protective. Patients who do not remain in the ICU for 24 hours are either healthy enough to be discharged during that period or so critical that they do not survive. Previous results have already established that RA is beneficial in more critically ill sepsis patients; these results could suggest that there is also an important upper limit on patient severity. Finally, there is evidence that if we consider septic patients who stayed in the ICU for even longer than 24 hours, RA has a strong protective effect even in the absence of hypotension, As previously discussed, if we identify sepsis patients using a documented diagnosis of septicemia but without requiring hypotension, there is an eight percentage point difference in mortality rates for patients with and without RA; the difference, however, is not statistically significant (table 5). If we modify the patient selection criteria to include only patients who remained in the ICU for at least 72 hours, the difference in mortality increases to 20 percentage points and becomes statistically significant. These results suggest that septic patients with 97 ICU stays over 3 days are another group for whom RA is particularly beneficial, and further research should explore the implications of this finding. Use of corticosteroids does not explain protective effect One possible explanation for the protective effect observed in RA and other auto-immune disorder patients is the effect of the medication taken for these conditions, particularly corticosteroids. We observe that patients with RA are eight times more likely to use prednisone at home (i.e. chronically) than patients without RA and are also more likely to receive steroids once admitted to the hospital (table 7). Given that corticosteroids are anti-inflammatory drugs, it is possible that the medication itself is protective as opposed to RA. Table 7: Corticosteroid usage rates for septic patients with and without RA RA No RA p-value 31 1,271 -- 35.5% 4.3% 0.000 64.5% 37.7% 0.002 Prednisone 54.8% 14.7% 0.000 Hydrocortisone 51.6% 30.8% 0.012 N Treatment (Home) Home Prednisone Treatment (Hospital) Any Steroids The data, however, does not support this conclusion. Stratifying 30-day mortality by the presence of RA and chronic prednisone use, we see that the difference in mortality between septic RA patients and non-RA patients is more than 20 percentage points regardless of whether the patient uses prednisone chronically (table 8). Further, adding home prednisone use to our logistic regression model does not impact the significance of RA in predicting 30-day mortality; presence of RA remains a significant independent variable with a p-value of 0.03. Conversely, use of home prednisone is not a significant predictor of 30-day mortality after controlling for RA and other patient demographics (table 9). Table 8: 30-day mortality by chronic prednisone use and RA RA Treatment No RA # Death Rate # Death Rate Home Prednisone 11 27.3% 54 48.1% No Home Prednisone 20 30.0% 1,212 50.6% 98 Table 9: Logistic regression analysis of sepsis mortality, including home prednisone use Outcome: 30-Day Mortality B SE p-value Exp(B) Age 0.02 0.00 0.00 1.02 Gender 0.12 0.12 0.30 1.13 SOFA 0.14 0.01 0.00 1.15 RA Flag -0.94 0.43 0.03 0.39 Home Prednisone -0.01 0.28 0.97 0.99 Intercept -2.88 0.32 0.00 0.06 Independent Variable Discussion This study examines the survival outcome after treatment of severe sepsis in patients with RA as compared to the rest of the population. It clearly demonstrates a statistically significant 30-day mortality benefit in a selected subgroup of hypotensive patients with RA as well as patients with other autoimmune diseases. The result is robust to various definitions of hypotension including septic shock requiring vasopressors (table 5). Although the sample size was relatively small, the study demonstrated a strong, statistically significant effect that was confirmed within other auto-immune diseases (table 3). Looking at a variety of autoimmune disorders supports the theory of a shared advantage, either due to the actual pathophysiology of those diseases or due to the common treatment modalities used. Patients’ corticosteroid drug regimens and in particular chronic prednisone use did not show any difference in 30-day mortality (table 8). However, the study has not explored whether the difference seen in mortality rates is due to the effect of other medications. Increasing numbers of patients with RA are treated with immune modulators, also known as disease-modifying anti-rheumatic drugs (DMARDs) which are potentially contributing to the survival benefit seen in our analysis. Patients with RA and autoimmune disease who are treated with DMARDs are under immune modulation prior to developing sepsis. Theoretically, in light of the immune modulation role in sepsis, being on DMARDs can potentially lead to favorable effects and lessened organ damage. If immune modulation in RA does influence 30-day mortality as demonstrated in this study, then patients taking DMARDs should demonstrate a clear survival benefit. Further analysis will look at the different chronic treatments used in auto-immune diseases and whether they show any improved outcome in sepsis. Also, looking at the rate in which organ failure happens in 99 both groups might help us understand why patients with RA and auto-immune diseases have lower mortality and whether their suppressed immunity actually helped them survive through protecting their organs from collateral damage caused by the stark immune response created in severe sepsis. Conclusion Our study found that septic patients with RA and other auto-immune diseases have significantly lower 30day mortality rates than patients without the condition, even after controlling for demographics and disease severity. Moreover, we identified groups of sepsis patients in the ICU for whom the RA comorbidity is particularly beneficial and showed that both sepsis severity and ICU length of stay are important in identifying those subpopulations. Patients’ corticosteroid drug regimens were one possible explanation for the protective effect observed in this study; however, we were able to show that corticosteroids alone, including chronic prednisone use, do not account for lower 30-day mortality rates among septic RA patients. We put forth an alternate theory suggesting that immune suppression or modulation prior to developing sepsis can actually have a protective effect on the host rather than the current belief that it increases the risk for severe infections which has been hard to prove in various studies. The results in this study contribute to current understanding of the relationship between sepsis and the immune system. Our findings also illustrate the power of big data in clinical research. Without access to a large observational database and the tools to analyze it, we would have been unable to identify a sufficient number of septic RA patients to conduct our analysis. Even with data for over 30,000 ICU stays, the final number of RA patients in this study was 31. Future research should try to replicate this result on other large observational datasets in order to increase the current sample size and provide further validation. 100 Appendix B Table B.1: All MIMIC II variables used for analysis, p-values by day Variable Day 1 Day 2 Day 3 Day 4 Stay Day ICUSTAY_ADMIT_AGE 0.015 0.015 0.015 0.015 WEIGHT_FIRST 0.666 0.666 0.666 0.666 SAPSI_FIRST 0.040 0.040 0.040 0.040 SOFA_FIRST 0.041 0.041 0.041 0.041 NUM_ELIX_GROUPS 0.367 0.367 0.367 0.367 GENDER 0.297 0.297 0.297 0.297 SUBJECT_ICUSTAY_SEQ 0.207 0.207 0.207 0.207 ICUSTAY_FIRST_SERVICE: 0.549 0.549 0.549 0.549 ETHNICITY 0.313 0.313 0.313 0.313 DNR 0.447 0.243 0.090 0.013 VENTILATION 0.581 0.414 0.498 0.573 PRESSORS 0.213 0.665 0.728 0.285 XRAY.DAY 0.873 0.370 0.867 0.445 MRI.DAY 0.849 0.601 0.862 0.888 CT_SCAN.DAY 0.309 0.715 0.703 0.468 ECHO.DAY 0.611 0.176 0.010 0.000 SAPS.DAY 0.034 0.022 0.004 0.001 SOFA.DAY 0.041 0.009 0.000 0.000 URINE.DAY 0.141 0.001 0.000 0.000 UNITS_IN.DAY 0.142 0.619 0.524 0.458 UNITS_OUT.DAY 0.063 0.075 0.001 0.000 FLUID_BALANCE.DAY 0.354 0.518 0.007 0.001 NUM.CREATINE 0.405 0.323 0.406 0.148 MEAN.CREATINE 0.199 0.178 0.088 0.052 NUM.SODIUM 0.401 0.432 0.407 0.153 MEAN.SODIUM 0.465 0.389 0.694 0.122 NUM.BUN 0.416 0.290 0.411 0.158 MEAN.BUN 0.001 0.000 0.000 0.000 NUM.CHLORIDE 0.455 0.350 0.422 0.150 MEAN.CHLORIDE 0.231 0.182 0.780 0.856 NUM.BICARB 0.461 0.408 0.327 0.134 MEAN.BICARB 0.155 0.036 0.000 0.000 NUM.GLUCOSE 0.443 0.596 0.587 0.083 MEAN.GLUCOSE 0.816 0.176 0.117 0.627 101 Variable Day 1 Day 2 Day 3 Day 4 Day Hourly NUM.MAGNES 0.444 0.464 0.278 0.148 MEAN.MAGNES 0.496 0.342 0.316 0.224 NUM.CALCIUM 0.631 0.452 0.119 0.149 MEAN.CALCIUM 0.491 0.560 0.684 0.570 NUM.PHOS 0.544 0.330 0.232 0.120 MEAN.PHOS 0.010 0.001 0.005 0.008 NUM.HGT 0.214 0.027 0.026 0.032 MEAN.HGT 0.419 0.267 0.509 0.309 NUM.HGB 0.349 0.303 0.012 0.043 MEAN.HGB 0.171 0.491 0.705 0.388 NUM.WBC 0.361 0.317 0.019 0.045 MEAN.WBC 0.725 0.371 0.482 0.171 NUM.PLATLET 0.249 0.140 0.012 0.039 MEAN.PLATLET 0.369 0.277 0.026 0.001 MINHR 0.549 0.041 0.031 0.005 MAXHR 0.336 0.347 0.269 0.357 AVGHR 0.524 0.074 0.048 0.051 SDHR 0.289 0.503 0.620 0.061 MINRR 0.058 0.646 0.688 0.465 MAXRR 0.552 0.664 0.146 0.627 AVGRR 0.096 0.563 0.531 0.754 SDRR 0.235 0.413 0.171 0.205 MINTEMP 0.023 0.354 0.407 0.007 MAXTEMP 0.003 0.014 0.017 0.006 AVGTEMP 0.001 0.018 0.013 0.002 SDTEMP 0.494 0.382 0.290 0.699 MINMAP 0.054 0.043 0.004 0.001 MAXMAP 0.239 0.245 0.239 0.025 AVGMAP 0.009 0.001 0.001 0.000 SDMAP 0.568 0.475 0.738 0.434 MINSI 0.449 0.030 0.028 0.000 MAXSI 0.395 0.269 0.109 0.000 AVGSI 0.393 0.188 0.002 0.000 SDSI 0.520 0.291 0.159 0.058 102 Variable Day 1 Day 2 Day 3 Day 4 Trend (To Date) VENTILATION.TOTAL.DAYS 0.585 0.580 0.188 VENT.EVER 0.433 0.531 0.757 PSRS.TOTAL.DAYS 0.220 0.053 0.016 PSRS.EVER 0.489 0.469 0.399 XRAY.TOTAL.DAYS 0.136 0.075 0.284 XRAY.EVER 0.108 0.081 0.294 MRI.TOTAL.DAYS 0.384 0.301 0.503 MRI.EVER 0.454 0.363 0.590 CT_SCAN.TOTAL.DAYS 0.678 0.653 0.737 CT_SCAN.EVER 0.589 0.781 0.749 ECHO.TOTAL.DAYS 0.425 0.439 0.422 ECHO.EVER 0.599 0.658 0.688 SAPS.DAY_MIN 0.021 0.013 0.005 SAPS.DAY_MEAN 0.013 0.006 0.002 SAPS.DAY_MAX 0.023 0.012 0.003 SAPS.DAY_RANGE 0.414 0.535 0.442 SAPS.DAY_TREND1 0.566 0.486 0.342 0.633 0.292 SAPS.DAY_TREND2 SAPS.DAY_TREND3 0.178 SOFA.DAY_MIN 0.016 0.001 0.000 SOFA.DAY_MEAN 0.013 0.002 0.000 SOFA.DAY_MAX 0.018 0.007 0.001 SOFA.DAY_RANGE 0.687 0.506 0.708 SOFA.DAY_TREND1 0.280 0.202 0.160 0.068 0.004 SOFA.DAY_TREND2 SOFA.DAY_TREND3 0.013 URINE.DAY_MIN 0.085 0.020 0.001 URINE.DAY_MEAN 0.008 0.000 0.000 URINE.DAY_MAX 0.002 0.000 0.000 URINE.DAY_RANGE 0.004 0.001 0.000 URINE.DAY_TREND1 0.422 0.793 0.175 0.284 0.521 URINE.DAY_TREND2 URINE.DAY_TREND3 0.325 UNITS_IN.DAY_MIN 0.604 0.480 0.531 UNITS_IN.DAY_MEAN 0.188 0.407 0.488 UNITS_IN.DAY_MAX 0.103 0.144 0.157 UNITS_IN.DAY_RANGE 0.082 0.040 0.068 UNITS_IN.DAY_TREND1 0.343 0.467 0.581 0.338 0.570 UNITS_IN.DAY_TREND2 UNITS_IN.DAY_TREND3 0.295 103 Variable Day 1 Day 2 Day 3 Day 4 UNITS_OUT.DAY_MIN 0.069 0.016 0.002 UNITS_OUT.DAY_MEAN 0.033 0.002 0.000 UNITS_OUT.DAY_MAX 0.045 0.004 0.001 UNITS_OUT.DAY_RANGE 0.217 0.028 0.013 UNITS_OUT.DAY_TREND1 0.567 0.110 0.232 0.462 0.197 UNITS_OUT.DAY_TREND2 UNIT_OUT.DAY_TREND3 0.471 FLUID_BAL.DAY_MIN 0.306 0.007 0.001 FLUID_BAL.DAY_MEAN 0.627 0.317 0.178 FLUID_BAL.DAY_MAX 0.322 0.463 0.510 FLUID_BAL.DAY_RANGE 0.059 0.025 0.024 FLUID_BAL.DAY_TREND1 0.493 0.261 0.407 0.381 0.190 FLUID_BAL.DAY_TREND2 FLUID_BAL.DAY_TREND3 0.264 Trend (To Date) CREATINE_MIN 0.128 0.094 0.071 CREATINE_MEAN 0.183 0.149 0.109 CREATINE_MAX 0.247 0.243 0.173 CREATINE_RANGE 0.589 0.589 0.680 CREATINE_TREND1 0.487 0.204 0.057 0.320 0.012 CREATINE_TREND2 CREATINE_TREND3 0.118 SODIUM_MIN 0.388 0.416 0.461 SODIUM_MEAN 0.406 0.428 0.311 SODIUM_MAX 0.451 0.432 0.202 SODIUM_RANGE 0.639 0.731 0.389 SODIUM_TREND1 0.587 0.472 0.155 0.552 0.548 SODIUM_TREND2 SODIUM_TREND3 0.650 BUN_MIN 0.000 0.000 0.000 BUN_MEAN 0.000 0.000 0.000 BUN_MAX 0.000 0.000 0.000 BUN_RANGE 0.481 0.306 0.152 BUN_TREND1 0.273 0.190 0.363 0.159 0.128 BUN_TREND2 BUN_TREND3 0.086 104 Variable Day 1 Day 2 Day 3 Day 4 CHLORIDE_MIN 0.174 0.282 0.475 CHLORIDE_MEAN 0.195 0.319 0.444 CHLORIDE_MAX 0.251 0.350 0.398 CHLORIDE_RANGE 0.589 0.559 0.627 CHLORIDE_TREND1 0.787 0.067 0.735 0.284 0.049 CHLORIDE_TREND2 CHLORIDE_TREND3 0.167 BICARB_MIN 0.047 0.015 0.004 BICARB_MEAN 0.059 0.013 0.001 BICARB_MAX 0.099 0.019 0.001 BICARB_RANGE 0.378 0.609 0.457 BICARB_TREND1 0.668 0.097 0.007 0.272 0.001 BICARB_TREND2 BICARB_TREND3 0.022 Trend (To Date) GLUCOSE_MIN 0.408 0.185 0.248 GLUCOSE_MEAN 0.329 0.133 0.162 GLUCOSE_MAX 0.382 0.230 0.200 GLUCOSE_RANGE 0.538 0.430 0.391 GLUCOSE_TREND1 0.294 0.632 0.490 0.363 0.443 GLUCOSE_TREND2 GLUCOSE_TREND3 0.632 MAGNES_MIN 0.443 0.318 0.233 MAGNES_MEAN 0.396 0.347 0.258 MAGNES_MAX 0.430 0.797 0.741 MAGNES_RANGE 0.606 0.348 0.331 MAGNES_TREND1 0.678 0.367 0.600 0.655 0.389 MAGNES_TREND2 MAGNES_TREND3 0.708 CALCIUM_MIN 0.553 0.559 0.536 CALCIUM_MEAN 0.573 0.646 0.650 CALCIUM_MAX 0.627 0.709 0.614 CALCIUM_RANGE 0.682 0.473 0.646 CALCIUM_TREND1 0.439 0.502 0.534 0.654 0.390 CALCIUM_TREND2 CALCIUM_TREND3 0.359 105 Variable Day 1 Day 2 Day 3 Day 4 PHOS_MIN 0.000 0.000 0.000 PHOS_MEAN 0.001 0.001 0.001 PHOS_MAX 0.008 0.011 0.007 PHOS_RANGE 0.332 0.578 0.708 PHOS_TREND1 0.392 0.219 0.294 0.677 0.238 PHOS_TREND2 PHOS_TREND3 0.573 HGT_MIN 0.613 0.628 0.709 HGT_MEAN 0.576 0.609 0.606 HGT_MAX 0.575 0.652 0.577 HGT_RANGE 0.431 0.634 0.698 HGT_TREND1 0.007 0.391 0.647 0.118 0.632 HGT_TREND2 HGT_TREND3 0.077 Trend (To Date) HGB_MIN 0.682 0.642 0.794 HGB_MEAN 0.565 0.695 0.770 HGB_MAX 0.498 0.670 0.698 HGB_RANGE 0.604 0.749 0.826 HGB_TREND1 0.012 0.620 0.691 0.072 0.722 HGB_TREND2 HGB_TREND3 0.045 WBC_MIN 0.436 0.412 0.202 WBC_MEAN 0.558 0.518 0.360 WBC_MAX 0.592 0.568 0.518 WBC_RANGE 0.537 0.592 0.552 WBC_TREND1 0.326 0.597 0.078 0.348 0.381 WBC_TREND2 WBC_TREND3 0.269 PLATLET_MIN 0.243 0.071 0.015 PLATLET_MEAN 0.293 0.152 0.061 PLATLET_MAX 0.352 0.254 0.143 PLATLET_RANGE 0.799 0.606 0.635 PLATLET_TREND1 0.650 0.045 0.030 0.139 0.005 PLATLET_TREND2 PLATLET_TREND3 0.007 HR_MIN 0.226 0.090 0.024 HR_MEAN 0.480 0.204 0.113 HR_MAX 0.540 0.616 0.676 HR_RANGE 0.098 0.090 0.093 106 Variable Day 1 Day 2 HR_TREND1 0.002 HR_TREND2 Day 3 Day 4 0.596 0.734 0.013 0.503 HR_TREND3 0.038 RR_MIN 0.178 0.420 0.811 RR_MEAN 0.294 0.297 0.454 RR_MAX 0.689 0.566 0.737 RR_RANGE 0.376 0.694 0.812 RR_TREND1 0.316 0.519 0.403 0.584 0.592 RR_TREND2 RR_TREND3 0.293 Trend (To Date) TEMP_MIN 0.181 0.432 0.325 TEMP_MEAN 0.001 0.001 0.000 TEMP_MAX 0.007 0.005 0.003 TEMP_RANGE 0.593 0.596 0.627 TEMP_TREND1 0.139 0.804 0.300 0.191 0.419 TEMP_TREND2 TEMP_TREND3 0.570 MAP_MIN 0.124 0.096 0.052 MAP_MEAN 0.024 0.004 0.000 MAP_MAX 0.312 0.350 0.378 MAP_RANGE 0.641 0.671 0.602 MAP_TREND1 0.515 0.441 0.200 0.430 0.094 MAP_TREND2 MAP_TREND3 0.102 SI_MIN 0.097 0.059 0.004 SI_MEAN 0.155 0.018 0.002 SI_MAX 0.280 0.414 0.423 SI_RANGE 0.366 0.414 0.409 SI_TREND1 0.454 0.336 0.043 0.032 0.073 SI_TREND2 SI_TREND3 0.004 107 Table B.2: All significant UFA thresholds, adult sepsis patients Ranked by absolute z-statistic Variable Threshold N % Died SOFA More than 12.0 74 68.9% AVG.SAPS More than 19.4 84 61.9% SAPS More than 18.1 73 61.6% MAX.URINE Less than 999.5 104 60.6% AVG.URINE Less than 516.5 81 65.4% AVG.SOFA More than 12.7 84 63.1% MIN.SOFA More than 10.1 93 60.2% URINE Less than 944.3 152 57.9% MAX.SOFA More than 15.2 66 65.2% MAX.SAPS More than 24.2 53 66.0% BICARB Less than 17.7 75 65.3% PLATELET Less than 85.3 96 57.3% MAX.UNITS_OUT Less than 1017.8 56 66.1% AVG.PLATELET Less than 109.0 106 54.7% MIN.PLATELET Less than 81.7 109 55.0% MINTEMP Less than 35.5 76 59.2% AVG.UNITS_OUT Less than 806.2 77 59.7% SOFA_FIRST More than 18.1 12 100.0% AVG.BICARB Less than 17.6 100 57.0% MEAN.PHOS More than 4.5 93 52.7% MIN.SAPS More than 15.2 117 51.3% MIN.PHOS More than 4.3 58 62.1% MAX.PLATELET Less than 116.4 79 55.7% UNITS_OUT Less than 927.7 111 55.0% MAXSI More than 1.6 14 92.9% AVG.SODIUM Less than 134.5 59 57.6% MAX.AVGSI.TD More than 1.6 71 57.7% MIN.FLUID_BALANCE.DAY More than 1,634.9 75 57.3% AVG.AVGMAP.TD Less than 66.5 62 59.7% SAPSI_FIRST More than 28.1 15 86.7% MAX.MEAN.BICARB Less than 19.8 100 56.0% AVG.MEAN.PHOS More than 4.7 96 54.2% RNG.URINE.DAY Less than 579.9 106 54.7% AVG.AVGTEMP.TD Less than 36.1 50 60.0% MAX.MEAN.SODIUM Less than 135.0 33 66.7% AVGSI More than 1.1 20 80.0% MIN.MEAN.WBC More than 25.8 16 81.3% 108 Variable Threshold N % Died AVG.AVGSI.TD More than 0.9 124 50.8% MINSI More than 0.7 113 52.2% AVGMAP Less than 67.4 86 55.8% AVGTEMP Less than 36.0 57 57.9% MIN.URINE.DAY Less than 123.0 91 54.9% AVGHR More than 110.3 42 64.3% MIN.MEAN.SODIUM Less than 126.5 7 100.0% MEAN.BICARB More than 26.1 113 8.8% AVG.MEAN.CHLORIDE Less than 98.7 29 62.1% MAX.AVGTEMP.TD Less than 36.8 10 90.0% MIN.MEAN.BICARB Less than 13.6 57 57.9% MIN.MEAN.BUN More than 64.9 48 60.4% TREND3.AVGTEMP Less than 1.0 67 55.2% MAX.AVGMAP.TD Less than 83.9 24 70.8% AVG.URINE.DAY More than 1,690.3 178 11.2% RNG.UNITS_OUT.DAY Less than 1,029.5 142 46.5% MEAN.CREATINE Less than 0.9 190 14.7% DNR More than 0.0 93 48.4% AVG.MEAN.WBC More than 31.7 18 72.2% MEAN.BUN More than 57.4 121 50.4% MEAN.BUN Less than 22.4 159 12.6% MEAN.SODIUM Less than 134.9 59 55.9% TREND1.AVGSI Less than 0.8 140 16.4% MAX.MEAN.BUN More than 92.4 48 60.4% MAX.MEAN.CHLORIDE Less than 103.8 68 48.5% MAX.MEAN.PHOS More than 5.4 104 48.1% AVG.MEAN.BUN Less than 21.4 130 13.8% MEAN.WBC More than 36.4 15 73.3% MIN.AVGSI.TD More than 0.6 121 47.9% MAXMAP More than 118.6 87 11.5% URINE.DAY More than 1,890.3 190 13.2% MINMAP Less than 52.1 73 53.4% MAXTEMP Less than 36.6 80 50.0% AVG.MEAN.BUN More than 64.9 84 53.6% FLUID_BALANCE.DAY Less than -527.2 121 14.0% AVGRR Less than 13.5 33 57.6% MINHR More than 101.6 26 65.4% MIN.AVGTEMP.TD Less than 35.3 159 44.0% 109 Variable Threshold N % Died MIN.MEAN.BUN Less than 16.6 134 13.4% MAX.MEAN.BUN Less than 27.4 139 15.1% RNG.MEAN.HGB Less than 0.4 18 66.7% TREND1.AVGMAP More than 1.2 95 16.8% RNG.URINE.DAY More than 1,439.0 228 14.9% SDSI More than 0.2 14 71.4% RNG.AVGHR.TD Less than 28.8 45 51.1% MAX.URINE.DAY More than 2,727.1 176 12.5% SOFA.DAY Less than 4.1 129 11.6% AVG.FLUID_BALANCE.DAY Less than 715.5 87 12.6% AVG.MEAN.CREATINE Less than 1.0 166 16.9% RNG.MEAN.HGT Less than 1.1 20 65.0% MAXMAP Less than 75.9 39 59.0% MIN.URINE.DAY More than 1,195.1 78 10.3% MAX.MEAN.BICARB More than 25.7 186 15.6% AVG.MEAN.WBC Less than 3.6 13 69.2% FLUID_BALANCE.DAY More than 3,107.2 75 52.0% MAX.MEAN.CREATINE Less than 1.0 123 16.3% MIN.MEAN.CHLORIDE Less than 99.6 103 43.7% TREND1.AVGHR Less than 0.9 145 20.7% AVG.MEAN.CALCIUM Less than 6.8 12 75.0% TREND2.AVGRR More than 1.5 35 5.7% MAX.FLUID_BALANCE.DAY Less than 587.8 19 0.0% MINRR Less than 11.2 146 41.1% MIN.UNITS_OUT.DAY More than 1,388.0 92 15.2% TREND2.AVGSI Less than 0.8 119 16.8% MIN.MEAN.CREATINE Less than 0.9 210 18.6% MIN.UNITS_OUT.DAY Less than 36.0 34 58.8% AVG.AVGHR.TD More than 94.0 176 42.0% MIN.UNITS_IN.DAY More than 2,871.7 118 43.2% MAXHR More than 125.5 74 45.9% MIN.AVGHR.TD More than 76.6 127 44.9% AVGSI Less than 0.6 113 14.2% MINMAP More than 76.1 61 11.5% SDMAP More than 14.1 59 11.9% AVG.MEAN.BICARB More than 23.5 161 15.5% MEAN.WBC Less than 2.7 8 75.0% TREND2.AVGSI More than 2.2 4 100.0% 110 Variable Threshold N % Died MIN.MEAN.CALCIUM Less than 6.4 25 56.0% TREND3.AVGSI Less than 0.8 109 19.3% MAX.MEAN.WBC More than 37.3 32 56.3% MAX.MEAN.CALCIUM Less than 7.0 10 70.0% AVGMAP More than 93.7 78 12.8% AVG.FLUID_BALANCE.DAY More than 3,535.4 141 44.7% MIN.FLUID_BALANCE.DAY Less than -473.0 193 16.1% RNG.MEAN.CALCIUM More than 1.5 42 47.6% AVG.MEAN.HGB More than 12.0 37 10.8% SDHR Less than 2.8 22 59.1% RNG.AVGSI.TD More than 1.8 10 70.0% MAX.MEAN.WBC Less than 3.1 6 83.3% XRAY.TODATE More than 3.0 22 54.5% MIN.MEAN.WBC Less than 2.9 23 52.2% RNG.MEAN.GLUCOSE More than 113.7 79 43.0% TREND3.AVGMAP Less than 0.7 6 83.3% MAX.FLUID_BALANCE.DAY More than 19,123.8 13 0.0% MIN.MEAN.BICARB More than 22.2 116 16.4% AVG.MEAN.CHLORIDE More than 113.5 81 39.5% TREND2.AVGMAP Less than 0.6 3 100.0% RNG.MEAN.SODIUM Less than 2.0 53 45.3% RNG.MEAN.CREATINE Less than 0.2 160 19.4% Table B.3: Comparison of classifiers with varying amounts of missing data, confidence intervals For each row of the table, an increasing percentage of each variable in the MIMIC II dataset was randomly replaced with missing values. % Missing N-UFA Random Forest UFA-based Section 4.2 Accuracy AUC Accuracy AUC 0% 77.5% (75.1, 79.9) 0.819 (0.797, 0.841) 79.0% (76.9, 81.1) 0.823 (0.796, 0.851) 5% 77.5% (74.9, 80.1) 0.820 (0.793, 0.847) 78.3% (76.7, 79.8) 0.812 (0.783, 0.841) 10% 78.1% (75.3, 80.8) 0.817 (0.793, 0.842) 77.1% (73.6, 80.7) 0.812 (0.785, 0.840) 25% 77.9% (76.0, 79.7) 0.816 (0.792, 0.839) 76.9% (74.7, 79.1) 0.819 (0.791, 0.847) 50% 76.2% (73.9, 78.4) 0.790 (0.764, 0.815) 71.9% (69.3, 74.6) 0.771 (0.744, 0.799) 111 Best subset regression Logistic Regression Section 4.2 Comparison % Missing Accuracy AUC Accuracy AUC 0% 74.8% (72.6, 77.0) 0.831 (0.799, 0.862) 69.7% (65.7, 71.6) 0.698 (0.642, 0.753) 5% 74.2% (72.1, 76.3) 0.818 (0.787, 0.849) 68.5% (67.2, 69.8) 0.659 (0.644, 0.673) 10% 74.4% (72.1, 76.7) 0.812 (0.786, 0.839) 66.0% (63.1, 68.8) 0.636 (0.600, 0.672) 0.631 (0.576, 0.686) 0.598 (0.566, 0.629) 25% 74.0% (70.6, 77.5) 0.781 (0.763, 0.799) 67.5% (64.1, 70.9) 50% 70.4% (67.9, 72.8) 0.706 (0.673, 0.740) 58.3% (53.3, 63.2) Table B.4: Comparison of classifiers with varying amounts of imprecise data, confidence intervals For each row, an increasing percentage of each variable in MIMIC II is randomly perterbed by a value , distributed normally with mean zero and the empirical variance of the variable in question % Varied N-UFA Random Forest UFA-based Section 4.2 Accuracy AUC Accuracy AUC 0% 77.5% (75.1, 79.9) 0.819 (0.797, 0.841) 79.0% (76.9, 81.1) 0.823 (0.796, 0.851) 5% 77.3% (74.5, 80.2) 0.808 (0.785, 0.830) 76.2% (74.0, 78.3) 0.805 (0.777, 0.833) 10% 76.5% (73.7, 79.3) 0.811 (0.788, 0.834) 77.5% (74.2, 80.8) 0.816 (0.782, 0.851) 25% 77.9% (75.1, 80.6) 0.811 (0.786, 0.836) 77.1% (74.9, 79.3) 0.795 (0.768, 0.822) 50% 75.8% (73.0, 78.5) 0.796 (0.766, 0.825) 76.3% (73.7, 79.0) 0.802 (0.775, 0.829) Best subset regression Logistic Regression Section 4.2 Original data % Varied Accuracy AUC Accuracy AUC 0% 74.8% (72.6, 77.0) 0.831 (0.799, 0.862) 68.7% (65.7, 71.6) 0.698 (0.642, 0.753) 5% 73.3% (71.5, 75.0) 0.818 (0.789, 0.847) 65.4% (61.7, 69.1) 0.638 (0.602, 0.674) 0.694 (0.671, 0.717) 10% 74.6% (72.1, 77.1) 0.821 (0.786, 0.856) 70.0% (68.1, 71.9) 25% 75.4% (73.3, 77.4) 0.788 (0.759, 0.818) 63.1% (59.1, 67.1) 0.611 (0.563, 0.658) 50% 75.8% (73.7, 77.9) 0.790 (0.755, 0.824) 68.8% (64.6, 73.1) 0.681 (0.635, 0.727) 112