Predicting Mortality for Patients in Critical Care: a Univariate Flagging Approach by

Predicting Mortality for Patients in Critical Care:
a Univariate Flagging Approach
by
Mallory Sheth
B.A., Stanford University (2008)
Submitted to the Sloan School of Management
in partial fulfillment of the requirements for the degree of
Master of Science in Operations Research
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2015
© 2015 Mallory Sheth. All rights reserved.
The author herby grants to MIT and Draper Laboratory permission to reproduce and distribute publicly
paper and electronic copies of this thesis document in whole or in part.
Signature of Author …………………………………………………………………………………………
Sloan School of Management
May 15, 2015
Certified by…………………………………………………………………………………………………..
Natasha Markuzon
The Charles Stark Draper Laboratory, Inc.
Technical Supervisor
Certified by…………………………………………………………………………………………………..
Roy E. Welsch
Eastman Kodak Leaders for Global Operations Professor of Management
Professor of Statistics and Engineering Systems
Thesis Supervisor
Accepted by………………………………………………………………………………………………….
Dimitris Bertsimas
Boeing Leaders for Global Operations Professor
Co-director, Operations Research Center
2
Predicting Mortality for Patients in Critical Care:
a Univariate Flagging Approach
by
Mallory Sheth
Submitted to the Sloan School of Management on May 15, 2015, in partial fulfillment of the requirements
for the degree of Master of Science in Operations Research
Abstract
Predicting outcomes for critically ill patients is a topic of considerable interest. The most widely
used models utilize data from early in a patient’s stay to predict risk of death. While research has
shown that use of daily information, including trends in key variables, can improve predictions
of patient prognosis, this problem is challenging as the number of variables that must be
considered is large and increasingly complex modeling techniques are required.
The objective of this thesis is to build a mortality prediction system that improves upon current
approaches. We aim to do this in two ways:
1. By incorporating a wider range of variables, including time-dependent features
2. By exploring different predictive modeling techniques beyond standard regression
We identify three promising approaches: a random forest model, a best subset regression
containing just five variables, and a novel approach called the Univariate Flagging Algorithm
(UFA). In this thesis, we show that all three methods significantly outperform a widely-used
mortality prediction approach, the Sequential Organ Failure Assessment (SOFA) score.
However, we assert that UFA in particular is well-suited for predicting mortality in critical care.
It can detect optimal cut-points in data, easily scales to a large number of variables, is easy to
interpret, is capable of predicting rare events, and is robust to noise and missing data. As such,
we believe it is a valuable step toward individual patient survival estimates.
Technical Supervisor: Natasha Markuzon
The Charles Stark Draper Laboratory, Inc.
Thesis Supervisor: Roy E. Welsch
Eastman Kodak Leaders for Global Operations Professor of Management
Professor of Statistics and Engineering Systems
3
4
Acknowledgements
First of all, I would like to thank my advisers, Natasha Markuzon and Roy Welsch. Though they are both
very busy, they always made me feel like a priority. They took time to review my findings, answer
questions, and give thoughtful feedback for which I am extremely grateful. This thesis would not have
been possible without them.
I would also like to thank several members of MIT’s Laboratory for Computational Physiology. Mornin
Feng and Li-wei Lehman provided valuable help in accessing and understanding the MIMIC II database,
while Roger Mark, Leo Celi, and Abdullah Chahin provided much appreciated clinical expertise. In
particular, I thank them for their contributions to the work on sepsis and auto-immune disease which is
presented in the appendix of this thesis.
I want to thank the Charles Stark Draper Laboratory, and in particular the amazing staff in the Education
Office, for the Draper Lab Fellow appointment that made my research possible.
I must also thank all of the people that provided me with support at MIT; in particular, the directors of the
Operations Research Center, Dimitris Bertsimas and Patrick Jalliet, and the amazing administration staff,
Laura Rose and Andrew Carvalho. Thank you so much for welcoming me into such a great program and
helping me along the way. I also am deeply grateful to the other ORC students; in particular, Mapi, Dan,
Kevin, Jack, Zeb, and Will for their comradery during first year classes when we were all just trying to
get our bearings at MIT .
I am lucky to have an amazing and supportive family, and I thank my parents and brothers for keeping me
sane, particularly while I was juggling classes, research, and wedding planning in my first year. I thank
the wine night crew for acting as a constant sounding board and showing me how quickly lifelong
friendships can form.
Finally, I want to thank my husband, Sureel Sheth, for his unconditional love and support even in the
most stressful times. I cannot wait to set off on our next adventure together.
5
6
Table of Contents
1
2
Introduction......................................................................................................................................... 11
1.1
Motivation................................................................................................................................... 11
1.2
Existing mortality prediction models .......................................................................................... 12
1.3
Research objectives..................................................................................................................... 14
1.4
Thesis organization ..................................................................................................................... 15
Data ..................................................................................................................................................... 17
2.1
MIMIC II database...................................................................................................................... 17
2.2
Variables for analysis.................................................................................................................. 18
2.2.1
2.3
2.3.1
3
Populations for analysis .............................................................................................................. 19
Sepsis .................................................................................................................................. 21
Methodology ....................................................................................................................................... 23
3.1
3.1.1
3.2
Commonly used classification techniques .................................................................................. 23
Preprocessing approach....................................................................................................... 24
Univariate Flagging Algorithm (UFA) ....................................................................................... 26
3.2.1
Identifying optimal cut points ............................................................................................. 26
3.2.2
Formal statement of algorithm............................................................................................ 28
3.2.3
Procedure to find optimal threshold.................................................................................... 29
3.2.4
UFA-based classifiers ......................................................................................................... 30
3.2.5
Example of UFA system: Iris dataset.................................................................................. 31
3.3
4
Temporal features ............................................................................................................... 18
Evaluation of classifier performance .......................................................................................... 34
Results................................................................................................................................................. 37
4.1
Challenges of clinical data .......................................................................................................... 37
4.1.1
Many clinical variables are asymmetric and long-tailed..................................................... 38
4.1.2
List of variables associated with mortality varies throughout ICU stay ............................. 39
4.1.3
Clinical variables are often strongly correlated .................................................................. 40
4.1.4
Most patients have missing data ......................................................................................... 41
4.1.5
Summary of key insights..................................................................................................... 42
4.2
4.2.1
Predictive performance of commonly used classification techniques ........................................ 43
Random forest performs well for predicting mortality ....................................................... 43
7
4.2.2
Logistic regression performs competitively after preprocessing ........................................ 44
4.2.3
Best regression model contains just five variables.............................................................. 47
4.2.4
Best classifiers outperform SOFA score by two days......................................................... 48
4.2.5
Little observed effect of temporal variables........................................................................ 49
4.2.6
Summary of key insights..................................................................................................... 51
4.3
Predictive performance of UFA system...................................................................................... 51
4.3.1
Automated thresholds align with subject matter expertise.................................................. 51
4.3.2
UFA-based classifiers significantly outperform SOFA score............................................. 54
4.3.3
Summary of key insights..................................................................................................... 56
4.4
Practical advantages of UFA system .......................................................................................... 56
4.4.1
N-UFA classifier is robust to noisy and missing data......................................................... 56
4.4.2
UFA system generalizes well to other critical care populations ......................................... 58
4.4.3
N-UFA classifier maximizes sensitivity for low targets ..................................................... 62
4.4.4
Summary of key insights..................................................................................................... 63
5
Discussion ........................................................................................................................................... 65
6
Future research.................................................................................................................................... 69
6.1
6.1.1
Multiple testing problem..................................................................................................... 69
6.1.2
Threshold uncertainty ......................................................................................................... 69
6.1.3
Multivariate approach ......................................................................................................... 71
6.2
7
UFA............................................................................................................................................. 69
Temporal features in critical care................................................................................................ 73
6.2.1
Identifying subsets of patients where trend is important..................................................... 74
6.2.2
Characterizing uncertainty through filtering....................................................................... 76
6.2.3
Learning patient specific state parameters .......................................................................... 79
Conclusion .......................................................................................................................................... 85
References................................................................................................................................................... 87
Appendix A................................................................................................................................................. 91
Appendix B ............................................................................................................................................... 101
8
List of Figures
Figure 1.1: Schematic for overall research approach
Figure 2.1: Schematic of MIMIC II data collection and database construction
Figure 3.1: Body temperature for adult sepsis patients
Figure 3.2: Number of high mortality and low mortality flags for adult sepsis patients
Figure 3.3: Petal lengths for different species of Iris
Figure 3.4: Automated thresholds for Iris versicolor and Iris virginica
Figure 3.5: Example of Iris classification using N-UFA
Figure 3.6: Example of confusion matrix
Figure 4.1: Box plots for SOFA and urine output stratified by in-hospital mortality
Figure 4.2: Number of variables with a mean p-value less than 0.05, by day
Figure 4.3: Pairwise correlation coefficients for SOFA variables
Figure 4.4: Test-set performance metrics for classifiers, full data
Figure 4.5: Random forest variable importance
Figure 4.6: Accuracy and AUROC before and after data preprocessing
Figure 4.7: Test-set performance metrics for classifiers, processed data
Figure 4.8: Test-set performance metrics for classifiers, balanced training data
Figure 4.9: AUROC for best classifiers and SOFA across time
Figure 4.10: Comparison of classifier performance with and without temporal features
Figure 4.11: Example UFA thresholds for adult sepsis patients
Figure 4.12: Number of high mortality and low mortality flags for adult sepsis patients
Figure 4.13: Accuracy and AUROC, full MIMIC population
Figure 4.14: AUROC for UFA-based classifiers and SOFA across time, full MIMIC population
Figure 4.15: Sensitivity and specificity, full MIMIC population
Figure 5.1: Visualization of number of flags classifier, sepsis subpopulation
Figure 6.1: Bootstrapped thresholds for low body temperature in adult sepsis patients
Figure 6.2: Examples of different cost penalties for two-dimensional flagging
Figure 6.3: Optimal cost penalties for two-dimensional flagging (1.43:1)
Figure 6.4: Trends in SOFA score by day and mortality status
Figure 6.5: Parallel coordinates plot summarizing SOFA across time
Figure 6.6: Parallel coordinates plot summarizing SOFA across time (SOFA of 5-12)
Figure 6.7: Example SIR results for a representative patient
Figure 6.8: Example of Hamilton regime switching model, CVP ( =
)
Figure 6.9: Example of Hamilton regime switching model, TVTP vs. CVP ( =
)
9
List of Tables
Table 1.1: Comparison of APACHE IV, SAPS 3, and SOFA
Table 2.1: List of available explanatory variables used from MIMIC II database
Table 2.2: List of engineered temporal features
Table 2.3: Most common primary diagnoses in the MIMIC II database
Table 3.1: List of variables for specification of UFA algorithm
Table 3.2: List of thresholds for Iris dataset, selected based on maximum absolute z-statistic
Table 4.1: Relationship between clinical variables and mortality, average p-value by day
Table 4.2: Summary of missing data by variable
Table 4.3: Summary of missing data by patient
Table 4.4: Final variable list after preprocessing
Table 4.5: Top three best subset regression models
Table 4.6: Test-set performance of best classifiers compared to SOFA score
Table 4.7: Top 20 most significant UFA thresholds
Table 4.8: Test-set performance for UFA-based classifiers (accuracy and AUROC)
Table 4.9: Test-set performance for UFA-based classifiers (sensitivity and specificity)
Table 4.10: Comparison of different classifiers with varying amounts of missing data
Table 4.11: Comparison of different classifiers with varying amounts of imprecise data
Table 4.12: Comparison of data-driven thresholds across different subpopulations
Table 4.13: AUROC for AMI and lung disease subpopulations
Table 4.14: Sensitivity for AMI and lung disease subpopulations
Table 4.15: Comparison of results for balanced and unbalanced data, AMI subpopulation
Table 6.1: Patient clusters based on static and trend SOFA data
Table 6.2: Test-set performance of classifiers based on regime switching model
Table B.1: All MIMIC II variables used for analysis, p-values by day
Table B.2: All significant UFA thresholds, adult sepsis patients
Table B.3: Comparison of classifiers with varying amounts of missing data, confidence intervals
Table B.4: Comparison of classifiers with varying amounts of imprecise data, confidence intervals
10
Chapter 1
1
Introduction
In this thesis, we investigate different approaches to predicting mortality for critical care patients. Using
over 200 variables to characterize a patient’s stay, we compare a variety of different predictive modeling
techniques. We consider commonly used linear and non-linear classification methods such as regression,
support vector machine (SVM), and random forest, as well as a novel approach called the Univariate
Flagging Algorithm (UFA).
Through our analysis, we identify key predictors of mortality for critical care patients and show that our
classifiers can outperform current mortality prediction models by as much as two days. We focus in
particular on UFA, as it easily scales to a large number of variables, is easy to interpret, and is robust to
noise and missing data. We believe UFA is particularly suited to critical care applications and could be a
valuable step toward individual patient survival estimates.
1.1
Motivation
Predicting outcomes for critically ill patients is a topic of considerable interest. Risk-adjusted mortality is
the most commonly used measure of quality in critical care [1, 2], and good predictive models are needed
to benchmark different physicians, facilities, or patient populations. Patient severity scores are frequently
used in medical research as a potential cofounder or to balance treatment and control groups [2]. On an
individual patient level, prognostic models can be used to determine courses of treatment or to
communicate with the patient’s family about likely outcomes [3, 4].
Though there are several widely used models that already exist, they have limitations. Many are
intentionally simple and rely on only a small number of variables measured at one point in time to
minimize the burden of data collection [4]. However, as electronic medical records become increasingly
widespread, the need for manual data extraction and manual score calculation will presumably be
eliminated, permitting more complex approaches. Many of the existing models also use a regression
framework, which ultimately limits the number of variables that can be considered and typically assumes
a relationship between the explanatory variables and the outcome that is linear in the coefficients [5]. In
11
this thesis, we move beyond regression and consider other modeling techniques that have the potential to
make more accurate and timely predictions.
1.2
Existing mortality prediction models
In this section, we discuss three commonly used mortality prediction models: the Acute Physiology and
Chronic Health Evaluation (APACHE) [6,7], the Simplified Acute Physiology Score (SAPS) [8,9], and
the Sequential Organ Failure Assessment (SOFA) score [10]. We outline their similarities and
differences, and identify possible areas for improvement.
APACHE and SAPS are both designed to use information available at ICU admission to predict patient
outcomes. They were both developed in the 1980s and quickly adopted in intensive care [4]. They have
undergone updates since their inception to use larger training datasets, incorporate more sophisticated
statistical techniques, and for recalibration. The most recent versions are APACHE IV and SAPS 3 which
were both developed in the last decade [7, 9].
SOFA, on the other hand, was designed to evaluate a patient throughout the ICU stay. It assigns a score of
0 (normal) to 4 (very abnormal) for six different organ systems on each day in the ICU [10]. Unlike
APACHE and SAPS, SOFA was originally intended to characterize patient morbidity as opposed to
predict patient mortality; however, since its development, it has often been used for the latter purpose
[11].
Table 1.1 compares the three systems along a variety of dimensions, including the data collection
window, required variables, and methodological approach. APACHE IV contains the largest number of
features, including the most physiological variables and 116 different acute diagnoses [1]. SAPS 3 and
SOFA on the other hand both contain relatively few variables. The creators of SOFA specifically cited the
fact that it is simple and easy to calculate as advantages of the approach [10]. Proponents of SAPS 3
similarly argue that a small number of required variables minimizes complexity and encourages routine
use [4].
In all three systems, the decision of which variables to include relied, at least in part, on clinical expertise
and domain knowledge [6, 9, 10]. In the case of SOFA, the entire system was designed through clinical
consensus. This highlights one possible area for improvement in mortality modeling, through greater
automation and data-driven variable selection methods.
12
Another possible area for improvement is exploring new modeling techniques for mortality prediction. In
Table 1.1, we see that both APACHE IV and SAPS 3 rely on regression based models. In recent years,
however, there has been discussion about whether nonlinear machine learning approaches such as random
forest may provide better performance [1].
Table 1.1: Comparison of APACHE IV, SAPS 3, and SOFA
Use of
temporal
variables
Method for
variable
selection and
weighting
Method for
mortality
prediction
No
Regression,
clinical
expertise, and
previous
knowledge
Regression
Admission
± 1 hour
Physiologic data (n=10), acute
diagnosis and anatomical site
of surgeries (n=15),
comorbidities (n=6), age,
hospital location and LOS
prior to ICU admission,
vasopressor use, type of
admission, infection
No
Regression,
clinical
expertise, and
definitions from
other scoring
systems
Regression
Daily
Physiologic data (n=6),
mechanical ventilation, use of
vasopressors
No
Clinical
consensus
n/a
Data
collection
window
Required variables
First 24
hours
Physiologic data (n=17),
admission diagnoses (n=116),
comorbidities (n=6), age,
hospital location and LOS
prior to ICU admission,
emergency surgery,
thrombolytic therapy,
mechanical ventilation
SAPS 3
SOFA
System
APACHE IV
Finally, none of these systems utilize temporal variables, though several studies have shown that use of
daily information can improve predictions of patient prognosis [12]. Using SOFA score as an example,
the maximum daily SOFA score and delta SOFA (defined as maximum score minus admission score)
have good correlation with outcomes for patients in the ICU for two or more days. Ferreira et al showed
that mean SOFA and maximum SOFA for the ICU stay are better indicators of prognosis than the initial
score; they also established that an increase in SOFA score over the first 48 hours is a strong predictor of
mortality [13]. In 2014, Andrew Zimmerman, one of the researchers who developed APACHE, said “it is
inconceivable that simple models could be effectively used for predicting individual patients’ outcomes”
and asserted that use of daily ICU information will be necessary to develop models powerful enough for
individual prognosis [1].
13
1.3
Research objectives
The objective of this thesis is to build a mortality prediction system that can outperform current
approaches. We aim to improve current methodologies in two ways:
1. By incorporating a wider range of variables, including time-dependent features
2. By exploring different predictive modeling techniques, including non-linear approaches such as
random forest or our newly designed Univariate Flagging Algorithm (UFA)
Figure 1.1 provides a schematic of the overall approach. During the data extraction and feature
engineering phase, we process a large number of variables, including time-dependent features, to include
in our analysis (red box). Next, we explore the data to understand its unique characteristics and challenges
(blue box). In this research, we use three different approaches to address the observed challenges.
Figure 1.1: Schematic for overall research approach
First, we experiment with a number of commonly used machine learning classification techniques and
compare their performance for the task of mortality prediction. In Section 4.2.1, we identify random
forest as the most promising approach. Second, we attempt to clean the data through additional
preprocessing and variable selection. In Sections 4.2.2 and 4.2.3, we show that regression methods can
work well, if care is taken to properly customize the model.
14
The third and final approach is to design a classification system to account for the unique challenges of
clinical data -- the Univariate Flagging Algorithm (UFA). In Sections 4.3 and 4.4, we demonstrate that
this system has the following characteristics:

Strong predictive performance

Scalable to a large number of variables, including more variables than observations

Robust to missing and noisy data

Easily customizable to different patient populations, care centers, or targets

Ability to predict rare events

Interpretable
In this thesis, we compare UFA to random forest and best subset regression in terms of overall predictive
performance. We also evaluate how all three methods perform relative to the well-established SOFA
score, a daily measure of patient morbidity.
1.4
Thesis organization
Chapter 2 describes the data and defines the populations used for analysis. Chapter 3 outlines the
methodology. It provides an overview of the four commonly used classification techniques employed in
this analysis: logistic regression, random forest, decision tree, and SVM. It also describes our novel
classifier, the UFA system, and provides an example using the well-known Iris dataset. Chapter 4
summarizes the results. It compares the predictive performance of a variety of mortality prediction
models, including the UFA system. It also discusses several practical advantages of UFA, such as its
ability to easily generalize to new populations. Chapter 5 is a high-level discussion of UFA, explaining
how it could be used in practice. Finally, Chapter 6 outlines several possibilities for future research and
Chapter 7 concludes the thesis.
15
16
Chapter 2
2
Data
This chapter describes the data used in this analysis. Section 2.1 provides an overview of the database,
Section 2.2 discusses variable extraction and the creation of temporal features, and Section 2.3 defines our
analysis populations.
2.1
MIMIC II database
The publicly available Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC II) database,
version 2.6, contains de-identified clinical data for nearly 30,000 adult ICU stays at Beth Israel Deaconess
Medical Center (BIDMC) in Boston, MA from 2001 to 2008 [14]. It was jointly developed by the
Massachusetts Institute of Technology (MIT), Phillips Healthcare, and BIDMC, and any researcher who
adheres to the data use requirements is permitted to use the database. Its creation and use for research was
approved by the institutional review boards of both BIDMC and MIT (IRB protocol 2001-P-001699/3).
Figure 2.1, which comes from the MIMIC II user guide, illustrates the data acquisition process. As is
evident in the figure, data comes from a variety of different sources including bedside monitors, the
clinical record, administrative data, and external sources such as the Social Security Death Index. The
figure also highlights the variety of patient data in MIMIC II.
Figure 2.1: Schematic of MIMIC II data collection and database construction
17
2.2
Variables for analysis
For this study, we processed over 200 explanatory variables from the MIMIC II database covering the
first four days of the patient’s stay. Certain variables were time invariant, such as demographics or
severity scores measured at admission. Much of the data, however, was measured daily or hourly,
including vital signs, measures of fluid balance, diagnostic laboratory test results, and use of services such
as imagining, ventilation, and vasopressors. Table 2.1 contains a list of the variables considered. While
most of these variables were pulled directly from the MIMIC II database, some such as shock index
(calculated as heart rate divided by systolic blood pressure) were constructed to measure known
interaction effects.
Table 2.1: List of available explanatory variables used from MIMIC II database
Time-Invariant (N=9)
Day-Level (N=38)
Hourly-Level (N=5)
Age
SOFA
Heart rate
Gender
SAPS
Respiratory rate
Race/ethnicity
Urine output
Mean arterial blood pressure
Weight
Fluid balance
Shock index
SAPS at admission
Use of ventilation
Temperature
SOFA at admission
Use of vasopressors
Number of comorbidities
Use of imaging: x-ray, MRI, CT, echo
Number of previous ICU stays
Do not resuscitate status
ICU care unit
Comfort measures only status
Number of tests and average daily
value: creatinine, sodium, BUN,
chloride, bicarbonate, glucose,
magnesium, calcium, phosphate,
hematocrit, hemoglobin, white blood
count, platelets
The primary outcome of interest throughout the study is in-hospital mortality, as recorded in the MIMIC
II clinical record. We also kept track of the International Classification of Diseases, 9th revision, Clinical
Modification (ICD9-CM) codes for each hospitalization. As detailed in Section 2.3, these codes were
used to identify disease-specific populations for analysis.
2.2.1
Temporal features
For each of the day-level and hourly-level variables in the analysis, we engineered a series of temporal
variables to measure patient dynamics. Table 2.2 lists the different features that were created.
18
Table 2.2: List of engineered temporal features
Day-level indicator (N=8)
Day-level continuous (N=30)
Hourly-level continuous (N=5)
Example: Use of ventilation
Example: Daily SOFA score
Example: heart rate
Ever used to date
Minimum to date
Daily minimum
Number of days to date
Average to date
Daily maximum
Maximum to date
Daily average
Range in values to date
Daily standard deviation
One-day trend
Minimum to date
Two-day trend
Average to date
Three-day trend
Maximum to date
Range in values to date
One-day trend
Two-day trend
Three-day trend
For each of the eight indicator variables in the analysis, such as use of ventilation or vasopressors, we
counted the number of days in which the patient had the service and also documented whether they ever
had the service.
For the continuous variables in the analysis, the first step was to roll all of our data up to the day-level.
For the five hourly-level variables, we summarized the minimum, maximum, average, and standard
deviation of the hourly values for each day. Next, for both the day-level and hourly-level variables, we
calculated the minimum, maximum, average, and range of values over the first four days of the patient’s
stay. We also looked at the one to three day trends. The one day trend was calculated by dividing the
current day’s average by the previous day’s average, the two day trend was calculated by dividing the
current day’s average by the average from two days ago, and the three day trend was calculated
analogously.
2.3
Populations for analysis
For this analysis, we considered non-elective, adult ICU stays of four or more days in MIMIC II. The
number of stays meeting these criteria totaled 5,378 with an in-hospital mortality rate of 21.2%.
With outcome modeling, there is an inherent trade-off between building a general prediction model that is
widely applicable or a specialized model that takes into account particular features of a disease, patient
population, or care facility. Various studies have shown that the most widely used models do not
19
generalize well to new populations, and independent research suggests regular updates and customization
for best performance [4, 15].
To measure the extent to which the classifiers presented in this thesis are customizable, we test them on a
variety of disease-specific subgroups, in addition to the full MIMIC population. Table 2.3 displays the ten
most common primary diagnoses in our data. The stays column counts the number of ICU stays of four or
more days associated with that diagnosis, while mortality refers to the in-hospital mortality rate for that
cohort.
Table 2.3: Most common primary diagnoses in the MIMIC II database
ICD9
Description
Stays
Mortality
All
Full MIMIC population
5,378
21.2%
038
Sepsis
512
30.9%
410
Acute myocardial infarction
496
14.7%
518
Lung disease
402
27.4%
414
Chronic ischemic heart disease
240
3.3%
428
Heart disease
200
20.5%
996
Complications from specified procedures
164
24.4%
431
Intercerebral hemorrhage
147
31.3%
430
Subarachnoid hemorrhage
125
19.2%
852
Subarachnoid subdural and extradural hemorrhage
109
23.9%
441
Aortic aneurysm and dissection
108
15.7%
In this thesis, in addition to the full MIMIC population, we focus on the three most common diagnoses in
the dataset (highlighted in yellow) -- sepsis, acute myocardial infarction (AMI), and lung disease -- as
they provide us with the largest possible sample size. Across the three subpopulations, we also have a
range of different mortality rates from 14.7% to 30.9%.
We decided to use sepsis patients, the largest of the three disease-based cohorts, to build and test our
mortality prediction models. In addition to being the most common diagnosis in the dataset, sepsis is the
10th leading cause of death in the United States and has an estimated annual economic burden of $16.7
billion making it an important cohort for analysis [16]. Additional discussion of the sepsis cohort is
available in Section 2.3.1.
We use the other two disease-based subpopulations, AMI and lung disease, along with the full MIMIC
population to test the degree to which our models generalize to new patient cohorts. We hypothesize that
20
one advantage of a fully automated system such as UFA will be that it is easy to customize for both
general and specialized applications.
For the purposes of this thesis, we only consider disease-based subpopulations based on the primary
diagnosis. As a side analysis, however, we considered all patients with a primary diagnosis of sepsis and
analyzed secondary diagnosis at the time of admission. In some cases, we observed significant
discrepancies in mortality. One of the most striking results was the 30-day mortality rate for septic
patients with rheumatoid arthritis (RA), an auto-immune disorder. At 29.0%, it was 21 percentage points
lower than the 30-day mortality rate for all severe sepsis patients (defined as sepsis with persistent
hypotension), and the observed difference was statistically significant even after controlling for patient
demographics and disease severity. The observed protective effect also extended to other auto-immune
patients. While these findings are outside the scope of this thesis, they are presented in their entirety in
Appendix A.
2.3.1
Sepsis
Sepsis is a systemic inflammatory response to an infection that can lead to organ failure and death. Severe
sepsis accounts for around 10% of all intensive care unit (ICU) admissions in the United States, and
mortality rates are commonly reported between 28-50% [17, 18].
The consensus clinical definition of sepsis was established in 1992 by the American College of Chest
Physicians (ACCP) and the Society of Critical Care Medicine (SCCM). ACCP/SCCM identified the
systemic inflammatory response syndrome (SIRS) criteria and defined sepsis as the presence of at least two
SIRS criteria caused by known or suspected infection. The SIRS criteria are:

Core body temperature above 38°C or below 36°C

Heart rate above 90 beats per minute

Respiration rate more than 20 breaths per minute

White blood count above 12,000/µl or less than 4,000/µl or more than 10% immature forms
They defined severe sepsis as sepsis with the addition of acute organ dysfunction [19].
While there is a consistent clinical definition of sepsis, there is no single method for identifying sepsis
patients in administrative health data. For the purpose of this thesis, we use the ICD-9-CM code 038 which
indicates septicemia. It has been shown that sepsis can be confirmed in 88.9% (95 percent confidence
interval 81.6-96.2) of patients with a 038 code in their discharge records [20].
21
We decided to use the 038 diagnosis over another definition of sepsis that is sometimes used with
administrative health data, called the Angus criteria [21]. Angus requires the presence of two ICD-9-CM
diagnosis codes, one for infection and another for organ dysfunction. However, it has been suggested that
sepsis defined by the Angus criteria overestimates the incidence of severe sepsis by a factor of two to four
[20].
Our data confirm that the Angus criteria are a broader definition of sepsis than the 038 diagnosis. We
identified 9,066 hospitalizations in the MIMIC II database that met the Angus criteria, nearly three times
the number for 038. For this analysis, we wanted to focus on a subset of patients who have a high
likelihood of true sepsis; therefore, we ultimately decided that the 038 definition was most appropriate for
our application.
22
Chapter 3
3
Methodology
This chapter summarizes our methodology for building and evaluating mortality prediction models.
Section 3.1 describes a number of commonly used classification techniques that can be used for outcome
prediction and summarizes the advantages and disadvantages of different methods. Section 3.2 provides a
formal specification for UFA, a novel methodology that we designed to deal with some of the specific
challenges posed by clinical data. Finally, Section 3.3 discusses our methods for evaluating these different
classifiers and comparing their predictive performance.
3.1
Commonly used classification techniques
At its root, mortality prediction is a classification task. The objective is to use a number of explanatory
variables, such as physiologic data, to classify patients into two groups: those who live and those who die.
A number of methods exist to do binary classification. In this section, we will introduce four approaches
used throughout the thesis: logistic regression, support vector machine (SVM), decision tree, and random
forest.
Perhaps the most common method currently used in mortality prediction is logistic regression. Logistic
regression is a linear method that models the posterior probabilities of the two classes as a logistic
function of the explanatory variables [22].
Logistic regression is usually fit by maximum likelihood, so it is prone to overfitting if the number of
variables is of the same order as the number of observations. Logistic regression also requires little to no
multicolinearity between the explanatory variables. For these reasons, it is often advantageous to find a
subset of the variables that are sufficient to explain the observed outcome, as opposed to using all of the
variables that are available [5]. One way to do this is through best subset regression, which searches for
the best combination of k variables according to some criteria. In this thesis, we use the Bayesian
information criterion (BIC).
23
We also consider several non-regression based methods in this thesis. SVM is a linear classifier that can
implement non-linear class boundaries by transforming the feature space. It generally has strong
predictive power [5] and is resistant to overfitting, since there is little flexibility in the decision boundary
[22]. The drawbacks are that it can be difficult to scale computationally, it tends to be difficult to
interpret, and it does not deal well with outliers and irrelevant information, which can be prevalent in
clinical data [5].
A decision tree is a non-linear method that classifies data using a series of univariate partitions [22]. Trees
are frequently depicted using a tree-like graph which makes them fairly easy to interpret. As opposed to
SVM, they also tend to be easy to compute and are generally insensitive to outliers [5].
While decision trees tend to have low bias, they often have high variance which can lead to unreliable
predictive performance. Random forest addresses this issue through creating a large number of decision
trees (in this thesis, we use 100) and essentially averaging the results [5]. This generally leads to better
predictive performance, though some of the interpretability of a single tree is lost.
3.1.1
Preprocessing approach
In Section 2.2, we describe our process for extracting over 200 variables from the MIMIC II database to
characterize the first four days of each patient’s ICU stay. Through exploration of the data, we learned
that many of the explanatory variables are asymmetric and long tailed, that there are a number of possible
outliers, and that many variables are highly correlated. In addition to the large number of features, all of
these aspects of the data make it challenging to analyze. One way to deal with these challenges is to
preprocess the data to remove variables that are possibly irrelevant or highly correlated with other
features.
This section describes our methodology to do variable selection for the MIMIC II data, in the hopes of
improving the predictive performance of our mortality prediction models. The process has two main
steps. First, we limited our variables to those that are individually associated with in-hospital mortality.
Second, we removed variables that were highly correlated with one another.
For step one, we ran a two-sided t-test [23] for each numeric variable to test the following hypothesis:
24
:
:
=
≠
For the categorical variables, we tested the same hypothesis using a chi-squared approach [23]. For each
variable, we ran five distinct tests, each utilizing 4/5 of the training data. Then, we were able to generate
an interval of possible p-values for each feature.
These results allowed us to formally explore the relationship between individual variables and in-hospital
mortality, as well as quantify the variability in this relationship. Moreover, by comparing the mean pvalue on day one through four of the ICU stay, we were able to determine how the individual predictive
power of particular features changed across time.
We also used the results of this analysis to do variable selection. For many of the standard classifiers that
we considered in our analysis, using a very large number of variables is problematic. For example, with
regression, the number of variables is limited by the degrees of freedom. Therefore, for certain analyses in
this thesis, we subset our variables to those with an average p-value of 0.05 on day 4 of the ICU stay. It is
likely that this selection criterion includes some variables that do not truly have a significant relationship
with mortality, due to multiple testing which is known to inflate the type I error rate [5]. We were
comfortable with the possibility of false positives, however, as this is only the first stage of our variable
selection process. In later stages, we will use additional methods such as best subset regression to further
limit our list of features. Then, we verify those models on previously unseen data, as shown in Section
4.2.3, to ensure that the observed relationship between the explanatory variables and mortality is
generalizable.
Step two of our preprocessing approach was to further restrict our list of variables by considering the
pairwise correlation coefficients [23]. Unsurprisingly, many of the variables in our analysis are highly
correlated. For example, average arterial blood pressure is correlated with both the maximum and
minimum values, with pairwise correlation coefficients between 0.7 and 0.8. We considered all pairs of
variables with a correlation coefficient greater than 0.6 or less than -0.6, and removed variables that were
not providing new information.
These two preprocessing steps allowed us to narrow our total list of variables from more than 200 to just
75. The results of the individual t-tests and the final list of variables are presented in Section 4.1.
25
3.2
Univariate Flagging Algorithm (UFA)
In many data classification problems, there is no linear relationship between an explanatory and the
dependent variables. Instead, there may be ranges of the input variable for which the observed outcome is
significantly more or less likely. In clinical decision making, for example, doctors identify ranges of
laboratory tests values that may identify patients’ higher risk of developing or having a disease [24, 25].
This also arises in other fields; in earth science, for example, amount of rainfall thresholds can be used to
develop early warning systems for landslides or flooding [26, 27].
In this section, we describe a new method for identifying such thresholds in an automated fashion called
the Univariate Flagging Algorithm (UFA) [28, 29]. We also describe two UFA-based classifiers that
combine the UFA thresholds to predict mortality for previously unseen patients. These methods were
specifically designed to address many of the challenges of clinical data without preprocessing and to
provide an alternative to the commonly used techniques from Section 3.1.
3.2.1
Identifying optimal cut points
Many classifiers, such as decision trees or support vector machines (SVM) [5], are designed to find
“optimal” cutpoints, typically defined as cutpoints that minimize some measure of node impurity. Such
measures include misclassification rate, Gini index, or entropy/information gain [5, 22]. Supervised
clustering works similarly, minimizing impurity while adding a penalty for the total number of clusters
[30]. Alternatively, Williams, et al (2006) put forth a minimum p-value approach for finding optimal
cutpoints for binary classification. Their algorithm uses a chi-squared test to find the cutpoint that
maximizes the difference in outcomes on one side of the cut and the other [25].
These approaches are similar in that they consider the entire input space, both false positives and false
negatives, to select the optimal cutpoint. In certain applications, however, one may only care about a
subspace. Under certain conditions, it might be important to identify separation thresholds that are
associated with a high prevalence of the target, while the overall solution is not optimized. Examples
include medical conditions where values outside clinically defined thresholds are associated with high
mortality, while more normal values do not provide much information.
For example, in sepsis patients, low body temperature is associated with illness severity and death [31].
Figure 1 displays average body temperature for the MIMIC II sepsis population, with an overall death rate
26
of 30.9%. Patients who died are denoted in red, while patients who survived are denoted in blue.
International guidelines for sepsis management define low body temperature as 36° C [32]. Below this
threshold, Figure 3.1 shows that sepsis patients die at a rate of 57.1%, nearly twice the overall death rate.
Above this threshold, one can say very little.
Figure 3.1: Body temperature for adult sepsis patients
We are interested in identifying such thresholds in an automated fashion. In the decision tree or SVM
framework, this can be accomplished by associating different costs with false positives and false
negatives, which will shift the “optimal” cutoff point accordingly [5, 22]. In practice, however, it is often
difficult to quantify the costs associated with different types of errors particularly in the medical domain.
Friedman & Fisher’s (1999) Patient Rule Induction Method (PRIM) procedure finds rectangular
subregions of the feature space that are associated with a high (or low) likelihood of the outcome. The
subregions are then slowly made smaller, each time increasing (or decreasing) the rate of the outcome
[33]. With this method and others like it, there is an inherent trade-off between the number of data points
within the subregion (the support) and the proportion of the data points that are associated with the
outcome (purity), where smaller supports generally have higher purity. With PRIM, the user is
responsible for defining the “optimal” subregion, by specifying the preferred trade-off for the application.
While this may work well in some situations, identifying the appropriate trade-off can be challenging,
suggesting the need for an algorithm that requires less user input.
27
We put forth a new threshold detection algorithm called UFA. UFA optimizes over subregions of the
input space, but performs the trade-off between support and purity automatically. In this thesis, we will
show that UFA can identify the existence of thresholds for individual variables and that they align with
visual inspection and thresholds established by subject matter experts. We also demonstrate that these
thresholds can be used to classify previously unseen test cases with performance equal to or better than
many standard classifiers, such as the random forest and logistic regression.
3.2.2
Formal statement of algorithm
UFA is designed to identify an optimal cutpoint for a single explanatory variable, such that observations
outside that threshold are associated with a significantly higher or lower likelihood of the target. UFA
identifies up to two such thresholds, one below the median and one above the median. The algorithm is
intended for a binary target
UFA finds the value
outside
=
(e.g. [0, 1]) and a continuous explanatory variable . At its most basic level,
that maximizes the difference in the outcome rate for observations that fall
and a baseline rate, while maintaining a good level of support.
Table 3.1 defines the variables that we will use in the formal specification of the UFA algorithm. For the
purpose of formulation, we consider candidate thresholds below the median value of .
Table 3.1: List of variables for specification of UFA algorithm.
28
For each
, we conduct the following hypothesis test to check for a significant difference in the outcome
rate below the threshold and the outcome rate in the interquartile range:
:
:
−
−
=0
≠0
(1)
We are using a binomial proportion test [23] with test statistic
=
where
−
∗ (1 −
)∗(
:
1
1
+
)
(2)
is the weighted average of the outcome rates, calculated:
=
We define
as the candidate threshold
= max[
∗
+
+
∗
(3)
with the maximum
in absolute value:
( )]
(4)
provides an inherent trade-off between maximizing the support and maximizing (or equivalently,
minimizing) the outcome rate. The proposed measure does not provide an optimal separation in terms of
minimizing the overall misclassification rate, but is optimized against finding areas enriched with cases
with target outcome. The same applies to finding areas with specifically low rate of the target.
3.2.3
Procedure to find optimal threshold
For each variable :
1. Generate a list of potential thresholds
between the median value of
value of , excluding those with low support, by dividing the range into

and the minimum
segments.
For the purpose of this thesis, we excluded the five lowest values of , assuming that
thresholds with a support less than five are of no interest.

Currently, we consider 50 segments of equal length.
2. Calculate
as specified in equation (2). Define
29
according to equation (4).
3. Check
for statistical significance by comparing its Z-value to a chosen critical value. Keep
the threshold if it is significant and discard it otherwise.

For the purpose of this thesis, we used a critical value of 2.576 to establish significance,
which is associated with a p-value of 0.01. We address issues related to multiple testing
by validating the thresholds on previously unseen data, as seen in Section 4.3.
Through this procedure, UFA finds the optimal threshold below the median for each variable . The
procedure can then be repeated using candidate thresholds between the median value of
value of
3.2.4
and maximum
for a total of up to two significant thresholds for each variable.
UFA-based classifiers
UFA is designed to work with a single variable. We considered several approaches to combine singlevariable information into a multi-dimensional classifier, and present two possibilities in this thesis. Both
create an indicator variable or “flag” for each significant threshold, which takes the value one if the data
point exceeds the threshold and zero otherwise.
The first classifier aggregates the number of “high risk” and “low risk” flags for each observation. Then, a
linear decision boundary is drawn to separate one class from the other along these two dimensions.
Throughout the thesis, this approach will be denoted as the Number of Flags algorithm (N-UFA).
Figure 3.2: Number of high mortality and low mortality flags for adult sepsis patients
Figure 3.2 shows an example of the N-UFA classifier’s performance in predicting adult sepsis patients’
mortality. For each patient, we counted the number of flags that are associated with a high likelihood of
30
mortality and the number of flags that are associated with a low likelihood of mortality; in this thesis,
each flag receives equal weight of one, though future research could investigate the impact of assigning
flags different weights. In Figure 3.2, the solid line represents the linear decision boundary that minimizes
the misclassification rate along these two dimensions.
The second UFA-based classifier presented in this thesis is a standard random forest model [5] that uses
the flags for each significant threshold as dummy inputs. We will denote this classifier RF-UFA.
Throughout the results section, we compare the predictive performance of the two UFA-based classifiers
to classifiers that rely on the original, continuous data. One could also consider methods that combine the
continuous data and the UFA flags, though they are not addressed in this thesis.
3.2.5
Example of UFA system: Iris dataset
In this section, we demonstrate the UFA system by applying it to the well-known Iris dataset [35]. A
classic in the machine learning discipline, it contains 50 observations for three different species of Iris.
One of the species, Iris setosa, is linearly separable while the other two species, Iris virginica and Iris
versicolor, are not. We ran UFA over this relatively straightforward dataset to ensure that it performed
comparably to other standard approaches.
Identifying optimal thresholds
We begin by using UFA to identify optimal thresholds for each variable in the Iris dataset. In some cases,
the optimal cutpoint is clear, such as the trivial case when the data is linearly separable. Figure 3.3
illustrates this trivial case. From the figure, it is clear that Iris setosa, denoted in red, is linearly separable
from the other two classes, which are denoted in blue. As expected, UFA is able to successfully identify a
threshold for a single variable, ‘petal length’, which separates the two classes from one another.
Next, we focus on identifying thresholds that separate the two remaining classes, Iris versicolor and Iris
virginica.
The Iris dataset contains four variables and, therefore, UFA searches for up to eight possible thresholds in
the data. Table 3.2 contains the optimal thresholds for each variable and shows that six of the eight are
significant. In Table 3.2, the automatic trade-off between purity and support inherent to UFA is apparent.
While a sepal width less than 2.4 identifies a subset of cases where 90% belong to the class versicolor, the
31
support (N=10) is not large enough to consider this variable in subsequent analysis. The other variables,
however, each have two significant thresholds, depicted in Figure 3.4. We see that three of the thresholds
are associated with a high likelihood of Iris versicolor, while three of the thresholds are associated with a
high likelihood of Iris virginica.
Figure 3.3: Petal lengths for different species of Iris
Table 3.2: List of thresholds for Iris dataset, selected based on maximum absolute z-statistic
Variable
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Threshold
Less Than
Less Than
Less Than
Less Than
More Than
More Than
More Than
More Than
5.7
2.4
4.7
1.5
7.0
3.2
5.0
1.7
N
24
10
45
48
12
10
42
46
% Versi.
87.5%
90.0%
97.8%
93.8%
0.0%
20.0%
2.4%
2.2%
ZStat
3.4
2.3
5.2
4.4
-3.0
-1.7
-5.1
-5.9
ZStat.Abs
3.4
2.3
5.2
4.4
3.0
1.7
5.1
5.9
Sig
1
0
1
1
1
0
1
1
Classification
We convert the six significant thresholds into indicator variables (“flags”) which take the value one if the
data point falls outside the threshold and zero otherwise. Then, plugging these flags into a random forest
model (RF-UFA), we find that we can correctly classify 48 of 50 cases for both Iris versicolor and Iris
virginica.
32
Figure 3.4: Automated thresholds for Iris versicolor and Iris virginica
Iris versicolor is denoted in green and Iris virginica is denoted in blue.
We can achieve the same level of accuracy using N-UFA. First, we aggregate the number of Iris virginica
flags and the number of Iris versicolor flags for each instance in the dataset, as seen in Figure 3.5. The
first number in each cell represents the number of true virginica, while the second number represents the
number of true versicolor. The optimal linear decision bounday classifies all of the green cells as Iris
versicolor and all of the blue cells as Iris virginica; values in red represent errors. We see that N-UFA
correctly classifies 96 of the 100 cases. This performance is in line with the apparent error rate of other
classification algorithms that have been used on the Iris dataset [36, 37].
Figure 3.5: Example of Iris classification using N-UFA
Versicolor Flags
Vir. Flags
0
1
2
3
0
0, 1
1, 5
1, 21
0, 21
1
5, 2
4, 0
--
--
2
28, 0
--
--
--
3
11, 0
--
--
--
We can also use RF-UFA and N-UFA to predict the class of previously unseen cases. Using five-fold
cross validation, which is explained in Section 3.3, both UFA-based classifiers have 100% accuracy
separating Iris setosa from the other classes. Overall, N-UFA averages 94.7% accuracy across the five
folds, while RF-UFA achieves an average accuracy of 96.0%. Once again, this performance is consistent
with other commonly used classifiers.
33
3.3
Evaluation of classifier performance
In this thesis, we would like to compare N-UFA and RF-UFA to other commonly used models and
evaluate their relative abilities to predict in-hospital mortality for previously unseen patients. In this
section, we describe our methodology for doing this comparison.
We utilize ten-fold cross validation [5] to train and test our models. Cross-validation works by dividing
the data into ten equal-sized parts. Then, nine of the parts are used to train the model and one part is used
to validate. The advantages of cross validation are two-fold. First, it does not require a hold-out validation
and test set which is advantageous when data is limited, as is the case for some of our diagnosis-based
subpopulations. Also, by rotating which part of the data is used for validation, one can generate ten
different estimates of model performance.
In this thesis, we will report the average performance for each classifier across the ten-folds. Tables and
figures that include a 95% confidence interval or error bars were created by calculating the standard
deviation of the ten-folds, and adding/subtracting two standard errors from the mean value.
Next, we define the metrics that are used in this thesis to characterize model performance. Each candidate
classifier predicts whether a patient will live (0) or die (1). For each patient, we also know their actual
outcome, so we can create a confusion matrix, such as the one depicted in Figure 3.6.
Figure 3.6: Example of confusion matrix
Predicted class
Died (1)
Lived (0)
Died (1)
True
positive
(TP)
False
negative
(FN)
Lived (0)
False
positive
(FP)
True
negative
(TN)
Actual
class
We can calculate a number of different performance metrics from the confusion matrix. The three used in
this thesis are accuracy, sensitivity, and specificity [5]. They are calculated as follows, utilizing the
notation from Figure 3.6:
34
Accuracy: (TP+TN) / (TP+TN+FN+FP)
Sensitivity: TP / (TP + FN)
Specificity: TN / (TN + FP)
Accuracy is the probability of making a correct prediction. Sensitivity is the probability of predicting
death, given the true state is death. Specificity is the probability of predicting survival, given the true state
is survival. We would like all of these metrics to be as close to one as possible, though in practice there is
a trade-off between high sensitivity and high specificity.
The last performance metric that we consider in our analysis is area under the receiver operating curve
(AUROC) which plots the sensitivity against the specificity for different discrimination thresholds [5].
Once again, values closer to one are better than lower values.
35
36
Chapter 4
4
Results
In this chapter, we present the results of our analysis. As described in Section 1.3, the goal of this thesis is
to design a mortality prediction system that can outperform current models. However, predicting adverse
outcomes for critically ill patients is difficult for a number of reasons. Section 4.1 explores the MIMIC II
data and highlights a number of challenges with ICU data. We learn that a good mortality prediction
model must be able to handle large amounts of data that may include strong correlations, and should be
robust to noise and outlier values, as well as missing data.
Section 4.2 compares the predictive performance of a number of different commonly used classifiers. In
this section, we identify two models that significantly outperform SOFA score, one of the current
mortality prediction systems described in Section 1.2. They are random forest and a five-variable
regression model selected through best subset regression.
Section 4.3 introduces a new classifier, UFA, which was designed to address many of the challenges of
clinical prediction in an automated fashion. We compare its predictive performance to SOFA, random
forest, and our best subset regression, and find that UFA predicts mortality for previous unseen patients as
well or better than all of these other methods.
Section 4.4 analyzes the degree to which UFA and our other candidate models possess other
characteristics that are desirable in mortality prediction, such as the ability to make accurate predictions
with missing or noisy data, to generalize to other diagnosis-based subpopulations, and to predict rare
events.
4.1
Challenges of clinical data
Section 2.2 describes the process that we used to compile over 200 variables from the MIMIC II database,
which summarize the first four days of a patient’s ICU stay. This section presents a variety of preliminary
analyses that we conducted to gain a better understanding of the data and the relationship between
37
individual variables and in-hospital mortality. These analyses uncover a number of challenges inherent to
clinical data.
4.1.1
Many clinical variables are asymmetric and long-tailed
As a first step, we plotted each of our variables, stratified by patient mortality, to look for influential data
points and understand each variable’s relationship with the target. Figure 4.1 shows the data for two
sample variables: SOFA score and daily urine output. In each plot, a death flag of “Y” denotes in-hospital
mortality, while a death flag of “N” denotes survival. We observe that lower SOFA scores and higher
urine output are generally associated with survival, which is consistent with clinical expectations.
Figure 4.1: Box plots for SOFA and urine output stratified by in-hospital mortality
However, the box plots also reveal that many of the variables in the analysis are not symmetric and have
long tails. Referring back to daily urine output in Figure 4.1 and focusing on patients that lived (“N”), we
observe one individual with daily urine output of 9,600 milliliters as compared to the median value of
1,654.5. According to the National Institutes of Health (NIH) [35], the normal range for daily urine in a
healthy individual is 800 to 2,000 milliliters, so 9,600 milliliters is extremely abnormal. With clinical
data, however, it is not always clear if outlier values represent data error or simply very sick individuals.
For this reason, we did not exclude influential points from the dataset in this analysis. As such, classifiers
that are robust to outliers and do not assume that data is normally distributed may be preferable over
classifiers that do not have these features.
38
4.1.2
List of variables associated with mortality varies throughout ICU stay
Next, we formally specified the relationship between each variable in our analysis and in-hospital
mortality. We ran a chi-squared or t-test, depending on variable type, to evaluate whether there was a
significant difference between patients who lived and patients who died. Using five-fold cross validation,
for each variable, we calculated five p-values using data through day one, day two, day three, and day
four of the ICU stay. Table 4.1 displays the average p-value by day for a subset of the variables in our
analysis, selected to highlight key results. Table B.1 in Appendix B contains the p-values for the full
variable list.
Table 4.1: Relationship between clinical variables and mortality, average p-value by day
Red values are less than 0.05
Variable
Type
Use of vasopressors
Day 1
Day 2
Day 3
Day 4
0.213
0.665
0.728
0.285
0.041
0.009
0.000
0.000
Daily urine
0.141
0.001
0.000
0.000
Average daily bicarbonate
0.155
0.036
0.000
0.000
Minimum heart rate
0.549
0.041
0.031
0.005
0.336
0.347
0.269
0.357
0.524
0.074
0.048
0.051
Total days with vasopressors
0.220
0.053
0.016
Ever used vasopressors
0.489
0.469
0.399
0.047
0.015
0.004
0.059
0.013
0.001
0.668
0.097
0.007
0.272
0.001
Daily SOFA
Maximum heart rate
Day-level
Hourly-level
Average heart rate
Minimum bicarbonate to date
Average bicarbonate to date
Temporal
1-day trend, bicarbonate
2-day trend, bicarbonate
3-day trend, bicarbonate
0.022
In total, there are 75 variables that have an average p-value below 0.05 on day four. In general, the same
features that are predictive when they are static are also predictive when they are dynamic. Examples
include SOFA, SAPS, urine output, temperature, shock index, and laboratory tests like BUN, phosphate,
and bicarbonate. Table 4.1 displays the results for bicarbonate. The p-value on day four for the daily
reading is 0.000; in addition, we see that the minimum to date, average to date, and the one to three day
trends are all significant.
There are also some cases where the temporal variables are significant, even though the static variable is
not. For example, as seen in Table 4.1, the use of vasopressors is never significant. By day three,
39
however, the p-value for the total number of days with vasopressors drops to 0.053 and by day four, it is
0.016. In this case, it appears the duration of use is important.
We see that the way in which data is summarized across time is important. For example, with heart rate,
the minimum hourly value is significant on days two through four, while the average value hovers around
0.05 and the maximum value is never significant.
Finally, it is nearly always the case that variables are more predictive throughout the ICU stay. Using
average daily bicarbonate, for example, Table 4.1 shows that the p-value on day one is 0.155; however, it
drops to 0.036 on day two and drops again to 0.000 on days three and four. Other variables like daily
SOFA, daily urine, and minimum heart rate show the same trend. Figure 4.2 displays the aggregate
number of variables with a mean p-value less than 0.05, which increases from just 11 on day one to 75 on
day four.
Figure 4.2: Number of variables with a mean p-value less than 0.05, by day
80
Number of variables
70
60
50
40
30
20
10
0
Day 1
Day 2
Day 3
Day 4
We can also conclude from Figure 4.2, however, that less than 50% of the variables are significantly
associated with in-hospital mortality even on day four. As such, classifiers that can handle irrelevant
features may be preferable over classifiers that do not have these features. We might also prefer a system
that is easily customizable to new datasets, since we observe that changes such as considering data on day
three instead of day four changes the relationship of variables with our target outcome.
4.1.3
Clinical variables are often strongly correlated
For the 75 variables that had a mean p-value of less than 0.05 on day four, we generated pairwise
correlation coefficients. Unsurprisingly, many of the temporal variables are highly correlated, particularly
40
when they are calculated from the same base data. For example, as seen in Figure 4.3, the mean SOFA
score to date is correlated with both the minimum and maximum to date, with correlation coefficients of
0.95 and 0.94, respectively. The mean SOFA score to date and the daily SOFA score are also correlated
with a coefficient of 0.91.
Figure 4.3: Pairwise correlation coefficients for SOFA variables
Values exceeding 0.9 are highlighted in red and values between 0.8 and 0.9 are highlighted in orange
Initial SOFA
Daily SOFA
Mean to date
Daily SOFA
0.66
Mean SOFA to date
0.87
0.91
Min SOFA to date
0.80
0.90
0.95
Max SOFA to date
0.90
0.83
0.94
Min to date
0.83
Even when variables are not calculated from the same data, we observe correlation. For example, the
average SOFA score to date is correlated with the total number of days on vasopressors. Average
phosphorus is correlated with average blood urea nitrogen (BUN). In both cases, the results make sense.
SOFA score and vasopressor use are likely both proxies for sepsis severity. Phosphorus and BUN both
measure kidney function. However, based on these results, a classifier that does not require variables to
be uncorrelated may be preferable to a classifier that does.
4.1.4
Most patients have missing data
Finally, we investigate the number of missing values in our dataset. Missing data can be especially
problematic in critical care, where the exact set of test results and other data recorded will depend on the
patient’s condition, the physician’s care plan, and a variety of other factors.
Table 4.2 summarizes the available data elements from MIMIC II with the most and least missing data for
the 517 patients in our sepsis subpopulation. The variable that is missing most often is SAPS score at
17%. The amount of missing data for each variable drops off quickly, however, and we observe that most
variables are only missing for between 2-5% of patients.
Temporal features can either exacerbate or help alleviate the missing data problem. Of the temporal
variables, the one-day trend in fluid balance is the variable missing most often at 26%. This make sense,
because to calculate this variable one must have fluid balance for both day three and day four of the stay.
The other one, two, and three day trend variables suffer from the same issue. Conversely, temporal
41
variables that summarize minimum or maximum values across a patient’s stay are rarely missing, since
they only require one data point in a 96 hour period.
Table 4.2: Summary of missing data by variable
Missing most often
Variable
Missing least often
%
Missing
%
Missing
Variable
Daily SAPS
17%
In-hospital mortality
0%
Initial SAPS
7%
Age
0%
Daily urine
5%
Initial SOFA
1%
Average calcium
5%
Daily SOFA
1%
Weight
5%
Average hematocrit
2%
The vast number of variables in the analysis also exacerbates the missing data problem. Though each
individual variable is missing with relatively low frequency, Table 4.3 shows that only 34.6% of patients
have complete data for the full list of variables in the analysis. For this thesis, we substituted missing
values with the empirical average for classifiers that cannot accommodate missing data. However, it is
clear that a classifier that can accommodate missing data would be preferable.
Table 4.3: Summary of missing data by patient
Amount missing
% of Patients
No missing data
34.6%
Missing 1+ variables
65.8%
Missing 5+ variables
27.5%
Missing 10+ variables
14.7%
Missing 50+ variables
4.8%
4.1.5
Summary of key insights
Through the analysis in this section, we identified some of the challenges of clinical data. The results
suggest that a good mortality prediction model should be able to handle large amounts of data that may
include strong correlations, and should be robust to noise, missing data, and outlier values. We would also
like the model to be easily customizable to new datasets, since we observed that changes such as
considering data on day three instead of day four changes the relationship of variables with our target
outcome.
42
4.2
Predictive performance of commonly used classification techniques
In this section, we compare the performance of four commonly used classification techniques, logistic
regression, SVM, random forest, and decision tree, for the task of mortality prediction in clinical care.
As a first step, we ran all four classifiers using our full list of variables, containing more than 200 static
and temporal features. Section 4.2.1 summarizes these results. Then, we attempted to improve our model
performance by preprocessing the data to address some of the challenges outlined in Section 4.1. The
results of that analysis are presented in Sections 4.2.2 and 4.2.3.
Through our analysis, we find that random forest and a best subset regression model containing just five
variables are the two most promising approaches. In Section 4.2.4, we show that they both outperform
SOFA score, a widely used mortality prediction tool, by as much as two days. In Section 4.2.5, we
provide some insights on the role of temporal features in predicting patient mortality.
4.2.1
Random forest performs well for predicting mortality
Using the full analysis dataset, including over 200 different clinical features, we compared the predictive
performance of logistic regression, SVM, random forest, and decision tree. Figure 4.4 displays the
accuracy, AUROC, sensitivity, and specificity for each approach. We observe that random forest,
depicted in purple, has the best overall performance. It has AUROC and specificity comparable to SVM.
However, it also has much higher sensitivity and the highest overall accuracy, at 79.0%.
Figure 4.4: Test-set performance metrics for classifiers, full data
43
To better understand which variables are most important in the random forest model, we generated a
variable importance plot. Figure 4.5 displays the 25 most important variables in the model, where
importance is measured by the change in out-of-bag error when that variable is removed from the
analysis. We see that daily urine is by far the most important variable in our model, according to this
definition. Other important features include daily SOFA score, shock index, and laboratory tests like
bicarbonate, hemoglobin, and platelet count. In Section 4.2.3, we will compare these variables to the best
variables for a logistic regression analysis, and in Section 4.3.1, we will compare them to the most
significant thresholds for the UFA system.
Figure 4.5: Random forest variable importance
Returning to Figure 4.4, we see that logistic regression and decision tree are the least successful classifiers
with relatively low accuracy, AUROC, and specificity. Given the results of the data exploration presented
in Section 4.1, this is unsurprising. We know that logistic regression in particular tends to suffer from
over fitting when the number of variables approaches the number of observations. We also know that it
performs best when there is little to no multicolinearity between the explanatory variable, which is not the
case in our dataset.
4.2.2
Logistic regression performs competitively after preprocessing
To address some of the data challenges outlined in Section 4.1, we decided to process our dataset to
remove irrelevant information and correlated variables. We hypothesized that this would improve the
predictive performance of logistic regression and possibly other methodologies as well.
44
We used the analysis from Section 4.1.2 to limit our total variables to the 75 that are individually
significant on day 4. Next, we removed variables that are strongly correlated with another variable in the
final list, where a significant correlation is defined as having a p-value greater than 0.6 or below -0.6.
While it may have been preferable to limit the variables based on whether combinations of variables were
significant or correlated, the large number of features makes that practically challenging. Table 4.4
contains the final list of variables after preprocessing. It includes 28 total features.
Table 4.4: Final variable list after preprocessing (N=28)
Age
Creatinine two-day trend
Platelet one-day trend
DNR
Average BUN to date
Platelet three-day trend
Use of echo
Chloride two-day trend
Number of platelet tests
Number of days with vasopressors
Maximum bicarbonate to date
Average temperature
Average SOFA to date
Bicarbonate one-day trend
Maximum temperature to date
SOFA two-day trend
Bicarbonate three-day trend
Average MAP to date
SOFA three-day trend
Average phosphorus to date
Average shock index to date
Maximum urine to date
Hemoglobin three-day trend
Shock index one-day trend
Minimum fluid balance to date
Minimum platelets to date
Shock index three-day trend
Daily fluid balance range
Figure 4.6 displays the accuracy and AUROC for each of our four classifiers before preprocessing (in
blue) and after preprocessing (in red). As expected, logistic regression has the largest increases. Accuracy
increases by more than ten percentage points, from 68.7% to 78.8%, while AUROC increases from 0.70
to 0.84. For the other classifiers, preprocessing does not diminish performance but it also does not lead to
significant improvements. After preprocessing, we see that logistic regression has very similar
performance to random forest, our best classifier from the previous section.
Figure 4.6: Accuracy and AUROC before and after data preprocessing
Accuracy
AUROC
45
Logistic regression also has very high sensitivity, relative to other methods. Figure 4.7 compares the four
methods for the same set of performance metrics in Figure 4.4, after preprocessing. We see that logistic
regression correctly identifies 60% of patients whose true outcome is death, compared to 50% for random
forest and just 37% for SVM.
Figure 4.7: Test-set performance metrics for classifiers, processed data
One way to increase the sensitivity for all of our methods is to train the classifiers to predict the positive
case (e.g. “death”) more often. This can be achieved by balancing the training dataset. Figure 4.8
compares the test-set performance of our four data mining algorithms using the preprocessed data and a
balanced training data.
Figure 4.8: Test-set performance metrics for classifiers, balanced training data
With a balanced training set, all of the classifiers have higher sensitivity as expected, and SVM becomes
more competitive with logistic regression and random forest. For the remainder of our discussion,
however, we will focus primarily on the latter two methods. Random forest has the advantage of not
46
requiring data preprocessing, as shown in Section 4.2.1. Logistic regression has the advantage of being
the traditional approach used in this field, making it more familiar to practitioners.
4.2.3
Best regression model contains just five variables
The original premise of this research was that the incorporation of a larger number of variables, including
temporal features, and the use of non-linear classifiers would improve our ability to predict mortality. The
results in the previous sections, however, suggest that a logistic regression model may actually be a viable
method, if customized appropriately, and that limiting the number of variables can increase performance.
Given this result, we decided to use an exhaustive best subset regression to determine whether we could
create a more parsimonious regression model without sacrificing predictive performance. To this end, we
were interested in subsets of variables that could achieve AUROC on unseen data of 0.81 to 0.86, the
range achieved by random forest and logistic regression in Section 4.2.1 and Section 4.2.2, respectively.
Beginning with the 28 variables in our preprocessed dataset, we looked for subsets of up to 10 variables
and selected the best combinations based on BIC. The three best combinations are listed in Table 4.5,
along with their accuracy and AUROC metrics. The variables age, SOFA, average temperature, and
average shock index appear in all three combinations. There are some notable omissions as well, such as
daily urine which Figure 4.5 identified as the most important variable in our random forest model. One
explanation could be that the relationship between daily urine values and mortality is not linear, which is
supported by our data visualization in Section 4.1.1.
Table 4.5: Top three best subset regression models
Variables in red are not present in the first model
Variables
Accuracy
AUROC
Age, daily SOFA, average temperature, average
SI, average bicarbonate
74.8% (72.6%, 77.0%)
0.831 (0.799, 0.862)
Age, daily SOFA, average temperature, average
SI, average bicarbonate, chloride two-day trend
75.0% (72.6%, 77.4%)
0.829 (0.799, 0.860)
Age, daily SOFA, average temperature, average
SI, chloride two-day trend, minimum fluid
balance to date
75.2% (73.5%, 76.9%)
0.816 (0.790, 0.843)
In Table 4.5, we observe that the average AUROC decreases from 0.831 for the best combination of
variables to 0.816 for the third best combination, though all three are in the range of the optimal
47
classifiers from Sections 4.2.1 and 4.2.2. The first combination of variables is the most parsimonious,
containing just five features.
If we continue looking down the list of best subset models until we find one with fewer than five
variables, we see that it contains age, daily SOFA, average temperature, and average SI; however, the
AUROC for this model is just 0.809. Given this, we select the first subset of variables in Table 4.5 as our
“best subset” regression: age, daily SOFA, average temperature, average SI, and average bicarbonate.
4.2.4
Best classifiers outperform SOFA score by two days
Table 4.6 compares the predictive performance of the random forest from Section 4.2.1 and the best
subset regression from Section 4.2.3 to SOFA score. As explained in Section 2.1, SOFA is a daily score,
designed to evaluate a patient throughout their ICU stay [10]. Though it was originally intended to
characterize patient morbidity, since its development, it has often been used to predict patient mortality
[11]. In this thesis, SOFA score represents baseline performance of current mortality prediction models.
Table 4.6 shows that both best subset regression and random forest can predict mortality better than
SOFA score on day four of the ICU stay. AUROC, in particular, is significantly higher at 0.831 and
0.823, as compared to 0.748 for SOFA score.
Table 4.6: Test-set performance of best classifiers compared to SOFA score
Classifier
Accuracy
AUROC
Best subset regression
74.8% (72.6%, 77.0%)
0.831 (0.799, 0.862)
Random forest
79.0% (76.9%, 81.1%)
0.823 (0.796, 0.851)
SOFA score
72.7% (69.6%, 75.8%)
0.748 (0.709, 0.788)
It would also be desirable, however, to make a more timely prediction than SOFA score. That is, it would
be useful to be able to predict mortality earlier in the ICU stay. Figure 4.9 compares AUROC for all three
methods across the second, third, and fourth day of the ICU stay.
There are three key takeaways from Figure 4.9. First, for all three methods, the ability to predict inhospital mortality increases the longer that the patient is in the ICU. For SOFA score, for example,
AUROC increases from 0.67 on day two to 0.75 on day four, an increase of 0.08. This result aligns with
our findings in Section 4.1.2, which showed that individual variable’s relationship with mortality also
tended to become stronger throughout the ICU stay.
48
Figure 4.9: AUROC for best classifiers and SOFA across time
Blue rectangle signifies confidence interval for SOFA AUROC on day four
The second takeaway is that the best subset regression and random forest outperform SOFA score on each
day. Table 4.6 showed a significant difference in AUROC for day four and Figure 4.9 confirms that this
relationship holds earlier in the ICU stay as well.
Finally, the blue rectangle in Figure 4.9 represents the confidence interval for SOFA AUROC on day
four. We see that the best subset regression and random forest have AUROC that falls within this interval
as early as day two. Therefore, not only do these models allow us to better predict in-hospital mortality at
a given point in time, they also allow us to make more timely predictions. For time sensitive applications,
such as determining courses of treatment or communicating with a patient’s family about likely outcomes,
this two day advantage may be extremely meaningful.
Figure 4.14 in Section 4.4 shows that the same result holds for the full MIMIC population and the UFA
system.
4.2.5
Little observed effect of temporal variables
In the previous section, we saw that the best subset regression performs better on day four than day two,
with an AUROC of 0.719 and 0.748 respectively. From this, we learn that the variables in the regression,
such as temperature or bicarbonate level are more predictive of in-hospital mortality when measured later
in the stay. What is notable, however, is that none of the variables in the best subset regression are
temporal variables, as none of them measure the change in a patient’s status across time.
49
One of the hypotheses of this analysis was that the use of temporal variables would improve our ability to
predict mortality for previous unseen patients. In this section, we check this assumption by comparing the
performance of logistic regression and random forest using three different sets of variables. The first set is
the preprocessed dataset from Section 4.2.2 which contains the 28 variables listed in Table 4.4. When
selecting these variables, we often encountered features with high pairwise correlation where one variable
was static and the other temporal, such as daily urine and maximum urine to date. When generating the
final list, we favored the temporal features. Therefore, we will denote this list of variables as “favor
temporal”.
However, it would also have been possible to favor the static variables. We will denote this list of 28
variables as “favor static”. Finally, we can create a variable list that contains no temporal information at
all (N=15), which we will denote “no temporal”.
Figure 4.10 compares accuracy and AUROC for these three lists of variables. For regression, we see
virtually no difference across the three variables lists. For random forest, the model with no temporal
variables performs slightly worse than the other two models, with accuracy of 77.9% versus 79.8%.
However, the differences are small and not significant.
We conclude that, contrary to our expectations, knowing about changes in a patient’s status across time
does not significantly improve our ability to predict in-hospital mortality above and beyond knowing their
current status. It is possible that this result is driven by the type of temporal variables that we chose to
include in our analysis. We discuss this issue further, and suggest other types of temporal variables that
might be considered, in Section 5.2
Figure 4.10: Comparison of classifier performance with and without temporal features
Accuracy
AUROC
50
4.2.6
Summary of key insights
Through the analysis in this section, we identified two classifiers that are capable of predicting patient
mortality better than current methods, outperforming SOFA score by as much as two days. The first is
random forest, which outperformed all of the other commonly used methods considered. It has the
advantage of having good performance, even on data that has not undergone preprocessing. The second
promising classifier was a best subset regression model with just five variables: age, daily SOFA score,
average temperature, average shock index, and average bicarbonate level. It has the advantage of being
extremely simple. Further, it was not immediately obvious that a linear model with no temporal features
could achieve such high performance; however, we conclude that it is possible if care is taken to properly
customize the model through extensive data preprocessing.
4.3
Predictive performance of UFA system
In this section, we summarize the performance of the fully-automated UFA system formalized in Section
3.2. This system was designed to address the many challenges associated with clinical data and outcome
prediction. In particular, UFA is adept at identifying cut points in clinical data, outside of which mortality
is significantly more or less likely. Section 4.3.1 discusses the degree to which these automaticallyidentified thresholds align with clinical insight. Section 4.3.2 demonstrates the UFA system’s ability to
predict mortality for previously unseen patients, and shows that it meets or exceeds the performance of
best subset regression, random forest, and SOFA score.
4.3.1
Automated thresholds align with subject matter expertise
As described in Section 3.2, UFA works by searching for ranges of the explanatory variables for which
the target is significantly more or less likely. Then, it combines this information for all of the variables in
the analysis to make a final prediction. This type of approach is very natural in the realm of clinical care,
where laboratory results or vital signs outside of clinically defined thresholds are often associated with
high mortality, while more normal values do not provide much information.
We ran UFA for all 218 variables in the MIMIC II dataset, and automatically identified 95 significant
thresholds associated with high mortality and 43 significant thresholds associated with low mortality for
sepsis patients on day four of the ICU stay. Table 4.7 displays the 20 thresholds with the highest absolute
z-statistic or, in other words, the most significant thresholds identified by UFA.
51
The first thing that we observe in Table 4.7 is that all of the most significant thresholds are associated
with high mortality. This is clear because the percent of patients who died outside each threshold exceeds
the overall sepsis death rate of 30.9%. The rest of the significant thresholds identified by the algorithm,
including the 43 low mortality thresholds, are available in Appendix B.
Table 4.7: Top 20 most significant UFA thresholds
Variable
Threshold
N
% Died
|Z|
SOFA
More than
12.0
74
68.9%
6.95
AVG.SAPS
More than
19.4
84
61.9%
6.60
SAPS
More than
18.1
73
61.6%
6.26
MAX.URINE
Less than
999.5
104
60.6%
6.20
AVG.URINE
Less than
516.5
81
65.4%
6.13
AVG.SOFA
More than
12.7
84
63.1%
6.10
MIN.SOFA
More than
10.1
93
60.2%
6.01
URINE
Less than
944.3
152
57.9%
5.98
MAX.SOFA
More than
15.2
66
65.2%
5.91
MAX.SAPS
More than
24.2
53
66.0%
5.84
BICARB
Less than
17.7
75
65.3%
5.79
PLATELET
Less than
85.3
96
57.3%
5.74
MAX.UNITS_OUT
Less than
1,017.8
56
66.1%
5.61
AVG.PLATELET
Less than
109.0
106
54.7%
5.53
MIN.PLATELET
Less than
81.7
109
55.0%
5.51
MINTEMP
Less than
35.5
76
59.2%
5.43
AVG.UNITS_OUT
Less than
806.2
77
59.7%
5.24
SOFA_FIRST
More than
18.1
12
100.0%
5.24
AVG.BICARB
Less than
17.6
100
57.0%
5.17
MEAN.PHOS
More than
4.5
93
52.7%
5.14
We see that many of the variables at the top of Table 4.7 align closely with the most important variables
in our random forest model, as depicted in Figure 4.5. Specifically, urine and daily SOFA score are both
highly ranked with bicarbonate levels and platelet count close behind.
To evaluate the UFA thresholds, we compared them to known clinically defined thresholds when
available. We find that the automated thresholds align well with subject matter expertise, as demonstrated
in Figure 4.11. In the figure, red data points indicate patients who died while blue data points indicate
patients who lived. Clinical bounds are denoted with solid lines and UFA thresholds are denoted with
dotted lines.
52
Figure 4.11: Example UFA thresholds for adult sepsis patients
Temperature
Maximum Platelet Count
Urine
Sodium
Mean Arterial Blood Pressure (MAP)
Minimum Phosphorus
53
The six variables in Figure 4.11 were selected to represent a variety of vital signs (e.g. temperature,
MAP), laboratory tests (e.g. platelets, sodium, and phosphorus), and other important clinical features (e.g.
urine output). In all cases, we see that the UFA defined threshold is well within one standard deviation of
the clinically specified threshold.
For instance, returning to the example from Section 3.2.1, the clinical definition of low body temperature
is 36° C. In sepsis, low body temperature is one of the diagnostic criteria for severe sepsis and septic
shock, and is known to be associated with patient severity and death [31]. Applying UFA to the MIMIC II
data for body temperature, we identify a high-mortality threshold at 35.97°C, denoted by a dotted line in
Figure 4.11. Below the threshold of 35.97°C, sepsis patients die at a rate of 57.9%, nearly twice the
overall death rate. We see that the UFA-identified threshold aligns closely with the known physiological
limit.
Moreover, UFA did not identify a significant threshold for high body temperature (fever) in sepsis. Once
again, this is consistent with published guidelines for sepsis presentation [31], which state that fever is an
“insensitive indicator” of severity of illness.
4.3.2
UFA-based classifiers significantly outperform SOFA score
Overall, UFA identified 95 thresholds associated with high mortality and 43 thresholds associated with
low mortality for sepsis patients on day four of the ICU stay. Figure 4.12 plots patients according to their
number of high and low flags, where red denotes patients who died and blue denotes patients that lived.
We see that a linear decision boundary effectively separates the two classes.
Figure 4.12: Number of high mortality and low mortality flags for adult sepsis patients
54
Table 4.8 compares the performance of our two UFA-based classifiers (highlighted in orange) to the best
commonly used classifiers identified in Section 4.2, random forest and best subset regression. All four
models are benchmarked against SOFA score, a widely used daily score designed to evaluate a patient
throughout their ICU stay [10]. The first UFA-based classifier, N-UFA, classifies patients according to
the linear decision boundary in Figure 4.12. The other UFA-based classifier, RF-UFA, is a standard
random forest model [5] which uses the flags for each significant threshold as dummy inputs.
Table 4.8: Test-set performance for UFA-based classifiers (accuracy and AUROC)
Classifier
SOFA
N-UFA
RF-UFA
Best subset regression
Random forest
Type
Accuracy
AUROC
Baseline
72.7% (69.6, 75.8)
0.748 (0.709, 0.788)
77.5% (75.1, 79.9)
0.819 (0.797, 0.841)
78.1% (75.8, 80.3)
0.800 (0.779, 0.821)
74.8% (72.6, 77.0)
0.831 (0.799, 0.862)
79.0% (76.9, 81.1)
0.823 (0.790, 0.851)
UFA-based
Section 4.2
We see that, on average, N-UFA correctly predicts in-hospital mortality for 77.5% of patients, while RFUFA achieves 78.1% accuracy. As seen in Table 4.8, this performance is better than SOFA score and
comparable to the best subset regression and random forest models. The results for AUROC follow a
similar pattern.
Table 4.9 shows the same comparison for sensitivity and specificity. We see that the UFA-based
classifiers (highlighted in orange) achieve the same sensitivity as best subset regression, at 50.9%, 49.7%,
and 51.7% respectively. All three methods perform significantly better than SOFA score on this metric.
However, both UFA-based methods also maintain specificity above 0.9. while best subset regression’s
specificity is 0.863.
Table 4.9: Test-set performance for UFA-based classifiers (sensitivity and specificity)
Classifier
SOFA
N-UFA
RF-UFA
Best subset regression
Random forest
Type
Sensitivity
Specificity
Baseline
35.4% (29.0, 41.7)
0.911 (0.869, 0.953)
50.9% (46.3, 55.4)
0.908 (0.871, 0.945)
49.7% (45.0, 54.4)
0.923 (0.892, 0.955)
51.7% (47.0, 56.4)
0.863 (0.820, 0.907)
47.4% (41.3, 53.6)
0.949 (0.924, 0.973)
UFA-based
Section 4.2
55
4.3.3
Summary of key insights
Through the analysis in this section, we saw that the UFA system can predict patient mortality
significantly better than SOFA score. We found that the mortality thresholds identified by UFA align with
clinical norms, and that UFA’s performance is consistent with the best performing classifiers from
Section 4.2, best subset regression and random forest.
This section also outlined some advantages of UFA. First, UFA requires no preprocessing, as opposed to
methods like best subset regression. Second, the UFA system provides information about the variables
fed into the algorithm by giving the user a set of association rules; e.g. when temperature is below
35.97°C, it is associated with significantly higher mortality. In this sense, it is extremely interpretable.
Finally, we see that the UFA-based classifiers have relatively high sensitivity while maintaining high
specificity. In the next section, we continue our discussion of the practical advantages of the UFA system,
particularly in the field of critical care and particularly as it compares to best subset regression and
random forest.
4.4
Practical advantages of UFA system
In Sections 4.2 and 4.3, we identified three approaches to mortality prediction in the ICU that
significantly outperform current methods: best subset regression, random forest, and the UFA system. We
also established that these three approaches have similar predictive performance for patients with a
primary diagnosis of sepsis.
In addition to strong predictive performance, however, there are several other characteristics that may be
desirable in an outcome prediction system. In this section, we address three:

Robust to missing and noisy data

Easily customizable to different patient populations, care centers, or targets

Ability to predict rare events
4.4.1
N-UFA classifier is robust to noisy and missing data
In Section 4.1, we outlined several difficulties of clinical data that should be considered when designing a
mortality prediction system. In particular, we found that many variables in MIMIC II are long tailed and
may include outliers. We also saw that the majority of patients in the database have incomplete data.
56
For the commonly used classifiers in Section 4.2, missing data were replaced with the empirical average
so that observations with incomplete information did not have to be dropped. While other imputation
techniques are possible, experimenting with different approaches was outside the scope of this thesis.
With UFA, there is no need to have complete data for each observation since each variable is considered
individually. If an instance is missing data for a particular variable, it can simply be excluded from the
calculation of that variable’s threshold, but remain included in calculations for which data is present.
The question remains, however, whether the UFA-based classifiers will have high predictive power if
certain flags are missing or assigned incorrectly due to noisy data. We hypothesize that N-UFA in
particular should be robust to noise and missing data, since it aggregates over all of the high and low risk
flags and does not depend on individual variables.
Table 4.10 confirms this hypothesis for mortality prediction of septic patients. It compares the
performance of N-UFA, random forest, and best subset regression for the original MIMIC II data
(denoted 0% additional missing data) and a version of the MIMIC II data where 50% of observations
were replaced randomly with missing values. All of these methods are compared to a standard logistic
regression, with no preprocessing, to benchmark the results.
Table 4.10: Comparison of different classifiers with varying amounts of missing data
Classifier
N-UFA
Random forest
Best subset regression
Logistic regression
Type
UFA-Based
Section 4.2
Comparison
Accuracy
AUROC
0%
50%
∆
0%
50%
∆
77.5%
76.2%
1.3%
0.819
0.790
0.029
79.0%
71.9%
7.1%
0.823
0.771
0.052
74.8%
70.4%
4.4%
0.831
0.706
0.125
68.7%
58.3%
10.4%
0.698
0.598
0.100
Table 4.10 shows that with 50% missing data, N-UFA has the highest accuracy and AUROC of all three
methods. We also see that the difference in accuracy between 0% missing data and 50% missing data for
N-UFA is only 1.3 percentage points, compared to 4.4 percentage points for best subset regression, 7.1
percentage points for random forest, and 10.4 percentage points for logistic regression. Similarly,
AUROC decreases by 0.029 as opposed to 0.125, 0.052, and 0.100 respectively.
Table 4.11 provides similar results for data noise. It presents accuracy and AUROC for N-UFA, random
forest, best subset regression, and logistic regression when 50% of the MIMIC II data is randomly
perturbed by a value , distributed normally with mean zero and the empirical variance of the variable in
57
question. Once again, we see that the N-UFA holds up well. With 50% imprecise data, the accuracy and
AUROC are in line with random forest and best subset regression, and significantly higher than logistic
regression. On average, accuracy decreases by just 1.7 percentage points and AUROC decreases by 0.023
for N-UFA as the percentage of imprecise data increases to 50%.
Table 4.11: Comparison of different classifiers with varying amounts of imprecise data
Classifier
N-UFA
Type
UFA-Based
Random forest
Best subset regression
Logistic regression
Section 4.2
Comparison
Accuracy
AUROC
0%
50%
∆
0%
50%
∆
77.5%
75.8%
1.7%
0.819
0.796
0.023
79.0%
76.3%
2.7%
0.823
0.802
0.021
74.8%
75.8%
-1.0%
0.831
0.790
0.041
68.7%
68.8%
-0.1%
0.698
0.681
0.017
Expanded versions of Table 4.10 and Table 4.11 including confidence intervals and results for 5-25%
missing data are available in Appendix B.
4.4.2
UFA system generalizes well to other critical care populations
With outcome modeling, there is an inherent trade-off between building a general prediction model that is
widely applicable or a specialized model that takes into account particular features of a disease, patient
population, or care facility. Various studies have shown that the most widely used models do not
generalize well to new populations, and independent research suggests regular updates and customization
for best performance [4, 15].
Up to this point, all of the results in this thesis were generated for patients with a primary diagnosis of
sepsis, the most prevalent primary diagnosis in the MIMIC II data. In this section, we evaluate each
model’s ability to generalize to the full MIMIC population. We also test the models’ performance on two
alternate diagnosis-based subpopulations, patients with AMI and patients with lung disease.
The results show that the UFA system can generalize to new cohorts of patients, and confirms many of
the findings from Section 4.3 including the clinical validity of the identified thresholds and accuracy and
AUROC in line with other commonly used classification methods.
58
Thresholds
We ran UFA for each of our four populations individually so that the high and low mortality thresholds
were specific to each cohort. Approximately 55% of possible thresholds were significant in only one or
two cohorts, suggesting a good deal of customization across the different groups. However, when a
significant threshold was found in multiple subpopulations, it tended to fall in the same place and
consistently predict either high or low mortality.
As an example, Table 4.12 compares the significant thresholds for average sodium levels and average
body temperature across all of the study cohorts. The normal range for sodium levels is 135-145 mEq/L
[34]. In Table 4.12, we see that UFA identified both high and low sodium thresholds for the full MIMIC
population, and these thresholds align very closely with the known bounds.
For the three disease-based cohorts, the values are also consistent with clinical norms, but the association
with mortality is only significant in one direction. This presumably highlights the specific risks inherent
to each disease. We see a very similar result for body temperature. While UFA identified significant high
and low thresholds in the full MIMIC population, low body temperature appears to be a better predictor of
mortality in sepsis and lung disease, while fever is problematic for AMI patients.
Table 4.12: Comparison of data-driven thresholds across different subpopulations
Average Sodium Level (normal range 135-145 mEq/L)
Value less than 133.3 associated with high mortality
All MIMIC
Value above 145.3 associated with high mortality
Sepsis
Value less than 134.9 associated with high mortality
AMI
Value more than 143.1 associated with high mortality
Lung Disease
Value less than 135.2 associated with high mortality
Average Body Temperature (normal range 36.0 – 38.0 C)
Value less than 35.97 associated with high mortality
All MIMIC
Value above 38.14 associated with high mortality
Sepsis
Value less than 35.97 associated with high mortality
AMI
Value more than 38.14 associated with high mortality
Lung Disease
Value less than 35.99 associated with high mortality
Next, we used the thresholds established by UFA to predict in-hospital mortality for the full MIMIC
population, patients with AMI, and patients with lung disease.
59
Full MIMIC population
Figure 4.13 displays accuracy and AUROC results for the full MIMIC population. We see that all five
classifiers have accuracy of approximately 80%. However, AUROC is significantly higher for the UFAbased classifiers and random forest as compared to SOFA and best subset regression.
Figure 4.13: Accuracy and AUROC, full MIMIC population
In fact, we find that N-UFA and RF-UFA can predict mortality after just two days in the ICU with the
same AUROC as SOFA score on day four for the full MIMIC population. Figure 4.14 shows that the
mean AUROC for N-UFA and RF-UFA on day two are 0.720 and 0.734 respectively, while the AUROC
for SOFA score on day four is 0.707.
Figure 4.14: AUROC for UFA-based classifiers and SOFA across time, full MIMIC population
Blue rectangle signifies confidence interval for SOFA AUROC on day four
We also find that N-UFA has significantly higher sensitivity than any of the other four classifiers in the
full MIMIC population. Focusing on patients who actually died, sensitivity measures the percentage that
were predicted correctly. Conversely, specificity focuses on patients who survived and measures the
60
percentage that were predicted correctly. As seen in Figure 4.15, all of the classifiers have lower
sensitivity than specificity meaning that they have more trouble predicting deaths than predicting survival.
However, N-UFA performs the best on sensitivity; with an average value of 28.6%, it has nearly twice the
sensitivity of SOFA.
Figure 4.15: Sensitivity and specificity, full MIMIC population
Disease-based subpopulations
As compared to the full MIMIC population and patients with sepsis, the AUROC results for the AMI and
lung disease-based subpopulations are less stark. In Table 4.13, we see that the AUROC for SOFA score
is slightly lower than the other methods for AMI, but not significantly. For the lung disease population,
all of methods that were considered have roughly similar AUROC as evidenced by a large amount of
overlap in the confidence intervals.
Table 4.13: AUROC for AMI and lung disease subpopulations
Classifier
SOFA
N-UFA
RF-UFA
Best subset regression
Random forest
Type
AMI
Lung
Baseline
0.725 (0.654,0.796)
0.721 (0.643,0.799)
0.777 (0.718,0.836)
0.712 (0.641,0.783)
0.746 (0.681,0.811)
0.708 (0.640,0.776)
0.755 (0.695,0.814)
0.703 (0.641,0.766)
0.747 (0.683,0.810)
0.731 (0.659,0.803)
UFA-based
Section 4.2
However, as with the full MIMIC population, the UFA-based classifiers outperform other methods in
terms of sensitivity.
61
4.4.3
N-UFA classifier maximizes sensitivity for low targets
Table 4.14 displays the sensitivity for the AMI and lung disease patient cohorts for all of the mortality
prediction approaches discussed in this thesis. We see that N-UFA significantly outperforms the other
methods, particularly for the AMI subpopulation, the patient cohort with the lowest target.
Table 4.14: Sensitivity for AMI and lung disease subpopulations
Classifier
SOFA
N-UFA
RF-UFA
Best subset regression
Random forest
Type
AMI
Lung
Baseline
1.7 (0.0, 5.0)
15.4 (10.8, 19.9)
32.7 (22.0, 43.4)
32.7 (26.2, 39.2)
12.0 (1.7, 22.4)
31.4 (23.7, 39.1)
4.8 (0.0, 11.7)
15.3 (11.8, 18.8)
2.8 (0.0, 6.6)
18.1 (13.0, 23.3)
UFA-based
Section 4.2
The in-hospital mortality rate for the AMI subpopulation is 14.7%, so a classifier can achieve more than
85% accuracy by simply predicting that everyone will survive. This approach would result in 0%
sensitivity. As such, it is unsurprising that sensitivity is generally low for this patient cohort, as low as
1.7% for the SOFA-based classifier. N-UFA, however, has 32.7% sensitivity which is nearly 7 times the
sensitivity of the next best classifier that is not UFA-based and almost 20 times the sensitivity of SOFA
score.
In general, one way to increase the sensitivity of a classifier is to balance the training data, so that 50% of
the training cases are patients who died. This teaches the classifier to sometimes predict death, as always
predicting survival will only result in 50% accuracy.
We tried balancing the training data for the AMI subpopulation, and the results are displayed in Table
4.15. The percentage of deaths that we could predict in a previously unseen test set increased to more than
58% for all classifiers. However, balancing the training data simultaneously led to a decrease in the
specificity, the percentage of patients who survived that were predicted correctly. We found a similar
result when we tried balancing the training data for the full MIMIC population and the other two
subpopulations. To achieve a different trade-off between sensitivity and specificity, we could assign costs
to different types of errors (e.g. false positives and false negatives) such that the relative importance
aligns with our application [5].
62
While balancing the training dataset is often a good approach to address class imbalance, it can
sometimes be undesirable. In particular, for very rare events, balancing the dataset through undersampling
may exclude a large number of potentially useful majority-class examples, while balancing the dataset
through oversampling may require replicating a small number of minority-class examples and can lead to
overfitting [38]. Table 4.15 suggests that N-UFA may provide a possible alternative, achieving relatively
high sensitivity in imbalanced data as compared to other commonly used classification techniques.
Table 4.15: Comparison of results for balanced and unbalanced data, AMI subpopulation
Classifier
SOFA
Baseline
N-UFA
RF-UFA
Best subset regression
Random forest
4.4.4
Type
UFA-based
Section 4.2
Original
Balanced
Sensitivity
Specificity
Sensitivity
Specificity
1.7%
99.8%
63.1%
67.7%
32.7%
93.9%
63.5%
70.4%
12.0%
98.7%
58.6%
69.6%
4.8%
99.1%
75.2%
70.2%
2.8%
99.3%
64.6%
72.6%
Summary of key insights
Through the analysis in this section, we saw that the UFA system has many practical advantages for the
application of critical care. First, we found that N-UFA holds up well even with large amounts of missing
or imprecise data. For missing data specifically, the decline in performance is smaller than both random
forest and best subset regression. We also saw that the system generalizes well to the full MIMIC
population. It has significantly higher AUROC as compared to best subset regression, and both N-UFA
and RF-UFA outperform SOFA score by as much as two days. Finally, we found that N-UFA has higher
sensitivity than all other approaches, including random forest, for all four populations analyzed in this
thesis. The result is particularly stark for AMI, the population where the death rate is lowest.
63
64
Chapter 5
5
Discussion
The objective of this thesis was to build a mortality prediction model that could outperform current
approaches. Throughout Sections 4.2 and 4.3, we identify three promising approaches: random forest, a
best subset regression containing just five variables, and UFA. For patients admitted to the ICU with a
primary diagnosis of sepsis, we demonstrate that all three are capable of significantly outperforming
SOFA score [10], a daily score widely used to evaluate a patient throughout their ICU stay.
While all three have strong predictive performance for sepsis patients, we assert that the UFA system is
particularly well-suited for the task of predicting mortality in critical care. UFA works by searching for
ranges of the explanatory variables for which the target is significantly more or less likely. Then, it
combines this information for all of the variables in the analysis to make a final prediction. This type of
approach is very natural in the realm of clinical care, where laboratory results or vital signs outside of
clinically defined thresholds are often associated with high mortality, while more normal values do not
provide much information.
UFA also has a variety of other practical advantages. First, UFA considers each variable individually. As
a result, it can quickly and easily be applied to datasets with a very large number of features, including
cases when the number of features is much larger than the number of observations. One can also easily
introduce new variables that are interactions of existing features, if thought to be important for the
application. In contrast, a standard logistic regression model will struggle as the number of variables
approaches the number of observations, and thus will require preprocessing to remove irrelevant features.
A second advantage of UFA is that it is fully-automated which allows for easy customization. In this
thesis, this is demonstrated by applying UFA to the full MIMIC population, an AMI subpopulation, and a
lung disease subpopulation, in addition to our cohort of sepsis patients. We found that the UFA system
had consistently good predictive performance. For the two diagnosis-based subpopulations, we found that
the UFA system had comparable performance to all of the other methods in terms of accuracy and
AUROC. For the full MIMIC population, we found that N-UFA and RF-UFA outperform SOFA score by
as much as two days. They also have significantly higher AUROC than best subset regression.
65
This result is not particularly surprising, as the best subset regression was tailored for the sepsis
population through extensive data preprocessing. The final model contained just five variables – age,
average temperature, average shock index, daily SOFA, and average bicarbonate – and has the benefit of
being very simple. In the sepsis population for which it was constructed, it has AUROC and sensitivity
equal to the number of flags classifier. However, we saw that it does not always generalize to new patient
cohorts.
UFA, on the other hand, can quickly generate a new set of significant thresholds and flags that are
particular to the population of interest without user intervention. Though we only explore disease-based
customization in this thesis, one could easily tailor the algorithm for a new point in time (e.g. patient
status as of day two), for new outcomes (e.g. 28 day mortality), or new types of patient cohorts, such as
individuals treated at a particular facility or in a particular region.
Finally, UFA is flexible and can be used for many purposes. The thresholds generated by the algorithm
are easy to interpret, since they take the form of association rules, i.e.:
When [VARIABLE] is above/below [VALUE], it is associated with signficantly higher/lower mortality.
In some applications, UFA might be used to simply generate a list of thresholds for a particular
population, which can then be compared to current clinical guidelines or used as a jumping off point for
future research. As shown in Table 4.10, one can also compare thresholds for different patient cohorts to
identify variables that are predictive of mortality in one population but not another. This functionality is
not present in any of the other classifiers.
Moreover, the thresholds can be used for prediction and can be combined into a number of different
classifiers. This thesis considered two UFA-based classifiers, one based on random forest (RF-UFA) and
one based on the number of high and low mortality flags (N-UFA). One could easily use the thresholds
within another predictive modeling framework if necessary for the application.
For the application of mortality prediction, N-UFA in particular appears to be a good choice. Our results
show that it has the advantage of maximizing sensitivity in unbalanced datasets, particularly for patient
cohorts with low death rates. It also has several other practical advantages.
66
First, it has been shown to be robust to missing and noisy data which can be prevalent in critical care.
Since N-UFA aggregates information, even if one flag is missing or assigned incorrectly due to noisy
data, the aggregation of all the other low and high mortality flags may still lead to a correct prediction. In
this thesis, we show that N-UFA performs better than or equal to random forest, the next most
competitive method, even with up to 50% unreliable data.
Second, predictions made by N-UFA are easy to summarize, visualize, and interpret. Figure 5.1 plots all
of the sepsis patients in our dataset based on their number of high mortality and low mortality flags.
Patients who died are indicated in red, while patients who survived are indicated in blue. The solid,
diagonal line is the linear boundary that best separates the two classes,
Figure 5.1: Visualization of number of flags classifier, sepsis subpopulation
Since N-UFA is two-dimensional, Figure 5.1 is easy to interpret and one can quickly see that the decision
boundary does quite a good job separating patients who lived from patients who died. When a physician
is presented with a new sepsis patient, the UFA system can automatically apply the established sepsis
thresholds, aggregate the number of thresholds violated by the new patient, and place him on the chart.
The yellow circle in Figure 5.1 represents a fictitious new patient with five high-mortality flags and
fifteen low-mortality flags. Based on this simple picture, one can quickly see that the patient is predicted
to live. Further, by considering the patient’s distance from the decision boundary and the other patients in
his vicinity, the physician can quickly internalize the level of uncertainty associated with the prediction.
In this case, one may be fairly comfortable predicting survival for the yellow patient but may ascribe a
high level of uncertainty (or refuse to make a prediction) for a patient on or near the decision boundary.
As such, we believe that UFA is a major step toward a mortality prediction model capable of individual
prognosis.
67
68
Chapter 6
6
Future research
This section discusses possibilities for future work. Section 6.1 focuses specifically on the UFA system,
while Section 6.2 discusses the use of temporal features in mortality prediction more broadly.
6.1
UFA
We believe that UFA is a promising first step toward a mortality prediction system capable of individual
prognosis. In this section, we discuss three possible areas for future work.
6.1.1
Multiple testing problem
One possible drawback to UFA is that it conducts
identify the optimal thresholds, where
statistical tests in the training phase in order to
is the number of variables and
is the number of potential
thresholds. As is well documented, multiple hypothesis testing can inflate the type I error rate and lead to
significant results, even when none exist [5, 25]. This drawback is also present in related methods, such as
the minimum p-value approach to finding optimal cut points.
In this thesis, we address the multiple testing problem by validating the UFA-generated thresholds on
previously unseen data and demonstrating that we can classify new patients with high accuracy and
AUROC. There are other possible approaches, however. For example, one can adjust the p-values for
multiple testing, using an approach such as the well-known Bonferroni method [5]. Future research
should consider the impact of using an alternative method to address the multiple testing problem on the
performance of the UFA system.
6.1.2
Threshold uncertainty
As explained in Section 3.2, the UFA system selects the optimal threshold
variable
in the training dataset, where
test statistic
for each explanatory
is defined as the candidate threshold
with the maximum
in absolute value. In addition, one might be interested in quantifying the uncertainty
associated with that threshold.
69
To do this, we employed bootstrapping, a data-driven technique where one resamples the training data
many times to generate additional new training datasets [5]. Then, we ran UFA on each one to create a
histogram of possible thresholds for each variable. Using the sepsis population, Figure 6.1 shows 1,000
bootstrapped thresholds for body temperature. The red dotted vertical line at 35.97°C represents the
optimal cut point
that was found using the full training data in Section 4.3.1.
Figure 6.1: Bootstrapped thresholds for low body temperature in adult sepsis patients
This information is useful in two ways. First, we can calculate the variance in the potential thresholds,
which can help quantify uncertainty. Second, we can compare
for the full training data to the
bootstrapped distribution, and determine whether it is consistent with the other trials. If we are concerned
that
may be overfit to the training data and not generalizable, we can consider using a feature of the
bootstrapped distribution such as the mean or mode as our candidate threshold instead.
For the sepsis application, we ran 100 bootstraps per variable and used the mode of each distribution as
the candidate threshold. On average, we found 5% fewer significant thresholds and, of the significant
thresholds found, 15.6% varied by more than 5% from
. When applied to unseen data, however, N-
UFA achieved the exact same AUROC using the bootstrapped thresholds as it did using the original
thresholds.
Since bootstrapping adds significant runtime to the UFA system and did not improve the predictive
performance for our application, we present the non-bootstrapped results in this thesis. However, we
recommend further work in this area. For example, visual inspection of the bootstrapped distributions for
70
the MIMIC II data reveals that some of the variables have a bimodal distribution, perhaps suggesting
multiple cut points. It is also possible that overfitting may be more of an issue in other datasets.
6.1.3
Multivariate approach
In its current form, UFA is univariate which provides certain practical advantages. It can quickly and
easily be applied to datasets with a very large number of features. Moreover, if the user feels that a
particular interaction is important for their application, it can be introduced as an additional feature.
However, the algorithm could conceivably also establish thresholds in multiple dimensions. The
following example shows one possibility using SVM.
In one dimension, UFA cycles through potential thresholds for a variable
and compares the death rate
outside of that threshold to a baseline death rate, calculated using patients in the interquartile range of .
In two dimensions, we use a similar approach; however, instead of cycling through values of , we iterate
through different cost penalties.
Figure 6.2 plots patients according to two variables, systolic blood pressure and heart rate. The patients
who died are indicated in red, while the patients who lived are indicated in black. By adjusting the cost
weights associated with misclassifying survival, we control the purity of patients within the high mortality
area. Figure 6.2 shows the difference of using a 1:1 cost penalty and a 2:1 cost penalty.
Figure 6.2: Examples of different cost penalties for two-dimensional flagging
1:1 Cost Penalty
2:1 Cost Penalty
71
As with the univariate case, we are interested in balancing the mortality rate within the high-mortality
area (in pink) with the number of cases in that area. We use a similar procedure to the one described in
Section 3.2.3.
Step 1: Using SVM, determine how much misclassification must be penalized in order to get a 100%
death rate (or equivalently, 0% death rate).
Step 2: Iterate from a 1:1 cost balance to the value from Step 1. For each value, run SVM and calculate
the death rate within the pink area. Compare this death rate to a baseline death rate, calculated based on
the interquartile ranges of the two variables.
Step 3: Generate a Z-statistic for the two-sided proportion test using the same method as the one-variable
case, outlined in Section 3.2.2.
Step 4: Choose the cost balance with the maximum Z-statistic.
In this example, the optimal cost ratio was found to be 1.43:1, which is depicted in Figure 6.3. We see
that, as desired, the pink area contains primarily patients who died, while simultaneously having a fairly
large support.
Figure 6.3: Optimal cost penalties for two-dimensional flagging (1.43:1)
The disadvantage of this approach, however, is that we have lost the ability to state the results of the
algorithm in terms of simple association rules; e.g. when [VARIABLE] is above/below [VALUE], it is
associated with significantly higher/lower mortality. Also, it is not clear what the appropriate “baseline”
72
death rate should be. In this example, we used patients who were in the interquartile range for systolic
blood pressure AND heart rate. As the number of variables increases, however, the number of patients in
that intersection will become very small. This is another possible area for future research.
6.2
Temporal features in critical care
It seems reasonable that information about a patient’s improving or worsening health status should impact
our ability to accurately predict mortality. Using SOFA as an example, past research shows that
maximum daily SOFA score and delta SOFA (defined as maximum score minus admission score) have
good correlation with outcomes for patients in the ICU for two or more days [14]. Our data seems to tell a
similar story. Figure 6.4 shows the average SOFA score by day for patients who ultimately died (in
purple) and lived (in green), and suggests that the trends are different.
In our analysis, however, we found that simply adding variables like maximum SOFA or the trend in
SOFA did not improve our ability to predict mortality for previously unseen patients. For sepsis patients,
we found a best subset regression model that contains just five variables, including no trend variables, had
the same predictive performance as a much more complicated model including all of the temporal
information.
Figure 6.4: Trends in SOFA score by day and mortality status
In this section, we discuss three ongoing analyses that consider new ways to incorporate temporal
information into mortality prediction models. As opposed to looking at summary statistics across time
such as minimum or standard deviation, we investigate whether more sophisticated methods result in
increased model performance.
73
6.2.1
Identifying subsets of patients where trend is important
This section describes a side analysis that was conducted to deconstruct the trend in average SOFA score
across time, depicted in Figure 6.4. The analysis shows that trend data does not improve our ability to
predict sepsis mortality for fairly healthy patients or especially critical patients. However, it is useful in
predicting mortality for patients who do not fall into either of those groups.
Figure 6.5 displays a parallel coordinates plot which we used to determine whether there are particular
trends in SOFA score that correspond to survival and in-hospital mortality. The only obvious trend that
we observed was more red lines (which denote patients that died) near the top of the plot and more black
lines (which denote patients that lived) near the bottom.
Figure 6.5: Parallel coordinates plot summarizing SOFA across time
SOFA (Day 4)
# Lived
# Died
% Died
13+
4
41
91%
5-12
86
80
48%
4 or less
40
9
18%
Next, we drilled down to patients with a SOFA score on day four of 5-12. Using clustering algorithms
and manual analysis, we determined that the number of decreases in SOFA during an individual’s stay
allowed us to further stratify patients. As seen in Figure 6.6, patients whose SOFA score declines three
times (i.e. goes down every day) have a death rate of 22%, while patients whose SOFA never declines
have a death rate of 61%. We also tried to stratify patients based on the magnitude of changes (from day
to day and across multiple days), as well as particular sequences of increases and decreases but did not
discover anything of note.
Using this information, we developed five clusters of patients based on SOFA score, outlined in Table
6.1. If we subset to Cluster 5, there is no longer a deviation in the trend of average SOFA score from day
1 to day 4. Further, the majority of the patients outside of Cluster 5 were categorized based on their day 4
74
SOFA score (as opposed to trend data), which may explain why trend data seemed to be less important in
our primary analysis.
Figure 6.6: Parallel coordinates plot summarizing SOFA across time (SOFA of 5-12)
Number of
Decreases
# Lived
# Died
% Died
0
7
11
61%
1
28
35
56%
2
37
30
45%
3
14
4
22%
Table 6.1: Patient clusters based on static and trend SOFA data
Cluster
1
2
3
4
5
Description
Day 4 SOFA >= 13
Day 4 SOFA <= 4
Zero Down
Three Down
Other
# Lived
4
40
7
14
65
# Died
41
9
11
4
65
% Died
91%
18%
61%
22%
50%
We went through a similar process to create patient clusters for bicarbonate levels, temperature, and
shock index. By replacing SOFA and bicarbonate with the new clustered variables (which include some
trend information) in our best subset regression, we were able to improve AUROC by 0.02 while
maintaining accuracy. Using the clustered variables for temperature and shock index did not improve
performance over our original model.
These results suggest that trend data may be useful for mortality prediction; however, it is possible that it
is only useful for a subset of patients. Determining how to interact temporal and static features to
automatically identify these subgroups is an area for future research.
75
6.2.2
Characterizing uncertainty through filtering
Another way to use temporal data is to use a patient’s progression to help quantify uncertainty about their
true state on the current day. In this section, we use sequential importance resampling (SIR) [39, 40] to
quantify the uncertainty in an individual patient’s SOFA score. By combining a dynamical model for
SOFA score progression and the observed SOFA scores for an individual patient, we can determine the
posterior distribution for the SOFA score on each day of a patient’s ICU stay.
We will assume that a patient’s SOFA score evolves in time according to the following model:
=
( )+
,
~ ( )
= 0, 1, 2, 3, 4
Then, in order to actually implement particle filtering, we need to determine ( ) ,
distribution of the noise term
( ), and the
for this particular application. There is precedent in the literature to use
a linear model to describe disease progression [41].
=
+
Further, in the clinical domain, models are typically parameterized using observational data from past
patients [41, 42]. Therefore, we used the full population of sepsis patients in MIMIC II to determine
and the distribution of
.
Since, particularly for patients who died, the progression of SOFA score is not consistent across time
periods, as seen in Figure 6.4, we decided to model each time step separately. The final system models for
patients that die then become:
= 0.95
=
+
=
+
=
+
+
And the system models for patients that live are:
= 0.92
76
+
= 0.91
+
= 0.92
+
= 0.94
+
We assume the error terms are distributed normally with mean zero and variance determined empirically.
Finally, for each time step, we had to select which model to use (i.e. patients who live or patients that
die). We tried a variety of different approaches which are outlined below.
=
1. Truth:
((
+
) + (1 −
= 1 if patient actually died and
)(
+
)
= 0 otherwise. This method is the least useful in
practice, since this information will not be known. However, it minimizes the uncertainty in the
model and therefore, provides a useful baseline.
2. Data:
= 1 if patient’s SOFA score on day t predicts death under the optimal classifier and
= 0 otherwise.
3. Distribution: Divide the distribution ( | ) into intervals. Let
be the probability mass
within each interval multiplied by the population death rate for that range of SOFA scores,
summed over all intervals. In this method, 0 ≤
≤ 1 and the dynamics equation is a convex
combination of the system models for patients who lived and patients who died. As a variation, I
can round
to 1 if it is larger than 0.5 and 0 otherwise, and select a single model.
Finally, we set our prior ( ) for each individual patient to be the observed distribution of SOFA score
at admission for all sepsis patients. We also assume that SOFA score is a measure of the patient’s true
severity plus some noise. Therefore, we model the relationship between
and the true state
as
follows:
where
~ (0,1).
=
+
With the model now fully specified, we generated ( |
:
) for t = 1, 2, 3, 4, 5 for each patient in
MIMIC II. Figure 6.7 displays six weighted histograms showing ( ) and ( |
77
:
) for a
representative patient. In each histogram, there is a red vertical line at 9.5, as we found that the optimal
classifier that relies only on the daily SOFA score is:
≥ 9.5,
.
< 9.5,
.
For t > 0, there is also a purple vertical line which indicates the observed SOFA score
for this particular
patient on each day.
Figure 6.7: Example SIR results for a representative patient
The true outcome of this patient was survival
Day 0 (Prior)
Day 1
Day 2
Day 3
Day 4
Day 5
Across time, we see that the posterior distribution of the SOFA score narrows, as the sequence of
observed values reduces uncertainty about the patient’s true state. On day five, we observe a SOFA score
of 10 for this patient. If we were to use this data point and our association rule predicting death for SOFA
scores above 9.5, we incorrectly predict that this patient died.
If instead we analyze ( |
:
), however, we can use various features of the distribution to make a
determination. One obvious choice would be the expected value. Another option would be the probability
78
of death, denoted
, which we calculate by dividing ( |
:
) into intervals and then summing, over all
intervals, the probability mass multiplied by the population death rate for SOFA scores in that range.
On day 5, we observe that [ ] < 9.5 and
survival.
< 0.500, both of which suggest a correct prediction of
With this approach, the posterior distribution for each patient on each day is determined both by general
dynamics and their own specific observed SOFA values. Therefore, the final distribution is customized
for each individual. Preliminary results such as those in Figure 6.4 suggest that features of the distribution
can be used to correctly predict in-hospital mortality for individual patient in cases where the daily SOFA
score fails. Unfortunately, at the population level, the two features explored above, expected value and
did not produce better results than predicting using the static SOFA score alone. However, future
research could explore using other features of ( |
try alternative models for the dynamics ( |
6.2.3
:
) for prediction, such as the standard deviation, or
) to improve results.
Learning patient specific state parameters
This section explores a final way to use time-varying information to characterize patients. In the previous
section, we explored methodology that allowed us to characterize our certainty about a patient’s severity
at any point in time, where severity was a continuous measure. In some situations, however, it may be
more appropriate to model severity as two states, critical or not critical. In this section, we use the
example of low blood pressure or hypotension.
When sepsis is accompanied by a fall in blood pressure, it is called severe sepsis. When that hypotension
is not responsive to fluids or other medications, the patient is said to be in septic shock which is
associated with very high mortality rates [43]. However, there is no consensus about the exact cutoff that
defines hypotension. Different physicians consulted on this thesis disagreed about whether a threshold of
60, 65, or 70 mmHg was appropriate; UFA identified a threshold of 67.4. Moreover, there is some
thought that hypotension is actually a relative measure. The normal range for mean arterial blood pressure
(MAP) in adults is 70-105 mmHg [43] -- if a patient is normally at 70 mmHg, he would not be considered
hypotensive at 65 mmHg, while a patient who is normally at 105 mmHg might be considered hypotensive
at 70 mmHg.
In this section, rather than use a fixed threshold to characterize hypotension, we decided to parameterize a
patient’s MAP time series using a Hamilton regime switching model [45]. This model assumes that a
patient has two states, a high state (normal) and a low state (hypotensive). For each patient, we can use a
79
filtering approach to learn the optimal model parameters (
) and the estimated state vector, i.e.
whether the patient is hypotensive or not at each time step. The inferred state ( ,
,…
), along with the
probability of switching between the high state and low state are potentially informative measures of
hypotension persistence.
Borrowing notation from Hamilton [45], we describe the data with a first-order autoregression,
where
~ (0,
) and
=
+
+
is a random variable that is 1 when the patient is in the high state (normal
MAP) and 2 when the patient is in the low state (hypotension). We used a first-order, two-state Markov
process to describe the transitions between
= 1 and
= 2.
CTP Specification: One possibility is to use constant transition probabilities (CTP) where
denotes the
probability of being in state at time , conditional on being in state at time − 1. The list of parameters
,
).
Figure 6.8 compares a plot of an example patient’s blood pressure data ( ,
,…
needed to fully characterize the behavior of
is then
=( , ,
the probability of being in state 2 (hypotension) at time t for
turned out to be
= 0.11,
= 4.23,
= 60.47,
= 50.67,
=
,
,
) and a plot of
. For this patient, the optimal theta
= 0.968,
= 0.966. We observe
that, for this patient, both the high state and the hypotensive state are quite “sticky” – that is,
are both close to 1.
Figure 6.8: Example of Hamilton regime switching model, CVP ( =
80
,
)
and
If we assign this patient to the hypotensive state when
>= 0.5 then Figure 6.8 implies that this patient
had two hypotensive episodes during his first 72 hours in the ICU (hours 7-27 and hours 29-42) with an
average episode length of 17 hours. In comparison, if we were to use a naïve threshold of 65 mmHg to
identify hypotensive episodes, we would say that this patient had eight episodes with an average episode
length of 4.5 hours. The estimated state vector customizes our hypotension definition for this patient and
provides a cleaner interpretation of the patient’s blood pressure data, which arguably conforms more
closely to visual inspection.
TVTP Specification: We can also allow the transition probabilities to depend both on the state at time
− 1 as well as other factors, such as the patient’s drug regimen. Specifically, the transition probabilities
evolve as logistic functions [46]:
=
=
While
and
exp(
1 + exp(
exp(
1 + exp(
)
)
)
)
are specified to be time-invariant, certain elements of the conditioning vector
change across time, resulting in time-varying transition probabilities (TVTP). For our analysis,
has
dimension (3 x 1) and contains the patient’s heart rate at time − 1, an indicator variable for whether the
patient received vasopressors, a drug intended to raise blood pressure, at time − 1, and a constant term.
We will denote the parameters that govern the transition probabilities as
=(
,
dimension (6 x 1). The list of parameters needed to fully characterize the behavior of
specification is then
=( , ,
,
)′ where
has
in the TVTP
, ).
We take, for example, a patient whose probability of staying in the hypotensive state (
) is 0.972
under the constant CTP specification. If we incorporate information about his drug regimen into our
model, we find that
= 0.464 when the patient is being treated with vasopressors and 0.840 when the
patient is not being treated. These results conform to our expectation that a patient is more likely to leave
the hypotensive state while on vasopressors.
Figure 6.9 provides the MAP data for this example patient, alongside his estimated probability of being in
the hypotensive state at each time step for both the CTP and the TVTP model specifications.
81
Figure 6.9: Example of Hamilton regime switching model, TVTP vs. CVP ( =
)
Overall, the two approaches produce similar results in most time periods. However, we observe two key
differences. First, the state estimates under the TVTP specification are more certain; that is,
deviates
less from zero and one. Second, the two methods differ for hours 42-49. The CTP specification shows a
high probability that the patient is in the hypotensive state, while the TVTP specification is able to use
additional information about heart rate and vasopressor use to update its estimate.
Returning to the issue of predicting mortality, we hypothesize that one can use the results of the Hamilton
regime switching model in two different ways to improve predictions. First, one can use the patientspecific estimated state vector to characterize variables like hypotension, as opposed to a single set
threshold. Second, one can consider the parameters of each patient’s switching model. For example, the
probability of staying in the high state (p11) and staying in the hypotensive state (p22) for each patient
might be useful measures of hypotension persistence. Similarly, the difference in p22 when a patient is on
vasopressors or not on vasopressors could provide a measure of their response to treatment.
Table 6.2 compares the performance of three logistic regression models which attempt to predict sepsis
mortality using solely information about hypotension. The first two models contain four independent
variables: total number of hours in the hypotensive state, the total number of hypotensive episodes, the
average episode length, and the length of the longest episode for each patient. The first model uses a
population-wide threshold of 65 mmHg to define hypotension, while the second model uses the patientspecific state vector. We see that the features of the state vector that were selected are not predictive of in-
82
hospital mortality on previously unseen samples. Accuracy is just 24% compared to 56% for a model that
uses a 65 mmHg threshold.
However, our third model, which uses the optimal model parameters ( , ,
,
,
,
) as the
independent variables, has the highest accuracy and AUROC at 76% and 0.705 respectively. This
provides some confidence that the Hamilton regime-switching model is capturing important dynamics in
patient’s blood pressure data, and suggests that further analysis should focus on identifying additional
features of the state vector that are more suitable for prediction.
Table 6.2: Test-set performance of classifiers based on regime switching model
Independent Variables
Accuracy
AUROC
Episode Characteristics (65 mmHg Threshold)
56%
0.635
Episode Characteristics (State Vector)
24%
0.147
Model Parameters
76%
0.705
83
84
Chapter 7
7
Conclusion
The objective of this thesis was to build a mortality prediction model that can outperform current
approaches. We aimed to improve current methodologies in two key ways:
1. By incorporating a wider range of variables, including time-dependent features
2. By exploring different predictive modeling techniques beyond standard regression
We identified three different outcome prediction approaches that can significantly outperform current
methods. The first model was a best subset regression containing just five static variables. It was not
immediately obvious that a linear model with no temporal features could achieve such high performance;
however, we conclude that it is possible if care is taken to properly customize the model through
extensive data preprocessing. The other two models, random forest and the UFA system, have the
advantage of being more flexible and they do not require the user to do variable selection. As such, they
are easy to customize to new populations and have consistently strong predictive performance.
In addition to being easily customizable, we show that the UFA system in general (and the N-UFA
classifier in particular) has several other practical advantages that make it well-suited for use in critical
care:
1. Provides user with simple association rules characterizing relationship between individual
variables and mortality
2. Robust to noise and missing data
3. Maximizes sensitivity in unbalanced datasets, particularly for rare events
4. Displays results in two-dimensions making it easy to interpret and visualize
As such, we believe that UFA is a major step toward a mortality prediction model capable of individual
prognosis.
85
86
References
1. Zimmerman JE, Kramer AA. A history of outcome predictionin the ICU. Current Opinion in
Critical Care, 2014; 20(5) 550-556.
2. Power GS, Harrison DA. Why try to predict ICU outcomes? Current Opinion in Critical Care,
2014; 20(5) 544-549.
3. Connors AF, et al. A controlled trial to improve care for seriously ill hospitalized patients: the
study to understand prognoses and preferences for outcomes and risks of treatments (SUPPORT).
JAMA 1995; 274(20) 1591-1598.
4. Salluh JIF, Soares M. ICU severity of illness scores: APACHE, SAPS and MPM. Current
Opinion in Critical Care, 2014; 20(5): 557-565.
5. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference,
and prediction. 2nd ed. New York: Springer Science+Business Media; 2009.
6. Knaus WA, et al. The APACHE III prognostic system: risk prediction of hospital mortality for
critically ill hospitalized adults. Chest 1991; 100(6): 1619-1636.
7. Zimmeman JE, Kramer AA, McNair DS, Malila FM. Acute Physiology and Chronic Health
Evaluations (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit
Care Med 2006; 34(5): 1297-1310.
8. Le Gall JR, Lemeshow S, Saulneir F. A new simplified acute physiology score (SAPS II) based
on a European/North American multicenter study. JAMA 1993; 270(24): 2957-63.
9. Moreno RP, Metnitz PGH, Almeida E, et al. SAPS 3 – from evaluation of the patient to
evaluation of the intensive care unit. Intensive Care Med 2005; 31: 1336-1355.
10. Vincent JL, et al. The SOFA (sepsis-related organ failure assessment) score to describe organ
dysfunction/failure. Intensive Care Med 1996; 22:707-710.
11. Minne L, Abu-Hanna A, de Jonge E. Evaluation of SOFA-based models for predicting mortality
in the ICU: A systematic review. Crit Care 2008; 12:R161
12. Moreno R, et al. The use of maximum SOFA score to quantify organ dysfunction/failure in
intensive care. Intensive Care Med. 1999; 25(7): 686-696.
13. Ferreira FL, et al. Serial evaluation of the SOFA score to predict outcome in critically ill patients.
JAMA 2001; 286 (14): 1754-1758.
14. Saeed, M, et al. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A publicaccess intensive care unit database. Crit Care Med. 2011; 39(5):952-960.
15. Soares M, et al. Performance of six severity-of-illness scores in cancer patients requiring
admission to the intensive care unit: a prospective observational study. Crit Care 2004; 8(4):
R194-R203.
16. Melamed A, Sorvillo F. The burden of sepsis-associated mortality in the United States from 1999
to 2005: an analysis of multiple-cause-of-death data. Crit Care 2009; 13(1).
17. Angus, Derek C. MD, MPH and Tom van der Poll, MD, PhD. “Severe sepsis and septic shock,”
New England Journal of Medicine, vol. 369, pp.840-851, 2013.
18. “Sepsis Fact Sheet,” National Insitute of General Medical Sciences. Nov. 2012. [Online].
Available: http://www.nigms.nih.gov/Education/Pages/factsheet_sepsis.aspx
19. Martin G. Sepsis, severe sepsis and septic shock: changes in incidence, pathogens, and outcomes.
Expert Rev Anti Infect Ther. 2012; 10(6): 701-706.
87
20. Martin G, Mannino DM, Eaton S, Moss M. “The epidemiology of sepsis in the United States
from 1979 through 2000,” New England Journal of Medicine, vol. 348, pp.1546-1554, 2003.
21. Angus DC, Linde-Zwirble WT, Lidicker J, Clermont G, Carcillo J, Pinsky MR. “Epidemiology of
severe sepsis in the United States: analysis of incidence, outcome, and associated costs of care”.
Critical Care Medicine, vol. 29, pp.1303-1310, 2001.
22. Witten IH, Frank E. Data mining: practical machine learning tools and techniques. 2nd ed. San
Francisco: Morgan Kaufmann Publishers; 2005.
23. Rice, JA. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury: Brooks/Cole; 2007.
24. Mazumdar M, Glassman JR. Categorizing a prognostic variable: review of methods, code for
easy implementation and applications to decision-making about cancer treatments. Stat Med.
2000 Jan 15; 19(1): 113-32.
25. Williams B, Mandrekar JN, Mandrekar SJ, Cha SS, Furth AF. Finding optimal cutpoints for
continuous covariates with binary and time-to-event outcomes. Technical Report, Mayo Clinic,
Department of Health Sciences Research, 2006. Available:
http://www.mayo.edu/research/documents/biostat-79pdf/doc-10027230
26. Baum RL, Godt JW. Early warning of rainfall-induced shallow landslides and debris flows in the
USA. Landslides. 2010; 7:259-272.
27. Martina MLV, Todini E, Libralon A. A Bayesian decision approach to rainfall thresholds based
flood warning. Hydrol. Earth Syst. Sci. 2006; 10:413-426.
28. Sheth M, Welsch R, Markuzon N. A fully-automated algorithm for identifying optimal thresholds
in data. Working paper.
29. Sheth M, Celi L, Mark R, Welsch R, Markuzon N. Predicting mortality in critical care using
time-varying parameters and the univariate flagging algorithm. Working paper.
30. Eick, CF, Zeidat N, Zhao Z. Supervised clustering – algorithms and benefits. Proceedings of the
16th IEEE International Conference on Tools with Artificial Intelligence, 2004.
31. Kalil, A. Septic shock clinical presentation. Medscape. 20 Oct 2014. Available:
http://emedicine.medscape.com/article/168402-clinical. Accessed 16 Mar 2015.
32. Dellinger RP, et al. Surviving sepsis campaign: international guidelines for management of severe
sepsis and septic shock: 2012. Crit Care Med. 2013; 41(2): 580-637.
33. Friedman JH, Fisher NI. Bump hunting in high-dimensional data. Statistics and Computing.
1999; 9:123-143.
34. MedlinePlus from National Institutes of Health. 16 March 2015. Available:
http://www.nlm.nih.gov/medlineplus/medlineplus.html
35. Lichman M. Iris Data Set: 2013. Database: UCI Machine learning Repository [Internet].
Accessed: http://archive.ics.uci.edu/ml/datasets/Iris
36. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936.
7(2): 179-188.
37. Kapouleas I, Weiss SM. An empirical comparison of pattern recognition, neural nets, and
machine learning classification methods. Readings in Machine Learning. 1990; 177-183.
38. Weiss, GM. Mining with rarity: a unifying framework. ACM SIGKDD Explorations. 2004; 6(1):
7-19.
39. Gordon, N. J., D.J. Salmond, A.F.M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian
state estimation. IEEE Proceedings, Vol 140 No 2, April 1993. Accessed May 16, 2014.
Available: http://www.ece.iastate.edu/~namrata/EE520/Gordonnovelapproach.pdf
88
40. Doucet A, Godsill S, Andrieu C. On sequential Monte Carlo sampling methods for Bayesian
filtering. Statistics and Computing. 2000; 10: 197-208. Accessed May 14, 2014. Available:
ftp://ftp.idsa.prd.fr/local/aspi/legland/ref/doucet00b.pdf
41. Helm JE, Lavieri MS, Van Oyen MP, Stein J, Musch D. Dynamic Forecasting and Control
Algorithms of Glaucoma Progression for Clinician Decision Support", Operations Research (2nd
round of review), 2012.
42. Rangel-Frausto, MS, Pittet D, Hwang T, Woolson RF, Wenzel RP. The dynamics of disease
progression in sepsis: Markov modeling describing the natural history and the likely impact of
effective antisepsis agents. Clinical Infectious Diseases. 1998 Jul; 27(1): 185-90.
43. Nachimuthu SK, Haug PJ. Early Detection of Sepsis in the Emergency Department using
Dynamic Bayesian Networks. AMIA Annual Symposium Proceedings. 2012; 653-662.
44. “Normal Hemodynamic Parameters”, LiDCO Group. Accessed 13 Dec 2014. Available:
http://www.lidco.com/clinical/hemodynamic.php
45. Hamilton JD. “Regime-Switching Models”, Palgrave Dictionary of Economics, 18 May 2005.
46. Diebold FX, Lee J, Weinbach GC. Regime Switching with Time Varying Transition
Probabilities. Nonstationary Time Series Analysis and Cointegration (Advanced Texts in
Econometrics). Oxford: Oxford University Press, p 283-302.
47. Sheth, MB, Mark R, Chahin A, Markuzon N. Protective effects of rheumatoid arthritis in septic
ICU patients. 2014 IEEE International Conference on Big Data, 27-30 Oct. 2014.
48. Elixhauser A, Steiner C, Harris DR, and Coffey R. Comorbidity measures for use with
administrative data. Med Care. 1998; 36:8-27.
49. Bale C, Kakrani AK, Dabadghao VS, Sharma ZD. Sequential organ failure assessment score as
prognostic marker in critically ill patients in a tertiary care intensive care unit. International
Journal of Medicine and Public Health. 2013; 3(3):155-158.
50. “Rheumatoid Arthritis,” Arthritis Foundation. 2014. [Online]. Available:
http://www.arthritis.org/conditions-treatments/disease-center/rheumatoid-arthritis/
89
90
Appendix A
Protective effects of rheumatoid arthritis in septic ICU patients
For the purposes of this thesis, we only considered disease-based subpopulations based on the primary
diagnosis. As a side analysis, however, we considered all patients with a primary diagnosis of sepsis and
analyzed secondary diagnosis at the time of admission. In some cases, we observed significant
discrepancies in mortality. One of the most striking results was the 30-day mortality rate for septic
patients with rheumatoid arthritis (RA), an auto-immune disorder. At 29.0%, it was 21 percentage points
lower than the 30-day mortality rate for all severe sepsis patients (defined as sepsis with persistent
hypotension), and the observed difference was statistically significant even after controlling for patient
demographics and disease severity.
This result is particularly interesting because RA is an auto-immune disorder. Auto-immune disorders are
a group of diseases that arise from an abnormal immune response of the body against substances and
tissues normally present in the body. The immunological responses are usually caused by antibody
production and T cell activation with intolerance to host’s own cells. The effect of autoimmunity has been
speculated to carry a worse outcome in sepsis compared to normal hosts. However, recent research is
hinting towards a different behavior and perhaps, a better outcome for patients with auto-immunity in
cases of severe sepsis.
Given the considerable interest in this topic, we decided to do a side investigation into the relationship
between rheumatoid arthritis and sepsis mortality [47].The results in this section are intended to stand
alone, and we define the sepsis population and the target outcome slightly differently than the other
results in the thesis.
Methods
The primary study population for this analysis consists of adult ICU patients with severe sepsis. As
described in Section 2.3.1, we define sepsis using the ICD-9 code 038 which indicates septicemia. For this
analysis, however, we further restricted our population to include markers of systematic shock. In
particular, we require the presence of hypotension, defined as three consecutive mean arterial blood
91
pressure (MAP) readings below 65 mmHg in a 30 minute period. Throughout this analysis, we will denote
this sepsis definition as “038-H”.
All patients were required to have 24 hours of ICU data beyond their first hypotensive episode, and we
selected the last ICU stay meeting these criteria for each patient. We identified 1,302 patients in the
database meeting the 038-H criterion. This study includes a series of sensitivity analyses to determine the
impact of these inclusion criteria on the results and to identify the group of patients that benefits most from
an RA diagnosis. These sensitivity analyses include variations to the definition of severe sepsis and
hypotension, as well as the removal of the 24 hour data requirement.
Throughout this study, RA was defined using the set of ICD-9-CM diagnosis codes that correspond to the
RA co-morbidity flag in the Elixhauser index. The Elixhauser methodology was specifically developed for
use with administrative health data, making it uniquely suited for our purposes [48]. We also use ICD-9CM diagnosis codes to identify subpopulations with other auto-immune disorders such as multiple
sclerosis and Crohn’s disease to determine whether these conditions show a similar protective effect as RA
in severe sepsis.
For each patient in the study, we used the MIMIC II database to collect demographic information like
gender and age, along with SOFA score. SOFA includes information about the condition of a patient’s
respiratory, renal, and cardiovascular systems, among others, and has been found to be a strong predictor
of prognosis for septic ICU patients [49]. We also used MIMIC II to gather information on use of
corticosteroids, which can be used in the treatment of both RA and severe sepsis. The admission histories
from discharge summaries were manually checked for home use of prednisone prior to admission in the
ICU.
The primary outcome of interest in this study is 30 day mortality for ICU patients with severe sepsis.
Thirty-day mortality was measured beginning with the patient’s hospital discharge date and relies on death
information from the Social Security Death Index (SSDI).
We compared mortality rates for sepsis patients with and without RA using a chi-squared test. We also
compared baseline characteristics such as gender, age, and SOFA score using a chi-squared or Student’s ttest (as appropriate) to determine whether the two groups of patients differed significantly across these
dimensions.
92
To test whether the presence of RA was significantly predictive of sepsis mortality, we utilized a logistic
regression model [5]. Our first model controlled for age, gender, and SOFA score, in addition to RA.
Subsequent models introduced covariates to control for patients’ drug regimens, including chronic use of
prednisone and use of corticosteroids in the ICU.
Results
Septic patients with RA have significantly lower 30-day mortality rates
As described in the methods section, the primary patient cohort for this study consists of individuals with a
documented diagnosis of septicemia, hypotension as indicated by three MAP readings below 65 within 30
minutes, and 24 hours of data beyond the first hypotensive episode. This patient cohort contains 1,302
individuals and of these, 31 patients have RA.
Our data show significantly lowered mortality rates for the septic patients with RA, suggesting a protective
effect. The observed 30-day mortality rate for hypotensive, septic patients with RA is 29.0% as compared
to 50.6% for patients without RA. This difference is significant with a p-value of 0.016 (table 1).
Table 1: 30-day mortality and demographics for hypotensive, septic patients with and without RA
RA
No RA
p-value
31
1,271
--
29.0%
50.6%
0.016
Gender (% Male)
41.9%
55.5%
0.130
Age (Mean ± SE)
65.1 ± 2.7
67.7 ± 0.5
0.358
SOFA (Mean ± SE)
10.1 ± 0.9
10.1 ± 0.1
0.978
N
Mortality
30-day
Patient Characteristics
We considered various co-factors that could explain the observed effect. One possibility is that this
difference in the mortality rate is due to systematic differences in the septic populations with and without
RA. For example, women are more likely to suffer from RA [50], a fact that is clearly represented in our
data. 41.9% of septic RA patients are male, which is 13.6 percentage points lower than septic patients
without RA. We also observe that septic RA patients are slightly younger than their counterparts without
RA, with a mean age of 65.1 compared to 67.7, though this difference is not statistically significant (table
1).
93
To control for underlying differences between the two groups, we utilized a logistic regression model. The
model included variables for patient gender, age, and SOFA score at ICU admission. We also included an
indicator for whether the patient had RA. After adjusting for patient demographics and health status, we
find that the presence of RA is still significantly predictive of survival at 30 days with a p-value of 0.024
(table 2). This result is consistent with the conclusion that RA is protective in severe sepsis.
Table 2: Logistic regression analysis of sepsis mortality
Outcome: 30-Day Mortality
Independent
B
Variable
SE
p-value
Exp(B)
Age
0.02
0.00
0.000
1.02
Gender
0.12
0.12
0.310
1.13
SOFA
0.14
0.01
0.000
1.15
RA Flag
-0.95
0.42
0.024
0.39
Intercept
-2.89
0.32
0.000
0.06
Identifying related populations where RA is beneficial
The results in the previous section establish that 30-day mortality is significantly lower for a particular
group of severe sepsis patients with RA. To better understand this phenomenon, we conducted a series of
related analyses to determine the particular conditions under which the result holds and to identify the
groups of patients who most benefit from a diagnosis of RA.
First, we investigated whether our observed protective effect extends to septic patients with other autoimmune disorders such as Crohn’s disease and multiple sclerosis. Next, we analyzed the role of sepsis
severity on our results by rerunning our analysis for less critically ill populations like sepsis patients
identified using the Angus criteria or patients with documented septicemia but no hypotension. Lastly, we
analyzed the role of ICU length of stay by varying the amount of time that patients were remaining in the
ICU. The following results confirm that septic patients with auto-immune disorders have better 30-day
survival rates than the overall sepsis population. Further, the results suggest that RA is particularly
beneficial for patients with a certain level of sepsis severity, and that length of stay is an important factor
in this analysis.
Septic Patients with auto-immune disorders have significantly lower 30-day mortality rates
Using the severe sepsis definition from the previous section, we find that RA is not the only auto-immune
disease with an observed protective effect. Septic ICU patients with a variety of different auto-immune
disorders have significantly lowered 30-day mortality compared to severe sepsis patients overall. For
94
example, while the full septic population has a 30-day mortality rate of 50.1%, septic patients with
multiple sclerosis have a mortality rate of 25.0%, septic patients with ulcerative colitis have a mortality
rate of 26.7%, and septic patients with systemic lupus erythematosus (SLE) have a mortality rate of
12.5% (table 3).
Moreover, if we define a group of sepsis patients with any of the auto-immune conditions listed in Table
3, we find that 30-day mortality is 21 percentage points lower than the full sepsis population. This result
is significant at a 0.01 level even after controlling for patient demographics and disease severity (table 4).
These results suggest that our observed protective effect applies to a variety of auto-immune conditions,
supporting the theory that it is broadly related to immune modulation in sepsis as opposed to specifically
related to the physiology of RA.
Table 3: 30-day mortality for septic ICU patients with auto-immune disorders
All Severe Sepsis
1,302
Death
Rate
50%
All Auto-Immune
65
29%
rheumatoid arthritis
15
27%
ulcerative colitis
15
33%
Crohns disease
13
46%
multiple sclerosis
12
25%
systemic lupus erythematosus
8
13%
myasthenia gravis
4
0%
ankylosing spondylitis
1
0%
psoriatic arthritis
0
--
Condition
N
Table 4: Logistic regression analysis of sepsis mortality, all auto-immune
Outcome: 30-Day Mortality
Independent
B
Variable
SE
p-value
Exp(B)
Age
0.02
0.00
0.000
1.02
Gender
0.12
0.12
0.303
1.13
SOFA
0.14
0.01
0.000
1.15
Auto-Immune Flag
-0.81
0.30
0.006
0.44
Intercept
-2.84
0.33
0.000
0.06
95
Level of sepsis severity and length of stay help determine when RA is protective
For this analysis, we defined severe sepsis as a documented diagnosis of septicemia (038 code), along with
hypotension. In this section, we varied patient selection criteria in order to identify a combination of
patient’s conditions when the RA diagnosis is most protective
In Section 2.3.1, we discuss two sepsis definitions that are less stringent than the one that was selected for
this analysis. The first, the Angus criteria, is the most general, while the second, which requires a
documented diagnosis of septicemia but does not require hypotension, falls in the middle. For both of these
definitions, septic patients with RA have lower 30-day mortality than patients without RA; however, for
the Angus criteria, the difference is only three percentage points and, in both cases, the difference is not
statistically significant (table 5).
We conclude that a certain level of patient severity is necessary for RA to have a strong protective effect.
At the same time, the data suggest that once that level is achieved, our results are very stable to variations
in methodology. For example, if we make our sepsis definition more stringent by adding the Angus
criteria on top of the 038-H definition, there is a 21 percentage point difference in 30-day mortality for
patients with RA and without RA, which is statistically significant. We also see a significant protective
effect if we define hypotension using a shorter or longer measurement window or if we define hypotension
based on the use of vasopressors as opposed to low blood pressure (table 5). In all of these cases, septic
patients with RA have a 30-day mortality rate that is at least 16 percentage points lower than septic patients
without RA and the result is significant even after controlling for patient demographics and disease
severity.
Table 5: P-values for chi-squared and logistic regression analysis of sepsis mortality, sensitivity
analysis
Outcome: 30-Day Mortality
Sepsis and Hypotension Definition
RA
No RA
p-value
(χ2)
p-value
(logistic)
Angus
30%
33%
0.39
0.77
038, No Hypotension Required
32%
40%
0.12
0.21
038-H + Angus
31%
52%
0.03
0.03
038-H, 3 MAP <= 65 (20 Min)
28%
54%
0.01
0.01
038-H, 3 MAP <= 65 (60 Min)
31%
49%
0.02
0.04
038-H, Vasopressors
32%
48%
0.03
0.04
96
We also find that a patient’s length of stay in the ICU is an important factor in determining whether or not
RA is beneficial during severe sepsis. All of the analyses in the previous discussion require that patients
have at least 24 hours of data beyond their first hypotensive episode or dose of vasopressors (depending on
the analysis) or, in the case that hypotension is not required, at least 24 hours in the ICU.
Removing this requirement dampens the protective effect of RA. For the original study inclusion criteria,
the difference in 30-day mortality for patients with and without RA decreases from 21.6 percentage points
to 14.0 percentage points once the 24 hour data requirement is removed. While this is still a statistically
significant difference at the 0.05 level after controlling for patient demographics, variations on the original
criteria, including a longer measurement window for hypotension or defining hypotension based on
vasopressors, are no longer statistically significant without the 24 hour requirement (table 6).
Table 6: P-values for chi-squared and logistic regression analysis of sepsis mortality, no 24 hour
requirement
Outcome: 30-Day Mortality
Sepsis and Hypotension
Definition
RA
No RA
p-value
(χ2)
p-value
(logistic)
038-H
41%
55%
0.08
0.05
038-H, 3 MAP <= 65 (20 Min)
44%
59%
0.08
0.05
038-H, 3 MAP <= 65 (60 Min)
40%
52%
0.08
0.11
038-H, Vasopressors
39%
51%
0.10
0.08
Consistent with earlier results, these findings suggest that the level of sepsis severity is important in
determining whether RA is protective. Patients who do not remain in the ICU for 24 hours are either
healthy enough to be discharged during that period or so critical that they do not survive. Previous results
have already established that RA is beneficial in more critically ill sepsis patients; these results could
suggest that there is also an important upper limit on patient severity.
Finally, there is evidence that if we consider septic patients who stayed in the ICU for even longer than 24
hours, RA has a strong protective effect even in the absence of hypotension, As previously discussed, if we
identify sepsis patients using a documented diagnosis of septicemia but without requiring hypotension,
there is an eight percentage point difference in mortality rates for patients with and without RA; the
difference, however, is not statistically significant (table 5). If we modify the patient selection criteria to
include only patients who remained in the ICU for at least 72 hours, the difference in mortality increases to
20 percentage points and becomes statistically significant. These results suggest that septic patients with
97
ICU stays over 3 days are another group for whom RA is particularly beneficial, and further research
should explore the implications of this finding.
Use of corticosteroids does not explain protective effect
One possible explanation for the protective effect observed in RA and other auto-immune disorder patients
is the effect of the medication taken for these conditions, particularly corticosteroids. We observe that
patients with RA are eight times more likely to use prednisone at home (i.e. chronically) than patients
without RA and are also more likely to receive steroids once admitted to the hospital (table 7). Given that
corticosteroids are anti-inflammatory drugs, it is possible that the medication itself is protective as opposed
to RA.
Table 7: Corticosteroid usage rates for septic patients with and without RA
RA
No RA
p-value
31
1,271
--
35.5%
4.3%
0.000
64.5%
37.7%
0.002
Prednisone
54.8%
14.7%
0.000
Hydrocortisone
51.6%
30.8%
0.012
N
Treatment (Home)
Home Prednisone
Treatment (Hospital)
Any Steroids
The data, however, does not support this conclusion. Stratifying 30-day mortality by the presence of RA
and chronic prednisone use, we see that the difference in mortality between septic RA patients and non-RA
patients is more than 20 percentage points regardless of whether the patient uses prednisone chronically
(table 8). Further, adding home prednisone use to our logistic regression model does not impact the
significance of RA in predicting 30-day mortality; presence of RA remains a significant independent
variable with a p-value of 0.03. Conversely, use of home prednisone is not a significant predictor of 30-day
mortality after controlling for RA and other patient demographics (table 9).
Table 8: 30-day mortality by chronic prednisone use and RA
RA
Treatment
No RA
#
Death Rate
#
Death Rate
Home Prednisone
11
27.3%
54
48.1%
No Home Prednisone
20
30.0%
1,212
50.6%
98
Table 9: Logistic regression analysis of sepsis mortality, including home prednisone use
Outcome: 30-Day Mortality
B
SE
p-value
Exp(B)
Age
0.02
0.00
0.00
1.02
Gender
0.12
0.12
0.30
1.13
SOFA
0.14
0.01
0.00
1.15
RA Flag
-0.94
0.43
0.03
0.39
Home Prednisone
-0.01
0.28
0.97
0.99
Intercept
-2.88
0.32
0.00
0.06
Independent Variable
Discussion
This study examines the survival outcome after treatment of severe sepsis in patients with RA as
compared to the rest of the population. It clearly demonstrates a statistically significant 30-day mortality
benefit in a selected subgroup of hypotensive patients with RA as well as patients with other autoimmune diseases. The result is robust to various definitions of hypotension including septic shock
requiring vasopressors (table 5).
Although the sample size was relatively small, the study demonstrated a strong, statistically significant
effect that was confirmed within other auto-immune diseases (table 3). Looking at a variety of autoimmune disorders supports the theory of a shared advantage, either due to the actual pathophysiology of
those diseases or due to the common treatment modalities used.
Patients’ corticosteroid drug regimens and in particular chronic prednisone use did not show any
difference in 30-day mortality (table 8). However, the study has not explored whether the difference seen
in mortality rates is due to the effect of other medications. Increasing numbers of patients with RA are
treated with immune modulators, also known as disease-modifying anti-rheumatic drugs (DMARDs)
which are potentially contributing to the survival benefit seen in our analysis. Patients with RA and autoimmune disease who are treated with DMARDs are under immune modulation prior to developing sepsis.
Theoretically, in light of the immune modulation role in sepsis, being on DMARDs can potentially lead to
favorable effects and lessened organ damage. If immune modulation in RA does influence 30-day
mortality as demonstrated in this study, then patients taking DMARDs should demonstrate a clear
survival benefit.
Further analysis will look at the different chronic treatments used in auto-immune diseases and whether
they show any improved outcome in sepsis. Also, looking at the rate in which organ failure happens in
99
both groups might help us understand why patients with RA and auto-immune diseases have lower
mortality and whether their suppressed immunity actually helped them survive through protecting their
organs from collateral damage caused by the stark immune response created in severe sepsis.
Conclusion
Our study found that septic patients with RA and other auto-immune diseases have significantly lower 30day mortality rates than patients without the condition, even after controlling for demographics and
disease severity. Moreover, we identified groups of sepsis patients in the ICU for whom the RA comorbidity is particularly beneficial and showed that both sepsis severity and ICU length of stay are
important in identifying those subpopulations.
Patients’ corticosteroid drug regimens were one possible explanation for the protective effect observed in
this study; however, we were able to show that corticosteroids alone, including chronic prednisone use, do
not account for lower 30-day mortality rates among septic RA patients. We put forth an alternate theory
suggesting that immune suppression or modulation prior to developing sepsis can actually have a
protective effect on the host rather than the current belief that it increases the risk for severe infections
which has been hard to prove in various studies.
The results in this study contribute to current understanding of the relationship between sepsis and the
immune system. Our findings also illustrate the power of big data in clinical research. Without access to a
large observational database and the tools to analyze it, we would have been unable to identify a sufficient
number of septic RA patients to conduct our analysis. Even with data for over 30,000 ICU stays, the final
number of RA patients in this study was 31. Future research should try to replicate this result on other
large observational datasets in order to increase the current sample size and provide further validation.
100
Appendix B
Table B.1: All MIMIC II variables used for analysis, p-values by day
Variable
Day 1
Day 2
Day 3
Day 4
Stay
Day
ICUSTAY_ADMIT_AGE
0.015
0.015
0.015
0.015
WEIGHT_FIRST
0.666
0.666
0.666
0.666
SAPSI_FIRST
0.040
0.040
0.040
0.040
SOFA_FIRST
0.041
0.041
0.041
0.041
NUM_ELIX_GROUPS
0.367
0.367
0.367
0.367
GENDER
0.297
0.297
0.297
0.297
SUBJECT_ICUSTAY_SEQ
0.207
0.207
0.207
0.207
ICUSTAY_FIRST_SERVICE:
0.549
0.549
0.549
0.549
ETHNICITY
0.313
0.313
0.313
0.313
DNR
0.447
0.243
0.090
0.013
VENTILATION
0.581
0.414
0.498
0.573
PRESSORS
0.213
0.665
0.728
0.285
XRAY.DAY
0.873
0.370
0.867
0.445
MRI.DAY
0.849
0.601
0.862
0.888
CT_SCAN.DAY
0.309
0.715
0.703
0.468
ECHO.DAY
0.611
0.176
0.010
0.000
SAPS.DAY
0.034
0.022
0.004
0.001
SOFA.DAY
0.041
0.009
0.000
0.000
URINE.DAY
0.141
0.001
0.000
0.000
UNITS_IN.DAY
0.142
0.619
0.524
0.458
UNITS_OUT.DAY
0.063
0.075
0.001
0.000
FLUID_BALANCE.DAY
0.354
0.518
0.007
0.001
NUM.CREATINE
0.405
0.323
0.406
0.148
MEAN.CREATINE
0.199
0.178
0.088
0.052
NUM.SODIUM
0.401
0.432
0.407
0.153
MEAN.SODIUM
0.465
0.389
0.694
0.122
NUM.BUN
0.416
0.290
0.411
0.158
MEAN.BUN
0.001
0.000
0.000
0.000
NUM.CHLORIDE
0.455
0.350
0.422
0.150
MEAN.CHLORIDE
0.231
0.182
0.780
0.856
NUM.BICARB
0.461
0.408
0.327
0.134
MEAN.BICARB
0.155
0.036
0.000
0.000
NUM.GLUCOSE
0.443
0.596
0.587
0.083
MEAN.GLUCOSE
0.816
0.176
0.117
0.627
101
Variable
Day 1
Day 2
Day 3
Day 4
Day
Hourly
NUM.MAGNES
0.444
0.464
0.278
0.148
MEAN.MAGNES
0.496
0.342
0.316
0.224
NUM.CALCIUM
0.631
0.452
0.119
0.149
MEAN.CALCIUM
0.491
0.560
0.684
0.570
NUM.PHOS
0.544
0.330
0.232
0.120
MEAN.PHOS
0.010
0.001
0.005
0.008
NUM.HGT
0.214
0.027
0.026
0.032
MEAN.HGT
0.419
0.267
0.509
0.309
NUM.HGB
0.349
0.303
0.012
0.043
MEAN.HGB
0.171
0.491
0.705
0.388
NUM.WBC
0.361
0.317
0.019
0.045
MEAN.WBC
0.725
0.371
0.482
0.171
NUM.PLATLET
0.249
0.140
0.012
0.039
MEAN.PLATLET
0.369
0.277
0.026
0.001
MINHR
0.549
0.041
0.031
0.005
MAXHR
0.336
0.347
0.269
0.357
AVGHR
0.524
0.074
0.048
0.051
SDHR
0.289
0.503
0.620
0.061
MINRR
0.058
0.646
0.688
0.465
MAXRR
0.552
0.664
0.146
0.627
AVGRR
0.096
0.563
0.531
0.754
SDRR
0.235
0.413
0.171
0.205
MINTEMP
0.023
0.354
0.407
0.007
MAXTEMP
0.003
0.014
0.017
0.006
AVGTEMP
0.001
0.018
0.013
0.002
SDTEMP
0.494
0.382
0.290
0.699
MINMAP
0.054
0.043
0.004
0.001
MAXMAP
0.239
0.245
0.239
0.025
AVGMAP
0.009
0.001
0.001
0.000
SDMAP
0.568
0.475
0.738
0.434
MINSI
0.449
0.030
0.028
0.000
MAXSI
0.395
0.269
0.109
0.000
AVGSI
0.393
0.188
0.002
0.000
SDSI
0.520
0.291
0.159
0.058
102
Variable
Day 1
Day 2
Day 3
Day 4
Trend (To Date)
VENTILATION.TOTAL.DAYS
0.585
0.580
0.188
VENT.EVER
0.433
0.531
0.757
PSRS.TOTAL.DAYS
0.220
0.053
0.016
PSRS.EVER
0.489
0.469
0.399
XRAY.TOTAL.DAYS
0.136
0.075
0.284
XRAY.EVER
0.108
0.081
0.294
MRI.TOTAL.DAYS
0.384
0.301
0.503
MRI.EVER
0.454
0.363
0.590
CT_SCAN.TOTAL.DAYS
0.678
0.653
0.737
CT_SCAN.EVER
0.589
0.781
0.749
ECHO.TOTAL.DAYS
0.425
0.439
0.422
ECHO.EVER
0.599
0.658
0.688
SAPS.DAY_MIN
0.021
0.013
0.005
SAPS.DAY_MEAN
0.013
0.006
0.002
SAPS.DAY_MAX
0.023
0.012
0.003
SAPS.DAY_RANGE
0.414
0.535
0.442
SAPS.DAY_TREND1
0.566
0.486
0.342
0.633
0.292
SAPS.DAY_TREND2
SAPS.DAY_TREND3
0.178
SOFA.DAY_MIN
0.016
0.001
0.000
SOFA.DAY_MEAN
0.013
0.002
0.000
SOFA.DAY_MAX
0.018
0.007
0.001
SOFA.DAY_RANGE
0.687
0.506
0.708
SOFA.DAY_TREND1
0.280
0.202
0.160
0.068
0.004
SOFA.DAY_TREND2
SOFA.DAY_TREND3
0.013
URINE.DAY_MIN
0.085
0.020
0.001
URINE.DAY_MEAN
0.008
0.000
0.000
URINE.DAY_MAX
0.002
0.000
0.000
URINE.DAY_RANGE
0.004
0.001
0.000
URINE.DAY_TREND1
0.422
0.793
0.175
0.284
0.521
URINE.DAY_TREND2
URINE.DAY_TREND3
0.325
UNITS_IN.DAY_MIN
0.604
0.480
0.531
UNITS_IN.DAY_MEAN
0.188
0.407
0.488
UNITS_IN.DAY_MAX
0.103
0.144
0.157
UNITS_IN.DAY_RANGE
0.082
0.040
0.068
UNITS_IN.DAY_TREND1
0.343
0.467
0.581
0.338
0.570
UNITS_IN.DAY_TREND2
UNITS_IN.DAY_TREND3
0.295
103
Variable
Day 1
Day 2
Day 3
Day 4
UNITS_OUT.DAY_MIN
0.069
0.016
0.002
UNITS_OUT.DAY_MEAN
0.033
0.002
0.000
UNITS_OUT.DAY_MAX
0.045
0.004
0.001
UNITS_OUT.DAY_RANGE
0.217
0.028
0.013
UNITS_OUT.DAY_TREND1
0.567
0.110
0.232
0.462
0.197
UNITS_OUT.DAY_TREND2
UNIT_OUT.DAY_TREND3
0.471
FLUID_BAL.DAY_MIN
0.306
0.007
0.001
FLUID_BAL.DAY_MEAN
0.627
0.317
0.178
FLUID_BAL.DAY_MAX
0.322
0.463
0.510
FLUID_BAL.DAY_RANGE
0.059
0.025
0.024
FLUID_BAL.DAY_TREND1
0.493
0.261
0.407
0.381
0.190
FLUID_BAL.DAY_TREND2
FLUID_BAL.DAY_TREND3
0.264
Trend (To Date)
CREATINE_MIN
0.128
0.094
0.071
CREATINE_MEAN
0.183
0.149
0.109
CREATINE_MAX
0.247
0.243
0.173
CREATINE_RANGE
0.589
0.589
0.680
CREATINE_TREND1
0.487
0.204
0.057
0.320
0.012
CREATINE_TREND2
CREATINE_TREND3
0.118
SODIUM_MIN
0.388
0.416
0.461
SODIUM_MEAN
0.406
0.428
0.311
SODIUM_MAX
0.451
0.432
0.202
SODIUM_RANGE
0.639
0.731
0.389
SODIUM_TREND1
0.587
0.472
0.155
0.552
0.548
SODIUM_TREND2
SODIUM_TREND3
0.650
BUN_MIN
0.000
0.000
0.000
BUN_MEAN
0.000
0.000
0.000
BUN_MAX
0.000
0.000
0.000
BUN_RANGE
0.481
0.306
0.152
BUN_TREND1
0.273
0.190
0.363
0.159
0.128
BUN_TREND2
BUN_TREND3
0.086
104
Variable
Day 1
Day 2
Day 3
Day 4
CHLORIDE_MIN
0.174
0.282
0.475
CHLORIDE_MEAN
0.195
0.319
0.444
CHLORIDE_MAX
0.251
0.350
0.398
CHLORIDE_RANGE
0.589
0.559
0.627
CHLORIDE_TREND1
0.787
0.067
0.735
0.284
0.049
CHLORIDE_TREND2
CHLORIDE_TREND3
0.167
BICARB_MIN
0.047
0.015
0.004
BICARB_MEAN
0.059
0.013
0.001
BICARB_MAX
0.099
0.019
0.001
BICARB_RANGE
0.378
0.609
0.457
BICARB_TREND1
0.668
0.097
0.007
0.272
0.001
BICARB_TREND2
BICARB_TREND3
0.022
Trend (To Date)
GLUCOSE_MIN
0.408
0.185
0.248
GLUCOSE_MEAN
0.329
0.133
0.162
GLUCOSE_MAX
0.382
0.230
0.200
GLUCOSE_RANGE
0.538
0.430
0.391
GLUCOSE_TREND1
0.294
0.632
0.490
0.363
0.443
GLUCOSE_TREND2
GLUCOSE_TREND3
0.632
MAGNES_MIN
0.443
0.318
0.233
MAGNES_MEAN
0.396
0.347
0.258
MAGNES_MAX
0.430
0.797
0.741
MAGNES_RANGE
0.606
0.348
0.331
MAGNES_TREND1
0.678
0.367
0.600
0.655
0.389
MAGNES_TREND2
MAGNES_TREND3
0.708
CALCIUM_MIN
0.553
0.559
0.536
CALCIUM_MEAN
0.573
0.646
0.650
CALCIUM_MAX
0.627
0.709
0.614
CALCIUM_RANGE
0.682
0.473
0.646
CALCIUM_TREND1
0.439
0.502
0.534
0.654
0.390
CALCIUM_TREND2
CALCIUM_TREND3
0.359
105
Variable
Day 1
Day 2
Day 3
Day 4
PHOS_MIN
0.000
0.000
0.000
PHOS_MEAN
0.001
0.001
0.001
PHOS_MAX
0.008
0.011
0.007
PHOS_RANGE
0.332
0.578
0.708
PHOS_TREND1
0.392
0.219
0.294
0.677
0.238
PHOS_TREND2
PHOS_TREND3
0.573
HGT_MIN
0.613
0.628
0.709
HGT_MEAN
0.576
0.609
0.606
HGT_MAX
0.575
0.652
0.577
HGT_RANGE
0.431
0.634
0.698
HGT_TREND1
0.007
0.391
0.647
0.118
0.632
HGT_TREND2
HGT_TREND3
0.077
Trend (To Date)
HGB_MIN
0.682
0.642
0.794
HGB_MEAN
0.565
0.695
0.770
HGB_MAX
0.498
0.670
0.698
HGB_RANGE
0.604
0.749
0.826
HGB_TREND1
0.012
0.620
0.691
0.072
0.722
HGB_TREND2
HGB_TREND3
0.045
WBC_MIN
0.436
0.412
0.202
WBC_MEAN
0.558
0.518
0.360
WBC_MAX
0.592
0.568
0.518
WBC_RANGE
0.537
0.592
0.552
WBC_TREND1
0.326
0.597
0.078
0.348
0.381
WBC_TREND2
WBC_TREND3
0.269
PLATLET_MIN
0.243
0.071
0.015
PLATLET_MEAN
0.293
0.152
0.061
PLATLET_MAX
0.352
0.254
0.143
PLATLET_RANGE
0.799
0.606
0.635
PLATLET_TREND1
0.650
0.045
0.030
0.139
0.005
PLATLET_TREND2
PLATLET_TREND3
0.007
HR_MIN
0.226
0.090
0.024
HR_MEAN
0.480
0.204
0.113
HR_MAX
0.540
0.616
0.676
HR_RANGE
0.098
0.090
0.093
106
Variable
Day 1
Day 2
HR_TREND1
0.002
HR_TREND2
Day 3
Day 4
0.596
0.734
0.013
0.503
HR_TREND3
0.038
RR_MIN
0.178
0.420
0.811
RR_MEAN
0.294
0.297
0.454
RR_MAX
0.689
0.566
0.737
RR_RANGE
0.376
0.694
0.812
RR_TREND1
0.316
0.519
0.403
0.584
0.592
RR_TREND2
RR_TREND3
0.293
Trend (To Date)
TEMP_MIN
0.181
0.432
0.325
TEMP_MEAN
0.001
0.001
0.000
TEMP_MAX
0.007
0.005
0.003
TEMP_RANGE
0.593
0.596
0.627
TEMP_TREND1
0.139
0.804
0.300
0.191
0.419
TEMP_TREND2
TEMP_TREND3
0.570
MAP_MIN
0.124
0.096
0.052
MAP_MEAN
0.024
0.004
0.000
MAP_MAX
0.312
0.350
0.378
MAP_RANGE
0.641
0.671
0.602
MAP_TREND1
0.515
0.441
0.200
0.430
0.094
MAP_TREND2
MAP_TREND3
0.102
SI_MIN
0.097
0.059
0.004
SI_MEAN
0.155
0.018
0.002
SI_MAX
0.280
0.414
0.423
SI_RANGE
0.366
0.414
0.409
SI_TREND1
0.454
0.336
0.043
0.032
0.073
SI_TREND2
SI_TREND3
0.004
107
Table B.2: All significant UFA thresholds, adult sepsis patients
Ranked by absolute z-statistic
Variable
Threshold
N
% Died
SOFA
More than
12.0
74
68.9%
AVG.SAPS
More than
19.4
84
61.9%
SAPS
More than
18.1
73
61.6%
MAX.URINE
Less than
999.5
104
60.6%
AVG.URINE
Less than
516.5
81
65.4%
AVG.SOFA
More than
12.7
84
63.1%
MIN.SOFA
More than
10.1
93
60.2%
URINE
Less than
944.3
152
57.9%
MAX.SOFA
More than
15.2
66
65.2%
MAX.SAPS
More than
24.2
53
66.0%
BICARB
Less than
17.7
75
65.3%
PLATELET
Less than
85.3
96
57.3%
MAX.UNITS_OUT
Less than
1017.8
56
66.1%
AVG.PLATELET
Less than
109.0
106
54.7%
MIN.PLATELET
Less than
81.7
109
55.0%
MINTEMP
Less than
35.5
76
59.2%
AVG.UNITS_OUT
Less than
806.2
77
59.7%
SOFA_FIRST
More than
18.1
12
100.0%
AVG.BICARB
Less than
17.6
100
57.0%
MEAN.PHOS
More than
4.5
93
52.7%
MIN.SAPS
More than
15.2
117
51.3%
MIN.PHOS
More than
4.3
58
62.1%
MAX.PLATELET
Less than
116.4
79
55.7%
UNITS_OUT
Less than
927.7
111
55.0%
MAXSI
More than
1.6
14
92.9%
AVG.SODIUM
Less than
134.5
59
57.6%
MAX.AVGSI.TD
More than
1.6
71
57.7%
MIN.FLUID_BALANCE.DAY
More than
1,634.9
75
57.3%
AVG.AVGMAP.TD
Less than
66.5
62
59.7%
SAPSI_FIRST
More than
28.1
15
86.7%
MAX.MEAN.BICARB
Less than
19.8
100
56.0%
AVG.MEAN.PHOS
More than
4.7
96
54.2%
RNG.URINE.DAY
Less than
579.9
106
54.7%
AVG.AVGTEMP.TD
Less than
36.1
50
60.0%
MAX.MEAN.SODIUM
Less than
135.0
33
66.7%
AVGSI
More than
1.1
20
80.0%
MIN.MEAN.WBC
More than
25.8
16
81.3%
108
Variable
Threshold
N
% Died
AVG.AVGSI.TD
More than
0.9
124
50.8%
MINSI
More than
0.7
113
52.2%
AVGMAP
Less than
67.4
86
55.8%
AVGTEMP
Less than
36.0
57
57.9%
MIN.URINE.DAY
Less than
123.0
91
54.9%
AVGHR
More than
110.3
42
64.3%
MIN.MEAN.SODIUM
Less than
126.5
7
100.0%
MEAN.BICARB
More than
26.1
113
8.8%
AVG.MEAN.CHLORIDE
Less than
98.7
29
62.1%
MAX.AVGTEMP.TD
Less than
36.8
10
90.0%
MIN.MEAN.BICARB
Less than
13.6
57
57.9%
MIN.MEAN.BUN
More than
64.9
48
60.4%
TREND3.AVGTEMP
Less than
1.0
67
55.2%
MAX.AVGMAP.TD
Less than
83.9
24
70.8%
AVG.URINE.DAY
More than
1,690.3
178
11.2%
RNG.UNITS_OUT.DAY
Less than
1,029.5
142
46.5%
MEAN.CREATINE
Less than
0.9
190
14.7%
DNR
More than
0.0
93
48.4%
AVG.MEAN.WBC
More than
31.7
18
72.2%
MEAN.BUN
More than
57.4
121
50.4%
MEAN.BUN
Less than
22.4
159
12.6%
MEAN.SODIUM
Less than
134.9
59
55.9%
TREND1.AVGSI
Less than
0.8
140
16.4%
MAX.MEAN.BUN
More than
92.4
48
60.4%
MAX.MEAN.CHLORIDE
Less than
103.8
68
48.5%
MAX.MEAN.PHOS
More than
5.4
104
48.1%
AVG.MEAN.BUN
Less than
21.4
130
13.8%
MEAN.WBC
More than
36.4
15
73.3%
MIN.AVGSI.TD
More than
0.6
121
47.9%
MAXMAP
More than
118.6
87
11.5%
URINE.DAY
More than
1,890.3
190
13.2%
MINMAP
Less than
52.1
73
53.4%
MAXTEMP
Less than
36.6
80
50.0%
AVG.MEAN.BUN
More than
64.9
84
53.6%
FLUID_BALANCE.DAY
Less than
-527.2
121
14.0%
AVGRR
Less than
13.5
33
57.6%
MINHR
More than
101.6
26
65.4%
MIN.AVGTEMP.TD
Less than
35.3
159
44.0%
109
Variable
Threshold
N
% Died
MIN.MEAN.BUN
Less than
16.6
134
13.4%
MAX.MEAN.BUN
Less than
27.4
139
15.1%
RNG.MEAN.HGB
Less than
0.4
18
66.7%
TREND1.AVGMAP
More than
1.2
95
16.8%
RNG.URINE.DAY
More than
1,439.0
228
14.9%
SDSI
More than
0.2
14
71.4%
RNG.AVGHR.TD
Less than
28.8
45
51.1%
MAX.URINE.DAY
More than
2,727.1
176
12.5%
SOFA.DAY
Less than
4.1
129
11.6%
AVG.FLUID_BALANCE.DAY
Less than
715.5
87
12.6%
AVG.MEAN.CREATINE
Less than
1.0
166
16.9%
RNG.MEAN.HGT
Less than
1.1
20
65.0%
MAXMAP
Less than
75.9
39
59.0%
MIN.URINE.DAY
More than
1,195.1
78
10.3%
MAX.MEAN.BICARB
More than
25.7
186
15.6%
AVG.MEAN.WBC
Less than
3.6
13
69.2%
FLUID_BALANCE.DAY
More than
3,107.2
75
52.0%
MAX.MEAN.CREATINE
Less than
1.0
123
16.3%
MIN.MEAN.CHLORIDE
Less than
99.6
103
43.7%
TREND1.AVGHR
Less than
0.9
145
20.7%
AVG.MEAN.CALCIUM
Less than
6.8
12
75.0%
TREND2.AVGRR
More than
1.5
35
5.7%
MAX.FLUID_BALANCE.DAY
Less than
587.8
19
0.0%
MINRR
Less than
11.2
146
41.1%
MIN.UNITS_OUT.DAY
More than
1,388.0
92
15.2%
TREND2.AVGSI
Less than
0.8
119
16.8%
MIN.MEAN.CREATINE
Less than
0.9
210
18.6%
MIN.UNITS_OUT.DAY
Less than
36.0
34
58.8%
AVG.AVGHR.TD
More than
94.0
176
42.0%
MIN.UNITS_IN.DAY
More than
2,871.7
118
43.2%
MAXHR
More than
125.5
74
45.9%
MIN.AVGHR.TD
More than
76.6
127
44.9%
AVGSI
Less than
0.6
113
14.2%
MINMAP
More than
76.1
61
11.5%
SDMAP
More than
14.1
59
11.9%
AVG.MEAN.BICARB
More than
23.5
161
15.5%
MEAN.WBC
Less than
2.7
8
75.0%
TREND2.AVGSI
More than
2.2
4
100.0%
110
Variable
Threshold
N
% Died
MIN.MEAN.CALCIUM
Less than
6.4
25
56.0%
TREND3.AVGSI
Less than
0.8
109
19.3%
MAX.MEAN.WBC
More than
37.3
32
56.3%
MAX.MEAN.CALCIUM
Less than
7.0
10
70.0%
AVGMAP
More than
93.7
78
12.8%
AVG.FLUID_BALANCE.DAY
More than
3,535.4
141
44.7%
MIN.FLUID_BALANCE.DAY
Less than
-473.0
193
16.1%
RNG.MEAN.CALCIUM
More than
1.5
42
47.6%
AVG.MEAN.HGB
More than
12.0
37
10.8%
SDHR
Less than
2.8
22
59.1%
RNG.AVGSI.TD
More than
1.8
10
70.0%
MAX.MEAN.WBC
Less than
3.1
6
83.3%
XRAY.TODATE
More than
3.0
22
54.5%
MIN.MEAN.WBC
Less than
2.9
23
52.2%
RNG.MEAN.GLUCOSE
More than
113.7
79
43.0%
TREND3.AVGMAP
Less than
0.7
6
83.3%
MAX.FLUID_BALANCE.DAY
More than
19,123.8
13
0.0%
MIN.MEAN.BICARB
More than
22.2
116
16.4%
AVG.MEAN.CHLORIDE
More than
113.5
81
39.5%
TREND2.AVGMAP
Less than
0.6
3
100.0%
RNG.MEAN.SODIUM
Less than
2.0
53
45.3%
RNG.MEAN.CREATINE
Less than
0.2
160
19.4%
Table B.3: Comparison of classifiers with varying amounts of missing data, confidence intervals
For each row of the table, an increasing percentage of each variable in the MIMIC II dataset was randomly replaced
with missing values.
%
Missing
N-UFA
Random Forest
UFA-based
Section 4.2
Accuracy
AUC
Accuracy
AUC
0%
77.5% (75.1, 79.9)
0.819 (0.797, 0.841)
79.0% (76.9, 81.1)
0.823 (0.796, 0.851)
5%
77.5% (74.9, 80.1)
0.820 (0.793, 0.847)
78.3% (76.7, 79.8)
0.812 (0.783, 0.841)
10%
78.1% (75.3, 80.8)
0.817 (0.793, 0.842)
77.1% (73.6, 80.7)
0.812 (0.785, 0.840)
25%
77.9% (76.0, 79.7)
0.816 (0.792, 0.839)
76.9% (74.7, 79.1)
0.819 (0.791, 0.847)
50%
76.2% (73.9, 78.4)
0.790 (0.764, 0.815)
71.9% (69.3, 74.6)
0.771 (0.744, 0.799)
111
Best subset regression
Logistic Regression
Section 4.2
Comparison
%
Missing
Accuracy
AUC
Accuracy
AUC
0%
74.8% (72.6, 77.0)
0.831 (0.799, 0.862)
69.7% (65.7, 71.6)
0.698 (0.642, 0.753)
5%
74.2% (72.1, 76.3)
0.818 (0.787, 0.849)
68.5% (67.2, 69.8)
0.659 (0.644, 0.673)
10%
74.4% (72.1, 76.7)
0.812 (0.786, 0.839)
66.0% (63.1, 68.8)
0.636 (0.600, 0.672)
0.631 (0.576, 0.686)
0.598 (0.566, 0.629)
25%
74.0% (70.6, 77.5)
0.781 (0.763, 0.799)
67.5% (64.1, 70.9)
50%
70.4% (67.9, 72.8)
0.706 (0.673, 0.740)
58.3% (53.3, 63.2)
Table B.4: Comparison of classifiers with varying amounts of imprecise data, confidence intervals
For each row, an increasing percentage of each variable in MIMIC II is randomly perterbed by a value , distributed
normally with mean zero and the empirical variance of the variable in question
% Varied
N-UFA
Random Forest
UFA-based
Section 4.2
Accuracy
AUC
Accuracy
AUC
0%
77.5% (75.1, 79.9)
0.819 (0.797, 0.841)
79.0% (76.9, 81.1)
0.823 (0.796, 0.851)
5%
77.3% (74.5, 80.2)
0.808 (0.785, 0.830)
76.2% (74.0, 78.3)
0.805 (0.777, 0.833)
10%
76.5% (73.7, 79.3)
0.811 (0.788, 0.834)
77.5% (74.2, 80.8)
0.816 (0.782, 0.851)
25%
77.9% (75.1, 80.6)
0.811 (0.786, 0.836)
77.1% (74.9, 79.3)
0.795 (0.768, 0.822)
50%
75.8% (73.0, 78.5)
0.796 (0.766, 0.825)
76.3% (73.7, 79.0)
0.802 (0.775, 0.829)
Best subset regression
Logistic Regression
Section 4.2
Original data
% Varied
Accuracy
AUC
Accuracy
AUC
0%
74.8% (72.6, 77.0)
0.831 (0.799, 0.862)
68.7% (65.7, 71.6)
0.698 (0.642, 0.753)
5%
73.3% (71.5, 75.0)
0.818 (0.789, 0.847)
65.4% (61.7, 69.1)
0.638 (0.602, 0.674)
0.694 (0.671, 0.717)
10%
74.6% (72.1, 77.1)
0.821 (0.786, 0.856)
70.0% (68.1, 71.9)
25%
75.4% (73.3, 77.4)
0.788 (0.759, 0.818)
63.1% (59.1, 67.1)
0.611 (0.563, 0.658)
50%
75.8% (73.7, 77.9)
0.790 (0.755, 0.824)
68.8% (64.6, 73.1)
0.681 (0.635, 0.727)
112