Imputation-enhanced Prediction
of Septic Shock In ICU
Joyce C. Ho, Cheng H. Lee and Joydeeph Ghosh
University of Texas at Austin
HI-KDD 2012: ACM SIGKDD Workshop on Health
Presenter : Kiyana Zolfaghar
 Motivation
Challenges of Clinical Data
Predictive model for
Sepsis Risk
 Septic Shock
Impact of imputation methods on prediction
Sepsis and Septic shock
a Severe, systemic inflammatory response with
a presumed or identified source of infection.
Sepsis with one or more organ dysfunction,
hypoperfusion or hypotension
a complication characterized
by low blood pressure despite
treatment by >600 mL of fluid
inputs in the last hour
 Septic Shock as a Severe illness
𝟏𝟎𝐭𝐡 most common cause of death in western societies
 25% of ICU bed utilization in western countries
 mortality rates range 12.8% for sepsis to 45.7% for septic shock
 the
Motivation for Prediction of Septic Shock in ICU Patients
 Early intervention and therapy can improve the outcome of patients
 treatment transition
treated by critical care
in later phases
Proactive treatment in
early phases
Prediction of Sepsis and Septic shock
 Data mining approach
for identifying patients at risk for developing sepsis
 Predictive models
Regression method
Decision trees
Support vector Machines
Bayesian Classification …..
 Issues Regarding Classification and Prediction
Data Preparation
 Feature selection
 Data cleaning
 remove or reduce noise
 treatment of missing values
Challenges of Clinical Data
 Typically noisy and inconsistently gathered
 Manually recordings of patient's data at irregular intervals
 Accurate measures for physiological variables require use of
invasive techniques
large amounts of missing data
in clinical studies
 Naïve Solution
 Simply ignoring subjects or features with missing data
Dramatic decrease in sample sizes or
feature spaces
Bias in the results
The Paper Contribution
Investigates the role and impact of imputation methods
while building predictive models for
Sepsis risk
Septic shock
Methodology of Research
Data Selection
Building predictive models for sepsis and Septic shock
Leveraging different imputation methods on data
Dataset Description
MIMIC-II Database
(Multiparameter Intelligent Monitoring in Intensive Care)
Publicly and freely available
Includes very large population of ICU patients
contains high temporal resolution data including
 lab results
 electronic documentation
monitor trends and waveforms.
Funded by :
National Institute Of Biomedical Imaging
and Bioengineering
Clinical Records in MIMIC-II
 Overview of the data categories
 General
• Patient demographics
• Hospital admissions & discharge Info.
• Room tracking, death dates
• ICD-9 codes
 Physiological measures
 Hourly vital sign metrics
 Medication records
 Lab test results
 Fluid Balance
 Input and output records
 Notes and Reports
 Discharge summary, nursing progress notes
 Radiology and echo reports.
Data Selection and Target Classes
Dataset Size : 12,179 patients
Avoid adults < 18 at time of admission
Patients with least ten observations of BP, TEMP, HR…
Target class
 Sepsis Risk Prediction
Patients identified by ICD-9 codlings (\995.91" or \995.92“)
• ~ 10:8% of dataset size (1,310 patients)
Septic shock Prediction
• Patient with hypotension and total fluid intake >600 mL
• ~ 44:7%of sepsis patients (586 patients)
Predictive Model for Sepsis Risk
 Features
Patient's Clinical History
• Demographic data (gender and ages)
• Medical history
• Basic health data (weight ..)
Measurements of Physiological Variables
logistic Regression as prediction model
use only the clinical history features
use clinical history features after step-wise regression
all available features
use all available features after step-wise regression
Stepwise logistic Regression model
• Logistic Regression
• Type of regression analysis used for predicting the outcome of a
categorical target variable
• Stepwise Regression
• the choice of predictive variables is carried out by an automatic
starting with no variables in the model
testing the addition of each variable using a chosen model
comparison criterion
adding the variable (if any) that improves the model the most
repeating this process until none improves the model.
Septic Shock Prediction Model
physiologic and laboratory values
Importance of time in septic shock
• Feature matrices creation at reference times of 30, 60, 90, and 120
minutes prior to the onset of septic shock.
 Prediction Models
Logistic Regression
all available features,
features set after forward stepwise regression
features set after backward stepwise regression
Support Vector Machine
Classification tree
Decision Tree Learning
• create a model to predicts value of a target variable based on
several input variables
Learning a decision tree
 Recursive partitioning
Based on selected attribute
 stopping partitioning
All samples for a given node belong
to the same class
 Decision tree
Classification Trees
Regression Trees
<= 9.5
<= 2.5
> 2.5
Missing Value Imputation
 Missing data in MIMIC II
excluding records with
missing value
47.2%. Reduction in
dataset size
Imputation Methods
1) Mean Feature Values (Mean for Subgroup)
Derived from the patients' gender and age group
• accounted for fundamental physiological differences between
genders and among age groups
 Challenges
 Mean substitution is especially problematic when there are
many missing values
 distorts the distribution and variance
Imputation Methods
2) Matrix Factorization-based Approaches
(Very popular in Bioinformatics fields)
• Used a linear combination of k-eigenvalues to predict the missing value
Probabilistic Principal Component Analysis (PPCA)
• Combined an Expectation-Maximization (EM) approach to Principal
Component Analysis (PCA) with a probabilistic model
• Use a likelihood function to penalizes data far from the training set
Bayesian PCA
• EM approach + Bayesian model to calculate the likelihood for constructed
Sepsis Risk Prediction Results
No Base Model to compare the result with
Evaluation metric
• AUC (Area Under the curve)
Septic Shock Prediction Results
• The septic shock EWS as baseline
• Prediction model : logistic regression
• predict the onset of septic shock one hour in advance
• Use invasively-gathered data from MIMIC waveform data
Imputation-enhanced Prediction Of
Septic Shock
• Impact of various imputation methods on different
reference time
• In comparison with baseline with logistic regression model
AUC Curves for predicting septic shock
60 minutes before onset
Septic shock prediction 60 minutes
before onset for three types of models:
Effect of imputation on logistic regression
coefficients for predicting septic Shock
Consistency across different
imputation methods
Inconsistency of values
obtained with and without
non-imputed model suffer
from over-fitting
 Imputing missing data can improve model Performance
especially when dealing with larger, noisier, and more
incomplete datasets
Matrix factorization imputation methods like BPCA lead to
models with better predictive accuracy than simpler
approaches like group means.