Imputation-enhanced Prediction of Septic Shock In ICU Patients Joyce C. Ho, Cheng H. Lee and Joydeeph Ghosh University of Texas at Austin HI-KDD 2012: ACM SIGKDD Workshop on Health Informatics Presenter : Kiyana Zolfaghar Outline Motivation Challenges of Clinical Data Predictive model for Sepsis Risk Septic Shock Impact of imputation methods on prediction Results Sepsis and Septic shock Sepsis a Severe, systemic inflammatory response with a presumed or identified source of infection. Severe Sepsis Sepsis with one or more organ dysfunction, hypoperfusion or hypotension Septic Shock a complication characterized by low blood pressure despite treatment by >600 mL of fluid inputs in the last hour Motivation Septic Shock as a Severe illness 𝟏𝟎𝐭𝐡 most common cause of death in western societies 25% of ICU bed utilization in western countries mortality rates range 12.8% for sepsis to 45.7% for septic shock the Motivation for Prediction of Septic Shock in ICU Patients Early intervention and therapy can improve the outcome of patients treatment transition treated by critical care physicians in later phases Proactive treatment in early phases Prediction of Sepsis and Septic shock Data mining approach for identifying patients at risk for developing sepsis Predictive models Regression method Decision trees Support vector Machines Bayesian Classification ….. Issues Regarding Classification and Prediction Data Preparation Feature selection Data cleaning remove or reduce noise treatment of missing values Challenges of Clinical Data Typically noisy and inconsistently gathered Manually recordings of patient's data at irregular intervals Accurate measures for physiological variables require use of invasive techniques large amounts of missing data in clinical studies Naïve Solution Simply ignoring subjects or features with missing data Dramatic decrease in sample sizes or feature spaces Bias in the results The Paper Contribution Investigates the role and impact of imputation methods while building predictive models for Sepsis risk Septic shock Methodology of Research Data Selection Building predictive models for sepsis and Septic shock Leveraging different imputation methods on data Results Dataset Description MIMIC-II Database (Multiparameter Intelligent Monitoring in Intensive Care) Publicly and freely available Includes very large population of ICU patients contains high temporal resolution data including lab results electronic documentation monitor trends and waveforms. Funded by : National Institute Of Biomedical Imaging and Bioengineering Clinical Records in MIMIC-II Overview of the data categories General • Patient demographics • Hospital admissions & discharge Info. • Room tracking, death dates • ICD-9 codes Physiological measures Hourly vital sign metrics Medication records Lab test results Fluid Balance Input and output records Notes and Reports Discharge summary, nursing progress notes Radiology and echo reports. Data Selection and Target Classes Dataset Size : 12,179 patients Avoid adults < 18 at time of admission Patients with least ten observations of BP, TEMP, HR… Target class Sepsis Risk Prediction • Patients identified by ICD-9 codlings (\995.91" or \995.92“) • ~ 10:8% of dataset size (1,310 patients) Septic shock Prediction • Patient with hypotension and total fluid intake >600 mL • ~ 44:7%of sepsis patients (586 patients) Predictive Model for Sepsis Risk Features Patient's Clinical History • Demographic data (gender and ages) • Medical history • Basic health data (weight ..) Measurements of Physiological Variables logistic Regression as prediction model use only the clinical history features use clinical history features after step-wise regression all available features use all available features after step-wise regression Stepwise logistic Regression model • Logistic Regression • Type of regression analysis used for predicting the outcome of a categorical target variable • Stepwise Regression • the choice of predictive variables is carried out by an automatic procedure 1. 2. 3. 4. starting with no variables in the model testing the addition of each variable using a chosen model comparison criterion adding the variable (if any) that improves the model the most repeating this process until none improves the model. Septic Shock Prediction Model Features physiologic and laboratory values Importance of time in septic shock • Feature matrices creation at reference times of 30, 60, 90, and 120 minutes prior to the onset of septic shock. Prediction Models Logistic Regression all available features, features set after forward stepwise regression features set after backward stepwise regression Support Vector Machine Classification tree Decision Tree Learning Goal • create a model to predicts value of a target variable based on several input variables Sex Learning a decision tree Recursive partitioning Based on selected attribute stopping partitioning All samples for a given node belong to the same class Decision tree Classification Trees Regression Trees Male Female Age Survived <= 9.5 >9.5 sibsp <= 2.5 36% died > 2.5 Survived died 2% 2% 61% Missing Value Imputation Missing data in MIMIC II excluding records with missing value 47.2%. Reduction in dataset size Imputation Methods 1) Mean Feature Values (Mean for Subgroup) Derived from the patients' gender and age group • accounted for fundamental physiological differences between genders and among age groups Challenges Mean substitution is especially problematic when there are many missing values distorts the distribution and variance Imputation Methods 2) Matrix Factorization-based Approaches (Very popular in Bioinformatics fields) SVDImpute • Used a linear combination of k-eigenvalues to predict the missing value Probabilistic Principal Component Analysis (PPCA) • Combined an Expectation-Maximization (EM) approach to Principal Component Analysis (PCA) with a probabilistic model • Use a likelihood function to penalizes data far from the training set Bayesian PCA • EM approach + Bayesian model to calculate the likelihood for constructed data Sepsis Risk Prediction Results No Base Model to compare the result with Evaluation metric • AUC (Area Under the curve) Septic Shock Prediction Results • The septic shock EWS as baseline • Prediction model : logistic regression • predict the onset of septic shock one hour in advance • Use invasively-gathered data from MIMIC waveform data Imputation-enhanced Prediction Of Septic Shock • Impact of various imputation methods on different reference time • In comparison with baseline with logistic regression model AUC Curves for predicting septic shock 60 minutes before onset Septic shock prediction 60 minutes before onset for three types of models: Effect of imputation on logistic regression coefficients for predicting septic Shock Consistency across different imputation methods Inconsistency of values obtained with and without Imputation non-imputed model suffer from over-fitting Conclusion Imputing missing data can improve model Performance especially when dealing with larger, noisier, and more incomplete datasets Matrix factorization imputation methods like BPCA lead to models with better predictive accuracy than simpler approaches like group means.