Ennett CM, Frize M. Selective sampling to overcome skewed a priori probabilities with neural networks. Proc AMIA Symp 2000;(20 Suppl):225-9. Selective Sampling to Overcome Skewed a priori Probabilities with Neural Networks Colleen M. Ennett1, MASc, Monique Frize1,2, PhD, PEng Systems & Computer Engineering, Carleton University, Ottawa, ON, Canada 2 School of Information Technology & Engineering, University of Ottawa, Ottawa, ON, Canada 1 Highly skewed a priori probabilities present challenges for researchers developing medical decision aids due to a lack of information on the rare outcome of interest. This paper attempts to overcome this obstacle by artificially increasing the mortality rate of the training sets. A weight pruning technique called weight-elimination is also applied to this coronary artery bypass grafting (CABG) database to assess its impact on the artificial neural network’s (ANN) performance. The results showed that increasing the mortality rate improved the sensitivity rates at the cost of the other performance measures, and the weight-elimination cost function improved the sensitivity rate without seriously affecting the other performance measures. Keywords Artificial neural networks; Coronary artery bypass grafting; Cardiac surgery; Weightelimination; Outcomes estimation. INTRODUCTION A challenge associated with medical databases is the ability to predict rare outcomes. Certainly, rare outcomes are of great interest since their occurrence is generally undesirable such as the presence of a disease or post-surgical mortality. Given that their occurrence is rare, there is limited data upon which to draw conclusions about their causes. Post-surgical mortality of coronary artery bypass graft (CABG) patients is an example of a rare outcome. Generally, CABG mortality rates reported in journal articles are less than 5 percent. Although this is a desirable rate from a medical point of view, the low mortality rate makes accurate mortality predictions quite challenging. Previous risk models have been developed for predicting CABG surgery mortality using additive [1,2] and statistical models [3-8], but in recent years artificial neural networks (ANNs) have entered into this area of medical research [9,10]. The question posed in this paper is: What is the effect of repeatedly presenting rare outcomes to an ANN using random selective sampling? In other words, the objective is to study the effect of changing the mortality rate distribution of the training datasets by entering the cases of the under-represented class multiple times into an artificial dataset. Artificially changing the a priori statistics (i.e., increasing the mortality rate) of the datasets is an attempt to combat the problems that arise with the low representation of the nonsurvivors (those who die) of CABG surgery when predicting in-hospital mortality using ANNs [11]. The highly skewed distribution poses serious difficulties for the intended ANN to be investigated, as shown by preliminary results done by the Medical IDEAS (Intelligent DEcision Aid Systems) Research Group [MIRG] researchers using an adult intensive care database [12,13]. Ennett and Frize [13] discovered a region of inconsistent network performance (i.e., sometimes the network would learn patterns and other times it would classify everything as belonging to the group with the highest a priori probability) when the underrepresented class made up approximately 15 percent of the cases. At representations of less than 15 percent, the ANN failed to learn anything. Instead, it classified all patients as belonging to the larger outcome class (i.e., those patients requiring less than 8 hours of artificial ventilation). In theory, this technique of changing the a priori statistics of the training set may provide a solution to the challenges faced when dealing with a drastically underrepresented outcome in a two-class problem. The application of this ANN algorithm with weightelimination to a medical database has given successful results in a previous study [12,14], but it also put into question whether the weight-elimination cost function would prove useful when applied to another database. The problem involved in the research by Trigg [12] and Frize et al. [14] was to identify intensive care unit (ICU) patients who require 8 or more hours of mechanical ventilation. This involved an approximately 70-30 percent distribution of the outcomes (i.e., 30 percent of the patients required 8 or more hours of ventilation). This is a much larger percentage of the underrepresented category than the problem presented by the CABG database. The second question was: Will the weight- Ennett CM, Frize M. Selective sampling to overcome skewed a priori probabilities with neural networks. Proc AMIA Symp 2000;(20 Suppl):225-9. elimination cost function (developed by Weigend et al. [15,16]) improve the performance of a feedforward backpropagation ANN used to classify CABG mortality? METHODOLOGY The experiments carried out for this paper used the cardiac patient database from the San Francisco Heart Institute (SFHI) in Daly City, CA. The original database contained 7050 patients who underwent all types of open-heart surgery between January 1, 1985 and June 30, 1994. The only surgical cases included in this analysis were patients who underwent CABG surgery, CABG plus valve surgery, or CABG plus repair surgery, leaving 6325 cases. There were only 9 cases collected in 1985 so those cases were eliminated from consideration because they were not representative of the surgical practice in 1985. Also, some technical advances were introduced at the hospital during 1990-91 that would have influenced the outcomes in 1992 and afterwards, so the data from 1992-94 were also removed. This left a database of 3526 patient cases dating from 1987 to 1991. Using 32 commonly chosen CABG mortality risk factors as inputs (listed in Table 1), there were 3427 complete patient cases. Table 1: Input variables Variable date of surgery patient’s age female gender mitral valve disease operation aortic valve disease operation emergent/urgent priority for surgery failed PTCA prior to surgery reoperation renal disease ventricular aneurysm left main disease ejection fraction: normal moderate severe mitral valve regurgitation aortic valve stenosis tricuspid valve disease hypertension pre-operative intraaortic balloon pump previous myocardial infarction (MI) Prevalence N/A N/A 0.247 0.021 0.031 0.229 0.047 0.1 0.062 0.003 0.195 0.660 0.250 0.090 0.041 0.025 0.001 0.621 0.042 0.532 evolving MI history of congestive heart failure unstable angina cerebrovascular disease peripheral vascular disease triple vessel disease obesity small stature chronic obstructive pulmonary disease diabetes cardiogenic shock hypercholesterolemia previous cerebrovascular accident anemia 0.013 0.102 0.629 0.119 0.121 0.74 0.107 0.018 0.132 0.228 0.007 0.531 0.039 0.165 The database was randomly separated into training and test sets such that two-thirds of the database was used for training (2254) and the remaining third became the test set (1173). The mortality rate of the training set was 3.7 percent, and that of the test set was 3.8 percent. A feed-forward neural network with one hidden layer (i.e., a three-layer ANN) was trained using the backprogation training algorithm with the hyperbolic tangent transfer function for use in the experiments for this paper. The weight-elimination cost function developed by Weigend et al. [15,16] was added to the error function. The weight-elimination function (Equation 1) is a penalty term that attempts to reduce the weights of the least important variables to zero (or near zero) to ultimately remove their influence from the network. It is a weight-pruning technique. wij 2 E (W ) E 0 (W ) ij w0 2 wij 2 1 w0 2 (1) E(W) is the combined cost function that includes the initial cost function, E0(W) (typically, the sum-ofsquared-errors (SSE)), and the weight-elimination term (the second term in Equation 1). Here, W represents the weight vector, is the weight-decay constant, and wij indicates the individual weight of the ANN. The objective was to artificially increase the number of nonsurvivors (and hence the mortality rate) in the training sets. To achieve this goal, the datasets were separated according to their outcome: death or Ennett CM, Frize M. Selective sampling to overcome skewed a priori probabilities with neural networks. Proc AMIA Symp 2000;(20 Suppl):225-9. survival. This approach was necessary due to the small number of nonsurvivors, and to ensure that the desired distribution could be achieved. By combining the idea of separating the patient records according to the outcome used by Katz et al. [17] with the random sampling technique so that a specific mortality rate could be obtained, the artificial datasets for the SFHI coronary artery surgery database were created. Therefore, once the original database was divided into training and test sets, these sets were further subdivided into those who survived the surgery and those who did not. The artificial sets were formulated from these “categorized” datasets. when classifying CABG patients. The no weightelimination networks are set to the same network parameters (i.e., number of hidden nodes, momentum, learning rate, and weighting factor) as the weight-elimination networks but with the weightelimination function deactivated. The objective of the weight-elimination cost function is to reduce the size of the connection weights to eliminate less useful input variables. Theoretically, this course of action of reducing the number of weights in the network, and hence the network’s complexity, is expected to improve the network’s classification performance [12,14,15,18]. To artificially increase the percentage of nonsurvivors in the datasets, a simple program that performed random sampling with replacement was used. Since the sample size can affect the performance of a model, the total number of cases in the training and test sets was kept constant. Given the number of cases, and the desired percentage of nonsurvivors for the particular dataset, it was possible to determine how many patients who did not survive the surgery would be included in that particular set. The artificial training sets had mortality rates of 10, 20, 30, 40 and 50 percent. This offered a range of training set mortality distributions for the experiments to be performed. Table 2 describes each dataset with the number of nonsurvivor and survivor patient cases. The nonsurvivor and survivor cases were chosen separately, but in each situation, the patient records were chosen randomly with replacement. The networks were compared on the basis of their sensitivity, specificity and correct classification rate (CCR) when using zero as the cut-off value (because the hyperbolic tangent function ranges from -1 to 1), and the area under the Receiver Operating Characteristic (ROC) curve. As well, the performance was contrasted against the CCR of a constant predictor (CP). A CP is a simple statistical benchmark that classifies all test set cases as belonging to the outcome class with the highest a priori probability of the training set. Table 2: Distributions of the training datasets # cases # cases Dataset nonsurvivor survivors Total 3.7 % true distribution 83 RESULTS The results of the weight-elimination ANN experiments where the training set a priori distributions were altered are presented in Table 3 according to their sensitivity, specificity, CCR and area under the ROC curve (C-index). Table 4 presents the performance of the weight-elimination and no weight-elimination ANNs at the point of best performance of the weight-elimination networks. The CCR of the CP in these experiments was 96.2 percent because the higher a priori class of the training set was the survivors, so the constant predictor would be correct 96.2 percent of the time if it classified every test case as a survivor. 2171 2254 10% mortality rate 225 2029 2254 20% mortality rate 451 1803 2254 30% mortality rate 676 1578 2254 40% mortality rate 902 1352 2254 Table 3: Performance measures of the test set for the best-performing double-layered weight-elimination networks 50% mortality rate 1127 1127 2254 Expt MR=3.7 MR=10 MR=20 MR=30 MR=40 MR=50 Sens (%) 10.85 18.18 43.55 59.09 70.45 72.73 Spec (%) 98.26 96.38 93.83 82.37 74.67 68.47 CCR (%) 94.98 93.01 91.94 81.5 74.51 66.47 Cindex 0.9387 0.907 0.8978 0.8054 0.7345 0.6737 The performance of the weight-elimination ANNs that were trained using the true mortality rate (MR=3.7 percent)) and trained on the set with an artificial mortality rate of 20 percent (MR=20 percent) were compared with their no weightelimination counterparts (using only the SSE cost function) to identify the effect of the weightelimination penalty term on the ANN’s performance * MR = mortality rate Ennett CM, Frize M. Selective sampling to overcome skewed a priori probabilities with neural networks. Proc AMIA Symp 2000;(20 Suppl):225-9. Table 4: Comparison of ANNs with and without weight-elimination (WE) using the best-performing networks Expt Sens (%) Spec (%) CCR (%) C-index MR=3.7 WE 10.85 98.26 94.98 0.9387 MR=3.7 no WE 8.94 99.31 95.92 0.9453 MR=20 WE 43.55 93.83 91.94 0.8978 MR=20 no WE 24.05 93.22 90.63 0.8844 * MR = mortality rate DISCUSSION Given the results presented in Table 3, it is clear that training with a higher-than-normal prevalence of the under-represented outcome (here, in-hospital death) improves the sensitivity of the ANN using the test data at the cost of poorer results for the other performance measures. Typical C-index values for CABG mortality risk models range from 0.70 to 0.80. The some of the results found in these experiments exceeded these C-index values, however, their clinical relevance may be questionable given that few of the nonsurvivors are correctly classified (as noted by the sensitivity rate). The ANNs trained on higher prevalences are of more clinical value because more nonsurvivors are correctly classified. The higher risk patients are always more difficult to classify using risk stratification models, therefore, preference should be given to a model that is better able to classify these individuals [19]. The results of Table 4 indicate that the use of the weight-elimination cost function improved the ANN’s classification performance of the nonsurvivors (the sensitivity) without a dramatic effect (if any) on the other performance parameters. One of the limitations of this study is the decreased specificity that occurred when the mortality rate was increased, and sometimes when weight-elimination was used. Certainly, the best model would have high sensitivity and high specificity, whereas the models presented here permitted a lower specificity at the cost of a higher sensitivity. The weight-elimination cost function improved the sensitivity of the ANNs while only having a moderate effect on the other performance measures. FUTURE WORK Researchers from MIRG continue to work on applications of the weight-elimination technique to different medical databases including the neonatal ICU [20,21]. Future work will also focus on developing a useful model that has a better balance between sensitivity and specificity. Acknowledgments This work was completed with the support of grants from the Medical Research Council (MRC) of Canada and the Natural Sciences and Engineering Research Council (NSERC) of Canada. The authors would also like to acknowledge and thank Dr. Richard E. Shaw for providing access to the San Francisco Heart Institute patient database. References 1. 2. 3. 4. 5. CONCLUSION 6. Increasing the mortality rate of the training set when the a priori probabilities are skewed will increase the sensitivity of the weight-elimination ANN at the cost of the other performance measures. A model with a higher sensitivity may be of more clinical value since the nonsurvivors are more difficult to identify. Such a model would highlight the patients who may be at a higher risk of death following CABG surgery. 7. Parsonnet V, Dean D, Bernstein AD. A method of uniform stratification of risk for evaluating the results of surgery in acquired adult heart disease. Circulation 1989;79(Suppl 1):I3-I12. Higgins TI, Estanganous FG, Loop FD, Beck GJ, Blum JM, Paranandi L. Stratification of morbidity and mortality outcome by preoperative risk factors in coronary artery bypass patients. A clinical severity score. JAMA 1992;267:23442348. Edwards FH, Graeber GM. The theorem of Bayes as a clinical research tool. Surgery, Gynecology & Obstetrics 1987;165:127-129. Edwards FH, Albus RA, Zaitchuk R, et al. Use of a Bayesian statistical model for risk assessment in coronary artery surgery. Ann Thorac Surg 1988;45:437-440. Edwards FH, Clark RE, Schwartz M. Coronary artery bypass grafting: The Society of Thoracic Surgeons national database experience. Ann Thorac Surg 1994;57:12-19. Pliam MB, Shaw RE, Zapolanski A. Comparative analysis of coronary surgery risk stratification models, J Invas Cardiol 1997;9:203-222. Shroyer ALW, Grover FL, Edwards FH. 1995 coronary artery bypass risk model: the Society of Thoracic Surgeons adult cardiac national database. Ann Thorac Surg 1998;65:879-884. Ennett CM, Frize M. Selective sampling to overcome skewed a priori probabilities with neural networks. Proc AMIA Symp 2000;(20 Suppl):225-9. 8. 9. 10. 11. 12. 13. 14. 15. Shroyer ALW, Plomondon ME, Grover FL, Edwards FH. The 1996 coronary artery bypass risk model: The Society of Thoracic Surgeons adult cardiac national database. Ann Thorac Surg 1999;67:1205-1208. Lippmann RP, Shahian DM. Coronary artery bypass risk prediction using neural networks. Ann Thorac Surg 1997;63:1635-1643. Orr RK. Use of a probabilistic neural network to estimate the risk of mortality after cardiac surgery. Med Decis Making 1997;17(2):178-185. Ennett CM. Coronary surgery mortality prediction using artificial neural networks. Master’s thesis, School of Information Technology and Engineering, University of Ottawa: Ottawa, ON, 1999. Trigg HCE. An investigation of methods to enhance the performance of artificial neural networks used to estimate ICU outcomes. Master’s thesis, Department of Electrical Engineering, University of New Brunswick: Fredericton, NB, 1997. Ennett CM, Frize M. An investigation into the strengths and limitations of artificial neural networks: an application to an adult ICU patient database. Proc of AMIA’98 Fall Symposium, Orlando, FL, Nov 7-11, 1998:998. Frize M, Trigg HCE, Solven FG, Stevenson M, Nickerson BG. Decision-support systems designed for critical care. Proceedings of the 1997 AMIA Fall Symposium. 1997:855. Weigend AS, Rumelhart DE, Huberman BA. Back-propagation, weight-elimination and time series prediction. In DS Tourestzky, JL Elman, TJ Sejnowski, GE Hinton, eds. Proc of the 1990 16. 17. 18. 19. 20. 21. Connectionist Models Summer School. Morgan Kaufmann: San Mateo. 1990:105-116. Weigend AS, Rumelhart DE, Huberman BA. Generalization by weight-elimination with application to forecasting. In RP Lippmann, J Moody, DS Touretzky, eds. Advances in Neural Information Processing Systems (NIPS’90), Volume 3. Morgan Kaufmann: San Mateo. 1991;3:875-882. Katz S, Katz AS, Lowe N, Quijano RC. Neural net_bootstrap hybrid methods for prediction of complications in patients implanted with artificial heart valves. J Heart Valve Dis 1994 Jan;3(1):49_52. Frize M, Wang L, Ennett CM, Nickerson B, Solven FG, Stevenson M. New advances and validation of knowledge management tools for critical care using classifier techniques. Proceedings of the AMIA '98 Annual Symposium, Orlando, FL: Nov 7_11, 1998:553_558. Clark RE. Calculating risk and outcome: The Society of Thoracic Surgeons database. Ann Thorac Surg 1996;62:S2-S5. Tong Y, Frize M, Walker R. Estimating ventilation using artificial neural networks in intensive care units. First Joint BMES/EMBS Conference, Atlanta, GA: Oct 12_16, 1999. Walker R, Frize M, Tong Y. Data analysis using artificial neural networks and computer_aided decision_making in the neonatal intensive care unit. Paediatric Academy Society 1999 Annual Meeting, San Francisco, CA: May 2, Paediatric Research 1999;45:231A.