Chapter 2 Outcome prediction in intensive care: results of a prospective, multicentre, Portuguese study Rui Moreno1, Pedro Morais1 on behalf of the Portuguese Severity Scoring Systems Study Groups of the Portuguese Intensive Care Society and of the Portuguese Society of Internal Medicine 1 Intensive Care Unit, Hospital de Santo António dos Capuchos, Lisboa, Portugal Intensive Care Medicine 1997;23:177-186 INTRODUCTION Developed in 1981 at George Washington University Medical Center, the Acute Physiology and Chronic Health Evaluation (APACHE) scoring system [1] was demonstrated to provide accurate and reliable measures of severity of illness in critically ill patients [2-4]. This model incorporated 34 variables, chosen and weighted by a panel of seven experts. The worst value for all the variables was collected during the first 32 h in the intensive care unit (ICU), its weights summed (from 0 to 4 points depending on the degree of abnormality), and the final result was the Acute Physiology Score. Its use was very complex, and in 1984 Le Gall et al. [5] published a simplified version of this model, known as the Simplified Acute Physiology Score (SAPS), widely used since then, specially in Europe. In 1985, Knaus et al. [6] published a simplified version of the APACHE system, the APACHE II. This uses the worst value recorded in the first 24 hours in the ICU for 12 physiologic variables (weighted from 0 to 4 points), age, surgical status (emergency surgery or elective surgery/non-surgical) and previous health status and requires the selection of a primary reason for ICU admission for a logistic regression model that transforms scores into probabilities of mortality. This system soon became the scoring system used most world-wide and has been used in administration, planning, quality assurance, comparison of ICUs [7-17] and even to assess comparability of groups in clinical trials [18-20]. The third version, APACHE III [21], was prospectively evaluated in 17440 patients admitted to 40 hospitals in the United States in 1988-1989. This is now a proprietary system; the equations are not in the public domain and must be purchased from APACHE Medical Systems, Washington, DC. This has limited its use, especially outside the United States, although a study evaluating its performance in a cohort of Brazilian ICUs has been published recently [22,23]. This system comprises the APACHE III score, based on the worst values recorded in the first 24 h in the ICU for 18 physiologic variables, the Glasgow Coma Score (GCS) and seven co-morbid chronic health conditions, and the APACHE III predictive equation, which uses the APACHE III score and reference data on major disease categories and site of treatment immediately prior to ICU admission to provide an estimation of the risk of hospital mortality for individual ICU patients. In 1993 Le Gall et al. published a new system, SAPS II, based on an European/North American multicentre study [24]. In this study, statistical modelling techniques were used to select and weigh the variables, and the evaluation of the risk of dying in hospital was based on a logistic regression model. SAPS II was developed and validated in a cohort of 12997 patients in ten countries in Europe, Canada and the United States. Evaluation of SAPS II or APACHE II in the target population is necessary before its broad utilisation, since variations in case mix, local policies, quality of care and quality of data collection have been shown to affect the performance of the equations used to predict mortality [15,24], and at least one example in the literature shows that these models do not fit well in a population in Spain [25]. Although a recent study [26] shows good discrimination and calibration in an international database for these models, the population analysed is not an independent one but is the validation sub-group of the original population in which SAPS II was developed. Since the authors randomised the original total database into two groups - the development and validation samples - we should expect that all the variables (and case mix factors not measured) are randomly distributed in the two subgroups; for this reason, both are expected to represent equal samples from the same underlying distribution and cannot be considered to be independent samples. The extent to which they represent the general population of ICUs in Europe has not been established, but it is probable that the performance of SAPS II will be better on this than on a different sample. The aim of this study is to evaluate and compare the performance of SAPS II and APACHE II in a Portuguese population - a completely independent population from those used to develop the models - using formal statistical comparison, according to recent recommendations [27]. MATERIAL AND METHODS Before the beginning of the study all mixed medical-surgical ICUs in Portugal (excluding the islands of Madeira and Azores) were invited to participate in the study by mail or personal communication by the Portuguese Severity Scores Study Group; 28 ICUs were invited, 19 (68 %) collaborated. In each of the ICUs a local co-ordinator was appointed (Appendix). Data collection took place from 15 December 1994 to 14 March 1995. During the study period, all consecutive admitted patients, 18 years or older, in the participating ICUs were enrolled; burn patients, acute coronary care and cardiac surgery patients, patients with missing data and patients with a length of stay in the ICU of less than 24 h were excluded from the final analysis. If patients had been admitted more than once to the ICU during the study period, only the first admission was analysed. All patients who were still in hospital on 14 May 1995 (two months after the end of data collection) were dropped from the study. Each patient was described using a simple set of variables selected from the literature that included all the variables from the original SAPS II and APACHE II systems [6,24]. All data were collected as raw data, using the most abnormal values during the first 24 h after admission to the ICU. Basic demographic characteristics, including sex, age, type of patient (medical, acute coronary, scheduled surgical and unscheduled surgical), and principal diagnostic category of admission (using a list of 50 mutually exclusive diagnoses [6]) were also recorded. Operative definitions for previous organic insufficiency were defined according to Knaus et al. [6]. In sedated patients, the GCS was given a normal value and a zero weight assigned. The presence or absence of organ dysfunction during the first 24 h in the ICU was assessed by the organ system failure (OSF) score, as described by Knaus et al. [28]. The utilisation of manpower during the same period was evaluated using the Therapeutic Intervention Scoring System (TISS) [29] and the Simplified TISS [30]. ICUs had the choice to enter data on standardised forms or using a computer program made by the authors, available in IBM format, containing out-of-range and logical error-checking. In both cases data were checked for accuracy and completeness and instances of missing data were referred to local co-ordinators. Everyone involved had access to a manual with the protocols and definitions, according to the original definitions. During the study period support was provided to all participating ICUs by the co-ordinating centre. At the end of the study, quality control was carried out by the site co-ordinator completing a second set of forms for a 5 % random sample of patients in that ICU. During 1993, a pilot study was conducted in 6 units to assess methods and definitions of data collection and analysis, and all the techniques used were adapted to ensure maximum reliability. Patients were followed up to hospital discharge, and their survival status was then registered. To assess inter-observer reliability, original forms and quality control forms were compared, and discrepancies evaluated using the kappa statistic [31,32] and intraclass correlation coefficients [33] to determine if there was a good rate of agreement. Chi-square statistics were used to test for the statistical significance of categorical variables and t-test or one-way analysis of variance were used to assess continuous variables. All statistical tests were two-sided, and a significance level of 0.05 was used except when stated otherwise. To compare the length of stay in the ICU between survivors and non survivors, we used Mann-Whitney U test, since the distribution was highly skewed. Since there is no single standard measure for describing the overall goodness of fit of multiple logistic regression models, we decided to employ four different methods to access discrimination and calibration. For discrimination - that is, the ability of the model to discriminate between patients who live and patients who die - we used 2 x 2 classification tables with decision criteria of 10, 50 and 90 %, and the area under the receiver operating characteristic (ROC) curve, computed by a modification of the Wilcoxon statistics, as proposed by Hanley and McNeil [34]. The comparison of the areas under ROC curves was done using the Z statistic with correction for the correlation introduced by studying the same sample [35]. Calibration - that is, the degree of correspondence between predicted and observed mortality - was assessed by calibration curves and by the two chi-square statistics proposed by Hosmer and Lemeshow [36]: the H test, collapsing the table based on fixed values of the estimated probabilities and the C test, collapsing the table based on percentiles of the estimated probabilities. To compare the predictions between both models we also used McNemar’s chi-square test and established the correlation between the probabilities of dying in the hospital, as calculated by SAPS II and APACHE II. To evaluate the uniformity of fit - that is, the capability of the models to adjust between subgroups - we stratified patients using several strategies: by predicted risk of dying in the hospital, ICU, type of patient, age group, length of stay in the ICU and diagnostic category of admission and analysed mortality ratios. The small number of patients included in some of the groups precluded a more formal evaluation (including measures of discrimination and calibration) of the uniformity of fit of the models. The computation of confidence intervals for the ratio of observed number of deaths to expected number of deaths based on the model was done using a parametric approach, as described by Rapoport et al. [37]: ([observed number of deaths] ± z1-"/2F)/(expected number of deaths), where: n s = å p (1 - p ) i i i =1 n is the number of patients in the ICU, Bi is the SAPS II/ APACHE II probability for the ith patient and z1-"/2 is the (1-"/2) x 100th percentile of the standard normal distribution. To use this technique we must assume that the underlying distribution is normal. So, we decided not to compute confidence intervals for the mortality ratio when the n was small, as in the case of mortality ratios by ICU. The evaluation of sensitivity, specificity, positive and negative predicted values and overall correct classification, as well as the respective confidence intervals were done according to Gardner and Altman [38]. All the data analysis and statistics were performed using the Statistical Package for Social Sciences, version 6.0.1. RESULTS During the study period, the 19 ICUs collected data on 1094 patients. After the exclusion of all the patients with missing data, an exclusion diagnosis or a length of stay in the ICU of less than 24 h, 982 patients remained (89.7 %). The mean number of patients analysed per ICU was 51.7. As Table 1 shows, most patients were male (67.7 %), with a mean (SD) age of 55.4 ± 19.1 years. There was a clear predominance of medical patients (68.2 %). Nonoperative respiratory disease was the principal diagnostic category of admission, accounting for 32.9 % of the sample. A large number of patients (24.5 %) presented with previous organic insufficiency. Table 1. Basic characteristics of the 982 patients analysed. N 982 665 55.4 ± 19.1 % 100 67.7 Number of patients Male sex a Age, years Type of patient Medical 670 68.2 Scheduled surgery 120 12.2 Unscheduled surgery 192 19.6 Previous Organic Insufficiency 241 24.5 Diagnostic category of admission Non-operative Respiratory 323 32.9 Cardiovascular 139 14.2 Trauma 55 5.6 Neurological 37 3.8 Other 16 1.6 Non specified 79 8.0 Post-operative 333 33.9 OSF Absent 273 27.8 Respiratory 577 58.8 Cardiovascular 262 26.7 Renal 179 18.2 Haematological 54 5.5 Neurological 195 19.9 b 6 (3 - 12) LOS (days) a 32.5 ± 11.4 TISS a 30.4 ± 9.9 Simplified TISS Interventions during the first 24 hours in the ICU Mechanical ventilation 690 70.3 Vasoactive drugs 470 47.9 Total parenteral nutrition 147 15.0 Swan-Ganz catheter 52 5.3 Arterial catheter 295 30.0 Central venous catheter 749 76.3 a 41.4 ± 20.7 SAPS II a 32.6 ± 29.9 SAPS II predicted risk of death a 19.6 ± 9.9 APACHE II a 33.5 ± 27.4 APACHE II predicted risk of death ICU mortality 241 24.5 Hospital mortality 314 32.0 LOS, length of stay; TISS, Therapeutic Intervention Scoring System; SAPS II, new Simplified Acute Physiology Score; APACHE, Acute Physiology and Chronic Health Evaluation Score. a b : mean ± standard deviation; : median (interquartile range). During the first 24 hours in the ICU, only 27.8 % of the patients had no organ failure; respiratory (58.8 %), cardiovascular (26.7 %), and renal (18.2 %) failure were the most frequent. All organ failures showed a significant relation to outcome (p < 0.001). Most of the patients received mechanical ventilation during the first 24 h in the ICU (70.3 %); vasoactive drugs (47.9 %) and total parenteral nutrition (15.0 %) were frequently used. Central venous catheterisation was used in 76.3 % of the patients, arterial catheterisation in 30.0 % and pulmonary artery catheter in 5.3 %. The use of these techniques resulted in a high TISS score (32.5 ± 11.4 for the original TISS and 30.4 ± 9.9 for the Simplified TISS), which was higher in non-survivors than in survivors (38.2 ± 10.9 and 35.7 ± 9.3 vs 29.8 ± 10.6 and 27.8 ± 9.1, p < 0.001 for both). The overall mortality in the ICU was 24.5 % and the corresponding mortality in the hospital 32.0 %. Median length of stay in the ICU was 6 days (interquartile range 3-12 days), and this was higher in non-survivors (survivors: median 5 days, interquartile range 3-9 days; nonsurvivors: 7 days, 3-16 days; p = 0.003). The intraclass correlation coefficient was 0.88 for diagnostic category of admission, 0.90 for urinary output and for blood pressure, and > 0.95 for all other continuous variables. For category variables, kappa values were 0.43 (78.5 % agreement) for acute renal failure (according to APACHE II definition and/or OSF definition), 0.44 (85.7 % agreement) for previous organic insufficiency, 0.81 (92.8 % agreement) for cardiovascular failure, and 0.75 (92.8 % agreement) for neurological failure; all other category variables have a kappa > 0.90. Mean severity scores were high (SAPS II 41.4 ± 20.7, APACHE II 19.6 ± 9.9) and showed a significant relation to mortality (p < 0.001 for both). There were very large differences between ICUs in all the scores analysed, with mean ICU values ranging from 11.83 to 24.67 for APACHE II and 30.25 to 46.97 for SAPS II. To estimate the discriminative power of the models, we used the area under the ROC curve. The values were 0.817 (standard error 0.015) for SAPS II, 0.782 (0.016) for APACHE II and 0.787 (0.015) for APACHE II predicted risk of death. The area for SAPS II, although very good, is lower than the area of the original SAPS II model (0.823). When SAPS II and APACHE II curves were compared, we found a statistical significant difference (one side test, p < 0.001) between both methods. In Figure 1 are plotted the ROC curves of the two models. Figure 1. Receiver operating characteristic (ROC) curves for the new Simplified Acute Physiology Score (SAPS II) (C) and the Acute Physiology and Chronic Health Evaluation (APACHE II) Score (*). The relationship between true positives (sensitivity) and false positives (1 minus specificity), is shown for both models. Table 2 shows the classification tables for SAPS II and APACHE II. With a decision criteria of 10 %, sensitivity - that is, the proportion of deaths predicted by the model - was better for APACHE II (95.86 %) than for SAPS II (92.04 %); however, the false-positive rate was high for both (APACHE II 66.32 %, SAPS II 58.68 %); the overall correct classification was better for SAPS II (57.54 %) than for APACHE II (53.56 %). With a decision criteria of 50 %, SAPS II showed better sensitivity (57.01 % vs 53.18), with a similar false positive rate (12.13 % vs 12.72 %) and a slightly better overall correct classification rate (78.00 % vs 76.37 %). At a decision criteria of 90 %, sensitivity was still better for SAPS II (20.06 % vs 11.78 %), the false positive rate similar (0.90 % vs 0.75 %) and overall correct classification slightly better (73.83 % vs 71.28 %). We decided to perform a crosstabulation of the predictions by both models (SAPS II and APACHE II) at a fixed decision criterion (50 %) (Table 3). In the global population, the two methods predicted the same outcome in 858 patients (87.4 %). For survivors, the two methods predicted the same outcome in 606 patients (90.7 %). For the 62 patients (9.3 %) where the predictions did not agree, SAPS II predicted 33 correctly (53.2 %) while APACHE II predicted only 29 correctly (46.8 %). The difference was not significant (McNemar’s P2 0.15, 1 df, not significant). For non-survivors the two methods predicted the same outcome in 252 patients (80.3 %). For the 62 patients (19.7 %) where the predictions do not agree, SAPS II predicted 37 correctly (59.7 %) while APACHE II predicted only 25 correctly (40.3 %). The difference was again not significant (McNemar’s P2 1.95, 1 df, not significant). Table 2. Classification tables for the new Simplified Acute Physiology Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II) Score. In parentheses are the 95 % confidence intervals, computed according to Gardner and Altman [38]. Decision criteria 10 % Observed survivors Observed non-survivors Sensitivity Specificity Positive predictive value Negative predictive value Overall correct classification Decision criteria 50 % Observed survivors Observed non-survivors Sensitivity Specificity Positive predictive value Negative predictive value Overall correct classification Decision criteria 90 % Observed survivors Observed non-survivors Sensitivity Specificity Positive predictive value Negative predictive value Overall correct classification No., number of patients SAPS II Predicted to live to die (No.) (No.) APACHE II Predicted to live To die (No.) (No.) 276 392 25 289 92.04 (89.04 - 95.03) 41.32 (37.58 - 45.05) 42.44 (38.73 - 46.15) 91.69 (88.58 - 94.81) 57.54 (54.44 - 60.63) 225 443 13 301 95.86 (93.66 - 98.06) 33.68 (30.10 - 37.27) 40.46 (36.93 - 43.98) 94.54 (91.65 - 97.42) 53.56 (50.44 - 56.68) 587 81 135 179 57.01 (51.53 - 62.48) 87.87 (85.40 - 90.35) 68.85 (63.22 - 74.48) 81.30 (78.46 - 84.15) 78.00 (75.41 - 80.59) 583 85 147 167 53.18 (47.67 - 58.70) 87.28 (84.75 - 89.80) 66.27 (60.43 - 72.11) 79.86 (76.95 - 82.77) 76.37 (73.72 - 79.03) 662 6 251 63 20.06 (15.63 - 24.49) 99.10 (98.39 - 99.82) 91.30 (84.66 - 97.95) 72.51 (69.61 - 75.40) 73.83 (71.08 - 76.58) 663 5 277 37 11.78 (8.22 - 15.35) 99.25 (98.60 - 99.91) 88.10 (78.30 - 97.89) 70.73 (67.62 - 73.45) 71.28 (68.45 - 74.11) Table 3. Comparison of classification tables for the new Simplified Acute Physiology Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II) Score at a decision criteria of 50 %. Results are presented as number of patients and percentage in parentheses. APACHE II Predicted to live Predicted to die S A P S II Global Population Predicted to live Predicted to die 664 (67.6) 66 (6.7) 58 (5.9) 194 (19.8) Survivors Predicted to live Predicted to die 554 (82.9) 29 (4.3) 33 (4.9) 52 (7.8) Non-survivors Predicted to live Predicted to die 110 (35.0) 37 (11.8) 25 (8.0) 142 (45.2) The calibration curves for SAPS II and APACHE II (Figure 2) demonstrated that as the predicted risk of hospital mortality (either by SAPS II or by APACHE II) increased, the proportion of patients who died also increased. However, at predicted risks > 70 %, the observed mortality within each risk group lay significantly below the diagonal line - that is, both models overestimates mortality in sicker patients. Figure 2. Calibration curves. The solid line represents perfect correspondence between actual and predicted risk of death and the dotted line the observed versus predicted risk of death. Top: data for the new Simplified Acute Physiology Score (SAPS II); bottom: data for the Acute Physiology and Chronic Health Evaluation (APACHE II) Score. In both, bars indicate the distribution of patients in the groups analysed. Hosmer-Lemeshow goodness-of-fit test H revealed a poor performance for both models (Table 4), though slightly better for SAPS II. Hosmer-Lemeshow goodness-of-fit test C gave a similar result (Table 5). This implies a significant lack of fit of both models. Table 4. Hosmer-Lemeshow goodness-of-fit test H for the new Simplified Acute Physiology Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II) Score. P r o b a b i l i t y Number of (dying) patients a SAPS II 0.00 301 0.10 170 Number of deaths Observed Expected 25 29 13.59 24.84 Number of survivors Observed Expected 276 141 287.41 145.15 0.20 103 26 0.30 90 28 0.40 58 27 0.50 58 28 0.60 34 24 0.70 55 32 0.80 44 32 0.90 –1.00 69 63 Total 982 314 b APACHE II 0.00 238 13 0.10 180 42 0.20 119 24 0.30 115 38 0.40 78 30 0.50 64 37 0.60 45 25 0.70 53 35 0.80 48 33 0.90 –1.00 42 37 Total 982 314 a : chi-square: 29.745, df: 10, p = 0.001 b : chi-square: 32.704, df: 10, p = 0.0003 25.44 31.49 25.82 31.93 22.31 41.05 37.67 65.93 320.10 77 62 31 30 10 23 12 6 668 77.55 58.50 32.18 26.06 11.68 13.94 6.33 3.07 661.89 12.50 26.79 28.92 40.29 35.13 35.10 29.10 39.89 41.33 39.72 328.80 225 138 95 77 48 27 20 18 15 5 668 225.49 153.21 90.07 74.71 42.86 28.89 15.89 13.10 6.66 2.27 653.19 Table 5. Hosmer-Lemeshow goodness-of-fit test C for the new Simplified Acute Physiology Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II) score. Probability Number of Number of deaths (dying) patients Observed Expected a SAPS II 0.00 106 5 1.68 0.03 105 9 4.62 0.06 90 11 7.28 010 112 21 14.28 0.15 83 16 15.87 0.21 94 23 25.02 0.31 108 39 41.18 0.44 88 43 46.89 0.62 100 62 73.69 0.84 – 1.00 96 85 89.56 Total 982 314 320.10 b APACHE II 0.00 99 6 2.70 0.04 98 3 6.10 0.08 98 16 10.38 0.13 98 30 15.28 0.18 100 18 21.63 0.25 97 27 29.31 0.34 98 30 38.41 0.45 98 50 50.15 0.58 98 59 67.42 0.78 – 1.00 98 75 87.38 Total 982 314 328.80 a : chi-square: 28.292, df: 10, p = 0.002 b : chi-square: 49.664, df: 10, p < 0.0001 Number of survivors Observed Expected 101 96 79 91 67 71 69 45 38 11 668 104.31 100.37 82.71 97.71 67.12 68.97 66.82 41.10 26.30 6.43 661.89 93 95 82 68 82 70 68 48 39 23 668 96.29 91.89 87.61 82.71 78.36 67.68 59.58 47.84 30.57 10.61 653.19 Predicted risk of dying in the hospital, as calculated by the two models, shows a highly significant correlation (multiple R: 0.827). It is clear that, although highly correlated, the results from the two models are widely dispersed (Figure 3), with a significant number of outliers, as already described in other studies [39]. Figure 3. Plot of the Acute Physiology and Chronic Health Evaluation (APACHE II) Score versus the new Simplified Acute Physiology Score (SAPS II) probabilities of death. APACHE II and SAPS II are highly correlated (multiple R: 0.827) but a large number of outliers are visible. The predictions seem more related in the extremes of risk. The uniformity of fit, that is, the capability of the models to adjust between subgroups, demonstrated variations in their performance across subgroups. When we stratified patients by risk of death, as when we plot a calibration curve, we can see that for more severely ill patients (as measured by the scores), both systems overestimate mortality (Figure 2). If we stratify patients by ICU, the differences are quite large, with mortality ratios ranging from 0.69 to 1.72 for SAPS II and from 0.42 to 1.63 for APACHE II. The number of observed deaths was similar to number of predicted deaths when we stratify patients by group (medical patients: predicted deaths 238.7, observed 235; scheduled surgery: predicted 18.4, observed 19; unscheduled surgery: predicted 62.9, observed 60) for SAPS II but with some differences for APACHE II (medical patients: predicted deaths 245.1, observed 235; scheduled surgery: predicted 17.7, observed 19; unscheduled surgery: predicted 66.1, observed 60). For age group, there were some variations in SAPS II and APACHE II, with mortality ratios varying from 0.85 to 1.25 for SAPS II and 0.87 to 1.19 for APACHE II (Table 6). However, in almost all cases, the confidence intervals for the mortality ratio enclosed the unit. The same is true when we stratify patients by diagnostic category (Table 7). Table 6. Mortality ratio by age group for the new Simplified Acute Physiology Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II) Score. In parentheses are the 95 % confidence intervals for the mortality ratios, computed according to Gardner and Altman [38]. Age (years) SAPS II # 24 25 - 34 35 - 44 45 - 54 55 - 64 65 - 74 75 - 84 $ 85 APACHE II # 24 25 - 34 35 - 44 45 - 54 55 - 64 65 - 74 75 - 84 $ 85 No., number of patients No. Observed deaths Predicted deaths Mortality ratio 93 114 89 131 165 236 127 27 16 31 20 39 54 89 52 13 18.8 24.9 20.9 36.7 56.9 92.5 56.0 13.4 0.85 (0.53 - 1.17) 1.25 (0.99 - 1.51) 0.96 (0.68 - 1.24) 1.06 (0.84 - 1.28) 0.95 (0.79 - 1.11) 0.96 (0.83 - 1.09) 0.93 (0.77 - 1.08) 0.97 (0.67 - 1.27) 93 114 89 131 165 236 127 27 16 31 20 39 54 89 52 13 18.4 26.0 22.0 41.0 57.3 95.5 57.7 11.0 0.87 (0.53 - 1.21) 1.19 (0.91 - 1.47) 0.91 (0.62 - 1.20) 0.95 (0.74 - 1.16) 0.94 (0.77 - 1.11) 0.93 (0.80 - 1.06) 0.90 (0.74 - 1.06) 1.19 (0.82 - 1.55) Table 7. Mortality ratio by diagnostic category for the new Simplified Acute Physiology Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II) Score. In parentheses are the 95 % confidence intervals for the mortality ratios, computed according to Gardner and Altman [38]. Diagnostic category SAPS II Non-operative Respiratory Cardiovascular Trauma Neurological Others Non-specified Post-operative APACHE II Non-operative Respiratory Cardiovascular Trauma Neurological Others Non-specified Post-operative No., number of patients No. Observed deaths Predicted deaths Mortality ratio 326 142 68 38 17 81 310 101 78 17 15 5 24 74 105.7 74.9 20.7 14.6 4.1 20.5 79.7 0.96 (0.83 - 1.08) 1.04 (0.93 - 1.16) 0.82 (0.53 - 1.11) 1.03 (0.70 - 1.35) 1.23 (0.54 - 1.92) 1.17 (0.89 - 1.45) 0.93 (0.78 - 1.08) 326 142 68 38 17 81 310 101 78 17 15 5 24 74 119.8 73.3 12.4 19.5 3.3 20.6 79.9 0.84 (0.72 - 0.97) 1.06 (0.94 - 1.18) 1.38 (0.95 - 1.80) 0.77 (0.51 - 1.03) 1.50 (0.73 - 2.28) 1.17 (0.84 - 1.49) 0.93 (0.77 – 1.08) DISCUSSION Over the past 30 years, the development and dissemination of strategies for monitoring and treating severely ill patients has lead to the creation of special places in hospitals, where all the necessary resources (human and technological) have been concentrated - the ICUs. These units soon became an essential part of acute patient care for all types of patients. The resultant increase in costs lead to the need to rationalise policies of admission and medical care in these units and concepts which were unknown in the medical world, such as efficiency and effectiveness, began to be accepted in the hospitals. Central to this was the necessity to quantify performance by defining the goals and outcomes of care. Therefore, over the past 15 years we have assisted to the development and application of several instruments to evaluate performance and, at the end, to compare ICUs. In the 1980s [7] several investigators proposed the use of the ratio between observed and predicted deaths - the standardised mortality ratio - as one of those tools. This was done under the assumption that, although ICUs admit a very heterogeneous group of patients with large differences in physiologic reserve (e.g. age, previous health status) and acute health status (e.g. type of patient, admission diagnosis, presence and level of physiologic impairment), existing severity scores can account for most of these characteristics. If the errors resulting from the collection of data and their application are small and randomly distributed, the final difference between predicted and observed mortality can be attributed to local differences in quality of care. This rationale has been challenged several times recently [14,15,40-43]. We can argue against at least two of these assumptions: firstly, the impact of intra- and inter-observer variability in data collection can have a significant impact on the performance of the severity scores, as argued by others [44,45], and, clearly, more research is needed on this issue; we want to stress that in this study the reliability of data collection was very high but the lack of generalised consensus in operative definitions makes comparison with other studies difficult. Secondly, the application of a model to a different population can only be done with confidence once the system has been tested and validated on that population, since variations in case mix not accounted for by the models can have a significant impact on their performance. This study demonstrated that both models failed to predict mortality accurately - that is, overall calibration was poor. This problem is more obvious for APACHE II than for SAPS II. When we compare the two models, we can see that discrimination was better for SAPS II than for APACHE II (0.817 vs 0.787, p < 0.001); the same is true for the percentage of correct classifications, both in survivors and in non-survivors, but the differences did not reach statistical significance. With respect to overall calibration, as measured by HosmerLemeshow H$ and C$ tests, both models presented statistical significant differences between predicted and observed mortality. The same can be observed in the calibration curves (Figure 2), with both models overestimating mortality in the most severely ill patients. This is probably important, since this sample is a very different population from the one represented in the European/North American database [24] or in Knaus’s et al.’s database [6]: a higher percentage of non-surgical patients, with a longer length of stay in the ICU, higher severity scores, and higher mortality. This lack of overall goodness of fit can also be related to differences in population composition if the models fit the data in a non uniform manner, as already observed with APACHE II in the United Kingdom [14,15,46] and with SAPS II in Spain [25] and Italy [43]. In this study, the small sample size was a relevant limiting factor in stratified analysis and precluded the formal evaluation of discrimination and calibration in clinically relevant subgroups. The hypothesis that the non uniformity of fit can explain, at least in part, the poor performance of the models when applied to independent populations should be tested in larger samples. An alternative (and perhaps complementary) explanation for the poor performance of the models is the presence of other factors (clinical and non-clinical), not measured by the present severity scores, that can have a huge influence on the performance of the ICUs. It should be noted that those factors are not randomly distributed between patients but clustered into ICUs; their effects on the performance of actual models should be one of the main priorities of research in this field, and we may anticipate that next generation severity scores will take into account not only the patient’s variability (that is, baseline characteristics, severity of disease) but also the variability among ICUs (the positive correlation between patients inside the same ICU introduced by local clinical and non-clinical factors that can influence outcome). For now, the results prevent their use to analyse quality of care or performance between ICUs in this population, at least without previous customisation, as previously suggested in other studies [8,9,11-13,37,39,43,47]. ACKNOWLEDGEMENTS The authors thank the participating staff of all the ICUs for their full collaboration on data collection. We acknowledge also the President of the of the Portuguese Intensive Care Society, Dr. Jorge Pimentel and the President of the Portuguese Society of Internal Medicine, Prof. Dr. Levy Guerra, for their invaluable support during the planning and execution of the study. REFERENCES Knaus WA, Zimmerman JE, Wagner DP, Draper EA, Lawrence DE. APACHE - Acute Physiology and Chronic Health Evaluation: a physiologically based classification system. Crit Care Med 1981;9:591-7. Knaus WA, Draper EA, Wagner DP, et al. Evaluating outcome from intensive care: A preliminary multihospital comparison. Crit Care Med 1982;10:491-6. Knaus WA, Le Gall JR, Wagner DP, et al. A Comparison of Intensive Care in the U.S.A. and France. Lancet 1982;642-6. Wagner DP, Draper EA, Abizanda Campos R, et al. Initial International use of APACHE: an acute severity of disease measure. Med Decis Making 1984;4:297. Le Gall JR, Loirat P, Alperovitch A, et al. A Simplified Acute Physiologic Score for ICU patients. Crit Care Med 1984;12:975-7. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med 1985;13:818-29. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. An evaluation of outcome from intensive care in major medical centers. Ann Intern Med 1986;104:410-8. Zimmerman JE, Knaus WA, Judson JA, et al. Patient selection for intensive care: a comparison of New Zealand and United States Hospitals. Crit Care Med 1988;16:318-25. Marsh HM, Krishan I, Naessens JM, et al. Assessment of prediction of mortality by using the APACHE II scoring system in intensive care units. Mayo Clin Proc 1990;65:1549-57. Castella X, Gilabert J, Torner F, Torres C. Mortality prediction models in intensive care: Acute Physiology and Chronic Health Evaluation II and Mortality Prediction Model compared. Crit Care Med 1991;19:191-7. Sirio CA, Tajimi K, Tase C, et al. An initial comparison of intensive care in Japan and United States. Crit Care Med 1992;20:1207-15. Zimmerman JE, Shortell SM, Rousseau DM, et al. Improving intensive care: observations based on organizational case studies in nine intensive care units: a prospective, multicenter study. Crit Care Med 1993;21:1443-51. Knaus WA, Wagner DP, Zimmerman JE, Draper EA. Variations in mortality and length of stay in Intensive Care Units. Ann Intern Med 1993;118:753-61. Rowan KM, Kerr JH, Major E, McPherson K, Short A, Vessey MP. Intensive Care Society's APACHE II study in Britain and Ireland - I: Variations in case mix of adult admissions to general intensive care units and impact on outcome. Br Med J 1993;307:972-7. Rowan KM, Kerr JH, Major E, McPherson K, Short A, Vessey MP. Intensive Care Society's APACHE II study in Britain and Ireland - II: Outcome comparisons of intensive care units after adjustment for case mix by the American APACHE II method. Br Med J 1993;307:977-81. Wong DT, Crofts SL, Gomez M, McGuire GP, Byrick RJ. Evaluation of predictive ability of APACHE II system and hospital outcome in Canadian intensive care unit patients. Crit Care Med 1995;23:1177-83. Rowan KM, Kerr JH, Major E, McPherson K, Short A, Vessey MP. Intensive Care Society's Acute Physiology and Chronic Health Evaluation (APACHE II) study in Britain and Ireland: A prospective, multicenter, cohort study comparing two methods for predicting outcome for adult intensive care patients. Crit Care Med 1994;22:1392-401. Greenman RL, Schein RMH, Martin MA, et al. A Controlled Clinical Trial of E5 Murine Monoclonal IgM Antibody to Endotoxin in the Treatment of Gram-Negative Sepsis. JAMA 1991;266:1097102. Ziegler EJ, Fisher CJ, Sprung CL, et al. Treatment of gram-negative bacteremia and septic shock with ha-1a human monoclonal antibody against endotoxin. A randomized, double-blind, placebocontrolled trial. N Engl J Med 1991;324:429-36. Knaus WA, Harrell FE, Fisher CJ, et al. The clinical evaluation of new drugs for sepsis. A prospective study design based on survival analysis. JAMA 1993;270:1233-41. Knaus WA, Wagner DP, Draper EA, et al. The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults. Chest 1991;100:1619-36. Bastos PG, Sun X, Wagner DP, Knaus WA, Zimmerman JE, The Brazil APACHE III Study Group. Application of the APACHE III prognostic system in Brazilian intensive care units: a prospective multicenter study. Intensive Care Med 1996;22:564-70. Bastos PG, Knaus WA, Zimmerman JE, Magalhães Jr A, Wagner DP, The Brazil APACHE III Study Group. The importance of technology for achieving superior outcomes from intensive care. Intensive Care Med 1996;22:664-9. Le Gall JR, Lemeshow S, Saulnier F. A new Simplified Acute Physiology Score (SAPS II) based on a European / North American multicenter study. JAMA 1993;270: 2957-63. Abizanda Campos R, Rodriguez MT, Ferrandiz A, et al. Evaluation of SAPS II mortality prediction capability. Comparison with SAPS I and APACHE II. Intensive Care Med 1994;20:51. Castella X, Artigas A, Bion J, Kari A, The European / North American Severity Study Group. A comparison of severity of illness scoring systems for intensive care unit patients: results of a multicenter, multinational study. Crit Care Med 1995;23:1327-35. Hadorn DC, Keeler EB, Rogers WH, Brook RH. Assessing the performance of mortality prediction models. Santa Monica, CA, RAND/UCLA/Harvard Center for Health Care Financing Policy Research, 1993. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. Prognosis in acute organ-system failure. Ann Surg 1985;202:685-93. Keene AR, Cullen DJ. Therapeutic intervention scoring system: update 1983. Crit Care Med 1983;11:13. Reis Miranda D, de Rijk A, Schaufeli W. Simplified Therapeutic Intervention Scoring System: The TISS 28 items - Results from a multicenter study. Crit Care Med 1996;24:64-73. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur 1960;20:37-46. Kramer MS, Feinstein AR. Clinical biostatistics. LIV. The biostatistics of concordance. Clin Pharmacol Ther 1981;29:111-23. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420-8. Hanley J, McNeil B. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29-36. Hanley J, McNeil B. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 1983;148:839-43. Hosmer DW, Lemeshow S. Applied logistic regression. John Wiley & Sons, Inc., New York, 1989. Rapoport J, Teres D, Lemeshow S, Gehlbach S. A method for assessing the clinical performance and cost-effectiveness of intensive care units: a multicenter inception cohort study. Crit Care Med 1994;22:1385-91. Gardner MJ, Altman DG. Statistics with confidence. British Medical Journal, London, 1989. Lemeshow S, Klar J, Teres D. Outcome prediction for individual intensive care patients: useful, misused, or abused ? Intensive Care Med 1995;21:770-6. Park RE, Brook RH, Kosecoff J, et al. Explaining variations in hospital death rates: ramdomless, severity of illness, quality of care. JAMA 1990; 264:484-90. Best WR, Comper DC. The ratio of observed-to-expected mortality as a quality of care indicator in nonsurgical VA patients. Med Care 1994;32:390-400. Fisher M, Herkes RG. Intensive care: speciality without frontiers. In: Parker M, Shapiro MJ, Porembka DT, eds. Critical Care State of the Art. California: Society of Critical Care Medicine 1995, 9-27. 1. Apolone, G, D'Amico, R, Bertolini, G, Iapichino, G, Cattaneo, A, De Salvo, G, Melotti, R. The performance of SAPS II in a cohort of patients admitted in 99 Italian ICUs: results from the GiViTI. Intensive Care Med 1996;22:1368-78 2. Lemmonier E, Loirat P, Kleinknecht D, Brivet F, Landais P, and the French Study Group on ARF. Translation ambiguity and inter-observer variability of severity scoring systems. Intensive Care Med 1992;20:581. 3. Abizanda R, Balerdi B, Lopez J, et al. Fallos de prediccion de resultados mediante APACHE II. Analisis de los errores de prediction de mortalidad en pacientes criticos. Med Clin Barc 1994;102:527-31. 4. Goldhill DR, Withington PS. The effects of casemix adjustment on mortality as predicted by APACHE II. Intensive Care Med 1996;22:415-9. 5. Moreno RP, Estrada H, Pereira E, Massa L. Movimento assistencial da Unidade de Cuidados Intensivos Polivalente do Hospital de Santo Antonio dos Capuchos. Acta Med Port 1994;7:13-20. APPENDIX List of co-authors (in italics their hospitals): Centro Hospitalar de Coimbra: Dra. Paula Coutinho; H. Universidade de Coimbra: Dr. João Paulo Sousa; H. de Egas Moniz: Dra. Isabel Gaspar, Dr. Andrade Gomes; H. de Pulido Valente: Dr. Luis Tello; H. de St. António dos Capuchos: Dra. Ermelinda Pereira; H. de S. Francisco Xavier - UCIC: Dra. Ana Ferreira; H. de S. Francisco Xavier - UCIM: Dr. João Cunha, Dra. Margarida Resende; H. de St. Maria - UCIR: Dra. Gabriela Brum, Dr. João Valença; H. de St. Marta: Dra. Manuela Coelho, Dra. Alexandrina Quintino; H. Distrital de Évora: Dr. Luis Filipe Froes; H. de Vila Real: Dr. Celestino; H. Distrital do Barreiro: Dra. Fátima Campante; H. do Desterro: Dra. Maria José Serra; H. do SAMS: Dr. Sousa e Costa; H. Dr. José Maria Grande: Dr. Carlos Baeta; H. Garcia de Orta: Dr. Pedro Moreira; H. Geral de St. António - UCIP: Dr. Rui Seca; H. Senhora da Oliveira: Dr. Estevão Lafuente; H. de São Bernardo: Dra. Rosa Ribeiro. The study was co-ordinated by the Severity Scores Groups (Coordenador: Dr. R. Moreno) of the Portuguese Intensive Care Society and the Portuguese Society of Internal Medicine. Data Analysis and statistics were done by Dr. R. Moreno. Computer programming was co-ordinated by Dr. P. Morais.