Chapter 2 Outcome prediction in intensive care: results of a

advertisement
Chapter 2
Outcome prediction in intensive care: results of a prospective,
multicentre, Portuguese study
Rui Moreno1, Pedro Morais1
on behalf of the Portuguese Severity Scoring Systems Study Groups of the Portuguese
Intensive Care Society and of the Portuguese Society of Internal Medicine
1
Intensive Care Unit, Hospital de Santo António dos Capuchos, Lisboa, Portugal
Intensive Care Medicine 1997;23:177-186
INTRODUCTION
Developed in 1981 at George Washington University Medical Center, the Acute Physiology
and Chronic Health Evaluation (APACHE) scoring system [1] was demonstrated to provide
accurate and reliable measures of severity of illness in critically ill patients [2-4]. This model
incorporated 34 variables, chosen and weighted by a panel of seven experts. The worst value
for all the variables was collected during the first 32 h in the intensive care unit (ICU), its
weights summed (from 0 to 4 points depending on the degree of abnormality), and the final
result was the Acute Physiology Score. Its use was very complex, and in 1984 Le Gall et al.
[5] published a simplified version of this model, known as the Simplified Acute Physiology
Score (SAPS), widely used since then, specially in Europe.
In 1985, Knaus et al. [6] published a simplified version of the APACHE system, the
APACHE II. This uses the worst value recorded in the first 24 hours in the ICU for 12
physiologic variables (weighted from 0 to 4 points), age, surgical status (emergency surgery
or elective surgery/non-surgical) and previous health status and requires the selection of a
primary reason for ICU admission for a logistic regression model that transforms scores into
probabilities of mortality. This system soon became the scoring system used most world-wide
and has been used in administration, planning, quality assurance, comparison of ICUs [7-17]
and even to assess comparability of groups in clinical trials [18-20].
The third version, APACHE III [21], was prospectively evaluated in 17440 patients admitted
to 40 hospitals in the United States in 1988-1989. This is now a proprietary system; the
equations are not in the public domain and must be purchased from APACHE Medical
Systems, Washington, DC. This has limited its use, especially outside the United States,
although a study evaluating its performance in a cohort of Brazilian ICUs has been published
recently [22,23]. This system comprises the APACHE III score, based on the worst values
recorded in the first 24 h in the ICU for 18 physiologic variables, the Glasgow Coma Score
(GCS) and seven co-morbid chronic health conditions, and the APACHE III predictive
equation, which uses the APACHE III score and reference data on major disease categories
and site of treatment immediately prior to ICU admission to provide an estimation of the risk
of hospital mortality for individual ICU patients.
In 1993 Le Gall et al. published a new system, SAPS II, based on an European/North
American multicentre study [24]. In this study, statistical modelling techniques were used to
select and weigh the variables, and the evaluation of the risk of dying in hospital was based
on a logistic regression model. SAPS II was developed and validated in a cohort of 12997
patients in ten countries in Europe, Canada and the United States.
Evaluation of SAPS II or APACHE II in the target population is necessary before its broad
utilisation, since variations in case mix, local policies, quality of care and quality of data
collection have been shown to affect the performance of the equations used to predict
mortality [15,24], and at least one example in the literature shows that these models do not
fit well in a population in Spain [25].
Although a recent study [26] shows good discrimination and calibration in an international
database for these models, the population analysed is not an independent one but is the
validation sub-group of the original population in which SAPS II was developed. Since the
authors randomised the original total database into two groups - the development and
validation samples - we should expect that all the variables (and case mix factors not
measured) are randomly distributed in the two subgroups; for this reason, both are expected
to represent equal samples from the same underlying distribution and cannot be considered
to be independent samples. The extent to which they represent the general population of
ICUs in Europe has not been established, but it is probable that the performance of SAPS II
will be better on this than on a different sample.
The aim of this study is to evaluate and compare the performance of SAPS II and APACHE
II in a Portuguese population - a completely independent population from those used to
develop the models - using formal statistical comparison, according to recent
recommendations [27].
MATERIAL AND METHODS
Before the beginning of the study all mixed medical-surgical ICUs in Portugal (excluding the
islands of Madeira and Azores) were invited to participate in the study by mail or personal
communication by the Portuguese Severity Scores Study Group; 28 ICUs were invited, 19
(68 %) collaborated. In each of the ICUs a local co-ordinator was appointed (Appendix).
Data collection took place from 15 December 1994 to 14 March 1995. During the study
period, all consecutive admitted patients, 18 years or older, in the participating ICUs were
enrolled; burn patients, acute coronary care and cardiac surgery patients, patients with
missing data and patients with a length of stay in the ICU of less than 24 h were excluded
from the final analysis. If patients had been admitted more than once to the ICU during the
study period, only the first admission was analysed. All patients who were still in hospital on
14 May 1995 (two months after the end of data collection) were dropped from the study.
Each patient was described using a simple set of variables selected from the literature that
included all the variables from the original SAPS II and APACHE II systems [6,24]. All data
were collected as raw data, using the most abnormal values during the first 24 h after
admission to the ICU. Basic demographic characteristics, including sex, age, type of patient
(medical, acute coronary, scheduled surgical and unscheduled surgical), and principal
diagnostic category of admission (using a list of 50 mutually exclusive diagnoses [6]) were
also recorded. Operative definitions for previous organic insufficiency were defined according
to Knaus et al. [6]. In sedated patients, the GCS was given a normal value and a zero weight
assigned.
The presence or absence of organ dysfunction during the first 24 h in the ICU was assessed
by the organ system failure (OSF) score, as described by Knaus et al. [28]. The utilisation of
manpower during the same period was evaluated using the Therapeutic Intervention Scoring
System (TISS) [29] and the Simplified TISS [30].
ICUs had the choice to enter data on standardised forms or using a computer program made
by the authors, available in IBM format, containing out-of-range and logical error-checking.
In both cases data were checked for accuracy and completeness and instances of missing data
were referred to local co-ordinators. Everyone involved had access to a manual with the
protocols and definitions, according to the original definitions. During the study period
support was provided to all participating ICUs by the co-ordinating centre.
At the end of the study, quality control was carried out by the site co-ordinator completing
a second set of forms for a 5 % random sample of patients in that ICU. During 1993, a pilot
study was conducted in 6 units to assess methods and definitions of data collection and
analysis, and all the techniques used were adapted to ensure maximum reliability.
Patients were followed up to hospital discharge, and their survival status was then registered.
To assess inter-observer reliability, original forms and quality control forms were compared,
and discrepancies evaluated using the kappa statistic [31,32] and intraclass correlation
coefficients [33] to determine if there was a good rate of agreement.
Chi-square statistics were used to test for the statistical significance of categorical variables
and t-test or one-way analysis of variance were used to assess continuous variables. All
statistical tests were two-sided, and a significance level of 0.05 was used except when stated
otherwise. To compare the length of stay in the ICU between survivors and non survivors,
we used Mann-Whitney U test, since the distribution was highly skewed.
Since there is no single standard measure for describing the overall goodness of fit of multiple
logistic regression models, we decided to employ four different methods to access
discrimination and calibration. For discrimination - that is, the ability of the model to
discriminate between patients who live and patients who die - we used 2 x 2 classification
tables with decision criteria of 10, 50 and 90 %, and the area under the receiver operating
characteristic (ROC) curve, computed by a modification of the Wilcoxon statistics, as
proposed by Hanley and McNeil [34]. The comparison of the areas under ROC curves was
done using the Z statistic with correction for the correlation introduced by studying the same
sample [35].
Calibration - that is, the degree of correspondence between predicted and observed mortality
- was assessed by calibration curves and by the two chi-square statistics proposed by Hosmer
and Lemeshow [36]: the H test, collapsing the table based on fixed values of the estimated
probabilities and the C test, collapsing the table based on percentiles of the estimated
probabilities.
To compare the predictions between both models we also used McNemar’s chi-square test
and established the correlation between the probabilities of dying in the hospital, as calculated
by SAPS II and APACHE II.
To evaluate the uniformity of fit - that is, the capability of the models to adjust between
subgroups - we stratified patients using several strategies: by predicted risk of dying in the
hospital, ICU, type of patient, age group, length of stay in the ICU and diagnostic category
of admission and analysed mortality ratios. The small number of patients included in some of
the groups precluded a more formal evaluation (including measures of discrimination and
calibration) of the uniformity of fit of the models.
The computation of confidence intervals for the ratio of observed number of deaths to
expected number of deaths based on the model was done using a parametric approach, as
described by Rapoport et al. [37]: ([observed number of deaths] ± z1-"/2F)/(expected number
of deaths), where:
n
s =
å p (1 - p )
i
i
i =1
n is the number of patients in the ICU, Bi is the SAPS II/ APACHE II probability for the ith
patient and z1-"/2 is the (1-"/2) x 100th percentile of the standard normal distribution. To use
this technique we must assume that the underlying distribution is normal. So, we decided not
to compute confidence intervals for the mortality ratio when the n was small, as in the case
of mortality ratios by ICU.
The evaluation of sensitivity, specificity, positive and negative predicted values and overall
correct classification, as well as the respective confidence intervals were done according to
Gardner and Altman [38].
All the data analysis and statistics were performed using the Statistical Package for Social
Sciences, version 6.0.1.
RESULTS
During the study period, the 19 ICUs collected data on 1094 patients. After the exclusion of
all the patients with missing data, an exclusion diagnosis or a length of stay in the ICU of less
than 24 h, 982 patients remained (89.7 %). The mean number of patients analysed per ICU
was 51.7. As Table 1 shows, most patients were male (67.7 %), with a mean (SD) age of
55.4 ± 19.1 years. There was a clear predominance of medical patients (68.2 %). Nonoperative respiratory disease was the principal diagnostic category of admission, accounting
for 32.9 % of the sample. A large number of patients (24.5 %) presented with previous
organic insufficiency.
Table 1. Basic characteristics of the 982 patients analysed.
N
982
665
55.4 ± 19.1
%
100
67.7
Number of patients
Male sex
a
Age, years
Type of patient
Medical
670
68.2
Scheduled surgery
120
12.2
Unscheduled surgery
192
19.6
Previous Organic Insufficiency
241
24.5
Diagnostic category of admission
Non-operative
Respiratory
323
32.9
Cardiovascular
139
14.2
Trauma
55
5.6
Neurological
37
3.8
Other
16
1.6
Non specified
79
8.0
Post-operative
333
33.9
OSF
Absent
273
27.8
Respiratory
577
58.8
Cardiovascular
262
26.7
Renal
179
18.2
Haematological
54
5.5
Neurological
195
19.9
b
6 (3 - 12)
LOS (days)
a
32.5 ± 11.4
TISS
a
30.4 ± 9.9
Simplified TISS
Interventions during the first 24 hours in the ICU
Mechanical ventilation
690
70.3
Vasoactive drugs
470
47.9
Total parenteral nutrition
147
15.0
Swan-Ganz catheter
52
5.3
Arterial catheter
295
30.0
Central venous catheter
749
76.3
a
41.4 ± 20.7
SAPS II
a
32.6 ± 29.9
SAPS II predicted risk of death
a
19.6 ± 9.9
APACHE II
a
33.5 ± 27.4
APACHE II predicted risk of death
ICU mortality
241
24.5
Hospital mortality
314
32.0
LOS, length of stay; TISS, Therapeutic Intervention Scoring System; SAPS II, new Simplified Acute Physiology
Score; APACHE, Acute Physiology and Chronic Health Evaluation Score.
a
b
: mean ± standard deviation; : median (interquartile range).
During the first 24 hours in the ICU, only 27.8 % of the patients had no organ failure;
respiratory (58.8 %), cardiovascular (26.7 %), and renal (18.2 %) failure were the most
frequent. All organ failures showed a significant relation to outcome (p < 0.001).
Most of the patients received mechanical ventilation during the first 24 h in the ICU (70.3 %);
vasoactive drugs (47.9 %) and total parenteral nutrition (15.0 %) were frequently used.
Central venous catheterisation was used in 76.3 % of the patients, arterial catheterisation in
30.0 % and pulmonary artery catheter in 5.3 %. The use of these techniques resulted in a high
TISS score (32.5 ± 11.4 for the original TISS and 30.4 ± 9.9 for the Simplified TISS), which
was higher in non-survivors than in survivors (38.2 ± 10.9 and 35.7 ± 9.3 vs 29.8 ± 10.6 and
27.8 ± 9.1, p < 0.001 for both).
The overall mortality in the ICU was 24.5 % and the corresponding mortality in the hospital
32.0 %. Median length of stay in the ICU was 6 days (interquartile range 3-12 days), and this
was higher in non-survivors (survivors: median 5 days, interquartile range 3-9 days; nonsurvivors: 7 days, 3-16 days; p = 0.003).
The intraclass correlation coefficient was 0.88 for diagnostic category of admission, 0.90 for
urinary output and for blood pressure, and > 0.95 for all other continuous variables. For
category variables, kappa values were 0.43 (78.5 % agreement) for acute renal failure
(according to APACHE II definition and/or OSF definition), 0.44 (85.7 % agreement) for
previous organic insufficiency, 0.81 (92.8 % agreement) for cardiovascular failure, and 0.75
(92.8 % agreement) for neurological failure; all other category variables have a kappa > 0.90.
Mean severity scores were high (SAPS II 41.4 ± 20.7, APACHE II 19.6 ± 9.9) and showed
a significant relation to mortality (p < 0.001 for both). There were very large differences
between ICUs in all the scores analysed, with mean ICU values ranging from 11.83 to 24.67
for APACHE II and 30.25 to 46.97 for SAPS II.
To estimate the discriminative power of the models, we used the area under the ROC curve.
The values were 0.817 (standard error 0.015) for SAPS II, 0.782 (0.016) for APACHE II
and 0.787 (0.015) for APACHE II predicted risk of death. The area for SAPS II, although
very good, is lower than the area of the original SAPS II model (0.823). When SAPS II and
APACHE II curves were compared, we found a statistical significant difference (one side test,
p < 0.001) between both methods. In Figure 1 are plotted the ROC curves of the two models.
Figure 1. Receiver operating characteristic (ROC) curves for the new Simplified Acute
Physiology Score (SAPS II) (C) and the Acute Physiology and Chronic Health Evaluation
(APACHE II) Score (*). The relationship between true positives (sensitivity) and false
positives (1 minus specificity), is shown for both models.
Table 2 shows the classification tables for SAPS II and APACHE II. With a decision criteria
of 10 %, sensitivity - that is, the proportion of deaths predicted by the model - was better for
APACHE II (95.86 %) than for SAPS II (92.04 %); however, the false-positive rate was high
for both (APACHE II 66.32 %, SAPS II 58.68 %); the overall correct classification was
better for SAPS II (57.54 %) than for APACHE II (53.56 %). With a decision criteria of 50
%, SAPS II showed better sensitivity (57.01 % vs 53.18), with a similar false positive rate
(12.13 % vs 12.72 %) and a slightly better overall correct classification rate (78.00 % vs
76.37 %). At a decision criteria of 90 %, sensitivity was still better for SAPS II (20.06 % vs
11.78 %), the false positive rate similar (0.90 % vs 0.75 %) and overall correct classification
slightly better (73.83 % vs 71.28 %).
We decided to perform a crosstabulation of the predictions by both models (SAPS II and
APACHE II) at a fixed decision criterion (50 %) (Table 3). In the global population, the two
methods predicted the same outcome in 858 patients (87.4 %). For survivors, the two
methods predicted the same outcome in 606 patients (90.7 %). For the 62 patients (9.3 %)
where the predictions did not agree, SAPS II predicted 33 correctly (53.2 %) while APACHE
II predicted only 29 correctly (46.8 %). The difference was not significant (McNemar’s P2
0.15, 1 df, not significant). For non-survivors the two methods predicted the same outcome
in 252 patients (80.3 %). For the 62 patients (19.7 %) where the predictions do not agree,
SAPS II predicted 37 correctly (59.7 %) while APACHE II predicted only 25 correctly (40.3
%). The difference was again not significant (McNemar’s P2 1.95, 1 df, not significant).
Table 2. Classification tables for the new Simplified Acute Physiology Score (SAPS II) and
the Acute Physiology and Chronic Health Evaluation (APACHE II) Score. In parentheses are
the 95 % confidence intervals, computed according to Gardner and Altman [38].
Decision criteria 10 %
Observed survivors
Observed non-survivors
Sensitivity
Specificity
Positive predictive value
Negative predictive value
Overall correct classification
Decision criteria 50 %
Observed survivors
Observed non-survivors
Sensitivity
Specificity
Positive predictive value
Negative predictive value
Overall correct classification
Decision criteria 90 %
Observed survivors
Observed non-survivors
Sensitivity
Specificity
Positive predictive value
Negative predictive value
Overall correct classification
No., number of patients
SAPS II Predicted
to live
to die
(No.)
(No.)
APACHE II Predicted
to live
To die
(No.)
(No.)
276
392
25
289
92.04 (89.04 - 95.03)
41.32 (37.58 - 45.05)
42.44 (38.73 - 46.15)
91.69 (88.58 - 94.81)
57.54 (54.44 - 60.63)
225
443
13
301
95.86 (93.66 - 98.06)
33.68 (30.10 - 37.27)
40.46 (36.93 - 43.98)
94.54 (91.65 - 97.42)
53.56 (50.44 - 56.68)
587
81
135
179
57.01 (51.53 - 62.48)
87.87 (85.40 - 90.35)
68.85 (63.22 - 74.48)
81.30 (78.46 - 84.15)
78.00 (75.41 - 80.59)
583
85
147
167
53.18 (47.67 - 58.70)
87.28 (84.75 - 89.80)
66.27 (60.43 - 72.11)
79.86 (76.95 - 82.77)
76.37 (73.72 - 79.03)
662
6
251
63
20.06 (15.63 - 24.49)
99.10 (98.39 - 99.82)
91.30 (84.66 - 97.95)
72.51 (69.61 - 75.40)
73.83 (71.08 - 76.58)
663
5
277
37
11.78 (8.22 - 15.35)
99.25 (98.60 - 99.91)
88.10 (78.30 - 97.89)
70.73 (67.62 - 73.45)
71.28 (68.45 - 74.11)
Table 3. Comparison of classification tables for the new Simplified Acute Physiology Score
(SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II) Score at
a decision criteria of 50 %. Results are presented as number of patients and percentage in
parentheses.
APACHE II
Predicted to live
Predicted to die
S
A
P
S
II
Global Population
Predicted to live
Predicted to die
664 (67.6)
66 (6.7)
58 (5.9)
194 (19.8)
Survivors
Predicted to live
Predicted to die
554 (82.9)
29 (4.3)
33 (4.9)
52 (7.8)
Non-survivors
Predicted to live
Predicted to die
110 (35.0)
37 (11.8)
25 (8.0)
142 (45.2)
The calibration curves for SAPS II and APACHE II (Figure 2) demonstrated that as the
predicted risk of hospital mortality (either by SAPS II or by APACHE II) increased, the
proportion of patients who died also increased. However, at predicted risks > 70 %, the
observed mortality within each risk group lay significantly below the diagonal line - that is,
both models overestimates mortality in sicker patients.
Figure 2. Calibration curves. The solid line represents perfect correspondence between
actual and predicted risk of death and the dotted line the observed versus predicted risk of
death. Top: data for the new Simplified Acute Physiology Score (SAPS II); bottom: data for
the Acute Physiology and Chronic Health Evaluation (APACHE II) Score. In both, bars
indicate the distribution of patients in the groups analysed.
Hosmer-Lemeshow goodness-of-fit test H revealed a poor performance for both models
(Table 4), though slightly better for SAPS II. Hosmer-Lemeshow goodness-of-fit test C gave
a similar result (Table 5). This implies a significant lack of fit of both models.
Table 4. Hosmer-Lemeshow goodness-of-fit test H for the new Simplified Acute Physiology
Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II)
Score.
P r o b a b i l i t y Number of
(dying)
patients
a
SAPS II
0.00 301
0.10 170
Number of deaths
Observed
Expected
25
29
13.59
24.84
Number of survivors
Observed
Expected
276
141
287.41
145.15
0.20 103
26
0.30 90
28
0.40 58
27
0.50 58
28
0.60 34
24
0.70 55
32
0.80 44
32
0.90 –1.00
69
63
Total
982
314
b
APACHE II
0.00 238
13
0.10 180
42
0.20 119
24
0.30 115
38
0.40 78
30
0.50 64
37
0.60 45
25
0.70 53
35
0.80 48
33
0.90 –1.00
42
37
Total
982
314
a
: chi-square: 29.745, df: 10, p = 0.001
b
: chi-square: 32.704, df: 10, p = 0.0003
25.44
31.49
25.82
31.93
22.31
41.05
37.67
65.93
320.10
77
62
31
30
10
23
12
6
668
77.55
58.50
32.18
26.06
11.68
13.94
6.33
3.07
661.89
12.50
26.79
28.92
40.29
35.13
35.10
29.10
39.89
41.33
39.72
328.80
225
138
95
77
48
27
20
18
15
5
668
225.49
153.21
90.07
74.71
42.86
28.89
15.89
13.10
6.66
2.27
653.19
Table 5. Hosmer-Lemeshow goodness-of-fit test C for the new Simplified Acute Physiology
Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II)
score.
Probability
Number of
Number of deaths
(dying)
patients
Observed
Expected
a
SAPS II
0.00 106
5
1.68
0.03 105
9
4.62
0.06 90
11
7.28
010 112
21
14.28
0.15 83
16
15.87
0.21 94
23
25.02
0.31 108
39
41.18
0.44 88
43
46.89
0.62 100
62
73.69
0.84 – 1.00
96
85
89.56
Total
982
314
320.10
b
APACHE II
0.00 99
6
2.70
0.04 98
3
6.10
0.08 98
16
10.38
0.13 98
30
15.28
0.18 100
18
21.63
0.25 97
27
29.31
0.34 98
30
38.41
0.45 98
50
50.15
0.58 98
59
67.42
0.78 – 1.00
98
75
87.38
Total
982
314
328.80
a
: chi-square: 28.292, df: 10, p = 0.002
b
: chi-square: 49.664, df: 10, p < 0.0001
Number of survivors
Observed
Expected
101
96
79
91
67
71
69
45
38
11
668
104.31
100.37
82.71
97.71
67.12
68.97
66.82
41.10
26.30
6.43
661.89
93
95
82
68
82
70
68
48
39
23
668
96.29
91.89
87.61
82.71
78.36
67.68
59.58
47.84
30.57
10.61
653.19
Predicted risk of dying in the hospital, as calculated by the two models, shows a highly
significant correlation (multiple R: 0.827). It is clear that, although highly correlated, the
results from the two models are widely dispersed (Figure 3), with a significant number of
outliers, as already described in other studies [39].
Figure 3. Plot of the Acute Physiology and Chronic Health Evaluation (APACHE II) Score
versus the new Simplified Acute Physiology Score (SAPS II) probabilities of death.
APACHE II and SAPS II are highly correlated (multiple R: 0.827) but a large number of
outliers are visible. The predictions seem more related in the extremes of risk.
The uniformity of fit, that is, the capability of the models to adjust between subgroups,
demonstrated variations in their performance across subgroups. When we stratified patients
by risk of death, as when we plot a calibration curve, we can see that for more severely ill
patients (as measured by the scores), both systems overestimate mortality (Figure 2).
If we stratify patients by ICU, the differences are quite large, with mortality ratios ranging
from 0.69 to 1.72 for SAPS II and from 0.42 to 1.63 for APACHE II.
The number of observed deaths was similar to number of predicted deaths when we stratify
patients by group (medical patients: predicted deaths 238.7, observed 235; scheduled surgery:
predicted 18.4, observed 19; unscheduled surgery: predicted 62.9, observed 60) for SAPS
II but with some differences for APACHE II (medical patients: predicted deaths 245.1,
observed 235; scheduled surgery: predicted 17.7, observed 19; unscheduled surgery:
predicted 66.1, observed 60).
For age group, there were some variations in SAPS II and APACHE II, with mortality ratios
varying from 0.85 to 1.25 for SAPS II and 0.87 to 1.19 for APACHE II (Table 6). However,
in almost all cases, the confidence intervals for the mortality ratio enclosed the unit. The same
is true when we stratify patients by diagnostic category (Table 7).
Table 6. Mortality ratio by age group for the new Simplified Acute Physiology Score (SAPS
II) and the Acute Physiology and Chronic Health Evaluation (APACHE II) Score. In
parentheses are the 95 % confidence intervals for the mortality ratios, computed according
to Gardner and Altman [38].
Age (years)
SAPS II
# 24
25 - 34
35 - 44
45 - 54
55 - 64
65 - 74
75 - 84
$ 85
APACHE II
# 24
25 - 34
35 - 44
45 - 54
55 - 64
65 - 74
75 - 84
$ 85
No., number of patients
No.
Observed deaths
Predicted deaths
Mortality ratio
93
114
89
131
165
236
127
27
16
31
20
39
54
89
52
13
18.8
24.9
20.9
36.7
56.9
92.5
56.0
13.4
0.85 (0.53 - 1.17)
1.25 (0.99 - 1.51)
0.96 (0.68 - 1.24)
1.06 (0.84 - 1.28)
0.95 (0.79 - 1.11)
0.96 (0.83 - 1.09)
0.93 (0.77 - 1.08)
0.97 (0.67 - 1.27)
93
114
89
131
165
236
127
27
16
31
20
39
54
89
52
13
18.4
26.0
22.0
41.0
57.3
95.5
57.7
11.0
0.87 (0.53 - 1.21)
1.19 (0.91 - 1.47)
0.91 (0.62 - 1.20)
0.95 (0.74 - 1.16)
0.94 (0.77 - 1.11)
0.93 (0.80 - 1.06)
0.90 (0.74 - 1.06)
1.19 (0.82 - 1.55)
Table 7. Mortality ratio by diagnostic category for the new Simplified Acute Physiology
Score (SAPS II) and the Acute Physiology and Chronic Health Evaluation (APACHE II)
Score. In parentheses are the 95 % confidence intervals for the mortality ratios, computed
according to Gardner and Altman [38].
Diagnostic category
SAPS II
Non-operative
Respiratory
Cardiovascular
Trauma
Neurological
Others
Non-specified
Post-operative
APACHE II
Non-operative
Respiratory
Cardiovascular
Trauma
Neurological
Others
Non-specified
Post-operative
No., number of patients
No.
Observed deaths
Predicted deaths
Mortality ratio
326
142
68
38
17
81
310
101
78
17
15
5
24
74
105.7
74.9
20.7
14.6
4.1
20.5
79.7
0.96 (0.83 - 1.08)
1.04 (0.93 - 1.16)
0.82 (0.53 - 1.11)
1.03 (0.70 - 1.35)
1.23 (0.54 - 1.92)
1.17 (0.89 - 1.45)
0.93 (0.78 - 1.08)
326
142
68
38
17
81
310
101
78
17
15
5
24
74
119.8
73.3
12.4
19.5
3.3
20.6
79.9
0.84 (0.72 - 0.97)
1.06 (0.94 - 1.18)
1.38 (0.95 - 1.80)
0.77 (0.51 - 1.03)
1.50 (0.73 - 2.28)
1.17 (0.84 - 1.49)
0.93 (0.77 – 1.08)
DISCUSSION
Over the past 30 years, the development and dissemination of strategies for monitoring and
treating severely ill patients has lead to the creation of special places in hospitals, where all
the necessary resources (human and technological) have been concentrated - the ICUs. These
units soon became an essential part of acute patient care for all types of patients. The
resultant increase in costs lead to the need to rationalise policies of admission and medical
care in these units and concepts which were unknown in the medical world, such as efficiency
and effectiveness, began to be accepted in the hospitals. Central to this was the necessity to
quantify performance by defining the goals and outcomes of care. Therefore, over the past
15 years we have assisted to the development and application of several instruments to
evaluate performance and, at the end, to compare ICUs.
In the 1980s [7] several investigators proposed the use of the ratio between observed and
predicted deaths - the standardised mortality ratio - as one of those tools. This was done
under the assumption that, although ICUs admit a very heterogeneous group of patients with
large differences in physiologic reserve (e.g. age, previous health status) and acute health
status (e.g. type of patient, admission diagnosis, presence and level of physiologic
impairment), existing severity scores can account for most of these characteristics. If the
errors resulting from the collection of data and their application are small and randomly
distributed, the final difference between predicted and observed mortality can be attributed
to local differences in quality of care. This rationale has been challenged several times recently
[14,15,40-43]. We can argue against at least two of these assumptions: firstly, the impact of
intra- and inter-observer variability in data collection can have a significant impact on the
performance of the severity scores, as argued by others [44,45], and, clearly, more research
is needed on this issue; we want to stress that in this study the reliability of data collection
was very high but the lack of generalised consensus in operative definitions makes
comparison with other studies difficult. Secondly, the application of a model to a different
population can only be done with confidence once the system has been tested and validated
on that population, since variations in case mix not accounted for by the models can have a
significant impact on their performance.
This study demonstrated that both models failed to predict mortality accurately - that is,
overall calibration was poor. This problem is more obvious for APACHE II than for SAPS
II. When we compare the two models, we can see that discrimination was better for SAPS
II than for APACHE II (0.817 vs 0.787, p < 0.001); the same is true for the percentage of
correct classifications, both in survivors and in non-survivors, but the differences did not
reach statistical significance. With respect to overall calibration, as measured by HosmerLemeshow H$ and C$ tests, both models presented statistical significant differences between
predicted and observed mortality. The same can be observed in the calibration curves (Figure
2), with both models overestimating mortality in the most severely ill patients. This is
probably important, since this sample is a very different population from the one represented
in the European/North American database [24] or in Knaus’s et al.’s database [6]: a higher
percentage of non-surgical patients, with a longer length of stay in the ICU, higher severity
scores, and higher mortality.
This lack of overall goodness of fit can also be related to differences in population
composition if the models fit the data in a non uniform manner, as already observed with
APACHE II in the United Kingdom [14,15,46] and with SAPS II in Spain [25] and Italy [43].
In this study, the small sample size was a relevant limiting factor in stratified analysis and
precluded the formal evaluation of discrimination and calibration in clinically relevant
subgroups. The hypothesis that the non uniformity of fit can explain, at least in part, the poor
performance of the models when applied to independent populations should be tested in
larger samples.
An alternative (and perhaps complementary) explanation for the poor performance of the
models is the presence of other factors (clinical and non-clinical), not measured by the present
severity scores, that can have a huge influence on the performance of the ICUs. It should be
noted that those factors are not randomly distributed between patients but clustered into
ICUs; their effects on the performance of actual models should be one of the main priorities
of research in this field, and we may anticipate that next generation severity scores will take
into account not only the patient’s variability (that is, baseline characteristics, severity of
disease) but also the variability among ICUs (the positive correlation between patients inside
the same ICU introduced by local clinical and non-clinical factors that can influence
outcome). For now, the results prevent their use to analyse quality of care or performance
between ICUs in this population, at least without previous customisation, as previously
suggested in other studies [8,9,11-13,37,39,43,47].
ACKNOWLEDGEMENTS
The authors thank the participating staff of all the ICUs for their full collaboration on data
collection. We acknowledge also the President of the of the Portuguese Intensive Care
Society, Dr. Jorge Pimentel and the President of the Portuguese Society of Internal Medicine,
Prof. Dr. Levy Guerra, for their invaluable support during the planning and execution of the
study.
REFERENCES
Knaus WA, Zimmerman JE, Wagner DP, Draper EA, Lawrence DE. APACHE - Acute Physiology and
Chronic Health Evaluation: a physiologically based classification system. Crit Care Med
1981;9:591-7.
Knaus WA, Draper EA, Wagner DP, et al. Evaluating outcome from intensive care: A preliminary
multihospital comparison. Crit Care Med 1982;10:491-6.
Knaus WA, Le Gall JR, Wagner DP, et al. A Comparison of Intensive Care in the U.S.A. and France.
Lancet 1982;642-6.
Wagner DP, Draper EA, Abizanda Campos R, et al. Initial International use of APACHE: an acute
severity of disease measure. Med Decis Making 1984;4:297.
Le Gall JR, Loirat P, Alperovitch A, et al. A Simplified Acute Physiologic Score for ICU patients. Crit
Care Med 1984;12:975-7.
Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification
system. Crit Care Med 1985;13:818-29.
Knaus WA, Draper EA, Wagner DP, Zimmerman JE. An evaluation of outcome from intensive care in
major medical centers. Ann Intern Med 1986;104:410-8.
Zimmerman JE, Knaus WA, Judson JA, et al. Patient selection for intensive care: a comparison of New
Zealand and United States Hospitals. Crit Care Med 1988;16:318-25.
Marsh HM, Krishan I, Naessens JM, et al. Assessment of prediction of mortality by using the APACHE
II scoring system in intensive care units. Mayo Clin Proc 1990;65:1549-57.
Castella X, Gilabert J, Torner F, Torres C. Mortality prediction models in intensive care: Acute
Physiology and Chronic Health Evaluation II and Mortality Prediction Model compared. Crit Care
Med 1991;19:191-7.
Sirio CA, Tajimi K, Tase C, et al. An initial comparison of intensive care in Japan and United States.
Crit Care Med 1992;20:1207-15.
Zimmerman JE, Shortell SM, Rousseau DM, et al. Improving intensive care: observations based on
organizational case studies in nine intensive care units: a prospective, multicenter study. Crit Care
Med 1993;21:1443-51.
Knaus WA, Wagner DP, Zimmerman JE, Draper EA. Variations in mortality and length of stay in
Intensive Care Units. Ann Intern Med 1993;118:753-61.
Rowan KM, Kerr JH, Major E, McPherson K, Short A, Vessey MP. Intensive Care Society's APACHE
II study in Britain and Ireland - I: Variations in case mix of adult admissions to general intensive
care units and impact on outcome. Br Med J 1993;307:972-7.
Rowan KM, Kerr JH, Major E, McPherson K, Short A, Vessey MP. Intensive Care Society's APACHE
II study in Britain and Ireland - II: Outcome comparisons of intensive care units after adjustment
for case mix by the American APACHE II method. Br Med J 1993;307:977-81.
Wong DT, Crofts SL, Gomez M, McGuire GP, Byrick RJ. Evaluation of predictive ability of APACHE
II system and hospital outcome in Canadian intensive care unit patients. Crit Care Med
1995;23:1177-83.
Rowan KM, Kerr JH, Major E, McPherson K, Short A, Vessey MP. Intensive Care Society's Acute
Physiology and Chronic Health Evaluation (APACHE II) study in Britain and Ireland: A
prospective, multicenter, cohort study comparing two methods for predicting outcome for adult
intensive care patients. Crit Care Med 1994;22:1392-401.
Greenman RL, Schein RMH, Martin MA, et al. A Controlled Clinical Trial of E5 Murine Monoclonal
IgM Antibody to Endotoxin in the Treatment of Gram-Negative Sepsis. JAMA 1991;266:1097102.
Ziegler EJ, Fisher CJ, Sprung CL, et al. Treatment of gram-negative bacteremia and septic shock with
ha-1a human monoclonal antibody against endotoxin. A randomized, double-blind, placebocontrolled trial. N Engl J Med 1991;324:429-36.
Knaus WA, Harrell FE, Fisher CJ, et al. The clinical evaluation of new drugs for sepsis. A prospective
study design based on survival analysis. JAMA 1993;270:1233-41.
Knaus WA, Wagner DP, Draper EA, et al. The APACHE III prognostic system. Risk prediction of
hospital mortality for critically ill hospitalized adults. Chest 1991;100:1619-36.
Bastos PG, Sun X, Wagner DP, Knaus WA, Zimmerman JE, The Brazil APACHE III Study Group.
Application of the APACHE III prognostic system in Brazilian intensive care units: a prospective
multicenter study. Intensive Care Med 1996;22:564-70.
Bastos PG, Knaus WA, Zimmerman JE, Magalhães Jr A, Wagner DP, The Brazil APACHE III Study
Group. The importance of technology for achieving superior outcomes from intensive care.
Intensive Care Med 1996;22:664-9.
Le Gall JR, Lemeshow S, Saulnier F. A new Simplified Acute Physiology Score (SAPS II) based on a
European / North American multicenter study. JAMA 1993;270: 2957-63.
Abizanda Campos R, Rodriguez MT, Ferrandiz A, et al. Evaluation of SAPS II mortality prediction
capability. Comparison with SAPS I and APACHE II. Intensive Care Med 1994;20:51.
Castella X, Artigas A, Bion J, Kari A, The European / North American Severity Study Group. A
comparison of severity of illness scoring systems for intensive care unit patients: results of a
multicenter, multinational study. Crit Care Med 1995;23:1327-35.
Hadorn DC, Keeler EB, Rogers WH, Brook RH. Assessing the performance of mortality prediction
models. Santa Monica, CA, RAND/UCLA/Harvard Center for Health Care Financing Policy
Research, 1993.
Knaus WA, Draper EA, Wagner DP, Zimmerman JE. Prognosis in acute organ-system failure. Ann Surg
1985;202:685-93.
Keene AR, Cullen DJ. Therapeutic intervention scoring system: update 1983. Crit Care Med 1983;11:13.
Reis Miranda D, de Rijk A, Schaufeli W. Simplified Therapeutic Intervention Scoring System: The TISS
28 items - Results from a multicenter study. Crit Care Med 1996;24:64-73.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur 1960;20:37-46.
Kramer MS, Feinstein AR. Clinical biostatistics. LIV. The biostatistics of concordance. Clin Pharmacol
Ther 1981;29:111-23.
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull
1979;86:420-8.
Hanley J, McNeil B. The meaning and use of the area under a receiver operating characteristic (ROC)
curve. Radiology 1982;143:29-36.
Hanley J, McNeil B. A method of comparing the areas under receiver operating characteristic curves
derived from the same cases. Radiology 1983;148:839-43.
Hosmer DW, Lemeshow S. Applied logistic regression. John Wiley & Sons, Inc., New York, 1989.
Rapoport J, Teres D, Lemeshow S, Gehlbach S. A method for assessing the clinical performance and
cost-effectiveness of intensive care units: a multicenter inception cohort study. Crit Care Med
1994;22:1385-91.
Gardner MJ, Altman DG. Statistics with confidence. British Medical Journal, London, 1989.
Lemeshow S, Klar J, Teres D. Outcome prediction for individual intensive care patients: useful, misused,
or abused ? Intensive Care Med 1995;21:770-6.
Park RE, Brook RH, Kosecoff J, et al. Explaining variations in hospital death rates: ramdomless,
severity of illness, quality of care. JAMA 1990; 264:484-90.
Best WR, Comper DC. The ratio of observed-to-expected mortality as a quality of care indicator in nonsurgical VA patients. Med Care 1994;32:390-400.
Fisher M, Herkes RG. Intensive care: speciality without frontiers. In: Parker M, Shapiro MJ, Porembka
DT, eds. Critical Care State of the Art. California: Society of Critical Care Medicine 1995, 9-27.
1.
Apolone, G, D'Amico, R, Bertolini, G, Iapichino, G, Cattaneo, A, De Salvo, G, Melotti, R. The
performance of SAPS II in a cohort of patients admitted in 99 Italian ICUs: results from the
GiViTI. Intensive Care Med 1996;22:1368-78
2.
Lemmonier E, Loirat P, Kleinknecht D, Brivet F, Landais P, and the French Study Group on ARF.
Translation ambiguity and inter-observer variability of severity scoring systems. Intensive Care
Med 1992;20:581.
3.
Abizanda R, Balerdi B, Lopez J, et al. Fallos de prediccion de resultados mediante APACHE II.
Analisis de los errores de prediction de mortalidad en pacientes criticos. Med Clin Barc
1994;102:527-31.
4.
Goldhill DR, Withington PS. The effects of casemix adjustment on mortality as predicted by
APACHE II. Intensive Care Med 1996;22:415-9.
5.
Moreno RP, Estrada H, Pereira E, Massa L. Movimento assistencial da Unidade de Cuidados
Intensivos Polivalente do Hospital de Santo Antonio dos Capuchos. Acta Med Port 1994;7:13-20.
APPENDIX
List of co-authors (in italics their hospitals): Centro Hospitalar de Coimbra: Dra. Paula
Coutinho; H. Universidade de Coimbra: Dr. João Paulo Sousa; H. de Egas Moniz: Dra.
Isabel Gaspar, Dr. Andrade Gomes; H. de Pulido Valente: Dr. Luis Tello; H. de St. António
dos Capuchos: Dra. Ermelinda Pereira; H. de S. Francisco Xavier - UCIC: Dra. Ana
Ferreira; H. de S. Francisco Xavier - UCIM: Dr. João Cunha, Dra. Margarida Resende; H.
de St. Maria - UCIR: Dra. Gabriela Brum, Dr. João Valença; H. de St. Marta: Dra.
Manuela Coelho, Dra. Alexandrina Quintino; H. Distrital de Évora: Dr. Luis Filipe Froes;
H. de Vila Real: Dr. Celestino; H. Distrital do Barreiro: Dra. Fátima Campante; H. do
Desterro: Dra. Maria José Serra; H. do SAMS: Dr. Sousa e Costa; H. Dr. José Maria
Grande: Dr. Carlos Baeta; H. Garcia de Orta: Dr. Pedro Moreira; H. Geral de St. António
- UCIP: Dr. Rui Seca; H. Senhora da Oliveira: Dr. Estevão Lafuente; H. de São Bernardo:
Dra. Rosa Ribeiro. The study was co-ordinated by the Severity Scores Groups (Coordenador: Dr. R. Moreno) of the Portuguese Intensive Care Society and the Portuguese
Society of Internal Medicine. Data Analysis and statistics were done by Dr. R. Moreno.
Computer programming was co-ordinated by Dr. P. Morais.
Download