Automated quantitative analysis of estrogen receptor expression in

advertisement
Breast Cancer Res Treat
DOI 10.1007/s10549-007-9736-z
PRECLINICAL STUDY
Automated quantitative analysis of estrogen receptor expression
in breast carcinoma does not differ from expert pathologist
scoring: a tissue microarray study of 3,484 cases
Dmitry A. Turbin Æ Samuel Leung Æ Maggie C. U. Cheang Æ Hagen A. Kennecke Æ
Kelli D. Montgomery Æ Steven McKinney Æ Diana O. Treaba Æ Niki Boyd Æ
Lynn C. Goldstein Æ Sunil Badve Æ Allen M. Gown Æ Matt van de Rijn Æ
Torsten O. Nielsen Æ C. Blake Gilks Æ David G. Huntsman
Received: 9 August 2007 / Accepted: 13 August 2007
Springer Science+Business Media, LLC 2007
Abstract
Background Estrogen receptor (ER) expression is routinely assessed by immunohistochemistry (IHC) in breast
carcinoma. Our study compares visual scoring of ER in
invasive breast cancer by histopathologists to quantitation
of staining using a fully automated system.
Materials and methods A tissue microarray was constructed from 4,049 cases (3,484 included in analysis) of
Electronic supplementary material The online version of this
article (doi:10.1007/s10549-007-9736-z) contains supplementary
material, which is available to authorized users.
D. A. Turbin S. Leung M. C. U. Cheang H. A. Kennecke T. O. Nielsen C. B. Gilks D. G. Huntsman
Genetic Pathology Evaluation Centre, Vancouver Coastal Health
Research Institute, Vancouver, BC, Canada
D. A. Turbin S. Leung M. C. U. Cheang H. A. Kennecke T. O. Nielsen C. B. Gilks D. G. Huntsman (&)
British Columbia Cancer Agency, University of British
Columbia, Rm 3427 - 600 West 10th Avenue, Vancouver, BC,
Canada V5Z 4E5
e-mail: dhuntsma@bccancer.bc.ca
D. A. Turbin S. McKinney N. Boyd D. G. Huntsman
Centre for Translational and Applied Genomics, Vancouver, BC,
Canada
K. D. Montgomery
Stanford University Medical Center, Stanford, CA, USA
S. Badve
Indiana University Hospital, Indianapolis, IN, USA
invasive breast carcinoma linked to treatment and outcome
information. Slides were scored independently by two
pathologists and scores were dichotomised, with ER positivity recognized at a cut-off of[1% positive nuclei. The slides
were scanned and analyzed with an Ariol automated system.
Results Using data dichotomised as ER positive or negative, both visual and automated scores were highly consistent:
there was excellent concordance between two pathologists
(kappa = 0.918 (95%CI: 0.903–0.932)) and between two
Ariol machines (kappa = 0.913 (95%CI: 0.897–0.928)). The
prognostic significance of ER positivity was similar whether
determined by pathologist or automated scoring for both the
entire patient cohort and subsets of patients treated with
tamoxifen alone or receiving no systemic adjuvant therapy.
The optimal cut point for the automated scores using breast
cancer disease-specific survival as an endpoint was [0.4%
positive nuclei. The concordance between dextran-coated
charcoal ER biochemical assay data and automated scores
(kappa = 0.728 (95%CI: 0.69–0.75); 0.74 (95%CI: 0.71–
0.77)) was similar to the concordance between biochemical
assay and pathologist scores (kappa = 0.72 (95%CI: 0.70–
0.75; 0.70 (95%CI: 0.67–0.72)).
Conclusion Fully automated quantitation of ER immunostaining yields results that do not differ from human
scoring against both biochemical assay and patient outcome gold standards.
Keywords Automated scoring Breast cancer Estrogen receptor Immunohistochemistry Pathology Tissue microarray
D. O. Treaba L. C. Goldstein A. M. Gown
PhenoPath Laboratories, Seattle, WA, USA
Introduction
M. van de Rijn
Stanford University Medical Center, Stanford, CA, USA
Estrogen receptor (ER) expression in breast carcinoma is
routinely assessed to predict response to hormonal
123
Breast Cancer Res Treat
treatment. Patients with tumours expressing ER are seven
to eight times more likely to benefit from endocrine therapy
than patients with ER-negative tumours [1]. ER expression
is also a prognostic factor for breast cancer as it is associated with favourable outcome in untreated patients [2].
Advances in computers and imaging techniques in the
past decade have led to increased interest in the application
of computerised image analysis in different fields, including pathology [3–5]. The potential benefits associated with
this technology include improved quantification and
reproducibility [5, 6]. Electronic device-assisted measurement of staining intensity is more precise than assessment
with a human eye, and is not influenced by factors such as
the ambient light or pathologist fatigue [7–13].
Use of image analysis for scoring of ER immunohistochemistry (IHC) has been considered since the inception of
ER IHC testing in the 1980s [14–16]. However, the process
of image analysis at that time was more time consuming
and labour-intensive than the visual scoring from the glass
slide. Image analysis software and hardware have significantly improved and are now being considered for use in
clinical histopathology laboratories [17–19].
We have compared results of visual and automated
scoring of ER immunostaining on a TMA constructed from
4,049 cases of invasive breast carcinoma, with data from
3,484 cases used for analysis. We then evaluated the interobserver variability for both visual scoring and the
automated image analysis system.
Material and methods
TMA construction and immunostaining
Ethical approval for the study was obtained from the Clinical
Research Ethics Board of the British Columbia Cancer
Agency. A series of 4,049 cases of invasive breast carcinoma
diagnosed in 1986–1992 and referred to the British
Columbia Cancer Agency for treatment were assembled into
17 tissue microarray blocks. Tissue microarrays (TMA)
were built using tissue cores from formalin-fixed paraffinembedded tumours. This material had been frozen prior to
neutral buffered formalin fixation. Hematoxylin and eosin
(H&E) stained slides were reviewed, and areas containing
tumour tissue were marked on both the slides and corresponding paraffin blocks for tissue microarray construction.
A single 0.6 mm core was taken from every donor block.
Microarray blocks were constructed using a manual arrayer
(Beecher Instruments, Inc., Silver Springs, MD) as previously described [20, 21]. Sections (4 lm thick) were cut
from the array blocks. Sections were then stained using
DakoCytomation EnVision and System-HRP [2]. Sections
were deparaffinised with xylene and dehydrated through
123
three alcohol changes. Endogenous peroxidase activity was
quenched by incubating 5 min in 0.03% hydrogen peroxide/
sodium azide. Slides were then incubated with a primary
anti-ER antibody, clone SP1 (1:250, LabVision, Fremont,
CA) followed by incubation with peroxidase-labelled
polymer in a Tris–HCl buffer for 30 min. Staining was
completed by 10 min incubation with DAB and chromogen.
Slides were then counterstained with hematoxylin and
mounted. Data from 3,484 of the 4,049 cases was used for
the statistical analysis. Five hundred and sixty-five cores
were not used for a variety of reasons including core drop-off
during the processing, insufficient or absent tumour tissue
within the cores, or artefactual distortion of the tissue
making interpretation impossible [22].
Biomarkers were scored visually from the glass slides
by two independent pathologists from different laboratories
(DOT and SB). Cut-off points for the visual scoring were:
0: up to 1% positive tumour nuclei, 1+: from 1% to 25%
positive nuclei, 2+: 25–75% positive nuclei, and 3+: more
than 75% positive nuclei. Scores were entered into a
standardised Excel worksheet (Microsoft Excel, Microsoft,
Redmond WA) with a sector map matching each TMA
section. Cases were not included into statistical analysis if
there was no interpretable data, i.e. if there was no tumour
tissue in the cores or the cores were cut through. Original
scoring grids were converted to tables using Deconvoluter
1.10 [23] and combined in a single text file with TMACombiner 1.00 [24]. The resulting text files were imported
into SPSS 11.0 for Windows.
The same slides were digitised with a commercial image
analysis system Ariol (Applied Imaging Inc., San-Jose,
CA). The Ariol scanner is based on an Olympus microscope
with motorized stage and autofocus capabilities, equipped
with a black and white video camera. Slides were scanned at
20· objective magnification with three filters: red, green
and blue; Ariol software converts these three-channel
images into colour reconstructions. The included software
was used for image analysis. The program was trained by a
pathologist (DAT) to analyse objects of specific colour, size
and shape for this particular staining, in order to increase
specificity of the analysis. This is done by the pathologist
identifying and tracing out invasive breast cancer areas on a
representative area of a core containing positively stained
and negative nuclei. This step ensures that stromal matrix
and most stromal cells are excluded from image analysis,
allowing the program to calculate percent of positive
tumour cells more precisely. After the program training on
one of the representative TMA cores, the rest of the analysis
(17 TMA slides) was performed without human supervision, i.e. the program analysed images one by one based on
the settings established during the training. All tissue cores
were analysed in toto; no specific pathologist selection of
tumour tissue within the cores was made following the
Breast Cancer Res Treat
training step. For statistical analysis, we selected only cores
with at least 50 cells analysed; all cores with less than 50
cells were considered unscorable.
To get an estimate of the demands posed on the operator
of the Ariol system, the same slides were scanned and
processed on an identical Ariol system by an operator with
less than 1 week experience (KDM). The descriptors of the
colour and shape of the positive and negative tumour cells
were transferred from one system to another, therefore
variations in the image analysis results depended only on
the scanner settings, i.e. positioning and white balance, but
not on the image analysis settings.
The hematoxylin and eosin and IHC images of all cores
used in this study are publicly available at the companion
site http://www.gpecimage.ubc.ca/tma/web/viewer.php [2,
25]. The site was constructed using GPEC database and a
Java applet provided by Bacus Laboratories, Inc. All the
slides were scanned with a BLISS scanner (Bacus Laboratories, Inc., Lombard, IL), and posted on the site.
WebSlide Browser for Windows (Bacus Laboratories, Inc.,
Lombard, IL) can be used for viewing preview images of
the arrays and images of individual cores.
We performed Kaplan–Meier analysis on the data provided by both visual and automated scoring systems. We
used also dextran-coated charcoal biochemical assay
(DCC) values from the original clinical database as an
external standard measure for ER quantitation [26, 27]. The
data were categorised as negative for 0–1 fmol/mg, weak
for 2–9 fmol/mg, moderate for 10–159 fmol/mg, and
strong for ‡160 fmol/mg. A cut-off point of 10 fmol/mg
was selected for dichotomisation of the DCC data as ER
negative or positive [27, 28].
Statistical analysis
Statistical analysis was performed in SPSS 11.0 for Windows
(SPSS Inc., Chicago, IL). Univariate analysis of survival was
done by calculating Kaplan–Meier survival curves. Subsets
of patients were compared using log-rank tests. All tests were
two-sided and used a 5% alpha level to determine significance. We used the open-source R 2.3.1 package to calculate
differences between kappa statistics from DCC to IHC
comparisons; a permutation test with 10,000 permutations
was implemented. The same test was performed to compare
chi-squares obtained in Kaplan–Meier analyses, to determine whether different scoring systems (human versus
machine) give significantly different results. R 2.3.1 was also
used for plotting variations of the hazard ratio and log-rank P
values depending on the cut-off point of percent of positive
nuclei reported with the automated system.
Continuous variable data obtained with the automated
image analysis systems were categorised using the optimal
survival cut-off points for our cohort, as determined with
X-tile v3.4.3 software [4]. Prognostic value of the categorical data obtained with this method was compared to
that of the data categorised according to the preselected 1%
cut-off point system.
Results
Median follow-up time for the 3,484 cases analysed was
12.47 years; median age at diagnosis was 60 years. About
1,039 (29.8%) women were premenopausal, 2,360 (67.7%)
postmenopausal, two were pregnant (0.1%), and menstrual
status was unknown in 83 (2.4%) women. Lymph nodes
were negative in 1,935 (55.5%) cases, positive in 1,540
(44.2%), and in 9 (0.3%) cases nodal status was unknown.
Ductal carcinoma was seen in 3,169 (91.0%) cases, lobular
carcinoma in 243 (7.0%), and other types in 72 (2.1%)
cases. All patients underwent surgical treatment; in addition, 1,116 (32.0%) patients received tamoxifen only, 1,443
(41.4%) patients received no systemic therapy, 656
(18.8%) were treated with chemotherapy, and 259 (7.4%)
received a combination of chemotherapy and tamoxifen.
Other treatment methods (ovarian ablation or hormonal
therapy other than tamoxifen, with or without chemotherapy) were applied in 10 (0.2%) cases.
Kaplan–Meier survival analysis of cases stratified based
on ER status, as determined by visual or machine scoring
of the immunostained slides, or by DCC is shown in Fig. 1.
Results of the log-rank tests for the individual training and
validation sets are shown as a supplementary figure.
Results of the log-rank tests with P values for the whole set
of patients, stratified as negative, weak, moderate, or
strongly ER positive, are as follows: visual scoring
v2 = 50.95,
P = 9.47 · 10–13;
Ariol
v2 = 37.71,
–10
2
P = 8.22 · 10 , DCC v = 66.25, P = 3.97 · 10–16
(Fig. 1a, d, g). After dichotomisation of the scores as either
ER positive or negative (Fig. 2), the results of log-rank test
are: visual scoring v2 = 49.69, P = 1.81 · 10–12; Ariol
v2 = 54.17,
P = 1.84 · 10–13,
DCC
v2 = 53.76,
–13
P = 2.27 · 10 . The differences in prognostic significance of these different analyses of ER status are not
statistically significant (Table 1), i.e. visual and machine
scoring showed similar results when comparing pathologist
and automated ER scoring after dichotomisation of scores
into ER positive or negative (Fig. 2a, d, g).
In the group of the patients that received adjuvant
treatment with only tamoxifen (Fig. 1b, e, h), results of the
log-rank test with P values for cases stratified as negative,
weak, moderate, or strongly ER positive are as follows:
visual scoring v2 = 26.07, P = 3.29 · 10–7; Ariol
v2 = 12.65, P = 0.0004, DCC v2 = 29.04, P = 7.08 · 10–8.
After dichotomisation of the scores as either ER negative
123
Breast Cancer Res Treat
b
Visual, all patients
0.8
0.8
0.8
0.6
0.4
>75% positive nuclei
p = 9.47E-013
25-75% positive nuclei
<1% positive nuclei
5.00
10.00
15.00
0.4
>75% positive nuclei
p = 3.29E-007
0.2
1-25% positive nuclei
0.00
0.6
<1% positive nuclei
0.0
20.00
0.00
5.00
10.00
15.00
0.6
0.4
20.00
e
f
Machine, tamoxifen only
1-25% positive nuclei
5.00
10.00
15.00
25-75% positive nuclei
p = 0.0004
0.4
20.00
0.00
5.00
10.00
15.00
h
20.00
i
DCC, tamoxifen only
0.8
p = 3.97E-016
0.2
2-9 fmol/mg
<=1 fmol/mg
0.0
0.00
5.00
10.00
15.00
Follow-up (years)
20.00
0.6
0.4
>=160 fmol/mg
p = 7.08E-008
0.2
10-159 fmol/mg
2-9 fmol/mg
<=1 fmol/mg
0.0
0.00
5.00
10.00
15.00
Follow-up (years)
20.00
Cum Survival
0.8
Cum Survival
0.8
>=160 fmol/mg
10.00
15.00
20.00
DCC, no systemic therapy
1.0
10-159 fmol/mg
5.00
Follow-up (years)
1.0
0.4
<=1% positive nuclei
0.00
1.0
0.6
1-25% positive nuclei
>75% positive nuclei
0.0
Follow-up (years)
DCC, all patients
20.00
25-75% positive nuclei
p = 9.83E-005
0.2
<=1% positive nuclei
Follow-up (years)
g
1-25% positive nuclei
0.0
15.00
0.6
>75% positive nuclei
<=1% positive nuclei
0.00
0.4
0.2
>75% positive nuclei
0.0
0.6
Cum Survival
0.8
Cum Survival
0.8
25-75% positive nuclei
10.00
Machine, no systemic therapy
0.8
0.6
5.00
Follow-up (years)
1.0
p = 8.22E-010
<1% positive nuclei
0.00
1.0
0.2
1-25% positive nuclei
Follow-up (years)
Machine, all patients
25-75% positive nuclei
0.0
1.0
0.4
>75% positive nuclei
p = 5.70E-006
0.2
1-25% positive nuclei
Follow-up (years)
d
25-75% positive nuclei
Cum Survival
1.0
0.0
Cum Survival
Visual, no systemic therapy
1.0
0.2
Cum Survival
c
Visual, tamoxifen only
1.0
Cum Survival
Cum Survival
a
0.6
0.4
>=160 fmol/mg
p = 5.30E-010
0.2
10-159 fmol/mg
2-9 fmol/mg
<=1 fmol/mg
0.0
0.00
5.00
10.00
15.00
20.00
Follow-up (years)
Fig. 1 Kaplan–Meier analysis performed on the data categorised at 1,
25 and 75% cut-off points. (a–c) visual scoring #1, (d–f) automated
system #1, (g–i) DCC. (a, d, g) entire cohort; (b, e, h) subset of the
patients treated with tamoxifen only with no chemotherapy; (c, f, i)
subset of the patients that received no systemic therapy
or positive (Fig. 2b, e, h), the results of log-rank test are:
visual scoring v2 = 20.90, P = 4.84 · 10–6; Ariol
v2 = 30.69,
P = 3.02 · 10–8,
DCC
v2 = 16.48,
–5
P = 4.91 · 10 . The differences in prognostic significance
of these different analyses of ER status (Fig. 2b, e, h) are
not statistically significant (Table 2).
In the group of the patients that received no systemic
therapy (Fig. 1c, f, i), results of the log-rank test with P
values for cases stratified as negative, weak, moderate, or
strongly ER positive, are as follows: visual scoring
v2 = 20.59,
P = 5.70 · 10–6;
Ariol
v2 = 15.17,
P = 9.83 · 10–5, DCC v2 = 38.56, P = 5.30 · 10–10. After
dichotomisation of the scores (Fig. 2c, f, i), the results of
log-rank test are: visual scoring v2 = 27.47, P = 1.60 · 10–
7
; Ariol v2 = 31.09, P = 2.46 · 10–8, DCC v2 = 42.69,
P = 6.43 · 10–11. The differences in prognostic significance of these different analyses of ER status (Fig. 2c, f, i)
are not statistically significant (Table 3). The concordant
and discrepant results for visual, machine, and DCC for the
entire cohort, tamoxifen treated subset of patients, and
patients receiving no systemic therapy are displayed as
Kaplan-Meier survival curves (Fig. 3).
123
Breast Cancer Res Treat
a
b
Visual, all patients
1.0
c
Visual, tamoxifen only
Visual, no systemic therapy
1.0
1.0
0.8
0.8
negative
negative
0.4
p = 1.81E-012
0.2
Cum Survival
Cum Survival
0.6
positive
positive
0.6
negative
negative
0.4
p = 4.84E-006
0.2
0.0
Cum Survival
positive
positive
positive
positive
0.8
d
5.00
10.00 15.00
Follow-up (years)
20.00
0.00
1.0
p = 1.60E-007
0.0
e
Machine, all patients
0.4
0.2
0.0
0.00
negative
negative
0.6
5.00
10.00 15.00
Follow-up (years)
20.00
0.00
f
Machine, tamoxifen only
5.00
10.00 15.00
Follow-up (years)
20.00
Machine, no systemic therapy
1.0
1.0
0.8
0.8
0.6
negative
negative
0.4
p = 1.84E-013
0.2
Cum Survival
Cum Survival
positive
positive
positive
positive
0.6
negative
negative
0.4
p = 3.02E-008
0.2
0.0
Cum Survival
positive
positive
0.8
g
5.00
10.00 15.00
Follow-up (years)
20.00
1.0
p = 2.46E-008
0.0
0.00
h
DCC, all patients
0.4
0.2
0.0
0.00
negative
negative
0.6
5.00
10.00 15.00
Follow-up (years)
20.00
0.00
i
DCC, tamoxifen only
5.00
10.00 15.00
Follow-up (years)
20.00
DCC, no systemic therapy
1.0
1.0
0.8
0.8
negative
negative
0.4
p = 2.27E-013
0.2
Cum Survival
Cum Survival
0.6
negative
negative
0.4
p = 4.91E-005
0.2
0.0
5.00
10.00 15.00
Follow-up (years)
20.00
Machine with X-tile-defined cut-off,
all patients
1.0
negative
negative
0.4
p = 6.43E-011
0.0
0.00
k
0.6
0.2
0.0
0.00
j
positive
positive
0.6
Cum Survival
positive
positive
positive
positive
0.8
5.00
10.00 15.00
Follow-up (years)
20.00
Machine with X-tile-defined cut-off,
tamoxifen only
0.00
l
5.00
10.00 15.00
Follow-up (years)
20.00
Machine with X-tile-defined cut-off,
no systemic therapy
1.0
1.0
0.8
0.8
0.6
negative
negative
0.4
p = 8.55E-015
0.2
Cum Survival
Cum Survival
positive
positive
positive
positive
0.6
0.4
negative
negative
p = 1.76E-008
0.2
0.0
5.00
10.00 15.00
Follow-up (years)
20.00
0.6
negative
negative
0.4
p = 1.91E-008
0.2
0.0
0.00
Cum Survival
positive
positive
0.8
0.0
0.00
5.00
10.00 15.00
Follow-up (years)
Fig. 2 Kaplan–Meier analysis performed on the data dichotomised at
1% cut-off point as negative versus positive. (a–c) visual scoring #1,
(d–f) automated system #1, (g–i) DCC, (j–l) automated system #1
with the cut-off point defined with the X-tile program. (a, d, g) entire
20.00
0.00
5.00
10.00 15.00
Follow-up (years)
20.00
cohort; (b, e, h) subset of the patients treated with tamoxifen only
with no chemotherapy; (c, f, i) subset of the patients that received no
systemic therapy
123
Breast Cancer Res Treat
Table 1 Permutation test on the whole cohort to compare the differences between chi-squares obtained in Kaplan–Meier analysis,
between DCC and each of the IHC scoring systems
DCC
Visual #1
Visual #2
Ariol #1
DCC
1
0.8433
0.6929
0.6346
0.7010
Visual #1
0.8433
1
0.2311
0.6139
0.3151
Visual #2
Ariol #1
0.6929
0.6346
0.2311
0.6139
1
0.1458
0.1458
1
Ariol #2
0.7010
0.3151
0.9783
0.1327
Table 5 Permutation test on the whole cohort to compare the differences between the kappa statistics obtained from the crosstabulations between DCC and each of the IHC scoring systems
Ariol #2
Visual #1
Visual #2
Ariol #1
Ariol #2
Visual #1
1
0.0005
0.2547
0.2982
Visual #2
0.0005
1
0.0327
0.0005
0.9783
0.1327
Ariol #1
Ariol #2
0.2547
0.2982
0.0327
0.0005
1
0.0188
0.0188
1
1
P values (with Bonferroni–Holmes corrections) are shown
P values for the pairwise comparisons are shown
Table 2 Permutation test on the ‘‘tamoxifen only’’ group to compare
the differences between chi-squares obtained in Kaplan–Meier analysis, between DCC and each of the IHC scoring systems
DCC
Visual #1
Visual #2
Ariol #1
Ariol #2
DCC
1
0.3673
0.4966
0.1831
0.3878
Visual #1
0.3673
1
0.5922
0.3377
0.9296
Visual #2
Ariol #1
0.4966
0.1831
0.5922
0.3377
1
0.1766
0.1766
1
0.7314
0.3071
Ariol #2
0.3878
0.9296
0.7314
0.3071
1
P values are shown
Table 3 Permutation test on the ‘‘no systemic therapy’’ group to
compare the differences between chi-squares obtained in Kaplan–
Meier analysis, between DCC and each of the IHC scoring systems
DCC
Visual #1
Visual #2
Ariol #1
Ariol #2
DCC
1
0.5731
0.3303
0.7082
0.4689
Visual #1
0.5731
1
0.3854
0.7725
0.8067
Visual #2
Ariol #1
0.3303
0.7082
0.3854
0.7725
1
0.3207
0.3207
1
0.6594
0.5724
Ariol #2
0.4689
0.8067
0.6594
0.5724
1
P values are shown
Interobserver variability was estimated comparing
visual scores of two pathologists, and the automated scores
generated by different Ariol hardware systems using the
same settings for colour, nuclear size and shape. When
comparing dichotomised scores on the whole series, the
interobserver agreement was excellent: for pathologist
versus pathologist scores, kappa = 0.918 (95%CI: 0.903–
0.932), and for two machine scores, kappa = 0.913
(95%CI: 0.897–0.928) (Table 4). There was also good
correlation between visual and automated scores, and for
both methods when compared to DCC (Table 5), with
neither visual nor machine scoring showing superiority
with respect to reproducibility or correlation with DCC.
The cut-off point for positive and negative cases as
determined with the X-tile program [4] was 0.4% of positive nuclei on both sets of automated scores, using breast
cancer specific survival as a discriminator. The entire cohort
was separated into training and validation set by the program, and an optimal cut-off point was defined as 0.4% of
positive nuclei on the training set. This cut-off point was
confirmed on the validation set, thus we concluded it was
applicable for the entire cohort. On the entire cohort of the
patients, results of the log-rank test for cases dichotomised
as negative for 0–0.4% of positive nuclei and positive for
[0.4% of positive nuclei were v2 = 60.20, P = 8.55 · 10–
15
(Fig. 2j). The results of the log-rank test for the patients
treated with tamoxifen only were as follows: v2 = 31.74,
P = 1.76 · 10–8 (Fig. 2k), and for the subset of patients
who received no treatment v2 = 31.59, P = 1.91 · 10–8
(Fig. 2l). On the validation set, results of the log-rank test
were v2 = 27.21, P = 1.83 · 10–7; the results of the
log-rank test for patients from the validation set treated with
tamoxifen only were v2 = 15.09, P = 1.03 · 10–4, and for
patients who received no treatment v2 = 18.20, P =
1.99 · 10–5 (supplementary Fig. d–f). On the training set,
results of the log-rank test were v2 = 25.97, P = 3.47 · 10–7;
the results of the log-rank test for patients from the training
Table 4 Kappa statistics on the whole cohort for comparison of reproducibility of different methods of ER assessment
DCC (CI)
Visual #1 (CI)
DCC
1
0.7286 (0.7019–0.7539)
0.6965 (0.6697–0.7237)
0.7188 (0.6918–0.7449)
0.7406 (0.7139–0.7665)
Visual #1
0.7286 (0.7019–0.7539)
1
0.9184 (0.9035–0.9326)
0.8955 (0.8785–0.9116)
0.9021 (0.8854–0.9180)
Visual #2
0.6965 (0.6697–0.7237)
0.9184 (0.9035–0.9326)
1
0.8852 (0.8680–0.9018)
0.8789 (0.8609–0.8955)
Ariol #1
0.7188 (0.6918–0.7449)
0.8955 (0.8785–0.9116)
0.8852 (0.8680–0.9018)
1
0.9128 (0.8974–0.9277)
Ariol #2
0.7406 (0.7139–0.7665)
0.9021 (0.8854–0.9180)
0.8789 (0.8609–0.8955)
0.9128 (0.8974–0.9277)
1
123
Visual #2 (CI)
Ariol #1 (CI)
Ariol #2 (CI)
Breast Cancer Res Treat
set treated with tamoxifen only were v2 = 14.10,
P = 1.74 · 10–4, and for patients who received no
treatment v2 = 14.76, P = 1.22 · 10–4 (supplementary
Fig. a–c).
The hazard ratios and log P values were plotted against
the percent of positive nuclei as identified by machine
(Fig. 4), using R 2.3.1 package. Hazard ratios and significance of survival difference (log P values) increase
dramatically when cut-off point is set to less than 1% of
positive nuclei, which confirms the results obtained with
the X-tile program.
DCC vs. Visual
0.8
0.8
0.8
0.6
0.4
Both positive
DCC positive, Visual negative
Cum Survival
1.0
Cum Survival
1.0
0.6
0.4
Both positive
DCC positive, Visual negative
0.2
0.0
0.00
d
Visual positive, DCC negative
Both negative
Both negative
5.00
10.00
15.00
Follow-up (years)
0.0
20.00
0.00
e
DCC vs. Machine
0.6
0.4
5.00
10.00
15.00
Follow-up (years)
Visual positive, DCC negative
20.00
0.00
f
DCC vs. Machine
0.8
0.8
0.8
Both positive
DCC positive, Visual negative
0.2
Both negative
0.0
0.00
g
5.00
10.00
15.00
Follow-up (years)
0.4
Both positive
DCC positive, Visual negative
0.2
Visual positive, DCC negative
Both negative
0.00
20.00
h
5.00
10.00
15.00
Follow-up (years)
0.4
20.00
0.8
0.8
Both positive
0.2
5.00
10.00
15.00
Follow-up (years)
Both positive
DCC positive, Visual negative
0.4
Both positive
DCC positive, Visual negative
0.2
Visual positive, DCC negative
Both negative
0.0
20.00
20.00
0.6
Visual positive, DCC negative
Both negative
0.00
0.4
0.2
Visual positive, DCC negative
0.0
Cum Survival
0.8
0.6
5.00
10.00
15.00
Follow-up (years)
Visual vs. Machine
1.0
DCC positive, Visual negative
Both negative
i
1.0
0.4
Visual positive, DCC negative
0.00
1.0
0.6
Both positive
DCC positive, Visual negative
0.0
Visual vs. Machine
20.00
0.6
0.2
Visual positive, DCC negative
0.0
Visual vs. Machine
Cum Survival
1.0
0.6
5.00
10.00
15.00
Follow-up (years)
DCC vs. Machine
1.0
0.4
Both negative
0.0
1.0
0.6
Both positive
DCC positive, Visual negative
0.2
Visual positive, DCC negative
Cum Survival
Cum Survival
c
DCC vs. Visual
1.0
0.2
Cum Survival
b
DCC vs. Visual
Cum Survival
Cum Survival
a
Intensity of staining was analysed also by the machine,
and H-score was calculated as (% of positive nuclei) ·
(staining intensity). H-score ranged from 0 in completely
negative cases to 7267.74. As found using the X-tile program, the best cut-off point for H-score was 21.1. When the
H-score was dichotomised as negative for 0–21.1 and
positive for [21.1, the results of the log-rank test in the
entire cohort were as follows: v2 = 60.81 P = 6.27 · 10–15.
In tamoxifen only treated subset of patients, the results were
v2 = 28.09, P = 1.16 · 10–7, and the subset with no systemic therapy v2 = 10.56, P = 0.001. The log-rank test
0.00
5.00
10.00
15.00
Follow-up (years)
Fig. 3 Kaplan–Meier analysis performed on the cases grouped based
on agreement/disagreement between different scoring systems. (a–c)
comparison between DCC and visual scoring #1, (d–f) comparison
between DCC and automated system #1, (g–i) comparison between
Both negative
0.0
20.00
0.00
5.00
10.00
15.00
Follow-up (years)
20.00
visual scoring #1 and automated scoring #1. (a, d, g) entire cohort; (b,
e, h) subset of the patients treated with tamoxifen only with no
chemotherapy; (c, f, i) subset of the patients that received no systemic
therapy
123
Breast Cancer Res Treat
b
Machine with X-tile defined cut-off,
tamoxifen only
c
Machine with X-tile defined cut-off,
no systemic therapy
Hazard ratio
1.0
0.5
0.0
Hazard ratio
0.0
0
d
0.5
1.0
0.0
0.5
Hazard ratio
1.5
1.5
1.5
Machine with X-tile defined cut-off,
all patients
20
40
60
80
% positive nuclei
100
Machine with X-tile defined cut-off,
all patients
0
e
0
1.0
a
20
40
60
80
% positive nuclei
100
Machine with X-tile defined cut-off,
tamoxifen only
0
f
20
40
60
80
% positive nuclei
100
Machine with X-tile defined cut-off,
no systemic therapy
0
0
-5
-5
-15
-20
Log p value
Log p value
Log p value
-5
-10
-10
-15
-25
-30
-10
-15
-20
-20
0
20
40
60
80
% positive nuclei
100
0
20
40
60
80
% positive nuclei
100
0
20
40
60
80
% positive nuclei
100
Fig. 4 Variation of hazard ratio and log-rank P value among different cut-off points of percent of positive nuclei reported by the automated
system
results are shown separately for the training and validation
sets at http://www.gpec.ubc.ca/index.php?content=papers/
AUTO_ER.php.
Discussion
The first attempts to use automated image analysis for
quantitation of IHC stained histological sections were
undertaken approximately 20 years ago, before IHC
replaced DCC as the most convenient means for ER
assessment [5, 6, 29]. However, until recently image quality
obtained with the video and digital cameras and the image
analysis software did not permit reliable fully automated
assessment of histological sections without continuous
supervision by a pathologist [29]. Recent advances in image
analysis techniques have allowed automatic selection of the
objects within the histological images based on their morphometric parameters, e.g. size, shape, range of staining
intensities, and fully automated quantitation of histological
sections is now technically feasible [5, 30, 31].
In theory, computer-assisted image analysis should
provide more accurate results for IHC quantitation than
semiquantitative scoring by a pathologist. A specially
123
calibrated device can measure the intensity of staining
much more precisely than a human eye [7, 13]. Using
optical density corrected by the intensity of the background instead of direct measurement of the intensity of
the IHC staining gives even more accurate assessment of
the staining and allows comparisons between slides
stained in different batches or in different laboratories.
Counting of hundreds and thousands of positive and
negative cells within a single image is a tedious procedure, and the degree of human mistake in counting is a
function of the number of the objects [8], i.e. the more
cells being analysed, the less precise the results. The use
of immunofluorescent staining with its capabilities for
subcellular localisation of proteins and co-localisation of
specific protein expression with tissue-specific markers
has potential for significant improvement in assessment of
biomarker expression [32–34]. In practice, however, the
results of automated image analysis are influenced by a
number of factors other than technical issues. Image
analysis systems cannot differentiate between benign and
malignant lesions with a precision comparable to the
expertise level of an experienced pathologist [5, 6] and
most image analysis applications require pathologist input
to identify the area to be analyzed.
Breast Cancer Res Treat
TMAs provide almost ideal material for automated
unsupervised image analysis because of the careful selection of the areas containing lesions that a researcher is
specifically interested in, the concurrent application of
identical staining conditions to all cores on a single slide,
and the small size of the tissue cores, each of which can be
represented by a single image. Inaccuracies encountered
during the TMA building, i.e. taking cores from the areas
that do not contain tumour tissue, constitute a relatively
small percent of all TMA cores and can be easily excluded
from the analysis.
We performed our study on a TMA based cohort of
3,484 cases of invasive breast carcinoma. We studied ER
expression in tumours, with visual semiquantitative
assessment of the staining, and quantitation with a commercial fully automated image analysis system. The results
of the IHC analysis by either pathologist or machine
demonstrated similar accuracy in assessment of prognostic
significance of ER expression in Kaplan–Meier analysis.
Comparison of log-rank tests with 10,000 permutations
detected no significant differences among any of these
methods (P [ 0.05) when the scores were dichotomised as
negative and positive. The concordance between scores
was excellent. The best agreement was achieved between
two visual scorers (kappa = 0.918 (95%CI: 0.903–0.932))
and between two machines (kappa = 0.913 (95%CI:
0.897–0.928)). The worst concordance in this study was
observed between the DCC method and one of the human
scorers with kappa = 0.697 (95%CI: 0.669–0.722), which
is still considered to be good agreement.
The optimal cut-off point found in our study for machine
scoring (0.4% of positive tumour nuclei) is highly consistent with the findings of Harvey et al. [35], who suggested
an H-score of 2 as an optimal cut-off point (weak positivity
in less than 1% of positive nuclei). These two studies
support the idea that virtually any ER positivity is associated with good prognosis in tamoxifen-treated breast
cancer. Utilisation of an H-score did not provide any
additional information in our study and in fact was inferior
to simple percent of positive nuclei, according to the results
of the log-rank test in Kaplan–Meier analysis of the treatment subsets.
There are a number of limitations in our study. First, the
machine scoring was performed in fully automated mode
on TMAs, with no specific selection of tumour tissue
within the tissue cores, and therefore may not translate
directly to conventional sections. For example, some
benign ducts and lobules were likely scored along with the
tumours, which could have artificially decreased prognostic
significance of automated scoring. If image analysis is
applied to the whole sections instead of the TMAs, an
operator can specify an area of tumour tissue that should be
scanned and analysed; this would increase specificity of the
analysis (although in TMAs the initial core is derived from
a pathologist-selected area). Secondly, two expert pathologists scored all cases within a week and although the
scoring was done independently, they discussed the scoring
system prior to scoring, which may explain the relatively
high kappa statistics between two visual scores (0.918
(95%CI: 0.903–0.932)) as compared to the usual agreement
reported for the scoring of nuclear staining (0.6–0.8) [36–
39].
In conclusion, the present study shows that a fully
automated image analysis with a system trained by a
human, but without continuous human supervision, can
provide results that do not differ from the scoring of
estrogen receptor immunohistochemistry by an expert
pathologist or to dextran–charcoal radioimmunoassay.
Further development of image analysis techniques will
inevitably improve the accuracy of object detection and
classification in histological sections, making such automated image analysis systems suitable for use in clinical
histopathology laboratories.
Acknowledgements The study was supported in part by an unrestricted educational grant from Sanofi-Aventis, Canada and a
Translational Acceleration grant from the Canadian Breast Cancer
Research Alliance. TON and DGH are scholars of Michael Smith
Foundation for Health Research.
References
1. Duffy MJ (2006) Estrogen receptors: role in breast cancer. Crit
Rev Clin Lab Sci 43:325–347
2. Cheang MC, Treaba DO, Speers CH et al (2006) Immunohistochemical detection using the new rabbit monoclonal antibody
SP1 of estrogen receptor in breast cancer is superior to mouse
monoclonal antibody 1D5 in predicting survival. J Clin Oncol
24:5637–5644
3. Bolger N, Heffron C, Regan I et al (2006) Implementation and
evaluation of a new automated interactive image analysis system.
Acta Cytol 50:483–491
4. Camp RL, Dolled-Filhart M, Rimm DL (2004) X-tile: a new bioinformatics tool for biomarker assessment and outcome-based
cut-point optimization. Clin Cancer Res 10:7252–7259
5. Taylor CR, Levenson RM (2006) Quantification of immunohistochemistry-issues
concerning
methods,
utility
and
semiquantitative assessment II. Histopathology 49:411–424
6. Walker RA (2006) Quantification of immunohistochemistryissues concerning methods, utility and semiquantitative assessment I. Histopathology 49:406–410
7. BVD/FOGRA (1992) Manual for standardization of the offset
printing process. Wiesbaden
8. Insight into images: principles and practices for segmentation,
registration, and image analysis (2004). A.K. Peters Ltd.,
Wellesey, MA
9. Byrne A, Hilbert DR (2003) Color realism and color science.
Behav Brain Sci 26:3–21; discussion 22–63
10. Gegenfurtner KR (2003) Cortical mechanisms of colour vision.
Nat Rev Neurosci 4:563–572
11. McLelland D, Fuller LU (2005) Photoshop CS2 bible. Wiley
Publishing Inc., Hoboken, NJ
123
Breast Cancer Res Treat
12. Rinner O, Gegenfurtner KR (2000) Time course of chromatic
adaptation for color appearance and discrimination. Vision Res
40:1813–1826
13. Wen C-H, Lee J-J (2000) Design and production of color calibration targets for digital input devices. In: Input/output and
imaging technologies II:4080. Taipei, Taiwan, pp 148–158
14. Greene GL, Nolan C, Engler JP et al (1980) Monoclonal antibodies to human estrogen receptor. Proc Natl Acad Sci USA
77:5115–5119
15. King WJ, Greene GL (1984) Monoclonal antibodies localize oestrogen receptor in the nuclei of target cells. Nature 307:745–747
16. Underwood JC (1983) Oestrogen receptors in human breast
cancer: review of histopathological correlations and critique of
histochemical methods. Diagn Histopathol 6:1–22
17. Aziz DC, Barathur RB (1994) Quantitation and morphometric
analysis of tumors by image analysis. J Cell Biochem Suppl
19:120–125
18. Esteban JM, Ahn C, Battifora H et al (1994) Quantitative immunohistochemical assay for hormonal receptors: technical aspects
and biological significance. J Cell Biochem Suppl 19:138–145
19. Schultz DS, Katz RL, Patel S et al (1992) Comparison of visual
and CAS-200 quantitation of immunocytochemical staining in
breast carcinoma samples. Anal Quant Cytol Histol 14:35–40
20. Kononen J, Bubendorf L, Kallioniemi A et al (1998) Tissue
microarrays for high-throughput molecular profiling of tumor
specimens. Nat Med 4:844–847
21. Makretsov N, Gilks CB, Coldman AJ et al (2003) Tissue microarray analysis of neuroendocrine differentiation and its prognostic
significance in breast cancer. Hum Pathol 34:1001–1008
22. Turbin DA, Cheang MC, Bajdik CD et al (2006) MDM2 protein
expression is a negative prognostic marker in breast carcinoma.
Mod Pathol 19:69–74
23. Liu CL, Prapong W, Natkunam Y et al (2002) Software tools for
high-throughput analysis and archiving of immunohistochemistry
staining data obtained with tissue microarrays. Am J Pathol
161:1557–1565
24. Liu CL, Montgomery KD, Natkunam Y et al (2005) TMACombiner, a simple software tool to permit analysis of replicate
cores on tissue microarrays. Mod Pathol 18:1641–1648
25. Ng TL, Gown AM, Barry TS et al (2005) Nuclear beta-catenin in
mesenchymal tumors. Mod Pathol 18:68–74
26. de las Mulas JM, van Niel M, Millan Y et al (2000) Immunohistochemical analysis of estrogen receptors in feline mammary
gland benign and malignant lesions: comparison with biochemical assay. Domest Anim Endocrinol 18:111–125
123
27. Magne N, Toillon RA, Castadot P et al (2006) Different clinical
impact of estradiol receptor determination according to the analytical method: A study on 1940 breast cancer patients over a
period of 16 consecutive years. Breast Cancer Res Treat 95:179–
184
28. Costa SD, Lange S, Klinga K et al (2002) Factors influencing the
prognostic role of oestrogen and progesterone receptor levels in
breast cancer – results of the analysis of 670 patients with
11 years of follow-up. Eur J Cancer 38:1329–1334
29. Franklin WA, Bibbo M, Doria MI et al (1987) Quantitation of
estrogen receptor content and Ki-67 staining in breast carcinoma
by the microTICAS image analysis system. Anal Quant Cytol
Histol 9:279–286
30. Gil J, Wu HS (2003) Applications of image analysis to anatomic
pathology: realities and promises. Cancer Invest 21:950–959
31. Rojo MG, Garcia GB, Mateos CP et al (2006) Critical comparison of 31 commercially available digital slide systems in
pathology. Int J Surg Pathol 14:285–305
32. Cregger M, Berger AJ, Rimm DL (2006) Immunohistochemistry
and quantitative analysis of protein expression. Arch Pathol Lab
Med 130:1026–1030
33. McCabe A, Dolled-Filhart M, Camp RL et al (2005) Automated
quantitative analysis (AQUA) of in situ protein expression,
antibody concentration, and prognosis. J Natl Cancer Inst
97:1808–1815
34. Rimm DL (2006) What brown cannot do for you. Nat Biotechnol
24:914–916
35. Harvey JM, Clark GM, Osborne CK et al (1999) Estrogen
receptor status by immunohistochemistry is superior to the
ligand-binding assay for predicting response to adjuvant endocrine therapy in breast cancer. J Clin Oncol 17:1474–1481
36. Diaz LK, Sahin A, Sneige N (2004) Interobserver agreement for
estrogen receptor immunohistochemical analysis in breast cancer:
a comparison of manual and computer-assisted scoring methods.
Ann Diagn Pathol 8:23–27
37. Hasegawa T, Yamamoto S, Matsuno Y (2002) Quantitative
immunohistochemical evaluation of MIB-1 labeling index in
adult soft-tissue sarcomas by computer-assisted image analysis.
Pathol Int 52:433–437
38. Kirkegaard T, Edwards J, Tovey S et al (2006) Observer variation in immunohistochemical analysis of protein expression, time
for a change? Histopathology 48:787–794
39. Lorinc E, Jakobsson B, Landberg G et al (2005) Ki67 and p53
immunohistochemistry reduces interobserver variation in
assessment of Barrett’s oesophagus. Histopathology 46:642–648
Download