Breast Cancer Res Treat DOI 10.1007/s10549-007-9736-z PRECLINICAL STUDY Automated quantitative analysis of estrogen receptor expression in breast carcinoma does not differ from expert pathologist scoring: a tissue microarray study of 3,484 cases Dmitry A. Turbin Æ Samuel Leung Æ Maggie C. U. Cheang Æ Hagen A. Kennecke Æ Kelli D. Montgomery Æ Steven McKinney Æ Diana O. Treaba Æ Niki Boyd Æ Lynn C. Goldstein Æ Sunil Badve Æ Allen M. Gown Æ Matt van de Rijn Æ Torsten O. Nielsen Æ C. Blake Gilks Æ David G. Huntsman Received: 9 August 2007 / Accepted: 13 August 2007 Springer Science+Business Media, LLC 2007 Abstract Background Estrogen receptor (ER) expression is routinely assessed by immunohistochemistry (IHC) in breast carcinoma. Our study compares visual scoring of ER in invasive breast cancer by histopathologists to quantitation of staining using a fully automated system. Materials and methods A tissue microarray was constructed from 4,049 cases (3,484 included in analysis) of Electronic supplementary material The online version of this article (doi:10.1007/s10549-007-9736-z) contains supplementary material, which is available to authorized users. D. A. Turbin S. Leung M. C. U. Cheang H. A. Kennecke T. O. Nielsen C. B. Gilks D. G. Huntsman Genetic Pathology Evaluation Centre, Vancouver Coastal Health Research Institute, Vancouver, BC, Canada D. A. Turbin S. Leung M. C. U. Cheang H. A. Kennecke T. O. Nielsen C. B. Gilks D. G. Huntsman (&) British Columbia Cancer Agency, University of British Columbia, Rm 3427 - 600 West 10th Avenue, Vancouver, BC, Canada V5Z 4E5 e-mail: dhuntsma@bccancer.bc.ca D. A. Turbin S. McKinney N. Boyd D. G. Huntsman Centre for Translational and Applied Genomics, Vancouver, BC, Canada K. D. Montgomery Stanford University Medical Center, Stanford, CA, USA S. Badve Indiana University Hospital, Indianapolis, IN, USA invasive breast carcinoma linked to treatment and outcome information. Slides were scored independently by two pathologists and scores were dichotomised, with ER positivity recognized at a cut-off of[1% positive nuclei. The slides were scanned and analyzed with an Ariol automated system. Results Using data dichotomised as ER positive or negative, both visual and automated scores were highly consistent: there was excellent concordance between two pathologists (kappa = 0.918 (95%CI: 0.903–0.932)) and between two Ariol machines (kappa = 0.913 (95%CI: 0.897–0.928)). The prognostic significance of ER positivity was similar whether determined by pathologist or automated scoring for both the entire patient cohort and subsets of patients treated with tamoxifen alone or receiving no systemic adjuvant therapy. The optimal cut point for the automated scores using breast cancer disease-specific survival as an endpoint was [0.4% positive nuclei. The concordance between dextran-coated charcoal ER biochemical assay data and automated scores (kappa = 0.728 (95%CI: 0.69–0.75); 0.74 (95%CI: 0.71– 0.77)) was similar to the concordance between biochemical assay and pathologist scores (kappa = 0.72 (95%CI: 0.70– 0.75; 0.70 (95%CI: 0.67–0.72)). Conclusion Fully automated quantitation of ER immunostaining yields results that do not differ from human scoring against both biochemical assay and patient outcome gold standards. Keywords Automated scoring Breast cancer Estrogen receptor Immunohistochemistry Pathology Tissue microarray D. O. Treaba L. C. Goldstein A. M. Gown PhenoPath Laboratories, Seattle, WA, USA Introduction M. van de Rijn Stanford University Medical Center, Stanford, CA, USA Estrogen receptor (ER) expression in breast carcinoma is routinely assessed to predict response to hormonal 123 Breast Cancer Res Treat treatment. Patients with tumours expressing ER are seven to eight times more likely to benefit from endocrine therapy than patients with ER-negative tumours [1]. ER expression is also a prognostic factor for breast cancer as it is associated with favourable outcome in untreated patients [2]. Advances in computers and imaging techniques in the past decade have led to increased interest in the application of computerised image analysis in different fields, including pathology [3–5]. The potential benefits associated with this technology include improved quantification and reproducibility [5, 6]. Electronic device-assisted measurement of staining intensity is more precise than assessment with a human eye, and is not influenced by factors such as the ambient light or pathologist fatigue [7–13]. Use of image analysis for scoring of ER immunohistochemistry (IHC) has been considered since the inception of ER IHC testing in the 1980s [14–16]. However, the process of image analysis at that time was more time consuming and labour-intensive than the visual scoring from the glass slide. Image analysis software and hardware have significantly improved and are now being considered for use in clinical histopathology laboratories [17–19]. We have compared results of visual and automated scoring of ER immunostaining on a TMA constructed from 4,049 cases of invasive breast carcinoma, with data from 3,484 cases used for analysis. We then evaluated the interobserver variability for both visual scoring and the automated image analysis system. Material and methods TMA construction and immunostaining Ethical approval for the study was obtained from the Clinical Research Ethics Board of the British Columbia Cancer Agency. A series of 4,049 cases of invasive breast carcinoma diagnosed in 1986–1992 and referred to the British Columbia Cancer Agency for treatment were assembled into 17 tissue microarray blocks. Tissue microarrays (TMA) were built using tissue cores from formalin-fixed paraffinembedded tumours. This material had been frozen prior to neutral buffered formalin fixation. Hematoxylin and eosin (H&E) stained slides were reviewed, and areas containing tumour tissue were marked on both the slides and corresponding paraffin blocks for tissue microarray construction. A single 0.6 mm core was taken from every donor block. Microarray blocks were constructed using a manual arrayer (Beecher Instruments, Inc., Silver Springs, MD) as previously described [20, 21]. Sections (4 lm thick) were cut from the array blocks. Sections were then stained using DakoCytomation EnVision and System-HRP [2]. Sections were deparaffinised with xylene and dehydrated through 123 three alcohol changes. Endogenous peroxidase activity was quenched by incubating 5 min in 0.03% hydrogen peroxide/ sodium azide. Slides were then incubated with a primary anti-ER antibody, clone SP1 (1:250, LabVision, Fremont, CA) followed by incubation with peroxidase-labelled polymer in a Tris–HCl buffer for 30 min. Staining was completed by 10 min incubation with DAB and chromogen. Slides were then counterstained with hematoxylin and mounted. Data from 3,484 of the 4,049 cases was used for the statistical analysis. Five hundred and sixty-five cores were not used for a variety of reasons including core drop-off during the processing, insufficient or absent tumour tissue within the cores, or artefactual distortion of the tissue making interpretation impossible [22]. Biomarkers were scored visually from the glass slides by two independent pathologists from different laboratories (DOT and SB). Cut-off points for the visual scoring were: 0: up to 1% positive tumour nuclei, 1+: from 1% to 25% positive nuclei, 2+: 25–75% positive nuclei, and 3+: more than 75% positive nuclei. Scores were entered into a standardised Excel worksheet (Microsoft Excel, Microsoft, Redmond WA) with a sector map matching each TMA section. Cases were not included into statistical analysis if there was no interpretable data, i.e. if there was no tumour tissue in the cores or the cores were cut through. Original scoring grids were converted to tables using Deconvoluter 1.10 [23] and combined in a single text file with TMACombiner 1.00 [24]. The resulting text files were imported into SPSS 11.0 for Windows. The same slides were digitised with a commercial image analysis system Ariol (Applied Imaging Inc., San-Jose, CA). The Ariol scanner is based on an Olympus microscope with motorized stage and autofocus capabilities, equipped with a black and white video camera. Slides were scanned at 20· objective magnification with three filters: red, green and blue; Ariol software converts these three-channel images into colour reconstructions. The included software was used for image analysis. The program was trained by a pathologist (DAT) to analyse objects of specific colour, size and shape for this particular staining, in order to increase specificity of the analysis. This is done by the pathologist identifying and tracing out invasive breast cancer areas on a representative area of a core containing positively stained and negative nuclei. This step ensures that stromal matrix and most stromal cells are excluded from image analysis, allowing the program to calculate percent of positive tumour cells more precisely. After the program training on one of the representative TMA cores, the rest of the analysis (17 TMA slides) was performed without human supervision, i.e. the program analysed images one by one based on the settings established during the training. All tissue cores were analysed in toto; no specific pathologist selection of tumour tissue within the cores was made following the Breast Cancer Res Treat training step. For statistical analysis, we selected only cores with at least 50 cells analysed; all cores with less than 50 cells were considered unscorable. To get an estimate of the demands posed on the operator of the Ariol system, the same slides were scanned and processed on an identical Ariol system by an operator with less than 1 week experience (KDM). The descriptors of the colour and shape of the positive and negative tumour cells were transferred from one system to another, therefore variations in the image analysis results depended only on the scanner settings, i.e. positioning and white balance, but not on the image analysis settings. The hematoxylin and eosin and IHC images of all cores used in this study are publicly available at the companion site http://www.gpecimage.ubc.ca/tma/web/viewer.php [2, 25]. The site was constructed using GPEC database and a Java applet provided by Bacus Laboratories, Inc. All the slides were scanned with a BLISS scanner (Bacus Laboratories, Inc., Lombard, IL), and posted on the site. WebSlide Browser for Windows (Bacus Laboratories, Inc., Lombard, IL) can be used for viewing preview images of the arrays and images of individual cores. We performed Kaplan–Meier analysis on the data provided by both visual and automated scoring systems. We used also dextran-coated charcoal biochemical assay (DCC) values from the original clinical database as an external standard measure for ER quantitation [26, 27]. The data were categorised as negative for 0–1 fmol/mg, weak for 2–9 fmol/mg, moderate for 10–159 fmol/mg, and strong for ‡160 fmol/mg. A cut-off point of 10 fmol/mg was selected for dichotomisation of the DCC data as ER negative or positive [27, 28]. Statistical analysis Statistical analysis was performed in SPSS 11.0 for Windows (SPSS Inc., Chicago, IL). Univariate analysis of survival was done by calculating Kaplan–Meier survival curves. Subsets of patients were compared using log-rank tests. All tests were two-sided and used a 5% alpha level to determine significance. We used the open-source R 2.3.1 package to calculate differences between kappa statistics from DCC to IHC comparisons; a permutation test with 10,000 permutations was implemented. The same test was performed to compare chi-squares obtained in Kaplan–Meier analyses, to determine whether different scoring systems (human versus machine) give significantly different results. R 2.3.1 was also used for plotting variations of the hazard ratio and log-rank P values depending on the cut-off point of percent of positive nuclei reported with the automated system. Continuous variable data obtained with the automated image analysis systems were categorised using the optimal survival cut-off points for our cohort, as determined with X-tile v3.4.3 software [4]. Prognostic value of the categorical data obtained with this method was compared to that of the data categorised according to the preselected 1% cut-off point system. Results Median follow-up time for the 3,484 cases analysed was 12.47 years; median age at diagnosis was 60 years. About 1,039 (29.8%) women were premenopausal, 2,360 (67.7%) postmenopausal, two were pregnant (0.1%), and menstrual status was unknown in 83 (2.4%) women. Lymph nodes were negative in 1,935 (55.5%) cases, positive in 1,540 (44.2%), and in 9 (0.3%) cases nodal status was unknown. Ductal carcinoma was seen in 3,169 (91.0%) cases, lobular carcinoma in 243 (7.0%), and other types in 72 (2.1%) cases. All patients underwent surgical treatment; in addition, 1,116 (32.0%) patients received tamoxifen only, 1,443 (41.4%) patients received no systemic therapy, 656 (18.8%) were treated with chemotherapy, and 259 (7.4%) received a combination of chemotherapy and tamoxifen. Other treatment methods (ovarian ablation or hormonal therapy other than tamoxifen, with or without chemotherapy) were applied in 10 (0.2%) cases. Kaplan–Meier survival analysis of cases stratified based on ER status, as determined by visual or machine scoring of the immunostained slides, or by DCC is shown in Fig. 1. Results of the log-rank tests for the individual training and validation sets are shown as a supplementary figure. Results of the log-rank tests with P values for the whole set of patients, stratified as negative, weak, moderate, or strongly ER positive, are as follows: visual scoring v2 = 50.95, P = 9.47 · 10–13; Ariol v2 = 37.71, –10 2 P = 8.22 · 10 , DCC v = 66.25, P = 3.97 · 10–16 (Fig. 1a, d, g). After dichotomisation of the scores as either ER positive or negative (Fig. 2), the results of log-rank test are: visual scoring v2 = 49.69, P = 1.81 · 10–12; Ariol v2 = 54.17, P = 1.84 · 10–13, DCC v2 = 53.76, –13 P = 2.27 · 10 . The differences in prognostic significance of these different analyses of ER status are not statistically significant (Table 1), i.e. visual and machine scoring showed similar results when comparing pathologist and automated ER scoring after dichotomisation of scores into ER positive or negative (Fig. 2a, d, g). In the group of the patients that received adjuvant treatment with only tamoxifen (Fig. 1b, e, h), results of the log-rank test with P values for cases stratified as negative, weak, moderate, or strongly ER positive are as follows: visual scoring v2 = 26.07, P = 3.29 · 10–7; Ariol v2 = 12.65, P = 0.0004, DCC v2 = 29.04, P = 7.08 · 10–8. After dichotomisation of the scores as either ER negative 123 Breast Cancer Res Treat b Visual, all patients 0.8 0.8 0.8 0.6 0.4 >75% positive nuclei p = 9.47E-013 25-75% positive nuclei <1% positive nuclei 5.00 10.00 15.00 0.4 >75% positive nuclei p = 3.29E-007 0.2 1-25% positive nuclei 0.00 0.6 <1% positive nuclei 0.0 20.00 0.00 5.00 10.00 15.00 0.6 0.4 20.00 e f Machine, tamoxifen only 1-25% positive nuclei 5.00 10.00 15.00 25-75% positive nuclei p = 0.0004 0.4 20.00 0.00 5.00 10.00 15.00 h 20.00 i DCC, tamoxifen only 0.8 p = 3.97E-016 0.2 2-9 fmol/mg <=1 fmol/mg 0.0 0.00 5.00 10.00 15.00 Follow-up (years) 20.00 0.6 0.4 >=160 fmol/mg p = 7.08E-008 0.2 10-159 fmol/mg 2-9 fmol/mg <=1 fmol/mg 0.0 0.00 5.00 10.00 15.00 Follow-up (years) 20.00 Cum Survival 0.8 Cum Survival 0.8 >=160 fmol/mg 10.00 15.00 20.00 DCC, no systemic therapy 1.0 10-159 fmol/mg 5.00 Follow-up (years) 1.0 0.4 <=1% positive nuclei 0.00 1.0 0.6 1-25% positive nuclei >75% positive nuclei 0.0 Follow-up (years) DCC, all patients 20.00 25-75% positive nuclei p = 9.83E-005 0.2 <=1% positive nuclei Follow-up (years) g 1-25% positive nuclei 0.0 15.00 0.6 >75% positive nuclei <=1% positive nuclei 0.00 0.4 0.2 >75% positive nuclei 0.0 0.6 Cum Survival 0.8 Cum Survival 0.8 25-75% positive nuclei 10.00 Machine, no systemic therapy 0.8 0.6 5.00 Follow-up (years) 1.0 p = 8.22E-010 <1% positive nuclei 0.00 1.0 0.2 1-25% positive nuclei Follow-up (years) Machine, all patients 25-75% positive nuclei 0.0 1.0 0.4 >75% positive nuclei p = 5.70E-006 0.2 1-25% positive nuclei Follow-up (years) d 25-75% positive nuclei Cum Survival 1.0 0.0 Cum Survival Visual, no systemic therapy 1.0 0.2 Cum Survival c Visual, tamoxifen only 1.0 Cum Survival Cum Survival a 0.6 0.4 >=160 fmol/mg p = 5.30E-010 0.2 10-159 fmol/mg 2-9 fmol/mg <=1 fmol/mg 0.0 0.00 5.00 10.00 15.00 20.00 Follow-up (years) Fig. 1 Kaplan–Meier analysis performed on the data categorised at 1, 25 and 75% cut-off points. (a–c) visual scoring #1, (d–f) automated system #1, (g–i) DCC. (a, d, g) entire cohort; (b, e, h) subset of the patients treated with tamoxifen only with no chemotherapy; (c, f, i) subset of the patients that received no systemic therapy or positive (Fig. 2b, e, h), the results of log-rank test are: visual scoring v2 = 20.90, P = 4.84 · 10–6; Ariol v2 = 30.69, P = 3.02 · 10–8, DCC v2 = 16.48, –5 P = 4.91 · 10 . The differences in prognostic significance of these different analyses of ER status (Fig. 2b, e, h) are not statistically significant (Table 2). In the group of the patients that received no systemic therapy (Fig. 1c, f, i), results of the log-rank test with P values for cases stratified as negative, weak, moderate, or strongly ER positive, are as follows: visual scoring v2 = 20.59, P = 5.70 · 10–6; Ariol v2 = 15.17, P = 9.83 · 10–5, DCC v2 = 38.56, P = 5.30 · 10–10. After dichotomisation of the scores (Fig. 2c, f, i), the results of log-rank test are: visual scoring v2 = 27.47, P = 1.60 · 10– 7 ; Ariol v2 = 31.09, P = 2.46 · 10–8, DCC v2 = 42.69, P = 6.43 · 10–11. The differences in prognostic significance of these different analyses of ER status (Fig. 2c, f, i) are not statistically significant (Table 3). The concordant and discrepant results for visual, machine, and DCC for the entire cohort, tamoxifen treated subset of patients, and patients receiving no systemic therapy are displayed as Kaplan-Meier survival curves (Fig. 3). 123 Breast Cancer Res Treat a b Visual, all patients 1.0 c Visual, tamoxifen only Visual, no systemic therapy 1.0 1.0 0.8 0.8 negative negative 0.4 p = 1.81E-012 0.2 Cum Survival Cum Survival 0.6 positive positive 0.6 negative negative 0.4 p = 4.84E-006 0.2 0.0 Cum Survival positive positive positive positive 0.8 d 5.00 10.00 15.00 Follow-up (years) 20.00 0.00 1.0 p = 1.60E-007 0.0 e Machine, all patients 0.4 0.2 0.0 0.00 negative negative 0.6 5.00 10.00 15.00 Follow-up (years) 20.00 0.00 f Machine, tamoxifen only 5.00 10.00 15.00 Follow-up (years) 20.00 Machine, no systemic therapy 1.0 1.0 0.8 0.8 0.6 negative negative 0.4 p = 1.84E-013 0.2 Cum Survival Cum Survival positive positive positive positive 0.6 negative negative 0.4 p = 3.02E-008 0.2 0.0 Cum Survival positive positive 0.8 g 5.00 10.00 15.00 Follow-up (years) 20.00 1.0 p = 2.46E-008 0.0 0.00 h DCC, all patients 0.4 0.2 0.0 0.00 negative negative 0.6 5.00 10.00 15.00 Follow-up (years) 20.00 0.00 i DCC, tamoxifen only 5.00 10.00 15.00 Follow-up (years) 20.00 DCC, no systemic therapy 1.0 1.0 0.8 0.8 negative negative 0.4 p = 2.27E-013 0.2 Cum Survival Cum Survival 0.6 negative negative 0.4 p = 4.91E-005 0.2 0.0 5.00 10.00 15.00 Follow-up (years) 20.00 Machine with X-tile-defined cut-off, all patients 1.0 negative negative 0.4 p = 6.43E-011 0.0 0.00 k 0.6 0.2 0.0 0.00 j positive positive 0.6 Cum Survival positive positive positive positive 0.8 5.00 10.00 15.00 Follow-up (years) 20.00 Machine with X-tile-defined cut-off, tamoxifen only 0.00 l 5.00 10.00 15.00 Follow-up (years) 20.00 Machine with X-tile-defined cut-off, no systemic therapy 1.0 1.0 0.8 0.8 0.6 negative negative 0.4 p = 8.55E-015 0.2 Cum Survival Cum Survival positive positive positive positive 0.6 0.4 negative negative p = 1.76E-008 0.2 0.0 5.00 10.00 15.00 Follow-up (years) 20.00 0.6 negative negative 0.4 p = 1.91E-008 0.2 0.0 0.00 Cum Survival positive positive 0.8 0.0 0.00 5.00 10.00 15.00 Follow-up (years) Fig. 2 Kaplan–Meier analysis performed on the data dichotomised at 1% cut-off point as negative versus positive. (a–c) visual scoring #1, (d–f) automated system #1, (g–i) DCC, (j–l) automated system #1 with the cut-off point defined with the X-tile program. (a, d, g) entire 20.00 0.00 5.00 10.00 15.00 Follow-up (years) 20.00 cohort; (b, e, h) subset of the patients treated with tamoxifen only with no chemotherapy; (c, f, i) subset of the patients that received no systemic therapy 123 Breast Cancer Res Treat Table 1 Permutation test on the whole cohort to compare the differences between chi-squares obtained in Kaplan–Meier analysis, between DCC and each of the IHC scoring systems DCC Visual #1 Visual #2 Ariol #1 DCC 1 0.8433 0.6929 0.6346 0.7010 Visual #1 0.8433 1 0.2311 0.6139 0.3151 Visual #2 Ariol #1 0.6929 0.6346 0.2311 0.6139 1 0.1458 0.1458 1 Ariol #2 0.7010 0.3151 0.9783 0.1327 Table 5 Permutation test on the whole cohort to compare the differences between the kappa statistics obtained from the crosstabulations between DCC and each of the IHC scoring systems Ariol #2 Visual #1 Visual #2 Ariol #1 Ariol #2 Visual #1 1 0.0005 0.2547 0.2982 Visual #2 0.0005 1 0.0327 0.0005 0.9783 0.1327 Ariol #1 Ariol #2 0.2547 0.2982 0.0327 0.0005 1 0.0188 0.0188 1 1 P values (with Bonferroni–Holmes corrections) are shown P values for the pairwise comparisons are shown Table 2 Permutation test on the ‘‘tamoxifen only’’ group to compare the differences between chi-squares obtained in Kaplan–Meier analysis, between DCC and each of the IHC scoring systems DCC Visual #1 Visual #2 Ariol #1 Ariol #2 DCC 1 0.3673 0.4966 0.1831 0.3878 Visual #1 0.3673 1 0.5922 0.3377 0.9296 Visual #2 Ariol #1 0.4966 0.1831 0.5922 0.3377 1 0.1766 0.1766 1 0.7314 0.3071 Ariol #2 0.3878 0.9296 0.7314 0.3071 1 P values are shown Table 3 Permutation test on the ‘‘no systemic therapy’’ group to compare the differences between chi-squares obtained in Kaplan– Meier analysis, between DCC and each of the IHC scoring systems DCC Visual #1 Visual #2 Ariol #1 Ariol #2 DCC 1 0.5731 0.3303 0.7082 0.4689 Visual #1 0.5731 1 0.3854 0.7725 0.8067 Visual #2 Ariol #1 0.3303 0.7082 0.3854 0.7725 1 0.3207 0.3207 1 0.6594 0.5724 Ariol #2 0.4689 0.8067 0.6594 0.5724 1 P values are shown Interobserver variability was estimated comparing visual scores of two pathologists, and the automated scores generated by different Ariol hardware systems using the same settings for colour, nuclear size and shape. When comparing dichotomised scores on the whole series, the interobserver agreement was excellent: for pathologist versus pathologist scores, kappa = 0.918 (95%CI: 0.903– 0.932), and for two machine scores, kappa = 0.913 (95%CI: 0.897–0.928) (Table 4). There was also good correlation between visual and automated scores, and for both methods when compared to DCC (Table 5), with neither visual nor machine scoring showing superiority with respect to reproducibility or correlation with DCC. The cut-off point for positive and negative cases as determined with the X-tile program [4] was 0.4% of positive nuclei on both sets of automated scores, using breast cancer specific survival as a discriminator. The entire cohort was separated into training and validation set by the program, and an optimal cut-off point was defined as 0.4% of positive nuclei on the training set. This cut-off point was confirmed on the validation set, thus we concluded it was applicable for the entire cohort. On the entire cohort of the patients, results of the log-rank test for cases dichotomised as negative for 0–0.4% of positive nuclei and positive for [0.4% of positive nuclei were v2 = 60.20, P = 8.55 · 10– 15 (Fig. 2j). The results of the log-rank test for the patients treated with tamoxifen only were as follows: v2 = 31.74, P = 1.76 · 10–8 (Fig. 2k), and for the subset of patients who received no treatment v2 = 31.59, P = 1.91 · 10–8 (Fig. 2l). On the validation set, results of the log-rank test were v2 = 27.21, P = 1.83 · 10–7; the results of the log-rank test for patients from the validation set treated with tamoxifen only were v2 = 15.09, P = 1.03 · 10–4, and for patients who received no treatment v2 = 18.20, P = 1.99 · 10–5 (supplementary Fig. d–f). On the training set, results of the log-rank test were v2 = 25.97, P = 3.47 · 10–7; the results of the log-rank test for patients from the training Table 4 Kappa statistics on the whole cohort for comparison of reproducibility of different methods of ER assessment DCC (CI) Visual #1 (CI) DCC 1 0.7286 (0.7019–0.7539) 0.6965 (0.6697–0.7237) 0.7188 (0.6918–0.7449) 0.7406 (0.7139–0.7665) Visual #1 0.7286 (0.7019–0.7539) 1 0.9184 (0.9035–0.9326) 0.8955 (0.8785–0.9116) 0.9021 (0.8854–0.9180) Visual #2 0.6965 (0.6697–0.7237) 0.9184 (0.9035–0.9326) 1 0.8852 (0.8680–0.9018) 0.8789 (0.8609–0.8955) Ariol #1 0.7188 (0.6918–0.7449) 0.8955 (0.8785–0.9116) 0.8852 (0.8680–0.9018) 1 0.9128 (0.8974–0.9277) Ariol #2 0.7406 (0.7139–0.7665) 0.9021 (0.8854–0.9180) 0.8789 (0.8609–0.8955) 0.9128 (0.8974–0.9277) 1 123 Visual #2 (CI) Ariol #1 (CI) Ariol #2 (CI) Breast Cancer Res Treat set treated with tamoxifen only were v2 = 14.10, P = 1.74 · 10–4, and for patients who received no treatment v2 = 14.76, P = 1.22 · 10–4 (supplementary Fig. a–c). The hazard ratios and log P values were plotted against the percent of positive nuclei as identified by machine (Fig. 4), using R 2.3.1 package. Hazard ratios and significance of survival difference (log P values) increase dramatically when cut-off point is set to less than 1% of positive nuclei, which confirms the results obtained with the X-tile program. DCC vs. Visual 0.8 0.8 0.8 0.6 0.4 Both positive DCC positive, Visual negative Cum Survival 1.0 Cum Survival 1.0 0.6 0.4 Both positive DCC positive, Visual negative 0.2 0.0 0.00 d Visual positive, DCC negative Both negative Both negative 5.00 10.00 15.00 Follow-up (years) 0.0 20.00 0.00 e DCC vs. Machine 0.6 0.4 5.00 10.00 15.00 Follow-up (years) Visual positive, DCC negative 20.00 0.00 f DCC vs. Machine 0.8 0.8 0.8 Both positive DCC positive, Visual negative 0.2 Both negative 0.0 0.00 g 5.00 10.00 15.00 Follow-up (years) 0.4 Both positive DCC positive, Visual negative 0.2 Visual positive, DCC negative Both negative 0.00 20.00 h 5.00 10.00 15.00 Follow-up (years) 0.4 20.00 0.8 0.8 Both positive 0.2 5.00 10.00 15.00 Follow-up (years) Both positive DCC positive, Visual negative 0.4 Both positive DCC positive, Visual negative 0.2 Visual positive, DCC negative Both negative 0.0 20.00 20.00 0.6 Visual positive, DCC negative Both negative 0.00 0.4 0.2 Visual positive, DCC negative 0.0 Cum Survival 0.8 0.6 5.00 10.00 15.00 Follow-up (years) Visual vs. Machine 1.0 DCC positive, Visual negative Both negative i 1.0 0.4 Visual positive, DCC negative 0.00 1.0 0.6 Both positive DCC positive, Visual negative 0.0 Visual vs. Machine 20.00 0.6 0.2 Visual positive, DCC negative 0.0 Visual vs. Machine Cum Survival 1.0 0.6 5.00 10.00 15.00 Follow-up (years) DCC vs. Machine 1.0 0.4 Both negative 0.0 1.0 0.6 Both positive DCC positive, Visual negative 0.2 Visual positive, DCC negative Cum Survival Cum Survival c DCC vs. Visual 1.0 0.2 Cum Survival b DCC vs. Visual Cum Survival Cum Survival a Intensity of staining was analysed also by the machine, and H-score was calculated as (% of positive nuclei) · (staining intensity). H-score ranged from 0 in completely negative cases to 7267.74. As found using the X-tile program, the best cut-off point for H-score was 21.1. When the H-score was dichotomised as negative for 0–21.1 and positive for [21.1, the results of the log-rank test in the entire cohort were as follows: v2 = 60.81 P = 6.27 · 10–15. In tamoxifen only treated subset of patients, the results were v2 = 28.09, P = 1.16 · 10–7, and the subset with no systemic therapy v2 = 10.56, P = 0.001. The log-rank test 0.00 5.00 10.00 15.00 Follow-up (years) Fig. 3 Kaplan–Meier analysis performed on the cases grouped based on agreement/disagreement between different scoring systems. (a–c) comparison between DCC and visual scoring #1, (d–f) comparison between DCC and automated system #1, (g–i) comparison between Both negative 0.0 20.00 0.00 5.00 10.00 15.00 Follow-up (years) 20.00 visual scoring #1 and automated scoring #1. (a, d, g) entire cohort; (b, e, h) subset of the patients treated with tamoxifen only with no chemotherapy; (c, f, i) subset of the patients that received no systemic therapy 123 Breast Cancer Res Treat b Machine with X-tile defined cut-off, tamoxifen only c Machine with X-tile defined cut-off, no systemic therapy Hazard ratio 1.0 0.5 0.0 Hazard ratio 0.0 0 d 0.5 1.0 0.0 0.5 Hazard ratio 1.5 1.5 1.5 Machine with X-tile defined cut-off, all patients 20 40 60 80 % positive nuclei 100 Machine with X-tile defined cut-off, all patients 0 e 0 1.0 a 20 40 60 80 % positive nuclei 100 Machine with X-tile defined cut-off, tamoxifen only 0 f 20 40 60 80 % positive nuclei 100 Machine with X-tile defined cut-off, no systemic therapy 0 0 -5 -5 -15 -20 Log p value Log p value Log p value -5 -10 -10 -15 -25 -30 -10 -15 -20 -20 0 20 40 60 80 % positive nuclei 100 0 20 40 60 80 % positive nuclei 100 0 20 40 60 80 % positive nuclei 100 Fig. 4 Variation of hazard ratio and log-rank P value among different cut-off points of percent of positive nuclei reported by the automated system results are shown separately for the training and validation sets at http://www.gpec.ubc.ca/index.php?content=papers/ AUTO_ER.php. Discussion The first attempts to use automated image analysis for quantitation of IHC stained histological sections were undertaken approximately 20 years ago, before IHC replaced DCC as the most convenient means for ER assessment [5, 6, 29]. However, until recently image quality obtained with the video and digital cameras and the image analysis software did not permit reliable fully automated assessment of histological sections without continuous supervision by a pathologist [29]. Recent advances in image analysis techniques have allowed automatic selection of the objects within the histological images based on their morphometric parameters, e.g. size, shape, range of staining intensities, and fully automated quantitation of histological sections is now technically feasible [5, 30, 31]. In theory, computer-assisted image analysis should provide more accurate results for IHC quantitation than semiquantitative scoring by a pathologist. A specially 123 calibrated device can measure the intensity of staining much more precisely than a human eye [7, 13]. Using optical density corrected by the intensity of the background instead of direct measurement of the intensity of the IHC staining gives even more accurate assessment of the staining and allows comparisons between slides stained in different batches or in different laboratories. Counting of hundreds and thousands of positive and negative cells within a single image is a tedious procedure, and the degree of human mistake in counting is a function of the number of the objects [8], i.e. the more cells being analysed, the less precise the results. The use of immunofluorescent staining with its capabilities for subcellular localisation of proteins and co-localisation of specific protein expression with tissue-specific markers has potential for significant improvement in assessment of biomarker expression [32–34]. In practice, however, the results of automated image analysis are influenced by a number of factors other than technical issues. Image analysis systems cannot differentiate between benign and malignant lesions with a precision comparable to the expertise level of an experienced pathologist [5, 6] and most image analysis applications require pathologist input to identify the area to be analyzed. Breast Cancer Res Treat TMAs provide almost ideal material for automated unsupervised image analysis because of the careful selection of the areas containing lesions that a researcher is specifically interested in, the concurrent application of identical staining conditions to all cores on a single slide, and the small size of the tissue cores, each of which can be represented by a single image. Inaccuracies encountered during the TMA building, i.e. taking cores from the areas that do not contain tumour tissue, constitute a relatively small percent of all TMA cores and can be easily excluded from the analysis. We performed our study on a TMA based cohort of 3,484 cases of invasive breast carcinoma. We studied ER expression in tumours, with visual semiquantitative assessment of the staining, and quantitation with a commercial fully automated image analysis system. The results of the IHC analysis by either pathologist or machine demonstrated similar accuracy in assessment of prognostic significance of ER expression in Kaplan–Meier analysis. Comparison of log-rank tests with 10,000 permutations detected no significant differences among any of these methods (P [ 0.05) when the scores were dichotomised as negative and positive. The concordance between scores was excellent. The best agreement was achieved between two visual scorers (kappa = 0.918 (95%CI: 0.903–0.932)) and between two machines (kappa = 0.913 (95%CI: 0.897–0.928)). The worst concordance in this study was observed between the DCC method and one of the human scorers with kappa = 0.697 (95%CI: 0.669–0.722), which is still considered to be good agreement. The optimal cut-off point found in our study for machine scoring (0.4% of positive tumour nuclei) is highly consistent with the findings of Harvey et al. [35], who suggested an H-score of 2 as an optimal cut-off point (weak positivity in less than 1% of positive nuclei). These two studies support the idea that virtually any ER positivity is associated with good prognosis in tamoxifen-treated breast cancer. Utilisation of an H-score did not provide any additional information in our study and in fact was inferior to simple percent of positive nuclei, according to the results of the log-rank test in Kaplan–Meier analysis of the treatment subsets. There are a number of limitations in our study. First, the machine scoring was performed in fully automated mode on TMAs, with no specific selection of tumour tissue within the tissue cores, and therefore may not translate directly to conventional sections. For example, some benign ducts and lobules were likely scored along with the tumours, which could have artificially decreased prognostic significance of automated scoring. If image analysis is applied to the whole sections instead of the TMAs, an operator can specify an area of tumour tissue that should be scanned and analysed; this would increase specificity of the analysis (although in TMAs the initial core is derived from a pathologist-selected area). Secondly, two expert pathologists scored all cases within a week and although the scoring was done independently, they discussed the scoring system prior to scoring, which may explain the relatively high kappa statistics between two visual scores (0.918 (95%CI: 0.903–0.932)) as compared to the usual agreement reported for the scoring of nuclear staining (0.6–0.8) [36– 39]. In conclusion, the present study shows that a fully automated image analysis with a system trained by a human, but without continuous human supervision, can provide results that do not differ from the scoring of estrogen receptor immunohistochemistry by an expert pathologist or to dextran–charcoal radioimmunoassay. Further development of image analysis techniques will inevitably improve the accuracy of object detection and classification in histological sections, making such automated image analysis systems suitable for use in clinical histopathology laboratories. Acknowledgements The study was supported in part by an unrestricted educational grant from Sanofi-Aventis, Canada and a Translational Acceleration grant from the Canadian Breast Cancer Research Alliance. TON and DGH are scholars of Michael Smith Foundation for Health Research. References 1. Duffy MJ (2006) Estrogen receptors: role in breast cancer. Crit Rev Clin Lab Sci 43:325–347 2. Cheang MC, Treaba DO, Speers CH et al (2006) Immunohistochemical detection using the new rabbit monoclonal antibody SP1 of estrogen receptor in breast cancer is superior to mouse monoclonal antibody 1D5 in predicting survival. J Clin Oncol 24:5637–5644 3. Bolger N, Heffron C, Regan I et al (2006) Implementation and evaluation of a new automated interactive image analysis system. Acta Cytol 50:483–491 4. Camp RL, Dolled-Filhart M, Rimm DL (2004) X-tile: a new bioinformatics tool for biomarker assessment and outcome-based cut-point optimization. Clin Cancer Res 10:7252–7259 5. Taylor CR, Levenson RM (2006) Quantification of immunohistochemistry-issues concerning methods, utility and semiquantitative assessment II. Histopathology 49:411–424 6. Walker RA (2006) Quantification of immunohistochemistryissues concerning methods, utility and semiquantitative assessment I. Histopathology 49:406–410 7. BVD/FOGRA (1992) Manual for standardization of the offset printing process. Wiesbaden 8. Insight into images: principles and practices for segmentation, registration, and image analysis (2004). A.K. Peters Ltd., Wellesey, MA 9. Byrne A, Hilbert DR (2003) Color realism and color science. Behav Brain Sci 26:3–21; discussion 22–63 10. Gegenfurtner KR (2003) Cortical mechanisms of colour vision. Nat Rev Neurosci 4:563–572 11. McLelland D, Fuller LU (2005) Photoshop CS2 bible. Wiley Publishing Inc., Hoboken, NJ 123 Breast Cancer Res Treat 12. Rinner O, Gegenfurtner KR (2000) Time course of chromatic adaptation for color appearance and discrimination. Vision Res 40:1813–1826 13. Wen C-H, Lee J-J (2000) Design and production of color calibration targets for digital input devices. In: Input/output and imaging technologies II:4080. Taipei, Taiwan, pp 148–158 14. Greene GL, Nolan C, Engler JP et al (1980) Monoclonal antibodies to human estrogen receptor. Proc Natl Acad Sci USA 77:5115–5119 15. King WJ, Greene GL (1984) Monoclonal antibodies localize oestrogen receptor in the nuclei of target cells. Nature 307:745–747 16. Underwood JC (1983) Oestrogen receptors in human breast cancer: review of histopathological correlations and critique of histochemical methods. Diagn Histopathol 6:1–22 17. Aziz DC, Barathur RB (1994) Quantitation and morphometric analysis of tumors by image analysis. J Cell Biochem Suppl 19:120–125 18. Esteban JM, Ahn C, Battifora H et al (1994) Quantitative immunohistochemical assay for hormonal receptors: technical aspects and biological significance. J Cell Biochem Suppl 19:138–145 19. Schultz DS, Katz RL, Patel S et al (1992) Comparison of visual and CAS-200 quantitation of immunocytochemical staining in breast carcinoma samples. Anal Quant Cytol Histol 14:35–40 20. Kononen J, Bubendorf L, Kallioniemi A et al (1998) Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med 4:844–847 21. Makretsov N, Gilks CB, Coldman AJ et al (2003) Tissue microarray analysis of neuroendocrine differentiation and its prognostic significance in breast cancer. Hum Pathol 34:1001–1008 22. Turbin DA, Cheang MC, Bajdik CD et al (2006) MDM2 protein expression is a negative prognostic marker in breast carcinoma. Mod Pathol 19:69–74 23. Liu CL, Prapong W, Natkunam Y et al (2002) Software tools for high-throughput analysis and archiving of immunohistochemistry staining data obtained with tissue microarrays. Am J Pathol 161:1557–1565 24. Liu CL, Montgomery KD, Natkunam Y et al (2005) TMACombiner, a simple software tool to permit analysis of replicate cores on tissue microarrays. Mod Pathol 18:1641–1648 25. Ng TL, Gown AM, Barry TS et al (2005) Nuclear beta-catenin in mesenchymal tumors. Mod Pathol 18:68–74 26. de las Mulas JM, van Niel M, Millan Y et al (2000) Immunohistochemical analysis of estrogen receptors in feline mammary gland benign and malignant lesions: comparison with biochemical assay. Domest Anim Endocrinol 18:111–125 123 27. Magne N, Toillon RA, Castadot P et al (2006) Different clinical impact of estradiol receptor determination according to the analytical method: A study on 1940 breast cancer patients over a period of 16 consecutive years. Breast Cancer Res Treat 95:179– 184 28. Costa SD, Lange S, Klinga K et al (2002) Factors influencing the prognostic role of oestrogen and progesterone receptor levels in breast cancer – results of the analysis of 670 patients with 11 years of follow-up. Eur J Cancer 38:1329–1334 29. Franklin WA, Bibbo M, Doria MI et al (1987) Quantitation of estrogen receptor content and Ki-67 staining in breast carcinoma by the microTICAS image analysis system. Anal Quant Cytol Histol 9:279–286 30. Gil J, Wu HS (2003) Applications of image analysis to anatomic pathology: realities and promises. Cancer Invest 21:950–959 31. Rojo MG, Garcia GB, Mateos CP et al (2006) Critical comparison of 31 commercially available digital slide systems in pathology. Int J Surg Pathol 14:285–305 32. Cregger M, Berger AJ, Rimm DL (2006) Immunohistochemistry and quantitative analysis of protein expression. Arch Pathol Lab Med 130:1026–1030 33. McCabe A, Dolled-Filhart M, Camp RL et al (2005) Automated quantitative analysis (AQUA) of in situ protein expression, antibody concentration, and prognosis. J Natl Cancer Inst 97:1808–1815 34. Rimm DL (2006) What brown cannot do for you. Nat Biotechnol 24:914–916 35. Harvey JM, Clark GM, Osborne CK et al (1999) Estrogen receptor status by immunohistochemistry is superior to the ligand-binding assay for predicting response to adjuvant endocrine therapy in breast cancer. J Clin Oncol 17:1474–1481 36. Diaz LK, Sahin A, Sneige N (2004) Interobserver agreement for estrogen receptor immunohistochemical analysis in breast cancer: a comparison of manual and computer-assisted scoring methods. Ann Diagn Pathol 8:23–27 37. Hasegawa T, Yamamoto S, Matsuno Y (2002) Quantitative immunohistochemical evaluation of MIB-1 labeling index in adult soft-tissue sarcomas by computer-assisted image analysis. Pathol Int 52:433–437 38. Kirkegaard T, Edwards J, Tovey S et al (2006) Observer variation in immunohistochemical analysis of protein expression, time for a change? Histopathology 48:787–794 39. Lorinc E, Jakobsson B, Landberg G et al (2005) Ki67 and p53 immunohistochemistry reduces interobserver variation in assessment of Barrett’s oesophagus. Histopathology 46:642–648