PATHOLOGICA 2005;97:1-6 ARTICOLO ORIGINALE Quality control for histological grading in breast cancer: an italian experience Controllo di qualità sulla valutazione del grading istologico nel carcinoma mammario: un’esperienza italiana ITALIAN NETWORK FOR QUALITY ASSURANCE OF TUMOUR BIOMARKERS (INQAT) GROUP* Key words Quality control • Histological grading • Breast cancer • Intraobserver reproducibility • Interobserver reproducibility Summary Parole chiave Controllo di qualità • Grading istologico • Carcinoma mammario • Riproducibilità entro-osservatore • Riproducibilità tra osservatori Riassunto Although the diagnostic criteria for histological grading in breast cancer have been improved by the introduction of the Nottingham/Tenovus classification, good results in terms of reproducibility are hard to obtain. The objective of this study was to validate a methodological protocol for external quality control. We assessed the intra- and interlaboratory reproducibility for grading score and its components. Our findings revealed a less than optimal level of agreement for both intra-observer and interobserver reproducibility. Each grading class contributed in a different way to the reproducibility and class G2 provided the poorest contribution to the observed agreement. Nuclear pleomorphism appeared to be the least reproducible, followed by mitotic count and tubule formation. In conclusion, our results suggest the need for quality control programs that should be conceived as a dynamic feedback process to better understand the reasons for some unsatisfactory performances and to implement the necessary corrective actions. Nonostante i criteri diagnostici per la valutazione del grading istologico nel carcinoma mammario siano stati meglio definiti dal metodo Nottingham/Tenovus, risultati soddisfacenti in termini di riproducibilità sono ancora difficili da ottenere. Scopo di questo studio è stato la validazione di un protocollo metodologico da utilizzarsi in programmi di valutazione esterna di qualità. Abbiamo quantificato la riproducibilità nella lettura del grading e delle sue componenti evidenziando un livello di accordo entro e tra laboratori non del tutto soddisfacente. Per quanto riguarda le categorie di grading, la classe G2 ha fornito il contributo minore all’accordo osservato. Fra le variabili componenti, la meno riproducibile è risultata il polimorfismo nucleare seguita da conta mitotica e formazione di tubuli. In conclusione, i nostri risultati suggeriscono la necessità di sviluppare programmi di controllo qualità come processi dinamici che possano aiutare a comprendere le ragioni di alcune prestazioni insoddisfacenti ed implementare le necessarie azioni correttive. Introduction clinicians as prognostic information on which to base therapeutic decisions 11-15. It is therefore important that a good level of intra- and interobserver reproducibility be sought in the determination of histological grade. Quality control programs can be considered a valid tool to reach this goal 16. We initiated an external quality control program in Italy, focused on the performance of a group of pathologists, to validate the development of an operative and methodological procedure to be possibly adopted by institutions and organizations engaged in external quality control programs. A quality control program requires a good level of organization that can be achieved by the assignment of responsibilities for the planning and management of ea- One of the fundamental concepts of histopathology is that the morphological appearance of tumors may be correlated with the degree of malignancy. In particular for breast cancer, morphological assessment of the degree of differentiation, evaluated with the histological grading method, has been shown in several studies to provide useful prognostic information 1-8. Even though the introduction of the Nottingham/Tenovus classification has improved and better defined the criteria to assess the histological grading of breast carcinoma [9], good results in terms of reproducibility and reliability are hard to obtain 10. Nevertheless, histological grading of breast carcinoma is commonly used by Corrispondenza Supported by PS “Oncologia” #CU03.00386 CNR-MIUR-2003 and PF Ministry of Health, Italian Government. * See Appendix for INQAT Group coworkers and affiliations Dott. Paolo Verderio, Unità Operativa di Statistica Medica e Biometria, Istituto Nazionale per lo Studio e la Cura dei Tumori, via Venezian 1, 20133 Milano, Italy - Tel. +39 02 23903201 - Fax +39 02 700503314 - E-mail: paolo.verderio@istitutotumori.mi.it INQAT GROUP 2 ch of its steps. To accomplish this, our program was conceived as an interactive relationship between two main types of expertise, the first represented by an experienced pathologist responsible for all the bioclinical aspects related to histological grading in breast cancer and the second by a group of biostatisticians responsible for study design and statistical analysis. Once the general purposes of the project had been established, we decided to evaluate the reproducibility of the analytical determination of histological grade. In addition, we investigated the influence of the three component variables (nuclear pleomorphism, tubule formation and mitotic count) on the assessment of histological grade. The present paper refers to the Quality Control Program performed in Italy in 2003. Study features We documented and formalized the purposes and methods of the quality control program in a manual that was approved by all participants before the start of the study. This document included the rationale of the study, a standard procedure for evaluation of the slides, and detailed instructions for each step of the study. Careful and functional management of the practical aspects is as important as appropriate study planning. For this reason we designed the quality control program as a dynamic feedback process built on the continuous interaction between participants and investigators. The data produced by each participant were returned to the Biostatistics Unit where they were entered into the database and periodically checked for completeness and conformity to the measurement scales. After the database had been closed, the results of the statistical analyses were reported to the participants. Such pure reporting of the data, together with a lab-specific form including the results provided by each pathologist, allowed the participants to compare their own activity with that of the others and possibly improve their performance. Thirteen pathologists, listed in the appendix, participated in the project; they pertained to the three major types of institutions where histological grading is assessed (research institutions, university departments and hospitals). Each participant provided 20 slides from their own practice, which were pooled; from this pool we selected 20 master slides and 10 reserve slides to be used in case of damage. Selection of the slides was based on stratified randomization so that the number of slides of each grading class, G1, G2 and G3, reflected the proportion of each class found in routine practice. Since, as known, the diagnostic reproducibility is heavily affected by the quality standards of slides, the selected specimens were subjected to the attentive approval of three “assessors” among the participating patho- logists, who were unanimously put forward for this task by the participants. Only those slides judged as well manufactured were added to the pool to be rounded among the participating pathologists. The slides were scored according to the Nottingham/Tenovus procedure, which is based on the morphological features of the tumor and is therefore not free from a certain degree of subjectivity. We bypassed this problem by defining a reference value for each slide based on the consensus of the three “assessors”. As a consequence, the performance assessment in this study refers to the remaining 10 pathologists. We organized two rounds of distribution of the 20 selected slides so that participants could examine the same slides on two independent occasions with an average interval of five months between the two. A new code was attributed to the slide at the second round to make the determination blind. Statistical methods The following aspects of reproducibility were investigated: (i) intra-observer reproducibility; (ii) reproducibility between each observer and the reference standard; (iii) interobserver reproducibility; (iv) contribution of each category to the observed reproducibility. As the classification criteria adopted in slide assessment for grading and its component variables involve a categorical-ordinal scale, the reproducibility related to (i), (ii) and (iii) was evaluated by computation of the weighted kappa statistic (Kw) 17. This makes it possible to adjust the observed agreement for chance by making allowance for the relative seriousness of disagreement (i.e. the distance between the categories). Kw values lie between zero (absence of agreement) and 1 (absolute agreement). As previously 18, the observed values of Kw were considered satisfactory if equal to or greater than 0.80. As regards (iv), kappa category-specific statistics (Kcs) and their weighted averages (Cohen’s kappa statistic, Kc) were estimated by jointly considering all participaTab. I. Classification criteria of kappa category-specific statistics (Kcs). Kappa value (range) Judgment < 0.00 0.00-0.20 0.21-0.40 0.41-0.60 0.61-0.80 0.81-1.00 Disagreement Slight agreement Fair agreement Moderate agreement Substantial agreement Almost perfect agreement QUALITY CONTROL FOR HISTOLOGICAL GRADING IN BREAST CANCER 3 Fig. 1. Intra-observer reproducibility for Grading. Fig. 2. Reproducibility between each observer and the reference standard for Grading. ting observers for each determination 17 19. Each Kcs value was interpreted in a qualitative manner on the basis of the Landis and Koch classification criteria 20 (Tab. I). Results GRADING (i) Intra-observer reproducibility A median Kw value of 0.60 (range: 0.38-0.96) was found for the intra-observer reproducibility. Figure 1 illustrates the concordance pattern observed. In this graph all the Kw values are plotted as blank diamonds; of the two horizontal lines the upper corresponds to the satisfactory Kw threshold (Kw ≥ 0.80) and the lower to the median Kw value. Only three observers reached a satisfactory Kw and only one was near satisfactory (Kw = 0.73). Concordance analysis showed an unsatisfactory level of reproducibility with a median Kw of 0.43 (range: 0.22-0.64) for the first determination and a median Kw of 0.46 (range: 0.24-0.72) for the second determination. Between the first and second determinations increased Kw values were observed for seven laboratories whereas for two laboratories the values decreased. (ii) Comparison with the reference values The results of the comparison with the reference values are represented in Figure 2 for the first determination (panel A) and the second determination (panel B). In this figure all the Kw values are plotted as black diamonds; also here the two horizontal lines refer to the satisfactory threshold and the median value. (iii) Interobserver reproducibility Table II reports the Kw values related to all possible pairwise comparisons (first determination upper half, second determination lower half). A rather unsatisfactory level of interobserver reproducibility was found for most observers with a median Kw of 0.44 (range: Tab. II. Kw values of all pairwise comparisons: first determination (upper part) and second determination (lower part) – GRADING. 2nd determination Observer 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 1.00 0.44 0.41 0.11 0.50 0.48 0.32 0.31 0.32 0.70 0.21 1.00 0.63 0.54 0.55 0.41 0.70 0.57 0.53 0.34 0.38 0.22 1.00 0.39 0.75 0.48 0.77 0.75 0.48 0.34 0.38 0.35 0.40 1.00 0.24 0.34 0.63 0.46 0.60 0.44 0.26 0.38 0.59 0.36 1.00 0.38 0.36 0.41 0.27 0.42 0.26 0.77 0.58 0.67 0.30 1.00 0.48 0.48 0.34 0.48 0.80 0.60 0.62 0.79 0.50 0.53 1.00 0.73 0.64 0.33 8 0.45 0.25 0.79 0.70 0.54 0.41 0.22 1.00 0.42 0.32 9 0.36 0.42 0.43 0.46 0.39 0.44 0.44 0.37 1.00 0.46 10 0.11 0.67 0.52 0.59 0.33 0.55 0.63 0.69 0.44 1.00 INQAT GROUP 4 Tab. III. Contribution of each class to the unweighted agreement – GRADING. Class st nd 1 determination 2 determination 0.49 0.19 0.18 0.29 0.43 0.14 0.26 0.27 G1 G2 G3 All 0.11-0.80) for the first determination and a median Kw of 0.46 (range: 0.11-0.77) for the second. (iv) Contribution of each category to the overall unweighted agreement As can be seen in Table III, for both determinations the most important contribution to the overall unweighted agreement (corresponding to a Kc of 0.29 and 0.27, respectively) is due to class G1, which shows moderate agreement with a Kcs of 0.49 (first determination) and 0.43 (second determination). By contrast, the G2 class appears to provide the poorest contribution to the observed overall agreement. COMPONENT VARIABLES As reported in Table IV, when the role played by the three components of grading (tubule formation, nuclear pleomorphism and mitotic count) was investigated, a kind of hierarchy in reproducibility emerged with tubule formation in top position, mitotic count in the middle and nuclear pleomorphism at the bottom. This hierarchy was found for all three investigated aspects: intra-observer agreement, comparison with the reference values (Tab. IV) and interobserver agreement (data not shown). Finally, as reported in Table V, for each component the Score 1 class showed the most important contribution to the overall reproducibility, followed by Score 3. As expected, the most problematic class was Score 2. Discussion Morphological assessment of the degree of differentiation evaluated with the histological grading method has been shown to provide useful prognostic information in breast cancer 1-8, even when the diagnosis was made by different pathologists 11 21 22 and based on different criteria without well-defined guidelines 12. Following the introduction of standardized grading criteria, partially based on semiquantitative evaluation, there has been a substantial improvement in the interobserver reproducibility related to the determination of this histological parameter 9. Histological grading is used in association with the conventional clinicopathological parameters (TNM classification) to define prognostic indexes that are widely accepted and utilized to identify different prognostic groups of breast cancer patients 11-15. However, the lack of good reproducibility among pathologists in the grading of breast cancer excludes any interchangeability of determinations made by different laboratories, consequently limiting the reliability and clinical usefulness of a parameter thought to be important for the management of breast cancer. Quality control programs can be a valid tool to monitor the performance of pathologists and improve the reproducibility between laboratories. Our program was developed as a simple and useful proposal for institutions engaged in quality assessment of tumor biomarkers. A good quality control program for any biological marker should cover two main aspects: 1) it should monitor the actual practice of determinations throughout the country; 2) it should provide suggestions to help understand the reasons for the possible unsatisfactory performance of some laboratories and assistance in planning corrections. Whereas it is easy to find papers dealing with the first aspect, the second has been scarcely addressed. Our program was designed as an attempt to fill this gap. Tab. IV. Kw distribution for intra-observer reproducibility and comparison of each observer versus reference values – GRADING COMPONENTS. Intra-observer vs reference values 2nd det. 1st det. Tubule formation 0.64 0.78 1.00 0.47 0.74 0.90 0.48 0.73 0.90 Minimum Median Maximum Nuclear pleomorphism 0.38 0.58 0.96 0.27 0.45 0.69 0.19 0.49 0.63 Minimum Median Maximum 0.31 0.72 0.97 0.11 0.56 0.83 0.42 0.57 0.86 Minimum Median Maximum Mitotic count det. = determination QUALITY CONTROL FOR HISTOLOGICAL GRADING IN BREAST CANCER 5 Tab. V. Contribution of each class to the unweighted agreement – GRADING COMPONENTS. Tubule formation Score 1 Score 2 Score 3 All Nuclear pleomorphism Mitotic count 1st det. 2nd det. 1st det. 2nd det. 1st det. 0.69 0.42 0.52 0.51 0.65 0.42 0.55 0.51 0.28 0.24 0.35 0.29 0.36 0.17 0.25 0.23 0.37 0.03 0.27 0.24 2nd det. 0.52 0.09 0.22 0.31 det. = determination As far as we know, our external quality control program is the first to be undertaken on a nationwide scale in Italy. Our findings, which appear to be in good agreement with those of some groups 19 but not others 6 23 24, show a less than optimal level of agreement for both intra-observer (Figs. 1, 2) and interobserver reproducibility (Tab. II). However, it should be noted that the studies whose results are not in agreement with ours differ from ours in study planning, study design and number of pathologists involved. Although it has not yet been defined which level of reproducibility ought to be considered clinically “acceptable”, we feel that our interobserver results do not appear to be satisfactory and further efforts are needed to better understand the reason for the unsatisfactory performance of some laboratories and to implement the necessary corrective actions. In the light of our results, the steps we propose to be taken are the following: 1. very precise standardized protocols should be adopted; 2. workshops should be organized in which the problems of the individual laboratories are compared and discussed; 3. training sessions should be held to discuss and hopefully solve these problems and bring the performance into uniformity so that the results of different laboratories will become interchangeable; 4. quality control programs should be consistently implemented. References 9 1 2 3 4 5 6 7 8 Patel C, Sidhu KP, Shah MJ, Patel SM. Role of mitotic counts in the grading and prognosis of the breast cancer. Indian J Pathol Microbiol 2002;45:247-54. Lynch J, Pattekar R, Barnes DM, Hanby AM, Camplejohn RS, Ryder K, et al. Mitotic counts provide additional prognostic information in mammary carcinoma. J Pathol 2002;196:275-279 Yang Q, Mori I, Sakurai T, Yoshimura G, Suzuma T, Nakamura Y, et al. Correlation between nuclear grade and biological prognostic variables in invasive breast cancer. Breast Cancer 2001;8:105-10. Dalton LW, Pinder SE, Elston CE, Ellis IO, Page DL, Dupont WD, et al. Histologic grading of breast cancer: linkage of patient outcome with level of pathologist agreement. Mod Pathol 2000;13:730-5. Genestie C, Zafrani B, Asselain B, Fourquet A, Rozan S, Validire P, et al. Comparison of the prognostic value of Scarff-BloomRichardson and Nottingham histological grades in a series of 825 cases of breast cancer: major importance of the mitotic count as a component of both grading systems. Anticancer Res 1998;18:571-6. Robbins P, Pinder S, de Klerk N, Dawkins H, Harvey J, Sterrett G, et al. Histological grading of breast carcinomas: a study of interobserver agreement. Hum Pathol 1995;26:873-9. Volpi A, Bacci F, Paradiso A, Saragoni L, Scarpi E, Ricci M, et al. Prognostic relevance of histological grade and its components in node-negative breast cancer patients. Mod Pathol 2004;17:1038-44. Medri L, Volpi A, Nanni O, Vecci AM, Mangia A, Schittulli F, et al. Prognostic relevance of mitotic activity in patients with nodenegative breast cancer. Mod Pathol 2003;16:1067-75. 10 11 12 13 14 15 16 17 18 Elston CW, Ellis IO. Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology 1991;19:403-10. Harvey JM, de Klerk NH, Sterrett GF. Histological grading in breast cancer: interobserver agreement, and relation to other prognostic factors including ploidy. Pathology 1992;24:63-8. Brown JM, Benson EA, Jones M. Confirmation of a long-term prognostic index in breast cancer. Breast 1993;2:144-7. Henson DE, Ries L, Freedman LS, Carriaga M. Relationship among outcome, stage of disease and histologic grade for 22,616 cases of breast cancer. The basis for a prognostic index. Cancer 1991;68:2142-9. Galea MH, Blamey RW, Elston CE, Ellis IO. The Nottingham prognostic index in primary breast cancer. Breast Cancer Res Treat 1992;22:207-19. Goldhirsch A, Wood WC, Gelber RD, Coates AS, Thurlimann B, Senn HJ. Meeting highlights: updated international expert consensus on the primary therapy of early breast cancer. J Clin Oncol 2003;21:3357-65. Schnitt SJ. Traditional and newer pathologic factors. J Natl Cancer Inst Monogr 2001;30:22-6. Paradiso A, Volpe S, Iacobacci A, Marubini E, Verderio P, et al. Quality control for biomarker determination in oncology: the experience of the Italian Network for Quality Assessment of Tumor Biomarkers (INQAT). Int J Biol Markers 2002;17:201-14. Fleiss JL. Statistical methods for rates and proportions. 2nd Ed. New York: Wiley and Sons 1981. Italian Network for Quality Assurance of Tumor Biomarkers (INQAT) Group. Interobserver reproducibility of immunohistochemical HER-2/neu evaluation in human breast cancer: the real- INQAT GROUP 6 19 20 21 world experience. Int J Biol Markers 2004;19:147-54. Holman CD. Analysis of interobserver variation on a programmable calculator. Am J Epidemiol 1984;120:154-61. Landis R, Koch G. The measurement of observer agreement for categorical data. Biometrics 1977;33:117-27. Contesso G, Mouriesse H, Friedman S, Genin J, Sarrazin D, Rouesse J. The importance of histologic grade in long-term prognosis of breast cancer: a study of 1010 patients, uniformly treated at the Institut Gustave-Roussy. J Clin Oncol 1987;5:1378-86. 22 23 24 Davis BW, Gelber RD, Goldhirsch A, Hartmann WH, Locher GW, Reed R, et al. Prognostic significance of tumor grade in clinical trials of adjuvant therapy for breast cancer with axillary lymph node metastasis. Cancer 1986;58:2662-70. Cutler SJ, Black MM, Friedell GH, Vidone RA, Goldenberg IS. Prognostic factors in cancer of female breast. II. Reproducibility of histopathologic classification. Cancer 1966;19:75-82. Theissig F, Kunze D, Haroske G, Meyer W. Histological grading of breast cancer: Interobserver reproducibility and prognostic significance. Path Res Pract 1990;186:732-6. Appendix * INQAT Group Co-workers and affiliations Writing Committee (in alphabetical order) M.E. Cortese (Biostatistics Unit) – Istituto Nazionale per lo Studio e la Cura dei Tumori, Milano E. Marubini (Biostatistics Unit) – Università di Milano, Milano A. Paradiso (Project Coordinating Unit) – Istituto Oncologico, Bari J. Rosai (Reference Unit) – Istituto Nazionale per lo Studio e la Cura dei Tumori, Milano L. Saragoni (Reference Unit) – Ospedale “Morgagni-Pierantoni”, Forlì P. Verderio (Biostatistics Unit) – Istituto Nazionale per lo Studio e la Cura dei Tumori, Milano Pathologists participating in the Grading Quality Control Program (in alphabetical order) M. Aldi – Ospedale per gli Infermi, Faenza L. Andreini – Imola AUSL, Imola S. Bianchi – Università di Firenze, Firenze P. Cossu Rocca – Università di Sassari, Sassari G. De Rosa – Ospedale Mauriziano, Candiolo C. Giardina – Università di Bari, Bari G. Lanzanova – Ospedale Civile, Ravenna F. Marandino – Istituto Regina Elena per lo Studio e la Cura dei Tumori, Roma F. Marzullo – Istituto Oncologico, Bari F. Nuzzo – Presidio Ospedaliero, Cesena N. Ravarino – Ospedale Mauriziano, Turin M. Ricci – Ospedale degli Infermi, Rimini A. Zito – Istituto Oncologico, Bari