Analytical vs. Diagnostic Performance: An Overview, with an Emphasis on Method Comparison and the Assessment of Bias Bente Flatland, DVM, MS, DACVIM, DACVP Associate Professor of Clinical Pathology Department of Biomedical and Diagnostic Sciences University of Tennessee, Knoxville, TN This presentation is adapted from: Flatland B, Friedrichs KR, Klenner S. Differentiating between Analytical and Diagnostic Performance Evaluation with a Focus on the Method Comparison Study and Identification of Bias. Veterinary Clinical Pathology 2014; in press. I wish to acknowledge my manuscript co-authors for their contributions to the material in these proceedings. Abbreviations Used (listed alphabetically) CLIA CV EQA PT QC STARD TE Clinical and Laboratory Improvement Amendments coefficient of variation external quality assessment proficiency testing quality control Standards for Reporting of Diagnostic Accuracy total error Introduction Analytical and diagnostic performance of laboratory tests are separate concepts, although they are clearly related in that a test having poor analytical performance is likely to have poor diagnostic performance.1 Analytical performance refers to how well an instrument or method can measure the analyte of interest – in other words, are results both reliable (reproducible) and valid (accurate)? Diagnostic performance refers to how well a given test can discriminate diseased and non-diseased individuals. Evaluation of diagnostic performance should follow evaluation of analytical performance.2 Evaluation of Analytical Error, with an Emphasis on Bias Assessment Analytical performance evaluation assesses the magnitude of analytical error, including evaluation of imprecision (random error) and bias (inaccuracy). Bias refers to the difference between a value measured by the test (index) instrument and the "true" value of the analyte as measured by a comparative method.3 Bias is relative, and the magnitude of bias observed depends upon the comparative method or material chosen to represent the "true" analyte value.1 The true value of an analyte is a theoretical concept4; in clinical laboratory medicine, the “true” value is ideally represented by results from a thoroughly researched, highly stable laboratory method known as a definitive method.5 Definitive methods are complex, technically demanding, and may be expensive. These are typically used during instrument and method development and to assign values to Page 1 of 5 Analytical vs. Diagnostic Performance (Flatland) ASVCP 2014, Education Symposium certified reference materials and standards at the manufacturing level. In contrast, a reference method can be employed by qualified laboratories for routine clinical use and has thoroughly documented accuracy and precision and low susceptibility to interferents; practicality is not necessarily a prime concern for reference methods.5 A field method is intended to be practical and has adequate precision, accuracy, and specificity for its intended use.5 For routine clinical use, reference laboratories employ reference methods or field methods; POCT sites use field methods. If definitive methods are not available or practical, what options do laboratories have for defining “true” analyte concentration? Comparative materials or methods that may be used include (a) the known values of certified reference materials, (b) the known values of assayed quality control materials, (c) peer group means from EQA/PT events, and (d) mean values from another reference or field method (method comparison study).1 Each of these approaches to bias assessment has pros and cons, and the choice of comparative material or method for bias assessment should be made keeping goals of the instrument performance study (i.e., intended uses of the bias data) in mind. Choice of comparative method may also be influenced by logistical considerations such method or material availability, financial costs, and analyte concentrations of interest.1 The major purpose of any method comparison study is to define bias (inaccuracy) of an index instrument or method. Even if bias is minimal, agreement between methods may be less than optimal if either method (or both) is imprecise; full assessment of method agreement evaluates both bias and imprecision.1 In theory, bias assessment should always be species-specific; however, this may be dictated by complexities of the analyte and test systems in question, as well as laboratory resources. Species-specific bias assessment presents logistical challenges, including availability of suitable, stable, species-specific samples and the financial and time costs of investigating bias for multiple species.1 It may be less crucial for simple, low-molecular-weight analytes (e.g., electrolytes or glucose) than for more complex tests (e.g., immunoassays or hematology testing).6 Further, necessity of species-specific bias assessment may depend on intended use of the bias data (e.g., QC validation or instrument harmonization). Research documenting the impact of species-specific bias assessment (or lack thereof) on successful veterinary laboratory outcomes is ideally needed.1 When two field methods (or a field and reference method) are compared, bias is expected. Comparison of two field methods is relevant if investigators wish to (a) show that a new method is a satisfactory substitute for an older method, (b) investigate whether a new method being installed at the testing site requires method-specific reference intervals and to aid reference interval transfer, (c) determine comparability of two methods (e.g., for performing serial patient evaluations), or (d) facilitate Page 2 of 5 Analytical vs. Diagnostic Performance (Flatland) ASVCP 2014, Education Symposium local harmonization of multiple instruments at the testing site via calibration to a consensus mean or other value.7 Methods comparison studies should state the purpose of the study (i.e., intended use of the bias data), present the type and magnitude of any bias identified, and discuss clinical implications of the observed bias.1 The type of statistical analysis performed in method comparison studies is dictated by the type of variable(s) being assessed and is different for continuous numerical data and quantitative or semiquantitative data.8 Further, method comparison statistics assume that data are independent (i.e., no repeated measures from the same individuals). Statistical methods (including how regression and difference plots are constructed) and the equation used to calculate bias should be given. When bias data are included in calculations of observed total error (TEobs), the absolute mathematical value for bias is used.9 Analytical Quality Requirements in Analytical Performance Assessment A full assessment of analytical performance also involves comparison to an analytical quality requirement. A quality requirement (a.k.a. analytical quality specification) is a pre-determined benchmark to which analytical performance of an instrument or method is judged. Analytical quality requirements are derived from various resources, including expert consensus, biological variation data, and regulatory requirements, and are classified into to a hierarchy denoting how rigorous (i.e., how evidence-based and medically relevant) they are.10 Commonly used analytical quality requirements in veterinary medicine are allowable total error (TEa, consensus-based, level 3), maximum total error (TEmax, biological variation-based, level 2a), maximum imprecision (CVmax, biological variation-based, level 2a), maximum bias (Biasmax, biological variation-based, level 2a), and performance requirements established by CLIA (regulatory-based, level 4).1,9 Biological variation data must obviously be available for the species in question in order for biological variation-based quality requirements to be calculated. Depending on the chosen quality requirement, assessment of analytical performance can focus on imprecision (e.g., observed CV compared to CVmax), bias (e.g., observed bias compared to biasmax), or both (e.g., TEobs compared to TEa, TEmax, or a CLIA requirement). Any observed value less than the relevant/chosen quality requirement is considered acceptable.1,11 Observed values greater than the quality requirement indicate that analytical performance is falls short of what is desired and should prompt trouble-shooting to identify causes of suboptimal analytical performance. If steps can be taken that should improve analytical performance, the performance evaluation should be repeated, and results reassessed in light of the quality requirement.11 Diagnostic Performance Assessment Page 3 of 5 Analytical vs. Diagnostic Performance (Flatland) ASVCP 2014, Education Symposium Diagnostic performance of any laboratory test should be assessed relative to disease diagnosis and is only needed and relevant for analyte changes of medical importance.1 Diagnostic performance is most easily assessed for individual laboratory tests for diseases that have a clear gold standard test that can be used for comparison. A gold standard (a.k.a. reference standard) is the best available test or method (or constellation of test results and clinical findings) for which a positive (abnormal) result definitively confirms the presence of disease (also referred to as the “target condition”).12 The gold standard used to confirm presence of disease and definitively classify study subjects as positive or negative should be independent of the test whose diagnostic performance is under evaluation (the index test).2,12 Diagnostic performance may also be of interest for individual tests that, while relatively nonspecific when interpreted individually, have value as part of a panel of tests, or have value when interpreted in light of a stringent clinical decision threshold (as opposed to a wider population-based reference interval).1 Studies of diagnostic performance should follow STARD criteria.12,13 Points to keep in mind when designing such studies are that (a) laboratories implementing well-established methods having known diagnostic performance may not need to make their own assessment of diagnostic performance [however, some assessment of analytical performance is required], (b) diagnostic performance of a test is dependent upon the decision limits used to determine whether a test is positive or negative, (c) the definition of "disease" (the target condition) should be clinically relevant to patient quality-of-life [and potentially logistical and/or financial considerations], and (d) inclusion and exclusion criteria for study subjects must be carefully considered and clearly stated.1 Summary Before a new method is introduced for routine use, assessment of analytical (and sometimes diagnostic) performance should be carried out. The main purpose of method comparison in assessing analytic performance is to define bias (inaccuracy) relative to a comparative method or material, chosen to represent “true” results. Authors of method comparison studies should clearly define their purpose and quality requirements in advance of data collection, realizing that the choice of comparative method or material depends, in part, upon intended use of the resulting bias data.1 Method comparison study data analysis and interpretation should document not only the type and magnitude of bias, but also discuss its clinical implications.1 Only after analytical performance has been assessed and appropriate reference intervals and/or clinical decision thresholds established can evaluation of diagnostic performance proceed. Diagnostic performance assessment is easiest when a clear gold standard test for disease diagnosis is Page 4 of 5 Analytical vs. Diagnostic Performance (Flatland) ASVCP 2014, Education Symposium available. Authors of diagnostic performance studies should follow STARD criteria and clearly define disease stage and the study population prior to data collection.1 References 1. Flatland B, Friedrichs KR, Klenner S. Differentiating between analytical and diagnostic performance evaluation with a focus on the method comparison study and identification of bias. Vet Clin Pathol 2014; in press. 2. Jensen AL, Kjelgaard-Hansen M. Diagnostic test validation. In: Weiss DJ, Wardrop KJ, eds. Schalm’s Veterinary Hematology, 6th ed. Ames, IA: Wiley-Blackwell;2010:1027-1033. 3. Stockham SL, Scott MA. Introductory concepts. In: Stockham SL, Scott MA. Fundamentals of Veterinary Clinical Pathology. 2nd Ed. Ames, IA:Blackwell Publishing; 2008:3-51. 4. Armbruster D. Accuracy controls: Assessing Trueness (Bias). Clin Lab Med. 2013;33:125-137. 5. Tietz NW. A model for a comprehensive measurement system in clinical chemistry. Clin Chem 1979;25:833-839. 6. Armbruster D, Miller RR. The Joint Committee for Traceability in Laboratory Medicine (JCTLM): A global approach to promote the standardization of clinical laboratory test results. Clin Biochem Rev 2007;28:105-113. 7. Miller WG, Myers GL, Gantzer ML, et al. Roadmap for harmonization of clinical laboratory measurement procedures. Clin Chem 2011;57:1108-1117. 8. Jensen AL, Kjelgaard-Hansen M. Method comparison in the clinical laboratory. Vet Clin Pathol. 2006;35:276-286. 9. Harr KE, Flatland B, Nabity M, Freeman KP. ASVCP Guidelines: allowable total error guidelines for biochemistry. Vet Clin Pathol 2013;42:424-436. 10. Kenny D, Fraser CD, Hyltoft-Petersen P, Kallner A. Consensus agreement. Scand J Clin Lab Invest 1999;59:585. 11. Lester S, Karr KE, Rishniw M, Pion PD. Current quality assurance concepts and considerations for quality control of in-clinic biochemistry testing. J Am Vet Med Assoc 2013;242:182-192. 12. Bossuyt PM, Reitsma JB, Bruns DE, et al. The STARD Statement for reporting studies of diagnostic accuracy: explanation and elaboration. Clin Chem 2003;49:7-18. 13. Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Vet Clin Pathol 2007;36:8-12. Page 5 of 5