Method validation With Confidence • Performance specifications • Experimental protocols • Statistical interpretation • EXCEL® Files Dietmar Stöckl Dietmar@stt-consulting.com STT Consulting Dietmar Stöckl, PhD Abraham Hansstraat 11 B-9667 Horebeke, Belgium e-mail: dietmar@stt-consulting.com Tel + FAX: +32/5549 8671 Copyright: STT Consulting 2007 Method validation 2 Content Content Introduction Materials Validation protocols • Imprecision • Limit of detection (LoD) • Working range • Linearity model 1 • Linearity model 2, accuracy protocol (= accuracy of calibration curve) • Recovery model 1 (paired sample protocol: spike and control) • Recovery model 2 (accuracy protocol: sample with target value) • Interference • Method comparison Annex • Summary of protocols, statistics & graphics • System stability, Ruggedness and multifactor protocols • Glossary of terms Method validation 3 Introduction Introduction WHAT is validation? Validation is the confirmation, through the provision of objective evidence, that requirements for a specific intended use or application have been fulfilled (ISO 9000). We see, from this definition, that we have to • specify the intended use of a method, • define performance requirements, • provide data from validation experiments (objective evidence), and • interprete the validation data (confirmation that requirements have been fulfilled). WHICH type of performance requirements (specifications) exist? Performance requirements can be statistical, analytical, or applicationdriven/regulatory. Statistical and analytical specifications are most useful for method evaluation. Application-driven/regulatory specifications are used for validation. Some examples are given in the table below. Performance requirements (specifications) Statistical t-test: P ≥ 0.05 F-test: P ≥ 0.05 Analytical Bias Calibration tolerance CV stable CV Application-driven# Bias 3% CV 3% #Cholesterol (National Cholesterol Education Program) WHICH performance characteristics exist? We have seen that we have to specify performance requirements for a validation. These requirements refer to the following performance charateristics of an analytical method: • Imprecision • Limit of detection • Working range • Linearity • Recovery • Interference/Specificity • Total error (method comparison) • [Robustness/Ruggedness]: will not be addressed in this book. Method validation 4 Introduction Introduction WHICH experiments do we have to perform? The experiments we have to perform depend on the performance characteristic we want to validate. For the estimation of method imprecision, for example, we need to perform repeated measurements with a stable sample. However, there is no agreement over the various application fields of analytical methods about the design of such experiments. In this book, we will mainly refer to the experimental protocols from the Clinical and Laboratory Standards Institute (CLSI). The table below gives an overview about typical experiments to be performed during a method validation study. Performance chracteristic Samples Measurements Imprecision IQC-samples; no target n = 20 (repetition over several days) LoD/LoQ Blank; Low sample n = 20 (repetition over several days) Linearity 5 related samples/-calibrators (mix); no target n = 4 (repetition within day) Working range See: Imprecision/Linearity Interference Samples: Interferent spike & control (no target) n = 4 (repetition within day) Recovery (Accuracy/Trueness) Samples: Known analyte spike & control or certified reference materials (CRM) n = 4 - 5 (repetition over several days) Total error (method comparison 40 samples (target by reference method) n = 1 or 2 (measurement in one or several days) IQC: Internal Quality Control; LoD: limit of detection; LoQ: limit of quantitation These experiments will be described in detail in the following chapters of the book. Method validation 5 Introduction Introduction HOW do we make decisions? When we have created data, we have to decide whether they fulfill the requirements that have been selected for the application of the method "for a specific intended use". Currently, it is common practice to make decisions without considering confidence intervals or statistical significance testing. Modern interpretation of analytical data, however, requires the use of confidence intervals/statistical significance testing.These two approaches are compared in the table below for the case of a recovery experiment. Decision making approaches “Old” Experimental recovery: 90% Limit: 85 – 115% Decision: passed “Modern” Experimental recovery: 90% Confidence interval: 11% (with n = 4 and CV = 7%) Limit: 85 – 115% Decision: fail (90 – 11 = 79%, exceeds 85%) Action: increase n or reduce CV In the “old” approach, we compare one “naked” number with the specification. This approach misses the information on the number of measurements that have been performed and the imprecision of the method. If we would repeat the validation, we easily could obtain a recovery estimate of 80%, for example. Therefore, decision-making should be statistics-based. This is by applying a formal statistical test or by interpreting the confidence interval of an experimental estimate. Statistics-based decision – Importance of the “test-value” (= requirement, specification) When we make statistics-based decisions, the selection of the test value will depend on the type of requirement we apply (statistical, analytical, validation). Statistical - Statistical test versus Null-hypothesis (F-test, t-test, 95% confidenceintervals, …): Bias = 0; Slope = 1; Intercept = 0; etc. Analytical - Statistical test versus estimate of stable performance (F-test, t-test, 95% confidence-intervals, etc.): Bias calibration tolerance; etc. Validation case (application-driven; “specific intended use”) - Statistical test versus validation limit (F-test, t-test, 95% confidence-intervals, etc.): CVexp CVmax; Biasexp Biasmax; etc. Nevertheless, in all three situations, we apply the same type of statistical tests. Method validation 6 Introduction Introduction Interpretation of 95%-confidence limits Confidence limits and quality specifications The figure below shows a graphical interpretation of 95%-confidence limits versus a predefined quality specification: "10". Note When comparing an estimate with a specification, usually, the confidence limits are constructed 1-sided. Specification 10 1. Limit 2. Typical performance 1. Interpretation of the cases A – D when the specification is a limit A: "In", the specification is satisfied with 95% probability. B: Not "In" with 95% probability - More data may help C: Not "In" with 95% probability, but also not out with 95% probability. D: "Out" 2. Interpretation when the number characterizes a stable process If the "number" is the typical performance of a stable process, situation C can still be accepted. C: Look at lower limit: Not "Out" with 95% probability. This situation is applied in the EP 5 protocol to investigate whether the user CV is different from the typical manufacturer CV. Method validation 7 Introduction Introduction SUMMARY For a successful validation, we need performance specifications, experimental protocols, and statistical interpretation of the data. The whole exercise, however should be carefully planned, including the samples needed, the foreseen internal quality control, and the documentation of the results. A validation plan should consider (at least), the following elements. Validation plan • Define the application, purpose and scope of the method • Define performance characteristics and acceptance criteria • Develop a validation protocol or operating procedure for the validation • Qualify materials, e.g. standards, reagents, and samples • Perform validation experiments • Document validation experiments and results in the validation report • Interprete the validation data and make statistics-based decisions Method validation 8 Introduction Introduction In the book, the following validation example will be used. Measurand Amount-of-substance concentration of glucose in serum S-glucose: mmol/L (adult reference interval: 3.9 – 5.8 mmol/L). Specific intended use For in vitro diagnostic purposes. Performance specifications Performance characteristic Specification Imprecision Within-run: 1.5%# Total: 3%# LoD 0.1 mmol/L Working range 0.1-42 mmol/L Linearity 0.1-42 mmol/L Limit: 5% Recovery Limit: 5% Interference Limit: 10% Total error – Method comparison Limit, Bias: 3%; Total error: 10% #Note: typical values for stable process; not meant as limit! Data simulation Most data are simulated with an assumed method CV of 1-2% (within-run) and 3% (total). Method validation 9 Materials Materials Instrument XYZ Standard, Lot# Reagent, Lot# Imprecision (CLSI EP5) and IQC during experiments Low IQC material : 3.9 mmol/L High-normal IQC material : 5.9 mmol/L High IQC material : 8.5 mmol/L LoD, dilutions, "adaptation of control" (CLSI EP17) Isotonic saline solution (= Blank) :0 mmol/L Linearity, experiment 1 (CLSI EP6) Low sample 1 High sample 1 : 3.0 mmol/L : 7.0 mmol/L Linearity, experiment 2 ("manufacturer protocol": accuracy) Spiked “Blank” : 45.0 mmol/L Recovery and Interference (CLSI EP7) Low sample 2 Normal sample High sample 2 Glucose solution in isotonic NaCl Bilirubin solution in isotonic NaCl Low sample 2 spiked with bilirubin : 3.5 mmol/L : 4.8 mmol/L : 6.5 mmol/L : 30.0 mmol/L : 600 mg/dL : 60 mg/dL Recovery (Accuracy) Standard 1 Standard 2 Standard 3 : 4.5 mmol/L : 5.0 mmol/L : 5.5 mmol/L Method comparison (CLSI EP9) 40 native samples : various Method validation 10 Imprecision Imprecision Graphics • Dot plot • Histogram Statistics • Descriptive Statistics: Dispersion • Gaussian "("Normal“) distribution • Outliers • Sampling statistics & Confidence intervals of SD‘s • Significance tests for SD & variance (Chi2, F-test) • ANOVA model II Method validation 11 Imprecision Imprecision The CLSI protocol (EP-5) • 2 Different samples (e.g., low and high) • 1 or 2 runs/per day • Duplicates • 20 Days IQC! with 1 or 2 samples Specific calculations for a single run Within-run standard deviation (swr): swr = SQRT[SD2dupl/(2 20)] Ddupl = Difference of within-run duplicates Standard deviation of the daily means (smeans = "B" in EP-5): smeans = SQRT[SD2means/(20-1)] Dmeans = Difference [daily mean - overall mean of 20 days] Between-day standard deviation (sbd): sdd = SQRT[s2means – s2wr/2] CAVE: set sdd = 0 when s2means < s2wr/2 (negative SQRT!) Total standard deviation (sT): sT = SQRT[s2means + s2wr/2] CAVE: set sT = swr when s2means < s2wr/2 Calculation of degrees of freedom: (EP5) – s2wr = number of duplicates measured: 20 – s2T = complex: precalculated in EXCEL-template Comparing a SD-estimate with a claim – Test overlap of 1-sided confidence limit (CL) of SDs with claim, or – 1-sample F-test ("Chi2-test"), 1-sided (EXCEL-template) Statistics for imprecision can also be treated with Model II ANOVA! Importance of imprecision • Limit of detection • Working range • Number of analytical replicates • Troubleshooting Method validation 12 Imprecision Imprecision – EXCEL file Day Replicate 1 Replicate 2 1 5.95 5.82 2 5.64 5.81 3 5.92 5.98 4 5.85 5.85 5 5.98 5.92 6 5.77 5.53 7 5.91 5.92 8 5.94 5.91 9 6.16 6.14 10 5.83 5.79 11 5.79 5.80 12 6.04 6.06 13 6.18 6.21 14 6.03 6.17 15 6.02 6.03 16 6.14 6.16 17 5.95 5.90 18 6.07 6.17 19 5.78 5.84 20 6.31 6.40 Graphics The distribution of the mean values does not indicate an outlier. The distribution of the differences indicates that day 6 may be an outlier (-0.24). According to the CLSI protocol it is not (4 SD outlier criterium). According to the Grubbs-test, it is. Calculations The Worksheet uses the CLSI EP5 calculations and EXCEL ANOVA (Tools>Data Analysis). In case ANOVA is used, the formulae for Swr, Sdd, and S T must be calculated with EXCEL (see examle in the Worksheet). Note: Due to the nature of calculation of Sdd (SQRT of a difference), Sdd is set to zero when MS-Between groups is <= MS-Within groups. We calculate: Swr = 0.063 mmol/L; CVwr = 1.1% Sdd = 0.170 mmol/L ST = 0.181 mmol/L; CVT = 3.1% Method validation 13 Imprecision Imprecision – EXCEL file Interpretation The calculated values for imprecision are: CVwr (exp) = 1.1% CVT (exp) = 3.1% The specifications are: CVwr (stable) = 1.5% CVT (stable) = 3.0% We compare them by use of the Chi2-statistics. We test whether the lower, 1-sided 95% confidence limits of the experimental estimates are equal or smaller than the preset specifications. Both values pass this statistical test, even though the experimental total CV T (3.1%) is higher than the limit (= 3%). The reason is that the lower confidence limit (=2.51%) is <3%. Calculations Chi2exp = (SD2exp df)/SD2claim (df = degrees of freedom, here = 20) Lower CL of SD = SD • SQRT[(df)/Chi20.05,df] Conclusion The validation data demonstrate that the method passes the pre-set specifications for within and total imprecision. DETAILED STATISTICAL BACKGROUND Statistics • Descriptive Statistics: Dispersion • Gaussian "("Normal“) distribution • Outliers • Sampling statistics & Confidence intervals of SD‘s • Significance tests for SD & variance (Chi2, F-test) • ANOVA model II Method validation 14 Limit of detection Limit of detection (LoD) Concepts LoD can be calculated from the • standard deviation of a blank • signal-to-noise ratio of a chromatogram of a low sample • calibration line by means of regression Graphics • Dot plot • Scatter plot Statistics From blank • Outlier • Mean • Confidence interval of centiles • SDtotal (experiments on different days) • Consideration of -errors and -errors: Power concept LoD considering of -errors and -errors Model 1: LoD = Mean + 1.65 s0 (s = at zero) • 5% false positives when the analyte is not present (-error) • 50% false negatives (-error) when the analyte "is present at 1.65 s0". Model 2: LoD = Mean + 2 • 1.65 s = Mean + 3.3 s • Mean and s are from the zero-standard • 3.3 s often simplified to 3 s Result: 5% false positive (-error) and 5% false negative (-error) 1.65 s Model applied in this book and in the EXCEL file Simplified Model 2: LoD = Mean + 3 s Method validation 15 Limit of detection Limit of detection (LoD) – Other concepts Chromatographic (S/N = 3) • Outlier • Mean • SDtotal (experiments on different days) Chromatographic LoD (S/N = 3) compared with LoD from “blank” (mean noise + 3.3 SD) 20 5 LoD = Mean noise + 3.3 SD 15 Response Signal 6 SD Noise 2 SD Response 15 10 20 LoD = S/N = 3 10 5 0 0 Time Time From calibration Calculation of LoD from calibration data with regression Yb = "Signal of blank" via regression = intercept a Sb = "Standard deviation of blank" = Sy/x b = slope Transform "Signal LoD" to concentration "Signal" LoD = a + 3 Sy/x Calculate CLoD via regression equation y = a + b x CLoD = (a + 3 Sy/x – a)/b = [3 Sy/x]/b When the calibration curve passes through zero, the mean-term is omitted (e.g., in case of an automatic blank). Method validation 16 Limit of detection Limit of detection (LoD) Samples Usually, the LoD is derived from test variation at zero analyte. This requires suitable "blank" samples. For exogenous compounds, such as drugs, this is easy to realize. For endogenous compounds, suitable blank samples are more difficult to realize. Note that "stripped" samples or blank solutions often give an overoptimistic LoD because of their "clean" matrix. Ideally, the LoD of a method should be assessed with several native samples containing concentrations near the detection limit, as determined by a reference method. Alternatively, the LoD is derived from measurements of calibrators. Protocols Blank ("Common"): Applied in this book and the EXCEL file 20 measurements of the zero-standard/blank - 20 days, for example combined with EP5 Chromatographic 20 measurements of a sample that gives a Signal/Noise ratio of 3. - 20 days, for example combined with EP5 Calibration From calibration curves at several different days (for example 5). CLSI Protocol EP 17 Determination of Limits of Quantitation. Method validation 17 Limit of detection Limit of detection (LoD) – EXCEL file Day mmol/L 1 0.01 0.05 2 -0.01 0.04 3 0.02 4 0.04 5 0.02 6 -0.03 0.01 7 -0.01 0.00 8 0.00 9 -0.01 10 0.01 11 0.02 -0.03 12 -0.03 -0.04 13 0.03 -0.05 14 -0.03 15 0.02 16 0.01 17 0.01 18 -0.04 19 0.01 20 0.00 0.03 0.02 -0.01 -0.02 Blank Graphic The graphic gives no indication of an outlier. Calculations (3 s model) Mean: 0.0020 mmol/L SD: 0.0219 mmol/L Confidence interval 3SD-centile (1-sided, 95%): 0.02 mmol/L Calculation: t(0.1,19) SQRT[SD2/20 + (32 SD2/2 20)] LOD: 0.068 mmol/L; #UCL: 0.088 mmol/L LOD (blanked): 0.066 mmol/L; #UCL: 0.086 mmol/L #UCL: upper confidence limit Interpretation We compare the UCL of the LoD (0.088 or 0.086 mmol/L) with the specification of 0.1 mmol/L. Conclusion The validation data demonstrate that the method passes the pre-set specification for the LoD. Method validation 18 Working range Working range – 2 Models CV (%) • Fixed value of the precision profile (Figure), or 45 40 35 30 25 20 15 10 5 0 Limit of detection Working range 0 5 10 15 20 25 Analyte (arbitrary units) • Linear part of the calibration function In this book and in the EXCEL file, the working range is defined by the linearity of the calibration curve. Protocol The protocol is presented in the chapter linearity/manufacturer protocol. In fact, this is a protocol that assess accuracy with a number of related (mixed) samples. Statistics & Graphics The statistics and graphics are presented in the chapters linearity and accuracy/recovery. Method validation 19 Linearity Linearity Graphics • Scatter plot • Residual plot (preferred) • For "accuracy model": Difference plot (preferred) Statistics Model 1 • Based on linear regression and ANOVA: F-test for variance around line/within sample sets (lack-of-fit: old EP 6 model) • Comparison of linear model with 2nd or 3rd order models (new EP 6 model) Interpretation: Use CBstat Statistics>Method evaluation>Linearity Model 2 ("Common", Accuracy) Often used by manufacturers for defining the Working Range ("Accuracy-based" = true x-values: e.g., weighed-in) Investigate the deviation from the line of equality with • confidence limits, or • t-test Interpretation • Use EXCEL® template Note In some fields, the correlation coefficient is used to assess linearity. Method validation 20 Linearity Linearity model 1 CLSI EP-6 protocol 5 interrelated samples Mixing protocol 1 low 2 low (3) + high (1) 3 low (2) + high (2) 4 low (1) + high (3) 5 high Alternative mixing 1 low 2 low medium: mix medium and low (1:1) 3 medium: low and high (1:1) 4 high medium: mix medium and high (1:1) 5 high Measurement design Measure all samples 4 times (random), within-run or "closely related runs": SDwr. Method validation 21 Linearity Linearity model 1 – EXCEL file (worksheet Linearity) Samples Low sample: 3 mmol/L High sample: 7 mmol/L EP 6 mix protocol Concentrations (C) of samples 2 - 4 (V = volume) C = (C1*V1 + C5*V5)/(V1 + V5) Sample# 1 2 3 4 5 Concentration (mmol/L) 3 4 5 6 7 Sample y1 y2 y3 y4 3.0 2.99 2.94 3.01 3.06 4.0 3.93 4.02 4.01 4.03 5.0 4.97 5.02 4.95 4.92 6.0 5.74 5.90 5.97 5.93 7.0 6.78 6.69 6.82 6.65 Graphic The graphic may indicate outliers in the levels 4 and 6 mmol/L. The Grubbs test, however, does not confirm the presence of an outlier. The residuals plot indicates non-linearity. Method validation 22 Linearity Linearity model 1 – EXCEL file (worksheet Linearity) Calculations The data are investigated for linearity with specialized software (here: CBstat). The models used are the "lack-of-fit" method and the evaluation by a second order polynomial fit (new CLSI EP 6 model). "Lack-of-fit" F-test for linearity: F = 2.5125 P: 0.0980 No significant deviation from linearity. Second order polynomial fit t-test of last coefficient against zero: SE of last coef.: 0.0085 t value: -2.8816 P:0.0104 x-level %-difference 3 -1.6 4 0.6 5 1.0 6 0.4 7 -0.7 Significant deviation from linearity, but non of the levels deviates by more than 5% (chosen limit). Interpretation The statistical results show that the second order polynomial fit method is more sensitive than the lack-of-fit method. The latter shows that the data-set is nonlinear. However, the 5% limit is not exceeded. Conclusion The validation data demonstrate that the method passes the pre-set specification for linearity. Method validation 23 Linearity Linearity model 2 – EXCEL file (worksheet Lin-Manuf) Accuracy protocol ("Working Range protocol") This model is called "Working Range protocol" because it is often applied by manufacturers to establish the working range. Samples 11 (for example) interrelated samples, prepared by mixing of a blank sample and a blank sample spiked with a known amount of analyte. 1: Blank (blank) 2: 9 blank + 1 high (spiked) sample 3: 8 blank + 2 high (spiked) sample 4: 7 blank + 3 high (spiked) sample 5: 6 blank + 4 high (spiked) sample 6: 5 blank + 5 high (spiked) sample 7: 4 blank + 6 high (spiked) sample 8: 3 blank + 7 high (spiked) sample 9: 2 blank + 8 high (spiked) sample 10: 1 blank + 9 high (spiked) sample 11: High (spiked, known concentration) sample Measurement design Measure all samples 4 times (random), within-run: SDwr. Sample y1 y2 y3 y4 0.0 0.03 0.00 0.00 -0.03 4.5 4.47 4.47 4.59 4.59 9.0 9.06 8.97 8.85 8.91 13.5 13.77 14.22 13.41 13.71 18.0 18.45 17.94 18.09 17.85 22.5 22.62 22.59 22.35 22.47 27.0 26.70 27.24 27.30 26.76 31.5 30.75 32.25 31.59 31.59 36.0 35.67 34.47 35.07 34.02 40.5 39.42 38.13 38.34 38.31 45 42.42 41.79 41.10 42.09 Method validation 24 Linearity Linearity model 2 – EXCEL file (worksheet Lin-Manuf) Graphic The graphic shows an (expected) increase of the scatter of the data around their mean values (constant measurement CV). Otherwise, there seems to be no irregularity. Calculations The 1-sided 95% confidence interval of the mean is calculated as follows: CI = ± t (0.1,3) x SD/SQRT(4). Interpretation The interpretation of the data is done by use of the difference plot. The plot indicates that the CLs overlap with the 5% specification from a concentration >31.5 mmol/L. More replicates could demonstrate that the concentration of 36 mmol/L is within the specified linearity limit of 5%. Conclusions The validation data do not support a working range up to 45 mmol/L. The range should be reduced to 31.5 mmol/L Method validation 25 Recovery Recovery Graphics • Ratio plot (%) • Difference plot (%) Statistics • Descriptive statistics: Location (mean, median & mode) • t-distribution • Central limit theorem • Confidence intervals • t-tests • ANOVA-model I • Power and sample size Method validation 26 Recovery Recovery experiments Protocols Model 1 ("Paired-sample"; see also CLSI EP 7) Samples "Paired-sample" experiment: 2 portions of native samples; spike one with known analyte amount (= Test) and the other with the same volume saline solution (= Control). 3 – 5 samples at relevant concentrations • Test: Add x-mL analyte standard (preferably in blank-solution) to y-mL sample; the volume added should be less than 5-10% (requires concentrated analyte standard) - Added concentration: e.g.; ½-1 of a "normal" sample • Control: Add same volume blank-solution to same volume sample Measurement design Measure Control & Test alternating (n = 2 – 4) - Note: may need repetition with other lots of calibrators/reagents Calculations Concentration added = Concentration of standard • x/(x + y) Concentration recovered = Test - Control Recovery (%) = 100 • (Recovered conc./Added conc.) ± 95%CL Model 2 (Accuracy: "trueness" based; "Common" protocol) Samples Experimental design: "Recovery of target values" • Reference materials with target values - Certified reference materials - IQC materials - Standards Measurement design • Measure samples 5 times at different days - Note: may need repetition with other lots of calibrators/reagents Calculations Recovery (%) = 100 • (Measured value/Target value) ± 95% CL Method validation 27 Recovery Recovery – Model 1 (paired sample), EXCEL file Samples/Materials Low sample : 3.5 mmol/L Normal sample : 4.8 mmol/L High sample 2 : 6.5 mmol/L Glucose solution in isotonic NaCl : 30 mmol/L (add ≤10% volume) Isotonic NaCl-solution Test: Add 0,1 mL (= x) Analyte-standard to 0,9 mL (= y) sample. Control: Add same volume NaCl-solution to same volume sample. Calculations (see EXCEL worksheet) Tests C = (Csample Vsample+Cstandard Vstandard)/(Vsample+Vstandard) Controls C = (Csample Vsample+Csaline Vsaline)/(Vsample+Vsaline) Added concentration = Concentration of standard x ml standard/(x ml standard + y ml sample) Recovered concentration = Test – Control Recovery (%) = 100 (Recovered conc./Added conc.) ± CL Results Control y1 y2 y3 y4 3.15 3.11 3.14 3.13 3.16 4.32 4.35 4.39 4.26 4.22 5.85 5.82 5.79 5.90 5.77 Test y1 y2 y3 y4 6.15 6.14 6.20 6.25 6.12 7.32 7.27 7.30 7.18 7.42 8.85 8.82 8.72 8.88 8.98 Method validation 28 Recovery Recovery – Model 1 (paired sample), EXCEL file Graphics The graphic shows the distribution of the results around their mean values and the individual recoveries. It shows no irregularities. Calculations The 1-sided 95% confidence interval of the mean difference between Test and Control is calculated withz-value as follows: CI = ± z x SDpr/SQRT(4), with z = 1.65 (1-sided 95%). The interpretation of the results is done with the confidence limits calculated with the z-value and the predicted SD (SDpr) from the EP 5 imprecision data (CLSI EP 7 approach). Note that the imprecision of the %-recoveries depend on the Test and Control level and on the magnitude of the spike (see EXCEL-file). CAVE: if one uses t, the propagated SD from the actual data has to be calculated (SD from Test and Control: different, because of different levels!). The degrees of freedom must be calculated with the Satterthwaite formula (different concentrations!). The respective test is a t-test. CAVE: the SD of %-recovery will be high when little is spiked!!! Interpretation The interpretation of the data is done by use of the % ratio plot. The plot shows that none of the CLs overlap with the 5% specification. Conclusions The validation demonstrates that the method passes the preset 5% limit for recovery. Method validation 29 Recovery Recovery – Model 2 (accuracy/trueness), EXCEL file Samples Low IQC material High-normal IQC material High IQC material Standard 1 Standard 2 Standard 3 : 3.9 mmol/L : 5.9 mmol/L : 8.5 mmol/L : 4.5 mmol/L : 5.0 mmol/L : 5.5 mmol/L Measurement Measure samples 5 times at different days. Note: may need repetition with other lots of calibrators/reagents. Calculations (see EXCEL worksheet) Recovery (%) = 100 (Measured value/Target value) ± CL Results Sample y1 y2 y3 y4 y5 3.9 3.93 3.90 3.88 3.92 3.91 5.9 5.83 5.70 5.79 5.63 5.84 8.5 7.92 8.64 8.31 8.79 8.66 4.5 4.63 4.48 4.40 4.50 4.60 5.0 4.92 4.97 5.29 4.95 5.14 5.5 5.59 5.60 5.68 5.93 5.28 Graphics The graphic shows the distribution of the results around their mean values. It shows no irregularities. Method validation 30 Recovery Recovery – Model 2 (accuracy/trueness), EXCEL file Calculations The 1-sided 95% confidence interval of the mean is calculated as follows: CI = ± t (0.1,4) x SD/SQRT(5). Interpretation The interpretation of the data is done by use of the % ratio plot. The plot shows that only the CL of Standard 3 overlaps with the 5% specification. This standard should be repeated. Conclusions The validation demonstrates that the method passes the preset 5% limit for recovery (given that the repetition of Standard 3 is within the specification). Method validation 31 Interference Interference testing (CLSI EP7) Graphics • See "Recovery: Paired sample Statistics • See Recovery: Paired sample Protocols (CLSI EP 7, 2 approaches) Approach 1: "Paired difference method" Applies similar experimental design and calculations as the paired-sample recovery experiment (3 – 5 samples). Instead of analyte standard, an interferent standard has to be prepared. • Test: Add x-mL interferent-solution (preferably in blank-solution) to y-mL sample; the volume added should be less than 5-10% • Control: Add the same volume blank-solution to the same volume sample Measure: Control & Test alternating (n = 2 – 4) Interference (%) = 100 • (Test - Control)/Contro ± 95% CL Approach 2: "Dose-response method" (used in EXCEL file) 3 – 5 samples, for each • Low pool (low or no interferent added; if no, add blank!) • High pool (interferent at maximum concentration) - Note: always add the same volumes blank/interferent solutions • Create 5 levels by "alternative mix-protocol linearity"! Measure: All levels "up", then down, or random (n = 2 – 4) Interference (%) = 100 • (Test - Control)/Control ± CL Note CLSI EP7 applies regression analysis for this protocol! Method validation 32 Interference Interference – EXCEL file Samples/Materials Low sample 2 : 3.5 mmol/L Interferent solution in NaCl : 600 mg/dL Isotonic saline solution -Make "Low pool" (add 0,1 ml saline to 0,9 ml sample) -Make "High pool" (add 0,1 ml interferent solution to 0,9 ml sample) (Note: always add the same volumes saline/interferent solutions) -Create 5 levels by "alternative mixing protocol" Measurement Measure, within-run: All levels "up", then down, or random (n = 4) Interference (%) = 100 • (Test - Control)/Control ± CL Results BILI y1 y2 y3 y4 0 3.17 3.12 3.13 3.15 15 3.24 3.18 3.03 3.22 30 3.15 3.12 3.20 3.13 45 3.33 3.36 3.34 3.40 60 3.55 3.68 3.40 3.53 Graphics The graphic shows the distribution of the results around their mean values. It shows no irregularities. Method validation 33 Interference Interference – EXCEL file Calculations The 1-sided 95% confidence interval of the mean difference between Test and Control (0 BILI) is calculated with the z-value as follows: CI = ± z x SDpr/SQRT(4), with z = 1.65 (1-sided 95%). The interpretation of the results is done with the confidence limits calculated with the z-value and the within-run imprecision as calculated from the EP 5 protocol (CLSI EP 7 approach). Note that the imprecision of the interference results (SDpr) is SQRT(2) times the measurement imprecision because the interference results are the difference between 2 measurements (Test and Control). Interpretation The interpretation of the data is done by use of the % difference plot. The plot shows that only the CL of the sample with 60 mg/dL bilirubin overlaps with the 10% specification. The test is valid up to a bilirubin concentration of 45 mg/dL. Conclusions The validation data show that the test is valid up to a bilirubin concentration of 45 mg/dL. Method validation 34 Method comparison Method comparison Graphics • Scatter plot • Difference plot • Residual plot • Krouwer plot • Bland and Altman plot Statistics • Correlation • Regression • Bland and Altman approach • General (F-test, t-test, confidence-intervals) General remarks Method comparison supposes: Appropriate performance of test- and comparison method - Internal Quality Control (verify actual imprecision with expected by use of Ftest; verify calibration with targetted control samples by t-test of confidence intervals) Appropriate presentation of the paired observations (xi,yi) Appropriate interpretation Interpretation of method comparison makes integrated use of: Graphical and statistical techniques Analytical quality specifications Method comparison – Sample size Usually, general recommendations are given for sample size (EP 9: n 40, e.g.). However, to assure given type I and II errors, i.e. sufficient power in a method comparison study, a minimum sample size is needed depending on: • Slope or intercept deviation to be detected • Measurement range • Constant or proportional analytical error assumption • Magnitude of SD or CV for the methods Tables are available: See Linnet K. Clin Chem 1999; 45: 882-894. Method validation 35 Method comparison Method comparison protocols The CLSI EP-9 protocol Experimental design: • At least 40 samples • Spread analysis over 5 days, randomize concentrations • Measure duplicates in 1 run, 1st series "upwards", second series "downwards" Apply adequate internal quality control! Data presentation and calculations: • Outlier tests: Diffdupl > 4 • Mean Diffdupl (if yes, perform the same with % data) • Scatter plots, singlicates and mean of duplicates • Bias plots, singlicates and mean of duplicates • Inspect for linearity, dispersion, and range (r 0.975) • Apply linear regression (ordinary or Deming) Interpretation: • Dependent on the criteria of the laboratory • Dependent on whether a reference method was used or a "comparative" method Note: Make a distinction between pure statistical, analytical, and clinical interpretation! The Valtech protocol Experiments • At least 50 samples (better: 80 - 100). • Carry the analyses out in singlicates, spread over 10 measurement series, and take the samples random. • Adequate internal quality control! Vassault A, Grafmeyer D, Naudin Cl, Dumont G, Bailly M, Henny J, Gerhardt MF, Georges P. Société Française de Biologie Clinique. Protocole de validation de techniques. Ann Biol Clin 1986;44:686-719 (english version: 720-45). See also: Vassault A, Grafmeyer D, de Graeve J, Cohen R, Beaudonnet A, Bienvenu J. Société Française de Biologie Clinique. Analyses de biologie médicale: spécifications et normes d’acceptabilité à l’usage de la validation de techniques. Ann Biol Clin 1999;57:685-95. Method validation 36 Method comparison Method comparison protocols The “UG” protocol “If possible, use a true reference method for comparison” Experiments • Start from a reliable calibration basis and verify it with IQC samples from the manufacturer = Stable basis. • Adapt the number and the sort of samples to the problem (e.g. 50). • Duplicates in 1 series, random sampling (note: for the reference method, adapt the number of measurements to the problem). • “Intensive” IQC Dewitte K, Stöckl D, Van de Velde M, Thienpont LM. Evaluation of intrinsic and routine quality of serum total magnesium measurement. Clin Chim Acta 2000;292:55-68. The stable basis Was the method performed adequately: Inspection of the internal quality control (IQC) data. • Evaluation of precision and traceability to manufacturer The stable basis – Statistics • F-test • t-test • Confidence-intervals Method validation 37 Method comparison Method comparison – EXCEL file Results Ref. Yours Ref. Yours Ref. Yours Ref. Yours 3.79 3.80 4.89 4.64 5.65 5.55 6.66 6.87 3.84 3.88 4.91 4.62 5.73 5.58 6.71 6.80 3.86 3.65 4.91 4.90 5.79 6.08 6.78 6.90 3.88 3.86 4.95 4.88 5.83 5.65 6.87 7.11 3.92 3.93 5.01 4.86 5.84 6.05 6.94 7.17 3.99 4.09 5.02 4.89 5.86 5.76 7.10 7.07 4.08 4.16 5.03 5.17 5.92 5.76 7.12 7.00 4.11 4.11 5.16 4.90 5.93 5.57 7.13 7.02 4.13 4.05 5.17 5.12 5.94 6.10 7.14 6.90 4.13 4.07 5.17 5.01 5.97 5.80 7.15 7.23 4.23 4.38 5.18 5.26 5.97 5.88 7.15 7.38 4.27 4.21 5.25 5.28 6.06 6.11 7.36 7.19 4.38 4.28 5.39 5.37 6.11 6.08 7.43 7.11 4.39 4.28 5.44 5.49 6.12 5.90 7.47 7.11 4.42 4.31 5.49 5.43 6.30 6.03 7.51 7.15 4.58 4.63 5.53 5.34 6.49 6.48 7.56 7.39 4.70 4.65 5.55 4.99 6.50 6.77 7.90 7.81 4.70 4.48 5.58 5.45 6.59 6.58 8.02 7.83 4.85 5.01 5.58 5.53 6.61 6.22 8.07 7.82 4.85 4.62 5.65 5.27 6.66 6.28 8.19 7.72 Method validation 38 Method comparison Method comparison – EXCEL file Calculations – Bland & Altman approach The calculations comprise the mean difference and the 1.96 CV of the individual differences and their respective CLs. CI (mean) = ± t (0.1,79) x SDdiff/SQRT(80), CI (1.96s centile) = ± t(0.1,79) SQRT[SD2/80 + (1.962 SD2/2 80)] = 1.71 CI (mean) See also Worksheet Meth-Comp3 for calculations. Graphics and interpretation The graphic (% differences) reveals no outliers. The CLs of the mean and the 1.96 centile of the differences ("limits of agreement") do not overlap with the respective specifications of 3% (SE or Bias limit) and 10% (TE limit). Conclusions The validation data show that the test passes the preset limits for systematic (3%) and total error (10%). Method validation 39 Method comparison Method comparison – EXCEL file Calculations – Regression See Worksheet Meth-Comp4 for the detailed calculation of the ordinary linear regression and correlation estimates. Calculations CI (line) = ± t Sy/x SQRT[1/n + (Xc –Xmean)2/S(Xi –Xmean)2] (df t = n – 2) Xc: concentration for which the bias shall be investigated. CI (points) = ± t Sy/x SQRT[1 + 1/n + (Xc –Xmean)2/S(Xi –Xmean)2] (df t = n – 2) Xc: concentration for which the total error shall be investigated. Graphics The results are presented in a scatter plot and a residuals plot. Interpretation The confidence limits of bias and total error at the minimum and maximum values of x (respectively y) are compared with the specifications. They are smaller than the specifications at both concentrations (see Worksheet Meth-Comp4). Conclusions The validation data show that the test passes the preset limits for systematic (3%) and total error (10%). Method validation 40 Notes Notes Method validation 41 Annex Content Summary of protocols, statistics & graphics System stability, Ruggedness and Multifactor protocols Glossary of terms Method validation 42 Annex Protocols & statistics Experimental protocols Protocols • Imprecision : EP 5 • Limit of detection : EP 17 or "Common" • Working range : see linearity or or define by imprecision • Linearity : EP 6 • Linearity by recovery : "Common" (Accuracy) • Recovery, reference material : "Common" (Accuracy) • Recovery, added analyte : see interference/specificity • Interference/Specificity : EP 7 • Total error : EP 9, UG* (Method comparison) EP* = CLSI Evaluation protocols; UG = University Ghent Others • EP 10 Preliminary evaluation • EP 12 Qualitative tests • EP 14 Matrix effects • EP 15 User demonstration precison & accuracy • EP 21 Total error Statistics (>Statistics Book) Analytical problem General Imprecision Limit of detection Working range Linearity Recovery Interference/Specificity Total error (method comparison) Trouble-shooting #Alternative: confidence intervals Associated statistics Basic statistics Outlier tests (e.g., Grubbs) F-test; CHI2-test (#), ANOVA Probability & Power see linearity or define by imprecision Regression, ANOVA t-test (#) t-test (#) Regression & correlation Bland & Altman plot Power (sample size calculations) Method validation 43 Annex Graphics Bivariate data Scatter plot Histogram Difference plot Frequency Univariate data Dot plot 8 7 6 5 4 3 2 1 55 65 75 85 95 105 115 125 135 0 Value-Bin Box plot Ratio plot (%) (Recovery) Residuals plot Method validation 44 Annex Overview of experiments, statistics, and graphics Performance chracteristic •Samples •Measurements# •Relevant SD$ •Graphics •Statistical test vs specification CLSI Doc. Imprecision •IQC-samples; no target •n = 20 •Within & total •Dot plot •ANOVA & 1-sample F-test or CL of SD EP 5 EP 15 LoD/LoQ •Blank; Low sample •n = 20 •Total •Dot-plot •1-sample F-test or CL of SD EP 17 Linearity •5 related samples/-calibrators (mix); no target •n = 4 •Within •Scatter-/residual plot •Lack-of-fit or polynomial regression EP 6 Working range See: Imprecision/Linearity --- --- Interference •Samples: Interferent spike & control (no target) •n = 4 •Within •Difference-/ratio plot •CL of mean difference (or ttest) EP 7 Trueness (Accuracy) •Samples: Known analyte spike & control or CRM •n = 5 •Total •Difference-/ratio plot •CL of mean difference or CL of mean (or t-tests) EP 7 EP 15 Total error •40 samples (RMP-target) •n = 1 or 2 •Total or within (UG protocol) •Scatter-/bias plot •Correlation, Regression/Bland&Altman EP 9 EP 21 UG #Numbers do not always correspond to the respective CLSI document. $Abbreviations: SD: standard deviation; IQC: Internal Quality Control; CLSI: Clinical and Laboratory Standards Institute; CRM: Certified Reference Material; RMP: Reference Measurement Procedure); UG: University Ghent; CL: Confidence limit. Method validation 45 Annex Overview of experiments, statistics, and graphics Sensitivity of statistical parameters to different types of errors (From: Westgard JO, Hunt MR. Clin Chem 1973;19:49-57) Type of error Random Constant Proportional Slope No No Yes Intercept No Yes No Sy/x Yes No No Bias No Yes Yes SDdiff Yes No Yes r Yes No No Statistic Least-squares Paired t-test Correlation Method validation 46 Annex System stability Trueness is also related to system [in]stability • Drift • Shift System instability is tackled by internal quality control. Carryover Carryover is related to the quality of the instrument and the test procedure (e.g., washing). See CLSI protocol EP-10. Ruggedness Ruggedness = ability to reproduce the method in different laboratories or in different circumstances • Related to the method principle and the test conditions • Related to the instrument a method is performed with Assessment of ruggedness • Between-laboratory performance data obtained through EQA • Ease of operation within the laboratory – Efforts needed for internal quality control – Productivity of a method (down time, calibration and service intervals, etc.) Method validation 47 Annex Multifactor protocols Classically, single effects are investigated in one experimental design (e.g., imprecision, linearity, carryover). Multi-factor evaluation designs investigate several effects with one experimental design – Advantage: less time consuming! Example- EP 10: Allows evaluation of • Imprecision • Linearity • Bias • Carryover • Drift Applies multiple linear regression analysis • Needs special software for interpretation The EP-10 protocol The design • 3 interrelated samples: low, mid, high • Prescribed measurement sequence: Mid, mid, high, low, mid, mid, low, low, high, high, mid. • 5 days, always 1 run Method validation 48 Annex Glossary Metrology [1] field of knowledge concerned with measurement Measurand [1] quantity intended to be measured Quantity [1] property of a phenomenon, body, or substance, to which a number can be assigned with respect to a reference Measurement [1] process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity Notes: • Quantities are length, mass, amount-of-substance, time, temperature, etc. • The value of a quantity is expressed by both a number and an unit • The full specification of the quantities measured in the medical laboratory comprises three elements: System (e.g., blood plasma) Component (also called analyte) (e.g., glucose) Kind-of-quantity (e.g., amount-of-substance concentration) The full report of a glucose measurement would read: “the amount-of-substance concentration of glucose in blood plasma was 5.2 mmol/L” Measurement unit [1] scalar quantity, defined and adopted by convention, with which any other quantity of the same kind can be compared to express the ratio of the two quantities as a number Value of a quantity [1] number and reference together expressing magnitude of a quantity EXAMPLE: Length of a given rod: 5.34 m Measurement standard [1] realization of the definition of a given quantity, with stated quantity value and measurement uncertainty, used as a reference EXAMPLE: 1 kg mass standard. Method validation 49 Annex Glossary Error [1] difference of measured quantity value and reference quantity value Systematic error [1] component of measurement error that in replicate measurements remains constant or varies in a predictable manner Bias [1] systematic measurement error or its estimate, with respect to a reference quantity value Random error [1] component of measurement error that in replicate measurements varies in an unpredictable manner Trueness [1] closeness of agreement between the average of an infinite number of replicate measured quantity values and a reference quantity value Accuracy [1] closeness of agreement between a measured quantity value and a true quantity value of the measurand Precision [1] closeness of agreement between indications obtained by replicate measurements on the same or similar objects under specified conditions Repeatability condition [1] condition of measurement in a set of conditions that includes the same measurement procedure, same operators, same measuring system, same operating conditions and same location, and replicate measurements on the same or similar objects over a short period of time Reproducibility condition [1] condition of measurement in a set of conditions that includes different locations, operators, measuring systems, and replicate measurements on the same or similar objects Uncertainty [1] parameter characterizing the dispersion of the quantity values being attributed to a measurand, based on the information used [Metrological] Traceability [1] property of a measurement result whereby the result can be related to a stated reference through a documented unbroken chain of calibrations, each contributing to the measurement uncertainty Method validation 50 Annex Glossary Commutability [of a reference material] [1] property of a reference material, demonstrated by the closeness of agreement between the relation among the measurement results for a stated quantity in this material, obtained according to two given measurement procedures, and the relation obtained among the measurement results for other specified materials Matrix effect [2] Influence of a property of the sample, other than the measurand, on the measurement of the measurand according to a specified measurement procedure and thereby on its measured value [2] Influence quantity [1] quantity that, in a direct measurement, does not affect the quantity that is actually measured, but affects the relation between the indication and the measurement result Note: Specificity & Interference are not yet unequivocally defined by ISO. Selectivity [1] capability of a measuring system, using a specified measurement procedure, to provide measurement results, for one or more measurands, that do not depend on each other nor on any other quantity in the system undergoing measurement (= specificity in chemistry) Interference [in analysis] A systematic error in the measure of a signal caused by the presence of concomitants in a sample (http://goldbook.iupac.org) specific [in analysis] A term which expresses qualitatively the extent to which other substances interfere with the determination of a substance according to a given procedure. Specific is considered to be the ultimate of selective, meaning that no interferences are supposed to occur (http://goldbook.iupac.org). Calibration [1] operation that, under specified conditions, in a first step establishes a relation between the quantity values with measurement uncertainties provided by measurement standards and corresponding indications with associated measurement uncertainties and, in a second step, uses this information to establish a relation for obtaining a measurement result from an indication Sensitivity [1] quotient of the change in the indication and the corresponding change in the value of the quantity being measured Method validation 51 Annex Glossary Linear range Concentration range over which the intensity of the signal obtained is directly proportional to the concentration of the species producing the signal (http://goldbook.iupac.org). Linearity (generic) Ability of an analytical procedure to produce test results which are proportional to the concentration (amount) of an analyte, either directly or by means of a well-defined mathematical transformation. Working interval [1] set of values of the quantities of the same kind that can be measured by a given measuring instrument or measuring system with specified instrumental uncertainty, under defined conditions Limit of detection (in analysis) The limit of detection, expressed as the concentration, cL, or the quantity, qL, is derived from the smallest measure, xL, that can be detected with reasonable certainty for a given analytical procedure. The value of xL is given by the equation xL = xbi + k • sbi, where xbi is the mean of the blank measures, sbi is the standard deviation of the blank measures, and k is a numerical factor chosen according to the confidence level desired (http://goldbook.iupac.org). Limit of detection [1] measured quantity value, obtained by a given measurement procedure, for which the probability of falsely claiming the absence of a component in a material is β, given a probability α of falsely claiming its presence Ruggedness (generic) Ability to reproduce the method in different laboratories or in different circumstances. Ruggedness (USP) Degree of reproducibility of the results obtained under a variety of conditions, expressed as %RSD. These conditions include different laboratories, analysts, instruments, reagents, days, etc. Robustness (ICH Q2A 1995) The robustness of an analytical procedure is a measure of its capacity to remain unaffected by small, but deliberate variations in method parameters and provides an indication of its reliability during normal usage. [1] BIPM, IEC, IFCC, ISO, IUPAC, IUPAP, OIML. Vocabulaire International des Termes Fondamentaux et Généraux de Métrologie. 3rd ed. Geneva: ISO, 2007. [2] EN/ISO 17511:2003. In vitro diagnostic medical devices – Measurement of quantities in biological samples – Metrological traceability of values assigned to calibrators and control materials. [3] See also: www.clsi.org>Harmonized Terminology Database Method validation 52