A Review of Statistical Methods for Technical Performance

advertisement
Quantitative Imaging Biomarkers:
A Review of Statistical Methods for Technical Performance Assessment
by
Technical Performance Working Group*
*Author list in alphabetical order:
Corresponding Author:
David L. Raunig, Ph.D.
ICON Medical Imaging
2800 Kelly Rd.
Warrington, PA 18976
Abstract
Keywords: quantitative imaging, imaging biomarkers, reliability, linearity, bias,
precision, repeatability, reproducibility, agreement
1.
BACKGROUND
Medical imaging, originally developed for use in the clinic as a tool for the physician, has
developed to the point that images can now be used to reliably measure structural and
functional features. Improved resolution and modalities made imaging useful for
quantitative measurements of an anatomical region of interest, known as a quantitative
imaging biomarker (QIB). These QIBs and comparative changes made imaging useful for
analyzing changes over time and for estimating the effects of therapeutic intervention for
the treatment of disease. It is imperative, then, that these QIBs represent the true feature
measurement (e.g. volume) and that these measurements can be reliably made time after
time, or able to repeat the same measurement, and that the measurement system can be
exported to different measurement conditions. Reliability then is represented by both the
ability to represent the true measurement without bias and to do so with minimum
variability.
The explosive use of biomarkers within the last decade was not always preceded by
knowledge of QIB to reliably obtain reliable measurements of the imaging feature of
interest. Additionally, statistical methods to assess QIB performance when used in
patients are not typically standardized and may even be inappropriate. Some examples
are slopes determined for non-linear relationships, standard deviation versus coefficient
of variation, significant correlation for poor correlation, and so on . This inconsistent use
of statistical metrics can be confusing and may even be contradictory for two different
literature references. Similarly, designing a study to measure QIB reliability may not
consider design aspects to acquire the necessary data to adequately describe the QIB
performance. The range of necessary QIB values, sample sizes including number and
types of image acquisitions and conditions for QIB use are all necessary but are only
some of the things that need to be considered.
Examples: All TechPerform group: Need at least the following to comment
Lisa McShane
Rich Wahl
Jim Voyvodic
Provide citations and short summary
2.
MOTIVATION
<<Describe here why we are doing this (consistency) which may repeat what was said
before >>
 RSNA QIBA reference
 Goal of this paper: Provide framework for QIB assessment of biomarker
reliability
o Define reliability
o Review of study design considerations when evaluating a biomarker for
use in a clinical trial (endpoint, patient enrollment, etc)  Estimability
(All Statisticians should provide input)
o Mention algorithms that determine the measurement

Claims. What can the biomarkers be used for? (Paul)
Once the QIB is measured through the application of a qualified and reliable algorithm,
the QIB itself must be evaluated for its ability to perform reliably enough to be used to
make intelligent and informed decisions. The ability to assess the reliability performance
of is critical to ensure that consistent and reliable quality of that QIB when used to
measure a disease feature such as size, biological activity, pharmacodynamic parameter
estimation or to measure physiological function, as examples.
We will refer to these qualities as the technical performance the QIB and will specifically
address the ability of the QIB to provide reliable measurements of change between two
different acquisitions. Therefore the problem can be stated as are the measurements
reliable enough to determine if there are changes to the QIB due to some factor such as
treatment or time. The three metrology areas that most directly address this overarching
question of technical performance are
•
Linearity:
The strength of the linear relationship of the biomarker to a known
or related standard reference, or more simply stated as the ability of the QIB
measure to unambiguously measure what it is supposed to measure,
•
Repeatability: The ability of the QIB to repeatedly and reliably measure the same
feature, or what is the variability of the imaging system, which may include the
patient, to obtain a measurement that is reliable enough to use in decision making,
and
•
Reproducibility: The ability of the QIB to be employed in different conditions
that may be experienced in its use, or the ability of the QIB to reliably obtain
measurements under different study conditions, as might be expected in a multiple
site clinical trial.
<< State that these concepts come from metrology and refer to the metrology
definitions >>
<< Refer to the algorithm performance for how these QIBs were obtained>>
3. OBJECTIVES
The objectives of the technology performance metrology group are to arrive at a
reasonable consensus among clinical, technology and statistical imaging experts to
establish the following:
•
Performance metrics needed to measure and report technical performance of a
QIB;
•
Methodologies to arrive at those metrics;
•
Study designs and considerations to arrive at meaningful and interpretable
assessment of technical performance of a QIB
The QIB, in its qualification for use in a clinical trial will include a claim that specifies
the details of that use that include the population, the imaging modality and possible
limitations <<ADD CLAIM ITEMS>>. Included within the evidence that the QIB is
qualified for the stated use is the specific evidence obtained for each of the technical
performance areas: bias/linearity, repeatability and reliability.
4.
QIB TYPES AND PERFORMANCE PARAMETERS
The QIBs fall into XX general types. The methods of measurement may be as simple as
electronic or physical calipers (e.g. length) or may be a complex measurement of a
functional parameter and require multiple images and a mathematical algorithm to derive.
These types of measurements specified in Table 4.1.
Table 4.1 Types of Measurements
Measurement Type Imaging Needs
Measurand
Structural Extent
Morphological and
Texture Features
Functional response
Single Image
Single or multiple images
V, L, A, D
CIR, IR, MS, AF,
SGLDM, FD, FT, EM
f(t), Ktrans, ROI(t)
Region of Interest
Physical Properties
Multiple Repeat Images under
different acquisition parameters
Multiple Images /Time series
ADC, BMD
<<Contrasted to Algorithm table 1.
<< What more needs to be said>>
5.
I.
TECHNICAL PERFORMANCE ANALYSIS SETUP
Steps to set up for a technical performance analysis
a. Define QIB and relationship to a truth measurement.
b. Define the study question. Can be defined as the study hypothesis but most
often should be in the form of what is the primary interest of the study and the
context of use of the results
i. Will be directly translatable to the claim and profile
ii. Define the statistical hypotheses if applicable
iii. Define strata that are identified within the claim that will be either
used or tested in the study
c. Define experimental unit,
i. May not be the patient/subject depending on study question
ii. State Inference and reason. Should be consistent with the profile claim
specifications
1. ROI (eg lesion)
2. Patient
d. Define parameters to be measured
e. Design the study as applicable (Provide quick examples that don’t
necessarily need citations)
i. Sample Size and justification which will probably include hypotheses
criteria
ii. Data requirements
1. data range,
2. Number of repeats
3. Etc.
iii. Random v Fixed Effects
iv. Strata/blocks
f. Estimate parameters
i. Test against a null hypothesis if applicable
ii. All parameters need estimates of confidence
6.
BIAS AND LINEARITY
The ultimate goal of any QIB is to provide an unbiased estimate of the actual, true
imaging physical or derived measurement over the entire range of expected values
defined in the claim. For example, when measuring the volume of a solid tumor, the
measured volume should, with random error, represent the actual volume of the lesion
and be able to do so for all lesion shapes and sizes within the expected spectrum of
lesions expected from within the claim.
A systematic bias from the true value is a bias that is caused by the imaging measurement
system and not a function of the actual measurement. Conversely, a non-systematic bias
is a difference from the true value that is dependent on the actual value in ways that are
not always able to be determined.
A linear response, as defined here, is a constant proportional relationship between the
actual true value (measurand) and the QIB over the entire range of the measurement (or
the measurement dynamic range)


7.
Definitions
o Bias

Motivation: Measure systematic bias; Identify nonsystematic bias in
the new biomarker
 Definitions for measuring bias
 Consistent with Terminology and Algorithm
 Statistical and Descriptive definitions
 Ground Truth available
 Ground Truth not available
o Linearity
 Motivation: Measure biomarker relationship to truth over the entire
range of claimed measurements
 Definitions for measuring linearity
 Range, dynamic range (?), lower and upper limits,
monotonicity, curvature
 Correlation , curvature
Statistical Methodologies to Assess Technical Linearity Performance
o Standard reference
 Methods
 Plots: X-Y Scatter,
 ???
o Imperfect reference
 Methods
 Plots: BA, X-Y scatter, …
 ???
o No reference
 Methods
 Plots: Bar, histogram, …
 ???
REPEATABILITY
Repeated scans over different time intervals provide complementary information,
encompassing different sources of variability. For example, sequential repeats within the
same scanning session capture effects including scanner adjustments and finite SNR on
measurement variability. If the subject is taken out and repositioned, additional
variability due to slight differences in subject positioning will also be captured. Longer
intervals between repeat scans (e.g., days, weeks or months) will be subject to the
aforementioned effects as well as possible scanner performance drift/change and
physiological variation (or disease progression if long enough) over that time period.
The basic repeatability of the measurement per se, in the presumed absence of
physiological change (i.e., short time repeat) is critical to know but might not be
sufficient if the study protocol has treatment superimposed on longer term repeat scans.
In this case, the contribution of variance due both to the imaging measure per se and the
physiological variability will impact study design and interpretation.
Read-reread, or analysis-reanalysis, is especially critical in the case of any subjective
human input (cf. radiological reads) or stochastic elements to computational image
analysis algorithms. Analysis algorithms that are completely deterministic should by
definition not be subject to analysis-reanalysis variability: the same input data should lead
to the same numeric output.
o Definitions
 VIM Definition (Nick, Mary, Lisa, Marina)
 Practical definition(s) that address concerns by AJS
 Within-Patient v within-(patient+reviewer+…)
o Statistical Methodologies to Assess Technical Repeatability Performance
 Scan-Rescan
 Test-Retest in all of its varieties.
 Read-Reread (maybe)
 Plots: BA with LOA;
 Point this toward the desired claim
o Steps to set up for a technical performance analysis
 Define QIB and repeatability conditions and source of variance to be
measured.
 Within-(patient+reader+instrument)
 Within-(patient+reader)+between-patient
 Etc.
 Define the study question much as before. Should directly relate to the
desired inference (eg. instrument, patient, patient+instrument, etc).
What is being repeated?
 Will be directly translatable to the claim and profile
 Define the statistical hypotheses if applicable
 Strata are generally not included here but may be as part of the
design to evaluate different repeatability
 Define experimental unit (related to Study question)
 May not be the patient/subject depending on study question
(eg. individual lesion)
 State Inference and reason. Should be consistent with the
profile claim specifications
o ROI (eg lesion)
o Patient
 Define parameters to be measured
 Variance and/or variance matrix
 CV
 ICC, CCC


8.
RC
LOA
REPRODUCIBILITY
For the QIB to be reliable for use in different conditions, the performance of the QIB
under those conditions needs to be assessed. The conditions for a clinical trial may
necessitate different patient populations, scanners, instrument technicians, image
raters/reviewers, scanning conditions and many more that may be specific to a modality.
Convert to text
{
Do reliability concepts and analysis methods differ for different uses of QIBs?
 Screening & diagnosis
 Risk stratification
 Early response to therapy
 Surrogate endpoint of clinical response to therapy
Do analysis methods differ for different aspects of RR? (this is the part we’d expected to
do, not fleshed out yet..)
 Different readers, same image analysis approach
 Different readers, different image analysis approach
 Different scanner manufacturer
 Different reconstruction method on same scanner manufacturer
 Reproducibility of patient preparation during same disease state
 Phantom studies may have a gold standard. Others will not.
  analysis methods for big versus small perturbations?
  analysis methods for comparing versions of the QIB, versus estimating LOA
}
o Definitions
 VIM / ASTM / …
 Practical definition
 Considerations that determine what the reproducible factors are
 Should tie back to profile specifications
o Steps to set up for a technical performance analysis
 Define QIB
 Define the study question. Can be defined as the study hypothesis but
most often should be in the form of what is the primary interest of the
study and the context of use of the results
 Will be directly translatable to the claim and profile
 Define the statistical hypotheses if applicable



9.
Strata are generally the question of interest unlike repeatability
(eg. Scanner, Country, populations, diseases, etc)
 How are strata compared? Means, Variances or both
 How do we separate differences in patient samples from
differences in reproducible factors? This section needs some
thought
Define experimental unit
 Site, reader, scanner, country, etc.
 Brief Explanation .
Define parameters to be measured
 Means
 Variance
 ICC
 RC
 LOA
 Significance covariate
 Examples
STUDY DESIGN
In order for each of the above performance metrics to represent the true performance of
the QIB under the profile, an experiment to collect the images and measure the data must
be carefully and systematically designed with the study question in mind and specific
endpoints defined. Too many times do published reports of QIB performance fall victim
to haphazard designs that limit the conclusions to far short of the desired goal or even
produce misleading results that have no real utility when trying to use that information to
design the next trial.
Study designs that hold one factor constant or limit the factor to a small set of conditions
and vary the others define the performance under the conditions dictated by that one
constant factor but could not generalize to a greater set of conditions. Conversely, study
designs that spread all of the available study subjects or factors over too wide a range
may not have enough information to draw any conclusions.
< 2-3 Examples in the literature >>
9.1. LINEARITY
9.2. REPEATABILITY
9.3. REPRODUCIBILITY
10.
CLAIMS AND PROFILES
11.
DISCUSSION AND FUTURE DIRECTIONS
Download