QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 QI-Bench: Informatics Services for Characterizing Performance of Quantitative Medical Imaging System Use Case August 16, 2011 Rev 1.0 Required Approvals: Author of this Revision: Andrew J. Buckler System Architect: Andrew J. Buckler Print Name Signature Date Document Revisions: Revision BBMSC Revised By Reason for Update 0.1 Andrew J. Buckler Stand-alone doc from EUC 1.0 Andrew J. Buckler Updated to open Phase 4 Date June 15, 2011 August 16, 2011 1 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Table of Contents 1. INTRODUCTION ...................................................................................................................................................3 1.1. PURPOSE & SCOPE ...............................................................................................................................................3 1.2. INVESTIGATORS, COLLABORATORS, AND ACKNOWLEDGEMENTS........................................................................3 1.3. TERMS USED IN THIS DOCUMENT ........................................................................................................................4 2. DESIGN OVERVIEW ............................................................................................................................................5 3. USE CASES .............................................................................................................................................................7 3.1. CREATE AND MANAGE SEMANTIC INFRASTRUCTURE AND LINKED DATA ARCHIVES..........................................8 3.1.1. Define, Extend, and Disseminate Ontologies, Vocabularies, and Templates ..............................................9 3.1.2. Install and Configure Linked Data Archive Systems ................................................................................. 10 3.1.3. Create and Manage User Accounts, Roles, and Permissions .................................................................... 11 3.1.4. Query and Retrieve Data from Linked Data Archive................................................................................. 12 3.2. CREATE AND MANAGE PHYSICAL AND DIGITAL REFERENCE OBJECTS ............................................................. 12 3.2.1. Develop Physical and/or Digital Phantom(s) ............................................................................................ 13 3.2.2. Import Data from Experimental Cohort to form Reference Data Set ........................................................ 14 3.2.3. Create Ground Truth Annotations and/or Manual Seed Points in Reference Data Set ............................. 16 3.3. CORE ACTIVITIES FOR BIOMARKER DEVELOPMENT .......................................................................................... 17 3.3.1. Set up an Experimental Run....................................................................................................................... 17 3.3.2. Execute an Experimental Run .................................................................................................................... 19 3.3.3. Analyze an Experimental Run .................................................................................................................... 21 3.4. COLLABORATIVE ACTIVITIES TO STANDARDIZE AND/OR OPTIMIZE THE BIOMARKER ....................................... 22 3.4.1. Validate Biomarker in Single Center or Otherwise Limited Conditions.................................................... 23 3.4.2. Team Optimizes Biomarker Using One or More Tests .............................................................................. 25 3.4.3. Support “Open Science” Publication model ............................................................................................. 27 3.5. CONSORTIUM ESTABLISHES CLINICAL UTILITY / EFFICACY OF PUTATIVE BIOMARKER .................................... 28 3.5.1. Measure Correlation of Imaging Biomarkers with Clinical Endpoints ..................................................... 29 3.5.2. Comparative Evaluation vs. Gold Standards or Otherwise Accepted Biomarkers .................................... 30 3.5.3. Formal Registration of Data for Qualification .......................................................................................... 32 3.6. COMMERCIAL SPONSOR PREPARES DEVICE / TEST FOR MARKET ...................................................................... 34 3.6.1. Organizations Issue “Challenge Problems” to Spur Innovation ............................................................... 34 3.6.2. Compliance / Proficiency Testing of Candidate Implementations ............................................................. 36 3.6.3. Formal Registration of Data for Approval or Clearance .......................................................................... 37 4. REFERENCES ...................................................................................................................................................... 39 BBMSC 2 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 1. Introduction 1.1. Purpose & Scope Quantitative results from imaging methods have the potential to be used as biomarkers in both routine clinical care and in clinical trials, in accordance with the widely accepted NIH Consensus Conference definition of a biomarker.1 In particular, when used as biomarkers in therapeutic trials, imaging methods have the potential to speed the development of new products to improve patient care. 2,3 Imaging biomarkers are developed for use in the clinical care of patients and in the conduct of clinical trials of therapy. In clinical practice, imaging biomarkers are intended to (a) detect and characterize disease, before, during or after a course of therapy, and (b) predict the course of disease, with or without therapy. In clinical research, imaging biomarkers are intended to be used in defining endpoints of clinical trials. A precondition for the adoption of the biomarker for use in either setting is the demonstration of the ability to standardize the biomarker across imaging devices and clinical centers and the assessment of the biomarker’s safety and efficacy. Although qualitative biomarkers can be useful, the medical community currently emphasizes the need for objective, ideally quantitative, biomarkers. “Biomarker” refers to the measurement derived from an imaging method, and “device” or “test” refers to the hardware/software used to generate the image and extract the measurement. Regulatory approval for clinical use4 and regulatory qualification for research use depend on demonstrating proof of performance relative to the intended application of the biomarker: In a defined patient population, For a specific biological phenomenon associated with a known disease state, With evidence in large patient populations, and Externally validated. This document describes public resources for methods and services that may be used for the assessment of imaging biomarkers that are needed to advance the field. It sets out the workflows that are derived the problem space and the goal for these informatics services as described in the Basic Story Board. 1.2. Investigators, Collaborators, and Acknowledgements Buckler Biomedical Associates LLC Kitware, Inc. In collaboration with: Information Technology Laboratory of (ITL) National Institute of Standards and Technology (NIST) Quantitative Imaging Biomarker Alliance (QIBA) Imaging Workspace of caBIG It is also important to acknowledge the many specific individuals who have contributed to the development of these ideas. A subset of some of the most significant include Dan Sullivan, Constantine Gatsonis, Dave Raunig, Georgia Tourassi, Howard Higley, Joe Chen, Rich Wahl, Richard Frank, David Mozley, Larry Schwartz, Jim Mulshine, Nick Petrick, Ying Tang, Mia Levy, Bob Schwanke, and many others <if you do not see your name, please do not hesitate to raise the issue as it is our express intent to have this viewed as an inclusive team effort and certainly not only the work of the direct investigators.> BBMSC 3 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 1.3. Terms Used in This Document The following are terms commonly used that may of assistance to the reader. BAM BRIDG BSB caB2B Business Architecture Model Biomedical Research Integrated Domain Group Basic Story Board Cancer Bench-to-Bedside caBIG CAD caDSR CDDS CD CDISC CBER CDER CIOMS CIRB Clinical management Clinical trial Cancer Biomedical Informatics Grid Computer-Aided Diagnosis Disease Data Standards Registry and Repository Clinical Decision Support Systems Compact Disc Clinical Data Interchange Standards Consortium Center for Biologics Evaluation and Research Center for Drug Evaluation and Research Council for International Organizations of Medical Sciences Central institutional review board The care of individual patients, whether they be enrolled in clinical trial(s) or not A regulatory directed activity to prove a testable hypothesis for a determined purpose Computed Tomography Domain Analysis Model Digital Imaging and Communication in Medicine Deoxyribonucleic Acid Data Safety Monitoring Board Enterprise Conformance and Compliance Framework Electronic Case Report Form Electrocardiogram Electronic Medical Records Enterprise Use Case Enterprise Vocabulary Services Food and Drug Administration Fluorodeoxyglucose Health Level Seven Institutional Biosafety Committee Institute of Biological Engineering Institutional Review Board Interagency Oncology Task Force Image Query in-vitro diagnosis Medicines and Healthcare Products Regulatory Agency Magnetic Resonance Imaging National Cancer Institute National Institute of Biomedical Imaging and Engineering National Institutes of Health National Library of Medicine CT DAM DICOM DNA DSMB ECCF eCRF EKG EMR EUC EVS FDA FDG HL7 IBC IBE IRB IOTF IQ IVD MHRA MRI NCI NIBIB NIH NLM BBMSC 4 of 40 QI-Bench: Informatics Services for Quantitative Imaging Nuisance variable Observation PACS PET Pharma Phenotype PI PRO QA QC RMA RNA SDTM SEP SUC Rev 1.0 A random variable that decreases the statistical power while adding no information of itself The act of recognizing and noting a fact or occurrence Picture Archiving and Communication System Positron Emission Tomography pharmaceutical companies The observable physical or biochemical characteristics of an organism, as determined by both genetic makeup and environmental influences. 5 Principal Investigator Patient Reported Outcomes Quality Assurance Quality Control Robust Multi-array Average Ribonucleic Acid Study Data Tabulation Model Surrogate End Point Tx Systematized Nomenclature of Medicine – Clinical Terms Service-Oriented Architecture System Use Case In clinical trials, a measure of effect of a certain treatment that correlates with a real clinical endpoint but does not necessarily have a guaranteed relationship. 6 Treatment UMLS US VCDE VEGF WHO XIP XML Unified Medical Language System Ultrasound Vocabularies & Common Data Elements vascular endothelial growth factor World Health Organization eXtensible Imaging Platform Extensible Markup Language SNOMED CT SOA SUC Surrogate endpoint 2. Design Overview The workflow for statistical validation implemented by QI-Bench is shown in Figure 1. BBMSC 5 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Reference Data Set Manager Reference Data Sets, Annotations, and Analysis Results QIBO Batch Analysis Service 3. Batch analysis scripts Source of clinical study results Quantitative Imaging Specification Language UPICT Protocols, QIBA Profiles, entered with Ruby on Rails web service 4. Output UPICT Protocols, QIBA Profiles, literature papers and other sources QIBO(red edges represent biostatistical generalizability) Clinical Body of Evidence BatchMaketo enable (formatted Scripts SDTM and/or other standardized registrations Figure 1: QI-Bench Semantic Workflow This functionality is decomposed into the following free-standing, but linked, applications (Figure 2). •Specify context for use and assay methods. •Use consensus terms in doing so. Specify QIBO, RadLex/ Snomed/ NCIt built using Ruby on Rails. Formulate •Assemble applicable reference data sets. •Include both imaging and non-imaging clinical data. caB2B, NBIA, PODS data elements, DICOM query tools. •Compose and iterate batch analyses on reference data. •Accumulate quantitative read-outs for analysis. Execute MIDAS, BatchMake, Condor Grid. Built using CakePHP. Analyze •Characterize the method relative to intended use. •Apply the existing tools and/or extend them. MVT portion of AVT, reuseable library of R scripts. •Compile evidence for regulatory filings. •Use standards in transfer to regulatory agencies. Package STDM standard of CDISC into repositories like FDA’s Janus. Figure 2: QI-Bench Applications BBMSC 6 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 QI-Bench may be deployed at various scopes and configurations: Figure 3: QI-Bench Web Deployment 3. Use Cases The following sections describe the principal workflows which have been identified. The sequence in which they are presented is set up to draw attention to the fact that each category of workflows builds on others. As such, it forms a rough chronology as to what users do with a given biomarker over time, and may also be useful to guide the design in such a way as it may be implemented and staged efficiently (Fig. 7). BBMSC 7 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Figure 4: Relationship between workflow categories that illustrates progressive nature of the activities they describe and possibly also suggesting means for efficient implementation and staging. 3.1. Create and Manage Semantic Infrastructure and Linked Data Archives Scientific research in the medical imaging field involves interdisciplinary teams, in general performing separate but related tasks from acquiring datasets to analyzing the processing results. Collaborative activity requires that these be defined and implemented with sophisticated infrastructure that ensures interoperability and security. The number and size of the datasets continue to increase every year due to advancements in the field. In order to streamline the management of images coming from clinical scanners, hospitals rely on picture archiving and communication systems (PACS). In general, however, research teams can rarely access PACS located in hospitals due to security restriction and confidentiality agreements. Furthermore, PACS have become increasingly complex and often do not fit in the scientific research pipeline. The workflows associated with this enterprise use case utilize a “Linked Image Archive” for long term storage of images and clinical data and a “Reference Data Set Manager” to allow creation and use of working sets of data used for specific purposes according to specified experimental runs or analyses. As such, the Reference Data Set is a selected subset of what is available in the Linked Data Archive (Fig. 8). BBMSC 8 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Figure 5: Use Case Model for Create and Manage Reference Data Set (architectural view) As in the case with the categories as a whole, individual workflows are generally understood as building on each other (Fig. 9). Figure 6: Workflows are presented to highlight how they build on each other. The following sections describe the workflows to establish and manage the Linked Data Archive and Reference Data Sets. 3.1.1. Define, Extend, and Disseminate Ontologies, Vocabularies, and Templates Means to structure and define lexicons, vocabularies, and interoperable architectures in support the creation and integration of Linked Data Archives, Reference Data Sets, Image Annotation Tool templates, etc. is needed to establish common grammars and semantics and provide for external interfaces. For example, users and clinicians need to record different image metadata (quantitative and semantic), necessitating a different “template” to be filled out while evaluating the image. Means for extending Annotation Tool templates so that new experiments can be supported without programming on the application (workstation) side is required. Once a template is created, any workstation will simply select it to enable Reference Data Set per the template. This will be crucial when people start a new cancer study. Need a platform-independent image metadata template editor to enable the community to define custom templates for research and clinical reporting. There will be ultimately be a tie in to eCRF. BBMSC 9 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 PRE-CONDITIONS Relationship to external knowledge sources (e.g., ontologies) is understood and links are made. Information model for our services has been established. FLOW OF EVENTS 1. Scientist identifies novel measurement, observation, processing step, or statistical analysis method. 2. Scientist defines new data structures. 3. Semantically defined interoperable data available on grid. 4. Grid users able to discover and integrate those data without coding. 5. New user can now open up same study in viewer and when they annotate, they could see new annotations and markups associated with the algorithm. POST-CONDITIONS Controlled vocabulary and semantic infrastructure is defined. Semantics are maintained. 3.1.2. Install and Configure Linked Data Archive Systems Imaging biomarker developers have a critical need to work with as large and diverse data as early as possible in the development cycle, and then subsequently throughout it. This spans a wide range of potentially useful imaging datasets including synthetic and real clinical scans of phantoms and clinical imaging datasets of patients with and without the disease/condition being measured. It is also important to have sufficient metadata (i.e. additional clinical information) to develop the algorithms and obtain early indications of full algorithm or algorithm component performance. All data and metadata should be stored in an electronic format easy to manipulate. The community will subdivide obtained data into a subset that is used for development (completely viewable) and another which is accessible through defined services (rather than completely viewable) used to assess algorithm performance. Identification of two subsets of data that are similar in characteristics allows the needs of the diverse stakeholders to be met. Further sub-division may be done by users (for example, within the development team, they may choose to create subdivisions of the available training data into “true” training vs. a sub-group which is set aside for use in internal testing). However, the community as a whole will sequester a portion of the available data as means for a publiclyrecognized testing capability. This latter use case is further developed as support for a publicly trusted honest broker to sequester test sets and apply them within a performance evaluation regime that may be used for a variety of purposes, as described later in the set of use cases. PRE-CONDITIONS Data exists from some sources that are representative of the clinical contexts for use and that represent to some level (even if not perfectly) the target patient population. Data is de-identified to remove personal health information (PHI). To the extent available, data is linked (or capable of being linked) across imaging and non-imaging data elements. Policy is set for what data goes into the development set vs. that which is sequestered. WORK FLOW 1. Create a network of federated databases. .1 BBMSC We need to collect as much information on the provenance (at least origin, ideally transformations at all stages of processing) of the data as possible. Users of the data should be able to put together an independent set of data with the same characteristics and end up with the same results when analyzed quantitatively. 10 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 2. Implement a secure protocol that seamlessly queries and retrieves interconnected instances. This “mesh” will work as a communication network in which each node acts as a messaging relay. .1 If a client requires a dataset not present on the local server, this one will automatically broadcast a query/retrieve message to the connected node which will either relay the message or send the actual dataset if it is stored there. 3. For the development set: .1 Provide access to public data with no security, i.e., with a guest login. .2 Provide a mechanism to set up purpose-built collections. .3 Provide a mechanism to create access lists for verifying whether a user should be allowed visibility to a collection. .4 Allow the user to add data to a data basket (see query types below in “Query and Retrieve Data from Linked Archive.” 4. Store sequestered data and make sure that only trusted entities can access the data stored in the database. .1 This sequestered data will be exclusively accessed by trusted sites that might be located remotely. In order to ensure that the site is trusted we will encrypt and sign the data using public-key cryptography. This mechanism is highly secure and ensures not only the integrity of the data transferred but also that only the intended recipient can open the data. Furthermore, the transfer channel between the two sites will be encrypted using securesocket layer (SSL) encryption for further security. .2 Provide a set of services (but not the direct data) to users. .3 Provide means and policy to refresh sequestered test sets. POST-CONDITIONS Appropriate access controls are in place and understood Formal data sharing agreements are not required or are in place for exploratory cross center analysis Data is in the archive and may be queried and read out. Informatics services are defined and available for users to access the data sets, directly or indirectly is given in the sequestering policy. 3.1.3. Create and Manage User Accounts, Roles, and Permissions Figure 7: Use Case Model for Create and Manage Reference Data Set (activity relationships) BBMSC 11 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 PRE-CONDITIONS Architectural elements are implemented and capable of being maintained. WORK FLOW The IT persons use the development and customization interface to prepare the tool support for the study according to the tool adaptation specifications. This includes, at a minimum, customizing the Image Annotation Tool representation of annotations customizing the de-identification tools customizing the annotation tool customizing the statistical analysis tool configuring the internal databases, and installing the tools and data at all the sites where the team members work. adapt clinical trial management system interface POST-CONDITIONS Operation may proceed. 3.1.4. Query and Retrieve Data from Linked Data Archive Once digital media are archived in the system they are made accessible to other research units and several methods of transfer are available: either via the internet or via file sharing. The stored digital content can be searched, visualized and downloaded directly from the system. PRE-CONDITIONS Analysis goal has been defined. Type of query and any necessary parameters are determined. FLOW OF EVENTS 1. If not already done, establish appropriate data elements and concepts to represent the query grammar by following workflow “Extend and Disseminate Ontologies, Templates, and Vocabularies.” 2. If not already done, establish a base of data from which query will be satisfied by following workflow “Install and Configure Linked Data Archive Systems.” 3. User is looking for examples of annotated images of a certain study. 4. User seeks annotated examples of radiology images for specific tissue or organ. 5. Rare event detection (e.g., rare lesions) – computer triages images and identifies images that require secondary review 6. User is looking for examples of annotated images of a certain phenotype. 7. Content based image retrieval (content-based image retrieval implies support by computer methods for complex queries performing potentially dynamic image classification according to image annotations.) POST-CONDITIONS Required data have been defined and collected. 3.2. Create and Manage Physical and Digital Reference Objects The first and critical building block in the successful implementation of quantitative imaging biomarkers is to establish the quality of the physical measurements involved in the process. The technical quality of imaging biomarkers is assessed with respect to the accuracy and precision of the related physical measurement(s). The next stage is to establish clinical utility (e.g., by sensitivity and specificity) in a BBMSC 12 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 defined clinical context of use. Consequently, NIST-traceable materials and objects are required to meet the measurement needs, guidelines and benchmarks. Appropriate reference objects (phantoms) for the technical proficiency studies with respect to accuracy and precision, and well-curated and characterized clinical Reference Data Sets with respect to sensitivity and specificity must be explicitly identified. Individual workflows are generally understood as building on each other (Fig. 11). Figure 8: Workflows are presented to highlight how they build on each other. The following sections describe the workflows to Create and Manage Physical and Digital Reference Objects. 3.2.1. Develop Physical and/or Digital Phantom(s) Imaging phantoms, or simply "phantoms", are specially designed objects that are scanned or imaged in the field of medical imaging to evaluate, analyze, and tune the performance of various imaging devices. These objects are more readily available and provide more consistent results than the use of a living subject or cadaver, and likewise avoid subjecting a living subject to direct risk. Phantoms were originally employed for use in 2D x-ray based imaging techniques such as radiography or fluoroscopy, though more recently phantoms with desired imaging characteristics have been developed for 3D techniques such as MRI, CT, Ultrasound, PET, and other imaging methods or modalities. A phantom used to evaluate an imaging device should respond in a similar manner to how human tissues and organs would act in that specific imaging modality. For instance, phantoms made for 2D radiography may hold various quantities of x-ray contrast agents with similar x-ray absorbing properties to normal tissue to tune the contrast of the imaging device or modulate the patients exposure to radiation. In such a case, the radiography phantom would not necessarily need to have similar textures and mechanical properties since these are not relevant in x-ray imaging modalities. However, in the case of ultrasonography, a phantom with similar rheological and ultrasound scattering properties to real tissue would be essential, but x-ray absorbing properties would not be needed.7 PRE-CONDITIONS Characteristics of the targeted biology are known. Interaction between biology and the intended imaging process are sufficiently understood to discern what is needed to mimic expected behavior. WORK FLOW 1. Acquire or develop reference object(s) and other support material(s) for controlled experimentation and ongoing quality control (QC). 2. Describe phantom and other controlled condition support material for “stand-alone” assessment and required initial and ongoing quality control specifics. 3. Address issues including manufacturing process, shelf life, physical standards traceability, and shipping. 4. Implement comprehensive QC program, including the process for initial acceptance of an imaging device as well as required ongoing QC procedures, data analysis, and reporting requirements. POST-CONDITIONS BBMSC 13 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Phantoms exist and are capable of being deployed at centers and/or shipped as part of proficiency testing protocols. 3.2.2. Import Data from Experimental Cohort to form Reference Data Set This Use Case includes the activities necessary for defining and assembling analysis datasets that draw from the totality of data stored in the Linked Data Archive. Typically, the datasets are generated for exploratory queries used in the presentation of comparative data. The Reference Data Set Manager integrates with grid computing environments to facilitate its integration into the different research groups through a universal interface to several distributed computing environments. It is understood that this interface is provided through the open-source tool aimed for batch processing called the Batch Analysis Service. In this workflow, we utilize Reference Data Set Manager, a web-based digital repository tuned for medical and scientific datasets that provides a flexible data management facility, a search engine, and an online image viewer. The Reference Data Set Manager allows users to store, manage and share scientific datasets, from a web browser or through a generic programming interface, thereby facilitating the dissemination of valuable imaging datasets to research collaborators. By way of example, a core set of data needed to develop a cancer therapy assessment algorithm is: 1. A set of images with known ground truth (e.g. anthropomorphic phantom). .1 This dataset would ideally consist of real or simulated scans of a Reference Data Set of physical objects with known volumetric characteristics. .2 A range of object sizes, densities, shapes, and lung attachment scenarios should be represented. .3 A range of image acquisition characteristics should be obtained including variation in slice thickness, tube current, and reconstruction kernel. .4 Metadata must contain the location and volumetric characteristics of all objects and any additional information on their surrounding or adjacent environment. 2. A set of clinical images where outcome has been determined. .1 This dataset would ideally consist of longitudinal scans of a large and diverse Reference Data Set of patients using many different image acquisition devices and image acquisition parameters. .2 The location and volumetric assessment of all lesions within each longitudinal acquisition must be established by an independent method, such as the assessment of multiple expert readers. This should include the localization and volumetric estimation of new lesions. .3 Metadata should at a minimum contain the location and independent volumetric assessment of all lesions, including the location of new lesions. Additional information on the variance of the independent volumetric assessment should also be available. .4 Additional metadata, such as the clinical characteristics of the patients (e.g. age, gender), classification of cancer (e.g. small cell) and lesion types (e.g. solid, non-solid), lesion attachment scenarios (e.g. lung pleura, major vessels), and cancer therapy approach, magnitude and duration, would also be useful to algorithm developers as they determine the strengths and weaknesses of different algorithmic methods. PRE-CONDITIONS Experimental question is defined. Determine from what linked data archive the Reference Data Set will be assembled. WORK FLOW The activity consists of sub-activities, with data flows between them. BBMSC 14 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 1. Determine inclusion / exclusion criteria that determine whether a particular patient, series, or tumor will be included in the study: General criteria may include: Case closed? N usable follow-up series? Patient outcome known? summary assessment of progression available? Image criteria may include: Acquisition parameters, Image quality assessment. Tumor criteria may include: type of tumor, grade of tumor, genomic information of the tumor, proteomic information of the tumor. Demographic information may include: Age – at the time of image acquisition, Sex. Clinical criteria might include: Excision findings? Survival, Pathology criteria, Histology criteria, Past Radiology criteria, CEA criteria (a blood biomarker), other blood biomarkers, other biomarker criteria, cancer-specific staging, trial-specific assessment of change, Stage of Cancer, Tumor-Node-Metastases, Pain, Difficulty breathing, Peak flow meter readings, Quality of life measurements. Only a subset of these will be used as inclusion criteria on any specific study. Some depend on what kind of malignancy is being studied. Note that the clinical criteria might be used for selecting patients and for documenting patient population characteristics even if the study does not correlate image measurements with other clinical indicators. 2. Apply pre-screening criteria to prepare a list of candidate patients for inclusion in the study, possibly using workflow “Query and Retrieve Data from Linked Data Archive.” 3. Select cases for the Reference Data Set according to the inclusion criteria. 4. Once the selection of the input datasets and parameters is performed for each stage of the pipeline, validate the workflow to make sure it satisfies the global requirements and store it for archiving and further editing. 5. Avoid downloading large Reference Data Set of datasets anytime an experiment is run. The first time the experiment is run, query Image Archive for the selected datasets and will cache them locally for further retrieval. 6. If a dataset is not stored locally (for example, if the cache is full) automatically fetch the data and transfer it to the client. 7. Store the metadata information associated with a given dataset as well as a full data provenance record. 8. Retrieve clinical data from CTMS. .1 Review clinical data for quality and criteria. 9. Audit the selection process to assure the quality of the selections and, for example, to check for bias in the selections. Approaches to bias reduction: BBMSC .1 Define the hypotheses that we would seek to explore using the dataset and thereafter to design the experiments that would allow these explorations. .2 Establish rules for the fraction of cases that are submitted, e.g., the first 70% enrolled in chronological order .3 Utilize a masked index of all the subjects in all arms (active intervention arm(s) and control arm) of multiple trials from multiple companies that meet the selection criteria (imaging modality, anatomic area of concern, tumor type, etc.) could be assembled. Using a random number scheme a predetermined number of subjects from the entire donated set of subjects could be selected (perhaps with mechanisms to ensure that no single trial or vendor 15 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 dominated the Reference Data Set). The data would be de-identified as to corporate source, trial, and of course subject ID. .4 Since the universe from which the subjects would be chosen is complete (hopefully including both successful and unsuccessful trials) and the selection is randomized, we should have eliminated many of the potential sources of bias. Since the data would be pooled and deidentified but not attributable, the anxiety related to releasing proprietary information should be lessened too. 10. Capture user comments needed to explain any non-obvious decisions or actions taken, such as when it is unclear whether a case meets the selection criteria. POST-CONDITIONS One or more Reference Data Sets is assembled to comprise a Reference Data Set with a defined purpose and workflow. The Reference Data Set(s) include all linked data necessary for the objectives (e.g., non-imaging data also). 3.2.3. Create Ground Truth Annotations and/or Manual Seed Points in Reference Data Set There are essentially two types of experimental studies supported. One is a correlative analysis where an imaging measure is taken and a coefficient of correlation is computed with another parameter, such as clinical outcome. A second type is where a measure of accuracy is computed, which necessitates a representation of the “ground truth,” whether this be defined in a manner traceable to physical standards or alternatively by a consensus of “experts” where physical traceability is either not possible or feasible. This workflow is utilized in the latter case. PRE-CONDITIONS An Reference Data Set has been assembled as one or more Reference Data Sets. Definition of what constitutes “ground truth” for the data set is established and has been checked as to its suitability for the experimental objective it will support. WORK FLOW 1. If not already done, establish appropriate data elements and concepts to represent the annotations by following workflow “Extend and Disseminate Ontologies, Templates, and Vocabularies.” 2. If not already done, create a Reference Data Set by following workflow “Import Data to Reference Data Set to Form Reference Data Set.” 3. The investigators define annotation instructions that specify in detail what the radiologist/reader should do with each type of reading task. 4. Trans-code imported data into Image Annotation Tool, manually cleaning up any data that cannot easily be trans-coded automatically. 5. Create nominal ground truth annotations. (This differs from ordinary reading tasks by removing any tool restrictions and by allowing the reader a lot more time to do the job right. It may entail presenting several expert markups for comparison, selection, or averaging.) BBMSC .1 The investigators assign reading tasks to radiologist/readers, placing a seed annotation in each task, producing worklists. .2 The radiologist/readers prepare seed annotations for each of the qualifying biological features (e.g., tumors) in each of the cases, attaching the instructions to each seed annotation and assuring that the seed annotations are consistent with the instructions. .3 The radiologist/readers annotates according to reference method (e.g., RECIST) (to allow comparative studies should that be within the objectives of the experimental on this Reference Data Set). 16 of 40 QI-Bench: Informatics Services for Quantitative Imaging .4 SUC Rev 1.0 Inspect and edit annotations, typically as XML, to associate them with other study data. 4. Record audit trail information needed to assure the validity of the study. POST-CONDITIONS The Reference Data Set has been annotated with properly defined and implemented “ground truth” and/or manual “seed points” as defined for the experimental purpose of the set. 3.3. Core Activities for Biomarker Development In general, biomarker development is the activity to find and utilize signatures for clinically relevant hallmarks with known/attractive bias and variance. E.g., signatures indicating apoptosis, reduction, metabolism, proliferation, angiogenesis or other processes evident in ex-vivo tissue imaging that may cascade to the point where they affect organ function and structure. Validate phenotypes that may be measured with known/attractive confidence interval. Such image-derived metrics may involve the extraction of lesions from normal anatomical background and the subsequent analysis of this extracted region over time, in order to yield a quantitative measure of some anatomic, physiologic or pharmacokinetic characteristic. Computational methods that inform these analyses are being developed by users in the field of quantitative imaging, computer-aided detection (CADe) and computer-aided diagnosis (CADx).8,9 They may also be obtained using quantitative outputs, such as those derived from molecular imaging. Individual workflows are generally understood as building on each other (Fig. 12). Figure 9: Workflows are presented to highlight how they build on each other. The following sections describe workflows for Core Activities for Biomarker Development. 3.3.1. Set up an Experimental Run It is perhaps helpful to describe two fundamental types of experiments: those that call for new acquisitions (and post-processing), vs. those which utilize already acquired data (only require the post-processing). However, in the most flexible generality, these are both understood simply as how one defines the starting point and which steps are included in the run. As such, they may both be supported using the same toolset. In the former case, there is need to support means by which the physical phantom or other imaging object is present local to a device, where the tool provides “directions” for how and when the scans are done with capture of the results; in the latter case, the tool must interface with the candidate implementation and may run local to the user or remote from them. In either case, the fundamental abstraction of an imaging pipeline is useful, where the notion is that any given experiment describes the full pipeline but focuses on granularity at the stage of the pipeline of interest (Fig. 13). BBMSC 17 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Assess change in tumor burden ... Assess change per target lesion Obtain images per timepoint (2) Patient Prep Acquire Imaging Agent (if any) Reconstru ct and Postprocess Calculate Calculate volume volume volumes images -ORDirectly process images to analyze change Lesion volume at time point (vt) Volume change per target lesion (Δvt) Subtract volumes volume changes Interpret Change in tumor burden by volume (ΔT.B.) Figure 10: Example technical description of biomarker, elaborating the pipeline. Knowledge of the read-outs subject to statistical analysis as well as the pipeline steps are given. The Batch Analysis Service is an open-source, cross platform tool for batch processing large amounts of data. The Batch Analysis Service can process datasets either locally or on distributed systems. The Batch Analysis Service uses a scripting language with a specific semantic to define loops and conditions. The Batch Analysis Service provides a scripting interface to command line applications. The arguments of the command line executable can be wrapped automatically if the executable is able to produce a specific XML description of its command line arguments. Manual wrapping can also be performed using the “Application Harness” interface provided by The Batch Analysis Service. The Batch Analysis Service allows users to upload scripts with the associated parameters description directly into the Reference Data Set Manager. Thus, when a user decides to process a Reference Data Set using a given Batch Analysis Service script, the Reference Data Set Manager automatically parse the parameter description and generate an HTML page. The HTML page provides an easy way to enter tuning parameters for the given algorithm and once the user has specified the parameters, the Batch Analysis Service configuration file is generated and the script is run on the selected Reference Data Set(s). This flexibility allows sharing Processing Pipelines easily between organizations, since the Batch Analysis Service script is describing the pipeline. PRE-CONDITIONS In cases where the experimental objective requires a physical or digital phantom, such is available. WORK FLOW 1. Define the biomarker in terms of pipeline steps as well as read-outs that will be subject of analysis. 2. If not already done, establish appropriate data elements and concepts to represent the pipeline and the annotations on which it will operate by following workflow “Extend and Disseminate Ontologies, Templates, and Vocabularies.” 3. Depending on the experimental objective, two basic possibilities exist for creating the image markups: .1 The user will implement the Application Harness API to interface a candidate biomarker implementation to the Batch Analysis Service. .2 Alternatively, the reference implementation for the given biomarker will be utilized. 4. Depending on the experimental objective, means for ensuring availability of physical and/or digital BBMSC 18 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 phantoms are undertaken. 5. Based on the pipeline definition, describe the experimental run to include, for example, set up for image readers so as to guide the user through the task, and save the results. .1 It could, for example, create an observation for each tumor, each in a different color. Or, one tumor per session. Depends on experiment design. This activity also tailors the order in which the steps of the reading task may be performed. .2 Design an experiment-specific tumor identification scheme and install it in the tumor identification function in the task form preparation tool. .3 Define the types of observations, characteristics, and modifiers to be collected in each reading session and program them into the constrained-vocabulary menus. (May be only done in ground truth annotations in early phases.) .4 Determine whether it will be totally algorithmic or will have manual steps. There are a number of reasons for manual steps, including for example if an acquisition is needed, whether the reader chooses his/her own seed stroke, whether manual improvements are permitted after the algorithm has made the attempt, or combinations of these. Allow for manual, semi-automatic, and automatic annotation. .5 Specify the sequence of steps associated with acquisitions, for example of a phantom shipped to the user, of new image data that will be added to the Reference Data Set(s). .6 Specify the sequence of steps that the reader is expected to perform on each type of reading task for data in the Reference Data Set(s). 6. Customize the analysis tool. .1 Analyze data with known statistical properties to determine whether custom statistical tools are producing valid results. .2 Customize built-in statistical methods .3 Add the measures, summary statistics, outlier analyses, plot types, and other statistical methods needed for the specific study design. .4 Load statistical methods into analysis tool .5 Configure presentation of longitudinal data .6 Customize outlier analysis .7 Configure the report generator to tailor the formats of exported data views. The report generator exports data views to files according to a run-time-configured list of the data views that should be included in the report. 7. Install databases and tools as needed at each site, loading the databases with initial data. This includes installing image databases at all actor sites, installing clinical report databases at all clinical sites, and installing annotation databases at Reader, Statistician, and PI sites. 8. Represent the Processing Pipeline so that it may be easily shared between organizations. POST-CONDITIONS One or more Processing Pipeline scripts are defined and may be executed on one or more Reference Data Sets. 3.3.2. Execute an Experimental Run Once the Processing Pipeline script is written, the Batch Analysis Service can run the script locally or across distributed processing jobs expressed as a directed-acyclic graph (DAG). When distributing jobs, the Batch Analysis Service first creates the appropriate DAG based on the current input script. By default, the Batch Analysis Service performs a loop unrolling and considers each scope within the loop as a different job which may ultimately be distributed on different nodes. This allows distributing independent BBMSC 19 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 jobs automatically (as long as each iteration is considered independent). The Batch Analysis Service also provides a way for the user to specify if a loop should be executed sequentially instead of in parallel. Before launching the script (or jobs) on the grid, the user can visualize the DAG to make sure it is correct. Then, when running the job, the user can monitor the process of each job distributed with the Batch Analysis Service. The Biomarker Evaluation GUI provides online monitoring of the grid processing in real time. Results of the processing at each node of the grid are instantly transferred back to the Biomarker Evaluation GUI and can be used to generate dashboard for batch processing. It permits to quickly check the validity of the processing by comparing the results with known baselines. Each row of the dashboard corresponds to a particular processing stage and is reported in red color if the result does not meet the validation criterion. PRE-CONDITIONS A Processing Pipeline script is defined and/or conceived that may be executed on one or more Reference Data Sets. One or more Reference Data Set(s) have been assembled and/or conceived for batch analysis. In cases where the experimental objective requires “ground truth”, “gold standard”, and/or “manual seed points”, then such annotations have been created through workflow “Create Ground Truth Annotations and/or Manual Seed Points in Reference Data Set”. WORK FLOW 1. If not already done, set up a processing pipeline script by following workflow “Set up an Experimental Run.” 2. If not already done, create an Reference Data Set by following workflow “Import Data to Reference Data Set to Form Reference Data Set” (optionally through workflow “Create Ground Truth Annotations and/or Manual Seed Points in Reference Data Set”). 3. Create electronic task forms for each of the manual tasks (e.g., acquisition or reading) to be analyzed in the study. 4. Fully-automated methods can just run at native compute speed. For semi-automated methods: .1 Assign each task form to one or more human participants, organizing the assignments into worklists that could specify, for example, when each task is performed and how many tasks are performed in one session. 5. Support two levels of Application Harness: .1 Full patient work-up: candidate algorithm or method encompasses not only individual tumor mark-up but also works across all tumors – just call the algorithm with the whole patient data .2 Individual tumor calculations: candidate algorithm expects to be invoked on an individual tumor basis – requires implementation of a reference implementation for navigating across measurable tumors and call the algorithm on each one 6. Upload Processing Pipelines and run them on selected Reference Data Sets. BBMSC .1 Translate to a grid job description and sent to the distributed computing environment along with the input datasets. .2 Integrate input data and parameters, processing tools and validation results. .3 For semi-automated methods which require a “user in-the-loop”, the manual process will be performed beforehand and manual parameters will be passed to the system via XML files (e.g., seed points) without any disruption of the batch processing workflow. .4 Automatically collect the parameters, data and results during each stage of the Processing Pipeline and stores the results in the database for further analysis. .5 Develop an interface to the statistical packages and the evaluation system. When generating a distributed computing jobs list, ensure that the post-processing information (values and 20 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 datasets) are collected and a post-processing package is created and sent via REST protocol. .6 Record all the input parameters, machine specification, input datasets and final results and stores them in database. At the end of the experiment, be able to access the processed data and visualize the results via web interface. .7 Integrate with grid computing technology to enable efficient distributed processing. .8 Process datasets either locally or on distributed systems. .9 Use a scripting language with a specific semantic to define loops and conditions. .10 Provide a scripting interface to any command line applications. .11 Perform a loop unrolling and consider each scope within the loop as a different job, which will be ultimately distributed on different nodes. .12 From a single script, translate a complete workflow to a set of job requirements for different grid engines. .13 Run command line executables associated with a description of the expected command line parameters. This pre-processing step is completely transparent to the user. 7. Provide online monitoring of the grid processing in real time. 8. Generate result dashboards. A dashboard allows one to quickly validate a processing task by comparing the results with known baselines. A dashboard is a table showing the results of an experiment. 9. Record audit trail information needed to assure the validity of the study. POST-CONDITIONS Results are available for analysis. 3.3.3. Analyze an Experimental Run Support hypothesis testing of the form given in decision trees and claims for context of use PRE-CONDITIONS Results are available for analysis. WORK FLOW 1. If not already done, create analyzable results by following workflow “Execute an Experimental Run.” 2. The following comparisons of markups can be made: .1 Analyze statistical variability .2 Measurements of agreement .3 User-defined calculations .4 Correlate tumor change with clinical indicators .5 Calculate regression analysis .6 Calculate factor analysis .7 Calculate ANOVA .8 Calculate outliers 3. Estimate the confidence intervals on tumor measurements due to the selected independent variables, as measured by validated volume difference measures. BBMSC 21 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 4. Review statistics results to uncover promising hypotheses about the data. Typical methods include: Box plot, Histogram, Multi-vari chart, Run chart, Pareto chart, Scatter plot, Stem-and-leaf plot, Odds ratio, Chi-square, Median polish, or Venn diagrams. 5. Provide review capability to user for all calculated items and plots 6. Drill-down interesting cases, e.g., outlying sub-distributions, compare readers on hard tumors, etc. POST-CONDITIONS Analyses are performed and reports are generated. 3.4. Collaborative Activities to Standardize and/or Optimize the Biomarker The first and critical building block in the successful implementation of quantitative imaging biomarkers is to establish the quality of the physical measurements involved in the process. The technical quality of imaging biomarkers is assessed with respect to the accuracy and reproducibility of the related physical measurement(s). Consequently, a well thought-out testing protocol must be developed so that, when carefully executed, it can ensure that the technical quality of the physical measurements involved in deriving the candidate biomarker is adequate. The overarching goal is to develop a generalizable approach for technical proficiency testing which can be adapted to meet the specific needs for a diverse range of imaging biomarkers (e.g., anatomic, functional, as well as combinations). Guidelines of “good practice” to address the following issues are needed: (i) composition of the development and test data sets, (ii) data sampling schemes, (iii) final evaluation metrics such as accuracy as well as ROC and FROC metrics for algorithms that extend to detection and localization. With development/testing protocols in place, the user would be able to report the estimated accuracy and reproducibility of their algorithms on phantom data by specifying the protocol they have used. Furthermore, they would be able to demonstrate which algorithmic implementations produce the most robust and unbiased results (i.e., less dependent on the development/testing protocol). The framework we propose must be receptive to future modifications by adding new development/testing protocols based on up-to-date discoveries. Inter-reader variation indicates difference in training and/or proficiency of readers. Intra-reader differences indicate differences from difficulty of cases. To show the clinical performance of an imaging test, the sponsor generally needs to provide performance data on a properly-sized validated set that represents a true patient population on which the test will be used. For most novel devices or imaging agents, this is the pivotal clinical study that will establish whether performance is adequate. In this section, we describe workflows that start with developed biomarker and seek to refine it by organized group activities of various kinds. These activities are facilitated by deployment of the Biomarker Evaluation Framework within and across centers as a means of supporting the interaction between investigators and to support a disciplined process of accumulating a body of evidence that will ultimately be capable of being used for regulatory filings. By way of example, a typical scenario to demonstrate how the Reference Data Set Manager involves three investigators working together on to refine a biomarker and tests to measure it: Alice who is responsible for acquiring images for a clinical study. Martin, who is managing an image processing laboratory responsible for analyzing the images acquired by Alice, and Steve, a statistician located at a different institution. First, Alice receives volumetric images from her clinical collaborators; she logs into the Reference Data Set Manager and creates the proper Reference Data Sets of datasets. She uses the web interface to upload the datasets into the system. The metadata are automatically extracted from the datasets (DICOM or other well known scientific file formats). She then adds more information about each dataset, such as demographic and clinical information, and changes the Reference Data Set’s policies to make it available to Martin. Martin is instantly notified that new datasets are available in the system and are ready to be processed. Martin logs in and starts visualizing the datasets online. He visualizes the dataset as slices and also uses more complex rendering technique to assess the quality of the acquisition. As he browses each dataset, Martin selects a subset of datasets of interest and put them in the electronic cart. At the end of the session, he downloads the datasets in his cart in bulk and gives them to his software engineers to train the different algorithms. As soon as the algorithms are validated on the training datasets, Martin uploads the algorithms, selects the remaining testing datasets and applies the BBMSC 22 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Processing Pipeline to the full Reference Data Set using the Batch Analysis Service. The pipeline is automatically distributed to all the available machines in the laboratory, decreasing the computation time by several orders of magnitude. The datasets and reports generated by the Processing Pipeline are automatically uploaded back into the system. During this time, Martin can monitor the overall progress of the processing via his web browser. When the processing is done, Martin gives access to Steve in order to validate the results statistically. Even located around the world, Steve can access and visualize the results, make comments and upload his statistical analysis in the system. Individual workflows are generally understood as building on each other (Fig. 14). Figure 11: Workflows are presented to highlight how they build on each other. The following sections describe workflows for Collaborative Activities to Standardize and/or Optimize the Biomarker. 3.4.1. Validate Biomarker in Single Center or Otherwise Limited Conditions Clinicians, educators, behavioral scientists, health scientists, and many other consumers of tests and measurements typically demand that any test have demonstrated validity and reliability. In casual scientific conversations in imaging contexts, the words reliability and validity are often used to describe a variety of properties (and sometimes the same one). The metrology view of proof of performance dictates that a measurement result is complete only when it includes a quantitative statement of its uncertainty.10,11 Generating this statement typically involves the identification and quantification of many sources of uncertainty, including those due to reproducibility and repeatability (which themselves may be due to multiple sources). Measures of uncertainty are required to assess whether a result is adequate for its intended purpose and how it compares with alternative methods. A high level of uncertainty can limit utilization, as uncertainty reduces statistical power, especially in the multi-center trials needed for major studies. Uncertainty compromises longitudinal measurements, especially when patients move between centers or when scanner changes occur. The following figures illustrate what occurs at this stage of a biomarker’s development, as it is represented by one or more candidate test implementations. Figure 15 illustrates one approach for the evaluation of a set of segmentation/classification algorithms, in this case for processing slide images for ex-vivo pathology but in a validation methodology equally applicable to in-vivo radiology.12 BBMSC 23 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Figure 12: Algorithm validation comprises the cross-comparison of the proposed method with other methods and the assessment of its performance. Figure 16 provides a use case diagram view of the steps being undertaken. Figure 13: Use Case Model to Validate Biomarker under Limited Conditions The user is working, possibly at single center or possibly with a consortium. S/he wants to test out a new image processing algorithm on a set of radiology studies. He has tested out the algorithm on one set of images at his local institute and is confident his code can be run remotely in a batch processing mode. One of the consortium collaborators has a set of radiology images (studies) that can be used to test the algorithm. The results of the algorithm include image masks (Markup) and quantitative results (Annotation) that can be shared and viewed by everyone. Note that this is an aggregate workflow that specializes workflows elaborated earlier in this document. BBMSC 24 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 PRE-CONDITIONS At least some development of the biomarker as measured by a candidate implementation has been pursued, whether through workflows such as described in “Biomarker Development” or otherwise. FLOW OF EVENTS 1. If not already done, follow workflow “Import Data to Reference Data Set to Form Reference Data Set” 2. As needed, follow workflow “Create Ground Truth Annotations and/or Manual Seed Points in Reference Data Set” 3. As needed, follow workflow “Develop Physical and/or Digital Phantom(s)” 4. If not already done, follow workflow “Set up an Experimental Run.” .1 User puts his code and description of files (URL of Image Archive, Filenames or StudyIDs, etc) into the DataModel defined by [Image Analysis/Biomarker Evaluation Framework] Image Annotation Tool] and the algorithm code is saved in the Algorithm Code Storage. The name of the algorithm is saved in the associated data model – dynamic extension applied. 5. Follow workflow “Execute an Experimental Run.” .1 User then sends this information out via Grid Services and the Image Archive and associated CPU processors start to run the batch processing. [Image Analysis/Biomarker Evaluation Framework, Image Archive Data Service, Image Information Data Service]. .2 As the algorithm is processed on each of the images, the results are saved in new Annotation and markup that will be associated with that image. .3 After processing is complete, user receives status that the remote job is completed and available for viewing. .4 User has the ability to see the results. Alternatively, they can develop methods to extract out all annotation results through the Biomarker Evaluation Framework without having to go through the viewer. 6. Follow workflow “Analyze an Experimental Run.” .1 Qualitative pairwise algorithm comparison. .2 The goal is to provide a validation assessment of technical and clinical performance via highthroughput computing tasks. 7. Iterate for confidence and refinement. POST-CONDITIONS Quantitative analysis algorithm is ready for further evaluation within a consortium or by a sponsor as part of one or more regulatory pathways. 3.4.2. Team Optimizes Biomarker Using One or More Tests When one or more imaging tests reach the point where their technical performance has been demonstrated in well-controlled settings, whether by an individual sponsor or by a collaboration, organized activities undertaken by teams are pursued to optimize the biomarker and characterize/improve the class of tests available to measure it. These activities generally include: Implementation and refinement of protocols that include recommended operating points for acquisition, analysis, interpretation, and QC, according to the documented intended use, Development and merging of training and test datasets from various sources to establish or augment linked data archive(s) Assessment and characterization of variability, minimum detectable change, and other aspects of BBMSC 25 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 performance in the intended environment including subject variability associated with the physiological and pathophysiological processes present in the target population – that is, moving beyond the more highly controlled conditions on which the biomarker and its tests may have been initially discovered and developed. Prospective trials and or retrospective analyses are stored as Reference Data Sets based on the intended-use claims, with care to exclude biases. The studies are undertaken in part to provide data to support proposed cut-points (i.e., decision thresholds), if imaging results are not reported as a continuous variable, and performance characteristics (including sensitivity, specificity and accuracy) are reported to complete this step. PRE-CONDITIONS The claim intended use of the imaging test must be clearly stated before initiating technical validation studies so that appropriate data are generated to support that use. Summary of Good Manufacturing Practices (GMP) issues as they related to producing the imaging test. Devices and software used to perform the imaging test must meet quality system requirements. 13 FLOW OF EVENTS Probably a useful way of thinking of this is that it specializes / extends workflow “Validate Biomarker in Single Center or Otherwise Limited Conditions”: 1. Write a Profile, in part through following workflow “Extend and Disseminate Ontologies, Templates, and Vocabularies”: .1 Describe the patient population characteristics, which determine how general the study is, and therefore the importance of the result. The study design should include how to capture the data to back up claims about the patient population characteristics. .2 High-level descriptions of the processing hardware and software, highlighting potential weaknesses, should be provided. .3 Detailed descriptions of the procedures to be used for image acquisition, analysis and interpretation of the quantitative imaging biomarker as a clinical metric should be included. .4 Procedures to be used when results are not interpretable or are discrepant from other known test results must be described; this is especially important for imaging tests used for eligibility or assignment to treatment arms. .5 Develop a standardized schema for validation and compliance reports. Ensure consensus on the XML schema is attained with all QIBA participants. 2. As needed, follow workflow “Develop Physical and/or Digital Phantom(s)” 3. Use the Profile to automatically follow the workflow “Import Data to Reference Data Set to Form Reference Data Set” using the Profile for automatic data selection. .1 Automatically select the optimal datasets based on its associated metadata. If only a limited number of datasets is identified, the user will be asked to provide more datasets. Similarly, if too many datasets are returned, the user will be asked to choose the specific datasets to be included in the experiment. 4. As needed, follow workflow “Create Ground Truth Annotations and/or Manual Seed Points in Reference Data Set” to annotate the Reference Data Set(s). 5. Use the Profile to automatically follow the workflow “Set up and Experimental Run” to describe the Processing Pipeline for a set of validation script(s). 6. If not already done, wrap candidate implementation(s) to be compatible with Application Harness. 7. Follow workflow “Execute an Experimental Run.” 8. Follow workflow “Analyze an Experimental Run” according to the prescriptions for validation and BBMSC 26 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 compliance as outlined in the Profile. 9. Iterate for confidence and refinement. 10. After an experiment runs on selected datasets, results should be compared to the expected target values defined in the “Qualification Data” file. POST-CONDITIONS One or more validated imaging tests measuring a biomarker with known performance characteristics regarding accuracy and reproducibility within a specified context for use. 3.4.3. Support “Open Science” Publication model The user shares and disseminates the artifacts, results, processed data, and, in some cases, the primary data. For hypothesis testing scientific articles or reports, the user documents the methods, materials, and results. The user explains if and how the hypothesis is rejected or accepted in light of the experiments conducted (the "new findings") and in the context of other similar reported findings. For discovery science scientific articles or reports, the user documents the methods and materials used in the experiments as well as, the results and summarizes the characteristics and intended use of the data or derived materials created (the "new findings"). In either case, the final documentation should be sufficient for another person similarly skilled in the field to replicate the experiment and new findings. Preliminary results may be released prior to formal presentation of results as, generally, a peer-reviewed journal article. Forums for sharing results may include staff seminars and seminars or posters at a scientific conference in addition to published journals. Primary or processed data may be submitted to accessible databases. The “open science” initiative allows scientists to expand upon traditional research results by providing software for interactively viewing underlying source data. The interactive Image Viewer system enhances the standard scientific publishing by adding interactive visualization. Using interactive Image Viewer, authors have the ability to create 3-dimensional visualization of their datasets, add 3-dimensional annotations and measurements and make the datasets available to reviewers and readers. The system is composed of two main components: the archiving system and the visualization software. A customized version of Reference Data Set Manager provides the data storage, delivers low-resolution datasets for pre-visualization, and in the background serves the full-resolution dataset. Reference Data Set Manager must support MeSH, the U.S. National Library of Medicine (NLM) controlled vocabulary used for indexing scientific publications. The second component, the interactive Image Viewer visualization software, interacts directly with the Reference Data Set Manager in order to retrieve stored datasets. Readers of an interactive Image Viewer-enabled manuscript can automatically launch the interactive Image Viewer software by clicking on a web link directly in the PDF. Within ten seconds, a low-resolution dataset is loaded from Reference Data Set Manager and can be interactively manipulated in 3D via the interactive Image Viewer software. PRE-CONDITIONS Primary and/or processed data sets, new equipment or software, and /or other artifacts generated. Data have been interpreted. Hypothesis accepted or rejected (applicable to hypothesis testing). Experimental materials and resultant materials available. FLOW OF EVENTS 1. If not already done, create reportable results by following workflow “Analyze an Experimental Run.” 2. Summarize the results. 3. Describe and/or interpret the results in the broader scientific context of the research ("narrate the findings"), using tables and figures as required to illustrate key points. 4. Communicate the validation, evaluation, and interpretation in any of several ways, including, but not limited to: .1 BBMSC Publish a manuscript 27 of 40 QI-Bench: Informatics Services for Quantitative Imaging .2 Present at a scientific meeting .3 Submit raw or processed data to appropriate resources/organizations .4 Establish mechanisms/s to share new equipment or software SUC Rev 1.0 5. Specific examples: .1 Make image analysis algorithms available as publically accessible caGrid analytical services. .2 Upload raw image and other data sets into a repository where on-line journals may access them by readers having a user interface that can run reference algorithms and/or could give access to data for local consideration. .3 Perform the workflow in a highly consistent manner so that results can be reproduced, standardized and eventually utilize common analysis tools. POST-CONDITIONS Publication of scientific articles or abstracts. Work has been shared through at least one means. Data and/or artifacts have been made available. New equipment or software is disseminated. 3.5. Consortium Establishes Clinical Utility / Efficacy of Putative Biomarker Biomarkers are useful only when accompanied by objective evidence regarding the biomarkers’ relationships to health status. Imaging biomarkers are usually used in concert with other types of biomarkers and with clinical endpoints (such as patient reported outcomes (PRO) or survival). Imaging and other biomarkers are often essential to the qualification of each other. The following figure expands on Figure 17 and specializes workflow “Team Optimizes Biomarker Using One or More Tests” as previously elaborated to build statistical power regarding the clinical utility and/or efficacy of a biomarker. Figure 14: Use Case Model to Establish Clinical Utility / Efficacy of a Putative Biomarker BBMSC 28 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Individual workflows are generally understood as building on each other (Fig. 18). Figure 15: Workflows are presented to highlight how they build on each other. The following workflows are elaborated. 3.5.1. Measure Correlation of Imaging Biomarkers with Clinical Endpoints The class of tests serves as a basis for defining the measurement technology for a biomarker which may then be assessed as to its clinical utility. This assessment may be done in the context of an effort to qualify the biomarker for use in regulatory decision making in clinical trials, or it may be a comparable activity associated with individual patient management without explicitly following a qualification pathway. In either case, the hallmark of this step is the assessment of clinical utility on the basis of at least some capability to measure it. Biomarker reproducibility in the clinical context is assessed using scans from patients that were imaged with the particular modality repeatedly and over an appropriately short period of time, without intervening therapy. The statistical approaches include standard analyses using intraclass correlation and BlandAltman plots for the assessment of agreement between measurements.14,15 However, more detailed determinations are also of interest for individual biomarkers. For example, it may be useful to determine the magnitude of observed change in a biomarker that would support a conclusion of change in the true measurement for an individual patient. It may also be of interest to determine if two modalities measuring the same quantity can be used interchangeably. 16 The diagnostic accuracy of biomarkers (that is, the accuracy in detecting and characterizing the disease) is assessed using methods suitable to the nature of the detection task, such as ROC, FROC, and LROC. In settings where the truth can be effectively considered as binary and the task is one of detection without reference to localization, the broad array of ROC methods will be appropriate.17,18 Since the majority of VIA imaging biomarkers in the volumetric analysis area produce measurements on a continuous scale, methods for estimating and comparing ROC curves from continuous data are needed. In settings where a binary truth is still possible but localization is important, methods from free-response ROC analysis are appropriate.19,20,21 PRE-CONDITIONS A mechanistic understanding or “rationale” of the role of the feature(s) assessed by the imaging test in healthy and disease states is documented. A statement of value to stakeholders (patients, manufacturers, biopharma, etc.), expressed in the context of the alternatives is understood (e.g., with explicit reference to methods that are presently used in lieu of the proposed biomarker). Consensus exists on whether the test is quantitative, semi-quantitative, or qualitative (descriptive); what platform will be used; what is to be measured; controls; scoring procedures, including the values that will be used (e.g., pos vs. neg; 1+, 2+ 3+); interpretation; etc. FLOW OF EVENTS 1. If not already done, follow workflow “Install and Configure Linked Data Archive Systems.” 2. If not already done, follow workflow “Import Data to Reference Data Set to Form Reference Data Set” for each of potentially several Reference Data Sets as needed to build statistical power. 3. If not already done, follow workflow “Team Optimizes Biomarker Using One or More Tests” to perform clinical performance groundwork to characterize sensitivity and specificity for readers using BBMSC 29 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 the imaging test when interpreted as a biomarker under specified conditions. 4. Conduct pilot study(ies) of the analysis to establish capability of the class of tests that represent the biomarker using the training set (e.g., following Sargent et al., as set utility determinations restricted to demonstrating that in single studies the endpoint captures much of the treatment benefit at the individual patient level.) 22 .1 Demonstrating a high correlation at the patient level between the early endpoint and the ultimate clinical endpoint within a trial, randomized or not, is not sufficient to validate an endpoint. Such a correlation may be a result of prognostic factors that influence both endpoints, rather than a result of similar treatment effect on the two endpoints. Despite this caveat, a reasonably high patient level correlation (for example, >50%) would suggest the possible utility of the early endpoint and the value of subsequently assessing, by means of a larger analysis, the predictive ability of the early endpoint for the ultimate phase 3 endpoint for treatment effect at the trial level. .2 For predictive markers, the Freedman approach involves estimating the treatment effect on the true endpoint, defined as s, and then assessing the proportion of treatment effect explained by the early endpoint. However, as noted by Freedman, this approach has statistical power limitations that will generally preclude conclusively demonstrating that a substantial proportion of the treatment benefit at the individual patient level is explained by the early endpoint. In addition, it has been recognized that the proportion explained is not indeed a true proportion, as it may exceed 100%, and that whilst it may be estimated within a single trial, that data from multiple trials are required to provide a robust estimate of the predictive endpoint. Additionally, it can have interpretability problems, also pointed out by Freedman. Buyse and Molenberghs also proposed an adjusted correlation method that overcomes some of these issues. .3 For prognostic markers, the techniques for doing so are most easily described in the context of a continuous surrogate (e.g. change in nodule volume) and a continuous outcome. Linear mixed models23 with random slopes (or, more generally, random functions) and intercepts through time are built for both the surrogate marker and the endpoint. That is, the joint distribution of the surrogate marker and the endpoint are modeled using the same techniques as used for each variable individually. The degree to which the random slopes for the surrogate and the endpoint are correlated give a direct measure of how well changes in the surrogate correlate with changes in the endpoint. The ability of the surrogate to extinguish the influence of potent risk factors, in a multivariate model, further strengthens its use as a surrogate marker. 5. Conduct the pivotal meta-analysis on the test set (extending the results to the trial level and establishing the achievable generalizability based on available data). Follow statisitical study designs as consistent with the claims and the type of biomarker along the lines described in the Basic Story Board for predictive vs. prognostic biomarkers. POST-CONDITIONS Sufficient data for registration according to workflow “Formal Registration of Data for Approval or Clearance” is available. 3.5.2. Comparative Evaluation vs. Gold Standards or Otherwise Accepted Biomarkers Whereas the correlation of a putative biomarker with clinical endpoints seems instrumental for the biomarker to be qualified for defined uses, in and of itself it does not result in the acceptance by the community that it ought to be used vs. what the community is already using. In addition to the efficacy of a biomarker unto itself as described in workflow “Measure Correlation of Imaging Biomarkers with Clinical Endpoints,” comparative analyses would be pursued that identify the relative advantages (or disadvantages as the case may be) of using this biomarker vs. another biomarker. Two specific examples that are currently relevant include spirometry vs. based lung densitometry, and use of diameter measurements on single axial slices as presently inculcated in RECIST. Ultimately, use of all putative BBMSC 30 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 imaging biomarkers are understood to be in relation to how it is done without benefit of the imaging biomarker and industry uptake for the biomarker requires an evaluation of relative performance against identified figures of merit. Following the RECIST example, when RECIST was defined (reference) it was the decision of the assessment group which measurement should be chosen as basis for the finally clinical classification of disease state development (progression, stability, response to treatment). Already at that time volumes were under investigation. Due to its impracticality at that time to make easy tumor measurements (highly manual process of tumor lesion markup; thick image slices that made a volumetric assessment imprecise)) diameter was the way to go forward. RECIST went through several rounds of refinement with regard to clinical validity of classification and usefulness in different phases of clinical drug development. Today RECIST1.1 is the accepted standard for clinical therapy response assessment in oncology trials. Meanwhile there have been further developments in the imaging techniques (higher spatial resolution, Thinner slices) and in algorithm development that make a volume measurement feasible. There are studies (reference to Merck) available that support the higher sensitivity of volume measurements versus diameter measurements with regard to change detection. Therefore it is in the interest of the relevant stakeholders to work towards recommendations for a datasets that enable FDA to accept the use of volume instead of diameter for RECIST. A core set of data needed to proove the validity of volume assessment compared to diameter measurements based on high quality, mixed (less than-5mm) slice thickness images of cancer cases acquired in several clinical trials by different pharmaceutical companies and academic consortia: This dataset would ideally consist of longitudinal scans of different clinical trials sponsored by pharmaceutical companies and comparable trials of publicly funded research (e.g. LIDC) A range of disease state, therapeutic intervention and RECIST based clinical outcome should be covered. All diameter measurements must be available. A real life range of image acquisition devices and image acquisition parameters. Metadata must contain a basic set of additional clinical data that support the clinical case identification and validation. The location and volumetric assessment of all lesions within each longitudinal acquisition must be established by an FDA accepted method, like a multi-reader approach. PRE-CONDITIONS A presumably more effective biomarker must be proposed for comparison under documented conditions with an existing, explicitly identified, and clinically accepted biomarker. FLOW OF EVENTS There are two approaches: either: Follow workflow “Measure Correlation of Imaging Biomarkers with Clinical Endpoints” for each of the two biomarkers, the “new” and the “accepted” and assess the degree to which each (independently) correlates with the desired clinical endpoint. The comparison is framed “indirectly” in terms of how well each correlates. The one that correlates better is said to be superior. Alternatively, the two biomarkers may be compared directly by following workflow “Measure Correlation of Imaging Biomarkers with Clinical Endpoints” only for the new biomarker and replacing the target of the correlation to be the result of following workflow “Create Ground Truth Annotations and/or Manual Seed Points in Reference Data Set” in the Reference Data Sets according to the previously accepted biomarker. The comparison in this case is more direct, with the implication being that the biomarker which calls an event first is considered better. The caveat is that the accepted biomarker may not actually be correct; in fat, it may be that the reason that the new biomarker is proposed is to overcome some deficiency in the prior biomarker so a direct comparison may be inconclusive because the “truth” of the event called is not established nor is it clear what happens in those cases where one biomarker calls an event but the other does not. Which one is correct in this case (Fig. 19). BBMSC 31 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Figure 16: Example Result of Comparing New Biomarker to Previous Accepted Biomarker POST-CONDITIONS The relative performance characteristics of the two biomarkers is known, subject to caveats in the approach used. 3.5.3. Formal Registration of Data for Qualification Sub-communities cooperate to pursue regulatory qualification of a “class” of implementations said to be the biomarker itself. These sub-communities have a critical need to work with as large and diverse a Reference Data Set of imaging data across multiple viable implementations to substantiate and characterize the performance of the imaging biomarker independent from a specific implementation, and use this in the context of the biomarker qualification pathway. This spans a wide range of potentially useful imaging datasets including synthetic and real clinical scans of phantoms and clinical imaging datasets of patients with and without the disease/condition being measured. Biomarkers are Qualified by therapeutic review groups in CDER. The process is still evolving but it generally requires evidence from clinical trials that the biomarker information correlates closely with a clinically meaningful parameter. We have adopted concepts and language from the current FDA process for the qualification of biomarkers,24,25,26 to make clear the specifics regarding necessary steps for a sponsoring collaborative to use it for qualification of putative biomarkers (Fig. 20). BBMSC 32 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 Figure 17: The qualification pathway is a collaborative undertaking between sponsors and regulatory agencies. In the figure, activities undertaken by the sponsor are indicated in the left hand column. Activities undertaken by national regulatory agencies (e.g., FDA or European Medicines Agency (EMEA)) are indicated on the right, and documents used to facilitate the communication are indicated in the center. It should be noted that the sponsor in this schematic could be a collaborative enterprise rather than a single commercial entity, to reflect the multistakeholder nature of the activity. PRE-CONDITIONS Sufficient data for registration is available. FLOW OF EVENTS 1. Development of a “Briefing Document,” for the regulatory agency, that describes all known evidence accumulated that pertains to the imaging biomarker’s qualification and that lays out a plan to complete steps to conclude the qualification process. .1 Claims (executive summary in the guidance) .2 Organization and administrative details for the organization sponsoring the qualification (section 2.1 in the guidance) .3 Clinical context for use and decision trees (section 2.2 in the guidance) .4 Summary of literature and what is done to date (section 2.3 in the guidance) .5 Recommendation for completing the qualification (full data package) (section 2,4 in the guidance) .6 Technical description of how the biomarker is measured (section 2.5 in the guidance) 2. Pursuit of a face-to-face meeting with the regulatory agency Biomarker Qualification Review Team to elicit agency feedback on the plan to complete the “Full Data Package.” BBMSC 33 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 3. Complete the Full Data Package: .1 Conduct (a) pilot(s) of the meta-analysis to establish capability of the class of tests that represent the biomarker by following workflow “Measure Correlation of Imaging Biomarkers with Clinical Endpoints.” .2 Conduct the pivotal meta-analysis on the test set (extending the results to the trial level and establishing the achievable generalizability based on available data) by following workflow “Measure Correlation of Imaging Biomarkers with Clinical Endpoints.” 4. Design, test, and implement specifications and services to utilize SDTM datasets, populate the Janus database, and create SDTM materialized views. 5. Utilize the agency’s STDM Validation Service as it is made publicly available to clinical trial sponsors to enable validation of their data sets using the same criteria to be applied by Janus. This service will not load data into the Janus repository as it is intended to be used as a preview or check of draft data sets before actual submission to FDA. Once data sets pass the SDTM Validation, sponsors will be able to submit them to Janus via the FDA gateway. 6. Draft guidance on incorporation of the imaging biomarker into clinical trials. 7. Meeting with the Biomarker Qualification Review Team to elicit regulatory agency feedback on the clinical efficacy study results for the purpose of obtaining agency acceptance that the biomarker becomes known as qualified. 8. Utilize the agency’s Load Service to automatically load data sets that are submitted to Janus and "pass" the SDTM validation checks via the integrated STDM Validation service into the Oracle database. 9. Promote use of the qualified imaging biomarker through education. POST-CONDITIONS Biomarker is qualified for use in a known and specified clinical context. 3.6. Commercial Sponsor Prepares Device / Test for Market Individual workflows are generally understood as building on each other (Fig. 21). Figure 18: Workflows are presented to highlight how they build on each other. The following workflows are elaborated. 3.6.1. Organizations Issue “Challenge Problems” to Spur Innovation One approach to encouraging innovation that has proven productive in many fields is for an organization to announce and administer a public “challenge” whereby a problem statement is given and solutions are solicited from interested parties that “compete” for how well they address the problem statement. The development of image processing algorithms has benefitted from this approach with many organized activities from a number of groups. Some of these groups are organized by industry (e.g., Medical Image Computing and Computer Assisted Intervention or MICCAI27), academia (e.g., at Cornell University28), or government agencies (e.g., NIST29). This workflow is intended to support such challenges. BBMSC 34 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 With regard to benchmarking, NIST’s Information Technology Laboratory has experience with performance evaluation of software and algorithms for text retrieval, biometrics, face recognition, speech, and motion image quality. While the evaluation of biomedical change analysis algorithms presents very real challenges in developing suitable assessment methods, the use of benchmarking offers to provide significant insight into algorithms and to contribute to their improvement. It is important to note that one of the reasons for doing this is to meet the need that a biomarker is defined in part by the “class” of tests available for it. That is, it is not defined by a single test or candidate implementation but rather by an aggregated understanding of potentially several such tests. As such, it is necessary through this or other means to organize activities to determine how the class performs, irrespective of any candidate that purports to be a member of the class. As such, this workflow is related to the “Compliance / Proficiency Testing of Candidate Implementations” workflow and it may be that an organization such as NIST can both host challenges as well as serve in the trusted broker role using common infrastructure for these separate but related functions. Note that this is an aggregate workflow that specializes workflows elaborated earlier in this document. PRE-CONDITIONS A well defined problem space has been described where there are multiple parties that have an interest in offering or developing solutions. WORK FLOW 1. Organization hosting the challenge prepares a protocol and set of instructions, including data interchange methods, for use by participants. .1 If not already done, follow workflow “Install and Configure Linked Data Archive Systems” .2 Optionally, follow workflow “Extend and Disseminate Ontologies, Templates, and Vocabularies” .3 Follow workflow “Import Data to Reference Data Set to Form Reference Data Set” to create a Reference Data Set for use by participants. .4 Follow workflow “Create Ground Truth Annotations and/or Manual Seed Points in Reference Data Set” to annotate the Reference Data Set. .5 Follow workflow “Develop Physical and/or Digital Phantom(s)” .6 Follow workflow “Set up an Experimental Run” to create a Processing Pipeline script for use by participants. 2. Potential participant downloads and read this the protocol and instructions. 3. Participant expresses interest and is cleared for participation. .1 Participant sets up the informatics tooling associated with the Biomarker Evaluation Framework. .2 The participant wraps their algorithm to make it compatible with the Application Harness. .3 Organization hosting the challenge follows workflow “Create and Manage User Accounts, Roles, and Permissions” to register the participant. 4. Participant accesses the following as have been established for the challenge: .1 The Processing Pipeline script which has been set up for the challenge. .2 The Reference Data Set which has been assembled for the challenge. 5. Optionally, participant may follow any of the workflows in “Biomarker Development” to prepare for the challenge. 6. Participant follows workflow “Execute an Experimental Run” to meet the challenge, sharing the results according to the formal submission instructions provided by the organizer. BBMSC 35 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 7. The organizer follows workflow “Analyze an Experimental Run” on submitted entries. .1 Analyze results according to the challenge objectives; .2 Provide Participants with individual analysis of their results; and .3 Publish the results of the evaluation, without publicly identifying individual scores by Participant. 8. If it serves the desired goal, the following workflows could be structured in a challenge format as specializes of this otherwise flexible concept: .1 Organize the challenge so as to follow the workflow “Measure Correlation of Imaging Biomarkers with Clinical Endpoints” .2 Organize the challenge so as to follow the workflow “Comparative Evaluation vs. Gold Standards or Otherwise Accepted Biomarkers” POST-CONDITIONS Participants have a calibrated understanding of how their solution compares with other solutions. The public (or members of the organization that sponsored the challenge), understand how the “class” of solutions to the problem cluster in terms of performance. 3.6.2. Compliance / Proficiency Testing of Candidate Implementations A substantial goal of this effort is to provide infrastructure to whereby multiple stakeholders collaborate to test hypotheses about the technical feasibility and the medical value of imaging biomarkers. In these examples, the outcome will be an efficient means to collectively gather and analyze a body of evidence which would be accepted by regulatory bodies that can then be utilized by individual entities to make clinical trial management more cost effective than if they had to pursue it individually. Once such imaging biomarkers have been validated and accepted by the community including the FDA, validation of the individual candidate tests could proceed as a “proficiency test” conducted by a trusted broker rather than requiring each sponsor to individually collect and prove from the ground up. The approach is to build a body of clinical evidence on the clinical utility of a given biomarker, based on some number of tests to measure it that may be collectively described as a class of acceptable tests for the biomarker. This class is characterized by the Profile as defined in workflow “Team Optimizes Biomarker Using One or More Tests” and supported by the results of either or both of the “Measure Correlation of Imaging Biomarkers with Clinical Endpoints” and/or “Comparative Evaluation vs. Gold Standards or Otherwise Accepted Biomarkers” workflows, regardless of whether formal qualification is sought. The Profile is used to establish targeted levels of performance with means for formal data selection that allows a batch process to be run on data sequestered by a trusted broker that is requested by commercial entities that wish to obtain a certificate of compliance (formal) or simply an assessment of proficiency as measured with respect to the Profile. PRE-CONDITIONS A Profile exists and is associated with a body of clinical evidence sufficient to establish the performance of a class of compliant measurements for the biomarker. FLOW OF EVENTS In this case there are two primary actors: the development sponsor, and the honest broker: 1. Individual sponsor: BBMSC .1 The sponsor needs to identify what clinical indications for use it wishes to have its implementation tested against. .2 Algorithms included in the imaging test for data and results interpretation must be prespecified before the study data is analyzed. Alteration of the algorithm to better fit the data is 36 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 generally not acceptable and may invalidate a study. .3 It needs means to interface its implementation to the brokers system in a black box manner such that the broker does not have visibility to proprietary implementation details. .4 It needs to receive back performance data and supporting documentation capable of being incorporated into regulatory filings at its discretion. 2. Honest broker: .1 The honest broker needs means to archive data sets that may be selectively accessed according to specific clinical indications and that may be mapped to image quality standards that have been described as so-called “acceptable”, “target”, and “ideal” .2 It needs to accept black-box systems for interface and running in batch on selected data sets. .3 It needs means to set seed points or support other reader interaction in semi-automated scenarios. .4 It needs to produce documentation regarding results inclusive of a charge-back mechanism to recover operational costs. 3. The training set will continued to be available and refreshed with new cases for direct access by interested investigators for testing of new imaging software algorithms or clinical hypotheses. Future investigators will have access to the training set for additional studies. 4. Define services whereby the test set is indirectly accessible via the trusted broker. POST-CONDITIONS Indication of whether the candidate implementation complies with the Profile (which in turn specifies the targeted performance with respect to clinical context for use). 3.6.3. Formal Registration of Data for Approval or Clearance Imaging devices are Approved (or Cleared) by CDRH. Recently CDRH has been requiring more evidence of patient benefit before it will approve a new device. An issue particularly relevant to algorithm approval is that a developer needs large numbers of clinical images for algorithm development (development), and then a statistically valid number of cases to establish performance. Ideally the testing cases should be different from the development cases, and regulatory agencies would like the testing to be carried out by a neutral party to ensure that the results are trusted. This workflowrefers to the activity of implementation teams to seek public compliance certifications or other evaluations that are contributory in their 510(k) and/or PMA applications so as to leverage publicly sequestered data by an honest broker. There are two objectives in doing so: a) so that individual sponsor don’t need to bear the full cost and time to collect such data themselves; and b) to provide a trusted objectivity that their proposed implementation is indeed a complaint member of the class of valid implementations of a biomarker with performance that meets or exceed targeted levels of performance that is recognized by national regulatory organizations With this body of data in place and structured according to regulatory agency preferences as to format (e.g., in SDTM), then it may be referenced as a “master file” contributory to approval or clearance as long as the product has been passed through workflow “Compliance / Proficiency Testing of Candidate Implementations” and bears a certificate of compliance. PRE-CONDITIONS A Profile exists and is associated with a body of clinical evidence sufficient to establish the performance of a class of compliant measurements for the biomarker. A commercial sponsor has obtained a certificate of compliance for their candidate implementation. FLOW OF EVENTS BBMSC 37 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 1. Design, test, and implement specifications and services to utilize SDTM datasets, populate the Janus database, and create SDTM materialized views. 2. Utilize the agency’s STDM Validation Service as it is made publicly available to clinical trial sponsors to enable validation of their data sets using the same criteria to be applied by Janus. This service will not load data into the Janus repository as it is intended to be used as a preview or check of draft data sets before actual submission to FDA. Once data sets pass the SDTM Validation, sponsors will be able to submit them to Janus via the FDA gateway. 3. Utilize the agency’s Load Service to automatically load data sets that are submitted to Janus and "pass" the SDTM validation checks via the integrated STDM Validation service into the Oracle database. POST-CONDITIONS Data is registered. BBMSC 38 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 4. References 1 Clinical Pharmacology & Therapeutics (2001) 69, 89–95; doi: 10.1067/mcp.2001.113989. 2 Janet Woodcock and Raymond Woosley. The FDA Critical Path Initiative and Its Influence on New Drug Development. Annu. Rev. Med. 2008. 59:1–12. 3 http://www.fda.gov/ScienceResearch/SpecialTopics/CriticalPathInitiative/CriticalPathOpportunitiesRep orts/ucm077262.htm, accessed 5 January 2010. 4 http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?CFRPart=820&showFR=1, accessed 28 February 2010. 5 http://www.answers.com/topic/phenotype. Accessed 17 February 2010. 6 http://www.answers.com/topic/surrogate-endpoint. Accessed 17 February 2010. 7 Iturralde, Mario P. (1990), CRC dictionary and handbook of nuclear medicine and clinical imaging, Boca Raton, Fla.: CRC Press, pp. 564, ISBN 0849332338. Giger, QIBA newsletter, February 2010. Giger M, Update on the potential of computer-aided diagnosis for breast disease, Future Oncol. (2010) 6(1), 1-4. International Organization for Standardization, “Guide to the Expression of Uncertainty in Measurement”, (International Organization for Standardization, Geneva) 1993. Joint Committee for Guides in Metrology, “International Vocabulary of Metrology – Basic and General Concepts and Associated Terms”, (Bureau International des Poids et Mesures, Paris) 2008. Pan T., et al. “Imaging Data Analysis and Management for Microscopy Images of Diffuse Gliomas,” as presented at during the November 2010 TBPT Face to Face meeting, Houston, Tx. 8 9 10 11 12 13 Food and Drug Administration, 21 CFR 820, Quality system regulation. http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?CFRPart=820. Accessed 5 January 2010. 14 Fleiss, J. The design and analysis of clinical experiments. Wiley, New York 1986. Bland M. and Altman D.; Measuring agreement in method comparison studies Stat Methods Med Res 1999; 8; 135. Barnhart H. and Barboriak D. Applications of the Repeatability of Quantitative Imaging Biomarkers: A Review of Statistical Analysis of Repeat Data Sets Translational Oncology (2009) 2, 231–235. Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press. Zhou, Z., Obuchowski, N., & McClish, D. (2002). Statistical Methods in Diagnostic Medicine. New York: Wiley. Chakraborty, D. P. and Berbaum, K. S. Observer studies involving detection and localization: Modeling, analysis and validation. Medical Physics 31(8), 2313–2330. 2004. Edwards, D. C., Kupinski, M. A., Metz, C. E., and Nishikawa, R. M. Maximum likelihood fitting of FROC curves under an initial-detection-and-candidate-analysis model. Medical Physics 29(12), 2861–2870. (2002). Bandos, A, Rockette H, Song T, Gur D. Area under the Free-Response ROC Curve (FROC) and a Related Summary Index Biometrics, 2009; 65, 247–256. Sargent DJ et al., Validation of novel imaging methodologies for use as cancer clinical trial endpoints, European Journal of Cancer, 45 (2009) 290-299. 15 16 17 18 19 20 21 22 23 24 McCulloch CE, Searle SR. Generalized, Linear and Mixed Models. New York: Wiley; 2000. Goodsaid F and Frueh F, Process map proposal for the validation of genomic biomarkers, Pharmacogenomics (2006) 7(5), 773-782. BBMSC 39 of 40 QI-Bench: Informatics Services for Quantitative Imaging SUC Rev 1.0 25 Goodsaid FM and Frueh FW, Questions and answers about the Pilot Process for Biomarker Qualification at the FDA, Drug Discovery Today, Vol 4, No. 1, 2007. 26 Goodsaid FM et al., Strategic paths for biomarker qualification, Toxicology 245 (2008) 219-223. 27 http://www.grand-challenge.org/index.php/Main_Page, accessed 23 December 2010. 28 http://www.preventcancer.org/uploadedFiles/Education/Conferences,_Workshops,_and_Educational_ Programs/Day-2-Friday-1100-AM-Tony-Reeves.pdf, accessed 23 December 2010. 29 http://www.nist.gov/itl/iad/dmg/biochangechallenge.cfm, accessed 23 December 2010. BBMSC 40 of 40