Monthly Program Update February 9, 2012 WITH FUNDING SUPPORT PROVIDED BY NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY Andrew J. Buckler, MS Principal Investigator Agenda • 10,000 foot view of where we’ve been and where we’re going • Implementation-independent computational model for quantitative imaging development • For each QI-Bench app, a demonstration of what exists and a gap analysis relative to model – (including first demo of Formulate, and review of major Execute update) 2 10,000 Foot View Period Activity 2009 -> Winter 2011 User needs and requirements analysis Spring/Summer 2011 Initial Execute (including RDSM and BAS) and desktop Analyze (initial prototyping) 3, 4 Autumn 2011 Initial Specify (including QIBO and BiomarkerDB) Individual time point demonstrator 5, 10 Winter 2012 Initial Formulate, and major update to Execute 3A Pilot 5, 30 Spring/Summer 2012 Computational model to drive architecture, and support for semi-automated workflows 3A Pivotal and longitudinal change demonstrator Autumn 2012 -> 2013 Exercise end-to-end chain with reproducible workflows Test Bed Developers, Users 1, 2 (various) 3 Where we left off last month… RDF Triple Store CT Volumetry used_for Therapeutic Efficacy Specify obtained_by CT measure_of Tumor growth Formulate Execute Reference Data Sets QIBO Analyze Y=β0..n+β1(QIB)+β2T+ eij 4 DNF Model: Data // Data resources: RawDataType = ImagingDataType | NonImagingDataType | ClinicalVariableType CollectedValue = Value + Uncertainty DataService = { RawData | CollectedValue } // implication being that contents may change over time ReferenceDataSet = { RawData | CollectedValue } // with fixed refresh policy and documented (controlled) provenance // Derived from analysis of one or more ReferenceDataSets: TechnicalPerformance = Uncertainty | CoefficientOfVariation | CoefficientOfReliability | … ClinicalPerformance = ReceiverOperatingCharacteristic | PPV/NPV | RegressionCoefficient | … SummaryStatistic = TechnicalPerformance| ClinicalPerformance 5 DNF Model: Knowledge // Managed as Knowledge store: Relation = subject property object (property object) BiomarkerDB = { Relation } //Examples: OntologyConcept has Instance | Biomarker isUsedFor BiologicalUse // “use” | Biomarker isMeasuredBy AssayMethod // “method” | AssayMethod usesTemplate AimTemplate // “template” | AimTemplate includes CollectedValuePrompt // “prompt” | ClinicalContext appliesTo IndicatedBiology // “biology” | (AssayMethod targets BiologicalTarget) withStrength TechnicalPerformance | (Biomarker pertainsTo ClinicalContext) withStrength ClinicalPerformance | generalizations beyond this 6 Requirements drive function // Business Requirements FNIH, QIBA, and C-Path participants don’t have a way to provide precise specification for context for use and applicable assay methods (to allow semantic labeling): BiomarkerDB = Specify (biomarker domain expertise, ontology for labeling); 7 Requirements drive function // Business Requirements FNIH, QIBA, and C-Path participants don’t have a way to provide precise specification for context for use and applicable assay methods (to allow semantic labeling): BiomarkerDB = Specify (biomarker domain expertise, ontology for labeling); Researchers and consortia don’t have an ability to exploit existing data resources with high precision and recall: ReferenceDataSet+ = Formulate (BiomarkerDB, {DataService} ); 8 Requirements drive function // Business Requirements FNIH, QIBA, and C-Path participants don’t have a way to provide precise specification for context for use and applicable assay methods (to allow semantic labeling): BiomarkerDB = Specify (biomarker domain expertise, ontology for labeling); Researchers and consortia don’t have an ability to exploit existing data resources with high precision and recall: ReferenceDataSet+ = Formulate (BiomarkerDB, {DataService} ); Technology developers and contract research organizations don’t have a way to do largescale quantitative runs: ReferenceDataSet .CollectedValue+ = Execute (ReferenceDataSet.RawData); 9 Requirements drive function // Business Requirements FNIH, QIBA, and C-Path participants don’t have a way to provide precise specification for context for use and applicable assay methods (to allow semantic labeling): BiomarkerDB = Specify (biomarker domain expertise, ontology for labeling); Researchers and consortia don’t have an ability to exploit existing data resources with high precision and recall: ReferenceDataSet+ = Formulate (BiomarkerDB, {DataService} ); Technology developers and contract research organizations don’t have a way to do largescale quantitative runs: ReferenceDataSet .CollectedValue+ = Execute (ReferenceDataSet.RawData); The community lacks way to apply definitive statistical analyses of annotation and image markup over specified context for use: BiomarkerDB.SummaryStatistic+ = Analyze ( { ReferenceDataSet .CollectedValue } ); 10 Requirements drive function // Business Requirements FNIH, QIBA, and C-Path participants don’t have a way to provide precise specification for context for use and applicable assay methods (to allow semantic labeling): BiomarkerDB = Specify (biomarker domain expertise, ontology for labeling); Researchers and consortia don’t have an ability to exploit existing data resources with high precision and recall: ReferenceDataSet+ = Formulate (BiomarkerDB, {DataService} ); Technology developers and contract research organizations don’t have a way to do largescale quantitative runs: ReferenceDataSet .CollectedValue+ = Execute (ReferenceDataSet.RawData); The community lacks way to apply definitive statistical analyses of annotation and image markup over specified context for use: BiomarkerDB.SummaryStatistic+ = Analyze ( { ReferenceDataSet .CollectedValue } ); Industry lacks standardized ways to report and submit data electronically: efiling transactions+ = Package (BiomarkerDB, {ReferenceDataSet} ); 11 Layered requirement: Reproducible Workflows with Documented Provenance // Business Requirements FNIH, QIBA, and C-Path participants don’t have a way to provide precise specification for context for use and applicable assay methods (to allow semantic labeling): BiomarkerDB = Specify (biomarker domain expertise, ontology for labeling); Researchers and consortia don’t have an ability to exploit existing data resources with high precision and recall: ReferenceDataSet+ = Formulate (BiomarkerDB, {DataService} ); Technology developers and contract research organizations don’t have a way to do largescale quantitative runs: ReferenceDataSet .CollectedValue+ = Execute (ReferenceDataSet.RawData); The community lacks way to apply definitive statistical analyses of annotation and image markup over specified context for use: BiomarkerDB.SummaryStatistic+ = Analyze ( { ReferenceDataSet .CollectedValue } ); Industry lacks standardized ways to report and submit data electronically: efiling transactions+ = Package (BiomarkerDB, {ReferenceDataSet} ); 12 Implementation-independent computational model: // Altogether efiling transactions = Package (Analyze (Execute (Formulate (Specify (biomarker domain expertise), DataService)))); Computational and informatics design implications: • It is possible to model chain as a process to achieve prescribed levels of statistical significance for validity and utility • It is possible to apply logical and statistical inference to address generalizability of the results across clinical contexts • Of course, bottleneck remains availability of data; purpose here is to define informatics services to make best use of that data as to: – How to optimize information content from any given experimental study, and – How to incorporate individual study results into a formally defined description of the biomarker acceptable to regulatory agencies. 13 BiomarkerDB = Specify (domain expertise to describe biomarkers); Role Domain expert Use Case Supported Now Gap vs. Model Add/edit knowledge Web App (thin) Can create triples in local store for local community Need to create knowledge spanning domains and communities (not just local) Desktop (thick) Triples can be curated singly, Ontology curated in Protégé Need to edit clusters of triples (not just one at a time), need system support for input to ontology-level curation All ontologies on Server-side BioPortal are supported Can’t link across ontologies directly Informaticist Curate knowledge IT systems expert Link ontologies 14 Project directions for Specify • Extend the QIBO to link to existing established ontologies 1. 2. • leverage BFO upper ontology to align different ontologies convert portions of BRIDG and LSDAM to ontology models in OWL Automated conversion done in two steps: 1. 2. • convert current Sparx Enterprise Architect XMI EMF UML format export resulting EMF UML into a RDF/OWL representation using TopBraid Composer Provide GUI to traverse the QIBO concepts according to their relationships and create statements represented as RDF triples and stored in an RDF store. • • example: “Image is from Patient AND Patient has age >60 AND Patient has Disease where Disease has value Lung Cancer AND Patient has Smoking Status where Smoking Status has value True.” This translates to “Find me all images that are from a patient older than 60, diagnosed with lung cancer and is a smoker”. Each set of RDF triples will be stored as a “profile” in Bio2RDF 15 ReferenceDataSet+ = Formulate (BiomarkerDB, {DataService} ); Role Use Case Supported Now Find data with high precision and recall Web App (thin) Form and expose Informaticist queries to find data Desktop (thick) Domain expert IT systems expert Configure knowledge resources and data services Gap vs. Model Perform saved searches and organize data Granular, role-based, security with single sign-on to both private and public data resources Use UML models Define queries in terms of RDF triples driven by ontologies (not UML) Prepare and support more Configure flexible method (e.g., Server-side resources that use SPARQL) so as not to be caGrid limited by caGrid 16 Project directions for Formulate • Enables users to select the profiles (set of RDF triples) created in Specify, execute a query and retrieve the results in various forms. • assemble/transform the set of RDF triples to SPARQL queries: 1. form an uninterrupted chain linking the instance of the input class from the ontology to the desired output class 2. formulate/invoke necessary SPARQL queries against the web services. Activities: •Load models used in Specify to caB2B •A transformer that will take Source (specify) •Load it to caB2B •Expose the query through caB2B Web Client •Create a form-based query parameterization utility Longer term: •Thick client will assist modification of the query or facilitate reparameterization Activities • End point representation (caGrid Service/SPARQL) • Target language is not DCQL but SPARQL • Come up with some SPARQL End Point wrappers for some projects of interest (pilots) •Universal wrapper for caGrid services (?) •For sure one for DICOM and TCIA/Midas Activities • Results in RDF • Transformed results in CSV and some form digestable by Midas • Directly stored (direct interfacing) • Manual load (indirect interfacing) 17 ReferenceDataSet .CollectedValue+ = Execute (ReferenceDataSet.RawData); Role Domain expert Use Case Set up studies to gather imaging results Establish metadata standards Informaticist and define scripted runs IT systems expert Configure data marts of mixed data types Supported Now Gap vs. Model Web App (thin) Warehouse data flexibly, initiate automated and (soon) semiautomated runs More complete automated support for data curation and script editing (current automated support for curation is limited) Desktop (thick) Single time point and (soon) longitudinal change studies on single data types Build composite markers with multiple parameters per modality and spanning multiple modalities Support multiple Server-side and mixed types of bit streams Explicit support for IHE Profiles 18 Project directions for Execute • Script to write “Image Formation” content into Biomarker Database for provenance of Reference Data Sets: Application for pulling in data from the “Image Formation” schema to populate the biomarker database. This data will originate from the DICOM imagery imported into QI-Bench. • Laboratory Protocol for the NBIA Connector and Batch Analysis Service: Laboratory protocol to describe the use of the NBIA Connector and the Image Formation script to import data into QIBench and use of the Batch Analysis Service for server-side processing. • Support change analysis biomarkers serial studies (up to two time points in the current period, extensible to additional in subsequent development iterations): Support experiments including at minimum two time points. An example of this is the change in volume or SUV, rather than (only) estimation of the value at one time point. • • Document and solidify the API harness for execution modules of the Batch Analysis Service: This task will include the documentation and complete specification of the Batchmake Application API. Support scripted reader studies: Support reader studies through worklist items specified via AIM templates as well as Query/Retrieve via DICOM standards for interaction with reader stations. ClearCanvas will serve as the target reader station for the first implementation. • Automated support for export of data to Analyze: Including generation of annotation and image markup output from reference algorithms (i.e., LSTK for volumetric CT and Slicer3D for SUV) based on AIM templates instead of the current hardcoded implementation. An AIM Template is an .xsd file. • Automated support for import of data from Formulate: This task will include refactoring and stabilization of the NBIA Connector in order to incorporate its functionality into Formulate. 19 BiomarkerDB.SummaryStatistic+ = Analyze ({ReferenceDataSet .CollectedValue}); Role Domain expert Use Case Run and modify statistical analyses Configure set of Informaticist analysis scripts in toolbox IT systems expert Configure data input and output services Supported Now Gap vs. Model Web App (thin) n/a Desktop (thick) Persist statistical Run relevant calculations results as Ncalculations at ary relations in knowledge technical store, across larger data performance level sets and at clinical level too Create the web app Connect directly to RDSM Connects to and knowledge store (and Server-side (some)caGrid data through Formulate, services broader set of data services) 20 Analyze Current Capabilities 21 Project directions for Analyze Desktop GUI Current MVT MVT Measurement XIP LIB R LIB Variability Tool Cached objects: AIM/DICOM, etc Data Access AIM, DICOM Experiment data Metadata NonGrid Data Sources AD Client Modified XIP Host Images, Patient info Annotations, Collections Experiments Web Client AD Server Scope: • GWT (or Tapestry) UI ; Create web client version • Change from DB2 to RESTful service layer; Add calculation results to persistent database Implemented according to Open Source Development Initiative (OSDI) recommendations; In such a way as to enable the enhancement roadmap; and Integrated with projects driving advanced semantics and FDA support Data services 22 efiling transactions+ = Package (BiomarkerDB, {ReferenceDataSet}); Role Use Case Define and “pull” a data package Domain expert Define Informaticist data mappings IT systems expert Connect to electronic regulatory systems Supported Now Gap vs. Model Web App (thin) n/a Define and “pull” a data package Desktop (thick) n/a Define data mappings Server-side n/a Connect to electronic regulatory systems Package is a lower priority application, current effort is to participate In vocabulary standards efforts, staging implementation later. 23 Outlook Period Activity 2009 -> Winter 2011 User needs and requirements analysis Spring/Summer 2011 Initial Execute (including RDSM and BAS) and desktop Analyze (initial prototyping) 3, 4 Autumn 2011 Initial Specify (including QIBO and BiomarkerDB) Individual time point demonstrator 5, 10 Winter 2012 Initial Formulate, and major update to Execute 3A Pilot 5, 30 Spring/Summer 2012 Computational model to drive architecture, and support for semi-automated workflows 3A Pivotal and longitudinal change demonstrator Autumn 2012 -> 2013 Exercise end-to-end chain with reproducible workflows Test Bed Developers, Users 1, 2 (various) 24 25 Value proposition of QI-Bench • Efficiently collect and exploit evidence establishing standards for optimized quantitative imaging: – Users want confidence in the read-outs – Pharma wants to use them as endpoints – Device/SW companies want to market products that produce them without huge costs – Public wants to trust the decisions that they contribute to • By providing a verification framework to develop precompetitive specifications and support test harnesses to curate and utilize reference data • Doing so as an accessible and open resource facilitates collaboration among diverse stakeholders 26 Summary: QI-Bench Contributions • We make it practical to increase the magnitude of data for increased statistical significance. • We provide practical means to grapple with massive data sets. • We address the problem of efficient use of resources to assess limits of generalizability. • We make formal specification accessible to diverse groups of experts that are not skilled or interested in knowledge engineering. • We map both medical as well as technical domain expertise into representations well suited to emerging capabilities of the semantic web. • We enable a mechanism to assess compliance with standards or requirements within specific contexts for use. • We take a “toolbox” approach to statistical analysis. • We provide the capability in a manner which is accessible to varying levels of collaborative models, from individual companies or institutions to larger consortia or public-private partnerships to fully open public access. 27 QI-Bench Structure / Acknowledgements • • Prime: BBMSC (Andrew Buckler, Gary Wernsing, Mike Sperling, Matt Ouellette) Co-Investigators – – • • Financial support as well as technical content: NIST (Mary Brady, Alden Dima, Guillaume Radde) Collaborators / Colleagues / Idea Contributors – – – – – – • FDA (Nick Petrick, Marios Gavrielides) UCLA (Grace Kim) UMD (Eliot Siegel, Joe Chen, Ganesh Saiprasad) VUmc (Otto Hoekstra) Northwestern (Pat Mongkolwat) Georgetown (Baris Suzek) Industry – – • Kitware (Rick Avila, Patrick Reynolds, Julien Jomier, Mike Grauer) Stanford (David Paik, Tiffany Ting Liu) Pharma: Novartis (Stefan Baumann), Merck (Richard Baumgartner) Device/Software: Definiens (Maria Athelogou), Claron Technologies (Ingmar Bitter) Coordinating Programs – – RSNA QIBA (e.g., Dan Sullivan, Binsheng Zhao) Under consideration: CTMM TraIT (Andre Dekker, Jeroen Belien) 28