Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research Joel Saltz MD, PhD Professor Biomedical Informatics, Computer Science Davis Chair in Cancer Research The Ohio State University Visiting Professor ISI, USC Translational Research Biology, biotechnology bioinformatics Credit: NIH SPORE Guidelines Disease mechanism, disease classification, diagnosis, treatment Imaging, Medical Analysis and Grid Environments (IMAGE) September 16 - 18 2003 Identify, query, retrieve, carry out on-demand data product generation directed at collections of data from multiple sites/groups on a given topic, reproduce each group’s data analysis and carry out new analyses on all datasets. Should be able to carry out entirely new analyses or to incrementally modify other scientist’s data analyses. Should not have to worry about physical location of data or processing…. caBIG – 50+ Grid Enabled Cancer Centers Federated Center 1 Center 2 Center 3 Biomedical Research Grids: Types of Information Radiological Studies Pathology Molecular (Proteomics, gene expression) Genetic, Epigenetic (SNPs, haplotype analysis) Laboratory, pharmacy, outcome data Example: Some Data types generated by OSU Comprehensive Cancer Center . Shared Resource Example data types Molecular Cytogenetics Datasets from Karyotype analysis, data from SKY/FISH experiments Analytical Cytometry FACSCaliber output, raw data from DiVaOption system, Genotyping and Sequencing DNA sequence information for the sample, primer sequence, PCR sequencing information Microarray Output from Affymetrix gene expression and Custom microarray analyses Mouse Phenotyping Digitized radiographic, gross, and histologic images, hematology characterization Real Time-PCR Description of sample plate, raw and processed output files Tissue Procurement Anonymized pathology report, age, gender, race, tissue procurement id, patient id (if consent form is available), consent form, virtual slide of the tissue, if available Leukemia Tissue Bank Sample processing date, date sample taken from the patient, accession id, patient id, patient name and last name, specimen type, number of tubes, diagnosis info, protocol # Proteomics 1D and 2D gel images, sample information (PI name, analysis method, instrument name), diagnosis, protein expressions for spots Clinical Trials Office Protocol descriptions, investigator, title, status, approval processing ids, clinical data for patients on protocols, lab reports, adverse events, and trial outcomes. caBIG Problem Statement Production of data outstripping our ability to analyze it The research community may not be aware of other work and datasets Researchers may not tag data with the definition of the data they produce Semantic information is not often encoded nor included with data sets “Data Islands” or “Silos” of information are produced based on the problems outlined above A small group of knowledgeable people transmit data amongst themselves “Modern” exploratory research requires the integration of disparate databases of biological information to explain results To elucidate the mechanism behind disease we must aggregate data from many databases Peter Corbitz – NCI Vision Provide the “Grid For Cancer Research” so that we may: Raise awareness of disparate datasets in the biological research community Allow research groups to exchange datasets with ease Allow research groups to understand the semantics of the datasets that they publish without always having to get on the phone Allow for quicker publication of the analysis of integrated data Peter Corbitz - NCI Blind Man/Elephant problem: high throughput techniques, molecular imaging are powerful but each contribute only a piece to a puzzle Bleeding Edge Pilot Project Developer Sites Adopter Sites Additional Group Members 6 5 4 Integrative Cancer Research 13 6 5 Tissue Banks and Pathology Tools 2 5 6 Vocabularies and Common Data Elements 7 Clinical Trial Management Domain Workspaces Cross Cutting Workspaces Strategic Level Working Groups Architecture SLWG Members 9 Data Sharing and Intellectual Capital 14 caBIG Strategic Planning 16 Training 10 Example: Ohio State BISTI Center for Grid Enabled Image Analysis: Novartis Molecular Imaging Studies Prescribed protocol (standardised) Image processing •Classification •Registration •Pre-post changes acquisition, analysis, storage Site A Site Z PATa1 study1 (baseline) study n (post) PATan study1 (baseline) study n (post) PATz1 study1 (baseline) study n (post) PATzn study1 (baseline) study n (post) (Knopp) Example: Genotype Phenotype Correlation Genetic, phenotypic data related via phylogentic tree 3473 SNPs among 11 strains of inbred mice Tree represents ancestry relationship between strains C3HHEJ and DBA2J are mutations associated with high heart rate variability Once candidate genotypes are identified, gather additional information about candidates from wrapped data sources Integration of Gene-Drug relationships in cancer treatment (Janies, Knoblock, Khan, Saltz) Example: Classification of Neuroblastoma Decision-tree models current classification system Close collaboration with leading Pathologist (Shimada – USC) who developed classifications Automate analysis, correlate with molecular, outcome data Classification determines treatment Children’s Oncology Group Scope: North America, Australia, New Zealand Example: Use Case from Abramson Cancer Center Use Case: A research would like to study the error rate in pathological diagnoses of solid tumor samples and compare numerous molecular diagnostic approaches to determine if the molecular diagnostic approach can enhance the accuracy of pathological diagnoses. Query: I want all solid tumors, specifically for lung cancer, that have a diagnosis based on tumor pathology. Each diagnosis must have an image of the tumor that allows for independent verification of diagnoses. Each record retrieved must also have either proteomics marker data or microarray data (Affy or two-color) included so that different molecular techniques can be correlated to the tumor pathology. In addition, I want all protein annotations for markers and genes associated with the proteomics and microarray data so I can perform meta-analyses. Issues: Biomedical Grid Architecture What metadata needs to be described How to enforce standardization and completeness of metadata description Is it practical for everyone to use the same ontologies? If not, how to handle local variations in controlled manner Are there middleware solutions that can help (yes!) Data grid techniques for query, management of very large grid based datasets Role for immutable gridbased datatypes? Issues: Biomedical Grid Architecture Distributed Ontology Management Many sites, many types of complex data Sites need to have freedom to create local ontology variants in controlled manner Systematic methods for controlled management, query of ontology variants Heuristic datatyping Data quality control Well defined structure (e.g. XML schema) + “sanity checks” to check accuracy and completeness of metadata e.g. are data values consistent with what would be expected in an affymetrix gene expression dataset? Issues: In Silico Research - Not Your Father’s Datamining Example: Predict clinical outcome Goal: optimize function F that predicts outcome by combining clinical, molecular, image data Molecular, image data in turn need to be interpreted and analyzed Need to find image analysis functions Gi and molecular data analysis functions Hi that make it possible best predict outcome Functions Gi and Hi make use of domain specific knowledge (e.g. phylogentic trees, histologic classifications, pathways) Issues: Incorporating ad-hoc data sources: Information Integration Not all data sources will post precise metadata definitions, ontologies etc Biomedical researchers should be able to use “nonconforming” data Goal is to develop a system where we automate the integration of data sources that are easy and natural Wrap datasources, define metadata, ontologies that allow data integration Middleware to cache ad-hoc data and to make ad-hoc information a first class grid citizen Issues: Security Requirements Patients can give consent to for some data for some studies or classes of studies Researchers need to be able to control access to data IRBs can approve release of identified data to some individuals IRBs can specify how deidentification is to be carried out and when deidentified data can be released Cooperative study may have different IRB-dictated constraints at different sites Individuals associated with a given study may have different roles with different data access permissions Access requests and successful accesses must be logged Issues: Security Requirements Need ontology based description of roles, ownership, permissions Validate correctness of description Conditions should lead to expected results (i.e. if one specifies rules, does one end up with counterintuitive results) Thanks! Mobius Middleware system that provides support for management of metadata definitions (defined as XML schemas) and efficient storage and retrieval of data instances in a distributed environment. Mechanism for data driven applications to cache, share, and asynchronously communicate data in a distributed environment Grid based distributed, searchable, and shareable persistent storage Infrastructure for grid coordination language Global Model Exchange Store and link data models defined inside namespaces in grid. Enables other services to publish, retrieve, discover, remove, and version metadata definitions Services composed in a DNS-like architecture representing parent-child namespace hierarchy When a schema is registered in GME, it is stored in under the name and name space specified by the application; schema is assigned a version number Finding candidate genes Complex, multi-factorial diseases CAD, diabetes, schizophrenia Long candidate lists Variations between individuals in their response to treatments Non-responders to statin treatment GOAL: Link phenotypes to genotypes Genotype <> Phenotype APPROACH: establish QTLs by correlating changes in genotype with changes in phenotype Genotype: SNPs Use in mapping – Haplotype blocks Functional come in many flavors – cis-regulatory, nonsynonymous, mRNA stability Phenotypes: Mice Inbred strains provide well characterized genotypes Publicly available quantitative phenotype data (http://www.jax.org/phenome/) BISP Demonstration: Mouse Phenotype Trait Query Mouse Phenotype/SNP Analysis processing Distributed Data Storage Mako Mako Mako Distributed query Virtual Mako Service Mobius internet Trait query trait CERF Service Delivery System SNP IDs HUGO IDs, BLAST Alignments CERF Server MouseTraitQueryResult (resource instance) Action: ViewGoMiner GOMiner Action:View CERF Client HTML Browser BRTT Demonstrations – June 21, 2000 Mouse SNP-Trait Association Demonstration Execute query A CERF Action/Service binding allows the query to be executed; - user prompted for the name of a Trait (HDL6). Image Data Processing 40,000 pixels Digitized Microscopy 40,000 pixels DCE-MRI Studies Visualization of Terabyte Scale, Multiresolution Data Image Analysis Middleware Framework The Distributed Metadata and Data Management service to keep track of workflows, image datasets, and analysis results. Manage metadata associated with images, analysis results, annotations in a distributed environment. Federate databases of image, clinical, and molecular data in a distributed environment The Image Data Storage Service to manage the storage resources on the server and encapsulates efficient storage methods for image datasets. Create and maintain large-scale, on-line databases of images (from gigabytes to multiple terabytes in size) on disk-based storage clusters. The Distributed Execution Service to support on-demand analysis of image data Execute simple and complex image analysis operations and workflows on distributed collections of images. Integrate data retrieval and processing on commodity clusters and multiprocessor machines BRTT Demonstrations – June 21, 2000 Rat Placenta Microscopy Image Demonstration Search for Images of Interest Prototype based on caCORE cancer Common Ontologic Representation Environment (caCORE) caCORE is the technology stack that facilitates data integration across multiple scientific disciplines Enterprise Vocabulary Services (EVS) Cancer Data Standards Repository (caDSR) Cancer Bioinformatics Infrastructure Objects (caBIO) caGRID Core architecture caGRID Extension (Integration of Discovery and Query Services) Client OGSA-DAI + Globus caGRID extension (Concept Discovery) caGRID extension (Federated Query) OGSA-DAI caGRID extension (metadata) caGRID extension (query) Grid Strongly typed XML transport, Metadata Management (Mobius) Data Source caBIO server Globus Petabyte sized rotating storage archives are no longer hypothetical Ohio Supercomputing Center Mass Storage Testbed LinTel boxes (PvFS/ Active Disk Archive) (20) D V D D V D (2) 890 MB/s through MetaData Servers (2) D V D D V D (2) (2) 890 M B/s Th rough put D V D (2) ) (2 (2) D V D DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD DVD (40 - 2 per xSeries) 10 GB/s ) (2 DVD (40 - 2 per T600) 384 MB/s throughput put r) Cisco Directors 9509 ve ut er hp r s oug e p thr 4 (4) 6 B/s MB/s throughput (1 M772 0 (4) 89 (4) 772 MB/s throughput FAStT600 Turbo (20) Scratch / Archive Storage Pool (310/420 TB) (4) 772 MB/s throughput (4) 772 MB/s throughput SAN Volume Controller (4 servers) FAStT900 (4) Core Storage Pool (35/50 TB) with SAN.FS Backup Storage 3584 Tape 1 L32 2 D32 Actual: 640 cartridges @ 200 GB for a total of 128 TB 4 drives max drive data rate is 35 MB/s • 50 TB of performance storage – home directories, project storage space, and longterm frequently accessed files. • 420 TB of performance/capacity storage – Active Disk Cache compute jobs that require directly connected storage – parallel file systems, and scratch space. – Large temporary holding area • 128 TB tape library – Backups and long-term "offline" storage IBM’s Storage Tank technology combined with TFN connections will allow large data sets to be seamlessly moved throughout the state with increased redundancy and seamless delivery.