Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research

advertisement
Panel: The Broader Role of Artificial
Intelligence in Large-Scale Scientific
Research
Joel Saltz MD, PhD
Professor Biomedical Informatics, Computer
Science
Davis Chair in Cancer Research
The Ohio State University
Visiting Professor ISI, USC
Translational Research
Biology, biotechnology
bioinformatics
Credit:
NIH SPORE Guidelines
Disease mechanism,
disease
classification,
diagnosis, treatment
Imaging, Medical Analysis and Grid
Environments (IMAGE)
September 16 - 18 2003
Identify, query, retrieve, carry
out on-demand data product
generation directed at
collections of data from
multiple sites/groups on a
given topic, reproduce each
group’s data analysis and carry
out new analyses on all
datasets. Should be able to
carry out entirely new analyses
or to incrementally modify
other scientist’s data analyses.
Should not have to worry about
physical location of data or
processing….
caBIG – 50+ Grid Enabled Cancer
Centers
Federated
Center 1
Center 2
Center 3
Biomedical Research Grids:
Types of Information

Radiological Studies

Pathology

Molecular (Proteomics,
gene expression)

Genetic, Epigenetic
(SNPs, haplotype
analysis)

Laboratory, pharmacy,
outcome data
Example: Some Data types generated by OSU Comprehensive
Cancer Center
.
Shared Resource
Example data types
Molecular Cytogenetics
Datasets from Karyotype analysis, data from SKY/FISH experiments
Analytical Cytometry
FACSCaliber output, raw data from DiVaOption system,
Genotyping and Sequencing
DNA sequence information for the sample, primer sequence, PCR
sequencing information
Microarray
Output from Affymetrix gene expression and Custom microarray
analyses
Mouse Phenotyping
Digitized radiographic, gross, and histologic images, hematology
characterization
Real Time-PCR
Description of sample plate, raw and processed output files
Tissue Procurement
Anonymized pathology report, age, gender, race, tissue procurement id,
patient id (if consent form is available), consent form, virtual slide
of the tissue, if available
Leukemia Tissue Bank
Sample processing date, date sample taken from the patient, accession
id, patient id, patient name and last name, specimen type, number
of tubes, diagnosis info, protocol #
Proteomics
1D and 2D gel images, sample information (PI name, analysis method,
instrument name), diagnosis, protein expressions for spots
Clinical Trials Office
Protocol descriptions, investigator, title,
status, approval processing ids, clinical data for
patients on protocols, lab reports, adverse
events, and trial outcomes.
caBIG Problem Statement

Production of data outstripping our ability to analyze it

The research community may not be aware of other work and
datasets

Researchers may not tag data with the definition of the data they
produce

Semantic information is not often encoded nor included with data
sets

“Data Islands” or “Silos” of information are produced based on
the problems outlined above

A small group of knowledgeable people transmit data amongst
themselves

“Modern” exploratory research requires the integration of
disparate databases of biological information to explain results

To elucidate the mechanism behind disease we must aggregate data
from many databases
Peter Corbitz – NCI
Vision

Provide the “Grid For Cancer
Research” so that we may:

Raise awareness of disparate
datasets in the biological
research community
 Allow research groups to
exchange datasets with ease
 Allow research groups to
understand the semantics of the
datasets that they publish
without always having to get on
the phone
 Allow for quicker publication of
the analysis of integrated data
Peter Corbitz - NCI

Blind Man/Elephant
problem: high throughput
techniques, molecular
imaging are powerful but
each contribute only a piece
to a puzzle
Bleeding Edge Pilot Project
Developer
Sites
Adopter
Sites
Additional
Group
Members
6
5
4
Integrative Cancer
Research
13
6
5
Tissue Banks and
Pathology Tools
2
5
6
Vocabularies and
Common Data Elements
7
Clinical Trial Management
Domain
Workspaces
Cross
Cutting
Workspaces
Strategic
Level
Working
Groups
Architecture
SLWG
Members
9
Data Sharing and
Intellectual Capital
14
caBIG Strategic Planning
16
Training
10
Example: Ohio State BISTI Center for Grid Enabled
Image Analysis: Novartis Molecular Imaging Studies
Prescribed protocol
(standardised)
Image processing
•Classification
•Registration
•Pre-post changes
acquisition, analysis, storage
Site A
Site Z
PATa1
study1
(baseline)
study n
(post)
PATan
study1
(baseline)
study n
(post)
PATz1
study1
(baseline)
study n
(post)
PATzn
study1
(baseline)
study n
(post)
(Knopp)
Example: Genotype Phenotype
Correlation

Genetic, phenotypic data related
via phylogentic tree

3473 SNPs among 11 strains of
inbred mice

Tree represents ancestry
relationship between strains

C3HHEJ and DBA2J are mutations
associated with high heart rate
variability

Once candidate genotypes are
identified, gather additional
information about candidates from
wrapped data sources

Integration of Gene-Drug
relationships in cancer treatment
(Janies, Knoblock, Khan, Saltz)
Example: Classification of
Neuroblastoma

Decision-tree models current
classification system

Close collaboration with
leading Pathologist (Shimada –
USC) who developed
classifications

Automate analysis, correlate
with molecular, outcome data

Classification determines
treatment

Children’s Oncology Group

Scope: North America,
Australia, New Zealand
Example: Use Case from Abramson
Cancer Center

Use Case:
 A research
would like to study the error rate in pathological
diagnoses of solid tumor samples and compare numerous
molecular diagnostic approaches to determine if the
molecular diagnostic approach can enhance the accuracy of
pathological diagnoses.

Query:

I want all solid tumors, specifically for lung cancer, that have a
diagnosis based on tumor pathology. Each diagnosis must have an
image of the tumor that allows for independent verification of
diagnoses. Each record retrieved must also have either proteomics
marker data or microarray data (Affy or two-color) included so that
different molecular techniques can be correlated to the tumor
pathology. In addition, I want all protein annotations for markers
and genes associated with the proteomics and microarray data so I
can perform meta-analyses.
Issues: Biomedical Grid
Architecture

What metadata needs to be described

How to enforce standardization and completeness
of metadata description

Is it practical for everyone to use the same
ontologies?


If not, how to handle local variations in controlled manner
Are there middleware solutions that can help (yes!)

Data grid techniques for query, management of very
large grid based datasets

Role for immutable gridbased datatypes?
Issues: Biomedical Grid
Architecture

Distributed Ontology Management




Many sites, many types of complex data
Sites need to have freedom to create local ontology
variants in controlled manner
Systematic methods for controlled management, query of
ontology variants
Heuristic datatyping



Data quality control
Well defined structure (e.g. XML schema) + “sanity checks”
to check accuracy and completeness of metadata
e.g. are data values consistent with what would be
expected in an affymetrix gene expression dataset?
Issues: In Silico Research - Not Your
Father’s Datamining
Example: Predict clinical
outcome

Goal: optimize function F
that predicts outcome by
combining clinical,
molecular, image data
 Molecular, image data in
turn need to be interpreted
and analyzed
 Need to find image analysis
functions Gi and molecular
data analysis functions Hi
that make it possible best
predict outcome
 Functions Gi and Hi make
use of domain specific
knowledge (e.g. phylogentic
trees, histologic
classifications, pathways)
Issues: Incorporating ad-hoc data
sources: Information Integration

Not all data sources will post precise metadata definitions,
ontologies etc

Biomedical researchers should be able to use “nonconforming” data

Goal is to develop a system where we automate the integration
of data sources that are easy and natural

Wrap datasources, define metadata, ontologies that allow data
integration

Middleware to cache ad-hoc data and to make ad-hoc
information a first class grid citizen
Issues: Security Requirements

Patients can give consent to for some data for some studies or
classes of studies

Researchers need to be able to control access to data

IRBs can approve release of identified data to some individuals

IRBs can specify how deidentification is to be carried out and when
deidentified data can be released

Cooperative study may have different IRB-dictated constraints at
different sites

Individuals associated with a given study may have different roles
with different data access permissions

Access requests and successful accesses must be logged
Issues: Security Requirements

Need ontology based description of roles,
ownership, permissions

Validate correctness of description

Conditions should lead to expected results (i.e. if
one specifies rules, does one end up with counterintuitive results)
Thanks!
Mobius

Middleware system that provides support for
management of metadata definitions (defined as
XML schemas) and efficient storage and retrieval of
data instances in a distributed environment.

Mechanism for data driven applications to cache,
share, and asynchronously communicate data in a
distributed environment

Grid based distributed, searchable, and shareable
persistent storage

Infrastructure for grid coordination language
Global Model Exchange

Store and link data models defined inside
namespaces in grid.

Enables other services to publish, retrieve, discover,
remove, and version metadata definitions

Services composed in a DNS-like architecture
representing parent-child namespace hierarchy

When a schema is registered in GME, it is stored in
under the name and name space specified by the
application; schema is assigned a version number
Finding candidate genes

Complex, multi-factorial diseases

CAD, diabetes, schizophrenia

Long candidate lists

Variations between individuals in their response to
treatments


Non-responders to statin treatment
GOAL: Link phenotypes to genotypes
Genotype <> Phenotype

APPROACH: establish QTLs by correlating changes in genotype with
changes in phenotype

Genotype: SNPs



Use in mapping – Haplotype blocks
Functional come in many flavors – cis-regulatory, nonsynonymous, mRNA stability
Phenotypes: Mice


Inbred strains provide well characterized genotypes
Publicly available quantitative phenotype data
(http://www.jax.org/phenome/)
BISP Demonstration: Mouse Phenotype Trait Query
Mouse Phenotype/SNP
Analysis
processing
Distributed Data Storage
Mako
Mako
Mako
Distributed query
Virtual Mako Service
Mobius
internet
Trait
query
trait
CERF Service Delivery System
SNP
IDs
HUGO IDs, BLAST Alignments
CERF
Server
MouseTraitQueryResult
(resource instance)
Action: ViewGoMiner
GOMiner
Action:View
CERF
Client
HTML
Browser
BRTT Demonstrations – June 21, 2000
Mouse SNP-Trait Association Demonstration
Execute query
A CERF Action/Service binding allows the query
to be executed; - user prompted for the name of a
Trait (HDL6).
Image Data Processing
40,000 pixels
Digitized
Microscopy
40,000 pixels
DCE-MRI
Studies
Visualization of Terabyte Scale,
Multiresolution Data
Image Analysis Middleware
Framework

The Distributed Metadata and Data Management service to keep track
of workflows, image datasets, and analysis results.

Manage metadata associated with images, analysis results,
annotations in a distributed environment.
 Federate databases of image, clinical, and molecular data in a
distributed environment

The Image Data Storage Service to manage the storage resources
on the server and encapsulates efficient storage methods for
image datasets.


Create and maintain large-scale, on-line databases of images (from
gigabytes to multiple terabytes in size) on disk-based storage
clusters.
The Distributed Execution Service to support on-demand analysis of
image data

Execute simple and complex image analysis operations and
workflows on distributed collections of images.
 Integrate data retrieval and processing on commodity clusters and
multiprocessor machines
BRTT Demonstrations – June 21, 2000
Rat Placenta Microscopy Image Demonstration
Search for Images of Interest
Prototype based on caCORE
 cancer Common Ontologic Representation Environment (caCORE)
 caCORE is the technology stack that facilitates data integration
across multiple scientific disciplines
Enterprise Vocabulary Services
(EVS)
Cancer Data Standards
Repository (caDSR)
Cancer Bioinformatics
Infrastructure Objects (caBIO)
caGRID Core architecture
caGRID Extension (Integration of Discovery and Query Services)
Client
OGSA-DAI +
Globus
caGRID extension
(Concept Discovery)
caGRID extension
(Federated Query)
OGSA-DAI
caGRID extension
(metadata)
caGRID extension
(query)
Grid
Strongly typed
XML transport, Metadata
Management (Mobius)
Data Source
caBIO server
Globus
Petabyte sized rotating storage archives are
no longer hypothetical
Ohio Supercomputing Center Mass Storage Testbed
LinTel boxes (PvFS/
Active Disk Archive) (20)






D V D

D V D
(2)
890 MB/s through
MetaData Servers
(2)

D V D

D V D
(2)
(2)
890 M
B/s Th
rough
put

D V D
(2)
)
(2
(2)
D V D
DVD
DVD
DVD
DVD





DVD
DVD
DVD
DVD
DVD





DVD
DVD
DVD
DVD
DVD





DVD
DVD
DVD
DVD
DVD
(40 - 2 per xSeries)
10 GB/s
)
(2

DVD
(40 - 2 per T600)
384 MB/s throughput
put
r)
Cisco Directors 9509
ve ut
er hp
r s oug
e
p thr
4
(4)
6 B/s MB/s throughput
(1 M772
0
(4)
89
(4)
772 MB/s throughput
FAStT600 Turbo (20)
Scratch / Archive Storage Pool (310/420 TB)
(4)
772 MB/s throughput
(4)
772 MB/s throughput
SAN Volume Controller
(4 servers)
FAStT900 (4)
Core Storage Pool (35/50 TB) with SAN.FS
Backup Storage
3584 Tape
1 L32 2 D32
Actual: 640 cartridges @ 200
GB for a total of 128 TB
4 drives
max drive data rate is 35 MB/s
• 50 TB of performance
storage
– home directories, project
storage space, and longterm frequently accessed
files.
• 420 TB of
performance/capacity
storage
– Active Disk Cache compute jobs that require
directly connected storage
– parallel file systems, and
scratch space.
– Large temporary holding
area
• 128 TB tape library
– Backups and long-term
"offline" storage
IBM’s Storage Tank technology combined with TFN connections will allow
large data sets to be seamlessly moved throughout the state with increased
redundancy and seamless delivery.
Download