The Biomedical Informatics Research Network: A National Information Infrastructure to

advertisement
The Biomedical Informatics
Research Network:
A National Information Infrastructure to
Enable and Advance Biomedical Research
Jeffrey S. Grethe, Ph.D.
Scientific Coordinator
BIRN Coordinating Center - University of California San Diego
October, 2006
Biomedical Informatics Research Network
A shared biomedical IT
infrastructure to hasten the
derivation of new
understanding and treatment of
disease through use of
distributed knowledge
•Collaboration between groups with
different expertise and resources
(technical, scientific, social and political)
•Technical infrastructure to support
collaboration (designed to be extensible
to other biomedical communities)
•Open access and dissemination of data
and tools (i.e. Open Source)
•Bringing transparent GRID Computing to
Biomedical Research
Origins of IT Infrastructure used to build the BIRN:
Initiatives like the NSF - National Partnership for Advanced Computational
Infrastructure (NPACI)
 ~50 partner sites
 shared compute
resources
 high-speed networks
 computational science
efforts in ...
VBNS Network Linked
Brain Mapping Data
@Wash U. & UCLA With
Supercomputing at UCSD
•
•
•
•
Neuroscience
Molecular Science
Earth Systems Science
Engineering
•
•
•
•
Resources (TeraFlops, High Performance Networks, Data Caches)
Metacomputing (Grid Tools - Middleware)
Interaction Environments (Visualization - Science Portals)
Data-Intensive Computing (Databases - Data Integration)
The NSF PACI Program Started in 1995
Current Program is ‘’Cyberinfrastructure”
BIRN Must Accommodate Growth
BIRN Sites
At the beginning (Circa End of 2001)
10+ Distinct Installations, ~ 100 Individual Machines
From the Expanding the BIRN Meeting @ NCRR: December. 6 & 7, 2001)
The BIRN Collaboratory Today
National Alliance
for Medical Imaging
Morphometry
BIRN Computing
Non-Human
Primate
Mouse
BIRN
BIRN-Coordinating
Function
BIRN
Center
Studying
brain
structures
related
unipolar
(NA-MIC)
Enabling
collaborative
research
at to
28
research
LinkingStudying
imaging,
behavior,
and
molecular
informatics
in
animal
models
of
multiple
sclerosis,
depression,
mild
Alzheimer's
disease
and
mild
Develops
Studyingand
regional
supports
brain
overall
dysfunctions
information
related
technology
to the
Multi-institutional,
interdisciplinary
teamADHD,
who
non-human
primate
pre-clinical
models
ofdevelop
disease
schizophrenia,
Parkinson's
disease,
Tourette's
institutions
comprised
of
37
research
groups.
cognitive
impairment
progression
(IT)
infrastructure
andanalysis
subtypes
linking
of
the
schizophrenia
testbeds
computational
tools
fordisorder,
the
and
visualization
of medical
brain
cancer.
image data
It will no longer matter where data, instruments
and computational resources are located!
Software Problem in a Nutshell
 Enable Analysis of Distributed Biomedical
Data in a National-Scale Production Facility
Data &
Network
CPU
Security
• Data sets are large – Data sets are many
• Enable new queries that integrate multiple sources
• Specialized application codes (from Test Beds) need to
work on BIRN-accessible data
• Some analysis pipelines require significant
computation
• Privacy, patient anonymity required
• Institutional ownership of data
 Easily Replicate Entire Software Stack
(Including Centralized Services) for other Groups
Major System Components
Collaborating Groups of Biomedical Researchers
Data Integration
Mechanisms
Distributed Data (Collections)
Distributed Data (file system)
Computation/Analysis Facilities
Identity/Login
Management
Authorization and
Role Definition
Overall Operations
Command/Batch Access
Application Portal
Domain
Application Tools
Integrated SW Distribution
Complete Workflows
BIRN has the Advantage of having deployed such
an “End-to-End” Infrastructure:
Built around research projects with geographically
distributed data.
 Consists of all the components required to
effectively share and collaboratively explore data
•
•
•
•
•
The BIRN Rack (BIRN site infrastructure)
The BIRN Portal
The BIRN Data Grid
The BIRN Data Integration Infrastructure
The BIRN Computational GRID
 The system integration, development, deployment and
management of this infrastructure is the main focus of
activities within the BIRN Coordinating Center
Function BIRN Overview
 Calibration Methods for Multi-Site fMRI
•
•
Study Regional Brain Dysfunction and
Correlated Morphological Differences
Progression and Treatment of
Schizophrenia
 Human Phantom Trials
•
•
•
Common Consortium Protocol
5 Subjects Scanned at All 11 Sites
Add'l 15 Controls, 15 Schizophrenics Per
Site Per Year
 Statistical Techniques
•
•
Identify Cross-Site Differences
Develop Corrections to Allow Data Pooling
 Develop Interoperable Post-Processing
 UC Irvine, UCLA, UC San Diego, MGH,
BWH, Stanford, UMinnesota, UIowa, UNew
Mexico, Duke/UNorth Carolina, MIT
FBIRN Federated Data
UMN
HID
p2
p1
p2
p1
p2
Stanford p1
HID
UCLA
p2
HID
UCI
HID
UI
HID
p1
p1
p2
UCSD
HID
p1
UNM
HID
= Data Integration Environment
= PostgreSQL test site
= Phase 1 / Phase 2 data
p1
p2
MGH
BWH HID
HID
Yale
p2 HID
p2
p1
p2
p1
Duke
HID
Currently, each FBIRN site collecting 15
schizophrenic subjects and 15 controls
•In a common imaging paradigm
•Using the same combination of calibration and
cognitive tasks
•Includes the challenges of multi-site clinical
populations
BIRN Data Grid
Uniform interface for connecting to heterogeneous distributed data resources
Allows for any “grid enabled” tool to interact with data no matter where it is located or
what it is located on
Allows for the seamless creation and management of distributed data sets
Distributed data appear as a single managed collection both to users and tools
16+ Terabytes
Total Number of Files (in thousands)
Jun-06
Feb-06
Oct-05
Jun-05
Feb-05
Oct-04
Jun-04
TB
# Files
Feb-04
16 million files
Oct-03
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
Jun-03
More than Doubling
Each Year
BIRN Data Grid Usage
Oct-02
Feb-03
Amount of Stored
Data and Number of
Files
Total Number of Files
(in thousands)
Total Size of Storage (in Gigabytes)
fBIRN Multi-Site Data Example
 Reference Anatomical Scan
 fMRI Scans from 10 Different Sites
• Same Subject, Registered, Same Slice
Calibration
Phase I Traveling Calibration
Subject Dataset Available
Morphometry BIRN
 Anatomical Correlates of Psychiatric
Illnesses
• Unipolar Depression, Alzheimer’s
Disease (AD) and Mild Cognitive
Impairment (MCI)
 Site and Platform Independent
Acquisition and Analysis for Pooling
Data
• Multi-Site Clinical Studies
• Increase Statistical Power for Rare
Populations or Subtle Effects
Normal Elderly Control
 Advanced Image Analysis and
Visualization
 MGH, BWH, Duke, UCLA, UC San
Diego, Johns Hopkins, UC Irvine,
Wash U, MIT
Alzheimer’s Individual
MRI Distortions due to Gradient Non-Linearities
Siemens Whole-Body
Symphony/Sonata
GE Whole-Body
CRM NVi/CVi
Siemens Head-Only
Allegra/AC-44
Max displ. 2.5/3.2mm
Max displ. 4.2/8.6mm
Max displ. 5.7/20.2mm
Multi-site Structural MRI Data Acquisition &
Calibration
•Develop acquisition & calibration protocols that improve
reproducibility, within- and across-sites
•Common acquisition protocol, distortion correction,
evaluation by scanning human phantoms multiple times at
all sites
Uncorrected
Corrected
Image intensity variability on
same subject scanned at 4 sites
Reproducibility Effects: Alignment of Surfaces
Siemens
GE
Same Subject
Co-registered
CORTICAL ESTIMATES:
NO DISTORTION CORRECTION
CORTICAL ESTIMATES:
DISTORTION CORRECTION
 Distortion correction does improve cortical surface co-registration
Morphometry BIRN: Semi-Automated
Shape Analysis Overview
Large Deformation Diffeomorphic
Metric Mapping using the TeraGrid
4
3
MGH
Segmentation
JHU
Shape Analysis
of Segmented Structures
Large Scale
Distributed
Computing
5
BWH
Visualization
1
Data Donor
Site (WashU)
N=45
BIRN
Data Grid
De-identification
And upload
2
Preliminary Study:
Scientific Goal:
•46 hippocampus
data sets
classify patient
status4 from
•30,000
CPU hours,
TB
morphometric results
data
SASHA: Shape Analysis Pipeline Results
6 semantic dementia subjects
21 control subjects
18 Alzheimer subjects
Shape-derived
metrics can be
used to detect
class-specific
information
SASHA: Large Scale Distributed Computing
Teragrid
1.25 TB of
Resultant Data
per Day
GPFS
BIRN
UCSD
UCI
Larger follow-up study
JHU
Processing upwards of 1250
comparisons per day (8986 cpu/hrs or
374 days of computing)
Beg et al, Pattern classification of hippocampal shape
analysis in a study of AD (to be submitted 2006)
De-identification and Upload Pipeline
•Robust automated methods for bulk MRI de-identification
and upload to database (diverse inputs, sharable outputs,
common package)
•De-facing: automated de-facing without brain removal
•Pipeline: image formats, BIRN ID generation, defacing, QA,
upload
Raw data
De-faced data
Mouse BIRN
 Studying animal models of disease across
dimensional scales to test hypothesis with
human neurological disorders
• Experimental Allergic Encephalomyelitis
(EAE) mouse models characteristic of
Multiple Sclerosis (MS)
• Dopamine Transporter (DAT) knockout
mouse for studies of schizophrenia,
attention-deficit hyperactivity disorder
(ADHD), Tourette’s disorder, and substance
abuse
• Using an alpha-synuclein mouse to model
the symptoms/pathology of Parkinson’s
Disease
• Cancer animal models consortium with
astrocytoma mouse model: NCI supported
with Terry Van Dyke @ Duke

Cal Tech, Duke, UCLA, UCSD, Univ. Tenn
Mouse BIRN Data Integration Framework
1. Create multimodal
databases
4. Use mediator to navigate
and query across data
sources
2. Create conceptual
links to a shared ontology
3. Situate the data in a
common spatial framework
Bonfire: Browse, Query and Utilize BIRN Knowledge
Sources
Bonfire Ontology
Browser and
Extension Tool
•Aggregate
knowledge
sources built on
UMLS
•Issue graphbased queries on
concepts
• Collaborative
extension: Users
may propose new
concept, receive a
unique ID and
“attach” it to
existing concept
•Developed and
maintained by
BIRN CC
Xufei Qian, Amarnath Gupta, Jeff Grethe
•Licenses
negotiated and
handled by BIRN
CC
BIRNLex
•Built using the OWL plugin of Protégé (web
ontology language
standard)
•Looked at terms
contributed by test beds
during first ontology
workshop
•Built using the existing
class hierarchy of FUGO
•Curation of BIRN Lex is
currently underway (July
28th, next session)
http://132.239.16.64:8080/BIRNLex/
National Center for Biological Ontologies
National Center for Biomedical Ontologies (NCBO), NCBC, Mark Musen,
P. I.
•Daniel Rubin from NCBO participates in OTF calls
Carol Bean arranged for OTF to attend workshop in March 2006
•Suzanna Lewis, Barry Smith, Michael Ashburner, Mark Musen, Daniel Rubin
Educated us on efforts underway at NCBO and vice versa
Provided their view on ontology “best practices” and what were examples of good
ontologies
Evaluated BIRN’s current efforts
“I just wanted to let you know how excited I am that the BIRN is now working so closely with NCBO. Your team
clearly has an appreciation for the importance of ontology for work in data integration and automated reasoning, and
I think that we will be able to do some important work together if your collaborating grant application is funded. I
[have] been watching the contributions that folks such as Bill Bug have been making on the mailing list for the W3C
healthcare and life sciences SIG and for the FUGO consortium, and it is clear that your team has the motivation and
the talent to make valuable contributions in the ontology arena. “ --Mark Musen
Spatial Registration of Data
Processing stream for spatial
registration of brain volumes
using the LONI pipeline
Volume and slice data brought
into register in order to correlate
cellular and subcellular changes
with non-invasive imaging
The BIRN Smart Atlas
An Example of a Data Grid-based GIS-like tool for spatial integration
of multiscale distributed brain data.
Ilya
Zaslavsky,
Joshua Tran,
Haiyun He,
Amarnath
Gupta
Mediator
wrapper
wrapper
wrapper
wrapper
UCSD
Cal Tech
Duke
UCLA
SRB
BIRN-CC Enables Test Bed Science
 A stable, robust, shared network and distributed database
environment
 Extensible tools and IT infrastructure that can be reused.
 Established cyberinfrastructure for data grid and large scale
data integration effort
 High performance connectivity between distributed resources
(computation and data storage)
 Seamless access to distributed high performance computing
resources
Changing the use pattern for research data from the
individual laboratory/project to shared use.
BIRN-CC Enables Test Bed Science
 Providing technical expertise
 Troubleshooting and support
 Working closely with test beds to develop standards
and best practices
 Providing a reliable infrastructure for large scale
collaboration
 Driving the development of grid middleware
 Developing tools in support of test bed research
 Providing access to computational resources
 and more. . .
Major System Components
Collaborating Groups of Biomedical Researchers
Data Integration
Mechanisms
Distributed Data (Collections)
Distributed Data (file system)
Computation/Analysis Facilities
Identity/Login
Management
Authorization and
Role Definition
Overall Operations
Command/Batch Access
Application Portal
Domain
Application Tools
Integrated SW Distribution
Complete Workflows
BIRN Core Software Infrastructure
Friendly Work Facilitating Portals
Other Institutions - HHMI / Osaka U.
Other NIH Projects - caBIG - NIEHS
Marine Metagenomics -- Moore Found.
Geosciences “GEON” - NSF
Grid Services
& Middleware
Ocean Observing “Looking” - NSF
Development
Tools & Libraries
Biomedical Informatics “BIRN”
Authentication - Authorization - Auditing - Workflows - Visualization - Analysis
Your
Specific
Tools
& User
Apps.
• BIRN CC
builds on
evolving
standards for
portals and
middleware
• Adds new
capabilities
required by
projects
•Provides
system
Shared integration of
Tools
domain-specific
Science tools building a
Domains distributed
infrastructure
Distributed Resources
Distributed Computing,
Instruments and Data Resources
• Utilizes
commodity
hardware and
stable networks
for baseline
connectivity
An Exercise for the Reader …
 There exists a large body of useful middleware
 It’s assembly, hardening and extension into a useful
system is left as an exercise to the reader
 The BIRN-CC is the “reader”
System Deployment
 Utilizing Rocks grid management software
 BIRN specific extensions to Rocks, also under CVS, means
automated, repeatable deployment of any version of the
BIRN system
 We’ve created BIRN “rolls” that integrate
• BIRN domain tools (e.g. 3DSlicer, LONI Pipeline, FreeSurfer)
• Database (Oracle) and SRB Configuration
 Rocks, with BIRN extensions, includes automated
deployment mechanism for
•
•
•
•
Middleware (Security, Computational, Data)
Data mediation/integration
Application codes
Portal and other Workflows
Nagios Alert System for Monitoring Racks
Network Throughput & Performance
Graphical and Numerical reporting of Site/Grid Performance
We Began with Standard Hardware
 This jumpstarted BIRN for
functionality
 Software footprint is
managed from the BIRN
Coordinating Center
 Integration of domain tools,
middleware, OS, updates,
and more
 BIRN expansion/upgrade of
existing sites has a more
generic (and less
expensive) hardware
footprint
Removing Barriers:
Decreasing Cost of Entry & Increasing Scalability
$120K
(2001)

< $20K
(Today)
< $5K
(~2011)
Prescribed
hardware
jumpstarted BIRN
for functionality
Support for multiple
vendors
Software solution for
researchers to BIRN
“enable” local
hardware
http://www.nbirn.net
Download