The Biomedical Informatics Research Network: A National Information Infrastructure to Enable and Advance Biomedical Research Jeffrey S. Grethe, Ph.D. Scientific Coordinator BIRN Coordinating Center - University of California San Diego October, 2006 Biomedical Informatics Research Network A shared biomedical IT infrastructure to hasten the derivation of new understanding and treatment of disease through use of distributed knowledge •Collaboration between groups with different expertise and resources (technical, scientific, social and political) •Technical infrastructure to support collaboration (designed to be extensible to other biomedical communities) •Open access and dissemination of data and tools (i.e. Open Source) •Bringing transparent GRID Computing to Biomedical Research Origins of IT Infrastructure used to build the BIRN: Initiatives like the NSF - National Partnership for Advanced Computational Infrastructure (NPACI) ~50 partner sites shared compute resources high-speed networks computational science efforts in ... VBNS Network Linked Brain Mapping Data @Wash U. & UCLA With Supercomputing at UCSD • • • • Neuroscience Molecular Science Earth Systems Science Engineering • • • • Resources (TeraFlops, High Performance Networks, Data Caches) Metacomputing (Grid Tools - Middleware) Interaction Environments (Visualization - Science Portals) Data-Intensive Computing (Databases - Data Integration) The NSF PACI Program Started in 1995 Current Program is ‘’Cyberinfrastructure” BIRN Must Accommodate Growth BIRN Sites At the beginning (Circa End of 2001) 10+ Distinct Installations, ~ 100 Individual Machines From the Expanding the BIRN Meeting @ NCRR: December. 6 & 7, 2001) The BIRN Collaboratory Today National Alliance for Medical Imaging Morphometry BIRN Computing Non-Human Primate Mouse BIRN BIRN-Coordinating Function BIRN Center Studying brain structures related unipolar (NA-MIC) Enabling collaborative research at to 28 research LinkingStudying imaging, behavior, and molecular informatics in animal models of multiple sclerosis, depression, mild Alzheimer's disease and mild Develops Studyingand regional supports brain overall dysfunctions information related technology to the Multi-institutional, interdisciplinary teamADHD, who non-human primate pre-clinical models ofdevelop disease schizophrenia, Parkinson's disease, Tourette's institutions comprised of 37 research groups. cognitive impairment progression (IT) infrastructure andanalysis subtypes linking of the schizophrenia testbeds computational tools fordisorder, the and visualization of medical brain cancer. image data It will no longer matter where data, instruments and computational resources are located! Software Problem in a Nutshell Enable Analysis of Distributed Biomedical Data in a National-Scale Production Facility Data & Network CPU Security • Data sets are large – Data sets are many • Enable new queries that integrate multiple sources • Specialized application codes (from Test Beds) need to work on BIRN-accessible data • Some analysis pipelines require significant computation • Privacy, patient anonymity required • Institutional ownership of data Easily Replicate Entire Software Stack (Including Centralized Services) for other Groups Major System Components Collaborating Groups of Biomedical Researchers Data Integration Mechanisms Distributed Data (Collections) Distributed Data (file system) Computation/Analysis Facilities Identity/Login Management Authorization and Role Definition Overall Operations Command/Batch Access Application Portal Domain Application Tools Integrated SW Distribution Complete Workflows BIRN has the Advantage of having deployed such an “End-to-End” Infrastructure: Built around research projects with geographically distributed data. Consists of all the components required to effectively share and collaboratively explore data • • • • • The BIRN Rack (BIRN site infrastructure) The BIRN Portal The BIRN Data Grid The BIRN Data Integration Infrastructure The BIRN Computational GRID The system integration, development, deployment and management of this infrastructure is the main focus of activities within the BIRN Coordinating Center Function BIRN Overview Calibration Methods for Multi-Site fMRI • • Study Regional Brain Dysfunction and Correlated Morphological Differences Progression and Treatment of Schizophrenia Human Phantom Trials • • • Common Consortium Protocol 5 Subjects Scanned at All 11 Sites Add'l 15 Controls, 15 Schizophrenics Per Site Per Year Statistical Techniques • • Identify Cross-Site Differences Develop Corrections to Allow Data Pooling Develop Interoperable Post-Processing UC Irvine, UCLA, UC San Diego, MGH, BWH, Stanford, UMinnesota, UIowa, UNew Mexico, Duke/UNorth Carolina, MIT FBIRN Federated Data UMN HID p2 p1 p2 p1 p2 Stanford p1 HID UCLA p2 HID UCI HID UI HID p1 p1 p2 UCSD HID p1 UNM HID = Data Integration Environment = PostgreSQL test site = Phase 1 / Phase 2 data p1 p2 MGH BWH HID HID Yale p2 HID p2 p1 p2 p1 Duke HID Currently, each FBIRN site collecting 15 schizophrenic subjects and 15 controls •In a common imaging paradigm •Using the same combination of calibration and cognitive tasks •Includes the challenges of multi-site clinical populations BIRN Data Grid Uniform interface for connecting to heterogeneous distributed data resources Allows for any “grid enabled” tool to interact with data no matter where it is located or what it is located on Allows for the seamless creation and management of distributed data sets Distributed data appear as a single managed collection both to users and tools 16+ Terabytes Total Number of Files (in thousands) Jun-06 Feb-06 Oct-05 Jun-05 Feb-05 Oct-04 Jun-04 TB # Files Feb-04 16 million files Oct-03 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 Jun-03 More than Doubling Each Year BIRN Data Grid Usage Oct-02 Feb-03 Amount of Stored Data and Number of Files Total Number of Files (in thousands) Total Size of Storage (in Gigabytes) fBIRN Multi-Site Data Example Reference Anatomical Scan fMRI Scans from 10 Different Sites • Same Subject, Registered, Same Slice Calibration Phase I Traveling Calibration Subject Dataset Available Morphometry BIRN Anatomical Correlates of Psychiatric Illnesses • Unipolar Depression, Alzheimer’s Disease (AD) and Mild Cognitive Impairment (MCI) Site and Platform Independent Acquisition and Analysis for Pooling Data • Multi-Site Clinical Studies • Increase Statistical Power for Rare Populations or Subtle Effects Normal Elderly Control Advanced Image Analysis and Visualization MGH, BWH, Duke, UCLA, UC San Diego, Johns Hopkins, UC Irvine, Wash U, MIT Alzheimer’s Individual MRI Distortions due to Gradient Non-Linearities Siemens Whole-Body Symphony/Sonata GE Whole-Body CRM NVi/CVi Siemens Head-Only Allegra/AC-44 Max displ. 2.5/3.2mm Max displ. 4.2/8.6mm Max displ. 5.7/20.2mm Multi-site Structural MRI Data Acquisition & Calibration •Develop acquisition & calibration protocols that improve reproducibility, within- and across-sites •Common acquisition protocol, distortion correction, evaluation by scanning human phantoms multiple times at all sites Uncorrected Corrected Image intensity variability on same subject scanned at 4 sites Reproducibility Effects: Alignment of Surfaces Siemens GE Same Subject Co-registered CORTICAL ESTIMATES: NO DISTORTION CORRECTION CORTICAL ESTIMATES: DISTORTION CORRECTION Distortion correction does improve cortical surface co-registration Morphometry BIRN: Semi-Automated Shape Analysis Overview Large Deformation Diffeomorphic Metric Mapping using the TeraGrid 4 3 MGH Segmentation JHU Shape Analysis of Segmented Structures Large Scale Distributed Computing 5 BWH Visualization 1 Data Donor Site (WashU) N=45 BIRN Data Grid De-identification And upload 2 Preliminary Study: Scientific Goal: •46 hippocampus data sets classify patient status4 from •30,000 CPU hours, TB morphometric results data SASHA: Shape Analysis Pipeline Results 6 semantic dementia subjects 21 control subjects 18 Alzheimer subjects Shape-derived metrics can be used to detect class-specific information SASHA: Large Scale Distributed Computing Teragrid 1.25 TB of Resultant Data per Day GPFS BIRN UCSD UCI Larger follow-up study JHU Processing upwards of 1250 comparisons per day (8986 cpu/hrs or 374 days of computing) Beg et al, Pattern classification of hippocampal shape analysis in a study of AD (to be submitted 2006) De-identification and Upload Pipeline •Robust automated methods for bulk MRI de-identification and upload to database (diverse inputs, sharable outputs, common package) •De-facing: automated de-facing without brain removal •Pipeline: image formats, BIRN ID generation, defacing, QA, upload Raw data De-faced data Mouse BIRN Studying animal models of disease across dimensional scales to test hypothesis with human neurological disorders • Experimental Allergic Encephalomyelitis (EAE) mouse models characteristic of Multiple Sclerosis (MS) • Dopamine Transporter (DAT) knockout mouse for studies of schizophrenia, attention-deficit hyperactivity disorder (ADHD), Tourette’s disorder, and substance abuse • Using an alpha-synuclein mouse to model the symptoms/pathology of Parkinson’s Disease • Cancer animal models consortium with astrocytoma mouse model: NCI supported with Terry Van Dyke @ Duke Cal Tech, Duke, UCLA, UCSD, Univ. Tenn Mouse BIRN Data Integration Framework 1. Create multimodal databases 4. Use mediator to navigate and query across data sources 2. Create conceptual links to a shared ontology 3. Situate the data in a common spatial framework Bonfire: Browse, Query and Utilize BIRN Knowledge Sources Bonfire Ontology Browser and Extension Tool •Aggregate knowledge sources built on UMLS •Issue graphbased queries on concepts • Collaborative extension: Users may propose new concept, receive a unique ID and “attach” it to existing concept •Developed and maintained by BIRN CC Xufei Qian, Amarnath Gupta, Jeff Grethe •Licenses negotiated and handled by BIRN CC BIRNLex •Built using the OWL plugin of Protégé (web ontology language standard) •Looked at terms contributed by test beds during first ontology workshop •Built using the existing class hierarchy of FUGO •Curation of BIRN Lex is currently underway (July 28th, next session) http://132.239.16.64:8080/BIRNLex/ National Center for Biological Ontologies National Center for Biomedical Ontologies (NCBO), NCBC, Mark Musen, P. I. •Daniel Rubin from NCBO participates in OTF calls Carol Bean arranged for OTF to attend workshop in March 2006 •Suzanna Lewis, Barry Smith, Michael Ashburner, Mark Musen, Daniel Rubin Educated us on efforts underway at NCBO and vice versa Provided their view on ontology “best practices” and what were examples of good ontologies Evaluated BIRN’s current efforts “I just wanted to let you know how excited I am that the BIRN is now working so closely with NCBO. Your team clearly has an appreciation for the importance of ontology for work in data integration and automated reasoning, and I think that we will be able to do some important work together if your collaborating grant application is funded. I [have] been watching the contributions that folks such as Bill Bug have been making on the mailing list for the W3C healthcare and life sciences SIG and for the FUGO consortium, and it is clear that your team has the motivation and the talent to make valuable contributions in the ontology arena. “ --Mark Musen Spatial Registration of Data Processing stream for spatial registration of brain volumes using the LONI pipeline Volume and slice data brought into register in order to correlate cellular and subcellular changes with non-invasive imaging The BIRN Smart Atlas An Example of a Data Grid-based GIS-like tool for spatial integration of multiscale distributed brain data. Ilya Zaslavsky, Joshua Tran, Haiyun He, Amarnath Gupta Mediator wrapper wrapper wrapper wrapper UCSD Cal Tech Duke UCLA SRB BIRN-CC Enables Test Bed Science A stable, robust, shared network and distributed database environment Extensible tools and IT infrastructure that can be reused. Established cyberinfrastructure for data grid and large scale data integration effort High performance connectivity between distributed resources (computation and data storage) Seamless access to distributed high performance computing resources Changing the use pattern for research data from the individual laboratory/project to shared use. BIRN-CC Enables Test Bed Science Providing technical expertise Troubleshooting and support Working closely with test beds to develop standards and best practices Providing a reliable infrastructure for large scale collaboration Driving the development of grid middleware Developing tools in support of test bed research Providing access to computational resources and more. . . Major System Components Collaborating Groups of Biomedical Researchers Data Integration Mechanisms Distributed Data (Collections) Distributed Data (file system) Computation/Analysis Facilities Identity/Login Management Authorization and Role Definition Overall Operations Command/Batch Access Application Portal Domain Application Tools Integrated SW Distribution Complete Workflows BIRN Core Software Infrastructure Friendly Work Facilitating Portals Other Institutions - HHMI / Osaka U. Other NIH Projects - caBIG - NIEHS Marine Metagenomics -- Moore Found. Geosciences “GEON” - NSF Grid Services & Middleware Ocean Observing “Looking” - NSF Development Tools & Libraries Biomedical Informatics “BIRN” Authentication - Authorization - Auditing - Workflows - Visualization - Analysis Your Specific Tools & User Apps. • BIRN CC builds on evolving standards for portals and middleware • Adds new capabilities required by projects •Provides system Shared integration of Tools domain-specific Science tools building a Domains distributed infrastructure Distributed Resources Distributed Computing, Instruments and Data Resources • Utilizes commodity hardware and stable networks for baseline connectivity An Exercise for the Reader … There exists a large body of useful middleware It’s assembly, hardening and extension into a useful system is left as an exercise to the reader The BIRN-CC is the “reader” System Deployment Utilizing Rocks grid management software BIRN specific extensions to Rocks, also under CVS, means automated, repeatable deployment of any version of the BIRN system We’ve created BIRN “rolls” that integrate • BIRN domain tools (e.g. 3DSlicer, LONI Pipeline, FreeSurfer) • Database (Oracle) and SRB Configuration Rocks, with BIRN extensions, includes automated deployment mechanism for • • • • Middleware (Security, Computational, Data) Data mediation/integration Application codes Portal and other Workflows Nagios Alert System for Monitoring Racks Network Throughput & Performance Graphical and Numerical reporting of Site/Grid Performance We Began with Standard Hardware This jumpstarted BIRN for functionality Software footprint is managed from the BIRN Coordinating Center Integration of domain tools, middleware, OS, updates, and more BIRN expansion/upgrade of existing sites has a more generic (and less expensive) hardware footprint Removing Barriers: Decreasing Cost of Entry & Increasing Scalability $120K (2001) < $20K (Today) < $5K (~2011) Prescribed hardware jumpstarted BIRN for functionality Support for multiple vendors Software solution for researchers to BIRN “enable” local hardware http://www.nbirn.net