5/31/2016 1 Data Management, Transfer, Metadata, Ontologies and Provenance Joel Saltz Chair and Professor Biomedical Informatics Department Professor Computer and Information Science, Pathology The Ohio State University Topics • • • • • • • • Databases and persistence Data types Metadata and mediation Data handling, formatting, remote access Data curation and provenance Secure transfer, encryption Complex queries Clinical and research governance 5/31/2016 3 Exon 1 5 10 G84A 5/31/2016 I405V 22.0 Kb Intron12 Taq1B Translational Research: Types of Information 16 4 Processing Remotely-Sensed Data NOAA Tiros-N w/ AVHRR sensor AVHRR Level 1 Data • As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR) sensor scans perpendicular to the satellite’s track. • At regular intervals along a scan line measurements are gathered to form an instantaneous field of view (IFOV). • Scan lines are aggregated into Level 1 data sets. A single file of Global Area Coverage (GAC) data represents: • ~one full earth orbit. • ~110 minutes. • ~40 megabytes. • ~15,000 scan lines. One scan line is 409 IFOV’s Pathology Satellite Data Processing Applications Porous Media Simulation Volumetric Reconstruction 5/31/2016 Dynamic Contrast MR 5 750 TB Disk storage at a mid-range site -- Deep Storage Hierarchy Planned at OSC for 2004 Virtual Microscope versus query images 5/31/2016 7 Virtual Microscope • Just for starters -- interactive software emulation of high power light microscope for processing image datasets • • • • Visualize and explore microscopy images Image analysis for cancer diagnoses and grading Virtual Placenta (Cancer research) Categorize images for associative retrieval • Commercial scanners O($150K) now can generate about 10TB/year, 100TB/year in next year or two 5/31/2016 8 Data Caching Client Client Side Image Processing Client Client Graphical User Interface WAN Children’s Hospital/Research Institute Children’s Hospital Information Systems Data Transfer Query Switch Execution Client Meta-data Authentication Management Authorization Server Server Server Information Systems Interface RAID Data Encryption Anonymization Digitized Slide Scanner Data Storage Advanced Image Analysis RAID RAID Server RAID Digitized Slide Scanner Virtual Slide Grid Node 5/31/2016 9 Make Filesystems Across Grid, Databases on Grid look like a big database -- GridDB-Lite Support efficient selection of the data of interest from distributed scientific datasets and transfer of data from storage clusters to compute clusters • Data Subsetting Model • Virtual Tables • Select Queries • Distributed Arrays SELECT <DataElements> FROM Dataset-1, Dataset-2,…, Dataset-n WHERE <Expression> AND <Filter(<DataElement>)> GROUP-BY-PROCESSOR ComputeAttribute(<DataElement>) 5/31/2016 10 Worldwide Translational Research • Customized access control • Distributed data, image warehouses • Grid based clinical, molecular dataset and image query, analysis • Grid based translational research workflow 5/31/2016 OSU Information Warehouse 11 Analysis Image analysis: Cancer grading and staging, treatment efficacy, pharmacokinetics, drug effectiveness, toxicity Drives accrual, protocol changes, choice of laboratory, imaging, genomic testing Data streamed Data to Dynamic Driven Translational Research Analysis Request for data www.cise.nsf.gov/DDDAS updates (Dynamic Data Driven Applications Systems) Data Diagnosis, Treatment, Laboratory, Radiology, Pathology Imaging, Proteomic, Gene Expression, Gene Sequence 5/31/2016 Generates requests for data Data driven algorithms -patient accrual, clinical, laboratory, genomic testing Workflow Rule based protocols, plan tests and treatments, plan patient consenting, specimen collection and analysis 12 Software Support for Data Driven Applications • DataCutter: Component Framework for Combined Task/Data Parallelism: • Filtering/Program coupling Service: Distributed C++ component framework • GridDB Lite: Large Data Query Layered on DataCutter • Indexing: Multilevel hierarchical indexes based on R-tree indexing method. • Data Cluster/Decluster/Range Query • Active Proxy G: Active Semantic Data Cache • Employ user semantics to cache and retrieve data • Store and reuse results of computations 5/31/2016 13 Metadata Management • Grid based infrastructure based on XML Schema • Version control for schemae • User can employ portions of multiple versioned schemae • Infrastructure for validating XML documents using distributed schemae • BOF at upcoming GGF in Chicago 5/31/2016 14