5/31/2016 1 Data Management, Transfer, Metadata, Ontologies and Provenance Joel Saltz Chair and Professor Biomedical Informatics Department Professor Computer and Information Science, Pathology The Ohio State University The Group Joel Saltz Ohio State Brian Collins IBM Mursley Sean University of Bechhofer Manchester Peter Lyster NIH/NIBIB Chris Catton David Shotton Amarnath Gupta 5/31/2016 Frederica Darema Oxford Univ Dept Zoology Oxford Univ Dept Zoology UCSD NSF Chris Taylor Univ Manchester Osaka University Osaka University UCSD Shimji Shimojo Toyokaza Akiyama Jeffrey Gretha Mark UCSD Ellisman Michael NIH Marron Philip UCSD Papadopoulos 3 Topics • • • • • • • • Databases and persistence Data types Metadata and mediation Data handling, formatting, remote access Data curation and provenance Secure transfer, encryption Complex queries Clinical and research governance 5/31/2016 4 Goals – what the world will look like? • Identify, query, retrieve, carry out on-demand data product generation directed at collections of data from multiple sites/groups on a given topic, reproduce each group’s data analysis and carry out new analyses on all datasets. Should be able to carry out entirely new analyses or to incrementally modify other scientist’s data analyses. Should not have to worry about physical location of data or processing. Should have excellent tools available to examine data. This should include a mechanism to authenticate potential users, control access to data and log identity of those accessing data. 5/31/2016 5 Data Validation and Provenance • Capture assumptions prevalent at time data was captured • Instigate capturing of more information than is common; we are not tracking all details that are important; implicit or tacit knowledge • Standard reporting forms • Paradigms of best practice – ‘this is what this group means by an aggressive grunt” this is how a 2 d gel should look • Include results on standard specimens, phantoms, imaging machine calibration info • Instruments that acquire data could transmit their state – encourage vendors to share state – e.g. hematology, clinical chemistry etc • Characterizing and publishing error estimates • Do we want to be able to recapture state of database at a given time? • Problem -- Standards for rapidly developing systems – eg mass spec • DICOM 5/31/2016 6 Metadata and query • Deal with ever changing data models, changing classification schemes, ontologies (e.g. is it a plasma membrane protein or shuttling protein?) • Need precise definitions of data transformations/filters to ensure reproducibility • Want tools that are as easy to use as Google – ability to select data without presupposing relationships • Separate “concrete or well defined” entities from “abstract” concepts • Should these be dealt with differently? • Not clear that there is consensus on what is defined although there are clear limiting examples 5/31/2016 7 Metadata and Query (cont) • How to represent metadata? How to couple grid based databases? • Easy? Describe metadata as OWL and map between ontologies • Mapping may be simple or complex. Mapping may involve temporal conditions • Annotation of secondary data analyses • Describe proposed mechanism with pointers to datasets cited as evidence. Graph is used to represent hypotheses. Annotate what is known and what is not known (in graph) • Old Data • Is old data useful? If so, how do we ensure you can still get at raw data? 16mm cell cycle films – open format. Large scale activities at various activities for archiving film. 5/31/2016 8 Formal description of biological and pathobiological mechanisms • • • • Graphs with different flavors of links Can use formal methods to deduce inferences Want mechanisms for generating views Distinguish between description of experimental method and biology • How to represent domain knowledge in a consistent way – should look at connections at semantic web people. • Use polygon to annotate image with controlled vocabulary term • What are the right ways to get to data and to formulate queries (e.g. by anatomic structure, hypothesis supported, molecule expressed etc), what classes of query does want to support, what tools should be used to formulate the query; methods for iterative query formulation. 5/31/2016 9 How to deliver on promise of doing queries and computations on grid based databases • • • • Can’t aggregate data in one place Will need to go to databases and select locally Need to distribute computation/application to data Relatively easy to move programs to data if data source is modeled well and if there are known file formats • Different data systems access data different ways using different metadata • Filesystems with different file formats • Database systems that speak SQL • Data returned by objects 5/31/2016 10 Content based image retrieval/classification • • • • Image based queries Look for patterns in 3,4, N – D Community effort to develop testbed? Define a standard set of datasets that would be published on grid • Tasks: • Feature detection • Distinguishing what is important (domain specific) • Systems software development testbed 5/31/2016 11 Scheduling • Support for on-demand data product generation, exploration of datasets and visualization • Grid based computation requires grid based allocation of resources • Can arise from coordinated priority scheduling or from overall surplus of resources • Complication arises from the need to coordinate coscheduling of closely coupled multiprocessor jobs • Iterative sweep over mesh with 1000 processors – with usual MPI programming paradigm, if one processor swaps out it’s piece of mesh, 999 other processors have to stop work on the problem and wait for messages • Academic work has been carried out on parallel multitasking but this has not been scaled up to large clusters/parallel machines • Quality of service • Most scheduling mechanisms require need to anticipate scale of resources needed for scientific queries 5/31/2016 12 Classification of problems 1. Lots of little grid based datasets – query – small result dataset 2. Lots of little grid based datasets – aggregate data – lots of data, data analysis on large dataset 3. Varying number of huge datasets. Can push filter to data source and extract data subset. Data subsets aggregated and processed. We then either get subcase 1 or 2 4. Significant computation may be involved. May need to send results to multiprocessor program on different platform 5. Another dimension: Interactive query? Time bound or no serious time constraint 5/31/2016 13 Another view • Dimension 1 – size of source datasets • Dimension 2 – amount of data to be aggregated • Dimension 3 – computational intensiveness (hides much complexity as computation may consist of one or several phases and may involve various data subsets; computation may be closely coupled or embarrasingly parallel) • Dimension 4 – need for interactivity 5/31/2016 14 Optimizations • Query planning and resource estimation, query optimization. • Need to automate and hide this process from user • Not practical to move petabytes (see case 3 in last slide) – need to do filtering, subsetting close to data source (e.g. move DataCutter filter to source) • Data aggregation process may involve very large datasets, may need to have significant compute, memory even disk resources dedicated to aggregation • Grid is a very deep memory hierarchy – role for software caching data and intermediate results • Possible implications on query language primitives – include time related directives 5/31/2016 15 5/31/2016 16 Data Management, Transfer, Metadata, Ontologies and Provenance Joel Saltz Chair and Professor Biomedical Informatics Department Professor Computer and Information Science, Pathology The Ohio State University Topics • • • • • • • • Databases and persistence Data types Metadata and mediation Data handling, formatting, remote access Data curation and provenance Secure transfer, encryption Complex queries Clinical and research governance 5/31/2016 18 Exon 1 5 10 G84A 5/31/2016 I405V 22.0 Kb Intron12 Taq1B Translational Research: Types of Information 16 19 Processing Remotely-Sensed Data NOAA Tiros-N w/ AVHRR sensor AVHRR Level 1 Data • As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR) sensor scans perpendicular to the satellite’s track. • At regular intervals along a scan line measurements are gathered to form an instantaneous field of view (IFOV). • Scan lines are aggregated into Level 1 data sets. A single file of Global Area Coverage (GAC) data represents: • ~one full earth orbit. • ~110 minutes. • ~40 megabytes. • ~15,000 scan lines. One scan line is 409 IFOV’s Pathology Satellite Data Processing Applications Porous Media Simulation Volumetric Reconstruction 5/31/2016 Dynamic Contrast MR 20 750 TB Disk storage at a mid-range site -- Deep Storage Hierarchy Planned at OSC for 2004 Virtual Microscope versus query images 5/31/2016 22 Virtual Microscope • Just for starters -- interactive software emulation of high power light microscope for processing image datasets • • • • Visualize and explore microscopy images Image analysis for cancer diagnoses and grading Virtual Placenta (Cancer research) Categorize images for associative retrieval • Commercial scanners O($150K) now can generate about 10TB/year, 100TB/year in next year or two 5/31/2016 23 Data Caching Client Client Side Image Processing Client Client Graphical User Interface WAN Children’s Hospital/Research Institute Children’s Hospital Information Systems Data Transfer Query Switch Execution Client Meta-data Authentication Management Authorization Server Server Server Information Systems Interface RAID Data Encryption Anonymization Digitized Slide Scanner Data Storage Advanced Image Analysis RAID RAID Server RAID Digitized Slide Scanner Virtual Slide Grid Node 5/31/2016 24 Make Filesystems Across Grid, Databases on Grid look like a big database -- GridDB-Lite Support efficient selection of the data of interest from distributed scientific datasets and transfer of data from storage clusters to compute clusters • Data Subsetting Model • Virtual Tables • Select Queries • Distributed Arrays SELECT <DataElements> FROM Dataset-1, Dataset-2,…, Dataset-n WHERE <Expression> AND <Filter(<DataElement>)> GROUP-BY-PROCESSOR ComputeAttribute(<DataElement>) 5/31/2016 25 Worldwide Translational Research • Customized access control • Distributed data, image warehouses • Grid based clinical, molecular dataset and image query, analysis • Grid based translational research workflow 5/31/2016 OSU Information Warehouse 26 Analysis Image analysis: Cancer grading and staging, treatment efficacy, pharmacokinetics, drug effectiveness, toxicity Drives accrual, protocol changes, choice of laboratory, imaging, genomic testing Data streamed Data to Dynamic Driven Translational Research Analysis Request for data www.cise.nsf.gov/DDDAS updates (Dynamic Data Driven Applications Systems) Data Diagnosis, Treatment, Laboratory, Radiology, Pathology Imaging, Proteomic, Gene Expression, Gene Sequence 5/31/2016 Generates requests for data Data driven algorithms -patient accrual, clinical, laboratory, genomic testing Workflow Rule based protocols, plan tests and treatments, plan patient consenting, specimen collection and analysis 27 Software Support for Data Driven Applications • DataCutter: Component Framework for Combined Task/Data Parallelism: • Filtering/Program coupling Service: Distributed C++ component framework • GridDB Lite: Large Data Query Layered on DataCutter • Indexing: Multilevel hierarchical indexes based on R-tree indexing method. • Data Cluster/Decluster/Range Query • Active Proxy G: Active Semantic Data Cache • Employ user semantics to cache and retrieve data • Store and reuse results of computations 5/31/2016 28 Metadata Management • Grid based infrastructure based on XML Schema • Version control for schemae • User can employ portions of multiple versioned schemae • Infrastructure for validating XML documents using distributed schemae • BOF at upcoming GGF in Chicago 5/31/2016 29