5/31/2016 1

advertisement
5/31/2016
1
Data Management, Transfer,
Metadata, Ontologies and Provenance
Joel Saltz
Chair and Professor
Biomedical Informatics Department
Professor Computer and Information
Science, Pathology
The Ohio State University
The Group
Joel Saltz
Ohio State
Brian Collins
IBM Mursley
Sean
University of
Bechhofer
Manchester
Peter Lyster NIH/NIBIB
Chris Catton
David
Shotton
Amarnath
Gupta
5/31/2016
Frederica
Darema
Oxford Univ
Dept Zoology
Oxford Univ
Dept Zoology
UCSD
NSF
Chris Taylor
Univ
Manchester
Osaka
University
Osaka
University
UCSD
Shimji
Shimojo
Toyokaza
Akiyama
Jeffrey
Gretha
Mark
UCSD
Ellisman
Michael
NIH
Marron
Philip
UCSD
Papadopoulos
3
Topics
•
•
•
•
•
•
•
•
Databases and persistence
Data types
Metadata and mediation
Data handling, formatting, remote access
Data curation and provenance
Secure transfer, encryption
Complex queries
Clinical and research governance
5/31/2016
4
Goals – what the world will look like?
• Identify, query, retrieve, carry out on-demand data
product generation directed at collections of data from
multiple sites/groups on a given topic, reproduce each
group’s data analysis and carry out new analyses on all
datasets. Should be able to carry out entirely new
analyses or to incrementally modify other scientist’s
data analyses. Should not have to worry about physical
location of data or processing. Should have excellent
tools available to examine data. This should include a
mechanism to authenticate potential users, control
access to data and log identity of those accessing
data.
5/31/2016
5
Data Validation and Provenance
• Capture assumptions prevalent at time data was captured
• Instigate capturing of more information than is common; we are
not tracking all details that are important; implicit or tacit
knowledge
• Standard reporting forms
• Paradigms of best practice – ‘this is what this group means by an
aggressive grunt” this is how a 2 d gel should look
• Include results on standard specimens, phantoms, imaging machine
calibration info
• Instruments that acquire data could transmit their state – encourage
vendors to share state – e.g. hematology, clinical chemistry etc
• Characterizing and publishing error estimates
• Do we want to be able to recapture state of database at a given time?
• Problem -- Standards for rapidly developing systems – eg mass
spec
• DICOM
5/31/2016
6
Metadata and query
• Deal with ever changing data models, changing
classification schemes, ontologies (e.g. is it a plasma
membrane protein or shuttling protein?)
• Need precise definitions of data
transformations/filters to ensure reproducibility
• Want tools that are as easy to use as Google – ability
to select data without presupposing relationships
• Separate “concrete or well defined” entities from
“abstract” concepts
• Should these be dealt with differently?
• Not clear that there is consensus on what is defined although
there are clear limiting examples
5/31/2016
7
Metadata and Query (cont)
• How to represent metadata? How to couple grid based
databases?
• Easy? Describe metadata as OWL and map between ontologies
• Mapping may be simple or complex. Mapping may involve temporal
conditions
• Annotation of secondary data analyses
• Describe proposed mechanism with pointers to datasets cited as
evidence. Graph is used to represent hypotheses. Annotate what is
known and what is not known (in graph)
• Old Data
• Is old data useful? If so, how do we ensure you can still get at raw
data? 16mm cell cycle films – open format. Large scale activities
at various activities for archiving film.
5/31/2016
8
Formal description of biological and
pathobiological mechanisms
•
•
•
•
Graphs with different flavors of links
Can use formal methods to deduce inferences
Want mechanisms for generating views
Distinguish between description of experimental method
and biology
• How to represent domain knowledge in a consistent way
– should look at connections at semantic web people.
• Use polygon to annotate image with controlled
vocabulary term
• What are the right ways to get to data and to
formulate queries (e.g. by anatomic structure,
hypothesis supported, molecule expressed etc), what
classes of query does want to support, what tools
should be used to formulate the query; methods for
iterative query formulation.
5/31/2016
9
How to deliver on promise of doing queries and
computations on grid based databases
•
•
•
•
Can’t aggregate data in one place
Will need to go to databases and select locally
Need to distribute computation/application to data
Relatively easy to move programs to data if data
source is modeled well and if there are known file
formats
• Different data systems access data different ways
using different metadata
• Filesystems with different file formats
• Database systems that speak SQL
• Data returned by objects
5/31/2016
10
Content based image retrieval/classification
•
•
•
•
Image based queries
Look for patterns in 3,4, N – D
Community effort to develop testbed?
Define a standard set of datasets that would
be published on grid
• Tasks:
• Feature detection
• Distinguishing what is important (domain specific)
• Systems software development testbed
5/31/2016
11
Scheduling
• Support for on-demand data product generation,
exploration of datasets and visualization
• Grid based computation requires grid based allocation of
resources
• Can arise from coordinated priority scheduling or from overall
surplus of resources
• Complication arises from the need to coordinate coscheduling of closely coupled multiprocessor jobs
• Iterative sweep over mesh with 1000 processors – with usual MPI
programming paradigm, if one processor swaps out it’s piece of
mesh, 999 other processors have to stop work on the problem and
wait for messages
• Academic work has been carried out on parallel multitasking but
this has not been scaled up to large clusters/parallel machines
• Quality of service
• Most scheduling mechanisms require need to anticipate
scale of resources needed for scientific queries
5/31/2016
12
Classification of problems
1. Lots of little grid based datasets – query – small
result dataset
2. Lots of little grid based datasets – aggregate data –
lots of data, data analysis on large dataset
3. Varying number of huge datasets. Can push filter to
data source and extract data subset. Data subsets
aggregated and processed. We then either get
subcase 1 or 2
4. Significant computation may be involved. May need to
send results to multiprocessor program on different
platform
5. Another dimension: Interactive query? Time bound or
no serious time constraint
5/31/2016
13
Another view
• Dimension 1 – size of source datasets
• Dimension 2 – amount of data to be aggregated
• Dimension 3 – computational intensiveness
(hides much complexity as computation may
consist of one or several phases and may
involve various data subsets; computation may
be closely coupled or embarrasingly parallel)
• Dimension 4 – need for interactivity
5/31/2016
14
Optimizations
• Query planning and resource estimation, query
optimization.
• Need to automate and hide this process from user
• Not practical to move petabytes (see case 3 in last
slide) – need to do filtering, subsetting close to data
source (e.g. move DataCutter filter to source)
• Data aggregation process may involve very large
datasets, may need to have significant compute,
memory even disk resources dedicated to aggregation
• Grid is a very deep memory hierarchy – role for
software caching data and intermediate results
• Possible implications on query language primitives –
include time related directives
5/31/2016
15
5/31/2016
16
Data Management, Transfer,
Metadata, Ontologies and Provenance
Joel Saltz
Chair and Professor
Biomedical Informatics Department
Professor Computer and Information
Science, Pathology
The Ohio State University
Topics
•
•
•
•
•
•
•
•
Databases and persistence
Data types
Metadata and mediation
Data handling, formatting, remote access
Data curation and provenance
Secure transfer, encryption
Complex queries
Clinical and research governance
5/31/2016
18
Exon
1
5
10
G84A
5/31/2016
I405V
22.0 Kb
Intron12
Taq1B
Translational Research:
Types of Information
16
19
Processing Remotely-Sensed Data
NOAA Tiros-N
w/ AVHRR sensor
AVHRR Level 1 Data
• As the TIROS-N satellite orbits, the
Advanced Very High Resolution Radiometer (AVHRR)
sensor scans perpendicular to the satellite’s track.
• At regular intervals along a scan line measurements
are gathered to form an instantaneous field of view
(IFOV).
• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area
Coverage (GAC) data
represents:
• ~one full earth orbit.
• ~110 minutes.
• ~40 megabytes.
• ~15,000 scan lines.
One scan line is 409 IFOV’s
Pathology
Satellite Data Processing
Applications
Porous Media Simulation
Volumetric Reconstruction
5/31/2016
Dynamic Contrast MR
20
750 TB Disk storage at a mid-range site -- Deep Storage Hierarchy
Planned at OSC for 2004
Virtual Microscope
versus
query
images
5/31/2016
22
Virtual Microscope
• Just for starters -- interactive
software emulation of high power light
microscope for processing image datasets
•
•
•
•
Visualize and explore microscopy images
Image analysis for cancer diagnoses and grading
Virtual Placenta (Cancer research)
Categorize images for associative retrieval
• Commercial scanners O($150K) now can
generate about 10TB/year, 100TB/year
in next year or two
5/31/2016
23
Data
Caching
Client
Client Side
Image
Processing
Client
Client
Graphical
User Interface
WAN
Children’s Hospital/Research Institute
Children’s
Hospital
Information
Systems
Data
Transfer
Query
Switch
Execution
Client
Meta-data
Authentication
Management
Authorization
Server
Server
Server
Information
Systems
Interface
RAID
Data
Encryption
Anonymization
Digitized Slide
Scanner
Data
Storage
Advanced
Image
Analysis RAID
RAID
Server
RAID
Digitized Slide
Scanner
Virtual Slide Grid Node
5/31/2016
24
Make Filesystems Across Grid, Databases on
Grid look like a big database -- GridDB-Lite
Support efficient selection of the data of interest from
distributed scientific datasets and transfer of data from
storage clusters to compute clusters
• Data Subsetting Model
• Virtual Tables
• Select Queries
• Distributed Arrays
SELECT <DataElements>
FROM Dataset-1, Dataset-2,…, Dataset-n
WHERE <Expression> AND <Filter(<DataElement>)>
GROUP-BY-PROCESSOR ComputeAttribute(<DataElement>)
5/31/2016
25
Worldwide Translational Research
• Customized
access control
• Distributed data,
image warehouses
• Grid based
clinical, molecular
dataset and
image query,
analysis
• Grid based
translational
research
workflow
5/31/2016
OSU Information Warehouse
26
Analysis
Image analysis: Cancer
grading and staging,
treatment efficacy,
pharmacokinetics, drug
effectiveness, toxicity
Drives accrual, protocol
changes, choice of
laboratory, imaging,
genomic testing
Data streamed Data
to
Dynamic
Driven Translational Research
Analysis
Request for data
www.cise.nsf.gov/DDDAS
updates
(Dynamic Data Driven Applications Systems)
Data
Diagnosis, Treatment,
Laboratory, Radiology,
Pathology Imaging,
Proteomic, Gene
Expression, Gene
Sequence
5/31/2016
Generates
requests for data
Data driven
algorithms -patient
accrual, clinical,
laboratory, genomic
testing
Workflow
Rule based protocols,
plan tests and
treatments, plan patient
consenting, specimen
collection and analysis
27
Software Support for Data Driven
Applications
• DataCutter: Component Framework for Combined Task/Data
Parallelism:
• Filtering/Program coupling Service: Distributed C++ component
framework
• GridDB Lite: Large Data Query Layered on DataCutter
• Indexing: Multilevel hierarchical indexes based on R-tree indexing
method.
• Data Cluster/Decluster/Range Query
• Active Proxy G: Active Semantic Data Cache
• Employ user semantics to cache and retrieve data
• Store and reuse results of computations
5/31/2016
28
Metadata Management
• Grid based infrastructure based on XML
Schema
• Version control for schemae
• User can employ portions of multiple versioned
schemae
• Infrastructure for validating XML documents
using distributed schemae
• BOF at upcoming GGF in Chicago
5/31/2016
29
Download