5/31/2016 1

advertisement
5/31/2016
1
Data Management, Transfer,
Metadata, Ontologies and Provenance
Joel Saltz
Chair and Professor
Biomedical Informatics Department
Professor Computer and Information
Science, Pathology
The Ohio State University
Topics
•
•
•
•
•
•
•
•
Databases and persistence
Data types
Metadata and mediation
Data handling, formatting, remote access
Data curation and provenance
Secure transfer, encryption
Complex queries
Clinical and research governance
5/31/2016
3
Exon
1
5
10
G84A
5/31/2016
I405V
22.0 Kb
Intron12
Taq1B
Translational Research:
Types of Information
16
4
Processing Remotely-Sensed Data
NOAA Tiros-N
w/ AVHRR sensor
AVHRR Level 1 Data
• As the TIROS-N satellite orbits, the
Advanced Very High Resolution Radiometer (AVHRR)
sensor scans perpendicular to the satellite’s track.
• At regular intervals along a scan line measurements
are gathered to form an instantaneous field of view
(IFOV).
• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area
Coverage (GAC) data
represents:
• ~one full earth orbit.
• ~110 minutes.
• ~40 megabytes.
• ~15,000 scan lines.
One scan line is 409 IFOV’s
Pathology
Satellite Data Processing
Applications
Porous Media Simulation
Volumetric Reconstruction
5/31/2016
Dynamic Contrast MR
5
750 TB Disk storage at a mid-range site -- Deep Storage Hierarchy
Planned at OSC for 2004
Virtual Microscope
versus
query
images
5/31/2016
7
Virtual Microscope
• Just for starters -- interactive
software emulation of high power light
microscope for processing image datasets
•
•
•
•
Visualize and explore microscopy images
Image analysis for cancer diagnoses and grading
Virtual Placenta (Cancer research)
Categorize images for associative retrieval
• Commercial scanners O($150K) now can
generate about 10TB/year, 100TB/year
in next year or two
5/31/2016
8
Data
Caching
Client
Client Side
Image
Processing
Client
Client
Graphical
User Interface
WAN
Children’s Hospital/Research Institute
Children’s
Hospital
Information
Systems
Data
Transfer
Query
Switch
Execution
Client
Meta-data
Authentication
Management
Authorization
Server
Server
Server
Information
Systems
Interface
RAID
Data
Encryption
Anonymization
Digitized Slide
Scanner
Data
Storage
Advanced
Image
Analysis RAID
RAID
Server
RAID
Digitized Slide
Scanner
Virtual Slide Grid Node
5/31/2016
9
Make Filesystems Across Grid, Databases on
Grid look like a big database -- GridDB-Lite
Support efficient selection of the data of interest from
distributed scientific datasets and transfer of data from
storage clusters to compute clusters
• Data Subsetting Model
• Virtual Tables
• Select Queries
• Distributed Arrays
SELECT <DataElements>
FROM Dataset-1, Dataset-2,…, Dataset-n
WHERE <Expression> AND <Filter(<DataElement>)>
GROUP-BY-PROCESSOR ComputeAttribute(<DataElement>)
5/31/2016
10
Worldwide Translational Research
• Customized
access control
• Distributed data,
image warehouses
• Grid based
clinical, molecular
dataset and
image query,
analysis
• Grid based
translational
research
workflow
5/31/2016
OSU Information Warehouse
11
Analysis
Image analysis: Cancer
grading and staging,
treatment efficacy,
pharmacokinetics, drug
effectiveness, toxicity
Drives accrual, protocol
changes, choice of
laboratory, imaging,
genomic testing
Data streamed Data
to
Dynamic
Driven Translational Research
Analysis
Request for data
www.cise.nsf.gov/DDDAS
updates
(Dynamic Data Driven Applications Systems)
Data
Diagnosis, Treatment,
Laboratory, Radiology,
Pathology Imaging,
Proteomic, Gene
Expression, Gene
Sequence
5/31/2016
Generates
requests for data
Data driven
algorithms -patient
accrual, clinical,
laboratory, genomic
testing
Workflow
Rule based protocols,
plan tests and
treatments, plan patient
consenting, specimen
collection and analysis
12
Software Support for Data Driven
Applications
• DataCutter: Component Framework for Combined Task/Data
Parallelism:
• Filtering/Program coupling Service: Distributed C++ component
framework
• GridDB Lite: Large Data Query Layered on DataCutter
• Indexing: Multilevel hierarchical indexes based on R-tree indexing
method.
• Data Cluster/Decluster/Range Query
• Active Proxy G: Active Semantic Data Cache
• Employ user semantics to cache and retrieve data
• Store and reuse results of computations
5/31/2016
13
Metadata Management
• Grid based infrastructure based on XML
Schema
• Version control for schemae
• User can employ portions of multiple versioned
schemae
• Infrastructure for validating XML documents
using distributed schemae
• BOF at upcoming GGF in Chicago
5/31/2016
14
Download