STORM Umit V Catalyurek Multiscale Computing Lab Biomedical Informatics Department

advertisement
STORM
Umit V Catalyurek
Multiscale Computing Lab
Biomedical Informatics Department
The Ohio State University
Roadmap
• Motivating Applications
– Oil Reservoir Management and Optimization
– Characteristics, Goals, and Challenges
• Middleware Systems
• STORM
– System Design
– Automatic Data Virtualization
– Results
2nd DIALOGUE Workshop
2
Processing Remotely-Sensed Data
NOAA Tiros-N
w/ AVHRR sensor
Applications associated with Large Datasets
AVHRR Level 1 Data
• As the TIROS-N satellite orbits, the
Advanced Very High Resolution Radiometer (AVHRR)
sensor scans perpendicular to the satellite’s track.
• At regular intervals along a scan line measurements
are gathered to form an instantaneous field of view
(IFOV).
• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area
Coverage (GAC) data
represents:
• ~one full earth orbit.
• ~110 minutes.
• ~40 megabytes.
• ~15,000 scan lines.
One scan line is 409 IFOV’s
Satellite Data Processing
Digital Pathology
Managing Oilfields,
Contaminant Transport
DCE-MRI Analysis
2nd DIALOGUE Workshop
Derivation of
macroscopic materials
properties from MD
simulations
3
Analysis of Confocal Microscopy Images
•
•
Solving aggregate queries involving Sum or Count operations on spatial data
Application domains:
– OLAP (On-Line Analytical Processing)
– Geographic data
– Image datasets
•
Sample query:
SELECT Add(Value(x,y))
FROM Image
WHERE (x,y)
in POLYGON <(10,20),(300,400)>
2nd DIALOGUE Workshop
4
Applications: Oil Reservoir
Management
Oil Reservoir Simulations
Seismic Data Analysis
Implementing effective oil and gas production
•
•
•
Simulate multiple
realizations of multiple
geostatistical models and
production strategies
Evaluate geologic
uncertainty and production
strategies simultaneously
Enable on-demand
exploration and comparison
of multiple scenarios
Analysis
Production rates, bypass
oil, net present value
Summary data
from datasets
Spatio-temporal
queries
Generate requests for
new simulations, new
seismic studies
Store and index
– Integration
of a robust,
Data in data during
simulation production
results
Workflow
Detect
and
track
changes
Grid-based computational
Seismic, well
Run new reservoir
reservoir
and data handling
Invert datapressures,
for reservoir
properties
simulations
simulations
infrastructure
Obtain initial,
Detect and track reservoir changes
boundary conditions,
– Distributed databases
input parameters for
of reservoir and
simulations
geophysicalAssimilate
data
data & reservoir properties
into
– Storage and computing
resources at multiplethe evolving reservoir model
institutions
Use simulation and optimization to guide future production
2nd DIALOGUE Workshop
6
Data Querying and Processing
Reservoir Simulations
Seismic Data
Sp (or CDP) #
& position
Array #
m realizations
Model 1
…
Model 2
Geostatistics
…
Array #
Model n
Receiver group #
& position
Receiver
group #
Component #
& position
Receiver
group #
Component #
& position
Component #
Well Pattern 1
Production
Strategies
Well Pattern 2
Array #
…
Well Pattern p
Receiver group #
& position
Receiver
group #
Component #
& position
Receiver
group #
Component #
& position
Component #
Receiver group #
& position
Receiver
group #
Component #
& position
Receiver
group #
Component #
& position
Component #
2nd DIALOGUE Workshop
50.00
50.00
50.00
50.00
50.00
50.00
50.00
50.00
50.00
7
Characteristics, Commonalities…
• Spatio-temporal datasets (generally low dimensional) –
datasets describe physical scenarios
– Multi-dimensional, Multi-resolution, Multi-scale
• Very large file-based datasets
– Tens of gigabytes to 100+ TB data
– Data is stored in a distributed collection of files
– Lots of datasets, lots of files
• Data products often involve results from ensemble of spatiotemporal datasets
• Some applications require interactive exploration of datasets
• Common operations: subsetting, filtering, interpolations,
projections, comparisons, frequency counts
• Modeling and management of data analysis workflows
2nd DIALOGUE Workshop
8
Data Services
• Distributed data processing support
• Grid based data virtualization, data
management, query, on demand data product
generation
• Distributed metadata and data management
• Track metadata associated with data and data
analysis workflows
2nd DIALOGUE Workshop
9
Middleware Support
• Data Virtualization: STORM
– Large data querying capabilities, layered on DataCutter
– Distributed data virtualization
– Indexing, Data Cluster/Decluster, Parallel Data Transfer
• Data Analysis/Processing Workflows: DataCutter
– Component Framework for Combined Task/Data Parallelism
– Filtering/Program coupling Service: Distributed C++ component
framework
– On demand data product generation
• Distributed Metadata and Data Management: Mobius
– Create, manage, version data definitions
– Management of metadata and data instances
– Data integration
• Multiple Query Workloads: Active Proxy-G
– Active Semantic Data Cache
– Employ user semantics to cache and retrieve data
– Store and reuse results of computations
2nd DIALOGUE Workshop
10
Data Virtualization
Applications developers generally prefer storing data in files
Support high level queries on multi-dimensional distributed
datasets
Many possible data abstractions, query interfaces
Grid virtualized object relational database or XML database
Grid virtualized objects with user defined methods invoked to
access and process data
Virtual Tables
Data
Virtualization
Data Service
Scientific Datasets
2nd DIALOGUE Workshop
11
Our Approach
• Front-end
– Support a basic SQL Select query with a virtual relational table
view or a virtual XML database view
• A lightweight layer on top of datasets
– STORM runtime middleware STORM carries out query
execution, query planning
• Compiler front end customizes runtime support
– Automatic customization and configuration of runtime query
support middleware
2nd DIALOGUE Workshop
12
STORM
Support efficient selection of the data of interest from
distributed scientific datasets and transfer of data
from storage clusters to compute clusters
• Data Subsetting Model
– Virtual Tables
– Select Queries
– Distributed Arrays
SELECT <DataElements>
FROM Dataset-1, Dataset-2,…, Dataset-n
WHERE <Expression> AND <Filter(<DataElement>)>
GROUP-BY-PROCESSOR ComputeAttribute(<DataElement>)
2nd DIALOGUE Workshop
13
STORM Services
• Query
• Meta-data
• Indexing
• Data Source
• Filtering
• Partition
Generation
• Data Mover
2nd DIALOGUE Workshop
14
STORM Query Planning
2nd DIALOGUE Workshop
15
STORM Query Execution
2nd DIALOGUE Workshop
16
Survey #
STORM Results: Selection
in Seismic Data
Line #
Sp (or CDP) #
& source position
Array #
Traces
50.00
Receiver group #
&
receiver group position
Component #
2nd DIALOGUE Workshop
17
STORM Results
2nd DIALOGUE Workshop
18
OSC Mass Storage System
LinTel boxes (PvFS/
Active Disk Archive) (20)






D V D

D V D
(2)
890 MB/s through
MetaData Servers
(2)

D V D

D V D
(2)
(2)
890 M
B/s Th
rough
put

D V D
(2)
)
(2
(2)
D V D
DVD
DVD
DVD
DVD





DVD
DVD
DVD
DVD
DVD





DVD
DVD
DVD
DVD
DVD





DVD
DVD
DVD
DVD
DVD
(40 - 2 per xSeries)
10 GB/s
)
(2

DVD
(40 - 2 per T600)
384 MB/s throughput
put
r)
Cisco Directors 9509
ve ut
er hp
r s oug
e
p thr
4
(4)
6 B/s MB/s throughput
(1 M772
0
(4)
89
(4)
772 MB/s throughput
FAStT600 Turbo (20)
Scratch / Archive Storage Pool (310/420 TB)
(4)
772 MB/s throughput
(4)
772 MB/s throughput
SAN Volume Controller
(4 servers)
FAStT900 (4)
Core Storage Pool (35/50 TB) with SAN.FS
Backup Storage
3584 Tape
1 L32 2 D32
Actual: 640 cartridges @ 200
GB for a total of 128 TB
4 drives
max drive data rate is 35 MB/s
2nd DIALOGUE Workshop
• 50 TB of performance
storage
– home directories, project
storage space, and longterm frequently accessed
files.
• 420 TB of
performance/capacity
storage
– Active Disk Cache compute jobs that require
directly connected storage
– parallel file systems, and
scratch space.
– Large temporary holding
area
• 128 TB tape library
– Backups and long-term
"offline" storage
19
STORM Results
Seismic Datasets:
10-25GB per file.
About 30TB of Data.
STORM I/O Performance
4500
4000
Bandwidth (MB/s)
3500
3000
2 Threads
2500
4 Threads
2000
Max
1500
1000
500
0
1
2
4
8
16
# XIO nodes
2nd DIALOGUE Workshop
20
Compiler Support
2nd DIALOGUE Workshop
21
Design Overview
• Dataset Schema Description Component
• Dataset Storage Description Component
• Dataset Layout Description Component
2nd DIALOGUE Workshop
22
{ Group “ROOT” {
DATASET “bh” {
[IPARS]
RID = INT2
TIME = INT4
X = FLOAT
Y = FLOAT
Z = FLOAT
POIL = FLOAT
PWAT = FLOAT
……
DATATYPE { IPARS }
DATASPACE {RANK 3 }
DATAINDEX { RID, TIME }
PARTS { 9503, 9503, 9537, 9554,
9503, 9707, 9520, 9520
}
DATA { DATASET SPACIAL,
DATASET POIL,
DATASET PWAT,
……
}
}
Group “SUBGROUP” {
DATASET “SPACIAL” {
DATATYPE { }
DATASPACE {
SKIP 4 LINES
LOOP PARTS {
X SPACE Y SPACE Z
SKIP 1 LINE
}
}
DATA {PART in (0,1,2,3,4,5,6,7)
.0.PART.5.init
}
}
DATASET “POIL” {
DATATYPE { }
DATASPACE {
LOOP TIME {
SKIP 1 double
LOOP PARTS
{ POIL }
}
}
DATA { PART in (0,1,2,3,4,5,6,7)
.0.PART.5.0
}
……
Description
file
Metadata
Data list
file
[bh]
DatasetDescription = IPARS
io = file
Dim = 17x65x65
Npart = 8
…
Osumed1 = osumed01.epn.osc.edu,
osumed02.epn.osc.edu,
…
0 = bh-10-1 osumed1 /scratch1/bh-10-1
1 = bh-10-2 osumed1 /scratch1/bh-10-2
……
}
2nd DIALOGUE Workshop
23
Test the ability of our code generation tool
600
10000
9000
500
Execution Time (seconds)
Execution Time (seconds)
8000
7000
6000
5000
4000
3000
LO - Hand-written
LO - Hand-w ritten
LO - Compiler Generated
Layout I
Layout II
Layout III
LO - Compiler Generated
400
Layout I
Layout II
300
Layout III
Layout IV
Layout IV
Layout V
Layout VI
Layout V
200
Layout VI
2000
100
1000
0
Query 1
Large Query
0
Query 2
Query 3
Query 4
Query 5
Sm all Queries
Oil Reservoir Management
The performance difference is within 4%~10% as for Layout 0.
Correctly and efficiently handle a variety of different layouts for the same data
2nd DIALOGUE Workshop
24
Distributed Execution: DataCutter
•
Pipe-and-filter metaphor of data
processing
Combined Data/Task Parallelism
– Data is streamed from producer
to consumer filters
•
– Transparent copies of filters
•
•
E0
Framework for task- and dataparallel manipulation of large
scientific data
Provide grid-based distributed
computation and applicationspecific storage access
XML description of data and
task flow
Ra0
R0
R1
host1
R2
host2
Cluster 1
9/11/2002
2nd DIALOGUE Workshop
host3
EK
host1
Ra1
host4
EK+1
Ra2
EN
host5
host2
M
host1
Cluster 3
Cluster 2
DataCutter
19
25
STORM + Seismic Image Reconstruction
2nd DIALOGUE Workshop
26
STORM + Seismic Image Reconstruction
4500
raw I/O
MB/sec
4000
STORM-only
3500
STORM+IR on XIO
3000
STORM+IR on XIO + PIV
2500
2000
1500
1000
500
0
1
2
4
8
16
Number of XIO Nodes
2nd DIALOGUE Workshop
27
STORM Data Resource
GDS
JDBC Driver
Data Resource
Storm Daemon
Data Mover
Filter
STORM instance
Extractor
2nd DIALOGUE Workshop
28
Experiment Setup
mob
8 nodes
Dual 1.4 GHz AMD
Optron
8 GB memory
1.5 TB local disk
Xio
16
2 Xeon 2.4 GHz
4 GB memory
7.3 TB FAStT600 disk
array
Dataset
Attributes
Record Size
Records
(millions)
Dataset (GB)
Cluster, Num
nodes
Oil Reservoir
21
84 bytes
3,840
315
Mob,03
Seismic
16
4240 bytes
247
1,056
Xio,16
TXm
6
24 bytes
X
24 * X / 1M
Mob,01
•
•
All nodes running linux
Gigabit switch
2nd DIALOGUE Workshop
29
Comparison with MySQL - 1
Varying table size.
Per tuple cost is lesser
MySQL-cold
MySQL-hot
STORM-cold
120
Execution Time (secs)
•
•
STORM-hot
100
80
60
40
20
0
0
50
100
150
200
Table Size (million rows)
2nd DIALOGUE Workshop
30
Comparison with MySQL - 2
Varying query size
Also compare them as data resources
MySQL
STORM
40
Execution Time (secs)
•
•
MySQL-DAI
STORM-DAI
30
20
10
0
0
250000
500000
750000
1000000
Query Size (num of records)
2nd DIALOGUE Workshop
31
Future Work: Scenario 1
Data Management, Access, Integration
Schema
Management
Mobius
OGSA-DAI
XML Virtualization
Metadata Management
Mobius
OGSA-DAI
Grid Protocols
OGSA-DAI
OGSA-DAI
SQL Virtualization
of Files
Data Product
Generation
STORM
• Grid-level data services via OGSA-DAI
• Management of data definitions and
metadata, XML virtualization via
Mobius
• Object-relational virtualization and
subsetting of file based datasets via
STORM
• On-demand data product generation
via DataCutter
• STORM, Mobius, DataCutter support
data operations on heterogeneous
collections of storage and compute
clusters
DataCutter
2nd DIALOGUE Workshop
32
Data Management, Access, and Integration
SQL Virtualization
of Files
STORM
Grid-data Service
(OGSA-DAI)
Simulation Data
Grid Service
Protocols
STORM
Grid-data Service (OGSA-DAI)
SQL Virtualization
of Files
Data Product
Generation
STORM
Data Product
Generation
DataCutter
DataCutter
XML Virtualization
Metadata Management
XML Virtualization
Metadata Management
Seismic Data
Mobius
Grid-data Service (OGSA-DAI)
Grid-data Service (OGSA-DAI)
SQL Virtualization
of Files
DataCutter
XML Virtualization
Metadata Management
Schema
Management
Mobius
Data Product
Generation
Mobius
Mobius
Seismic/Simulation Data
2nd DIALOGUE Workshop
33
Scenario 2: Refactor STORM
• Refactor to handle
– XML databases
– Relational databases
– Object databases
• We should be able reuse following services
– Query planning
– Data partitioning
– Data transfer
2nd DIALOGUE Workshop
34
For more info
Multiscale Computing Lab
http://www.multiscalecomputing.org or
http://msc.osu.edu
STORM project web site
http://storm.bmi.ohio-state.edu
STORM is part of the NSF's Middleware Initiative Since Release 5
http://www.nsf-middleware.org
2nd DIALOGUE Workshop
35
Download