DECOUPLED EXECUTION PARADIGM FOR DATA-INTENSIVE HIGH-END COMPUTING

advertisement
DECOUPLED EXECUTION
PARADIGM FOR DATA-INTENSIVE
HIGH-END COMPUTING
Yong Chen
Data-Intensive Scalable Computing Laboratory
Department of Computer Science
Texas Tech University
11/15/12
2
About Me
•  Assistant Professor, director and faculty member of the
Data-Intensive Scalable Computing Laboratory (DISCL)
•  My research focuses on data-intensive computing, parallel
and distributed computing, high-performance computing,
Cloud computing, computer architectures and systems
software support for high-performance scientific
computing/high-end enterprise computing
3
11/15/12
Wordle of My Current Publication Titles
Acknowledge: http://www.wordle.net/
11/15/12
Outline
•  Introduction and Background
•  Decoupled Execution Paradigm
•  Theoretic Modeling and Analysis
•  Data Dependence and Resource Contention
•  Evaluations
•  Conclusion
4
11/15/12
5
High-End Computing/High-Performance Computing
•  A form of parallel computing, with a focus on performance
•  Fundamental limits of serial computers are being approached
•  A strategic tool for scientific discovery and innovations
•  Solve “grand challenge” problems
•  Understand the phenomenon behind data
•  Computer simulation and analysis complement theory and
experiments
6
Petaflops
System
A Typical HEC System: ANL Intrepid
Rack
72 Racks
Cabled 8x8x16
32 Node Cards
1024 chips, 4096 procs
IBM Blue Gene/P architecture
Node Card
(32 chips 4x4x2)
32 compute, 0-2 IO cards
1 PF/s
144 TB
14 TF/s
2 TB
Compute Card
1 chip, 20
DRAMs
435 GF/s
64 GB
Chip
4 processors
850 MHz
8 MB EDRAM
13.6 GF/s
2.0 GB DDR
Supports 4-way SMP
Front End Node / Service Node
System p Servers
Linux SLES10
Source: ANL ALCF
Note: Data Not Latest
Maximum
System
256 racks
3.5 PF/s
512 TB
HPC SW:
Compilers
GPFS
ESSL
Loadleveler
7
11/15/12
Scientific Applications Trend
•  Applications tend to be data intensive
•  Scientific simulations, data mining, large-scale data processing, etc.
•  A GTC run on 29K cores on the Jaguar machine at OLCF generated over
54 Terabytes of data in a 24 hour period
PI
Data requirements for selected INCITE applications at ALCF
On-Line
Data
Project
Lamb, Don
Fischer, Paul
Dean, David
Baker, David
Worley, Patrick H.
Wolverton, Christopher
Washington, Warren
Tsigelny, Igor
Tang, William
Sugar, Robert
Siegel, Andrew
Roux, Benoit
FLASH: Buoyancy-Driven Turbulent Nuclear Burning 75TB
Reactor Core Hydrodynamics
2TB
Computational Nuclear Structure
4TB
Computational Protein Structure
1TB
Performance Evaluation and Analysis
1TB
Kinetics and Thermodynamics of Metal and
5TB
Complex Hydride Nanoparticles
Climate Science
10TB
Parkinson's Disease
2.5TB
Plasma Microturbulence
2TB
Lattice QCD
1TB
Thermal Striping in Sodium Cooled Reactors
4TB
Gating Mechanisms of Membrane Proteins
10TB
Source: R. Ross et. al., Argonne National Laboratory
Off-Line Data
300TB
5TB
40TB
2TB
1TB
100TB
345TB
50TB
10TB
44TB
8TB
10TB
8
11/15/12
Atmospheric Science
•  A huge number of sensors deployed across the world
•  They record data every 3 hours
Sensors deployment across the world (from NOAA, National Oceanic and
Atmospheric Administration)
9
11/15/12
Execution Paradigm of High-End Computing:
State of the Art
•  Current HEC execution models and their associated runtime
systems, however, are computing-centric
•  Systems architecture
•  Programming model (e.g. message passing interface, MPI)
•  Runtime (e.g. MPI library)
Compute
node
Compute
node
Compute
node
Interconnect
Storage
Storage
Abstracted HEC Systems
d
a
t
a
11/15/12
10
Execution Paradigm of High-End Computing:
State of the Art
•  Not ready to support efficient I/O for data-intensive HEC
•  MPI focuses on exchanging in-memory data
•  HEC performance is commonly measured in terms of peak perf. of
small computation kernels fitting into memory and cache well
•  Data-driven IT industry has developed a new paradigm,
MapReduce, for their needs
•  Great need for the HEC community to rethink the
execution models for the coming data-intensive HEC era
11/15/12
11
Decoupled Execution Paradigm
•  We propose a new Decoupled Execution Paradigm (DEP) for Data-
Intensive High-End Computing
•  Introduce the notion of separation compute nodes & data (processing) nodes
•  Decouples execution into computation-intensive and data-intensive ops
•  Data nodes take care of data-intensive operations collectively
•  Compute nodes take care of computation-intensive operations collectively
•  Application is executed in a decoupled but fundamentally more efficient
manner for data-intensive HEC with the collective support
•  Rethinking of execution paradigm where I/O intensive operation is as
important as computation
•  Providing balanced computation and data-access capabilities
•  Preliminary results have shown promise and potential
Y. Chen, C. Chen, X.-H. Sun, W. D. Gropp, and R. Thakur. A Decoupled Execution Paradigm for Data-Intensive High-End
Computing. In the Proc. of the IEEE International Conference on Cluster Computing 2012 (Cluster’12), 2012.
11/15/12
12
Motivating Example
•  Data commonly represented by a multi-
dimensional array-based data model
•  Read required data from storage servers to
compute nodes
•  Perform computations on desired data with
specified conditions, and then write back
•  With clear data retrieval and processing
phases and computing and simulation
phases
•  Data access and movement often dominate
execution time for data-intensive HEC apps
Processing 3-dimensional Temperature Data
In Community Earth System Model (CESM)
13
11/15/12
Decoupled Execution Paradigm Design
Applications
Decoupled Execution Programming Model
(DEPM) (MPI Extension)
Decoupled Execution Systems
Architecture
Compute-side
Data Nodes
Storage-side
Data Nodes
Interconnect
Compute Nodes
Local SSD storage
Decoupled Execution Runtime System
(DERS) Message Passing Library
Local SSD storage
Decoupled Execution Runtime System
(DERS) Data Processing Library
11/15/12
14
DEP System Architecture
•  Nodes decoupled: compute nodes, compute-side data
nodes and storage-side data notes
•  Compute-side data nodes reduce the size of computing
generated data before sending it to storage nodes
•  Storage-side data nodes reduce the size of data retrieved
from storage before sent
•  Data nodes can provide simple data forwarding, but
•  The idea is to conduct decoupled data-intensive operations and
optimizations to reduce the data size and movement
•  Compute nodes take care of computation-intensive
operations collectively
11/15/12
15
DEP Programming Model
•  To determine operations to be passed to data nodes
•  Designed as an MPI extension, allowing users to specify
operations conducted on data nodes
•  Results sent back to compute nodes for further processing
•  Similar to netCDF Operators, but allowing data-intensive
operations to be decoupled and processed on data nodes
•  For instance, an ncwa operator computes the weighted average on
specified data and returns the result for further computations
•  Extended and more powerful
•  Allowing operations to be decoupled not only operators
•  Allowing optimizations across operations
11/15/12
16
DEP Runtime System
•  Relies on two libraries, message passing library and data
processing library
•  Message passing library focuses on the memory abstraction
and provides support for computation-intensive operations
•  Leverage the existing MPI library for this purpose
•  Data processing library focuses on the I/O abstraction and
provides support for data-intensive operations
•  Two libraries are tightly coupled and and the message passing
library manages the interaction between these two libraries
•  Can optimize user-defined data-intensive operations and other
I/O optimization operations on data nodes as well
17
11/15/12
Comparison of Execution Paradigms
Retrieval
Retrieval
Reduce
Bottleneck
Compute
Compute
Reduce
Store
Conventional Execution Paradigm
Reduced latency
and improved
access
Reduced data
movement and
network
transmission
Store
Decoupled Execution Paradigm
21
11/15/12
Data Dependence and Dynamic Data Distribution
•  Data Dependence is caused by two factors:
•  Data access patterns of operations
•  Data distribution in file system
•  Dynamic Data Distribution proposed
Strip L
Strip 1
1
Terrain
map
2
3
4
5
…
…
N-4
N-3
N-2
latitude
N-1
N
longitude
2
SFD: Single flow direction
3
s1
s2
s3
Operation
…
s4
s
s5
Data distribution
M-3
s6
s7
s8
4
M-2
M-1
M
MFD: Multiple flow direction
Strip o
Possible data distribution
Strip q
Server a
Server b
Server c
Analysis
Kernel
Analysis
Kernel
Analysis
Kernel
Disk
Strip o
…
Disk
Strip p
…
C. Chen and Y. Chen. Dynamic Active Storage for High Performance I/O. In The 41st International Conference on
Parallel Processing (ICPP’12), 2012.
Disk
Strip q
22
11/15/12
Resource Contention and Solution
•  HEC system may run dozens of applications simultaneously
•  Resource contention degrades overall system performance
•  Dynamic operation scheduling proposed
p1
APP1
p2
pn
APP2
p2
p1
AI
AI
AI
NI: Normal I/O
AI
pn
AI
AI
APP m
p2
p1
NI
pn
NI
NI
NI
NI
I/O requests
NI
AI: Active I/O
Each I/O Requests 256MB Data
Data
Node
AI
NI
NI
Data
Node
Data
Node
AI
AI
NI
NI
AI
AI
NI
NI
AI
I/O queue
Execution Time (s)
m<n
80
70
60
50
40
30
20
10
0
AS
TS
3
5
7
# of I/Os per storage node
C. Chen, Y. Chen, and P. C. Roth. DOSAS: Mitigating the Resource Contention in Active Storage Systems. Accepted to appear
in the IEEE International Conference on Cluster Computing 2012
11/15/12
23
Preliminary Results and Experimental Platform
•  Experimental platform
•  A 640-node Linux cluster
•  Node equipped with Intel(R) Xeon(R)
2.8GHz CPUs (12 cores per node)
and 24GB memory
•  Two application kernels
•  Kernel calculation of the CESM that
computes the moving average of
selected area of specified data
•  Flow routing and flow accumulation
calculations in geographic
information system
Flow Directions in a Grid of Terrains.
Numbers represent the gradient of
each terrain. Arrows represent the
direction of water flow
24
11/15/12
Results of the CESM Kernel Code
'%!"
'#!"
'!!"
'#!"
'!!"
)*+,-+.*+/0"
&!"
123"
%!"
$!"
Execution Time (s)
Executtion Time (s)
'$!"
&!"
)*+,-+.*+/0"
%!"
123"
$!"
#!"
#!"
!"
'#"
#$"
$&"
(%"
!"
4GB5"
'#"
#$"
$&"
(%"
4GB5"
'#!"
)!"
'!!"
(!"
&!"
)*+,-+.*+/0"
%!"
123"
$!"
#!"
!"#$%&'()*+,#)-./)
!"#$%&'()*+,#)-./)
Execution Time of CESM Kernel Code with Different Data Sets on 48 Nodes
'!"
&!"
,-./0.1-.23"
%!"
456"
$!"
#!"
!"
'#"
#$"
$&"
(%"
(GB)
!"
#$"
$&"
&*"
+("
Execution Time of CESM Kernel Code with Different Data Sets on 96 Nodes
7GB8"
25
11/15/12
Results of the CESM Kernel Code
%#!!"
Bandwidth (MB/s)
%!!!"
$#!!"
*+,-.,/+,01"
234"
$!!!"
Effective Bandwidth of CESM Kernel Code
with Different Data Sets on 96 Nodes, with
48 storage-side data nodes
#!!"
'!!!"
$%"
%&"
&'"
()"
(GB)
Effective Bandwidth of CESM Kernel Code
with Different Data Sets on 96 Nodes, with
24 storage-side data nodes
&#!!"
Bandwidth (MB/s)
!"
&!!!"
%#!!"
+,-./-0,-12"
%!!!"
345"
$#!!"
$!!!"
#!!"
!"
$%"
%'"
'("
)*"
(GB)
26
11/15/12
*!"
+!"
)!"
*!"
(!"
)!"
'!"
+,-./-0,-12"
&!"
345"
%!"
!"#$%&$'()*MB/s+)
!"#$%&$'()*+!,-.)
Results of the GIS Kernel Code
(!"
'!"
$!"
#!"
#!"
#*"
%("
&*"
)$"
456"
%!"
$!"
!"
,-./0.1-.23"
&!"
!"
(GB)
#*"
%("
&*"
)$"
(GB)
Effective Bandwidth of Flow Routing and Accumulation
Code with Different Data Sets on 24 Nodes
+!"
'%!"
*!"
'$!"
(!"
'!"
,-./0.1-.23"
&!"
456"
%!"
$!"
#!"
Bandwidth (MB/s)
bandwidth (MB/s)
)!"
'#!"
'!!"
*+,-.,/+,01"
&!"
234"
%!"
$!"
#!"
!"
#*"
%("
&*"
)$"
(GB)
!"
'&"
Effective Bandwidth of Flow Routing and Accumulation
Code with Different Data Sets on 48 Nodes
(%"
$&"
)#"
(GB)
11/15/12
27
Related Work and Comparison
•  Extensive studies have focused on improving the
performance of data-intensive HEC at various levels
•  Compare with three levels of work
•  Architecture, programming model, and runtime system levels
•  Architecture improvements for data-intensive HEC
•  Nonvolatile storage-class memory devices promising but cannot
reduce the data movement across the network
•  Active storage and active disks offload computations but designed
for either idle computing power or an embedded processor
•  DEP provides a more powerful platform for the same purpose
•  I/O forwarding and data shipping offload I/O requests too. Data
nodes in the DEP design can carry all these functions and do more
11/15/12
28
Related Work and Comparison
•  Programming model improvements for data-intensive HEC
•  Current models designed for computation-intensive applications
•  Include MPI, Global Arrays, Unified Parallel C, Chapel, X10, Coarray Fortran, and HPF
•  I/O through a subset of interfaces such as MPI-IO
•  MapReduce an instant hit, but typically layered on top of distributed
file systems and not designed for high performance semantics
•  Runtime system improvements for data-intensive HEC
•  Advanced I/O libraries, HDF, PnetCDF, ADIOS
•  Collective I/O, data sieving, server-directed I/O, disk-directed I/O
•  Caching, buffering, staging, and prefetching optimization strategies
•  Data nodes in DEP work for both reads and writes, and can provide
buffering or staging, but more importantly on reduction
11/15/12
29
Conclusion and Future Work
•  I/O has been widely recognized as a bottleneck in high-
end computing for data-intensive scientific discovery
•  The bottleneck and massive amount of data movement can largely
limit the productivity of data-intensive sciences
•  Contribution of this research
•  Study a decoupled execution paradigm (DEP) for data-intensive
high-end computing
•  Separate data-processing nodes and compute nodes, decomposes
application operations, and maps onto decupled nodes
•  Data and compute nodes collectively provide a balanced design
•  Verify with an initial prototype and results promising
•  An initial step of trying a new execution paradigm
•  Further working on each component
11/15/12
U-Reason Seminar
Thank You
For more information please visit:
http://data.cs.ttu.edu/dep
30
Download