S A L HPC Group

advertisement
Workshop on Petascale Data Analytics: Challenges, and Opportunities, SC11
SALSA HPC Group
http://salsahpc.indiana.edu
School of Informatics and Computing
Indiana University
Twister
Bingjing Zhang, Richard Teng
Twister4Azure
Thilina Gunarathne
High-Performance
Visualization Algorithms
For Data-Intensive Analysis
Seung-Hee Bae and Jong Youl Choi
DryadLINQ CTP Evaluation
Hui Li, Yang Ruan, and Yuduo Zhou
Cloud Storage, FutureGrid
Xiaoming Gao, Stephen Wu
Million Sequence Challenge
Saliya Ekanayake, Adam Hughs, Yang Ruan
Cyberinfrastructure for
Remote Sensing of Ice Sheets
Jerome Mitchell
Intel’s Application Stack
SALSA
Applications
Support Scientific Simulations (Data Mining and Data Analysis)
Kernels, Genomics, Proteomics, Information Retrieval, Polar Science,
Scientific Simulation Data Analysis and Management, Dissimilarity
Computation, Clustering, Multidimensional Scaling, Generative Topological
Mapping
Security, Provenance, Portal
Services and Workflow
Programming
Model
Runtime
Storage
Infrastructure
Hardware
High Level Language
Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)
Distributed File Systems
Object Store
Windows Server
Linux HPC Amazon Cloud
HPC
Bare-system
Bare-system
Virtualization
CPU Nodes
Data Parallel File System
Azure Cloud
Virtualization
Grid
Appliance
GPU Nodes
SALSA
SALSA
What are the challenges?
Providing
both cost
effectiveness
parallel
paradigms that
These
challenges
must
be met for and
bothpowerful
computation
andprogramming
storage. If computation
and is
capableare
of handling
theit’s
incredible
increases
in dataset
sizes.
storage
separated,
not possible
to bring
computing
to data.
(large-scale data analysis for Data Intensive applications )
Data locality
Research issues
 its impact on performance;
 the factors that affect data locality;
 the maximum degree of data locality that can be achieved.
 portability
between
HPC and
Cloud systems
Factors
beyond
data locality
to improve
performance
 achieve
scalingtheperformance
To
best data locality is not always the optimal scheduling decision. For
instance,
the node where input data of a task are stored is overloaded, to run the task
 faultiftolerance
on it will result in performance degradation.
Task granularity and load balance
In MapReduce , task granularity is fixed.
This mechanism has two drawbacks
1) limited degree of concurrency
2) load unbalancing resulting from the variation of task execution time.
Programming Models and Tools
MapReduce in Heterogeneous Environment
MICROSOFT
8
Motivation
Data
Deluge
Experiencing in
many domains
MapReduce
Classic Parallel
Runtimes (MPI)
Data Centered, QoS
Efficient and
Proven techniques
Expand the Applicability of MapReduce to more
classes of Applications
Map-Only
Input
map
Output
Iterative MapReduce
MapReduce
More Extensions
iterations
Input
map
Input
map
reduce
Pij
reduce
Distinction on static and variable data
Configurable long running (cacheable)
map/reduce tasks
Pub/sub messaging based
communication/data transfers
Broker Network for facilitating
communication
Main program’s process space
Worker Nodes
configureMaps(..)
Local Disk
configureReduce(..)
Cacheable map/reduce tasks
while(condition){
runMapReduce(..)
May send <Key,Value> pairs directly
Iterations
Reduce()
Combine()
operation
updateCondition()
} //end while
close()
Map()
Communications/data transfers via the
pub-sub broker network & direct TCP
• Main program may contain many
MapReduce invocations or iterative
MapReduce invocations
Master Node
Pub/sub
Broker Network
B
Twister
Driver
B
B
B
Main Program
One broker
serves several
Twister daemons
Twister Daemon
Twister Daemon
map
reduce
Cacheable tasks
Worker Pool
Local Disk
Worker Node
Worker Pool
Scripts perform:
Data distribution, data collection,
and partition file creation
Local Disk
Worker Node
Components of Twister
14
Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for
input/output/intermediate data storage.
New Job
Job Start
Map
Combine
Map
Combine
Scheduling Queue
Worker Role
Map Workers
Reduce
Merge
Add
Iteration?
Map
Combine
Reduce
Data Cache
Hybrid scheduling of the new iteration
Yes
No
Job Finish
Left over tasks
that did not get
scheduled through
bulleting board.
Map
1
Map
2
Map
n
Reduce Workers
Red
1
Red
2
Red
n
In Memory Data Cache
Map Task Meta Data Cache
Job Bulletin Board +
In Memory Cache +
Execution History
New Iteration
• Distributed, highly scalable and highly available cloud
services as the building blocks.
• Utilize eventually-consistent , high-latency cloud services
effectively to deliver performance comparable to
traditional MapReduce runtimes.
• Decentralized architecture with global queue based
dynamic task scheduling
• Minimal management and maintenance overhead
• Supports dynamically scaling up and down of the compute
resources.
• MapReduce fault tolerance
BLAST Sequence Search
Smith Waterman Sequence Alignment
Parallel Efficiency
Cap3 Sequence Assembly
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
Twister4Azure
Amazon EMR
Apache Hadoop
Num. of Cores * Num. of Files
Task Execution Time Histogram
Strong Scaling with 128M Data Points
Number of Executing Map Task Histogram
Weak Scaling
Weak Scaling
Azure Instance Type Study
Data Size Scaling
Number of Executing Map Task Histogram
This demo is for real time visualization of the
process of multidimensional scaling(MDS)
calculation.
We use Twister to do parallel calculation inside the
cluster, and use PlotViz to show the intermediate
results at the user client computer.
The process of computation and monitoring is
automated by the program.
MDS projection of 100,000 protein sequences showing a few experimentally
identified clusters in preliminary work with Seattle Children’s Research Institute
Client Node
II. Send intermediate
results
Master Node
Twister
Driver
ActiveMQ
Broker
MDS Monitor
Twister-MDS
PlotViz
I. Send message to
start the job
Master Node
Twister
Driver
Twister-MDS
Twister Daemon
Pub/Sub
Broker
Network
Twister Daemon
map
reduce
calculateBC
map
reduce
calculateStress
Worker Pool
Worker Pool
Worker Node
Worker Node
MDS Output Monitoring Interface
Map-Collective Model
Iterations
Input
User Program
map
Initial Collective Step
Network of Brokers
reduce
Final Collective Step
Network of Brokers
Twister Daemon Node
ActiveMQ Broker Node
Twister Driver
Node
Broker-Driver
Connection
Broker-Daemon
Connection
Broker-Broker
Connection
7 Brokers and
32 Computing
Nodes in total
5 Brokers and 4 Computing
Nodes in total
Twister-MDS Execution Time
100 iterations, 40 nodes, under different input data sizes
1600.000
1508.487
1404.431
Total Execution Time (Seconds)
1400.000
1200.000
1000.000
816.364
800.000
737.073
600.000
359.625
400.000
200.000
189.288
303.432
148.805
0.000
38400
51200
76800
Number of Data Points
Original Execution Time (1 broker only)
Current Execution Time (7 brokers, the best broker number)
102400
(In Method C, centroids are split to 160 blocks, sent through 40 brokers in 4
rounds)
100.00
93.14
90.00
Broadcasting Time (Unit: Second)
80.00
70.56
70.00
60.00
50.00
46.19
40.00
30.00
24.50
18.79
20.00
13.07
10.00
0.00
400M
600M
Method C
Method B
800M
Twister New Architecture
Master Node
Worker Node
Worker Node
Broker
Broker
Broker
Configure Mapper
map broadcasting chain
Add to MemCache
map
Map
merge
Cacheable tasks
reduce
Reduce
reduce collection chain
Twister Driver
Twister Daemon
Twister Daemon
Chain/Ring Broadcasting
Twister Daemon Node
Twister Driver Node
•
Driver sender:
• send broadcasting data
• get acknowledgement
• send next broadcasting data
• …
•
Daemon sender:
• receive data from the last daemon (or
driver)
• cache data to daemon
• Send data to next daemon (waits for
ACK)
• send acknowledgement to the last
daemon
Chain
Broadcasting
Protocol
Daemon 0
Daemon 1
Daemon 2
Driver
send
receive
handle data
send
receive
get ack
ack
handle data
send
receive
handle data
send
receive
ack
handle data
ack
get ack
get ack
I know this is the end
of Daemon Chain
send
receive
ack
handle data
get ack
send
receive
ack
handle data
ack
get ack
get ack
get ack
ack
ack
I know this is the end
of Cache Block
Broadcasting Time Comparison
Broadcasting Time Comparison on 80 nodes, 600 MB data, 160
pieces
Broadcasting Time (Unit: Seconds)
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Execution No.
Chain Broadcasting
All-to-All Broadcasting, 40 brokers
Applications & Different Interconnection Patterns
Map Only
Input
map
Classic
MapReduce
Input
map
Iterative Reductions
Twister
Input
map
Loosely
Synchronous
iterations
Pij
Output
reduce
reduce
CAP3 Analysis
Document conversion
(PDF -> HTML)
Brute force searches in
cryptography
Parametric sweeps
High Energy Physics
(HEP) Histograms
SWG gene alignment
Distributed search
Distributed sorting
Information retrieval
Expectation
maximization algorithms
Clustering
Linear Algebra
Many MPI scientific
applications utilizing
wide variety of
communication
constructs including
local interactions
- CAP3 Gene Assembly
- PolarGrid Matlab data
analysis
- Information Retrieval HEP Data Analysis
- Calculation of Pairwise
Distances for ALU
Sequences
- Kmeans
- Deterministic
Annealing Clustering
- Multidimensional
Scaling MDS
- Solving Differential
Equations and
- particle dynamics
with short range forces
Domain of MapReduce and Iterative Extensions
MPI
SALSA
Scheduling vs. Computation of Dryad in a
Heterogeneous Environment
34
SALSA
Runtime Issues
35
SALSA
Development of library of Collectives to use at Reduce phase
Broadcast and Gather needed by current applications
Discover other important ones
Implement efficiently on each platform – especially Azure
Better software message routing with broker networks using
asynchronous I/O with communication fault tolerance
Support nearby location of data and computing using data
parallel file systems
Clearer application fault tolerance model based on implicit
synchronizations points at iteration end points
Later: Investigate GPU support
Later: run time for data parallel languages like Sawzall, Pig
Latin, LINQ
Convergence is Happening
Data Intensive
Paradigms
Data intensive application (three basic activities):
capture, curation, and analysis (visualization)
Cloud infrastructure and runtime
Clouds
Multicore
Parallel threading and processes
•
•
•
•
FutureGrid: a Grid Testbed
IU Cray operational, IU IBM (iDataPlex) completed stability test May 6
UCSD IBM operational, UF IBM stability test completes ~ May 12
Network, NID and PU HTC system operational
UC IBM stability test completes ~ May 27; TACC Dell awaiting delivery of components
NID: Network Impairment Device
Private
FG Network
Public
SALSA
SALSAHPC Dynamic Virtual Cluster on
Demonstrate the concept of Science
FutureGrid -- Demo
atonSC09
on Clouds
FutureGrid
Dynamic Cluster Architecture
Monitoring Infrastructure
SW-G Using
Hadoop
SW-G Using
Hadoop
SW-G Using
DryadLINQ
Linux
Baresystem
Linux on
Xen
Windows
Server 2008
Bare-system
XCAT Infrastructure
iDataplex Bare-metal Nodes
(32 nodes)
Monitoring & Control Infrastructure
Monitoring Interface
Pub/Sub
Broker
Network
Virtual/Physical
Clusters
XCAT Infrastructure
Summarizer
Switcher
iDataplex Baremetal Nodes
• Switchable clusters on the same hardware (~5 minutes between different OS such as Linux+Xen to Windows+HPCS)
• Support for virtual clusters
• SW-G : Smith Waterman Gotoh Dissimilarity Computation as an pleasingly parallel problem suitable for MapReduce
style applications
SALSA
SALSAHPC Dynamic Virtual Cluster on
Demonstrate the concept of Science
FutureGrid -- Demo
atusing
SC09
on Clouds
a FutureGrid cluster
• Top: 3 clusters are switching applications on fixed environment. Takes approximately 30 seconds.
• Bottom: Cluster is switching between environments: Linux; Linux +Xen; Windows + HPCS.
Takes approxomately 7 minutes
• SALSAHPC Demo at SC09. This demonstrates the concept of Science on Clouds using a FutureGrid iDataPlex. SALSA
SALSA HPC Group
Indiana University
http://salsahpc.indiana.edu
Download