S A L HPC Group

advertisement
SALSA HPC Group
http://salsahpc.indiana.edu
School of Informatics and Computing
Indiana University
Twister
Bingjing Zhang
Funded by Microsoft Foundation Grant, Indiana
University's Faculty Research Support Program
and NSF OCI-1032677 Grant
Twister4Azure
Thilina Gunarathne
Funded by Microsoft Azure Grant
High-Performance
Visualization Algorithms
For Data-Intensive Analysis
Seung-Hee Bae and Jong Youl Choi
Funded by NIH Grant 1RC2HG005806-01
DryadLINQ CTP Evaluation
Hui Li, Yang Ruan, and Yuduo Zhou
Funded by Microsoft Foundation Grant
Million Sequence Challenge
Saliya Ekanayake, Adam Hughs, Yang Ruan
Funded by NIH Grant 1RC2HG005806-01
Cyberinfrastructure for
Remote Sensing of Ice Sheets
Jerome Mitchell
Funded by NSF Grant OCI-0636361
Applications
Kernels, Genomics, Proteomics, Information Retrieval, Polar Science
Scientific Simulation Data Analysis and Management
Dissimilarity Computation, Clustering, Multidimentional Scaling, Generative
Topological Mapping
Security, Provenance, Portal
Services and Workflow
Programming
Model
Runtime
Storage
Infrastructure
Hardware
High Level Language
Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)
Distributed File Systems
Object Store
Windows Server
Linux HPC Amazon Cloud
HPC
Bare-system
Bare-system
Virtualization
CPU Nodes
Data Parallel File System
Azure Cloud
Virtualization
GPU Nodes
Grid
Appliance
(a) Map Only
(b) Classic MapReduce
(c) Iterative MapReduce
Iterations
Input
Input
Input
(d) Loosely Synchronous
map
map
map
Pij
reduce
reduce
Output
High Energy Physics (HEP)
Expectation maximization clustering
CAP3 Analysis
Histograms
e.g. Kmeans
Smith-Waterman Distances
Distributed search
Linear Algebra
Parametric sweeps
Distributed sorting
Multimensional Scaling
PolarGrid Matlab data analysis
Information retrieval
Page Rank
Domain of MapReduce and Iterative Extensions
Many MPI scientific
applications such as solving
differential equations and
particle dynamics
MPI
GTM
Purpose
MDS (SMACOF)
• Non-linear dimension reduction
• Find an optimal configuration in a lower-dimension
• Iterative optimization method
Input
Vector-based data
Non-vector (Pairwise similarity matrix)
Objective
Function
Maximize Log-Likelihood
Minimize STRESS or SSTRESS
Complexity
O(KN) (K << N)
O(N2)
Optimization
Method
EM
Iterative Majorization (EM-like)
Parallel Visualization
Algorithms
PlotViz
Distinction on static and variable data
Configurable long running (cacheable)
map/reduce tasks
Pub/sub messaging based
communication/data transfers
Broker Network for facilitating
communication
Main program’s process space
Worker Nodes
configureMaps(..)
Local Disk
configureReduce(..)
Cacheable map/reduce tasks
while(condition){
runMapReduce(..)
May send <Key,Value> pairs directly
Iterations
Reduce()
Combine()
operation
updateCondition()
} //end while
close()
Map()
Communications/data transfers via the
pub-sub broker network & direct TCP
• Main program may contain many
MapReduce invocations or iterative
MapReduce invocations
Master Node
Pub/sub
Broker Network
B
Twister
Driver
B
B
B
Main Program
One broker
serves several
Twister daemons
Twister Daemon
Twister Daemon
map
reduce
Cacheable tasks
Worker Pool
Local Disk
Worker Node
Worker Pool
Scripts perform:
Data distribution, data collection,
and partition file creation
Local Disk
Worker Node
II. Send intermediate
results
Master Node
Twister
Driver
ActiveMQ
Broker
MDS Monitor
Twister-MDS
PlotViz
I. Send message to
start the job
Client Node
Method A
Hierarchical Sending
Method B
Improved Hierarchical Sending
Method C
All-to-All Sending
Twister Daemon Node
ActiveMQ Broker Node
8 Brokers and 32 Daemon
Nodes in total
Twister Driver Node
Broker-Daemon Connection
Broker-Broker Connection
Broker-Driver Connection
Twister Daemon Node
ActiveMQ Broker Node
8 Brokers and 32 Daemon
Nodes in total
Twister Driver Node
Broker-Daemon Connection
Broker-Broker Connection
Broker-Driver Connection
Time used for the first level sending,
𝑁
(
𝑏
+ 𝑏 − 1)𝛼
𝑁
𝑏
Time used for the second level sending 𝛼 (sending in parallel)
𝑁 is the number of Twister Daemon Nodes
𝑏 is the number of brokers
𝛼 is the transmission time for each sending
Get the derivation of 𝑏, 𝑏 = 2𝑁
That is when 𝑏 = 2𝑁, the total broadcasting time is the
minimum. 𝑡 = (2 2𝑁 − 1) 𝛼
Twister Daemon Node
ActiveMQ Broker Node
7 Brokers and 32 Daemon
Nodes in total
Twister Driver Node
Broker-Daemon Connection
Broker-Broker Connection
Broker-Driver Connection
Twister Daemon Node
ActiveMQ Broker Node
7 Brokers and 32 Computing
Nodes in total
Twister Driver Node
Broker-Daemon Connection
Broker-Broker Connection
Broker-Driver Connection
𝑡 = 𝑏−1 𝛼
𝑁
+
𝛼,
𝑏−1
𝑡 comes to the minimum when 𝑏 = 𝑁 + 1
, 𝑡=2 𝑁𝛼
𝑁 is the number of Twister Daemon Nodes
𝑏 is the number of brokers
𝛼 is the transmission time for each sending
(100 iterations, 200 broadcastings, 40 nodes, 51200 data points)
370
360
Execution Time (Unit: Second)
350
340
330
320
310
300
290
280
270
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
Method A 358.6 330.6 314.5 309.5 310 307.9 313.8 316.6 318.1 319.2 321.9 324.3 329.1 335.1 338.3 343.3 345.3 350.3 350.7 359.5
Method B 360.9 318.9 307.3 303.3 304.7 306.3 311.4 311.4 316.8 321.1 321.6 325.9 329.6 334.4 337.6 338.3 346.7 353 353.2 359.9
Number of Brokers
Twister Daemon Node
ActiveMQ Broker Node
5 Brokers and 4 Computing
Nodes in total
Twister Driver Node
Broker-Daemon Connection
Broker-Driver Connection
Centroid 1
Centroid 2
Centroids
Centroid 3
Centroid N
Twister Driver Node
ActiveMQ Broker Node
Twister Daemon Node
Centroid 1
Centroid 1
Centroid 2
Centroid 3
Centroid 4
Centroid 2
Centroid 1
Centroid 2
Centroid 3
Centroid 4
Centroid 3
Centroid 1
Centroid 2
Centroid 3
Centroid 4
Centroid 4
Centroid 1
Centroid 2
Centroid 3
Centroid 4
Twister Map Task
ActiveMQ Broker Node
Twister Reduce Task
Centroid 1
Centroid 2
Centroid 3
Centroid 4
Centroid 1
Centroid 2
Centroid 3
Centroid 4
Centroid 1
Centroid 2
Centroid 3
Centroid 4
Centroid 1
Centroid 2
Centroid 3
Centroid 4
Centroid 1
Centroid 1
Centroid 1
Centroid 1
Centroid 2
Centroid 2
Centroid 2
Centroid 2
Centroid 3
Centroid 3
Centroid 3
Centroid 3
Centroid 4
Centroid 4
Centroid 4
Centroid 4
(In Method C, centroids are split to 160 blocks, sent through 40 brokers in 4
rounds)
100.00
93.14
90.00
Broadcasting Time (Unit: Second)
80.00
70.56
70.00
60.00
50.00
46.19
40.00
30.00
24.50
18.79
20.00
13.07
10.00
0.00
400M
600M
Method C
Method B
800M
• Distributed, highly scalable and highly available cloud
services as the building blocks.
• Utilize eventually-consistent , high-latency cloud services
effectively to deliver performance comparable to
traditional MapReduce runtimes.
• Decentralized architecture with global queue based
dynamic task scheduling
• Minimal management and maintenance overhead
• Supports dynamically scaling up and down of the compute
resources.
• MapReduce fault tolerance
Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for
input/output/intermediate data storage.
New Job
Job Start
Map
Combine
Map
Combine
Scheduling Queue
Worker Role
Map Workers
Reduce
Merge
Add
Iteration?
Map
Combine
Reduce
Data Cache
Hybrid scheduling of the new iteration
Yes
No
Job Finish
Left over tasks
that did not get
scheduled through
bulleting board.
Map
1
Map
2
Map
n
Reduce Workers
Red
1
Red
2
Red
n
In Memory Data Cache
Map Task Meta Data Cache
Job Bulletin Board +
In Memory Cache +
Execution History
New Iteration
Task Execution Time Histogram
Strong Scaling with 128M Data Points
Number of Executing Map Task Histogram
Weak Scaling
Weak Scaling
Azure Instance Type Study
Data Size Scaling
Number of Executing Map Task Histogram
BLAST Sequence Search
Smith Waterman Sequence Alignment
Parallel Efficiency
Cap3 Sequence Assembly
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
Twister4Azure
Amazon EMR
Apache Hadoop
Num. of Cores * Num. of Files
ISGA
<<XML>>
Ergatis
<<XML>>
TIGR Workflow
SGE
clusters
Condor
Cloud,
Other DCEs
clusters
Chris Hemmerich, Adam Hughes, Yang Ruan, Aaron Buechlein, Judy Qiu, and Geoffrey Fox. Map-Reduce Expansion of the ISGA
Genomic Analysis Web Server (2010) The 2nd IEEE International Conference on Cloud Computing Technology and Science
O(NxN)
O(NxN)
Gene
Sequences
Pairwise
Alignment &
Distance
Calculation
Pairwise
Clustering
Cluster Indices
Visualization
Distance Matrix
O(NxN)
MultiDimensional
Scaling
Coordinates
3D Plot
Gene
Sequences (N
= 1 Million)
Select
Referenc
e
N-M
Sequence
Set (900K)
Pairwise
Alignment
& Distance
Calculation
Reference
Sequence Set
(M = 100K)
Reference
Coordinates
Interpolative MDS
with Pairwise
Distance Calculation
x, y, z
O(N2)
N-M
x, y, z
Coordinates
Visualization
Distance Matrix
MultiDimensional
Scaling
(MDS)
3D Plot
Input DataSize: 680k
Sample Data Size: 100k
Out-Sample Data Size: 580k
Test Environment: PolarGrid with 100 nodes, 800 workers.
100k sample data
680k data
GTM / GTM-Interpolation
A
1
A
B
C
B
2
C
1
Parallel HDF5
ScaLAPACK
MPI / MPI-IO
Parallel File System
K latent
points
N data
points
2
Finding K clusters for N data points
Relationship is a bipartite graph (bi-graph)
Represented by K-by-N matrix (K << N)
Decomposition for P-by-Q compute grid
Reduce memory requirement by 1/PQ
Cray / Linux / Windows Cluster
Parallel MDS
MDS Interpolation
• O(N2) memory and computation
required.
– 100k data  480GB memory
• Balanced decomposition of NxN
matrices by P-by-Q grid.
– Reduce memory and computing
requirement by 1/PQ
• Communicate via MPI primitives
c1
r1
r2
c2
c3
• Finding approximate
mapping position w.r.t. kNN’s prior mapping.
• Per point it requires:
– O(M) memory
– O(k) computation
• Pleasingly parallel
• Mapping 2M in 1450 sec.
– vs. 100k in 27000 sec.
– 7500 times faster than
estimation of the full MDS.
37
MPI, Twister
n
In-sample
1
2
N-n
......
Out-of-sample
P-1
Training
Trained data
Interpolation
Interpolated
map
p
Total N data
MapReduce
Full data processing by GTM or MDS is computing- and
memory-intensive
Two step procedure
Training : training by M samples out of N data
Interpolation : remaining (N-M) out-of-samples are
approximated without training
PubChem data with CTD
visualization by using MDS (left)
and GTM (right)
About 930,000 chemical compounds
are visualized as a point in 3D space,
annotated by the related genes in
Comparative Toxicogenomics
Database (CTD)
Chemical compounds shown in
literatures, visualized by MDS (left)
and GTM (right)
Visualized 234,000 chemical
compounds which may be related
with a set of 5 genes of interest
(ABCB1, CHRNB2, DRD2, ESR1, and
F2) based on the dataset collected
from major journal literatures which is
also stored in Chem2Bio2RDF system.
ALU 35339
Metagenomics 30000
100K training and 2M interpolation of PubChem Interpolation MDS (left) and GTM
(right)
Top: 3D visualization of crossover flight paths
Bottom Left and Right:
The Web Map Service (WMS) protocol enables
users to access the original data set from
MATLAB and GIS software in order to display
a single frame for a particular flight path
Investigate in applicability and performance of DryadLINQ CTP to develop
scientific applications.
Goals:
Evaluate key features and interfaces
Probe parallel programming models
Three applications:
SW-G bioinformatics application
Matrix Multiplication
PageRank
Row partition
Row column partition
2 dimensional block
decomposition in Fox algorithm
0.5
Parallel Efficiency
Parallel algorithms for matrix
multiplication
RowPartition
0.2
0.1
0
2400
PLINQ, TPL, and Thread Pool
Timing model for MM
200
Speed up
Port multi-core to Dryad task to
improve Performance
Fox-Hey
0.3
Multi core technologies
Hybrid parallel model
RowColumnPartition
0.4
4800
7200
9600 12000 14400 16800 19200
Input data size
Sequential
TPL
Thread
PLINQ
150
100
50
0
RowPartition
RowColumnPartition
Fox-Hey
Workload of SW-G, a pleasingly parallel application, is heterogeneous due to
the difference in input gene sequences. Hence workload balancing becomes an
issue.
Two approach to alleviate it:
Randomized distributed input data
Partition job into finer granularity tasks
10000
4500
9000
4000
8000
Execution Time (Seconds)
5000
Exectuion Time (Seconds)
3500
3000
2500
2000
1500
Skewed
1000
6000
5000
4000
3000
2000
Randomized
500
7000
Std. Dev. = 50
Std. Dev. = 100
Std. Dev. = 250
1000
0
0
0
50
100
Standard Deviation
150
200
250
31
62
124
Number of Partitions
186
248
SALSA HPC Group
Indiana University
http://salsahpc.indiana.edu
Download