SALSA HPC Group http://salsahpc.indiana.edu School of Informatics and Computing Indiana University Twister Bingjing Zhang Funded by Microsoft Foundation Grant, Indiana University's Faculty Research Support Program and NSF OCI-1032677 Grant Twister4Azure Thilina Gunarathne Funded by Microsoft Azure Grant High-Performance Visualization Algorithms For Data-Intensive Analysis Seung-Hee Bae and Jong Youl Choi Funded by NIH Grant 1RC2HG005806-01 DryadLINQ CTP Evaluation Hui Li, Yang Ruan, and Yuduo Zhou Funded by Microsoft Foundation Grant Million Sequence Challenge Saliya Ekanayake, Adam Hughs, Yang Ruan Funded by NIH Grant 1RC2HG005806-01 Cyberinfrastructure for Remote Sensing of Ice Sheets Jerome Mitchell Funded by NSF Grant OCI-0636361 Applications Kernels, Genomics, Proteomics, Information Retrieval, Polar Science Scientific Simulation Data Analysis and Management Dissimilarity Computation, Clustering, Multidimentional Scaling, Generative Topological Mapping Security, Provenance, Portal Services and Workflow Programming Model Runtime Storage Infrastructure Hardware High Level Language Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling) Distributed File Systems Object Store Windows Server Linux HPC Amazon Cloud HPC Bare-system Bare-system Virtualization CPU Nodes Data Parallel File System Azure Cloud Virtualization GPU Nodes Grid Appliance (a) Map Only (b) Classic MapReduce (c) Iterative MapReduce Iterations Input Input Input (d) Loosely Synchronous map map map Pij reduce reduce Output High Energy Physics (HEP) Expectation maximization clustering CAP3 Analysis Histograms e.g. Kmeans Smith-Waterman Distances Distributed search Linear Algebra Parametric sweeps Distributed sorting Multimensional Scaling PolarGrid Matlab data analysis Information retrieval Page Rank Domain of MapReduce and Iterative Extensions Many MPI scientific applications such as solving differential equations and particle dynamics MPI GTM Purpose MDS (SMACOF) • Non-linear dimension reduction • Find an optimal configuration in a lower-dimension • Iterative optimization method Input Vector-based data Non-vector (Pairwise similarity matrix) Objective Function Maximize Log-Likelihood Minimize STRESS or SSTRESS Complexity O(KN) (K << N) O(N2) Optimization Method EM Iterative Majorization (EM-like) Parallel Visualization Algorithms PlotViz Distinction on static and variable data Configurable long running (cacheable) map/reduce tasks Pub/sub messaging based communication/data transfers Broker Network for facilitating communication Main program’s process space Worker Nodes configureMaps(..) Local Disk configureReduce(..) Cacheable map/reduce tasks while(condition){ runMapReduce(..) May send <Key,Value> pairs directly Iterations Reduce() Combine() operation updateCondition() } //end while close() Map() Communications/data transfers via the pub-sub broker network & direct TCP • Main program may contain many MapReduce invocations or iterative MapReduce invocations Master Node Pub/sub Broker Network B Twister Driver B B B Main Program One broker serves several Twister daemons Twister Daemon Twister Daemon map reduce Cacheable tasks Worker Pool Local Disk Worker Node Worker Pool Scripts perform: Data distribution, data collection, and partition file creation Local Disk Worker Node II. Send intermediate results Master Node Twister Driver ActiveMQ Broker MDS Monitor Twister-MDS PlotViz I. Send message to start the job Client Node Method A Hierarchical Sending Method B Improved Hierarchical Sending Method C All-to-All Sending Twister Daemon Node ActiveMQ Broker Node 8 Brokers and 32 Daemon Nodes in total Twister Driver Node Broker-Daemon Connection Broker-Broker Connection Broker-Driver Connection Twister Daemon Node ActiveMQ Broker Node 8 Brokers and 32 Daemon Nodes in total Twister Driver Node Broker-Daemon Connection Broker-Broker Connection Broker-Driver Connection Time used for the first level sending, 𝑁 ( 𝑏 + 𝑏 − 1)𝛼 𝑁 𝑏 Time used for the second level sending 𝛼 (sending in parallel) 𝑁 is the number of Twister Daemon Nodes 𝑏 is the number of brokers 𝛼 is the transmission time for each sending Get the derivation of 𝑏, 𝑏 = 2𝑁 That is when 𝑏 = 2𝑁, the total broadcasting time is the minimum. 𝑡 = (2 2𝑁 − 1) 𝛼 Twister Daemon Node ActiveMQ Broker Node 7 Brokers and 32 Daemon Nodes in total Twister Driver Node Broker-Daemon Connection Broker-Broker Connection Broker-Driver Connection Twister Daemon Node ActiveMQ Broker Node 7 Brokers and 32 Computing Nodes in total Twister Driver Node Broker-Daemon Connection Broker-Broker Connection Broker-Driver Connection 𝑡 = 𝑏−1 𝛼 𝑁 + 𝛼, 𝑏−1 𝑡 comes to the minimum when 𝑏 = 𝑁 + 1 , 𝑡=2 𝑁𝛼 𝑁 is the number of Twister Daemon Nodes 𝑏 is the number of brokers 𝛼 is the transmission time for each sending (100 iterations, 200 broadcastings, 40 nodes, 51200 data points) 370 360 Execution Time (Unit: Second) 350 340 330 320 310 300 290 280 270 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Method A 358.6 330.6 314.5 309.5 310 307.9 313.8 316.6 318.1 319.2 321.9 324.3 329.1 335.1 338.3 343.3 345.3 350.3 350.7 359.5 Method B 360.9 318.9 307.3 303.3 304.7 306.3 311.4 311.4 316.8 321.1 321.6 325.9 329.6 334.4 337.6 338.3 346.7 353 353.2 359.9 Number of Brokers Twister Daemon Node ActiveMQ Broker Node 5 Brokers and 4 Computing Nodes in total Twister Driver Node Broker-Daemon Connection Broker-Driver Connection Centroid 1 Centroid 2 Centroids Centroid 3 Centroid N Twister Driver Node ActiveMQ Broker Node Twister Daemon Node Centroid 1 Centroid 1 Centroid 2 Centroid 3 Centroid 4 Centroid 2 Centroid 1 Centroid 2 Centroid 3 Centroid 4 Centroid 3 Centroid 1 Centroid 2 Centroid 3 Centroid 4 Centroid 4 Centroid 1 Centroid 2 Centroid 3 Centroid 4 Twister Map Task ActiveMQ Broker Node Twister Reduce Task Centroid 1 Centroid 2 Centroid 3 Centroid 4 Centroid 1 Centroid 2 Centroid 3 Centroid 4 Centroid 1 Centroid 2 Centroid 3 Centroid 4 Centroid 1 Centroid 2 Centroid 3 Centroid 4 Centroid 1 Centroid 1 Centroid 1 Centroid 1 Centroid 2 Centroid 2 Centroid 2 Centroid 2 Centroid 3 Centroid 3 Centroid 3 Centroid 3 Centroid 4 Centroid 4 Centroid 4 Centroid 4 (In Method C, centroids are split to 160 blocks, sent through 40 brokers in 4 rounds) 100.00 93.14 90.00 Broadcasting Time (Unit: Second) 80.00 70.56 70.00 60.00 50.00 46.19 40.00 30.00 24.50 18.79 20.00 13.07 10.00 0.00 400M 600M Method C Method B 800M • Distributed, highly scalable and highly available cloud services as the building blocks. • Utilize eventually-consistent , high-latency cloud services effectively to deliver performance comparable to traditional MapReduce runtimes. • Decentralized architecture with global queue based dynamic task scheduling • Minimal management and maintenance overhead • Supports dynamically scaling up and down of the compute resources. • MapReduce fault tolerance Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage. New Job Job Start Map Combine Map Combine Scheduling Queue Worker Role Map Workers Reduce Merge Add Iteration? Map Combine Reduce Data Cache Hybrid scheduling of the new iteration Yes No Job Finish Left over tasks that did not get scheduled through bulleting board. Map 1 Map 2 Map n Reduce Workers Red 1 Red 2 Red n In Memory Data Cache Map Task Meta Data Cache Job Bulletin Board + In Memory Cache + Execution History New Iteration Task Execution Time Histogram Strong Scaling with 128M Data Points Number of Executing Map Task Histogram Weak Scaling Weak Scaling Azure Instance Type Study Data Size Scaling Number of Executing Map Task Histogram BLAST Sequence Search Smith Waterman Sequence Alignment Parallel Efficiency Cap3 Sequence Assembly 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% Twister4Azure Amazon EMR Apache Hadoop Num. of Cores * Num. of Files ISGA <<XML>> Ergatis <<XML>> TIGR Workflow SGE clusters Condor Cloud, Other DCEs clusters Chris Hemmerich, Adam Hughes, Yang Ruan, Aaron Buechlein, Judy Qiu, and Geoffrey Fox. Map-Reduce Expansion of the ISGA Genomic Analysis Web Server (2010) The 2nd IEEE International Conference on Cloud Computing Technology and Science O(NxN) O(NxN) Gene Sequences Pairwise Alignment & Distance Calculation Pairwise Clustering Cluster Indices Visualization Distance Matrix O(NxN) MultiDimensional Scaling Coordinates 3D Plot Gene Sequences (N = 1 Million) Select Referenc e N-M Sequence Set (900K) Pairwise Alignment & Distance Calculation Reference Sequence Set (M = 100K) Reference Coordinates Interpolative MDS with Pairwise Distance Calculation x, y, z O(N2) N-M x, y, z Coordinates Visualization Distance Matrix MultiDimensional Scaling (MDS) 3D Plot Input DataSize: 680k Sample Data Size: 100k Out-Sample Data Size: 580k Test Environment: PolarGrid with 100 nodes, 800 workers. 100k sample data 680k data GTM / GTM-Interpolation A 1 A B C B 2 C 1 Parallel HDF5 ScaLAPACK MPI / MPI-IO Parallel File System K latent points N data points 2 Finding K clusters for N data points Relationship is a bipartite graph (bi-graph) Represented by K-by-N matrix (K << N) Decomposition for P-by-Q compute grid Reduce memory requirement by 1/PQ Cray / Linux / Windows Cluster Parallel MDS MDS Interpolation • O(N2) memory and computation required. – 100k data 480GB memory • Balanced decomposition of NxN matrices by P-by-Q grid. – Reduce memory and computing requirement by 1/PQ • Communicate via MPI primitives c1 r1 r2 c2 c3 • Finding approximate mapping position w.r.t. kNN’s prior mapping. • Per point it requires: – O(M) memory – O(k) computation • Pleasingly parallel • Mapping 2M in 1450 sec. – vs. 100k in 27000 sec. – 7500 times faster than estimation of the full MDS. 37 MPI, Twister n In-sample 1 2 N-n ...... Out-of-sample P-1 Training Trained data Interpolation Interpolated map p Total N data MapReduce Full data processing by GTM or MDS is computing- and memory-intensive Two step procedure Training : training by M samples out of N data Interpolation : remaining (N-M) out-of-samples are approximated without training PubChem data with CTD visualization by using MDS (left) and GTM (right) About 930,000 chemical compounds are visualized as a point in 3D space, annotated by the related genes in Comparative Toxicogenomics Database (CTD) Chemical compounds shown in literatures, visualized by MDS (left) and GTM (right) Visualized 234,000 chemical compounds which may be related with a set of 5 genes of interest (ABCB1, CHRNB2, DRD2, ESR1, and F2) based on the dataset collected from major journal literatures which is also stored in Chem2Bio2RDF system. ALU 35339 Metagenomics 30000 100K training and 2M interpolation of PubChem Interpolation MDS (left) and GTM (right) Top: 3D visualization of crossover flight paths Bottom Left and Right: The Web Map Service (WMS) protocol enables users to access the original data set from MATLAB and GIS software in order to display a single frame for a particular flight path Investigate in applicability and performance of DryadLINQ CTP to develop scientific applications. Goals: Evaluate key features and interfaces Probe parallel programming models Three applications: SW-G bioinformatics application Matrix Multiplication PageRank Row partition Row column partition 2 dimensional block decomposition in Fox algorithm 0.5 Parallel Efficiency Parallel algorithms for matrix multiplication RowPartition 0.2 0.1 0 2400 PLINQ, TPL, and Thread Pool Timing model for MM 200 Speed up Port multi-core to Dryad task to improve Performance Fox-Hey 0.3 Multi core technologies Hybrid parallel model RowColumnPartition 0.4 4800 7200 9600 12000 14400 16800 19200 Input data size Sequential TPL Thread PLINQ 150 100 50 0 RowPartition RowColumnPartition Fox-Hey Workload of SW-G, a pleasingly parallel application, is heterogeneous due to the difference in input gene sequences. Hence workload balancing becomes an issue. Two approach to alleviate it: Randomized distributed input data Partition job into finer granularity tasks 10000 4500 9000 4000 8000 Execution Time (Seconds) 5000 Exectuion Time (Seconds) 3500 3000 2500 2000 1500 Skewed 1000 6000 5000 4000 3000 2000 Randomized 500 7000 Std. Dev. = 50 Std. Dev. = 100 Std. Dev. = 250 1000 0 0 0 50 100 Standard Deviation 150 200 250 31 62 124 Number of Partitions 186 248 SALSA HPC Group Indiana University http://salsahpc.indiana.edu