A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher1 and Hassan Rabeti2 1Department of Computer Science, Texas State University-San Marcos 2Department of Mathematics, Texas State University-San Marcos Problem: HPC is Hard to Exploit HPC application writers are domain experts They are not typically computer scientists and have little or no formal education in parallel programming Parallel programming is difficult and error prone Modern HPC systems are complex Consist of interconnected compute nodes with multiple CPUs and one or more GPUs per node Require parallelization at multiple levels (inter-node, intra-node, and accelerator) for best performance A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 2 Target Area: Iterative Local Searches Important application domain Widely used in engineering & real-time environments Examples All sorts of random restart greedy algorithms Ant colony opt, Monte Carlo, n-opt hill climbing, etc. ILS properties Iteratively produce better solutions Can exploit large amounts of parallelism Often have exponential search space A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 3 Our Solution: ILCS Framework Iterative Local Champion Search (ILCS) framework Supports non-random restart heuristics Genetic algorithms, tabu search, particle swarm opt, etc. Simplifies implementation of ILS on parallel systems Design goal Ease of use and scalability Framework benefits Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 4 User Interface User writes 3 serial C functions and/or 3 single- GPU CUDA functions with some restrictions size_t CPU_Init(int argc, char *argv[]); void CPU_Exec(long seed, void const *champion, void *result); void CPU_Output(void const *champion); See paper for GPU interface and sample code Framework runs Exec (map) functions in parallel A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 5 Internal Operation: Threading master forks a worker per core Fc user CPU code Fc user CPU code Fc user CPU code Fc user CPU code Fc user CPU code Fc Fc user CPU code Fc workersFcevaluate user CPU code user CPU code user CPU code seeds, record local opt Fc user CPU code Fc user CPU code Fm ILCS master thread starts user CPU code Fm Fg h h Fg h handlers launch GPU user GPU code code, sleep, record result master forks a handler per GPU Fc Fg h h Fg h user GPU code … worker thread #1 … worker thread #2 … worker thread #3 … worker thread #4 … master/comm thread master sporadically finds h Fg h global opt via MPI, sleeps … GPU handler thread #1 user GPU code … GPU1 worker threads GPU workers evaluate Fg h seeds, recordh local opt user GPU code A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches … GPU handler thread #2 … GPU2 worker threads 6 Internal Operation: Seed Distribution E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2) Node 0 CPUs 0, 1, 2, … each node gets chunk of 64-bit seed range Benefits Node 1 GPUs CPUs Node 2 GPUs CPUs Node 3 GPUs CPUs …, 263-1, 263, … CPUs process chunk bottom up 262, ... a b c d a b c d a b CPU threads (one seed per thread at a time) GPUs …, 264-2, 264-1 GPUs process chunk top down …, 263-1 1 2 1 2 1 2 1 2 1 2 1 GPUs (strided range of seeds per GPU at a time) Balanced workload irrespective of number of CPU cores or GPUs (or their relative performance) Users can generate other distributions from seeds Any injective mapping results in no redundant evaluations A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 7 Related Work MapReduce/Hadoop/MARS and PADO Their generality and unnecessary features for ILS incur overhead and increase learning curve Some do not support accelerators, some require Java ILCS framework is optimized for ILS applications Reduction is provided, does not require multiple keys, does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 8 Evaluation Methodology Three HPC Systems (at TACC and NICS) compute CPU CPU clock GPU GPU clock system CPUs GPUs nodes cores frequency cores frequency Keeneland 264 528 4,224 2.6 GHz 792 405,504 1.3 GHz Ranger 3,936 15,744 62,976 2.3 GHz Stampede 6,400 12,800 102,400 2.7 GHz 128* n/a n/a Largest tested configuration compute nodes Keeneland 128 Ranger 2048 Stampede 1024 system total total total total CPUs GPUs CPU cores GPU cores 256 384 2048 196,608 8192 0 32768 0 2048 0 16384 0 A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches datacenterknowledge.com 9 Sample ILS Codes Traveling Salesman Problem (TSP) Find shortest tour 4 inputs from TSPLIB 2-opt hill climbing Finite State Machine (FSM) Find best FSM config to predict hit/miss events 4 sizes (n = 3, 4, 5, 6) Monte Carlo method A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 10 FSM Transitions/Second Evaluated transitions evaluated per sec (trillions) 25.0 21,532,197,798,304 s-1 20.0 GPU shmem limit 15.0 Keeneland Ranger 10.0 Stampede 5.0 Ranger uses twice as many cores as Stampede 0.0 3-bit FSM 4-bit FSM 5-bit FSM 6-bit FSM A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 11 TSP Tour-Changes/Second Evaluated moves evaluated per second (trillions) 14.0 12,239,050,704,370 s-1 based on serial CPU code 12.0 10.0 GPU re-computes: O(n) memory 8.0 Keeneland Ranger 6.0 Stampede 4.0 2.0 CPU pre-computes: O(n2) memory each core evals a tour change every 3.6 cycles 0.0 kroE100 ts225 rat575 d1291 A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 12 TSP Moves/Second/Node Evaluated moves evaluated per second (billions) 50.0 GPUs provide >90% of performance on Keeneland 45.0 40.0 35.0 30.0 Keeneland 25.0 Ranger 20.0 Stampede 15.0 10.0 5.0 0.0 kroE100 ts225 rat575 d1291 A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 13 ILCS Scaling on Ranger (FSM) transitions evaluated per sec (billions) 10000 1000 >99% parallel efficiency on 2048 nodes 100 3-bit FSM 4-bit FSM 5-bit FSM 6-bit FSM 10 other two systems are similar 1 compute nodes A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 14 ILCS Scaling on Ranger (TSP) moves evaluated per second (billions) 100000 >95% parallel efficiency on 2048 nodes 10000 1000 kroE100 ts225 100 rat575 d1291 10 longer runs are even better 1 0.1 compute nodes A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 15 Intra-Node Scaling on Stampede (TSP) moves evaluated per second (billions) 14.0 >98.9% parallel efficiency on 16 threads 12.0 10.0 kroE100 8.0 6.0 framework overhead is very small 4.0 ts225 rat575 d1291 2.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 worker threads A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 16 Tour Quality Evolution (Keeneland) deviation from optimal tour length 8.0% 7.0% 6.0% 5.0% kroE100 4.0% quality depends on chance: ILS provides good ts225 3.0% solution quickly, then progressively improves it rat575 d1291 2.0% 1.0% 0.0% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 step A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 17 Tour Quality after 6 Steps (Stampede) deviation from optimal tour length 9.0% 8.0% 7.0% 6.0% kroE100 5.0% 4.0% larger node counts typically yield better results faster 3.0% ts225 rat575 d1291 2.0% 1.0% 0.0% 1 2 4 8 16 32 64 128 256 512 1024 compute nodes A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 18 Summary and Conclusions ILCS Framework Automatic parallelization of iterative local searches Provides MPI, OpenMP, and multi-GPU support Checkpoints currently best solution every few seconds Scales very well (decentralized) Evaluation 2-opt hill climbing (TSP) and Monte Carlo method (FSM) AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs ILCS source code is freely available http://cs.txstate.edu/~burtscher/research/ILCS/ Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 19