A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science The Traveling Salesman Problem Common combinatorial optimization problem Wire routing, logistics, robot arm movement, etc. Given n cities, find shortest Hamiltonian tour Must visit all cities exactly once and end in first city Usually expressed as a graph problem We use complete, undirected, planar, Euclidean graph Vertices represent cities Edge weights reflect distances A Parallel GPU Version of the Traveling Salesman Problem July 2011 TSP Algorithm Optimal solution is NP-hard Heuristic algorithms used to approximate solution We use an iterative hill climbing search algorithm Generate k random initial tours (k climbers) Iteratively refine them until local minimum reached In each iteration, apply best opt-2 move Find best pair of edges (a,b) and (c,d) such that replacing them with (a,d) and (b,c) minimizes tour length A Parallel GPU Version of the Traveling Salesman Problem → July 2011 GPU Requirements Lots of data parallelism Need 10,000s of ‘independent’ threads Sufficient memory access regularity Thepcreport.net Sets of 32 threads should have ‘nice’ access patterns Sufficient code regularity Sets of 32 threads should follow the same control flow Plenty of data reuse At least O(n2) operations on O(n) data A Parallel GPU Version of the Traveling Salesman Problem July 2011 TSP_GPU Implementation Assuming 100-city problems & 100,000 climbers Climbers are independent, can be run in parallel Plenty of data parallelism Potential load imbalance Different number of steps required to reach local minimum Every step determines best of 4851 opt-2 moves Same control flow (but different data) Coalesced memory access patterns O(n2) operations on O(n) data A Parallel GPU Version of the Traveling Salesman Problem July 2011 Code Optimizations Key code section: finding best opt-2 move Doubly nested loop Only computes difference in tour length, not absolute length Highly optimized to minimize memory accesses “Caches” rest of data in registers Requires only 6 clock cycles per move on a Xeon CPU core Local minimum compared to best solution so far Best solution updated if needed, otherwise tour is discarded Other small optimizations (see paper) A Parallel GPU Version of the Traveling Salesman Problem July 2011 GPU Optimizations Random tours generated in parallel on GPU Minimizes data transfer to GPU (CPU only generates distance matrix and prints result) gamedsforum.ca 2D distance matrix resident in shared memory Ensures hits in software-controlled fast data cache Tours copied to local memory in chunks of 1024 Enables accessing them with coalesced loads & stores A Parallel GPU Version of the Traveling Salesman Problem July 2011 Evaluation Method Systems NVIDIA Tesla C2050 GPU (1.15 GHz 14 SMs w/ 32 PEs) Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons) Datasets Five 100-city inputs from TSPLIB Implementations CUDA (GPU), Pthreads (CPU), serial C (CPU) Use almost identical code for finding best opt-2 move A Parallel GPU Version of the Traveling Salesman Problem July 2011 Runtime Comparison (kroE100 Input) 262144 154684 156413 (median) sequential Runtimes (in ms) CUDA GPU Min Median Max 78350 65536 pthreads 39175 19591 16384 9802 4908 4368 4096 2724 2539 2497 256 CUDA GPU 1024 seq CPU 1 2 4 8 16 32 64 Number of threads (pthreads CPU) 128 GPU is 7.8x faster than CPU with 8 cores One GPU chip is as fast as 16 or 32 CPU chips A Parallel GPU Version of the Traveling Salesman Problem July 2011 Speedup over Sequential Code Speedup over Serial (kroE100 Input) 90 pthreads 80 CUDA GPU Min 70 Median 60 56.8 Max 60.9 61.9 50 40 31.5 35.4 30 15.8 20 10 (median) 1.0 2.0 3.9 1 2 4 7.9 0 8 16 32 Number of threads (pthreads) 64 128 256 CUDA GPU Pthreads code scales well to 32 threads (4 CPUs) CPU performance fluctuates (NUMA), GPU stable A Parallel GPU Version of the Traveling Salesman Problem July 2011 Solution Quality TSPLIB Database CUDA GPU Solution Quality Name Optimal Cost Min. Tour Cost Min. Tour # Runtime (s) kroA100 21,282 21,282 33,188 2.540 kroB100 22,141 22,141 5,969 2.499 kroC100 20,749 20,749 23,092 2.543 kroD100 21,294 21,294 32,142 2.497 22,084 16,941 2.499 22,068 117,583 4.952 kroE100 22,068 Optimal tour found in 4 of 5 cases with 100,000 climbers 200,000 climbers find best solution in fifth case Runtime independent of input and linear in climbers A Parallel GPU Version of the Traveling Salesman Problem July 2011 Summary TSP_GPU source code is freely available at http://www.cs.txstate.edu/~burtscher/research/TSP_GPU/ TSP_GPU algorithm Highly optimized implementation for GPUs Evaluates almost 20 billion tour modifications per second on a single GPU (as fast as 32 8-core Xeons) Produces high-quality results May be better suited for GPU than ACO and GA algos. Acknowledgments NSF TeraGrid (NICS), NVIDIA Corp., and Intel Corp. A Parallel GPU Version of the Traveling Salesman Problem July 2011