Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4th, 2011 Experience with an Easy-to-Program Parallel Architecture XMT (eXplicit Multi-Threading) Platform ◦ Design goal: easy to program many-core architecture ◦ PRAM-based design, PRAM-On-Chip programming ◦ Ease of programming demonstrated by order-of-magnitude ease-ofteaching/learning ◦ 64-processor hardware, compiler, 20+ papers, 9 grad degrees, 6 US Patents ◦ Only one previous single-application paper (Dascal et. al, 1999) Parallel Max-Flow results ◦ [IPDPS 2010] 2.5x speedup vs. serial using CUDA ◦ [Caragea and Vishkin, SPAA 2011] up to 108.3x speedup vs. serial using XMT 3-page paper 2 How to publish application papers on an easy-to-program platform? Reward game is skewed Easier to publish on “hard-to-program” platforms ◦ Remember STI Cell? Application papers for easy-to-program architectures are considered “boring” ◦ Even when they show good results Recipe for academic publication: ◦ Take simple application (e.g. Breadth-First Search in graph) ◦ Implement it on latest (difficult to program) parallel architecture ◦ Discuss challenges and work-arounds 3 Parallel Programming Today Current Parallel Programming High-friction navigation - by implementation [walk/crawl] Initial program (1week) begins trial & error tuning (½ year; architecture dependent) PRAM-On-Chip Programming Low-friction navigation – mental design and analysis [fly] No need to crawl Identify most efficient algorithm Advance to efficient implementation 4 PRAM-On-Chip Programming High-school student comparing parallel programming approaches ◦ “I was motivated to solve all the XMT programming assignments we got, since I had to cope with solving the algorithmic problems themselves, which I enjoy doing. In contrast, I did not see the point of programming other parallel systems available to us at school, since too much of the programming was effort getting around the way the systems were engineered, and this was not fun” 5 Maximum Flow in Networks Extensively studied problem ◦ Numerous algorithms and implementations (general graphs) ◦ Application domains Network analysis Airline scheduling Image processing DNA sequence alignment Parallel Max-Flow algorithms and implementations ◦ Paper has overview ◦ SMPs and GPUs ◦ Difficult to obtain good speedups vs. serial e.g. 2.5x for hybrid CPU-GPU solution 6 XMT Max-Flow Parallel Solution First stage: identify/design parallel algorithm ◦ [Shiloach,Vishkin 1982] designed O(n2log n) time, O(nm) space PRAM algorithm ◦ [Goldberg,Tarjan 1988] introduced distance labels in S-V: PushRelabel algorithm with O(m) space complexity ◦ [Anderson, Setubal 1992] observed poor practical performance for G-T, augmented with S-V-inspired Global Relabeling heuristic ◦ Solution: Hybrid SV-GT PRAM algorithm Second stage: write PRAM-On-Chip implementation ◦ Relax PRAM lock-step synchrony by grouping several PRAM steps in an XMT spawn block Insert synchronization points (barriers) where needed for correctness ◦ Maintain active node set instead of polling all graph nodes for work ◦ Use hardware supported atomic operations to simplify reductions 7 Input Graph Families Performance is highly dependent on the structure of the graph Graph structures proposed in DIMACS challenge [DIMACS90] ◦ Used by virtually every Max-Flow publication Datasets ADG RLG RMF-WIDE RMF-LONG RANDOM Description Acyclic Dense Graph Washington Random Level Graph GenRMF Wide Graph GenRMF Long Graph Radom Graph Nodes 1200 131074 8192 8192 65536 Edges 719400 391168 23040 22464 96759 8 16.19 XMT.1024 0.88 1.56 1.09 1.76 XMT.64 1.70 2.83 5.00 7.95 10.00 8.10 20.00 15.00 108.33 Speed-Up Results ADG RLG-WIDE RMF-WIDE RMF-LONG RANDOM Compared to “best serial implementation”, running on recent x86 processor [Goldberg2006] ClockCycleSerialMaxflow_ x86 Speedup Clock cycle count speedups: ClockCycles ParallelMaxflow_ XMT Two XMT configurations: ◦ XMT.64: 64 core FPGA prototype ◦ XMT.1024: 1024-core, cycle-accurate simulator XMTSim Speedups: 1.56x to 108.3x for XMT.1024 9 Conclusion XMT aims at being easy-to-program, general-purpose architecture ◦ Performance improvements on hard-to-parallelize applications like Max-Flow ◦ Ease of programming: by showing order-of-magnitude improvement in ease-ofteaching/learning Achieved difficult speedups at much earlier developmental stage (10th graders in HS versus graduate students). UCSB/UMD experiment, Middle-School, Magnet HS, Inner City HS, freshmen course, UIUC/UMD-experiment: J. Sys. & SW08 SIGCSE10, EduPar11. Current stage of XMT project: develop more complex applications beyond benchmarks ◦ Max-Flow is a step in that direction ◦ More needed Without an easy-to-program many-core architecture, rejection of parallelism by mainstream programmers is all but certain ◦ Affirmative action: drive more researchers to work and seek publications on easy-toprogram architectures ◦ This work should not be dismissed as ‘too easy’ Thank you! 10