Programmability and Portability Problems? Time for Hardware Upgrades ~2003 Wall Street traded companies gave up the safety of the only paradigm that worked for them for parallel computing Yet to see: Easy-to-program, fast general-purpose many-core computer for single task completion time Uzi Vishkin 2009 Develop in 2009 application-SW for 2010s many-cores, or wait? Portability/investment questions: Will 2009 code be supported in 2010s? Development-hours in 2009 vs 2010s? Maintenance in 2010s? Performance in 2010s? Good News Vendors open up to ~40 years of parallel computing. Also SW to match vendors’ HW (2009 acquisitions). Also: new starts However They picked the wrong part: parallel architectures are a disaster area for programmability. In any case: their programming is too constrained. Contrast with general-purpose serial computing that “set the serial programmer free”. Current direction drags general-purpose computing to an unsuccessful paradigm. My main point Need to reproduce serial success for many-core computing. The business food chain SW developers serve customers NOT machines. If HW developers will not get used to idea of serving SW developers, guess what will happen to customers of their HW. Technical points Will overview/note: -What does it mean to “set free” parallel algorithmic thinking? -Architecture functions/abilities that achieve that -HW features supporting them Vendors must provide such functions. Simple way: just add these features Example of HW feature Prefix-Sum • 1500 cars enter a gas station with 1000 pumps. • Direct in unit time a car to a EVERY pump. • Direct in unit time a car to EVERY pump becoming available. Proposed HW solution: prefix-sum functional unit. (HW enhancement of Fetch&Add) SPAA’97 + US Patent Objective for programmer’s model • Emerging: not sure, but analysis should be work-depth. Why not design for your analysis? (like serial) What could I do in parallel at each step assuming unlimited hardware Serial Paradigm # ops .. time Time = Work .. Natural (Parallel) Paradigm # . . ops .. .. .. .. time Work = total #ops Time << Work • [SV82] conjectured that the rest (full PRAM algorithm) just a matter of skill. • Lots of evidence that this “work-depth methodology” works. Used as framework in PRAM algorithms textbooks: JaJa-92, Keller-Kessler-Traeff-01. • Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase Hardware prototypes of PRAM-On-Chip 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT) architecture [SPAA98] (Cray started to use “XMT” 7+ years later) Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to 1000+ cores on-chip XMT big idea in a nutshell Design for work-depth 1) 1 operation now; Any #ops next time unit. 2) No need to program for locality beyond use of local thread variables, post work-depth. 3) Enough interconnection network bandwidth XMT: A PRAM-On-Chip Vision • IF you could program a current manycore great speedups. XMT: Fix the IF • XMT: Designed from the ground up to address that for on-chip parallelism • Unlike matching current HW • Today’s position Replicate functions • Tested HW & SW prototypes • Software release of full XMT environment • SPAA’09: ~10X relative to Intel Core 2 Duo • For more info: Google “XMT” Programmer’s Model: Workflow Function • Arbitrary CRCW Work-depth algorithm. - Reason about correctness & complexity in synchronous model • SPMD reduced synchrony – Main construct: spawn-join block. Can start any number of processes at once. Threads advance at own speed, not lockstep – Prefix-sum (ps). Independence of order semantics (IOS) – Establish correctness & complexity by relating to WD analyses – Circumvents “The problem with threads”, e.g., [Lee] spawn join spawn join • Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] • Trial&error contrast: similar startwhile insufficient interthread bandwidth do{rethink algorithm to take better advantage of cache} Ease of Programming • Benchmark Can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher. Conclusion • XMT provides viable answer to biggest challenges for the field – Ease of programming – Scalability (up&down) Facilitates code portability • Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform • ICPP’08 paper compares with GPUs. • Easy to build. 1 student in 2+ yrs: hardware design + FPGA-based XMT computer in slightly more than two years time to market; implementation cost. Replicate functions, perhaps by replicating solutions Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: a) Cycle-accurate simulator of the XMT machine b) Compiler from XMTC to that machine Also provided, extensive material for teaching or selfstudying parallelism, including (i)Tutorial + manual for XMTC (150 pages) (ii)Classnotes on parallel algorithms (100 pages) (iii)Video recording of 9/15/07 HS tutorial (300 minutes) (iv) Video recording of grad Parallel Algorithms lectures (30+hours) www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html, Or just Google “XMT” Q&A Question: Why PRAM-type parallel algorithms matter, when we can get by with existing serial algorithms, and parallel programming methods like OpenMP on top of it? Answer: With the latter you need a strong-willed Comp. Sci. PhD in order to come up with an efficient parallel program at the end. With the former (study of parallel algorithmic thinking and PRAM algorithms) high school kids can write efficient (more efficient if fine-grained & irregular!) parallel programs.