Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer’s Productivity Uzi Vishkin Common wisdom [cf. tribal lore collected by DARPA HPCS, 2005]: Programming for parallelism is easy It is the programming for performance that makes it hard Reinvention of Computing for Many-Core Parallelism Requires Addressing Productivity Uzi Vishkin A less fatalistic position: Programming for parallelism is easy But, the difficulty of programming for performance depends on the system Productivity in Parallel Computing The large parallel machines story Funding of productivity: $M650 HProductivityCS, ~2002 Met # Gflops goals: up by 1000X since mid-90’s; Exascale talk & plans Met power goals. Also: groomed eloquent spokespeople Progress on productivity: No agreed benchmarks. No spokesperson. Elusive! In fact, not much has changed since: “as intimidating and time consuming as programming in assembly language”--NSF Blue Ribbon Committee, 2003 or even “parallel software crisis”, CACM 1991. Common sense engineering: Untreated bottleneck diminished returns on improvements bottleneck becomes more critical Next 10 years: New specific programs on flops and power. What about productivity?! Reality: economic island. Cleared by marketing: DOE applications Enter: mainstream many-cores Every CS major should be able to program many-cores Coherence Issue When you come to a fork in the road, take it!-Yogi Berra Camp 1 Many US best minds opt for occupations that do not involve programming • NSF tries to lure them to CS in HS by: (1) presenting the steady march and broad reach of computing across the sciences, industries, culture and society, correcting the current narrow focus on programming in introductory course [New Programs Aim to Lure Young Into Digital Jobs, NYTimes, 12/09]; (2) productivity (3) computational thinking Camp 2 Power/performance Reinvent mainstream computing for parallelism • Vendors try to build many-cores that require decomposition-first programming. Railroading to productivity “disaster area”. Hacking. Insufficient support from parallel algorithms design & analysis. Short on outreach/productivity/abstraction Unintended outcome of “taking the fork” (prod vs. power/perf) • Camp cheerleaders: core CS (alg design & analysis style) is radical. Peer review favors both sides over center. Centrists as extremists is an oxymoron! • Building wrong expectations among prospective CS majors. Disappointment will lead to “Get me out of this major” • Pool of CS majors to be engaged in decomposition- first too limited (after subtracting the lured-to-breadth-over-programming and the core) Consequences of “taking the fork” surrealism • Eventual casualties: # students, credibility & productivity Research/comparison of several holistic parallel platforms could: (i) prevent much of the damage, (ii) build up the real diversity needed for natural selection, and (iii) advise the NSF on programs that otherwise could cancel one another Lessons from Invention of Computing “It should be noted that in comparing codes four viewpoints must be kept in mind, all of them of comparable importance: • Simplicity and reliability of the engineering solutions required by the code; • Simplicity, compactness and completeness of the code; • Ease and speed of the human procedure of translating mathematical conceived methods into the code [”COMPUTATIONAL THINKING”], and also of finding and correcting errors in coding or of applying to it changes that have been decided upon at a later stage; • Efficiency of the code in operating the machine near it full intrinsic speed. -H. Goldstine, J. von Neumann. Planning and coding problems for an electronic computing instrument, 1947 Take home - Comparing codes is a pivotal and broad issue - Concern for Productivity is as old as computing (development-time) - Human process: intellectual/algorithm/planning plus skill/coding - Contrast with: Tendency to understand HW upgrade from application code (even if machine not yet built, A. Ghuloum, Intel, CACM 9/09) – unreasonable expectation from application code developers How was the “human procedure” addressed? Answer: Basically, By Abstraction and Induction 1. General-Purpose computing is about a platform for your future (whatever) program, as opposed specific application, a general method for the human procedure was key 2. GvN47 based coding on mathematical induction (known for math proofs and as axiom of the natural numbers) 3. It worked for establishing serial computing. This method led to simplicity, compactness and completeness of the resulting code. References: - Knuth67, The art of Computer Programming. Vol. 1: Fundamental Algorithms. Chapter 1: Basic concepts. 1.1 Algorithms. 1.2 Math Prelims. 1.2.1 Math Induction Algorithms: 1. Finiteness. 2. Definiteness. 3. Input. 4. Output. 5. Effectiveness. Gold standards Definiteness: Induction Effectiveness: “Uniform cost criterion" [AHU74] abstraction “Killer app” for general-purpose many cores: Let the app-dreamers do their magic • Oxymoron?.. general-purpose: no one application in particular Not really: If possible, a killer application would be helpful • However, wrong as condition for progress General-purpose computing is an infrastructure for the IT sector and the economy • The general-purpose computing infrastructure has been realized by the software spiral (the cyclic process of hardware improvements leading to software improvements that lead back to hardware improvements and so on; Andy Grove, Intel) • Instituting a parallel software spiral is a killer application for many-cores: as in the past app-dreamers will invent uses Not surprisingly, the killer application is also an infrastructure • Government has a role in building infrastructure Instituting a parallel software spiral merits government funding However, insufficient empowerment for: creating and developing alternative platforms to the point of establishing their merit. Serial Abstraction & A Parallel Counterpart Example • Rudimentary abstraction that made serial computing simple that any single instruction available for execution in a serial program executes immediately Serial Execution, Based What could I do in parallel at each step assuming unlimited on Serial Abstraction hardware # ops .. .. # . . ops .. .. .. time Time = Work Parallel Execution, Based on Parallel Abstraction .. time Work = total #ops Time << Work Abstracts away different execution time for different operations (e.g., memory hierarchy) . Used by programmers to conceptualize serial computing and supported by hardware and compilers. The program provides the instruction to be executed next (inductively) • Rudimentary abstraction for making parallel computing simple: that indefinitely many instructions, which are available for concurrent execution, execute immediately, dubbed Immediate Concurrent Execution (ICE) Step-by-step (inductive) explication of the instructions available next for concurrent execution. # processors not even mentioned. Falls back on the serial abstraction if 1 instruction/step. CACM’10: Using simple abstraction to guide the reinvention of computing for parallelism [Overall: old Work-Depth description. Only “minimalist abstraction”: ICE builds only on induction, itself a rudimentary concept] • [SV82] conjectured that the rest (full PRAM algorithm) just a matter of skill • Lots of evidence that “work-depth” works. Used as framework in PRAM algorithms texts: JaJa-92, KKT-01 • ICE in line with PRAM: Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase • Widely agreed: work&depth are necessary. Jury is out on: what else. Our position: as little as possible. Workflow from parallel algorithms to programming versus trial-and-error Option 1 Domain decomposition, or task decomposition Option 2 PAT Parallel algorithmic thinking (ICE/WD/PRAM) Program Insufficient inter-thread bandwidth? Rethink algorithm: Take better advantage of cache Compiler Hardware Is Option 1 good enough for the parallel programmer’s model? Options 1B and 2 start with a PRAM algorithm, but not option 1A. Options 1A and 2 represent workflow, but not option 1B. PAT Prove correctness Program Still correct Tune Still correct Hardware Not possible in the 1990s. Possible now: XMT@UMD Why settle for less? Mark Twain on the PRAM We should be careful to get out of an experience only the wisdom that is in it— and stop there; lest we be like the cat that sits down on a hot stove-lid. She will never sit down on a hot stove-lid again— and that is well; but also she will never sit down on a cold one anymore— Mark Twain PRAM algorithms did not become standard CS knowledge in 1988-90 since “hot stove-lid”: No 1990s implementable computer architecture allowed programmers to look at a computer as a PRAM The XMT project @UMD changed that PS NVidia happy to report success with 2 PRAM algorithms in IPDPS09. Great to see that from a major vendor [These 2 algorithms are decomposition-based, unlike most PRAM algorithms. Freshmen programmed same 2 algorithms on our XMT machine The Parallel Programmer’s Productivity Landscape Postulation: a continental divide Ocean Decomposition-first programming Work-depth programming Great Lakes How different can productivity of many-core architectures be? Answer: very! Metaphor: Dropping rain a short distance apart. Very different outcomes. Think of programmer’s productivity as cost of producing usable water. The decomposition-first programming side requires domain-decomposition or task-decomposition that have not worked in spite of big investment. (Looks greener, since invested; what if goes to ocean while arid side to Sweetwater?) Work-depth initial abstraction is decomposition-free. (Arid, under-invested) Require leap-of-faith for investment. Validation of Ease of Programming To Date 1. Comparison with MPI by DARPA-HPCS SW Eng leaders [HochsteinBasiliVGilbert] 2. Teachability demonstrated so far [TorbertVTzurEllison, SIGCSE’10 to appear]: - To freshman class with 11 non-CS students. Some prog. assignments: median finding, merge-sort, integer-sort & sample-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher. Teachability: necessary (but not sufficient) condition for ease-of-programming. Itself necessary (but not sufficient) condition for productivity. Hence, teachability as good a benchmark as any out there for productivity Conclusion 1. 2. Want future mainstream programmers to embrace general-purpose parallelism (every CS major; for common SW architectures). Yet, in the past: Insufficient evidence on productivity. Yet, history of repeated surprise: Parallel machines repel programmers Research Drivers Empower select holistic (HW+SW) parallel platforms for merit-based comparison. Imagine a new world with the given platform. Consider all aspects: e.g., is it sufficient for reinstating the SW spiral? Is the barrier-to-entry for creative applications low enough? How will the CS curriculum will look? Who will be attracted to study CS? Then, gather evidence: Methodically compare productivity (development-time, run-time) of platforms. Ownership stake role for Indian partner (Prof. PJ Narayan, IIIT, Hyderabad): India – largest producer of SW. New platform requires sufficient Indian interest. Lead benchmarking/comparison for productivity, etc. For session Coming from algorithms, computer vision and computational biology, compare select platforms for performance, productivity (development-time and run-time), and overall for reinstating the SW spiral. Benchmark algorithms and applications based on their inherent parallelism for future machine platforms, as opposed to using existing code written for yesterday’s (serial or parallel) machines. Issue: How to benchmark for productivity? Not just a theory. XMT: prototyped HW&SW 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT) architecture [SPAA98] Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to 1000+ cores on-chip Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it great speedups. Motivation: Fix the IF Programmer’s Model: Engineering Workflow • Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model • SPMD reduced synchrony – Threads advance at own speed, not lockstep – Main construct: spawn-join block. Note: can start any number of processes at once. Can express locality (“decomposition-second”) – Prefix-sum (ps). Independence of order semantics (IOS). – Establish correctness & complexity by relating to WD analyses. – Circumvents “The problem with threads”, e.g., [Lee]. spawn join spawn join • Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL08] • Trial&error contrast: similar startwhile insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache} Performance • Simulation of 1024 processors: 100X on standard benchmark suite for VHDL gate-level simulation. for 1024 processors GV06 • [SPAA’09]: ~10X relative to Intel Core 2 Duo with 64-processor XMT; same silicon area as 1 commodity processor (core) • Promise of 100X with 1024 processors also for irregular, fine-grained parallelism with up- and down-scalability. Some Credits Grad students:, George Caragea, James Edwards, David Ellison, Fuat Keceli, Beliz Saybasili, Alex Tzannes. Recent grads: Aydin Balkan, Mike Horak, Xingzhi Wen • Industry design experts (pro-bono) • Rajeev Barua, Compiler. Co-advisor of 2 CS grad students. 2008 NSF grant • Gang Qu, VLSI and Power. Co-advisor • Steve Nowick, Columbia U., Asynch computing. Co-advisor. 2008 NSF team grant. • Ron Tzur, U. Colorado, K12 Education. Co-advisor. 2008 NSF seed funding K12: Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city) Ingenuity Project Middle School 2009 Summer Camp, Montgomery County Public Schools • • • • • • • • • Marc Olano, UMBC, Computer graphics. Co-advisor. Tali Moreshet, Swarthmore College, Power. Co-advisor. Bernie Brooks, NIH. Co-Advisor Marty Peckerar, Microelectronics Igor Smolyaninov, Electro-optics Funding: NSF, NSA 2008 deployed XMT computer, NIH 6 Issued patents. More patent applications Informal industry partner: Intel Reinvention of Computing for Parallelism. Selected for Maryland Research Center of Excellence (MRCE) by USM. Not yet funded. 17 members, including UMBC, UMBI, UMSOM. Mostly applications.