Saman Amarasinghe Lets stick with current sequential languages Parallel Programming is hard! Billons of LOC written in sequential languages Let the compiler do all the work Maintain the current strong machine abstraction SUIF Parallelizing Compiler Monica Lam and the Stanford SUIF team 1993 – 1997 Automatically extract parallelism from sequential programs 600 4 1000 3 400 2 200 1 But… Techniques were not robust for general use swm256 tomcatv su2cor ear hydro2d nasa7 alvinn mdljsp2 wave5 ora mdljdp2 fpppp 0 doduc Vector processor Cray C90 540 Uniprocessor Digital 21164 508 SUIF on 8 processors Digital 8400 1,016 800 8 7 6 5 spice2g6 Achieved Best SPEC results of the day 1200 MFLOPS Interprocedural analysis Array and scalar data-flow analysis Reduction and recurrence recognition C to FORTRAN Number of Processors Heroic Analysis Composition is key to building large systems Implemented naturally via time-multiplexing The framework for parallelizing sequential programs Sequential parts at outermost Global barriers 1 0.9 Utilization 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2004 2 4 2007 8 16 2010 32 64 2013 128 256 2016 512 1024 2019 2048 4096 2022 Number of cores Expected Year Speedup = 1/(1– p + p/N) Utilization = 1/(p + N*(1 – p)) 1 0.9 Utilization 0.8 0.7 0.6 0.5 % parallel 0.4 90% 0.3 0.2 0.1 0 1 2004 2 4 2007 8 16 2010 32 64 2013 128 256 2016 512 1024 2019 2048 4096 2022 Number of cores Expected Year Speedup = 1/(1– p + p/N) Utilization = 1/(p + N*(1 – p)) 1 0.9 Utilization 0.8 0.7 0.6 0.5 % parallel 0.4 90% 0.3 99% 0.2 0.1 0 1 2004 2 4 2007 8 16 2010 32 64 2013 128 256 2016 512 1024 2019 2048 4096 2022 Number of cores Expected Year Speedup = 1/(1– p + p/N) Utilization = 1/(p + N*(1 – p)) 1 0.9 Utilization 0.8 0.7 0.6 0.5 % parallel 0.4 90% 0.3 99% 0.2 99.90% 0.1 0 1 2004 2 4 2007 8 16 2010 32 64 2013 128 256 2016 512 1024 2019 2048 4096 2022 Number of cores Expected Year Speedup = 1/(1– p + p/N) Utilization = 1/(p + N*(1 – p)) Currently… Theory, algorithms, languages, tools all centered around the sequential paradigm A well enforced machine abstraction Move to muticore is a fundamental shift Akin to analog design to digital shift Need a new abstraction where parallelism is the primary form of expression Parallelism is simple Parallelism is natural Communication is intuitive Parallel composition of sequential segments With possible space-multiplexed execution Parallel programming still in the dark ages Elite community of practitioners Active open research, little stable consensus Assumption: we don’t know how to teach parallel programming! Aim for a “Mead and Conway” type revolution Develop simple, cookbook approaches If we can’t teach them, they’re too complex! Make them accessible Carefully thought-out courseware, tools, texts, courses Focus on the educational community Exporting, proselytizing, workshops, conferences, journals, … 1. Move to a truly parallel world (long term) Natural world is extremely parallel learn to emulate it Can we make sequential programs a special case of parallel programming? 2. Rejoice when parallelism is natural (medium term) Switch to parallel languages if using them is easier than sequential languages 3. Help migrate legacy application (short term) Existing large body of code – cannot ignore! Written in sequential languages – need to work with them Some domains are inherently parallel Coding them using a sequential language is… Harder than using the right parallel abstraction All information on inherent parallelism is lost There are win-win situations Increasing the programmer productivity while extracting parallel performance Streaming domain and the StreamIt experience MPEG bit stream picture type VLD quantization coefficients <QC> <PT1, PT2> macroblocks, motion vectors splitter frequency encoded macroblocks differentially coded motion vectors ZigZag Structured block level diagram describes computation and flow of data Motion Vector Decode <QC> IQuantization IDCT Conceptually easy to understand Repeat Clean abstraction of functionality Saturation spatially encoded macroblocks motion vectors joiner Mapping to C (sequentialization) destroys this simple view splitter Cr Cb Y Motion Compensation Motion Compensation Motion Compensation reference reference reference <PT1> <PT1> <PT1> picture picture picture Channel Upsample Channel Upsample joiner recovered picture <PT2> Picture Reorder Color Space Conversion MPEG-2 Decoder MPEG bit stream picture type VLD quantization coefficients <QC> <PT1, PT2> macroblocks, motion vectors splitter frequency encoded macroblocks add VLD(QC, PT1, PT2); add splitjoin { split roundrobin(NB, V); differentially coded motion vectors add pipeline { add ZigZag(B); add IQuantization(B) to QC; add IDCT(B); add Saturation(B); } add pipeline { add MotionVectorDecode(); add Repeat(V, N); } ZigZag Motion Vector Decode <QC> IQuantization IDCT Repeat Saturation spatially encoded macroblocks motion vectors join roundrobin(B, V); joiner } add splitjoin { split roundrobin(4(B+V), B+V, B+V); splitter Cr Cb Y add MotionCompensation(4(B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; add ChannelUpsample(B); } } Motion Compensation Motion Compensation Motion Compensation reference reference reference <PT1> <PT1> <PT1> picture picture picture Channel Upsample Channel Upsample join roundrobin(1, 1, 1); joiner recovered picture <PT2> Picture Reorder Color Space Conversion MPEG-2 Decoder } add PictureReorder(3WH) to PT2; add ColorSpaceConversion(3WH); MPEG bit stream picture type VLD quantization coefficients <QC> <PT1, PT2> macroblocks, motion vectors splitter frequency encoded macroblocks differentially coded motion vectors ZigZag Motion Vector Decode <QC> IQuantization IDCT Repeat Thread (fork/join) parallelism Parallelism explicit in algorithm Between filters without producer/consumer relationship Data Parallelism Saturation spatially encoded macroblocks Task Parallelism motion vectors joiner Data parallel loop (forall) Between iterations of a stateless filter Can’t parallelize filters with state splitter Cr Cb Y Motion Compensation Motion Compensation Motion Compensation reference reference reference <PT1> <PT1> <PT1> picture picture picture Channel Upsample Channel Upsample joiner recovered picture <PT2> Picture Reorder Color Space Conversion MPEG-2 Decoder Pipeline Parallelism Usually exploited in hardware Between producers and consumers Stateful filters can be parallelized Benchmarks On a 16 core MIT Raw Processor (http://cag.csail.mit.edu/raw) m et ri c M ea n r nt k ad a rp e R Se rb an CT ES D D er r rt de od lV oc ec o So TD E ni c Fi lte ne G 2D ha n G eo C M PE Bi to T r ad io FF co de FM R Vo Throughput Normalized to Single Core StreamIt 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 . Don’t modify a code segment if… The performance impact is insignificant and is isolated from the rest Automatic parallelizer works perfectly Modify and annotate a segment if… Still in Existing Sequential Languages Automatic parallelizer needs a little help Otherwise rewrite the segment Program Reincarnation A new body with the same old soul Use a Parallel Language Dynamic analysis Managed program execution Program invariant inference Application knowledge database Legacy Program Original Binary Source File Original Compiler .c .exe Assisted parallelization GUI tool Automatic Parallelization Managed Program Execution Correctness in reincarnated Test Generation Divergence Analysis Static analysis Learn about the domain .log Block diagram Refactoring identification Domain Knowledge Extraction Program Invariant Inference Engine Application Knowledge (program representation & invariants) .log Flag domain specific issues Generate domain-specific hints Bring programs to modern age Instrumenter and Binary interpreter Static Analysis Automatic parallelization info for program understanding Divergence Analysis Test Generation Known Idiom Identification & Domain Hint Generation Refactoring Identification Managed Program Execution .exe Domain Knowledge Database Compiler & Instrumenter Reincarna ted .c Assisted Application Reincarnation Tool Block Diagram Representation Multicore menace will impact all of us in a big way Parallelism need to keep up with Moore’s curve Will definitely need new parallel languages where parallelism is the primary form of composition Low hanging fruit when parallelism is the natural form of expression However, cannot ignore the past investments © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.