Design Patterns for Parallel Programming Research work by: Kurt Keutzer, EECS, UC Berkeley Tim Mattson, Intel Corporation Presented to faculty at Tsinghua Multicore Workshop Michael Wrinn, Intel Corporation 1 Outline A Programming Pattern Language – patterns: why and what? Structural patterns Computational patterns Composing patterns: examples Concurrency patterns and PLPP Examples: pattern language case studies Examples: pattern language in Smoke Demo Motivation 2 Outline Motivation – what is the problem we are trying to solve How do programmers think? Psychology meets computer science. Pattern Language for Parallel programming Toy problems showing some key PLPP patterns A PLPP example (molecular dynamics) Expanding the effort: Patterns for engineering parallel software A Survey of some key patterns Case studies 3 Microprocessor trends Many core processors are the “new normal”. 160 cores 80 cores 240 cores ATI RV770 1 CPU + 6 cores NVIDIA Tesla C1060 Intel Terascale research chip 3rd party names are the property of their IBM Cell The State of the field A harsh assessment … We have turned to multi-core chips not because of the success of our parallel software but because of our failure to continually increase CPU frequency. Result: a fundamental and dangerous (for the computer industry) mismatch Parallel hardware is ubiquitous. Parallel software is rare The multi-core Challenge: How do we make Parallel software as ubiquitous as parallel hardware? This could be the greatest challenge ever faced by the computer industry 5 We need to create a new generation of parallel programmers Parallel Programming is difficult, error prone and only accessible to small cadre of experts … Clearly we haven’t been doing a very good job at educating programmers. Consider the following: All known programmers are human beings. Hence, human psychology, not hardware design, should guide our work on this problem. We need to focus on the needs of the programmer, not the needs of the computer 6 Outline Motivation – what is the problem we are trying to solve How do programmers think? Psychology meets computer science. Pattern Language for Parallel programming Toy problems showing some key PLPP patterns A PLPP example (molecular dynamics) Expanding the effort: Patterns for engineering parallel software A Survey of some key patterns Case studies 7 Cognitive Psychology and human reasoning Human Beings are model builders We build hierarchical complexes of mental models. Understand sensory input in terms of these models. When input conflicts with models, we tend to believe the models. Consider the following slide … Why did most of you see motion? Your brain’s visual system contains a model that says this combination of geometric shapes and colors implies motion. Your brain believes the model and not what your eyes see.. To understand people and how we work, you need understand the models we work with. Programming and models Programming is a process of successive refinement of a problem over a hierarchy of models. [Brooks83] The models represent the problem at a different level of abstraction. The top levels express the problem in the original problem domain. The lower levels represent the problem in the computer’s domain. The models are informal, but detailed enough to support simulation. Model based reasoning in programming Models Specification Programming Domain Problem Specific: polygons, rays, molecules, etc. OpenMP’s fork/join, Actors Computation Machine (AKA Cost Model) Threads – shared memory Processes – shared nothing Registers, ALUs, Caches, Interconnects, etc. Programming process: getting started. The programmer starts by constructing a high level model from the specification. Successive refinement starts by using some combination of the following techniques: The problem’s state is defined in terms of objects belonging to abstract data types with meaning in the original problem domain. Key features of the solution are identified and emphasized. These features, sometimes referred to as "beacons” [Wiedenbeck89] , emphasize key aspects of the solution. The programming process Programmers use an informal, internal notation based on the problem, mathematics, programmer experience, etc. Within a class of programming languages, the program generated is only weakly dependent on the language. [Robertson90] [Petre88] Programmers think about code in chunks or “plans”. [Rist86] Low level plans code a specific operation: e.g. summing an array. High level or global plans relate to the overall program structure. Programming Process: Strategy + Opportunistic Refinement Common strategies are: Backwards goal chaining - start at result and work backwards to generate sub goals. Continue recursively until plans emerge. Forward chaining: Directly leap to plans when a problem is familiar. [Rist86] Opportunistic Refinement: [Petre90] Progress is made at multiple levels of abstraction. Effort is focused on the most productive level. Programming Process: the role of testing Programmers test the emerging solution throughout the programming process. Testing consists of two parts: Hypothesis generation: The programmer forms an idea of how a portion of the solution should behave. Simulation: The programmer runs a mental simulation of the solution within the problem models at the appropriate level of abstraction. [Guindom90] From psychology to software Hypothesis: A Design Pattern language provides the roadmap to apply results from the psychology of programming to software engineering: Design patterns capture the essence of plans The structure of patterns in a pattern language should mirror the types of models programmers use. Connections between patterns must fit well with goal-chaining and opportunistic refinement. References [Brooks83] R. Brooks, "Towards a theory of the comprehension of computer programs", International Journal of Man-Machine Studies, vol. 18, pp. 543-554, 1983. [Guindom90] R. Guindon, "Knowledge exploited by experts during software system design", Int. J. Man-machine Studies, vol. 33, pp. 279-304, 1990 [Hoc90] J.-M. Hoc, T.R.G. Green, R. Samurcay and D.J. Gilmore (eds.), Psychology of Programming, Academic Press Ltd., 1990. [Petre88] M. Petre and R.L. Winder, "Issues governing the suitability of programming languages for programming tasks. "People and Computers IV: Proceedings of HCI-88, Cambridge University Press, 1988. [Petre90] M. Petre, "Expert Programmers and Programming Languages", in [Hoc90], p. 103, 1990. [Rist86] R.S. Rist, "Plans in programming: definition, demonstration and development" in E. Soloway and S. Iyengar (Eds.), Empirical Studies of Programmers, Norweed, NF, Ablex, 1986. [Rist90] R.S. Rist, "Variability in program design: the interaction of process with knowledge", International Journal of Man-Machine Studies, Vol. 33, pp. 305-322,1990. [Robertson90] S. P. Robertson and C Yu, "Common cognitive representations of program code across tasks and languages", int. J. Man-machine Studies, vol. 33, pp. 343-360, 1990. [Wiedenbeck89] S. Wiedenbeck and J. Scholtz, "Beacons: a knowledge structure in program comprehension", In G. Salvendy and M.J. Smith (eds.) Designing and Using Human Computer interfaces and Knowledge-based systems, Amsterdam: Elsevier,1989. 18 People, Patterns, and Frameworks Design Patterns Application Developer Uses application design patterns (e.g. feature extraction) to design the application Application-Framework Uses programming Developer design patterns (e.g. Map/Reduce) to design the application framework Frameworks Uses application frameworks (e.g. CBIR) to develop application Uses programming design patterns (e.g MapReduce) to develop the application framework Eventually 1 2 3 Domain Experts + Domain literate programming + gurus (1% of the population). Application patterns & frameworks End-user, application programs Parallel patterns & programming frameworks Application frameworks Parallel programming gurus (1-10% of programmers) Parallel programming frameworks The hope is for Domain Experts to create parallel code with little or no understanding of parallel programming. Leave hardcore “bare metal” efficiency layer programming to the parallel programming experts Today 1 2 3 Domain Experts + Domain literate programming + gurus (1% of the population). Application patterns & frameworks End-user, application programs Parallel patterns & programming frameworks Application frameworks Parallel programming gurus (1-10% of programmers) Parallel programming frameworks • For the foreseeable future, domain experts, application framework builders, and parallel programming gurus will all need to learn the entire stack. • That’s why you all need to be here today! Definitions - 1 Design Patterns: “Each design pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.“ Page x, A Pattern Language, Christopher Alexander Structural patterns: design patterns that provide solutions to problems associated with the development of program structure Computational patterns: design patterns that provide solutions to recurrent computational problems Definitions - 2 Library: The software implementation of a computational pattern (e.g. BLAS) or a particular sub-problem (e.g. matrix multiply) Framework: An extensible software environment (e.g. Ruby on Rails) organized around a structural pattern (e.g. model-viewcontroller) that allows for programmer customization only in harmony with the structural pattern Domain specific language: A programming language (e.g. Matlab) that provides language constructs that particularly support a particular application domain. The language may also supply library support for common computations in that domain (e.g. BLAS). If the language is restricted to maintain fidelity to a structure and provides library support for common computations then it encompasses a framework (e.g. NPClick). Getting our software act together: First step … Define a conceptual roadmap to guide our work 13 dwarves Alexander’s Pattern Language Christopher Alexander’s approach to (civil) architecture: "Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.“ Page x, A Pattern Language, Christopher Alexander Alexander’s 253 (civil) architectural patterns range from the creation of cities (2. distribution of towns) to particular building problems (232. roof cap) A pattern language is an organized way of tackling an architectural problem using patterns Main limitation: It’s about civil not software architecture!!! 25 Alexander’s Pattern Language (95-103) Layout the overall arrangement of a group of buildings: the height and number of these buildings, the entrances to the site, main parking areas, and lines of movement through the complex. 95. Building Complex 96. Number of Stories 97. Shielded Parking 98. Circulation Realms 99. Main Building 100. Pedestrian Street 101. Building Thoroughfare 102. Family of Entrances 103. Small Parking Lots 26 Family of Entrances (102) May be part of Circulation Realms (98). Conflict: When a person arrives in a complex of offices or services or workshops, or in a group of related houses, there is a good chance he will experience confusion unless the whole collection is laid out before him, so that he can see the entrance of the place where he is going. Resolution: Lay out the entrances to form a family. This means: 1) They form a group, are visible together, and each is visible from all the others. 2) They are all broadly similar, for instance all porches, or all gates in a wall, or all marked by a similar kind of doorway. May contain Main Entrance (110), Entrance Transition (112), Entrance Room (130), Reception Welcomes You (149). 27 Family of Entrances http://www.intbau.org/Images/Steele/Badran5a.jpg 28 Computational Patterns The Dwarfs from “The Berkeley View” (Asanovic et al.) Dwarfs form our key computational patterns 29 Patterns for Parallel Programming • PLPP is the first attempt to develop a complete pattern language for parallel software development. • PLPP is a great model for a pattern language for parallel software • PLPP mined scientific applications that utilize a monolithic application style •PLPP doesn’t help us much with horizontal composition •Much more useful to us than: Design Patterns: Elements of Reusable Object-Oriented Software, Gamma, Helm, Johnson & Vlissides, Addison-Wesley, 1995. 30 Structural programming patterns In order to create more complex software it is necessary to compose programming patterns For this purpose, it has been useful to induct a set of patterns known as “architectural styles” Examples: pipe and filter event based/event driven layered Agent and repository/blackboard process control Model-view-controller 31 Putting it all together… 13 dwarves Elements of a Pattern Description • Name • Problem: • Context • Trade-offs that crop up in this situation Solution • Context in which this problem occurs Forces • Classes of problems this pattern addresses Solution the pattern embodies Invariants Properties that need to always be true for this pattern to work • Examples • Known uses • Related Patterns 33 Programming Pattern Language 1.0 Keutzer& Mattson Applications Choose your high level structure – what is the structure of my application? Guided expansion Efficiency Layer Productivity Layer Pipe-and-filter Agent and Repository Process Control Event based, implicit invocation Choose your high level architecture - Guided decomposition Identify the key computational patterns – what are my key computations? Guided instantiation Task Decomposition ↔ Data Decomposition Group Tasks Order groups data sharing data access Model-view controller Graph Algorithms Graphical models Iterator Dynamic Programming Finite state machines Map reduce Dense Linear Algebra Backtrack Branch and Bound Layered systems Sparse Linear Algebra N-Body methods Arbitrary Static Task Graph Unstructured Grids Circuits Structured Grids Spectral Methods Refine the structure - what concurrent approach do I use? Guided re-organization Event Based Data Parallelism Pipeline Task Parallelism Divide and Conquer Geometric Decomposition Discrete Event Graph algorithms Utilize Supporting Structures – how do I implement my concurrency? Guided mapping Distributed Fork/Join Shared Queue Array SharedCSP Shared Hash Table Data Digital Circuits Master/worker Loop Parallelism BSP Implementation methods – what are the building blocks of parallel programming? Guided implementation Thread Creation/destruction Message passing Speculation Barriers Process/Creation/destruction Collective communication Transactional memory Mutex Semaphores 34 Architecting Parallel Software Decompose Tasks Decompose Data •Group tasks •Identify data sharing •Order Tasks •Identify data access Identify the Software Structure Identify the Key Computations 35 Identify the SW Structure Structural Patterns •Pipe-and-Filter •Agent-and-Repository •Event-based coordination •Iterator •MapReduce •Process Control •Layered Systems These define the structure of our software but they do not describe what is computed 36 Analogy: Layout of Factory Plant 37 Identify Key Computations Computational Patterns Computational patterns describe the key computations but not how they are implemented 38 Analogy: Machinery of the Factory 39 Architecting Parallel Software Decompose Tasks/Data Order tasks Identify Data Sharing and Access Identify the Software Structure •Pipe-and-Filter •Agent-and-Repository •Event-based •Bulk Synchronous •MapReduce •Layered Systems •Arbitrary Task Graphs Identify the Key Computations • Graph Algorithms • Dynamic programming • Dense/Spare Linear Algebra • (Un)Structured Grids • Graphical Models • Finite State Machines • Backtrack Branch-and-Bound • N-Body Methods • Circuits • Spectral Methods 40 Analogy: Architected Factory Raises appropriate issues like scheduling, latency, throughput, workflow, resource management, capacity etc. 41 Outline A Programming Pattern Language – patterns: why and what? Structural patterns Computational patterns Composing patterns: examples Concurrency patterns and PLPP Examples: pattern language case studies Examples: pattern language in Smoke Demo Motivation 42 Inventory of Structural Patterns 1. 2. 3. 4. 5. 6. 7. 8. pipe and filter iterator MapReduce blackboard/agent and repository process control model-view controller layered event-based coordination Elements of a structural pattern Components are where the computation happens A configuration is a graph of components (vertices) and connectors (edges) A structural pattern may be described as a family of graphs. Connectors are where the communication happens 44 Pattern 1: Pipe and Filter •Filters embody computation •Only see inputs and produce outputs Filter 1 Filter 3 •Pipes embody communication Filter 2 Filter 4 May have feedback Filter 5 Filter 6 Filter 7 Examples? 45 Examples of pipe and filter Almost every large software program has a pipe and filter structure at the highest level Compiler Image Retrieval System Logic optimizer 46 Pattern 2: Iterator Pattern Initialization condition Variety of functions performed asynchronously iterate Synchronize results of iteration No Exit condition met? Yes Examples? 47 Example of Iterator Pattern: Training a Classifier: SVM Training Iterator Structural Pattern Update surface iterate Identify Outlier All points within acceptable error? No Yes 48 Pattern 3: MapReduce To us, it means A map stage, where data is mapped onto independent computations A reduce stage, where the results of the map stage are summarized (i.e. reduced) Map Map Reduce Reduce Examples? 49 Examples of Map Reduce General structure: Map a computation across distributed data sets Reduce the results to find the best/(worst), maxima/(minima) Support-vector machines (ML) • Map to evaluate distance from the frontier • Reduce to find the greatest outlier from the frontier Speech recognition • Map HMM computation to evaluate word match • Reduce to find the mostlikely word sequences 50 Pattern 4: Agent and Repository Agent 2 Agent 1 Repository/ Blackboard (i.e. database) Agent 3 Examples? Agent 4 Agent and repository : Blackboard structural pattern Agents cooperate on a shared medium to produce a result Key elements: Blackboard: repository of the resulting creation that is shared by all agents (circuit database) Agents: intelligent agents that will act on blackboard (optimizations) Manager: orchestrates agents access to the blackboard and creation of the aggregate results (scheduler) 51 Example: Compiler Optimization Common-sub-expression elimination Constant folding loop fusion Software pipelining Internal Program representation Strength-reduction Dead-code elimination Optimization of a software program Intermediate representation of program is stored in the repository Individual agents have heuristics to optimize the program Manager orchestrates the access of the optimization agents to the program in the repository Resulting program is left in the repository 52 Example: Logic Optimization timing opt agent 1 timing opt agent 2 timing opt agent 3 …….. timing opt agent N Circuit Database Optimization of integrated circuits Integrated circuit is stored in the repository Individual agents have heuristics to optimize the circuitry of an integrated circuit Manager orchestrates the access of the optimization agents to the circuit repository Resulting optimized circuit is left in the repository 53 Pattern 5: Process Control manipulated variables control parameters controller input variables process controlled variables Source: Adapted from Shaw & Garlan 1996, p27-31. Process control: Process: underlying phenomena to be controlled/computed Actuator: task(s) affecting the process Sensor: task(s) which analyze the state of the process Controller: task which determines what actuators should be effected Examples? 54 Examples of Process Control user timing constraints Timing constraints controller Process control structural pattern Circuit 55 Pattern 6: Model-View-Controller • Model: embodies the data and “intelligence” (aka business logic) of the system • Controller: captures all user input and translates it into actions on the model Examples? • View: renders the current state of the model for user 56 Example of Model-View Controller View 1 View 2 View 1 Control Form Values 50% 50% 30% View 2 30% 20% Back User Updates Values & Presses ‘View 1’ Button Controller Controller Selects View State Change 20% Back View Determines Model State a = 50% b = 30% c = 20% Model Extended from: Design Patterns Elements of Reusable ObjectOriented Software 57 Pattern 7: Layered Systems Individual layers are big but the interface between two adjacent layers is narrow Non-adjacent layers cannot communicate directly. Examples? 58 Example: ISO Network Protocol 59 Pattern 8: Event-based Systems Agent Agent Agent Agent Event Manager Agent Agent Agent Agent Medium Agents interact via events/signals in a medium Examples? Event manager manages events Interaction among agents is dynamic – no fixed connection 60 Example: The Internet • Internet is the medium • Computers are agents • Signals are IP packets • Control plane of the router is the event manager 128.0.0.56 61 Remember the Analogy: Layout of Factory Plant We have only talked about structure. We haven’t described computation. 62 Architecting Parallel Software Decompose Tasks Decompose Data •Group tasks •Identify data sharing •Order Tasks •Identify data access Identify the Software Structure Identify the Key Computations 63 Outline A Programming Pattern Language – patterns: why and what? Structural patterns Computational patterns Composing patterns: examples Concurrency patterns and PLPP Examples: design patterns in Smoke Demo Motivation 64 Computational Patterns the dwarfs from “The Berkeley View” form our key computational patterns 65 Problem and Context Problem In many situations the proper behavior of a system can naturally be described by a language of finite, or perhaps infinite, strings. The problem is to define a piece of software that distinguishes between valid input strings (associated with proper behavior) and invalid input strings (improper behavior). The system may have a set of pre-defined responses to proper input and another set of responses to improper input. Context: As inputs arrive a system must respond depending on the input. Alternatively, depending on inputs, changes in a process may be actuated. The proper response of the system may be to idle after receiving a sequence of inputs, or the input stream may be presumed to be infinite. 66 Solution: Finite State Machines Transducer FSM: Huffman decoding Output Symbol Codeword A 0 B 10 C 1100 D 1101 E 1110 F 1111 0/C, 1/D 0/A 1/1/a 0/B b 0/E, 1/F d 0/c 1/e Input Alphabet : 0, 1 Output Alphabet : A - F States: a,b,c,d,e Transitions: ( (a, 0) , (a, A)), ( (a, 1) , (b,-)),… Initial state: a After: Multimedia Image and Video Processing By Ling Guan, Jan Larsen 67 Example: Traffic Light State Machine Otherwise IF A=1 AND B=0 Always A B Car Sensors Always IF A=0 Note: Clock beats every 4 sec. So Light is Yellow for 4 sec. AND B=1 Otherwise 68 Problem and Context Problem In many problems the output is a simple logical function, or bit-wise permutation, of the input. Context: A vector of Boolean values is applied as input and another set, defined by combinational operators or “wiring”, is given as output. 69 Solution: Circuits Describe desired function as a circuit 70 Example: Digital Encryption Standard (DES) Kp Kp-1 Kp-2 K3 K2 K1 '1' 1 0 new SP 1 chosen plaintext 0 MASK DES test DP? EP 71 Outline A Programming Pattern Language – patterns: why and what? Structural patterns Computational patterns Composing patterns: examples Concurrency patterns and PLPP Examples: pattern language case studies Examples: pattern language in Smoke Demo Motivation 72 Architecture of Logic Optimization Group, order tasks Decompose Data Group, order tasks Group, order tasks Graph algorithm pattern Graph algorithm pattern 73 Architecting Speech Recognition Pipe-and-filter Recognition Network Graphical Model Inference Engine Active State Computation Steps Pipe-and-filter Dynamic Programming MapReduce Voice Input Beam Search Iterations Signal Processing Most Likely Word Sequence Iterator Large Vocabulary Continuous Speech Recognition Poster: Chong, Yi Work also to appear at Emerging Applications for Manycore Architecture 74 CBIR Application Framework New Images 250 200 150 Choose Examples Feature Extraction 100 50 0 Learn Classifier Exercise Classifier Results User Feedback ? ? Catanzaro, Sundaram, Keutzer, “Fast SVM Training and Classification on Graphics Processors”, ICML 2008 75 Feature Extraction Image histograms are common to many feature extraction procedures, and are an important feature in their own right • Agent and Repository: Each agent computes a local transform of the image, plus a local histogram. • Results are combined in the repository, which contains the global histogram The data dependent access patterns found when constructing histograms make them a natural fit for the agent and repository pattern 76 Learn Classifier: SVM Training Update Optimality Conditions iterate Learn Classifier MapReduce Select Working Set, Solve QP MapReduce Bulk Synchronous Structural Pattern 77 Exercise Classifier : SVM Classification Test Data SV Compute dot products Dense Linear Algebra Exercise Classifier Compute Kernel values, sum & scale MapReduce Output 78 Outline A Programming Pattern Language – patterns: why and what? Structural patterns Computational patterns Composing patterns: examples Concurrency patterns and PLPP Examples: pattern language case studies Examples: pattern language in Smoke Demo Motivation 79 Recap: what we’re trying to accomplish Ultimately, we want domain-expert programmers to routinely create parallel programs. Domain-Expert programmers: Driven by the need to ship robust solutions on schedule. New features are more important than performance. Little or no background in computer science or concurrency. Finished with formal education … “on the job” learning in burst mode. Note: we not trying to tell concurrency experts how to write parallel programs Expert parallel programmers … they already have what they need. HPC programmers … they want performance at all costs and will work as close to the hardware as needed to hit performance goals. 80 How do we support domain experts: 1 2 3 Domain Experts Domain literate programming gurus (1% of the population). + Application frameworks End-user, application programs + Parallel programming frameworks Application frameworks Parallel programming gurus (1-10% of programmers) Parallel programming frameworks The hope is for Domain Experts to create parallel code with little or no understanding of parallel programming. Leave hardcore “bare metal” efficiency layer programming to the parallel programming experts 81 How will we make this happen? Software architecture systematically described in terms of a design pattern language: Everyone … even the “gurus” … need direction. The “parallel programming gurus” need to know the variety of parallel programming patterns in use. The “domain literate programming gurus” need to know which application patterns to support. … and we need to document these so everyone can work with them at the appropriate level of detail. By expressing parallel programming in terms of design pattern languages: We provide this direction to the gurus We document the frameworks and how they work. We layout a roadmap to solving the parallel programming problem. 82 Programming Pattern Language 1.0 Keutzer& Mattson Applications Choose your high level structure – what is the structure of my application? Guided expansion Efficiency Layer Productivity Layer Pipe-and-filter Agent and Repository Process Control Event based, implicit invocation Choose your high level architecture - Guided decomposition Identify the key computational patterns – what are my key computations? Guided instantiation Task Decomposition ↔ Data Decomposition Group Tasks Order groups data sharing data access Model-view controller Graph Algorithms Iterator Dynamic Programming Map reduce Dense Linear Algebra Layered systems Sparse Linear Algebra Arbitrary Static Task Graph Unstructured Grids Graphical models Finite state machines Backtrack Branch and Bound N-Body methods Circuits Spectral Methods Structured Grids Refine the structure - what concurrent approach do I use? Guided re-organization Event Based Data Parallelism Pipeline Task Parallelism Divide and Conquer Geometric Decomposition Discrete Event Graph algorithms Utilize Supporting Structures – how do I implement my concurrency? Guided mapping Distributed Fork/Join Shared Queue Array SharedCSP Shared Hash Table Data Digital Circuits Master/worker Loop Parallelism BSP Implementation methods – what are the building blocks of parallel programming? Guided implementation Thread Creation/destruction Message passing Speculation Barriers Process Creation/destruction Collective communication Transactional memory Mutex Semaphores 83 Programming Pattern Language 1.0 Keutzer& Mattson These patterns are all about parallel algorithms and how Top levels: Structural and to turn them into code. computational patterns: Many books talk about how to use a particular Note: in many cases, these pertain to programming … practices We instead…focus how to goodlanguage software bothonserial and use these languages and “think parallel”. parallel. Applications Choose your high level structure – what is the structure of my application? Guided expansion Efficiency Layer Productivity Layer Pipe-and-filter Agent and Repository Process Control Event based, implicit invocation Choose your high level architecture - Guided decomposition Task Decomposition ↔ Data Decomposition Group Tasks Order groups data sharing data access Model-view controller Iterator Map reduce Layered systems Arbitrary Static Task Graph Identify the key computational patterns – what are my key computations? Guided instantiation Graphical models Finite state machines Backtrack Branch and Bound N-Body methods Circuits Spectral Methods Graph Algorithms Dynamic Programming Dense Linear Algebra Sparse Linear Algebra Unstructured Grids Structured Grids Refine the structure - what concurrent approach do I use? Guided re-organization Event Based Data Parallelism Pipeline Task Parallelism Divide and Conquer Geometric Decomposition Discrete Event Graph algorithms Utilize Supporting Structures – how do I implement my concurrency? Guided mapping Distributed Fork/Join Shared Queue Array CSP Shared Hash Table Shared-Data Master/worker Loop Parallelism BSP Implementation methods – what are the building blocks of parallel programming? Guided implementation Thread Creation/destruction Message passing Speculation Barriers Process Creation/destruction Collective communication Transactional memory Digital Circuits Semaphores Mutex 84 PLPP’s structure: Four design spaces in parallel software development Decomposition Original Problem Tasks, shared and local data Implementation. & building blocks Units of execution + new shared data for extracted dependencies Program SPMD_Emb_Par () { Program SPMD_Emb_Par () TYPE *tmp, *func(); { Program SPMD_Emb_Par () global_array Data(TYPE); TYPE *tmp, *func(); { Program SPMD_Emb_Par () global_array Res(TYPE); global_array Data(TYPE); TYPE *tmp, *func(); { int N = global_array get_num_procs(); global_array Data(TYPE); TYPERes(TYPE); *tmp, *func(); int id get_proc_id(); int=N = global_array get_num_procs(); global_array Res(TYPE); Data(TYPE); if (id==0) int id get_proc_id(); int=setup_problem(N,DATA); N = get_num_procs(); global_array Res(TYPE); for (int I= 0; if (id==0) int id =setup_problem(N,DATA); get_proc_id(); intI<N;I=I+Num){ Num = get_num_procs(); tmp = (id==0) func(I); for (int I= 0; if int id I<N;I=I+Num){ =setup_problem(N,DATA); get_proc_id(); Res.accumulate( tmp = (id==0) func(I); for (int I= 0; tmp); I<N;I=I+Num){ if setup_problem(N, Data); } Res.accumulate( tmp = func(I); for (int I= ID;tmp); I<N;I=I+Num){ } } Res.accumulate( tmp = func(I, tmp); Data); } } Res.accumulate( tmp); } } } Corresponding source code85 Decomposition (Finding Concurrency) Start with a specification that solves the original problem -- finish with the problem decomposed into tasks, shared data, and a partial ordering. Start DependencyAnalysis DecompositionAnalysis GroupTasks OrderGroups DataDecomposition DataSharing TaskDecomposition decomposition Structural Design Evaluation Computational Concurrency strategy Implementation strategy Par prog building blocks 86 Concurrency Strategy (Algorithm Structure) Start Organize By Flow of Data Regular ? Pipeline Irregular? Event Based Coordination Organize By Tasks Organize By Data Linear ? Recursive? Linear ? Task Parallelis m Divide and Conquer Geometric Decompositio n Recursive? Recursive Data decomposition Decision Structural Computational Decision Point Key Concurrency strategy Implementation strategy Par prog building blocks Design Pattern 87 Implementation strategy (Supporting Structures) High level constructs impacting large scale organization of the source code. Program Structure Data Structures SPMD Shared Data Master/Worker Shared Queue Loop Parallelism Distributed Array Fork/Join decomposition Structural Computational Concurrency strategy Implementation strategy Par prog building blocks 88 Parallel Programming building blocks Low level constructs implementing specific constructs used in parallel computing. Examples in Java, OpenMP and MPI. These are not properly design patterns, but they are included to make the pattern language self-contained. UE* Management Thread control Process control Synchronization Communications Memory sync/fences Message Passing barriers Collective Comm Mutual Exclusion Other Comm decomposition Structural Computational Concurrency strategy * UE = Implementation Unit of execution strategy Par prog building blocks 89 Case Studies Simple Examples Linear Algebra Managing recursive parallelism Molecular dynamics Multi-level parallelism Branch and Bound Abstracting complexity in data structures Spectral Methods Numerical Integration A complete example from design to code. Game Engines A complex example showing that there are multiple ways to parallelize a problem 90 The heart of a game is the “Game Engine” Front End network Animation Input Sim Render data/state Data/state Audio Data/state Media (disk, optical media) assets game state in its internally managed data structures. Source: Conversations with developers at Electronic Arts 91 The heart of a game is the “Game Engine” Front End network Sim Physics Input Sim Time Loop Particle System Render Data/state data/state Collision Detection Media (disk, optical media) Animation Audio AI Data/state Sim is the integrator assets and integrates from one time step to the next; calling update methods for the other modules inside a central time loop. game state in its internally managed data structures. Source: Conversations with developers at Electronic Arts 92 Finding concurrency: functional parallelism Front End network Animation Input Sim Render data/state Data/state Audio Data/state Media (disk, optical media) assets Combine modules into groups, assign one group to a thread. Asynchronous execution … interact through events. Coarse grained parallelism dominated by flow of data between groups … event based coordination pattern. 93 The FindingConcurrency Design Space Start with a specification that solves the original problem -- finish with the problem decomposed into tasks, shared data, and a partial ordering. Start Many cores needs many concurrent tasks DependencyAnalysis … functional parallelism just doesn’t DecompositionAnalysis expose GroupTasks enough concurrency in this case. DataDecomposition We need to go back and rethink the design. OrderGroups TaskDecomposition DataSharing Design Evaluation 94 More sophisticated parallelization strategy Front End network Animation Input Sim Render data/state Data/state Audio Data/state Media (disk, optical media) assets Decompose computation into a pool of tasks – finer granularity exposes more concurrency. Work stealing critical to support good load balancing. 95 Parallel Execution Strategy: Task parallelism pattern Dependencies create an acyclic graph of tasks. Front End network Input Animation Sim Render data/state Data/state Audio Data/state Media (disk, optical media) assets Tasks are assigned priorities with some labeled as optional. A pool of threads executes tasks based on: • When dependencies are met. • Task Priority The Challenge is scheduling based on bus usage, load balancing, and managing concurrency for correctness 96 decomposition The AlgorithmStructure Design Space Solution: A composition of these two patterns Structural Computational Concurrency strategy Implementation strategy Par prog building blocks Start Organize By Flow of Data Regular? Pipeline Irregular? Event Based Coordination High level architecture … functional decomposition Organize By Tasks Organize By Data Linear? Recursive? Linear? Task Parallelism Divide and Conquer Geometric Decomposition Collections of tasks generated by modules Recursive? Recursive Data 97 The Supporting Structures Design Space Top level level functional functional Decomposition Decomposition with MPI Manage task Manage task queues inina a queues global pool global pool(to (to support work support work stealing) stealing) Standard queue won’t willwill need to build Standard queue won’twork, work, need to build more general structures to support prioritized more generaldata data structures to support queue and interrupt capabilities prioritized queue and interrupt capabilities Program Structure Data Structures SPMD Shared Data Master/Worker Shared Queue Loop Parallelism Distributed Array Fork/Join 98 Architecture/Framework implications Top level structural patterns Event based, implicit invocation Inside each module Fine grained task decomposition Master worker pattern extended with Work stealing Task priorities Soft real time constraints … interrupt, update state, flush queues This would fit nicely into a framework … in fact it needs a framework since the task queues would be very complicated to get right. 99 Outline The parallel computing landscape A Programming Pattern Language Overview Structural patterns Computational patterns Composing patterns: examples Concurrency patterns and PLPP Examples: simple case studies Applications in CAD The CAD algorithm landscape Parallel patterns in CAD: overview Detailed case studies: architecting parallel CAD software 100 How does architecting SW help? A good software architecture should provide: Composibility: allow you to build something complicated by composing simple pieces Modularity: elements of the software are firewalled from each other Locality: easy to identify what’s happening where May provide a basis for building a more general framework (e.g. Model-view-controller and Ruby-on-Rails) Patterns help by: Clearly identifying the problem Offering solutions that capture and embody a lot of shared experience with solving the problem Offering a common vocabulary for this problem/solution process Also capturing common pitfalls of the solutions in the pattern (e.g. repository will always become the bottleneck in data-andrespository) 101 Overall Summary Manycore microprocessors are on the horizon and computationally intensive applications, like computer-aided design, will want to exploit them Incremental methods will not scale to exploit >= 32 processors: Use of threads (OpenMP, Win32, others) Incremental parallelization of serial codes Use of threading Tools and Library support Need to re-architect software Key to re-architect software for manycore is patterns and frameworks We identified two key computational patterns for CAD: graph algorithms and backtracking/branch-and-bound There is some software support in the from of a framework, Cilk, for backtracking/branch-and-bound The key to parallelize graph algorithms is netlist/graph partitioning We presented a number of approaches for effective graph partitioning We aim to use this to produce a framework for graph algorithms Design Patterns in education: lessons from history? Early OO days Now Perception: Object oriented? Isn’t that just an academic thing? Perception: OO=programming Usage: specialists only. Mainstream avoids with indifference, anxiety, or worse. Isn’t this how it was always done? 1994 Performance: not so good. 103 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Usage: cosmetically widespread, some key concepts actually deployed. Performance: so-so, masked by CPU advances until now. Design Patterns in education: lessons from history? Early PP days(now) Future Perception: Parallel programming? Isn’t that just an HPC thing? Perception: PP=programming Usage: specialists only. Mainstream avoids with indifference, anxiety, or worse. Isn’t this how it was always done? 2010? Performance: very good, for the specialists. 104 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Usage: cosmetically widespread, some key concepts actually deployed. Performance: broadly sufficient. Application domains greatly expanded. Michael Anderson, Mark Murphy, Smoke Pattern Decomposition Kurt Keutzer Copyright © 2009, Intel Corporation. All rights reserved. *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners Agenda Characterize the architecture and computation of Smoke in terms of Our Pattern Language • • What is an effective multi-threaded game architecture? • Which components of Smoke can be easily sped up on manycore devices? Can the architecture and computation in Smoke be described in Our Pattern Language? Games as a potential ParLab app • • 106 Where can we as grad students have the most impact? What else can manycore enable? Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Our Pattern Language Hierarchy of patterns that describe reusable components of software. 107 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Agenda Characterize the architecture and computation of Smoke in terms of Our Pattern Language • • What is an effective multi-threaded game architecture? • Which components of Smoke can be easily sped up on manycore devices? Can the architecture and computation in Smoke be described in Our Pattern Language? Games as a potential ParLab app • • 108 Where can we as grad students have the most impact? What else can manycore enable? Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Structural and Computational Patterns First lets talk about the structural and computational patterns we found in Smoke. 109 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Intro Top-Level: Architecture Smoke is composed of several subsystems Some (Physics, Graphics . .) are large independent engines. Velocity Input Input data Input AI Pos, Velocity, Goal Physics Po Pos, Velocity, Collision s Po Po s s Graphics Pos, Verts, Texture Audio Pos, SFX 110 Brad Werth, GDC 2009 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners 110/32 Top-Level v1: Task Graph Pattern There exist fixed dependencies between subsystems Can be modeled as an arbitrary task graph Example: Moving the zombie • Keyboard -> AI -> Physics -> Graphics Input Physics AI Effects Graphics 111 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Top-Level v1: Iterator Pattern Iterates over consecutive frames Data from previous frame is used in the next Input Physics AI Effects Graphics 112 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Top-Level v2: Puppeteer Subsystems have private data, need access to other subsystems' data Don't want to write N*(N-1) interfaces Common use is to manage communication between multiple simulators, each with different data structures 113 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Task Graph (v1) vs. Puppeteer (v2) Subsystem modularity is important (swappable) Want to reduce the complexity of communication for a scalable multi-threaded design Motivates the use of the puppeteer pattern instead of arbitrary task graph for subsystem communication Framework Change Control Manager Interfaces Input Physics Graphics Brad Werth, GDC 2009 114 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Effects AI Agenda Characterize the architecture and computation of Smoke in terms of Our Pattern Language • • What is an effective multi-threaded game architecture? • Which components of Smoke can be easily sped up on manycore devices? Can the architecture and computation in Smoke be described in Our Pattern Language? Games as a potential ParLab app • • 115 Where can we as grad students have the most impact? What else can manycore enable? Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Puppeteer Subsystem Patterns Now that we’ve described the top-level structural patterns, look at the subsystems. Framework Change Control Manager Interfaces Input 116 Physics Graphics Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Effects AI Physics Subsystem From Havok User Guide Pipe and filter and structure 117 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Physics Subsystem Constraint Solver – Minimize penalties and forces in a system subject to constraints. Dense Linear Algebra computational pattern > 180° 118 robbocode.com Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Physics Subsystem Collision Detection – Interpolate paths and test if overlap occurs during the current timestep N-body computational pattern 119 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners AI Subsystem Agent and Repository structural pattern • • Multiple agents access and modify shared data Examples: Version control systems, logic optimization, parallel SAT solvers Agent 1 Agent 2 Shared Repository (i.e., Database) ... Agent N 120 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners AI Subsystem Zombies and chickens update location in POI repository One writer (zombie) many readers (chickens) reduces the burden on the repository controller Agents (animals) are independent state machines 121 Brad Werth, GDC 2009 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Effects Subsystem Procedural Fire: Two particle simulators • • Visible fire Invisible “Heat” particles N-body computational pattern 122 Hugh Smith, Intel Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Parallel Implementation Patterns We’ve described the structural and computational patterns. What about the lower layers? 123 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Parallel Algorithm Strategy Patterns (1) At the top level, subsystems can run concurrently • Communication within a single frame is eliminated by doublebuffering Smoke exploits this task parallelism (subject of Lab 2) Frame 1 Data Graphics Physics AI Frame 2 Data 124 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners ... Parallel Algorithm Strategy Patterns (2) Procedural Fire is highly data parallel Smoke spreads the computation over 8 threads (ideal hardware utilization) Hugh Smith, Intel Brad Werth, GDC 2009 125 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Parallel Algorithm Strategy Patterns (3) AI subsystem exploits task parallelism among independent agents (chickens) Thread Profile Before AI Render Thread Profile After Render AI Brad Werth, GDC 2009 126 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Implementation Strategy and Concurrent Execution Patterns Task queue -> Thread pool Intel Thread Building Blocks Brad Werth, GDC 2009 127 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Agenda Characterize the architecture and computation of Smoke in terms of Our Pattern Language • • What is an effective multi-threaded game architecture? • Which components of Smoke can be easily sped up on manycore devices? Can the architecture and computation in Smoke be described in Our Pattern Language? Games as a potential ParLab app • • 128 Where can we as grad students have the most impact? What else can manycore enable? Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners What was done in the demo? Top level • Task parallelism among subsystems Procedural Fire • Data parallel particle simulator AI Subsystem • 129 Task parallelism among agents Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners What else can be done? (1) Add data parallel effects: rain, ragdoll corpses, more things breaking apart • • • Similar to procedural fire Can easily be added without changing gameplay Objects may or may not be registered with the scene graph Particle simulator 130 nvidia.com Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners What else can be done? (2) Speed up and enhance physics subsystem • • • Independent “simulation islands” can be executed in parallel Parallel constraint solver & collision detection Speedup allows for more detailed and realistic simulation – More active independent objects in simulation – Bring in new algorithms from scientific computing • • 131 Must ensure the worst-case computation can fit in a single frame Nvidia PhysX is working on this Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners What else can be done? (3) Enable more complex AI • • Smoke’s AI is very simple. What is AI like in larger games? If AI is just state machines, we can add more Add new interfaces • • • 132 Computer vision pose detection as input Overlaps well with our current research Microsoft Xbox – Project Natal Xbox.com/projectnatal Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Agenda Characterize the architecture and computation of Smoke in terms of Our Pattern Language 133 • • What is an effective multi-threaded game architecture? • Which components of Smoke can be easily sped up on manycore devices? Can the architecture and computation in Smoke be described in Our Pattern Language? Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners Summary and conclusions Characterize the architecture and computation of Smoke in terms of Our Pattern Language • What is an effective multi-threaded game architecture? – Puppeteer patterns allows for modularity, ease of development and evolution • Can the architecture and computation in Smoke be described in Our Pattern Language? – Yes, the description is natural • Which components of Smoke can be easily sped up on manycore devices? – Data parallel effects, physics, AI, and new interfaces all show potential benefit 134 Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice. *Other brands and names are the property of their respective owners