Increasing Machine Throughput Superscalar Processing Multiprocessor Systems CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Multiprocessing • There are 3 generic ways to do multiple things “in parallel” – Instruction-level Parallelism (ILP) • Superscalar – doing multiple instructions (from a single program) simultaneously – Data-level Parallelism (DLP) • Do a single operation over a larger chunk of data – Vector Processing – “SIMD Extensions” like MMX – Thread-level Parallelism (TLP) • Multiple processes – Can be separate programs – …or a single program broken into separate threads • Usually used on multiple processors, but not required CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Superscalar Processors • With pipelining, can you ever reduce the CPI to < 1? • Only if you issue > 1 instruction per cycle • Superscalar means you add extra hardware to the pipeline to allow multiple instructions (usually 2-8) to execute simultaneously in the same stage – This is a form of parallelism, although one never refers to a superscalar machine as a parallel machine CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Dynamic Scheduling I n s tr u c ti o n fe t c h I n -o r d e r i s s u e a n d d e c o d e u n it R e se r v a t io n s ta tio n F u n ct io n a l u n it s I n te g e r R e s e r v ati o n sta ti o n … R e s e r v a tio n s ta ti o n R e s e r v a tio n s ta ti o n In te g e r … F lo a ti n g Lo a d / p o int S to r e O ut -o f-o r d e r e x e c u te In - o r d e r c o m m it C o m m it u ni t 4 reservation stations for 4 separate pipelines • – Each pipeline may have a different depth CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED How to Fill Pipeline Slots • We’ve got lots of room to execute – now how do we fill the slots? • This process is called Scheduling – A schedule is created, telling instructions when they can execute • 2 (very different) ways to do this: – Static Scheduling • Compiler (or coder) arranges instructions into an order which can be executed correctly – Dynamic Scheduling • Hardware in the processor reorders instructions at runtime to maximize the number executing in parallel CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Static Scheduling Case Study • Intel’s IA-64 architecture ( AKA Itanium ) – First appeared in 2001 after enormous amounts of hype • Independent instructions are collected by the compiler into a bundle. – The instructions in each bundle are then executed in parallel • Similar to previous static scheduling schemes – However, the compiler has more control than standard VLIW machines CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Dynamic Pipeline Scheduling • Allow the hardware to make scheduling decisions • • In order issue of instructions Out of order execution of instructions • In case of empty resources: – • The hardware will look ahead in the instruction stream to see if there are any instructions that are OK to execute As they are fetched, instructions get placed in reservation stations – where they wait until their inputs are ready CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Dynamic Scheduling I n s tr u c ti o n fe t c h I n -o r d e r i s s u e a n d d e c o d e u n it R e se r v a t io n s ta tio n F u n ct io n a l u n it s I n te g e r R e s e r v ati o n sta ti o n … R e s e r v a tio n s ta ti o n R e s e r v a tio n s ta ti o n In te g e r … F lo a ti n g Lo a d / p o int S to r e O ut -o f-o r d e r e x e c u te In - o r d e r c o m m it C o m m it u ni t 4 reservation stations for 4 separate pipelines • – Each pipeline may have a different depth CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED Committing Results • When do results get committed / written? – – • In order commit Out of order commit – very dangerous! Advantages: – – – • Hide load stalls Hide memory latency Approach 100% processor utilization Disadvantages: – – – LOTS of power-hungry hardware Branch prediction is crucial Complex control CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Dynamic Scheduling Case Study • Intel’s Pentium 4 – First appeared in 2000 • Possible for 126 instructions to be “in-flight” at one time! • Processors have gone “backwards” on this since 2003 CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Thread-level Parallelism (TLP) – If you have multiple threads… • by having multiple programs running, or • writing a multithreaded application – …you can get higher performance by running these threads: • On multiple processors, or • On a machine that has multithreading support – SMT – (AKA “Hyperthreading”) • Conceptually these are very similar – The hardware is very different CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Parallel Processing • The term parallel processing is usually reserved for the situation in which a single task is executed on multiple processors – Discounts the idea of simply running separate tasks on separate processors – a common thing to do to get high throughput, but not really parallel processing Key questions in design: 1. How do parallel processors share data and communicate? – 2. How are the processors connected? – • shared memory vs distributed memory single bus vs network The number of processors is determined by a combination of #1 and #2 CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst The Jigsaw Puzzle Analogy CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Serial Computing Suppose you want to do a jigsaw puzzle that has, say, a thousand pieces. We can imagine that it’ll take you a certain amount of time. Let’s say that you can put the puzzle together in an hour. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Shared Memory Parallelism If Alice sits across the table from you, then she can work on her half of the puzzle and you can work on yours. Once in a while, you’ll both reach into the pile of pieces at the same time (you’ll contend for the same resource), which will cause a little bit of slowdown. And from time to time you’ll have to work together (communicate) at the interface between her half and yours. The speedup will be nearly 2-to-1: combined, it might take 35 minutes instead of 30. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst The More the Merrier? Now let’s put Bob and Charlie on the other two sides of the table. Each of you can work on a part of the puzzle, but there’ll be a lot more contention for the shared resource (the pile of puzzle pieces) and a lot more communication at the interfaces. So you will get noticeably less than a 4-to-1 speedup, but you’ll still have an improvement, maybe something like 3-to-1: the four of you can get it done in 20 minutes instead of an hour. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Diminishing Returns If we now put Dave and Ed and Frank and George on the corners of the table, there’s going to be a whole lot of contention for the shared resource, and a lot of communication at the many interfaces. So the speedup you’ll get will be much less than we’d like; you’ll be lucky to get 5-to-1. So we can see that adding more and more workers onto a shared resource is eventually going to have a diminishing return. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Distributed Parallelism Now let’s try something a little different. Let’s set up two tables, and let’s put you at one of them and Alice at the other. Let’s put half of the puzzle pieces on your table and the other half of the pieces on Alice’s. Now you can both work completely independently, without any contention for a shared resource. BUT, the cost per communication is MUCH higher (you have to scootch your tables together), and you need the ability to split up (decompose) the puzzle pieces reasonably evenly, which may be tricky to do for some puzzles. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst More Distributed Processors It’s a lot easier to add more processors in distributed parallelism. But, you always have to be aware of the need to decompose the problem and to communicate among the processors. Also, as you add more processors, it may be harder to load balance the amount of work that each processor gets. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Load Balancing Load balancing means ensuring that everyone completes their workload at roughly the same time. For example, if the jigsaw puzzle is half grass and half sky, then you can do the grass and Alice can do the sky, and then you only have to communicate at the horizon – and the amount of work that each of you does on your own is roughly equal. So you’ll get pretty good speedup. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Load Balancing Load balancing can be easy, if the problem splits up into chunks of roughly equal size, with one chunk per processor. Or load balancing can be very hard. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Load Balancing Load balancing can be easy, if the problem splits up into chunks of roughly equal size, with one chunk per processor. Or load balancing can be very hard. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Load Balancing Load balancing can be easy, if the problem splits up into chunks of roughly equal size, with one chunk per processor. Or load balancing can be very hard. CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst How is Data Shared? • Distributed Memory Systems – Each processor (or group of processors) has its own memory space – All data sharing between “nodes” is done using a network of some kind • i.e. Ethernet – Information sharing is usually explicit • Shared Memory Systems – All processors share one memory address space and can access it – Information sharing is often implicit CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Shared Memory Systems • Processors all operate independently, but operate out of the same memory. • Some data structures can be read by any of the processors • To properly maintain ordering in our programs, synchronization primitives (locks/semaphores) are needed! CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Processor Processor Processor Cache Cache Cache Single bus Memory I/O Dan Ernst Multicore Processors • A Multicore processor is simply a shared memory machine where the processors reside on the same piece of silicon. • “Moore’s law” continuing means we have more transistors to put on a chip – Doing more superscalar or superpipelining will not help us anymore! – Use these transistors to make every system a multiprocessor CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Example Cache Coherence Problem P2 P1 u=? $ P3 3 u=? 4 $ 5 $ u :5 u = 7 u :5 I/O devices 1 u :5 2 Memory CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Cache Coherence • According to Webster’s dictionary … – Cache: a secure place of storage – Coherent: logically consistent • Cache Coherence: keep storage logically consistent – Coherence requires enforcement of 2 properties 1. Write propagation – All writes eventually become visible to other processors 2. Write serialization – All processors see writes to same block in same order CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Cache Coherence Solutions • Two most common variations: – “Snoopy” schemes • rely on broadcast to observe all coherence traffic • well suited for buses and small-scale systems • example: Intel x86 – Directory schemes • uses centralized or “regional” information to avoid broadcast • scales fairly well to large numbers of processors • example: SGI Origin/Altix CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Why Cache Coherent Shared Memory? Pluses For applications - looks like multitasking uniprocessor For OS - only evolutionary extensions required Easy to do communication without OS Software can worry about correctness first and then performance Minuses Proper synchronization is complex Communication is implicit so may be harder to optimize More work for hardware designers Result Symmetric Multiprocessors (SMPs) are the most successful parallel machines ever And the first with multi-billion-dollar markets! CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Intel Core 2 Duo Multiple levels of cache coherence – On-chip, the L1 caches must stay coherent using the L2 as the back-ing store – Off-chip, the processor must support standard SMP processor configurations CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Distributed Memory Systems • Hardware in which each processor (or group of processors) has its own private memory. • Communication is achieved through some kind of network – – • This can be as simple as Ethernet… …or far more customized, if communication is important Examples – – – Cray XE6 A rack of Dell Poweredge 1950s Folding@HOME CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst The Most Common Distributed Performance System: Clustering • A parallel computer built out of commodity hardware components – PCs or server racks – Commodity network (like ethernet) – Often running a free-software OS like Linux with a low-level software library to facilitate multiprocessing • Use software to send messages between machines – Standard is to use MPI (message passing interface) CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst What is a Cluster? • “… [W]hat a ship is … It's not just a keel and hull and a deck and sails. That's what a ship needs. But what a ship is ... is freedom.” – Captain Jack Sparrow “Pirates of the Caribbean” CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst What a Cluster is …. • A cluster needs of a collection of small computers, called nodes, hooked together by an interconnection network • It also needs software that allows the nodes to communicate over the interconnect. • But what a cluster is … is all of these components working together as if they’re one big computer (sometimes called a “supercomputer”) CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst What a Cluster is …. • Nodes – PCs/Workstations – Server rack nodes • Interconnection network – – – – Ethernet (“GigE”) Myrinet (“10GigE”) Infiniband (low latency) The Internet • (typically called a “Grid”) CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire • Software – OS • Generally Linux – Redhat / CentOS / SuSE • Windows HPC Server – Libraries (MPICH2, PBLAS, MKL, NAG) – Tools (Torque/Maui, Ganglia, GridEngine) Dan Ernst An Actual (Production) Cluster Interconnect Nodes CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Other Actual Clusters… CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst What a Cluster is NOT… • At the high end, many supercomputers are made with custom parts: – – – – Custom backplane/network Custom/Reconfigurable processors Extreme Custom cooling Custom memory system • Examples: – IBM Blue Gene – Cray XT/XE – SGI Altix (not even really a distributed memory machine…kind of) CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Flynn’s Taxonomy of Computer Systems (1966) A simple model for categorizing computers and computation: 4 categories: 1. SISD – Single Instruction Single Data – 2. the standard uniprocessor model SIMD – Single Instruction Multiple Data (DLP) – – 3. Full systems that are “true” SIMD are no longer in use Many of the concepts exist in vector processing and SIMD extensions MISD – Multiple Instruction Single Data – 4. doesn’t really make sense MIMD – Multiple Instruction Multiple Data (TLP) – the most common model in use CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst MIMD • Multiple instructions are applied to multiple data • The multiple instructions can come from the same program, or from different programs – • Generally “parallel processing” implies the first Most modern multiprocessors are of this form – – IBM Blue Gene, Cray T3D/T3E/XT3/4/5/6, SGI Origin/Altix Beowulf and other “Homebrew” or third-party clusters CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst “True” SIMD • A single instruction is applied to multiple data elements in parallel – same operation on all elements at the same time • Most well known examples are: – – Thinking Machines CM-1 and CM-2 MasPar MP-1 and MP-2 • • All are out of existence now SIMD requires massive data parallelism • Usually have LOTS of very very simple processors (e.g. 8-bit CPUs) CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Vector Processors • Closely related to SIMD – – – • Cray J90, Cray T90, Cray SV1, NEC SX-6 Looked to be “merging” with MIMD systems • Cray X1E, as an example Now appears to be dropped in favor of GPUs Use a single instruction to operate on an entire vector of data – – – Difference from “True” SIMD is that data in a vector processor is not operated on in true parallel, but rather in a pipeline Uses “vector registers” to feed a pipeline for the vector operation Generally have memory systems optimized for “streaming” of large amounts of consecutive or strided data • (Because of this, didn’t typically have caches until late 90s) CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst GPUs Lots of “cores” on a single chip. Programming model is unusual • Probably closer to SIMD than Vector machines… – Actually executes things in parallel • …but its not really SI – Threads each have a mind of their own • I’ve seen them referred to as “SPMD” and “SPMT” • More clearly, Flynn’s Taxonomy is not very good at describing specific systems – It’s better at describing a style of computation – Sadly, everyone likes to categorize things CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst