Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas) Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) – – – – – Break the problem Use half the energy 1000 mickey mouse cores Hardware is sequential Server throughput (how many pins?) What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) – Dark silicon – Amdahl’s Law – The Cloud • The answer – The fundamental concept vis-à-vis parallelism – What it means re: the transformation hierarchy Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) – – – – – Break the problem Use half the energy 1000 mickey mouse cores Hardware is sequential Server throughput (how many pins?) What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) – Dark silicon – Amdahl’s Law – The Cloud • The answer – The fundamental concept vis-à-vis parallelism – What it means re: the transformation hierarchy It starts with the raw material (Moore’s Law) • The first microprocessor (Intel 4004), 1971 – 2300 transistors – 106 KHz • The Pentium chip, 1992 – 3.1 million transistors – 66 MHz • Today – more than one billion transistors – Frequencies in excess of 5 GHz • Tomorrow ? And what we have done with this raw material Number of Transistors Cache Microprocessor Tim e Too many people do not realize: Parallelism did not start with Multi-core • Pipelining • Out-of-order Execution • Multiple operations in a single microinstruction • VLIW (horizontal microcode exposed to the software) Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) – – – – – Break the problem Use half the energy 1000 mickey mouse cores Hardware is sequential Server throughput (how many pins?) What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) – Dark silicon – Amdahl’s Law – The Cloud • The answer – The fundamental concept vis-à-vis parallelism – What it means re: the transformation hierarchy One thousand mickey mouse cores • Why not a million? Why not ten million? • Let’s start with 16 – What if we could replace 4 with one more powerful core? • …and we learned: – – – – One more powerful core is not enough Sometimes we need several Morphcore was born BUT not all morphcore (fixed function vs flexibility) The Asymmetric Chip Multiprocessor (ACMP) Large core Large core Large core Large core “Tile-Large” Approach Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Large core Niagara Niagara -like -like core core Niagara Niagara -like -like core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach ACMP Approach Large core vs. Small Core Large Core • • • • Out-of-order Wide fetch e.g. 4-wide Deeper pipeline Aggressive branch predictor (e.g. hybrid) • Many functional units • Trace cache • Memory dependence speculation Small Core • • • • In-order Narrow Fetch e.g. 2-wide Shallow pipeline Simple branch predictor (e.g. Gshare) • Few functional units Throughput vs. Serial Performance Speedup vs. 1 Large Core 9 Niagara Tile-Large ACMP 8 7 6 5 4 3 2 1 0 0 0.2 0.4 0.6 Degree of Parallelism 0.8 1 Server throughput • The Good News: Not a software problem – Each core runs its own problem • The Bad News: How many pins? – Memory bandwidth • More Bad News: How much energy? – Each core runs its own problem What about GPUs and Data Base • In theory, absolutely! • GPUs (SMT + SIMD + Predication) – Provided there are no conditional branches (Divergence) – Provided memory accesses line up nicely (Coalescing) • Data Bases – Provided there are no critical sections Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) – – – – – Break the problem Use half the energy 1000 mickey mouse cores Hardware is sequential Server throughput (how many pins?) What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) – Dark silicon – Amdahl’s Law – The Cloud • The answer – The fundamental concept vis-à-vis parallelism – What it means re: the transformation hierarchy Dark Silicon • Too many transistors: we can not power them all – All those cores powered down – All that parallelism wasted • Not really: The Refrigerator! (aka: Accelerators) – Fork (in parallel) – Although not all at the same time! Amdahl’s Law • The serial bottleneck always limits performance • Heterogeneous cores AND control over them can minimize the effect The Cloud • It is behind the curtain, how to manage it • Answer: the on-chip run-time system • Answer: Pragmas beyond the Cloud Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) – – – – – Break the problem Use half the energy 1000 mickey mouse cores Hardware is sequential Server throughput (how many pins?) What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) – Dark silicon – Amdahl’s Law – The Cloud • The answer – The fundamental concept vis-à-vis parallelism – What it means re: the transformation hierarchy The fundamental concept: Synchronization Problem Algorithm Program ISA (Instruction Set Arch) Microarchitecture Circuits Electrons At every layer we synchronize • Algorithm: task dependencies • ISA: sequential control flow (implicit) • Microarchitecture: ready bits • Circuit : clock cycle (implicit) Who understands this? • Should this be part of students’ parallelism education? • Where should it come in the curriculum? • Can students even understand these different layers? Parallel to Sequential to Parallel • Guri says: think sequential, execute parallel – i.e. don’t throw away 60 years of computing experience – The original HPS model of out-of-order execution – Synchronization is obvious: restricted data flow • At the higher level, parallel at larger granularity – Pragmas in JAVA? Who would have thought! – Dave Kuck’s CEDAR project, vintage 1985 – Synchronization is necessary: course grain data flow Can we do more? • The run-time system – part of the chip design – The chip knows the chip resources – On-chip monitoring can supply information – The run-time system can direct the use of those resources • The Cloud – the other extreme, and today’s be-all – How do we harness its capability? – What is needed from the hierarchy to make it work My message • Parallelism is a serious goal IF we want to solve the most challenging problems (Cure cancer, predict tsunamis) • Telling people to think parallel is nice, but often silly • Examining the transformation hierarchy and seeing where we can leverage seems to me a sounder approach