IBM Research Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept. © 2009 IBM Research Performance Multicore Performance Challenge # of Cores 2 mpp@us.ibm.com © 2009 IBM Research Take Home Messages “Who needs 100 cores to run MS Word?” - Dave Patterson, Berkeley Performance is critical and it's not free! • Data movement is critical to performance! Performance • Which curve are you on? # of Cores 3 mpp@us.ibm.com © 2009 IBM Research Outline 4 • What’s happening? • Why is it happening? • What are the implications? • What can we do about it? mpp@us.ibm.com © 2009 IBM Research What’s happening? • Intel, IBM, AMD, Sun, nVidia, Cray, etc. # Cores Heterogeneity (e.g., Cell processor, system level) Decreasing – – – 5 Heterogeneous Multicore Increasing – – • Homogeneous Multicore Industry shift to multicore – • Single core Core complexity (e.g., Cell processor, GPUs) Decreasing since Pentium4 single core Bytes per FLOP mpp@us.ibm.com © 2009 IBM Research Heterogeneity: Amdahl’s Law for Multicore Cores Unicore Serial Parallel Homogeneous Heterogeneous Even for square root performance growth (Hill & Marty, 2008) Loophole: Have cores work in concert on serial code… mpp@us.ibm.com © 2009 IBM Research Good & Bad News GOOD NEWS Multicore programming is parallel programming BAD NEWS Multicore programming is parallel programming mpp@us.ibm.com © 2009 IBM Research Many Levels of Parallelism 8 • Node • Socket • Chip • Core • Thread • Register/SIMD • Multiple instruction pipelines • Need to be aware of all of them! mpp@us.ibm.com © 2009 IBM Research Additional System Types Multicore CPU accelerator accelerator accelerator System Bus System Bus main memory main memory 9 accelerator accelerator accelerator memory accelerator accelerator accelerator Multicore Power core CPU System Bus PCIe mpp@us.ibm.com Network attached main memory E’net IB bridge System Bus bridge Multicore CPU Heterogeneous bus attached NIC IO bus attached main memory bridge Homogeneous bus attached On-chip I/O bus Multicore CPU NIC Multicore CPU System Bus memory © 2009 IBM Research Multicore Programming Challenge Higher Performance Interesting research! Nirvana Better tools Better programming Danger Zone! “Lazy” Programming Lower Easier Harder Programmability 10 mpp@us.ibm.com © 2009 IBM Research Outline 11 • What’s happening? • Why is it happening? – HW Challenges – BW Challenges • What are the implications? • What can we do about it? mpp@us.ibm.com © 2009 IBM Research Power Density – The fundamental problem 1000 W/cm 2 Nuclear Reactor 100 10 1 Pentium III Pentium II ® Hot Plate Pentium Pro ® Pentium® i386 i486 1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 ® 0.1 0.07 Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32 mpp@us.ibm.com © 2009 IBM Research What’s causing the problem? 65 nM 10S Tox=11A Gate Stack Power Density (W/cm2) 1000 100 10 1 0.1 0.01 0.001 1 Gate dielectric approaching a fundamental limit (a few atomic layers) mpp@us.ibm.com 0.1 0.01 Gate Length © 2009 IBM Research Microprocessor Clock Speed Trends Managing power dissipation is limiting clock speed increases Clock Speed (MHz) Clock Frequency (MHz) 10 1.0E+04 4 1.0E+03 10 3 10 1.0E+02 2 1990 mpp@us.ibm.com 1995 2000 2005 2010 © 2009 IBM Research Intuition: Power vs. Performance Trade Off 5 Relative Power 1.4 1 .7 .8 1 1.3 1.6 Relative Performance mpp@us.ibm.com © 2009 IBM Research Outline 16 • What’s happening? • Why is it happening? – HW Challenges – BW Challenges • What are the implications? • What can we do about it? mpp@us.ibm.com © 2009 IBM Research The Hungry Beast Data (“food”) Data Pipe Processor (“beast”) Pipe too small = starved beast Pipe big enough = well-fed beast Pipe too big = wasted resources 17 mpp@us.ibm.com © 2009 IBM Research The Hungry Beast Data (“food”) Data Pipe Processor (“beast”) Pipe too small = starved beast Pipe big enough = well-fed beast Pipe too big = wasted resources If flops grow faster than pipe capacity… … the beast gets hungrier! 18 mpp@us.ibm.com © 2009 IBM Research Move the food closer Cache Processor Data (“food”) Load more food while the beast eats 19 mpp@us.ibm.com © 2009 IBM Research What happens if the beast is still hungry? If the data set doesn’t fit in cache Cache misses Memory latency exposed Performance degraded Cache Processor Several important application classes don’t fit 20 Graph searching algorithms Network security Natural language processing Bioinformatics Many HPC workloads mpp@us.ibm.com © 2009 IBM Research Make the food bowl larger Cache size steadily increasing Implications 21 Chip real estate reserved for cache Less space on chip for computes More power required for fewer FLOPS mpp@us.ibm.com Cache Processor © 2009 IBM Research Make the food bowl larger Cache size steadily increasing Implications Chip real estate reserved for cache Less space on chip for computes More power required for fewer FLOPS Cache Processor But… 22 Important application working sets are growing faster Multicore even more demanding on cache than unicore mpp@us.ibm.com © 2009 IBM Research The beast is hungry! Data pipe not growing fast enough! 23 mpp@us.ibm.com © 2009 IBM Research The beast had babies • 24 Multicore makes the data problem worse! – Efficient data movement is critical – Latency hiding is critical mpp@us.ibm.com © 2009 IBM Research GOAL: The proper care and feeding of hungry beasts 25 mpp@us.ibm.com © 2009 IBM Research Outline 26 • What’s happening? • Why is it happening? • What are the implications? • What can we do about it? mpp@us.ibm.com © 2009 IBM Research Example: The Cell/B.E. Processor 27 mpp@us.ibm.com © 2009 IBM Research Feeding the Cell Processor 8 SPEs each with – LS SPE SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC – MFC – SXU SPU SXU 16B/cycle PPE EIB (up to 96B/cycle) – OS functions 16B/cycle 16B/cycle PPE – Disk IO – Network IO PPU L2 L1 MIC 16B/cycle (2x) BIC PXU 32B/cycle 16B/cycle Dual XDRTM FlexIOTM 64-bit Power Architecture with VMX 28 mpp@us.ibm.com © 2009 IBM Research Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow Enables detailed programmer control of data flow Avoids restrictive HW cache management Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes Unlikely to determine optimal data flow Potentially very inefficient Allows more efficient use of the existing bandwidth BOTTOM LINE: It’s all about the data! 29 mpp@us.ibm.com © 2009 IBM Research Lessons Learned Cell Processor • Core simplicity impacted algorithmic design – Increased predictability – Avoid recursion & branches – Simpler code is better code – e.g., bubble vs. comb sort • Heterogeneity – Serial core must balance parallel cores well • Programmability suffered – Forced to address data flow directly – Led to better algorithms & performance portability 30 mpp@us.ibm.com © 2009 IBM Research What are the implications? 31 • Computational Complexity • Parallel programming • Communication • Synchronization • Collecting metadata • Merging Operations • Grouping Operations • Memory Layout • Memory Conflicts • Debugging mpp@us.ibm.com Some general Some Cell specific © 2009 IBM Research Computational complexity is inadequate • Focus on computes: O(N), O(N2), O(lnN), etc. • Ignores BW analysis • – – Memory flows are now the bottlenecks Memory hierarchies are critical to performance – Need to incorporate memory into the picture Need “Data Complexity” – – 32 Necessarily HW dependent Calculate data movement (track where they come from) and divide by BW to get time for data mpp@us.ibm.com © 2009 IBM Research Don’t apply computational complexity blindly O(N2) Run Time O(N) O(N) isn’t always better than O(N2) N You are here More cores can lead to smaller N per core… mpp@us.ibm.com © 2009 IBM Research Where is your data? Localize your data! Disk Run Time Tape L3 cache L2 cache L1 cache N (“Locality”) Put your data where you want it when you want it! mpp@us.ibm.com © 2009 IBM Research Example: Compression Computational Complexity • Compress to reduce data flow Compression • Increases slope of O(N) Compute • But reduces run time N Compute Read Write Compute Compression Read Write Run Time mpp@us.ibm.com © 2009 IBM Research Implication: Communication Overhead 1 36 • BW can swamp compute • Minimize communication mpp@us.ibm.com 2 © 2009 IBM Research Implication: Communication Overhead L L 9L 37 vs. 4L • Modify partitioning to reduce communications • Trade off with synchronization mpp@us.ibm.com © 2009 IBM Research Implications: Synchronization Overhead Synchronization Overhead Time 38 mpp@us.ibm.com © 2009 IBM Research Implications: Synchronization – Load Balancing Uniform • 39 Adaptive Modify data partitioning to balance workloads mpp@us.ibm.com © 2009 IBM Research Implications: Synchronization – Nondeterminism Suppose: 40 mpp@us.ibm.com = © 2009 IBM Research Implications: Synchronization – Nondeterminism Average nondeterministic Probability Deterministic Max of N Threads Run Time 41 mpp@us.ibm.com © 2009 IBM Research Implications: Metadata - Parallel sort example Unsorted data 42 Metadata Sorted data • Collect histogram in first pass • Use histogram to parallelize second pass mpp@us.ibm.com © 2009 IBM Research Implications: Merge Operations – FFT Example • • Naive – 1D FFT (x axis) – Transpose – 1D FFT (y axis) – Transpose – 43 Buffer Input Image Improved – Merge steps – • Tile FFT/Transpose (x axis) FFT/Transpose (y axis) Avoid unnecessary data movement mpp@us.ibm.com Transposed Tile Transposed Image Transposed Buffer © 2009 IBM Research Implications: Restructure to Avoid Data Movement Compute A Compute A Compute A Transform A to B Compute A Compute B Compute A Transform B to A Transform A to B Compute A Compute B Transform A to B Compute B Compute B Compute B Transform B to A Compute B 44 mpp@us.ibm.com © 2009 IBM Research Implications: Streaming Data & Finite Automata DFA Data Replicate & Overlap DFA DFA DFA Enables loop unrolling & software pipelining 45 mpp@us.ibm.com © 2009 IBM Research Implications: Streaming Data – NID Example Sample Word List: “the” “that” “math” Find (lots of) substrings in (long) string Build graph of words & represent as DFA 46 mpp@us.ibm.com © 2009 IBM Research Implications: Streaming Data – NID Example Random access to large state transition table (STT) 47 mpp@us.ibm.com © 2009 IBM Research Implications: Streaming Data – Hiding Latency 48 mpp@us.ibm.com © 2009 IBM Research Implications: Streaming Data – Hiding Latency Enables loop unrolling & software pipelining 49 mpp@us.ibm.com © 2009 IBM Research Roofline Model (S. Williams) Processing Rate Compute bound Latency bound Software Pipelining Low High Data Locality 50 mpp@us.ibm.com © 2009 IBM Research Implications: Group Like Operations – Tokenization Ex. Action DFA Data • 51 Intuitive – Get data Serial – State Transition Serial – Action Branchy & Nondeterministic – Repeat mpp@us.ibm.com © 2009 IBM Research Implications: Group Like Operations – Tokenization Ex. Action Action List 1 DFA Data Action List 2 Better 52 – – Get data State Transition Serial Serial – – – Add Action to List Repeat Process Action Lists Serial mpp@us.ibm.com Serial Action List 3 • • • Loop unrolling SIMD Load balance © 2009 IBM Research Implications: Covert BW to Compute Bound – NN Ex. F Output N Basis functions: dot product + nonlinearity DxN Matrix of parameters D Input dimensions X Neural net function F(X) – RBF, MLP, KNN, etc. If too big for cache, BW Bound 53 mpp@us.ibm.com © 2009 IBM Research Implications: Covert BW to Compute Bound – NN Ex. Merge Split function over multiple SPEs Avoids unnecessary memory traffic Reduce compute time per SPE Minimal merge overhead 54 mpp@us.ibm.com © 2009 IBM Research Implications: Pay Attention to Memory Hierarchy Register File L1 L2 Main Memory BW: High Low Latency: Low High Size: Small Larger 55 mpp@us.ibm.com © 2009 IBM Research Implications: Pay Attention to Memory Hierarchy C L1 C L1 C L1 L2 C L1 L3 C L1 L2 C L1 C L1 L2 C 56 L1 • Data eviction rate • Optimal tiling • Shared memory space can impact load balancing mpp@us.ibm.com © 2009 IBM Research Implications: Memory Hierarchy & Tiling = X Optimal tiling depends on cache size 57 mpp@us.ibm.com © 2009 IBM Research Implications: Data Re-Use – FFT Revisited • Long stride trashes cache • Use full cachelines where possible Stride N2 Single Element Data envelope N Stride 1 58 mpp@us.ibm.com © 2009 IBM Research Implications: Handle Race Conditions (Debugging) Thread 1 Write data 2 Read data 1 • 59 Write data Good ? Bad Heisenberg Uncertainty Principle – Instrumenting the code changes behavior – Problem with maintaining exact timing mpp@us.ibm.com © 2009 IBM Research Implications: More Cores – More Memory Conflicts Thread Bank 1 Bank 8 1 2 3 4 5 6 7 8 • HOT SPOT Bank 1 – Plan data layout – Avoid multiples of the number of banks – Randomize start points – Make critical data sizes and number of threads relatively prime Bank 8 1 2 3 4 5 6 7 8 60 Avoid bank conflicts mpp@us.ibm.com © 2009 IBM Research Implications: Reduce Data Movement Data Green’s Function (X,Y) D ( x + i , y + j )G ( x , y , i , j ) ij New G at each (x,y) Radial symmetry of G reduces BW requirements 61 mpp@us.ibm.com © 2009 IBM Research Implications: Reduce Data Movement SPE 0 62 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 mpp@us.ibm.com Data SPE 7 © 2009 IBM Research Implications: Reduce Data Movement SPE 0 63 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 mpp@us.ibm.com Data SPE 7 © 2009 IBM Research Implications: Reduce Data Movement 2R+1 For each X – Load next column of data – Load next column of indices – For each Y (X,Y) • Load Green’s functions • SIMDize Green’s functions • Compute convolution at (X,Y) R H – Cycle buffers 1 Data buffer Green’s Index buffer 2 64 mpp@us.ibm.com © 2009 IBM Research Outline 65 • What’s happening? • Why is it happening? • What are the implications? • What can we do about it? mpp@us.ibm.com © 2009 IBM Research What can we do about it? • • 66 We want – High performance – Low power – Easy programmability Choose any two! We need – “Magic” compiler – Multicore enabled libraries – Multicore enabled tools – New algorithms mpp@us.ibm.com © 2009 IBM Research What can we do about it? • Compiler “magic” – • OpenMP, autovectorization BUT… Doesn’t encourage parallel thinking Programming models – • Tools – • CUDA, OpenCL, Pthreads, UPC, PGAS, etc Cell SDK, RapidMind (Intel), PeakStream (Google), Cilk (Intel), Gedae, VSIPL++, Charm++, Atlas, FFTW, PHiPAC If you want performance… – No substitute for better algorithms & hand-tuning! – Performance analyzers » HPCToolkit, FDPR-Pro, Code Analyzer, Diablo, TAU, Paraver, VTune, SunStudio Performance Analyzer, Code Analyzer, PDT, Trace Analyzer, Thor, etc. 67 mpp@us.ibm.com © 2009 IBM Research What can we do about it? Example: OpenCL • Open “standard” • Based on C - not difficult to learn • Allows natural transition from (proprietary) CUDA programs • Interoperates with MPI • Provides application portability – Hides specifics of underlying accelerator architecture – Avoids HW lock-in: “future-proofs” applications • Weaknesses – No DP, no recursion & accelerator model only Portability does not equal performance portability! 68 mpp@us.ibm.com © 2009 IBM Research What can we do about it? Hide Complexity in Libraries • Manually – Slow, expensive, new library for each architecture • Autotuners – Search program space for optimal performance – Examples: Atlas (BLAS), FFTW (FFT), Spiral (DSP). OSKI (Sparse BLAS), PhiPAC (BLAS) • Local Optimality Problem: – F() & G() may be optimal, but will F(G()) be? mpp@us.ibm.com © 2009 IBM Research What can we do about it? It’s all about the data! The data problem is growing Intelligent software prefetching – Use DMA engines – Don’t rely on HW prefetching Efficient data management 70 – Multibuffering: Hide the latency! – BW utilization: Make every byte count! – SIMDization: Make every vector count! – Problem/data partitioning: Make every core work! – Software multithreading: Keep every core busy! mpp@us.ibm.com © 2009 IBM Research Conclusions • Programmability will continue to suffer – • Incorporate data flow into algorithmic development – • 71 Computational complexity vs. “data flow” complexity Restructure algorithms to minimize: – • No pain - no gain Synchronization, communication, non-determinism, load imbalance, non-locality Data management is the key to better performance – Merge/Group data operations to minimize memory traffic – – Restructure data traffic: Tile, Align, SIMDize, Compress Minimize memory bottlenecks mpp@us.ibm.com © 2009 IBM Research Backup Slides 72 mpp@us.ibm.com © 2009 IBM Research Abstract The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. This desire for newer, faster and larger is driving continued demand for higher computer performance. The industry’s response to address these challenges has been to embrace “multicore” technology by designing processors that have multiple processing cores on each silicon chip. Increasing the number of cores per chip has enabled processor peak performance to double with each doubling of the number of cores. With performance doubling occurring at approximately constant clock frequency so that energy costs can be controlled, multicore technology is poised to deliver the performance users need for their next generation applications while at the same time reducing total cost of ownership per FLOP. The multicore solution to the clock frequency problem comes at a cost: Performance scaling on multicore is generally sub-linear and frequently decreases beyond some number of cores. For a variety of technical reasons, off-chip bandwidth is not increasing as fast as the number of cores per chip which is making memory and communication bottlenecks the main barriers to improved performance. What these bottlenecks mean to multicore users is that precise and flexible control of data flows will be crucial to achieving high performance. Simple mappings of their existing algorithms to multicore will not result in the naïve performance scaling one might expect from increasing the number of cores per chip. Algorithmic changes, in many cases major, will have to be made to get value out of multicore. Multicore users will have to re-think and in many cases re-write their applications if they want to achieve high performance. Multicore forces each programmer to become a parallel programmer; to think of their chips as clusters; and to deal with the issues of communication, synchronization, data transfer and non-determinism as integral elements of their algorithms. And for those already familiar with parallel programming, multicore processors add a new level of parallelism and additional layers of complexity. This talk will highlight some of the challenges that need to be overcome in order to get better performance scaling on multicore, and will suggest some solutions. 73 mpp@us.ibm.com © 2009 IBM Research Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology (to scale) 74 mpp@us.ibm.com © 2009 IBM Research To Scale Comparison of L2 AMD Cell BE Intel IBM 75 mpp@us.ibm.com © 2009 IBM Research Intel Multi-Core Forum (2006) The Issue Linux Throughput 35000 SDET 26250 9.8x 17500 8750 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Processors mpp@us.ibm.com © 2009 IBM Research The “Yale Patt Ladder” Problem Algorithm Program ISA (Instruction Set Architecture) Microarchitecture To improve performance need people who can cross between levels Circuits Electrons mpp@us.ibm.com © 2009