Exploiting Heterogeneous Architectures Alex Beutel, John Dickerson, Vagelis Papalexakis 15-740/18-740 Computer Architecture, Fall 2012 In Class Discussion Tuesday 10/16/2012 Heterogeneous Hardware Systems • Multiple CPU Single GPU systems • Asymmetric Multicore Processors (AMP) – Combination of general-purpose big and small cores – Trade-off between performance & power consumption – Usually “on chip” AMP’s • Single-ISA architectures – Similar to AMP but have same instruction sets among cores – “small” processors can support in-order execution – “big ones can support out-of-order execution Overview BIS Optimization Speed – Remove Speed – ILP and bottlenecks MLP Hardware Location PIE YinYang Speed - Async and Power ACMP (single ISA) SW/HW Compiler (SW) SMS Speed - Memory GPU & CPU VM/SW Mem Controller (HW) Outline 1. BIS: Jos A. Joao et al. “Bottleneck identification and scheduling in multithreaded applications,” in Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (AS- PLOS ’12). 2. YinYang: Ting Cao et al. “The yin and yang of power and performance for asymmetric hardware and managed software,” in Proceedings of the 39th International Symposium on Computer Architecture (ISCA ’12). 3. PIE: Kenzo Van Craeynest et al. “Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE),” in Pro- ceedings of the 39th International Symposium on Computer Architecture (ISCA ’12). 4. SMS: Rachata Ausavarungnirun et al. “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” in Proceedings of the 39th International Symposium on Computer Architecture (ISCA ’12). Bottleneck Identification and Scheduling in Multithreaded Applications • Focuses on the problem of removing bottlenecks – Big problem in many systems – can’t scale well to many threads • Bottlenecks include critical sections, pipeline stalls, barriers are a few examples • In ACMPs previous research shows that “big” cores can be used to handle (serializing) bottlenecks – Limited fine grain adaptivity and generality • Authors propose BIS – Key insight – costliest bottlenecks are those that make other threads wait longest – involves co-operation of software and hardware to detect bottlenecks – Accelerates them using 1 or more “big” cores of the ACMP Bottlenecks per makes three main contributions: opose a cooperative hardware-software mechanism to y the most critical bottlenecks of different types, e.g., sections, barriers and pipeline stages, allowing the proer, the software or the hardware to remove or accelerate ottlenecks. To our knowledge, this is the first such pro- • Amdahl’s serial portion • Critical sections Barriers opose an• automatic acceleration mechanism that decides bottlenecks to accelerate and where to accelerate them. the fast cores on an ACMP as hardware-controlled fine• Pipeline stages ccelerators for serializing bottlenecks. By using hard- T1 ... T2 C1 N T3 N T4 N 10 N C1 30 ... C1 N C1 20 N C1 40 N 50 60 time 70 (a) Critical section Idle T1 ... ... T2 T3 T4 upport, we minimize the overhead of execution migration ow for quick adaptation to changing critical bottlenecks. the first paper to explore the trade-offs of bottleneck ation using ACMPs with multiple large cores. We show ultiple large cores improve performance when multiple ecks are similarly critical and need to be accelerated aneously.2 aluation shows that BIS improves performance by 37% core Symmetric CMP and by 32% over an ACMP (which erates serial sections) with the same area budget. BIS rforms recently proposed dynamic mechanisms (ACS pelined workloads and FDP for pipelined workloads) on y 15% on a 1-large-core, 28-small-core ACMP. We also BIS’ performance improvement increases as the number Idle N time 10 20 30 40 50 60 70 (b) Barrier T1 ... Idle S1 T2 T3 T4 10 ... S1 S2 S3 time 20 30 40 50 60 70 (c) Pipeline stages Figure 1. Examples of bottleneck execution. on the critical path), but also reduce the wait for both T1 and T4 However, accelerating the same C1 on T2 between times 20 and Bottleneck Identification • Software used to identify bottlenecks • Instructions such as BottleneckCall, BottleneckReturn, BottleneckWait give feedback to BIS system • BIS system keeps track of blocks and thread waiting cycles (TWC) (with optimizations) Scheduling in Multithreaded Applications (with ACMPs) • Take N bottlenecks with highest TWC and accelerate • Many methods to accelerate, they focus on assigning to bigger cores in ACMP • Send worst bottlenecks from small cores to big core and keep in Scheduling Buffer • Lots of edge cases dealt with such as avoiding false serialization • Also extend to multiple large core context Bottleneck Identification and Scheduling in Multithreaded Applications The Yin/Yang Metaphor • Hardware: heterogeneous multi-core balances power and performance – • • • Everyone cares about performance-per-energy (PPE) instead of absolute performance Software: move toward managed programming languages with virtual machines, like Java (JVM), C# (.NET), JavaScript, Yang of heterogeneous: exposed hardware adds complexity Yin of managed language: VM handles all that exposed complexity for the programmer • • • Yang of VM languages is overhead Yin of heterogeneous hardware is small cores can alleviate that overhead problem Yin and Yang of Power and Performance: Overview • • Virtual machines consume a ton of extra computation time and energy (~40%) Java VM-related numbers (~37%): – – – • 10% garbage collection 12% JIT 15% executing untouched instructions via the interpreter Paper: exploits GC, JIT, Interpreter tasks by placing them on the right types of cores with combination of parallelism, asynchrony, noncriticality, and hardware sensitivity Yin and Yang of Power and Performance: Overview • Garbage collection – asynchronous, can use many cores, does not benefit from high clock rate. Use low power core with high memory bandwidth • JIT – async, some parallelism, and non-critical. Use small core because powerful enough • Interpreter – Critical path and not async. Uses the applications parallelism. Again use low power cores generally YinYang Experimental Evaluation • • • • • • Power: they measure power overhead of VM services, and yes the VM eats power so it's a good candidate for heterogeneous systems Power-per-energy (PPE): lots of results reported like this instead of absolute. Moving the JIT and GC to lower-clocked cores increases PPE (by 9-13%) GC very memory-bound, great on low-power cores JIT less memory-bound, but embarrassingly parallel so still great on low-power cores Interpreter PPE improvement less stark, but still there. YinYang Experimental Evaluation PIE: Performance Impact Estimation • • • A heterogeneous multi-core architecture is one that features big, powerful, power-hungry core(s) and small, weak, energy-efficient core(s). How do we map workloads onto the appropriate cores to maximize "speed-per-energy" PIE is a static or dynamic scheduler that takes both memory-level parallelism and instructionlevel parallelism into account to predict how well a job will do on different types of cores. – – Static: schedule jobs once for duration of job Dynamic: push parts of jobs to appropriate cores PIE (contd.) • Motivation: Intuition is wrong: – “Compute-heavy jobs should go on the heavyweight 'big' cores, while memory-heavy jobs can do well enough on 'small' cores.” – Authors do experiments and find out that big cores do well on MLP intensive jobs, while small cores do well on ILP intensive jobs More PIE • One way to schedule is to randomly sample job-core mappings, learn, choose best – • • PIE instead tries to estimate a job's performance on Core-type B, while the job is running on Core-type A Estimate is based on aggregate of: – – • High overhead! Base instruction-level parallelism (ILP) score A memory-level parallelism (MLP) component that is a function of the processor's architecture and the observed cache misses from specific job Scheduler takes estimate of job's performance on different processor types, decides where to put the job PIE works well Multiple CPU Single GPU Systems • Memory becomes critical resource – – – – GPU accesses vastly different from CPU ones GPUs generate significantly more requests GPU spawns many different threads Increased contention between GPU and CPU • Need to design a Memory Controller – Schedules the memory accesses – Ensures fairness – Is scalable and easy to implement • Current approaches non robust to presence of both GPU and CPUs Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems • Proposes a multi-stage approach to application-aware memory scheduling – Handles interference between bandwidth demanding apps and non demanding ones (e.g. GPU and CPU respectively) – Simplified hardware implementation due to decoupling of memory controller across multiple stages • Improves CPU performance without degrading GPU performance – Authors test on many settings (e.g. CPU only, GPU only, CPU & GPU) – Compare to existing approaches Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems • Sophisticated approaches that prioritize memory accesses need too complex logic – E.g. CAM memories • SMS uses a three stage approach: 1. Batch Formulation: Per source aggregation of memory request batches 2. Batch Scheduler: Prioritize batches coming from latency critical apps (e.g. CPU ones) 3. DRAM Command Scheduler: FIFO queues per DRAM bank/each batch from Stage2 is placed on these FIFOs SMS works well Discussion • Leaving ISA assumption, how can we combine ideas from first three papers? – – • Seems like we can incorporate ILP and MLP in our queuing decisions in BIS VM can also become more dynamic BIS, PIE, and YinYang papers assume heterogeneous multicore systems with the same instruction set architecture (ISA). How would these papers change if we assumed different ISAs? – – – I'm thinking not fundamentally. It would make the prediction part of PIE more complicated (you'd need some context-aware performance scaling between ISAs), but wouldn't break it. Similarly, you'd need some scaling for VM stuff, but it wouldn't break anything there, either. Maybe the class has an opinion on this? Do we need to keep ISAs homogeneous across a heterogeneous multi-core? What do we gain or lose from this?