Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director, Intel-NTU CCC Center With help from a long list of collaborators: Guangyu Sun, Jishen Zhao, Cong Xu, Yuan Xie (PSU), Christopher Hughes, Changkyu Kim (Intel) Notice and Disclaimers Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. All products, dates, and figures are preliminary for planning purposes and are subject to change without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com. Intel® Itanium®, Xeon™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2011, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.. Optimization Notice – Please read Optimization Notice Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements. 碳排放量大難解決 王振堂:資料中心不是好 生意 不要歡迎 2012-02-11 01:12 中國時報 【康文柔/台北報導】 Google即將在彰濱工業區設立設資料中心,台北市電 腦公會理事長、宏碁董事長王振堂昨(10)日指出, 台灣發展雲端產業,最重要的是軟體、服務與應用, 絕對不是設資料中心(Data center)。 因資料中心耗電高,碳排放量大,他說「這不是一個 好生意!」政府不要再歡迎外商到台灣蓋資料中心。 Dark Silicon Dark Silicon and the End of Multicore Scaling In International Symposium on Computer Architecture (ISCA), 2011, by H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Regardless of chip organization and topology, multicore scaling is power limited At 22 nm, 21% of a fixed-size chip must be powered off At 8 nm, more than 50% Energy Efficiency Memory is the key in Multi-Core We should spend 90%+ energy in memory “Memory” = Memory/cache hierarchy Cache coherence (or even non-coherent) Data management placement/replacement Compiler or hardware assist Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion “Bandwidth Wall” Performance depends on two resources Compute does the work Bandwidth feeds the compute Processors (thru ILP, DLP, and TLP) are faster Memory becomes relative slower R e la tiv e p e rfo rm a n c e s in c e 1 9 8 0 1000 M ic ro p ro c e s s o r M IP S 100 ~ 5 0 % annually D R A M B and wd ith 10 ~ 2 7 % annually D R A M L ate nc y ~ 7 % annually 1 1980 1985 1990 1995 2000 Year Source: D. A. Patterson, "Latency Lags Bandwidth," Communication of the ACM, Vol. 47, no. 10, pp. 71-75, Oct. 2004. Current memory hierarchy may be not good enough How to Alleviate Bandwidth Problems Software techniques Cache blocking, data compression, memory management, data structure re-arrangement, etc. But, not always applicable Hardware techniques “New” memory technologies provide opportunities 3D stacking [Madan, HPCA 2009, Sun HPCA 2009, Sun ISLPED 2009] eDRAM [Thoziyoor, ISCA 2008, Wu, ISCA 2009] MRAM [Sun, HPCA 2009, Wu, ISCA 2009] PCM [Lee, ISCA 2009, Qureshi, ISCA 2009, Wu, ISCA 2009] Trade-off Between Bandwidth & Power High bandwidth means high power GPUs use GDDR higher bandwidth GDDR burns more power 1GB GDDR at 128GB/s can burn roughly 4x more power of 4GB DDR at 16GB/s Recall: Some find GPUs perform better because of higher bandwidth (details in Lee, et al., “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU,” ISCA, June 2010.) Why multi-core is everywhere? Isn’t it because of power? Who burns more power? Will it be memory? Challenge: How to provide energy-efficient memory hierarchy Our Research Statement What should memory hierarchy look like? Does a memory hierarchy provide enough bandwidth? How many levels in the memory hierarchy? What are the capacity and bandwidth of each level? Which memory technologies are chosen for different levels? Can we explore the design space quickly? Simulation-based design evaluation slow Not feasible for large design space Exploration using an analytical model (Moguls) Fast and accurate (with proper approximations) Sneak preview A new level of cache every 5-7 years Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion Cache: Smaller, Faster Buffer Smaller, faster buffer Stores most-likely to be used data Cache E.g., cache satisfies 90% of requests Only 10% of requests go to next level Significantly improve the performance CPU Based on temporal and spatial localities Filters the memory access E.g., 10ns access time (vs. DRAM 100ns) E.g., 32KB vs. 2GB E.g., 90%*10ns + 10%*(110ns)=21ns Memory Bandwidth Requirement Core …… Core BRC(T) BW demand from cores Core M1 BP1 BRBW 1(T) demand (decided by T) ProvidedBP BW 2 M2 N level BR2(T) … BRn-1(T) BPn Mn Question: How to abstract this relationship? BW provided by main memory BPM BRn(T) Main memory Bandwidth (log scale) Capacity-bandwidth (CB) Coordinate (C, B) C: cache capacity B: demanded/provided BW Origin = (CO, BM ) CO: the minimum capacity BM: BW provided by main memory Both provided and demand BWs can be described in the same coordinate. (CO, BM ) Cache Capacity (log scale) Bandwidth (log scale) Provided CB Curve (CO, BM ) (C1, BP1 ) First level cache Second level cache (C2, BP2 ) Cache Capacity (log scale) Bandwidth (log scale) Demand CB Curve BRC(T1) Demand CB curve is continuous and decided by T (Cx, Bx ) BRC(T2) Demand CB curve under T1 Demand CB curve under T2 Cache Capacity (log scale) (CO, BM ) Bandwidth (log scale) Combine Demand & Provided CB curves (C1, BP1 ) Demand CB curve is satisfied by provided CB curve. BRC(T2) Demand CB curve under T2 Cache Capacity (log scale) (CO, BM ) (C2, BP2 ) Bandwidth (log scale) Combine Demand & Provided CB curves (cont.) (C1, BP1 ) BRC(T1) Demand CB curve is NOT satisfied by provided CB curve. (C2, BP2 ) Question: How to modify the provided CB curve? (CO, BM ) Cache Capacity (log scale) Bandwidth (log scale) Increase Capacity (C1, BP1 ) (C’1, BP1 ) BRC(T1) (CO, BM ) (C2, BP2 ) Cache Capacity (log scale) Bandwidth (log scale) Increase Bandwidth (C1, BP1 ) BRC(T1) (CO, BM ) (C2, BP’2 ) (C2, BP2 ) Cache Capacity (log scale) Bandwidth (log scale) Add an Extra Level (C1, BP1 ) BRC(T1) (CO, BM ) (C3, BP3 ) (C2, BP2 ) Cache Capacity (log scale) Why named Moguls? Wikipedia: Moguls are a series of bumps on a trail formed when skiers push the snow into piles as they ski. Why named Moguls? Recall the Research Statements Does a memory hierarchy provide enough bandwidth? How many levels in the memory hierarchy? What are the capacity and bandwidth of each level? Which memory technologies are chosen for different levels? Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion Approximations Used to Apply the Moguls Approximation-1: The demand CB curve is represented as a straight line with slope -1/2 in log-log space. Bandwidth (log scale) y=-x/2 BRC(T) (CO, BM ) y=-x/2 Cache Capacity (log scale) CO(BS /BM)2 Approximations Used to Apply the Moguls Approximation-2: The access power of a cache is approximately r Capacity ´ Bandwidth Iso-power line Bandwidth (log scale) Same power y=-x/2 BRC(T) (CO, BM ) Cache Capacity (log scale) CO(BS /BM)2 After Applying Two Approximations Iso-power line is parallel to the demand CB curve. Bandwidth (log scale) Iso-power line BRC(T) (CO, BM ) Cache Capacity (log scale) CO(BS /BM)2 Bandwidth (log scale) Start Point: Two-Level Cache Design (C1, BP1 ) BRC(T) (C2, BP2 ) (C1, BP2 ) (CO, BM ) Cache Capacity (log scale) CO(BS /BM)2 Energy-efficient Tow-level Cache Design Bandwidth (log scale) Iso-power line BRC(T) (C1, BP1 ) BS (CO, BM ) (C2, BP2 ) (C1, BP2 ) Cache Capacity (log scale) CO(BS /BM)2 Bandwidth (log scale) Extension to N-level Cache Design BRC(T) (C1, BP1 ) (C2, BP1 ) (Cn-1, BPn-1 ) (Cn, BPn ) (Cn, BPn-1 ) (CO, BM ) Cache Capacity (log scale) CO(BS /BM)2 Bandwidth (log scale) Design under Power Constraint BRC(T1) (C1, BP1 ) BRC(T2) (C1, BP1 ) (C2, BP1 ) (C2, BP1 ) Throughput is degraded from T1 to T2 (Cn-1, BPn-1 ) (Cn-1, BPn-1 ) (Cn, BPn-1 ) (Cn, BPn-1 ) (CO, BM ) Cache Capacity (log scale) (Cn, BPn ) (Cn, BPn ) CO(BS /BM)2 For More Details Mixing different memory technologies Simulation results to validate the model Accurately predict the number of levels (> 90%) Accurately predict size/BW of every level (>80%) "Moguls: a Model to Explore the Memory Hierarchy for Throughput Computing," G. Sun, C. J. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie,Y.-K Chen, in International Symposium on Computer Architecture (ISCA), June 2011. Recall the Research Statements Does a memory hierarchy provide enough bandwidth? How many levels in the memory hierarchy? What are the capacity and bandwidth of each level? Which memory technologies are chosen for different levels? Different Memory Technologies MRAMIso-power lines eDRAM Write : Read = 1:9 100 SRAM RRAM MRAM eDRAM PCRAM 10 PCM 1 1 2 256KB 3 1MB 4 Cross Memristor 5 6 7 8 9 10 4MB 16MB 64MB Cache Capacity (log scale) 11 12 256MB points Different Memory Technologies Iso-power lines 100 SRAM eDRAM 10 SRAM is more energy-efficient 1 1 256KB 2 1MB3 4 eDRAM is more energy-efficient 6 7 8 9 4MB5 16MB 64MB Cache Capacity (log scale) 10 11 12 256MB Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion Our Research Statement What should memory hierarchy look like? How to use the latest technology components? What is the proper number of levels? What should capacity/bandwidth of each level be? Can we explore the design space quickly? Simulation-based design evaluation slow Not feasible for large design space Exploration using an analytical model (Moguls) Fast and accurate (with proper approximations) Historical Trend Growing bandwidth gap Processor speed increases at 50% per year Memory bandwidth increases at 27% per year Intel processors introduced on-die L1 cache in 1990 L2 cache in 1998 L3 cache in 2005 Optimal Performance per Watt A new level of cache every 5-7 years Takeaway Messages It is time to add another level of memory into the hierarchy to alleviate the bandwidth bottleneck. Need new memory level roughly every 5-7 years Mathematically, with some assumptions about miss-rate curves and power consumption, we can solve for: Optimal # of levels, capacity & BW of each level Our study shows that L4 helps significantly Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion Programming GPU Becomes Popular •Source: Google scholar with “GPGPU” and “GPGPU GPU CPU performance speedup” Many show GPUs have significant performance gain However, GPUs are NOT orders of magnitude faster than CPUs Architecture-specific optimizations are important GPU’s Are Against Future Trends GPU memory are not energy efficient The number of levels and the size of each level must be adjusted according to the throughput of the processors CUDA-like GPU programming model are not scalable in the future To get good performance on GPU, explicit memory management are often required It will become a nightmare if the number of levels and the size of each level are changing Trust me: reduce your bets on GPU’s Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion Conclusions The winning multi-core architecture will have the most energy-efficient memory hierarchy GPU (or GDDR) is good, but not that good Expect to have more levels On-die adaptive/reconfigurable cache can save energy 3D stacking, eDRAM, and other emerging technologies "Moguls: a Model to Explore the Memory Hierarchy for Throughput Computing," G. Sun, C. J. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie,Y.-K Chen, in International Symposium on Computer Architecture, June 2011. "Performance and Energy Implications of Caches For Throughput Computing," C. Hughes, C. Kim,Y.-K. Chen, IEEE Micro Magazine, vol.30, no.6, pp.25-35, Nov.-Dec. 2010. Software can help too One of our current work Discussion: how about cloud computing? Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director, Intel-NTU CCC Center With help from a long list of collaborators: Guangyu Sun, Jishen Zhao, Cong Xu, Yuan Xie (PSU), Christopher Hughes, Changkyu Kim (Intel)