Energy-Efficiency Memory Hierarchy for Multi

Energy-Efficiency Memory Hierarchy
for Multi-core Architectures
Yen-Kuang Chen, Ph.D., IEEE Fellow
Principal Engineer, Intel Corporation
Associate Director, Intel-NTU CCC Center
With help from a long list of collaborators: Guangyu Sun, Jishen Zhao,
Cong Xu, Yuan Xie (PSU), Christopher Hughes, Changkyu Kim (Intel)
Notice and Disclaimers
Notice: This document contains information on products in the design phase of development. The
information here is subject to change without notice. Do not finalize a design with this information. Contact
your local Intel sales office or your distributor to obtain the latest specification before placing your product
order.
 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS
PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO
LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS
FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR
OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or
life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any
time, without notice.
 All products, dates, and figures are preliminary for planning purposes and are subject to change without
notice.
 Designers must not rely on the absence or characteristics of any features or instructions marked "reserved"
or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for
conflicts or incompatibilities arising from future changes to them.
Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system
hardware or software design or configuration may affect actual performance.
 The Intel products discussed herein may contain design defects or errors known as errata which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
 Copies of documents which have an order number and are referenced in this document, or other Intel
literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
 Intel® Itanium®, Xeon™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks
or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Copyright © 2011, Intel Corporation. All rights reserved.
 *Other names and brands may be claimed as the property of others..

Optimization Notice – Please read
Optimization Notice
Intel® Compiler includes compiler options that optimize for instruction sets that are available in both
Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize
equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are
reserved for Intel microprocessors. For a detailed description of these compiler options, including
the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides >
Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized
for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel®
Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the
options you select, your code and other factors, you likely will get extra performance on Intel
microprocessors.
While the paragraph above describes the basic optimization approach for Intel® Compiler, with
respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not
optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to
Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2),
Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3
(Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel
microprocessors.
Intel recommends that you evaluate other compilers to determine which best meet your
requirements.
碳排放量大難解決 王振堂:資料中心不是好
生意 不要歡迎

2012-02-11 01:12 中國時報 【康文柔/台北報導】

Google即將在彰濱工業區設立設資料中心,台北市電
腦公會理事長、宏碁董事長王振堂昨(10)日指出,
台灣發展雲端產業,最重要的是軟體、服務與應用,
絕對不是設資料中心(Data center)。

因資料中心耗電高,碳排放量大,他說「這不是一個
好生意!」政府不要再歡迎外商到台灣蓋資料中心。
Dark Silicon

Dark Silicon and the End of Multicore Scaling


In International Symposium on Computer Architecture (ISCA),
2011, by H. Esmaeilzadeh, E. Blem, R. St. Amant, K.
Sankaralingam, and D. Burger.
Regardless of chip organization and topology, multicore
scaling is power limited


At 22 nm, 21% of a fixed-size chip must be powered off
At 8 nm, more than 50%
Energy Efficiency
Memory is the key in Multi-Core




We should spend 90%+ energy in memory
“Memory” = Memory/cache hierarchy
Cache coherence (or even non-coherent)
Data management placement/replacement

Compiler or hardware assist
Outline




Motivation
Moguls Memory Model
Energy-efficient Memory Hierarchy Design with
Moguls
Future Prediction



Memory hierarchy
Processor designs
Conclusion
“Bandwidth Wall”

Performance depends on
two resources




Compute does the work
Bandwidth feeds the
compute
Processors (thru ILP, DLP,
and TLP) are faster
Memory becomes relative
slower
R e la tiv e p e rfo rm a n c e s in c e 1 9 8 0
1000
M ic ro p ro c e s s o r M IP S
100
~ 5 0 % annually
D R A M B and wd ith
10
~ 2 7 % annually
D R A M L ate nc y
~ 7 % annually
1
1980
1985
1990
1995
2000
Year
Source: D. A. Patterson, "Latency Lags
Bandwidth," Communication of the ACM,
Vol. 47, no. 10, pp. 71-75, Oct. 2004.
Current memory hierarchy
may be not good enough
How to Alleviate Bandwidth Problems

Software techniques



Cache blocking, data compression, memory management, data
structure re-arrangement, etc.
But, not always applicable
Hardware techniques





“New” memory technologies provide opportunities
3D stacking [Madan, HPCA 2009, Sun HPCA 2009, Sun ISLPED
2009]
eDRAM [Thoziyoor, ISCA 2008, Wu, ISCA 2009]
MRAM [Sun, HPCA 2009, Wu, ISCA 2009]
PCM [Lee, ISCA 2009, Qureshi, ISCA 2009, Wu, ISCA 2009]
Trade-off Between Bandwidth & Power

High bandwidth means high power

GPUs use GDDR  higher bandwidth


GDDR burns more power


1GB GDDR at 128GB/s can burn roughly 4x more power of 4GB
DDR at 16GB/s
Recall:



Some find GPUs perform better because of higher bandwidth (details
in Lee, et al., “Debunking the 100X GPU vs. CPU Myth: An Evaluation
of Throughput Computing on CPU and GPU,” ISCA, June 2010.)
Why multi-core is everywhere? Isn’t it because of power?
Who burns more power? Will it be memory?
Challenge:

How to provide energy-efficient memory hierarchy
Our Research Statement

What should memory hierarchy look like?





Does a memory hierarchy provide enough bandwidth?
How many levels in the memory hierarchy?
What are the capacity and bandwidth of each level?
Which memory technologies are chosen for different levels?
Can we explore the design space quickly?

Simulation-based design evaluation slow


Not feasible for large design space
Exploration using an analytical model (Moguls)

Fast and accurate (with proper approximations)
Sneak preview
A new level of cache every 5-7 years
Outline




Motivation
Moguls Memory Model
Energy-efficient Memory Hierarchy Design with
Moguls
Future Prediction



Memory hierarchy
Processor designs
Conclusion
Cache: Smaller, Faster Buffer

Smaller, faster buffer



Stores most-likely to be used data



Cache
E.g., cache satisfies 90% of requests
Only 10% of requests go to next level
Significantly improve the performance

CPU
Based on temporal and spatial localities
Filters the memory access


E.g., 10ns access time (vs. DRAM 100ns)
E.g., 32KB vs. 2GB
E.g., 90%*10ns + 10%*(110ns)=21ns
Memory
Bandwidth Requirement
Core
……
Core
BRC(T)
BW demand from cores
Core
M1
BP1
BRBW
1(T) demand (decided by T)
ProvidedBP
BW
2
M2
N level
BR2(T)
…
BRn-1(T)
BPn
Mn
Question: How to abstract this relationship?
BW provided by main memory
BPM
BRn(T)
Main memory
Bandwidth (log scale)
Capacity-bandwidth (CB) Coordinate
(C, B)
C: cache capacity
B: demanded/provided BW
Origin = (CO, BM )
CO: the minimum capacity
BM: BW provided by main memory
Both provided and demand BWs can be
described in the same coordinate.
(CO, BM )
Cache Capacity (log scale)
Bandwidth (log scale)
Provided CB Curve
(CO, BM )
(C1, BP1 )
First level cache
Second level cache
(C2, BP2 )
Cache Capacity (log scale)
Bandwidth (log scale)
Demand CB Curve
BRC(T1)
Demand CB curve is continuous
and decided by T
(Cx, Bx )
BRC(T2)
Demand CB curve under T1
Demand CB curve under T2
Cache Capacity (log scale)
(CO, BM )
Bandwidth (log scale)
Combine Demand & Provided CB curves
(C1, BP1 )
Demand CB curve is satisfied by
provided CB curve.
BRC(T2)
Demand CB curve under T2
Cache Capacity (log scale)
(CO, BM )
(C2, BP2 )
Bandwidth (log scale)
Combine Demand & Provided CB curves (cont.)
(C1, BP1 )
BRC(T1)
Demand CB curve is NOT satisfied
by provided CB curve.
(C2, BP2 )
Question: How to modify the provided CB curve?
(CO, BM )
Cache Capacity (log scale)
Bandwidth (log scale)
Increase Capacity
(C1, BP1 )
(C’1, BP1 )
BRC(T1)
(CO, BM )
(C2, BP2 )
Cache Capacity (log scale)
Bandwidth (log scale)
Increase Bandwidth
(C1, BP1 )
BRC(T1)
(CO, BM )
(C2, BP’2 )
(C2, BP2 )
Cache Capacity (log scale)
Bandwidth (log scale)
Add an Extra Level
(C1, BP1 )
BRC(T1)
(CO, BM )
(C3, BP3 )
(C2, BP2 )
Cache Capacity (log scale)
Why named Moguls?
Wikipedia:
Moguls are a series of
bumps on a trail formed
when skiers push the
snow into piles as they
ski.
Why named Moguls?
Recall the Research Statements

Does a memory hierarchy provide enough
bandwidth?

How many levels in the memory hierarchy?

What are the capacity and bandwidth of each level?

Which memory technologies are chosen for
different levels?
Outline




Motivation
Moguls Memory Model
Energy-efficient Memory Hierarchy Design with
Moguls
Future Prediction



Memory hierarchy
Processor designs
Conclusion
Approximations Used to Apply the Moguls
Approximation-1: The demand CB curve is represented
as a straight line with slope -1/2 in log-log space.
Bandwidth (log scale)

y=-x/2
BRC(T)
(CO, BM )
y=-x/2
Cache Capacity (log scale)
CO(BS /BM)2
Approximations Used to Apply the Moguls

Approximation-2: The access power of a cache is
approximately r Capacity ´ Bandwidth
Iso-power line
Bandwidth (log scale)
Same power
y=-x/2
BRC(T)
(CO, BM )
Cache Capacity (log scale)
CO(BS /BM)2
After Applying Two Approximations
Iso-power line is parallel to the demand CB curve.
Bandwidth (log scale)

Iso-power line
BRC(T)
(CO, BM )
Cache Capacity (log scale)
CO(BS /BM)2
Bandwidth (log scale)
Start Point: Two-Level Cache Design
(C1, BP1 )
BRC(T)
(C2, BP2 )
(C1, BP2 )
(CO, BM )
Cache Capacity (log scale)
CO(BS /BM)2
Energy-efficient Tow-level Cache Design
Bandwidth (log scale)
Iso-power line
BRC(T)
(C1, BP1 )
BS
(CO, BM )
(C2, BP2 )
(C1, BP2 )
Cache Capacity (log scale)
CO(BS /BM)2
Bandwidth (log scale)
Extension to N-level Cache Design
BRC(T)
(C1, BP1 )
(C2, BP1 )
(Cn-1, BPn-1 )
(Cn, BPn )
(Cn, BPn-1 )
(CO, BM )
Cache Capacity (log scale)
CO(BS /BM)2
Bandwidth (log scale)
Design under Power Constraint
BRC(T1)
(C1, BP1 )
BRC(T2) (C1, BP1 )
(C2, BP1 )
(C2, BP1 )
Throughput is degraded
from T1 to T2
(Cn-1, BPn-1 )
(Cn-1, BPn-1 )
(Cn, BPn-1 )
(Cn, BPn-1 )
(CO, BM )
Cache Capacity (log scale)
(Cn, BPn )
(Cn, BPn )
CO(BS /BM)2
For More Details


Mixing different memory technologies
Simulation results to validate the model
 Accurately
predict the number of levels (> 90%)
 Accurately predict size/BW of every level (>80%)

"Moguls: a Model to Explore the Memory Hierarchy
for Throughput Computing,"
G. Sun, C. J. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie,Y.-K
Chen, in International Symposium on Computer
Architecture (ISCA), June 2011.
Recall the Research Statements

Does a memory hierarchy provide enough
bandwidth?

How many levels in the memory hierarchy?

What are the capacity and bandwidth of each level?

Which memory technologies are chosen for
different levels?
Different Memory Technologies
MRAMIso-power lines eDRAM
Write : Read = 1:9
100
SRAM
RRAM
MRAM
eDRAM
PCRAM
10
PCM
1
1
2
256KB
3
1MB
4
Cross
Memristor
5
6
7
8
9
10
4MB
16MB
64MB
Cache Capacity (log scale)
11
12
256MB
points
Different Memory Technologies
Iso-power lines
100
SRAM
eDRAM
10
SRAM is more
energy-efficient
1
1
256KB
2
1MB3
4
eDRAM is more
energy-efficient
6
7
8
9
4MB5
16MB
64MB
Cache Capacity (log scale)
10
11
12
256MB
Outline




Motivation
Moguls Memory Model
Energy-efficient Memory Hierarchy Design with
Moguls
Future Prediction



Memory hierarchy
Processor designs
Conclusion
Our Research Statement

What should memory hierarchy look like?




How to use the latest technology components?
What is the proper number of levels?
What should capacity/bandwidth of each level be?
Can we explore the design space quickly?

Simulation-based design evaluation slow


Not feasible for large design space
Exploration using an analytical model (Moguls)

Fast and accurate (with proper approximations)
Historical Trend

Growing bandwidth gap



Processor speed increases at 50% per year
Memory bandwidth increases at 27% per year
Intel processors introduced on-die



L1 cache in 1990
L2 cache in 1998
L3 cache in 2005
Optimal Performance per Watt
A new level of cache every 5-7 years
Takeaway Messages

It is time to add another level of memory into the
hierarchy to alleviate the bandwidth bottleneck.


Need new memory level roughly every 5-7 years
Mathematically, with some assumptions about miss-rate curves
and power consumption, we can solve for:


Optimal # of levels, capacity & BW of each level
Our study shows that L4 helps significantly
Outline




Motivation
Moguls Memory Model
Energy-efficient Memory Hierarchy Design with
Moguls
Future Prediction



Memory hierarchy
Processor designs
Conclusion
Programming GPU Becomes Popular



•Source: Google scholar with “GPGPU” and
“GPGPU GPU CPU performance speedup”
Many show GPUs have
significant performance
gain
However, GPUs are NOT
orders of magnitude
faster than CPUs
Architecture-specific
optimizations are
important
GPU’s Are Against Future Trends

GPU memory are not energy efficient


The number of levels and the size of each level must be
adjusted according to the throughput of the processors
CUDA-like GPU programming model are not scalable in
the future


To get good performance on GPU, explicit memory
management are often required
It will become a nightmare if the number of levels and the size
of each level are changing
Trust me: reduce your bets on GPU’s
Outline




Motivation
Moguls Memory Model
Energy-efficient Memory Hierarchy Design with
Moguls
Future Prediction



Memory hierarchy
Processor designs
Conclusion
Conclusions

The winning multi-core architecture will have the most
energy-efficient memory hierarchy


GPU (or GDDR) is good, but not that good
Expect to have more levels



On-die adaptive/reconfigurable cache can save energy


3D stacking, eDRAM, and other emerging technologies
"Moguls: a Model to Explore the Memory Hierarchy for Throughput
Computing," G. Sun, C. J. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie,Y.-K Chen,
in International Symposium on Computer Architecture, June 2011.
"Performance and Energy Implications of Caches For Throughput
Computing," C. Hughes, C. Kim,Y.-K. Chen, IEEE Micro Magazine, vol.30,
no.6, pp.25-35, Nov.-Dec. 2010.
Software can help too

One of our current work
Discussion: how about cloud computing?
Energy-Efficiency Memory Hierarchy
for Multi-core Architectures
Yen-Kuang Chen, Ph.D., IEEE Fellow
Principal Engineer, Intel Corporation
Associate Director, Intel-NTU CCC Center
With help from a long list of collaborators: Guangyu Sun, Jishen Zhao,
Cong Xu, Yuan Xie (PSU), Christopher Hughes, Changkyu Kim (Intel)