Slide - School of Computing

advertisement
Micro-Pages: Increasing DRAM
Efficiency with Locality-Aware Data
Placement
Kshitij Sudan, Niladrish Chatterjee, David Nellans,
Manu Awasthi, Rajeev Balasubramonian, Al Davis
School of Computing, University of Utah
ASPLOS-2010
DRAM Memory Constraints
• Modern machines spend nearly 25% - 40% of total system
power for memory.
• Some commercial servers already have larger power budgets for
memory than CPU.
• Main memory access is one of the largest performance
bottlenecks.
We address both performance and power concerns for DRAM
memory accesses.
2
DRAM Access Mechanism
A few
bitsthe
Many
bits column
readwithin
from
Accesses
Array are then selected a
DRAM
cellsbegin
to service
device
with a
th
fromof
the
1/8
therowsingle
CPU
request!
selecting
a bank,
buffer.
row
buffer
then
a row.
These
bits
are then
the
output from
One word of
the
device.
data
output
Row
…
DRAM
chip or
device
Bank
Memory
Controller
Rank
DIMM
Memory
or channelrequest and
CPU
makesbus
a memory
the Memory Controller
converts it to appropriate DRAM
commands.
3
DRAM Access Inefficiencies - I
•
Over fetch due to large row-buffers.
• 8 KB read into row buffer for a 64 byte cache line.
• Row-buffer utilization for a single request < 1%.
• Why are row buffers so large?
• Large arrays minimize cost-per-bit.
• Striping a cache line across multiple chips (arrays) improves data
transfer bandwidth.
4
DRAM Access Inefficiencies - II
•
Open page policy
•
•
FR-FCFS request scheduling (First-Ready FCFS)
•
•
Row buffers kept open with the hope that subsequent requests will be
row buffer hits.
Memory controller schedules requests to open row-buffers first.
Access Latency
Access Energy
Row-buffer Hit
~ 75 cycles
~ 18 nJ
Row-buffer Miss
~ 225 cycles
~ 38 nJ
Diminishing locality in multi-cores.
5
DRAM Row-buffer Hit-rates
With increasing core counts,
DRAM row-buffer hit-rates reduce.
6
Key Observation
Cache Block Access Pattern Within OS Pages
For heavily accessed pages in a given time interval,
accesses are usually to a few cache blocks.
7
Outline
 DRAM Basics.
 Motivation.
• Basic Idea.
• Software Only Implementation (ROPS).
• Hardware Implementation (HAM).
• Results.
8
Basic Idea
Gather all heavily accessed chunks of independent OS pages and map
Reserved DRAM
them to the same DRAM row.
Region
4 KB OS Pages
1 KB micro-pages
DRAM Memory
Hottest micro-pages
Coldest micro-pages
9
Basic Idea
• Identifying “hot” micro-pages.
• Memory controller counters and OS daemon.
• Reserved rows in DRAM for hot micro-pages.
• Simplifies book-keeping overheads. 4MB capacity loss from a 4GB system
(< 0.1%).
• EPOCH based schemes.
• Expose EPOCH length to the OS for flexibility.
10
Software Only Implementation (ROPS)
Reduced OS
Page size (ROPS)
Baseline
CPU Memory Request
Virtual
Address
X
Translation
Lookaside
Buffer
(TLB)
Translation
Physical Address Y
Lookaside
Hot micro-pages
Buffer
(TLB)
Physical Address Z
Cold micro-pages
4 GB Main Memory
Y
• Shrink the OS page size to 1KB
• Every Epoch:
1. Migrate hot micro-pages.
• TLB shoot-down and page table update.
2. Promote cold micro-pages to a superpage.
• Page table/TLB updated.
4 MB Reserved
DRAM region
11
Software Only Implementation (ROPS)
• Reduced OS Page Size (ROPS).
• Throughout the system, reduce page size to 1KB size.
• Migrate hot micro-pages via DRAM-copy
• Hot micro-pages live in the same row-buffer in the reserved DRAM region.
• Mitigate reduction in TLB reach by promoting cold micro-pages to
4KB superpages.
• Superpage creation facilitated by “reservation-based” page
allocation.
• Allocate four 1KB micro-pages to contiguous DRAM frames.
• Allows contiguous virtual addresses to be placed in contiguous physical
addresses → makes superpage creation easy.
12
Hardware Implementation (HAM)
Hardware Assisted
BaselineMigration (HAM)
4 GB Main Memory
CPU Memory Request
Physical
Address
X
Page A
X
Mapping Table
Old Address New Address
X
Y
New addr . Y
Y
4 MB Reserved
DRAM region
13
Hardware Implementation (HAM)

Hardware Assisted Migration (HAM).

New level of address indirection
− Place data wherever you want in the DRAM.

Maintain a Mapping Table (MT)

− Preserve old physical addresses of migrated micro-pages.
DRAM-copy of hot micro-pages to the reserved rows.

Populate/update MT every EPOCH.
14
Results

Schemes Evaluated

Baseline

Oracle/Profiled:


Best-effort estimate of expected benefit in the next epoch based on a prior profile run.
Epoch Based ROPS and HAM


Evaluated 5M, 10M, 50M, and 100M.
Trends are similar, best perf. with 5M and 10M.
Simulation Parameters
•
Simics simulation platform.
•
DRAMSimL1 based
DRAM
Inst. and Data
Cachetiming.
CPU
4-core Out-of-Order CMP, 2 GHz freq.
Private, 32 KB/2-way, 1-cycle access
L2 Unified Cache
Shared, 128 KB/8-way, 10-cycle access
• DRAM timing
and energy figures from
Micron datasheets.
Total DRAM Capacity
4 GB
DIMM Configuration
8 DIMMs, 1 rank/DIMM, 64 bit channel, 8 devices/DIMM
Active Row-Buffers per DIMM
4
DIMM-Level Row-Buffer Size
8 KB
15
Results
Accesses to Micro-Pages in Reserved Rows in an Epoch
% of
Total
accesses
to micropages in
reserved
rows
Total #
4KB
pages
touched
in an
Epoch.
% Accesses to micro-pages
4KB pages touched
16
Results
5M cycle EPOCH, ROPS, HAM and ORACLE
Percent
change in
performance
Hardware
Apart
Applications
from
assisted
9%with
perf.
migration
room
gains,
for
offers
our
improvement
schemes
better returns
also
show
save
due
average
to lower
performance
energy
TLB management
at theImprovement
same overheads.
time! of 9%
17
Results
ROPS, HAM and ORACLE
Energy consumption of the DRAM sub-system
%
Reduction
in DRAM
energy
18
Conclusions
• On average, for applications with room for improvement and
with our best performing scheme
• Average performance ↑ 9% (max. 18%)
• Average memory energy consumption ↓ 18% (max. 62%).
• Average row-buffer utilization ↑ 38%
• Hardware assisted migration offers better returns due to
fewer overheads of TLB shoot-down and misses.
• Future work
• Can co-locate hot micro-pages that are accessed around the same
time.
19
That's all for today …
Questions?
http://www.cs.utah.edu/arch-research
20
Download