Page-based DRAM commands

advertisement
Page-based Commands for
DRAM Systems
Aamer Jaleel
Brinda Ganesh
Lei Zong
Outline
•
•
•
•
•
•
Memory System Overview
Related work
Experiment setup
Page level access measurements
Solution
Expected Speedup
Processor-Memory Gap
µProc 60% / year. Doubles every 1.5years
DRAM 9% / year. Doubles every 10 years
Processor-Memory Performance Gap: Grows 50% / year
http://www.e-insite.net/ednmag
Memory Access Time
CPU
Core
L1
L2
MC
DRAM
Access Time (cycles)
L1
3
L2
8
DRAM
181
Data for 1.8GHz Opteron
www.aceshardware.com/
Large Size Memory Accesses
• Applications
– Initialization
– Data Movement
– Stream operations
• Operating System
– Task Creation
– System Calls
– Page Allocation, Management
• Functions that would use them
– Memset, Clear User
– Memcpy, Copy from User, Copy To User
Experiment Setup
• Workstation based
– 2.4 GHz P4 (Wonko)
– 750MHz PIII (Majikthise)
– 900 MHz P III (Jaleel)
• Bochs x86 emulator
• Operating System
– Linux Kernel v 2.4.19
• Applications
– SPEC2000 Integer benchmarks using glibc-2.2.5
Memset : Count
Memset Count
1.00E+08
1.00E+06
1.00E+04
1.00E+02
cf
pa
rs
er
m
vp
r
bz
ip
2
vo
rte
x
gc
c
gz
pe ip
rlb
m
k
tw
ol
f
cr
af
ty
1.00E+00
Memset : Access Size
1.0E+09
1.0E+06
1.0E+03
Average Length
pa
rse
r
cf
m
bz
ip2
vp
r
vo
rte
x
gc
c
gz
pe ip
rlb
m
k
tw
olf
cr
af
ty
1.0E+00
Maximum Length
% Memset Time
pa
rs
er
cf
p2
m
bz
i
vp
r
cr
af
ty
tw
ol
f
p
pe
rlb
m
k
gz
i
gc
c
vo
rte
x
% Overhead
Memset : % Overhead
25
20
15
10
5
0
Memcpy: Count
Memcpy Count
1.00E+08
1.00E+06
1.00E+04
1.00E+02
cf
pa
rs
er
m
p2
bz
i
vp
r
vo
rte
x
gc
c
gz
pe ip
rlb
m
k
tw
ol
f
cr
af
ty
1.00E+00
Memcpy : Access Size
1.0E+09
1.0E+06
1.0E+03
1.0E+00
vortex gcc
gzip perlbmk twolf crafty
Average Length
vpr
bzip2 mcf parser
Maximum Length
pa
rs
er
cf
m
bz
ip
2
vp
r
cr
af
ty
ol
f
k
m
tw
pe
rlb
ip
gz
c
rte
x
gc
vo
% Overhead
Memcpy: % Overhead
35
30
25
20
15
10
5
0
OS : Memset / Clear User
Real-Time Plot
•
•
•
•
•
Behavior over Time
Frequency of operation
Access Size
Operation Duration
Averages
OS : Memcpy / Copy User
Real-Time Plot
•
•
•
•
•
Behavior over Time
Frequency of operation
Access Size
Operation Duration
Averages
Page based Commands
• Set Page
– A constant
• Copy Page
–AB
• Page level Arithmetic operations
–AB+C
–AB-C
Page based Commands
SETPAGE ZERO, 0x04000
4 kB
DRAM
Page based Commands
SETPAGE ZERO, 0x04000
4 kB
128 bytes
Cache
DRAM
Page based Commands Issue
SETPAGE ZERO, 0x04000
4 kB
128 bytes
Cache
DRAM
How do we ensure Memory and Cache Consistency?
How much data is actually in the
cache ?
Function
% Hit Rate
Boot + Halt
% Hit Rate
SPEC workload
Memset
7.23%
0.23
Memcpy ( Source)
7.88
10.53%
Memcpy (Destination)
< 0.01 %
< 0.01 %
Page based Commands
SETPAGE ZERO, 0x04000
4 kB
DRAM
Page based Commands Issue
SETPAGE ZERO, 0x04000
DRAM level Page Fragmentation
4 kB
4 kB
DRAM
Page based Commands Issue
SETPAGE ZERO, 0x04000
DRAM level Page Fragmentation
4 kB
4 kB
DRAM
Maximum number of rows a page can occupy is 2
Solution
• Hardware at Cache Level
• Ability to map s/w pages to h/w pages
Expected Speedup I
Memset( Address, Length, SetValue)
Current Implementation
Proposed Implementation
EndAddr  Addr + Length
While ( Address < EndAddr)
Mem[Address]  SetValue
Address  Address + 1
While (Length >= PageSize)
SetPage (SetValue, Address)
Length  Length – PageSize
Address Address + Length
Call Memset ( Address , Length,
SetValue)
Expected Speedup II
• Current Memset Time for a page : 4 s
• Expected Memset Time for a page
= # Rows in a page * Time to read a Row +
+Cache Coherence Logic + Misc
= 2 * 100 ns + X
= 200 ns + X
Related Work
• IRAM – On-chip DRAM
– Advantage: bigger storage, eliminates much of the
off-chip memory access, energy efficient
– Disadvantage: not much performance increase,
doesn’t work with conventional microprocessors
• Active page – bring computation to DRAM
– break the memory into fixed page-size and add
reconfigurable logic to DRAM
• Heap paper shows some memory accesses that
can be eliminated entirely
Conclusion
• Page- based commands are necessary.
Download