18-447: Computer Architecture Lecture 19: Main Memory Carnegie Mellon University

advertisement
18-447: Computer Architecture
Lecture 19: Main Memory
Prof. Onur Mutlu
Carnegie Mellon University
Spring 2012, 4/2/2012
Reminder: Homeworks

Homework 5


Due today
Topics: Out-of-order execution, dataflow, vector processing,
memory, caches
2
Number of Students
Homework 4 Grades
14
12
10
8
6
4
2
0
50
60
70
80
90
100
105
Grade
Average
Median
Max
Min
Max Possible Points
Total number of students
83.14
83
105
51
105
47
3
Reminder: Lab Assignments

Lab Assignment 5



Implementing caches and branch prediction in a high-level
timing simulator of a pipelined processor
Due April 6
Extra credit: Cache exploration and high performance with
optimized caches
4
Total number of students
Average
Median
Max
Min
Max Possible Points (w/o EC)
760 - 770
750 - 760
740 - 750
730 - 740
720 - 730
710 - 720
700 - 710
690 - 700
680 - 690
670 - 680
660 - 670
650 - 660
640 - 650
630 - 640
620 - 630
610 - 620
600 - 610
590 - 600
580 - 590
570 - 580
560 - 570
550 - 560
540 - 550
530 - 540
520 - 530
510 - 520
500 - 510
…
230 - 240
Lab 4 Grades
20
18
16
14
12
10
8
6
4
2
0
665.3
695
770
230
700
46
5
Lab 4: Correct Designs and Extra Credit
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Student
Eric Brunstad
Arthur Chang
Alex Crichton
Jason Lin
Anish Phophaliya
James Wahawisan
Prerak Patel
Greg Nazario
Kee Young Lee
Jonathan Loh
Vikram Rajkumar
Justin Wagner
Daniel Jacobs
Mike Mu
Qiannan Zhang
Andrew Tan
Dennis Liang
Dev Gurjar
Winnie Woo
Crit. Path (ns)
Cycles
10.425
10.686
10.85
11.312
10.593
9.16
11.315
12.23
10.019
13.731
13.823
15.065
13.593
14.055
13.484
16.754
16.722
12.864
23.281
Execution Time (ns) Relative Execution Time
34568
360371.4
1.00
34804
371915.5
1.03
34636
375800.6
1.04
34672
392209.7
1.09
37560
397873.0
1.10
44976
411980.2
1.14
37886
428680.1
1.19
35696
436562.1
1.21
44976
450614.5
1.25
33668
462295.3
1.28
34932
482865.0
1.34
33728
508112.3
1.41
37782
513570.7
1.43
36832
517673.8
1.44
38764
522693.8
1.45
34660
580693.6
1.61
37176
621657.1
1.73
57332
737518.8
2.05
33976
790995.3
2.19
6
Lab 4 Extra Credit
Rank
1
2
3
Student
Crit. Path (ns) Cycles
Execution Time Relative Execution Time
Eric Brunstad
10.425
34568
360371.4
1.00
Arthur Chang
10.686
34804
371915.5
1.03
Alex Crichton
10.85
34636
375800.6
1.04
7
Reminder: Midterm II

Next week



April 11
Everything covered in the course can be on the exam
You can bring in two cheat sheets (8.5x11’’)
8
Review of Last Lecture

Wrap up basic caches








Handling writes
Sectored caches
Instruction vs. data
Multi-level caching issues
Cache performance
Multiple outstanding misses
Multiple accesses per cycle
Start main memory



DRAM basics
Interleaving
Bank, rank concepts
9
Review: Interleaving

Interleaving (banking)



Problem: a single monolithic memory array takes long to
access and does not enable multiple accesses in parallel
Goal: Reduce the latency of memory array access and enable
multiple accesses in parallel
Idea: Divide the array into multiple banks that can be
accessed independently (in the same cycle or in consecutive
cycles)



Each bank is smaller than the entire memory storage
Accesses to different banks can be overlapped
Issue: How do you map data to different banks? (i.e., how do
you interleave data across banks?)
10
The DRAM Subsystem
DRAM Subsystem Organization






Channel
DIMM
Rank
Chip
Bank
Row/Column
12
The DRAM Bank Structure
13
The DRAM Bank Structure
14
Page Mode DRAM





A DRAM bank is a 2D array of cells: rows x columns
A “DRAM row” is also called a “DRAM page”
“Sense amplifiers” also called “row buffer”
Each address is a <row,column> pair
Access to a “closed row”




Activate command opens row (placed into row buffer)
Read/write command reads/writes column in the row buffer
Precharge command closes the row and prepares the bank for
next access
Access to an “open row”

No need for activate command
15
DRAM Bank Operation
Rows
Row address 0
1
Columns
Row decoder
Access Address:
(Row 0, Column 0)
(Row 0, Column 1)
(Row 0, Column 85)
(Row 1, Column 0)
Row 01
Row
Empty
Column address 0
1
85
Row Buffer CONFLICT
HIT
!
Column mux
Data
16
The DRAM Chip



Consists of multiple banks (2-16 in Synchronous DRAM)
Banks share command/address/data buses
The chip itself has a narrow interface (4-16 bits per read)
17
128M x 8-bit DRAM Chip
18
DRAM Rank and Module


Rank: Multiple chips operated together to form a wide
interface
All chips comprising a rank are controlled at the same time



A DRAM module consists of one or more ranks



Respond to a single command
Share address and command buses, but provide different data
E.g., DIMM (dual inline memory module)
This is what you plug into your motherboard
If we have chips with 8-bit interface, to read 8 bytes in a
single access, use 8 chips in a DIMM
19
A 64-bit Wide DIMM (One Rank)
DRAM
Chip
Command
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
Data
20
A 64-bit Wide DIMM (One Rank)

Advantages:



Acts like a highcapacity DRAM chip
with a wide
interface
Flexibility: memory
controller does not
need to deal with
individual chips
Disadvantages:

Granularity:
Accesses cannot be
smaller than the
interface width
21
Multiple DIMMs

Advantages:


Enables even
higher capacity
Disadvantages:

Interconnect
complexity and
energy
consumption
can be high
22
DRAM Channels


2 Independent Channels: 2 Memory Controllers (Above)
2 Dependent/Lockstep Channels: 1 Memory Controller with
wide interface (Not Shown above)
23
Generalized Memory Structure
24
Generalized Memory Structure
25
The DRAM Subsystem
The Top Down View
DRAM Subsystem Organization






Channel
DIMM
Rank
Chip
Bank
Row/Column
27
The DRAM subsystem
“Channel”
DIMM (Dual in-line memory module)
Processor
Memory channel
Memory channel
Breaking down a DIMM
DIMM (Dual in-line memory module)
Side view
Front of DIMM
Back of DIMM
Breaking down a DIMM
DIMM (Dual in-line memory module)
Side view
Front of DIMM
Rank 0: collection of 8 chips
Back of DIMM
Rank 1
Rank
Rank 0 (Front)
Rank 1 (Back)
<0:63>
Addr/Cmd
CS <0:1>
Memory channel
<0:63>
Data <0:63>
Chip 7
...
<56:63>
Chip 1
<8:15>
<0:63>
<0:7>
Rank 0
Chip 0
Breaking down a Rank
Data <0:63>
Bank 0
<0:7>
<0:7>
<0:7>
...
<0:7>
<0:7>
Chip 0
Breaking down a Chip
Breaking down a Bank
2kB
1B (column)
row 16k-1
...
Bank 0
<0:7>
row 0
Row-buffer
1B
1B
...
<0:7>
1B
DRAM Subsystem Organization






Channel
DIMM
Rank
Chip
Bank
Row/Column
35
Example: Transferring a cache block
Physical memory space
0xFFFF…F
...
Channel 0
DIMM 0
0x40
64B
cache block
0x00
Rank 0
Example: Transferring a cache block
Physical memory space
Chip 0
Chip 1
0xFFFF…F
Rank 0
Chip 7
<56:63>
<8:15>
<0:7>
...
...
0x40
64B
cache block
0x00
Data <0:63>
Example: Transferring a cache block
Physical memory space
Chip 0
Chip 1
0xFFFF…F
Rank 0
...
<56:63>
<8:15>
<0:7>
...
Row 0
Col 0
0x40
64B
cache block
0x00
Chip 7
Data <0:63>
Example: Transferring a cache block
Physical memory space
Chip 0
Chip 1
Rank 0
0xFFFF…F
...
<56:63>
<8:15>
<0:7>
...
Row 0
Col 0
0x40
64B
cache block
0x00
Chip 7
Data <0:63>
8B
8B
Example: Transferring a cache block
Physical memory space
Chip 0
Chip 1
0xFFFF…F
Rank 0
...
<56:63>
<8:15>
<0:7>
...
Row 0
Col 1
0x40
64B
cache block
0x00
8B
Chip 7
Data <0:63>
Example: Transferring a cache block
Physical memory space
Chip 0
Chip 1
Rank 0
0xFFFF…F
...
<56:63>
<8:15>
<0:7>
...
Row 0
Col 1
0x40
8B
0x00
Chip 7
64B
cache block
Data <0:63>
8B
8B
Example: Transferring a cache block
Physical memory space
Chip 0
Chip 1
0xFFFF…F
Rank 0
Chip 7
...
<56:63>
<8:15>
<0:7>
...
Row 0
Col 1
0x40
8B
0x00
64B
cache block
Data <0:63>
8B
A 64B cache block takes 8 I/O cycles to transfer.
During the process, 8 columns are read sequentially.
Latency Components: Basic DRAM Operation


CPU → controller transfer time
Controller latency




Controller → DRAM transfer time
DRAM bank latency




Queuing & scheduling delay at the controller
Access converted to basic commands
Simple CAS if row is “open” OR
RAS + CAS if array precharged OR
PRE + RAS + CAS (worst case)
DRAM → CPU transfer time (through controller)
43
Multiple Banks (Interleaving) and Channels

Multiple banks



Multiple independent channels serve the same purpose



But they are even better because they have separate data buses
Increased bus bandwidth
Enabling more concurrency requires reducing



Enable concurrent DRAM accesses
Bits in address determine which bank an address resides in
Bank conflicts
Channel conflicts
How to select/randomize bank/channel indices in address?


Lower order bits have more entropy
Randomizing hash functions (XOR of different address bits)
44
How Multiple Banks/Channels Help
45
Multiple Channels

Advantages



Increased bandwidth
Multiple concurrent accesses (if independent channels)
Disadvantages

Higher cost than a single channel


More board wires
More pins (if on-chip memory controller)
46
Address Mapping (Single Channel)

Single-channel system with 8-byte memory bus


2GB memory, 8 banks, 16K rows & 2K columns per bank
Row interleaving

Consecutive rows of memory in consecutive banks
Row (14 bits)

Bank (3 bits)
Column (11 bits)
Byte in bus (3 bits)
Cache block interleaving


Consecutive cache block addresses in consecutive banks
64 byte cache blocks
Row (14 bits)
High Column
8 bits


Bank (3 bits)
Low Col.
Byte in bus (3 bits)
3 bits
Accesses to consecutive cache blocks can be serviced in parallel
How about random accesses? Strided accesses?
47
Bank Mapping Randomization

DRAM controller can randomize the address mapping to
banks so that bank conflicts are less likely
3 bits
Column (11 bits)
Byte in bus (3 bits)
XOR
Bank index
(3 bits)
48
Address Mapping (Multiple Channels)
C
Row (14 bits)
Row (14 bits)

C
Bank (3 bits)
Column (11 bits)
Byte in bus (3 bits)
C Bank (3 bits)
Column (11 bits)
Byte in bus (3 bits)
Column (11 bits)
Byte in bus (3 bits)
Row (14 bits)
Bank (3 bits) C
Row (14 bits)
Bank (3 bits)
Column (11 bits)
C Byte in bus (3 bits)
Where are consecutive cache blocks?
Row (14 bits)
High Column
Bank (3 bits)
Low Col.
3 bits
8 bits
Row (14 bits)
C
High Column
Bank (3 bits)
Low Col.
High Column
C Bank (3 bits)
Low Col.
High Column
Bank (3 bits) C
High Column
8 bits
Low Col.
Byte in bus (3 bits)
3 bits
8 bits
Row (14 bits)
Byte in bus (3 bits)
3 bits
8 bits
Row (14 bits)
Byte in bus (3 bits)
3 bits
8 bits
Row (14 bits)
Byte in bus (3 bits)
Bank (3 bits)
Low Col.
C Byte in bus (3 bits)
3 bits
49
Interaction with VirtualPhysical Mapping

Operating System influences where an address maps to in
DRAM
Virtual Page number (52 bits)
Physical Frame number (19 bits)
Row (14 bits)


Bank (3 bits)
Page offset (12 bits)
VA
Page offset (12 bits)
PA
Column (11 bits)
Byte in bus (3 bits)
PA
Operating system can control which bank a virtual page is
mapped to. It can randomize Page<Bank,Channel>
mappings
Application cannot know/determine which bank it is accessing
50
DRAM Refresh (I)


DRAM capacitor charge leaks over time
The memory controller needs to read each row periodically
to restore the charge



Activate + precharge each row every N ms
Typical N = 64 ms
Implications on performance?
-- DRAM bank unavailable while refreshed
-- Long pause times: If we refresh all rows in burst, every 64ms
the DRAM will be unavailable until refresh ends


Burst refresh: All rows refreshed immediately after one
another
Distributed refresh: Each row refreshed at a different time,
at regular intervals
51
DRAM Refresh (II)


Distributed refresh eliminates long pause times
How else we can reduce the effect of refresh on
performance?

Can we reduce the number of refreshes?
52
Effect of DRAM Refresh
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
53
Retention Time of DRAM Cells


Observation: DRAM cells have different data retention times
Corollary: Not all rows need to be refreshed at the same
frequency
54
Reducing DRAM Refresh Operations


Idea: If we can identify the retention time of different rows,
we can refresh each row at the frequency it really needs to
be refreshed
Implementation: Refresh controller bins the rows according
to their minimum retention times and refreshes rows in
each bin at the frequency specified for the bin



e.g., a bin for 64-128ms, another for 128-256ms, …
Observation: Only very few rows need to be refreshed very
frequently (every 256ms)  Have only a few bins  low
HW overhead while reducing refresh frequency for most
rows by 4X
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA
2012.
55
RAIDR Mechanism
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
56
DRAM Controller

Purpose and functions


Ensure correct operation of DRAM (refresh and timing)
Service DRAM requests while obeying timing constraints of
DRAM chips



Buffer and schedule requests to improve performance


Constraints: resource conflicts (bank, bus, channel), minimum
write-to-read delays
Translate requests to DRAM command sequences
Reordering and row-buffer management
Manage power consumption and thermals in DRAM

Turn on/off DRAM chips, manage power modes
57
DRAM Controller Issues

Where to place?

In chipset
+ More flexibility to plug different DRAM types into the system
+ Less power density in the CPU chip

On CPU chip
+ Reduced latency for main memory access
+ Higher bandwidth between cores and controller

More information can be communicated (e.g. request’s importance in
the processing core)
58
DRAM Controller (II)
59
A Modern DRAM Controller
60
DRAM Scheduling Policies (I)

FCFS (first come first served)


Oldest request first
FR-FCFS (first ready, first come first served)
1. Row-hit first
2. Oldest first
Goal: Maximize row buffer hit rate  maximize DRAM throughput

Actually, scheduling is done at the command level


Column commands (read/write) prioritized over row commands
(activate/precharge)
Within each group, older commands prioritized over younger ones
61
DRAM Scheduling Policies (II)

A scheduling policy is essentially a prioritization order

Prioritization can be based on





Request age
Row buffer hit/miss status
Request type (prefetch, read, write)
Requestor type (load miss or store miss)
Request criticality


Oldest miss in the core?
How many instructions in core are dependent on it?
62
Row Buffer Management Policies

Open row
Keep the row open after an access
+ Next access might need the same row  row hit
-- Next access might need a different row  row conflict, wasted energy


Closed row
Close the row after an access (if no other requests already in the request
buffer need the same row)
+ Next access might need a different row  avoid a row conflict
-- Next access might need the same row  extra activate latency


Adaptive policies

Predict whether or not the next access to the bank will be to
the same row
63
Open vs. Closed Row Policies
Policy
First access
Next access
Commands
needed for next
access
Open row
Row 0
Row 0 (row hit)
Read
Open row
Row 0
Row 1 (row
conflict)
Precharge +
Activate Row 1 +
Read
Closed row
Row 0
Row 0 – access in
request buffer
(row hit)
Read
Closed row
Row 0
Row 0 – access not Activate Row 0 +
in request buffer
Read + Precharge
(row closed)
Closed row
Row 0
Row 1 (row closed) Activate Row 1 +
Read + Precharge
64
Why are DRAM Controllers Difficult to Design?

Need to obey DRAM timing constraints for correctness





Need to keep track of many resources to prevent conflicts



There are many (50+) timing constraints in DRAM
tWTR: Minimum number of cycles to wait before issuing a
read command after a write command is issued
tRC: Minimum number of cycles between the issuing of two
consecutive activate commands to the same bank
…
Channels, banks, ranks, data bus, address bus, row buffers
Need to handle DRAM refresh
Need to optimize for performance


(in the presence of constraints)
Reordering is not simple
Predicting the future?
65
Why are DRAM Controllers Difficult to Design?

From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing
Write-Caused Interference in Memory Systems,” HPS Technical Report,
April 2010.
66
DRAM Power Management

DRAM chips have power modes
Idea: When not accessing a chip power it down

Power states






Active (highest power)
All banks idle
Power-down
Self-refresh (lowest power)
State transitions incur latency during which the chip cannot
be accessed
67
Download