Extra processing power in memory controller

advertisement
Memory Arithmetic Unit Interface
Jason M. Meier
Justin S. Teller
Tom J. Keeley
Current Paradigm
Done: Task 1
Task 1
CPU:
Task 2
MEMORY
CTRL:
MEMORY:
CPU
DRAM System
Memory
Controller
Active Pages Implementation
• Used Configurable DRAM - RADRAM
•Reconfigurable logic implements various memory
functions
•“Active Page” consists of a page of data and a set of
associated functions
•Works on individual DRAM chips
•Processor-centric and Memory-centric partitioning
* Active Pages - Oskin, Chong, Sherwood – ISCA ‘98
MAUI Implementation
Done: Task 1
CPU:
Task 1
Task 2
MEMORY
CTRL/MAUI:
Task 1
MEMORY:
CPU
MAU
MAUI
DRAM System
Memory
Controller
MAUI Instruction Set
MAUI_LD <m_rd>,offset(<cpu_rs>)
1) CPU sends an MAU_LOAD register command to
the MC (along with the reg # and address to read)
across the front-side bus.
2) MC interprets command and places a Read
command in the transaction queue.
3) DRAM performs read.
4) Result is stored in appropriate register in the
MAUI register file.
LOAD REG
CPU:
1
4
MC/MAUI:
2
3
R
DRAM:
MAU
4
MAUI
DRAM System
3
1
Memory
Controller
2
MAUI Instruction Set II
MAUI_LDI <rd>,<cpu_rs>
1) CPU sends an MAU_LOADI register command
to the MC (along with the reg # and integer to save)
across the front-side bus.
2) MC interprets command and places integer in the
appropriate register in the MAUI register file.
LOADI REG
CPU:
1
MC/MAUI:
2
DRAM:
MAU
2
MAUI
DRAM System
1
Memory
Controller
MAUI Instruction Set III
MAUI_ADD <rd>,<rs1>,<rs2>,<rsz>
CPU:
1
MAU_ADD
2
4
MC/MAUI:
3
W
DRAM:
R
R
W
1
1) CPU invalidates addresses in the cache that fall
within the range of the destination array. Addresses
within the range of the source arrays are written
back if dirty.
2) CPU sends an MAUI_ADD command to the MC
(along with the reg #’s) across the front-side bus.
3) MC interprets command, MAUI adds the
appropriate registers and places a Write command
and next two Read commands in the transaction
queue.
4) Step 3 repeats for the length of the array.
3
MAU
CPU
MAUI
DRAM System
4
2
Memory
Controller
Issues: Read & Write Locks
Issues: Address Mapping
Virtual Space
Memory that is Contiguous in Virtual Space
may not be Contiguous in Physical Space
•MAUI assumes consecutive addressing (size register)
TLB
•MAUI operations which cross page boundaries
must be split into separate operations for each
page
•Programmer will not know mapping scheme
Physical
Space
•Result: All MAUI operations will need to be
privileged instructions, accessed by
programs through a system call.
Issues: Compiler Issues
• The compiler will be responsible for deciding
when MAUI instructions should be used.
• This decision will be based on the size of the
array, and if it’s likely to be in the cache, or if
it’s likely to used by an instruction that isn’t
implemented in the MAUI.
Issues: Task Interrupts
CPU:
Task 1
MEMORY
CTRL/MAUI:
Task 2
Task 2
Task 1
Task 2
Task 1
MEMORY:
CPU
MAU
MAUI
DRAM System
Memory
Controller
Example: maui_add I
BIU
maui_ld r1, 0
Memory
maui_ld r1, 0
Size(r4)
RL1_beg
RL2_beg
WL_beg
R1_Data
R2_Data
R3_Data
MAU_Status = open
Offset
RL1_end
RL2_end
WL_end
R1_Addr = 0
R2_Addr
R3_Addr
Transaction
Queue
R1_status
R2_status
R3_status
Memory
Controller
Example: maui_add II
BIU
maui_ld r2, 5
Memory
maui_ld r2, 5
Size(r4)
RL1_beg
RL2_beg
WL_beg
R1_Data
R2_Data
R3_Data
MAU_Status = open
Offset
RL1_end
RL2_end
WL_end
R1_Addr = 0
R2_Addr = 5
R3_Addr
Transaction
Queue
R1_status
R2_status
R3_status
Memory
Controller
Example: maui_add III
BIU
maui_ld r3, 10
Memory
maui_ld r3, 10
Size(r4)
RL1_beg
RL2_beg
WL_beg
R1_Data
R2_Data
R3_Data
MAU_Status = open
Offset
RL1_end
RL2_end
WL_end
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R1_status
R2_status
R3_status
Memory
Controller
Example: maui_add IV
BIU
maui_ld r4, 2
Memory
maui_ld r4, 2
Size(r4) = 2
RL1_beg
RL2_beg
WL_beg
R1_Data
R2_Data
R3_Data
MAU_Status = open
Offset
RL1_end
RL2_end
WL_end
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R1_status
R2_status
R3_status
Memory
Controller
Example: maui_add V
BIU
maui_add r3, r1, r2
Memory
maui_add r3, r1, r2
Size(r4) = 2
RL1_beg = 0
RL2_beg = 5
WL_beg = 10
R1_Data
R2_Data
R3_Data
MAU_Status = occupied
Offset = 0
RL1_end = 1
RL2_end = 6
WL_end = 11
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R, 0
R1_status = w
R2_status = w
R3_status = u
R, 5
Memory
Controller
Example: maui_add VI
BIU
Read 10
Memory
maui_add r3, r1, r2*
Size(r4) = 2
RL1_beg = 1
RL2_beg = 5
WL_beg = 10
R1_Data = D1[0]
R2_Data
R3_Data
MAU_Status = occupied
Offset = 0
RL1_end = 1
RL2_end = 6
WL_end = 11
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R1_status = f
R2_status = w
R3_status = u
D1[0]
Memory
Controller
Example: maui_add VII
BIU
Read 10
Memory
maui_add r3, r1, r2*
Size(r4) = 2
RL1_beg = 1
RL2_beg = 6
WL_beg = 10
R1_Data = D1[0]
R2_Data = D2[0]
R3_Data
MAU_Status = occupied
Offset = 0
RL1_end = 1
RL2_end = 6
WL_end = 11
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R1_status = f
R2_status = f
R3_status = u
D2[0]
Memory
Controller
Example: maui_add VIII
BIU
Read 10
Memory
maui_add r3, r1, r2*
Size(r4) = 2
RL1_beg = 1
RL2_beg = 6
WL_beg = 11
R1_Data = D1[0]
R2_Data = D2[0]
R3_Data = D1[0] + D2[0]
MAU_Status = occupied
Offset = 1
RL1_end = 1
RL2_end = 6
WL_end = 11
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R, 1
R1_status = w
R2_status = w
R3_status = f
R, 6
W,10, D1[0]+D2[0]
Memory
Controller
Example: maui_add IX
BIU
Write 6, D
Memory
maui_add r3, r1, r2*
Size(r4) = 2
RL1_beg = NULL
RL2_beg = 6
WL_beg = 11
R1_Data = D1[1]
R2_Data
R3_Data
MAU_Status = occupied
Offset = 1
RL1_end = NULL
RL2_end = 6
WL_end = 11
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R1_status = f
R2_status = w
R3_status = u
D1[1]
Memory
Controller
Example: maui_add X
BIU
Write 6, D
Memory
maui_add r3, r1, r2*
Size(r4) = 2
RL1_beg = NULL
RL2_beg = NULL
WL_beg = 11
R1_Data = D1[1]
R2_Data = D2[1]
R3_Data
MAU_Status = occupied
Offset = 1
RL1_end = NULL
RL2_end = NULL
WL_end = 11
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R1_status = f
R2_status = f
R3_status = u
D2[1]
Memory
Controller
Example: maui_add XI
BIU
Memory
Next Instruction
Size(r4) = 2
RL1_beg = NULL
RL2_beg = NULL
WL_beg = NULL
R1_Data = D1[1]
R2_Data = D2[1]
R3_Data = D1[1] + D2[1]
MAU_Status = free?
Offset = 2
RL1_end = NULL
RL2_end = NULL
WL_end = NULL
R1_Addr = 0
R2_Addr = 5
R3_Addr = 10
Transaction
Queue
R1_status = u
R2_status = u
R3_status = f
W,10, D1[1]+D2[1]
Memory
Controller
Advantages & Disadvantages
Advantages
•Better performance for DRAM latency bound computations
•Lower latency to DRAM compared to CPU
•Reduced traffic on front-side bus
•Concurrent execution
Disadvantages
•MAUI operates at a lower clock frequency
•Increased compiler complexity
•Increased fabrication costs (More Logic = More $$)
•Recently used data may not be cached
Alternative Implementation
MAUI Occupies its Own Read & Write Bus
GOOD •Eliminate Contention with CPU for DRAM system resources.
GOOD •Create Circular Data flow resulting in increased performance
X BAD •Need Specialized Triple-Ported DRAM system leading to
increased production costs
CPU
MAU
MAUI
MAUI Read &
Write Bus
Memory
Controller
DRAM System
Test Setup
• Simulated on SimpleScalar version 4.0
• One set of test benches with dual array
operations running in both the MAUI and
CPU with four different array sizes. This
trial was repeated for both shared and
independent memory access busses.
• Found up to a 43% speedup!
Total CPU Cycles
Results
10000000
No MAUI
MAUI (Shared Bus)
MAUI (Separate Bus)
1000000
100000
10000
60 Int Array
600 Int Array
6000 Int Array 60000 Int Array
Future Enhancements I
MAU Multi-tasking
CPU:
Task 1
Task 2
Task 3
Task 3
Task 2
MEMORY
CTRL/MAUI:
Task 1
MEMORY:
MAUS
MAUI
Larger Register
File
Small
Cache
More MAUs for
Parallelism
DRAM System
Memory
Controller
Future Enhancements II
Better Pipelining
CPU:
MAU_ADD
MC/MAUI:
R
DRAM:
R
R
R
R
R
R
R
W W W W
MAU
MAUI
DRAM System
Larger Register
File to Hold
Intermediate Results
Memory
Controller
Download