Memory Arithmetic Unit Interface Jason M. Meier Justin S. Teller Tom J. Keeley Current Paradigm Done: Task 1 Task 1 CPU: Task 2 MEMORY CTRL: MEMORY: CPU DRAM System Memory Controller Active Pages Implementation • Used Configurable DRAM - RADRAM •Reconfigurable logic implements various memory functions •“Active Page” consists of a page of data and a set of associated functions •Works on individual DRAM chips •Processor-centric and Memory-centric partitioning * Active Pages - Oskin, Chong, Sherwood – ISCA ‘98 MAUI Implementation Done: Task 1 CPU: Task 1 Task 2 MEMORY CTRL/MAUI: Task 1 MEMORY: CPU MAU MAUI DRAM System Memory Controller MAUI Instruction Set MAUI_LD <m_rd>,offset(<cpu_rs>) 1) CPU sends an MAU_LOAD register command to the MC (along with the reg # and address to read) across the front-side bus. 2) MC interprets command and places a Read command in the transaction queue. 3) DRAM performs read. 4) Result is stored in appropriate register in the MAUI register file. LOAD REG CPU: 1 4 MC/MAUI: 2 3 R DRAM: MAU 4 MAUI DRAM System 3 1 Memory Controller 2 MAUI Instruction Set II MAUI_LDI <rd>,<cpu_rs> 1) CPU sends an MAU_LOADI register command to the MC (along with the reg # and integer to save) across the front-side bus. 2) MC interprets command and places integer in the appropriate register in the MAUI register file. LOADI REG CPU: 1 MC/MAUI: 2 DRAM: MAU 2 MAUI DRAM System 1 Memory Controller MAUI Instruction Set III MAUI_ADD <rd>,<rs1>,<rs2>,<rsz> CPU: 1 MAU_ADD 2 4 MC/MAUI: 3 W DRAM: R R W 1 1) CPU invalidates addresses in the cache that fall within the range of the destination array. Addresses within the range of the source arrays are written back if dirty. 2) CPU sends an MAUI_ADD command to the MC (along with the reg #’s) across the front-side bus. 3) MC interprets command, MAUI adds the appropriate registers and places a Write command and next two Read commands in the transaction queue. 4) Step 3 repeats for the length of the array. 3 MAU CPU MAUI DRAM System 4 2 Memory Controller Issues: Read & Write Locks Issues: Address Mapping Virtual Space Memory that is Contiguous in Virtual Space may not be Contiguous in Physical Space •MAUI assumes consecutive addressing (size register) TLB •MAUI operations which cross page boundaries must be split into separate operations for each page •Programmer will not know mapping scheme Physical Space •Result: All MAUI operations will need to be privileged instructions, accessed by programs through a system call. Issues: Compiler Issues • The compiler will be responsible for deciding when MAUI instructions should be used. • This decision will be based on the size of the array, and if it’s likely to be in the cache, or if it’s likely to used by an instruction that isn’t implemented in the MAUI. Issues: Task Interrupts CPU: Task 1 MEMORY CTRL/MAUI: Task 2 Task 2 Task 1 Task 2 Task 1 MEMORY: CPU MAU MAUI DRAM System Memory Controller Example: maui_add I BIU maui_ld r1, 0 Memory maui_ld r1, 0 Size(r4) RL1_beg RL2_beg WL_beg R1_Data R2_Data R3_Data MAU_Status = open Offset RL1_end RL2_end WL_end R1_Addr = 0 R2_Addr R3_Addr Transaction Queue R1_status R2_status R3_status Memory Controller Example: maui_add II BIU maui_ld r2, 5 Memory maui_ld r2, 5 Size(r4) RL1_beg RL2_beg WL_beg R1_Data R2_Data R3_Data MAU_Status = open Offset RL1_end RL2_end WL_end R1_Addr = 0 R2_Addr = 5 R3_Addr Transaction Queue R1_status R2_status R3_status Memory Controller Example: maui_add III BIU maui_ld r3, 10 Memory maui_ld r3, 10 Size(r4) RL1_beg RL2_beg WL_beg R1_Data R2_Data R3_Data MAU_Status = open Offset RL1_end RL2_end WL_end R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R1_status R2_status R3_status Memory Controller Example: maui_add IV BIU maui_ld r4, 2 Memory maui_ld r4, 2 Size(r4) = 2 RL1_beg RL2_beg WL_beg R1_Data R2_Data R3_Data MAU_Status = open Offset RL1_end RL2_end WL_end R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R1_status R2_status R3_status Memory Controller Example: maui_add V BIU maui_add r3, r1, r2 Memory maui_add r3, r1, r2 Size(r4) = 2 RL1_beg = 0 RL2_beg = 5 WL_beg = 10 R1_Data R2_Data R3_Data MAU_Status = occupied Offset = 0 RL1_end = 1 RL2_end = 6 WL_end = 11 R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R, 0 R1_status = w R2_status = w R3_status = u R, 5 Memory Controller Example: maui_add VI BIU Read 10 Memory maui_add r3, r1, r2* Size(r4) = 2 RL1_beg = 1 RL2_beg = 5 WL_beg = 10 R1_Data = D1[0] R2_Data R3_Data MAU_Status = occupied Offset = 0 RL1_end = 1 RL2_end = 6 WL_end = 11 R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R1_status = f R2_status = w R3_status = u D1[0] Memory Controller Example: maui_add VII BIU Read 10 Memory maui_add r3, r1, r2* Size(r4) = 2 RL1_beg = 1 RL2_beg = 6 WL_beg = 10 R1_Data = D1[0] R2_Data = D2[0] R3_Data MAU_Status = occupied Offset = 0 RL1_end = 1 RL2_end = 6 WL_end = 11 R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R1_status = f R2_status = f R3_status = u D2[0] Memory Controller Example: maui_add VIII BIU Read 10 Memory maui_add r3, r1, r2* Size(r4) = 2 RL1_beg = 1 RL2_beg = 6 WL_beg = 11 R1_Data = D1[0] R2_Data = D2[0] R3_Data = D1[0] + D2[0] MAU_Status = occupied Offset = 1 RL1_end = 1 RL2_end = 6 WL_end = 11 R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R, 1 R1_status = w R2_status = w R3_status = f R, 6 W,10, D1[0]+D2[0] Memory Controller Example: maui_add IX BIU Write 6, D Memory maui_add r3, r1, r2* Size(r4) = 2 RL1_beg = NULL RL2_beg = 6 WL_beg = 11 R1_Data = D1[1] R2_Data R3_Data MAU_Status = occupied Offset = 1 RL1_end = NULL RL2_end = 6 WL_end = 11 R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R1_status = f R2_status = w R3_status = u D1[1] Memory Controller Example: maui_add X BIU Write 6, D Memory maui_add r3, r1, r2* Size(r4) = 2 RL1_beg = NULL RL2_beg = NULL WL_beg = 11 R1_Data = D1[1] R2_Data = D2[1] R3_Data MAU_Status = occupied Offset = 1 RL1_end = NULL RL2_end = NULL WL_end = 11 R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R1_status = f R2_status = f R3_status = u D2[1] Memory Controller Example: maui_add XI BIU Memory Next Instruction Size(r4) = 2 RL1_beg = NULL RL2_beg = NULL WL_beg = NULL R1_Data = D1[1] R2_Data = D2[1] R3_Data = D1[1] + D2[1] MAU_Status = free? Offset = 2 RL1_end = NULL RL2_end = NULL WL_end = NULL R1_Addr = 0 R2_Addr = 5 R3_Addr = 10 Transaction Queue R1_status = u R2_status = u R3_status = f W,10, D1[1]+D2[1] Memory Controller Advantages & Disadvantages Advantages •Better performance for DRAM latency bound computations •Lower latency to DRAM compared to CPU •Reduced traffic on front-side bus •Concurrent execution Disadvantages •MAUI operates at a lower clock frequency •Increased compiler complexity •Increased fabrication costs (More Logic = More $$) •Recently used data may not be cached Alternative Implementation MAUI Occupies its Own Read & Write Bus GOOD •Eliminate Contention with CPU for DRAM system resources. GOOD •Create Circular Data flow resulting in increased performance X BAD •Need Specialized Triple-Ported DRAM system leading to increased production costs CPU MAU MAUI MAUI Read & Write Bus Memory Controller DRAM System Test Setup • Simulated on SimpleScalar version 4.0 • One set of test benches with dual array operations running in both the MAUI and CPU with four different array sizes. This trial was repeated for both shared and independent memory access busses. • Found up to a 43% speedup! Total CPU Cycles Results 10000000 No MAUI MAUI (Shared Bus) MAUI (Separate Bus) 1000000 100000 10000 60 Int Array 600 Int Array 6000 Int Array 60000 Int Array Future Enhancements I MAU Multi-tasking CPU: Task 1 Task 2 Task 3 Task 3 Task 2 MEMORY CTRL/MAUI: Task 1 MEMORY: MAUS MAUI Larger Register File Small Cache More MAUs for Parallelism DRAM System Memory Controller Future Enhancements II Better Pipelining CPU: MAU_ADD MC/MAUI: R DRAM: R R R R R R R W W W W MAU MAUI DRAM System Larger Register File to Hold Intermediate Results Memory Controller