Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong Outline • • • • • • Memory System Overview Related work Experiment setup Page level access measurements Solution Expected Speedup Processor-Memory Gap µProc 60% / year. Doubles every 1.5years DRAM 9% / year. Doubles every 10 years Processor-Memory Performance Gap: Grows 50% / year http://www.e-insite.net/ednmag Memory Access Time CPU Core L1 L2 MC DRAM Access Time (cycles) L1 3 L2 8 DRAM 181 Data for 1.8GHz Opteron www.aceshardware.com/ Large Size Memory Accesses • Applications – Initialization – Data Movement – Stream operations • Operating System – Task Creation – System Calls – Page Allocation, Management • Functions that would use them – Memset, Clear User – Memcpy, Copy from User, Copy To User Experiment Setup • Workstation based – 2.4 GHz P4 (Wonko) – 750MHz PIII (Majikthise) – 900 MHz P III (Jaleel) • Bochs x86 emulator • Operating System – Linux Kernel v 2.4.19 • Applications – SPEC2000 Integer benchmarks using glibc-2.2.5 Memset : Count Memset Count 1.00E+08 1.00E+06 1.00E+04 1.00E+02 cf pa rs er m vp r bz ip 2 vo rte x gc c gz pe ip rlb m k tw ol f cr af ty 1.00E+00 Memset : Access Size 1.0E+09 1.0E+06 1.0E+03 Average Length pa rse r cf m bz ip2 vp r vo rte x gc c gz pe ip rlb m k tw olf cr af ty 1.0E+00 Maximum Length % Memset Time pa rs er cf p2 m bz i vp r cr af ty tw ol f p pe rlb m k gz i gc c vo rte x % Overhead Memset : % Overhead 25 20 15 10 5 0 Memcpy: Count Memcpy Count 1.00E+08 1.00E+06 1.00E+04 1.00E+02 cf pa rs er m p2 bz i vp r vo rte x gc c gz pe ip rlb m k tw ol f cr af ty 1.00E+00 Memcpy : Access Size 1.0E+09 1.0E+06 1.0E+03 1.0E+00 vortex gcc gzip perlbmk twolf crafty Average Length vpr bzip2 mcf parser Maximum Length pa rs er cf m bz ip 2 vp r cr af ty ol f k m tw pe rlb ip gz c rte x gc vo % Overhead Memcpy: % Overhead 35 30 25 20 15 10 5 0 OS : Memset / Clear User Real-Time Plot • • • • • Behavior over Time Frequency of operation Access Size Operation Duration Averages OS : Memcpy / Copy User Real-Time Plot • • • • • Behavior over Time Frequency of operation Access Size Operation Duration Averages Page based Commands • Set Page – A constant • Copy Page –AB • Page level Arithmetic operations –AB+C –AB-C Page based Commands SETPAGE ZERO, 0x04000 4 kB DRAM Page based Commands SETPAGE ZERO, 0x04000 4 kB 128 bytes Cache DRAM Page based Commands Issue SETPAGE ZERO, 0x04000 4 kB 128 bytes Cache DRAM How do we ensure Memory and Cache Consistency? How much data is actually in the cache ? Function % Hit Rate Boot + Halt % Hit Rate SPEC workload Memset 7.23% 0.23 Memcpy ( Source) 7.88 10.53% Memcpy (Destination) < 0.01 % < 0.01 % Page based Commands SETPAGE ZERO, 0x04000 4 kB DRAM Page based Commands Issue SETPAGE ZERO, 0x04000 DRAM level Page Fragmentation 4 kB 4 kB DRAM Page based Commands Issue SETPAGE ZERO, 0x04000 DRAM level Page Fragmentation 4 kB 4 kB DRAM Maximum number of rows a page can occupy is 2 Solution • Hardware at Cache Level • Ability to map s/w pages to h/w pages Expected Speedup I Memset( Address, Length, SetValue) Current Implementation Proposed Implementation EndAddr Addr + Length While ( Address < EndAddr) Mem[Address] SetValue Address Address + 1 While (Length >= PageSize) SetPage (SetValue, Address) Length Length – PageSize Address Address + Length Call Memset ( Address , Length, SetValue) Expected Speedup II • Current Memset Time for a page : 4 s • Expected Memset Time for a page = # Rows in a page * Time to read a Row + +Cache Coherence Logic + Misc = 2 * 100 ns + X = 200 ns + X Related Work • IRAM – On-chip DRAM – Advantage: bigger storage, eliminates much of the off-chip memory access, energy efficient – Disadvantage: not much performance increase, doesn’t work with conventional microprocessors • Active page – bring computation to DRAM – break the memory into fixed page-size and add reconfigurable logic to DRAM • Heap paper shows some memory accesses that can be eliminated entirely Conclusion • Page- based commands are necessary.