Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Institute of Computing Technology, Chinese Academy of Sciences 2009.2.15 Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009) INSTITUTE OF COMPUTING TECHNOLOGY Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen An Brief Intro Of ICT, CAS INSTITUTE OF COMPUTING TECHNOLOGY ICT has developed the Loongson CPU ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10th in Top500. Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work INSTITUTE OF COMPUTING TECHNOLOGY Importance of I/O operations I/O are ubiquitous INSTITUTE OF COMPUTING TECHNOLOGY Load binary files:Disk Memory Brower web, media stream:NetworkMemory… I/O are important Many commercial applications are I/O intensive: Database, Internet applications etc. State-of-the-Art I/O Technologies I/O Bues: 20GB/s PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect I/O Devices RAID: 400MB/s 10GE: 1.25GB/s INSTITUTE OF COMPUTING TECHNOLOGY A Typical Computer Architecture INSTITUTE OF COMPUTING TECHNOLOGY NIC Direct Memory Access (DMA) INSTITUTE OF COMPUTING TECHNOLOGY DMA is an essential feature of I/O operation in all modern computers DMA allows I/O subsystems to access system memory for reading and/or writing independently of CPU. Many I/O devices use DMA Including disk drive controllers, graphics cards, network cards, sound cards and GPUs Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work INSTITUTE OF COMPUTING TECHNOLOGY DMA in Computer Architecture INSTITUTE OF COMPUTING TECHNOLOGY NIC An Example of Disk Read: DMA Receiving Operation INSTITUTE OF COMPUTING TECHNOLOGY CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ Memory ⑤ DMA Engine User Buffer • Cache Access Latency: ~20 Cycles • Memory Access Latency:~200 Cycles Potential Improvement of DMA INSTITUTE OF COMPUTING TECHNOLOGY CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ DMA Engine Memory ⑤ User Buffer • This is a typical Shared-Cache Scheme Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application INSTITUTE OF COMPUTING TECHNOLOGY Rethink DMA Mechanism The Nature of DMA INSTITUTE OF COMPUTING TECHNOLOGY There is a producer-consumer relationship between CPU and DMA engine Memory plays a role of transient place for I/O data transferred between processor and I/O device Corollaries Once I/O data is produced, it will be consumed I/O data within DMA buffer will be used only once in most cases (i.e. almost no reuse). Characterizations of I/O data are different from CPU data It may not be appropriate to store I/O data and CPU data together Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Proposal A Dedicated Cache Storing I/O data Capable of exchanging data with processor’s last level cache (LLC) Reduce overhead of I/O data movement DMA INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues DMA Cache State Diagram is similar to CPU Cache in Uniprocessor system We are researching multiprocessor platform… DMA Cache State Diagram Cache Coherence Data Path Replacement Policy Write Policy Prefetching CPU Cache State Diagram INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Cache Coherence Data Path Replacement Policy Write Policy Prefetching DMA Additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism INSTITUTE OF COMPUTING TECHNOLOGY Data Path: CPU Read cmd data CPU read Cache Ctrl Miss in LLC & Hit in DMA Cache Hit in DMA cache? Cache Ctrl Last Level Cache DMA Cache Snoop Ctrl Snoop Ctrl System Bus DMA Ctrl Mem Ctrl I/O Device Memory INSTITUTE OF COMPUTING TECHNOLOGY Data Path: DMA Read Cache Ctrl Cache Ctrl Last Level Cache DMA Cache Snoop Ctrl Hit in LLC? Snoop Ctrl System Bus DMA read DMA Ctrl Mem Ctrl I/O Device Memory cmd data INSTITUTE OF COMPUTING TECHNOLOGY Miss in DMA Cache & Hit in LLC DMA Cache Design Issues Cache Coherence Data Path Replacement Policy Write Policy Prefetching INSTITUTE OF COMPUTING TECHNOLOGY An LRU-like Replace Policy 1. Invalid Block 2. Clean Block 3. Dirty Block DMA Cache Design Issue INSTITUTE OF COMPUTING TECHNOLOGY Cache Coherence Data Path Replacement Policy Write Policy Adopt Write-Allocate Policy Prefetching Both Write-Back or Write Through policies are available DMA Cache Design Issue Cache Coherence Data Path Replacement Policy Write Policy Prefetching INSTITUTE OF COMPUTING TECHNOLOGY Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work INSTITUTE OF COMPUTING TECHNOLOGY Memory Trace Collection INSTITUTE OF COMPUTING TECHNOLOGY Hyper Memory Trace Tool (HMTT) Capable of Collecting all memory requests Provide APIs for injecting tags into memory trace to identify high-level system operations FPGA Emulation L2 Cache from Godson-2F DDR2 Memory Controller from Godson-2F DDR2 DIM model from Micron Technology Xtreme system from Cadence Memory trace DMA Cache L2 Cache MemCtrl DDR2 Dram INSTITUTE OF COMPUTING TECHNOLOGY Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work INSTITUTE OF COMPUTING TECHNOLOGY Experimental Setup Machine AMD Opteron 2GB Memory 1 GE NIC IDE disk Configurations Snoop Cache (2MB) Shared Cache (2MB) DMA Cache Benchmark File Copy TPC-H SPECWeb2005 INSTITUTE OF COMPUTING TECHNOLOGY 256KB + prefetch 256KB w/o prefetch 128KB + prefetch 128KB w/o prefetch 64KB + prefetch 64KB w/o prefetch 32KB + prefetch 32KB w/o prefetch Characterization of DMA INSTITUTE OF COMPUTING TECHNOLOGY The portions of DMA memory reference varies depending on applications The sizes of DMA requests varies depending on application Normalized Speedup INSTITUTE OF COMPUTING TECHNOLOGY Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others DMA Write & CPU Read Hit Rate Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme? INSTITUTE OF COMPUTING TECHNOLOGY Breakdown of Normalized Total Cycles INSTITUTE OF COMPUTING TECHNOLOGY % of DMA Writes causing Dirty Block Replacement INSTITUTE OF COMPUTING TECHNOLOGY Those DMA writes cause cache pollution and thrashing problem The 256KB DMA cache is able to significantly eliminate these phenomena % of Valid Prefetched Blocks INSTITUTE OF COMPUTING TECHNOLOGY DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern. Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work INSTITUTE OF COMPUTING TECHNOLOGY Conclusions and Ongoing Work INSTITUTE OF COMPUTING TECHNOLOGY The Nature of DMA There is a producer-consumer relationship between CPU and DMA engine Memory plays a role of transient place for I/O data transferred between processor and I/O device We propose a DMA cache scheme and its design issues. Experimental results show that DMA cache can significantly improve I/O performance. Ongoing Work The impact of multiprocessor, multiple DMA channels for DMA cache In theory, a shared cache with an intelligent replacement policy can achieve the effect of DMA cache scheme. Godson-3 has integrated an dedicate cache management policy for I/O data. INSTITUTE OF COMPUTING TECHNOLOGY THANKS! Q&A?