IOM_DMA_Cache

advertisement
Exploiting the Produce-Consume
Relationship in DMA to Improve I/O
Performance
Institute of Computing Technology,
Chinese Academy of Sciences
2009.2.15
Workshop on The Influence of I/O on
Microprocessor Architecture (IOM-2009)
INSTITUTE OF COMPUTING TECHNOLOGY
Dan Tang, Yungang Bao,
Yunji Chen, Weiwu Hu, Mingyu Chen
An Brief Intro Of ICT, CAS
INSTITUTE OF COMPUTING
TECHNOLOGY
ICT has developed
the Loongson CPU
ICT has built the Fastest
HPC in China –
Dawning 5000, which is
233.5TFlops and rank
10th in Top500.
Overview






Background
Nature of DMA Mechanism
DMA Cache Scheme
Research Methodology
Evaluations
Conclusions and Ongoing Work
INSTITUTE OF COMPUTING
TECHNOLOGY
Importance of I/O operations

I/O are ubiquitous



INSTITUTE OF COMPUTING
TECHNOLOGY
Load binary files:Disk Memory
Brower web, media stream:NetworkMemory…
I/O are important

Many commercial applications are I/O intensive:

Database, Internet applications etc.
State-of-the-Art I/O Technologies

I/O Bues: 20GB/s




PCI-Express 2.0
HyperTransport 3.0
QuickPath Interconnect
I/O Devices


RAID: 400MB/s
10GE: 1.25GB/s
INSTITUTE OF COMPUTING
TECHNOLOGY
A Typical Computer Architecture
INSTITUTE OF COMPUTING
TECHNOLOGY
NIC
Direct Memory Access (DMA)
INSTITUTE OF COMPUTING
TECHNOLOGY

DMA is an essential feature of I/O operation
in all modern computers

DMA allows I/O subsystems to access
system memory for reading and/or writing
independently of CPU.

Many I/O devices use DMA

Including disk drive controllers, graphics
cards, network cards, sound cards and GPUs
Overview






Background
Nature of DMA Mechanism
DMA Cache Scheme
Research Methodology
Evaluations
Conclusions and Ongoing Work
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA in Computer Architecture
INSTITUTE OF COMPUTING
TECHNOLOGY
NIC
An Example of Disk Read:
DMA Receiving Operation
INSTITUTE OF COMPUTING
TECHNOLOGY
CPU
①
Descriptor
Driver Buffer
④
Kernel Buffer
②
③
Memory
⑤
DMA
Engine
User Buffer
• Cache Access Latency: ~20 Cycles
• Memory Access Latency:~200 Cycles
Potential Improvement of DMA
INSTITUTE OF COMPUTING
TECHNOLOGY
CPU
①
Descriptor
Driver Buffer
④
Kernel Buffer
②
③
DMA
Engine
Memory
⑤
User Buffer
• This is a typical Shared-Cache Scheme
Problems of Shared-Cache Scheme



Cache Pollution
Cache Thrashing
Degrade performance
when DMA requests are
large (>100KB) for “Oracle
+ TPC-H” application
INSTITUTE OF COMPUTING
TECHNOLOGY
Rethink DMA Mechanism

The Nature of DMA



INSTITUTE OF COMPUTING
TECHNOLOGY
There is a producer-consumer relationship between CPU and
DMA engine
Memory plays a role of transient place for I/O data transferred
between processor and I/O device
Corollaries




Once I/O data is produced, it will be consumed
I/O data within DMA buffer will be used only once in most cases
(i.e. almost no reuse).
 Characterizations of I/O data are different from CPU data
 It may not be appropriate to store I/O data and CPU data
together
Overview






Background
Nature of DMA Mechanism
DMA Cache Scheme
Research Methodology
Evaluations
Conclusions and Ongoing Work
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Proposal

A Dedicated Cache

Storing I/O data

Capable of exchanging
data with processor’s
last level cache (LLC)

 Reduce overhead of
I/O data movement
DMA
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues





DMA Cache State Diagram is
similar to CPU Cache in
Uniprocessor system
We are researching
multiprocessor platform…
DMA Cache
State Diagram

Cache Coherence
Data Path
Replacement Policy
Write Policy
Prefetching
CPU Cache
State Diagram

INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues






Cache Coherence
Data Path
Replacement Policy
Write Policy
Prefetching
DMA
Additional data paths and data access ports
for LLC are not required because data
migration operations between DMA cache
and LLC can share existing data paths and
ports of snooping mechanism
INSTITUTE OF COMPUTING
TECHNOLOGY
Data Path: CPU Read
cmd
data
CPU read
Cache Ctrl
Miss in LLC
&
Hit in DMA
Cache
Hit in DMA cache?
Cache Ctrl
Last Level
Cache
DMA
Cache
Snoop Ctrl
Snoop Ctrl
System Bus
DMA Ctrl
Mem Ctrl
I/O Device
Memory
INSTITUTE OF COMPUTING
TECHNOLOGY
Data Path: DMA Read
Cache Ctrl
Cache Ctrl
Last Level
Cache
DMA
Cache
Snoop Ctrl
Hit in LLC?
Snoop Ctrl
System Bus
DMA read
DMA Ctrl
Mem Ctrl
I/O Device
Memory
cmd
data
INSTITUTE OF COMPUTING
TECHNOLOGY
Miss in DMA
Cache
&
Hit in LLC
DMA Cache Design Issues





Cache Coherence
Data Path
Replacement Policy
Write Policy
Prefetching

INSTITUTE OF COMPUTING
TECHNOLOGY
An LRU-like Replace Policy
1.
Invalid Block
2.
Clean Block
3.
Dirty Block
DMA Cache Design Issue





INSTITUTE OF COMPUTING
TECHNOLOGY
Cache Coherence
Data Path
Replacement Policy
Write Policy  Adopt Write-Allocate Policy
Prefetching  Both Write-Back or Write Through
policies are available
DMA Cache Design Issue





Cache Coherence
Data Path
Replacement Policy
Write Policy
Prefetching

INSTITUTE OF COMPUTING
TECHNOLOGY
Adopt straightforward sequential
prefetching
 Prefetching trigged by cache miss
 Fetch 4 blocks one time
Overview






Background
Nature of DMA Mechanism
DMA Cache Scheme
Research Methodology
Evaluations
Conclusions and Ongoing Work
INSTITUTE OF COMPUTING
TECHNOLOGY
Memory Trace Collection

INSTITUTE OF COMPUTING
TECHNOLOGY
Hyper Memory Trace Tool (HMTT)


Capable of Collecting all memory requests
Provide APIs for injecting tags into memory trace
to identify high-level system operations
FPGA Emulation




L2 Cache from Godson-2F
DDR2 Memory Controller from Godson-2F
DDR2 DIM model from Micron Technology
Xtreme system from Cadence
Memory trace
DMA
Cache
L2 Cache
MemCtrl
DDR2 Dram
INSTITUTE OF COMPUTING
TECHNOLOGY
Overview






Background
Nature of DMA Mechanism
DMA Cache Scheme
Research Methodology
Evaluations
Conclusions and Ongoing Work
INSTITUTE OF COMPUTING
TECHNOLOGY
Experimental Setup

Machine




AMD Opteron
2GB Memory
1 GE NIC
IDE disk

Configurations
 Snoop Cache (2MB)
 Shared Cache (2MB)
 DMA Cache




Benchmark



File Copy
TPC-H
SPECWeb2005
INSTITUTE OF COMPUTING
TECHNOLOGY





256KB + prefetch
256KB w/o prefetch
128KB + prefetch
128KB w/o prefetch
64KB + prefetch
64KB w/o prefetch
32KB + prefetch
32KB w/o prefetch
Characterization of DMA
INSTITUTE OF COMPUTING
TECHNOLOGY


The portions of DMA
memory reference
varies depending on
applications
The sizes of DMA
requests varies
depending on
application
Normalized Speedup
INSTITUTE OF COMPUTING
TECHNOLOGY


Baseline is snoop cache scheme
DMA cache schemes exhibits better performance than others
DMA Write & CPU Read Hit Rate


Both shared cache and DMA cache exhibit high hit rates
Then, where do cycle go for shared cache scheme?
INSTITUTE OF COMPUTING
TECHNOLOGY
Breakdown of Normalized Total Cycles
INSTITUTE OF COMPUTING
TECHNOLOGY
% of DMA Writes causing Dirty Block
Replacement
INSTITUTE OF COMPUTING
TECHNOLOGY


Those DMA writes cause cache pollution and thrashing problem
The 256KB DMA cache is able to significantly eliminate these
phenomena
% of Valid Prefetched Blocks
INSTITUTE OF COMPUTING
TECHNOLOGY


DMA caches can exhibit an impressive high prefetching accuracy
This is because I/O data has very regular access pattern.
Overview






Background
Nature of DMA Mechanism
DMA Cache Scheme
Research Methodology
Evaluations
Conclusions and Ongoing Work
INSTITUTE OF COMPUTING
TECHNOLOGY
Conclusions and Ongoing Work
INSTITUTE OF COMPUTING
TECHNOLOGY

The Nature of DMA


There is a producer-consumer relationship between CPU and DMA engine
Memory plays a role of transient place for I/O data transferred between
processor and I/O device

We propose a DMA cache scheme and its design issues.

Experimental results show that DMA cache can significantly improve I/O
performance.

Ongoing Work



The impact of multiprocessor, multiple DMA channels for DMA cache
In theory, a shared cache with an intelligent replacement policy can achieve the
effect of DMA cache scheme.
Godson-3 has integrated an dedicate cache management policy for I/O data.
INSTITUTE OF COMPUTING
TECHNOLOGY
THANKS!
Q&A?
Download