Virtually Pipelined Network Memory Banit Agrawal Tim Sherwood

advertisement
Virtually Pipelined
Network Memory
Banit Agrawal
Tim Sherwood
UC SANTA BARBARA
Memory Design is Hard
• Increasing functionalities
• Increasing size of data structures
IPv4 routing
table size
Packet
classification rules
100k
200k
360k
2000
5000
10000
• Increasing line rate
10 Gbps
40 Gbps
• Throughput in the worst case
160 Gbps
 Need to service the traffic at the advertised rate
Banit Agrawal
4/11/2020
What programmers think ?
Network Programmers
Network
System
•
•
•
•
Memory
DRAM
Low cost
Low power
High capacity
High bandwidth
 incase of some access patterns
What is the problem?
Banit Agrawal
4/11/2020
DRAM Bank Conflicts
Variable latency
bank 2
row decoder
Busy1
bank
row decoder
row decoder
row decoder
bank 0
Variable throughput
Busy3
bank
sense amplifier
sense amplifier
sense amplifier
sense amplifier
column decoder
column decoder
column decoder
column decoder
address bus
data bus
Worst case : Every access conflicts
Banit Agrawal
4/11/2020
Prior Work
• Reducing bank-conflicts in common access patterns
 Prefetching and memory-aware layout
[Lin-HPCA’01, Mathew-
HPCA’00]
 Reordering of requests [Hong-HPCA’99, Rixner-ISCA’00]
 Vector processing domain [Espasa-Micro’97]
 Good for desktop computing
No guarantees for the worst case
• Reducing bank conflicts for special access patterns
 Packet buffering : written once write and read once
 Low bank conflicts - Optimizations including row-locality and
scheduling [Hasan-ISCA’03, Nikologiannis-ICC’01]
 No bank conflicts - Reordering and clever memory management
algorithms [Garcia-Micro’03, Iyer-StanTechReport’02]
Not applicable in any access patterns
Banit Agrawal
4/11/2020
Where network system stands ?
0% deadline failures
Full determinism required
No exploitable deadline failures
Common-case optimized parts
Best effort (co-operative)
Common-case optimized Parts
Banit Agrawal
4/11/2020
Virtually Pipelined Memory
• Normalize the overall latency
 Using randomization and buffering
 Deterministic latency for all accesses
t
t+D
Memory
Controller
DRAM
• Trillions of accesses without any bank conflicts
 Even in case of any access patterns
Banit Agrawal
4/11/2020
Outline
• Memory for networking systems
• Memory controller
• Design analysis
 Hardware design
 How we compare?
• Conclusion
Banit Agrawal
4/11/2020
Memory Controller
Bank 0 controller
Bank 0
key
R
t
HU
5 → 2,A
6 → 0,F
7 → 2,B
8 → 3,A
Bank 2 controller
address
t+D
data
Bus Scheduler
Bank 1 controller
Bank 1
Bank 2
Bank 3 controller
Bank 3
Non-conflicting Accesses
Bank latency (L) – 15 cycles
0
requests
10
A
data ready
20
30
B
Normalized delay (D) – 30 cycles
40
50
60
70
80
C
A
B
C
Repeated requests
Banit Agrawal
4/11/2020
Redundant Accesses
Bank latency (L) – 15 cycles
0
requests
A
10
B
20
Normalized delay (D) – 30 cycles
30
40
AA
50
60
70
80
B
data ready
A
B
AA
B
Conflicting requests
Banit Agrawal
4/11/2020
Conflicting Accesses
Bank latency (L) – 15 cycles
requests
0
10
A
B
20
Normalized delay (D) – 30 cycles
30
40
C
D
50
60
70
80
E
Stall
data ready
Banit Agrawal
A
4/11/2020
B
C
D
E
Implementing Virtual Pipelined Banks
Delay Storage Buffer
v
address
Bank Access Queue
incr/decr
r/w
data words
row id
++
Bank Access
Queue
first zero
++
Delay Storage Buffer
++
++
++
row
addr
data
Set 1
access[t-3]
access[t-2]
access[t]
access[t-d]
access[t-d+1]
access[t-d+2]
Control
Circular Delay Buffer
Logic
…
…
in ptr
Write Buffer
Set 0
Write Buffer (FIFO)
Control Logic
data words
out ptr
address
scheduled-access address
scheduled-access data
++
to memory
Circular Delay Buffer
Interface address Interface data
Implementing Virtual Pipelined Banks
Delay Storage Buffer
v
address
Bank Access Queue
incr/decr
r/w
data words
row id
++
Bank Access
Queue
first zero
++
Delay Storage Buffer
++
++
++
row
addr
data
Set 1
access[t-3]
access[t-2]
access[t]
access[t-d]
access[t-d+1]
access[t-d+2]
Control
Circular Delay Buffer
Logic
…
…
in ptr
Write Buffer
Set 0
Write Buffer (FIFO)
Control Logic
data words
out ptr
address
scheduled-access address
scheduled-access data
++
to memory
Circular Delay Buffer
Interface address Interface data
Delay Storage Buffer Stall
• Mean-time to stall (MTS)
 B – number of banks, 1/B is the probability of a request
• Stall happens when there are more than k accesses in interval D
• An Illustration
 Normalized latency (D) - 30 cycles
 Number of entries in the delay storage buffer (K) - 3
Banit Agrawal
4/11/2020
Delay Storage Buffer Stall
+1
requests
+1
A
0
+1 +1
B
10
C
20
data ready
MTS =
30
D
A
B
-1
-1
4/11/2020
F
E
40
50
60
70
C
D
-1 -1
log ( 21 )
log (1 – ( (
Banit Agrawal
+1 +1
D-1
K-1
1
B
) * ) K-1 ))
+D
80
E
F
-1 -1
Markovian Analysis
• Bank access queue stall
 State-based analysis
 Number of banks (B) - 1/B is the probability of an access to a
bank
• If more than D cycles of work to be done, a stall occurs.
• An example:
 Bank access latency (L) = 3
 Normalized delay (D) = 6
1
B
1
B
idle
1-
1
B
Banit Agrawal
1
2
3
1-
4/11/2020
1
B
1
1
B
4
5
stall
6
MTS = probability of stall
state becomes 0.5
Markovian Analysis
I  1
M
 1
1  B

 1
1 
 B

 0


 0


 0


 0


 0


 0

Banit Agrawal
0
0
0
0
0
0
0
0
1
B
0
0
0
0
0
0
1
B
0
0
0
0
0
1
B
0
0
0
0
1
B
1
0
1
B
1
1
B
0
0
1
1
B
0
0
0
1
0
0
0
0
1
0
0
0
0
0
4/11/2020
0
1
B
0
0
0
0
1
B
0
0
0

0 


0 


0 


0 

1 

B 

1 
B 

1 
B 

1 

P=IM
n
Find n s.t. P=50%
Hardware Design and Overhead
• Hardware design
 Verilog implementation
 Verification using ModelSim and C++ simulation model
 Synthesizing using Synopsys Design Compiler
• Hardware overhead tool
 Using Cacti parameters
 Verify one with the synthesized design
• Optimal design parameters using this tool
 45.7 seconds MTS with area overhead of 34.1 mm2 at 77%
efficiency
 10 hours MTS with area overhead of 34 mm2 at 71.4%
efficiency
Banit Agrawal
4/11/2020
How VPNM performs ?
• Packet buffering
 Only need to store the head and tail pointers
 Can support arbitrarily large number of logical queues
Scheme
Line rate
(Gbps)
Area
(mm2)
Total delay
(ns)
Supported
interfaces
RADS [17]
40
10
53
130
CFDS [12]
160
60
10000
850
Our
approach
160
41.9
960
4096
35% less
area
10x less
latency
5x more
queues
• Packet reassembly
Banit Agrawal
4/11/2020
Conclusion
• VPNM provides
 Deterministic latency

Randomization and normalization
t
t+D
Memory
Controller
 Higher throughput


worst case that is impossible to exploit

Handles any access patterns
Ease of programmability/mapping


Banit Agrawal
Packet buffering
Packet reassembly
4/11/2020
DRAM
Thanks for your attention.
Questions??
http://www.cs.ucsb.edu/~arch/
Banit Agrawal
4/11/2020
Download