Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn
Tsinghua University
2
Outline
Motivation and background
Morphing GPU into a network processor
High performance radar
DSP processor
Conclusion
2
3
High Performance Embedded Computing
Future IT infrastructure demands even higher computing power
Core Internet router throughput: up to 90Tbps
4G wireless base station: 1Gbit/s data rate per customer and up to 200 subscribers in service area
CMU driverless car: 270GFLOPs (Giga FLoating point Operations Per second)
…
3
Fast Increasing IC Costs
Fabrication Cost
Moore’s Second Law: The cost of doubling circuit density increases in line with Moore's First Law.
Design Cost
Now $20-50M per product
Will reach $75-120M at
32nm node
4
The 4-year development of
Cell processor by Sony, IBM, and Toshiba costs over
$400M .
~$1M
4
5
Implications of the Prohibitive Cost
ASICs would be unaffordable for many applications!
Scott MacGregor, CEO of Broadcom:
• “Broadcom is not intending a move to 45nm in the next year or so as it will be too expensive.”
David Turek, VP of IBM:
• “IBM will be pulling out of Cell development, with PowerXCell
8i to be the company’s last entrance in the technology.”
5
Multicore Machines Are Really Powerful!
Manufacturer Processor Type
AMD
AMD
AMD
AMD
Fujitsu
Intel nVidia nVidia nVidia
Tilera
GPGPU
GPU
GPU
CPU
CPU
CPU
GPU
GPGPU
GPGPU
CPU
Model Model Number # Cores
FireStream
Radeon HD
Radeon HD
9270
5870
5970
Magny-Cours
SPARC64 VII
Core 2 Extreme QX9775
Fermi
Tesla
480
C1060
Tesla
TilePro
C2050
GPU: Graphics Processing Unit
GFLOPs FP64 GFLOPs FP32
160/800
320/1600
640/3200
12
4
4
512
240
448
64
GPGPU: General Purpose GPU
240
544
928
362.11
128
51.2
780
77.76
515.2
166
1200
2720
4640
362.11
128
51.2
1560
933.12
1288
166
6
AMD 12-Core CPU Tilera Tile Gx100 CPU NVidia Fermi GPU
6
7
Implications
An increasing number of applications would be implemented with multi-core devices
Huawei: multi-core base stations
Intel: cluster based Internet routers
IBM: signal processing and radar applications on Cell processor
…
Also meets the strong demands for customizability and extendibility
7
8
Outline
Motivation and background
Morphing GPU into a network processor
High performance radar
DSP processor
Conclusion
8
9
Software Routing with GPU
Background and motivation
GPU based routing processing
Routing table lookup
Packet classification
Deep packet inspection
GPU microarchitecture enhancement
CPU and GPU integration
QoS-aware scheduling
9
10
Ever-Increasing Internet Traffic
10
11
Fast Changing Network Protocols/Services
New services are rapidly appearing
Data-center, Ethernet forwarding, virtual LAN, …
Personal customization is often essential for QoS
However, today’s Internet heavily depend on 2 protocols
Ethernet and IPv4, with both developed in 1970 s!
11
12
…
Internet Router
12
19”
Internet Router
Backbone network device
Packet forwarding and path finding
Connect multiple subnets
Key requirements
• High throughput: 40G-90Tbps
• High flexibility
6ft
Capacity: 160Gb/s
Power: 4.2kW
2ft
Cisco GSR 12416
13 Packet s
Router Packet s
13
14
Current Router Solutions
Hardware routers
Fast
Long design time
Expensive
And hard to maintain
Network processor based router
Network processor: data parallel packet processor
No good programming models
Software routers
Extremely flexible
Low cost
But slow
14
15
Outline
Background and motivation
GPU based routing processing
Routing table lookup
Packet classification
Deep packet inspection
GPU microarchitecture enhancement
CPU and GPU integration
QoS-aware scheduling
15
16
Critical Path of Routing Processing
Header Processing
Data Hdr
Packet
Classification
IP Address
Lookup
Update
Header
Queue
Packet
Data Hdr
Hdr Fields Flow IP Addr Next Hop
Switch Fabric
Rule
Set
Routing
Table
Buffer
Memory
Deep Packet Inspection
16
17
GPU Based Software Router
Data level parallelism = packet level parallelism
Internet
CPU0 CPU1
Graphics Card
GPU
GPU
Memory
CPU2 CPU3
PCIe 16-lane
PCIe 4-lane
NIC
PCIe 4-lane
NIC
Front Side Bus
(FSB)
North Bridge
(Memory controller)
Memory
Bus
Main
Memory
17
18
Routing Table Lookup
Routing table contains network topology information
Find the output port according to destination IP address
Potentially large routing table (~1M entries)
• Can be updated dynamically
An exemplar routing table
Destination Address Prefix
24.30.32/20
24.30.32.160/28
208.12.32/20
208.12.32.111/32
Next-Hop
192.41.177.148
192.41.177.3
192.41.177.196
192.41.177.195
Output Port
2
6
1
5
18
Routing Table Lookup
19
Longest prefix match
Memory bound
Usually based on a trie data structure
• Trie: a prefix tree with strings as keys
• A node’s position directly reflects its key
• Pointer operations
• Widely divergent branches!
Destination
Address Prefix
24.30.32/20
208.12.32/20
24.30.32/2
0
1
0
Next-Hop
192.41.177.148
24.30.32.160/28 192.41.177.3
192.41.177.196
208.12.32.111/32 192.41.177.195
2
Output
Port
208.12.32/20
2
6
1
5
24.30.32.160/2
8 3
208.12.32
.111/32
Search Trie
4
19
20
GPU Based Routing Table Lookup
Organize the search trie into an array
Pointer converted to offset with regard to array head
6X speedup even with frequent routing table updates
20
Packet Classification
Match header fields with predefined rules
Size of rule-sets can be huge (i.e., over 5000 rules)
21
Rule
Priority
Packet filtering
Traffic rate limit
Example
Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority
Deny all traffic from ISP3 destined to 166.111.66.77
Ensure ISP2 does not inject more than 10Mbps email traffic on interface 2
Accounting & billing Treat video traffic to 166.111.X.X as highest priority and perform accounting
21
22
Packet Classification
Hardware solution
Usually with Ternary CAM
(TCAM)
• Expensive and power hungry
Software solutions
Linear search
Hash based
Tuple space search
• Convert the rules into a set of exact match
22
23
GPU Based Packet Classification
A linear search approach
Scale to rule sets with 20,000 rules
Meta-programming
Compile rules into CUDA code with PyCUDA
Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority if (DA >= 166.111.66.70) && (DA <= 166.111.66.77) priority = 0;
23
24
GPU Based Packet Classification
~60X speedup
100
CPU_DBS GPU_DBS
10
1
0,1
Rule Set
24
25
Deep Packet Inspection (DPI)
Core component for network intrusion detection
Against viruses, spam, software vulnerabilities, …
Sniffing
Snort
Packet Decoder
Example rule: alert tcp $EXTERNAL_NET
27374 -> $HOME_NET any
(msg:"BACKDOOR subseven 22"; flags: A+; content:
"|0d0a5b52504c5d3030320d
Preprocessor
(Plug-ins)
0a|";
Packet stream
Detection Engine
(Plug-ins)
Fixed String
Matching
Output Stage
(Plug-ins) Alerts/Logs
25
GPU Based Deep Packet Inspection (DPI)
26
Fixed string match
Each rule is just a string that is disallowed
Bloom-filter based search
One warp for a packet and one thread for a string
Throughput: 19.2Gbps (30X speed-up over SNORT)
Initial Bloom
Filter
0 0 0 0 0 0 0 0 0 0 0 0 r1 r2 …
After pre-processing rules
0 1 0 0 1 0 1 0 1 0 0 1 s1 s2 …
Checking packet content 0 1 0 0 1 0 1 0 1 0 0 1
Bloom Vector
Hash 1
Hash 2
Hash 3
26
27
GPU Based Deep Packet Inspection (DPI)
Regular expression matching
Each rule is a regular expression
• e.g., a|b* = { ε , a, b, bb, bbb, ...}
Aho-Corasick Algorithm
• Converts patterns into a finite state machine
• Matching is done by state traversal
Memory bound
• Virtually no computation
Compress the state table
• Merging don’t-cared entries
Throughput: 9.3Gbps
15X speed-up over SNORT
Example: P={he, she, his, hers}
27
28
Outline
Background and motivation
GPU based routing processing
Routing table lookup
Packet classification
Deep packet inspection
GPU microarchitecture enhancement
CPU and GPU integration
QoS-aware scheduling
28
Limitation of GPU-Based Packet Processing
29
CPU-GPU communication overhead
Internet
No QoS guarantee
CPU0 CPU1
Graphics Card
GPU
CPU3
Packet queue
CPU2
GPU
Memory
Front Side Bus
(FSB)
PCIe 16 -lane
PCIe 4 -lane
NIC
PCIe 4 -lane
NIC
North Bridge
(Memory controller)
Memory
Bus
Main
Memory
29
Microarchitectural Enhancements
CPU-GPU integration with a shared memory
Maintain current CUDA interface
Implemented on GPGPU-Sim *
Task FIFO
Internet
CPU
NPGPU
GPU
Delayed
Commit
Queue
CPU/GPU Shared Memory
NIC
30
*A. Bakhoda, et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator,
ISPASS, 2009.
30
31
Microarchitectural Enhancements
Uniformly one thread for one packet
No thread block necessary
Directly schedule and issue warps
GPU fetches packet IDs from task queue when
Either a sufficient number of packets are already collected
Or a given interval passes after last fetch
GPU
Core
GPU
Core
CPU-maintained task queue
GPU
Core
GPU
Core
Delayed Commit Queue
GPU
Core
GPU
Core
31
32
Results: Throughput
350
300
250
200
150
100
50
0
Line-card Rate
CPU/GPU
New Architecture
Deep Packet
Inspection
Packet
Classification
Routing Table
Lookup
Decrease TTL
32
33
Results: Packet Latency
250
200
CPU/GPU
New Architecture
150
100
50
0
Deep Packet
Inspection
Packet Classification Routing Table Lookup Decrease TTL
33
34
Outline
Motivation and background
Morphing GPU into a network processor
High performance radar
DSP processor
Conclusion
34
35
High Performance Radar DSP Processor
Motivation
Feasibility of GPU for DSP processing
Designing a massively parallel DSP processor
35
36
Research Objectives
High performance DSP processor
For high-performance applications
• Radar, sonar, cellular baseband, …
Performance requirements
Throughput ≥ 800GFLOPs
Power Efficiency ≥ 100GFLOPS/W
Memory bandwidth ≥ 400Gbit/s
Scale to multi-chip solutions
36
Current DSP Platforms
37
Processor
Frequenc y
# cores Throughput
Memory
Bandwidt h
Powe r
TI
TMS320C6472
-700
FreeScale
MSC8156
ADI
TigerSHARC
ADSP-TS201S
PicoChip
PC205
Intel Core i7
980XE
Tilera Tile64
NVidia Fermi
GPU
500MHz
1GHz
600MHz
260MHz
6
6
1
1GPP+
248DSPs
33.6GMac/s
48GMac/s
4.8GMac/s
31GMac/s
NA
1GB/s
38.4GB/s
(on-chip)
NA
3.8W
10W
2.18
W
<5W
3.3GHz
6
866MHz 64 CPUs
107.
5GFLOPS
221GFLOP
S
31.8GB/s
1GHz
512
*GDDR5: Peak Bandwidth 28.2GB/s
S
230GB/s
* cores
130
W
6.25GB/s 22W
200
W
Power
Efficiency
(GFLOPS/W)
17.7
9.6
4.4
12.4
0.8
10.0
7.7
37
38
High Performance Radar DSP Processor
Motivation
Feasibility of GPU for DSP processing
Designing a massively parallel DSP processor
38
39
HPEC Challenge - Radar Benchmarks
Benchmark
TDFIR
FDFIR
CT
QR
SVD
CFAR
GA
PM
DB
Description
Time-domain finite impulse response filtering
Frequency-domain finite impulse response filtering
Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT
QR factorization: prevalent in target recognition algorithms
Singular value decomposition: produces a basis for the matrix as well as the rank for reducing interference
Constant false-alarm rate detection: find target in an environment with varying background noise
Graph optimization via genetic algorithm: removing uncorrelated data relations
Pattern Matching: identify stored tracks that match a target
Database operations to store and query target tracks
39
40
GPU Implementation
Benchmark
TDFIR
FDFIR
CT
QR
SVD
CFAR
GA
PM
DB
Description
Loops of multiplication and accumulation (MAC)
FFT followed by MAC loops
GPU based matrix transpose, extremely efficient
Pipeline of CPU + GPU, Fast Givens algorithm
Based on QR factorization and fast matrix multiplication
Accumulation of neighboring vector elements
Parallel random number generator and interthread communication
Vector level parallelism
Binary tree operation, hard for GPU implementation
40
41
Performance Results
Kernels
TDFIR
FDFIR
CT
PM
CFAR
GA
QR
SVD
DB
Data Set
Set 2
Set 3
Set 4
Set 1
Set 2
Set 3
Set 4
Set 1
Set 2
Set 3
Set 1
Set 2
Set 1
Set 2
Set 1
Set 2
Set 1
Set 2
Set 1
Set 2
Set 1
Set 2
Set 1
CPU Throughput (GFLOPS) *
1.314
1.313
1.261
0.562
0.683
0.441
0.373
1.704
0.901
0.904
0.747
0.791
112.3
5.794
3.382
3.326
0.541
0.542
1.194
0.501
0.871
0.281
1.154
GPU Throughput (GFLOPS) *
17.319
13.962
8.301
1.177
8.571
0.589
2.249
54.309
5.679
6.686
4.175
2.684
126.8
8.459
97.506
23.130
61.681
11.955
17.177
35.545
7.761
21.241
2.234
* The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.
Speedup
13.1
10.6
6.6
2.1
12.5
1.4
6.0
31.8
6.3
7.4
5.6
3.4
1.13
1.46
28.8
6.9
114.1
22.1
14.3
70.9
8.9
75.6
1.9
41
42
Performance Comparison
GPU: NVIDIA Fermi, CPU: Intel Core 2 Duo (3.33GHz), DSP AD TigherSharc 101
100000
Set 1
Set 3
Set 2
Set 4
10000
1000
100
10
1
CPU GPU DSP CPU GPU DSP CPU GPU DSP CPU GPU DSP CPU GPU DSP
TDFIR FDFIR CFAR QR SVD
42
43
Instruction Profiling
43
44
Thread Profiling
Warp occupancy: number of active threads in an issued warp
32 threads per warp
44
45
Off-Chip Memory Profiling
DRAM efficiency: the percentage of time spent on sending data across the pins of DRAM over the whole time of memory service.
45
46
Limitation
GPU suffers from a low power-efficiency (MFLOPS/W)
46
47
High Performance Radar DSP Processor
Motivation
Feasibility of GPU for DSP processing
Designing a massively parallel DSP processor
47
48
Key Idea - Hardware Architecture
Borrow the GPU microarchitecture
Using a DSP core as the basic execution unit
Multiprocessors organized in programmable pipelines
Neighboring multiprocessors can be merged as wider datapaths
48
49
Key Idea – Parallel Code Generation
Meta-programming based parallel code generation
Foundation technologies
GPU meta-programming frameworks
• Copperhead (UC Berkeley) and PyCUDA (NY University)
DSP code generation framework
• Spiral (Carnegie Mellon University)
DSP scripting
Run
DSP code generation
Source optimization
Compile runtime
Run on DSP
49
50
Key Idea – Internal Representation as KPN
Kahn Process Network (KPN)
A generic model for concurrent computation
Solid theoretic foundation
• Process algebra
50
51
Scheduling and Optimization on KPN
Automatic task and thread scheduling and mapping
Extract data parallelism through process splitting
Latency and throughput aware scheduling
Performance estimation based on analytical models
T
1
T
2
T total
T i
51
52
Key Idea - Low Power Techniques
GPU-like processors are power hungry!
Potential low power techniques
Aggressive memory coalescing
Enable task-pipeline to avoid synchronization via global memory
Operation chaining to avoid extra memory accesses
???
DRAM line
…
Used
Unused
DRAM chip
52
53
Outline
Motivation and background
Morphing GPU into a network processor
High performance radar
DSP processor
Conclusion
53
54
Conclusion
A new market of high performance embedded computing is emerging
Multi-core engines would be the work-horses
Need both HW and SW research
Case study 1: GPU based Internet routing
Case study 2: Massively parallel DSP processor
Significant performance improvements
More works ahead
• Low power, scheduling, parallel programming model, legacy code, …
54