GPU Based Packet Classification

High Performance Embedded Computing with Massively Parallel Processors

Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn

Tsinghua University

2

Outline

 Motivation and background

 Morphing GPU into a network processor

 High performance radar

DSP processor

 Conclusion

2

3

High Performance Embedded Computing

 Future IT infrastructure demands even higher computing power

 Core Internet router throughput: up to 90Tbps

 4G wireless base station: 1Gbit/s data rate per customer and up to 200 subscribers in service area

 CMU driverless car: 270GFLOPs (Giga FLoating point Operations Per second)

 …

3

Fast Increasing IC Costs

 Fabrication Cost

 Moore’s Second Law: The cost of doubling circuit density increases in line with Moore's First Law.

 Design Cost

 Now $20-50M per product

 Will reach $75-120M at

32nm node

4

The 4-year development of

Cell processor by Sony, IBM, and Toshiba costs over

$400M .

~$1M

4

5

Implications of the Prohibitive Cost

 ASICs would be unaffordable for many applications!

 Scott MacGregor, CEO of Broadcom:

• “Broadcom is not intending a move to 45nm in the next year or so as it will be too expensive.”

 David Turek, VP of IBM:

• “IBM will be pulling out of Cell development, with PowerXCell

8i to be the company’s last entrance in the technology.”

5

Multicore Machines Are Really Powerful!

Manufacturer Processor Type

AMD

AMD

AMD

AMD

Fujitsu

Intel nVidia nVidia nVidia

Tilera

GPGPU

GPU

GPU

CPU

CPU

CPU

GPU

GPGPU

GPGPU

CPU

Model Model Number # Cores

FireStream

Radeon HD

Radeon HD

9270

5870

5970

Magny-Cours

SPARC64 VII

Core 2 Extreme QX9775

Fermi

Tesla

480

C1060

Tesla

TilePro

C2050

GPU: Graphics Processing Unit

GFLOPs FP64 GFLOPs FP32

160/800

320/1600

640/3200

12

4

4

512

240

448

64

GPGPU: General Purpose GPU

240

544

928

362.11

128

51.2

780

77.76

515.2

166

1200

2720

4640

362.11

128

51.2

1560

933.12

1288

166

6

AMD 12-Core CPU Tilera Tile Gx100 CPU NVidia Fermi GPU

6

7

Implications

 An increasing number of applications would be implemented with multi-core devices

 Huawei: multi-core base stations

 Intel: cluster based Internet routers

 IBM: signal processing and radar applications on Cell processor

 …

 Also meets the strong demands for customizability and extendibility

7

8

Outline




DSP processor

 Conclusion

8

9

Software Routing with GPU

 Background and motivation

 GPU based routing processing

 Routing table lookup

 Packet classification

 Deep packet inspection

 GPU microarchitecture enhancement

 CPU and GPU integration

 QoS-aware scheduling

9

10

Ever-Increasing Internet Traffic

10

11

Fast Changing Network Protocols/Services

 New services are rapidly appearing

 Data-center, Ethernet forwarding, virtual LAN, …

 Personal customization is often essential for QoS

 However, today’s Internet heavily depend on 2 protocols

 Ethernet and IPv4, with both developed in 1970 s!

11

12

…

Internet Router

12

19”

Internet Router

 Backbone network device

 Packet forwarding and path finding

 Connect multiple subnets

 Key requirements

• High throughput: 40G-90Tbps

• High flexibility

6ft

Capacity: 160Gb/s

Power: 4.2kW

2ft

Cisco GSR 12416

13 Packet s

Router Packet s

13

14

Current Router Solutions

 Hardware routers

 Fast

 Long design time

 Expensive

 And hard to maintain

 Network processor based router

 Network processor: data parallel packet processor

 No good programming models

 Software routers

 Extremely flexible

 Low cost

 But slow

14

15

Outline









15

16

Critical Path of Routing Processing

Header Processing

Data Hdr

Packet

Classification

IP Address

Lookup

Update

Header

Queue

Packet

Data Hdr

Hdr Fields Flow IP Addr Next Hop

Switch Fabric

Rule

Set

Routing

Table

Buffer

Memory

Deep Packet Inspection

16

17

GPU Based Software Router

 Data level parallelism = packet level parallelism

Internet

CPU0 CPU1

Graphics Card

GPU

GPU

Memory

CPU2 CPU3

PCIe 16-lane

PCIe 4-lane

NIC

PCIe 4-lane

NIC

Front Side Bus

(FSB)

North Bridge

(Memory controller)

Memory

Bus

Main

Memory

17

18

Routing Table Lookup

 Routing table contains network topology information

 Find the output port according to destination IP address

 Potentially large routing table (~1M entries)

• Can be updated dynamically

An exemplar routing table

Destination Address Prefix

24.30.32/20

24.30.32.160/28

208.12.32/20

208.12.32.111/32

Next-Hop

192.41.177.148

192.41.177.3

192.41.177.196

192.41.177.195

Output Port

2

6

1

5

18

Routing Table Lookup

19

 Longest prefix match

 Memory bound

 Usually based on a trie data structure

• Trie: a prefix tree with strings as keys

• A node’s position directly reflects its key

• Pointer operations

• Widely divergent branches!

Destination

Address Prefix

24.30.32/20

208.12.32/20

24.30.32/2

0

1

0

Next-Hop

192.41.177.148

24.30.32.160/28 192.41.177.3

192.41.177.196

208.12.32.111/32 192.41.177.195

2

Output

Port

208.12.32/20

2

6

1

5

24.30.32.160/2

8 3

208.12.32

.111/32

Search Trie

4

19

20

GPU Based Routing Table Lookup

 Organize the search trie into an array

 Pointer converted to offset with regard to array head

 6X speedup even with frequent routing table updates

20

Packet Classification

 Match header fields with predefined rules

 Size of rule-sets can be huge (i.e., over 5000 rules)

21

Rule

Priority

Packet filtering

Traffic rate limit

Example

Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority

Deny all traffic from ISP3 destined to 166.111.66.77

Ensure ISP2 does not inject more than 10Mbps email traffic on interface 2

Accounting & billing Treat video traffic to 166.111.X.X as highest priority and perform accounting

21

22

Packet Classification

 Hardware solution

 Usually with Ternary CAM

(TCAM)

• Expensive and power hungry

 Software solutions

 Linear search

 Hash based

 Tuple space search

• Convert the rules into a set of exact match

22

23

GPU Based Packet Classification

 A linear search approach

 Scale to rule sets with 20,000 rules

 Meta-programming

 Compile rules into CUDA code with PyCUDA

Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority if (DA >= 166.111.66.70) && (DA <= 166.111.66.77) priority = 0;

23

24

GPU Based Packet Classification

 ~60X speedup

100

CPU_DBS GPU_DBS

10

1

0,1

Rule Set

24

25

Deep Packet Inspection (DPI)

 Core component for network intrusion detection

 Against viruses, spam, software vulnerabilities, …

Sniffing

Snort

Packet Decoder

Example rule: alert tcp $EXTERNAL_NET

27374 -> $HOME_NET any

(msg:"BACKDOOR subseven 22"; flags: A+; content:

"|0d0a5b52504c5d3030320d

Preprocessor

(Plug-ins)

0a|";

Packet stream

Detection Engine

(Plug-ins)

Fixed String

Matching

Output Stage

(Plug-ins) Alerts/Logs

25

GPU Based Deep Packet Inspection (DPI)

26

 Fixed string match

 Each rule is just a string that is disallowed

 Bloom-filter based search

 One warp for a packet and one thread for a string

 Throughput: 19.2Gbps (30X speed-up over SNORT)

Initial Bloom

Filter

0 0 0 0 0 0 0 0 0 0 0 0 r1 r2 …

After pre-processing rules

0 1 0 0 1 0 1 0 1 0 0 1 s1 s2 …

Checking packet content 0 1 0 0 1 0 1 0 1 0 0 1

Bloom Vector

Hash 1

Hash 2

Hash 3

26

27

GPU Based Deep Packet Inspection (DPI)

 Regular expression matching

 Each rule is a regular expression

• e.g., a|b* = { ε , a, b, bb, bbb, ...}

 Aho-Corasick Algorithm

• Converts patterns into a finite state machine

• Matching is done by state traversal

 Memory bound

• Virtually no computation

 Compress the state table

• Merging don’t-cared entries

 Throughput: 9.3Gbps

 15X speed-up over SNORT

Example: P={he, she, his, hers}

27

28

Outline









28

Limitation of GPU-Based Packet Processing

29

CPU-GPU communication overhead

Internet

No QoS guarantee

CPU0 CPU1

Graphics Card

GPU

CPU3

Packet queue

CPU2

GPU

Memory

Front Side Bus

(FSB)

PCIe 16 -lane

PCIe 4 -lane

NIC

PCIe 4 -lane

NIC

North Bridge

(Memory controller)

Memory

Bus

Main

Memory

29

Microarchitectural Enhancements

 CPU-GPU integration with a shared memory

 Maintain current CUDA interface

 Implemented on GPGPU-Sim *

Task FIFO

Internet

CPU

NPGPU

GPU

Delayed

Commit

Queue

CPU/GPU Shared Memory

NIC

30

*A. Bakhoda, et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator,

ISPASS, 2009.

30

31

Microarchitectural Enhancements

 Uniformly one thread for one packet

 No thread block necessary

 Directly schedule and issue warps

 GPU fetches packet IDs from task queue when

 Either a sufficient number of packets are already collected

 Or a given interval passes after last fetch

GPU

Core

GPU

Core

CPU-maintained task queue

GPU

Core

GPU

Core

Delayed Commit Queue

GPU

Core

GPU

Core

31

32

Results: Throughput

350

300

250

200

150

100

50

0

Line-card Rate

CPU/GPU

New Architecture

Deep Packet

Inspection

Packet

Classification

Routing Table

Lookup

Decrease TTL

32

33

Results: Packet Latency

250

200

CPU/GPU

New Architecture

150

100

50

0

Deep Packet

Inspection

Packet Classification Routing Table Lookup Decrease TTL

33

34

Outline




DSP processor

 Conclusion

34

35

High Performance Radar DSP Processor

 Motivation

 Feasibility of GPU for DSP processing

 Designing a massively parallel DSP processor

35

36

Research Objectives

 High performance DSP processor

 For high-performance applications

• Radar, sonar, cellular baseband, …

 Performance requirements

 Throughput ≥ 800GFLOPs

 Power Efficiency ≥ 100GFLOPS/W

 Memory bandwidth ≥ 400Gbit/s

 Scale to multi-chip solutions

36

Current DSP Platforms

37

Processor

Frequenc y

# cores Throughput

Memory

Bandwidt h

Powe r

TI

TMS320C6472

-700

FreeScale

MSC8156

ADI

TigerSHARC

ADSP-TS201S

PicoChip

PC205

Intel Core i7

980XE

Tilera Tile64

NVidia Fermi

GPU

500MHz

1GHz

600MHz

260MHz

6

6

1

1GPP+

248DSPs

33.6GMac/s

48GMac/s

4.8GMac/s

31GMac/s

NA

1GB/s

38.4GB/s

(on-chip)

NA

3.8W

10W

2.18

W

<5W

3.3GHz

6

866MHz 64 CPUs

107.

5GFLOPS

221GFLOP

S

31.8GB/s

1GHz

512

*GDDR5: Peak Bandwidth 28.2GB/s

S

230GB/s

* cores

130

W

6.25GB/s 22W

200

W

Power

Efficiency

(GFLOPS/W)

17.7

9.6

4.4

12.4

0.8

10.0

7.7

37

38


 Motivation



38

39

HPEC Challenge - Radar Benchmarks

Benchmark

TDFIR

FDFIR

CT

QR

SVD

CFAR

GA

PM

DB

Description

Time-domain finite impulse response filtering

Frequency-domain finite impulse response filtering

Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT

QR factorization: prevalent in target recognition algorithms

Singular value decomposition: produces a basis for the matrix as well as the rank for reducing interference

Constant false-alarm rate detection: find target in an environment with varying background noise

Graph optimization via genetic algorithm: removing uncorrelated data relations

Pattern Matching: identify stored tracks that match a target

Database operations to store and query target tracks

39

40

GPU Implementation

Benchmark

TDFIR

FDFIR

CT

QR

SVD

CFAR

GA

PM

DB

Description

Loops of multiplication and accumulation (MAC)

FFT followed by MAC loops

GPU based matrix transpose, extremely efficient

Pipeline of CPU + GPU, Fast Givens algorithm

Based on QR factorization and fast matrix multiplication

Accumulation of neighboring vector elements

Parallel random number generator and interthread communication

Vector level parallelism

Binary tree operation, hard for GPU implementation

40

41

Performance Results

Kernels

TDFIR

FDFIR

CT

PM

CFAR

GA

QR

SVD

DB

Data Set

Set 2

Set 3

Set 4

Set 1

Set 2

Set 3

Set 4

Set 1

Set 2

Set 3

Set 1

Set 2

Set 1

Set 2

Set 1

Set 2

Set 1

Set 2

Set 1

Set 2

Set 1

Set 2

Set 1

CPU Throughput (GFLOPS) *

1.314

1.313

1.261

0.562

0.683

0.441

0.373

1.704

0.901

0.904

0.747

0.791

112.3

5.794

3.382

3.326

0.541

0.542

1.194

0.501

0.871

0.281

1.154

GPU Throughput (GFLOPS) *

17.319

13.962

8.301

1.177

8.571

0.589

2.249

54.309

5.679

6.686

4.175

2.684

126.8

8.459

97.506

23.130

61.681

11.955

17.177

35.545

7.761

21.241

2.234

* The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.

Speedup

13.1

10.6

6.6

2.1

12.5

1.4

6.0

31.8

6.3

7.4

5.6

3.4

1.13

1.46

28.8

6.9

114.1

22.1

14.3

70.9

8.9

75.6

1.9

41

42

Performance Comparison

 GPU: NVIDIA Fermi, CPU: Intel Core 2 Duo (3.33GHz), DSP AD TigherSharc 101

100000

Set 1

Set 3

Set 2

Set 4

10000

1000

100

10

1

CPU GPU DSP CPU GPU DSP CPU GPU DSP CPU GPU DSP CPU GPU DSP

TDFIR FDFIR CFAR QR SVD

42

43

Instruction Profiling

43

44

Thread Profiling

 Warp occupancy: number of active threads in an issued warp

 32 threads per warp

44

45

Off-Chip Memory Profiling

 DRAM efficiency: the percentage of time spent on sending data across the pins of DRAM over the whole time of memory service.

45

46

Limitation

 GPU suffers from a low power-efficiency (MFLOPS/W)

46

47


 Motivation



47

48

Key Idea - Hardware Architecture

 Borrow the GPU microarchitecture

 Using a DSP core as the basic execution unit

 Multiprocessors organized in programmable pipelines

 Neighboring multiprocessors can be merged as wider datapaths

48

49

Key Idea – Parallel Code Generation

 Meta-programming based parallel code generation

 Foundation technologies

 GPU meta-programming frameworks

• Copperhead (UC Berkeley) and PyCUDA (NY University)

 DSP code generation framework

• Spiral (Carnegie Mellon University)

DSP scripting

Run

DSP code generation

Source optimization

Compile runtime

Run on DSP

49

50

Key Idea – Internal Representation as KPN

 Kahn Process Network (KPN)

 A generic model for concurrent computation

 Solid theoretic foundation

• Process algebra

50

51

Scheduling and Optimization on KPN

 Automatic task and thread scheduling and mapping

 Extract data parallelism through process splitting

 Latency and throughput aware scheduling

 Performance estimation based on analytical models

T

1

T

2

T total

T i

51

52

Key Idea - Low Power Techniques

 GPU-like processors are power hungry!

 Potential low power techniques

 Aggressive memory coalescing

 Enable task-pipeline to avoid synchronization via global memory

 Operation chaining to avoid extra memory accesses

 ???

DRAM line

…

Used

Unused

DRAM chip

52

53

Outline




DSP processor

 Conclusion

53

54

Conclusion

 A new market of high performance embedded computing is emerging

 Multi-core engines would be the work-horses

 Need both HW and SW research

 Case study 1: GPU based Internet routing

 Case study 2: Massively parallel DSP processor

 Significant performance improvements

 More works ahead

• Low power, scheduling, parallel programming model, legacy code, …

54

GPU Based Packet Classification

High Performance Embedded Computing with Massively Parallel Processors

Related documents

Products

Support

GPU Based Packet Classification

High Performance Embedded Computing with Massively Parallel Processors

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib