ppt

advertisement
High-Performance
Hardware Monitors to
Protect Network Processors
from Data Plane Attacks
Kekai Hu, Harikrishnan Chandrikakutty,
Deepak Unnikrishnan, Tilman Wolf,
and Russell Tessier
Department of Electrical and Computer Engineering
University of Massachusetts Amherst
Department of Electrical and Computer Engineering
Outline
 The problem: software attacks on network
routers
• Routers now include programmable processors




Our solution: include a monitor for the processor
Software to generate the monitor information
Implementation on DE4 NetFPGA board
Experimental results
Russell Tessier
2
Computer Networks
 Networks provide
connectivity between endsystems
 Success of the Internet:
hourglass architecture
 Success is also a problem:
diverse apps, diverse
systems
 Changing requirements for
the network layer
Layered protocol
stack
Example protocols
HTTP
DNS
Application layer
TLS/SSL
Transport layer
BGP
...
UDP
Network layer
...
SIP
TCP
IP
Ethernet
...
Link layer
DSL
FDDI
1000BASE-T
Physical layer
...
RS-232
802.11a/b/g/n SONET/SDH
Russell Tessier
3
Network Architecture
 Extensions to current
Internet
 Customization of data
plane
 Requires router systems
with ability to adapt
Access router:
- Access concentration
(cable, DSL, wireless)
- Network address translation
- Policy-based QoS
- Monitoring and billing
- Firewall
`
End system:
- IP security
- TCP termination
Russell Tessier
Core router:
- Multiprotocol label switching
- QoS aware routing
- Monitoring
Edge router:
- Packet classification
- QoS (DiffServ)
- monitoring and billing
Server:
- Content-based
switching
- Firewall
- SSL termination
- IP security
4
Programmable Router
 Network processor
• General-purpose processing
capability in data path
• Packet processing in software
Router
Port
packets
 High-performance
processing hardware

Example: 40-core
network processors
 Key challenge:
• Security due to
software processing
Russell Tessier
Switching
Fabric
Port
Network Processor
Network Interface
• Scalability through high
levels of parallelism
Port
Processing Processing
Engine
Engine
Port
Port
Interconnect
Processing Processing
Engine
Engine
I/O
Protocol
processing
on packet
processor
5
Exploit of Vulnerable Network Processor
 Vulnerability in network
processor can be exploited
 “In-network” denial of
service attack (new type of
attack)
 Key questions
• Can such vulnerabilities occur
in packet processing code?
(Yes, we show one example.)
• Can vulnerabilities be
exploited?
(Yes, for von-Neumann and for
Harvard architectures)
Russell Tessier
Packet
forwarding
software
Malicious
packet
Vulnerable
packet
processor
In-network
denial of
service
attack
Unprotected
software-based
router
6
Header Insertion Exploit
 Stack smashing during header insertion
• Control flow can be changed to attack code
Low address
.
.
.
New pkt buffer
Current
frame ptr
 Attack code: infinite transmission loop
Local var
Previous frame ptr
Return address
Current
frame
(generate
CM header)
• Devastating denial-of-service attack
 Prototype implementation
•
New
stack
pointer
Custom network processor on NetFPGA
 Harvard arch.: possible, but less exciting
Russell Tessier
.
.
.
Original packet hdr
Stack
growth
Packet payload
(attack code)
High address
7
Attack model
Original packet
• Congestion management
protocol
Link
Hdr
IP
Hdr
UDP
Hdr
payload
CM
application
– Inserts a custom header
New packet
Link
Hdr
IP
Hdr
CM
Hdr
len1
UDP
Hdr
payload
len2
• Integer overflow vulnerability
– len1 = 12
– len2 = 65334
– sum = 10
8
Defense Mechanism: Hardware Monitor
 Change of software easy
• Just need matching
monitoring graph
Russell Tessier
DFA monitoring
graph
processing code
instruction memory
network
processor
core
mon. graph
mon. memory
hash of
processing
instruction
reset/
recovery
comparison
logic
hardware monitor
• Obtained from offline
analysis of binary
• Deviations trigger reset
NFA monitoring
graph
NFA-to-DFA
transformation
network processor
 Monitoring graph represents correct behavior
processing code
binary
offline analysis
• Core reports hash of
each executed instruction
runtime operation
 Hardware monitor
co-located with
each processor core
data memory
packet buffer
network interface
9
Offline Analysis of Processing Binary
 Executed instruction reported by core as 4-bit hash
• Hash combines address, opcode, registers
• Hash allows for compact representation of information
 Monitoring graph
• Each instruction represented as a state
• Edges correspond to execution of instruction
• Control-flow operations lead to multiple possible next states
[…]
49c:
4a0:
4a4:
4a8:
4ac:
4b0:
4b4:
4b8:
[…]
Russell Tessier
49c
97c20010
00000000
2c420033
1440000a
00000000
3c026666
34430191
97c20010
lhu
nop
sltiu
bnez
nop
lui
ori
lhu
v0,16(s8)
v0,v0,51
v0,4d4
v0,0x6666
v1,v0,0x191
v0,16(s8)
0
4a0
7
4a4
11
4a8
10
6
4ac
10
4b0
7
4b4
4d4
3
4b8
10
NFA-to-DFA conversion
 Problem: non-deterministic finite automaton (NFA)
• State may have two next states with same edge value due to hash
1
2
b
c
3
4
d
5
e
6
f
a
c
• Implementation would need to keep track of multiple states
 Solution: NFA-to-DFA conversion (powerset construction)
• Well-known algorithm [Hopcroft and Ullman, 1976]
f
1
2
b
{3,5}
c
4
d
5
e
6
f
a
• Deterministic finite automaton (DFA) requires only one state
Russell Tessier
11
Implementation of DFA Monitor
 Each state may have up to 16 next states
• Most states only have one or two next states
 Requirements
• Compact representation
• Fast processing of each hash value (single memory access)
 Idea: grouping of next states in contiguous memory
• Trick: determine offset into memory based on order of hash value
9
9
a
7
f
a
7
f
2
2
grouping
b
d
7
11
c 3
0
14
e 2
g
b
group 1
group 2
h
7
11
c 3
d
0
g
e
5
2
h
group 3
Russell Tessier
12
Implementation of DFA Monitor
 DFA monitor system
• Memory keeps all valid hash values in one bit vector
• Next state based on offset of group and position of hash value
reset/
recovery
1
hash
comparison
16
...
16
4
...
mult
group 1
0x0000
group 2
0x0002
group 3
0x0006
32
...
16
number of
offset in
next states state group
valid hash values on
outgoing edges
2
0
0000 0000 1000 0100
...
c
2
1
0000 1000 0000 1000
group base
address
b
1
1
0000 0000 1000 0000
f
1
0
0000 0010 0000 0000
e
3
0
0000 0000 0010 0101
d
...
...
...
f
1
0
0000 0010 0000 0000
h
...
...
...
g
...
...
...
group 16
4-bit hash
function
32
-1
a
4
processor
instruction
4
...
one-hot
encoding
4
add
k
(position of
matching hash
among valid
hash values)
group 1
group 2
group 3
state machine memory
Russell Tessier
13
Altera DE4 NetFPGA Infrastructure
Stratix IV
SOPC SYSTEM
1 GigE
1 GigE
PCIe
SSRAM
TSE MAC
DDR
RAM
TSE MAC
1 GigE
TSE MAC
1 GigE
TSE MAC
HOST COMPUTER
• Altera DE4 board
• 5x resources compared to
NetFPGA 1G
• PCI Express, 8GB DDR2
DE4 BOARD
CUSTOM INTERFACE
 Open source port of
NetFPGA 1G [1]
 Networking research
platform
REFERENCE ROUTER
OR
PACKET GENERATOR
 Configurable designs
• Reference router
• Packet generator
Russell Tessier
JTAG
MASTER
BRIDGE
JTAG
14
Altera DE4 NetFPGA reference router
 Complete IPV4 router
• Forward packets on all four 1Gbps
interfaces
MAC
RxQ
CPU
RxQ
MAC
RxQ
CPU
RxQ
MAC
RxQ
CPU
RxQ
MAC
RxQ
CPU
RxQ
CPU
TxQ
MAC
TxQ
CPU
TxQ
 Design components
•
•
•
•
Input Arbiter
Input queues
Input arbiter
Output port lookup
Output queues
 Prototype system replaces output
port lookup module
Russell Tessier
Output port lookup
Output Queues
MAC
TxQ
CPU
TxQ
MAC
TxQ
CPU
TxQ
MAC
TxQ
15
Single-core system architecture
 Single-core network processor system integrated along with Altera
DE4 NetFPGA reference router pipeline
Russell Tessier
16
Harvard memory architecture
• Separate physical memory space for instruction and
data
– General memory error techniques for code-injection attacks
will not be successful
– Attacks need to be generated with existing library functions
Instuction
Memory
Instructions
Data Memory
Address
Instruction
Address
CPU
Data
Data
Stack
17
Experimental Setup
 Altera DE4 reference router with prototype system
 Altera DE4 packet generator
• Generate packets
• Capture forwarded packets
 Evaluation metrics
• Attack detection
• Throughput
• Resource utilization
Russell Tessier
18
Offline analysis
 MIPS-GCC compiler generates binary
and branch information
 Powerset construction to perform NFAto-DFA transformation.
 Memory initialization file is loaded into
memory
Benchmark
Source code
MIPS-GCC
compiler
Instruction
Info
Branch
Info
NFA to DFA conversion
module
Instruction
Info
Branch
Info
Memory generator module
Memory initialization file
Russell Tessier
19
Evaluation
 Monitoring speed
• Single memory
access
• Lookup into fixedsize register file
 Memory size of
monitor
• More states due to NFA-to-DFA conversion
• More states due to multiple entries in memory for certain states
• In practice, overhead is below 10%
 Very fast and compact hardware monitor
Russell Tessier
20
Benchmarks
• NpBench [1]
– Modern network applications
– Three specific functional groups
• Traffic management and quality of service group
• Security and media processing group
• Packet processing group
[1] B.K. Lee and L.K. John, "NpBench: a benchmark suite for control plane and data
plane applications for network processors," in Proc of 21st International Conference
on Computer Design, vol., no., pp. 226- 233, 13-15 Oct. 2003.
21
Prototype Implementation on FPGA
 Small overhead compared to processor core (4096 states):
 Correct operation: attack packet detected and dropped
Russell Tessier
22
Attack with Defense in Place
 Attack packet dropped, router continues to operate
Russell Tessier
23
Throughput
 Throughput performance of the network processor with security
monitor
CM protocol and IPV4 application
Throughput in Mbps
•
50
40
30
Normal Packet
Packet ratio 100:1
Packet ratio 25:1
Packet ratio 10:1
20
10
0
64
256
Packet size in bytes
Russell Tessier
24
24
Multicore Monitor
 Dynamic workloads pose problem for
hardware monitor
• Processing may differ between packets
• Monitors need to match processing
core
core
core
monitor
monitor
monitor
 Mapping between processors and monitors
core
• 1-to-1 mapping requires frequent reload of monitor
• Any-to-any mapping costly to implement
monitor
• Clusters with n-to-m mapping provide balance
 Interconnect is configured dynamically depending on workload
• Mapping between core and
monitor
Russell Tessier
monitor
core
...
...
core
monitor
core
...
... monitor
core
monitor
core
core
...
... monitor
core
...
monitor
... monitor
monitor
monitor
... monitor
25
System Architecture of Clustered System
 Multiple cores can access multiple monitors
• Dynamic configuration of crossbar
 Secure loading of monitors through external interface
External
Memory
Network
Interface
Inter-core
Interconnect
Control
Processor
n Processors
Control
Signals
Proc
Proc
...
Proc
Proc
Proc
...
Proc
...
Proc
Crossbar
Crossbar
Proc
...
Proc
Crossbar
m Monitors
Mon
External
Interface
Russell Tessier
Centralized
Monitor
Memory
...
Mon
Mon
Mon
...
Mon
...
Mon
Mon
...
Mon
...
AES
Mon
26
Cluster Design
 Simple implementation of clustered monitor
• Dynamic configuration through programming of demultiplexers
Reset/Recover
NP Core
Monitor
Select
1
NP Core
NP Core
NP Core
Hash
Hash
Hash
32
3
Hash
R/R from
Monitor_1 to 6
4
Hash_2
Hash_1
From
Hash_1 to 4
Hash_3
Hash_4
Reset/
Recover
4
Proc
Select
1
2
Hash
4
Monitor_1
Russell Tessier
Monitor_2
Monitor_3
Monitor_4
Monitor_5
Monitor_6
27
Dual-Ported Monitor Implementation
 Memory of monitor can be shared between two monitors
• Effective use of dual-ported memory
• Two monitoring graphs can be used in parallel
Reset/Recover
1
Monitor 2
Valid Hashes
Processor
Instruction
Next State
14
Hash
Comparison
1
Monitoring Graph 1
4
16
Next
State
Select
Graph
Select
One-hot
Encoding
Next
State
Select
32
1
Processor
Instruction
Monitor 1
Russell Tessier
16
4
K
4-bit Hash
Function
4
K
Monitoring Graph 2
4
4-bit Hash
Function
K-1
One-hot
Encoding
Graph
Select
32
Hash
Comparison
1
Next State
Valid Hashes 14
Reset/Recover
28
Prototype Implementation on FPGA
 Resources cost for a multi-core system (4 cores, 6 monitors):
Russell Tessier
29
Conclusions
 Current and future Internet needs to meet new demands
 Programmable routers provide packet processing platform
• Systems problem: security vulnerabilities
• Attacks can be launched within data plane (i.e., not control access)
• Monitor-based hardware defense mechanism is effective
 Hardware monitor design and prototype
• Uses compact DFA (less than 10% more states than NFA)
• Verification with single memory access per instruction
• Defense shown for Harvard architecture attack
 Technique extended to multicore network processors
 Promising defense against attacks on network infrastructure
Russell Tessier
30
Download