A R -T P S

advertisement
A REAL-TIME PACKET
SCAN ARCHITECTURE
Tim Sherwood
UC Santa Barbara
Big Questions
Can my system be optimized further?
 If so, then how and when?
 How much benefit can I expect?
 Have I seen this behavior before?
Is my system working “correctly”?
 Soft errors, backdoors, hardware bugs
Am I under attack?
 If so, then by whom?
 Am I witness to an attack?
Online Monitors
To Protect and Serve
• Our machines are constantly under attack
• Cannot rely on end users, we need networks
which actively defend themselves.
IDS/IPS are promising ways of providing protection
Market for such systems: $918.9 million by the end of 2007.
Snort: an widely accepted open source IDS
This requires the protection system to be able to
operate at 10 to 40 Gb/s. (We aim at current and next
generation networks.)
The Problem
Our computing infrastructure is fast
 Processors → ~109 instructions/second
 Network Routers → ~109 bytes/second
Beyond our ability to monitor naively
 Full traces are near impossible to gather
 Sampling may miss important data
 Intrusive monitoring will change data
New Architectures are Required
Why a new Computer Architecture
Latency
Common
Case
• Throughput is critical
–40 Gigabit link = Packet out every 8 ns
–Each packet needs multiple memory ref
• Design for worst case stream
– Network vendors chip by wire rate
– Denial of service and reliability
– Caches are no help
Packet Scan Architecture
• High Performance Packet Scan Architecture
 Underlying primitives to support high-throughput monitors
 Algorithm – Architecture co-design
• Example primitive: String Matching
 0.4MB and 10Gbps for Snort rule set ( >10,000 characters)
• Bit-Split String Matching Algorithm
 Reduces out edges from 256 to 2.
 Formal language – correctness and efficiency
• Memory Tile Based Design
 Memory throughput is the key
 Data is distributed over tiles with bounded contention
• Performance/area beats the best techniques we
examined by a factor of 10 or more.
Packet Scan Architecture
• String Matching
• Bit-Split String Matching Algorithm
• A Memory Tile Based Architecture
• Building a Real System
• Is it really correct?
• Future Work
examine
packet content
Scanning for Intrusions
CodeRed worm:
web flow established
uricontent with “/root.exe”
Software
Scan
IDS
Traffic In
Traffic Out
Most IDS define a set of rules.
A string defines a suspicious transmission.
We are not building a full IDS, rather building the
primitives from which full systems can be built
Multiple String Matching
• The multiple string matching algorithm:
 Input: A set of strings/patterns S, and a buffer b
 Output: Every occurrence of an element of S in b
A string can be anywhere in the payload of a packet.
Input:
A B D FC A B
Strings:
A B
CA
A B
 Extra constraint: b is really a stream
• How to implement:
Option 1) search for each string independently
Option 2) combine strings together and search all at once
Why hardware
• Snort: >1,000 rules, growing at 1 rule/day or more
• Active research into automated rule building
• Strings are not limited to be just [a-z]+
• We need a high speed string matching technique
with stringent worst case performance.
• Many algorithms are targeted for average case
performance. Aho-Corasick can scan once and
output all matches. But it is too big to be on-chip.
The Aho-Corasick Algorithm
•
Given a finite set P of patterns, build a
deterministic finite automaton G accepting
the set of all patterns in P.
The Aho-Corasick Algorithm
•
An Aho/Corasick String Matching Automaton for a
given finite set P of patterns is a (deterministic)
finite automaton G accepting the set of all words
containing a word of P as a suffix. G consists of
the following components:
•
•
•
•
•
•
finite set Q of states
finite alphabet A
Transition function g: Q × A → Q + {fail}
Failure Function h: Q → Q + {fail}
initial state q0 in Q
a set F of final states
On String Matching and Languages
• This should not be any big surprise





P is a FL
FL  RL
RL can be recognized by a RE
RE can be simulated with an NFA
An NFA can be simulated with a DFA
• This last step is the problem
 Aho and Corasick shows that for FL there is no
exponential blow up in state
An AC Automaton Example
• Example: P = {he, she, his, hers}
Initial State
Transition Function
State
Accepting State
h
h
h
2
•The Construction:
linear time.
•The search of all
patterns in P: linear
time
h
h
s
8
s
9
4
S h
7
h
h
i
6
S
3
i
S
r
s
S
1
e
0
e
h
r
S
S
5
h
S
(Edges pointing back to State 0 are not shown).
Matching on the example
h
h
h
S
2
r
h
h
i
6
s Sh
8
7
s
9
s
S
1
e
0
S
3
h
h
i
S
4
h
r
e
S
5
h
S
Input stream:
h x h e rs
Only scan the input stream once.
Linear Time: So what’s the problem
• How to implement it on chip?
256 Next State Pointers
2
…
…
…
…
16,384
…
0
0
<14>
1
2
<14>
<14>
3
<14>
1
255
<14>
• Problem: Size too big to be on-chip
 ~ 10,000 nodes
 256 out edges per node
 Requires 16,384*256*14 = ~10MB
• Solution: partition into small state machines
 Less strings per machine
 Less out edges per machine
Packet Scan Architecture
• String Matching
• Bit-Split String Matching Algorithm
• A Memory Tile Based Architecture
• Building a Real System
• Is it really correct?
• Future Work
many tiny FSM
working together
An example
P0 = { he, she, his, hers }
An example
P0 = { he, she, his, hers }
check for agreement
An example of Bit-Split
P0 = { he, she, his, hers }
P0
B03
0001 0000
0000
0000 0001
0000
0110 1000
h
h
S
r
h
h
i
6
s Sh
8
7
s
9
0
3
r
e
5
1
1
b2 { 0 ,3 }
1
0
1
{ 0,3 }
b4{0,1,4}
S
4
h
b1 { 0 ,1 }
0
b3 {0,1,2,6 }
0
h
h
i
1
1
0111 0011
s S
S
1
e
2
h
b0 {0}
0
0
0 0 b6{0,1,2,5,6}
S
b3{0,1,2,6}
h
S
(Edges pointing back to State 0 are not shown).
1
0
1
1
0
b5{0,3,7,8}
1
b7{0,3,9}
Compact State Set
P0 = { he, she, his, hers }
P0
B03
0
b0 { }
1
1
h
h
h
S
2
r
h
h
i
6
s Sh
8
7
s
9
s
S
1
e
0
1
b1 { }
S
0
3
h
h
i
r
e
5
b4 {
0
S
4
h
b2 {
1
}
1
}
0
0 0 b6{ 2,5 }
S
0
b3{ 2 }
h
S
(Edges pointing back to State 0 are not shown).
1
1
1
0
b5{7}
1
b7{9}
An example of Bit-Split
P0 = { he, she, his, hers }
P0
B03
B04
b0 {}
h 0 s
h
h
e
2
h
r
i
6
s Sh
8
9
1
1
3
0
1
b2{}
b1{}
0
S h
7
s
0
S
b1{} 1
S
1
b0 {}
i
h
h
r
h
4 S
e
5
0
1
S
0
b3{2}
S
(Edges pointing back to State 0 are not shown).
0
1
b5 {}
b6{2,5}
b6{2,5}
0
0
1
1
1
0
1
1
1
b3 {} 1
1 0
0
0
b5{7}
1
h
0
b4{2}
1
b4 {}
0
b2{}
0
b8{2,7}
1
b7 {} 0
b7{9}
b9{9}
0
1 0
Nice Properties
• The number of states in Bij is rigorously
bounded by the number of states in Pi
• No exponential blow up in state
• Linear construction time
• Possible to traverse multiple edges at a time
to multiply throughput
Matching on the example
hxhe
0100
P0
1110
B03
B04
b0 {}
h 0 s
h
h
e
2
h
r
i
6
s Sh
8
9
1
1
3
0
1
b2{}
b1{}
0
S h
7
s
0
S
b1{} 1
S
1
b0 {}
i
h
h
r
h
4 S
e
5
0
1
S
0
b3{2}
S
0
1
b5 {}
b6{2,5}
b6{2,5}
0
0
1
1
1
0
1
1
1
b3 {} 1
1 0
0
0
b5{7}
1
h
0
b4{2}
1
b4 {}
0
b2{}
0
b8{2,7}
1
b7 {} 0
b7{9}
b9{9}
1 0
0
How do you “combine” the results from the different state machines?
Only if all the state machines agree, is there actually a match.
Packet Scan Architecture
• String Matching
• Bit-Split String Matching Algorithm
• A Memory Tile Based Architecture
• Building a Real System
• Is it really correct?
• Future Work
SRAM tiles
implement FSM
Our Main Idea: Bit-Split
• Partition rules (P) into smaller sets (P0 to Pn)
• Build AC state-machine for each subset
• For each DFA Pi, rip state-machine apart into
8 tiny state-machines (Bi0 through Bi7)
• Each of which searches for 1 bit in the 8 bit
encoding of an input character
 Only if all the different B machines agree can
there actually a match
How to Implement
• The AC state machine is equivalent to the 8
tiny state machines.
• The 8 tiny state machines can run
independently, which means in parallel
• Intersection done with bit-wise AND.
• 8 is intuitive but not optimal
• How to build a system to implement this
algorithm?
 Our algorithm makes it feasible to be on-chip
A Hardware Implementation
State Machine Tile
Rule Module 0
Tile 0
Tile 3
Control
Block
2-bit Input [0:1]
[6:7]
2
<8>
Partial Match Vector
[2:3]
16
16
[4:5]
Tile 2
8
<8>
<8>
Partial Match Vector
<8>
16
Full Match Vector
8
16
4:1 Mux
…
Input
Output Latch
Rule Module N
8
0
1
2
3
255
Rule Module 1
8
<16>
…
Tile 1
Complete Set of Matches for All Rules
4 Next State Pointers
decoder
8
Current State <8>
Byte from Payload
String Match Engine
Config
Data
2 bits from
each byte
Partial
Match
Vector
• A rule module is equivalent to an AC state machine
• Rule modules, tiles are structurally equivalent
• All full match vectors are concatenated to indicate which
strings are matched
• One tile stores one tiny bit-split state machine
An efficient Implementation
Cycle
Cycle
Cycle
Cycle
3
2
1
0
e
h
x
h
01
01
01
01
10
10
11
10
01
10
10
10
2
2
2
Tile 0
00 01 10 11
h
x
h
e
0
0
1
0
0
1
0
2
0
0
0000
2
0
3
0
0
1000
3
0
4
0
0
1110
4
0
4
0
0
1111
2
Tile 2
Tile 1
00 01 10 11 PMV
PMV
0000
01
00
00
00
h
x
h
0
0
0
1
2
0000
0
1
0
2
0
0000
1
1
0
3
0
0000
2
1
0
5
0
0000
3
1
6
5
0
4
7
0
2
1
2
3
0
0
0
0
0
0
3
4
3
2
2
5
0000
0000
1000
4
0
0
6
2
0000
h
x
e
h
Tile 3
00 01 10 11 PMV
00 01 10 11 PMV
0
1 0
0
2
0000
1
1 3
0
2
0000
2
4 0
0
2
0000
0000
3
1 0
5
6
1000
0
1000
4
1 7
0
2
0000
h
h
x
e
5
0
0
4
7
0010
5
0
4
5
0
0000
5
1 0
0
8
0000
6
6
0
0
3
5
1100
6
7
0
2
0
1100
6
4 0
0
2
0010
7
7
0
0
4
2
0001
7
9
0
3
0
0000
7
1 0
5
6
1100
8
8
8
1
0
3
0
0010
8
4 0
0
2
0001
9
9
9
1
0
3
0
0001
9
e
h
x
h
1000
0000
0000
0000
e
5
e
h
x
h
1111
1110
1000
0000
e
h
x
h
1100
0000
0000
0000
Cycle
Cycle
Cycle
Cycle
3+P
2+P
1+P
0+P
e
h
x
h
1000
0000
0000
0000
1000
0000
0000
0000
An efficient Implementation
Cycle
Cycle
Cycle
Cycle
3
2
1
0
e
h
x
h
01
01
01
01
10
10
11
10
01
10
10
10
2
2
2
Tile 0
00 01 10 11
h
x
h
e
0
0
1
0
0
1
0
2
0
0
0000
2
0
3
0
0
1000
3
0
4
0
0
1110
4
0
4
0
0
1111
2
Tile 2
Tile 1
00 01 10 11 PMV
PMV
0000
01
00
00
00
h
x
h
0
0
0
1
2
0000
0
1
0
2
0
0000
1
1
0
3
0
0000
2
1
0
5
0
0000
3
1
6
5
0
4
7
0
2
1
2
3
0
0
0
0
0
0
3
4
3
2
2
5
0000
0000
1000
4
0
0
6
2
0000
h
x
e
h
Tile 3
00 01 10 11 PMV
00 01 10 11 PMV
0
1 0
0
2
0000
1
1 3
0
2
0000
2
4 0
0
2
0000
0000
3
1 0
5
6
1000
0
1000
4
1 7
0
2
0000
h
h
x
e
5
0
0
4
7
0010
5
0
4
5
0
0000
5
1 0
0
8
0000
6
6
0
0
3
5
1100
6
7
0
2
0
1100
6
4 0
0
2
0010
7
7
0
0
4
2
0001
7
9
0
3
0
0000
7
1 0
5
6
1100
8
8
8
1
0
3
0
0010
8
4 0
0
2
0001
9
9
9
1
0
3
0
0001
9
e
h
x
h
1000
0000
0000
0000
e
5
e
h
x
h
1111
1110
1000
0000
e
h
x
h
1100
0000
0000
0000
Cycle
Cycle
Cycle
Cycle
3+P
2+P
1+P
0+P
e
h
x
h
1000
0000
0000
0000
1000
0000
0000
0000
Performance of Hardware
Performance of Hardware
Key Metric: Throughput*Character/Area
Packet Scan Architecture
• String Matching
• Bit-Split String Matching Algorithm
• A Memory Tile Based Architecture
• Building a Real System
• Is it really correct?
• Future Work
Integration and
interfaces (FPGA)
Prototype Design
Ethernet Interface
100Mbps
(promiscuous)
Reg Interface
byte_in
data_enabl
e
data_low
data_high
address
we
rst
DMA
SM Core
byte_in
data_enabl
e
data
address
we
rst
clk
Avalon Bus (50MHz, 12Gbps)
vector
out
Microprocessor
(control/update)
Device Drivers/
Application Layer
Connect to bus
clk
reset
cs
address
write
data
String Match
Engine
(~1Gbps)
Interface With Avalon Bus
Connect to bus
sme_send_byte(
sme_write_tile(Base_add,
Base_add,
0, byte_from_packet)
1, 0, 0x0001, 0x00000000)
clk
reset
cs
address
write
data
byte_in
data_enable
data_low
data_high
address
we
rst
This function
function is
is for
for
This
Tile the
Module
Upper
Lower
sending
actual
initializing
memory
address
number
number
data
data
to
the match
string
indata
the
string
match engine
engines
byte_in
data_enable
data
address
we
rst
clk
vector
out
Packet Scan Architecture
• String Matching
• Bit-Split String Matching Algorithm
• A Memory Tile Based Architecture
• Building a Real System
• Is it really correct?
• Future Work
Proofs
(yes)
A Formalization
Splits DFA as an NFA
Correctness stems from RL subset
The above property is sufficient, is it necessary?
Exploiting fixed wildcards is possible, what
about more general patterns?
Packet Scan Architecture
• String Matching
• Bit-Split String Matching Algorithm
• A Memory Tile Based Architecture
• Building a Real System
• Is it really correct?
• Future Work
Extensions
and Applications
Primitives for Security
•
•
•
•
•
•
•
Packet Address List Lookup
Packet Address Range Query
Packet Classification
String Finding
Regular Expression Finding
Statefull Flow Monitors
Packet Ordering
Related Work
• Software based
 Good for ~100Mb/s, common case
• FPGA-based
 Many schemes map rules down to a specialized circuit

Near optimal utilization of hardware resources
 Implementing state machines on block-RAMs [Cho and MangioneSmith]
 Concurrent to our work: mapping state machines to on-chip SRAM
[Aldwairi et. al.]
 Bloom filters [Dharmapurikar et al.]

Excellent filter in the common case
• TCAM-based
 Require all patterns to be shorter or equal to TCAM width
 Cutting long patterns: 2Gbps with 295KB TCAM [Yu et. al.]
Conclusions
• New Tile-based Architecture
 0.4MB and 10Gbps for Snort rule set ( >10,000
characters)
 Possible to be used for other applications, e.g. IP
lookups, packet classification.
• New Bit-split Algorithm:
 General purpose enough for many other applications, e.g.
spam detection, peephole optimization, IP lookups,
packet classification, etc.
 Feasible to be implemented on other tile-based
architecture.
Thanks
•
•
•
•
•
Lin Tan
Brett Brotherton
Prof. Ryan Kastner
Prof. Ömer Egecioglu
Shreyas Prasad, Shashi Mysore, Bita
Mazloom, Ted Huffmire, Banit Argawal
All done.
Download