A REAL-TIME PACKET SCAN ARCHITECTURE Tim Sherwood UC Santa Barbara Big Questions Can my system be optimized further? If so, then how and when? How much benefit can I expect? Have I seen this behavior before? Is my system working “correctly”? Soft errors, backdoors, hardware bugs Am I under attack? If so, then by whom? Am I witness to an attack? Online Monitors To Protect and Serve • Our machines are constantly under attack • Cannot rely on end users, we need networks which actively defend themselves. IDS/IPS are promising ways of providing protection Market for such systems: $918.9 million by the end of 2007. Snort: an widely accepted open source IDS This requires the protection system to be able to operate at 10 to 40 Gb/s. (We aim at current and next generation networks.) The Problem Our computing infrastructure is fast Processors → ~109 instructions/second Network Routers → ~109 bytes/second Beyond our ability to monitor naively Full traces are near impossible to gather Sampling may miss important data Intrusive monitoring will change data New Architectures are Required Why a new Computer Architecture Latency Common Case • Throughput is critical –40 Gigabit link = Packet out every 8 ns –Each packet needs multiple memory ref • Design for worst case stream – Network vendors chip by wire rate – Denial of service and reliability – Caches are no help Packet Scan Architecture • High Performance Packet Scan Architecture Underlying primitives to support high-throughput monitors Algorithm – Architecture co-design • Example primitive: String Matching 0.4MB and 10Gbps for Snort rule set ( >10,000 characters) • Bit-Split String Matching Algorithm Reduces out edges from 256 to 2. Formal language – correctness and efficiency • Memory Tile Based Design Memory throughput is the key Data is distributed over tiles with bounded contention • Performance/area beats the best techniques we examined by a factor of 10 or more. Packet Scan Architecture • String Matching • Bit-Split String Matching Algorithm • A Memory Tile Based Architecture • Building a Real System • Is it really correct? • Future Work examine packet content Scanning for Intrusions CodeRed worm: web flow established uricontent with “/root.exe” Software Scan IDS Traffic In Traffic Out Most IDS define a set of rules. A string defines a suspicious transmission. We are not building a full IDS, rather building the primitives from which full systems can be built Multiple String Matching • The multiple string matching algorithm: Input: A set of strings/patterns S, and a buffer b Output: Every occurrence of an element of S in b A string can be anywhere in the payload of a packet. Input: A B D FC A B Strings: A B CA A B Extra constraint: b is really a stream • How to implement: Option 1) search for each string independently Option 2) combine strings together and search all at once Why hardware • Snort: >1,000 rules, growing at 1 rule/day or more • Active research into automated rule building • Strings are not limited to be just [a-z]+ • We need a high speed string matching technique with stringent worst case performance. • Many algorithms are targeted for average case performance. Aho-Corasick can scan once and output all matches. But it is too big to be on-chip. The Aho-Corasick Algorithm • Given a finite set P of patterns, build a deterministic finite automaton G accepting the set of all patterns in P. The Aho-Corasick Algorithm • An Aho/Corasick String Matching Automaton for a given finite set P of patterns is a (deterministic) finite automaton G accepting the set of all words containing a word of P as a suffix. G consists of the following components: • • • • • • finite set Q of states finite alphabet A Transition function g: Q × A → Q + {fail} Failure Function h: Q → Q + {fail} initial state q0 in Q a set F of final states On String Matching and Languages • This should not be any big surprise P is a FL FL RL RL can be recognized by a RE RE can be simulated with an NFA An NFA can be simulated with a DFA • This last step is the problem Aho and Corasick shows that for FL there is no exponential blow up in state An AC Automaton Example • Example: P = {he, she, his, hers} Initial State Transition Function State Accepting State h h h 2 •The Construction: linear time. •The search of all patterns in P: linear time h h s 8 s 9 4 S h 7 h h i 6 S 3 i S r s S 1 e 0 e h r S S 5 h S (Edges pointing back to State 0 are not shown). Matching on the example h h h S 2 r h h i 6 s Sh 8 7 s 9 s S 1 e 0 S 3 h h i S 4 h r e S 5 h S Input stream: h x h e rs Only scan the input stream once. Linear Time: So what’s the problem • How to implement it on chip? 256 Next State Pointers 2 … … … … 16,384 … 0 0 <14> 1 2 <14> <14> 3 <14> 1 255 <14> • Problem: Size too big to be on-chip ~ 10,000 nodes 256 out edges per node Requires 16,384*256*14 = ~10MB • Solution: partition into small state machines Less strings per machine Less out edges per machine Packet Scan Architecture • String Matching • Bit-Split String Matching Algorithm • A Memory Tile Based Architecture • Building a Real System • Is it really correct? • Future Work many tiny FSM working together An example P0 = { he, she, his, hers } An example P0 = { he, she, his, hers } check for agreement An example of Bit-Split P0 = { he, she, his, hers } P0 B03 0001 0000 0000 0000 0001 0000 0110 1000 h h S r h h i 6 s Sh 8 7 s 9 0 3 r e 5 1 1 b2 { 0 ,3 } 1 0 1 { 0,3 } b4{0,1,4} S 4 h b1 { 0 ,1 } 0 b3 {0,1,2,6 } 0 h h i 1 1 0111 0011 s S S 1 e 2 h b0 {0} 0 0 0 0 b6{0,1,2,5,6} S b3{0,1,2,6} h S (Edges pointing back to State 0 are not shown). 1 0 1 1 0 b5{0,3,7,8} 1 b7{0,3,9} Compact State Set P0 = { he, she, his, hers } P0 B03 0 b0 { } 1 1 h h h S 2 r h h i 6 s Sh 8 7 s 9 s S 1 e 0 1 b1 { } S 0 3 h h i r e 5 b4 { 0 S 4 h b2 { 1 } 1 } 0 0 0 b6{ 2,5 } S 0 b3{ 2 } h S (Edges pointing back to State 0 are not shown). 1 1 1 0 b5{7} 1 b7{9} An example of Bit-Split P0 = { he, she, his, hers } P0 B03 B04 b0 {} h 0 s h h e 2 h r i 6 s Sh 8 9 1 1 3 0 1 b2{} b1{} 0 S h 7 s 0 S b1{} 1 S 1 b0 {} i h h r h 4 S e 5 0 1 S 0 b3{2} S (Edges pointing back to State 0 are not shown). 0 1 b5 {} b6{2,5} b6{2,5} 0 0 1 1 1 0 1 1 1 b3 {} 1 1 0 0 0 b5{7} 1 h 0 b4{2} 1 b4 {} 0 b2{} 0 b8{2,7} 1 b7 {} 0 b7{9} b9{9} 0 1 0 Nice Properties • The number of states in Bij is rigorously bounded by the number of states in Pi • No exponential blow up in state • Linear construction time • Possible to traverse multiple edges at a time to multiply throughput Matching on the example hxhe 0100 P0 1110 B03 B04 b0 {} h 0 s h h e 2 h r i 6 s Sh 8 9 1 1 3 0 1 b2{} b1{} 0 S h 7 s 0 S b1{} 1 S 1 b0 {} i h h r h 4 S e 5 0 1 S 0 b3{2} S 0 1 b5 {} b6{2,5} b6{2,5} 0 0 1 1 1 0 1 1 1 b3 {} 1 1 0 0 0 b5{7} 1 h 0 b4{2} 1 b4 {} 0 b2{} 0 b8{2,7} 1 b7 {} 0 b7{9} b9{9} 1 0 0 How do you “combine” the results from the different state machines? Only if all the state machines agree, is there actually a match. Packet Scan Architecture • String Matching • Bit-Split String Matching Algorithm • A Memory Tile Based Architecture • Building a Real System • Is it really correct? • Future Work SRAM tiles implement FSM Our Main Idea: Bit-Split • Partition rules (P) into smaller sets (P0 to Pn) • Build AC state-machine for each subset • For each DFA Pi, rip state-machine apart into 8 tiny state-machines (Bi0 through Bi7) • Each of which searches for 1 bit in the 8 bit encoding of an input character Only if all the different B machines agree can there actually a match How to Implement • The AC state machine is equivalent to the 8 tiny state machines. • The 8 tiny state machines can run independently, which means in parallel • Intersection done with bit-wise AND. • 8 is intuitive but not optimal • How to build a system to implement this algorithm? Our algorithm makes it feasible to be on-chip A Hardware Implementation State Machine Tile Rule Module 0 Tile 0 Tile 3 Control Block 2-bit Input [0:1] [6:7] 2 <8> Partial Match Vector [2:3] 16 16 [4:5] Tile 2 8 <8> <8> Partial Match Vector <8> 16 Full Match Vector 8 16 4:1 Mux … Input Output Latch Rule Module N 8 0 1 2 3 255 Rule Module 1 8 <16> … Tile 1 Complete Set of Matches for All Rules 4 Next State Pointers decoder 8 Current State <8> Byte from Payload String Match Engine Config Data 2 bits from each byte Partial Match Vector • A rule module is equivalent to an AC state machine • Rule modules, tiles are structurally equivalent • All full match vectors are concatenated to indicate which strings are matched • One tile stores one tiny bit-split state machine An efficient Implementation Cycle Cycle Cycle Cycle 3 2 1 0 e h x h 01 01 01 01 10 10 11 10 01 10 10 10 2 2 2 Tile 0 00 01 10 11 h x h e 0 0 1 0 0 1 0 2 0 0 0000 2 0 3 0 0 1000 3 0 4 0 0 1110 4 0 4 0 0 1111 2 Tile 2 Tile 1 00 01 10 11 PMV PMV 0000 01 00 00 00 h x h 0 0 0 1 2 0000 0 1 0 2 0 0000 1 1 0 3 0 0000 2 1 0 5 0 0000 3 1 6 5 0 4 7 0 2 1 2 3 0 0 0 0 0 0 3 4 3 2 2 5 0000 0000 1000 4 0 0 6 2 0000 h x e h Tile 3 00 01 10 11 PMV 00 01 10 11 PMV 0 1 0 0 2 0000 1 1 3 0 2 0000 2 4 0 0 2 0000 0000 3 1 0 5 6 1000 0 1000 4 1 7 0 2 0000 h h x e 5 0 0 4 7 0010 5 0 4 5 0 0000 5 1 0 0 8 0000 6 6 0 0 3 5 1100 6 7 0 2 0 1100 6 4 0 0 2 0010 7 7 0 0 4 2 0001 7 9 0 3 0 0000 7 1 0 5 6 1100 8 8 8 1 0 3 0 0010 8 4 0 0 2 0001 9 9 9 1 0 3 0 0001 9 e h x h 1000 0000 0000 0000 e 5 e h x h 1111 1110 1000 0000 e h x h 1100 0000 0000 0000 Cycle Cycle Cycle Cycle 3+P 2+P 1+P 0+P e h x h 1000 0000 0000 0000 1000 0000 0000 0000 An efficient Implementation Cycle Cycle Cycle Cycle 3 2 1 0 e h x h 01 01 01 01 10 10 11 10 01 10 10 10 2 2 2 Tile 0 00 01 10 11 h x h e 0 0 1 0 0 1 0 2 0 0 0000 2 0 3 0 0 1000 3 0 4 0 0 1110 4 0 4 0 0 1111 2 Tile 2 Tile 1 00 01 10 11 PMV PMV 0000 01 00 00 00 h x h 0 0 0 1 2 0000 0 1 0 2 0 0000 1 1 0 3 0 0000 2 1 0 5 0 0000 3 1 6 5 0 4 7 0 2 1 2 3 0 0 0 0 0 0 3 4 3 2 2 5 0000 0000 1000 4 0 0 6 2 0000 h x e h Tile 3 00 01 10 11 PMV 00 01 10 11 PMV 0 1 0 0 2 0000 1 1 3 0 2 0000 2 4 0 0 2 0000 0000 3 1 0 5 6 1000 0 1000 4 1 7 0 2 0000 h h x e 5 0 0 4 7 0010 5 0 4 5 0 0000 5 1 0 0 8 0000 6 6 0 0 3 5 1100 6 7 0 2 0 1100 6 4 0 0 2 0010 7 7 0 0 4 2 0001 7 9 0 3 0 0000 7 1 0 5 6 1100 8 8 8 1 0 3 0 0010 8 4 0 0 2 0001 9 9 9 1 0 3 0 0001 9 e h x h 1000 0000 0000 0000 e 5 e h x h 1111 1110 1000 0000 e h x h 1100 0000 0000 0000 Cycle Cycle Cycle Cycle 3+P 2+P 1+P 0+P e h x h 1000 0000 0000 0000 1000 0000 0000 0000 Performance of Hardware Performance of Hardware Key Metric: Throughput*Character/Area Packet Scan Architecture • String Matching • Bit-Split String Matching Algorithm • A Memory Tile Based Architecture • Building a Real System • Is it really correct? • Future Work Integration and interfaces (FPGA) Prototype Design Ethernet Interface 100Mbps (promiscuous) Reg Interface byte_in data_enabl e data_low data_high address we rst DMA SM Core byte_in data_enabl e data address we rst clk Avalon Bus (50MHz, 12Gbps) vector out Microprocessor (control/update) Device Drivers/ Application Layer Connect to bus clk reset cs address write data String Match Engine (~1Gbps) Interface With Avalon Bus Connect to bus sme_send_byte( sme_write_tile(Base_add, Base_add, 0, byte_from_packet) 1, 0, 0x0001, 0x00000000) clk reset cs address write data byte_in data_enable data_low data_high address we rst This function function is is for for This Tile the Module Upper Lower sending actual initializing memory address number number data data to the match string indata the string match engine engines byte_in data_enable data address we rst clk vector out Packet Scan Architecture • String Matching • Bit-Split String Matching Algorithm • A Memory Tile Based Architecture • Building a Real System • Is it really correct? • Future Work Proofs (yes) A Formalization Splits DFA as an NFA Correctness stems from RL subset The above property is sufficient, is it necessary? Exploiting fixed wildcards is possible, what about more general patterns? Packet Scan Architecture • String Matching • Bit-Split String Matching Algorithm • A Memory Tile Based Architecture • Building a Real System • Is it really correct? • Future Work Extensions and Applications Primitives for Security • • • • • • • Packet Address List Lookup Packet Address Range Query Packet Classification String Finding Regular Expression Finding Statefull Flow Monitors Packet Ordering Related Work • Software based Good for ~100Mb/s, common case • FPGA-based Many schemes map rules down to a specialized circuit Near optimal utilization of hardware resources Implementing state machines on block-RAMs [Cho and MangioneSmith] Concurrent to our work: mapping state machines to on-chip SRAM [Aldwairi et. al.] Bloom filters [Dharmapurikar et al.] Excellent filter in the common case • TCAM-based Require all patterns to be shorter or equal to TCAM width Cutting long patterns: 2Gbps with 295KB TCAM [Yu et. al.] Conclusions • New Tile-based Architecture 0.4MB and 10Gbps for Snort rule set ( >10,000 characters) Possible to be used for other applications, e.g. IP lookups, packet classification. • New Bit-split Algorithm: General purpose enough for many other applications, e.g. spam detection, peephole optimization, IP lookups, packet classification, etc. Feasible to be implemented on other tile-based architecture. Thanks • • • • • Lin Tan Brett Brotherton Prof. Ryan Kastner Prof. Ömer Egecioglu Shreyas Prasad, Shashi Mysore, Bita Mazloom, Ted Huffmire, Banit Argawal All done.