Data plane algorithms in routers From prefix lookup to deep packet inspection Cristian Estan, University of Wisconsin-Madison What is the data plane? The part of the router handling the traffic Throughput defined as number of packets or bytes handled per second is very important Data plane algorithms applied to every packet Successive packets typically treated independently Example: deciding on which link to send a packet “Line speed” – keeping up with the rate at which traffic can be transmitted over the wire or fiber Example: 10Gbps router has 32 ns to handle 40 byte packet Memory usage limited by technology and costs Can afford at most tens of megabits of fast on-chip memory DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 A generic data plane problem Router has many directives composed of a guard, and an associated action (all guards distinct) There is a simple procedure for testing how well a guard matches a packet For each packet, find the guard that matches “best” and take the associated action Example – routing table lookup: Each guard is an IP prefix (between 0 and 32 bits) Matching procedure: is the guard a prefix of the 32 bit destination IP address “Best” defined as longest matching prefix DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 The rules of the game Matching against all guards in sequence is too slow We build a data structure that captures the semantics of all guards and use it for matching Primary metrics How fast the matching algorithm is How much memory the data structure needs Time to build data structure also has some importance We can cheat (but we won’t today) by: Using binary or ternary content-addressable memories Using other forms of hardware support DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Measuring “algorithm complexity” Execution cost measured in number of memory accesses to read data structure Actual data manipulation operations typically very simple On some platforms we can read wide words Worst case performance most important Worst case defined with respect to input, not guards Caching has been proven ineffective for many settings Using algorithms with good amortized complexity, but bad worst case requires large buffers DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Overview Longest matching prefix Trie-based algorithms Uni-bit and multi-bit tries (fixed stride and variable stride) Leaf pushing Bitmap compression of multi-bit trie nodes Tree bitmap representation for multi-bit trie nodes Binary search on ranges Binary search on prefix lengths Classification on multiple fields Signature matching DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Longest matching prefix Used in routing table lookup (a.k.a. forwarding) for finding the link on which to send a packet Guard: a bit string of 0 to w bits called IP prefix Action: a single byte interface identifier Input: a w-bit string representing the destination IP address of the packet (w is 32 for IPv4,128 for IPv6) Output: the interface associated with the longest guard matching the input Size of problem: hundreds of thousands of prefixes DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Controlled prefix expansion with stride 3 P1 P1 P1 P1 P2 P2 P2 P2 P3 P4 P4 P4 P4 P5 P6 P7 P8 P8 P9 Routing table P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* Uni-bit trie 0 P1 1 P2 0 P3 1 P6 0 1 0 P7 1 P9 000* 001* 010* 011* 100* 101* 110* 111* 100* 100000* 100001* 100010* 100011* 100000* 101* 110* 110010* 110011* 111* 0 P4 1 0 1 P1 P1 P1 P1 P3 P4 P4 P4 P5 P6 P7 P8 P8 P9 0 1 0 1 P8 000* 001* 010* 011* 100* 100001* 100010* 100011* 100000* 101* 110* 110010* 110011* 111* 0 P5 1 Leaf pushing Multi-bit trie with fixed stride 000 001 010 011 100 101 110 111 P1 P1 P1 P1 P3 P5 P7 P9 P5 P4 P4 P4 000 001 010 P8 011 P8 100 101 110 111 Multi-bit trie with variable stride 000 P1 001 P1 010 P1 011 P1 100 P3 3 101 P5 110 P7 2 111 P9 DIMACS Tutorial on Algorithms for Next Generation Networks 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 P5 P4 P4 P4 00 01 P8 10 11 August 6-8 2007 000 001 010 011 100 101 110 111 P1 P1 P1 P1 000 001 010 011 100 101 110 111 P5 P4 P4 P4 P3 P3 P3 P3 000 P7 001 P7 010 P8 P9 011 P8 100 P7 101 P7 110 P7 111 P7 Leaf pushing reduces memory usage but increases update time P5 Given a maximum trie height h and a routing table of size n dynamic programming algorithm computes optimal variable stride trie in O(nw2h) Controlled prefix expansion with stride 3 P1 P1 P1 P1 P2 P2 P2 P2 P3 P4 P4 P4 P4 P5 P6 P7 P8 P8 P9 Routing table P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* Input 11000010 000* 001* 010* 011* 100* 101* 110* 111* 100* 100000* 100001* 100010* 100011* 100000* 101* 110* 110010* 110011* 111* P1 P1 P1 P1 P3 P4 P4 P4 P5 P6 P7 P8 P8 P9 Longest matching prefix P7 P2 Uni-bit trie 0 P1 1 P2 0 P3 1 P6 0 1 0 P7 1 P9 0 P4 1 0 1 0 1 0 1 P8 000* 001* 010* 011* 100* 100001* 100010* 100011* 100000* 101* 110* 110010* 110011* 111* 0 P5 1 Leaf pushing Multi-bit trie with fixed stride 000 001 010 011 100 101 110 111 P1 P1 P1 P1 P3 P5 P7 P9 P5 P4 P4 P4 000 001 010 P8 011 P8 100 101 110 111 Multi-bit trie with variable stride 000 P1 001 P1 010 P1 011 P1 100 P3 3 101 P5 110 P7 2 111 P9 DIMACS Tutorial on Algorithms for Next Generation Networks 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 P5 P4 P4 P4 00 01 P8 10 11 August 6-8 2007 000 001 010 011 100 101 110 111 P1 P1 P1 P1 000 001 010 011 100 101 110 111 P5 P4 P4 P4 P3 P3 P3 P3 000 P7 001 P7 010 P8 P9 011 P8 100 P7 101 P7 110 P7 111 P7 Leaf pushing reduces memory usage but increases update time P5 Given a maximum trie height h and a routing table of size n dynamic programming algorithm computes optimal variable stride trie in O(nw2h) Controlled prefix expansion with stride 3 P1 P1 P1 P1 P2 P2 P2 P2 P3 P4 P4 P4 P4 P5 P6 P7 P8 P8 P9 Routing table P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* Input 11000010 000* 001* 010* 011* 100* 101* 110* 111* 100* 100000* 100001* 100010* 100011* 100000* 101* 110* 110010* 110011* 111* P1 P1 P1 P1 P3 P4 P4 P4 P5 P6 P7 P8 P8 P9 Longest matching prefix P7 Uni-bit trie 0 P1 1 P2 0 P3 1 P6 0 1 0 P7 1 P9 0 P4 1 0 1 0 1 0 1 P8 000* 001* 010* 011* 100* 100001* 100010* 100011* 100000* 101* 110* 110010* 110011* 111* 0 P5 1 Leaf pushing Multi-bit trie with fixed stride 000 001 010 011 100 101 110 111 P1 P1 P1 P1 P3 P5 P7 P9 P5 P4 P4 P4 000 001 010 P8 011 P8 100 101 110 111 Multi-bit trie with variable stride 000 P1 001 P1 010 P1 011 P1 100 P3 3 101 P5 110 P7 2 111 P9 DIMACS Tutorial on Algorithms for Next Generation Networks 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 P5 P4 P4 P4 00 01 P8 10 11 August 6-8 2007 000 001 010 011 100 101 110 111 P1 P1 P1 P1 000 001 010 011 100 101 110 111 P5 P4 P4 P4 P3 P3 P3 P3 000 P7 001 P7 010 P8 P9 011 P8 100 P7 101 P7 110 P7 111 P7 Leaf pushing reduces memory usage but increases update time P5 Given a maximum trie height h and a routing table of size n dynamic programming algorithm computes optimal variable stride trie in O(nw2h) Lulea bitmap compression Compressed node P1 P1 P1 P1 P3 P4 P4 P4 P5 P6 P7 P8 P8 P9 000* 001* 010* 011* 100* 100001* 100010* 100011* 100000* 101* 110* 110010* 110011* 111* P1 000* 000 1 P1 001* 001 0 P1 010* 010 0 P1 P1 011* 011 0 P3 100* 100 1 P5 P4 100001* 101 1 P4 100010* P5 110 1 P9 P4 100011* 111 1 P5 100000* P9 P6 101* Repeating entries are stored only once in the P7 110* compressed array. An auxiliary bitmap is needed P8 110010* to find the right entry in the compressed node. It P8 110011* stores a 0 for positions that do not differ from the P9 111* previous one. 000 001 010 011 100 101 110 111 Input 11001010 P1 P2 P3 P4 P5 P6 P7 P8 P9 0* 1* 100* 1000* 100000* 101* 110* 11001* 111* When the compression bitmaps are large it is expensive to count bits during lookup. The bitmap is divided into chunks and a precomputed auxiliary array stores the number of bits set before each chunk. The lookup algorithm needs to count only bits set within one chunk. Representing node as tree bitmap Pointers to children and 0* 1 prefixes are stored in 1* 1 separate structures. Prefixes 00* 0 of all lengths are stored, 000 0 P1 01* 0 thus leaf pushing is not 001 0 P2 10* 0 needed and update is fast. 010 0 P3 11* 0 Bitmaps have 1s 011 0 P6 000* 0 corresponding to entries that 100 1 P7 001* 0 are not empty. 101 0 P9 010* 0 110 1 011* 0 111 0 100* 1 101* 1 110* 1 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 111* 1 Longest matching prefix P7 P2 P1 P1 P1 P1 Bitmap supporting fast counting 00000 1 00001 0 00010 0 00011 0 00100 1 00101 1 00110 1 00111 0 01000 1 01001 0 01010 1 01011 0 01100 1 01101 0 00 0 01110 0 01 4 01111 1 10 8 10000 1 11 13 10001 0 10010 1 10011 0 10100 1 13+0=13 10101 1 10110 1 10111 0 11000 0 11001 0 11010 0 11011 0 11100 1 11101 1 11110 1 11111 0 Binary search on ranges Divide w-bit address space into maximal continuous ranges covered by same prefix Build array or balanced (binary) search tree with boundaries of ranges At lookup time perform O(log(n)) search Not better than multi-bit tries with compression, but it is not covered by patents DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Binary search on prefix lengths Core idea: for each prefix length represented in the routing table, have a hash table with the prefixes Can find longest matching prefix after looking up in each hash table the prefix of the address with corresponding length Binary search on prefix lengths is faster Simple but wrong algorithm: if you find prefix at length x store it as best match and look for longer matching prefixes, otherwise look for shorter prefixes Problem: what if there is both a shorter and a longer prefix, but no prefix at length x? Solution: insert marker at length x when there are longer prefixes. Must store with marker longest matching shorter prefix. Markers lead to moderate increase in memory usage. Promising algorithm for IPv6 (w=128) DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Papers on longest matching prefix G. Varghese “Network algorithmics an interdisciplinary approach to designing fast networked devices”, chapter 11, Morgan Kaufmann 2005 V. Srinivasan, G. Varghese “Faster IP lookups using controlled prefix expansion”, ACM Trans. on Comp. Sys., Feb. 1999 M. Degermark, A. Brodnik, S. Carlsson, S. Pink “Small forwarding tables for fast routing lookups”, ACM SIGCOMM, 1997 W. Eatherton, Z. Dittia, G. Varghese “Tree Bitmap : Hardware / Software IP Lookups with Incremental Updates”, http://wwwcse.ucsd.edu/~varghese/PAPERS/willpaper.pdf B. Lampson, V. Srinivasan, G. Varghese “IP lookups using multiway and multicolumn search”, IEEE Infocom, 1998 M. Waldvogel, G. Varghese, J. Turner, B. Plattner, “Scalable highspeed IP lookups”, ACM Trans. on Comp. Sys., Nov. 2001 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Overview Longest matching prefix Classification on multiple fields Solution for two-dimensional case: grid of tries Bit vector linear search Cross-producting Decision tree approaches Signature matching DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Packet classification problem Required for security, recognizing packets with quality of service requirements Guard: prefixes or ranges for k header fields Typically source and destination prefix, source and destination port range, and exact value or * for protocol All fields must match for rule to apply Action: drop, forward, map to a certain traffic class Input: a tuple with the values of the k header fields Output: the action associated with the first rule that matches the packet (rules are strictly ordered) Size of problem: thousands of classification rules DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Example of classification rule set External time server TO Router that filters traffic Mail gateway M Net Internet Internal time server TI Secondary name server S Destination IP Source IP Dest Port Src Port Protocol Action M * 25 * * R1 M * 53 * UDP R2 M S 53 * * R3 M * 23 * * R4 TI TO 123 123 UDP R5 * Net * * * R6 Net * * * TCP/ACK R7 * * * * * R8 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 A geometric view of packet classification R1 R1 Destination address space R3 R2 Source address space R3 R2 Source address space In theory number of regions defined can be much larger than number of rules Any algorithm that guarantees O(n) space for all rule sets of size n needs O(log(n)k-1) time for classification DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 The two dimensional case: source and destination IP addresses For each destination prefix in rule set, link to corresponding node in destination IP trie a trie with source prefixes of rules using this destination prefix Matching algorithm must use backtracking to visit all source tries Grid of tries: by pre-computing “switch pointers” in destination tries and propagating some information about more general rules, matching may proceed without backtracking Memory used proportional to number of rules Matching time O(w) with constant depending on stride Extended grid of tries handles 5 fields and has good run time and memory in practice DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Dest IP Src IP Dest Port Src Port Proto M * 25 * * M * 53 * UDP M S 53 * * M * 23 * * TI TO 123 123 UDP * Net * * * Net * * * TCP * * * * * Action R1 R2 R3 R4 R5 R6 R7 R8 Dest IP 11110111 00001111 00000111 00000101 Source IP S 11110011 TO 11011011 Net 11010111 * 11010011 Src Port 123 11111111 * 11110111 Proto UDP 11111101 TCP 10110111 * 10110101 M TI Net * Dest Port 25 10000111 53 01100111 23 00010111 123 00001111 * 00000111 00000101+ 11010011+ 00000111+ 11110111+ 10110111 00000001 Bit vector approaches do linear search through rule set For each field we pre-compute a structure (e.g. trie) to find most specific prefix or range distinguished by rule set For each rule, a single bit represents whether a given most specific prefix matches rule or not We associate with each range a bitmap of size n encoding which of the rules may match a packet in that prefix or range Classification algorithm first computes for each field of the packet the most specific prefix/range it belongs to By then AND-ing together the k bitmaps of size n we find matching rules Works well for hardware solutions that allow wide memory reads Scales poorly to large rule sets DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 R8 Dest IP Src IP Dest Port Src Port Proto M * 25 * * M * 53 * UDP M S 53 * * M * 23 * * TI TO 123 123 UDP * Net * * * Net * * * TCP * * * * * Dest IP M TI Net * Action R1 R2 R3 R4 R5 R6 R7 R8 Cross-producting performs longest prefix matching separately for all fields and combines the results in a single step by looking up the matching rule in a pre-computed table explicitly listing the first matching rule for each element of the cross-product. The size of this table is the product of the numbers of recognized prefixes/ranges for the individual fields. Due to its memory requirements this method is not feasible. Src IP S TO Net * Dest Port 25 53 23 123 * Src Port 123 * Proto UDP TCP * 3*120+3*30+4*6+1*3+1=478 0 1 2 3 4 5 Cross Product M,S,25,123,UDP M,S,25,123TCP M,S,25,123,* M,S,25,*,UDP M,S,25,*,TCP M,S,25,*,* Action R1 R1 R1 R1 R1 R1 … 478 *,*,*,*,TCP 479 *,*,*,*,* R8 R8 4*4*5*2*3=480 Equivalenced cross-producting (a.k.a. recursive flow classification or RFC) combines the results of the per-field longest matching prefix operations two by two. The pairs of values are grouped in equivalence classes and in general there are much fewer equivalence classes than pairs of values. This leads to significant memory savings as compared to simple cross-producting. This algorithm provides fast packet classification, but compared to other algorithms, the memory requirements are relatively large (but feasible in some settings). DIMACS Tutorial on Algorithms for Next Generation Networks Dest IP Rule Class - Src IP bitmap 0 M,S 11110011 C1 1 M,TO 11010011 C2 2 M,Net 11010111 C3 3 M,* 11010011 C2 4 TI,S 00000011 C4 5 TI,T0 00001011 C5 6 TI,Net 00000111 C6 7 TI,* 00000011 C4 8 Net,S 00000011 C4 9 Net,TO 00000011 C4 10 Net,Net 00000111 C6 11 Net,* 00000011 C4 12 *,S 00000001 C7 13 *,TO 00000001 C7 14 *,Net 00000100 C8 15 *,* 00000001 C7 16 entries, 8 distinct classes Dest Src Dest Src Proto IP IP Port Port August 6-8 2007 Final result Decision tree approaches At each node of the tree test a bit in a field or perform a range test Large fan-out leads to shallow trees and fast classification Leaves contain a few rules traversed linearly Interior nodes may contain rules that match also Tests may look at bits from multiple fields A rule may appear in multiple nodes of the decision tree – this can lead to increased memory usage Tree built using heuristics that pick fields to compare on that divide remaining rules relatively evenly among descendants Fast and compact on rule sets used today DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Papers on packet classification G. Varghese “Network algorithmics …”, chapter 12 V. Srinivasan, G. Varghese, S. Suri, M. Waldvogel, “Fast and Scalable Layer Four Switching”, ACM SIGCOMM, Sep. 1998 F. Baboescu, S. Singh, G. Varghese, “Packet classification for core routers: Is there an alternative to CAMs?”, IEEE Infocom, 2003 P. Gupta, N. McKeown, “Packet classification on multiple fields”, ACM SIGCOMM 1999 T. Woo, “A modular approach to packet classification: Algorithms and results”, IEEE Infocom, 2000 S. Singh, F. Baboescu, G. Varghese, “Packet classification using multidimensional cutting”, SIGCOMM, 2003 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Overview Longest matching prefix Classification on multiple fields Signature matching String matching Regular expression matching w/ DFAs and D2FAs DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Signature matching Used in intrusion prevention/detection, application classification, load balancing Guard: a byte string or a regular expression Action: drop packet, log alert, set priority, direct to specific server Input: byte string from the payload of packet(s) Hence the name “deep packet inspection” Output: the positions at which various signatures match or the identifier of the “highest priority” signature that matches Size of problem: hundreds of signatures per protocol DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 String matching Most widely used early form of deep packet inspection, but the more expressive regular expressions have superceded strings by now Still used as pre-filter to more expensive matching operations by popular open source IDS/IPS Snort Matching multiple strings a well-studied problem A. Aho, M. Corasick. “Efficient string matching: An aid to bibliographic search”, Communications of the ACM, June 1975 Many hardware-based solutions published in last decade Matching time independent of number of strings, memory requirements proportional to sum of their sizes DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Regular expression matching Deterministic and non-deterministic finite automata (DFAs and NFAs) can match regular expressions NFAs more compact but require backtracking or keeping track of sets of states during matching Both representations used in hardware and software solutions, but only DFA based solutions can guarantee throughput in software DFAs have a state space explosion problem From DFAs recognizing individual signatures we can build a DFA that recognizes entire signature set in a single pass Size of combined DFA much larger than sum of sizes for DFAs recognizing individual signatures Multiple combined DFAs are used to match signature set DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J. Turner, “Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection”, ACM SIGCOMM, September 2006 Delayed Input DFA (D2FA) Deterministic finite automaton (DFA) State 0 25 18 25 41 41 41 5 5 Input …410052… Crt. state 12 1 State 1 19 12 12 4 2 8 2 8 State 2 19 12 12 4 4 4 8 8 State 3 25 18 25 6 41 5 41 5 … If the “current state” variable meets an acceptance condition (e.g. whether the state identifier is larger than a given threshold), the automaton raises an alert. D2FAs build on the observation that for many pairs of states, the transition tables are very similar and it is enough to store the differences. The lookup algorithm may need to follow multiple default transitions until it finds a state that explicitly stores a pointer to the next state it needs to transition to. Since this is a throughput concern, the algorithm for constructing D2FAs allows the user to set a limit on the length of the maximum default path. Default transitions State 0 State 1 State 3 2 25 18 25 41 41 41 5 5 Set of regular expr. State 2 0 19 12 12 4 4 4 8 8 2 8 2 6 … 5 41 D2FAs with no bound on default path length Memory D2FAs d.p.l.≤4 Avg. d.p.l. Max d.p.l. Memory Cisco590 18.32 57 0.80% 1.56% Cisco103 16.65 54 0.98% 1.54% Cisco7 19.61 61 2.58% 3.31% Linux56 7.68 30 1.64% 1.87% Linux10 5.14 20 8.59% 9.08% Snort11 5.86 9 1.57% 1.66% Bro648 6.45 17 0.45% 0.51% The memory columns report the ratio between the number of transitions used by the D2FA and the corresponding DFA. DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Conclusions Networking devices implement more and more complex data plane processing to better control traffic The algorithms and data structures used have big performance impact Often set of rules to be matched against has specific structure Algorithms exploiting this structure may give good performance even if it is impossible to find an algorithm that gives good performance on all possible rule sets DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 That’s all folks! DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007