Ultra-High Throughput Low-Power Packet Classification Author: Alen Kennedy and Xiaojun Wang Publisher: IEEE Transactions on Very Large Scale Integration (VLSI) Systems Presenter: Ching Hsuan Shih Date: 2013/10/24 Outline I. Introduction II. Decision Tree-Based Packet Classification III.Algorithmic Modifications IV.Packet Classification Engine V. Performance Results VI.Conclusion I. Instroduction • The current amount of energy used by networking devices worldwide could exceed the yearly output of 21 typical nuclear reactor units [2]. • Power consumption should be a key concern when designing any new networking equipment. • To relieve pressure of a network processor with the growing number of tasks such as packet fragmentation, reassembly, classification and etc. • • By the addition of extra processing capacity and ramping up clock speeds to gain extra performance are difficult due to physical limitations in the silicon and tight power budgets. By the use of hardware accelerators which can reduce power consumption while increasing processing capacity. I. Instroduction (Cont.) • In software: • Analysis [12] showed that even the best performing algorithm in terms of throughput RFC [5] can only classify around 400,000 packets per second. • In hardware: • Approaches can classify packets at core network line speeds, which can exceed 40 Gb/s using power hungry TCAM. • The classifier presented here: • A modified version of the HyperCuts can classify packets in parallel at speeds of up to 138.56 Gb/s. II. Decision Tree-Based Packet Classification HyperCuts packet classification algorithm • Multiple dimensions cutting at a time. • Creates a decision tree by taking a geometric view of a ruleset, with each rule considered to be a hypercube in hyperspace. • Works by breaking a ruleset into groups, with each group containing a small number of rules suitable for a linear search. II. Decision Tree-Based Packet Classification (Cont.) A. Building a Decision Tree (HyperCuts) 1. Decide a value for spfac and binth: • • • spfac is used to control how many cuts can be made to each root or internal node binth is used to limit the amount of rules at leaf nodes In the following example, spfac will be 3 and binth will be 2 II. Decision Tree-Based Packet Classification (Cont.) 2. Decide which dimensions to cut: • Calculate the number of distinct range specifications for each field • • The source and destination IPs both having 6, the source and destination ports both having 4, and the protocol number having 2, giving a mean of 4.4. The fields whose distinct number of range specifications is greater than or equal to the mean number of distinct range specifications are then considered for cutting. • The source and destination IPs shall, therefore, be considered for cutting. II. Decision Tree-Based Packet Classification (Cont.) 3. Decide how many cuts should be made: I. Max cuts to node i ≤ spfac*sqrt (number of rules at node i), where i is the internal or root node being cut. • • II. The maximum number of cuts that can be made to the root node in this example is 7.9. The number of cuts is limited to be a power of 2 for ease of implementation, which means that a maximum of 4 cuts can be performed. Try all combinations of cuts between the chosen dimensions. • • The combinations of cuts to the source and destination IPs are [0, 2], [0, 4], [2, 0], [2, 2], and [4, 0]. The combination [2, 2] resulting in the smallest maximum number of rules stored in a child node is to cut both the source and destination IPs in two. Note. Small spfac → resulting in fewer cuts to nodes, creating a deep and narrow decision tree. • Require less memory but have a longer search time. Larger spfac → allowing more cuts, resulting in a wide but shallow decision tree. • Require more memory but have a shorter search time. II. Decision Tree-Based Packet Classification (Cont.) EX. A packet with a header [0001, 0111, 50, 80, UDP] 1. root node → 2 cuts performed to both the source and destination IPs → only 1 bit of MSB need to examine [0001, 0111] → these bits are concatenated to form the index 00 2. internal node → 4 cuts performed to the destination IP → 2 bits of MSB need to examine [0111] → giving the index 11 → linear search in a leaf node → return R6 as the matching rule II. Decision Tree-Based Packet Classification (Cont.) B. Heuristics Used to Reduce Memory Usage 1. 2. 3. 4. Node merging • The pointers to the leaf nodes which contain the same list of rules are modified so that they point to just one of these leaf nodes. Rule overlap • A rule can never be matched and is , therefore, removed from a leaf node if a rule with a higher priority completely covers another rule within the leaf node’s subregion. Pushing common rule subset upward (not be used in the modified HyperCuts of this paper) • Store rules at an internal node or root node that would otherwise need to be stored in all of the internal or root node’s subregions. Region compaction II. Decision Tree-Based Packet Classification (Cont.) • Region compaction: • Result in fewer cuts, thus reducing memory consumption. III. Algorithmic Modifications 1. Cutting Scheme 2. Compacting of a Regoin Through Pre-Cutting 3. Rule Storage 4. Cut Selection 5. Memory Organization III. Algorithmic Modifications (Cont.) 1. Cutting Scheme: • Requires three values to be specified before building the decision tree. 1. Number of cuts to be made to the root node → # of cuts = 2n, n is 1~18 2. Maximum number of cuts that can be made to an internal node → # of cuts = 2m , m is 1~4 3. Maximum number of rules that a leaf node can store III. Algorithmic Modifications (Cont.) 1. Cutting Scheme: • Perform the majority of cuts to the root node → resulting in a shallow decision tree. • Only a few cuts made to an internal node → prevent the decision tree from using too much memory → the information needed to traverse an internal node can fit in a single memory word • Use the same method of HyperCuts to select fields to cut and decide how many cuts to be made. • One different is that all combinations of cuts between the chosen fields that equal the 2n limit are tried on the root node. III. Algorithmic Modifications (Cont.) 2. Compacting of a Regoin Through Pre-Cutting: • Why don’t use region compaction? → Requires floating point division when a packet traverses the decision tree. Also requires the minimum and maximum values of the area covered by all fields to calculate the index of the child node . III. Algorithmic Modifications (Cont.) Region Compaction: A packet with a destination IP of 0111 d = ((7 - 5) + 1) / 2 = 1.5 index = (7 − 5) / 1.5 = 1 III. Algorithmic Modifications (Cont.) 2. Compacting of a Regoin Through Pre-Cutting: A packet with a destination IP of 0111 can be simply calculated by using its third MSB as index. III. Algorithmic Modifications (Cont.) 3. Rule Storage: • Store the actual rule in the leaf node rather than a pointer to the rule. • • A small increase in memory consumption for some rulesets and a reduction for others as pointers to rules do not need to be stored. Large increase in throughput as data are presented to the classifier one clock cycle earlier. • Encoding scheme • • • An IP address usually requires 32 bits to store its address and 6 bits to store its mask. Reduce the number of bits required to store the source and destination IPs from 76 bits down to 70 bits Only a slight increase in the logic needed to decode the information. III. Algorithmic Modifications (Cont.) • Encoding scheme: • Store the 32 bits IP address and 6 bits mast as a 35 bits number. LSB of 35 bits → 0, ππ ππππ π‘βππ 28 πππ‘π ππ πΌπ ππππππ π ππππ π‘π ππ πππ‘πβ ππ₯πππ‘ππ¦. → 32 πππ‘π πππ πΌπ ππππππ π , 2 πππ‘π πππππππ‘πππ π‘βπ ππ’ππππ ππ πππ′ π‘ − ππππ. 1, ππ‘βπππ€ππ π. → 28 πππ‘π πππ πΌπ ππππππ π , 6 πππ‘π πππππππ‘πππ π‘βπ ππ’ππππ ππ πππ‘π π‘βππ‘ ππππ π‘π ππ πππ‘πβππ III. Algorithmic Modifications (Cont.) 4. Cut Selection (the information to calculate index of child node): • The cutting information for each field consists of two pre-computed value • Cuts (is also the length of the bit-mask for a given field) → the number of cuts in the field EX. An 8 bits protocol number limited to 256 cuts, can only have 0, 2, 4, 8, 16, 32, 64, 128, or 256 cuts performed to it. So, use 4 bits number for Cuts to represent the nine possible cut values. • BPos → the number of lower bits in a packet field that need to be removed by shifting the field right to calculate a child node index. EX. The protocol number will require three bits to store its BPos value as it will need to be shifted right by 0~7 places. III. Algorithmic Modifications (Cont.) 4. Cut Selection (the information to calculate index of child node): • The child node index is generated in two stages • • Generate the subindex for each field Concatenate these subindices together to form the final 18 bits index III. Algorithmic Modifications (Cont.) 5. Memory Organization: • Use 324 bits wide memory words • The root node requires 18 bits to store each of its child node pointers • Each internal node will fit fully in one memory word • Each rule in a leaf node requires 162 bits. III. Algorithmic Modifications (Cont.) 5. Memory Organization: • A memory map showing how to save a decision tree with 32 cuts made to the root node, 2 internal nodes, and 4 leaf nodes containing 1~6 rules. IV. Packet Classification Engine Architecture of the Classifier: • Two modules π΄ π‘πππ π‘πππ£πππ ππ π‘βππ‘ ππ π’π ππ π‘π π‘πππ£πππ π π πππππ πππ π‘πππ π΄ ππππ ππππ π ππππβππ ππππππ¦π π‘π€π ππππππππ‘ππ ππππππ π‘βππ‘ π€πππ ππ ππππππππ • Information on the decision tree’s root node is stored in registers in the tree traverser. • Make it possible for the tree traverser to begin a new packet while the previous packet is being compared with rules in a leaf node. IV. Packet Classification Engine (Cont.) Architecture of the Classifier: • Use 8 packet classification engines working in parallel with both the Stratix III and Cyclone III • Rulesets that contain many wildcard rules to be broken up into groups • Rules with wildcard source IP can be kept in one group, rules with wildcard destination IP can be kept in another group. • Help to ensure that the bandwidth of an FPGAs internal memory is better utilized. IV. Packet Classification Engine (Cont.) IV. Packet Classification Engine (Cont.) Sorter logic block: The sorter logic block registers the Match, NoMatch, and RuleID signals for a classified packet to a chain of registers and multiplexers in series. The register selected will depend on the packet ID number. The Match, NoMatch, and RuleID signals will be registered to the output register if they are next in the sequence of results to be outputted, and stored if not. All stored results are shifted toward the output register each time a result appears that is due to be outputted. This means that the classification results are outputted from the classifier in the same order that the packets were inputted. IV. Packet Classification Engine (Cont.) Supporting IPv6 Packet Classification: • Widen the memory words from 324 bits to 348 bits, with a memory word storing one rule instead of two. • Tree traverser uses more logic resources such as larger multiplexers. • Leaf node searcher needs a larger comparator block . • Root and internal nodes require an extra 4 bits to store their cutting information. V. Performance Results Hardware Implementation Parameters: • Stratix III: Process packets at line rates of up to 138.56 Gb/s as minimumsized 40 byte packets can arrive back-to-back. • Cyclone III: Reach line speeds of up to 70Gb/s. V. Performance Results (Cont.) Memory Usage and Worst Case Number of Memory Accesses: V. Performance Results (Cont.) Evaluation Against Prior Art: • RFC, HiCuts, HyperCuts, TSS, and EGT-PC can only classify packets at speeds of 400,973, 57,042, 32,242, 10,700, and 7,491 p/s, respectively in software alone. SF: Switch Factor V. Performance Results (Cont.) Throughput Versus Power Consumption: