Routing in Routers Hans Jørgen Furre Nygårdshaug hansjny Christian Fredrik Fossum Resell chrifres Hans Petter Taugbøl Kragset hpkragse Outline ● Router architecture ○ ● Lookup algorithms ○ ○ ○ ● ● Network Processor Address-cache based Binary tries ■ Path compressed ■ Multibit ■ Level compressed Hardware based Packet classification Switch fabric routing Large scale router architecture ● A router consists of several logical components ● ● ● ● Packet processing Address lookup Traffic management Physical layer The Network Processor ● ● NPs are specialised processors High rate of change in the network world ○ ● Line rate is faster than processing rate (2002) ○ ○ ○ ● Flexible processors wanted ■ Performance vs adaptability tradeoff To process all packets, parallelisation is required Traffic can be split between processors based on flow (destination/source) ■ Traffic management etc only applicable per flow Parallelisation introduces IPC issues NPs are classified broadly as either configurable or programmable Configurable NPs ● ● ● ● Co-processors with specialised abilities ○ ○ Connected by configurable connections Lookup, classification, forwarding ○ Pipeline packets through different steps in parallel ○ Manager controls and schedules transitions Optimized with a narrow instruction set Designed for performance Not adaptive Programmable NPs ● ● ● ● ● Multiple RISC or RISC-cluster task units Controller provides instruction set to RISCs RISC handles packets as instructed ○ May distribute tasks to co-processors Flexible and adaptive, instruction sets can change ○ This means RISCs can’t be optimised More time consuming ○ may not meet line speed Lookup-algorithms ● ● ● Why is prefix-matching worse than exact lookup? Cache based Trie-based lookup schemes Why is prefix matching so hard? ● ● ● ● Incoming packet: dst:adr 10.138.5.8 The router does not know the length of the prefix. Hence: searching in the space of all prefix lengths AND the space of all prefixes of a given length Cache based ● ● ● ● Cache recently used addresses Depends on locality of the packet stream With more multiplexing into high-speed links, locality decreases Not sufficient anymore Trie-based schemes Binary trie: 0 1 P1 1 0 P2 P3 1 1 P4 0 P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 P4 0 P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 Address: 10111010 P4 0 P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 Address: 10111010 P4 0 Current next-hop pointer: none P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 Address: 10111010 P4 0 Current next-hop pointer: none P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 Address: 10111010 P4 0 Current next-hop pointer: P2 P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 Address: 10111010 P4 0 Current next-hop pointer: P2 P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 Address: 10111010 P4 0 Current next-hop pointer: P4 P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 Address: 10111010 P4 0 Current next-hop pointer: P4 P5 1 Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 Right pointer P2 P3 1 1 Address: 10111010 P4 0 Current next-hop pointer: P4 P5 1 ? Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 P2 Right pointer P3 1 1 Address: 10111010 Current next-hop pointer: P4 P4 Return la st next-h o 0 p pointe r. P5 1 ? Trie-based schemes 0 Binary trie: P1 Node structure: 1 0 Next-hop-pointer Left-pointer 1 P2 Right pointer P3 1 1 Lookup time complexity Storage requirement P4 O(W) 0 O(NW) P5 1 Path-compressed tries ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● All leaf nodes contains a prefix and a next hop pointer 1 0 P2 P3 1 1 P4 0 P5 1 Path-compressed tries ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● All leaf nodes contains a prefix and a next hop pointer 1 0 P2 P3 1 1 P4 0 P5 1 Path-compressed tries ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● All leaf nodes contains a prefix and a next hop pointer 1 0 P2 1 P4 0 P5 P3 Path-compressed tries ,,1 ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● 1,,2 All leaf nodes contains a prefix and a next hop pointer 1 0 P2 1 Node structure Bit string Left-pointer Next hop ptr P4 Bit position Right pointer 0 P5 P3 Path-compressed tries ,,1 ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● 1,,2 All leaf nodes contains a prefix and a next hop pointer 1 0 10, P2, 4 P2 1 P3 Node structure Bit string Left-pointer Next hop ptr P4 Bit position Right pointer 0 P5 1011, P4, 5 Path-compressed tries ,,1 ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● 1,,2 All leaf nodes contains a prefix and a next hop pointer 1 0 10, P2, 4 P2 1 P3 Node structure Bit string Next hop ptr Left-pointer Address: 1001 P4 Bit position Right pointer 0 P5 1011, P4, 5 Path-compressed tries ,,1 ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● 1,,2 All leaf nodes contains a prefix and a next hop pointer 1 0 10, P2, 4 P2 1 P3 Node structure Bit string Next hop ptr Left-pointer Address: 1001 P4 Bit position Right pointer 0 P5 1011, P4, 5 Path-compressed tries ,,1 ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● 1,,2 All leaf nodes contains a prefix and a next hop pointer 1 0 10, P2, 4 P2 P3 1 Node structure Bit string Next hop ptr Left-pointer Address: 1001 P4 Bit position Right pointer 0 ? P5 1011, P4, 5 Path-compressed tries ,,1 ● Removes single descendant internal nodes (nodes without next hop pointer) 0 1 P1 ● 1,,2 All leaf nodes contains a prefix and a next hop pointer 1 0 10, P2, 4 P2 P3 1 Node structure Bit string Next hop ptr Left-pointer Address: 1001 P4 Bit position Right pointer 0 Prefix not matching. Return P2. ? P5 1011, P4, 5 Multibit tries ● ● Each node contains 2k pointers If pointer address length is not a multiple of K, expand address Multibit tries ● ● Each node contains 2k pointers If pointer address length is not a multiple of K, expand address Example: 2-b tries Ptr 1-B addr Expanded P1 0 00 P1 0 01 P2 10 10 P3 11 11 P4 1011 1011 P5 10110 101100 P5 10110 101101 Multibit tries 00 P1 01 P1 10 11 P2 P3 11 P4 00 P5 01 P5 Multibit tries 00 P1 01 P1 10 11 P2 P3 11 P4 Address: 10011 00 P5 01 P5 Multibit tries 00 P1 01 P1 10 11 P2 P3 11 P4 Address: 10011 00 One step instead of five P5 01 P5 Multibit tries 00 P1 01 P1 10 11 P2 P3 11 P4 Lookup time complexity O(W/K) 00 Storage requirement O(2(K-1)NW) P5 01 P5 Level-compressed tries ● ● ● Combines path compression and multibit Replaces the I’th complete level with a single node with 2i descendants Performed recursively on each subtrie Updates? Rebuild structure. ref [6]: Fast IP Routing with LC-Tries, Stefan Nilsson and Gunnar Karlsson, August 01, 1998 Hardware based schemes DIR-24-8-BASIC ● ● ● ● Fantastic name Pipeline in hardware Approx 100 M p/s with “current” SDRAM technology Two tables used for lookup ○ ○ ● 24-8 split is reasonable ○ ○ ● ● ● TBL24: up to and including 24 bits long TBLlong: longer than 24 bits size is no problem over 99 percent are 24 bits long or less Simple and efficient Requires large amounts of memory Bad scaling TBL24 entry format If longest route with this 24-bit prefix is < 24 bits long: 0 Next Hop 1 bit 15 bits If longest route with this 24-bit prefix is > 24 bits long: 1 1 bit Index into 2nd table 15 bits TBL24 DIR-24-8-BASIC destination address 0 224 entries 24 23 31 8 TBLlong Next hop SRAM-based Pipeline Scheme ● ● ● ● Needs only SRAMs Avoids memory refresh overhead of DRAM Segment and offset pairs Segmentation table: 64 k entries ○ ○ ● ● ● ● next-hop, or pointer to second-level tables (next hop array) offset length: 0 < k <= 16 each NHA has a k determined by longest prefix it has stored in last 4 bits of entry Difficult to scale to 128-bit IPv6 addresses SRAM-based Pipeline Scheme Ternary CAM ● ● ● ● ● Specialized memory (binary) CAMs can only perform exact matching Stores (value, mask) pairs Priority encoder is used to select a match Forwarding table ○ ● ● decreasing order of prefix length Incremental update problem High price, large power consumption TCAM Line number Address (binary) Output port 1 101XX A 2 0110X B 3 011XX C 4 10011 D Packet classification ● ISPs want to provide different QoS ○ ● Different customers pay for higher QoS ○ ○ ● bitrate, ping, transmission limit, ... Routers must distinguish customers Routers implement rules ○ ○ ● Must be implemented in routers For a flow, all applicable rules form a classifier For every packet the router must use the classifier to see if the packet is allowed through. ■ Needless to say, this isn’t trivial. Classification algorithms ○ ○ Speed, scalability, adaptability and flexibility are important Not discussed further here! Arbitration Algorithms for Single-Stage Switches iSLIP ● ● ● Algorithm for switch fabric routing Input buffered Adaptive ○ ○ ● Self balancing Fair Non-starving ref [2]: The iSLIP Scheduling Algorithm for Input-Queued Switches, Nick McKeown 1999 1 1 1 4 4 2 3 3 2 1 1 4 2 Inputs 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 2 Outputs 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 2 Virtual Output Queues VOQs 2 4 2 3 3 … 3 1 1 4 2 4 3 3 4 1 1 4 2 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 2 Arbiters 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 2 Step 1 - Request 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 2 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 2 Step 2 - Grant 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 2 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 2 Step 3 - Accept 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 2 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 Changes 2 3 3 ● 1 4 2 ● In standard RRM all output arbiters would advance because they issued a grant. This is iSLIP’s advantage. 4 1 4 2 3 2 3 1 1 4 2 3 3 4 2 4 2 3 Request 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Grant 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Request 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Grant 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Request 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Grant 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 1 4 2 3 3 2 1 1 4 2 3 3 ● ● ● Now the arbiters are perfectly unsynchronised! 100% throughput This goes on... 4 3 2 4 1 1 4 2 3 2 3 3 4 2 1 1 4 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Traffic pattern ● ● Note that the previous examples was with uniform traffic. In non-uniform traffic iSLIP still provides fair queueing. ○ ● ● Adapts fast and maintains RR fairness. During bursty traffic the delay will be proportionate to the burst length and N (switch ports). Improvements and modifications exist: ○ ○ ○ iterations priority weight ref [2]: The iSLIP Scheduling Algorithm for Input-Queued Switches, Nick McKeown 1999 Performance ● Performs better than naive algorithms Request 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Grant 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Request 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Grant 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Request 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Grant 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 2 1 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 Accept 1 1 4 1 4 2 3 3 2 1 4 2 3 3 ● ● 1 and 2 go in turns, RR. Important to move arbiter past last accept, in case of incoming new packet. 1 4 3 2 4 1 1 4 2 3 2 3 3 4 2 1 1 4 2 4 2 3 iDRRM ● ● Starts at inputs, not outputs like iSLIP Very similar otherwise 1 1 1 4 4 2 3 3 2 1 1 4 2 Step 1 - Request 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 2 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 2 Step 2 - Grant 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 2 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 4 2 3 3 2 1 1 4 2 Step 3 - Accept 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 1 1 1 4 2 4 3 3 2 1 1 4 2 4 1 1 4 2 4 1 1 4 2 3 2 3 3 4 2 3 3 3 2 4 2 3 iDRRM ● ● Comparable throughput to iSLIP Less time to arbit and easier implementation ○ less information exchange between input and output arbiters ■ saves the time for initial request ref [1]: Next Generation Routers, H. Jonathan Chao, 2002 Comparison (worst case) iSLIP ● ● ● Request: N² messages Grant: N Accept: N iDRRM ● ● ● Request: N Grant: N Accept: N EDRRM ● ● Exhaustive DRRM The arbiters DON’T change when a request is accepted ○ ● ● This can lead to starvation, needs a timer to override Higher throughput in non-uniform traffic In uniform and random traffic the advantage is lost Conclusion ● ● ● Routers are hard to make good in all aspects Size of routing tables grows fast Network traffic grows fast ○ ● Traffic pattern also changing (video streaming etc) Questions? References 1. 2. 3. 4. 5. 6. Next Generation Routers, H. Jonathan Chao, 2002 The iSLIP Scheduling Algorithm for Input-Queued Switches, Nick McKeown 1999: https://www.cse. iitb.ac.in/~mythili/teaching/cs641_autumn2015/references/16-iSlip-crossbar-sched.pdf http://tiny-tera.stanford.edu/~nickm/papers/Infocom98_lookup.pdf https://www.pagiamtzis.com/cam/camintro/ N. Huang, S. Zhao, J. Pan, and C. Su, “A fast IP routing lookup scheme for gigabit switching routers,” Proc. IEEE INFOCOM’99, Mar. 1999. http://www.drdobbs.com/cpp/fast-ip-routing-with-lc-tries/184410638