High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University High Performance Switching and Routing Telecom Center Workshop: Sept 4, 1997. Nick McKeown Balaji Prabhakar Departments of Electrical Engineering and Computer Science nickm@stanford.edu balaji@isl.stanford.edu Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 2 Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet Switches • The Evolution of IP Routers Copyright 1999. All Rights Reserved 3 Basic Architectural Components Admission Control Policing Congestion Control Routing Switching Copyright 1999. All Rights Reserved Reservation Output Scheduling Control Datapath: per-packet processing 4 Basic Architectural Components Datapath: per-packet processing 1. Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 5 Where high performance packet switches are used - Carrier Class Core Router - ATM Switch - Frame Relay Switch The Internet Core Edge Router Copyright 1999. All Rights Reserved Enterprise WAN access & Enterprise Campus Switch 6 Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet Switches • The Evolution of IP Routers Copyright 1999. All Rights Reserved 7 ATM Switch • • • • Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with new. Forward cell to outgoing interface. Transmit cell onto link. Copyright 1999. All Rights Reserved 8 Ethernet Switch • Lookup frame DA in forwarding table. – If known, forward to correct port. – If unknown, broadcast to all ports. • Learn SA of incoming frame. • Forward frame to outgoing interface. • Transmit frame onto link. Copyright 1999. All Rights Reserved 9 IP Router • Lookup packet DA in forwarding table. – If known, forward to correct port. – If unknown, drop packet. • Decrement TTL, update header Cksum. • Forward packet to outgoing interface. • Transmit packet onto link. Copyright 1999. All Rights Reserved 10 Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet Switches • The Evolution of IP Routers Copyright 1999. All Rights Reserved 11 First-Generation IP Routers Shared Backplane Copyright 1999. All Rights Reserved Buffer Memory CPU DMA DMA DMA Line Interface Line Interface Line Interface MAC MAC MAC 12 Second-Generation IP Routers Buffer Memory CPU DMA DMA DMA Line Card Local Buffer Memory Line Card Local Buffer Memory Line Card Local Buffer Memory MAC MAC MAC Copyright 1999. All Rights Reserved 13 Third-Generation Switches/Routers Switched Backplane Line Card Copyright 1999. All Rights Reserved CPU Card Line Card Local Buffer Memory Local Buffer Memory MAC MAC 14 Fourth-Generation Switches/Routers Clustering and Multistage 1 2 3 4 5 6 13 14 15 16 17 18 25 26 27 28 29 30 7 8 9 10 11 12 19 20 21 22 23 24 31 32 21 1 2 3 4 5 6 7 8 9 10 111213 14 15 16 17 1819 20 21 22 23 2425 26 27 28 29 30 31 32 Copyright 1999. All Rights Reserved 15 Packet Switches References • J. Giacopelli, M. Littlewood, W.D. Sincoskie “Sunshine: A high performance self-routing broadband packet switch architecture”, ISS ‘90. • J. S. Turner “Design of a Broadcast packet switching network”, IEEE Trans Comm, June 1988, pp. 734-743. • C. Partridge et al. “A Fifty Gigabit per second IP Router”, IEEE Trans Networking, 1998. • N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro Magazine, Jan-Feb 1997. Copyright 1999. All Rights Reserved 16 Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 17 Basic Architectural Components Datapath: per-packet processing 1. Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 18 Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – – – – Caching CIDR Patricia trees/tries Other methods • Packet Classification Copyright 1999. All Rights Reserved 19 ATM and MPLS Switches Direct Lookup VCI Copyright 1999. All Rights Reserved (Port, VCI) Memory 20 Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – – – – Caching CIDR Patricia trees/tries Other methods • Packet Classification Copyright 1999. All Rights Reserved 21 Bridges and Ethernet Switches Associative Lookups Advantages: Associative Memory or CAM Search Data Network Associated Address Data 48 • Simple Associated Data { Hit? Address log2N Copyright 1999. All Rights Reserved Disadvantages • Slow • High Power • Small • Expensive 22 Bridges and Ethernet Switches Hashing 16 Memory Data 48 Hashing Function Address Search Data Associated Data { Hit? Address log2N Copyright 1999. All Rights Reserved 23 Lookups Using Hashing An example Memory #1 Search Data 48 #2 #3 #4 Associated Data Hashing Function CRC-16 Linked lists Copyright 1999. All Rights Reserved 16 #1 { #2 Hit? Address log2N #1 #2 #3 24 Lookups Using Hashing Performance of simple example ER = 1 --- 1 + ------------------------------- 2 1- M 1 – --1 – N Where: ER = Expected number of memory references M = Number of memory addresses in table N = Number of linked lists = M N Copyright 1999. All Rights Reserved 25 Lookups Using Hashing Advantages: • Simple • Expected lookup time can be small Disadvantages • Non-deterministic lookup time • Inefficient use of memory Copyright 1999. All Rights Reserved 26 Trees and Tries Binary Search Tree < > > < N entries Copyright 1999. All Rights Reserved > log2N < Binary Search Trie 0 0 1 1 010 0 1 111 27 Trees and Tries Multiway tries 16-ary Search Trie 0000, ptr 0000, 0 1111, ptr 000011110000 Copyright 1999. All Rights Reserved 1111, ptr 0000, 0 1111, ptr 111111111111 28 Trees and Tries Multiway tries N D + Ew = D L – 11 – 1 – ------D L N D + En = 1 + D L 1 – ------DL Where: L –1 D i 1 – D i – 1 N – 1 – D 1 – i N i=1 L = Number of layers/references N = Number of entries in table L–1 D = Degree of tree Di – D i – 11 – D i – 1 N i =1 E n = Expected number of nodes Ew = Expected amount of wasted memory Degree of Tree # Mem References 2 4 8 16 64 256 48 24 16 12 8 6 # Nodes Total Memory Fraction (Mbytes) Wasted (%) (x106) 1.09 0.53 0.35 0.25 0.17 0.12 4.3 4.3 5.6 8.3 21 64 49 73 86 93 98 99.5 Table produced from 215 randomly generated 48-bit addresses Copyright 1999. All Rights Reserved 29 Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – – – – Caching CIDR Patricia trees/tries Other methods • Packet Classification Copyright 1999. All Rights Reserved 30 Caching Addresses Slow Path Buffer Memory CPU Fast Path DMA DMA DMA Line Card Local Buffer Memory Line Card Local Buffer Memory Line Card Local Buffer Memory MAC MAC MAC Copyright 1999. All Rights Reserved 31 Caching Addresses LAN: Average flow < 40 packets WAN: Huge Number of flows 100% 90% 80% Cache Hit Rate 70% 60% 50% 40% 30% 20% 10% 0% Cache = 10% of Full Table Copyright 1999. All Rights Reserved 32 IP Routers Class-based addresses IP Address Space Class A 212.17.9.4 Class B Class A Class B Class C Copyright 1999. All Rights Reserved Class C D Routing Table: Exact Match 212.17.9.0 Port 4 33 IP Routers CIDR Class-based: A B C D 232-1 0 Classless: 128.9.0.0 65/8 0 142.12/19 128.9/16 216 232-1 128.9.16.14 Copyright 1999. All Rights Reserved 34 IP Routers CIDR 128.9.19/24 128.9.25/24 128.9.16/20 128.9.176/20 128.9/16 232-1 0 128.9.16.14 Most specific route = “longest matching prefix” Copyright 1999. All Rights Reserved 35 IP Routers Metrics for Lookups 128.9.16.14 Prefix Port 65/8 128.9/16 128.9.16/20 128.9.19/24 128.9.25/24 128.9.176/20 142.12/19 3 5 2 7 10 1 3 Copyright 1999. All Rights Reserved • Lookup time • Storage space • Update time • Preprocessing time 36 IP Router Lookup H E A D E R Dstn Addr Forwarding Engine Next Hop Next Hop Computation Forwarding Table Destination Next Hop ------------- Incoming Packet ---- ---- IPv4 unicast destination address based lookup Copyright 1999. All Rights Reserved 37 Need more than IPv4 unicast lookups • Multicast • PIMSM – Longest Prefix Matching on the source and group address – Try (S,G) followed by (*,G) followed by (*,*,RP) – Check Incoming Interface • DVMRP: – Incoming Interface Check followed by (S,G) lookup • IPv6 • 128bit destination address field • Exact address architecture not yet known Copyright 1999. All Rights Reserved 38 Lookup Performance Required Line Line Rate Pktsize=40B Pktsize=240B T1 1.5Mbps 4.68 Kpps 0.78 Kpps OC3 155Mbps 480 Kpps 80 Kpps OC12 622Mbps 1.94 Mpps 323 Kpps OC48 2.5Gbps 7.81 Mpps 1.3 Mpps 31.25 Mpps 5.21 Mpps OC192 10 Gbps Gigabit Ethernet (84B packets): 1.49 Mpps Copyright 1999. All Rights Reserved 39 Size of the Routing Table Source: http://www.telstra.net/ops/bgptable.html Copyright 1999. All Rights Reserved 40 Ternary CAMs Associative Memory Value 10.0.0.0 10.1.0.0 10.1.1.0 10.1.3.0 10.1.3.1 Mask 255.0.0.0 255.255.0.0 255.255.255.0 255.255.255.0 255.255.255.255 R1 R2 R3 R4 R4 Next Hop Priority Encoder Copyright 1999. All Rights Reserved 41 Binary Tries 0 d 1 f e a b g i h c Copyright 1999. All Rights Reserved j Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 e) 0101 f) 011 g) 100 h) 1010 i) 1100 j) 11110000 42 Patricia Tree 0 f d a b e c Copyright 1999. All Rights Reserved 1 g h i Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 Skip=5 e) 0101 f) 011 j g) 100 h) 1010 i) 1100 j) 11110000 43 Patricia Tree Disadvantages • Many memory accesses • May need backtracking • Pointers take up a lot of space Advantages • General Solution • Extensible to wider fields Avoid backtracking by storing the intermediate-best matched prefix. (Dynamic Prefix Tries) 40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)] Copyright 1999. All Rights Reserved 44 Binary search on trie levels Level 0 Level 8 Level 29 Copyright 1999. All Rights Reserved P 45 Binary search on trie levels Store a hash table for each prefix length to aid search at a particular trie level. Length Hash 8 12 10 16 24 10.1, 10.2 10.1.1, 10.1.2, 10.2.3 Copyright 1999. All Rights Reserved Example Prefixes 10.0.0.0/8 10.1.0.0/16 10.1.1.0/24 10.1.2.0/24 10.2.3.0/24 Example Addrs 10.1.1.4 10.4.4.3 10.2.3.9 10.2.4.8 46 Binary search on trie levels Disadvantages • Multiple hashed memory accesses. • Updates are complex. Advantages • Scaleable to IPv6. 33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)] Copyright 1999. All Rights Reserved 47 Compacting Forwarding Tables 1 0 0 0 1 0 Copyright 1999. All Rights Reserved 1 1 1 0 0 0 1 1 1 1 48 Compacting Forwarding Tables 10001010 11100010 10000010 10110100 R1, 0 0 R2, 3 1 R3, 7 2 Codeword array 11000000 R4, 9 3 R5, 0 4 Base index array 0 0 13 1 Copyright 1999. All Rights Reserved 49 Compacting Forwarding Tables Disadvantages • Scalability to larger tables? • Updates are complex. Advantages • Extremely small data structure - can fit in cache. 33K entries: 160KB data structure with average 2Mpps [O(W/k)] Copyright 1999. All Rights Reserved 50 Multi-bit Tries 16-ary Search Trie 0000, ptr 0000, 0 1111, ptr 000011110000 Copyright 1999. All Rights Reserved 1111, ptr 0000, 0 1111, ptr 111111111111 51 Compressed Tries Only 3 memory accesses L8 L16 L24 Copyright 1999. All Rights Reserved 52 Number Routing Lookups in Hardware Prefix length Most prefixes are 24-bits or shorter Copyright 1999. All Rights Reserved 53 Routing Lookups in Hardware Prefixes up to 24-bits 224 = 16M entries 142.19.6 142.19.6 Next Hop 24 14 142.19.6.14 1 Next Hop Copyright 1999. All Rights Reserved 54 Routing Lookups in Hardware Prefixes up to 24-bits 128.3.72 0 Next Hop Pointer base 128.3.72 24 Next Hop Prefixes above 24-bits Copyright 1999. All Rights Reserved 8 offset Next Next Hop Hop 44 128.3.72.44 1 55 Routing Lookups in Hardware Prefixes up to n-bits 2n entries: 0 i N i m 2 entries j Prefixes longer than N+M bits Next Hop N+M Copyright 1999. All Rights Reserved 56 Routing Lookups in Hardware Disadvantages • Large memory required (9-33MB) • Depends on prefix-length distribution. Advantages • 20Mpps with 50ns DRAM • Easy to implement in hardware Various compression schemes can be employed to decrease the storage requirements: e.g. employ carefully chosen variable length strides, bitmap compression etc. Copyright 1999. All Rights Reserved 57 IP Router Lookups References • A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables for Fast Routing Lookups”, Sigcomm 1997, pp 3-14. • B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and multicolumn search”, Infocom 1998, pp 1248-56, vol. 3. • M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP routing lookups”, Sigcomm 1997, pp 25-36. • P. Gupta, S. Lin, N.McKeown. “Routing lookups in hardware at memory access speeds”, Infocom 1998, pp 1241-1248, vol. 3. • S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998. • V. Srinivasan, G.Varghese. “Fast IP lookups using controlled prefix expansion”, Sigmetrics, June 1998. Copyright 1999. All Rights Reserved 58 Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – – – – Caching CIDR Patricia trees/tries Other methods • Packet Classification Copyright 1999. All Rights Reserved 59 Providing ValueAdded Services Some examples • Differentiated services – Regard traffic from Autonomous System #33 as `platinumgrade’ • Access Control Lists – Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp • Committed Access Rate – Rate limit WWW traffic from subinterface#739 to 10Mbps • Policybased Routing – Route all voice traffic through the ATM network Copyright 1999. All Rights Reserved 60 Packet Classification H E A D E R Incoming Packet Copyright 1999. All Rights Reserved Forwarding Engine Packet Classification Action Classifier (Policy Database) Predicate Action ---------------- ---61 Multi-field Packet Classification Field 1 Field 2 … Field k Action Rule 1 152.163.190.69/21 152.163.80.11/32 … UDP A1 Rule 2 152.168.3.0/24 152.163.0.0/16 … TCP A2 … … … … … … Rule N 152.168.0.0/16 152.0.0.0/8 … ANY An Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet. Copyright 1999. All Rights Reserved 62 Geometric Interpretation in 2D Field #1 Field #2 R7 R6 P1 P2 Field #2 Data R3 e.g. (144.24/16, 64/24) e.g. (128.16.46.23, *) R1 R5 Copyright 1999. All Rights Reserved R4 R2 Field #1 63 Proposed Schemes Pros Sequential Evaluation Small storage, scales well with number of fields Ternary CAMs Single cycle classification Grid of Tries Small storage requirements and (Srinivasan et fast lookup rates for two fields. al[Sigcomm Suitable for big classifiers 98]) Copyright 1999. All Rights Reserved Cons Slow classification rates Cost, density, power consumption Not easily extendible to more than two fields. 64 Proposed Schemes (Contd.) Pros Crossproducting (Srinivasan et al[Sigcomm 98]) Fast accesses. Suitable for multiple fields. Bil-level Parallelism Suitable for (Lakshman and multiple fields. Stiliadis[Sigcomm 98]) Copyright 1999. All Rights Reserved Cons Large memory requirements. Suitable without caching for classifiers with fewer than 50 rules. Large memory bandwidth required. Comparatively slow lookup rate. Hardware only. 65 Proposed Schemes (Contd.) Pros Hierarchical Intelligent Cuttings (Gupta and McKeown[HotI 99]) Tuple Space Search (Srinivasan et al[Sigcomm 99]) Suitable for multiple fields. Small memory requirements. Good update time. Suitable for multiple fields. The basic scheme has good update times and memory requirements. Recursive Flow Fast accesses. Suitable for Classification (Gupta multiple fields. and Reasonable memory McKeown[Sigcomm requirements for real-life 99]) classifiers. Copyright 1999. All Rights Reserved Cons Large preprocessing time. Classification rate can be low. Requires perfect hashing for determinism. Large preprocessing time and memory requirements for large classifiers. 66 Grid of Tries 0 Dimension 1 1 0 0 0 1 R4 0 1 R1 0 1 1 0 R3 R2 Copyright 1999. All Rights Reserved 0 0 R5 0 R6 0 1 Dimension 2 R7 67 Grid of Tries Disadvantages • Static solution • Not easy to extend to higher dimensions Advantages • Good solution for two dimensions 20K entries: 2MB data structure with 9 memory accesses [at most 2W] Copyright 1999. All Rights Reserved 68 Classification using Bit Parallelism 0 1 1 1 1 1 0 0 R4 Copyright 1999. All Rights Reserved R3 R2 R1 69 Classification using Bit Parallelism Disadvantages • Large memory bandwidth • Hardware optimized Advantages • Good solution for multiple dimensions for small classifiers 512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips. Copyright 1999. All Rights Reserved 70 Classification Using Multiple Fields Recursive Flow Classification 2S = 2128 2T = 212 Packet Header Memory Memory F1 Memory Action F2 F3 2S = 2128 264 224 2T = 212 F4 Fn Copyright 1999. All Rights Reserved 71 Packet Classification References • T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp 191-202. • V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203-214. • V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, to be presented at Sigcomm 1999. • P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999. • P. Gupta, N. McKeown, “Packet classification on multiple fields”, Sigcomm 1999. Copyright 1999. All Rights Reserved 72 Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 73 Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – – – – Scheduling algorithms Combining input and output queues Other non-blocking fabrics Multicast traffic Copyright 1999. All Rights Reserved 74 Basic Architectural Components Datapath: per-packet processing 1. Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 75 Interconnects Two basic techniques Input Queueing Output Queueing Usually a non-blocking switch fabric (e.g. crossbar) Usually a fast bus Copyright 1999. All Rights Reserved 76 Interconnects Output Queueing Individual Output Queues Centralized Shared Memory Memory b/w = 2N.R 1 2 N 1 2 Memory b/w = (N+1).R Copyright 1999. All Rights Reserved N 77 Output Queueing The “ideal” 2 1 1 2 1 2 1 2 11 2 2 1 Copyright 1999. All Rights Reserved 78 Output Queueing How fast can we make centralized shared memory? 5ns SRAM Shared Memory • 5ns per memory operation • Two memory operations per packet • Therefore, up to 160Gb/s • In practice, closer to 80Gb/s 1 2 N 200 byte bus Copyright 1999. All Rights Reserved 79 Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – – – – Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic Copyright 1999. All Rights Reserved 80 Interconnects Input Queueing with Crossbar Memory b/w = 2R Data In Scheduler configuration Data Out Copyright 1999. All Rights Reserved 81 Input Queueing Delay Head of Line Blocking Load 58.6% Copyright 1999. All Rights Reserved 100% 82 Head of Line Blocking Copyright 1999. All Rights Reserved 83 Copyright 1999. All Rights Reserved 84 Copyright 1999. All Rights Reserved 85 Input Queueing Virtual output queues Copyright 1999. All Rights Reserved 86 Input Queues Delay Virtual Output Queues Load Copyright 1999. All Rights Reserved 100% 87 Input Queueing Memory b/w = 2R Scheduler Copyright 1999. All Rights Reserved Can be quite complex! 88 Input Queueing Scheduling Input 1 Q(1,1) A1 (t) A1,1(t) Matching, M Output 1 D1 (t) Q(1,n) ? Input m Q(m,1) Output n Dn(t) Am (t) Q(m,n) Copyright 1999. All Rights Reserved 89 Input Queueing 1 2 3 4 7 2 4 2 5 2 Request Graph Scheduling 1 1 2 2 3 3 4 4 1 2 3 4 Bipartite Matching (Weight = 18) Question: Maximum weight or maximum size? Copyright 1999. All Rights Reserved 90 Input Queueing Scheduling • Maximum Size – Maximizes instantaneous throughput – Does it maximize long-term throughput? • Maximum Weight – Can clear most backlogged queues – But does it sacrifice long-term throughput? Copyright 1999. All Rights Reserved 91 Input Queueing Scheduling Copyright 1999. All Rights Reserved 1 1 2 2 1 1 2 2 92 Input Queueing Longest Queue First or Oldest Cell First Weight 1 2 3 4 1 1 1 Queue Length Waiting Time ={ 1 10 10 1 2 3 4 Copyright 1999. All Rights Reserved } 100% 1 2 3 4 1 2 3 4 93 Input Queueing Why is serving long/old queues better than serving maximum number of queues? Non-uniform traffic Uniform traffic VOQ # Copyright 1999. All Rights Reserved Avg Occupancy Avg Occupancy • When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. • When traffic is non-uniform, some queues become longer than others. • A good algorithm keeps the queue lengths matched, and services a large number of queues. VOQ # 94 Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) – Parallel Iterative Matching (PIM) – iSLIP • Maximal Weight Algorithms – Fair Access Round Robin (FARR) – Longest Port First (LPF) Copyright 1999. All Rights Reserved 95 Wave Front Arbiter Requests Match 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 Copyright 1999. All Rights Reserved 96 Wave Front Arbiter Requests Copyright 1999. All Rights Reserved Match 97 Wave Front Arbiter Implementation Copyright 1999. All Rights Reserved 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 Combinational Logic Blocks 98 Wave Front Arbiter Wrapped WFA (WWFA) N steps instead of 2N-1 Requests Copyright 1999. All Rights Reserved Match 99 Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) – Parallel Iterative Matching (PIM) – iSLIP • Maximal Weight Algorithms – Fair Access Round Robin (FARR) – Longest Port First (LPF) Copyright 1999. All Rights Reserved 100 Parallel Random Iterative Matching Random Selection Selection #1 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 4 4 4 4 4 4 Requests Grant Accept/Match 1 2 #2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 4 4 4 4 4 4 Copyright 1999. All Rights Reserved 101 Parallel Iterative Matching Maximal is not Maximum 1 2 3 1 2 3 1 2 3 1 2 3 4 4 4 4 Requests Copyright 1999. All Rights Reserved Accept/Match 1 2 3 1 2 3 4 4 102 Parallel Iterative Matching Analytical Results Number of iterations to converge: N2 E U i ------4i E C log N Copyright 1999. All Rights Reserved C = # of iterations required to resolve connections N = # of ports U i = # of unresolved connections after iteration i 103 Parallel Iterative Matching Copyright 1999. All Rights Reserved 104 Parallel Iterative Matching Copyright 1999. All Rights Reserved 105 Parallel Iterative Matching Copyright 1999. All Rights Reserved 106 Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) – Parallel Iterative Matching (PIM) – iSLIP • Maximal Weight Algorithms – Fair Access Round Robin (FARR) – Longest Port First (LPF) Copyright 1999. All Rights Reserved 107 iSLIP Round-Robin Selection Round-Robin Selection #1 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 4 4 4 4 4 4 Requests Grant Accept/Match 1 2 #2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 4 4 4 4 4 4 Copyright 1999. All Rights Reserved 108 iSLIP Properties • • • • • Random under low load TDM under high load Lowest priority to MRU 1 iteration: fair to outputs Converges in at most N iterations. On average <= log2N • Implementation: N priority encoders • Up to 100% throughput for uniform traffic Copyright 1999. All Rights Reserved 109 iSLIP Copyright 1999. All Rights Reserved 110 iSLIP Copyright 1999. All Rights Reserved 111 iSLIP Programmable Priority Encoder N N Implementation 1 1 Grant Accept 2 2 Grant Accept log2N log2N State Decision N N Grant Copyright 1999. All Rights Reserved N Accept log2N 112 Input Queueing References References • M. Karol et al. “Input vs Output Queueing on a Space-Division Packet Switch”, IEEE Trans Comm., Dec 1987, pp. 1347-1356. • Y. Tamir, “Symmetric Crossbar arbiters for VLSI communication switches”, IEEE Trans Parallel and Dist Sys., Jan 1993, pp.13-27. • T. Anderson et al. “High-Speed Switch Scheduling for Local Area Networks”, ACM Trans Comp Sys., Nov 1993, pp. 319-352. • N. McKeown, “The iSLIP scheduling algorithm for Input-Queued Switches”, IEEE Trans Networking, April 1999, pp. 188-201. • C. Lund et al. “Fair prioritized scheduling in an input-buffered switch”, Proc. of IFIP-IEEE Conf., April 1996, pp. 358-69. • A. Mekkitikul et al. “A Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches”, IEEE Infocom 98, April 1998. Copyright 1999. All Rights Reserved 113 Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – – – – Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic Copyright 1999. All Rights Reserved 114 Other Non-Blocking Fabrics Clos Network Copyright 1999. All Rights Reserved 115 Other Non-Blocking Fabrics Clos Network Expansion factor required = 2-1/N (but still blocking for multicast) Copyright 1999. All Rights Reserved 116 Other Non-Blocking Fabrics Self-Routing Networks 000 000 001 001 010 010 011 011 100 100 101 101 110 110 111 111 Copyright 1999. All Rights Reserved 117 Other Non-Blocking Fabrics Self-Routing Networks The Non-blocking Batcher Banyan Network Batcher Sorter Self-Routing Network 3 7 7 7 7 7 7 7 2 5 0 4 6 6 5 3 2 5 5 4 5 2 5 3 1 6 5 4 6 6 1 3 0 3 3 0 1 0 4 3 2 2 1 0 6 2 1 0 1 4 4 4 6 2 2 0 000 001 010 011 100 101 110 111 • Fabric can be used as scheduler. •Batcher-Banyan network is blocking for multicast. Copyright 1999. All Rights Reserved 118 Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – – – – Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic Copyright 1999. All Rights Reserved 119 Speedup • Context – input-queued switches – output-queued switches – the speedup problem • Early approaches • Algorithms • Implementation considerations Copyright 1999. All Rights Reserved 120 Speedup: Context M e m o r y M e m o r y A generic switch The placement of memory gives - Output-queued switches - Input-queued switches - Combined input- and output-queued switches Copyright 1999. All Rights Reserved 121 Output-queued switches Best delay and throughput performance - Possible to erect “bandwidth firewalls” between sessions Main problem - Requires high fabric speedup (S = N) Unsuitable for high-speed switching Copyright 1999. All Rights Reserved 122 Input-queued switches Big advantage - Speedup of one is sufficient Main problem - Can’t guarantee delay due to input contention Overcoming input contention: use higher speedup Copyright 1999. All Rights Reserved 123 A Comparison Memory speeds for 32x32 switch Output-queued Input-queued Line Rate Memory BW Access Time Per cell Memory BW Access Time 100 Mb/s 3.3 Gb/s 128 ns 200 Mb/s 2.12 s 1 Gb/s 33 Gb/s 12.8 ns 2 Gb/s 212 ns 2.5 Gb/s 82.5 Gb/s 5.12 ns 5 Gb/s 84.8 ns 10 Gb/s 330 Gb/s 1.28ns 20 Gb/s 21.2 ns Copyright 1999. All Rights Reserved 124 The Speedup Problem Find a compromise: 1 < Speedup << N - to get the performance of an OQ switch - close to the cost of an IQ switch Essential for high speed QoS switching Copyright 1999. All Rights Reserved 125 Some Early Approaches Probabilistic Analyses - assume traffic models (Bernoulli, Markov-modulated, non-uniform loading, “friendly correlated”) - obtain mean throughput and delays, bounds on tails - analyze different fabrics (crossbar, multistage, etc) Numerical Methods - use actual and simulated traffic traces - run different algorithms - set the “speedup dial” at various values Copyright 1999. All Rights Reserved 126 The findings Very tantalizing ... - under different settings (traffic, loading, algorithm, etc) - and even for varying switch sizes A speedup of between 2 and 5 was sufficient! Copyright 1999. All Rights Reserved 127 Using Speedup 1 2 1 2 1 Copyright 1999. All Rights Reserved 128 Intuition Bernoulli IID inputs Speedup = 1 Fabric throughput = .58 Bernoulli IID inputs Speedup = 2 Fabric throughput = 1.16 I/p efficiency, = 1/1.16 Ave I/p queue = 6.25 Copyright 1999. All Rights Reserved 129 Intuition (continued) Bernoulli IID inputs Speedup = 3 Fabric throughput = 1.74 Input efficiency = 1/1.74 Ave I/p queue = 1.35 Bernoulli IID inputs Speedup = 4 Fabric throughput = 2.32 Input efficiency = 1/2.32 Ave I/p queue = 0.75 Copyright 1999. All Rights Reserved 130 Issues Need hard guarantees - exact, not average Robustness - realistic, even adversarial, traffic not friendly Bernoulli IID Copyright 1999. All Rights Reserved 131 The Ideal Solution Inputs Speedup = N Outputs ? Speedup << N Question: Can we find - a simple and good algorithms - that exactly mimics output-queueing - regardless of switch sizes and traffic patterns? Copyright 1999. All Rights Reserved 132 What is exact mimicking? Apply same inputs to an OQ and a CIOQ switch - packet by packet Obtain same outputs - packet by packet Copyright 1999. All Rights Reserved 133 Algorithm - MUCF Key concept: urgency value - urgency = departure time - present time Copyright 1999. All Rights Reserved 134 MUCF The algorithm - Outputs try to get their most urgent packets - Inputs grant to output whose packet is most urgent, ties broken by port number - Loser outputs for next most urgent packet - Algorithm terminates when no more matchings are possible Copyright 1999. All Rights Reserved 135 Stable Marriage Problem Men = Outputs Bill John Pedro Women = Inputs Hillary Copyright 1999. All Rights Reserved Monica Maria 136 An example Observation: Only two reasons a packet doesn’t get to its output - Input contention, Output contention - This is why speedup of 2 works!! Copyright 1999. All Rights Reserved 137 What does this get us? Speedup of 4 is sufficient for exact emulation of FIFO OQ switches, with MUCF What about non-FIFO OQ switches? E.g. WFQ, Strict priority Copyright 1999. All Rights Reserved 138 Other results To exactly emulate an NxN OQ switch - Speedup of 2 - 1/N is necessary and sufficient (Hence a speedup of 2 is sufficient for all N) - Input traffic patterns can be absolutely arbitrary - Emulated OQ switch may use a “monotone” scheduling policies - E.g.: FIFO, LIFO, strict priority, WFQ, etc Copyright 1999. All Rights Reserved 139 What gives? Complexity of the algorithms - Extra hardware for processing - Extra run time (time complexity) What is the benefit? - Reduced memory bandwidth requirements Tradeoff: Memory for processing - Moore’s Law supports this tradeoff Copyright 1999. All Rights Reserved 140 Implementation - a closer look Main sources of difficulty - Estimating urgency, etc - info is distributed (and communicating this info among I/ps and O/ps) - Matching process - too many iterations? Estimating urgency depends on what is being emulated - Like taking a ticket to hold a place in a queue - FIFO, Strict priorities - no problem - WFQ, etc - problems Copyright 1999. All Rights Reserved 141 Implementation (contd) Matching process - A variant of the stable marriage problem - Worst-case number of iterations for SMP = N2 - Worst-case number of iterations in switching = N - High probability and average approxly log(N) Copyright 1999. All Rights Reserved 142 Other Work Relax stringent requirement of exact emulation - Least Occupied O/p First Algorithm (LOOFA) Keeps outputs always busy if there are packets By time-stamping packets, it also exactly mimics - Disallow arbitrary inputs E.g. leaky bucket constrained Obtain worst-case delay bounds Copyright 1999. All Rights Reserved 143 References for speedup - Y. Oie et al, “Effect of speedup in nonblocking packet switch’’, ICC 89. - A.L Gupta, N.D. Georgana, “Analysis of a packet switch with input and and output buffers and speed constraints”, Infocom 91. - S-T. Chuang et al, “Matching output queueing with a combined input and and output queued switch”, IEEE JSAC, vol 17, no 6, 1999. - B. Prabhakar, N. McKeown, “On the speedup required for combined input and output queued switching”, Automatica, vol 35, 1999. - P. Krishna et al, “On the speedup required for work-conserving crossbar switches”, IEEE JSAC, vol 17, no 6, 1999. - A. Charny, “Providing QoS guarantees in input buffered crossbar switches with speedup”, PhD Thesis, MIT, 1998. Copyright 1999. All Rights Reserved 144 Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – – – – Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic Copyright 1999. All Rights Reserved 145 Multicast Switching • The problem • Switching with crossbar fabrics • Switching with other fabrics Copyright 1999. All Rights Reserved 146 Multicasting 2 1 Copyright 1999. All Rights Reserved 3 5 4 6 147 Crossbar fabrics: Method 1 Copy network + unicast switching Copy networks Increased hardware, increased input contention Copyright 1999. All Rights Reserved 148 Method 2 Use copying properties of crossbar fabric No fanout-splitting: Easy, but low throughput Fanout-splitting: higher throughput, but not as simple. Leaves “residue”. Copyright 1999. All Rights Reserved 149 The effect of fanout-splitting Performance of an 8x8 switch with and without fanout-splitting under uniform IID traffic Copyright 1999. All Rights Reserved 150 Placement of residue Key question: How should outputs grant requests? (and hence decide placement of residue) Copyright 1999. All Rights Reserved 151 Residue and throughput Result: Concentrating residue brings more new work forward. Hence leads to higher throughput. But, there are fairness problems to deal with. This and other problems can be looked at in a unified way by mapping the multicasting problem onto a variation of Tetris. Copyright 1999. All Rights Reserved 152 Multicasting and Tetris Input ports 1 2 3 4 5 Residue 1 2 3 4 5 Output ports Copyright 1999. All Rights Reserved 153 Multicasting and Tetris Input ports 1 2 3 4 5 Residue Concentrated 1 2 3 4 5 Output ports Copyright 1999. All Rights Reserved 154 Replication by recycling Main idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves. b c y a d x e x b x a c y y e d Copyright 1999. All Rights Reserved 155 Replication by recycling (cont’d) Receive Reseq Transmit Output Table Network Recycle Scaleable to large fanouts. Needs resequencing at outputs and introduces variable delays. Copyright 1999. All Rights Reserved 156 References for Multicasting • J. Hayes et al. “Performance analysis of a multicast switch”, IEEE/ACM Trans. on Networking, vol 39, April 1991. • B. Prabhakar et al. “Tetris models for multicast switches”, Proc. of the 30th Annual Conference on Information Sciences and Systems, 1996 • B. Prabhakar et al. “Multicast scheduling for input-queued switches”, IEEE JSAC, 1997 • J. Turner, “An optimal nonblocking multicast virtual circuit switch”, INFOCOM, 1994 Copyright 1999. All Rights Reserved 157 Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 158 Output Scheduling • What is output scheduling? • How is it done? • Practical Considerations Copyright 1999. All Rights Reserved 159 Output Scheduling Allocating output bandwidth Controlling packet delay scheduler Copyright 1999. All Rights Reserved 160 Output Scheduling FIFO Fair Queueing Copyright 1999. All Rights Reserved 161 Motivation • FIFO is natural but gives poor QoS – bursty flows increase delays for others – hence cannot guarantee delays Need round robin scheduling of packets – Fair Queueing – Weighted Fair Queueing, Generalized Processor Sharing Copyright 1999. All Rights Reserved 162 Fair queueing: Main issues • Level of granularity – packet-by-packet? (favors long packets) – bit-by-bit? (ideal, but very complicated) • Packet Generalized Processor Sharing (PGPS) – serves packet-by-packet – and imitates bit-by-bit schedule within a tolerance Copyright 1999. All Rights Reserved 163 How does WFQ work? WR = 1 WG = 5 WP = 2 Copyright 1999. All Rights Reserved 164 Delay guarantees • Theorem If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions. Copyright 1999. All Rights Reserved 165 Practical considerations • For every packet, the scheduler needs to – classify it into the right flow queue and maintain a linked-list for each flow – schedule it for departure • Complexities of both are o(log [# of flows]) – first is hard to overcome – second can be overcome by DRR Copyright 1999. All Rights Reserved 166 Deficit Round Robin 700 50 250 400 200 600 600 500 250 750 500 1000 100 400 500 Good approximation of FQ Much simpler to implement Copyright 1999. All Rights Reserved 500 Quantum size 167 But... • WFQ is still very hard to implement – classification is a problem – needs to maintain too much state information – doesn’t scale well Copyright 1999. All Rights Reserved 168 Strict Priorities and Diff Serv • Classify flows into priority classes – maintain only per-class queues – perform FIFO within each class – avoid “curse of dimensionality” Copyright 1999. All Rights Reserved 169 Diff Serv • A framework for providing differentiated QoS – set Type of Service (ToS) bits in packet headers – this classifies packets into classes – routers maintain per-class queues – condition traffic at network edges to conform to class requirements May still need queue management inside the network Copyright 1999. All Rights Reserved 170 References for O/p Scheduling - A. Demers et al, “Analysis and simulation of a fair queueing algorithm”, ACM SIGCOMM 1989. - A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the single node case”, IEEE Trans. on Networking, June 1993. - A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the multiple node case”, IEEE Trans. on Networking, August 1993. - M. Shreedhar, G. Varghese, “Efficient Fair Queueing using Deficit Round Robin”, ACM SIGCOMM, 1995. - K. Nichols, S. Blake (eds), “Differentiated Services: Operational Model and Definitions”, Internet Draft, 1998. Copyright 1999. All Rights Reserved 171 Active Queue Management • Problems with traditional queue management – tail drop • Active Queue Management – goals – an example – effectiveness Copyright 1999. All Rights Reserved 172 Tail Drop Queue Management Lock-Out Max Queue Length Copyright 1999. All Rights Reserved 173 Tail Drop Queue Management • Drop packets only when queue is full – long steady-state delay – global synchronization – bias against bursty traffic Copyright 1999. All Rights Reserved 174 Global Synchronization Max Queue Length Copyright 1999. All Rights Reserved 175 Bias Against Bursty Traffic Max Queue Length Copyright 1999. All Rights Reserved 176 Alternative Queue Management Schemes • Drop from front on full queue • Drop at random on full queue both solve the lock-out problem both have the full-queues problem Copyright 1999. All Rights Reserved 177 Active Queue Management Goals • Solve lock-out and full-queue problems – no lock-out behavior – no global synchronization – no bias against bursty flow • Provide better QoS at a router – low steady-state delay – lower packet dropping Copyright 1999. All Rights Reserved 178 Active Queue Management • Problems with traditional queue management – tail drop • Active Queue Management – goals an example – effectiveness Copyright 1999. All Rights Reserved 179 Random Early Detection (RED) Pk maxth P2 qavg P1 minth if qavg < minth: admit every packet else if qavg <= maxth: drop an incoming packet with p = (qavg - minth)/(maxth - minth) else if qavg > maxth: drop every incoming packet Copyright 1999. All Rights Reserved 180 Effectiveness of RED: Lock-Out • Packets are randomly dropped • Each flow has the same probability of being discarded Copyright 1999. All Rights Reserved 181 Effectiveness of RED: Full-Queue • Drop packets probabilistically in anticipation of congestion (not when queue is full) • Use qavg to decide packet dropping probability: allow instantaneous bursts • Randomness avoids global synchronization Copyright 1999. All Rights Reserved 182 What QoS does RED Provide? • Lower buffer delay: good interactive service – qavg is controlled to be small • Given responsive flows: packet dropping is reduced – early congestion indication allows traffic to throttle back before congestion • Given responsive flows: fair bandwidth allocation Copyright 1999. All Rights Reserved 183 Unresponsive or aggressive flows • Don’t properly back off during congestion • Take away bandwidth from TCP compatible flows • Monopolize buffer space Copyright 1999. All Rights Reserved 184 Control Unresponsive Flows • Some active queue management schemes – RED with penalty box – Flow RED (FRED) – Stabilized RED (SRED) identify and penalize unresponsive flows with a bit of extra work Copyright 1999. All Rights Reserved 185 Active Queue Management References • B. Braden et al. “Recommendations on queue management and congestion avoidance in the internet”, RFC2309, 1998. • S. Floyd, V. Jacobson, “Random early detection gateways for congestion avoidance”, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993. • D. Lin, R. Morris, “Dynamics on random early detection”, ACM SIGCOMM, 1997 • T. Ott et al. “SRED: Stabilized RED”, INFOCOM 1999 • S. Floyd, K. Fall, “Router mechanisms to support end-toend congestion control”, LBL technical report, 1997 Copyright 1999. All Rights Reserved 186 Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 187 Basic Architectural Components Admission Control Policing Congestion Control Routing Switching Copyright 1999. All Rights Reserved Reservation Output Scheduling Control Datapath: per-packet processing 188 Basic Architectural Components Datapath: per-packet processing 1. Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 189