Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory Washington University in St. Louis fredk@arl.wustl.edu Washington WASHINGTON UNIVERSITY IN ST LOUIS Defining Terms and Models Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 2 The SPP Node NPE code option • Slice instantiation: – Allocate virtual machine (VM) instance on a GPE NPE SCD SRAM GPE FPx mi-mux GPE local delivery/exceptions, uses an Internal UDP Tunnel Egress IP route table and ARP LC … SCD (ARP, nat) map flow to internal destination … • Line card TCAM Filters direct traffic app planetlab OS • Share a common set of (global) IP addresses – UDP/TCP port space shared across GPE/NPEs vmx RMP TCAM – may request code option instance, NPE resources and bandwidth NMP … … – unregistered traffic originating outside the node Ingress is sent to the CP. – unregistered traffic originating within node uses Internet NAT (on line card) – application may register server ports. Causes filter to be inserted in the line card directing traffic to specific GPE – application must register ports (or tunnels) associated with fast path instances • It is assumed that fast path instances will use tunnels (overlays) to send traffic between routing nodes. – Currently we only support UDP tunnels but will extend to include GRE and possibly others. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 3 Meta-Interfaces and Tunnels • Slice Fast path (Code option instance, allocated resources) are assumed to sit at one end of a tunnel – – – • The encapsulated packet is processed by the fast path. – – • currently only UDP tunnels are supported. UDP Tunnel is defined by the 4-tuple UDP tunnel: {peer ipaddr, peer port, local ipaddr, local port} Meta-interface or MI: Represents a tunnel endpoint as viewed by a slice’s the fast path router. A meta-interface is defined by the local endpoint’s address Meta-Interface: {local ipaddr, local UDP port} packet is always encapsulated within a tunnel by the substrate code option instance processes the encapsulated frame In the SPP context, slice registers MI and substrate manages encapsulation headers: – – – – Guard against forging source address A filter is installed in the corresponding line card’s TCAM to send matching packets to the correct NPE NPE’s decap module verifies the encapsulation header and provides isolation between slices (based on local IP and port number values in the tunnel header) Fabric VLANs are used to provide link level isolation between slice instances. The VLAN label is also used by the substrate to associate packets with slice fast paths. MI IP Address MI: local tunnel endpoint (UDP), {external ipaddr, udp_port} fast path (FPx) meta-interfaces 0 Fred Kuhns - 3/24/2016 1 2 3 4 5 6 Washington WASHINGTON UNIVERSITY IN ST LOUIS UDP Port 0 192.168.1.2 6060 1 192.168.1.3 6060 2 192.168.1.2 6061 3 192.168.1.2 6062 4 192.168.1.3 6061 5 192.168.1.3 6062 6 192.168.1.3 6063 4 Lookup Table, TCAM, Use Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 5 Lookup filters: Key, Action and Result • A lookup key is then created from the packet’s header fields and the receiving metainterface – code option extracts fields from the encapsulated packet – substrate adds the receiving meta-interface identifier • If no entry is found then the packet’s no_route exception attribute is set, otherwise a result is returned containing an action field and forwarding information (output meta-interface and next hop address) – a code option may define additional exception attributes • The complete filter specification: {lookup_key, result_vector} • lookup_key : {RxMI, *copt_key} – RxMI : Meta interface ID on which the packet was received. – copt_key : Lookup key defined by the code option. The IPv4 key: {daddr(32),saddr(32),sport(16),dport(16),tcp_flgs(8),proto(8)} • result_vector : {sindx, action[, qid, TxMI, nexthop]} – sindx : stats index – action: Packet disposition, one of {drop, fwd, ld} • drop : drop packet; • fwd : forward packet using next hop value (fwdkey) • ld : local delivery, code option instance has local address information?? – qid : packet Queue – TxMI : Meta-interface used for sending packet, corresponds to a previously registered local tunnel endpoint. Used to fill in the local address of the outgoing packet tunnel header. – nexthop : Tunnel endpoint for the next hop. For UDP tunnels, this is the IP address and UDP port number of the next hop device. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 6 Slice view of the Lookup Key user specified lookup key (4 - 32-bit words) xsid’ 12 xmi N slice defined fields 128-N • When a packet is received the substrate creates a lookup key using the target slices xsid and the receiving meta-interface. The remaining bits are defined by the code option. – xsid’ : represents the internal slice ID and may differ from the value of xsid. For implementation efficiency, this is the VLAN identifier assigned to the slice. – xmi : Internal representation of the meta-interface (MI), encoding of the received tunnel endpoint. • For UDP tunnels this field includes a 4-bit interface id and the 16 bit local UDP port number. The 4-bit id is used as an index into a table of local IP addresses. • The IPv4 code option defined fields are shown below where pr is the IP protocol field and tcp is the TCP header flags. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 7 IPv4 TCAM Filter Formats (on NPE) Defined by the IPv4 Code Option, 112bits Substrate defined T if 1 4 vlan 11 RX port 16 daddr saddr sport dport tcp/proto 32 32 16 16 16 Represents input meta-interface TCP 0100 flags 2 2 12 T = 0: Normal Lookup T = 1; substrate only lookup 2 RSV 6 proto 8 !TCP 00 Result, 64 bits rsv D L rsv 3 1 1 11 sindx 16 global stats index (SCD maps slice’s sindx to global value) TX IP daddr 32 TX dport 16 TX sport 16 TX IP address and sport represents the output meta-interface. The dport is provided by the slice. (RMP maps miid to tx tunnel params, use dport provided by slice) rsv 12 QM Sch 2 3 qid 15 20-bit internal qid (SCD maps slice’s miid to QM and Sch. SCD Also maps slice’s qid to global qid value) D: Drop packet L: Local delivery Slice parameters: Key: Input miid, IPv4 fltr {daddr, saddr, sport, dport, tcp/proto} Result: Flags {Drop, GPE}, sindx, Output miid, QID Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 8 Lookup • Parse block make copt_key. • Substrate add the xsid and xmi fields. • Substrate uses the TxMI and nexthop fields to construct encapsulation header packet parse block decap xsid’ annotations: {xsid, RxMI} xmi slice defined fields Lookup A xsid:RxMI:copt_key sindx;action:qid:TxMI:nexthop Fred Kuhns - 3/24/2016 ... ... TxMI:nexthop Washington WASHINGTON UNIVERSITY IN ST LOUIS 9 Version 2 and Multicast • In version 2 there will be 2 stages to the lookupadd fanout (count) to lookup B. if fanout > 1 then address of fanout else result vector; Chain fanout blocks TxMI includes an interface vector: 4-bit field that is used to lookup interface IP address and MAC address. • • fanout Table qid:TxMI:nexthop VLAN table in header format and VLAN table in Decap/Parse packet decap ... parse block sindex passed from side A annotations: {xsid, RxMI} overloaded with fanout address xmi slice defined fields LookupA lookup_key action:sindx:rindx rindx LookupB sindx:qid:TxMI:nexthop Fred Kuhns - 3/24/2016 ... ... result_index ... xsid’ Washington WASHINGTON UNIVERSITY IN ST LOUIS 10 Lookup Example • When a code option is requested the slice is allocated the requested number of TCAM entries; fid ε {0,..., Nf-1} – all TCAM operations accept a TCAM entry ID (fid) – Entries are listed in priority order with fid=0 the highest priority and entry Nf-1 the lowest. • Slice BW Allocations Interface BW ipAddr 0* BE 192.168.1.2 1 100Mbps 10.50.10.2 2 10Mbps 10.1.1.1 It is up to the slice control path to order the lookup entries. – For example if we have the simple routing database: 10.10.2.1/32Local delivery (GPE) 10.5.2.0/24 NH A 10.5.1.0/24 NH B 10.5.0.0/16 NH C • Then the control software could use the following: Slice Meta-Interfaces MI IP Address UDP Port 0 192.168.1.2 6060 1 10.50.10.2 6061 2 10.50.10.2 6062 3 10.1.1.1 6060 Slice Queue Bindings QID Interface BW max Bytes 0 0* Local* 1 1 40% 1024 2 1 60% 1024 3 2 100% 1024 Desired Route Table (LPM) prefix TxMI nexthop 10.10.2.1/32 0* Local 10.5.2.0/24 1 NH A 10.5.1.0/24 2 NH B 10.5.0.0/16 3 NH C write_fltr(fid, rxmi, {prefix,width}, action, {qid,TxMI,nexthop}) write_fltr(0, *, {10.10.2.1, 0xFFFFFFFF}, LD}) write_fltr(1, *, {10.5.2.0, 0xFFFFFF00}, fwd, {1, 1, NHA}) write_fltr(2, *, {10.5.1.0, 0xFFFFFF00}, fwd, {2, 2, NHB}) write_fltr(3, *, {10.5.0.0, 0xFFFF0000}, fwd, {3, 3, NHC}) Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 11 Example IPv4 LPM • In general for longest prefix match a good strategy is to divide allocated filters into 32 sets • For example assume 1024 TCAM entries have been allocated and we are using LPM. – Divide the filters into 32 sets of 32 filters each and associate a prefix length with each: Prefix Width 32 31 w 1 Filter ID Range 0 - 31 32-63 (32-w)*32 +(0...31) 992 - 1023 – Then for a particular prefix width add it to the appropriate set. – Entries within a set are non-overlapping so their order doesn’t matter. – This is the scheme used by software written by IDT, the manufacturer of the TCAM we currently use. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 12 Keeping track of TCAM entries • Substrate will have to manage the mapping of VM TCAM filter IDs to the actual filter ID. • VM control software will use a normalized filter index list (starts at 0 and has the requested number of filters entries). • The SCD (xscale daemon) must map the per-VM index into the actual TCAM Index. • Source for managing TCAM entries. • NPU A and B share a common TCAM and index range so this must be managed across the two xscales. – See C++ implementation of the RangeMap class in $WUSRC/range – Class will also be used for managing the QID name space. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 13 Control Software: Resource Management Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 14 System Resource Manager node components not in hub (switch, GPEs, Development Hosts) Resource DB SNM Support fast path configuration via the PLC SRM CP GPE NMP NPE SCD LC SCD MUX TCAM RMP SRAM FP k FP kx FP vmx control SP root context planetlab OS vnet TCAM Exception and Local delivery traffic. Includes shim header with RxMI. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 15 Partitioning of (substrate) Responsibilities • Virtual Machine (Slice control SW): Application logic, code option specific control and data operations. – – – • vnet – – • traditional PlanetLab slice operations manage code option specific lookup tables, stats, memory and configuration blocks implements interface with fast path for exception and local delivery traffic flow isolation: filtering traffic through the linux kernel add support for VLAN- based filtering and port reservation Resource Manager Proxy (aka Local Resource Manager) – all VM commands are issued to the RMP • • • – – • verifies (or inserts) substrate message header slice IDs to prevent deliberate or accidental masquerading - part of ensuring isolation and security. in tandem with SRM implements device independent logic System Resource Manager – – device independent logic responsible for implementing and enforcing • • • • the RMP is able to validate command sender (authenticate) enforce access restrictions (authorize) decouples VMs from substrate control entities. That is, maps exported abstractions and interfaces to specific hardware and software interfaces. system resource abstractions resource isolation and allocation policies facilitating SNM: implementing PlanetLab compatible behavior and abstractions Substrate Control Daemon – – – intermediary between VM and code option instances (vouches for VM) enforces policies on resource allocations and isolation in the control plane implements device dependent logic Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 16 Responsibilities System tables Interfaces ifn:{type,ipaddr,linkBW,availBW} xsid Per Slice Tables endpoints id:{type,ipaddr,port,proto,board,bw} NPE (allocated) sram {start,size} #flts BW controlIP board ID #Qs BW #Stats SRM (the “Decider”) request allocation SCD (NPE) Tables in data Path base SRAM “real” indx Lookup Table xsid:offset fid xsid:range “real” indx Stats Table sid “real” indx xsid:offset vlan VLAN Table copt:sram_addr xsid:size Queue Params xsid:range HF Control Block? code option control blocks? ranges are not required to be contiguous Per interface scheduler and rate limits Per Slice data Slice Maps Slice SRAM Assignments xsid: {qidMap,FidMap,statsMap} xsid: {sram_start,sram_size} Interface BW Fred Kuhns - 3/24/2016 RMP RMP Responsibilities • Translate slice MI to local endpoint. Either call SRM or cache mappings. • Add xsid to subMsg header • Pass through identifiers mapped by SCD: qid, fid and stats. • Pass through relative queue weights, SCD maps to global weight. make allocation qid xsid:range GPE BWmaps?? endpoint (port) maps servMap resvMap meta-ifaces mi:endpoint ... plab sliceID vlanid:xsid xsidMap ... gpe board id vlan VLAN maps range:{start,end} ... endpoint (port) maps resvMap availMap usedMaps ... ... NPE Table id:{addr,BW/Port,copts,fltrs,sram,Qs} Washington WASHINGTON UNIVERSITY IN ST LOUIS SCD Responsibilities • Translate slice specific indices to global indices: qid, fid and stats. • Knows the location of all tables • Interprets commands to add, remove and modify entries to data path tables. • Knows per slice interface BW allocation and maps relative queue weight to global weight. • Each interface schedule is assigned (by SRM) 17 max rate. Queuing and allocating Interface Bandwidth Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 18 Simple Queuing Example Slice Interface and Queue Allocations: {Port, BW, QList}, Qlist = {{qid, weight, threshold},...} NPE wrr q10 qid in 0...n-1 q11 ... FP slice1 q1n’ BW11 q20 qid in 0...m-1 q21 ... FP slice2 q2m’ Physical Port (Interface) Attributes: {ifn, type, ipaddr, linkBW, availBW} ifn : Interface number type: {Internet, Peering} Operations: get_interfaces() LC get_ifattrs(ifn) get_ifpeer(ifn) alloc_ifbw(ifn,xsid,bw) wrr FP1 FP2 BW1 BW11 + BW21 = BW1 GPE GPE ipAddr linkBW BW21 Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 19 Substrate Message Format Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 20 Substrate Message msg header 15 mlen cid 0 15 0 mid cmd body: 0-N(B) mlen: Total message length, including the header. mid: Message ID, used to support synchronous message processing. cid: context identifier. Specifies context within which the message is processed. A value of 0 indicates substrate context. cmd: Command to execute or a return code. The 4 header fields are each 16 bites. body: 0 or more bytes of command data. Fred Kuhns - 3/24/2016 • Assume a simple command response (two-way) messaging framework. But will support one-way schemes.. • Supports asynchronous communications using a message ID. • The command field is overloaded for the return code. • Every server is expected to implement a simple Version command (cmd == 0) which return the server’s ID and Version number as two 32-bit fields. – primary use is for monitoring health of servers and debugging. – All other command values are uniique only to a particular server. • Uses UDP as the transport protocol. • All commands are expected to be idempotent Washington WASHINGTON UNIVERSITY IN ST LOUIS 21 Overview • In the interface specifications I provide a c-like description of the operations and results. • The descriptions are only intended to describe the actual message format, data fields and returned results. It is not meant to specify an application level library. • The arguments are to be encoded into the message body in the order that are given, using network byte order (Big Endian) and without padding. • All commands result in: 1. No return response: one-way call semantics 2. an error occurs processing the message or command encounters and unexpected condition or error. In this case the return message will have the error return code in the cmd field. 3. The command completes and does not indicate and error to the message framework then the message result code indicates success. The message body contains any result data. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 22 Example Message • Slice with xsid of 0x10 requests the allocation of a global UDP port (decimal 17) for the local IP address 128.252.130.34 (hex 0x80FC8222). – Assume the alloc_port command ID is 4. port = alloc_port(0x80FC8222, 0, 17) – Allocate a global UDP (decimal 17) port for the local IP address 128.252.130.34 (hex 0x80FC8222), and let the system assign the next available port number. • The resource manager allocates port 5050 (0x13BA), the return code of 0 indicates success. Command Message Reply Message 1 F 10 4 80 FC 82 22 00 00 11 1 F 10 0 80 FC 82 22 13 BA 11 Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 23 NAT Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 24 • Problem: – UDP, TCP: 2 or more GPEs attempt to use same global IP, Port and Proto – ICMP: ??? Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 25 BWi,j w i,j Wj w i,j Wj Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS BWj, BWj,min BWi,j BWj MTU MTU BWj Wj BWi,j BWj,min 26