Supercharged PlanetLab Platform, Control Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory Washington University in St. Louis fredk@arl.wustl.edu Washington WASHINGTON UNIVERSITY IN ST LOUIS Prototype Organization Switch RTM GPE LC NPE DRAM external interface Key Extract (2 ME) ingress side egress side ExtTx (2 ME) Queue Manager (2 ME) Lookup (2 ME) Hdr Format (1 ME) Queue Manager (2 ME) IntTx (2 ME) TCAM Hdr Format (1 ME) SRAM Lookup (2 ME) Rate Monitor (1 ME) SRAM SRAM Key Extract (1 ME) IntRx (2 ME) switch interface SRAM SRAM ExtRx (2 ME) GPE DRAM SRAM SRAM Rx (1 ME) Key Extract (1 ME) Lookup (1 ME) Hdr Format (1 ME) Queue Manager (2 ME) Tx (1 ME) TCAM DRAM • One NP blade (with RTM) implements Line Card – separate ingress/egress pipelines • Second NP hosts multiple slice fast-paths – multiple static code options for diverse slices – configurable filters and queues • GPEs run standard Planetlab OS with vServers Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 2 Connecting an SPP East Coast Local/Regional Host plab/SPP West Coast ARP: endstations and intermediate routers plab/SPP plab/SPP Ethernet SW R point-to-point SPP For now assume there is just a single connection to the public Internet Fred Kuhns - 3/24/2016 point-to-point LC(s) CP sw gpe gpe R Host npe Washington WASHINGTON UNIVERSITY IN ST LOUIS npe 3 System Block Diagram Substrate Control Daemon (SCD) Boot and Configuration Control (BCC) RTM 10 x 1GbE RMP … … vnet SPI LC NAT & Tunnel filters (in/out) SCD flow stats xscale xscale (netflow) NPU-B NPU-B GE NPU-A TCAM SCD xscale xscale RTM NPE ARP Table FIBPCI TCAM NMP PCI GPE GE NPU-A GPE user slivers NPE pl_netflow NPE sppnode.txt ReBoot how?? External Interfaces SPP Node bootcd cacert.pem boot_server plnode.txt PLC … … Power Control Unit (has own IP Address) SPI interfaces Hub Fabric Ethernet Switch (10Gbps, data path) move pl_netflow to cp? Base Ethernet Switch (1Gbps, control) manage LC Tables Control Processor (CP) System Node Manager (SNM) Resource DB BCC tftp, dhcpd routed* sshd* httpd* Standalone GPEs I2C (IPMI) nodeconf.xml Slivers DB boot files System Resource Manager (SRM) Fred Kuhns - 3/24/2016 route DB user info Shelf manager flow stats All flow Washington WASHINGTON UNIVERSITY IN ST LOUIS monitoring done at Line Card 4 Software Components • Utilities: parts of BCC to generate config and distribution files – – • Control processor: – – – – – – • Local Boot Manager (LBM): Modified BootManager running on the GPEs Resource Manager Proxy (RMP) Node Manager Proxy (NMP), that is the required changes to existing Node Manager software. Network Processor Element (NPE) – – – – • Boot and Configuration Control (BCC) System Resource Manager (SRM) System Node Manager (SNM) user authentication and ssh forwarding daemon http daemon providing a node specific interface to netflow data (planetflow) Routing protocol daemon (BGP/OSPF/RIP) for maintaining FIB in Line Card General Purpose Element (GPE) – – – • Node configuration and management: generate config files, dhcp, tftp, ramdisk Boot CD and distribution file management (images, RPM and tar files) for GPEs and CP. Substrate Control Daemon (SCD, formally known as wuserv) kernel module to read/write memory locations (wumod) Command interpreter for configuring NPU memory (wucmd) Modified Radisys and Intel source; ramdisk; Linux kernel Line Card – ARP: protocol and error notifications. Lookup table entries have either the NH IP or an ethernet address • – – – Sliver packets which can not be mapped to an Ehternet address must receive error notifications. netflow-like stat collection and reporting to CP for display on web and downloading by PLC. FIB in lookup table maintained by the SRM NAT lookup entries for unregistered traffic originating from GPE or CP Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 5 Boot and Configuration Control Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 6 Boot and Configuration Control • Read config file and allocate IP subnets and addresses for substrate • Initialize Hub (delegate to SRM) – base and fabric switches – Initialize any switches not within the chassis • Create dhcp configuration file and start daemon – assigns control IP subnets and addresses – assigns internal substrate IP subnet on fabric Ethernet • Initialize Line Card to forward all traffic to CP – Use the control interface, base or front panel (Base only connected to NPUA). – All ingress traffic sent to CP – What about Egress traffic when we are multi-homed, either through different physical ports or one port with more than one next hop? • We could assume only one physical port and one next hop. • This is a general issue, the general solution is to run routing protocols on the CP and keep the line card’s TCAM up to date. • Start remaining system level services (i.e. daemons) – wuarl daemons – system daemons: sshd*, httpd, routed* • System Node Manager maintains user login information for ssh forwarding Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 7 Boot and Configuration Control • Assist GPE in booting: – Download from PLC SPP specific version of the BootManager and NodeManager tar/rpm distributions. – Downloads/maintains Planetlab bootstrap distribution • Updated BootCD – The boot CD contains SPP config file with CP address, spp_config. – No modifications to initial boot scripts, they contact the BCC over the fabric interface (using the substrate IP subnet) and download the next stage. • GPEs obtain distribution files from the BCC on the CP: – SPP changes are confined to the BootManager and NodeManager sources (that is the plan) – PLC Database updated to place all SPP nodes in the “SPP” Node Group, we use this to trigger additional “special” processing. – Modified BootManager scripts configure control interfaces (Base) and 2 Fabric interfaces (2 per Hub). – Creates/Updates spp_config file on GPE node – Installs BootStrap source then overwrites the NodeManager with our modified version. Washington Fred Kuhns - 3/24/2016 8 WASHINGTON UNIVERSITY IN ST LOUIS Node Manager Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 9 System Node Manager • • Logically the top-half of the PlanetLab Node Manager PLC API method GetSlivers(): – – – • Local GetSlivers() (xmlrpc interface) to GPEs – • periodically call PLC for current list of slices assigned to this node assign system slivers to each GPE, then split application slivers across available GPEs keep persistent tables to handle daemon crashes or local device reboots Node Manager Proxys (per GPE) list of allocated slivers along with other node specific data {timestamp, list of configuration files, node id, node groups, network addresses, assigned slivers} Resource management across GPEs – Manage Pool and VM RSpec assignment for each GPE: •opportunity to extend RSpecs to account for distributed resources. – Perform ‘top-half’ processing of the per GPE NMP api (exported to sliver on this only). Calls on one GPE may impact resource assignments or sliver status on a different GPE: {Ticket(), GetXIDs(), GetSSHKeys(), Create(), Destroy(), Start(), Stop(), GetEffectiveRSpec(), GetRSpec(), GetLoans(), validate_loans(), SetLoans()} • • Currently the node manager uses CA Certs and SSH keys when communicating with PLC, we will need to do the same. But we can relax security between SNM and the NMPs. Tightly coupled with the System Resource Manager – – – Maintain a globally unique (to the node) Sliver ID which corresponds to what we call the meta-router ID and make available to SRM when enabling fast path processing (VLANs, UDP Port numbers etc). must request/maintain list of available GPEs and resource availability on each. Used for allocating sliver’s to GPEs and handling RSpecs. SRM may delegate GPE management to SNM. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 10 SNM: Questions • Robustness -- not contemplating for this version – If a GPE goes down do we migrate slivers to remaining GPEs? – If a GPE is added do we migrate some slivers to new GPE to load balance? – Intermediate solution: • If GPE goes down then mark the corresponding slices as “unmapped” and reassign to remaining GPEs • No migration of slivers when GPEs are added, just assign new slivers to the new GPE • Do we need to intercept any of the API calls made against the PLC? • What about the boot manager api calls and the uploading of boot log files (alpina boot logs)? • implementation of the remote reboot command and console logging. Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 11 Node Manager Proxy • “Bottom-Half” of existing Node Manager • modify GetSliver() to call the System Node Manager. – use base interface and different security (currently they wrap xmlrpc calls with a curl command which includethe PLC’s certified public key). • Forward GPE oriented sliver resource operations to SNM: see API list in SNM description Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 12 System Resource Manager Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 13 System Resource Manager node components not in hub (switch, GPEs, Development Hosts) LC SCD MUX TCAM GPE NMP RMP Alt. Hub snmpd Fabric SW (Logical Slot 2, Channel 2) Primary Hub (Logical Slot 1, Channel 1) XFP planetlab OS snmpd Fabric SW Base SW SFP root context SRM Resource DB Base SW XFP Fred Kuhns - 3/24/2016 SFP XFP SCD NPE SRAM FP k FP kk FP TCAM XFP Washington WASHINGTON UNIVERSITY IN ST LOUIS 14 System Resource Manager • Maintains table describing system hardware components and their attributes – NPEs: code-options, memory blocks, counters, TCAM entries – GPEs and HW attributes • Sliver attributes corresponding to internal representations and control mechanisms: – unique Sliver ID (aka meta-router ID) – global port space across assigned IP addresses – fast path VLAN assignment and corresponding IP Subnets • HUB Management: – Manage fabric Ethernet switches (including any used external to the Chassis or in a multi-chassis scenario) – Manage base SW • Manage line card table entries?? Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 15 System Resource Management • Allocate Global port space – input: Slice ID, [Global IP address=0, proto=UDP, Port=0] – actions: allocate port – output: {IP Address, Port, Proto} or 0 [can’t allocate] • Allocate Sliver ID – input: Slice name – actions: • Allocate unique Sliver ID and assign to slice • allocate VLAN ID (1-to-1 map of sliver ID to VLAN) – output: {Sliver ID, VLAN ID} • Allocate NPE code option (internal) – input: Sliver ID, code option id – action: Assign NPE ‘slot’ to slice • Allocate code option instance from an eligible NPE; {NPE, instance ID} • Allocate memory block for instance (the instance ID is just an index into an array of preallocated memory blocks). – output: NPE Instance = {NPE ID, Slot Number} • Allocate Stats Index Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 16 System Resource manager • Add Tunnel (aka Meta-Interface) to NPE Instance: – input: Sliver ID, NPE Instance, {IP Address, UDP Port} – actions: • Add mapping to NPE demux table [VLAN:IP Addr:UDP Port <-> Instance ID] • Update instance’s attribute block {tunnel fields, exception/local delivery, QID, physical port, Ethernet addr for NPE/LC} • Update next hop table (result index map to next hop tunnel) • Set default QM weights, number of queues, thresholds. • Update Line Card Ingress and Egress lookup tables: tunnel, NPE Ethernet address, physical port, QIDs etc.?? • Update LC ingress and egress queue attributes for tunnel?? • Create NPE Sliver instance: – Input: Slice ID; {IP address, UDP Port}; {Interface ID, Physical Port} {SRAM block; # filter table entries; # of queues queues; # of packet buffers; code option; amount of SRAM required; total reserved bandwidth} – Actions: • • • • Allocate NPE code option Add tunnel to NPE Instance enable Sliver VLAN on associated fabric interface ports delegate to RMP: configure GPE vnet module (via RMP) to accept Sliver’s VLAN traffic. Open UDP Port for data and control in root context and pass back to client. – output: (NPE code option) Instance number Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 17 Resource Manager Proxy • Act as intermediary between client virtual machines and the node control infrastructure. – all exported interfaces are implemented by the RMP • managing the life cycle of an NPE code instance • accessing instance data and memory locations • read/write to code option instance’s memory block • get/set queue attributes {threshold, weight} • get/add/remove/update lookup table entries (i.e. TCAM filters) • get/clear pre/post queue counters, for a given stats index – one-time or periodic get • get packet/byte counter for tunnel at Line card • allocate/release local Port Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 18 Example Scenarios Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 19 Default Traffic Configurations external interface PE NPE Control messages sent over an to fabric and base isolated base Ethernet switch. (additional GPEs) For isolation andNAT security Line card performs like function for traffic from vservers. … 3 4 x GPE NMP RMP MP root context planetlab OS 2 x 1 x 5 x x 10GbE (fabric, data) 1GbE (base, control) 6 x Substrate LC mux CP user login info SNM Default: traffic forwarded to CP over 10Gbps Ethernet switch (aka fabric) PLC Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS SRM Resource DB sliver tbl 20 Logging Into a Slice PE NPE GPE NMP Host … (located within node) RMP MP root context planetlab OS 4 x Once authenticated, session forwarded to appropriate 3 2 x GPE and vserver. 5 x 1 x x 10GbE (fabric, data) 1GbE (base, control) 6 x Substrate LC mux CP fwder user loginsshinfo SNM PLC ssh connection directed to CP for user authentication Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS SRM Resource DB sliver tbl 21 Update Local Slice Definitions PE NPE GPE NMP Host … (located within node) RMP MP root context planetlab OS 3 4 x 2 x 1 x 5 x x 10GbE (fabric, data) 1GbE (base, control) 6 x Substrate LC mux CP user login info SNM PLC retrieve/update slice descriptions Fred Kuhns - 3/24/2016 update local database, allocate slice instances SRM (slivers) to GPE nodes Resource DB sliver tbl slices ... slices slices ... slices... ... Washington WASHINGTON UNIVERSITY IN ST LOUIS 22 Creating Local Slice Instance create new slice retrieve/update slice descriptions NPE PE GPE slices ... NMP Host … (located within node) RMP MP root context planetlab OS 3 4 x 2 x 1 x 5 x x 10GbE (fabric, data) 1GbE (base, control) 6 x Substrate LC mux CP user login info SNM SRM Resource DB sliver tbl PLC slices ... slices slices...... Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 23 Allocating NPE (Creating Allocate Meta-Router) OpenNPE localsliver socket{code for FP - fast path NPE SRAM Control Interface TCAM PE GPE NMP ... tbl Host exception and local option, SRAM, delivery traffic; return Interfaces/Ports, etc} to client vserver lkup (located within node) … Fast Path FPk RMP MP root context planetlab OS 3 4 2 x k VLAN x 5 x Forward request to Returns and Systemstatus resource assigned global Port x manager number 1 x 10GbE (fabric, data) 1GbE (base, control) 6 x Substrate LC mux MI1 CP user login info SNM global UDP with port new PLC Allocate shared NPE Allocate resources, associate Allocate and Enable VLAN requested interface(s); slice fast path {SRAMfor block; # filter table entries; # to isolate internal slice configure Line card.amount of queues; # of packet buffers; code option; traffic, VLANk of SRAM required; total reserved bandwidth} Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS SRM Resource DB sliver tbl 24 Managing the Data Path • Allocate or Delete NPE Slice instance • Add, remove or alter filters – each slice is allocated a portion of the NPE’s TCAM • Read or write to per slice memory blocks in SRAM – each slice is allocated a block of SRAM • Read counters – one time or periodic • Set Queue rate or threshold. • Get queue lengths Fred Kuhns - 3/24/2016 NPE SRAM TCAM GPE NMP DP DP FPllk RMP SCD root context planetlab OS 2 1 x x 10GbE (fabric, data) 1GbE (base, control) 6 x CP user login info SNM SRM Resource DB sliver tbl Washington WASHINGTON UNIVERSITY IN ST LOUIS FP - fast path 25 Misc Functions Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 26 Other LC Functions • Line Card Table maintenance – – – – – – – • NAT Functions – – • traffic originating from within SPP may also want to selective map global proto/port number to specific GPEs? ARP and FIB on Line card – – • multi-homed SPP node must be able to send packets to the correct next hop router/endsystem random traffic from/to the GPE must be handled correctly tunnels represent point-to-point connections so it may be alright to explicitly indicate which of possibly several interfaces and next (Ethernet) hop devices the tunnel should be bound alternatively if were are running the routing protocols we could provide the user with the output port as a utility program. But there are problems with running routing protocols: we could forward all route updates to the CP. But standard implementations assume the interfaces are physically connected to the endsystem. We could play tricks as vini does. or we assume that there is only one interface connected to one Ethernet device. route daemon runs on CP and keeps FIB up to date ARP runs on xscale and maps FIB next hop entries to their corresponding Ethernet destination addresses. netflow – – flow-based statistics collection SRM collects periodically and posts via web Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 27 Other Functions • vnet – isolation based on VLAN IDs – support port reservations • ssh forwarding – maintain user login information on CP – modify ssh daemon (or have wrapper) to forward user logins to correct GPE • rebooting Node (spp), even when line card fails?? Fred Kuhns - 3/24/2016 Washington WASHINGTON UNIVERSITY IN ST LOUIS 28