Fred Kuhns - Washington University in St. Louis

advertisement
Supercharged PlanetLab Platform,
Control Overview
Fred Kuhns
fredk@arl.wustl.edu
Applied Research Laboratory
Washington University in St. Louis
fredk@arl.wustl.edu
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Prototype Organization
Switch
RTM
GPE
LC
NPE
DRAM
external interface
Key
Extract
(2 ME)
ingress side
egress side
ExtTx
(2 ME)
Queue
Manager
(2 ME)
Lookup
(2 ME)
Hdr
Format
(1 ME)
Queue
Manager
(2 ME)
IntTx
(2 ME)
TCAM
Hdr
Format
(1 ME)
SRAM
Lookup
(2 ME)
Rate
Monitor
(1 ME)
SRAM
SRAM
Key
Extract
(1 ME)
IntRx
(2 ME)
switch interface
SRAM
SRAM
ExtRx
(2 ME)
GPE
DRAM
SRAM
SRAM
Rx
(1 ME)
Key
Extract
(1 ME)
Lookup
(1 ME)
Hdr
Format
(1 ME)
Queue
Manager
(2 ME)
Tx
(1 ME)
TCAM
DRAM
• One NP blade (with RTM) implements Line Card
– separate ingress/egress pipelines
• Second NP hosts multiple slice fast-paths
– multiple static code options for diverse slices
– configurable filters and queues
• GPEs run standard Planetlab OS with vServers
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
2
Connecting an SPP
East Coast
Local/Regional
Host
plab/SPP
West Coast
ARP: endstations and intermediate routers
plab/SPP
plab/SPP
Ethernet SW
R
point-to-point
SPP
For now assume there is
just a single connection
to the public Internet
Fred Kuhns - 3/24/2016
point-to-point
LC(s)
CP
sw
gpe
gpe
R
Host
npe
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
npe
3
System Block Diagram
Substrate Control Daemon (SCD)
Boot and Configuration Control (BCC)
RTM 10 x 1GbE
RMP
…
…
vnet
SPI
LC
NAT & Tunnel
filters (in/out)
SCD
flow stats
xscale xscale
(netflow)
NPU-B
NPU-B
GE
NPU-A
TCAM
SCD
xscale xscale
RTM
NPE
ARP Table
FIBPCI
TCAM
NMP
PCI
GPE
GE
NPU-A
GPE
user slivers
NPE
pl_netflow
NPE
sppnode.txt
ReBoot
how??
External Interfaces
SPP Node
bootcd
cacert.pem
boot_server
plnode.txt
PLC
…
…
Power Control
Unit
(has own IP
Address)
SPI
interfaces
Hub
Fabric Ethernet Switch (10Gbps, data path)
move pl_netflow to cp?
Base Ethernet Switch (1Gbps, control)
manage LC Tables
Control Processor (CP)
System Node Manager (SNM)
Resource DB
BCC
tftp,
dhcpd
routed* sshd* httpd*
Standalone GPEs
I2C
(IPMI)
nodeconf.xml
Slivers DB
boot files
System Resource Manager (SRM)
Fred Kuhns - 3/24/2016
route DB
user info
Shelf manager
flow stats
All flow
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
monitoring done at Line Card
4
Software Components
•
Utilities: parts of BCC to generate config and distribution files
–
–
•
Control processor:
–
–
–
–
–
–
•
Local Boot Manager (LBM): Modified BootManager running on the GPEs
Resource Manager Proxy (RMP)
Node Manager Proxy (NMP), that is the required changes to existing Node Manager software.
Network Processor Element (NPE)
–
–
–
–
•
Boot and Configuration Control (BCC)
System Resource Manager (SRM)
System Node Manager (SNM)
user authentication and ssh forwarding daemon
http daemon providing a node specific interface to netflow data (planetflow)
Routing protocol daemon (BGP/OSPF/RIP) for maintaining FIB in Line Card
General Purpose Element (GPE)
–
–
–
•
Node configuration and management: generate config files, dhcp, tftp, ramdisk
Boot CD and distribution file management (images, RPM and tar files) for GPEs and CP.
Substrate Control Daemon (SCD, formally known as wuserv)
kernel module to read/write memory locations (wumod)
Command interpreter for configuring NPU memory (wucmd)
Modified Radisys and Intel source; ramdisk; Linux kernel
Line Card
–
ARP: protocol and error notifications. Lookup table entries have either the NH IP or an ethernet address
•
–
–
–
Sliver packets which can not be mapped to an Ehternet address must receive error notifications.
netflow-like stat collection and reporting to CP for display on web and downloading by PLC.
FIB in lookup table maintained by the SRM
NAT lookup entries for unregistered traffic originating from GPE or CP
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
5
Boot and Configuration
Control
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
6
Boot and Configuration Control
• Read config file and allocate IP subnets and addresses for substrate
• Initialize Hub (delegate to SRM)
– base and fabric switches
– Initialize any switches not within the chassis
• Create dhcp configuration file and start daemon
– assigns control IP subnets and addresses
– assigns internal substrate IP subnet on fabric Ethernet
• Initialize Line Card to forward all traffic to CP
– Use the control interface, base or front panel (Base only connected to NPUA).
– All ingress traffic sent to CP
– What about Egress traffic when we are multi-homed, either through different physical
ports or one port with more than one next hop?
• We could assume only one physical port and one next hop.
• This is a general issue, the general solution is to run routing protocols on the CP and keep the
line card’s TCAM up to date.
• Start remaining system level services (i.e. daemons)
– wuarl daemons
– system daemons: sshd*, httpd, routed*
• System Node Manager maintains user login information for ssh forwarding
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
7
Boot and Configuration Control
• Assist GPE in booting:
– Download from PLC SPP specific version of the BootManager and
NodeManager tar/rpm distributions.
– Downloads/maintains Planetlab bootstrap distribution
• Updated BootCD
– The boot CD contains SPP config file with CP address, spp_config.
– No modifications to initial boot scripts, they contact the BCC over the
fabric interface (using the substrate IP subnet) and download the next
stage.
• GPEs obtain distribution files from the BCC on the CP:
– SPP changes are confined to the BootManager and NodeManager
sources (that is the plan)
– PLC Database updated to place all SPP nodes in the “SPP” Node Group,
we use this to trigger additional “special” processing.
– Modified BootManager scripts configure control interfaces (Base) and 2
Fabric interfaces (2 per Hub).
– Creates/Updates spp_config file on GPE node
– Installs BootStrap source then overwrites the NodeManager with our
modified version.
Washington
Fred Kuhns - 3/24/2016
8
WASHINGTON UNIVERSITY IN ST LOUIS
Node Manager
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
9
System Node Manager
•
•
Logically the top-half of the PlanetLab Node Manager
PLC API method GetSlivers():
–
–
–
•
Local GetSlivers() (xmlrpc interface) to GPEs
–
•
periodically call PLC for current list of slices assigned to this node
assign system slivers to each GPE, then split application slivers across available GPEs
keep persistent tables to handle daemon crashes or local device reboots
Node Manager Proxys (per GPE) list of allocated slivers along with other node specific data
{timestamp, list of configuration files, node id, node groups, network addresses, assigned slivers}
Resource management across GPEs
–
Manage Pool and VM RSpec assignment for each GPE:
•opportunity to extend RSpecs to account for distributed resources.
–
Perform ‘top-half’ processing of the per GPE NMP api (exported to sliver on this only). Calls on one GPE may
impact resource assignments or sliver status on a different GPE:
{Ticket(), GetXIDs(), GetSSHKeys(), Create(), Destroy(), Start(), Stop(),
GetEffectiveRSpec(), GetRSpec(), GetLoans(), validate_loans(), SetLoans()}
•
•
Currently the node manager uses CA Certs and SSH keys when communicating with PLC, we will
need to do the same. But we can relax security between SNM and the NMPs.
Tightly coupled with the System Resource Manager
–
–
–
Maintain a globally unique (to the node) Sliver ID which corresponds to what we call the meta-router ID and
make available to SRM when enabling fast path processing (VLANs, UDP Port numbers etc).
must request/maintain list of available GPEs and resource availability on each. Used for allocating sliver’s to
GPEs and handling RSpecs.
SRM may delegate GPE management to SNM.
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
10
SNM: Questions
• Robustness -- not contemplating for this version
– If a GPE goes down do we migrate slivers to remaining GPEs?
– If a GPE is added do we migrate some slivers to new GPE to load
balance?
– Intermediate solution:
• If GPE goes down then mark the corresponding slices as “unmapped”
and reassign to remaining GPEs
• No migration of slivers when GPEs are added, just assign new slivers to
the new GPE
• Do we need to intercept any of the API calls made against the
PLC?
• What about the boot manager api calls and the uploading of
boot log files (alpina boot logs)?
• implementation of the remote reboot command and console
logging.
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
11
Node Manager Proxy
• “Bottom-Half” of existing Node Manager
• modify GetSliver() to call the System Node
Manager.
– use base interface and different security (currently they
wrap xmlrpc calls with a curl command which includethe
PLC’s certified public key).
• Forward GPE oriented sliver resource operations to
SNM: see API list in SNM description
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
12
System Resource
Manager
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
13
System Resource Manager
node components not in hub
(switch, GPEs, Development Hosts)
LC
SCD
MUX
TCAM
GPE
NMP
RMP
Alt. Hub
snmpd
Fabric SW
(Logical Slot 2, Channel 2)
Primary Hub
(Logical Slot 1, Channel 1)
XFP
planetlab OS
snmpd
Fabric SW
Base SW
SFP
root context
SRM
Resource DB
Base SW
XFP
Fred Kuhns - 3/24/2016
SFP
XFP
SCD
NPE
SRAM
FP
k
FP
kk
FP
TCAM
XFP
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
14
System Resource Manager
• Maintains table describing system hardware components
and their attributes
– NPEs: code-options, memory blocks, counters, TCAM entries
– GPEs and HW attributes
• Sliver attributes corresponding to internal representations
and control mechanisms:
– unique Sliver ID (aka meta-router ID)
– global port space across assigned IP addresses
– fast path VLAN assignment and corresponding IP Subnets
• HUB Management:
– Manage fabric Ethernet switches (including any used external
to the Chassis or in a multi-chassis scenario)
– Manage base SW
• Manage line card table entries??
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
15
System Resource Management
• Allocate Global port space
– input: Slice ID, [Global IP address=0, proto=UDP, Port=0]
– actions: allocate port
– output: {IP Address, Port, Proto} or 0 [can’t allocate]
• Allocate Sliver ID
– input: Slice name
– actions:
• Allocate unique Sliver ID and assign to slice
• allocate VLAN ID (1-to-1 map of sliver ID to VLAN)
– output: {Sliver ID, VLAN ID}
• Allocate NPE code option (internal)
– input: Sliver ID, code option id
– action: Assign NPE ‘slot’ to slice
• Allocate code option instance from an eligible NPE; {NPE, instance ID}
• Allocate memory block for instance (the instance ID is just an index into an array of
preallocated memory blocks).
– output: NPE Instance = {NPE ID, Slot Number}
• Allocate Stats Index
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
16
System Resource manager
• Add Tunnel (aka Meta-Interface) to NPE Instance:
– input: Sliver ID, NPE Instance, {IP Address, UDP Port}
– actions:
• Add mapping to NPE demux table [VLAN:IP Addr:UDP Port <-> Instance ID]
• Update instance’s attribute block
{tunnel fields, exception/local delivery, QID, physical port, Ethernet addr for NPE/LC}
• Update next hop table (result index map to next hop tunnel)
• Set default QM weights, number of queues, thresholds.
• Update Line Card Ingress and Egress lookup tables: tunnel, NPE Ethernet address,
physical port, QIDs etc.??
• Update LC ingress and egress queue attributes for tunnel??
• Create NPE Sliver instance:
– Input: Slice ID; {IP address, UDP Port}; {Interface ID, Physical Port} {SRAM
block; # filter table entries; # of queues queues; # of packet buffers; code option;
amount of SRAM required; total reserved bandwidth}
– Actions:
•
•
•
•
Allocate NPE code option
Add tunnel to NPE Instance
enable Sliver VLAN on associated fabric interface ports
delegate to RMP: configure GPE vnet module (via RMP) to accept Sliver’s VLAN traffic.
Open UDP Port for data and control in root context and pass back to client.
– output: (NPE code option) Instance number
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
17
Resource Manager Proxy
• Act as intermediary between client virtual machines and
the node control infrastructure.
– all exported interfaces are implemented by the RMP
• managing the life cycle of an NPE code instance
• accessing instance data and memory locations
• read/write to code option instance’s memory block
• get/set queue attributes {threshold, weight}
• get/add/remove/update lookup table entries (i.e. TCAM
filters)
• get/clear pre/post queue counters, for a given stats index
– one-time or periodic get
• get packet/byte counter for tunnel at Line card
• allocate/release local Port
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
18
Example Scenarios
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
19
Default Traffic Configurations
external interface PE
NPE
Control messages sent over
an
to fabric and base
isolated base Ethernet switch.
(additional GPEs)
For
isolation
andNAT
security
Line
card
performs
like function for traffic
from vservers.
…
3
4
x
GPE
NMP
RMP
MP
root context
planetlab OS
2
x
1
x
5
x
x
10GbE (fabric, data)
1GbE (base, control)
6
x
Substrate
LC
mux
CP
user login info
SNM
Default: traffic forwarded to
CP over 10Gbps Ethernet
switch (aka fabric)
PLC
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
SRM
Resource DB
sliver tbl
20
Logging Into a Slice
PE
NPE
GPE
NMP
Host
…
(located within node)
RMP
MP
root context
planetlab OS
4
x
Once authenticated, session
forwarded
to appropriate
3
2
x
GPE and vserver.
5
x
1
x
x
10GbE (fabric, data)
1GbE (base, control)
6
x
Substrate
LC
mux
CP
fwder
user loginsshinfo
SNM
PLC
ssh connection directed to
CP for user authentication
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
SRM
Resource DB
sliver tbl
21
Update Local Slice Definitions
PE
NPE
GPE
NMP
Host
…
(located within node)
RMP
MP
root context
planetlab OS
3
4
x
2
x
1
x
5
x
x
10GbE (fabric, data)
1GbE (base, control)
6
x
Substrate
LC
mux
CP
user login info
SNM
PLC
retrieve/update slice
descriptions
Fred Kuhns - 3/24/2016
update local database,
allocate slice instances
SRM
(slivers) to GPE nodes
Resource DB
sliver tbl
slices
...
slices
slices
...
slices...
...
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
22
Creating Local Slice Instance
create new slice
retrieve/update slice
descriptions
NPE
PE
GPE
slices ...
NMP
Host
…
(located within node)
RMP
MP
root context
planetlab OS
3
4
x
2
x
1
x
5
x
x
10GbE (fabric, data)
1GbE (base, control)
6
x
Substrate
LC
mux
CP
user login info
SNM
SRM
Resource DB
sliver tbl
PLC
slices
...
slices
slices......
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
23
Allocating NPE (Creating Allocate
Meta-Router)
OpenNPE
localsliver
socket{code
for
FP - fast path
NPE
SRAM
Control Interface
TCAM
PE
GPE
NMP
...
tbl
Host
exception
and local
option, SRAM,
delivery
traffic; return
Interfaces/Ports,
etc} to
client vserver
lkup
(located within node)
…
Fast Path
FPk
RMP
MP
root context
planetlab OS
3
4
2
x k
VLAN
x
5
x
Forward request to
Returns
and
Systemstatus
resource
assigned
global Port
x
manager
number
1
x
10GbE (fabric, data)
1GbE (base, control)
6
x
Substrate
LC
mux
MI1
CP
user login info
SNM
global
UDP with
port new
PLC Allocate shared NPE Allocate
resources,
associate
Allocate
and
Enable
VLAN
requested
interface(s);
slice fast path {SRAMfor
block;
# filter
table entries; #
to isolate
internal
slice
configure
Line
card.amount
of queues; # of packet buffers;
code
option;
traffic, VLANk
of SRAM required; total reserved bandwidth}
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
SRM
Resource DB
sliver tbl
24
Managing the Data Path
• Allocate or Delete NPE
Slice instance
• Add, remove or alter
filters
– each slice is allocated a
portion of the NPE’s
TCAM
• Read or write to per slice
memory blocks in
SRAM
– each slice is allocated a
block of SRAM
• Read counters
– one time or periodic
• Set Queue rate or
threshold.
• Get queue lengths
Fred Kuhns - 3/24/2016
NPE
SRAM
TCAM
GPE
NMP
DP
DP
FPllk
RMP
SCD
root context
planetlab OS
2
1
x
x
10GbE (fabric, data)
1GbE (base, control)
6
x
CP
user login info
SNM
SRM
Resource DB
sliver tbl
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
FP - fast path
25
Misc Functions
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
26
Other LC Functions
•
Line Card Table maintenance
–
–
–
–
–
–
–
•
NAT Functions
–
–
•
traffic originating from within SPP
may also want to selective map global proto/port number to specific GPEs?
ARP and FIB on Line card
–
–
•
multi-homed SPP node must be able to send packets to the correct next hop router/endsystem
random traffic from/to the GPE must be handled correctly
tunnels represent point-to-point connections so it may be alright to explicitly indicate which of possibly
several interfaces and next (Ethernet) hop devices the tunnel should be bound
alternatively if were are running the routing protocols we could provide the user with the output port as
a utility program.
But there are problems with running routing protocols: we could forward all route updates to the CP.
But standard implementations assume the interfaces are physically connected to the endsystem.
We could play tricks as vini does.
or we assume that there is only one interface connected to one Ethernet device.
route daemon runs on CP and keeps FIB up to date
ARP runs on xscale and maps FIB next hop entries to their corresponding Ethernet destination
addresses.
netflow
–
–
flow-based statistics collection
SRM collects periodically and posts via web
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
27
Other Functions
• vnet
– isolation based on VLAN IDs
– support port reservations
• ssh forwarding
– maintain user login information on CP
– modify ssh daemon (or have wrapper) to forward user
logins to correct GPE
• rebooting Node (spp), even when line card fails??
Fred Kuhns - 3/24/2016
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
28
Download