The Experiment Lifecycle and its Major Programs 1 Experiment Lifecycle: the User Perspective 2 Creating an Experiment • Done with `batchexp’ for both batch and interactive experiments – “batch”* is historical name • Can bring the experiment to three states – swapped – pre-run only – posted – queued experiment ready to run – active – experiment swapped in 3 Swapping An Experiment • Done with `swapexp’ • Can effect several transitions – swapped to active (swap in experiment) – active to swapped (swap out experiment) – active to active (modify experiment) – posted to swapped (dequeue batch experiment) 4 Pre-run (tbprerun) • Parse NS file (parse-ns and parse.tcl) – Put virtual state in database (xmlconvert) • Do visualization layout (prerender) • Compute static routes (staticroutes) 5 swapped to active (tbswap in) • Mapping: Find nodes for experimenter – assign_wrapper – assign • Allocate nodes (nalloc) – Set serial console access (console_setup) • Set up NFS exports (exports_setup) • Set up DNS names (named_setup) • Reboot nodes and wait for them (os_setup) – Load disks if necessary (os_load) 6 swapped to active (contd.) • • • • Start event system (eventsys_control) Create VLANs (snmpit) Set up mailing lists (genelists) Failure at any step results in swapout 7 active to swapped(tbswap out) • Stop the event system (eventsys_control) • Tear down VLANs (snmpit) • Free nodes (nfree) – Scheduled reservations (sched_reserve) – Place in reloadpending experiment – Revoke console access (console_setup) • Reset DNS (named_setup) • Reset NFS exports (exports_setup) • Reset mailing lists (genelists) 8 active to active (tbswap modify) • Purpose: experiment modification – Get new virtual state (re-parse NS file) – Bring physical mapping into sync with new state • Leaves alone nodes whose physical mapping matches the new virtual state 9 Important Daemons • batch_daemon – Picks up posted experiments – Attempts a swapin – One experiment at a time for each user – Swaps out finished batch experiments • reload_daemon – Picks up nodes from reloadpending experiment – Frees them when done reloading 10 Next, in More Depth • Parsing • Resource allocation – Setup for the action: assign_wrapper – The real brains: assign • • • • • • Serial console management Link shaping IP routing support Traffic generation Inter-node synchronization Event system 12 Parsing Experiment Configurations 13 Experiment Configuration Language • General purpose OTcl scripting language based on NS • Exports an API nearly identical to that of NS albeit a subset • Testbed specific actions via the tb-* procedures – We provide a compatibility script to include when running under a NS simulation • Define your own procedures / classes / methods 14 Making sense out of others’ code • The parser is also written in OTcl • It mirrors a subset of NS classes • Implemented methods for the above classes capture the user specified experiment attributes • Convert experiment attributes to an intermediate XML format – Generic format makes it easy to add support for other configuration languages • Store the configuration in the virt_* tables such as virt_nodes, virt_lans etc. 15 Implementation Quirks • Capture top level resource names for later use – E.g.: Use 'n0' to name the physical node when the user asks for set n0 [$ns node] • Rename resource names to workaround restrictions such as in DNS – E.g.: Node 'n(0)' to 'n-0' • Parser run on ops for security reasons – Mixing trusted/untrusted OTcl code on main server (boss) is dangerous • Read tbsetup/ns2ir/README in the source tree for details 16 Assign Wrapper (PG Version) 17 Assign Wrapper • Perl frontend to assign • Converts virtual DB representation to more neutral “top” file format (input) • Converts results from plain text format into physical DB representation • assign_wrapper is extremely testbed aware • Moves information from virtual tables to physical tables 18 Virtual Representation • An experiment is really a set of tables in the database • Includes “virt_nodes” and “virt_lans” which describe the nodes and the network topology • Other tables include routes, program agents, traffic generators, virtual types, etc. 19 Virtual Representation Cont. • Example: set n1 [$ns node] set n2 [$ns node] set link0 [$ns duplex-link $n1 $n2 100MB 10ms] tb-set-hardware $n2 pc600 • Is stored in database tables: virt_node ('n1', '10.1.1.1', 'pc850', 'FBSD-STD', ...) virt_node ('n2', '10.1.1.2', 'pc600', 'RHL-STD, ...) virt_lan ('link0', 'n1', '100MB', '5ms', ...) virt_lan ('link0', 'n2', '100MB', '5ms', ...) 20 What’s a top file? • Stands for "topology" file, but thats too many syllables. • Input file to assign specifying nodes, links, desires. • Conversion of DB format to: node n2 pc850 node n1 pc600 link link0/n1:0,n2:0 n1 n2 100000 0 0 • Combine with current (free) physical resources to come up with a solution. 21 Assign Results • Assign maps n1 and n2 to pc1 and pc41 based on types and bandwidth. Nodes node1 pc1 node2 pc41 End Nodes Edges link0/n1:0,n2:0 intraswitch pc1/eth3 pc41/eth1 End Edges • The above is a “simplified” version of actual results. Gory details available elsewhere. 22 Assign Wrapper Continues • Allocate physical resources (nodes) as specified by assign • Allocate virtual resources (vnodes) on physical nodes (local and remote) • If some nodes already allocated (someone else got them before you), try again • Keep trying until maximum try exceeded; assign might fail to find a solution on first N tries 23 Assign Wrapper Keeps Going … • Insert set of “vlans” into database – pc1/eth3 connected to pc41/eth1 • Update “interfaces” table with IP addresses assigned by the parser • Update “nodes” table with user specified values from virt_nodes. – Osids, rpms, tarballs, etc. • Update “linkdelays” table with end node traffic shaping configuration (from virt_lans) 24 And Going and Going • Update “delays” table with delay node traffic shaping configuration • Update “tunnels” table with tunnel configuration (widearea nodes) • Update “agents” table with location of where events should be sent to control traffic shaping • Call exit(0) and rest! 25 Resource Allocation: assign 26 assign’s job • Maps virtual resources to local nodes and VLANs • General combinatorial optimization approach to NP-hard problem • Uses simulated annealing • Minimizes inter-switch links, number of switches, and other constraints. • Takes seconds for most experiments 27 What’s Hard About It? • Satisfy constraints – Requested types – Can’t go over inter-switch bandwidth – Domain-specific constraints • LAN placement for virtual nodes • Subnodes • Maximize opportunity for future mappings – Minimize inter-switch bandwidth – Avoid scarce nodes 28 What It Can Do • Handle multiple types of nodes on multiple switches • Allow users to ask for classes of nodes • Prefer/discourage use of certain nodes • Map multiple virtual nodes to one physical node • Handle nodes that are 'hosted' in some other node • Partial solutions 29 What It Doesn't Do • Map based on observed end-to-end network characteristics – Applicable to wide-area and wireless – But, we have another program, wanassign, that can • Satisfy requests for specific link types – But, we could approximate with subnodes • Full node resource description 30 Issues • Complicated – – – – Several authors Subject of paper evaluating many configurations Nature of randomized algorithm makes debugging hard Evolved over time to keep up with features • Scaling – Particularly with virtual and simulated nodes • Not just scale (1000’s), it’s the type of node – Pre-passes may help • The good: it’s coped with a lot of new demands! 31 Remote Console Access 32 Executive Summary • • • • • • Allow user access to consoles via serial line Console proxy enables remote access Authentication and encryption All console output logged Requires OS support for serial consoles Utah Emulab: all nodes have serial lines – Not required, but handy 33 Serial Consoles • Can redirect console in three places – BIOS: on most “server” motherboards – Boot loader: easy on BSD and Linux – OS: easy on BSD and Linux • Boot loaders and OSes must be configured – Generally via boot loader configuration 34 The serial line proxy (capture) • Original purpose was to log console output – Read/write serial line, log data, present tty IF – Use “tip” to access pty • Enhanced to “remote” the console – Present a socket interface – Can be accessed from anywhere on the network • One capture process per serial line 35 Authentication (capserver) • Only users in an experiment can access • Use a one-time key – capture running on serial line host generates new key for every “session” • Sends key to capserver on the boss node – capserver records key in DB, returns ownership info – capture uses info to protect ACL and log files 36 Clients (console, tiptunnel) • console is the replacement for tip – Run on ops, obtains access info via ACL file created by capture – File permissions restrict user access • tiptunnel is the remove version – Binaries for Linux, BSD, Windows – Run as a helper app from browser – Access info passed via secure web connection – All communication via SSL 37 Emulab Link Shaping 38 Executive Summary • Emulab allows setting and modification of bandwidth, latency, and loss rate on a perlink basis • Interface through NS script or command • Implemented either by dedicated “delay” nodes or on end nodes • Delay nodes work with any end node OS • End node shaping for FreeBSD or Linux 39 Delay nodes • Run FreeBSD + dummynet + bridging • FreeBSD kernel: – Runs at 10000Hz to improve accuracy – Uses polling device drivers to reduce overhead • • • • Nodes are dedicated to an experiment One node can shape multiple links Transparent to end nodes Not transparent to switch fabric 40 VLANs and Delay Nodes - Diagram 41 End node shaping (“link delays”) • Handle link shaping at both ends of the link • Requires OS support on the end nodes – FreeBSD: dummynet – Linux: “tc” with modifications • Conserves Emulab resources at potential expense of emulation fidelity • Works in environments where delay nodes are not practical or possible 42 Dynamic control • Link settings can be modified at “run time” – at commands in the NS file – tevc command • Run a control agent (delay_agent) on all nodes implementing shaping • Listens for events, interacts with kernel to effect changes • OS specific 43 IP routing support in Emulab 44 Executive Summary • Emulab offers three options for IP routing in a topology: none, manual, or automatic • Specified via the NS file • Routes setup automatically at boot time • There is no agent for dynamic modification of routes 45 User-specified routing • “None” – No experimental network routes will be setup – Used for LANs and routing experiments • “Manual” – Explicit specification of routes in the NS file – Routes becomes part of DB state of experiment – Passed to a node at boot, part of self-config – Implies IP forwarding enabled 46 Emulab-provided routing • “Static” – Emulab calculates routes at experiment creation (routecalc, staticroutes) – Shortest path calculation between all pairs – Optimized to coalesce into network routes • “Session” – Dynamic routing: runs gated/OSPF on all nodes – Auto-generated config file uses only active experimental interfaces 47 Routing Gotcha’s • Node default route uses the control net – Missing manual routes result in lost traffic • Control net is visible to routing daemons – Makes their job easy (one hop to anyone) • NxN "Static" route computation and storage do not scale as N increases, such as in multiplexed virtual nodes 48 Traffic Generation in Emulab 49 Executive Summary • Emulab allows experiments to run and control background traffic generators • Interface through NS script or command line tool • Constant Bit Rate traffic only right now • UDP or TCP only right now 50 Implementation details • Based on TG (http://www.postel.org/tg/) – UDP or TCP, one-way, various distributions of interarrival and length • Modified to be an event agent – Start and stop, change packet rate and size • Interface: – NS: standard syntax for traffic sources/sinks – tevc command line tool 51 Inter-node synchronization in Emulab 52 Executive Summary • Provides a simple inter-node barrier synchronization mechanism for experiments • Example: wait for all nodes to finish running a test before starting the next one • Not a centralized service (per-experiment infrastructure), scales well • Easy to use: can be scripted 53 History • Originally implemented a single-barrier, single-use “ready” mechanism: – Allowed users to know when all nodes were “up” – Used centralized TMCC to report/query status – Network/server unfriendly: constant polling • Users wanted a more general mechanism – Multiple barriers, reusable barriers • Tended to roll their own – Often network unfriendly as well 54 Enter the Sync Server • In NS file, declare a node as the server: – set node1 [$ns node] – tb-set-sync-server $node1 • When node boots, it starts up the sync server automatically • Nodes requiring synchronization use emulab-sync application • Use can be scripted using program agent 55 Example client use • One node acts as barrier master, initializing barrier and waiting for a number of clients: – /usr/testbed/bin/emulab-sync -i 4 • All other client nodes contact the barrier: – /usr/testbed/bin/emulab-sync • emulab-sync blocks until the barrier count is reached 56 Implementation • Simple TCP-based server and client program – UDP version in the works • Client: – Gets server info from a config file written at boot – Connect to server and write a small record – Block until a reply is read • Server: – Accept connections, read records from clients – Write a reply when all clients have connected 57 Issues • Why not use the event system for synchronization? – Event system is a centralized service – As we move to decentralization, may reconsider • Authentication: none – Local: uses shared control net so this is a problem, won't be with control net VLANs – Wide-area: wide-open, add HMAC ala events or just use event system 58 The Emulab Event System 59 Emulab Control Plane • Many of Emulab’s features are dynamically controllable: – Traffic generators: can be started, stopped, and parameters altered – Link shaping: links can be brought up and down, characteristics can be modified • Control is via the NS file, the web interface, or a command line tool. 60 Example: A Link • NS: create a shaped link: – set link0 [$ns duplex-link $n1 $n2 50Mb 10ms DropTail] • NS: control the link: – $ns at 100 "$link0 modify DELAY=20 BANDWIDTH=25" – $ns at 200 "$link0 down" • Command line: control the link – tevc -e tutorial/linktest +10 link0 down 61 What's really happening? • A link “agent” runs on each (delay) node to control all of the links for that node. • The agent listens for “events” from the server telling it what to do. • A per-experiment scheduler doles out the events at the proper time, sending them to the agents. • Other agents include the traffic generators, program objects, link tester. 62 Come on, what's really happening?! • Use Elvin (http://elvin.dstc.edu.au/) – off-the-shelf publish-subscribe system • Agents "listen" for events by "subscribing" to those they care about. • The per-experiment scheduler "publishes" events as they come due. • Events flow from the scheduler through the Elvin daemon to the nodes, and ultimately to the agents that wanted them. 63 Static/Dynamic event flow 64 Issues: Time • What happens to “event time” when an experiment is swapped? – Run in real time: events could be lost – Suspend time: dilation of experiment time – Restart time: replay static event stream • Timing for dynamic events – tevc … +10 link0 down; tevc … +10 link1 up – What is the latency between events? • What latency do we need to guarantee? 65 Issues: Security • Elvin mechanism is too heavyweight – Requires encryption to protect authentication keys – We have no reason to encrypt our events • Don't want to tie ourselves to Elvin – In principle – Elvin has gone closed source • Emulab past: no authentication, no wide-area • Emulab current: use end-to-end HMAC – Key transferred via TMCC – Wide-area nodes supported, cannot inject events 66 Issues: Scaling • Open Elvin TCP connection for every agent – Use per-node proxy – But agents still send events directly to boss – And there are still a lot of nodes • Use UDP? – What about lost events? • Deliver static events to nodes early? – Doesn't help dynamic (“now”) events • Multicast, someday (not the current usage model) • You’d think we could just find a better pub/sub system, but haven’t. 67