The Experiment Lifecycle and its Major Programs

The Experiment Lifecycle
its Major Programs
Experiment Lifecycle:
the User Perspective
Creating an Experiment
• Done with `batchexp’ for both batch and
interactive experiments
– “batch”* is historical name
• Can bring the experiment to three states
– swapped – pre-run only
– posted – queued experiment ready to run
– active – experiment swapped in
Swapping An Experiment
• Done with `swapexp’
• Can effect several transitions
– swapped to active (swap in experiment)
– active to swapped (swap out experiment)
– active to active (modify experiment)
– posted to swapped (dequeue batch experiment)
Pre-run (tbprerun)
• Parse NS file (parse-ns and parse.tcl)
– Put virtual state in database (xmlconvert)
• Do visualization layout (prerender)
• Compute static routes (staticroutes)
swapped to active (tbswap in)
• Mapping: Find nodes for experimenter
– assign_wrapper
– assign
• Allocate nodes (nalloc)
– Set serial console access (console_setup)
• Set up NFS exports (exports_setup)
• Set up DNS names (named_setup)
• Reboot nodes and wait for them (os_setup)
– Load disks if necessary (os_load)
swapped to active (contd.)
Start event system (eventsys_control)
Create VLANs (snmpit)
Set up mailing lists (genelists)
Failure at any step results in swapout
active to swapped(tbswap out)
• Stop the event system (eventsys_control)
• Tear down VLANs (snmpit)
• Free nodes (nfree)
– Scheduled reservations (sched_reserve)
– Place in reloadpending experiment
– Revoke console access (console_setup)
• Reset DNS (named_setup)
• Reset NFS exports (exports_setup)
• Reset mailing lists (genelists)
active to active (tbswap modify)
• Purpose: experiment modification
– Get new virtual state (re-parse NS file)
– Bring physical mapping into sync with new state
• Leaves alone nodes whose physical mapping
matches the new virtual state
Important Daemons
• batch_daemon
– Picks up posted experiments
– Attempts a swapin
– One experiment at a time for each user
– Swaps out finished batch experiments
• reload_daemon
– Picks up nodes from reloadpending experiment
– Frees them when done reloading
Next, in More Depth
• Parsing
• Resource allocation
– Setup for the action: assign_wrapper
– The real brains: assign
Serial console management
Link shaping
IP routing support
Traffic generation
Inter-node synchronization
Event system
Parsing Experiment Configurations
Experiment Configuration Language
• General purpose OTcl scripting language
based on NS
• Exports an API nearly identical to that of NS
albeit a subset
• Testbed specific actions via the tb-*
– We provide a compatibility script to include when
running under a NS simulation
• Define your own procedures / classes /
Making sense out of others’ code
• The parser is also written in OTcl
• It mirrors a subset of NS classes
• Implemented methods for the above classes
capture the user specified experiment attributes
• Convert experiment attributes to an intermediate
XML format
– Generic format makes it easy to add support for other
configuration languages
• Store the configuration in the virt_* tables such as
virt_nodes, virt_lans etc.
Implementation Quirks
• Capture top level resource names for later use
– E.g.: Use 'n0' to name the physical node when the user
asks for set n0 [$ns node]
• Rename resource names to workaround restrictions
such as in DNS
– E.g.: Node 'n(0)' to 'n-0'
• Parser run on ops for security reasons
– Mixing trusted/untrusted OTcl code on main server (boss)
is dangerous
• Read tbsetup/ns2ir/README in the source tree for
Assign Wrapper (PG Version)
Assign Wrapper
• Perl frontend to assign
• Converts virtual DB representation to more
neutral “top” file format (input)
• Converts results from plain text format into
physical DB representation
• assign_wrapper is extremely testbed aware
• Moves information from virtual tables to
physical tables
Virtual Representation
• An experiment is really a set of tables in the
• Includes “virt_nodes” and “virt_lans” which
describe the nodes and the network topology
• Other tables include routes, program agents,
traffic generators, virtual types, etc.
Virtual Representation Cont.
• Example:
set n1 [$ns node]
set n2 [$ns node]
set link0 [$ns duplex-link $n1 $n2 100MB 10ms]
tb-set-hardware $n2 pc600
• Is stored in database tables:
virt_node ('n1', '', 'pc850', 'FBSD-STD', ...)
virt_node ('n2', '', 'pc600', 'RHL-STD, ...)
virt_lan ('link0', 'n1', '100MB', '5ms', ...)
virt_lan ('link0', 'n2', '100MB', '5ms', ...)
What’s a top file?
• Stands for "topology" file, but thats too many
• Input file to assign specifying nodes, links,
• Conversion of DB format to:
node n2 pc850
node n1 pc600
link link0/n1:0,n2:0 n1 n2 100000 0 0
• Combine with current (free) physical
resources to come up with a solution.
Assign Results
• Assign maps n1 and n2 to pc1 and pc41
based on types and bandwidth.
node1 pc1
node2 pc41
End Nodes
link0/n1:0,n2:0 intraswitch pc1/eth3 pc41/eth1
End Edges
• The above is a “simplified” version of actual
results. Gory details available elsewhere.
Assign Wrapper Continues
• Allocate physical resources (nodes) as
specified by assign
• Allocate virtual resources (vnodes) on
physical nodes (local and remote)
• If some nodes already allocated (someone
else got them before you), try again
• Keep trying until maximum try exceeded;
assign might fail to find a solution on first N
Assign Wrapper Keeps Going …
• Insert set of “vlans” into database
– pc1/eth3 connected to pc41/eth1
• Update “interfaces” table with IP addresses
assigned by the parser
• Update “nodes” table with user specified
values from virt_nodes.
– Osids, rpms, tarballs, etc.
• Update “linkdelays” table with end node
traffic shaping configuration (from virt_lans)
And Going and Going
• Update “delays” table with delay node traffic
shaping configuration
• Update “tunnels” table with tunnel
configuration (widearea nodes)
• Update “agents” table with location of where
events should be sent to control traffic
• Call exit(0) and rest!
Resource Allocation:
assign’s job
• Maps virtual resources to local nodes and VLANs
• General combinatorial optimization approach to
NP-hard problem
• Uses simulated annealing
• Minimizes inter-switch links, number of switches,
and other constraints.
• Takes seconds for most experiments
What’s Hard About It?
• Satisfy constraints
– Requested types
– Can’t go over inter-switch bandwidth
– Domain-specific constraints
• LAN placement for virtual nodes
• Subnodes
• Maximize opportunity for future mappings
– Minimize inter-switch bandwidth
– Avoid scarce nodes
What It Can Do
• Handle multiple types of nodes on multiple
• Allow users to ask for classes of nodes
• Prefer/discourage use of certain nodes
• Map multiple virtual nodes to one physical
• Handle nodes that are 'hosted' in some other
• Partial solutions
What It Doesn't Do
• Map based on observed end-to-end network
– Applicable to wide-area and wireless
– But, we have another program, wanassign, that
• Satisfy requests for specific link types
– But, we could approximate with subnodes
• Full node resource description
• Complicated
Several authors
Subject of paper evaluating many configurations
Nature of randomized algorithm makes debugging hard
Evolved over time to keep up with features
• Scaling
– Particularly with virtual and simulated nodes
• Not just scale (1000’s), it’s the type of node
– Pre-passes may help
• The good: it’s coped with a lot of new demands!
Remote Console Access
Executive Summary
Allow user access to consoles via serial line
Console proxy enables remote access
Authentication and encryption
All console output logged
Requires OS support for serial consoles
Utah Emulab: all nodes have serial lines
– Not required, but handy
Serial Consoles
• Can redirect console in three places
– BIOS: on most “server” motherboards
– Boot loader: easy on BSD and Linux
– OS: easy on BSD and Linux
• Boot loaders and OSes must be configured
– Generally via boot loader configuration
The serial line proxy
• Original purpose was to log console output
– Read/write serial line, log data, present tty IF
– Use “tip” to access pty
• Enhanced to “remote” the console
– Present a socket interface
– Can be accessed from anywhere on the
• One capture process per serial line
• Only users in an experiment can access
• Use a one-time key
– capture running on serial line host generates
new key for every “session”
• Sends key to capserver on the boss node
– capserver records key in DB, returns ownership
– capture uses info to protect ACL and log files
(console, tiptunnel)
• console is the replacement for tip
– Run on ops, obtains access info via ACL file
created by capture
– File permissions restrict user access
• tiptunnel is the remove version
– Binaries for Linux, BSD, Windows
– Run as a helper app from browser
– Access info passed via secure web connection
– All communication via SSL
Emulab Link Shaping
Executive Summary
• Emulab allows setting and modification of
bandwidth, latency, and loss rate on a perlink basis
• Interface through NS script or command
• Implemented either by dedicated “delay”
nodes or on end nodes
• Delay nodes work with any end node OS
• End node shaping for FreeBSD or Linux
Delay nodes
• Run FreeBSD + dummynet + bridging
• FreeBSD kernel:
– Runs at 10000Hz to improve accuracy
– Uses polling device drivers to reduce overhead
Nodes are dedicated to an experiment
One node can shape multiple links
Transparent to end nodes
Not transparent to switch fabric
VLANs and Delay Nodes - Diagram
End node shaping
(“link delays”)
• Handle link shaping at both ends of the link
• Requires OS support on the end nodes
– FreeBSD: dummynet
– Linux: “tc” with modifications
• Conserves Emulab resources at potential
expense of emulation fidelity
• Works in environments where delay nodes
are not practical or possible
Dynamic control
• Link settings can be modified at “run time”
– at commands in the NS file
– tevc command
• Run a control agent (delay_agent) on all
nodes implementing shaping
• Listens for events, interacts with kernel to
effect changes
• OS specific
IP routing support in Emulab
Executive Summary
• Emulab offers three options for IP routing in
a topology: none, manual, or automatic
• Specified via the NS file
• Routes setup automatically at boot time
• There is no agent for dynamic modification
of routes
User-specified routing
• “None”
– No experimental network routes will be setup
– Used for LANs and routing experiments
• “Manual”
– Explicit specification of routes in the NS file
– Routes becomes part of DB state of experiment
– Passed to a node at boot, part of self-config
– Implies IP forwarding enabled
Emulab-provided routing
• “Static”
– Emulab calculates routes at experiment creation
(routecalc, staticroutes)
– Shortest path calculation between all pairs
– Optimized to coalesce into network routes
• “Session”
– Dynamic routing: runs gated/OSPF on all nodes
– Auto-generated config file uses only active
experimental interfaces
Routing Gotcha’s
• Node default route uses the control net
– Missing manual routes result in lost traffic
• Control net is visible to routing daemons
– Makes their job easy (one hop to anyone)
• NxN "Static" route computation and storage
do not scale as N increases, such as in
multiplexed virtual nodes
Traffic Generation in Emulab
Executive Summary
• Emulab allows experiments to run and
control background traffic generators
• Interface through NS script or command line
• Constant Bit Rate traffic only right now
• UDP or TCP only right now
Implementation details
• Based on TG (
– UDP or TCP, one-way, various distributions of
interarrival and length
• Modified to be an event agent
– Start and stop, change packet rate and size
• Interface:
– NS: standard syntax for traffic sources/sinks
– tevc command line tool
Inter-node synchronization
in Emulab
Executive Summary
• Provides a simple inter-node barrier
synchronization mechanism for experiments
• Example: wait for all nodes to finish running
a test before starting the next one
• Not a centralized service (per-experiment
infrastructure), scales well
• Easy to use: can be scripted
• Originally implemented a single-barrier,
single-use “ready” mechanism:
– Allowed users to know when all nodes were “up”
– Used centralized TMCC to report/query status
– Network/server unfriendly: constant polling
• Users wanted a more general mechanism
– Multiple barriers, reusable barriers
• Tended to roll their own
– Often network unfriendly as well
Enter the Sync Server
• In NS file, declare a node as the server:
– set node1 [$ns node]
– tb-set-sync-server $node1
• When node boots, it starts up the sync server
• Nodes requiring synchronization use
emulab-sync application
• Use can be scripted using program agent
Example client use
• One node acts as barrier master, initializing
barrier and waiting for a number of clients:
– /usr/testbed/bin/emulab-sync -i 4
• All other client nodes contact the barrier:
– /usr/testbed/bin/emulab-sync
• emulab-sync blocks until the barrier count
is reached
• Simple TCP-based server and client program
– UDP version in the works
• Client:
– Gets server info from a config file written at boot
– Connect to server and write a small record
– Block until a reply is read
• Server:
– Accept connections, read records from clients
– Write a reply when all clients have connected
• Why not use the event system for
– Event system is a centralized service
– As we move to decentralization, may reconsider
• Authentication: none
– Local: uses shared control net so this is a
problem, won't be with control net VLANs
– Wide-area: wide-open, add HMAC ala events or
just use event system
The Emulab Event System
Emulab Control Plane
• Many of Emulab’s features are dynamically
– Traffic generators: can be started, stopped, and
parameters altered
– Link shaping: links can be brought up and down,
characteristics can be modified
• Control is via the NS file, the web interface,
or a command line tool.
Example: A Link
• NS: create a shaped link:
– set link0 [$ns duplex-link $n1 $n2 50Mb 10ms DropTail]
• NS: control the link:
– $ns at 100 "$link0 modify DELAY=20 BANDWIDTH=25"
– $ns at 200 "$link0 down"
• Command line: control the link
– tevc -e tutorial/linktest +10 link0 down
What's really happening?
• A link “agent” runs on each (delay) node to
control all of the links for that node.
• The agent listens for “events” from the server
telling it what to do.
• A per-experiment scheduler doles out the
events at the proper time, sending them to
the agents.
• Other agents include the traffic generators,
program objects, link tester.
Come on, what's really
• Use Elvin (
– off-the-shelf publish-subscribe system
• Agents "listen" for events by "subscribing" to
those they care about.
• The per-experiment scheduler "publishes"
events as they come due.
• Events flow from the scheduler through the
Elvin daemon to the nodes, and ultimately to
the agents that wanted them.
Static/Dynamic event flow
Issues: Time
• What happens to “event time” when an
experiment is swapped?
– Run in real time: events could be lost
– Suspend time: dilation of experiment time
– Restart time: replay static event stream
• Timing for dynamic events
– tevc … +10 link0 down; tevc … +10 link1 up
– What is the latency between events?
• What latency do we need to guarantee?
Issues: Security
• Elvin mechanism is too heavyweight
– Requires encryption to protect authentication keys
– We have no reason to encrypt our events
• Don't want to tie ourselves to Elvin
– In principle
– Elvin has gone closed source
• Emulab past: no authentication, no wide-area
• Emulab current: use end-to-end HMAC
– Key transferred via TMCC
– Wide-area nodes supported, cannot inject events
Issues: Scaling
• Open Elvin TCP connection for every agent
– Use per-node proxy
– But agents still send events directly to boss
– And there are still a lot of nodes
• Use UDP?
– What about lost events?
• Deliver static events to nodes early?
– Doesn't help dynamic (“now”) events
• Multicast, someday (not the current usage model)
• You’d think we could just find a better pub/sub
system, but haven’t.