UH SWARM: Dense perfSONAR Deployment With Small, Inexpensive Devices Alan Whinery U. Hawaii ITS September 29, 2015 Slide 1 Slide 1 The Whole Small perfNode Thing At a $2000 to $5000 price point, a typical perfSONAR node gets deployed at REN leaves and branches At $50 price point, you can buy 40 to 100 times as many for the same amount Some focus on pS node equivalence, Intel compatibility Deployment tends to be relatively sparse (~$200 price point) Some focus on value in smaller ($50) nodes Slide 2 Slide 2 The Whole Small perfNode Thing PerfSONAR developer fork for small devices:https://github.com/perfsonar/project/wiki/perfSONAR-EndpointNode-Project $150 - $250 price range, focus on Intel architecture RNP Brasil – MonIPE http://monipe.rnp.br/ Same type of nodes – ARM architecture - as our $50 price range UH SWARM Our thing – Beaglebones, Raspberries, etc ARM/$50. Slide 3 Slide 3 The Swarm Wrote paragraph into our CC-NIE campus networking proposal about making use of the recent availability of ~$50 computers to “sense” the network, using elements of perfSONAR. Funded a project to deploy 100 nodes on one campus over 2 years, exploiting a ~$50 price point to deploy many nodes on campus as a dense mesh. Slide 4 Slide 4 Goals/Challenges Finding nodes to buy in the face of market exhaustion Getting node deployment work-flow down to nil Getting recoveries of off-line nodes to a minimum Tracking assets and reliability, generating metrics Evaluating capabilities of the whole set-up Developing a test program for many nodes Slicing/Dicing data to see what it has to tell us Developing visualizations and distillations to put tools in hands of network maintainers, merging into pS Toolkit Slide 5 Slide 5 Devices We Have Raspberry Pi – famous, $50, med-perf, file system on SD card, 100 Mb Ethernet, USB 2.0 BeagleBone Black – $50, more perf, FS on internal flash, and/or SD card, 100 Mb, USB 2.0 Honorable mention: CuBox i4 – $147, more perf, FS on SD, GigE, WiFi, USB 2.0 MiraBox $149 – most perf, FS on SD, dual GigE, WiFi, USB 3.0 Slide 6 Slide 6 Reliability Raspberry Pi (July 2014) UH ITS owns 47 – 1 has failed 22 SD card hard failures 10 file-system failures BeagleBone Black Rev A/C. (December 2013/April 2015) UH ITS owns 60, 1 has corrupted firmware Of nodes in production, one had to be power-cycled, once CuBox – one deployed 6 months of service zero problems. (using SD from OEM). Mirabox – promising, dual GigE ($150), wimpy kernel Slide 7 Slide 7 SD Cards DANE ELEC 8 GB Class 4 10 cards, 2 failures in light duty SanDisk Ultra 8 GB Class 10 Kingston 8 GB Class 10 10 cards, 0 failures, 3 FS corrupted in 42k hours 10 cards, 0 failures, 7 FS corrupted, in 42k hours Kingston 4 GB Class 4 20 hard failures in less than 20k hours (100% across 6 weeks, < 1000 Hr MTBF) SanDisk Ultra – 8GB Class 10 Most recent batch of replacements Slide 8 Slide 8 Year 1 Tried 10 BeagleBones, liked them And a few Raspberries Pi The market vacuum around the release of BBB Rev. C made BBB impossible to obtain Bought 43 Raspberries Although we are going with BeagleBone Black for the completion, we could make Raspberries work if necessary. Bought 2 Dell rack servers as test facilitators, data archives. Slide 9 Slide 9 2nd Year Completion 50 BeagleBone Black Rev. C (4 GB internal flash) BBB Internal flash is more reliable than SD Internal + SD card enables separating system/data partitions Better 100 Mb Ethernet performance 5 Raspberry Pi 2 Model B As number deployed approaches 100, we will be placing nodes in new/special roles. Correlating topology from netdot, MRTG graphs for context Slide 10 Slide 10 Management Puppet/The Foreman https://puppetlabs.com/ http://theforeman.org/ Easy to push changes, updates out to the swarm. Easy to push errors out to the swarm and require 50 SSH sessions. Work-flow Try to minimize per node actions and attended setup RPi – ua-netinstall with tweaks for Puppetization BBB – custom SD that auto-images the internal flash Make individual nodes as interchangeable as possible (If you have a choice use one type of device) Slide 11 Slide 11 Characteristics Of Dense Sensor Deployment Within An Enterprise A “sensor” is less complicated than a perfSONAR toolkit node Central perfSONAR buoy/MA orchestrates Having many observations makes the loss of a single one less important. You can correlate topo and test results to “triangulate” on the source of a problem. It takes planning avoid affecting user traffic Strategy is to “be” user traffic pS Toolkit as-built isn't really made for 100 nodes Slide 12 Slide 12 Test Programs: powstream (owamp) powstream puts 10 packets/second on the wire, 24 hours a day (there's been discussion about increasing the rate) To some extent, apparently stochastic/probabilistic loss resembles stochastic/probabilistic loss at much higher rates – meaning – the probabilistic loss that powstream encounters is probably the minimum of what a throughput test will encounter. Slide 13 Slide 13 SideBar: Regular Global perfSONAR powstream Log-scaled loss to color gradient An early idea about how to unify many graphs in front of your cerebral cortex. Black = 0% loss Green->Yellow->Red gradient: low → medium → higher loss Log scaled to avoid hiding low loss In our campus network, everything was always black (no appreciable loss) (Gray – no data) Time: left to right Each 10 pixel row is one path. Slide 14 Slide 14 Test Programs: powstream (owamp) powstream from pS Toolkit node to/from each sensor node Really, really, really boring at first glance. All loss appears to be about zero. Always one or two losing a packet per day (1 in 864000) Standard deviation in latency groups somewhat interesting, may reflect queuing, flares in latency std dev may precede loss events Longitudinal analysis reveals damaging loss rates that would otherwise be invisible Higher packet rates might expose low loss probabilities in shorter time Slide 15 Slide 15 30 nodes, in/out Mathis, Semke, Mahdavi, "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm”, ACM SIGCOMM, Vol 27, Number 3, July 1997 Slide: Used with permission Speed Limits You Can't See For 45 milliseconds RTT, typical minimum to get onto continental US from Hawaii Loss Rate 10 pps Powstream Packets Lost Per day TCP AIMD Coastal Limit @1460 MSS (Mbits/sec) TCP AIMD Coastal Limit @8960 MSS (Mbits/sec) 45 mS RTT 45 mS RTT 1.82E-005 15.75 42.56 261.18 2.25E-006 1.94 121.11 743.23 1.87E-006 1.62 132.76 814.72 9.38E-007 0.81 187.58 1151.16 6.05E-007 0.52 233.55 1433.28 5.93E-007 0.51 236.03 1448.52 3.35E-007 0.29 314.03 1927.21 2.51E-007 0.22 362.49 2224.57 1.74E-007 0.15 435.64 2673.49 Slide 19 Slide 19 Test Progams: 50 Node Full Mesh TCP Throughput <= 100 Mbps RPi, BBB throughput tests resemble real-life user flows Unlike a high performance iperf tester which “punches the network in the face” I run a 50x50 full mesh iperf matrix (2450 tests) in about 7 hours, (5 second tests). Full-mesh traceroute is collected concurrently By scoring every hop encountered on the average peformance for paths it appears in, “per-hop confidence” can be derived. Using multi-rate UDP vs. TCP is worth investigating. Slide 20 Slide 20 The Matrix Sources Cut-out view of iperf3 tests to/from a chosen node... This row/column represents all tests to/from that chosen node. Leaves one wondering what the correlation is between the pink squares showing retransmissions Slide 22 Slide 22 Correlating Full Mesh Throughput And Traceroute Results For Fault Isolation Slide 23 Slide 23 Graph of per-hop “confidence” with colored links where retransmissions were observed (names/addresses obfuscated) This graph shows hops involved in in-bound Throughput testing between a chosen node and all partners. Each oval represents an IP interface as reported in Traceroute output. Graph rendered from test data with GraphViz. (GraphViz.org) Data Archiving perfSONAR MA Exposing some ways in which MA handling of long-term, diverse data could be optimized Correlating such things as early/late “bathtub curve” failures per equipment life cycle (see Wikipedia ^^^ ) Trending probabilistic loss by months/years Etc Slide 25 Slide 25 Ongoing perfSONAR toolkit integration Not so much new development as making some pieces fit together Correlation of other sources to zero in on a fault NetDot Flows/MRTG Ancillary programs Log collection (honeypot-ish info) Name resolution tests v6/v4 precedence Slide 26 Slide 26