Opnix Smart Routing Technology Overview ”There is more then one way to skin a cat…” Aaron D. Britt Opnix, Inc. Orbit1000 Technology Discussion NANOG -1- Orbit1000 Technology Discussion Overview • • • • • Orbit1000 Technology Discussion Orbit1000 CPE Overview Probing Method in More detail Orbit1000 CORE Overview Things to Come… Lets Review - Q & A NANOG -2- Orbit1000 CPE High Level Architecture Subscriber AS 100 Orbit AS 64701 IP Block Advertised - (24.10.0.0/16) 24.10.1.1 LAN A IBGP C 30.30.30.2 OSPF Area 0 24.10.4.1 IBGP ENCRYPTED 10.10.10.2 Orbit 1000 Opnix CORE B 20.20.20.2 EBGP EBGP 30.30.30.1 Carrier C AS 300 Orbit1000 Technology Discussion EBGP 20.20.20.1 Carrier B AS 200 NANOG -3- Functions of the Orbit1000 CPE Probe stuff Receive BGP Feed and Set Routes Communicate with the CORE – Send Raw Probe Data – Receive Optimized Routes CORE ENCRYPTED • • • Orbit1000 CPE Customer Router(s) Set BGP Routes Orbit1000 Technology Discussion Internet QA Probes Discovery Probes NANOG -4- How we become one with the Packet • UDP Probes – Proactive Philosophy using patented ActiveScan – Tried ICMP - routers drop ICMP despite what RFC says – We tried TCP – set off IDS systems all over the place – We tried the force - but none of us had enough metaclorians. – We now use a UDP probe, though proprietary in nature, very similar to that of a typical traceroute. – We found that during testing, routing policy set using UDP Probe data is within 2% of the routing policy set using TCP probe data, but it doesn’t set off IDS systems! Orbit1000 Technology Discussion NANOG -5- Probing Mechanism • • Where do we probe? – Prefix List based on prefixes important to each Customer • Top 500 Trafficked Sites/ News Groups etc… • Route Feed from Customer Routers • Traffic Flow Data (Netflow, Span Port <sniff sniff>) • Logs (Web, DNS etc…) • Capable of probing 110,000+ routes, but it doesn’t make sense to (most of the time) – discovery.ignore and discovery.include lists. – ’Prefix + 1’ methodology, unless a more specific ip address is specified in the configuration. We probe multiple prefixes over multiple upstreams in parallel, configurable amount – how much bandwidth do you want to spend on Probes? Orbit1000 Technology Discussion NANOG -6- Metrics Gathered • • OpScore (Algorithm based on the probe data weighted, and calculated based on customer defined settings) – Latency – Unreliability • Link Unreliability • Probe Closure Prefix 216.183.192.0/19 Over Carrier "B" Prefix 216.183.192.0/19 Over Carrier "C" Carrier Preferenc (Range 100 - 1) Carrier Preference (Range 100 - 1) • Packet Loss Actual Weight Result Actual Weight Result 25 25% 6.25 75 25% 18.75 • Routing Loops Layer 3 Hops (range 2 to 30) Layer 3 Hops (range 2 to 30) Actual Weight Result Actual Weight Result – Bad Hops 15 10.00% 1.5 20 10.00% 2 Bad Hops (range 1 to 5) Bad Hops (range 1 to 5) – Layer 3 Hops Actual Weight Result Actual Weight Result 1 10.00% 0.1 0 10.00% 0 – Carrier Preference Unreliability (Range 1 - 100) Unreliability (Range 1 - 100) Actual Weight Result Actual Weight Result Lowest score wins 50 25.00% 12.5 25 25.00% 6.25 Latency (5 to 300 ms) Actual Weight Result 125 30.00% OpScore Orbit1000 Technology Discussion 37.5 57.85 Latency (5 to 300 ms) Actual Weight Result 50 30.00% OpScore 15 42.00 NANOG -7- QA Process (Testing the Active Link) • • • • • • UDP Based (Just like our Discovery Probes) We QA everything! We send the QA probe to a TTL based on where we think the endpoint is based on our discovery data. We check the latency and unreliability against the probe data we used to set the route. How many QA routes do we send, and how fast? – The QA Limit is configurable like Carrier Limit in the Client Config – which means you control how many routes we can QA in parallel. QA happens much faster then Discovery. Orbit1000 Technology Discussion NANOG -8- Customer Portal Orbit1000 CORE • 5 Pieces – Balancer (Communicates w/CPE) – Optimizer (Crunches Numbers) – View (Keeps Latest and Greatest Views per CPE) – SQL dB (Stores Stuff) – Customer Portal (Looks stuff up) Portal CORE SQL dB CPE Orbit1000 Technology Discussion Balancer Optimizer View NANOG -9- Data Access • • Portal – Access to Data, raw and graphical (Current and Historical) – All metrics and weights represented – Access to each CPE Client Config – RouteVision (Visualize over Multiple Paths) – Aggregate Summarizations SQL dB – Raw Data • Transactional Data (Real Time) • Warehoused Data (Portal) • Archival Data Orbit1000 Technology Discussion NANOG -10- Fault Tolerance Stuff… • • • • • • If it goes up in smoke, the Customer router reverts back to standard BGP. Discovery Probes halt if the CPE loses the CORE connection, if keep-alives fail within a period of time, product removes routes and “sleeps” until communication with the CORE is reestablished. Fault Tolerant reasoning behind storing CPE config on central dB Heartbeat / fail over process between CPE’s SNMP traps, early warning system (RAM, Hard Disk, CPU etc..) Always working on additional MIB support Orbit1000 Technology Discussion NANOG -11- Things to Come… • • • • • • • • • • Probes to support Jumbo Frames (Adjustable Frame Size) Dedicated Jitter Metrics Black- hole and Routing Loop Discovery/reports via Website TCP Slow Start Algorithm emulation TCP and/or UDP probes (Pick your poison) TCP Sniffing for Active Links (Monitor Actual Data – Replace QA) Multicast Support IPV6 Support Additional MIB support NEBS Compliant (just kidding) Orbit1000 Technology Discussion NANOG -12- Contact Information If you have any questions or would like to comment and/or critique this method of ‘Cat Skinning’ (I would love for some hecklers to drop me a line, with-out peer review no progress is possible) here is my contact info… http://www.opnix.com aaron@opnix.com Case Studies available today… • Tier 1 ISP • Fortune 5 Enterprise • Fortune 100 Financial Institution • Internet2/Abilene Deployment Orbit1000 Technology Discussion NANOG -13- Layer 3 Hops vs latency (30 day Summary) Orbit1000 Technology Discussion 0.3 0.25 0.2 0.15 Series1 0.1 0.05 19 17 15 13 11 9 0 7 0.020716 0.024832 0.033791 0.045662 0.055674 0.079405 0.109979 0.131937 0.141727 0.142373 0.143105 0.151558 0.177103 0.196629 0.216883 0.231439 0.244841 0.263682 0.268043 5 latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: latency: 3 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: NANOG -14- Prefixes are how many hops away? Orbit1000 Technology Discussion 16000 14000 12000 10000 8000 Series1 6000 4000 2000 19 17 15 13 11 9 0 7 2047 473 660 1621 2726 3601 4340 5527 7831 8761 9111 13756 9506 7743 7174 4679 4321 2881 1339 5 # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: # prefixes: 3 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: ttl: NANOG -15- Other Questions to ask… • • • • Is there a direct correlation between Hops and Latency? Hop count seems anecdotal, yet the numbers are quite convincing… How accurate does UDP measurements compare with TCP measurements when talking about Latency, Packet Loss and Throughput? How much does Asymmetrical routing, play a part in the world of Sub optimal routing? With Netflow stats, on average it seems that Routers only forward packets to 10% or so of the Global Rib, yet our routing Tables are tenfold +. Seems we can do something here, I just don’t know what, yet… Orbit1000 Technology Discussion NANOG -16-