Virtual Network Diagnosis as a Service Wenfei Wu (UW-Madison) Guohui Wang (Facebook) Aditya Akella (UW-Madison) Anees Shaikh (IBM System Networking) Virtual Networks in the Cloud • • • • • The cloud provider maintains the data center infrastructure The tenant request/configure virtual networks for their apps The virtual network is mapped to the physical infrastructure Network services are provided as virtual middleboxes Multiple tenants share the infrastructure Subnet 1 VM GW MB 2 Tenant VM R VM R Subnet 2 VM GW Application Traffic VS VS VS VM VM VM Middlebox Virtual Network Problems • Multiple layers may have various problems o Virtual network connectivity issues o Tenants’ application performance may be degraded • Isolation and abstraction prevent tenants from diagnosing their virtual networks Existing Solutions • Infrastructure-layer tools may expose the physical network to tenants • sFlow, NetFlow • OFRewind, NDB, etc. • Tracing tools in the virtual network are difficult to deploy in some virtual components (e.g., middleboxes) • Tcpdump, Xtrace, SNAP VND Proposal • The cloud provider should offer a virtual network diagnosis service (VND) to the tenants 1. 2. 3. 4. Diagnostic request Trace collection Trace parse Data query Diagnosis Policy Diagnosis Request Tenant Path: n1… nk Pattern: IP, port, … n1: Table Server1 n2: Table Server2 … Control Server Diagnosis Policy Path: n1… nk Pattern: IP, port n1: collector1 n2: collector2 … Tenant allocated resources Table Table & Physical Topology Server Server VND Challenges • Preserve isolation and abstractions • Low overhead • Scalability Contents • Motivation • VND Design & Implementation • Evaluation • Conclusions VND Architecture Collect Parse Config Config Table Server Query Executor Trace Table ts id … Policy Manager Tenant Collect Config & Topology Analysis Manager Control Server Table Server Query Executor Trace Table ts id Collection Policy … Flow Trace Trace Parser Trace Parser Trace Collector Trace Collector Raw Data Raw Data Parse Config Cloud Controller Query Execution Data Collection (1) • The tenant submits a Data Collection Configuration Front End Virtual Appliance Link : l1 Capture Point node1 Flow Pattern field = value, ... Capture Point node2 ... Virtual Appliance Node: n1 Capture Point input, [output] ... Load Balancer Server 1 Cap: Input srcIP=10.0.0.6 dstIP=10.0.0.8 tcp dstPort=80 dualDirection Server 2 Cap: output … Server 3 Data Collection (2) • Policy Manager generates a Data Collection Policy Trace ID tr_id1 Physical Cap Point vs1 Collector c1 at h1 Pattern field = value, ... Path vs1, ..., h1, c1 Load Balancer vSwitch (vs1) NIC Collector (c1) tr_id1 srcIP=10.0.0.6, …, etc. Hypervisor Data Parse • The tenant submits a Data Parse Configuration Trace ID all Filter: ip.proto = tcp or ip.proto = udp Fields: timestamp as ts, ip.src as src_ip, ip.dst as dst_ip, ip.proto as proto, tcp.src as src_port, tcp.dst as dst_port, udp.src as src_port, udp.dst as dst_port Table ID tab_id1 Filter exp Fields field_list exp = exp and exp | exp or exp | not exp | (exp) | prim prim = field in value_set field_list = field (as name) (, field (as name))* ts src_IP dst_IP proto src_port dst_port … … … … … … Data Analysis • Trace Tables form a distributed database Tenants Operations Diagnostic Applications SQL Interface Analysis Manager Diagnostic Schema Collection View Trace_ID Vnet_App Cap_Loc Parse View Pattern Trace_ID … Trace Table Trace Table Trace Table ts ts ts id … Query Executor id … Query Executor id … Query Executor Data Analysis Examples ts src_IP dst_IP seq ack Payload_length … … … … … … • RTT 1. create view T1_f as select * from T1 where srcIP=IP1 and dstIP=IP2 2. create view T1_b as select * from T1 where dstIP=IP1 and srcIP=IP2 3. create view RTT as select f.ts as t1, b.ts as t2 from T1_f as f, T1_b as b where f.seq + f.payload_length = b.ack 4. select avg(t2-t1) from RTT • throughput, loss, delay, statistics, etc. Optimizations • Local Table Server placement • Place the collector locally with the capture points • Avoid the trace collection traffic traversing the network • Move data only when queries need it • Avoid interference with existing rules using the multi-table feature on OVS Contents • Motivation • VND Design & Implementation • Evaluation • Conclusions Trace Collection Overhead (1) • We transfer data between VMs in 2 physical servers o 10Gbps NIC o 8 pairs of VMs, each pair transfer 1Gbps traffic • Each minute, we capture one VM pair’s traffic • The duplicated traffic increases as we capture more traffic • The original application traffic is not impacted Trace Collection Overhead (2) • VMs perform memory copy and data transfer simultaneously • We measure memory and network throughput • Each 1Gbps network traffic duplication causes 59 MB/s memory overhead Data Query Overhead • We use RTT monitoring as an example o 200 Mbps traffic o Calculate average RTT periodically • Network overhead is negligible • Execution time scales linearly with the data size • VND can process 2-3 Gbps traffic in real time RTT Scalability (1) • Control Server is a simple web server can be scaled up easily • Data collection distributed locally to avoid being the bottleneck Scalability (2) • Data Query simulation • A data center with 10,000 servers, each has a 10Gbps NIC • Virtual network size [2, 20] • Query executors can process 3 Gbps traffic in real time • Total link utilization [0.1, 0.9] • Results • 30% of total link capacity can be queried in real time Conclusions • The cloud provider should offer a virtual network diagnosis service to the tenants • We design VND • Architecture, interfaces and operations • Address several important challenges • Our experiments and simulation demonstrate the feasibility of VND END Q&A Please contact with WISDOM http://wisdom.cs.wisc.edu Functional Validation • Middlebox Bottleneck Detection • VND use throughput and RTT time to find abnormity RE IDS Functional Validation(2) • Middlebox scaling RE IDS VND Architecture Collect Parse Config Config Table Server Query Executor Trace Table ts id … Policy Manager Tenant Collect Config & Topology Analysis Manager Control Server Table Server Query Executor Trace Table ts id Collection Policy … Flow Trace Trace Parser Trace Parser Trace Collector Trace Collector Raw Data Raw Data Parse Config Cloud Controller Query Execution