Internet Measurement Jennifer Rexford Outline • Measurement overview – Why measure? Why model measurements? – What to measure? Where to measure? • Internet challenges • Measurement tools – Active: ping, traceroute, and pathchar – Passive: logs, SNMP, packet, and flow monitoring • Operational applications of measurement • Discussion Why Measure? • The Internet is a man-made system, so why do we need to measure it? – Because we still don’t really understand it – Because sometimes things go wrong • Measurement for network operations – Detecting and diagnosing problems – What-if analysis of future changes • Measurement for scientific discovery – Characterizing a complex system as organism – Creating accurate models that represent reality – Identifying new features and phenomena Why Build Models of Measurements? • Compact summary of measurements – Efficient way to represent a large data set – E.g., exponential distribution with mean 100 sec • Expose important properties of measurements – Reveals underlying cause or engineering question – E.g., mean RTT to help explain TCP throughout • Generate random but realistic data as input – Generate new data that agree in key properties – E.g., topology models to feed into simulators “All models are wrong, but some models are useful.” – George Box What Can be Measured? • Traffic – Load statistics – Packet or flow traces • Performance of paths – Application performance, e.g,. Web download time – Transport performance, e.g., TCP bulk throughput – Network performance, e.g., packet delay and loss • Network structure – Topology, and paths on the topology – Dynamics of the routing protocol Where Measure? • Short answer – Anywhere you can! • End hosts – Application logs, e.g., Web server logs – Sending active probes to measure performance • Individual links/routers – Load statistics, packet traces, flow traces – Configuration state – Routing-protocol messages or table dumps – Alarms Internet Challenges Make Measurement an Art • Stateless routers – Routers do not routinely store packet/flow state – Measurement is an afterthought, adds overhead • IP narrow waist – IP measurements cannot see below network layer – E.g., link-layer retransmission, tunnels, etc. • Violations of end-to-end argument – E.g., firewalls, address translators, and proxies – Not directly visible, and may block measurements • Decentralized control – Autonomous Systems may block measurements – No global notion of time Active Measurement: Ping • Adding traffic for purposes of measurement – Trade-offs between accuracy and overhead – Need careful methods to avoid introducing bias • Ping – Host sends an ICMP ECHO packet to a target – … and captures the ICMP ECHO REPLY – Useful for checking connectivity, and RTT – Only requires control of one of the two end-points • Problems with ping – Round-trip rather than one-way delays – Some hosts might not respond Active Measurement: Traceroute • Time-To-Live field in IP packet header – Source sends a packet with a TTL of n – Each router along the path decrements the TTL – “TTL exceeded” sent when TTL reaches 0 • Traceroute tool exploits this TTL behavior TTL=1 source Time exceeded destination TTL=2 Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message Active Measurement: Challenges of Traceroute • Measuring multiple paths – Successive probes may traverse different paths • Non-participating network elements – Some routers and firewalls don’t reply • Inaccurate delay information – Includes processing delays on the router CPU • Round-trip vs. one-way measurements – Paths may have asymmetric properties • Interfaces, not routers – Returns IP address of interfaces, not routers Active Measurement: Applications of Traceroute • Network troubleshooting – Identify forwarding loops and black holes – Identify long and convoluted paths – See how far the probe packets get • Network topology inference – Launch traceroute probes from many places – … toward many destinations – Join together to fill in parts of the topology – … though traceroute undersamples the edges Active Measurement: Pathchar for Links rtt (i 1) rtt (i ) d L / c i : initial TTL value c : link capacity L : packet size rtt(i+1) -rtt(i) Three delay components: d : propagation delay L / c : transmission delay : queueing delay noise How to infer d,c? min. RTT (L) slope=1/c d L Passive Measurement: Logs at Hosts • Web server logs – Host, time, URL, response code, content length, … – E.g., 122.345.131.2 - - [15/Oct/1998:00:00:25 0400] "GET /images/wwwtlogo.gif HTTP/1.0" 304 - "http://www.aflcio.org/home.htm" "Mozilla/2.0 (compatible; MSIE 3.02; Update a; AK; AOL 4.0; Windows 95)" "-" • DNS logs – Request, response, time • Useful for workload characterization, troubleshooting, etc. Passive Measurement: SNMP • Simple Network Management Protocol – Coarse-grained counters on the router – E.g., byte and packet counts • Polling – Management system can poll the counters – E.g., once every five minutes • Limitations – Extremely coarse-grained statistics – Delivered over UDP! • Advantages: ubiquitous Passive Measurement: Packet Monitoring • Tapping a link Multicast switch Shared media (Ethernet, wireless) Host A Host A Host B Monitor Host B S w i t c h Host C Monitor Splitting a point-to-point link Router A Router B Monitor Line card that does packet sampling Router A Packet Monitoring: Selecting the Traffic • Filter to focus on a subset of the packets – IP addresses/prefixes (e.g., to/from specific Web sites, client machines, DNS servers, mail servers) – Protocol (e.g., TCP, UDP, or ICMP) – Port numbers (e.g., HTTP, DNS, BGP, Napster) • Collect first n bytes of packet (snap length) – Medium access control header (if present) – IP header (typically 20 bytes) – IP+UDP header (typically 28 bytes) – IP+TCP header (typically 40 bytes) – Application-layer message (entire packet) Tcpdump Output (three-way TCP handshake and HTTP request message) timestamp Web server client address and port # (port 80) 23:40:21.008043 eth0 > 135.207.38.125.1043 > lovelace.acm.org.www: S 617756405:617756405(0) win 32120 <mss 1460,sackOK,timestamp 46339 0,nop,wscale 0> (DF) SYN flag sequence number TCP options 23:40:21.036758 eth0 < lovelace.acm.org.www > 135.207.38.125.1043: S 2598794605:2598794605(0) ack 617756406 win 16384 <mss 512> 23:40:21.036789 eth0 > 135.207.38.125.1043 > lovelace.acm.org.www: . 1:1(0) ack 1 win 32120 (DF) 23:40:21.037372 eth0 > 135.207.38.125.1043 > lovelace.acm.org.www: P 1:513(512) ack 1 win 32256 (DF) 23:40:21.085106 eth0 < lovelace.acm.org.www > 135.207.38.125.1043: . 1:1(0) ack 513 win 16384 23:40:21.085140 eth0 > 135.207.38.125.1043 > lovelace.acm.org.www: P 513:676(163) ack 1 win 32256 (DF) 23:40:21.124835 eth0 < lovelace.acm.org.www > 135.207.38.125.1043: P 1:179(178) ack 676 win 16384 Analysis of Packet Traces • IP header – Traffic volume by IP addresses or protocol – Burstiness of the stream of packets – Packet properties (e.g., sizes, out-of-order, etc.) • TCP header – Traffic breakdown by application (e.g., Web) – TCP congestion and flow control – Number of bytes and packets per session • Application header – URLs, HTTP headers (e.g., cacheable response?) – DNS queries and responses, user key strokes, … Aggregating Packets into IP Flows flow 1 flow 2 flow 3 flow 4 • Set of packets that “belong together” – Source/destination IP addresses and port numbers – Same protocol, ToS bits, … – Same input/output interfaces at a router (if known) • Packets that are “close” together in time – Maximum spacing between packets (e.g., 15 sec, 30 sec) – Example: flows 2 and 4 are different flows due to time Packet vs. Flow Measurement • Basic statistics (available from both techniques) – Traffic mix by IP addresses, port numbers, and protocol – Average packet size • Traffic over time – Both: traffic volumes on a medium-to-large time scale – Packet: burstiness of the traffic on a small time scale • Statistics per TCP connection – Both: number of packets & bytes transferred over the link – Packet: frequency of lost or out-of-order packets, and the number of application-level bytes delivered • Per-packet info (available only from packet traces) – TCP seq/ack #s, receiver window, per-packet flags, … – Probability distribution of packet sizes – Application-level header and body (full packet contents) Measurement Challenges for Operators • Network-wide view – Crucial for evaluating control actions – Multiple kinds of data from multiple locations • Large scale – Large number of high-speed links and routers – Large volume of measurement data • Poor state-of-the-art – Working within existing protocols and products – Technology not designed with measurement in mind • The “do no harm” principle – Don’t degrade router performance – Don’t require disabling key router features – Don’t overload the network with measurement data Network Operations Tasks • Reporting of network-wide statistics – Generating basic information about usage and reliability • Performance/reliability troubleshooting – Detecting and diagnosing anomalous events • Security – Detecting, diagnosing, and blocking security problems • Traffic engineering – Adjusting network configuration to the prevailing traffic • Capacity planning – Deciding where and when to install new equipment Basic Reporting • Producing basic statistics about the network – For business purposes, network planning, ad hoc studies • Examples – – – – – Proportion of transit vs. customer-customer traffic Total volume of traffic sent to/from each private peer Mixture of traffic by application (Web, Napster, etc.) Mixture of traffic to/from individual customers Usage, loss, and reliability trends for each link • Requirements – Network-wide view of basic traffic and reliability statistics – Ability to “slice and dice” measurements in different ways (e.g., by application, by customer, by peer, by link type) Troubleshooting • Detecting and diagnosing problems – Recognizing and explaining anomalous events • Examples – – – – – Why Why Why Why Why a backbone link is suddenly overloaded the route to a destination prefix is flapping DNS queries are failing with high probability a route processor has high CPU utilization a customer cannot reach certain Web sites • Requirements – Network-wide view of many protocols and systems – Diverse measurements at different protocol levels – Thresholds for isolating significant phenomena Security • Detecting and diagnosing problems – Recognizing suspicious traffic or disruptions • Examples – Denial-of-service attack on a customer or service – Spread of a worm or virus through the network – Route hijack of an address block by adversary • Requirements – Detailed measurements from multiple places – Including deep-packet inspection, in some cases – Online analysis of the data – Installing filters to block the offending traffic Traffic Engineering • Adjusting resource allocation policies – Path selection, buffer management, and link scheduling • Examples – OSPF weights to divert traffic from congested links – BGP policies to balance load on peering links – Link-scheduling weights to reduce delay for “gold” traffic • Requirements – Network-wide view of the traffic carried in the backbone – Timely view of the network topology and configuration – Accurate models to predict impact of control operations (e.g., the impact of RED parameters on TCP throughput) Capacity Planning • Deciding whether to buy/install new equipment – What? Where? When? • Examples – – – – – Where to put the next backbone router When to upgrade a link to higher capacity Whether to add/remove a particular peer Whether the network can accommodate a new customer Whether to install a caching proxy for cable modems • Requirements – Projections of future traffic patterns from measurements – Cost estimates for buying/deploying the new equipment – Model of the potential impact of the change (e.g., latency reduction and bandwidth savings from a caching proxy) Examples of Public Data Sets • Network-wide data – Abilene and GEANT backbones – Netflow, IGP, and BGP traces • CAIDA DatCat – Data catalogue maintained by CAIDA – http://imdc.datcat.org/ • Interdomain routing – RouteViews and RIPE-NCC – BGP routing tables and update messages • Traceroute and looking glass servers – http://www.traceroute.org/ – http://www.nanog.org/lookingglass.html Discussion • How important is accuracy of the data? • How can we validate measurement studies? (If we know the answer already, why are we measuring?) • How to do controlled experiments with measurement techniques? • Can we move measurement to a science rather than an art? • Can we identify incentives for making measurement possible and data available? • Distributed analysis of measurement data? • An architecture for router or line-card support for traffic and performance measurement? • Trade-offs between security and privacy?