The Performance Bottleneck Application, Computer, or Network Richard Carlson <rcarlson@internet2.edu> eVLBI Workshop – Performance Tuning Tutorial September 17, 2006 Outline • Why there is a problem • What can be done to find/fix problems • Tools you can use Basic Premise • Application’s performance should meet your expectations! • If they don’t you should complain! • But you have to complain effectively. Questions • How many times have you said: • What’s wrong with the network? • Why is the network so slow? • Do you have any way to find out? • Tools to check local host • Tools to check local network • Tools to check end-to-end path Unfortunate Reality • Every problem, regardless of cause, exhibits the same symptom • The application performance doesn’t meet the users expectations! Possible Bottlenecks • Network infrastructure • Host computer/appliance • Application design Simple Network Picture Bob’s Host Network Infrastructure Carol’s Host Network Infrastructure Switch 2 Switch 1 R4 Switch 3 R5 R8 R1 R3 R6 R2 R7 Switch 4 R9 Network Infrastructure Bottlenecks • Links too small • Using FastEthernet instead of Gigabit Ethernet • Links congested • Too many hosts crossing this link • Scenic routing • End-to-end path is longer than it needs to be • Broken equipment • Bad NIC, broken wire/cable, cross-talk • Administrative restrictions • Firewalls, Filters, shapers, restrictors Host Computer Bottlenecks • CPU utilization • What else is the processor doing? • Memory limitations • Main memory and network buffers • I/O bus speed • Getting data into and out of the NIC • Disk access speed Application Behavior Bottlenecks • Chatty protocol • Lots of short messages between peers • High reliability protocol • Send packet and wait for reply before continuing • No run-time tuning options • Use only default settings • Blaster protocol • Ignore congestion control feedback Problems, Problems, Problems • Problems can exist at multiple levels • Network infrastructure • Host computer • Application design • Multiple problems can exist at the same time • All problems must be found and fixed before things get better Transport Protocols 101 • Transmission Control Protocol (TCP) • Provides applications with a reliable in-order delivery service • The most widely used Internet transport protocol • Web, File transfers, email, P2P, Remote login • User Datagram Protocol (UDP) • Provides applications with an unreliable delivery service • RTP, DVTS, DNS Outline • Why there is a problem • What can be done to find/fix problems • Tools you can use Remote Image Processing • Carol is analyzing astronomical images. Bob needs to send a data file containing digital images (50 MB per file) to Carol every ½ hour. Bob and Carol are 2,000 miles apart. How long should each transfer take? • 5 minutes? • 1 minute? • 5 seconds? What should we expect? • Assumptions: • 100 Mbps Fast Ethernet is the slowest link • 50 msec round trip time • Bob & Carol calculate: • 50 MB * 8 = 400 Mbits • 400 Mb / 100 Mb/sec = 4 seconds Initial Test Results Initial Test Results • 18 Minutes!!! This is unacceptable! • First look for network infrastructure problem • Use NDT tester to examine both hosts Initial NDT testing shows Duplex Mismatch at one end NDT Found Duplex Mismatch • Investigating this it is found that the switch port is configured for 100 Mbps Full-Duplex operation. • Network administrator corrects configuration and asks for re-test Duplex Mismatch Corrected SCP results after Duplex Mismatch Corrected Intermediate Results • Time dropped from 18 minutes to 40 seconds. • Is this acceptable??? • Remember your calculations said it should take 4 seconds. • 400 Mb / 40 sec = 10 Mbps • Why are we limited to 10 Mbps? • Are you satisfied with 1/10th of the possible performance? Default TCP window size Calculating the Window Size • Remember Bob found the round-trip time was 50 msec • Calculate window size limit • 85.3KB * 8 b/B = 698777 b • 698777 b / .050 s = 13.98 Mbps • Stated another way • 698777 b / 100 Mb/s = 6.99 msec • 43 msec of idle time every RTT Calculating the Window Size • Calculate new window size • (100 Mb/s * .050 s) / 8 b/B = 610.3 KB • Use 8MB for testing purposes Resetting Window Buffer Intermediate Results • Use application specific options to manually reset buffer size • Fixes problem for this application • Doesn’t fix problem for other applications • Need better ‘default behavior’ for all applications With TCP window size tuned Steps so far • Found and fixed Duplex Mismatch • Network Infrastructure problem • Found and fixed TCP window size values • Host configuration problem • Are we done yet? SCP results with auto-tuning enabled Intermediate Results • SCP still runs slower than expected • Hint: SSH uses internal buffers • Design choice by Application Developers limit performance • Patch available from PSC SCP Results with tuned SCP Final Results • Fixed infrastructure problem • Fixed host configuration problem • Fixed Application configuration problem • Achieved target time of 4 seconds to transfer 50 MB file over 2000 miles Follow-up questions • What would have happened if I tried the patched SCP version before fixing the TCP buffer problem? • Would not have been able to see improvement. • Discard patch because “it didn’t work”? Why is it hard to Find/Fix Problems? • Network infrastructure is complex • Network infrastructure is shared • Network infrastructure consists of multiple components Shared Infrastructure • Other applications accessing the network • Remote disk access • Automatic email checking • Heartbeat facilities • Other computers are attached to the closet switch • Uplink to facility infrastructure • Other users on and off site • Uplink from facility to gigapop/backbone Other Network Components • DHCP (Dynamic Host Resolution Protocol) • At least 2 packets exchanged to configure your host • DNS (Domain Name Resolution) • At least 2 packets exchanged to translate FQDN into IP address • Multiple addresses require a sequential search • Network Security Devices • Intrusion Detection, VPN, Firewall Why is it hard to Find/Fix Problems? • Computers have multiple components • Each Operating System (OS) has a unique set of tools to tune the network stack • Network Interface Cards also have tuning options • Application Appliances come with few knobs and limited options Computer Components • • • • • Main CPU (clock speed) Front & Back side bus Main Memory I/O Bus (ATA, SCSI, SATA) Disk (access speed and size) Computer Issues • Lots of internal components with multitasking OS • Lots of tunable TCP/IP parameters that need to be ‘right’ for each possible connection Why is it hard to Find/Fix Problems? • Applications depend on default system settings • Problems scale with distance • More access to remote resources • 80/20 % rule since the early 1990’s, 80% of your traffic leaves your local network Default System Settings • For Linux 2.6.13 there are: • 11 tunable IP parameters • 45 tunable TCP parameters • 148 Web100 variables (TCP MIB) • Currently no OS ships with default settings that work well over trans-continental distances • Some applications allow run-time setting of some options • 30 settable/viewable IP parameters • 24 settable/viewable TCP parameters • There are no standard ways to set run-time option ‘flags’ Application Issues • Setting tunable parameters to the ‘right’ value • Getting the protocol ‘right’ Outline • Why there is a problem • What can be done to find/fix problems • Tools you can use Tools, Tools, Tools • • • • • • • • Ping Traceroute Iperf Tcpdump Tcptrace BWCTL NDT OWAMP • • • • • • • • AMP Advisor Thrulay Web100 MonaLisa pathchar NPAD Pathdiag • • • • • • • • Surveyor Ethereal CoralReef MRTG Skitter Cflowd Cricket Net100 Active Measurement Tools • Tools that inject packets into the network to measure some value • Available Bandwidth • Delay/Jitter • Loss • May require bi-directional traffic or synchronized hosts • May require running test program on both hosts Passive Measurement Tools • Tools that monitor existing traffic on the network and extract some information • Bandwidth used • Jitter • Loss rate • May generate some privacy and/or security concerns How do you set realistic Expectations? • Assume network bandwidth exists or find out what the limits are • Local LAN connection • Site Access link • Monitor the link utilization occasionally • Weathermap • MRTG graphs • Look at your host config/utilization • What is the CPU utilization Distance Matters • It’s harder to go fast over a long distance • TCP congestion control requires numerous round trips to prevent flooding network • TCP buffer limits can stop sender from injecting new data into the network • Application can exhibit poor behavior when used over long distances Ethernet, FastEthernet, Gigabit Ethernet, 10 GE • 10/100/1000 auto-sensing NICs are common today • Most facilities have installed 10/100 switched infrastructure • Access network links are currently the limiting factor in most networks • Backbone networks are 10 Gigabit/sec Wireless LAN’s • 802.11b - 11 Mbps (expect 5) • 802.11a – 34 Mbps (expect 15) • 802.11g – 54 Mbps (expect 25) • Expect large variations in speed due to radio signal propagation Focus on 2 tools • Existing NDT tool • Allows users to test network path for a limited number of common problems • Emerging PerfSonar tool • Allows users to retrieve network path data from major national and international REN network Network Diagnostic Tool (NDT) •Measure performance to users desktop •Identify real problems for real users • Network infrastructure is the problem • Host tuning issues are the problem •Make tool simple to use and understand •Make tool useful for users and network administrators •Web-based JAVA applet allows testing from any browser Installing your own server • All Internet2 tools are FREE • Visit http://e2epi.internet2.edu/ for details • Workshops are available to help your administrator get them up and running ( http://e2epi.internet2.edu/net-perf-wkshp/ ) • Encourage your peers to start testing • Encourage your vendors to include the client programs NPToolkit Bootable CD Knoppix based Live-CD Contains listed tools Download from Internet2 Ask for a pre-built CD-ROM http://e2epi.internet2.edu/network-performance-toolkit/network-performance-toolkit.iso PerfSonar – Next Steps in Performance Monitoring • New Initiative involving multiple partners • ESnet (DOE labs) • GEANT (European Research and Education network) • Internet2 (Abilene and connectors) • Sample tool (Joe Metzger ESnet) https://performance.es.net/cgi-bin/perfsonar-trace.cgi Traceroute Visualizer Abilene Weather Map http://loadrunner.uits.iu.edu/weathermaps/abilene/ Windows XP Performance Google it! • Enter “tuning tcp” into the google search engine. • Top 2 hits are: http://www.psc.edu/networking/perf_tune.html http://www-didc.lbl.gov/TCP-tuning/TCP-tuning.html PSC Tuning Page LBNL Tuning Page Conclusions • Applications can fully utilize the network • All problems have a single symptom • All problems must be found and fixed before things get better • Some people stop investigating before finding all problems • Tools exist, and more are being developed, to make it easier to find problems