CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups KyoungSoo Park, Vivek Pai, Larry Peterson, Zhe Wang Princeton University 7/2/2016 OSDI '04 1 Domain Name System(DNS) Human-friendly names IP addresses Operational for over 20 years Essential part of the Web Two components 7/2/2016 Server-side: name owners Client-side: contacting name owners OSDI '04 2 Two Kinds of DNS Problems Server-side problems [Danzig92], [Jung01] Nameserver bugs Misconfigurations Hardening/replacing server infrastructure Client-side problems 7/2/2016 Between local nameservers (LDNS) and clients Larger memories = higher LDNS hit rate LDNS cache hit rate : 80 ~ 90% Result: LDNS problems magnified OSDI '04 3 Contributions Measure LDNS problems, causes Client-side DNS helper, CoDNS Communicates with other CoDNS peers Incrementally deployable Works with all DNS lookups (CDN, etc) Benefits 7/2/2016 Latency reduction: 27-82% Availability: generally adds extra ‘9’ OSDI '04 4 Local DNS Lookup Problems Local DNS lookup failures 5+ seconds delay for cached records Frequent & widely-distributed Unpredictable service 7/2/2016 Directly affects user-perceived latency Random delays in web access Kills HTTP proxies, web services, and busy mail servers OSDI '04 5 Demonstrating Local Problems Local name lookup every 6 seconds “yyy.domain” on xxx.domain at 200 sites Lookup should be handled locally LDNS is site-shared, NOT PlanetLab’s Failure criteria “planetlab-2.cs.princeton.edu” for planetlab-1.cs.princeton.edu 5+ seconds of latency zero answer Rolling average of the past 100 queries 7/2/2016 OSDI '04 6 Expected DNS Behavior University of Utah Rice University 7/2/2016 OSDI '04 7 DNS Failure on Various Nodes Cornell Texas A&M University of Oregon 7/2/2016 OSDI '04 8 Possible Causes Packet loss LDNS overloading Cron jobs Maintenance problems 7/2/2016 OSDI '04 9 Packet Loss UDP inherently unreliable Single loss triggers query retransmission Less than ~0.1% in LAN environment Heavily dependent on local traffic Cable modem/DSL may be worse 7/2/2016 Losses last for ~ 1 min Our sites have ~4 LAN hops, Cable ~8 OSDI '04 10 Nameserver Overloading University of Michigan University of Torino, Italy Technical University Berlin, Germany 7/2/2016 8 am 6 pm OSDI '04 8 am 6 pm 11 Nameserver Overloading Many responses for 1 sec ~ 5 sec No timeout but simply late Pr (Overloading | DNS Failure) = 90% for some nodes Bursts cause socket buffer overflow 7/2/2016 Experiment in the paper OSDI '04 12 Cron jobs/heavy processes University of Tennessee 1 University of Tennessee 2 Moscow State University 7/2/2016 Not a client problem! OSDI '04 13 Why Do We See This? Large memory large cache Large cache high hit rate High hit rate CPU load drops Low CPU load add more services More services memory pressure Memory pressure failures, delays 7/2/2016 OSDI '04 14 Maintenance Problems /etc/resolv.conf Blocking services Outside the firewall Complete outage Configured to dead nameservers Berkeley Millennium nodes, 3/17/2004 Blackout / natural disaster 7/2/2016 Duke hit by hurricane Isabel, Fall/2003 OSDI '04 15 Solution:CoDNS LAN CoDNS My LAN LDNS LDNS Client Programs CoDNS Wide Area Network(WAN) My Machine 7/2/2016 OSDI '04 16 CoDNS : Cooperative DNS Cooperative name lookup scheme If local server OK, use local server When failing, ask peers to do lookup Insurance model Share risk, share benefits Aggregate name lookup service Aggregate cache effect Incrementally deployable, no server change 7/2/2016 OSDI '04 17 Design Issues Proximity / liveness Request locality Select nearby peers Monitors nameserver’s health as well Pick same peer for same names Highest Random Weight (HRW) Remote request timeout 7/2/2016 Dynamically adjusted to local server’s health Exponentially backed off for each remote query OSDI '04 18 How many peers needed? Average Response Time 700 900ms 600ms 200m 0ms One extra peer halves avg response time! 600 500 400 300 200 100 0 0 1 2 4 8 16 32 64 Number of extra peers 7/2/2016 OSDI '04 19 Average Number of Lookups Effect of Timeout 10 9 8 7 6 5 4 3 2 1 200ms - slope changes 500ms - virtually flat 0 8 peers 4 peers 2 peers 1 peer 100 200 300 400 500 600 700 800 900 1000 Initial Timeout for Remote Query(ms) 7/2/2016 OSDI '04 20 Deployment Status CoDNS deployed on all PlanetLab nodes CoDeeN uses CoDNS as primary DNS Running 24/7 since August 2003 After CoDeeN’s own DNS cache Remote query configuration 7/2/2016 One extra peer, 200ms starting timeout On total LDNS failure, send immediately Monitor 10 nodes as neighbors OSDI '04 21 Evaluation Average Response Time(ms) Live traffic for one week for CoDeeN (20k - 30k) 1000 900 800 LDNS 700 CoDNS 600 500 400 300 200 100 0 0 10 20 30 40 50 60 70 80 90 Nodes Sorted by LDNS Response Tim e 7/2/2016 OSDI '04 22 Finer-grained View Live traffic for one day Effectively flattens the spikes Cache miss + WAN problem LDNS 7/2/2016 CoDNS OSDI '04 23 Availability Adds one ‘9’, from 99% to 99.9% Availability(%) 99.99% CoDNS LDNS 99.9% 99% 90% 9% 0 10 20 30 40 50 60 70 80 90 Nodes Sorted By LDNS Availability 7/2/2016 OSDI '04 24 What About CDNs? CDN uses DNS to pick “best” replica CoDNS used only when LDNS failing Pro: faster lookup time Con: maybe worse/farther replica In reality, peer’s answer is better 30% of the time 7/2/2016 OSDI '04 25 CDN Pro/Con Measurements Latency Difference(ms) 100000 DNS Gain 10000 Download Penalty 1000 100 10 1 0 20 40 60 80 100 120 140 Nodes Sorted by DNS Time Difference 7/2/2016 OSDI '04 26 Overhead Percentage Heartbeat packet: 1/sec, Memory: 600KB Remote queries: median 25% more lookups 90 80 70 60 50 40 30 20 10 0 Sent Answered Win 0 10 20 30 40 50 60 70 80 90 Nodes Sorted by Number of Remote Queries 7/2/2016 OSDI '04 27 CoDNS Alternatives In the paper: Private Nameservers Secondary Nameservers TCP Queries 7/2/2016 OSDI '04 28 Conclusion Local failures relatively frequent Failure time dominates latency CoDNS provides low-cost “insurance” service 7/2/2016 Masks local failures Reduces avg response time 27-82% Improves availability by additional ‘9’ Incrementally deployable, no server change OSDI '04 29 More Information CoDNS homepage: http://codeen.cs.princeton.edu/codns/ Email: princeton_codeen@slices.planet-lab.org 7/2/2016 OSDI '04 30 TCP Queries DNS support TCP Simple TCP 2 packets vs. 9 packets (3+2+4 =9) Persistent TCP Failure rate is better Not used exept for AFXR or when answer is big ACK overhead Resource waste for Idle connections Vulnerable to overloading/server down 7/2/2016 OSDI '04 31 S-TCP,P-TCP,UDP, CoDNS Replay test(10792 names) on 107 nodes CoDNS First 7/2/2016 OSDI '04 32 CoDNS vs. Persistent TCP Average Response Time (ms) Persistent TCP CoDNS 1000 800 600 400 200 0 1 11 21 31 41 51 61 71 81 91 101 Nodes Sorted by Persistent TCP's Response Time 7/2/2016 OSDI '04 33 Lookup Distribution Live traffic on a node for one week (20333 queries) 2043135 ms / 5809265 ms = 35.1% 100 ms vs. 286 ms per query Great improvement on W-CDF 5.5% 0.06% 76% 17.8% 7/2/2016 OSDI '04 34 Analysis on Wins 80% at first query, 95% at second query Percentage Win-by-3 Win-by-2 Win-by-1 105 100 95 90 85 80 75 70 0 10 20 30 40 50 60 70 80 90 Nodes Sorted by Win-by-1 7/2/2016 OSDI '04 35