CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups KyoungSoo Park

advertisement
CoDNS: Improving DNS
Performance and Reliability
via Cooperative Lookups
KyoungSoo Park, Vivek Pai, Larry Peterson,
Zhe Wang
Princeton University
7/2/2016
OSDI '04
1
Domain Name System(DNS)




Human-friendly names  IP addresses
Operational for over 20 years
Essential part of the Web
Two components


7/2/2016
Server-side: name owners
Client-side: contacting name owners
OSDI '04
2
Two Kinds of DNS Problems

Server-side problems [Danzig92], [Jung01]




Nameserver bugs
Misconfigurations
Hardening/replacing server infrastructure
Client-side problems




7/2/2016
Between local nameservers (LDNS) and clients
Larger memories = higher LDNS hit rate
LDNS cache hit rate : 80 ~ 90%
Result: LDNS problems magnified
OSDI '04
3
Contributions


Measure LDNS problems, causes
Client-side DNS helper, CoDNS




Communicates with other CoDNS peers
Incrementally deployable
Works with all DNS lookups (CDN, etc)
Benefits


7/2/2016
Latency reduction: 27-82%
Availability: generally adds extra ‘9’
OSDI '04
4
Local DNS Lookup Problems

Local DNS lookup failures



5+ seconds delay for cached records
Frequent & widely-distributed
Unpredictable service



7/2/2016
Directly affects user-perceived latency
Random delays in web access
Kills HTTP proxies, web services, and busy
mail servers
OSDI '04
5
Demonstrating Local Problems

Local name lookup every 6 seconds

“yyy.domain” on xxx.domain at 200 sites




Lookup should be handled locally
LDNS is site-shared, NOT PlanetLab’s
Failure criteria



“planetlab-2.cs.princeton.edu” for
planetlab-1.cs.princeton.edu
5+ seconds of latency
zero answer
Rolling average of the past 100 queries
7/2/2016
OSDI '04
6
Expected DNS Behavior

University of Utah

Rice University
7/2/2016
OSDI '04
7
DNS Failure on Various Nodes

Cornell

Texas A&M

University of Oregon
7/2/2016
OSDI '04
8
Possible Causes




Packet loss
LDNS overloading
Cron jobs
Maintenance problems
7/2/2016
OSDI '04
9
Packet Loss

UDP inherently unreliable


Single loss triggers query retransmission
Less than ~0.1% in LAN environment

Heavily dependent on local traffic


Cable modem/DSL may be worse

7/2/2016
Losses last for ~ 1 min
Our sites have ~4 LAN hops, Cable ~8
OSDI '04
10
Nameserver Overloading

University of Michigan

University of Torino, Italy

Technical University Berlin, Germany
7/2/2016
8 am
6 pm
OSDI '04
8 am
6 pm
11
Nameserver Overloading

Many responses for 1 sec ~ 5 sec



No timeout but simply late
Pr (Overloading | DNS Failure) = 90% for
some nodes
Bursts cause socket buffer overflow

7/2/2016
Experiment in the paper
OSDI '04
12
Cron jobs/heavy processes

University of Tennessee 1

University of Tennessee 2

Moscow State University
7/2/2016
Not a client problem!
OSDI '04
13
Why Do We See This?






Large memory  large cache
Large cache  high hit rate
High hit rate  CPU load drops
Low CPU load  add more services
More services  memory pressure
Memory pressure  failures, delays
7/2/2016
OSDI '04
14
Maintenance Problems

/etc/resolv.conf


Blocking services


Outside the firewall
Complete outage


Configured to dead nameservers
Berkeley Millennium nodes, 3/17/2004
Blackout / natural disaster

7/2/2016
Duke hit by hurricane Isabel, Fall/2003
OSDI '04
15
Solution:CoDNS
LAN
CoDNS
My LAN
LDNS
LDNS
Client
Programs CoDNS
Wide Area Network(WAN)
My Machine
7/2/2016
OSDI '04
16
CoDNS : Cooperative DNS




Cooperative name lookup scheme
If local server OK, use local server
When failing, ask peers to do lookup
Insurance model




Share risk, share benefits
Aggregate name lookup service
Aggregate cache effect
Incrementally deployable, no server
change
7/2/2016
OSDI '04
17
Design Issues

Proximity / liveness



Request locality



Select nearby peers
Monitors nameserver’s health as well
Pick same peer for same names
Highest Random Weight (HRW)
Remote request timeout


7/2/2016
Dynamically adjusted to local server’s health
Exponentially backed off for each remote query
OSDI '04
18
How many peers needed?
Average Response Time
700
900ms
600ms
200m
0ms
One extra peer halves
avg response time!
600
500
400
300
200
100
0
0
1
2
4
8
16
32
64
Number of extra peers
7/2/2016
OSDI '04
19
Average Number of Lookups
Effect of Timeout
10
9
8
7
6
5
4
3
2
1
200ms - slope changes
500ms - virtually flat
0
8 peers
4 peers
2 peers
1 peer
100 200 300 400 500 600 700 800 900 1000
Initial Timeout for Remote Query(ms)
7/2/2016
OSDI '04
20
Deployment Status

CoDNS deployed on all PlanetLab nodes


CoDeeN uses CoDNS as primary DNS


Running 24/7 since August 2003
After CoDeeN’s own DNS cache
Remote query configuration



7/2/2016
One extra peer, 200ms starting timeout
On total LDNS failure, send immediately
Monitor 10 nodes as neighbors
OSDI '04
21
Evaluation
Average Response Time(ms)

Live traffic for one week for CoDeeN (20k - 30k)
1000
900
800
LDNS
700
CoDNS
600
500
400
300
200
100
0
0
10
20
30
40
50
60
70
80
90
Nodes Sorted by LDNS Response Tim e
7/2/2016
OSDI '04
22
Finer-grained View


Live traffic for one day
Effectively flattens the spikes
Cache miss +
WAN problem
LDNS
7/2/2016
CoDNS
OSDI '04
23
Availability

Adds one ‘9’, from 99% to 99.9%
Availability(%)
99.99%
CoDNS
LDNS
99.9%
99%
90%
9%
0
10
20
30
40
50
60
70
80
90
Nodes Sorted By LDNS Availability
7/2/2016
OSDI '04
24
What About CDNs?
CDN uses DNS to pick “best” replica
CoDNS used only when LDNS failing



Pro: faster lookup time
Con: maybe worse/farther replica
In reality, peer’s answer is better 30%
of the time
7/2/2016
OSDI '04
25
CDN Pro/Con Measurements
Latency Difference(ms)
100000
DNS Gain
10000
Download Penalty
1000
100
10
1
0
20
40
60
80
100
120
140
Nodes Sorted by DNS Time Difference
7/2/2016
OSDI '04
26
Overhead

Percentage

Heartbeat packet: 1/sec, Memory: 600KB
Remote queries: median 25% more lookups
90
80
70
60
50
40
30
20
10
0
Sent
Answered
Win
0
10
20
30
40
50
60
70
80
90
Nodes Sorted by Number of Remote Queries
7/2/2016
OSDI '04
27
CoDNS Alternatives
In the paper:
 Private Nameservers
 Secondary Nameservers
 TCP Queries
7/2/2016
OSDI '04
28
Conclusion



Local failures relatively frequent
Failure time dominates latency
CoDNS provides low-cost “insurance”
service




7/2/2016
Masks local failures
Reduces avg response time 27-82%
Improves availability by additional ‘9’
Incrementally deployable, no server
change
OSDI '04
29
More Information
CoDNS homepage:
http://codeen.cs.princeton.edu/codns/
Email:
princeton_codeen@slices.planet-lab.org
7/2/2016
OSDI '04
30
TCP Queries

DNS support TCP



Simple TCP


2 packets vs. 9 packets (3+2+4 =9)
Persistent TCP



Failure rate is better
Not used exept for AFXR or when answer is big
ACK overhead
Resource waste for Idle connections
Vulnerable to overloading/server down
7/2/2016
OSDI '04
31
S-TCP,P-TCP,UDP, CoDNS


Replay test(10792 names) on 107 nodes
CoDNS First
7/2/2016
OSDI '04
32
CoDNS vs. Persistent TCP
Average Response Time (ms)
Persistent TCP
CoDNS
1000
800
600
400
200
0
1
11
21
31
41
51
61
71
81
91
101
Nodes Sorted by Persistent TCP's Response Time
7/2/2016
OSDI '04
33
Lookup Distribution




Live traffic on a node for one week (20333 queries)
2043135 ms / 5809265 ms = 35.1%
100 ms vs. 286 ms per query
Great improvement on W-CDF
5.5%  0.06%
76%  17.8%
7/2/2016
OSDI '04
34
Analysis on Wins
80% at first query, 95% at second query
Percentage
Win-by-3
Win-by-2
Win-by-1
105
100
95
90
85
80
75
70
0
10
20
30
40
50
60
70
80
90
Nodes Sorted by Win-by-1
7/2/2016
OSDI '04
35
Download