DNS Performance and the Effectiveness
of Caching
Jaeyeon Jung, Emil Sit, Hari Balakrishnan, Robert Morris
MIT Laboratory for Computer Science
November 2001
Motivation
✔ Identify the factors that affect client-perceived
DNS performance
– Response latency
– Errors and failure modes of DNS
✔ Evaluate the effectiveness of DNS caching
– How important is it for scalability?
– Unanticipated uses of DNS (e.g. Web server
selection)
– 18% of flows were due to DNS in MCI wide-area
backbone traces
Slide 1
DNS Lookup Sequence
App
2
Root
server
3
1
Local Network
5
Local
Recursive
DNS server
Local cache
Internet
4
.edu
server
Delegation
Stub Resolver
MIT
server
1. Host asks local server for address of www.mit.edu
2. Local DNS server doesn’t know, asks root.
Root refers to a .edu server
3. The .edu server refers to an MIT server
4. The MIT server responds with an address
5. The local server caches response
✔ Query / response [ referral answer]
Slide 2
Terminology
✔ Mapping in the DNS name space
–
–
record : name’s IP address
record : name of DNS server
✔ Caching in DNS
– Time To Live : expiration time set by the originator
of a name
– Negative caching
Slide 3
Questions
✔ What is the ratio of TCP connections to DNS A record
lookups?
✔ What is the number of DNS queries per lookup?
✔ DNS errors
– What percentage of lookups do never get an
answer?
– Performance of retransmission protocol
✔ DNS failures
✔ What is the effect of varying TTLs and degrees of
caching sharing on cache hit rate?
Slide 4
Key Findings
✔ TCP / DNS lookup ratio suggests that the hit rate of
DNS caches inside MIT is between 70% and 80%
✔ 23% of all client lookups in the most recent MIT trace
fail to elicit any answer
✔ 13% of lookups result in an answer that indicates a
failure. Most of these failures indicate
✔ % of TCP connections made to names with low TTL
values increased from 12% to 25% in 2000
✔ Setting all A-record TTL’s to a value as small as 10
minutes is not likely to degrade the scalability of DNS
Slide 5
The Data
✔ Collection Methodology
– Wide-area DNS query/response
– Outgoing TCP connections:
– Anonymized internal addresses
✔ Analysis Methodology
– Sliding window of 60 seconds
R1 a.root−servers.net
NS .edu
t1
Q1 a.root−servers.net
A www.mit.edu
t2 t3
Q2 .edu
t4
Q2 .edu
R2 .edu
NS mit.edu
t5 t6
R3 mit.edu
A www.mit.edu
t7
Q3 mit.edu
latency: t7 −t1
retransmission: 1
referral: 2
t
Slide 6
Traced Network Topology
Subnet #1
Collection
machine
Subnet #2
External network
Router
MIT LCS and AI
Subnet #24
✔
Subnet #3
– 24 internal subnetworks sharing the border router
– Data collected in January and December 2000
✔ Paper has KAIST analysis
Slide 7
Basic Statistics
00/01/03-10
2,530,430
23.5%
64.3%
11.1%
1.0%
6,039,582
4,521,348
4.62
Date
Total lookups
Unanswered
Answered with success
Answered with failure
Zero answer
Total query packets
TCP connections
#TCP : #valid answers
00/12/04-11
4,160,954
22.7%
63.6%
13.1%
0.5%
10,617,796
5,347,003
3.53
Slide 8
Unanswered Lookups
✔ Significant fraction of all DNS packets seen in the
wide-area Internet:
of total
5.5%
13.1%
4.9%
Zero referrals
Non-zero referrals
Loops
, 63%
– 59%
query packets
9.3%
10.3%
3.1%
- Unanswered lookups classified by type -
Slide 9
Retransmissions
100
80
CDF
60
40
mit-jan00 (Answered)
mit-dec00 (Answered)
mit-jan00 (Zero referral)
mit-dec00 (Zero referral)
20
0
0
2
4
6
8
10
12
14
Number of retransmissions
✔ 99.9% of answered lookups have
2 retransmissions
Slide 10
Suboptimal Retransmission Strategy
✔ Overly persistent retransmissions
– Average # of retransmissions:
,5
3.5
– # of retransmissions of worst 5%:
, 12
6
✔ Inappropriate setting for the number of retries or
excessive timeout value
– No retransmissions within 60 seconds:
, 19%
12%
Slide 11
Failures
7.68%
3.24%
11.14%
1.96%
✔ Inverse lookups for IP addresses with no inverse
mapping:
✔ Invalid queries :
✔ Non-existent top-level domains:
,
➔ Large number of distinct names makes negative caching
ineffective
Slide 12
DNS Caching
✔ How useful is it to share DNS caches among
many client machines?
➔ Locality of references among clients
✔ What is the likely impact of choice of TTL on
caching effectiveness?
➔ Locality of references in time
Slide 13
Name Popularity
1
Fraction of requests
0.8
0.6
0.4
0.2
mit-jan00
mit-dec00
kaist-may01
0
0
0.2
0.4
0.6
0.8
1
Fraction of query names
✔ The top 10% account for more than 68% of total
lookups
✔ A long tail : 9.0% are unique names
Slide 14
TTL Distribution
100
1 day
80
60
CDF
1 hour
40
15 minutes
20
5 minutes
0
1
10
100
mit-jan00
mit-dec00
kaist-may01
1000
10000
100000
TTL (sec)
✔ The fraction of accesses to short TTLs has doubled
➔ Increased deployment of DNS-based server
selection
Slide 15
Trace-driven Simulation
✔ TCP connection workload
✔ Database containing name/IP/TTL bindings
✔ Algorithm
– Randomly divide TCP clients into groups of size s
– For each new TCP connection, determine the group
g and look for a name n in the cache of group g
– If n exists and the cached TTL has not expired,
record a hit. Otherwise record a miss
Slide 16
Effect of Sharing on Hit rates
100
Hit rate (%)
80
60
40
20
mit-jan00
mit-dec00
kaist-may01
0
10
20
30
40 50 60
Group size
70
80
90
100
✔ Most of the benefit of sharing is obtained with as few as
10 or 20 clients per cache
Slide 17
Impact of TTL on Hit rates
100
Hit rate (%)
80
60
40
20
group size = 1
group size = 25
group size = 1233
0
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
TTL (sec)
✔ Most of the benefit of caching is achieved with TTLs
less than about 1000 seconds.
✔ 5-min TTLs would increase DNS traffic by factor of 1.5
✔
record caching is critical
Slide 18
Conclusions
✔ About a quarter of all DNS lookups never get an
answer, which corresponds over 50% of DNS packets in
the wide-area Internet
✔ The DNS retransmission protocol appears to be overly
persistent, but in 10%-12% of cases, no retransmissions
occur.
✔ Setting all A-record TTL’s to a value as small as 10
minutes is not likely to degrade the scalability of DNS
in any noticeable way.
✔ The cacheability of
records enhances scalability by
reducing load on the root and top-level name servers
Slide 19