DNS Performance and the Effectiveness of Caching Jaeyeon Jung, Emil Sit, Hari Balakrishnan, Robert Morris MIT Laboratory for Computer Science November 2001 Motivation ✔ Identify the factors that affect client-perceived DNS performance – Response latency – Errors and failure modes of DNS ✔ Evaluate the effectiveness of DNS caching – How important is it for scalability? – Unanticipated uses of DNS (e.g. Web server selection) – 18% of flows were due to DNS in MCI wide-area backbone traces Slide 1 DNS Lookup Sequence App 2 Root server 3 1 Local Network 5 Local Recursive DNS server Local cache Internet 4 .edu server Delegation Stub Resolver MIT server 1. Host asks local server for address of www.mit.edu 2. Local DNS server doesn’t know, asks root. Root refers to a .edu server 3. The .edu server refers to an MIT server 4. The MIT server responds with an address 5. The local server caches response ✔ Query / response [ referral answer] Slide 2 Terminology ✔ Mapping in the DNS name space – – record : name’s IP address record : name of DNS server ✔ Caching in DNS – Time To Live : expiration time set by the originator of a name – Negative caching Slide 3 Questions ✔ What is the ratio of TCP connections to DNS A record lookups? ✔ What is the number of DNS queries per lookup? ✔ DNS errors – What percentage of lookups do never get an answer? – Performance of retransmission protocol ✔ DNS failures ✔ What is the effect of varying TTLs and degrees of caching sharing on cache hit rate? Slide 4 Key Findings ✔ TCP / DNS lookup ratio suggests that the hit rate of DNS caches inside MIT is between 70% and 80% ✔ 23% of all client lookups in the most recent MIT trace fail to elicit any answer ✔ 13% of lookups result in an answer that indicates a failure. Most of these failures indicate ✔ % of TCP connections made to names with low TTL values increased from 12% to 25% in 2000 ✔ Setting all A-record TTL’s to a value as small as 10 minutes is not likely to degrade the scalability of DNS Slide 5 The Data ✔ Collection Methodology – Wide-area DNS query/response – Outgoing TCP connections: – Anonymized internal addresses ✔ Analysis Methodology – Sliding window of 60 seconds R1 a.root−servers.net NS .edu t1 Q1 a.root−servers.net A www.mit.edu t2 t3 Q2 .edu t4 Q2 .edu R2 .edu NS mit.edu t5 t6 R3 mit.edu A www.mit.edu t7 Q3 mit.edu latency: t7 −t1 retransmission: 1 referral: 2 t Slide 6 Traced Network Topology Subnet #1 Collection machine Subnet #2 External network Router MIT LCS and AI Subnet #24 ✔ Subnet #3 – 24 internal subnetworks sharing the border router – Data collected in January and December 2000 ✔ Paper has KAIST analysis Slide 7 Basic Statistics 00/01/03-10 2,530,430 23.5% 64.3% 11.1% 1.0% 6,039,582 4,521,348 4.62 Date Total lookups Unanswered Answered with success Answered with failure Zero answer Total query packets TCP connections #TCP : #valid answers 00/12/04-11 4,160,954 22.7% 63.6% 13.1% 0.5% 10,617,796 5,347,003 3.53 Slide 8 Unanswered Lookups ✔ Significant fraction of all DNS packets seen in the wide-area Internet: of total 5.5% 13.1% 4.9% Zero referrals Non-zero referrals Loops , 63% – 59% query packets 9.3% 10.3% 3.1% - Unanswered lookups classified by type - Slide 9 Retransmissions 100 80 CDF 60 40 mit-jan00 (Answered) mit-dec00 (Answered) mit-jan00 (Zero referral) mit-dec00 (Zero referral) 20 0 0 2 4 6 8 10 12 14 Number of retransmissions ✔ 99.9% of answered lookups have 2 retransmissions Slide 10 Suboptimal Retransmission Strategy ✔ Overly persistent retransmissions – Average # of retransmissions: ,5 3.5 – # of retransmissions of worst 5%: , 12 6 ✔ Inappropriate setting for the number of retries or excessive timeout value – No retransmissions within 60 seconds: , 19% 12% Slide 11 Failures 7.68% 3.24% 11.14% 1.96% ✔ Inverse lookups for IP addresses with no inverse mapping: ✔ Invalid queries : ✔ Non-existent top-level domains: , ➔ Large number of distinct names makes negative caching ineffective Slide 12 DNS Caching ✔ How useful is it to share DNS caches among many client machines? ➔ Locality of references among clients ✔ What is the likely impact of choice of TTL on caching effectiveness? ➔ Locality of references in time Slide 13 Name Popularity 1 Fraction of requests 0.8 0.6 0.4 0.2 mit-jan00 mit-dec00 kaist-may01 0 0 0.2 0.4 0.6 0.8 1 Fraction of query names ✔ The top 10% account for more than 68% of total lookups ✔ A long tail : 9.0% are unique names Slide 14 TTL Distribution 100 1 day 80 60 CDF 1 hour 40 15 minutes 20 5 minutes 0 1 10 100 mit-jan00 mit-dec00 kaist-may01 1000 10000 100000 TTL (sec) ✔ The fraction of accesses to short TTLs has doubled ➔ Increased deployment of DNS-based server selection Slide 15 Trace-driven Simulation ✔ TCP connection workload ✔ Database containing name/IP/TTL bindings ✔ Algorithm – Randomly divide TCP clients into groups of size s – For each new TCP connection, determine the group g and look for a name n in the cache of group g – If n exists and the cached TTL has not expired, record a hit. Otherwise record a miss Slide 16 Effect of Sharing on Hit rates 100 Hit rate (%) 80 60 40 20 mit-jan00 mit-dec00 kaist-may01 0 10 20 30 40 50 60 Group size 70 80 90 100 ✔ Most of the benefit of sharing is obtained with as few as 10 or 20 clients per cache Slide 17 Impact of TTL on Hit rates 100 Hit rate (%) 80 60 40 20 group size = 1 group size = 25 group size = 1233 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 TTL (sec) ✔ Most of the benefit of caching is achieved with TTLs less than about 1000 seconds. ✔ 5-min TTLs would increase DNS traffic by factor of 1.5 ✔ record caching is critical Slide 18 Conclusions ✔ About a quarter of all DNS lookups never get an answer, which corresponds over 50% of DNS packets in the wide-area Internet ✔ The DNS retransmission protocol appears to be overly persistent, but in 10%-12% of cases, no retransmissions occur. ✔ Setting all A-record TTL’s to a value as small as 10 minutes is not likely to degrade the scalability of DNS in any noticeable way. ✔ The cacheability of records enhances scalability by reducing load on the root and top-level name servers Slide 19