Chapter 10 DNS-Based Botnet Detection Introduction This chapter discusses the detection of bots and botnets using the Domain Name System (DNS).1 The first section below provides some essential background on aspects of the DNS protocol relevant to botnet detection. The subsequent section discusses how to design botnet detection heuristics using DNS, and presents selected case studies and tools. Background The Domain Name System (DNS) is a world-wide distributed database that stores information about named Internet resources. Although DNS holds many types of information about domains (for example, mappings between IP addresses and name servers, mail servers, canonical names), just a few details are relevant to botnet detection. Here, we present a simplified overview of DNS, with a focus on recursive queries for A and CNAME records. Readers looking for a more detailed analysis of DNS are directed to RFC 1034, 1035 and the numerous books on DNS.2 DNS Overview As Don Marti once observed, “DNS is a consensus reality”3. The mappings between any particular domain name and IP address depend on which server is queried, when, and whether it performs caching, forwarding, or is an authority server. Network caches and application layer caching (either through an OS stub resolver or user application) can also affect mappings. Because of local variations in caching behavior, it is entirely likely that different hosts will receive different answer sets for the same domain. The DNS infrastructure, and resolvers like BIND,4 seek to minimize this; however, botmasters have crafted their networks to leverage this potential. 1 See P. Mockapetris, “RFC 1034: Domain names - concepts and facilities”, Nov. 1987, http://www.faqs.org/rfcs/rfc1034.html, and P. Mockapetris, "RFC 1035: Domain names implementation and specification", Nov. 1987, http://www.faqs.org/rfcs/rfc1035.html 2 Cricket Liu & Paul Albitz, “DNS and BIND, 5th Ed., O'Reilly, 2006. Ron Aitchison, “Pro DNS and BIND” APress, New York, 2005. 3 Don Marti, "[linux-elitists] ICANN frenzy!", March, 2001, http://zgp.org/pipermail/linuxelitists/2001-March/001716.html 4 Internet Systems Consortium, “Berkeley Internet Name Domain (BIND)”, 2006, http://www.isc.org/index.pl?/sw/bind/ Page 1 of 17 Figure 10.1: Typical propagation of a botnet, and resulting DNS usage. To illustrate key principles of DNS relevant to botnet detection, let’s consider the following scenario. A botmaster releases a virus, which spreads randomly. The virus forces victim to “rally” by joining a command-andcontrol (C&C) service, hosted at the domain, evil.example.com.5 From there, the botmaster may make use of the victims for other purposes, e.g., spamming, phishing, identity theft, DDoS. Figure 10.1 illustrates the propagation of the malware, written by the botmaster, and designated as “VX” in the diagram. Each victim in turn infects others, creating a victim cloud. Since the virus also forces victims to contact the C&C server (e.g., perhaps an IRC server, a web server, a P2P network), infected individuals must perform DNS lookups of evil.example.com. In Figure 10.1, this is depicted for a single victim, with a dashed line showing an A-record query. The botmaster, who owns or has license to use the domain, can control the DNS resolution at the authority server. Thus, if the C&C service is taken down, the botmaster merely has to update the DNS mapping to point to a new C&C domain. Likewise, if network administrators block access to the IP address of the C&C site, the botmaster merely has to migrate or renumber the C&C’s IP address. Figure 10.1 therefore represents the general pattern of infection seen in many botnets. (There are of course variations that use a different cycle of infection.) Note that Figure 10.1 shows only a simplified view of the DNS traffic. During the growth of a botnet, there are several distinct phases to the botnet’s DNS traffic. Figure 10.2 shows a more detailed view of DNS resolution, while still omitting several possible scenarios. First, the host performing a lookup may consult the stub resolver, which may have a local cache. (This scenario is discussed in detail below.) If we presume the host cache does not contain the mapping for evil.example.com, the host then sends a DNS request (here, we presume an A-record query) to a recursive server. (Again, we omit the possibility of an iterative, or non-recursive query). 5 The example.com, example.net and example.org domains are reserved under RFC 2606. for use in documentation. In this chapter, we’ll use the fictitious third-level domain evil.example.com as an example of a domain associated with botnet activity. See D. Eastlake & A. Panitz, “RFC 2606: Reserved Top Level DNS Names”, June 1999, http://www.faqs.org/rfcs/rfc2606.html Page 2 of 17 Caching resolvers perform lookups, and store the results for a prescribed period of time, the TTL period.6 If we presume the caching resolver does not have a cached answer (and further presume that, at the time the host’s request arrives, it has only cached the addresses of the root servers), then the caching server sends a request to the root servers.7 Since the domain (evil.example.com) is not part of the root zone (“.”), and the “com.” zone has been delegated to another DNS server, the root servers cannot reply with an answer, and instead give the address of other name servers: the “com.” TLD servers.8 The recursive server then sends a query to the TLD’s server. Since the query is for a host in the example.com zone, which has been further delegated, it returns the address of the example.com name server. The example.com server is, in this hypothetical example, the start of authority (SOA) for the zone, and provides the requested record to the recursive server. This answer is then sent in reply to the host’s request, and cached by both the recursive server and stub resolver. Additionally, all of the intermediary requests (e.g., the address mappings of the TLD and SOA) are cached as well. One can also examine many of these steps by using the dig utility, executed in trace mode. For example, the command dig maps.google.com +trace will run dig in non-recursive mode, printing all the intermediate lookups. For example, the trace will show the steps executed in finding the “com.” servers, finding the google.com zone servers, and then ultimately locating the appropriate A-record or CNAME response. 6 The TTL period of a DNS cache is different from the hop-count lifetime or TTL period found in routing. The TTL period is prescribed by the authority name server for the zone. Caching servers generally, but not always, follow the recommended caching time. 7 The alert ready might spot the possible chicken-and-egg problem in this setup. A freshly booted caching server would consult a root zone hints file, often a static file distributed with a DNS server, to learn the addresses of the root servers. One can obtain a copy of the hints file at http://www.internic.net/zones/named.root 8 Separately, one can inspect the root’s zone files by sending an AXFR request to a root server. For example, using the command “dig @f.root-servers.net . axfr” will list all zone entries at the root. BIND8 previously shipped with a useful script in $BIND8/contrib/misc/normalize_zone.pl, to format this output; however, this was removed in BIND9. Page 3 of 17 Figure 10.3: Distributed victim networks, some in diurnal low phases, and the impact of their recursive servers on authority DNS servers. Interacting with dig will further show how Figure 10.2 greatly simplifies matters. The diagram does not consider EDNS09 responses, truncation, and other DNS traffic and scenarios that routinely occur. But this general view is useful to understand how world-wide epidemics of infections drive DNS resolution patterns, and affect caching refreshes from different servers. Consider Figure 10.3, which shows the world-wide spread of our hypothetical botnet to various different networks. Each network has a different recursive resolver (depicted in the Figure 10.3 on the edge of each network cloud). Victims within each network drive patterns of lookups directed at the caching resolver. When the caching resolver’s local cache fails (through timeout), the network’s DNS servers will, eventually, consult with the start of authority (SOA) for a given domain. Note that many of the victims are located in different time zones, and might therefore generate less activity at night.10 Since botnets usually have victims scattered around the world, recursive timeouts and authority refresh lookups arrive in rolling waves, depending on the number of victims in local areas. This observation has resulted in models that describe botnet growth patterns, based on time zones.11 Stub Caching Most operating systems provide a minimal DNS resolution service for use by applications. In most cases, both negative (i.e., NXDOMAIN),12 and positive results are stored. For example, on Windows, the dnsrslvr.dll and dnsapi.dll libraries are used by most applications to resolve domain names. Previously 9 Paul Vixie, “Extension Mechanisms for DNS (EDNS0)”, Aug. 1999, http://www.faqs.org/rfcs/rfc2671.html 10 In many countries, electricity costs and local customs are such that machines are powered down at night. Upon reboot, the victims require new DNS resolutions, and refresh the local recursive server’s cache entries. 11 See David Dagon, Cliff Zou & Wenke Lee, “Modeling Botnet Propagation Using Time Zones”, in “Proceedings of the 13th Annual Network and Distributed System Security Symposium”, 2006. 12 M. Andrews, “Negative Caching of DNS Queries (DNS NCACHE)”, Mar. 1998, http://www.faqs.org/rfcs/rfc2308.html Page 4 of 17 resolved domains (both successful and unsuccessful) are stored by the host OS. This improves performance, since the host does not need to use the network to lookup recently resolved domains. In some cases, particular applications operate their own DNS cache, on top of the host’s stub resolver. Most prominently, Microsoft Internet Explorer used to cache domains13 for 24 hours (in IE 3.x), and more recently does so for 30 minutes (in IE 4.x, 5.x and 7.x)14. Similarly, Firefox and Mozilla-based browsers cached DNS answers for 15 minutes (and more recently, since 2004, for 1 minute), regardless of the TTL value. In many cases, researchers may need to control the caching behavior of the stub resolver and user applications. On Windows, this is typically done by running ipconfig /flushdns after using “ipconfig /displaydns” to confirm the local DNS cache contents. On Mac OSX, one can simply use “lookupd -flushcache”. Other unixes, by default, do not cache DNS answers obtained by gethostbyname(3) or gethostbyname_r(3), or other <netdb.h> functions. A restart generally flushes the various caching utilities and daemons, e.g., nscd, named. Windows provides a variety of registry keys to control host-based DNS caching, including those for IE.15 Table 10.1 provides a listing of many relevant registry keys that affect stub resolver behavior. For Firefox and Mozilla-class browsers, one merely browses to about:config, and selects a “New > Integer” value, creating a property called network.dnsCacheExpiration, with an Integer value of 0 (zero).16 Researchers are not alone in their need to occasionally flush local DNS caches. Most bots include a primitive capability to flush stub DNS caches, either through a forked execution of “ipconfig /flushdns”, or by using the DnsFlushResolverCache* functions in the dnsapi.dll library. Botmasters often flush the stub resolver’s cache, to increase their bots’ network agility. A typical implementation appears in Code Listing 1, sampled from a common rBot source tree. This particular code block originated in the rBot family, but is now common to hundreds of bots. It essentially locates the address of the DnsFlushResolverCache() family of functions, and uses them to clear the cache. This side-steps the need to adjust the registry, and avoids forking a secondary process to invoke ipconfig. In many cases, this ability to flush the stub resolver’s cache is exposed in the bots instruction API. Thus, with a single command, a botmaster can remove any stale DNS entries in the victim host stubs. Caching Resolvers With many exceptions, recursive servers provide DNS services to local networks. With the exception of open recursive servers,17 the clients populating a recursive server’s cache lines should tend to be those found within a 13 Note that the browser’s DNS cache is completely different from the local cache of a website. 14 Microsoft, Inc., “How Internet Explorer uses the cache for DNS host entries”, Nov. 2004, http://support.microsoft.com/kb/263558 Internet Explorer 6.x only cached DNS CNAME responses, not Arecords. Prior to XPsp2, if a resolution was pending, each click by a user would add another 2-minutes to the TTL period of a cached record–potentially creating an infinite cache. 15 Microsoft, Inc., “How to Disable Client-Side DNS Caching in Windows XP and Windows Server 2003”, Dec. 2005, http://support.microsoft.com/default.aspx?scid=kb\%3Ben-us\%3B318803 16 Gordon Sheridan, "Network.dnsCacheExpiration”, June, 2001, http://kb.mozillazine.org/Network.dnsCacheExpiration One similarly can add user pref(‘‘network.dnsCacheExpiration’’, 0); to the user’s prefs.jsfile. 17 John Kristof, “DNS - Open Recursive Name Server Probing”, 2006, http://condor.depaul.edu/~jkristof/orns/ Page 5 of 17 local network (e.g., clients who obtain a DHCP lease, and have their /etc/resolv.conf settings provided by the DHCP daemon.) Thus, it is often the case that recursive caches reflect the resolution behavior of the local user population. Table 1.1 DNS Cache Registry Setting DWORD: Value: Comment: MaxNegativeCacheTtl 0 (default 900 sec; 15 min.) Time an NXDOMAINresponse is cached; 0 eliminates negative caching DWORD: Value: Comment: MaxCacheEntryTtlLimit 0 (default 86400 sec; 1 day) Maximum DNS cache time DWORD: Value: Comment: NetFailureCacheTime 0 (default 30 sec) How long a DNS client stops sending queries when network is down. DWORD: Value: Comment: NegativeSOACacheTime 0 (default 120 sec) The time an NXDOMAIN response from an SOA is cached. Settings for the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\ Services\Dnscache\Parameters, and how they affect local caching behavior. Researchers may need to adjust these keys when investigating DNS problems, or configuring honeypots. Note that many bots also adjust these settings from their defaults to improve bot performance. Code Listing 1 DNS cache flushing in rBot /* loaddlls.cpp */ // dynamically load dnsapi.dll HMODULE dnsapi_dll = LoadLibrary(‘‘dnsapi.dll’’); if (dnsapi_dll) { fDnsFlushResolverCache = (DFRC)GetProcAddress(dnsapi_dll, “DnsFlushResolverCache”); fDnsFlushResolverCacheEntry_A = (DFRCEA)GetProcAddress(dnsapi_dll, “DnsFlushResolverCacheEntry_A”); if (!fDnsFlushResolverCache || !fDnsFlushResolverCacheEntry_A) nodnsapi = TRUE; Page 6 of 17 } else { nodnsapierr = GetLastError(); nodnsapi = TRUE; } //... /* netutils.cpp */ BOOL FlushDNSCache(void) { BOOL bRet = FALSE; if (fDnsFlushResolverCache) bRet = fDnsFlushResolverCache(); return (bRet); } Page 7 of 17 Figure 10.4: Botnet broken down by country of origin, over time. A clear diurnal pattern appears. As noted above, botnets tend to have diverse victim populations, spread across different time zones. Figure 10.4 illustrates this point. Victims from a 350,000 member botnet were tracked for activity (e.g., connections to the C&C server), plotted along the Y-axis as SYN-rates in a one minute epoch. Since the time line covers several days, a clear diurnal pattern appears. Closer inspection of a few countries shows that those in different time zones are phase shifted by an amount appropriate for their difference in time zones. (Compare, for example, England in GMT -0, or Zulu, time zone, with those in Eastern Europe.) This illustrates how botnet activity is structured in waves of connections, divided by time zones. Because clients are in diurnal low phases at different times, depending on their country of origin, the DNS activity of these hosts similarly varies. Unless cached by the stub resolver or programmed to do otherwise, bots will have to preface their TCP connections to the C&C server with a (network) DNS lookup. In the context of botnets, recursive servers are less likely to refresh cache entries for malicious domains during off-peak hours. This property will help us design detection heuristics, discussed below. DNS-Based Botnet Detection Passive DNS Replication A weakness in DNS routinely exploited by botnets is the lack of consistency and history. DNS servers merely show the current address of a domain (at least according to the resolver’s cache), and not its prior history. While BIND and other DNS tools all offer extensive output capabilities, logging is primarily used in debugging and not production environments. Similarly, zone transfers are usually permitted between trusted servers and secondaries, and are generally not available to the larger Internet community. Thus, even DNS operators often lack information about what records were cached. In many cases, only the zone maintainer is in a position to know the complete history of domain’s mapping. Page 8 of 17 Figure 10.5: Conceptual diagram of passive DNS sensor deployment, adapted from Weimer’s FIRST paper. A sensor witnesses all mappings for a command-and-control server. Botmasters are keenly aware of this, and routinely move C&C locations. With a large number of C&C servers, botmasters minimize the chance that the network can be disrupted through simple remediation. In some cases, botmasters turn botnets “on and off” to simulate remediation. For example, a malicious domain’s address might be set to a non-routable (e.g., and RCF 1918 address,18) for a few hours. When more victims are needed, the domain address is set to the C&C server. This gives the impression that the domain is remediated (and nonrouted). The cycling of IP mappings for a domain complicates take down efforts, and makes remediation (e.g., simple firewall rule creation) far more difficult. One way to overcome this technique is to consult a passive DNS replication service.19 Passive DNS replication was created by Florian Weimer, and constructs partial zone files by observing DNS traffic. The intuitive idea is fairly simple: one merely observed DNS traffic, and stores all completed resolutions. Over time, this approximates a zone file (or more precisely, the relevant portions of a zone file that users required.) Since the replication is placed in a database, one can further query not only the current resolution of a domain, but every address seen in an answer set for a domain. Figure 10.5, adapted from Weimer’s paper, shows a conceptual diagram of how passive DNS helps investigators track botnets. In order to reach the command and control server, infected hosts perform a DNS lookup of the domain. If we assume a low caching time for the domain, the recursive server for their network eventually contacts the SOA for the domain. (Alternatively, the victim’s recursive server could be a forwarder, or otherwise consult another caching server. This is conceptually the same case.) As shown in Figure 10.5, the 18 See IANA, “Special-Use IPv4 Addresses”, Sept. 2002, http://www.faqs.org/rfcs/rfc3330.html", and Y. Rekhter, B. Moskowitz, D. Karrenberg, G. J. de Groot & E. Lear, “Address Allocation for Private Internets”, Feb. 1996, http://www.faqs.org/rfcs/rfc1918.html 19 Florian Weimer, “Passive DNS Replication”, Apr. 2005, http://www.enyo.de/fw/software/dnslogger/first2005-paper.pdf Page 9 of 17 botmaster has a variety of C&C servers at the ready. Mitigation of one does not stop the botnet, since the botmaster also has the ability to update the DNS entry with the authority server. (This is popularly referred to as the “whack-a-mole” strategy of survival, after the name of a popular carnival amusement game.) A passive DNS sensor, shown in Figure 10.5, is deployed so that it observes the answers provided by the authority server. The sensor stores all answers from the authority, and not just the most current mapping. Migration of the C&C server is therefore observable. One can also use the passive DNS database to study the migration pattern and history of all domains (or at least those domains that users below the recursive server consulted). Legitimate Internet servers have very good reasons to stay put, and change their IP addresses very infrequently. In contrast, fraud-oriented servers, such as botnet C&Cs, phishing sites, and drop sites, have every motivation to change their IP address. Passive DNS gives us a window into this behavior, and is designed to track changes in DNS mappings. Investigators can therefore use passive DNS to discover the history of a C&C domain. A key variable is the amount of DNS traffic generated by the victims, relative to the tendency of the C&C server to migrate. If the C&C server migrates frequently, one needs enough victims to force a new cache “discovery” of the changed IP. Such sensors are most useful when deployed at locations where sufficiently high volumes of traffic are expected. The University of Stuttgart provides a web interface to a passive DNS service, at: http://cert.uni-stuttgart.de/stats/dns-replication.php Users can make a rate-limited number of queries. When investigating a malicious domain, their web interface lets one query the passive DNS logger database. At first, it may seem incongruous that a DNS trace from a few networks (many in Germany) would provide clues to, say, a security investigation in a New Zealand network, or in some other distant part of the world. But since botnets do not differentiate between their victims, it is very likely that passive DNS sensors on remote networks will provide useful clues to local investigations. Victims in other network may provide clues useful to your own network. To further illustrate the utility of passive DNS, consider the following approach, taken in response to an botnet alert on a local network. • • Assume one discovers an infect host, H, attempting to contact a C&C server, called C1 o One can remediate host H, but how do you stop other potential victims in the network from reaching the command and control site? o One can of course block access to C1, assuming the domain has no other legitimate uses. But the botmaster can update the C&C site to a new address, C2. And once you discover the new site, they can move the C&C site to any other address, Ci . How can one discover the other C&C sites, without having your local machines first become victims? The local network administrator often has difficult choices: o One can discover the binary, and run it in a honeypot to track the botnet. This is difficult, since binaries are often not easily discovered in the early hours of an infection. Further, having this technical capability can be expensive for small networks, and may violate local policies about handling live malware. o One can instead leave the infected host H connected to the botnet at host C1, and watch what other C&C machines it later reaches. This approach of course risks local assets, and Page 10 of 17 is usually not an option for networks that have taken the trouble to draft security incident response policies. • One can instead consult a passive DNS service, and ask what other mappings were seen for the C&C domain. For a given domain, passive DNS server can tell what other IP addresses are associated. A firewall rule can block access to the servers at C1, C2,... Ci , even if your local victim H has only contact the first command-and-control site, C1. Similarly, the list of associated IPs for a botnet C&C domain may help one expand a local investigation. In effect, victims in remote networks become “canaries in the mine”, and indicate what other IPs are associated with a botnet outbreak. This lets administrators rapidly deploy comprehensive firewall blocking rules. Further, passive DNS logs help remediation. For example, one can locate other victims in the local network by consulting flow logs, to see who has recently contacted any of the C&C’s possible IP addresses. Network administrators battling botnets are encouraged to run passive DNS servers of their own. In most cases, local network privacy rules would not prevent the sharing of information about remote, third-party domains. (Passive DNS only shares what answer came from a DNS server, not the time, or the user who requested the information.) And even if privacy rules prevent the sharing of this information, the additional information may help local investigations. Heuristics The preceding section discussed the use of sensor tools to investigate botnets. Using particular aspects of the DNS traffic, one can further design additional detection heuristics. Here, we have a needle-in-the-haystack problem. Many recursive servers handle hundreds of thousands of packets per second, for thousands of different domains. Differentiating the legitimate domains from the botnet C&C domains is a complex research problem. There are far too many domains for a human to do this by hand. In the following sections, we discuss monitoring techniques that help one classify domains as either benign or suspect. Note that these factors are not decisive, and would not be suitable for an automated response system. Rather, they are heuristics that help a human expedite a review of suspect domains. TTL Monitoring DNS responses optionally have a TTL field, which suggests how long (in seconds) a server and application should cache the address.20 Note that caching is itself optional, (though recommended) and the time given to a cache is also optional. The field is an unsigned 32-bit field, ranging from 0 (meaning “do not cache”) to 232 . In some DNS resolvers, the value can be set in a zone file, where each domain has an optional TTL period set by the $TTL directive.21 Many legitimate services use long caching times, e.g., 86400 seconds or more. This relieves the load on authority servers, and generally provides resolvers with shorter network paths to a cache and faster responses. Lengthy cache times are appropriate and often used for legitimate servers because they seldom change IP addresses. This is not universally true, of course. Some legitimate servers (most famously cnn.com, which tends to use 5 minute cache times) opt for a shorter cache time. This provides them flexibility in handling large spikes 20 R. Elz & R. Bush, “Clarifications to the DNS Specification”, July, 1997, http://www.faqs.org/rfcs/rfc2181.html 21 M. Andrews, “Negative Caching of DNS Queries (DNS NCACHE)”, March, 1998, http://www.faqs.org/rfcs/rfc2308.html Page 11 of 17 of exponentially arriving traffic. But these situations tend to be the minority; most legitimate sites use longer TTL periods. DNS-Based Botnet Detection As discussed above, botmasters tend to migrate C&C servers, to avoid remediation, and frustrate takedown efforts. To maximize the number of victims that migrate to a new C&C server, botmasters tend to favor shorter TTL periods for domains. This of course is not universally the case, but botnets that use lengthy TTL periods must keep C&C servers up for at least that length of time, or suffer a loss in victim population with each server migration. Figure 10.6: (a) A histogram of 443 sampled botnet C&C TTLs. Page 12 of 17 Figure 10. 6 (b) A CDF of Botnet C&C TTL. Note the majority of the population is under a few hours. Thus, while not universally the case, there is nonetheless a subclass of botnets that favor low TTL periods. This is demonstrated in Figure 10.6(a), which shows a distribution of botnet C&C TTLs. The sampling started with 443 “active botnets”. Botnets already flagged for abuse at the SOA were excluded, and typically had a TTL > 86400. Many of the domains have very short TTL periods for the life of the botnet. Figure 10.6(b) shows the same population in a cumulative distribution function (CDF) graph. In general, a CDF shows what fraction of an overall distribution falls below a particular threshold. In Figure 10.6(b), we see that 50% of the population had a TTL below 2 hours, and 85% of the population used less than 3 hours. With noted exceptions (e.g., akamai, cnn.com), TTL periods for legitimate sites are often set for days. Short TTL periods are therefore an indication (but not proof) that a domain is suspicious. At the very least, they help one rank domains, so that an analyst is more productive in a manual review.22 Caution should be used when focusing exclusively on short-TTL values. First, this parameter is easily manipulated by botmasters. (For example, most Dynamic DNS services provide a simple interface to let domain owners adjust TTL values). As such, botnet detection based solely on TTL values is extremely brittle. Second, it is quite common for legitimate domain owners to shorten TTL values in advance of IP renumberings or server migration. That is, many legitimate domains with long TTL values (e.g., TTL = 86400) will shorten TTLs in advance of network maintenance that results in IP changes, all to minimize disrupting clients. 22 Machine learning models are also possible, but are beyond the scope of this chapter. Page 13 of 17 Network administrators observing DNS traffic at their network edge can use short TTLs to prioritize suspicious domains, and assist investigations. Similarly, researchers who encounter domains gathered from honeypots and binary analysis should note short TTL periods, since they are a hallmark of suspicious activities. Request Rates As noted above, client populations distributed around the world tend to fall into different time zones, with different diurnal patterns. Each hour of a day, a new population of victims potentially comes online. This in turn drives large spikes in recursive traffic. As illustrated in Figure 10.4, these recursive lookups ultimately result in traffic directed towards an authority server. To help identify suspicious domains, we can rank and prioritize domains based on their associated request rates. The theory is that malicious domains (with large numbers of victims) should tend to have a larger volume of recursive and SOA refreshes. The problem of course is that some legitimate domains also have very high request rates. To address this problem, we can look at patterns of resolutions associated with different levels of a domain. We can classify DNS requests as either second-level domain (SLD) requests, such as example.com, or third-level subdomain requests (3LD), such as foo.example.com. To avoid increased costs and additional risks, botmasters tend create botnets within 3LDs, all under a common SLD. For example, a botmaster may purchase the string example.com from a registrar, and then also arrange for DNS service for the 3LDs botnet1.example.com, botnet2..., and so on. The botmasters use subdomains in order to avoid creating a new domain, different SLD for each new botnet, e.g., example1.com, example2.com. Each transaction to create such a domain involves risk. The seller may be recording the originating IP for the transaction, requiring the bot master to use numerous stepping stones or proxies. Some registrars are careful about screening and validating the whois contact information provided by the domain purchaser. Some dynamic registrars require phone numbers and other identification. If the purchase is performed with stolen user accounts, there is a further risk of being caught. Since many DNS providers offer subdomain packages (e.g., a few free subdomains with DNS service) this allows the botmaster to reuse their purchased domain, and minimize both their costs and risk. Botmasters see another advantage in using subdomains. Even if service to a 3LD is suspended, service to other 3LDs within the same SLD is usually not disrupted. So, if botnet1.example.com is blocked, traffic to normaluser.example.com and botnet2.example.com is not disrupted. This lets botmasters create multiple, redundant DDNS services for their networks, all using the same SLD. Page 14 of 17 Figure 10.7: Comparison of Canonical DNS Request Rates By comparison, most normal users usually do not employ subdomains when adding subcategories to an existing site. For example, if a legitimate company owns example.com, and wants to add subcategories of pages on their web site, they are more likely to expand the URL (e.g., example.com/products) instead of using a 3LD subdomain (e.g., products.example.com). This lets novice web developers create new content cheaply and quickly, without the need to perform complicated DNS updates (and implement virtual hosts checking in the web server) following each change to a web site. This is, of course, essentially a sociological observation about how botmasters and normal users behave when creating subdomains and domain content. There will be exceptions, and the behavior of both groups can also change. But the motivating factors (risk, cost, and convenience) should persist. We therefore assume that, in the large, this observation may hold for a class of botnets (but certainly not all). This fact helps us design a simple detection system. We can score domains based on the number of sibling and child domain lookups that occur. Thus, we can “penalize” the ranking of domains using the traffic volumes set to sister domains. For example, if one observes large amounts of legitimate traffic to google.com, and large volumes of botnet traffic to botnet1.example.com and botnet2.example.com, we can sift out the botnets by scoring the parent zone, example.com, based on the traffic directed at its children. One can think of this as ranking families of domains, based on the amount of traffic sent to the parent zone’s subtree. Similarly, it mirrors some of the analysis provided by dig, when run in trace mode, as discussed above. Logically, this must start at the SLD-level. Figure 10.7 shows an application of this technique. After monitoring DNS traffic at a busy service provider for several weeks, approximately 1.28 million DNS requests were sampled. Fig. 10.7 shows the average lookup rate for normal hosts, in requests per hour. When SLD domain traffic is placed into a canonical form (based on the volume of traffic directly to subdomains), it becomes much easier to distinguish the normal and bot traffic. Since the botnet traffic tends to favor a family of related domains (e.g., botnet1.example.com, botnet2.example.com), ranking the domains based on the traffic to a particular subtree helps separate the signal (bot traffic) from the noise (normal traffic). Page 15 of 17 Once again, it’s important to note that this heuristic is not by itself a complete classifier. Further, one must appreciate that botmasters are human adversaries, and always have a chance to respond to any detection system. This particular heuristic is based more on risk factors that influence a botmaster’s decision process, rather than empirical (and potentially brittle) observations. As such, it may prove more useful and resilient than other detection strategies. Conclusion This chapter has discussed key properties of DNS, how botnets affect DNS traffic, and what DNS-based tools and heuristics are useful in detecting botnets. Botnets generate large waves of DNS traffic. This is dampened by the impact of caching, both at the host application/stub level, and at the recursive level. The discussion above noted how researchers may need to adjust local caching behavior, and how botnets already do the same. The detection and remediation of botnets is assisted by DNS sensors as well. Passive DNS in particular provides researchers an opportunity to provide more comprehensive responses. By logging all addresses associated with a domain, passive DNS lets administrators expand investigations, and implement more complete remediations. This chapter also discussed how some heuristics can be used to identify suspicious domains. For example, low TTL values and weighted traffic volumes can help priority rank domain traffic. While not applicable to all botnets, these approaches have some demonstrated utility. The reader is urged to follow the example used in designing these heuristics. Researchers should consider properties that are inherent in a botnet’s behavior, and less likely to change over time. Solutions Fast Track How do Botnets Use DNS? • • • • Botmasters frequently create multiple, redundant command-and-control centers. By manipulating the DNS entry for the C&C domain, botmasters can migrate victims between different C&C centers. Because caching delays the propagation of a new IP for a C&C center, botmasters seek to minimize cache times. Bots frequently flush the stub and application caches. Further, botmasters minimize recursive DNS server cache times by setting a low TTL period for the C&C domain. Because botnets often have victims all over the world, victims in different time zones generate traffic in waves. This includes DNS resolutions, which can be reduced by caching behavior. Using DNS to Assist Botnet Response • • • To detect the multiple IP addresses often associated with a botnet domain, once can use a Passive DNS service. Passive DNS stores all resolutions associated with a domain. This lets one learn not only where a C&C is located, but where it used to be located. Local administrators may also run their own Passive DNS collection logger. Because of the large numbers of domains found in most DNS traces, heuristics are needed to rank order or prioritize suspicious domains. Using DNS Detect Botnets Page 16 of 17 • • • • To increase the network agility of a botnet, botmasters favor short TTL periods, or the time a domain is cached by a host or caching DNS server. Empirical evidence suggests the vast major of bot-oriented domains are cached for only a few hours. With a few noted exceptions, most legitimate domains are cached for a period of days. Since botnets often have many victims, large volumes of DNS traffic are associated with botnets, particular at authority DNS servers. Many legitimate domains also experience large volumes of traffic. To help distinguish the two, one can score the traffic for subdomains. For example, traffic to a domain can be weighted by the traffic directed to sibling domains. These detection techniques are merely heuristics. Botmasters are human adversaries, and can respond to detection strategies. Frequently Asked Questions Q: What is DNS caching? A: Answers from DNS servers are often stored in caching DNS servers, in stub resolver (host OS) cache lines, or by particular applications, such as IE and Firefox. When a host performs a DNS lookup, it consults any relevant local application caches, host caches, and finally any cache associated with a recursive server. This improves DNS performance, but also affects the type of traffic a researcher may find when investigating a botnet. Q: Why do bots flush local DNS caches? A: Botnets are mobile, and often use multiple C&C sites. To shift victim populations between C&C servers, botmasters merely have to change the DNS entries. Local caching of previous C&C locations delays or prevents victims from reaching a new C&C. Thus, bots often flush the local DNS cache, using various utilities or host API functions. Q: Where can I access a passive DNS service? A: One is available from http://cert.uni-stuttgart.de/stats/dns-replication.php Q: What will passive DNS show me? A: For a given domain, you can obtain every previous resolution of the domain observed by the network. This includes mail, name server, CNAME, A-records, and others. Q: What DNS properties are useful for botnet detection? A: Botnets often, but not always, use low TTL periods for C&C domains. In other words, the DNS entry for a botnet domain is often short, or under a few hours. This contrasts with legitimate domains, with a few noted exceptions. Because botnets have large numbers of victims, they often create large spikes of DNS lookups at authority servers, and at recursive servers. Q: Can’t botmasters evade this sort of DNS-based detection? A: Of course. Botmasters get a turn to respond to any detection regime. Factors such as weighted subdomain request volume and low TTL periods, however, are likely to remain valid for a class of botnets. There are practical reasons for botmasters to continue using low TTL values, and difficulties in adjusting volumes of traffic before victim rallying has completed. These are heuristics, and have a shelf life. Page 17 of 17