IEEE TRANSACTIONS ON COMPUTERS 1 Strider HoneyMonkeys: Active, Client-Side Honeypots for Finding Malicious Websites Yi-Min Wang, Yuan Niu, Hao Chen, Doug Beck, Xuxian Jiang, Roussi Roussev, Chad Verbowski, Shuo Chen, and Sam King Abstract— A serious threat is emerging on the Internet where malicious Web sites exploit browser vulnerabilities to install malicious software. Traditional honeypots are typically server-side, which passively draw attacks from malicious clients. In this paper, we describe the Strider HoneyMonkey Exploit Detection System, which uses client-side honeypots to hunt for potentially malicious Web sites actively. Our HoneyMonkey system detects unauthorized persistent-state changes resulting from successful vulnerability exploits; therefore, it can detect both known and zero-day exploits. The system also tracks URL-redirection and analyzes cloaking to expose back-end malicious Web sites from front-end doorway pages that attract browser traffic. Within the first month of deployment, the system identified 752 unique URLs that could successfully exploit unpatched Windows XP machines. During daily monitoring of these 752 exploit-URLs, we discovered a malicious Web site that hosted a zero-day exploit of an unpatched javaprxy.dll vulnerability and that was hiding behind 25 doorway pages. Microsoft Security Response Center confirmed our report as the first “in-the-wild”, zero-day exploit of this vulnerability. Additionally, by scanning the most popular one million URLs as determined by a search engine, we found over seven hundred exploit-URLs (0.071%), many of which served popular content related to celebrities, song lyrics, wallpapers, video game cheats, and wrestling. Index Terms—Security, Virtual Machines, Software Vulnerability, World-Wide Web —————————— —————————— 1 INTRODUCTION I nternet attacks that use malicious or hacked Web sites to exploit unpatched browser vulnerabilities are on the rise. These malicious Web servers install malcode on visiting client machines without requiring any user action [6],[10],[42]. In 2004 [66] and again in 2006 [67], for example, attackers compromised 3rd-party ad servers, which then infected vulnerable machines that visited pages serving ads from these servers. Although manual analyses of these events [11],[12],[15],[24],[41],[45],[48] help discover the exploited vulnerabilities and installed malware, such efforts do not scale We need an automated approach that can find malicious Web sites efficiently and can help us study the landscape of this problem. We describe the Strider HoneyMonkey system for automatic, systematic Web patrol. “Honey” means “honeypot” [18], and “Monkey” refers to “monkey programs” that drive Web browsers to mimic human browsing. As illustrated in Figure 1, while common server-side honeypots expose server-side vulnerabilities and passively await attacks from malicious clients, HoneyMonkeys are client-side honeypots with vulnerable client-side software that actively visit potentially malicious Web servers. The HoneyMonkey system adopts a black-box, signature-free, state-change-based approach to exploit detection: ———————————————— Yi-Min Wang, Doug Beck, Chad Verbowski, and Shuo Chen are with Microsoft Corporation Hao Chen and Yuan Niu are with the Department of Computer Science, the University of California, Davis. Xuxian Jiang is with George Mason University. Roussi Roussev is with Florida Institute of Technology. Sam King is with the University of Illinois at Urbana-Champaign. Part of this work was performed while these three coauthors were summer interns at Microsoft Research. Malicious Client Malicious Doorway Page Active Visiting Server Software Honeypot Vulnerability URL Redirection Browser Passive Waiting Exploit traffic Exploit Provider Monkey Program Client Software HoneyMonkey Figure 1: Honeypot versus HoneyMonkey instead of directly detecting the acts of vulnerability exploits, the system uses the Strider Tracer [53] to catch unauthorized file creations and configuration changes that result from successful exploits. This allows us to pipeline the same HoneyMonkey program on multiple Virtual Machines (VMs) whose OSs and applications have different patch levels to evaluate the strength of attacks from different Web sites. URL-redirection tracking [56] is another key aspect of the HoneyMonkey exploit analysis. Large-scale exploiters typically structure their Web sites in two layers: doorway pages at the front end that attract browser traffic and redirect it to exploit providers at the back end (see Figure 1). We distinguish two types of doorway pages: content providers, which host attractive content (such as pornography, celebrity information, game cheats, etc.) at well-known URLs to draw direct browser traffic, versus “throw-away” spam doorways, which attract traffic by appearing in top 2 results at major search engines through the use of questionable search engine optimization techniques (commonly referred to as Web spam [17] or search spam). The HoneyMonkey uses the Strider URL Tracer [60] to capture all the front-end and back-end exploit sites involved in detected malicious activities. We demonstrate the effectiveness of our method by discovering large communities of malicious Web sites and by deriving the redirection relationships between them. We describe our experience of identifying a new zero-day exploit using this system. We expose hundreds of malicious Web pages from amongst the most popular Web sites and thousands of pages that are involved in Web spamming. Finally, we propose a comprehensive anti-exploit process based on this monitoring system to improve Internet safety. This paper is organized as follows. Section 2 provides background information on the problem space by describing the techniques used in actual client-side exploits of popular Web browsers. Section 3 gives an overview of the Strider HoneyMonkey Exploit Detection System and its surrounding Anti-Exploit Process. Section 4 evaluates the effectiveness of HoneyMonkey in both known-vulnerability and zero-day exploit detection, and presents an analysis of the exploit data to help prioritize investigation tasks. Section 5 discusses the limitations of and possible attacks on the current HoneyMonkey system and describes several countermeasures including an enhancement based on a Vulnerability-Specific Exploit Detection mechanism. Section 6 surveys related work and Section 7 concludes the paper. 2 BROWSER-BASED VULNERABILITY EXPLOITS Malicious Web sites typically exploit browser vulnerabilities in four steps: code obfuscation, URL redirection and cloaking, vulnerability exploitation, and malware installation. 2.1 Code Obfuscation To foil human investigation and signature-based detection by anti-malware programs, some Web sites use a combination of the following code obfuscation techniques: (1) dynamic code injection using the document.write() function inside a script; (2) unreadable, long strings with encoded characters --- such as “%28” and “&#104” -- which are then decoded either by the unescape() function inside a script or directly by the browser; (3) custom decoding routine included in a script; and (4) sub-string replacement using the replace() function. Code-obfuscation makes it difficult for signature-based detectors to detect new variants of existing exploit code. 2.2 URL Redirection and Cloaking 2.2.1 URL Redirection Most malicious Web sites automatically redirect browser traffic to additional URLs. Specifically, when a browser visits a primary URL, the response from that URL instructs the browser to automatically visit one or more secondary URLs, which may or may not affect the content that is displayed to the user. Such redirections typically use one of IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID the following mechanisms classified into three categories: (1) protocol redirection using HTTP 302 Temporary Redirect; (2) HTML tags including <iframe>, <frame> inside <frameset>, and <META http-equiv=refresh>; (3) script functions including window.location.replace(), window.location.href(), window.open(), window.showModalDialog(), and <link_ID>.click(), etc. Since benign Web sites also commonly use redirection to enrich content, forbidding redirection is infeasible. 2.2.2 URL Cloaking In some cases, spammers use redirection and cloaking synergistically to serve optimized pages. For example, in crawler-browser cloaking, a spam Web page contains scripts that, when executed, will modify the page. Since most crawlers run with script execution off, the page seen by crawlers is different from the one seen by humans. In a recent paper [37] we reported a new type of cloaking: clickthrough cloaking, that serves different content based on the Referer field in HTTP requests and the browser referrer variable. We discovered that some malicious Web site operators were using search spamming to draw traffic, and that some of them were using cloaking to spam more effectively. We divide click-through cloaking into three categories: server-side cloaking, client-side cloaking, and combination techniques. Server-side cloaking checks the Referer field in the HTTP request header for a search engine URL. For example, www.intheribbons.com/win330/2077_durwood.html serves spam content from lotto.gamblingfoo.com to click-through users, but serves a “404 Not Found” page to non-clickthrough users. To defeat this cloaking, one can issue a query of “url:www.intheribbons.com/win330/2077_durwood.html” in a search engine to fool the Referer field check. In some cases, the pages are scripted with checks for special queries not expected of normal users. For example, “link:”, “linkdomain:”, “site:”, “url:” (or “info:” 1) are often used by those investigating spam, but not normal search users. However, all these checks can be circumvented by faking the Referer field in the HTTP header. The advantage gained by faking the Referer field is nullified when pages use client-side cloaking to distinguish between fake and real Referer field data by running a script in the client’s browser to check the document.referrer variable. Example 1 shows a script used by the spam URL naha.org/old/tmp/evans-sara-real-fine-place/index.html. The script checks whether the document.referrer string contains the name of any major search engines. If successful the browser redirects to ppcan.info/mp3re.php and eventually to spam; otherwise, the browser stays at the current doorway page. To defeat the simple client-side cloaking, issuing a query of the form “url:link1” is sufficient. This allows us to fake a click through from a real search engine page. var url = document.location + ""; exit=true; ref=escape(document.referrer); if ((ref.indexOf('search')==-1) && (ref.indexOf('google')==-1) && 1 To show information about particular Web pages, some search engines accept “url:” queries while others accept “info:” queries. WANG ET AL.: STRIDER HONEYMONKEYS: ACTIVE, CLIENT-SIDE HONEYPOTS FOR FINDING MALICIOUS WEB SITES (ref.indexOf('find')==-1) && (ref.indexOf('yahoo')==-1) && (ref.indexOf('aol')==-1) && (ref.indexOf('msn')==-1) && (ref.indexOf('altavista')==-1) && (ref.indexOf('ask')==-1) && (ref.indexOf('alltheweb')==-1) && (ref.indexOf('dogpile')==-1) && (ref.indexOf('excite')==-1) && (ref.indexOf('netscape')==-1) && (ref.indexOf('fast')==-1) && (ref.indexOf('seek')==-1) && (ref.indexOf('find')==-1) && (ref.indexOf('searchfeed')==-1) && (ref.indexOf('about.com')==-1) && (ref.indexOf('dmoz')==-1) && (ref.indexOf('accoona')==-1) && (ref.indexOf('crawler')==-1)) { exit=false; } if (exit) { p=location; r=escape(document.referrer); location='http://ppcan.info/mp3re.php?niche=Evans, Sara&ref='+r } Example 1: A basic client-side cloaking script More vigilant spammers may include checks for “link:”, “linkdomain:”, and “site:” as well as for the spam URL’s domain name in the “url:” and “info:” cases. These checks are shown in Example 2 for lossovernigh180.blogspot.com. Again, the script determines which page will be served. Function is_se_traffic() { if ( document.referrer ) { if ( document.referrer.indexOf(“google”)>0 || document.referrer.indexOf(“yahoo”)>0 || document.referrer.indexOf(“msn”)>0 || document.referrer.indexOf(“live”)>0 || document.referrer.indexOf(“search.blogger.com”)>0 || document.referrer.indexOf(“www.ask.com”)>0) { If ( document.referrer.indexOf( document.domain )<0 && document.referrer.indexOf( “link%3A” )<0 && document.referrer.indexOf( “linkdomain%3A” )<0 && document.referrer.indexOf( “site%3A” )<0 ) { return true; } } } return false; } Example 2: Script fragment for document.referrer checking from zeppele.com/9726_5fcb7_Vp8.js, which lossovernigh180.blogspot.com redirects to Techniques exist to combine both server and client-side scripting in an attempt to foil the efforts of spam investigators, to whom the client-side script logic is exposed. Combination techniques extract referrer information directly from the client-side document.referrer variable and pass it along to a spam domain for server-side checks. For example, lawweekly.student.virginia.edu/wwwboard/messages/007.html sends document.referrer information to 4nrop.com as part of the URL. If the referrer information passes the server-side check, the browser is redirected to raph.us. Otherwise, the browser is redirected to a bogus 404 error page. This spammer has also attacked other .edu Web sites and set up cloaked pages with similar behavior: pbl.cc.gatech.edu/bmed3200a/10 and languages.uconn.edu/faculty/CVs/data-10.php are two such examples. Example 3 shows an obfuscated script using the combination techniques in mywebpage.netscape.com/superphrm2/ordertramadol.htm. By replacing eval() with alert(), we were able to 3 see that emaxrdr.com is sent the document.referrer information, and then either redirects the browser to the spam page pillserch.com or to spampatrol.org, a fake spam-reporting Web site. <script> var params="f=pharmacy&cat=tramadol"; function kqqw(s){ var Tqqe=String("qwertyuioplkjhgfdsazxcvbnmQWERTYU IOPLKJHGFDSAZXCVBNM_1234567890"); var tqqr=String(s); var Bqqt=String(""); var Iqqy,pqqu,Yqqi=tqqr.length; for ( Iqqy=0; Iqqy<Yqqi; Iqqy+=2) { pqqu=Tqqe.indexOf(tqqr.charAt(Iqqy))*63; pqqu+=Tqqe.indexOf(tqqr.charAt(Iqqy+1)); Bqqt=Bqqt+String.fromCharCode(pqqu); } return(Bqqt); } eval(kqqw('wKwVwLw2wXwJwCw1qXw4wMwDw1wJqGqHq8qHq SqHw_ … Bw1qHqSqHq0qHqFq7'));</script> Example 3: Obfuscated script for combo cloaking 2.3 Vulnerability Exploitation Many malicious Web pages attempt to exploit multiple browser vulnerabilities to maximize the chance of a successful attack. Figure 2 shows an example HTML fragment that uses various primitives to load multiple files from different URLs on the same server to exploit three vulnerabilities fixed in Microsoft Security Bulletins MS05-002 [34], MS03-011 [35], and MS04-013 [36]. If any of the exploits succeeds, it will download and execute a Trojan downloader named win32.exe. Note that although Internet Explorer is a popular target, other browsers also have exploitable vulnerabilities. 2.4 Malware Installation Most exploits eventually install malcode on the victim machines to enable further attacks. We have observed a plethora of malcode types installed through browser exploits, including viruses that infect files, backdoors that open entry points for future unauthorized access, bot programs that allow the attacker to control a whole network of compromised systems, Trojan downloaders that connect to the Internet and download other programs, Trojan droppers that drop files from themselves without accessing the Internet, and Trojan proxies that redirect network traffic. Some exploits also install spyware programs and rogue anti-spyware programs. 3 THE STRIDER HONEYMONKEY SYSTEM The HoneyMonkey system automatically detects and analyzes Web sites that exploit Web browsers. Figure 3 illustrates the HoneyMonkey Exploit Detection System, shown inside the dotted square, and the surrounding Anti-Exploit Process, which includes both automatic and manual components. 4 IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID 3.1 Redirection and Cloaking Analysis 3.1.1 Redirection Analysis Many exploit-URLs do not contain exploits but instead act as front-end doorway pages to attract browser traffic. They are either content providers serving “attractive” content (such as pornography or celebrity information), or spam pages that get into top results at major search engines through questionable search engine optimization techniques [17][37]. These doorway pages then redirect to backend exploit providers, which execute the exploits. We may track the URLs visited through browser redirection by intercepting APIs at the network level within each browser process [56]. When the HoneyMonkey runs in its “redirection analysis” mode, it checks all the automatically visited URLs. In Section 4, we present how our URL redirection analysis helps identify major exploit providers that receive traffic from a large number of doorway pages, and helps illustrate how exploit providers organize their Web pages to facilitate customized malware installations for their affiliates. Finally, we are able to positively identify the Web pages that actually perform the exploits by implementing an option in our redirection tracker to block all redirection traffic. cloaking scanner visits Web sites by clicking through search-result pages. Given a URL, the scanner derives likely keywords targeted by the spammer and queries search engines so that the document.referrer variable is correctly set. We have observed that click-through cloaking almost always indicates spam; therefore, our redirection-diff approach uses the presence of cloaking as a spam detection mechanism [55]. Specifically, our tool scans each suspicious URL twice –- once with anti-cloaking and once without -and compare the two sets of redirection domains. If the two sets are different, the tool flags the URL as cloaking. False positives can exist under some circumstances: (1) Some Web sites serve rotating ads from different domains. (2) Some Web sites rotate final destinations to distribute visits among their traffic-affiliate programs. (3) Some Web accesses may fail. In practice, we find the following improvements effective for reducing false positives: (1) We can detect the majority of cloaked pages by comparing only the final destinations from two scans. (2) We can perform the scans and diffs on each suspect URL multiple times and flag only the URLs that do yield consistent diff result. (3) Given a group of URLs expected to exhibit similar cloaking behavior, we can perform the scans and diff for each URL once and exclude those inconsistent with the rest of the group. Given a confirmed cloaked URL, we can construct a group of suspect URLs by examining domains hosted on nearby IP addresses. 3.1.2 Cloaked Malicious Web Pages In a recent search spam investigation described in Section 4.4, we found thousands of malicious pages whose operators were using search spam techniques to attract more traffic. Of those malicious page, hundreds use click- 3.2 Exploit Detection System through cloaking (Section 2.2.2). For example, on October The exploit detection system is the heart of the Honey13, 2006, mandevillechevrolet.com (spaces added for safety), Monkeys design (Figure 3). This system consists of a 3which appeared as the top Yahoo search result for “Man- stage pipeline of virtual machines. Given a large list of indeville Chevrolet”, would install a malware program put URLs with a potentially low exploit-URL density, each named “ane.exe” under C:\ if the visitor’s machine was HoneyMonkey in Stage 1 starts with a scalable mode by vulnerable. This spam page used client-side script to check visiting N URLs simultaneously inside one unpatched VM. document.referrer and redirected click-through users to When the HoneyMonkey detects an exploit, it switches to hqualityporn . com. the basic, one-URL-per-VM mode to re-test each of the N In a case study using mandevillechevrolet . com as a seed, suspects to determine which ones are exploit URLs. we extracted 118 suspicious domains hosted on its nearby Stage-2 HoneyMonkeys scan Stage-1 detected exploitIP addresses. We scanned the 118 domain-level pages and URLs and perform recursive redirection analysis to identify found that 90 (76%) of the 118 URLs were cloaked URLs all Web pages involved in exploit activities and to deterHoneyMonkey Detection System hiding behind three malicious domains: dinet.info, frlynx.info , mineExploit their relationships. Stage-3 HoneyMonkeys continuand joutweb.net. We re-scanned the same group of URLs ously scan Stage-2 detected exploit-URLs using (nearly) again in mid-November 2006HoneyMonkey: and found that the three do-Stage-2 HoneyMonkey: fullyHoneyMonkey: patched VMs to detect attacksStage-3 exploiting the latest Stage-1 (Nearly) fully patched virtual Unpatchedby virtual machines mains had been replaced tisall.info , frsets.info, and Unpatched virtual machines vulnerabilities. machines with redirection analysis without analysis recdir.org (hosted on the redirection same pair of IP addresseswith redirection We usedanalysis a network of 20 machines to produce the results 85.255.115.227 and 66.230.138.194) but their cloaking behav- reported in this paper. Each machine had a CPU speed beior remained the same. tween 1.7 and 3.2 GHz and a memory size between 512 MB Given these findings, it is important for security investi- and 2GB, and ran one VM configured with 256 MB to gators to adopt anti-cloaking techniques in their investiga- 512MB of RAM. Each VM supported up to 10 simultaneous List of Exploit Redirection graph of Redirection graph URLs-oftion. We present one such tool in the following section. browser processes in the scalable mode, with or each process URLs zero-day latest-patchedof exploit-URLs interest vulnerability exploit-URLs visiting a different URL. Due to the way HoneyMonkeys detect exploits (discussed in 3.3.1), there is a trade-off be3.1.3 Anti-cloaking Scanner and Redirection-Diff Spam tween the scan rate and the robustness of exploit detection: Detection Tool if the HoneyMonkey does not wait long enough or if too Seed-based Access Analysis Anti-spyware Browser & Security To defeat cloaking, we could exploit theofweaknessesInternet of many simultaneous browser processes cause excessive blocking URL & anti-virus other related response individual cloakingSuspect techniques, but exploit this approach wouldsafety be slowdown, some exploit perform a teams deHunting teampages may not center laborious and unreliable. Instead, density our end-to-endenforcement anti- tectable attack (e.g., beginning a software installation). team Figure 3. HoneyMonkey Exploit Detection System and Anti-Exploit Process WANG ET AL.: STRIDER HONEYMONKEYS: ACTIVE, CLIENT-SIDE HONEYPOTS FOR FINDING MALICIOUS WEB SITES 5 Through extensive experiments, we determined that a wait time of two minutes was a good trade-off. Taking into account the overhead of restarting VMs in a clean state, each machine was able to scan and analyze between 3,000 to 4,000 URLs per day. (We have since improved the scalability of the system to a scan rate of 8,000 URLs per day per machine in the scalable mode.) In contrast, the basic mode scans between 500 and 700 URLs per day per machine. We expect that using a more sophisticated VM platform that enables significantly more VMs per host machine and faster rollback [50] would significantly increase our scalability. scribed in section 3.1.1. To ease the cleanup of infected state, we run HoneyMonkeys inside a VM. (Our current implementation uses Microsoft Virtual PC and Virtual Server.) Upon detecting an exploit, the monkey saves its logs and notifies the Monkey Controller on the host machine to destroy the infected VM and re-spawn a clean HoneyMonkey, which then continues to visit the remaining URL list. The Monkey Controller then passes the detected exploit-URL to the next monkey in the pipeline to further investigate the strength of the exploit. 3.2.1 Exploit Detection Although we could manually build signature-based detection code for known vulnerabilities and exploits, this approach would not scale and could not detect zero-day exploits. Instead, we take the following black-box, nonsignature-based approach: a monkey program launches a browser instance to visit each input URL and then waits for a few minutes to allow for possible delayed downloading. Since the monkey does not click any dialog box to permit software installation, any executable files or registry entries created outside the browser sandbox indicate exploits. This approach has the additional important advantage of detecting both known-vulnerability exploits and zero-day exploits uniformly. Specifically, the same monkey program running on unpatched machines to detect a broad range of vulnerability exploits (as shown in Stages 1 and 2) can run on fully patched machines to detect zero-day exploits, as shown in Stage 3. At the end of each visit, the HoneyMonkey generates an XML report containing the following five pieces of information: (i) Executable files created or modified outside the browser sandbox folders: this is the primary mechanism for exploit detection. We implemented it on top of the Strider Tracer [53], which uses a file-tracing driver to efficiently record every single file read/write operation. (ii) Processes created: Strider Tracer also tracks all child processes created by the browser process. (iii) Windows registry entries created or modified: Strider Tracer additionally includes a driver that efficiently records every single registry [16] read/write. To highlight the most critical entries, we use the Strider Gatekeeper and GhostBuster filters [54,55], which target registry entries most frequently attacked by spyware, Trojans, and rootkits based on an extensive study. This allows HoneyMonkey to detect exploits that modify critical configuration settings (such as the browser home page and the wallpaper) without creating executable files. (iv) Vulnerability exploited: to provide additional information and to address limitations of the black-box approach, we have developed and incorporated a vulnerability-specific detector, to be discussed in Section 5. This is based on the vulnerability signature of the exploit, rather than on any particular piece of malcode. (v) Redirect-URLs visited: Since malcode is often laundered through other sites, this module allows us to track redirections to determine both the real source of the malcode and those involved in the distribution chain, as de- 3.3 Anti-Exploit Process The Anti-Exploit Process generates the input URL lists for HoneyMonkeys to scan, analyzes the output exploitURLs, and takes appropriate actions. 3.3.1 Generating Input URL Lists We use four sources for generating suspicious URLs for analysis. The first category consists of suspicious URLs, such as known Web sites that host malware [9], links appearing in phishing or spam emails [45] or instant messages, Web pages serving questionable content such as pornography, URL names that are typos of popular sites [G05], Web sites involved in DNS cache poisoning [20,26,44], etc. The second category consists of the most popular Web pages, which can potentially infect a large client population. Examples include the top 100,000 Web sites based on browser traffic ranking [3], the top-N search results of popular queries, and the top N million Web sites based on click-through counts as measured by search engines. The third category comes from seed-based suspect hunting as illustrated in the bottom-left corner of `. Confirmed exploit-URLs are used as “seeds” to locate more potentially malicious Web pages. For example, we have observed that depth-N crawling of exploit pages containing a large number of links often leads to the discovery of even more exploit pages. Similarly, as we will demonstrate in Section 4.4, when an exploit-URL is comment-spammed at an open forum (i.e., it is injected into a publicly accessible comment field to inflate its popularity [38]), other URLs that appear on the same forum page are also likely malicious [37]. The fourth category encompasses URL lists of a more localized scope. For example, an organization may want to regularly verify that its Web pages have not been compromised to exploit visitors, or a user may want to investigate whether any recently visited URL was responsible for his spyware infection. 3.3.2 Acting on Output Exploit-URL Data Stage 1 Output – Exploit-URLs The percentage of exploit-URLs in a given list can be used to measure the risk of Web surfing. For example, by comparing the percentage numbers from two URL lists corresponding to two different search categories (e.g., gambling versus shopping), we can assess the relative risk of malware infection for people with different browsing habits. 6 Stage 2 Output – Traffic Redirection Analysis The HoneyMonkey system currently serves as a leadgeneration tool for the Internet safety enforcement team in the Microsoft legal department. The redirection analysis and subsequent investigations of the malicious behavior of the installed malware programs provide a prioritized list for potential enforcement actions that include sending sitetakedown notices, notifying law enforcement agencies, and filing civil suits against the individuals responsible for distributing the malware programs [61][62]. We have successfully shut down many malicious URLs discovered by the HoneyMonkey. Due to the international nature of the exploit community, access blocking may be more appropriate and effective than legal actions in many cases. Blocking can be implemented at different levels: search engines can remove exploit-URLs from their database [57], Internet Service Providers (ISPs) can black-list exploit-URLs to protect their entire customer base, corporate proxy servers can prevent employees from accessing any of the exploit-URLs, and individual users can block their machines from communicating with any exploit sites by editing their local “hosts” files to map those server hostnames to a local loopback IP address. Exploit-URLs also provide valuable leads to our antispyware product team. Each installed program is tagged with an “exploit-based installation without user permission” attribute. This clearly distinguishes the program from other more benign spyware programs that are always installed after a user accepts the licensing agreement. Stage 3 Output – Zero-Day Exploit-URLs By constantly monitoring all known exploit-URLs using HoneyMonkeys running on fully extremely damaging, whether they are deployed in the wild is the most critical piece of information in the decision process for security guidance, patch development, and patch release. When a HoneyMonkey detects a zero-day exploit, it reports the URL to the Microsoft Security Response Center, which works closely with the zenforcement team and the groups owning the software with the vulnerability to thoroughly investigate the case and determine the most appropriate course of action. We will discuss an actual case in Section 4.2. Due to the unavoidable delay between patch release and patch deployment, it is important to know whether the vulnerabilities fixed in the newly released patch are being actively exploited in the wild. Such latest-patchedvulnerability exploit monitoring can be achieved by running HoneyMonkeys on nearly fully patched machines that miss only the latest patch. This illustrates the prevalence of such exploits, which help guide on the urgency of patch IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID deployment. 4 EXPERIMENTAL EVALUATION We present experimental results in four sections: scanning suspicious URLs, zero-day exploit detection, scanning popular URLs, and scanning spam URLs. We refer to the first and the third sets of data as “suspicious-list data” and “popular-list data”, respectively. All experiments were performed with Internet Explorer browser version 6.0. We note that the statistics reported in this paper do not allow us to calculate the total number of end-hosts exploited by the malicious Web sites we have found. Such calculations would require knowing precisely the number of machines that have visited each exploit page and whether each machine has patched the specific vulnerabilities that the visted page attempts to exploit. 4.1 Scanning Suspicious URLs 4.1.1 Summary Statistics Our first experiment aimed at gathering a list of most likely candidates for exploit-URLs in order to get the highest hit rate possible. In the May/June 2005 time frame, we collected 16,190 potentially malicious URLs from three sources: (1) a Web search of “known-bad” Web sites involved in the installations of malicious spyware programs [9]; (2) a Web search for Windows “hosts” files [21] that are used to block advertisements and bad sites by controlling the domain name-to-IP address mapping; (3) depth-2 crawling of some of the discovered exploit-URLs. We used the Stage-1 HoneyMonkeys running on unpatched WinXP SP1 and SP2 VMs to scan the 16,190 URLs and identified 207 as exploit-URLs; this translates into a density of 1.28%. This serves as an upper bound on the infection rate: if a user does not patch his machine at all and he exclusively visits risky Web sites with questionable content, his machine will get exploited by approximately one out of every 100 URLs he visits. We will discuss the exploitURL density for normal browsing behavior in Section 4.3. After recursive redirection analysis by Stage-2 HoneyMonkeys, the list expanded from 207 URLs to 752 URLs – a 263% increase. This reveals that there is a sophisticated network of exploit providers hiding behind URL redirection to perform malicious activities. Table 1 shows the breakdown of the 752 exploit-URLs among different service-packs (SP1 or SP2) and patch levels, where “UP” stands for “UnPatched”, “PP” stands for “Partially Patched”, and “FP” stands for “Fully Patched”. As expected, the SP1-UP number is much higher than the SP2-UP number because the former has more known vulnerabilities that have existed for a longer time. Table 1. Exploit statistics for Windows XP as a function of patch levels (May/June 2005 data) Total SP1 Unpatched (SP1-UP) SP2 Unpatched (SP2-UP) # ExploitURLs # Exploit Sites 752 688 204 288 268 115 WANG ET AL.: STRIDER HONEYMONKEYS: ACTIVE, CLIENT-SIDE HONEYPOTS FOR FINDING MALICIOUS WEB SITES Doorway Pages Exploit Providers R 20 30 Number of hosted exploit URLs Figure 6: Top 32 SP1-UP exploit sites ranked by the number of hosted exploit-URLs these hundreds of URLs and sites. Site ranking based on connection counts Figure 5 illustrates the top 15 exploit sites for SP1-UP ac60 Number of sites from which traffic is received 50 Number of sites with two-way redirection 40 Number of sites to which traffic is redirected 30 20 10 #9 #1 0 S ite #1 1 S ite #1 2 S ite #1 3 S ite #1 4 S ite #1 5 ite S #8 ite S #7 ite S #6 ite S #5 ite S #4 ite S #3 ite ite S #2 0 Figure 5: Top 15 exploit sites ranked by connection 2. The bar height indicording their the connection counts counts, to among 268 SP1-UP exploit sites from the cates how many suspicious list other sites a given site has direct trafficredirection relationship with and likely reflects how entrenched a site owner is with the exploit community. The bar for each site is composed of three segments of different colors: a black segment represents the number of sites that redirect traffic here; a white segment represents the number of sites to which traffic is redirected; a gray segment indicates the number of sites that have two-way traffic redirection relationship with the given site. For example, site #15 corresponds to a content provider who is redirecting traffic to multiple exploit providers and We anonymize the names of malicious Web sites so as not to interfere with the ongoing investigations by law-enforcement agencies and our legal department. 2 Figure 4: SP2-PP Redirection Graph (17 URLs, 10 sites) 10 Site ranking based on the number of hosted exploit URLs S 4.1.2 Site Ranking Figure 4 illustrates the site and redirection relationships of the 17 exploit-URLs for SP2-PP. These are among the most powerful exploit pages in terms of the number of machines they are capable of infecting and should be considered high priorities for investigation. Each rectangular node represents an exploit-URL, and nodes inside a dashed-line box are pages hosted on the same Web site. Each arrow represents an automatic traffic redirection. Nodes that do not receive redirected traffic are most likely content providers, while nodes that receive traffic from multiple exploit sites are most likely exploit 30 25 20 15 10 5 0 0 #1 The SP2-PP numbers are the numbers of exploit pages and sites that successfully exploited a WinXP SP2 machine partially patched up to early 2005. The fact that the numbers are one order of magnitude lower than their SP2-UP counterparts demonstrates the importance of patching. An important observation is that only a small percentage of exploit sites are updating their exploit capabilities to keep up with the latest vulnerabilities, even though proof-ofconcept exploit code for most of the vulnerabilities are publicly posted. We believe this is due to three factors: (1) Upgrading and testing new exploit code incurs some cost which needs to be traded off against the increase in the number of victim machines; (2) Some vulnerabilities are more difficult to exploit than others; for example, some of the attacks are nondeterministic or take longer. Most exploiters tend to stay with existing, reliable exploits and only upgrade when they find the next easy target. (3) Most security-conscious Web users diligently apply patches. Exploit sites with “advanced” capabilities are likely to draw attention from knowledgeable users and become targets for investigation. The SP2-FP numbers again demonstrate the importance of software patching: none of the 752 exploit-URLs was able to exploit a fully updated WinXP SP2 machine according to our May/June 2005 data. As we describe in Section 4.2, there was a period of time in early July when this was no longer true. We were able to quickly identify and report the few exploit providers capable of infecting fully patched machines, which led to actions leading to their shutdown. ite 0 S 0 providers. We define for each node a connection count that represents the number of cross-site arrows, both incoming and outgoing, directly connected to it. Such numbers provide a good indication of the degree of involvement in the exploit network. It is clear from the picture that node R has the highest connection count. Therefore, blocking access to this site would be the most effective starting point as it would disrupt nearly half of this exploit network. The redirection graph for the 688 SP1-UP exploit-URLs is much more complex and is omitted here. Most of the URLs appear to be pornography pages and the aggressive traffic redirection among them leads to the complexity of the bulk of the graph. In the isolated corners, we found a shopping site redirecting traffic to five advertising companies that serve exploiting advertisements, a screensaver freeware site, and over 20 exploit search sites. Next, we describe two ranking algorithms that help prioritize the investigations of Number of directly connected sites 10 ite 17 S SP2 Partially Patched (SP2PP) SP2 Fully Patched (SP2-FP) 7 8 sharing traffic with a few other content providers. Site #7 corresponds to an exploit provider that is receiving traffic from multiple Web sites. Sites #4, #5, and #9 are pornography sites that play a complicated role: they redirect traffic to many exploit providers and receive traffic from many content providers. Their heavy involvement in exploit activities and the fact that they are registered to the same owner suggest that they may be set up primarily for exploit purposes. Site ranking, categorization, and grouping play a key role in the anti-exploit process because it serves as the basis for deciding the most effective resource allocation for monitoring, investigation, blocking, and legal actions. For example, high-ranking exploit sites in Figure 5 should be heavily monitored because a zero-day exploit page connected to any of them would likely affect a large number of Web sites. Legal investigations should focus on top exploit providers, rather than content providers that are mere traffic doorways and do not perform exploits themselves. Site ranking based on number of hosted exploit-URLs Figure 6 illustrates the top 32 sites, each hosting five or more exploit URLs. This ranking helps to highlight those Web sites whose internal page hierarchy provides important insights. First, some Web sites host a large number of exploit pages with a well-organized hierarchical structure. For example, the #1 site hosts 24 exploit pages that are organized by what appear to be account names for affiliates; many others organize their exploit pages by affiliate IDs or referring site names; some even organize their pages by the names of the vulnerabilities they exploit and a few of them have the word “exploit” as part of the URL names. The second observation is that some sophisticated Web sites use transient URL names that contain random strings. This is designed to make investigations more difficult. Site ranking based on the number of hosted exploit-URLs helps highlight such sites so that they are prioritized higher for investigation. The zero-day exploits discussed in the next sub-section provide a good example of this. 4.2 Zero-Day Exploit Detection In early July 2005, a Stage-3 HoneyMonkey discovered our first zero-day exploit. The javaprxy.dll vulnerability was known at that time without an available patch [28,29], and whether it was actually being exploited was a critical piece of information not previously known. The HoneyMonkey system detected the first exploit page within 2.5 hours of scanning and it was confirmed to be the first inthe-wild exploit-URL of the vulnerability reported to the Microsoft Security Response Center. A second exploit-URL was detected in the next hour. These two occupy positions #132 and #179, respectively, in our list of 752 monitored URLs. This information enabled the response center to provide customers with a security advisory and a follow-up security bulletin [29],[46]. During the subsequent five days, HoneyMonkey detected that 26 of the 752 exploit-URLs upgraded to the zero-day exploit. Redirection analysis further revealed that 25 of them were redirecting traffic to a previously unknown ex- IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID ploit provider site that was hosting exploit-URLs with names in the following form: http://82.179.166.2/[8 chars]/test2/iejp.htm where [8 chars] consists of 8 random characters that appeared to change gradually over time. Takedown notices were sent after further investigation of the installed malware programs, and most of the 25 Web pages stopped exploiting the javaprxy.dll vulnerability shortly after that. Latest-Patched-Vulnerability Exploit Monitoring One day after the patch release, HoneyMonkey detected another jump in the number of exploit-URLs for the vulnerability: 53 URLs from 12 sites were upgraded in the subsequent six days. Redirection analysis revealed that all of them were redirecting traffic to a previously known exploit provider (ranked #1 in Figure 6) who added a new exploit page for javaprxy.dll to increase its infection base. A takedown notice was sent and all 53 URLs stopped exploiting within a couple of days. Important Observations This experience provides concrete evidence that the HoneyMonkey system can potentially evolve into a fullfledged, systematic and automatic zero-day exploit monitoring system for browser-based attacks. We make the following observations from the initial success: (i) Monitoring easy-to-find exploit-URLs is effective: we predicted that monitoring the 752 exploit-URLs would be useful for detecting zero-day exploits because the fact that we could find them quickly within the first month implies that they are more popular and easier to reach. Although zero-day exploits are extremely powerful, they still need to connect to popular Web sites in order to receive traffic to exploit. If they connect to any of the monitored URLs in our list, the HoneyMonkey can quickly detect the exploits and identify the exploit providers behind the scene through redirection analysis. Our zero-day exploit detection experience confirmed the effectiveness of this approach. (ii) Monitoring content providers with well-known URLs is effective: we predicted that monitoring content providers would be useful for tracking the potentially dynamic behavior of exploit providers. Unlike exploit providers who could easily move from one IP address to another and use random URLs, content providers need to maintain their well-known URLs in order to continue attracting browser traffic. The HoneyMonkey takes advantage of this fundamental weakness in the browser-based exploit model and utilizes the content providers as convenient doorways into the exploit network. The effectiveness of this approach was further confirmed by the detection of another zero-day exploit of the msdds.dll vulnerability around mid-August 2005 [30]: 48 doorway pages redirecting to three exploit providers were detected by early September. Of particular importance was a doorway URL from the popular list connecting to an exploit provider who was downloading a malware program not in the signature set of any anti-malware scanners at that time. This demonstrated the value of using HoneyMonkey for malware sample collection to improve the effectiveness of anti-malware programs, as depicted in the bottom row of Figure 3. WANG ET AL.: STRIDER HONEYMONKEYS: ACTIVE, CLIENT-SIDE HONEYPOTS FOR FINDING MALICIOUS WEB SITES Table 2. Comparison of suspicious and popular list data Suspicious List Popular List # URLs scanned 16,190 1,000,000 # Exploit URLs 207 (1.28%) 710 (0.071%) # Exploit URLs After Redirection (Expansion Factor) 752 1,036 (263%) (46%) # Exploit Sites 288 470 SP2-to-SP1 Ratio 204/688=0.30 131/980=0.13 4.3.1 Summary Statistics Before redirection analysis Of the one million URLs, HoneyMonkey determined that 710 were exploit pages. This translates into a density of 0.071%, which is between one to two orders of magnitude lower than the 1.28% number from the suspicious-list data. The distribution of exploit-URLs among the ranked list is fairly uniform, which implies that the next million URLs likely exhibit a similar distribution and so there are likely many more exploit URLs to be discovered. Eleven of the 710 exploit pages are very popular: they are among the top 10,000 of the one million URLs that we scanned. This demonstrates the need for constant, automatic Web patrol of popular pages in order to protect the Internet from large-scale infections. After redirection analysis The Stage-2 HoneyMonkey redirection analysis expanded the list of 710 exploit-URLs to 1,036 URLs hosted by 470 sites. This (1,036-710)/710=46% expansion is much lower than the 263% expansion in the suspicious-list data, suggesting that the redirection network behind the former is Number of directly connected sites 4.3 Scanning Popular URLs By specifically searching for potentially malicious Web sites, we were able to obtain a list of URLs that have 1.28% of the pages performing exploits. A natural question that most Web users will ask is: if I never visit those risky Web sites that serve questionable content, do I have to worry about vulnerability exploits? To answer this question, around July 2005, we gathered the most popular one million URLs as measured by the click-through counts from a search engine and tested them with the HoneyMonkey system. Table 2 summarizes the comparison of key data from this popular list with the previous data from the suspicious list. less complex. The SP2-to-SP1 ratio of 0.13 is lower than its counterpart of 0.30 from the suspicious-list data (see Table 2). This suggests that overall the exploit capabilities in the popular list are not as advanced as those in the suspicious list, which is consistent with the findings from our manual analysis. 80 70 60 50 40 30 20 10 0 Si Number of sites from which traffic is received Number of sites with two-way redirection Number of sites to which traffic is redirected te #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 te ite ite ite ite ite ite ite te te te ite ite ite Si S S S S S S S Si Si Si S S S Figure 7: Top 15 exploit sites ranked by connection counts, among the 426 SP1-UP exploit sites from the popular list Number of hosted exploit URLs (iii) Monitoring highly ranked and advanced exploitURLs is effective: we predicted that the top exploit sites we identified are more likely to upgrade their exploits because they have a serious investment in this business. Also, Web sites that appear in the SP2-PP graph are more likely to upgrade because they appeared to be more up-to-date exploiters. Both predictions have been shown to be true: the first detected zero-day exploit-URL belongs to the #9 site in Figure 5 (which is registered to the same email address that also owns the #4 and #5 sites) and 7 of the top 10 sites in Figure 5 upgraded to the javaprxy.dll exploit; nearly half of the SP2-PP exploit-URLs in Figure 4 upgraded as well. 9 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 Site ranking based on number of hosted exploit URLs Figure 8: Top 32 sites ranked by the number of exploitURLs, among the 426 SP1-UP exploit sites Intersecting the 470 exploit sites with the 288 sites from Section 4.1 yields only 17 sites. These numbers suggest that the degree of overlap between the suspicious list, generally with more powerful attacks, and the popular list is not alarmingly high at this point. But more and more exploit sites from the suspicious list may try to “infiltrate” the popular list to increase their infection base. In total, we have collected 1,780 exploit-URLs hosted by 741 sites. 4.3.2 Site Ranking Site ranking based on connection counts Figure 7 illustrates the top 15 SP1-UP exploit sites by connection counts. There are several interesting differences between the two data sets behind the suspicious-list plot (Figure 5) and the popular-list plot (Figure 7). First, there is not a single pair of exploit sites in the popular-list data that are doing two-way traffic redirection, which appears to be unique in the malicious pornography community. Second, while it is not uncommon to see Web sites redirecting traffic to more than 10 or even 20 sites in the suspicious-list, sites in the popular-list data redirect traffic to at most 4 sites. This suggests that aggressive traffic redirection/exchange is also a phenomenon unique to the malicious pornography community. Finally, the top four exploit providers in the popular-list clearly stand out. None of them have any URLs in the original list of one million primary URLs, but all of them are behind a large number of doorway pages. The #1 site provides exploits to 75 Web sites primarily in the following 10 IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID Table 3: Malicious URLs in Spammed Forums 2 3 4 5 Spammed at (cached pages of) these forums http://eclatlantus.yiffyhost.com/fesseeprogenco.html http://humorrise.sitesled.com/burberryhandbag-supplier-wholesale.html http://alexandrsultz.sitesled.com/chloe/h andbag.html http://nowodus.tripod.com/new-mexicolottery.html http://lermon.t35.com http://members.tripod.com/deportivolapigeon/gu estbook.htm http://sgtrois.webator.net/?2004/11/02/4-un-cdgratuit-offert-par-ladisq http://www.no-nation.org/article347.html http://aose.ift.ulaval.ca/modules.php?op=modloa d&name=News&file=article&sid=137 http://68.15.204.73/ngallery/albums/3/1.aspx five categories: (1) celebrities, (2) song lyrics, (3) wallpapers, (4) video game cheats, and (5) wrestling. The #2 site receives traffic from 72 Web sites, the majority of which are located in one particular country. The #3 site is behind 56 related Web sites that serve cartoon-related pornographic content. The #4 site appears to be an advertising company serving exploiting ads through Web sites that belong to similar categories as those covered by the #1 site. Site ranking based on number of hosted exploit-URLs Figure 8 illustrates the top 32 sites hosting more than one exploit URL. Unlike Figure 7, which highlights mostly exploit provider sites, Figure 8 highlights many content provider sites that host a large number of exploit pages containing a similar type of content. Again, the top four sites stand out: the #1 site is a content provider of video game cheats information for multiple game consoles. The #2 site (which also appears as the third entry in Figure 7) hosts a separate URL for each different Web site from which it receives traffic. The #3 site is a content provider that has a separate entry page for each celebrity figure. The #4 site is a content provider of song lyrics with one entry page per celebrity singer. sitesled.com Kalovalex Maripirs Sadoviktor Vinnerira Footman X X X X X X X X X X .com 174 (14%) of 1,280 22 (26%) of 85 Now that the effectiveness of the HoneyMonkey system is widely known [22], we expect that exploit sites will start adopting techniques to evade HoneyMonkey detection. We .com X X X X 164 (49%) of 338 5 DISCUSSIONS 140 120 100 80 Table 4: Distribution of Malicious60 Sub-domains 40 20 0 tripod porkyhost 100megsfree5 250free.com analloverz .com 6 (18%) of 33 we were able to find 1,054 unique, comment-spammed malicious URLs, most of which were doorway pages residing on free-hosting Web sites. Figure 9 illustrates the distribution of malicious URLs among the top-15 domains according to the number of malicious URLs hosted. Clearly, several of these domains are heavily targeted by exploiters. We also observed that many same-named malicious subdomains existed on multiple domains. Table 4 illustrates the distribution of the top-10 sub-domains (in rows) across top domains hosting malicious sites (in columns) with “X” indicating that the sub-domain existed on the hosting domain as a malicious site. This is demonstrative of the malicious site hierarchical structure mentioned in section 4.1.2 in which exploiters are using their account names as the sub-domain names so that payment tracking can be portable across different hosting domains. In addition, we found a total of 75 behind-the-scenes, malicious URLs that received redirection traffic from these thousands of doorway URLs and were responsible for the actual vulnerability-exploit activities. In particular, the following three URLs were behind all five sets of malicious URLs: http:/ /zllin.info/n/us14/index.php http:/ /linim.net/fr/?id=us14 http:/ /alllinx.info/fr/?id=us14 All three domains share the same registrant with an Oklahoma state address, whose owner appears to be a major exploiter involved in search spamming. 4.4 Scanning Spam URLs Between July and September 2006, we used the HoneyMonkey system to scan the top search results of a set of spammer-targeted keywords at the three major search engines and identified five malicious spam URLs. For each URL, we used the search engine to find a forum that had been spammed with the URL, extracted all URLs spammed on the same forum page, and scanned them to see if we could discover additional malicious URLs. Table 3 summarizes the results. For each of the five malicious URLs, we found many other seemingly related URLs spammed on the same forum page, and scan results showed that 14% to 49% of these related URLs were also malicious. In total, from this small set of five “seed URLs”, t35.com # Malicious URLs (%) of # Spam URLs 688 (47%) of 1,463 Number of malicious URLs hosted on each domain si t35 te . 50 sle com xo 1m d.c om eg om s 50 pag . co 00 es m m .c eg om or s.c gf o re m t e. po ripo com r d fre kyh .co eh os m os t.c tp om r ba o.c 10 t t ca om 0m ek ve c fre eg itie .ne eh sf r s. c t os ee om to 5.c nl o in m 27 e.c 5m om 25 b 0f .co re m e. co m 1 Malicious URLs php5.cz .com X X X X X X 1access host.com X X Domains hosting malicious URLs X X X myopen web.com X X X FigureX9: Number of comment-spammed malicious URLs hosted on each of the top-15 domains WANG ET AL.: STRIDER HONEYMONKEYS: ACTIVE, CLIENT-SIDE HONEYPOTS FOR FINDING MALICIOUS WEB SITES discuss three types of potential evasion techniques and our countermeasures. Since it has become clear that a weakness of the HoneyMonkey is the time window between a successful exploit that allows foreign code execution and the subsequent execution of the HoneyMonkey exploit detection code, we have developed and integrated a tool called Vulnerability-Specific Exploit Detector (VSED), which allows the HoneyMonkey to detect and record the first sign of an exploit. Such a detector only works for known vulnerabilities though; detecting zero-day exploits of totally unknown vulnerabilities remains a challenge. The VSED tool will be discussed in Section 5.4. 5.1 Identifying HoneyMonkey Machines There are three ways for an exploit site to identify HoneyMonkey machines and skip exploits. (i) Targeting HoneyMonkey IP addresses: The easiest way to avoid detection is to black-list the IP addresses of HoneyMonkey machines. We plan to counter this by running the HoneyMonkey network behind multiple ISPs with dynamically assigned IP addresses. If an exploit site wants to black-list all IP addresses belonging to these ISPs, it will need to sacrifice a significant percentage of its infection base. One market research study of ISP client membership shows that the top 10 US ISPs service over 62% of US Internet users [25]. (ii) Performing a test to determine if a human is present: Currently, HoneyMonkeys do not click on any dialog box. A malicious Web site could introduce a one-time dialog box that asks a simple question; after the user clicks the OK button to prove he’s human, the Web site drops a cookie to suppress the dialog box for future visits. More sophisticated Web sites can replace the simple dialog box with a CAPTCHA Turing Test [2] (although this would raise suspicion because most non-exploiting sites do not use such tests). We will need to incorporate additional intelligence into the HoneyMonkeys to handle dialog boxes and to detect CAPTCHA tests if and when we see Web sites starting to adopt such techniques to evade detection. (iii) Detecting the presence of a VM or the HoneyMonkey code: Malicious code could detect a VM by executing a series of instructions with high virtualization overhead and comparing the elapsed time to some external reference [50]; by detecting the use of reserved x86 opcodes normally only used by specific VMs [33]; by leveraging information leaked by sensitive, non-privileged instructions [43]; and by observing certain file directory contents known to be associated with UML (User-Mode Linux) [8] or a specific hardware configuration, default MAC address, or I/O backdoor associated with VMware [23]. Most VM-detection techniques arise due to the fact that the x86 processors are not fully virtualizable. Fortunately, both Intel [52] and AMD [40] have proposed architecture extensions that would make x86 processors fully virtualizable, and thus make detecting a VM more difficult. In the meantime, we can adopt anti-detection techniques that target known VM-detection methods [8],[50]. As VMs are increasingly used as general computing platforms, the approach of detecting HoneyMonkeys by detecting VMs will become less effective. 11 Alternatively, we developed techniques that allow us to also run HoneyMonkey on non-virtual machines so that the results can be cross-checked to identify sophisticated attackers. We implemented support to efficiently checkpoint our system (both memory and disk state) when it is in a known-good state, and roll back to that checkpoint after an attack has been detected. To checkpoint memory, we utiliszed the hibernation functionality already present in Windows to efficiently store and restore memory snapshots. To support disk checkpoints, we implemented copy-on-write disk functionality by modifying the generic Windows disk class driver which is used by most disks today. Our copyon-write implementation divides the physical disk into two equally sized partitions. We use the first partition to hold the default disk image that we roll back to when restoring a checkpoint, and the second partition as a scratch partition to store all disk writes made after taking a checkpoint. We maintain a bitmap in memory to record which blocks have been written to so we know which partition contains the most recent version of each individual block. As a result, no extra disk reads or writes are needed to provide copy-onwrite functionality and a rollback can be simply accomplished by zeroing out the bitmap. To provide further protection, we can adopt resource-hiding techniques to hide the driver from sophisticated attackers who are trying to detect the driver to identify a HoneyMonkey machine. Some exploit sites may be able to obtain the “signatures” of the HoneyMonkey logging infrastructure and build a detection mechanism to allow them to disable the logging or tamper with the log. Since such detection code can only be executed after a successful exploit, we can use VSED to detect occurrences of exploits and highlight those that do not have a corresponding file-creation log. Additionally, we are incorporating log signing techniques to detect missing or modified log entries. We note that some classes of exploits require writing a file to disk and then executing that file for running arbitrary code. These exploits cannot escape our detection by trying to identify a HoneyMonkey machine because our file-based detection actually occurs before they can execute code. 5.2 Exploiting without Triggering HoneyMonkey Detection Currently, HoneyMonkey cannot detect exploits that do not make any persistent-state changes or make such changes only inside browser sandbox. Even with this limitation, the HoneyMonkey is able to detect most of today’s Trojans, backdoors, and spyware programs that rely on significant persistent-state changes to enable automatic restart upon reboot. Again, the VSED tool can help address this limitation. HoneyMonkeys only wait for a few minutes for each URL, so a possible evasion technique is to delay the exploit. However, such delays reduce the chance of successful infections because real users may close the browser before the exploit happens. We plan to run HoneyMonkeys with random wait times and highlight those exploit pages that exhibit inconsistent behaviors across runs for more in-depth manual analysis. 12 5.3 Randomizing the Attacks Exploit sites may try to inject nondeterministic behavior to complicate the HoneyMonkey detection. They may randomly exploit one in every N browser visits. We consider this an acceptable trade-off: while this would require multiple scans by the HoneyMonkeys to detect an exploit, it forces the exploit sites to reduce their infection rates by N times as well. If a major exploit provider is behind more than N monitored content providers, the HoneyMonkey can still detect it through redirection tracking in one round of scans. Exploit sites may try to randomize URL redirections by selecting a random subset of machines to receive forwarded traffic each time from a large set of infected machines that are made to host exploit code. Our site ranking algorithm based on connection counts should discourage this because such sites would end up prioritizing themselves higher for investigation. Also, they reveal the identities of infected machines, whose owners can be notified to clean up the machines. 5.4 Vulnerability-Specific Exploit Detector (VSED) To address some of the limitations discussed above and to provide additional information on the exact vulnerabilities being exploited, we have developed a vulnerabilityspecific detector, called VSED, and integrated it into the HoneyMonkey. The VSED tool implements a source-code level, vulnerability-specific intrusion detection technique that is similar to IntroVirt [31]. For each vulnerability, we manually write “predicates” to test the state of the monitored program to determine when an attacker is about to trigger a vulnerability. VSED operates by inserting breakpoints within buggy code to stop execution before potentially malicious code runs, in order to allow secure logging of an exploit alert. For example, VSED could detect a buffer overflow involving the “strcpy” function by setting a breakpoint right before the buggy “strcpy” executes. Once VSED stops the application, the predicate examines the variables passed into “strcpy” to determine if an overflow is going to happen. One limitation of VSED is that it cannot identify zero-day exploits of unknown vulnerabilities. To evaluate the effectiveness of VSED for detecting browser-based exploits, we wrote predicates for six recent IE vulnerabilities and tested them against the exploit-URLs from both the suspicious list and the popular list. Although we do not have a comprehensive list of predicates built yet, we can already pinpoint the vulnerabilities exploited by hundreds of exploit-URLs. 6 RELATED WORK There is a rich body of literature on honeypots. Most honeypots are deployed to mimic vulnerable servers waiting for attacks from client machines [18,39,27,32]. In contrast, HoneyMonkeys mimic clients drawing attacks from malicious servers. To our knowledge, there are sixz other projects related to the concept of client-side honeypots. Three are active client honeypots: HoneyClient [19] and Moshchuk et al.’s study of spyware [63], both of which were done in parallel IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID with HoneyMonkey, and the recent work of Provos et al [65]. The other three are passive client honeypots used as part of intrusion detection systems: email quarantine, shadow honeypots, and SpyProxy. In parallel to our work, the Honeyclient project [19] shares the same goal of trying to identify browser-based attacks. However, the project has not published any deployment experience or any data on detected exploit-URLs. There are also several major differences in terms of implementation: HoneyMonkeys are VM-based, use a pipeline of machines with different patch levels, and track URL redirections. Also parallel to HoneyMonkey are systems deployed for measurement studies of malware on the Web. Moshchuk et al [63] focused on spyware. Their tools actively sought out executables for download as well as Web sites performing “drive-by downloads.” They used virtual machines in a state-change-based approach and anti-spyware tools to determine whether the executables were malicious. Their system had built in heuristics to deal with non-deterministic elements such as time bombs, pop-up windows, and other Javascript tricks. They determined through three separate crawls that although there was a reduction in malicious Web pages, spyware’s presence on the Web is still significant. Provos et al [65] conducted another measurement study also using a state-change-based approach. Their goal was to gauge the state of Web-based malware over a twelve month period by identifying and finding examples of four injection mechanisms: Web server security, user contributed content, advertising, and 3rd-party widgets. Using data from Google crawls, they identified potentially malicious URLs that they fed to Internet Explorer running in a virtual machine. Using anti-virus engines and detecting state changes to the file system, registry, and processes spawned, they classified URLs as harmless, malicious, or inconclusive. They found that not only is malware distributed across large numbers of URLs and domains, but that malware binaries change frequently. Furthermore, they discovered that machines compromised by Web-based malware communicate with malicious servers in botnet-like structures. In May of 2007, the same team released a blog post [68] that reemphasized the significance of the finding that 0.1% of sampled URLs were malicious. In the context of billions of URLs, this means that millions are malicious. Their findings are consistent to our findings that 0.071% of URLs are malicious. Sidiroglou et al. [47] described an email quarantine system which intercepts every incoming message, “opens” all suspicious attachments inside instrumented virtual machines, uses behavior-based anomaly detection to flag potentially malicious actions, quarantines flagged emails, and only delivers messages that are deemed safe. Anagnostakis et al. [4] proposed the technique of “shadow honeypots” which are applicable to both servers and clients. The key idea is to combine anomaly detection with honeypots by diverting suspicious traffic identified by anomaly detectors to a shadow version of the target application that is instrumented to detect potential attacks and filter out false positives. As a demonstration of client-side WANG ET AL.: STRIDER HONEYMONKEYS: ACTIVE, CLIENT-SIDE HONEYPOTS FOR FINDING MALICIOUS WEB SITES protection, the authors deployed their prototype on Mozilla Firefox browsers. SpyProxy [64] is optimized to detect malicious Web sites in real time transparently using a state-change-based approach. While the HoneyMonkeys are designed for detection and measurement, SpyProxy is designed to act as a proxy for client browsers by intercepting HTTP requests and executing Web content on the fly in a virtual machine before forwarding the results to the client. The authors deployed SpyProxy as a network service and tested SpyProxy against a list of 100 malicious Web pages from 45 Web sites. SpyProxy was 100% effective in blocking malware in that experiment. The three client-side honeypots described above are passive in that they are given existing traffic and do not actively solicit traffic. In contrast, HoneyMonkeys are active and are responsible for seeking out malicious Web sites and drawing attack traffic from them. The former has the advantage of providing effective, focused protection of targeted population. The latter has the advantages of staying out of the application’s critical path and achieving a broader coverage, but it does require additional defense against potential traps/black-holes during the recursive redirection analysis. The two approaches are complementary and can be used in conjunction with each other to provide maximum protection. Existing honeypot techniques can be categorized using two other criteria: (1) physical honeypots [32] with dedicated physical machines versus virtual honeypots built on Virtual Machines [51,49]; (2) low-interaction honeypots [39], which only simulate network protocol stacks of different operating systems, versus high-interaction honeypots [27], which provide an authentic decoy system environment. HoneyMonkeys belong to the category of highinteraction, virtual honeypots. In contrast with the black-box, state-change-based detection approach used in HoneyMonkey, several papers proposed vulnerability-oriented detection methods, which can be further divided into vulnerability-specific and vulnerability-generic methods. The former includes Shield [58], a network-level filter designed to detect worms exploiting known vulnerabilities, and IntroVirt [31], a technique for specifying and monitoring vulnerability-specific predicates at code level. The latter includes system call-based intrusion detection systems [13][14], memory layout randomization [5][59], non-executable pages [1] and pointer encryption [7]. An advantage of vulnerability-oriented techniques is the ability to detect an exploit earlier and identify the exact vulnerability being exploited. As discussed in Section 5.4, we have incorporated IntroVirt-style, vulnerabilityspecific detection capability into the HoneyMonkey. 7 SUMMARY We have presented the design and implementation of the Strider HoneyMonkey as the first systematic method for automated Web patrol in the hunt for malicious Web sites that exploit browser vulnerabilities. Our analyses of two sets of data showed that the densities of malicious URLs are 1.28% and 0.071%, respectively. In total, we have identified 13 a large community of 741 Web sites hosting 1,780 exploitURLs. Furthermore, using a small set of only five seed URLs, we discovered 1,054 additional malicious URLs that were spammed in the comment fields at public forums. We proposed redirection analysis to capture the relationship between exploit sites, and site ranking algorithms based on the number of directly connected sites and the number of hosted exploit-URLs to identify major players. Our success in detecting the first-reported, in-the-wild, zero-day exploit-URL of the javaprxy.dll vulnerability demonstrates the effectiveness of our approach, which monitors easy-tofind content providers with well-known URLs as well as top exploit providers with advanced exploit capabilities. Finally, we discussed several techniques that malicious Web sites can adopt to evade HoneyMonkey detection, which motivated us to incorporate an additional vulnerability-specific exploit detection mechanism to complement the HoneyMonkey’s core black-box exploit detection approach. ACKNOWLEDGMENT The authors wish to express their thanks to Ming Ma for his help in finding malicious spam pages. REFERENCES [1] S. Andersen and V. Abella, “Data Execution Prevention. Changes to Functionality in Microsoft Windows XP Service Pack 2, Part 3: Memory Protection Technologies.,” http://www.microsoft.com/technet/prodtechnol/winxppro/maintai n/ sp2mempr.mspx. [2] L. von Ahn, M. Blum, and J. Langford, “Telling Humans and Computers Apart Automatically,” in Communications of the ACM, Feb. 2004. [3] Alexa, http://www.alexa.com/. [4] K. Anagnostakisy, S. Sidiroglouz, P. Akritidis, K. Xinidis, E. Markatos, and A. Keromytis. “Detecting Targeted Attacks Using Shadow Honeypots,” in Proc. USENIX Security Symposium, August 2005. [5] PaX Address Space Layout Randomization (ASLR). http://pax.grsecurity.net/docs/aslr.txt. [6] Xpire.info, http://www.vitalsecurity.org/xpire-splitinfinityserverhack_malwareinstall-condensed.pdf, Nov. 2004. [7] C. Cowan, S. Beattie, J. Johansen, and P. Wagle. “PointGuard: Protecting pointers from buffer overflow vulnerabilities,” in Proc. USENIX Security Symposium, August 2003. [8] C. Carella, J. Dike, N. Fox, and M. Ryan, “UML Extensions for Honeypots in the ISTS Distributed Honeypot Project,” in Proc. IEEE Workshop on Information Assurance, 2004. [9] “Webroot: CoolWebSearch Top Spyware Threat,” http://www.techweb.com/showArticle.jhtml?articleID=160400314, TechWeb, March 30, 2005. [10] Download.Ject, http://www.microsoft.com/security/incident/download_ject.mspx, June 2004. [11] Ben Edelman, “Who Profits from Security Holes?” Nov. 2004, http://www.benedelman.org/news/111804-1.html. [12] “Follow the Money; or, why does my computer keep getting infested with spyware?” http://www.livejournal.com/users/tacit/125748.html. [13] S. Forrest, S. Hofmeyr, A. Somayaji, and T. Longsta, “A sense of self for Unix processes,” in Proc. IEEE Symposium. on Security and Priva- 14 cy, May 1996. [14] H. Feng, O. Kolesnikov, P. Fogla, W. Lee, and W. Gong, “Anomaly detection using call stack information,” in Proc. IEEE Symposium on Security and Privacy, May 2003. [15] “Googkle.com installed malware by exploiting browser vulnerabilities,” http://www.f-secure.com/v-descs/googkle.shtml. [16] Archana Ganapathi, Yi-Min Wang, Ni Lao, and Ji-Rong Wen, "Why PCs Are Fragile and What We Can Do About It: A Study of Windows Registry Problems", in Proc. IEEE Dependable Computing and Communications Symposium, June 2004. [17] Z. Gyongyi and H. Garcia-Molina, “Web Spam Taxonomy,” in Proc. International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005. [18] The Honeynet Project, http://www.honeynet.org/. [19] Honeyclient Development Project, http://www.honeyclient.org/. [20] “Another round of DNS cache poisoning,” Handlers Diary, March 30, 2005, http://isc.sans.org/. [21] hpHOSTS community managed hosts file, http://www.hostsfile.net/downloads.html. [22] Strider HoneyMonkey Exploit Detection, http://research.microsoft.com/HoneyMonkey. [23] T. Holz and F. Raynal, “Detecting Honeypots and other suspicious environments,” in Proc. IEEE Workshop on Information Assurance and Security, 2005. [24] “iframeDOLLARS dot biz partnership maliciousness,” http://isc.sans.org/diary.php?date=2005-05-23. [25] ISP Ranking by Subscriber, http://www.ispplanet.com/research/rankings/index.html. [26] “Scammers use Symantec, DNS holes to push adware,” InfoWorld.com, March 7, 2005, http://www.infoworld.com/article/05/03/07/HNsymantecholesand adware_1.html?DESKTOP%20SECURITY. [27] Xuxian Jiang, Dongyan Xu, “Collapsar: A VM-Based Architecture for Network Attack Detention Center”, in Proc. USENIX Security Symposium, Aug. 2004. [28] Microsoft Security Advisory (903144) - A COM Object (Javaprxy.dll) Could Cause Internet Explorer to Unexpectedly Exit, http://www.microsoft.com/technet/security/advisory/903144.mspx. [29] Microsoft Security Bulletin MS05-037 - Vulnerability in JView Profiler Could Allow Remote Code Execution (903235), http://www.microsoft.com/technet/security/bulletin/ms05037.mspx. [30] Microsoft Security Advisory (906267) – A COM Object (Msdds.dll) Could Cause Internet Explorer to Unexpectedly Exit, http://www.microsoft.com/technet/security/advisory/906267.mspx. [31] Ashlesha Joshi, Sam King, George Dunlap, Peter Chen, “Detecting Past and Present Intrusions Through Vulnerability-Specific Predicates,” in Proc. ACM Symposium on Operating System Principles, 2005. [32] Sven Krasser, Julian Grizzard, Henry Owen, and John Levine, “The Use of Honeynets to Increase Computer Network Security and User Awareness”, in Journal of Security Education, pp. 23-37, vol. 1, no. 2/3. March 2005. [33] Lallous, “Detect if your program is running inside a Virtual Machine,” March 2005, http://www.codeproject.com/system/VmDetect.asp. [34] Microsoft Security Bulletin MS05-002, Vulnerability in Cursor and Icon Format Handling Could Allow Remote Code Execution, http://www.microsoft.com/technet/security/Bulletin/MS05002.mspx. [35] Microsoft Security Bulletin MS03-011, Flaw in Microsoft VM IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID Could Enable System Compromise, http://www.microsoft.com/technet/security/Bulletin/MS03011.mspx. [36] Microsoft Security Bulletin MS04-013, Cumulative Security Update for Outlook Express, http://www.microsoft.com/technet/security/Bulletin/MS04013.mspx. [37] Y. Niu, Y. M. Wang, H. Chen, M. Ma, and F. Hsu, “A Quantitative Study of Forum Spamming Using Context-based Analysis,” in Proc. Network and Distributed System Security Symposium, 2007. [38] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” http://dbpubs.stanford.edu:8090/pub/1999-66, Jan. 29, 1998 [39] Niels Provos, “A Virtual Honeypot Framework”, in Proc. USENIX Security Symposium, Aug. 2004. [40] AMD Pacifica Virtualization Technology, http://enterprise.amd.com/downloadables/Pacifica.ppt. [41] “Russians use affiliate model to spread spyware,” http://www.itnews.com.au/newsstory.aspx?CIaNID=18926. [42] Team Register, “Bofra exploit hits our ad serving supplier,” http://www.theregister.co.uk/2004/11/21/register_adserver_attack/ , November 2004. [43] Red Pill, http://invisiblethings.org/papers/redpill.html. [44] Symantec Gateway Security Products DNS Cache Poisoning Vulnerability, http://securityresponse.symantec.com/avcenter/security/Content/2 004.06.21.html. [45] “Michael Jackson suicide spam leads to Trojan horse,” http://www.sophos.com/virusinfo/articles/jackotrojan.html, Sophos, June 9, 2005. [46] “What is Strider HoneyMonkey,” http://research.microsoft.com/honeymonkey/article.aspx, Aug. 2005. [47] Stelios Sidiroglou and Angelos D. Keromytis, “A Network Worm Vaccine Architecture,” in Proc.Information Security Practice and Experience Conference, April 2005. [48] Michael Ligh, “Tri-Mode Browser Exploits - MHTML, ANI, and ByteVerify,” http://www.mnin.org/write/2005_trimode.html, April 30, 2005. [49] Know Your Enemy: Learning with User-Mode Linux. Building Virutal Honeynets using UML, http://www.honeynet.org/papers/uml/. [50] Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex Snoeren, Geoff Voelker, and Stefan Savage, “Scalability, Fidelity and Containment in the Potemkin Virtual Honeyfarm,” in Proc. ACM Symposium on Operating Systems Principles, Oct. 2005. [51] Know Your Enemy: Learning with VMware. Building Virutal Honeynets using VMware, http://www.honeynet.org/papers/vmware/. [52] Vanderpool Technology, Technical report, Intel Corporation, 2005. [53] Yi-Min Wang, et al., “STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support”, in Proc. Usenix LISA, Oct. 2003. [54] Yi-Min Wang, et al., “Gatekeeper: Monitoring Auto-Start Extensibility Points (ASEPs) for Spyware Management”, in Proc. Usenix LISA, 2004. [55] Yi-Min Wang, Doug Beck, Binh Vo, Roussi Roussev, and Chad Verbowski, “Detecting Stealth Software with Strider GhostBuster,” in Proc. International Conference on Dependable Systems and Networks, June 2005. [56] Yi-Min Wang, Doug Beck, Jeffrey Wang, Chad Verbowski, and WANG ET AL.: STRIDER HONEYMONKEYS: ACTIVE, CLIENT-SIDE HONEYPOTS FOR FINDING MALICIOUS WEB SITES Brad Daniels, “Strider Typo-Patrol: Discovery and Analysis of Systematic Typo-Squatting,” in Proc. 2nd Workshop on Steps to Reducing Unwanted Traffic on the Internet (SRUTI), July 2006. [57] Yi-Min Wang, Doug Beck, Xuxian Jiang, and Roussi Roussev, “Automated Web Patrol with Strider HoneyMonkeys: Finding Web sites That Exploit Browser Vulnerabilities,” Microsoft Research Technical Report MSR-TR-2005-72, July 2005, ftp://ftp.research.microsoft.com/pub/tr/TR-2005-72.pdf. [58] Helen J. Wang, Chuanxiong Guo, Daniel R. Simon, and Alf Zugenmaier, “Shield: Vulnerability-Driven Network Filters for Preventing Known Vulnerability Exploits,” in Proc. ACM SIGCOMM, August 2004. [59] J. Xu, Z. Kalbarczyk and R. K. Iyer, “Transparent Runtime Randomization for Security,” in Proc. Symp. Reliable and Distributed Systems (SRDS), October 2003. [60] Strider URL Tracer, http://research.microsoft.com/URLTracer. [61] Microsoft, Washington AG sue antispyware company http://www.computerworld.com/printthis/2006/0,4814,108047,00.html, Jan. 25, 2006 [62] Attorney General McKenna Announces $1 Million Settlement in Washington’s First Spyware Suit. http://www.atg.wa.gov/releases/2006/rel_Secure_Computer_Restitution_2006 .html, Dec. 4, 2006 [63] Alexander Moshchuk, Tanya Bragin, Steven D. Gribble, and Henry M. Levy. “A Crawler-based Study of Spyware on the Web,” in Proc. Network and Distributed Systems Security Symposium, San Diego, CA, February 2006. [64] Alexander Moshchuk, Tanya Bragin, Damien Deville, Steven D. Gribble, and Henry M. Levy. “A Crawler-based Study of Spyware on the Web,” in Proc. USENIX Security Symposium, Boston, MA, August 2007. [65] Niels Provos, Dean McNamee, Panayiotis Mavrommatis, Ke Wang, and Nagendra Modadugu. “ The Ghost in the Browser: Analysis of Web-based Malware,” in Proc. Workshop on Hot Topics in Understanding Botnets, Cambridge, MA, April 2007. [66] Bofra exploit hits our ad serving supplier. http://www.theregister.com/2004/11/21/register_adserver_attack/, Nov. 11, 2004. [67] Brian Krebs, “Hacked Ad Seen on MySpace Served Spyware to a Million.” http://blog.washingtonpost.com/securityfix/2006/07/myspace_ad_served_adw are_to_mo.html, Jul. 19, 2006. [68] Panayiotis Mavrommatis and Niels Provos. “Introducing Google's online security efforts.” http://googleonlinesecurity.blogspot.com/2007/05/introducing-googles-antimalware.html, May 21, 2007. 15