Spamscatter Introduction Spamscatter: Characterizing Internet Scam Hosting Infrastructure David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 1 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Introduction Motivation • 70 billion spam messages are sent everyday for a simple reason, advertising websites. • A scam then is any website marketed using spam • This online resource is directly implicated in the spam profit cycle, meaning it is rarer and more valuable • Characterizing the scam infrastructure helps – Reveal the dynamics and business pressures exerted on spammers – Identify means to reduce unwanted sites and spam 2 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Introduction Spamscatter Approach • Mine a large quantity of spam – Extract URLs – Probe machines hosting the scams • This works because URLs must be correct – Follow the scent of money… • All we need is a reliably large source of spam – We have access to a four letter, top level domain producing 150K spam per day 3 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Introduction Understanding scams • Are scams distributed across different servers? • Do different scams share the same server? • How long do scams stay active? How reliable is their hosting? • Where are scam servers located? • Why is it useful to study these characteristics? 4 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Methodology Spamscatter and the Scam 5 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Methodology Methodology • Data collection – Extract links from large spam feed – Probe links every 3 hours for 7 days – Record browser redirection – Save screenshots • Analysis – Identify scams across servers and domains – Report on distributed and shared infrastructure, lifetime, stability, and location 6 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Methodology Identifying Scams • Goal: Identify multiple hosts in the same scam, since many scams are spread across different IPs and domain names • Naïve Approaches: 1. Correlate independent spam emails 2. Use HTML content returned from the webserver • Limitations: Spam has too much chaff and obfuscation HTML is uninteresting and mostly composed of images. Web crawlers fail with frames, iframes and JavaScript 7 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Methodology Image Shingling • Solution: Use rendered screenshots of web pages for correlation. – How to compare upwards of 10,000 images? • Image shingling – based on text shingling idea [BRO97] – Fragment images into blocks and hash the blocks – Two images are similar if T% of the hashed blocks are the same (T=70-80%) – Shingling allows us to essentially compare all images in O(N lg N) – Resilient to small variations among images 8 Aug. 9th, 2007 Usenix Security 2007 Spamscatter An Example Scam An Example Scam: “Downloadable Software” Scam Perspective • 99 observed virtual hosts • 3 IP addresses • Operated for months • 85 senders • No forwarding used • 5535 probes (97% successful) 9 Aug. 9th, 2007 Usenix Security 2007 Spamscatter An Example Scam Clustering with Image Shingling • Images differ slightly • Some pages rotate content 10 Aug. 9th, 2007 Usenix Security 2007 Spamscatter An Example Scam Location • 2 Web servers in China; 1 Webserver in Russia • 85 senders from 30 countries (28 from US) Blue – Web servers hosting Downloadable Software Red – Spam Relays – Hosts that sent us spam Aug. 9th, 2007 Usenix Security 2007 11 Spamscatter An Example Scam Shared Infrastructure • One of the IPs (221.4.246.3) hosting “Downloadable Software” was also hosting “Toronto Pharmacy” • Server located in Guangzhou, China 12 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results Summary Statistics 1 week of spam collection – Nov. 28th – Dec. 4th 2 weeks of probing – Nov. 28th – Dec. 11th 1,087,711 Spam messages 319,700 30% contain links 36,390 11.3% are distinct links 7,029 19.3% resolve to unique IP addresses 2,334 33.2% resolve to distinct scams 13 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results - Infrastructure Distributed Infrastructure To what extent is the infrastructure distributed for scams? • Most scams are not distributed: – 94% used one IP • Top three distributed scams were extensive – 22, 30, and 45 IPs • Top three virtualhosted scams – 110, 695, and 3029 domain names 14 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results - Infrastructure Shared Infrastructure To what extent do multiple scams share infrastructure? • 38% of scams hosted on a machine with at least one other scam • 10 IPs hosted 10 or more scams • Top three shared IPs – 15, 18, and 22 scams 15 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results - Lifetime Scam Lifetime & Stability How long are scams active, and how reliable are the hosts? Scam webhosts seem to be taken down shortly after scams disappear Overall scam lifetime approached two weeks Reliability is high > 97% usually 16 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results - Lifetime Spam campaign lifetime How long do spam campaigns last for a scam? • 137 spams messages per scam (Avg) • Most spam campaigns relatively short – 88% last 20 hours or less • Only 8% last more than 2 days • Scam lifetimes considerably longer – on average one week < 2 days < 20 hour 17 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results - Location Location Where are scam hosting servers located? Blue – Web servers Red – Spam Relays Aug. 9th, 2007 18 Usenix Security 2007 Spamscatter Results - Location Location Spam Relays Web Servers Country 1. usa 2. chn 3. can 4. gbr 5. fra 6. deu 7. rus 8. kor Count 5884 741 379 315 314 258 185 181 Percent [57.40%] [7.23%] [3.70%] [3.07%] [3.06%] [2.52%] [1.80%] [1.77%] Country Count 1. usa 54159 2. fra 26371 3. esp 25196 4. chn 24833 5. pol 21199 6. ind 20235 7. deu 18678 8. kor 17446 Percent [14.50%] [7.06%] [6.75%] [6.65%] [5.68%] [5.42%] [5.00%] [4.67%] 19 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results - Categorization Scam Categorization Scam category % of scams Uncategorized………………………………. 29.57% Information Technology………………… 16.67% Dynamic Content …………………………. 11.52% Business and Economy …………………. 6.23% Shopping ……………………………………… 4.30% Financial Data and Services ………….. 3.61% Illegal or Questionable …………………. 2.15% Adult ……………………………………………. 1.80% Message Boards and Clubs …………… 1.80% Web Hosting ………………………………… 1.63% 20 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results - Categorization Lifetime of scams with Categorization More than 40% of malicious scams disappear before 120 hours Same is true for less than 15% of all scams 21 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Conclusion Summary • Started with over 1m spam messages and coalesced to fewer than 2,500 scams. • Image shingling allowed us to scalably determine if two sites were part of the same scam • Most scams use one web server (vulnerable to blacklisting) – Scams may use many virtual domains that point to one IP • Most scams not malicious per se • Scam infrastructure more stable, longer lived, concentrated in US, compared with spam senders 22 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Conclusion Spammers beware; These boffins are on the prowl Questions and Answers 23 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Supplementary Information Spamscope Visibility • Collected spam from news.admin.netabuse.sightings – a newsgroup for contributing spam • For a 3 day period, we saw – 6,977 spam from the newsgroup 205 scams – 113,216 spam from our feed 1,687 • 12% of the newsgroup scams were in ours • The “largest” scams (most emails and most domains/IP) were seen in both feeds 24 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Results - Blacklisting Blacklists Host type Spam relay Classification Open proxy Spam host % of hosts 72.27% 5.86% Scam host Open proxy Spam host 2.06% 14.86% 9.7% of the scam hosts also sent us spam 25 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Supplementary Information Web Server OS 1 Linux recent 2.4 (1) 2 Windows 2000 (SP1+) 3 Akamai ??? 4 Windows 2000 SP4 5 Linux recent 2.4 (2) 6 FreeBSD 4.6-4.8 7 Slashdot or BusinessWeek 8 FreeBSD 5.0 9 Windows XP SP1 10 Linux older 2.4 11.97% 11.05% 10.86% 8.25% 7.84% 7.72% 7.04% 6.49% 5.90% 5.56% 26 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Supplementary Information URL Classification WISP Dynamic Content WISP Uncategorized 17.931% 13.965% WISP Illegal or Questionable WISP Information Technology WISP Shopping WISP Business and Economy WISP Financial Data and Services WISP Personals and Dating WISP Advertisements WISP Educational Institutions WISP Pay-to-Surf WISP Search Engines and Portals WISP Supplements and Unregulated Compounds WISP Sex 10.306% 9.051% 4.872% 4.733% 4.626% 1.867% 1.249% 1.247% 1.022% 0.884% 0.865% 0.862% 27 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Supplementary Information Image Clustering 1 week of spam collection – Nov. 28th – Dec. 4th 2 weeks of probing – Nov. 28th – Dec. 11th 2,541,486 Total probes 9.8% of probes result in a captured image 250,864 9572 3.8% of screenshots are the 'first' screenshot for a scam 2334 Clusters detected by image shingling 28 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Supplementary Information Image Shingling For a typical day of screenshots, we tested various thresholds A 70% threshold provided a good mixture between flexibility and accuracy 29 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Supplementary Information Overlap of pairs of scams on the same server For scams running on the same server, how much time do they overlap? • 96% of all scam pairs overlapped with each other when they remained active •Only 10% of scams fully overlapped each other One week 30 Aug. 9th, 2007 Usenix Security 2007 Spamscatter Supplementary Information IP ranges What are the network locations of scams and spam relays? • The cumulative distribution of IP addresses is highly nonuniform • Majority of spam relays (60%) fall between 58.* -> 91.* • Most scams (50%) fall between 64.* -> 72.* 31 Aug. 9th, 2007 Usenix Security 2007