Using big data analytics to identify malicious content: a case study on spam emails Mamoun Alazab & Roderic Broadhurst Mamoun.alazab@anu.edu.au http://cybercrime.anu.edu.au Outline • • • • • • Background Cybercrime and SPAM? Importance of Big Data Analytics Data description & Analysis Summary Q&A 2 Background ( ANU Cybercrime Observatory1) Team Research Interests: ― Criminology/Sociology ― Organised Crime ― Law & Regulation ― Information Security ― Malware Analysis ― Phishing attacks ― Police and media cases ― Computer Forensics 1 http://cybercrime.anu.edu.au 3 Spammed Messages Social Networking Websites Worm Install Malware Become Zombie Malicious Websites Removable Devices Spam as ‘social engineering’, enables malware to reach ‘high volume low value’ targets that make it one of the popular means for spreading and injecting malware on computers. 4 Cybercrime definition from legislation Definition of high tech crime (Australian Federal Police)1: High tech crime offences are defined in Commonwealth legislation in Part 10.7 - Computer Offences of the Criminal Code Act 1995 and include: • computer intrusions (for example, malicious hacking) • unauthorised modification of data, including destruction of data • denial-of-service (DoS) attacks • distributed denial of service (DDoS) attacks using botnets • the creation and distribution of malicious software (for example, viruses, worms, trojans). 1 AFP, link: http://www.afp.gov.au/policing/cybercrime/hightech-crime.aspx 5 Spam (Def.) • it is hard to define the term spam accurately. • Some distinct spam is an issue about consent not content, and while others believe it is the issue of content not the consent. Also some other believes it is about quantity or scale. • In general, the word spam is commonly used to describe unsolicited e-mails that are sent in bulk1. Certain definitions also stress the commercial nature of spam2. 1 Commission communication, on unsolicited commercial communications or "spam", p. 5 2 For example, the US CAN-SPAM act of 2003 establishes requirements for those who send commercial e-mail. 6 Cont. • In Australia, ACMA defined spam as unsolicited commercial electronic messages; also a single electronic message can also be considered spam under Australian law. • On the other hand, Spamhaus defined spam differently and consider an email is a spam only if it is both unsolicited and sent in bulk. • "bulk", "commercial" and " unsolicited " are on themselves problematic, as they do not provide enough flexibility to deal with the variety of the content that is distributed using modern digital means of communications. 7 Why Big Data when fighting spam • Dealing with spam introduces a number of Big Data challenges. The total size and scale of the data is enormous. • In the 1990s, the average PC user received one or two spam messages a day. The amount of spam was currently estimated to be 200 billion messages sent per day (circa August 2010 see Josh Halliday, 2011; Syamtec, MAAWG, 2013 ). • The suppression of spam involves the need to understand complex patterns of behavior and the capacity to identify new types of spam. • Around 96% of all email messages are estimated to be spam. 8 Cont. • Of all spam emails sent on any one day, an average of 3.3% contained malicious attachments and higher for suspect web-pages (perhaps 1 in 5 – ephemeral). • Spammers collect gross world wide revenue on the order of $200 million per year ( Google, Microsoft, Yahoo, 2012) • Spam now is associated with the recent crime toolkits. i.e. Blackhole, Zeus). 9 Focus • Emails containing malicious contents, they attempt to compromise the security of a computer and try to lure the recipient to click on a fake or infected URL that links to a malicious Web site (‘landing page’) or downloads a malicious attachment with a zero-day exploit. Regardless of the source i.e. phishing or spear phishing’. 10 Data Set • Data provided from the Australian Communication Media Authority's (ACMA) Spam Intelligence Database (SID) & the Computer Emergency Response Team (CERT) Australia. • SID - Three sources received in anonymised. Only 2 data sources have been processed thus far (2012). • Data (spam) are in the Millions in raw format. Our analysis only looked at the messages which appear to have been relayed through Australia, for example last hop IP address was located in Australia. 11 Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total Habul data set # Spam Emails 67 104 75 65 83 94 72 85 363 73 193 95 1,369 Botnet Data set # spam Emails 31,991 49,085 45,413 33,311 28,415 11,587 16,251 21,970 27,819 13,426 17,145 20,696 317,109 12 3 V’s of Big Data Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection is made possible by mining the huge amount of data available and at play. However, “big data” is not only about Volume, but also about Velocity, and Variety (The 3V’s of big data). Volume Velocity Variety • Data Quantity • Data Speed • Data Type 13 Email Attacks • Malware and phishing are becoming combined – Poisoned attachments (Ex. custom PDF exploits) – Links to web sites with malware (web browser exploits) – Install Trojan or remote access software • Attackers use – – – – Fake domains: PayPal vs. PayPaI <= I not L Compromised Sites: hosting malicious software URL Shorting services: Hides real URL Droppers: malicious code on sites that drop malware upon visiting a site – a webpage. – Spear-phishing: targets specific groups or individuals – usually attractive targets with limited guardianship – ‘Social engineered’ deceptive/tailored email content (e.g. advanced fee frauds etc.) 14 Trends (1/2) • Spam and spam campaigns are often sent • In large quantities • At certain times or time frames • Seemingly harmless URL that can redirect to compromised Web sites. • Inconspicuous file names and extensions. tracking_instructions.pdf.zip • Social Engineering tactics - Attackers user common business terms in the file names as spear phishing bait. i.e. ups, amazon, HP_document, etc. 15 Trends (2/2) • Same attachments with different email body. • Same email body with different attachment • ZIP files remain the preferred file of choice for malware delivery over email (potentially delivers a high payload). • Malware is delivered in ZIP file format in an estimated 91% of identified cases in our data. • Malware authors (spammers) focus on evasion (e.g. double extensions, obfuscation, change code) • URL seems to be not working - Evidence of a so-called “Waterhole” attack. 16 Zeus Virus Zeus code injection Legitimate webpage 17 Ransomware 18 Identifying malicious spam emails • Parsed the raw data to database. Then, extract attachments and URLs and upload to VirusTotal1 1 Free online virus checker that offers support for academic researchers, to scan for viruses and suspicious content. VirusTotal uses over 40 different virus scanners, where we consider an attachment or URL to be malicious if at least one scanner shows a positive result. https://www.virustotal.com 19 Attachment 20 URLs URL seems to be not working - Evidence of using Waterhole attack. 21 Waterhole attack - Blackhole Exploit Kit (finding) http://comromised.com/../index.com Malicious website 22 Conclusion • Predicating spam messages containing malicious contents are not possible without the systematic analysis of big data, previous knowledge of current threats and likely development in modus operandi. • Propose using only spam email text to predict malicious attachments and URLs (ask for the paper) – Novel features to capture text patterns – Self-contained (no external resources) • We show we can predict malicious attachments up to 95.2%, and up to 68.1% for URLs. • Machine learning and data analytics based on Big Data will improve the discovery of targeted attacks and persistent threats. (ask for the 2nd paper) 23 Thank you and visit http://cybercrime.anu.edu.au