Using big data analytics to identify malicious content: a case study

advertisement
Using big data analytics to identify malicious
content: a case study on spam emails
Mamoun Alazab & Roderic Broadhurst
Mamoun.alazab@anu.edu.au
http://cybercrime.anu.edu.au
Outline
•
•
•
•
•
•
Background
Cybercrime and SPAM?
Importance of Big Data Analytics
Data description & Analysis
Summary
Q&A
2
Background ( ANU Cybercrime Observatory1)
Team
Research Interests:
― Criminology/Sociology
― Organised Crime
― Law & Regulation
― Information Security
― Malware Analysis
― Phishing attacks
― Police and media cases
― Computer Forensics
1
http://cybercrime.anu.edu.au
3
Spammed Messages
Social Networking Websites
Worm
Install Malware
Become Zombie
Malicious Websites
Removable Devices
Spam as ‘social engineering’, enables malware to reach ‘high volume low
value’ targets that make it one of the popular means for spreading and
injecting malware on computers.
4
Cybercrime definition from legislation
Definition of high tech crime (Australian Federal Police)1:
High tech crime offences are defined in Commonwealth legislation in
Part 10.7 - Computer Offences of the Criminal Code Act 1995 and
include:
• computer intrusions (for example, malicious hacking)
• unauthorised modification of data, including destruction of data
• denial-of-service (DoS) attacks
• distributed denial of service (DDoS) attacks using botnets
• the creation and distribution of malicious software (for example,
viruses, worms, trojans).
1 AFP,
link: http://www.afp.gov.au/policing/cybercrime/hightech-crime.aspx
5
Spam (Def.)
• it is hard to define the term spam accurately.
• Some distinct spam is an issue about consent not content, and while
others believe it is the issue of content not the consent. Also some
other believes it is about quantity or scale.
• In general, the word spam is commonly used to describe unsolicited
e-mails that are sent in bulk1. Certain definitions also stress the
commercial nature of spam2.
1 Commission
communication, on unsolicited commercial communications or "spam", p. 5
2 For
example, the US CAN-SPAM act of 2003 establishes requirements for those who send
commercial e-mail.
6
Cont.
• In Australia, ACMA defined spam as unsolicited commercial
electronic messages; also a single electronic message can also be
considered spam under Australian law.
• On the other hand, Spamhaus defined spam differently and
consider an email is a spam only if it is both unsolicited and sent in
bulk.
• "bulk", "commercial" and " unsolicited " are on themselves
problematic, as they do not provide enough flexibility to deal with the
variety of the content that is distributed using modern digital means
of communications.
7
Why Big Data when fighting spam
• Dealing with spam introduces a number of Big Data
challenges. The total size and scale of the data is enormous.
• In the 1990s, the average PC user received one or two spam
messages a day. The amount of spam was currently
estimated to be 200 billion messages sent per day (circa
August 2010 see Josh Halliday, 2011; Syamtec, MAAWG,
2013 ).
• The suppression of spam involves the need to understand
complex patterns of behavior and the capacity to identify new
types of spam.
• Around 96% of all email messages are estimated to be spam.
8
Cont.
• Of all spam emails sent on any one day, an average of
3.3% contained malicious attachments and higher for
suspect web-pages (perhaps 1 in 5 – ephemeral).
• Spammers collect gross world wide revenue on the order
of $200 million per year ( Google, Microsoft, Yahoo,
2012)
• Spam now is associated with the recent crime toolkits.
i.e. Blackhole, Zeus).
9
Focus
• Emails containing malicious contents, they
attempt to compromise the security of a
computer and try to lure the recipient to click
on a fake or infected URL that links to a
malicious Web site (‘landing page’) or
downloads a malicious attachment with a
zero-day exploit. Regardless of the source
i.e. phishing or spear phishing’.
10
Data Set
• Data provided from the Australian Communication Media Authority's
(ACMA) Spam Intelligence Database (SID) & the Computer
Emergency Response Team (CERT) Australia.
• SID - Three sources received in anonymised. Only 2 data sources
have been processed thus far (2012).
• Data (spam) are in the Millions in raw format. Our analysis only
looked at the messages which appear to have been relayed through
Australia, for example last hop IP address was located in Australia.
11
Month
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Total
Habul data set
# Spam Emails
67
104
75
65
83
94
72
85
363
73
193
95
1,369
Botnet Data set
# spam Emails
31,991
49,085
45,413
33,311
28,415
11,587
16,251
21,970
27,819
13,426
17,145
20,696
317,109
12
3 V’s of Big Data
Machine learning and data mining are well established techniques in
the world of IT and especially among web companies and startups.
Spam detection is made possible by mining the huge amount of data
available and at play. However, “big data” is not only about Volume, but
also about Velocity, and Variety (The 3V’s of big data).
Volume
Velocity
Variety
• Data
Quantity
• Data
Speed
• Data
Type
13
Email Attacks
• Malware and phishing are becoming combined
– Poisoned attachments (Ex. custom PDF exploits)
– Links to web sites with malware (web browser exploits)
– Install Trojan or remote access software
• Attackers use
–
–
–
–
Fake domains: PayPal vs. PayPaI <= I not L
Compromised Sites: hosting malicious software
URL Shorting services: Hides real URL
Droppers: malicious code on sites that drop malware upon visiting a site
– a webpage.
– Spear-phishing: targets specific groups or individuals – usually attractive
targets with limited guardianship
– ‘Social engineered’ deceptive/tailored email content (e.g. advanced fee
frauds etc.)
14
Trends (1/2)
• Spam and spam campaigns are often sent
• In large quantities
• At certain times or time frames
• Seemingly harmless URL that can redirect to compromised Web
sites.
• Inconspicuous file names and extensions.
tracking_instructions.pdf.zip
•
Social Engineering tactics - Attackers user common business
terms in the file names as spear phishing bait. i.e. ups,
amazon, HP_document, etc.
15
Trends (2/2)
• Same attachments with different email body.
• Same email body with different attachment
• ZIP files remain the preferred file of choice for malware
delivery over email (potentially delivers a high payload).
• Malware is delivered in ZIP file format in an estimated
91% of identified cases in our data.
• Malware authors (spammers) focus on evasion (e.g.
double extensions, obfuscation, change code)
• URL seems to be not working - Evidence of a so-called
“Waterhole” attack.
16
Zeus Virus
Zeus code injection
Legitimate webpage
17
Ransomware
18
Identifying malicious spam emails
• Parsed the raw data to database. Then, extract attachments
and URLs and upload to VirusTotal1
1
Free online virus checker that offers support for academic researchers, to scan for viruses and
suspicious content. VirusTotal uses over 40 different virus scanners, where we consider an attachment
or URL to be malicious if at least one scanner shows a positive result.
https://www.virustotal.com
19
Attachment
20
URLs
URL seems to be not working - Evidence of using Waterhole attack.
21
Waterhole attack - Blackhole Exploit Kit (finding)
http://comromised.com/../index.com
Malicious website
22
Conclusion
•
Predicating spam messages containing malicious contents are not possible
without the systematic analysis of big data, previous knowledge of current
threats and likely development in modus operandi.
•
Propose using only spam email text to predict malicious attachments and
URLs (ask for the paper)
– Novel features to capture text patterns
– Self-contained (no external resources)
•
We show we can predict malicious attachments up to 95.2%, and up to
68.1% for URLs.
•
Machine learning and data analytics based on Big Data will improve the
discovery of targeted attacks and persistent threats. (ask for the 2nd paper)
23
Thank you
and visit
http://cybercrime.anu.edu.au
Download