Security Data Science (SDS) ENEE 759D | ENEE 459D | CMSC 858Z Prof. Tudor DumitraČ™ Assistant Professor, ECE University of Maryland, College Park http://ter.ps/759d https://www.facebook.com/SDSAtUMD Introducing Your Instructor Tudor DumitraČ™ Office: AVW 3425 Email: tdumitra@umiacs.umd.edu Course Website: http://ter.ps/759d Office Hours: Mon 2-3 pm 2 My Background • Ph.D. at Carnegie Mellon University – Research in distributed systems and fault-tolerant middleware • Worked at Symantec Research Labs – Built WINE platform for Big Data experiments in security – WINE currently used by academic researchers and Symantec engineers • Joined UMD faculty WINE • Research and teaching on applied security and systems – Focus on solving security problems with data analysis techniques 3 SDS In A Nutshell • Course objectives – Ability to understand and interpret scholarly publications, to explain their key ideas, and to provide constructive feedback – Ability to apply some of these ideas in practice • Topics Vulnerabilities and exploits Spam infrastructures Failures of cryptosystems Pay per install Internet worms Attacks against physical infrastructure Denial of service Targeted attacks Botnets Economic implications of cybercrime • Grading – 50% paper reviews and class participation – 50% projects 4 We Are Swimming in Data • Data created/reproduced in 2010: 1,200 exabytes • Data collected to find the Higgs boson: 1 gigabyte / s • Yahoo: 200 petabytes across 20 clusters • Security: – Global spam in 2011: 62 billion / day – Malware variants created in 2011: 403 million 5 Why So Much Data? • We can store it – 6¢ / GB – 29¢ / GB (SAS HDD) • We can generate it – Most data is machine-generated – Most malware samples are variants of other malware, generated automatically (repacking, obfuscation) What to do with all this data? 6 Three Stories about Data 7 WHAT QUESTIONS TO ASK ON A FIRST DATE? The Power of Big Data 8 If You Want to Know … Do my date and I have long-term potential? 9 If You Want to Know … Do my date and I have long-term potential? … ask: 275,000 user submitted questions 34,260 real world couples Q Do you like horror movies? Q Have you ever traveled around another country alone? Q Wouldn't it be fun to chuck it all and go live on a sailboat? 3.7× Top 3 user rated questions, about: • God • Sex • Smoking Psychology Likelihood of coincidence Data 10 Online Dating and Big Data • eHarmony – Analyzes hundreds of behavioral variables, most collected automatically – CTO: former search engineer at Yahoo! • OkCupid We do math to get you dates – Founded by Harvard math & CS majors • PlentyOfFish Building this matching system was harder than [being] cited in the paper that won the Fields Medal Source: CNN Money 11 Early 1900s: Most Factories Had Private Generators Source: Nicholas Carr Electricity was critical for business, but not widely available 12 Is he an engineer? Data analytics provide remarkable insight Does she date engineers? Applications in many disciplines Source: OkCupid 13 What Is Data Science? • Also known as … … Big Data analytics … Machine intelligence … Data-intensive computing … Data wrangling … Data munging … Data jujitsu Source: Drew Conway 14 IMPROVING MACHINE TRANSLATION The Unreasonable Effectiveness of Data 15 2005 NIST Machine Translation Competition English-Arabic competition • Google’s first entry – None of the engineers spoke Arabic • Simple statistical approach • Trained using United Nations documents – 200 million translated words – 1 trillion monolingual words 16 For many hard problems there appears to be a threshold of sufficient data A. Halevy, et al., CACM 2009. 17 What is Security Data Science? • Also known as … … Security analytics … Surveillance analytics • Applying data science methods to security problems 18 Security Principles in 60 Seconds [J. Saltzer & M. Schroeder, SOSP 1973] • Economy of mechanism: Keep the protection mechanism as simple and small as possible • Fail-safe defaults: Base access decisions on permission rather than exclusion • Complete mediation: Check every access to every object • Open design: Do not keep the design secret • Separation of privilege: Require two keys to unlock, not one • Least privilege: Grant every program/user the least set of privileges necessary to complete the job • Least common mechanism: Minimize the amount of mechanism common to more than one user and depended on by all users • Psychological acceptability: Design interfaces for ease of use 19 Security in Practice (Source: C. Nachenberg, Symantec) • 1986: Simple computer viruses – Defense: anti-virus • 1990: Polymorphic viruses (decryption logic + encrypted malicious code) – Defense: “universal” decoder, emulation • 1995: Macro viruses – Defense: AV vendor cooperation, digital signatures for macros • 1999: Worms – Defense: Vulnerability-specific signatures • 2004: Web-based malware – Defense: behavior blocking • 2006: Auto-generated malware – Defense: reputation based security • 2010 (but probably earlier): Targeted attacks (physical infrastructure, 0-day, etc.) – Defense: ?? 20 UNDERSTANDING ZERO-DAY ATTACKS The Need for Security Data Science 21 Zero-Day Attacks: Recent Examples Zero-day attack = cyber attack exploiting a software vulnerability before the public disclosure of the vulnerability 2011: Attack against RSA 2010: Stuxnet 2009: Operation Aurora against Google 22 Price of Zero-Day Exploits on the Black Market The Economist, March 2013 23 Hydraq Trojan also displayed this obfuscation. Additional links joining the various exploits together included a shared command-and-control infrastructure. Trojans dropped by different exploits were connecting to the same servers to retrieve commands from the attackers. Some compromised websites used in the watering hole attacks had two different exploits injected into them one after the other. Yet another connection is the use of similar encryption in documents and malicious executables. A technique used to pass data to a SWF file was re-used in multiple attacks. Finally, the same family of Trojan was dropped from multiple different exploits. The Elderwood Project Group with “seemingly unlimited” supply of zero-day exploits Figure 7 illustrates the connections between the various exploits. (Source: Symantec) Figure 7 Links bet ween different exploit s 24 Zero-Day Attacks: Open Questions Decade-long open questions • How common are zero-day attacks? • How long can they remain undiscovered? • What happens after disclosure? Zero-day attack Prior work [Arbaugh 2000, Frei 2008, McQueen 2009, Shahzad 2012] Vulnerability timeline Creation Exploit used in attacks Vulnerability disclosed (“day zero”) Security patch released All hosts patched 25 Zero-Day Attacks: Open Questions (cont’d) Decade-long questions: Why still open? • Rare events, hard to observe in small data sets • Need data analysis at scale Malware variants 100000 10000 CVE-2009-4324 CVE-2009-0658 CVE-2009-0084 Rare events CVE-2010-1241 CVE-2010-2862 CVE-2010-0480 Before disclosure: Targeted attacks CVE-2009-0561 1000 CVE-2010-2883 After disclosure: Large-scale attacks CVE-2009-3126 CVE-2008-2249 CVE-2009-2501 CVE-2008-0015 100 CVE-2010-0028 10 CVE-2011-1331 CVE-2009-1134 1 -100 -50 t0 50 100 150 [weeks] Time [weeks] Creation Exploit used in attacks Vulnerability disclosed (“day zero”) Security patch released All hosts patched 26 Research in Security Data Science Challenge 1: Find the needle in the haystack – Example: Identify and measure zero-day attacks Targeted attacks before disclosure Variants 105 Rare events 403 million new malware variants created in 2011 103 10 -100 -50 T0 50 100 150 (weeks) Challenge 2: Ensure generally applicable and repeatable results – The threat landscape changes frequently Challenge 3: Deal with new and advanced threats – Skilled and persistent hackers can bypass firewalls, anti-virus, passwordprotected systems, two-factor authentication, physical isolation […] Your thesis topic goes here 27 What is Security Data Science? (re-visited) • Systems knowledge: develop technologies needed to store and process massive data sets • Statistics & machine learning knowledge: analyze the data and extract information • Security knowledge: ask the right questions about cyber attacks • Data scientists are in high demand in the cybersecurity industry Booz Allen may be recruiting more [data scientists] than Google or Facebook The Economist, June 2013 28 Course Content • Introduction to Security Data Science • Hands-on emphasis – this is largely an unexplored research area – Team-based projects – Reviews of scholarly publications – No textbook • Specific things you can expect to learn – Selected topics in security – System skills: Experiment design, data analysis, scalability – Team skills: Cooperating to achieve your team goals – Speaking/writing skills: Presenting paper/project findings, providing constructive feedback 29 This is an Advanced Course • You are responsible for holding up your end of the educational bargain – I expect you to attend classes and to complete reading assignments – I expect you to learn how to analyze data and to try things out for yourself – I expect you to know how to find research literature on security topics • The required readings provide starting points – I expect you to manage your time • In general there will be one written assignment due before each lecture • Learning material in this course requires participation – This is not a sit-back-and-listen kind of course; class participation is required for understanding the material and makes up a part of your grade! • Different grading criteria for graduate and undergraduate students Reading Assignments • Readings: 1-2 papers before each lecture – Not light reading – some papers require several readings to understand – For next time: C. Kanich et al., 'Spamalytics: An Empirical Analysis of Spam Marketing Conversion,'ACM CCS, 2008. – Check course web page (still in flux) for next readings and links to papers • Homeworks: review the papers you read using a defined template – Submit homework by email to tdumitra@umiacs.umd.edu • We might switch to a Web based submission system in the future – Due at 6 pm the evening before class – BibTeX template: Summary, Contributions, Weaknesses, Opinion (optional) – I will provide feedback on some of your written critiques; no email means your writeup is satisfactory • In-class discussion: stand up and talk about the papers – Volunteers are preferred – Students randomly selected if no volunteers 31 Discuss … Do my date and I have long-term potential? … ask: 275,000 user submitted questions 34,260 real world couples Q Do you like horror movies? Q Have you ever traveled around another country alone? Q Wouldn't it be fun to chuck it all and go live on a sailboat? 3.7× Top 3 user rated questions, about: • God • Sex • Smoking Psychology Likelihood of coincidence Data 32 Course Projects • Pilot project: two-week individual projects – Propose a security problem and a data set that you could analyze to solve it • Some ideas are available on the web page – Conduct preliminary data analysis and write a report – Propose projects by September 9th (soft deadline) – Submit report by September 18th • Group project: ten-week group project – Deeper investigation of promising approaches – Submit written report and present findings during last week of class • 2 checkpoints along the way (schedule on the course web page) – Form teams and propose projects by September 30th • Peer reviews: review at least 2 project reports from other students – Use skills learned from paper reviews – Post project proposals, reports and reviews on Piazza 33 Pre-Requisite Knowledge • Good programming skills – Knowledge of languages commonly used in data analysis, like Matlab or R, is a plus – To brush up: ‘Data Analysis and Visualization with MATLAB for Beginners’ seminar, on September 12 at 5pm, Room 1110 Kim Engineering Building • Ability to come up to speed on advanced security topics – Covered in the paper readings – Basic knowledge of security (CMSC 414, ENEE 459C or equivalent) is a plus • Ability to come up to speed on data analytics – Lectures provide light-duty tutorials, but you will need to pick up the details as you go along 34 Policies • “Showing up is 80% of life” – Woody Allen – Participation in in-class discussions is required for full credit – You can get an “A” with a few missed assignments, but reserve these for emergencies (conference trips, waking up sick, etc.) – Notify the instructor if you need to miss a class, and submit your homework on time • UMD’s Code of Academic Integrity applies, modified as follows: – Complete your homework entirely on your own. After you hand in your homework, you are welcome (and encouraged) to discuss it with others – Discuss the problems and concepts involved in the project, but produce your own project implementation, report and presentation • Group projects are the result of team work • See class web site for the official version 35 Classroom Protocol • Please arrive on time; lecture begins promptly – I also promise to end on time – Handouts, readings and homework templates posted class web page • Questions are encouraged – If you don’t understand, ask; probably other students are struggling too – Explain the content of your reading assignment, and the underlying reasoning, to the rest of the class – Your reasons don't have to be "right” – you just have to be able to explain them • There is no way to cover everything – If there is an interesting aspect that we do not cover in class, feel free to incorporate that in your projects 36 Grading Criteria • Straight scale: A≥90; B≥80; C≥70; D<70 – 50% Written paper critique and class discussion • 24 assignments x 2 points each + 2 points for this lecture – 50% Projects • 30 points for group project, 10 points for pilot project, 10 points for project reviews – 10% Subjective evaluation • Expectations – Graduate students: you can explain the contributions and weaknesses of the papers you read – Undergraduates: you demonstrate a general understanding of the papers • Unsatisfactory participation means: – You did not read the papers – You did not produce a working implementation for your project, or you do not 37 understand how the implementation works Review of Lecture • What did we learn? – – – – Data analytics provide real benefits Analyzing large data sets allows tackling long-standing hard problems Difference between security principles and security in practice Examples of security problems that require insights from large data sets • I want to emphasize – This is systems course, not a not a pen-and-paper course – You will be expected to build a real, working, data analysis tool • What’s next? – Basic statistics and experimental design – Pilot project: proposal, approach, expectations • Deadline reminder – Post pilot project proposal on Piazza by Monday (soft deadline) – First homework due on Sunday at 6 pm 38 Dive In http://ter.ps/759d 39