Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas Outline and Acknowledgement ● Vision for Assured Information Sharing ● Handling Different Trust levels ● Defensive Operations between Untrustworthy Partners – ● Detecting Malicious Executables using Data Mining Research Funded by Air Force Office of Scientific Research and Texas Enterprise Funds Vision: Assured Information Sharing Data/Policy for Coalition Publish Data/Policy Publish Data/Policy Publish Data/Policy Component Data/Policy for Agency A Component Data/Policy for Agency C Component Data/Policy for Agency B 1. Trustworthy Partners 2. Semi-Trustworthy partners 3. Untrustworthy partners 4. Dynamic Trust Our Approach ● Integrate the Medicaid claims data and mine the data; next enforce policies and determine how much information has been lost by enforcing policies – ● Apply game theory and probing techniques to extract information from semi-trustworthy partners – ● Prof. Murat Kantarcioglu and Ryan Layfield (PhD Student) Data Mining for Defensive and offensive operations – – ● Prof. Khan, Dr. Awad (Postdoc) and Student Workers (MS students) E.g., Malicious code detection, Honeypots Prof. Latifur Khan and Mehedy Masud Dynamic Trust levels, Peer to Peer Communication – Prof. Kevin Hamlen and Nathalie Tsybulnik (PhD student) Introduction: Detecting Malicious Executables using Data Mining 0 What are malicious executables? - Harm computer systems - Virus, Exploit, Denial of Service (DoS), Flooder, Sniffer, Spoofer, Trojan etc. - Exploits software vulnerability on a victim - May remotely infect other victims - Incurs great loss. Example: Code Red epidemic cost $2.6 Billion Malicious code detection: Traditional approach 0 - Signature based Requires signatures to be generated by human experts So, not effective against “zero day” attacks State of the Art: Automated Detection Automated detection approaches: O Behavioural: analyse behaviours like source, destination address, attachment type, statistical anomaly etc. ● Content-based: analyse the content of the malicious executable – Autograph (H. Ah-Kim – CMU): Based on automated signature generation process – N-gram analysis (Maloof, M.A. et .al.): Based on mining features and using machine learning. ● New Ideas Content -based approaches consider only machinecodes (byte-codes). ✗ Is it possible to consider higher-level source codes for malicious code detection? ✗ Yes: Diassemble the binary executable and retrieve the assembly program ✗ Extract important features from the assembly program ✗ Combine with machine-code features ✗ Feature Extraction Binary n-gram features ✗ – Sequence of n consecutive bytes of binary executable Assembly n-gram features ✗ – Sequence of n consecutive assembly instructions System API call features ✗ – DLL function call information The Hybrid Feature Retrieval Model ● Collect training samples of normal and malicious executables. ● Extract features ● Train a Classifier and build a model ● Test the model against test samples Hybrid Feature Retrieval (HFR) ● Training Hybrid Feature Retrieval (HFR) ● Testing Feature Extraction Binary n-gram features – Features are extracted from the byte codes in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: Given a 11-byte sequence: 0123456789abcdef012345, The 2-grams (2-byte sequences) are: 0123, 2345, 4567, 6789, 89ab, abcd, cdef, ef01, 0123, 2345 The 4-grams (4-byte sequences) are: 01234567, 23456789, 456789ab,...,ef012345 and so on.... Problem: – Large dataset. Too many features (millions!). Solution: – – Use secondary memory, efficient data structures Apply feature selection Feature Extraction Assembly n-gram features – Features are extracted from the assembly programs in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: three instructions “push eax”; “mov eax, dword[0f34]” ; “add ecx, eax”; 2-grams (1) “push eax”; “mov eax, dword[0f34]”; (2) “mov eax, dword[0f34]”; “add ecx, eax”; Problem: – Same problem as binary Solution: – Same solution Feature Selection ● Select Best K features ● Selection Criteria: Information Gain ● Gain of an attribute A on a collection of examples S is given by | Sv | Gain ( S, A) Entropy ( S) Entropy ( Sv ) | S | VValues ( A) Experiments 0 Dataset – Dataset1: 838 Malicious and 597 Benign executables – Dataset2: 1082 Malicious and 1370 Benign executables – Collected Malicious code from VX Heavens (http://vx.netlux.org) 0 Disassembly – Pedisassem ( http://www.geocities.com/~sangcho/index.html ) 0 Training, Testing – Support Vector Machine (SVM) – C-Support Vector Classifiers with an RBF kernel Results ● ● ● HFS = Hybrid Feature Set BFS = Binary Feature Set AFS = Assembly Feature Set Results ● ● ● HFS = Hybrid Feature Set BFS = Binary Feature Set AFS = Assembly Feature Set Results ● ● ● HFS = Hybrid Feature Set BFS = Binary Feature Set AFS = Assembly Feature Set Future Plans ● ● System call: – seems to be very useful. – Need to Consider Frequency of call – Call sequence pattern (following program path) – Actions immediately preceding or after call Detect Malicious code by program slicing – requires analysis Data Mining to Detect Buffer Overflow Attack Mohammad M. Masud, Latifur Khan, Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas Introduction ● Goal – – ● Intrusion detection. e.g.: worm attack, buffer overflow attack. Main Contribution – – 'Worm' code detection by data mining coupled with 'reverse engineering'. Buffer overflow detection by combining data mining with static analysis of assembly code. Background ● What is 'buffer overflow'? – ● A situation when a fixed sized buffer is overflown by a larger sized input. How does it happen? – example: ........ char buff[100]; gets(buff); ........ memory Input string buff Stack Background (cont...) ● Then what? buff memory buff ........ char buff[100]; gets(buff); ........ Stack Stack Return address overwritten Attacker's code memory buff Stack New return address points to this memory location Background (cont...) ● So what? – – ● It can now – – – ● Program may crash or The attacker can execute his arbitrary code Execute any system function Communicate with some host and download some 'worm' code and install it! Open a backdoor to take full control of the victim How to stop it? Background (cont...) ● Stopping buffer overflow – – ● Preventive approaches – – – ● Preventive approaches Detection approaches Finding bugs in source code. Problem: can only work when source code is available. Compiler extension. Same problem. OS/HW modification Detection approaches – – Capture code running symptoms. Problem: may require long running time. Automatically generating signatures of buffer overflow attacks. CodeBlocker (Our approach) ● A detection approach ● Based on the Observation: – ● Main Idea – ● Attack messages usually contain code while normal messages contain data. Check whether message contains code Problem to solve: – Distinguishing code from data Severity of the problem ● It is not easy to detect actual instruction sequence from a given string of bits Our solution ● ● ● ● ● Apply data mining. Formulate the problem as a classification problem (code, data) Collect a set of training examples, containing both instances Train the data with a machine learning algorithm, get the model Test this model against a new message CodeBlocker Model Feature Extraction Disassembly ● We apply SigFree tool – implemented by Xinran Wang et al. (PennState) Feature extraction ● Features are extracted using – – ● N-gram analysis Control flow analysis N-gram analysis What is an n-gram? -Sequence of n instructions Traditional approach: -Flow of control is ignored 2-grams are: 02, 24, 46,...,CE Assembly program Corresponding IFG Feature extraction (cont...) ● Control-flow Based N-gram analysis What is an n-gram? -Sequence of n instructions Proposed Control-flow based approach -Flow of control is considered 2-grams are: 02, 24, 46,...,CE, E6 Assembly program Corresponding IFG Feature extraction (cont...) ● Control Flow analysis. Generated features – – – ● Checking IMR – – ● A memory is referenced using register addressing and the register value is undefined e.g.: mov ax, [dx + 5] Checking UR – ● Invalid Memory Reference (IMR) Undefined Register (UR) Invalid Jump Target (IJT) Check if the register value is set properly Checking IJT – Check whether jump target does not violate instruction boundary Feature extraction (cont...) ● Why n-gram analysis? – ● Intuition: in general, disassembled executables should have a different pattern of instruction usage than disassembled data. Why control flow analysis? – Intuition: there should be no invalid memory references or invalid jump targets. Putting it together ● Compute all possible n-grams ● Select best k of them ● ● Compute feature vector (binary vector) for each training example Supply these vectors to the training algorithm Experiments ● Dataset – – – ● Real traces of normal messages Real attack messages Polymorphic shellcodes Training, Testing – Support Vector Machine (SVM) Results ● ● CFBn: Control-Flow Based n-gram feature CFF: Control-flow feature Novelty / contribution ● ● ● We introduce the notion of control flow based n-gram We combine control flow analysis with data mining to detect code / data Significant improvement over other methods (e.g. SigFree) Advantages ● 1) Fast testing ● 2) Signature free operation 3) Low overhead ● 4) Robust against many obfuscations Limitations ● ● Need samples of attack and normal messages. May not be able to detect a completely new type of attack. Future Works ● Find more features ● Apply dynamic analysis techniques ● Semantic analysis Reference / suggested readings – X. Wang, C. Pan, P. Liu, and S. Zhu. Sigfree: A signature free buffer overflow attack blocker. In USENIX Security, July 2006. – Kolter, J. Z., and Maloof, M. A. Learning to detect malicious executables in the wild Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining Seattle, WA, USA Pages: 470 – 478, 2004. Email Worm Detection (behavioural approach) Outgoing Emails The Model Feature extraction Training data Test data Machine Learning Classifier Clean or Infected ? Feature Extraction Per email features = Binary valued Features Presence of HTML; script tags/attributes; embedded images; hyperlinks; Presence of binary, text attachments; MIME types of file attachments = Continuous-valued Features Number of attachments; Number of words/characters in the subject and body Per window features = Number of emails sent; Number of unique email recipients; Number of unique sender addresses; Average number of words/characters per subject, body; average word length:; Variance in number of words/characters per subject, body; Variance in word length = Ratio of emails with attachments Feature Reduction & Selection Principal Component Analysis = Reduce higher dimensional data into lower dimension = Helps reducing noise, overfitting Decesion Tree = Used to Select Best features Experiments 0 Data Set - Contains instances for both normal and viral emails. – Six worm types: ● bagle.f, bubbleboy, mydoom.m, mydoom.u, netsky.d, sobig.f - Collected from UC Berkeley ● Training, Testing: - Decision Tree: C4.5 algorithm (J48) on Weka Systems - Support Vector Machine (SVM) and Naïve Bayes (NB). Results Conclusion & Future Work ● Three approaches has been tested – – – Apply classifier directly Apply dimension reduction (PCA) and then classify Apply feature selection (decision tree) and then classify ● Decision tree has the best performance ● Future Plans – ● Combine content based with behavioral approaches Offensive Operations – Honeypots, Information operations