272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining Code Mining • There is a lot of code that is available for everyone to access – Can we learn from them? • One of the active research directions in software engineering is to mine existing code for various purposes such as – To discover common behaviors • which can then be used to extract specifications such as interfaces usage patterns, etc. – To discover anomalies • which can then be used to find bugs or problematic behaviors We will discuss two papers that do this • Today we will discuss two papers that use code mining for different purposes: • "Graph-based Mining of Multiple Object Usage Patterns" Tung Nguyen, Hoan Nguyen, Nam Pham, Jafar Al-Kofahi, and Tien Nguyen. 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2009). – "Mining Specifications of Malicious Behavior" Mihai Christodorescu, Somesh Jha, Christopher Kruegel. Joint meeting of European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2007). Mining for Object Usage Patterns • We discussed papers that automatically extract behavioral interfaces for classes – The papers we discussed earlier focus on usage of a single class and try to identify the ordering of method class on a single object • However, there maybe usage patterns that involve multiple objects – Moreover, usage patterns may involve • control flow structures – such as calling a method within a loop • and data dependencies – such as one argument in a method call being dependent on an argument in another method call GrouMiner • GrouMiner is a tool that extracts usage patterns for objects that takes into account both – temporal usage orders (like we have seen in the interface extraction papers we discussed already), and – data dependencies • It defines a graph-based object usage model (groum) and extracts these models from existing code Object Usage Model • A groum is a directed acyclic graph – nodes are labeled, edges are not labeled • Nodes correspond to – actions: method calls, access to object fields – control flow structures: conditions, branches or loops such as if, while, for statements • Edges represent – temporal ordering: If A is used before (or always generated before) B then there is an edge from A to B – data dependency: There is an edge from A to B if there is a data dependency between A and B • A groum can represent multiple objects How to Extract the Object Usage Model? • The temporal ordering of the nodes in a groum is extracted from the AST by adding edges between nodes that are sequentially ordered • The data dependency edges are extracted using an intra-procedural dependency analysis – Identify the variables involved in each action to determine the dependencies and add edges to represent the dependencies Extracting Code Skeletons • A groum model can be un-parsed and converted to a code skeleton • This code skeleton will demonstrate the usage pattern as code rather than a directed acyclic graph • This approach can be used as a reverse engineering approach to discover different usage patterns in the code • However, there will be many groums in a given code and not all of them should be reported as usage patterns – They should be filetered somehow Usage Pattern Mining • GrouMiner uses graph mining techniques to identify the common usage patterns in the code • They determine the frequency of a pattern by computing the number of independent occurrences – If the frequency of a pattern is higher than a threshold, then it is reported • The graph mining algorithm determines the common patterns efficiently by – By identifying common graph patterns incrementally, starting with graphs with small number of nodes and then finding other patterns based on sub-graph relationship – By checking equivalence of patterns approximately using a vector representation that summarizes the features of a pattern, rather than doing an exact matching Anomaly Detection • Using the graph mining algorithm they can identify anomalous usages – They identify an anomalous usage as a sub-graph of an identified pattern that is not extensible to that pattern – This is considered a violation of the pattern • A violation is considered an anomaly when it is too rare – i.e., common violations are not reported as anomalies • They discuss two types of anomaly detection: 1) anomaly detection in a given project, 2) anomaly detection when a project changes • Anomaly detection can be used to identify errors – Ana anomalous usage may correspond to violation of an interface and may point to a bug • However, when anomaly detection is used as a bug finding approach it generates a lot of false positives (87.8% in one case) – i.e., many identified anomalies do not correspond to errors Mining Specifications of Malicious Behavior • In the second paper we are discussing, code mining is used to find specifications of malicious behavior • Computer security applications rely on manually written specifications to identify malicious code automatically • However, the manual specification task is hard and time consuming – This paper tries to automate the specification of malicious behavior The approach • The presented approach works in three steps 1. Collect execution traces from malware and benign programs 2. Construct the corresponding dependence graphs 3. Compute specification of malicious behavior as difference of dependence graphs • Note that in this approach mining is done on the execution traces – In the paper we discussed earlier, mining was done on the source code How to represent behavior? • They identify some requirements for representation of behaviors: 1. A specification must not contain independent operations 2. A specification must relate the dependent operations 3. A specification should capture only security relevant operations • To meet these requirements they focus only on system calls and represent malicious behavior as a dependence graph of system calls • This representation satisfies their requirements – Independent calls will not be connected in this representation – Dependent calls will be connected – Only the system calls will be tracked since they correspond to the security relevant operations How to represent the behavior? • The behavior is represented as a special type of dependence graph • Since they are interested in system security, they decide to model execution behavior as a sequence of system calls • Each node of the dependence graph they construct corresponds to a system call • The edges of the dependence graph corresponds to constraints that represent the dependences between two system calls – Such as argument1 for call1 is equal to the argument 2 of call2 More on dependence graphs • The dependence graphs they construct are directed acyclic graphs • Each node corresponds to a system call – They define a simple type system for the arguments of the system calls • Edges represent dependencies which are characterized as logic formulas – A logic system that allows constraints with modular and bit-vector arithmetic, arrays, and existential and universal quantifiers is sufficient Comparing Benign Programs and Malware • The presented approach first constructs the dependence graphs for the execution traces of the benign program and the malicious programs • Then they construct the minimal contrast subgraph of a malware dependence graph and the benign dependence graph – The smallest subgraph of the first graph that does not appear in the second Empirical evaluation • Thee presented approach is applied to 16 well-known malware examples • For these 16 examples, the algorithm successfully discovers the same behavioral features as those independently provided by human experts