272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining

advertisement
272: Software Engineering
Fall 2012
Instructor: Tevfik Bultan
Lecture 17: Code Mining
Code Mining
• There is a lot of code that is available for everyone to access
– Can we learn from them?
• One of the active research directions in software engineering is to
mine existing code for various purposes such as
– To discover common behaviors
• which can then be used to extract specifications such as
interfaces usage patterns, etc.
– To discover anomalies
• which can then be used to find bugs or problematic behaviors
We will discuss two papers that do this
• Today we will discuss two papers that use code mining for different
purposes:
• "Graph-based Mining of Multiple Object Usage Patterns" Tung
Nguyen, Hoan Nguyen, Nam Pham, Jafar Al-Kofahi, and Tien
Nguyen. 7th joint meeting of the European Software Engineering
Conference and the ACM SIGSOFT Symposium on the
Foundations of Software Engineering (ESEC/FSE 2009).
– "Mining Specifications of Malicious Behavior" Mihai
Christodorescu, Somesh Jha, Christopher Kruegel. Joint meeting
of European Software Engineering Conference and ACM
SIGSOFT Symposium on the Foundations of Software
Engineering (ESEC/FSE 2007).
Mining for Object Usage Patterns
• We discussed papers that automatically extract behavioral interfaces
for classes
– The papers we discussed earlier focus on usage of a single class
and try to identify the ordering of method class on a single object
• However, there maybe usage patterns that involve multiple objects
– Moreover, usage patterns may involve
• control flow structures
– such as calling a method within a loop
• and data dependencies
– such as one argument in a method call being dependent
on an argument in another method call
GrouMiner
• GrouMiner is a tool that extracts usage patterns for objects that takes
into account both
– temporal usage orders (like we have seen in the interface
extraction papers we discussed already), and
– data dependencies
• It defines a graph-based object usage model (groum) and extracts
these models from existing code
Object Usage Model
• A groum is a directed acyclic graph
– nodes are labeled, edges are not labeled
• Nodes correspond to
– actions: method calls, access to object fields
– control flow structures: conditions, branches or loops such as if,
while, for statements
• Edges represent
– temporal ordering: If A is used before (or always generated
before) B then there is an edge from A to B
– data dependency: There is an edge from A to B if there is a data
dependency between A and B
• A groum can represent multiple objects
How to Extract the Object Usage Model?
• The temporal ordering of the nodes in a groum is extracted from the
AST by adding edges between nodes that are sequentially ordered
• The data dependency edges are extracted using an intra-procedural
dependency analysis
– Identify the variables involved in each action to determine the
dependencies and add edges to represent the dependencies
Extracting Code Skeletons
• A groum model can be un-parsed and converted to a code skeleton
• This code skeleton will demonstrate the usage pattern as code rather
than a directed acyclic graph
• This approach can be used as a reverse engineering approach to
discover different usage patterns in the code
• However, there will be many groums in a given code and not all of
them should be reported as usage patterns
– They should be filetered somehow
Usage Pattern Mining
• GrouMiner uses graph mining techniques to identify the common
usage patterns in the code
• They determine the frequency of a pattern by computing the number
of independent occurrences
– If the frequency of a pattern is higher than a threshold, then it is
reported
• The graph mining algorithm determines the common patterns
efficiently by
– By identifying common graph patterns incrementally, starting with
graphs with small number of nodes and then finding other patterns
based on sub-graph relationship
– By checking equivalence of patterns approximately using a vector
representation that summarizes the features of a pattern, rather
than doing an exact matching
Anomaly Detection
• Using the graph mining algorithm they can identify anomalous usages
– They identify an anomalous usage as a sub-graph of an identified
pattern that is not extensible to that pattern
– This is considered a violation of the pattern
• A violation is considered an anomaly when it is too rare
– i.e., common violations are not reported as anomalies
• They discuss two types of anomaly detection: 1) anomaly detection in
a given project, 2) anomaly detection when a project changes
• Anomaly detection can be used to identify errors
– Ana anomalous usage may correspond to violation of an interface
and may point to a bug
• However, when anomaly detection is used as a bug finding approach
it generates a lot of false positives (87.8% in one case)
– i.e., many identified anomalies do not correspond to errors
Mining Specifications of Malicious Behavior
• In the second paper we are discussing, code mining is used to find
specifications of malicious behavior
• Computer security applications rely on manually written specifications
to identify malicious code automatically
• However, the manual specification task is hard and time consuming
– This paper tries to automate the specification of malicious
behavior
The approach
• The presented approach works in three steps
1. Collect execution traces from malware and benign programs
2. Construct the corresponding dependence graphs
3. Compute specification of malicious behavior as difference of
dependence graphs
• Note that in this approach mining is done on the execution traces
– In the paper we discussed earlier, mining was done on the
source code
How to represent behavior?
• They identify some requirements for representation of behaviors:
1. A specification must not contain independent operations
2. A specification must relate the dependent operations
3. A specification should capture only security relevant operations
• To meet these requirements they focus only on system calls and
represent malicious behavior as a dependence graph of system calls
• This representation satisfies their requirements
– Independent calls will not be connected in this representation
– Dependent calls will be connected
– Only the system calls will be tracked since they correspond to
the security relevant operations
How to represent the behavior?
• The behavior is represented as a special type of dependence graph
• Since they are interested in system security, they decide to model
execution behavior as a sequence of system calls
• Each node of the dependence graph they construct corresponds to a
system call
• The edges of the dependence graph corresponds to constraints that
represent the dependences between two system calls
– Such as argument1 for call1 is equal to the argument 2 of call2
More on dependence graphs
• The dependence graphs they construct are directed acyclic graphs
• Each node corresponds to a system call
– They define a simple type system for the arguments of the system
calls
• Edges represent dependencies which are characterized as logic
formulas
– A logic system that allows constraints with modular and bit-vector
arithmetic, arrays, and existential and universal quantifiers is
sufficient
Comparing Benign Programs and Malware
• The presented approach first constructs the dependence graphs for
the execution traces of the benign program and the malicious
programs
• Then they construct the minimal contrast subgraph of a malware
dependence graph and the benign dependence graph
– The smallest subgraph of the first graph that does not appear in
the second
Empirical evaluation
• Thee presented approach is applied to 16 well-known malware
examples
• For these 16 examples, the algorithm successfully discovers the
same behavioral features as those independently provided by human
experts
Download