Reverse Engineering and Software Clustering for Remodularization

advertisement
Reverse Engineering and Software
Clustering for Remodularization
Technical Report CS-010-2002
Mehreen Saeed and Onaiza Maqbool
Supervisor : Dr. Haroon Atique Babri
Software Engineering Research Group SERG
Computer Science Department
Lahore University of Management Sciences LUMS
Abstract
The objective of this report is to explore the area of software reverse
engineering. It consists of two parts. Part I of the report deals with the
general area of software reverse engineering. It details the objectives of
reverse engineering and outlines some of the tools and technologies used
in this area. Part II of the report is more focused and investigates the
area of software clustering in a thorough and detailed manner with its
applications to reverse engineering. It also presents the detailed results
of software clustering experiments that were carried out as part of this
research.
ii
Acknowledgements
For making this endeavor a success, we acknowledge the support
provided by the following:
Dr Syed Zahoor Hassan, for finding time to offer suggestions and
guiding us in the right direction and Dr Mansoor Sarwar, for his interest
and encouragement.
DrHamayun Mian, Chairman PITB with whom discussions were held in
the initial phases of this research.
Johannes Martin of the Rigi group, University of Victoria, Canada for
providing files in the Rigi Standard Format (RSF) for Xfig
Lahore University of Management Sciences (LUMS) for providing the
financial support and resources to carry out this research.
iii
Table of Contents
1
Reverse Engineering.......................................................................................................... 1
1.1 Objectives of Reverse Engineering ............................................................................... 2
1.2 Reverse Engineering Areas ........................................................................................... 3
Code Reverse Engineering Process ....................................................................................... 3
1.3.1 Levels of Abstraction ............................................................................................ 4
1.3.2 Program Representation ........................................................................................ 5
1.3.3 Exchange Formats ................................................................................................. 7
1.3.3.1 Graphical Exchange Language GXL ............................................................. 8
1.3.3.2 Rigi Standard Format RSF and Tuple Attribute TA Language ..................... 8
1.4 Reverse Engineering Techniques .................................................................................. 9
1.4.1 Function Abstraction ........................................................................................... 10
1.4.2 GraphLog............................................................................................................. 10
1.4.3 Knowledge Based Program Analysis .................................................................. 11
1.4.4 Graph Parsing Approach ..................................................................................... 11
1.4.5 Concept Analysis ................................................................................................. 12
1.4.5.1 Mathematical Foundation ............................................................................ 12
1.4.5.2 Concept Analysis Applications ................................................................... 14
1.4.6 Design Recovery System (DESIRE) ................................................................... 15
1.4.7 Recognizing Program Plans ................................................................................ 16
1.4.8 Rigi ...................................................................................................................... 16
1.4.9 Grok ..................................................................................................................... 17
1.4.10 Dali .................................................................................................................. 18
1.5 Case Studies on Reverse Engineering ......................................................................... 18
1.5.1 Xfig ...................................................................................................................... 18
1.5.2 SORTIE ............................................................................................................... 19
1.5.3 SQL/DS ............................................................................................................... 20
1.6 Tools for Reverse Engineering .................................................................................... 21
2 Software Clustering ......................................................................................................... 22
2.1 Clustering : An Overview ............................................................................................ 22
2.1.1 Entity identification and feature selection ........................................................... 22
2.1.2 Definition of a similarity measure ....................................................................... 23
2.1.2.1 Association coefficients............................................................................... 23
2.1.2.2 Distance measures ....................................................................................... 24
2.1.2.3 Correlation Coefficients .............................................................................. 25
2.1.2.4 Probabilistic Measures................................................................................. 27
2.1.3 Clustering ............................................................................................................ 27
2.1.3.1 Hierarchical ................................................................................................. 27
2.1.3.2 Partitional Algorithms ................................................................................. 30
2.1.4 Assessment .......................................................................................................... 30
2.2 Clustering Software ..................................................................................................... 30
2.2.1.1 Observations Regarding Association, Correlation and Distance Metrics.... 32
2.2.2 Assessment of Results ......................................................................................... 32
2.2.3 Use of Precision and Recall ................................................................................. 33
2.3 Our Approach .............................................................................................................. 34
iv
2.3.1 Entity identification and feature selection ........................................................... 34
2.3.2 Definition of a similarity measure ....................................................................... 35
2.3.3 Clustering and the Combined Algorithm............................................................. 35
2.4 Methods Used in the Past ............................................................................................ 37
2.5 Clustering : A Theoretical Framework ........................................................................ 39
2.5.1 The Iteration Versus the Total Clusters Graph .................................................... 40
2.5.2 Black Hole System .............................................................................................. 40
2.5.3 Glass Cloud System ............................................................................................. 41
2.5.4 Equidistant Objects .............................................................................................. 42
2.5.5 Structured Program .............................................................................................. 45
2.5.6 Unstructured Program ......................................................................................... 47
2.6 Experiments with Clustering for Software Re-Modularization ................................... 51
2.6.1 The Test System .................................................................................................. 51
2.6.2 Clustering Techniques ......................................................................................... 52
2.6.3 Evauation of Results ............................................................................................ 53
2.6.4 Anayslis of the Xfig System ................................................................................ 53
2.6.5 Anayslis of d_files Sub_System .......................................................................... 53
2.6.6 Summary and Discussion of Results ................................................................... 59
2.7 Future Directions ......................................................................................................... 61
A. Experimental Results ....................................................................................................... 63
A.1
Anayslis of e_files Sub_System ............................................................................. 63
A.2
Analysis of f-files SubSystem ................................................................................ 66
A.3
Analysis of u_files SubSystem ................................................................................ 69
A.4
Analysis of w_files SubSystem .............................................................................. 72
A.5
Analysis of the Entire System ................................................................................ 75
B. Software Repositories ...................................................................................................... 79
C. Summary of Reverse Engineering Methods .................................................................... 80
D. Detailed Summary of Reverse Engineering Methods ..................................................... 81
Annotated Bibliography .......................................................................................................... 83
v
Table of Figures
Figure 1: The code reverse engineering process ....................................................................... 3
Figure 2 : The forward and reverse engineering of source code ............................................... 4
Figure 3 : Parse Tree and the Corresponding AST for the Expression b*b-4*a*c [Grune, Bal,
Jacobs, Langendoen, 2001] ............................................................................................... 6
Figure 4 : Resource Flow Graph Generated by Rigi for Linked List Program [Müller, Tilley,
Orgun, Corrie, Madhavji, 1992]. ....................................................................................... 7
Figure 5 : The Directed Graph and a part of its corresponding GXL for a Small Program
[Website GXL]. ................................................................................................................. 8
Figure 6 : Concept Lattice for the Attributes and Facts in Table 1 ......................................... 14
Figure 7 : The Reverse Engineering Process in Rigi ............................................................... 17
Figure 8 : Divisive Clustering Algorithm ................................................................................ 27
Figure 9 : Agglomerative Clustering ....................................................................................... 28
Figure 10 : Initial Clusters ....................................................................................................... 28
Figure 11 : After Step 1 ........................................................................................................... 29
Figure 12 : Clusters Formed at Step 3 Using Single Linkage Algorithm ................................ 29
Figure 13 : Clusters Formed Using Complete Linkage Algorithm ......................................... 29
Figure 14 : Precision and Recall Graph for an Ideal Clustering Algorithm ............................ 34
Figure 15: Iteration Vs. Total Clusters for the Black Hole System ......................................... 41
Figure 16: Pancaked Structure [Pressman, 97]........................................................................ 41
Figure 17: The general curve obtained for Iteration Vs. The total Number of Clusters .......... 43
Figure 18: Iteration Vs. Total Clusters for the Combined Algorithm for Equidistant Objects 43
Figure 19: Iteration Vs. the Total Number of Clusters for Equidistant Objects with 100%
similarity.......................................................................................................................... 44
Figure 20 : A Structured Program ........................................................................................... 46
Figure 21: Clustering Process for the Structured Program ...................................................... 46
Figure 22: Unstructured Program ............................................................................................ 47
Figure 23: Clustering Process for the Unstructured Program .................................................. 48
Figure 24: Clusters Formed for the Unstructured Program ..................................................... 48
Figure 25: Iteration Vs. Number of Clusters for the Jaccard Similarity Using Combined
Algorithm. ....................................................................................................................... 49
Figure 26: Clusters formed for Jaccard Similarity Measure Using Combined Features ......... 49
Figure 27: Clustering Process for Simple Similarity Co-efficient Using Complete Linkage . 50
Figure 28: Clusters Formed Using Simple Similarity Coefficient and Complete Linkage
Algorithm ........................................................................................................................ 50
Figure 29: Comparison of Features Using Jaccard Similarity and Complete Algorithm ........ 54
Figure 30: Comparison of Similarity Metrics Using Complete Algorithm and All Features.. 56
Figure 31: Comparison of Similarity Metrics Using Combined Algorithm and All Features 56
Figure 32: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 57
Figure 33: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the
Combined and Complete Algorithms Using All Features ............................................... 58
Figure 34: Comparison of Features Using Different Similarity Metrics and Complete
Algorithm ........................................................................................................................ 63
Figure 35: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 64
vi
Figure 36: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
......................................................................................................................................... 64
Figure 37: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 65
Figure 38: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the
Combined and Complete Algorithms Using All Features ............................................... 66
Figure 39: Comparison of Features Using Different Similarity Metrics and Complete
Algorithm ........................................................................................................................ 67
Figure 40: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 67
Figure 41: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
......................................................................................................................................... 68
Figure 42: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 68
Figure 43: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the
Combined and Complete Algorithms Using All Features ............................................... 69
Figure 44: Comparison of Features Using Different Similarity Metrics and Complete
Algorithm ........................................................................................................................ 70
Figure 45: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 70
Figure 46: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
......................................................................................................................................... 71
Figure 47: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 71
Figure 48: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the
Combined and Complete Algorithms Using All Features ............................................... 72
Figure 49: Comparison of Features Using Different Similarity Metrics and Complete
Algorithm ........................................................................................................................ 73
Figure 50: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 73
Figure 51: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
......................................................................................................................................... 74
Figure 52: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 74
Figure 53: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the
Combined and Complete Algorithms Using All Features ............................................... 75
Figure 54: Comparison of Features Using Different Similarity Metrics and Complete
Algorithm ........................................................................................................................ 76
Figure 55: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 76
Figure 56: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
......................................................................................................................................... 77
Figure 57: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 77
Figure 58: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the
Combined and Complete Algorithms Using All Features ............................................... 78
vii
PART I : REVERSE ENGINEERING
1
Reverse Engineering
The term “reverse engineering” was initially used for the extraction of design and structure of
hardware components by analyzing the finished products. Now the term “reverse engineering” is
also widely applicable to software systems. One of the first known definitions of software reverse
engineering is by Chikofsky and Cross in their seminal paper “Reverse Engineering and Design
Recovery: A Taxonomy” [Chikofsky, Cross II, 1990]. They define reverse engineering as:
“The process of analyzing a subject system to
 identify the system’s components and their interactions
 Create representations of the system in another form or at a higher level of abstraction ”
Note the above definition of reverse engineering is not restricted to software only. The process of
reverse engineering is carried out to obtain an understanding of the system when the original
design or plan of the system is not available. The intention can be to make an identical ‘clone’ of
the system or to carry out maintenance on the system when the original design is not available.
When considering software systems, it is often the case that long term changes in the system lead
to a deviation of the system from its original design. For legacy systems, written ten or even
twenty years ago the original designers or developers of the system are no longer available. Such
systems might be too costly to replace. In such situations it is often desirable to carry out reverse
engineering processes to gain a better understanding of the system. This process helps the
maintenance engineers in adapting the system to the changing needs of the organization and also
in reengineering and migrating the system to a newer platform.
Chikofsky and Cross point out that the reverse engineering is “a process of examination, not
change or replication”. Forward engineering involves the process of moving from a high level
abstraction in the form of client requirements or specifications of the system to a lower level
implementation of the system, in the form of source code, that is platform dependent. Reverse
engineering is, therefore, the opposite process of forward engineering. Here the goal is to
“examine” the system to move from low level details available in the form of source code to
higher abstract levels. The higher abstract levels can be in the form of syntax trees or dependency
graphs that give a better understanding of the source code. At a higher level of abstraction, the
process of reverse engineering identifies the subsystems of an overall design, ultimately leading
to an understanding of the logical design, the functional specifications and the requirements or the
specifications of the system.
The objective of this report is not to look at the financial aspects of reverse engineering but to
conduct an indepth detail of its technical aspects. In this part of the report we explore the
objectives of reverse engineering, the reverse engineering processes and the various approaches
1
to reverse engineering. It also outlines some of the work done in the past and also a few case
studies pertaining to this area.
1.1
Objectives of Reverse Engineering
The main goal of reverse engineering is to study a system and gain a better understanding of the
system. The understanding or insight of the system is required for re-documenting the system,
maintaining the system, re-designing or restructuring the system to a present day paradigm. Such
a study would also be required to migrate or reengineer the system from an old platform to a new
one.
Chikofsky and Cross define various objectives of reverse engineering as [Chikofsky, Cross II,
1990] :

Cope with complexity of large voluminous systems : The main problem encountered with
maintaining large legacy systems is the shear volume of code that has to be understood.
Reverse engineering tools help in extracting relevant information from large complex systems
and, hence, help the software developers control the processes and products in systems
evolution.

Generate alternate views of software programs : Reverse engineering tools are designed to
generate various alternate views of software programs like call graphs, dependency graphs,
resource flow graphs etc. A graphical representation of such views helps the maintenance
engineers gain a better insight of the system.

Recover lost information from legacy systems : Over a period of time a system tends to
deviate from its original design. The process of reverse engineering helps in recovering lost
information of its design, specifications and requirements.

Detect side effects of haphazard initial design and successive modifications : Haphazard
design together with modifications in the system leads to side effects in the system. The
performance of the system can degrade or it can produce unpredictable results. The process
of reverse engineering can provide these observations and identify problem areas of the
system.

Synthesize higher abstractions : As mentioned earlier the goal of reverse engineering is not
just to provide an insight of source code but also to generate higher level abstract forms of the
system like the logical design, functional specifications or the requirements of the system.

Facilitate re-use by detecting reusable software components : An important application of
reverse engineering is to identify components in software system and reuse them in present
systems to save development costs.
2
1.2
Reverse Engineering Areas
Müller et al. in their paper “Reverse Engineering: A Roadmap” have identified two main areas of
reverse engineering [Müller et al., 2000]:


Code reverse engineering
Data reverse engineering
Code reverse engineering is the retrieval of information from source code to extract various levels
of abstraction. Data reverse engineering, on the other hand, focuses on data generated by a
software system and finding the mapping of this data with the logical data structures being used
in the program. Therefore, code reverse engineering tells us “how” the processing of information
takes place and data reverse engineering focuses on “what” information is processed.
Code reverse engineering has applications in the area of program understanding, system
reengineering, architecture recovery and restructuring, studying software evolution etc. Initially
the process of reverse engineering was considered only to pertain to code reverse engineering.
However, data reverse engineering gained the focus of attention by researchers in the past few
years when solving the Y2K problem. It is applicable where massive software changes pertaining
to data are required like the structure of date change in the Y2K problem or the European
currency change in 2002. This report mainly focuses in the area of code reverse engineering
rather than data reverse engineering. The area of code reverse engineering is therefore presented
in detail here.
1.3
Code Reverse Engineering Process
Code Reveres Engineering Process
Exchange
format
Source
code
Parse
Document/
Visualize
Analyze
Language
Dependent
Language
Independent
Figure 1: The code reverse engineering process
Figure 1 illustrates the code reverse engineering process. The source code is parsed by the parser.
The parser produces an abstract form of the code in terms of a syntax tree, a dependency graph, a
resource flow graph etc. The analyzer then analyzes the source code for further documentation or
visualization. A reverse engineering process is the reverse of a forward engineering process. In
this case one moves from lower level implementation details to a more abstract form.
3
1.3.1
Levels of Abstraction
Requirements Specification
Requirements Specification
Formal Specification
Formal Specification
Design Specification
Design Specification
Implementation
Implementation
Specific
Specific
Reverse Engineering
Forward Engineering
Figure 2 : The forward and reverse engineering of source code
Figure 2 illustrates the forward and reverse engineering processes. Hirandi and Ning have
defined the following levels of abstraction [Harandi, Ning, 1990]:

Implementation Level : At the implementation level of abstraction a program’s source code
is directly represented into a form from which it is possible to directly obtain the source code.
However, an abstraction at the implementation level is independent of the programming
language used. Example of this form of representation are the abstract syntax trees or the
symbol tables.

Structure Level : At this level of abstraction, the source code of a program is shown in less
detail but more stress is laid on understanding the basic detailed structure or design of the
system. Examples of this form of abstraction are the representation of a system by means of
call graphs, resource flow graphs or program dependency graphs etc.

Function Level : Functional level abstraction tends to group together functionally equivalent
but structurally different parts of source code. This form of abstraction provides the software
engineers with a bird’s eye view of the entire system where they can gain an insight of the
basic architecture of the system. It presents the main modules present in the system and their
interaction with each other.

Domain Level : The domain level abstraction replaces the algorithmic nature of data to
domain specific concepts. Abstraction at the functional level gives the ‘how’ information to
4
the software engineer and the domain level abstraction answers the questions ‘what type of
information is being processed’ and ‘what is being done’ and not ‘how’ it is being done. This
level gives an insight into the nature of the problem being solved by the system and details
the requirements and the specifications of the software.
In Figure 1, the parser is only used to convert a source code into a format that is independent of
the original programming language. The real intelligence lies in the analyzer that documents or
produces abstract high level forms of the system. Some of the methods for analyzing the source
code are described later in this report in section 1.4 “Reverse Engineering Techniques”.
1.3.2
Program Representation
As we can see that the parsing stage of reverse engineering is language dependent. The second
stage is language independent. The trend in reverse engineering is to build analyzers that are
independent of the programming language used. Such a framework is also called the Intentional
Programming Framework where the programs are stored, edited and processed as source
graphs and are not dependent upon the programming language [Visser, 2001]. The possible types
of outputs produced by the parser and processed further for reverse engineering a system are:
5
expression
expressio
n
term
-
term
term
term
factor
*
term *
*
factor
factor
identifier
-
factor
identifier
factor
identifier
‘c’
*
*
identifier
‘b’
identifier
‘a’
‘b’
‘b’
4
‘b’
4
Parse Tree
‘c’
*
‘a
’
Abstract Syntax Tree
Figure 3 : Parse Tree and the Corresponding AST for the Expression b*b-4*a*c [Grune, Bal, Jacobs,
Langendoen, 2001]

Abstract syntax tree AST: The syntax tree or a parse tree depicts how the various segments
of program text are viewed in terms of grammar [Grune, Bal, Jacobs, Langendoen, 2001]. It
is a very detailed data structure showing the syntactic information of a program. Usually, this
detailed information is not required for reverse engineering and hence the abstract syntax tree
is used instead. Abstract syntax tree is less detailed than a parse tree and usually the
information regarding the grammar of the language is not shown in the abstract syntax tree.
An Example of both a parse tree and an AST is shown in Figure 3.

Tree or graph : A program can also be represented by means of trees, directed acyclic
graphs or full fledged graphs with cycles. Of particular interest are program dependency
graphs PDGs that are directed graphs whose vertices are connected by means of several types
of edges. The vertices represent the assignment statements and predicates of a program. The
edges represent control and data dependencies. A PDG can represent programs with only one
procedure. Similar to PDG, system dependency graphs are used to represent programs with
multiple procedure calls. Another type of graph used to represent programs is the resource
6
flow graph that is a directed weighted graph, where the vertices of the graph are components
of the system and the edges are dependencies induced by the resource supplier-client relation.
A directed edge from node a to node b represents a dependency of a on b and the weight on
the edge represents the resources being exchanged between a and b [Müller, Uhl, 1990]. An
example of a resource flow graph as generated by the reverse engineering tool Rigi is shown
in Figure 4.
Figure 4 : Resource Flow Graph Generated by Rigi for Linked List Program [Müller, Tilley, Orgun, Corrie,
Madhavji, 1992].
1.3.3
Exchange Formats
Figure 1 depicts the reverse engineering process and shows that the process of analyzing the
source code can be kept language independent. To achieve this end, there is a need to define
exchange formats that allow the output from a parser to be reused by any analyzer. Various
exchange formats have been defined to represent the AST or the dependency graph of a program.
Two of the more popular exchange formats namely the Graphical Exchange Language GXL and
Rigi Standard Format RSF are described here.
7
1.3.3.1
Graphical Exchange Language GXL
<?xml version="1.0"?>
<!DOCTYPE gxl SYSTEM "../../../gxl1.0.dtd">
<gxl
xmlns:xlink="http://www.w3.org/1999/xlink"
>
<graph id="simpleExample"><type
xlink:href="../../schema/gxl/simpleExampleSc
hema.gxl#simpleExampleSchema"/>
<node id="p">
<type
xlink:href="../../schema/gxl/simpleExampleSc
hema.gxl#Proc"/>
<attr name="file">
<string>main.c</string>
</attr>
</node>
<node id="q">
<type
xlink:href="../../schema/gxl/simpleExampleSc
hema.gxl#Proc"/>
<attr name="file">
<string>test.c</string>
</attr>
</node>
<node id="v">
<type
Figure 5 : The Directed Graph and a part of its corresponding GXL
for a Small Program [Website GXL].
xlink:href="../../schema/gxl/simpleExampleSc
hema.gxl#Var"/>
GXL appears to be the emerging standard for exchanging
output between a parsers and a
<attrthe
name="line">
<int>225</int>
reverse engineering tool. It is based on the Dagstuhl Middle
<typeModel DMM, which is a model for
xlink:href="../../schema/gxl/simpleExampleSc
repository software in reverse engineering [Website GXL].
GXL is an XML sub-language and
hema.gxl#ref"/>
can be used to represent any type of schema graphs. It is <attr
usedname="line">
to represent AST level abstraction
<int>27</int>
or architectural level abstraction and can be used to exchange
instance data and schema data.
</attr>
GXL represents a typed attributed directed graph. The directed graph for a small program and its
corresponding GXL is taken from GXL’s website and is shown in Figure 5.
1.3.3.2
Rigi Standard Format RSF and Tuple Attribute TA Language
The tuple attribute language is used to represent certain types of information of a program by the
means of graphs [Holt, 1997]. TA is a two level language. The first level for recording facts
about the program, also called the Tuple Sub-language and the second for recording its schema
also called the Attribute Sub-language. The tuple language has facts or tuples in the form of a
triple. The first attribute of the tuple represents the verb, the second the subject and the third the
object e.g. if module A calls module B in a program, this would be represented as:
Call P Q
This statement also represents two nodes P and Q of a graph connected together by an edge
labeled “call”. Call is the verb, P the subject and Q the object. This syntax for triples was
invented in the Rigi Project [Müller, Tilley, Orgun, Corrie, Madhavji, 1992], where it is called
RSF, for Rigi Standard Form. This form was extended by Holt [Holt, 1997]. Hence, the entire
program can be represented by means of a colored graph and stored using this language.
8
Therefore the language is a means for storing graphs and hence acts like a graph repository. The
attribute sub-language is used to assign an attribute to each edge or node in the colored graph
representing the program e.g.
Call P Q {color = black}
P{color = red}
The first statement assigns a color to the verb/edge and the second statement assigns a color to the
node (P). This language can also be used to record the position of the vertices in the graph.
Hence, TA language is very useful and simple and can be considered a repository for storing
information about graphs that represent a program structure.
1.4
Reverse Engineering Techniques
This section outlines various techniques for software reverse engineering and outlines some of the
work done by various researchers. A summary of these methods is presented in the appendices.
Many of the techniques described are on code reverse engineering for re-modularization, program
understanding or ease of software maintenance. The major challenge of code reverse engineering
is to extract meaningful abstract level design and requirements information from source code.
Rich and Wills point out several difficulties in this regard [Rich, Wills, 1990]:

Syntactic Variation : There are different ways and methods of solving the same problem. A
programmer can achieve the same net flow of data and control in many different possible
ways. All programmers have different styles and different approaches to coding a problem.
Even the same programmer, when given the same problem twice would code it differently.

Non-Contiguousness : The logic for solving a single problem may be scattered throughout
the program. It might not be localized to a single module or procedure but spread at different
places in the program.

Implementation Variation : A problem can be implemented in many different ways. The
same problem can have different solutions at an abstract level and hence different
implementations for solving it.

Overlapping Implementations : Two or more distinct abstract forms or levels of a system
may be implemented using the same lines of code or the same portion of a program, making
it difficult to identify them individually.

Unrecognizable Code : A program understanding system should be able to ignore the
idiosyncrasies present in the source code and only pick up relevant information.
In the following sections we describe some of the reverse engineering technologies.
9
1.4.1
Function Abstraction
[Hausler, Pleszkoch, Linger, Hevner 1990] propose program understanding through “function
abstraction”. Abstracting program functions means that the precise function of a program or subprogram is determined by analyzing how it manipulates the data. The objective of this technique
is to extract business rules by producing higher level abstractions. A summary of this method is
as follows :






Transform the program into a structured form by getting rid of all goto statements
Group two classes of programs : proper programs and prime programs. Proper program is a
flow chart program having a single entry and singe exit node and a path through each node.
Prime programs are irreducible proper programs. Hausler et al. identified three types of
primes namely iteration e.g. do-while, alternation i.e. if-then and sequence i.e. begin-end.
Construct a proper program by repeatedly replacing function nodes by prime sub-programs.
This process is termed as function expansion.
Use program slicing techniques based on data structures to abstract program primes one
variable at a time.
The functionality of each prime is determined by analyzing the usage patterns of data or trace
tables. This technique could vary depending upon the type of prime program being analyzed.
After abstracting the overall program function translate the mathematical constructs to
English text.
By performing function expansion the problem of determining a program’s function reduces to
determining the functionality of each program prime. The technique of function abstraction was
not implemented but just developed as a methodology. It also lays down the basis for attaining
program’s function abstraction. However, straightforward abstracting techniques can lead to
program functions that are difficult to read and understand. There still remains the problem of
generating useful as well as functionally correct abstractions.
1.4.2
GraphLog
Joint research conducted by IBM and University of Toronto developed a graphical language
GraphLog as an aid to software engineering for querying software structures by means of visual
graphs [Consens, Mendelzon, Ryman, 1992]. Their main motivation for developing this language
was to construct a platform for bringing together two aspects of program understanding namely
query and visualization. The software system can be represented as a directed graph and queries
can be constructed by drawing graphs using a graphical edior. Consens et al. proposed using this
language as a basis for evaluating the quality of a software system by defining quality metrics
such as cyclic dependency among packages. This language can also be used to extract additional
information from source code by input queries using visual graphs. The basis of source code
representation is via ER diagrams and the underlying mechanism for GraphLog is based upon
predicate calculus.
10
1.4.3
Knowledge Based Program Analysis
Harandi and Ning [Harandi, Ning, 1990] describe building a knowledge based system ‘PAT’ for
enhancing program understanding and conducting program analysis. Their implementation of a
model system helps maintainers understand three basic issues.
1. What high level concepts does a program implement
2. How does the program encode the concepts in terms of low-level concepts
3. Is the implementation of the concepts that have been recognized correct
PAT has a built in parser that re-writes the entire program into a language independent form
namely the event base. The event base has a set of objects called events that represent the
syntactic and semantic concepts present in the program. The program events are stored in a
hierarchical structure with low level concepts representing the source code. At the higher level
programming patterns and strategies are used. Using this event set the “understander” recognizes
higher level events that represent function oriented concepts. The newly recognized events are
added to the event set and the process is repeated till no more high-level events are recognized.
The final event set then has information about the high level concepts that a program implements.
A deductive inference rule engine is used to implement the “understander’s” core functionality.
A plan base is stored in the system by a domain expert. The plan base is a set of program plans
that represent the analysis knowledge of the program. It is stored as inference rules from which
high level events can be derived. The prototype system built by Harandi and Ning has 100
program events and a few dozen program plan rules. However, for real life systems several
hundred event classes and plans would be required.
1.4.4
Graph Parsing Approach
Rich and Wills [Rich, Wills, 1990] describe a technique called the ‘Graph Parsing Approach’ to
recognize a program’s design by identifying the commonly used data structures and algorithms.
They have defined the term ‘cliches’ for the commonly used programming structures like
enumerations, accumulators, binary searches etc. They describe a recognizer that translates the
source code of a program into plan calculus. A “Plan” is a hierarchical graph structure made of
boxes that represent operations and tests and arrows denote control and data flow. The
commonly used program constructs i.e. cliches are input into the system by an expert with
knowledge about those constructs in the form of plans.
The recognition of clichés starts at the parsing stage where a parser translates the source code into
plans which are directed graphs. The cliché recognizer then identifies sub-graphs and replaces
them with abstract operations using pattern matching techniques. Matches are made with the
program constructs already stored in the database. The main drawback of this technique is that in
cliché recognition, the search techniques employed are complex and too expensive to implement.
The issue of automatically learning program plans also needs to be addressed.
11
1.4.5
Concept Analysis
Concept analysis provides a way of grouping together similar entities based on their common
attributes. The mathematical foundation for concept analysis rests on the Lattice Theory
introduced by Birkhoff in 1940 [Birkhoff, 1940]. In 1996 Snelting introduced this idea to
software engineering for inferring and extracting software hierarchies from raw source code
[Snelting, 1996]. To introduce the reader to this area of reverse engineering the following terms
are introduced:

Module : A syntactic unit used to group entities together.
implementation are a part of a module.

Component : Group of related elements with a common goal that unifies them together.

Atomic Component : Non-hierarchical component that consists of related global constants,
variables, sub-programs and/or user defined types.
1.4.5.1
An interface and optional
Mathematical Foundation
Concept analysis is based on a relation R between a set of objects O and a set of attributes A as
given by:
R  O A
The triple C=(O,A,R) is called a formal context. The set of common attributes for a set of objects
OsO is defined as:
 (Os ) {a  A | (o  O)(o, a)  R}
The set of common objects AsA is defined as:
( As ) {o  O | (a  A)(o, a)  R}
The following example is used to illustrate the above terms. It is taken from [Lindig, Snelting,
1997]. Consider the object versus attributes table:
O1
O2
O3
O4
A1
X
A2
x
A3
A4
A5
X
X
X
x
x
x
x
x
A6
A7
A8
x
x
X
X
x
x
Table 1: Object Versus Attributes Tables
For the above table:
12
{ O1 ,O2} = { A1, A2, A3, A4, A5 }
{ O3} = {A3, A4, A6, A7, A8}
{ A3, A4} = { O2, O3, O4}
{ A1, A2} = { O1, O2}
A concept is a pair of sets consisting of a set of objects and a set of attributes (X,Y) such that the
following is satisfied:
Y   ( X ), X   (Y )
A concept can be defined as a maximal collection of objects sharing common attributes. For a
concept c = (O,A), O is the set of objects and called the extent of c i.e. extent(c). A is the set of
attributes and called the intent of c i.e. intent(c). Table 2 shows the concepts for the object
attribute table illustrated in Table 1.
C1
C2
C3
C4
C5
C6
C7
{ O1, O2, O3, O4},
{ O2, O3, O4},
{ O1}
{ O2, O4},
{ O3, O4},
{ O4},
,

{ A3, A4}
{ A1, A2}
{ A3, A4, A5}
{ A3, A4, A6, A7, A8}
{ A3, A4, A5, A6, A7, A8}
{ A1, A2, A3, A4, A5, A6, A7, A8}
Table 2 : Concepts for Table 1
The set of all concepts form a partial order governed by the following relationship:
(O1, A1)  (O2 , A2 )  A1  A2
(O1, A1)  (O2 , A2 )  O1  O2
The set of concepts taken together with partial ordering form a complete lattice known as a
concept lattice. The cocept lattice for the example given in this section is presented in Figure 6.
13
C1
C2
A1,A2
O1
C3
A5
O2
A3,A4
C5
C4
C6
A6,A7,A8,
O3
O4
C7
Figure 6 : Concept Lattice for the Attributes and Facts in Table 1
The nodes of the concept lattice are labeled with attributes AiA if it is the largest concept having
Ai in its intent. It is also labeled with objects OiO if it is the smallest concept having Oi in its
intent. The concept lattice gives an insight into the structure of the relationship of objects. The
above figure shows that there are two disjoint set of objects, the first one being O1 with attributes
A1 and A2. The second one being O2, O3, O4 sharing the other attribute.
1.4.5.2
Concept Analysis Applications
Lindig and Snelting use concept analysis to identify higher level modules in a program [Lindig,
Snelting, 1997]. They treat subprograms as objects and global variables as attributes to derive a
concept lattice and hence concept partitions. The concept lattice provides multiple possibilities
for modularization of programs with each partition representing a possible modularization of the
original program. It can be used to provide modularization at a coarse or a finer level of
granularity. Lindig and Snelting conducted a case study on a Fortran program with 100KLOC,
317 subroutines and 492 global variables. However, they failed to restructure the program using
concept analysis.
Siff and Reps used concept analysis to detect abstract data types to identify classes in a C
program to convert it to C++ [Siff, Reps, 1997]. Sub-programs were treated as objects and structs
or records in C were treated as attributes. They also introduced the idea of using negative
examples i.e. absent features as attributes (e.g. a subprogram does not have attribute X). They
successfully demonstrated their approach on small problems. However, larger programs are too
complex to handle and for that they suggest manual intervention.
14
Canfora et al. used concept analysis to identify sets of variables and to extract persistent objects
of data files and their accessor routines for COBOL programs [Canfora, Cimitile, Lucia, Lucca,
1999]. They treated COBOL programs as objects and the files they accessed as attributes. Their
approach is semi-automatic and relies on manual inspection of the concept lattice.
Concept analysis is an interesting new area of reverse engineering. However, it has an
exponential time complexity and has a space complexity of O(2K). Its advantages are that it is
based on a sound mathematical background and gives a powerful insight into the types of
relationships and their interactions.
1.4.6
Design Recovery System (DESIRE)
DESIRE is a tool that is a product of research at MCC by a team led by Biggerstaff [Biggerstaff,
Mitbander,Webster, 1994] and [Biggerstaff, 1989]. The basic philosophy behind this system is
that informal linguistic knowledge is a necessary component for program understanding. This
knowledge can only be input into a system through the interaction of an expert. The objective of
this technique is to find the mapping of human-oriented concepts to realizations in the form of
source code within a specific problem.
Biggerstaff proposes a design recovery process in which the various modules of the program, key
data items and software engineering artifacts are identified from source code [Biggerstaff, 1989].
Also, the informal design abstractions and their relationship to source code is pointed out. From
this a library of reusable modules is built by interacting with a software engineer who relates
these reusable modules to fragments of source code. The design of an unknown system could
then be recovered by finding similar mappings of source code to abstract concepts using the reuse
library.
The DESIRE tool is used to aid maintenance and software engineers for the better understanding
of a system and also for documenting and maintaining it. It consists of a knowledge based pattern
recognizer and prolog-based inference engine. It has a built in parser to take the source code as
input and generate a parse tree. A plane text browser is also a part of the system to show
relationships between data, items and files. Different views can be generated based upon
different input queries. Prolog is used for querying the source code.
In [Biggerstaff, Mitbander,Webster, 1994] it is proposed that abstract concepts be associated with
the source code. They propose finding the candidate concepts through the following:



Find the breakpoints set in the program and their associated comments and map them onto the
associated functions and global variables. Generate a slice based on the functions and global
variables found.
Identify a cluster of functions related together through their use of shared global variables or
through shared control paths.
Use knowledge base of domain model.
15
DESIRE is mainly being used for debugging or porting and documentation or program
understanding. The main problem encountered is the building of knowledge base and relating the
abstract concepts with pieces of source code.
1.4.7
Recognizing Program Plans
A programming plan is also termed as a cliché and describes design elements in terms of common
implementation patterns [Quilici, 1994]. Quilici describes an approach to recognizing program
plans to enhance program understanding and ultimately identifying modules and classes within a
program to port C code to C++. The main motivation behind this approach is the empirical study
of student programmers to see how they understand programs. The main elements of this
approach are:



Convert the source program to an AST augmented with data and control flow
Take a library of plan heirarchies. [Quilici, 1994] use a library of plans which was developed
to understand COBOL programs. A plan consists of two parts namely a recognition rule
having components of the plan and the contstraints on them and a plan definition which has
the attributes of the plan.
Map general program plans to particular program fragments of source code. The approach
followed uses a hybrid top down, bottom up approach for identifying these plans. A top
down approach determines what goals a program might achieve and identifies which plans
would achieve these goals. It then tries to match program plans to pieces of source code. A
bottom up approach starts at the level of program statements and tries to match them to
program plans.
This approach relies heavily on the plan library. Determining whether a plan is present in a
program or not is an NP-hard problem and computationally very demanding. [Quilici, Woods,
Zhang, 2000] describe constraint-based approaches to solving this problem in an efficient
manner. In [Deursen, Woods, Quilici, 2000] the plan recognition technique has been applied to
solve the Y2K problem by using correct and incorrect date manipulation plans and matching them
with fragments of source code.
1.4.8
Rigi
Rigi has been developed at the University of Victoria [Müller, Tilley, Orgun, Corrie, Madhavji,
1992], [Müller, Wong, Tilley, 1994]. It is a framework for program understanding designed to
aid software engineers gain an insight into the structure and design of a program. The Rigi tool
takes as input the RSF (described in section 1.3.3.2 on page 8) of a program generated by a parser
and displays the dependencies of various components present in the system and their related
attributes. Rigi is a visual tool that represents the source code as a directed weighted graph, where
vertices are components and edges are dependencies (Resource Dependency Graph). A directed
edge from a to b with a weight w indicates that a provides a set of syntactic modules to b. It is, a
semi-automatic reverse engineering tool where the software engineer is able to define subsystems and write scripts to cluster software entities together based on their criteria.
16
In Rigi, the reverse engineering process is illustrated in Figure 7. It comprises the following:





Generate the RSF for the source code using the Rigi parser or any other suitable tool.
Compose and display the resource dependency graph using the graph editor. Allow the user
to cluster or group together various components of the system to construct sub-systems.
Compute the interfaces between sub-systems by analyzing / propagating the dependencies
between them
Evaluate the sub-system structures with respect to various software engineering principles
like low coupling and high cohesion
Capture and display the relevant views of the system
Composition of subsystem hierarchies
using graph editor
Extract RFG
Capture relevant
views
Compute interfaces
among systems by
analyzing /
propagating
dependencies
Evaluate reconstructed subsystem structure
w.r.t SE principles
e.g. low coupling
and high cohesion
Figure 7 : The Reverse Engineering Process in Rigi
1.4.9
Grok
Grok is a tool to aid software maintenance and it was developed by SWAG Group, University of
Waterloo [Fahmy, Holt, Cordy, 2001]. It is based upon relational algebra and can be used to
specify various types of architectural transformations. The input to Grok is RSF (Rigi Standard
Format) which is a format for representing source code by means of resource flow graphs. The
transformations, therefore, specified are applied on this graph and are equivalent to graph
transformations. The transformation produced represents high level architectural information of
the system making it easier to understand. Fahmy et al. describe the application of Grok to two
real world systems and its effectiveness in specifying various transformations such as lifting, hide
interior, hide exterior, diagnosis etc. The main advantage of using Grok is that it efficiently
processes large graphs. However, there are limitations of using relational algebra that it cannot
be used for generalized pattern matching. It cannot represent transformations where nodes and
edges are along a route, have to be stored and they represent a pattern.
17
1.4.10 Dali
Work on Dali has been carried out by Carnegie Mellon Software Engineering Institute [Kazman,
Carrière, 1998]. They have developed a workbench to generate various architectural views of the
system to aid in software understanding, maintenance and re-use. Various tools like profiling
tools, parsers, lexical analyzers are used to extract different views of the system. The view
information is then stored in an SQL database, one table per relation. The advantage of using
different tools is that both static and dynamic information can be captured. Static information can
be acquired directly from source code. However, the entire information of a system cannot be
determined at compile time because of late bindings due to polymorphism or function pointers
etc. Hence, tools like gprof can be used to capture runtime information and generate runtime
specific views of the system.
After generating different views of the system and storing them in an SQL database, SQL queries
are used to specify clustering patterns to reduce the complexity of a system. Each relation is
defined as a table and union and join queries can be specified to ‘fuse’ different views together.
The views can be refined by using information from other views. These views are then imported
to Rigi. Rigi is used as a visual tool to model, analyze and manipulate the different views of the
system and requires the input of an expert already familiar with the system. Different views are
combined together and further refined to give a better view of the system. For the sake of
analysis Dali applies an export/import model where the Rigi model is exported to another tool.
After being analyzed the resulting model is imported back to Rigi, hence, generating an overall
picture of the system.
The main advantages of Dali are that it uses SQL queries for defining clustering patterns. The
Dali workbench integrates various tools together for architectural understanding. However,
patterns have to be written by an expert familiar with the system and there are no analytic
capabilities in Dali to generate them automatically. Also, the clustering patterns are specific to a
certain system and cannot be reused in most of the cases.
1.5
Case Studies on Reverse Engineering
Various reverse engineering tools and techniques were applied to real life projects. This section
provides a description of three such case studies:



XFig
SORTIE
SQL/DS
1.5.1
Xfig
A workshop for the demonstration of various reverse engineering tools was organized by IBM’s
Centre for Advanced Studies CAS in CASCON’99 in Toronto. The details and results of this
demonstration can be found in [Sim, Storey,2000] and the text in this section has been taken from
this source. Five tools participated in this workshop namely Rigi, Lemma, PBS, TkSee and
18
IBM’s Visual Age C++. A sixth team used standard UNIX tools. They were all assigned the
problem of analyzing the source code for Xfig 3.2.1 which is a drawing package available on
Unix platforms. It comprises 75,000 lines of ANSI C code. The purpose of the workshop was to
have a common platform for evaluating different tools and to establish benchmarks for comparing
their performance and functionality. The teams were assigned two reverse engineering tasks
namely documentation and evaluating the structure of the program. They were also given the
option of performing one maintenance task out of three.
The documentation produced by the teams was rather short and brief. The teams had varying
opinions on the structure of the program and the quality of source code. In terms of quality PBS
said the subsystems exhibited low cohesion and high coupling, while Lemma said they exhibited
low coupling and high cohesion. The PBS team observed that the original code had eroded over
time and Rigi’s interpretation was that the design had improved over the past releases. All the
teams had varying results on the number of GOTO statements contained in the source code.
However, despite the varying results on the reverse engineering tasks the teams had the same
approach to maintenance tasks.
The main problem encountered by all the tools was in the parsing stage. All of the teams had
difficulty performing the initial phase of reverse engineering task i.e. parsing.
In structured demonstration the tools were placed into three categories namely visualization (PBS
and Rigi), advanced search (TkSee and Lemma)and code creation (Visual Age and UNIX tools).
It was interesting to note that tools from the same category produced similar results. It was
concluded that visualization tools do better on reverse engineering tasks and the search tools are
better suited for maintenance tasks. Visualization and searching tools complement each other.
The workshop also concluded with a few lessons for tool users. When selecting a tool for
program comprehension it is necessary to know in advance the purpose of using the tool. A
visualization tool is better suited for large-scale maintenance reverse engineering problems where
an understanding of the architectural design is required. A search tool is more suitable for day to
day maintenance tasks. Also, a new tool has a greater chance of being accepted if it complements
existing tools and works in collaboration with them.
1.5.2
SORTIE
Sortie is a legacy code being used by forestry researchers for the research and exploration in
understanding forest dynamics [Website SORTIE] and also provides a test environment for forest
management decisions. The software is less than 28K lines of code including comments and
source listing. Originally this program was written in C but was later ported to C++ using the
Borland IDE. Various teams working on reverse engineering tools were invited to analyze the
existing structure of this program and suggest a new architecture based upon the user
requirements and the extracted structure. Also, one of the goals of this project was to exchange
data at various stages from the results of other participating tools and use the output as an input to
the next stage. GXL was defined as the standard for exchanging data between different tools.
19
Participating tools for the SORTIE project were Rigi, cppx, TKSee, Bauhaus, SGG P.U.R.E,
Columbus, KLOC Worksuite, VIBRO and PBS. Almost all of these teams could not go on to the
re-architecture phase because of the difficulty in parsing the source code of SORTIE. Nearly all
of them concluded that re-engineering the existing source code was unrealistic because of the
poor design of the system. All the teams suggested re-writing the entire software from scratch
keeping in perspective the user requirements.
1.5.3
SQL/DS
SQL/DS is a multi-million line relational database management system that has evolved since
1976. It was originally written in PL/I to run on VM and it is over 3,000,000 lines of code. The
analysis of this system formed the CAS program understanding project [Buss et al., 1994] with
participation from six research groups. The groups are from IBM Software Solutions Toronto
Laboratory Centre for Advanced Studies, the National Research Council of Canada, McGill
University, the University of Michigan, the University of Toronto and the University of Victoria.
The various goals of this project were :










Code correctness
Performance enhancement
Detecting unitialized data
Pointer errors and memory leaks
Detecting data type mismatches
Finding incomplete uses of record fields
Finding similar code fragments
Localizing algorithm plans
Recognizing high complexity or inefficient code
Predicting the impact of change
The IBM team used REFINE to parse the SQL/DS source code into a suitable form for analysis
and applied defect filtering technique. A toolkit was built on top of REFINE for defect filtering.
They also performed a design quality metrics analysis to predict the product’s quality and found
that defects caused by design errors accounted for 43% of the total product defects.
The University of Victoria used Rigi as a tool for performing Structural Re-documentation. Rigi
was used to parse the source code and produce a resource-flow graph of the software. Later
subsystem composition techniques were combined with human pattern matching skills to manage
the complexity of the system. Rigi has a graph editor from which diagrams of software structures
such as call graphs, module interconnection graphs and inclusion dependencies were produced.
The University of Toronto performed textual analysis of code using pattern matching techniques.
Their method is based on fingerprinting an appropriate subset of sub-strings in the source code.
Using this technique a number of redundancies in the code were detected and instances of ‘cut
and copy’ of code were found.
20
University of Michigan has developed the SCRUPLE software for searching source code.
SCRUPLE itself is based upon a pattern-based query language.
McGill used REFINE for converting the program to an object-oriented annotated abstract syntax
tree. The tools GRASP and PROUST were used for this purpose. Several problems were
encountered in recognizing programming plans. One was that pattern matching techniques
required precise recognition whereas the source code may have several redundant or irrelevant
code. Pattern matching schemes based on graph transformations are expensive and can have high
complexity. This team also detected ‘clones’ in the source code using the 5 similarity metrics.
1.6
Tools for Reverse Engineering
The list of various existing tools for reverse engineering is : (This is not an exhaustive list)
 Rigi : University of Victoria, Canada
 Cppx : SWAG Group University of Waterloo, Canada
 TKSee : KBRE Univesity of Ottawa
 Bauhaus : Univesity of Stuttgart, Germany
 SGG P.U.R.E : Univesity of Berne, Switzerland
 Columbus/CAN : Research group on Artificial Intelligence, Hungarian Academy of Sciences,
University of Szeged
 KLOC Work Suite : KLOCwork Solutions Corporation
 VIBRO : Visualisation Research Group, University of Durham UK
 PBS : SWAG Group University of Waterloo, Canada
 REFINE : Reasoning Inc
 SCRUPLE : University of Michigan
 Lemma : IBM
 Visual Age C++ : IBM Toronto Lab
21
PART II : SOFTWARE CLUSTERING
2
Software Clustering
In this section of the report a detailed overview of clustering and its applications to software
reverse engineering is presented. We’ll start by giving a general overview of clustering
techniques, algorithms and methods. The application of clustering to grouping together similar
software entities is then described. A theoretical framework for clustering software artifacts is
built up describing different hypothetical situations where software clustering can be applied. To
evaluate the suitability of software clustering methods, an open source project was selected by us
and clustering was applied to this system from the point of view of re-modularization. We
describe the results of these experiments in detail and present our results.
2.1
Clustering : An Overview
To ease software maintenance efforts, it is essential to understand the existing software system.
Realizing this has lead to research on techniques for analyzing code and for program
comprehension. Although analyzing the code has a number of benefits, this analysis leads to
understanding the system "in-the-small". To gain understanding of the system "in-the-large", we
require techniques for uncovering the over-all software architecture i.e. we need to understand the
various components within the system and their interactions with one another. This architectural/
structural view is of immense value when changes are made in the existing components or when
new components are added.
The activity of grouping together similar entities is not a new one, nor is it tied to the computer
science field alone. The grouping of entities is described by the term "clustering" and is relevant
in varied disciplines such as biology, archaeology, economics, psychology, geography etc.
Clustering methods aim at discovering/extracting some sort of a structure based on the
relationships between entities. Given the large amount of data associated with every entity, and
the complexity of relationships between them, it is natural that different methods come up with
different clusterings.
Irrespective of the discipline in which the clustering is being applied, the following generic steps
can be identified in any clustering activity:




Entity identification and feature selection
Definition of a similarity measure
Clustering
Assessment
2.1.1
Entity identification and feature selection
The individual characteristics used to identify an entity are referred to as features and may be
qualitative or quantitative [Gowda, Diday, 1992]. These features are grouped together into a
22
feature vector, sometimes called a pattern. A large number of features may be used to describe an
entity, but it is useful to utilize only those features that are descriptive and discriminatory [Murty,
Jain, Flynn, 1999].
2.1.2
Definition of a similarity measure
Clusters are defined on the basis of proximity of two patterns, therefore it is necessary to
devise/select a measure for similarity. [Wiggerts, 97] describes the following categories of
similarity measures:




Association coefficients
Distance measures
Correlation measure
Probabilistic measure
2.1.2.1
Association coefficients
Association coefficients take the presence or absence of a feature into account. The features are
thus assumed to be binary, with 1 denoting presence and 0 denoting absence. The following table
is constructed to find association coefficients between entity i and entity j.
Entity i
1
0
Entity j
1
0
a
b
c
d
Table 3: Table of Similarities
In the table a represents the number of features for which both entities have the value 1, b
represents the number of features present in entity i but absent in entity j, c represents the number
of features present in entity j but absent in entity i, and d represents the number of features absent
in both entities. Some frequently used similarity measures Sim(i,j) between entity i and entity j
are:
Simple Coefficient
Sim (i, j ) 
ad
abcd
Jaccard Coefficient
Sim (i, j ) 
a
abc
23
Srensen-Dice coefficient
Sim (i, j ) 
2a
2a  b  c
As can be seen, the coefficients differ in two ways [Wiggerts, 97]:


The contribution of the 0-0 matches. The Jaccard and Sorenson-Dice coefficients do not take
into account the 0-0 matches.
The weight of the matches and mismatches. The Sorenson-Dice coefficient assigns double
weight to the 1-1 matches. In the simple coefficient, matches and mismatches are given equal
weight
Hence, the Jaccard and the Sorensen-Dice coefficients take into account only the presence of
features. On the other hand, the simple coefficient also takes into account the absence of features.
For this metric similarity would be high if common features are absent in both entities.
2.1.2.2
Distance measures
The most popular distance measure is the Euclidean distance, which evaluates the proximity of
entities in space. The larger the distance, the greater is the dissimilarity of the entities. Hence the
measure is zero if and only if the entities have the same score on all the features.
Two of the more popular distance measures are the Euclidean distance and the Camberra metrics
as given by:
Euclidean Distance
D( X , Y ) 
i 1( xi  yi ) 2
n
For binary features where the feature vector comprises only zeros and ones, this metric can be
reduced to a term consisting of a,b,c,d, defined in Table 3. This form can be computed using the
table below
xi
0
1
0
1
yi
0
0
1
1
(xi-yi)2
0
b
c
0
Table 4 : Computation of Euclidean Distance for Binary Features
From the above table it is clear that the distance measure reduces to :
D( X , Y )  b  c
24
Camberra Distance Measure
xi  y i
n
C( X , Y ) 
x
i 1
i
 yi
To calculate the Camberra distance measure in terms of a,b,c,d we construct the following table :
xi
0
1
0
1
xi  y i
yi
x
0
0
1
1
i
 yi
0
b
c
0
Table 5 : Computation of Camberra Distance Metric for Binary Features
In the above table it is assumed that the term is zero when both xi are yi are zero. The assumption
is made because the distance between two NULL vectors would be taken as zero. Hence, we can
conclude that the Camberra distance metric is given by:
C( X , Y )  b  c
where b and c are defined in Table 3.
Clearly we can see that the Camberra metric and the Euclidean metric would give similar results
for binary features. Both these metrics take into account the total number of mismatches of zeros
and ones in two entities when calculating the distance between them. The higher the mismatch
the more the distance between the two entities.
2.1.2.3
Correlation Coefficients
Correlation coefficients may be used to correlate features. The most popular coefficient is the
Pearson product moment correlation co-efficient whose value lies between -1 and 1 with 0
denoting no correlation. Negative correlation denotes high dissimilarity and positive correlation
is taken as high similarity. Correlation measure is given by:
i 1( xi  xi )( yi  yi )
n

i 1( xi  xi ) 2 i 1( yi  yi ) 2
n
n
25

( x  y )
 xy  n
 x) )  ( y  ( y)
( x  (

n
n
2
2
2
2
)
where the summation is taken from 1 to n where n = a+b+c+d. When using only binary
features the correlation measure reduces to terms of a, b, c and d, defined in Table 3, using the
following table:
xi
yi
 xy
x
y
x
0
1
0
1
0
0
1
1
0
0
0
a
0
b
0
a
0
0
c
a
0
b
0
a
2
y
2
0
0
c
a
Table 6 : Computation of Correlation for Binary Features
Hence correlation metric reduces to the following :
a

((a  b) 

(a  b)(a  c)
(a  b  c  d )
( a  b) 2
(a  c) 2
)((a  c)(
))
(a  b  c  d )
(a  b  c  d )
ad  bc
(1)
(a  b)(c  d )( a  c)(b  d )
1
Comparison with Association Coefficient
From 1 it can be seen that the both the correlation coefficient and the Jaccard coefficient are one
indicating maximum similarity if b and c are zero and a is non-zero. Also, if d is very large as
compared to a, b and c then the correlation coefficient reduces to the following expression:

a
(a  b)( a  c)
for d>>a, d>>b and d>>c
The above form involves terms only with a, b and c and we can see that this measure is now very
similar to the Jaccard coefficient. Hence, if two entities have very sparse feature vectors then
their correlation coefficient behaves in a very similar way to the Jaccard coefficient. This result
26
is important when considering software artifacts for clustering purposes as, generally, the feature
vector obtained is very sparse.
2.1.2.4
Probabilistic Measures
These measures emphasize the idea that agreement on rare features is more important than
agreement on frequently encountered features. Such measures take into account the distribution
of frequencies of features present over the set of entities. According to Sneath these measures
are close to the correlation coefficient [Sneath, Sokal, 1973].
2.1.3
Clustering
The clustering step performs the actual grouping based on similarity measures that were
determined. A number of clustering algorithms are available and may be chosen based on their
suitability. It may be worthwhile to keep in mind that:



One clustering technique cannot be expected to uncover/extract all clusters, owing to the
variety of domains and data sets and the complexity of relationships between them.
It is unreasonable to expect that there will be one ideal structure. It is possible, indeed
probable that more than one structures exist that are equally valid.
Algorithms may impose their own structure on the set of data.
Algorithms to perform clustering are broadly divided into two categories namely:
1. Hierarchical
2. Partitional
2.1.3.1
Hierarchical
These algorithms follow a top-down or bottom-up approach, more formally known as the divisive
and agglomerative approaches respectively
Divisive
In this approach, we start with one cluster containing all entities and divide a cluster into two at
each successive step. These algorithms are not widely used, due to the complexity associated with
computing the number of possible divisions at every step [Tzerpos, Holt, 1998]. Figure 8 (b)
illustrates the first step in a divisive algorithm; initially all entities are in one cluster.
(a) One Cluster
(b) After First Iteration
Figure 8 : Divisive Clustering Algorithm
27
Agglomerative
In this approach we start with the entities and group them together to finally obtain one structure.
Each entity is regarded initially as a singleton cluster. In the first step, two most similar singleton
clusters are joined together. The situation is as shown in Figure 9:
(a) Singleton Clusters
(b) After First Iteration
Figure 9 : Agglomerative Clustering
At step two, we must determine whether two singleton clusters are most similar or whether the
newly formed cluster and some singleton clusters are most similar. In order to do this, the
similarity (or the distance) between the newly formed cluster and each of the existing clusters
must be updated. There are different algorithms to deal with this issue [Davey, Burd, 2000]:
Single Linkage (Nearest Neighbor Approach):
Single Link (A, B U C) = Max (similarity (A, B), similarity (A, C))
Complete Linkage (Furthest Neighbor Approach):
Complete Link (A, B U C) = Min (similarity (A, B), similarity (A, C))
Weighted Average Linkage:
Weighted Link (A, B U C) = 1/2 (similarity (A, B))+1/2(similarity (A, C))
Unweighted Average linkage
Unweighted Link (A, B U C) = (similarity (A, B)*size (B) + (similarity (A, C)* size(C))/(size(B)
+ size(C))
The single linkage rule produces clusters that are isolated but may be non-compact. To illustrate,
consider the following figure:
A
C
D
B
E
Figure 10 : Initial Clusters
A
A
B
B
2
C
8
5
D
13
11
E
14
12
28
6
C
D
E
7
3
The table above lists the entities along with some distances of interest, the greater the distance,
the lesser is the similarity. According to the single linkage rule, the first two entities to be
clustered together will be A and B, and the new distance between AUB and C will be given by 5
(Minimum of 8 and 5). This is shown in Figure 11.
A
C
D
B
E
Figure 11 : After Step 1
At step 2, D and E will be grouped together, with the new distance between DUE and C given as
6 (Minimum of 6 and 7). Since the distance between C and AUB is less than the distance between
C and DUE, at step 3 we’ll get the clusters illustrated in Figure 12:
A
C
D
B
E
Figure 12 : Clusters Formed at Step 3 Using Single Linkage Algorithm
On the other hand, if complete linkage was used, the situation at step 3 would be as depicted in
Figure 13:
A
C
D
B
E
Figure 13 : Clusters Formed Using Complete Linkage Algorithm
The reason for the difference is that when new distances are computed at step 1, the distance
between AUB and C is given as 8 (Maximum of 8 and 5). At step 2, the distance between DUE
and C is given by 7 (Maximum of 6 and 7).
As can be seen from the figure, complete linkage produces compact clusters i.e. in order for an
entity to join a cluster, it must be similar to every existing entity within the cluster, as opposed to
the single linkage approach, where the entity may be similar to only one entity within the cluster
[Wiggerts, 97]
29
2.1.3.2
Partitional Algorithms
These algorithms start with some initial partition and modify the partition at every step in such a
way that some criterion is optimized. [Tzerpos, Holt, 1998]. Two popular approaches are:

Squared error algorithms : The first step is to choose some initial partition with clusters. At
every step, each entity is assigned to the cluster whose center is closest to it. The new cluster
centers are re-computed and the assignment of entities is repeated till the clusters become
stable.

Graph Theoretic algorithms : The problem is represented as a graph and clusters are
depicted via subgraphs with certain properties e.g. we may construct a minimal spanning tree
and then delete the edges with maximum weights to obtain clusters.
2.1.4
Assessment
It is important to assess and validate the structure of the system obtained after clustering. There
are typically three kinds of validation studies [Murty, Jain, Flynn, 1999]:
External Assessment
This assessment compares the structure obtained with some already available (a priori) structure
Internal Assessment
This assessment is intrinsic, i.e. it tries to determine whether the structure is intrinsically
appropriate for the data.
Relative Assessment
This assessment compares two structures relative to one another.
2.2
Clustering Software
The application of cluster analysis to software artifacts is comparatively a newer area of research
where lots of issues have not been addressed. The major goal of software clustering is to
automate the process of discovering high level abstract subsystems within the source code. Such
subsystems can be used for software visualization, re-modularization, architecture discovery etc.
The higher level subsystems also aid the maintenance engineers in gaining a better understanding
of the source code and making the necessary changes. Lung points out that software clustering
can be applied to software during various life cycles [Lung, 1998]. It can support software
architecture partitioning during the forward engineering process and aid in recovering software
architecture during the reverse engineering processes.
According to Tezerpos and Holt, [Tzerpos, Holt, 1998] it would benefit the software community
to use the wide variety of clustering techniques available rather than re-inventing them to derive a
software’s architecture. They find this a good idea because:
30



With software, we have a fairly good idea of what a cluster should look like. Software
Engineering principles of cohesion, coupling and information can guide the clustering
process.
A software can have more than one valid view, which is what different clustering algorithms
can give us.
Even if a clustering algorithm “imposes” a structure, with legacy software this is not a
problem. The point of view is that any structure is better than no structure.
The process of clustering together similar objects depends upon the nature and size of data. It
also depends upon the similarity measures and clustering algorithms being used. Clustering itself
can impose a structure on the data it is being applied to. When applying a clustering technique
one has to keep in mind the nature of the data and use the appropriate parameters suited to that
type of data.
When clustering together software entities with a point of view of obtaining better remodularization we would aim for high cohesion and low coupling [Davey, Burd, 2000]. For
example if functions that access the same data type or variables are placed into the same module,
the module would tend to have communicational cohesion.
The above argument brings out the need to tailor our clustering algorithm to the type of software
being clustered. Researchers have pointed out that different clustering techniques behave
differently when applied to different types of software systems [Davey, Burd, 2000],[ Anquetil,
Lethbridge, 1999]. A technique might impose an artificial structure on the existing system
instead of bringing out the natural one. In this report we would like to introduce a new clustering
algorithm called the ‘combined’ algorithm which is more intuitive for grouping together software
entities. It is appropriate to situations where there are binary features vectors and the presence
and the absence of a feature is taken into account for grouping objects or entities together.
When clustering the steps that are carried out are the same as for any other artifact. The first step
is to identify the entities in a software system. Files and functions are the most popular candidates
and researchers have used both as entities. The reasons for using functions rather than files (used
in large systems) have been listed by [Davey, Burd, 2000]


Clustering of functions is intuitive
Functions reflect the functionality of the system more clearly
The next stage is to select features based on which similarity measures will be derived. [Anquetil,
Lethbridge, 1999] divide features into formal and non-formal descriptive features. They call a
feature formal if it consists of information that has a direct impact on, or is a direct consequence
of, the software system’s behavior. They have identified some formal features for software:


Rout :
Var:
Routines called by the entity
Global variables referred to by the entity
31



Type :
Macros:
File:
User defined types referred to by the entity
Macros used by the entity
Files included by the entity
The source for these formal features is the code for the software.
Non-formal features use information that has no direct influence on the system’s behavior. Two
non-formal features used by Anquetil and Lethbridge are:


Ident:
Cmt:
References to words in identifiers declared or in the entity
References to words in comments describing the entity
Any of the similarity measures identified in section 1.1.2 may be used for software. In case of
binary features, the formulas may be simplified considerably. For clustering, experiments have
been carried out using both hierarchical and partitional algorithms.
2.2.1.1
Observations Regarding Association, Correlation and Distance Metrics
From the discussion of the association, correlation and distance metrics we can make the
following observations :





The Jaccard and Sorensen-Dice metrics perform almost identically. The result has also been
pointed out by Davey and Burd [Davey, Burd, 2000].
The simple metric does not seem very intuitive for dealing with software clustering as it takes
into account the absence of features.
The correlation metric is equivalent to Jaccard metric when the total absent features are very
large as compared to the count of present features or mismatched features. This is usually the
case.
For clustering, the Camberra distance metric and Euclidean distance will perform
equivalently
Similarity will be low if b and c are high, Jaccard will be low and both Camberra and
Euclidean distances (inverse of similarity) would be high. However, when dealing with
software the Camberra metric and Euclidean distance are not appropriate for a case when all
of a, b and c are zero. In this case the distance is zero indicating high similarity which is
misleading. For example two functions like sort and swap may not be accessing any common
features so that the distance between them would be zero. The distance metrics, in this case,
indicate high similarity which does not seem intuitively correct.
2.2.2
Assessment of Results
For assessment of results, external assessment is often used. The groups of objects made during
the clustering process is known as a partition. When agglomerative clustering is used a partition
is made during each iteration of the algorithm. To evaluate the quality of these partitions, the
partitions made by the clustering algorithm are compared with a reference system also called the
32
expert decomposition. Generally the expert decomposition is obtained with directories being
used as clusters when files are used as features, or the files themselves being considered as
clusters when functions are used as features. However, the designer or the developer of the
system best makes the expert decomposition as they have the best knowledge of the system under
consideration. The clustering made by the clustering algorithm are also termed as test clustering.
The following external measures have been defined:
2.2.3
Use of Precision and Recall
[Anquetil, Lethbridge, 1999] and [Davey, Burd, 2000] use precision and recall as a measure for
assessing the quality of partitions. Precision and recall are defined using the concept of intra
pairs and inter pairs. Intra pairs are the pairs of entities in the same cluster whereas the inter
pairs are pairs of entities in two different clusters. Precision and recall are defined by taking all
the intra pairs in the expert decomposition and the test clustering and are given by:

Precision : Percentage of intra pairs in the test clustering that are also in the expert
decomposition.

Recall : Percentage of intra pairs in the expert decomposition that are also in the test
clustering
Mathematically, if A denotes the intra pairs of the test clustering and B represents the intra pairs
of the expert partition then the precision p and recall r are given by:
p
| A B |
| A|
r
| A B |
|B|
Precision and recall are measures that are often used for evaluating the quality of results in
information retrieval systems [Kontogiannis, 97]. Recall measures the percentage of intra pairs
that are relevant to the expert clustering. Ideally a recall of 100% would be required. Precision,
on the other hand measures the noise in the test clustering.
From the above definitions it is clear that there is a tradeoff between precision and recall. It is
desirable that both precision and recall are high for a test clustering. However, this is generally
not the case. When using agglomerative clustering, at the start of the clustering process there are
all singleton clusters and the system has zero recall and 100% precision. As the clustering
proceeds the precision decreases and the recall increases. When the entire system is one big
cluster the recall is 100% but the precision is very low.
33
For an ideal case, where the algorithm is making cluster partitions that agree exactly with the
expert decomposition, the precision would stay constantly at 100% and recall would rise to 100%
till the total number of intra pairs in the expert partition is equal to the total number of intra pairs
in the test clustering. At this point precision and recall graphs would crossover. After that
precision would start decreasing but recall would remain constantly at 100%. An example of this
process is shown below in Figure 14.
Precision
Recall
120
Percentage
100
80
60
40
20
0
Iteration
Figure 14 : Precision and Recall Graph for an Ideal Clustering Algorithm
2.3
Our Approach
Having established clustering to be a viable approach for finding structure within a software
system, we chose the Xfig utility source code for applying clustering techniques. The experiments
are detailed in the later part of this report.
2.3.1
Entity identification and feature selection
The first step was to identify the entities in a software system. In this paper, we use functions in
the code as entities. The next stage is to select features based on which similarity measures will
be derived. We used the Xfig files parsed into the Rigi standard format [Müller, Tilley, Orgun,
Corrie, Madhavji, 1992]. Based on the available data, we used the following features:



Call :
Global :
Type :
Functions called by the entity
Global variables referred to by the entity
User defined types referred to by the entity
To compute similarity measures, we used the method of association coefficients, which takes the
presence or absence of a feature into account. The data about calls, globals and types is placed
into a matrix in the following manner:
Function1
Calls
C1
1
C2
0
Globals
G1 G2
1
1
Type
T1
0
34
Function2
1
1
0
1
0
For functions 1 and 2, the values for a, b, c and d will be:
a = 2 (number of features present in 1 and 2)
b = 1 (number of features present in 1 but absent in 2)
c = 1 (number of features present in 2 but absent in 1)
d = 1 (number of features absent in 1 and 2)
2.3.2
Definition of a similarity measure
We have used the following similarity measures in our experiments:
Simple (i,j) = (a+d)/(a+b+c)
For the functions 1 and 2, the Simple measure gives the result ¾ = .75
Jaccard (i,j) = a/(a+b+c)
For the functions 1 and 2, the Jaccard measure gives the result 2/4 = .5
Sorenson-Dice (i,j) = 2a/(2a+b+c)
For the functions 1 and 2, the Sorenson measure gives the result 4/6 = .67
2.3.3
Clustering and the Combined Algorithm
We used the agglomerative hierarchical clustering algorithm to arrive at clusters. The techniques
used by our experiments included single linkage, complete linkage, weighted linkage and
unweighted linkage. In addition to this we also define a new algorithm called the combined
algorithm for clustering. When two clusters are merged the similarity or the distance between the
newly formed cluster and the rest of the entities in the system is re-calculated. This new distance
is calculated based on the similarity values of the two entities that are grouped together with the
rest of the entities. We define this new distance using the ‘combined’ algorithm for binary
features as follows:
Suppose the two entities i, j that are merged together have their corresponding binary feature
vectors given by vi and vj. The new feature vector vk then associated with the merged cluster
would be given by taking the logic OR between the two feature vectors:
vk = vi OR vj
Now the new distance between the newly formed cluster and the rest of the entities or clusters
within the system would be calculated based on this new feature vector vk and the feature vector
of the rest of the entities and clusters. An example would clarify the combined algorithm.
Suppose there are three entities in the system given by A,B,C whose feature vectors are given by:
35
Entity
A
B
C
Feature Vector
{1 1 0 0 0 1}
{1 0 1 0 0 1}
{1 0 0 0 0 0}
The corresponding similarity matrix using the Jaccard coefficient is given by:
A
1/2
1/3
A
B
C
B
1/2
1/3
C
1/3
1/3
-
If the Jaccard coefficient is used then A, B are more similar and hence would be merged together
into AB and the new raw data matrix would be given by:
Entity
AB
C
Feature Vector
{1 1 1 0 0 1}
{1 0 0 0 0 0}
Now the new similarity measure is re-calculated using the above raw data matrix and between
AB and C and it is equal to 1/4.
We argue that the combined clustering algorithm is very suitable for clustering together software
artifacts as the feature vector usually denotes references to functions, global variables or types.
When two entities are merged together the new entity would access both the features of the two
entities that are merged together and hence a new feature vector would be associated with it. This
new feature vector can be obtained by taking the OR of the older feature vectors.
In short, the following steps are involved in a clustering process:

Step 1 : Compute Raw Data Matrix : Get the feature vector for each module or entity. The
matrix of feature vectors forms the raw data matrix. The row entries denote an entity and the
columns represent the features

Step 2 : Compute the Similarity Matrix : The similarity matrix represents the similarity
between each module or component. It is a square, symmetric matrix. Only the lower
triangular matrix is required for computations. The values in the similarity matrix are
determined based on the association co-efficient, distance co-efficient or correlation coefficient being used.
36

2.4
Step 3 : Apply the Clustering Algorithm : Find the two entities with the maximum
similarity. Merge the two entities together into a new cluster and calculate the new cluster
distance from other entities using any one of single linkage, complete linkage, weighted
average, unweighted average clustering algorithm. Repeat merging the entities till only one
cluster remains.
Methods Used in the Past
One of the earliest work done on software clustering is by Schwanke and Platoff [Schwanke,
Platoff, 1989]. They introduced the idea of “shared neighbors” technique for grouping together
software entities to achieve low coupling and high cohesion. They argue that when grouping
together similar entities more importance should be given to shared neighbors rather than
connection strengths in a resource flow graph. They used hierarchical ascending classification
method similar to agglomerative clustering to group together entities based on various features
like variables, datatypes, macros and procedures. Also, they introduced the idea of “Maverick
analysis” used to refine a partition by identifying entities (also called ‘Mavericks’) that were
placed in the wrong partition [Schwanke, 91]. They developed a tool by the name of ARCH for
automatic clustering.
Anguetil and Lethbridge conducted detailed clustering experiments using different open source
projects and also a legacy system [Anquetil, Lethbridge, 1999]. They compared various
hierarchical and non-hierarchical clustering algorithms. They treated an entire source code file as
an entity and various formal and non-formal features associated with them were used for
clustering. They have introduced two methods for the evaluation of clustering results namely
expert criteria and design criteria. The expert criterion is based upon precision and recall. The
design criterion, on the other hand, is based upon the notion of metrics used to measure coupling
and cohesion.
Kontgiannis conducted interesting work on finding similar software entities [Kontgiannis, 97].
Their goal was not to re-modularize software entities but to detect programming patterns within
source code files. Their work is relevant to software clustering as it shows that various software
design metrics like structural complexity, data complexity, McCabe complexity, Albrecht metric
and Kafura Metric pertaining to a certain part of source code can be used as its fingerprint. Such
a feature vector can be associated with various parts of source code and used to detect patterns
within source code.
Davey and Burd also conducted several detailed experiments to compare various features,
similarity metrics and algorithms to be used for clustering [Davey, Burd, 2000]. They used three
different parts of legacy C source code to conduct their experiments. The experiments conducted
by them were only on small systems (maximum 140 functions). However, their work
demonstrates the suitability of using software clustering methods to group together software
artifacts.
37
Lung uses clustering for recovering software architecture and re-modularization [Lung, 1998].
They use the design metric namely data complexity to compare the performance of their
clustering algorithm with the old system and the newly modularized system. Their work shows
that clustering software based on common features reduces the data complexity of the system and
leads to a better architecture.
Tzerpos and Holt use ACDC i.e. Algorithm for Comprehension Driven Clustering algorithm to
group together software artifacts [Tzerpos, Holt, 2000]. Their main aim was to discover
subsystems to ease the understanding of the software system. Their algorithm proceeds in two
steps. In the first step they discover a possible skeleton decomposition of the system using a
pattern driven approach. They define various possible subsystem patterns namely source file
patter, directory structure pattern, body header pattern, leaf collection pattern, support library
pattern, central dispatcher pattern and subgraph dominator pattern. Using these patterns a
skeleton decomposition is formed. In the second stage of the algorithm they use a technique
known as “Orphan Adoption” that tends to place isolated entities in one of the possible
subsystems identified during the first phase of the algorithm.
The table below summarizes the work related to clustering conducted in the past
Refere
nce
[Kontgi
annis,
97]
[Anque
til,
Lethbri
dge,
1999]
Objective
Metric
Detect
programming
patterns
Detection of
clones
Remodularizat
ion
Structural Complexity
Data Complexity
McCabe Complexity
Albrecht Metric
Kafura Metric
Formal features(
Type
Variable
Routine
File
Macro
All(Union
of
the
above))
Non-formal(
References to words in
an identifier
References to words in
comments)
Function
Type
Global variabl
[Sartipi,
Kontog
iannis,
2001]
[Davey,
Burd,
2000]
Recover
architecture of
a system
Remodularizat
ion
Global
Type
Call
[Schwa
nke,
Platoff,
1989]
Extract
conceptual
architecture of
the system
Feature count based on
similar neighbors.
Names being used in a
module
(variables,
data types, macros,
procedures)
Distance
Measure
Euclidean
distance
Correlation
Taxonomic(dist
ance)
Camberra
Jaccard
Simple
matching
Srensen-Dice
Component
association
Mutual
association
Jaccard
Srensen-Dice
Pearson
correlation coefficient
Camberra
Category utility
= size*purity
38
Clustering Algo
Use
distance
as
threshold and group
together
components
whose distance falls
within this threshold
Agglomerative
hierarchical
(single
linkage,
complete
linkage,
weighted
average
linkage,
unweighted
average linkage)
Bunch (hill climbing)
non-hierarchical
algorithm
Supervised
algorithm
clustering
Agglomerative
hierarchical
(single
linkage,
complete
linkage,
weighted
average
linkage,
unweighted
average
linkage)
Hierarchical ascending
classification
method
(similar
to
agglomerative
clustering). It also splits
clusters
during
clustering
(Maverick
Analysis)
System
Used
TCSH
CLIPS
BASH
ROGER
Evaluation
Gcc
Mosaic
Linux
Telecom
Expert
Criteria
(Precision
and Recall)
Design
Criteria
(Coupling
and
cohesion)
CLIPS
Xfig
Precision
Recall
3 samples
of C source
code taken
from legacy
systems
Precision
Recall
DOSE
Tiled
Window
Manager
(TWM)
Compare
clusters
formed
by
the
modules/pro
cedures
defined
in
common
Precision
and Recall
[Schwa
nke,
91]
Develop a tool
to
provide
modularizatio
n advice
[Tzerpo
s, Holt,
2000]
Evaluate the
similarity of
two different
decomposition
s of a system
to determine
the stability of
a
clustering
algorithm
Extracting
sub-systems
for program
comprehensio
n
[Tzerpo
s, Holt,
2000]
[Müller
, Uhl,
1990]
Extracting
sub-system
structures. To
be used in
Rigi
[Lung,
1998]
Software
architecture
recovery and
restructuring
Non-local names like
Procedures
Macros
Typedefs
Variables
Individual field names
of structured type and
variables
Similarity
metric
Mojo
Source file patterns
Directory
structure
patterns
Body header pattern
Leaf collection pattern
Support library pattern
Central
dispatcher
pattern
Subgraph dominator
pattern
Use
a
directed
weighted
graph.
Directed edge from a
to b indicates module a
depends on b and the
weight indicates the
resources
being
exchanged between a
and b
files
Jacknife
method 52
procedures
divided into
training(48)
and
test
set(4)
Hierarchical ascending
classification
method
(similar
to
agglomerative
clustering). It also splits
clusters
during
clustering
(Maverick
Analysis)
Used Bunch (software
for clustering)
A version
of Arch’s
batch
clustering
tool.
Random
and
field
data from a
legacy
system
Quality
metric based
on MoJo
ACDC algorithm
TOBEY
Linux
Stability
measure
Quality
measure
Real time
telecommu
nication
application
Data
complexity
K-2 partite graph
UPGMA (unweighted
pair-group method using
arithmetic averages)
Table 7 : Summary of Software Clustering Methods Used in the Past
2.5
Clustering : A Theoretical Framework
This section outlines various scenarios from the point of view clustering. It also relates the
clustering process to the design of a system by comparing well structured and unstructured
architectures. Following are the various configurations and scenarios that are discussed in this
section where the Black Hole Configuration and Gas Cloud System are types of partitions also
discussed by [Anquetil, Lethbridge, 1999].





Black Hole System
Gas Cloud System
Equidistant Objects
Structured Program
Unstructured Program
39
2.5.1
The Iteration Versus the Total Clusters Graph
The clustering process will be illustrated using the iteration versus the total number of clusters
graph. To compare the performance of different features, similarity metrics and clustering
algorithms the total number of non-singleton clusters is plotted against iteration number. In the
beginning there are no non-singleton clusters, hence the graph starts with zero total clusters. As
the iterations increase the number of clusters formed also increases showing a rise in the curve. If
a singleton cluster merges with an existing non-singleton cluster then the total number of clusters
remain constant. If two clusters merge together then the total number of clusters in the system
decreases hence leading to a decline in the curve. At the end of the clustering process, the bigger
clusters are merged together till only one cluster remains leading to a decline in the curve. In
short there are three possibilities when analyzing the iteration versus total number of clusters
graph:



A rising curve indicates the merger of two non-singleton clusters
A constant curve illustrates the merger of a singleton cluster with a non-singleton cluster
A falling curve represents the merger of two non-singleton clusters
The graph obtained provides an ad-hoc subjective measure for evaluating the successfulness of
the approach as pointed out by [Davey, Burd, 2000]. It can be used as a rough metric to evaluate
the usefulness of similarity measures and clustering algorithms. The total number of nonsingleton clusters represent the quality of a clustering process. If there is a large number of
clusters made by a clustering process, these clusters will be small identical cluster of functions
with respect to the feature they access. If they are accessing the same data type or variable then it
is likely that these clusters are cohesive clusters. Davey and Burd [Davey, Burd, 2000] suggest
that such a clustering process will provide more suitable modularization than if the total number
of maximum clusters is small as the later case signifies a large black hole cluster that tends to
attract other entities towards it.
2.5.2
Black Hole System
In this configuration a single object termed as a ‘Black Hole Object’ is highly similar to other
objects in the system. The rest of the objects are far apart form each other, resulting in the black
hole object absorbing the rest of the clusters in the system. One possibility of the raw data matrix
that leads to the black hole system is given below:
Modules
SelectObject M0
CreateSpline M1
CreateLine M2
CreateCircle M3
CreateRectangle M4
SelectSpline
1
1
0
0
0
SelectLine
1
0
1
0
0
40
SelectCircle SelectRectangle
1
1
0
0
0
0
1
0
0
1
The system has five modules M0, M1, M2, M3 and M4. The columns indicate the features or
data structures accessed by these modules. In this case the features are the functions called by
these modules. The module ‘SelectObject M0’ is an abstract module that calls all the other
functions. The rest of the modules M1, M2, M3 and M4 are independent entities that perform the
required task. The data matrix is a binary matrix with a ‘one’ indicating that a module calls a
certain function. The clustering process for this system is depicted by the graph of iteration
versus total number of clusters in Figure 15.
1.2
Total Clusters
1
0.8
0.6
0.4
0.2
0
0
2
Iteration
4
6
Figure 15: Iteration Vs. Total Clusters for the Black Hole System
The graph shows that the total number of clusters remains a constant as one cluster with M0 is
formed and it attracts all the singleton clusters towards it.
Relationship to Design
The scenario described in this section pertains to the ‘pancaked structure’ as described by
Pressman [Pressman, 97]. It depicts one controlling module calling many other functions and the
rest of the modules being independent entities. A pictorial representation of this structure is
shown in Figure 16. Such configurations should be avoided as Pressman points out that such
structures do not make ‘effective use of factoring’. Also, a more reasonable distribution of
control should be planned.
Figure 16: Pancaked Structure [Pressman, 97]
2.5.3
Glass Cloud System
A glass cloud system results when objects or entities being clustered are infinitely apart from one
another and have no common features. A possibility of the raw data matrix leading to this
configuration is shown in Table 8. This scenario is also similar to the ‘pancaked structure’ as
described by Pressman [Pressman, 97] and illustrated in Figure 16. Such structures should be
avoided.
41
Modules
Feature1 Feature2 Feature3 Feature4
M1
1
0
0
0
M2
0
1
0
0
M3
0
0
1
0
M4
0
0
0
1
Table 8 : Raw Data Matrix for the Gas Cloud System
2.5.4
Equidistant Objects
When considering equidistant objects there are a number of possibilities. In this sub-section we
discuss two such possibilities:
 Objects partially similar to one another
 Objects with 100% similarity with one another
Objects partially similar to one another
To visualize a system with partially similar objects consider the example of 4 modules
CreateSpline, CreateLine, CreateCircle and CreateRectangle with their corresponding raw data
matrix given by:
Modules
CreateSpline M1
CreateLine M2
CreateCircle M3
CreateRectangle M4
PosX
1
1
0
0
PosY StartX StartY EndX EndY
1
1
0
0
0
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
The Jaccard Similarity Matrix for the above raw data matrix is given by:
Modules
M1
M2
M3
M4
M1
1/5
1/5
1/5
M2
1/5
1/5
1/5
M3
1/5
1/5
1/5
M4
1/5
1/5
1/5
-
The table of similarity values between all pairs of entities indicates that the objects are equidistant
from each other. Each pair of entity has a Jaccard Similarity Measure equal to 1/5 = 0.2. In this
sub-section the clustering of such entities is analyzed. It is assumed that when selecting pairs of
entities with equal similarities for merging into a cluster, a preference is given on merging
singleton clusters.
If single linkage, complete linkage, weighted average linkage or unweighted average linkage is
used to group together equidistant objects, the general curve obtained for iteration versus number
of clusters is given below in Figure 17.
42
Number of Clusters
Iteration
Figure 17: The general curve obtained for Iteration Vs. The total Number of Clusters
The curve shows that the total number of non-singleton clusters are merged in the beginning of
the clustering process resulting in a peak in the curve. After all the singleton clusters are merged
the curve drops as non-singleton clusters are being merged together.
Total Clusters
On the other hand if combined algorithm is used then the curve obtained is shown below in
Figure 18:
Iteration
Figure 18: Iteration Vs. Total Clusters for the Combined Algorithm for Equidistant Objects
The clustering process starts by merging two singleton clusters resulting in a black hole cluster
that tends to attract the rest of the singleton clusters towards it. The nature of the combined
algorithm is such that when two singleton clusters are merged then the resulting cluster has a
greater degree of similarity between the rest of the objects than the two individual singleton
clusters. The rest of the curve is a constant indicating the merging of singleton clusters with the
main black hole cluster.
Relationship to Design
The above examples demonstrate an example of loosely cohesive objects that exhibit a small
degree of similarity with each other. A set threshold should be used to determine whether these
modules should be grouped together or not. If the modules are not functionally related to each
other then it means that the variables or features used don’t have a semantic relationship to one
43
another and are being used as per need by any function. These features can also be termed as
utility features. On the other hand, if the modules were intended to be functionally similar as per
design of the entire system then the features or variables being accessed by them could be
grouped together into a bigger structure during the re-modularization stage.
Objects with 100% Similarity
Objects being 100% similar with respect to the features they access would all be grouped together
by the clustering process into one strongly cohesive cluster. A raw data matrix for such a
scenario is shown in the table.
Modules
CreateSpline M1
CreateLine M2
CreateCircle M3
CreateRectangle M4
PosX
1
1
1
1
PosY StartX StartY
1
1
1
1
1
1
1
1
1
1
1
1
The Jaccard Similarity Matrix for the above raw data matrix is given by:
Module
s
M1
M2
M3
M4
M1
M2
M3
M4
1
1
1
1
1
1
1
1
1
1
1
1
-
Number of Clusters
The clustering process for such a situation is shown in Figure 19:
Iteration
Figure 19: Iteration Vs. the Total Number of Clusters for Equidistant Objects with 100% similarity
The curve is similar to the curve for partially similar equidistant objects when the clustering
algorithm being used is single, complete, weighted or unweighted.
44
Relationship to Design
Equidistant objects with 100% similarity represent one cohesive module from the point of view of
design. If the re-modularization process is being carried out with the aim of migrating the system
to one confirming to the object oriented paradigm then all the modules can be grouped together as
operations of the same class. However, the proportion of such equidistant modules to the total
number of modules present in the system should be kept in view when re-modularizing software.
If the proportion of such objects is very large then it means that the features being used to classify
them into different clusters are not sufficient and discriminating enough to form a smaller
grouping of clusters. An architecture with groups/clusters comprising equidistant objects is
preferred, where each group is independently performing its own functions and is disjoint from
the rest of the other groups.
2.5.5
Structured Program
In this sub-section the clustering process for a structured program is reviewed. A structured
program containing modules M0,M1, …, M14 is illustrated by the tree in Figure 20. The straight
lines represent the calls being made from one module to another and the dashed lines represent
the data structures being accessed by that module. If we consider the data structure as a feature
and cluster similar objects together on the basis of this feature then the clusters obtained are
indicated by the dashed circles containing the objects. The clusters are highly cohesive with a
Jaccard similarity measure of 1 amongst themselves.
The corresponding raw data matrix for the 14 modules is shown in the following Table 9.
Modules
M0
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
M13
M14
T1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
T2
0
0
1
0
0
0
1
1
0
0
0
0
0
0
0
T3
0
0
0
1
0
0
0
0
1
1
1
0
0
0
0
T4
0
0
0
0
1
0
0
0
0
0
0
1
1
1
1
Table 9: Raw Data Matrix for the Structured Program
45
M0
M2
M1
M4
M3
M14
M6
M5
M7
M8
M10
M9
M13
M12
M11
T1
T3
T2
T4
Figure 20 : A Structured Program
Total Clusters
Also, the clustering process that leads to the above clusters is depicted in Figure 21.
7
6
5
4
3
2
1
0
0
5
Iteration
10
15
Figure 21: Clustering Process for the Structured Program
In the beginning of the clustering process singleton clusters are being merged leading to a rise in
the curve. After all singleton pairs are merged together, the remaining singleton objects are
merged into their similar clusters leading to a straight horizontal line as the total number of
clusters remains constant. Once all the singletons are merged together, similar clusters are
grouped leading to a decline in the curve.
The final partition of the system thus obtained is also termed as the “Planetary System” by
[Anquetil, Lethbridge, 1999]. Such a system has several cohesive sub-systems that are
interconnected to form the entire system. This is the ideal case. In the next section a case for a
system is considered which can result from modifications in the above system.
46
2.5.6
Unstructured Program
M0
M2
M1
M4
M3
M14
M6
M5
M7
M12
M8
M10
M13
M9
M11
T1
T9
T7
T3
T2
T4
T6
T5
Figure 22: Unstructured Program
This section explores one last case of finding structures within programs that have started with a
good design but evolved into a more haphazard system due to modifications and enhancements
being made in the future. The structured program illustrated in
Figure 20 can result in an unstructured system due to future enhancements and modifications. A
possible conceptual picture of the entire system that deviates from the initial design of the
structured program is shown in Figure 22. It can be noted that the initial design and structure of
the system is still buried deep within the system, though it may not be so obvious by only
visualizing the system. The system consists of the same 14 modules i.e. M0, M1, …, M14.
However, the data dependencies of modules have increased and the calls between the horizontal
partitions have also increased in number. In the text that follows an analysis of a few similarity
measures and algorithms used for clustering is presented in detail.
47
Jaccard Similarity Measure Using Complete Algorithm
The graph depicting the clustering process using the Jaccard similarity measure and the complete
algorithm is shown below in Figure 23 which shows the iteration versus the total number of
clusters curve.
Total Clusters
5
4
3
2
1
0
0
5
Iteration
10
15
Figure 23: Clustering Process for the Unstructured Program
The clusters formed using the complete algorithm with Jaccard similarity measure are shown in
Figure 24. For these clusters a threshold of 0.5 was used to stop the clustering process.
M0
M2
M1
M4
M3
M8
M14
M12
M7
M5
M6
M13
M9
M10
M11
T1
T9
T7
T3
T2
T4
T6
T5
Figure 24: Clusters Formed for the Unstructured Program
48
Jaccard Similarity Measure Using Combined Algorithm
The graph depicting the clustering process using the Jaccard similarity measure and the combined
algorithm is shown below in Figure 25 which shows the iteration versus the total number of
clusters curve.
Number of Clusters
5
4
3
2
1
0
0
5
Iteration 10
15
Figure 25: Iteration Vs. Number of Clusters for the Jaccard Similarity Using Combined Algorithm.
The clusters formed are shown in the figure below (Figure 26).
M0
M2
M1
M4
M3
M8
M6
M5
M14
M12
M7
M13
M9
M10
M11
T1
T9
T7
T3
T2
T4
T6
T5
Figure 26: Clusters formed for Jaccard Similarity Measure Using Combined Features
For this algorithm a threshold of similarity 0.5 was used to stop the clustering process.
49
Simple Similarity Coefficient Using Complete Algorithm
The graph depicting the clustering process using the Simple Similarity Coefficient and the
complete algorithm is shown below in Figure 27 which shows the iteration versus the total
number of clusters curve.
6
Total Clusters
5
4
3
2
1
0
0
5
Iteration 10
15
Figure 27: Clustering Process for Simple Similarity Co-efficient Using Complete Linkage
The clusters formed using the Simple Similarity Coefficient and the Complete Linkage Algorithm
are shown below in Figure 28. A threshold of .888 was used to terminate the clustering process.
M0
M1
M2
M4
M3
M8
M14
M12
M7
M5
M13
M9
M6
M10
M11
T1
T9
T7
T3
T2
T4
T6
T5
Figure 28: Clusters Formed Using Simple Similarity Coefficient and Complete Linkage Algorithm
50
Discussion of Results
In the previous text the clustering process using different similarity coefficients and clustering
algorithms was presented. It can be seen from the graphs of iteration versus total number of
clusters that the curve is the same for both the Jaccard and the Simple coefficient. Also, the total
number of clusters being formed as the process continues is the same for these similarity
measures. The curves obtained using weighted, unweighted and single linkage algorithms for the
Jaccard coefficient are also similar but not shown here.
The clusters formed for the complete and the combined algorithm are also the same. However,
the ones formed using the simple matching coefficient are slightly different. This measure takes
the absence of features into account and groups together the modules having the zero feature
vector into one cluster.
It can be seen from the pictures of the clusters formed (Figure 28, Figure 26, Figure 24) for the
unstructured program that the clustering process has identified the clusters or groups of entities as
intended by the initial design of the system (shown in
Figure 20). This shows that the clustering process is able to recover the original design or
architecture of a system that gets obscured by future enhancements and modifications.
2.6
Experiments with Clustering for Software Re-Modularization
To evaluate the suitability of the application of clustering algorithms to software artifacts for
software re-modularization a few experiments were carried out. A detail of these experiments
and the results obtained is presented in this section.
2.6.1
The Test System
The test system used for clustering software artificats is Xfig Version 3.2.3. Xfig is an open
source drawing tool that runs under X-Windows system [Website Xfig]. Xfig was also used as a
model system for carrying out a competition of the reverse engineering tools [Sim, Storey,2000].
It is written in the C programming language and consists of 75K lines of source code. Its source
code is distributed over around a 100 source files and around 75 include files. There is no
documentation regarding the structure or implementation of Xfig. However, usage manuals are
available for this system.
Xfig source code has been parsed using the Rigi parser [Martin, Wong, Winter, Müller, 2000] and
its facts are available at [Website Xfig RSF]. The facts are in Rigi Standard Format (RSF) which
is a Tuple Attribute (TA) format. For experiments with software clustering, the raw source code
of Xfig was not used but the available RSF file was used instead. The RSF file has around
200,000 facts including the facts for calls, accesses, references etc. Analysis of Xfig shows
around 1700 functions. The RSF file was exported to an SQL compliant RDBMS system to ease
the access of facts during the clustering process.
51
The goal of the software clustering experiments is to find a structure within the source code of
Xfig so that higher abstract level sub-systems can be identified. In the long run such an analysis
can also be useful in identifying objects or classes within the system to move it to an object
oriented paradigm. In Xfig a consistent naming convention of source code files is used as
follows:
 d_* files are intended for drawing shapes
 e_* files are related to editing
 f_* files have file related functions
 u_* files pertain to utilities for drawing and editing
 w_* files contain the X-windows related calls
The following table lists the total number of functions/entities in each sub-system:
Sub-System
d_files
e_files
f_files
u_files
w_files
Total Functions
94
369
139
422
637
Total Source Files
10
19
17
18
31
Table 10: The Xfig Sub-Systems
The above source files can, therefore, help in determining and categorizing the various subsystems within Xfig. In the experiments described in the following text, clustering was carried out
individually on all the above mentioned sub-systems. It was then also carried out on the entire
system, thus being treated as a whole entity.
2.6.2
Clustering Techniques
For the analysis of Xfig agglomerative hierarchical clustering algorithm was used to group
together software artifacts. Each function was treated as a single entity or object to be grouped.
The features used to describe each entity are as follows:



Call : The calls made by a function
Global : The global variables referred to by a function
Type : The types accessed by a function. The types are the data structures defined in C
language using ‘struct’ or the user defined types declared using ‘typedef’.\
The experiments were conducted using the following similarity measures:




Jaccard coefficient
Srensen-Dice coefficient
Simple matching coefficient
Pearson’s Correlation coefficient
52

Camberra distance metric
The following clustering algorithms were used:





Single linkage
Complete linkage
Weighted average linkage
Unweighted average linkage
Combined algorithm
2.6.3
Evauation of Results
To evaluate the results of clustering precision and recall have been used [Anquetil, Lethbridge,
1999]. To calculate the precision and recall of a test partition, an expert decomposition is
required. We constructed the expert decomposition by placing all the functions in one source file
in one cluster. This is based on the assumption that all functions placed in one file perform
similar functions.
2.6.4
Anayslis of the Xfig System
In the text that follows the results of the clustering experiments are presented when conducted on
the 5 sub-systems of Xfig. In these experiments the following are analyzed:




The comparison of features when using different similarity measures
The comparison of various similarity metrics
The comparison of various clustering algorithms
Precision and recall for various similarity metrics using different clustering algorithms
The behavior of all the sub-systems is almost identical, with a few exceptions, when the
clustering process is applied to them. The results for d_files sub-system will be presented in
detail. The graphs of the results of comparison for the rest of the sub-systems and the entire
system are detailed in the appendices.
In the following section the analysis of d-files is presented in detail and the results of the rest of
the sub-systems is summarized.
2.6.5
Anayslis of d_files Sub_System
In this sub-section a detailed analysis of d_files sub-system is carried out and the results are
presented. The d_files pertain to drawing objects and there are a total of 10 source files in this
sub-system.
Comparison of features
The following graphs show the performance of the features all, call, global and type when using
different similarity metrics and the complete clustering algorithm.
53
Call
Type
All
Global
30
50
25
20
40
Total Clusters
Total Clusters
All
Global
15
10
5
0
30
20
10
0
0
(a)
50
Iteration
100
Jaccard
0
(b)
All
Global
50
Iteration
100
Simple
Call
Type
All
Global
30
Call
Type
50
25
Total Clusters
Total Clusters
Call
Type
20
15
10
5
0
40
30
20
10
0
0
50
Iteration
100
0
50
Iteration
100
`
(c)
Correlation
(d)
Camberra
Figure 29: Comparison of Features Using Jaccard Similarity and Complete Algorithm
The above graphs illustrate that the performance of the Jaccard metric is very similar to of the
correlation similarity measure. As explained previously in Section 2.1.2, the Jaccard metric J is
given by :
J = a/(a+b+c)
The correlation measure for binary features is given by:

ad  bc
(a  b)(c  d )( a  c)(b  d )
The two similarity measures would perform equivalently when d is much larger than a as
explained in Section 2.1.2.3. For experiments with software clustering d tends to be very large as
the feature vector associated with a module that is treated as an entity is a sparse vector. This
factor explains why the behavior of the correlation and the Jaccard measure tend to be similar.
The graphs of Figure 29 also show that the simple and the Camberra metric perform equivalently.
The simple metric is given by (a+d)/(a+b+c+d) and the Camberra metric is given by (b+c).
However, for software clustering the Jaccard similarity measure seems to be the most intuitive as
it is based on the presence of features as compared to the correlation and simple metrics that also
take into account the absence of features.
54
All the above graphs also show a sharp peak for the ‘type’ feature indicating that there are many
‘equidistant objects’ with respect to the ‘type’ feature. On the other hand the curves for the ’call’
feature are more spread out and leveled at the top. The same is true for the curves of ‘all’
features. The ‘global’ feature like the ‘type’ feature also seems to have a sharp peak in the
beginning of the clustering process signifying the merging of identical singleton objects. After
the peak, the graphs seems to level off illustrating the merger of singleton objects with clusters.
The afore mentioned phenomenon is true for all sub-systems except the f-files sub-system. In this
sub-system the Jaccard and correlation metric when applied to ‘all’ features produces the
maximum number of clusters. However, the shape of the graph is similar to the rest of the subsystems when using ‘all’ features. The shape of the graph when using ‘all’ features and ‘call’
feature when using the Jaccard and Correlation metric resembles that of the graph for a structured
system signifying the importance of these features and metrics with respect to the clustering
process.
Although, Davey and Burd suggest that the feature that produces the maximum number of
clusters is the best feature [Davey, Burd, 2000], the shapes of the graphs should also be taken into
account when evaluating a certain clustering. The ‘type’ feature seems to produce the maximum
number of clusters during the clustering process, however, the curve suggests that many objects
or functions are equidistant with respect to the type feature (as shown by the sharp peak). A
carefully analysis also shows that the feature vector for ‘type’ is very sparse and many functions
access the same type showing that this feature provides insufficient discrimination between
different entities. The curve for ‘all’ features and ‘call’ feature, however, seems to suggest more
suitable partitions for the Jaccard and Correlation metric. The total number of maximum clusters
is almost the same as for the ‘type’ feature but the curve is more spread out and flatter at the top
illustrating the merger of non-singleton clusters with singleton clusters in the middle of the
clustering process. This shows that many of the entities are not equidistant with respect to these
features and they are providing more discrimination between the different functions.
Comparison of Similarity Metrics
To compare the various similarity metrics for the ‘d-files’ sub-system the graph of iteration
versus the total clusters is plotted for various similarity metrics using all features and the
complete and combined algorithm, in Figure 30 and Figure 31. When using ‘all’ and the
complete algorithm the performance of all similarity metrics with respect to the total number of
clusters being formed is almost the same. However, when the combined algorithm is used the
performance of correlation deteriorates considerably. This behavior can be explained as follows.
During the beginning of the clustering process the feature vector for all entities is sparse, hence
the value of d is very large as compared to a. As the total number of functions in a cluster
increases the feature vectors are ‘ORed’ together and the resulting vector is no longer very sparse.
Hence the value of d becomes comparable to the value of a and the performance of correlation
differs from the performance of the Jaccard association coefficient.
55
The graphs obtained also show that the behavior of Sorensen-dice and Jaccard association coefficientis is almost exactly similar. This result matches with the results obtained by Davey and
Burd.
Camberra
Simple
Correlation
Jaccard
30
Total Clusters
25
20
15
10
5
0
0
20
40
60
Iteration
80
100
Figure 30: Comparison of Similarity Metrics Using Complete Algorithm and All Features
Camberra
Simple
Correlation
Sorensen
Jaccard
30
Total Clusters
25
20
15
10
5
0
0
20
40
60
Iteration
80
100
Figure 31: Comparison of Similarity Metrics Using Combined Algorithm and All Features
56
Comparison of Algorithms
Combined
Unweighted
Complete
Weighted
Single
25
Total Clusters
20
15
10
5
0
0
20
40
60
80
100
Iteration
Figure 32: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features
To compare the performance of various clustering algorithms a plot of the total number of
clusters is taken for the combined, complete, single, weighted and unweighted algorithms when
using the Jaccard similarity metric. All algorithms seem to perform almost equivalently for the dfiles sub-system. However, this result is not generally true for the rest of the sub-systems where
the performance of single link algorithm is much worse than the complete link, combined,
weighted and unweighted algorithms. Davey and Burd point out that when single linkage is used
the bigger clusters are merged rather than singletons unless the distance between them is very
large. Hence, single link creates a large number of isolated loosely coupled clusters. However,
the complete linkage algorithm tends to push the clusters apart and it depends upon the original
similarities calculated from the feature vectors. It would tend to create a higher number of highly
cohesive clusters. Both Lethbridge and Anquetil [Anquetil, Lethbridge, 1999] and Davey and
Burd [Davey, Burd, 2000] find that the performance of weighted and un-weighted algorithms lie
between that of single and complete. The same is true for the rest of the sub-systems of Xfig
except d-files.
The results on d-files and the rest of the sub-systems also show that the performance of the
combined algorithm is almost equivalent to that of the complete link algorithm. Since we are
experimenting with software artifacts the use of combined algorithm seems more intuitive as it
associates a new feature vector with a new cluster by combining all the facts together.
57
Precision and Recall
Precision
Recall
Precision
80
Percentage
100
80
Percentage
100
60
40
20
60
40
20
0
0
0
20
40
60
80
100
0
20
Iteration
(a)
Jaccard Complete
(b)
40
60
Iteration
Recall
Precision
100
80
80
60
40
20
80
100
Jacard Combined
100
Percentage
Percentage
Precision
Recall
60
40
20
0
0
20
40
60
80
0
100
0
Iteration
(c)
Recall
Correlation Complete
(d)
20
40
60
Iteration
80
100
Correlation Combined
Figure 33: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and
Complete Algorithms Using All Features
Figure 33 illustrates the precision and recall graphs for the d-files sub-system for the correlation
and Jaccard similarity metrics when using the complete and combined algorithms for ‘all’
features. As can be seen by the graphs, the precision is high at the beginning of the clustering
process and low at the end. As opposed to that, the recall is zero at the start of the process and
rises to 100% at the end when one big cluster is formed. The crossover point for precision and
recall signifies the point when the total number of intra pairs of the expert clustering is equal to
the total number of intra pairs in the test clustering. When precision is greater than recall, it
means that the total number of intra pairs in the algorithm’s test clustering is less than that of the
expert’s. Conversely, when the total number of pairs in the expert clustering is less than that of
the test clustering then the recall is higher than precision. The graphs illustrate this tradeoff
between precision and recall.
Davey and Burd [Davey, Burd, 2000] and [Anquetil, Lethbridge, 1999] suggest that the precision
and recall can be used to evaluate the performance of different algorithms and similarity metrics.
58
It is obvious that the higher the value of both precision and recall the better the performance of
the parameters used for the clustering process. Davey and Burd also indicate that the later the
crossover point occurs in the clustering process the better the partitions produced. The reason
being that more functions would have been considered for clustering if the crossover occurs at a
later point.
For the Jaccard similarity metric, the cross over point for the complete and combined algorithm
occurs at almost the same point. However, the height of this point is slightly higher for the
combined algorithm than the complete algorithm for the d-files, e-files and f-files sub-systems
indicating the better performance of the combined algorithm to complete algorithm.
The
following table shows the crossover point for various sub-systems:
d_files
e_files
f_files
u_files
w_files
Complete
18%
22%
25%
33%
17%
Combined
33%
29%
30%
30%
17%
In almost all the sub-systems when using the correlation metric with the complete algorithm the
crossover point occurs later in the clustering process than for Jaccard similarity measure.
However, the height of this point is much lower than for both Jaccard complete and Jaccard
combined algorithms. The “correlation-combined” parameter performs the worst in all the subsystems as its crossover point occurs much earlier in the clustering process and its height is much
lower.
2.6.6
Summary and Discussion of Results
In this section a detailed analysis of clustering software artifacts using the Xfig system has been
carried out. The various features for clustering, similarity measures and clustering algorithms
were compared. The experiments were carried out on 5 sub-systems of Xfig and also the entire
Xfig system. Generally, the results obtained on the 5 sub-systems and the entire system were
identical. The following observations have been made.


For the Xfig system, the ‘type’ feature is not discriminating enough. The ‘call’ feature
performs quite well. However, a combination of ‘type’, ‘call’ and ‘global’ features gives the
best performance.
If the total number of absent features is very large as compared to the total number of present
features then the correlation metric performs similarly to the Jaccard metric. The Jaccard
metric also performs identically to the Sorensen-Dice similarity measure. Intuitively for
software clustering, the Jaccard and Sorensen-Dice algorithms are more suitable than the
simple and the Camberra metrics.
59

The performance of weighted and un-weighted clustering algorithms lies in between that of
single link and complete link algorithms. The complete link algorithm produces better
clustering than the single link algorithm. Also, the combined algorithm produces better
results than the complete link algorithm. Also, for software clustering the combined
algorithm seems more intuitive than the complete algorithm.
Analysis of Partitions and Cohesion
When analyzing a partition, it was noted that many of the clusters being made are logically
cohesive. Pressman describes 7 levels of cohesion [Pressman, 1997] as given by (in the order
from loosely cohesive to strongly cohesive):







Coincidental
Logical
Temporal
Procedural
Communicational
Sequential
Functional
Pressman points out that the scale of cohesion is non-linear, the low end “cohesiveness” being
much worse than the middle range which is almost as good as high end cohesion. For the
clustering processes we need to define various measures to quantify these levels of cohesion and
pin-point those features that influence a partition to being strongly cohesive.
As an example consider the following functions in the d-files sub-system:




Create_lineobject
Create_regpoly
Line_drawing_selected
Regpoly_drawing_selected
When using the Jaccard similarity measure and the complete algorithm on the basis of the ‘type’
feature the clustering process groups together create_lineobject and create_regpoly. Also, it
groups together line_drawing_selected and regpoly_drawing_selected as the ‘type’ feature vector
of this pair is more similar. The clusters therefore obtained are logically cohesive as the functions
are logically related to each other and they perform the same type of function but on different
types of objects. This is a loose form of cohesion. Ideally, we would like to form clusters that
manipulate the same data type. It would be more appropriate to group together create_lineobject
and line_drawing_selected in one cluster and create_regpoly and regpoly_drawing_selected to be
palced in another cluster.
The above discussion brings out the need for defining quantitative measures of cohesion and
analyzing the different features to estimate their relative importance to cohesion. For the above
60
example function names can also be used to group together similar object. However, this feature
is only useful if the reverse engineer is certain that consistent naming conventions have been used
in the entire source code. It is also necessary to assign weights to the individual feature vectors
for “global”, “type” and “call” features when using the combination of ‘all’ features so that the
resulting clusters obtained are functionally cohesive rather than loosely cohesive.
2.7
Future Directions
This sections pin-points areas of software clustering that are worth further investigation. They
are outlined in the text that follows:

Quality of Partition : It is necessary to evaluate the quality of a partition using measures
other than precision and recall. This can be done by using metrics such as data complexity or
structural complexity. The quality of clustering could then be assessed by comparing these
metrics with the expert’s. This can also help in deciding the criterion for stopping the
clustering process or determining the height of the cut for the dendogram made by the
clustering process.

Various Measures of Cohesion : As pointed out in section 2.6.6 there is a need to define
measures that quantify the various levels of cohesion and assess the relative importance
various features with respect to these levels of cohesion. This will help in evaluating the
quality of a partition and also in determining the weights to be assigned to individual features
during the clustering process.

Assessing the Relative Importance of Features Used : The features that have been used for
the experiments described in this report have been assigned equal weights when using the
‘all’ feature. It is necessary to devise methods to assess their relative importance towards
software clustering and use weighted features. Also, binary feature vectors have been used
for these experiments. It would be interesting to count the relative frequencies of usage of
these features by individual functions and compare it with binary features. Additionally the
‘call’, ‘type’ and ‘global’ features should also be compared with other features like function
names, comments in the source code etc. for the Xfig system.

Defining New Similarity Measures : Right now researchers have been conducting
experiments using the association coefficients and the distance or correlation measures.
There are still other similarity measures like the probabilistic measures that remain
unexplored for clustering together software artifacts. There is a need to come up with new
similarity measures that are relevant to the software artifacts that are being grouped together
and it is more important that they pertain to the type of software that they are being applied to
e.g. legacy code, open source, embedded system, real time system etc.

Comparison with Other Systems : The results obtained from Xfig need to be compared
with other systems. It would be interesting to compare the open-source systems with legacy
61
systems. Also, the clustering process for different types of systems like embedded systems,
device drivers, real time systems can be compared

Relationship to Software Architecture : The clustering process identifies the various
groups of entities in a software system. It also pin-points the various sub-systems present in
the entire software. However, a clustering process can impose a structure of its own as well
depending upon the nature of the algorithm and the similarity measure used. It would be
interesting to relate these algorithms and similarity measures to the various types of defined
software architectures as pointed out by Shaw and Garlan [Shaw, Garlan, 1996].

Studying Software Evolution : Software clustering an help study how a software system
evolves. Studying the previous and current versions of the same program can help a software
engineer gain an insight into the initial design and the corresponding life cycle of change that
took place in the software.
62
APPENDICES
A. Experimental Results
The appendices present the result of clustering when applied to the following sub-systems of
Xfig:
 e-files
 f-files
 u-files
 w-files
 The entire system
A.1
Anayslis of e_files Sub_System
In this section analysis of all source files of Xfig with prefix “e_” is carried out.
Comparison of Features
All
Global
Call
Type
All
Global
140
120
100
80
60
40
20
0
Total Clusters
Total Clusters
150
100
200
Iteration
50
400
0
(b)
100
Correlation
200
Iteration
300
200
Iteration
400
Call
Type
100
200
Iteration
300
Camberra
Figure 34: Comparison of Features Using Different Similarity Metrics and Complete Algorithm
63
400
160
140
120
100
80
60
40
20
0
0
(d)
300
Simple
All
Global
140
120
100
80
60
40
20
0
0
100
Call
Type
Total Clusters
Total Clusters
300
Jaccard
All
Global
(c)
100
0
0
(a)
Call
Type
400
Comparison of Similarity Metrics
Camberra
Simple
Correlation
Jaccard
120
Total Clusters
100
80
60
40
20
0
0
100
200
Iteration
300
400
Figure 35: Comparison Similarity Metrics for the Complete Algorithm Using All Features
Camberra
Simple
Correlation
Sorensen
Jaccard
100
Total Clusters
80
60
40
20
0
0
100
200
Iteration
300
400
Figure 36: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
64
Comparison of Algorithms
Combined
Unweighted
Complete
Weighted
Single
100
Total Clusters
80
60
40
20
0
0
100
200
Iteration
300
400
Figure 37: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features
Precision and Recall
Recall
Precision
100
100
80
80
Percentage
Percentage
Precision
60
40
20
60
40
20
0
0
0
100
200
300
400
0
Iteration
(a)
Recall
Jaccard Complete
100
200
Iteration
(b)
65
Jacard Combined
300
400
Precision
Recall
Precision
80
80
Percentage
100
Percentage
100
60
40
20
60
40
20
0
0
0
100
200
300
400
0
100
200
Iteration
(c)
Recall
300
400
Iteration
Correlation Complete
(d)
Correlation Combined
Figure 38: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and
Complete Algorithms Using All Features
A.2
Analysis of f-files SubSystem
In this section an analysis of all “f_” prefix source files is carried out.
Comparison of features
All
Global
Call
Type
50
50
40
40
Total Clusters
Total Clusters
All
Global
30
20
10
30
20
10
0
0
0
(a)
Call
Type
Jaccard
50
Iteration
100
0
150
50
100
Iteration
(b)
66
Simple
150
Call
Type
All
Global
50
50
40
40
Total Clusters
Total Clusters
All
Global
30
20
10
30
20
10
0
0
0
(c)
Call
Type
50
Iteration
100
150
Correlation
0
(d)
50
Iteration
100
Camberra
Figure 39: Comparison of Features Using Different Similarity Metrics and Complete Algorithm
Comparison of Similarity Metrics
Total Clusters
Camberra
Simple
Correlation
Jaccard
45
40
35
30
25
20
15
10
5
0
0
50
Iteration
100
150
Figure 40: Comparison Similarity Metrics for the Complete Algorithm Using All Features
67
150
Camberra
Simple
Correlation
Sorensen
Jaccard
40
Total Clusters
35
30
25
20
15
10
5
0
0
50
Iteration
100
150
Figure 41: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
Comparison of Algorithms
Total Clusters
Combined
Unweighted
Complete
Weighted
Single
45
40
35
30
25
20
15
10
5
0
0
50
Iteration
100
150
Figure 42: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features
68
Precision and Recall
Recall
Precision
100
80
80
Percentage
Percentage
Precision
100
60
40
20
60
40
20
0
0
0
50
100
150
0
50
Iteration
(a)
(b)
Precision
150
Jacard Combined
Recall
Precision
100
100
80
80
Percentage
Percentage
100
Iteration
Jaccard Complete
60
40
20
Recall
60
40
20
0
0
0
50
100
150
0
Iteration
(c)
Recall
Correlation Complete
50
100
150
Iteration
(d)
Correlation Combined
Figure 43: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and
Complete Algorithms Using All Features
A.3
Analysis of u_files SubSystem
In this section the results obtained from clustering the sub-system of all files with names prefix
“u_” is carried out.
69
Comparison of features
All
Global
Call
Type
All
Global
200
Total Clusters
Total Clusters
150
100
50
150
100
50
0
0
0
(a)
100
200
300
Iteration
400
0
(b)
400
600
Simple
Call
Type
All
Global
Call
Type
200
Total Clusters
200
Total Clusters
200
Iteration
Jaccard
All
Global
150
100
50
150
100
50
0
0
0
200
400
0
600
200
Iteration
(c)
Call
Type
Correlation
400
Iteration
(d)
Camberra
Figure 44: Comparison of Features Using Different Similarity Metrics and Complete Algorithm
Comparison of Similarity Metrics
Camberra
Simple
Correlation
Jaccard
120
Total Clusters
100
80
60
40
20
0
0
100
200
300
Iteration
400
500
Figure 45: Comparison Similarity Metrics for the Complete Algorithm Using All Features
70
600
Camberra
Simple
Correlation
Sorensen
Jaccard
140
Total Clusters
120
100
80
60
40
20
0
0
100
200
300
Iteration
400
500
Figure 46: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
Comparison of Algorithms
Combined
Unweighted
Complete
Weighted
Single
120
Total Clusters
100
80
60
40
20
0
0
100
200
300
Iteration
400
500
Figure 47: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features
71
Precision and Recall
Recall
Precision
100
80
80
Percentage
Percentage
Precision
100
60
40
20
60
40
20
0
0
0
200
400
600
0
200
Iteration
(a)
(b)
Recall
Precision
100
80
80
Percentage
Percentage
Precision
60
40
40
20
0
0
400
600
0
Iteration
Correlation Complete
Recall
60
20
200
600
Jacard Combined
100
0
400
Iteration
Jaccard Complete
(c)
Recall
200
400
600
Iteration
(d)
Correlation Combined
Figure 48: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the
Combined and Complete Algorithms Using All Features
A.4
Analysis of w_files SubSystem
This section presents the clustering results for the “w_” sub-system.
72
Comparison of features
All
Global
Call
Type
All
Global
250
Total Clusters
Total Clusters
200
150
100
50
200
150
100
50
0
0
0
(a)
200
400
Iteration
600
0
800
Jaccard
(b)
All
Global
400
Iteration
All
Global
600
150
Call
Type
100
50
200
150
100
50
0
0
0
200
400
Iteration
600
Correlation
0
800
(d)
200
400
Iteration
600
Camberra
Figure 49: Comparison of Features Using Different Similarity Metrics and Complete Algorithm
Comparison of Similarity Metrics
Camberra
Simple
Total Clusters
800
250
Total Clusters
Total Clusters
200
Simple
Call
Type
200
(c)
Call
Type
Correlation
Jaccard
180
160
140
120
100
80
60
40
20
0
0
200
400
Iteration
600
800
Figure 50: Comparison Similarity Metrics for the Complete Algorithm Using All Features
73
800
Total Clusters
Camberra
Simple
Correlation
Sorensen
Jaccard
180
160
140
120
100
80
60
40
20
0
0
200
400
Iteration
600
800
Figure 51: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
Comparison of Algorithms
Total Clusters
Combined
Unweighted
Complete
Weighted
Single
180
160
140
120
100
80
60
40
20
0
0
100
200
300
400
Iteration
500
600
700
Figure 52: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features
74
Precision and Recall
Recall
Precision
100
80
80
Percentage
Percentage
Precision
100
60
40
60
40
20
20
0
0
0
0
100 200 300 400 500 600 700
100 200 300 400 500 600 700
Iteration
Iteration
(a)
Jaccard Complete
(b)
Jacard Combined
Recall
Precision
100
100
80
80
Percentage
Percentage
Precision
60
40
Recall
60
40
20
20
0
0
0
100 200 300 400 500 600 700
0
100 200 300 400 500 600 700
Iteration
(c)
Recall
Iteration
Correlation Complete
(d)
Correlation Combined
Figure 53: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and
Complete Algorithms Using All Features
A.5
Analysis of the Entire System
Comparison of features
All
Global
Call
Type
All
Global
800
500
Total Clusters
Total Clusters
600
400
300
200
100
0
600
400
200
0
0
(a)
Call
Type
Jaccard
500
1000
Iteration
1500
2000
0
(b)
75
Simple
500
1000
Iteration
1500
2000
All
Global
All
Global
Call
Type
800
600
500
Total Clusters
Total Clusters
Call
Type
400
300
200
100
600
400
200
0
0
0
500
1000
Iteration
1500
2000
0
500
1000
Iteration
1500
2000
(c)
Correlation
(d)
Camberra
Figure 54: Comparison of Features Using Different Similarity Metrics and Complete Algorithm
Comparison of Similarity Metrics
Camberra
Simple
Correlation
Jaccard
Total Clusters
500
400
300
200
100
0
0
500
1000
Iteration
1500
2000
Figure 55: Comparison Similarity Metrics for the Complete Algorithm Using All Features
76
Camberra
Simple
Correlation
Sorensen
Jaccard
Total Clusters
500
400
300
200
100
0
0
500
1000
Iteration
1500
2000
Figure 56: Comparison of Similarity Metrics for the Combined Algorithm Using All Features
Comparison of Algorithms
Total Clusters
Combined
Unweighted
Complete
Weighted
Single
450
400
350
300
250
200
150
100
50
0
0
500
1000
Iteration
1500
2000
Figure 57: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features
Precision and Recall
Recall
Precision
100
80
80
Percentage
Percentage
Precision
100
60
40
40
20
0
0
500
1000
1500
2000
0
Iteration
(a)
60
20
0
Jaccard Complete
500
1000
Iteration
(b)
77
Recall
Jacard Combined
1500
2000
Recall
Precision
100
80
80
Percentage
Percentage
Precision
100
60
40
40
20
0
0
500
1000
1500
2000
0
Iteration
(c)
60
20
0
Correlation Complete
Recall
500
1000
1500
2000
Iteration
(d)
Correlation Combined
Figure 58: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and
Complete Algorithms Using All Features
78
B. Software Repositories
Important links to repositories that contain program facts or databases :
 Links to Software guniea pigs :
http://plg.uwaterloo.ca/~holt/guinea_pig/
 Links to Rigi Projects by Johannes Martin :
http://www.rigi.csc.uvic.ca/~jmartin/rigi-projects/
 Links to work done by Anquetil and Lethbridge [Anquetil, Lethbridge, 1999]. This site also
contains the facts for the experiments conducted by them :
http://www.site.uottawa.ca/~anquetil/
79
C. Summary of Reverse Engineering Methods
Method
Rigi
Technique(Broad
Category)
Clustering
Semi-Automatic
Rule-based
Abstraction Level
Technology
Between Function & K2 Partite grahs & clustering based on metrics.
Structure level
Semi-automatic
Function Abstraction
Domain level
Use of prime and proper programs
Pattern matching for replacing prime programs with higher level abstractions
Knowledge based program analysis (PAT) Rule-based
Function level
Knowledge based system with a deductive inference rule engine
Graph parsing approach
Rule-based
Function level
Flow graphs and flow graph grammar rules.
Plan calculus
GraphLog
Predicate calculus
Structure level
Visual tool for representing queries in source code. Based on predicate calculus.
Concept analysis
Lattice Theory
Between function and Concept analysis using a lattice theory
(Siff & Reps
structure level
Use of negative attributes for building concept lattice.
DESIRE
Rule-based
Structure level
Use of informal knowledge in design recovery.
Biggerstaff
Knowledge based pattern recognizer and prolog-based inference engine.
Plan recognition
Rule based
Function level
Matching program plans to code fragments
GROK
Relational Agebra & Between
structure Architectural transformations using relational algebra
Architectural
level and function
Transformation
level
Dali
SQL Queries
Between
structure Extension of Rigi. Use of SQL for specifying clustering patterns. Use of dynamic
level and function and static information to generate views. Requires manual intervention
level
80
D. Detailed Summary of Reverse Engineering Methods
Technology
Research
Group
K2 Partite grahs Univ.of
&
clustering Victoria
based on metrics.
Semi-automatic
Input
Output
Objective
Advantages
Disadvantages/Required
future Description
extensions/Problems
Rigi
RFG
Layered
Sub- Structural
re- Visualization tool
If the system is too complex then
(Resource Flow system
documentation
Interactive system
difficult
to
form
sub-systems
Graph)
heirarchies
Extracting
System Technique based on K2 partite graphs heirarchies. The metrics used for
Architecture
caters for low coupling and high automatic sub-system detection are adcohesion
hoc and may reuire an expert’s input
Function Use of prime and IBM
& Source code
High
level Extract high level Converts unstructured program to a Method not implemented
See
Function
Abstractio proper programs Univ. of
abstractions
business rules
structured one.
Method needs to be formalized
Abstraction
on
n
Pattern matching Maryland
going
upto
Method helps
in
understading If the system is too complex then the
for
replacing
business rules
programs by abstracting the smaller prime programs extracted may be too page 10
prime programs
units within it.
large and not understandable
with higher level
abstractions
Knowledg Knowledge based University The
set
of High
level Identify what high Try to simulate the behaviour of an For real system several hundred event See Knowledge
e
based system with a of Illinois events
function
level concepts are expert by inputting an expert’s classes and plans will be required
program deductive
generated
by concepts
that implemented by a knowledge into the system
which could only be acquired with the Based Program
analysis inference
rule
program parser. are fed into the program, how they
help of a domain expert.
Analysis on page
(PAT)
engine
system
as are implemented and
11
programming are they implemented
plans
correctly
Graph
Flow graphs and MIT
Graphical
A design tree in Identify the clichés The design tree produced is a good The nature of the problem is such that See
Graph
parsing
flow
graph
representation which
(commonly used data tool for program maintenance. it requires exhaustive search which
Parsing
approach grammar rules.
of source code programming structures
and Generic nature of this technique can will not scale up to programs of large
Plan calculus
called
plan constructs are algorithms)
being help cope with syntactic variation, size and complexity.
Approach on page
calculus, which replaced
by used in the progroam non-contiguousness, implementation All clichés are pre-input into the 11
is converted to clichés.
variation
and
overlapping system and no mechanism for learning
flow graphs
implementations.
them.
GraphLog Visual tool for University ER model
Answer queries Present source code Use of metrics for better design.
As this is an interactive tool, therefore See GraphLog on
representing
of Toronto
presented by the queries through a Visual tool for presenting complex for systems that are too complex some page 10
queries in source & IBM
user:
visual tool.
Use queries.
Also supports recursive automation might be required for
code. Based on
Partition code software
quality queries.
generating various views of the
predicate
into
overlay metrics to aid the
system.
calculus.
modules.
software
engineer
Manually removing one defect like
Modified design towards
better
cyclic dependency can also introduce
of better quality. understanding of the
other design defects like violation of
Interactively
system and improving
implementation hidiing.
select different the design of the
views of the system.
model.
Concept Concept analysis University Abstract syntax Concept
Convert C code to Strong mathematical foundation for Exponential time complexity poses a
analysis using a lattice of
tree annotated partitions that C++
understanding
big problem for scalability to bigger
81
(Siff
Reps
& theory
Wisconsin with
type offer
a
If modularization is too fine then it can systems.
Use of negative
information
possibility
of
be made coarse and vice versa.
Real life systems when analyzed have
attributes
for
C++ classes.
interferences in the concept lattice that
building concept
introduces the problem of automatic
lattice.
partition.
DESIRE Use of informal MCC
C source code Plane text web Recover design of Use of informal knowledge for Problem of relating relating abstract
Biggerstaf knowledge
in
parsed to form browser
existing system and recovering design and building the re- concepts to source code still remains. See
f
design recovery.
parse trees.
showing
construct a re-use use library.
A concept can
Knowledge based
Prolog
for relationships
library. Use this re- Using the re-use library for extracting
be defined as a
pattern
querying source between data, use library for design domain knowledge.
recognizer and
code.
items and files. recovery of legacy Map human-oriented concepts to
maximal
prolog-based
Different views systems and acquire realizations within the source code.
inference engine.
can
be domain knowledge. Provide tool for documenting and
collection
of
generated based
understanding source code.
objects sharing
upon different
input queries.
common
attributes. For a
concept c =
(O,A), O is the
set of objects
and called the
extent of c i.e.
extent(c). A is
the
set
of
attributes and
called the intent
of
c
i.e.
intent(c). Table
2 shows the
concepts for the
object attribute
table illustrated
in Table 1.
82
C1
C2
C3
C4
C5
C6
C7
Table
Concepts
{ O1, O2, O3, O4},
{ O2, O3, O4},
{ O1}
{ O2, O4},
{ O3, O4},
{ O4},
,
2

{A
{A
{A
{A
{A
{A
:
for
Table 1
The set of all
concepts form a
partial
order
governed by the
following
relationship:
(O1, A1)  (O2 , A2 )  A1  A2
(O1, A1)  (O2 , A2 )  O1  O2
The
set
of
concepts taken
together
with
partial ordering
form a complete
lattice known as
a
concept
83
lattice.
The
cocept
lattice
for the example
given in this
section
is
presented
in
Figure 6.
Figure
6
:
Concept
Lattice
for the Attributes
and
Facts
in
Table 1
The nodes of
the
concept
lattice
are
labeled
with
attributes AiA
if it is the
largest concept
having Ai in its
intent. It is also
labeled
with
objects OiO if
it is the smallest
concept having
Oi in its intent.
84
The
concept
lattice gives an
insight into the
structure of the
relationship of
objects.
The
above
figure
shows that there
are two disjoint
set of objects,
the first one
being O1 with
attributes
A1
and A2.
The
second
one
being O2, O3,
O4 sharing the
other attribute.
2.7.1.1
Lindig
Snelting
85
Conc
ept
Analy
sis
Appli
cation
s
and
use
concept analysis
to
identify
higher
level
modules in a
program
[Lindig,
Snelting, 1997].
They
treat
subprograms as
objects
and
global variables
as attributes to
derive a concept
lattice
and
hence concept
partitions. The
concept lattice
provides
multiple
possibilities for
modularization
of
programs
with
each
partition
representing a
possible
modularization
of the original
86
program. It can
be
used
to
provide
modularization
at a coarse or a
finer level of
granularity.
Lindig
and
Snelting
conducted
a
case study on a
Fortran program
with 100KLOC,
317 subroutines
and 492 global
variables.
However, they
failed
to
restructure the
program using
concept
analysis.
Siff and Reps
used
concept
analysis
to
detect abstract
data types to
87
identify classes
in a C program
to convert it to
C++
[Siff,
Reps,
1997].
Sub-programs
were treated as
objects
and
structs
or
records in C
were treated as
attributes. They
also introduced
the idea of
using negative
examples
i.e.
absent features
as
attributes
(e.g.
a
subprogram
does not have
attribute
X).
They
successfully
demonstrated
their approach
on
small
problems.
88
However, larger
programs
are
too complex to
handle and for
that
they
suggest manual
intervention.
Canfora et al.
used
concept
analysis
to
identify sets of
variables and to
extract
persistent
objects of data
files and their
accessor
routines
for
COBOL
programs
[Canfora,
Cimitile, Lucia,
Lucca, 1999].
They
treated
COBOL
programs
as
objects and the
89
files
they
accessed
as
attributes.
Their approach
is
semiautomatic and
relies
on
manual
inspection
of
the
concept
lattice.
Concept
analysis is an
interesting new
area of reverse
engineering.
However, it has
an exponential
time complexity
and has a space
complexity of
O(2K).
Its
advantages are
that it is based
on a sound
mathematical
background and
90
gives
a
powerful
insight into the
types
of
relationships
and
their
interactions.
Design
Recovery
System
(DESIRE)
on
page 13
Plan
Matching
University C Source code Identification of Enhance
program Based on empirical studies on how Identifying program plans within See Recognizing
recognitio program plans to of Hawaii, parsed to AST program
understanding
and programmers
conduct
program programs is computationally complex
n
code fragments Manoa
augmented with segments that recognize
clichés understanding.
and the method described may not Program Plans
data and control implement
a within programs.
scale up to real world systems with page 16
flow.
certain plan
Also identify modules
high complexity.
A library of
to convert C to C++
program plans
code.
GROK
Dali
Architectural
SWAG
RSF
transformations Group,
Standard
using relational University Format)
algebra
of
Waterloo
(Rigi Architectural Extracting high level Efficiently processes large graphs.
Some transformations are difficult to See Grok on page
transformation structure information Based on sound mathematical express in relational algebra.
17
to
show to ease software foundations.
Not well suited for generalized pattern
relations
maintenance
Many of transformations that occur matching where some of the nodes and
between
during software maintenance can be edges, along a path, represent a pattern
different
easily specified using Grok.
and they have to be stored after being
software
visited.
components
Extension
of Carnegie SQL database Various
Extraction
of Support for SQL to specify clustering No analytic capabilities in Dali
See Dali on page 18
Rigi.
Use of Mellon
storing different architectural
architectural
patterns
Patterns have to be written by
SQL
for University views generated views of the information of a Various views of the system are someone who understands the system
specifying
by
different system
system
for generated from different tools giving
clustering
tools
like
documenting
and both a dynamic and static picture of
patterns. Use of
parsers, lexical
understanding
a the system
dynamic
and
analysers etc.
system and also for
static information
maintenance
and
to
generate
reuse
views. Requires
manual
intervention
91
92
Annotated Bibliography
[Anquetil, Lethbridge, 1999] N.Anquetil, T.C.Lethbridge, “Experiments with Clustering as a Software
Remodularization Method”. The Sixth Working Conference on Reverse Engineering (WCRE’99), 1999.
Compares various clustering algos and features used for clustering. Uses hierarchical clustering.
Experiments on gcc, mosaic, Linux, Telecom. Nice description of similarity metrics and
clustering algos given in this paper.
[Biggerstaff, 1989] T.J.Biggerstaff, “Design Recovery for Maintenance and Reuse”. IEEE Computer,
22(7), pages 36-49, July 1989.
Very commonly referred to paper. Describes the idea of building a re-use library for design
recovery. Also, describes the prototype developed called ‘DESIRE’.
[Biggerstaff, Mitbander, Webster, 1994] T.J.Biggerstaff, B.G.Mitbander,D.E.Webster, “Program
Understanding and the Concept Assignment Problem”. Communications of the ACM, 37(5), pages 72-83,
May 1994.
Describes further work on DESIRE and what could be the other possibilities for candidate
concepts.
[Birkhoff, 1940] G.Birkhoff. Lattice Theory, lst ed, American Mathematical Society, Providence, R.I,
1940.
We do not have the above reference. Lattice theory lays the foundation of mathematical concept
analysis.
[Buss et al., 1994] E.Buss, R.De Mori, J.Henshaw, H.Johnson, K.Kontogiannis, E.Merlo, H.Müller,
J.Mylopoulos, S.Paul, A.Prakash, M.Stanley, S.Tilley, J.Troster, K.Wong, “Investigating Reverse
Engineering Technologies: The CAS Program Understanding Project”, IBM Systems Journal, 33(3), 1994.
Paper describes the various reverse engineering techniques of SQL/DS source code as done by 5
teams.
[Canfora, Cimitile, Lucia, Lucca, 1999] G.Canfora, A.Cimitile, A.De Lucia, G.A. Di Lucca, “A Case Study
of Applying an Eclectic Approach to Identify Objects in Code”, Inernational Workshop on Program
Comprehension, pages:136-143, , IEEE Computer Society Press, Pittsburgh, 1999.
Identify objects within COBOL programs using files as attributes and subprograms as objects
[Chikofsky, Cross II, 1990] E.J.Chikofsky, J.H.Cross II, “Reverse Engineering and Design Recovery: A
Taxonomy.” IEEE Software, 7(1), pages 13-17, January 1990.
93
An introductory paper for reverse engineering. Describes basic reverse engineering terminology.
Is very commonly referred to.
[Consens, Mendelzon Ryman, 1992] M.Consens, A.Mendelzon A.Ryman, “Visualizing and Querying
Software Structures.” 14th International Conference on Software Engineering , Melbourne, Australia,
pages 138-156, May 1992.
One of the first papers on GraphLog. Is a very commonly quoted papers. Describes the use of
predicate calculus for presenting queries based on graphs.
[Davey, Burd, 2000] J.Davey, E.Burd, Evaluating the Suitability of Data Clustering for Software
Remodularization”. The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane,
Australia, pages 268-276, 2000.
Compares various similarity metrics, New cluster distances algorithms and feature sets. Good
paper to read. Use hierarchical clustering.
[Deursen, Woods, Quilici, 2000] A.V.Deursen, S.Woods, A.Quilici, “Program Plan Recognition For Year
2000 Tools”, Science Of Computer Programing, 36(2-3), pages 303-325, 2000.
Work on plan recognition and its application to Y2K problem
[Fahmy, Holt, Cordy, 2001] H.M.Fahmy, R.C.Holt, J.R.Cordy , “Wins and Losses of Algebraic
Transformations of Software Architectures”. Automated Software Engineering ASE 2001, San Diego,
California, November 26-29, 2001.
Describes GROK, a tool based on relational algebra for graph transformations. To be used for
software maintenance
[Godfrey, 2001] M.W.Godfrey, “Practical Data Exchange for Reverse Engineering Frameworks: Some
Requirements, Some Experience, Some Headaches”. Software Engineering Notes 26(1), pages 50-52,
January 2001.
Describes work on TAXFORM by SWAG Group and also on integrating various reverse
engineering tools.
[Grune, Bal, Jacobs, Langendoen, 2001] D.Grune, H.E.Bal, C.J.H.Jacobs, K.G.Langendoen, “Modern
Compiler Design”, John Wiley and Sons, Ltd, 2001.
Very good, handy book on compilers
94
[Harandi, Ning, 1990] M.T.Harandi, J.Q.Ning, “Knowledge-Based Program Analysis”. IEEE Software
7(1), pages 74-81, January 1990.
Describes how a knowledge based system of program events and program plans can be built.
Pioneer paper.
[Hausler, Pleszkoch, Linger, 1990] P.A.Hausler, M.G.Pleszkoch, R.C.Linger, “Using Function Abstraction
to Understand Program Behavior.” IEEE Software 7(1), pages 55-63, January 1990.
Describes the use of proper and prime programs to get function abstraction.
overview, not implemented. Pioneer paper.
Theoretical
[Holt,
Language”,
1997]
R.C.Holt,
“An
introduction
to
TA:
The
Tuple
Attribute
http://www.swag.uwaterloo.ca/pbs/
Introduces the tuple attribute format
[Jain, Murty, Flynn, 1999] A.K.Jain, M.N.Murty, P.J.Flynn, “Data Clustering: A Review”.
ACM
Computing Surveys, 13(3), September 1999.
A general review of data clustering and clustering algorithms with respect to various applications.
[Kazman, Carrière, 1998] R.Kazman, S.J.Carrière, "View Extraction and View Fusion in Architectural
Understanding". The 5th International Conference on Software Reuse, Victoria, BC, Canada, June 1998.
Describes Dali workbench for creating the different views of a system
[Kontogiannis, 97] K.Kontogiannis, “Evaluation Experiments on the Detection of Programming Patterns
Using Software Metrics”. The Fourth Working Conference on Reverse Engineering (WCRE'97), 1997.
Describes 5 similarity metrics to be used for detecting clones and also uses recall and precision to
evaluate them. Experiments on unix utilities.
[Lung, 1998] C.H.Lung,
“Software architecture recovery and restructuring through Clustering
Techniques”. Third International Conference on Software Architecture, November 1998.
Gives an interesting idea regarding evaluation of clustering using data complexity
[Lindig, Snelting, 1997] C.Lindig, G.Snelting, “Assessing Modular Structure of Legacy Code Based on
Mathematical Concept Analysis”’, International Conference on Software Engineering, pages: 349-359,
Boston, 1997.
95
Nice tutorial on concept analysis. They use subprograms as objects and globals as attributes.
[Maarek, Berry, Kaiser, 91] Y.S.Maarek, D.M.Berry, G.E.Kaiser, “An Information Retrieval Approach for
Automatically Constructing Software Libraries”. IEEE Transactions on Software Engineering, 17(8),
pages 800-813, August 1991.
Very commonly referred to paper. One of the earlier work done on software clustering.
[Martin, Wong, Winter, Müller, 2000] J.Martin, K.Wong, B. Winter, H.A.Müller, “Analyzing xfig Using
the Rigi Tool Suite”. The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane,
Australia, 2000.
. Describes the Rigi approach to xfig reverse engineering conducted for the reverse engineering
competition held for CASCON’99.
[Murty, Jain, Flynn, 1999] "Data Clustering: A Review", ACM Computing Surveys, 31(3), September,
1999.
We don’t have this reference
[Müller et al., 2000] H.A.Müller, J.H.Jahnke, D.B.Smith, M.A.Storey, S.R.Tilley, K.Wong, “Reverse
Engineering: A Roadmap”. The 22nd International Conference on Software Engineering, Limerick, Ireland,
pages 49-60, June 2000.
Describes the reverse engineering techniques and outlines some of the possible future research
areas for this field.
[Müller, Tilley, Orgun, Corrie, Madhavji, 1992] H.A.Müller, S.R.Tilley, M.Orgun, B. Corrie,
N.H.Madhavji, "A Reverse Engineering Environment Based on Spatial and Visual Software
Interconnection Models”. Proceedings of the Fifth ACM SIGSOFT Symposium on Software Development
Environments (SIGSOFT’92), Virginia, pages 88-98, in ACM Software Engineering Notes, 17(5),
December 1992.
Has an overview of the reverse engineering approach in Rigi and the spatial and visual
information represented by it.
[Müller, Uhl, 1990] "Composing subsystem structures using (k,2)-partite graphs." Proceedings of the
1990 Conference on Software Maintenance (CSM 1990), pages 12-19, November 1990.
The paper has an overview of Rigi’s method of subsystem composition by making k-2 partite
graphs and composing by interconnection strength and common neighbour.
96
[Müller, Wong, Tilley, 1994] H. A. Müller, K. Wong, and S. R. Tilley. "Understanding software systems
using reverse engineering technology.” The 62nd Congress of L'Association Canadienne Francaise pour
l'Avancement des Sciences Proceedings (ACFAS 1994).
Describes Rigi’s approach to reverse engineering. Has a good section on reverse engineering
approaches and can be used for literary survey.
[Pressman, 1997] R.S.Pressman, “Software Engineering A Practitioner’s Approach”. The McGraw-Hill
Companies, Inc, 1997.
Pressman’s book on software engineering
[Nelson, 1996] M.L.Nelson, “A Survey of Reverse Engineering and Program Comprehension”. From
HTTP site http://citeseer.nj.nec.com/nelson96survey.html, 1996.
A very basic paper describing reverse engineering approaches, definitions and techniques.
[Quilici, 1994] A.Quilici, “A Memory-Based Approach to Recognizing Programming Plans”.
Communications of the ACM, 37(5), pages 84-93, May 1994.
Interesting section on experiments conducted with student programmers on how they understand
programs. Describe their plan based approach.
[Quilici, Woods, Zhang, 2000] A.Quilici, S.Woods, Y.Zhang, “Program Plan Matching: Experiments With
A Constraint-Based Approach”, Science Of Computer Programing, 36(2-3), pages 285-302, 2000.
Describes how plan recognition can be improved using constraint based approach for scalability
to complex and bigger systems
[Rich, Wills, 1990] “Recognizing a program’s design: A graph parsing approach.” IEEE Software, 7(1),
pages 82-89, January 1990.
Describes the use of clichés and programming plans. Pioneer paper.
[Sartipi, Kontogiannis, 2001] K.Sartipi, K.Kontogiannis, “Component Clustering Based on Maximal
Association”. Eighth Working Conference on Reverse Engineering (WCRE’01), 2001.
Supervised clustering. Work relaes to concept analysis. Introduce new similarity measure =
component association. Experiments on CLIPS and Xfig.
97
[Schwanke, 91] R.W.Schwanke, “An Intelligent Tool for Re-engineering Software Modularity”.
Thirteenth International Conference on Software Engineering (ICSE’91), pages 83-92, May 1991.
One of the Pioneer papers on Clustering. Also introduces Maverick Analysis.
[Schwanke, Platoff, 1989] R.W.Schwanke, M.A.Platoff, “Cross Reference are Features”.
Second
International Conference on Software Configuration Management, ACM press, pages 86-95, 1989. Also
available as ACM SIGSOFT Software Engineering Notes, 14(7), pages 86-95, November, 1989.
Pioneer paper. Introduces the concept of using shared neighbours for similarity metric.
[Shaw, Garlan, 1996] M.Shaw, D.Garlan, “Software Architecture Perspectives on an Emerging
Descipline”. Prentice-Hall Inc., 1996.
This is a book on software architecture and models
[Siff, Reps, 1997] M.Siff, T.Reps , “Identifying Modules via Concept Analysis”. International Conference
on Software Maintenance, pages 170-179, IEEE Computer Society, October, 1997.
Describes their approach to concept analysis. Has a nice section ‘concept-analysis’ primer.
Explains how to identify C++ classes from C code.
[Sim, Storey,2000] S.E.Sim, M.D.Storey, “A Structured Demonstration of Program Comprehension
Tools”. The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000.
Describes results of the use of various tools on the Xfig source code. The demonstration was held
by CASCON’99 and results presented in WCRE’00.
[Sneath, Sokal, 1973] P.H.A.Sneath, R.R.Sokal, “Numerical Taxonomy”. Series of books in biology.
W.H.Freeman and Company, San Francisco, 1973.
We don’t have this reference
[Snelting, 1996] G.Snelting, “Reengineering of Configurations Based on Mathematical Concept Analysis”.
ACM Transactions on Software Engineering and Methodology 5(2), pages: 146-189, April 1996.
One of the introductory papers on concept analysis.
[Tzerpos, Holt, 1998] V.Tzerpos, R.C.Holt, “Software Botryology Automatic Clustering of Software
Systems”. Ninth International Workshop on Database and Expert Systems Applications (DEXA’98),
Vienna, Austria, August 1998..
98
A very general paper on clustering. Points to some past trends and future research directions for
software clustering.
[Tzerpos, Holt, 2000] V.Tzerpos, R.C.Holt, “MoJo: A Distance Metric for Software Clustering”, The
Sixth Working Conference on Reverse Engineering (WCRE'99), Atlanta, October 1999..
Introduce a new metric for comparing clustering approaches. Metric evaluates similarity between
two different partitions of a system.
[Tzerpos, Holt, 2000] V.Tzerpos, R.C.Holt, “ACDC: An Algorithm for Comprehension-Driven
Clustering”, The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia,
2000.
Describes the use of various features for subsystem patterns, Orphan Adoption and has a nice
section on literature survey.
[Visser, 2001] E.Visser, “A Survey of Rewriting Strategies in Program Transformation Systems”, First
International Workshop on Reduction Strategies in Rewriting and Programming (WRS 2001), Also in
Electronic
Notes
in
Theoretical
Computer
Science,
Volume
57,
http://www.elsevier.nl/locate/entcs/volume57.html, 2001.
Very detailed paper on the various program transformation systems.
[Website GXL] http://www.gupro.de/GXL/
GXL’s website.
[Website, Xfig] http://www.xfig.org
[Website Xfig RSF] http://www.rigi.csc.uvic.ca/~jmartin/rigi-projects/
Website where RSF for a couple of projects can be found. RSF generated by Rigi
[Website SORTIE] http://www.csr.uvic.ca/chisel/collab/casestudy.html.
CASCON’s website. Has related information on the SORTIE project (A demonstration for the
evaluation of various reverse engineering tools)
[Wiggerts, 97] T.A. Wiggerts, “Using Clustering Algorithms in Legacy Systems Remodularization”,
Fourth Working Conference on Reverse Engineering (WCRE’97), October, 1997.
99
Very nice literature survey and good review on similarity metrics and clustering algos. Good
starting point for clustering
100
Download