Reverse Engineering and Software Clustering for Remodularization Technical Report CS-010-2002 Mehreen Saeed and Onaiza Maqbool Supervisor : Dr. Haroon Atique Babri Software Engineering Research Group SERG Computer Science Department Lahore University of Management Sciences LUMS Abstract The objective of this report is to explore the area of software reverse engineering. It consists of two parts. Part I of the report deals with the general area of software reverse engineering. It details the objectives of reverse engineering and outlines some of the tools and technologies used in this area. Part II of the report is more focused and investigates the area of software clustering in a thorough and detailed manner with its applications to reverse engineering. It also presents the detailed results of software clustering experiments that were carried out as part of this research. ii Acknowledgements For making this endeavor a success, we acknowledge the support provided by the following: Dr Syed Zahoor Hassan, for finding time to offer suggestions and guiding us in the right direction and Dr Mansoor Sarwar, for his interest and encouragement. DrHamayun Mian, Chairman PITB with whom discussions were held in the initial phases of this research. Johannes Martin of the Rigi group, University of Victoria, Canada for providing files in the Rigi Standard Format (RSF) for Xfig Lahore University of Management Sciences (LUMS) for providing the financial support and resources to carry out this research. iii Table of Contents 1 Reverse Engineering.......................................................................................................... 1 1.1 Objectives of Reverse Engineering ............................................................................... 2 1.2 Reverse Engineering Areas ........................................................................................... 3 Code Reverse Engineering Process ....................................................................................... 3 1.3.1 Levels of Abstraction ............................................................................................ 4 1.3.2 Program Representation ........................................................................................ 5 1.3.3 Exchange Formats ................................................................................................. 7 1.3.3.1 Graphical Exchange Language GXL ............................................................. 8 1.3.3.2 Rigi Standard Format RSF and Tuple Attribute TA Language ..................... 8 1.4 Reverse Engineering Techniques .................................................................................. 9 1.4.1 Function Abstraction ........................................................................................... 10 1.4.2 GraphLog............................................................................................................. 10 1.4.3 Knowledge Based Program Analysis .................................................................. 11 1.4.4 Graph Parsing Approach ..................................................................................... 11 1.4.5 Concept Analysis ................................................................................................. 12 1.4.5.1 Mathematical Foundation ............................................................................ 12 1.4.5.2 Concept Analysis Applications ................................................................... 14 1.4.6 Design Recovery System (DESIRE) ................................................................... 15 1.4.7 Recognizing Program Plans ................................................................................ 16 1.4.8 Rigi ...................................................................................................................... 16 1.4.9 Grok ..................................................................................................................... 17 1.4.10 Dali .................................................................................................................. 18 1.5 Case Studies on Reverse Engineering ......................................................................... 18 1.5.1 Xfig ...................................................................................................................... 18 1.5.2 SORTIE ............................................................................................................... 19 1.5.3 SQL/DS ............................................................................................................... 20 1.6 Tools for Reverse Engineering .................................................................................... 21 2 Software Clustering ......................................................................................................... 22 2.1 Clustering : An Overview ............................................................................................ 22 2.1.1 Entity identification and feature selection ........................................................... 22 2.1.2 Definition of a similarity measure ....................................................................... 23 2.1.2.1 Association coefficients............................................................................... 23 2.1.2.2 Distance measures ....................................................................................... 24 2.1.2.3 Correlation Coefficients .............................................................................. 25 2.1.2.4 Probabilistic Measures................................................................................. 27 2.1.3 Clustering ............................................................................................................ 27 2.1.3.1 Hierarchical ................................................................................................. 27 2.1.3.2 Partitional Algorithms ................................................................................. 30 2.1.4 Assessment .......................................................................................................... 30 2.2 Clustering Software ..................................................................................................... 30 2.2.1.1 Observations Regarding Association, Correlation and Distance Metrics.... 32 2.2.2 Assessment of Results ......................................................................................... 32 2.2.3 Use of Precision and Recall ................................................................................. 33 2.3 Our Approach .............................................................................................................. 34 iv 2.3.1 Entity identification and feature selection ........................................................... 34 2.3.2 Definition of a similarity measure ....................................................................... 35 2.3.3 Clustering and the Combined Algorithm............................................................. 35 2.4 Methods Used in the Past ............................................................................................ 37 2.5 Clustering : A Theoretical Framework ........................................................................ 39 2.5.1 The Iteration Versus the Total Clusters Graph .................................................... 40 2.5.2 Black Hole System .............................................................................................. 40 2.5.3 Glass Cloud System ............................................................................................. 41 2.5.4 Equidistant Objects .............................................................................................. 42 2.5.5 Structured Program .............................................................................................. 45 2.5.6 Unstructured Program ......................................................................................... 47 2.6 Experiments with Clustering for Software Re-Modularization ................................... 51 2.6.1 The Test System .................................................................................................. 51 2.6.2 Clustering Techniques ......................................................................................... 52 2.6.3 Evauation of Results ............................................................................................ 53 2.6.4 Anayslis of the Xfig System ................................................................................ 53 2.6.5 Anayslis of d_files Sub_System .......................................................................... 53 2.6.6 Summary and Discussion of Results ................................................................... 59 2.7 Future Directions ......................................................................................................... 61 A. Experimental Results ....................................................................................................... 63 A.1 Anayslis of e_files Sub_System ............................................................................. 63 A.2 Analysis of f-files SubSystem ................................................................................ 66 A.3 Analysis of u_files SubSystem ................................................................................ 69 A.4 Analysis of w_files SubSystem .............................................................................. 72 A.5 Analysis of the Entire System ................................................................................ 75 B. Software Repositories ...................................................................................................... 79 C. Summary of Reverse Engineering Methods .................................................................... 80 D. Detailed Summary of Reverse Engineering Methods ..................................................... 81 Annotated Bibliography .......................................................................................................... 83 v Table of Figures Figure 1: The code reverse engineering process ....................................................................... 3 Figure 2 : The forward and reverse engineering of source code ............................................... 4 Figure 3 : Parse Tree and the Corresponding AST for the Expression b*b-4*a*c [Grune, Bal, Jacobs, Langendoen, 2001] ............................................................................................... 6 Figure 4 : Resource Flow Graph Generated by Rigi for Linked List Program [Müller, Tilley, Orgun, Corrie, Madhavji, 1992]. ....................................................................................... 7 Figure 5 : The Directed Graph and a part of its corresponding GXL for a Small Program [Website GXL]. ................................................................................................................. 8 Figure 6 : Concept Lattice for the Attributes and Facts in Table 1 ......................................... 14 Figure 7 : The Reverse Engineering Process in Rigi ............................................................... 17 Figure 8 : Divisive Clustering Algorithm ................................................................................ 27 Figure 9 : Agglomerative Clustering ....................................................................................... 28 Figure 10 : Initial Clusters ....................................................................................................... 28 Figure 11 : After Step 1 ........................................................................................................... 29 Figure 12 : Clusters Formed at Step 3 Using Single Linkage Algorithm ................................ 29 Figure 13 : Clusters Formed Using Complete Linkage Algorithm ......................................... 29 Figure 14 : Precision and Recall Graph for an Ideal Clustering Algorithm ............................ 34 Figure 15: Iteration Vs. Total Clusters for the Black Hole System ......................................... 41 Figure 16: Pancaked Structure [Pressman, 97]........................................................................ 41 Figure 17: The general curve obtained for Iteration Vs. The total Number of Clusters .......... 43 Figure 18: Iteration Vs. Total Clusters for the Combined Algorithm for Equidistant Objects 43 Figure 19: Iteration Vs. the Total Number of Clusters for Equidistant Objects with 100% similarity.......................................................................................................................... 44 Figure 20 : A Structured Program ........................................................................................... 46 Figure 21: Clustering Process for the Structured Program ...................................................... 46 Figure 22: Unstructured Program ............................................................................................ 47 Figure 23: Clustering Process for the Unstructured Program .................................................. 48 Figure 24: Clusters Formed for the Unstructured Program ..................................................... 48 Figure 25: Iteration Vs. Number of Clusters for the Jaccard Similarity Using Combined Algorithm. ....................................................................................................................... 49 Figure 26: Clusters formed for Jaccard Similarity Measure Using Combined Features ......... 49 Figure 27: Clustering Process for Simple Similarity Co-efficient Using Complete Linkage . 50 Figure 28: Clusters Formed Using Simple Similarity Coefficient and Complete Linkage Algorithm ........................................................................................................................ 50 Figure 29: Comparison of Features Using Jaccard Similarity and Complete Algorithm ........ 54 Figure 30: Comparison of Similarity Metrics Using Complete Algorithm and All Features.. 56 Figure 31: Comparison of Similarity Metrics Using Combined Algorithm and All Features 56 Figure 32: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 57 Figure 33: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features ............................................... 58 Figure 34: Comparison of Features Using Different Similarity Metrics and Complete Algorithm ........................................................................................................................ 63 Figure 35: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 64 vi Figure 36: Comparison of Similarity Metrics for the Combined Algorithm Using All Features ......................................................................................................................................... 64 Figure 37: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 65 Figure 38: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features ............................................... 66 Figure 39: Comparison of Features Using Different Similarity Metrics and Complete Algorithm ........................................................................................................................ 67 Figure 40: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 67 Figure 41: Comparison of Similarity Metrics for the Combined Algorithm Using All Features ......................................................................................................................................... 68 Figure 42: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 68 Figure 43: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features ............................................... 69 Figure 44: Comparison of Features Using Different Similarity Metrics and Complete Algorithm ........................................................................................................................ 70 Figure 45: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 70 Figure 46: Comparison of Similarity Metrics for the Combined Algorithm Using All Features ......................................................................................................................................... 71 Figure 47: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 71 Figure 48: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features ............................................... 72 Figure 49: Comparison of Features Using Different Similarity Metrics and Complete Algorithm ........................................................................................................................ 73 Figure 50: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 73 Figure 51: Comparison of Similarity Metrics for the Combined Algorithm Using All Features ......................................................................................................................................... 74 Figure 52: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 74 Figure 53: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features ............................................... 75 Figure 54: Comparison of Features Using Different Similarity Metrics and Complete Algorithm ........................................................................................................................ 76 Figure 55: Comparison Similarity Metrics for the Complete Algorithm Using All Features . 76 Figure 56: Comparison of Similarity Metrics for the Combined Algorithm Using All Features ......................................................................................................................................... 77 Figure 57: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 77 Figure 58: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features ............................................... 78 vii PART I : REVERSE ENGINEERING 1 Reverse Engineering The term “reverse engineering” was initially used for the extraction of design and structure of hardware components by analyzing the finished products. Now the term “reverse engineering” is also widely applicable to software systems. One of the first known definitions of software reverse engineering is by Chikofsky and Cross in their seminal paper “Reverse Engineering and Design Recovery: A Taxonomy” [Chikofsky, Cross II, 1990]. They define reverse engineering as: “The process of analyzing a subject system to identify the system’s components and their interactions Create representations of the system in another form or at a higher level of abstraction ” Note the above definition of reverse engineering is not restricted to software only. The process of reverse engineering is carried out to obtain an understanding of the system when the original design or plan of the system is not available. The intention can be to make an identical ‘clone’ of the system or to carry out maintenance on the system when the original design is not available. When considering software systems, it is often the case that long term changes in the system lead to a deviation of the system from its original design. For legacy systems, written ten or even twenty years ago the original designers or developers of the system are no longer available. Such systems might be too costly to replace. In such situations it is often desirable to carry out reverse engineering processes to gain a better understanding of the system. This process helps the maintenance engineers in adapting the system to the changing needs of the organization and also in reengineering and migrating the system to a newer platform. Chikofsky and Cross point out that the reverse engineering is “a process of examination, not change or replication”. Forward engineering involves the process of moving from a high level abstraction in the form of client requirements or specifications of the system to a lower level implementation of the system, in the form of source code, that is platform dependent. Reverse engineering is, therefore, the opposite process of forward engineering. Here the goal is to “examine” the system to move from low level details available in the form of source code to higher abstract levels. The higher abstract levels can be in the form of syntax trees or dependency graphs that give a better understanding of the source code. At a higher level of abstraction, the process of reverse engineering identifies the subsystems of an overall design, ultimately leading to an understanding of the logical design, the functional specifications and the requirements or the specifications of the system. The objective of this report is not to look at the financial aspects of reverse engineering but to conduct an indepth detail of its technical aspects. In this part of the report we explore the objectives of reverse engineering, the reverse engineering processes and the various approaches 1 to reverse engineering. It also outlines some of the work done in the past and also a few case studies pertaining to this area. 1.1 Objectives of Reverse Engineering The main goal of reverse engineering is to study a system and gain a better understanding of the system. The understanding or insight of the system is required for re-documenting the system, maintaining the system, re-designing or restructuring the system to a present day paradigm. Such a study would also be required to migrate or reengineer the system from an old platform to a new one. Chikofsky and Cross define various objectives of reverse engineering as [Chikofsky, Cross II, 1990] : Cope with complexity of large voluminous systems : The main problem encountered with maintaining large legacy systems is the shear volume of code that has to be understood. Reverse engineering tools help in extracting relevant information from large complex systems and, hence, help the software developers control the processes and products in systems evolution. Generate alternate views of software programs : Reverse engineering tools are designed to generate various alternate views of software programs like call graphs, dependency graphs, resource flow graphs etc. A graphical representation of such views helps the maintenance engineers gain a better insight of the system. Recover lost information from legacy systems : Over a period of time a system tends to deviate from its original design. The process of reverse engineering helps in recovering lost information of its design, specifications and requirements. Detect side effects of haphazard initial design and successive modifications : Haphazard design together with modifications in the system leads to side effects in the system. The performance of the system can degrade or it can produce unpredictable results. The process of reverse engineering can provide these observations and identify problem areas of the system. Synthesize higher abstractions : As mentioned earlier the goal of reverse engineering is not just to provide an insight of source code but also to generate higher level abstract forms of the system like the logical design, functional specifications or the requirements of the system. Facilitate re-use by detecting reusable software components : An important application of reverse engineering is to identify components in software system and reuse them in present systems to save development costs. 2 1.2 Reverse Engineering Areas Müller et al. in their paper “Reverse Engineering: A Roadmap” have identified two main areas of reverse engineering [Müller et al., 2000]: Code reverse engineering Data reverse engineering Code reverse engineering is the retrieval of information from source code to extract various levels of abstraction. Data reverse engineering, on the other hand, focuses on data generated by a software system and finding the mapping of this data with the logical data structures being used in the program. Therefore, code reverse engineering tells us “how” the processing of information takes place and data reverse engineering focuses on “what” information is processed. Code reverse engineering has applications in the area of program understanding, system reengineering, architecture recovery and restructuring, studying software evolution etc. Initially the process of reverse engineering was considered only to pertain to code reverse engineering. However, data reverse engineering gained the focus of attention by researchers in the past few years when solving the Y2K problem. It is applicable where massive software changes pertaining to data are required like the structure of date change in the Y2K problem or the European currency change in 2002. This report mainly focuses in the area of code reverse engineering rather than data reverse engineering. The area of code reverse engineering is therefore presented in detail here. 1.3 Code Reverse Engineering Process Code Reveres Engineering Process Exchange format Source code Parse Document/ Visualize Analyze Language Dependent Language Independent Figure 1: The code reverse engineering process Figure 1 illustrates the code reverse engineering process. The source code is parsed by the parser. The parser produces an abstract form of the code in terms of a syntax tree, a dependency graph, a resource flow graph etc. The analyzer then analyzes the source code for further documentation or visualization. A reverse engineering process is the reverse of a forward engineering process. In this case one moves from lower level implementation details to a more abstract form. 3 1.3.1 Levels of Abstraction Requirements Specification Requirements Specification Formal Specification Formal Specification Design Specification Design Specification Implementation Implementation Specific Specific Reverse Engineering Forward Engineering Figure 2 : The forward and reverse engineering of source code Figure 2 illustrates the forward and reverse engineering processes. Hirandi and Ning have defined the following levels of abstraction [Harandi, Ning, 1990]: Implementation Level : At the implementation level of abstraction a program’s source code is directly represented into a form from which it is possible to directly obtain the source code. However, an abstraction at the implementation level is independent of the programming language used. Example of this form of representation are the abstract syntax trees or the symbol tables. Structure Level : At this level of abstraction, the source code of a program is shown in less detail but more stress is laid on understanding the basic detailed structure or design of the system. Examples of this form of abstraction are the representation of a system by means of call graphs, resource flow graphs or program dependency graphs etc. Function Level : Functional level abstraction tends to group together functionally equivalent but structurally different parts of source code. This form of abstraction provides the software engineers with a bird’s eye view of the entire system where they can gain an insight of the basic architecture of the system. It presents the main modules present in the system and their interaction with each other. Domain Level : The domain level abstraction replaces the algorithmic nature of data to domain specific concepts. Abstraction at the functional level gives the ‘how’ information to 4 the software engineer and the domain level abstraction answers the questions ‘what type of information is being processed’ and ‘what is being done’ and not ‘how’ it is being done. This level gives an insight into the nature of the problem being solved by the system and details the requirements and the specifications of the software. In Figure 1, the parser is only used to convert a source code into a format that is independent of the original programming language. The real intelligence lies in the analyzer that documents or produces abstract high level forms of the system. Some of the methods for analyzing the source code are described later in this report in section 1.4 “Reverse Engineering Techniques”. 1.3.2 Program Representation As we can see that the parsing stage of reverse engineering is language dependent. The second stage is language independent. The trend in reverse engineering is to build analyzers that are independent of the programming language used. Such a framework is also called the Intentional Programming Framework where the programs are stored, edited and processed as source graphs and are not dependent upon the programming language [Visser, 2001]. The possible types of outputs produced by the parser and processed further for reverse engineering a system are: 5 expression expressio n term - term term term factor * term * * factor factor identifier - factor identifier factor identifier ‘c’ * * identifier ‘b’ identifier ‘a’ ‘b’ ‘b’ 4 ‘b’ 4 Parse Tree ‘c’ * ‘a ’ Abstract Syntax Tree Figure 3 : Parse Tree and the Corresponding AST for the Expression b*b-4*a*c [Grune, Bal, Jacobs, Langendoen, 2001] Abstract syntax tree AST: The syntax tree or a parse tree depicts how the various segments of program text are viewed in terms of grammar [Grune, Bal, Jacobs, Langendoen, 2001]. It is a very detailed data structure showing the syntactic information of a program. Usually, this detailed information is not required for reverse engineering and hence the abstract syntax tree is used instead. Abstract syntax tree is less detailed than a parse tree and usually the information regarding the grammar of the language is not shown in the abstract syntax tree. An Example of both a parse tree and an AST is shown in Figure 3. Tree or graph : A program can also be represented by means of trees, directed acyclic graphs or full fledged graphs with cycles. Of particular interest are program dependency graphs PDGs that are directed graphs whose vertices are connected by means of several types of edges. The vertices represent the assignment statements and predicates of a program. The edges represent control and data dependencies. A PDG can represent programs with only one procedure. Similar to PDG, system dependency graphs are used to represent programs with multiple procedure calls. Another type of graph used to represent programs is the resource 6 flow graph that is a directed weighted graph, where the vertices of the graph are components of the system and the edges are dependencies induced by the resource supplier-client relation. A directed edge from node a to node b represents a dependency of a on b and the weight on the edge represents the resources being exchanged between a and b [Müller, Uhl, 1990]. An example of a resource flow graph as generated by the reverse engineering tool Rigi is shown in Figure 4. Figure 4 : Resource Flow Graph Generated by Rigi for Linked List Program [Müller, Tilley, Orgun, Corrie, Madhavji, 1992]. 1.3.3 Exchange Formats Figure 1 depicts the reverse engineering process and shows that the process of analyzing the source code can be kept language independent. To achieve this end, there is a need to define exchange formats that allow the output from a parser to be reused by any analyzer. Various exchange formats have been defined to represent the AST or the dependency graph of a program. Two of the more popular exchange formats namely the Graphical Exchange Language GXL and Rigi Standard Format RSF are described here. 7 1.3.3.1 Graphical Exchange Language GXL <?xml version="1.0"?> <!DOCTYPE gxl SYSTEM "../../../gxl1.0.dtd"> <gxl xmlns:xlink="http://www.w3.org/1999/xlink" > <graph id="simpleExample"><type xlink:href="../../schema/gxl/simpleExampleSc hema.gxl#simpleExampleSchema"/> <node id="p"> <type xlink:href="../../schema/gxl/simpleExampleSc hema.gxl#Proc"/> <attr name="file"> <string>main.c</string> </attr> </node> <node id="q"> <type xlink:href="../../schema/gxl/simpleExampleSc hema.gxl#Proc"/> <attr name="file"> <string>test.c</string> </attr> </node> <node id="v"> <type Figure 5 : The Directed Graph and a part of its corresponding GXL for a Small Program [Website GXL]. xlink:href="../../schema/gxl/simpleExampleSc hema.gxl#Var"/> GXL appears to be the emerging standard for exchanging output between a parsers and a <attrthe name="line"> <int>225</int> reverse engineering tool. It is based on the Dagstuhl Middle <typeModel DMM, which is a model for xlink:href="../../schema/gxl/simpleExampleSc repository software in reverse engineering [Website GXL]. GXL is an XML sub-language and hema.gxl#ref"/> can be used to represent any type of schema graphs. It is <attr usedname="line"> to represent AST level abstraction <int>27</int> or architectural level abstraction and can be used to exchange instance data and schema data. </attr> GXL represents a typed attributed directed graph. The directed graph for a small program and its corresponding GXL is taken from GXL’s website and is shown in Figure 5. 1.3.3.2 Rigi Standard Format RSF and Tuple Attribute TA Language The tuple attribute language is used to represent certain types of information of a program by the means of graphs [Holt, 1997]. TA is a two level language. The first level for recording facts about the program, also called the Tuple Sub-language and the second for recording its schema also called the Attribute Sub-language. The tuple language has facts or tuples in the form of a triple. The first attribute of the tuple represents the verb, the second the subject and the third the object e.g. if module A calls module B in a program, this would be represented as: Call P Q This statement also represents two nodes P and Q of a graph connected together by an edge labeled “call”. Call is the verb, P the subject and Q the object. This syntax for triples was invented in the Rigi Project [Müller, Tilley, Orgun, Corrie, Madhavji, 1992], where it is called RSF, for Rigi Standard Form. This form was extended by Holt [Holt, 1997]. Hence, the entire program can be represented by means of a colored graph and stored using this language. 8 Therefore the language is a means for storing graphs and hence acts like a graph repository. The attribute sub-language is used to assign an attribute to each edge or node in the colored graph representing the program e.g. Call P Q {color = black} P{color = red} The first statement assigns a color to the verb/edge and the second statement assigns a color to the node (P). This language can also be used to record the position of the vertices in the graph. Hence, TA language is very useful and simple and can be considered a repository for storing information about graphs that represent a program structure. 1.4 Reverse Engineering Techniques This section outlines various techniques for software reverse engineering and outlines some of the work done by various researchers. A summary of these methods is presented in the appendices. Many of the techniques described are on code reverse engineering for re-modularization, program understanding or ease of software maintenance. The major challenge of code reverse engineering is to extract meaningful abstract level design and requirements information from source code. Rich and Wills point out several difficulties in this regard [Rich, Wills, 1990]: Syntactic Variation : There are different ways and methods of solving the same problem. A programmer can achieve the same net flow of data and control in many different possible ways. All programmers have different styles and different approaches to coding a problem. Even the same programmer, when given the same problem twice would code it differently. Non-Contiguousness : The logic for solving a single problem may be scattered throughout the program. It might not be localized to a single module or procedure but spread at different places in the program. Implementation Variation : A problem can be implemented in many different ways. The same problem can have different solutions at an abstract level and hence different implementations for solving it. Overlapping Implementations : Two or more distinct abstract forms or levels of a system may be implemented using the same lines of code or the same portion of a program, making it difficult to identify them individually. Unrecognizable Code : A program understanding system should be able to ignore the idiosyncrasies present in the source code and only pick up relevant information. In the following sections we describe some of the reverse engineering technologies. 9 1.4.1 Function Abstraction [Hausler, Pleszkoch, Linger, Hevner 1990] propose program understanding through “function abstraction”. Abstracting program functions means that the precise function of a program or subprogram is determined by analyzing how it manipulates the data. The objective of this technique is to extract business rules by producing higher level abstractions. A summary of this method is as follows : Transform the program into a structured form by getting rid of all goto statements Group two classes of programs : proper programs and prime programs. Proper program is a flow chart program having a single entry and singe exit node and a path through each node. Prime programs are irreducible proper programs. Hausler et al. identified three types of primes namely iteration e.g. do-while, alternation i.e. if-then and sequence i.e. begin-end. Construct a proper program by repeatedly replacing function nodes by prime sub-programs. This process is termed as function expansion. Use program slicing techniques based on data structures to abstract program primes one variable at a time. The functionality of each prime is determined by analyzing the usage patterns of data or trace tables. This technique could vary depending upon the type of prime program being analyzed. After abstracting the overall program function translate the mathematical constructs to English text. By performing function expansion the problem of determining a program’s function reduces to determining the functionality of each program prime. The technique of function abstraction was not implemented but just developed as a methodology. It also lays down the basis for attaining program’s function abstraction. However, straightforward abstracting techniques can lead to program functions that are difficult to read and understand. There still remains the problem of generating useful as well as functionally correct abstractions. 1.4.2 GraphLog Joint research conducted by IBM and University of Toronto developed a graphical language GraphLog as an aid to software engineering for querying software structures by means of visual graphs [Consens, Mendelzon, Ryman, 1992]. Their main motivation for developing this language was to construct a platform for bringing together two aspects of program understanding namely query and visualization. The software system can be represented as a directed graph and queries can be constructed by drawing graphs using a graphical edior. Consens et al. proposed using this language as a basis for evaluating the quality of a software system by defining quality metrics such as cyclic dependency among packages. This language can also be used to extract additional information from source code by input queries using visual graphs. The basis of source code representation is via ER diagrams and the underlying mechanism for GraphLog is based upon predicate calculus. 10 1.4.3 Knowledge Based Program Analysis Harandi and Ning [Harandi, Ning, 1990] describe building a knowledge based system ‘PAT’ for enhancing program understanding and conducting program analysis. Their implementation of a model system helps maintainers understand three basic issues. 1. What high level concepts does a program implement 2. How does the program encode the concepts in terms of low-level concepts 3. Is the implementation of the concepts that have been recognized correct PAT has a built in parser that re-writes the entire program into a language independent form namely the event base. The event base has a set of objects called events that represent the syntactic and semantic concepts present in the program. The program events are stored in a hierarchical structure with low level concepts representing the source code. At the higher level programming patterns and strategies are used. Using this event set the “understander” recognizes higher level events that represent function oriented concepts. The newly recognized events are added to the event set and the process is repeated till no more high-level events are recognized. The final event set then has information about the high level concepts that a program implements. A deductive inference rule engine is used to implement the “understander’s” core functionality. A plan base is stored in the system by a domain expert. The plan base is a set of program plans that represent the analysis knowledge of the program. It is stored as inference rules from which high level events can be derived. The prototype system built by Harandi and Ning has 100 program events and a few dozen program plan rules. However, for real life systems several hundred event classes and plans would be required. 1.4.4 Graph Parsing Approach Rich and Wills [Rich, Wills, 1990] describe a technique called the ‘Graph Parsing Approach’ to recognize a program’s design by identifying the commonly used data structures and algorithms. They have defined the term ‘cliches’ for the commonly used programming structures like enumerations, accumulators, binary searches etc. They describe a recognizer that translates the source code of a program into plan calculus. A “Plan” is a hierarchical graph structure made of boxes that represent operations and tests and arrows denote control and data flow. The commonly used program constructs i.e. cliches are input into the system by an expert with knowledge about those constructs in the form of plans. The recognition of clichés starts at the parsing stage where a parser translates the source code into plans which are directed graphs. The cliché recognizer then identifies sub-graphs and replaces them with abstract operations using pattern matching techniques. Matches are made with the program constructs already stored in the database. The main drawback of this technique is that in cliché recognition, the search techniques employed are complex and too expensive to implement. The issue of automatically learning program plans also needs to be addressed. 11 1.4.5 Concept Analysis Concept analysis provides a way of grouping together similar entities based on their common attributes. The mathematical foundation for concept analysis rests on the Lattice Theory introduced by Birkhoff in 1940 [Birkhoff, 1940]. In 1996 Snelting introduced this idea to software engineering for inferring and extracting software hierarchies from raw source code [Snelting, 1996]. To introduce the reader to this area of reverse engineering the following terms are introduced: Module : A syntactic unit used to group entities together. implementation are a part of a module. Component : Group of related elements with a common goal that unifies them together. Atomic Component : Non-hierarchical component that consists of related global constants, variables, sub-programs and/or user defined types. 1.4.5.1 An interface and optional Mathematical Foundation Concept analysis is based on a relation R between a set of objects O and a set of attributes A as given by: R O A The triple C=(O,A,R) is called a formal context. The set of common attributes for a set of objects OsO is defined as: (Os ) {a A | (o O)(o, a) R} The set of common objects AsA is defined as: ( As ) {o O | (a A)(o, a) R} The following example is used to illustrate the above terms. It is taken from [Lindig, Snelting, 1997]. Consider the object versus attributes table: O1 O2 O3 O4 A1 X A2 x A3 A4 A5 X X X x x x x x A6 A7 A8 x x X X x x Table 1: Object Versus Attributes Tables For the above table: 12 { O1 ,O2} = { A1, A2, A3, A4, A5 } { O3} = {A3, A4, A6, A7, A8} { A3, A4} = { O2, O3, O4} { A1, A2} = { O1, O2} A concept is a pair of sets consisting of a set of objects and a set of attributes (X,Y) such that the following is satisfied: Y ( X ), X (Y ) A concept can be defined as a maximal collection of objects sharing common attributes. For a concept c = (O,A), O is the set of objects and called the extent of c i.e. extent(c). A is the set of attributes and called the intent of c i.e. intent(c). Table 2 shows the concepts for the object attribute table illustrated in Table 1. C1 C2 C3 C4 C5 C6 C7 { O1, O2, O3, O4}, { O2, O3, O4}, { O1} { O2, O4}, { O3, O4}, { O4}, , { A3, A4} { A1, A2} { A3, A4, A5} { A3, A4, A6, A7, A8} { A3, A4, A5, A6, A7, A8} { A1, A2, A3, A4, A5, A6, A7, A8} Table 2 : Concepts for Table 1 The set of all concepts form a partial order governed by the following relationship: (O1, A1) (O2 , A2 ) A1 A2 (O1, A1) (O2 , A2 ) O1 O2 The set of concepts taken together with partial ordering form a complete lattice known as a concept lattice. The cocept lattice for the example given in this section is presented in Figure 6. 13 C1 C2 A1,A2 O1 C3 A5 O2 A3,A4 C5 C4 C6 A6,A7,A8, O3 O4 C7 Figure 6 : Concept Lattice for the Attributes and Facts in Table 1 The nodes of the concept lattice are labeled with attributes AiA if it is the largest concept having Ai in its intent. It is also labeled with objects OiO if it is the smallest concept having Oi in its intent. The concept lattice gives an insight into the structure of the relationship of objects. The above figure shows that there are two disjoint set of objects, the first one being O1 with attributes A1 and A2. The second one being O2, O3, O4 sharing the other attribute. 1.4.5.2 Concept Analysis Applications Lindig and Snelting use concept analysis to identify higher level modules in a program [Lindig, Snelting, 1997]. They treat subprograms as objects and global variables as attributes to derive a concept lattice and hence concept partitions. The concept lattice provides multiple possibilities for modularization of programs with each partition representing a possible modularization of the original program. It can be used to provide modularization at a coarse or a finer level of granularity. Lindig and Snelting conducted a case study on a Fortran program with 100KLOC, 317 subroutines and 492 global variables. However, they failed to restructure the program using concept analysis. Siff and Reps used concept analysis to detect abstract data types to identify classes in a C program to convert it to C++ [Siff, Reps, 1997]. Sub-programs were treated as objects and structs or records in C were treated as attributes. They also introduced the idea of using negative examples i.e. absent features as attributes (e.g. a subprogram does not have attribute X). They successfully demonstrated their approach on small problems. However, larger programs are too complex to handle and for that they suggest manual intervention. 14 Canfora et al. used concept analysis to identify sets of variables and to extract persistent objects of data files and their accessor routines for COBOL programs [Canfora, Cimitile, Lucia, Lucca, 1999]. They treated COBOL programs as objects and the files they accessed as attributes. Their approach is semi-automatic and relies on manual inspection of the concept lattice. Concept analysis is an interesting new area of reverse engineering. However, it has an exponential time complexity and has a space complexity of O(2K). Its advantages are that it is based on a sound mathematical background and gives a powerful insight into the types of relationships and their interactions. 1.4.6 Design Recovery System (DESIRE) DESIRE is a tool that is a product of research at MCC by a team led by Biggerstaff [Biggerstaff, Mitbander,Webster, 1994] and [Biggerstaff, 1989]. The basic philosophy behind this system is that informal linguistic knowledge is a necessary component for program understanding. This knowledge can only be input into a system through the interaction of an expert. The objective of this technique is to find the mapping of human-oriented concepts to realizations in the form of source code within a specific problem. Biggerstaff proposes a design recovery process in which the various modules of the program, key data items and software engineering artifacts are identified from source code [Biggerstaff, 1989]. Also, the informal design abstractions and their relationship to source code is pointed out. From this a library of reusable modules is built by interacting with a software engineer who relates these reusable modules to fragments of source code. The design of an unknown system could then be recovered by finding similar mappings of source code to abstract concepts using the reuse library. The DESIRE tool is used to aid maintenance and software engineers for the better understanding of a system and also for documenting and maintaining it. It consists of a knowledge based pattern recognizer and prolog-based inference engine. It has a built in parser to take the source code as input and generate a parse tree. A plane text browser is also a part of the system to show relationships between data, items and files. Different views can be generated based upon different input queries. Prolog is used for querying the source code. In [Biggerstaff, Mitbander,Webster, 1994] it is proposed that abstract concepts be associated with the source code. They propose finding the candidate concepts through the following: Find the breakpoints set in the program and their associated comments and map them onto the associated functions and global variables. Generate a slice based on the functions and global variables found. Identify a cluster of functions related together through their use of shared global variables or through shared control paths. Use knowledge base of domain model. 15 DESIRE is mainly being used for debugging or porting and documentation or program understanding. The main problem encountered is the building of knowledge base and relating the abstract concepts with pieces of source code. 1.4.7 Recognizing Program Plans A programming plan is also termed as a cliché and describes design elements in terms of common implementation patterns [Quilici, 1994]. Quilici describes an approach to recognizing program plans to enhance program understanding and ultimately identifying modules and classes within a program to port C code to C++. The main motivation behind this approach is the empirical study of student programmers to see how they understand programs. The main elements of this approach are: Convert the source program to an AST augmented with data and control flow Take a library of plan heirarchies. [Quilici, 1994] use a library of plans which was developed to understand COBOL programs. A plan consists of two parts namely a recognition rule having components of the plan and the contstraints on them and a plan definition which has the attributes of the plan. Map general program plans to particular program fragments of source code. The approach followed uses a hybrid top down, bottom up approach for identifying these plans. A top down approach determines what goals a program might achieve and identifies which plans would achieve these goals. It then tries to match program plans to pieces of source code. A bottom up approach starts at the level of program statements and tries to match them to program plans. This approach relies heavily on the plan library. Determining whether a plan is present in a program or not is an NP-hard problem and computationally very demanding. [Quilici, Woods, Zhang, 2000] describe constraint-based approaches to solving this problem in an efficient manner. In [Deursen, Woods, Quilici, 2000] the plan recognition technique has been applied to solve the Y2K problem by using correct and incorrect date manipulation plans and matching them with fragments of source code. 1.4.8 Rigi Rigi has been developed at the University of Victoria [Müller, Tilley, Orgun, Corrie, Madhavji, 1992], [Müller, Wong, Tilley, 1994]. It is a framework for program understanding designed to aid software engineers gain an insight into the structure and design of a program. The Rigi tool takes as input the RSF (described in section 1.3.3.2 on page 8) of a program generated by a parser and displays the dependencies of various components present in the system and their related attributes. Rigi is a visual tool that represents the source code as a directed weighted graph, where vertices are components and edges are dependencies (Resource Dependency Graph). A directed edge from a to b with a weight w indicates that a provides a set of syntactic modules to b. It is, a semi-automatic reverse engineering tool where the software engineer is able to define subsystems and write scripts to cluster software entities together based on their criteria. 16 In Rigi, the reverse engineering process is illustrated in Figure 7. It comprises the following: Generate the RSF for the source code using the Rigi parser or any other suitable tool. Compose and display the resource dependency graph using the graph editor. Allow the user to cluster or group together various components of the system to construct sub-systems. Compute the interfaces between sub-systems by analyzing / propagating the dependencies between them Evaluate the sub-system structures with respect to various software engineering principles like low coupling and high cohesion Capture and display the relevant views of the system Composition of subsystem hierarchies using graph editor Extract RFG Capture relevant views Compute interfaces among systems by analyzing / propagating dependencies Evaluate reconstructed subsystem structure w.r.t SE principles e.g. low coupling and high cohesion Figure 7 : The Reverse Engineering Process in Rigi 1.4.9 Grok Grok is a tool to aid software maintenance and it was developed by SWAG Group, University of Waterloo [Fahmy, Holt, Cordy, 2001]. It is based upon relational algebra and can be used to specify various types of architectural transformations. The input to Grok is RSF (Rigi Standard Format) which is a format for representing source code by means of resource flow graphs. The transformations, therefore, specified are applied on this graph and are equivalent to graph transformations. The transformation produced represents high level architectural information of the system making it easier to understand. Fahmy et al. describe the application of Grok to two real world systems and its effectiveness in specifying various transformations such as lifting, hide interior, hide exterior, diagnosis etc. The main advantage of using Grok is that it efficiently processes large graphs. However, there are limitations of using relational algebra that it cannot be used for generalized pattern matching. It cannot represent transformations where nodes and edges are along a route, have to be stored and they represent a pattern. 17 1.4.10 Dali Work on Dali has been carried out by Carnegie Mellon Software Engineering Institute [Kazman, Carrière, 1998]. They have developed a workbench to generate various architectural views of the system to aid in software understanding, maintenance and re-use. Various tools like profiling tools, parsers, lexical analyzers are used to extract different views of the system. The view information is then stored in an SQL database, one table per relation. The advantage of using different tools is that both static and dynamic information can be captured. Static information can be acquired directly from source code. However, the entire information of a system cannot be determined at compile time because of late bindings due to polymorphism or function pointers etc. Hence, tools like gprof can be used to capture runtime information and generate runtime specific views of the system. After generating different views of the system and storing them in an SQL database, SQL queries are used to specify clustering patterns to reduce the complexity of a system. Each relation is defined as a table and union and join queries can be specified to ‘fuse’ different views together. The views can be refined by using information from other views. These views are then imported to Rigi. Rigi is used as a visual tool to model, analyze and manipulate the different views of the system and requires the input of an expert already familiar with the system. Different views are combined together and further refined to give a better view of the system. For the sake of analysis Dali applies an export/import model where the Rigi model is exported to another tool. After being analyzed the resulting model is imported back to Rigi, hence, generating an overall picture of the system. The main advantages of Dali are that it uses SQL queries for defining clustering patterns. The Dali workbench integrates various tools together for architectural understanding. However, patterns have to be written by an expert familiar with the system and there are no analytic capabilities in Dali to generate them automatically. Also, the clustering patterns are specific to a certain system and cannot be reused in most of the cases. 1.5 Case Studies on Reverse Engineering Various reverse engineering tools and techniques were applied to real life projects. This section provides a description of three such case studies: XFig SORTIE SQL/DS 1.5.1 Xfig A workshop for the demonstration of various reverse engineering tools was organized by IBM’s Centre for Advanced Studies CAS in CASCON’99 in Toronto. The details and results of this demonstration can be found in [Sim, Storey,2000] and the text in this section has been taken from this source. Five tools participated in this workshop namely Rigi, Lemma, PBS, TkSee and 18 IBM’s Visual Age C++. A sixth team used standard UNIX tools. They were all assigned the problem of analyzing the source code for Xfig 3.2.1 which is a drawing package available on Unix platforms. It comprises 75,000 lines of ANSI C code. The purpose of the workshop was to have a common platform for evaluating different tools and to establish benchmarks for comparing their performance and functionality. The teams were assigned two reverse engineering tasks namely documentation and evaluating the structure of the program. They were also given the option of performing one maintenance task out of three. The documentation produced by the teams was rather short and brief. The teams had varying opinions on the structure of the program and the quality of source code. In terms of quality PBS said the subsystems exhibited low cohesion and high coupling, while Lemma said they exhibited low coupling and high cohesion. The PBS team observed that the original code had eroded over time and Rigi’s interpretation was that the design had improved over the past releases. All the teams had varying results on the number of GOTO statements contained in the source code. However, despite the varying results on the reverse engineering tasks the teams had the same approach to maintenance tasks. The main problem encountered by all the tools was in the parsing stage. All of the teams had difficulty performing the initial phase of reverse engineering task i.e. parsing. In structured demonstration the tools were placed into three categories namely visualization (PBS and Rigi), advanced search (TkSee and Lemma)and code creation (Visual Age and UNIX tools). It was interesting to note that tools from the same category produced similar results. It was concluded that visualization tools do better on reverse engineering tasks and the search tools are better suited for maintenance tasks. Visualization and searching tools complement each other. The workshop also concluded with a few lessons for tool users. When selecting a tool for program comprehension it is necessary to know in advance the purpose of using the tool. A visualization tool is better suited for large-scale maintenance reverse engineering problems where an understanding of the architectural design is required. A search tool is more suitable for day to day maintenance tasks. Also, a new tool has a greater chance of being accepted if it complements existing tools and works in collaboration with them. 1.5.2 SORTIE Sortie is a legacy code being used by forestry researchers for the research and exploration in understanding forest dynamics [Website SORTIE] and also provides a test environment for forest management decisions. The software is less than 28K lines of code including comments and source listing. Originally this program was written in C but was later ported to C++ using the Borland IDE. Various teams working on reverse engineering tools were invited to analyze the existing structure of this program and suggest a new architecture based upon the user requirements and the extracted structure. Also, one of the goals of this project was to exchange data at various stages from the results of other participating tools and use the output as an input to the next stage. GXL was defined as the standard for exchanging data between different tools. 19 Participating tools for the SORTIE project were Rigi, cppx, TKSee, Bauhaus, SGG P.U.R.E, Columbus, KLOC Worksuite, VIBRO and PBS. Almost all of these teams could not go on to the re-architecture phase because of the difficulty in parsing the source code of SORTIE. Nearly all of them concluded that re-engineering the existing source code was unrealistic because of the poor design of the system. All the teams suggested re-writing the entire software from scratch keeping in perspective the user requirements. 1.5.3 SQL/DS SQL/DS is a multi-million line relational database management system that has evolved since 1976. It was originally written in PL/I to run on VM and it is over 3,000,000 lines of code. The analysis of this system formed the CAS program understanding project [Buss et al., 1994] with participation from six research groups. The groups are from IBM Software Solutions Toronto Laboratory Centre for Advanced Studies, the National Research Council of Canada, McGill University, the University of Michigan, the University of Toronto and the University of Victoria. The various goals of this project were : Code correctness Performance enhancement Detecting unitialized data Pointer errors and memory leaks Detecting data type mismatches Finding incomplete uses of record fields Finding similar code fragments Localizing algorithm plans Recognizing high complexity or inefficient code Predicting the impact of change The IBM team used REFINE to parse the SQL/DS source code into a suitable form for analysis and applied defect filtering technique. A toolkit was built on top of REFINE for defect filtering. They also performed a design quality metrics analysis to predict the product’s quality and found that defects caused by design errors accounted for 43% of the total product defects. The University of Victoria used Rigi as a tool for performing Structural Re-documentation. Rigi was used to parse the source code and produce a resource-flow graph of the software. Later subsystem composition techniques were combined with human pattern matching skills to manage the complexity of the system. Rigi has a graph editor from which diagrams of software structures such as call graphs, module interconnection graphs and inclusion dependencies were produced. The University of Toronto performed textual analysis of code using pattern matching techniques. Their method is based on fingerprinting an appropriate subset of sub-strings in the source code. Using this technique a number of redundancies in the code were detected and instances of ‘cut and copy’ of code were found. 20 University of Michigan has developed the SCRUPLE software for searching source code. SCRUPLE itself is based upon a pattern-based query language. McGill used REFINE for converting the program to an object-oriented annotated abstract syntax tree. The tools GRASP and PROUST were used for this purpose. Several problems were encountered in recognizing programming plans. One was that pattern matching techniques required precise recognition whereas the source code may have several redundant or irrelevant code. Pattern matching schemes based on graph transformations are expensive and can have high complexity. This team also detected ‘clones’ in the source code using the 5 similarity metrics. 1.6 Tools for Reverse Engineering The list of various existing tools for reverse engineering is : (This is not an exhaustive list) Rigi : University of Victoria, Canada Cppx : SWAG Group University of Waterloo, Canada TKSee : KBRE Univesity of Ottawa Bauhaus : Univesity of Stuttgart, Germany SGG P.U.R.E : Univesity of Berne, Switzerland Columbus/CAN : Research group on Artificial Intelligence, Hungarian Academy of Sciences, University of Szeged KLOC Work Suite : KLOCwork Solutions Corporation VIBRO : Visualisation Research Group, University of Durham UK PBS : SWAG Group University of Waterloo, Canada REFINE : Reasoning Inc SCRUPLE : University of Michigan Lemma : IBM Visual Age C++ : IBM Toronto Lab 21 PART II : SOFTWARE CLUSTERING 2 Software Clustering In this section of the report a detailed overview of clustering and its applications to software reverse engineering is presented. We’ll start by giving a general overview of clustering techniques, algorithms and methods. The application of clustering to grouping together similar software entities is then described. A theoretical framework for clustering software artifacts is built up describing different hypothetical situations where software clustering can be applied. To evaluate the suitability of software clustering methods, an open source project was selected by us and clustering was applied to this system from the point of view of re-modularization. We describe the results of these experiments in detail and present our results. 2.1 Clustering : An Overview To ease software maintenance efforts, it is essential to understand the existing software system. Realizing this has lead to research on techniques for analyzing code and for program comprehension. Although analyzing the code has a number of benefits, this analysis leads to understanding the system "in-the-small". To gain understanding of the system "in-the-large", we require techniques for uncovering the over-all software architecture i.e. we need to understand the various components within the system and their interactions with one another. This architectural/ structural view is of immense value when changes are made in the existing components or when new components are added. The activity of grouping together similar entities is not a new one, nor is it tied to the computer science field alone. The grouping of entities is described by the term "clustering" and is relevant in varied disciplines such as biology, archaeology, economics, psychology, geography etc. Clustering methods aim at discovering/extracting some sort of a structure based on the relationships between entities. Given the large amount of data associated with every entity, and the complexity of relationships between them, it is natural that different methods come up with different clusterings. Irrespective of the discipline in which the clustering is being applied, the following generic steps can be identified in any clustering activity: Entity identification and feature selection Definition of a similarity measure Clustering Assessment 2.1.1 Entity identification and feature selection The individual characteristics used to identify an entity are referred to as features and may be qualitative or quantitative [Gowda, Diday, 1992]. These features are grouped together into a 22 feature vector, sometimes called a pattern. A large number of features may be used to describe an entity, but it is useful to utilize only those features that are descriptive and discriminatory [Murty, Jain, Flynn, 1999]. 2.1.2 Definition of a similarity measure Clusters are defined on the basis of proximity of two patterns, therefore it is necessary to devise/select a measure for similarity. [Wiggerts, 97] describes the following categories of similarity measures: Association coefficients Distance measures Correlation measure Probabilistic measure 2.1.2.1 Association coefficients Association coefficients take the presence or absence of a feature into account. The features are thus assumed to be binary, with 1 denoting presence and 0 denoting absence. The following table is constructed to find association coefficients between entity i and entity j. Entity i 1 0 Entity j 1 0 a b c d Table 3: Table of Similarities In the table a represents the number of features for which both entities have the value 1, b represents the number of features present in entity i but absent in entity j, c represents the number of features present in entity j but absent in entity i, and d represents the number of features absent in both entities. Some frequently used similarity measures Sim(i,j) between entity i and entity j are: Simple Coefficient Sim (i, j ) ad abcd Jaccard Coefficient Sim (i, j ) a abc 23 Srensen-Dice coefficient Sim (i, j ) 2a 2a b c As can be seen, the coefficients differ in two ways [Wiggerts, 97]: The contribution of the 0-0 matches. The Jaccard and Sorenson-Dice coefficients do not take into account the 0-0 matches. The weight of the matches and mismatches. The Sorenson-Dice coefficient assigns double weight to the 1-1 matches. In the simple coefficient, matches and mismatches are given equal weight Hence, the Jaccard and the Sorensen-Dice coefficients take into account only the presence of features. On the other hand, the simple coefficient also takes into account the absence of features. For this metric similarity would be high if common features are absent in both entities. 2.1.2.2 Distance measures The most popular distance measure is the Euclidean distance, which evaluates the proximity of entities in space. The larger the distance, the greater is the dissimilarity of the entities. Hence the measure is zero if and only if the entities have the same score on all the features. Two of the more popular distance measures are the Euclidean distance and the Camberra metrics as given by: Euclidean Distance D( X , Y ) i 1( xi yi ) 2 n For binary features where the feature vector comprises only zeros and ones, this metric can be reduced to a term consisting of a,b,c,d, defined in Table 3. This form can be computed using the table below xi 0 1 0 1 yi 0 0 1 1 (xi-yi)2 0 b c 0 Table 4 : Computation of Euclidean Distance for Binary Features From the above table it is clear that the distance measure reduces to : D( X , Y ) b c 24 Camberra Distance Measure xi y i n C( X , Y ) x i 1 i yi To calculate the Camberra distance measure in terms of a,b,c,d we construct the following table : xi 0 1 0 1 xi y i yi x 0 0 1 1 i yi 0 b c 0 Table 5 : Computation of Camberra Distance Metric for Binary Features In the above table it is assumed that the term is zero when both xi are yi are zero. The assumption is made because the distance between two NULL vectors would be taken as zero. Hence, we can conclude that the Camberra distance metric is given by: C( X , Y ) b c where b and c are defined in Table 3. Clearly we can see that the Camberra metric and the Euclidean metric would give similar results for binary features. Both these metrics take into account the total number of mismatches of zeros and ones in two entities when calculating the distance between them. The higher the mismatch the more the distance between the two entities. 2.1.2.3 Correlation Coefficients Correlation coefficients may be used to correlate features. The most popular coefficient is the Pearson product moment correlation co-efficient whose value lies between -1 and 1 with 0 denoting no correlation. Negative correlation denotes high dissimilarity and positive correlation is taken as high similarity. Correlation measure is given by: i 1( xi xi )( yi yi ) n i 1( xi xi ) 2 i 1( yi yi ) 2 n n 25 ( x y ) xy n x) ) ( y ( y) ( x ( n n 2 2 2 2 ) where the summation is taken from 1 to n where n = a+b+c+d. When using only binary features the correlation measure reduces to terms of a, b, c and d, defined in Table 3, using the following table: xi yi xy x y x 0 1 0 1 0 0 1 1 0 0 0 a 0 b 0 a 0 0 c a 0 b 0 a 2 y 2 0 0 c a Table 6 : Computation of Correlation for Binary Features Hence correlation metric reduces to the following : a ((a b) (a b)(a c) (a b c d ) ( a b) 2 (a c) 2 )((a c)( )) (a b c d ) (a b c d ) ad bc (1) (a b)(c d )( a c)(b d ) 1 Comparison with Association Coefficient From 1 it can be seen that the both the correlation coefficient and the Jaccard coefficient are one indicating maximum similarity if b and c are zero and a is non-zero. Also, if d is very large as compared to a, b and c then the correlation coefficient reduces to the following expression: a (a b)( a c) for d>>a, d>>b and d>>c The above form involves terms only with a, b and c and we can see that this measure is now very similar to the Jaccard coefficient. Hence, if two entities have very sparse feature vectors then their correlation coefficient behaves in a very similar way to the Jaccard coefficient. This result 26 is important when considering software artifacts for clustering purposes as, generally, the feature vector obtained is very sparse. 2.1.2.4 Probabilistic Measures These measures emphasize the idea that agreement on rare features is more important than agreement on frequently encountered features. Such measures take into account the distribution of frequencies of features present over the set of entities. According to Sneath these measures are close to the correlation coefficient [Sneath, Sokal, 1973]. 2.1.3 Clustering The clustering step performs the actual grouping based on similarity measures that were determined. A number of clustering algorithms are available and may be chosen based on their suitability. It may be worthwhile to keep in mind that: One clustering technique cannot be expected to uncover/extract all clusters, owing to the variety of domains and data sets and the complexity of relationships between them. It is unreasonable to expect that there will be one ideal structure. It is possible, indeed probable that more than one structures exist that are equally valid. Algorithms may impose their own structure on the set of data. Algorithms to perform clustering are broadly divided into two categories namely: 1. Hierarchical 2. Partitional 2.1.3.1 Hierarchical These algorithms follow a top-down or bottom-up approach, more formally known as the divisive and agglomerative approaches respectively Divisive In this approach, we start with one cluster containing all entities and divide a cluster into two at each successive step. These algorithms are not widely used, due to the complexity associated with computing the number of possible divisions at every step [Tzerpos, Holt, 1998]. Figure 8 (b) illustrates the first step in a divisive algorithm; initially all entities are in one cluster. (a) One Cluster (b) After First Iteration Figure 8 : Divisive Clustering Algorithm 27 Agglomerative In this approach we start with the entities and group them together to finally obtain one structure. Each entity is regarded initially as a singleton cluster. In the first step, two most similar singleton clusters are joined together. The situation is as shown in Figure 9: (a) Singleton Clusters (b) After First Iteration Figure 9 : Agglomerative Clustering At step two, we must determine whether two singleton clusters are most similar or whether the newly formed cluster and some singleton clusters are most similar. In order to do this, the similarity (or the distance) between the newly formed cluster and each of the existing clusters must be updated. There are different algorithms to deal with this issue [Davey, Burd, 2000]: Single Linkage (Nearest Neighbor Approach): Single Link (A, B U C) = Max (similarity (A, B), similarity (A, C)) Complete Linkage (Furthest Neighbor Approach): Complete Link (A, B U C) = Min (similarity (A, B), similarity (A, C)) Weighted Average Linkage: Weighted Link (A, B U C) = 1/2 (similarity (A, B))+1/2(similarity (A, C)) Unweighted Average linkage Unweighted Link (A, B U C) = (similarity (A, B)*size (B) + (similarity (A, C)* size(C))/(size(B) + size(C)) The single linkage rule produces clusters that are isolated but may be non-compact. To illustrate, consider the following figure: A C D B E Figure 10 : Initial Clusters A A B B 2 C 8 5 D 13 11 E 14 12 28 6 C D E 7 3 The table above lists the entities along with some distances of interest, the greater the distance, the lesser is the similarity. According to the single linkage rule, the first two entities to be clustered together will be A and B, and the new distance between AUB and C will be given by 5 (Minimum of 8 and 5). This is shown in Figure 11. A C D B E Figure 11 : After Step 1 At step 2, D and E will be grouped together, with the new distance between DUE and C given as 6 (Minimum of 6 and 7). Since the distance between C and AUB is less than the distance between C and DUE, at step 3 we’ll get the clusters illustrated in Figure 12: A C D B E Figure 12 : Clusters Formed at Step 3 Using Single Linkage Algorithm On the other hand, if complete linkage was used, the situation at step 3 would be as depicted in Figure 13: A C D B E Figure 13 : Clusters Formed Using Complete Linkage Algorithm The reason for the difference is that when new distances are computed at step 1, the distance between AUB and C is given as 8 (Maximum of 8 and 5). At step 2, the distance between DUE and C is given by 7 (Maximum of 6 and 7). As can be seen from the figure, complete linkage produces compact clusters i.e. in order for an entity to join a cluster, it must be similar to every existing entity within the cluster, as opposed to the single linkage approach, where the entity may be similar to only one entity within the cluster [Wiggerts, 97] 29 2.1.3.2 Partitional Algorithms These algorithms start with some initial partition and modify the partition at every step in such a way that some criterion is optimized. [Tzerpos, Holt, 1998]. Two popular approaches are: Squared error algorithms : The first step is to choose some initial partition with clusters. At every step, each entity is assigned to the cluster whose center is closest to it. The new cluster centers are re-computed and the assignment of entities is repeated till the clusters become stable. Graph Theoretic algorithms : The problem is represented as a graph and clusters are depicted via subgraphs with certain properties e.g. we may construct a minimal spanning tree and then delete the edges with maximum weights to obtain clusters. 2.1.4 Assessment It is important to assess and validate the structure of the system obtained after clustering. There are typically three kinds of validation studies [Murty, Jain, Flynn, 1999]: External Assessment This assessment compares the structure obtained with some already available (a priori) structure Internal Assessment This assessment is intrinsic, i.e. it tries to determine whether the structure is intrinsically appropriate for the data. Relative Assessment This assessment compares two structures relative to one another. 2.2 Clustering Software The application of cluster analysis to software artifacts is comparatively a newer area of research where lots of issues have not been addressed. The major goal of software clustering is to automate the process of discovering high level abstract subsystems within the source code. Such subsystems can be used for software visualization, re-modularization, architecture discovery etc. The higher level subsystems also aid the maintenance engineers in gaining a better understanding of the source code and making the necessary changes. Lung points out that software clustering can be applied to software during various life cycles [Lung, 1998]. It can support software architecture partitioning during the forward engineering process and aid in recovering software architecture during the reverse engineering processes. According to Tezerpos and Holt, [Tzerpos, Holt, 1998] it would benefit the software community to use the wide variety of clustering techniques available rather than re-inventing them to derive a software’s architecture. They find this a good idea because: 30 With software, we have a fairly good idea of what a cluster should look like. Software Engineering principles of cohesion, coupling and information can guide the clustering process. A software can have more than one valid view, which is what different clustering algorithms can give us. Even if a clustering algorithm “imposes” a structure, with legacy software this is not a problem. The point of view is that any structure is better than no structure. The process of clustering together similar objects depends upon the nature and size of data. It also depends upon the similarity measures and clustering algorithms being used. Clustering itself can impose a structure on the data it is being applied to. When applying a clustering technique one has to keep in mind the nature of the data and use the appropriate parameters suited to that type of data. When clustering together software entities with a point of view of obtaining better remodularization we would aim for high cohesion and low coupling [Davey, Burd, 2000]. For example if functions that access the same data type or variables are placed into the same module, the module would tend to have communicational cohesion. The above argument brings out the need to tailor our clustering algorithm to the type of software being clustered. Researchers have pointed out that different clustering techniques behave differently when applied to different types of software systems [Davey, Burd, 2000],[ Anquetil, Lethbridge, 1999]. A technique might impose an artificial structure on the existing system instead of bringing out the natural one. In this report we would like to introduce a new clustering algorithm called the ‘combined’ algorithm which is more intuitive for grouping together software entities. It is appropriate to situations where there are binary features vectors and the presence and the absence of a feature is taken into account for grouping objects or entities together. When clustering the steps that are carried out are the same as for any other artifact. The first step is to identify the entities in a software system. Files and functions are the most popular candidates and researchers have used both as entities. The reasons for using functions rather than files (used in large systems) have been listed by [Davey, Burd, 2000] Clustering of functions is intuitive Functions reflect the functionality of the system more clearly The next stage is to select features based on which similarity measures will be derived. [Anquetil, Lethbridge, 1999] divide features into formal and non-formal descriptive features. They call a feature formal if it consists of information that has a direct impact on, or is a direct consequence of, the software system’s behavior. They have identified some formal features for software: Rout : Var: Routines called by the entity Global variables referred to by the entity 31 Type : Macros: File: User defined types referred to by the entity Macros used by the entity Files included by the entity The source for these formal features is the code for the software. Non-formal features use information that has no direct influence on the system’s behavior. Two non-formal features used by Anquetil and Lethbridge are: Ident: Cmt: References to words in identifiers declared or in the entity References to words in comments describing the entity Any of the similarity measures identified in section 1.1.2 may be used for software. In case of binary features, the formulas may be simplified considerably. For clustering, experiments have been carried out using both hierarchical and partitional algorithms. 2.2.1.1 Observations Regarding Association, Correlation and Distance Metrics From the discussion of the association, correlation and distance metrics we can make the following observations : The Jaccard and Sorensen-Dice metrics perform almost identically. The result has also been pointed out by Davey and Burd [Davey, Burd, 2000]. The simple metric does not seem very intuitive for dealing with software clustering as it takes into account the absence of features. The correlation metric is equivalent to Jaccard metric when the total absent features are very large as compared to the count of present features or mismatched features. This is usually the case. For clustering, the Camberra distance metric and Euclidean distance will perform equivalently Similarity will be low if b and c are high, Jaccard will be low and both Camberra and Euclidean distances (inverse of similarity) would be high. However, when dealing with software the Camberra metric and Euclidean distance are not appropriate for a case when all of a, b and c are zero. In this case the distance is zero indicating high similarity which is misleading. For example two functions like sort and swap may not be accessing any common features so that the distance between them would be zero. The distance metrics, in this case, indicate high similarity which does not seem intuitively correct. 2.2.2 Assessment of Results For assessment of results, external assessment is often used. The groups of objects made during the clustering process is known as a partition. When agglomerative clustering is used a partition is made during each iteration of the algorithm. To evaluate the quality of these partitions, the partitions made by the clustering algorithm are compared with a reference system also called the 32 expert decomposition. Generally the expert decomposition is obtained with directories being used as clusters when files are used as features, or the files themselves being considered as clusters when functions are used as features. However, the designer or the developer of the system best makes the expert decomposition as they have the best knowledge of the system under consideration. The clustering made by the clustering algorithm are also termed as test clustering. The following external measures have been defined: 2.2.3 Use of Precision and Recall [Anquetil, Lethbridge, 1999] and [Davey, Burd, 2000] use precision and recall as a measure for assessing the quality of partitions. Precision and recall are defined using the concept of intra pairs and inter pairs. Intra pairs are the pairs of entities in the same cluster whereas the inter pairs are pairs of entities in two different clusters. Precision and recall are defined by taking all the intra pairs in the expert decomposition and the test clustering and are given by: Precision : Percentage of intra pairs in the test clustering that are also in the expert decomposition. Recall : Percentage of intra pairs in the expert decomposition that are also in the test clustering Mathematically, if A denotes the intra pairs of the test clustering and B represents the intra pairs of the expert partition then the precision p and recall r are given by: p | A B | | A| r | A B | |B| Precision and recall are measures that are often used for evaluating the quality of results in information retrieval systems [Kontogiannis, 97]. Recall measures the percentage of intra pairs that are relevant to the expert clustering. Ideally a recall of 100% would be required. Precision, on the other hand measures the noise in the test clustering. From the above definitions it is clear that there is a tradeoff between precision and recall. It is desirable that both precision and recall are high for a test clustering. However, this is generally not the case. When using agglomerative clustering, at the start of the clustering process there are all singleton clusters and the system has zero recall and 100% precision. As the clustering proceeds the precision decreases and the recall increases. When the entire system is one big cluster the recall is 100% but the precision is very low. 33 For an ideal case, where the algorithm is making cluster partitions that agree exactly with the expert decomposition, the precision would stay constantly at 100% and recall would rise to 100% till the total number of intra pairs in the expert partition is equal to the total number of intra pairs in the test clustering. At this point precision and recall graphs would crossover. After that precision would start decreasing but recall would remain constantly at 100%. An example of this process is shown below in Figure 14. Precision Recall 120 Percentage 100 80 60 40 20 0 Iteration Figure 14 : Precision and Recall Graph for an Ideal Clustering Algorithm 2.3 Our Approach Having established clustering to be a viable approach for finding structure within a software system, we chose the Xfig utility source code for applying clustering techniques. The experiments are detailed in the later part of this report. 2.3.1 Entity identification and feature selection The first step was to identify the entities in a software system. In this paper, we use functions in the code as entities. The next stage is to select features based on which similarity measures will be derived. We used the Xfig files parsed into the Rigi standard format [Müller, Tilley, Orgun, Corrie, Madhavji, 1992]. Based on the available data, we used the following features: Call : Global : Type : Functions called by the entity Global variables referred to by the entity User defined types referred to by the entity To compute similarity measures, we used the method of association coefficients, which takes the presence or absence of a feature into account. The data about calls, globals and types is placed into a matrix in the following manner: Function1 Calls C1 1 C2 0 Globals G1 G2 1 1 Type T1 0 34 Function2 1 1 0 1 0 For functions 1 and 2, the values for a, b, c and d will be: a = 2 (number of features present in 1 and 2) b = 1 (number of features present in 1 but absent in 2) c = 1 (number of features present in 2 but absent in 1) d = 1 (number of features absent in 1 and 2) 2.3.2 Definition of a similarity measure We have used the following similarity measures in our experiments: Simple (i,j) = (a+d)/(a+b+c) For the functions 1 and 2, the Simple measure gives the result ¾ = .75 Jaccard (i,j) = a/(a+b+c) For the functions 1 and 2, the Jaccard measure gives the result 2/4 = .5 Sorenson-Dice (i,j) = 2a/(2a+b+c) For the functions 1 and 2, the Sorenson measure gives the result 4/6 = .67 2.3.3 Clustering and the Combined Algorithm We used the agglomerative hierarchical clustering algorithm to arrive at clusters. The techniques used by our experiments included single linkage, complete linkage, weighted linkage and unweighted linkage. In addition to this we also define a new algorithm called the combined algorithm for clustering. When two clusters are merged the similarity or the distance between the newly formed cluster and the rest of the entities in the system is re-calculated. This new distance is calculated based on the similarity values of the two entities that are grouped together with the rest of the entities. We define this new distance using the ‘combined’ algorithm for binary features as follows: Suppose the two entities i, j that are merged together have their corresponding binary feature vectors given by vi and vj. The new feature vector vk then associated with the merged cluster would be given by taking the logic OR between the two feature vectors: vk = vi OR vj Now the new distance between the newly formed cluster and the rest of the entities or clusters within the system would be calculated based on this new feature vector vk and the feature vector of the rest of the entities and clusters. An example would clarify the combined algorithm. Suppose there are three entities in the system given by A,B,C whose feature vectors are given by: 35 Entity A B C Feature Vector {1 1 0 0 0 1} {1 0 1 0 0 1} {1 0 0 0 0 0} The corresponding similarity matrix using the Jaccard coefficient is given by: A 1/2 1/3 A B C B 1/2 1/3 C 1/3 1/3 - If the Jaccard coefficient is used then A, B are more similar and hence would be merged together into AB and the new raw data matrix would be given by: Entity AB C Feature Vector {1 1 1 0 0 1} {1 0 0 0 0 0} Now the new similarity measure is re-calculated using the above raw data matrix and between AB and C and it is equal to 1/4. We argue that the combined clustering algorithm is very suitable for clustering together software artifacts as the feature vector usually denotes references to functions, global variables or types. When two entities are merged together the new entity would access both the features of the two entities that are merged together and hence a new feature vector would be associated with it. This new feature vector can be obtained by taking the OR of the older feature vectors. In short, the following steps are involved in a clustering process: Step 1 : Compute Raw Data Matrix : Get the feature vector for each module or entity. The matrix of feature vectors forms the raw data matrix. The row entries denote an entity and the columns represent the features Step 2 : Compute the Similarity Matrix : The similarity matrix represents the similarity between each module or component. It is a square, symmetric matrix. Only the lower triangular matrix is required for computations. The values in the similarity matrix are determined based on the association co-efficient, distance co-efficient or correlation coefficient being used. 36 2.4 Step 3 : Apply the Clustering Algorithm : Find the two entities with the maximum similarity. Merge the two entities together into a new cluster and calculate the new cluster distance from other entities using any one of single linkage, complete linkage, weighted average, unweighted average clustering algorithm. Repeat merging the entities till only one cluster remains. Methods Used in the Past One of the earliest work done on software clustering is by Schwanke and Platoff [Schwanke, Platoff, 1989]. They introduced the idea of “shared neighbors” technique for grouping together software entities to achieve low coupling and high cohesion. They argue that when grouping together similar entities more importance should be given to shared neighbors rather than connection strengths in a resource flow graph. They used hierarchical ascending classification method similar to agglomerative clustering to group together entities based on various features like variables, datatypes, macros and procedures. Also, they introduced the idea of “Maverick analysis” used to refine a partition by identifying entities (also called ‘Mavericks’) that were placed in the wrong partition [Schwanke, 91]. They developed a tool by the name of ARCH for automatic clustering. Anguetil and Lethbridge conducted detailed clustering experiments using different open source projects and also a legacy system [Anquetil, Lethbridge, 1999]. They compared various hierarchical and non-hierarchical clustering algorithms. They treated an entire source code file as an entity and various formal and non-formal features associated with them were used for clustering. They have introduced two methods for the evaluation of clustering results namely expert criteria and design criteria. The expert criterion is based upon precision and recall. The design criterion, on the other hand, is based upon the notion of metrics used to measure coupling and cohesion. Kontgiannis conducted interesting work on finding similar software entities [Kontgiannis, 97]. Their goal was not to re-modularize software entities but to detect programming patterns within source code files. Their work is relevant to software clustering as it shows that various software design metrics like structural complexity, data complexity, McCabe complexity, Albrecht metric and Kafura Metric pertaining to a certain part of source code can be used as its fingerprint. Such a feature vector can be associated with various parts of source code and used to detect patterns within source code. Davey and Burd also conducted several detailed experiments to compare various features, similarity metrics and algorithms to be used for clustering [Davey, Burd, 2000]. They used three different parts of legacy C source code to conduct their experiments. The experiments conducted by them were only on small systems (maximum 140 functions). However, their work demonstrates the suitability of using software clustering methods to group together software artifacts. 37 Lung uses clustering for recovering software architecture and re-modularization [Lung, 1998]. They use the design metric namely data complexity to compare the performance of their clustering algorithm with the old system and the newly modularized system. Their work shows that clustering software based on common features reduces the data complexity of the system and leads to a better architecture. Tzerpos and Holt use ACDC i.e. Algorithm for Comprehension Driven Clustering algorithm to group together software artifacts [Tzerpos, Holt, 2000]. Their main aim was to discover subsystems to ease the understanding of the software system. Their algorithm proceeds in two steps. In the first step they discover a possible skeleton decomposition of the system using a pattern driven approach. They define various possible subsystem patterns namely source file patter, directory structure pattern, body header pattern, leaf collection pattern, support library pattern, central dispatcher pattern and subgraph dominator pattern. Using these patterns a skeleton decomposition is formed. In the second stage of the algorithm they use a technique known as “Orphan Adoption” that tends to place isolated entities in one of the possible subsystems identified during the first phase of the algorithm. The table below summarizes the work related to clustering conducted in the past Refere nce [Kontgi annis, 97] [Anque til, Lethbri dge, 1999] Objective Metric Detect programming patterns Detection of clones Remodularizat ion Structural Complexity Data Complexity McCabe Complexity Albrecht Metric Kafura Metric Formal features( Type Variable Routine File Macro All(Union of the above)) Non-formal( References to words in an identifier References to words in comments) Function Type Global variabl [Sartipi, Kontog iannis, 2001] [Davey, Burd, 2000] Recover architecture of a system Remodularizat ion Global Type Call [Schwa nke, Platoff, 1989] Extract conceptual architecture of the system Feature count based on similar neighbors. Names being used in a module (variables, data types, macros, procedures) Distance Measure Euclidean distance Correlation Taxonomic(dist ance) Camberra Jaccard Simple matching Srensen-Dice Component association Mutual association Jaccard Srensen-Dice Pearson correlation coefficient Camberra Category utility = size*purity 38 Clustering Algo Use distance as threshold and group together components whose distance falls within this threshold Agglomerative hierarchical (single linkage, complete linkage, weighted average linkage, unweighted average linkage) Bunch (hill climbing) non-hierarchical algorithm Supervised algorithm clustering Agglomerative hierarchical (single linkage, complete linkage, weighted average linkage, unweighted average linkage) Hierarchical ascending classification method (similar to agglomerative clustering). It also splits clusters during clustering (Maverick Analysis) System Used TCSH CLIPS BASH ROGER Evaluation Gcc Mosaic Linux Telecom Expert Criteria (Precision and Recall) Design Criteria (Coupling and cohesion) CLIPS Xfig Precision Recall 3 samples of C source code taken from legacy systems Precision Recall DOSE Tiled Window Manager (TWM) Compare clusters formed by the modules/pro cedures defined in common Precision and Recall [Schwa nke, 91] Develop a tool to provide modularizatio n advice [Tzerpo s, Holt, 2000] Evaluate the similarity of two different decomposition s of a system to determine the stability of a clustering algorithm Extracting sub-systems for program comprehensio n [Tzerpo s, Holt, 2000] [Müller , Uhl, 1990] Extracting sub-system structures. To be used in Rigi [Lung, 1998] Software architecture recovery and restructuring Non-local names like Procedures Macros Typedefs Variables Individual field names of structured type and variables Similarity metric Mojo Source file patterns Directory structure patterns Body header pattern Leaf collection pattern Support library pattern Central dispatcher pattern Subgraph dominator pattern Use a directed weighted graph. Directed edge from a to b indicates module a depends on b and the weight indicates the resources being exchanged between a and b files Jacknife method 52 procedures divided into training(48) and test set(4) Hierarchical ascending classification method (similar to agglomerative clustering). It also splits clusters during clustering (Maverick Analysis) Used Bunch (software for clustering) A version of Arch’s batch clustering tool. Random and field data from a legacy system Quality metric based on MoJo ACDC algorithm TOBEY Linux Stability measure Quality measure Real time telecommu nication application Data complexity K-2 partite graph UPGMA (unweighted pair-group method using arithmetic averages) Table 7 : Summary of Software Clustering Methods Used in the Past 2.5 Clustering : A Theoretical Framework This section outlines various scenarios from the point of view clustering. It also relates the clustering process to the design of a system by comparing well structured and unstructured architectures. Following are the various configurations and scenarios that are discussed in this section where the Black Hole Configuration and Gas Cloud System are types of partitions also discussed by [Anquetil, Lethbridge, 1999]. Black Hole System Gas Cloud System Equidistant Objects Structured Program Unstructured Program 39 2.5.1 The Iteration Versus the Total Clusters Graph The clustering process will be illustrated using the iteration versus the total number of clusters graph. To compare the performance of different features, similarity metrics and clustering algorithms the total number of non-singleton clusters is plotted against iteration number. In the beginning there are no non-singleton clusters, hence the graph starts with zero total clusters. As the iterations increase the number of clusters formed also increases showing a rise in the curve. If a singleton cluster merges with an existing non-singleton cluster then the total number of clusters remain constant. If two clusters merge together then the total number of clusters in the system decreases hence leading to a decline in the curve. At the end of the clustering process, the bigger clusters are merged together till only one cluster remains leading to a decline in the curve. In short there are three possibilities when analyzing the iteration versus total number of clusters graph: A rising curve indicates the merger of two non-singleton clusters A constant curve illustrates the merger of a singleton cluster with a non-singleton cluster A falling curve represents the merger of two non-singleton clusters The graph obtained provides an ad-hoc subjective measure for evaluating the successfulness of the approach as pointed out by [Davey, Burd, 2000]. It can be used as a rough metric to evaluate the usefulness of similarity measures and clustering algorithms. The total number of nonsingleton clusters represent the quality of a clustering process. If there is a large number of clusters made by a clustering process, these clusters will be small identical cluster of functions with respect to the feature they access. If they are accessing the same data type or variable then it is likely that these clusters are cohesive clusters. Davey and Burd [Davey, Burd, 2000] suggest that such a clustering process will provide more suitable modularization than if the total number of maximum clusters is small as the later case signifies a large black hole cluster that tends to attract other entities towards it. 2.5.2 Black Hole System In this configuration a single object termed as a ‘Black Hole Object’ is highly similar to other objects in the system. The rest of the objects are far apart form each other, resulting in the black hole object absorbing the rest of the clusters in the system. One possibility of the raw data matrix that leads to the black hole system is given below: Modules SelectObject M0 CreateSpline M1 CreateLine M2 CreateCircle M3 CreateRectangle M4 SelectSpline 1 1 0 0 0 SelectLine 1 0 1 0 0 40 SelectCircle SelectRectangle 1 1 0 0 0 0 1 0 0 1 The system has five modules M0, M1, M2, M3 and M4. The columns indicate the features or data structures accessed by these modules. In this case the features are the functions called by these modules. The module ‘SelectObject M0’ is an abstract module that calls all the other functions. The rest of the modules M1, M2, M3 and M4 are independent entities that perform the required task. The data matrix is a binary matrix with a ‘one’ indicating that a module calls a certain function. The clustering process for this system is depicted by the graph of iteration versus total number of clusters in Figure 15. 1.2 Total Clusters 1 0.8 0.6 0.4 0.2 0 0 2 Iteration 4 6 Figure 15: Iteration Vs. Total Clusters for the Black Hole System The graph shows that the total number of clusters remains a constant as one cluster with M0 is formed and it attracts all the singleton clusters towards it. Relationship to Design The scenario described in this section pertains to the ‘pancaked structure’ as described by Pressman [Pressman, 97]. It depicts one controlling module calling many other functions and the rest of the modules being independent entities. A pictorial representation of this structure is shown in Figure 16. Such configurations should be avoided as Pressman points out that such structures do not make ‘effective use of factoring’. Also, a more reasonable distribution of control should be planned. Figure 16: Pancaked Structure [Pressman, 97] 2.5.3 Glass Cloud System A glass cloud system results when objects or entities being clustered are infinitely apart from one another and have no common features. A possibility of the raw data matrix leading to this configuration is shown in Table 8. This scenario is also similar to the ‘pancaked structure’ as described by Pressman [Pressman, 97] and illustrated in Figure 16. Such structures should be avoided. 41 Modules Feature1 Feature2 Feature3 Feature4 M1 1 0 0 0 M2 0 1 0 0 M3 0 0 1 0 M4 0 0 0 1 Table 8 : Raw Data Matrix for the Gas Cloud System 2.5.4 Equidistant Objects When considering equidistant objects there are a number of possibilities. In this sub-section we discuss two such possibilities: Objects partially similar to one another Objects with 100% similarity with one another Objects partially similar to one another To visualize a system with partially similar objects consider the example of 4 modules CreateSpline, CreateLine, CreateCircle and CreateRectangle with their corresponding raw data matrix given by: Modules CreateSpline M1 CreateLine M2 CreateCircle M3 CreateRectangle M4 PosX 1 1 0 0 PosY StartX StartY EndX EndY 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 The Jaccard Similarity Matrix for the above raw data matrix is given by: Modules M1 M2 M3 M4 M1 1/5 1/5 1/5 M2 1/5 1/5 1/5 M3 1/5 1/5 1/5 M4 1/5 1/5 1/5 - The table of similarity values between all pairs of entities indicates that the objects are equidistant from each other. Each pair of entity has a Jaccard Similarity Measure equal to 1/5 = 0.2. In this sub-section the clustering of such entities is analyzed. It is assumed that when selecting pairs of entities with equal similarities for merging into a cluster, a preference is given on merging singleton clusters. If single linkage, complete linkage, weighted average linkage or unweighted average linkage is used to group together equidistant objects, the general curve obtained for iteration versus number of clusters is given below in Figure 17. 42 Number of Clusters Iteration Figure 17: The general curve obtained for Iteration Vs. The total Number of Clusters The curve shows that the total number of non-singleton clusters are merged in the beginning of the clustering process resulting in a peak in the curve. After all the singleton clusters are merged the curve drops as non-singleton clusters are being merged together. Total Clusters On the other hand if combined algorithm is used then the curve obtained is shown below in Figure 18: Iteration Figure 18: Iteration Vs. Total Clusters for the Combined Algorithm for Equidistant Objects The clustering process starts by merging two singleton clusters resulting in a black hole cluster that tends to attract the rest of the singleton clusters towards it. The nature of the combined algorithm is such that when two singleton clusters are merged then the resulting cluster has a greater degree of similarity between the rest of the objects than the two individual singleton clusters. The rest of the curve is a constant indicating the merging of singleton clusters with the main black hole cluster. Relationship to Design The above examples demonstrate an example of loosely cohesive objects that exhibit a small degree of similarity with each other. A set threshold should be used to determine whether these modules should be grouped together or not. If the modules are not functionally related to each other then it means that the variables or features used don’t have a semantic relationship to one 43 another and are being used as per need by any function. These features can also be termed as utility features. On the other hand, if the modules were intended to be functionally similar as per design of the entire system then the features or variables being accessed by them could be grouped together into a bigger structure during the re-modularization stage. Objects with 100% Similarity Objects being 100% similar with respect to the features they access would all be grouped together by the clustering process into one strongly cohesive cluster. A raw data matrix for such a scenario is shown in the table. Modules CreateSpline M1 CreateLine M2 CreateCircle M3 CreateRectangle M4 PosX 1 1 1 1 PosY StartX StartY 1 1 1 1 1 1 1 1 1 1 1 1 The Jaccard Similarity Matrix for the above raw data matrix is given by: Module s M1 M2 M3 M4 M1 M2 M3 M4 1 1 1 1 1 1 1 1 1 1 1 1 - Number of Clusters The clustering process for such a situation is shown in Figure 19: Iteration Figure 19: Iteration Vs. the Total Number of Clusters for Equidistant Objects with 100% similarity The curve is similar to the curve for partially similar equidistant objects when the clustering algorithm being used is single, complete, weighted or unweighted. 44 Relationship to Design Equidistant objects with 100% similarity represent one cohesive module from the point of view of design. If the re-modularization process is being carried out with the aim of migrating the system to one confirming to the object oriented paradigm then all the modules can be grouped together as operations of the same class. However, the proportion of such equidistant modules to the total number of modules present in the system should be kept in view when re-modularizing software. If the proportion of such objects is very large then it means that the features being used to classify them into different clusters are not sufficient and discriminating enough to form a smaller grouping of clusters. An architecture with groups/clusters comprising equidistant objects is preferred, where each group is independently performing its own functions and is disjoint from the rest of the other groups. 2.5.5 Structured Program In this sub-section the clustering process for a structured program is reviewed. A structured program containing modules M0,M1, …, M14 is illustrated by the tree in Figure 20. The straight lines represent the calls being made from one module to another and the dashed lines represent the data structures being accessed by that module. If we consider the data structure as a feature and cluster similar objects together on the basis of this feature then the clusters obtained are indicated by the dashed circles containing the objects. The clusters are highly cohesive with a Jaccard similarity measure of 1 amongst themselves. The corresponding raw data matrix for the 14 modules is shown in the following Table 9. Modules M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 T1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 T2 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 T3 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 T4 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 Table 9: Raw Data Matrix for the Structured Program 45 M0 M2 M1 M4 M3 M14 M6 M5 M7 M8 M10 M9 M13 M12 M11 T1 T3 T2 T4 Figure 20 : A Structured Program Total Clusters Also, the clustering process that leads to the above clusters is depicted in Figure 21. 7 6 5 4 3 2 1 0 0 5 Iteration 10 15 Figure 21: Clustering Process for the Structured Program In the beginning of the clustering process singleton clusters are being merged leading to a rise in the curve. After all singleton pairs are merged together, the remaining singleton objects are merged into their similar clusters leading to a straight horizontal line as the total number of clusters remains constant. Once all the singletons are merged together, similar clusters are grouped leading to a decline in the curve. The final partition of the system thus obtained is also termed as the “Planetary System” by [Anquetil, Lethbridge, 1999]. Such a system has several cohesive sub-systems that are interconnected to form the entire system. This is the ideal case. In the next section a case for a system is considered which can result from modifications in the above system. 46 2.5.6 Unstructured Program M0 M2 M1 M4 M3 M14 M6 M5 M7 M12 M8 M10 M13 M9 M11 T1 T9 T7 T3 T2 T4 T6 T5 Figure 22: Unstructured Program This section explores one last case of finding structures within programs that have started with a good design but evolved into a more haphazard system due to modifications and enhancements being made in the future. The structured program illustrated in Figure 20 can result in an unstructured system due to future enhancements and modifications. A possible conceptual picture of the entire system that deviates from the initial design of the structured program is shown in Figure 22. It can be noted that the initial design and structure of the system is still buried deep within the system, though it may not be so obvious by only visualizing the system. The system consists of the same 14 modules i.e. M0, M1, …, M14. However, the data dependencies of modules have increased and the calls between the horizontal partitions have also increased in number. In the text that follows an analysis of a few similarity measures and algorithms used for clustering is presented in detail. 47 Jaccard Similarity Measure Using Complete Algorithm The graph depicting the clustering process using the Jaccard similarity measure and the complete algorithm is shown below in Figure 23 which shows the iteration versus the total number of clusters curve. Total Clusters 5 4 3 2 1 0 0 5 Iteration 10 15 Figure 23: Clustering Process for the Unstructured Program The clusters formed using the complete algorithm with Jaccard similarity measure are shown in Figure 24. For these clusters a threshold of 0.5 was used to stop the clustering process. M0 M2 M1 M4 M3 M8 M14 M12 M7 M5 M6 M13 M9 M10 M11 T1 T9 T7 T3 T2 T4 T6 T5 Figure 24: Clusters Formed for the Unstructured Program 48 Jaccard Similarity Measure Using Combined Algorithm The graph depicting the clustering process using the Jaccard similarity measure and the combined algorithm is shown below in Figure 25 which shows the iteration versus the total number of clusters curve. Number of Clusters 5 4 3 2 1 0 0 5 Iteration 10 15 Figure 25: Iteration Vs. Number of Clusters for the Jaccard Similarity Using Combined Algorithm. The clusters formed are shown in the figure below (Figure 26). M0 M2 M1 M4 M3 M8 M6 M5 M14 M12 M7 M13 M9 M10 M11 T1 T9 T7 T3 T2 T4 T6 T5 Figure 26: Clusters formed for Jaccard Similarity Measure Using Combined Features For this algorithm a threshold of similarity 0.5 was used to stop the clustering process. 49 Simple Similarity Coefficient Using Complete Algorithm The graph depicting the clustering process using the Simple Similarity Coefficient and the complete algorithm is shown below in Figure 27 which shows the iteration versus the total number of clusters curve. 6 Total Clusters 5 4 3 2 1 0 0 5 Iteration 10 15 Figure 27: Clustering Process for Simple Similarity Co-efficient Using Complete Linkage The clusters formed using the Simple Similarity Coefficient and the Complete Linkage Algorithm are shown below in Figure 28. A threshold of .888 was used to terminate the clustering process. M0 M1 M2 M4 M3 M8 M14 M12 M7 M5 M13 M9 M6 M10 M11 T1 T9 T7 T3 T2 T4 T6 T5 Figure 28: Clusters Formed Using Simple Similarity Coefficient and Complete Linkage Algorithm 50 Discussion of Results In the previous text the clustering process using different similarity coefficients and clustering algorithms was presented. It can be seen from the graphs of iteration versus total number of clusters that the curve is the same for both the Jaccard and the Simple coefficient. Also, the total number of clusters being formed as the process continues is the same for these similarity measures. The curves obtained using weighted, unweighted and single linkage algorithms for the Jaccard coefficient are also similar but not shown here. The clusters formed for the complete and the combined algorithm are also the same. However, the ones formed using the simple matching coefficient are slightly different. This measure takes the absence of features into account and groups together the modules having the zero feature vector into one cluster. It can be seen from the pictures of the clusters formed (Figure 28, Figure 26, Figure 24) for the unstructured program that the clustering process has identified the clusters or groups of entities as intended by the initial design of the system (shown in Figure 20). This shows that the clustering process is able to recover the original design or architecture of a system that gets obscured by future enhancements and modifications. 2.6 Experiments with Clustering for Software Re-Modularization To evaluate the suitability of the application of clustering algorithms to software artifacts for software re-modularization a few experiments were carried out. A detail of these experiments and the results obtained is presented in this section. 2.6.1 The Test System The test system used for clustering software artificats is Xfig Version 3.2.3. Xfig is an open source drawing tool that runs under X-Windows system [Website Xfig]. Xfig was also used as a model system for carrying out a competition of the reverse engineering tools [Sim, Storey,2000]. It is written in the C programming language and consists of 75K lines of source code. Its source code is distributed over around a 100 source files and around 75 include files. There is no documentation regarding the structure or implementation of Xfig. However, usage manuals are available for this system. Xfig source code has been parsed using the Rigi parser [Martin, Wong, Winter, Müller, 2000] and its facts are available at [Website Xfig RSF]. The facts are in Rigi Standard Format (RSF) which is a Tuple Attribute (TA) format. For experiments with software clustering, the raw source code of Xfig was not used but the available RSF file was used instead. The RSF file has around 200,000 facts including the facts for calls, accesses, references etc. Analysis of Xfig shows around 1700 functions. The RSF file was exported to an SQL compliant RDBMS system to ease the access of facts during the clustering process. 51 The goal of the software clustering experiments is to find a structure within the source code of Xfig so that higher abstract level sub-systems can be identified. In the long run such an analysis can also be useful in identifying objects or classes within the system to move it to an object oriented paradigm. In Xfig a consistent naming convention of source code files is used as follows: d_* files are intended for drawing shapes e_* files are related to editing f_* files have file related functions u_* files pertain to utilities for drawing and editing w_* files contain the X-windows related calls The following table lists the total number of functions/entities in each sub-system: Sub-System d_files e_files f_files u_files w_files Total Functions 94 369 139 422 637 Total Source Files 10 19 17 18 31 Table 10: The Xfig Sub-Systems The above source files can, therefore, help in determining and categorizing the various subsystems within Xfig. In the experiments described in the following text, clustering was carried out individually on all the above mentioned sub-systems. It was then also carried out on the entire system, thus being treated as a whole entity. 2.6.2 Clustering Techniques For the analysis of Xfig agglomerative hierarchical clustering algorithm was used to group together software artifacts. Each function was treated as a single entity or object to be grouped. The features used to describe each entity are as follows: Call : The calls made by a function Global : The global variables referred to by a function Type : The types accessed by a function. The types are the data structures defined in C language using ‘struct’ or the user defined types declared using ‘typedef’.\ The experiments were conducted using the following similarity measures: Jaccard coefficient Srensen-Dice coefficient Simple matching coefficient Pearson’s Correlation coefficient 52 Camberra distance metric The following clustering algorithms were used: Single linkage Complete linkage Weighted average linkage Unweighted average linkage Combined algorithm 2.6.3 Evauation of Results To evaluate the results of clustering precision and recall have been used [Anquetil, Lethbridge, 1999]. To calculate the precision and recall of a test partition, an expert decomposition is required. We constructed the expert decomposition by placing all the functions in one source file in one cluster. This is based on the assumption that all functions placed in one file perform similar functions. 2.6.4 Anayslis of the Xfig System In the text that follows the results of the clustering experiments are presented when conducted on the 5 sub-systems of Xfig. In these experiments the following are analyzed: The comparison of features when using different similarity measures The comparison of various similarity metrics The comparison of various clustering algorithms Precision and recall for various similarity metrics using different clustering algorithms The behavior of all the sub-systems is almost identical, with a few exceptions, when the clustering process is applied to them. The results for d_files sub-system will be presented in detail. The graphs of the results of comparison for the rest of the sub-systems and the entire system are detailed in the appendices. In the following section the analysis of d-files is presented in detail and the results of the rest of the sub-systems is summarized. 2.6.5 Anayslis of d_files Sub_System In this sub-section a detailed analysis of d_files sub-system is carried out and the results are presented. The d_files pertain to drawing objects and there are a total of 10 source files in this sub-system. Comparison of features The following graphs show the performance of the features all, call, global and type when using different similarity metrics and the complete clustering algorithm. 53 Call Type All Global 30 50 25 20 40 Total Clusters Total Clusters All Global 15 10 5 0 30 20 10 0 0 (a) 50 Iteration 100 Jaccard 0 (b) All Global 50 Iteration 100 Simple Call Type All Global 30 Call Type 50 25 Total Clusters Total Clusters Call Type 20 15 10 5 0 40 30 20 10 0 0 50 Iteration 100 0 50 Iteration 100 ` (c) Correlation (d) Camberra Figure 29: Comparison of Features Using Jaccard Similarity and Complete Algorithm The above graphs illustrate that the performance of the Jaccard metric is very similar to of the correlation similarity measure. As explained previously in Section 2.1.2, the Jaccard metric J is given by : J = a/(a+b+c) The correlation measure for binary features is given by: ad bc (a b)(c d )( a c)(b d ) The two similarity measures would perform equivalently when d is much larger than a as explained in Section 2.1.2.3. For experiments with software clustering d tends to be very large as the feature vector associated with a module that is treated as an entity is a sparse vector. This factor explains why the behavior of the correlation and the Jaccard measure tend to be similar. The graphs of Figure 29 also show that the simple and the Camberra metric perform equivalently. The simple metric is given by (a+d)/(a+b+c+d) and the Camberra metric is given by (b+c). However, for software clustering the Jaccard similarity measure seems to be the most intuitive as it is based on the presence of features as compared to the correlation and simple metrics that also take into account the absence of features. 54 All the above graphs also show a sharp peak for the ‘type’ feature indicating that there are many ‘equidistant objects’ with respect to the ‘type’ feature. On the other hand the curves for the ’call’ feature are more spread out and leveled at the top. The same is true for the curves of ‘all’ features. The ‘global’ feature like the ‘type’ feature also seems to have a sharp peak in the beginning of the clustering process signifying the merging of identical singleton objects. After the peak, the graphs seems to level off illustrating the merger of singleton objects with clusters. The afore mentioned phenomenon is true for all sub-systems except the f-files sub-system. In this sub-system the Jaccard and correlation metric when applied to ‘all’ features produces the maximum number of clusters. However, the shape of the graph is similar to the rest of the subsystems when using ‘all’ features. The shape of the graph when using ‘all’ features and ‘call’ feature when using the Jaccard and Correlation metric resembles that of the graph for a structured system signifying the importance of these features and metrics with respect to the clustering process. Although, Davey and Burd suggest that the feature that produces the maximum number of clusters is the best feature [Davey, Burd, 2000], the shapes of the graphs should also be taken into account when evaluating a certain clustering. The ‘type’ feature seems to produce the maximum number of clusters during the clustering process, however, the curve suggests that many objects or functions are equidistant with respect to the type feature (as shown by the sharp peak). A carefully analysis also shows that the feature vector for ‘type’ is very sparse and many functions access the same type showing that this feature provides insufficient discrimination between different entities. The curve for ‘all’ features and ‘call’ feature, however, seems to suggest more suitable partitions for the Jaccard and Correlation metric. The total number of maximum clusters is almost the same as for the ‘type’ feature but the curve is more spread out and flatter at the top illustrating the merger of non-singleton clusters with singleton clusters in the middle of the clustering process. This shows that many of the entities are not equidistant with respect to these features and they are providing more discrimination between the different functions. Comparison of Similarity Metrics To compare the various similarity metrics for the ‘d-files’ sub-system the graph of iteration versus the total clusters is plotted for various similarity metrics using all features and the complete and combined algorithm, in Figure 30 and Figure 31. When using ‘all’ and the complete algorithm the performance of all similarity metrics with respect to the total number of clusters being formed is almost the same. However, when the combined algorithm is used the performance of correlation deteriorates considerably. This behavior can be explained as follows. During the beginning of the clustering process the feature vector for all entities is sparse, hence the value of d is very large as compared to a. As the total number of functions in a cluster increases the feature vectors are ‘ORed’ together and the resulting vector is no longer very sparse. Hence the value of d becomes comparable to the value of a and the performance of correlation differs from the performance of the Jaccard association coefficient. 55 The graphs obtained also show that the behavior of Sorensen-dice and Jaccard association coefficientis is almost exactly similar. This result matches with the results obtained by Davey and Burd. Camberra Simple Correlation Jaccard 30 Total Clusters 25 20 15 10 5 0 0 20 40 60 Iteration 80 100 Figure 30: Comparison of Similarity Metrics Using Complete Algorithm and All Features Camberra Simple Correlation Sorensen Jaccard 30 Total Clusters 25 20 15 10 5 0 0 20 40 60 Iteration 80 100 Figure 31: Comparison of Similarity Metrics Using Combined Algorithm and All Features 56 Comparison of Algorithms Combined Unweighted Complete Weighted Single 25 Total Clusters 20 15 10 5 0 0 20 40 60 80 100 Iteration Figure 32: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features To compare the performance of various clustering algorithms a plot of the total number of clusters is taken for the combined, complete, single, weighted and unweighted algorithms when using the Jaccard similarity metric. All algorithms seem to perform almost equivalently for the dfiles sub-system. However, this result is not generally true for the rest of the sub-systems where the performance of single link algorithm is much worse than the complete link, combined, weighted and unweighted algorithms. Davey and Burd point out that when single linkage is used the bigger clusters are merged rather than singletons unless the distance between them is very large. Hence, single link creates a large number of isolated loosely coupled clusters. However, the complete linkage algorithm tends to push the clusters apart and it depends upon the original similarities calculated from the feature vectors. It would tend to create a higher number of highly cohesive clusters. Both Lethbridge and Anquetil [Anquetil, Lethbridge, 1999] and Davey and Burd [Davey, Burd, 2000] find that the performance of weighted and un-weighted algorithms lie between that of single and complete. The same is true for the rest of the sub-systems of Xfig except d-files. The results on d-files and the rest of the sub-systems also show that the performance of the combined algorithm is almost equivalent to that of the complete link algorithm. Since we are experimenting with software artifacts the use of combined algorithm seems more intuitive as it associates a new feature vector with a new cluster by combining all the facts together. 57 Precision and Recall Precision Recall Precision 80 Percentage 100 80 Percentage 100 60 40 20 60 40 20 0 0 0 20 40 60 80 100 0 20 Iteration (a) Jaccard Complete (b) 40 60 Iteration Recall Precision 100 80 80 60 40 20 80 100 Jacard Combined 100 Percentage Percentage Precision Recall 60 40 20 0 0 20 40 60 80 0 100 0 Iteration (c) Recall Correlation Complete (d) 20 40 60 Iteration 80 100 Correlation Combined Figure 33: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features Figure 33 illustrates the precision and recall graphs for the d-files sub-system for the correlation and Jaccard similarity metrics when using the complete and combined algorithms for ‘all’ features. As can be seen by the graphs, the precision is high at the beginning of the clustering process and low at the end. As opposed to that, the recall is zero at the start of the process and rises to 100% at the end when one big cluster is formed. The crossover point for precision and recall signifies the point when the total number of intra pairs of the expert clustering is equal to the total number of intra pairs in the test clustering. When precision is greater than recall, it means that the total number of intra pairs in the algorithm’s test clustering is less than that of the expert’s. Conversely, when the total number of pairs in the expert clustering is less than that of the test clustering then the recall is higher than precision. The graphs illustrate this tradeoff between precision and recall. Davey and Burd [Davey, Burd, 2000] and [Anquetil, Lethbridge, 1999] suggest that the precision and recall can be used to evaluate the performance of different algorithms and similarity metrics. 58 It is obvious that the higher the value of both precision and recall the better the performance of the parameters used for the clustering process. Davey and Burd also indicate that the later the crossover point occurs in the clustering process the better the partitions produced. The reason being that more functions would have been considered for clustering if the crossover occurs at a later point. For the Jaccard similarity metric, the cross over point for the complete and combined algorithm occurs at almost the same point. However, the height of this point is slightly higher for the combined algorithm than the complete algorithm for the d-files, e-files and f-files sub-systems indicating the better performance of the combined algorithm to complete algorithm. The following table shows the crossover point for various sub-systems: d_files e_files f_files u_files w_files Complete 18% 22% 25% 33% 17% Combined 33% 29% 30% 30% 17% In almost all the sub-systems when using the correlation metric with the complete algorithm the crossover point occurs later in the clustering process than for Jaccard similarity measure. However, the height of this point is much lower than for both Jaccard complete and Jaccard combined algorithms. The “correlation-combined” parameter performs the worst in all the subsystems as its crossover point occurs much earlier in the clustering process and its height is much lower. 2.6.6 Summary and Discussion of Results In this section a detailed analysis of clustering software artifacts using the Xfig system has been carried out. The various features for clustering, similarity measures and clustering algorithms were compared. The experiments were carried out on 5 sub-systems of Xfig and also the entire Xfig system. Generally, the results obtained on the 5 sub-systems and the entire system were identical. The following observations have been made. For the Xfig system, the ‘type’ feature is not discriminating enough. The ‘call’ feature performs quite well. However, a combination of ‘type’, ‘call’ and ‘global’ features gives the best performance. If the total number of absent features is very large as compared to the total number of present features then the correlation metric performs similarly to the Jaccard metric. The Jaccard metric also performs identically to the Sorensen-Dice similarity measure. Intuitively for software clustering, the Jaccard and Sorensen-Dice algorithms are more suitable than the simple and the Camberra metrics. 59 The performance of weighted and un-weighted clustering algorithms lies in between that of single link and complete link algorithms. The complete link algorithm produces better clustering than the single link algorithm. Also, the combined algorithm produces better results than the complete link algorithm. Also, for software clustering the combined algorithm seems more intuitive than the complete algorithm. Analysis of Partitions and Cohesion When analyzing a partition, it was noted that many of the clusters being made are logically cohesive. Pressman describes 7 levels of cohesion [Pressman, 1997] as given by (in the order from loosely cohesive to strongly cohesive): Coincidental Logical Temporal Procedural Communicational Sequential Functional Pressman points out that the scale of cohesion is non-linear, the low end “cohesiveness” being much worse than the middle range which is almost as good as high end cohesion. For the clustering processes we need to define various measures to quantify these levels of cohesion and pin-point those features that influence a partition to being strongly cohesive. As an example consider the following functions in the d-files sub-system: Create_lineobject Create_regpoly Line_drawing_selected Regpoly_drawing_selected When using the Jaccard similarity measure and the complete algorithm on the basis of the ‘type’ feature the clustering process groups together create_lineobject and create_regpoly. Also, it groups together line_drawing_selected and regpoly_drawing_selected as the ‘type’ feature vector of this pair is more similar. The clusters therefore obtained are logically cohesive as the functions are logically related to each other and they perform the same type of function but on different types of objects. This is a loose form of cohesion. Ideally, we would like to form clusters that manipulate the same data type. It would be more appropriate to group together create_lineobject and line_drawing_selected in one cluster and create_regpoly and regpoly_drawing_selected to be palced in another cluster. The above discussion brings out the need for defining quantitative measures of cohesion and analyzing the different features to estimate their relative importance to cohesion. For the above 60 example function names can also be used to group together similar object. However, this feature is only useful if the reverse engineer is certain that consistent naming conventions have been used in the entire source code. It is also necessary to assign weights to the individual feature vectors for “global”, “type” and “call” features when using the combination of ‘all’ features so that the resulting clusters obtained are functionally cohesive rather than loosely cohesive. 2.7 Future Directions This sections pin-points areas of software clustering that are worth further investigation. They are outlined in the text that follows: Quality of Partition : It is necessary to evaluate the quality of a partition using measures other than precision and recall. This can be done by using metrics such as data complexity or structural complexity. The quality of clustering could then be assessed by comparing these metrics with the expert’s. This can also help in deciding the criterion for stopping the clustering process or determining the height of the cut for the dendogram made by the clustering process. Various Measures of Cohesion : As pointed out in section 2.6.6 there is a need to define measures that quantify the various levels of cohesion and assess the relative importance various features with respect to these levels of cohesion. This will help in evaluating the quality of a partition and also in determining the weights to be assigned to individual features during the clustering process. Assessing the Relative Importance of Features Used : The features that have been used for the experiments described in this report have been assigned equal weights when using the ‘all’ feature. It is necessary to devise methods to assess their relative importance towards software clustering and use weighted features. Also, binary feature vectors have been used for these experiments. It would be interesting to count the relative frequencies of usage of these features by individual functions and compare it with binary features. Additionally the ‘call’, ‘type’ and ‘global’ features should also be compared with other features like function names, comments in the source code etc. for the Xfig system. Defining New Similarity Measures : Right now researchers have been conducting experiments using the association coefficients and the distance or correlation measures. There are still other similarity measures like the probabilistic measures that remain unexplored for clustering together software artifacts. There is a need to come up with new similarity measures that are relevant to the software artifacts that are being grouped together and it is more important that they pertain to the type of software that they are being applied to e.g. legacy code, open source, embedded system, real time system etc. Comparison with Other Systems : The results obtained from Xfig need to be compared with other systems. It would be interesting to compare the open-source systems with legacy 61 systems. Also, the clustering process for different types of systems like embedded systems, device drivers, real time systems can be compared Relationship to Software Architecture : The clustering process identifies the various groups of entities in a software system. It also pin-points the various sub-systems present in the entire software. However, a clustering process can impose a structure of its own as well depending upon the nature of the algorithm and the similarity measure used. It would be interesting to relate these algorithms and similarity measures to the various types of defined software architectures as pointed out by Shaw and Garlan [Shaw, Garlan, 1996]. Studying Software Evolution : Software clustering an help study how a software system evolves. Studying the previous and current versions of the same program can help a software engineer gain an insight into the initial design and the corresponding life cycle of change that took place in the software. 62 APPENDICES A. Experimental Results The appendices present the result of clustering when applied to the following sub-systems of Xfig: e-files f-files u-files w-files The entire system A.1 Anayslis of e_files Sub_System In this section analysis of all source files of Xfig with prefix “e_” is carried out. Comparison of Features All Global Call Type All Global 140 120 100 80 60 40 20 0 Total Clusters Total Clusters 150 100 200 Iteration 50 400 0 (b) 100 Correlation 200 Iteration 300 200 Iteration 400 Call Type 100 200 Iteration 300 Camberra Figure 34: Comparison of Features Using Different Similarity Metrics and Complete Algorithm 63 400 160 140 120 100 80 60 40 20 0 0 (d) 300 Simple All Global 140 120 100 80 60 40 20 0 0 100 Call Type Total Clusters Total Clusters 300 Jaccard All Global (c) 100 0 0 (a) Call Type 400 Comparison of Similarity Metrics Camberra Simple Correlation Jaccard 120 Total Clusters 100 80 60 40 20 0 0 100 200 Iteration 300 400 Figure 35: Comparison Similarity Metrics for the Complete Algorithm Using All Features Camberra Simple Correlation Sorensen Jaccard 100 Total Clusters 80 60 40 20 0 0 100 200 Iteration 300 400 Figure 36: Comparison of Similarity Metrics for the Combined Algorithm Using All Features 64 Comparison of Algorithms Combined Unweighted Complete Weighted Single 100 Total Clusters 80 60 40 20 0 0 100 200 Iteration 300 400 Figure 37: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features Precision and Recall Recall Precision 100 100 80 80 Percentage Percentage Precision 60 40 20 60 40 20 0 0 0 100 200 300 400 0 Iteration (a) Recall Jaccard Complete 100 200 Iteration (b) 65 Jacard Combined 300 400 Precision Recall Precision 80 80 Percentage 100 Percentage 100 60 40 20 60 40 20 0 0 0 100 200 300 400 0 100 200 Iteration (c) Recall 300 400 Iteration Correlation Complete (d) Correlation Combined Figure 38: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features A.2 Analysis of f-files SubSystem In this section an analysis of all “f_” prefix source files is carried out. Comparison of features All Global Call Type 50 50 40 40 Total Clusters Total Clusters All Global 30 20 10 30 20 10 0 0 0 (a) Call Type Jaccard 50 Iteration 100 0 150 50 100 Iteration (b) 66 Simple 150 Call Type All Global 50 50 40 40 Total Clusters Total Clusters All Global 30 20 10 30 20 10 0 0 0 (c) Call Type 50 Iteration 100 150 Correlation 0 (d) 50 Iteration 100 Camberra Figure 39: Comparison of Features Using Different Similarity Metrics and Complete Algorithm Comparison of Similarity Metrics Total Clusters Camberra Simple Correlation Jaccard 45 40 35 30 25 20 15 10 5 0 0 50 Iteration 100 150 Figure 40: Comparison Similarity Metrics for the Complete Algorithm Using All Features 67 150 Camberra Simple Correlation Sorensen Jaccard 40 Total Clusters 35 30 25 20 15 10 5 0 0 50 Iteration 100 150 Figure 41: Comparison of Similarity Metrics for the Combined Algorithm Using All Features Comparison of Algorithms Total Clusters Combined Unweighted Complete Weighted Single 45 40 35 30 25 20 15 10 5 0 0 50 Iteration 100 150 Figure 42: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 68 Precision and Recall Recall Precision 100 80 80 Percentage Percentage Precision 100 60 40 20 60 40 20 0 0 0 50 100 150 0 50 Iteration (a) (b) Precision 150 Jacard Combined Recall Precision 100 100 80 80 Percentage Percentage 100 Iteration Jaccard Complete 60 40 20 Recall 60 40 20 0 0 0 50 100 150 0 Iteration (c) Recall Correlation Complete 50 100 150 Iteration (d) Correlation Combined Figure 43: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features A.3 Analysis of u_files SubSystem In this section the results obtained from clustering the sub-system of all files with names prefix “u_” is carried out. 69 Comparison of features All Global Call Type All Global 200 Total Clusters Total Clusters 150 100 50 150 100 50 0 0 0 (a) 100 200 300 Iteration 400 0 (b) 400 600 Simple Call Type All Global Call Type 200 Total Clusters 200 Total Clusters 200 Iteration Jaccard All Global 150 100 50 150 100 50 0 0 0 200 400 0 600 200 Iteration (c) Call Type Correlation 400 Iteration (d) Camberra Figure 44: Comparison of Features Using Different Similarity Metrics and Complete Algorithm Comparison of Similarity Metrics Camberra Simple Correlation Jaccard 120 Total Clusters 100 80 60 40 20 0 0 100 200 300 Iteration 400 500 Figure 45: Comparison Similarity Metrics for the Complete Algorithm Using All Features 70 600 Camberra Simple Correlation Sorensen Jaccard 140 Total Clusters 120 100 80 60 40 20 0 0 100 200 300 Iteration 400 500 Figure 46: Comparison of Similarity Metrics for the Combined Algorithm Using All Features Comparison of Algorithms Combined Unweighted Complete Weighted Single 120 Total Clusters 100 80 60 40 20 0 0 100 200 300 Iteration 400 500 Figure 47: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 71 Precision and Recall Recall Precision 100 80 80 Percentage Percentage Precision 100 60 40 20 60 40 20 0 0 0 200 400 600 0 200 Iteration (a) (b) Recall Precision 100 80 80 Percentage Percentage Precision 60 40 40 20 0 0 400 600 0 Iteration Correlation Complete Recall 60 20 200 600 Jacard Combined 100 0 400 Iteration Jaccard Complete (c) Recall 200 400 600 Iteration (d) Correlation Combined Figure 48: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features A.4 Analysis of w_files SubSystem This section presents the clustering results for the “w_” sub-system. 72 Comparison of features All Global Call Type All Global 250 Total Clusters Total Clusters 200 150 100 50 200 150 100 50 0 0 0 (a) 200 400 Iteration 600 0 800 Jaccard (b) All Global 400 Iteration All Global 600 150 Call Type 100 50 200 150 100 50 0 0 0 200 400 Iteration 600 Correlation 0 800 (d) 200 400 Iteration 600 Camberra Figure 49: Comparison of Features Using Different Similarity Metrics and Complete Algorithm Comparison of Similarity Metrics Camberra Simple Total Clusters 800 250 Total Clusters Total Clusters 200 Simple Call Type 200 (c) Call Type Correlation Jaccard 180 160 140 120 100 80 60 40 20 0 0 200 400 Iteration 600 800 Figure 50: Comparison Similarity Metrics for the Complete Algorithm Using All Features 73 800 Total Clusters Camberra Simple Correlation Sorensen Jaccard 180 160 140 120 100 80 60 40 20 0 0 200 400 Iteration 600 800 Figure 51: Comparison of Similarity Metrics for the Combined Algorithm Using All Features Comparison of Algorithms Total Clusters Combined Unweighted Complete Weighted Single 180 160 140 120 100 80 60 40 20 0 0 100 200 300 400 Iteration 500 600 700 Figure 52: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features 74 Precision and Recall Recall Precision 100 80 80 Percentage Percentage Precision 100 60 40 60 40 20 20 0 0 0 0 100 200 300 400 500 600 700 100 200 300 400 500 600 700 Iteration Iteration (a) Jaccard Complete (b) Jacard Combined Recall Precision 100 100 80 80 Percentage Percentage Precision 60 40 Recall 60 40 20 20 0 0 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 Iteration (c) Recall Iteration Correlation Complete (d) Correlation Combined Figure 53: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features A.5 Analysis of the Entire System Comparison of features All Global Call Type All Global 800 500 Total Clusters Total Clusters 600 400 300 200 100 0 600 400 200 0 0 (a) Call Type Jaccard 500 1000 Iteration 1500 2000 0 (b) 75 Simple 500 1000 Iteration 1500 2000 All Global All Global Call Type 800 600 500 Total Clusters Total Clusters Call Type 400 300 200 100 600 400 200 0 0 0 500 1000 Iteration 1500 2000 0 500 1000 Iteration 1500 2000 (c) Correlation (d) Camberra Figure 54: Comparison of Features Using Different Similarity Metrics and Complete Algorithm Comparison of Similarity Metrics Camberra Simple Correlation Jaccard Total Clusters 500 400 300 200 100 0 0 500 1000 Iteration 1500 2000 Figure 55: Comparison Similarity Metrics for the Complete Algorithm Using All Features 76 Camberra Simple Correlation Sorensen Jaccard Total Clusters 500 400 300 200 100 0 0 500 1000 Iteration 1500 2000 Figure 56: Comparison of Similarity Metrics for the Combined Algorithm Using All Features Comparison of Algorithms Total Clusters Combined Unweighted Complete Weighted Single 450 400 350 300 250 200 150 100 50 0 0 500 1000 Iteration 1500 2000 Figure 57: Comparison of Algorithms for the Jaccard Similarity Metric Using All Features Precision and Recall Recall Precision 100 80 80 Percentage Percentage Precision 100 60 40 40 20 0 0 500 1000 1500 2000 0 Iteration (a) 60 20 0 Jaccard Complete 500 1000 Iteration (b) 77 Recall Jacard Combined 1500 2000 Recall Precision 100 80 80 Percentage Percentage Precision 100 60 40 40 20 0 0 500 1000 1500 2000 0 Iteration (c) 60 20 0 Correlation Complete Recall 500 1000 1500 2000 Iteration (d) Correlation Combined Figure 58: Precision and Recall for the Jaccard and Correlation Similarity Metrics for the Combined and Complete Algorithms Using All Features 78 B. Software Repositories Important links to repositories that contain program facts or databases : Links to Software guniea pigs : http://plg.uwaterloo.ca/~holt/guinea_pig/ Links to Rigi Projects by Johannes Martin : http://www.rigi.csc.uvic.ca/~jmartin/rigi-projects/ Links to work done by Anquetil and Lethbridge [Anquetil, Lethbridge, 1999]. This site also contains the facts for the experiments conducted by them : http://www.site.uottawa.ca/~anquetil/ 79 C. Summary of Reverse Engineering Methods Method Rigi Technique(Broad Category) Clustering Semi-Automatic Rule-based Abstraction Level Technology Between Function & K2 Partite grahs & clustering based on metrics. Structure level Semi-automatic Function Abstraction Domain level Use of prime and proper programs Pattern matching for replacing prime programs with higher level abstractions Knowledge based program analysis (PAT) Rule-based Function level Knowledge based system with a deductive inference rule engine Graph parsing approach Rule-based Function level Flow graphs and flow graph grammar rules. Plan calculus GraphLog Predicate calculus Structure level Visual tool for representing queries in source code. Based on predicate calculus. Concept analysis Lattice Theory Between function and Concept analysis using a lattice theory (Siff & Reps structure level Use of negative attributes for building concept lattice. DESIRE Rule-based Structure level Use of informal knowledge in design recovery. Biggerstaff Knowledge based pattern recognizer and prolog-based inference engine. Plan recognition Rule based Function level Matching program plans to code fragments GROK Relational Agebra & Between structure Architectural transformations using relational algebra Architectural level and function Transformation level Dali SQL Queries Between structure Extension of Rigi. Use of SQL for specifying clustering patterns. Use of dynamic level and function and static information to generate views. Requires manual intervention level 80 D. Detailed Summary of Reverse Engineering Methods Technology Research Group K2 Partite grahs Univ.of & clustering Victoria based on metrics. Semi-automatic Input Output Objective Advantages Disadvantages/Required future Description extensions/Problems Rigi RFG Layered Sub- Structural re- Visualization tool If the system is too complex then (Resource Flow system documentation Interactive system difficult to form sub-systems Graph) heirarchies Extracting System Technique based on K2 partite graphs heirarchies. The metrics used for Architecture caters for low coupling and high automatic sub-system detection are adcohesion hoc and may reuire an expert’s input Function Use of prime and IBM & Source code High level Extract high level Converts unstructured program to a Method not implemented See Function Abstractio proper programs Univ. of abstractions business rules structured one. Method needs to be formalized Abstraction on n Pattern matching Maryland going upto Method helps in understading If the system is too complex then the for replacing business rules programs by abstracting the smaller prime programs extracted may be too page 10 prime programs units within it. large and not understandable with higher level abstractions Knowledg Knowledge based University The set of High level Identify what high Try to simulate the behaviour of an For real system several hundred event See Knowledge e based system with a of Illinois events function level concepts are expert by inputting an expert’s classes and plans will be required program deductive generated by concepts that implemented by a knowledge into the system which could only be acquired with the Based Program analysis inference rule program parser. are fed into the program, how they help of a domain expert. Analysis on page (PAT) engine system as are implemented and 11 programming are they implemented plans correctly Graph Flow graphs and MIT Graphical A design tree in Identify the clichés The design tree produced is a good The nature of the problem is such that See Graph parsing flow graph representation which (commonly used data tool for program maintenance. it requires exhaustive search which Parsing approach grammar rules. of source code programming structures and Generic nature of this technique can will not scale up to programs of large Plan calculus called plan constructs are algorithms) being help cope with syntactic variation, size and complexity. Approach on page calculus, which replaced by used in the progroam non-contiguousness, implementation All clichés are pre-input into the 11 is converted to clichés. variation and overlapping system and no mechanism for learning flow graphs implementations. them. GraphLog Visual tool for University ER model Answer queries Present source code Use of metrics for better design. As this is an interactive tool, therefore See GraphLog on representing of Toronto presented by the queries through a Visual tool for presenting complex for systems that are too complex some page 10 queries in source & IBM user: visual tool. Use queries. Also supports recursive automation might be required for code. Based on Partition code software quality queries. generating various views of the predicate into overlay metrics to aid the system. calculus. modules. software engineer Manually removing one defect like Modified design towards better cyclic dependency can also introduce of better quality. understanding of the other design defects like violation of Interactively system and improving implementation hidiing. select different the design of the views of the system. model. Concept Concept analysis University Abstract syntax Concept Convert C code to Strong mathematical foundation for Exponential time complexity poses a analysis using a lattice of tree annotated partitions that C++ understanding big problem for scalability to bigger 81 (Siff Reps & theory Wisconsin with type offer a If modularization is too fine then it can systems. Use of negative information possibility of be made coarse and vice versa. Real life systems when analyzed have attributes for C++ classes. interferences in the concept lattice that building concept introduces the problem of automatic lattice. partition. DESIRE Use of informal MCC C source code Plane text web Recover design of Use of informal knowledge for Problem of relating relating abstract Biggerstaf knowledge in parsed to form browser existing system and recovering design and building the re- concepts to source code still remains. See f design recovery. parse trees. showing construct a re-use use library. A concept can Knowledge based Prolog for relationships library. Use this re- Using the re-use library for extracting be defined as a pattern querying source between data, use library for design domain knowledge. recognizer and code. items and files. recovery of legacy Map human-oriented concepts to maximal prolog-based Different views systems and acquire realizations within the source code. inference engine. can be domain knowledge. Provide tool for documenting and collection of generated based understanding source code. objects sharing upon different input queries. common attributes. For a concept c = (O,A), O is the set of objects and called the extent of c i.e. extent(c). A is the set of attributes and called the intent of c i.e. intent(c). Table 2 shows the concepts for the object attribute table illustrated in Table 1. 82 C1 C2 C3 C4 C5 C6 C7 Table Concepts { O1, O2, O3, O4}, { O2, O3, O4}, { O1} { O2, O4}, { O3, O4}, { O4}, , 2 {A {A {A {A {A {A : for Table 1 The set of all concepts form a partial order governed by the following relationship: (O1, A1) (O2 , A2 ) A1 A2 (O1, A1) (O2 , A2 ) O1 O2 The set of concepts taken together with partial ordering form a complete lattice known as a concept 83 lattice. The cocept lattice for the example given in this section is presented in Figure 6. Figure 6 : Concept Lattice for the Attributes and Facts in Table 1 The nodes of the concept lattice are labeled with attributes AiA if it is the largest concept having Ai in its intent. It is also labeled with objects OiO if it is the smallest concept having Oi in its intent. 84 The concept lattice gives an insight into the structure of the relationship of objects. The above figure shows that there are two disjoint set of objects, the first one being O1 with attributes A1 and A2. The second one being O2, O3, O4 sharing the other attribute. 2.7.1.1 Lindig Snelting 85 Conc ept Analy sis Appli cation s and use concept analysis to identify higher level modules in a program [Lindig, Snelting, 1997]. They treat subprograms as objects and global variables as attributes to derive a concept lattice and hence concept partitions. The concept lattice provides multiple possibilities for modularization of programs with each partition representing a possible modularization of the original 86 program. It can be used to provide modularization at a coarse or a finer level of granularity. Lindig and Snelting conducted a case study on a Fortran program with 100KLOC, 317 subroutines and 492 global variables. However, they failed to restructure the program using concept analysis. Siff and Reps used concept analysis to detect abstract data types to 87 identify classes in a C program to convert it to C++ [Siff, Reps, 1997]. Sub-programs were treated as objects and structs or records in C were treated as attributes. They also introduced the idea of using negative examples i.e. absent features as attributes (e.g. a subprogram does not have attribute X). They successfully demonstrated their approach on small problems. 88 However, larger programs are too complex to handle and for that they suggest manual intervention. Canfora et al. used concept analysis to identify sets of variables and to extract persistent objects of data files and their accessor routines for COBOL programs [Canfora, Cimitile, Lucia, Lucca, 1999]. They treated COBOL programs as objects and the 89 files they accessed as attributes. Their approach is semiautomatic and relies on manual inspection of the concept lattice. Concept analysis is an interesting new area of reverse engineering. However, it has an exponential time complexity and has a space complexity of O(2K). Its advantages are that it is based on a sound mathematical background and 90 gives a powerful insight into the types of relationships and their interactions. Design Recovery System (DESIRE) on page 13 Plan Matching University C Source code Identification of Enhance program Based on empirical studies on how Identifying program plans within See Recognizing recognitio program plans to of Hawaii, parsed to AST program understanding and programmers conduct program programs is computationally complex n code fragments Manoa augmented with segments that recognize clichés understanding. and the method described may not Program Plans data and control implement a within programs. scale up to real world systems with page 16 flow. certain plan Also identify modules high complexity. A library of to convert C to C++ program plans code. GROK Dali Architectural SWAG RSF transformations Group, Standard using relational University Format) algebra of Waterloo (Rigi Architectural Extracting high level Efficiently processes large graphs. Some transformations are difficult to See Grok on page transformation structure information Based on sound mathematical express in relational algebra. 17 to show to ease software foundations. Not well suited for generalized pattern relations maintenance Many of transformations that occur matching where some of the nodes and between during software maintenance can be edges, along a path, represent a pattern different easily specified using Grok. and they have to be stored after being software visited. components Extension of Carnegie SQL database Various Extraction of Support for SQL to specify clustering No analytic capabilities in Dali See Dali on page 18 Rigi. Use of Mellon storing different architectural architectural patterns Patterns have to be written by SQL for University views generated views of the information of a Various views of the system are someone who understands the system specifying by different system system for generated from different tools giving clustering tools like documenting and both a dynamic and static picture of patterns. Use of parsers, lexical understanding a the system dynamic and analysers etc. system and also for static information maintenance and to generate reuse views. Requires manual intervention 91 92 Annotated Bibliography [Anquetil, Lethbridge, 1999] N.Anquetil, T.C.Lethbridge, “Experiments with Clustering as a Software Remodularization Method”. The Sixth Working Conference on Reverse Engineering (WCRE’99), 1999. Compares various clustering algos and features used for clustering. Uses hierarchical clustering. Experiments on gcc, mosaic, Linux, Telecom. Nice description of similarity metrics and clustering algos given in this paper. [Biggerstaff, 1989] T.J.Biggerstaff, “Design Recovery for Maintenance and Reuse”. IEEE Computer, 22(7), pages 36-49, July 1989. Very commonly referred to paper. Describes the idea of building a re-use library for design recovery. Also, describes the prototype developed called ‘DESIRE’. [Biggerstaff, Mitbander, Webster, 1994] T.J.Biggerstaff, B.G.Mitbander,D.E.Webster, “Program Understanding and the Concept Assignment Problem”. Communications of the ACM, 37(5), pages 72-83, May 1994. Describes further work on DESIRE and what could be the other possibilities for candidate concepts. [Birkhoff, 1940] G.Birkhoff. Lattice Theory, lst ed, American Mathematical Society, Providence, R.I, 1940. We do not have the above reference. Lattice theory lays the foundation of mathematical concept analysis. [Buss et al., 1994] E.Buss, R.De Mori, J.Henshaw, H.Johnson, K.Kontogiannis, E.Merlo, H.Müller, J.Mylopoulos, S.Paul, A.Prakash, M.Stanley, S.Tilley, J.Troster, K.Wong, “Investigating Reverse Engineering Technologies: The CAS Program Understanding Project”, IBM Systems Journal, 33(3), 1994. Paper describes the various reverse engineering techniques of SQL/DS source code as done by 5 teams. [Canfora, Cimitile, Lucia, Lucca, 1999] G.Canfora, A.Cimitile, A.De Lucia, G.A. Di Lucca, “A Case Study of Applying an Eclectic Approach to Identify Objects in Code”, Inernational Workshop on Program Comprehension, pages:136-143, , IEEE Computer Society Press, Pittsburgh, 1999. Identify objects within COBOL programs using files as attributes and subprograms as objects [Chikofsky, Cross II, 1990] E.J.Chikofsky, J.H.Cross II, “Reverse Engineering and Design Recovery: A Taxonomy.” IEEE Software, 7(1), pages 13-17, January 1990. 93 An introductory paper for reverse engineering. Describes basic reverse engineering terminology. Is very commonly referred to. [Consens, Mendelzon Ryman, 1992] M.Consens, A.Mendelzon A.Ryman, “Visualizing and Querying Software Structures.” 14th International Conference on Software Engineering , Melbourne, Australia, pages 138-156, May 1992. One of the first papers on GraphLog. Is a very commonly quoted papers. Describes the use of predicate calculus for presenting queries based on graphs. [Davey, Burd, 2000] J.Davey, E.Burd, Evaluating the Suitability of Data Clustering for Software Remodularization”. The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, pages 268-276, 2000. Compares various similarity metrics, New cluster distances algorithms and feature sets. Good paper to read. Use hierarchical clustering. [Deursen, Woods, Quilici, 2000] A.V.Deursen, S.Woods, A.Quilici, “Program Plan Recognition For Year 2000 Tools”, Science Of Computer Programing, 36(2-3), pages 303-325, 2000. Work on plan recognition and its application to Y2K problem [Fahmy, Holt, Cordy, 2001] H.M.Fahmy, R.C.Holt, J.R.Cordy , “Wins and Losses of Algebraic Transformations of Software Architectures”. Automated Software Engineering ASE 2001, San Diego, California, November 26-29, 2001. Describes GROK, a tool based on relational algebra for graph transformations. To be used for software maintenance [Godfrey, 2001] M.W.Godfrey, “Practical Data Exchange for Reverse Engineering Frameworks: Some Requirements, Some Experience, Some Headaches”. Software Engineering Notes 26(1), pages 50-52, January 2001. Describes work on TAXFORM by SWAG Group and also on integrating various reverse engineering tools. [Grune, Bal, Jacobs, Langendoen, 2001] D.Grune, H.E.Bal, C.J.H.Jacobs, K.G.Langendoen, “Modern Compiler Design”, John Wiley and Sons, Ltd, 2001. Very good, handy book on compilers 94 [Harandi, Ning, 1990] M.T.Harandi, J.Q.Ning, “Knowledge-Based Program Analysis”. IEEE Software 7(1), pages 74-81, January 1990. Describes how a knowledge based system of program events and program plans can be built. Pioneer paper. [Hausler, Pleszkoch, Linger, 1990] P.A.Hausler, M.G.Pleszkoch, R.C.Linger, “Using Function Abstraction to Understand Program Behavior.” IEEE Software 7(1), pages 55-63, January 1990. Describes the use of proper and prime programs to get function abstraction. overview, not implemented. Pioneer paper. Theoretical [Holt, Language”, 1997] R.C.Holt, “An introduction to TA: The Tuple Attribute http://www.swag.uwaterloo.ca/pbs/ Introduces the tuple attribute format [Jain, Murty, Flynn, 1999] A.K.Jain, M.N.Murty, P.J.Flynn, “Data Clustering: A Review”. ACM Computing Surveys, 13(3), September 1999. A general review of data clustering and clustering algorithms with respect to various applications. [Kazman, Carrière, 1998] R.Kazman, S.J.Carrière, "View Extraction and View Fusion in Architectural Understanding". The 5th International Conference on Software Reuse, Victoria, BC, Canada, June 1998. Describes Dali workbench for creating the different views of a system [Kontogiannis, 97] K.Kontogiannis, “Evaluation Experiments on the Detection of Programming Patterns Using Software Metrics”. The Fourth Working Conference on Reverse Engineering (WCRE'97), 1997. Describes 5 similarity metrics to be used for detecting clones and also uses recall and precision to evaluate them. Experiments on unix utilities. [Lung, 1998] C.H.Lung, “Software architecture recovery and restructuring through Clustering Techniques”. Third International Conference on Software Architecture, November 1998. Gives an interesting idea regarding evaluation of clustering using data complexity [Lindig, Snelting, 1997] C.Lindig, G.Snelting, “Assessing Modular Structure of Legacy Code Based on Mathematical Concept Analysis”’, International Conference on Software Engineering, pages: 349-359, Boston, 1997. 95 Nice tutorial on concept analysis. They use subprograms as objects and globals as attributes. [Maarek, Berry, Kaiser, 91] Y.S.Maarek, D.M.Berry, G.E.Kaiser, “An Information Retrieval Approach for Automatically Constructing Software Libraries”. IEEE Transactions on Software Engineering, 17(8), pages 800-813, August 1991. Very commonly referred to paper. One of the earlier work done on software clustering. [Martin, Wong, Winter, Müller, 2000] J.Martin, K.Wong, B. Winter, H.A.Müller, “Analyzing xfig Using the Rigi Tool Suite”. The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000. . Describes the Rigi approach to xfig reverse engineering conducted for the reverse engineering competition held for CASCON’99. [Murty, Jain, Flynn, 1999] "Data Clustering: A Review", ACM Computing Surveys, 31(3), September, 1999. We don’t have this reference [Müller et al., 2000] H.A.Müller, J.H.Jahnke, D.B.Smith, M.A.Storey, S.R.Tilley, K.Wong, “Reverse Engineering: A Roadmap”. The 22nd International Conference on Software Engineering, Limerick, Ireland, pages 49-60, June 2000. Describes the reverse engineering techniques and outlines some of the possible future research areas for this field. [Müller, Tilley, Orgun, Corrie, Madhavji, 1992] H.A.Müller, S.R.Tilley, M.Orgun, B. Corrie, N.H.Madhavji, "A Reverse Engineering Environment Based on Spatial and Visual Software Interconnection Models”. Proceedings of the Fifth ACM SIGSOFT Symposium on Software Development Environments (SIGSOFT’92), Virginia, pages 88-98, in ACM Software Engineering Notes, 17(5), December 1992. Has an overview of the reverse engineering approach in Rigi and the spatial and visual information represented by it. [Müller, Uhl, 1990] "Composing subsystem structures using (k,2)-partite graphs." Proceedings of the 1990 Conference on Software Maintenance (CSM 1990), pages 12-19, November 1990. The paper has an overview of Rigi’s method of subsystem composition by making k-2 partite graphs and composing by interconnection strength and common neighbour. 96 [Müller, Wong, Tilley, 1994] H. A. Müller, K. Wong, and S. R. Tilley. "Understanding software systems using reverse engineering technology.” The 62nd Congress of L'Association Canadienne Francaise pour l'Avancement des Sciences Proceedings (ACFAS 1994). Describes Rigi’s approach to reverse engineering. Has a good section on reverse engineering approaches and can be used for literary survey. [Pressman, 1997] R.S.Pressman, “Software Engineering A Practitioner’s Approach”. The McGraw-Hill Companies, Inc, 1997. Pressman’s book on software engineering [Nelson, 1996] M.L.Nelson, “A Survey of Reverse Engineering and Program Comprehension”. From HTTP site http://citeseer.nj.nec.com/nelson96survey.html, 1996. A very basic paper describing reverse engineering approaches, definitions and techniques. [Quilici, 1994] A.Quilici, “A Memory-Based Approach to Recognizing Programming Plans”. Communications of the ACM, 37(5), pages 84-93, May 1994. Interesting section on experiments conducted with student programmers on how they understand programs. Describe their plan based approach. [Quilici, Woods, Zhang, 2000] A.Quilici, S.Woods, Y.Zhang, “Program Plan Matching: Experiments With A Constraint-Based Approach”, Science Of Computer Programing, 36(2-3), pages 285-302, 2000. Describes how plan recognition can be improved using constraint based approach for scalability to complex and bigger systems [Rich, Wills, 1990] “Recognizing a program’s design: A graph parsing approach.” IEEE Software, 7(1), pages 82-89, January 1990. Describes the use of clichés and programming plans. Pioneer paper. [Sartipi, Kontogiannis, 2001] K.Sartipi, K.Kontogiannis, “Component Clustering Based on Maximal Association”. Eighth Working Conference on Reverse Engineering (WCRE’01), 2001. Supervised clustering. Work relaes to concept analysis. Introduce new similarity measure = component association. Experiments on CLIPS and Xfig. 97 [Schwanke, 91] R.W.Schwanke, “An Intelligent Tool for Re-engineering Software Modularity”. Thirteenth International Conference on Software Engineering (ICSE’91), pages 83-92, May 1991. One of the Pioneer papers on Clustering. Also introduces Maverick Analysis. [Schwanke, Platoff, 1989] R.W.Schwanke, M.A.Platoff, “Cross Reference are Features”. Second International Conference on Software Configuration Management, ACM press, pages 86-95, 1989. Also available as ACM SIGSOFT Software Engineering Notes, 14(7), pages 86-95, November, 1989. Pioneer paper. Introduces the concept of using shared neighbours for similarity metric. [Shaw, Garlan, 1996] M.Shaw, D.Garlan, “Software Architecture Perspectives on an Emerging Descipline”. Prentice-Hall Inc., 1996. This is a book on software architecture and models [Siff, Reps, 1997] M.Siff, T.Reps , “Identifying Modules via Concept Analysis”. International Conference on Software Maintenance, pages 170-179, IEEE Computer Society, October, 1997. Describes their approach to concept analysis. Has a nice section ‘concept-analysis’ primer. Explains how to identify C++ classes from C code. [Sim, Storey,2000] S.E.Sim, M.D.Storey, “A Structured Demonstration of Program Comprehension Tools”. The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000. Describes results of the use of various tools on the Xfig source code. The demonstration was held by CASCON’99 and results presented in WCRE’00. [Sneath, Sokal, 1973] P.H.A.Sneath, R.R.Sokal, “Numerical Taxonomy”. Series of books in biology. W.H.Freeman and Company, San Francisco, 1973. We don’t have this reference [Snelting, 1996] G.Snelting, “Reengineering of Configurations Based on Mathematical Concept Analysis”. ACM Transactions on Software Engineering and Methodology 5(2), pages: 146-189, April 1996. One of the introductory papers on concept analysis. [Tzerpos, Holt, 1998] V.Tzerpos, R.C.Holt, “Software Botryology Automatic Clustering of Software Systems”. Ninth International Workshop on Database and Expert Systems Applications (DEXA’98), Vienna, Austria, August 1998.. 98 A very general paper on clustering. Points to some past trends and future research directions for software clustering. [Tzerpos, Holt, 2000] V.Tzerpos, R.C.Holt, “MoJo: A Distance Metric for Software Clustering”, The Sixth Working Conference on Reverse Engineering (WCRE'99), Atlanta, October 1999.. Introduce a new metric for comparing clustering approaches. Metric evaluates similarity between two different partitions of a system. [Tzerpos, Holt, 2000] V.Tzerpos, R.C.Holt, “ACDC: An Algorithm for Comprehension-Driven Clustering”, The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000. Describes the use of various features for subsystem patterns, Orphan Adoption and has a nice section on literature survey. [Visser, 2001] E.Visser, “A Survey of Rewriting Strategies in Program Transformation Systems”, First International Workshop on Reduction Strategies in Rewriting and Programming (WRS 2001), Also in Electronic Notes in Theoretical Computer Science, Volume 57, http://www.elsevier.nl/locate/entcs/volume57.html, 2001. Very detailed paper on the various program transformation systems. [Website GXL] http://www.gupro.de/GXL/ GXL’s website. [Website, Xfig] http://www.xfig.org [Website Xfig RSF] http://www.rigi.csc.uvic.ca/~jmartin/rigi-projects/ Website where RSF for a couple of projects can be found. RSF generated by Rigi [Website SORTIE] http://www.csr.uvic.ca/chisel/collab/casestudy.html. CASCON’s website. Has related information on the SORTIE project (A demonstration for the evaluation of various reverse engineering tools) [Wiggerts, 97] T.A. Wiggerts, “Using Clustering Algorithms in Legacy Systems Remodularization”, Fourth Working Conference on Reverse Engineering (WCRE’97), October, 1997. 99 Very nice literature survey and good review on similarity metrics and clustering algos. Good starting point for clustering 100