Metarule-Guided Association Rule Mining for Program Understanding Technical Report CS-014(1)-2004 Onaiza Maqbool, Haroon Babri Computer Science Department Lahore University of Management Sciences LUMS Acknowledgment We would like to thank the Lahore University of Management Sciences for providing funding for this research. Thanks to Rainer Koschke of the Bauhaus group at the University of Stuttgart for providing the RSF for CVS, Aero, Mosaic and Bash, and to Johannes Martin of the Rigi group, University of Victoria, for providing the RSF for Xfig. ii List of Figures Figure 1: Our rule mining approach ..................................................................................................................... 9 Figure 2: Global variable access (Xfig) .............................................................................................................. 16 Figure 3: Global variable access (CVS) ............................................................................................................. 16 Figure 4: Global variable access (Aero) ............................................................................................................. 16 Figure 5: Global variable access (Mosaic) ......................................................................................................... 17 Figure 6: Global variable access (Xfig) .............................................................................................................. 17 Figure 7: Global variable access breakdown ...................................................................................................... 18 Figure 8: Function calls (Xfig) ........................................................................................................................... 21 Figure 9: Function calls (CVS) ........................................................................................................................... 22 Figure 10: Function calls (Aero)......................................................................................................................... 22 Figure 11: Function calls (Mosaic) ..................................................................................................................... 22 Figure 12: Function calls (Xfig) ......................................................................................................................... 23 Figure 13: Function calls (d_*files sub-system) ................................................................................................. 23 Figure 14: Function calls (e_*files sub-system) ................................................................................................. 24 Figure 15: Function calls (f_*files sub-system) ................................................................................................. 24 Figure 16: Function calls (u_*files sub-system) ................................................................................................. 24 Figure 17: Function calls (w_*files sub-system) ................................................................................................ 25 Figure 18: Types accessed by a subsystem (d_*files) ........................................................................................ 29 Figure 19: Types accessed by a subsystem (e_*files) ........................................................................................ 30 Figure 20: Types accessed by a subsystem (f_*files) ......................................................................................... 30 Figure 21: Types accessed by a subsystem (u_*files) ........................................................................................ 30 Figure 22: Types accessed by a subsystem (w_*files) ....................................................................................... 31 Figure 23: Function, global variable and user defined type statistics ................................................................ 31 Figure 24: Statistics for function calls, global variable accesses and user defined types accesses .................... 32 Figure 25: Global variable access statistics ........................................................................................................ 32 Figure 26: A comparison of accesses to global variables ................................................................................... 32 Figure 27: Function call statistics ....................................................................................................................... 33 Figure 28: A comparison of calls to functions.................................................................................................... 33 Figure 29: Frequently accessed functions........................................................................................................... 34 Figure 30: A comparison of frequently accessed functions................................................................................ 34 iii List of Tables Table 1: A set of transactions representing sales of items.................................................................................... 6 Table 2: A sample set of transactions illustrating items used in our experiments ............................................. 10 Table 3: Re-engineering patterns and metarules ................................................................................................ 13 Table 4: Test system characteristics ................................................................................................................... 14 Table 5: Statistics of Xfig sub-systems .............................................................................................................. 14 Table 6: Global variables usage summary .......................................................................................................... 15 Table 7: Global variable access breakdown in test systems ............................................................................... 17 Table 8: Global variables accessed by one function........................................................................................... 18 Table 9: Functions called by one function.......................................................................................................... 19 Table 10: Functions called together with high confidence ................................................................................. 20 Table 11: Function call summary ....................................................................................................................... 21 Table 12: Function calls made in various sub-systems....................................................................................... 25 Table 13: Number of function calls made to functions in each sub-system ....................................................... 26 Table 14: Five or more function calls made to functions in each sub-system.................................................... 26 Table 15: Global variables with high association ............................................................................................... 27 Table 16: User defined types with high association ........................................................................................... 27 Table 17: Types accessed by functions in the test systems ................................................................................ 28 Table 18: Global variables accessed by maximum number of functions ........................................................... 29 Table 19: Accesses to user defined types in various Xfig sub-systems ............................................................. 29 iv Table of Contents 1. Introduction ................................................................................................................................................ 1 2. Related work ............................................................................................................................................... 3 3. An overview of association rule mining ................................................................................................... 6 4. Re-engineering patterns for structured legacy systems ......................................................................... 8 5. Metarule-guided association rule mining for program understanding ................................................ 9 5.1. Selecting items and transactions ...................................................................................................... 9 5.2. Setting appropriate thresholds for association rules ...................................................................... 10 5.3. Selecting interesting association rules in the re-engineering context ............................................ 11 5.4. Metarules and re-engineering patterns ........................................................................................... 11 6. Experiments and results .......................................................................................................................... 14 6.1 Pattern 1: Reduce global variable usage ........................................................................................ 14 6.2 Pattern 2: Localize variables .......................................................................................................... 18 6.3 Pattern 3: Increase locality of reference ......................................................................................... 19 6.4 Pattern 4: Identify utilities .............................................................................................................. 20 6.5 Pattern 5: Increase data modularity ................................................................................................ 26 6.6 Pattern 6: Strengthen encapsulation ............................................................................................... 27 6.7 Pattern 7: Select appropriate storage classes ................................................................................. 28 6.8 Pattern 8: Generate alternative views ............................................................................................. 29 6.9 Discussion of results ...................................................................................................................... 31 7. Conclusions ............................................................................................................................................... 35 References .......................................................................................................................................................... 36 Appendix ............................................................................................................................................................ 39 v 1. Introduction Legacy systems are old software systems that are crucial to the operation of a business. These systems are expected to have undergone changes in their lifetime due to changes in requirements, business conditions and technology. It is quite likely that such changes were made without proper regard to software engineering principles. The result is often a deteriorated structure, which is unstable but cannot be discarded because it is costly to do so. Moreover, another reason for retaining these legacy systems is that they embed business knowledge which is not documented elsewhere. Since it is often not feasible to discard a system and develop a new one, techniques must be employed to improve the structure of the existing system. An effective strategy for change must be devised. Strategies for change mentioned in [1] include maintenance, architectural transformation and re-engineering. Re-engineering is a process that re-implements legacy systems to make them more maintainable [1]. According to [2], reengineering is any activity that improves one’s understanding of software or prepares/improves the software itself, usually for increased maintainability, reusability or evolvability. Given the fact that software maintenance usually accounts for over 50% of project effort [1],[3],[4], making it the single most expensive software engineering activity [1], and perhaps the most important life cycle phase [5], the need for re-engineering to ease the maintenance effort is justified. The re-engineering option should be chosen when system quality has been degraded by regular change, but change is still required, i.e. the system under consideration has low quality but a high business value, and the re-engineering effort is less risky and less costly than system replacement. The re-engineering effort starts with gaining an understanding of the software system, a process known as reverse engineering. Understanding is critical to many activities including maintenance, enhancement, reuse, design of a similar system and training [6]. Reverse engineering has been heralded as one of the most promising technologies to combat the legacy systems problem [7]. However, gaining system understanding is difficult because documentation for the system is often not available and source code files are the only means of information regarding the system. According to [8], system understanding takes up 47% of the software maintenance effort. Hall [9] places the system understanding effort at 47%-62%. Tools and techniques are thus required to make the program understanding task easier. Tools provide automated support for system understanding at the procedural level by extracting the procedural design or at a higher level by extracting the architectural design. Tools for procedural design extraction are based on techniques including knowledge based systems with inference rule engines [10], graph parsing [11] and program plan recognition [12]. Techniques utilized by architectural design recovery tools include the composition of sub-system hierarchies using (k,2)-partite graphs [13], graph transformations using relational algebra [14] and view extraction and fusion using SQL [15]. In the past, the application of deductive techniques to different aspects of software engineering, including program understanding, has been more frequent than the use of inductive techniques [16]. Deductive techniques usually employ some knowledge base as their underlying technology which is used to deduce relations within the software system. Great effort is required to build the knowledge base and continuously maintain it. Moreover, algorithms used for deductions may be computationally demanding [16]. Researchers have thus started exploring the use of inductive or data mining techniques in software engineering. Data mining is considered one of the most promising interdisciplinary developments in the information industry [17]. In this report, our focus is on the use of data mining, or more specifically, on association rule mining for discovering interesting relationships within legacy system components which lead to program understanding. These relationships can be used to identify recurring problems within legacy systems, and are thus related to re-engineering patterns. Patterns were first adopted by the software community as a way of documenting recurring solutions to design problems [18]. Since then, they have been used as an effective means to 1 communicate best practice in various aspects of software development including re-engineering. Reengineering patterns for object-oriented legacy systems were identified based on experiences during the FAMOOS project [19]. Our focus in this work is on traditional legacy systems developed using the structured approach. Our report presents an approach to gain insight about the structure of these legacy systems by applying metarule-guided association rule mining. Metarule-guided association rule mining is carried out as a two step process. In the first step, we formulate metarules based on knowledge of typical problems in legacy systems. These metarules place constraints on the form of association rules to be mined as well as on values of interestingness measures. In the second step, metarule-guided mining is applied to the legacy system source code to mine association rules. By relating the metarules to re-engineering patterns, system understanding is gained, which allows suggestions for making subsequent changes and optimizations to the source code for better maintainability. Thus we make two major contributions in this work. First, we show how metaruleguided association rule mining can be used to mine legacy system source code to gain insight into its structure and detect re-engineering opportunities. Second, we formulate re-engineering patterns for structured legacy systems and relate each metarule to a re-engineering pattern, thus presenting solutions to problems identified by a metarule. The organization of this report is as follows. Section 2 gives an account of related work. In section 3 we present an overview of association rule mining. Section 4 introduces re-engineering patterns. Section 5 details our approach and presents metarules and related patterns. Section 6 gives the results of applying association rule mining to test systems. Finally, we present the conclusions in section 7. 2 2. Related work Based on studies of the IBM programming process and the OS/360 operating system, Lehman proposed laws that guide the evolution of E-type software systems [20]. A study of the evolution of Logica plc Fastwire (FW) financial transaction system, reported in 1997, supports most of the laws formulated earlier [21]. Both these studies focus on system growth in terms of the number of modules within the system in order to determine trends in software evolution. More recently, Godfrey and Tu carried out studies to understand the evolution of open source software [22], [23]. An interesting observation is that the growth rate of most open source software systems is super-linear, violating Lehman’s law of software evolution which suggests that the growth of systems slows down as their size increases. Godfrey and Tu also present an approach for studying architectural evolution of software systems [24], which integrates visualization and software metrics. Evolution status of various entities is modeled by determining which entities are new, have been deleted, or remain unchanged. Unlike these studies by Lehman and Godfrey, which model structural changes at the entity level, our selection of items in the form of functions, global variables and user defined types will enable the study of characteristics of legacy system evolution at a more detailed level. Visualization can also be used to gain an understanding of a legacy system’s overall structure and its evolution. Lanza et. al. [25] propose a lightweight approach to understanding object-oriented legacy systems using a combination of software metrics and visualization. They visualize various aspects of the system e.g. a system hotspot view helps in identifying very large and small classes, a system complexity view is based on inheritance hierarchies. These views are helpful in indicating problematic areas which may require a deeper study. The evolution of legacy object-oriented systems can also be studied using the same lightweight approach [26]. The use of data mining for software understanding is gaining popularity as evidenced by recent work in this area. One possible categorization of the work done would be according to the mining technique employed. Popular techniques include association rule mining, concept learning and cluster analysis. Another useful categorization is according to the application or purpose of work. Keeping in view that some researchers employ more than one technique to arrive at results, in this report we chose to categorize research according to its purpose e.g. architecture recovery, re-modularization, maintenance support, facilitating re-use, interaction pattern mining and program comprehension. The architecture recovery of software systems using data mining techniques is discussed in [27] - [30]. The data mining technique employed in these papers is primarily association rule mining. The Identification of sub-systems based on associations (ISA) was proposed by Oca and Carver [27]. They used association rule mining for extraction of data cohesive sub-systems by grouping together programs that use the same data files. Experiments were performed on a 25 KLOC Cobol system with 28 programs and 36 data files and subsystems were successfully identified. However, the results obtained were not compared with any existing documentation or decomposition. A very similar approach is that of [28], where the use of a representation model (RM) is discussed to represent the sub-systems identified using the ISA methodology in [27]. Sartipi et al. [29] discuss a technique for recovering the structural view of a legacy system’s architecture based on association rule mining, clustering and matches with architectural plans. Rather than using programs as subsystem entities as in [27], functions are used. Moreover, associations between functions are determined based on the variables and types they access, and the function calls they make. After closely associated function groups have been identified, a branch and bound algorithm is used for matching with architectural plans. In experiments performed with the CLIPS system (40 KLOC, 734 functions, 59 aggregate data types and 163 global variables), a precision level of 90% and a recall level of 60% is maintained between the conceptual and concrete architecture recovered through the defined process. Tjortjis et al. [30] also employ association rule mining to arrive at decompositions of a system at the function level. Their rule mining approach is similar to the one discussed in [29], where attributes are variables, data types and calls to functions. Groups of functions 3 i.e. sub-systems are created finding common attributes participating in the same association rules. Experiments performed on a Cobol system, with programs of an average size of 1000 lines of code show that the results compared with a mental model constructed by a developer of the system are satisfactory. Software re-modularization using clustering techniques is discussed in [31] - [36]. Tzerpos and Holt [31] present the case for using clustering techniques to re-modularize software, after the techniques have been adapted to fit the peculiarities of the software domain. In [32], Wiggerts provides a framework to apply cluster analysis for re-modularization. The clustering process is described in some detail, along with commonly used similarity measures and clustering algorithms. Experiments with clustering as a re-modularization method are described in [33] and [34]. Both papers conclude by recommending similarity measures and clustering algorithms which yield good experimental results for software artifacts. A theoretical explanation to some previous experimental results obtained by researchers in the area is provided in [35]. The paper also describes a new clustering algorithm, which gives better results as compared to the currently employed algorithms for clustering software. In [36], the weighted combined algorithm for software clustering is presented. This algorithm shows improvement in clustering results as compared to previously employed clustering algorithms. Data mining techniques have also been employed to support the maintenance task [16], [37] - [39]. The use of inductive techniques to extract a maintenance relevance relation (MRR) from the source code, maintenance logs and historical maintenance update records is discussed in [16], [37], [38]. An MRR simply indicates that if a software engineer needs to understand file1, he/she probably also needs to understand file2. The problem has been presented as a concept learning problem, and a decision tree classifier is used for classifying file pairs as relevant, not relevant and potentially relevant. Relevance indicates that two files were modified in the same update and potential relevance indicates that both files were looked at in the same update. Experiments performed on the SX2000 system, a large legacy telephony switching system with 1.9 MLOC and 4700 files show that the 2-class problem, with relevant and non-relevant classes only, yields better results than the 3-class problem. It is also seen that combining text based features with syntactic features yields better results than using syntactic features alone. Zimmermann et al. [39] apply association rule mining to ease maintenance by mining version histories. Association rule mining is used to predict likely changes after a change has been made, prevent errors due to incomplete changes and detect coupling between items which is not revealed by program analysis. Mining is carried out through the ROSE tool developed for this purpose. The tool reads a version archive and groups changes into transactions so that rules describing them are formed. ROSE was tested on 8 open source projects and was found to be a helpful tool in suggesting further changes and in warning about missing changes. Software re-use can be facilitated by data mining techniques [40] - [45]. Software library reuse patterns have been mined using generalized association rules in [40]. The discovered patterns serve as guides to identify typical usage of the software library i.e. the combination of classes and member functions that are typically re-used by applications. The paper extends earlier work by the author [41], in which association rules rather than generalized association rules were used. Re-use patterns for the KDE core libraries version 1.1.2 were mined by analyzing 76 applications. Reusable components are identified by finding similarities between components using lexical rather than structural techniques in [42]. Mc Carey et al. [43] utilize collaborative filtering for recommending reusable components by enabling prediction of the utility of an item based on the user’s previous history and opinions of other like minded users. The building of a digital library for source code to facilitate reuse of code segments is discussed in [44]. Garg [45] discusses the sharing of knowledge of re-usable components across multiple projects. El-Ramly et al. [46] apply sequential data mining to user activities to discover interaction patterns depicting how users interact with systems. Interaction pattern mining is applied to legacy and web-based systems. For legacy systems, these patterns reflect active services. For web-based systems, the patterns can be used for reengineering the website for easier and faster access. 4 Program comprehension can be aided by source code mining [47], [48]. Balanyi and Ferenc [47] automatically search design patterns in C++ code. A tool called Columbus is used to build an internal representation of the source code which is compared with pattern descriptions written in DPML, a language based on XML. Four publicly available C++ projects were used for experiments. Some problems were faced because of rule violations in implementation, and when the structures of patterns were similar. However, the overall results were satisfactory. The comprehension of C++ programs by clustering together similar entities based on their attributes is proposed in [48]. Four entities are used: classes, member data, member functions and parameters. Results of applying the process to three open source systems have been presented. Analysis of the results reveals correlations between classes, thus revealing portions of code that have common characteristics and are expected to change together. A workshop for mining software repositories for assisting in program understanding and studying evolution was held recently [49]. Papers presented in the workshop covered various aspects, including the infrastructure required for extraction of information, use of mining for program understanding, identification of change patterns, defect analysis, software re-use and process and community analysis. The present work employs association rule mining to find interesting associations between global variables, user defined types and function calls within a system and relates them to re-engineering patterns. Although association rule mining has been employed by other researchers e.g. Sartipi et. al. [29] and Tjortjis et. al. [30], their purpose and technique is different from the one proposed in this work. The purpose of association rule mining in the case of Sartitipi et. al. is architectural design recovery i.e. decomposition of the legacy system into modules, where a module is a collection of functions, data types and variables. The item set used by them consists of global variables, data types and function calls within the system. The transaction set consists of global variables, data types accessed and functions called by a function. The Apriori algorithm is used to generate frequent item sets, which indicate interesting association between functions. Functions that are associated are candidates for being placed in the same module. The architectural query language (AQL) is used to describe a system’s conceptual architecture. Based on this architecture and the frequent item sets, clustering and a branch and bound algorithm are used to instantiate the concrete architecture. Although the item and transaction sets employed in [29] are similar to the ones we employ in this work, it is evident that the problem addressed is very different. Whereas Sartipi et. al. use association rules to indicate association between functions to discover modules, in this work, our purpose in applying association rule mining is to gain program understanding and indicate problem areas where improvements may be made. To achieve this purpose we find associations not only between functions, but between functions and other items as well e.g. between global variables and user defined types. Tjortjis et. al. also employ association rule mining for grouping together similar entities within a software system. The item set used by them consists of variables, data types and calls to blocks of code (modules), where modules may be functions, procedures or classes. The transaction set thus consists of variables, types accessed and calls made by modules. In their algorithm, large item sets are first generated by finding item sets that have a higher support than a user-defined threshold. From this item set, association rules with confidence greater than a user-defined threshold are generated. Finally groups of modules are created based on the number of common association rules. Thus in their case, similar to the rule mining approach in [29], rules are mined to group modules. We, on the other hand, use metarule-guided association rule mining to find associations between items, which can be used to identify potential problems in a system. By relating our mined association rules to re-engineering patterns, we find solutions to commonly occurring problems. To the best of our knowledge, research has not been conducted to relate association rule mining to re-engineering patterns as is done in this work. 5 3. An overview of association rule mining Association rule mining is a data mining technique that finds interesting association or correlation relationships among a large set of data items [17]. Traditionally, association rule mining has been employed as a useful tool to support business decision making by discovering interesting relationships among business transaction records. To illustrate the concept of association rule mining, consider a set of items I = {i1, i2,….,in}. Let D be a set of transactions, with each transaction T corresponding to a subset of I. An association rule is an implication of the form A B where A I , B I and A B . As an example, consider a set of computer accessories (CDs, memory sticks, microphones, speakers) that are available at a certain store. These accessories form the set of items I of interest to us. Every sale made represents a transaction T. Suppose the sales made are represented in the form of the following set of transactions D in Table 1: Transaction ID T1 T2 T3 T4 T5 Items CD, memory stick CD microphone, speaker CD, speaker, microphone memory stick, microphone, speaker Table 1: A set of transactions representing sales of items Association rules in the above case represent items that tend to be sold together e.g. the association rule microphone speaker shows that those who buy microphones also tend to buy speakers. This association rule can also be written as buys(X, “microphone”) buys(X, “speaker”), where the variable X represents customers who bought the items. Since this association rule contains a single predicate (buys), the association rule is referred to as a single-dimensional association rule. When more than one predicate is present, the rule is referred to as multidimensional. A large number of such association rules may exist in a given set of transactions and not all of them may be of interest e.g. the association rule CD speaker can be inferred from the transaction set in Table 1 but it is clear that this association is not as strong (or interesting) as the one between microphones and speakers. An association rule is said to be interesting if it is easily understood, valid, useful, novel or validates a hypothesis that the user sought to confirm [17]. To find interesting rules, support and confidence are commonly used as objective measures of interestingness. Support represents the percentage of transactions in D which contain both A and B. Confidence is the percentage of transactions in D containing A that also contain B. Another measure of interestingness is coverage. The coverage of an association rule is the proportion of transactions in D that contain A. In terms of probability: Support ( A B) = P( A B) Confidence ( A B) = P( B | A) Coverage ( A B) = P( A) In Table 1, microphone speaker is an association rule with support 3/5, confidence 1, and coverage 3/5. This rule is more interesting than CD speaker which has support 1/5, confidence 1/3, and coverage 3/5. In order to make the data mining process more effective, a user may place various constraints under which mining is to be performed. These include interestingness constraints which involve specifying thresholds on 6 measures of interestingness (such as support, confidence and coverage), and rule constraints which place restrictions on the number of predicates and/or on the relationships that exist among them, based on the analyst’s experience, expectations or intuition regarding the data [17]. Rule constraints thus specify the form of the association rule to be mined and are expressed as rule templates or ‘metarules’. As an example, suppose we have customer related data for the sales in Table 1, and we want to associate customer characteristics with sales of speakers. A metarule for the information in which we are interested is of the form P(X,Y) buys(X, “speaker”) where variable X represents a customer and variable Y takes on values of the attribute assigned to the predicate variable P. During the mining process, rules which match this metarule are found. One example of a matching rule is age(X, “young”) buys(X, “speaker”). Since both the predicate variable P and variable Y may vary, rules gender(X, “male”) buys(X, “speaker”) and age(X, “old”) buys(X, “speaker”) also satisfy the metarule. 7 4. Re-engineering patterns for structured legacy systems Patterns have been used in various disciplines to enable collective experiences to be documented and shared. In the software engineering domain, they have been used effectively to communicate best practices in various aspects of software development including the development process, design, testing and reengineering. As pointed out in section 1, re-engineering is carried out to improve the structure of legacy software systems to make them more maintainable. Although there are different reasons for re-engineering e.g. improving performance, porting the system to a new platform, exploiting new technology, the actual technical problems within legacy systems are often similar and hence some general techniques or patterns can be utilized to aid in the re-engineering task [18]. Re-engineering patterns record experiences about understanding and modifying legacy software systems and thus present solutions to recurring re-engineering problems. Mens and Tourwe [50] point out that trying to understand what a legacy system does, what parts of the legacy system to change and the potential impact of these changes is a significant problem. Unlike forward engineering, which is supported by processes, established processes are not available for re-engineering. Reengineering patterns fill this gap by providing expertise that can be consulted. The re-engineering cycle typically involves some form of reverse engineering to understand the system and to identify problems areas, followed by forward engineering to restructure a system. Re-engineering patterns in [18] focus on the re-engineering cycle of object-oriented legacy systems, although some of the identified patterns may apply to software systems which are not object-oriented. Examples of typical problems in objectoriented legacy systems include lack of cohesion within classes, misuse of inheritance and violation of encapsulation [18]. In this work, we present re-engineering patterns for legacy systems developed using the structured approach. These patterns help in identifying recurring problems within structured code and offer suggestions for improvement. Similar to other software patterns, they capture structural level problems that may not be evident from any one part of a legacy system. It is relevant to note that the re-engineering patterns in [18] capture the entire re-engineering cycle, starting from ‘setting the right direction’ and ‘first contact’ with the system, to actually restructuring (in the context of object-oriented systems, refactoring) the system to remove some of the problems which appear in object-oriented code as it evolves. Most of the patterns that deal with initial phases of re-engineering apply to object-oriented as well as structured systems. However, most of the patterns dealing with restructuring take into account problems within object-oriented code. In this work, we do not attempt to re-capture all re-engineering patterns, since as already pointed out, many patterns especially those involving the initial re-engineering phases are equally valid for structured systems. Our focus is on identifying problems which arise in structured systems as they evolve. Thus the patterns identified in this work complement those identified in [18] by addressing some typical problems within legacy structured systems. 8 5. Metarule-guided association rule mining for program understanding In this work, we employ association rule mining to aid the design discovery process of structured legacy systems. To make our mining process more effective, we use constraint based mining. Constraints are placed on the form of association rules to be mined and on interestingness measure values. These constraints are specified as metarules. For simplicity, in the following sections a metarule is used to refer to a combination of association rule form and interestingness measure thresholds rather than a constraint on association rule form only. This is due to the fact that we employ such a combination to reveal system characteristics. The metarules we develop are based on knowledge of some typical problems in legacy systems and serve as templates to mine meaningful association rules. Each metarule is related to a re-engineering pattern. Re-engineering patterns focus on the process of discovery i.e. they discover an existing design, identify problems and then present appropriate solutions [18]. A metarule related to a re-engineering pattern aids in this discovery process by capturing symptoms of typical problems that arise in legacy software systems. Step 1 Step 2 Body of knowledge (Legacy systems) Metarules ………… ………… ………… ………… ………… …. Used for Metarule formulation Guide Association rule mining process Applied to Test legacy systems Association rules ………… ………… ………… ………… ………… …. Are related to Re-engineering patterns ……………… ……………… ……………… ………. Suggest solutions to problems identified by Figure 1: Our rule mining approach Figure 1 illustrates our rule mining approach as a two step process. In the first step, we use knowledge of generic problems within structured legacy systems to arrive at metarules. In the next step, metarule-guided mining is carried out on the test legacy systems to find association rules which serve as indicators of problems that are present in the system. We relate metarules to re-engineering patterns by describing how a reengineering pattern presents a solution to problems within a legacy system, as identified by the mined association rules. In this section, we discuss various factors that must be taken into account during formulation of metarules. Results of carrying out metarule-guided mining on test systems are described in section 5. 5.1. Selecting items and transactions A metarule can be thought of as a hypothesis regarding relationships that a user is interested in confirming [17]. As a first step, it is necessary to identify items between which relationships are to be mined. In our case, the guiding principle is to choose items which facilitate understanding of the source code and identification of problems, and also allow suggestions for restructuring the code for greater maintainability. Most of the legacy software systems that exist have been developed using the structured approach, with functions or routines forming basic components. Moreover, in legacy software, the use of global variables is often widespread 9 leading to difficulty in understanding the code. In view of these facts, we decided to use functions and global variables as items. Moreover, we also decided to use user defined types. The reason is that user defined types become potential data objects when code is to be restructured as object-oriented code. To illustrate the item and transaction sets, let F = {f1, f2,…,fn} be the set of all functions, G = {g1, g2,…,gk} be the set of all global variables, and U = {u1, u2,…,um} be the set of all user defined types present in a software system. The item set I is the union of these sets i.e. I = F G U . Each transaction T corresponds to a subset of I and denotes the functions called, global variables and user defined types accessed 1 by a function within the software system. As an example, consider a software system consisting of two functions (f1 and f2) only. Suppose f1 accesses global variables g1 and g2, user defined types u1, u3 and u5, and calls the function f2. Suppose f2 accesses global variables g2 and g3, user defined types u1, u2 and u3, and does not call any function. The transaction set for this system is presented in Table 2. Calling ffunction 1 f2 Global variables g 1, g2 g2, g3 User defined Called types functions u f2 1, u3, u5 u1, u2, u3 Table 2: A sample set of transactions illustrating items used in our experiments Thus in the transaction set that we use, each transaction consists of items in the form of global variables and user defined types accessed by a function, and function calls made by the function. Therefore, there are as many transactions as the number of functions in the software system. To obtain results that are meaningful in the re-engineering context, the function which accesses global variables, user defined types and makes function calls is also treated as an item. 5.2. Setting appropriate thresholds for association rules Association rules that are mined using rule based constraints which restrict the rule form may not all be interesting. For finding interesting association rules which indicate problems or re-engineering opportunities, we require interestingness constraints. Traditionally high thresholds of support, confidence and coverage have been used for finding interesting association rules [17]. We use both low and high thresholds of the three measures: coverage, support and confidence. Low thresholds can reveal interesting facts about the software and provide insight into its structure, as illustrated by patterns that we describe in section 4.4. In general, assigning fixed values to thresholds such as 0.7 for high and 0.3 for low is not desirable due to the following reasons [18]. Firstly, threshold values should be selected based on coding standards used by the development team, and these may not be available. Secondly, tools that employ thresholds often display only abnormal entities, so the user is not aware of the number of normal entities which may result in a distorted view of the system. A study on software metrics, undertaken with various threshold values [51] revealed that observations were independent of threshold value. Due to the reasons cited, it may not be meaningful to fix an absolute threshold, since system characteristics such as size, number of global variables, user defined types and functions, can vary widely from system to system. Since the confidence measure does not depend on these system characteristics and is related to the association between two items only, setting thresholds for the confidence measure is not an issue as compared to the other measures. Oftentimes instead of setting absolute thresholds, researchers advocate other techniques for interpreting results. Demeyer et. al. [18] suggest that data be presented in a table and sorted according to various measurement values to obtain a meaningful interpretation. To determine a relationship among data points describing one or two variables, Fenton and Pfleeger [52] use box plots and scatter plots. Box plots utilize median and quartiles to represent the data and suggest outliers. Whereas a box plot hides most of the expected 1 By an access, we mean a reference to a variable by using or setting its value, or taking its address. 10 behaviour and shows unusual values, a scatter plot graphs all the data points to reveal trends. Histograms are also used to reveal the distribution of data. For displaying and interpreting results, we adopt the following approach. Each re-engineering pattern in section 5.4 has an associated metarule, which specifies the association rule form. For metarules involving low or high thresholds of support and coverage, we mine the system to find all association rules of the form specified by the metarule. As suggested in [18], [52], the entire results are displayed using histograms or other meaningful graphs. High and low thresholds for the support and coverage measures are then determined for the system under consideration interactively by visually inspecting the results of metarule-guided mining. Interactive determination of thresholds may be simplified by indicating certain measures e.g. mean, median, quartiles on the graph. 5.3. Selecting interesting association rules in the re-engineering context It is relevant to note that if we employ user defined types, functions and global variables as items, and use coverage, support and confidence interestingness measures with thresholds low, high and one, the number of possible metarules is quite large2. Useful metarules need to be filtered using subjective interestingness measures. These measures are based on user beliefs in data and find rules interesting if they are unexpected or offer information on which the user can act [17]. We reduce the total possible metarules to a small subset by keeping in mind that our purpose is to use these metarules to identify re-engineering opportunities. Thus we select metarules that highlight some typical problems in legacy structured systems. Since we use items based on the set of global variables, user defined types and functions, our metarules are further restricted to problems that are related to this item set. Amongst the typical problems that occur in legacy code is lack of modularity i.e. strong coupling between modules, which hampers evolution [18]. Although software is expected to evolve over time, the guideline is that evolution should not degrade the quality of software. One way to reduce coupling between modules and thus reduce side effects when changes are made is to reduce the number of global variables or to reduce their scope. Metarules 1 and 2, which are related to patterns 1 and 2 (Table 3), are used to detect global variables that can be localized and thus reduce coupling. Another means to ensure that side effects are detected and do not propagate throughout the system is to encourage functions that are coupled to be identified and placed together. Metarule 3, related to pattern 3 (Table 3), helps in identifying these coupled functions. Duplicated functionality is another typical problem in legacy systems, since quite often various teams re-implement similar functionality [18]. To avoid this problem, functions should share code by making use of utility functions. Use of utilities is facilitated if these utilities are identified and catalogued for reference. Metarule 4, related to pattern 4 (Table 3), deals with identification of utilities. Legacy systems may need to be modularized so that they can be more easily understood and maintained. One modularization option is to convert a structured system to an object-oriented system, by identifying data items that are potential objects and related functions which act upon these objects. Metarules 5 and 6 address this issue (Table 3). Patterns and related metarules are briefly described in section 5.4. Details are presented in the appendix. 5.4. Metarules and re-engineering patterns According to our description of metarules in the introduction to section 5, a metarule places a restriction on the form of association rules to be mined as well as on interestingness measure values. Rule based constraints are restrictions that may be placed, for example, on the relationships between predicates, or on the values that these predicates may assume. As an example, consider a metarule of the form: 2 If we consider global variables, user defined types, and functions as items and restrict the maximum number of predicates on the left and right hand sides of association rules to one, the possible number of metarules is 3 2. Differentiating between calling functions and called functions increases this number to 42. Additionally, we have three interestingness measures, each of which can have three possible values. Thus the number of metarules is 42*33 = 432. This number is even larger if we remove the restriction on the maximum number of predicates.. 11 Global (X,Y) Accessed by (X) M1 A more general form of this metarule is P1(X,Y) P2(X) where P1 and P2 are predicate variables that are instantiated to the attributes Global and Accessed by in M1. Thus in M1, a restriction is placed on the attribute to which a predicate is instantiated, as well as on the relationship between the predicates. Variable X represents functions, and variable Y represents global variables used by the function. When association rules are mined using this template, global variables accessed by various functions in the software system are returned. To find association rules that are meaningful in the re-engineering context, an additional constraint is placed on interestingness measures. Thus the complete metarule (for details, refer to Pattern category 4: Improve performance Pattern 7: Select appropriate storage Intent: Make access to frequently accessed variables efficient classes Metarule 7 Form: Global (X,Y) Accessed by (X) Coverage: High Implication: Global variable on LHS is used by most of the functions in the system. Pattern category 5: Focus on high-level architecture Pattern 8: Generate alternative views Intent: Focus on architecture from multiple perspectives Metarule 8 Type (X,Y) Accessed by (X) Coverage: High (within a sub-system) Implication: User defined type(s) are used by most of the functions within the sub-system. Table 3 and appendix) is represented as: Global (X,Y) Accessed by (X) Coverage: Low M2 When association rules are mined using this template, we are able to identify a re-engineering opportunity i.e. global variables which are used by few functions within the system. In this work, we use the convention as in M2 to describe metarules. Predicate names (attributes) are self-explanatory. Some metarules which are interesting in the re-engineering context, their implications, and related reengineering patterns are shown in Pattern category 4: Improve performance Pattern 7: Select appropriate storage Intent: Make access to frequently accessed variables efficient classes Metarule 7 Form: Global (X,Y) Accessed by (X) Coverage: High Implication: Global variable on LHS is used by most of the functions in the system. Pattern category 5: Focus on high-level architecture Pattern 8: Generate alternative views Intent: Focus on architecture from multiple perspectives Metarule 8 Type (X,Y) Accessed by (X) Coverage: High (within a sub-system) Implication: User defined type(s) are used by most of the functions within the sub-system. Table 3. In some of the metarules identified, one out of the three interestingness measures has been used. This indicates that the value of the other two measures does not influence the result. However, it is possible to identify metarules where a combination of measures gives interesting results, as is evident from pattern 5. We discuss benefits of employing the re-engineering patterns related to metarules, as well as issues and problems in the appendix. There are various popular forms for expressing patterns e.g. the Alexandrian form, the Gang of Four (GOF) form and the Coplien form [53]. To describe our re-engineering patterns, we adapt 12 the pattern form used by Demeyer et. al. [18]. comprehension. Related patterns are grouped into categories to aid Pattern category 1: Enhance modularity and control side effects Pattern 1: Reduce global variable Intent: Remove global variables that are used by very few usage functions. Metarule 1 Form: Global (X,Y) Accessed by (X) Coverage: Low Implication: Only a small proportion of functions in the system access the global variable(s) on the LHS. Pattern 2: Localize variables Intent: Reduce the scope of variables that have larger scope than necessary. Metarule 2 Form: Global(X,Y) Accessed by(X) Confidence: One Implication: The global variables are used by one function only. Pattern 3: Increase locality of Intent: Group related functions together to improve reference understandability and maintainability. Metarule 3-1 Form: Function(X,Y) Called by (X) Confidence: One Implication: The functions are called by one function only. Metarule3 -2 Form: Function(X,Y) Function(X,Y) Confidence: High Support: High Implication: Whenever one function is called, there is a high probability that the other function is also called. Pattern category 2: Avoid duplicated functionality Pattern 4: Identify utilities Intent: Identify utility routines so that they can be shared by functions. Metarule 4 Form: Function(X,Y) Called by(X) Coverage: High Implication: Function(s) on LHS are called by most of the functions in the system. Pattern category 3: Re-modularize for maintainability Pattern 5: Increase data modularity Intent: Place related data items into a structure Metarule 5-1 Form: Global(X,Y) Global(X,Y) Confidence: High Support: High Implication: Whenever one global variable is accessed, there is a high probability that the other global variable is also accessed. Metarule 5-2 Form: Type(X,Y) Type(X,Y) Confidence: High Support: High Implication: Whenever one type variable is accessed, there is a high probability that the other type is also accessed. Pattern 6: Strengthen encapsulation Intent: Identify potential classes. Metarule 6 Form: Accessed by(X) Type(X,Y) Confidence: One Implication: Function(s) access the type(s) on the RHS. Pattern category 4: Improve performance 13 Pattern 7: Select appropriate storage Intent: Make access to frequently accessed variables efficient classes Metarule 7 Form: Global (X,Y) Accessed by (X) Coverage: High Implication: Global variable on LHS is used by most of the functions in the system. Pattern category 5: Focus on high-level architecture Pattern 8: Generate alternative views Intent: Focus on architecture from multiple perspectives Metarule 8 Type (X,Y) Accessed by (X) Coverage: High (within a sub-system) Implication: User defined type(s) are used by most of the functions within the sub-system. Table 3: Re-engineering patterns and metarules 14 6. Experiments and results For conducting metarule-guided association rule mining, source code of five open source software systems written in C was used. A brief description of these systems is provided in Table 4. CVS Aero Description Version control system Version Mosaic Bash Xfig Rigid body Web browser simulator Unix shell Drawing tool 1.8 1.7 2.6 1.14.4 3.2.3 Lines of code (LOC) 30k 31k 37k 38k 75k Functions 810 686 818 892 1661 Global variables 419 535 348 539 1746 User defined types 375 814 112 198 828 Table 4: Test system characteristics CVS, Aero, Mosaic and Bash have been used for architecture recovery experiments in [54] and Xfig has been used for architecture recovery experiments in [36]. Our data mining experiments are helpful in gaining understanding of the test systems by providing a more detailed view of their structure. The source files for the five systems have been parsed using the Rigi tool and relevant ‘facts’ have been stored in an exchange format called the ‘Rigi Standard Format’ (RSF) [55], [56]. The transaction set discussed in section 5.1 was developed from this fact set. In the following sections, we present results of applying association rule mining to our test systems and analyze these results. It is relevant to note that the Xfig system consists of five major subsystems, whose source code files can be identified by their names. We have experimented with the 95 files in these sub-systems, leaving 4 files which have not been considered. We do not expect that the average figures and percentages obtained for Xfig will change substantially on inclusion of these 4 files. Table 5 presents statistics related to the Xfig sub-systems. System Purpose Functions Xfig d_*files Drawing tasks 94 Xfig e_*files Editing tasks 369 Xfig f_*files File related tasks 139 Xfig u_*files Utility files 422 Xfig w_*files Window related tasks 637 Table 5: Statistics of Xfig sub-systems 6.1 Pattern 1: Reduce global variable usage (Metarule 1 form: Global (X,Y) Accessed by (X) Coverage: Low) 15 To evaluate the applicability of Pattern 1, the source code of the five systems was mined using metarule 1. Table 6 summarizes statistics obtained as a result of mining association rules using this metarule. CVS Aero Mosaic Bash Xfig Number of association rules 1469 satisfying the metarule 2628 858 1714 8280 Number of functions accessing global variables 391 (48%) (%age of total functions) 422 (62%) 370 (45%) 545 (61%) 1329 (80%) Median number of functions accessing a global 2 variable 3 2 2 3 Average number of functions accessing a global 3.99 variable 5.29 2.93 4.12 5.97 Standard deviation 10.31 4.32 5.70 15.73 Most frequently accessed Noexec global variable aktuelleWelt current_win rl_point XtStrings Coverage of the most frequently accessed global 0.068 (55) variable (# of functions) 0.152 (104) 0.057 (47) 0.07 (62) 0.139 (231) 5.96 Table 6: Global variables usage summary The metarule associated with Pattern 1 can be used to derive useful information from the system when coverage is low. It can be seen from Table 6 that coverage values for the accessed global variables in all five systems remain below 0.2. Even in Mosaic, which has the least coverage value, the most frequently accessed global variable is accessed by 47 functions. The number of functions is higher in the other systems. If a global variable is accessed by 47 functions, there is reason to define it as global. It is obvious that a coverage value of 0.2 cannot be considered low in the above systems. This result illustrates why we do not recommend setting absolute thresholds. It also illustrates the difficulty in setting a threshold which can be used for all systems. Instead of trying to set an absolute threshold, we depict our results using a graph which can be inspected visually to identify unusual values that represent restructuring opportunities. To enable the graph to be interpreted interactively, median, quartiles, and upper and lower control limits (defined as mean 2*standard deviation) may be depicted on the graph. As an example, the global variable usage graph for Bash, as mined by metarule 1 is depicted in Figure 2. 16 Bash - Global variable usage Number of functions accessing a global variable 70 60 50 40 30 20 10 0 Global variables Figure 2: Global variable access (Xfig) It is evident from Figure 2 that a substantial proportion of global variables are being accessed by very few functions in Bash. The graphs of the other four systems show a very similar global usage trend (Figure 3 – Figure 6). Number of functions accessing a global variable CVS - Global variable usage 60 50 40 30 20 10 0 Global variables Figure 3: Global variable access (CVS) Number of functions accessing a global variable Aero - Global variable usage 120 100 80 60 40 20 0 Global variables Figure 4: Global variable access (Aero) 17 Number of functions accessing a global variable Mosaic - Global variable usage 50 45 40 35 30 25 20 15 10 5 0 Global variables Figure 5: Global variable access (Mosaic) Number of functions accessing a global variable Xfig - Global variable usage 250 200 150 100 50 0 Global variables Figure 6: Global variable access (Xfig) It is interesting to note that the median number of functions accessing a global variable is almost the same (2 and 3) for all systems. The average number of functions accessing a global variable is also similar. Table 7 depicts the breakdown for global variable access in all five test systems, and Figure 3 shows the breakdown specifically for Bash. CVS Aero Mosaic Percentage of global variables accessed by more than 5 13% 13% 9% functions Percentage of global variables 87% 87% 91% accessed by 5 or less functions Table 7: Global variable access breakdown in test systems Bash Xfig 14% 18% 86% 82% 18 Bash - Global variable usage 14% 86% Globals accessed by more than 5 functions Globals accessed by 5 or less functions Figure 7: Global variable access breakdown Metarule-guided rule mining has thus successfully identified a re-engineering opportunity. To improve modularity and hence software quality, it will be useful to examine global variables being accessed by very few functions in order to ascertain whether there is a need to define these variables globally. As an example, consider the variable noninc_history_pos used in just two functions noninc_search and noninc_dosearch in Bash. The only valid reason for defining this variable as global is that its value is to be retained throughout the duration of the program. Even in such a case, the variable should be defined as static to restrict its scope. An examination of the code shows that the variable has indeed been defined as static. On the other hand, the rl_kill_ring_length variable, also used by just two functions rl_yank_pop and rl_kill_text within readline.c is not a static variable. If the variable value is not required to persist throughout the lifetime of the program, it should be passed as a parameter within the relevant functions instead of being defined globally. Even if a global variable is required, its scope should be restricted according to usage to reduce coupling. 6.2 Pattern 2: Localize variables (Metarule 2 form: Global(X,Y) Accessed by(X) Confidence: One) To evaluate the applicability of pattern 2, the source code of the five systems was mined using metarule 2. Table 8 summarizes statistics obtained as a result of mining association rules using this metarule. Number of association rules satisfying the metarule Percentage of global variables accessed by one function CVS Aero Mosaic Bash Xfig 77 72 104 115 310 18% 13% 30% 21% 18% Table 8: Global variables accessed by one function It is interesting to note that in all systems, quite a large percentage of global variables is being used by only one function. The fact that these variables have been defined as global seems to indicate a poor design or a design that has deteriorated over time. However, the definition of these variables as global is justified if they 19 represent settings that are required to persist throughout a program’s lifetime. For example, a detailed look at Bash reveals that global variables being accessed by single functions do indeed represent system settings. The msg_buf variable which represents a buffer for messages is used by one function only (rl_message). If this buffer is accessed more than once during the execution of the program, there is reason for it to be global. The fact that this variable has been defined as static shows an effort by developers to reduce variable scope. However, opportunities for reduction in variable scope within the system may be identified. For example, the globally defined variable special_vars is also used by only one function but has not been defined as static. 6.3 Pattern 3: Increase locality of reference (Metarule 3-1 form: Function(X,Y) Called by (X) Confidence: One Metarule 3-2 form: Function(X,Y) Function (X,Y) Confidence: High Support: High) To evaluate the applicability of pattern 3, the source code of the five systems was mined using metarule 3-1 and metarule 3-2. Table 9 summarizes statistics obtained as a result of mining association rules using metarule 3-1. CVS Aero Mosaic Bash Xfig Number of association rules satisfying the metarule 236 250 321 298 368 Percentage of functions called by one function 29% 36% 39% 33% 22% Percentage of functions residing in same file as calling function 54% (128) 65% (163) 56% (181) 74% (220) 73% (270) Percentage of functions residing in a different file 46% (108) 35% (87) 44% (140) 26% (78) 27% (98) Table 9: Functions called by one function It can be seen from Table 9 that most of the functions called by a single function reside in the same file as the calling function in all systems. Bash and Xfig have higher percentages of functions residing in same files. Functions residing in a different file from the calling function need to be examined in more detail. A valid reason for placing a called function in a different file from the calling function is that the called function may be a ‘utility’ function which has been placed together with other utility functions in a separate file. For example, in Bash the function all_visible_variables present in the file variables.c is called by the function variable_completion_function in bashline.c. variables.c contains functions related to shell variables. Functions within bashline.c are related to reading lines of input. In this case, there is valid reason for placing all_visible_variables in the file variables.c, along with other functions related to shell variables. If this is not the case i.e. the called function appears to be logically related to the calling function, it should be placed in the same file as the calling function to increase efficiency and ease maintenance by increasing locality of reference. Table 10 summarizes statistics about functions that are called together with high confidence. In this case, we use a threshold of 0.7 for confidence. It should be noted that setting absolute thresholds for coverage and support may not be meaningful (see section 5.2), since they depend on system size. However, confidence denotes the association of two items with each other, and is not affected by size of the system. Thus it is meaningful to set a threshold e.g. 0.7, which denotes that if one function is called, there is a 70% probability 20 that the other function is also called. A problem that may arise in the application of this pattern is that a function may be associated with more than one function. It is thus difficult to decide which functions to place in the same file. In such a case, the support of the association rule may be helpful which denotes the number of functions which call the two functions together. If two functions are called together by many functions, it is more useful to place them in close proximity, rather than functions that are called together by few functions. Number of association rules satisfying the metarule (confidence> 0.7) Number of association rules satisfying the metarule (confidence=1) Highest support for high confidence Highest support for confidence=1 Functions in same file for high confidence Functions in different files CVS 4353 Aero 2598 Mosaic 3821 Bash 3644 Xfig 3584 3974 2523 3762 3564 3424 138 21 22 31 59 28 23% 21 30% 14 25% 10 43% 20 48% 77% 70% 75% 57% 52% Table 10: Functions called together with high confidence It can be seen from Table 10 that most of the functions called together are placed in different files. Thus there is scope of application of this pattern, so as to improve understandability. 6.4 Pattern 4: Identify utilities (Metarule 4 form: Function(X,Y) Called by(X) Coverage: High) To evaluate the applicability of pattern 4, the source code of the five systems was mined using metarule 4. Table 11 summarizes statistics obtained as a result of mining association rules using this metarule. CVS Aero Mosaic Bash Xfig Number of association rules satisfying the metarule 3899 1708 1472 2075 4798 Number of functions making function calls (%age of total functions) 631 (78%) 469 (68%) 518 (63%) 694 (78%) 1302 (78%) Median number of calls made 2 1 1 2 2 Average number of calls made 6.73 3.80 2.57 3.18 4.28 Standard deviation 17.64 7.79 4.77 8.46 8.45 Number of functions to which 5 or more calls are made (%age of total functions) 148 (18%) 75 (11%) 55 (7%) 98 (11%) 239 (14%) Number of functions to which 10 or more calls are made (%age of total functions) 84 (10%) 39 (6%) 22 (3%) 29 (3%) 101 (6%) error fprintf mo_fetch_w xmalloc put_msg Most frequently called 21 function indow_by_i d Coverage of the most frequently called function (# of functions) 0.259 (210) 0.138 (95) 0.072 (59) 0.219 (195) 0.08 (133) Table 11: Function call summary The metarule associated with Pattern 4 can be used to derive useful information from the system when coverage is high. It can be seen from Table 11 that coverage values for the called functions in the five systems remain below 0.3. In CVS, which has the highest coverage, the most frequently called function is called by 210 functions. If a function is called by 210 functions, there is reason to assume that the function is some kind of a utility function. Similar to pattern 1, setting a threshold is difficult because with a high absolute threshold of say 0.7, none of the functions would be considered as frequently accessed. Instead of trying to set an absolute threshold, we depict the results using a graph which can be inspected interactively to identify frequently called functions. To make the interpretation of the graph easier, median, quartiles, and upper and lower control limits may be depicted on the graph. As an example, the function call graph for Bash is depicted in Figure 8. Bash - Function calls made Number of functions calling a function 200 175 150 125 100 75 50 25 0 Functions Figure 8: Function calls (Xfig) It is easy to identify functions called by a large number of functions from Figure 8. The graphs of the other four systems show a very similar function call trend (Figure 9 – Figure 12). 22 Number of functions calling a function CVS - Function calls made 250 200 150 100 50 0 Functions Figure 9: Function calls (CVS) Aero - Function calls made Number of functions calling a function 100 80 60 40 20 0 Functions Figure 10: Function calls (Aero) Mosaic - Function calls made Number of functions calling a function 70 60 50 40 30 20 10 0 Functions Figure 11: Function calls (Mosaic) 23 Number of functions calling a function Xfig - Function calls made 140 120 100 80 60 40 20 0 Functions Figure 12: Function calls (Xfig) It is interesting to note that the median number of functions calling a function is almost the same (1 and 2) for all systems. The average number of functions calling a function is also similar. An examination of the Bash code shows that 75% of the functions to which 20 or more calls are made, reside in the files general.c, error.c or variable.c. This indicates that utility functions are placed in separate files in Bash, making it easier to reference such functions. To identify the utility functions within various sub-systems of Xfig, associations between the functions with the software, and functions within each sub-system were noted. Error! Reference source not found. Error! Reference source not found. show the calls made within the various sub-systems. The graphs are similar, depicting that function calls in each sub-system follow a similar trend. Xfig - Function calls made in the d_*files subsystem Number of functions calling a function 40 35 30 25 20 15 10 5 0 Functions Figure 13: Function calls (d_*files sub-system) 24 Xfig - Function calls made in the e_*files subsystem Number of functions calling a function 50 40 30 20 10 0 Functions Figure 14: Function calls (e_*files sub-system) Number of functions calling a function Xfig - Function calls m ade in the f_*files subsystem 40 35 30 25 20 15 10 5 0 Functions Figure 15: Function calls (f_*files sub-system) Xfig - Function calls made in u_*files subsystem Number of calls made to a function 35 30 25 20 15 10 5 0 Functions Figure 16: Function calls (u_*files sub-system) 25 Xfig - Function calls made in the w_*files subsystem Number of functions calling a function 80 70 60 50 40 30 20 10 0 Functions Figure 17: Function calls (w_*files sub-system) Table 12 shows some statistics of the function calls made. It can be observed that the average number of calls made in all the sub-systems is very similar. Sub-system Average number Standard of calls made deviation Highest number of calls made (function) Coverage d_*files 3.27 5.00 37 (0.394) e_*files 3.44 5.19 draw_mousefun_canvas set_cursor f_*files 2.03 2.98 u_*files 3.09 3.70 w_*files 3.02 5.13 file_msg set_action_object put_msg 46 (0.125) 38 (0.273) 33 (0.078) 69 (0.108) Table 12: Function calls made in various sub-systems Table 13 shows the function calls that are made to various functions by functions within a sub-system. It appears that the u_*files sub-system contains utility functions, because maximum number of functions are called from this sub-system. This was our initial assumption , which has been confirmed through the results obtained. The d_*files sub-system functions make more calls to functions within the u_*files sub-system as compared to any other sub-system. However, the rest of the functions make more calls to functions within their own sub-system, with the 2nd highest number of calls to the u_*files sub-system. Function calls are made to functions in all other sub-systems, but the d_*files sub-system appears to contain the least number of utilities. These figures show that the Xfig system is quite well structured, but its structure can be improved by further localizing the calls to own sub-system or the u_*files sub-system. Table 14 shows call results for functions to which 5 or more calls are made. Results are similar to those in Table 13, with the maximum number of functions being called from the u_*files sub-system and no calls to the d_*files sub-system functions except from functions within d_*files. 26 Calls made to functions in e_*files Calls made to functions in f_*files Calls made to functions in u_*files Calls made to functions in w_*files Calls made to functions in misc. files 1 1 53 13 6 6 224 16 177 51 11 1 2 113 48 28 6 2 10 9 302 43 10 2 7 20 39 301 16 60 244 159 619 Total Table 13: Number of function calls made to functions in each sub-system 436 49 Sub-system d_*files e_*files f_*files u_*files w_*files Sub-system Calls made to functions in d_*files 49 Calls made Calls made to functions to functions in e_*files in d_*files 1 Calls made to functions in f_*files Calls made to functions in u_*files Calls made to functions in w_*files Calls made to functions in misc. files 9 6 2 55 12 7 2 1 43 11 6 3 33 6 13 12 110 Total 1 Table 14: Five or more function calls made to functions in each sub-system 64 22 d_*files e_*files 12 1 7 f_*files u_*files w_*files 1 4 6.5 Pattern 5: Increase data modularity (Metarule 5-1 form: Global(X,Y) Global(X,Y) Confidence: High Support: High, Metarule 5-2 form: Type(X,Y) Type(X,Y) Confidence: High Support: High) To evaluate the applicability of pattern 5, the source code of the five systems was mined using metarule 5-1 and metarule 5-2. In this case, we use a threshold of 0.7 for confidence. It should be noted that setting absolute thresholds for coverage and support may not be meaningful (see section 5.2), since they depend on system size. However, confidence denotes the association of two items with each other, and is not affected by size of the system. Thus it is meaningful to set a threshold e.g. 0.7, which denotes that if a global variable (user defined type) is accessed, there is a 70% probability that the other global variable (user defined type) is also accessed. However, the confidence measure alone cannot be used to arrive at a meaningful result. A global variable may be associated with more than one global variable. Similarly a user defined type may be associated with more than one user defined type. It is thus difficult to decide which global variables or user defined types to place in one structure. In such a case, the support of the association rule may be helpful. Support denotes the number of functions which access the two global variables or user defined types together. If two global variables or user defined types are accessed together by many functions, it is more useful to place them in single structure, rather than global variables or user defined types that are accessed together by few functions. 27 CVS Aero Mosaic Bash Xfig 1923 5199 2602 1635 14983 0.022 (18) 0.099 (68) 0.007(6) 0.047 (42) 0.127 (211) Pending_erro r, pending_error _ text KoerperKoor dAnAus, KoordAnAus size_of_cache d_cd_array, Number of association rules satisfying the metarule Highest support for global variables occurring together with high confidence (# of functions) Global variables occurring together with highest support rl_end, rl_point _ArgCount,_ ArgCount cached_cd_ar ray Table 15: Global variables with high association Table 15 summarizes statistics about global variables with high association. The names of the global variables in all five systems reveal that they are related to one another. Through the use of association rule mining, such variables may be identified and placed in a structure after inspection. Similar to Table 15, Table 16 summarizes statistics about user defined types with high association3. Number of association rules satisfying the metarule Highest support for user defined types occurring together with high confidence (# of functions) User defined types occurring together with highest support CVS Aero Mosaic Bash Xfig 41 54 10 14 422 0.010 (8) 0.017 (12) 0.010 (8) 0.003 (3) 0.128 (213) DIR, dirent TKollision, TReal mo_hotlist, mo_hot_ite m KEYMAP _ENTRY_ ARRAY, keymap Arg, WidgetList Table 16: User defined types with high association 6.6 Pattern 6: Strengthen encapsulation (Metarule 6 form: Accessed by(X) Type(X,Y) Confidence: One) To evaluate the applicability of pattern 6, the source code of the systems was mined using metarule 6. Table 17 summarizes statistics obtained as a result of mining association rules using this metarule. 3 For CVS, Aero, Mosaic and Bash, accesses to user defined types are reported for local variables within functions. 28 CVS Aero Mosaic Bash Xfig 489 476 294 319 3937 299 (37%) 279 (41%) 248 (30%) 249 (28%) 1364 (82%) Number of types accessed 80 63 47 51 149 Median number of functions accessing a type 2.5 3 3 3 5 Average number of functions accessing a type 6.11 7.56 6.26 6.25 26.42 Standard deviation 10.74 14.56 12.56 9.36 53.92 Number of association rules satisfying the metarule Number of functions accessing user defined types (%age of total functions) Table 17: Types accessed by functions in the test systems Identification of potential objects is facilitated by mining association rules of the described form. As an example, consider some of the user defined types identified within Bash. The user_info type is accessed by 10 functions which reside in 7 different files. This type can be grouped together with the functions that access it to form a class, which represents user information and member functions to access this information. Some of the types are accessed by functions within 1 file, making the task of conversion to an object-oriented design easier. For example, the JOB type is accessed by 22 functions (start_job, delete_job, find_job etc.), 21 of which reside in the file job.c. In a similar manner, types and accessing functions may also be identified easily within other systems. 6.7 Pattern 7: Select appropriate storage classes (Metarule 7 form: Global (X,Y) Accessed by (X) Coverage: High) Table 18 shows global variables accessed by the largest number of functions in Xfig and Bash. Only three global variables are shown. CVS Aero Mosaic Number of functions 55 Global variable Coverage noexec 0.068 45 server_active 0.056 36 __ctype_b 0.044 104 aktuelleWelt 0.152 72 __iob 0.105 68 KoordAnAus 0.099 47 current_win 0.057 43 srcTrace 0.053 29 23 cciTrace 0.028 62 rl_point 0.07 46 rl_end 0.052 39 __ctype_b 0.044 231 XtStrings 0.139 211 _ArgCount 0.127 Bash Xfig _ArgList 211 0.127 Table 18: Global variables accessed by maximum number of functions If the functions are accessing the global variables are called frequently, the global variables are candidate register variables. 6.8 Pattern 8: Generate alternative views (Metarule8 form: Typel (X,Y) Accessed by (X) Coverage: High (within a sub-system)) Table 19 summarizes the statistics for types accessed by functions in various sub-systems of Xfig. Figure 18- Figure 22 show the types accessed within the sub-systems. Sub-system Average number Standard of accesses to deviation types d_*files 6.28 e_*files Cursor 5.92 17.34 Highest number accesses (Type) 25.89 F_compound FILE f_*files 6.49 8.21 u_*files 15.10 25.33 F_compound WidgetList Coverage 22 (0.234) 99 (0.268) 37 (0.266) 127 (0.301) 257 (0.403) Xfig - Types accessed within d_*files subsystem 25 20 15 10 F_arc F_pic object.h.unnamed.149( FILE size_t XtAppContext F_sfactor F_line F_ellipse F_spline F_text XCharStruct F_pos F_point Window Cursor 0 F_compound 5 XFontStruct Number of functions accessing a type w_*files 16.74 35.40 Table 19: Accesses to user defined types in various Xfig sub-systems of Types Figure 18: Types accessed by a subsystem (d_*files) 30 Number of functions accessing a type 40 35 30 25 20 15 10 5 0 FILE F_compound F_pos F_pic F_line appresStruct size_t COLR F_spline F_arc F_ellipse F_text XColor Cmap(struct) F_arrow fig_settings F_point Display paper_def(struct) __uint8_t stat(struct) _recent_files jpeg_error_mgr(struc pcxheadr Cursor error_ptr f_readgif.c.unnamed. F_sfactor j_common_ptr object.h.unnamed.14 Window WidgetList Arg Colormap f_read.c.unnamed.19 hdr(struct) f_readgif.c.unnamed. f_readeps.c.unname object.h.unnamed.19 jpeg_memory_mgr(st j_decompress_ptr JSAMPIMAGE object.h.unnamed.14 f_neuclrtab.c.unnam __int32_t f_readpcx.c.unname pcxhed(struct) Number of functions accessing a type F_compound F_line WidgetList F_point F_spline Cursor F_ellipse F_arc Arg F_pos F_text e_edit.c.unnamed.199(struct) WidgetClass F_arrow F_sfactor F_pic icon_struct Pixmap Display XtVarArgsList Window choice_info Pixel size_t Colormap object.h.unnamed.149(struct) XtLanguageProc Position FILE Dimension XFontStruct XtActionsRec GC XtAppContext Atom appresStruct sfactor_def(struct) XtOrderProc XtCallbackRec XtCallbackProc stat(struct) XCharStruct XKeyReleasedEvent Time paper_def(struct) pid_t time_t object.h.unnamed.195(struct) e_edit.c.unnamed.200(struct) counts(struct) patrn_strct XtWidgetGeometry XColor 120 100 80 60 40 20 0 140 120 100 80 60 40 20 0 F_compound F_line F_point F_spline F_arc Window F_pos F_ellipse F_text appresStruct Display GC F_sfactor F_arrow F_linkinfo counts(struct) zXPoint FILE F_pic XFontStruct paper_def(struct) object.h.unnamed.1 Cursor _fstruct(struct) XColor object.h.unnamed.1 Pixel size_t XEvent XClientMessageEv WidgetList Atom /usr/X11R6/include/ Colormap angle_table(struct) Cmap(struct) xfont(struct) _xfstruct(struct) XGCValues u_draw.c.unnamed. u_draw.c.unnamed. u_draw.c.unnamed. funcs(struct) __uint8_t Visual PIXRECT _fpnt(struct) XPoint Region _arrow_shape(stru XErrorEvent Number of functions accessing a type Xfig - Types accessed within e_*files subsystem Types Figure 19: Types accessed by a subsystem (e_*files) Xfig -- Types accessed within f_*files subsystem Types Figure 20: Types accessed by a subsystem (f_*files) Xfig - Types accessed with u_*files subsystem Types Figure 21: Types accessed by a subsystem (u_*files) 31 Number of functions accessing a type 300 250 200 150 100 50 0 WidgetList Arg Display appresStruct ind_sw_info Pixmap Window icon_struct GC XColor WidgetClass XtAppContext XtLanguageProc F_compound XFontStruct Colormap Pixel Cursor XtVarArgsList XtActionsRec FILE size_t XtIntervalId Dimension Position Atom mode_sw_info XCharStruct main_menu_info XButtonReleasedEv XEvent XGCValues choice_info style_template(struc F_text XPoint LIBRARY_REC(stru Screen RGB F_pos FigListWidget ListPart HSV va_list XtCallbackRec fig_colors spin_struct RotatedTextItem XawListReturnStruct XtOrderProc PIXRECT paper_def(struct) funcs(struct) XtWidgetGeometry _XPrivDisplay F_line stat(struct) counts(struct) XExposeEvent XRectangle F_arc F_ellipse F_spline XWindowAttributes Drawable XawTextBlock fig_settings _fstruct(struct) globalStruct patrn_strct XKeyReleasedEvent XPointerMovedEvent KeySym zXPoint DIR Visual dirent(struct) CorePart XKeyPressedEvent xfont(struct) CompKey _recent_files XrmValuePtr MenuItemRec _xfstruct(struct) FigSmeBSBObject FigSmeBSBPart XKeyboardControl w_drawprim.c.unna __uint8_t passwd(struct) pid_t XtCallbackProc XtTranslations w_drawprim.c.unna RectObjClassPart F_pic GrabInfo menu_def RectObjPart Region XFontSetExtents SmeBSBClassRec XConfigureEvent XButtonPressedEve SmeBSBPart SmePart Time SmeThreeDPart Font Xfig - Types accessed by w_*files subsystem Types Figure 22: Types accessed by a subsystem (w_*files) It can be seen that some of the types are associated with a certain sub-system only, whereas some of them are accessed by more than one sub-system e.g. the types F_Line, F_Spline, F_arc etc. are accessed by all sub-systems. The reason for this is that these shapes are drawn in the d_*files sub-system, edited in the e_*files sub-system, etc. Thus for an object-oriented view, these types will be associated with functions across sub-systems. On the other hand, types accessed within a certain sub-system may be associated with functions within the sub-system. 6.9 Pattern 8: Generate alternative views Statistics obtained as a result of carrying out association rule mining on the test legacy systems are summarized in Figure 23 - Figure 30. Statistics presented in Figure 23 and Figure 24 are related to the number of functions, global variables and user defined types in the test systems. Figure 23: Function, global variable and user defined type statistics 32 Figure 24: Statistics for function calls, global variable accesses and user defined types accesses It can be seen that the number of functions and global variables is similar across all systems except Xfig. The number of functions making function calls, accessing global variables and user defined types are also comparable across these systems. The number of functions accessing user defined types is larger for xfig as compared to other systems, and also larger than the number of functions accessing global variables and making function calls. Figure 25: Global variable access statistics Figure 26: A comparison of accesses to global variables 33 Figure 25 and Figure 26 present statistics related to the access of global variables by functions within test systems, which is an indicator of the global coupling present in these systems. It is interesting to note that although the percentage of functions accessing global variables varies from 45% to 80%, the percentage of global variables accessed by 5 or less functions is very similar across all systems. Mosaic has the smallest percentage of functions accessing global variables, and also the highest percentage of global variables accessed by 5 or less functions. These results show that all five systems present opportunities for applying pattern 1 and pattern 2 to reduce coupling by reducing the number of global variables. In cases where a global variable is required, its scope may be reduced to make the program easier to understand and manage. Figure 27: Function call statistics Figure 28: A comparison of calls to functions Figure 27 and Figure 28 present statistics related to functions called by a single function within the systems. The percentage of functions making function calls is similar across all systems, varying from 63% to 78%. The percentage of functions called by one function only is also similar across all systems, varying from 22% to almost 40% in Mosaic. Thus the opportunity to apply pattern 3 to increase locality of reference is present in all systems. Functions that are called by a single function and yet reside in a different file need to be examined. CVS and Mosaic have the highest percentages of such functions, which should be examined to determine the feasibility of placing them in the same file as the calling function. 34 Figure 29: Frequently accessed functions Figure 30: A comparison of frequently accessed functions Figure 29 and Figure 30 present statistics of functions within the systems to which 5 or more calls are made. This percentage varies from 7% to 18%, in general remaining low. CVS has the highest percentage of functions to which 5 or more calls are made. Thus in all systems, there exist functions that are accessed by many functions. It is useful to examine these functions and identify them as utilities, also placing them in appropriate files. Table 15, Table 16 and Table 17 summarize statistics related to system re-modularization. The applicability of pattern 5 is evident for all the systems, as global variables and user defined types that occur together have been identified. It is suggested that pattern 5 be applied before pattern 6, so that related data items are placed together. Once data has been modularized, functions accessing these types may be mined using pattern 6, thus taking a step towards indicating potential classes in an object-oriented design. 35 7. Conclusions In this report we present an association rule mining approach for the problem of understanding a software system given only the source code. To make the association rule mining process more effective, we employ metarule-guided mining. Metarules specify constraints on the mining process. We use two types of constraints: rule constraints and interestingness constraints. Rule based constraints are used to restrict the form of the association rule mined whereas interestingness constraints are used to restrict the possible values of interestingness measures. Furthermore, we relate each metarule to a re-engineering pattern that presents solutions to the problems identified by a metarule. To illustrate the feasibility and effectiveness of employing metarule-guided mining for program understanding, we restrict ourselves to a small number of metarules and re-engineering patterns that address some typical problems within legacy systems developed using the structured approach. Metarule-guided mining was used to analyze the structure of five legacy systems and extract meaningful association rules which provided useful insight about the software’s overall structure and suggested strategies for its improvement. The extracted association rules capture associations between functions, global variables and user defined types within the system. They highlight potential problems within the code and identify areas which require a more detailed study. As illustrated, these association rules applied along with re-engineering patterns can be used to restructure the code for maintainability, and if required, to re-modularize the code e.g. by converting a structured design to an object-oriented design. A manual inspection to carry out the same tasks would have taken a much longer time. Our experiments with five open source systems reveal similar results in terms of the average and percentage values in the patterns discussed. For example, the average and median number of functions accessing a global variable is very similar in all five systems, as is also the average and median number of function calls made. This result is interesting, especially since the test systems have different application domains and the number of functions, global variables and user defined types within these systems is also different. Our experimental results show how rules extracted by the mining process can be used to identify potential problems in, and suggest improvements to, the legacy systems under study. Thus metarule-guided association rule mining is effective in revealing interesting characteristics, trends and nature of open source legacy systems, and can be used as a basis for restructuring or re-modularizing them for greater maintainability. 36 References [1] I. Sommerville, Software Engineering, Fifth Edition, Addison Wesley, 2000. [2] R.S. Arnold, Software Reengineering, IEEE Computer Society Press, 1993. [3] R.S. Pressman, Software Engineering A Practitioner’s Approach, Fifth Edition, Mc Graw Hill, 2001. [4] S.L. Pfleeger, Software Engineering Theory and Practice, Prentice Hall, 1998. [5] R.L. Glass, Frequently Forgotten Fundamental Facts about Software Engineering, IEEE Software, May/June 2001. [6] T.J.Biggerstaff, “Design Recovery for Maintenance and Reuse”. IEEE Computer, 22(7), pages 36-49, July 1989. [7] H.A. Muller, M. Story, J.H. Jahnke, D.B. Smith, A.R. Tilley, K. Wong, “Reverse Engineering: A Roadmap”, The 22nd International Conference on Software Engineering (ICSE’00), June 2000 [8] G. Parikh, N. Zvegintzov, Tutorial on Software Maintenance, IEEE Computer Society Press, 1983. [9] R.P. Hall, Seven Ways to Cut Software Maintenance Costs, Datamation, July 1987. [10] M.T.Harandi, J.Q.Ning, “Knowledge-Based Program Analysis”, IEEE Software 7(1), pages 74-81, January 1990 . [11] Rich, Wills “Recognizing a program’s design: A graph parsing approach”, IEEE Software, 7(1), pages 82-89, January 1990. [12] A.Quilici, “A Memory-Based Approach to Recognizing Programming Plans”. Communications of the ACM, 37(5), pages 84-93, May 1994. [13] H. A. Müller, K. Wong, and S. R. Tilley “Understanding software systems using reverse engineering technology.” The 62nd Congress of L'Association Canadienne Francaise pour l'Avancement des Sciences Proceedings (ACFAS) 1994. [14] H.M.Fahmy, R.C.Holt, J.R.Cordy , “Wins and Losses of Algebraic Transformations of Software Architectures”. Automated Software Engineering ASE 2001, San Diego, California, November 26-29, 2001. [15] R.Kazman, S.J.Carrière, "View Extraction and View Fusion in Architectural Understanding". The 5th International Conference on Software Reuse, Victoria, BC, Canada, June 1998. [16] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Supporting software maintenance by mining software update records, International Conference on Software Maintenance, (ICSM) 2001. [17] Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, August 2000. [18] Demeyer, S., Ducasse, S., Nierstrasz, O., Object-Oriented Reengineering Patterns, Morgan Kaufmann, 2003. [19] Website http://www.iam.unibe.ch/~scg/Archive/famoos/ [20] M.M. Lehman “Program, Life Cycles and the Laws of Software Evolution”, Proceedings of the IEEE, September 1980. [21] M.M. Lehman, J.F. Ramil, P.D. Wernik, D.E. Perry, W.M. Turski, “Metrics and Laws of Software Evolution – The NinetiesView”, International Symposium on Software Metrics, 1997. 37 [22] M.W. Godfrey, Q. Tu, “Evolution in Open Source Software: A Case Study”, International Conference on Software Maintenance (ICSM) 2000. [23] M.W. Godfrey, Q. Tu, “Growth, Evolution and Structural Change in Open Source Software”, IWPSE 2001. [24] Q. Tu, M.W. Godfrey, “An Integrated Approach for Studying Architectural Evolution”, International Workshop on Program Comprehension (IWPC), 2002. [25] M. Lanza, S. Ducasse, “Polymetric Views – A Lightweight Visual Approach to Reverse Engineering”, IEEE transactions on Software Engineering, September 2003. [26] M. Lanza, S. Ducasse, “Understanding Software Evolution Using a Combination of Software Visualization and Software Metrics”, Proceedings LMO’02, 2002. [27] C. Montes de Oca, D. L .Carver, Identification of Data Cohesive Subsystems using Data Mining Techniques, International Conference on Software Maintenance, (ICSM) November 1998. [28] C. Montes de Oca, D. L .Carver, A Visual Representation Model for Software Subsystem Decomposition, Working Conference on Reverse Engineering (WCRE'98), October, 1998. [29] K. Sartipi, K. Kontogiannis, F. Mavaddat, Architectural Design Recovery Using Data Mining Techniques, Conference on Software Maintenance and Reengineering (CSMR’00), February, 2000. [30] Tjortjis, C., Sinos, L., Layzell, P., Facilitating Program Comprehension by Mining Association Rules from Source Code, 11th IEEE International Workshop on Program Comprehension (IWPC'03), May, 2003. [31] V. Tzerpos, R.C. Holt, “Software Botryology: Automatic Clustering of Software Systems”, Ninth International Workshop on Database and Expert Systems Applications (DEXA’98), August 1998. [32] T.A. Wiggerts, “Using clustering algorithms in legacy systems remodularization,” Fourth Working Conference on Reverse Engineering (WCRE’97), October 1997. [33] N.Anquetil and T.C.Lethbridge, “Experiments with clustering as a software remodularization method,” The Sixth Working Conference on Reverse Engineering (WCRE’99), 1999. [34] J.Davey and E.Burd, “Evaluating the Suitability of Data Clustering for Software Remodularization”, The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000. [35] M.Saeed, O.Maqbool, H.A.Babri, S.M. Sarwar, S.Z. Hassan “Software Clustering Techniques and the Use of the Combined Algorithm”, Conference on Software Maintenance and Re-engineering (CSMR’03), March 2003. [36] O.Maqbool, H.A.Babri, “The Weighted Combined Algorithm: A Linkage Algorithm for Software Clustering”, Conference on Software Maintenance and Re-engineering (CSMR’04), March 2004. [37] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Mining the maintenance history of a legacy software system, International Conference on Software Maintenance, (ICSM) , 2003. [38] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Mining the software change repository of a legacy telephony system, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [39] T. Zimmermann, P. Weibgerber, S. Diehl, A. Zeller, Mining version histories to guide software changes, Proceedings of the 26th International Conference on Software Engineering (ICSE) 2004. [40] A. Michail, Data mining library reuse patterns using generalized association rules, Proceedings of the 22nd international conference on Software engineering, June 2000. 38 [41] A. Michail, Data Mining Library Reuse Patterns in User-Selected Applications, 14th IEEE International Conference on Automated Software Engineering , October, 1999. [42] R. Amin, M. Cinneide, T. Veale, LASER: A lexixal approach to analogy in software reuse, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [43] F. McCarey, M. Cinneide, N. Kushmerick, A case study on recommending reusable software components using collaborative filtering, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [44] Y. Yusof, O. F. Rana, Template mining in source-code digital libraries, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [45] P. K. Garg, T. Gschwind, K. Inoue, Multi-project software engineering: An example, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [46] M. El-Ramly, E. Stroulia, Mining software usage data, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [47] Z. Balanyi, R. Ferenc, Mining design patterns from C++ source code, International Conference on Software Maintenance (ICSM) 2003. [48] Y. Kanellopoulos, C. Tjortjis, Data mining source code to facilitate program comprehension: Experiments on clustering data retrieved from C++ programs, International Workshop on Program Comprehension (IWPC) 2004, [49] Website MSR 2004 http://msr.uwaterloo.ca [50] T. Mens, T. Tourwe, “A Survey of Software Refactoring”, IEEE transactions on Software Engineering, February 2004. [51] S. Demeyer, S. Ducasse, “Metrics, do they really help?”, Proceedings LMO’99, Paris 1999. [52] N. Fenton, S.L. Pfleeger, Software Metrics: A Rigorous and Practical Approach, Second Edition, PWS Publishing Company, 1997. [53] L. Rising, The Patterns Handbook Techniques, Strategies and Applications, Cambridge University Press, 1998. [54] R. Koschke, “Atomic Architectural Component Recovery for Program Understanding and Evolution”, PhD Thesis, University of Stuttgart, 2000. [55] Website http://www.bauhaus-stuttgart.de/bauhaus [56] J.Martin, K.Wong, B. Winter and H.A.Müller, “Analyzing xfig using the Rigi Tool Suite”, The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000. [57] R.B. Murray, C++ Strategies and Tactics, Addison Wesley, 1993. [58] S. McConnell, Code Complete A Practical Handbook of Software Construction, Microsoft Press, 1993. [59] M.A. Weiss, Efficient C Programming A Practical Approach, Prentice-Hall, 1995. [60] C. Riva, “Reverse Architecting: an Industrial Experience Report”, Seventh Working Conference on Reverse Engineering (WCRE’00), 2000. [61] K. Wong, S. Tilley, H. Muller, M.A. Storey, Structural Redocumentation: A Case Study, IEEE Software, January, 1995. 39 Appendix Pattern category 1: Enhance modularity and control side effects Pattern 1: Reduce global variable usage Intent Remove global variables that are used by very few functions. Metarule 1 Coverage Global (X,Y) Accessed by (X) Low Implication Only a small proportion of functions in the system access the global variable(s) on the LHS. Problem As software evolves, its structure becomes complex and its internal quality degrades. Due to high coupling between system components in such a deteriorated structure, a single change often results in a number of side effects. Functions which access the same global variables are highly coupled (common coupling). How can such coupling among functions be reduced? Forces This problem is difficult because 1. To improve system structure, the structure must first be understood. In the absence of documentation, gaining a structural view and identifying dependencies may be difficult and time consuming. 2. Some form of coupling may be necessary within the system. Highly coupled functions whose coupling levels may be reduced need to be identified. 3. A change e.g. changing a shared global variable to a parameter that is passed between the related functions may require changes at a number of places within the code. Solving this problem is feasible because 1. With the proper tools, some forms of coupling between functions can be identified. Solution Identify global variables which are used by few functions within the system. Reduce common coupling by reducing the use of such global variables. Rather than defining such variables as global, pass them as parameters within the relevant functions. If passing a global variable as a parameter is not convenient, its scope may be restricted to a single file by defining it as static. Tradeoffs Pros When a large number of variables are to be shared amongst functions, global variables are convenient. Moreover, global variables have a longer lifetime than automatic variables, making it simpler to share information between functions that do not call each other. However, unless the global data is read only, the use of global variables results in undesirable coupling between functions, leading to difficulties in program understanding and maintenance. By removing unnecessary global variables and restricting their usage, coupling among components is reduced, making it easier to trace faults and avoid unintentional changes to data. Cons & Difficulties Careful evaluation of each global variable is required to decide whether it should be passed as a parameter, declared to be static or left as it is. Even if very few functions in the system access the global variable, the designers of the system may have valid reasons for defining it as global. Pattern 2: Localize variables 40 Intent Reduce the scope of variables that have larger scope than necessary. Metarule 2 Confidence Global(X,Y) Accessed by(X) One Implication Each global variable on LHS is used by one function only. Problem Legacy systems are likely to contain a large number of global variables which reduces the manageability of the systems. How can a function accessing global variables be made easier to read and manage? Forces This problem is difficult because 1. A function may access a large number of global variables. 2. Global variables are not the only factor that make a function more difficult to read and understand. Solving this problem is feasible because 1. You have an idea of the global variable usage within the system (Pattern 1) Solution Identify global variables that are accessed by one function only. Make the variable a local variable for that function. Tradeoffs Pros If a variable is used by one function, there may be no reason for defining it as global. Localizing a global variable reduces chances of error and makes functions easier to read and maintain. Cons & Difficulties Careful evaluation of each identified global variable is required before a decision to reduce its scope is taken. The rationale for a decision by a developer to define a variable as global is rarely documented. Pattern 3: Increase locality of reference Intent Group related functions together to improve understandability and maintainability. Metarule 3-1 Confidence Function(X,Y) Called by (X) One Implication Each function on LHS is called by one function only. Metarule 3-2 Confidence Support Function(X,Y) Function(X,Y) High High Implication Whenever one function is called, there is a high probability that the other function is also called. Problem A legacy system is often large and may consist of a large number of functions. Changes made within a software system may involve changes to various related functions. How can changes be localized? Forces This problem is difficult because 1. It is necessary to identify functions that may be affected by a change. 2. A function may be related to a number of other functions, so it is difficult to decide the set of functions with which its relation is strongest. Solving this problem is feasible because 41 1. Localized changes ease the maintenance task. Solution Identify functions between which there appears to be a strong relation. Place such functions in close proximity, perhaps in the same file. It may be possible to inline functions being called by only one function. Tradeoffs Pros Grouping together related functions promotes modular continuity [3] because changes in requirements are localized instead of resulting in system wise changes. Localized changes ease the maintenance task. Also, in case a function is inlined, making a function inline results in performance improvement and is clearly beneficial if the function expansion is shorter than the code for the calling sequence. Cons & Difficulties Placing related functions in the same file may be useful for legacy applications that have been developed using the structured approach. It may not be a feasible option in some cases e.g. when a system has a layered architecture and function calls are made across different layers, or in object-oriented architectures where the functional separation is related to data. In case the inlining option is chosen, it is important to remember that large inline functions will save a small percentage of run time but will have a higher space penalty. Also functions with loops should almost never be inlined [57] because the run time of a loop is likely to swamp the function call overhead. Large inlined functions may also make it difficult to understand the functionality of the calling function. Pattern category 2: Avoid duplicated functionality Pattern 4: Identify utilities Intent Identify utility routines so that they can be shared by functions. Metarule 4 Coverage Function(X,Y) Called by(X) High Implication Function(s) on LHS are called by most of the functions in the system. Problem In legacy systems, functionality is often duplicated due to different teams re-implementing similar functionality. To avoid this problem, functions should share code by making use of utility routines. How can the use of utility routines be facilitated? Forces This problem is difficult because 1. To facilitate the use of utilities, the utilities must be identified. 2. Many such utility routines may be present, all of which need to be catalogued for reference. Solving this problem is feasible because 1. Utility routines may have certain characteristics which facilitate identification. Solution Identify functions that are called by many functions within the system. Treat these functions as utility functions. It may be useful to place groups of related utility functions in separate files. Tradeoffs Pros Utility functions represent re-usable components of a structured system. Re-usable components can lead to measurable benefits in terms of reduction in development cycle time and project cost, and increase in productivity. 42 Cons & Difficulties In order for functions to be re-used effectively, they must be properly catalogued for easy reference, standardized for easy application and validated for easy integration [3]. In case the number of such functions is large, related functions need to be identified and grouped together, otherwise searching for the appropriate function may be time consuming. Pattern category 3: Re-modularize for maintainability Pattern 5: Increase data modularity Intent Place related data items into a structure Metarule5-1 Confidence Support Global(X,Y) Global(X,Y) High High Implication Whenever one global variable is accessed, there is a high probability that the other global variable is also accessed. Metarule5-2 Confidence Support Type(X,Y) Type(X,Y) High High Implication Whenever one type is accessed, there is a high probability that the other type is also accessed. Problem Legacy systems are likely to contain a number of global variables and user defined types. How can relations between global variables (user defined types) be clarified? How can the global variables (user defined types) be more easily managed? Forces This problem is difficult because 1. A large number of global variables and user defined types may exist in a system, and examining them may be time consuming. 2. A global variable (user defined type) may be related to many other global variables (user defined types). Examining such complex relationships and arriving at meaningful conclusions may be very difficult Solving this problem is feasible because 1. The purpose of a function is much easier to understand if relationships between data are clear and data is well managed. Solution Identify related global variables (user defined types). Examine them to see which of them form coherent entities. If they are logically related, combine them into a structure. Tradeoffs Pros Combining variables/types into structures leads to code that is easier to understand and change. If at some stage, a shift is to be made to an object-oriented design paradigm, the structures become data attributes of potential classes. Cons & Difficulties It may be the case that one global variable/type is associated with a high degree of confidence with a number of other global variables/types i.e. when the global variable/type is accessed, a number of other 43 global variables/types are accessed. In this case, to avoid a large structure that hinders rather than promotes understandability, the software engineer needs to study the code and analyze which global variables/types are to be combined into a structure. Pattern 6: Strengthen encapsulation Intent Identify potential classes. Metarule6 Confidence Accessed by(X) Type(X,Y) One Implication The function(s) access the type(s) on the RHS. Problem Many legacy systems were developed using the structured approach. To facilitate re-use and ease maintenance, how can these systems may be modularized as object-oriented systems? Forces This problem is difficult because 1. Identifying potential classes accurately in such large systems is not easy. Solving this problem is feasible because 1. Classes contain related data attributes for which operations are defined. Related data attributes can be identified using pattern 5 (Increase data modularity). Solution Identify collections of functions and the common data set they access. These represent potential classes and should be packaged together to strengthen encapsulation and provide information hiding. Tradeoffs Pros Large programs that use information hiding have been found easier to modify by a factor of 4 than programs that don’t [58]. Information hiding forms a foundation for both structured and object-oriented design. Cons & Difficulties Functions may access more than one type, in which case a careful study of the code is required to decide the type with which the function should be associated. It may be the case that the types accessed by a function form a coherent entity which can be transformed into a structure (See pattern 5). Pattern category 4: Improve performance Pattern 7: Select appropriate storage classes Intent Make access to frequently accessed variables efficient Metarule7 Coverage Global (X,Y) Accessed by (X) High Implication Global variable on LHS is used by most of the functions in the system. Problem Legacy systems are likely to contain a number of global variables. How can these variables be accessed more efficiently? Forces 44 This problem is difficult because 1. A large number of global variables may exist in a system, and examining all of them may be time consuming. 2. It may not be feasible to increase access efficiency for variables that are not frequently accessed. Solving this problem is feasible because 1. Accesses to memory locations (variables) are much slower as compared to CPU access and often become a performance bottleneck. Solution Identify global variables that are frequently used. Declare them to be register variables. Tradeoffs Pros Register variables are a suggestion to the compiler that they be placed in a register instead of memory. This provides a certain degree of control over program efficiency, and may result in faster access and speed improvement [59]. Cons & Difficulties The storage of program variables in registers may interfere with conventional usage of a register by the compiler, thus slowing down execution of a program. Thus knowledge of the architecture and compiler is required before such declarations can be of use. It is to be kept in mind that the declaration is a suggestion to the compiler only, which the compiler may choose to ignore [59]. Also, not all application languages support the declaration of register variables. Pattern category 5: Focus on high-level architecture Pattern 8: Generate alternative views Intent Focus on architecture from multiple perspectives Metarule8 Coverage Typel (X,Y) Accessed by (X) High (within a sub-system) Implication User defined type(s) are used by most of the functions within the sub-system. Problem How can software architecture (i.e. various sub-systems) be viewed from multiple perspectives so that large scale structural changes are easier to make? Forces This problem is difficult because 1. Software architectures are mostly implicit. Solving this problem is feasible because 1. A clear understanding of the architecture is required to adapt a system to changing requirements in rapidly evolving domains, to reduce cost by sharing components across several projects and for defining product family architectures [60]. Solution Generate a data view of a sub-system by highlighting association of types with a certain sub-system. Tradeoffs Pros Sub-systems and their inter-relationships represent the architecture of a software system. It is important to focus on the architecture from multiple perspectives so that large scale structural changes are easier to 45 make [61]. A study of the types associated with a sub-system provides an alternative ‘data’ view which is particularly important if changes are to be made to types or functions within the sub-system. Cons & Difficulties Different sub-systems of the software system may access the same types, making it difficult to associate a type with one sub-system only. In such a case, it may be feasible to use Pattern 6 on the entire software system in order to group together functions with related types and hence arrive at an alternative modularization based on data abstraction. 46