5. Metarule-guided association rule mining for

advertisement
Metarule-Guided Association Rule Mining for Program Understanding
Technical Report CS-014(1)-2004
Onaiza Maqbool, Haroon Babri
Computer Science Department
Lahore University of Management
Sciences LUMS
Acknowledgment
We would like to thank the Lahore University of Management Sciences for providing funding for this
research. Thanks to Rainer Koschke of the Bauhaus group at the University of Stuttgart for providing the RSF
for CVS, Aero, Mosaic and Bash, and to Johannes Martin of the Rigi group, University of Victoria, for
providing the RSF for Xfig.
ii
List of Figures
Figure 1: Our rule mining approach ..................................................................................................................... 9
Figure 2: Global variable access (Xfig) .............................................................................................................. 16
Figure 3: Global variable access (CVS) ............................................................................................................. 16
Figure 4: Global variable access (Aero) ............................................................................................................. 16
Figure 5: Global variable access (Mosaic) ......................................................................................................... 17
Figure 6: Global variable access (Xfig) .............................................................................................................. 17
Figure 7: Global variable access breakdown ...................................................................................................... 18
Figure 8: Function calls (Xfig) ........................................................................................................................... 21
Figure 9: Function calls (CVS) ........................................................................................................................... 22
Figure 10: Function calls (Aero)......................................................................................................................... 22
Figure 11: Function calls (Mosaic) ..................................................................................................................... 22
Figure 12: Function calls (Xfig) ......................................................................................................................... 23
Figure 13: Function calls (d_*files sub-system) ................................................................................................. 23
Figure 14: Function calls (e_*files sub-system) ................................................................................................. 24
Figure 15: Function calls (f_*files sub-system) ................................................................................................. 24
Figure 16: Function calls (u_*files sub-system) ................................................................................................. 24
Figure 17: Function calls (w_*files sub-system) ................................................................................................ 25
Figure 18: Types accessed by a subsystem (d_*files) ........................................................................................ 29
Figure 19: Types accessed by a subsystem (e_*files) ........................................................................................ 30
Figure 20: Types accessed by a subsystem (f_*files) ......................................................................................... 30
Figure 21: Types accessed by a subsystem (u_*files) ........................................................................................ 30
Figure 22: Types accessed by a subsystem (w_*files) ....................................................................................... 31
Figure 23: Function, global variable and user defined type statistics ................................................................ 31
Figure 24: Statistics for function calls, global variable accesses and user defined types accesses .................... 32
Figure 25: Global variable access statistics ........................................................................................................ 32
Figure 26: A comparison of accesses to global variables ................................................................................... 32
Figure 27: Function call statistics ....................................................................................................................... 33
Figure 28: A comparison of calls to functions.................................................................................................... 33
Figure 29: Frequently accessed functions........................................................................................................... 34
Figure 30: A comparison of frequently accessed functions................................................................................ 34
iii
List of Tables
Table 1: A set of transactions representing sales of items.................................................................................... 6
Table 2: A sample set of transactions illustrating items used in our experiments ............................................. 10
Table 3: Re-engineering patterns and metarules ................................................................................................ 13
Table 4: Test system characteristics ................................................................................................................... 14
Table 5: Statistics of Xfig sub-systems .............................................................................................................. 14
Table 6: Global variables usage summary .......................................................................................................... 15
Table 7: Global variable access breakdown in test systems ............................................................................... 17
Table 8: Global variables accessed by one function........................................................................................... 18
Table 9: Functions called by one function.......................................................................................................... 19
Table 10: Functions called together with high confidence ................................................................................. 20
Table 11: Function call summary ....................................................................................................................... 21
Table 12: Function calls made in various sub-systems....................................................................................... 25
Table 13: Number of function calls made to functions in each sub-system ....................................................... 26
Table 14: Five or more function calls made to functions in each sub-system.................................................... 26
Table 15: Global variables with high association ............................................................................................... 27
Table 16: User defined types with high association ........................................................................................... 27
Table 17: Types accessed by functions in the test systems ................................................................................ 28
Table 18: Global variables accessed by maximum number of functions ........................................................... 29
Table 19: Accesses to user defined types in various Xfig sub-systems ............................................................. 29
iv
Table of Contents
1. Introduction ................................................................................................................................................ 1
2. Related work ............................................................................................................................................... 3
3. An overview of association rule mining ................................................................................................... 6
4. Re-engineering patterns for structured legacy systems ......................................................................... 8
5. Metarule-guided association rule mining for program understanding ................................................ 9
5.1.
Selecting items and transactions ...................................................................................................... 9
5.2.
Setting appropriate thresholds for association rules ...................................................................... 10
5.3.
Selecting interesting association rules in the re-engineering context ............................................ 11
5.4.
Metarules and re-engineering patterns ........................................................................................... 11
6. Experiments and results .......................................................................................................................... 14
6.1
Pattern 1: Reduce global variable usage ........................................................................................ 14
6.2
Pattern 2: Localize variables .......................................................................................................... 18
6.3
Pattern 3: Increase locality of reference ......................................................................................... 19
6.4
Pattern 4: Identify utilities .............................................................................................................. 20
6.5
Pattern 5: Increase data modularity ................................................................................................ 26
6.6
Pattern 6: Strengthen encapsulation ............................................................................................... 27
6.7
Pattern 7: Select appropriate storage classes ................................................................................. 28
6.8
Pattern 8: Generate alternative views ............................................................................................. 29
6.9
Discussion of results ...................................................................................................................... 31
7. Conclusions ............................................................................................................................................... 35
References .......................................................................................................................................................... 36
Appendix ............................................................................................................................................................ 39
v
1. Introduction
Legacy systems are old software systems that are crucial to the operation of a business. These systems are
expected to have undergone changes in their lifetime due to changes in requirements, business conditions and
technology. It is quite likely that such changes were made without proper regard to software engineering
principles. The result is often a deteriorated structure, which is unstable but cannot be discarded because it is
costly to do so. Moreover, another reason for retaining these legacy systems is that they embed business
knowledge which is not documented elsewhere.
Since it is often not feasible to discard a system and develop a new one, techniques must be employed to
improve the structure of the existing system. An effective strategy for change must be devised. Strategies for
change mentioned in [1] include maintenance, architectural transformation and re-engineering. Re-engineering
is a process that re-implements legacy systems to make them more maintainable [1]. According to [2], reengineering is any activity that improves one’s understanding of software or prepares/improves the software
itself, usually for increased maintainability, reusability or evolvability.
Given the fact that software maintenance usually accounts for over 50% of project effort [1],[3],[4], making
it the single most expensive software engineering activity [1], and perhaps the most important life cycle phase
[5], the need for re-engineering to ease the maintenance effort is justified. The re-engineering option should be
chosen when system quality has been degraded by regular change, but change is still required, i.e. the system
under consideration has low quality but a high business value, and the re-engineering effort is less risky and
less costly than system replacement.
The re-engineering effort starts with gaining an understanding of the software system, a process known as
reverse engineering. Understanding is critical to many activities including maintenance, enhancement, reuse,
design of a similar system and training [6]. Reverse engineering has been heralded as one of the most
promising technologies to combat the legacy systems problem [7]. However, gaining system understanding is
difficult because documentation for the system is often not available and source code files are the only means
of information regarding the system. According to [8], system understanding takes up 47% of the software
maintenance effort. Hall [9] places the system understanding effort at 47%-62%. Tools and techniques are
thus required to make the program understanding task easier. Tools provide automated support for system
understanding at the procedural level by extracting the procedural design or at a higher level by extracting the
architectural design. Tools for procedural design extraction are based on techniques including knowledge
based systems with inference rule engines [10], graph parsing [11] and program plan recognition [12].
Techniques utilized by architectural design recovery tools include the composition of sub-system hierarchies
using (k,2)-partite graphs [13], graph transformations using relational algebra [14] and view extraction and
fusion using SQL [15].
In the past, the application of deductive techniques to different aspects of software engineering, including
program understanding, has been more frequent than the use of inductive techniques [16]. Deductive
techniques usually employ some knowledge base as their underlying technology which is used to deduce
relations within the software system. Great effort is required to build the knowledge base and continuously
maintain it. Moreover, algorithms used for deductions may be computationally demanding [16]. Researchers
have thus started exploring the use of inductive or data mining techniques in software engineering. Data
mining is considered one of the most promising interdisciplinary developments in the information industry
[17].
In this report, our focus is on the use of data mining, or more specifically, on association rule mining for
discovering interesting relationships within legacy system components which lead to program understanding.
These relationships can be used to identify recurring problems within legacy systems, and are thus related to
re-engineering patterns. Patterns were first adopted by the software community as a way of documenting
recurring solutions to design problems [18]. Since then, they have been used as an effective means to
1
communicate best practice in various aspects of software development including re-engineering. Reengineering patterns for object-oriented legacy systems were identified based on experiences during the
FAMOOS project [19]. Our focus in this work is on traditional legacy systems developed using the structured
approach. Our report presents an approach to gain insight about the structure of these legacy systems by
applying metarule-guided association rule mining. Metarule-guided association rule mining is carried out as a
two step process. In the first step, we formulate metarules based on knowledge of typical problems in legacy
systems. These metarules place constraints on the form of association rules to be mined as well as on values of
interestingness measures. In the second step, metarule-guided mining is applied to the legacy system source
code to mine association rules. By relating the metarules to re-engineering patterns, system understanding is
gained, which allows suggestions for making subsequent changes and optimizations to the source code for
better maintainability. Thus we make two major contributions in this work. First, we show how metaruleguided association rule mining can be used to mine legacy system source code to gain insight into its structure
and detect re-engineering opportunities. Second, we formulate re-engineering patterns for structured legacy
systems and relate each metarule to a re-engineering pattern, thus presenting solutions to problems identified
by a metarule.
The organization of this report is as follows. Section 2 gives an account of related work. In section 3 we
present an overview of association rule mining. Section 4 introduces re-engineering patterns. Section 5 details
our approach and presents metarules and related patterns. Section 6 gives the results of applying association
rule mining to test systems. Finally, we present the conclusions in section 7.
2
2. Related work
Based on studies of the IBM programming process and the OS/360 operating system, Lehman proposed
laws that guide the evolution of E-type software systems [20]. A study of the evolution of Logica plc Fastwire
(FW) financial transaction system, reported in 1997, supports most of the laws formulated earlier [21]. Both
these studies focus on system growth in terms of the number of modules within the system in order to
determine trends in software evolution. More recently, Godfrey and Tu carried out studies to understand the
evolution of open source software [22], [23]. An interesting observation is that the growth rate of most open
source software systems is super-linear, violating Lehman’s law of software evolution which suggests that the
growth of systems slows down as their size increases. Godfrey and Tu also present an approach for studying
architectural evolution of software systems [24], which integrates visualization and software metrics.
Evolution status of various entities is modeled by determining which entities are new, have been deleted, or
remain unchanged. Unlike these studies by Lehman and Godfrey, which model structural changes at the entity
level, our selection of items in the form of functions, global variables and user defined types will enable the
study of characteristics of legacy system evolution at a more detailed level.
Visualization can also be used to gain an understanding of a legacy system’s overall structure and its
evolution. Lanza et. al. [25] propose a lightweight approach to understanding object-oriented legacy systems
using a combination of software metrics and visualization. They visualize various aspects of the system e.g. a
system hotspot view helps in identifying very large and small classes, a system complexity view is based on
inheritance hierarchies. These views are helpful in indicating problematic areas which may require a deeper
study. The evolution of legacy object-oriented systems can also be studied using the same lightweight
approach [26].
The use of data mining for software understanding is gaining popularity as evidenced by recent work in this
area. One possible categorization of the work done would be according to the mining technique employed.
Popular techniques include association rule mining, concept learning and cluster analysis. Another useful
categorization is according to the application or purpose of work. Keeping in view that some researchers
employ more than one technique to arrive at results, in this report we chose to categorize research according to
its purpose e.g. architecture recovery, re-modularization, maintenance support, facilitating re-use, interaction
pattern mining and program comprehension.
The architecture recovery of software systems using data mining techniques is discussed in [27] - [30]. The
data mining technique employed in these papers is primarily association rule mining. The Identification of
sub-systems based on associations (ISA) was proposed by Oca and Carver [27]. They used association rule
mining for extraction of data cohesive sub-systems by grouping together programs that use the same data files.
Experiments were performed on a 25 KLOC Cobol system with 28 programs and 36 data files and subsystems were successfully identified. However, the results obtained were not compared with any existing
documentation or decomposition. A very similar approach is that of [28], where the use of a representation
model (RM) is discussed to represent the sub-systems identified using the ISA methodology in [27]. Sartipi et
al. [29] discuss a technique for recovering the structural view of a legacy system’s architecture based on
association rule mining, clustering and matches with architectural plans. Rather than using programs as subsystem entities as in [27], functions are used. Moreover, associations between functions are determined based
on the variables and types they access, and the function calls they make. After closely associated function
groups have been identified, a branch and bound algorithm is used for matching with architectural plans. In
experiments performed with the CLIPS system (40 KLOC, 734 functions, 59 aggregate data types and 163
global variables), a precision level of 90% and a recall level of 60% is maintained between the conceptual and
concrete architecture recovered through the defined process. Tjortjis et al. [30] also employ association rule
mining to arrive at decompositions of a system at the function level. Their rule mining approach is similar to
the one discussed in [29], where attributes are variables, data types and calls to functions. Groups of functions
3
i.e. sub-systems are created finding common attributes participating in the same association rules.
Experiments performed on a Cobol system, with programs of an average size of 1000 lines of code show that
the results compared with a mental model constructed by a developer of the system are satisfactory.
Software re-modularization using clustering techniques is discussed in [31] - [36]. Tzerpos and Holt [31]
present the case for using clustering techniques to re-modularize software, after the techniques have been
adapted to fit the peculiarities of the software domain. In [32], Wiggerts provides a framework to apply cluster
analysis for re-modularization. The clustering process is described in some detail, along with commonly used
similarity measures and clustering algorithms. Experiments with clustering as a re-modularization method are
described in [33] and [34]. Both papers conclude by recommending similarity measures and clustering
algorithms which yield good experimental results for software artifacts. A theoretical explanation to some
previous experimental results obtained by researchers in the area is provided in [35]. The paper also describes
a new clustering algorithm, which gives better results as compared to the currently employed algorithms for
clustering software. In [36], the weighted combined algorithm for software clustering is presented. This
algorithm shows improvement in clustering results as compared to previously employed clustering algorithms.
Data mining techniques have also been employed to support the maintenance task [16], [37] - [39]. The
use of inductive techniques to extract a maintenance relevance relation (MRR) from the source code,
maintenance logs and historical maintenance update records is discussed in [16], [37], [38]. An MRR simply
indicates that if a software engineer needs to understand file1, he/she probably also needs to understand file2.
The problem has been presented as a concept learning problem, and a decision tree classifier is used for
classifying file pairs as relevant, not relevant and potentially relevant. Relevance indicates that two files were
modified in the same update and potential relevance indicates that both files were looked at in the same
update. Experiments performed on the SX2000 system, a large legacy telephony switching system with 1.9
MLOC and 4700 files show that the 2-class problem, with relevant and non-relevant classes only, yields better
results than the 3-class problem. It is also seen that combining text based features with syntactic features
yields better results than using syntactic features alone. Zimmermann et al. [39] apply association rule mining
to ease maintenance by mining version histories. Association rule mining is used to predict likely changes
after a change has been made, prevent errors due to incomplete changes and detect coupling between items
which is not revealed by program analysis. Mining is carried out through the ROSE tool developed for this
purpose. The tool reads a version archive and groups changes into transactions so that rules describing them
are formed. ROSE was tested on 8 open source projects and was found to be a helpful tool in suggesting
further changes and in warning about missing changes.
Software re-use can be facilitated by data mining techniques [40] - [45]. Software library reuse patterns
have been mined using generalized association rules in [40]. The discovered patterns serve as guides to
identify typical usage of the software library i.e. the combination of classes and member functions that are
typically re-used by applications. The paper extends earlier work by the author [41], in which association rules
rather than generalized association rules were used. Re-use patterns for the KDE core libraries version 1.1.2
were mined by analyzing 76 applications. Reusable components are identified by finding similarities between
components using lexical rather than structural techniques in [42]. Mc Carey et al. [43] utilize collaborative
filtering for recommending reusable components by enabling prediction of the utility of an item based on the
user’s previous history and opinions of other like minded users. The building of a digital library for source
code to facilitate reuse of code segments is discussed in [44]. Garg [45] discusses the sharing of knowledge of
re-usable components across multiple projects.
El-Ramly et al. [46] apply sequential data mining to user activities to discover interaction patterns depicting
how users interact with systems. Interaction pattern mining is applied to legacy and web-based systems. For
legacy systems, these patterns reflect active services. For web-based systems, the patterns can be used for
reengineering the website for easier and faster access.
4
Program comprehension can be aided by source code mining [47], [48]. Balanyi and Ferenc [47]
automatically search design patterns in C++ code. A tool called Columbus is used to build an internal
representation of the source code which is compared with pattern descriptions written in DPML, a language
based on XML. Four publicly available C++ projects were used for experiments. Some problems were faced
because of rule violations in implementation, and when the structures of patterns were similar. However, the
overall results were satisfactory. The comprehension of C++ programs by clustering together similar entities
based on their attributes is proposed in [48]. Four entities are used: classes, member data, member functions
and parameters. Results of applying the process to three open source systems have been presented. Analysis of
the results reveals correlations between classes, thus revealing portions of code that have common
characteristics and are expected to change together.
A workshop for mining software repositories for assisting in program understanding and studying evolution
was held recently [49]. Papers presented in the workshop covered various aspects, including the infrastructure
required for extraction of information, use of mining for program understanding, identification of change
patterns, defect analysis, software re-use and process and community analysis.
The present work employs association rule mining to find interesting associations between global variables,
user defined types and function calls within a system and relates them to re-engineering patterns. Although
association rule mining has been employed by other researchers e.g. Sartipi et. al. [29] and Tjortjis et. al. [30],
their purpose and technique is different from the one proposed in this work. The purpose of association rule
mining in the case of Sartitipi et. al. is architectural design recovery i.e. decomposition of the legacy system
into modules, where a module is a collection of functions, data types and variables. The item set used by them
consists of global variables, data types and function calls within the system. The transaction set consists of
global variables, data types accessed and functions called by a function. The Apriori algorithm is used to
generate frequent item sets, which indicate interesting association between functions. Functions that are
associated are candidates for being placed in the same module. The architectural query language (AQL) is
used to describe a system’s conceptual architecture. Based on this architecture and the frequent item sets,
clustering and a branch and bound algorithm are used to instantiate the concrete architecture. Although the
item and transaction sets employed in [29] are similar to the ones we employ in this work, it is evident that the
problem addressed is very different. Whereas Sartipi et. al. use association rules to indicate association
between functions to discover modules, in this work, our purpose in applying association rule mining is to
gain program understanding and indicate problem areas where improvements may be made. To achieve this
purpose we find associations not only between functions, but between functions and other items as well e.g.
between global variables and user defined types.
Tjortjis et. al. also employ association rule mining for grouping together similar entities within a software
system. The item set used by them consists of variables, data types and calls to blocks of code (modules),
where modules may be functions, procedures or classes. The transaction set thus consists of variables, types
accessed and calls made by modules. In their algorithm, large item sets are first generated by finding item sets
that have a higher support than a user-defined threshold. From this item set, association rules with confidence
greater than a user-defined threshold are generated. Finally groups of modules are created based on the
number of common association rules. Thus in their case, similar to the rule mining approach in [29], rules are
mined to group modules. We, on the other hand, use metarule-guided association rule mining to find
associations between items, which can be used to identify potential problems in a system. By relating our
mined association rules to re-engineering patterns, we find solutions to commonly occurring problems. To the
best of our knowledge, research has not been conducted to relate association rule mining to re-engineering
patterns as is done in this work.
5
3. An overview of association rule mining
Association rule mining is a data mining technique that finds interesting association or correlation
relationships among a large set of data items [17]. Traditionally, association rule mining has been employed as
a useful tool to support business decision making by discovering interesting relationships among business
transaction records.
To illustrate the concept of association rule mining, consider a set of items I = {i1, i2,….,in}. Let D be a set
of transactions, with each transaction T corresponding to a subset of I. An association rule is an implication of
the form A  B where A  I , B  I and A  B   . As an example, consider a set of computer accessories
(CDs, memory sticks, microphones, speakers) that are available at a certain store. These accessories form the
set of items I of interest to us. Every sale made represents a transaction T. Suppose the sales made are
represented in the form of the following set of transactions D in Table 1:
Transaction
ID
T1
T2
T3
T4
T5
Items
CD, memory stick
CD
microphone, speaker
CD, speaker, microphone
memory stick, microphone,
speaker
Table 1: A set of transactions representing sales of items
Association rules in the above case represent items that tend to be sold together e.g. the association rule
microphone  speaker shows that those who buy microphones also tend to buy speakers. This association
rule can also be written as buys(X, “microphone”)  buys(X, “speaker”), where the variable X represents
customers who bought the items. Since this association rule contains a single predicate (buys), the association
rule is referred to as a single-dimensional association rule. When more than one predicate is present, the rule
is referred to as multidimensional.
A large number of such association rules may exist in a given set of transactions and not all of them may be
of interest e.g. the association rule CD  speaker can be inferred from the transaction set in Table 1 but it is
clear that this association is not as strong (or interesting) as the one between microphones and speakers. An
association rule is said to be interesting if it is easily understood, valid, useful, novel or validates a hypothesis
that the user sought to confirm [17]. To find interesting rules, support and confidence are commonly used as
objective measures of interestingness. Support represents the percentage of transactions in D which contain
both A and B. Confidence is the percentage of transactions in D containing A that also contain B. Another
measure of interestingness is coverage. The coverage of an association rule is the proportion of transactions in
D that contain A. In terms of probability:
Support ( A  B) = P( A  B)
Confidence ( A  B) = P( B | A)
Coverage ( A  B) = P( A)
In Table 1, microphone  speaker is an association rule with support 3/5, confidence 1, and coverage 3/5.
This rule is more interesting than CD  speaker which has support 1/5, confidence 1/3, and coverage 3/5.
In order to make the data mining process more effective, a user may place various constraints under which
mining is to be performed. These include interestingness constraints which involve specifying thresholds on
6
measures of interestingness (such as support, confidence and coverage), and rule constraints which place
restrictions on the number of predicates and/or on the relationships that exist among them, based on the
analyst’s experience, expectations or intuition regarding the data [17]. Rule constraints thus specify the form
of the association rule to be mined and are expressed as rule templates or ‘metarules’. As an example, suppose
we have customer related data for the sales in Table 1, and we want to associate customer characteristics with
sales of speakers. A metarule for the information in which we are interested is of the form P(X,Y)  buys(X,
“speaker”) where variable X represents a customer and variable Y takes on values of the attribute assigned to
the predicate variable P. During the mining process, rules which match this metarule are found. One example
of a matching rule is age(X, “young”)  buys(X, “speaker”). Since both the predicate variable P and variable
Y may vary, rules gender(X, “male”)  buys(X, “speaker”) and age(X, “old”)  buys(X, “speaker”) also
satisfy the metarule.
7
4. Re-engineering patterns for structured legacy systems
Patterns have been used in various disciplines to enable collective experiences to be documented and
shared. In the software engineering domain, they have been used effectively to communicate best practices in
various aspects of software development including the development process, design, testing and reengineering. As pointed out in section 1, re-engineering is carried out to improve the structure of legacy
software systems to make them more maintainable. Although there are different reasons for re-engineering
e.g. improving performance, porting the system to a new platform, exploiting new technology, the actual
technical problems within legacy systems are often similar and hence some general techniques or patterns can
be utilized to aid in the re-engineering task [18]. Re-engineering patterns record experiences about
understanding and modifying legacy software systems and thus present solutions to recurring re-engineering
problems. Mens and Tourwe [50] point out that trying to understand what a legacy system does, what parts of
the legacy system to change and the potential impact of these changes is a significant problem. Unlike forward
engineering, which is supported by processes, established processes are not available for re-engineering. Reengineering patterns fill this gap by providing expertise that can be consulted.
The re-engineering cycle typically involves some form of reverse engineering to understand the system and
to identify problems areas, followed by forward engineering to restructure a system. Re-engineering patterns
in [18] focus on the re-engineering cycle of object-oriented legacy systems, although some of the identified
patterns may apply to software systems which are not object-oriented. Examples of typical problems in objectoriented legacy systems include lack of cohesion within classes, misuse of inheritance and violation of
encapsulation [18].
In this work, we present re-engineering patterns for legacy systems developed using the structured
approach. These patterns help in identifying recurring problems within structured code and offer suggestions
for improvement. Similar to other software patterns, they capture structural level problems that may not be
evident from any one part of a legacy system. It is relevant to note that the re-engineering patterns in [18]
capture the entire re-engineering cycle, starting from ‘setting the right direction’ and ‘first contact’ with the
system, to actually restructuring (in the context of object-oriented systems, refactoring) the system to remove
some of the problems which appear in object-oriented code as it evolves. Most of the patterns that deal with
initial phases of re-engineering apply to object-oriented as well as structured systems. However, most of the
patterns dealing with restructuring take into account problems within object-oriented code. In this work, we do
not attempt to re-capture all re-engineering patterns, since as already pointed out, many patterns especially
those involving the initial re-engineering phases are equally valid for structured systems. Our focus is on
identifying problems which arise in structured systems as they evolve. Thus the patterns identified in this
work complement those identified in [18] by addressing some typical problems within legacy structured
systems.
8
5. Metarule-guided association rule mining for program understanding
In this work, we employ association rule mining to aid the design discovery process of structured legacy
systems. To make our mining process more effective, we use constraint based mining. Constraints are placed
on the form of association rules to be mined and on interestingness measure values. These constraints are
specified as metarules. For simplicity, in the following sections a metarule is used to refer to a combination of
association rule form and interestingness measure thresholds rather than a constraint on association rule form
only. This is due to the fact that we employ such a combination to reveal system characteristics. The metarules
we develop are based on knowledge of some typical problems in legacy systems and serve as templates to
mine meaningful association rules. Each metarule is related to a re-engineering pattern. Re-engineering
patterns focus on the process of discovery i.e. they discover an existing design, identify problems and then
present appropriate solutions [18]. A metarule related to a re-engineering pattern aids in this discovery process
by capturing symptoms of typical problems that arise in legacy software systems.
Step 1
Step 2
Body of knowledge
(Legacy systems)
Metarules
…………
…………
…………
…………
…………
….
Used for
Metarule
formulation
Guide
Association rule
mining process
Applied to
Test legacy
systems
Association
rules
…………
…………
…………
…………
…………
….
Are related to
Re-engineering
patterns
………………
………………
………………
……….
Suggest solutions to
problems identified by
Figure 1: Our rule mining approach
Figure 1 illustrates our rule mining approach as a two step process. In the first step, we use knowledge of
generic problems within structured legacy systems to arrive at metarules. In the next step, metarule-guided
mining is carried out on the test legacy systems to find association rules which serve as indicators of problems
that are present in the system. We relate metarules to re-engineering patterns by describing how a reengineering pattern presents a solution to problems within a legacy system, as identified by the mined
association rules.
In this section, we discuss various factors that must be taken into account during formulation of metarules.
Results of carrying out metarule-guided mining on test systems are described in section 5.
5.1. Selecting items and transactions
A metarule can be thought of as a hypothesis regarding relationships that a user is interested in confirming
[17]. As a first step, it is necessary to identify items between which relationships are to be mined. In our case,
the guiding principle is to choose items which facilitate understanding of the source code and identification of
problems, and also allow suggestions for restructuring the code for greater maintainability. Most of the legacy
software systems that exist have been developed using the structured approach, with functions or routines
forming basic components. Moreover, in legacy software, the use of global variables is often widespread
9
leading to difficulty in understanding the code. In view of these facts, we decided to use functions and global
variables as items. Moreover, we also decided to use user defined types. The reason is that user defined types
become potential data objects when code is to be restructured as object-oriented code.
To illustrate the item and transaction sets, let F = {f1, f2,…,fn} be the set of all functions, G = {g1, g2,…,gk}
be the set of all global variables, and U = {u1, u2,…,um} be the set of all user defined types present in a
software system. The item set I is the union of these sets i.e. I = F  G  U . Each transaction T corresponds to a
subset of I and denotes the functions called, global variables and user defined types accessed 1 by a function
within the software system. As an example, consider a software system consisting of two functions (f1 and f2)
only. Suppose f1 accesses global variables g1 and g2, user defined types u1, u3 and u5, and calls the function f2.
Suppose f2 accesses global variables g2 and g3, user defined types u1, u2 and u3, and does not call any function.
The transaction set for this system is presented in Table 2.
Calling
ffunction
1
f2
Global
variables
g
1, g2
g2, g3
User
defined Called
types
functions
u
f2
1, u3, u5
u1, u2, u3
Table 2: A sample set of transactions illustrating items used in our experiments
Thus in the transaction set that we use, each transaction consists of items in the form of global variables
and user defined types accessed by a function, and function calls made by the function. Therefore, there are as
many transactions as the number of functions in the software system. To obtain results that are meaningful in
the re-engineering context, the function which accesses global variables, user defined types and makes
function calls is also treated as an item.
5.2. Setting appropriate thresholds for association rules
Association rules that are mined using rule based constraints which restrict the rule form may not all be
interesting. For finding interesting association rules which indicate problems or re-engineering opportunities,
we require interestingness constraints. Traditionally high thresholds of support, confidence and coverage have
been used for finding interesting association rules [17]. We use both low and high thresholds of the three
measures: coverage, support and confidence. Low thresholds can reveal interesting facts about the software
and provide insight into its structure, as illustrated by patterns that we describe in section 4.4. In general,
assigning fixed values to thresholds such as 0.7 for high and 0.3 for low is not desirable due to the following
reasons [18]. Firstly, threshold values should be selected based on coding standards used by the development
team, and these may not be available. Secondly, tools that employ thresholds often display only abnormal
entities, so the user is not aware of the number of normal entities which may result in a distorted view of the
system. A study on software metrics, undertaken with various threshold values [51] revealed that observations
were independent of threshold value. Due to the reasons cited, it may not be meaningful to fix an absolute
threshold, since system characteristics such as size, number of global variables, user defined types and
functions, can vary widely from system to system. Since the confidence measure does not depend on these
system characteristics and is related to the association between two items only, setting thresholds for the
confidence measure is not an issue as compared to the other measures.
Oftentimes instead of setting absolute thresholds, researchers advocate other techniques for interpreting
results. Demeyer et. al. [18] suggest that data be presented in a table and sorted according to various
measurement values to obtain a meaningful interpretation. To determine a relationship among data points
describing one or two variables, Fenton and Pfleeger [52] use box plots and scatter plots. Box plots utilize
median and quartiles to represent the data and suggest outliers. Whereas a box plot hides most of the expected
1
By an access, we mean a reference to a variable by using or setting its value, or taking its address.
10
behaviour and shows unusual values, a scatter plot graphs all the data points to reveal trends. Histograms are
also used to reveal the distribution of data.
For displaying and interpreting results, we adopt the following approach. Each re-engineering pattern in
section 5.4 has an associated metarule, which specifies the association rule form. For metarules involving low
or high thresholds of support and coverage, we mine the system to find all association rules of the form
specified by the metarule. As suggested in [18], [52], the entire results are displayed using histograms or other
meaningful graphs. High and low thresholds for the support and coverage measures are then determined for
the system under consideration interactively by visually inspecting the results of metarule-guided mining.
Interactive determination of thresholds may be simplified by indicating certain measures e.g. mean, median,
quartiles on the graph.
5.3. Selecting interesting association rules in the re-engineering context
It is relevant to note that if we employ user defined types, functions and global variables as items, and use
coverage, support and confidence interestingness measures with thresholds low, high and one, the number of
possible metarules is quite large2. Useful metarules need to be filtered using subjective interestingness
measures. These measures are based on user beliefs in data and find rules interesting if they are unexpected or
offer information on which the user can act [17].
We reduce the total possible metarules to a small subset by keeping in mind that our purpose is to use these
metarules to identify re-engineering opportunities. Thus we select metarules that highlight some typical
problems in legacy structured systems. Since we use items based on the set of global variables, user defined
types and functions, our metarules are further restricted to problems that are related to this item set.
Amongst the typical problems that occur in legacy code is lack of modularity i.e. strong coupling between
modules, which hampers evolution [18]. Although software is expected to evolve over time, the guideline is
that evolution should not degrade the quality of software. One way to reduce coupling between modules and
thus reduce side effects when changes are made is to reduce the number of global variables or to reduce their
scope. Metarules 1 and 2, which are related to patterns 1 and 2 (Table 3), are used to detect global variables
that can be localized and thus reduce coupling. Another means to ensure that side effects are detected and do
not propagate throughout the system is to encourage functions that are coupled to be identified and placed
together. Metarule 3, related to pattern 3 (Table 3), helps in identifying these coupled functions. Duplicated
functionality is another typical problem in legacy systems, since quite often various teams re-implement
similar functionality [18]. To avoid this problem, functions should share code by making use of utility
functions. Use of utilities is facilitated if these utilities are identified and catalogued for reference. Metarule 4,
related to pattern 4 (Table 3), deals with identification of utilities. Legacy systems may need to be
modularized so that they can be more easily understood and maintained. One modularization option is to
convert a structured system to an object-oriented system, by identifying data items that are potential objects
and related functions which act upon these objects. Metarules 5 and 6 address this issue (Table 3). Patterns
and related metarules are briefly described in section 5.4. Details are presented in the appendix.
5.4.
Metarules and re-engineering patterns
According to our description of metarules in the introduction to section 5, a metarule places a restriction on
the form of association rules to be mined as well as on interestingness measure values. Rule based constraints
are restrictions that may be placed, for example, on the relationships between predicates, or on the values that
these predicates may assume. As an example, consider a metarule of the form:
2 If we consider global variables, user defined types, and functions as items and restrict the maximum number of predicates on the left and right hand
sides of association rules to one, the possible number of metarules is 3 2. Differentiating between calling functions and called functions increases this
number to 42. Additionally, we have three interestingness measures, each of which can have three possible values. Thus the number of metarules is
42*33 = 432. This number is even larger if we remove the restriction on the maximum number of predicates..
11
Global (X,Y)  Accessed by (X)
M1
A more general form of this metarule is P1(X,Y)  P2(X) where P1 and P2 are predicate variables that are
instantiated to the attributes Global and Accessed by in M1. Thus in M1, a restriction is placed on the attribute
to which a predicate is instantiated, as well as on the relationship between the predicates. Variable X
represents functions, and variable Y represents global variables used by the function. When association rules
are mined using this template, global variables accessed by various functions in the software system are
returned. To find association rules that are meaningful in the re-engineering context, an additional constraint
is placed on interestingness measures. Thus the complete metarule (for details, refer to
Pattern category 4: Improve performance
Pattern 7: Select appropriate storage Intent: Make access to frequently accessed variables efficient
classes
Metarule 7
Form: Global (X,Y)  Accessed by (X) Coverage: High
Implication: Global variable on LHS is used by most of the functions in the system.
Pattern category 5: Focus on high-level architecture
Pattern 8: Generate alternative views Intent: Focus on architecture from multiple perspectives
Metarule 8
Type (X,Y)  Accessed by (X) Coverage: High (within a sub-system)
Implication: User defined type(s) are used by most of the functions within the sub-system.
Table 3 and appendix) is represented as:
Global (X,Y)  Accessed by (X)
Coverage: Low
M2
When association rules are mined using this template, we are able to identify a re-engineering opportunity i.e.
global variables which are used by few functions within the system. In this work, we use the convention as in
M2 to describe metarules. Predicate names (attributes) are self-explanatory.
Some metarules which are interesting in the re-engineering context, their implications, and related reengineering patterns are shown in
Pattern category 4: Improve performance
Pattern 7: Select appropriate storage Intent: Make access to frequently accessed variables efficient
classes
Metarule 7
Form: Global (X,Y)  Accessed by (X) Coverage: High
Implication: Global variable on LHS is used by most of the functions in the system.
Pattern category 5: Focus on high-level architecture
Pattern 8: Generate alternative views Intent: Focus on architecture from multiple perspectives
Metarule 8
Type (X,Y)  Accessed by (X) Coverage: High (within a sub-system)
Implication: User defined type(s) are used by most of the functions within the sub-system.
Table 3. In some of the metarules identified, one out of the three interestingness measures has been used.
This indicates that the value of the other two measures does not influence the result. However, it is possible to
identify metarules where a combination of measures gives interesting results, as is evident from pattern 5.
We discuss benefits of employing the re-engineering patterns related to metarules, as well as issues and
problems in the appendix. There are various popular forms for expressing patterns e.g. the Alexandrian form,
the Gang of Four (GOF) form and the Coplien form [53]. To describe our re-engineering patterns, we adapt
12
the pattern form used by Demeyer et. al. [18].
comprehension.
Related patterns are grouped into categories to aid
Pattern category 1: Enhance modularity and control side effects
Pattern 1: Reduce global variable Intent: Remove global variables that are used by very few
usage
functions.
Metarule 1
Form: Global (X,Y)  Accessed by (X) Coverage: Low
Implication: Only a small proportion of functions in the system access the global variable(s) on the LHS.
Pattern 2: Localize variables
Intent: Reduce the scope of variables that have larger scope than
necessary.
Metarule 2
Form: Global(X,Y)  Accessed by(X) Confidence: One
Implication: The global variables are used by one function only.
Pattern 3: Increase locality of Intent: Group related functions together to improve
reference
understandability and maintainability.
Metarule 3-1
Form: Function(X,Y)  Called by (X) Confidence: One
Implication: The functions are called by one function only.
Metarule3 -2
Form: Function(X,Y)  Function(X,Y) Confidence: High Support: High
Implication: Whenever one function is called, there is a high probability that the other function is also
called.
Pattern category 2: Avoid duplicated functionality
Pattern 4: Identify utilities
Intent: Identify utility routines so that they can be shared by
functions.
Metarule 4
Form: Function(X,Y)  Called by(X) Coverage: High
Implication: Function(s) on LHS are called by most of the functions in the system.
Pattern category 3: Re-modularize for maintainability
Pattern 5: Increase data modularity
Intent: Place related data items into a structure
Metarule 5-1
Form: Global(X,Y)  Global(X,Y) Confidence: High Support: High
Implication: Whenever one global variable is accessed, there is a high probability that the other global
variable is also accessed.
Metarule 5-2
Form: Type(X,Y)  Type(X,Y) Confidence: High Support: High
Implication: Whenever one type variable is accessed, there is a high probability that the other type is also
accessed.
Pattern 6: Strengthen encapsulation Intent: Identify potential classes.
Metarule 6
Form: Accessed by(X)  Type(X,Y) Confidence: One
Implication: Function(s) access the type(s) on the RHS.
Pattern category 4: Improve performance
13
Pattern 7: Select appropriate storage Intent: Make access to frequently accessed variables efficient
classes
Metarule 7
Form: Global (X,Y)  Accessed by (X) Coverage: High
Implication: Global variable on LHS is used by most of the functions in the system.
Pattern category 5: Focus on high-level architecture
Pattern 8: Generate alternative views Intent: Focus on architecture from multiple perspectives
Metarule 8
Type (X,Y)  Accessed by (X) Coverage: High (within a sub-system)
Implication: User defined type(s) are used by most of the functions within the sub-system.
Table 3: Re-engineering patterns and metarules
14
6. Experiments and results
For conducting metarule-guided association rule mining, source code of five open source software systems
written in C was used. A brief description of these systems is provided in Table 4.
CVS
Aero
Description
Version
control
system
Version
Mosaic
Bash
Xfig
Rigid body
Web browser
simulator
Unix shell
Drawing tool
1.8
1.7
2.6
1.14.4
3.2.3
Lines of code (LOC)
30k
31k
37k
38k
75k
Functions
810
686
818
892
1661
Global variables
419
535
348
539
1746
User defined types
375
814
112
198
828
Table 4: Test system characteristics
CVS, Aero, Mosaic and Bash have been used for architecture recovery experiments in [54] and Xfig has
been used for architecture recovery experiments in [36]. Our data mining experiments are helpful in gaining
understanding of the test systems by providing a more detailed view of their structure.
The source files for the five systems have been parsed using the Rigi tool and relevant ‘facts’ have been
stored in an exchange format called the ‘Rigi Standard Format’ (RSF) [55], [56]. The transaction set discussed
in section 5.1 was developed from this fact set. In the following sections, we present results of applying
association rule mining to our test systems and analyze these results.
It is relevant to note that the Xfig system consists of five major subsystems, whose source code files can be
identified by their names. We have experimented with the 95 files in these sub-systems, leaving 4 files which
have not been considered. We do not expect that the average figures and percentages obtained for Xfig will
change substantially on inclusion of these 4 files. Table 5 presents statistics related to the Xfig sub-systems.
System
Purpose
Functions
Xfig d_*files
Drawing tasks
94
Xfig e_*files
Editing tasks
369
Xfig f_*files
File related tasks
139
Xfig u_*files
Utility files
422
Xfig w_*files
Window related
tasks
637
Table 5: Statistics of Xfig sub-systems
6.1 Pattern 1: Reduce global variable usage
(Metarule 1 form: Global (X,Y)  Accessed by (X) Coverage: Low)
15
To evaluate the applicability of Pattern 1, the source code of the five systems was mined using metarule 1.
Table 6 summarizes statistics obtained as a result of mining association rules using this metarule.
CVS
Aero
Mosaic
Bash
Xfig
Number of association rules
1469
satisfying the metarule
2628
858
1714
8280
Number
of
functions
accessing global variables 391 (48%)
(%age of total functions)
422 (62%)
370 (45%)
545 (61%)
1329 (80%)
Median
number
of
functions accessing a global 2
variable
3
2
2
3
Average
number
of
functions accessing a global 3.99
variable
5.29
2.93
4.12
5.97
Standard deviation
10.31
4.32
5.70
15.73
Most frequently accessed
Noexec
global variable
aktuelleWelt
current_win
rl_point
XtStrings
Coverage of the most
frequently accessed global 0.068 (55)
variable (# of functions)
0.152 (104)
0.057 (47)
0.07 (62)
0.139 (231)
5.96
Table 6: Global variables usage summary
The metarule associated with Pattern 1 can be used to derive useful information from the system when
coverage is low. It can be seen from Table 6 that coverage values for the accessed global variables in all five
systems remain below 0.2. Even in Mosaic, which has the least coverage value, the most frequently accessed
global variable is accessed by 47 functions. The number of functions is higher in the other systems. If a global
variable is accessed by 47 functions, there is reason to define it as global. It is obvious that a coverage value of
0.2 cannot be considered low in the above systems. This result illustrates why we do not recommend setting
absolute thresholds. It also illustrates the difficulty in setting a threshold which can be used for all systems.
Instead of trying to set an absolute threshold, we depict our results using a graph which can be inspected
visually to identify unusual values that represent restructuring opportunities. To enable the graph to be
interpreted interactively, median, quartiles, and upper and lower control limits (defined as mean  2*standard
deviation) may be depicted on the graph. As an example, the global variable usage graph for Bash, as mined
by metarule 1 is depicted in Figure 2.
16
Bash - Global variable usage
Number of functions
accessing a global
variable
70
60
50
40
30
20
10
0
Global variables
Figure 2: Global variable access (Xfig)
It is evident from Figure 2 that a substantial proportion of global variables are being accessed by very few
functions in Bash. The graphs of the other four systems show a very similar global usage trend (Figure 3 –
Figure 6).
Number of functions
accessing a global variable
CVS - Global variable usage
60
50
40
30
20
10
0
Global variables
Figure 3: Global variable access (CVS)
Number of functions
accessing a global variable
Aero - Global variable usage
120
100
80
60
40
20
0
Global variables
Figure 4: Global variable access (Aero)
17
Number of functions
accessing a global variable
Mosaic - Global variable usage
50
45
40
35
30
25
20
15
10
5
0
Global variables
Figure 5: Global variable access (Mosaic)
Number of functions
accessing a global variable
Xfig - Global variable usage
250
200
150
100
50
0
Global variables
Figure 6: Global variable access (Xfig)
It is interesting to note that the median number of functions accessing a global variable is almost the same
(2 and 3) for all systems. The average number of functions accessing a global variable is also similar. Table 7
depicts the breakdown for global variable access in all five test systems, and Figure 3 shows the breakdown
specifically for Bash.
CVS
Aero
Mosaic
Percentage of global variables
accessed by more than 5 13%
13%
9%
functions
Percentage of global variables
87%
87%
91%
accessed by 5 or less functions
Table 7: Global variable access breakdown in test systems
Bash
Xfig
14%
18%
86%
82%
18
Bash - Global variable usage
14%
86%
Globals accessed by more than 5 functions
Globals accessed by 5 or less functions
Figure 7: Global variable access breakdown
Metarule-guided rule mining has thus successfully identified a re-engineering opportunity. To improve
modularity and hence software quality, it will be useful to examine global variables being accessed by very
few functions in order to ascertain whether there is a need to define these variables globally. As an example,
consider the variable noninc_history_pos used in just two functions noninc_search and
noninc_dosearch in Bash. The only valid reason for defining this variable as global is that its value is to
be retained throughout the duration of the program. Even in such a case, the variable should be defined as
static to restrict its scope. An examination of the code shows that the variable has indeed been defined as
static. On the other hand, the rl_kill_ring_length variable, also used by just two functions
rl_yank_pop and rl_kill_text within readline.c is not a static variable. If the variable value is
not required to persist throughout the lifetime of the program, it should be passed as a parameter within the
relevant functions instead of being defined globally. Even if a global variable is required, its scope should be
restricted according to usage to reduce coupling.
6.2 Pattern 2: Localize variables
(Metarule 2 form: Global(X,Y)  Accessed by(X) Confidence: One)
To evaluate the applicability of pattern 2, the source code of the five systems was mined using metarule 2.
Table 8 summarizes statistics obtained as a result of mining association rules using this metarule.
Number of association rules
satisfying the metarule
Percentage of global
variables accessed by one
function
CVS
Aero
Mosaic
Bash
Xfig
77
72
104
115
310
18%
13%
30%
21%
18%
Table 8: Global variables accessed by one function
It is interesting to note that in all systems, quite a large percentage of global variables is being used by only
one function. The fact that these variables have been defined as global seems to indicate a poor design or a
design that has deteriorated over time. However, the definition of these variables as global is justified if they
19
represent settings that are required to persist throughout a program’s lifetime. For example, a detailed look at
Bash reveals that global variables being accessed by single functions do indeed represent system settings. The
msg_buf variable which represents a buffer for messages is used by one function only (rl_message). If
this buffer is accessed more than once during the execution of the program, there is reason for it to be global.
The fact that this variable has been defined as static shows an effort by developers to reduce variable scope.
However, opportunities for reduction in variable scope within the system may be identified. For example, the
globally defined variable special_vars is also used by only one function but has not been defined as
static.
6.3 Pattern 3: Increase locality of reference
(Metarule 3-1 form: Function(X,Y)  Called by (X) Confidence: One
Metarule 3-2 form: Function(X,Y)  Function (X,Y) Confidence: High Support: High)
To evaluate the applicability of pattern 3, the source code of the five systems was mined using metarule 3-1
and metarule 3-2. Table 9 summarizes statistics obtained as a result of mining association rules using metarule
3-1.
CVS
Aero
Mosaic
Bash
Xfig
Number of association rules
satisfying the metarule
236
250
321
298
368
Percentage of functions
called by one function
29%
36%
39%
33%
22%
Percentage of functions
residing in same file as
calling function
54% (128)
65% (163)
56% (181)
74% (220)
73% (270)
Percentage of functions
residing in a different file
46% (108)
35% (87)
44% (140)
26% (78)
27% (98)
Table 9: Functions called by one function
It can be seen from Table 9 that most of the functions called by a single function reside in the same file as
the calling function in all systems. Bash and Xfig have higher percentages of functions residing in same files.
Functions residing in a different file from the calling function need to be examined in more detail. A valid
reason for placing a called function in a different file from the calling function is that the called function may
be a ‘utility’ function which has been placed together with other utility functions in a separate file. For
example, in Bash the function all_visible_variables present in the file variables.c is called by
the function variable_completion_function in bashline.c. variables.c contains functions
related to shell variables. Functions within bashline.c are related to reading lines of input. In this case,
there is valid reason for placing all_visible_variables in the file variables.c, along with other
functions related to shell variables. If this is not the case i.e. the called function appears to be logically related
to the calling function, it should be placed in the same file as the calling function to increase efficiency and
ease maintenance by increasing locality of reference.
Table 10 summarizes statistics about functions that are called together with high confidence. In this case,
we use a threshold of 0.7 for confidence. It should be noted that setting absolute thresholds for coverage and
support may not be meaningful (see section 5.2), since they depend on system size. However, confidence
denotes the association of two items with each other, and is not affected by size of the system. Thus it is
meaningful to set a threshold e.g. 0.7, which denotes that if one function is called, there is a 70% probability
20
that the other function is also called. A problem that may arise in the application of this pattern is that a
function may be associated with more than one function. It is thus difficult to decide which functions to place
in the same file. In such a case, the support of the association rule may be helpful which denotes the number
of functions which call the two functions together. If two functions are called together by many functions, it is
more useful to place them in close proximity, rather than functions that are called together by few functions.
Number of association rules
satisfying the metarule
(confidence> 0.7)
Number of association rules
satisfying the metarule
(confidence=1)
Highest support for high
confidence
Highest support for confidence=1
Functions in same file for high
confidence
Functions in different files
CVS
4353
Aero
2598
Mosaic
3821
Bash
3644
Xfig
3584
3974
2523
3762
3564
3424
138
21
22
31
59
28
23%
21
30%
14
25%
10
43%
20
48%
77%
70%
75%
57%
52%
Table 10: Functions called together with high confidence
It can be seen from Table 10 that most of the functions called together are placed in different files. Thus there
is scope of application of this pattern, so as to improve understandability.
6.4 Pattern 4: Identify utilities (Metarule 4 form: Function(X,Y)  Called by(X) Coverage: High)
To evaluate the applicability of pattern 4, the source code of the five systems was mined using metarule 4.
Table 11 summarizes statistics obtained as a result of mining association rules using this metarule.
CVS
Aero
Mosaic
Bash
Xfig
Number of association rules
satisfying the metarule
3899
1708
1472
2075
4798
Number of functions making
function calls (%age of total
functions)
631 (78%)
469 (68%)
518 (63%)
694 (78%)
1302 (78%)
Median number of calls made
2
1
1
2
2
Average number of calls made
6.73
3.80
2.57
3.18
4.28
Standard deviation
17.64
7.79
4.77
8.46
8.45
Number of functions to which
5 or more calls are made
(%age of total functions)
148 (18%)
75 (11%)
55 (7%)
98 (11%)
239 (14%)
Number of functions to which
10 or more calls are made
(%age of total functions)
84 (10%)
39 (6%)
22 (3%)
29 (3%)
101 (6%)
error
fprintf
mo_fetch_w
xmalloc
put_msg
Most frequently called
21
function
indow_by_i
d
Coverage of the most
frequently called function (#
of functions)
0.259 (210)
0.138 (95)
0.072 (59)
0.219 (195)
0.08 (133)
Table 11: Function call summary
The metarule associated with Pattern 4 can be used to derive useful information from the system when
coverage is high. It can be seen from Table 11 that coverage values for the called functions in the five systems
remain below 0.3. In CVS, which has the highest coverage, the most frequently called function is called by
210 functions. If a function is called by 210 functions, there is reason to assume that the function is some kind
of a utility function. Similar to pattern 1, setting a threshold is difficult because with a high absolute threshold
of say 0.7, none of the functions would be considered as frequently accessed.
Instead of trying to set an absolute threshold, we depict the results using a graph which can be inspected
interactively to identify frequently called functions. To make the interpretation of the graph easier, median,
quartiles, and upper and lower control limits may be depicted on the graph. As an example, the function call
graph for Bash is depicted in Figure 8.
Bash - Function calls made
Number of functions
calling a function
200
175
150
125
100
75
50
25
0
Functions
Figure 8: Function calls (Xfig)
It is easy to identify functions called by a large number of functions from Figure 8. The graphs of the other
four systems show a very similar function call trend (Figure 9 – Figure 12).
22
Number of functions calling
a function
CVS - Function calls made
250
200
150
100
50
0
Functions
Figure 9: Function calls (CVS)
Aero - Function calls made
Number of functions
calling a function
100
80
60
40
20
0
Functions
Figure 10: Function calls (Aero)
Mosaic - Function calls made
Number of functions
calling a function
70
60
50
40
30
20
10
0
Functions
Figure 11: Function calls (Mosaic)
23
Number of functions calling
a function
Xfig - Function calls made
140
120
100
80
60
40
20
0
Functions
Figure 12: Function calls (Xfig)
It is interesting to note that the median number of functions calling a function is almost the same (1 and 2)
for all systems. The average number of functions calling a function is also similar.
An examination of the Bash code shows that 75% of the functions to which 20 or more calls are made,
reside in the files general.c, error.c or variable.c. This indicates that utility functions are
placed in separate files in Bash, making it easier to reference such functions.
To identify the utility functions within various sub-systems of Xfig, associations between the functions
with the software, and functions within each sub-system were noted. Error! Reference source not found. Error! Reference source not found. show the calls made within the various sub-systems. The graphs are
similar, depicting that function calls in each sub-system follow a similar trend.
Xfig - Function calls made in the d_*files
subsystem
Number of functions
calling a function
40
35
30
25
20
15
10
5
0
Functions
Figure 13: Function calls (d_*files sub-system)
24
Xfig - Function calls made in the e_*files
subsystem
Number of functions
calling a function
50
40
30
20
10
0
Functions
Figure 14: Function calls (e_*files sub-system)
Number of functions calling
a function
Xfig - Function calls m ade in the f_*files
subsystem
40
35
30
25
20
15
10
5
0
Functions
Figure 15: Function calls (f_*files sub-system)
Xfig - Function calls made in u_*files subsystem
Number of calls made to a
function
35
30
25
20
15
10
5
0
Functions
Figure 16: Function calls (u_*files sub-system)
25
Xfig - Function calls made in the w_*files
subsystem
Number of functions
calling a function
80
70
60
50
40
30
20
10
0
Functions
Figure 17: Function calls (w_*files sub-system)
Table 12 shows some statistics of the function calls made. It can be observed that the average number of
calls made in all the sub-systems is very similar.
Sub-system
Average number Standard
of calls made
deviation
Highest number of
calls made (function)
Coverage
d_*files
3.27
5.00
37 (0.394)
e_*files
3.44
5.19
draw_mousefun_canvas
set_cursor
f_*files
2.03
2.98
u_*files
3.09
3.70
w_*files
3.02
5.13
file_msg
set_action_object
put_msg
46 (0.125)
38 (0.273)
33 (0.078)
69 (0.108)
Table 12: Function calls made in various sub-systems
Table 13 shows the function calls that are made to various functions by functions within a sub-system. It
appears that the u_*files sub-system contains utility functions, because maximum number of functions are
called from this sub-system. This was our initial assumption , which has been confirmed through the results
obtained. The d_*files sub-system functions make more calls to functions within the u_*files sub-system as
compared to any other sub-system. However, the rest of the functions make more calls to functions within
their own sub-system, with the 2nd highest number of calls to the u_*files sub-system. Function calls are made
to functions in all other sub-systems, but the d_*files sub-system appears to contain the least number of
utilities. These figures show that the Xfig system is quite well structured, but its structure can be improved by
further localizing the calls to own sub-system or the u_*files sub-system.
Table 14 shows call results for functions to which 5 or more calls are made. Results are similar to those in
Table 13, with the maximum number of functions being called from the u_*files sub-system and no calls to
the d_*files sub-system functions except from functions within d_*files.
26
Calls made
to functions
in e_*files
Calls made
to functions
in f_*files
Calls made
to functions
in u_*files
Calls made
to functions
in w_*files
Calls made
to functions
in misc.
files
1
1
53
13
6
6
224
16
177
51
11
1
2
113
48
28
6
2
10
9
302
43
10
2
7
20
39
301
16
60
244
159
619
Total
Table 13: Number of function calls made to functions in each sub-system
436
49
Sub-system
d_*files
e_*files
f_*files
u_*files
w_*files
Sub-system
Calls
made to
functions
in d_*files
49
Calls made
Calls made
to functions
to functions
in e_*files
in d_*files
1
Calls made
to functions
in f_*files
Calls made
to functions
in u_*files
Calls made
to functions
in w_*files
Calls made
to functions
in misc. files
9
6
2
55
12
7
2
1
43
11
6
3
33
6
13
12
110
Total
1
Table 14: Five or more function calls made to functions in each sub-system
64
22
d_*files
e_*files
12
1
7
f_*files
u_*files
w_*files
1
4
6.5 Pattern 5: Increase data modularity
(Metarule 5-1 form: Global(X,Y)  Global(X,Y) Confidence: High Support: High,
Metarule 5-2 form: Type(X,Y)  Type(X,Y) Confidence: High Support: High)
To evaluate the applicability of pattern 5, the source code of the five systems was mined using metarule 5-1
and metarule 5-2. In this case, we use a threshold of 0.7 for confidence. It should be noted that setting absolute
thresholds for coverage and support may not be meaningful (see section 5.2), since they depend on system
size. However, confidence denotes the association of two items with each other, and is not affected by size of
the system. Thus it is meaningful to set a threshold e.g. 0.7, which denotes that if a global variable (user
defined type) is accessed, there is a 70% probability that the other global variable (user defined type) is also
accessed. However, the confidence measure alone cannot be used to arrive at a meaningful result. A global
variable may be associated with more than one global variable. Similarly a user defined type may be
associated with more than one user defined type. It is thus difficult to decide which global variables or user
defined types to place in one structure. In such a case, the support of the association rule may be helpful.
Support denotes the number of functions which access the two global variables or user defined types together.
If two global variables or user defined types are accessed together by many functions, it is more useful to
place them in single structure, rather than global variables or user defined types that are accessed together by
few functions.
27
CVS
Aero
Mosaic
Bash
Xfig
1923
5199
2602
1635
14983
0.022 (18)
0.099 (68)
0.007(6)
0.047 (42)
0.127 (211)
Pending_erro
r,
pending_error
_ text
KoerperKoor
dAnAus,
KoordAnAus
size_of_cache
d_cd_array,
Number of association
rules satisfying the
metarule
Highest support for global
variables occurring
together with high
confidence (# of
functions)
Global variables
occurring together with
highest support
rl_end,
rl_point
_ArgCount,_
ArgCount
cached_cd_ar
ray
Table 15: Global variables with high association
Table 15 summarizes statistics about global variables with high association. The names of the global variables
in all five systems reveal that they are related to one another. Through the use of association rule mining, such
variables may be identified and placed in a structure after inspection. Similar to Table 15, Table 16
summarizes statistics about user defined types with high association3.
Number of association rules
satisfying the metarule
Highest support for user
defined types occurring
together with high confidence
(# of functions)
User defined types occurring
together with highest support
CVS
Aero
Mosaic
Bash
Xfig
41
54
10
14
422
0.010 (8)
0.017 (12)
0.010 (8)
0.003 (3)
0.128 (213)
DIR, dirent
TKollision,
TReal
mo_hotlist,
mo_hot_ite
m
KEYMAP
_ENTRY_
ARRAY,
keymap
Arg,
WidgetList
Table 16: User defined types with high association
6.6 Pattern 6: Strengthen encapsulation
(Metarule 6 form: Accessed by(X)  Type(X,Y) Confidence: One)
To evaluate the applicability of pattern 6, the source code of the systems was mined using metarule 6. Table
17 summarizes statistics obtained as a result of mining association rules using this metarule.
3
For CVS, Aero, Mosaic and Bash, accesses to user defined types are reported for local variables within functions.
28
CVS
Aero
Mosaic
Bash
Xfig
489
476
294
319
3937
299
(37%)
279
(41%)
248
(30%)
249
(28%)
1364
(82%)
Number of types accessed
80
63
47
51
149
Median number of functions
accessing a type
2.5
3
3
3
5
Average number of functions
accessing a type
6.11
7.56
6.26
6.25
26.42
Standard deviation
10.74
14.56
12.56
9.36
53.92
Number of association rules
satisfying the metarule
Number of functions accessing
user defined types (%age of
total functions)
Table 17: Types accessed by functions in the test systems
Identification of potential objects is facilitated by mining association rules of the described form. As an
example, consider some of the user defined types identified within Bash. The user_info type is accessed
by 10 functions which reside in 7 different files. This type can be grouped together with the functions that
access it to form a class, which represents user information and member functions to access this information.
Some of the types are accessed by functions within 1 file, making the task of conversion to an object-oriented
design easier. For example, the JOB type is accessed by 22 functions (start_job, delete_job,
find_job etc.), 21 of which reside in the file job.c. In a similar manner, types and accessing functions may
also be identified easily within other systems.
6.7 Pattern 7: Select appropriate storage classes
(Metarule 7 form: Global (X,Y)  Accessed by (X) Coverage: High)
Table 18 shows global variables accessed by the largest number of functions in Xfig and Bash. Only three
global variables are shown.
CVS
Aero
Mosaic
Number of functions
55
Global variable
Coverage
noexec
0.068
45
server_active
0.056
36
__ctype_b
0.044
104
aktuelleWelt
0.152
72
__iob
0.105
68
KoordAnAus
0.099
47
current_win
0.057
43
srcTrace
0.053
29
23
cciTrace
0.028
62
rl_point
0.07
46
rl_end
0.052
39
__ctype_b
0.044
231
XtStrings
0.139
211
_ArgCount
0.127
Bash
Xfig
_ArgList
211
0.127
Table 18: Global variables accessed by maximum number of functions
If the functions are accessing the global variables are called frequently, the global variables are candidate
register variables.
6.8 Pattern 8: Generate alternative views
(Metarule8 form: Typel (X,Y)  Accessed by (X) Coverage: High (within a sub-system))
Table 19 summarizes the statistics for types accessed by functions in various sub-systems of Xfig. Figure
18- Figure 22 show the types accessed within the sub-systems.
Sub-system
Average number Standard
of accesses to deviation
types
d_*files
6.28
e_*files
Cursor
5.92
17.34
Highest number
accesses (Type)
25.89
F_compound
FILE
f_*files
6.49
8.21
u_*files
15.10
25.33
F_compound
WidgetList
Coverage
22 (0.234)
99 (0.268)
37 (0.266)
127 (0.301)
257 (0.403)
Xfig - Types accessed within d_*files subsystem
25
20
15
10
F_arc
F_pic
object.h.unnamed.149(
FILE
size_t
XtAppContext
F_sfactor
F_line
F_ellipse
F_spline
F_text
XCharStruct
F_pos
F_point
Window
Cursor
0
F_compound
5
XFontStruct
Number of functions accessing a
type
w_*files
16.74
35.40
Table 19: Accesses to user defined types in various Xfig sub-systems
of
Types
Figure 18: Types accessed by a subsystem (d_*files)
30
Number of functions
accessing a type
40
35
30
25
20
15
10
5
0
FILE
F_compound
F_pos
F_pic
F_line
appresStruct
size_t
COLR
F_spline
F_arc
F_ellipse
F_text
XColor
Cmap(struct)
F_arrow
fig_settings
F_point
Display
paper_def(struct)
__uint8_t
stat(struct)
_recent_files
jpeg_error_mgr(struc
pcxheadr
Cursor
error_ptr
f_readgif.c.unnamed.
F_sfactor
j_common_ptr
object.h.unnamed.14
Window
WidgetList
Arg
Colormap
f_read.c.unnamed.19
hdr(struct)
f_readgif.c.unnamed.
f_readeps.c.unname
object.h.unnamed.19
jpeg_memory_mgr(st
j_decompress_ptr
JSAMPIMAGE
object.h.unnamed.14
f_neuclrtab.c.unnam
__int32_t
f_readpcx.c.unname
pcxhed(struct)
Number of functions
accessing a type
F_compound
F_line
WidgetList
F_point
F_spline
Cursor
F_ellipse
F_arc
Arg
F_pos
F_text
e_edit.c.unnamed.199(struct)
WidgetClass
F_arrow
F_sfactor
F_pic
icon_struct
Pixmap
Display
XtVarArgsList
Window
choice_info
Pixel
size_t
Colormap
object.h.unnamed.149(struct)
XtLanguageProc
Position
FILE
Dimension
XFontStruct
XtActionsRec
GC
XtAppContext
Atom
appresStruct
sfactor_def(struct)
XtOrderProc
XtCallbackRec
XtCallbackProc
stat(struct)
XCharStruct
XKeyReleasedEvent
Time
paper_def(struct)
pid_t
time_t
object.h.unnamed.195(struct)
e_edit.c.unnamed.200(struct)
counts(struct)
patrn_strct
XtWidgetGeometry
XColor
120
100
80
60
40
20
0
140
120
100
80
60
40
20
0
F_compound
F_line
F_point
F_spline
F_arc
Window
F_pos
F_ellipse
F_text
appresStruct
Display
GC
F_sfactor
F_arrow
F_linkinfo
counts(struct)
zXPoint
FILE
F_pic
XFontStruct
paper_def(struct)
object.h.unnamed.1
Cursor
_fstruct(struct)
XColor
object.h.unnamed.1
Pixel
size_t
XEvent
XClientMessageEv
WidgetList
Atom
/usr/X11R6/include/
Colormap
angle_table(struct)
Cmap(struct)
xfont(struct)
_xfstruct(struct)
XGCValues
u_draw.c.unnamed.
u_draw.c.unnamed.
u_draw.c.unnamed.
funcs(struct)
__uint8_t
Visual
PIXRECT
_fpnt(struct)
XPoint
Region
_arrow_shape(stru
XErrorEvent
Number of functions
accessing a type
Xfig - Types accessed within e_*files subsystem
Types
Figure 19: Types accessed by a subsystem (e_*files)
Xfig -- Types accessed within f_*files subsystem
Types
Figure 20: Types accessed by a subsystem (f_*files)
Xfig - Types accessed with u_*files subsystem
Types
Figure 21: Types accessed by a subsystem (u_*files)
31
Number of functions
accessing a type
300
250
200
150
100
50
0
WidgetList
Arg
Display
appresStruct
ind_sw_info
Pixmap
Window
icon_struct
GC
XColor
WidgetClass
XtAppContext
XtLanguageProc
F_compound
XFontStruct
Colormap
Pixel
Cursor
XtVarArgsList
XtActionsRec
FILE
size_t
XtIntervalId
Dimension
Position
Atom
mode_sw_info
XCharStruct
main_menu_info
XButtonReleasedEv
XEvent
XGCValues
choice_info
style_template(struc
F_text
XPoint
LIBRARY_REC(stru
Screen
RGB
F_pos
FigListWidget
ListPart
HSV
va_list
XtCallbackRec
fig_colors
spin_struct
RotatedTextItem
XawListReturnStruct
XtOrderProc
PIXRECT
paper_def(struct)
funcs(struct)
XtWidgetGeometry
_XPrivDisplay
F_line
stat(struct)
counts(struct)
XExposeEvent
XRectangle
F_arc
F_ellipse
F_spline
XWindowAttributes
Drawable
XawTextBlock
fig_settings
_fstruct(struct)
globalStruct
patrn_strct
XKeyReleasedEvent
XPointerMovedEvent
KeySym
zXPoint
DIR
Visual
dirent(struct)
CorePart
XKeyPressedEvent
xfont(struct)
CompKey
_recent_files
XrmValuePtr
MenuItemRec
_xfstruct(struct)
FigSmeBSBObject
FigSmeBSBPart
XKeyboardControl
w_drawprim.c.unna
__uint8_t
passwd(struct)
pid_t
XtCallbackProc
XtTranslations
w_drawprim.c.unna
RectObjClassPart
F_pic
GrabInfo
menu_def
RectObjPart
Region
XFontSetExtents
SmeBSBClassRec
XConfigureEvent
XButtonPressedEve
SmeBSBPart
SmePart
Time
SmeThreeDPart
Font
Xfig - Types accessed by w_*files subsystem
Types
Figure 22: Types accessed by a subsystem (w_*files)
It can be seen that some of the types are associated with a certain sub-system only, whereas some of them
are accessed by more than one sub-system e.g. the types F_Line, F_Spline, F_arc etc. are accessed
by all sub-systems. The reason for this is that these shapes are drawn in the d_*files sub-system, edited in the
e_*files sub-system, etc. Thus for an object-oriented view, these types will be associated with functions across
sub-systems. On the other hand, types accessed within a certain sub-system may be associated with functions
within the sub-system.
6.9 Pattern 8: Generate alternative views
Statistics obtained as a result of carrying out association rule mining on the test legacy systems are
summarized in Figure 23 - Figure 30. Statistics presented in Figure 23 and Figure 24 are related to the number
of functions, global variables and user defined types in the test systems.
Figure 23: Function, global variable and user defined type statistics
32
Figure 24: Statistics for function calls, global variable accesses and user defined types accesses
It can be seen that the number of functions and global variables is similar across all systems except Xfig.
The number of functions making function calls, accessing global variables and user defined types are also
comparable across these systems. The number of functions accessing user defined types is larger for xfig as
compared to other systems, and also larger than the number of functions accessing global variables and
making function calls.
Figure 25: Global variable access statistics
Figure 26: A comparison of accesses to global variables
33
Figure 25 and Figure 26 present statistics related to the access of global variables by functions within test
systems, which is an indicator of the global coupling present in these systems. It is interesting to note that
although the percentage of functions accessing global variables varies from 45% to 80%, the percentage of
global variables accessed by 5 or less functions is very similar across all systems. Mosaic has the smallest
percentage of functions accessing global variables, and also the highest percentage of global variables
accessed by 5 or less functions. These results show that all five systems present opportunities for applying
pattern 1 and pattern 2 to reduce coupling by reducing the number of global variables. In cases where a global
variable is required, its scope may be reduced to make the program easier to understand and manage.
Figure 27: Function call statistics
Figure 28: A comparison of calls to functions
Figure 27 and Figure 28 present statistics related to functions called by a single function within the
systems. The percentage of functions making function calls is similar across all systems, varying from 63% to
78%. The percentage of functions called by one function only is also similar across all systems, varying from
22% to almost 40% in Mosaic. Thus the opportunity to apply pattern 3 to increase locality of reference is
present in all systems. Functions that are called by a single function and yet reside in a different file need to be
examined. CVS and Mosaic have the highest percentages of such functions, which should be examined to
determine the feasibility of placing them in the same file as the calling function.
34
Figure 29: Frequently accessed functions
Figure 30: A comparison of frequently accessed functions
Figure 29 and Figure 30 present statistics of functions within the systems to which 5 or more calls are
made. This percentage varies from 7% to 18%, in general remaining low. CVS has the highest percentage of
functions to which 5 or more calls are made. Thus in all systems, there exist functions that are accessed by
many functions. It is useful to examine these functions and identify them as utilities, also placing them in
appropriate files.
Table 15, Table 16 and Table 17 summarize statistics related to system re-modularization. The applicability
of pattern 5 is evident for all the systems, as global variables and user defined types that occur together have
been identified. It is suggested that pattern 5 be applied before pattern 6, so that related data items are placed
together. Once data has been modularized, functions accessing these types may be mined using pattern 6, thus
taking a step towards indicating potential classes in an object-oriented design.
35
7. Conclusions
In this report we present an association rule mining approach for the problem of understanding a software
system given only the source code. To make the association rule mining process more effective, we employ
metarule-guided mining. Metarules specify constraints on the mining process. We use two types of
constraints: rule constraints and interestingness constraints. Rule based constraints are used to restrict the
form of the association rule mined whereas interestingness constraints are used to restrict the possible values
of interestingness measures. Furthermore, we relate each metarule to a re-engineering pattern that presents
solutions to the problems identified by a metarule. To illustrate the feasibility and effectiveness of employing
metarule-guided mining for program understanding, we restrict ourselves to a small number of metarules and
re-engineering patterns that address some typical problems within legacy systems developed using the
structured approach.
Metarule-guided mining was used to analyze the structure of five legacy systems and extract meaningful
association rules which provided useful insight about the software’s overall structure and suggested strategies
for its improvement. The extracted association rules capture associations between functions, global variables
and user defined types within the system. They highlight potential problems within the code and identify areas
which require a more detailed study. As illustrated, these association rules applied along with re-engineering
patterns can be used to restructure the code for maintainability, and if required, to re-modularize the code e.g.
by converting a structured design to an object-oriented design. A manual inspection to carry out the same
tasks would have taken a much longer time.
Our experiments with five open source systems reveal similar results in terms of the average and percentage
values in the patterns discussed. For example, the average and median number of functions accessing a global
variable is very similar in all five systems, as is also the average and median number of function calls made.
This result is interesting, especially since the test systems have different application domains and the number
of functions, global variables and user defined types within these systems is also different. Our experimental
results show how rules extracted by the mining process can be used to identify potential problems in, and
suggest improvements to, the legacy systems under study. Thus metarule-guided association rule mining is
effective in revealing interesting characteristics, trends and nature of open source legacy systems, and can be
used as a basis for restructuring or re-modularizing them for greater maintainability.
36
References
[1] I. Sommerville, Software Engineering, Fifth Edition, Addison Wesley, 2000.
[2] R.S. Arnold, Software Reengineering, IEEE Computer Society Press, 1993.
[3] R.S. Pressman, Software Engineering A Practitioner’s Approach, Fifth Edition, Mc Graw Hill, 2001.
[4] S.L. Pfleeger, Software Engineering Theory and Practice, Prentice Hall, 1998.
[5] R.L. Glass, Frequently Forgotten Fundamental Facts about Software Engineering, IEEE Software,
May/June 2001.
[6] T.J.Biggerstaff, “Design Recovery for Maintenance and Reuse”. IEEE Computer, 22(7), pages 36-49,
July 1989.
[7] H.A. Muller, M. Story, J.H. Jahnke, D.B. Smith, A.R. Tilley, K. Wong, “Reverse Engineering: A
Roadmap”, The 22nd International Conference on Software Engineering (ICSE’00), June 2000
[8] G. Parikh, N. Zvegintzov, Tutorial on Software Maintenance, IEEE Computer Society Press, 1983.
[9] R.P. Hall, Seven Ways to Cut Software Maintenance Costs, Datamation, July 1987.
[10] M.T.Harandi, J.Q.Ning, “Knowledge-Based Program Analysis”, IEEE Software 7(1), pages 74-81,
January 1990 .
[11] Rich, Wills “Recognizing a program’s design: A graph parsing approach”, IEEE Software, 7(1), pages
82-89, January 1990.
[12] A.Quilici, “A Memory-Based Approach to Recognizing Programming Plans”. Communications of the
ACM, 37(5), pages 84-93, May 1994.
[13] H. A. Müller, K. Wong, and S. R. Tilley “Understanding software systems using reverse engineering
technology.” The 62nd Congress of L'Association Canadienne Francaise pour l'Avancement des Sciences
Proceedings (ACFAS) 1994.
[14] H.M.Fahmy, R.C.Holt, J.R.Cordy , “Wins and Losses of Algebraic Transformations of Software
Architectures”. Automated Software Engineering ASE 2001, San Diego, California, November 26-29,
2001.
[15] R.Kazman, S.J.Carrière, "View Extraction and View Fusion in Architectural Understanding". The 5th
International Conference on Software Reuse, Victoria, BC, Canada, June 1998.
[16] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Supporting software maintenance by mining software update
records, International Conference on Software Maintenance, (ICSM) 2001.
[17] Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, August 2000.
[18] Demeyer, S., Ducasse, S., Nierstrasz, O., Object-Oriented Reengineering Patterns, Morgan Kaufmann,
2003.
[19] Website http://www.iam.unibe.ch/~scg/Archive/famoos/
[20] M.M. Lehman “Program, Life Cycles and the Laws of Software Evolution”, Proceedings of the IEEE,
September 1980.
[21] M.M. Lehman, J.F. Ramil, P.D. Wernik, D.E. Perry, W.M. Turski, “Metrics and Laws of Software
Evolution – The NinetiesView”, International Symposium on Software Metrics, 1997.
37
[22] M.W. Godfrey, Q. Tu, “Evolution in Open Source Software: A Case Study”, International Conference on
Software Maintenance (ICSM) 2000.
[23] M.W. Godfrey, Q. Tu, “Growth, Evolution and Structural Change in Open Source Software”, IWPSE
2001.
[24] Q. Tu, M.W. Godfrey, “An Integrated Approach for Studying Architectural Evolution”, International
Workshop on Program Comprehension (IWPC), 2002.
[25] M. Lanza, S. Ducasse, “Polymetric Views – A Lightweight Visual Approach to Reverse Engineering”,
IEEE transactions on Software Engineering, September 2003.
[26] M. Lanza, S. Ducasse, “Understanding Software Evolution Using a Combination of Software
Visualization and Software Metrics”, Proceedings LMO’02, 2002.
[27] C. Montes de Oca, D. L .Carver, Identification of Data Cohesive Subsystems using Data Mining
Techniques, International Conference on Software Maintenance, (ICSM) November 1998.
[28] C. Montes de Oca, D. L .Carver, A Visual Representation Model for Software Subsystem
Decomposition, Working Conference on Reverse Engineering (WCRE'98), October, 1998.
[29] K. Sartipi, K. Kontogiannis, F. Mavaddat, Architectural Design Recovery Using Data Mining
Techniques, Conference on Software Maintenance and Reengineering (CSMR’00), February, 2000.
[30] Tjortjis, C., Sinos, L., Layzell, P., Facilitating Program Comprehension by Mining Association Rules
from Source Code, 11th IEEE International Workshop on Program Comprehension (IWPC'03), May,
2003.
[31] V. Tzerpos, R.C. Holt, “Software Botryology: Automatic Clustering of Software Systems”, Ninth
International Workshop on Database and Expert Systems Applications (DEXA’98), August 1998.
[32] T.A. Wiggerts, “Using clustering algorithms in legacy systems remodularization,” Fourth Working
Conference on Reverse Engineering (WCRE’97), October 1997.
[33] N.Anquetil and T.C.Lethbridge, “Experiments with clustering as a software remodularization method,”
The Sixth Working Conference on Reverse Engineering (WCRE’99), 1999.
[34] J.Davey and E.Burd, “Evaluating the Suitability of Data Clustering for Software Remodularization”, The
Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000.
[35] M.Saeed, O.Maqbool, H.A.Babri, S.M. Sarwar, S.Z. Hassan “Software Clustering Techniques and the
Use of the Combined Algorithm”, Conference on Software Maintenance and Re-engineering (CSMR’03),
March 2003.
[36] O.Maqbool, H.A.Babri, “The Weighted Combined Algorithm: A Linkage Algorithm for Software
Clustering”, Conference on Software Maintenance and Re-engineering (CSMR’04), March 2004.
[37] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Mining the maintenance history of a legacy software system,
International Conference on Software Maintenance, (ICSM) , 2003.
[38] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Mining the software change repository of a legacy telephony
system, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004.
[39] T. Zimmermann, P. Weibgerber, S. Diehl, A. Zeller, Mining version histories to guide software changes,
Proceedings of the 26th International Conference on Software Engineering (ICSE) 2004.
[40] A. Michail, Data mining library reuse patterns using generalized association rules, Proceedings of the
22nd international conference on Software engineering, June 2000.
38
[41] A. Michail, Data Mining Library Reuse Patterns in User-Selected Applications, 14th IEEE International
Conference on Automated Software Engineering , October, 1999.
[42] R. Amin, M. Cinneide, T. Veale, LASER: A lexixal approach to analogy in software reuse, Proceedings
of the 1st International Workshop on Mining Software Repositories, 2004.
[43] F. McCarey, M. Cinneide, N. Kushmerick, A case study on recommending reusable software components
using collaborative filtering, Proceedings of the 1st International Workshop on Mining Software
Repositories, 2004.
[44] Y. Yusof, O. F. Rana, Template mining in source-code digital libraries, Proceedings of the 1st
International Workshop on Mining Software Repositories, 2004.
[45] P. K. Garg, T. Gschwind, K. Inoue, Multi-project software engineering: An example, Proceedings of the
1st International Workshop on Mining Software Repositories, 2004.
[46] M. El-Ramly, E. Stroulia, Mining software usage data, Proceedings of the 1st International Workshop on
Mining Software Repositories, 2004.
[47] Z. Balanyi, R. Ferenc, Mining design patterns from C++ source code, International Conference on
Software Maintenance (ICSM) 2003.
[48] Y. Kanellopoulos, C. Tjortjis, Data mining source code to facilitate program comprehension:
Experiments on clustering data retrieved from C++ programs, International Workshop on Program
Comprehension (IWPC) 2004,
[49] Website MSR 2004 http://msr.uwaterloo.ca
[50] T. Mens, T. Tourwe, “A Survey of Software Refactoring”, IEEE transactions on Software Engineering,
February 2004.
[51] S. Demeyer, S. Ducasse, “Metrics, do they really help?”, Proceedings LMO’99, Paris 1999.
[52] N. Fenton, S.L. Pfleeger, Software Metrics: A Rigorous and Practical Approach, Second Edition, PWS
Publishing Company, 1997.
[53] L. Rising, The Patterns Handbook Techniques, Strategies and Applications, Cambridge University Press,
1998.
[54] R. Koschke, “Atomic Architectural Component Recovery for Program Understanding and Evolution”,
PhD Thesis, University of Stuttgart, 2000.
[55] Website http://www.bauhaus-stuttgart.de/bauhaus
[56] J.Martin, K.Wong, B. Winter and H.A.Müller, “Analyzing xfig using the Rigi Tool Suite”, The Seventh
Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000.
[57] R.B. Murray, C++ Strategies and Tactics, Addison Wesley, 1993.
[58] S. McConnell, Code Complete A Practical Handbook of Software Construction, Microsoft Press, 1993.
[59] M.A. Weiss, Efficient C Programming A Practical Approach, Prentice-Hall, 1995.
[60] C. Riva, “Reverse Architecting: an Industrial Experience Report”, Seventh Working Conference on
Reverse Engineering (WCRE’00), 2000.
[61] K. Wong, S. Tilley, H. Muller, M.A. Storey, Structural Redocumentation: A Case Study, IEEE Software,
January, 1995.
39
Appendix
Pattern category 1: Enhance modularity and control side effects
Pattern 1: Reduce global variable usage
Intent
Remove global variables that are used by very few functions.
Metarule 1
Coverage
Global (X,Y)  Accessed by (X)
Low
Implication
Only a small proportion of functions in the system access the global variable(s) on the LHS.
Problem
As software evolves, its structure becomes complex and its internal quality degrades. Due to high coupling
between system components in such a deteriorated structure, a single change often results in a number of
side effects. Functions which access the same global variables are highly coupled (common coupling).
How can such coupling among functions be reduced?
Forces
This problem is difficult because
1. To improve system structure, the structure must first be understood. In the absence of
documentation, gaining a structural view and identifying dependencies may be difficult and time
consuming.
2. Some form of coupling may be necessary within the system. Highly coupled functions whose
coupling levels may be reduced need to be identified.
3. A change e.g. changing a shared global variable to a parameter that is passed between the related
functions may require changes at a number of places within the code.
Solving this problem is feasible because
1. With the proper tools, some forms of coupling between functions can be identified.
Solution
Identify global variables which are used by few functions within the system. Reduce common coupling by
reducing the use of such global variables. Rather than defining such variables as global, pass them as
parameters within the relevant functions. If passing a global variable as a parameter is not convenient, its
scope may be restricted to a single file by defining it as static.
Tradeoffs
Pros
When a large number of variables are to be shared amongst functions, global variables are convenient.
Moreover, global variables have a longer lifetime than automatic variables, making it simpler to share
information between functions that do not call each other. However, unless the global data is read only, the
use of global variables results in undesirable coupling between functions, leading to difficulties in program
understanding and maintenance. By removing unnecessary global variables and restricting their usage,
coupling among components is reduced, making it easier to trace faults and avoid unintentional changes to
data.
Cons & Difficulties
Careful evaluation of each global variable is required to decide whether it should be passed as a parameter,
declared to be static or left as it is. Even if very few functions in the system access the global variable, the
designers of the system may have valid reasons for defining it as global.
Pattern 2: Localize variables
40
Intent
Reduce the scope of variables that have larger scope than necessary.
Metarule 2
Confidence
Global(X,Y)  Accessed by(X)
One
Implication
Each global variable on LHS is used by one function only.
Problem
Legacy systems are likely to contain a large number of global variables which reduces the manageability
of the systems. How can a function accessing global variables be made easier to read and manage?
Forces
This problem is difficult because
1. A function may access a large number of global variables.
2. Global variables are not the only factor that make a function more difficult to read and understand.
Solving this problem is feasible because
1. You have an idea of the global variable usage within the system (Pattern 1)
Solution
Identify global variables that are accessed by one function only. Make the variable a local variable for that
function.
Tradeoffs
Pros
If a variable is used by one function, there may be no reason for defining it as global. Localizing a global
variable reduces chances of error and makes functions easier to read and maintain.
Cons & Difficulties
Careful evaluation of each identified global variable is required before a decision to reduce its scope is
taken. The rationale for a decision by a developer to define a variable as global is rarely documented.
Pattern 3: Increase locality of reference
Intent
Group related functions together to improve understandability and maintainability.
Metarule 3-1
Confidence
Function(X,Y)  Called by (X)
One
Implication
Each function on LHS is called by one function only.
Metarule 3-2
Confidence
Support
Function(X,Y)  Function(X,Y)
High
High
Implication
Whenever one function is called, there is a high probability that the other function is also called.
Problem
A legacy system is often large and may consist of a large number of functions. Changes made within a
software system may involve changes to various related functions. How can changes be localized?
Forces
This problem is difficult because
1. It is necessary to identify functions that may be affected by a change.
2. A function may be related to a number of other functions, so it is difficult to decide the set of
functions with which its relation is strongest.
Solving this problem is feasible because
41
1. Localized changes ease the maintenance task.
Solution
Identify functions between which there appears to be a strong relation. Place such functions in close
proximity, perhaps in the same file. It may be possible to inline functions being called by only one
function.
Tradeoffs
Pros
Grouping together related functions promotes modular continuity [3] because changes in requirements are
localized instead of resulting in system wise changes. Localized changes ease the maintenance task. Also,
in case a function is inlined, making a function inline results in performance improvement and is clearly
beneficial if the function expansion is shorter than the code for the calling sequence.
Cons & Difficulties
Placing related functions in the same file may be useful for legacy applications that have been developed
using the structured approach. It may not be a feasible option in some cases e.g. when a system has a
layered architecture and function calls are made across different layers, or in object-oriented architectures
where the functional separation is related to data.
In case the inlining option is chosen, it is important to remember that large inline functions will save a
small percentage of run time but will have a higher space penalty. Also functions with loops should almost
never be inlined [57] because the run time of a loop is likely to swamp the function call overhead. Large
inlined functions may also make it difficult to understand the functionality of the calling function.
Pattern category 2: Avoid duplicated functionality
Pattern 4: Identify utilities
Intent
Identify utility routines so that they can be shared by functions.
Metarule 4
Coverage
Function(X,Y)  Called by(X)
High
Implication
Function(s) on LHS are called by most of the functions in the system.
Problem
In legacy systems, functionality is often duplicated due to different teams re-implementing similar
functionality. To avoid this problem, functions should share code by making use of utility routines. How
can the use of utility routines be facilitated?
Forces
This problem is difficult because
1. To facilitate the use of utilities, the utilities must be identified.
2. Many such utility routines may be present, all of which need to be catalogued for reference.
Solving this problem is feasible because
1. Utility routines may have certain characteristics which facilitate identification.
Solution
Identify functions that are called by many functions within the system. Treat these functions as utility
functions. It may be useful to place groups of related utility functions in separate files.
Tradeoffs
Pros
Utility functions represent re-usable components of a structured system. Re-usable components can lead to
measurable benefits in terms of reduction in development cycle time and project cost, and increase in
productivity.
42
Cons & Difficulties
In order for functions to be re-used effectively, they must be properly catalogued for easy reference,
standardized for easy application and validated for easy integration [3]. In case the number of such
functions is large, related functions need to be identified and grouped together, otherwise searching for the
appropriate function may be time consuming.
Pattern category 3: Re-modularize for maintainability
Pattern 5: Increase data modularity
Intent
Place related data items into a structure
Metarule5-1
Confidence
Support
Global(X,Y)  Global(X,Y)
High
High
Implication
Whenever one global variable is accessed, there is a high probability that the other global variable is also
accessed.
Metarule5-2
Confidence
Support
Type(X,Y)  Type(X,Y)
High
High
Implication
Whenever one type is accessed, there is a high probability that the other type is also accessed.
Problem
Legacy systems are likely to contain a number of global variables and user defined types. How can
relations between global variables (user defined types) be clarified? How can the global variables (user
defined types) be more easily managed?
Forces
This problem is difficult because
1. A large number of global variables and user defined types may exist in a system, and examining
them may be time consuming.
2. A global variable (user defined type) may be related to many other global variables (user defined
types). Examining such complex relationships and arriving at meaningful conclusions may be very
difficult
Solving this problem is feasible because
1. The purpose of a function is much easier to understand if relationships between data are clear and
data is well managed.
Solution
Identify related global variables (user defined types). Examine them to see which of them form coherent
entities. If they are logically related, combine them into a structure.
Tradeoffs
Pros
Combining variables/types into structures leads to code that is easier to understand and change. If at some
stage, a shift is to be made to an object-oriented design paradigm, the structures become data attributes of
potential classes.
Cons & Difficulties
It may be the case that one global variable/type is associated with a high degree of confidence with a
number of other global variables/types i.e. when the global variable/type is accessed, a number of other
43
global variables/types are accessed. In this case, to avoid a large structure that hinders rather than promotes
understandability, the software engineer needs to study the code and analyze which global variables/types
are to be combined into a structure.
Pattern 6: Strengthen encapsulation
Intent
Identify potential classes.
Metarule6
Confidence
Accessed by(X)  Type(X,Y)
One
Implication
The function(s) access the type(s) on the RHS.
Problem
Many legacy systems were developed using the structured approach. To facilitate re-use and ease
maintenance, how can these systems may be modularized as object-oriented systems?
Forces
This problem is difficult because
1. Identifying potential classes accurately in such large systems is not easy.
Solving this problem is feasible because
1. Classes contain related data attributes for which operations are defined. Related data attributes can
be identified using pattern 5 (Increase data modularity).
Solution
Identify collections of functions and the common data set they access. These represent potential classes
and should be packaged together to strengthen encapsulation and provide information hiding.
Tradeoffs
Pros
Large programs that use information hiding have been found easier to modify by a factor of 4 than
programs that don’t [58]. Information hiding forms a foundation for both structured and object-oriented
design.
Cons & Difficulties
Functions may access more than one type, in which case a careful study of the code is required to decide
the type with which the function should be associated. It may be the case that the types accessed by a
function form a coherent entity which can be transformed into a structure (See pattern 5).
Pattern category 4: Improve performance
Pattern 7: Select appropriate storage classes
Intent
Make access to frequently accessed variables efficient
Metarule7
Coverage
Global (X,Y)  Accessed by (X)
High
Implication
Global variable on LHS is used by most of the functions in the system.
Problem
Legacy systems are likely to contain a number of global variables. How can these variables be accessed
more efficiently?
Forces
44
This problem is difficult because
1. A large number of global variables may exist in a system, and examining all of them may be time
consuming.
2. It may not be feasible to increase access efficiency for variables that are not frequently accessed.
Solving this problem is feasible because
1. Accesses to memory locations (variables) are much slower as compared to CPU access and often
become a performance bottleneck.
Solution
Identify global variables that are frequently used. Declare them to be register variables.
Tradeoffs
Pros
Register variables are a suggestion to the compiler that they be placed in a register instead of memory. This
provides a certain degree of control over program efficiency, and may result in faster access and speed
improvement [59].
Cons & Difficulties
The storage of program variables in registers may interfere with conventional usage of a register by the
compiler, thus slowing down execution of a program. Thus knowledge of the architecture and compiler is
required before such declarations can be of use. It is to be kept in mind that the declaration is a suggestion
to the compiler only, which the compiler may choose to ignore [59]. Also, not all application languages
support the declaration of register variables.
Pattern category 5: Focus on high-level architecture
Pattern 8: Generate alternative views
Intent
Focus on architecture from multiple perspectives
Metarule8
Coverage
Typel (X,Y)  Accessed by (X)
High (within a sub-system)
Implication
User defined type(s) are used by most of the functions within the sub-system.
Problem
How can software architecture (i.e. various sub-systems) be viewed from multiple perspectives so that large
scale structural changes are easier to make?
Forces
This problem is difficult because
1. Software architectures are mostly implicit.
Solving this problem is feasible because
1. A clear understanding of the architecture is required to adapt a system to changing requirements in
rapidly evolving domains, to reduce cost by sharing components across several projects and for
defining product family architectures [60].
Solution
Generate a data view of a sub-system by highlighting association of types with a certain sub-system.
Tradeoffs
Pros
Sub-systems and their inter-relationships represent the architecture of a software system. It is important to
focus on the architecture from multiple perspectives so that large scale structural changes are easier to
45
make [61]. A study of the types associated with a sub-system provides an alternative ‘data’ view which is
particularly important if changes are to be made to types or functions within the sub-system.
Cons & Difficulties
Different sub-systems of the software system may access the same types, making it difficult to associate a
type with one sub-system only. In such a case, it may be feasible to use Pattern 6 on the entire software
system in order to group together functions with related types and hence arrive at an alternative
modularization based on data abstraction.
46
Download