Mining RCS Data

advertisement
Mining RCS Data
Catherine Stringfellow Swetha Myneni Raaji Vedala-Tirumala Sreya Reddy
Department of Computer Science, Midwestern State University, 3410 Taft Blvd, Wichita Falls, TX 76308
Abstract - As software systems evolve, it becomes
important to know and to track the evolution of a
system and its components. Data mining tools helps us
to find the relationships between these components
from patterns in the attributes of the data. The purpose
of this paper is to describe how relationships between
the attributes in the RCS (Revision Control System)
change data can be found. These relationships may
include which components of the system undergo
repeated maintenance, how many lines of code are
changed, the span of time in which files undergo
change, as well as the type of maintenance (according
to certain keywords used in the log). The technique is
illustrated on a flight simulator system consisting of
409 RCS files.
Keywords: maintenance, software evolution, RCS
data mining.
1
Introduction
It is important to track the evolution of a system
and its components as changes are made repeatedly.
Components become increasingly difficult to maintain
if these changes are not managed. The identification of
these components and the classification of their
changes help in maintaining a system or building an
entirely new system properly.
This information can be used when we need to
reengineer the components in the future. Problems will
increase, if the changes are not understood in early
stages. Late changes typically require changes to code
in multiple components, which leads to problems in
software architecture of the system [1, 7]. Software
architecture problems are far more costly to fix and
thus it is better to identify them as soon as possible.
The change reports provided by RCS files help in
finding these changes in the components. Change
reports are written whenever developers change some
part of the code. The information for each change is
stored in the logs of each RCS file of the system. A
change may affect many components. In addition, the
time, author and the nature of the change are also
recorded.
Data mining analysis is a tool to find patterns
between attributes. Most existing analysis approaches
employ quantitative data (statistical or series
techniques) and qualitative data (such as details about
the people who fix defective components). Data
mining techniques analyze both qualitative and
quantitative data using association rule mining.
Decision tree learner is one of the most prominent
machine learning techniques that is successful in
classification problems, where attributes are discrete
values [1]. Thus, discretizing attribute values enables
association mining techniques to be more easily used.
This paper is organized in the following way.
Section 2 describes prior research, section 3 describes
the method followed in this case study, section 4 gives
the results of applying data mining on RCS data of an
industrial application system, and finally section 5
summarizes the paper.
2 Background
Data mining is commonly used in a wide range
of profiling practices, such as marketing, surveillance,
fraud detection and scientific discovery [6, 10]. It is a
technique that can also be quite helpful in effectively
managing the software development process [13].
Software defect data are typically used in reliability
modeling to predict the remaining number of defects
in order to assess software quality and make release
decisions [8].
Recent work using data mining
techniques analyze data collected during different
phases of software development, including defect data
[3].
A great deal of software engineering data exists
today and continues to grow, and already the
helpfulness of that data in improving software
development and software quality has been
demonstrated. Xie, Thummalapenta, Lo and Liu
describe categories of data, various mining algorithms
and the challenges in mining data [13].
Sahraoui, Boukadoum, Chawiche, Mai and
Serhani [4] use a fuzzy binary decision trees technique
to predict the stability among versions of an
application
developed
using
object-oriented
techniques.
This technique gives an improved
performance in deriving association rules over the
classical decision tree technique, as well as solving the
threshold limit problem and the naive model problem,
in which decision support is not available during
development.
As software gets more complex, testing and
fixing defects become difficult to schedule. Rattikorn,
Kulkarni, Stringfellow, and Andrews [3] presented an
empirical approach that employed well-established
data mining algorithms to construct predictive models
from historical defect data. Their goal was to predict
an estimated time for fixing defects in testing. The
accuracy obtained from their predictive models is as
high as 93%, despite the fact that not all relevant
information was collected.
Association rule mining can help identify strong
relationships between different attribute values [3].
These relationships can also help to predict any or
many attribute values, hence, it is easy to be flooded
with association rules even for a small data set [11].
Association rules are an expression of different
regularities implied in the datasets, and some rules
found imply other rules. Some of these rules can be
meaningless and ambiguous: as a result it becomes
necessary to constrain these rules to a minimum
number of instances (e.g. 90% of the dataset) and set a
minimum limit for accuracy (e.g. 95% accuracy level).
The coverage or support of an association rule can be
defined as the number of instances for which the rule
can be predicted correctly. The accuracy or confidence
of an association rule can be defined as proportion of
correctly predicted instances to that of all instances for
which the rule applies.
In rule mining, an attribute-value pair is called
an item and combinations of attribute-value pairs are
called item [1, 11]. Association rules are generally
sought for item sets. There are two steps in mining
association rules 1) identifying all the item sets that
satisfy the minimum specified coverage and 2)
generating rules from each of these item sets that
satisfy the minimum support and minimum
confidence. It is possible that some items sets may not
generate any rules and some may generate many rules.
Much current software defect prediction work
focuses on the number of defects remaining in a
software system. This is to help developers detect
software defects and assist project managers in
allocating testing resources more effectively. Song,
Sheppard, Cartwright and Mair had results that show
for defect association prediction, the accuracy is very
high and the false-negative rate is very low [5]. They
also found that higher support and confidence levels
may not result in higher prediction accuracy and a
sufficient number of rules are a precondition for high
prediction accuracy.
Srikant, Vu and Agrawal
describe a technique to find only rules that meet
certain constraints, building them into the mining
algorithm, in order to get a subset of rules of interest
in less execution time [6].
Several recent studies have focused on version
history analysis to identify components that may need
special attention in some maintenance task. Nikora
and Munson found that measurements of code churn
in an evolving software system can serve as predictors
for the number of faults inserted in a system during
development, and that this predictive ability can be
improved with a clear standard for the definition of a
fault [2]. Stringfellow, Amory, Potnuri, Georg and
Andrews use RCS change history data to determine
the level of coupling between components to identify
potential code decay [7]. They looked at grouping
changes according to a common author and a small
time interval for checking in the changed files, as well
as grouping changes to files within and between
components.
This paper uses RCS files as input. RCS systems
have many advantages: they manage multiple
revisions of files and automate the storing, retrieval,
logging and merging of revisions [7]. The automatic
logging makes it easy to know what changes were
made to a module, without having to compare source
listings. Revision numbers aid in retrieving the
changes. Included in the RCS data is the author, date,
and log message summary in files.
3
Method
The method in this study follows a mostly datadriven approach, although the authors are focused on
the task of maintenance. It also follows the steps in
the mining methodology outlined in Xie et al. [13].
Preprocessing involved extracting data from RCS
files, scrubbing the data, importing the data into arff
files. Mining involved finding association rules using
WEKA software [10]. Witten and Frank’s book and
WEKA toolkit are excellent resources for data mining
[12].
A python program extracts attributes, from each
RCS log. The results are stored in the form of text file
that is then imported to a spreadsheet and stored in a
csv file. That file is then converted to an arff file using
a online conversion tool. Finally, the arff file is opened
in WEKA to obtain rules showing different
relationships between attributes with association rule
mining [9].
3.1 RCS data extraction
Figure 1 shows the first part of an RCS file. The
head indicates that the last log made to the file (in this
case 1.4, indicating there are 4 logs of changes made
to the file). Each log has the date, time, and author of
the change made.
head
1.4
date
next
1.3
date
next
1.4;
99.08.30.21.27.54;
author mikeh;
state Exp;
1.3;
99.07.14.22.32.48;
author mikeh;
state Exp;
1.2;
Fig 1: Change data in an RCS file with
attributes date, author, etc.
The RCS data comes from a large flight
simulation system, consisting of 409 RCS files of type
.c,v and .h,v [8]. The RCS logs were first analyzed to
determine the ten most frequently occurring keywords
concerning changes in the RCS files. The keywords in
logs that are indicative of change were determined to
be add, delet, fix, bug, chang, fails, modify, error,
correct, mov, fault, debug, updat, problem, delet and
replac. File attributes, such as span of time from the
first to the last change and the number of logs for a file
were of interest.
Figure 2 shows the descriptions of two changes
in the log. The descriptions are found at the end of the
RCS file after the code, after the string Desc. Log 1.3,
for example, has @d2 1, a2 1 in it, which means that
on line 2 there was one statement deleted and one
added. A programmer types in a description of the
changed in comment (inserted between two @’s).
The python program extracts attribute values from
each change log, including log number, date of
change, author, number of deletes, number of adds and
the frequency of each keyword occuring in the log
associated with change. These extracted attribute
values are stored in a csv file.
Desc @@
1.4
log
@fixed ccip/ccrp ct/hl code
@
text
@/*
1.3
log
@first checkin after tactics
@
text
@d2 1
a2 1
d61 2
a62 2
Fig 2: Example of two RCS logs.
3.2 Converting to arff
The resulting csv file output from the python
program is converted in to arff file using an online
conversion tool, as Weka supports only arff files [10].
The arff file format is becoming an increasingly
important tool to transform these data into useful
information.
3.3 Data mining for association rules
Data mining is the process of
extracting
rules from data. The WEKA software application is
used in this study to find a set of rules and
relationships between the change attributes extracted
from the RCS files. WEKA requires data to be
nominal. So filters are chosen to discretize these
numeric attributes. After applying bins (grouping) and
filters the rules are found and the relationships
between attributes are determined.
Discretizing is done by selecting a particular
attribute and then supplying the number of bins (and
their range of values) into which the attribute is to be
divided or grouped. Grouping data for a numeric
attribute into bins converts it into a nominal attribute.
Once all the attributes are nominal, association is
performed using an apriori filter.
Figure 3 shows a WEKA screenshot of a list
of the attributes extracted. The attribute SpanDays is
highlighted (SpanDays refers to the span of days that a
file underwent changes). It shows the six bins the
numeric counts fell into and a frequency graph. It also
shows minimum, maximum, mean, and standard
Fig3: Screenshot of the output from WEKA showing the graph for the attributes per log [7].
deviation values for those attributes. The maximum
span, for example was 481 days.
Some data scrubbing was performed. One of
the RCS files, for example, was ignored as it were not
formatted correctly for data extraction. In addition,
some data in the csv files when uploaded had zeros in
some of the columns, and these were deleted. As
already mentioned, data imported into WEKA had to
be converted from numerical format to nominal format
for data mining analysis.
4
Results
Using association rule mining and an apriori
filter, 100 rules were found between the attributes.
From these rules, we consider those rules that have a
minimum confidence of 0.9 or above to find
relationships between the attributes.
The most
significant of these rules are shown in figure 4.
Rule1: SpanInDays=1_80 184 ==>
CountOfLogs=1_4 add=0_4
updatevalue=0_1 181 conf:(0.98)
Rule2:SpanInDays=1_80 184 ==>
CountOfLogs=1_4 chang=1_6
updatevalue=0_1 181 conf:(0.98)
Rule3: fail=0_max 409 ==> bug=0_max
409 conf:(1).
Rule4: mov=0_max 409 ==> bug=0_max
409 conf:(1).
5
Conclusion
This study shows that it is possible to mine RCS
or other subversion control data in a system to identify
certain change relationships. Identifying relationships
between the attributes in RCS data could help in
improving the performance of the system architecture
and determining the components that need repeated
maintenance. Data mining is one of the prominent
tools that could help in finding these relationships and
rules with both qualitative and quantitative data.
Mining for change relationships may improve a
system’s performance early stages of development and
save money. If the changes are found in latter stages,
changing the whole system architecture could lead to a
disaster.
Rule5: debug=0_max 409 ==>
fail=0_max 409 conf :(1).
6
Rule6: bug=0_max 409 ==> fail=0_max
correct=0_max 409 conf :( 1).
[1] Anand, R., “Association rule mining for a medical
record system using WEKA, “File paper, Midwestern
State University, Fall 2006.
Rule7: Author=chris 388 <==>
DateTime=529_max NumAdds=1_6 354
conf:(0.91).
[2] Nikora, A., Munson J., Developing fault predictors
for evolving software systems. Proc 9th Intl Software
Metrics Symposium, Sydney, Australia, Sept 2003,
338-349.
Rule8: NumLogs=0_5
DateTime=529_max 257 ==> LogNo=0_4
NumAdds=1_6 250 conf :( 0.97).
[3] Rattikorn, H., Kulkarni, A., Stringfellow, C.,
Andrews, A., Software defect data and predictability
for testing schedules. Proc 18th Intl Conf on Software
Engineering and Knowledge Engineering, SEKE’06,
San Francisco Bay, USA, Jul 2006, 715-717.
Fig4: Rules found from association rule mining.
Figure 4 shows eight rules which have confidence
0.9 and above. Rules 1 and 2 (and several other of the
100 rules found) indicate that for files that undergo
change in short span of time (here the bin was less
than 80 days), there are few occurrences of the
keyword add or chang (-e, -es, -ing, -ed). (In addition
there were few occurrences of the keywords updat,
remov or delet.) Also for files with a shorter span,
these few changes were done by few authors for few
logs. Rules 3-5 shows there is a strong correlation
between keywords fail, bug, move and debug – these
words can be almost taken as synonyms and probably
counted together, as a result.
The added lines by author Chris are done mostly
during the second half of the development schedule,
and he mainly added only a few lines of code (Rule 7).
Files with a few number of logged changes, had
mostly only a few number of lines added (Rule 8).
References
[4] Sahraoui, H., Boukadoum, M., Chawiche, H., Mai,
G., Serhani, M., A Fuzzy Logic Framework to
improve the performance and interpretation of rule
based quality prediction models for OO software, Proc
26th Annual Intl Conf on Computer Science Software
and Applications, England, Aug 2002, 131-138.
[5] Song, Q., Sheppard, M., Cartwright, M., Mair, C.,
Software defect association mining and defect
correction effort prediction. IEEE Trans on Software
Engineering, 32(2), Feb 2006, 69-82.
[6] Srikant, R., Vu, Q., Agrawal, R., Mining
association rules with item constraints. Proc 3rd Intl
Conf on Knowledge Discovery and Data Mining
(KDD’ 97), Aug 1997, 67-73.
[7] Stringfellow, C., Amory, C., Potnuri, D., Georg,
M., Andrews, A., Deriving change architectures from
RCS history, IASTED Conf. Software Engineering and
Applications, Cambridge, MA, Nov 2004, 210-215.
[8] Stringfellow, C., Mayrhauser, A., Applying the
SRGM Selection to Flight Simulation Failure Data,
Tech Report, Colorado State University, Fort Collins,
CO, 2000, 1-16.
[9] Tkalcic, M., Online conversion tool,
Ljubljana,slavnik.fe.unilj.si/markot/csv2arff/csv2arff.p
hp.
[10] Waikato Environment for Knowledge Analysis
(WEKA). Data Mining Software in Java, Nov 12,
2006 http://www.cs.waikato.ac.nz/ml/weka/
[11] Williams, C., Hollingsworth, K., Automatic
mining of source code repositories to improve bug
finding techniques. IEEE Trans on Software
Engineering, 31(6), Jun2005, 466-480.
[12] Witten, H., Frank, E., Data mining: practical
machine learning tools and techniques, 2nd ed.,
Elsevier Publications, 2005.
[13] Xie, T., Thummalapenta, S., Lo, D., Liu, C.,
“Data mining for software engineering,” Computer,
Aug 2009, pp. 55-62.
Download