Mining RCS Data Catherine Stringfellow Swetha Myneni Raaji Vedala-Tirumala Sreya Reddy Department of Computer Science, Midwestern State University, 3410 Taft Blvd, Wichita Falls, TX 76308 Abstract - As software systems evolve, it becomes important to know and to track the evolution of a system and its components. Data mining tools helps us to find the relationships between these components from patterns in the attributes of the data. The purpose of this paper is to describe how relationships between the attributes in the RCS (Revision Control System) change data can be found. These relationships may include which components of the system undergo repeated maintenance, how many lines of code are changed, the span of time in which files undergo change, as well as the type of maintenance (according to certain keywords used in the log). The technique is illustrated on a flight simulator system consisting of 409 RCS files. Keywords: maintenance, software evolution, RCS data mining. 1 Introduction It is important to track the evolution of a system and its components as changes are made repeatedly. Components become increasingly difficult to maintain if these changes are not managed. The identification of these components and the classification of their changes help in maintaining a system or building an entirely new system properly. This information can be used when we need to reengineer the components in the future. Problems will increase, if the changes are not understood in early stages. Late changes typically require changes to code in multiple components, which leads to problems in software architecture of the system [1, 7]. Software architecture problems are far more costly to fix and thus it is better to identify them as soon as possible. The change reports provided by RCS files help in finding these changes in the components. Change reports are written whenever developers change some part of the code. The information for each change is stored in the logs of each RCS file of the system. A change may affect many components. In addition, the time, author and the nature of the change are also recorded. Data mining analysis is a tool to find patterns between attributes. Most existing analysis approaches employ quantitative data (statistical or series techniques) and qualitative data (such as details about the people who fix defective components). Data mining techniques analyze both qualitative and quantitative data using association rule mining. Decision tree learner is one of the most prominent machine learning techniques that is successful in classification problems, where attributes are discrete values [1]. Thus, discretizing attribute values enables association mining techniques to be more easily used. This paper is organized in the following way. Section 2 describes prior research, section 3 describes the method followed in this case study, section 4 gives the results of applying data mining on RCS data of an industrial application system, and finally section 5 summarizes the paper. 2 Background Data mining is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery [6, 10]. It is a technique that can also be quite helpful in effectively managing the software development process [13]. Software defect data are typically used in reliability modeling to predict the remaining number of defects in order to assess software quality and make release decisions [8]. Recent work using data mining techniques analyze data collected during different phases of software development, including defect data [3]. A great deal of software engineering data exists today and continues to grow, and already the helpfulness of that data in improving software development and software quality has been demonstrated. Xie, Thummalapenta, Lo and Liu describe categories of data, various mining algorithms and the challenges in mining data [13]. Sahraoui, Boukadoum, Chawiche, Mai and Serhani [4] use a fuzzy binary decision trees technique to predict the stability among versions of an application developed using object-oriented techniques. This technique gives an improved performance in deriving association rules over the classical decision tree technique, as well as solving the threshold limit problem and the naive model problem, in which decision support is not available during development. As software gets more complex, testing and fixing defects become difficult to schedule. Rattikorn, Kulkarni, Stringfellow, and Andrews [3] presented an empirical approach that employed well-established data mining algorithms to construct predictive models from historical defect data. Their goal was to predict an estimated time for fixing defects in testing. The accuracy obtained from their predictive models is as high as 93%, despite the fact that not all relevant information was collected. Association rule mining can help identify strong relationships between different attribute values [3]. These relationships can also help to predict any or many attribute values, hence, it is easy to be flooded with association rules even for a small data set [11]. Association rules are an expression of different regularities implied in the datasets, and some rules found imply other rules. Some of these rules can be meaningless and ambiguous: as a result it becomes necessary to constrain these rules to a minimum number of instances (e.g. 90% of the dataset) and set a minimum limit for accuracy (e.g. 95% accuracy level). The coverage or support of an association rule can be defined as the number of instances for which the rule can be predicted correctly. The accuracy or confidence of an association rule can be defined as proportion of correctly predicted instances to that of all instances for which the rule applies. In rule mining, an attribute-value pair is called an item and combinations of attribute-value pairs are called item [1, 11]. Association rules are generally sought for item sets. There are two steps in mining association rules 1) identifying all the item sets that satisfy the minimum specified coverage and 2) generating rules from each of these item sets that satisfy the minimum support and minimum confidence. It is possible that some items sets may not generate any rules and some may generate many rules. Much current software defect prediction work focuses on the number of defects remaining in a software system. This is to help developers detect software defects and assist project managers in allocating testing resources more effectively. Song, Sheppard, Cartwright and Mair had results that show for defect association prediction, the accuracy is very high and the false-negative rate is very low [5]. They also found that higher support and confidence levels may not result in higher prediction accuracy and a sufficient number of rules are a precondition for high prediction accuracy. Srikant, Vu and Agrawal describe a technique to find only rules that meet certain constraints, building them into the mining algorithm, in order to get a subset of rules of interest in less execution time [6]. Several recent studies have focused on version history analysis to identify components that may need special attention in some maintenance task. Nikora and Munson found that measurements of code churn in an evolving software system can serve as predictors for the number of faults inserted in a system during development, and that this predictive ability can be improved with a clear standard for the definition of a fault [2]. Stringfellow, Amory, Potnuri, Georg and Andrews use RCS change history data to determine the level of coupling between components to identify potential code decay [7]. They looked at grouping changes according to a common author and a small time interval for checking in the changed files, as well as grouping changes to files within and between components. This paper uses RCS files as input. RCS systems have many advantages: they manage multiple revisions of files and automate the storing, retrieval, logging and merging of revisions [7]. The automatic logging makes it easy to know what changes were made to a module, without having to compare source listings. Revision numbers aid in retrieving the changes. Included in the RCS data is the author, date, and log message summary in files. 3 Method The method in this study follows a mostly datadriven approach, although the authors are focused on the task of maintenance. It also follows the steps in the mining methodology outlined in Xie et al. [13]. Preprocessing involved extracting data from RCS files, scrubbing the data, importing the data into arff files. Mining involved finding association rules using WEKA software [10]. Witten and Frank’s book and WEKA toolkit are excellent resources for data mining [12]. A python program extracts attributes, from each RCS log. The results are stored in the form of text file that is then imported to a spreadsheet and stored in a csv file. That file is then converted to an arff file using a online conversion tool. Finally, the arff file is opened in WEKA to obtain rules showing different relationships between attributes with association rule mining [9]. 3.1 RCS data extraction Figure 1 shows the first part of an RCS file. The head indicates that the last log made to the file (in this case 1.4, indicating there are 4 logs of changes made to the file). Each log has the date, time, and author of the change made. head 1.4 date next 1.3 date next 1.4; 99.08.30.21.27.54; author mikeh; state Exp; 1.3; 99.07.14.22.32.48; author mikeh; state Exp; 1.2; Fig 1: Change data in an RCS file with attributes date, author, etc. The RCS data comes from a large flight simulation system, consisting of 409 RCS files of type .c,v and .h,v [8]. The RCS logs were first analyzed to determine the ten most frequently occurring keywords concerning changes in the RCS files. The keywords in logs that are indicative of change were determined to be add, delet, fix, bug, chang, fails, modify, error, correct, mov, fault, debug, updat, problem, delet and replac. File attributes, such as span of time from the first to the last change and the number of logs for a file were of interest. Figure 2 shows the descriptions of two changes in the log. The descriptions are found at the end of the RCS file after the code, after the string Desc. Log 1.3, for example, has @d2 1, a2 1 in it, which means that on line 2 there was one statement deleted and one added. A programmer types in a description of the changed in comment (inserted between two @’s). The python program extracts attribute values from each change log, including log number, date of change, author, number of deletes, number of adds and the frequency of each keyword occuring in the log associated with change. These extracted attribute values are stored in a csv file. Desc @@ 1.4 log @fixed ccip/ccrp ct/hl code @ text @/* 1.3 log @first checkin after tactics @ text @d2 1 a2 1 d61 2 a62 2 Fig 2: Example of two RCS logs. 3.2 Converting to arff The resulting csv file output from the python program is converted in to arff file using an online conversion tool, as Weka supports only arff files [10]. The arff file format is becoming an increasingly important tool to transform these data into useful information. 3.3 Data mining for association rules Data mining is the process of extracting rules from data. The WEKA software application is used in this study to find a set of rules and relationships between the change attributes extracted from the RCS files. WEKA requires data to be nominal. So filters are chosen to discretize these numeric attributes. After applying bins (grouping) and filters the rules are found and the relationships between attributes are determined. Discretizing is done by selecting a particular attribute and then supplying the number of bins (and their range of values) into which the attribute is to be divided or grouped. Grouping data for a numeric attribute into bins converts it into a nominal attribute. Once all the attributes are nominal, association is performed using an apriori filter. Figure 3 shows a WEKA screenshot of a list of the attributes extracted. The attribute SpanDays is highlighted (SpanDays refers to the span of days that a file underwent changes). It shows the six bins the numeric counts fell into and a frequency graph. It also shows minimum, maximum, mean, and standard Fig3: Screenshot of the output from WEKA showing the graph for the attributes per log [7]. deviation values for those attributes. The maximum span, for example was 481 days. Some data scrubbing was performed. One of the RCS files, for example, was ignored as it were not formatted correctly for data extraction. In addition, some data in the csv files when uploaded had zeros in some of the columns, and these were deleted. As already mentioned, data imported into WEKA had to be converted from numerical format to nominal format for data mining analysis. 4 Results Using association rule mining and an apriori filter, 100 rules were found between the attributes. From these rules, we consider those rules that have a minimum confidence of 0.9 or above to find relationships between the attributes. The most significant of these rules are shown in figure 4. Rule1: SpanInDays=1_80 184 ==> CountOfLogs=1_4 add=0_4 updatevalue=0_1 181 conf:(0.98) Rule2:SpanInDays=1_80 184 ==> CountOfLogs=1_4 chang=1_6 updatevalue=0_1 181 conf:(0.98) Rule3: fail=0_max 409 ==> bug=0_max 409 conf:(1). Rule4: mov=0_max 409 ==> bug=0_max 409 conf:(1). 5 Conclusion This study shows that it is possible to mine RCS or other subversion control data in a system to identify certain change relationships. Identifying relationships between the attributes in RCS data could help in improving the performance of the system architecture and determining the components that need repeated maintenance. Data mining is one of the prominent tools that could help in finding these relationships and rules with both qualitative and quantitative data. Mining for change relationships may improve a system’s performance early stages of development and save money. If the changes are found in latter stages, changing the whole system architecture could lead to a disaster. Rule5: debug=0_max 409 ==> fail=0_max 409 conf :(1). 6 Rule6: bug=0_max 409 ==> fail=0_max correct=0_max 409 conf :( 1). [1] Anand, R., “Association rule mining for a medical record system using WEKA, “File paper, Midwestern State University, Fall 2006. Rule7: Author=chris 388 <==> DateTime=529_max NumAdds=1_6 354 conf:(0.91). [2] Nikora, A., Munson J., Developing fault predictors for evolving software systems. Proc 9th Intl Software Metrics Symposium, Sydney, Australia, Sept 2003, 338-349. Rule8: NumLogs=0_5 DateTime=529_max 257 ==> LogNo=0_4 NumAdds=1_6 250 conf :( 0.97). [3] Rattikorn, H., Kulkarni, A., Stringfellow, C., Andrews, A., Software defect data and predictability for testing schedules. Proc 18th Intl Conf on Software Engineering and Knowledge Engineering, SEKE’06, San Francisco Bay, USA, Jul 2006, 715-717. Fig4: Rules found from association rule mining. Figure 4 shows eight rules which have confidence 0.9 and above. Rules 1 and 2 (and several other of the 100 rules found) indicate that for files that undergo change in short span of time (here the bin was less than 80 days), there are few occurrences of the keyword add or chang (-e, -es, -ing, -ed). (In addition there were few occurrences of the keywords updat, remov or delet.) Also for files with a shorter span, these few changes were done by few authors for few logs. Rules 3-5 shows there is a strong correlation between keywords fail, bug, move and debug – these words can be almost taken as synonyms and probably counted together, as a result. The added lines by author Chris are done mostly during the second half of the development schedule, and he mainly added only a few lines of code (Rule 7). Files with a few number of logged changes, had mostly only a few number of lines added (Rule 8). References [4] Sahraoui, H., Boukadoum, M., Chawiche, H., Mai, G., Serhani, M., A Fuzzy Logic Framework to improve the performance and interpretation of rule based quality prediction models for OO software, Proc 26th Annual Intl Conf on Computer Science Software and Applications, England, Aug 2002, 131-138. [5] Song, Q., Sheppard, M., Cartwright, M., Mair, C., Software defect association mining and defect correction effort prediction. IEEE Trans on Software Engineering, 32(2), Feb 2006, 69-82. [6] Srikant, R., Vu, Q., Agrawal, R., Mining association rules with item constraints. Proc 3rd Intl Conf on Knowledge Discovery and Data Mining (KDD’ 97), Aug 1997, 67-73. [7] Stringfellow, C., Amory, C., Potnuri, D., Georg, M., Andrews, A., Deriving change architectures from RCS history, IASTED Conf. Software Engineering and Applications, Cambridge, MA, Nov 2004, 210-215. [8] Stringfellow, C., Mayrhauser, A., Applying the SRGM Selection to Flight Simulation Failure Data, Tech Report, Colorado State University, Fort Collins, CO, 2000, 1-16. [9] Tkalcic, M., Online conversion tool, Ljubljana,slavnik.fe.unilj.si/markot/csv2arff/csv2arff.p hp. [10] Waikato Environment for Knowledge Analysis (WEKA). Data Mining Software in Java, Nov 12, 2006 http://www.cs.waikato.ac.nz/ml/weka/ [11] Williams, C., Hollingsworth, K., Automatic mining of source code repositories to improve bug finding techniques. IEEE Trans on Software Engineering, 31(6), Jun2005, 466-480. [12] Witten, H., Frank, E., Data mining: practical machine learning tools and techniques, 2nd ed., Elsevier Publications, 2005. [13] Xie, T., Thummalapenta, S., Lo, D., Liu, C., “Data mining for software engineering,” Computer, Aug 2009, pp. 55-62.