IEEE Software a) Title: ESTIMATING SOFTWARE DEVELOPMENT PROJECTS: A NEW APPROACH b) Category: Papers c) Authors and affiliations: Isabel Ramos Researcher José Cristóbal Riquelme Researcher Dpto. de Lenguajes y Sistemas Informáticos. Universidad de Sevilla. Dpto. de Lenguajes Universidad de Sevilla. Facultad de Informática y Estadística. Avda. Reina Mercedes, s/n. 41012 - Sevilla (Spain). Facultad de Informática y Estadística. Avda. Reina Mercedes, s/n. 41012 – Sevilla (Spain). Phone number:+34 954552776 Fax number: +34 954557139 Phone number: +34 954552775 Fax number: +34 954557139 e-mail: isabel.ramos@lsi.us.es y Sistemas e-mail: riquelme@lsi.us.es Informáticos. ESTIMATING SOFTWARE DEVELOPMENT PROJECTS: A NEW APPROACH Isabel Ramos José Cristóbal Riquelme Dpto. de Lenguajes y Sistemas Informáticos Dpto. de Lenguajes y Sistemas Informáticos Universidad de Sevilla Universidad de Sevilla e-mail: isabel.ramos@lsi.us.es e-mail: riquelme@lsi.us.es ABSTRACT The actual simulation environments and the dynamic models for software development projects (henceforth SDP) are making feasible the creation of the denominated SDP simulators. The main advantage of these simulation tools is the possibility of answering without cost question as: How will the project evolve if ...? before the project execution or how would the project evolve if ...? when the project has finished. In this paper we present a part of the results obtained by combining, on one hand, the use of a tool that learns producing rules, and additionally a dynamic model of SDP. Thus allows us to obtain automatically management rules applicable to a SDP for estimating good results with the variables that the project manager desires. 1. INTRODUCTION Between the jobs that SPDs manager must perform are the activities of planning, monitoring and development control. For this, the managers basically have available their own mental models based on the accumulated experience in similar projects and they lack formal models and tools that make possible to improve the accuracy of the decision to taking. Recently, a new tool called PDS simulator has been established. These simulators make feasible to the project managers to experiment with several management policies without cost, and in this way to obtain the possible decision more right. A dynamic model constitutes the essential core of these simulators. This model is obtained from the observation of the variables that define the real project state and the relations that guide its time evolution. Present simulators environments (Stella, Vensim, iThink, Powersim, etc.) make feasible enlarge the model utilities of these models and they facilitate the simulators construction. The simulation of a dynamic model for a SDP make feasible, before beginning the development (or when the project has finished), to know what is the impact: a) that a change of technology would have on the project [Chichakly 93], b) of the application of different management policies and c) of a change of the maturity level of the development organization. Until now, for using a project simulator the manager must know: the initial estimations, the project and development environment constraints, and the management policies that he will apply. From these data the simulator provide the final results of the project (time, cost, etc.). If these results are not the expected, the previous process is repeated modifying the applied management policies. The process continues until that the project manager obtains the desired results. In this paper, we propose a new approach to estimating the final results of a project: to obtain automatically management rules1 for the SDPs. The knowledge of these management rules can be obtained before the beginning the project's execution (or when the project have finished.) and it will permit us to obtain good results for the variables that the project manager desires. That is, now the manager must know: the initial estimations, the project and development environment constraints, and the results that he wishes to get. Our tool suggest to him/her the management policies that he/she must apply (or would must have applied if he/she is doing a post-mortem analysis). In order to obtain automatically the management rules, we have combined the advantages that a system that learns based on rules presents and the information that a dynamic model for SDPs provides. In the following sections, we first present a brief introduction to the concept of machine learning and the tool used for this purpose; later, we present the information that is 1 We call management rule to a set of management policies (decisions) that to take the manager for carrying out the project final goals. given to the dynamic system for SDPs and how we have composed the two techniques. Finally, we apply these methods to a specific SDP for carrying out a post-mortem analysis. 2. MACHINE LEARNING The computational techniques and tools designed to support the extraction of useful knowledge from databases are traditionally named machine learning. More recently the names of data mining or Knowledge Discovery in Databases are used (KDD). In general, the previous techniques try to extract, in an automatic way, information useful for decision support or exploration and understanding the phenomena that is the data source. A standard KDD process is constituted by several steps [Fayyad 96] such as data preparation, data selection, data cleaning, data mining and proper interpretation of the results. Therefore, data mining can be considered a particular step that consists in the application of specific algorithms for extracting patterns from data. A wide variety and number of data mining algorithms are described in the literature from the fields of statistics, pattern recognition, machine learning and databases. Most data mining algorithms can be viewed as compositions of three basic techniques and principles: the model (classification, regression, clustering, linear function, etc.), the preference criterion, usually some form of goodness-of-fit function of the model to the data and search algorithm (genetic, greedy, gradient descent, etc.). Thereby, the choice of a method of data mining depends on the model representation that we need. Given that our goal is to find rules to describe the behavior of a SDP, our election has been to work with decision trees. A decision tree is a classifier with the structure of a tree, where each node is a leaf indicating a class, or an internal decision node that specifies some test to be carried out on a single attribute value, and one branch and subtree for each possible outcome of the test. The main advantages of decision trees are their utility for finding structure in highdimensional spaces and the conversion to rules easily meaningful for humans is immediate. However, classification trees with univariate threshold decision boundaries, which may not be suitable for problems, where the true decision boundaries are non-linear multivariate functions. The decision tree algorithm more spread is C4.5 [Quinlan 93]. Basically, C4.5 consists in a recursive algorithm with divide and conquer technique that optimizes the tree construction on basis to gain information criterion. The program output is a graphic representation of the found tree (figure 1), a confusion matrix from classification results and an estimated error rate. If Condition 1 is V1: Class C1 If Condition 1 is V2: If Condition 2 is V3: Class C2 Condition 2 is V4: Class C2 Condition 2 is V5: Class C1 Figure 1: Example of type of not binary tree and set of rules equivalent. The rules of the Figure 1 also can be translated as: If Condition 1 is V1 or Condition 1 is V2 and Condition 2 is V5 then Class 1. If Condition 1 is V2 and Condition 2 is V3 or V4 then Class 2. C4.5 is very easy to set up and run it only needs a declaration for the types and range of attributes in a separate file of data and it is executed with UNIX commands with very few parameters. The main disadvantage is that the regions obtained in continuous spaces are hyperrectangles due to the test of the internal nodes are the forms: pi L or pi U. However, given our purpose in this work the results supplied by the C4.5 are perfectly valid. 3. THE DYNAMIC MODEL AND THE C4.5 To obtain the database that is the entry of the C4.5, we have used a dynamic model for SDP proposed in [Ramos, 98], denominated Reduced Dynamic Model (RDM)2, and implemented in the environment simulation Vensim®. The variables that allow to know the basic behaviour of a dynamic system are defined through differential equations. Furthermore, the model possesses a set of parameters that permit us to study different behaviours. These are provided by the management policies that can be applied in the SDPs, both related to the environment of the project (initial estimations, complexity of the software, etc) and the related to the development organization (personnel management, effort assignment, etc.) and its maturity level (like the average delays through the realization of the activities of detection and correction of errors). Table 1 shows the different groups of parameters classified according to their function3. PROJECT ENVIRONMENT ORGANIZATION ENVIRONMENT Initial Estimation Project Complexity Effort assignment Management Policies Personnel management Delivery time Average delays Maturity Degree Nominal values Others Table 1: Classification of the dynamic model for SDP parameters. The values of the parameters can be chosen randomly in an interval defined by the user (for example, the technical personnel average dedication can vary between 20% and 100% depending on the uncertainty level that the user have). Later, the model is simulated and a record for the database is generated with the values of the parameters and the finals values obtained for the desired system variables (time, cost, number of errors, etc.). From this generated database, the 2 This model makes feasible to know the evolution of a SPD in early stages of the project that is when the information available is limited. 3 The number of parameter that have a dynamic model vary from a model to other. For example, the model of [AdelHamid, 91] have around 64, however RDM have about 32. C4.5 learns examining the supplied data and proposing a set of rules for the decision-making (See figure 2). Initial Estimations Project and Organization Environment Project Objectives Project Simulation D. B. Machine learning Management rules Figure 2: Steps to follow for gathering management rules from a dynamic model. 4. OBTAINING OF MANAGEMENT RULES In the following sections we use the data of a real SDP proposed in [Abdel-Hamid 91] which we will call PROJECT. This is a well-known project and amply validated by the authors. In section 4.1 we show some of the parameters more significant of the PROJECT environment and organization. In section 4.2 we indicate the initial estimations and final values obtained by the cost and delivery time. In section 4.3, management rules have been obtained automatically that permit us to accomplish a post-mortem analysis of the PROJECT. That is to say, answer the next question: how we could have improved the final results of this project? In other way, what would management policies must have applied for improving simultaneously the PROJECT cost and time. 4.1. Project and organization environment From among the parameters that define the development environment, so much for the project as for the organization and the maturity degree of the development organization, we have collected, by considering them representative of each one of the blocks3 of Table 1, those which appear in Table 2. Indicated in this table are, for each parameter, the name that it has in the Reduced Dynamical Model, the interval values that it can take, a brief description of its meaning and the units of measurement. It is considered, for the specific SDP that we are going to analyze, that the rest of the parameters [Abdel-Hamid, 91] are not going to vary. NAME DEDIC RESQA READE RECON PORTE TECCO RENOT ESFPR RETRA POFOR INTAM INTERVAL DESCRIPTION (UNITS) (20 - 100) Average dedication of the technical personnel (%). (5 - 15) Average delay in the development of Quality activities (days). (20 - 120) Average delay in the appropriateness of the new technical personnel in the project (days) (1 - 40) Average delay in accomplishing the contracting of technical personnel (days). (30 - 100) Percentage of technicians at the beginning of the project in relation to the estimated average value (%). (1 - 4) Technicians to contract for each experienced full time technician (technicians) (5 - 15) Average delay in notifying the real state of the project (days). (0,1 - 0,25) Nominal effort necessary in the Tests stage by error (technicians-day). (1 - 15) Average delay in the transferring of technical personnel that exceed to other projects (days). (10 - 40) Average percentage of the experienced technicians’ dedication to training (%). (0 - 0,5) Initial underestimation of the project's size in source code lines (ldc). Table 2: Representative parameters of the project's environment and of the organization's environment. The variables that have been studied in the next section are the cost and the delivery time of the PROJECT. 4.2 Initial estimations and project goals The initially estimated values for the PROJECT delivery time and cost was 320 days and 1111 technicians-day (t-d), however the real values were 387 days and 2092 technicians-days [AbdelHamid 91]. Therefore, the final values obtained exceed the initial estimations about 20% and 50% respectively. Next, we will define the values that we want obtain for the project time and cost. These values will be denominated GOOD. Delivery time (days): Any value of the time between 320 and 387 days will be labeled as GOOD because it is inferior to the obtained real results. All values greater than 387 days will be labeled as BAD by surpassing the obtained real results. Cost (technician - days): Any value between 1111 y 2092 technicians-day will be labeled as GOOD because it is inferior to the obtained real results. All values greater than 2092 technicians-day will be labeled as BAD by surpassing the obtained real results. Therefore, before this final we would have to ask ourselves: Do management rules exist that might have improved the final results? In the following section we answer this question. 4.3. Management rules obtained In fact, we want to know: What values should the parameters have taken to improve the obtained real results? And a second question that the actual development organization must answer is: Are these values be easy to modify?. Below, the management rules obtained for PROJECT applying RDM and C4.5 are shown in Table 3. READE <=27; RENOT < = 12; INTAM > 0,40, DEDIC > 0,6 RENOT > 12; RETRA > 10 READE > 27; RETRA < = 14; INTAM < = 0,47; POFOR <= 0,13; ESFPR > 0,22 INTAM > 0,47; POFOR <= 0,18 RETRA > 14; DEDIC > 0,82 (rule 1) (rule 2) (rule 3) (rule 4) (rule 5) Table 3: Management rules to estimate GOOD results, simultaneously, for the delivery time and the cost. To comply with the goals of the time and cost proposed, five management rules have been obtained (Table 3). A general reading of the obtained rules indicate us: Which are the most important parameters for obtaining the wished values for the time and cost simultaneously, and what is the range of values for those parameters? Particularly, the management rules (1) and (2) indicate to us that the final results achieved for the delivery time and the cost could have been improved either if (rule 1): "The integration of the new personnel in the project (READE) might have been lesser than or equal to 27 days and the notification of the progress of the project (RENOT) might have been lesser than or equal to 12 days and the initial underestimation of the size of the product in source code lines (INTAM) might have been greater than 40 % and the dedication of the technical personnel in the project (DEDIC) might have been greater than 60 %". Or if (rule 2): "The integration of the new personnel in the project (READE) might have been lesser than or equal to 27 days and the notification of the project's progress (RENOT) might have been greater than 12 days and the transfer of the technical personnel to other projects (RETRA) might have been greater than 10 days". In figure 3, we can verify what would have been the evolution of time and effort by the application of management rule 2. This rule would have improved simultaneously the final results obtained for the time and effort in a 5% and a 2% respectively. For thus, we would must raise the parameters RENOT (>12 days) and RETRA (>10 days) whose initial values were 10 days. While the parameter READE (<=27 days) would not have been necessary modify because initially was 20 days [Abdel-Hamid, 91]. 400 3,000 days t-d 325 2,000 days t-d 250 1,000 days t-d 0 50 100 150 200 Days Delivery time (rule 2) Cost (rule 2) 250 300 350 days t-d Figure 3: Time and cost evolution when rule 2 is applied. Therefore, based on the previous management rules, we can answer the first of the questions that we previously mentioned. The answer is yes, PROJECT's final results could have been improved and the values of the parameters appear in the management rules of Table 3. The second question can only be answered by the project director and the others managers of the development organization. Once the management rules have been obtained, the manager of the project is who decides which rule or rules are the easiest to apply, in function of the specific project and of the software organization. In any case, he/she knows that if the parameters don't take the values of the rules, the optimization of the variable or groups of variables of his interest are not guaranteed. In view of the results obtained and of the complexity that have the management and control of a SDP, we propose at least two basic criteria in the election of management rules: first, to choose rules whose parameters are easy to control and to modify and, in second place, if it is possible, to choose rules that have a small number of parameters. 5. CONCLUSIONS AND FUTURE WORKS The obtaining of management rules for SDPs can be applied before beginning the execution of a project to define the management policies more adequate for the project that is going to be accomplished. It can also be used in projects already ended to accomplish a post-mortem analysis. These rules can be applied in order to: Obtain values that can be considered good (acceptable or bad) for any variable that we are interested in analyzing, either in an independent way or simultaneously with other variables. Analyze which are the parameters involved in the definition of management policies and the level of maturity of the organization and which are easy to modify. Study which of the previously mentioned parameters have more influence in obtaining good results. In fact, we can say that it is possible to obtain automatically management rules for a SDP and to recognize what are the management policies that guarantee the attainment of its goals. In light of the potential that the obtaining of management rules presents from a dynamic model, our future projects are guided in the application of fuzzy logic techniques and in the creation of a simulator for SDP that can generate management rules in a multiproject environment. 6. REFERENCES [Abdel-Hamid, 91] Abdel-Hamid, T.; Madnick, S.: “Software Project Dynamics: an integrated approach”, Prentice-Hall, 1991. [Chichacky, 93] Chichacly, K. J.: “The bifocal vantage point: managing software projects from a Systems Thinking Perspective”. American Programmer, pp.: 18 - 25. May, 1993. [Fayyad, 96] Fayyad, U.; Piatetsky-Shapiro, G.; Smyth P.: “The KDD Process for Extracting Useful Knowledge from Volumes of Data”. Communications of the ACM. Vol. 39, Nº 11, pp.: 27-34. November, 1996. [Quinlan, 93] Quinlan, J.: “C4.5: Programs for Machine Learning”, Morgan Kaufmann Pub. Inc., 1993. [Ramos, 98] Ramos, I.; Ruiz, M.: “A Reduced Dynamic Model to Make Estimations in the Initial Stages of a Software Development Project”. INSPIRE III. Process Improvement through Training and Education. Edited by C. Hawkings, M. Ross, G. Staples, J. B. Thompson. Pp.: 172 – 185, September 1998.