IMPROVED SIZE AND EFFORT ESTIMATION MODELS FOR

advertisement
IMPROVED SIZE AND EFFORT ESTIMATION MODELS
FOR SOFTWARE MAINTENANCE
by
Vu Nguyen
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2010
Copyright 2010
Vu Nguyen
DEDICATION
Kính tặng ba mẹ,
Nguyễn Đình Hoàng, Nguyễn Thị Thiên
You survived and struggled through one of the darkest chapters in the history of Vietnam
to raise and love your children.
ii
ACKNOWLEDGEMENTS
I am deeply indebted to my advisor, Dr. Barry Boehm, for his encouragement,
support, and constructive advice. I have learned not only from his deep knowledge and
experience in software engineering but also from his remarkable personality. I would also
like to thank A Winsor Brown who brought me to work on the CodeCount project and
make the tool more useful for the software development community.
I am grateful to Dr. Bert Steece, whom I regard as my unofficial advisor, for his
statistical insights that shape my understanding of statistical analysis applied to this
dissertation work. My thanks also go to other members of my qualifying exam and
defense committees, Dr. Nenad Medvidović, Dr. Ellis Horowitz, and Dr. Rick Selby.
Their criticism, encouragement, and suggestion effectively helped to shape my work.
I am thankful to my friends and faculty for their constructive feedback on my
work, including Dan Port, Tim Menzies, Jo Ann Lane, Brad Clark, Don Reifer,
Supannika Koolmanojwong, Pongtip Aroonvatanaporn, and Qi Li. Other colleagues Julie
Sanchez of the USC Center for Systems and Software Engineering, Marilyn A Sperka,
Ryan E Pfeiffer, and Michael Lee of the Aerospace Corporation, and Lori Vaughan of
Northrop Grumman also assisted me in various capacities.
This work was made possible with the support for data collection from the
Affiliates of Center for Systems and Software Engineering and two organizations in
Vietnam and Thailand. Especially, Phuong Ngo, Ngoc Do, Phong Nguyen, Long Truong,
iii
Ha Ta, Hoai Tang, Tuan Vo, and Phongphan Danphitsanuphan have provided me
tremendous help in collecting historical data from the organizations in Vietnam and
Thailand.
I awe much to the Fulbright Program for financial support during my Master’s
program at USC, giving me an opportunity to fulfill my dream of studying abroad and
doing research with remarkable researchers in the software engineering research
community. My cultural and educational experiences in the United States made possible
by the program are priceless.
iv
TABLE OF CONTENTS
Dedication ........................................................................................................................... ii
Acknowledgements............................................................................................................ iii
List of Tables .................................................................................................................... vii
List of Figures .................................................................................................................... ix
Abbreviations...................................................................................................................... x
Abstract ............................................................................................................................. xii
Chapter 1.
Introduction................................................................................................. 1
1.1
The Problem........................................................................................................ 2
1.2
A Solution ........................................................................................................... 2
1.3
Research Hypotheses .......................................................................................... 3
1.4
Definitions........................................................................................................... 6
Chapter 2.
Related Work .............................................................................................. 7
2.1
Software Sizing................................................................................................... 7
2.1.1
Code-based Sizing Metrics ......................................................................... 7
2.1.2
Functional Size Measurement (FSM) ....................................................... 10
2.2
Major Cost Estimation Models ......................................................................... 16
2.2.1
SLIM ......................................................................................................... 17
2.2.2
SEER-SEM ............................................................................................... 19
2.2.3
PRICE-S.................................................................................................... 22
2.2.4
KnowledgePlan (Checkpoint)................................................................... 24
2.2.5
COCOMO ................................................................................................. 25
2.3
Maintenance Cost Estimation Models .............................................................. 30
2.3.1
Phase-Level Models.................................................................................. 30
2.3.2
Release-Level Models............................................................................... 32
2.3.3
Task-Level Models ................................................................................... 36
2.3.4
Summary of Maintenance Estimation Models.......................................... 44
Chapter 3.
The Research Approach............................................................................ 46
3.1
The Modeling Methodology ............................................................................. 46
3.2
The Calibration Techniques.............................................................................. 52
3.2.1
Ordinary Least Squares Regression .......................................................... 52
3.2.2
The Bayesian Analysis.............................................................................. 53
3.2.3
A Constrained Multiple Regression Technique........................................ 55
v
3.3
Evaluation Strategies ........................................................................................ 58
3.3.1
Model Accuracy Measures ....................................................................... 58
3.3.2
Cross-Validation ....................................................................................... 60
Chapter 4.
the COCOMO II Model for Software Maintenance ................................. 62
4.1
Software Maintenance Sizing Methods ............................................................ 62
4.1.1
The COCOMO II Reuse and Maintenance Models.................................. 64
4.1.2
A Unified Reuse and Maintenance Model................................................ 68
4.2
COCOMO II Effort Model for Software Maintenance..................................... 75
Chapter 5.
Research Results ....................................................................................... 80
5.1
The Controlled Experiment Results.................................................................. 80
5.1.1
Description of the Experiment .................................................................. 80
5.1.2
Experiment Results ................................................................................... 84
5.1.3
Limitations of the Experiment .................................................................. 91
5.2
Delphi Survey Results....................................................................................... 92
5.3
Industry Sample Data........................................................................................ 96
5.4
Model Calibrations and Validation................................................................. 104
5.4.1
The Bayesian Calibrated Model.............................................................. 105
5.4.2
The Constrained Regression Calibrated Models..................................... 110
5.4.3
Reduced Parameter Models .................................................................... 114
5.4.4
Local Calibration .................................................................................... 117
5.5
Summary ......................................................................................................... 122
Chapter 6.
Contributions and Future Work .............................................................. 123
6.1
Contributions................................................................................................... 123
6.2
Future Work .................................................................................................... 125
Bibliography ................................................................................................................... 129
Appendix A. UFNM and AA Rating Scale .................................................................... 138
Appendix B. Delphi Survey Form .................................................................................. 139
Appendix C. Data Collection Forms............................................................................... 156
Appendix D. The COCOMO II.2000 Parameters Used in the Experiment.................... 158
Appendix E. Histograms for the Cost Drivers ................................................................ 159
Appendix F. Correlation Matrix for Effort, Size, and Cost Drivers ............................... 170
vi
LIST OF TABLES
Table 2-1. COCOMO Sizing Models ............................................................................... 27
Table 2-2. COCOMO II Calibrations ............................................................................... 29
Table 2-3. Maintenance Cost Estimation Models............................................................. 42
Table 4-1. Maintenance Model’s Initial Cost Drivers ...................................................... 76
Table 4-2. Ratings of Personnel Experience Factors (APEX, PLEX, LTEX).................. 77
Table 4-3. Ratings of RELY ............................................................................................. 78
Table 5-1. Summary of results obtained from fitting the models ..................................... 89
Table 5-2. Differences in Productivity Ranges................................................................. 93
Table 5-3. Rating Values for Cost Drivers from Delphi Survey ...................................... 94
Table 5-4. RELY Rating Values Estimated by Experts.................................................... 96
Table 5-5. Maintenance Core Data Attributes .................................................................. 98
Table 5-6. Summary Statistics of 86 Data Points ........................................................... 101
Table 5-7. Differences in Productivity Ranges between Bayesian Calibrated Model and
COCOMO II.2000......................................................................................... 105
Table 5-8. Rating Values for Cost Drivers from Bayesian Approach ............................ 107
Table 5-9. Estimation Accuracies Generated by the Bayesian Approach ...................... 109
Table 5-10. Estimation Accuracies of COCOMO II.2000 on the Data Set.................... 110
Table 5-11. Retained Cost Drivers of the Constrained Models ...................................... 111
Table 5-12. Estimation Accuracies of Constrained Approaches .................................... 112
Table 5-13. Estimation Accuracies of Constrained Approaches using LOOC Crossvalidation ..................................................................................................... 112
vii
Table 5-14. Correlation Matrix for Highly Correlated Cost Drivers .............................. 115
Table 5-15. Estimation Accuracies of Reduced Calibrated Models ............................... 116
Table 5-16. Stratification by Organization ..................................................................... 119
Table 5-17. Stratification by Program ............................................................................ 119
Table 5-18. Stratification by Organization on 45 Releases ............................................ 120
viii
LIST OF FIGURES
Figure 3-1. The Modeling Process.................................................................................... 48
Figure 3-2. A Posteriori Bayesian Update in the Presence of Noisy Data RUSE ............ 54
Figure 3-3. Boxplot of mean of PRED(0.3) on the COCOMO II.2000 data set .............. 56
Figure 3-4. Boxplot of mean of PRED(0.3) on the COCOMO 81 data set ...................... 56
Figure 4-1. Types of Code ................................................................................................ 63
Figure 4-2. Nonlinear Reuse Effects................................................................................. 65
Figure 4-3. AAM Curves Reflecting Nonlinear Effects ................................................... 72
Figure 5-1. Effort Distribution.......................................................................................... 85
Figure 5-2. Maintenance Project Collection Range.......................................................... 97
Figure 5-3. Distribution of Equivalent SLOC ................................................................. 101
Figure 5-4. Correlation between PM and EKSLOC ....................................................... 103
Figure 5-5. Correlation between log(PM) and log(EKSLOC)........................................ 103
Figure 5-6. Adjusted Productivity for the 86 Releases ................................................... 104
Figure 5-7. Adjusted Productivity Histogram for the 86 Releases ................................. 104
Figure 5-8. Productivity Ranges Calibrated by the Bayesian Approach ........................ 108
Figure 5-9. Productivity Ranges Generated by CMRE .................................................. 113
Figure 5-10. Distribution of TIME ................................................................................. 116
Figure 5-11. Distribution of STOR................................................................................. 116
ix
ABBREVIATIONS
COCOMO
Constructive Cost Model
COCOMO II
Constructive Cost Model version II
CMMI
Capability Maturity Model Integration
EM
Effort Multiplier
PM
Person Month
OLS
Ordinary Least Squares
MSE
Mean Square Error
MAE
Mean Absolute Error
CMSE
Constrained Minimum Sum of Square Errors
CMAE
Constrained Minimum Sum of Absolute Errors
CMRE
Constrained Minimum Sum of Relative Errors
MMRE
Mean of Magnitude of Relative Errors
MRE
Magnitude of Relative Errors
PRED
Prediction level
ICM
Incremental Commitment Model
PR
Productivity Range
SF
Scale Factor
MODEL PARAMETERS
Size Parameters
AA
Assessment and Assimilation
AAF
Adaptation Adjustment Factor
AAM
Adaptation Adjustment Multiplier
AKSLOC
Kilo Source Lines of Code of the Adapted Modules
CM
Code Modified
DM
Design Modified
EKSLOC
Equivalent Kilo Source Lines of Code
x
ESLOC
Equivalent Source Lines of Code
IM
Integration Modified
KSLOC
Kilo Source Lines of Code
RKSLOC
Kilo Source Lines of Code of the Reused Modules
SLOC
Source Lines of Code
SU
Software Understanding
UNFM
Programmer Unfamiliarity
Cost Drivers
ACAP
Analyst Capability
APEX
Applications Experience
CPLX
Product Complexity
DATA
Database Size
DOCU
Documentation Match to Life-Cycle Needs
FLEX
Development Flexibility
LTEX
Language and Tool Experience
PCAP
Programmer Capability
PCON
Personnel Continuity
PERS
Personnel Capability
PLEX
Platform Experience
PMAT
Equivalent Process Maturity Level
PREC
Precedentedness of Application
PREX
Personnel Experience
PVOL
Platform Volatility
RELY
Required Software Reliability
RESL
Risk Resolution
SITE
Multisite Development
STOR
Main Storage Constraint
TEAM
Team Cohesion
TIME
Execution Time Constraint
TOOL
Use of Software Tools
xi
ABSTRACT
Accurately estimating the cost of software projects is one of the most desired
capabilities in software development organizations. Accurate cost estimates not only help
the customer make successful investments but also assist the software project manager in
coming up with appropriate plans for the project and making reasonable decisions during
the project execution. Although there have been reports that software maintenance
accounts for the majority of the software total cost, the software estimation research has
focused considerably on new development and much less on maintenance.
In this dissertation, an extension to the well-known model for software estimation,
COCOMO II, is introduced for better determining the size of maintained software and
improving the effort estimation accuracy of software maintenance. While COCOMO II
emphasizes the cost estimation of software development, the extension captures various
characteristics of software maintenance through a number of enhancements to the
COCOMO II size and effort estimation models to support the cost estimation of software
maintenance.
Expert input and an industry data set of eighty completed software maintenance
projects from three software organizations were used to build the model. A number of
models were derived through various calibration approaches, and these models were then
evaluated using the industry data set. The full model, which was derived through the
Bayesian analysis, yields effort estimates within 30% of the actuals 51% of the time,
outperforming the original COCOMO II model when it was used to estimate these
xii
projects by 34%. Further performance improvement was obtained when calibrating the
full model to each individual program, generating effort estimates within 30% of the
actuals 80% of the time.
xiii
Chapter 1. INTRODUCTION
Software maintenance is an important activity in software engineering. Over the
decades, software maintenance costs have been continually reported to account for a
large majority of software costs [Zelkowitz 1979, Boehm 1981, McKee 1984, Boehm
1988, Erlikh 2000]. This fact is not surprising. On the one hand, software environments
and requirements are constantly changing, which lead to new software system upgrades
to keep pace with the changes. On the other hand, the economic benefits of software
reuse have encouraged the software industry to reuse and enhance the existing systems
rather than to build new ones [Boehm 1981, 1999]. Thus, it is crucial for project
managers to estimate and manage the software maintenance costs effectively.
Software cost estimation plays an important role in software engineering practice,
often determining the success or failure of contract negotiation and project execution.
Cost estimation’s deliverables, such as effort, schedule, and staff requirements are
valuable pieces of information for project formation and execution. They are used as key
inputs for project bidding and proposal, budget and staff allocation, project planning,
progress monitoring and control, etc. Unreasonable and unreliable estimates are a major
cause of project failure, which is evidenced by a CompTIA survey of 1,000 IT
respondents in 2007, finding that two of the three most-cited causes of IT-project failure
are concerned with unrealistic resource estimation [Rosencrance 2007].
Recognizing the importance of software estimation, the software engineering
community has put tremendous effort into developing models in order to help estimators
1
generate accurate cost estimates for software projects. In the last three decades, many
software estimation models and methods have been proposed and used, such as
COCOMO, SLIM, SEER-SEM, and Price-S.
1.1 THE PROBLEM
COCOMO is the most popular non-proprietary software estimation model in
literature as well as in industry. The model was built using historical data and
assumptions of software (ab initio) development projects. With a few exceptions, the
model’s properties (e.g., forms, cost factors and constants) are supposed to be applicable
to estimating the cost of software maintenance. However, inherent differences exist
between software development and maintenance. For example, software maintenance
depends on quality and complexity of the existing architecture, design, source code, and
supporting documentation. The problem is that these differences make the model’s
properties less relevant in the software maintenance context, resulting in low estimation
accuracies achieved. Unfortunately, there is a lack of empirical studies that evaluate and
extend COCOMO or other models, in order to better estimate the cost of software
maintenance.
1.2 A SOLUTION
Instead of using the COCOMO model that was designed for new development,
what if we build an extension that takes into account characteristics of software
maintenance. Thus, my thesis is
2
Improved COCOMO models that allow estimators to better determine the
equivalent size of maintained software and estimate maintenance effort can improve the
accuracy of effort estimation of software maintenance.
The goal of this study is to investigate such models. I will test a number of
hypotheses testing the accuracy of alternative models that predict the effort of software
maintenance projects, the explanatory power of such cost drivers as execution time and
memory constraints, and potential effects on quality attributes as better determinants of
the effort involved in deleting code. Using the industry dataset that I have collected, I will
also evaluate the differences in the effects of the COCOMO cost drivers on the project's
effort between new development and maintenance projects. Finally, COCOMO models
for software maintenance will be introduced and validated using industry data sets.
1.3 RESEARCH HYPOTHESES
The main research question that this study attempts to address, reflecting the
proposed solution, is as follows
Are there extended COCOMO models that allow estimators to better determine the
equivalent size of maintained software and estimate maintenance effort that can improve
the accuracy of effort estimation of software maintenance?
To address this question, I implement the modeling process as described in
Section 3.2. The process involves testing a number of hypotheses. The inclusion of
hypothesis tests in the process helps frame the discussion and analysis of the results
3
obtained during the modeling process. This section summarizes the hypotheses to be
tested in this dissertation.
It is clear from prior work discussed in Chapter 2 that there is no common
maintenance size metric used as compared to the size for new development. In software
(ab initio) development, a certain level of consistency is achieved; either SLOC or
Function Point is widely used. In software maintenance, on the other hand, some
maintenance effort estimation models use the sum of SLOC added, modified, and
deleted, others do not include the SLOC deleted. Due to this inconsistency, evidence on
whether the SLOC deleted is a significant predictor of effort is desired. Thus, the
following hypothesis is investigated.
Hypothesis 1: The measure of the SLOC deleted from the modified modules is not
a significant size metric for estimating the effort of software maintenance projects.
This hypothesis was tested through a controlled experiment of student
programmers performing maintenance tasks. The SLOC deleted metric was used as a
predictor of effort in linear regression models, and the significance of the coefficient
estimate of this predictor was tested using the typical 0.05 level of significance.
Software maintenance is different from new software development in many ways.
The software maintenance team works on the system of which the legacy code and
documentation, thus it is constrained by the existing system’s requirements, architecture,
design, etc. This characteristic leads to differences in project activities to be performed,
system complexity, personnel necessary skill set, etc. As a result, we expect that the
impact of each cost driver on the effort of software maintenance would be significantly
4
different in comparison with the COCOMO II.2000 model. We, therefore, investigate the
following hypothesis.
Hypothesis 2: The productivity ranges of the cost drivers in the COCOMO II
model for maintenance are different from those of the cost drivers in the COCOMO
II.2000 model.
The productivity range specifies the maximum impact of a cost driver on effort. In
other words, it indicates the percentage of effort increase or decrease if the rating of a
cost driver increases the lowest to the highest level. This hypothesis was tested by
performing simple comparisons on the productivity ranges of the cost drivers between the
two models.
Finally, the estimation model must be validated and compared with other
estimation approaches regarding its estimation performance. As the model for
maintenance proposed in this study is an extension of the COCOMO II model, we will
analyze its performance with that of the COCOMO II model and other two
unsophisticated but commonly used approaches including the simple linear regression
and the productivity index.
Hypothesis 3: The COCOMO II model for maintenance outperforms the
COCOMO II.2000 model when estimating the effort of software maintenance projects.
Hypothesis 4: The COCOMO II model for maintenance outperforms the simple
linear regression model and the productivity index estimation method.
5
Hypotheses 3 and 4 were tested by comparing the estimation performance of
various models using the estimation accuracy metrics MMRE and PRED.
1.4 DEFINITIONS
This section provides definitions of common terms and phrases referred to
throughout this dissertation. Software maintenance or maintenance is used to refer to the
work of modifying, enhancing, and providing cost-effective support to the existing
software. Software maintenance in this definition has a broader meaning than the IEEE
definition given in [IEEE 1999] as it includes minor and major functional enhancements
and error corrections. As opposed to software maintenance, new (ab initio) development
refers to the work of developing and delivering the new software product.
COCOMO II is used to refer to a set of estimation models that were developed
and released in [Boehm 2000b] as a major extension of COCOMO to distinguish itself
from the version originally created and published in [Boehm 1981]. COCOMO II.2000
refers to the COCOMO II model whose constants and cost driver ratings were released in
[Boehm 2000b]. COCOMO II model for maintenance is used to indicate the sizing
method and effort estimation model for software maintenance.
6
Chapter 2. RELATED WORK
Software cost estimation has attracted tremendous attention from the software
engineering research community. A number of studies have been published to address
cost estimation related problems, such as software sizing, software productivity factors,
cost estimation models for software development and maintenance. This chapter presents
a review of software sizing techniques in Section 2.1, major cost estimation models in
Section 2.2, and maintenance cost estimation models in Section 2.3.
2.1 SOFTWARE SIZING
Size is one of the most important attributes of a software product. It is a key
indicator of software cost and time; it is also a base unit to derive other metrics for
software project measurements, such as productivity and defect density. This section
describes the most popular sizing metrics and techniques that have been proposed and
applied in practice. These techniques can be categorized into code-based sizing metrics
and functional size measurements.
2.1.1
CODE-BASED SIZING METRICS
Code-based sizing metrics measure the size or complexity of software using the
programmed source code. Because a significant amount of effort is devoted to
programming, it is believed that an appropriate measure correctly quantifying the code
7
can be a perceivable indicator of software cost. Halstead’s software length equation based
on program’s operands and operators, McCabe’s Cyclometic Complexity, number of
modules, source lines of code (SLOC) among others have been proposed and used as
code-based sizing metrics. Of these, SLOC is the most popular. It is used as a primary
input by most major cost estimation models, such as SLIM, SEER-SEM, PRICE-S,
COCOMO, and KnowledgePlan (see Section 2.2).
Many different definitions of SLOC exist. SLOC can be the number of physical
lines, the number of physical lines excluding comments and blank lines, or the number of
statements commonly called logical SLOC, etc. To help provide a consistent SLOC
measurement, the SEI published a counting framework that consists of SLOC definitions,
counting rules, and checklists [Park 1992]. Boehm et al. adapted this framework for use
in the COCOMO models [Boehm 2000b]. USC’s Center for Systems and Software
Engineering (USC CSSE) has published further detailed counting rules for most of the
major languages along with the CodeCount tool1. In COCOMO, the number of
statements or logical SLOC, is the standard SLOC input. Logical SLOC is less sensitive
to formats and programming styles, but it is dependent on the programming languages
used in the source code [Nguyen 2007].
For software maintenance, the count of added/new, modified, unmodified,
adapted, reused, and deleted SLOC can be used. Software estimation models usually
aggregate these measures in a certain way to derive a single metric commonly called
effective SLOC or equivalent SLOC [Boehm 2000b, SEER-SEM, Jorgensen 1995, Basili
1
http://csse.usc.edu
8
1996, De Lucia 2005]. Surprisingly, there is a lack of consensus on how to measure
SLOC for maintenance work. For example, COCOMO, SLIM, and SEER-SEM exclude
the deleted SLOC metric while KnowledgePlan and PRICE-S include this metric in their
size measures. Several maintenance effort estimation models proposed in the literature
use the size metric as the sum of SLOC added, modified, and deleted [Jorgensen 1995,
Basili 1996].
SLOC has been widely accepted for several reasons. It has been shown to be
highly correlated with the software cost; thus, they are relevant inputs for software
estimation models [Boehm 1981, 2000b]. In addition, code-based metrics can be easily
and precisely counted using software tools, eliminating inconsistencies in SLOC counts,
given that the same counting rules are applied. However, the source code is not available
in early project stages, which means that it is difficult to accurately measure SLOC until
the source code is available. Another limitation is that SLOC is dependent on the
programmer’s skills and programming styles. For example, an experienced programmer
may write fewer lines of code than an inexperienced one for the same purpose, resulting
in a problem called productivity paradox [Jones 2008]. A third limitation is a lack of a
consistent standard for measurements of SLOC. As aforementioned, SLOC could mean
different things, physical lines of code, physical lines of code excluding comments and
blanks, or logical SLOC. There is also no consistent definition of logical SLOC [Nguyen
2007]. The lack of consistency in measuring SLOC can cause low estimation accuracy as
described in [Jeffery 2000]. A forth limitation is that SLOC is technology and language
9
dependent. Thus, it is difficult to compare productivity gains of projects of varying
technologies.
2.1.2
FUNCTIONAL SIZE MEASUREMENT (FSM)
Having faced these limitations, Albrecht developed and published the Function
Point Analysis (FPA) method as an alternative to code-based sizing methods [Albrecht
1979]. Albrecht and Gaffney later extended and published the method in [Albrecht and
Gaffney 1983]. The International Function Point Users Group (IFPUG), a non-profit
organization, was later established to maintain and promote the practice. IFPUG has
extended and published several versions of the FPA Counting Practices Manual to
standardize the application of FPA [IFPUG 2004, 1999]. Other significant extensions to
the FPA method have been introduced and widely applied in practice, such as Mark II
FPA [Symons 1988] and COSMIC-FFP [Abran 1998].
2.1.2.1
FUNCTION POINTS ANALYSIS (FPA)
FPA takes into account both static and dynamic aspects of the system. The static
aspect is represented by data stored or accessed by the system, and the dynamic aspect
reflects transactions performed to access and manipulate the data. FPA defines two data
function types (Logical Internal File, External Interface File) and three transaction
function types (External Input, External Output, Query). Function point counts are the
sum of the scores that are assigned to each of the data and the transactional functions.
The score of each of the data and transaction functions is determined by its type and its
10
complexity. Function counts are then adjusted by a system complexity factor called Value
Adjustment Factor (VAF) to obtain adjusted function point counts.
The IFPUG’s Function Point Counting Practices Manual (CPM) provides
guidelines, rules, and examples for counting function points. The manual specifies three
different types of function point counts for the Development project, the Enhancement
project, and the Application count. The Development project type refers to the count of
functions delivered to the user in the first installation of the new software. The
Enhancement project type refers to function point counts for modifications made to the
preexisting software. And the Application function point count measures the functions
provided to the user by a software product and is referred to as the baseline function point
count. It can be considered the actual function point count of the development project
developing and delivering the system. Thus, the application function point count can be
determined by applying the same process given for the development project if the user
requirements can be obtained from the working software.
In FPA, the enhancement project involves the changes that result in functions
added, modified or deleted from the existing system. The procedure and complexity
scales are the same as those of the development, except that it takes into account changes
in complexity of the modified functions and the overall system characteristics. The
process involves identifying added, modified, deleted functions and determining value
adjustment factors of the system before and after changes.
11
The Effective Function Point count (EFP) of the enhancement project is computed
using the formula
EFP = (ADD + CHGA) * VAFA + DEL * VAFB
(Eq. 2-1)
Where,
ƒ
ADD is the unadjusted function point count of added functions.
ƒ
CHGA is the unadjusted function point count of functions that are obtained by
modifying the preexisting ones. It is important to note that CHGA counts any
function that is modified, regardless of how much the function is changed.
ƒ
DEL is the unadjusted function point count of deleted functions.
ƒ
VAFA is the Value Adjustment Factor of the application after the
enhancement project is completed.
ƒ
2.1.2.2
VAFB is the Value Adjustment Factor of the preexisting application.
MARK II FUNCTION POINT ANALYSIS
In 1988, Symons [1988] proposed Mark II Function Point Analysis (MkII FPA)
as an extension to Albrecht’s FPA. MkII FPA was later extended and published by the
United Kingdom Software Metrics Association (UKSMA) in the MkII FPA Counting
Practices Manual [UKSMA 1998]. The MkII FPA method is a certified ISO as an
international standard FSM method [ISO 2002].
12
MkII FPA measures the functionality of a software system by viewing the
software system as consisting of logical transactions. Each logical transaction is a finestgrained unit of a self-consistent process that is recognizable by the user. The logical
transaction consists of three constituent parts: input, processing, and output components.
The input and output components contain data element types that are sent across the
application boundary and processed by the processing component. The processing
component handles the input and output by referencing data entity types. Conceptually,
MkII FPA’s definitions of data entity types and data element types are similar to those of
ILFs/EIFs and DETs in Albrecht’s FPA.
The size of the input and output components is the count of data element types,
and the size of the processing component is the count of data entity types. The MkII
function point count, which is referred to as MFI, is determined by computing a weighted
sum of the sizes of the input, processing, and output components of all logical
transactions. That is,
MFI = Wi Ni + WeNe + WoNo
(Eq. 2-2)
Where, Wi, We, Wo are the weights of input data element types, data entity types,
and output data element types, respectively; and Ni, Ne, No are the counts of input data
element types, data entity types, and output data element types, respectively. The weights
Wi, We, Wo are calibrated using historical data by relating MFI to the effort required to
develop the respective functions. In the CPM version 1.3.1 [UKSMA 1998], the industry
average weights are as Wi = 0.58, We = 1.66, and Wo = 0.26.
13
For sizing the software maintenance work (referred to as “changes” in CPM
1.3.1), the MkII function point count includes the size of the logical transactions that are
added, changed, or deleted. This involves counting all individual input/output data
element types and data entity types that are added, modified, or deleted. The formula (Eq.
2-2) can be used to calculate the MkII function point count of maintenance work, where
Ni, Ne, No are treated as the counts of input data element types, data entity types, and
output data element types that are added, modified, or deleted.
2.1.2.3
COSMIC FFP
Full Function Point, which is often referred to as FFP 1.0, is a FSM method
proposed by St-Pierre et al. [St-Pierre 1997]. FFP was designed as an extension to IFPUG
FPA to better measure the functional size of real-time software. The method was later
extended and renamed to COSMIC-FFP by the Common Software Measurement
International Consortium (COSMIC). Several extensions have been published, and the
latest version is COSMIC-FFP 2.2 [COSMIC 2003]. COSMIC-FFP has been certificated
as an ISO international standard [ISO 2003].
COSMIC-FFP views the functionality of a software application as consisting of
functional processes, each having sub-processes. There are two types of sub-processes,
data movement type and data manipulation type. COSMIC-FFP does not handle the data
manipulation type separately, but it assumes some association between the two types. It
defines four types of data movement sub-processes, including Entry, Exit, Read, and
Write. An Entry is a movement of the data attributes contained in one data group from
14
the outside to the inside of the application boundary; an Exit moves a data group in the
opposite direction to that of the entry; and a Read or a Write refers to a movement of a
data group from or to storage. The COSMIC-FFP measurement method determines the
functional size by measuring the data movement sub-processes, each moving exactly one
data group. That is, the COSMIC-FFP function point count, which is measured in cfsu
(COSMIC Functional Size Unit), is computed by summing all Entries, Exits, Reads, and
Writes. The complexity of the measurement procedure involves identifying application
layers, application boundaries, and data groups that each sub-process handles. Detailed
rules and guidelines for this process are given in the COSMIC-FFP measurement manual
[COSMIC 2003].
In COSMIC-FFP, the functional size of the software maintenance work is the
simple sum of all Entry, Exit, Read, and Write data movement sub-processes whose
functional processes are affected by the change. By this calculation, COSMIC-FFP
assumes that three different types of change to the functionality (add, modify, and delete)
have the same level of complexity (e.g., the size of adding a function is counted the same
as that of modifying or deleting the function). That is, the COSMIC-FFP count for the
maintenance work (Sizecfsu) is computed as
Sizecfsu = Sizeadded + Sizemodified + Sizedeleted
(Eq. 2-3)
Where, Sizeadded, Sizemodified, and Sizedeleted are the COSMIC-FFP counts measured
in csfu of the added, modified, and deleted data movements, respectively.
15
The above-discussed functional size measurement methods measure the size of
software maintenance consistently in the sense that added, modified, and deleted
functions are all included in the count, and they are assigned the same weight. Thus, the
number of function point counts assigned for a function is a constant regardless of
whether the function is added, modified, or deleted. This calculation also implies that the
effort required to add a function is expected to be the same as the effort to delete or
modify the same function.
2.2 MAJOR COST ESTIMATION MODELS
Many estimation models have been proposed and applied over the years. Instead
of describing them all, this section provides a brief review of major estimation models
that have been developed, continued to be applied, and marketed by respective
developers. These models include SLIM, SEER-SEM, PRICE-S, KnowledgePlan, and
COCOMO. There are several reasons for this selection. First, they represent the core set
of models that was developed in the early 1980’s and 1990’s. Second, they are still being
investigated and used widely in practice and literature. Their long history of extensions
and adoptions is proof of their robustness and usefulness. Third, these models perform
estimation for a broad range of software development and maintenance activities,
covering a number of phases of software lifecycles such as requirements, architecture,
implementation, testing, and maintenance.
16
2.2.1
SLIM
SLIM is one of the most popular cost estimation models that has been in the
market for decades. The model was originally developed in the late 1970s by Larry
Putnam of Quantitative Software Measurement2, and its mathematical formulas and
analysis were published in [Putnam and Myers 1992]. As the model is proprietary, the
subsequent upgrades of the model structures and mathematical formulas are not available
in the public domain.
Generally, the SLIM model assumes that the staffing profile follows a form of
Rayleigh probability distribution of project staff buildup over time. The shapes and sizes
of the Rayleigh curve reflect the project size, manpower buildup index, and other
productivity parameters. The Rayleigh staffing level at time t is presented as
⎛
⎜
t 2 ⎞⎟
K ⎜ − 2t 2 ⎟
p (t ) = 2 te ⎝ d ⎠
td
(Eq. 2-4)
Where K is the total lifecycle effort and td the schedule to the peak of the staffing
curve. The quality D =
K
is considered staffing complexity of the project. The total
t d2
lifecycle effort is calculated using the project size S, the technology factor C, and td, and
is defined as
K=
2
S3 1
x
C 3 t d4
(Eq. 2-5)
www.qsm.com
17
Boehm et al. note that some assumptions of staffing profile following the
Rayleigh distribution do not always hold in practice [Boehm 2000a]. For example, some
development practices such as maintenance and incremental development may employ a
constant level of staff. In subsequent adjustments, SLIM handles this limitation by
allowing the staffing profile to be adjusted by a staffing curve parameter. There are
multiple project lifecycle phases defined by the model, such as feasibility study,
functional design, main build (development), and maintenance. Each lifecycle phase may
have a different staffing curve parameter. In the main build phase, for example, the
staffing curve parameter can specify the curve as Medium Front Load (for staff peaking
at 40% of the phase), Medium Rear Load (peaking at 80% of the phase), or Rear Load
(peaking at the end of the phase). In the maintenance phase, this parameter specifies a flat
staffing curve.
SLIM views software maintenance as a software lifecycle phase following the
main build or development phase. The maintenance phase may have major
enhancements, minor enhancements, and baseline support including emergency fixes,
help desk support, infrastructure upgrades, operational support, small research projects,
etc. The maintenance phase can be estimated independently with the other phases.
The model uses the effective SLOC as a unit of project size. Function points and
user-defined metrics such as the number of modules, screens, etc. can be used, but they
have to be converted to effective SLOC using a ‘gear factor’. SLIM counts new code and
modified code, but it excludes deleted code. Clearly, the model assumes that new code
and modified code have the same influence on the maintenance effort.
18
2.2.2
SEER-SEM
SEER-SEM is a commercial and proprietary model developed and marketed by
Galorath, Inc3. This model is an extension of the Jensen model [Jensen 1983] from which
model structures and parameters are extended while sharing the same core formulas. Like
SLIM, the model uses the Rayleigh probability distribution of staffing profile versus time
to determine development effort. The size of the project and other parameters can change
the Rayleigh curve which then gives estimates for effort, time, and peak staff
accordingly.
In SEER-SEM, the traditional Rayleigh staffing level at time t is calculated as
⎛
⎜
t 2 ⎞⎟
K ⎜ − 2t 2 ⎟
p (t ) = 2 te ⎝ d ⎠
td
(Eq. 2-6)
Where K is the total life cycle effort and td the time to the peak of the staffing
curve; these terms are calculated as
⎛ S e3
K = D x⎜
⎜ C te
⎝
⎞
⎟
⎟
⎠
1.2
⎛ Se
⎜⎜
⎝ C te
⎞
⎟⎟
⎠
0.4
0.4
td = D
− 0.2
(Eq. 2-7)
(Eq. 2-8)
Where D is the staffing complexity, Se the effective size, and Cte the effective
technology. Although sharing the same meaning of D with SLIM, SEER-SEM defines D
3
www.galorath.com
19
as
K
, which is slightly different from that of SLIM. The derivative of the Rayleigh curve
t d3
p(t) at t = 0 is defined as staffing rate, generally measuring the number of people added
to the project per year. The effective technology Cte is an aggregated factor determined by
using a set of technology and environment parameters. The effective size Se is the size of
the project that can be measured in source lines of code (SLOC), function points, or other
units.
In SEER-SEM, the software maintenance cost covers necessary maintenance
activities performed to ensure the operation of the software after its delivery. These
software maintenance activities involve correcting faults, changing the software to new
operating environments, fine-tuning and perfecting the software, and minor
enhancements. They are usually triggered by change requests and software faults that are
found during the operation of the software. SEER-SEM uses a number of maintenancespecific parameters to estimate the maintenance cost for a given period of time, in
addition to the development cost of the software. The maintenance cost is allocated into
corrective, adaptive, perfective maintenance, and minor enhancement cost categories.
Major enhancements and re-engineering are not included in the software
maintenance cost, but are treated separately as a new software development project.
SEER-SEM handles major enhancements and new development similarly except that
their differences are reflected in the way that the effective size of the software is
determined.
20
The effective size, Se, of the software is calculated as
Se = New + [P x (0.4 A + 0.25 B + 0.35 C)]
(Eq. 2-9)
Where,
ƒ
New and P are the new size and the pre-existing software size, respectively. The
size can be measured in either function points or lines of code. Discarded code
is not included in calculating the effective size of the pre-existing software (P).
ƒ
A, B, and C are the respective percentages of code redesign, code
reimplementation, and code retest required to adapt the pre-existing software.
The element [P x (0.4 A + 0.25 B + 0.35 C)] in formula (Eq. 2-9) is the
equivalent size of rework or reuse of the pre-existing software. The parameters A, B, and
C are subjective, and their values range from 0 to 100%. The SEER-SEM user manual
provides formulas and guidelines to help determine these parameters [Galorath 2002]. It
does not give details, however, on how to reevaluate these parameters when the
completed software is available (it is important to collect the actual size data because the
actual size data is used to recalibrate the model and to measure the actual productivity,
defect density, etc). The parameters A, B, and C maximum limit of 100% might
potentially result in underestimation of the rework because it does not account for
possible code expansion, full retest and integration of the pre-existing code.
21
2.2.3
PRICE-S
PRICE-S was originally developed by Frank Freiman of RCA for estimating
acquisition and development of hardware systems in the late 1960’s and then released as
a first commercial software cost estimation model in 1977. After multiple upgrades,
extensions, and changes of ownership, the current model, True S, is now implemented as
a component in the TruePlanning tool marketed by PRICE Systems4. As True S is built
on the same core methodology as its predecessor, this review uses the name PRICE-S to
maintain its originality.
PRICE-S is an activity based estimation model that estimates the effort required
for each activity. In PRICE-S, an activity represents “the work that people, equipment,
technologies, or facilities perform to produce a product or deliver a service” [PRICE-S
2009]. As described in [IPSA 2008], the effort E of each activity is modeled in the form
of
E = S x PB x PA
(Eq. 2-10)
Where,
ƒ
S is the software size that can be measured in source lines of code, function
points, predictive object points (POPs) or use case conversion points (UCCPs).
POPs is an object oriented (OO) metric introduced by PRICE Systems to
measure the size of OO projects [Minkiewicz 1998], and UCCPs is a metric
used to quantify the size of use cases.
4
www.pricesystems.com
22
ƒ
BP is the ideal baseline productivity of an industry or an application domain.
ƒ
PA is the productivity adjustment factor that accounts for the overall effects of
cost drivers on the productivity of the project.
PRICE-S views the software maintenance phase as an activity that focuses on
fixing latent defects, deployment, and changing the software release to improve
performance, efficiency or portability. Thus, the maintenance cost is determined by these
activities other than as a function of how much the software cost is modified. PRICE-S
assumes that the maintenance cost only includes changes required to improve
performance, efficiency, and portability or to correct defects. Other changes not included
in the maintenance cost (i.e., functional enhancements) are estimated the same way as the
development project. The model uses a number of maintenance-specific parameters to
calculate the cost of software maintenance.
The costs associated with functional enhancements and adaptations for preexisting software involve specifying the amount of new, adapted, reused, changed, and
deleted code. Other size measures are also used such as the Percentage of Design
Adapted, Percentage of Code Adapted, and Percentage of Test Adapted, which are
similar to those of SEER-SEM. But unlike SEER-SEM, PRICE-S does not use the
effective size aggregated from different types of code in its model calculations. Rather, it
uses the different size components separately in various calculations to determine the
effort.
23
2.2.4
KNOWLEDGEPLAN (CHECKPOINT)
KnowledgePlan is a commercial software estimation tool developed and first
released by Software Productivity Research (SPR5) in 1997. KnowledgePlan is an
extension of several previous tools, Checkpoint and SPQR/20, which were originally
based on Capers Jones’ works [Jones 1997]. According to the KnowledgePlan user’s
guide, the model relies on knowledge bases of thousands of completed projects to provide
estimation and scheduling capabilities [KnowledgePlan 2005].
KnowledgePlan estimates effort at project, phase, and task levels of granularity.
In addition to the effort, the model main outputs include resources, duration, defects,
schedules and dependencies. Estimating and scheduling project tasks are enabled through
the knowledge bases containing hundreds of standard task categories to represent typical
activities in software projects. These standard task categories cover planning,
management, analysis, design, implementation, testing, documentation, installation,
training, and maintenance. Each of these task categories is associated with predefined
inclusion rules and algorithms that are used to suggest relevant tasks and determine the
size of deliverables and productivity for the given size of the project being estimated.
The KnowledgePlan tool supports different project types such as new
development, enhancement, reengineering, reverse engineering, and maintenance and
support of a legacy system. KnowledgePlan uses eight sizing categories called code
types, including New, Reused, Leveraged, Prototype, Base, Changed, Deleted, and
System Base. Differences in project types may be reflected in the distribution of size
5
www.spr.com
24
estimates of these categories. For example, a new development project typically has a
high proportion of New code while enhancement may have a high value of Base code.
2.2.5
COCOMO
The Constructive Cost Model (COCOMO), a well-known cost and schedule
estimation model, was originally published in the text Software Engineering Economics
[Boehm 1981]. This original model is often referred to as COCOMO 81. The model was
defined based on the analysis of 63 completed projects from different domains during the
1970s and the early 1980s. To address the issues emerging from advancements and
changes in technologies and development processes, the USC Center for Systems and
Software Engineering has developed and published COCOMO II. The model was
initially released in [Boehm 1995] and then published in the definitive book [Boehm
2000b]. Among the main upgrades are the introduction of new functional forms that use
scale factors, new cost drivers, and a new set of parameters’ values.
COCOMO II comprises three sub-models, Applications Composition, Early
Design, and Post-Architecture. The Applications Composition model is used to compute
the effort and schedule to develop the system that is integrated from reusable components
and other reusable assets using integrated development tools for design, construction,
integration, and test. The Applications Composition model has a different estimation
form from the other models. It uses a size input measured in terms of Application Points
or Object Points [Kauffman and Kumar 1993, Banker 1994] and a productivity rate to
calculate effort. The Early Design model is used in the early stages of the project when
25
the project information is not detailed enough for a fine-grained estimate. When the
detailed information is available (i.e., the high level design is complete, development
environment is determined), the Post-Architecture model is used alternatively. The Early
Design and Post-Architecture models use source lines of code as the basic size unit and
follow the same arithmetic form.
The general form of the effort formulas of the COCOMO 81, Early Design, and
Post-Architecture models can be written as
p
PM = A * Size * ∏ EM i
B
(Eq. 2-11)
i
Where,
ƒ
PM is effort estimate in person months.
ƒ
A is a multiplicative constant, which can be calibrated using historical data.
ƒ
Size is an estimated size of the software, measured in KSLOC.
ƒ
B is an exponential constant (COCOMO I) or scale factors (COCOMO II).
ƒ
EM specifies effort multipliers, which represent the multiplicative component of
the equation.
In COCOMO 81, the B term is an exponential constant, which is usually greater
than 1.0, indicating diseconomies of scale. In COCOMO II, B is defined as a function of
5
scale factors, in form of B = β 0 + β 1 ∑ SFi where β0 and β1 are constants, and SFi is one
i =1
of the five scale factors. The COCOMO 81 model identifies 15 effort multipliers.
26
COCOMO II uses 7 in the Early Design model and 17 in the Post-Architecture model.
The effort multipliers are the cost drivers that have multiplicative impact on the effort of
the project while the scale factors have exponential impact. The Early Design and PostArchitecture models have the same set of scale factors while the cost drivers in the Early
Design model were derived from those of the Post-Architecture model by combining
drivers that were found to be highly correlated. The COCOMO II Post-Architecture
model’s rating values of the model cost drivers were calibrated using the Bayesian
technique on a database of 161 project data points [Chulani 1999a].
Table 2-1. COCOMO Sizing Models
Development Type
Source
Code Types
Sizing Model
New System
New
New/Added
New SLOC
Pre-existing
Adapted, Reused
Reuse
Major enhancements
Pre-existing
Adapted, Reused
Reuse
Maintenance (repairs
Pre-existing
Added, Modified
Maintenance
and minor updates)
COCOMO II provides different models to determine the source code size of the
project, dependent on the origin of the code (new, pre-existing) and types of change
(addition, modification, reuse, automatic translation). Table 2-1 shows the sizing models
for various development types. In COCOMO, software maintenance involves repairs and
minor updates that do not change its primary functions [Boehm 1981]. Major
enhancements, which include changes that are not considered as maintenance, are
27
estimated similarly to software new development except that the size is determined
differently. The model uses the same ratings of the cost drivers for both maintenance and
new software development with a few exceptions. The Required Development Schedule
(SCED) and Required Reusability (RUSE) cost drivers are not used in the estimation of
effort for maintenance, and the Required Software Reliability (RELY) cost driver has a
different impact scale (see Section 0).
2.2.5.1
MODEL CALIBRATIONS AND EXTENSIONS
As COCOMO is a non-proprietary model, its details are available in the public
domain, encouraging researchers and practitioners in the software engineering
community to independently evaluate the model. There have been many extensions
independently reported, e.g., [Kemerer 1987, Jeffery and Low 1990, Gulezian 1991,
Menzies 2006]. Menzies et al. use machine learning techniques to generate effort models
from the original COCOMO model [Menzies 2006]. Gulezian proposed a calibration
method by transforming the model equation into a linear form and estimating the model
parameters using standard linear regression techniques [Gulezian 1991]. This calibration
method has been adopted by the COCOMO development team in their calibration work,
e.g., [Clark 1998, Chulani 1999b, Yang and Clark 2003, Chen 2006, Nguyen 2008].
COCOMO has also been a model used to validate new estimation approaches such as
fuzzy logic and neural networks (e.g., [Idri 2000, Huang 2005, Reddy and Raju 2009]).
The COCOMO development team continues to calibrate and extend the model
using different calibration approaches on more augmented data sets [Boehm and Royce
28
1989, Clark 1998, Chulani 1999b, Yang and Clark 2003, Chen 2006, Nguyen 2008].
Table 2-2 shows the best results obtained by main studies performed by the team.
PRED(0.30) is the percentage of estimates that fall within 30% of the actuals. For
example, PRED(0.30) = 52% indicates that the model in Clark [1998] produces estimates
that are within 30% of the actuals, 52% of the time. Two types of results are often
reported, fitting and cross-validation. The fitting approach uses the same data set for both
training and testing the model while the cross-validation approach uses different data
sets: one for training and the other for testing. The accuracies reported from crossvalidation better indicate the performance of the model because the main reason to build
a model is to use it in estimating future projects.
Table 2-2. COCOMO II Calibrations
Study
Fitting PRED(0.30)
Cross-validation
PRED(0.30)
COCOMOII.1997 [Clark 1998]
52%
-
COCOMOII.2000 [Chulani 1999b]
75%
69%
COCOMOII.2003 [Yang and Chark 2003]
56%
-
Chen [2006]
-
76%
Nguyen [2008]
80%
75%
In COCOMOII.1997, the parameters are weighted averages of data-driven
regression results and expert-judgment rating scales, in which the latter scales account for
90% of the weight. The COCOMOII.2000 calibration was based on the Bayesian analysis
using a data set of 161 data points, and the COCOMOII.2003 calibration used the same
Bayesian analysis, but it included 43 additional new data points. On the same data set as
29
COCOMOII.2000, Chen et al. [2006] used a subset selection technique to prune model
parameters, and Nguyen et al. [2008] applied constraints on regression models; they both
attempted to reduce variances in the model. As shown in Table 2-2, their results indicate
noticeable improvements in estimation accuracies, suggesting that using appropriate
machine learning techniques can potentially improve the model performance.
2.3 MAINTENANCE COST ESTIMATION MODELS
Although the area of software maintenance estimation has received less attention
as compared to that of new development, given the importance of software maintenance,
a number of models have been introduced and applied to estimating the maintenance
costs (see Table 2-3). These models address diverse sets of software maintenance work,
covering, for instance, error corrections, functional enhancements, technical renovations,
and reengineering. They can be roughly classified into three types based on the
granularity level of the estimation focus: phase-, release-, and task-level maintenance
estimation models.
2.3.1
PHASE-LEVEL MODELS
A set of the maintenance models focuses on the effort of routine maintenance
work for a certain period or the whole phase of the software maintenance. The routine
maintenance work refers to all activities performed during the operation of a software
system after it is delivered. It involves fault corrections, minor functional changes and
enhancements, and technical improvements of which the main purpose is to ensure the
regular operation of the system. The maintenance models integrated in COCOMO,
30
SEER-SEM, PRICE-S, SLIM, and KnowledgePlan are of this kind. In these models,
maintenance costs are usually a part of the estimates produced when estimating the cost
of a new system to be developed. Thus, the size of the system is a key input to estimate
the maintenance effort. Most of these models use additional cost drivers that are specific
to software maintenance. For example, COCOMO uses two drivers, namely software
understanding (SU) and the level of unfamiliarity of the programmer (UNFM) for its
sizing of the maintenance work; SEER-SEM uses such parameters as maintenance size
growth over time and maintenance rigor (the thoroughness of maintenance activities to
be performed) in its maintenance cost calculations [Galorath 2002].
Estimating the maintenance cost for a system of which the development cost is
being estimated is important for the architectures trade-off analysis and making
investment decisions on the system being evaluated. However, because many
assumptions are made about the system that has yet to be developed, it is difficult to
estimate the system maintenance cost accurately. This difficulty could be a reason that, to
the best of my knowledge, there is no empirical study published to evaluate and compare
the estimation accuracies of these models. Another possible reason is that these models,
with the exception of COCOMO, are proprietary and their details have not been fully
published, making them difficult to be investigated in the research context.
To estimate the adaptation and reuse work, COCOMO, SEER-SEM, PRICE-S,
SLIM, and KnowledgePlan provide methods to size the work and compute the effort and
schedule estimates using the same models developed for estimating new software
development. Obviously, these models assume that the adaptation and reuse work has the
31
same characteristics with new software development. Unfortunately, this assumption has
never been validated empirically.
2.3.2
RELEASE-LEVEL MODELS
Instead of estimating the cost of the maintenance phase as a whole, another group
of models focuses on the maintenance cost at a finer-grained level, estimating the effort
of a planned set of maintenance tasks or a planned release. This approach usually
involves using data from the past releases and analyzing the changes to estimate the cost
for the next release.
Basili et al., together with characterizing the effort distribution of maintenance
releases, described a simple regression model to estimate the effort for maintenance
releases of different types such as error correction and enhancement [Basili 1996]. The
model uses a single variable, SLOC, which was measured as the sum of added, modified
and deleted SLOC including comments and blanks. The prediction accuracy was not
reported although the coefficient of determination was relatively high (R2 = 0.75),
indicating that SLOC is a good predictor of the maintenance effort.
Considering the maintenance work after the initial delivery of the system as being
organized into sequences of operational releases, Ramil and Lehman introduced and
evaluated linear regression models to estimate the effort required to evolve the system
from a release to the next [Ramil 2000, 2003]. Their models take into account all
maintenance tasks necessary to grow the system, which can include error corrections,
functional enhancements, technical improvements, etc. The model predictors are size
32
metrics measured at coarse granularity levels, modules (number of added, changed
modules) and subsystems (number of added, changed), and number of changes to
modules plus all changed modules. The models were calibrated and validated on the data
sets collected from two case studies. In terms of MMRE and MdMRE, the best model
achieved MMRE = 19.3%, MdMRE=14.0% and PRED(0.25)=44%. This best model
seems to be based on coarse-grained metrics (subsystems), which is consistent with a
prior finding by Lindvall [1998], in which coarse-grained metrics, e.g., the number of
classes, were shown to estimate change effort more accurately than other finer-grained
metrics. However, it is important to note that this best model did not generate the best
(highest) PRED(0.25), indicating that the model evaluation and ensuing inferences are
likely contingent upon the estimation accuracy indicators used [Myrtweit 2005].
Caivano et al. described a method and a supporting tool for dynamic calibration
of the effort estimation model of renewal (reverse engineering and restoration) projects
[Caivano 2001]. The model accepts the change information gathered during the project
execution and calibrates itself to better reflect dynamics, current and future trends of the
project. At the beginning of the project, the model starts with its most common form
calibrated from completed projects. During the project execution, the model may change
its predictors and constants using the stepwise regression technique. The method uses
fine-grained metrics obtained from the source code such as number of lines of source
code, McCabe’s cyclomatic complexity, Halstead’s complexity, number of modules
obtained after the renewal process. They validated the method using both data from a
legacy system and simulation. They found that fine-grained estimation and model
33
dynamic recalibration are effective for improving the model accuracy and confirmed that
the estimation model is process-dependent. A later empirical study further verifies these
conclusions [Baldassarre 2003]. Other studies have also reported some success in
improving the prediction accuracies by recalibrating estimation models in the iterative
development environment [Trendowicz 2006, Abrahamsson 2007].
Sneed proposed a model called ManCost for estimating software maintenance
cost by applying different sub-models for different types of maintenance tasks [Sneed
2004]. Sneed grouped maintenance tasks into four different types, hence, four submodels:
-
Error correction: costs of error correction for a release.
-
Routine change: maintenance costs implementing routine change requests.
-
Functional enhancement: costs of adding, modifying, deleting, improving
software functionality. The model treats this type the same as new development in
which the size can be measured in SLOC or Function Point, and the effort is
estimated using adjusted size, complexity, quality, and other influence factors.
-
Technical renovation: costs of technical improvements, such as performance and
optimization.
Sneed suggests that these task types have different characteristics; thus, each
requires an appropriate estimation sub-model. Nonetheless, the adjusted size and
productivity index are common measures used in these sub-models. The adjusted size
was determined by taking into account the effects of complexity and quality factors.
34
Although examples were given to explain the use of the sub-models to estimate the
maintenance effort, there was no validation reported to evaluate the estimation
performance of these sub-models. Sneed also proposed several extensions to account for
reengineering tasks [Sneed 2005] and maintenance of web applications [Sneed and
Huang 2007].
The wide acceptance of the FSM methods has attracted a number of studies to
develop and improve maintenance estimation models using the FSM metrics. Most of
these studies focused on the cost of adaptive maintenance tasks that enhance the system
by adding, modifying, and deleting existing functions. Having found that the FPA did not
reflect the size of small changes well, Abran and Maya presented an extension to the FPA
method by dividing the function complexity levels into finer intervals [Abran and Maya
1995]. This extension uses smaller size increments and respective weights to discriminate
small changes that were found to be common in the maintenance environment. They
validated the model on the data obtained from a financial institution and demonstrated
that a finer grained sizing technique better characterizes the size characteristics of small
maintenance activities.
Niessink and van Vliet described a Maintenance Function Point (MFP) model to
predict the effort required to implement non-corrective change requests [Niessink and
van Vliet 1998]. The model uses the same FPA procedure for enhancement to determine
the FP count, but the Unadjusted FP count was adjusted by a multiplicative factor,
namely Maintenance Impact Ratio (MIR), to account for the relative impact of a change.
The approaches were validated using the data set collected from a large financial
35
information system, the best model producing relatively low prediction accuracies
MMRE = 47% and PRED(0.25) = 28%. The result also shows that the size of the
component to be changed has a higher impact on the effort than the size of the change.
This result indicates that the maintainers might have spent time to investigate not only the
functions affected by the change but also other functions related to the change.
Abran et al. reported on the application of the COSMIC-FFP functional size
measurement method to building effort estimation models for adaptive maintenance
projects [Abran 2002]. They described the use of the functional size measurement
method in two field studies, one with the models built on 15 projects implementing
functional enhancements to a software program for linguistic applications, the other with
19 maintenance projects of a real-time embedded software program. The two field studies
did not use the same set of metrics but they include three metrics, effort, Cfsu, and the
level of difficulty of the project. The authors showed that while project effort and
functional size has a positive correlation, this correlation is strong enough to build good
effort estimation models that use a single size measure. However, as they demonstrated, a
more reliable estimation model can be derived by taking into account the contribution of
other categorical factors, such as project difficulty.
2.3.3
TASK-LEVEL MODELS
The task-level model estimates the cost of implementing each maintenance task
which comes in the form of error reports or change requests. This type of model deals
with small effort estimates, usually ranging from a few hours to a month.
36
Sneed introduced a seven-step process and a tool called SoftCalc to estimate the
size and costs required to implement maintenance tasks [Sneed 1995]. The model uses
various size measures, including SLOC (physical lines of code and statements), function
points, object-points, and data points (the last two were originally proposed in the same
paper). The size of the impact domain, the proportion of the affected software, was
determined and then adjusted by complexity, quality, and project influence factors. The
maintenance effort was computed using the adjusted size and a productivity index.
Rather than generating an exact effort estimate, it would be beneficial to classify
maintenance change requests in terms of levels of difficulty or levels of effort required,
and use this classification information to plan resources respectively. Briand and Basili
[1992] proposed a modeling approach to building classification models for the
maintenance effort of change requests. The modeling procedure involves four high-level
steps, identifying predictable metrics, identifying significant predictable metrics,
generating a classification function, and validating the model. The range of each
predictable variable and effort was divided into intervals, each being represented by a
number called difficulty index. The effort range has five intervals (below one hour,
between one hour and one day, between one day and one week, between one week and
one month, above one month) which were indexed, from 1 to 5 respectively. To evaluate
the approach, Briand and Basili used a data set of 163 change requests from four different
projects at the NASA Goddard Space Flight Center. The approach produced the
classification models achieving from 74% to 93% classification correctness.
37
Briand and Basili’s modeling approach can be implemented to dynamically
construct models according to specific environments. Organizations can build a general
model that initially uses a set of predefined metrics, such as types of modification; the
number of components added, modified, deleted; the number of lines of code added,
modified, deleted. This general model is applied at the start of the maintenance phase, but
it will then be then refined when data is sufficient. However, as Basili et al. pointed out,
it is difficult to determine the model’s inputs correctly as they are not available until the
change is implemented [Basili 1997].
Basili et al. presented a classification model that classifies the cost of rework in a
library of reusable software components, i.e. Ada files [Basili 1997]. The model, which
was constructed using the C4.5 mining algorithm [Quinlan 1993], determines which
component versions were associated with errors that require a high correction cost (more
than 5 hours) or a low correction cost (no more than 5 hours). Three internal product
metrics, the number of function calls, the number of declaration statements, the number
of exceptions, were shown to be relevant predictors of the model. As these metrics can be
collected from the component version to be corrected, the model can be a useful
estimation tool.
Jorgensen evaluated eleven different models to estimate the effort of individual
maintenance tasks using regression, neural networks, and pattern recognition approaches
[Jorgensen 1995]. In the last approach, the Optimized Set Reduction (OSR) was used to
select the most relevant subset of variables for the predictors of effort [Briand 1992b]. All
of the models use the maintenance task size, which is measured as the sum of added,
38
updated, and deleted SLOC, as a main variable. Four other predictors were selected as
they were significantly correlated with the maintenance productivity. These are all
indicator predictors: Cause (whether or not the task is corrective maintenance), Change
(whether or not more than 50% of effort is expected to be spent on modifying of the
existing code compared to inserting and deleting the code), Mode (whether or not more
than 50% of effort is expected to be spent on development of new modules), Confidence
(whether or not the maintainer has high confidence on resolving the maintenance task).
Other variables, type of language, maintainer experience, task priority,
application age, and application size, were shown to have no significant correlation with
the maintenance productivity. As Jorgensen indicated, this result did not demonstrate that
each of these variables has no influence on maintenance effort. There were possible joint
effects of both investigated and non-investigated variables. For example, experienced
maintainers wrote more compact code while being assigned more difficult tasks than
inexperienced maintainers, and maintainers might be more experienced in large
applications than in the small ones. To validate the models, Jorgensen used a data set of
109 randomly selected maintenance tasks collected from different applications in the
same organization. Of the eleven models built and compared, the best model seems to be
the types of log linear regression and hybrid type based on pattern recognition and
regression. The best model could generate effort estimates with MRE = 100% and
PRED(.25) = 26% using a cross-validation approach [Bradley and Gong 1983]. This
performance is unsatisfactorily low. Considering low prediction accuracies achieved,
Jorgensen recommended that a formal model be used as supplementary to expert
39
predictions and suggested that the Bayesian analysis be an appropriate approach to
combining the estimates of the investigated models and expert predictions.
Fioravanti et al. proposed and evaluated a model and metrics for estimating the
adaptive maintenance effort of object-oriented systems [Fioravanti 1999, Fioravanti and
Nesi 2001]. Using the linear regression analysis, they derived the model and metrics
based on classical metrics previously proposed for effort estimation models. They
evaluated the model and metrics using the data collected from a real project, showing that
the complexity and the number of interfaces have the highest impact on the adaptive
maintenance effort.
Several previous studies have proposed and evaluated models for exclusively
estimating the effort required to implement corrective maintenance tasks. Lucia et al.
used the multiple linear regression to build effort estimation models for corrective
maintenance [De Lucia 2005]. Three models were built using coarse-grained metrics,
namely the number of tasks requiring source code modification, the number of tasks
requiring fixing of data misalignment, the number of other tasks, the total number of
tasks, and SLOC of the system to be maintained. They evaluated the models on 144
observations, each corresponding to a one-month period, collected from five corrective
maintenance projects in the same software services company. The best model, which
includes all metrics, achieved MMRE = 32.25% and PRED(0.25) = 49.31% using leavemore-out cross-validation. In comparison with the non-linear model previously used by
the company, the authors showed that the linear model with the same variables produces
higher estimation accuracies. They also showed that taking into account the difference in
40
the types of corrective maintenance tasks can improve the performance of the estimation
model.
41
Table 2-3. Maintenance Cost Estimation Models
Model/Study
Maintenance Task Type Effort Estimated For
Modeling Approach
Key Input Metrics
COCOMO
Regular maintenance
Major enhancement
(Adaptation and reuse)
Regular maintenance
Major enhancement
Maintenance period
Adaptation and reuse
project
Maintenance period
Adaptation and reuse
project
Maintenance period
Adaptation and reuse
project
Maintenance period
Adaptation and reuse
project
Maintenance period
Linear and Nonlinear
Regression
Bayesian Analysis
Arithmetic
SLOC added and modified,
SU, UNFM, DM, CM, IM
Release
Simple linear
regression
Classification (C4.5
algorithm)
Linear regression
KnowledgePlan
PRICE-S
Regular maintenance
Major enhancement
SEER-SEM
Corrective
Adaptive
Perfective
Corrective
Adaptive
Perfective
Error correction and
enhancement
Corrective
SLIM
Basili 1996
Basili 1997
Niessink and van Adaptive (functional
Vliet 1998
enhancement)
Abran 1995
Adaptive (functional
enhancement)
Abran 2002
Adaptive (functional
enhancement)
Component version
Individual change
request
A set of activities
resulting a maintenance
work product (Release)
Release (project)
Arithmetic
Arithmetic
Heuristic
Arithmetic
Linear and nonlinear
regression
Best Estimation
Accuracies
Reported6
-
SLOC or IFPUG’s FP of code added,
reused, leveraged, prototype, base,
changed, deleted, system base
SLOC, IFPUG’s FP, POP, or UCCP of
new, adapted, deleted, reused code.
-
Maintenance rigor, Annual change rate,
Years of maintenance, Effective size
(measured in SLOC or IFPUG’s FP)
SLOC added and modified
-
SLOC (sum of SLOC added, modified,
deleted)
Number of function calls, declaration
statements, exceptions
IFPUG’s FPA, Maintenance Impact
Ratio (MIR)
Extended IFPUG’s FPA
-
COSMIC-FFP
MMRE=0.25
PRED(0.25)=53%
-
-
76% classification
correctness.
MMRE=47%
PRED(0.25)=28%
6
42
The selection of the best model is dependent on the performance indicators used [Myrtweit 2005]. In this table, the most common performance
indicators MMRE and PRED(0.25) are reported, and MMRE is used to indicate the best model, the model that generates lowest MMRE values.
42
Table 2-3. Continued
Caivano 2001
Adaptive (renewal e.g.,
reverse engineering and
restoration)
Iteration
Sneed 2004
Various:
Release
Error correction
Routine change
Functional
enhancement
Technical renovation
Linear regression
Dynamic calibration
Productivity index
COCOMO II
Ramil 2000, 2003 All task types for
evolving the software
Release
Linear regression
Sneed 1995
Corrective
Adaptive
Perfective
Preventive
Individual maintenance
task
Productivity index
(multiplied by
adjusted size)
Briand and Basili
1992a
Corrective
Adaptive (and
enhancive)
Individual change
request
Classification
Jorgensen 1995
Corrective, adaptive,
perfective, preventive
Individual maintenance
task
De Lucia 2005
Corrective
Monthly period effort
Linear regression
Neural networks
Pattern recognition
Linear regression
SLOC
McCabe’s complexity
Halstead’s complexity
Number of modules
SLOC
IFPUG’s FP
Object-points [Sneed 1995]
Number of errors
Test coverage
Complexity index
Quality index
Productivity index
Number of systems, modules
SLOC
IFPUG’s FP
Object-points
Data-points
Complexity index
Quality index
Project influence factor
Modification type
Source of error
SLOC added, modified, deleted
Components added, modified, deleted
Objects (code or design or both)
SLOC (sum of SLOC added, modified,
deleted)
Number of tasks
SLOC
-
-
MMRE=19.3%
PRED(0.25)=44%
(Cross-validation)
-
74%-93%
classification
correctness
MMRE=100%
PRED(0.25)=26%
(Cross-validation)
MMRE=32.25%
PRED(0.25)=49.32%
(Cross-validation)
43
43
2.3.4
SUMMARY OF MAINTENANCE ESTIMATION MODELS
It is clear from the maintenance estimation models that estimating the cost
required to develop and deliver a release receives much attention. Much attention given
to the release estimation may indicate the widespread practice in software maintenance
that the maintenance work is organized into a sequence of operational releases [Rajlich
2001]. Each release includes enhancements and changes that can be measured and used
as a size input for the effort estimation model. Source code based metrics are the
dominant size inputs, reflecting the fact that these metrics are the main metrics for effort
estimation models in new development. Modified and added code are used in most of the
models while some models use deleted code.
Although these studies report a certain degree of success in applying effort
estimation models to the maintenance context, they exhibit several limitations. On the
one hand, all of the models are context-specific with a few exceptions, e.g., [Briand and
Basili 1992a; Caivano 2001]. They were built to address certain estimation needs of a
single organization or several organizations performing maintenance work with specific
processes, technologies, and people. Thus, this practical approach limits statistical
inferences that can be made to other maintenance contexts that are different from where
the studies were carried out. On the other hand, even with the models that are generic
enough, there are a lack of empirical studies documenting and validating the application
of the proposed models across maintenance contexts in multiple organizations with
diverse processes, technologies, and people. Building generic models that can be
44
recalibrated to specific organizations or even projects would be much needed in
maintenance estimation [Caivano 2001].
45
Chapter 3. THE RESEARCH APPROACH
This chapter presents my research approach and framework to derive the extended
COCOMO II size and effort models for software maintenance. Section 3.1 details the 8step modeling process that is based on the generic COCOMO II modeling methodology
discussed in [Boehm 2000b]. The techniques used to calibrate the proposed COCOMO II
effort model for software maintenance are discussed in Section 3.2. And Section 3.3
presents the strategies used to validate and compare the performance of estimation
models.
3.1 THE MODELING METHODOLOGY
Figure 3-1 represents the process to be followed to derive the extended COCOMO
model for software maintenance. This process is based on the modeling methodology
proposed and applied to building several models in the COCOMO model suite, such as
COCOMO II, COQUALMO, CORADMO, and COCOTS [Boehm 2000b]. As the
COCOMO model for software maintenance addressed in this study is an extension of the
COCOMO II model, Steps 1, 3A, and 4 are performed with a consideration that the
model would share most, if not all, of the cost drivers used in the COCOMO II model.
Additionally, any step can be revisited if updated information or data would change the
final result.
46
Step 1: Analyze existing literature
This step involves reviewing the software cost modeling literature for identifying
important sizing metrics, potentially significant parameters, parameter definition issues,
and functional forms.
To maintain the consistencies among the COCOMO II models, we began this step
by taking into account all of the 22 cost drivers used in the COCOMO II model,
assuming that the cost of software maintenance is also influenced by the same set of
parameters. All sizing parameters used in the COCOMO II reuse and maintenance
models were also considered with a few changes in their definitions, and new metrics
(e.g., deleted code) were identified (see Section 4.1.2).
Step 2: Perform maintenance experiment to validate some size measures
To validate the impact of the size metrics (e.g., added, modified, and deleted
SLOC) on the maintenance cost, we performed a controlled experiment of student
programmers performing maintenance tasks. The results suggest that these size metrics
are significant parameters for sizing the software maintenance work. The results also
show that software understanding is as much as 50% of the software maintenance effort,
providing evidence for the proposed sizing method for software maintenance (see Section
4.1.2).
47
Figure 3-1. The Modeling Process
1
Analyze existing literature
Perform maintenance
experiment to validate
some size measures
3A
Perform Behavioral Analysis
Identify relative significance
of factors
4
Determine sizing method
for maintenance
Perform Expert-judgment
and Delphi assessment
7
8
3B
Determine form of effort model
5A
6
2
5B
Gather project data
Test hypotheses about impact of
parameters
Calibrate model parameters
Determine parameter variability
Evaluate model performance
48
Step 3A: Perform Behavioral Analysis and Identify relative significance of
factors
In this step, a behavioral analysis is performed to understand the relative impact
of the potential parameters, which have been identified in Step 1, on project effort and
schedule (project duration). This step can result in parameters combined or dropped from
model forms to be determined in the next step.
Again, for the maintenance model, we performed this step by using the cost driver
rating values of the COCOMO II.2000 model and assuming that they would have a
similar relative impact on the cost of software maintenance.
Step 3B: Determine sizing method for maintenance
Once the potential size metrics and their relative impacts are identified, a method
for aggregating them into an equivalent size measure has to be identified. The insights
learned during the literature review of the existing estimation models for software
maintenance would help to determine a relevant sizing method for maintenance. Details
of this step are discussed in Section 4.1.2.
Step 4: Determine form of effort model
To ensure consistencies with the COCOMO II model, the COCOMO model for
maintenance uses the same effort form as the COCOMO II model.
Step 5A: Perform Expert-judgment and Delphi assessment
This step involves performing the Delphi exercise by surveying experts in the
field of software estimation to obtain their opinions about quantitative relationships
49
between each of the parameters and the maintenance effort. We apply a Wideband Delphi
process [Boehm 1981] to enable consensus among the experts. The experts are asked to
independently provide their numeric estimates for the impact of each parameter on
maintenance effort. Multiple rounds of surveys can be carried out during the process until
a convergence is achieved. Variances can be reduced by sharing the results from the
previous rounds with participants. The numeric rating values of the COCOMO II.2000
model are provided in the Delphi survey as initial results, providing the participants
information for relative comparison between the values for new software development
and software maintenance.
Step 5B: Gather project data
Industry project data is collected for the model validation, hypothesis test, and
calibration purposes. The data, including, at least, the actual effort, size metrics, and
symbolic rating level (e.g., Low, Nominal, High, etc) of each cost driver are obtained
from completed maintenance projects. This step can be carried out without dependence
on the result of Step 5A because the cost drivers are based on symbolic rating levels
rather than numeric values.
Step 6: Test hypotheses about impact of parameters
The expert-determined numeric rating values (a priori) for the cost drivers
obtained from Step 5A and the sampling project data from Step 5B are used to test the
research hypotheses about the impact of the cost drivers on the project effort and
schedule. The data collinearity and dispersion should also be analyzed during this step.
This may suggest dropping or combining cost drivers in the model, resulting in multiple
50
model variations (e.g., different combinations of the parameters) that will be followed in
Step 7.
Step 7: Calibrate model parameters and Determine parameter variability
This step involves applying modeling techniques to calibrating the model
parameters using the expert-determined numeric rating values (a priori), the sampling
project data, and the result obtained from the analysis in Step 6. Multiple modeling
techniques are applied to different model variations obtained in Step 6. As a result, this
step may lead to multiple sets of calibrated parameters (i.e., multiple model variations).
The variability of each parameter is determined during this step.
In this research, we use multiple calibration techniques to calibrate the model
parameters. These techniques include the ordinary linear regression, the Bayesian
analysis, and our proposed constrained regression techniques. Further details on these
techniques are discussed in Section 3.2.
Step 8: Evaluate model performance
The purpose of this step is to evaluate and suggest the best models along with
their cost driver rating values and constants for estimating the cost of software
maintenance. This step involves using cross-validation approaches to test the models and
determine the model performance metrics. The best models are then chosen using the
model performance metrics obtained.
51
3.2 THE CALIBRATION TECHNIQUES
This section describes three calibration techniques, namely ordinary least squares
regression, Bayesian analysis, and constrained regression technique, which are applied to
calibrating the cost drivers of the model.
3.2.1
ORDINARY LEAST SQUARES REGRESSION
Ordinary least squares (OLS) regression is the most popular technique used to
build software cost estimation models. In COCOMO, the OLS is used for many purposes,
such as analyzing the correlation between cost drivers and the effort and generating
coefficients and their variances during the Bayesian analysis.
Suppose that a dataset has N observations, where the response is effort and p
predictors are, for example, size and cost drivers. Let xi = (xi1, xi2,…, xip), i = 1, 2, …, N,
be the vector of p predictors, and yi be the response for the ith observation. The model for
multiple linear regression can be expressed as
yi = β 0 + β1 xi1 + ... + β p xip + ε i
(Eq. 3-1)
where β 0 , β1 ,..., β p are the regression coefficients, and εi is the error term for the
ith observation. The corresponding prediction equation of (Eq. 3-1) is
yˆ i = βˆ0 + βˆ1 xi1 + ... + βˆ p xip
(Eq. 3-2)
where βˆ0 , βˆ1 ,..., βˆ p are the estimates of coefficients, and yˆi is the estimate of
response for the ith observation.
52
The ordinary least squares (OLS) estimates for the regression coefficients are
obtained by minimizing the sum of square errors. Thus, the response estimated from the
regression line minimizes the sum of squared distances between the regression line and
the observed response.
Although regression is a standard method for estimating software cost models, it
faces some major challenges. The model may be overfitted. This occurs when
unnecessary predictors remain in the model. With software cost data, some of the
predictors are highly correlated. Such collinearity may cause high variances and covariances in coefficients and result in poor predictive performance when one encounters
new data. We can sometimes ameliorate these problems by reducing the number of
predictor variables. By retaining only the most important variables, we increase the
interpretability of the model and reduce the cost of the data collection process. As
empirical evidence of this effectiveness, Chen et al. report that the reduced-parameter
COCOMO models can yield lower prediction errors and lower variance [Chen 2006].
3.2.2
THE BAYESIAN ANALYSIS
The Bayesian approach to calibrating the COCOMO II model was introduced by
Chulani et al. [Chulani 1999a]. This study produced a set of constants and cost drivers
officially published in the COCOMO II book [Boehm 2000b]. Since then, the Bayesian
approach has been used in a number of calibrations of the model using multiple
COCOMO data sets, e.g., [Yang and Clark 2003].
53
The Bayesian approach relies on Bayes’ theorem to combine the a priori
knowledge and the sample information in order to produce an a posteriori model. In the
COCOMO context, the a priori knowledge is the expert-judgment estimates and
variances of parameter values; the sample information is the data collected from
completed projects. Figure 3-2 shows the productivity range (the ratio between the
highest and lowest rating values) of the RUSE (Develop for Reusability) cost driver
obtained by combining a priori expert-judgment estimate and data-determined value.
Figure 3-2. A Posteriori Bayesian Update in the Presence of Noisy Data RUSE
[Boehm 2000b]
A posteriori Bayesian update
1.73
0.83
1.31
A priori
Experts’ Delphi
Productivity Range =
Highest Rating /
Lowest Rating
Noisy data analysis
The Bayesian approach calculates the posterior mean b** and variance Var(b**) of
the coefficients as
−1
⎞
⎛ 1
⎞
⎛ 1
b = ⎜ 2 X ' X + H * ⎟ × ⎜ 2 X ' Xb + H *b * ⎟
⎠
⎝s
⎠
⎝s
**
(Eq. 3-3)
54
⎞
⎛ 1
Var (b ) = ⎜ 2 X ' X + H * ⎟
⎠
⎝s
−1
**
(Eq. 3-4)
Where,
ƒ
X and s2 are the matrix of parameters and the variance of the residual for the
sample data, respectively.
ƒ
H* and b* are the inverse of variance and the mean of the prior information
(expert-judgment estimates), respectively.
To compute the posterior mean and variance of the coefficients, we need to
determine the mean and variance of the expert-judgment estimates and the sampling
information. Steps 5A and 5B in the modeling process Figure 3-1 are followed to obtain
these data.
3.2.3
A CONSTRAINED MULTIPLE REGRESSION TECHNIQUE
In [Nguyen 2008], we proposed the regression techniques to calibrate the
COCOMO model coefficients. The technique estimates the model coefficients by
minimizing objective functions while imposing model constraints. The objective
functions represent the overall goal of the model, that is, to achieve high estimation
accuracies. The constraints can be considered subordinate goals, or the priori knowledge,
about the model. We validated the technique on two data sets used to construct
COCOMO 81 and COCOMO II.2000. The results indicate that the technique can
improve the performance of the COCOMO II model (see Figure 3-3 and Figure 3-4). On
both COCOMO II.2000 and COCOMO 81 data sets, the constrained techniques CMAE
55
and CMRE were found to outperform the other techniques compared. With this finding,
we will apply this technique to calibrate the model and compare the calibration results
obtained by this technique with those of the Bayesian analysis.
Figure 3-3. Boxplot of mean of PRED(0.3) on the COCOMO II.2000 data set
Figure 3-4. Boxplot of mean of PRED(0.3) on the COCOMO 81 data set
The technique consists of building three models for calibrating the coefficients in
Equation (Eq. 4-12). These models are based on three objective functions MSE, MAE,
and MRE that have been well investigated and applied to building or evaluating cost
56
estimation models. MSE is a technique minimizing the sum of square errors, MAE
minimizing the sum of absolute errors, and MRE minimizing the sum of relative errors.
The models examined include:
(1) Constrained Minimum Sum of Square Errors (CMSE):
N
Minimize
∑(y
i
i =1
− yˆi ) 2
(Eq. 3-5)
subject to MREi ≤ c and βˆ j ≥ 0 , i = 1,…, N , j = 0,…, p
(2) Constrained Minimum Sum of Absolute Errors (CMAE):
N
Minimize
∑ y − yˆ
i =1
i
i
(Eq. 3-6)
subject to MREi ≤ c and βˆ j ≥ 0 , i = 1,…, N , j = 0,…, p
(3) Constrained Minimum Sum of Relative Errors (CMRE):
N
Minimize
∑
i =1
yi − yˆi
yi
(Eq. 3-7)
subject to MREi ≤ c and βˆ j ≥ 0 , i = 1,…, N , j = 0,…, p
Where, c ≥ 0 is the turning parameter controlling the upper bound of MRE for
each estimate, and MREi is the magnitude of relative error of the estimate ith, which is
detailed in Section 3.3.1.
Estimating βˆ0 , βˆ1 ,..., βˆ p in Equations (Eq. 3-5), (Eq. 3-6), and (Eq. 3-7) is an
optimization problem. Equation (Eq. 3-6) is a quadratic programming problem, and
57
Equations (Eq. 3-5) and (Eq. 3-7) can be transformed to a form of the linear
programming. A procedure for this transformation is discussed in Narula and Wellington
[1977]. In this study, we use quadratic and linear programming solvers (quadprog7 and
lpSolve8) provided in the R statistical packages to estimate the coefficients.
As discussed in [Nguyen 2008], one of the advantages of this technique is that the
priori knowledge can be included in the regression models in the form of constraints to
adjust the estimates of coefficients. The constraints can be any functions of the model
parameters that are known prior to building the model. For example, in COCOMO the
estimates of coefficients should be non-negative (e.g., an increase in the parameter value
will result in an increase in effort). As the constraints are applied, the technique can
effectively prune parameters that are negative while adjusting other parameters to
minimize the objective function.
3.3 EVALUATION STRATEGIES
3.3.1
MODEL ACCURACY MEASURES
MMRE and PRED are the most widely used metrics for evaluating the accuracy
of cost estimation models. These metrics are calculated based on a number of actuals
observed and estimates generated by the model. They are derived from the basic
magnitude of the relative error MRE, which is defined as:
7
8
http://cran.r-project.org/web/packages/quadprog/index.html
http://cran.r-project.org/web/packages/lpSolve/index.html
58
MREi =
yi − yˆi
yi
(Eq. 3-8)
where yi and ŷi are the actual and the estimate of the ith observation, respectively.
Because yi is log-transformed, we calculate the MREi using
MREi = 1 − e yˆi − yi
(Eq. 3-9)
The mean of MRE of N estimates is defined as
MMRE =
1
N
N
∑ MRE
i =1
i
(Eq. 3-10)
As every estimate is included in calculating MMRE, extreme values of MRE can
significantly affect MMRE. To handle this problem, another important criterion used for
model evaluation is PRED. PRED(l) is defined as the percentage of estimates, where
MRE is not greater than l, that is PRED(l) = k/n, where k is the number of estimates with
MRE falling within l, and n is the total number of estimates. We can see that unlike
MMRE, PRED(l) is insensitive to errors greater than l. Another accuracy measure that
has been often reported in the software estimation research is the median of the
magnitude of relative errors (MdMRE). Unlike MMRE, the MdMRE measure provides
information about the concentration of errors and is not affected by extreme errors. Using
these measures as model comparison criteria, one model is said to outperform another if
it has lower MMRE, MdMRE, and higher PRED(l).
In this research, the results are reported and compared mainly using PRED(0.3)
and MMRE. This measure is considered a standard in reporting COCOMO calibration
59
and model improvement in the previous studies [Chulani 1999, Chen 2006]. In addition,
to allow comparisons between the models investigated in this study with others,
PRED(0.25), PRED(0.50), and MdMRE measures are also reported.
3.3.2
CROSS-VALIDATION
The most important criterion for rejection or acceptance of a cost estimation
model is its ability to predict using new data. Ideally the prediction error of a new cost
model is calculated using data from future projects. This approach, however, is usually
impossible in practice because new data is not always available at the time the model is
developed. Instead, model developers have to use the data that is available to them for
both constructing and validating the model. This strategy is usually referred to as crossvalidation.
While many cross-validation approaches have been proposed, the most common
are a simple holdout strategy and a computer-intensive method called K-fold cross
validation. The holdout approach splits the dataset into to two distinctive subsets: training
and test sets. The training set is used to fit the model and the test set provides estimates of
the prediction errors. K-fold cross validation divides the data into K subsets. Each time,
one of the K subsets is used as the test set and the other K-1 subsets form a training set.
Then, the average error across all K trials is computed. K-fold cross validation avoids the
issue of overly-optimistic results for prediction accuracy. This technique enables the user
to independently choose the size of each test set and the number of trials to use for
averaging the results. The variance of the resulting estimate is reduced as K is increased.
The disadvantage of this method is that the training algorithm has to be rerun K times,
60
resulting in computational effort. The K-fold cross-validation procedure can be described
in the following three steps:
Step 1. Randomly split the dataset into K subsets
Step 2. For each i = 1, 2, …, K:
•
Build the model with the ith subset of the data removed
•
Predict effort for observations in the ith subset
•
Calculate MMREi and PRED(l)i for the ith subset.
MMREi is calculated as
MMREi =
1 P y j − yˆ* j
∑ y
P j =1
j
(Eq. 3-11)
Where, P is the number of observations in the ith subset, yˆ* j is the estimate of the jth
observation in the ith subset, and l = 0.3, 0.25, 0.2, and 0.1.
The special case where K = N is often called as leave-one-out cross-validation
(LOOC). In this method, the training set that consists of N – 1 observations is used to
build the model to test the remaining observation. LOOC appears to be a preferred crossvalidation method used for validating the performance of software estimation models.
One possible reason is that software estimation data is scarce, and thus, models cannot
afford to leave more data points out of the training set. Another reason is that the
approach reflects the reality in which all available data of an organization is used to
calibrate the model for future projects. Therefore, LOOC is used in this study.
61
Chapter 4. THE COCOMO II MODEL FOR SOFTWARE
MAINTENANCE
This chapter describes a number of extensions to the COCOMO II model for
software maintenance. Section 4.1 discusses the existing COCOMO II sizing methods
and their limitations, and then proposes modifications and extensions to the COCOMO II
models for software reuse and maintenance by unifying these models, redefining the
parameters and formulas for computing the equivalent size of the software maintenance
project. Section 4.2 presents an approach to extending the COCOMO II effort estimation
model for software maintenance.
4.1 SOFTWARE MAINTENANCE SIZING METHODS
Size is the most significant and essential factor for the cost estimation model,
inclusion of software maintenance cost (see Section 2.3). Thus, we must provide methods
to measure the size of work effectively.
Unlike the new development, where the code for core capabilities does not exist,
the software maintenance relies on the preexisting code. As a result, the maintenance
sizing method must take into account different types of code and other factors, such as
the effects of understanding the preexisting code and the quality and quantity of the
preexisting code.
62
Figure 4-1. Types of Code
Preexisting Code
Delivered Code
Reused Modules
External Modules
Existing System
Modules
Manually develop
and maintain
Automatically
translate
Adapted Modules
New Modules
Automatically
Translated Modules
Figure 4-1 shows preexisting code as source inputs and delivered code as outputs
of the development and maintenance activities. They can be grouped into the following
types:
ƒ
External modules: the code taken from a source other than the system to be
maintained. They can be proprietary or open-source code.
ƒ
Preexisting system modules: the code of the system to be upgraded or
maintained.
ƒ
Reused modules: the preexisting code that is used as a black-box without
modifications.
63
ƒ
Adapted modules: the code that is changed from using the preexisting code.
The preexisting code is used as a white-box in which source lines of code are
added, deleted, modified, and unmodified.
ƒ
New modules: the modules newly added to the updated system.
ƒ
Automatically translated modules: the code obtained from using code
translation tools to translate the preexisting code for use in the updated system.
In COCOMO, the automatically translated code is not included in the size of the
maintenance and reuse work. Instead, the effort associated with the
automatically translated code is estimated in a separate model different from the
main COCOMO effort model. In the COCOMO estimation model for software
maintenance, we also exclude the automatically translated code from the sizing
method and effort estimation.
4.1.1
THE COCOMO II REUSE AND MAINTENANCE MODELS
COCOMO II provides two separate models for sizing software reuse and software
maintenance. The reuse model is used to compute the equivalent size of the code that is
reused and adapted from other sources. The reuse model can also be used for sizing major
software enhancements. On the other hand, the maintenance model is designed to
measure the size of minor enhancements and fault corrections.
The COCOMO II Reuse Sizing Model
The COCOMO II reuse model was derived on the basis of experience and
findings drawn from previous empirical studies on software reuse costs. Selby [1988]
64
performed an analysis of reuse costs of reused modules in the NASA Software
Engineering Laboratory, indicating nonlinear effects of the reuse cost function (Figure
4-2.). Gerlich and Denskat [1994] describe a formula to represent the number of interface
checks required in terms of the number of modules modified and the total number of
software modules, showing that the relationship between the number of interface checks
required and the number of modules modified is nonlinear. The cost of understanding and
testing the existing code could, in part, cause the nonlinear effects. Parikh and Zvegintzov
[1983] found that the effort required to understand the software be modified takes 47
percent of the total maintenance effort.
Figure 4-2. Nonlinear Reuse Effects
AAM Worst Case:
AAF = 50
AA = 8
SU = 50
UNFM = 1
1.5
AAM
AAM Best Case:
AAF = 50
AA = 0
SU = 10
UNFM = 0
1.0
Relative Cost
Selby data
summary
0.5
Selby data
0.045
0.0
50 %
100 %
Relative Modification of Size (AAF)
65
The COCOMO II reuse model computes the equivalent SLOC for enhancing,
adapting, and reusing the pre-existing software. This model takes into account the amount
of software to be adapted, percentage of design modified (DM), the percentage of code
modified (CM), the percentage of integration and testing required (IM), the level of
software understanding (SU), and the programmer’s relative unfamiliarity with the
software (UNFM). The model is expressed as:
Equivalent KSLOC = Adapted KLOC * (1 −
AT
) * AAM
100
(Eq. 4-1)
Where,
AAF = 0.4 * DM + 0.3 * CM + 0.3 * IM
(Eq. 4-2)
⎧ AA + AAF (1 + 0.02 * SU * UNFM )
, for AAF ≤ 50
⎪
100
AAM = ⎨
AA + AAF + SU * UNFM
⎪
, for AAF > 50
100
⎩
(Eq. 4-3)
AT is the total Automatic Translation code; AA is the degree of Assessment and
Assimilation; AAF is the Adaptation Adjustment Factor representing the amount of
modification; and AAM stands for Adaptation Adjustment Multiplier. The factors SU and
UNFM in the model are used to adjust for software comprehension effects on the
adaptation and reuse effort, reflecting the cost of understanding the software to be
modified [Parikh and Zvegintzov 1983, Nguyen 2009, Nguyen 2010]. Figure 4-2. shows
the region of possible AAM values specified by the parameters AAF, AA, SU, and UFNM.
66
Software Understanding (SU) measures the degree of understandability of the
existing software (how easy it is to understand the existing code). The rating scale ranges
from Very Low (very difficult to understand) to Very High (very easy to understand). SU
specifies how much increment to Equivalent SLOC (ESLOC) is needed if the
programmer is new to the existing code. Appendix Table A.3 describes the numeric SU
rating scale for each rating level. At the rating of Very Low, the developers would spend
50% of their effort for understanding the software for an equivalent amount of code.
Programmer Unfamiliarity (UNFM) measures the degree of unfamiliarity of
the programmer with the existing software. This factor is applied multiplicatively to the
software understanding effort increment (see Appendix Table A.2).
Assessment and Assimilation (AA) is the degree of assessment and assimilation
needed to determine whether a reused software module is appropriate to the system, and
to integrate its description into the overall product description. AA is measured as the
percentage of effort required to assess and assimilate the existing code as compared to the
total effort for software of comparable size (see Appendix Table A.1).
Equivalent SLOC is equivalent to SLOC of all new code that would be produced
by the same amount of effort. Thus, Equivalent SLOC would be equal to new SLOC if
the project is developed from scratch with all new code.
The COCOMO II Maintenance Sizing Model
Depending on the availability of the data, several means can be used to calculate
the size of maintenance. One way is to determine the maintenance size based on the size
of the base code (BCS), the percentage of change to the base code named Maintenance
67
Change Factor (MCF), and an adjustment factor called Maintenance Adjustment Factor
(MAF).
Size = BCS x MCF x MAF
(Eq. 4-4)
Alternatively, COCOMO can measure the size of maintenance based on the size
of added and modified code, and adjusts it with the MAF factor. MAF is adjusted with the
SU and UNFM factors from the Reuse model. That is,
MAF = 1 + (SU x UNFM / 100)
(Eq. 4-5)
Size = (Added + Modified) x [1 + SUxUNFM/100]
(Eq. 4-6)
Thus,
The maintenance size measure is then used as an input to the COCOMO II models
to generate the effort and schedule estimates. The COCOMO II model assumes that the
software maintenance cost is influenced by the same set of cost drivers and their ratings
as is the development cost, with some exceptions noted above.
4.1.2
A UNIFIED REUSE AND MAINTENANCE MODEL
This section presents my proposed sizing model for software maintenance that
unites and improves the existing COCOMO II reuse and maintenance models. This
model is proposed to address the following limitations of the existing models.
ƒ
The reuse model would underestimate code expansion. The software system
grows over time at a significant rate as the result of continuing functional
68
expansions, enhancements, and fault corrections [Lehman 1997]. However, as
the change factors DM, CM, IM account for, respectively, the percentage of
design, code, test and integration modified to the existing code (i.e., DM and
CM values cannot exceed 100%), they do not fully take into account the
amount of added code that effectively expands the existing code. For example,
a module grows from 5K to 15K SLOC, a rate of 300%, but CM can be
specified maximum 100%, consequently, underestimating the updated
module.
ƒ
According to the definitions of the reuse and maintenance models above, we
can see that there is no clear distinction between the reused software and the
maintained software. In fact, Basili describes three models that treat software
maintenance as a reuse-oriented process [Basili 1990]. But the two sizing
models do not share the same set of parameters. Moreover, there is no way to
convert between the reuse’s equivalent size and maintenance size.
ƒ
There is a lack of connection between the equivalent size estimated by the
reuse model and the completed code (e.g., added, modified, deleted SLOC).
That is, the question of how to compute the equivalent size of completed code
is not addressed in the model. This is important because the actual size is used
to calibrate the estimation model for future estimation.
The model is derived from the basis of the COCOMO reuse model. Specifically,
it uses the same parameters SU and UNFM while it redefines other parameters and model
formulas.
69
ƒ
DM is the percentage of the design modification made to the analysis and
design artifacts of the preexisting software affected by the changes for the new
release or product. DM does not include the design related to the code
expansion (e.g., new classes and methods) because the code expansion is
taken into account by CM. The DM value ranges from 0 to 100%.
ƒ
CM is the percentage of code added, modified, and deleted relative to the size
of the preexisting modules affected by the changes for the new release or
product. In other words, CM is equal to the sum of SLOC added, modified,
and deleted divided by the total SLOC of the preexisting code. It includes
code expansions, which may go beyond the size of the preexisting code, and
thus CM can exceed 100%.
ƒ
IM is the percentage of integration and test needed for the preexisting modules
to be adapted into the new release or product, relative to the normal amount of
integration and test for the preexisting modules affected by the changes. IM
may exceed 100% if the integration and test is required for other parts of the
system that are related to the changes or some special integration and test is
required to validate and verify the whole system. Like DM, IM does not
include the integration and test for the code expansion as CM accounts for this
effect.
Using the parameters DM, CM, IM, SU, and UNFM, the formulas used to
compute AAF and AAM are presented as
70
AAF = 0.4 * DM + CM + 0.3 * IM
2
⎧
⎡ ⎛
AAF ⎞ ⎤
⎪ AA + AAF + ⎢1 − ⎜1 −
⎟ ⎥ * SU * UNFM
100 ⎠ ⎥⎦
⎪
⎢⎣ ⎝
if AAF ≤ 100
AAM = ⎨
100
⎪
AA + AAF + SU * UNFM
⎪
if AAF > 100
100
⎩
(Eq. 4-7)
(Eq. 4-8)
Although Equations (Eq. 4-7) and (Eq. 4-8) are based on those of the COCOMO
reuse model, they are different in several ways. In the AAF formula, the CM coefficient is
1 instead of 0.3 while the coefficients for DM and IM are the same as those in Equation
(Eq. 4-2). This change reflects the new definition of CM, which accounts for code
expansion and considers that a SLOC modified or deleted is same as a SLOC added.
AAF represents the equivalent relative size of the changes for the new release or product,
and its value is greater than 0% and may exceed 100%. For AAM, Equation (Eq. 4-3)
presents the AAM curve, which consists of two straight lines joining at AAF = 50%,
which is less intuitive because the breakage at AAF = 50% is not demonstrated
empirically or theoretically. The new AAF Equation (Eq. 4-7) smoothes the curve when
AAF ≤ 100%. The difference between the old and new AAF curves is shown in Figure 43. This difference is most significant when AAF is close to 50%: the old AAF
overestimates AAM as compared to the new AAF. The difference decreases in the
direction moving from the AAM worse case to the AAM best case. As a result, the two
equations produce the same AAM values for the AAM best case.
Now, let’s show that the smoothed curve of new AAM Equation (Eq. 4-8) can be
derived while maintaining the nonlinear effects discussed above. First, it is important to
71
note that AAM is proposed as a model representing the nonlinear effects involved in the
module interface checking and testing interfaces directly affected or related to the
changes. It, therefore, takes into account not only the size of the changes but also the size
of the modules to be changed. Second, we assume that there is one interface between two
modules.
Figure 4-3. AAM Curves Reflecting Nonlinear Effects
1.60
AAM Worse Case
AAF = 50
AA = 8
SU = 50
UNFM =1
1.40
1.20
New AAM equation
Old AAM equation
Relative Cost
1.00
AAM Best Case
AAF = 50
AA = 0
SU = 10
UNFM =0
0.80
0.60
0.40
Percentage of interfaces checked
0.20
0.00
0
20
40
60
80
100
120%
Relative Modification of Size (AAF)
Let n be the number of modules of the system and x be the percentage of
modification. The number of interfaces among n modules is
n(n − 1) n 2
≈
(for n >> 0)
2
2
72
As the number (percentage) of unmodified modules is 100 – x, the number of
(100 − x) 2
interfaces in the unmodified modules can be approximated as
. The number of
2
interfaces remained to be checked is
100 2 (100 − x) 2
−
. Therefore, the percentage of
2
2
2
x ⎞
⎛
interfaces to be check is 1 − ⎜1 −
⎟ . Here, x is the percentage of modification, which
⎝ 100 ⎠
2
AAF ⎞
⎛
represents AAF, or we get 1 − ⎜1 −
⎟ as the percentage of code that requires
100 ⎠
⎝
⎡ ⎛ AAF ⎞ 2 ⎤
checking. The quantity ⎢1 − ⎜1 −
⎟ ⎥ * SU *UNFM in Equation (Eq. 4-8) accounts for
100 ⎠ ⎥⎦
⎢⎣ ⎝
the effects of understanding of the interfaces to be checked.
Although different in the form, the percentage of code that requires checking
AAF ⎞
⎛
1 − ⎜1 −
⎟
100 ⎠
⎝
2
is close to Gerlich and Denskat [1994], which demonstrates that the
number of interface checks requires, N, is
⎛ k −1⎞
N = k * (m − k ) + k * ⎜
⎟
⎝ 2 ⎠
(Eq. 4-9)
where k and m are the number of modified modules and total modules in the
software, respectively.
The unified model classified the delivered code into three different module types,
reused, adapted, and new as described above. Considering the differences among these
types, we use different parameters and formulas to measure the size of each type
separately. The size of deleted modules is not included in the model.
73
New Modules
The module is added to the system; thus, its size is simply the added KSLOC count
(KSLOCadded) without considering the effects of module checking and understanding.
Adapted Modules
The equivalent size (EKSLOCadapted) is measured using the size of the preexisting
modules to be adapted (AKSLOC) and the AAM factor described in Equation (Eq. 4-8).
EKSLOC adapted = AKLOC * AAM
Reused Modules
As these modules are not modified, the DM, CM, SU, and UNFM are all zero. Thus,
the equivalent KSLOC (EKSLOC) of the reused modules is computed as
EKSLOC reused = RKSLOC * AAM ,
where RKSLOC is the KSLOC of the reused modules, and AAM is computed as
AAM = (AAreused + 0.3 * IMreused) / 100
AAreused is the degree of assessment and assimilation needed to determine the
modules relevant for reuse in the maintained system. It is measured as the percentage of
effort spent to assess and assimilate the existing code versus the total effort needed to
write the reused modules from scratch.
Finally, the equivalent SLOC is computed by the formula
EKSLOC = KSLOC added + EKSLOC adapted + EKSLOC reused
(Eq. 4-10)
74
4.2 COCOMO II EFFORT MODEL FOR SOFTWARE MAINTENANCE
We first assume that the cost of software maintenance follows the same form of
the COCOMO II model. In other words, the model is nonlinear and consists of additive,
multiplicative, and exponential components [Boehm and Valerdi 2008]. Furthermore, the
cost drivers’ definitions and rating levels remain the same except that the Developed for
Reusability (RUSE) and Required Development Schedule (SCED) cost drivers were
eliminated, and rating levels for the Required Software Reliability (RELY), Applications
Experience (APEX), Platform Experience (PLEX), and Language and Tool Experience
(LTEX) were adjusted. Details of the changes and rationale for the changes are given as
follows.
Elimination of SCED and RUSE
As compared with the COCOMO II model, the Developed for Reusability (RUSE)
and Required Development Schedule (SCED) cost drivers were excluded from the effort
model for software maintenance, and the initial rating scales of the Required Software
Reliability (RELY) cost driver were adjusted to reflect the characteristics of software
maintenance projects. As defined in COCOMO II, the RUSE cost driver “accounts for
additional effort needed to develop components intended for reuse on current or future
projects.” This additional effort is spent on requirements, design, documentation, and
testing activities to ensure the software components are reusable. In software
maintenance, the maintain team usually adapts, reuses, or modifies the existing reusable
components, and thus, this additional effort is less relevant as it is in development.
75
Additionally, the sizing method already accounts for additional effort needed for
integrating and testing the reused components through the IM parameter.
Table 4-1. Maintenance Model’s Initial Cost Drivers
Scale Factors
PREC
Precedentedness of Application
FLEX
Development Flexibility
RESL
Risk Resolution
TEAM
Team Cohesion
PMAT
Equivalent Process Maturity Level
Effort Multipliers
Product Factors
RELY
Required Software Reliability
DATA
Database Size
CPLX
Product Complexity
DOCU
Documentation Match to Life-Cycle Needs
Platform Factors
TIME
Execution Time Constraint
STOR
Main Storage Constraint
PVOL
Platform Volatility
Personnel Factors
ACAP
Analyst Capability
PCAP
Programmer Capability
PCON
Personnel Continuity
APEX
Applications Experience
LTEX
Language and Tool Experience
PLEX
Platform Experience
Project Factors
TOOL
Use of Software Tools
SITE
Multisite Development
In COCOMO II, the SCED cost driver “measures the schedule constraint imposed
on the project team developing the software.” [Boehm 2000b] The ratings define the
percentage of schedule compressed or extended from a Nominal rating level. According
to COCOMO II, schedule compressions require extra effort while schedule extensions do
not, and thus, ratings above than Nominal, which represent schedule extensions, are
76
assigned the same value, 1.0, as Nominal. In software maintenance, the schedule
constraint is less relevant since the existing system is operational and the maintenance
team can produce quick fixes for urgent requests rather than accelerating the schedule for
the planned release.
Adjustments for APEX, PLEX, LTEX, and RELY
The personnel experience factors (APEX, PLEX, and LTEX) were adjusted by
increasing the number of years of experience required for each rating. That is, if the
maintenance team has an average of 3 years of experience then the rating is Nominal
while in COCOMO II the rating assigned for this experience is High. The ratings of
APEX, PLEX, and LTEX are shown in Table 4-2. The reason for this adjustment is that
the maintenance team in software maintenance tends to remain longer in the same system
than in the development. More often, the team continues to maintain the system after they
develop it.
Table 4-2. Ratings of Personnel Experience Factors (APEX, PLEX, LTEX)
Very Low
APEX, PLEX, LTEX
≤ 6 months
Low
1 year
Nominal
3 years
High
6 years
Very High
12 years
RELY is “the measure of the extent to which the software must perform its
intended function over a period of time.” [Boehm 2000b] In software maintenance, the
RELY rating values are not monotonic, i.e., they do not only increase or decrease when
the RELY rating increases from Very Low to Extra High (see Table 4-3). The Very Low
multiplier is higher than the Nominal 1.0 multiplier due to the extra effort in extending
77
and debugging sloppy software, the Very High multiplier is higher due to the extra effort
in CM, QA, and V&V to keep the product at a Very High RELY level.
Table 4-3. Ratings of RELY
Very Low
RELY
Initial multiplier
Low
slight
low, easily
inconvenience recoverable
losses
1.23
1.10
Nominal
High
moderate,
easily
recoverable
l
1.0
high
financial
loss
0.99
Very
High
risk to
human
life
1.07
The following will describe the effort form, parameters, and the general
transformation technique to be used for the model calibration. The effort estimation
model can be written in the following general nonlinear form
p
PM = A * Size * ∏ EM i
E
(Eq. 4-11)
i
Where
PM = effort estimate in person months
A = multiplicative constant
Size = estimated size of the software, measured in KSLOC. Increasing size has
local additive effects on the effort. Size is referred to as an additive factor.
EM = effort multipliers. These factors have global impacts on the cost of the
overall system.
78
5
E = is defined as a function of scale factors, in the form of E = B + ∑ β i SFi .
i =1
Similar to the effort multipliers, the scale factors have global effects across
the system but their effects are associated with the size of projects. They
have more impact on the cost of larger-sized projects than smaller-sized
projects.
From Equation (Eq. 4-11), it is clear that we need to determine the numerical
values of the constants, scale factors, and effort multipliers. Moreover, these constants
and parameters have to be tamed into historical data so that the model better reflects the
effects of the factors in practice and improves the estimation performance. This process is
often referred to as calibration.
As Equation (Eq. 4-11) is nonlinear, we need to linearize it by applying the
natural logarithmic transformation:
log(PM) = β0 + β1 log(Size) + β2 SF1 log(Size) + … +
β6 SF5 log(Size) + β7 log(EM1) + … + β23 log(EM17)
(Eq. 4-12)
Equation (Eq. 4-12) is a linear form and its coefficients can be estimated using a
typical multiple linear regression approach such as ordinary least squares regression. This
is a typical method that was used to calibrate the model coefficients and constants by
[Clark 1997, Chulani 1999, Yang and Clark 2003, Nguyen 2008]. Applying a calibration
technique, we can obtain the estimates of coefficients in Equation (Eq. 4-12). The
estimates of coefficients are then used to compute the constants and parameter values in
Equation (Eq. 4-11) [Boehm 2000b].
79
Chapter 5. RESEARCH RESULTS
5.1 THE CONTROLLED EXPERIMENT RESULTS9
Hypothesis 1 states that the SLOC deleted from the modified modules is not a
significant size metric for estimating the maintenance effort. One approach to testing this
hypothesis is to validate and compare the estimation accuracies of the model using the
deleted SLOC and those of the model not using the deleted SLOC. Unfortunately, due to
the effects of other factors on the software maintenance effort, this approach is
impractical. Thus, the controlled experiment method was used as an approach to testing
this hypothesis. In a controlled experiment, various effects can be isolated.
We performed a controlled experiment of student programmers performing
maintenance tasks on a small C++ program [Nguyen 2009, Nguyen 2010]. The purpose
of the study was to assess size and effort implications and labor distributions of three
different maintenance types and to describe estimation models to predict the
programmer’s effort on maintenance tasks.
5.1.1
DESCRIPTION OF THE EXPERIMENT
We recruited 1 senior and 23 computer-science graduate students who were
participating in our directed research projects. The participation in the experiment was
voluntary although we gave participants a small incentive by exempting participants from
9
Parts of this section were adapted from [Nguyen 2010]
80
the final assignment. By the time the experiment was carried, all participants had been
asked to compile and test the program as a part of their directed research work. However,
according to our pre-experiment survey, their level of unfamiliarity with the program
code (UNFM) varies from “Completely unfamiliar” to “Completely familiar”. We rated
UNFM as “Completely unfamiliar” if the participant had not read the code and as
“Completely familiar” if the participant had read and understood the code, and modified
some parts of the program prior to the experiment.
The performance of participants is affected by many factors such as programming
skills, programming experience, and application knowledge [Boehm 2000b]. We assessed
the expected performance of participants through pre-experiment surveys and review of
participants’ resumes. All participants claimed to have programming experience in either
C/C++ or Java or both, and 22 participants already had working experience in the
software industry. On average, participants claimed to have 3.7 years of programming
experience and 1.9 years of working experience in the software industry.
We ranked participants by their expected performance based on their C/C++
programming experience, industry experience, and level of familiarity with the program.
We then carefully assigned participants to three groups in a manner that the performance
capability among the groups is balanced as much as possible. As a result, we had seven
participants in the enhancive group, eight in the reductive group, and nine in the
corrective group.
Participants performed the maintenance tasks individually in two sessions in a
software engineering lab. Two sessions had the total time limit of 7 hours, and
81
participants were allowed to schedule their time to complete these sessions. If
participants did not complete all tasks in the first session, they continued the second
session on the same or a different day. Prior to the first session, participants were asked to
complete a pre-experiment questionnaire on their understanding of the program and then
were told how the experiment would be performed. Participants were given the original
source code, a list of maintenance activities, and a timesheet form. Participants were
required to record time on paper for every activity performed to complete maintenance
tasks. The time information includes start clock time, stop clock time, and interruption
time measured in minute. Participants recorded their time for each of the following
activities:
•
Task comprehension includes reading, understanding task requirements, and
asking for further clarification.
•
Isolation involves locating and understanding code segments to be adapted.
•
Editing code includes programming and debugging the affected code.
•
Unit test involves performing tests on the affected code.
We focused on the context of software maintenance where the programmers
perform quick fixes according to customer’s maintenance requests [Basili 1990]. Upon
receiving the maintenance request, the programmers validate the request and contact the
submitter for clarifications if needed. They then investigate the program code to identify
relevant code fragments, edit, and perform unit tests on the changes [Basili 1996, Ko
2005].
82
Obviously, the activities above do not include design modifications because small
changes and enhancements hardly affect the system design. Indeed, since we focus on the
maintenance quick-fix, the maintenance request often does not affect the existing design.
Integration test activities are also not included as the program is by itself the only
component, and we perform acceptance testing independently to certify the completion of
tasks.
Participants performed the maintenance tasks for the UCC program, an enhanced
version of the CodeCount tool. The program was a utility used to count and compare
SLOC-related metrics such as statements, comments, directive statements, and data
declarations of a source program. (This same program was also used to collect the sample
data for calibrating the model proposed in this dissertation). UCC was written in C++ and
had 5,188 logical SLOC in 20 classes.
The maintenance tasks were divided into three groups, enhancive, reductive, and
corrective, each being assigned to one respective participant group. These maintenance
types fall into the business rules cluster, according to the topology proposed by [Chapin
2001]. There were five maintenance tasks for the enhancive group and six for the other
groups.
The enhancive tasks require participants to add five new capabilities that allow
the program to take an extra input parameter, check the validity of the input and notify
users, count for and while statements, and display a progress indicator. Since these
capabilities are located in multiple classes and methods, participants had to locate the
appropriate code to add and possibly modify or delete the existing code. We expected
83
that majority of code would be added for the enhancive tasks unless participants had
enough time to replace the existing code with a better version of their own.
The reductive tasks ask for deleting six capabilities from the program. These
capabilities involve handling an input parameter, counting blank lines, and generating a
count summary for the output files. The reductive tasks emulate possible needs from
customers who do not want to include certain capabilities in the program because of
redundancy, performance issues, platform adaptation, etc. Similar to the enhancive tasks,
participants need to locate the appropriate code and delete lines of code, or possibly
modify and add new code to meet the requirements.
The corrective tasks call for fixing six capabilities that were not working as
expected. Each task is equivalent to a user request to fix a defect of the program. Similar
to the enhancive and reductive tasks, corrective tasks handle input parameters, counting
functionality, and output files. We designed these tasks in such a way that they required
participants to mainly modify the existing lines of code.
5.1.2
EXPERIMENT RESULTS
Maintenance time was calculated as the duration between finish and start time
excluding the interruption time if any. The resulting timesheet had a total of 490 records
totaling 4,621 minutes. On average, each participant recorded 19.6 activities with a total
of 192.5 minutes or 3.2 hours. We did not include the acceptance test effort because it
was done independently after the participants completed and submitted their work.
Indeed, in the real-world situation the acceptance test is usually performed by customers
84
or independent teams, and their effort is often not recorded as the effort spent by the
maintenance team.
The sizes of changes were collected in terms of the number of SLOC added,
modified, and deleted by comparing the original with the modified version. These SLOC
values were then adjusted using the proposed sizing method to obtain equivalent SLOC.
We measured the SLOC of task-relevant code fragments (TRCF) by summing the size of
all affected methods. As a SLOC is corresponding to one logical source statement, one
SLOC modified can easily be distinguished from a combination of one added and one
deleted.
Figure 5-1. Effort Distribution
12%
8%
19%
26%
18%
10%
20%
20%
35%
41%
31%
27%
53%
Corrective Group
Enhancive Group
Reductive Group
21%
26%
10%
12%
Task comprehension
Isolation
32%
35%
Editing code
Unit test
37%
27%
Reductive Group
Overall
85
The first three charts in Figure 5-1 show the distribution of effort of four different
activities by participants in each group. The forth chart shows the overall distribution of
effort by combining all three groups. Participants spent the largest proportion of time on
coding, and they spent much more time on the isolation activity than the testing activity.
By comparing the distribution of effort among the groups, we can see that proportions of
effort spent on the maintenance activities vary vastly among three groups. The task
comprehension activity required the smallest proportions of effort. The corrective group
spent the largest share of time for code isolation, twice as much as that of the enhancive
group, while the reductive group spent much more time on unit test as compared with the
other groups. That is, updating or deleting existing program capabilities requires a high
proportion of effort for isolating the code while adding new program capabilities needs a
large majority of effort for editing code.
The enhancive group spent 53% of total time on editing, twice as much as that
spent by the other groups. At the same time, the corrective group needed 51% of the total
time on program comprehension related activities including task comprehension and code
isolation. Overall, these activities account for 42% of the total time. These results largely
support the COCOMO II’s assumption in the size parameter Software Understanding
(SU) and the previous studies of effort distribution of maintenance tasks which suggest
that program comprehension requires up to 50% of maintainer’s total effort [Boehm
2000b, Basili 1996, Fjelstad and Hamlen 1983].
86
Predicting Participant’s Maintenance Effort
With the data obtained from the experiment, we built models to explain and
predict time spent by each participant on his or her maintenance tasks. Previous studies
have identified numerous factors that affect the cost of maintenance work. These factors
reflect the characteristics of the platform, program, product, and personnel of
maintenance work [Boehm 2000b]. In the context of this experiment, personnel factors
are most relevant. Other factors are relatively invariant, hence irrelevant, because
participants performed the maintenance tasks in the same environment, same product,
and same working set. Therefore, we examined the models that use only factors that are
relevant to the context of this experiment.
Effort Adjustment Factor (EAF) is the product of the effort multipliers defined in
the COCOMO II model, representing overall effects of the model’s multiplicative factors
on effort. In this experiment, we defined EAF as the multiplicative of programmer
capability (PCAP), language and tools experience (LTEX), and platform experience
(PLEX). We used the same rating values for these cost drivers that are defined in the
COCOMO II Post-Architecture model. We rated PCAP, LTEX, PLEX values based on
participant’s GPA, experience, pre-test, and post test scores. The numeric values of these
parameters are given in Appendix D. If the rating fell in between two defined rating
levels, we divided the scale into finer intervals by using a linear extrapolation from the
defined values of two adjacent rating levels. This technique allowed specifying more
precise ratings for the cost drivers.
87
We investigated the following models
M1: E = β0 + β1 * S1 * EAF
(Eq. 5-1)
M2: E = β0 + (β1 * Add + β2 * Mod + β3 * Del) * EAF
(Eq. 5-2)
M3: E = β0 + β1 * S2 * EAF
(Eq. 5-3)
M4: E = β0 + (β1 * Add + β2 * Mod) * EAF
(Eq. 5-4)
Where,
ƒ
E is the total minutes that the participant spends on his or her completed
maintenance tasks.
ƒ
Add, Mod, and Del represent the number of SLOC added, modified, and
deleted by the participant for all completed maintenance tasks, respectively.
ƒ
S1 is the total equivalent SLOC that was added, modified, and deleted by the
participant, that is, S1 = Add + Mod + Del.
ƒ
S2 is the total equivalent SLOC that was added and modified, or S2 = Add +
Mod.
ƒ
EAF is the effort adjustment factor described above. As we can see in the
models’ equations, the SLOC metrics Add, Mod, and Del are all adjusted by
EAF, taking into account the capability and experience of the participant.
Models M3 and M4 differ from models M1 and M2 in that they do not include the
variable Del. Thus, differences in the performance of models M3 and M4 versus models
M1 and M2 will reflect the effect of the deleted SLOC metric. The estimates of
coefficients in model M2 determine how this model differs from model M1. This
88
difference is subtle but significant because M2 accounts for the impact of each type of
SLOC metrics on maintenance effort.
We collected the total of 24 data points, each having SLOC added (Add),
modified (Mod), deleted (Del), actual effort (E), and effort adjustment factor (EAF).
Fitting the 24 data points to models M1, M2, M3, M4 using least squares regression, we
obtained the results presented in Table 5-1.
Table 5-1. Summary of results obtained from fitting the models
Metrics
R2
β0
β1
β2
β3
MMRE
PRED(0.3)
PRED(0.25)
M1
0.50
78.1 (p = 10-3)
2.2 (p = 10-4)
33%
58%
46%
M2
0.75
43.9 (p = 0.06)
2.8 (p = 10-7)
5.3 (p = 10-5)
1.3 (p = 0.02)
20%
79%
71%
M3
0.55
110.1 (p = 10-7)
2.2 (p = 10-5)
28%
75%
75%
M4
0.64
79.1 (p = 4.8*10-4)
2.3 (p = 10-6)
4.6 (p = 2.7*10-4)
27%
79%
71%
The p-values are shown next to the estimates of coefficients. In all models, the
estimates of all coefficients but β0 on M2 are statistically significant (p ≤ 0.05). It is
important to note that β1, β2, and β3 in model M2 are the estimates of coefficients of the
Add, Mod, and Del variables, respectively. They reflect variances in the productivity of
three maintenance types that were discussed above. These estimates show that the Add,
Mod, and Del variables have significantly different impacts on the effort estimate of M2.
One modified SLOC affects as much as two added or four deleted SLOC. That is,
modifying one SLOC is much more expensive than adding or deleting it. As shown in
Table 5-1, although Del has the least impact on effort as compared to Add and Mod, the
89
correlation between Del and the maintenance effort is statistically significant (p = 0.02).
Based on this result, Hypothesis 1 is therefore rejected.
Models M1 and M3, which both use a single combined size parameter, have the
same slope value (β1 = 2.2), indicating that the size parameters S1 and S2 have the same
impact on the effort.
The estimates of the intercept (β0) in the models indicate the average overhead of
the participant’s maintenance tasks. The overhead seems to come from non-coding
activities such as task comprehension and unit test, and these activities do not result in
any changes in source code. Model M3 has the highest overhead (110 minutes), which
seems to compensate for the absence of the deleted SLOC in the model.
The coefficient of determination (R2) values suggest that 75% of the variability in
the effort is predicted by the variables in M2 while only 50%, 55%, and 64% of that
predicted by the variables in M1, M3, and M4, respectively. It is interesting to note that
both models M3 and M4, which did not include the deleted SLOC, generated higher R2
values than did model M1. Moreover, the R2 values obtained by models M2 and M4 are
higher than those of models M1 and M3 that use a single combined size metric.
The MMRE, PRED(0.3), and PRED(0.25) values indicate that M2 is the best
performer, and it outperforms M1, the worst performer, by a wide margin. Model M2
produced estimates with a lower error average (MMRE = 20%) than did M1 (MMRE =
33%). For model M2, 79% of the estimates (19 out of 24) have the MRE values of less
than or equal to 30%. In other words, the model produced effort estimates that are within
30% of the actuals 79% of the time. Comparing the performance of M2 and M4, we can
90
see that the deleted SLOC contributes to improving the performance of M2 over M4 as
these models have the same variables except the deleted SLOC. This result reinforces the
rejection of Hypothesis 1.
5.1.3
LIMITATIONS OF THE EXPERIMENT
As the controlled experiment was performed using the subjects of student
programmers in the lab setting, a number of limitations are exhibited. Differences in
environment settings between the experiment and real software maintenance may limit
the ability to generalize the conclusions of this experiment. First, professional
maintainers may have more experience than our participants. However, as all of the
participants, with the exception of the senior, were graduate students, and most of the
participants including the senior had industry experience, the difference in experience is
not a major concern. Second, professional maintainers may be thoroughly familiar with
the program, e.g., they are the original programmers. The experiment may not be
generalized for this case although many of our participants were generally familiar with
the program. Third, a real maintenance process may be different in several ways, such as
more maintenance activities (e.g., design change and code inspection) and collaboration
among programmers. In this case, the experiment can be generalized to four investigated
maintenance activities that are performed by an individual programmer with no
interaction or collaboration with other programmers.
91
5.2 DELPHI SURVEY RESULTS
Expert-judgment estimation, as the name implies, is an estimation methodology
that relies on the experts to produce project estimates based on their experience as
opposed to using formal estimation methods. Expert-judgment estimation is useful in the
case where information is not sufficient or a high level of uncertainty exists in the project
being estimated [Boehm 2000b]. Of many expert-judgment techniques introduced,
Wideband Delphi has been applied successfully in determining initial rating values for
the COCOMO-like models, such as COCOMO II.2000 and COSYSMO.
In this study, the Delphi exercise was also employed to reach expert consensus
regarding the initial rating scales of the maintenance effort model. The results are treated
as priori-knowledge to be used in the Bayesian and other calibration techniques.
The Delphi form was distributed to a number of software maintainers, project
managers, and researchers who have been involving in maintaining software projects and
in software maintenance research. In the Delphi form, the definitions of parameters, predetermined rating levels, and descriptions of each rating level (see Appendix B) were
given. The form also includes the initial rating scales from COCOMO II.2000. These
initial rating scales were intended to provide participants information on the current
experience about software development cost so that participants can draw similarities and
differences between software development and maintenance to provide estimates for
software maintenance.
Although the survey was distributed to various groups of participants, it turned
out that only participants who were familiar with COCOMO returned the form with their
92
estimates. Many who declined to participate explained that they did not have sufficient
COCOMO knowledge and experience to provide reasonable estimates. As a result, eight
participants actually returned the form with their estimates.
Table 5-2. Differences in Productivity Ranges
Cost Driver
PMAT
PREC
TEAM
FLEX
RESL
Delphi Survey
1.41
1.31
1.30
1.26
1.39
COCOMO II.2000
1.43
1.33
1.29
1.26
1.38
Difference
-0.02
-0.02
0.01
0.00
0.01
PCAP
RELY
1.83
1.22
1.76
1.24
0.07
-0.02
CPLX
TIME
STOR
ACAP
PLEX
LTEX
DATA
DOCU
PVOL
2.29
1.55
1.35
1.77
1.45
1.46
1.38
1.52
1.46
2.38
1.63
1.46
2.00
1.40
1.43
1.42
1.52
1.49
-0.09
-0.08
-0.11
-0.23
0.05
0.03
-0.04
0.00
-0.03
APEX
PCON
TOOL
SITE
1.60
1.69
1.55
1.56
1.51
1.59
1.50
1.53
0.09
0.10
0.05
0.04
The participants’ years of experience in COCOMO range from three years to over
three decades. Three participants have recently involved in software maintenance
estimation research, and four participants were involved in industry maintenance projects
at the time of the survey. With the level of experience the participants bring, it is safe to
assert that the estimates provided by them are reliable.
93
Table 5-3. Rating Values for Cost Drivers from Delphi Survey
Very
Low
Low
Nominal
High
Equivalent Process
Maturity Level
PMAT
7.46
5.98
4.50
2.87
1.24
0.00
1.41
0.0026
Precedentedness of
Application
PREC
5.86
4.65
3.59
2.18
1.12
0.00
1.31
0.0009
Team Cohesion
TEAM
5.65
4.75
3.46
2.38
1.21
0.00
1.30
0.0038
Development
Flexibility
FLEX
4.94
4.06
3.03
2.01
1.00
0.00
1.26
0.0000
Risk Resolution
RESL
7.22
5.77
4.32
2.85
1.47
0.00
1.39
0.0034
Programmer
Capability
PCAP
1.34
1.19
1.00
0.85
0.73
1.83
0.0248
Required Software
Reliability
RELY
1.22
1.11
1.00
1.03
1.12
1.22
0.0009
Product Complexity
CPLX
0.73
0.87
1.00
1.16
1.33
1.67
2.29
0.0529
Execution Time
Constraint
TIME
1.00
1.10
1.26
1.55
1.55
0.0155
Main Storage
Constraint
STOR
1.00
1.05
1.13
1.35
1.35
0.0119
Analyst Capability
ACAP
1.33
1.17
1.00
0.87
0.75
1.77
0.1432
Platform Experience
PLEX
1.20
1.10
1.00
0.91
0.83
1.45
0.0071
Language and Tool
Experience
LTEX
1.20
1.09
1.00
0.90
0.82
1.46
0.0140
Database Size
DATA
0.89
1.00
1.11
1.22
1.38
0.0134
Documentation
Match to Life-Cycle
Needs
DOCU
0.91
1.00
1.13
1.26
1.52
0.0255
Platform Volatility
PVOL
0.89
1.00
1.17
1.31
1.46
0.0060
Applications
Experience
APEX
1.25
1.13
1.00
0.88
0.79
1.60
0.0190
Personnel Continuity
PCON
1.35
1.18
1.00
0.90
0.80
1.69
0.0994
Use of Software
Tools
TOOL
1.18
1.10
1.00
0.89
0.76
1.55
0.0138
Multisite
Development
SITE
1.23
1.10
1.00
0.92
0.84
1.56
0.0143
0.83
Very
High
Extra
High
Acronym
Cost Driver
0.79
PRs PR Vars
94
The results of the Delphi survey are presented in Table 5-3 with the last two
columns showing the productivity range and variances of productivity ranges for each of
the cost drivers. The productivity range represents the maximum impact of the respective
cost driver on effort. As an illustration, changing the ACAP rating from Low to High
would require the additional effort of 77%.
Initially, the Delphi survey was planned to be carried out in two rounds. However,
as shown in the last row of Table 5-3, the experts’ estimates were converged even in the
first round. Therefore, I decided not to run the second round. One possible explanation
for this early convergence is the participants’ familiarity with the COCOMO model and
its cost drivers. In addition, the initial rating values of the cost drivers were provided,
offering information for participants to compare with COCOMO II estimates.
Table 5-2 shows the differences in the productivity ranges between the COCOMO
II.2000 model and the Delphi survey. The Difference column indicates an increase (if
positive) or a decrease (if negative) in level of impact of the respective cost driver on
effort. As can be seen in Table 5-3 and Table 5-2, a few cost drivers have their
productivity ranges changed significantly. The Program Complexity (CPLX) still has the
highest influence on effort, but its impact in software maintenance is less than in software
development, indicating that the experts believe that having the legacy system will reduce
the effort spent for maintaining the same system (although the complexity of the system
remains the same). Other cost drivers Analyst Capability (ACAP), Application
Experience (APEX), and Personnel Continuity (PCON), Execution Time Constraint
(TIME), Main Storage Constraint (STOR), and Programmer Capability (PCAP) are also
95
seen to have considerable changes in the productivity ranges from the values of
COCOMO II.2000. Noticeably, ACAP is considered less influential in software
maintenance than in new development, and PCAP is more influential than ACAP. This
result indicates that the programmer is believed to be more cost effective than the analyst
while the opposite is seen in COCOMO II.2000.
Table 5-4. RELY Rating Values Estimated by Experts
Very Low
Low
Nominal
RELY
slight
inconvenience
Multiplier
1.22
low, easily
recoverable
losses
1.11
moderate,
easily
recoverable
l
1.00
High
Very High
high
risk to
financial loss human life
1.03
1.12
As the RELY multipliers have been adjusted, both Very Low and Very High
RELY ratings are higher than the Nominal 1.0 multiplier. Due to this change, the
RELY’s productivity range in software maintenance is much different from that of
COCOMO II. As shown in Table 5-4, a Very Low RELY requires an additional 25% of
effort as compared with a Nominal RELY, while Very High RELY also requires
additional effort, although by a smaller percentage, 12%.
5.3 INDUSTRY SAMPLE DATA
Due to the diverse nature of software maintenance projects, the data collection
process involves establishing and applying consistent criteria for inclusion of completed
maintenance projects. Only projects that satisfy all of the following criteria are collected:
96
ƒ
The project starting and ending dates are clearly specified.
ƒ
The general available release that accumulates features maintained or added from
the starting date is delivered at the end of the project (see Figure 5-2). In this
context, the project is intended to deliver the respective release.
ƒ
The project team uses the legacy code of the existing system as a base for the
maintenance work.
ƒ
The maintenance project type includes error correction and functional
enhancement. Reengineering and language-migration projects are not included.
ƒ
The project equivalent size must be at least 2000 SLOC. (This requirement is
implied in the COCOMO model)
Of course, other criteria treated as conditions for collecting complete data are not
included in the list above because they are implicitly required. For example, the source
code versions at the starting and ending dates must be available to allow counting the
added, modified, deleted, and unmodified lines of code and modules.
Figure 5-2. Maintenance Project Collection Range
Release N
Project starts for
Release N+1
Release N+1
Project starts for
Release N+2
Timeline
Baseline 1
Maintenance project N+1
Baseline 2
97
Table 5-5. Maintenance Core Data Attributes
Category
Attribute
Description
General
Product
Sanitized name or identifier of software product. A product may
have multiple releases. The first release is the new developmenttype project, which is not included in the database for software
maintenance.
Release
Release number. Each release associated with a maintenance
project.
SLOC Adapted
Sum of SLOC added, modified, and deleted of adapted modules.
SLOC PreAdapted
SLOC count of the preexisting modules to be adapted.
SLOC Added
SLOC count of new modules
SLOC Reused
SLOC count of reused modules
DM
The percentage of design modified.
IM
The percentage of implementation and test needed for the
preexisting modules.
SU
Software Understanding increment
UNFM
Programmer Unfamiliarity with the software product.
AA
Assessment and Assimilation increment
Effort
Total time in person-month spent by the maintenance team
delivering the release.
Cost drivers
A set of 20 cost drivers proposed for the maintenance effort
model. Generally, seven base rating levels are defined, including
Very Low, Low, Nominal, High, Very High, and Extra High.
Increment
Specifying the increment percentage from the base level of each
cost driver.
Size metrics
Effort and Cost
drivers
The data collection forms (see Appendix C) and the accompanied guidelines were
distributed to data providers. Table 5-5 lists core data attributes collected for each project.
Other optional attributes, such as application type or domain, project starting and ending
dates, and programming languages used were also included in the data collection forms.
98
These data attributes are grouped into three categories: project general information, size
measures, and effort and cost drivers. One organization used its own data collection form,
but its core data attributes were consistent with the definitions provided in our forms.
For the cost drivers, an actual rating that falls between two defined rating levels
can be specified, allowing finer-grained increments in the rating scale to more closely
describe the true value of the cost driver. The increment attribute can be specified to
increase the base rating, and the numeric rating for the cost driver is a linear extrapolation
from the base rating and the next defined value.
CM is the percentage of code modified in the adapted modules. Thus, it is
computed as the ratio between SLOC Adapted and SLOC Pre-Adapted, or CM = SLOC
Adapted / SLOC Pre-Adapted.
In total, 86 releases that met the above criteria were collected. All 86 releases
were completed and delivered in the years between 1997 and 2009. The application
domains of these releases can be classified into data processing, military - ground,
management of information system, utilities, web, and others. Of these data points, 64
came from an organization member of the center’s Affiliates, 14 from a CMMI-level 5
organization in Vietnam, and 8 from a CMMI-level 3 organization in Thailand. The first
organization has been an active COCOMO user and calibrated the model for their
internal use. The other organizations collected analyzed project size, effort, and other
metrics as a part of their CMMI compliant processes. The organization in Vietnam
granted me permission to interview project managers and team members to fill out the
data collection forms. For the organization in Thailand, several interviews with the
99
representative were carried out in order to validate the data points provided by the project
teams. These granted accesses helped alleviate variations that may have been caused by
inconsistency in understanding the data attributes.
The size metrics were collected by using code counting tools to compare
differentials between two baselines of the source program (see Figure 5-2). These size
metrics are based on the logical SLOC definition originally described in Park [1992] and
later adopted into the definition checklist for logical SLOC counts in the COCOMO
model [Boehm 2000b]. According to this checklist, one source statement is counted as
one logical SLOC, and thus, blanks and comments are excluded. One organization
reported using various code counting tools, and the other organizations used the
CodeCount tool10 for collecting the size metrics. Although using the same counting
definition, variations in the results generated by these SLOC counting tools may exist
[Nguyen 2007]. This problem, which is caused by inconsistent interpretations of the
counting definition, is a known limitation of the SLOC metrics [Jones 1978].
Nonetheless, the SLOC counts among the releases of the same program are highly
consistent as each program used the same SLOC counting tool.
Effort was collected in person-hour and converted into person-month using
COCOMO’s standard 152 person-hours per person-month, avoiding variations created by
different definitions of person-month among the organizations. However, as discussed in
[Chulani 1999], unrecorded overtime can cause variations in the actual effort reported.
Another source of variations arrives from the subjective estimates of originates from
10
http://sunset.usc.edu/research/CODECOUNT/
100
subjective judgment of the cost drivers. As discussed in [Chulani 1999], these variations
can be alleviated by locally calibrating the model into organization.
Table 5-6. Summary Statistics of 86 Data Points
Statistics
Average
Median
Max
Min
Size (EKSLOC)
64.1
39.6
473.4
2.8
Effort (PM)
115.2
58.7
1505.1
4.9
Schedule (Month)
10.5
10.2
36.9
1.8
Figure 5-3. Distribution of Equivalent SLOC
ESLOC
Added
31.8%
ESLOC
Adapted
60.7%
ESLOC
Reused
7.5%
The summary statistics of the 86 data points are presented in Table 5-6. The size
in Equivalent SLOC is computed from the size metrics using the proposed sizing method
discussed in Section 4.1.2. The Equivalent SLOC count is different from the final SLOC
count of the delivered release. Due to the inclusion of SLOC Reused and SLOC PreAdapted, the final SLOC count of the delivered release is usually higher than Equivalent
KSLOC. As shown in Figure 5-3, on average, the projects spent the majority of effort on
101
adapting the pre-existing modules, almost a third for adding new modules, and only a
small percentage (7.5%) for testing and integrating the reused modules.
The scatter plot on PM versus Equivalent KLSOC of the dataset shown in Figure
5-4 indicates that the dataset is skewed with far fewer large projects than small projects
and the variability of PM is considerably higher in large projects. Additionally, there is
one extreme project that has effort almost three times as much as that of the second
largest project. These characteristics are often seen in software datasets [Chulani 1999,
Kitchenham and Mendes 2009]. A typical approach to handling these datasets is to apply
the logarithmic transformation on both effort and size metrics. The logarithmic
transformation takes into account both linear and non-linear relationships, and it helps
ensure linear relationships between log-transformed variables. The scatter plot in Figure
5-5 shows that the logarithmic transformation can appropriately resolve these issues that
exist in the data set.
Outliers can distort the linear regression model, affecting the accuracy and
stability of the model. Unfortunately, software data sets often contain outliers. This
problem is caused by inconsistency and ambiguity in the definition of software terms
(i.e., size and effort), imprecision in the data collection process, and the lack of
standardized software processes [Miyazaki 1994, Basili 1996, Chulani 1999]. To handle
possible outliers in software data sets, several techniques have been used often, including
building robust regression models, transforming the data, and identifying and eliminating
outliers from the rest of the data set [Miyazaki 1994, Kitchenham 2009]. To some extent,
all of these techniques were used in this study.
102
Figure 5-4. Correlation between PM and
EKSLOC
Figure 5-5. Correlation between log(PM)
and log(EKSLOC)
1600
3.5
1400
3
1200
2.5
2
800
Log(PM)
PM
1000
600
1.5
400
1
200
0.5
0
0
0
100
200
300
EKSLOC
400
500
0
0.5
1
1.5
2
2.5
3
Log(EKSLOC)
The productivity, which is measured as Equivalent KSLOC divided by PM, can be
used to detect extreme data points. With the effects of cost drivers, however, using the
productivity measure is not sufficient if not taking into account the combined impact of
cost drivers that may overweigh the variability of the productivity. The productivity,
therefore, needs to be adjusted with the effects of the cost drivers. Figures 4-5 and 4-6
show the productivities for the 86 releases, indicating six releases with unusually high
productivities. In these releases, the adjusted productivities reach more than 2.0
Equivalent KSLOC/PM while in most of the releases the adjusted productivities are
below 0.5 Equivalent KSLOC/PM. A closer inspection of these data points revealed that
they came from the same program and had Very High or Extra High ratings for all
personnel factors (ACAP, PCAP, APEX, PLEX, and LTEX). These extreme ratings are
unusual. And thus, these data points were eliminated, and the final data set with the
remaining 80 releases is used for analysis and calibration in this study.
103
Adjusted productivity
Figure 5-6. Adjusted Productivity for the 86 Releases
Release #
Figure 5-7. Adjusted Productivity Histogram for the 86 Releases
Adjusted productivity
5.4 MODEL CALIBRATIONS AND VALIDATION
This section describes the results of the calibrations that were completed using the
calibration techniques including OLS, the Bayesian, and the constrained regression
techniques. Eighty data points of the final data set were used in these calibrations.
104
5.4.1
THE BAYESIAN CALIBRATED MODEL
The Bayesian analysis was used to calibrate the model cost drivers and constants.
The process involves using the priori-information as initial values for the cost drivers,
fitting the sample data by the linear regression, and applying the Bayesian equations (Eq.
3-3) and (Eq. 3-4) to compute the estimates of coefficients and their variances.
Table 5-7. Differences in Productivity Ranges between Bayesian Calibrated Model
and COCOMO II.2000
Cost Driver
Bayesian
Calibrated
COCOMO II.2000
Difference
PMAT
1.41
1.43
-0.03
PREC
1.31
1.33
-0.02
TEAM
1.29
1.29
0.01
FLEX
1.26
1.26
-0.01
RESL
1.39
1.38
0.01
PCAP
1.79
1.76
0.02
RELY
1.22
1.24
-0.02
CPLX
2.22
2.38
-0.16
TIME
1.55
1.63
-0.08
STOR
1.35
1.46
-0.11
ACAP
1.61
2.00
-0.39
PLEX
1.44
1.40
0.04
LTEX
1.46
1.43
0.03
DATA
1.36
1.42
-0.06
DOCU
1.53
1.52
0.01
PVOL
1.46
1.49
-0.04
APEX
1.58
1.51
0.08
PCON
1.49
1.59
-0.10
TOOL
1.55
1.50
0.05
SITE
1.53
1.53
0.01
A
3.16
2.94
0.22
B
0.78
0.91
-0.13
Constants
105
Table 5-8 summarizes the rating values of the 20 cost drivers calibrated by the
Bayesian analysis using the 80 releases of the final data set. Undefined rating scales for
the cost drivers at the corresponding levels are indicated by the grayed blank cells. It
should be noted that the rating value for the Nominal rating of the effort multipliers is
1.0, which implies that the nominal estimated effort is not adjusted by the effort
multipliers.
Hypothesis 2 states that the productivity ranges of the cost drivers in the
COCOMO II model for maintenance are different form those of COCOMO II.2000.
Different approaches used to calibrate the COCOMO II model for maintenance result in
different sets of productivity ranges. For testing this hypothesis, the productivity ranges
calibrated through the Bayesian analysis were used to compare with those of COCOMO
II.2000. This comparison is valid since the Bayesian analysis was also applied to
calibrating COCOMO II.2000’s productivity ranges.
Table 5-7 presents productivity ranges of the COCOMO II maintenance calibrated
by the Bayesian approach and COCOM II.2000 and their differences between the two
models. For the effort multipliers, the productivity range is the ratio between the largest
and the smallest rating value. The productivity ranges of the scale factors are for 100
EKSLOC projects and computed as PRSFi =
100 B + ( 0.01xSFi max)
, where B and SFimax are the
1000 B
constant B and the maximum rating value of the scale factor i.
106
Table 5-8. Rating Values for Cost Drivers from Bayesian Approach
Acronym
Very
Low
Low
Nominal
High
Very
High
Extra
High
Equivalent Process Maturity
Level
PMAT
7.41
5.94
4.47
2.85
1.23
0.00
Precedentedness of
Application
PREC
5.85
4.65
3.59
2.18
1.12
0.00
Team Cohesion
TEAM
5.61
4.72
3.44
2.36
1.20
0.00
Development Flexibility
FLEX
4.94
4.06
3.03
2.01
1.00
0.00
Risk Resolution
RESL
7.17
5.73
4.29
2.83
1.46
0.00
Programmer Capability
PCAP
1.32
1.18
1.00
0.86
0.74
Required Software Reliability
RELY
1.25
1.11
1.00
1.00
1.08
Product Complexity
CPLX
0.74
0.87
1.00
1.15
1.32
1.64
Execution Time Constraint
TIME
1.00
1.10
1.26
1.55
Main Storage Constraint
STOR
1.00
1.05
1.13
1.35
Analyst Capability
ACAP
1.27
1.14
1.00
0.89
0.79
Platform Experience
PLEX
1.20
1.10
1.00
0.91
0.83
Language and Tool
Experience
LTEX
1.20
1.09
1.00
0.90
0.82
Database Size
DATA
0.89
1.00
1.11
1.22
Documentation Match to LifeCycle Needs
DOCU
0.91
1.00
1.13
1.26
Platform Volatility
PVOL
0.89
1.00
1.17
1.30
Applications Experience
APEX
1.25
1.13
1.00
0.88
0.79
Personnel Continuity
PCON
1.26
1.14
1.00
0.92
0.84
Use of Software Tools
TOOL
1.18
1.10
1.00
0.89
0.76
Multisite Development
SITE
1.22
1.10
1.00
0.92
0.84
Cost Driver
0.83
0.80
It is clear that the differences in the productivity ranges of the Product Complexity
(CPLX), Storage Constraint (STOR), Analyst Capability (ACAP), and Personnel
Continuity (PCON) cost drivers are most noticeable. The impact of CPLX on effort
decreased by 16%. In other words, the effort was reduced by 16% if the CPLX rating was
changed from Low to Extra High. The CPLX cost driver was seen to be less important in
107
maintenance than in development, which is similar to the experts’ estimates. The impact
of ACAP was also reduced significantly, by 39%, making it less influential than PCAP.
As visualized in Figure 5-8, PCAP is the second most influential cost driver, only second
to CPLX. This result matches the experts’ estimates. From the changes in the
productivity ranges of the cost drivers, one can see that the capability and experience of
the programmer have higher impacts on the maintenance productivity than those of the
analyst. Based this result, Hypothesis 2 is therefore supported.
Figure 5-8. Productivity Ranges Calibrated by the Bayesian Approach
RELY
FLEX
1.22
1.26
TEAM
1.29
PREC
1.31
STOR
1.35
DATA
1.36
RESL
1.39
PMAT
1.41
PLEX
1.44
PVOL
1.46
LTEX
1.46
PCON
1.49
DOCU
1.53
SITE
1.53
TOOL
1.55
TIME
1.55
1.58
APEX
1.61
ACAP
1.79
PCAP
CPLX
1.00
2.22
1.50
2.00
2.50
The constant B of the scale exponent decreased significantly as compared with
that of COCOMO II.2000. If the scale factors are all Nominal, the scale exponent would
108
be 0.97. This value demonstrates the diseconomies of scale of the model. However,
taking into account the variance, using the t-distribution, the 95% confidence interval for
the SF value with 58 degrees of freedom is 0.86 ≤ SF ≤ 1.08.
Table 5-9. Estimation Accuracies Generated by the Bayesian Approach
PRED(0.30) PRED(0.25) PRED(0.50) MMRE
MdMRE
Fitting
51%
41%
71%
48%
30%
Cross-validation
46%
39%
71%
45%
31%
The first row in Table 5-9 shows the fitting estimation accuracies of the model
calibrated by the Bayesian approach, using the 20 cost drivers and the 80-release data set,
and the second row shows the estimation accuracies generated using the LOOC crossvalidation approach. As can be seen in the table, the LOOC cross-validation produced
lower accuracies than did the fitting approach since it separated the testing data point
from the training set. The estimates of two releases had MRE errors of 200% or higher,
causing the MMRE values noticeably high in both cases, much higher than the means, the
MdMRE values. A closer look at these data points indicates that they have high
productivities, but with the information available in the data set, they do not indicate
irregularities that deem them outliers. If these data points were removed from the model,
the MMRE value would decrease to 42% but the PRED(0.30) value would remain the
unaffected.
109
Table 5-10. Estimation Accuracies of COCOMO II.2000 on the Data Set
PRED(0.30)
38%
PRED(0.25)
31%
MMRE
56%
MdMRE
37%
Hypothesis 3 proposes that the COCOMO II model for maintenance outperforms
the COCOMO II.2000 model. To test this hypothesis, the COCOMO II.2000 model was
used to estimate the effort of the same 80 releases, and the resulting estimation accuracies
in terms of MMRE and PRED were compared with those of the COCOMO II model for
maintenance. It should be noted that the COCOMO II.2000 model’s cost drivers and
sizing method differ from the proposed models.
The estimation accuracies generated by the COCOMO II.2000 model were shown
in Table 5-10. It is clear that the COCOMO II.2000 model performs poorly on the data
set of 80 releases, producing estimates within 30% of the actuals 38% of the time with
the maximum relative error (MRE) reaching 346%. The proposed model calibrated using
the Bayesian approach outperforms COCOMO II.2000 by a wide margin, 34%.
COCOMO II.2000 also underperforms the proposed model in terms of MMRE, 56%
versus 48%. Thus, Hypothesis 3 is supported.
5.4.2
THE CONSTRAINED REGRESSION CALIBRATED MODELS
The constrained regression approaches (CMRE, CMSE, and CMAE) were used to
calibrate the model’s constants and cost drivers. The only difference among these
110
approaches is in the objective function. CMRE minimizes the sum of relative errors,
CMSE minimizing the sum of square errors, and CMAE minimizing the sum of absolute
errors. As these approaches seek to optimize the objective functions and impose the fixed
constraints, cost drivers can be pruned from the model.
Table 5-11. Retained Cost Drivers of the Constrained Models
Model
# Cost Drivers
Retained Cost Drivers
CMRE
12
ACAP, APEX, CPLX, DOCU, FLEX, LTEX,
PCAP, PCON, PLEX, SITE, STOR, TOOL
CMSE
14
ACAP, APEX, CPLX, DATA, DOCU, FLEX,
LTEX, PCAP, PCON, PLEX, SITE, STOR, TIME,
TOOL
CMAE
12
APEX, CPLX, DATA, DOCU, FLEX, LTEX,
PCAP, PCON, PLEX, SITE, TIME, TOOL
As shown in Table 5-11, of the 20 cost drivers in the initial model, twelve were
retained by CMRE and CMAE and fourteen by CMSE. Each of these approaches retained
a different set of cost drivers, but they included the same 10 cost drivers (in bold). Of the
five scale factors, only the Development Flexibility (FLEX) cost driver remains in all
models. This result is quite unexpected and counterintuitive as FLEX was rated the
second least influential cost driver by the experts.
In general, the difference in the sets of cost drivers retained is explained by the
fact that each approach seeks to optimize a different objective and, thus, reaches a
different optimal solution. These optional solutions are mainly driven by the sample data
set, and they are highly affected by the imprecision and the variability in the data set.
111
Table 5-12. Estimation Accuracies of Constrained Approaches
Approach
PRED(0.30) PRED(0.25) PRED(0.50) MMRE
MdMRE
CMRE
60%
56%
76%
37%
23%
CMSE
51%
43%
79%
39%
30%
CMAE
58%
54%
71%
42%
23%
Table 5-13. Estimation Accuracies of Constrained Approaches using LOOC Crossvalidation
PRED(0.30) PRED(0.25) PRED(0.50) MMRE
MdMRE
CMRE
51%
40%
73%
43%
30%
CMSE
41%
35%
69%
49%
35%
CMAE
51%
48%
67%
52%
28%
As illustrated in Figure 5-9, which shows the productivity ranges generated by the
CMRE approach, eight cost drivers, including all 5 scale factors but FLEX and 4 effort
multipliers (PVOL, DATA, TIME, and RELY), were pruned from the model, leaving 12
most relevant cost drivers in the model. The Storage Constraint (STOR) cost driver
appears to be the most influential driver with the productivity range of 2.81, and the
personnel factors except LTEX are among the least influential. This result appears to be
counterintuitive since it contradicts the Delphi results which indicate that personnel
factors have the most significant impacts on effort and that STOR is not considered to be
highly influential. It should be noted that, unlike the Bayesian approach, which adjusts
the data-driven estimates with the experts’ estimates, the constrained regression
techniques rely heavily on the sample data to generate estimates. Thus, like multiple
112
linear regression, it suffers the correlation between cost drivers and the lack of dispersion
in the sample data set. These problems are further analyzed in Section 5.4.3.
Figure 5-9. Productivity Ranges Generated by CMRE
PVOL
1.00
DATA
1.00
TIME
1.00
RELY
1.00
RESL
1.00
TEAM
1.00
PREC
1.00
PMAT
1.00
ACAP
DOCU
SITE
APEX
PCAP
PCON
1.10
1.29
1.33
1.37
1.45
1.50
1.60
PLEX
TOOL
2.21
2.33
LTEX
2.36
FLEX
2.68
CPLX
STOR
1.00
2.81
1.50
2.00
2.50
3.00
The estimation accuracies produced by three constrained regression approaches
are listed in Table 5-12. Considering the PRED(0.30), PRED(0.25), and MMRE metrics,
it is clear that CMRE and CMAE outperform both COCOMO II.2000 and the Bayesian
calibrated model, improving PRED(0.30) from 38% produced by COCOMO II.2000 to
60%, a 58% improvement. It is important to note that CMSE’s PRED(0.30) and
PRED(0.25) values are as low as those of the Bayesian calibrated model. A possible
reason this similarity is that both CMSE and the Bayesian calibrated model optimize the
113
same quantity, the sum of square errors. This quantity may overlook high square errors in
small projects to favor high square errors in large projects, resulting in high MRE errors
in the estimates of these small projects [Nguyen 2008]. The LOOC cross-validation
results in Table 5-13 further confirm the improvement in the performance of the
constrained models over the Bayesian calibrated model. This observation is consistent
with the results reported in [Nguyen 2008] where the performance of these approaches
were compared with Lasso, Ridge, Stepwise, and OLS regression techniques using other
COCOMO data sets.
5.4.3
REDUCED PARAMETER MODELS
Strong correlations and the lack of dispersion in cost drivers can negatively affect
the stability of the estimation model that is built on the assumptions of the multiple linear
regression. The analysis of the correlation and dispersion properties of the data set may
suggest possible redundant drivers that contribute little to improving the accuracy of the
estimation model. Similar to the previous COCOMO studies [Chulani 1999, Valerdi
2005], this study also investigated a reduced model with a smaller set of cost drivers by
aggregating cost drivers that are highly correlated or lack dispersion. Two cost drivers are
deemed to be aggregated if their correlation coefficient is 0.65 or greater.
Table 5-14 shows the correlation matrix of highly correlated drivers, all belonging
to the personnel factors. ACAP and PCAP were aggregated into Personnel Capability
(PERS), and APEX, LTEX, PLEX into Personnel Experience (PREX). It turns out that
these aggregations are the same as those of the COCOMO Early Design model, except
that PCON was not aggregated into PREX.
114
Table 5-14. Correlation Matrix for Highly Correlated Cost Drivers
ACAP
PCAP
APEX
LTEX
PLEX
ACAP
1.00
0.74
0.44
0.58
0.49
PCAP
APEX
LTEX
PLEX
1.00
0.39
0.53
0.50
1.00
0.72
0.58
1.00
0.75
1.00
Figure 5-10 and Figure 5-11 show that the Execution Time Constraint (TIME) and
Main Storage Constraint (STOR) ratings are positively skewed with 82 and 76 data
points out of the 86 data points rated Nominal, respectively. Moreover, while the range of
defined ratings for TIME and STOR is from Nominal to Very High, no data point was
rated High and Very High for TIME and Very High for STOR. This lack of dispersion in
TIME and STOR can result in high variances in the coefficients of these drivers as the
impact of the non-Nominal ratings cannot be determined due to the lack of these ratings.
An explanation for this lack of dispersion is that new technologies and advancements in
the storage and processing facilities, making TIME and STOR insignificant for the
systems that are less constrained by the limitations of the storage and processing
facilities. And all of the projects in the data set are non-critical and non-embedded
systems, which are typically not dependent on the storage and processing constraints.
Thus, the TIME and STOR cost drivers were eliminated from the reduced model. In all,
seven cost drivers were removed, and two new were added, resulting in the reduced
model with 15 cost drivers, as compared to the full model with 20 cost drivers.
115
Figure 5-10. Distribution of TIME
Figure 5-11. Distribution of STOR
80
80
60
60
40
40
20
20
0
0
Nominal
Nominal
High
High
Very High
Table 5-15. Estimation Accuracies of Reduced Calibrated Models
Approach
Cost
PRED(0.3) PRED(0.25) PRED(0.5) MMRE
Drivers
MdMRE
Bayesian
15
51%
38%
70%
36%
22%
CMRE
9
59%
53%
74%
36%
22%
CMSE
9
54%
46%
70%
37%
27%
CMAE
10
60%
51%
73%
38%
22%
The reduced model was calibrated into the data set of 80 releases using the
Bayesian and the constrained regression approaches. The estimation accuracies of the
reduced calibrated models are presented in Table 5-15. As can be seen from comparing
the results in Table 5-9, Table 5-12, and Table 5-15 the performance of the reduced
calibrated models is as good as that of the models calibrated using all initial 20 cost
drivers. However, one exception was observed: the reduced model calibrated by the
116
Bayesian approach appears to have a smaller MMRE value than that of the respective full
model.
5.4.4
LOCAL CALIBRATION
A series of stratifications by organization and by program were performed using four
different approaches:
•
Productivity index. The project effort estimate is determined by dividing
project size by productivity index, i.e., Effort = Size / Productivity Index,
where the productivity index is the average productivity of past projects or an
industry census. It is the simplest model to estimate effort given that the
productivity index is known. This model is often referred to as the baseline,
and a more sophisticated model is only useful if it outperforms the baseline
model [Jorgensen 1995].
•
Simple linear regression between effort and size. The simple linear
regression is used to derive the model whose response and predictor are the
logarithmic transformations of PM and Equivalent KSLOC, respectively.
•
Bayesian analysis. The full model with all the drivers that were calibrated
using the Bayesian analysis was used as a basis to fit into local data sets and
compute the constants A and B in the COCOMO effort estimation model (Eq.
4-11). The resulting local models differ from each other only in the constants
A and B. The full model, instead of the reduced model, was used since it
117
allows organizations to select from the full model the most suitable cost
drivers for their processes.
•
Constrained regression with CMRE. This local calibration approach used
the set of 12 cost drivers and the constants A and B obtained from the
constrained regression with CMRE. The constants A and B were further
calibrated into local data sets based on organization and program.
Using the approaches above, the following eight local calibrations were
investigated:
•
C1. The productivity index of each organization was computed and used to
estimate the projects within the same organization.
•
C2. The simple linear regression was applied for each individual
organization.
•
C3. Stratification by organization using the Bayesian approach.
•
C4. Stratification by organization using the constrained approach CMRE.
•
C5. The productivity index of the previous releases in each program was used
to determine the effort of the current release in the same program.
•
C6. The simple linear regression was applied for each individual program
that has five or more releases.
•
C7. Stratification by program using the Bayesian approach.
•
C8. Stratification by program using the constrained approach CMRE.
118
To avoid overfitting, which is, in part, caused by the lack of degrees of freedom,
each parameter should be calibrated by at least five data points [Boehm 2000b]. As the
local calibrations above have at most one parameter, the organizations and programs with
five or more releases were included. As a result, all 80 releases were used for the
calibrations (C1 – C4) into three organizations, and 45 releases were used for the
calibrations (C5 – C8) into six programs. To enable cross comparisons between these
organization-based and program-based calibrations, the local models for C1 – C4 were
also tested on the same 45 releases used in the program-based calibrations.
Table 5-16. Stratification by Organization
Calibration
#Releases
PRED(0.30) PRED(0.25) MMRE
MdMRE
C1
80
48%
40%
44%
37%
C2
80
35%
34%
50%
42%
C3
80
59%
54%
38%
23%
C4
80
64%
62%
34%
21%
Table 5-17. Stratification by Program
Calibration
#Releases
PRED(0.30) PRED(0.25) MMRE
MdMRE
C5
45
64%
53%
27%
24%
C6
45
69%
64%
25%
20%
C7
45
80%
71%
22%
18%
C8
45
79%
72%
21%
16%
119
Table 5-18. Stratification by Organization on 45 Releases
Calibration
#Releases
PRED(0.30) PRED(0.25) MMRE
MdMRE
C1
45
40%
40%
44%
37%
C2
45
31%
27%
53%
40%
C3
45
63%
55%
40%
24%
C4
45
60%
58%
34%
22%
It can be seen from Table 5-16 and Table 5-17 that the calibrations based on the
Bayesian and CMRE models outperform both productivity index and simple linear
regression approaches. C3 and C4, whose performances are comparable, are more
favorable than C1 and C2 when stratification by organization is concerned. Similarly, C7
and C8 outperform C5 and C6 when calibrating for each individual program. It is clear
that the effects of the cost drivers in the Bayesian and CMRE models positively
contribute to improving the performance of the models. Even in the same program, the
cost drivers’ ratings change from one release to another. For example, if the attrition rate
is low, the programmer is more familiar with the system, and is more experienced with
the languages and tools used in the program. If the attrition rate is high, on the other
hand, the overall programmer capability and experience could be lower than the previous
release. Due to these effects, the productivity of the releases of the same program does
not remain the same.
The program-based calibrations appear to generate better estimates than the
organization-based calibrations. This finding is further confirmed when the local models
in C1, C2, C3, and C4 were tested on the same data set of 45 releases used in the
120
program-based calibrations (see Table 5-18). One explanation for the superiority of the
program-based calibrations is that possible variations in application domains,
technologies, and software processes were minimized when considering only releases in
the same program. Another explanation is that the variations in the data collection
process are reduced as the same code counting tool was used and the same data collector
rated the cost drivers for all releases in the same program.
The best performers are the local models that were calibrated using the Bayesian
and MRE models on each individual program. They could produce estimates within 30%
of the actuals 80% of the time. This result is comparable to the COOCOMO II.2000
model when it was stratified by organization [Chulani 1999].
Comparing the accuracies in Tables 5-9, 5-12, 5-13, 5-16, 5-17, and, 5-18 one can
see that the generic models calibrated using the Bayesian and CMRE approaches can be
further improved by calibrating them into each individual organization and program.
Moreover, these generic models are more favorable than the productivity index and the
simple linear regression in the stratification by organization, indicating that the generic
models may be useful in the absence of sufficient data for local calibration. This finding
confirms the previous COCOMO studies by Chulani [1999], Menzies et al. [2006], and
Valerdi [2005] and other studies, such as Maxwell et al. [1999], which suggest that local
calibration improves the performance of the software estimation model.
Hypothesis 4 states that the COCOMO II model for maintenance outperforms the
simple linear regression and the productivity index method. As shown above, this
hypothesis is supported.
121
5.5 SUMMARY
The data set of 80 releases from 3 organizations was used to validate the proposed
extensions to the size and effort estimation model. The effort estimation model was
calibrated to the data set using a number of techniques including the linear regression,
Bayesian, and constrained regression. Local calibrations to organization and program
were also performed and compared with the less sophisticated approaches, the
productivity index and the simple linear regression. The best model, which was calibrated
using the releases of each individual program, can produce estimates with PRED(0.30) =
80% and MMRE = 0.22, outperforming the less sophisticated but commonly used
productivity index and simple linear regression.
122
Chapter 6. CONTRIBUTIONS AND FUTURE WORK
This dissertation has presented and evaluated the COCOMO II model for
maintenance, an extension to COCOMO II for better estimating the cost of software
maintenance projects. The main goal of this work was to improve COCOMO II’s
estimation performance on software maintenance projects by enhancing its sizing method
and effort estimation model. The best resulting model can generate estimates within 30%
of the actuals 80% of the time when it is calibrated to each individual program. It is fair
to claim that this goal was achieved.
This chapter discusses the contributions of this work and the direction for future
work.
6.1 CONTRIBUTIONS
In this, there are a number of contributions to software estimation research and
practice. First, a unified method for measuring the size of software reuse and
maintenance has been proposed, allowing the maintainer to measure the size of both
software reuse and maintenance work using the same parameters and formulas. This
extension is important because it is well-known that there are no clear distinctions
between software reuse and maintenance. This method also allows measuring the actual
size of the completed reuse and maintenance work based on the delivered source code,
making a connection between the estimated and the actual size. As shown above, the
locally calibrated model can generate more accurate estimates. Thus, the ability to
123
determine the actual equivalent size of maintenance work for the calibration process is a
key step to improving software estimation accuracy.
Second, an adjusted set of cost drivers and their rating values derived from an
industry data set of 80 releases from 3 organizations. This set of cost drivers was
augmented to the COCOMO II effort model for software maintenance. This model can be
used as is to estimate maintenance projects in organizations where data is not sufficient
for calibration or it can be used as a base model to be locally calibrated into individual
organizations or programs.
Third, one important feature of COCOMO II is that it provides software project
management the capability to perform tradeoff analysis about cost implications of
software decisions. With the set of cost drivers and rating scales for software
maintenance obtained from this work, the software maintainers can perform the same
tradeoff analysis regarding cost, functionality, performance, and quality of the
maintenance work.
Fourth, three constrained regression approaches to building and calibrating cost
estimation models were introduced and evaluated as a broader part of this dissertation.
From the analyses of these approaches using three different COCOMO data sets,
COCOMO 81 [Boehm 1981], COCOMO II [Boehm 2000], and the data set of 80
maintenance releases, it can be asserted that the constrained regression approaches
CMRE and CMAE are favorable in terms of improving the estimation performance in
calibrating the COCOMO model.
124
Fifth, this work evidences that the size of deleted statements in the modified
module is a significant size measure for estimating the cost of software maintenance. This
finding implies that the cost of deleting source code in the modified module is neither
minimal nor zero. Thus, it is suggested that the estimation model using SLOC include the
SLOC deleted metric in its size measure.
Finally, evidence about the superiority of the locally calibrated model over the
generic model in terms of improving the estimation accuracy was presented. This
evidence supports previous studies (e.g., [Chulani 1999, Menzies et al. 2006, and Valerdi
2005]), which claim that local calibration improves estimation accuracy while other
studies (e.g., [Briand et al. 1999 and Mendes et al. 2005]) report contradicting results in
this regard. Moreover, the COCOMO II model for maintenance was shown to outperform
less sophisticated models, such as the productivity index and simple linear regression.
This finding indicates that organizations should invest in more sophisticated models than
these simple models if estimation accuracy is a concern.
6.2 FUTURE WORK
There are number of directions for future work that are worth exploring. These
directions involve further calibrating the model, extending it to other types of
maintenance, to iterative and agile software development methods, and improving the
constrained regression approaches.
ƒ
Software estimation modeling is a continual process. As advancements in
technologies, software processes, languages, and supporting tools have
125
accelerated rapidly, the productivity of software projects increases over the
years [Nguyen 2010b]. The software estimation model has to be recalibrated
and extended to more reflect more closely the software development and
maintenance practice. Moreover, three organizations from which the data set
was collected are definitely not representative of the software industry. Thus,
the model calibrated using this data set may not reflect effectively the true
practice of software maintenance. However, considering this work as a step in
making COCOMO a better model for estimating the cost of software
maintenance, the maintenance model should be calibrated with data points
from a variety of sources. Understandably, collecting software cost related
metrics from industry is difficult. Due to the sensitivity of these metrics,
organizations are reluctant to share. It is even more challenging to collect
maintenance data since it requires more detailed sizing metrics and
organizations often do not have a software maintenance process as rigorous as
the development counterpart to mandate the data collection and analysis
practice.
ƒ
This work has shown performance improvement in calibrating the model to
program, essentially capitalizing on common features of all releases in the
same program, such as application domain, programming language, and
operating platform. Building the domain-specific, language-specific, or
platform-specific model by calibrating the generic model using data sets of the
same feature(s) would potentially improve the estimation accuracy.
126
ƒ
The initial cost driver ratings were estimated by eight experts who had
considerable experience in COCOMO. It would be desirable to obtain inputs
from a more diverse group of participants including those who are not familiar
with COCOMO. To be effective, some type of training on COCOMO (e.g., its
formulas, assumptions, and definitions of the cost drivers) would be needed to
provide participants sufficient knowledge before they complete the survey.
ƒ
The scope of this work was limited to functional enhancement and fault
correction types of software maintenance. This limitation was imposed, in
part, by the data set collected for calibration and validation. Future work
would be extending the model to support other types of software maintenance,
such as reengineering, reverse engineering, language and data migration,
performance improvement, program perfection, documentation, training, and
other cost-effective support activities such as those defined in [Chapin 2001].
ƒ
In iterative software development models, such as the Incremental
Commitment Model (ICM) [Pew and Mavor 2007], and in many agile
methods, the project schedule is divided into iterations, each involving
refinements, enhancements and additions of plans, software requirements,
design, source code, test, and other work products based on previous
iterations. The end of an iteration is marked by a milestone which offers
deliverables. Each iteration can be viewed as a maintenance project [Basili
1990]. It would be interesting to study how the proposed model can be applied
to estimating sizing and effort for iterations within a development project. In
127
addition, the sizing method can be used to compute the actual Equivalent
SLOC to understand how productivity and different types of size metrics
(adapted, reused, and added SLOC) change over time during the project.
ƒ
The constrained regression approaches were shown to be favorable in
calibrating the COCOMO model. However, as they rely heavily on the sample
data, they suffer the common problems of software datasets, such as
collinearity, sknewness, heteroscedasticity, outliers, and high variations,
possibly producing counter-intuitive estimates for cost drivers. A future study
may consider imposing constraints on the value ranges of cost drivers. These
value ranges (e.g., the lowest and highest productivity range for each cost
driver) should be determined by experts through the same Delphi survey.
128
BIBLIOGRAPHY
Abrahamsson P., Moser R., Pedrycz W., Sillitti A., Succi G. (2007), "Effort Prediction in
Iterative Software Development Processes -- Incremental Versus Global
Prediction Models", Proceedings of the First International Symposium on
Empirical Software Engineering and Measurement (ESEM)
Abran A. and Maya M. (1995), "A Sizing Measure for Adaptive Maintenance Work
Products," Proceedings of the 11th International Conference on Software
Maintenance (ICSM), pp.286-294
Abran A., St-Pierre D., Maya M., Desharnais J.M. (1998), "Full function points for
embedded and real-time software", Proceedings of the UKSMA Fall Conference,
London, UK, 14.
Abran A., Silva I., Primera L. (2002), "Field studies using functional size measurement in
building estimation models for software maintenance", Journal of Software
Maintenance and Evolution, Vol 14, part 1, pp. 31-64
Albrecht A.J. (1979), “Measuring Application Development Productivity,” Proc. IBM
Applications Development Symp., SHARE-Guide, pp. 83-92.
Albrecht A.J. and Gaffney J. E. (1983) "Software Function, Source Lines of Code, and
Development Effort Prediction: A Software Science Validation," IEEE
Transactions on Software Engineering, vol. SE-9, no. 6, November
Banker R., Kauffman R., and Kumar R. (1994), “An Empirical Test of Object-Based
Output Measurement Metrics in a Computer Aided Software Engineering (CASE)
Environment,” Journal of Management Information System.
Basili V.R., (1990) "Viewing Maintenance as Reuse-Oriented Software Development,"
IEEE Software, vol. 7, no. 1, pp. 19-25, Jan.
Basili V.R., Briand L., Condon S., Kim Y.M., Melo W.L., Valett J.D. (1996),
“Understanding and predicting the process of software maintenance releases,”
Proceedings of International Conference on Software Engineering, Berlin,
Germany, pp. 464–474.
Basili V.R., Condon S.E., Emam K.E., Hendrick R.B., Melo W. (1997) "Characterizing
and Modeling the Cost of Rework in a Library of Reusable Software
Components". Proceedings of the 19th International Conference on Software
Engineering, pp.282-291
129
Boehm B.W. (1981), “Software Engineering Economics”, Prentice-Hall, Englewood
Cliffs, NJ, 1981.
Boehm B.W. (1988), “Understanding and Controlling Software Costs”, IEEE
Transactions on Software Engineering.
Boehm B.W., Royce W. (1989), “Ada COCOMO and Ada Process Model,” Proc. Fifth
COCOMO User’s Group Meeting, Nov.
Boehm B.W., Clark B., Horowitz E., Westland C., Madachy R., Selby R. (1995), “Cost
models for future software life cycle processes: COCOMO 2.0, Annals of
Software Engineering 1, Dec., pp. 57–94.
Boehm B.W. (1999), "Managing Software Productivity and Reuse," Computer 32, Sept.,
pp.111-113
Boehm B.W., Abts C., Chulani S. (2000), “Software development cost estimation
approaches: A survey,” Annals of Software Engineering 10, pp. 177-205.
Boehm B.W., Horowitz E., Madachy R., Reifer D., Clark B.K., Steece B., Brown A.W.,
Chulani S., and Abts C. (2000), “Software Cost Estimation with COCOMO II,”
Prentice Hall.
Boehm B.W., Valerdi R. (2008), "Achievements and Challenges in Cocomo-Based
Software Resource Estimation," IEEE Software, pp. 74-83, September/October
Bradley E., Gong G. (1983), "A leisurely look at the bootstrap, the jack-knife and crossvalidation", American Statistician 37 (1), pp.836–848.
Briand L.C., Basili V., Thomas W.M. (1992), “A pattern recognition approach for
software engineering analysis”, IEEE Transactions on Software Engineering 18
(11) 931–942.
Briand L.C. & Basili V.R. (1992) “A Classification Procedure for an Effective
Management of Changes during the Software Maintenance Process”, Proc. ICSM
’92, Orlando, FL
Briand L.C., El-Emam K., Maxwell K., Surmann D., and Wieczorek I., “An Assessment
and Comparison of Common Cost Estimation Models,” Proc. 21st International
Conference on Software Engineering, pp. 313-322, 1999.
Caivano D., Lanubile F., Visaggio G. (2001), “Software renewal process comprehension
using dynamic effort estimation”, Proceedings of International Conference on
Software Maintenance, Florence, Italy, pp. 209–218.
130
Chapin N., Hale J.E., Kham K.M, Ramil J.F., Tan W. (2001), “Types of software
evolution and software maintenance,” Journal of Software Maintenance: Research
and Practice, v.13 n.1, p.3-30, Jan.
Chen Z. (2006), "Reduced-Parameter Modeling for Cost Estimation Models," PhD
Thesis, the University of Southern California.
Chulani S., Boehm B.W., Steece B. (1999), “Bayesian analysis of empirical software
engineering cost models,” IEEE Transactions on Software Engineering, vol. 25
n.4, pp. 573-583, July/August.
Chulani S. (1999), "Bayesian Analysis of Software Cost and Quality Models", PhD
Thesis, The University of Southern California.
Clark B., Chulani S., and Boehm B.W., (1998), “Calibrating the COCOMO II Post
Architecture Model,” International Conference on Software Engineering, April.
Clark B. (2003), "Calibration of COCOMO II.2003", 17th International Forum on
COCOMO and Software Cost Modeling,
DOI=http://sunset.usc.edu/events/2002/cocomo17/Calibration%20fo%20COCO
MO%20II.2003%20Presentation%20-%20Clark.pdf
Conte S.D., Dunsmore H.E., Shen V.Y. (1986), “Software Engineering Metrics and
Models,” Menlo Park, California, Benjamin/Cummings.
COSMIC (2003) COSMIC FFP v.2.2, Measurement Manual.
De Lucia A., Pompella E., Stefanucci S. (2003), “Assessing the maintenance processes of
a software organization: an empirical analysis of a large industrial project”, The
Journal of Systems and Software 65 (2), 87–103.
De Lucia A., Pompella E., and Stefanucci S. (2005), “Assessing effort estimation models
for corrective maintenance through empirical studies”, Information and Software
Technology 47, pp. 3–15
Erlikh L. (2000). “Leveraging legacy system dollars for E-business”. (IEEE) IT Pro,
May/June, 17-23.
Fioravanti F., Nesi P., Stortoni F. (1999), "Metrics for Controlling Effort During
Adaptive Maintenance of Object Oriented Systems," IEEE International
Conference on Software Maintenance (ICSM'99), 1999
Fioravanti F. & Nesi P. (2001), “Estimation and prediction metrics for adaptative
maintenance effort of object-oriented systems”, IEEE Transactions on Software
Engineering 27 (12), 1062–1084.
131
Fjelstad R.K. and Hamlen W.T. (1983), “Application Program Maintenance Study:
Report to Our Respondents”, In Tutorial on Software Maintenance, G. Parikh and
N. Zvegintzov, Eds., IEEE Computer Society Press, Los Angeles, CA. pp.11-27.
Galorath 2002, SEER-SEM Software Estimation, Planning and Project Control - User's
Manual, Galorath Incorporated.
Gerlich R., and Denskat U. (1994), "A Cost Estimation Model for Maintenance and High
Reuse, Proceedings," ESCOM 1994, Ivrea, Italy.
Huang X., Ho D., Ren J., Capretz L.F. (2005), "Improving the COCOMO model using a
neuro-fuzzy approach", Applied Soft Computing, Vol. 5, p. 29-40
Idri A., Kjiri L., Abran A. (2000), "COCOMO Cost Model Using Fuzzy Logic", 7th
International Conference on Fuzzy Theory and Technology, pp. 219-223.
IEEE (1998) IEEE Std. 1219-1998, Standard for Software Maintenance, IEEE Computer
Society Press, Los Alamitos, CA.
IFPUG (1999), "IFPUG Counting Practices Manual - Release. 4.1," International
Function Point Users Group, Westerville, OH
IFPUG (2004), "IFPUG Counting Practices Manual - Release. 4.2," International
Function Point Users Group, Princeton Junction, NJ.
ISO (1997), ISO/IEC 14143-I:1997--Information technology--Software measurement
Functional size measurement Definition of concepts, International Organization
for Standardization, Geneva, Switzerland, 1997.
ISO (2002), ISO/IEC 20968: Software Engineering - MkII Function Point Analysis –
Counting Practices Manual
ISO (2003) ISO/IEC 19761: COSMIC Full Function Points Measurement Manual, v.2.2.
ISO (1998), “ISO/IEC 14143-1:1998 – Information technology – Software measurement
– Functional size measurement – Definition of concepts”.
ISPA (2008) "Parametric Estimating Handbook," International Society of Parametric
Analysts, 4th Ed.
Jeffery D.R., Low G. (1990), "Calibrating estimation tools for software development",
Software engineering journal, vol. 5, no. 4, July, pp. 215-222
Jeffery D.R., Ruhe M., Wieczorek I. (2000), “A comparative study of cost modeling
techniques using public domain multi-organizational and company-specific data”,
Information and Software Technology 42 (14) 1009–1016.
132
Jensen R. (1983), “An Improved Macrolevel Software Development Resource Estimation
Model,” In Proceedings of 5th ISPA Conference, April 1983, pp. 88–92.
Jones, T. C. (1978), “Measuring programming quality and productivity.” IBM Systems
Journal, vol. 17, no. 1 Mar., pp. 39-63
Jones T.C. (1997), “Applied Software Measurement”, McGraw Hill.
Jones T.C. (2008), "Applied Software Measurement: Global Analysis of Productivity and
Quality", 3rd Ed., McGraw-Hill.
Jorgensen M. (1995), “Experience with the accuracy of software maintenance task effort
prediction models”, IEEE Transactions on Software Engineering 21 (8) 674–681.
Kauffman R. and Kumar R. (1993), “Modeling Estimation Expertise in Object Based
ICASE Environments,” Stern School of Business Report, New York University,
January.
Kemerer C.F. (1987), “An Empirical Validation of Software Cost Estimation Models,”
Comm. ACM, vol. 30, no. 5, pp. 416 – 429, May. DOI =
http://doi.acm.org/10.1145/22899.22906
Kitchenham B.A., Travassos G.H.,Mayrhauser A.v., Niessink F., Schneidewind N.F.,
Singer J., Takada S., Vehvilainen R., Yang H. (1999), "Toward an ontology of
software maintenance," Journal of Software Maintenance 1999; 11(6):365–389.
Kitchenham, B.A. and Mendes, E. (2009). “Why comparative effort prediction studies
may be invalid.” In Proceedings of the 5th international Conference on Predictor
Models in Software Engineering, pp. 1-5.
KnowledgePlan (2005), Software Productivity Research LLC, “KnowledgePlan User’s
Guide, Version 4.1”, SPR.
Ko A.J., Aung H.H., Myers B.A. (2005). “Eliciting design requirements for maintenanceoriented ideas: a detailed study of corrective and perfective maintenance”.
Proceedings of the international conference on software engineering ICSE’2005.
IEEE Computer Society; p. 126–35.
Lehman M.M., Ramil J.F., Wernick P.D., Perry D.E., Turski W.M. (1997), "Metrics and
Laws of Software Evolution - The Nineties View," Software Metrics, IEEE
International Symposium on Software Metrics (METRICS'97), pp.20-32
Lindvall M. (1998), “Monitoring and measuring the change-prediction process at
different granularity levels: an empirical study”, Software Process Improvement
and Practice 4, 3–10.
133
Maxwell K., Wassenhove L.V., and Dutta S. (1999), “Performance Evaluation of General
and Company Specific Models in Software Development Effort Estimation,”
Management Science, vol. 45, no. 6, pp. 77-83, June.
McKee J. (1984). “Maintenance as a function of design”. Proceedings of the AFIPS
National Computer Conference, 187-193.
Mendes E., Lokan C., Harrison R., and Triggs C. (2005), “A Replicated Comparison of
Cross-Company and Within-Company Effort Estimation Models Using the
ISBSG Database,” 11th IEEE International Software Metrics Symposium
(METRICS'05)
Menzies T., Port D., Chen Z., Hihn J., and Sherry S., (2005), "Validation Methods for
Calibrating Software Effort Models", International Conference on Software
Engineering, May 15–21.
Menzies T., Chen Z., Hihn J., Lum K. (2006), “Selecting Best Practices for Effort
Estimation,” IEEE Transactions on Software Engineering
Minkiewicz, A. (1997), “Measuring Object Oriented Software with Predictive Object
Points,” PRICE Systems.
DOI=http://www.pricesystems.com/white_papers/Measuring%20Object%20Orie
nted%20Software%20with%20Predictive%20Object%20Points%20July%20'97%
20-%20Minkiewicz.pdf
Minkiewicz A.F. (1997), “Measuring Object Oriented Software with Predictive Object
Points,” Proc. Conf. Applications in Software Measurements (ASM ’97).
Miyazaki Y, Terakado M, Ozaki K, and Nozaki H. (1994) “Robust Regression for
Developing Software Estimation Models.” Journal of Systems and Software, Vol.
27, pp. 3-16.
Moad J. (1990). “Maintaining the competitive edge”. Datamation 61-62, 64, 66.
Myrtweit I., Stensrud E., Shepperd M. (2005), Reliability and Validity in Comparative
Studies of Software Prediction Models. IEEE Transactions on Software
Engineering, Vol. 31, No. 5
Narula S.C., Wellington J.F. (1977), “Prediction, Linear Regression and the Minimum
Sum of Relative Errors,” Technometrics, Vol. 19, No. 2
NESMA (2008), "FPA according to NESMA and IFPUG; the present situation",
http://www.nesma.nl/download/artikelen/FPA%20according%20to%20NESMA
%20and%20IFPUG%20-%20the%20present%20situation%20(vs%202008-0701).pdf
134
Nguyen V., Deeds-Rubin S., Tan T., Boehm B.W. (2007), “A SLOC Counting Standard,”
The 22nd International Annual Forum on COCOMO and Systems/Software Cost
Modeling. DOI = http://csse.usc.edu/csse/TECHRPTS/2007/usc-csse-2007737/usc-csse-2007-737.pdf
Nguyen V., Steece B., Boehm B.W. (2008), “A constrained regression technique for
COCOMO calibration”, Proceedings of the 2nd ACM-IEEE international
symposium on Empirical software engineering and measurement (ESEM), pp.
213-222
Nguyen V., Boehm B.W., Danphitsanuphan P. (2009), “Assessing and Estimating
Corrective, Enhancive, and Reductive Maintenance Tasks: A Controlled
Experiment.” Proceedings of 16th Asia-Pacific Software Engineering Conference
(APSEC 2009), Dec.
Nguyen V., Boehm B.W., Danphitsanuphan P. (2010), “A Controlled Experiment in
Assessing and Estimating Software Maintenance Tasks”, APSEC Special Issue,
Information and Software Technology Journal, 2010.
Nguyen V., Huang L., Boehm B.W. (2010), “Analysis of Productivity Over Years”,
Technical Report, USC Center for Systems and Software Engineering.
Niessink F., van Vliet H. (1998), “Two case study in measuring maintenance effort”,
Proceedings of International Conference on Software Maintenance, Bethesda,
MD, USA, pp. 76–85.
Parikh G. and Zvegintzov N. (1983). The World of Software Maintenance, Tutorial on
Software Maintenance, IEEE Computer Society Press, pp. 1-3.
Park R.E. (1992), "Software Size Measurement: A Framework for Counting Source
Statements," CMU/SEI-92-TR-11, Sept.
Pew R. W. and Mavor A. S. (2007), “Human-System Integration in the System
Development Process: A New Look”, National Academy Press
Price-S (2009), TruePlanning User Manual, PRICE Systems, www.pricesystems.com
Putnam L. & Myers W. (1992), "Measures for Excellence: Reliable Software on Time,
within Budget," Prentice Hall PTR.
Quinlan J.R. (1993), "C4.5: Programs for Machine Learning," Morgan Kaufmann
Publishers, Sao Mateo, CA.
Rajlich V., Wilde N., Page H. (2001), "Software Cultures and Evolution", IEEE
Computer, Sept., pp. 24-28
135
Ramil J.F. (2000), “Algorithmic cost estimation software evolution", Proceedings of
International Conference on Software Engineering, Limerick, Ireland, pp. 701–
703.
Ramil J.F. (2003), “Continual Resource Estimation for Evolving Software," PhD Thesis,
University of London, Imperial College of Science, Technology and Medicine.
Reddy C.S., Raju K., (2009), "An Improved Fuzzy Approach for COCOMO’s Effort
Estimation using Gaussian Membership Function
Rosencrance L. (2007), "Survey: Poor communication causes most IT project failures,"
Computerworld
Selby R. (1988), Empirically Analyzing Software Reuse in a Production Environment, In
Software Reuse: Emerging Technology, W. Tracz (Ed.), IEEE Computer Society
Press, pp. 176-189.
Sneed H.M., (1995), "Estimating the Costs of Software Maintenance Tasks," IEEE
International Conference on Software Maintenance, pp. 168-181
Sneed H.M., (2004), "A Cost Model for Software Maintenance & Evolution," IEEE
International Conference on Software Maintenance, pp. 264-273
Sneed H.M., (2005), "Estimating the Costs of a Reengineering Project," 12th Working
Conference on Reverse Engineering, pp. 111-119
Sneed H.M., Huang S. (2007), "Sizing Maintenance Tasks for Web Applications," 11th
European Conference on Software Maintenance and Reengineering, 2007. CSMR
'07.
Swanson E.B. (1976), "The dimensions of maintenance," Proceedings 2nd International
Conference on Software Engineering. IEEE Computer Society: Long Beach CA,
1976; 492–497.
Symons C.R. (1988) "Function Point Analysis: Difficulties and Improvements," IEEE
Transactions on Software Engineering, vol. 14, no. 1, pp. 2-11
Symons C.R. (1991), "Software Sizing and Estimating: Mk II FPA". Chichester,
England: John Wiley.
Trendowicz A., Heidrich J., Münch J., Ishigai Y., Yokoyama K., Kikuchi N. (2006),
“Development of a hybrid cost estimation model in an iterative manner”,
Proceedings of the 28th International Conference on Software Engineering,
China.
UKSMA (1998) MkII Function Point Analysis Counting Practices Manual. United
Kingdom Software Metrics Association. Version 1.3.1
136
Valerdi R. (2005), "The Constructive Systems Engineering Cost Model (COSYSMO)",
PhD Thesis, The University of Southern California.
Yang Y. and Clark B. (2003), "COCOMO II.2003 Calibration Status", CSE Annual
Research Review,
DOI=http://sunset.usc.edu/events/2003/March_2003/COCOMO_II_2003_Recalib
ration.Pdf
Zelkowitz M.V., Shaw A.C., Gannon J.D. (1979). “Principles of Software Engineering
and Design”. Prentice-Hall
137
APPENDIX A. UFNM AND AA RATING SCALE
Table A.1: Rating Scale for Assessment and Assimilation Increment (AA)
AA Increment
0
2
4
6
8
Level of AA Effort
None
Basic module search and documentation
Some module Test and Evaluation (T&E), documentation
Considerable module T&E, documentation
Extensive module T&E, documentation
Table A.2: Rating Scale for Programmer’s Unfamiliarity (UNFM)
UNFM Increment
0.0
0.2
0.4
0.6
0.8
1.0
Level of Unfamiliarity
Completely familiar.
Mostly familiar
Somewhat familiar
Considerably familiar
Mostly unfamiliar
Completely unfamiliar.
Table A.3: Rating Scale for Software Understanding Increment (SU)
Very Low
Very low
cohesion, high
coupling,
spaghetti code.
Low
Moderately
low cohesion,
high coupling.
Nominal
Reasonably well
structured;
Some weak
areas.
High
High
cohesion, low
coupling.
Very High
Strong modularity,
information hiding
in data / control
structures.
Application
Clarity
No match
between
program and
application
world views.
Some
correlation
between
program and
application.
Moderate
correlation
between
program and
application.
Good
correlation
between
program and
application.
Clear match
between program
and application
world views.
SelfDescriptiveness
Obscure code;
documentation
missing,
obscure or
obsolete
Some code
commentary
and
headers;
some useful
documentation.
Moderate level
of code
commentary,
headers,
documentations.
Good code
commentary
and
headers;
useful
documentatio
n;
some weak
areas.
Self-descriptive
code;
documentation
up-to-date, well
organized, with
design rationale.
Current SU
Increment to
ESLOC
Your estimates
50%
40%
30%
20%
10%
Structure
138
APPENDIX B. DELPHI SURVEY FORM
COCOMO FOR MAINTENANCE AND REUSE
I. PARTICIPANT INFORMATION:
Your Name:
Corporation name:
Division:
Location (City, State):
Email address:
Phone:
Date prepared:
Years of experience in cost modeling and estimation:
Years of experience with COCOMO:
What category would best describe your application domain? (Check all that apply)
Management of Info System
Operating Systems
Process Control
Command and Control
Military - Airborne
Signal Processing
Communications
Military – Ground, Sea
Simulation
Engineering and Science
Military – Missile
Testing
Environment/Tools
Military – Space
Web
Utilities
Other (please specify):
Contact Person:
Vu Nguyen
Research Assistant
Center for Systems and Software Engineering
University of Southern California
Email: nguyenvu@usc.edu
Phone: (323) 481-1585
Fax: (213) 740-4927
139
II. INTRODUCTION
The Constructive Cost Model (COCOMO) is a cost and schedule estimation model that
was originally published in 1981. The latest extension of the model is COCOMO II
which was published in 2000. Among the main upgrades in COCOMO II are the
introduction of new functional forms that use new cost drivers, and a set of rating scales.
The model was built mainly based on and targeted development projects.
In an effort to better support software maintenance estimation, we have proposed an
extension to COCOMO size and effort models for estimating software maintenance
projects. The extended model consists of a sub-model for sizing and a sub-model for
estimating the effort. These models use the same set of COCOMO II cost drivers.
However, the rating scales of the cost drivers would differ from those of COCOMO II
due to differences between development and maintenance/enhancement projects.
In this Delphi exercise, we need your help to determine rating scales of the drivers on
size and effort of software maintenance.
We define software maintenance as the work of modifying and enhancing the existing
software. It includes functionality repairs and minor or major functional enhancements. It
excludes reengineering and programming language migration (e.g., migrating
programming language from C++ to Web languages). In addition, we assume that
software maintenance projects have specified starting and ending dates (see Figure 1). A
new release of the product is delivered at the ending date of the maintenance project. To
be consistent with the COCOMO II development model, the estimated effort covers the
Elaboration and Construction phases of the release. Inception and Transition may vary
significantly, as shown in Table A.6 on page 309 of the COCOMO II book.
Release N
Project starts for
Release N+1
Release N+1
Project starts for
Release N+2
Timeline
Code
Baseline
Maintenance project N+1
Code
Baseline
Figure 1 – Maintenance projects timeline
III. INSTRUCTIONS
The questionnaire consists of three parts: size drivers, effort multipliers, and scale factors.
Each groups drivers that have impact on the project effort in a same fashion.
For your convenience, we provide the current values of the drivers given in COCOMO II
version 2000 (denoted as CII.2000). These values were the results of combining expert
estimates and historical data using the Bayesian approach. They were intended to be
mainly used for development projects.
140
We also provide CII.2000’s Productivity Range (PR) for each of the cost drivers (effort
multipliers and scale factors). The Productivity Range represents the relative impact of the
cost driver on effort.
IV. MAINTENANCE SIZE DRIVERS
The extended model for maintenance uses a set of size drivers to determine equivalent
Source Lines of Code (SLOC) for the amount of code to be written. Unlike new
development, in software maintenance the team needs to assess and assimilate the
preexisting code. They also need to comprehend the preexisting code for their
maintenance work.
We need your help to determine typical percentages of effort required for these activities.
Software Understanding (SU)
SU size driver measures the degree of understandability of the existing software (how
easy it is to understand the existing code). The rating scale ranges from Very Low (very
difficult to understand) to Very High (very easy to understand), corresponding to the
percentage of effort required to understand the preexisting code for programmers who are
completely unfamiliar with the code. For example, the values in the COCOMO II model
specify that Very Low SU requires 50% of the total effort spent for understanding the
preexisting code.
Very Low
Very low
cohesion, high
coupling,
spaghetti code.
Low
Moderately
low cohesion,
high coupling.
Nominal
Reasonably well
structured;
Some weak
areas.
High
High
cohesion, low
coupling.
Very High
Strong modularity,
information hiding
in data / control
structures.
Application
Clarity
No match
between
program and
application
world views.
Some
correlation
between
program and
application.
Moderate
correlation
between
program and
application.
Good
correlation
between
program and
application.
Clear match
between program
and application
world views.
SelfDescriptiveness
Obscure code;
documentation
missing,
obscure or
obsolete
Some code
commentary
and
headers;
some useful
documentation.
Moderate level
of code
commentary,
headers,
documentations.
Good code
commentary
and
headers;
useful
documentatio
n;
some weak
areas.
Self-descriptive
code;
documentation
up-to-date, well
organized, with
design rationale.
Current SU
Increment to
ESLOC
Your estimates
50%
40%
30%
20%
10%
Structure
Rationale (optional):
141
V. EFFORT MULTIPLIERS
These are the 17 effort multipliers used in COCOMO II Post-Architecture model to
adjust the nominal effort, Person Months, to reflect the software product under
development. They are grouped into four categories: product, platform, personnel, and
project.
Nominal is the baseline rating level which is assigned the value of 1. The rating level
below 1 indicates a deduction in effort, and above 1, an increase in effort. For example,
Very High DATA of 1.28, for COCOMO II.2000, requires additional 28% of effort as
compared to Nominal DATA.
The Productivity Range is the ratio between the largest and the smallest effort multiplier.
For example PR of 1.42 for DATA below indicates that the productivity increases by 42%
by having Low versus Very High testing database size.
A.
Product Factors
Data Base Size (DATA)
This measure attempts to capture the affect large data requirements have on product
development. The rating is determined by calculating D/P. The reason the size of the
database is important to consider it because of the effort required to generate the test data
that will be used to exercise the program.
D DatabaseSize( Bytes)
=
P ProgramSize(SLOC)
DATA is rated as low if D/P is less than 10 and it is very high if it is greater than 1000.
Low
Nominal
High
Very High
Productivity
Range (PR)
DATA
D/P < 10
10 ≤ D/P <100
100 ≤D/P < 1000
D/P ≥ 1000
1.28/0.90 =
CII.2000
0.90
1.0
1.14
1.28
1.42
Your
estimates
1.0
Rationale (optional):
142
Required Software Reliability (RELY)
This is the measure of the extent to which the software must perform its intended
function over a period of time. If the effect of a software failure is only slight
inconvenience then RELY is Very Low. If a failure would risk human life, then RELY is
Very High.
The Productivity range is less relevant here because the maintenance multipliers are not
always monotonic. The Very Low multiplier is higher than the Nominal 1.0 multiplier
due to the extra effort in extending and debugging sloppy software, the Very High
multiplier is higher due to the extra effort in CM, QA, and V&V to keep the product at a
Very High RELY level.
Very Low
Low
RELY
slight
inconvenience
low, easily
recoverable
losses
CII.2000
1.23
1.10
Your
estimates
Nominal
High
Very
High
moderate,
easily
recoverable
losses
1.0
high
financial
loss
risk to
human life
1.23/0.99 =
0.99
1.07
1.24
Productivity
Range (PR)
1.0
Rationale (optional):
143
Product Complexity (CPLX)
The table below provides the new COCOMO II CPLX rating scale. Complexity is
divided into five areas: control operations, computational operations, device-dependent
operations, data management operations, and user interface management operations.
Select the area or combination of areas that characterize the product or a sub-system of
the product. The complexity rating is the subjective weighted average of these areas.
Control Operations
Computational
Operations
Device-dependent
Operations
Data
Manageme
nt
User
Interface
Management
Simple
arrays in
main
memory.
Simple
COTS-DB
queries,
Single file
subsetting
with no data
structure
changes, no
edits, no
i
di
Multi-file
input and
single file
output.
Simple
structural
Simple input
forms, report
generators.
Simple
triggers
activated by
data stream
contents.
Complex
data
restructuring
Distributed
database
coordination
. Complex
triggers.
Search
optimization
.
Widget set
development
and extension.
Simple voice
I/O,
multimedia.
Highly
coupled,
dynamic
relational
and object
structures.
l
Complex
multimedia,
virtual reality.
Straight-line code
with a few nonnested structured
programming
operators: DOs,
CASEs,
IFTHENELSEs.
Straightforward
nesting of structured
programming
operators. Mostly
simple predicates
Evaluation of
simple expressions:
e.g., A=B+C*(DE)
Simple read, write
statements with
simple formats.
Evaluation of
moderate-level
expressions: e.g.,
D=SQRT(B**24.*A*C)
Nominal
1.0
Mostly simple
nesting. Some
intermodule control.
Decision tables.
Simple callbacks or
message passing,
Use of standard
math and statistical
routines. Basic
matrix/vector
operations.
No cognizance
needed of particular
processor or I/O
device
characteristics. I/O
done at GET/PUT
l l
I/O processing
includes device
selection, status
checking and error
processing.
High
1.17
Highly nested
structured
programming
operators with many
compound
predicates. Queue
and stack control.
Homogeneous
Reentrant and
recursive coding.
Fixed-priority
interrupt handling.
Task
synchronization,
complex callbacks,
heterogeneous
di t ib t d
Multiple resource
scheduling with
dynamically
changing priorities.
Microcode-level
control. Distributed
h d l i
Basic numerical
analysis:
multivariate
interpolation,
ordinary
differential
equations. Basic
truncation
Difficult but
structured
numerical analysis:
near-singular
matrix equations,
partial differential
equations. Simple
parallelization.
Operations at
physical I/O level
(physical storage
address translations;
seeks, reads, etc.).
Optimized I/O
overlap.
Difficult and
unstructured
numerical analysis:
highly accurate
analysis of noisy,
stochastic data.
l
Device timingdependent coding,
micro-programmed
operations.
Performancecritical embedded
Very Low
0.73
Low
0.87
Very
High
1.34
Extra
High
1.74
Routines for
interrupt diagnosis,
servicing, masking.
Communication line
handling.
Performanceintensive embedded
systems.
Use of simple
graphic user
interface
(GUI)
builders.
Simple use of
widget set.
Moderately
complex
2D/3D,
dynamic
graphics,
multimedia.
144
CPLX
CII.2000
Very
Low
0.73
Low
0.87
Nominal
1.0
Your
estimates
High
1.17
Very
High
1.34
Extra
High
Productivit
y Range
(PR)
2.38
1.74
1.0
Rationale (optional):_
___
Documentation match to life-cycle needs (DOCU)
Several software cost models have a cost driver for the level of required documentation.
In COCOMO II, the rating scale for the DOCU cost driver is evaluated in terms of the
suitability of the project's documentation to its life-cycle needs. The rating scale goes
from Very Low (many life-cycle needs uncovered) to Very High (very excessive for lifecycle needs).
Very Low
Low
DOCU
Many lifecycle
needs
uncovered
Some lifecycle
needs
uncovered
CII.2000
0.81
0.91
Your
estimates
Rationale (optional):_
Nominal
High
Very High
Productivity
Range
Right-sized
to life-cycle
needs
Excessive
for lifecycle
needs
Very
excessive for
life-cycle
needs
1.23/0.81 =
1.0
1.11
1.23
1.52
1.0
___
145
B.
Platform Factors
The platform refers to the target-machine complex of hardware and infrastructure
software (previously called the virtual machine). The factors have been revised to reflect
this as described in this section. Some additional platform factors were considered, such
as distribution, parallelism, embeddedness, and real-time operations.
Execution Time Constraint (TIME)
This is a measure of the execution time constraint imposed upon a software system. The
rating is expressed in terms of the percentage of available execution time expected to be
used by the system or subsystem consuming the execution time resource. The rating
ranges from nominal, less than 50% of the execution time resource used, to Extra High,
95% of the execution time resource is consumed.
Nominal
High
Very High
Extra High
Productivity
Range (PR)
TIME
≤ 50% use of
available
execution time
70%
85%
95%
1.63/1.0 =
CII.2000
1.0
1.11
1.29
1.63
1.63
Your
estimates
Rationale (optional):_
___
Main Storage Constraint (STOR)
This rating represents the degree of main storage constraint imposed on a software
system or subsystem. Given the remarkable increase in available processor execution
time and main storage, one can question whether these constraint variables are still
relevant. However, many applications continue to expand to consume whatever resources
are available, making these cost drivers still relevant. The rating ranges from nominal,
less that 50%, to extra high, 95%.
Nominal
High
Very High
Extra High
Productivity
Range (PR)
STOR
≤ 50% use of
available storage
70%
85%
95%
1.46/1.0 =
CII.2000
1.0
1.05
1.17
1.46
1.46
Your estimates
1.0
Rationale (optional):_
___
146
Platform Volatility (PVOL)
"Platform" is used here to mean the complex of hardware and software (OS, DBMS, etc.)
the software product calls on to perform its tasks. If the software to be developed is an
operating system then the platform is the computer hardware. If a database management
system is to be developed then the platform is the hardware and the operating system. If a
network text browser is to be developed then the platform is the network, computer
hardware, the operating system, and the distributed information repositories. The
platform includes any compilers or assemblers supporting the development of the
software system. This rating ranges from low, where there is a major change every 12
months, to very high, where there is a major change every two weeks.
Low
PVOL
CII.2000
Nominal
major change
every 12 mo.;
minor change
every 1 mo.
0.87
Your
estimates
Very High
Productivity
Range (PR)
major: 6 mo.;
minor: 2 wk.
major: 2 mo.;
minor: 1 wk.
major: 2 wk.;
minor: 2 days
1.3/0.87 =
1.0
1.15
1.30
1.49
1.0
Rationale (optional):_
C.
High
___
Personnel Factors
Analyst Capability (ACAP)
Analysts are personnel that work on requirements, high level design and detailed design.
The major attributes that should be considered in this rating are Analysis and Design
ability, efficiency and thoroughness, and the ability to communicate and cooperate. The
rating should not consider the level of experience of the analyst; that is rated with APEX,
PLEX, and LTEX. Analysts that fall in the 15th percentile are rated very low and those
that fall in the 95th percentile are rated as very high.
Very Low
Low
Nominal
High
Very
High
Productivity
Range(PR)
ACAP
15th
percentile
35th
percentile
55th
percentile
75th
percentile
90th
percentile
1.42/0.71 =
CII.2000
Your
estimates
1.42
1.19
1.0
0.85
0.71
2.00
Rationale (optional):_
1.0
___
147
Programmer Capability (PCAP)
Current trends continue to emphasize the importance of highly capable analysts. However
the increasing role of complex COTS packages, and the significant productivity leverage
associated with programmers' ability to deal with these COTS packages, indicates a trend
toward higher importance of programmer capability as well.
Evaluation should be based on the capability of the programmers as a team rather than as
individuals. Major factors which should be considered in the rating are ability, efficiency
and thoroughness, and the ability to communicate and cooperate. The experience of the
programmer should not be considered here; it is rated with APEX, PLEX, and LTEX. A
very low rated programmer team is in the 15th percentile and a very high rated
programmer team is in the 95th percentile.
Very Low
Low
PCAP
15th
percentile
35th
percentile
55th
percentile
75th
percentile
90th
percentile
1.34/0.76 =
CII.2000
1.34
1.15
1.0
0.88
0.76
1.76
Your
estimates
Nominal
High
Very High
Productivity
Range (PR)
1.0
Rationale (optional):_
___
Applications Experience (APEX)
This rating is dependent on the level of applications experience of the project team
developing the software system or subsystem. The ratings are defined in terms of the
project team's equivalent level of experience with this type of application. A very low
rating is for application experience of less than 6 months. A very high rating is for
experience of 12 years or more. Amounts of experience have been shifted one place to
the left to account for the generally higher levels of experience in maintenance personnel.
Very Low
Low
Nominal
High
Very High
Productivity
Range (PR)
APEX
≤ 6 months
1 year
3 years
6 years
12 years
1.22/0.81 =
CII.2000
1.22
1.10
1.0
0.88
0.81
1.51
Your
estimates
Rationale (optional):_
1.0
___
Platform Experience (PLEX)
148
The Post-Architecture model broadens the productivity influence of PLEX, recognizing
the importance of understanding the use of more powerful platforms, including more
graphic user interface, database, networking, and distributed middleware capabilities.
Very Low
Low
Nominal
High
Very High
Productivity
Range (PR)
PLEX
≤ 6 months
1 year
3 years
6 years
12 years
1.19/0.85 =
CII.2000
1.19
1.09
1.0
0.91
0.85
1.40
Your
estimates
1.0
Rationale (optional):_
___
Language and Tool Experience (LTEX)
This is a measure of the level of programming language and software tool experience of
the project team developing the software system or subsystem. Software development
includes the use of tools that perform requirements and design representation and
analysis, configuration management, document extraction, library management, program
style and formatting, consistency checking, etc. In addition to experience in programming
with a specific language the supporting tool set also effects development time. A low
rating is given for experience of less than 6 months. A very high rating is given for
experience of 12 or more years.
Very Low
LTEX
CII.2000
≤6
months
1.20
Low
High
Very High
Productivity
Range (PR)
1 year
3 years
6 years
12 years
1.20/0.84 =
1.09
1.0
0.91
0.84
1.43
Your
estimates
Rationale (optional):_
Nominal
1.0
___
149
Personnel Continuity (PCON)
The rating scale for PCON is in terms of the project's annual personnel continuity: from
3% turnover, Very High, to 48%, Very Low.
Very Low
Low
Nominal
High
PCON
48% / year
24% / year
12% / year
6% / year
3% / year
1.29/0.81=
CII.2000
1.29
1.12
1.0
0.90
0.81
1.51
Your
estimates
Productivity
Range (PR)
1.0
Rationale (optional):_
D.
Very High
___
Project Factors
Use of Software Tools (TOOL)
Software tools have improved significantly since the 1970's projects used to calibrate
COCOMO. The TOOL rating ranges from simple edit and code, Very Low, to integrated
lifecycle management tools, Very High.
Very
Low
Low
Nominal
High
Very High
Productivity
Range (PR)
TOOL
edit,
code,
debug
simple,
frontend,
backend
CASE, little
integration
basic
lifecycle
tools,
moderately
integrated
strong,
mature
lifecycle
tools,
moderately
integrated
strong,
mature,
proactive
lifecycle tools,
well
integrated
with
processes,
methods,
reuse
1.17/0.78 =
CII.2000
1.17
1.09
1.0
0.90
0.78
1.50
Your
estimates
Rationale (optional):_
1.0
___
150
Multisite Development (SITE)
Given the increasing frequency of multisite developments, and indications that multisite
development effects are significant, the SITE cost driver has been added in COCOMO II.
Determining its cost driver rating involves the assessment and averaging of two factors:
site collocation (from fully collocated to international distribution) and communication
support (from surface mail and some phone access to full interactive multimedia).
Very
Low
Low
Nominal
High
Very High
Extra
High
Productivity
Range (PR)
SITE:
Communications
Some
phone,
mail
Individual
phone,
FAX
Narrowband
email
Wideband
electronic
communication.
Wideband
elect.
comm,
occasional
video conf.
Inter
active
multi
media
1.22/0.80 =
CII.2000
1.22
1.09
1.0
0.93
0.86
0.80
1.53
Your
estimates
Rationale (optional):_
1.0
___
151
VI. SCALE FACTORS
Software cost estimation models often have an exponential factor to account for the
relative economies or diseconomies of scale encountered in different sized software
projects. The exponent, B, in the following equation is used to capture these effects.
5
B = 0.91 + 0.01 × ∑ SFi
i =1
If B < 1.0, the project exhibits economies of scale. If the product's size is doubled, the
project effort is less than doubled. The project's productivity increases as the product size
increases. Some project economies of scale can be achieved via project-specific tools
(e.g., simulations, testbeds) but in general these are difficult to achieve. For small
projects, fixed start-up costs such as tool tailoring and setup of standards and
administrative reports are often a source of economies of scale.
If B = 1.0, the economies and diseconomies of scale are in balance. This linear model is
often used for cost estimation of small projects. It is used for the COCOMO II
Applications Composition model.
If B > 1.0, the project exhibits diseconomies of scale. This is generally due to two main
factors: growth of interpersonal communications overhead and growth of large-system
integration overhead. Larger projects will have more personnel, and thus more
interpersonal communications paths consuming overhead. Integrating a small product as
part of a larger product requires not only the effort to develop the small product but also
the additional overhead effort to design, maintain, integrate, and test its interfaces with
the remainder of the product.
COCOMO II uses five scale factors, each having six rating levels. The selection of scale
drivers is based on the rationale that they are a significant source of exponential variation
on a project's effort or productivity variation. Each scale driver has a range of rating
levels, from Very Low to Extra High. Each rating level has a weight, SF, and the specific
value of the weight is called a scale factor. A project's scale factors, SFi, are summed
across all of the factors, and used to determine a scale exponent, B.
Instructions:
For simplicity, we will use a 100KSLOC project as the baseline and determine the
Productivity Range of each scale factor for this project. The Extra High rating level is
fixed at 1.0, indicating no increment in effort. It can be considered the baseline rating
level. The Very Low rating level specifies a maximum increment in effort and
corresponds to the PR for the 100KSLOC project. For example, the Productivity Range,
for the 100KSLOC project, of the PREC scale factor is 1.33 which indicates that reducing
the PREC rating level from Extra High to Very Low would require additional 33% of
effort.
152
Precedentedness (PREC)
If a product is similar to several previously developed projects, then the precedentedness
is high.
Scale
Factor
PREC
CII.2000
Very Low
Low
Nominal
High
Very
High
Extra
High
thoroughly
unprecedented
largely
unprecedented
somewhat
unpreceden
ted
generally
familiar
largely
familiar
thoroughly
familiar
1.33
1.26
1.19
1.12
1.06
1.0
(100KSLOC)
Your
estimates
1.0
Rationale (optional):_
___
Development Flexibility (FLEX)
FLEX determines the degree of development flexibility (establishing requirements,
architectures, design, implementation, testing) to which the team must conform. The
more rigorous conformance requirements are, the less flexibility the team can have.
Scale
Factor
FLEX
CII.2000
Very Low
Low
Nomin
al
High
Very High
Extra
High
rigorous
occasional
relaxation
some
relaxati
on
general
conformity
some
conformity
general
goals
1.26
1.21
1.15
1.10
1.05
1.0
(100KSLOC)
Your
estimates
Rationale (optional):_
1.0
___
153
Architecture / Risk Resolution (RESL)
RESL scale factor measures the level of risk eliminated at the time of estimation. The
rating levels are allocated to the percentage of critical risks eliminated.
Scale Factor
RESL*
CII.2000
Very Low
Low
Nominal
High
Very High
Extra
High
little
(20%)
some
(40%)
often (60%)
generally
(75%)
mostly
(90%)
full
(100%)
1.38
1.30
1.22
1.14
1.07
1.0
(100KSLOC)
Your estimates
(*)
1.0
% significant module interfaces specified, % significant risks eliminated.
Rationale (optional):_
___
Team Cohesion (TEAM)
TEAM accounts for the sources of project turbulence and entropy because of difficulties
in synchronizing the project’s stakeholders: users, customers, developers, maintainers,
interfacers, others. These difficulties may arise from differences in stakeholder
objectives and cultures; difficulties in reconciling objectives; and stakeholders' lack of
experience and familiarity in operating as a team.
Scale
Factor
Very Low
Low
Nominal
High
Very High
Extra High
TEAM
very
difficult
interactions
some
difficult
interactions
basically
cooperative
interactions
largely
cooperative
highly
cooperative
seamless
interactions
1.29
1.22
1.16
1.11
1.05
1.0
CII.2000
(100KSLOC)
Your
estimates
Rationale (optional):_
1.0
___
154
Process Maturity (PMAT)
PMAT specifies SEI’s Capability Maturity Model (CMM) or an equivalent process
maturity level that the project follows.
Scale
Factor
Very Low
Low
Nominal
High
PMAT
SW-CMM
Level 1
Lower
1.43
SW-CMM
Level 1
Upper
1.33
SW-CMM
Level 2
SW-CMM
Level 3
SW-CMM
Level 4
SW-CMM
Level 5
1.24
1.15
1.07
1.0
CII.2000
Very High
Extra
High
(100KSLOC)
Your
estimates
Rationale (optional):_
1.0
___
155
APPENDIX C. DATA COLLECTION FORMS
I. GENERAL FORM
University of Southern California
Center for Systems and Software Engineering
DATA COLLECTION FORM
Software Maintenance
Contact Person: Mr. Vu Nguyen
Email: nguyenvu@usc.edu
1. Project Name/Identifier:
2. Version/Release Number:
3. Project ID:
4. Date Prepared:
5. Originator:
6. Application type (check one or more if applicable):
Command and control
Management of information system
Simulation
Communications
Operating systems
Process control
Testing
Engineering & Science
Signal processing
Utility tools
Outsourcing
Other
7. Maintenance type (check one or more):
Check one ore more types of the maintenance tasks on your project. Also, provide rough estimates of the effort spent
for each maintenance type. Refer to the Guide sheet for details on these maintenance types.
Corrective maintenance, % effort:
Perfective maintenance,
Perfective
maintenance, %
% effort:
effort:
Adaptive maintenance, % effort:
Major enhancement, % effort:
8. Component general size information
List components and their related measures. The component listed here must be implemented or changed by the project
team . External components that are developed outside the project team must not be included. The component that was
fully removed must also not be included.
REVL, DM, IM, SU, UNFM, and AA are subjective measures. Please refer to the Guide sheet for instructions.
REVL DM
IM
No.
(%)
(%)
(%)
SU
AA UNFM
Languages
Component
SA
REVL – Requirements Evolution and Volatility
DM – Design Modified
SU – Software Understanding
IM – Integration Modified
AA – Assessment and Assimilation
SA – Software Age in Year
UNFM – Programmer Unfamiliarity
SLOC – Logical Source Lines of Code
156
9. Actual efforts by phase
Input phase, actual effort, and schedule, if actual effort and schedule per phase
or anchor point is available. Otherwise, provide them for the whole project
instead.
Actual Effort
Schedule
Phase/Anchor Point
(PM)
(Month)
10. UCC Size Metrics
Sizing metrics can be obtained by using the UCC tool to compare two version of
the program.Please report the following size metrics:
Metric
Definition
Count
SLOC Adapted
Sum of SLOC added, modified, and deleted of modified modules
SLOC Pre-Adapted
SLOC of pre-existing modules before they are modified
SLOC Added
SLOC of added modules
SLOC Reused
SLOC Deleted
SLOC of reused/unmodified modules
SLOC of deleted modules
II. COST DRIVERS
Cost Driver Ratings
Driver
Rating Level
% Increment Rationale (optional)
Scale Factors
Precedentedness of Application (PREC)
Somewhat unprecedented (NOM)
Development Flexibility (FLEX)
Occasional relax. allowed(NOM)
Risk Resolution (RESL)
60% of significant risks eliminated (NOM)
Team Cohension (TEAM)
Basically coopperative interactions (NOM)
Equivalent Process Maturity Level (PMAT)
SW-CMM Level 3
Product Factors
Required Software Reliability (RELY)
Errors cause small-easily recoverable loss (NOM)
Database Size (DATA)
10 <= DB bytes per LOC < 100 (average size database) (NOM)
Product Complexity (CPLX)
Average Complixity (NOM)
Documentation Match to Life-Cycle Needs (DOCU)
Documentation appropriate to lifecycle needs (NOM)
Platform Factors
Execution Time Constraint (TIME)
Program uses <= 50% of available excution time (NOM)
Main Storage Constraint (STOR)
Program uses <= 50% of available storage (NOM)
Platform Volatility (PVOL)
Major change every 6 months; minor change every 2 weeks (NOM)
Personnel Factors
Analyst Capability (General) (ACAP)
55th percentile (average)
Programmer Capability (General) (PCAP)
55th percentile (average)
Personnel Continuity (Turnover) (PCON)
12% / year (NOM)
Applications Experience (APEX)
3 years of experience (NOM)
Language and Tool Experience (LTEX)
3 years of experience (NOM)
Platform Experience (PLEX)
3 years of experience (NOM)
Project Factors
Use of Software Tools (TOOL)
Basic lifecycle tools; moderately integrated(NOM)
Multisite Development (SITE)
Multi-city*or*company;narrowbank email(NOM)
157
APPENDIX D. THE COCOMO II.2000 PARAMETERS USED IN
THE EXPERIMENT
Parameter Very low Low
PCAP
1.34
1.15
PLEX
1.19
1.09
LTEX
1.20
1.09
SU
50
40
UNFM
0.0
0.20
Nominal
1.00
1.00
1.00
30
0.40
High
0.88
0.91
0.91
20
0.60
Very high Extra high
0.76
0.85
0.84
10
0.80
1.00
158
APPENDIX E. HISTOGRAMS FOR THE COST DRIVERS
20
10
5
0
No. of Releases
30
This appendix lists the histograms for the cost drivers of the 86 releases in the
data set, including the 6 outliers. These histograms show the numeric values instead of
the symbolic rating levels as the increment between two adjacent rating levels was also
provided to better specify the actual value of the rating.
0
1
2
3
4
5
PMAT
159
0
5
10
15
20
No. of Releases
0
0
1
1
2
PREC
2
3
4
5
3
4
5
TEAM
160
0
5 10
20
No. of Releases
30
0
5
10 15 20 25
No. of Releases
1
1
2
2
3
3
4
4
5
6
5
FLEX
7
8
RESL
161
0
5
15
25
No. of Releases
35
35
25
15
0 5
No. of Releases
0.7
0.8
0.9
1.0
1.1
1.2
50
30
0 10
No. of Releases
PCAP
1.00
1.05
1.10
1.15
1.20
RELY
162
25
20
15
10
5
0
No. of Releases
0.9
1.0
1.1
1.2
1.3
60
40
20
0
No. of Releases
CPLX
1.00
1.02
1.04
1.06
1.08
1.10
TIME
163
60
40
20
0
No. of Releases
1.00
1.02
1.04
1.06
1.08
1.10
1.12
1.14
1.00
1.05
1.10
20
5 10
0
No. of Releases
30
STOR
0.75
0.80
0.85
0.90
0.95
ACAP
164
0
10
20
30
No. of Releases
40
0.8
0.8
0.9
0.9
1.0
PLEX
1.0
1.1
1.2
1.1
1.2
LTEX
165
0
10
20
30
No. of Releases
40
0 10
30
50
No. of Releases
0.9
0.8
1.0
0.9
1.1
1.0
1.1
1.2
DATA
1.2
1.3
DOCU
166
0
5
10 15 20 25
No. of Releases
0
5
10
15
No. of Releases
20
0.8
0.9
0.8
0.9
1.0
1.0
1.1
1.1
1.2
PVOL
1.2
APEX
167
0
10
20
30
40
No. of Releases
50
0
5
15
25
No. of Releases
35
0.8
0.9
0.8
1.0
0.9
1.1
1.0
1.2
1.1
1.3
PCON
1.2
TOOL
168
0
5 10
20
No. of Releases
30
35
25
No. of Releases
15
0 5
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
SITE
169
APPENDIX F. CORRELATION MATRIX FOR EFFORT, SIZE, AND
COST DRIVERS
PM
EKSLOC
PMAT
PREC
TEAM
FLEX
RESL
PCAP
RELY
CPLX
TIME
STOR
ACAP
PLEX
LTEX
DATA
DOCU
PVOL
APEX
PCON
TOOL
SITE
PM EKSLOC
1.00
0.84
1.00
0.04
0.17
0.25
0.27
0.39
0.39
0.68
0.72
0.35
0.35
-0.02
-0.22
-0.19
-0.11
0.03
0.03
0.00
-0.03
-0.02
-0.01
0.00
-0.17
0.11
-0.10
0.00
-0.21
0.17
0.20
0.03
-0.04
0.02
0.03
0.08
-0.14
0.03
-0.05
0.26
0.15
0.19
0.01
PMAT
PREC
TEAM
FLEX
1.00
0.07
0.15
-0.05
0.21
0.16
0.21
-0.26
-0.10
0.34
-0.07
0.20
0.18
0.08
-0.23
-0.12
0.21
0.35
-0.18
-0.21
1.00
0.43
0.28
0.14
-0.10
-0.09
-0.09
-0.25
-0.15
-0.03
-0.04
-0.02
0.45
-0.06
0.20
0.13
-0.17
0.30
0.45
1.00
0.43
0.26
0.09
-0.26
0.03
-0.10
-0.28
-0.16
0.08
-0.19
0.50
-0.21
0.24
-0.06
-0.07
0.63
0.37
1.00
-0.04
-0.27
-0.39
0.26
0.09
-0.25
-0.37
-0.32
-0.29
0.23
-0.06
0.01
-0.18
-0.36
0.55
0.04
RESL PCAP
1.00
0.12
0.32
-0.01
-0.21
0.26
0.17
0.50
0.30
0.12
-0.19
0.05
0.41
0.16
-0.10
0.06
1.00
0.06
-0.08
-0.43
-0.16
0.74
0.50
0.53
0.07
-0.18
0.26
0.39
0.51
0.02
0.19
RELY CPLX
1.00
0.09
-0.09
0.21
0.21
0.07
0.09
-0.32
0.03
-0.09
0.23
0.10
-0.43
0.00
1.00
-0.01
-0.13
-0.01
-0.36
-0.23
-0.07
0.05
0.15
-0.29
-0.21
0.23
0.01
TIME STOR ACAP PLEX LTEX DATA DOCU PVOL APEX PCON TOOL SITE
TIME
1.00
STOR
0.09
1.00
ACAP -0.38 -0.08
1.00
PLEX
-0.33
0.24
0.49 1.00
LTEX
-0.37
0.07
0.58 0.75
1.00
DATA -0.16 -0.05 -0.27 -0.03 -0.23
1.00
DOCU
0.49
0.26
0.19 -0.02
0.01 -0.39
1.00
PVOL
-0.29 -0.04
0.37 0.26
0.18
0.19
0.27
1.00
APEX
-0.27
0.06
0.44 0.58
1.00
0.72 -0.30 -0.07 -0.17
PCON -0.13 -0.11
0.46 0.32
0.30 -0.20 -0.04 -0.10
0.32
1.00
TOOL
0.12 -0.42 -0.25 -0.20 -0.33
0.60 -0.13
0.18 -0.25 -0.19
1.00
SITE
0.14 -0.10
0.37 0.11 -0.03
0.16
0.38
0.26
0.13
0.08
0.38 1.00
170
Download