C D M -B

advertisement
CONCEPTUAL DATA MODEL-BASED
SOFTWARE SIZE ESTIMATION FOR
INFORMATION SYSTEMS
Written by:
Hee Beng Kuan Tan,
Yuan Zhao and
Hongyu Zhang
Reviewed by:
Joan Baldriche
OUTLINE

Introduction




Building the models









Rationale behind using a conceptual data model
Earlier exploration
Proposed method
Validation of the proposed method





Holdout method
10-Fold cross-validation method
Testing the models
Proposed software size estimation



Motivation
Problem Statement
Conceptual Data Model
Data collection
Summary of the results
Threats to validity
Further investigation
Comparison to existing methods
Size to effort estimation
Conclusions
Paper critique
Questions
INTRODUCTION
MOTIVATION



The estimation of software size plays a key role
in a project’s success.
Overestimation may lead to the abortion of
essential projects or the loss of projects to
competitors.
Underestimation may result in huge financial
losses. It is also likely to adversely affect the
success and quality of projects.
INTRODUCTION
PROBLEM STATEMENT



Existing software sizing methods have the
problem of using information that is difficult to
derive in the early stage of software development.
The lack of information in the early stage of a
project is what many times leads to flawed
estimations which could be fatal for the project.
The conceptual data model offers enough
information to produce a very accurate
estimation of the size of the project.
INTRODUCTION
CONCEPTUAL DATA MODEL


A Conceptual Data Model
is a diagram identifying
the entities and the
relationships between
them in order to gain,
reflect, and document
understanding of the
organization’s business,
from a data perspective.
Includes the important
entities and the
relationships among them.
Typically does not contain
attributes, or if it does only
significant ones.
BUILDING THE MODELS
Regression analysis is a classical statistical
technique for building estimation models. It is
one of the most commonly used methods in
econometric work.
 A multiple linear regression model that expresses
the estimated value of a dependent variable y as
a function of k independent variables: x1, x2,. . .
xk, is a commonly used method for regression
analysis.
 Example:
y = β0 + β1x1 + β2x2 + · · · + βkxk,
 Where β0, β1, β2, . . . , βk are the coefficients to be
estimated from a random sample.

BUILDING THE MODELS
HOLDOUT METHOD
The holdout method uses separate datasets for
model building and validation.
 Any model built from a random sample should be
validated against another random sample.
 The size of the dataset for building the model
should not be less than five times of the total
number of independent variables.
 The size of the latter sample should not be less
than half the size of the former sample.

BUILDING THE MODELS
10-FOLD CROSS-VALIDATION METHOD





The 10-fold cross validation is the most commonly
used k-fold cross-validation (by setting the integer k
to 10).
It only uses one dataset. It divides the dataset into k
approximately equal partitions. Each partition is in
turn used for testing and the remainder is used for
training.
That is, the (1−1/k)portion of the dataset is used for
training and the 1/k portion of the dataset is used for
testing.
The procedure is repeated k times so that, at the end,
every instance is used exactly once for testing.
The best-fitted parameters are chosen to build and
validate the model based on the whole dataset.
TESTING THE MODELS

These are some of the tests that are used when
building this kind of models.
Significance test.
 Fitness test.
 Multicollinearity test.
 Extreme case test.


The most used measures when for model
validation in software engineering.
MMRE (Mean Magnitude of Relative Error)
 PRED(0.25)(Prediction at some percentage)


Their acceptable values are not more than 0.25
and not less than 0.75, respectively.
PROPOSED SOFTWARE SIZE ESTIMATION
RATIONALE BEHIND USING A CONCEPTUAL DATA MODEL



Information system refers to a database application that
supports business processes and functions through
maintaining a database using a standard database
management system.
Information systems are very data extensive.
Information systems usually required the following in
order to maintain and retrieve data in and from the
database:
Graphical User Interface
Information delivery through printed reports
For each business entity, concept, and relationship type, they
provide programs to create, reference and remove their
instances.
 They contain simple data analysis functions such as sum,
count, average, minimum, and maximum.
 They provide error correction programs for canceling
extraneous input instances submitted by users.



PROPOSED SOFTWARE SIZE ESTIMATION
RATIONALE BEHIND USING A CONCEPTUAL DATA MODEL



The size of the source code of an information system
in this class depends basically on its database, which
in turn depends on its conceptual data model.
The preceding observation suggests that the size of an
information system is well characterized by its
conceptual data model. As a conceptual data model
for an information system can be more accurately
constructed than obtaining the information required
by existing software sizing methods in the earlier
stage of software development.
In fact, the construction of a conceptual data model in
the early stage of software development is quite
commonly practiced in industry.
PROPOSED SOFTWARE SIZE ESTIMATION
EARLIER EXPLORATION
A preliminary method using a multiple linear
regression model to estimate the LOC for
information systems from their conceptual data
model was proposed in Tan and Zhao [2004,
2006].
 In the preliminary method LOC was estimated
from the total number of entity types (E), total
number of relationship types(R), and total
number of attributes (A) in an ER diagram that
represents the system’s conceptual data model.
 KLOC= β0 + β1(E) + β2(R) + β3(A).

PROPOSED SOFTWARE SIZE ESTIMATION
EARLIER EXPLORATION
The results from the previous study showed some
inconsistencies. There was no consistency on the
values of the coefficients for the models built,
which was something expected.
 Most organizations in the software industry hold
the view that their systems are confidential and
should not be accessed by external parties. This
prevented the data from being reviewed.
 Further investigation showed that most of the
data that had been supplied by companies in the
industry was wrong.

PROPOSED SOFTWARE SIZE ESTIMATION
PROPOSED METHOD
The new multiple linear regression model was
proposed.
 KLOC = β0 + βC(C) + βR(R) + βA(Ā)
 C, the total number of classes in the conceptual
data model.
 R, the total number of unidirectional relationship
types in the conceptual data model.
 Ā, the average number of attributes per class in
the conceptual data model, that is, Ā = A/C,
where A was the total number of attributes that
are defined explicitly in the conceptual data
model.

PROPOSED SOFTWARE SIZE ESTIMATION
PROPOSED METHOD

The use of the proposed LOC estimation method
centers on the construction of a conceptual data model
using a class diagram. The conceptual data model
should be constructed as follows:
Identify all the business entities and concepts required to
be tracked or monitored by the system. Each resulting
entity and concept is represented as a class.
 Identify all the interactions between business entities,
between concepts, and between the business entities and
concepts represented. All the interactions are represented
as associations and aggregations.
 For each business entity and concept represented as a
class, identify the required attributes to describe the entity
or concept, and represent them as the attributers of the
class.

VALIDATION OF THE PROPOSED METHOD
The proposed method was validated using the
two methods for building and validating multiple
linear regression model that were discussed
earlier.
 Both methods were applied to five pairs of
datasets categorized into three groups according
to the programming language used two datasets
labeled as dataset A and B; that were collected
independently from software industry and opensource systems.

VALIDATION OF THE PROPOSED METHOD
DATA COLLECTION
The main objective for collecting data from opensource systems was to have larger datasets.
 As mention before, because of confidentiality
concerns, many companies did not allow to access
their documentation to verify the accuracy of the
data supplied, so their data was removed from
the study.
 For each system sampled, the conceptual data
model of the system was manually reverse
engineered from its database schema and
program source code.

VALIDATION OF THE PROPOSED METHOD
SUMMARY OF THE RESULTS


Using the MMRE and the PRED(0.25) measures, all
the datasets fell within the acceptable ranges of 0.25
and 0.75 respectively.
In most of the models built, the magnitude of the
coefficient for A was the lowest among the coefficients
for the three independent variables. This is consistent
with the intuitive understanding. Much complexity in
information systems is caused by the maintenance of
class instances and the navigation between these
instances. The number of attributes in classes does
not affect the complexity as much as the number of
classes and relationship types. Therefore, C and R
have much greater influence than A on LOC.
VALIDATION OF THE PROPOSED METHOD
SUMMARY OF THE RESULTS
VALIDATION OF THE PROPOSED METHOD
THREATS TO VALIDITY


Clearly, due to the confidentiality issues, it was
not possible for the samples to cover the full
range of possible LOC values. So the models built
are not appropriate for systems with LOC values
that are significantly larger or smaller than the
LOC values in the respective samples.
As we can see the proposed method is not ready
for general use, but it shows promise for later
experiments with larger datasets.
VALIDATION OF THE PROPOSED METHOD
FURTHER INVESTIGATION


To address the general applicability of the proposed
estimation method and the models built, the common
characteristics of the datasets were investigated.
In comparable studies 14 different factors have been
used as adjustment to the models.


Data communications, distributed data processing,
performance, heavily used configuration, transaction rate,
online data entry, end-user efficiency, online update,
complex processing, reusability, installation ease,
operational ease, multiple sites, and facilitate change.
In addition to those, this study added another 4 for a
total of 18.

Reuse level, security, GUI requirement, external report
and inquiry.
COMPARISON TO EXISTING METHODS

The proposed approach was compared to two of
the most widely used methods of size and effort
estimation methods in the field.

FP(Function Point) method.


A function point is a unit of measurement to express the
amount of business functionality an information system
provides to a user.
Use-Case Point method.

A use-case point is based on the number and complexity of
the use cases in the system and the number and complexity
of the actors on the system.
COMPARISON TO EXISTING METHODS
FP METHOD


One major problem of the FP method is the difficulty in
obtaining the required information in the early stage of
software development. Since the proposed method is based
on a conceptual data model, it has a clear advantage over
the FP method.
In terms of accuracy, it was proven that the proposed
method is very similar to the FP method.
 The values of MMRE and PRED(0.25) of the first
comparison were 0.14 and 0.77, respectively versus 0.14
and 0.81 for the proposed method.
 The values of MMRE and PRED(0.25) of the first
comparison were 0.17 and 0.75, respectively versus 0.21
and 0.79 for the proposed method.
COMPARISON TO EXISTING METHODS
USE CASE POINT METHOD



In the use-case point method, it was not easy to estimate
information required (such as the complexity of the usecases) in the early stage of software development.
After building two linear regression models for the use-cae
point method both the values of MMRE and PRED(0.25)
were far away from the acceptable ranges.
The use-case point method failed to build a reasonable
model for estimating LOC from the systems in the used
datasets.
SIZE TO EFFORT ESTIMATION
Software effort estimation is a crucial task in the
software industry.
 Estimated effort and market factors are keys for
developers to price a software development
project.
 Project plans and schedules are also based on the
estimated effort.
 Effort estimation methods are mostly algorithmic
in nature, there have also been explorations into
the use of machine learning or nonalgorithmic
methods

SIZE TO EFFORT ESTIMATION
The major drivers used in effort estimation
models are size, team experience, technology,
reuse resources, and required reliability.
 Among these, size is a main driver.
 A good size estimate is very important for effort
estimation, and the estimation of size is still
challenging.
 The proposed approach can be used to estimate
the LOC of the systems in the early stages of
development and then use the results to better
estimate effort.

CONCLUSIONS




A novel method for estimating LOC for a large class of
information systems has been proposed and validated .
Conducted experiments comparing the proposed method
with the well-known FP method and the use-case point
method on estimating LOC showed that the proposed
method is comparable in terms of accuracy.
There is no doubt that much work is still required to finetune the proposed method in order to put it into practical
use.
It is also highly probable the proposed method can estimate
project effort and cost directly based on the three
independent variables used (C, R, and A) in a similar way
as the FP method.
PAPER CRITIQUE
STRENGTHS



Being an extension from previous work from the
same authors shows the vast knowledge that
they had coming into this paper.
A lot of detail in data selection to ensure the
validity of the results.
A lot of testing to ensure the models did not
illustrate the same mistakes as in the first paper.
PAPER CRITIQUE
WEAKNESSES
Due to confidentiality concerns, many companies
did not allow to access their documentation to
verify the accuracy of the data supplied, so their
data was removed from the study, leaving out of
the research what could have resulted in
different outcome.
 It doesn’t compare to the other methods in
estimating effort yet.
 A lot of repetition throughout the paper, making
the paper a bit too long.
 Could have been a organized bit better.

PAPER CRITIQUE
SUGGESTIONS



Repeat experiment with different sized projects,
go outside of the ranges used for this one and
compare the results.
Expand on this research to generalize the method
using the 18 proposed factors.
Enhance the study to estimate effort from the
three independent variables used (C, R, and A).
QUESTIONS
Download