CONCEPTUAL DATA MODEL-BASED SOFTWARE SIZE ESTIMATION FOR INFORMATION SYSTEMS Written by: Hee Beng Kuan Tan, Yuan Zhao and Hongyu Zhang Reviewed by: Joan Baldriche OUTLINE Introduction Building the models Rationale behind using a conceptual data model Earlier exploration Proposed method Validation of the proposed method Holdout method 10-Fold cross-validation method Testing the models Proposed software size estimation Motivation Problem Statement Conceptual Data Model Data collection Summary of the results Threats to validity Further investigation Comparison to existing methods Size to effort estimation Conclusions Paper critique Questions INTRODUCTION MOTIVATION The estimation of software size plays a key role in a project’s success. Overestimation may lead to the abortion of essential projects or the loss of projects to competitors. Underestimation may result in huge financial losses. It is also likely to adversely affect the success and quality of projects. INTRODUCTION PROBLEM STATEMENT Existing software sizing methods have the problem of using information that is difficult to derive in the early stage of software development. The lack of information in the early stage of a project is what many times leads to flawed estimations which could be fatal for the project. The conceptual data model offers enough information to produce a very accurate estimation of the size of the project. INTRODUCTION CONCEPTUAL DATA MODEL A Conceptual Data Model is a diagram identifying the entities and the relationships between them in order to gain, reflect, and document understanding of the organization’s business, from a data perspective. Includes the important entities and the relationships among them. Typically does not contain attributes, or if it does only significant ones. BUILDING THE MODELS Regression analysis is a classical statistical technique for building estimation models. It is one of the most commonly used methods in econometric work. A multiple linear regression model that expresses the estimated value of a dependent variable y as a function of k independent variables: x1, x2,. . . xk, is a commonly used method for regression analysis. Example: y = β0 + β1x1 + β2x2 + · · · + βkxk, Where β0, β1, β2, . . . , βk are the coefficients to be estimated from a random sample. BUILDING THE MODELS HOLDOUT METHOD The holdout method uses separate datasets for model building and validation. Any model built from a random sample should be validated against another random sample. The size of the dataset for building the model should not be less than five times of the total number of independent variables. The size of the latter sample should not be less than half the size of the former sample. BUILDING THE MODELS 10-FOLD CROSS-VALIDATION METHOD The 10-fold cross validation is the most commonly used k-fold cross-validation (by setting the integer k to 10). It only uses one dataset. It divides the dataset into k approximately equal partitions. Each partition is in turn used for testing and the remainder is used for training. That is, the (1−1/k)portion of the dataset is used for training and the 1/k portion of the dataset is used for testing. The procedure is repeated k times so that, at the end, every instance is used exactly once for testing. The best-fitted parameters are chosen to build and validate the model based on the whole dataset. TESTING THE MODELS These are some of the tests that are used when building this kind of models. Significance test. Fitness test. Multicollinearity test. Extreme case test. The most used measures when for model validation in software engineering. MMRE (Mean Magnitude of Relative Error) PRED(0.25)(Prediction at some percentage) Their acceptable values are not more than 0.25 and not less than 0.75, respectively. PROPOSED SOFTWARE SIZE ESTIMATION RATIONALE BEHIND USING A CONCEPTUAL DATA MODEL Information system refers to a database application that supports business processes and functions through maintaining a database using a standard database management system. Information systems are very data extensive. Information systems usually required the following in order to maintain and retrieve data in and from the database: Graphical User Interface Information delivery through printed reports For each business entity, concept, and relationship type, they provide programs to create, reference and remove their instances. They contain simple data analysis functions such as sum, count, average, minimum, and maximum. They provide error correction programs for canceling extraneous input instances submitted by users. PROPOSED SOFTWARE SIZE ESTIMATION RATIONALE BEHIND USING A CONCEPTUAL DATA MODEL The size of the source code of an information system in this class depends basically on its database, which in turn depends on its conceptual data model. The preceding observation suggests that the size of an information system is well characterized by its conceptual data model. As a conceptual data model for an information system can be more accurately constructed than obtaining the information required by existing software sizing methods in the earlier stage of software development. In fact, the construction of a conceptual data model in the early stage of software development is quite commonly practiced in industry. PROPOSED SOFTWARE SIZE ESTIMATION EARLIER EXPLORATION A preliminary method using a multiple linear regression model to estimate the LOC for information systems from their conceptual data model was proposed in Tan and Zhao [2004, 2006]. In the preliminary method LOC was estimated from the total number of entity types (E), total number of relationship types(R), and total number of attributes (A) in an ER diagram that represents the system’s conceptual data model. KLOC= β0 + β1(E) + β2(R) + β3(A). PROPOSED SOFTWARE SIZE ESTIMATION EARLIER EXPLORATION The results from the previous study showed some inconsistencies. There was no consistency on the values of the coefficients for the models built, which was something expected. Most organizations in the software industry hold the view that their systems are confidential and should not be accessed by external parties. This prevented the data from being reviewed. Further investigation showed that most of the data that had been supplied by companies in the industry was wrong. PROPOSED SOFTWARE SIZE ESTIMATION PROPOSED METHOD The new multiple linear regression model was proposed. KLOC = β0 + βC(C) + βR(R) + βA(Ā) C, the total number of classes in the conceptual data model. R, the total number of unidirectional relationship types in the conceptual data model. Ā, the average number of attributes per class in the conceptual data model, that is, Ā = A/C, where A was the total number of attributes that are defined explicitly in the conceptual data model. PROPOSED SOFTWARE SIZE ESTIMATION PROPOSED METHOD The use of the proposed LOC estimation method centers on the construction of a conceptual data model using a class diagram. The conceptual data model should be constructed as follows: Identify all the business entities and concepts required to be tracked or monitored by the system. Each resulting entity and concept is represented as a class. Identify all the interactions between business entities, between concepts, and between the business entities and concepts represented. All the interactions are represented as associations and aggregations. For each business entity and concept represented as a class, identify the required attributes to describe the entity or concept, and represent them as the attributers of the class. VALIDATION OF THE PROPOSED METHOD The proposed method was validated using the two methods for building and validating multiple linear regression model that were discussed earlier. Both methods were applied to five pairs of datasets categorized into three groups according to the programming language used two datasets labeled as dataset A and B; that were collected independently from software industry and opensource systems. VALIDATION OF THE PROPOSED METHOD DATA COLLECTION The main objective for collecting data from opensource systems was to have larger datasets. As mention before, because of confidentiality concerns, many companies did not allow to access their documentation to verify the accuracy of the data supplied, so their data was removed from the study. For each system sampled, the conceptual data model of the system was manually reverse engineered from its database schema and program source code. VALIDATION OF THE PROPOSED METHOD SUMMARY OF THE RESULTS Using the MMRE and the PRED(0.25) measures, all the datasets fell within the acceptable ranges of 0.25 and 0.75 respectively. In most of the models built, the magnitude of the coefficient for A was the lowest among the coefficients for the three independent variables. This is consistent with the intuitive understanding. Much complexity in information systems is caused by the maintenance of class instances and the navigation between these instances. The number of attributes in classes does not affect the complexity as much as the number of classes and relationship types. Therefore, C and R have much greater influence than A on LOC. VALIDATION OF THE PROPOSED METHOD SUMMARY OF THE RESULTS VALIDATION OF THE PROPOSED METHOD THREATS TO VALIDITY Clearly, due to the confidentiality issues, it was not possible for the samples to cover the full range of possible LOC values. So the models built are not appropriate for systems with LOC values that are significantly larger or smaller than the LOC values in the respective samples. As we can see the proposed method is not ready for general use, but it shows promise for later experiments with larger datasets. VALIDATION OF THE PROPOSED METHOD FURTHER INVESTIGATION To address the general applicability of the proposed estimation method and the models built, the common characteristics of the datasets were investigated. In comparable studies 14 different factors have been used as adjustment to the models. Data communications, distributed data processing, performance, heavily used configuration, transaction rate, online data entry, end-user efficiency, online update, complex processing, reusability, installation ease, operational ease, multiple sites, and facilitate change. In addition to those, this study added another 4 for a total of 18. Reuse level, security, GUI requirement, external report and inquiry. COMPARISON TO EXISTING METHODS The proposed approach was compared to two of the most widely used methods of size and effort estimation methods in the field. FP(Function Point) method. A function point is a unit of measurement to express the amount of business functionality an information system provides to a user. Use-Case Point method. A use-case point is based on the number and complexity of the use cases in the system and the number and complexity of the actors on the system. COMPARISON TO EXISTING METHODS FP METHOD One major problem of the FP method is the difficulty in obtaining the required information in the early stage of software development. Since the proposed method is based on a conceptual data model, it has a clear advantage over the FP method. In terms of accuracy, it was proven that the proposed method is very similar to the FP method. The values of MMRE and PRED(0.25) of the first comparison were 0.14 and 0.77, respectively versus 0.14 and 0.81 for the proposed method. The values of MMRE and PRED(0.25) of the first comparison were 0.17 and 0.75, respectively versus 0.21 and 0.79 for the proposed method. COMPARISON TO EXISTING METHODS USE CASE POINT METHOD In the use-case point method, it was not easy to estimate information required (such as the complexity of the usecases) in the early stage of software development. After building two linear regression models for the use-cae point method both the values of MMRE and PRED(0.25) were far away from the acceptable ranges. The use-case point method failed to build a reasonable model for estimating LOC from the systems in the used datasets. SIZE TO EFFORT ESTIMATION Software effort estimation is a crucial task in the software industry. Estimated effort and market factors are keys for developers to price a software development project. Project plans and schedules are also based on the estimated effort. Effort estimation methods are mostly algorithmic in nature, there have also been explorations into the use of machine learning or nonalgorithmic methods SIZE TO EFFORT ESTIMATION The major drivers used in effort estimation models are size, team experience, technology, reuse resources, and required reliability. Among these, size is a main driver. A good size estimate is very important for effort estimation, and the estimation of size is still challenging. The proposed approach can be used to estimate the LOC of the systems in the early stages of development and then use the results to better estimate effort. CONCLUSIONS A novel method for estimating LOC for a large class of information systems has been proposed and validated . Conducted experiments comparing the proposed method with the well-known FP method and the use-case point method on estimating LOC showed that the proposed method is comparable in terms of accuracy. There is no doubt that much work is still required to finetune the proposed method in order to put it into practical use. It is also highly probable the proposed method can estimate project effort and cost directly based on the three independent variables used (C, R, and A) in a similar way as the FP method. PAPER CRITIQUE STRENGTHS Being an extension from previous work from the same authors shows the vast knowledge that they had coming into this paper. A lot of detail in data selection to ensure the validity of the results. A lot of testing to ensure the models did not illustrate the same mistakes as in the first paper. PAPER CRITIQUE WEAKNESSES Due to confidentiality concerns, many companies did not allow to access their documentation to verify the accuracy of the data supplied, so their data was removed from the study, leaving out of the research what could have resulted in different outcome. It doesn’t compare to the other methods in estimating effort yet. A lot of repetition throughout the paper, making the paper a bit too long. Could have been a organized bit better. PAPER CRITIQUE SUGGESTIONS Repeat experiment with different sized projects, go outside of the ranges used for this one and compare the results. Expand on this research to generalize the method using the 18 proposed factors. Enhance the study to estimate effort from the three independent variables used (C, R, and A). QUESTIONS