Visualization and Analysis of Open Source Software Evolution using An Evolution Curve Method Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania Email: robertas.damasevicius@ktu.lt http://soften.ktu.lt/~damarobe Context and Problem Software systems are: Software design is: a social process embedded within organizational and cultural structures influenced by social processes such as programmer collaboration in teams Open source software systems: designed, constructed and used by people components in larger socio-technical systems Free to use Free availability of source code Developed by many programmers Continuously evolve Aim: analysis of open source software evolution using metrics Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 2 What is software evolution? Definition: Activities: a continuing process in time during which some essential software properties are changed modification, adaptation, maintenance, and other activities which occur after the delivery of the first operational release to the users Importance: costs devoted to system maintenance and evolution account for more than 90% of total software costs (Erlikh, 1990) Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 3 Forces and factors of open source software evolution Evolution of open source systems: less strict control and management model usually started by a single developer (seed) attracted users become co-developers governed by the needs of users and spontaneous collaboration of co-developers Evolution mechanisms: natural selection, competition variation-increasing & variation-decreasing influenced by psychological, intellectual, social and cultural, economic and business factors Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 4 Software metrics Common Specific software evolution metrics Source lines of code Cyclomatic complexity Halstead metrics Number of classes and interfaces R.C. Martin’s software package metrics Cohesion, Coupling, … SDI metric L–metric AICC metric G-metric Software development models Statistical models Rayleigh model Halstead’s Software Science model COCOMO model Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 5 Lehman’s “Laws of Software Evolution” Formulated by M.M. Lehman in the 1980s Law of Continuing Change Law of Increasing Complexity Law of Statistically Smooth Growth Law of Organisational Stability Law of Conservation of Familiarity Law of Continuing Growth Law of Declining Quality Law of Feedback System Evolution forces Growth Maintenance Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 6 Transition-based model of evolution Software characteristic Gradual change Sudden change Transitions time Stages: many, often overlapping Transitions: breakpoints between stages, which represent significant changes. Transitions occur because as a system evolves, its structure must be regularly adapted to the changing requirements and environment Gradual change: a slow process of incremental change caused by accumulating maintenance steps or gradual decay Sudden change: significant changes in the evolving system or in the process by which it is evolved Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 7 Information-theoretic methods Shannon entropy A measure of the uncertainty associated with a random variable. The information source generates a series of symbols xi belonging to an alphabet with size N according to a known probability distribution p(xi), the entropy function H of a sequence X can be defined: H X n px log i 2 p xi i 1 High entropy: higher complexity of the system’s code Low entropy: there are some repeated patterns of source code; code maintenance is required Kolmogorov Complexity Measures the ‘complexity’ (i.e., information content) of an object by the length of the smallest program that generates it. Kolmogorov Complexity Kφ(x) of an object x in the description system φ is the length of the shortest program capable of producing x: K x min { w : w x} w Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 8 Evolution curve method (1) 1 2 Motivation: the addition of new features to a software system leads to the change of basic software characteristics (complexity/entropy) in the system. Idea: use the change of software size and complexity as a means to determine different stages of evolution of a software system Inspiration: Z-curve1 and DNA walk2 methods used in analyzing complex genetic sequences R. Zhang, C.T. Zhang. Z Curves, an Intuitive Tool for Visualizing and Analyzing DNA sequences. J. Biomol. Struc. Dynamics 11, 767–782, 1994. S. Paxia, A. Rudra, Y. Zhou, B. Mishra. A Random Walk down the Genomes: DNA Evolution in VALIS. IEEE Computer 35(7):73-79, 2002. Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 9 Evolution curve method (2) E-curve is composed of a series of nodes Ei ( xi , yi ) , whose coordinates are xi and yi (i = 1,2,...,N), where N is the number of versions of the analyzed software system. The nodes Ei are connected sequentially with straight segments. The coordinates xi and yi are calculated iteratively: xi 1 1, if xi xi 1 , if x 1 if i 1 K i K i 1 K i K i 1 K i K i 1 yi 1 1, if yi yi 1 , if y 1 if i 1 H i H i 1 H i H i 1 H i H i 1 K i is the Kolmogorov Complexity of the i-th version of a software system; H i is the Shannon entropy of the i-th version of a system Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 10 Evolution curve method (3) Two dimensions of the Evolution curve x (relative information content) and y (relative complexity), Represent two independent (orthogonal) characteristics of a software system: x-dimension: amount of information contained in a software system and is an estimation of software size; y-dimension: information entropy of a software system and is an estimation of software complexity. Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 11 Software evolution stages Software Growth: system is actively developed Software Maintenance: system becomes simpler often at a cost of its size Software Improvement: system becomes more complex and generic Software Shrink: functionality of a system is reduced Complexity GROWTH MAINTENANCE EVOLUTION IMPROVEMENT Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia SHRINK Size 12 Trends of Evolution curve Actively developed systems: long upward trends of growth Mature, stable systems: long downward trends of maintenance Complexity Complexity Actively Developed Systems Mature Systems Size Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia Size 13 Case studies Source: SourceForge 7-zip Grip Archiver 82 versions, 5 years, 160K LOC CD player/ripper 36 versions, 14K LOC eMule P2P file sharing client Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 14 Case study: eMule eMule: one of the biggest P2P file sharing clients coded in Microsoft Visual C++ using MFC Free software, released under the GNU GPL Source code first released at version 0.02 on July 6, 2002 Latest release contains 222,680 lines of code Actively developed by 5 developers Current development status is “Production/Stable” For analysis, 68 versions of eMule source code were used Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 15 eMule: Entropy Version 015a Version 030a Version 018a Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 16 eMule: Size y = A + B∙x + C∙x2 A = 7676.17 B = 4324.67 C = 177.488 r = 0.9935 Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 17 eMule’s Evolution curve 30e 47c 23b 44b 25b Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 18 What does the changelog say? Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 19 Conclusions Software evolution process can be divided into 4 stages software growth: the size and complexity of developed software is increasing software maintenance: the aim is to contain complexity and fix software bugs software improvement: the aim is to contain software system size at a cost of increasing complexity software shrink: both software size and its complexity is trimmed Evolution curve method can: identify software evolution stages identify the initial development status of the analyzed software system: actively developed systems show long growth trends mature systems show maintenance and improvement trends Is independent from software implementation language Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 20 Ongoing Research and Further Work Analysis of other entropy measures such as block entropy and Rényi entropies Dynamic models of software evolution paper submitted to Journal of Software Maintenance and Evolution Differential equations, etc. More case studies paper submitted to Computing and Information Systems Journal Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 21 Thank You. Any Questions? Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 22 7-zip: Evolution curve Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 23 Grip: Evolution curve Eighth International Baltic Conference on Databases and Information Systems June 2-5, 2008, Tallinn, Estonia 24