Yu et al. DRAFT — NOT FOR PUBLICATION Page 1 of 19 Maintaining Linux: The Role of “current” Liguo Yu Tennessee Technological University Cookeville, TN, USA yul@csc.tntech.edu Stephen R. Schach, Kai Chen Vanderbilt University Nashville, TN, USA srs@vuse.vanderbilt.edu kai.chen@vanderbilt.edu Abstract We examined 249 versions of Linux, and performed definition–use analysis to determine the role played by global variable current in each version. We examined three versions of Linux in detail: versions 1.2.0, 2.2.10, and 2.4.20. For each of those versions, we display the common coupling induced by current within that version using a graphical notation that reflects definitions and uses. We also measured the relationship between the number of instances of current and the size of Linux. We found that the number of instances increased much faster than the size of the kernel but slower than the total size of the product. Furthermore, nonkernel modules were the major source of the increase of instances of global variable current. These increases were largely within nonkernel folder arch, which contains architecture-dependent source code, and in nonkernel folder drivers, which contains all the driver programs. Consequently, as more drivers are added to Linux and as more platforms are supported, problems with maintainability caused by current will be exacerbated. Key Words Maintainability, coupling, dependencies, common coupling, definition–use analysis, Linux Contact Details: Stephen R. Schach srs@vuse.vanderbilt.edu +1.615.322.2924 Yu et al. DRAFT — NOT FOR PUBLICATION Page 2 of 19 Maintaining Linux: The Role of “current” 1. Introduction The coupling between two modules is a measure of the degree of interaction between those modules and, hence, of the dependency between the modules. Common coupling is considered to present risks for software development, especially for maintenance [1], [2] (two modules are defined to be common coupled if they both reference the same global variable). The open-source software development life-cycle model can best be described as continuous maintenance, as encapsulated in the dictum “release early and often” [3]. Accordingly, it is vital there be as little common coupling as possible in open-source software. We have recently published a series of papers in which we have shown that there is reason to question the long-term maintainability of Linux [4–7], the pre-eminent open-source operating system. In a longitudinal study of 400 successive versions of Linux [4–6], we showed that the number of lines of code in each kernel module increases linearly with version number, whereas the number of instances of common coupling between each kernel module and all the other Linux modules grows exponentially. Both results were significant at the 99.99% level. In view of the deleterious effect of common coupling, we concluded that the resulting dependencies between modules had the potential of rendering Linux hard to maintain in the future. We then categorized common coupling in terms of the possible impact a change to a global variable would have on the kernel (every installation of Linux consists of all the kernel modules, together with a subset of the nonkernel modules specific to that installation) [7]. The most deleterious form of common coupling is category 5. In this paper, we investigate the role played by category-5 global variable current, with regard to the maintainability of Linux. Yu et al. DRAFT — NOT FOR PUBLICATION Page 3 of 19 In Section 2, we discuss software dependencies, and in Section 3 we outline our categorization of common coupling in kernel-based software in terms of definition–use analysis. We discuss common coupling in Linux in Chapter 4, especially the role played by global variable current. We present our results in Chapter 5. Chapter 6 contains a discussion of our results, our conclusions, and an outline of future work. 2. Software dependencies Coupling is a measure of the degree of dependency between two software components (classes, modules, packages, or the like). A good software system should have high cohesion within each component and low coupling between components. There are several different coupling categorizations, including [8–10], all of which include common coupling. Common coupling is considered to be a strong form of coupling, that is, it induces a high degree of dependency between software components, making the components difficult to understand, maintain, and reuse [11]. Coupling between components strengthens the dependency of one component on others and increases the probability that changes in one component may affect other components, which makes maintenance difficult and likely to introduce regression faults. Coupling has not yet been explicitly shown to be related to maintainability. However, it has been shown that coupling is related to fault-proneness of a software system [2], [12], [13]. If a module is fault-prone then it will have to undergo repeated maintenance, and these frequent changes are likely to compromise its maintainability. Furthermore, these frequent changes will not always be restricted to the fault-prone module itself; it is not uncommon to have to modify more than one module to fix a single fault. Consequently, the fault-proneness of one module can adversely affect the maintainability of a number of other modules. In other words, it is easy to believe that strong Yu et al. DRAFT — NOT FOR PUBLICATION coupling can have a deleterious effect on maintainability. Page 4 of 19 In this paper, we use common coupling to represent the dependencies between software components and use it to measure the maintainability of a software component. 3. Definition–use analysis and common coupling Each occurrence of a variable in source code is either a definition of that variable or use of that variable. A definition of a variable x is a statement that assigns a value to x. The most common form of definition is an assignment statement, such as x = 12. The use of a variable x is a statement that utilizes the value of x, such as y = x – 5. From the creation of a variable to the destruction of that variable, each time the variable is invoked, it is either assigned a new value (a definition) or its present value is used (a use). Common coupling induces dependencies between components. In [7] we used definition-use analysis to categorize common coupling in kernel-based software. Many software products, especially operating systems and database management systems, comprise a kernel, a set of components common to all installations, together with a set of architecture-specific or hardware-specific nonkernel components [14, 15]. We refer to a software product that is comprised of a kernel together with optional nonkernel components as kernel-based software. The kernel is the most important part of a kernel-based software product. Therefore, the maintainability of the kernel reflects the maintainability of the kernel-based software product. Common coupling within a kernel-based product increases the dependency of the kernel on other components and, therefore, decreases the maintainability of the kernel. In one of our previous studies [7], we presented an ordered categorization of common coupling within kernel-based software. Global variables are divided into five categories on the Yu et al. DRAFT — NOT FOR PUBLICATION Page 5 of 19 basis of definition–use relations, from the least deleterious (category 1) to the most harmful (category 5). For example, a category-1 global variable is defined in kernel components but has no uses in kernel components. Because there is no use of a category-1 global variable in a kernel component, definitions in other components (kernel or nonkernel) cannot affect kernel components. Consequently, all kernel components are independent with respect to this global variable, and the presence of a category-1 global variable will not cause difficulties for kernel component maintenance. On the other hand, a category-5 global variable is (a) Defined in one or more kernel component Ki, i = 1, …, n; (b) Defined in one or more nonkernel components NKj, j = 1, …, m; and (c) Used in one or more kernel components. A kernel module that uses a category-5 global variable is therefore vulnerable to a modification made in a kernel module Ki or a nonkernel module NKj in which that global variable is defined. It is extremely difficult to minimize the impact of changes that involve category-5 global variables. (For details of global variable categories 2, 3, and 4, the reader is referred to [7].) 4. Common coupling in Linux We analyzed version 2.4.20 of Linux using our categorization of common coupling in kernelbased software [7]. An overview of our results is shown in Table 1, which shows that Linux has 99 distinct global variables. Altogether, there are 1,022 instances of global variables in kernel modules: 276 definitions and 746 uses. Similarly, there are 14,088 instances of global variables in nonkernel modules, making a total of 15,110 instances in all. generally referred to as files.) (In Linux, modules are Yu et al. DRAFT — NOT FOR PUBLICATION Page 6 of 19 Table 1. Definitions and uses of global variables in Linux 2.4.20 [7]. Number Kernel modules of global Number of Number Total of number variables instances of instances of definitions of uses instances 99 276 746 1,022 Nonkernel modules Number Number Total of of number instances instances of of of uses instances definitions 1,667 12,421 14,088 Overall number of instances 15,110 Table 2. Definitions and uses of the 20 category-5 global variables [7]. Global variable current Others Overall Kernel modules Nonkernel modules Overall Number Number Number Number Number Number number of uses of of of uses of of of modules definitions instances modules definitions containing containing a global a global variable variable 18 114 382 1,071 1,403 6,795 8,694 37 66 76 319 224 327 693 55 180 458 1,390 1,627 7,122 9,387 As shown in Table 2, 55 of the 99 global variables in version 2.4.20 of Linux fall into in category 5. Of these, current is the most prevalent. It is appears in 18 kernel modules in which there are 114 instances of definitions and 382 instances of uses. It appears in 1,071 nonkernel modules in which there are 1,403 instances of definitions and 6,795 instances of uses. Adding the definitions and the uses yields a total of 8,694 instances. That is, more than half of the 15,110 instances of global variables in Linux are instances of current. Global variable current first appeared in an early version of Linux and is still present in the latest version. Because of the important role played by current with respect to common coupling within the Linux kernel, we studied its evolution in different versions of Linux. Unlike one of our previous studies [4], which considered only the number of instances of global variables and ignored the definition-use property, our study here expands on our earlier work and Yu et al. DRAFT — NOT FOR PUBLICATION Page 7 of 19 focuses on the evolution of the most widely utilized category-5 global variable. Our new results support our previous results [4], [7]. We use these new results to make a further prediction of the future maintainability of the Linux kernel. 4.1 Global variable current Global variable current was first introduced in version 1.0.0 of Linux. In version 1.0.9, it was defined as a pointer to a structure task_struct in kernel module sched.c: struct task_struct *current = &init_task; From version 1.3.31 onward, current was defined as a preprocessor macro. For example, in version 2.4.20 current is defined as a macro get_current (), which is an inline function that returns a pointer to a structure task_struct. In both version 1.0.9 and version 2.4.20, current can be viewed as a pointer to a structure task_struct; the redefinition of current as a preprocessor macro appears to have been done solely to increase efficiency; the two implementations are in every way functionally equivalent. Data structure task_struct describes a process or task in the system. During scheduling, the kernel relies upon a linked list of runable tasks to determine which task should be run next. This linked list is a list of data structures of type task_struct, each of which contains information about a particular task. Data structure struct_task contains 83 field variables; 60 are primitive types, 3 are composite data structures, and 20 are pointers to composite data structures. For example, variable state is used to represent whether this task is runable and, if so, whether it is interruptable. The composite data structures or pointers to composite data structures are used to reference the domain in which the process is executing; the files that a process has open; the binary file format that Linux understands; real time timer, signal handler, memory management, and file system information; and so on [16]. Yu et al. DRAFT — NOT FOR PUBLICATION Page 8 of 19 The definition-use analysis of each instance of current was performed on the basis of the theory outlined in Section 3. For example, the statement current->state = TASK_RUNNING; was considered a definition of current, because the value of current (or, more precisely, the data structure to which it points) is changed. Conversely, the statement if (current->need_resched) x = 1; else x = 0; was considered a use of current, because the value of current (or, more precisely, the data structure to which it points) is referenced, but not changed. 4.2 Linux versions Linux has two kinds of versions, even-numbered versions and odd-numbered versions, depending on whether the second digit in the version number is even or odd. Odd-numbered versions are referred to as development versions, released for future development of evennumbered versions. Even-numbered versions are referred to as stable versions; these versions are released for use. In Linux, development versions and stable versions are developed in parallel. When a development version appears to be mature, it becomes part of the stable tree (for example, stable version 1.2.0 is based on development version 1.1.95). In order to handle the issue of parallel development, we considered only the tree stem from version 1.0.0 to version 2.4.20 and ignored all the other branches. These versions in the tree stem constitute successive versions because no parallel development is present in the tree stem; every pair of adjoining versions is connected by the ancestor and offspring relationship. There were a total of 496 subversions of Linux from version 1.0.0 to version 2.4.20. In order to make our research project manageable, we examined every second subversion, 249 in all. Yu et al. DRAFT — NOT FOR PUBLICATION Page 9 of 19 For each of these subversions, we determined (1) the size of the kernel (in KLOC); (2) the size of the nonkernel (in KLOC); (3) the number of kernel modules in which current appears; (4) the number of nonkernel modules in which current appears; (5) the number of definitions and uses of global variable current in kernel modules; (6) the number of definitions and uses of global variable current in nonkernel modules; (7) the relation of the size of kernel to the number of instances of current in kernel modules; and (8) the relation of the size of the nonkernel to the number of instances of current in nonkernel modules. 5. Results 5.1 Graphical notation We use an arrow to represent a definition–use relation between modules. A single-headed arrow pointing from module A to module B means that the global variable is defined in A and used in B. A double–headed arrow between module A and module B means the global variable is defined in both A and B and used in both A and B. We use the pair (d, u) to indicate (number of definitions, number of uses) of a global variable within a module. This notation is utilized in Figure 1, which shows the definitions and uses of global variable current in version 1.2.0. The top box denotes the 129 nonkernel modules; in those modules, there are 296 definitions and 769 uses of current. The large lower box denotes the kernel. The 10 kernel modules fall into two groups (grouped by the dotted lines). The six modules in the oval all contain definitions and uses of current. For example, there are 13 definitions and 34 uses of current in exit.c. The four modules at the bottom all use current, but do not define it. Yu et al. DRAFT — NOT FOR PUBLICATION Page 10 of 19 129 nonkernel modules (296, 769) signal.c (7, 7) sched.c (9, 42) exit.c (13, 34) sys.c (25, 58) itimer.c (6, 6) exec_domain.c (2, 7) fork.c (–, 10) panic.c (–, 1) ksyms.c (–, 1) printk.c (–, 1) Figure 1. Definitions and uses of global variable current in version 1.2.0 of Linux Yu et al. DRAFT — NOT FOR PUBLICATION Page 11 of 19 508 nonkernel modules (1034, 3277) kmod.c (3, 18) sys.c (38, 96) signal.c (8, 62) exit.c (7, 21) sched.c (9, 24) itimer.c (7, 9) fork.c (1, 21) exec_domain.c (4, 14) acct.c (–, 17) sysctl.c (–, 4) capability.c (–, 9) panic.c (–, 1) Figure 2. Definitions and uses of global variable current in version 2.2.10 of Linux. Yu et al. DRAFT — NOT FOR PUBLICATION Page 12 of 19 1071 nonkernel modules (1403, 6795) softirq.c (3, 4) fork.c (2, 25) signal.c (13, 69) timer.c (2, 8) kmod.c (2, 13) sched.c (14, 29) exit.c (14, 35) . – c sys.c (49, 115) acct.c (2, 21) a uid16.c (2, 12) itimer.c (7, 9) exec_domain.c (4, 14) ptrace.c (–, 12) context.c (–, 3) printk.c (–, 1) capability.c (–, 9) sysctl.c (–, 2) panic.c (–, 1) Figure 3. Definitions and uses of global variable current in version 2.4.20 of Linux. Yu et al. DRAFT — NOT FOR PUBLICATION Page 13 of 19 Figure 2 shows the definitions and uses of global variable current in version 2.2.10; and Figure 3 shows the definitions and uses of global variable current in version 2.4.20. Comparing these three figures, we see that current has similar distributions in both kernel and nonkernel modules, but with increasing complexity and frequency. We now examine the relationship between the number of instances of current and the size of Linux. 5.2 Number of instances versus the size of the product Table 3 shows data from several versions of Linux. Each of these versions is associated with a major change in Linux. From version 1.1.78 to version 2.4.20, the size of the kernel (in KLOC) increased about 187 percent, the total size increased about 1,600 percent, and the total number of instances of current increased about 638 percent. In other words, for global variable current, the number of instances increased much faster than the size of the kernel but slower than the total size of the product. During this period, instances of current in nonkernel modules increased much faster than instances in kernel modules; nonkernel modules were the major source of the increase of instances of global variable current. Yu et al. DRAFT — NOT FOR PUBLICATION Page 14 of 19 Table 3: Global variable current in several versions of Linux Version Number of Number of Number of Number of Total Kernel number instances of instances of instances of instances of number of KLOC definitions in uses in definitions in uses in instances kernel modules kernel nonkernel nonkernel modules modules modules 1.1.78 61 167 263 687 1,178 4.95 1.2.0 62 167 296 769 1,294 4.98 1.3.0 62 164 319 822 1,367 5.06 2.0.0 66 181 546 1,404 2,197 6.93 2.1.0 62 179 527 1,386 2,154 6.95 2.2.0 79 298 998 3,216 4,591 10.07 2.3.0 77 196 1,029 3,260 4,562 10.25 2.4.0 99 323 1,216 5,152 6,790 13.09 2.4.20 114 382 1,403 6,795 8,694 14.23 Total KLOC 251.55 281.99 310.55 674.43 694.04 1,603.79 1,690.24 2,980.09 4,274.69 Generally, each release of a new Linux version is larger than the previous version due to additional functionality, new architecture, or more drivers added. To see how the size of the product (in KLOC) is related to the total number of instances of global variable current, we selected 35 versions of Linux from 1.1.78 to 2.4.18; the size of the product grew in intervals of about 100 KLOC. Figure 4 shows the total number of instances of current in these versions. It can be seen that the total number of instances of global variable current increases approximately linearly with the size of the product. Yu et al. DRAFT — NOT FOR PUBLICATION Page 15 of 19 Total Instances of “current” 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0 500 1000 1500 2000 2500 3000 3500 4000 Size of the Product (in KLOC) Figure 4: Total number of instances of current versus the size of the product in KLOC. 5.3 Driver modules and arch modules Linux source code is organized into folders. For example, in version 2.4.20 there is one kernel folder and 11 nonkernel folders. Architecture-dependent source code is put in nonkernel folder arch and nonkernel folder drivers contains all the driver programs. As shown in Table 2, most of the instances of current are in nonkernel modules. To understand how nonkernel modules contribute to the increasing of number of instances, we studied how the instances of current are distributed in nonkernel modules. In version 1.1.78, 60 percent of the nonkernel modules belong to the drivers folder or the arch folder. By version 2.4.20, this figure had increased to 75 percent. During this time, the percentage of nonkernel instances of current in these two folders increased from 40 to 79 percent. From this analysis, we can draw two conclusions. First, driver and architecturedependent modules grew faster than the other nonkernel modules. Second, the number of instances of current in the drivers and arch folders also grew faster than the number in other nonkernel modules. This growth has resulted in the drivers and arch folders contributing to Yu et al. DRAFT — NOT FOR PUBLICATION Page 16 of 19 nearly 80 percent of nonkernel instances of current in Linux version 2.4.20. This is shown in Figure 5. 80% 75% 70% 65% 60% 55% 50% size 45% instances of "current" 40% 1.1.78 1.2.0 1.3.0 2.0.0 2.1.0 2.2.0 Version number 2.3.0 2.4.0 2.4.20 Figure 5: Size of drivers and arch modules in nonkernel modules (in KLOC) as a percentage of the total size of the nonkernel modules; and instances of current in those two sets of modules, as a percentage of the total number of instances of current in nonkernel modules 6. Discussion, conclusions, and future work Each installation of Linux consists of all the kernel modules, plus a set of nonkernel modules specific to that installation, its architecture, and its drivers. It might therefore be argued that, in any one installation of Linux, the number of instances of current in nonkernel modules is likely to be far smaller than the 8,198 instances in version 2.4.20, the version we have presented in detail. From the viewpoint of maintenance, however, what is important is the total number of instances of current. First, if a change is made to a global variable, it has to be consistently made to every instance of that global variable. Thus, the total number of instances of current is what counts, not the number in a specific installation. Second, every definition of current constitutes a potential source of vulnerability from the viewpoint of maintenance of the Linux Yu et al. DRAFT — NOT FOR PUBLICATION Page 17 of 19 kernel. From Table 2, we see that there are 114 instances of definitions of category-5 global variables in kernel modules, and 1,403 instances of definitions of current in nonkernel modules. That is, there are 1,517 instances of definitions of current that could affect a kernel module if a modification were made to the module containing that definition of current. As pointed out in the introduction, the open-source life-cycle model essentially constitutes continuous maintenance. Consequently, we have shown that, unless Linux is restructured with minimal common coupling, current will be a major potential source of problems as Linux evolves in the future. Linux is continuously growing with more driver and architecture-dependent modules being added. Adding more drivers means more tasks will be associated with global variable current. In addition, if more platforms are supported, more platform-specific tasks will be added, too, causing further instances of current to be added. This will result in even greater increases in the number of instances of current in nonkernel modules. Although this increase in instances of current is mainly in nonkernel modules, it will surely affect uses of current in kernel modules, because current is a category-5 global variable. That is, as Linux grows, the problems caused by current will be exacerbated. In this work, we have treated current as a simple global variable; we have ignored the fact that current is effectively a pointer to a structure with multiple components. For example, we have treated a statement like current->errno = 0; as a definition of current. We are now examining current at a lower level of abstraction of the definition–use relation. The above statement, for example, can also be viewed as a use of (dereferenced) global pointer current plus a definition of integer errno, a field of task_struct. Yu et al. DRAFT — NOT FOR PUBLICATION Page 18 of 19 We believe that this investigation will highlight those fields of task_struct that are the most problematic from the viewpoint of the maintainability of Linux. References [1] L. C. Briand, J. Daly, V. Porter, and J. Wüst, “A Comprehensive Empirical Validation of Design Measures for Object-Oriented Systems,” Proceedings of the 5th International Software Metrics Symposium, Bethesda, MD, Nov. 1998, pp. 246–57. [2] D. A. Troy and S. H. Zweben, “Measuring the Quality of Structured Designs,” Journal of Systems and Software 2 (June 1981), pp. 112–120. [3] E. S. Raymond, The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary, O’Reilly & Associates, Sebastopol, CA, 2000. Also available at www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/. [4] S. R. Schach, B. Jin, D. R. Wright, G. Z. Heller, and A. J. Offutt, “Maintainability of the Linux Kernel,” IEE Proceedings—Software 149 (February 2002), pp. 18-23. [5] S. R. Schach and J. Offutt, “On the Nonmaintainability of Open-Source Software,” Proceedings of the 2nd Workshop on Open Source Software Engineering, Orlando, FL, May 2002, pp. 47–49. [6] J. Offutt, “Open-source Software: More or Less Secure and Reliable?” Panel at International Symposium on Software Reliability Engineering (ISSRE ‘02), Annapolis, MD, Nov. 2002. [7] L. Yu, S. R. Schach, K. Chen, J. Offutt, “Categorization of Common Coupling and its Application to the Maintainability of the Linux Kernel,” IEEE Transactions on Software Engineering 30 (October 2004), pp. 694–706. Yu et al. [8] DRAFT — NOT FOR PUBLICATION Page 19 of 19 W. P. Stevens, G. J. Myers, and L. L. Constantine, “Structured Design,” IBM Systems Journal 13 (No. 2, 1974), pp. 115–39. [9] J. Offutt, M. J. Harrold, and P. Kolte, “A Software Metric System for Module Coupling,” Journal of Systems and Software 20 (March, 1993), pp. 295–308. [10] M. Page-Jones, The Practical Guide to Structured Systems Design. Yourdon Press, New York, 1980. [11] S. R. Schach, B. Jin, D. R. Wright, G. Z. Heller, and J. Offutt, “Quality Impacts of Clandestine Common Coupling,” Software Quality Journal 11 (July 2003), pp. 211–18. [12] D. Kafura and S. Henry, “Software Quality Metrics Based on Interconnectivity,” Journal of Systems and Software 2 (May 1981), pp. 121–31. [13] R. W. Selby and V. R. Basili, “Analyzing Error-Prone System Structure,” IEEE Transactions on Software Engineering 17 (February 1991), pp. 141–52. [14] P. Brinch Hansen, “The Nucleus of a Multiprogramming System,” Communications of the ACM 4 (April 1970), pp. 238–41. [15] T. Härder, “New Approaches to Object Processing in Engineering Databases,” Proceedings of the International Workshop on Object-Oriented Database Systems, Pacific Grove, CA, Sept. 1986, p. 217. [16] D. Rusling, “The Linux Kernel,” 1999, www.linuxhq.com/guides/TLK/tlk.html