Maintaining Linux: The Role of “ ” current

advertisement
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 1 of 19
Maintaining Linux: The Role of “current”
Liguo Yu
Tennessee Technological University
Cookeville, TN, USA
yul@csc.tntech.edu
Stephen R. Schach, Kai Chen
Vanderbilt University
Nashville, TN, USA
srs@vuse.vanderbilt.edu
kai.chen@vanderbilt.edu
Abstract
We examined 249 versions of Linux, and performed definition–use analysis to determine the
role played by global variable current in each version. We examined three versions of Linux in
detail: versions 1.2.0, 2.2.10, and 2.4.20. For each of those versions, we display the common
coupling induced by current within that version using a graphical notation that reflects
definitions and uses. We also measured the relationship between the number of instances of
current and the size of Linux. We found that the number of instances increased much faster than
the size of the kernel but slower than the total size of the product. Furthermore, nonkernel
modules were the major source of the increase of instances of global variable current. These
increases were largely within nonkernel folder arch, which contains architecture-dependent
source code, and in nonkernel folder drivers, which contains all the driver programs.
Consequently, as more drivers are added to Linux and as more platforms are supported,
problems with maintainability caused by current will be exacerbated.
Key Words
Maintainability, coupling, dependencies, common coupling, definition–use analysis, Linux
Contact Details:
Stephen R. Schach
srs@vuse.vanderbilt.edu
+1.615.322.2924
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 2 of 19
Maintaining Linux: The Role of “current”
1. Introduction
The coupling between two modules is a measure of the degree of interaction between those
modules and, hence, of the dependency between the modules. Common coupling is considered
to present risks for software development, especially for maintenance [1], [2] (two modules are
defined to be common coupled if they both reference the same global variable).
The open-source software development life-cycle model can best be described as
continuous maintenance, as encapsulated in the dictum “release early and often” [3].
Accordingly, it is vital there be as little common coupling as possible in open-source software.
We have recently published a series of papers in which we have shown that there is reason to
question the long-term maintainability of Linux [4–7], the pre-eminent open-source operating
system. In a longitudinal study of 400 successive versions of Linux [4–6], we showed that the
number of lines of code in each kernel module increases linearly with version number, whereas
the number of instances of common coupling between each kernel module and all the other
Linux modules grows exponentially. Both results were significant at the 99.99% level. In view
of the deleterious effect of common coupling, we concluded that the resulting dependencies
between modules had the potential of rendering Linux hard to maintain in the future.
We then categorized common coupling in terms of the possible impact a change to a global
variable would have on the kernel (every installation of Linux consists of all the kernel modules,
together with a subset of the nonkernel modules specific to that installation) [7]. The most
deleterious form of common coupling is category 5. In this paper, we investigate the role played
by category-5 global variable current, with regard to the maintainability of Linux.
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 3 of 19
In Section 2, we discuss software dependencies, and in Section 3 we outline our
categorization of common coupling in kernel-based software in terms of definition–use analysis.
We discuss common coupling in Linux in Chapter 4, especially the role played by global
variable current. We present our results in Chapter 5. Chapter 6 contains a discussion of our
results, our conclusions, and an outline of future work.
2. Software dependencies
Coupling is a measure of the degree of dependency between two software components
(classes, modules, packages, or the like). A good software system should have high cohesion
within each component and low coupling between components. There are several different
coupling categorizations, including [8–10], all of which include common coupling. Common
coupling is considered to be a strong form of coupling, that is, it induces a high degree of
dependency between software components, making the components difficult to understand,
maintain, and reuse [11].
Coupling between components strengthens the dependency of one component on others and
increases the probability that changes in one component may affect other components, which
makes maintenance difficult and likely to introduce regression faults. Coupling has not yet been
explicitly shown to be related to maintainability. However, it has been shown that coupling is
related to fault-proneness of a software system [2], [12], [13]. If a module is fault-prone then it
will have to undergo repeated maintenance, and these frequent changes are likely to compromise
its maintainability. Furthermore, these frequent changes will not always be restricted to the
fault-prone module itself; it is not uncommon to have to modify more than one module to fix a
single fault.
Consequently, the fault-proneness of one module can adversely affect the
maintainability of a number of other modules. In other words, it is easy to believe that strong
Yu et al.
DRAFT — NOT FOR PUBLICATION
coupling can have a deleterious effect on maintainability.
Page 4 of 19
In this paper, we use common
coupling to represent the dependencies between software components and use it to measure the
maintainability of a software component.
3. Definition–use analysis and common coupling
Each occurrence of a variable in source code is either a definition of that variable or use of
that variable. A definition of a variable x is a statement that assigns a value to x. The most
common form of definition is an assignment statement, such as x = 12. The use of a variable x
is a statement that utilizes the value of x, such as y = x – 5. From the creation of a variable to
the destruction of that variable, each time the variable is invoked, it is either assigned a new
value (a definition) or its present value is used (a use). Common coupling induces dependencies
between components.
In [7] we used definition-use analysis to categorize common coupling in kernel-based
software.
Many software products, especially operating systems and database management
systems, comprise a kernel, a set of components common to all installations, together with a set
of architecture-specific or hardware-specific nonkernel components [14, 15]. We refer to a
software product that is comprised of a kernel together with optional nonkernel components as
kernel-based software.
The kernel is the most important part of a kernel-based software product. Therefore, the
maintainability of the kernel reflects the maintainability of the kernel-based software product.
Common coupling within a kernel-based product increases the dependency of the kernel on other
components and, therefore, decreases the maintainability of the kernel.
In one of our previous studies [7], we presented an ordered categorization of common
coupling within kernel-based software. Global variables are divided into five categories on the
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 5 of 19
basis of definition–use relations, from the least deleterious (category 1) to the most harmful
(category 5). For example, a category-1 global variable is defined in kernel components but has
no uses in kernel components. Because there is no use of a category-1 global variable in a kernel
component, definitions in other components (kernel or nonkernel) cannot affect kernel
components. Consequently, all kernel components are independent with respect to this global
variable, and the presence of a category-1 global variable will not cause difficulties for kernel
component maintenance.
On the other hand, a category-5 global variable is
(a) Defined in one or more kernel component Ki, i = 1, …, n;
(b) Defined in one or more nonkernel components NKj, j = 1, …, m; and
(c) Used in one or more kernel components.
A kernel module that uses a category-5 global variable is therefore vulnerable to a
modification made in a kernel module Ki or a nonkernel module NKj in which that global
variable is defined. It is extremely difficult to minimize the impact of changes that involve
category-5 global variables.
(For details of global variable categories 2, 3, and 4, the reader is referred to [7].)
4. Common coupling in Linux
We analyzed version 2.4.20 of Linux using our categorization of common coupling in kernelbased software [7]. An overview of our results is shown in Table 1, which shows that Linux has
99 distinct global variables. Altogether, there are 1,022 instances of global variables in kernel
modules: 276 definitions and 746 uses. Similarly, there are 14,088 instances of global variables
in nonkernel modules, making a total of 15,110 instances in all.
generally referred to as files.)
(In Linux, modules are
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 6 of 19
Table 1. Definitions and uses of global variables in Linux 2.4.20 [7].
Number
Kernel modules
of global Number of Number
Total
of
number
variables instances
of
instances
of
definitions
of uses
instances
99
276
746
1,022
Nonkernel modules
Number
Number
Total
of
of
number
instances instances
of
of
of uses
instances
definitions
1,667
12,421
14,088
Overall
number
of
instances
15,110
Table 2. Definitions and uses of the 20 category-5 global variables [7].
Global
variable
current
Others
Overall
Kernel modules
Nonkernel modules
Overall
Number
Number Number Number
Number Number number
of uses
of
of
of uses
of
of
of
modules definitions
instances
modules definitions
containing
containing
a global
a global
variable
variable
18
114
382
1,071
1,403
6,795
8,694
37
66
76
319
224
327
693
55
180
458
1,390
1,627
7,122
9,387
As shown in Table 2, 55 of the 99 global variables in version 2.4.20 of Linux fall into in
category 5. Of these, current is the most prevalent. It is appears in 18 kernel modules in which
there are 114 instances of definitions and 382 instances of uses. It appears in 1,071 nonkernel
modules in which there are 1,403 instances of definitions and 6,795 instances of uses. Adding
the definitions and the uses yields a total of 8,694 instances. That is, more than half of the
15,110 instances of global variables in Linux are instances of current.
Global variable current first appeared in an early version of Linux and is still present in the
latest version.
Because of the important role played by current with respect to common
coupling within the Linux kernel, we studied its evolution in different versions of Linux. Unlike
one of our previous studies [4], which considered only the number of instances of global
variables and ignored the definition-use property, our study here expands on our earlier work and
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 7 of 19
focuses on the evolution of the most widely utilized category-5 global variable. Our new results
support our previous results [4], [7]. We use these new results to make a further prediction of the
future maintainability of the Linux kernel.
4.1 Global variable current
Global variable current was first introduced in version 1.0.0 of Linux. In version 1.0.9, it
was defined as a pointer to a structure task_struct in kernel module sched.c:
struct task_struct *current = &init_task;
From version 1.3.31 onward, current was defined as a preprocessor macro. For example, in
version 2.4.20 current is defined as a macro get_current (), which is an inline function that
returns a pointer to a structure task_struct. In both version 1.0.9 and version 2.4.20, current
can be viewed as a pointer to a structure task_struct; the redefinition of current as a
preprocessor macro appears to have been done solely to increase efficiency; the two
implementations are in every way functionally equivalent.
Data structure task_struct describes a process or task in the system. During scheduling,
the kernel relies upon a linked list of runable tasks to determine which task should be run next.
This linked list is a list of data structures of type task_struct, each of which contains
information about a particular task. Data structure struct_task contains 83 field variables; 60
are primitive types, 3 are composite data structures, and 20 are pointers to composite data
structures. For example, variable state is used to represent whether this task is runable and, if
so, whether it is interruptable. The composite data structures or pointers to composite data
structures are used to reference the domain in which the process is executing; the files that a
process has open; the binary file format that Linux understands; real time timer, signal handler,
memory management, and file system information; and so on [16].
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 8 of 19
The definition-use analysis of each instance of current was performed on the basis of the
theory outlined in Section 3. For example, the statement
current->state = TASK_RUNNING;
was considered a definition of current, because the value of current (or, more precisely, the data
structure to which it points) is changed. Conversely, the statement
if (current->need_resched) x = 1; else x = 0;
was considered a use of current, because the value of current (or, more precisely, the data
structure to which it points) is referenced, but not changed.
4.2 Linux versions
Linux has two kinds of versions, even-numbered versions and odd-numbered versions,
depending on whether the second digit in the version number is even or odd. Odd-numbered
versions are referred to as development versions, released for future development of evennumbered versions. Even-numbered versions are referred to as stable versions; these versions
are released for use.
In Linux, development versions and stable versions are developed in parallel. When a
development version appears to be mature, it becomes part of the stable tree (for example, stable
version 1.2.0 is based on development version 1.1.95). In order to handle the issue of parallel
development, we considered only the tree stem from version 1.0.0 to version 2.4.20 and ignored
all the other branches. These versions in the tree stem constitute successive versions because no
parallel development is present in the tree stem; every pair of adjoining versions is connected by
the ancestor and offspring relationship. There were a total of 496 subversions of Linux from
version 1.0.0 to version 2.4.20. In order to make our research project manageable, we examined
every second subversion, 249 in all.
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 9 of 19
For each of these subversions, we determined (1) the size of the kernel (in KLOC); (2) the
size of the nonkernel (in KLOC); (3) the number of kernel modules in which current appears;
(4) the number of nonkernel modules in which current appears; (5) the number of definitions
and uses of global variable current in kernel modules; (6) the number of definitions and uses of
global variable current in nonkernel modules; (7) the relation of the size of kernel to the number
of instances of current in kernel modules; and (8) the relation of the size of the nonkernel to the
number of instances of current in nonkernel modules.
5. Results
5.1 Graphical notation
We use an arrow to represent a definition–use relation between modules. A single-headed
arrow pointing from module A to module B means that the global variable is defined in A and
used in B. A double–headed arrow between module A and module B means the global variable
is defined in both A and B and used in both A and B. We use the pair (d, u) to indicate (number
of definitions, number of uses) of a global variable within a module.
This notation is utilized in Figure 1, which shows the definitions and uses of global variable
current in version 1.2.0. The top box denotes the 129 nonkernel modules; in those modules,
there are 296 definitions and 769 uses of current. The large lower box denotes the kernel. The
10 kernel modules fall into two groups (grouped by the dotted lines). The six modules in the
oval all contain definitions and uses of current. For example, there are 13 definitions and 34
uses of current in exit.c. The four modules at the bottom all use current, but do not define it.
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 10 of 19
129 nonkernel modules (296, 769)
signal.c
(7, 7)
sched.c
(9, 42)
exit.c
(13, 34)
sys.c
(25, 58)
itimer.c
(6, 6)
exec_domain.c
(2, 7)
fork.c (–, 10)
panic.c (–, 1)
ksyms.c (–, 1)
printk.c (–, 1)
Figure 1. Definitions and uses of global variable current in version 1.2.0 of Linux
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 11 of 19
508 nonkernel modules (1034, 3277)
kmod.c
(3, 18)
sys.c
(38, 96)
signal.c
(8, 62)
exit.c
(7, 21)
sched.c
(9, 24)
itimer.c
(7, 9)
fork.c
(1, 21)
exec_domain.c
(4, 14)
acct.c (–, 17)
sysctl.c (–, 4)
capability.c (–, 9)
panic.c (–, 1)
Figure 2. Definitions and uses of global variable current in version 2.2.10 of Linux.
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 12 of 19
1071 nonkernel modules (1403, 6795)
softirq.c
(3, 4)
fork.c
(2, 25)
signal.c
(13, 69)
timer.c
(2, 8)
kmod.c
(2, 13)
sched.c
(14, 29)
exit.c
(14, 35)
. –
c
sys.c
(49, 115)
acct.c
(2, 21)
a
uid16.c
(2, 12)
itimer.c
(7, 9)
exec_domain.c
(4, 14)
ptrace.c (–, 12)
context.c (–, 3)
printk.c (–, 1)
capability.c (–, 9)
sysctl.c (–, 2)
panic.c (–, 1)
Figure 3. Definitions and uses of global variable current in version 2.4.20 of Linux.
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 13 of 19
Figure 2 shows the definitions and uses of global variable current in version 2.2.10; and
Figure 3 shows the definitions and uses of global variable current in version 2.4.20. Comparing
these three figures, we see that current has similar distributions in both kernel and nonkernel
modules, but with increasing complexity and frequency.
We now examine the relationship between the number of instances of current and the size of
Linux.
5.2 Number of instances versus the size of the product
Table 3 shows data from several versions of Linux. Each of these versions is associated with
a major change in Linux. From version 1.1.78 to version 2.4.20, the size of the kernel (in
KLOC) increased about 187 percent, the total size increased about 1,600 percent, and the total
number of instances of current increased about 638 percent. In other words, for global variable
current, the number of instances increased much faster than the size of the kernel but slower
than the total size of the product. During this period, instances of current in nonkernel modules
increased much faster than instances in kernel modules; nonkernel modules were the major
source of the increase of instances of global variable current.
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 14 of 19
Table 3: Global variable current in several versions of Linux
Version Number of Number of Number of Number of
Total
Kernel
number instances of instances of instances of instances of number of KLOC
definitions in
uses in definitions in
uses in
instances
kernel modules kernel
nonkernel
nonkernel
modules
modules
modules
1.1.78
61
167
263
687
1,178
4.95
1.2.0
62
167
296
769
1,294
4.98
1.3.0
62
164
319
822
1,367
5.06
2.0.0
66
181
546
1,404
2,197
6.93
2.1.0
62
179
527
1,386
2,154
6.95
2.2.0
79
298
998
3,216
4,591
10.07
2.3.0
77
196
1,029
3,260
4,562
10.25
2.4.0
99
323
1,216
5,152
6,790
13.09
2.4.20
114
382
1,403
6,795
8,694
14.23
Total
KLOC
251.55
281.99
310.55
674.43
694.04
1,603.79
1,690.24
2,980.09
4,274.69
Generally, each release of a new Linux version is larger than the previous version due to
additional functionality, new architecture, or more drivers added. To see how the size of the
product (in KLOC) is related to the total number of instances of global variable current, we
selected 35 versions of Linux from 1.1.78 to 2.4.18; the size of the product grew in intervals of
about 100 KLOC. Figure 4 shows the total number of instances of current in these versions. It
can be seen that the total number of instances of global variable current increases approximately
linearly with the size of the product.
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 15 of 19
Total Instances of “current”
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
0
500
1000
1500
2000
2500
3000
3500
4000
Size of the Product (in KLOC)
Figure 4: Total number of instances of current versus the size of the product in KLOC.
5.3 Driver modules and arch modules
Linux source code is organized into folders. For example, in version 2.4.20 there is one
kernel folder and 11 nonkernel folders. Architecture-dependent source code is put in nonkernel
folder arch and nonkernel folder drivers contains all the driver programs. As shown in Table 2,
most of the instances of current are in nonkernel modules. To understand how nonkernel
modules contribute to the increasing of number of instances, we studied how the instances of
current are distributed in nonkernel modules.
In version 1.1.78, 60 percent of the nonkernel modules belong to the drivers folder or the
arch folder. By version 2.4.20, this figure had increased to 75 percent. During this time, the
percentage of nonkernel instances of current in these two folders increased from 40 to 79
percent. From this analysis, we can draw two conclusions. First, driver and architecturedependent modules grew faster than the other nonkernel modules. Second, the number of
instances of current in the drivers and arch folders also grew faster than the number in other
nonkernel modules. This growth has resulted in the drivers and arch folders contributing to
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 16 of 19
nearly 80 percent of nonkernel instances of current in Linux version 2.4.20. This is shown in
Figure 5.
80%
75%
70%
65%
60%
55%
50%
size
45%
instances of "current"
40%
1.1.78
1.2.0
1.3.0
2.0.0 2.1.0 2.2.0
Version number
2.3.0
2.4.0
2.4.20
Figure 5: Size of drivers and arch modules in nonkernel modules (in KLOC) as a percentage of
the total size of the nonkernel modules; and instances of current in those two sets of modules, as
a percentage of the total number of instances of current in nonkernel modules
6. Discussion, conclusions, and future work
Each installation of Linux consists of all the kernel modules, plus a set of nonkernel modules
specific to that installation, its architecture, and its drivers. It might therefore be argued that, in
any one installation of Linux, the number of instances of current in nonkernel modules is likely
to be far smaller than the 8,198 instances in version 2.4.20, the version we have presented in
detail. From the viewpoint of maintenance, however, what is important is the total number of
instances of current. First, if a change is made to a global variable, it has to be consistently
made to every instance of that global variable. Thus, the total number of instances of current is
what counts, not the number in a specific installation. Second, every definition of current
constitutes a potential source of vulnerability from the viewpoint of maintenance of the Linux
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 17 of 19
kernel. From Table 2, we see that there are 114 instances of definitions of category-5 global
variables in kernel modules, and 1,403 instances of definitions of current in nonkernel modules.
That is, there are 1,517 instances of definitions of current that could affect a kernel module if a
modification were made to the module containing that definition of current.
As pointed out in the introduction, the open-source life-cycle model essentially constitutes
continuous maintenance. Consequently, we have shown that, unless Linux is restructured with
minimal common coupling, current will be a major potential source of problems as Linux
evolves in the future.
Linux is continuously growing with more driver and architecture-dependent modules
being added. Adding more drivers means more tasks will be associated with global variable
current. In addition, if more platforms are supported, more platform-specific tasks will be
added, too, causing further instances of current to be added. This will result in even greater
increases in the number of instances of current in nonkernel modules. Although this increase in
instances of current is mainly in nonkernel modules, it will surely affect uses of current in
kernel modules, because current is a category-5 global variable. That is, as Linux grows, the
problems caused by current will be exacerbated.
In this work, we have treated current as a simple global variable; we have ignored the fact
that current is effectively a pointer to a structure with multiple components. For example, we
have treated a statement like
current->errno = 0;
as a definition of current. We are now examining current at a lower level of abstraction of the
definition–use relation. The above statement, for example, can also be viewed as a use of
(dereferenced) global pointer current plus a definition of integer errno, a field of task_struct.
Yu et al.
DRAFT — NOT FOR PUBLICATION
Page 18 of 19
We believe that this investigation will highlight those fields of task_struct that are the most
problematic from the viewpoint of the maintainability of Linux.
References
[1]
L. C. Briand, J. Daly, V. Porter, and J. Wüst, “A Comprehensive Empirical Validation of
Design Measures for Object-Oriented Systems,” Proceedings of the 5th International
Software Metrics Symposium, Bethesda, MD, Nov. 1998, pp. 246–57.
[2]
D. A. Troy and S. H. Zweben, “Measuring the Quality of Structured Designs,” Journal of
Systems and Software 2 (June 1981), pp. 112–120.
[3]
E. S. Raymond, The Cathedral and the Bazaar: Musings on Linux and Open Source by
an Accidental Revolutionary, O’Reilly & Associates, Sebastopol, CA, 2000.
Also
available at www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/.
[4]
S. R. Schach, B. Jin, D. R. Wright, G. Z. Heller, and A. J. Offutt, “Maintainability of the
Linux Kernel,” IEE Proceedings—Software 149 (February 2002), pp. 18-23.
[5]
S. R. Schach and J. Offutt, “On the Nonmaintainability of Open-Source Software,”
Proceedings of the 2nd Workshop on Open Source Software Engineering, Orlando, FL,
May 2002, pp. 47–49.
[6]
J. Offutt, “Open-source Software: More or Less Secure and Reliable?” Panel at
International Symposium on Software Reliability Engineering (ISSRE ‘02), Annapolis,
MD, Nov. 2002.
[7]
L. Yu, S. R. Schach, K. Chen, J. Offutt, “Categorization of Common Coupling and its
Application to the Maintainability of the Linux Kernel,” IEEE Transactions on Software
Engineering 30 (October 2004), pp. 694–706.
Yu et al.
[8]
DRAFT — NOT FOR PUBLICATION
Page 19 of 19
W. P. Stevens, G. J. Myers, and L. L. Constantine, “Structured Design,” IBM Systems
Journal 13 (No. 2, 1974), pp. 115–39.
[9]
J. Offutt, M. J. Harrold, and P. Kolte, “A Software Metric System for Module Coupling,”
Journal of Systems and Software 20 (March, 1993), pp. 295–308.
[10]
M. Page-Jones, The Practical Guide to Structured Systems Design. Yourdon Press, New
York, 1980.
[11]
S. R. Schach, B. Jin, D. R. Wright, G. Z. Heller, and J. Offutt, “Quality Impacts of
Clandestine Common Coupling,” Software Quality Journal 11 (July 2003), pp. 211–18.
[12]
D. Kafura and S. Henry, “Software Quality Metrics Based on Interconnectivity,” Journal
of Systems and Software 2 (May 1981), pp. 121–31.
[13]
R. W. Selby and V. R. Basili, “Analyzing Error-Prone System Structure,” IEEE
Transactions on Software Engineering 17 (February 1991), pp. 141–52.
[14]
P. Brinch Hansen, “The Nucleus of a Multiprogramming System,” Communications of
the ACM 4 (April 1970), pp. 238–41.
[15]
T. Härder, “New Approaches to Object Processing in Engineering Databases,”
Proceedings of the International Workshop on Object-Oriented Database Systems,
Pacific Grove, CA, Sept. 1986, p. 217.
[16]
D. Rusling, “The Linux Kernel,” 1999, www.linuxhq.com/guides/TLK/tlk.html
Download