Evolution, Growth, and Cloning in Linux: A Case Study

advertisement
Evolution, Growth, and Cloning
in Linux: A Case Study
University of Waterloo
Michael W. Godfrey
Davor Svetinovic
Qiang Tu
University of Waterloo
Overview
• Ongoing CSER project:
– Investigating growth and evolution of open source
software
• Linux, vim, gcc, …
• Lehman’s laws of evolution and Linux
– Why is Linux still growing so fast?
• Hyp: cloning is common
• Case study of Linux SCSI drivers (in progress)
– How/why does cloning really occur?
– Parallel evolution?
– How well do clone detection tools work in spotting
“real-world” cloning?
What is software evolution?
“Evolution is what happens
while you’re busy
making other plans.”
• Usually, we consider evolution to begin once the first
version has been delivered:
– Maintenance is the planned set of tasks to effect changes.
• e.g., corrective, perfective, adaptive, preventive
– Evolution is what actually happens to the software.
Lehman’s Laws of software
evolution in a nutshell
• Observations:
– (Most) useful software must evolve or die.
– As a software system gets bigger, its resulting
complexity tends to limit its ability to grow.
– Development progress/effort is (more or less) constant.
• Advice:
– Need to manage complexity.
– Do periodic redesigns.
– Treat software and its development process as a
feedback system (and not as a passive theorem).
Lehman’s examples
Growth of Linux
Growth of compressed tarfile
Growth in number of source files (*.[ch])
20,000,000
6000
Development releases (1.1, 1.3, 2.1, 2.3)
Stable releases (1.0, 1.2, 2.0, 2.2)
16,000,000
14,000,000
Size in bytes
# of source code files (*.[ch] )
18,000,000
12,000,000
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
5000
Stable releases (1.0, 1.2, 2.0, 2.2)
4000
3000
2000
1000
0
Jan 1993
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Development releases (1.1, 1.3, 2.1, 2.3)
0
Apr 2001
Jan 1993
Growth in number of functions, variables, macros
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Apr 2001
Growth in LOC (w. and w/o comments)
2,500,000
140,000
Total LOC ("wc -l") -- development releases
120,000
100,000
Development releases (1.1, 1.3, 2.1, 2.3)
2,000,000
Total LOC ("wc -l") -- stable releases
Total LOC uncommented -- development releases
Stable releases (1.0, 1.2, 2.0, 2.2)
Total LOC uncommented -- stable releases
Total LOC
# of global fcns, variables, and macros
Jun 1994
80,000
60,000
1,500,000
1,000,000
40,000
500,000
20,000
0
0
Jan 1993
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Apr 2001
Jan 1993
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Apr 2001
Observations and hypotheses
• Growth along devel. path is super-linear
y = .21*x^2 + 252*x + 90,055
r2=.997
y = size in LOC
x = days since v1.0
r2 is “coefficient of determination” using least squares
[Lehman/Turski’s model: y’ = y + E/y^2  (3Ex)^(1/3)]
– Linux’s strong growth is continuing.
– This is stronger growth at MLOC level than observed by
others (Lehman, Gall), even for other OSs.
Linux growth phenomena
Growth of major subsystems
Average and median .c file size
1,200,000
700
arch
1,000,000
Total uncommented LOC
Uncommented LOC
600
500
400
300
200
Average .c file size -- dev. releases
Average .c file size -- stable releases
Median .c file size -- dev. releases
Median .c file size -- stable releases
100
0
Jan 1993
drivers
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
800,000
600,000
include
net
fs
kernel
mm
400,000
ipc
lib
200,000
Apr 2001
init
0
Jan 1993
Average and median .h file size
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Apr 2001
SS LOC as percentage of whole system
Percentage of total system uncommented LOC
140
120
Uncommented LOC
100
80
60
40
Average .h file size -- dev. releases
Average .h file size -- stable releases
Median .h file size -- dev. releases
Median .h file size -- stable releases
20
0
Jan 1993
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Apr 2001
70.0
60.0
drivers
arch
include
net
fs
kernel
mm
ipc
lib
init
50.0
40.0
30.0
20.0
10.0
0.0
Jan 1993
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Apr 2001
Linux growth phenomena
Growth of drivers SS
SS growth as percentage of whole system
(ignoring drivers SS)
25.0
20.0
15.0
250,000
Total uncommented LOC
Percentage of total system uncommented LOC
300,000
arch
include
net
fs
kernel
mm
ipc
lib
init
30.0
10.0
200,000
150,000
100,000
50,000
5.0
0.0
Jan 1993
drivers/net
drivers/scsi
drivers/char
drivers/video
drivers/isdn
drivers/sound
drivers/acorn
drivers/block
drivers/cdrom
drivers/usb
drivers/"others"
0
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Apr 2001
Jan 1993
Jun 1994
Growth of small core SSs
6000
35,000
5000
4000
3000
2000
Dec 1999
Apr 2001
30,000
25,000
20,000
15,000
10,000
Jul 1998
Dec 1999
Apr 2001
arch/ppc/
arch/sparc/
arch/sparc64/
arch/m68k/
arch/mips/
arch/i386/
arch/alpha/
arch/arm/
arch/sh/
arch/s390/
5,000
1000
0
0
Jan 1993
Jul 1998
40,000
kernel
mm
ipc
lib
init
Total uncommented LOC
Total uncommented LOC
7000
Mar 1997
Growth of arch SS
9000
8000
Oct 1995
Jun 1994
Oct 1995
Mar 1997
Jul 1998
Dec 1999
Apr 2001
Jan 1993
Jun 1994
Oct 1995
Mar 1997
Why has Linux been able to
continue its geometric growth?
• Core code quality is carefully maintained
• Architecture/problem domain
– It’s largely drivers
– Much of the code is “parallel”
– It’s not as big as you might think
• Vanilla configuration used only 15% of files
• Development model (OSD) and its sociology
– Popularity and visibility has encouraged outsiders (both
hackers and industry) to contribute
– “Clone and hack” is an acceptable development style
Case study: Linux SCSI drivers
• Nice, controlled experiment:
– Large body of code, multiple versions, well used system,
open source
– SCSI drivers all do similar tasks
– Source comments shows cloning has occurred!
• Approx. 500 releases of Linux since 1994.
• Kernel v2.3.39: (released Jan 2000)
– 5000 source files, 2.2 MLOC, 10 hardware architectures
– drivers/scsi has 212 source files, 166 KLOC,
Goals of case study
• Examine “real world” cloning:
– How common is it?
– Why is it done?
– What do the “cloning patterns” look like?
• Examine parallel evolution:
– What kinds of changes are common?
– Do developers (need to) change clone relatives too?
• Is there a better design structure lurking?
• Compare against clone detection tools
– Are detections tools looking for the right indications of
cloning?
SCSI Subsystem - Size (rel. 2.2.16)
•
•
•
•
•
•
Number of source files: 211
Number of functions: 2512
Number of lines: 254,953
% of comments: 38
Number of low-level drivers: 80
File size:
– on average ~3000 lines
– large multi-card drivers ~15,000 lines
SCSI Subsystem - Architecture
• Upper Layer
– Uniform way of handling devices
– Hard Disk, CD-ROM Disk, Tape, Generic
• Middle Layer
– “bridge” between Upper Layer and Low-Level Devices
• Low-Level Device Drivers
– low-level driver functionality and management
Clones Expected?
• Why did we expect to find clones:
–
–
–
–
–
–
–
–
Every driver must implement uniform interface
Design of subsystem does not support other forms of reuse
Driver logic is relatively simple (!)
Devices from same family  more cloning
Completely different hardware  less or no cloning
Open source  anyone can reuse code
Easier and more efficient to reuse existing code
Reused code already tested, so probably better quality than if
we build it from scratch
Clones - Manual Inspection
• From source code comments, we have found:
esp.[ch]
jazz_esp.[ch]
cyberstorm.[ch]
dec_esp.[ch] cyberstormII.[ch] mca_53c9x.[ch] blz2060.[ch]
qlogicisp.[ch]
fdomain.[ch]
sd.[ch]
t128.[ch]
qlogicpti.[ch]
fd_mcs.[ch]
sr.[ch]
pas16.[ch]
fastlane.[ch]
Types of Changes Detected
•
•
•
•
•
•
•
Names of variables
Initialization parameters and constants
Driver specific initialization logic removed/added
Small change in supporting functions
Small changes in driver management code
Comments are updated
Code changed is highly embedded into other code,
which makes extraction of that code hard
Automatic Clone Detection
• We have looked for commercial and research
clone detection software
• Clone Finder - www. studio501.com
– free trial edition (C, C++)
– easy to use
– groups clones and highlights them in the source code
• Clone DR [Baxter] www.semdesigns.com (future)
– Cobol trial edition (supports also C, C++, Java)
• Merlo et al. tool (future)
Clone Finder Results
•
•
•
•
•
•
•
Number of files scanned: 8
Number of source lines: 4081
Elapsed time in seconds: 0.44
Number of Groupings: 14
Number of Blocks within those groupings: 30
Total number of duplicated lines: 373
Percent of source lines which are duplicated: 9.14
Something missed?
cyberstorm.c
….
static void dma_dump_state(struct NCR_ESP *esp)
{
ESPLOG(("esp%d: dma -- cond_reg<%02x>\n",
esp->esp_id, ((struct cyber_dma_registers *)
(esp->dregs))->cond_reg));
ESPLOG(("intreq:<%04x>, intena:<%04x>\n",
custom.intreqr, custom.intenar));
}
cyberstormII.c
….
static void dma_dump_state(struct NCR_ESP *esp)
{
ESPLOG(("esp%d: dma -- cond_reg<%02x>\n",
esp->esp_id, ((struct cyberII_dma_registers *)
(esp->dregs))->cond_reg));
ESPLOG(("intreq:<%04x>, intena:<%04x>\n",
custom.intreqr, custom.intenar));
}
static void dma_init_read(struct NCR_ESP *esp, __u32 addr, int
length)
{
struct cyber_dma_registers *dregs =
(struct cyber_dma_registers *) esp->dregs;
static void dma_init_read(struct NCR_ESP *esp, __u32 addr, int
length)
{
struct cyberII_dma_registers *dregs =
(struct cyberII_dma_registers *) esp->dregs;
…….
cache_clear(addr, length);
cache_clear(addr, length);
addr &= ~(1);
dregs->dma_addr0 = (addr >> 24) & 0xff;
dregs->dma_addr1 = (addr >> 16) & 0xff;
dregs->dma_addr2 = (addr >> 8) & 0xff;
dregs->dma_addr3 = (addr
) & 0xff;
ctrl_data &= ~(CYBER_DMA_WRITE);
addr &= ~(1);
dregs->dma_addr0 = (addr >> 24) & 0xff;
dregs->dma_addr1 = (addr >> 16) & 0xff;
dregs->dma_addr2 = (addr >> 8) & 0xff;
dregs->dma_addr3 = (addr
) & 0xff;
}
……...
How to Solve Cloning “Problem”
• Clone management through development process?
– Unlikely in this case, since it’s hard to incorporate into
open source development
• Automatic clone detection and removal?
– Not clear that tools are adequate for “real world”
cloning problems
– Software developed and maintained by different parties
– Architecture of the subsystem would be “broken”
Proposed Clone Solution
• Combination of clone control and removal:
– Make driver “template” that separates generic code
from driver specific one
– Clearly indicate which parts of driver are to be changed
and which not
– “Alarm” other developers when bug discovered in
common code
• This allows independent development, preserves
architecture, and simplifies design
• Applicable to all “plug-in” based software
Conclusion
• It’s not clear that current clone detection tools
“do the right thing”
• Theory developed on clone management, detection,
and removal is not universally applicable to all
types of applications, languages, and designs
– Need more qualitative analysis of “cloning in the real
world”
• Combination of different approaches should give
the best results
Ongoing & Future Work
• More detailed qualitative analysis of “cloning in
the real world”
• More investigation of relative effectiveness of
clone detection tools
• Investigation of “parallel evolution” by
maintenance type
– bug fixes
– new features
– restructuring
• Investigate another driver family, see if results are
similar e.g., Linux network card drivers
Download