Uploaded by Hamza

SOFE-3980U-W2022-Week10-11

advertisement
SOFE 3980U: Software Quality
Software Metrics
Instructor:
Akramul Azim, PhD
Winter 2022
Faculty of Engineering and Applied Science
University of Ontario Institute of Technology (UOIT)
(Slides from Dr. Ivan Bruha on Software Metrics)
How many Lines of Code?
https://www.youtube.com/watch?v=8Io6IRiwYio
SOFTWARE QUALITY METRICS
BASICS
How many Lines of Code?
with TEXT_IO; use TEXT_IO;
procedure Main is
--This program copies characters from an input
--file to an output file. Termination occurs
--either when all characters are copied or
--when a NULL character is input
Nullchar, Eof: exception;
Char: CHARACTER;
Input_file, Output_file, Console: FILE_TYPE;
Begin
loop
Open (FILE => Input_file, MODE => IN_FILE,
NAME => “CharsIn”);
Open (FILE => Output_file, MODE =>OUT_FILE,
NAME => “CharOut”);
Get (Input_file, Char);
if END_OF_FILE (Input_file) then
raise Eof;
elseif Char = ASCII.NUL then
raise Nullchar;
else
Put(Output_file, Char);
end if;
end loop;
exception
when Eof => Put (Console, “no null characters”);
when Nullchar => Put (Console, “null terminator”);
end Main
Software Quality Models
Use
Factor
Criteria
Communicativeness
Usability
Product
operation
Reliability
Accuracy
Consistency
Device Efficiency
Efficiency
Accessibility
Completeness
Reusability
METRICS
Structuredness
Maintainability
Product
revision
Conciseness
Device independence
Portability
Testability
Legability
Self-descriptiveness
Traceability
Definition of system reliability
The reliability of a system is the probability that
the system will execute without failure in a
given environment for a given period of time.
Implications:
• No single reliability number for a given system dependent on how the system is used
• Use probability to express our uncertainty
• Time dependent
What is a software failure?
Alternative views:
• Formal view
–
–
–
Any deviation from specified program behaviour is a failure
Conformance with specification is all that matters
This is the view adopted in computer science
• Engineering view
–
–
Any deviation from required, specified or expected behaviour is a failure
If an input is unspecified the program should produce a “sensible” output
appropriate for the circumstances
– This is the view adopted in dependability assessment
Human errors, faults, and failures
?
can lead to
human error
can lead to
fault
failure
• Human Error: Designer’s mistake
• Fault: Encoding of an error into a software
document/product
• Failure: Deviation of the software system from specified
or expected behaviour
Processing errors
In the absence of fault tolerance:
Human
Error
Fault
Input
Processing
Error
Failure
Relationship between faults and failures
Faults
Failures (sized by MTTF)
35% of all faults only lead to very
rare failures (MTTF>5000 years)
The relationship between faults
and failures
• Most faults are benign
• For most faults: removal will not lead to greatly improved
reliability
• Large reliability improvements only come when we
eliminate the small proportion of faults which lead to the
more frequent failures
• Does not mean we should stop looking for faults, but
warns us to be careful about equating fault counts with
reliability
The ‘defect density’ measure: an
important health warning
• Defects = {faults}  {failures}
– but sometimes defects = {faults} or defects = {failures}
• System defect density =
number of defects found
system size
– where size is usually measured as thousands of lines of code (KLOC)
• Defect density is used as a de-facto measure of software
quality.
• What are industry ‘norms’ and what do they mean?
Defect density Vs module size
Defect
Density
Theory
Observation?
Lines of Code
A Study in Relative Efficiency of Testing
Methods
Testing Type
Defects found
per hour
Regular use
0.21
Black box
0.282
White box
0.322
Reading/Inspections
1.057
R B Grady, ‘Practical Software metrics for Project Management
and Process Improvement’, Prentice Hall, 1992
The problem with ‘problems’
• Defects
• Faults
• Failures
• Anomalies
• Bugs
• Crashes
Incident Types
• Failure (in pre or post release)
• Fault
• Change request
Generic Data
Applicable to all incident types
What: Product details
Where (Location): Where is it?
Who: Who found it?
When (Timing): When did it occur?
What happened (End Result): What was observed?
How (Trigger): How did it arise?
Why (Cause): Why did it occur?
Severity/Criticality/Urgency
Change
Example: Failure Data
What: ABC Software Version 2.3
Where: Norman’s home PC
Who: Norman
When: 13 Jan 2000 at 21:08 after 35 minutes of operational
use
End result: Program crashed with error message xyz
How: Loaded external file and clicked the command Z.
Why: <BLANK - refer to fault>
Severity: Major
Change: <BLANK>
Example: Fault Data (1) - reactive
What: ABC Software Version 2.3
Where: Help file, section 5.7
Who: Norman
When: 15 Jan 2000, during formal inspection
End result: Likely to cause users to enter invalid passwords
How: The text wrongly says that passwords are case sensitive
Why: <BLANK>
Urgency: Minor
Change: Suggest rewording as follows ...
Example: Fault Data (2) - responsive
What: ABC Software Version 2.3
Where: Function <abcd> in Module <ts0023>
Who: Simon
When: 14 Jan 2000, after 2 hours investigation
What happened: Caused reported failure id <0096>
How: <BLANK>
Why: Missing exception code for command Z
Urgency: Major
Change: exception code for command Z added to function
<abcd> and also to function <efgh>. Closed on 15 Jan
2000.
Example: Change Request
What: ABC Software Version 2.3
Where: File save menu options
Who: Norman
When: 20 Jan 2000
End result: <BLANK>
How: <BLANK>
Why: Must be able to save files in ascii format - currently not
possible
Urgency: Major
Change: Add function to enable ascii format file saving
Tracking incidents to components
Incidents need to be traceable to identifiable components but at what level of granularity?
• Unit
• Module
• Subsystem
• System
Fault classifications used in Eurostar
control system
Cause
error in software design
error in software implementation
error in test procedure
deviation from functional specification
hardware not configured as specified
change or correction induced error
clerical error
other (specify)
Category
category not applicable
initialisation
logic/control structure
interface (external)
interface (internal)
data definition
data handling
computation
timing
other (specify)
Summary of Software Metrics Basics
• Software quality is a multi-dimensional notion
• Defect density is a common (but confusing) way of
measuring software quality
• Much data collection focuses on ‘incident types: failures,
faults, and changes. There are ‘who, when, where,..’ type
data to collect in each case
• System components must be identified at appropriate levels
of granularity
SOFTWARE METRICS
PRACTICE
Why software measurement?
• To assess software products
• To assess software methods
• To help improve software processes
From Goals to Actions
Goals
Measures
Data
Facts/trends
Decisions
Actions
Goal Question Metric (GQM)
• There should be a clearly-defined need for every
measurement.
• Begin with the overall goals of the project or product.
• From the goals, generate questions whose answers will
tell you if the goals are met.
• From the questions, suggest measurements that can help
to answer the questions.
From Basili and Rombach’s Goal-Question-Metrics paradigm, described in
IEEE Transactions on Software Engineering, 1988 paper on the TAME
project.
GQM Example
Goal
Questions
Metrics
Identify fault-prone modules as early as possible
What do we mean by Does ‘complexity’ impactHow much testing
‘fault-prone’ module? fault-proneness?
is done per module?
….
‘Defect data’ for each module
‘Effort data’ for each module
‘Size/complexity data’
• # faults found per testing phase• Testing effort per testing phase for each module
• # failures traced to module
• # faults found per testing phase • KLOC
• complexity metrics
The Metrics Plan
For each technical goal this contains information about
• WHY metrics can address the goal
• WHAT metrics will be collected, how they will be defined, and how
they will be analyzed
• WHO will do the collecting, who will do the analyzing, and who will see
the results
• HOW it will be done - what tools, techniques and practices will be used
to support metrics collection and analysis
• WHEN in the process and how often the metrics will be collected and
analyzed
• WHERE the data will be stored
The Enduring LOC Measure
• LOC: Number of Lines Of Code
• The simplest and most widely used measure of
program size. Easy to compute and automate
• Used (as normalising measure) for
– productivity assessment (LOC/effort)
– effort/cost estimation (Effort = f(LOC))
– quality assessment/estimation (defects/LOC))
• Alternative (similar) measures
–
–
–
–
KLOC: Thousands of Lines Of Code
KDSI: Thousands of Delivered Source Instructions
NCLOC: Non-Comment Lines of Code
Number of Characters or Number of Bytes
Example: Software Productivity at
Toshiba
Instructions per
programmer month
300
250
Introduced Software
Workbench System
200
150
100
50
0
1972
1974
1976
1978
1980
1982
Problems with LOC type measures
• No standard definition
• Measures length of programs rather than size
• Wrongly used as a surrogate for:
– effort
– complexity
– functionality
• Fails to take account of redundancy and reuse
• Cannot be used comparatively for different types of programming
languages
• Only available at the end of the development life-cycle
Fundamental software size attributes
• length the physical size of the product
• functionality measures the functions supplied by the
product to the user
• complexity
– Problem complexity measures the complexity of the underlying
problem.
– Algorithmic complexity reflects the complexity/efficiency of the
algorithm implemented to solve the problem
– Structural complexity measures the structure of the software used to
implement the algorithm (includes control flow structure, hierarchical
structure and modular structure)
– Cognitive complexity measures the effort required to understand the
software.
The search for more discriminating
metrics
Measures that:
• capture cognitive complexity
• capture structural complexity
• capture functionality (or functional complexity)
• are language independent
• can be extracted at early life-cycle phases
The 1970’s: Measures of Source Code
Characterized by
• Halstead’s ‘Software Science’ metrics
• McCabe’s ‘Cyclomatic Complexity’ metric
Influenced by:
• Growing acceptance of structured programming
• Notions of cognitive complexity
Halstead’s Software Science Metrics
A program P is a collection of tokens, classified as
either operators or operands.
n1 = number of unique operators
n2 = number of unique operands
N1 = total occurrences of operators
N2 = total occurrences of operands
Length of P is N = N1+N2 Vocabulary of P is n = n1+n2
Theory: Estimate of N is N = n1 log n1 + n2 log n2
Theory: Effort required to generate P is
E=
n1 N2 N log n
2n2
(elementary mental
discriminations)
Theory: Time required to program P is T=E/18 seconds
McCabe’s Cyclomatic Complexity
Metric v
If G is the control flowgraph of program P
and G has e edges (arcs) and n nodes
v(P) = e-n+2
v(P) is the number of linearly
independent paths in G
here e = 16 n =13
v(P) = 5
More simply, if d is the number of
decision nodes in G then
v(P) = d+1
McCabe proposed: v(P)<10 for each module P
The 1980’s: Early Life-Cycle Measures
• Predictive process measures - effort and cost estimation
• Measures of designs
• Measures of specifications
Software Cost Estimation
See that building
on the screen?
I want
to know
its weight
How can I tell by
just looking at the
screen? I don’t
have any instruments
or context
I don’t care. You’ve
got your eyes and
a thumb and I want
the answer to the
nearest milligram
Simple COCOMO Effort Prediction
effort = a (size)b
effort = person months
size = KDSI (predicted)
a,b constants depending on type of system:
‘organic’:
a = 2.4 b = 1.05
‘semi-detached’: a = 3.0 b = 1.12
‘embedded’:
a = 3.6 b = 1.2
COCOMO Development Time
Prediction
time = a (effort)b
effort = person months
time = development time (months)
a,b constants depending on type of system:
‘organic’:
a = 2.5 b = 0.32
‘semi-detached’: a = 2.5 b = 0.35
‘embedded’:
a = 2.5 b = 0.38
Regression Based Cost Modelling
log E (Effort)
10,000
Slope b
1000
100
log E = log a + b * log S
10
E=a*Sb
log a
1K
10K
100K
1000K
10000K
log S(Size)
Albrecht’s Function Points
Count the number of:
External inputs
External outputs
External inquiries
External files
Internal files
giving each a ‘weighting factor’
The Unadjusted Function Count (UFC) is the sum of
all these weighted scores
To get the Adjusted Function Count (FP), multiply
by a Technical Complexity Factor (TCF)
FP = UFC x TCF
Function Points: Example
Spell-Checker Spec: The checker accepts as input a document file and an
optional personal dictionary file. The checker lists all words not contained
in either of these files. The user can query the number of words processed
and the number of spelling errors found at any stage during processing
errors found enquiry
words processes enquiry
User
Document file
# words processed message
# errors message
Spelling
Checker
Personal dictionary
User
report on misspelt words
words
Dictionary
A = # external inputs = 2, B =# external outputs = 3, C = # inquiries = 2,
D = # external files = 2, E = # internal files = 1
Assuming average complexity in each case
UFC = 4A + 5B + 4C +10D + 7E = 58
Function Points: Applications
• Used extensively as a ‘size’ measure in preference to LOC
• Examples:
Productivity
FP
Person months effort
Quality
Defects
FP
Effort prediction
E=f(FP)
Function Points and Program Size
Language
Source Statements per FP
Assembler
C
Algol
COBOL
FORTRAN
Pascal
RPG
PL/1
MODULA-2
PROLOG
LISP
BASIC
4 GL Database
APL
SMALLTALK
Query languages
Spreadsheet languages
320
150
106
106
106
91
80
80
71
64
64
64
40
32
21
16
6
The 1990’s: Broader Perspective
• Reports on Company-wide measurement
programmes
• Benchmarking
• Impact of SEI’s CMM process assessment
• Use of metrics tools
• Measurement theory as a unifying framework
• Emergence of international software measurement
standards
– measuring software quality
– function point counting
– general data collection
Process improvement at
Motorola
In-process
defects/MAELOC
1000
800
600
400
200
0
1
2
3
SEI level
4
5
Flashback: The SEI Capability Maturity Model
Level 5: Optimising
Process change management
Technology change management
Defect prevention
Level 4: Managed
Software quality management
Quantitative process mgment
Level 3: Defined
Peer reviews
Training programme
Intergroup coordination
Integrated s/w management
Organization process definition/focus
Level 2: Repeatable
S/W configuration management
S/W QA S/W project planning
S/W subcontract management
S/W requirements management
Level 1: Initial/ad-hoc
IBM Space Shuttle Software Metrics
Program (1)
Early detection rate
Total inserted error rate
IBM Space Shuttle Software Metrics
Program (2)
Predicted total error rate trend (errors per KLOC)
14
12
10
8
95% high
6
Actual
4
expected
2
95% low
0
1
3
5
7
8A
Onboard flight software releases
8C
8F
IBM Space Shuttle Software
Metrics Program (3)
Onboard flight software failures
occurring per base system
25
20
15
10
5
0
8B
8C
8D
Basic operational increment
20
SOFTWARE METRICS FRAMEWORK
Software Measurement Activities
Cost
Estimation
Algorithmic
complexity
Function
Points
Productivity
Models
Software
Quality
Models
Structural
Measures
Complexity
Metrics
Reliability
Models
GQM
Are these diverse activities related?
Opposing Views on Measurement?
‘‘When you can measure what you are speaking about, and
express it in numbers, you know something about it; but
when you cannot measure it, when you cannot express it
in numbers, your knowledge is of a meagre kind.”
Lord Kelvin
“In truth, a good case could be made that if your knowledge
is meagre and unsatisfactory, the last thing in the world
you should do is make measurements. The chance is
negligible that you will measure the right things
accidentally.”
George Miller
Definition of Measurement
Measurement is the process of empirical
objective assignment of numbers to
entities, in order to characterise a specific
attribute.
• Entity: an object or event
• Attribute: a feature or property of an entity
• Objective: the measurement process must
be based on a well-defined rule whose results
are repeatable
Example Measures
ENTITY
Person
Person
Source code
Source code
Testing process
ATTRIBUTE
Age
Age
Length
Length
duration
Tester
efficiency
Testing process
fault
frequency
quality
Source code
Operating system reliability
MEASURE
Years at last birthday
Months since birth
# Lines of Code (LOC)
# Executable statements
Time in hours from start to
finish
Number of faults found per
KLOC
Number of faults found per
KLOC
Number of faults found per
KLOC
Mean Time to failure
rate of occurrence of
failures
Avoiding Mistakes in Measurement
Common mistakes in software measurement can be
avoided simply by adhering to the definition of
measurement. In particular:
• You must specify both entity and attribute
• The entity must be defined precisely
• You must have a reasonable, intuitive understanding of
the attribute before you propose a measure
Be Clear of Your Attribute
It is a mistake to propose a ‘measure’ if there is no
consensus on what attribute it characterises.
o Results of an IQ test
– intelligence?
– or verbal ability?
– or problem solving skills?
o # defects found / KLOC
– quality of code?
– quality of testing?
A Cautionary Note
We must not re-define an attribute to fit in with an
existing measure.
His IQ rating
is zero - he
didn’t manage
a single answer
Well I know he can’t
write yet, but I’ve always
regarded him as a
rather intelligent dog
Types and uses of measurement
• Two distinct types of measurement:
– direct measurement
– indirect measurement
• Two distinct uses of measurement:
– for assessment
– for prediction
Measurement for prediction requires a prediction
system
Some Direct Software Measures
• Length of source code (measured by LOC)
• Duration of testing process (measured by elapsed time in
hours)
• Number of defects discovered during the testing process
(measured by counting defects)
• Effort of a programmer on a project (measured by person
months worked)
Some Indirect Software Measures
Programmer productivity
Module defect density
Defect detection
efficiency
Requirements stability
Test effectiveness ratio
System spoilage
LOC produced
person months of effort
number of defects
module size
number of defects detected
total number of defects
numb of initial requirements
total number of requirements
number of items covered
total number of items
effort spent fixing faults
total project effort
Predictive Measurement
Measurement for prediction requires a prediction system.
This consists of:
• Mathematical model
– e.g. ‘E=aSb’ where E is effort in person months (to be predicted), S is size
(LOC), and a and b are constants.
• Procedures for determining model parameters
– e.g. ‘Use regression analysis on past project data to determine a and b’.
• Procedures for interpreting the results
– e.g. ‘Use Bayesian probability to determine the likelihood that your prediction
is accurate to within 10%’
No Shortcut to Accurate Prediction
‘‘Testing your methods on a sample of past data gets to the
heart of the scientific approach to gambling. Unfortunately
this implies some preliminary spadework, and most people
skimp on that bit, preferring to rely on blind faith instead’’
•
[Drapkin and Forsyth 1987]
Software prediction (such as cost estimation) is no different
from gambling in this respect
Products, Processes, and Resources
Resources
Processes
Products
Process: a software related activity or event
– testing, designing, coding, etc.
Product: an object which results from a process
– test plans, specification and design documents, source and object code,
minutes of meetings, etc.
Resource: an item which is input to a process
– people, hardware, software, etc.
Internal and External Attributes
Let X be a product, process, or resource
• External attributes of X are those which can only be
measured with respect to how X relates to its
environment
– e.g. reliability or maintainability of source code (product)
• Internal attributes of X are those which can be measured
purely in terms of X itself
– e.g. length or structuredness of source code (product)
The Framework Applied
ATTRIBUTES
ENTITIES
Internal
External
PRODUCTS
Specification
Source Code
....
Length, functionality
modularity, structuredness,
reuse ....
maintainability
reliability
.....
PROCESSES
Design
Test
....
time, effort, #spec faults found
time, effort, #failures observed
....
stability
cost-effectiveness
....
RESOURCES
People
Tools
....
age, price, CMM level
price, size
....
productivity
usability, quality
....
CASE STUDY :
COMPANY OBJECTIVES
• Monitor and improve product reliability
– requires information about actual operational failures
• Monitor and improve product maintainability
– requires information about fault discovery and fixing
• ‘Process improvement’
– too high a level objective for metrics programme
– previous objectives partially characterise process improvement
General System Information
• 27 releases since Nov '87 implementation
• Currently 1.6 Million LOC in main system (15.2%
increase from 1991 to 1992)
1600000
1400000
1200000
LOC
1000000
COBOL
800000
Natural
600000
400000
200000
0
1991
1992
Main Data
Fault NumberWeek In System Area Fault Type Week OutHours to Repair
...
F254
...
...
92/14
C2
...
P
...
...
92/17
5.5
• ‘faults’ are really failures (the lack of a distinction caused
problems)
• 481 (distinct) cleared faults during the year
• 28 system areas (functionally cohesive)
• 11 classes of faults
• Repair time: actual time to locate and fix defect
Case Study Components
• 28 ‘System areas’
– All closed faults traced to system area
• System areas made up of Natural, Batch COBOL, and CICS
COBOL programs
– Typically 80 programs in each. Typical program 1000 LOC
• No documented mapping of program to system area
• For most faults: ‘batch’ repair and reporting
– No direct, recorded link between fault and program in most cases
• No database with program size information
• No historical database to capture trends
Single Incident Close Report
Fault id
Reported
Definition
F752
18/6/92
Logically deleted work done records
appear on enquiries
Description
Causes misleading info to users
Amend ADDITIONAL WORK PERFORMED
RDVIPG2A to ignore work done records with
FLAG-AMEND = 1 or 2
Programs changed RDVIPG2A, RGHXXZ3B
SPE
Joe Bloggs
Date closed
26/6/92
Single Incident Close Report:
Improved Version
Fault id
Reported
Trigger
F752
18/6/92
Delete work done record, then open enquiry
End result
Deleted records appear on enquiries, providing
misleading info to users
Cause
Omission of appropriate flag variables
for work done records
Change
Amend ADDITIONAL WORK PERFORMED
in RDVIPG2A to ignore work done records with
FLAG-AMEND = 1 or 2
Programs changed
SPE
Date closed
RDVIPG2A, RGHXXZ3B
Joe Bloggs
26/6/92
Fault Classification
Non-orthogonal:
Data
Micro
JCL
Operations
Misc
Unresolved
Program
Query
Release
Specification
User
Missing Data
• Recoverable
– Size information
– Static/complexity information
– Mapping of faults to programs
– Severity categories
• Non-recoverable
– Operational usage per system area
– Success/failure of fixes
– Number of repeated failures
‘Reliability’ Trend
Faults received per week
50
40
Faults
30
20
10
0
10
20
30
Week
40
50
Identifying Fault Prone Systems?
Number or faults per system area (1992)
90
80
70
60
faults
50
40
30
20
10
0
C2
J
System area
Analysis of Fault Types
Faults by fault type (total 481 faults)
Others
Data
User
Query
Unresolved
Release
Misc
Program
Fault Types and System Areas
Most common faults over system areas
70
60
50
Program
faults40
Data
User
30
Release
20
Unresolved
10
Query
0
Miscellaneous
C2 C
J G
Unresolved
G2 N
Area
T
User
C3
W
D
Program
F
C1
Miscellaneous
Maintainability Across System Areas
Mean Time To Repair Fault (by system area)
10
9
8
hours
7
6
5
4
3
2
1
0
D
O
S W1 F
W
C3
P
L
G C1
J
System Area
T
D1 G2 N
Z
C
C2 G1
U
Maintainability Across Fault Types
Mean Time To Repair Fault (by fault type)
9
8
7
6
5
4
3
2
1
0
Fault type
Case study results with additional
data: System Structure
Normalised Fault Rates (1)
20.00
18.00
16.00
14.00
12.00
Faults per KLOC
10.00
8.00
6.00
4.00
2.00
0.00
C2 C3
P
C
L G2
N
J
G
F
W G1
Area
S
D
O W1 C4
M D1
I
Z
B
Normalised Fault Rates (2)
1.20
1.00
0.80
Faults per KLOC
0.60
0.40
0.20
0.00
C3
P
C
L
G2
N
J
G
F
W
G1
Area
S
D
O W1
C4
M
D1
I
Z
B
Case Study 1 Summary
• The ‘hard to collect’ data was mostly all there
– Exceptional information on post-release ‘faults’ and maintenance
effort
– It is feasible to collect this crucial data
• Some ‘easy to collect’ (but crucial) data was
omitted or not accessible
– The addition to the metrics database of some basic information
(mostly already collected elsewhere) would have enabled
proactive activity.
– Goals almost fully met with the simple additional data.
– Crucial explanatory analysis possible with simple additional data
– Goals of monitoring reliability and maintainability only partly met
with existing data
SOFTWARE METRICS:
MEASUREMENT THEORY AND
STATISTICAL ANALYSIS
Natural Evolution of Measures
As our understanding of an attribute grows, it is possible
to define more sophisticated measures; e.g.
temperature of liquids:
• 200BC - rankings, ‘‘hotter than’’
• 1600 - first thermometer preserving ‘‘hotter than’’
• 1720 - Fahrenheit scale
• 1742 - Centigrade scale
• 1854 - Absolute zero, Kelvin scale
Measurement Theory Objectives
Measurement theory is the scientific basis for all types of
measurement. It is used to determine formally:
• When we have really defined a measure
• Which statements involving measurement are
meaningful
• What the appropriate scale type is
• What types of statistical operations can be applied to
measurement data
Measurement Theory: Key Components
• Empirical relation system
– the relations which are observed on entities in the real world which characterise
our understanding of the attribute in question,
e.g. ‘Fred taller than Joe’ (for height of people)
• Representation condition
– real world entities are mapped to number (the measurement mapping) in such a
way that all empirical relations are preserved in numerical relations and no new
relations are created
e.g. M(Fred) > M(Joe) precisely when Fred is taller than Joe
• Uniqueness Theorem
– Which different mappings satisfy the representation condition,
e.g. we can measure height in inches, feet, centimetres, etc but all such
mappings are related in a special way.
Representation Condition
Real World
Number System
M
Joe
Fred
63
Joe taller than Fred
Empirical relation
72
M(Joe) > M(Fred)
preserved under M as
Numerical relation
Meaningfulness in Measurement
Some statements involving measurement appear more
meaningful than others:
• Fred is twice as tall as Jane
• The temperature in Tokyo today is twice that in
London
• The difference in temperature between Tokyo and
London today is twice what it was yesterday
Formally a statement involving measurement is
meaningful if its truth value is invariant of
transformations of allowable scales
Measurement Scale Types
Some measures seem to be of a different ‘type’ to
others, depending on what kind of statements are
meaningful. The 5 most important scale types of
measurement are:
• Nominal
• Ordinal
• Interval
• Ratio
• Absolute
Increasing order
of sophistication
Nominal Scale Measurement
• Simplest possible measurement
• Empirical relation system consists only of different
classes; no notion of ordering.
• Any distinct numbering of the classes is an
acceptable measure (could even use symbols
rather than numbers), but the size of the numbers
have no meaning for the measure
Ordinal Scale Measurement
• In addition to classifying, the classes are also ordered
with respect to the attribute
• Any mapping that preserves the ordering (i.e. any
monotonic function) is acceptable
• The numbers represent ranking only, so addition and
subtraction (and other arithmetic operations) have no
meaning
Interval Scale Measurement
•
•
•
•
Powerful, but rare in practice
Distances between entities matters, but not ratios
Mapping must preserve order and intervals
Examples:
– Timing of events’ occurrence, e.g. could measure these in units
of years, days, hours etc, all relative to different fixed events.
Thus it is meaningless to say ‘‘Project X started twice as early as
project Y’’, but meaningful to say ‘‘the time between project X
starting and now is twice the time between project Y starting
and now’’
– Air Temperature measured on Fahrenheit or Centigrade scale
Ratio Scale Measurement
Common in physical sciences. Most useful scale of
measurement
• Ordering, distance between entities, ratios
• Zero element (representing total lack of the attribute)
• Numbers start at zero and increase at equal intervals
(units)
• All arithmetic can be meaningfully applied
Absolute Scale Measurement
• Absolute scale measurement is just counting
• The attribute must always be of the form of
‘number of occurrences of x in the entity’
– number of failures observed during integration testing
– number of students in this class
• Only one possible measurement mapping (the
actual count)
• All arithmetic is meaningful
Validation of Measures
• Validation of a software measure is the process of
ensuring that the measure is a proper numerical
characterisation of the claimed attribute
• Example:
–
A valid measure of length of programs must not contradict any intuitive
notion about program length
–
If program P2 is bigger than P1 then m(P2) > m(P1)
–
If m(P1) = 7 and m(P2) = 9 then if P1 and P2 are concatenated then
m(P1;P2) must equal m(P1)+m(P2) = 16
• A stricter criterion is to demonstrate that the measure is
itself part of valid prediction system
Validation of Prediction Systems
• Validation of a prediction system, in a given environment,
is the process of establishing the accuracy of the
predictions made by empirical means
– i.e. by comparing predictions against known data points
• Methods
– Experimentation
– Actual use
• Tools
– Statistics
– Probability
Scale Types Summary
Scale Types
Nominal
Ordinal
Interval
Ratio
Absolute
Characteristics
Entities are classified. No arithmetic
meaningful.
Entities are classified and ordered. Cannot
use + or -.
Entities classified, ordered, and differences
between them understood (‘units’). No zero,
but can use ordinary arithmetic on intervals.
Zeros, units, ratios between entities. All
arithmetic.
Counting; only one possible measure. All
arithmetic.
Meaningfulness and Statistics
The scale type of a measure affects what operations it is
meaningful to perform on the data
Many statistical analyses use arithmetic operators
These techniques cannot be used on certain data particularly nominal and ordinal measures
Example: The Mean
• Suppose we have a set of values {a1,a2,...,an}
and wish to compute the ‘average’
• The mean is a1+a2+...an
n
• The mean is not a meaningful average for a
set of ordinal scale data
Alternative Measures of Average
Median: The midpoint of the data when it is
arranged in increasing order. It divides the data
into two equal parts
Suitable for ordinal data. Not suitable for nominal
data since it relies on order having meaning.
Mode: The commonest value
Suitable for nominal data
Summary of Meaningful Statistics
Scale Type
Average
Spread
Nominal
Mode
Frequency
Ordinal
Median
Percentile
Interval
Arithmetic mean
Standard deviation
Ratio
Geometric mean
Coefficient of variation
Absolute
Any
Any
Non-Parametric Techniques
• Most software measures cannot be assumed to be
normally distributed. This restricts the kind of analytical
techniques we can apply.
• Hence we use non-parametric techniques:
– Pie charts
– Bar graphs
– Scatter plots
– Box plots
Box Plots
• Graphical representation of the spread of data.
• Consists of a box with tails drawn relative to a scale.
• Constructing the box plot:
– Arrange data in increasing order
– The box is defined by the median, upper quartile (u) and lower quartile (l) of the data.
Box length b is u  l
– Upper tail is u+1.5b, lower tail is l  1.5b
– Mark any data items outside upper or lower tail (outliers)
– If necessary truncate tails (usually at 0) to avoid meaningless concepts like negative lines
of code
median
upper quartile
lower tail
scale
lower quartile
upper tail
x
outlier
Box Plots: Examples
31
System
A
B
C
D
E
F
G
H
I
J
K
L
M
N
P
Q
R
KLOC MOD
FD
10
23
26
31
31
40
47
52
54
67
70
75
83
83
100
110
200
36
22
15
33
15
13
22
16
15
18
10
34
16
18
12
20
21
15
43
61
10
43
57
58
65
50
60
50
96
51
61
32
78
48
54
83
161
R
x
KLOC 0
50
D A
100
16
43
150
51
200
61
88
x x
L
x
MOD
0
25
4.5
50
15
18
75
22
32.5
FD
0
10
20
100
30
x x
x
D L
A
40
Scatterplots
• Scatterplots are used to represent data for which two
measures are given for each entity
• Two dimensional plot where each axis represents one
measure and each entity is plotted as a point in the 2D plane
Example Scatterplot: Length vs Effort
60
40
Effort
(months)
20
0
0
10
20
Length (KLOC)
30
Determining Relationships
60
non-linear fit
linear fit
40
Effort
(months)
outliers?
20
0
0
10
20
Length (KLOC)
30
Causes of Outliers
• There may be many causes of outliers, some acceptable
and others not. Further investigation is needed to
determine the cause
• Example: A long module with few errors may be due to:
– the code being of high quality
– the module being especially simple
– reuse of code
– poor testing
Only the last requires action, although if it is the first it would be
useful to examine further explanatory factors so that the good
lessons can be learnt (was it use of a special tool or method,
was it just because of good people or management, or was it
just luck?)
Control Charts
• Help you to see when your data are within acceptable
bounds
• By watching the data trends over time, you can decide
whether to take action to prevent problems before they
occur.
• Calculate the mean and standard deviation of the data,
and then two control limits.
Control Chart Example
4.0
3.5
Preparation hours
per hour of 3.0
inspection 2.5
2.0
1.5
1.0
0.5
0
Upper
Control
Limit
Mean
Lower
Control
Limit
1
2
3
4
5
Components
6
7
EMPIRICAL RESULTS
Case study: Basic data
Release
n (sample size 140
modules)
n+1 (sample size 246
modules)
Function test
916
Number of faults
System test
Site test
682
19
Operation
52
2292
1008
108
238
• Major switching system software
• Modules randomly selected from those that
were new or modified in each release
• Module is typically 2,000 LOC
• Only distinct faults that were fixed are conted
• Numerous metrics for each module
Hypotheses tested
• Hypotheses relating to Pareto principle of
distribution of faults and failures
• Hypotheses relating to the use of early fault data
to predict later fault and failure data
• Hypotheses about metrics for fault prediction
• Benchmarking hypotheses
Hypothesis 1a: a small number of modules contain most of
the faults discovered during testing
100
80
60
% of Faults
40
20
0
30
60
% of Modules
90
Hypothesis 1b:
• If a small number of modules contain most of the faults
discovered during pre-release testing then this is simply
because those modules constitute most of the code size.
• For release n, the 20% of the modules which account for
60% of the faults (discussed in hypothesis 1a) actually
make up just 30% of the system size. The result for
release n+1 was almost identical.
Hypothesis 2a: a small number of modules
contain most of the operational faults?
100
80
60
% of Failures
40
20
0
10
% of Modules
100
Hypothesis 2b
if a small number of modules contain most of the
operational faults then this is simply because those
modules constitute most of the code size.
• No: very strong evidence in favour of a converse
hypothesis:
most operational faults are caused by faults in a
small proportion of the code
• For release n, 100% of operational faults contained
in modules that make up just 12% of entire system
size. For release n+1, 80% of operational faults
contained in modules that make up 10% of the
entire system size.
Higher incidence of faults in function testing (FT) implies
higher incidence of faults in system testing (ST)?
100%
80%
% of Accumalated
Faults in ST
60%
ST
FT
40%
20%
0%
15%
30%
45%
60%
% of Modules
75%
90%
Hypothesis 4:Higher incidence of faults pre-release implies
higher incidence of faults post-release?
• At the module level
• This hypothesis underlies the wide acceptance
of the fault-density measure
Pre-release vs post-release faults
35
Modules ‘fault prone’ pre-release
are NOT ‘fault-prone post-release demolishes most defect prediction models
30
25
Post-release faults
20
15
10
5
0
0
20
40
60
80
100
Pre-release faults
120
140
160
Size metrics good predictors of fault and
failure prone modules?
• Hypothesis 5a: Smaller modules are less likely to
be failure prone than larger ones
• Hypothesis 5b Size metrics are good predictors of.
number of pre-release faults in a module
• Hypothesis 5c: Size metrics are good predictors of
number of post-release faults in a module
• Hypothesis 5d: Size metrics are good predictors of
a module’s (pre-release) fault-density
• Hypothesis 5e: Size metrics are good predictors of
a module’s (post-release) fault-density
Plotting faults against size
160
140
Correlation but
poor prediction
120
Faults
100
80
60
40
20
0
0
2000
4000
6000
Lines of code
8000
10000
Cyclomatic complexity against pre-and postrelease faults
Pre-release
Faults
Post-release
160
140
120
100
35
30
25
Faults
80
60
40
20
0
20
15
10
5
0
0
1000
2000
Cyclomatic complexity
3000
0
1000
2000
Cyclomatic complexity
Cyclomatic complexity no better
at prediction than KLOC (for either pre- or post-release)
3000
Defect density Vs size
35
Size is no indicator of defect density
(this demolishes many
software engineering assumptions)
30
25
Defects
per KLOC
20
15
10
5
0
0
2000
4000
6000
8000
Module size (KLOC)
10000
Benchmarking hypotheses
Do software systems produced in similar environments
have broadly similar fault densities at similar testing and
operational phases?
Release
Pre-release
fault density
n (sample size 140
modules)
n+1 (sample size 246
modules)
6.6
Post-release
faultdensity
0.23
5.93
0.63
Evaluating Software
Engineering Technologies
through Measurement
The Uncertainty of Reliability Achievement
methods
• Software engineering is dominated by revolutionary
methods that are supposed to solve the software crisis
• Most methods focus on fault avoidance
• Proponents of methods claim theirs is best
• Adopting a new method can require a massive overhead
with uncertain benefits
• Potential users have to rely on what the experts say
Use of Measurement in Evaluating Methods
• Measurement is the only truly convincing means of
establishing the efficacy of a method/tool/technique
• Quantitative claims must be supported by empirical
evidence
We cannot rely on anecdotal evidence.
There is simply too much at stake.
Actual Promotional Claims for Formal
Methods
Productivity
gains of
250%
Maintenance
effort reduced
80%
Software
Integration
time-scales
cut to
1/6
What are we to make of such claims?
Formal Methods for Safety Critical Systems
• Wide consensus that formal methods must be used
• Formal methods mandatory in Def Stan 00-55
‘‘These mathematical approaches provide us with the best
available approach to the development of high-integrity
systems.’’
McDermid JA, ‘Safety critical systems: a vignette’, IEE Software Eng J, 8(1), 2-3,
1993
SMARTIE Formal Methods Study
CDIS Air
Traffic Control System
Best quantitative evidence yet to support FM
• Mixture of formally (VDM, CCS) and informally developed
modules.
• The techniques used resulted in extraordinarily high
levels of reliability (0.81 failures per KLOC).
• Little difference in total number of pre-delivery faults for
formal and informal methods (though unit testing
revealed fewer errors in modules developed using formal
techniques), but clear difference in the post-delivery
failures.
Relative sizes and changes reported for
each design type in delivered code
Design Type
FSM
VDM
VDM/CCS
Formal
Informal
Total Lines
of Delivered
Code
19064
61061
22201
102326
78278
Number of
Fault
Reportgenerated
Code
Changes in
Delivered
Code
260
1539
202
2001
1644
Code
Changes
per
KLOC
Number
of
Modules
Having
This
Design
Type
Total
Number
of
Delivered
Modules
Changed
Percent
Delivered
Modules
Changed
13.6
25.2
9.1
19.6
21.0
67
352
82
501
469
52
284
57
393
335
78%
81%
70%
78%
71%
Code changes by design type for modules
requiring many changes
Design Type
FSM
VDM
VDM/CCS
Formal
Informal
Total
Number
of
Modules
Changed
Number
of
Modules
with Over
5 Changes
Per
Module
Percent of
Modules
Changed
58
284
58
400
556
11
89
11
111
108
16%
25%
13%
22%
19%
Number
of
Modules
with Over
10
Changes
Per
Module
8
35
3
46
31
Percent
of
Modules
Changed
12%
19%
4%
9%
7%
Changes Normalized by KLOC for
Delivered Code by Design Type
10
9
Changes per Quarter/KLOC
8
7
FSM
6
Informal
5
VDM
4
VDM/CCS
3
2
1
0
0
2
4
Quarter of Year
6
8
Faults discovered during unit testing
Design Type
FSM
VDM
VDM/CCS
Formal
Informal
Number of
Faults
Discovered
43
184
11
238
487
Number of
Modules Having
This Design Type
77
352
83
512
692
Number of Faults
Normalized by Number of
Modules
.56
.52
.13
.46
.70
Changes to delivered code as a result of
post-delivery problems
Design Type
Number of
Changes
Number of Lines of
Code Having This
Design Type
Number of
Changes
Normalized
by KLOC
Number of
Modules
Having This
Design Type
FSM
VDM
VDM/CCS
Formal
Informal
6
44
9
59
126
19064
61061
22201
102326
78278
.31
.72
.41
.58
1.61
67
352
82
501
469
Number of
Changes
Normalized
by Number
of Modules
.09
.13
.11
.12
.27
Post-delivery problems discovered in each
problem category
Category
1
2
3
Specification
Design
Testing
Documentation
None assigned
Number of problems
6
6
46
1
1
29
11
47
Post-delivery problem rates reported in
the literature
Source
Siemens operating system
NAG scientific libraries
CDIS air traffic control support
Lloyd’s language parser
IBM cleanroom development
IBM normal development
Satellite planning study
Unisys communications software
Language
Failures per
KLOC
Formal
methods
used?
Assembly
Fortran
C
C
Various
Various
Fortran
Ada
6-15
3.00
.81
1.40
3.40
30.0
6-16
2-9
No
No
Yes
Yes
Partly
No
No
No
SOFTWARE METRICS FOR RISK
AND UNCERTAINTY
The Classic size driven approach
• Since mid-1960’s LOC used as surrogate for different
notions of software size
• LOC used as driver in early resource prediction and
defect prediction models
• Drawbacks of LOC led to complexity metrics and function
points
• ...But approach to both defects prediction and resource
prediction remains ‘size’ driven
Predicting road fatalities
Month
Weather
conditions
Month
Number
of
fatalities
Naïve model
Road
conditions
Number
of
journeys
Average
speed
Number
of
fatalities
Causal/explanatory model
Predicting software effort
Problem
Complexity
Size
Size
Schedule
Effort
Effort
Product
Quality
Naïve model
Causal/explanatory model
Resource
quality
Typical software/systems assessment
problem
“Is this system sufficiently reliable to ship?”
You might have:
• Measurement data from testing
• Empirical data
• Process/resource information
• Proof of correctness
• ….
None alone is sufficient
So decisions inevitably involve expert judgement
What we really need for assessment
We need to be able to incorporate:
• uncertainty
• diverse process and product information
• empirical evidence and expert judgement
• genuine cause and effect relationships
• incomplete information
We also want visibility of all assumptions
Bayesian Belief Nets (BBNs)
• Powerful graphical framework in which to reason about
uncertainty using diverse forms of evidence
• Nodes of graph represent uncertain variables
• Arcs of graph represent causal or influential relationships
between the variables
• Associated with each node is a probability table (NPT)
A
P(B | C)
B
P(A |B,C)
C
D
P(D)
P(C)
Defects BBN (simplified)
Problem Complexity
Testing Effort
Defects Introduced
Defects Detected
Operational usage
Design Effort
Residual Defects
Operational defects
Bayes’ Theorem
A: ‘Person has cancer’ p(A)=0.1 (prior)
B: ‘Person is smoker’ p(B)=0.5
What is p(A|B)?
p(B|A)=0. 8
Posterior
probability
(posterior)
(likelihood)
Likelihood
Prior
probability
p ( B| A ) p ( A)
p( A| B ) =
p( B )
So
p(A|B)=0.16
Bayesian Propagation
• Applying Bayes theorem to update all probabilities when
new evidence is entered
• Intractable even for small BBNs
• Breakthrough in late 1980s - fast algorithm
• Tools like Hugin implement efficient propagation
• Propagation is multi-directional
• Make predictions even with missing/incomplete data
Classic approach to defect modelling
Complexity
Functionality
Quality of staff,
tools
Resources/
process
quality
Solution/problem
size/complexity
Number of
defects
Problems with classic defects modelling
approach
• Fails to distinguish different notions of ‘defect’
• Statistical approaches often flawed
• Size/complexity not causal factors
• Obvious causal factors not modelled
• Black box models hide crucial assumptions
• Cannot handle uncertainty
• Cannot be used for real risk assessment
Many defects pre-release, few after
Few defects pre-release,
many after
Schematic of classic resource model
Complexity
Functionality
Quality of staff,
tools
Solution/problem
size
Required
reliability
Resources
quality
Solution
quality
Required
Resources
Required duration
Required effort
Problems with classic approach to
resource prediction
• Based on historical projects which happened to be completed (but
not necessarily successful)
• Obvious causal factors not modelled or modelled incorrectly solution size should never be a ‘driver’
• Flawed assumption that resource levels are not already fixed in
some way before estimation (i.e. cannot handle realistic contraints)
• Statistical approaches often flawed
• Black box models hide crucial assumptions
• Cannot handle uncertainty
• Cannot be used for real risk assessment
Classic approach cannot handle questions
we really want to ask
• For a problem of this size, and given these limited resources, how
likely am I to achieve a product of suitable quality?
• How much can I scale down the resources if I am prepared to put
up with a product of specified lesser quality?
• The model predicts that I need 4 people over 2 years to build a
system of this kind of size. But I only have funding for 3 people over
one year. If I cannot sacrifice quality, how good do the staff have to
be to build the systems with the limited resources?
Schematic of ‘resources’ BBN
Complexity
Functionality
Problem size
Functionality
Solution size
Problem size
Quality of staff,
tools
Required
resources
Proportion
implemented
Required duration
Required effort
Appropriateness
of actual
Actual duration
resources
Actual effort
Solution
quality
Solution
reliability
“Appropriateness of resources” Subnet
required_effort
number_staff
actual_duration
actual_effort
required_duration
appropriate_effort
appropriate_duration
appropriate_resources
Specific values for problem size
Now we require high accuracy
Actual resources entered
Actual resource quality entered
Software defects and resource prediction
summary
• Classical approaches:
– Mainly regression-based black-box models
– Predicted_attribute = f(size)
– Crucial assumptions often hidden
– Obvious causal factors not modelled
– Cannot handle uncertainty
– Cannot be used for real risk assessment
• BBNs provide realistic alternative approach
Conclusions: Benefits of BBNs
• Help risk assessment and decision making in a wide range of
applications
• Model cause-effect relationships, uncertainty
• Incorporate expert judgement
• Combine diverse types of information
• All assumptions visible and auditable
• Ability to forecast with missing data
• Rigorous, mathematical semantics
• Good tool support
References
• Fenton NE and Pfleeger SL, ‘Software Metrics: A Rigorous &
Practical Approach’
• Software metrics slides by Ivan Bruha
Download