Uploaded by Priyanka Emani

Papers

advertisement
from the editor
Editor in Chief: Steve McConnell
■
Construx Software
■
software@construx.com
Cargo Cult Software
Engineering
In the South Seas there is a cargo cult of
people. During the war they saw airplanes with lots of good materials, and
they want the same thing to happen
now. So they’ve arranged to make things
like runways, to put fires along the sides
of the runways, to make a wooden hut
for a man to sit in, with two wooden
pieces on his head for headphones and
bars of bamboo sticking out like
antennas—he’s the controller—
and they wait for the airplanes
to land. They’re doing everything right. The form is perfect.
It looks exactly the way it
looked before. But it doesn’t
work. No airplanes land. So I
call these things cargo cult science, because they follow all
the apparent precepts and forms
of scientific investigation, but
they’re missing something essential, because the planes don’t land.—Richard
Feynman, in Surely You’re Joking, Mr.
Feynman! WW Norton & Company,
New York, reprint ed., 1997
I
find it useful to draw a contrast between two different organizational development styles: process-oriented and
commitment-oriented development.
Process-oriented development achieves
its effectiveness through skillful planning, carefully defined processes, efficient
use of available time, and skillful applicaCopyright @ 2000 Steven C. McConnell. All Rights Reserved.
tion of software engineering best practices.
This style of development succeeds because
the organization that uses it is constantly
improving. Even if its early attempts are ineffective, steady attention to process means
each successive attempt will work better
than the previous one.
Commitment-oriented development goes
by several names, including hero-oriented
development and individual empowerment.
Commitment-oriented organizations are
characterized by hiring the best possible
people; asking them for total commitment
to their projects; empowering them with
nearly complete autonomy; motivating them
to an extreme degree; and then seeing that
they work 60, 80, or 100 hours a week until the project is finished. Commitmentoriented development derives its potency
from its tremendous motivational ability;
study after study has found that individual
motivation is by far the largest single contributor to productivity. Developers make
voluntary, personal commitments to the
projects they work on, and they often go to
extraordinary lengths to make their projects
succeed.
Organizational Imposters
When used knowledgeably, either development style can produce high-quality software economically and quickly. However,
both development styles have pathological
look-alikes that don’t work nearly as well
and that can be difficult to distinguish from
the genuine articles.
The process-imposter organization bases
its practices on a slavish devotion to process
March/April 2000
IEEE SOFTWARE
11
FROM THE EDITOR
EDITOR-IN-CHIEF:
Steve McConnell
10662 Los Vaqueros Circle
Los Alamitos, CA 90720-1314
software@construx.com
EDITORS-IN-CHIEF EMERITUS:
Carl Chang, Univ. of Illinois, Chicago
Alan M. Davis, Omni-Vista
EDITORIAL BOARD
Maarten Boasson, Hollandse Signaalapparaten
Terry Bollinger, The MITRE Corp.
Andy Bytheway, Univ. of the Western Cape
David Card, Software Productivity Consortium
Larry Constantine, Constantine & Lockwood
Christof Ebert, Alcatel Telecom
Robert L. Glass, Computing Trends
Lawrence D. Graham, Black, Lowe, and Graham
Natalia Juristo, Universidad Politécnica de Madrid
Warren Keuffel
Karen Mackey, Cisco Systems
Brian Lawrence, Coyote Valley Software
Tomoo Matsubara, Matsubara Consulting
Stephen Mellor, Project Technology
Wolfgang Strigel, Software Productivity Centre
Jeffrey M. Voas, Reliable Software
Technologies Corp.
Karl E. Wiegers, Process Impact
INDUSTRY ADVISORY BOARD
Robert Cochran, Catalyst Software
Annie Kuntzmann-Combelles, Objectif Technologie
Enrique Draier, Netsystem SA
Eric Horvitz, Microsoft
Takaya Ishida, Mitsubishi Electric Corp.
Dehua Ju, ASTI Shanghai
Donna Kasperson, Science Applications International
Günter Koch, Austrian Research Centers
Wojtek Kozaczynski, Rational Software Corp.
Masao Matsumoto, Univ. of Tsukuba
Susan Mickel, BoldFish
Deependra Moitra, Lucent Technologies, India
Melissa Murphy, Sandia National Lab
Kiyoh Nakamura, Fujitsu
Grant Rule, Guild of Independent Function
Point Analysts
Chandra Shekaran, Microsoft
Martyn Thomas, Praxis
M A G A Z I N E O P E R AT I O N S C O M M I T T E E
Sorel Reisman (chair), William Everett (vice chair),
James H. Aylor, Jean Bacon, Thomas J. (Tim)
Bergin, Wushow Chou, George V. Cybenko,
William I. Grosky, Steve McConnell, Daniel E.
O’Leary, Ken Sakamura, Munindar P. Singh, James
J. Thomas, Yervant Zorian
P U B L I C AT I O N S B O A R D
Sallie Sheppard (vice president), Sorel Reisman
(MOC chair), Rangachar Kasturi (TOC chair), Jon
Butler (POC chair), Angela Burgess (publisher),
Laurel Kaleda (IEEE representative), Jake Aggarwal,
Laxmi Bhuyan, Lori Clarke, Alberto del Bimbo, Mike
Liu, Mike Williams (secretary), Zhiwei Xu
12
IEEE SOFTWARE
March/ April 2000
for process’s sake. These organizations look at process-oriented organizations, such as NASA’s Software Engineering Laboratory and IBM’s former Federal Systems Division, and
observe that those organizations generate lots of documents and hold frequent meetings. The imposters conclude that if they generate an equivalent number of documents and hold a
comparable number of meetings, they
will be similarly successful. If they
generate more documentation and
hold more meetings, they will be even
more successful! But they don’t understand that the documentation and
the meetings are not responsible for
the success; these are the side effects
of a few specific, effective processes. I
call these organizations bureaucratic
because they put the form of software
processes above the substance. Their
misuse of process is demotivating,
which hurts productivity. And they’re
not very enjoyable to work for.
The commitment-imposter organization focuses primarily on motivating people to work long hours. These
organizations look at successful companies such as Microsoft and observe
that they generate very little documentation, offer stock options to employees, and then require mountains of
overtime. They conclude that if they,
too, minimize documentation, offer
stock options, and require extensive
overtime, they will be successful. The
less documentation and the more
overtime the better! But these organizations miss the fact that Microsoft
and other successful commitment-oriented companies don’t require overtime. They hire people who love to
create software. They team these people with other people who love to create software just as much as they do.
They provide lavish organizational
support and rewards for creating software. And then they turn them loose.
The natural outcome is that software
developers and managers choose to
work long hours voluntarily. Imposter
organizations confuse the effect (long
hours) with the cause (high motivation). I call the imposter organizations
sweatshops because they emphasize
working hard rather than working
smart, and they tend to be chaotic and
ineffective. They’re not very enjoyable
to work for, either.
Cargo Cult Organizations
At first glance, these two kinds of
imposter organizations appear to be
exact opposites. One is incredibly bureaucratic, and the other is incredibly
chaotic. But one key similarity is actually more important than their superficial differences: Neither is very
effective because neither understands
what really makes its projects succeed
or fail. They go through the motions
of looking like effective organizations
that are stylistically similar. But without any real understanding of why
the practices work, they are essentially just sticking pieces of bamboo
in their ears and hoping their projects
will land safely. Many of their projects end up crashing, because these
are just two different varieties of
cargo cult software engineering, similar in their lack of understanding of
what makes software projects work.
Cargo cult software engineering is
easy to identify. Its engineer proponents justify their practices by saying,
“We’ve always done it this way in the
past,” or “Our company standards require us to do it this way”—even
when those ways make no sense. They
refuse to acknowledge the trade-offs
involved in either process-oriented or
commitment-oriented development.
Both have strengths and weaknesses.
When presented with more effective,
new practices, cargo cult software engineers prefer to stay in their wooden
huts of familiar, comfortable, and
not necessarily effective work habits.
“Doing the same thing again and
again and expecting different results is
a sign of insanity,” the old saying
goes. It’s also a sign of cargo cult software engineering.
The Real Debate
In this magazine and in many
other publications, we spend our time
debating whether process is good or
individual empowerment (in other
words, commitment-oriented development) might be better. This is a
false dichotomy. Process is good, and
FROM THE EDITOR
D E PA R T M E N T E D I T O R S
Bookshelf: Warren Keuffel, wkeuffel@computer.org
so is individual empowerment. The
two can exist side by side. Processoriented organizations can ask for an
extreme commitment on specific projects. Commitment-oriented organizations can use software engineering
practices skillfully.
The difference between these two
approaches really comes down to differences of style and personality. I have
worked on several projects of each
style and have liked different things
about each style. Some developers enjoy working methodically on an 8-to-5
schedule, which is more common in
process-oriented companies. Other developers enjoy the focus and excitement that comes with making a 24 × 7
commitment to a project. Commitment-oriented projects are more exciting on average, but a process-oriented
project can be just as exciting when it
has a well-defined and inspiring mission. Process-oriented organizations
seem to degenerate into their pathological look-alikes less often than commitment-oriented organizations do,
but either style can work well if it is
skillfully planned and executed.
The fact that both project styles
have pathological look-alikes has
muddied the debate. Some projects
conducted in each style succeed, and
some fail. That lets a process advocate point to process successes and
commitment failures and claim that
process is the key to success. It lets
the commitment advocate do the
same thing, in reverse.
The issue that has fallen by the
wayside while we’ve been debating
is so blatant that, like Edgar Allen
Poe’s Purloined Letter, we have
overlooked it. We should not be debating process versus commitment;
we should be debating competence
versus incompetence. The real difference is not which style we
choose, but what education, training, and understanding we bring to
bear on the project. Rather than
sticking with the old, misdirected
debate, we should look for ways to
raise the average level of developer
and manager competence. That will
improve our chances of success regardless of which development style
we choose.
In the Next Issue
Requirements Engineering
Validating Requirements for Voice Communication
How Videoconferencing and Computer Support
Affect Requirements Negotiation
A Reference Model for Requirements and Specifications
Also: Automated Support for GQM Measurement
Linguistic Rules for OO Analysis
In Future Issues
CMM across the Enterprise:
Reports from a Range of CMM Level 5 Companies
Process Diversity: Other Ways to Build Software
Malicious IT: The Software vs. the People
Software Engineering in the Small
Applying Usability Techniques during Development
Recent Developments in Software Estimation
Global Software Development
Culture at Work: Karen Mackey, Cisco Systems,
kmackey@best.com
Loyal Opposition: Robert Glass, Computing Trends,
rglass@indiana.edu
Manager: Don Reifer, dreifer@sprintmail.com
Quality Time: Jeffrey Voas, Reliable Software Technologies Corp., jmvoas@rstcorp.com
Soapbox: Tomoo Matsubara, Matsubara Consulting,
matsu@computer.org
Softlaw: Larry Graham, Black, Lowe, and Graham,
graham@blacklaw.com
STAFF
Managing Editor
Dale C. Strok
dstrok@computer.org
Group Managing Editor
Dick Price
Associate Editor
Dennis Taylor
News Editor
Crystal Chweh
Staff Editor
Jenny Ferrero
Assistant Editors
Cheryl Baltes and Shani Murray
Magazine Assistants
Dawn Craig and Angela Delgado
software@computer.org
Art Director
Toni Van Buskirk
Cover Illustration
Dirk Hagner
Technical Illustrator
Alex Torres
Production Artist
Carmen Flores-Garvey
Executive Director and Chief Executive Officer
T. Michael Elliott
Publisher
Angela Burgess
Membership/Circulation
Marketing Manager
Georgann Carter
Advertising Manager
Patricia Garvey
Advertising Assistant
Debbie Sims
CONTRIBUTING EDITORS
Dale Adams, Greg Goth, Nancy Mead, Ware Myers,
Keri Schreiner, Pradip Srimani
Editorial: All submissions are subject to editing for clarity,
style, and space. Unless otherwise stated, bylined articles and
departments, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE Software does not necessarily constitute endorsement by the IEEE
or the IEEE Computer Society.
To Submit: Send 2 electronic versions (1 word-processed
and 1 postscript or PDF) of articles to Magazine Assistant, IEEE
Software, 10662 Los Vaqueros Circle, PO Box 3014, Los
Alamitos, CA 90720-1314; software@computer.org. Articles
must be original and not exceed 5,400 words including figures
and tables, which count for 200 words each.
March/April 2000
IEEE SOFTWARE
13
1 of 17
Syntax, Semantics, Micronesian cults and Novice Programmers | Microsoft Docs
03/01/2004 • 6 minutes to read
I've had this idea in me for a long time now that I've been struggling with getting out
into the blog space. It has to do with the future of programming, declarative languages,
Microsoft's language and tools strategy, pedagogic factors for novice and experienced
programmers, and a bunch of other stuff. All these things are interrelated in some fairly
complex ways. I've come to the realization that I simply do not have time to organize
these thoughts into one enormous essay that all hangs together and makes sense. I'm
going to do what blogs do best -- write a bunch of (comparatively!) short articles each
exploring one aspect of this idea. If I'm redundant and prolix, so be it.
Today I want to blog a bit about novice programmers. In future essays, I'll try to tie that
into some ideas about the future of pedagogic languages and languages in general.
: I'd appreciate your feedback on whether this makes
sense or it's a bunch of useless theoretical posturing.
: I'd appreciate your feedback on what you think
are the vital concepts that you had to grasp when you were learning to program, and
what you stress when you mentor new programmers.
An intern at another company wrote me recently to say "I am working on a project for an
internship that has lead me to some scripting in vbscript. Basically I don't know what I am
doing and I was hoping you could help. " The writer then included a chunk of script and a
feature request. I've gotten requests like this many times over the years; there are a lot
of novice programmers who use script, for the obvious reason that we designed it to be
appealing to novices.
Well, as I wrote last Thursday, there are times when you want to teach an intern to fish,
and times when you want to give them a fish. I could give you the line of code that
implements the feature you want. And then I could become the feature request server
for every intern who doesn't know what they're doing… nope. Not going to happen.
Sorry. Down that road lies cargo cult programming, and believe me, you want to avoid
that road.
2 of 17
Syntax, Semantics, Micronesian cults and Novice Programmers | Microsoft Docs
What's cargo cult programming? Let me digress for a moment. The idea comes from a
true story, which I will briefly summarize:
During the Second World War, the Americans set up airstrips on various tiny islands in
the Pacific. After the war was over and the Americans went home, the natives did a
perfectly sensible thing -- they dressed themselves up as ground traffic controllers and
waved those sticks around. They mistook cause and effect -- they assumed that the guys
waving the sticks were the ones making the planes full of supplies appear, and that if
only they could get it right, they could pull the same trick. From our perspective, we
know that it's the other way around -- the guys with the sticks are there
the
planes need them to land. No planes, no guys.
The cargo cultists had the unimportant surface elements right, but did not see enough
of the whole picture to succeed. They understood the
but not the
. There
are lots of cargo cult programmers -. Therefore, they cannot make meaningful changes to the
program. They tend to proceed by making random changes, testing, and changing
again until they manage to come up with something that works.
(Incidentally, Richard Feynman wrote a great essay on cargo cult science. Do a web
search, you'll find it.)
Beginner programmers: do not go there! Programming courses for beginners often
concentrate heavily on getting the syntax right. By "syntax" I mean the actual letters and
numbers that make up the program, as opposed to "semantics", which is the meaning of
the program. As an analogy, "syntax" is the set of grammar and spelling rules of English,
"semantics" is what the sentences mean. Now, obviously, you have to learn the syntax of
the language -- unsyntactic programs simply do not run. But what they don't stress in
these courses is that
The cargo cultists had the syntax -- the
formal outward appearance -- of an airstrip down cold, but they sure got the semantics
wrong.
To make some more analogies, it's like playing chess. Anyone can learn
. Playing a game where the strategy makes sense is the hard (and
interesting) part.
Every VBScript statement has a meaning.
Passing the
right arguments in the right order will come with practice, but getting the meaning right
3 of 17
Syntax, Semantics, Micronesian cults and Novice Programmers | Microsoft Docs
requires thought. You will eventually find that some programming languages have nice
syntax and some have irritating syntax, but that it is largely irrelevant. It doesn't matter
whether I'm writing a program in VBScript, C, Modula3 or Algol68 -- all these languages
have different syntaxes, but very similar semantics.
You also need to understand and use
are
. High-level languages like VBScript
already give you a huge amount of abstraction away from the underlying hardware and
make it easy to do even more abstract things.
Beginner programmers often do not understand what abstraction is. Here's a silly
example. Suppose you needed for some reason to compute 1 + 2 + 3 + .. + n for some
integer n. You could write a program like this:
n = InputBox("Enter an integer")
Sum = 0
For i = 1 To n
Sum = Sum + i
Next
MsgBox Sum
Now suppose you wanted to do this calculation many times. You could replicate the
middle four lines over and over again in your program, or you could
:
Function Sum(n)
Sum = 0
For i = 1 To n
Sum = Sum + i
Next
End Function
n = InputBox("Enter an integer")
MsgBox Sum(n)
That is
-- you can write up routines that make your code look cleaner
because you have less duplication. But
. The power of abstraction is that
. One day you realize that your sum function is inefficient, and you can use
4 of 17
Syntax, Semantics, Micronesian cults and Novice Programmers | Microsoft Docs
Gauss's formula instead. You throw away your old implementation and replace it with
the much faster:
Function Sum(n)
Sum = n * (n + 1) / 2
End Function
The code which calls the function doesn't need to be changed. If you had not abstracted
this operation away, you'd have to change all the places in your code that used the old
algorithm.
A study of the history of programming languages reveals that we've been moving
steadily towards languages which support more and more powerful abstractions.
Machine language abstracts the
program with
in the machine, allowing you to
. Assembly language abstracts the
abstracts the
into
into higher concepts like
abstracts even farther by allowing variables to refer to
.C
. C++
which contain both
. XAML abstracts away the notion of a class by
providing a
for object relationships.
To sum up, Eric's advice for novice programmers is:
The rest is just practice.
February 29, 2004
The comment has been removed
February 29, 2004
My next piece of advice would be: Learn to use your debugger.
I see it so often on message boards where a novice's code isn't working right, and they
Werk
Titel: Numerische Mathematik
Verlag: Springer Verlag
Jahr: 1969
Kollektion: Mathematica
Digitalisiert: Niedersächsische Staats- und Universitätsbibliothek Göttingen
Werk Id: PPN362160546_0013
PURL: http://resolver.sub.uni-goettingen.de/purl?PPN362160546_0013
LOG Id: LOG_0038
LOG Titel: Gaussian Elimination is not Optimal.
LOG Typ: article
Übergeordnetes Werk
Werk Id: PPN362160546
PURL: http://resolver.sub.uni-goettingen.de/purl?PPN362160546
Terms and Conditions
The Goettingen State and University Library provides access to digitized documents strictly for noncommercial educational,
research and private purposes and makes no warranty with regard to their use for other purposes. Some of our collections
are protected by copyright. Publication and/or broadcast in any form (including electronic) requires prior written permission
from the Goettingen State- and University Library.
Each copy of any part of this document must contain there Terms and Conditions. With the usage of the library's online
system to access or download a digitized document you accept the Terms and Conditions.
Reproductions of material on the web site may not be made for or donated to other repositories, nor may be further
reproduced without written permission from the Goettingen State- and University Library.
For reproduction requests and permissions, please contact us. If citing materials, please give proper attribution of the
source.
Contact
Niedersächsische Staats- und Universitätsbibliothek Göttingen
Georg-August-Universität Göttingen
Platz der Göttinger Sieben 1
37073 Göttingen
Germany
Email: gdz@sub.uni-goettingen.de
Adaptive Strassen’s Matrix Multiplication
Paolo D’Alberto
Alexandru Nicolau
Yahoo!
Sunnyvale, CA
Dept. of Computer Science
University of California Irvine
pdalbert@yahoo-inc.com
nicolau@ics.uci.edu
ABSTRACT
General Terms
Strassen’s matrix multiplication (MM) has benefits with respect to
any (highly tuned) implementations of MM because Strassen’s reduces the total number of operations. Strassen achieved this operation reduction by replacing computationally expensive MMs with
matrix additions (MAs). For architectures with simple memory hierarchies, having fewer operations directly translates into an efficient utilization of the CPU and, thus, faster execution. However,
for modern architectures with complex memory hierarchies, the operations introduced by the MAs have a limited in-cache data reuse
and thus poor memory-hierarchy utilization, thereby overshadowing the (improved) CPU utilization, and making Strassen’s algorithm (largely) useless on its own.
In this paper, we investigate the interaction between Strassen’s
effective performance and the memory-hierarchy organization. We
show how to exploit Strassen’s full potential across different architectures. We present an easy-to-use adaptive algorithm that
combines a novel implementation of Strassen’s idea with the MM
from automatically tuned linear algebra software (ATLAS) or GotoBLAS. An additional advantage of our algorithm is that it applies
to any size and shape matrices and works equally well with row
or column major layout. Our implementation consists of introducing a final step in the ATLAS/GotoBLAS-installation process that
estimates whether or not we can achieve any additional speedup using our Strassen’s adaptation algorithm. Then we install our codes,
validate our estimates, and determine the specific performance.
We show that, by the right combination of Strassen’s with ATLAS/GotoBLAS, our approach achieves up to 30%/22% speed-up
versus ATLAS/GotoBLAS alone on modern high-performance single processors. We consider and present the complexity and the
numerical analysis of our algorithm, and, finally, we show performance for 17 (uniprocessor) systems.
Algorithms
Categories and Subject Descriptors
G.4 [Mathematics of Computing]: Mathematical Software; D.2.8
[Software Engineering]: Metrics—complexity measures, performance measures; D.2.3 [Software Engineering]: Coding Tools
and Techniques—Top-down programming
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. ICS’07 June 18-20, Seattle, WA, USA.
Copyright 2007 ACM 978-1-595993-768-1/07/0006 ...$5.00.
Keywords
Matrix Multiplications, Fast Algorithms
1.
INTRODUCTION
In the last 30 years, the complexity of processors on a chip is
following accurately Moore’s law; that is, the number of transistors
per chip doubles every 18 months. Unfortunately, the steady increase of processor integration does not always result in a proportional increase of the system performance on a given application
(program); that is, the same application does not double its performance when it runs on a new state-of-the-art architecture every 18
months. In fact, the performance of an application is the result of an
intricate synergy between the two constituent parts of the system:
on one side, the architecture composed of processors, memory hierarchy and devices, and, on the other side, the code composed of
algorithms, software packages and libraries steering the computation on the hardware. When the architecture evolves, the code must
adapt so that the system can deliver peak performance.
Our main interest is the design and implementation of codes that
embrace the architecture evolution. We want to write efficient and
easy to maintain codes, which any developer can use for several
generations of architectures. Adaptive codes attempt to provide
just that. In fact, they are an effective solution for the efficient utilization of (and portability across) complex and always-changing
architectures (e.g., [16, 11, 30, 19]). In this paper, we discuss a
single but fundamental kernel in dense linear algebra: matrix multiply (MM) for any size and shape matrices stored in double precision and in standard row or column major layout [20, 15, 14, 33, 3,
18].
We extend Strassen’s algorithm to deal with rectangular and
arbitrary-size matrices so as to exploit better data locality and number of operations, and thus performance, than previously proposed
versions [22]. We consider the performance effects of Strassen’s
applied to rectangular matrices directly (i.e., exploiting fewer operations) or, after a cache-oblivious problem division, to (almost)
square matrices only (i.e., exploiting data locality). We show that
for some architectures, the latter can outperform the former, (in
contrast of what estimated by Knight [25]), and we show that both
must build on top of highly efficient O(n3 ) MMs based on the
state-of-the-art adaptive software packages such as ATLAS [11]
and hand tuned packages such as GotoBLAS [18]. In fact, we
show that choosing the right combination of our algorithm with
these highly tuned MMs, we can achieve an execution-time reduction up to 30% and 22% when comparet to using alone ATLAS
problem, on which Strassen can be applied, and into (extremely irregular) subproblems deploying matrix-by-vector
and vector-by-vector computations.
and GotoBLAS respectively. We present an extensive quantitative
evaluation of our algorithm performance for a large set of different architectures to demonstrate that our approach is beneficial and
portable. We discuss also the numerical stability of our algorithm
and its practical error evaluation.
The paper is organized as follows. In Section 2, we discuss the
related work and we highlight our contributions. In Section 3, we
present our algorithm and discuss its practical numerical stability
and complexity. In Section 4, we present our experimental results;
in particular, in Section 4.1, we discuss the performance for any
matrix shape and, in Section 4.2, for square matrices only. We
conclude in Section 5.
2.
2. Our algorithm applies Strassen’s strategy recursively as
many time as a function of the problem size. If the problem
size is large enough, the algorithm has a recursion depth that
goes as deep as there is any performance advantage. This is
in contrast to the approach in [22] where Strassen’s strategy
is applied just once. Furthermore, we determine the recursion point empirically by micro-benchmarking at installation
time; unlike as in [32], where Strassen’s algorithm is studied
in isolation without any performance comparison with highperformance MM.
RELATED WORK
Strassen’s algorithm is the first and the most used fast algorithm
(i.e., breaking the O(n3 ) operation count). Strassen discovered that
the original recursive algorithm of complexity O(n3 ) can be reorganized in such a way that one computationally expensive recursive MM step can be replaced with 18 cheaper matrix additions
(MA). As a result, Strassen’s algorithm has (asymptotically) fewer
operations (i.e., multiplications and additions) O(n2.86 ). Another
variant is by Winograd; Winograd replaced one MM with 15 MAs
and improved Strassen’s complexity by a constant factor. Our approach can be applied to the Winograd’s variant as well; however,
Winograd’s algorithm is beyond the scope of this paper and it will
be investigated separately and in the future.
In practice however for small matrices, Strassen’s has a significant overhead and a conventional MM results in better performance. To overcome this, several authors have shown hybrid algorithms; that is, deploying Strassen’s MM in conjunction with
conventional MM [5, 4, 20], where for a specific problem size n1 ,
or recursion point [22], Strassen’s algorithm yields the computation to the conventional MM implementations. 1 With the deployment of modern and faster architectures (with fast CPUs and
relatively slow memory), Strassen’s has performance appeal for
larger and larger problems, undermining its practical benefits. In
other words, the evolution of modern architectures —i.e., capable
of solving large problems fast— presents a scenario where the recursion point has started increasing [2]. However, the demand for
solving larger and larger problems has increased together with the
development of modern architectures; this sheds a completely new
light on what/when a problem size is actually practical, making
Strassen’s approach extremely powerful. Given a modern architecture, finding for what problem sizes Strassen’s is beneficial has
never been so compelling.
Our approach has three contributions/advantages with respect to
previous approaches.
1. Our algorithm divides the MM problems into a set of balanced subproblems without any matrix padding or peeling,
so it achieves balanced workload and predictable performance. This balanced division leads to: a cleaner algorithm
formulation, an easier parallelization, and little or no work in
combining the solutions of the subproblems (conquer step).
This strategy differs from the division process proposed by
Huss-Lederman et al. [22, 20, 1] leading to fewer operations
and better data locality. In this approach [22], the problem
division is a function of the matrix sizes such that for oddmatrix sizes, the problem is divided into a large even-size
1
Thus, for a problem of size n ≤ n1 , this hybrid algorithm is the
conventional MM; for every matrix size n ≥ n1 , the hybrid algorithm is faster because it applies Strassen’s strategy and thus it
exploits all its performance benefits.
3. We store matrices in standard row or column major format
and, at any time, we can yield control to a highly tuned MM
such as ATLAS’s DGEM M without any overhead. Thus,
we can use our algorithm in combination with these highly
tuned MM routines with no modifications or change of layout
overheads (i.e., estimated as 5–10% of the total execution
time [32]).
In the literature, there are other fast MM algorithms. For example, Pan showed a bilinar algorithm that is asymptotically faster
than Strassen-Winograd [27] O(n2.79 ) and he presented a survey
of the topic [28] with best O(n2.49 ). The practical implementation of Pan’s algorithm is presented by Kaporini [23, 24]. New approaches are emerging recently, which promise to be practical and
numerically stable [7, 12]. The fastest to date is by Coppersmith
and Winograd [8] O(n2.376 ).
3.
BALANCED MATRIX MULTIPLICATION
In this section, we introduce our algorithm, which is a composition of three different layers/algorithms: on the top level (in this
section), we use a cache oblivious algorithm [17] so to reduce the
problem to almost square matrices; in the middle level (Section
3.1), we deploy Strassen’s algorithm to reduce the computation
work; in the lower level, we deploy ATLAS’s [33] or GotoBLAS
[18] to unleash the architecture characteristics. In the following,
we introduce briefly our notations and then our algorithms.
We identify the size of a matrix A ∈ Mm×n as σ(A)=m×n,
where m is the number of rows and n the number of columns of
the matrix A. Matrix multiplication is defined for operands of
sizes σ(C)=m×p, σ(A)=m×n and σ(B)=n×p, and identified
as C=AB, where the component ci,j
Pat row i and column j of the
result matrix C is defined as ci,j = n
k=0 ai,k bk,j .
In this paper, we use a simplified notation to identify submatrices. We choose to divide logically a matrix M into at most four
submatrices; we label them so that M0 is the first and the largest
submatrix, M2 is logically beneath M0 , M1 is on the right of M0 ,
and M3 is beneath M1 and to the right of M2 (e.g., how to divide
a matrix into two submatrices, and into four, see Figure 1). This
notation is taken from [6].
In Table 1, we present the framework of our cache-oblivious algorithm that we identify as Balanced MM. This algorithm divides
the problem and specifies the operands in such a way that the subproblems have balanced workload and similar operand shapes and
sizes. In fact, this algorithm reduces the problem to almost square
matrices; that is, A is almost square if m ≤ γm n or n ≤ γn m
and γ is usually equal to 2. We aim at the efficient utilization of the
higher level of the memory hierarchy; that is, the memory pages
Table 1: Balanced Matrix Multiplication C=AB with
σ(A)=m×n and σ(B)=n×p
Computation
Operand Sizes
if m ≥ γm max(n, p) then
C0 =A0 B
C2 =A2 B
γm =2 (A is tall)
σ(A0 )=d m
e×n, σ(C0 )=d m
e×p,
2
2
σ(A2 )=b m
c×n, σ(C2 )=b m
c×p
2
2
else if p ≥ γp max(m, n) then
C0 =AB0
C1 =AB1
γp =2 (B is long)
σ(C0 )=m×d p2 e, σ(B0 )=n×d p2 e,
σ(C1 )=m×b p2 c, σ(B1 )=n×b p2 c,
else if n ≥ γn max(m, p) then
C=A0 B0
C=C + A1 B2
γn =2 (B is tall and A is long)
e, σ(B0 )=d n
e×p
σ(A0 )=m×d n
2
2
σ(A1 )=m×b n
c, σ(B2 )=b n
c×p
2
2
else
A, B and C almost square
see Section 3.1
Strassen C=A∗s B
often organized using small and fully associative cache, table lookaside buffer (TLB). This formulation has been proven (asymptotically) optimal [17] in the number of misses. We present detailed
performance of this strategy for any matrix size and for three systems in Section 4.
This technique breaks down the general problem into small and
regular problems to exploit better data locality in the memory hierarchy. Knight [25] presented evidence showing that this approach
is not optimal in the sense of number of operations; for example,
using directly Strassen decomposition to the rectangular matrices
should achieve better performance. In Section 4.1, we address this
issue quantitatively and show that we must exploit data locality as
much as the operation reduction.
In the following section (Section 3.1), we describe our version of
Strassen’s algorithm for any (almost square) matrices and, thereby
completing the algorithm specified in Table 1. We call it hybrid
ATLAS/GotoBLAS–Strassen algorithm (HASA).
3.1
a matrix into almost square submatrices as described previously,
we can achieve a balanced division into subcomputations. However, now we do not have submatrices of the same sizes. In such
a scenario, we defined the MA between matrices so that we have
a fully functional recursive algorithm; that is, we reduce the computation to 7 well defined MMs, where the left-operand columns
match the right operand rows, and MAs are between almost square
matrices as follows.
Consider a matrix addition X=Y + Z (subtraction is similar).
Intuitively, when the resulting matrix X is larger than the addenda
Y or Z, the computation is performed as if the smaller operands
are extended and padded with zeros. Otherwise, if the result matrix is smaller than the operands, the computation is performed as
if the larger operands are cropped to fit the result matrix. Formally, X=Y + Z is defined so that σ(X)=m×n, σ(Y)=p×q
and σ(Z)=r×s and xi,j =f (i, j) + g(i, j) where:
(
yi,j if 0 ≤ i < p ∧ 0 ≤ j < q
f (i, j)=
0
otherwise
(
zi,j if 0 ≤ i < r ∧ 0 ≤ j < s
g(i, j)=
0
otherwise
Table 2: HASA C=A∗s B with σ(A)=m×n and σ(B)=n×p
Computation
if RecursionPoint(A,B) then
ATLAS/GotoBLAS C=A∗a B
(Divide et impera)
e×b p2 c
σ(T2 )=d n
2
m
σ(M3 )=d 2 e×b p2 c
T1 =A2 − A0
T2 =B0 + B1
M6 =T1 ∗s T2
C3 =C3 +M6
e×d n
e
σ(T1 )=d m
2
2
p
σ(T2 )=d n
e×d
e
2
2
σ(M6 )=d m
e×d p2 e
2
T1 =A2 + A3
M2 =T1 ∗s B0
C2 =M2 , C3 =C3 −M2
σ(T1 )=b m
c×d n
e
2
2
c×d p2 e
σ(M2 )=b m
2
T1 =A0 + A3
T2 =B0 + B3
M1 =T1 ∗s T2
C0 =M1 , C3 =C3 +M1
σ(T1 )=d m
e×d n
e
2
2
n
σ(T2 )=d 2 e×d p2 e
σ(M1 )=d m
e×d p2 e
2
T1 =A0 + A1
M5 =T1 ∗s B3
C0 =C0 −M5 , C1 =C1 +M5
σ(T1 )=d m
e×b n
c
2
2
σ(M5 )=d m
e×b p2 c
2
T1 =A1 − A3
T2 =B2 + B3
M7 =T1 ∗s T2
C0 =C0 +M7
σ(T1 )=d m
e×b n
c
2
2
σ(T2 )=b n
c×d p2 e
2
σ(M7 )=d m
e×d p2 e
2
T2 =B2 − B0
M4 =A3 ∗s T2
C0 =C0 +M4 , C2 =C2 +M4
σ(T2 )=b n
c×d p2 e
2
p
σ(M4 )=b m
c×d
e
2
2
In this section, we present our generalization of Strassen’s MM
algorithm. Our algorithm reduces the number of passes through the
data as well as the number of computations because of a balanced
division process. In practice, this algorithm is more efficient than
previous approaches [31, 22, 32]. The algorithm applies to any
matrix sizes (and, thus, shape) rather than only to square matrices
like in [9, 10].
A
C0
C1
C2
C3
B
A0
A1
A2
A3
=
B0
B1
B2
B3
*
Figure 1: Logical decomposition of matrices in sub-matrices
Matrices C, A and B in the MM computation are composed of four balanced sub-matrices (see Figure 1). Consider the
operand matrix A with σ(A)=m×n. Now, A is logically composed of four matrices: A0 with σ(A0 )=d m
e×d n2 e, A1 with
2
n
m
n
e×b
c,
A
with
σ(A
)=b
c×d
e and A3 with
σ(A1 )=d m
2
2
2
2
2
2
m
n
σ(A3 )=b 2 c×b 2 c (similarly, B is composed of four submatrices
where σ(B0 )=d n2 e×d p2 e).
We generalized Strassen’s algorithm in order to compute the MM
regardless of matrix size, as shown in Table 2. Notice that, dividing
(e.g., max(m, n, p) < 100)
(Solve directly)
else {
T2 =B1 − B3
M3 =A0 ∗s T2
C1 =M3 , C3 =M3
Hybrid ATLAS/GotoBLAS–Strassen algorithm (HASA)
C
Operand Sizes
}
Correctness. To reduce the total number of multiplications,
Strassen’s algorithm computes implicitly some products that are
necessary and some that are artificial and must be carefully removed from the final result. For example, the product A0 B0 ,
which is a term of M1 , is a singular product and it is required for
the computation of C0 ; in contrast, A0 B3 is an artificial product,2 computed in the same expression, because of the way the algorithm adds submatrices of A and B together in preparation of the
recursive MM. Every artificial product must be removed (i.e., subtracted) by combining the different MAs (e.g., M1 + M5 ) so as
to achieve the final correct result. By construction, Strassen’s algorithm is correct; here, the only concern is our adaptation to any
matrix sizes: First, we note that all MM are well defined —i.e.,
the number of columns of the left operands is equal to the number of rows of the right operand. Second, all singular products are
correctly computed and added. Third, all artificial products are correctly computed and removed.
Error w.r.t. DCS
8.0E-13
7.0E-13
ATLAS
HASA
RBC
Input [-1,+1]
6.0E-13
5.0E-13
4.0E-13
3.0E-13
2.0E-13
1.0E-13
0.0E+00
3.2
Numerical considerations
1000
3000
4000
5000
Size
Error w.r.t. DCS
The classic MM has component-wise and norm-wise error
bounds as follows:
Component-wise:
Norm-wise:
2000
2
|C − Ċ| ≤ nu|A||B| + O(u ),
kC − Ċk ≤ n2 ukAkkBk + O(u2 ) .
1.2E-11
1.0E-11
ATLAS
HASA
RBC
Input [0,+1]
8.0E-12
where we identify the exact result with C and the computed solution with Ċ, |A| has components |ai,j |, and kAk= maxij |aij |.
For this algorithm, we can apply the same numerical analysis
used for Strassen’s algorithm. Strassen’s has been proved weakly
stable. Brent and Higham [5, 20] showed that the stability of the
algorithm is worsening as the depth of the recursion increases.
6.0E-12
4.0E-12
2.0E-12
kC − Ċk ≤ [( nn1 )log 12 (n21 + 5n1 ) − 5n]ukAkkBk + O(u2 )
≤ 3` n2 ukAkkBk + O(u2 )
(1)
Again, n1 is the size where Strassen’s algorithm yields to the usual
MM, and u is the inherent floating point precision. We denote with
` the recursion depth; that is, the number of times we divide the
problem. This is a weaker error bound (a norm-wise bound) than
the one for the classic algorithm (a component-wise bound) and
it is also a pessimistic estimation (i.e., in practice). Demmel and
Higman [13] have shown that, in practice, fast MMs can lead to fast
and accurate results when they are used in blocked computation of
the LAPACK routines.
Here, we follows the same procedure used by Higham [21] to
quantify the error for our algorithm and for large problems sizes
such as to investigate the effect of error in double precision (Figure
2).
Input. We restrict the input matrix values to a specific range or
intervals: [−1, 1] and [0, 1]. We then initialize the input matrices
using a uniformly distributed random number generator.
Reference Doubly Compensated Summation. We consider the
output of the computation C = AB. We compute every element
cij independently and by performing first a dot product of the row
vector ai∗ by the column vector b∗j and we store the temporary
products zk = aik bkj into a temporary vector z. Then, we sort the
vector in decreasing order such that |zi | ≥ |zj | with i < j using the
adaptive sorting library [26]. Eventually, we add the vector components so that to compute the reference output using Priest’s doubly
compensated summation (DCS) procedure [29] in extended double precision as described in [21]. This is our baseline or ultimate
reference.
Comparison. We compare the output-value difference (w.r.t of
the DCS based MM) of ATLAS algorithm, HASA (Strassen algo2
The product A0 B3 cannot be computed directly because is not
defined in general: the columns(A0 ) > rows(B3 ) and thus we
should pad B3 or crop A0 properly.
0.0E+00
1000
2000
3000
4000
5000
Size
Figure 2: Pentium 4 3.2GHz Maximum error estimation: Recursion point n1 =900, 1 recursion 900≤n≤1800, 2 recursions
1800≤n≤3600, and 3 recursions 3600≤n
rithm using ATLAS’s MM) and the naive row-by-column algorithm (RBC) using an accumulator register, for which the summation is not compensated and the values are in the original order.
Considerations. In Figure 2, we show the error evaluation w.r.t.
the DCS MM for square matrices only. For Strassen’s algorithm
the error ratio of HASA over ATLAS is no larger than 10 for both
ranges [0, 1] and [−1, 1]. These error ratios are less dramatic than
what an upper bound analysis would suggest. That is, each recursion could increases the error as 2` instead of 3` , and in practice,
no more than 10 (one precision digit w.r.t. the 16 available).
3.3
Recursion Point and Complexity
Strassen’s algorithm embodies two different locality properties
because its two basic computations exploit different data locality:
matrix multiply MM has spatial and temporal locality, and matrix
addition MA has only spatial locality. In this section, we describe
how these data reuses affect the algorithm performance and thus
our strategy.
ed n2 ep + mnd p2 e + d m
eb n2 cd p2 e) secThe 7 MMs take 2π(d m
2
2
onds, where π is the efficiency of MM —i.e., pi for product— and
1
is simply the floating point operation per second (FLOPS) of
π
the computation.
The 22 MAs (18 matrix additions and 4 matrix copies) take
α[5d n2 e(d m
e + d p2 e) + 3mp] seconds (see Table 2), where α is
2
the efficiency of MA —i.e., alpha for addition. For example, we
4.
EXPERIMENTAL RESULTS
We split this experimental section. We present experimental results for rectangular matrices and square matrices separately. For
rectangular matrices, we investigate the effects of the problem sizes
and shapes w.r.t. the algorithm choice and the architecture. For
square matrices, we show how effective our approach is for a large
set of different architectures and odd matrix sizes, and we show
detailed performance for a representative set of architectures.
4.1
Rectangular matrices
Given a matrix multiply C = AB with σ(A)=m×n and
σ(B)=n×p, we characterize this problem size by a triplet s =
[m, n, p]. For presentation purpose, we opted to represent
Q one matrix multiplication by its number of operations x=2 2i=0 si and
plot it on the abscissa (otherwise the performance plot must be a
4-dimension graph, [time, m, n, p]).
4.1.1
HP zv6000, Athlon-64 2GHz using ATLAS,
data locality vs. operations
In this section, we turn our attention to a general performance
comparison of the balanced MM (Table 1) and HASA (Table 2)
with respect to the routine cblas dgemm —i.e., ATLAS’s MM for
matrices in double precision— for dense but rectangular matrices.
In Figure 3, we present the relative performance, we then discuss
the process used to collect these results and offer an interpretation.
We investigated the input space s∈T×T×T with T={100, 500,
1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 8000, 10000,
12000}; however, we ran only the problems that have a working set
(i.e., operands, C, A and B, and one-recursion-step temporaries,
100*(ATLAS-Our Algorithm)/ATLAS
perform five MAs of submatrices of A: we perform three additions
with A0 , 3d m
ed n2 e, one with A1 , d m
eb n2 c, and one with A2 ,
2
2
m
n
b 2 cd 2 e. Similarly, we perform MAs with submatrices of B and
C.
Thus, we find that the problem size [m, n, p] to yield control to
the ATLAS/GotoBLAS’s algorithm is when Equation 2 is satisfied:
α
n
m
p
m n p
[5d e(d e + d e) + 3mp].
(2)
b cd ed e ≤
2 2 2
2π
2
2
2
We assume that π and α are functions of the matrix size only, as
we explain in the following.
Layout effects. The performance of MA is not affected by a specific matrix layout (i.e., row-major format or Z-Morton) or shape as
long as we can exploit the only viable reuse: spatial data reuse. We
know that data reuse (spatial/temporal) is crucial for matrix multiply. In practice, ATLAS and GotoBLAS cope rather well with the
effects of a (limited) row/column major format reaching often 90%
of peak performance. Thus, we can assume for practical purpose
that π and α are functions of the matrix size only.
Square Matrix MM. Combining the performance properties of
both matrix multiplications and matrix additions with a more specific analysis for only square matrices; that is, n = m = p. We can
simplify Equation 2 and we find that the recursion point n1 is
α
n1 = 22 .
(3)
π
For example, if we assume a ratio α/π=50 (this is common for
the systems tested in this work), we find that the recursion point
corresponds to the problem (matrix) size n1 > 1100. For problems
of size (`)n1 ≤ n < (` + 1)n1 , we may apply Strassen’s ` times.
In fact, the factors π and α are easy to estimate by benchmarking
and we can determine the specific recursion point n1 by a linear
search.
60
40
20
0
0
20
40
60
80
100
120
140
-20
Billions Operations
-40
-60
-80
Balanced
HASA dynamic
Figure 3: HP ZV6005: Relative performance with respect to
ATLAS’s cblas dgemm.
M, T1 and T2 ) smaller than the main-memory size (i.e., 512MB).
We present relative performance results for two algorithms with
respect to cblas dgemm: Balanced and HASA dynamic.
Balanced is the algorithm where we determine the recursion
point for HASA when we install our codes into this architecture.
First, we found the experimental recursion point for square matrices, which is ṅ1 =1500. We then set the HASA (Table 2) to stop
the recursion when at least one matrix size is smaller than 1500.
HASA dynamic is the algorithm where we determine the recursion point for every specific problem size at runtime (no cacheoblivious strategy). To achieve the same performance of the Balanced algorithm for square matrices, we set a coefficient =20 as
s
s
an additive contribution to the ratio α
∼ α
+ . At run time and
πs
πs
for each problem s, we measure the performance of MAs (i.e., αs )
and the performance of cblas dgemm (i.e., πs ). We set the recursion point as the problem size satisfying
2b
αs
n
m
p
m n p
cd ed e ≤ (
+ )[5d e(d e + d e) + 3mp].
2 2 2
πs
2
2
2
Performance observations. ATLAS’s peak performance is 3.52
GFLOPS; Balanced and HASA dynamic peak performance is normalized to 3.8 GFLOPS (i.e., 2 ∗ m ∗ n ∗ p/ExecutionT ime, it
is an overestimate of the actual number of operations per second
but it is still a valid comparison measure). The erratic performance
of both algorithms (i.e., from 50% speedup to 20% slowdown and
especially for small sizes and very large sizes w.r.t. ATLAS’s
cblas dgemm) is not a problem of the recursion point determination, which may cause either the recursion to improve performance
unexpectedly or the recursion overhead to choke the execution time.
The main reason for the sudden speedups is the poor performance
of cblas dgemm; the main reason for the slowdowns is the access
of the hard-disk memory space by our algorithm.
We conclude that Strassen’s algorithm can be applied successfully to rectangular matrices and we can achieve significant improvements in performance in doing so. However, the improvements are often the result of a better data-locality utilization than
just a reduction of operations. For example, HASA dynamic algorithm is bound to have fewer floating point operations than Balanced, because the former apply Strassen’s division more times
than the latter especially for non-square matrices; however, Balanced algorithm achieves on average 1.3% execution-time reduc-
20
100*(Goto - Our Algorithm)/ Goto
100*(Goto - Our Algorithm)/ Goto
50
40
30
20
10
0
0
-10
100
200
300
400
Billions Operations
15
10
5
-20
Balanced
HASA
-30
0
-40
0
Balanced
HASA
50
100
150
200
250
300
350
400
450
Billions Operations
-50
-5
Figure 4: Optiplex GX280, Pentium 4 3.2GHz, using GotoBLAS’s DGEMM.
tion instead HASA dynamic achieves on average 0.5%. 3 In general,
Balanced presents very predictable performance with often better
peak performance than HASA dynamic.
4.1.2
GotoBLAS, Strassen vs. Faster MM
Recently, GotoBLAS MM has replaced ATLAS MM and the
former has become the fastest MM for most of state-of-the-art architectures. In this section, we show that our algorithm is almost
unaffected by the change of the leaf implementation (i.e., when the
recursion yields to ATLAS/GotoBLAS), leading to comparable and
even better improvements.
We investigated the input space s∈T×T×T with T={1000, 1500,
2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000}. We present relative
performance results for two algorithms with respect to GotoBLAS
DGEMM (resp. Balanced and HASA) and for two architectures
Athlon 64 2.4GHz and Pentium 4 3.2 GHz. Balanced and HASA
are tuned offline (i.e., recursion point determined once for all).
Optiplex GX280, Pentium 4 3.2GHz using GotoBLAS. GotoBLAS peak performance is 5.5 GFLOPS; Balanced and HASA
dynamic peak performance is normalized to 6.2 GFLOPS (for comparison purpose). For this architecture, the recursion point is empirically found at n1 = 1000 and we stop the recursion when a
matrix size is smaller than n1 . By construction, HASA algorithm
has fewer instructions than Balanced because it applies Strasssen’s
division to larger matrices; however, HASA achieves smaller relative improvement w.r.t. GotoBLAS MM than what the Balance
algorithm achieves (on average Balanced achieves 7.2% speedup
and HASA achieves 5.8%), see Figure 4. This suggests that the
algorithm with better data locality delivers better performance than
the algorithm with fewer operations (for this architecture).
Notice that the performance plot has an erratic behavior, which
is similar to the previous scenario (i.e., Figure 3); however, in this
case, the performance is affected by the architecture as we show in
the following using the same library but a different system.
Altura 939, Athlon-64 2.5GHz using GotoBLAS. GotoBLAS
peak performance is 4.5 GFLOPS, Balanced and HASA dynamic
peak performance is normalized to 5.4 GFLOPS (as comparison).
For this architecture (with a faster memory hierarchy and processor than in Section 4.1), the recursion point is empirically found at
n1 = 900 and we stop the recursion when a matrix size is smaller
3
The input set has mostly small problems, thus the average time
reduction is biased towards small values.
Figure 5: Altura 939, Athlon 64 2.5GHz, using GotoBLAS’s
DGEMM.
than n1 . Both algorithms Balanced and HASA have similar performance; that is, on average HASA achieves 7.7% speedup and
Balanced achieves 7.5% speedup. Also, for this architectures, the
performance is very predictable and the performance plot shows
the levels of recursion applied clearly, see Figure 5 (clearly 3 levels
and a fourth for very large problems).
4.2
Square Matrices
For square matrices, we measure only the performance of HASA
because the Balanced algorithm will call directly HASA. For each
machine, we had three basic kernels: the C-code implementation
cblas dgemm of MM in double precision from ATLAS, hybrid ATLAS/Strassen algorithm (HASA) —in Table 2 where the leaf computation is the cblas dgemm— and our hand-coded MA, which is
tailored to each architecture. In Table 3, we present a summary of
the experimental results (for rectangular matrices as well) but, in
this section, we present details for only four architectures (see [9,
10] for more results).
Installation. We installed the software package ATLAS on every architectures. The ATLAS routine cblas dgemm is the MM we
used as reference: we time this routine for each architecture so to
determine our baseline and the coefficient π (i.e.,, for square matrices of size 1000), and this routine is also the leaf computation of
HASA.
We timed the execution of MA (i.e.,, for square matrices of size
1000) [9, 10] and we determined α. Once we have π and α, we
π
determined the recursion point n1 =22 α
, which we have used to
install our codes. We then determined experimentally the recursion
π
point ṅ1 (i.e., n1 =22( α
+ )) based on a simple linear search.
Performance presentation. We present two measures of performance: the relative execution time of HASA over cblas dgemm
and cblas dgemm relative MFLOPS over ideal machine peak performance (i.e., maximum number of multiply-add operations per
second). In fact, the execution time is what any final user cares
about when comparing two different algorithms. However, a measure of performance for cblas dgemm, such as MFLOPS, shows
whether or not HASA improves the performance of a MM kernel
that is either efficiently or poorly designed. This basic measure is
important in as much as it shows the space for improvement for
both cblas dgemm and HASA.
We use the following terminology: HASA is the recursion algorithm for which the recursion point is based on the experimental
Table 3: Systems and performance: π1 106 is the performance of cblas dgemm or DGEMM (GotoBLAS) in MFLOPS for n=1000;
1
106 is the performance of MA in MFLOPS for n=1000; n1 is the theoretical recursion point as estimated in 22 α
; instead, ṅ1 is the
α
π
measured recursion point.
System
Fujitsu HAL 300
RX1600
ES40
Altura 939
Optiplex GX280
RP5470
Ultra 5
ProLiant DL140
ProLiant DL145
Ultra-250
HP ZV 6005
Sun-Fire-V210
Sun Blade
ASUS
Unknown server
Fosa
SGI O2
Processors
SPARC64 100MHz
Itanium 2@1.0GHz
Alpha ev67 4@667MHz
Athlon 64 2.45MHz
Pentium 4 3.2GHz
8600 PA-RISC 550MHz
UltraSparc2 300MHz
Xeon 2@3.2GHz
Opteron 2@2.2GHz
UltraSparc2 2@300MHz
Athlon 64 2GHz
UltraSparc3 1GHz
UltraSparc2 500MHz
AthlonXP 2800+ 2GHz
Itanium 2@700MHz
Pentium III 800MHz
MIPS 12K 300MHz
π
ṅ1 (i.e., =22( α
+ )), which is different for each architecture, and
it is statically computed once. The S-k is the Strassen’s algorithm
for which k is the recursion depth before yielding to cblas dgemm,
independently of the problem size. Note that we did not report negative relative performance for S-k because they would have mostly
negative bars cluttering the charts and the results. So for clarity,
we omitted the correspondent negative bars in the charts. The performance obtained by the systems in Table 3, are obtained by the
collection of the best performance among several trials and thus
with hot caches.
Performance interpretation. In principle, the S-1 algorithm
should have about sp1 = (1 − ṅ1 /n)/8 relative speedup where ṅ1
is the recursion point as found as in Table 3 and n is the problem
size (e.g., it is about 7–10% for our set of systems and n=5000).
The speedup,
P` for the algorithm with recursion depth `, is additive,
sp =
i=1 spi ; however, each recursion contribution is always
less than the first one (spi < sp1 ), because the number of operations saved is decreasing when going down the division process. In
the best case, for a three level recursion depth, we should achieve
about 3sp1 ∼ 24−30%. We achieve such a speedup for at least
one architecture, the system ES40, Figure 7 (and similar trend for
the Altura system Figure 5). However, for the other architectures,
a recursion depth of three is often harmful.
5.
CONCLUSIONS
We have presented a practical implementation of Strassen’s algorithm, which applies an adaptive algorithm to exploit highly tuned
MMs, such as ATLAS’s. We differ from previous approaches
in that we use an-easy-to-adapt recursive algorithm using a balanced division process. This division process simplifies the algorithm and it enables us to combine an easy performance model and
highly tuned MM kernels so that to determine off-line and at installation what is the best strategy. We have tested extensively the
performance of our approach on 17 systems and we have shown
that Strassen is not always applicable and, for modern systems,
the recursion point is quite large; that is, the problem size where
Strassen’s start having a performance edge is quite large (i.e., matrix sizes larger than 1000 × 1000). However, speedups up to 30%
are observed over already tuned MM using this hybrid approach.
We conclude by observing that a sound experimentation envi-
1
106
π
1
106
α
177
3023
1240
4320
4810
763
407
2395
3888
492
3520
1140
460
2160
2132
420
320
10
105
41
110
120
21
9
53
93
10
71
22
8
39
27
4
2
n1 =22 α
π
390
487
665
860
900
772
984
995
918
1061
1106
1140
1191
1218
1737
2009
2816
ṅ1
400
725
700
900
1000
1175
1225
1175
1175
1300
1500
1150
1884
1300
2150
N/A
N/A
Figure
–
–
Fig. 7
Fig. 5
Fig. 4
Fig. 8
–
–
Fig. 9
–
Fig. 3
–
–
–
–
–
–
ronment in combination with a simple complexity model —which
quantifies the interactions among the kernels of an application and
the underlying architecture— can go a long way in helping the design of complex-but-portable codes. Such metrics can improve the
design of the algorithms and may serve as a foundation for a fully
automated approach.
6.
ACKNOWLEDGMENTS
The first author worked on this project during his post-doctorate
fellowship in the SPIRAL Project at the Department of Electric and
Computer Engineering in the Carnegie Mellon University and his
work was supported in part by DARPA through the Department of
Interior grant NBCH1050009.
7.
REFERENCES
[1] D. Bailey, K. Lee, and H. Simon. Using strassen’s algorithm
to accelerate the solution of linear systems. J. Supercomput.,
4(4):357–371, 1990.
[2] D. H. Bailey and H. R. P. Gerguson. A Strassen-Newton
algorithm for high-speed parallelizable matrix inversion. In
Supercomputing ’88: Proceedings of the 1988 ACM/IEEE
conference on Supercomputing, pages 419–424. IEEE
Computer Society Press, 1988.
[3] G. Bilardi, P. D’Alberto, and A. Nicolau. Fractal matrix
multiplication: a case study on portability of cache
performance. In Workshop on Algorithm Engineering 2001,
Aarhus, Denmark, 2001.
[4] R. P. Brent. Algorithms for matrix multiplication. Technical
Report TR-CS-70-157, Stanford University, Mar 1970.
[5] R. P. Brent. Error analysis of algorithms for matrix
multiplication and triangular decomposition using
Winograd’s identity. Numerische Mathematik, 16:145–156,
1970.
[6] S. Chatterjee, A.R. Lebeck, P.K. Patnala, and M. Thottethodi.
Recursive array layout and fast parallel matrix
multiplication. In Proc. 11-th ACM SIGPLAN, June 1999.
[7] H. Cohn, R. Kleinberg, B. Szegedy, and C. Umans.
Group-theoretic algorithms for matrix multiplication, Nov
2005.
% speedup w.r.t. cblas_dgemm
% speedup w.r.t. cblas_dgemm
16
HASA
S-1
S-2
S-3
13
10
7
4
22
17
12
7
1
2
725
1175
1625
2075
2525
2975
3425
3875
4325
4775 N
700
-3
100
100
80
60
950
1400
1850
2300
2750
3200
3650
4100
4550
5000
N
-2
% cblas_dgemm w.r.t. peak
% cblas_dgemm w.r.t. peak
HASA
S-1
S-2
S-3
27
80
60
40
40
20
20
0
0
950
725
1175
1625
2075
2525
2975
3425
3875
4325
4775
1400
1850
2300
2750
3200
3650
4100
4550
Figure 6: RX1600 Itanium 2@1.0GHz.
[8] D. Coppersmith and S. Winograd. Matrix multiplication via
arithmetic progressions. In Proceedings of the 19-th annual
ACM conference on Theory of computing, pages 1–6, 1987.
[9] P. D’Alberto and A. Nicolau. Adaptive Strassen and
ATLAS’s DGEMM: A fast square-matrix multiply for
modern high-performance systems. In The 8th International
Conference on High Performance Computing in Asia Pacific
Region (HPC asia), pages 45–52, Beijing, Dec 2005.
[10] P. D’Alberto and A. Nicolau. Using recursion to boost
ATLAS’s performance. In The Sixth International
Symposium on High Performance Computing (ISHPC-VI),
2005.
[11] J. Demmel, J. Dongarra, E. Eijkhout, E. Fuentes, E. Petitet,
V. Vuduc, R.C. Whaley, and K. Yelick. Self-Adapting linear
algebra algorithms and software. Proceedings of the IEEE,
special issue on ”Program Generation, Optimization, and
Adaptation”, 93(2), 2005.
[12] J. Demmel, J. Dumitriu, O. Holtz, and R. Kleinberg. Fast
matrix multiplication is stable, Mar 2006.
[13] J. Demmel and N. Higham. Stability of block algorithms
with fast level-3 BLAS. ACM Transactions on Mathematical
Software, 18(3):274–291, 1992.
[14] N. Eiron, M. Rodeh, and I. Steinwarts. Matrix multiplication:
a case study of algorithm engineering. In Proceedings
WAE’98, Saarbru̇cken, Germany, Aug 1998.
[15] J.D. Frens and D.S. Wise. Auto-Blocking
matrix-multiplication or tracking BLAS3 performance from
5000
N
N
Figure 7: ES40 Alpha ev67 4@667MHz.
[16]
[17]
[18]
[19]
[20]
[21]
[22]
source code. Proc. 1997 ACM Symp. on Principles and
Practice of Parallel Programming, 32(7):206–216, July
1997.
M. Frigo and S. Johnson. The design and implementation of
FFTW3. Proceedings of the IEEE, special issue on
”Program Generation, Optimization, and Adaptation”,
93(2):216–231, 2005.
M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran.
Cache oblivious algorithms. In Proceedings 40th Annual
Symposium on Foundations of Computer Science, 1999.
K. Goto and R.A. van de Geijn. Anatomy of
high-performance matrix multiplication. ACM Transactions
on Mathematical Software.
J.A. Gunnels, F.G. Gustavson, G.M. Henry, and R.A. van de
Geijn. FLAME: Formal Linear Algebra Methods
Environment. ACM Transactions on Mathematical Software,
27(4):422–455, December 2001.
N.J. Higham. Exploiting fast matrix multiplication within the
level 3 BLAS. ACM Trans. Math. Softw., 16(4):352–368,
1990.
N.J. Higham. Accuracy and Stability of Numerical
Algorithms, Second Edition. SIAM, 2002.
S. Huss-Lederman, E.M. Jacobson, A. Tsao, T. Turnbull, and
J.R. Johnson. Implementation of Strassen’s algorithm for
matrix multiplication. In Supercomputing ’96: Proceedings
of the 1996 ACM/IEEE conference on Supercomputing
(CDROM), page 32. ACM Press, 1996.
HASA
S-1
S-2
S-3
12
% speedup w.r.t. cblas_dgemm
% speedup w.r.t. cblas_dgemm
15
9
6
HASA
S-1
S-2
S-3
19
14
9
3
4
0
1625
2075
2525
2975
3425
3875
4325
4775
N
-1 1175
1625
2075
2525
2975
3425
3875
4325
4775
N
2975
3425
3875
4325
4775
N
100
100
% cblas_dgemm w.r.t. peak
% cblas_dgemm w.r.t. peak
1175
80
60
80
60
40
40
20
20
0
0
1175
1625
2075
2525
2975
3425
3875
4325
4775
N
1175
1625
2075
2525
Figure 8: RP5470 8600 PA-RISC 550MHz.
Figure 9: ProLiant DL145 Opteron 2@2.2GHz.
[23] I. Kaporin. A practical algorithm for faster matrix
multiplication. Numerical Linear Algebra with Applications,
6(8):687–700, 1999. Centre for Supercomputer and
Massively Parallel Applications, Computing Centre of the
Russian Academy of Sciences, Vavilova 40, Moscow
117967, Russia.
[24] Igor Kaporin. The aggregation and cancellation techniques
as a practical tool for faster matrix multiplication. Theor.
Comput. Sci., 315(2-3):469–510, 2004.
[25] P. Knight. Fast rectangular matrix multiplication and
QR-decomposition. Linear algebra and its applications,
221:69–81, 1995.
[26] X. Li, M. Garzaran, and D. Padua. Optimizing sorting with
genetic algorithms. In In In Proc. of the Int. Symp. on Code
Generation and Optimization, pages 99–110, March 2005.
[27] V. Pan. Strassen’s algorithm is not optimal: Trililnear
technique of aggregating, uniting and canceling for
constructing fast algorithms for matrix operations. In FOCS,
pages 166–176, 1978.
[28] V. Pan. How can we speed up matrix multiplication? SIAM
Review, 26(3):393–415, 1984.
[29] D. Priest. Algorithms for arbitrary precision floating point
arithmetic. In P. Kornerup and D. W. Matula, editors,
Proceedings of the 10th IEEE Symposium on Computer
Arithmetic (Arith-10), pages 132–144, Grenoble, France,
1991. IEEE Computer Society Press, Los Alamitos, CA.
[30] M. Püschel, J.M.F. Moura, J. Johnson, D. Padua, M. Veloso,
B.W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko,
K. Chen, R.W. Johnson, and N. Rizzolo. SPIRAL: Code
generation for DSP transforms. Proceedings of the IEEE,
special issue on ”Program Generation, Optimization, and
Adaptation”, 93(2), 2005.
[31] V. Strassen. Gaussian elimination is not optimal. Numerische
Mathematik, 14(3):354–356, 1969.
[32] M. Thottethodi, S. Chatterjee, and A.R. Lebeck. Tuning
Strassen’s matrix multiplication for memory efficiency. In
Proc. Supercomputing, Orlando, FL, nov 1998.
[33] R. Whaley and J. Dongarra. Automatically tuned linear
algebra software. In Proceedings of the 1998 ACM/IEEE
conference on Supercomputing (CDROM), pages 1–27. IEEE
Computer Society, 1998.
© 1968 Nature Publishing Group
© 1968 Nature Publishing Group
© 1968 Nature Publishing Group
© 1968 Nature Publishing Group
A Taxonomy for Test Oracles
Douglas Hoffman
Software Quality Methods, LLC.
Phone 408-741-4830
Fax 408-867-4550
doug.hoffman@acm.org
Keywords:
Automated Testing, Model of Testing, Software Under Test, Test Oracles, Test
Verification, Test Validation
Abstract
Software test automation is often a difficult and complex process. The most familiar aspects of test
automation are organizing and running of test cases and capturing and verifying test results. A set of
expected results are needed for each test case in order to check the test results. Generation of these
expected results is often done using a mechanism called a test oracle. This paper describes classes of
oracles for various types of automated software verification and validation. Several relevant
characteristics of oracles are included and the advantages and disadvantages for each class covered.
Background
Software testing is a process of providing inputs to software under test (SUT) and evaluating the
results. In software testing, the mechanism used to generate expected results is called an oracle. (In this
paper, the first letter will be capitalized when referring to an Oracle for a specific test.) Many different
approaches can be used to generate, capture, and compare test results. The author, for example, at one
time or another has used the following methods for generating expected results:
•
•
•
•
•
•
•
•
•
Manual verification of results (human oracle)
Separate program implementing the same algorithm
Simulator of the software system to produce parallel results
Debugged hardware simulator to emulate hardware and software operations
Earlier version of the software
Same version of software on a different hardware platform
Check of specific values for known responses
Verification of consistency of generated values and end points
Sampling of values against independently generated expected results
Test automation usually requires incorporation of Oracles into the testing process so test outcomes
can be evaluated. Automating the verification of results has significant implications on both the test case
and Oracle design. Because of the current high machine speeds and low cost of memory, test cases can
generate very large amounts of data, with corresponding amounts of Oracle data needed for
comparison. One or both sets of data can be generated and stored for comparison and then discarded if
Quality Week 1998
Douglas Hoffman
Page 1
3/30/98
no differences are found. When data comparison is incorporated into test cases, effort is required to
design each test to include error handling, reporting differences, and capturing error results. When the
comparisons are done separately the effort is not repeated, but standards must be employed for
formatting and storing inputs and results.
Many organizations today depend on a human oracle to verify test results. The tester is expected to
know how the software will work, and they are expected to know when the software misbehaves. This
often happens by default for manual testing, and is usually the case for GUI testing. A human oracle is
not satisfactory for several reasons when test cases are automated. The volume of data from automated
tests is often overwhelming. A person may not keep up with analyzing displayed information before the
system changes it. Not all effects of a test case are available and displayed for a person to observe. The
automated testing process is tedious and requires concentration for arbitrarily long periods. A person
also becomes quickly trained on what to expect, and once trained is likely to overlook minor deviations
(errors).
A worse situation occurs with automated tests when tests run without benefit of any verification. The
result from merely running a test is nearly always the same whether or not a fault is encountered –
program termination. Based on experience, very few errors cause noticeable abnormal test termination.
Unless test results are verified it requires a spectacular event to show that an error has occurred. When
a batch of automated tests is run with only cursory checks, we may only learn that something went
wrong somewhere, without a clue about the likely cause. Some automated mechanism is needed to
check the results from automated tests.
Creating an oracle to verify values for a mathematical subroutine may be straightforward by using a
different algorithm, language, compiler, etc. At the other extreme, an Oracle for the interrupt handling of
an operating system kernel is far more difficult to create. Hardware and system emulators need to be
created, and parallel mechanisms for causing specific events need to be put in place for both the SUT
and the Oracle. Timing and synchronization between the SUT and Oracle are also extremely difficult to
manage to correctly verify software operation.
The difficulty in creating most test oracles falls somewhere between the two extremes. It is often
impractical to generate complete sets of expected results. It is particularly difficult to generate expected
information for file directories, machine registers, system tables, memory, etc. Usually these aspects and
side-effects of the SUT are ignored when tests are verified unless there is a gross, obvious problem.
This is also true when the tests are manually run.
A Simple Model for Automated Tests
Figure 1 shows an Input-Process-Output model for black box testing. The test case is a set of
inputs and verification is done by observing the results. SUT’s very seldom fit this model, however, as
they have multiple, complex inputs and results. We need to know the values for all of the inputs and
check all of the results in order to know whether the SUT responds properly. Also, some of the results
from software execution are only indirectly related to the functions we are exercising in our test. Test
Quality Week 1998
Douglas Hoffman
Page 2
3/30/98
results include such things as residual values left in memory, program states for the SUT and other
software, instrument control signals, and data base values.
System Under
Test
Test Inputs
Test Results
Figure 1: I-P-O Testing Model
Figure 2 shows a more complete model for software testing, including more categories of inputs to
and results from a test. To determine whether the SUT responds properly, we need to know or set all
of the inputs and check all of the results. Because of the vast possible outcomes from running a
program, test designers select what they consider are relevant inputs and results, and then choose a
subset of these to use in predicting and verifying program behavior. The test case input values are only
one part of the stimulus for a test, and even thorough test plans identify only some of the test case
preconditions. The environmental inputs are seldom spelled out in detail.
Test Inputs
Test Results
Precondition Data
Postcondition Data
Precondition
Program State
System Under
Test
Environmental
Inputs
Postcondition
Program State
Environmental
Results
Figure 2: Expanded Testing Model
Quality Week 1998
Douglas Hoffman
Page 3
3/30/98
Several observations can be made when introducing an oracle into the model. Different types of
oracles are needed for different types of software. The domain, range, and form of input and output
data varies substantially between programs. Most software has multiple forms of inputs and results so
several oracles may be needed for a single software program. Different characteristics in a program may
require separate oracles. For example, a program’s results may include computed functions, screen
navigations, and asynchronous event handling. Several oracles may need to work together because of
interactions of common inputs. In the case of a word processor, pagination changes are based upon
characteristics such as the font and font size, while the test case may be about color compatibility. An
oracle for pagination has to factor in fonts even when a test case is about color. Although an oracle may
be excellent at predicting certain results, only the SUT running in the target environment will process all
of the inputs and provide all of the results. No matter how meticulous we are in creating an oracle, we
will not achieve both independence and completeness.
Because using a single oracle may be impractical to model all system behaviors for the SUT, this
paper will assume that oracles are created for specific purposes. This simplifying assumption holds since
an oracle that completely models SUT behavior can be considered to be composed of several special
purpose oracles focusing on specific SUT behaviors. The special purpose oracle can then completely
predict SUT behaviors for which it is designed. We can add other oracles to predict other behaviors
and results from the SUT. (In practice, most test oracles focus on modeling straightforward behaviors,
and we apply different oracles at different times to check program behaviors such as functionality,
screen navigations, or memory use.) The characteristics of these focused oracles can be at the extremes
of our measurements.
Characteristics of Oracles
There are several characteristics we might measure relating an oracle to the SUT. Table 1 provides
a list of some useful measures for oracles. Each of these characteristics describe a correspondence
between an oracle and the SUT and measures can range from no relationship to exact duplication.
Completeness, for example, can range from no predictions (which is not very useful) to exact
duplication in all results categories (a second implementation of the SUT).
•
•
•
•
•
•
•
Quality Week 1998
Completeness of information from oracle
Accuracy of information from oracle
Independence of oracle from SUT
• Algorithms
• Sub-programs and libraries
• System platform
• Operating environment
Speed of predictions
Time of execution of oracle
Usability of results
Correspondence (currency) of oracle through changes in the SUT
Douglas Hoffman
Page 4
3/30/98
Table 1: Oracle Characteristics
It is easy to see that the more complete and accurate an oracle is, the more complex it has to be.
Indeed, if the oracle exactly predicts all results from the SUT it will be at least as complex. This also
means that the better an oracle is at providing expected results, the more likely that detected differences
are due to faults in the oracle rather than the SUT. Likewise, the more an oracle predicts about program
state and environment conditions, the more dependent the oracle is on the SUT and operating
environment. This dependence makes the oracle more complex and more difficult to maintain. It also
means that faults may be missed because both the SUT and the oracle may contain the fault.
Software tests themselves can be classified in many different ways. Manual testing brings up images
of a human providing input and interpreting results as the means of testing. Yet, humans sometimes need
books, tables, calculators, or even programs (an Oracle) to know the expected result. Automated
testing does not mean mechanical reproduction of manual tests. Automated tests that include evaluation
of results need some kind of oracle regardless of the type or purpose of the tests. Yet, the mechanism
for evaluation of results ranges from none (the program or system didn’t crash) to exact (all values,
displays, files, etc., are verified). Various levels of effort and exactness are appropriate under different
circumstances. The nature and complexity of an oracle is also dependent upon those circumstances.
Types of Oracles
Real world oracles vary widely in their characteristics. Although the mechanics of various oracles
may be vastly different, a few classes can be identified which correspond with automated test
approaches. These types of oracles are categorized based upon the outputs from the oracle rather than
the method of generation of the results. Thus, an oracle that uses a lookup table to derive values may be
the same type of oracle as one that implements an alternate algorithm to compute the values. The type
descriptions define the purpose of the oracle and its method of use. Five types are identified and defined
below. They are labeled True, Stochastic, Heuristic, Sampling, and Consistent oracles.
A “True oracle” faithfully reproduces all relevant results for a SUT using independent platform,
algorithms, processes, compilers, code, etc. The same values are fed to the SUT and the Oracle for
results comparison. The Oracle for an algorithm or subroutine can be straightforward enough for this
type of oracle to be considered. The sin() function, for example, can be implemented separately using
different algorithms and the results compared to exhaustively test the results (assuming the availability of
sufficient machine cycles). For a given test case all values input to the SUT are verified to be “correct”
using the Oracle’s separate algorithm. The less the SUT has in common with the Oracle, the more
confidence in the correctness of the results (since common hardware, compilers, operating systems,
algorithms, etc., may inject errors that effect both the SUT and Oracle the same way). Test cases
employing a true oracle are usually limited by available machine time and system resources.
Quality Week 1998
Douglas Hoffman
Page 5
3/30/98
A “Stochastic” approach focuses on verifying a statistically selected sample of values. This is most
useful when resources are limited and only a relatively small amount of inputs will be included in the
tests. For all inputs and ranges for the inputs, values are selected which are equally likely. For the sin()
example, a pseudo-random number generator may be used to select the input values. The same values
are fed to the SUT and the Oracle for results comparison. The statistically random input selection results
in a test case that has no bias from the data chosen. It also means that suspect or error prone areas of
the software are no more or less likely to be encountered than any other area. Either the Oracle has to
be substantial enough to be able to accept arbitrary inputs or the pseudo-random sequence needs to be
known in advance and an Oracle created for those particular values.
A “Heuristic oracle” reproduces selected results for the SUT and the remaining values can be
checked using simpler algorithms or consistency checks based on a heuristic. For the sin() function, a
Heuristic Oracle might generate only the specific values for sin(;/2), sin(;), sin(3;/2), sin(2;)
[whose results are 1, 0, -1, 0]. The test can then give values between the four points at very small
increments to the SUT. A heuristic is applied to verify that the SUT returns values that are progressively
greater (or less) than the last value. Although the heuristic approach will accept many functions that are
incorrect, the Oracle is very easy to implement (especially when compared to a True Oracle), runs
much faster, and will find most faults.
The “Sampling” approach uses a selected set of values. The values are selected because of some
criteria other than statistical randomness. Boundary values, specific integers, midpoints, minima, and
maxima are examples often chosen when testing. Often, values are selected because they are easy to
generate, recognize, or recall. (These are all selected samples that are not statistically random.) Once
the values are selected, an Oracle can be created that provides the expected reslults. Software testing
usually includes some effort based on Sampling to focus on areas likely to have faults and critical
functions and features. The key difference between the Stochastic oracle and Sampling oracle is in the
method of selection of input and result values.
A “Consistent”oracle uses the results from one test run as the Oracle for subsequent tests. This is
particularly useful for evaluating the effects of changes from one revision to another. The Oracle in this
situation comes from a simulator, equivalent product, software from an alternate platform, or an early
version of the SUT. The values being compared can include intermediate results, call trees, data values,
or any other data extracted from the SUT automatically. The Oracle-generated data is usually too
voluminous to be thoroughly or exhaustively verified. The value in comparing results from the SUT and
the Oracle is from evaluating and explaining any differences. Because very large volumes of data can be
stored and compared, the test cases can cover large input and result ranges. Although historic faults may
remain when this technique is used, new faults and side-effects are often exposed and fixes are
confirmed.
Table 2 summarizes the five types of oracles and some of their characteristics.
Quality Week 1998
Douglas Hoffman
Page 6
3/30/98
True Oracle
Stochastic
Sampling
Consistent
Verify selected
points, use a
heuristic for
remainder
Algorithm
Verification
Verify a specially
selected sample
Compare run
n results with
n-1
Boundary
Testing
Regression
Test
Can automate
tests with a
simple Oracle
Easier than True
Oracle
Very fast
verification
possible with
simple Oracle
May miss
systematic and
specific errors.
Can be time
consuming to
verify
Can miss
systematic errors
and incorrect
algorithms
May Miss
Systematic or
Specific Errors
Fastest; Can
generate and
verify large
amounts of
data
Original run
may include
unknown
errors
Definition
Independent
generation of
expected results
Verify a randomly
selected sample
Example of
use
Advantages
Algorithm
Validation
Operational
Verification
Possibility for
exhaustive testing
Disadvantages
Expensive
implementation.
Possibly long
execution times
Heuristic
Table 2: Five Types of Oracles
Other Remarks on Oracles
Data from the Oracle can be generated before, parallel to, or after the test case is run. If the Oracle
data is generated before the test, the inputs for the test case need to be known and the expected results
must be stored in suitable form for comparison during or after testing. Early Oracle data generation is
useful when the Oracle is slow, and it is required for the consistency approach. When the test case
performs comparisons with expected results the Oracle has to run before or in parallel with the test
case. Parallel running of an Oracle presumes that the Oracle runs quickly enough to be practical. When
test results are stored and checked after test execution, the timing of Oracle data generation can be
independent of test execution. Such after-the-test verification can be done using stored results from a
test run with either stored or real-time generated Oracle output.
Test results can be verified manually, within the test case, or automated separately. Manual
verification requires both test results and Oracle data be available for comparison and is limited by
human processing capabilities. Verification within a test case means that the Oracle data has to be
available when the test case runs, which means either prior or parallel running of the Oracle. The test
case also needs to be designed to perform the collection, comparing, and reporting of results. Separate
automation of results comparison requires that results from the test run are saved and that either the
Oracle results are likewise saved or generated as needed by the verification routines.
Care must be taken during test planning to decide on the method of results comparison. Oracles are
required for verification and the nature of an oracle depends on several factors under the control of the
test designer and automation architect. Different Oracles may be used for a single automated test and a
single oracle may serve many test cases. If test results are to be analyzed, some type of oracle is
required.
Quality Week 1998
Douglas Hoffman
Page 7
3/30/98
Douglas Hoffman
Software Quality Methods, LLC.
Phone 408-741-4830
Fax 408-867-4550
doug.hoffman@acm.org
Bio:
Douglas Hoffman is an independent consultant with Software Quality Methods, LLC. He has been
in the software engineering and quality assurance fields for over 25 years and now teaches courses and
consults with management in strategic and tactical planning for software quality. For five years he served
as Chairman of the Santa Clara Valley Software Quality Association (SSQA), a Task Group of the
American Society for Quality (ASQ). He has been a participant at dozens of software quality
conferences and has been Program Chairman for several international conferences on software quality.
He is a member of the ACM and IEEE and is active in the ASQ as a Senior Member, participating in
the Software Division, the Santa Clara Valley Section, and the Software Quality Task Group. He is
Certified by ASQ as a Software Quality Engineer and has been a registered ISO 9000 Lead Auditor.
He has a BA in Computer Science, an MS in Electrical Engineering, and an MBA.
Douglas’ experience includes consulting, teaching, managing, and engineering in the computer
systems and software industries. He has over fifteen years experience in creating and transforming
software quality and development groups, and twenty years of business management experience. His
work in corporate, quality assurance, development, manufacturing, and support organizations makes
him very well versed in technical and managerial issues in the computer industry. Douglas has taught
technical and managerial courses in high schools, universities, and corporations for over 25 years.
Quality Week 1998
Download