from the editor Editor in Chief: Steve McConnell ■ Construx Software ■ software@construx.com Cargo Cult Software Engineering In the South Seas there is a cargo cult of people. During the war they saw airplanes with lots of good materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head for headphones and bars of bamboo sticking out like antennas—he’s the controller— and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.—Richard Feynman, in Surely You’re Joking, Mr. Feynman! WW Norton & Company, New York, reprint ed., 1997 I find it useful to draw a contrast between two different organizational development styles: process-oriented and commitment-oriented development. Process-oriented development achieves its effectiveness through skillful planning, carefully defined processes, efficient use of available time, and skillful applicaCopyright @ 2000 Steven C. McConnell. All Rights Reserved. tion of software engineering best practices. This style of development succeeds because the organization that uses it is constantly improving. Even if its early attempts are ineffective, steady attention to process means each successive attempt will work better than the previous one. Commitment-oriented development goes by several names, including hero-oriented development and individual empowerment. Commitment-oriented organizations are characterized by hiring the best possible people; asking them for total commitment to their projects; empowering them with nearly complete autonomy; motivating them to an extreme degree; and then seeing that they work 60, 80, or 100 hours a week until the project is finished. Commitmentoriented development derives its potency from its tremendous motivational ability; study after study has found that individual motivation is by far the largest single contributor to productivity. Developers make voluntary, personal commitments to the projects they work on, and they often go to extraordinary lengths to make their projects succeed. Organizational Imposters When used knowledgeably, either development style can produce high-quality software economically and quickly. However, both development styles have pathological look-alikes that don’t work nearly as well and that can be difficult to distinguish from the genuine articles. The process-imposter organization bases its practices on a slavish devotion to process March/April 2000 IEEE SOFTWARE 11 FROM THE EDITOR EDITOR-IN-CHIEF: Steve McConnell 10662 Los Vaqueros Circle Los Alamitos, CA 90720-1314 software@construx.com EDITORS-IN-CHIEF EMERITUS: Carl Chang, Univ. of Illinois, Chicago Alan M. Davis, Omni-Vista EDITORIAL BOARD Maarten Boasson, Hollandse Signaalapparaten Terry Bollinger, The MITRE Corp. Andy Bytheway, Univ. of the Western Cape David Card, Software Productivity Consortium Larry Constantine, Constantine & Lockwood Christof Ebert, Alcatel Telecom Robert L. Glass, Computing Trends Lawrence D. Graham, Black, Lowe, and Graham Natalia Juristo, Universidad Politécnica de Madrid Warren Keuffel Karen Mackey, Cisco Systems Brian Lawrence, Coyote Valley Software Tomoo Matsubara, Matsubara Consulting Stephen Mellor, Project Technology Wolfgang Strigel, Software Productivity Centre Jeffrey M. Voas, Reliable Software Technologies Corp. Karl E. Wiegers, Process Impact INDUSTRY ADVISORY BOARD Robert Cochran, Catalyst Software Annie Kuntzmann-Combelles, Objectif Technologie Enrique Draier, Netsystem SA Eric Horvitz, Microsoft Takaya Ishida, Mitsubishi Electric Corp. Dehua Ju, ASTI Shanghai Donna Kasperson, Science Applications International Günter Koch, Austrian Research Centers Wojtek Kozaczynski, Rational Software Corp. Masao Matsumoto, Univ. of Tsukuba Susan Mickel, BoldFish Deependra Moitra, Lucent Technologies, India Melissa Murphy, Sandia National Lab Kiyoh Nakamura, Fujitsu Grant Rule, Guild of Independent Function Point Analysts Chandra Shekaran, Microsoft Martyn Thomas, Praxis M A G A Z I N E O P E R AT I O N S C O M M I T T E E Sorel Reisman (chair), William Everett (vice chair), James H. Aylor, Jean Bacon, Thomas J. (Tim) Bergin, Wushow Chou, George V. Cybenko, William I. Grosky, Steve McConnell, Daniel E. O’Leary, Ken Sakamura, Munindar P. Singh, James J. Thomas, Yervant Zorian P U B L I C AT I O N S B O A R D Sallie Sheppard (vice president), Sorel Reisman (MOC chair), Rangachar Kasturi (TOC chair), Jon Butler (POC chair), Angela Burgess (publisher), Laurel Kaleda (IEEE representative), Jake Aggarwal, Laxmi Bhuyan, Lori Clarke, Alberto del Bimbo, Mike Liu, Mike Williams (secretary), Zhiwei Xu 12 IEEE SOFTWARE March/ April 2000 for process’s sake. These organizations look at process-oriented organizations, such as NASA’s Software Engineering Laboratory and IBM’s former Federal Systems Division, and observe that those organizations generate lots of documents and hold frequent meetings. The imposters conclude that if they generate an equivalent number of documents and hold a comparable number of meetings, they will be similarly successful. If they generate more documentation and hold more meetings, they will be even more successful! But they don’t understand that the documentation and the meetings are not responsible for the success; these are the side effects of a few specific, effective processes. I call these organizations bureaucratic because they put the form of software processes above the substance. Their misuse of process is demotivating, which hurts productivity. And they’re not very enjoyable to work for. The commitment-imposter organization focuses primarily on motivating people to work long hours. These organizations look at successful companies such as Microsoft and observe that they generate very little documentation, offer stock options to employees, and then require mountains of overtime. They conclude that if they, too, minimize documentation, offer stock options, and require extensive overtime, they will be successful. The less documentation and the more overtime the better! But these organizations miss the fact that Microsoft and other successful commitment-oriented companies don’t require overtime. They hire people who love to create software. They team these people with other people who love to create software just as much as they do. They provide lavish organizational support and rewards for creating software. And then they turn them loose. The natural outcome is that software developers and managers choose to work long hours voluntarily. Imposter organizations confuse the effect (long hours) with the cause (high motivation). I call the imposter organizations sweatshops because they emphasize working hard rather than working smart, and they tend to be chaotic and ineffective. They’re not very enjoyable to work for, either. Cargo Cult Organizations At first glance, these two kinds of imposter organizations appear to be exact opposites. One is incredibly bureaucratic, and the other is incredibly chaotic. But one key similarity is actually more important than their superficial differences: Neither is very effective because neither understands what really makes its projects succeed or fail. They go through the motions of looking like effective organizations that are stylistically similar. But without any real understanding of why the practices work, they are essentially just sticking pieces of bamboo in their ears and hoping their projects will land safely. Many of their projects end up crashing, because these are just two different varieties of cargo cult software engineering, similar in their lack of understanding of what makes software projects work. Cargo cult software engineering is easy to identify. Its engineer proponents justify their practices by saying, “We’ve always done it this way in the past,” or “Our company standards require us to do it this way”—even when those ways make no sense. They refuse to acknowledge the trade-offs involved in either process-oriented or commitment-oriented development. Both have strengths and weaknesses. When presented with more effective, new practices, cargo cult software engineers prefer to stay in their wooden huts of familiar, comfortable, and not necessarily effective work habits. “Doing the same thing again and again and expecting different results is a sign of insanity,” the old saying goes. It’s also a sign of cargo cult software engineering. The Real Debate In this magazine and in many other publications, we spend our time debating whether process is good or individual empowerment (in other words, commitment-oriented development) might be better. This is a false dichotomy. Process is good, and FROM THE EDITOR D E PA R T M E N T E D I T O R S Bookshelf: Warren Keuffel, wkeuffel@computer.org so is individual empowerment. The two can exist side by side. Processoriented organizations can ask for an extreme commitment on specific projects. Commitment-oriented organizations can use software engineering practices skillfully. The difference between these two approaches really comes down to differences of style and personality. I have worked on several projects of each style and have liked different things about each style. Some developers enjoy working methodically on an 8-to-5 schedule, which is more common in process-oriented companies. Other developers enjoy the focus and excitement that comes with making a 24 × 7 commitment to a project. Commitment-oriented projects are more exciting on average, but a process-oriented project can be just as exciting when it has a well-defined and inspiring mission. Process-oriented organizations seem to degenerate into their pathological look-alikes less often than commitment-oriented organizations do, but either style can work well if it is skillfully planned and executed. The fact that both project styles have pathological look-alikes has muddied the debate. Some projects conducted in each style succeed, and some fail. That lets a process advocate point to process successes and commitment failures and claim that process is the key to success. It lets the commitment advocate do the same thing, in reverse. The issue that has fallen by the wayside while we’ve been debating is so blatant that, like Edgar Allen Poe’s Purloined Letter, we have overlooked it. We should not be debating process versus commitment; we should be debating competence versus incompetence. The real difference is not which style we choose, but what education, training, and understanding we bring to bear on the project. Rather than sticking with the old, misdirected debate, we should look for ways to raise the average level of developer and manager competence. That will improve our chances of success regardless of which development style we choose. In the Next Issue Requirements Engineering Validating Requirements for Voice Communication How Videoconferencing and Computer Support Affect Requirements Negotiation A Reference Model for Requirements and Specifications Also: Automated Support for GQM Measurement Linguistic Rules for OO Analysis In Future Issues CMM across the Enterprise: Reports from a Range of CMM Level 5 Companies Process Diversity: Other Ways to Build Software Malicious IT: The Software vs. the People Software Engineering in the Small Applying Usability Techniques during Development Recent Developments in Software Estimation Global Software Development Culture at Work: Karen Mackey, Cisco Systems, kmackey@best.com Loyal Opposition: Robert Glass, Computing Trends, rglass@indiana.edu Manager: Don Reifer, dreifer@sprintmail.com Quality Time: Jeffrey Voas, Reliable Software Technologies Corp., jmvoas@rstcorp.com Soapbox: Tomoo Matsubara, Matsubara Consulting, matsu@computer.org Softlaw: Larry Graham, Black, Lowe, and Graham, graham@blacklaw.com STAFF Managing Editor Dale C. Strok dstrok@computer.org Group Managing Editor Dick Price Associate Editor Dennis Taylor News Editor Crystal Chweh Staff Editor Jenny Ferrero Assistant Editors Cheryl Baltes and Shani Murray Magazine Assistants Dawn Craig and Angela Delgado software@computer.org Art Director Toni Van Buskirk Cover Illustration Dirk Hagner Technical Illustrator Alex Torres Production Artist Carmen Flores-Garvey Executive Director and Chief Executive Officer T. Michael Elliott Publisher Angela Burgess Membership/Circulation Marketing Manager Georgann Carter Advertising Manager Patricia Garvey Advertising Assistant Debbie Sims CONTRIBUTING EDITORS Dale Adams, Greg Goth, Nancy Mead, Ware Myers, Keri Schreiner, Pradip Srimani Editorial: All submissions are subject to editing for clarity, style, and space. Unless otherwise stated, bylined articles and departments, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE Software does not necessarily constitute endorsement by the IEEE or the IEEE Computer Society. To Submit: Send 2 electronic versions (1 word-processed and 1 postscript or PDF) of articles to Magazine Assistant, IEEE Software, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314; software@computer.org. Articles must be original and not exceed 5,400 words including figures and tables, which count for 200 words each. March/April 2000 IEEE SOFTWARE 13 1 of 17 Syntax, Semantics, Micronesian cults and Novice Programmers | Microsoft Docs 03/01/2004 • 6 minutes to read I've had this idea in me for a long time now that I've been struggling with getting out into the blog space. It has to do with the future of programming, declarative languages, Microsoft's language and tools strategy, pedagogic factors for novice and experienced programmers, and a bunch of other stuff. All these things are interrelated in some fairly complex ways. I've come to the realization that I simply do not have time to organize these thoughts into one enormous essay that all hangs together and makes sense. I'm going to do what blogs do best -- write a bunch of (comparatively!) short articles each exploring one aspect of this idea. If I'm redundant and prolix, so be it. Today I want to blog a bit about novice programmers. In future essays, I'll try to tie that into some ideas about the future of pedagogic languages and languages in general. : I'd appreciate your feedback on whether this makes sense or it's a bunch of useless theoretical posturing. : I'd appreciate your feedback on what you think are the vital concepts that you had to grasp when you were learning to program, and what you stress when you mentor new programmers. An intern at another company wrote me recently to say "I am working on a project for an internship that has lead me to some scripting in vbscript. Basically I don't know what I am doing and I was hoping you could help. " The writer then included a chunk of script and a feature request. I've gotten requests like this many times over the years; there are a lot of novice programmers who use script, for the obvious reason that we designed it to be appealing to novices. Well, as I wrote last Thursday, there are times when you want to teach an intern to fish, and times when you want to give them a fish. I could give you the line of code that implements the feature you want. And then I could become the feature request server for every intern who doesn't know what they're doing… nope. Not going to happen. Sorry. Down that road lies cargo cult programming, and believe me, you want to avoid that road. 2 of 17 Syntax, Semantics, Micronesian cults and Novice Programmers | Microsoft Docs What's cargo cult programming? Let me digress for a moment. The idea comes from a true story, which I will briefly summarize: During the Second World War, the Americans set up airstrips on various tiny islands in the Pacific. After the war was over and the Americans went home, the natives did a perfectly sensible thing -- they dressed themselves up as ground traffic controllers and waved those sticks around. They mistook cause and effect -- they assumed that the guys waving the sticks were the ones making the planes full of supplies appear, and that if only they could get it right, they could pull the same trick. From our perspective, we know that it's the other way around -- the guys with the sticks are there the planes need them to land. No planes, no guys. The cargo cultists had the unimportant surface elements right, but did not see enough of the whole picture to succeed. They understood the but not the . There are lots of cargo cult programmers -. Therefore, they cannot make meaningful changes to the program. They tend to proceed by making random changes, testing, and changing again until they manage to come up with something that works. (Incidentally, Richard Feynman wrote a great essay on cargo cult science. Do a web search, you'll find it.) Beginner programmers: do not go there! Programming courses for beginners often concentrate heavily on getting the syntax right. By "syntax" I mean the actual letters and numbers that make up the program, as opposed to "semantics", which is the meaning of the program. As an analogy, "syntax" is the set of grammar and spelling rules of English, "semantics" is what the sentences mean. Now, obviously, you have to learn the syntax of the language -- unsyntactic programs simply do not run. But what they don't stress in these courses is that The cargo cultists had the syntax -- the formal outward appearance -- of an airstrip down cold, but they sure got the semantics wrong. To make some more analogies, it's like playing chess. Anyone can learn . Playing a game where the strategy makes sense is the hard (and interesting) part. Every VBScript statement has a meaning. Passing the right arguments in the right order will come with practice, but getting the meaning right 3 of 17 Syntax, Semantics, Micronesian cults and Novice Programmers | Microsoft Docs requires thought. You will eventually find that some programming languages have nice syntax and some have irritating syntax, but that it is largely irrelevant. It doesn't matter whether I'm writing a program in VBScript, C, Modula3 or Algol68 -- all these languages have different syntaxes, but very similar semantics. You also need to understand and use are . High-level languages like VBScript already give you a huge amount of abstraction away from the underlying hardware and make it easy to do even more abstract things. Beginner programmers often do not understand what abstraction is. Here's a silly example. Suppose you needed for some reason to compute 1 + 2 + 3 + .. + n for some integer n. You could write a program like this: n = InputBox("Enter an integer") Sum = 0 For i = 1 To n Sum = Sum + i Next MsgBox Sum Now suppose you wanted to do this calculation many times. You could replicate the middle four lines over and over again in your program, or you could : Function Sum(n) Sum = 0 For i = 1 To n Sum = Sum + i Next End Function n = InputBox("Enter an integer") MsgBox Sum(n) That is -- you can write up routines that make your code look cleaner because you have less duplication. But . The power of abstraction is that . One day you realize that your sum function is inefficient, and you can use 4 of 17 Syntax, Semantics, Micronesian cults and Novice Programmers | Microsoft Docs Gauss's formula instead. You throw away your old implementation and replace it with the much faster: Function Sum(n) Sum = n * (n + 1) / 2 End Function The code which calls the function doesn't need to be changed. If you had not abstracted this operation away, you'd have to change all the places in your code that used the old algorithm. A study of the history of programming languages reveals that we've been moving steadily towards languages which support more and more powerful abstractions. Machine language abstracts the program with in the machine, allowing you to . Assembly language abstracts the abstracts the into into higher concepts like abstracts even farther by allowing variables to refer to .C . C++ which contain both . XAML abstracts away the notion of a class by providing a for object relationships. To sum up, Eric's advice for novice programmers is: The rest is just practice. February 29, 2004 The comment has been removed February 29, 2004 My next piece of advice would be: Learn to use your debugger. I see it so often on message boards where a novice's code isn't working right, and they Werk Titel: Numerische Mathematik Verlag: Springer Verlag Jahr: 1969 Kollektion: Mathematica Digitalisiert: Niedersächsische Staats- und Universitätsbibliothek Göttingen Werk Id: PPN362160546_0013 PURL: http://resolver.sub.uni-goettingen.de/purl?PPN362160546_0013 LOG Id: LOG_0038 LOG Titel: Gaussian Elimination is not Optimal. LOG Typ: article Übergeordnetes Werk Werk Id: PPN362160546 PURL: http://resolver.sub.uni-goettingen.de/purl?PPN362160546 Terms and Conditions The Goettingen State and University Library provides access to digitized documents strictly for noncommercial educational, research and private purposes and makes no warranty with regard to their use for other purposes. Some of our collections are protected by copyright. Publication and/or broadcast in any form (including electronic) requires prior written permission from the Goettingen State- and University Library. Each copy of any part of this document must contain there Terms and Conditions. With the usage of the library's online system to access or download a digitized document you accept the Terms and Conditions. Reproductions of material on the web site may not be made for or donated to other repositories, nor may be further reproduced without written permission from the Goettingen State- and University Library. For reproduction requests and permissions, please contact us. If citing materials, please give proper attribution of the source. Contact Niedersächsische Staats- und Universitätsbibliothek Göttingen Georg-August-Universität Göttingen Platz der Göttinger Sieben 1 37073 Göttingen Germany Email: gdz@sub.uni-goettingen.de Adaptive Strassen’s Matrix Multiplication Paolo D’Alberto Alexandru Nicolau Yahoo! Sunnyvale, CA Dept. of Computer Science University of California Irvine pdalbert@yahoo-inc.com nicolau@ics.uci.edu ABSTRACT General Terms Strassen’s matrix multiplication (MM) has benefits with respect to any (highly tuned) implementations of MM because Strassen’s reduces the total number of operations. Strassen achieved this operation reduction by replacing computationally expensive MMs with matrix additions (MAs). For architectures with simple memory hierarchies, having fewer operations directly translates into an efficient utilization of the CPU and, thus, faster execution. However, for modern architectures with complex memory hierarchies, the operations introduced by the MAs have a limited in-cache data reuse and thus poor memory-hierarchy utilization, thereby overshadowing the (improved) CPU utilization, and making Strassen’s algorithm (largely) useless on its own. In this paper, we investigate the interaction between Strassen’s effective performance and the memory-hierarchy organization. We show how to exploit Strassen’s full potential across different architectures. We present an easy-to-use adaptive algorithm that combines a novel implementation of Strassen’s idea with the MM from automatically tuned linear algebra software (ATLAS) or GotoBLAS. An additional advantage of our algorithm is that it applies to any size and shape matrices and works equally well with row or column major layout. Our implementation consists of introducing a final step in the ATLAS/GotoBLAS-installation process that estimates whether or not we can achieve any additional speedup using our Strassen’s adaptation algorithm. Then we install our codes, validate our estimates, and determine the specific performance. We show that, by the right combination of Strassen’s with ATLAS/GotoBLAS, our approach achieves up to 30%/22% speed-up versus ATLAS/GotoBLAS alone on modern high-performance single processors. We consider and present the complexity and the numerical analysis of our algorithm, and, finally, we show performance for 17 (uniprocessor) systems. Algorithms Categories and Subject Descriptors G.4 [Mathematics of Computing]: Mathematical Software; D.2.8 [Software Engineering]: Metrics—complexity measures, performance measures; D.2.3 [Software Engineering]: Coding Tools and Techniques—Top-down programming Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS’07 June 18-20, Seattle, WA, USA. Copyright 2007 ACM 978-1-595993-768-1/07/0006 ...$5.00. Keywords Matrix Multiplications, Fast Algorithms 1. INTRODUCTION In the last 30 years, the complexity of processors on a chip is following accurately Moore’s law; that is, the number of transistors per chip doubles every 18 months. Unfortunately, the steady increase of processor integration does not always result in a proportional increase of the system performance on a given application (program); that is, the same application does not double its performance when it runs on a new state-of-the-art architecture every 18 months. In fact, the performance of an application is the result of an intricate synergy between the two constituent parts of the system: on one side, the architecture composed of processors, memory hierarchy and devices, and, on the other side, the code composed of algorithms, software packages and libraries steering the computation on the hardware. When the architecture evolves, the code must adapt so that the system can deliver peak performance. Our main interest is the design and implementation of codes that embrace the architecture evolution. We want to write efficient and easy to maintain codes, which any developer can use for several generations of architectures. Adaptive codes attempt to provide just that. In fact, they are an effective solution for the efficient utilization of (and portability across) complex and always-changing architectures (e.g., [16, 11, 30, 19]). In this paper, we discuss a single but fundamental kernel in dense linear algebra: matrix multiply (MM) for any size and shape matrices stored in double precision and in standard row or column major layout [20, 15, 14, 33, 3, 18]. We extend Strassen’s algorithm to deal with rectangular and arbitrary-size matrices so as to exploit better data locality and number of operations, and thus performance, than previously proposed versions [22]. We consider the performance effects of Strassen’s applied to rectangular matrices directly (i.e., exploiting fewer operations) or, after a cache-oblivious problem division, to (almost) square matrices only (i.e., exploiting data locality). We show that for some architectures, the latter can outperform the former, (in contrast of what estimated by Knight [25]), and we show that both must build on top of highly efficient O(n3 ) MMs based on the state-of-the-art adaptive software packages such as ATLAS [11] and hand tuned packages such as GotoBLAS [18]. In fact, we show that choosing the right combination of our algorithm with these highly tuned MMs, we can achieve an execution-time reduction up to 30% and 22% when comparet to using alone ATLAS problem, on which Strassen can be applied, and into (extremely irregular) subproblems deploying matrix-by-vector and vector-by-vector computations. and GotoBLAS respectively. We present an extensive quantitative evaluation of our algorithm performance for a large set of different architectures to demonstrate that our approach is beneficial and portable. We discuss also the numerical stability of our algorithm and its practical error evaluation. The paper is organized as follows. In Section 2, we discuss the related work and we highlight our contributions. In Section 3, we present our algorithm and discuss its practical numerical stability and complexity. In Section 4, we present our experimental results; in particular, in Section 4.1, we discuss the performance for any matrix shape and, in Section 4.2, for square matrices only. We conclude in Section 5. 2. 2. Our algorithm applies Strassen’s strategy recursively as many time as a function of the problem size. If the problem size is large enough, the algorithm has a recursion depth that goes as deep as there is any performance advantage. This is in contrast to the approach in [22] where Strassen’s strategy is applied just once. Furthermore, we determine the recursion point empirically by micro-benchmarking at installation time; unlike as in [32], where Strassen’s algorithm is studied in isolation without any performance comparison with highperformance MM. RELATED WORK Strassen’s algorithm is the first and the most used fast algorithm (i.e., breaking the O(n3 ) operation count). Strassen discovered that the original recursive algorithm of complexity O(n3 ) can be reorganized in such a way that one computationally expensive recursive MM step can be replaced with 18 cheaper matrix additions (MA). As a result, Strassen’s algorithm has (asymptotically) fewer operations (i.e., multiplications and additions) O(n2.86 ). Another variant is by Winograd; Winograd replaced one MM with 15 MAs and improved Strassen’s complexity by a constant factor. Our approach can be applied to the Winograd’s variant as well; however, Winograd’s algorithm is beyond the scope of this paper and it will be investigated separately and in the future. In practice however for small matrices, Strassen’s has a significant overhead and a conventional MM results in better performance. To overcome this, several authors have shown hybrid algorithms; that is, deploying Strassen’s MM in conjunction with conventional MM [5, 4, 20], where for a specific problem size n1 , or recursion point [22], Strassen’s algorithm yields the computation to the conventional MM implementations. 1 With the deployment of modern and faster architectures (with fast CPUs and relatively slow memory), Strassen’s has performance appeal for larger and larger problems, undermining its practical benefits. In other words, the evolution of modern architectures —i.e., capable of solving large problems fast— presents a scenario where the recursion point has started increasing [2]. However, the demand for solving larger and larger problems has increased together with the development of modern architectures; this sheds a completely new light on what/when a problem size is actually practical, making Strassen’s approach extremely powerful. Given a modern architecture, finding for what problem sizes Strassen’s is beneficial has never been so compelling. Our approach has three contributions/advantages with respect to previous approaches. 1. Our algorithm divides the MM problems into a set of balanced subproblems without any matrix padding or peeling, so it achieves balanced workload and predictable performance. This balanced division leads to: a cleaner algorithm formulation, an easier parallelization, and little or no work in combining the solutions of the subproblems (conquer step). This strategy differs from the division process proposed by Huss-Lederman et al. [22, 20, 1] leading to fewer operations and better data locality. In this approach [22], the problem division is a function of the matrix sizes such that for oddmatrix sizes, the problem is divided into a large even-size 1 Thus, for a problem of size n ≤ n1 , this hybrid algorithm is the conventional MM; for every matrix size n ≥ n1 , the hybrid algorithm is faster because it applies Strassen’s strategy and thus it exploits all its performance benefits. 3. We store matrices in standard row or column major format and, at any time, we can yield control to a highly tuned MM such as ATLAS’s DGEM M without any overhead. Thus, we can use our algorithm in combination with these highly tuned MM routines with no modifications or change of layout overheads (i.e., estimated as 5–10% of the total execution time [32]). In the literature, there are other fast MM algorithms. For example, Pan showed a bilinar algorithm that is asymptotically faster than Strassen-Winograd [27] O(n2.79 ) and he presented a survey of the topic [28] with best O(n2.49 ). The practical implementation of Pan’s algorithm is presented by Kaporini [23, 24]. New approaches are emerging recently, which promise to be practical and numerically stable [7, 12]. The fastest to date is by Coppersmith and Winograd [8] O(n2.376 ). 3. BALANCED MATRIX MULTIPLICATION In this section, we introduce our algorithm, which is a composition of three different layers/algorithms: on the top level (in this section), we use a cache oblivious algorithm [17] so to reduce the problem to almost square matrices; in the middle level (Section 3.1), we deploy Strassen’s algorithm to reduce the computation work; in the lower level, we deploy ATLAS’s [33] or GotoBLAS [18] to unleash the architecture characteristics. In the following, we introduce briefly our notations and then our algorithms. We identify the size of a matrix A ∈ Mm×n as σ(A)=m×n, where m is the number of rows and n the number of columns of the matrix A. Matrix multiplication is defined for operands of sizes σ(C)=m×p, σ(A)=m×n and σ(B)=n×p, and identified as C=AB, where the component ci,j Pat row i and column j of the result matrix C is defined as ci,j = n k=0 ai,k bk,j . In this paper, we use a simplified notation to identify submatrices. We choose to divide logically a matrix M into at most four submatrices; we label them so that M0 is the first and the largest submatrix, M2 is logically beneath M0 , M1 is on the right of M0 , and M3 is beneath M1 and to the right of M2 (e.g., how to divide a matrix into two submatrices, and into four, see Figure 1). This notation is taken from [6]. In Table 1, we present the framework of our cache-oblivious algorithm that we identify as Balanced MM. This algorithm divides the problem and specifies the operands in such a way that the subproblems have balanced workload and similar operand shapes and sizes. In fact, this algorithm reduces the problem to almost square matrices; that is, A is almost square if m ≤ γm n or n ≤ γn m and γ is usually equal to 2. We aim at the efficient utilization of the higher level of the memory hierarchy; that is, the memory pages Table 1: Balanced Matrix Multiplication C=AB with σ(A)=m×n and σ(B)=n×p Computation Operand Sizes if m ≥ γm max(n, p) then C0 =A0 B C2 =A2 B γm =2 (A is tall) σ(A0 )=d m e×n, σ(C0 )=d m e×p, 2 2 σ(A2 )=b m c×n, σ(C2 )=b m c×p 2 2 else if p ≥ γp max(m, n) then C0 =AB0 C1 =AB1 γp =2 (B is long) σ(C0 )=m×d p2 e, σ(B0 )=n×d p2 e, σ(C1 )=m×b p2 c, σ(B1 )=n×b p2 c, else if n ≥ γn max(m, p) then C=A0 B0 C=C + A1 B2 γn =2 (B is tall and A is long) e, σ(B0 )=d n e×p σ(A0 )=m×d n 2 2 σ(A1 )=m×b n c, σ(B2 )=b n c×p 2 2 else A, B and C almost square see Section 3.1 Strassen C=A∗s B often organized using small and fully associative cache, table lookaside buffer (TLB). This formulation has been proven (asymptotically) optimal [17] in the number of misses. We present detailed performance of this strategy for any matrix size and for three systems in Section 4. This technique breaks down the general problem into small and regular problems to exploit better data locality in the memory hierarchy. Knight [25] presented evidence showing that this approach is not optimal in the sense of number of operations; for example, using directly Strassen decomposition to the rectangular matrices should achieve better performance. In Section 4.1, we address this issue quantitatively and show that we must exploit data locality as much as the operation reduction. In the following section (Section 3.1), we describe our version of Strassen’s algorithm for any (almost square) matrices and, thereby completing the algorithm specified in Table 1. We call it hybrid ATLAS/GotoBLAS–Strassen algorithm (HASA). 3.1 a matrix into almost square submatrices as described previously, we can achieve a balanced division into subcomputations. However, now we do not have submatrices of the same sizes. In such a scenario, we defined the MA between matrices so that we have a fully functional recursive algorithm; that is, we reduce the computation to 7 well defined MMs, where the left-operand columns match the right operand rows, and MAs are between almost square matrices as follows. Consider a matrix addition X=Y + Z (subtraction is similar). Intuitively, when the resulting matrix X is larger than the addenda Y or Z, the computation is performed as if the smaller operands are extended and padded with zeros. Otherwise, if the result matrix is smaller than the operands, the computation is performed as if the larger operands are cropped to fit the result matrix. Formally, X=Y + Z is defined so that σ(X)=m×n, σ(Y)=p×q and σ(Z)=r×s and xi,j =f (i, j) + g(i, j) where: ( yi,j if 0 ≤ i < p ∧ 0 ≤ j < q f (i, j)= 0 otherwise ( zi,j if 0 ≤ i < r ∧ 0 ≤ j < s g(i, j)= 0 otherwise Table 2: HASA C=A∗s B with σ(A)=m×n and σ(B)=n×p Computation if RecursionPoint(A,B) then ATLAS/GotoBLAS C=A∗a B (Divide et impera) e×b p2 c σ(T2 )=d n 2 m σ(M3 )=d 2 e×b p2 c T1 =A2 − A0 T2 =B0 + B1 M6 =T1 ∗s T2 C3 =C3 +M6 e×d n e σ(T1 )=d m 2 2 p σ(T2 )=d n e×d e 2 2 σ(M6 )=d m e×d p2 e 2 T1 =A2 + A3 M2 =T1 ∗s B0 C2 =M2 , C3 =C3 −M2 σ(T1 )=b m c×d n e 2 2 c×d p2 e σ(M2 )=b m 2 T1 =A0 + A3 T2 =B0 + B3 M1 =T1 ∗s T2 C0 =M1 , C3 =C3 +M1 σ(T1 )=d m e×d n e 2 2 n σ(T2 )=d 2 e×d p2 e σ(M1 )=d m e×d p2 e 2 T1 =A0 + A1 M5 =T1 ∗s B3 C0 =C0 −M5 , C1 =C1 +M5 σ(T1 )=d m e×b n c 2 2 σ(M5 )=d m e×b p2 c 2 T1 =A1 − A3 T2 =B2 + B3 M7 =T1 ∗s T2 C0 =C0 +M7 σ(T1 )=d m e×b n c 2 2 σ(T2 )=b n c×d p2 e 2 σ(M7 )=d m e×d p2 e 2 T2 =B2 − B0 M4 =A3 ∗s T2 C0 =C0 +M4 , C2 =C2 +M4 σ(T2 )=b n c×d p2 e 2 p σ(M4 )=b m c×d e 2 2 In this section, we present our generalization of Strassen’s MM algorithm. Our algorithm reduces the number of passes through the data as well as the number of computations because of a balanced division process. In practice, this algorithm is more efficient than previous approaches [31, 22, 32]. The algorithm applies to any matrix sizes (and, thus, shape) rather than only to square matrices like in [9, 10]. A C0 C1 C2 C3 B A0 A1 A2 A3 = B0 B1 B2 B3 * Figure 1: Logical decomposition of matrices in sub-matrices Matrices C, A and B in the MM computation are composed of four balanced sub-matrices (see Figure 1). Consider the operand matrix A with σ(A)=m×n. Now, A is logically composed of four matrices: A0 with σ(A0 )=d m e×d n2 e, A1 with 2 n m n e×b c, A with σ(A )=b c×d e and A3 with σ(A1 )=d m 2 2 2 2 2 2 m n σ(A3 )=b 2 c×b 2 c (similarly, B is composed of four submatrices where σ(B0 )=d n2 e×d p2 e). We generalized Strassen’s algorithm in order to compute the MM regardless of matrix size, as shown in Table 2. Notice that, dividing (e.g., max(m, n, p) < 100) (Solve directly) else { T2 =B1 − B3 M3 =A0 ∗s T2 C1 =M3 , C3 =M3 Hybrid ATLAS/GotoBLAS–Strassen algorithm (HASA) C Operand Sizes } Correctness. To reduce the total number of multiplications, Strassen’s algorithm computes implicitly some products that are necessary and some that are artificial and must be carefully removed from the final result. For example, the product A0 B0 , which is a term of M1 , is a singular product and it is required for the computation of C0 ; in contrast, A0 B3 is an artificial product,2 computed in the same expression, because of the way the algorithm adds submatrices of A and B together in preparation of the recursive MM. Every artificial product must be removed (i.e., subtracted) by combining the different MAs (e.g., M1 + M5 ) so as to achieve the final correct result. By construction, Strassen’s algorithm is correct; here, the only concern is our adaptation to any matrix sizes: First, we note that all MM are well defined —i.e., the number of columns of the left operands is equal to the number of rows of the right operand. Second, all singular products are correctly computed and added. Third, all artificial products are correctly computed and removed. Error w.r.t. DCS 8.0E-13 7.0E-13 ATLAS HASA RBC Input [-1,+1] 6.0E-13 5.0E-13 4.0E-13 3.0E-13 2.0E-13 1.0E-13 0.0E+00 3.2 Numerical considerations 1000 3000 4000 5000 Size Error w.r.t. DCS The classic MM has component-wise and norm-wise error bounds as follows: Component-wise: Norm-wise: 2000 2 |C − Ċ| ≤ nu|A||B| + O(u ), kC − Ċk ≤ n2 ukAkkBk + O(u2 ) . 1.2E-11 1.0E-11 ATLAS HASA RBC Input [0,+1] 8.0E-12 where we identify the exact result with C and the computed solution with Ċ, |A| has components |ai,j |, and kAk= maxij |aij |. For this algorithm, we can apply the same numerical analysis used for Strassen’s algorithm. Strassen’s has been proved weakly stable. Brent and Higham [5, 20] showed that the stability of the algorithm is worsening as the depth of the recursion increases. 6.0E-12 4.0E-12 2.0E-12 kC − Ċk ≤ [( nn1 )log 12 (n21 + 5n1 ) − 5n]ukAkkBk + O(u2 ) ≤ 3` n2 ukAkkBk + O(u2 ) (1) Again, n1 is the size where Strassen’s algorithm yields to the usual MM, and u is the inherent floating point precision. We denote with ` the recursion depth; that is, the number of times we divide the problem. This is a weaker error bound (a norm-wise bound) than the one for the classic algorithm (a component-wise bound) and it is also a pessimistic estimation (i.e., in practice). Demmel and Higman [13] have shown that, in practice, fast MMs can lead to fast and accurate results when they are used in blocked computation of the LAPACK routines. Here, we follows the same procedure used by Higham [21] to quantify the error for our algorithm and for large problems sizes such as to investigate the effect of error in double precision (Figure 2). Input. We restrict the input matrix values to a specific range or intervals: [−1, 1] and [0, 1]. We then initialize the input matrices using a uniformly distributed random number generator. Reference Doubly Compensated Summation. We consider the output of the computation C = AB. We compute every element cij independently and by performing first a dot product of the row vector ai∗ by the column vector b∗j and we store the temporary products zk = aik bkj into a temporary vector z. Then, we sort the vector in decreasing order such that |zi | ≥ |zj | with i < j using the adaptive sorting library [26]. Eventually, we add the vector components so that to compute the reference output using Priest’s doubly compensated summation (DCS) procedure [29] in extended double precision as described in [21]. This is our baseline or ultimate reference. Comparison. We compare the output-value difference (w.r.t of the DCS based MM) of ATLAS algorithm, HASA (Strassen algo2 The product A0 B3 cannot be computed directly because is not defined in general: the columns(A0 ) > rows(B3 ) and thus we should pad B3 or crop A0 properly. 0.0E+00 1000 2000 3000 4000 5000 Size Figure 2: Pentium 4 3.2GHz Maximum error estimation: Recursion point n1 =900, 1 recursion 900≤n≤1800, 2 recursions 1800≤n≤3600, and 3 recursions 3600≤n rithm using ATLAS’s MM) and the naive row-by-column algorithm (RBC) using an accumulator register, for which the summation is not compensated and the values are in the original order. Considerations. In Figure 2, we show the error evaluation w.r.t. the DCS MM for square matrices only. For Strassen’s algorithm the error ratio of HASA over ATLAS is no larger than 10 for both ranges [0, 1] and [−1, 1]. These error ratios are less dramatic than what an upper bound analysis would suggest. That is, each recursion could increases the error as 2` instead of 3` , and in practice, no more than 10 (one precision digit w.r.t. the 16 available). 3.3 Recursion Point and Complexity Strassen’s algorithm embodies two different locality properties because its two basic computations exploit different data locality: matrix multiply MM has spatial and temporal locality, and matrix addition MA has only spatial locality. In this section, we describe how these data reuses affect the algorithm performance and thus our strategy. ed n2 ep + mnd p2 e + d m eb n2 cd p2 e) secThe 7 MMs take 2π(d m 2 2 onds, where π is the efficiency of MM —i.e., pi for product— and 1 is simply the floating point operation per second (FLOPS) of π the computation. The 22 MAs (18 matrix additions and 4 matrix copies) take α[5d n2 e(d m e + d p2 e) + 3mp] seconds (see Table 2), where α is 2 the efficiency of MA —i.e., alpha for addition. For example, we 4. EXPERIMENTAL RESULTS We split this experimental section. We present experimental results for rectangular matrices and square matrices separately. For rectangular matrices, we investigate the effects of the problem sizes and shapes w.r.t. the algorithm choice and the architecture. For square matrices, we show how effective our approach is for a large set of different architectures and odd matrix sizes, and we show detailed performance for a representative set of architectures. 4.1 Rectangular matrices Given a matrix multiply C = AB with σ(A)=m×n and σ(B)=n×p, we characterize this problem size by a triplet s = [m, n, p]. For presentation purpose, we opted to represent Q one matrix multiplication by its number of operations x=2 2i=0 si and plot it on the abscissa (otherwise the performance plot must be a 4-dimension graph, [time, m, n, p]). 4.1.1 HP zv6000, Athlon-64 2GHz using ATLAS, data locality vs. operations In this section, we turn our attention to a general performance comparison of the balanced MM (Table 1) and HASA (Table 2) with respect to the routine cblas dgemm —i.e., ATLAS’s MM for matrices in double precision— for dense but rectangular matrices. In Figure 3, we present the relative performance, we then discuss the process used to collect these results and offer an interpretation. We investigated the input space s∈T×T×T with T={100, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 8000, 10000, 12000}; however, we ran only the problems that have a working set (i.e., operands, C, A and B, and one-recursion-step temporaries, 100*(ATLAS-Our Algorithm)/ATLAS perform five MAs of submatrices of A: we perform three additions with A0 , 3d m ed n2 e, one with A1 , d m eb n2 c, and one with A2 , 2 2 m n b 2 cd 2 e. Similarly, we perform MAs with submatrices of B and C. Thus, we find that the problem size [m, n, p] to yield control to the ATLAS/GotoBLAS’s algorithm is when Equation 2 is satisfied: α n m p m n p [5d e(d e + d e) + 3mp]. (2) b cd ed e ≤ 2 2 2 2π 2 2 2 We assume that π and α are functions of the matrix size only, as we explain in the following. Layout effects. The performance of MA is not affected by a specific matrix layout (i.e., row-major format or Z-Morton) or shape as long as we can exploit the only viable reuse: spatial data reuse. We know that data reuse (spatial/temporal) is crucial for matrix multiply. In practice, ATLAS and GotoBLAS cope rather well with the effects of a (limited) row/column major format reaching often 90% of peak performance. Thus, we can assume for practical purpose that π and α are functions of the matrix size only. Square Matrix MM. Combining the performance properties of both matrix multiplications and matrix additions with a more specific analysis for only square matrices; that is, n = m = p. We can simplify Equation 2 and we find that the recursion point n1 is α n1 = 22 . (3) π For example, if we assume a ratio α/π=50 (this is common for the systems tested in this work), we find that the recursion point corresponds to the problem (matrix) size n1 > 1100. For problems of size (`)n1 ≤ n < (` + 1)n1 , we may apply Strassen’s ` times. In fact, the factors π and α are easy to estimate by benchmarking and we can determine the specific recursion point n1 by a linear search. 60 40 20 0 0 20 40 60 80 100 120 140 -20 Billions Operations -40 -60 -80 Balanced HASA dynamic Figure 3: HP ZV6005: Relative performance with respect to ATLAS’s cblas dgemm. M, T1 and T2 ) smaller than the main-memory size (i.e., 512MB). We present relative performance results for two algorithms with respect to cblas dgemm: Balanced and HASA dynamic. Balanced is the algorithm where we determine the recursion point for HASA when we install our codes into this architecture. First, we found the experimental recursion point for square matrices, which is ṅ1 =1500. We then set the HASA (Table 2) to stop the recursion when at least one matrix size is smaller than 1500. HASA dynamic is the algorithm where we determine the recursion point for every specific problem size at runtime (no cacheoblivious strategy). To achieve the same performance of the Balanced algorithm for square matrices, we set a coefficient =20 as s s an additive contribution to the ratio α ∼ α + . At run time and πs πs for each problem s, we measure the performance of MAs (i.e., αs ) and the performance of cblas dgemm (i.e., πs ). We set the recursion point as the problem size satisfying 2b αs n m p m n p cd ed e ≤ ( + )[5d e(d e + d e) + 3mp]. 2 2 2 πs 2 2 2 Performance observations. ATLAS’s peak performance is 3.52 GFLOPS; Balanced and HASA dynamic peak performance is normalized to 3.8 GFLOPS (i.e., 2 ∗ m ∗ n ∗ p/ExecutionT ime, it is an overestimate of the actual number of operations per second but it is still a valid comparison measure). The erratic performance of both algorithms (i.e., from 50% speedup to 20% slowdown and especially for small sizes and very large sizes w.r.t. ATLAS’s cblas dgemm) is not a problem of the recursion point determination, which may cause either the recursion to improve performance unexpectedly or the recursion overhead to choke the execution time. The main reason for the sudden speedups is the poor performance of cblas dgemm; the main reason for the slowdowns is the access of the hard-disk memory space by our algorithm. We conclude that Strassen’s algorithm can be applied successfully to rectangular matrices and we can achieve significant improvements in performance in doing so. However, the improvements are often the result of a better data-locality utilization than just a reduction of operations. For example, HASA dynamic algorithm is bound to have fewer floating point operations than Balanced, because the former apply Strassen’s division more times than the latter especially for non-square matrices; however, Balanced algorithm achieves on average 1.3% execution-time reduc- 20 100*(Goto - Our Algorithm)/ Goto 100*(Goto - Our Algorithm)/ Goto 50 40 30 20 10 0 0 -10 100 200 300 400 Billions Operations 15 10 5 -20 Balanced HASA -30 0 -40 0 Balanced HASA 50 100 150 200 250 300 350 400 450 Billions Operations -50 -5 Figure 4: Optiplex GX280, Pentium 4 3.2GHz, using GotoBLAS’s DGEMM. tion instead HASA dynamic achieves on average 0.5%. 3 In general, Balanced presents very predictable performance with often better peak performance than HASA dynamic. 4.1.2 GotoBLAS, Strassen vs. Faster MM Recently, GotoBLAS MM has replaced ATLAS MM and the former has become the fastest MM for most of state-of-the-art architectures. In this section, we show that our algorithm is almost unaffected by the change of the leaf implementation (i.e., when the recursion yields to ATLAS/GotoBLAS), leading to comparable and even better improvements. We investigated the input space s∈T×T×T with T={1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000}. We present relative performance results for two algorithms with respect to GotoBLAS DGEMM (resp. Balanced and HASA) and for two architectures Athlon 64 2.4GHz and Pentium 4 3.2 GHz. Balanced and HASA are tuned offline (i.e., recursion point determined once for all). Optiplex GX280, Pentium 4 3.2GHz using GotoBLAS. GotoBLAS peak performance is 5.5 GFLOPS; Balanced and HASA dynamic peak performance is normalized to 6.2 GFLOPS (for comparison purpose). For this architecture, the recursion point is empirically found at n1 = 1000 and we stop the recursion when a matrix size is smaller than n1 . By construction, HASA algorithm has fewer instructions than Balanced because it applies Strasssen’s division to larger matrices; however, HASA achieves smaller relative improvement w.r.t. GotoBLAS MM than what the Balance algorithm achieves (on average Balanced achieves 7.2% speedup and HASA achieves 5.8%), see Figure 4. This suggests that the algorithm with better data locality delivers better performance than the algorithm with fewer operations (for this architecture). Notice that the performance plot has an erratic behavior, which is similar to the previous scenario (i.e., Figure 3); however, in this case, the performance is affected by the architecture as we show in the following using the same library but a different system. Altura 939, Athlon-64 2.5GHz using GotoBLAS. GotoBLAS peak performance is 4.5 GFLOPS, Balanced and HASA dynamic peak performance is normalized to 5.4 GFLOPS (as comparison). For this architecture (with a faster memory hierarchy and processor than in Section 4.1), the recursion point is empirically found at n1 = 900 and we stop the recursion when a matrix size is smaller 3 The input set has mostly small problems, thus the average time reduction is biased towards small values. Figure 5: Altura 939, Athlon 64 2.5GHz, using GotoBLAS’s DGEMM. than n1 . Both algorithms Balanced and HASA have similar performance; that is, on average HASA achieves 7.7% speedup and Balanced achieves 7.5% speedup. Also, for this architectures, the performance is very predictable and the performance plot shows the levels of recursion applied clearly, see Figure 5 (clearly 3 levels and a fourth for very large problems). 4.2 Square Matrices For square matrices, we measure only the performance of HASA because the Balanced algorithm will call directly HASA. For each machine, we had three basic kernels: the C-code implementation cblas dgemm of MM in double precision from ATLAS, hybrid ATLAS/Strassen algorithm (HASA) —in Table 2 where the leaf computation is the cblas dgemm— and our hand-coded MA, which is tailored to each architecture. In Table 3, we present a summary of the experimental results (for rectangular matrices as well) but, in this section, we present details for only four architectures (see [9, 10] for more results). Installation. We installed the software package ATLAS on every architectures. The ATLAS routine cblas dgemm is the MM we used as reference: we time this routine for each architecture so to determine our baseline and the coefficient π (i.e.,, for square matrices of size 1000), and this routine is also the leaf computation of HASA. We timed the execution of MA (i.e.,, for square matrices of size 1000) [9, 10] and we determined α. Once we have π and α, we π determined the recursion point n1 =22 α , which we have used to install our codes. We then determined experimentally the recursion π point ṅ1 (i.e., n1 =22( α + )) based on a simple linear search. Performance presentation. We present two measures of performance: the relative execution time of HASA over cblas dgemm and cblas dgemm relative MFLOPS over ideal machine peak performance (i.e., maximum number of multiply-add operations per second). In fact, the execution time is what any final user cares about when comparing two different algorithms. However, a measure of performance for cblas dgemm, such as MFLOPS, shows whether or not HASA improves the performance of a MM kernel that is either efficiently or poorly designed. This basic measure is important in as much as it shows the space for improvement for both cblas dgemm and HASA. We use the following terminology: HASA is the recursion algorithm for which the recursion point is based on the experimental Table 3: Systems and performance: π1 106 is the performance of cblas dgemm or DGEMM (GotoBLAS) in MFLOPS for n=1000; 1 106 is the performance of MA in MFLOPS for n=1000; n1 is the theoretical recursion point as estimated in 22 α ; instead, ṅ1 is the α π measured recursion point. System Fujitsu HAL 300 RX1600 ES40 Altura 939 Optiplex GX280 RP5470 Ultra 5 ProLiant DL140 ProLiant DL145 Ultra-250 HP ZV 6005 Sun-Fire-V210 Sun Blade ASUS Unknown server Fosa SGI O2 Processors SPARC64 100MHz Itanium 2@1.0GHz Alpha ev67 4@667MHz Athlon 64 2.45MHz Pentium 4 3.2GHz 8600 PA-RISC 550MHz UltraSparc2 300MHz Xeon 2@3.2GHz Opteron 2@2.2GHz UltraSparc2 2@300MHz Athlon 64 2GHz UltraSparc3 1GHz UltraSparc2 500MHz AthlonXP 2800+ 2GHz Itanium 2@700MHz Pentium III 800MHz MIPS 12K 300MHz π ṅ1 (i.e., =22( α + )), which is different for each architecture, and it is statically computed once. The S-k is the Strassen’s algorithm for which k is the recursion depth before yielding to cblas dgemm, independently of the problem size. Note that we did not report negative relative performance for S-k because they would have mostly negative bars cluttering the charts and the results. So for clarity, we omitted the correspondent negative bars in the charts. The performance obtained by the systems in Table 3, are obtained by the collection of the best performance among several trials and thus with hot caches. Performance interpretation. In principle, the S-1 algorithm should have about sp1 = (1 − ṅ1 /n)/8 relative speedup where ṅ1 is the recursion point as found as in Table 3 and n is the problem size (e.g., it is about 7–10% for our set of systems and n=5000). The speedup, P` for the algorithm with recursion depth `, is additive, sp = i=1 spi ; however, each recursion contribution is always less than the first one (spi < sp1 ), because the number of operations saved is decreasing when going down the division process. In the best case, for a three level recursion depth, we should achieve about 3sp1 ∼ 24−30%. We achieve such a speedup for at least one architecture, the system ES40, Figure 7 (and similar trend for the Altura system Figure 5). However, for the other architectures, a recursion depth of three is often harmful. 5. CONCLUSIONS We have presented a practical implementation of Strassen’s algorithm, which applies an adaptive algorithm to exploit highly tuned MMs, such as ATLAS’s. We differ from previous approaches in that we use an-easy-to-adapt recursive algorithm using a balanced division process. This division process simplifies the algorithm and it enables us to combine an easy performance model and highly tuned MM kernels so that to determine off-line and at installation what is the best strategy. We have tested extensively the performance of our approach on 17 systems and we have shown that Strassen is not always applicable and, for modern systems, the recursion point is quite large; that is, the problem size where Strassen’s start having a performance edge is quite large (i.e., matrix sizes larger than 1000 × 1000). However, speedups up to 30% are observed over already tuned MM using this hybrid approach. We conclude by observing that a sound experimentation envi- 1 106 π 1 106 α 177 3023 1240 4320 4810 763 407 2395 3888 492 3520 1140 460 2160 2132 420 320 10 105 41 110 120 21 9 53 93 10 71 22 8 39 27 4 2 n1 =22 α π 390 487 665 860 900 772 984 995 918 1061 1106 1140 1191 1218 1737 2009 2816 ṅ1 400 725 700 900 1000 1175 1225 1175 1175 1300 1500 1150 1884 1300 2150 N/A N/A Figure – – Fig. 7 Fig. 5 Fig. 4 Fig. 8 – – Fig. 9 – Fig. 3 – – – – – – ronment in combination with a simple complexity model —which quantifies the interactions among the kernels of an application and the underlying architecture— can go a long way in helping the design of complex-but-portable codes. Such metrics can improve the design of the algorithms and may serve as a foundation for a fully automated approach. 6. ACKNOWLEDGMENTS The first author worked on this project during his post-doctorate fellowship in the SPIRAL Project at the Department of Electric and Computer Engineering in the Carnegie Mellon University and his work was supported in part by DARPA through the Department of Interior grant NBCH1050009. 7. REFERENCES [1] D. Bailey, K. Lee, and H. Simon. Using strassen’s algorithm to accelerate the solution of linear systems. J. Supercomput., 4(4):357–371, 1990. [2] D. H. Bailey and H. R. P. Gerguson. A Strassen-Newton algorithm for high-speed parallelizable matrix inversion. In Supercomputing ’88: Proceedings of the 1988 ACM/IEEE conference on Supercomputing, pages 419–424. IEEE Computer Society Press, 1988. [3] G. Bilardi, P. D’Alberto, and A. Nicolau. Fractal matrix multiplication: a case study on portability of cache performance. In Workshop on Algorithm Engineering 2001, Aarhus, Denmark, 2001. [4] R. P. Brent. Algorithms for matrix multiplication. Technical Report TR-CS-70-157, Stanford University, Mar 1970. [5] R. P. Brent. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd’s identity. Numerische Mathematik, 16:145–156, 1970. [6] S. Chatterjee, A.R. Lebeck, P.K. Patnala, and M. Thottethodi. Recursive array layout and fast parallel matrix multiplication. In Proc. 11-th ACM SIGPLAN, June 1999. [7] H. Cohn, R. Kleinberg, B. Szegedy, and C. Umans. Group-theoretic algorithms for matrix multiplication, Nov 2005. % speedup w.r.t. cblas_dgemm % speedup w.r.t. cblas_dgemm 16 HASA S-1 S-2 S-3 13 10 7 4 22 17 12 7 1 2 725 1175 1625 2075 2525 2975 3425 3875 4325 4775 N 700 -3 100 100 80 60 950 1400 1850 2300 2750 3200 3650 4100 4550 5000 N -2 % cblas_dgemm w.r.t. peak % cblas_dgemm w.r.t. peak HASA S-1 S-2 S-3 27 80 60 40 40 20 20 0 0 950 725 1175 1625 2075 2525 2975 3425 3875 4325 4775 1400 1850 2300 2750 3200 3650 4100 4550 Figure 6: RX1600 Itanium 2@1.0GHz. [8] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In Proceedings of the 19-th annual ACM conference on Theory of computing, pages 1–6, 1987. [9] P. D’Alberto and A. Nicolau. Adaptive Strassen and ATLAS’s DGEMM: A fast square-matrix multiply for modern high-performance systems. In The 8th International Conference on High Performance Computing in Asia Pacific Region (HPC asia), pages 45–52, Beijing, Dec 2005. [10] P. D’Alberto and A. Nicolau. Using recursion to boost ATLAS’s performance. In The Sixth International Symposium on High Performance Computing (ISHPC-VI), 2005. [11] J. Demmel, J. Dongarra, E. Eijkhout, E. Fuentes, E. Petitet, V. Vuduc, R.C. Whaley, and K. Yelick. Self-Adapting linear algebra algorithms and software. Proceedings of the IEEE, special issue on ”Program Generation, Optimization, and Adaptation”, 93(2), 2005. [12] J. Demmel, J. Dumitriu, O. Holtz, and R. Kleinberg. Fast matrix multiplication is stable, Mar 2006. [13] J. Demmel and N. Higham. Stability of block algorithms with fast level-3 BLAS. ACM Transactions on Mathematical Software, 18(3):274–291, 1992. [14] N. Eiron, M. Rodeh, and I. Steinwarts. Matrix multiplication: a case study of algorithm engineering. In Proceedings WAE’98, Saarbru̇cken, Germany, Aug 1998. [15] J.D. Frens and D.S. Wise. Auto-Blocking matrix-multiplication or tracking BLAS3 performance from 5000 N N Figure 7: ES40 Alpha ev67 4@667MHz. [16] [17] [18] [19] [20] [21] [22] source code. Proc. 1997 ACM Symp. on Principles and Practice of Parallel Programming, 32(7):206–216, July 1997. M. Frigo and S. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, special issue on ”Program Generation, Optimization, and Adaptation”, 93(2):216–231, 2005. M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cache oblivious algorithms. In Proceedings 40th Annual Symposium on Foundations of Computer Science, 1999. K. Goto and R.A. van de Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software. J.A. Gunnels, F.G. Gustavson, G.M. Henry, and R.A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Transactions on Mathematical Software, 27(4):422–455, December 2001. N.J. Higham. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Softw., 16(4):352–368, 1990. N.J. Higham. Accuracy and Stability of Numerical Algorithms, Second Edition. SIAM, 2002. S. Huss-Lederman, E.M. Jacobson, A. Tsao, T. Turnbull, and J.R. Johnson. Implementation of Strassen’s algorithm for matrix multiplication. In Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), page 32. ACM Press, 1996. HASA S-1 S-2 S-3 12 % speedup w.r.t. cblas_dgemm % speedup w.r.t. cblas_dgemm 15 9 6 HASA S-1 S-2 S-3 19 14 9 3 4 0 1625 2075 2525 2975 3425 3875 4325 4775 N -1 1175 1625 2075 2525 2975 3425 3875 4325 4775 N 2975 3425 3875 4325 4775 N 100 100 % cblas_dgemm w.r.t. peak % cblas_dgemm w.r.t. peak 1175 80 60 80 60 40 40 20 20 0 0 1175 1625 2075 2525 2975 3425 3875 4325 4775 N 1175 1625 2075 2525 Figure 8: RP5470 8600 PA-RISC 550MHz. Figure 9: ProLiant DL145 Opteron 2@2.2GHz. [23] I. Kaporin. A practical algorithm for faster matrix multiplication. Numerical Linear Algebra with Applications, 6(8):687–700, 1999. Centre for Supercomputer and Massively Parallel Applications, Computing Centre of the Russian Academy of Sciences, Vavilova 40, Moscow 117967, Russia. [24] Igor Kaporin. The aggregation and cancellation techniques as a practical tool for faster matrix multiplication. Theor. Comput. Sci., 315(2-3):469–510, 2004. [25] P. Knight. Fast rectangular matrix multiplication and QR-decomposition. Linear algebra and its applications, 221:69–81, 1995. [26] X. Li, M. Garzaran, and D. Padua. Optimizing sorting with genetic algorithms. In In In Proc. of the Int. Symp. on Code Generation and Optimization, pages 99–110, March 2005. [27] V. Pan. Strassen’s algorithm is not optimal: Trililnear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In FOCS, pages 166–176, 1978. [28] V. Pan. How can we speed up matrix multiplication? SIAM Review, 26(3):393–415, 1984. [29] D. Priest. Algorithms for arbitrary precision floating point arithmetic. In P. Kornerup and D. W. Matula, editors, Proceedings of the 10th IEEE Symposium on Computer Arithmetic (Arith-10), pages 132–144, Grenoble, France, 1991. IEEE Computer Society Press, Los Alamitos, CA. [30] M. Püschel, J.M.F. Moura, J. Johnson, D. Padua, M. Veloso, B.W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko, K. Chen, R.W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on ”Program Generation, Optimization, and Adaptation”, 93(2), 2005. [31] V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 14(3):354–356, 1969. [32] M. Thottethodi, S. Chatterjee, and A.R. Lebeck. Tuning Strassen’s matrix multiplication for memory efficiency. In Proc. Supercomputing, Orlando, FL, nov 1998. [33] R. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pages 1–27. IEEE Computer Society, 1998. © 1968 Nature Publishing Group © 1968 Nature Publishing Group © 1968 Nature Publishing Group © 1968 Nature Publishing Group A Taxonomy for Test Oracles Douglas Hoffman Software Quality Methods, LLC. Phone 408-741-4830 Fax 408-867-4550 doug.hoffman@acm.org Keywords: Automated Testing, Model of Testing, Software Under Test, Test Oracles, Test Verification, Test Validation Abstract Software test automation is often a difficult and complex process. The most familiar aspects of test automation are organizing and running of test cases and capturing and verifying test results. A set of expected results are needed for each test case in order to check the test results. Generation of these expected results is often done using a mechanism called a test oracle. This paper describes classes of oracles for various types of automated software verification and validation. Several relevant characteristics of oracles are included and the advantages and disadvantages for each class covered. Background Software testing is a process of providing inputs to software under test (SUT) and evaluating the results. In software testing, the mechanism used to generate expected results is called an oracle. (In this paper, the first letter will be capitalized when referring to an Oracle for a specific test.) Many different approaches can be used to generate, capture, and compare test results. The author, for example, at one time or another has used the following methods for generating expected results: • • • • • • • • • Manual verification of results (human oracle) Separate program implementing the same algorithm Simulator of the software system to produce parallel results Debugged hardware simulator to emulate hardware and software operations Earlier version of the software Same version of software on a different hardware platform Check of specific values for known responses Verification of consistency of generated values and end points Sampling of values against independently generated expected results Test automation usually requires incorporation of Oracles into the testing process so test outcomes can be evaluated. Automating the verification of results has significant implications on both the test case and Oracle design. Because of the current high machine speeds and low cost of memory, test cases can generate very large amounts of data, with corresponding amounts of Oracle data needed for comparison. One or both sets of data can be generated and stored for comparison and then discarded if Quality Week 1998 Douglas Hoffman Page 1 3/30/98 no differences are found. When data comparison is incorporated into test cases, effort is required to design each test to include error handling, reporting differences, and capturing error results. When the comparisons are done separately the effort is not repeated, but standards must be employed for formatting and storing inputs and results. Many organizations today depend on a human oracle to verify test results. The tester is expected to know how the software will work, and they are expected to know when the software misbehaves. This often happens by default for manual testing, and is usually the case for GUI testing. A human oracle is not satisfactory for several reasons when test cases are automated. The volume of data from automated tests is often overwhelming. A person may not keep up with analyzing displayed information before the system changes it. Not all effects of a test case are available and displayed for a person to observe. The automated testing process is tedious and requires concentration for arbitrarily long periods. A person also becomes quickly trained on what to expect, and once trained is likely to overlook minor deviations (errors). A worse situation occurs with automated tests when tests run without benefit of any verification. The result from merely running a test is nearly always the same whether or not a fault is encountered – program termination. Based on experience, very few errors cause noticeable abnormal test termination. Unless test results are verified it requires a spectacular event to show that an error has occurred. When a batch of automated tests is run with only cursory checks, we may only learn that something went wrong somewhere, without a clue about the likely cause. Some automated mechanism is needed to check the results from automated tests. Creating an oracle to verify values for a mathematical subroutine may be straightforward by using a different algorithm, language, compiler, etc. At the other extreme, an Oracle for the interrupt handling of an operating system kernel is far more difficult to create. Hardware and system emulators need to be created, and parallel mechanisms for causing specific events need to be put in place for both the SUT and the Oracle. Timing and synchronization between the SUT and Oracle are also extremely difficult to manage to correctly verify software operation. The difficulty in creating most test oracles falls somewhere between the two extremes. It is often impractical to generate complete sets of expected results. It is particularly difficult to generate expected information for file directories, machine registers, system tables, memory, etc. Usually these aspects and side-effects of the SUT are ignored when tests are verified unless there is a gross, obvious problem. This is also true when the tests are manually run. A Simple Model for Automated Tests Figure 1 shows an Input-Process-Output model for black box testing. The test case is a set of inputs and verification is done by observing the results. SUT’s very seldom fit this model, however, as they have multiple, complex inputs and results. We need to know the values for all of the inputs and check all of the results in order to know whether the SUT responds properly. Also, some of the results from software execution are only indirectly related to the functions we are exercising in our test. Test Quality Week 1998 Douglas Hoffman Page 2 3/30/98 results include such things as residual values left in memory, program states for the SUT and other software, instrument control signals, and data base values. System Under Test Test Inputs Test Results Figure 1: I-P-O Testing Model Figure 2 shows a more complete model for software testing, including more categories of inputs to and results from a test. To determine whether the SUT responds properly, we need to know or set all of the inputs and check all of the results. Because of the vast possible outcomes from running a program, test designers select what they consider are relevant inputs and results, and then choose a subset of these to use in predicting and verifying program behavior. The test case input values are only one part of the stimulus for a test, and even thorough test plans identify only some of the test case preconditions. The environmental inputs are seldom spelled out in detail. Test Inputs Test Results Precondition Data Postcondition Data Precondition Program State System Under Test Environmental Inputs Postcondition Program State Environmental Results Figure 2: Expanded Testing Model Quality Week 1998 Douglas Hoffman Page 3 3/30/98 Several observations can be made when introducing an oracle into the model. Different types of oracles are needed for different types of software. The domain, range, and form of input and output data varies substantially between programs. Most software has multiple forms of inputs and results so several oracles may be needed for a single software program. Different characteristics in a program may require separate oracles. For example, a program’s results may include computed functions, screen navigations, and asynchronous event handling. Several oracles may need to work together because of interactions of common inputs. In the case of a word processor, pagination changes are based upon characteristics such as the font and font size, while the test case may be about color compatibility. An oracle for pagination has to factor in fonts even when a test case is about color. Although an oracle may be excellent at predicting certain results, only the SUT running in the target environment will process all of the inputs and provide all of the results. No matter how meticulous we are in creating an oracle, we will not achieve both independence and completeness. Because using a single oracle may be impractical to model all system behaviors for the SUT, this paper will assume that oracles are created for specific purposes. This simplifying assumption holds since an oracle that completely models SUT behavior can be considered to be composed of several special purpose oracles focusing on specific SUT behaviors. The special purpose oracle can then completely predict SUT behaviors for which it is designed. We can add other oracles to predict other behaviors and results from the SUT. (In practice, most test oracles focus on modeling straightforward behaviors, and we apply different oracles at different times to check program behaviors such as functionality, screen navigations, or memory use.) The characteristics of these focused oracles can be at the extremes of our measurements. Characteristics of Oracles There are several characteristics we might measure relating an oracle to the SUT. Table 1 provides a list of some useful measures for oracles. Each of these characteristics describe a correspondence between an oracle and the SUT and measures can range from no relationship to exact duplication. Completeness, for example, can range from no predictions (which is not very useful) to exact duplication in all results categories (a second implementation of the SUT). • • • • • • • Quality Week 1998 Completeness of information from oracle Accuracy of information from oracle Independence of oracle from SUT • Algorithms • Sub-programs and libraries • System platform • Operating environment Speed of predictions Time of execution of oracle Usability of results Correspondence (currency) of oracle through changes in the SUT Douglas Hoffman Page 4 3/30/98 Table 1: Oracle Characteristics It is easy to see that the more complete and accurate an oracle is, the more complex it has to be. Indeed, if the oracle exactly predicts all results from the SUT it will be at least as complex. This also means that the better an oracle is at providing expected results, the more likely that detected differences are due to faults in the oracle rather than the SUT. Likewise, the more an oracle predicts about program state and environment conditions, the more dependent the oracle is on the SUT and operating environment. This dependence makes the oracle more complex and more difficult to maintain. It also means that faults may be missed because both the SUT and the oracle may contain the fault. Software tests themselves can be classified in many different ways. Manual testing brings up images of a human providing input and interpreting results as the means of testing. Yet, humans sometimes need books, tables, calculators, or even programs (an Oracle) to know the expected result. Automated testing does not mean mechanical reproduction of manual tests. Automated tests that include evaluation of results need some kind of oracle regardless of the type or purpose of the tests. Yet, the mechanism for evaluation of results ranges from none (the program or system didn’t crash) to exact (all values, displays, files, etc., are verified). Various levels of effort and exactness are appropriate under different circumstances. The nature and complexity of an oracle is also dependent upon those circumstances. Types of Oracles Real world oracles vary widely in their characteristics. Although the mechanics of various oracles may be vastly different, a few classes can be identified which correspond with automated test approaches. These types of oracles are categorized based upon the outputs from the oracle rather than the method of generation of the results. Thus, an oracle that uses a lookup table to derive values may be the same type of oracle as one that implements an alternate algorithm to compute the values. The type descriptions define the purpose of the oracle and its method of use. Five types are identified and defined below. They are labeled True, Stochastic, Heuristic, Sampling, and Consistent oracles. A “True oracle” faithfully reproduces all relevant results for a SUT using independent platform, algorithms, processes, compilers, code, etc. The same values are fed to the SUT and the Oracle for results comparison. The Oracle for an algorithm or subroutine can be straightforward enough for this type of oracle to be considered. The sin() function, for example, can be implemented separately using different algorithms and the results compared to exhaustively test the results (assuming the availability of sufficient machine cycles). For a given test case all values input to the SUT are verified to be “correct” using the Oracle’s separate algorithm. The less the SUT has in common with the Oracle, the more confidence in the correctness of the results (since common hardware, compilers, operating systems, algorithms, etc., may inject errors that effect both the SUT and Oracle the same way). Test cases employing a true oracle are usually limited by available machine time and system resources. Quality Week 1998 Douglas Hoffman Page 5 3/30/98 A “Stochastic” approach focuses on verifying a statistically selected sample of values. This is most useful when resources are limited and only a relatively small amount of inputs will be included in the tests. For all inputs and ranges for the inputs, values are selected which are equally likely. For the sin() example, a pseudo-random number generator may be used to select the input values. The same values are fed to the SUT and the Oracle for results comparison. The statistically random input selection results in a test case that has no bias from the data chosen. It also means that suspect or error prone areas of the software are no more or less likely to be encountered than any other area. Either the Oracle has to be substantial enough to be able to accept arbitrary inputs or the pseudo-random sequence needs to be known in advance and an Oracle created for those particular values. A “Heuristic oracle” reproduces selected results for the SUT and the remaining values can be checked using simpler algorithms or consistency checks based on a heuristic. For the sin() function, a Heuristic Oracle might generate only the specific values for sin(;/2), sin(;), sin(3;/2), sin(2;) [whose results are 1, 0, -1, 0]. The test can then give values between the four points at very small increments to the SUT. A heuristic is applied to verify that the SUT returns values that are progressively greater (or less) than the last value. Although the heuristic approach will accept many functions that are incorrect, the Oracle is very easy to implement (especially when compared to a True Oracle), runs much faster, and will find most faults. The “Sampling” approach uses a selected set of values. The values are selected because of some criteria other than statistical randomness. Boundary values, specific integers, midpoints, minima, and maxima are examples often chosen when testing. Often, values are selected because they are easy to generate, recognize, or recall. (These are all selected samples that are not statistically random.) Once the values are selected, an Oracle can be created that provides the expected reslults. Software testing usually includes some effort based on Sampling to focus on areas likely to have faults and critical functions and features. The key difference between the Stochastic oracle and Sampling oracle is in the method of selection of input and result values. A “Consistent”oracle uses the results from one test run as the Oracle for subsequent tests. This is particularly useful for evaluating the effects of changes from one revision to another. The Oracle in this situation comes from a simulator, equivalent product, software from an alternate platform, or an early version of the SUT. The values being compared can include intermediate results, call trees, data values, or any other data extracted from the SUT automatically. The Oracle-generated data is usually too voluminous to be thoroughly or exhaustively verified. The value in comparing results from the SUT and the Oracle is from evaluating and explaining any differences. Because very large volumes of data can be stored and compared, the test cases can cover large input and result ranges. Although historic faults may remain when this technique is used, new faults and side-effects are often exposed and fixes are confirmed. Table 2 summarizes the five types of oracles and some of their characteristics. Quality Week 1998 Douglas Hoffman Page 6 3/30/98 True Oracle Stochastic Sampling Consistent Verify selected points, use a heuristic for remainder Algorithm Verification Verify a specially selected sample Compare run n results with n-1 Boundary Testing Regression Test Can automate tests with a simple Oracle Easier than True Oracle Very fast verification possible with simple Oracle May miss systematic and specific errors. Can be time consuming to verify Can miss systematic errors and incorrect algorithms May Miss Systematic or Specific Errors Fastest; Can generate and verify large amounts of data Original run may include unknown errors Definition Independent generation of expected results Verify a randomly selected sample Example of use Advantages Algorithm Validation Operational Verification Possibility for exhaustive testing Disadvantages Expensive implementation. Possibly long execution times Heuristic Table 2: Five Types of Oracles Other Remarks on Oracles Data from the Oracle can be generated before, parallel to, or after the test case is run. If the Oracle data is generated before the test, the inputs for the test case need to be known and the expected results must be stored in suitable form for comparison during or after testing. Early Oracle data generation is useful when the Oracle is slow, and it is required for the consistency approach. When the test case performs comparisons with expected results the Oracle has to run before or in parallel with the test case. Parallel running of an Oracle presumes that the Oracle runs quickly enough to be practical. When test results are stored and checked after test execution, the timing of Oracle data generation can be independent of test execution. Such after-the-test verification can be done using stored results from a test run with either stored or real-time generated Oracle output. Test results can be verified manually, within the test case, or automated separately. Manual verification requires both test results and Oracle data be available for comparison and is limited by human processing capabilities. Verification within a test case means that the Oracle data has to be available when the test case runs, which means either prior or parallel running of the Oracle. The test case also needs to be designed to perform the collection, comparing, and reporting of results. Separate automation of results comparison requires that results from the test run are saved and that either the Oracle results are likewise saved or generated as needed by the verification routines. Care must be taken during test planning to decide on the method of results comparison. Oracles are required for verification and the nature of an oracle depends on several factors under the control of the test designer and automation architect. Different Oracles may be used for a single automated test and a single oracle may serve many test cases. If test results are to be analyzed, some type of oracle is required. Quality Week 1998 Douglas Hoffman Page 7 3/30/98 Douglas Hoffman Software Quality Methods, LLC. Phone 408-741-4830 Fax 408-867-4550 doug.hoffman@acm.org Bio: Douglas Hoffman is an independent consultant with Software Quality Methods, LLC. He has been in the software engineering and quality assurance fields for over 25 years and now teaches courses and consults with management in strategic and tactical planning for software quality. For five years he served as Chairman of the Santa Clara Valley Software Quality Association (SSQA), a Task Group of the American Society for Quality (ASQ). He has been a participant at dozens of software quality conferences and has been Program Chairman for several international conferences on software quality. He is a member of the ACM and IEEE and is active in the ASQ as a Senior Member, participating in the Software Division, the Santa Clara Valley Section, and the Software Quality Task Group. He is Certified by ASQ as a Software Quality Engineer and has been a registered ISO 9000 Lead Auditor. He has a BA in Computer Science, an MS in Electrical Engineering, and an MBA. Douglas’ experience includes consulting, teaching, managing, and engineering in the computer systems and software industries. He has over fifteen years experience in creating and transforming software quality and development groups, and twenty years of business management experience. His work in corporate, quality assurance, development, manufacturing, and support organizations makes him very well versed in technical and managerial issues in the computer industry. Douglas has taught technical and managerial courses in high schools, universities, and corporations for over 25 years. Quality Week 1998