Uploaded by tidaxa4130

David Padua - Encyclopedia of Parallel Computing (2011, Springer) - libgen.lc

advertisement
Encyclopedia of Parallel Computing
David Padua (Ed.)
Encyclopedia of Parallel
Computing
With  Figures and  Tables
123
Editor-in-Chief
David Padua
University of Illinois at Urbana-Champaign
Urbana, IL
USA
ISBN ----
e-ISBN ----
DOI ./----
Print and electronic bundle ISBN: ----
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 
© Springer Science+Business Media, LLC 
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer
Science+Business Media, LLC,  Spring Street, New York, NY , USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be
taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Parallelism, the capability of a computer to execute operations concurrently, has been a constant throughout the
history of computing. It impacts hardware, software, theory, and applications. The fastest machines of the past few
decades, the supercomputers, owe their performance advantage to parallelism. Today, physical limitations have
forced the adoption of parallelism as the preeminent strategy of computer manufacturers for continued performance gains of all classes of machines, from embedded and mobile systems to the most powerful servers. Parallelism
has been used to simplify the programming of certain applications which react to or simulate the parallelism of the
natural world. At the same time, parallelism complicates programming when the objective is to take advantage of
the existence of multiple hardware components to improve performance. Formal methods are necessary to study
the correctness of parallel algorithms and implementations and to analyze their performance on different classes of
real and theoretical systems. Finally, parallelism is crucial for many applications in the sciences, engineering, and
interactive services such as search engines.
Because of its importance and the challenging problems it has engendered, there have been numerous research
and development projects during the past half century. This Encyclopedia is our attempt to collect accurate and clear
descriptions of the most important of those projects. Although not exhaustive, with over  entries the Encyclopedia covers most of the topics that we identified at the outset as important for a work of this nature. Entries include
many of the best known projects and span all the important dimensions of parallel computing including machine
design, software, programming languages, algorithms, theoretical issues, and applications.
This Encyclopedia is the result of the work of many, whose dedication made it possible. The  Editorial Board
Members created the list of entries, did most of the reviewing, and suggested authors for the entries. Colin Robertson, the Managing Editor, Jennifer Carlson, Springer’s Reference Development editor, and Editorial Assistants Julia
Koerting and Simone Tavenrath, worked for  long years coordinating the recruiting of authors and the review and
submission process. Melissa Fearon, Springer’s Senior Editor, helped immensely with the coordination of authors
and Editorial Board Members, especially during the difficult last few months. The nearly  authors wrote crisp,
informative entries. They include experts in all major areas, come from many different nations, and span several
generations. In many cases, the author is the lead designer or researcher responsible for the contribution reported
in the entry. It was a great pleasure for me to be part of this project. The enthusiasm of everybody involved made
this a joyful enterprise. I believe we have put together a meaningful snapshot of parallel computing in the last 
years and presented a believable glimpse of the future. I hope the reader agrees with me on this and finds the entries,
as I did, valuable contributions to the literature.
David Padua
Editor-in-Chief
University of Illinois at Urbana-Champaign
Urbana, IL
USA
Editors
David Padua
Editor-in-Chief
University of Illinois at Urbana-Champaign
Urbana, IL
USA
Sarita Adve
Editorial Board Member
University of Illinois at Urbana-Champaign
Urbana, IL
USA
Colin Robertson
Managing Editor
University of Illinois at Urbana-Champaign
Urbana, IL
USA
Gheorghe S. Almasi
Editorial Board Member
IBM T. J. Watson Research Center
Yorktown Heights, NY
USA
viii
Editors
Srinivas Aluru
Editorial Board Member
Iowa State University
Ames, IA
USA
Gianfranco Bilardi
Editorial Board Member
University of Padova
Padova
Italy
David Bader
Editorial Board Member
Georgia Tech
Atlanta, GA
USA
Siddharta Chatterjee
Editorial Board Member
IBM Systems & Technology Group
Austin, TX
USA
Editors
Luiz DeRose
Editorial Board Member
Cray Inc.
St. Paul, MN
USA
José Duato
Editorial Board Member
Universitat Politècnica de València
València
Spain
Jack Dongarra
Editorial Board Member
University of Tennessee
Knoxville, TN
USA
Oak Ridge National Laboratory
Oak Ridge, TN
USA
University of Manchester
Manchester
UK
Paul Feautrier
Editorial Board Member
Ecole Normale Supérieure de Lyon
Lyon
France
ix
x
Editors
María J. Garzarán
Editorial Board Member
University of Illinois at Urbana Champaign
Urbana, IL
USA
William Gropp
Editorial Board Member
University of Illinois Urbana-Champaign
Urbana, IL
USA
Michael Gerndt
Editorial Board Member
Technische Universitaet Muenchen
Garching
Germany
Thomas Gross
Editorial Board Member
ETH Zurich
Zurich
Switzerland
Editors
James C. Hoe
Editorial Board Member
Carnegie Mellon University
Pittsburgh, PA
USA
Hironori Kasahara
Editorial Board Member
Waseda University
Tokyo
Japan
Laxmikant Kale
Editorial Board Member
University of Illinois at Urbana Champaign
Urbana, IL
USA
Christian Lengauer
Editorial Board Member
University of Passau
Passau
Germany
xi
xii
Editors
José E. Moreira
Editorial Board Member
IBM Thomas J. Watson Research Center
Yorktown Heights, NY
USA
Keshav Pingali
Editorial Board Member
The University of Texas at Austin
Austin, TX
USA
Yale N. Patt
Editorial Board Member
The University of Texas at Austin
Austin, TX
USA
Markus Püschel
Editorial Board Member
ETH Zurich
Zurich
Switzerland
Editors
Ahmed H. Sameh
Editorial Board Member
Purdue University
West Lafayette, IN
USA
Vivek Sarkar
Editorial Board Member
Rice University
Houston, TX
USA
Pen-Chung Yew
Editorial Board Member
University of Minnesota at Twin Cities
Minneapolis, MN
USA
xiii
List of Contributors
DENNIS ABTS
Google Inc.
Madison, WI
USA
SRINIVAS ALURU
Iowa State University
Ames, IA
USA
and
SARITA V. ADVE
University of Illinois at Urbana-Champaign
Urbana, IL
USA
Indian Institute of Technology Bombay
Mumbai
India
GUL AGHA
University of Illinois at Urbana-Champaign
Urbana, IL
USA
JASMIN AJANOVIC
Intel Corporation
Portland, OR
USA
SELIM G. AKL
Queen’s University
Kingston, ON
Canada
PATRICK AMESTOY
Université de Toulouse ENSEEIHT-IRIT
Toulouse cedex 
France
BABA ARIMILLI
IBM Systems and Technology Group
Austin, TX
USA
ROGER S. ARMEN
Thomas Jefferson University
Philadelphia, PA
USA
HASAN AKTULGA
Purdue University
West Lafayette, IN
USA
DOUGLAS ARMSTRONG
Intel Corporation
Champaign, IL
USA
JOSÉ I. ALIAGA
TU Braunschweig Institute of Computational Mathematics
Braunschweig
Germany
DAVID I. AUGUST
Princeton University
Princeton, NJ
USA
ERIC ALLEN
Oracle Labs
Austin, TX
USA
CEVDET AYKANAT
Bilkent University
Ankara
Turkey
GEORGE ALMASI
IBM
Yorktown Heights, NY
USA
DAVID A. BADER
Georgia Institute of Technology
Atlanta, GA
USA
xvi
List of Contributors
MICHAEL BADER
Universität Stuttgart
Stuttgart
Germany
SCOTT BIERSDORFF
University of Oregon
Eugene, OR
USA
DAVID H. BAILEY
Lawrence Berkeley National Laboratory
Berkeley, CA
USA
GIANFRANCO BILARDI
University of Padova
Padova
Italy
RAJEEV BALASUBRAMONIAN
University of Utah
Salt Lake City, UT
USA
ROBERT BJORNSON
Yale University
New Haven, CT
USA
UTPAL BANERJEE
University of California at Irvine
Irvine, CA
USA
GUY BLELLOCH
Carnegie Mellon University
Pittsburgh, PA
USA
ALESSANDRO BARDINE
Università di Pisa
Pisa
Italy
ROBERT BOCCHINO
Carnegie Mellon University
Pittsburgh, PA
USA
MUTHU MANIKANDAN BASKARAN
Reservoir Labs, Inc.
New York, NY
USA
HANS J. BOEHM
HP Labs
Palo Alto, CA
USA
CÉDRIC BASTOUL
University Paris-Sud  - INRIA Saclay Île-de-France
Orsay
France
ERIC J. BOHM
University of Illinois at Urbana-Champaign
Urbana, IL
USA
AARON BECKER
University of Illinois at Urbana-Champaign
Urbana, IL
USA
MATTHIAS BOLLHÖFER
Universitat Jaume I
Castellón
Spain
MICHAEL W. BERRY
The University of Tennessee
Knoxville, TN
USA
DAN BONACHEA
Lawrence Berkeley National Laboratory
Berkeley, CA
USA
ABHINAV BHATELE
University of Illinois at Urbana-Champaign
Urbana, IL
USA
PRADIP BOSE
IBM Corp. T.J. Watson Research Center
Yorktown Heights, NY
USA
List of Contributors
MARIAN BREZINA
University of Colorado at Boulder
Boulder, CO
USA
ÜMIT V. ÇATALYÜREK
The Ohio State University
Columbus, OH
USA
JEFF BROOKS
Cray Inc.
St. Paul, MN
USA
LUIS H. CEZE
University of Washington
Seattle, WA
USA
HOLGER BRUNST
Technische Universität Dresden
Dresden
Germany
BRADFORD L. CHAMBERLAIN
Cray Inc.
Seattle, WA
USA
HANS-JOACHIM BUNGARTZ
Technische Universität München
Garching
Germany
ERNIE CHAN
NVIDIA Corporation
Santa Clara, CA
USA
MICHAEL G. BURKE
Rice University
Houston, TX
USA
RONG-GUEY CHANG
National Chung Cheng University
Chia-Yi
Taiwan
ALFREDO BUTTARI
Université de Toulouse ENSEEIHT-IRIT
Toulouse cedex 
France
BARBARA CHAPMAN
University of Houston
Houston, TX
USA
ERIC J. BYLASKA
Pacific Northwest National Laboratory
Richland, WA
USA
DAVID CHASE
Oracle Labs
Burlington, MA
USA
ROY H. CAMPBELL
University of Illinois at Urbana-Champaign
Urbana, IL
USA
DANIEL CHAVARRÍA-MIRANDA
Pacific Northwest National Laboratory
Richland, WA
USA
WILLIAM CARLSON
Institute for Defense Analyses
Bowie, MD
USA
NORMAN H. CHRIST
Columbia University
New York, NY
USA
MANUEL CARRO
Universidad Politécnica de Madrid
Madrid
Spain
MURRAY COLE
University of Edinburgh
Edinburgh
UK
xvii
xviii
List of Contributors
PHILLIP COLELLA
University of California
Berkeley, CA
USA
KAUSHIK DATTA
University of California
Berkeley, CA
USA
SALVADOR COLL
Universidad Politécnica de Valencia
Valencia
Spain
JIM DAVIES
Oxford University
UK
GUOJING CONG
IBM
Yorktown Heights, NY
USA
JAMES H. COWNIE
Intel Corporation (UK) Ltd.
Swindon
UK
ANTHONY P. CRAIG
National Center for Atmospheric Research
Boulder, CO
USA
ANTHONY CURTIS
University of Houston
Houston
TX
JAMES DEMMEL
University of California at Berkeley
Berkeley, CA
USA
MONTY DENNEAU
IBM Corp., T.J. Watson Research Center
Yorktown Heights, NY
USA
JACK B. DENNIS
Massachusetts Institute of Technology
Cambridge, MA
USA
MARK DEWING
Intel Corporation
Champaign, IL
USA
H. J. J. VAN DAM
Pacific Northwest National Laboratory
Richland, WA
USA
VOLKER DIEKERT
Universität Stuttgart FMI
Stuttgart
Germany
FREDERICA DAREMA
National Science Foundation
Arlington, VA
USA
JACK DONGARRA
University of Tennessee
Knoxville, TN
USA
ALAIN DARTE
École Normale Supérieure de Lyon
Lyon
France
DAVID DONOFRIO
Lawrence Berkeley National Laboratory
Berkeley, CA
USA
RAJA DAS
IBM Corporation
Armonk, NY
USA
RON O. DROR
D. E. Shaw Research
New York, NY
USA
List of Contributors
IAIN DUFF
Science & Technology Facilities Council
Didcot, Oxfordshire
UK
MICHAEL DUNGWORTH
SANDHYA DWARKADAS
University of Rochester
Rochester, NY
USA
RUDOLF EIGENMANN
Purdue University
West Lafayette, IN
USA
WU-CHUN FENG
Virginia Tech
Blacksburg, VA
USA
and
Wake Forest University
Winston-Salem, NC
USA
JOHN FEO
Pacific Northwest National Laboratory
Richland, WA
USA
JEREMY T. FINEMAN
Carnegie Mellon University
Pittsburgh, PA
USA
E. N. (MOOTAZ) ELNOZAHY
IBM Research
Austin, TX
USA
JOSEPH A. FISHER
Miami Beach, FL
USA
JOEL EMER
Intel Corporation
Hudson, MA
USA
CORMAC FLANAGAN
University of California at Santa Cruz
Santa Cruz, CA
USA
BABAK FALSAFI
Ecole Polytechnique Fédérale de Lausanne
Lausanne
Switzerland
JOSÉ FLICH
Technical University of Valencia
Valencia
Spain
PAOLO FARABOSCHI
Hewlett Packard
Sant Cugat del Valles
Spain
CHRISTINE FLOOD
Oracle Labs
Burlington, MA
USA
PAUL FEAUTRIER
Ecole Normale Supérieure de Lyon
Lyon
France
MICHAEL FLYNN
Stanford University
Stanford, CA
USA
KARL FEIND
SGI
Eagan, MN
USA
JOSEPH FOGARTY
University of South Florida
Tampa, FL
USA
xix
xx
List of Contributors
PIERFRANCESCO FOGLIA
Università di Pisa
Pisa
Italy
PEDRO J. GARCIA
Universidad de Castilla-La Mancha
Albacete
Spain
TRYGGVE FOSSUM
Intel Corporation
Hudson, MA
USA
MICHAEL GARLAND
NVIDIA Corporation
Santa Clara, CA
USA
GEOFFREY FOX
Indiana University
Bloomington, IN
USA
KLAUS GÄRTNER
Weierstrass Institute for Applied Analysis and Stochastics
Berlin
Germany
MARTIN FRÄNZLE
Carl von Ossietzky Universität
Oldenburg
Germany
ED GEHRINGER
North Carolina State University
Raleigh, NC
USA
FRANZ FRANCHETTI
Carnegie Mellon University
Pittsburgh, PA
USA
ROBERT A. VAN DE GEIJN
The University of Texas at Austin
Austin, TX
USA
STEFAN M. FREUDENBERGER
Freudenberger Consulting
Zürich
Switzerland
AL GEIST
Oak Ridge National Laboratory
Oak Ridge, TN
USA
HOLGER FRÖNING
University of Heidelberg
Heidelberg
Germany
THOMAS GEORGE
IBM Research
Delhi
India
KARL FÜRLINGER
Ludwig-Maximilians-Universität München
Munich
Germany
MICHAEL GERNDT
Technische Universität München
München
Germany
EFSTRATIOS GALLOPOULOS
University of Patras
Patras
Greece
AMOL GHOTING
IBM Thomas. J. Watson Research Center
Yorktown Heights, NY
USA
ALAN GARA
IBM T.J. Watson Research Center
Yorktown Heights, NY
USA
JOHN GILBERT
University of California
Santa Barbara, CA
USA
List of Contributors
ROBERT J. VAN GLABBEEK
NICTA
Sydney
Australia
and
The University of New South Wales
Sydney
Australia
and
Stanford University
Stanford, CA
USA
SERGEI GORLATCH
Westfälische Wilhelms-Universität Münster
Münster
Germany
KAZUSHIGE GOTO
The University of Texas at Austin
Austin, TX
USA
ALLAN GOTTLIEB
New York University
New York, NY
USA
STEVEN GOTTLIEB
Indiana University
Bloomington, IN
USA
N. GOVIND
Pacific Northwest National Laboratory
Richland, WA
USA
SUSAN L. GRAHAM
University of California
Berkeley, CA
USA
ANANTH Y. GRAMA
Purdue University
West Lafayette, IN
USA
DON GRICE
IBM Corporation
Poughkeepsie, NY
USA
LAURA GRIGORI
Laboratoire de Recherche en Informatique Universite
Paris-Sud 
Paris
France
WILLIAM GROPP
University of Illinois at Urbana-Champaign
Urbana, IL
USA
ABDOU GUERMOUCHE
Université de Bordeaux
Talence
France
JOHN A. GUNNELS
IBM Corp
Yorktown Heights, NY
USA
ANSHUL GUPTA
IBM T.J. Watson Research Center
Yorktown Heights, NY
USA
JOHN L. GUSTAFSON
Intel Corporation
Santa Clara, CA
USA
ROBERT H. HALSTEAD
Curl Inc.
Cambridge, MA
USA
KEVIN HAMMOND
University of St. Andrews
St. Andrews
UK
JAMES HARRELL
xxi
xxii
List of Contributors
ROBERT HARRISON
Oak Ridge National Laboratory
Oak Ridge, TN
USA
JOHN C. HART
University of Illinois at Urbana-Champaign
Urbana, IL
USA
MICHAEL HEATH
University of Illinois at Urbana-Champaign
Urbana, IL
USA
HERMANN HELLWAGNER
Klagenfurt University
Klagenfurt
Austria
DANNY HENDLER
Ben-Gurion University of the Negev
Beer-Sheva
Israel
BRUCE HENDRICKSON
Sandia National Laboratories
Albuquerque, NM
USA
ROBERT HENSCHEL
Indiana University
Bloomington, IN
USA
MANUEL HERMENEGILDO
Universidad Politécnica de Madrid
Madrid
Spain
IMDEA Software Institute
Madrid
Spain
OSCAR HERNANDEZ
Oak Ridge National Laboratory
Oak Ridge, TN
USA
PAUL HILFINGER
University of California
Berkeley, CA
USA
KEI HIRAKI
The University of Tokyo
Tokyo
Japan
H. PETER HOFSTEE
IBM Austin Research Laboratory
Austin, TX
USA
CHRIS HSIUNG
Hewlett Packard
Palo Alto, CA
USA
JONATHAN HU
Sandia National Laboratories
Livermore, CA
USA
KIERAN T. HERLEY
University College Cork
Cork
Ireland
THOMAS HUCKLE
Technische Universität München
Garching
Germany
MAURICE HERLIHY
Brown University
Providence, RI
USA
WEN-MEI HWU
University of Illinois at Urbana-Champaign
Urbana, IL
USA
List of Contributors
FRANÇOIS IRIGOIN
MINES ParisTech/CRI
Fontainebleau
France
KRISHNA KANDALLA
The Ohio State University
Columbus, OH
USA
KEN’ICHI ITAKURA
Japan Agency for Marine-Earth Science and Technology
(JAMSTEC)
Yokohama
Japan
LARRY KAPLAN
Cray Inc.
Seattle, WA
USA
JOSEPH F. JAJA
University of Maryland
College Park, MD
USA
JOEFON JANN
T. J. Watson Research Center, IBM Corp.
Yorktown Heights, NY
USA
KARL JANSEN
NIC, DESY Zeuthen
Zeuthen
Germany
PRITISH JETLEY
University of Illinois at Urbana-Champaign
Urbana, IL
USA
WIBE A. DE JONG
Pacific Northwest National Laboratory
Richland, WA
USA
TEJAS S. KARKHANIS
IBM T.J. Watson Research Center
Yorktown Heights, NY
USA
RAJESH K. KARMANI
University of Illinois at Urbana-Champaign
Urbana, IL
USA
GEORGE KARYPIS
University of Minnesota
Minneapolis, MN
USA
ARUN KEJARIWAL
Yahoo! Inc.
Sunnyvale, CA
USA
MALEQ KHAN
Virginia Tech
Blacksburg, VA
USA
LAXMIKANT V. KALÉ
University of Illinois at Urbana-Champaign
Urbana, IL
USA
THILO KIELMANN
Vrije Universiteit
Amsterdam
The Netherlands
ANANTH KALYANARAMAN
Washington State University
Pullman, WA
USA
GERRY KIRSCHNER
Cray Incorporated
St. Paul, MN
USA
AMIR KAMIL
University of California
Berkeley, CA
USA
CHRISTOF KLAUSECKER
Ludwig-Maximilians-Universität München
Munich
Germany
xxiii
xxiv
List of Contributors
KATHLEEN KNOBE
Intel Corporation
Cambridge, MA
USA
V. S. ANIL KUMAR
Virginia Tech
Blacksburg, VA
USA
ANDREAS KNÜPFER
Technische Universität Dresden
Dresden
Germany
KALYAN KUMARAN
Argonne National Laboratory
Argonne, IL
USA
GIORGOS KOLLIAS
Purdue University
West Lafayette, IN
USA
JAMES LA GRONE
University of Houston
Houston, TX
USA
K. KOWALSKI
Pacific Northwest National Laboratory
Richland, WA
USA
ROBERT LATHAM
Argonne National Laboratory
Argonne, IL
USA
QUINCEY KOZIOL
The HDF Group
Champaign, IL
USA
BRUCE LEASURE
Saint Paul, MN
USA
DIETER KRANZLMÜLLER
Ludwig-Maximilians-Universität München
Munich
Germany
MANOJKUMAR KRISHNAN
Pacific Northwest National Laboratory
Richland, WA
USA
JENQ-KUEN LEE
National Tsing-Hua University
Hsin-Chu
Taiwan
CHARLES E. LEISERSON
Massachusetts Institute of Technology
Cambridge, MA
USA
CHI-BANG KUAN
National Tsing-Hua University
Hsin-Chu
Taiwan
CHRISTIAN LENGAUER
University of Passau
Passau
Germany
DAVID J. KUCK
Intel Corporation
Champaign, IL
USA
RICHARD LETHIN
Reservoir Labs, Inc.
New York, NY
USA
JEFFERY A. KUEHN
Oak Ridge National Laboratory
Oak Ridge, TN
USA
ALLEN LEUNG
Reservoir Labs, Inc.
New York, NY
USA
List of Contributors
JOHN M. LEVESQUE
Cray Inc.
Knoxville, TN
USA
PEDRO LÓPEZ
Universidad Politécnica de Valencia
Valencia
Spain
MICHAEL LEVINE
GEOFF LOWNEY
Intel Corporation
Husdon, MA
USA
JEAN-YVES L’EXCELLENT
ENS Lyon
Lyon
France
JIAN LI
IBM Research
Austin, TX
USA
XIAOYE SHERRY LI
Lawarence Berkeley National Laboratory
Berkeley, CA
USA
ZHIYUAN LI
Purdue University
West Lafayette, IN
USA
VICTOR LUCHANGCO
Oracle Labs
Burlington, MA
USA
PIOTR LUSZCZEK
University of Tennessee
Knoxville, TN
USA
OLAV LYSNE
The University of Oslo
Oslo
Norway
CALVIN LIN
University of Texas at Austin
Austin, TX
USA
XIAOSONG MA
North Carolina State University
Raleigh, NC
USA
and
Oak Ridge National Laboratory
Raleigh, NC
USA
HESHAN LIN
Virginia Tech
Blacksburg, VA
USA
ARTHUR B. MACCABE
Oak Ridge National Laboratory
Oak Ridge, TN
USA
HANS-WOLFGANG LOIDL
Heriot-Watt University
Edinburgh
UK
KAMESH MADDURI
Lawrence Berkeley National Laboratory
Berkeley, CA
USA
RITA LOOGEN
Philipps-Universität Marburg
Marburg
Germany
JAN-WILLEM MAESSEN
Google
Cambridge, MA
USA
xxv
xxvi
List of Contributors
KONSTANTIN MAKARYCHEV
IBM T.J. Watson Research Center
Yorktown Heights, NY
USA
PHILLIP MERKEY
Michigan Technological University
Houghton, MI
USA
JUNICHIRO MAKINO
National Astronomical Observatory of Japan
Tokyo
Japan
JOSÉ MESEGUER
University of Illinois at Urbana-Champaign
Urbana, IL
USA
ALLEN D. MALONY
University of Oregon
Eugene, OR
USA
MICHAEL METCALF
Berlin
Germany
MADHA V. MARATHE
Virginia Tech
Blacksburg, VA
USA
ALBERTO F. MARTÍN
Universitat Jaume I
Castellón
Spain
GLENN MARTYNA
IBM Thomas J. Watson Research Center
Yorktown Heights, NY
USA
ERIC R. MAY
University of Michigan
Ann Arbor, MI
USA
SALLY A. MCKEE
Chalmers University of Technology
Goteborg
Sweden
SAMUEL MIDKIFF
Purdue University
West Lafayette, IN
USA
KENICHI MIURA
National Institute of Informatics
Tokyo
Japan
BERND MOHR
Forschungszentrum Jülich GmbH
Jülich
Germany
JOSÉ E. MOREIRA
IBM T.J. Watson Research Center
Yorktown Heights, NY
USA
ALAN MORRIS
University of Oregon
Eugene, OR
USA
MIRIAM MEHL
Technische Universität München
Garching
Germany
J. ELIOT B. MOSS
University of Massachusetts
Amherst, MA
USA
BENOIT MEISTER
Reservoir Labs, Inc.
New York, NY
USA
MATTHIAS MÜLLER
Technische Universität Dresden
Dresden
Germany
List of Contributors
PETER MÜLLER
ETH Zurich
Zurich
Switzerland
ALLEN NIKORA
California Institute of Technology
Pasadena, CA
USA
YOICHI MURAOKA
Waseda University
Tokyo
Japan
ROBERT W. NUMRICH
City University of New York
New York, NY
USA
ANCA MUSCHOLL
Université Bordeaux 
Talence
France
RAVI NAIR
IBM Thomas J. Watson Research Center
Yorktown Heights, NY
USA
STEPHEN NELSON
MARIO NEMIROVSKY
Barcelona Supercomputer Center
Barcelona
Spain
RYAN NEWTON
Intel Corporation
Hudson, MA
USA
ROCCO DE NICOLA
Universita’ di Firenze
Firenze
Italy
ALEXANDRU NICOLAU
University of California Irvine
Irvine, CA
USA
JAREK NIEPLOCHA†
Pacific Northwest National Laboratory
Richland, WA
USA
†
deceased
STEVEN OBERLIN
LEONID OLIKER
Lawrence Berkeley National Laboratory
Berkeley, CA
USA
DAVID PADUA
University of Illinois at Urbana-Champaign
Urbana, IL
USA
SCOTT PAKIN
Los Alamos National Laboratory
Los Alamos, NM
USA
BRUCE PALMER
Pacific Northwest National Laboratory
Richland, WA
USA
DHABALESWAR K. PANDA
The Ohio State University
Columbus, OH
USA
SAGAR PANDIT
University of South Florida
Tampa, FL
USA
YALE N. PATT
The University of Texas at Austin
Austin, TX
USA
xxvii
xxviii
List of Contributors
OLIVIER PÈNE
University de Paris-Sud-XI
Orsay Cedex
France
WILFRED POST
Oak Ridge National Laboratory
Oak Ridge, TN
USA
PAUL PETERSEN
Intel Corporation
Champaign, IL
USA
CHRISTOPH VON PRAUN
Georg-Simon-Ohm University of Applied Sciences
Nuremberg
Germany
BERNARD PHILIPPE
Campus de Beaulieu
Rennes
France
FRANCO P. PREPARATA
Brown University
Providence, RI
USA
MICHAEL PHILIPPSEN
University of Erlangen-Nuremberg
Erlangen
Germany
COSIMO ANTONIO PRETE
Università di Pisa
Pisa
Italy
JAMES C. PHILLIPS
University of Illinois at Urbana-Champaign
Urbana, IL
USA
ANDREA PIETRACAPRINA
Università di Padova
Padova
Italy
KESHAV PINGALI
The University of Texas at Austin
Austin, TX
USA
TIMOTHY PRINCE
Intel Corporation
Santa Clara, CA
USA
JEAN-PIERRE PROST
Morteau
France
GEPPINO PUCCI
Università di Padova
Padova
Italy
TIMOTHY M. PINKSTON
University of Southern California
Los Angeles, CA
USA
MARKUS PÜSCHEL
ETH Zurich
Zurich
Switzerland
ERIC POLIZZI
University of Massachusetts
Amherst, MA
USA
ENRIQUE S. QUINTANA-ORTÍ
Universitat Jaume I
Castellón
Spain
STEPHEN W. POOLE
Oak Ridge National Laboratory
Oak Ridge, TN
USA
PATRICE QUINTON
ENS Cachan Bretagne
Bruz
France
List of Contributors
RAM RAJAMONY
IBM Research
Austin, TX
USA
ARUN RAMAN
Princeton University
Princeton, NJ
USA
LAWRENCE RAUCHWERGER
Texas A&M University
College Station, TX
USA
JAMES R. REINDERS
Intel Corporation
Hillsboro, OR
USA
STEVEN P. REINHARDT
JOHN REPPY
University of Chicago
Chicago, IL
USA
MARÍA ENGRACIA GÓMEZ REQUENA
Universidad Politécnica de Valencia
Valencia
Spain
DANIEL RICCIUTO
Oak Ridge National Laboratory
Oak Ridge, TN
USA
YVES ROBERT
Ecole Normale Supérieure de Lyon
France
ARCH D. ROBISON
Intel Corporation
Champaign, IL
USA
A. W. ROSCOE
Oxford University
Oxford
UK
ROBERT B. ROSS
Argonne National Laboratory
Argonne, IL
USA
CHRIS ROWEN
CEO, Tensilica
Santa Clara, CA, USA
DUNCAN ROWETH
Cray (UK) Ltd.
UK
SUKYOUNG RYU
Korea Advanced Institute of Science and Technology
Daejeon
Korea
VALENTINA SALAPURA
IBM Research
Yorktown Heights, NY
USA
JOEL H. SALTZ
Emory University
Atlanta, GA
USA
ROLF RIESEN
IBM Research
Dublin
Ireland
AHMED SAMEH
Purdue University
West Lafayette, IN
USA
TANGUY RISSET
INSA Lyon
Villeurbanne
France
MIGUEL SANCHEZ
Universidad Politécnica de Valencia
Valencia
Spain
xxix
xxx
List of Contributors
BENJAMIN SANDER
Advanced Micro Device Inc.
Austin, TX
USA
MICHAEL L. SCOTT
University of Rochester
Rochester, NY
USA
PETER SANDERS
Universitaet Karlsruhe
Karlsruhe
Germany
MATOUS SEDLACEK
Technische Universität München
Garching
Germany
DAVIDE SANGIORGI
Universita’ di Bologna
Bologna
Italy
JOEL SEIFERAS
University of Rochester
Rochester, NY
USA
VIVEK SARIN
Texas A&M University
College Station, TX
USA
FRANK OLAF SEM-JACOBSEN
The University of Oslo
Oslo
Norway
VIVEK SARKAR
Rice University
Houston, TX
USA
OLAF SCHENK
University of Basel
Basel
Switzerland
MICHAEL SCHLANSKER
Hewlett-Packard Inc.
Palo-Alto, CA
USA
STEFAN SCHMID
Telekom Laboratories/TU Berlin
Berlin
Germany
ANDRÉ SEZNEC
IRISA/INRIA, Rennes
Rennes
France
JOHN SHALF
Lawrence Berkeley National Laboratory
Berkeley, CA
USA
MEIYUE SHAO
Umeå University
Umeå
Sweden
MARTIN SCHULZ
Lawrence Livermore National Laboratory
Livermore, CA
USA
DAVID E. SHAW
D. E. Shaw Research
New York, NY
USA
and
Columbia University
New York, NY
USA
JAMES L. SCHWARZMEIER
Cray Inc.
Chippewa Falls, WI
USA
XIAOWEI SHEN
IBM Research
Armonk, NY
USA
List of Contributors
SAMEER SHENDE
University of Oregon
Eugene, OR
USA
EDGAR SOLOMONIK
University of California at Berkeley
Berkeley, CA
USA
GALEN M. SHIPMAN
Oak Ridge National Laboratory
Oak Ridge, TN
USA
MATTHEW SOTTILE
Galois, Inc.
Portland, OR
USA
HOWARD JAY SIEGEL
Colorado State University
Fort Collins, CO
USA
M’HAMED SOULI
Université des Sciences et Technologies de Lille
Villeneuve d’Ascq cédex
France
DANIEL P. SIEWIOREK
Carnegie Mellon University
Pittsburgh, PA
USA
WYATT SPEAR
University of Oregon
Eugene, OR
USA
FEDERICO SILLA
Universidad Politécnica de Valencia
Valencia
Spain
EVAN W. SPEIGHT
IBM Research
Austin, TX
USA
BARRY SMITH
Argonne National Laboratory
Argonne, IL
USA
MARK S. SQUILLANTE
IBM
Yorktown Heights, NY
USA
BURTON SMITH
Microsoft Corporation
Redmond, WA
USA
ALEXANDROS STAMATAKIS
Heidelberg Institute for Theoretical Studies
Heidelberg
Germany
MARC SNIR
University of Illinois at Urbana-Champaign
Urbana, IL
USA
GUY L. STEELE, JR.
Oracle Labs
Burlington, MA
USA
LAWRENCE SNYDER
University of Washington
Seattle, WA
USA
THOMAS L. STERLING
Louisiana State University
Baton Rouge, LA
USA
MARCO SOLINAS
Università di Pisa
Pisa
Italy
TJERK P. STRAATSMA
Pacific Northwest National Laboratory
Richland, WA
USA
xxxi
xxxii
List of Contributors
PAULA E. STRETZ
Virginia Tech
Blacksburg, VA
USA
JOSEP TORRELLAS
University of Illinois at Urbana-Champaign
Urbana, IL
USA
THOMAS M. STRICKER
Zürich, CH
Switzerland
JESPER LARSSON TRÄFF
University of Vienna
Vienna
Austria
JIMMY SU
University of California
Berkeley, CA
USA
HARI SUBRAMONI
The Ohio State University
Columbus, OH
USA
SAYANTAN SUR
The Ohio State University
Columbus, OH
USA
JOHN SWENSEN
CPU Technology
Pleasanton, CA
USA
PHILIP TRINDER
Heriot-Watt University
Edinburgh
UK
RAFFAELE TRIPICCIONE
Università di Ferrara and INFN Sezione di Ferrara
Ferrara
Italy
MARK TUCKERMAN
New York University
New York, NY
USA
RAY TUMINARO
Sandia National Laboratories
Livermore, CA
USA
HIROSHI TAKAHARA
NEC Corporation
Tokyo
Japan
BORA UÇAR
ENS Lyon
Lyon
France
MICHELA TAUFER
University of Delaware
Newark, DE
USA
MARAT VALIEV
Pacific Northwest National Laboratory
Richland, WA
USA
VINOD TIPPARAJU
Oak Ridge National Laboratory
Oak Ridge, TN
USA
NICOLAS VASILACHE
Reservoir Labs, Inc.
New York, NY
USA
ALEXANDER TISKIN
University of Warwick
Coventry
UK
MARIANA VERTENSTEIN
National Center for Atmospheric Research
Boulder, CO
USA
List of Contributors
JENS VOLKERT
Johannes Kepler University Linz
Linz
Austria
TONG WEN
University of California
Berkeley, CA
USA
YEVGEN VORONENKO
Carnegie Mellon University
Pittsburgh, PA
USA
R. CLINT WHALEY
University of Texas at San Antonio
San Antonio, TX
USA
RICHARD W. VUDUC
Georgia Institute of Technology
Atlanta, GA
USA
ANDREW B. WHITE
Los Alamos National Laboratory
Los Alamos, NM
USA
GENE WAGENBRETH
University of Southern Califorina
Topanga, CA
USA
BRIAN WHITNEY
Oracle Corporation
Hillsboro, OR
USA
DALI WANG
Oak Ridge National Laboratory
Oak Ridge, TN
USA
ROLAND WISMÜLLER
University of Siegen
Siegen
Germany
JASON WANG
LSTC
Livermore, CA
USA
ROBERT W. WISNIEWSKI
IBM
Yorktown Heights, NY
USA
GREGORY R. WATSON
IBM
Yorktown Heights, NY
USA
DAVID WOHLFORD
Reservoir Labs, Inc.
New York, NY
USA
ROGER WATTENHOFER
ETH Zürich
Zurich
Switzerland
FELIX WOLF
Aachen University
Aachen
Germany
MICHAEL WEHNER
Lawrence Berkeley National Laboratory
Berkeley, CA
USA
DAVID WONNACOTT
Haverford College
Haverford, PA
USA
JOSEF WEIDENDORFER
Technische Universität München
München
Germany
PATRICK H. WORLEY
Oak Ridge National Laboratory
Oak Ridge, TN
USA
xxxiii
xxxiv
List of Contributors
SUDHAKAR YALAMANCHILI
Georgia Institute of Technology
Atlanta, GA
USA
FIELD G. VAN ZEE
The University of Texas at Austin
Austin, TX
USA
KATHERINE YELICK
University of California at Berkeley and Lawrence Berkeley
National Laboratory
Berkeley, CA
USA
LIXIN ZHANG
IBM Research
Austin, TX
USA
PEN-CHUNG YEW
University of Minnesota at Twin-Cities
Minneapolis, MN
USA
BOBBY DALTON YOUNG
Colorado State University
Fort Collins, CO
USA
CLIFF YOUNG
D. E. Shaw Research
New York, NY
USA
GABRIEL ZACHMANN
Clausthal University
Clausthal-Zellerfeld
Germany
GENGBIN ZHENG
University of Illinois at Urbana-Champaign
Urbana, IL
USA
HANS P. ZIMA
California Institute of Technology
Pasadena, CA
USA
JAROSLAW ZOLA
Iowa State University
Ames, IA
USA
A
Ab Initio Molecular Dynamics
Car-Parrinello Method
Access Anomaly
Race Conditions
Actors do not share state: an actor must explicitly send
a message to another actor in order to affect the latter’s
behavior. Each actor carries out its actions concurrently
(and asynchronously) with other actors. Moreover, the
path a message takes as well as network delays it may
encounter are not specified. Thus, the arrival order of
messages is indeterminate. The key semantic properties of the standard Actor model are encapsulation of
state and atomic execution of a method in response
to a message, fairness in scheduling actors and in the
delivery of messages, and location transparency enabling
distributed execution and mobility.
Actors
Rajesh K. Karmani, Gul Agha
University of Illinois at Urbana-Champaign, Urbana,
IL, USA
Definition
Actors is a model of concurrent computation for developing parallel, distributed, and mobile systems. Each
actor is an autonomous object that operates concurrently and asynchronously, receiving and sending messages to other actors, creating new actors, and updating
its own local state. An actor system consists of a collection of actors, some of whom may send messages to, or
receive messages from, actors outside the system.
Preliminaries
An actor has a name that is globally unique and a behavior which determines its actions. In order to send an
actor a message, the actor’s name must be used; a name
cannot be guessed but it may be communicated in a
message. When an actor is idle, and it has a pending message, the actor accepts the message and does
the computation defined by its behavior. As a result,
the actor may take three types of actions: send messages, create new actors, and update its local state. An
actor’s behavior may change as it modifies its local state.
Advantages of the Actor Model
In the object-oriented programming paradigm, an
object encapsulates data and behavior. This separates
the interface of an object (what an object does) from its
representation (how it does it). Such separation enables
modular reasoning about object-based programs and
facilitates their evolution. Actors extend the advantages of objects to concurrent computations by separating control (where and when) from the logic of a
computation.
The Actor model of programming [] allows
programs to be decomposed into self-contained, autonomous, interactive, asynchronously operating components. Due to their asynchronous operation, actors
provide a model for the nondeterminism inherent in
distributed systems, reactive systems, mobile systems,
and any form of interactive computing.
History
The concept of actors has developed over  decades. The
earliest use of the term “actors” was in Carl Hewitt’s
Planner [] where the term referred to rule-based
active entities which search a knowledge base for patterns to match, and in response, trigger actions. For
the next  decades, Hewitt’s group worked on actors as
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,
© Springer Science+Business Media, LLC 

A
Actors
Thread
State
Methods
Mailbox
Create
Thread
State
Thread
State
Methods
Methods
msg
Mailbox
Mailbox
Actors. Fig.  Actors are concurrent objects that communicate through messages and may create new actors. An actor
may be viewed as an object augmented with its own control, a mailbox, and a globally unique, immutable name
agents of computation, and it evolved as a model of concurrent computing. A brief history of actor research can
be found in []. The commonly used definition of actors
today follows the work of Agha () which defines
actors using a simple operational semantics [].
In recent years, the Actor model has gained in
popularity with the growth of parallel and distributed
computing platforms such as multicore architectures,
cloud computers, and sensor networks. A number
of actor languages and frameworks have been developed. Some early actor languages include ABCL,
POOL, ConcurrentSmalltalk, ACT++, CEiffel (see []
for a review of these), and HAL []. Actor languages and frameworks in current use include Erlang
(from Ericsson) [], E (Erights.org), Scala Actors
library (EPFL) [], Ptolemy (UC Berkeley) [], ASP
(INRIA) [], JCoBox (University of Kaiserslautern)
[], SALSA (UIUC and RPI) [], Charm++ []
and ActorFoundry [] (both from UIUC), the Asynchronous Agents Library [] and Axum [] (both from
Microsoft), and Orleans framework for cloud computing from Microsoft Research []. Some well-known
open source applications built using actors include
Twitter’s message queuing system and Lift Web Framework, and among commercial applications are Facebook Chat system and Vendetta’s game engine.
Illustrative Actor Language
In order to show how actor programs work, consider a simple imperative actor language ActorFoundry
that extends Java. A class defining an actor behavior
extends osl.manager.Actor. Messages are handled by
methods; such methods are annotated with @message.
The create(class, args) method creates an actor
instance of the specified actor class, where args correspond to the arguments of a constructor in the class.
Each newly created actor has a unique name that is initially known only to the creator at the point where the
creation occurs.
Execution Semantics
The semantics of ActorFoundry can be informally
described as follows. Consider an ActorFoundry program P that consists of a set of actor definitions.
An actor communicates with another actor in P by
sending asynchronous (non-blocking) messages using
the send statement: send(a,msg) has the effect of eventually appending the contents of msg to the mailbox of the actor a. However, the call to send returns
immediately, that is, the sending actor does not wait
for the message to arrive at its destination. Because
actors operate asynchronously, and the network has
indeterminate delays, the arrival order of messages is
Actors
nondeterministic. However, we assume that messages
are eventually delivered (a form of fairness).
At the beginning of execution of P, the mailbox of
each actor is empty and some actor in the program must
receive a message from the environment. The ActorFoundry runtime first creates an instance of a specified
actor and then sends a specified message to it, which
serves as P’s entry point.
Each actor can be viewed as executing a loop with
the following steps: remove a message from its mailbox
(often implemented as a queue), decode the message,
and execute the corresponding method. If an actor’s
mailbox is empty, the actor blocks – waiting for the next
message to arrive in the mailbox (such blocked actors
are referred to as idle actors). The processing of a message may cause the actor’s local state to be updated,
new actors to be created, and messages to be sent.
Because of the encapsulation property of actors, there is
no interference between messages that are concurrently
processed by different actors.
An actor program “terminates” when every actor
created by the program is idle and the actors are
not open to the environment (otherwise the environment could send new messages to their mailboxes in the future). Note that an actor program need
not terminate – in particular, certain interactive programs and operating systems may continue to execute
indefinitely.
Listing  shows the HelloWorld program in ActorFoundry. The program comprises of two actor definitions: the HelloActor and the WorldActor. An
instance of the HelloActor can receive one type of
message, the greet message, which triggers the execution of greet method. The greet method serves
as P’s entry point, in lieu of the traditional main
method.
On receiving a greet message, the HelloActor
sends a print message to the stdout actor (a builtin actor representing the standard output stream)
along with the string “Hello.” As a result, “Hello”
will eventually be printed on the standard output
stream. Next, it creates an instance of the WorldActor.
The HelloActor sends an audience message to the
WorldActor, thus delegating the printing of “World”
to it. Note that due to asynchrony in communication, it is possible for “World” to be printed before
“Hello.”
A
Listing  Hello World! program in ActorFoundry
public class HelloActor extends Actor {
@message
public void greet() throws RemoteCodeException
{
ActorName other = null;
send(stdout, "print", "Hello");
other = create(WorldActor.class);
send(other, "audience");
}
}
public class WorldActor extends Actor {
@message
public void audience() throws RemoteCodeException
{
send(stdout, "print", "World");
}
}
Synchronization
Synchronization in actors is achieved through communication. Two types of commonly used communication
patterns are Remote Procedure Call (RPC)-like messaging and local synchronization constraints. Language
constructs can enable actor programmers to specify
such patterns. Such language constructs are definable
in terms of primitive actor constructs, but providing
them as first-class linguistic objects simplifies the task
of writing parallel code.
RPC-Like Messaging
RPC-like communication is a common pattern of
message-passing in actor programs. In RPC-like communication, the sender of a message waits for the
reply to arrive before the sender proceeds with processing other messages. For example, consider the pattern
shown in Fig.  for a client actor which requests a quote
from a travel service. The client wishes to wait for the
quote to arrive before it decides whether to buy the trip,
or to request a quote from another service.
Without a high-level language abstraction to express
RPC-like message pattern, a programmer has to explicitly implement the following steps in their program:
. The client actor sends a request.
. The client then checks incoming messages.

A

A
Actors
Client
Re
que
st
Service #1
ly
Rep
Client
Re
dependencies in the code. This may not only make the
program’s execution more inefficient than it needs to be,
it can lead to deadlocks and livelocks (where an actor
ignores or postpones processing messages, waiting for
an acknowledgment that never arrives).
Local Synchronization Constraints
que
st
Service #2
ly
Rep
Client
Actors. Fig.  A client actor requesting quotes from
multiple competing services using RPC-like
communication. The dashed slanted arrows denote
messages and the dashed vertical arrows denote that the
actor is waiting or is blocked during that period in its life
. If the incoming message corresponds to the reply
to its request, the client takes the appropriate action
(accept the offer or keep searching).
. If an incoming message does not correspond to the
reply to its request, the message must be handled
(e.g., by being buffered for later processing), and the
client continues to check messages for the reply.
RPC-like messaging is almost universally supported
in actor languages and libraries. RPC-like messages are
particularly useful in two kinds of common scenarios.
One scenario occurs when an actor wants to send an
ordered sequence of messages to a particular recipient –
in this case, it wants to ensure that a message has been
received before it sends another. A variant of this scenario is where the sender wants to ensure that the target
actor has received a message before it communicates
this information to another actor. A second scenario is
when the state of the requesting actor is dependent on
the reply it receives. In this case, the requesting actor
cannot meaningfully process unrelated messages until
it receives a response.
Because RPC-like messages are similar to procedure calls in sequential languages, programmers have
a tendency to overuse them. Unfortunately, inappropriate usage of RPC-like messages introduces unnecessary
Asynchrony is inherent in distributed systems and
mobile systems. One implication of asynchrony is that
the number of possible orderings in which messages
may arrive is exponential in the number of messages
that are “pending” at any time (i.e., messages that have
been sent but have not been received). Because a sender
may be unaware of what the state of the actor it is sending a message to will be when the latter receives the
message, it is possible that the recipient may not be in
a state where it can process the message it is receiving. For example, a spooler may not have a job when
some printer requests one. As another example, messages to actors representing individual matrix elements
(or groups of elements) asking them to process different iterations in a parallel Cholesky decomposition
algorithm need to be monotonically ordered. The need
for such orderings leads to considerable complexity in
concurrent programs, often introducing bugs or inefficiencies due to suboptimal implementation strategies.
For example, in the case of Cholesky decomposition,
imposing a global ordering on the iterations leads to
highly inefficient execution on multicomputers [].
Consider the example of a print spooler. Suppose
a ‘get’ message from an idle printer to its spooler may
arrive when the spooler has no jobs to return the printer.
One way to address this problem is for the spooler to
refuse the request. Now the printer needs to repeatedly
poll the spooler until the latter has a job. This technique
is called busy waiting; busy waiting can be expensive –
preventing the waiting actor from possibly doing other
work while it “waits,” and it results in unnecessary message traffic. An alternate is to the spooler buffer the
“get” message for deferred processing. The effect of such
buffering is to change the order in which the messages
are processed in a way that guarantees that the number of messages put messages to the spooler is always
greater than the number of get messages processed by
the spooler.
If pending messages are buffered explicitly inside the
body of an actor, the code specifying the functionality
Actors
Client
Open
File :
“open”
Client
Rea
d
Rea
d
Rea
d
File :
“read” | “close”
Close
Client
Actors. Fig.  A file actor communication with a client is
constrained using local synchronization constraints. The
vertical arrows depict the timeline of the life of an actor
and the slanted arrows denote messages. The labels inside
a circle denote the messages that the file actor can accept
in that particular state
(the how or representation) of the actor is mixed with
the logic determining the order in which the actor processes the messages (the when). Such mixing violates the
software principle of separation of concerns. Researchers
have proposed various constructs to enable programmers to specify the correct orderings in a modular and
abstract way, specifically, as logical formulae (predicates) over the state of an actor and the type of messages. Many actor languages and frameworks provide
such constructs; examples include local synchronization constraints in ActorFoundry, and pattern matching
on sets of messages in Erlang and Scala Actors library.
A
images is passed through a series of filtering and transforming stages. The output of the last stage is a stream
of processed images. This pattern has been demonstrated by an image processing example, written using
the Asynchronous Agents Library [], which is part of
the Microsoft Visual Studio .
A map-reduce graph is an example of the divideand-conquer pattern (see Fig. b). A master actor maps
the computation onto a set of workers and the output from each of these workers is reduced in the “join
continuation” behavior of the master actor (possibly
modeled as a separate actor) (e.g., see []). Other examples of divide-and-conquer pattern are naïve parallel
quicksort [] and parallel mergesort. The synchronization idioms discussed above may be used in succinct
encoding of these patterns in actor programs since these
patterns essentially require ordering the processing of
some messages.
Semantic Properties
As mentioned earlier, some important semantic properties of the pure Actor model are encapsulation and
atomic execution of methods (where a method represents computation in response to a message), fairness,
and location transparency []. We discuss the implications of these properties.
Note that not all actor languages enforce all these
properties. Often, the implementations compromise
some actor properties, typically because it is simpler
to achieve efficient implementations by doing so. However, it is possible by sophisticated program transformations, compilation, and runtime optimizations to regain
almost all the efficiency that is lost in a naïve language
implementation, although doing so is more challenging for library-like actor frameworks []. By failing to
enforce some actor properties in an actor language or
framework implementation, actor languages add to the
burden of the programmers, who have to then ensure
that they write programs in a way that does not violate
the property.
Patterns of Actor Programming
Encapsulation and Atomicity
Two common patterns of parallel programming are
pipeline and divide-and-conquer []. These patterns are
illustrated in Fig. a and b, respectively.
An example of the pipeline pattern is an image processing network (see Fig. a) in which a stream of
Encapsulation implies that no two actors share state.
This is useful for enforcing an object-style decomposition in the code. In sequential object-based languages,
this has led to the natural model of atomic change in
objects: an object invokes (sends a message to) another

A

A
Actors
Stage #1
Stage #2
Stage #3
Raw images
Processed images
Master
(map)
Requests
Worker #1
Worker #2
Replies
Master
(reduce)
Actors. Fig.  Patterns of actor programming (from top): (a) pipeline pattern (b) divide-and-conquer pattern
object, which finishes processing the message before
accepting another message from a different object. This
allows us to reason about the behavior of the object in
response to a message, given the state of the target object
when it received the message. In a concurrent computation, it is possible for a message to arrive while an actor
is busy processing another message. Now if the second message is allowed to interrupt the target actor and
modify the target’s state while the target is still processing the first message, it is no longer feasible to reason
about the behavior of the target actor based on what
the target’s state was when it received the first message.
This makes it difficult to reason about the behavior of
the system as such interleaving of messages may lead to
erroneous and inconsistent states.
Instead, the target actor processes messages one
at a time, in a single atomic step consisting of all
actions taken in response to a given message [].
By dramatically reducing the nondeterminism that
must be considered, such atomicity provides a macrostep semantics which simplifies reasoning about actor
programs. Macro-step semantics is commonly used
by correctness-checking tools; it significantly reduces
the state-space exploration required to check a property against an actor program’s potential executions
(e.g., see []).
Fairness
The Actor model assumes a notion of fairness which
states that every actor makes progress if it has some
computation to do, and that every message is eventually
delivered to the destination actor, unless the destination
actor is permanently disabled. Fairness enables modular reasoning about the liveness properties of actor
programs []. For example, if an actor system A is composed with an actor system B where B includes actors
that are permanently busy, the composition does not
affect the progress of the actors in A. A familiar example
where fairness would be useful is in browsers. Problems are often caused by the composition of browser
components with third-party plug-ins: in the absence
of fairness, such plug-ins sometimes result in browser
crashes and hang-ups.
Location Transparency
In the Actor model, the actual location of an actor does
not affect its name. Actors communicate by exchanging messages with other actors, which could be on the
same core, on the same CPU, or on another node in
the network. Location transparent naming provides an
abstraction for programmers, enabling them to program without worrying about the actual physical location of actors. Location transparent naming facilitates
Actors
automatic migration in the runtime, much as indirection in addressing facilitates compaction following
garbage collection in sequential programming.
Mobility is defined as the ability of a computation
to migrate across different nodes. Mobility is important for load-balancing, fault-tolerance, and reconfiguration. In particular, mobility is useful in achieving
scalable performance, particularly for dynamic, irregular applications []. Moreover, employing different
distributions in different stages of a computation may
improve performance. In other cases, the optimal or
correct performance depends on runtime conditions
such as data and workload, or security characteristics
of different platforms. For example, web applications
may be migrated to servers or to mobile clients depending on the network conditions and capabilities of the
client [].
Mobility may also be useful in reducing the energy
consumed by the execution of parallel applications.
Different parts of an application often involve different parallel algorithms and the energy consumption of
an algorithm depends on how many cores the algorithm is executed on and at what frequency these cores
operate []. Mobility facilitates dynamic redistribution
of a parallel computation to the appropriate number
of cores (i.e., to the number of cores that minimize
energy consumption for a given performance requirement and parallel algorithm) by migrating actors. Thus,
mobility could be an important feature for energyaware programming of multicore (manycore) architectures. Similarly, energy savings may be facilitated by
being able to migrate actors in sensor networks and
clouds.
Implementations
Erlang is arguably the best-known implementation of
the Actor model. It was developed to program telecom switches at Ericsson about  years ago. Some
recent actor implementations have been listed earlier.
Many of these implementations have focused on a
particular domain such as the Internet (SALSA), distributed applications (Erlang and E), sensor networks
(ActorNet), and, more recently multicore processors
(Scala Actors library, ActorFoundry, and many others in
development).
A
It has been noted that a faithful but naïve implementation of the Actor model can be highly inefficient []
(at least on the current generation of architectures).
Consider three examples:
. An implementation that maps each actor to a separate process may have a high cost for actor creation.
. If the number of cores is less than the number of
actors in the program (sometimes termed CPU oversubscription), an implementation mapping actors to
separate processes may suffer from high context
switching cost.
. If two actors are located on the same sequential
node, or on a shared-memory processor, it may
be an order of magnitude more efficient to pass a
reference to the message contents rather than to
make a copy of the actual message contents.
These inefficiencies may be addressed by compilation and runtime techniques, or through a combination of the two. The implementation of the ABCL
language [] demonstrates some early ideas for optimizing both intra-node and internode execution and
communication between actors. The Thal language
project [] shows that encapsulation, fairness, and universal naming in an actor language can be implemented
efficiently on commodity hardware by using a combination of compiler and runtime. The Thal implementation also demonstrates that various communication
abstractions such as RPC-like communication, local
synchronization constraints, and join expressions can
also be supported efficiently using various compile-time
program transformations.
The Kilim framework develops a clever postcompilation continuation-passing style (CPS) transform (“weaving”) on Java-based actor programs for
supporting lightweight actors that can pause and
resume []. Kilim and Scala also add type systems to
support safe but efficient messages among actors on a
shared node [, ]. Recent work suggests that ownership transfer between actors, which enables safe and
efficient messaging, can be statically inferred in most
cases [].
On distributed platforms such as cloud computers or grids, because of latency in sending messages
to remote actors, an important technique for achieving good performance is communication–computation

A

A
Actors
overlap. Decomposition into actors and the placement
of actors can significantly determine the extent of this
overlap. Some of these issues have been effectively
addressed in the Charm++ runtime []. Decomposition and placement issues are also expected to show up
on scalable manycore architectures since these architecture cannot be expected to support constant time access
to shared memory.
Finally, note that the notion of garbage in actors
is somewhat complex. Because an actor name may
be communicated in a message, it is not sufficient to
mark the forward acquaintances (references) of reachable actors as reachable. The inverse acquaintances of
reachable actors that may be potentially active need to
be considered as well (these actors may send a message to a reachable actor). Efficient garbage collection
of distributed actors remains an open research problem
because of the problem of taking efficient distributed
snapshots of the reachability graph in a running
system [].
Tools
Several tools are available to aid in writing, maintaining, debugging, model checking, and testing actor
programs. Both Erlang and Scala have a plug-in for
the popular, open source IDE (Integrated Development
Environment) called Eclipse (http://www.eclipse.org).
A commercial testing tool for Erlang programs called
QuickCheck [] is available. The tool enables programmers to specify program properties and input generators which are used to generate test inputs.
JCute [] is a tool for automatic unit testing of programs written in a Java actor framework. Basset []
works directly on executable (Java bytecode) actor
programs and is easily retargetable to any actor language that compiles to bytecode. Basset understands
the semantic structure of actor programs (such as the
macro-step semantics), enabling efficient path exploration through the Java Pathfinder (JPF) – a popular tool for model checking programs []. The term
rewriting system Maude provides an Actor module to
specify program behavior; it has been used to model
check actor programs []. There has also been work on
runtime monitoring of actor programs [].
Extensions and Abstractions
A programming language should facilitate the process of writing programs by being close to the conceptual level at which a programmer thinks about a
problem rather than at the level at which it may be
implemented. Higher level abstractions for concurrent
programming may be defined in interaction languages
which allow patterns to be captured as first-class
objects []. Such abstractions can be implemented
through an adaptive, reflective middleware []. Besides
programming abstractions for concurrency in the pure
(asynchronous) Actor model, there are variants of the
Actor model, such as for real-time, which extend the
model [, ]. Two interaction patterns are discussed
to illustrate the ideas of interaction patterns.
Pattern-Directed Communication
Recall that a sending actor must know the name of
a target actor before the sending actor can communicate with the target actor. This property, called locality,
is useful for compositional reasoning about actor programs – if it is known that only some actors can send
a message to an actor A, then it may be possible to
figure out what types of messages A may get and perhaps specify some constraints on the order in which
it may get them. However, real-world programs generally create an open system which interacts with their
external environment. This means that having ways of
discovering actors which provide certain services can be
helpful. For example, if an actor migrates to some environment, discovering a printer in that environment may
be useful.
Pattern-directed communication allows programmers to declare properties of a group of actors, enabling
the use of the properties to discover actual recipients are
chosen at runtime. In the ActorSpace model, an actor
specifies recipients in terms of patterns over properties
that must be satisfied by the recipients. The sender may
send the message to all actors (in some group) that satisfy the property, or to a single representative actor [].
There are other models for pattern-based communication. In Linda, potential recipients specify a pattern for
messages they are interested in []. The sending actors
simply inserts a message (called tuple in Linda) into
Actors
a tuple-space, from where the receiving actors may read
or remove the tuples if the tuple matches the pattern of
messages the receiving actor is interested in.
Coordination
Actors help simplify programming by increasing the
granularity at which programmers need to reason about
concurrency, namely, they may reason in terms of the
potential interleavings of messages to actors, instead
of in terms the interleavings of accesses to shared
variables within actors. However, developing actor programs is still complicated and prone to errors. A key
cause of complexity in actor programs is the large number of possible interleaving of messages to groups of
actors: if these message orderings are not suitably constrained, some possible execution orders may fail to
meet the desired specification.
Recall that local synchronization constraints postpone the dispatch of a message based on the contents of
the messages and the local state of the receiving actor
(see section “Local Synchronization Constraints”). Synchronizers, on the other hand, change the order in which
messages are processed by a group of actors by defining constraints on ordering of messages processed at
different actors in a group of actors. For example, if
a withdrawal and deposit messages must be processed
atomically by two different actors, a Synchronizer can
specify that they must be scheduled together. Synchronizers are described in [].
In the standard actor semantics, an actor that knows
the name of a target actor may send the latter a message. An alternate semantics introduces the notion of a
channel; a channel is used to establish communication
between a given sender and a given recipient. Recent
work on actor languages has introduced stateful channel contracts to constrain the order of messages between
two actors. Channels are a central concept for communication between actors in both Microsoft’s Singularity platform [] and Microsoft’s Axum language [],
while they can be optionally introduced between two
actors in Erlang. Channel contracts specify a protocol that governs the communication between the two
end points (actors) of the channel. The contracts are
stated in terms of state transitions based on observing
messages on the channel.
A
From the perspective of each end point (actor), the
channel contract specifies the interface of the other end
point (actor) in terms of not only the type of messages
but also the ordering on messages. In Erlang, contracts
are enforced at runtime, while in Singularity a more
restrictive notion of typed contracts make it feasible to
check the constraints at compile time.
Current Status and Perspective
Actor languages have been used for parallel and distributed computing in the real world for some time
(e.g., Charm++ for scientific applications on supercomputers [], Erlang for distributed applications []). In
recent years, interest in actor-based languages has been
growing, both among researchers and among practitioners. This interest is triggered by emerging programming platforms such as multicore computers and cloud
computers. In some cases, such as cloud computing,
web services, and sensor networks, the Actor model
is a natural programming model because of the distributed nature of these platforms. Moreover, as multicore architectures are scaled, multicore computers will
also look more and more like traditional multicomputer
platforms. This is illustrated by the -core Single-Chip
Cloud Computer (SCC) developed by Intel [] and the
-core TILE-Gx by Tilera []. However, the argument for using actor-based programming languages is
not simply that they are a good match for distributed
computing platforms; it is that Actors is a good model in
which to think about concurrency. Actors simplify the
task of programming by extending object-based design
to concurrent (parallel, distributed, mobile) systems.
Bibliography
. Agha G, Callsen CJ () Actorspace: an open distributed
programming paradigm. In: Proceedings of the fourth ACM
SIGPLAN symposium on principles and practice of parallel programming (PPoPP), San Diego, CA. ACM, New York,
pp –
. Agha G, Frolund S, Kim WY, Panwar R, Patterson A, Sturman D
() Abstraction and modularity mechanisms for concurrent
computing. IEEE Trans Parallel Distr Syst ():–
. Agha G () Actors: a model of concurrent computation in
distributed systems. MIT Press, Cambridge, MA
. Agha G () Concurrent object-oriented programming. Commun ACM (): –

A

A
Actors
. Arts T, Hughes J, Johansson J, Wiger U () Testing telecoms
software with quviq quickcheck. In: ERLANG ’: proceedings of
the  ACM SIGPLAN workshop on Erlang. ACM, New York,
pp –
. Agha G, Kim WY () Parallel programming and complexity
analysis using actors. In: Proceedings: third working conference
on massively parallel programming models, , London. IEEE
Computer Society Press, Los Alamitos, CA, pp –
. Agha G, Mason IA, Smith S, Talcott C () A foundation for
actor computation. J Funct Program ():–
. Armstrong J () Programming Erlang: software for a concurrent World. Pragmatic Bookshelf, Raleigh, NC
. Astley M, Sturman D, Agha G () Customizable middleware
for modular distributed software. Commun ACM ():–
. Astley M (–) The actor foundry: a java-based actor programming environment. Open Systems Laboratory, University of
Illinois at Urbana-Champaign, Champaign, IL
. Briot JP, Guerraoui R, Lohr KP () Concurrency and distribution in object-oriented programming. ACM Comput Surv ():
–
. Chang PH, Agha G () Towards context-aware web applications. In: th IFIP international conference on distributed applications and interoperable systems (DAIS), , Paphos, Cyprus.
LNCS , Springer, Berlin
. Clavel M, Durán F, Eker S, Lincoln P, Martí-Oliet N, Meseguer J,
Talcott C () All about maude – a high-performance logical framework: how to specify, program and verify systems in
rewriting logic. Springer, Berlin
. Carriero N, Gelernter D () Linda in context. Commun ACM
:–
. Caromel D, Henrio L, Serpette BP () Asynchronous sequential processes. Inform Comput ():–
. Intel Corporation. Single-chip cloud computer. http://
techresearch.intel.com/ProjectDetails.aspx?Id=
. Microsoft Corporation. Asynchronous agents library. http://
msdn.microsoft.com/enus/library/dd(VS.).aspx
. Microsoft Corporation. Axum programming language. http://
msdn.microsoft.com/en-us/devlabs/dd.aspx
. Tilera Corporation. TILE-Gx processor family. http://tilera.com/
products/processors/TILE-Gxfamily
. Fähndrich M, Aiken M, Hawblitzel C, Hodson O, Hunt G, Larus
JR, Levi S () Language support for fast and reliable messagebased communication in singularity OS. SIGOPS Oper Syst Rev
():–
. Feng TH, Lee EA () Scalable models using model transformation. In: st international workshop on model based architecting
and construction of embedded systems (ACESMB), Toulouse,
France
. Houck C, Agha G () Hal: a high-level actor language and its
distributed implementation. In: st international conference on
parallel processing (ICPP), vol II, An Arbor, MI, pp –
. Hewitt C () PLANNER: a language for proving theorems in
robots. In: Proceedings of the st international joint conference
on artificial intelligence, Morgan Kaufmann, San Francisco, CA,
pp –
. Haller P, Odersky M () Actors that unify threads and events.
In: th International conference on coordination models and languages, vol  of lecture notes in computer science, Springer,
Berlin
. Haller P, Odersky M () Capabilities for uniqueness and borrowing. In: D’Hondt T (ed) ECOOP  object-oriented programming, vol  of lecture notes in computer science, Springer,
Berlin, pp –
. Kim WY, Agha G () Efficient support of location transparency
in concurrent object-oriented programming languages. In: Supercomputing ’: proceedings of the  ACM/IEEE conference on
supercomputing, San Diego, CA. ACM, New York, p 
. Korthikanti VA, Agha G () Towards optimizing energy costs
of algorithms for shared memory architectures. In: SPAA ’:
proceedings of the nd ACM symposium on parallelism in algorithms and architectures, Santorini, Greece. ACM, New York,
pp –
. Kale LV, Krishnan S () Charm++: a portable concurrent
object oriented system based on c++. ACM SIGPLAN Not ():
–
. Karmani RK, Shali A, Agha G () Actor frameworks for the
JVM platform: a comparative analysis. In: PPPJ ’: proceedings of the th international conference on principles and practice of programming in java, Calgary, Alberta. ACM, New York,
pp –
. Lauterburg S, Dotta M, Marinov D, Agha G () A framework for state-space exploration of javabased actor programs.
In: ASE ’: proceedings of the  IEEE/ACM international conference on automated software engineering, Auckland, New Zealand. IEEE Computer Society, Washington, DC,
pp –
. Lee EA () Overview of the ptolemy project. Technical report
UCB/ERL M/. University of California, Berkeley
. Lauterburg S, Karmani RK, Marinov D, Agha G () Evaluating ordering heuristics for dynamic partialorder reduction
techniques. In: Fundamental approaches to software engineering
(FASE) with ETAPS, , LNCS , Springer, Berlin
. Negara S, Karmani RK, Agha G () Inferring ownership transfer for efficient message passing. In: To appear in the th ACM
SIGPLAN symposium on principles and practice of parallel programming (PPoPP). ACM, New York
. Ren S, Agha GA () Rtsynchronizer: language support for realtime specifications in distributed systems. ACM SIGPLAN Not
():–
. Sturman D, Agha G () A protocol description language for
customizing failure semantics. In: Proceedings of the thirteenth
symposium on reliable distributed systems, Dana Point, CA. IEEE
Computer Society Press, Los Alamitos, CA, pp –
. Sen K, Agha G () Automated systematic testing of open
distributed programs. In: Fundamental approaches to software
engineering (FASE), volume  of lecture notes in computer
science, Springer, Berlin, pp –
. Kliot G, Larus J, Pandya R, Thelin J, Bykov S, Geller A ()
Orleans: a framework for cloud computing. Technical report
MSR-TR--, Microsoft Research
Affinity Scheduling
. Singh V, Kumar V, Agha G, Tomlinson C () Scalability of
parallel sorting on mesh multicomputers. In: Parallel processing
symposium, . Proceedings, fifth international. Anaheim, CA,
pp –
. Srinivasan S, Mycroft A () Kilim: isolation typed actors for
java. In: Procedings of the European conference on object oriented programming (ECOOP), Springer, Berlin
. Schäfer J, Poetzsch-Heffter A () Jcobox: generalizing active
objects to concurrent components. In: Proceedings of the
th European conference on object-oriented programming,
ECOOP’, Maribor, Slovenia. Springer, Berlin/Heidelberg
pp –
. Sen K, Vardhan A, Agha G, Rosu G () Efficient decentralized
monitoring of safety in distributed systems. In: ICSE ’: proceedings of the th international conference on software engineering, Edinburg, UK. IEEE Computer Society. Washington, DC,
pp –
. Varela C, Agha G () Programming dynamically reconfigurable open systems with SALSA. ACM SIGPLAN Notices ():
–
. Venkatasubramanian N, Agha G, Talcott C () Scalable distributed garbage collection for systems of active objects. In:
Bekkers Y, Cohen J (eds) International workshop on memory
management, ACM SIGPLAN and INRIA, St. Malo, France.
Lecture notes in computer science, vol , Springer, Berlin,
pp –
. Visser W, Havelund K, Brat G, Park S () Model checking programs. In: Proceedings of the th IEEE international conference
on automated software engineering, ASE ’, Grenoble, France.
IEEE Computer Society, Washington, DC, pp 
. Yonezawa A (ed) () ABCL: an object-oriented concurrent
system. MIT Press, Cambridge, MA
Affinity Scheduling
Mark S. Squillante
IBM, Yorktown Heights, NY, USA
Synonyms
Cache affinity
scheduling
scheduling;
Resource
affinity
Definition
Affinity scheduling is the allocation, or scheduling, of
computing tasks on the computing nodes where they
will be executed more efficiently. Such affinity of a task
for a node can be based on any aspects of the computing node or computing task that make execution
A
more efficient, and it is most often related to different speeds or overheads associated with the resources
of the computing node that are required by the
computing task.
Discussion
Introduction
In parallel computing environments, it may be more
efficient to schedule a computing task on one computing node than on another. Such affinity of a specific task for a particular node can arise from many
sources, where the goal of the scheduling policy is
typically to optimize a functional of response time
and/or throughput. For example, the affinity might be
based on how fast the computing task can be executed on a computing node in an environment comprised of nodes with heterogeneous processing speeds.
As another example, the affinity might concern the
resources associated with the computing nodes such
that each node has a collection of available resources,
each task must execute on a node with a certain
set of resources, and the scheduler attempts to maximize performance subject to these imposed system
constraints.
Another form of affinity scheduling is based on the
state of the memory system hierarchy. More specifically, it may be more efficient in parallel computing environments to schedule a computing task on a
particular computing node than on any other if relevant data, code, or state information already resides
in the caches or local memories associated with the
node. This is the form most commonly referred to as
affinity scheduling, and it is the primary focus of this
entry. The use of such affinity information in generalpurpose multiprocessor systems can often improve performance in terms of functionals of response time and
throughput, particularly if this information is inexpensive to obtain and exploit. On the other hand, the performance benefits of this form of affinity scheduling
often depend upon a number of factors and vary over
time, and thus there is a fundamental scheduling tradeoff between scheduling tasks where they execute most
efficiently and keeping the workload shared among
nodes.
Affinity scheduling in general-purpose multiprocessor systems will be next described within its historical

A

A
Affinity Scheduling
context, followed by a brief discussion of the performance trade-off between affinity scheduling and load
sharing.
General-Purpose Multiprocessor Systems
Many of the general-purpose parallel computing environments introduced in the s were shared-memory
multiprocessor systems of modest size in comparison with the parallel computers of today. Most of the
general-purpose multiprocessor operating systems at
the time implemented schedulers based on simple priority schemes that completely ignored the affinity a task
might have for a specific processor due to the contents of
processor caches. On the other hand, as processor cycle
times improved at a much faster rate than main memory
access times, researchers were observing the increasingly important relative performance impact of cache
misses (refer to, e.g., []). The relative costs of a cache
miss were also increasing due to other issues including cache coherency [] and memory bus interference.
As a matter of fact, the caches in one of the generalpurpose multiprocessor systems at the time were not
intended to reduce the memory access time at all, but
rather to reduce memory bus interference and protect
the bus from most processor-memory references (refer
to, e.g., []).
In one of the first studies of such processor-cache
affinity issues, Squillante and Lazowska [, ] considered the performance implications of scheduling based
on the affinity of a task for the cache of a particular processor. The typical behaviors of tasks in general-purpose
multiprocessor systems can include alternation between
executing at a processor and releasing this processor
either to perform I/O or synchronization operations (in
which cases the task is not eligible for scheduling until
completion of the operation) or because of quantum
expiration or preemption (in which cases the task is
suspended to allow execution of another task). Upon
returning and being scheduled on a processor, the task
may experience an initial burst of cache misses and the
duration of this burst depends, in part, upon the number of blocks belonging to the task that are already resident in the cache of the processor. Continual increases
in cache sizes at the time further suggested that a significant portion of the working set of a task may reside in
the cache of a specific processor under these scenarios.
Scheduling decisions that disregard such cache reload
times may cause significant increases in the execution
times of individual tasks as well as reductions in the
performance of the entire system. These cache-reload
effects may be further compounded by an increase in
the overhead of cache coherence protocols (due to a
larger number of cache invalidations resulting from
modification of a task’s data still resident in another
processor’s cache) and an increase in bus traffic and
interference (due to cache misses).
To this end, Squillante and Lazowska [, ] formulate and analyze mathematical models to investigate
fundamental principles underlying the various performance trade-offs associated with processor-cache
affinity scheduling. These mathematical models represent general abstractions of a wide variety of generalpurpose multiprocessor systems and workloads, ranging
from more traditional time-sharing parallel systems
and workloads even up through more recent dynamic
coscheduling parallel systems and workloads (see,
e.g., []). Several different scheduling policies are
considered, spanning the entire spectrum from ignoring processor-cache affinity to fixing tasks to execute
on specific processors. The results of this modeling
analysis illustrate and quantify the benefits and limitations of processor-cache affinity scheduling in generalpurpose shared-memory multiprocessor systems with
respect to improving and degrading the first two statistical moments of response time and throughput. In
particular, the circumstances under which performance
improvement and degradation can be realized and the
importance of exploiting processor-cache affinity information depend upon many factors. Some of the most
important of these factors include the size of processor caches, the locality of the task memory references,
the size of the set of cache blocks in active use by the
task (its cache footprint), the ratio of the cache footprint
loading time to the execution time of the task per visit
to a processor, the time spent non-schedulable by the
task between processor visits, the processor activities in
between such visits, the system architecture, the parallel computing workload, the system scheduling strategy,
and the need to adapt scheduling decisions with changes
in system load.
A large number of empirical research studies subsequently followed to further examine affinity scheduling across a broad range of general-purpose parallel
computing systems, workloads, scheduling strategies,
Affinity Scheduling
and resource types. Gupta et al. [] use a detailed
multiprocessor simulator to evaluate various scheduling strategies, including gang scheduling (coscheduling), two-level scheduling (space-sharing) with process
control, and processor-cache affinity scheduling, with
a focus on the performance impact of scheduling on
cache behavior. The benefits of processor-cache affinity scheduling are shown to exhibit a relatively small
but noticeable performance improvement, which can
be explained by the factors identified above under the
multiprocessor system and workload studied in [].
Two-level scheduling with process control and gang
scheduling are shown to provide the highest levels of
performance, with process control outperforming gang
scheduling when applications have large working sets
that fit within a cache. Vaswani and Zahorjan [] then
developed an implementation of a space-sharing strategy (under which processors are partitioned among
applications) to examine the implications of cache affinity in shared-memory multiprocessor scheduling under
various scientific application workloads. The results of
this study illustrate that processor-cache affinity within
such a space-sharing strategy provides negligible performance improvements for the three scientific applications considered. It is interesting to note that the
models in [, ], parameterized by the system and
application measurements and characteristics in [],
also show that affinity scheduling yields negligible performance benefits in these circumstances. Devarakonda
and Mukherjee [] consider various implementation
issues involved in exploiting cache affinity to improve
performance, arguing that affinity is most effective
when implemented through a thread package which
supports the multiplexing of user-level threads on operating system kernel-level threads. The results of this
study show that a simple scheduling strategy can yield
significant performance improvements under an appropriate application workload and a proper implementation approach, which can once again be explained by the
factors identified above. Torrellas et al. [, ] study
the performance benefits of cache-affinity scheduling
in shared-memory multiprocessors under a wide variety of application workloads, including various scientific, software development, and database applications, and under a time-sharing strategy that was in
widespread use by the vast majority of parallel computing environments at the time. The authors conclude
A
that affinity scheduling yields significant performance
improvements for a few of the application workloads
and moderate performance gains for most of the application workloads considered in their study, while not
degrading the performance of the rest of these workloads. These results can once again be explained by the
factors identified above.
In addition to affinity scheduling in generalpurpose shared-memory multiprocessors, a number
of related issues have arisen with respect to other
types of resources in parallel systems. One particularly interesting application area concerns affinity-based
scheduling in the context of parallel network protocol
processing, as first considered by Salehi et al. (see []
and the references cited therein), where parallel computation is used for protocol processing to support
high-bandwidth, low-latency networks. In particular,
Salehi et al. [] investigate affinity-based scheduling issues with respect to: supporting a large number of streams concurrently; receive-side and send-side
protocol processing (including data-touching protocol
processing); stream burstiness and source locality of
network traffic; and improving packet-level concurrency and caching behavior. The approach taken to evaluate various affinity-based scheduling policies is based
on a combination of multiprocessor system measurements, analytic modeling, and simulation. This combination of methods is used to illustrate and quantify
the potentially significant benefits and effectiveness of
affinity-based scheduling in multiprocessor networking. Across all of the workloads considered, the authors
find benefits in managing threads and free-memory by
taking affinity issues into account, with different forms
of affinity-based scheduling performing best under different workload conditions.
The trends of increasing cache and local memory sizes and of processor speeds decreasing at much
faster rates than memory access times continued to
grow over time. This includes generations of nonuniform memory-access (NUMA) shared-memory and
distributed-memory multiprocessor systems in which
the remoteness of memory accesses can have an even
more significant impact on performance. The penalty
for not adhering to processor affinities can be considerably more significant in such NUMA and distributedmemory multiprocessor systems, where the actual
cause of the larger costs depends upon the memory

A

A
Affinity Scheduling
management policy but are typically due to remote
access or demand paging of data stored in non-local
memory modules. These trends in turn have caused
the performance benefits of affinity scheduling in parallel computing environments to continue to grow. The
vast majority of parallel computing systems available
today, therefore, often exploit various forms of affinity
scheduling throughout distinct aspects of the parallel
computing environment. There continues to be important differences, however, among the various forms of
affinity scheduling depending upon the system architecture, application workload, and scheduling strategy, in
addition to new issues arising from some more recent
trends such as power management.
Affinity Scheduling and Load Sharing
Trade-off
In parallel computing environments that employ affinity scheduling, the system often allocates computing
tasks on the computing nodes where they will be executed most efficiently. Conversely, underloaded nodes
are often inevitable in parallel computing environments
due to factors such as the transient nature of system
load and the variability of task service times. If computing tasks are always executed on the computing nodes
for which they have affinity, then the system may suffer from load sharing problems as tasks are waiting at
overloaded nodes while other nodes are underloaded.
On the one hand, if processor affinities are not followed, then the system may incur significant penalties
as each computing task must establish its working set in
close proximity to a computing node before it can proceed. On the other hand, scheduling decisions cannot be
based solely on task-node affinity, else other scheduling
criteria, such as fairness, may be sacrificed. Hence, there
is a fundamental scheduling trade-off between keeping the workload shared among nodes and scheduling
tasks where they execute most efficiently. An adaptive
scheduling policy is needed that determines, as a function of system load, the appropriate balance between
the extremes of strictly balancing the workload among
all computing nodes and abiding by task-node affinities blindly.
One such form of load sharing has received considerable attention with respect to distributed system
environments. Static policies, namely those that use
information about the average behavior of the system while ignoring the current state, have been proposed and studied by numerous researchers including
Bokhari [], and Tantawi and Towsley []. These policies have been shown, in many cases, to provide better
performance than policies that do not attempt to share
the system workload. Other studies have shown that
more adaptive policies, namely those that make decisions based on the current state of the system, have the
potential to greatly improve system performance over
that obtained with static policies. Furthermore, Livny
and Melman [], and Eager, Lazowska, and Zahorjan [] have shown that much of this potential can
be realized with simple methods. This potential has
prompted a number of studies of specific adaptive load
sharing policies, including the research conducted by
Barak and Shiloh [], Wang and Morris [], Eager
et al. [], and Mirchandaney et al. [].
While there are many similarities between the
affinity scheduling and load sharing trade-off in
distributed and shared-memory (as well as some
distributed-memory) parallel computing environments,
there can be several important differences due to distinct characteristics of these diverse parallel system
architectures. One of the most important differences
concerns the direct costs of moving a computing task. In
a distributed system, these costs are typically incurred
by the computing node from which the task is being
migrated, possibly with an additional network delay.
Subsequent to this move, the processing requirements
of the computing task are identical to those at the computing node where the task has affinity. On the other
hand, the major costs of task migration in a sharedmemory (as well as distributed-memory) system are
the result of a larger service demand at the computing node to which the computing task is migrated,
reflecting the time required to either establish the working set of the task in closer proximity to this node or
remotely access the working set from this node. Hence,
there is a shift of the direct costs of migration from the
(overloaded) computing node for which the computing task has affinity to the (underloaded) node receiving
the task.
Motivated by these and related differences between
distributed and shared-memory systems, Squillante
and Nelson [] investigated the fundamental tradeoff between affinity scheduling and load sharing in
Affinity Scheduling
shared-memory (as well as some distributed-memory)
parallel computing environments. More specifically, the
authors consider the question of how expensive task
migration must become before it is not beneficial to
have an underloaded computing node migrate a computing task waiting at another node, as it clearly would
be beneficial to migrate such a task if the cost to do
so was negligible. Squillante and Nelson [], therefore,
formulate and analyze mathematical models to investigate this and related questions concerning the conditions under which it becomes detrimental to migrate
a task away from its affinity node with respect to the
costs of not adhering to these task-node affinities. The
results of this modeling analysis illustrate and quantify
the potentially significant benefits of migrating waiting
tasks to underloaded nodes in shared-memory multiprocessors even when migration costs are relatively
large, particularly at moderate to heavy loads. By sharing the collection of computing tasks among all computing nodes, a combination of affinity scheduling and
threshold-based task migration can yield performance
that is better than a non-migratory policy even with
a larger service demand for migrated tasks, provided
proper threshold values are employed. These modeling analysis results also demonstrate the potential for
unstable behavior under task migration policies when
improper threshold settings are employed, where optimal policy thresholds avoiding such behavior are provided as a function of system load and the relative
processing time of migrated tasks.
Related Entries
Load Balancing, Distributed Memory
Operating System Strategies
Scheduling Algorithms
Bibliographic Notes and Further
Reading
Affinity scheduling based on the different speeds
of executing tasks in heterogeneous processor environments has received considerable attention in the
literature, including both deterministic models and
stochastic models for which, as examples, the interested reader is referred to [, ] and [, ],
respectively. Affinity scheduling based on resource
requirement constraints have also received considerable
A
attention in the literature (see, e.g., [, , ]). Examples of shared memory multiprocessor systems from
the s include the DEC Firefly [] and the Sequent
Symmetry []. Multiprocessor operating systems of
this period that completely ignored processor-cache
affinity include Mach [], DYNIX [], and Topaz [].
This entry primarily focuses on affinity scheduling
based on the state of the memory system hierarchy
where it can be more efficient to schedule a computing task on a particular computing node than on
another if any relevant information already resides in
caches or local memories in close proximity to the node.
This entry also considers the fundamental trade-off
between affinity scheduling and load sharing. A number
of important references have been provided throughout, to which the interested reader is referred together
with the citations provided therein. Many more references on these subjects are widely available, in addition
to the wide variety of strategies that have been proposed
to address affinity scheduling and related performance
trade-offs.
Bibliography
. Accetta M, Baron R, Bolosky W, Golub D, Rashid R, Tevanian A,
Young M () Mach: a new kernel foundation for UNIX development. In: Proceedings of USENIX Association summer technical conference, Atlanta, GA, June . USENIX Association,
Berkeley, pp –
. Archibald J, Baer J-L () Cache coherence protocols: evaluation using a multiprocessor simulation model. ACM Trans
Comput Syst ():–
. Barak A, Shiloh A () A distributed load-balancing policy for
a multicomputer. Softw Pract Exper ():–
. Bokhari SH () Dual processor scheduling with dynamic reassignment. IEEE Trans Softw Eng SE-():–
. Conway RW, Maxwell WL, Miller LW () Theory of scheduling. Addison-Wesley, Reading
. Craft DH () Resource management in a decentralized system.
In: Proceedings of symposium on operating systems principles,
October . ACM, New York, pp –
. Devarakonda M, Mukherjee A () Issues in implementation
of cache-affinity scheduling. In: Proceedings of winter USENIX
conference, January . USENIX Association, Berkeley,
pp –
. Eager DL, Lazowska ED, Zahorjan J () Adaptive load sharing
in homogeneous distributed systems. IEEE Trans Softw Eng SE():–
. Eager DL, Lazowska ED, Zahorjan J () A comparison of
receiver-initiated and sender-initiated adaptive load sharing.
Perform Evaluation ():–

A

A
Ajtai–Komlós–Szemerédi Sorting Network
. Gupta A, Tucker A, Urushibara S () The impact of operating system scheduling policies and synchronization methods on the performance of parallel applications. In: Proceedings of ACM SIGMETRICS conference on measurement and
modeling of computer systems, May . ACM, New York,
pp –
. Horowitz E, Sahni S () Exact and approximate algorithms for
scheduling nonidentical processors. J ACM ():–
. Jouppi NP, Wall DW () Available instruction-level parallelism
for superscalar and superpipelined machines. In: Proceedings of
international symposium on computer architecture, April .
ACM Press, New York, pp –
. Lin W, Kumar PR () Optimal control of a queueing system
with two heterogeneous servers. IEEE Trans Automatic Contr
():–
. Livny M, Melman M () Load balancing in homogeneous
broadcast distributed systems. In: Proceedings of ACM computer network performance symposium. ACM Press, New York,
pp –
. Mirchananey R, Towsley D, Stankovic JA () Adaptive load
sharing in heterogeneous systems. J Parallel Distrib Comput
:–
. Salehi JD, Kurose JF, Towsley D () The effectiveness of affinitybased scheduling in multiprocessor networking (extended version). IEEE/ACM Trans Networ ():–
. Sequent Computer Systems () Symmetry technical summary.
Sequent Computer Systems Inc, Beaverton
. Squillante MS, Lazowska ED () Using processor-cache affinity information in shared-memory multiprocessor scheduling.
Technical Report --, Department of Computer Science,
University of Washington, June . Minor revision, Feb 
. Squillante MS, Lazowska ED () Using processor-cache affinity
information in shared-memory multiprocessor scheduling. IEEE
Trans Parallel Distrib Syst ():–
. Squillante MS, Nelson RD () Analysis of task migration
in shared-memory multiprocessors. In: Proceedings of ACM
SIGMETRICS conference on measurement and modeling of computer systems, May . ACM, New York, pp –
. Squillante MS, Zhang Y, Sivasubramaniam A, Gautam N,
Franke H, Moreira J () Modeling and analysis of dynamic
coscheduling in parallel and distributed environments. In: Proceedings of ACM SIGMETRICS conference on measurement and
modeling of computer systems, June . ACM, New York,
pp –
. Tantawi AN, Towsley D () Optimal static load balancing in
distributed computer systems. J ACM ():–
. Thacker C, Stewart LC, Satterthwaite EH Jr () Firefly: a multiprocessor workstation. IEEE Trans Comput C-():–
. Thakkar SS, Gifford PR, Fieland GF () The balance multiprocessor system. IEEE Micro ():–
. Torrellas J, Tucker A, Gupta A () Benefits of cache-affinity
scheduling in shared-memory multiprocessors: a summary.
In: Proceedings of ACM SIGMETRICS conference on measurement and modeling of computer systems, May . ACM,
New York, pp –
. Torrellas J, Tucker A, Gupta A () Evaluating the performance
of cache-affinity scheduling in shared-memory multiprocessors.
J Parallel Distrib Comput ():–
. Vaswani R, Zahorjan J () The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors. In: Proceedings of symposium on operating
systems principles, October . ACM, New York, pp –
. Wang YT, Morris R () Load sharing in distributed systems.
IEEE Trans Comput C-():–
. Weinrib A, Gopal G () Decentralized resource allocation
for distributed systems. In: Proceedings of IEEE INFOCOM ’, San Francisco, April . IEEE, Washington, DC,
pp –
. Yu OS () Stochastic bounds for heterogeneous-server queues
with Erlang service times. J Appl Probab :–
Ajtai–Komlós–Szemerédi Sorting
Network
AKS Network
AKS Network
Joel Seiferas
University of Rochester, Rochester, NY, USA
Synonyms
Ajtai–Komlós–Szemerédi sorting network; AKS sorting
network; Logarithmic-depth sorting network
Definition
AKS networks are O(log n)-depth networks of -item
sorters that sort their n input items by following the
 design by Miklós Ajtai, János Komlós, and Endre
Szemerédi.
Discussion
Comparator Networks for Sorting
Following Knuth [, Section ..] or Cormen et al.
[, Chapter ], consider algorithms that reorder their
input sequences I (of items, or “keys,” from some
totally ordered universe) via “oblivious comparison–
exchanges.” Each comparison–exchange is effected by
a “comparator” from a data position i to another data
AKS Network
position j, which permutes I[i] and I[j] so that I[i] ≤
I[j]. (So a comparator amounts to an application of
a -item sorter.) The sequence of such comparison–
exchanges depends only on the number n of input items,
but not on whether items get switched. Independent
comparison–exchanges, involving disjoint pairs of data
positions, can be allowed to take place at the same time –
up to ⌊n/⌋ comparison–exchanges per parallel step.
Each such algorithm can be reformulated so that
every comparison–exchange [i : j] has i < j (Knuth’s
standard form) [, Exercise ..-, or , Exercise
.-]. Following Knuth, restrict attention to such
algorithms and represent them by left-to-right parallel horizontal “time lines” for the n data positions (also
referred to as registers), connecting pairs of them by vertical lines to indicate comparison–exchanges. A famous
example is Batcher’s family of networks for sorting
“bitonic” sequences []. The network for n =  is
depicted in Fig. . Since the maximum depth is , four
steps suffice. In general, Batcher’s networks sort bitonic
input sequences of lengths n that are powers of , in
log n steps.
If such a network reorders every length-n input
sequence into sorted order (i.e., so that I[i] ≤ I[j] holds
whenever i < j does), then call it a sorting network. Since
the concatenation of a sorted sequence and the reverse
of a sorted sequence is bitonic, Batcher’s networks can
be used to merge sorted sequences in Θ(log n) parallel steps (using Θ(n log n) comparators), and hence to
implement merge sort in Θ(log n) parallel steps (using
Θ(n log n) comparators).
The total number Θ(n log n) of comparators used
by Batcher’s elegant sorting networks is worse by a factor
of Θ(log n) than the number Θ(n log n) = Θ(log n!)
of two-outcome comparisons required for the best
(nonoblivious) sorting algorithms, such as merge sort
and heap sort [, ]. At the expense of simplicity
and practical multiplicative constants, the “AKS” sorting networks of Ajtai, Komlós, and Szemerédi [–]
close this gap, sorting in O(log n) parallel steps, using
O(n log n) comparators.
The Sorting-by-Splitting Approach
The AKS algorithm is based on a “sorting-by-splitting”
or “sorting-by-classifying” approach that amounts to an
ideal version of “quicksort” [, ]: Separate the items
to be sorted into a smallest half and a largest half, and
A
continue recursively on each half. But the depth of just
a halving network would already have to be Ω(log n),
again seeming to lead, through recursion, only to sorting networks of depth Ω(log n).
(To see that (all) the outputs of a perfect halver
have to be at depth Ω(log n), consider the computation on a sequence of n identical items, and consider
any particular output item. If its depth is only d, then
there are n − d input items that can have no effect
on that output item. If this number exceeds n/, then
these uninvolved input items can all be made larger or
smaller to make the unchanged output wrong. (In fact, a
slightly more careful argument by Alekseev [, ] shows
also that the total number of comparators has to be
Ω(n log n), again seeming to leave no room for a good
recursive result.))
I [0]
I [1]
I [2]
I [3]
I [4]
I [5]
I [6]
I [7]
I [8]
I [9]
I [10]
I [11]
I [12]
I [13]
I [14]
I [15]
Step 1
Step 2
Step 3
Step 4
AKS Network. Fig.  Batcher’s bitonic merger for n = 
elements. Each successive comparator (vertical line)
reorders its inputs (top and bottom endpoints) into sorted
order

A

A
AKS Network
λ⬘ 1/2
0
prefix
1
misplacement area
AKS Network. Fig.  Setting for approximate halving
(prefix case)
λ⬘
0
prefix
λ
1
misplacement area
AKS Network. Fig.  Setting for approximate λ-separation
(prefix case)
The , AKS breakthrough [] was twofold: to
notice that there are shallow approximate separators,
and to find a way to tolerate their errors in sorting by
classification.
Approximate Separation via Approximate
Halving via Bipartite Expander Graphs
The approximate separators defined, designed, and used
are based on the simpler definition and existence of
approximate halvers. The criterion for approximate
halving is a relative bound (ε) on the number of misplacements from each prefix or suffix (of relative length
λ′ up to λ = /) of a completely sorted result to the
wrong half of the result actually produced (see Fig. ).
For approximate separation, the “wrong fraction” onehalf is generalized to include even larger fractions  − λ.
(See Fig. .)
Definition  For each ε >  and λ ≤ /, the criterion
for ε-approximate λ-separation of a sequence of n input
elements is that, for each λ′ ≤ λ, at most ελ′ n of the ⌊λ′ n⌋
smallest (respectively, largest) elements do not get placed
among the ⌊λn⌋ first (respectively, last) positions. For λ =
/, this is called ε-halving.
Although these definitions do not restrict n, they
will be needed and used only for even n.
The AKS construction will use constant-depth networks that perform both ε-halving and ε-approximate
λ-separation for some λ smaller than /. Note that neither of these quite implies the other, since the former
constrains more prefixes and suffixes (all the way up
to half the length), while the latter counts more misplacements (the ones to a longer “misplacement area”)
as significant.
Lemma  For each ε > , ε-halving can be performed
by comparator networks (one for each even n) of constant
depth (depending on ε, but independent of n).
Proof From the study of “expander graphs” [, e.g.,],
ε < , start with a constant d, deterusing the fact that −ε
ε
mined by ε, and a d-regular n/-by-n/ bipartite graph
with the following expansion property: Each subset S of
either part, with ∣S∣ ≤ εn/, has more neighbors than
−ε
∣S∣.
ε
The ε-halver uses a first-half-to-second-half comparator corresponding to each edge in the d-regular
bipartite expander graph. This requires depth only d,
by repeated application of a minimum-cut argument, to
extract d matchings [, Exercise .-, for example].
To see that the result is indeed an ε-halver, consider,
in any application of the network, the final positions of
the strays from the m ≤ n/ smallest elements. Those
positions and all their neighbors must finally contain elements among the m smallest. If the fraction of strays
were more than ε, then this would add up to more than
εm = m, a contradiction.
εm + −ε
ε
Using approximate halvers in approximate separation and exact sorting resembles such mundane tasks
as sweeping dirt with an imperfect broom or clearing
snow with an imperfect shovel. There has to be an efficient, converging strategy for cleaning up the relatively
few imperfections. Subsequent strokes should be aimed
at where the concentrations of imperfections are currently known to lie, rather than again at the full job –
something like skimming spilled oil off the surface of
the Gulf of Mexico.
The use of approximate halvers in approximate separation involves relatively natural shrinkage of “stroke
size” to sweep more extreme keys closer to the ends.
Lemma  For each ε >  and λ ≤ /, ε-approximate
λ-separation and simultaneous ε-halving can be performed by comparator networks (one for each even n) of
constant depth (depending on ε and λ, but independent
of n).
Proof For ε small in terms of ε and λ, make use of the
ε  -halvers already provided by Lemma  (the result for
λ = /):
First apply the ε  -halver for length n to the whole
sequence, and then work separately on each resulting
half, so that the final result will remain an ε  -halver.
AKS Network
The halves are handled symmetrically, in parallel; so focus on the first half. In terms of m =
⌊λn⌋ (assuming n ≥ /λ), apply ε  -halvers to the
⌈log (n/m)⌉ −  prefixes of lengths m, m, m, …,
⌈log (n/m)−⌉ m, in reverse order, where the last one
listed (first one performed) is simplified as if the inputs
beyond the first n/ were all some very large element.
Then the total number of elements from the smallest
⌊λ′ n⌋ (for any λ′ ≤ λ) that do not end up among the
first ⌊λn⌋ = m positions is at most ε  λ′ n in each of the
⌈log (n/m)⌉ intervals (m, m], (m, m], (m, m], …,
(⌈log (n/m)−⌉ m, n/], and (n/, n], for a total of at most
⌈log (n/m)⌉ε  λ′ n.
For any chosen c <  (close to , making log (/c)
close to ), the following holds: Unless n is small (n <
(/λ)(/( − c))),
+log (n/m) = (n/m) = n/⌊λn⌋ < n/(λn − )
≤ n/(cλn) = (/c)(/λ),
so that log (n/m) is less than log (/c) + log (/λ),
and the number of misplaced elements above is at most
⌈log (/c)+log (/λ)⌉ε  λ′ n. This is bounded by ελ′n as
required, provided ε  ≤ ε/⌈log (/c) + log (/λ)⌉.
Sorting with Approximate Separators
Without loss of generality, assume the n items to be
sorted are distinct, and that n is a power of  larger
than . To keep track of the progress of the recursive
classification, and the (now not necessarily contiguous)
sets of registers to which approximate halvers and separators should be applied, consider each of the registers
always to occupy one of n −  =  +  +  + ⋅ ⋅ ⋅+ n/ bags –
one bag corresponding to each nonsingleton binary
subinterval of the length-n index sequence: one whole,
its first and second halves (two halves), their first and
second halves (four quarters), …, their first and second
halves (n/ contiguous pairs). These binary subintervals, and hence the associated bags, correspond nicely to
the nodes of a complete binary tree, with each nontrivial binary subinterval serving as parent for its first and
second halves, which serve respectively as its left child
and right child. (The sets, and even the numbers, of registers corresponding to these bags will change with time,
according to time-determined baggings and rebaggings
by the algorithm.)
A
Based on its actual rank (position in the actual
sorted order), each register’s current item can be considered native to one bag at each level of the tree. For
example, if it lies in the second quartile, it is native to
the second of the four bags that are grandchildren of the
root. Sometimes an item will occupy a bag to which it
is not native, where it is a stranger; more specifically, it
will be j-strange if its bag is at least j steps off its native
path down from the root. (So the -strangers are all the
strangers; and the additional -strangers are actually
native, and not strangers at all.)
Initially, consider all n registers to be in the root
bag (the one corresponding to the whole sequence of
indices), to which all contents are native. The strategy is
to design and follow a schedule, oblivious to the particular data, of applications of certain comparator networks
and conceptual rebaggings of the results, that is guaranteed to leave all items in the bags of constant-height
subtrees to which they are native. Then to finish, it
will suffice to apply a separate sorting network to the
O() registers in the bags of each of these constant-size
subtrees.
The Structure of the Bagging Schedule
Each stage of the network acts separately on the contents
of each nonempty bag, which is an inductively predictable subsequence of the n registers. In terms of the
bag’s current “capacity” b (an upper bound on its number of registers), a certain fixed “skimming” fraction λ,
and other parameters to be chosen later (in retrospect),
it applies an approximate separator from Lemma  to
that sequence of registers, and it evacuates the results
to the parent and children bags as follows: If there is a
parent, then “kick back” to it ⌊λb⌋ items (or as many as
are available, if there are not that many) from each end
of the results, where too-small and too-large items will
tend to accumulate. If the excess (between these ends) is
odd (which will be inductively impossible at the root),
then kick back any one additional register (the middle
one, say) to the parent. Send the first and second halves
of any remaining excess down to the respective children.
Note that this plan can fail to be feasible in only one
case: the number of registers to be evacuated exceeds
⌊λb⌋ + , but the bag has no children (i.e., it is a leaf).
The parameter b is an imposed capacity that will
increase (exponentially) with the depth of the bag in the
tree but decrease (also exponentially) with time, thus

A

A
AKS Network
“squeezing” all the items toward the leaves, as desired.
Aim to choose the parameters so that the squeezing is
slow enough that the separators from Lemma  have
time to successfully skim and reroute all strangers back
to their native paths.
To complete a (not yet proved) description of a network of depth O(log n) to sort n items, here is a preview
of one set of adequate parameters:
λ = ε = / and b = n ⋅ d (.)t ,
where d ≥  is the depth and t ≥  is the number of
previous stages. It turns out that adopting these parameters enables iteration until bleaf < , and that at that
time every item will be in a height- subtree to which
it is native, so that the job can be finished with disjoint
-sorters.
A Suitable Invariant
In this and the next section, the parameters are carefully
reintroduced, and constraints on them are accumulated,
sufficient for the analysis to work out. For A comfortably
larger than  ( in the preview above) and ν less than but
close to  (. in the preview above), define the capacity
(b above) of a depth-d bag after t stages to be nν t Ad .
(Again note the dynamic reduction with time, so that
this capacity eventually becomes small even compared
to the number of items native to the bag.) Let λ <  be
chosen as indicated later, and let ε >  be small.
Subject to the constraints accumulated in section “Argument that the Invariant is Maintained”, it
will be shown there that each successful iteration of
the separation–rebagging procedure described in section “The Structure of the Bagging Schedule” (i.e., each
stage of the network) preserves the following fourclause invariant.
. Alternating levels of the tree are entirely empty.
. On each level, the number of registers currently in
each bag (or in the entire subtree below) is the same
(and hence at most the size of the corresponding
binary interval).
. The number of registers currently in each bag is
bounded by the current capacity of the bag.
. For each j ≥ , the number of j-strangers currently in
the registers of each bag is bounded by λε j− times
the bag’s current capacity.
How much successful iteration is enough? Until
the leaf capacities b dip below the constant /λ. At
that point, the subtrees of smallest height k such
that (/λ)(/A)k+ <  contain all the registers, because
higher-level capacities are at most b/Ak+ < (/λ)
(/A)k+ < . And the contents are correctly classified,
because the number of (k − i + )-strangers in each
bag at each height i ≤ k is bounded by λε k−i b/Ai ≤
λb < . So the job can be finished with independent k+ sorters, of depth at most (k + )(k + )/ []. In fact,
for the right choice of parameters (holding /λ to less
than A ), k as small as  can suffice, so that the job can be
finished with mere -sorters, of depth just . Therefore,
the number t of successful iterations need only exceed
((+log A)/ log (/ν)) log n−log (A/λ)/ log (/ν) =
O(log n), since that is (exactly) enough to get the leaf
capacity nν t A(log n)− down to /λ.
As noted in the previous section, only one thing can
prevent successful iteration of the proposed procedure
for each stage: ⌊λb⌋ +  being less than the number of
items to be evacuated from a bag with current capacity b
but with no children. Since such a bag is a leaf, it follows
from Clause  of the invariant that the number of items
is at most . Thus the condition is ⌊λb⌋+ < , implying
the goal b < /λ has already been reached.
Therefore, it remains only to choose parameters
such that each successful iteration does preserve the
entire invariant.
Argument that the Invariant Is Maintained
Only Clauses  and  are not immediately clear. First
consider the former – that capacity will continue after
the next iteration to bound the number of registers in
each bag. This is nontrivial only for a bag that is currently empty. If the current capacity of such a bag is
b ≥ A, then the next capacity can safely be as low as
(number of registers from below)
+ (number of registers from above)
≤ (λbA + ) + b/(A)
= b(λA + /(A)) + 
≤ b(λA + /(A) + /A), since b ≥ A.
In the remaining Clause  case, the current capacity of the empty bag is b < A. Therefore, all higher bags’
capacities are bounded by b/A < , so that the n registers are equally distributed among the subtrees rooted
AKS Network
on the level below, an even number of registers per subtree. Since each root on that level has passed down an
equal net number of registers to each of its children, it
currently holds an even number of registers and will not
kick back an odd register to the bag of interest. In this
case, therefore, the next capacity can safely be as low as
just λbA.
In either Clause  case, therefore, any
ν ≥ λA + /(A) will work.
Finally, turn to restoration of Clause . First, consider the relatively easy case of j > . Again, this is
nontrivial only for a bag that is currently empty. Suppose the current capacity of such a bag is b. What is a
bound on the number of j-strangers after the next step?
It is at most
(all (j + )-strangers currently in children)
+ ((j − )-strangers currently in parent, and not
filtered out by the separation)
< bAλε j + ε((b/A)λε j− )
≤ λε j− νb, provided Aε + /A ≤ ν .
Note that the bound ε((b/A)λε j− ) exploits the “filtering” performance of an approximate separator: At most
fraction ε of the “few” smallest (or largest) are permuted
“far” out of place.
All that remains is the more involved Clause  case
of j = . Consider any currently empty bag B, of current
capacity b. At the root there are always no strangers; so
assume B has a parent, D, and a sibling, C. Let d be the
number of registers in D.
There are three sources of -strangers in B after
the next iteration, two previously strange, essentially as
above, and one newly strange:
. Current -strangers at children (at most λεbA).
. Unfiltered current -strangers in D (at most
ε(λ(b/A)), as above).
. Items in D that are native to C but that now get sent
to B instead.
For one of these last items to get sent down to B, it must
get permuted by the approximate separator of Lemma 
into “B’s half ” of the registers. The number that do is at
most the number of C-native items in excess of d/, plus
the number of “halving errors” by the approximate separator. By the approximate halving behavior, the latter is
at most εb/(A), leaving only the former to estimate.
A
For this remaining estimate, compare the current
“actual” distribution with a more symmetric “benchmark” distribution that has an unchanged number of
registers in each bag, but that, for each bag C′ on the
same level as B, has only C′ -native items below C′ and
has d/ C′ -native items in the parent D′ of C′ . (If d
is odd, then the numbers of items in D′ native to its
two children will be ⌊d/⌋ and ⌈d/⌉, in either order.)
That there is such a redistribution follows from Clause 
of the invariant: Start by partitioning the completely
sorted list among the bags on B’s level, and then move
items down and up in appropriate numbers to fill the
budgeted vacancies.
In the benchmark distribution, the number of
C-native items in excess of d/ is . If the actual distribution is to have an excess, where can the excess C-native
items come from, in terms of the benchmark distribution? They can come only from C-native items on levels
above D’s and from a net reduction in the number of
C-native items in C’s subtree. The latter can only be via
the introduction into C’s subtree of items not native to C.
By Clause  of the invariant, the number of these can be
at most
λεbA + λε  bA + λε  bA + λε  bA + . . .
= λεbA( + (εA) + ((εA) )
+ ((εA) ) + . . . ) < λεbA/( − (εA) ).
If i is the number of bags C′ on the level of C, then
the total number of items on levels above D’s is at most
i− b/A + i− b/A + i− b/A + . . . .
Since the number native to each such C′ is the same, the
number native to C is at most /i times as much:
b/(A) + b/(A) + b/(A) + . . .
< b/((A) ( − /(A) ))
= b/(A − A).
So the total number of -strangers from all sources
is at most
λεbA + ε(λ(b/A)) + εb/(A) + λεbA/( − (εA) )
+ b/(A − A).
This total is indeed bounded, as needed, by λνb (λ times
the new capacity), provided

A

A
AKS Network
λεA + ελ/A + ε/(A) + λεA/( − (εA) )
+ /(A − A) ≤ λν.

This completes the argument, subject to the following accumulated constraints:
A > ,
in a design originally conceived for perfect halvers.
The “one-step-backward, two-steps-forward” approach
is a valuable one that should and indeed does show
up in many other settings. This benefit holds regardless of whether the constant hidden in the big-O for
this particular result can ever be reduced to something
practical.
ν < ,
ε > ,
Related Entries
λ < ,
Bitonic Sort
ν ≥ λA + /(A),
Sorting
ν ≥ Aε + /A,
λν ≥ λεA + ελ/A + ε/(A) + λεA/( − (εA) )
+/(A − A).

Here is one choice order and strategy that works out
neatly:
.
.
.
.
Choose A big
Choose λ between /(A − A) and /(A)
Choose ε small
Choose ν within the resulting allowed range
For example, A = , λ = /, ε = /, and ν =
/. In fact, perturbing λ and ε to / makes it possible to hold the parameter “k” of section “A Suitable
Invariant” to  and get by with -sorters at the very end,
as previewed in section “The Structure of the Bagging
Schedule”.
Concluding Remarks
Thus the AKS networks are remarkable sorting algorithms, one for each n, that sort their n inputs in
just O(log n) oblivious compare–exchange steps that
involve no concurrent reading or writing.
The amazing kernel of this celebrated result is that
the number of such steps for the problem of approximate halving (in provable contrast to the problem of
perfect halving) does not have to grow with n at all.
This is yet another (of many) reasons to marvel at the
existence of constant-degree expander graphs. And it
is another reason to consider approximate solutions to
fundamental algorithms. (Could some sort of approximate merging algorithm similarly lead to a fast sorting
algorithm based on merge sort?)
Algorithmically most interesting is the measured use
of approximate halvers to clean up their own errors
Bibliographic Notes and Further
Reading
Ajtai, Komlós, and Szemerédi submitted the first version of their design and argument directly to a
journal []. Inspired by Joel Spencer’s ideas for a more
accessible exposition, they soon published a quite different conference version []. The most influential version, on which most other expositions are based, was
the one eventually published by Paterson []. The version presented here is based on a much more recent
simplification [], with fewer distinct parameters, but
thus with less potential for fine-tuning more quantitative results.
The embedded expander graphs, and thus the constant hidden in the O(log n) depth estimate, make these
networks impractical to build or use. The competition for the constant is Batcher’s extra log  n factor [],
which is relatively quite small for any remotely practical
value of n.
The shallowest approximate halvers can be produced by more direct arguments rather than via
expander graphs. By a direct and careful counting argument, Paterson [] (the theorem in his appendix, specialized to the case α = ) proves that the depth of an
ε-halver need be no more than ⌈( log ε)/ log( − ε) +
/ε − ⌉. Others [, ] have less precisely sketched or
promised asymptotically better results.
Even better, one can focus on the construction
of approximate separators. Paterson [], for example,
notes the usefulness in that construction (as presented
above, for Lemma ) of somewhat weaker (and thus
shallower) halvers, which perform well only on extreme
sets of sizes bounded in terms of an additional parameter α ≤ . Tailoring the parameters to the different levels
Algebraic Multigrid
of approximate halvers in the construction, he manages to reduce the depth of separators enough to reduce
the depth of the resulting sorting network to less than
 log n.
Ajtai, Komlós, and Szemerédi [] announce that
design in terms of generalized, multi-way comparators
(i.e., M-sorter units) can lead to drastically shallower
approximate halvers and “near-sorters” (their original
version [] of separators). Chvátal [] pursues this idea
carefully and arrives at an improved final bound less
than  log n, although still somewhat short of the
 log n privately claimed by Komlós. To date, the
tightest analyses have appeared only as sketches in preliminary conference presentations or mere promises
of future presentation [], or as unpublished technical
reports [].
Leighton [] shows, as a corollary of the AKS result
(regardless of the networks), that there is an n-node
degree- network that can sort n items in O(log n) steps.
Considering the VLSI model of parallel computation, Bilardi and Preparata [] show how to lay out the
AKS sorting networks to sort n O(log n)-bit numbers
in optimal area O(n ) and time O(log n).
Bibliography
. Ajtai M, Komlós J, and Szemerédi E () Sorting in c log n
parallel steps. Combinatorica ():–
. Ajtai M, Komlós J, Szemerédi E () An O(n log n) sorting
network, Proceedings of the fifteenth annual ACM symposium
on theory of computing, Association for computing machinery,
Boston, pp –
. Ajtai M, Komlós J, Szemerédi E () Halvers and expanders,
Proceedings, rd annual symposium on foundations of computer science, IEEE computer society press, Los Alamitos,
pp –
. Alekseev VE () Sorting algorithms with minimum memory.
Kibernetika ():–
. Batcher KE () Sorting networks and their applications,
AFIPS conference proceedings, Spring joint computer conference, vol , Thompson, pp –
. Bilardi G, Preparata FP () The VLSI optimality of the AKS
sorting network. Inform Process Lett ():–
. Chvatal V () Lecture notes on the new AKS sorting network,
DCS-TR-, Computer Science Department, Rutgers University
. Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduction
to algorithms, nd edn. The MIT Press, Cambridge
. Knuth DE () The art of computer programming, vol , Sorting and searching, nd edn. Addison–Wesley, Reading
. Leighton T () Tight bounds on the complexity of parallel
sorting. IEEE Trans Comput C-():–
A
. Manos H () Construction of halvers. Inform Process Lett
():–
. Paterson MS () Improved sorting networks with O(log N)
depth. Algorithmica (–):–
. Seiferas J () Sorting networks of logarithmic depth, further
simplified. Algorithmica ():–
. Seiferas J () On the counting argument for the existence of
expander graphs, manuscript
AKS Sorting Network
AKS Network
Algebraic Multigrid
Marian Brezina , Jonathan Hu , Ray Tuminaro

University of Colorado at Boulder, Boulder, CO, USA

Sandia National Laboratories, Livermore, CA, USA
Synonyms
AMC
Definition
Multigrid refers to a family of iterative algorithms for
solving large sparse linear systems associated with a
broad class of integral and partial differential equations [, , ]. The key to its success lies in the use
of efficient coarse scale approximations to dramatically
accelerate the convergence so that an accurate approximation is obtained in only a few iterations at a cost that
is linearly proportional to the problem size. Algebraic
multigrid methods construct these coarse scale approximations utilizing only information from the finest resolution matrix. Use of algebraic multigrid solvers has
become quite common due to their optimal execution
time and relative ease-of-use.
Discussion
Introduction
Multigrid algorithms are used to solve large sparse linear systems
Ax = b
()
where A is an n × n matrix, b is a vector of length n,
and one seeks a vector x also of length n. When these

A

A
Algebraic Multigrid
matrices arise from elliptic partial differential equations
(PDEs), multigrid algorithms are often provably optimal in that they obtain a solution with O(n/p + log p)
floating point operations where p is the number of processes employed in the calculation. Their rapid convergence rate is a key feature. Unlike most simpler iterative
methods, the number of iterations required to reach a
given tolerance does not degrade as the system becomes
larger (e.g., when A corresponds to a PDE and the mesh
used in discretization is refined).
Multigrid development started in the s with the
pioneering works of Brandt [] and Hackbusch [, ],
though the basic ideas first appeared in the s [, ,
]. The first methods would now be classified as geometric multigrid (GMG). Specifically, applications supply a mesh hierarchy, discrete operators corresponding
to the PDE discretization on all meshes, interpolation
operators to transfer solutions from a coarse resolution mesh to the next finer one, and restriction operators to transfer solutions from a fine resolution mesh
to the next coarsest level. In geometric multigrid, the
inter-grid transfers are typically based on a geometrical relationship of a coarse mesh and its refinement,
such as using linear interpolation to take a solution on
a coarse mesh and define one on the next finer mesh.
While early work developed, proved, and demonstrated
optimally efficient algorithms, it was the development of
algebraic multigrid (AMG) that paved the way for nonmultigrid experts to realize this optimal behavior across
a broad range of applications including combustion calculations, computational fluid dynamics, electromagnetics, radiation transfer, semiconductor device modeling, structural analysis, and thermal calculations. While
better understood for symmetric positive definite (SPD)
systems, AMG methods have achieved notable success
on other types of linear systems such as those involving strongly convective flows [, ] or drift diffusion
equations.
The first algebraic multigrid scheme appeared in
the mid-s and is often referred to as classical algebraic multigrid [, ]. This original technique is still
one of the most popular, and many improvements have
helped extend its applicability on both serial and parallel
computers. Although classical algebraic multigrid is frequently referred to as just algebraic multigrid, the notation C-AMG is used here to avoid confusion with other
algebraic multigrid methods. The list of alternative
AMG methods is quite extensive, but this entry only
addresses one alternative referred to as smoothed aggregation [, ], denoted here as SA. Both C-AMG and
SA account for the primary AMG use by most scientists. Some publicly available AMG codes include [,
, , , ]. A number of commercial software products also contain algebraic multigrid solvers. Today,
research continues on algebraic multigrid to further
expand applicability and robustness.
Geometric Multigrid
Although multigrid has been successfully applied to a
wide range of problems, its basic principles are easily understood by first studying the two-dimensional
Poisson problem
−uxx − uyy = f
()
defined over the unit square with homogeneous Dirichlet conditions imposed on the boundary. Discretization
leads to an SPD matrix problem
Ax = b.
()
The Gauss–Seidel relaxation method is one of the oldest
iterative solution techniques for solving (). It can be
written as
x(k+) ← (D + L)− (b − Ux(k) )
()
where x(k) denotes the approximate solution at the kth
iteration, D is the diagonal of A, L is the strictly lower triangular portion of A, and U is the strictly upper triangular portion of A. While the iteration is easy to implement
and economical to apply, a prohibitively large number
of iterations (which increases as the underlying mesh
is refined) are often required to achieve an accurate
solution [].
Figure  depicts errors (e(k) = x(∗) − x(k) , where
x(∗) is the exact solution) over the problem domain as a
function of iteration number k for a typical situation on
a  ×  mesh. Notice that while the error remains quite
large after four iterations, it has become locally much
smoother. Denoting by r(k) = b−Ax(k) the residual, the
error is found by solving Ae (k) = r(k) . Of course, solving
this residual equation appears to be no easier than solving the original system. However, the basic multigrid
idea is to recognize that smooth error is well represented
Algebraic Multigrid
A

A
a
b
k=0
c
k=2
k=4
Algebraic Multigrid. Fig.  Errors as a function of k, the Gauss–Seidel iteration number
MGV(A, x, b, ):
if =
/ max
pre
x
← S (A, x, b)
← b − Ax
r
x+1 ← 0
x+1 ← MGV(A+1, x+1, I+1 r, +1)
x
← x + I +1x+1
x
← S
else x ←
post
(A, x, b)
A−1
b
Algebraic Multigrid. Fig.  Multigrid V-cycle
on a mesh with coarser resolution. Thus, a coarse version can be formed and used to obtain an approximation with less expense by inverting a smaller problem.
This approximation is then used to perform the update
(k)
(k)
x(∗) ≈ x(k+) = x(k) + ec where ec is defined by
interpolating to the fine mesh the solution of the coarse
resolution version of Ae(k) = r(k) . This is referred to as
coarse-grid correction. Since the coarse-level correction
may introduce oscillatory components back into the
error, it is optionally followed by application of a small
number of simple relaxation steps (the post-relaxation).
If the size of the coarse discretization matrix is small
enough, Gaussian elimination can be applied to directly
obtain a solution of the coarse level equations. If the size
of the coarse system matrix remains large, the multigrid
idea can be applied recursively.
Figure  illustrates what is referred to as multigrid
V-cycle for solving the linear system Aℓ xℓ = bℓ . Subscripts are introduced to distinguish different resolution
approximations. A = A is the operator on the finest
level, where one seeks a solution while Aℓmax denotes
the coarsest level system where Gaussian elimination
can be applied without incurring prohibitive cost. Iℓℓ+
ℓ
restricts residuals from level ℓ to level ℓ + , and Iℓ+
pre
post
prolongates from level ℓ+ to level ℓ. Sℓ () and Sℓ ()
denote a basic iterative scheme (e.g., Gauss-Seidel) that
is applied to smooth the error. It will be referred to as
relaxation in the remainder of the entry, although it
is also commonly called smoothing. The right side of
Fig.  depicts the algorithm flow for a four-level method.
The lowest circle represents the direct solver. The circles to the left of this represent pre-relaxation while
those to the right indicate post-relaxation. Finally, the
downward arrows indicate restriction while the upward
arrows correspond to interpolation or prolongation, followed by correction of the solution. Each relaxation in
the hierarchy effectively damps errors that are oscillatory with respect to the mesh at that level. The net effect
is that for a well-designed multigrid, errors at all frequencies are substantially reduced by one sweep of a
V-cycle iteration.
Different multigrid variations visit coarse meshes
more frequently. The V-cycle (depicted here) and a
W-cycle are the most common. The W-cycle is obtained
A
Algebraic Multigrid
by adding a second xℓ+ ← MGV() invocation immediately after the current one in Fig. . In terms of cost, grid
transfers are relatively inexpensive compared to relaxation. The relaxation cost is typically proportional to the
number of unknowns or ck on a k×k×k mesh where c is
a constant. Ignoring the computational expense of grid
transfers and assuming that coarse meshes are defined
by halving the resolution in each coordinate direction,
the V-cycle cost is
V-cycle cost ≈ c (k +
k  k
k
+
+
+ ...)
  

fine-level relaxation cost.

That is, the extra work associated with the coarse level
computations is almost negligible.
In addition to being inexpensive, coarse-level computations dramatically accelerate convergence so that a
solution is obtained in a few iterations. Figure  illustrates convergence histories for a Jacobi iteration, a
Gauss–Seidel iteration, and a multigrid V-cycle iterpre
post
ation where Sℓ () and Sℓ () correspond to one
Gauss–Seidel sweep. These experiments use a standard
finite difference discretization of () on four different
meshes with a zero initial guess and a random right
hand. Figure  clearly depicts the multigrid advantage.
The key to this rapid convergence lies in the complementary nature of relaxation and the coarse level corrections.
Relaxation eliminates high frequency errors while the
coarse level correction eliminates low frequency errors
≈
102
100
Jacobi on 31 ⫻ 31 mesh
Gauss–Seidel on 31 ⫻ 31 mesh
Multigrid on 31 ⫻ 31 mesh
Jacobi on 63 ⫻ 63 mesh
Gauss–Seidel on 63 ⫻ 63 mesh
Multigrid on 63 ⫻ 63 mesh
10−2
||r||2

10−4
0
20
40
60
Iterations
80
x(k+) ← x(k) + ωD− (b − Ax(k) )
()
where D is the diagonal matrix associated with A and
ω is a scalar damping parameter typically between zero
and one. The error propagation is governed by
e(k+) = (I − ωD− A)e(k) = (I − ωD− A)k+ e()
()
where e(k) = x(∗) − xk . Clearly, error components in
eigenvector directions associated with small eigenvalues
of D− A are almost unaffected by the Jacobi iteration. Gauss-Seidel exhibits similar behavior, though the
analysis is more complex. When A corresponds to a
standard Poisson operator, these eigenvectors are all
low frequency modes, and thus effectively attenuated
through the coarse-grid correction process. However,
when applied to uxx + єuyy (with є ≪ ), eigenvectors
associated with low eigenvalues are smooth functions
in the x direction, but may be oscillatory in the y direction. Within geometric multigrid schemes, where coarsening is dictated by geometrical considerations, these
oscillatory error modes are not reduced, and multigrid
convergence suffers when based on a simple point-wise
relaxation, such as Jacobi. It may thus be necessary to
employ more powerful relaxation (e.g., line relaxation)
to make up for deficiencies in the choice of coarse grid
representations.
Algebraic Multigrid
10−6
10−8
on each level. Furthermore, errors that are relatively
low frequency on a given level appear oscillatory on
some coarser level, and will be efficiently eliminated by
relaxation on that level.
For complex PDE operators, it can sometimes be
challenging to define multigrid components in a way
that preserves this complementary balance. Jacobi and
Gauss-Seidel relaxations, for example, do not effectively
smooth errors in directions of weak coupling when
applied to highly anisotropic diffusion PDEs. This is
easily seen by examining a damped Jacobi iteration
100
Algebraic Multigrid. Fig.  Residuals as a function of k,
the Gauss–Seidel iteration number
Algebraic multigrid differs from geometric multigrid
ℓ
in that I ℓℓ+ and I ℓ+
are not defined from geometric
information. They are constructed using only Aℓ and,
optionally, a small amount of additional information (to
be discussed). Once grid transfers on a given level are
defined, a Galerkin projection is employed to obtain a
coarse level discretization:
ℓ
Aℓ+ ← Iℓℓ+ Aℓ Iℓ+
.
Algebraic Multigrid
Further, Iℓℓ+ is generally taken as the transpose of the
prolongation operator for symmetric problems. In this
case, an algebraic V-cycle iteration can be completely
specified by a procedure for generating prolongation
matrices along with a choice of appropriate relaxation
method. In contrast to geometric multigrid, the relaxation process is typically selected upfront, but the prolongation is automatically tailored to the problem at
hand. This is referred to as operator-dependent prolongation, and can overcome problems such as anisotropies
of which geometric multigrid coarsening may not be
aware.
Recalling the Jacobi error propagation formula (),
components associated with large eigenvalues of A are
easily damped by relaxation while those associated with
small eigenvalues remain. (Without loss of generality, one can assume that A has been scaled so that
its diagonal is identically one. Thus, A as opposed to
D− A is used to simplify the discussion.) In geometric
multigrid, the focus is on finding a relaxation scheme
such that all large eigenvalues of its error propagation
operator correspond to low frequency eigenvectors. In
algebraic multigrid, one assumes that a simple (e.g.,
Gauss–Seidel) relaxation is employed where the error
propagation satisfies
∣∣Sℓ e∣∣A ℓ ≤ ∣∣e∣∣A ℓ − α∣∣e∣∣A
ℓ
= ∣∣e∣∣A ℓ − α∣∣r∣∣
()
where α is a positive constant independent of the mesh
size. Equation () indicates that errors associated with
a large Aℓ -norm are reduced significantly while those
associated with small residuals are relatively unchanged
by relaxation. The focus is now on constructing Iℓℓ+
ℓ
and Iℓ+
such that eigenvectors associated with small
eigenvalues of Aℓ are transferred between grid levels
accurately. That is, error modes that are not damped
by relaxation on a given level must be transferred to
another level so that they can be reduced there. Note
that these components need not be smooth in the geometric sense, and are therefore termed algebraically
smooth.
More formally, the general principle for grid transfers applied to symmetric problems is that they satisfy
an approximation property, for example,
ℓ
∣∣e − I ℓ+
ê∣∣ ≤
β
∣∣e∣∣
∥Aℓ ∥ A ℓ
()
A
where ê is a coarse vector minimizing the left side and β
is a constant independent of mesh size and e. Basically,
() requires that vectors with smaller Aℓ -norm be more
accurately captured on coarse meshes. That is, high and
low energy, measured by the Aℓ -norm, replaces the
geometric notions of high and low frequency.
It is possible to show that a two-level method that
employs a coarse grid correction satisfying () followed
by a relaxation method Sℓ satisfying () is itself convergent, independent of problem size. The error propagation
operator associated with the coarse-grid correction is
given by
−
ℓ
ℓ
(Iℓℓ+ Aℓ Iℓ+
) Iℓℓ+ Aℓ .
Tℓ = I − Iℓ+
()
Note that () moves fine error to the coarse level, performs an exact solve, interpolates the result back to the
fine level, and differences it with the original fine error.
ℓ
Assuming that the grid transfer Iℓ+
is full-rank, the following estimate for a two-level multigrid method can be
proved using () and ():
∣∣S ℓ Tℓ ∣∣A ℓ
√
α
≤ − ,
β
()
where post-relaxation only is assumed to simplify the
discussion. That is, a coarse-grid correction followed
by relaxation converges at a rate independent of the
mesh size. Given the general nature of problems treated
with algebraic multigrid, however, sharp multilevel convergence results (as opposed to two-level results) are
difficult to obtain. The best known multilevel AMG con
, where the
vergence bounds are of the form,  − C(L)
constant C(L) depends polynomially on the number of
multigrid levels [].
In addition to satisfying some approximation property, grid transfers must be practical. The sparsity patℓ
terns of I ℓ+
and Iℓℓ+ effectively determine the number
of nonzeros in Aℓ+ . If prohibitively large, the method
is impractically expensive. Most AMG methods construct grid transfers in several stages. The first stage
defines a graph associated with Aℓ . The second stage
coarsens this graph. Coarsening fixes the dimensions of
ℓ
and Iℓ ℓ + . Coarsening may also effectively deterI ℓ+
mine the sparsity pattern of the grid transfer matrices or,
within some AMG methods, the sparsity pattern may be
constructed separately. Finally, the actual grid transfer

A

A
Algebraic Multigrid
coefficients are determined so that the approximation
property is satisfied and that relaxation on the coarse
grid is effective.
One measure of expense is the operator complexity, Σ i (nnz(Ai )/nnz(A ), which compares number of
nonzeros on all levels to the number of nonzeros in
the finest grid matrix. The operator complexity gives
an indication both of the amount of memory required
to store the AMG preconditioner and the cost to apply
it. Another measure of complexity is the matrix stencil size, which is the average number of coefficients in
a row of Aℓ . Increases in matrix stencil size can lead
to increases in communication and matrix evaluation
costs. In general, AMG methods must carefully balance accuracy with both types of complexity. Increasing
the complexity of an AMG method often leads to better convergence properties, at the cost of each iteration
being more expensive. Conversely, decreasing the complexity of a method tends to lead to a method that
converges more slowly, but is cheaper to apply.
Within C-AMG, a matrix graph is defined with the
help of a strength-of-connection measure. In particular, each vector unknown corresponds to a vertex. A
graph edge between vertex i and j is only added if i
is strongly influenced by j or if −aij ≥ є maxi≠k (−aik ),
where є >  is independent of i and j. The basic idea is
to ignore weak connections when defining the matrix
graph in order to coarsen only in directions of strong
connections (e.g., to avoid difficulties with anisotropic
ℓ
is determined by
applications). The structure of Iℓ+
selecting a subset of vertices, called C-points, from the
fine graph. The C-points constitute the coarse graph vertices; the remaining unknowns are called F-points and
will interpolate from the values at C-points. If C-points
are too close to each other, the resulting complexities are
high. If they are too far apart, convergence rates tend to
suffer. To avoid these two situations, the selection process attempts to satisfy two conditions. The first is that
if point j strongly influences an F-point i, then j should
either be a C-point or j and i should be strongly influenced by a common C-point. The second is that the
set of C-points should form a maximal independent set
such that no two C-points are strongly influenced by
each other. For isotropic problems, the aforementioned
C-AMG algorithm tends to select every other unknown
as a C-point. This yields a coarsening rate of roughly
d , where d is the spatial problem dimension. With the
C-points identified, the interpolation coefficients are
calculated. C-points themselves are simply interpolated
via injection. The strong influence of an F-point j on a
different F-point i is given by the weighted interpolation
from the C-points common to i and j. The latter is done
in such a manner so as to preserve interpolation of constants. Additionally, it is based on the assumption that
residuals are small after relaxation. That is, Aℓ e ≈ , or,
equivalently, modes associated with small eigenvalues
should be accurately interpolated. Further details can be
found in [].
SA also coarsens based on the graph associated
with matrix Aℓ . In contrast to C-AMG, however, the
coarse grid unknowns are formed from disjoint groups,
or aggregates, of fine grid unknowns. Aggregates are
formed around initially-selected unknowns called root
points. An unknown i is included in an aggregate if it
is strongly coupled to the root point j. More precisely,
unknowns
i and j are said to be strongly coupled if ∣aij ∣ >
√
θ ∣aii ∣∣ajj ∣, where θ ≥  is a tuning parameter independent of i and j. Because the aggregates produced by SA
tend to have a graph diameter , SA generally coarsens
at a rate of d , where d is the geometric problem dimension. After aggregate identification, a tentative prolongator is constructed so that it exactly interpolates a set
of user-defined vectors. For scalar PDEs this is often just
a single vector corresponding to the constant function.
For more complex operators, however, it may be necessary to provide several user-defined vectors. In threedimensional elasticity, it is customary to provide six
vectors that represent the six rigid body modes (three
translations and three rotations). In general, these vectors should be near-kernel vectors that are generally
problematic for the relaxation. The tentative prolongator is defined by restricting the user-defined vectors
to each aggregate. That is, each column of the tentative prolongator is nonzero only for degrees of freedom
within a single aggregate, and the values of these nonzeros correspond to one of the user-defined vectors. The
columns are locally orthogonalized within an aggregate
by a QR algorithm to improve linear independence. The
final result is that the user-defined modes are within the
range space of the tentative prolongator, the prolongator
columns are orthonormal, and the tentative prolongator is quite sparse because each column’s nonzeros are
associated with a single aggregate. Unfortunately, individual tentative prolongator columns have a high energy
Algebraic Multigrid
(or large A-norm). This essentially implies that some
algebraically smooth modes are not accurately represented throughout the mesh hierarchy even though the
user-defined modes are accurately represented. To rectify this, one step of damped Jacobi is applied to all
columns of the tentative prolongator:
ℓ
ℓ
I ℓ+
= (I − ωD−
ℓ A ℓ ) I ℓ+
where ω is the Jacobi damping parameter and Dℓ is
the diagonal of Aℓ . This reduces the energy in each of
the columns while preserving the interpolation of userdefined low energy modes. Further details can be found
in [, ].
Parallel Algebraic Multigrid
Distributed memory parallelization of most simulations based on PDEs is accomplished by dividing the
computational domain into subdomains, where one
subdomain is assigned to each processor. A processor
is then responsible for updating unknowns associated
within its subdomain only. To do this, processors occasionally need information from other processors; that
information is obtained via communication. Partitioning into boxes or cubes is straightforward for logically rectangular meshes. There are several tools that
automate the subdivision of domains for unstructured
meshes [, , ]. A general goal during subdivision
is to assign an equal amount of work to each processor and to reduce the amount of communication
between processors by minimizing the surface area of
the subdomains.
Multigrid parallelization follows in a similar fashion. For example, V-cycle computations within a mesh
are performed in parallel, but each mesh in the hierarchy is addressed one at a time as in standard multigrid. A multigrid cycle (see Fig. ) consists almost
entirely of relaxation and sparse matrix–vector products. This means that the parallel performance depends
entirely on these two basic kernels. Parallel matrix–
vector products are quite straightforward, as are parallel
Jacobi and parallel Chebyshev relaxation (which are
often preferred in this setting []). Chebyshev relaxation requires an estimate of the largest eigenvalue of
A ℓ (which is often available or easily estimated) but not
that of the smallest eigenvalue, as relaxation need only
damp the high end of the spectrum. Unfortunately, construction of efficient parallel Gauss–Seidel algorithms
A
is challenging and relies on sophisticated multicoloring
schemes on unstructured meshes []. As an alternative,
most large parallel AMG packages support an option to
employ Processor Block (or local) Gauss–Seidel. Here,
each processor performs Gauss–Seidel as a subdomain
solver for a block Jacobi method. While Processor Block
Gauss–Seidel is easy to parallelize, the overall multigrid
convergence rate usually suffers and can even lead to
divergence if not suitably damped [].
Given a multigrid hierarchy, the main unique aspect
associated with parallelization boils down to partitioning all the operators in the hierarchy. As these operators are associated with matrix graphs, it is actually
the graphs that must be partitioned. The partitioning
of the finest level graph is typically provided by an outside application and so it ignores the coarse graphs that
are created during multigrid construction. While coarse
graph partitioning can also be done in this fashion, it
is desirable that the coarse and fine graph partitions
“match” in some way so that inter-processor communication during grid transfers is minimized. This is usually
done by deriving coarse graph partitions from the fine
graph partition. For example, when the coarse graph
vertices are a subset of fine vertices, it is natural to
simply use the fine graph partitioning on the coarse
graph. If the coarse graph is derived by agglomerating
elements and the fine graph is partitioned by elements,
the same idea holds. In cases without a simple correspondence between coarse and fine graphs, it is often
natural to enforce a similar condition that coarse vertices reside on the same processors that contain most
of the fine vertices that they interpolate []. Imbalance
can result if the original matrix A itself is poorly loadbalanced, or if the coarsening rate differs significantly
among processes, leading to imbalance on coarse levels.
Synchronization occurs within each level of the multigrid cycle, so the time to process a level is determined
by the slowest process. Even when work is well balanced,
processors may have only a few points as the total number of graph vertices can diminish rapidly from one
level to a coarser level. In fact, it is quite possible that
the total number of graph vertices on a given level is
actually less than the number of cores/processors on
a massively parallel machine. In this case some processors are forced to remain idle while more generally
if processes have only a few points, computation time
can be dominated by communication. A partial solution

A

A
Algebraic Multigrid
is to redistribute and load-balance points on a subset
of processes. While this certainly leaves processes idle
during coarse level relaxation, this can speed up run
times because communication occurs among fewer processes and expensive global all-to-all communication
patterns are avoided. A number of alternative multigrid
algorithms that attempt to process coarse level corrections in parallel (as opposed to the standard sequential
approach) have been considered []. Most of these,
however, suffer drawbacks associated with convergence
rates or even more complex load balancing issues.
The parallel cost of a simple V-cycle can be estimated
to help understand the general behavior. In particular,
assume that run time on a single level is modeled by


⎛
k ⎞
k
Tsl = c ( ) + c α + β ( )
q
q ⎠
⎝
where k is the number of degrees of freedom on the
level, q is the number of processors, and α + βw measures the cost of sending a message of length w from one
processor to another on a distributed memory machine.
The constants c and c reflect the ratio of computation to communication inherent in the smoothing and
matrix-vector products. Then, if the coarsening rate per
dimension in D is γ, it is easy to show that the run
time of a single AMG V-cycle with several levels, ℓ, is
approximately


k
γ
k
γ
) + c β ( ) ( 
) + c αℓ
Tamg = c ( ) ( 
q
γ −
q
γ −
where now k is the number of degrees of freedom
on the finest mesh. For γ = , a standard multigrid
coarsening rate, this becomes



k

k
Tamg = c ( ) + c β ( ) + c αℓ.

q

q
Comparing this with the single level execution time,
we see that the first term is pure computation and it
increases by . to account for the hierarchy of levels.
The second term reflects communication bandwidth
and it increases by .. The third term is communication latency and it increases by ℓ. Thus, it is to
be expected that parallel multigrid spends a higher
percentage of time communicating than a single level
method. However, if (k/q) (the number of degrees
of freedom per processor) is relatively large, then the
first term dominates and so inefficiencies associated
with the remaining terms are not significant. However,
when this first term is not so dominant, then inefficiencies associated with coarse level computations are a
concern. Despite a possible loss of efficiency, the convergence benefits of multigrid far outweigh these concerns
and so, generally, multilevel methods remain far superior to single level methods, even on massively parallel
systems.
The other significant issue associated with parallel algebraic multigrid is parallelizing the construction
phase of the algorithm. In principle this construction
phase is highly parallel due to the locality associated
with most algorithms for generating grid transfers and
setting up smoothers. The primary challenge typically
centers on the coarsening aspect of algebraic multigrid. For example, the smoothed aggregation algorithm
requires construction of aggregates, then a tentative
prolongator, followed by a prolongator smoothing step.
Prolongator smoothing as well as Galerkin projection (to build coarse level discretizations) require only
an efficient parallel matrix–matrix multiplication algorithm. Further, construction of the tentative prolongator only requires that the near-null space be injected
in a way consistent with the aggregates followed by
a series of independent small QR computations. It is,
however, the aggregation phase that can be difficult.
While highly parallel in principle, it is difficult to get
the same quality of aggregates without having some
inefficiencies. Aggregation is somewhat akin to placing floor tiles in a room. If several workers start at
different locations and lay tile simultaneously, the end
result will likely lead to the trimming of many interior
tiles so that things fit. Unfortunately, a large degree of
irregularity in aggregates can either degrade the convergence properties of the overall method or significantly
increase the cost per iteration. Further, once each process (or worker in our analogy) has only a few rows of
the relatively coarse operator Aℓ , the rate of coarsening
slows, thus leading to more multigrid levels. Instead of
naively allowing each process to coarsen its portion of
the graph independently, several different aggregationbased strategies have been proposed in the context of
SA, based on parallel maximal independent sets [, ]
or using graph-partitioning algorithms []. The latter
can increase the coarsening rate and decrease the operator complexity. In both cases, an aggregate can span several processors. Parallel variants of C-AMG coarsening
Algebraic Multigrid
have also been developed to minimize these effects.
The ideas are quite similar to the smoothed aggregation context, but the details tend to be distinct due to
differences in coarsening rates and the fact that one is
based on aggregation while the other centers on identifying vertex subsets. Subdomain blocking [] coarsens
from the process boundary inward, and CLJP coarsening [, ] is based on selecting parallel maximal
independent sets. A more recently developed aggressive coarsening variant, PMIS [], addresses the high
complexities sometimes seen with CLJP.
We conclude this section with some performance
figures associated with the ML multigrid package that is
part of the Trilinos framework. ML implements parallel smoothed aggregation along the lines just discussed.
Figure  illustrates weak parallel scaling on a semiconductor simulation. Each processor has approximately
,  degrees of freedom so that the problem size
increases as more processors are used. In an ideal situation, one would like that the total simulation time
remains constant as the problem size and processor
count increase. Examination of Fig.  shows that execution times rise only slightly due to relatively constant
run time per iteration.
A
The emergence of very large scale multi-core
architectures presents increased stresses on the underlying multigrid parallelization algorithms and so research
must continue to properly take advantage of these new
architectures.
Related Multigrid Approaches
When solving difficult problems, it is often advantageous to “wrap” the multigrid solver with a Krylov
method. This amounts to using a small number of AMG
iterations as a preconditioner in a Krylov method. The
additional cost per iteration amounts to an additional
residual computation and a small number of inner
products, depending on Krylov method used. Multigrid
methods exhibiting a low-dimensional deficiency can
have good convergence rates restored this way.
Standard implementations of C-AMG and SA base
their coarsening on assumptions of smoothness. For
C-AMG, this is that an algebraically smooth error
behaves locally as a constant. When this is not the case,
convergence will suffer. One advantage of SA is that it is
possible to construct methods capable of approximating any prescribed user-defined error component accurately on coarse meshes. Thus, in principle, if one knows
Average CPU time per Newton step(Prec+Lin Sol)(s)
Weak Scaling Study: Average Time for Different Preconditioners
2 ⫻ 1.5 micron BJT Steady-State Drift-Diffusion Bias 0.3V
1200
1-level DD ILU
3-level NSA W(1,1) agg125
3-level PGSA W(1,1) agg125
1000
4096p
Red Storm: 1 core of
2.4 GHz dual core Opteron
800
600
400
1024p
200
256p
4p
0
105
64p
16p
106
107
Unknowns
1024p
4096p
108
Algebraic Multigrid. Fig.  Weak scaling timings for a semiconductor parallel simulation. The left image is the
steady-state electric potential; red represents high potential, blue indicates low potential. The scaling results compare
GMRES preconditioned by domain-decomposition, a nonideal AMG method, and an AMG method intended for
nonsymmetric problems, respectively. Results courtesy of []

A

A
Algebraic Multigrid
difficult components for relaxation, suitable grid transfers can be defined that accurately transfer these modes
to coarse levels. However, there are applications where
such knowledge is lacking. For instance, QCD problems may have multiple difficult components that are
not only oscillatory, but are not a priori known. Adaptive multigrid methods have been designed to address
such problems [, ]. The key feature is that they determine the critical components and modify coarsening to
ensure their attenuation. These methods can be intricate, but are based on the simple observation that an
iteration process applied to the homogeneous problem, Aℓ xℓ = , will either converge with satisfactory
rate, or reveal, as its solution, the error that the current
method does not attenuate. As the method improves,
any further components are revealed more rapidly. The
adaptive methods are currently more extensively developed within the SA framework, but progress has also
been made within the context of C-AMG, [].
Another class of methods that attempt to take
advantage of additional available information are
known as AMGe [, ]. These utilize local finite element information, such as local stiffness matrices, to
construct inter-grid transfers. Although this departs
from the AMG framework, variants related to the
methodology have been designed that do not explicitly
require local element matrices [, ].
AMG methods have been successfully applied
directly to high-order finite element discretizations.
However, given the much denser nature of a matrix,
AH , obtained from high-order discretization, it is usually advantageous to precondition the high-order system by an inverse of the low-order discretization, AL ,
corresponding to the problem over the same nodes
that are used to define the high-order Lagrange interpolation. One iteration of AMG can then be used
to efficiently approximate the action of (AL )− , and
the use of AH is limited to computing a fine-level
residual, which may be computed in a matrix-free
fashion.
Related Entries
Metrics
Preconditioners for Sparse Iterative Methods
Rapid Elliptic Solvers
Acknowledgment
Sandia National Laboratories is a multiprogram laboratory operated by Sandia Corporation, a wholly
owned subsidiary of Lockheed Martin company, for the
US Department of Energy’s National Nuclear Security
Administration under contract DE-AC-AL.
Bibliography
. Adams MF () A parallel maximal independent set algorithm.
In: Proceedings th Copper Mountain conference on iterative
methods, Copper Mountain
. Adams MF () A distributed memory unstructured GaussSeidel algorithm for multigrid smoothers. In: ACM/IEEE
Proceedings of SC: High Performance Networking and
Computing, Denver
. Adams M, Brezina M, Hu J, Tuminaro R (July ) Parallel
multigrid smoothing: polynomial versus Gauss-Seidel. J Comp
Phys ():–
. Bakhvalov NS () On the convergence of a relaxation method
under natural constraints on an elliptic operator. Z Vycisl Mat Mat
Fiz :–
. Boman E, Devine K, Heaphy R, Hendrickson B, Leung V, Riesen
LA, Vaughan C, Catalyurek U, Bozdag D, Mitchell W, Teresco J.
Zoltan () .: Parallel partitioning, load balancing, and datamanagement services; user’s guide. Sandia National Laboratories,
Albuquerque, NM, . Tech. Report SAND-W. http://
www.cs.sandia.gov/Zoltan/ughtml/ug.html
. Brandt A () Multi-level adaptive solutions to boundary value
problems. Math Comp :–
. Brandt A, McCormick S, Ruge J () Algebraic multigrid
(AMG) for sparse matrix equations. In: Evans DJ (ed) Sparsity
and its applications. Cambridge University Press, Cambridge
. Brezina M, Cleary AJ, Falgout RD, Henson VE, Jones JE, Manteuffel TA, McCormick SF, Ruge JW () Algebraic multigrid based
on element interpolation (AMGe). SIAM J Sci Comp :–

. Brezina M, Falgout R, MacLachlan S, Manteuffel T, McCormick
S, Ruge J () Adaptive smoothed aggregation (αSA) multigrid.
SIAM Rev ():–
. Brezina M, Falgout R, MacLachlan S, Manteuffel T, McCormick
S, Ruge J () Adaptive algebraic multigrid. SIAM J Sci Comp
:–
. Brezina M, Manteuffel T, McCormick S, Ruge J, Sanders G ()
Towards adaptive smoothed aggregation (αSA) for nonsymmetric problems. SIAM J Sci Comput :
. Brezina M () SAMISdat (AMG) version . - user’s guide
. Briggs WL, Henson VE, McCormick S () A multigrid tutorial, nd ed. SIAM, Philadelphia
. Chartier T, Falgout RD, Henson VE, Jones J, Manteuffel T,
Mc-Cormick S, Ruge J, Vassilevski PS () Spectral AMGe
(ρAMGe). SIAM J Sci Comp :–
. Chow E, Falgout R, Hu J, Tuminaro R, Meier-Yang U ()
A survey of parallelization techniques for multigrid solvers.
Algorithm Engineering
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
In: Parallel processing for scientific computing, SIAM book
series on software, environments, and tools. SIAM, Philadelphia,
pp –
Cleary AJ, Falgout RD, Henson VE, Jones JE () Coarse-grid
selection for parallel algebraic multigrid. In: Proceedings of the
fifth international symposium on solving irregularly structured
problems in parallel. Lecture Notes in Computer Science, vol .
Springer New York, pp –
Dohrmann CR () Interpolation operators for algebraic
multigrid by local optimization. SIAM J Sci Comp :–
Fedorenko RP (/) A relaxation method for solving elliptic
difference equations. Z Vycisl Mat Mat Fiz :– (). Also
in U.S.S.R. Comput Math Math Phys :– ()
Fedorenko RP () The speed of convergence of one iterative process. Z. Vycisl. Mat Mat Fiz :–. Also in U.S.S.R.
Comput Math Math Phys :–
Gee M, Siefert C, Hu J, Tuminaro R, Sala M () ML
. smoothed aggregation user’s guide. Technical Report
SAND-, Sandia National Laboratories, Albuquerque,
NM, 
Hackbusch W () On the convergence of a multi-grid iteration
applied to finite element equations. Technical Report Report , Institute for Applied Mathematics, University of Cologne, West
Germany
Hackbusch W () Multigrid methods and applications, vol .
Computational mathematics. Springer, Berlin
Hackbusch W () Convergence of multi-grid iterations
applied to difference equations. Math Comput ():–
Henson VE, Vassilevski PS () Element-free AMGe: general
algorithms for computing interpolation weights in AMG. SIAM J
Sci Comp :–
Henson VE, Yang UM () BoomerAMG: a parallel algebraic
multigrid solver and preconditioner. Appl Numer Math ():
–
Karypis G, Kumar V () Multilevel k-way partitioning scheme
for irregular graphs. Technical Report -, Department of
Computer Science, University of Minnesota
Karypis G, Kumar V () ParMETIS: Parallel graph partitioning and sparse matrix ordering library. Technical Report -,
Department of Computer Science, University of Minnesota
Krechel A, Stüben K () Parallel algebraic multigrid based on
subdomain blocking. Parallel Comput ():–
Lin PT, Shadid JN, Tuminaro RS, Sala M () Performance of
a Petrov-Galerkin algebraic multilevel preconditioner for finite
element modeling of the semiconductor device drift-diffusion
equations. Int J Num Meth Eng. Early online publication,
doi:./nme.
Mavriplis DJ () Parallel performance investigations of an
unstructured mesh Navier-Stokes solver. Intl J High Perf Comput
Appl :–
Prometheus.
http://www.columbia.edu/_m/promintro.
html.
Stefan Reitzinger. http://www.numa.unilinz.ac.at/research/
projects/pebbles.html
A
. Ruge J, Stüben K () Algebraic multigrid (AMG). In:
McCormick SF (ed) Multigrid methods, vol  of Frontiers in
applied mathematics. SIAM, Philadelphia, pp –
. Sala M, Tuminaro R () A new Petrov-Galerkin smoothed
aggregation preconditioner for nonsymmetric linear systems.
SIAM J Sci Comput ():–
. De Sterck H, Yang UM, Heys JJ () Reducing complexity in
parallel algebraic multigrid preconditioners. SIAM J Matrix Anal
Appl ():–
. Trottenberg U, Oosterlee C, Schüller A () Multigrid. Academic, London
. Tuminaro R, Tong C () Parallel smoothed aggregation multigrid: aggregation strategies on massively parallel machines. In:
Donnelley J (ed) Supercomputing  proceedings, 
. Vaněk P, Brezina M, Mandel J () Convergence of algebraic
multigrid based on smoothed aggregation. Numerische Mathematik :–
. Vaněk P, Mandel J, Brezina M () Algebraic multigrid by
smoothed aggregation for second and fourth order elliptic problems. Computing :–
. Varga RS () Matrix iterative analysis. Prentice-Hall, Englewood Cliffs
. Yang UM () On the use of relaxation parameters in hybrid
smoothers. Numer Linear Algebra Appl :–
Algorithm Engineering
Peter Sanders
Universitaet Karlsruhe, Karlsruhe, Germany
Synonyms
Experimental parallel algorithmics
Definition
Algorithmics is the subdiscipline of computer science
that studies the systematic development of efficient
algorithms. Algorithm Engineering (AE) is a methodology for algorithmic research that views design, analysis,
implementation, and experimental evaluation of algorithms as a cycle driving algorithmic research. Further
components are realistic models, algorithm libraries,
and a multitude of interrelations to applications. Fig. 
gives an overview. A more detailed definition can be
found in []. This article is concerned with particular
issues that arise in engineering parallel algorithms.

A

A
Algorithm Engineering
Discussion
Introduction
The development of algorithms is one of the core areas
of computer science. After the early days of the s–
s, in the s and s, algorithmics was largely
viewed as a subdiscipline of computer science that is
concerned with “paper-and-pencil” theoretical work –
design of algorithms with the goal to prove worstcase performance guarantees. However, in the s
it became more and more apparent that this purely
theoretical approach delays the transfer of algorithmic results into applications. Therefore, in algorithm
engineering implementation and experimentation are
viewed as equally important as design and analysis of
algorithms. Together these four components form a
feedback cycle: Algorithms are designed, then analyzed
and implemented. Together with experiments using
realistic inputs, this process induces new insights that
lead to modified and new algorithms. The methodology
of algorithm engineering is augmented by using realistic models that form the basis of algorithm descriptions,
analysis, and implementation and by algorithm libraries
that give high quality reusable implementations.
The history of parallel algorithms is exemplary for
the above development, where many clever algorithms
were developed in the s that were based on the Parallel Random Access Machine (PRAM) model of computation. While this yielded interesting insights into the
Algorithm
engineering
basic aspects of parallelizable problems, it has proved
quite difficult to implement PRAM algorithms on mainstream parallel computers.
The remainder of this article closely follows Fig. ,
giving one section for each of the main areas’ models, design, analysis, implementation, experiments,
instances/benchmarks, and algorithm libraries.
Models
Parallel computers are complex systems containing processors, memory modules, and networks connecting
them. It would be very complicated to take all these
aspects into account at all times when designing, analyzing, and implementing parallel algorithms. Therefore we need simplified models. Two families of such
models have proved very useful: In a shared memory
machine, all processors access the same global memory. In a distributed memory machine, several sequential
computers communicate via an interconnection network. While these are useful abstractions, the difficulty
is to make these models more concrete by specifying
what operations are supported and how long they take.
For example, shared memory models have to specify
how long it takes to access a memory location. From
sequential models we are accustomed to constant access
time and this also reflects the best case behavior of
many parallel machines. However, in the worst case,
most real world machines will exhibit severe contention
Realistic
models
Real
inputs
Design
Falsifiable
hypotheses
induction
Experiments
Deduction
Implementation
Applications
Analysis
Appl. engineering
Perf.−
guarantees
Algorithm−
libraries
Algorithm Engineering. Fig.  Algorithm engineering as a cycle of design, analysis, implementation, and experimental
evaluation driven by falsifiable hypotheses
Algorithm Engineering
when many processors access the same memory module. Hence, despite many useful models (e.g., QRQW
PRAM – Queue Read Queue Write Parallel Random
Access Machine []), there remains a considerable gap
to reality when it comes to large-scale shared memory
systems.
The situation is better for distributed memory
machines, in particular, when the sequential machines
are connected by a powerful switch. We can then
assume that all processors can communicate with an
arbitrary partner with predictable performance. The
LogP model [] and the Bulk Synchronous Parallel
model [] put this into a simple mathematical form.
Equally useful are folklore models where one simply
defines the time needed for exchanging a message as the
sum of a startup overhead and a term proportional to
the message length. Then the assumption is usually that
every processor can only communicate with a single
other processor at a time, or perhaps it can receive from
one processor and send to another one. Also note that
algorithms designed for a distributed memory model
often yield efficient implementations on shared memory
machines.
Design
As in algorithm theory, we are interested in efficient algorithms. However, in algorithm engineering,
it is equally important to look for simplicity, implementability, and possibilities for code reuse. In particular, since it can be very difficult to debug parallel code,
the algorithms should be designed for simplicity and
testability. Furthermore, efficiency means not just
asymptotic worst-case efficiency, but we also have to
look at the constant factors involved. For example, we
have to be aware that operations where several processors interact can be a large constant factor more
expensive than local computations. Furthermore, we
are not only interested in worst-case performance but
also in the performance for real-world inputs. In particular, some theoretically efficient algorithms have similar
best-case and worst-case behavior whereas the algorithms used in practice perform much better on all but
contrived examples.
Analysis
Even simple and proven practical algorithms are often
difficult to analyze and this is one of the main reasons
A
for gaps between theory and practice. Thus, the analysis
of such algorithms is an important aspect of AE.
For example, a central problem in parallel processing is partitioning of large graphs into approximately
equal-sized blocks such that few edges are cut. This
problem has many applications, for example, in scientific computing. Currently available algorithms with
performance guarantees are too slow for practical use.
Practical methods iteratively contract the graph while
preserving its basic structure until only few nodes are
left, compute an initial solution on this coarse representation, and then improve by local search on each level.
These algorithms are very fast and yield good solutions
in many situations, yet no performance guarantees are
known (see [] for a recent overview).
Implementation
Despite huge efforts in parallel programming languages
and in parallelizing compilers, implementing parallel
algorithms is still one of the main challenges in the
algorithm engineering cycle. There are several reasons
for this. First, there are huge semantic gaps between
the abstract algorithm description, the programming
tools used, and the actual hardware. In particular, really
efficient codes often use fairly low-level programming
interfaces such as MPI or atomic memory operations
in order to keep the overheads for processor interaction manageable. Perhaps more importantly, debugging
parallel programs is notoriously difficult.
Since performance is the main reason for using
parallel computers in the first place, and because of
the complexity of parallel hardware, performance tuning is an important part of the implementation phase.
Although the line between implementation and experimentation is blurred here, there are differences. In
particular, performance tuning is less systematic. For
example, there is no need for reproducibility, detailed
studies of variances, etc., when one finds out that
sequential file I/O is a prohibitive bottleneck or when it
turns out that a collective communication routine of a
particular MPI implementation has a performance bug.
Experiments
Meaningful experiments are the key to closing the
cycle of the AE process. Compared to the natural sciences, AE is in the privileged situation where it can
perform many experiments with relatively little effort.

A

A
Algorithm Engineering
However, the other side of the coin is highly nontrivial
planning, evaluation, archiving, postprocessing, and
interpretation of results. The starting point should
always be falsifiable hypotheses on the behavior of the
investigated algorithms, which stem from the design,
analysis, implementation, or from previous experiments. The result is a confirmation, falsification, or
refinement of the hypothesis. The results complement
the analytic performance guarantees, lead to a better
understanding of the algorithms, and provide ideas for
improved algorithms, more accurate analysis, or more
efficient implementation.
Experiments with parallel algorithms are challenging because the number of processors (let alone other
architectural parameters) provide another degree of
freedom for the measurements, because even parallel
programs without randomization may show nondeterministic behavior on real machines, and because large
parallel machines are an expensive resource.
Experiments on comparing the quality of the computed results are not so much different from the sequential case. In the following, we therefore concentrate on
performance measurements.
Measuring Running Time
The CPU time is a good way to characterize the time
used by a sequential process (without I/Os), even in
the presence of some operating system interferences.
In contrast, in parallel programs we have to measure
the actual elapsed time (wall-clock time) in order to
capture all aspects of the parallel program, in particular, communication and synchronization overheads.
Of course, the experiments must be performed on an
otherwise unloaded machine, by using dedicated job
scheduling and by turning off unnecessary components
of the operating system on the processing nodes. Usually, further aspects of the program, like startup, initialization, and shutdown, are not interesting for the measurement. Thus timing is usually done as follows: All
processors perform a barrier synchronization immediately before the piece of program to be timed; one
processor x notes down its local time and all processors
execute the program to be measured. After another barrier synchronization, processor x measures the elapsed
time. As long as the running time is large compared to
the time for a barrier synchronization, this is an easy
way to measure wall-clock time. To get reliable results,
averaging over many repeated runs is advisable. Nevertheless, the measurements may remain unreliable since
rare delays, for example, due to work done by the operating system, can become quite frequent when they can
independently happen on any processor.
Speedup and Efficiency
In parallel computing, running time depends on the
number of processors used and it is sometimes difficult
to see whether a particular execution time is good or
bad considering the amount of resources used. Therefore, derived measures are often used that express this
more directly. The speedup is the ratio of the running time of the fastest known sequential implementation to that of the parallel running time. The speedup
directly expresses the impact of parallelization. The relative speedup is easier to measure because it compares
with the parallel algorithm running on a single processor. However, note that the relative speedup can
significantly overstate the actual usefulness of the parallel algorithm, since the sequential algorithm may be
much faster than the parallel algorithm run on a single
processor.
For a fixed input and a good parallel algorithm, the
speedup will usually start slightly below one for a single
processor, and initially goes up linearly with the number
of processors. Eventually, the speedup curve gets more
and more flat until parallelization overheads become
so large that the speedup goes down again. Clearly, it
makes no sense to add processors beyond the maximum
of the speedup curve. Usually it is better to stop much
earlier in order to keep the parallelization cost-effective.
Efficiency, the ratio of the speedup to the number of processors, more directly expresses this. Efficiency usually
starts somewhere below one and then slowly decreases
with the number of processors. One can use a threshold
for the minimum required efficiency to decide on the
maximum number of efficiently usable processors.
Often, parallel algorithms are not really used to
decrease execution time but to increase the size of the
instances that can be handled in reasonable time. From
the point of view of speedup and efficiency, this is good
news because for a scalable parallel algorithm, by sufficiently increasing the input size, one can efficiently use
any number of processors. One can check this experimentally by scaling the input size together with the
number of processors. An interesting property of an
Algorithm Engineering
algorithm is how much one has to increase the input
size with the number of processors. The isoefficiency
function [] expresses this relation analytically, giving
the input size needed to achieve some given, constant
efficiency. As usual in algorithmics, one uses asymptotic notation to get rid of constant factors and lower
order terms.
Speedup Anomalies
Occasionally, efficiency exceeding one (also called
superlinear speedup) causes confusion. By Brent’s principle (a single processor can simulate a p-processor
algorithm with a uniform slowdown factor of p) this
should be impossible. However, genuine superlinear
absolute speedup can be observed if the program relies
on resources of the parallel machine not available to
a simulating sequential machine, for example, main
memory or cache.
A second reason is that the computations done by
an algorithm can be done in many different ways, some
leading to a solution fast, some more slowly. Hence, the
parallel program can be “lucky” to find a solution more
than p times earlier than the sequential program. Interestingly, such effects do not always disappear when averaging over all inputs. For example, Schöning [] gives
a randomized algorithm for finding satisfying assignments to formulas in propositional calculus that are
in conjunctive normal form. This algorithm becomes
exponentially faster when run in parallel on many (possibly simulated) processors. Moreover, its worst-case
performance is better than any sequential algorithm.
Brent’s principle is not violated since the best sequential
algorithm turns out to be the emulation of the parallel
algorithm.
Finally, there are many cases were superlinear
speedup is not genuine, mostly because the sequential
algorithm used for comparison is not really the best one
for the inputs considered.
Instances and Benchmarks
Benchmarks have a long tradition in parallel computing. Although their most visible use is for comparing
different machines, they are also helpful within the
AE cycle. During implementation, benchmarks of basic
operations help to select the right approach. For example, SKaMPI [] measures the performance of most MPI
A
calls and thus helps to decide which of several possible calls to use, or whether a manual implementation
could help.
Benchmark suites of input instances for an important computational problem can be key to consistent
progress on this problem. Compared to the alternative that each working group uses its own inputs, this
has obvious advantages: there can be wider range of
inputs, results are easier to compare, and bias in instance
selection is less likely. For example, Chris Walshaw’s
graph partitioning archive [] has become an important
reference point for graph partitioning.
Synthetic instances, though less realistic than realworld inputs can also be useful since they can be generated in any size and sometimes are good as stress
tests for the algorithms (though it is often the other way
round – naively constructed random instances are likely
to be unrealistically easy to handle). For example, for the
graph partitioning problem, one can generate graphs
that almost look like random graphs but have a predesigned structure that can be more or less easy to detect
according to tunable parameters.
Algorithm Libraries
Algorithm libraries are made by assembling implementations of a number of algorithms using the methods
of software engineering. The result should be efficient,
easy to use, well documented, and portable. Algorithm
libraries accelerate the transfer of know-how into applications. Within algorithmics, libraries simplify comparisons of algorithms and the construction of software
that builds on them. The software engineering involved
is particularly challenging, since the applications to
be supported are unknown at library implementation
time and because the separation of interface and (often
highly complicated) implementation is very important.
Compared to an application-specific reimplementation,
using a library should save development time without
leading to inferior performance. Compared to simple,
easy to implement algorithms, libraries should improve
performance. To summarize, the triangle between generality, efficiency, and ease of use leads to challenging trade-offs because often optimizing one of these
aspects will deteriorate the others. It is also worth mentioning that correctness of algorithm libraries is even
more important than for other softwares because it is
extremely difficult for a user to debug a library code

A

A
Algorithmic Skeletons
that has not been written by his team. All these difficulties imply that implementing algorithms for use in a
library is several times more difficult than implementations for experimental evaluation. On the other hand,
a good library implementation might be used orders
of magnitude more frequently. Thus, in AE there is a
natural mechanism leading to many exploratory implementations and a few selected library codes that build
on previous experimental experience.
In parallel computing, there is a fuzzy boundary between software libraries whose main purpose is
to shield the programmer from details of the hardware and genuine algorithm libraries. For example, the
basic functionality of MPI (message passing) is of the
first kind, whereas its collective communication routines have a distinctively algorithmic flavor. The Intel
Thread Building Blocks (http://www.threadingbuild
ingblocks.org) offers several algorithmic tools including a load balancer hidden behind a task concept and
distributed data structures such as hash tables. The standard libraries of programming languages can also be
parallelized. For example, there is a parallel version of
the C++ STL in the GNU distribution [].
The Computational Geometry Algorithms Library
(CGAL, http://www.cgal.org) is a very sophisticated
example of an algorithms library that is now also getting
partially parallelized [].
.
.
.
.
.
.
.
.
.
of parallel computation. In: th ACM SIGPLAN symposium on
principles and practice of parallel programming, pp –, San
Diego, – May, . ACM, New York
Gibbons PB, Matias Y, Ramachandran V () The queue-read
queue-write pram model: accounting for contention in parallel
algorithms. SIAM J Comput ():–
Grama AY, Gupta A, Kumar V () Isoefficiency: measuring the
scalability of parallel algorithms and architectures. IEEE Concurr
():–
Reussner R, Sanders P, Prechelt L, Müller M () SKaMPI: a
detailed, accurate MPI benchmark. In: EuroPVM/MPI, number
 in LNCS, pp –
Sanders P () Algorithm engineering – an attempt at a definition. In: Efficient Algorithms. Lecture Notes in Computer Science,
vol . Springer, pp –
Schöning U () A probabilistic algorithm for k-sat and constraint satisfaction problems. In: th IEEE symposium on foundations of computer science, pp –
Singler J, Sanders P, Putze F () MCSTL: the multi-core standard template library. In: th international Euro-Par conference.
LNCS, vol . Springer, pp –
Soperm AJ, Walshaw C, Cross M () A combined evolutionary search and multilevel optimisation approach to graph
partitioning. J Global Optim ():–
Valiant L () A bridging model for parallel computation. Commun ACM ()
Walshaw C, Cross M () JOSTLE: parallelmultilevel graphpartitioning software – an overview. In: Magoules F (ed) Mesh
partitioning tech-niques and domain decomposition techniques,
pp –. Civil-Comp Ltd. (Invited chapter)
Conclusion
This article explains how algorithm engineering (AE)
provides a methodology for research in parallel algorithmics that allows to bridge gaps between theory and
practice. AE does not abolish theoretical analysis but
contains it as an important component that, when applicable, provides particularly strong performance and
robustness guarantees. However, adding careful implementation, well-designed experiments, realistic inputs,
algorithm libraries, and a process coupling all of this
together provides a better way to arrive at algorithms
useful for real-world applications.
Algorithmic Skeletons
Parallel Skeletons
All Prefix Sums
Reduce and Scan
Scan
for Distributed Memory, Message-Passing
Systems
Bibliography
. Batista VHF, Millman DL, Pion S, Singler J () Parallel geometric algorithms for multi-core computers. In: th ACM symposium on computational geometry, pp –
. Culler D, Karp R, Patterson D, Sahay A, Schauser KE, Santos E,
Subramonian R, Eicken Tv () LogP: towards a realistic model
Allen and Kennedy Algorithm
Parallelism Detection in Nested Loops, Optimal
Allgather
Allgather
Jesper Larsson Träff , Robert A. van de Geijn

University of Vienna, Vienna, Austria

The University of Texas at Austin, Austin, TX, USA
Synonyms
All-to-all broadcast; Collect; Concatenation; Gather-toall; Gossiping; Total exchange
Definition
Among a group of processing elements (nodes) each
node has a data item that is to be transferred to all other
nodes, such that all nodes in the group end up having all
of the data items. The allgather operation accomplishes
this total data exchange.
Discussion
The reader may consider first visiting the entry on collective communication.
It is assumed that the p nodes in the group of nodes
participating in the allgather operation are indexed consecutively, each having an index i with  ≤ i < p.
It is furthermore assumed that the data items are to
be collected in some fixed order determined by the
node indices; each node may apply a different order.
Assume that each node i initially has a vector of data
p−
xi of some number of elements ni with n = ∑i= ni
being the total number of subvector elements. Upon
completion of the allgather operation each node i will
have the full vector x consisting of the subvectors xi ,
i = , . . . , p − . This is shown in Fig. .
All nodes in the group are assumed to explicitly take part in the allgather communication operation. If all subvectors xi have the same number of elements ni = n/p the allgather operation is said to be regular, otherwise irregular. It is
Before
Node 
After
Node 
Node 
x
Node 
Node 
x
x
x
x
Node 
x
x
x
x
x
x
x
Allgather. Fig.  Allgather on three nodes
A
commonly assumed that all nodes know the size of all
subvectors in advance. The Message Passing Interface
(MPI), for instance, makes this assumption for both
its regular MPI_Allgather and its irregular MPI_
MPIAllgatherv collective operations. Other collective interfaces that support the operation make similar
assumptions. This can be assumed without loss of generality. If it is not the case, a special, one item per node
allgather operation can be used to collect the subvector
sizes at all nodes.
The allgather operation is a symmetric variant of
the broadcast operation in which all nodes concurrently
broadcast their data item, and is therefore often referred
to as all-to-all broadcast. It is semantically equivalent
to a gather operation that collects data items from all
nodes at a specific root node, followed by a broadcast from that root node, or to p concurrent gather
operations with each node serving as root in one such
operation. This explains the term allgather. The term
concatenation can be used to emphasize that the data
items are gathered in increasing index order at all nodes
if such is the case. When concerned with the operation
for specific communication networks (graphs) the term
gossiping has been used. The problem has, like broadcast, been extensively studied in the literature, and other
terminology is occasionally found.
Lower Bounds
To obtain lower bounds for the allgather operation a
fully connected, k-ported, homogeneous communication
system with linear communication cost is assumed. This
means that
●
All nodes can communicate directly with all other
nodes, at the same communication cost,
● In each communication operation, a node can send
up to k distinct messages to k other nodes, and
simultaneously receive k messages from k possibly
different nodes, and
● The cost of transmitting a message of size n (in some
unit) between any two nodes is modeled by a simple,
linear function α + nβ. Here α is a start-up latency
and β the transfer cost per unit.
With these approximations, lower bounds on the
number of communication rounds during which nodes
are involved in k-ported communication, and the total

A

A
Allgather
amount of data that have to be transferred along a
critical path of any algorithm, can be easily established:
●
Since the allgather operation implies a broadcast of
a data item from each node, the number of communication rounds is at least ⌈logk+ p⌉. This follows
because the number of nodes to which this particular item has been broadcast can at most increase by
a factor of k +  per round, such that the number of
nodes to which the item has been broadcast after d
rounds is at most (k + )d .
● The total amount of data to be received by node
j is n − nj , and since this can be received over
k simultaneously active ports, a lower bound is
n−n
n−n/p
max≤j<p ⌈ k j ⌉. For the regular case this is ⌈ k ⌉ =
(p−)n
⌈ pk ⌉.
In the linear cost model, a lower bound for
the allgather operation is therefore ⌈logk+ p⌉α +
(n−n
max ≤j<p ⌈ k j ⌉β. For regular allgather problems this
simplifies to
⌈logk+ p⌉α + ⌈
(p − )n
⌉β.
pk
The model approximations abstract from any specific
network topology, and for specific networks, including
networks with a hierarchical communication structure,
better, more precise lower bounds can sometimes be
established. The network diameter will for instance provide another lower bound on the number of communication rounds.
Algorithms
At first it is assumed that k =  and that the allgather is
regular, that is, the size ni of each subvector xi is equal
to n/p. A brief survey of practically useful, common
algorithmic approaches to the problem follows.
The cost of this algorithm is
(p − ) (α +
n
(p − )n
β) = (p − )α +
β.
p
p
This simple algorithm achieves the lower bound in
the second term, and is useful when the vector size n is
large. The linear, first term is far from the lower bound.
If the p nodes are instead organized in a r ×r mesh,
with p = r r , the complete allgather operation can be
accomplished by first gathering (simultaneously) along
the first dimension, and secondly gathering the larger
subvectors along the second dimension. The cost of this
algorithm becomes
n
n
β) + (r − ) (α + r β)
p
p
(r − )n + (r − )r n
β
= (r + r − )α +
p
(p − )n
β.
= (r + r − )α +
p
(r − ) (α +
Generalizing the approach, if the p nodes are organized in a d-dimension with p = r × r × ⋯ × rd− , the
complete allgather operation can be accomplished by d
successive allgather operations along the d dimensions.
The cost of this algorithm becomes
(rd− − ) (α +
n
n
β) + (rd− − ) (α + rd− β) + ⋯
p
p
⎛d−
⎞ (p − )n
β.
= ∑(rj − )α +
p
⎝ j=
⎠
Notice that as the number of dimensions increases,
the α term decreases while the (optimal) β term does
not change.
If p = d so that d = log p this approach yields an
algorithm with cost
Ring, Mesh, and Hypercube
⎞ (p − )n
⎛log  p−
(p − )n
β = (log p)α +
β.
∑ ( − )α +
p
p
⎠
⎝ j=
A flexible class of algorithms can be described by first
viewing the nodes as connected as a ring where node
i sends to node (i + ) mod p and receives from node
(i − ) mod p. An allgather operation can be accomplished in p −  communication rounds. In round j,
 ≤ j < p − , node i sends subvector x(i−j) mod p and
receives subvector x(i−−j) mod p .
The allgather in each dimension now involves only
two nodes, and becomes a bidirectional exchange of
data. This algorithm maps easily to hypercubes, and
conversely a number of essentially identical algorithms
were originally developed for hypercube systems. For
fully connected networks it is restricted to the situation where p is a power of two, and does not easily,
Allgather
A
j←1
j ← 2
while j < p do
/* next round */
par/* simultaneous send-receive */
Send subvector (xi,x(i+1) mod p, . . . ,x(i+j−1) mod p) to node (i−j) mod p
Receive subvector (x(i+j) mod p,x(i+j+1) mod p, . . . ,x(i+j+j−1) mod p) from node (i+j) mod p
endpar
j ← j
j ← 2j
endwhile
/* last subvector */
j ← p−j
par/* simultaneous send-receive */
Send subvector (xi,x(i+1) mod p, . . . ,x(i+j−1) mod p) to node (i−j) mod p
Receive subvector (x(i+j+1) mod p,x(i+j+1) mod p, . . . ,x(i+j+j−1) mod p) from node (i+j) mod p
endpar
Allgather. Fig.  The dissemination allgather algorithm for node i,  ≤ i < p, and  < p
without loss of theoretical performance, generalize to
arbitrary values of p. It is sometimes called the bidirectional exchange algorithm, and relies only on restricted,
telephone-like bidirectional communication. For this
algorithm both the α and the β terms achieve their
respective lower bounds. In contrast to, for instance,
the broadcast operation, optimality can be achieved
without the use of pipelining techniques.
Dissemination Allgather
On networks supporting bidirectional, fully connected,
single-ported communication the allgather problem
can be solved in the optimal ⌈log  p⌉ number of communication rounds for any p as shown in Fig. . In round
k, node i communicates with nodes (i + k ) mod p and
(i − k ) mod p, and the size of the subvectors sent and
received doubles in each round, with the possible exception of the last. Since each node sends and receives p − 
subvectors each of n/p elements the total cost of the
algorithm is
(p − )n
β,
⌈log p⌉α +
p
for any number of nodes p, and meets the lower
bound for single-ported communication systems. It can
be generalized optimally (for some k) to k-ported communication systems.
This scheme is useful in many settings for
implementation of other collective operations. The
⌈log p⌉-regular communication pattern is an instance
of a so-called circulant graph.
Composite Algorithms
Different, sometimes practically efficient algorithms for
the allgather operation can be derived by combinations
of algorithms for broadcast and gather. The full vector
can be gathered at a chosen root node and broadcast
from this node to all other nodes. This approach is
inherently a factor of two off from the optimal number
of communication rounds, since both gather and broadcast requires at least ⌈log p⌉ communication rounds
even for fully connected networks. Other algorithms
can be derived from broadcast or gather, by doing these
operations simultaneously with each of the p nodes
acting as either broadcast or gather root node.
Related Entries
Broadcast
Collective Communication
Message Passing Interface (MPI)
Bibliographic Notes and Further
Reading
The allgather operation is a symmetric variant of
the broadcast operation, and together with this, one

A

A
All-to-All
of the most studied collective communication operations. Early theoretical surveys under different communication assumptions can be found in [, , , ].
For (near-)optimal algorithms for hypercubes, meshes,
and tori, see [, ]. Practical implementations, for
instance, for MPI for a variety of parallel systems
have frequently been described, see, for instance, [, ,
, ]. Algorithms and implementations that exploit
multi-port communication capabilities can be found in
[, , ]. Algorithms based on ring shifts are sometimes called cyclic or bucket algorithms; in [, ] it is
discussed how to create hybrid algorithms for meshes
of lower dimension and fully connected architectures
where the number of nodes is not a power of two. The
ideas behind these algorithms date back to the early
days of distributed memory architectures [, ]. The
dissemination allgather algorithm is from the fundamental work of Bruck et al. [], although the term
is not used in that paper [, ]. Attention has almost
exclusively been given to the regular variant of the
problem. A pipelined algorithm for very large, irregular
allgather problems was given in []. More general bibliographic notes can be found in the entry on collective
communication.
Bibliography
. Benson GD, Chu C-W, Huang Q, Caglar SG () A comparison of MPICH allgather algorithms on switched networks. In:
Recent advances in parallel virtual machine and message passing interface, th European PVM/MPI users’ group meeting.
Lecture notes in computer science, vol . Springer, Berlin,
pp –
. Bruck J, Ho C-T, Kipnis S, Upfal E, Weathersby D () Efficient
algorithms for all-to-all communications in multiport messagepassing systems. IEEE Trans Parallel Distrib Syst ():–
. Chan E, Heimlich M, Purkayastha A, van de Geijn RA ()
Collective communication: theory, practice, and experience. Concurrency Comput: Pract Exp ():–
. Chan E, van de Geijn RA, Gropp W, Thakur R () Collective communication on architectures that support simultaneous
communication over multiple links. In: ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP),
ACM, New York, pp –
. Chen M-S, Chen J-C, Yu PS () On general results for all-to-all
broadcast. IEEE Trans Parallel Distrib Syst ():–
. Fox G, Johnson M, Lyzenga G, Otto S, Salmon J, Walker D ()
Solving problems on concurrent processors, vol I. Prentice-Hall,
Englewood Cliffs
. Fraigniaud P, Lazard E () Methods and problems of communication in usual networks. Discret Appl Math (–):–
. Hedetniemi SM, Hedetniemi T, Liestman AL () A survey
of gossiping and broadcasting in communication networks. Networks :–
. Hensgen D, Finkel R, Manber U () Two algorithms for barrier
synchronization. Int J Parallel Program ():–
. Ho C-T, Kao M-Y () Optimal broadcast in all-port wormholerouted hypercubes. IEEE Trans Parallel Distrib Syst ():–
. Krumme DW, Cybenko G, Venkataraman KN () Gossiping
in minimal time. SIAM J Comput ():–
. Mamidala AR, Vishnu A, Panda DK () Efficient shared memory and RDMA based design for MPI Allgather over InfiniBand.
In: Recent advances in parallel virtual machine and message passing interface, th European PVM/MPI users’ group meeting.
Lecture notes in computer science, vol . Springer, Berlin,
pp –
. Mitra P, Payne DG, Schuler L, van de Geijn R () Fast collective
communication libraries, please. In: Intel Supercomputer Users’
Group Meeting, University of Texas,  June 
. Qian Y, Afsahi A () RDMA-based and SMP-aware multi-port
all-gather on multi-rail QsNetII SMP clusters. In: International
conference on parallel processing (ICPP ) Xi’ an, China, p. 
. Saad Y, Schultz MH () Data communication in parallel architectures. Parallel Comput ():–
. Träff JL () Efficient allgather for regular SMP-clusters. In:
Recent advances in parallel virtual machine and message passing
interface, th European PVM/MPI users’ group meeting, Lecture
notes in computer science, vol . Springer, Berlin, pp –
. Träff JL, Ripke A, Siebert C, Balaji P, Thakur R, Gropp W ()
A pipelined algorithm for large, irregular all-gather problems. Int
J High Perform Comput Appl ():–
. Yang Y, Wang J () Near-optimal all-to-all broadcast in multidimensional all-port meshes and tori. IEEE Trans Parallel Distrib
Syst ():–
All-to-All
Jesper Larsson Träff , Robert A. van de Geijn

University of Vienna, Vienna, Austria

The University of Texas at Austin, Austin, TX, USA
Synonyms
Complete exchange; Index; Personalized all-to-all
exchange; Transpose
Definition
Among a set of processing elements (nodes) each
node has distinct (personalized) data items destined
for each of the other nodes. The all-to-all operation
accomplishes this total data exchange among the set of
All-to-All
nodes, such that each node ends up having an individual
data item from each of the other nodes.
Discussion
The reader may consider first visiting the entry on collective communication.
Let the p nodes be indexed consecutively,
,  . . . , p − . Initially each node i has a (column)vector
of data x(i) that is further subdivided into subvectors
(i)
(i)
xj for j = , . . . , p − . The subvector xj is to be sent
to node j from node i. Upon completion of the all-to-all
exchange operation node i will have the vector consist(j)
ing of the subvectors xi for j = , . . . , p − . In effect,
the matrix consisting of the i columns x(i) is transposed
(j)
with the ith row consisting of subvectors xi originally
distributed over the p nodes j = , . . . , p −  now transferred to node i. This transpose all-to-all operation is
(i)
illustrated in Fig. . The subvectors xi do not have to
be communicated, but for symmetry reasons it is convenient to think of the operation as if each node also
contributes a subvector to itself.
The all-to-all exchange can be interpreted as an
actual matrix transpose operation in the case where
(j)
all subvectors xi have the same number of elements n. In that case the all-to-all problem is called
regular and the operation is also sometimes termed
index. The all-to-all operation is, however, also well
defined in cases where subvectors have different number of elements. Subvectors could for instance dif(i)
fer for each row index j, e.g., nj = ∣xj ∣ for all
nodes i, or each subvector could have a possibly dif(i)
ferent number of elements nij = ∣xj ∣ without any
specified relation between the subvector sizes. In all
such cases, the all-to-all problem is called irregular.
(i)
Likewise, the subvectors xj can be structured objects,
for instance matrix blocks. It is common to assume that
for both regular and irregular all-to-all problems each
Before
After
Node 
Node 
Node 
Node 
Node 
Node 
()
x
()
x
()
x
()
x
()
x
x
()
x
()
x
()
x
()
x
()
x
()
x
()
x
()
x
()
x
()
x
()
()
x
()
x
All-to-All. Fig.  All-to-all communication for three nodes
A
node knows not only the number of elements of all
subvectors it has to send to other nodes, but also the
number of elements in all subvectors it has to receive
from other nodes. This can be assumed without loss of
generality, since a regular all-to-all operation with subvectors of size one can be used to exchange and collect
the required information on number of elements to be
sent and received.
All-to-all communication is the most general and
most communication intensive collective data-exchange
communication pattern. Since it is allowed that some
(i)
(i)
∣xj ∣ =  and that some subvectors xj are identical
for some i and j, the irregular all-to-all operation subsumes other common collective communication patterns like broadcast, gather, scatter, and allgather.
Algorithms for specific patterns are typically considerably more efficient than general, all-to-all algorithms, which motivates the inclusion of a spectrum
of collective communication operations in collective
communication interfaces and libraries. The MessagePassing Interface (MPI), for instance, has both regular MPI_Alltoall and irregular MPI_Alltoallv
and MPI_Alltoallw operations in its repertoire,
as well as operations for the specialized operations
broadcast, gather, scatter, and allgather, and structured
data can be flexibly described by so-called user-defined
datatypes.
All-to-all communication is required in FFT computations, matrix transposition, generalized permutations, etc., and is thus a fundamental operation in a large
number of scientific computing applications.
Lower Bounds
The (regular) all-to-all communication operation
requires that each node exchanges data with each
other node. Therefore, a lower bound on the all-to-all
communication time will be determined by a minimal
bisection and the bisection bandwidth of the communication system or network. The number of subvectors
that have to cross any bisection of the nodes (i.e., partition into two roughly equal-sized subsets) is p / (for
even p), namely p/ × p/ = p / subvectors from each
subset of the partition. The number of communication
links that can be simultaneously active in a minimal
partition determines the number of communication
rounds that are at the least needed to transfer all subvectors across the bisection assuming that subvectors are

A

A
All-to-All
not combined. A d-dimensional, symmetric torus with
unidirectional communication has a bisection of kd−
√
with k = d p, and the number of required communi√
cation rounds is therefore p d p/. A hypercube (which
can also be construed as a log p dimensional mesh) has
bisection p/. The low bisection of the torus limits the
bandwidth that can be achieved for all-to-all communication. If the bisection bandwidth is B vector elements
per time unit, any all-to-all algorithms requires at least
np /B units of time.
Assume now a fully connected, homogeneous,
k-ported, bidirectional send-receive communication
network. Each node can communicate directly with any
other node at the same cost, and can at the same time
receive data from at most k nodes and send distinct data
to at most k, possibly different nodes. In this model,
the following general lower bounds on the number of
communication rounds and the amount of data transferred in sequence per node can be proved. Here n is
the amount of data per subvector for each node.
●
A complete all-to-all exchange requires at least
⌈logk+ p⌉ communication rounds and transfers at
least ⌈ n(p−)
⌉ units of data per node.
k
● Any all-to-all algorithm that uses ⌈log k+ p⌉ communp
nication rounds must transfer at least Ω( k+
logk+ p)
units of data per node.
n(p−)
● Any all-to-all algorithm that transfers exactly k
p−
units of data per node requires at least k communication rounds.
These lower bounds bound the minimum required
time for the all-to-all operation. In contrast to, for
instance, allgather communication, there is a trade-off
between the number of communication rounds and
the amount of data transferred. In particular, it is not
possible to achieve a logarithmic number of communication rounds and a linear number of subvectors per
node. Instead, fewer rounds can only be achieved at the
expense of combining subvectors and sending the same
subvector several times. As a first approximation, communication cost is modeled by a linear cost function
such that the time to transfer n units of data is α + nβ
for a start-latency α and cost per unit β. All-to-all communication time is then at least α times the number of
communication rounds plus β times the units of data
transferred in sequence by a node.
Algorithms
In the following, the regular all-to-all operation is considered. Each node i has a subvector xj(i) to transfer
to each other node j, and all subvectors have the same
(j)
number of elements n = ∣xi ∣.
Fully Connected Systems
In a direct algorithm each node i sends each subvector directly to its destination node j. It is assumed
that communication takes place in rounds, in which
all or some nodes send and/or receive data. The difficulty is to organize the communication in such a way
that no node stays idle for too many rounds, and that
in the same round no node is required to send or
receive more than the k subvectors permitted by the
k-ported assumption.
For one-ported, bidirectional systems the simplest
algorithm takes p− communication rounds. In round r,
(i)
 ≤ r < p, node i sends subvector x(i+r) mod p to node
((i−r) mod p)
(i + r) mod p, and receives subvector xi
(i)
from node (i − r) mod p. If required, subvector xi is
copied in a final noncommunication round. This algorithm achieves the lower bound on data transfer, at the
expense of p −  communication rounds, and trivially
generalizes to k-ported systems. In this case, the last
round may not be able to utilize all k communication
ports.
For single-ported systems (k = ) with weaker
bidirectional communication capabilities allowing only
that each node i sends and receives from the same
node j (often referred to as the telephone model)
in a communication round, or even unidirectional
communication capabilities allowing each node to only
either send or receive in one communication round, a
different algorithm solves the problem. Such an algorithm is described in the following.
A fully connected communication network can be
viewed as a complete, undirected graph with processing
nodes modeled as graph nodes. For even p the complete
graph can be partitioned into p− disjoint -factors (perfect matchings), each of which associates each node i
with a different node j. For odd p this is not possible,
but the formula j = (r − i) mod p for r = , . . . , p − 
associates a different node j with i for each round r such
that if i is associated with j in round r, then j is associated with i in the same round. In each round there
is exactly one node that becomes associated with itself.
All-to-All
This can be used to achieve the claimed -factorization
for even p. Let the last node i = p −  be special. Perform the p −  rounds on the p −  non-special nodes.
In the round where a non-special node becomes associated with itself, it instead performs an exchange with
the special node. It can be shown that p and p −  communication rounds for odd and even p, respectively, is
optimal. The algorithm is depicted in Fig. . It takes p − 
communication rounds for even p, and p rounds for
odd p. If a self-copy is required it can for odd p be done
in the round where a node is associated with itself and
for even p either before or after the exchange. Note that
in the case where p is even the formula j = (r − i) mod p
if odd(p) then
for r ← 0, 1,. . . ,p − 1 do
j ← (r − i) mod p
if i = j then
par/* simultaneous send-receive */
(i)
Send subvector xj to node j
(j)
Receive subvector xi from node j
end par
end if
end for
else if i < p − 1 then /* p even */
/* non-special nodes */
for r ← 0,1,. . . ,p − 2 do
j ← (r − i − 1) mod (p − 1)
if i = j then j ← p − 1
par/* simultaneous send-receive */
(i)
Send subvector xj to node j
(j)
Receive subvector xi from node j
end par
end for
else /* special node */
for r ← 0,1,. . . ,p − 2 do
if even (r) then j ← r/2 else j ← (p−1+r)/2
par/* simultaneous send-receive */
(i)
Send subvector xj to node j
(j)
Receive subvector xi from node j
end par
end for
end if
All-to-All. Fig.  Direct, telephone-like all-to-all
communication based on -factorization of the complete p
node graph. In each communication round r each node i
becomes associated with a unique node j with which it
performs an exchange. The algorithm falls into three
special cases for odd and even p, respectively, and for the
latter for nodes i < p −  and the special node i = p − . For
odd p the number of communication rounds is p, and p − 
for even p. In both cases, this is optimal in the number of
communication rounds
A
also pairs node i with itself in some round, and the selfexchange could be done in this round. This would lead
to an algorithm with p rounds, in some of which some
nodes perform the self-exchange. If bidirectional communication is not supported, each exchange between
nodes i and j can be accomplished by the smaller
numbered node sending and then receiving from the
larger numbered node, and conversely the larger numbered node receiving and then sending to the smaller
numbered node.
If p is a power of two the pairing j = i ⊕ r, where
⊕ denotes the bitwise exclusive-or operation, likewise
produces a -factorization and this has often been used.
Hypercube
By combining subvectors the number of communication rounds and thereby the number of message
start-ups can be significantly reduced. The price is an
increased communication volume because subvectors
will have to be sent multiple times via intermediate
nodes. Such indirect algorithms were pioneered for
hypercubes and later extended to other networks and
communication models. In particular for the fully connected, k-ported model algorithms exist that give the
optimal trade-off (for many values of k and p) between
the number of communication rounds and the amount
of data transferred per node.
A simple, indirect algorithm for the d-dimensional
hypercube that achieves the lower bound on the
number of communication rounds at the expense of
a logarithmic factor more data is as follows. Each
hypercube node i communicates with each of its d
neighbors, and the nodes pair up with their neighbors
in the same fashion. In round r, node i pairs up and
performs an exchange with neighbor j = i ⊕  d−r for
r = , . . . , d. In the first round, node i sends in one message the d− subvectors destined to the nodes of the
d − -dimensional subcube to which node j belongs. It
receives from node j the d− subvectors for the d − dimensional subcube to which it itself belongs. In the
second round, node i pairs up with a neighbor of a
d−-dimensional subcube, and exchanges the d− own
subvectors and the additional d− subvectors for this
subcube received in the first round. In general, in each
round r node i receives and sends r d−−r = d− = p/
subvectors.
Total cost of this algorithm in the linear cost model
p
is log p(α + β  n).

A

A
All-to-All
Irregular All-to-All Communication
The irregular all-to-all communication problem is considerably more difficult both in theory and in practice. In the general case with no restrictions on the
sizes of the subvectors to be sent and received, finding communication schedules that minimize the number of communication rounds and/or the amount
of data transferred is an NP-complete optimization
problem. For problem instances that are not too
irregular a decomposition into a sequence of more regular problems and other collective operations sometimes work, and such approaches have been used in
practice. Heuristics and approximation algorithms for
many variations of the problem have been considered in the literature. For concrete communication
libraries, a further practical difficulty is that full information about the problem to be solved, that is the
sizes of all subvectors for all the nodes is typically
not available to any single node. Instead each node
knows only the sizes of the subvectors it has to send
(and receive). To solve the problem this information has to be centrally gathered entailing a sequential bottleneck, or distributed algorithms or heuristics for computing a communication schedule must be
employed.
Related Entries
Allgather
Collective Communication
FFT (Fast Fourier Transform)
Hypercubes and Meshes
MPI (Message Passing Interface)
Bibliographic Notes and Further
Reading
All-to-all communication has been studied since the
early days of parallel computing, and many of the
results presented here can be found in early, seminal
work [, ]. Classical references on indirect all-toall communication for hypercubes are [, ], see also
later [] for combinations of different approaches leading to hybrid algorithms. The generalization to arbitrary
node counts for the k-ported, bidirectional communication model was developed in [], which also proves
the lower bounds on number of rounds and the tradeoff between amount of transferred data and required
number of communication rounds. Variants and combinations of these algorithms have been implemented
in various MPI like libraries for collective communication [, , ].
A proof that a -factorization of the p node complete graph exists when p is even can be found in []
and elsewhere.
All-to-all algorithms not covered here for meshes
and tori can be found in [, , , ], and algorithms
for multistage networks in [, ]. Lower bounds for
tori and meshes based on counting link load can be
found in []. Implementation considerations for some
of these algorithms for MPI for the Blue Gene systems
can be found in [].
Irregular all-to-all exchange algorithms and implementations have been considered in [, ] (decomposition into a series of easier problems) and later in [, ],
the former summarizing many complexity results.
Bibliography
. Bala V, Bruck J, Cypher R, Elustondo P, Ho A, Ho CT, Kipnis S,
Snir M () CCL: a portable and tunable collective communications library for scalable parallel computers. IEEE T Parall Distr
():–
. Bokhari SH () Multiphase complete exchange: a theoretical
analysis. IEEE T Comput ():–
. Bruck J, Ho CT, Kipnis S, Upfal E, Weathersby D () Efficient
algorithms for all-to-all communications in multiport messagepassing systems. IEEE T Parall Distr ():–
. Fox G, Johnson M, Lyzenga G, Otto S, Salmon J, Walker D ()
Solving problems on concurrent processors, vol I. Prentice-Hall,
Englewood Cliffs
. Goldman A, Peters JG, Trystram D () Exchanging messages
of different sizes. J Parallel Distr Com ():–
. Harary F () Graph theory. Addison-Wesley, Reading, Mass
. Johnsson SL, Ho CT () Optimum broadcasting and personalized communication in hypercubes. IEEE T Comput ():
–
. Kumar S, Sabharwal Y, Garg R, Heidelberger P () Optimization of all-to-all communication on the blue gene/l supercomputer. In: International conference on parallel processing (ICPP),
Portland, pp –
. Lam CC, Huang CH, Sadayappan P () Optimal algorithms for
all-to-all personalized communication on rings and two dimensional tori. J Parallel Distr Com ():–
. Liu W, Wang CL, Prasanna VK () Portable and scalable algorithm for irregular all-to-all communication. J Parallel Distr Com
:–
AMD Opteron Processor Barcelona
. Massini A () All-to-all personalized communication on
multistage interconnection networks. Discrete Appl Math
(–):–
. Ranka S, Wang JC, Fox G () Static and run-time algorithms
for all-to-many personalized communication on permutation
networks. IEEE T Parall Distr ():–
. Ranka S, Wang JC, Kumar M () Irregular personalized communication on distributed memory machines. J Parallel Distr
Com ():–
. Ritzdorf H, Träff JL () Collective operations in NEC’s highperformance MPI libraries. In: International parallel and distributed processing symposium (IPDPS ), p 
. Saad Y, Schultz MH () Data communication in parallel architectures. Parallel Comput ():–
. Scott DS () Efficient all-to-all communication patterns in
hypercube and mesh topologies. In: Proceedings th conference
on distributed memory concurrent Computers, pp –
. Suh YJ, Shin KG () All-to-all personalized communication in
multidimensional torus and mesh networks. IEEE T Parall Distr
():–
. Suh YJ, Yalamanchili S () All-to-all communication with
minimum start-up costs in D/D tori and meshes. IEEE T Parall
Distr ():–
. Thakur R, Gropp WD, Rabenseifner R () Improving the performance of collective operations in MPICH. Int J High Perform
C :–
. Tseng YC, Gupta SKS () All-to-all personalized communication in a wormhole-routed torus. IEEE T Parall Distr ():
–
. Tseng YC, Lin TH, Gupta SKS, Panda DK () Bandwidthoptimal complete exchange on wormhole-routed D/D torus
networks: A diagonal-propagation approach. IEEE T Parall Distr
():–
. Yang Y, Wang J () Optimal all-to-all personalized exchange
in self-routable multistage networks. IEEE T Parall Distr ():
–
. Yang Y, Wang J () Near-optimal all-to-all broadcast in multidimensional all-port meshes and tori. IEEE T Parall Distr
():–
All-to-All Broadcast
Allgather
Altivec
IBM Power Architecture
Vector Extensions, Instruction-Set Architecture (ISA)
A
AMD Opteron Processor
Barcelona
Benjamin Sander
Advanced Micro Device Inc., Austin, TX, USA
Synonyms
Microprocessors
Definition
The AMD OpteronTM processor codenamed “Barcelona”
was introduced in September  and was notable
as the industry’s first “native” quad-core x processor, containing four cores on single piece of silicon.
The die also featured an integrated L cache which was
accessible by all four cores, Rapid Virtualization Indexing to improve virtualization performance, new power
management features including split power planes for
the integrated memory controller and the cores, and
a higher-IPC (Instructions per Clock) core including
significantly higher floating-point performance. Finally,
the product was a plug-in replacement for the previousgeneration AMD Opteron processor – “Barcelona” used
the same Direct Connect system architecture, the same
physical socket including commodity DDR memory,
and fit in approximately the same thermal envelope.
Discussion
Previous Opteron Generations
The first-generation AMD Opteron processor was
introduced in April of . AMD Opteron introduced
the AMD instruction set architecture, which was an
evolutionary extension to the x instruction set that
provided additional registers and -bit addressing [].
An important differentiating feature of AMD was
that it retained compatibility with the large library of
-bit x applications which had been written over
the -year history of that architecture – other -bit
architectures at that time either did not support the
x ISA or supported it only through a slow emulation
mode. The first-generation AMD Opteron also introduced Direct Connect architecture which featured a
memory controller integrated on the same die as the

A

A
AMD Opteron Processor Barcelona
processor, and HyperTransportTM Technology bus links
to directly connect the processors and enable glueless
two-socket and four-socket multiprocessor systems.
The second-generation AMD Opteron processor
was introduced two years later in April of  and
enhanced the original offering by integrating two cores
(Dual-Core) with the memory controller. The dual-core
AMD Opteron was later improved with DDR memory for higher bandwidth, and introduced AMD-VTM
hardware virtualization support.
“Barcelona” introduced the third generation of the
AMD Opteron architecture, in September of .
System-on-a-Chip
“Barcelona” was the industry’s first native quad-core
processor – as shown in Fig. , the processor contained
four processing cores (each with a private  k L
cache), a shared  MB L cache, an integrated crossbar
which enabled efficient multiprocessor communication,
and DDR memory controller and three HyperTransport links. The tight integration on a single die enabled
efficient communication between all four cores, notably
through an improved power management flexibility and
the shared L cache.
The L cache was a non-inclusive architecture –
the contents of the L caches were independent and
typically not replicated in the L, enabling more efficient use of the cache hierarchy. The L and L caches
were designed as a victim-cache architecture. A newly
retrieved cache line would initially be written to the
L cache. Eventually, the processor would evict the line
from the L cache and would then write it into the L
cache. Finally, the line would be evicted from the L
cache and the processor would write it to the L cache.
The shared cache used a “sharing-aware” fill and
replacement policy. Consider the case where a linefill request hits in the L cache: The processor has the
option of leaving a copy of the line in the L (predicting that other cores are likely to want that line in the
future) or moving the line to the requesting core, invalidating it from the L to leave a hole which could be
used by another fill. Lines which were likely to be shared
(for example, instruction code or lines which had been
shared in the past) would leave a copy of the line in
AMD Opteron Processor Barcelona. Fig.  Quad-Core AMD OpteronTM processor design
AMD Opteron Processor Barcelona
the L cache for other cores to access. Lines which were
likely to be private (for example, requested with write
permission or which had never been previously shared
in the past) would move the line from the L to the L.
The L cache maintained a history of which lines had
been shared to guide the fill policy. Additionally, the
sharing history information influenced the L replacement policy – the processor would preferentially keep
lines which had been shared in the past by granting
them an additional trip through the replacement policy.
In Client CPUs, AMD later introduced a “TripleCore” product based on the same die used by
“Barcelona” – this product served an important product category (for users who valued an affordable product with more than dual-core processing power). The
Triple-Core product was targeted at the consumer
market and was based on the same die used by
“Barcelona” with three functional cores rather than four.
Socket Compatibility
“Barcelona” was socket-compatible with the previousgeneration Dual-Core AMD OpteronTM product:
“Barcelona” had the same pinout, used the same two
channels of DDR memory, and also fit in a similar
thermal envelope enabling the use of the same cooling and thermal solution. Even though “Barcelona” used
four cores (rather than two) and each core possessed
twice the peak floating-point capability of the previous generation, “Barcelona” was able to fit in the same
thermal envelope through the combination of a smaller
process (“Barcelona” was the first AMD server product based on  nm SOI technology) and a reduction
in peak operating frequency (initially “Barcelona” ran at
. GHz). The doubling of core density in the same platform was appealing and substantially increased performance on many important server workloads compared
to the previous-generation AMD Opteron product.
Customers could upgrade their AMD Opteron products
to take advantage of quad-core “Barcelona,” and OEMs
could leverage much of their platform designs from the
previous generation.
Delivering More Memory Bandwidth
As mentioned above, the “Barcelona” processor’s socketcompatibility eased its introduction to the marketplace.
However, the four cores on a die placed additional
load on the memory controller – the platform had the
A
same peak bandwidth as the previous generation: Each
socket still had two channels of DDR memory, running at up to  MHz and delivering a peak bandwidth of . GB/s (per socket). To feed the additional
cores, the memory controller in “Barcelona” included
several enhancements which improved the delivered
bandwidth.
One notable feature was the introduction of “independent” memory channels. In the second-generation
AMD Opteron product, each memory channel serviced
half of a -byte memory request – i.e., in parallel
the memory controller would read  bytes from each
channel. This was referred to as a “ganged” controller
organization and has the benefit of perfectly balancing
the memory load between the two memory channels.
However, this organization also effectively reduces the
number of available dram banks by a factor of two –
when a dram page is opened on the first channel, the
controller opens the same page on the other channel at
the same time. Effectively, the ganged organization creates pages which are twice as big, but provides half as
many dram pages as a result.
With four cores on a die all running four different
threads, the memory accesses tend to be unrelated and
more random. DRAMs support only a small number
of open banks, and requests which map to the same
bank but at different addresses create a situation called
a “page conflict.” The page conflict can dramatically
reduce the efficiency and delivered bandwidth of the
DRAM, because the DRAM has to repeatedly switch
between the multiple pages which are competing for
the same open bank resources. Additionally, write traffic coming from the larger multi-level “Barcelona” cache
hierarchy created another level of mostly random memory traffic which had to be efficiently serviced. All of
these factors led to a new design in which the two memory channels were controlled independently – i.e., each
channel could independently service an entire -byte
cache line rather than using both channels for the same
cache line. This change enabled the two channels to
independently determine which pages to open, effectively doubling the number of banks and enabling the
design to better cope with the more complex memory
stream coming from the quad-core processor.
A low-order address bit influenced the channel
selection, which served to spread the memory traffic between the two controllers and avoid overloading

A

A
AMD Opteron Processor Barcelona
one of the channels. The design hashed the low-order
address bit with other higher-order address bit to spread
the traffic more evenly, and in practice, applications
showed an near-equal allocation between the two channels, enabling the benefits of the extra banks provided
by the independent channels along with equal load balancing. The independent channels also resulted in a
longer burst length (twice as long as the ganged organization), which reduced pressure on the command bus;
essentially each Read or Write command performed
twice as much work with the independent organization. This was especially important for some DIMMs
which required commands to be sent for two consecutive cycles (“T” mode).
“Barcelona” continued to support a dynamic openpage policy, in which the memory controller leaves
dram pages in the “open” state if the pages are likely
to be accessed in the future. As compared to a closedpage design (which always closes the dram pages), the
dynamic open-page design can improve latency and
bandwidth in cases where the access stream has locality to the same page (by leaving the page open), and
also deliver best-possible bandwidth when the access
stream contains conflicts (by recognizing the pageconflict stream and closing the page). “Barcelona” introduced a new predictor which examined the history of
accesses to each bank to determine whether to leave the
page open or to close it. The predictor was effective at
increasing the number of page hits (delivering lower
latency) and reducing the number of page conflicts
(improving bandwidth).
“Barcelona” also introduced a new DRAM prefetcher
which monitored read traffic and prefetched the next
cache line when it detected a pattern. The prefetcher
had sophisticated pattern detection logic which could
detect both forward and backward patterns, unit and
non-unit strides, as well as some more complicated
patterns. The prefetcher targeted a dedicated buffer to
store the prefetched data (rather than a cache) and thus
the algorithm could aggressively exploit unused dram
bandwidth without concern for generating cache pollution. The prefetcher also had a mechanism to throttle
the prefetcher if the prefetches were inaccurate or if
the dram bandwidth was consumed by non-prefetch
requests. Later generations of the memory controller
would improve on both the prefetch and the throttling
mechanisms.
The DRAM controller was also internally replumbed with wider busses (the main busses were
increased from -bits to -bits) and with additional
buffering. Many of these changes served to prepare
the controller to support future higher-speeds memory technologies. One notable change was the addition
of write-bursting logic, which buffered memory writes
until a watermark level was achieved, at which time the
controller would burst all the writes to the controller. As
compared to trickling the writes to the DRAM as they
arrived at the memory controller, the write-bursting
mode was another bandwidth optimization. Typically,
the read and write requests address different banks, so
switching between a read mode and a write mode both
minimizes the number of read/write bus turnarounds
and minimizes the associated page open/close traffic.
“Barcelona” Core Architecture
The “Barcelona” core was based on the previous
“K” core design but included comprehensive IPC
(Instruction-Per-Clock) improvements throughout the
entire pipeline. One notable feature was the “Wide
Floating-Point Accelerator,” which doubled the raw
computational floating-point data paths and floatingpoint execution units from -bits to -bits. A
. GHz “Barcelona” core possessed a peak rate of
 GFlops of single-precision computation; the quadcore “Barcelona” could then deliver  GFlops at peak
(twice as many cores, each with twice as much performance). This doubling of each core’s raw computation bandwidth was accompanied by a doubling of the
instruction fetch bandwidth (from -bytes/cycle to bytes/cycle) and a doubling of the data cache bandwidth
(two -bit loads could be serviced each clock cycle),
enabling the rest of the pipeline to feed the new higherbandwidth floating-point units. Notably, SSE instructions include one or more prefix bytes, frequently use
the REX prefix to encode additional registers, and can
be quite large. The increased instruction fetch bandwidth was therefore important to keep the pipeline
balanced and able to feed the high-throughput wide
floating-point units.
“Barcelona” introduced a new “unaligned SSE
mode,” which allowed a single SSE instruction to both
load and execute an SSE operation, without concern for
alignment. Previously, users had to use two instructions
AMD Opteron Processor Barcelona
(an unaligned load followed by an execute operation).
This new mode further reduced pressure on the instruction decoders and also reduced register pressure. The
mode relaxed a misguided attempt to simplify the architecture by penalizing unaligned operations and was
later adopted as the x standard. Many important algorithms such as video decompression can benefit from
vector instructions but cannot guarantee alignment in
the source data (for example, if the input data is compressed).
The “Barcelona” pipeline included an improved
branch predictor, using more bits for the global history
to improve the accuracy and also doubling the size of
the return stack. “Barcelona” added a dedicated predictor for indirect branches to improve performance when
executing virtual functions commonly used in modern programming styles. The wider -byte instruction fetch improved both SSE instruction throughput
as well as higher throughput in codes with large integer
instructions, particularly when using some of the more
complicated addressing modes.
“Barcelona” added a Sideband Stack Optimizer feature which executed common stack operations (i.e.,
the PUSH and POP instructions) with dedicated stack
adjustment logic. The logic broke the serial dependence
chains seen in consecutive strings of PUSH and POP
instructions (common at function entry and exit), and
also freed the regular functional units to execute other
operations.
“Barcelona” also improved the execution core with
a data-dependent divide, which provided an early
out for the common case where the dividend was
small. “Barcelona” introduced the SSEa instruction
set, which added a handful of instructions including
leading-zero count and population count, bit INSERT
and EXTRACT, and streaming single-precision store
operations.
“Barcelona” core added an out-of-order load feature, which enabled loads to bypass other loads in the
pipeline. Other memory optimizations included a wider
L bus, larger data and instruction TLBs,  GB page
size, and -bit physical address to support large server
database footprints. The L data TLB was expanded to
 entries, and each entry could hold any of the three
page sizes in the architecture ( K,  M, or  GB); this
provided flexibility for the architecture to efficiently run
applications with both legacy and newer page sizes.
A
Overall the “Barcelona” core was a comprehensive but evolutionary improvement over the previous
design. The evolutionary improvement provided a consistent optimization strategy for compilers and software developers: optimizations for previous-generation
AMD OpteronTM processors were largely effective on
the “Barcelona” core as well.
Virtualization and Rapid Virtualization
Indexing
In , virtualization was an emerging application class
driven by customer desire to more efficiently utilize
multi-core server systems and thus was an important
target for the quad-core “Barcelona.” One performance
bottleneck in virtualized applications was the address
translation performed by the hypervisor – the hypervisor virtualizes the physical memory in the system and
thus has to perform an extra level of address translation
between the guest physical and the actual host physical address. Effectively, the hypervisor creates another
level of page tables for this final level of translation.
Previous-generation processors performed this translation with a software-only technique called “shadow paging.” Shadow paging required a large number of hypervisor intercepts (to maintain the page tables) which
slowed performance and also suffered from an increase
in the memory footprint. “Barcelona” introduced Rapid
Virtualization Indexing (also known as “nested paging”) which provided hardware support for performing
the final address translation; effectively, the hardware
was aware of both the host and guest page tables and
could walk both as needed []. “Barcelona” also provided translation caching structures to accelerate the
nested table walk.
“Barcelona” continued to support AMD-VTM hardware virtualization support, tagged TLBs to reduce
TLB flushing when switching between guests, and
the Device Exclusion Vector for security. Additionally, well-optimized virtualization applications typically
demonstrate a high degree of local memory accesses,
i.e., accesses to the integrated memory controller rather
than to another socket in the multi-socket system.
AMD’s Direct Connect architecture, which provided
lower latency and higher bandwidth for local memory
accesses, was thus particularly well suited for running
virtualization applications.

A

A
Amdahl’s Argument
Power Reduction Features
The “Barcelona” design included dedicated power supplies for the CPU cores and the memory controller,
allowing the voltage for each to be separately controlled.
This allowed the cores to operate at reduced power consumption levels while the memory controller continued
to run at full speed and service memory requests from
other cores in the system. In addition, the core design
included the use of fine-gaters to reduce the power
to logic on the chip which was not currently in use.
One example was the floating-point unit – for integer code, when the floating-point unit was not in use,
“Barcelona” would gate the floating-point unit and significantly reduce the consumed power. The fine-gaters
could be re-enabled in a single cycle and did not cause
any visible increased latency.
The highly integrated “Barcelona” design also
reduced the overall system chip count and thus reduced
system power. Notably the AMD OpteronTM system
architecture integrated the northbridge on the same die
as the processor (reducing system chip count by one
device present in some legacy system architectures),
and also used commodity DDR memory (which
consumed less power than the competing FB-DIMM
standard).
Future Directions
The AMD OpteronTM processor codenamed “Barcelona”
was the third generation in the AMD Opteron processor line. “Barcelona” was followed by the AMD Opteron
processor codenamed “Shangai,” which was built in
 nm SOI process technology, included a larger  M
shared L cache, further core and northbridge performance improvements, faster operating frequencies, and
faster DDR and HT interconnect frequencies. “Shanghai” was introduced in November of .
“Shanghai” was followed by the AMD Opteron processor codenamed “Istanbul,” which integrated six cores
onto a single die, and again plugged into the same
socket as “Barcelona” and “Shanghai.” “Istanbul” also
included an “HT Assist” feature which substantially
reduced probe traffic in the system. HT Assist adds
cache directory to each memory controller; the directory tracks lines in the memory range serviced by the
memory controller which are cached somewhere in the
system. Frequently, a memory access misses the directory (indicating the line is not cached anywhere in the
system) and thus the system can immediately return
the requested data without having to probe the system.
HT Assist enables the AMD Opteron platform to efficiently scale bandwidth to -socket and -socket server
systems.
The next AMD OpteronTM product is on the near
horizon as well. Codenamed “Magny-Cours,” this processor is planned for introduction in the first quarter of
, includes up to -cores in each socket, and introduces a new G platform. Each G socket contains 
DDR channels and  HyperTransport links. “MagnyCours” continues to use evolved versions of the core
and memory controller that were initially introduced in
“Barcelona.”
“Barcelona” broke new ground as the industry’s
first native quad-core device, including a shared L
cache architecture and leadership memory bandwidth.
The “Barcelona” core was designed with an evolutionary approach, and delivered higher core performance
(especially on floating-point codes), and introduced
new virtualization technologies to improve memory
translation performance and ease Hypervisor implementation. Finally, “Barcelona” was plug-compatible
with the previous-generation AMD Opteron processors, leveraging the stable AMD Direct Connect architecture and cost-effective commodity DDR memory
technology.
Bibliography
. Advanced Micro Devices, Inc. x-TM Technology White
Paper. http://www.amd.com/us-en/assets/content_type/white_
papers_and_tech_docs/x-_wp.pdf
. Advanced Micro Devices, Inc. () AMD-VTM Nested Paging.
http://developer.amd.com/assets/NPT-WP-%-final-TM.pdf
. Advanced Micro Devices, Inc. () AMD architecture tech
docs. http://www.amd.com/us-en/Processors/DevelopWithAMD/
,___,.html
. Sander B () Core optimizations for system-level performance. http://www.instat.com/fallmpf//conf.htm http://www.
instat.com/Fallmpf//
Amdahl’s Argument
Amdahl’s Law
Amdahl’s Law
Discussion
Amdahl’s Law
Graphical Explanation
John L. Gustafson
Intel Labs, Santa Clara, CA, USA
Synonyms
Amdahl’s argument; Fixed-size speedup; Law of diminishing returns; Strong scaling
Definition
Amdahl’s Law says that if you apply P processors
to a task that has serial fraction f , the predicted
net speedup is
Speedup =
A

f+
−f
P
.
More generally, it shows the speedup that results from
applying any performance enhancement by a factor of
P to only one part of a given workload.
A corollary of Amdahl’s Law, often confused with
the law itself, is that even when one applies a very
large number of processors P (or other performance
enhancement) to a problem, the net improvement in
speed cannot exceed /f .
The diagram in Fig.  graphically explains the formula
in the definition.
The model sets the time required to solve the present
workload (top bar) to unity. The part of the workload
that is serial, f , is unaffected by parallelization. (See discussion below for the effect of including the time for
interprocessor communication.) The model assumes
that the remainder of the time,  − f , parallelizes perfectly so that it takes only /P as much time as on the
serial processor. The ratio of the top bar to the bottom
bar is thus /( f + ( − f )/P).
History
In the late s, research interest increased in the idea
of achieving higher computing performance by using
many computers working in parallel. At the Spring
 meeting of the American Federation of Information Processing Societies (AFIPS), organizers set up
a session entitled “The best approach to large computing capability – A debate.” Daniel Slotnick presented “Unconventional Systems,” a description of a
-processor ensemble controlled by a single instruction stream, later known as the ILLIAC IV []. IBM’s
chief architect, Gene Amdahl, presented a counterargument entitled “Validity of the single processor approach
to achieving large scale computing capabilities” []. It
Time for present workload
f
1–f
Serial
fraction
f
P processors applied to
parallel fraction
1–f
P
Reduced time
Amdahl’s Law. Fig.  Graphical explanation of Amdahl’s Law

A
A
Amdahl’s Law
was in this presentation that Amdahl made a specific
argument about the merits of serial mainframes over
parallel (and pipelined) computers.
The formula known as Amdahl’s Law does not
appear anywhere in that paper. Instead, the paper shows
a hand-drawn graph that includes the performance
speedup of a -processor system over a single processor, as the fraction of parallel work increases from % to
%. There are no numbers or labels on the axes in the
original, but Fig.  reproduces his graph more precisely.
Amdahl estimated that about % of an algorithm
was inherently serial, and data management imposed
another % serial overhead, which he showed as the
gray area in the figure centered about % parallel content. He asserted this was the most probable region of
operation. From this, he concluded that a parallel system like the one Slotnick described would only yield
from about X to X speedup. At the debate, he presented the formula used to produce the graph, but did
not include it in the text of the paper itself.
This debate was so influential that in less than a year,
the computing community was referring to the argument against parallel computing as “Amdahl’s Law.” The
person who first coined the phrase may have been Willis
H. Ware, who in  first put into print the phrase
“Amdahl’s Law” and the usual form of the formula, in
a RAND report titled “The Ultimate Computer” [].
The argument rapidly became part of the commonly
accepted guidelines for computer design, just as the Law
of Diminishing Returns is a classic guideline in economics and business. In the early s, computer users
attributed the success of Cray Research vector computers over rivals such as those made by CDC to Cray’s
better attention to Amdahl’s Law. The Cray designs did
not take vector pipelining to such extreme levels relative to the rest of their system and often got a higher
fraction of peak performance as a result. The widely
used textbook on computer architecture by Hennessy
and Patterson [] harkens back to the traditional view
of Amdahl’s Law as guidance for computer designers,
particularly in its earlier editions.
The formula Amdahl used was simply the use of elementary algebra to combine two different speeds, here
defined as work per unit time, not distance per unit time:
it simply compares two cases of the net speed as the
total work divided by the total time. This common result
is certainly not due to Amdahl, and he was chagrined
at receiving credit for such an obvious bit of mathematics. “Amdahl’s Law” really refers to the argument
that the formula (along with its implicit assumptions
about typical serial fractions and the way computer
costs and workloads scale) predicts harsh limits on what
parallel computing can achieve. For those who wished
to avoid the change to their software that parallelism
would require, either for economic or emotional reasons, Amdahl’s Law served as a technical defense for
their preference.
30
25
Performance

20
15
10
5
0
0.0
0.2
0.4
0.6
0.8
Fraction of arithmetic that can be run in parallel
Amdahl’s Law. Fig.  Amdahl’s original -processor speedup graph (reconstructed)
1.0
Amdahl’s Law
Estimates of the “Serial Fraction” Prove
Pessimistic
The algebra of Amdahl’s Law is unassailable since it
describes the fundamental way speeds add algebraically,
for a fixed amount of work. However, the estimate of
the serial fraction f (originally %, give or take %)
was only an estimate by Amdahl and was not based on
mathematics.
There exist algorithms that are inherently almost
% serial, such as a time-stepping method for a
physics simulation involving very few spatial variables.
There are also algorithms that are almost % parallel,
such as ray tracing methods for computer graphics, or
computing the Mandelbrot set. It therefore seems reasonable that there might be a rather even distribution of
serial fraction from  to  over the entire space of computer applications. The following figure shows another
common way to visualize the effects of Amdahl’s Law,
with speedup as a function of the number of processors.
Figure  shows performance curves for serial fractions
., ., ., and . for a -processor computer system.
The limitations of Amdahl’s Law for performance
prediction were highlighted in , when IBM scientist Alan Karp publicized a skeptical challenge (and a
token award of $) to anyone who could demonstrate
a speedup of over  times on three real computer
applications []. He had just returned from a conference at which startup companies nCUBE and Thinking
Machines had announced systems with over , processors, and Karp gave the community  years to solve
the problem putting a deadline at the end of  to
achieve the goal. Karp suggested fluid dynamics, structural analysis, and econometric modeling as the three
application areas to draw from, to avoid the use of
contrived and unrealistic applications. The published
speedups of the  era tended to be less than tenfold
and used applications of little economic value (like how
to place N queens on a chessboard so that no two can
attack each other).
The Karp Challenge was widely distributed by
e-mail but received no responses for years, suggesting
that if -fold speedups were possible, they required
more than a token amount of effort. C. Gordon Bell,
also interested in promoting the advancement computing, proposed a similar challenge but with two alterations: He raised the award to $,, and said that it
would be given annually to the greatest parallel speedup
achieved on three real applications, but only awarded if
the speedup was at least twice that of the previous award.
This definition was the original Gordon Bell Prize [],
and Bell envisioned that the first award might be for
something close to tenfold speedup, with increasingly
difficult advances after that.
By late , Sandia scientists John Gustafson, Gary
Montry, and Robert Benner undertook to demonstrate high parallel speedup on applications from fluid
se)
rea
inc
ar
al (
line
Serial fraction f = 0.1
Ide
Speedup (time reduction)
15
10
Serial fraction f = 0.2
5
Serial fraction f = 0.3
Serial fraction f = 0.4
10
20
30
40
Number of processors
Amdahl’s Law. Fig.  Speedup curves for large serial fractions
A
50
60

A
A
Amdahl’s Law
60
Speedup (time reduction)

e)
as
ar
ine
50
(l
al
Ide
re
inc
n
ctio
1
.00
0
f=
a
l fr
ria
40
Se
nf=
actio
rial fr
30
0.01
Se
20
Serial fraction f = 0.1
10
10
20
30
40
Number of processors
50
60
Amdahl’s Law. Fig.  Speedup curves for smaller serial fractions
dynamics, structural mechanics, and acoustic wave
propagation. They recognized that an Amdahl-type
speedup (now called “strong scaling”) was more challenging to achieve than some of the speedups claimed
for distributed memory systems like Caltech’s Cosmic
Cube [] that altered the problem according to the
number of processors in use. However, using a ,processor nCUBE , the three Sandia researchers were
able to achieve performance on , processors ranging from X to X that of a single processor running the same size problem, implying the Amdahl serial
fraction was only .–. for those applications.
This showed that the historical estimates of values for
the serial fraction in Amdahl’s formula might be far too
high, at least for some applications. While the mathematics of Amdahl’s Law is unassailable, it was a poorly
substantiated opinion that the actual values of the serial
fraction for computing workloads would always be too
high to permit effective use of parallel computing.
Figure  shows Amdahl’s Law, again for a processor system, but with serial fractions of ., .,
and ..
The TOP list ranks computers by their ability to solve a dense system of linear equations.
In November , the top-ranked system (Jaguar,
Oak Ridge National Laboratories) achieved over %
parallel efficiency using , computing cores.
For Amdahl’s Law to hold, the serial fraction must be
about one part per million for this system.
Observable Fraction and Superlinear
Speedup
For many scientific codes, it is simple to instrument
and measure the amount of time f spent in serial execution. One can place timers in the program around
serial regions and obtain an estimate of f that might or
might not strongly depend on the input data. One can
then apply this fraction for Amdahl’s Law estimates of
time reduction, or Gustafson’s Law estimates of scaled
speedup. Neither law takes into account communication costs or intermediate degrees of parallelism.
A more common practice is to measure the parallel speedup as the number of processors is varied, and
fit the resulting curve to derive f . This approach may
yield some guidance for programmers and hardware
developers, but it confuses serial fraction with communication overhead, load imbalance, changes in the
relative use of the memory hierarchy, and so on. The
term “strong scaling” refers to the requirement to keep
the problem size the same for any number of processors.
A common phenomenon that results from “strong scaling” is that when spreading a problem across more and
more processors, the memory per processor goes down
to the point where the data fits entirely in cache, resulting in superlinear speedup []. Sometimes, the superlinear speedup effects and the communication overheads
partially cancel out, so what appears to be a low value
of f is actually the result of the combination of the
two effects. In modern parallel systems, performance
Amdahl’s Law
analysis with Amdahl’s original law alone will usually
be inaccurate, since so many other parallel processing
phenomena have large effects on the speedup.
Impact on Parallel Computing
Even at the time of the  AFIPS conference, there was
already enough investment in serial computing software
that the prospect of rewriting all of it to use parallelism was quite daunting. Amdahl’s Law served as a
strong defense against having to rewrite the code to
exploit multiple processors. Efforts to create experimental parallel computer systems proceeded in the decades
that followed, especially in academic or research laboratory settings, but the major high-performance computer companies like IBM, Digital, Cray, and HP did
not create products with a high degree of parallelism,
and cited Amdahl’s Law as the reason. It was not until
the  solution to the Karp Challenge, which made
clear that Amdahl’s Law need not limit the utility of
highly parallel computing, that vendors began developing commercial parallel products in earnest.
Implicit Assumptions, and Extensions
to the Law
Fixed Problem Size
The assumption that the computing community overlooked for  years was that the problem size is fixed.
If one applies many processors (or other performance
enhancement) to a workload, it is not necessarily true
that users will keep the workload fixed and accept
shorter times for the execution of the task. It is common to increase the workload on the faster machine to
the point where it takes the same amount of time as
before.
In comparing two things, the scientific approach is
to control all but one variable. The natural choice when
comparing the performance of two computations is to
run the same problem in both situations and look for a
change in the execution time. If speed is w/t where w
is work and t is time, then speedup for two situations
is the ratio of the speeds: (w /t )/(w /t ). By keeping
the work the same, that is, w = w , the speedup simplifies to t /t , and this avoids the difficult problem of
defining “work” for a computing task. Amdahl’s Law
uses this “fixed-size speedup” assumption. While the
assumption is reasonable for small values of speedup, it
A
is less reasonable when the speeds differ by many orders
of magnitude. Since the execution time of an application
tends to match human patience (which differs according to application), people might scale the problem such
that the time is constant and thus is the controlled variable. That is, t = t , and the speedup simplifies to w /w .
See Gustafson’s Law.
System Cost: Linear with the Number
of Processors?
Another implicit assumption is that system cost is linear
in the number of processors, so anything less than perfect speedup implies that cost-effectiveness goes down
every time an approach uses more processors. At the
time Amdahl made his argument in , this was a
reasonable assumption: a system with two IBM processors would probably have cost almost exactly twice
that of an individual IBM processor. Amdahl’s paper
even states that “. . . by putting two processors side by
side with shared memory, one would find approximately
. times as much hardware,” where the additional .
hardware is for sharing the memory with a crossbar
switch. He further estimated that memory conflicts
would add so much time that net price performance of
a dual-processor system would be . that of a single
processor.
His cost assumptions are not valid for present-era
system designs. As Moore’s Law has decreased the cost
of transistors to the point where a single silicon chip
holds many processors in the same package that formerly held a single processor, it is apparent that system
costs are far below linear in the number of processors.
Processors share software and other facilities that can
cost much more than individual processor cores. Thus,
while Amdahl’s algebraic formula is true, the implications it provided in  for optimal system design have
changed. For example, it might be that increasing the
number of processors by a factor of  only provides a
net speedup of .X for the workload, but if the quadrupling of processors only increases system cost by
.X, the cost-effectiveness of the system increases with
parallelism. Put another way, the point of diminishing
returns for adding processors in  might have been a
single processor. With current economics, it might be
a very large number of processors, depending on the
application workload.

A

A
Amdahl’s Law
All-or-None Parallelism
In the Amdahl model, there are only two levels of concurrency for the use of N processors: N-fold parallel or
serial. A more realistic and detailed model recognizes
that the amount of exploitable parallelism might vary
from one to N processors through the execution of a
program []. The speedup is then
Speedup=/ ( f + f / + f / + . . . + fN /N) ,
where fj is the fraction of the program that can be run
on j processors in parallel, and f + f + . . . + fN = .
Sharing of Resources
Because the parallelism model was that of multiple
processors controlled by a single instruction stream,
Amdahl formulated his argument for the parallelism of
a single job, not the parallelism of multiple users running multiple jobs. For parallel computers with multiple
instruction streams, if the duration of a serial section is
longer than the time it takes to swap in another job in
the queue, there is no reason that N −  of the N processors need to go idle as long as there are users waiting
for the system. As the previous section mentions, the
degree of parallelism can vary throughout a program.
A sophisticated queuing system can allocate processing
resources to other users accordingly, much as systems
partition memory dynamically for different jobs.
Communication Cost
Because Amdahl formulated his argument in , he
treated the cost of communication of data between
processors as negligible. At that time, computer arithmetic took so much longer than data motion that the
data motion was overlapped or insignificant. Arithmetic speed has improved much more than the speed of
interprocessor communication, so many have improved
Amdahl’s Law as a performance model by incorporating
communication terms in the formula.
Some have suggested that communication costs are
part of the serial fraction of Amdahl’s Law, but this
is a misconception. Interprocessor communication can
be serial or parallel just as the computation can. For
example, a communication algorithm may ask one processor to send data to all others in sequence (completely
serial) or it may ask each processor j to send data to
processor j − , except that processor  sends to processor N, forming a communication ring (completely
parallel).
Analogies
Without mentioning Amdahl’s Law by name, others
have referred, often humorously, to the limitations of
parallel processing. Fred Brooks, in The Mythical ManMonth (), pointed out the futility of trying to complete software projects in less time by adding more
people to the project []. “Brooks’ Law” is his observation that adding engineers to a project can actually make
the project take longer. Brooks quoted the well-known
quip, “Nine women can’t have a baby in one month,” and
may have been the first to apply that quip to computer
technology.
From  to , Ambrose Bierce wrote a collection of cynical definitions called The Devil’s Dictionary.
It includes the following definition:
▸ Logic, n. The art of thinking and reasoning in strict
accordance with the limitations and incapacities of
human misunderstanding. The basis of logic is the
syllogism, consisting of a major and a minor premise
and a conclusion – thus:
Major Premise: Sixty men can do a piece of work  times
as quickly as one man.
Minor Premise: One man can dig a post-hole in s;
therefore –
Conclusion: Sixty men can dig a post-hole in s.
This may be called the syllogism arithmetical, in which,
by combining logic and mathematics, we obtain a double certainty, and are twice blessed.
In showing the absurdity of using  processors
(men) for an inherently serial task, he predated Amdahl
by almost  years.
Transportation provides accessible analogies for
Amdahl’s Law. For example, if one takes a trip at
 miles per hour and immediately turns around, how
fast does one have to go to average  miles per hour?
This is a trick question that many people incorrectly
answer, “ miles per hour.” To average  miles per
hour, one would have to travel back at infinite speed
and instantly. For a fixed travel distance, just as for
a fixed workload, speeds do not combine as a simple
Amdahl’s Law
arithmetic average. This is contrary to our intuition,
which may be the reason some consider Amdahl’s Law
such a profound observation.
A
Related Entries
Brent’s Theorem
Gustafson’s Law
Metrics
Pipelining
Perspective
Amdahl’s  argument became a justification for the
avoidance of parallel computing for over  years. It
was appropriate for many of the early parallel computer designs that shared an instruction stream or the
memory fabric or other resources. By the s, hardware approaches emerged that looked more like collections of autonomous computers that did not share
anything yet were capable of cooperating on a single
task. It was not until Gustafson published his alternative formulation for parallel speedup in , along
with several examples of actual ,-fold speedups
from a -processor system, that the validity of the
parallel-computing approach became widely accepted
outside the academic community. Amdahl’s Law still
is the best rule-of-thumb when the goal of the performance improvement is to reduce execution time for a
fixed task, whereas Gustafson’s Law is the best rule-ofthumb when the goal is to increase the problem size for
a fixed amount of time. Amdahl’s and Gustafson’s Laws
do not contradict one another, nor is either a corollary or equivalent of the other. They are for different
assumptions and different situations.
Gene Amdahl, in a  personal interview, stated
that he never intended his argument to be applied to
the case where each processor had its own operating
system and data management, and would have been
far more open to the idea of parallel computing as
a viable approach had it been posed that way. He is
now a strong advocate of parallel architectures and sits
on the technical advisory board of Massively Parallel
Technologies, Inc.
With the commercial introduction of single-image
systems with over , processors such as Blue Gene,
and clusters with similar numbers of server processor cores, it becomes increasingly unrealistic to use a
fixed-size problem to compare the performance of a
single processor with that of the entire system. Thus,
scaled speedup (Gustafson’s Law) applies to measure
performance of the largest systems, with Amdahl’s Law
applied mainly where the number of processors changes
over a narrow range.
Bibliographic Notes and Further
Reading
Amdahl’s original  paper is short and readily available online, but as stated in the Discussion section, ith
has neither the formula nor any direct analysis. An
objective analysis of the Law and its implications can
be found in [] or []. The series of editions of thehth
textbook on computer architecture by Hennessey and
Patterson [] began in  with a strong alignment toht
Amdahl’s  debate position against parallel computing, and has evolved a less strident stance in more recent
editions.
For a rigorous mathematical treatment of Amdahl’s
Law that covers many of the extensions and refinements mentioned in the Discussion section, see [].
One of the first papers to show how fixed-sized speedup
measurement is prone to superlinear speedup effects
is [].
A classic  work on speedup and efficiency is
“Speedup versus efficiency in parallel systems” by D. L.
Eager, J. Zahorjan, and E. D. Lazowska, in IEEE Transactions, March , –. DOI=./..
Bibliography
. Amdahl GM () Validity of the single-processor approach
to achieve large scale computing capabilities. AFIPS Joint
Spring Conference Proceedings  (Atlantic City, NJ, Apr. –
), AFIPS Press, Reston VA, pp –, At http://wwwinst.eecs.berkeley.edu/∼n/paper/Amdahl.pdf
. Bell G (interviewed) (July ) An interview with Gordon Bell.
IEEE Software, ():–
. Brooks FP () The mythical man-month: Essays on software
engineering. Addison-Wesley, Reading. ISBN ---
. Gustafson JL (April ) Fixed time, tiered memory, and superlinear speedup. Proceedings of the th distributed memory conference, vol , pp –. ISBN: ---
. Gustafson JL, Montry GR, Benner RE (July ) Development
of parallel methods for a -processor hypercube. SIAM J Sci
Statist Comput, ():–
. Gustafson (May ) Reevaluating Amdahl’s law. Commun
ACM, ():–. DOI=./.

A

A
AMG
. Hennessy JL, Patterson DA (, , , ) Computer
architecture: A quantitative approach. Elsevier Inc.
. Hwang K, Briggs F () Computer architecture and parallel
processing, McGraw-Hill, New York. ISBN: 
. Karp A () http://www.netlib.org/benchmark/karp-challenge
. Lewis TG, El-Rewini H () Introduction to parallel computing, Prentice Hall. ISBN: ---, –
. Seitz CL () Experiments with VLSI ensemble machines. Journal of VLSI and computer systems, vol . No. , pp –
. Slotnick D () Unconventional systems. AFIPS joint spring
conference proceedings  (Atlantic City, NJ, Apr. –). AFIPS
Press, Reston VA, pp –
. Sun X-H, Ni L () Scalable problems and memory-bounded
speedup. Journal of parallel and distributed computing, vol .
No , pp –
. Ware WH () The Ultimate Computer. IEEE spectrum, vol .
No. , pp –
accelerate molecular dynamics (MD) simulations of
biomolecular systems. Anton performs massively parallel computation on a set of identical MD-specific
ASICs that interact in a tightly coupled manner using a
specialized high-speed communication network. Anton
enabled, for the first time, the simulation of proteins at
an atomic level of detail for periods on the order of a
millisecond – about two orders of magnitude beyond
the previous state of the art – allowing the observation
of important biochemical phenomena that were previously inaccessible to both computational and experimental study.
Discussion
Introduction
AMG
Algebraic Multigrid
Analytics, Massive-Scale
Massive-Scale Analytics
Anomaly Detection
Race Detection Techniques
Intel Parallel Inspector
Anton, A Special-Purpose
Molecular Simulation Machine
Ron O. Dror , Cliff Young , David E. Shaw,

D. E. Shaw Research, New York, NY, USA

Columbia University, New York, NY, USA
Definition
Anton is a special-purpose supercomputer architecture designed by D. E. Shaw Research to dramatically
Classical molecular dynamics (MD) simulations give
scientists the ability to trace the motions of biological molecules at an atomic level of detail. Although
MD simulations have helped yield deep insights into
the molecular mechanisms of biological processes in
a way that could not have been achieved using only
laboratory experiments [, ], such simulations have
historically been limited by the speed at which they can
be performed on conventional computer hardware.
A particular challenge has been the simulation
of functionally important biological events that often
occur on timescales ranging from tens of microseconds
to a millisecond, including the “folding” of proteins into
their native three-dimensional shapes, the structural
changes that underlie protein function, and the interactions between two proteins or between a protein and
a candidate drug molecule. Such long-timescale simulations pose a much greater challenge than simulations
of larger chemical systems at more moderate timescales:
the number of processors that can be used effectively in
parallel scales with system size but not with simulation
length, because of the sequential dependencies within a
simulation.
Anton, a specialized, massively parallel supercomputer developed by D. E. Shaw Research, accelerated
such calculations by several orders of magnitude compared with the previous state of the art, enabling the
simulation of biological processes on timescales that
might otherwise not have been accessible for many
years. The first -node Anton machine (Fig. ), which
became operational in late , completed an all-atom
Anton, A Special-Purpose Molecular Simulation Machine
A

A
Anton, A Special-Purpose Molecular Simulation Machine. Fig.  A -node Anton machine
Anton, A Special-Purpose Molecular Simulation Machine. Table  The longest (to our knowledge) all-atom MD
simulations of proteins in explicitly represented water published through the end of 
Length (μs)
Protein
Hardware
Software
Citation
,
BPTI
Anton
[native]
[]

gpW
Anton
[native]
[]

WW domain
x cluster
NAMD
[, ]

Villin HP-
x cluster
NAMD
[]

Villin HP-
x cluster
GROMACS
[]

Rhodopsin
Blue Gene/L
Blue Matter
[, ]

β AR
x cluster
Desmond
[]
protein simulation spanning more than a millisecond
of biological time in  []. By way of comparison, the longest such simulation previously reported in
the literature, which was performed on general-purpose
computer hardware using the MD code NAMD, was 
microseconds (μs) in length []; at the time, few other
published simulations had reached  μs (Table ).
An Anton machine comprises a set of identical
processing nodes, each containing a specialized MD
computation engine implemented as a single ASIC
(Fig. ). These processing nodes are connected through
a specialized high-performance network to form a
three-dimensional torus. Anton was designed to use
both novel parallel algorithms and special-purpose
logic to dramatically accelerate those calculations that
dominate the time required for a typical MD simulation []. The remainder of the simulation algorithm
is executed by a programmable portion of each chip
that achieves a substantial degree of parallelism while
preserving the flexibility necessary to accommodate
anticipated advances in physical models and simulation
methods.
Anton was created to attack a somewhat different problem than the ones addressed by several other
projects that have deployed significant computational
resources for MD simulations. The Folding@Home
project [], for example, uses hundreds of thousands of
PCs (made available over the Internet by volunteers) to
simulate a very large number of separate molecular trajectories, each of which is limited to the timescale accessible on a single PC. While a great deal can be learned
from a large number of independent MD trajectories,
many other important problems require the examination of a single, very long trajectory – the principal
Anton, A Special-Purpose Molecular Simulation Machine
−Y
+Y
+X
Host
Computer
Torus
Link
Torus
Link
Torus
Link
Host
Interface
HighThroughput
Interaction
Subsystem
(HTIS)
DRAM
Router
Flexible
Subsystem
Memory Controller
−X
Router
Router
−Z
Torus
Link
+Z
Torus
Link
Router
Router
A
Torus
Link

Router
Memory Controller
DRAM
Intra-chip
Ring Network
Anton, A Special-Purpose Molecular Simulation Machine. Fig.  Block diagram of a single Anton ASIC, comprising the
specialized high-throughput interaction subsystem, the more general-purpose flexible subsystem, six inter-chip torus
links, an intra-chip communication ring, and two memory controllers
task for which Anton was designed. Other projects
have produced special-purpose hardware (e.g., FASTRUN [], MDGRAPE [], and MD Engine []) to
accelerate the most computationally expensive elements
of an MD simulation. Such hardware reduces the effective cost of simulating a given period of biological time,
but Amdahl’s law and communication bottlenecks prevent the efficient use of enough such chips in parallel
to extend individual simulations beyond timescales of a
few microseconds.
Anton was named after Anton van Leeuwenhoek,
often referred to as the “father of microscopy.” In
the seventeenth century, van Leeuwenhoek built highprecision optical instruments that allowed him to visualize bacteria and other microorganisms, as well as
blood cells and spermatozoa, revealing for the first
time an entirely new biological world. In pursuit of an
analogous goal, Anton (the machine) was designed as
a sort of “computational microscope,” providing contemporary biological and biomedical researchers with
a tool for understanding organisms and their diseases
at previously inaccessible spatial and temporal scales.
Anton has enabled substantial advances in the study
of the processes by which proteins fold, function, and
interact with drugs [, ].
Structure of a Molecular Dynamics
Computation
An MD simulation computes the motion of a collection of atoms – for example, a protein surrounded by
water – over a period of time according to the laws of
classical physics. Time is broken into a series of discrete time steps, each representing a few femtoseconds
of simulated time. For each time step, the simulation
Anton, A Special-Purpose Molecular Simulation Machine
performs a computationally intensive force calculation
for each atom, followed by a less expensive integration
operation that advances the positions and velocities of
the atoms.
Forces are evaluated based on a model known as
a force field. Anton supports a variety of commonly
used biomolecular force fields, which express the total
force on an atom as a sum of three types of component
forces: () bonded forces, which involve interactions
between small groups of atoms connected by one or
more covalent bonds; () van der Waals forces, which
include interactions between all pairs of atoms in the
system, but which fall off quickly with distance and
are typically only evaluated for nearby pairs of atoms;
and () electrostatic forces, which include interactions
between all pairs of charged atoms, and fall off slowly
with distance.
Electrostatic forces are typically computed by one
of several fast, approximate methods that account for
long-range effects without requiring the explicit interaction of all pairs of atoms. Anton, like most MD codes for
general-purpose hardware, divides electrostatic interactions into two contributions. The first decays rapidly
with distance, and is thus computed directly for all
atom pairs separated by less than some cutoff radius.
This contribution and the van der Waals interactions
together constitute the range-limited interactions. The
second contribution (long-range interactions) decays
more slowly, but can be expressed as a convolution
and efficiently computed using fast Fourier transforms
(FFTs) []. This process requires the mapping of
charges from atoms to nearby mesh points before the
FFT computations (charge spreading), and the calculation of forces on atoms based on values associated with
nearby mesh points after the FFT computations ( force
interpolation).
The Role of Specialization in Anton
During the five years spent designing and building
Anton, the number of transistors on a chip increased
by roughly tenfold, as predicted by Moore’s law. Anton,
on the other hand, enabled simulations approximately
, times faster than was possible at the beginning
of that period, providing access to biologically critical millisecond timescales significantly sooner than
would have been possible on commodity hardware.
Achieving this performance required reengineering
A
how MD is done, simultaneously considering changes
to algorithms, software, and, especially, hardware.
Hardware specialization allows Anton to redeploy
resources in ways that benefit MD. Compared to other
high-performance computing applications, MD uses
much computation and communication but surprisingly little memory. Anton exploits this property by
using only SRAMs and small first-level caches on the
ASIC, constraining all code and data to fit on-chip in
normal operation (for chemical systems that exceed
SRAM size, Anton pages state to each node’s local
DRAM). The area that would have been spent on large
caches and aggressive memory hierarchies is instead
dedicated to computation and communication. Each
Anton ASIC contains dedicated, specialized hardware
datapaths to evaluate the range-limited interactions
and perform charge spreading and force interpolation, packing much more computational logic on a
chip than is typical of general-purpose architectures.
Each ASIC also contains programmable processors with
specialized instruction set architectures tailored to the
remainder of the MD computation. Anton’s specialized
network fabric not only delivers bandwidth and latency
two orders of magnitude better than Gigabit Ethernet,
but also sustains a large fraction of peak network bandwidth when delivering small packets and provides hardware support for common MD communication patterns
such as multicast [].
The most computationally intensive parts of an MD
simulation – in particular, the electrostatic interactions – are also the most well established and unlikely to
change as force field models evolve, making these calculations particularly amenable to hardware acceleration.
Dramatically speeding up MD, however, requires that
one accelerates more than just an “inner loop.” Calculation of electrostatic and van der Waals forces accounts
for roughly % of the computational time for a representative MD simulation on a single general-purpose
processor. Amdahl’s law states that no matter how much
one accelerates this calculation, the remaining computations, left unaccelerated, would limit the maximum
speedup to a factor of . Hence, Anton dedicates a
significant fraction of silicon area to accelerating other
tasks, such as bonded force computation and integration, incorporating programmability as appropriate to
accommodate a variety of force fields and simulation
features.

A

A
Anton, A Special-Purpose Molecular Simulation Machine
System Architecture
The building block of an Anton system is a node, which
includes an ASIC with two major computational subsystems (Fig. ). The first is the high-throughput interaction subsystem (HTIS), designed for computing massive
numbers of range-limited pairwise interactions of various forms. The second is the flexible subsystem, which is
composed of programmable cores used for the remaining, less structured part of the MD calculation. The
Anton ASIC also contains a pair of high-bandwidth
DRAM controllers (augmented with the ability to accumulate forces and other quantities), six high-speed
(. Gbit/s per direction) channels that provide communication to neighboring ASICs, and a host interface
that communicates with an external host computer for
input, output, and general control of the Anton system. The ASICs are implemented in -nm technology and clocked at  MHz, with the exception of the
arithmetically intensive portion of the HTIS, which is
clocked at  MHz.
An Anton machine may incorporate between  and
, nodes, each of which is responsible for updating
the position of particles within a distinct region of space
during a simulation. One -node machine, one
-node machine, ten -node machines, and several
smaller machines were operational as of June . For
a given machine size, the nodes are connected to form a
three-dimensional torus (i.e., a three-dimensional mesh
that wraps around in each dimension, which maps naturally to the periodic boundary conditions used during
most MD simulations). Four nodes are incorporated in
each node board, and  node boards fit in a -inch
rack; larger machines are constructed by linking racks
together.
Almost all computation on Anton uses fixed-point
arithmetic, which can be thought of as operating on
twos-complement numbers in the range [−, ). In practice, most of the quantities handled in an MD simulation fall within well-characterized, bounded ranges
(e.g., bonds are between  and  Å in length), so there
is no need for software or hardware to dynamically
normalize fixed-point values. Use of fixed-point arithmetic reduces die area requirements and facilitates the
achievement of certain desirable numerical properties:
for example, repeated Anton simulations will produce
bitwise identical results even when performed on different numbers of nodes, and molecular trajectories
produced by Anton in certain modes of operation are
exactly reversible (a physical property guaranteed by
Newton’s laws but rarely achieved in numerical simulation) [].
The High-Throughput Interaction Subsystem
(HTIS)
The HTIS is the largest computational accelerator in
Anton, handling the range-limited interactions, charge
spreading, and force interpolation. These tasks account
for a substantial majority of the computation involved
in an MD simulation and require several hundred
microseconds per time step on general-purpose supercomputers. The HTIS accelerates these computations
such that they require just a few microseconds on
Anton, using an array of  hardwired pairwise point
interaction modules (PPIMs) (Fig. ). The heart of each
PPIM is a force calculation pipeline that computes the
force between a pair of particles; this is a -stage
pipeline (at  MHz) of adders, multipliers, function
evaluation units, and other specialized datapath elements. The functional units of this pipeline use customized numerical precisions: bit width varies across
the different stages to minimize die area while ensuring
an accurate -bit result. The HTIS keeps these pipelines
operating at high utilization through careful choreography of data flow both between chips and within a
chip. A single HTIS can perform , interactions
per microsecond; a modern x core, by contrast, can
perform about  interactions per microsecond [].
Despite its name, the HTIS also addresses latency: a node Anton performs the entire range-limited interaction computation of a ,-atom MD time step in
just  μs, over two orders of magnitude faster than any
contemporaneous general-purpose computer.
The computation is parallelized across chips using
a novel technique, the NT method [], which requires
less communication bandwidth than traditional methods for parallelizing range-limited interactions. Figure 
shows the spatial volume from which particle data must
be imported into each node using the NT method
compared with the import volume required by the traditional “half-shell” approach. As the level of parallelism increases, the import volume of the NT method
becomes progressively smaller in both absolute and
asymptotic terms than that of the traditional method.
A
Anton, A Special-Purpose Molecular Simulation Machine
PPIM
A
−
−
−
x2
x2
x2
+
<
HTIS
communication
ring
interfaces
−
−
−
×
×
+
particle
memory
+
x
x2
<
×
f(x)
interaction
control block
×
1
x2 x2 x2
+
×
g(x)
×>>
×>>
×>>
+
×
×
×
+
Anton, A Special-Purpose Molecular Simulation Machine. Fig.  High-throughput interaction subsystem (HTIS) and
detail of a single pairwise point interaction module (PPIM). In addition to the  PPIMs, the HTIS includes two
communication ring interfaces, a buffer area for particle data, and an embedded control core called the interaction control
block. The U-shaped arrows show the flow of data through the particle distribution and force reduction networks. Each
PPIM includes eight matchmaking units (shown stacked), a number of queues, and a force calculation pipeline that
computes pairwise interactions
The NT method requires that each chip compute
interactions between particles in one spatial region (the
tower) and particles in another region (the plate). The
HTIS uses a streaming approach to bring together all
pairs of particles from the two sets. Figure  depicts
the internal structure of the HTIS, which is dominated
by the two halves of the PPIM array. The HTIS loads
tower particles into the PPIM array and streams the
plate particles through the array, past the tower particles. Each plate particle accumulates the force from its
interactions with tower particles as it streams through
the array. While the plate particles are streaming by,
each tower particle also accumulates the force from its
interactions with plate particles. After the plate particles
have been processed, the accumulated tower forces are
streamed out.
Not all plate particles need to interact with all tower
particles; some pairs, for example, exceed the cutoff distance. To improve the utilization of the force calculation

pipelines, each PPIM includes eight dedicated match
units that collectively check each arriving plate particle
against the tower particles stored in the PPIM to determine which pairs may need to interact, using a lowprecision distance test. Each particle pair that passes
this test and satisfies certain other criteria proceeds
to the PPIM’s force calculation pipeline. As long as at
least one-eighth of the pairs checked by the match units
proceed, the force calculation pipeline approaches full
utilization.
In addition to range-limited interactions, the HTIS
performs charge spreading and force interpolation.
Anton is able to map these tasks to the HTIS by employing a novel method for efficient electrostatics computation, k-space Gaussian Split Ewald (k-GSE) [],
which employs radially symmetric spreading and interpolation functions. This radial symmetry allows the
hardware that computes pairwise nonbonded interactions between pairs of particles to be reused for

A
Anton, A Special-Purpose Molecular Simulation Machine
interactions between particles and grid points. Both the
k-GSE method and the NT method were developed
while re-examining fundamental MD algorithms during the design phase of Anton.
The Flexible Subsystem
Although the HTIS handles the most computationally
intensive parts of an Anton calculation, the flexible subsystem performs a far wider variety of tasks. It initiates each force computation phase by sending particle
positions to multiple ASICs. It handles those parts of
force computation not performed in the HTIS, including calculation of bonded force terms and the FFT. It
performs all integration tasks, including updating positions and velocities, constraining some particle pairs to
be separated by a fixed distance, modulating temperature and pressure, and migrating atoms between nodes
as the molecular system evolves. Lastly, it performs all
boot, logging, and maintenance activities. The computational details of these tasks vary substantially from one
MD simulation to another, making programmability a
requirement.
The flexible subsystem contains eight geometry cores
(GCs) that were designed at D. E. Shaw Research to perform fast numerical computations, four control cores
(Tensilica LXs) that coordinate the overall data flow
in the Anton system, and four data transfer engines
that allow communication to be hidden behind computation. The GCs perform the bulk of the flexible
subsystem’s computational tasks, and they have been
customized in a number of ways to speed up MD.
Each GC is a dual-issue, statically scheduled SIMD
processor with pipelined multiply accumulate support.
The GC’s basic data type is a vector of four -bit
fixed-point values, and two independent SIMD operations on these vectors issue each cycle. The GC’s
instruction set includes element-wise vector operations
(for example, vector addition), more complicated vector operations such as a dot product (which is used
extensively in calculating bonded forces and applying
distance constraints), and scalar operations that read
and write arbitrary scalar components of the vector registers (essentially accessing the SIMD register file as a
larger scalar register file).
Each of the four control cores manages a corresponding programmable data transfer engine, used to
coordinate communication and synchronization for the
Anton: A Special-Purpose Molecular Simulation Machine.
Fig.  Import regions associated with two parallelization
methods for range-limited pairwise interactions. (a) In a
traditional spatial decomposition method, each node
imports particles from the half-shell region so that they
can interact with particles in the home box. (b) In the NT
method, each node computes interactions between
particles in a tower region and particles in a plate region.
Both of these regions include the home box, but particles
in the remainder of each region must be imported. In both
methods, each pair of particles within the cutoff radius of
one another will have their interaction computed on some
node of the machine
flexible subsystem. In addition to the usual system interface and cache interfaces, each control core also connects to a -KB scratchpad memory, which holds MD
simulation data for background transfer by the data
transfer engine. These engines can be programmed to
write data from the scratchpad to network destinations
and to monitor incoming writes for synchronization
purposes. The background data transfer capability provided by these engines is crucial for performance, as it
enables overlapped communication and computation.
The control cores also handle maintenance tasks, which
tend not to be performance-critical (e.g., checkpointing
every million time steps).
Considerable effort went into keeping the flexible
subsystem from becoming an Amdahl’s law bottleneck.
Careful scheduling allows some of the tasks performed
by the flexible subsystem to be partially overlapped with
or completely hidden behind communication or HTIS
computation []. Adjusting parameters for the algorithm used to evaluate electrostatic forces (including the
Anton, A Special-Purpose Molecular Simulation Machine
cutoff radius and the FFT grid density) shifts computational load from the flexible subsystem to the HTIS [].
A number of mechanisms balance load among the cores
of the flexible subsystem and across flexible subsystems
on different ASICs to minimize Anton’s overall execution time. Even with these and other optimizations, the
flexible subsystem remains on the critical path for up to
one-third of Anton’s overall execution time.
Communication Subsystem
The communication subsystem provides high-speed,
low-latency communication both between ASICs and
among the subsystems within an ASIC []. Within
a chip, two -bit, -MHz communication rings
link all subsystems and the six inter-chip torus ports.
Between chips, each torus link provides .-Gbit/s fullduplex communication with a hop latency around  ns.
The communication subsystem supports efficient multicast, provides flow control, and provides class-based
admission control with rate metering.
In addition to achieving high bandwidth and low
latency, Anton supports fine-grained inter-node communication, delivering half of peak bandwidth on messages of just  bytes. These properties are critical to
delivering high performance in the communicationintensive tasks of an MD simulation. A -node Anton
machine, for example, performs a  ×  × , spatially
distributed D FFT in under  μs, an order of magnitude faster than the contemporary implementations in
the literature [].
Software
Although Anton’s hardware architecture incorporates
substantial flexibility, it was designed to perform variants of a single application: molecular dynamics.
Anton’s software architecture exploits this fact to maximize application performance by eliminating many of
the layers of a traditional software stack. The Anton
ASICs, for example, do not run a traditional operating
system; instead, the control cores of the flexible subsystem run a loader that installs code and data on the
machine, simulates for a time, then unloads the results
of the completed simulation segment.
Programming Anton is complicated by the heterogeneous nature of its computational units. The geometry cores of the flexible subsystem are programmed in
A
assembly language, while the control cores of the flexible subsystem and a control processor in the HTIS are
programmed in C augmented with intrinsics. Various
fixed-function hardware units, such as the PPIMs, are
programmed by configuring state machines and filling
tables. Anton’s design philosophy emphasized performance over ease of programmability, although increasingly sophisticated compilers and other tools to simplify
programming are under development.
Anton Performance
Figure  shows the performance of a -node Anton
machine on several different chemical systems, varying in size and composition. On the widely used Joint
AMBER-CHARMM benchmark system, which contains , atoms and represents the protein dihydrofolate reductase (DHFR) surrounded by water, Anton
simulates . μs per day of wall-clock time []. The
fastest previously reported simulation of this system was
obtained using a software package, called Desmond,
which was developed within our group for use on commodity clusters []. This Desmond simulation executed
at a rate of  nanoseconds (ns) per day on a node .-GHz Intel Xeon E cluster connected by
a DDR InfiniBand network, using only two of the eight
cores on each node in order to maximize network bandwidth per core []. (Using more nodes, or more cores
per node, leads to a decrease in performance as a result
of an increase in communication requirements.) Due
to considerations related to the efficient utilization of
resources, however, neither Desmond nor other highperformance MD codes for commodity clusters are typically run at such a high level of parallelism, or in a
configuration with most cores on each node idle. In
practice, the performance realized in such cluster-based
simulations is generally limited to speeds on the order of
 ns/day. The previously published simulations listed
in Table , for example, ran at  ns/day or less – over
two orders of magnitude short of the performance we
have demonstrated on Anton.
Anton machines with fewer than  nodes may
prove more cost effective when simulating certain
smaller chemical systems. A -node Anton machine
can be partitioned, for example, into four -node
machines, each of which achieves . μs/day on the
DHFR system – well over % of the . μs/day

A
A
Anton, A Special-Purpose Molecular Simulation Machine
Performance (simulated μ s/day)

20
Water only
Protein in water
gpW
DHFR
15
aSFP
10
NADHOx
5
0
0
20
40
60
80
Thousands of atoms
FtsZ
100
T7Lig
120
Anton: A Special-Purpose Molecular Simulation Machine. Fig.  Performance of a -node Anton machine for
chemical systems of different sizes. All simulations used .-femtosecond time steps with long-range interactions
evaluated at every other time step; additional simulation parameters can be found in Table  of reference []
achieved when parallelizing the same simulation across
all  nodes. Configurations with more than 
nodes deliver increased performance for larger chemical systems, but do not benefit chemical systems with
only a few thousand atoms, for which the increase
in communication latency outweighs the increase in
parallelism.
Prior to Anton’s completion, few reported all-atom
protein simulations had reached  μs, the longest being
a -μs simulation that took over  months on the
NCSA Abe supercomputer [] (Table ). On June ,
, Anton completed the first millisecond-long simulation – more than  times longer than any reported
previously. This ,-μs simulation modeled a protein called bovine pancreatic trypsin inhibitor (BPTI)
(Fig. ), which had been the subject of many previous MD simulations; in fact, the first MD study of
a protein, published in  [], simulated BPTI for
. ps. The Anton simulation, which was over  million times longer, revealed unanticipated behavior that
was not evident at previously accessible timescales,
including transitions among several distinct structural
states [, ].
Future Directions
Commodity computing benefits from economies of
scale but imposes limitations on the extent to which
an MD simulation, and many other high-performance
computing problems, can be accelerated through parallelization. The design of Anton broke with commodity designs, embraced specialized architecture and
co-designed algorithms, and achieved a three-orderof-magnitude speedup over a development period of
approximately five years. Anton has thus given scientists, for the first time, the ability to perform MD
simulations on the order of a millisecond – 
times longer than any atomically detailed simulation previously reported on either general-purpose
or special-purpose hardware. For computer architects, Anton’s level of performance raises the questions
of which other high-performance computing problems might be similarly accelerated, and whether the
economic or scientific benefits of such acceleration
would justify building specialized machines for those
problems.
In its first two years of operation, Anton has
begun to serve as a “computational microscope,” allowing the observation of biomolecular processes that
have been inaccessible to laboratory experiments and
that were previously well beyond the reach of computer simulation. Anton has revealed, for example,
the atomic-level mechanisms by which certain proteins fold (Fig. ) and the structural dynamics underlying the function of important drug targets [,
]. Anton’s predictive power has been demonstrated
Anton, A Special-Purpose Molecular Simulation Machine
A

A
Anton: A Special-Purpose Molecular Simulation Machine. Fig.  Two renderings of a protein (BPTI) taken from a
molecular dynamics simulation on Anton. (a) The entire simulated system, with each atom of the protein represented by a
sphere and the surrounding water represented by thin lines. For clarity, water molecules in front of the protein are not
pictured. (b) A “cartoon” rendering showing important structural elements of the protein (secondary and tertiary
structure)
Related Entries
Amdahl’s Law
Distributed-Memory Multiprocessor
GRAPE
IBM Blue Gene Supercomputer
NAMD (NAnoscale Molecular Dynamics)
a
t = 5 μs
b
t = 25 μs
c
t = 50 μs
Anton: A Special-Purpose Molecular Simulation
Machine. Fig.  Unfolding and folding events in a -μs
simulation of the protein gpW, at a temperature that
equally favors the folded and unfolded states. Panel (a)
shows a snapshot of a folded structure early in the
simulation, (b) is a snapshot after the protein has partially
unfolded, and (c) is a snapshot after it has folded again.
Anton has also simulated the folding of several proteins
from a completely extended state to the experimentally
observed folded state
through comparison with experimental observations
[, ]. Anton thus provides a powerful complement
to laboratory experiments in the investigation of fundamental biological processes, and holds promise as a tool
for the design of safe, effective, precisely targeted drugs.
N-Body Computational Methods
QCDSP and QCDOC Computers
Bibliographic Notes and Further
Reading
The first Anton machine was developed at D. E. Shaw
Research between  and . The overall architecture was described in [], with the HTIS, the
flexible subsystem, and the communication subsystem
described in more detail in [], [], and [], respectively. Initial performance results were presented in [],
and initial scientific results in [] and []. Other aspects
of the Anton architecture, software, and design process
are described in several additional papers [, , , ].
A number of previous projects built specialized
hardware for MD simulation, including MD-GRAPE
[], MD Engine [], and FASTRUN []. Extensive
effort has focused on efficient parallelization of MD
on general-purpose architectures, including IBM’s Blue

A
Anton, A Special-Purpose Molecular Simulation Machine
Gene [] and commodity clusters [, , ]. More
recently, MD has also been ported to the Cell BE processor [] and to GPUs [].
.
Bibliography
. Bhatele A, Kumar S, Mei C, Phillips JC, Zheng G, Kalé LV
() Overcoming scaling challenges in biomolecular simulations across multiple platforms. In: Proceedings of the IEEE international parallel and distributed processing symposium, Miami
. Bowers KJ, Chow E, Xu H, Dror RO, Eastwood MP, Gregersen
BA, Klepeis JL, Kolossváry I, Moraes MA, Sacerdoti FD, Salmon
JK, Shan Y, Shaw DE () Scalable algorithms for molecular
dynamics simulations on commodity clusters. In: Proceedings
of the ACM/IEEE conference on supercomputing (SC). IEEE,
New York
. Chow E, Rendleman CA, Bowers KJ, Dror RO, Hughes DH,
Gullingsrud J, Sacerdoti FD, Shaw DE () Desmond performance on a cluster of multicore processors. D. E. Shaw
Research Technical Report DESRES/TR--, New York.
http://deshawresearch.com
. Dror RO, Arlow DH, Borhani DW, Jensen MØ, Piana S, Shaw
DE () Identification of two distinct inactive conformations of
the β  -adrenergic receptor reconciles structural and biochemical
observations. Proc Natl Acad Sci USA :–
. Dror RO, Grossman JP, Mackenzie KM, Towles B, Chow E,
Salmon JK, Young C, Bank JA, Batson B, Deneroff MM, Kuskin
JS, Larson RH, Moraes MA, Shaw DE () Exploiting nanosecond end-to-end communication latency on Anton. In:
Proceedings of the conference for high performance computing,
networking, storage and analysis (SC). IEEE, New York
. Ensign DL, Kasson PM, Pande VS () Heterogeneity even
at the speed limit of folding: large-scale molecular dynamics
study of a fast-folding variant of the villin headpiece. J Mol
Biol :–
. Fine RD, Dimmler G, Levinthal C () FASTRUN: a special
purpose, hardwired computer for molecular simulation. Proteins
:–
. Fitch BG, Rayshubskiy A, Eleftheriou M, Ward TJC, Giampapa
ME, Pitman MC, Pitera JW, Swope WC, Germain RS () Blue
Matter: scaling of N-body simulations to one atom per node. IBM
J Res Dev :
. Freddolino PL, Liu F, Gruebele MH, Schulten K () Tenmicrosecond MD simulation of a fast-folding WW domain. Biophys J :L–L
. Freddolino PL, Park S, Roux B, Schulten K () Force field bias
in protein folding simulations. Biophys J :–
. Freddolino P, Schulten K () Common structural transitions
in explicit-solvent simulations of villin headpiece folding. Biophys J :–
. Grossfield A, Pitman MC, Feller SE, Soubias O, Gawrisch K
() Internal hydration increases during activation of the
G-protein-coupled receptor rhodopsin. J Mol Biol :–
. Grossman JP, Salmon JK, Ho CR, Ierardi DJ, Towles B, Batson B, Spengler J, Wang SC, Mueller R, Theobald M, Young
C, Gagliardo J, Deneroff MM, Dror RO, Shaw DE ()
.
.
.
.
.
.
.
.
.
.
.
.
Hierarchical simulation-based verification of Anton, a specialpurpose parallel machine. In: Proceedings of the th IEEE international conference on computer design (ICCD ’), Lake Tahoe
Grossman JP, Young C, Bank JA, Mackenzie K, Ierardi DJ, Salmon
JK, Dror RO, Shaw DE () Simulation and embedded software development for Anton, a parallel machine with heterogeneous multicore ASICs. In: Proceedings of the th IEEE/ACM/
IFIP international conference on hardware/software codesign
and system synthesis (CODES/ISSS ’)
Hess B, Kutzner C, van der Spoel D, Lindahl E () GROMACS
: algorithms for highly efficient, load-balanced, and scalable
molecular simulation. J Chem Theor Comput :–
Ho CR, Theobald M, Batson B, Grossman JP, Wang SC,
Gagliardo J, Deneroff MM, Dror RO, Shaw DE () Post-silicon
debug using formal verification waypoints. In: Proceedings of the
design and verification conference and exhibition (DVCon ’),
San Jose
Khalili-Araghi F, Gumbart J, Wen P-C, Sotomayor M, Tajkhorshid
E, Shulten K () Molecular dynamics simulations of membrane channels and transporters. Curr Opin Struct Biol :–
Klepeis JL, Lindorff-Larsen K, Dror RO, Shaw DE () Longtimescale molecular dynamics simulations of protein structure
and function. Curr Opin Struct Biol :–
Kuskin JS, Young C, Grossman JP, Batson B, Deneroff MM, Dror
RO, Shaw DE () Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation. In: Proceedings
of the th annual international symposium on high-performance
computer architecture (HPCA ’). IEEE, New York
Larson RH, Salmon JK, Dror RO, Deneroff MM, Young C, Grossman JP, Shan Y, Klepeis JL, Shaw DE () High-throughput
pairwise point interactions in Anton, a specialized machine
for molecular dynamics simulation. In: Proceedings of the th
annual international symposium on high-performance computer
architecture (HPCA ’). IEEE, New York
Luttman E, Ensign DL, Vishal V, Houston M, Rimon N, Øland
J, Jayachandran G, Friedrichs MS, Pande VS () Accelerating molecular dynamic simulation on the cell processor and
PlayStation . J Comput Chem :–
Martinez-Mayorga K, Pitman MC, Grossfield A, Feller SE, Brown
MF () Retinal counterion switch mechanism in vision evaluated by molecular simulations. J Am Chem Soc :–
McCammon JA, Gelin BR, Karplus M () Dynamics of folded
proteins. Nature :–
Pande VS, Baker I, Chapman J, Elmer SP, Khaliq S, Larson SM,
Rhee YM, Shirts MR, Snow CD, Sorin EJ, Zagrovic B ()
Atomistic protein folding simulations on the submillisecond
time scale using worldwide distributed computing. Biopolymers
:–
Piana S, Sarkar K, Lindorff-Larsen K, Guo M, Gruebele M, Shaw
DE () Computational design and experimental testing of the
fastest-folding β-sheet protein. J Mol Biol :–
Rosenbaum DM, Zhang C, Lyons JA, Holl R, Aragao D, Arlow
DH, Rasmussen SGF, Choi H-J, DeVree BT, Sunahara RK,
Chae PS, Gellman SH, Dror RO, Shaw DE, Weis WI, Caffrey M,
Gmeiner P, Kobilka BK () Structure and function of an irreversible agonist-β  adrenoceptor complex. Nature :–
Array Languages
. Shan Y, Klepeis JL, Eastwood MP, Dror RO, Shaw DE ()
Gaussian split Ewald: a fast Ewald mesh method for molecular
simulation. J Chem Phys :
. Shaw DE () A fast, scalable method for the parallel evaluation of distance-limited pairwise particle interactions. J Comput
Chem :–
. Shaw DE, Deneroff MM, Dror RO, Kuskin JS, Larson RH, Salmon
JK, Young C, Batson B, Bowers KJ, Chao JC, Eastwood MP,
Gagliardo J, Grossman JP, Ho CR, Ierardi DJ, Kolossváry I, Klepeis
JL, Layman T, McLeavey C, Moraes MA, Mueller R, Priest EC,
Shan Y, Spengler J, Theobald M, Towles B, Wang SC () Anton:
a special-purpose machine for molecular dynamics simulation.
In: Proceedings of the th annual international symposium on
computer architecture (ISCA ’). ACM, New York
. Shaw DE, Dror RO, Salmon JK, Grossman JP, Mackenzie
KM, Bank JA, Young C, Deneroff MM, Batson B, Bowers
KJ, Chow E, Eastwood MP, Ierardi DJ, Klepeis JL, Kuskin JS,
Larson RH, Lindorff-Larsen K, Maragakis P, Moraes MA, Piana
S, Shan Y, Towles B () Millisecond-scale molecular dynamics simulations on Anton. In: Proceedings of the conference for
high performance computing, networking, storage and analysis
(SC). ACM, New York
. Shaw DE, Maragakis P, Lindorff-Larsen K, Piana S, Dror RO, Eastwood MP, Bank JA, Jumper JM, Salmon JK, Shan Y, Wriggers W
() Atomic-level characterization of the structural dynamics
of proteins. Science :–
. Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG,
Schulten K () Accelerating molecular modeling applications
with graphics processors. J Comput Chem :–
. Taiji M, Narumi T, Ohno Y, Futatsugi N, Suengaga A, Takada
N, Konagaya A () Protein Explorer: a petaflops specialpurpose computer system for molecular dynamics simulations.
In: Proceedings of the ACM/IEEE conference on supercomputing
(SC ’), Phoenix, AZ. ACM, New York
. Toyoda S, Miyagawa H, Kitamura K, Amisaki T, Hashimoto E,
Ikeda H, Kusumi A, Miyakawa N () Development of MD
Engine: high-speed accelerator with parallel processor design for
molecuar dynamics simulations. J Comput Chem :–
. Young C, Bank JA, Dror RO, Grossman JP, Salmon JK, Shaw
DE () A  ×  × , spatially distributed D FFT in four
microseconds on Anton. In: Proceedings of the conference for
high performance computing, networking, storage and analysis
(SC). ACM, New York
Application-Specific Integrated
Circuits
VLSI Computation
Applications and Parallelism
Computational Sciences
A
Architecture Independence
Network Obliviousness
Area-Universal Networks
Universality in VLSI Computation
Array Languages
Calvin Lin
University of Texas at Austin, Austin, TX, USA
Definition
An array language is a programming language that supports the manipulation of entire arrays – or portions of
arrays – as a basic unit of operation.
Discussion
Array languages provide two primary benefits: () They
raise the level of abstraction, providing conciseness and
programming convenience; () they provide a natural
source of data parallelism, because the multiple elements of an array can typically be manipulated concurrently. Both benefits derive from the removal of control
flow. For example, the following array statement assigns
each element of the B array to its corresponding element
in the A array:
A := B;
This same expression could be expressed in a scalar
language using an explicit looping construct:
for i := 1 to n
for j := 1 to n
A[i][j] := B[i][j];
The array statement is conceptually simpler because
it removes the need to iterate over individual array
elements, which includes the need to name individual
elements. At the same time, the array expression admits
more parallelism because it does not over-specify the
order in which the pairs of elements are evaluated; thus,

A

A
Array Languages
the elements of the B array can be assigned to the
elements of the A array in any order, provided that
they obey array language semantics. (Standard array
language semantics dictate that the righthand side of
an assignment be fully evaluated before its value is
assigned to the variable on the lefthand side of the
assignment.)
For the above example, the convenience of array
operators appears to be minimal, but in the context of
a parallel computer, the benefits to programmer productivity can be substantial, because a compiler can
translate the above array statement to efficient parallel code, freeing the programmer from having to deal
with many low-level details that would be necessary if
writing in a lower-level language, such as MPI. In particular, the compiler can partition the work and compute
loop bounds, handling the messy cases where values do
not divide evenly by the number of processors. Furthermore, if the compiler can statically identify locations
where communication must take place, it can allocate
extra memory to cache communicated values, and it can
insert communication where necessary, including any
necessary marshalling of noncontiguous data. Even for
shared memory machines, the burden of partitioning
work, inserting appropriate synchronization, etc., can
be substantial.
Array Indexing
Array languages can be characterized by the mechanisms that they provide for referring to portions of
an array. The first array language, APL, provided no
method of accessing portions of an array. Instead, in
APL, all operators are applied to all elements of their
array operands.
Languages such as Fortran  use array slices to concisely specify indices for each dimension of an array. For
example, the following statement assigns the upper left
 ×  corner of array A to the upper left corner of array B.
B(1:3,1:3) = A(1:3,1:3)
The problem with slices is that they introduce considerable redundancy, and they force the programmer to perform index calculations, which can be error
prone. For example, consider the Jacobi iteration, which
computes for each array element the average of its four
nearest neighbors:
B(2:n,2:n) = (A(1:n-1,2:n)+
A(3:n+1,2:n)+
A(2:n, 1:n-1)+
A(2:n, 3:n+3)+
The problem becomes much worse, of course, for
higher-dimensional arrays.
The C∗ language [], which was designed for
the Connection machine, simplifies array indexing by
defining indices that are relative to each array element.
For example, the core of the Jacobi iteration can be
expressed in C∗ as follows, where active is an array
that specifies the elements of the array where the computation should take place:
where (active)
{
B = ([.-1][.]A + [.+1][.]A
+ [.][.-1]A + [.][.+1]A)/4;
}
C∗ ’s relative indexing is less error prone than slices:
Each relative index focuses attention on the differences
among the array references, so it is clear that the first two
array references refer to the neighbors above and below
each element and that the last two array references refer
to the neighbors to the left and right of each element.
The ZPL language [] further raises the level of
abstraction by introducing the notion of a region to represent index sets. Regions are a first-class language construct, so regions can be named and manipulated. The
ability to name regions is important because – for software maintenance and readability reasons – descriptive
names are preferable to constant values. The ability to
manipulate regions is important because – like C∗ ’s relative indexing – it defines new index sets relative to
existing index sets, thereby highlighting the relationship
between the new index set and the old index set.
To express the Jacobi iteration in ZPL, programmers
would use the At operator (@) to translate an index set
by a named vector:
[R]
B := (A@north + A@south +
A@west + A@east)/4;
The above code assumes that the programmer has
defined the region R to represent the index set
[:n][:n], has defined north to be a vector whose
value is [−, ], has defined south to be a vector
Array Languages
whose value is [+,], and so forth. Given these definitions, the region R provides a base index set [:n][:n]
for every reference to a two-dimensional array in this
statement, so R applies to every occurrence of A in
this statement. The direction north shifts this base
index set by − in the first dimension, etc. Thus, the
above statement has the same meaning as the Fortran
 slice notation, but it uses named values instead of
hard-coded constants.
ZPL provides other region operators to support
other common cases, and it uses regions in other ways,
for example, to declare array variables. More significantly, regions are quite general, as they can represent
sparse index sets and hierarchical index sets.
More recently, the Chapel language from Cray []
provides an elegant form of relative indexing that can
take multiple forms. For example, the below Chapel
code looks almost identical to the ZPL equivalent,
except that it includes the base region, R, in each array
index expression:
T[R] = (A[R+north] + A[R+south]+
A[R+east] + A[R+west])/4.0;
Alternatively, the expression could be written from the
perspective of a single element in the index set:
[ij in R] T(ij) = (A(ij+north) +
A(ij+south) +
A(ij+east) +
A(ij+west))/4.0;
The above formulation is significant because it allows
the variable ij to represent an element of an
arbitrary tuple, where the tuple (referred to as a domain
in Chapel) could represent many different types of index
sets, including sparse index sets, hierarchical index sets,
or even nodes of a graph.
Finally, the following Chapel code fragment shows
that the tuple can be decomposed into its constituent
parts, which allows arbitrary arithmetic to be performed on each part separately:
[(i,j) in R]
A(ij) = (A(i-1,j)+
A(i+1,j)+
A(i,j-1)+
A(i,j+1))/4.0;
The FIDIL language [] represents an early attempt
to raise the level of abstraction, providing support for
A
irregular grids. In particular, FIDIL supports index sets,
known as domains, which are first-class objects that
can be manipulated through union, intersection, and
difference operators.
Array Operators
Array languages can also be characterized by the set of
array operators that they provide. All array languages
support elementwise operations, which are the natural extension of scalar operators to array operands. The
Array Indexing section showed examples of the elementwise + and = operators.
While element-wise operators are useful, at some
point, data parallel computations need to be combined
or summarized, so most array languages provide reduction operators, which combine – or reduce – multiple
values of an array into a single scalar value. For example, the values of an array can be reduced to a scalar
by summing them or by computing their maximum or
minimum value.
Some languages provide additional power by allowing reductions to be applied to a subset of an array’s
dimensions. For example, each row of values in a twodimensional array can be reduced to produce a single column of values. In general, a partial reduction
accepts an n-dimensional array of values and produces
an m-dimensional array of values, where m < n. This
construct can be further generalized by allowing the
programmer to specify an arbitrary associative and
commutative function as the reduction operator.
Parallel prefix operators – also known as scan operators – are an extension of reductions that produce an
array of partial values instead of a single scalar value.
For example, the prefix sum accepts as input n values
and computes all sums, x + x + x + . . . + xk for  ≤
k ≤ n. Other parallel prefix operations are produced
by replacing the + operator with some other associative
operator. (When applied to multi-dimensional arrays,
the array indices are linearized in some well-defined
manner, e.g., using Row Major Order.)
The parallel prefix operator is quite powerful
because it provides a general mechanism for parallelizing computations that might seem to require sequential
iteration. In particular, sequential loop iterations that
accumulate information as they iterate can typically be
solved using a parallel prefix.

A

A
Array Languages
Given the ability to index an individual array element, programmers can directly implement their own
reduction and scan code, but there are several benefits of language support. First, reduction and scan are
common abstractions, so good linguistic support makes
these abstractions easier to read and write. Second,
language support allows their implementations to be
customized to the target machine. Third, compiler support introduces nontrivial opportunities for optimization: When multiple reductions or scans are performed
in sequence, their communication components can be
combined to reduce communication costs.
Languages such as Fortran  and APL also provide additional operators for flattening, re-shaping, and
manipulating arrays in other powerful ways. These languages also provide operators such as matrix multiplication and matrix transpose that treat arrays as matrices.
The inclusion of matrix operations blurs the distinction
between array languages and matrix languages, which
are described in the next section.
Matrix Languages
Array languages should not be confused with matrix
languages, such as Matlab []. An array is a programming language construct that has many uses. By contrast, a matrix is a mathematical concept that carries
additional semantic meaning. To understand this distinction, consider the following statement:
A = B * C;
In an array language, the above statement assigns to
A the element-wise product of B and C. In a matrix
language, the statement multiples B and C.
The most popular matrix language, Matlab, was
originally designed as a convenient interactive interface
to numeric libraries, such as EISPACK and LINPACK,
that encourages exploration. Thus, for example, there
are no variable declarations. These interactive features
make Matlab difficult to parallelize, because they inhibit
the compiler’s ability to carefully communication the
computation and communication.
Future Directions
Array language support can be extended in two dimensions. First, the restriction to flat dense arrays is too
limiting for many computations, so language support
for sparsely populated arrays, hierarchical arrays, and
irregular pointer-based data structures are also needed.
In principle, Chapel’s domain construct supports all of
these extensions of array languages, but further research
is required to fully support these data structures and
to provide good performance. Second, it is important
to integrate task parallelism with data parallelism, as is
being explored in languages such as Chapel and X.
Related Entries
Chapel (Cray Inc. HPCS Language)
Fortran  and Its Successors
HPF (High Performance Fortran)
NESL
ZPL
Bibliographic Notes and Further
Reading
The first array language, APL [], was developed in the
early s and has often been referred to as the first
write-only language because of its terse, complex nature.
Subsequent array languages that extended
more conventional languages began to appear in the
late s. For example, extensions of imperative languages include C∗ [], FIDIL [], Dataparallel C [],
and Fortran  []. NESL [] is a functional language
that includes support for nested one dimensional arrays.
In the early s, High Performance Fortran (HPF)
was a data parallel language that extended Fortran 
and Fortran  to provide directives about data distribution. At about the same time, the ZPL language []
showed that a more abstract notion of an array’s index
set could lead to clear and concise programs.
More recently, the DARPA-funded HighProductivity Computing Systems project led to the
development of Chapel [] and X [], which both integrate array languages with support for task parallelism.
Ladner and Fischer [] presented key ideas of the
parallel prefix algorithm, and Blelloch [] elegantly
demonstrated the power of the scan operator for array
languages.
Bibliography
. Adams JC, Brainerd WS, Martin JT, Smith BT, Wagener JL ()
Fortran  handbook. McGraw-Hill, New York
. Blelloch G () Programming parallel algorithms. Comm ACM
():–
Array Languages, Compiler Techniques for
. Blelloch GE () NESL: a nested data-parallel language. Technical Report CMUCS--, School of Computer Science,
Carnegie Mellon University, Pittsburgh, PA, January 
. Chamberlain BL () The design and implementation of a
region-based parallel language. PhD thesis, University of Washington, Department of Computer Science and Engineering,
Seattle, WA
. Chamberlain BL, Callahan D, Zima HP () Parallel programmability and the Chapel language. Int J High Perform Comput Appl ():–
. Ebcioglu K, Saraswat V, Sarkar V () X: programming
for hierarchical parallelism and non-uniform data access. In:
International Workshop on Language Runtimes, OOPSLA ,
Vancouver, BC
. Amos Gilat () MATLAB: an introduction with applications,
nd edn. Wiley, New York
. Hatcher PJ, Quinn MJ () Data-parallel programming on
MIMD computers. MIT Press, Cambridge, MA
. Hilfinger PN, Colella P () FIDIL: a language for scientific programming. Technical Report UCRL-, Lawrence Livermore
National Laboratory, Livermore, CA, January 
. Iverson K () A programming language. Wiley, New York
. Ladner RE, Fischer MJ () Parallel prefix computation. JACM
():–
. Lin C, Snyder L () ZPL: an array sublanguage. In: Banerjee U,
Gelernter D, Nicolau A, Padua D (eds) Languages and compilers
for parallel computing. Springer-Verlag, New York, pp –
. Rose JR, Steele Jr GL () C∗ : an extended C language for
data parallel programming. In: nd International Conference on
Supercomputing, Santa Clara, CA, March 
Array Languages, Compiler
Techniques for
Jenq-Kuen Lee , Rong-Guey Chang , Chi-Bang Kuan

National Tsing-Hua University, Hsin-Chu, Taiwan

National Chung Cheng University, Chia-Yi, Taiwan
Synonyms
Compiler optimizations for array languages
array operations and intrinsic functions to deliver performance for parallel architectures, such as vector processors, multi-processors and VLIW processors.
Discussion
Introduction to Array Languages
There are several programming languages providing
a rich set of array operations and intrinsic functions
along with array constructs to assist data-parallel programming. They include Fortran , High Performance Fortran (HPF), APL and MATLAB, etc. Most
of them provide programmers array intrinsic functions
and array operations to manipulate data elements of
multidimensional arrays concurrently without requiring iterative statements. Among these array languages,
Fortran  is a typical example, which consists of an
extensive set of array operations and intrinsic functions
as shown in Table . In the following paragraphs, several examples will be provided to bring readers basic
information about array operations and intrinsic functions supported by Fortran . Though in the examples
only Fortran  array operations are used for illustration, the array programming concepts and compiler
techniques are applicable to common array languages.
Fortran  extends former Fortran language features
to allow a variety of scalar operations and intrinsic functions to be applied to arrays. These array operations and
intrinsic functions take array objects as inputs, perform
a specific operation on array elements concurrently, and
return results in scalars or arrays. The code fragment
below is an array accumulation example, which involves
two two-dimensional arrays and one array-add operation. In this example, all array elements in the array S is
going to be updated by accumulating by those of array
A in corresponding positions. The accumulation result,
also a two-dimensional  ×  array, is at last stored
back to array S, in which each data element contains
element-wise sum of array A and array S.
integer
Definition
Compiler techniques for array languages generally
include compiler supports, optimizations and code generation for programs expressed or annotated by all
kinds of array languages. These compiler techniques
mainly take advantage of data-parallelism explicit in
A
S( , ) ,
A( , )
S = S + A
Besides primitive array operations, Fortran  also
provides programmers a set of array intrinsic functions
to manipulate array objects. These intrinsic functions
listed in Table  include functions for data movement,

A

A
Array Languages, Compiler Techniques for
Array Languages, Compiler Techniques for. Table  Array
intrinsic functions in Fortran 
The second example presents a way to reorganize
data elements within an input array with array intrinsic
functions. The circular-shift (CSHIFT) intrinsic function in Fortran  performs data movement over an
input array along a given dimension. Given a twodimensional array, it can be considered to shift data
elements of the input array to the left or right, up
or down in a circular manner. The first argument of
CSHIFT indicates the input array to be shifted while
the rest two arguments specify the shift amount and
the dimension along which data are shifted. In the code
fragment below, one two-dimensional array, A, is going
to be shifted by one-element offset along the first dimension. If the initial contents of array A are labeled as the
left-hand side of Fig. , after the circular-shift its data
contents will be moved to new positions as shown in the
right-hand side of Fig. , where the first row of A sinks
to the bottom and the rest move upward by one-element
offset.
Array intrinsics
Functionality
CSHIFT
Circular-shift elements of the input
array along one specified dimension
DOT_PRODUCT
Compute dot-product of two input
arrays as two vectors
EOSHIFT
End-off shift elements of the input
array along one specified dimension
MATMUL
Matrix multiplication
MERGE
Combine two conforming arrays under
the control of an input mask
PACK
Pack an array under the control of an
input mask
Reduction
Reduce an array by one specified
dimension and operator
RESHAPE
Construct an array of a specified shape
from elements of the input array
SPREAD
Replicate an array by adding one
dimension
Section move
Perform data movement over a region
of the input array
integer
TRANSPOSE
Matrix transposition
A = CSHIFT(A ,  , )
UNPACK
Unpack an array under the control of
an input mask
matrix multiplication, array reduction, compaction, etc.
In the following paragraphs, two examples using array
intrinsic functions are provided: the first one with
array reduction and the second one with array data
movement.
The first example, shown in the code fragment
below, presents a way to reduce an array with a specific operator. The SUM intrinsic function, one instance
of array reduction functions in Fortran , sums up an
input array along a specified dimension. In this example, a two-dimension  ×  array A, composed of total
 elements, is passed to SUM with its first dimension
specified as the target for reduction. After reduction by
SUM, it is expected that the SUM will return a onedimension array consisting of four elements, each of
them corresponds to a sum of the array elements in the
first dimension.
i n t e g e r A( , ), S()
S = SUM(A , )
A( , )
Compiler Techniques for Array Languages
So far, readers may have experienced the concise representation of array operations and intrinsic functions.
The advantages brought by array operations mainly
focus on exposing data parallelism to compilers for concurrent data processing. The exposed data parallelism
can be used by compilers to generate efficient code
to be executed on parallel architectures, such as vector processors, multi-processors and VLIW processors.
Array operations provide abundant parallelism to be
exploited by compilers for delivering performance on
those parallel processors. In the following paragraphs,
two compiler techniques on compiling Fortran  array
operations will be elaborated to readers. Though these
techniques are originally developed for Fortran , they
are also applicable to common array languages, such as
Matlab and APL.
The first technique covers array operation synthesis for consecutive array operations, which treats array
intrinsic functions as mathematical functions. In the
synthesis technique, each array intrinsic function has
its data access function that specifies its data access pattern and the mapping relationship between its input and
Array Languages, Compiler Techniques for
11
12
13
14
21
22
23
24
A=
21
22
23
24
31
32
33
34
A′ =
31
32
33
34
41
42
43
44
41
42
43
44
11
12
13
14
Array Languages, Compiler Techniques for. Fig.  Array
contents change after CSHIFT (shift amount = ,
dimension = )
output arrays. Through providing data access functions
for all array operations, this technique can synthesize
multiple array operations into a composite data access
function, which expresses data accesses and computation to generate the target arrays from source arrays. In
this way, compiling programs directly by the composite
data access function can greatly improve performance
by reducing redundant data movement and temporary
storages required for passing immediate results.
The second compiler technique concerns compiler
supports and optimizations for sparse array programs.
In contrast with dense arrays, sparse arrays consist
of much more zero elements than nonzero elements.
Having this characteristic, they are more applicable
than dense arrays in many scientific applications. For
example, sparse linear systems, such as Boeing–Harwell
matrix, are with popular usages. Similar to dense arrays,
there are demands for support of array operations
and intrinsic functions to elaborate data parallelism
in sparse matrices. Later on, we will have several
paragraphs illustrating how to support sparse array
operations in Fortran , and we will also cover some
optimizing techniques for sparse programs.
Synthesis for Array Operations
In the next paragraphs, an array operation synthesis technique targeting on compiling Fortran  array
operations is going to be elaborated. This synthesis technique can be applied to programs that contain consecutive array operations. Array operations here include
not only those of Fortran  but all array operations
that can be formalized into data access functions. With
this technique, compilers can generate efficient codes
by removing redundant data movement and temporary
storages, which are often introduced in compiling array
programs.
A
Consecutive Array Operations
As shown in the previous examples, array operations
take one or more arrays as inputs, conduct a specific
operation to them, and return results in an array or a
scalar value. For more advanced usage, multiple array
operations can be cascaded to express compound computation over arrays. An array operation within the consecutive array operations takes inputs either from input
arrays or intermediate results from others, processing
data elements and passing its results to the next. In this
way, they conceptually describe a particular relationship
between source arrays and target arrays.
The following code fragment is an example with
consecutive array operations, which involves three array
operations, TRANSPOSE, RESHAPE, and CSHIFT, and
three arrays, A(, ), B(, ), and C(). The cascaded
array operations at first transpose array A and reshape
array C into two  ×  matrices, afterwards sum the
two  ×  intermediate results, and finally circular-shift
the results along its first dimension. With this example, readers may experience the concise representation
and power of compound array operations in the way
manipulating arrays without iterative statements.
i n t e g e r A( , ) , B( , ), C()
B = CSHIFT((TRANSPOSE(A)
+ RESHAPE(C , / , /)) ,  , )
To compile programs with consecutive array operations, one straightforward compilation may translate
each array operation into a parallel loop and create temporary arrays to pass intermediate results used by rest
array operations. Take the previous code fragment as an
example, at first, compilers will separate the TRANSPOSE function from the consecutive array operations
and create a temporary array T to keep the transposed
results. Similarly, another array T is created for the
RESHAPE function, and these two temporary arrays, T
and T, are summed into another temporary array, T.
At last, T is taken by the CSHIFT and used to produce
the final results in the target array B.
i n t e g e r A( , ) , B( , ) , C()
i n t e g e r T( , ) , T( , ) , T( , )
T = TRANSPOSE(A)
T = RESHAPE(C , / , /)

A

A
Array Languages, Compiler Techniques for
T = T + T
B = CSHIFT(T ,  , )
This straightforward scheme is inefficient as it introduces unnecessary data movement between temporary
arrays and temporary storages for passing intermediate
results. To compile array programs in a more efficient
way, array operation synthesis can be applied to obtain
a function F at compile-time such that B = F(A, C),
which is the synthesized data access function of the
compound operations and it is functionally identical to
the original sequence of array operations. The synthesized data access function specifies a direct mapping
from source arrays to target arrays, and can be used by
compilers to generate efficient code without introducing
temporary arrays.
Data Access Functions
The concept of data access functions needs to be further elaborated here. An array operation has its own
data access function that specifies element-wise mapping between its input and output arrays. As array operations in Fortran  have various formats, there are
different looks of data access functions. In the following paragraphs, total three types of data access functions
for different types of array operations will be provided
as the basis of array operation synthesis.
The first type of data access functions is for
array operations that contain a single source array
and comply with continuous data accessing, such as
TRANSPOSE and SPREAD in Fortran . For an
array operation with n-dimensional target array T and
m-dimensional source array S, which can be annotated
in T = Array_Operation(S), its data access function
can be represented as equation (). In equation (), fi
in the right hand side is an index function representing an array subscript of the source array S, where the
array subscripts from i to in are index variables of the
target array T. For example, the data access function
for T = TRANSPOSE(S) is as follows: T[i, j] = S[ j, i]
where the index functions by definition are f (i, j) = j
and f (i, j) = j.
T[i , i , ⋯, in ] = S[f (i , i , ⋯, in ), f (i , i , ⋯, in ), ⋯,
fm (i , i , ⋯, in )]
()
The second type of data access functions is for array
operations that also have a single source array but with
segmented data accessing, which means data in the
arrays are accessed in parts or with strides. Due to segmented data access, the data access functions cannot
be represented in a single continuous form. Instead,
they have to be described by multiple data access patterns, each of which covers has an disjointed array index
range. To represent an array index range, a notation
called, segmentation descriptors, can be used, which are
boolean predicates of the form:
ϕ(/fi (i , i , ⋯, in ), ⋯, fm (i , i , ⋯, in )/, /l : u : s ,
l : u : s , ⋯, lm : um : sm /)
where fi is an index function and li , ui , and si are
the lower bound, upper bound, and stride of the
index function fi (i , i , ⋯, in ). The stride si can be
omitted if it is equal to one, representing contiguous
data access. For example, the segmented descriptor
ϕ(/i, j/, / : ,  : /) delimits the range of (i = ,
j =  : ).
After providing a notation for segmented index
ranges, let us go back to array operations with single source and segmented data access functions, such
as CSHIFT and EOSHIFT in Fortran . For an
array operation with an n-dimensional target array T
and an m-dimensional source array S, annotated in
T = Array_Operation(S), its data access function can
be represented as follows:
⎧
⎪
⎪
S[f (i , i , ⋯, in ), f (i , i , ⋯, in ), ⋯,
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
fm (i , i , ⋯, in )] ∣ γ
⎪
⎪
⎪
⎪
⎪
⎪
T[i , i , ⋯, in ] = ⎨ S[g (i , i , ⋯, in ), g (i , i , ⋯, in ), ⋯,
⎪
⎪
⎪
⎪
⎪
⎪
⎪
gm (i , i , ⋯, in )] ∣ γ 
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎩⋯
()
where γ  and γ  are two separate array index ranges for
index function fi and gi respectively. Take CSHIFT as
an example, for an array operation B = CSHIFT(A, , )
with the input A and output B are both  ×  arrays,
its data access function can be represented as equation (). In equation (), the data access function of
CSHIFT(A, , ) is divided into two parts, describing
different sections of the target array B will be computed
by different formulas: for the index range, i = , j =  ∼ ,
B[i, j] is assigned by the value of A[i−, j]; for the index
Array Languages, Compiler Techniques for
range, i =  ∼ , j =  ∼ , B[i, j] is assigned by the value
of A[i + , j].
⎧
⎪
⎪
⎪
⎪A[i − , j] ∣ ϕ(/i, j/, / : ,  : /)
B[i, j] = ⎨
⎪
⎪
⎪
A[i + , j] ∣ ϕ(/i, j/, / : ,  : /)
⎪
⎩
C()
()
⋯, fkdk (i , i , ⋯, in )])
Synthesis of Array Operations
After providing the definition of data access functions,
let us begin to elaborate the mechanism of array operation synthesis. For illustration, the synthesis framework
can be roughly partitioned into three major steps:
. Build a parse tree for consecutive array
operations
. Provide a data access function for each array
operation
. Synthesize collected data access functions into a
composite one
Throughout the three steps, the consecutive array operations shown in the previous example will again be used
as a running example to illustrate the synthesis flow. We
replicate the code fragment here for reference.
In the first step, a parse tree is constructed for the
consecutive array operations. In the parse tree, source
arrays are leaf nodes while target arrays are roots, and
each internal node corresponds to an array operation.
The parse tree for the running example is provided in
Fig. . All temporary arrays created in the straightforward compilation are arbitrarily given an unique name
for identification. In the example, temporary arrays T,
T and T are attached to nodes of “+,” TRANSPOSE,
and RESHAPE as their intermediate results. The root
node is labeled with the target array, array B in this
example, which will contain the final results of the
consecutive array operations.
In the second step, data access functions are provided for all array operations. In the running example,
the data access function of T = TRANSPOSE(A) is
T[i, j] = A[ j, i]; the data access function of T =
RESHAPE(C, /, /) is T[i, j] = C[i + j ∗ ]; the
data access function of the T = T + T is T[i, j] =
F (T[i, j], T[i, j]), where F (x, y) = x + y; the data
access function of B = CSHIFT(T) is as follows:
⎧
⎪
⎪
⎪
⎪ T[i + , j] ∣ ϕ(/i, j/, / : ,  : /)
B[i, j] = ⎨
()
⎪
⎪
⎪
T[i
−
,
j]
∣
ϕ(/i,
j/,
/
:
,

:
/)
⎪
⎩
In the last step, the collected data access functions
are going to be synthesized, starting from the data access
CSHIFT
+
TRANSPOSE
A
T1
B
T3
RESHAPE

A
+ RESHAPE(C, /, /)) ,  , )
()
T[i , i , ⋯, in ] = F (S [f (i , i , ⋯, in ),
Sk [fk (i , i , ⋯, in ), fk (i , i , ⋯, in ),
B(, ) ,
B = CSHIFT((TRANSPOSE(A)
The third type of data access functions is for array
operations with multiple source arrays and continuous
data access. For an array operation with k source arrays,
annotated as T = Array_Operation(S , S , ⋯, Sk ), its data
access function can be represented as equation (),
where F is a k-nary function used to describe how the
desired output to be derived by k data elements from the
input arrays. Each element from source arrays is associj
ated with an index function of the form fi , where i and j
indicates its input array number and dimension, respectively. Take whole-array addition C(:,:) = A(:,:) + B(:,:)
as an example, its data access function can be represented in C[i, j] = F(A[i, j], B[i, j]), where F(x, y) =
x + y.
f (i , i , ⋯, in ), ⋯, fd (i , i , ⋯, in )] , ⋯,
i n t e g e r A(, ) ,
A
T2
C
Array Languages, Compiler Techniques for. Fig.  The
parse tree for consecutive array operations in the running
example

A
Array Languages, Compiler Techniques for
function of array operation at the root node. Throughout the synthesis process, every temporary array at
right-hand side (RHS) of access function is replaced
by the data access function that defines the temporary
array. During the substituting process, other temporary arrays will continually appear in the RHS of the
updated data access function. This process repeats until
all temporary arrays in the RHS are replaced with source
arrays.
For the running example, the synthesis process
begins with the data access function of CSHIFT for
it is the root node in the parse tree. The data access
function of CSHIFT is listed in Equation (), in which
temporary array T appears in the right-hand side. To
substitute T, the data access function of T = T + T,
which specifies T[i, j] = F (T[i, j], T[i, j]) is of
our interest. After the substitution, it may result in the
following data access function:
B[i, j] =
⎧
⎪
⎪
⎪
⎪F (T[i + , j], T[i + , j]) ∣ ϕ(/i, j/, / : ,  : /)
⎨
⎪
⎪
⎪
F (T[i − , j], T[i − , j]) ∣ ϕ(/i, j/, / : ,  : /)
⎪
⎩ 
()
In the updated data access function, come out two
new temporary arrays T and T. Therefore, the process continues to substitute T with T[i, j] = A[ j, i]
(the data access function of TRANSPOSE) and T with
T[i, j] = C[i + j ∗ ] (the data access function of
RESHAPE). At the end of synthesis process, a composite
data access function will be derived as follows, containing only the target array B and the source array A and C
without any temporary arrays.
B[i, j] =
⎧
⎪
⎪
⎪
⎪F (A[ j, i + ], C[i +  + j ∗ ]) ∣ ϕ(/i, j/, / : ,  : /)
⎨
⎪
⎪
⎪
F (A[ j, i − ], C[i −  + j ∗ ]) ∣ ϕ(/i, j/, / : ,  : /)
⎪
⎩ 
()
The derived data access function describes a direct
and element-wise mapping from data elements in the
source arrays to those of the target array. The mapping
described by the synthesized data access function can
be visualized in Fig. . Since the synthesized data access
function consists of two formulas with different segmented index ranges, it indicates that two portions of
j
0
1
2
3
0
1
i
2
3
B(i, j ) = A(j, i+1), + C(i+1+j *4)
B(i, j ) = A(j, i−3), + C(i −3+j *4)
Array Languages, Compiler Techniques for. Fig. 
Constituents of the target array are described by the
synthesized data access function
the target array B, gray blocks and white blocks in the
Fig. , will be computed in different manners.
Code Generation with Data Access Functions
After deriving a composite data access function,
compilers can generate a parallel loop with FORALL
statements for the consecutive array operations. For the
running example, compilers may generate the pseudo
code below, with no redundant data movement between
temporary arrays and therefore no additional storages
required.
integer A( ,) , B( ,) , C(   )
FORALL i =  t o  , j =  t o 
IF (i, j) ∈ ϕ(/i, j/, / : ,  : /) THEN
B [ i , j ] =A[ j, i + ]+ C[ i +  + j ∗  ]
IF (i, j) ∈ ϕ(/i, j/, / : ,  : /) THEN
B [ i , j ] =A[ j , i − ] +C[ i −+ j ∗ ]
END FORALL
Optimizations on Index Ranges
The code generated in the previous step consists of only
one single nested loop. It is straightforward but inefficient because it includes two if-statements to ensure that
different parts of the target array will be computed by
different equations. These guarding statements incurred
by segmented index ranges will be resolved at runtime
and thus lead to performance degradation. To optimize
programs with segmented index ranges the loop can be
further divided into multiple loops, each of which has a
Array Languages, Compiler Techniques for
0
<0,0,0,5,3,0>
0
0
0
0
0
<2,0,1,0,0,0>
<0,0,9,3,0,0>
0
0
0
<9,0,0,0,6,0>
0
0
0
<1,0,6,2,3,5>
0
0
0
0
0
0
<0,2,0,3,4,0>
0
0
<1,0,0,0,7,2>
0
0
0
0
<2,4,2,8,0,0>
0
0
0
<0,0,0,7,3,4>
A
Array Languages, Compiler Techniques for. Fig.  A three-dimensional sparse array A(, , )
bound to cover its array access range. By the index range
optimization, the loop for the running example can be
divided into two loops for two disjointed access ranges.
At last, we conclude this section by providing efficient
code as follows for the running example. It is efficient
in array data accessing and promising to deliver performance. For more information about array operation
synthesis, readers can refer to [–].
i n t e g e r A(  ,  ) , B (  ,  ) , C (   )
FORALL i =  t o  , j =  t o 
B [ i , j ] = A[ j , i +  ] + C[ i +  + j ∗ ]
END FORALL
FORALL i =  t o  , j =  t o 
B [ i , j ] =A[ j , i −  ] +C[ i −+ j ∗  ]
END FORALL
Support and Optimizations for Sparse
Array Operations
In the next section, another compiler technique is going
to be introduced, which targets array operation supports and optimizations for sparse arrays. In contrast
to dense arrays, sparse arrays consist of more zero elements than nonzero elements. With this characteristic,
they are more applicable than dense arrays in many
data specific applications, such as network theory and
earthquake detection. Besides, they are also extensively
used in scientific computation and numerical analysis.
To help readers understand sparse arrays, a sparse array
A(, , ) is depicted in Fig. . The sparse array A is
a three-dimensional  ×  ×  matrix with its first two
dimensions containing plenty of zero elements, which
are zero-vectors in the rows and columns.
Compression and Distribution Schemes
for Sparse Arrays
For their sparse property, sparse arrays are usually
represented in a compressed format to save computation and memory space. Besides, to process sparse
workload on distributed memory environments efficiently sparse computation are often partitioned to
be distributed to processors. For these reasons, both
compression and distribution schemes have to be considered in supporting parallel sparse computation.
Table  lists options of compression and distribution
schemes for two-dimensional sparse arrays. There will
be more detailed descriptions for these compression
and distribution in upcoming paragraphs.
Compression Schemes
For a one-dimensional array, its compression can be
either in a dense representation or in a pair-wise
sparse representation. In the sparse representation, an
array containing index and value pairs are used to
record nonzero elements, with no space for keeping
zero elements. For example, a one-dimensional array
⟨, , , , , ⟩ can be compressed by the pair (, ) and
(, ), representing two non-zero elements in the array,
which are the fourth element equaling to five and the
fifth element equaling to three.

A

A
Array Languages, Compiler Techniques for
Array Languages, Compiler Techniques for. Table 
Compression and distribution schemes for
two-dimensional arrays
Compression scheme
Distribution scheme
Compressed row storage (CRS)
(Block, *)
Compressed column storage (CCS) (*, Block)
Dense representation
34
0
11
0
0
21
0
0
0
97
0
9
51
0
28 0
0
CRS view
34
11
0
0
21
0
0
0
97
0
9
51 28
0
0
CCS view
(Block, Block)
CCS
CRS
RO
CO
DA
RO
CO
DA
1
2
11
1
2
34
2
1
34
3
4
51
4
3
21
6
1
11
6
2
97
7
3
97
8
4
9
8
4
28
1
51
2
21
2
28
3
9
Array Languages, Compiler Techniques for. Fig.  CRS
and CCS compression schemes for a two-dimensional
array
For higher-dimensional arrays, there are two compression schemes for two-dimensional arrays, which
are Compressed Row Storage (CRS) and Compressed
Column Storage (CCS). They have different ways to
compress a two-dimensional array, either by rows or by
columns. The CCS scheme regards a two-dimensional
array as a one-dimensional array of its rows, whereas
the CRS scheme considers it as a one-dimensional array
of its columns. Both of them use a CO and DA tuple,
an index and data pair, to represent a nonzero element
in a one-dimensional array, a row in CRS or a column
in CCS. The CO fields correspond to index offsets of
nonzero values in a CRS row or in a CCS column. In
addition to CO and DA pairs, both CRS and CCS representation contain a RO list that keeps value indexes
where each row or column starts. In Fig. , we show an
example to encode a two-dimensional array with CRS
and CCS schemes, where total  non-zero values are
recorded with their values in the DA fields and CCS
column or CRS row indexes in the CO fields.
Value
Index
5 3
4 5
Value
Index
2 1
1 3
Value
Index
9 3
3 4
Value
Index
9 6
1 5
Value
Index
1 6 2 3 5
1 3 4 5 6
Value
Index
2 3 4
2 4 5
Value
Index
1 7 2
1 5 6
Value
Index
2 4 2 8
1 2 3 4
Value
Index
7 3 4
4 5 6
DA
CO
RO
2
2
3
1
5
6
3
2
6
1
2
4
6
7
8
10
Array Languages, Compiler Techniques for. Fig.  A
compression scheme for the three-dimensional array in
Fig. 
Compression schemes for higher-dimensional arrays
can be constructed by employing -d and -d sparse
arrays as bases to construct higher-dimensional arrays.
Figure  shows a compression scheme for the threedimensional sparse array in Fig. . The three-dimensional sparse array is constructed by a two-dimensional
CRS structure with each element in the first level
is a one-dimensional sparse array. Similarly, one can
employ -d sparse arrays as bases to build fourdimensional spare arrays. The representation will be a
two-dimensional compressed structure with each DA
field in the structure also a two-dimensional compressed structure.
The code fragment below shows an instance to
implement three-dimensional sparse arrays for both
compression schemes, in which each data field contains
a real number. This derived sparsed_real data type
can be used to declare sparse arrays with CRS or CCS
compression schemes as the example shown in Fig. .
All array primitive operations such as +, −, ∗, etc., and
array intrinsic functions applied on the derived type
can be overloaded to sparse implementations. In this
way, data stored in spare arrays can be manipulated as
Array Languages, Compiler Techniques for
they are in dense arrays. For example, one can conduct
a matrix multiplication over two sparse arrays via a
sparse matmul, in which only nonzero elements will be
computed.
type s p a r s e  d _ r e a l
type ( d e s c r i p t o r ) : : d
i n t e g e r , p o i n t e r , dimension ( : )
i n t e g e r , p o i n t e r , dimension ( : )
type ( s p a r s e  d _ r e a l ) , pointer ,
dimension ( : ) : : DA
end t y p e s p a r s e  d _ r e a l
: : RO
: : CO
Distribution Schemes
In distributed memory environments, data distribution
of sparse arrays needs to be considered. The distribution schemes currently considered for sparse arrays
are general block partitions based on the number of
nonzero elements. For two-dimensional arrays, there
are (Block, ∗ ), (∗ , Block) and (Block, Block) distributions. In the (∗ , Block) scheme a sparse array is distributed by rows, while in the (Block, ∗ ) distribution an
array is distributed by columns. For the (Block, Block)
distribution, an array is distributed both by rows and
columns. Similarly, distribution schemes for higherdimensional arrays can be realized by extending more
dimensions in data distribution options.
0
The following code fragment is used to illustrate
how to assign a distribution scheme to a sparse array
on a distributed environment. The three-dimensional
sparse array in Fig.  is again used for explanation.
At first, the sparse array is declared as a -D sparse
array by the derived type, sparsed_real. Next, the
bound function is used to specify the shape of the
sparse array, specifying its size of each dimension. After
shape binding, the sparse array is assigned to a (Block,
Block, ∗ ) distribution scheme through the distribution
function, by which the sparse array will be partitioned
and distributed along its first two dimensions. Figure  shows the partition and distribution situation
over a distributed environment with four-processors,
where each partition covers data assigned to a target
processor.
use s p a r s e
type ( s p a r s e  d _ r e a l ) : : A
c a l l bound ( A ,  ,  ,  )
c a l l d i s t r i b u t i o n (A , Block , Block , * )
...
Programming with Sparse Arrays
With derived data types and operation overloading,
sparse matrix computation can also be expressed as
0
<0,0,0,5,3,0>
0
0
0
0
<2,0,1,0,0,0> <0,0,9,3,0,0> 0
0
0
<9,0,0,0,6,0>
0
0
0 <1,0,6,2,3,5>
0
0
0
0
0
<0,2,0,3,4,0>
0
0
<1,0,0,0,7,2> 0
0
0
0
<2,4,2,8,0,0>
0
<0,0,0,7,3,4>
0
A
0
0
Array Languages, Compiler Techniques for. Fig.  A three-dimensional array is distributed on four processors by (Block,
Block, ∗ )

A

A
Array Languages, Compiler Techniques for
integer, parameter : : row = 1000
real, dimension ( row , 2∗row−1) : : A
real, dimension ( row ) : : x, b
integer, dimension (2∗row−1) : : s h i f t
integer, parameter: : row = 1000
type ( s p a r s e 2 d_r e a l ) : : A
type( s p a r s e 1 d_r e a l ) : : x, b
integer, dimension (2∗row−1) : : s h i f t
b = sum (A∗e o s h i f t ( s p r e a d ( x, dim=2, n c o p i e s =2∗row−1),
dim=1, s h i f t=a r t h (−row +1 ,1 ,2∗row−1)) , dim=2)
c a l l bound ( A , row , 2∗row−1)
c a l l bound ( x , row )
c a l l bound ( b , row )
Array Languages, Compiler Techniques for. Fig. 
Numerical routines for banded matrix multiplication
b = sum (A∗e o s h i f t ( s p r e a d ( x, dim=2, n c o p i e s =2∗row−1) ,
dim=1, s h i f t=a r t h (−row +1 ,1 ,2∗row−1)) , dim=2)
Array Languages, Compiler Techniques for. Fig.  Sparse
implementation for the banmul routine
concisely as dense computation. The code fragment in
Fig.  is a part of a Fortran  program excerpted
from the book, Numerical Recipes in Fortran  []
with its computation kernel named banded multiplication that calculates b = Ax, with the input, A, a
matrix and the second input, x, a vector. (For more
information about banded vector-matrix multiplication, readers can refer to the book [].) In Fig. ,
both the input and output are declared as dense
arrays, with three Fortran  array intrinsic functions,
EOSHIFT, SPREAD, and SUM used to produce its
results.
When the input and output are sparse matrices with
plenty of zero elements, the dense representation and
computation will be inefficient. With the support of
sparse data types and array operations, the banmul kernel can be rewritten into a sparse version as shown in
Fig. . Comparing Fig.  with Fig. , one can find that
the sparse implementation is very similar to the dense
version, with slight differences in array declaration and
extra function calls for array shape binding. By changing inputs and outputs from dense to sparse arrays,
a huge amount of space is saved by only recording
non-zero elements. Besides, the computation kernel is
unchanged thanks to operator overloading, and it tends
to have better performance for only processing nonzero
elements. One thing to be noted here, the compression
and distribution scheme of the sparse implementation
are not specified for input and output arrays, and thus
the default settings are used. In advanced programming, these schemes can be configured via exposed
routines to programmers, which can be used to optimize programs for compressed data organizations. We
use following paragraphs to discuss how to select a
proper compression and distribution scheme for sparse
programs.
Selection of Compression and Distribution
Schemes
Given a program with sparse arrays, there introduces
an optimizing problem, how to select distribution and
compression schemes for sparse arrays in the program. In the next paragraphs, two combinations of
compression and distribution schemes for the banmul
kernel are presented to show how scheme combinations make impact to performance. At first Fig. 
shows the expression tree of the banmul kernel in
Fig. , where T, T, and T beside the internal
nodes are temporary arrays introduced by compilers
for passing intermediate results for array operations.
As mentioned before, through exposed routines, programmers can choose compression and distribution
schemes for input and output arrays. Nevertheless,
it still has the need to figure out proper compression and distribution schemes for temporary arrays
since they are introduced by compilers rather than
programmers.
The reason why compression schemes for input,
output and temporary arrays should be concerned is
briefed as follows. When an array operation is conducted on arrays in different compression schemes, it
requires data conversion that reorganizes compressed
arrays from one compression scheme to another for
a conforming format. That is because array operations are designed to process input arrays and return
results in one conventional scheme, either in CRS
or in CCS. The conversion cost implicit in array
operations hurts performance a lot and needs to be
avoided.
Array Languages, Compiler Techniques for
*
Array Languages, Compiler Techniques for. Table 
Assignments with less compression conversions
b
Sum
T3
A
A
Eoshift
Spread
T2
Array
Compressed scheme
Distribution scheme
x
d
(Block)
T
CCS
(Block, *)
T
CCS
(Block, *)
T
CCS
(Block, *)
A
CCS
(Block, *)
b
d
(Block)
T1
x
Array Languages, Compiler Techniques for. Fig. 
Expression tree with intrinsic functions in banmul
Array Languages, Compiler Techniques for. Table 
Assignments with extra compression conversions needed
Array
Compressed scheme
Distribution scheme
x
d
(Block)
T
CRS
(Block, *)
T
CCS
(Block, *)
T
CRS
(Block, *)
A
CRS
(Block, *)
b
d
(Block)
In Table  and Table , we present two combinations
of compression and distribution schemes for the banmul kernel. To focus on compression schemes, all arrays
now have same distribution schemes, (Block) for onedimensional arrays and (Block, *) for two-dimensional
arrays. As shown in Table , the input array A and three
temporary arrays are in sparse representation, with only
T assigned to CCS scheme and the rest assigned to
CRS scheme. When the conversion cost is considered,
we can tell that the scheme selection in Table  is better than that in Table  because it requires less conversions: one for converting the result of Spread to
CCS and the other for converting the result of Eoshift
to CRS.
Sparsity, Cost Models, and the Optimal
Selection
To compile sparse programs for distributed environments, in additional to conversion costs, communication costs need to be considered. In order to minimize
the total execution time, an overall cost model needs be
introduced, which covers three factors that have major
impact to performance: computation, communication
and compression conversion cost. The costs of computation and communication model time consumed on
computing and transferring nonzero elements of sparse
arrays, while the conversion cost mentioned before
models time consumed on converting data in one compression schemes to another.
All these costs are closely related to the numbers of
nonzero elements in arrays, defined as sparsity of arrays,
which can be used to infer the best scheme combination
for a sparse program. The sparsity informations can be
provided by programmers that have knowledge about
program behaviors, or they can be obtained through
profiling or advanced probabilistic inference schemes
discussed in []. With sparsity informations, the selection process can be formulated into a cost function
that estimates overall execution time to run a sparse
program on distributed memory environments. With
this paragraph, we conclude the compiler supports and
optimizations for sparse arrays. For more information
about compiler techniques for sparse arrays, readers can
refer to [, ].
Related Entries
Array Languages
BLAS (Basic Linear Algebra Subprograms)

A

A
Array Languages, Compiler Techniques for
Data Distribution
Dense Linear System Solvers
Distributed-Memory Multiprocessor
HPF (High Performance Fortran)
Locality of Reference and Parallel Processing
Metrics
Reduce and Scan
Shared-Memory Multiprocessors
Bibliographic Notes and Further
Reading
For more information about compiler techniques
of array operation synthesis, readers can refer to
papers [–]. Among them, the work in [, ] focuses
on compiling array operations supported by Fortran .
Besides this, the synthesis framework, they also provide
solutions to performance anomaly incurred by the presence of common subexpressions and one-to-many array
operations. In their following work [], the synthesis
framework is extended to synthesize HPF array operations on distributed memory environments, in which
communication issues are addressed.
For more information about compiler supports and
optimizations for sparse programs, readers can refer
to papers [, ]. The work in [] puts lots of efforts
to support Fortran  array operations for sparse
arrays on parallel environments, in which both compression schemes and distribution schemes for multidimensional sparse arrays are considered. Besides this,
the work also provides a complete complexity analysis
for the sparse array implementations and report that the
complexity is in proportion to the number of nonzero
elements in sparse arrays, which is consistent with the
conventional design criteria for sparse algorithms and
data structures.
The support of sparse arrays and operations in []
is divided into a two-level implementation: in the lowlevel implementation, a sparse array needs to be specified with compression and distribution schemes; in the
high-level implementation, all intrinsic functions and
operations are overloaded for sparse arrays, and compression and distribution details are hidden in implementations. In the work [], a compilation scheme
is proposed to transform high-level representations to
low-level implementations with the three costs, computation, communication and conversion, considered.
Except the compiler techniques mentioned in this
entry, there is a huge amount of literature discussing
compiler techniques for dense and sparse array programs, which are not limited to Fortran . Among
them there is a series of work on compiling ZPL [–],
a language that defines a concise representation for
describing data-parallel computations. In [], another
approach similar to array operations synthesis is proposed to achieve the same goal through loop fusion
and array contraction. For other sparse program optimizations, there are also several research efforts on
Matlab [–]. One design and implementation of
sparse array support in Matlab is provided in [], and
there are related compiler techniques on Matlab sparse
arrays discussed in [, ].
Bibliography
. Press WH, Teukolsky SA, Vetterling WT, Flannery BP ()
Numerical recipes in Fortran : the art of parallel scientific
computing. Cambridge University Press, New York
. Hwang GH, Lee JK, Ju DC () An array operation synthesis scheme to optimize Fortran  programs. ACM SIGPLAN
Notices (ACM PPoPP Issue)
. Hwang GH, Lee JK, Ju DC () A function-composition
approach to synthesize Fortran  array operations. J Parallel
Distrib Comput
. Hwang GH, Lee JK, Ju DC () Array operation synthesis
to optimize HPF programs on distributed memory machines.
J Parallel Distrib Comput
. Chang RG, Chuang TR, Lee JK () Parallel sparse supports
for array intrinsic functions of Fortran . J Supercomputing
():–
. Chang RG, Chuang TR, Lee JK () Support and optimization
for parallel sparse programs with array intrinsics of Fortran .
Parallel Comput
. Lin C, Snyder L () ZPL: an array sublanguage. th International Workshop on Languages and Compilers for Parallel
Computing, Portland
. Chamberlain BL, Choi S-E, Christopher Lewis E, Lin C, Snyder L,
Weathersby D () Factor-join: a unique approach to compiling
array languages for parallel machines. th International Workshop on Languages and Compilers for Parallel Computing, San
Jose, California
. Christopher Lewis E, Lin C, Snyder L () The implementation and evaluation of fusion and contraction in array languages.
International Conference on Programming Language Design and
Implementation, San Diego
. Shah V, Gilbert JR () Sparse matrices in Matlab*P: design and
implementation. th International Conference on High Performance Computing, Springer, Heidelberg
Asynchronous Iterative Algorithms
. Buluç A, Gilbert JR () On the representation and multiplication of hypersparse matrices. nd IEEE International Symposium on Parallel and Distributed Processing, Miami, Florida
. Buluç A, Gilbert JR () Challenges and advances in parallel
sparse matrix-matrix multiplication. International Conference on
Parallel Processing, Portland
Asynchronous Iterations
Asynchronous Iterative Algorithms
Asynchronous Iterative
Algorithms
Giorgos Kollias, Ananth Y. Grama, Zhiyuan Li
Purdue University, West Lafayette, IN, USA
Synonyms
Asynchronous
computations
iterations;
Asynchronous
iterative
Definition
In iterative algorithms, a grid of data points are updated
iteratively until some convergence criterion is met. The
update of each data point depends on the latest updates
of its neighboring points. Asynchronous iterative algorithms refer to a class of parallel iterative algorithms
that are capable of relaxing strict data dependencies,
hence not requiring the latest updates when they are
not ready, while still ensuring convergence. Such relaxation may result in the use of inconsistent data which
potentially may lead to an increased iteration count
and hence increased computational operations. On the
other hand, the time spent on waiting for the latest updates performed on remote processors may be
reduced. Where waiting time dominates the computation, a parallel program based on an asynchronous algorithm may outperform its synchronous counterparts.
Discussion
Introduction
Scaling application performance to large number of
processors must overcome challenges stemming from
A
high communication and synchronization costs. While
these costs have improved over the years, a corresponding increase in processor speeds has tempered the
impact of these improvements. Indeed, the speed gap
between the arithmetic and logical operations and the
memory access and message passing operations has a
significant impact even on parallel programs executing on a tightly coupled shared-memory multiprocessor
implemented on a single semiconductor chip.
System techniques such as prefetching and multithreading use concurrency to hide communication and
synchronization costs. Application programmers can
complement such techniques by reducing the number of communication and synchronization operations
when they implement parallel algorithms. By analyzing the dependencies among computational tasks, the
programmer can determine a minimal set of communication and synchronization points that are sufficient for
maintaining all control and data dependencies embedded in the algorithm. Furthermore, there often exist
several different algorithms to solve the same computational problem. Some may perform more arithmetic
operations than others but require less communication and synchronization. A good understanding of the
available computing system in terms of the tradeoff
between arithmetic operations versus the communication and synchronization cost will help the programmer
select the most appropriate algorithm.
Going beyond the implementation techniques mentioned above requires the programmer to find ways to
relax the communication and synchronization requirement in specific algorithms. For example, an iterative
algorithm may perform a convergence test in order to
determine whether to start a new iteration. To perform such a test often requires gathering data which
are scattered across different processors, incurring the
communication overhead. If the programmer is familiar with the algorithm’s convergence behavior, such a
convergence test may be skipped until a certain number of iterations have been executed. To further reduce
the communication between different processors, algorithm designers have also attempted to find ways to
relax the data dependencies implied in conventional
parallel algorithms such that the frequency of communication can be substantially reduced. The concept
of asynchronous iterative algorithms, which dates back
over  decades, is developed as a result of such attempts.

A

A
Asynchronous Iterative Algorithms
With the emergence of parallel systems with tens of
thousands of processors and the deep memory hierarchy accessible to each processor, asynchronous iterative algorithms have recently generated a new level of
interest.
Motivating Applications
Among the most computation-intensive applications
currently solved using large-scale parallel platforms
are iterative linear and nonlinear solvers, eigenvalue
solvers, and particle simulations. Typical time-dependent
simulations based on iterative solvers involve multiple
levels at which asynchrony can be exploited. At the
lowest level, the kernel operation in these solvers are
sparse matrix-vector products and vector operations. In
a graph-theoretic sense, a sparse matrix-vector product can be thought of as edge-weighted accumulations at nodes in a graph, where nodes correspond
to rows and columns and edges correspond to matrix
entries. Indeed, this view of a matrix-vector product
forms the basis for parallelization using graph partitioners. Repeated matrix-vector products, say, to compute
A(n) y, while solving Ax = b, require synchronization
between accumulations. However, it can be shown that
within some tolerance, relaxing strict synchronization
still maintains convergence guarantees of many solvers.
Note that, however, the actual iteration count may be
larger. Similarly, vector operations for computing residuals and intermediate norms can also relax synchronization requirements. At the next higher level, if the
problem is nonlinear in nature, a quasi-Newton scheme
is generally used. As before, convergence can be guaranteed even when the Jacobian solves are not exact.
Finally, in time-dependent explicit schemes, some timeskew can be tolerated, provided global invariants are
maintained, in typical simulations.
Particle systems exhibit similar behavior to one
mentioned above. Spatial hierarchies have been well
explored in this context to derive fast methods such
as the Fast Multipole Method and Barnes-Hut method.
Temporal hierarchies have also been explored, albeit to
a lesser extent. Temporal hierarchies represent one form
of relaxed synchrony, since the entire system does not
evolve in lock steps. Informally, in a synchronous system, the state of a particle is determined by the state of
the system in the immediately preceding time step(s) in
explicit schemes. Under relaxed models of synchrony,
one can constrain the system state to be determined
by the prior state of the system in a prescribed time
window. So long as system invariants such as energy
are maintained, this approach is valid for many applications. For example, in protein folding and molecular docking, the objective is to find minimum energy
states. It is shown to be possible to converge to true
minimum energy states under such relaxed models of
synchrony.
While these are two representative examples from
scientific domains, nonscientific algorithms lend themselves to relaxed synchrony just as well. For example, in
information retrieval, PageRank algorithms can tolerate
relaxed dependencies on iterative computations [].
Iterative Algorithms
An iterative algorithm is typically organized as a series
of steps essentially of the form
x(t + ) ← f (x(t))
()
where operator f (⋅) is applied to some data x(t) to produce new data x(t+). Here integer t counts the number
of steps, assuming starting with x(), and captures the
notion of time. Given certain properties on f (⋅), that
it contracts (or pseudo-contracts) and it has a unique
fixed point, and so on, an iterative algorithm is guaranteed to produce at least an approximation within a
prescribed tolerance to the solution of the fixed-point
equation x = f (x), although the exact number of needed
steps cannot be known in advance.
In the simplest computation scenario, data x reside
in some Memory Entity (ME) and operator f (⋅) usually
consists of both operations to be performed by some
Processing Entity (PE) and parameters (i.e., data)
hosted by some ME. For example, if the iteration step
is a left matrix-vector multiplication x ← Ax, x is considered to be the “data” and the matrix itself with the
set of incurred element-by-element multiplications and
subsequent additions to be the “operator”, where the
“operator” part also contains the matrix elements as its
parameters.
However, in practice this picture can get complicated:
●
Ideally x data and f (⋅) parameters should be readily
available to the f (⋅) operations. This means that one
Asynchronous Iterative Algorithms
would like all data to fit in the register file of a typical PE, or even its cache files (data locality). This is
not possible except for very small problems which,
most of the time, are not of interest to researchers.
Data typically reside either in the main memory
or for larger problems in a slow secondary memory. Designing data access strategies which feed the
PE at the fastest possible rate, in effect optimizing
data flow across the memory hierarchy while preserving the semantics of the algorithm, is important
in this aspect. Modern multicore architectures, in
which many PEs share some levels of the memory hierarchy and compete for the interconnection
paths, introduce new dimensions of complexity in
the effective mapping of an iterative algorithm.
● There are cases in which data x are assigned to more
than one PEs and f (⋅) is decomposed. This can be
the result either of the sheer volume of the data or
the parameters of f (⋅) or even the computational
complexity of f (⋅) application which performs a
large number of operations in each iteration on each
data unit. In some rarer cases the data itself or the
operator remains distributed by nature throughout
the computation. Here, the PEs can be the cores
of a single processor, the nodes (each consisting of
multiple cores and processors) in a shared-memory
machine or similar nodes in networked machines.
The latter can be viewed as PEs accessing a network
of memory hierarchies of multiple machines, i.e., an
extra level of ME interconnection paths.
Synchronous Iterations and Their Problems
The decomposition of x and f (⋅), as mentioned above,
necessitates the synchronization of all PEs involved at
each step. Typically, each fragment fi (⋅) of the decomposed operator may need many, if not all, of the
A
B
1
fragments {xj } of the newly computed data x for an iteration step. So, to preserve the exact semantics, it should
wait for these data fragments to become available, in
effect causing its PE to synchronize with all those PEs
executing the operator fragments involved in updating
these {xj } in need. Waiting for the availability of data
at the end of each iteration make these iterations “synchronous” and this practice preserves the semantics. In
networked environments, such synchronous iterations
can be coupled with either “synchronous communications” (i.e., synchronous global communication at the
end of each step) or “asynchronous communications”
for overlapping computation and communication (i.e.,
asynchronously sending new data fragments as soon as
they are locally produced and blocking the receipt of
new data fragments at the end of each step) [].
Synchronization between PEs is crucial to enforcing correctness of the iterative algorithm implementation but it also introduces idle synchronization phases
between successive steps. For a moment consider the
extreme case of a parallel iterative computation where
for some PE the following two conditions happen to
hold:
. It completes its iteration step much faster than the
other PEs.
. Its input communication links from the other PEs
are very slow.
It is evident that this PE will suffer a very long
idle synchronization phase, i.e., time spent doing nothing (c.f. Fig. ). Along the same line of thought, one
could also devise the most unfavorable instances of
computation/communication time assignments for the
other PEs which suffer lengthy idle synchronization
phases. It is clear that the synchronization penalty for
iterative algorithms can be severe.
idle
1
A
2
2
idle
idle
Asynchronous Iterative Algorithms. Fig.  Very long idle periods due to synchronization between A and B executing
synchronous iterations. Arrows show exchange messages, the horizontal axis is time, boxes with numbers denote iteration
steps, and dotted boxes denote idle time spent in the synchronization. Note that such boxes will be absent in
asynchronous iterations

A

A
Asynchronous Iterative Algorithms
Asynchronous Iterations: Basic Idea and
Convergence Issues
The essence of asynchronous iterative algorithms is
to reduce synchronization penalty by simply eliminating the synchronization phases in iterative algorithms
described above. In other words, each PE is permitted
to proceed to its next iteration step without waiting for
updating its data from other PEs. Use newly arriving
data if available, but otherwise reuse the old data. This
approach, however, fails to retain the temporal ordering of the operations in the original algorithm. The
established convergence properties are usually altered
as a consequence. The convergence behavior of the new
asynchronous algorithm is much harder to analyze than
the original [], the main difficulty being the introduction of multiple independent time lines in practice,
one for each PE. Even though the convergence analysis uses a global time line, the state space does not
involve studying the behavior of the trajectory of only
one point (i.e., the shared data at the end of each iteration as in synchronous iterations) but rather of a set of
points, one for each PE. An extra complexity is injected
by the flexibility of the local operator fragments fi to be
applied to rather arbitrary combinations of data components from the evolution history of their argument
lists. Although some assumptions may underly a particular asynchronous algorithm so that certain scenarios of
reusing old data components are excluded, a PE is free
to use, in a later iteration, data components older than
those of the current one. In other words, one can have
non-FIFO communication links for local data updates
between the PEs.
The most general strategy for establishing convergence of asynchronous iterations for a certain problem
is to move along the lines of the Asynchronous Convergence Theorem (ACT) []. One tries to construct a
sequence of boxes (cartesian products of intervals where
data components are known to lie) with the property
that the image of each box under f (⋅) will be contained
in the next box in this sequence, given that the original (synchronous) iterative algorithm converges. This
nested box structure ensures that as soon as all local
states enter such a box, no communication scenario can
make any of the states escape to an outer box (deconverge), since updating variables through communication is essentially a coordinate exchange. On the other
hand the synchronous convergence assumption guarantees that local computation also cannot produce any
state escaping the current box, its only effect being a
possible inclusion of some of the local data components in the respective intervals defining some smaller
(nested) box. The argument here is that if communications are nothing but state transitions parallel to box
facets and computations just drive some state coordinates in the facet of some smaller (nested) box, then
their combination will continuously drive the set of
local states to successively smaller boxes and ultimately
toward convergence within a tolerance.
It follows that since a synchronous iteration is just a
special case in this richer framework, its asynchronous
counterpart will typically fail to converge for all asynchronous scenarios of minimal assumptions. Two broad
classes of asynchronous scenarios are usually identified:
Totally asynchronous The only constraint here is that, in
the local computations, data components are ultimately updated. Thus, no component ever becomes
arbitrarily old as the global time line progresses,
potentially indefinitely, assuming that any local
computation marks a global clock tick. Under such a
constraint, ACT can be used to prove, among other
things, that the totally asynchronous execution of
the classical iteration x ← Ax + b, where x and
b are vectors and A a matrix distributed row-wise,
converges to the correct solution provided that the
spectral radius of the modulus matrix is less than
unity (ρ(∣A∣) < ). This is a very interesting result
from a practical standpoint as it directly applies to
classical stationary iterative algorithms (e.g., Jacobi)
with nonnegative iteration matrix, those commonly
used for benchmarking the asynchronous model
implementations versus the synchronous ones.
Partially asynchronous The constraint here is stricter:
each PE must perform at least one local iteration within the next B global clock ticks (for B
a fixed integer) and it must not use components
that are computed B ticks prior or older. Moreover, for the locally computed components, the most
recent must always be used. Depending on how
restricted the value of B may be, partially asynchronous algorithms can be classified in two types,
Type I and Type II. The Type I is guaranteed to
Asynchronous Iterative Algorithms
converge for arbitrarily chosen B. An interesting
case of Type I has the form x ← Ax where x is
a vector and A is a column-stochastic matrix. This
case arises in computing stationary distributions of
Markov chains [] like the PageRank computation. Another interesting case is found in gossip-like
algorithms for the distributed computation of statistical quantities, where A = A(t) is a time-varying
row-stochastic matrix []. For Type II partially
asynchronous algorithms, B is restricted to be a
function of the structure and the parameters of the
linear operator itself to ensure convergence. Examples include gradient and gradient projection algorithms, e.g., those used in solving optimization and
constrained optimization problems respectively. In
these examples, unfortunately, the step parameter in
the linearized operator must vary as inversely proportional to B in order to attain convergence. This
implies that, although choosing large steps could
accelerate the computation, the corresponding B
may be too small to enforce in practice.
Implementation Issues
Termination Detection
In a synchronous iterative algorithm, global convergence can be easily decided. It can consist of a local convergence detection (e.g., to make sure a certain norm is
below a local threshold in a PE) which triggers global
convergence tests in all subsequent steps. The global
tests can be performed by some PE acting as the monitor
to check, at the end of each step, if the error (determined by substituting the current shared data x in the
fixed point equation) gets below a global threshold. At
that point, the monitor can signal all PEs to terminate.
However, in an asynchronous setting, local convergence
detection at some PE does not necessarily mean that
the computation is near the global convergence, due to
the fact that such a PE may compute too fast and not
receive much input from its possibly slow communication in-links. Thus, in effect, there could also be the
case in which each PE computes more or less in isolation from the others an approximation to a solution
to the less-constrained fixed point problem concerning its respective local operator fragment fi (⋅) but not
f (⋅). Hence, local convergence detection may trigger
A
the global convergence tests too early. Indeed, experiments have shown long phases in which PEs continually
enter and exit their “locally converged” status. With the
introduction of an extra waiting period for the local
convergence status to stabilize, one can avoid the premature trigger of the global convergence detection procedure which may be either centralized or distributed.
Taking centralized global detection for example, there
is a monitor PE which decides global convergence and
notifies all iterating PEs accordingly. A practical strategy is to introduce two integer “persistence parameters” (a localPersistence and a globalPersistence). If
local convergence is preserved for more than localPersistence iterations, the PE will notify the monitor. As
soon as the monitor finds out that all the PEs remain
in locally convergence status for more than globalPersistence of their respective checking cycles, it signals
global convergence to all the PEs.
Another solution, with a simple proof of correctness, is to embed a conservative iteration number in
messages, defined as the smallest iteration number of
all those messages used to construct the argument list of
the current local computation, incremented by one [].
This is strongly reminiscent of the ideas in the ACT and
practically the iteration number is the box counter in
that context, where nested boxes are marked by successively greater numbers of such. In this way, one can precompute some bound for the target iteration number for
a given tolerance, with simple local checks and minimal
communication overhead. A more elaborate scheme for
asynchronous computation termination detection in a
distributed setting can be found in literature [].
Asynchronous Communication
The essence of asynchronous algorithms is not to let
local computations be blocked by communications with
other PEs. This nonblocking feature is easier to implement on a shared memory system than on a distributed
memory system. This is because on a shared memory
system, computation is performed either by multiple
threads or multiple processes which share the physical address space. In either case, data communication
can be implemented by using shared buffers. When a
PE sends out a piece of data, it locks the data buffer
until the write completes. If another PE tries to receive
the latest data but finds the data buffer locked, instead

A

A
Asynchronous Iterative Algorithms
of blocking itself, the PE can simply continue iterating
the computation with its old, local copy of the data.
On the other hand, to make sure the receive operation retrieves the data in the buffer in its entirety,
the buffer must also be locked until the receive is
complete. The sending PE hence may also find itself
locked out of the buffer. Conceptually, one can give
the send operation the privilege to preempt the receive
operation, and the latter may erase the partial data
just retrieved and let the send operation deposit the
most up to date data. However, it may be easier to
just let the send operation wait for the lock, especially
because the expected waiting will be much shorter than
what is typically experienced in a distributed memory
system.
On distributed memory systems, the communication protocols inevitably perform certain blocking operations. For example, on some communication layer, a
send operation will be considered incomplete until the
receiver acknowledges the safe receipt of the data. Similarly, a receive operation on a certain layer will be
considered incomplete until the arriving data become
locally available. From this point of view, a communication operation in effect synchronizes the communicating partners, which is not quite compatible with the
asynchronous model. For example, the so-called nonblocking send and probe operations assume the existence a hardware device, namely, the network interface
card that is independent of the PE. However, in order
to avoid expensive copy operations between the application and the network layers, the send buffer is usually
shared by both layers, which means that before the PE
can reuse the buffer, e.g., while preparing for the next
send in the application layer, it must wait until the buffer
is emptied by the network layer. (Note that the application layer cannot force the network layer to undo its
ongoing send operation.) This practically destroys the
asynchronous semantics. To overcome this difficulty,
computation and communication must be decoupled,
e.g., implemented as two separate software modules,
such that the blocking due to communication does not
impede the computation. Mechanisms must be set up to
facilitate fast data exchange between these two modules,
e.g., by using another data buffer to copy data between
them. Now, the computation module will act like a PE
sending data in a shared memory system while the communication module acts like a PE receiving data. One
must also note that multiple probes by a receiving PE
are necessary since multiple messages with the same
“envelope” might have arrived during a local iteration.
This again has a negative impact on the performance
of an asynchronous algorithm, because the receiving PE
must find the most recent of these messages and discard
the rest.
The issues listed above have been discussed in
the context of the MPI library []. Communication
libraries such as Jace [] implemented in Java or its
C++ port, CRAC [], address some of the shortcomings
of “general-purpose” message passing libraries. Internally they use separate threads for the communication
and computation activities with synchronized queues of
buffers for the messages and automatic overwriting on
the older ones.
Recent Developments and Potential Future
Directions
Two-Stage Iterative Methods and Flexible
Communications
There exist cases in which an algorithm can be
restructured so as to introduce new opportunities for
parallelization and thus yield more places to inject asynchronous semantics. Notable examples are two-stage
iterative methods for solving linear systems of equations
of the form Ax = b. Matrix A is written as A = M − N
with M being a block diagonal part. With such splitting, the iteration reads as Mx(t + ) ← (Nx(t) + b),
and its computation can be decomposed row-wise and
distributed to the PEs, i.e., a block Jacobi iteration.
However, to get a new data fragment xi (t + ) at
each iteration step at some PE, a new linear system
of equations must be solved, which can be done by
a new splitting local Mi part, resulting in a nested
iteration. In a synchronous version, new data fragments exchange only at the end of each iteration to
coordinate execution. The asynchronous model applied
in this context [] not only relaxes the timing in
xi (t + ) exchanges in the synchronous model, but
it also introduces asynchronous exchanges even during the nested iterations (toward xi (t + )) and their
immediate use in the computation and not at the
end of the respective computational phases. This idea
was initially used for solving systems of nonlinear
equations [].
Asynchronous Iterative Algorithms
Asynchronous Tiled Iterations (or Parallelizing
Data Locality)
As mentioned in Sect. , the PEs executing the operator and the MEs hosting its parameters and data must
have fast interconnection paths. Restructuring the computation so as to maximize reuse of cached data across
iterations have been studied in the past []. These are
tiling techniques [] applied to chunks of iterations
(during which convergence is not tested) and coupled
with strategies for breaking the so induced dependencies. In this way data locality is considerably increased
but opportunities for parallelization are confined only
within the current tile data and not the whole data set
as is the general case, e.g., in iterative stencil computations. Furthermore, additional synchronization barriers, scaling with the number of the tiles in number,
are introduced. In a very recent work [], the asynchronous execution of tiled chunks is proposed for
regaining the parallelization degree of nontiled iterations: each PE is assigned a set of tiles (its sub-grid) and
performs the corresponding loops without synchronizing with the other PEs. Only the convergence test at the
end of such a phase enforces synchronization. So on
the one hand, locality is preserved since each PE traverses its current tile data only and on the other hand
all available PEs execute concurrently in a similar fashion without synchronizing, resulting in a large degree of
parallelization.
When to Use Asynchronous Iterations?
Asynchronism can enter an iteration in both natural
and artificial ways. In naturally occurring asynchronous
iterations, PEs are either asynchronous by default (or it
is unacceptable to synchronize) computation over sensor networks or distributed routing over data networks
being such examples.
However, the asynchronous execution of a parallelized algorithm enters artificially, in the sense that
most of the times it comes as a variation of the synchronous parallel port of a sequential one. Typically,
there is a need to accelerate the sequential algorithm
and as a first step it is parallelized, albeit in synchronous
mode in order to preserve semantics. Next, in the presence of large synchronization penalties (when PEs are
heterogeneous both in terms of computation and communication as in some Grid installations) or extra flexibility needs (such as asynchronous computation starts,
A
dynamic changes in data or topology, non-FIFO communication channels), asynchronous implementations
are evaluated. Note that in all those cases of practical interest, networked PEs are implied. The interesting
aspect of [, ] is that it broadens the applicability of
the asynchronous paradigm in shared memory setups
for yet another purpose, which is to preserve locality
but without losing parallelism itself as a performance
boosting strategy.
Related Entries
Memory Models
Synchronization
Bibliographic Notes and Further
Reading
The asynchronous computation model, as an alternative to the synchronous one, has a life span of
almost  decades. It started with its formal description in the pioneering work of Chazan and Miranker
[] and its first experimental investigations by Baudet
[] back in the s. Perhaps the most extensive and systematic treatment of the subject is contained in a book by Bertsekas and Tsitsiklis in the
s [], particularly in its closing three chapters. During the last  decades an extensive literature has been accumulated [, , , , , , ,
, , , ]. Most of these works explore theoretical extensions and variations of the asynchronous
model coupled with very specific applications. However in the most recent ones, focus has shifted to more
practical, implementation-level aspects [, , ],
since the asynchronous model seems appropriate for
the realization of highly heterogeneous, Internet-scale
computations [, , ].
Bibliography
. Bahi JM, Contassot-Vivier S, Couturier R, Vernier F () A
decentralized convergence detection algorithm for asynchronous
parallel iterative algorithms. IEEE Trans Parallel Distrib Syst
():–
. Bahi JM, Contassot-Vivier S, Couturier R () Asynchronism
for iterative algorithms in a global computing environment. In:
th Annual International Symposium on high performance computing systems and applications (HPCS’). IEEE, Moncton,
Canada, pp –

A

A
Asynchronous Iterative Algorithms
. Bahi JM, Contassot-Vivier S, Couturier R () Coupling
dynamic load balancing with asynchronism in iterative algorithms on the computational Grid. In: th International Parallel
and Distributed Processing Symposium (IPDPS’), p . IEEE,
Nice, France
. Bahi JM, Contassot-Vivier S, Couturier R () Performance
comparison of parallel programming environments for implementing AIAC algorithms. In: th International Parallel and
Distributed Processing Symposium (IPDPS’). IEEE, Santa
Fe, USA
. Bahi JM, Contassot-Vivier S, Couturier R () Parallel iterative algorithms: from sequential to Grid computing. Chapman &
Hall/CRC, Boca Raton, FL
. Bahi JM, Domas S, Mazouzi K () Combination of Java and
asynchronism for the Grid: a comparative study based on a parallel power method. In: th International Parallel and Distributed
Processing Symposium (IPDPS ’), pp a, . IEEE, Santa Fe,
USA, April 
. Bahi JM, Domas S, Mazouzi K (). Jace: a Java environment for distributed asynchronous iterative computations. In:
th Euromicro Conference on Parallel, Distributed and NetworkBased Processing (EUROMICRO-PDP’), pp –. IEEE,
Coruna, Spain
. Baudet GM () Asynchronous iterative methods for multiprocessors. JACM ():–
. El Baz D () A method of terminating asynchronous iterative algorithms on message passing systems. Parallel Algor Appl
:–
. El Baz D () Communication study and implementation analysis of parallel asynchronous iterative algorithms on message
passing architectures. In: Parallel, distributed and network-based
processing, . PDP ’. th EUROMICRO International Conference, pp –. Weimar, Germany
. El Baz D, Gazen D, Jarraya M, Spiteri P, Miellou JC () Flexible
communication for parallel asynchronous methods with application to a nonlinear optimization problem. D’Hollander E, Joubert
G et al (eds). In: Advances in Parallel Computing: Fundamentals, Application, and New Directions. North Holland, vol ,
pp –
. El Baz D, Spiteri P, Miellou JC, Gazen D () Asynchronous
iterative algorithms with flexible communication for nonlinear
network flow problems. J Parallel Distrib Comput ():–
. Bertsekas DP () Distributed asynchronous computation of
fixed points. Math Program ():–
. Bertsekas DP, Tsitsiklis JN () Parallel and distributed computation. Prentice-Hall, Englewood Cliffs, NJ
. Blathras K, Szyld DB, Shi Y () Timing models and local
stopping criteria for asynchronous iterative algorithms. J Parallel
Distrib Comput ():–
. Blondel VD, Hendrickx JM, Olshevsky A, Tsitsiklis JN ()
Convergence in multiagent coordination, consensus, and flocking. In: Decision and Control,  and  European Control Conference. CDC-ECC’. th IEEE Conference on,
pp –
. Chazan D, Miranker WL () Chaotic relaxation. J Linear
Algebra Appl :–
. Couturier R, Domas S () CRAC: a Grid environment to solve
scientific applications with asynchronous iterative algorithms. In:
Parallel and Distributed Processing Symposium, . IPDPS
. IEEE International, p –
. Elsner L, Koltracht I, Neumann M () On the convergence
of asynchronous paracontractions with application to tomographic reconstruction from incomplete data. Linear Algebra
Appl :–
. Frommer A, Szyld DB () On asynchronous iterations. J Comput Appl Math (–):–
. Frommer A, Schwandt H, Szyld DB () Asynchronous
weighted additive schwarz methods. ETNA :–
. Frommer A, Szyld DB () Asynchronous two-stage iterative
methods. Numer Math ():–
. Liu L, Li Z () Improving parallelism and locality with
asynchronous algorithms. In: th ACM SIGPLAN Symposium
on principles and practice of parallel programming (PPoPP),
pp –, Bangalore, India
. Lubachevsky B, Mitra D () A chaotic, asynhronous algorithm
for computing the fixed point of a nonnegative matrix of unit
spectral radius. JACM ():–
. Miellou JC, El Baz D, Spiteri P () A new class of asynchronous iterative algorithms with order intervals. Mathematics
of Computation, ():–
. Moga AC, Dubois M () Performance of asynchronous
linear iterations with random delays. In: Proceedings of the
th International Parallel Processing Symposium (IPPS ’),
pp –
. Song Y, Li Z () New tiling techniques to improve cache temporal locality. ACM SIGPLAN Notices ACM SIGPLAN Conf
Program Lang Design Implement ():–
. Spiteri P, Chau M () Parallel asynchronous Richardson
method for the solution of obstacle problem. In: Proceedings of th Annual International Symposium on High Performance Computing Systems and Applications, Moncton, Canada,
pp –
. Strikwerda JC () A probabilistic analysis of asynchronous
iteration. Linear Algebra Appl (–):–
. Su Y, Bhaya A, Kaszkurewicz E, Kozyakin VS () Further
results on convergence of asynchronous linear iterations. Linear
Algebra Appl (–):–
. Szyld DB () Perspectives on asynchronous computations for
fluid flow problems. First MIT Conference on Computational
Fluid and Solid Mechanics, pp –
. Szyld DB, Xu JJ () Convergence of some asynchronous nonlinear multisplitting methods. Num Algor (–):–
. Uresin A, Dubois M () Effects of asynchronism on the convergence rate of iterative algorithms. J Parallel Distrib Comput
():–
. Wolfe M () More iteration space tiling. In: Proceedings of the
 ACM/IEEE conference on Supercomputing, p . ACM,
Reno, NV
ATLAS (Automatically Tuned Linear Algebra Software)
. Kollias G, Gallopoulos E, Szyld DB () Asynchronous iterative computations with Web information retrieval structures: The
PageRank case. In: Joubert GR, Nagel WE, et al (Eds). Parallel
Computing: Current and Future issues of High-End computing,
NIC Series. John von Neumann-Institut für Computing, Jülich,
Germany, vol , pp –
. Kollias G, Gallopoulos E () Asynchronous Computation of
PageRank computation in an interactive multithreading environment. In: Frommer A, Mahoney MW, Szyld DB (eds) Web
Information Retrieval and Linear Algebra Algorithms, Dagstuhl
Seminar Proceedings. IBFI, Schloss Dagstuhl, Germany, ISSN:
–
Asynchronous Iterative
Computations
Asynchronous Iterative Algorithms
ATLAS (Automatically Tuned
Linear Algebra Software)
R. Clint Whaley
University of Texas at San Antonio, San Antonio,
TX, USA
Synonyms
Numerical libraries
Definition
ATLAS [–, , ] is an ongoing research project
that uses empirical tuning to optimize dense linear
algebra software. The fruits of this research are embodied in an empirical tuning framework available as
an open source/free software package (also referred
to as “ATLAS”), which can be downloaded from the
ATLAS homepage []. ATLAS generates optimized
libraries which are also often collectively referred to as
“ATLAS,” “ATLAS libraries,” or more precisely, “ATLAStuned libraries.” In particular, ATLAS provides a full
implementation of the BLAS [, , , ] (Basic
Linear Algebra Subprograms) API, and a subset of
optimized LAPACK [] (Linear Algebra PACKage) routines. Because dense linear algebra is rich in operand
reuse, many routines can run tens or hundreds of times
A
faster when tuned for the hardware than when written
naively. Unfortunately, highly tuned codes are usually
not performance portable (i.e., a code transformation
that helps performance on architecture A may reduce
performance on architecture B).
The BLAS API provides basic building block linear algebra operations, and was designed to help ease
the performance portability problem. The idea was to
design an API that provides the basic computational
needs for most dense linear algebra algorithms, so that
when this API has been tuned for the hardware, all
higher-level codes that rely on it for computation automatically get the associated speedup. Thus, the job of
optimizing a vast library such as LAPACK can be largely
handled by optimizing the much smaller code base
involved in supporting the BLAS API. The BLAS are
split into three “levels” based on how much cache reuse
they enjoy, and thus how computationally efficient they
can be made to be. In order of efficiency, the BLAS levels are: Level  BLAS [], which involve matrix–matrix
operations that can run near machine peak, Level 
BLAS [, ] which involve matrix–vector operations
and Level  BLAS [, ], which involve vector–vector
operations. The Level  and  BLAS have the same
order of memory references as floating point operations (FLOPS), and so will run at roughly the speed of
memory for out-of-cache operation.
The BLAS were extremely successful as an API,
allowing dense linear algebra to run at near-peak rates
of execution on many architectures. However, with
hardware changing at the frantic pace dictated by
Moore’s Law, it was an almost impossible task for hand
tuners to keep BLAS libraries up-to-date on even those
fortunate families of machines that enjoyed their attentions. Even worse, many architectures did not have anyone willing and able to provide tuned BLAS, which
left investigators with codes that literally ran orders
of magnitude slower than they should, representing
huge missed opportunities for research. Even on systems where a vendor provided BLAS implementations,
license issues often prevented their use (e.g., SUN provided an optimized BLAS for the SPARC, but only
licensed its use for their own compilers, which left
researchers using other languages such as High Performance Fortran without BLAS; this was one of the
original motivations to build ATLAS).

A

A
ATLAS (Automatically Tuned Linear Algebra Software)
Empirical tuning arose as a response to this need
for performance portability. The idea is simple enough
in principle: Rather than hand-tune operations to the
architecture, write a software framework that can vary
the implementation being optimized (through techniques such as code generation) so that thousands
of inter-related transformation combinations can be
empirically evaluated on the actual machine in question. The framework uses actual timings to discover
which combinations of transformations lead to high
performance on this particular machine, resulting in
portably efficient implementations regardless of architecture. Therefore, instead of waiting months (or even
years) for a hand tuner to do the same thing, the user
need only install the empirical tuning package, which
will produce a highly tuned library in a matter of hours.
ATLAS Software Releases and Version
Numbering
ATLAS almost always has two current software releases
available at any one time. The first is the stable release,
which is the safest version to use. The stable release
has undergone extensive testing, and is known to work
on many different platforms. Further, every known bug
in the stable release is tracked (along with associated
fixes) in the ATLAS errata file []. When errors affecting answer accuracy are discovered in the stable release,
a message is sent to the ATLAS error list [], which
any user can sign up for. In this way, users get updates
anytime the library they are using might have an error,
and they can update the software with the supplied
patch if the error affects them. Stable releases happen
relatively rarely (say once every year or two).
The second available package is the developer release,
which is meant to be used by ATLAS developers, contributers, and people happy to live on the bleeding edge.
Developer releases typically contain a host of features
and performance improvements not available in the stable release, but many of these features will have been
exposed to minimal testing (a new developer release
may have only been crudely tested on a single platform,
whereas a new stable release will have been extensively
tested on dozens of platforms). Developer releases happen relatively often (it is not uncommon to release two
in the same week).
Each ATLAS release comes with a version number, which is comprised of: <major number>.<minor
number>.<update number>. The meaning of these
terms is:
Major number: Major release numbers are changed
only when fairly large, sweeping changes are made.
Changes in the API are the most likely to cause
a major release number to increment. For example, when ATLAS went from supporting only matrix
multiply to all the Level  BLAS, the major number changed; the same happened when ATLAS
went from supporting only Level  BLAS to
all BLAS.
Minor number: Minor release numbers are changed at
each official release. Even numbers represent stable
releases, while odd minor numbers are reserved for
developer releases.
Update number: Update numbers are essentially patches
on a particular release. For instance, stable ATLAS
releases only occur roughly once per year or two.
As errors are discovered, they are errata-ed, so that
a user can apply the fixes by hand. When enough
errata are built up that it becomes impractical to
apply the important ones by hand, an update release
is issued. So, stable updates are typically bug fixes,
or important system workarounds, while developer
updates often involve substantial new code. A typical number of updates to a stable release might be
something like . A developer release may have any
number of updates.
So, .. would be a stable release, with one group of fixes
already applied. .. would be the th update (th
release) of the associated developer release.
Essentials of Empirical Tuning
Any package that adapts software based on timings
falls into a classification that ATLAS shares, which we
call AEOS (Automated Empirical Optimization of Software). These packages can vary strongly on details, but
they must have some commonalities:
. The search must be automated in some way, so that
an expert hand-tuner is not required.
→ ATLAS has a variety of searches for different
operations, all of which can be found in the
ATLAS/tune directory.
. The decision of whether a transformation is useful
or not must be empirical, in that an actual timing
ATLAS (Automatically Tuned Linear Algebra Software)
measurement on the specific architecture in question is performed, as opposed to the traditional
application of transformations using static heuristics or profile counts.
→ ATLAS has a plethora of timers and testers,
which can be found in ATLAS/tune and
ATLAS/bin. These timers must be much
more accurate and context-sensitive than typical timers, since optimization decisions are
based on them. ATLAS uses the methods
described in [] to ensure high-quality
timings.
. These methods must have some way to vary/adapt
the software being tuned. ATLAS currently uses
parameterized adaptation, multiple implementation,
and source generation (see Methods of Software
Adaption for details).
Methods of Software Adaptation
Parameterized adaptation: The simplest method is
having runtime or compile-time variables that cause
different behaviors depending on input values. In linear algebra, the most important of such parameters is
probably the blocking factor(s) used in blocked algorithms, which, when varied, varies the data cache utilization. Other parameterized adaptations in ATLAS
include a large number of crossover points (empirically found points in some parameter space where a
second algorithm becomes superior to a first). Important crossover points in ATLAS include: whether problem size is large enough to withstand a data copy,
whether problem is large enough to utilize parallelism, whether a problem dimension is close enough
to degenerate that a special-case algorithm should be
used, etc.
Not all important tuning variables can be handled
by parameterized adaptation (simple examples include
instruction cache size, choice of combined or separate
multiply and add instructions, length of floating point
and fetch pipelines, etc.), since varying them actually
requires changing the underlying source code. This then
brings in the need for the second method of software
adaptation, source code adaptation, which involves actually generating differing implementations of the same
operation.
ATLAS presently uses two methods of source code
adaptation, which are discussed in greater detail below.
A
. Multiple implementation: Formalized search of
multiple hand-written implementations of the kernel in question. ATLAS uses multiple implementation in the tuning of all levels of the BLAS.
. Source generation: Write a program that can generate differing implementations of a given algorithm based on a series of input parameters. ATLAS
presently uses source generation in tuning matrix
multiply (and hence the entire Level  BLAS).
Multiple implementation: Perhaps the simplest
approach for source code adaptation is for an empirical
tuning package to supply various hand-tuned implementations, and then the search heuristic can be as simple as trying each implementation in turn until the best
is found. At first glance, one might suspect that supplying these multiple implementations would make even
this approach to source code adaptation much more
difficult than the traditional hand-tuning of libraries.
However, traditional hand-tuning is not the mere application of known techniques it may appear when examined casually. Knowing the size and properties of your
level  cache is not sufficient to choose the best blocking
factor, for instance, as this depends on a host of interlocking factors which often defy a priori understanding
in the real world. Therefore, it is common in hand-tuned
optimizations to utilize the known characteristics of the
machine to narrow the search, but then the programmer
writes various implementations and chooses the best.
For multiple implementation, this process remains
the same, but the programmer adds a search and timing layer to accomplish what would otherwise be done
by hand. In the simplest cases, the time to write this
layer may not be much if any more than the time the
implementer would have spent doing the same process in a less formal way by hand, while at the same
time capturing at least some of the flexibility inherent in empirical tuning. Due to its obvious simplicity,
this method is highly parallelizable, in the sense that
multiple authors can meaningfully contribute without
having to understand the entire package. In particular,
various specialists on given architectures can provide
hand-tuned routines without needing to understand
other architectures, the higher level codes (e.g., timers,
search heuristics, higher-level routines which utilize
these basic kernels, etc.). Therefore, writing a multiple

A

A
ATLAS (Automatically Tuned Linear Algebra Software)
implementation framework can allow for outside contribution of hand-tuned kernels in an open source/free
software framework such as ATLAS.
Source generation: In source generation, a source
generator (i.e., a program that writes other programs)
is produced. This source generator takes as parameters the various source code adaptations to be made.
As before, simple examples include loop unrolling factors, choice of combined or separate multiply and add
instructions, length of floating point and fetch pipelines,
and so on. Depending on the parameters, the source
generator produces a routine with the requisite characteristics. The great strength of source generators is
their ultimate flexibility, which can allow for far greater
tunings than could be produced by all but the best
hand-coders. However, generator complexity tends to
go up along with flexibility, so that these programs
rapidly become almost insurmountable barriers to outside contribution.
ATLAS therefore combines these two methods of
source adaptation, where the GEMM kernel source
generator produces strict ANSI/ISO C for maximal
architectural portability. Multiple implementation is
utilized to encourage outside contribution, and allows
for extreme architectural specialization via assembly implementations. Parameterized adaptation is then
combined with these two techniques to fully tune the
library.
Both multiple implementation and code generation
are specific to the kernel being tuned, and can be either
platform independent (if written to use portable languages such as C) or platform dependent (as when
assembly is used or generated). Empirically tuned compilers can relax the kernel-specific tuning requirement,
and there has been some initial work on utilizing this
third method of software adaptation [, ] for ATLAS,
but this has not yet been incorporated into any ATLAS
release.
discussed below are available only in the later developer
releases; this is noted whenever true.
GEMM (GEneral rectangular Matrix Multiply)
is empirically tuned using all discussed methods
(parameterized adaption, multiple implementation, and
source generation). Parameterization is mainly used for
crossover points and cache blocking, and ATLAS uses
both its methods of source code adaptation in order to
optimize GEMM:
. Code generation: ATLAS’s main code generator
produces ANSI C implementations of ATLAS’s
matrix multiply kernel []. The code generator can
be found in ATLAS/tune/blas/gemm/emit_
mm.c. ATLAS/tune/blas/gemm/mmsearc
h.c is the master GEMM search that not only exercises the options to emit_mm.c, but also invokes
all subservient searches. With emit_mm.c and
mmsearch.c, ATLAS has a general-purpose code
generator that can work on any platform with an
ANSI C compiler. However, most compilers are
generally unsuccessful in vectorizing these types
of code (and vectorization has become critically
important, especially on the ×), and so from
version .. and later, ATLAS has a code generator written by Chad Zalkin that generates SSE
vectorized code using gcc’s implementation of the
Intel SSE intrinsics. The vectorized source generator
is ATLAS/tune/blas/gemm/mmgen_sse.c,
and the search that exercises its options is ATLAS/
tune/blas/gemm/mmksearch_sse.c.
. Multiple implementation: ATLAS also tunes
GEMM using multiple implementation, and this
search can be found in ATLAS/tune/blas/
gemm/ummsearch.c, and all the hand-tuned
kernel implementations which are searched can be
found in ATLAS/tune/blas/gemm/CASES/.
Empirical Tuning in the Rest of the Package
Search and Software Adaptation for the
Level  BLAS
ATLAS’s Level  BLAS are implemented as GEMMbased BLAS, and so ATLAS’s empirical tuning is all
done for matrix multiply (see [] and [] for descriptions of ATLAS’s GEMM-based BLAS). As of this
writing, the most current stable release is .., and
the newest developer release is ... Some features
Currently, the the Level  and  BLAS are tuned
only via parameterization and multiple implementation
searches. ATLAS .. and greater has some prototype
SSE generators for matrix vector multiply and rank-
update (most important of the Level  BLAS routines)
available in ATLAS/tune/blas/gemv/mvgen_
sse.c & ATLAS/tune/blas/ger/r1gen_
sse.c, respectively. These generators are currently not
ATLAS (Automatically Tuned Linear Algebra Software)
debugged, and lack a search, but may be used in later
releases.
ATLAS can also autotune LAPACK’s blocking factor, as discussed in []. ATLAS currently autotunes
the QR factorization for both parallel and serial
implementations.
Parallelism in ATLAS
ATLAS is currently used in two main ways in parallel programming. Parallel programmers call ATLAS’s
serial routines directly in algorithms they themselves
have parallelized. The other main avenue of parallel
ATLAS use is programmers that write serial algorithms
which get implicit parallelism by calling ATLAS’s parallelized BLAS implementation. ATLAS currently parallelizes only the Level  BLAS (the Level  and  are
typically bus-bound, so parallelization of these operations can sometimes lead to slowdown due to increased
bus contention). In the current stable (..), the parallel BLAS use pthreads. However, recent research []
forced a complete rewrite of the threading system for
the developer release which results in as much as a doubling of parallel application performance. The developer
release also supports Windows threads and OpenMP in
addition to pthreads. Ongoing work involves improving parallelism at the LAPACK level [], and empirically
tuning parallel crossover points, which may lead to the
safe parallelization of the Level  and  BLAS.
Discussion
History of the Project
The research that eventually grew into ATLAS was
undertaken by R. Clint Whaley in early , as a direct
response to a problem from ongoing research on parallel programming and High Performance Fortran (HPF).
The Innovative Computer Laboratory (ICL) of the University of Tennessee at Knoxvile (UTK) had two small
clusters, one consisting of SPARC chips, and the other
PentiumPROs. For the PentiumPROs, no optimized
BLAS were available. For the SPARC cluster, SUN provided an optimized BLAS, but licensed them so that
they could not be used with non-SUN compilers, such
as the HPF compiler. This led to the embarrassment
of having a -processor parallel algorithm run slower
than a serial code on the same chips, due to lack of a
portably optimal BLAS.
A
The PHiPAC effort [] from Berkeley comprised the
first systematic attempt to harness automated empirical
tuning in this area. PHiPAC did not deliver the required
performance on the platforms being used, but the idea
was obviously sound. Whaley began working on this
idea on nights and weekends in an attempt to make
the HPF results more credible. Eventually, a working
prototype was produced, and was demonstrated to the
director of ICL (Jack Dongarra), and it became the fulltime project for Whaley, who was eventually joined on
the project by Antoine Petitet. ATLAS has been under
continuous development since that time, and has followed Whaley to a variety of institutions, as shown in
the ATLAS timeline below. Please note that ATLAS is an
open source project, so many people have substantially
contributed to ATLAS that are not mentioned here.
Please see ATLAS/doc/AtlasCredits.txt for a
rough description of developer contribution to ATLAS.
Rough ATLAS Timeline Including Stable
Releases
Early : Development of prototype by Whaley in
spare time.
Mid : Whaley full time on ATLAS development at
ICL/UTK.
Dec : Technical report describing ATLAS published. ATLAS v . released, provides S/D GEMM
only.
Sep : Version . released, using SuperScalar
GEMM-based BLAS [] to provide entire real Level
 BLAS in both precisions.
Mid : Antoine Petitet joins ATLAS group at ICL/
UTK.
Feb : Version . released. Automated install and
configure steps, all Level  BLAS supported in all
four types/precisions, C interface to BLAS added.
Dec : Version .Beta released. ATLAS provides
complete BLAS support with C and F interfaces.
GEMM generator can generate all transpose cases,
saving the need to copy small matrices. Addition of
LU, Cholesky, and associated LAPACK routines.
Dec : Version . released. Pthreads support for
parallel Level  BLAS. Support for user-contribution
of kernels, and associated multiple implementation
search.

A

A
ATLAS (Automatically Tuned Linear Algebra Software)
Mid : Antoine leaves to take job at SUN/France.
Jan : Whaley begins PhD work in iterative compilation at FSU.
Jun : Version . released. Level  BLAS optimized.
Addition of LAPACK inversion and related routines.
Addition of sanity check after install.
Dec : Version . released. Numerous optimizations, but no new API coverage.
Jul : Whaley begins as assistant professor at UTSA.
Mar : NSF/CRI funding for ATLAS obtained.
Dec : Version . released. Complete rewrite of
configure and install, as part of overall modernization of package (bitrotted from  to , where
there was no financial support, and so ATLAS work
only done on volunteer basis). Numerous fixes and
optimizations, but API support unchanged. Addition of ATLAS install guide. Extended LAPACK
timing/testing available.
Ongoing Research and Development
There are a host of areas in ATLAS that require significant research in order to improve. Since , there has
been continuous ATLAS R&D coming from Whaley’s
group at the University of Texas at San Antonio. Here,
we discuss only those areas that the ATLAS team is actually currently investigating (time of writing: November ). Initial work on these ideas is already in
the newest developer releases, and at least some of the
results will be available in the next stable release (..).
Kernel Tuning
Improving ATLAS’s kernel tuning is an ongoing focus.
One area of interest involves exploiting empirical compilation [, ] for all kernels, and adding additional
vectorized source generators, particularly for the Level 
and  BLAS. Additionally, it is important to add support for tuning the Level  and  BLAS to different
cache states, as discussed in []. There is also an ongoing effort to rewrite the ATLAS timing and tuning
frameworks so that others can easily plug in their own
searches or empirical tuning frameworks for ATLAS to
automatically use.
Improvements in Parallel Performance
There is ongoing work aimed at extending [], both
to cover more OSes, to even further improve overheads, and to empirically tune a host of crossover points
between degrees of parallelism, which should improve
current performance and enable opportunities to safely
parallelize additional operations. Further work along
the lines of [] is being pursued in order to more effectively parallelize LAPACK. Finally, use of massively parallel GPUs is being investigated based on the impressive
initial work of [, , ].
LAPACK
In addition to the parallel work mentioned in the previous section, the ATLAS group is expanding the coverage
of routines to include all Householder factorizationrelated routines, based on the ideas presented in []
(this will result in ATLAS providing a native implementation, with full C and Fortran interfaces, for
all dense factorizations). Another ongoing investigation involves extending the LAPACK tuning discussed
in [] to handle more routines more efficiently. Finally,
the ATLAS group is researching error control [] with
an eye to keeping error bounds low while using faster
algorithms.
Bibliography
. Anderson E, Bai Z, Bischof C, Demmel J, Dongarra J, Du
Croz J, Greenbaum A, Hammarling S, McKenney A, Ostrouchov S, Sorensen D () LAPACK users’ guide, rd edn. SIAM,
Philadelphia, PA
. Bilmes J, Asanović K, Chin CW, Demmel J () Optimizing
matrix multiply using PHiPAC: a portable, high-performance,
ANSI C coding methodology. In: Proceedings of the ACM
SIGARC International Conference on SuperComputing, Vienna,
Austria, July 
. Castaldo AM, Whaley RC () Minimizing startup costs for
performance-critical threading. In: Proceedings of the IEEE
international parallel and distributed processing symposium,
Rome, Italy, May 
. Castaldo AM, Whaley RC () Scaling LAPACK panel
operations using parallel cache assignment. In: Accepted for publication in th AMC SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Bangalore, India,
January 
. Castaldo AM, Whaley RC, Chronopoulos AT () Reducing
floating point error in dot product using the superblock family of
algorithms. SIAM J Sci Comput ():–
. Dongarra J, Du Croz J, Duff I, Hammarling S () A set of level
 basic linear algebra subprograms. ACM Trans Math Softw ():
–
. Dongarra J, Du Croz J, Hammarling S, Hanson R () Algorithm : an extended set of basic linear algebra subprograms:
model implementation and test programs. ACM Trans Math
Softw ():–
Automatically Tuned Linear Algebra Software (ATLAS)
. Dongarra J, Du Croz J, Hammarling S, Hanson R () An
extended set of FORTRAN basic linear algebra subprograms.
ACM Trans Math Softw ():–
. Elmroth E, Gustavson F () Applying recursion to serial and
parallel qr factorizaton leads to better performance. IBM J Res
Dev ():–
. Hanson R, Krogh F, Lawson C () A proposal for standard
linear algebra subprograms. ACM SIGNUM Newsl ():–
. Kågström B, Ling P, van Loan C () Gemm-based level 
blas: high performance model implementations and performance
evaluation benchmark. ACM Trans Math Softw ():–
. Lawson C, Hanson R, Kincaid D, Krogh F () Basic linear
algebra subprograms for fortran usage. ACM Trans Math Softw
():–
. Li Y, Dongarra J, Tomov S () A note on autotuning GEMM
for GPUs. Technical Report UT-CS--, University of Tennessee, January 
. Whaley TC et al. Atlas mailing lists. http://math-atlas.
sourceforge.net/faq.html#lists
. Volkov V, Demmel J () Benchmarking GPUs to tune
dense linear algebra. In Supercomputing . Los Alamitos,
November 
. Volkov V, Demmel J (). LU, QR and Cholesky factorizations
using vector capabilities of GPUs. Technical report, University of
California, Berkeley, CA, May 
. Whaley RC () Atlas errata file. http://math-atlas.sourceforge.
net/errata.html
. Whaley RC () Empirically tuning lapack’s blocking factor for
increased performance. In: Proceedings of the International Multiconference on Computer Science and Information Technology,
Wisla, Poland, October 
. Whaley RC, Castaldo AM () Achieving accurate and
context-sensitive timing for code optimization. Softw Practice
Exp ():–
. Whaley RC, Dongarra J () Automatically tuned linear algebra
software. Technical Report UT-CS--, University of Tennessee, TN, December . http://www.netlib.org/lapack/lawns/
lawn.ps
. Whaley RC, Dongarra J () Automatically tuned linear algebra software. In: SuperComputing : high performance networking and computing, San Antonio, TX, USA, . CD-ROM
proceedings. Winner, best paper in the systems category. http://
www.cs.utsa.edu/~whaley/papers/atlas_sc.ps
. Whaley RC, Dongarra J () Automatically tuned linear algebra
software. In: Ninth SIAM conference on parallel processing for
scientific computing, . CD-ROM proceedings.
. Whaley RC, Petitet A () Atlas homepage. http://math-atlas.
sourceforge.net/
. Whaley RC, Petitet A () Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw
Practice Exp ():–, February . http://www.cs.utsa.
edu/~whaley/papers/spercw.ps
. Whaley RC, Petitet A, Dongarra JJ () Automated empirical
optimization of software and the ATLAS project. Parallel Comput
(–):–
A
. Whaley RC, Whalley DB () Tuning high performance
kernels through empirical compilation. In: The  international conference on parallel processing, Oslo, Norway, June
, pp –
. Yi Q, Whaley RC () Automated transformation for
performance-criticial kernels. In: ACM SIGPLAN symposium on library-centric software design, Montreal, Canada,
October 
Atomic Operations
Synchronization
Transactional Memories
Automated Empirical
Optimization
Autotuning
Automated Empirical Tuning
Autotuning
Automated Performance Tuning
Autotuning
Automated Tuning
Autotuning
Automatically Tuned Linear
Algebra Software (ATLAS)
ATLAS
Software)
(Automatically
Tuned
Linear
Algebra

A

A
Autotuning
Autotuning
Richard W. Vuduc
Georgia Institute of Technology, Atlanta,
GA, USA
Synonyms
Automated empirical optimization; Automated empirical tuning; Automated performance tuning; Automated
tuning; Software autotuning
Definition
Automated performance tuning, or autotuning, is an
automated process, guided by experiments, of selecting
one from among a set of candidate program implementations to achieve some performance goal. “Performance goal” may mean, for instance, the minimization
of execution time, energy delay, storage, or approximation error. An “experiment” is the execution of a
benchmark and observation of its results with respect
to the performance goal. A system that implements an
autotuning process is referred to as an autotuner. An
autotuner may be a stand-alone code generation system
or may be part of a compiler.
Discussion
Introduction: From Manual to Automated
Tuning
When tuning a code by hand, a human programmer typically engages in the following iterative process.
Given an implementation, the programmer develops
or modifies the program implementation, performs an
experiment to measure the implementation’s performance, and then analyzes the results to decide whether
the implementation has met the desired performance
goal; if not, he or she repeats these steps. The use of
an iterative, experiment-driven approach is typically
necessary when the program and hardware performance behavior is too complex to model explicitly in
another way.
Autotuning attempts to automate this process. More
specifically, the modern notion of autotuning is a process consisting of the following components, any or all
of which are automated in a given autotuning approach
or system:
●
Identification of a space of candidate implementations. That is, the computation is associated with
some space of possible implementations that may be
defined implicitly through parameterization.
● Generation of these implementations. That is, an
autotuner typically possesses some facility for producing (generating) the actual code that corresponds to any given point in the space of candidates.
● Search for the best implementation, where best is
defined relative to the performance goal. This search
may be guided by an empirically derived model
and/or actual experiments, that is, benchmarking
candidate implementations.
Bilmes et al. first described this particular notion of
autotuning in the context of an autotuner for dense
matrix–matrix multiplication [].
Intellectual Genesis
The modern notion of autotuning can be traced historically to several major movements in program
generation.
The first is the work by Rice on polyalgorithms.
A polyalgorithm is a type of algorithm that, given some
a particular input instance, selects from among several
candidate algorithms to carry out the desired computation []. Rice’s later work included his attempt to formalize algorithm selection mathematically, along with
an approximation theory approach to its solution, as
well as applications in both numerical algorithms and
operating system scheduler selection []. The key influence on autotuning is the notion of multiple candidates
and the formalization of the selection problem as an
approximation and search problem.
A second body of work is in the area of profileand feedback-directed compilation. In this work, the
compiler instruments the program to gather and
store data about program behavior at runtime, and
then uses this data to guide subsequent compilation of the same program. This area began with
the introduction of detailed program performance
measurement [] and measurement tools []. Soon
after, measurement became an integral component in
the compilation process itself in several experimental compilers, including Massalin’s superoptimizer for
Autotuning
exploring candidate instruction schedules, as well as
the peephole optimizers of Chang et al. []. Dynamic
or just-in-time compilers employ similar ideas; see
Smith’s survey of the state-of-the-art in this area as
of  []. The key influence on modern autotuning is the idea of automated measurement-driven
transformation.
The third body of work that has exerted a strong
influence on current work in autotuning is that of formalized domain-driven code generation systems. The
first systems were developed for signal processing and
featured high-level transformation/rewrite systems that
manipulated symbolic formulas and translated these
formulas into code [, ]. The key influence on modern autotuning is the notion of high-level mathematical representations of computations and the automated
transformations of these representations.
Autotuning as it is largely studied today began
with the simultaneous independent development of the
PHiPAC autotuner for dense matrix–matrix multiplication [], FFTW for the fast Fourier transform [], and
the OCEANS iterative compiler for embedded systems
[]. These were soon followed by additional systems for
dense linear algebra (ATLAS) [] and signal transforms
(SPIRAL) [].
Contemporary Issues and Approaches
A developer of an autotuner must, in implementing
his or her autotuning system or approach, consider
a host of issues for each of the three major autotuning components, that is, identification of candidate
implementations, generation, and search. The issues cut
across components, though for simplicity the following
exposition treats them separately.
Identification of Candidates
The identification of candidates may involve characterization of the target computations and/or anticipation of
the likely target hardware platforms. Key design points
include what candidates are possible, who specifies the
candidates, and how the candidates are represented for
subsequent code generation and/or transformation.
For example, for a given computation there may be
multiple candidate algorithms and data structures, for
example, which linear solver, what type of graph data
structure. If the computation is expressed as code, the
space may be defined through some parameterization
A
of the candidate transformations, for example, depth
of loop unrolling, cache blocking/tiling sizes. There
may be numerous other implementation issues, such
as where to place data or how to schedule tasks and
communication.
Regarding who identifies these candidates, possibilities include the autotuner developer; the autotuner
itself, for example, through preprogrammed and possibly extensible rewrite rules; or even the end-user programmer, for example, through a meta-language or
directives.
Among target architectures, autotuning researchers
have considered all of the different categories of sequential and parallel architectures. These include everything
from superscalar cache-based single- and multicore
systems, vector-enabled systems, and shared and distributed memory multiprocessor systems.
Code Generation
Autotuners employ a variety of techniques for producing the actual code.
One approach is to build a specialized code generator that can only produce code for a particular computation or family of computations. The generator itself
might be as conceptually simple as a script that, given a
few parameter settings, produces an output implementation. It could also be as sophisticated as a program that
takes an abstract mathematical formula as input, using
symbolic algebra and rewriting to transform the formula to an equivalent formula, and translating the formula to an implementation. In a compiler-based autotuner, the input could be code in a general-purpose language, with conventional compiler technology enabling
the transformation or rewriting of that input to some
implementation output. In other words, there is a large
design space for the autotuner code generator component. Prominent examples exist using combinations of
these techniques.
Search
Given a space of implementations, selecting the best
implementation is, generically, a combinatorial search
problem. Key questions are what type of search to use,
when to search, and how to evaluate the candidate
implementations during search.
Among types of search, the simplest is an exhaustive approach, in which the autotuner enumerates all

A

A
Autotuning
possible candidates and experimentally evaluates each
one. To reduce the potential infeasibility of exhaustive
search, numerous pruning heuristics are possible. These
include random search, simulated annealing, statistical experimental design approaches, machine-learning
guided search, genetic algorithms, and dynamic programming, among others.
On the question of when to search, the main categories are off-line and at runtime. An off-line search
would typically occur once per architecture or architectural families, prior to use by an application. A runtime
search, by contrast, occurs when the computation is
invoked by the end-user application. Hybrid approaches
are of course possible. For example, a series of offline benchmarks could be incorporated into a runtime
model that selects an implementation. Furthermore,
tuning could occur across multiple application invocations through historical recording and analysis mechanisms, as is the case in profile and feedback-directed
compilation.
The question of evaluation is one of the methodologies for carrying out the experiment. One major
issue is whether this evaluation is purely experimental or guided by some predictive model (or both). The
model itself may be parameterized and parameter values learned (in the statistical sense) during tuning. A
second major issue is under what context evaluation
occurs. That is, tuning may depend on features of the
input data, so conducting an experiment in the right
context could be critical to effective tuning.
There are numerous debates and issues within this
community at present, a few examples of which follow:
●
●
●
●
●
●
How will we measure the success of autotuners, in
terms of performance, productivity, and/or other
measures?
To what extent can entire programs be autotuned,
or will large successes be largely limited to relatively
small library routines and small kernels extracted
from programs?
For what applications is data-dependent tuning
really necessary?
Is there a common infrastructure (tools, languages,
intermediate representations) that could support
autotuning broadly, across application domains?
Where does the line between an autotuner and a
“traditional” compiler lie?
When is search necessary, rather than analytical
models? Always? Never? Or only sometimes, perhaps as a “stopgap” measure when porting to new
platforms? How does an autotuner know when to
stop?
Related Entries
Algorithm Engineering
ATLAS (Automatically Tuned Linear Algebra
Software)
Benchmarks
Code Generation
FFTW
Profiling
Additional Pointers
The literature on work related to autotuning is large and
growing. At the time of this writing, the last major comprehensive surveys of autotuning projects had appeared
in the Proceedings of the IEEE special issue edited by
Moura et al. [] and the article by Vuduc et al. [,
Section ], which include many of the methodological
aspects autotuning described here.
There are numerous community-driven efforts to
assemble autotuning researchers and disseminate their
results. These include: the U.S. Department of Energy
(DOE) sponsored workshop on autotuning, organized
under the auspices of CScADS []; the International
Workshop on Automatic Performance Tuning []; and
the Collective Tuning (cTuning) wiki [], to name a few.
Spiral
Bibliography
. (). Collective Tuning Wiki. http://ctuning.org
. (). International Workshop on Automatic Performance Tuning. http://iwapt.org
. Aarts B, Barreteau M, Bodin F, Brinkhaus P, Chamski Z, Charles
H-P, Eisenbeis C, Gurd J, Hoogerbrugge J, Hu P, Jalby W, Knijnenburg PMW, O’Boyle MFP, Rohou E, Sakellariou R, Schepers H, Seznec A, Stöhr E, Verhoeven M, Wijshoff HAG ()
OCEANS: Optimizing Compilers for Embedded Applications. In:
Proc. Euro- Par, vol  of LNCS, Passau, Germany. Springer,
Berlin / Heidelberg
. Bilmes J, Asanovic K, Chin C-W, Demmel J () Optimizing
matrix multiply using PHiPAC: A portable, highperformance,
ANSI C coding methodology. In Proc. ACM Int’l Conf Supercomputing (ICS), Vienna, Austria, pp –
Autotuning
. Center for Scalable Application Development Software ()
Workshop on Libraries and Autotuning for Petascale Applications. http://cscads.rice.edu/workshops/summer/autotuning
. Chang PP, Mahlke SA, Mei W, Hwu W () Using profile information to assist classic code optimizations. Software: Pract Exp
():–
. Covell MM, Myers CS, Oppenheim AV () Symbolic and
knowledge-based signal processing, chapter : computer-aided
algorithm design and rearrangement. Signal Processing Series.
Prentice-Hall, pp –
. Frigo M, Johnson SG () A fast Fourier transform compiler.
ACM SIGPLAN Notices ():–. Origin: Proc. ACM Conf.
Programming Language Design and Implementation (PLDI)
. Graham SL, Kessler PB, McKusick MK () gprof: A call graph
execution profiler. ACM SIGPLAN Notices ():–. Origin:
Proc. ACM Conf. Programming Language Design and Implementation (PLDI)
. Johnson JR, Johnson RW, Rodriguez D, Tolimieri R ()
A methodology for designing, modifying, and implementing
Fourier Transform algorithms on various architectures. Circuits,
Syst Signal Process ():–
. Knuth DE () An empirical study of FORTRAN programs.
Software: Practice Exp ():–
A
. Moura JMF, Püschel M, Dongarra J, Padua D, (eds) ()
Proceedings of the IEEE: Special Issue on Program Generation,
Optimization, and Platform Adaptation, vol . IEEE Comp Soc.
http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=&
puNumber=
. Püschel M, Moura JMF, Johnson J, Padua D, Veloso M,
Singer B, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K,
Johnson RW, Rizzolo N () SPIRAL: Code generation for DSP
transforms. Proc. IEEE: Special issue on “Program Generation,
Optimization, and Platform Adaptation” ():–
. Rice JR () A polyalgorithm for the automatic solution of nonlinear equations. In Proc. ACM Annual Conf./Annual Mtg. New
York, pp –
. Rice JR () The algorithm selection problem. In: Alt F, Rubinoff
M, Yovits MC (eds) Adv Comp :–
. Smith MD () Overcoming the challenges to feedbackdirected optimization. ACM SIGPLAN Notices ():–
. Vuduc R, Demmel J, Bilmes J () Statistical models for empirical search-based performance tuning. Int’l J High Performance
Comp Appl (IJHPCA) ():–
. Whaley RC, Petitet A, Dongarra J () Automated empirical
optimizations of software and the ATLAS project. Parallel Comp
(ParCo) (–):–

A
B
Backpressure
Flow Control
Bandwidth-Latency Models (BSP,
LogP)
Thilo Kielmann , Sergei Gorlatch

Vrije Universiteit, Amsterdam, The Netherlands

Westfälische Wilhelms-Universität Münster, Münster,
Germany
Synonyms
Message-passing performance models; Parallel communication models
Definition
Bandwidth-latency models are a group of performance
models for parallel programs that focus on modeling
the communication between the processes in terms of
network bandwidth and latency, allowing quite precise
performance estimations. While originally developed
for distributed-memory architectures, these models
also apply to machines with nonuniform memory
access (NUMA), like the modern multi-core
architectures.
Discussion
Introduction
The foremost goal of parallel programming is to speed
up algorithms that would be too slow when executed sequentially. Achieving this so-called speedup
requires a deep understanding of the performance of
the inter-process communication and synchronization,
together with the algorithm’s computation. Both computation and communication/synchronization performance strongly depend on properties of the machine
architecture in use. The strength of the bandwidthlatency models is that they model quite precisely
the communication and synchronization operations.
When the parameters of a given machine are fed
into the performance analysis of a parallel algorithm,
bandwidth-latency models lead to rather precise and
expressive performance evaluations. The most important models of this class are bulk synchronous parallel
(BSP) and LogP, as discussed below.
The BSP Model
The bulk synchronous parallel (BSP) model was proposed by Valiant [] to overcome the shortcomings
of the traditional PRAM (Parallel Random Access
Machine) model, while keeping its simplicity. None of
the suggested PRAM models offers a satisfying forecast of the behavior of parallel machines for a wide
range of applications. The BSP model was developed
as a bridge between software and hardware developers:
if the architecture of parallel machines is designed as
prescribed by the BSP model, then software developers can rely on the BSP-like behavior of the hardware.
Furthermore it should not be necessary to customize
perpetually the model of applications to new hardware
details in order to benefit from a higher efficiency of
emerging architectures.
The BSP model is an abstraction of a machine with
physically distributed memory that uses a presentation of communication as a global bundle instead of
single point-to-point transfers. A BSP model machine
consists of a number of processors equipped with memory, a connection network for point-to-point messages
between processors, and a synchronization mechanism,
which allows a barrier synchronization of all processors.
Calculations on a BSP machine are organized as a
sequence of supersteps as shown in Fig. :
●
In each superstep, each processor executes local
computations and may perform communication
operations.
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,
© Springer Science+Business Media, LLC 
B
Bandwidth-Latency Models (BSP, LogP)
Barrier synchronization
Local computations
Superstep

Global communications
Time
Barrier synchronization
Processors
Bandwidth-Latency Models (BSP, LogP). Fig.  The BSP
model executes calculations in supersteps. Each superstep
comprises three phases: () simultaneous local
computations performed by each process, () global
communication operations for exchanging data between
processes, () barrier synchronization for finalizing the
communication operation and enabling access to received
data by the receiving processes
●
A local computation can be executed in every time
unit (step).
● The result of a communication operation does not
become effective before the next superstep begins,
i.e., the receiver cannot use received data until the
current superstep is finished.
● At the end of every superstep, a barrier synchronization using the synchronization mechanism of the
machine takes place.
The BSP model machine can be viewed as a MIMD
(Multiple Instruction Multiple Data) system, because
the processes can execute different instructions simultaneously. It is loosely synchronous at the superstep level,
compared to the instruction-level tight synchrony in the
PRAM model: within a superstep, different processes
execute asynchronously at their own pace. There is a single address space, and a processor can access not only its
local memory but also any remote memory in another
processor, the latter imposing communication.
Within a superstep, each computation operation
uses only data in the processor’s local memory. These
data are put into the local memory either at the program
start-up time or by the communication operations of
previous supersteps. Therefore, the computation operations of a process are independent of other processes,
while it is not allowed for multiple processes to read
or write the same memory location in the same step.
Because of the barrier synchronization, all memory and
communication operations in a superstep must completely finish before any operation of the next superstep
begins. These restrictions imply that a BSP computer
has a sequential consistency memory model.
A program execution on a BSP machine is characterized using the following four parameters []:
●
●
p: the number of processors.
s: the computation speed of processors expressed as
the number of computation steps that can be executed by a processor per second. In each step, one
arithmetic operation on local data can be executed.
● l: the number of steps needed for the barrier synchronization.
● g: the average number of steps needed for transporting a memory word between two processors of the
machine.
In a real parallel machine, there are many different patterns of communication between processors. For
simplicity, the BSP model abstracts the communication
operations using the h relation concept: an h relation is
an abstraction of any communication operation, where
each node sends at most h words to various nodes and
each node receives at most h words. On a BSP computer,
the time to realize any h relation is not longer than g ⋅h.
The BSP model is more realistic than the PRAM
model, because it accounts for several overheads
ignored by PRAM:
●
To account for load imbalance, the computation
time w is defined as the maximum number of steps
spent on computation operations by any processor.
● The synchronization overhead is l, which has a lower
bound equal to the communication network latency
(i.e., the time for a word to propagate through the
physical network) and is always greater than zero.
● The communication overhead is g ⋅h steps, i.e., g ⋅h
is the time to execute the most time-consuming h
relation. The value of g is platform-dependent: it is
Bandwidth-Latency Models (BSP, LogP)
smaller on a computer with more efficient communication support.
For a real parallel machine, the value of g depends on
the bandwith of the communication network, the communication protocols, and the communication library.
The value of l depends on the diameter of the network,
as well as on the communication library. Both parameters are usually estimated using benchmark programs.
Since the value s is used for normalizing the values l
and g, only p, l and g are independent parameters. The
execution time of a BSP-program is a sum of the execution time of all supersteps. The execution time of
each superstep comprises three terms: () maximum
local computation time of all processes, w, () costs
of global communication realizing a h-relation, and
() costs for the barrier synchronization finalizing the
superstep:
Tsuperstep = w + g ⋅h + l
()
BSP allows for overlapping of the computation, the
communication, and the synchronization operations
within a superstep. If all three types of operations are
fully overlapped, then the time for a superstep becomes
max(w, g⋅h, l). However, usually the more conservative
w + g ⋅h + l is used.
The BSP model was implemented in a so-called
BSPlib library [, ] that provides operations for the initialization of supersteps, execution of communication
operations, and barrier synchronizations.
The LogP Model
In [], Culler et al. criticize BSP’s assumption that the
length of supersteps must allow to realize arbitrary
h-relations, which means that the granularity has a
lower bound. Also, the messages that have been sent
during a superstep are not available for a recipient
before the following superstep begins, even if a message
is sent and received within the same superstep. Furthermore, the BSP model assumes hardware support for
the synchronization mechanism, although most existing parallel machines do not provide such a support.
Because of these problems, the LogP model [] has been
devised, which is arguably more closely related to the
modern hardware used in parallel machines.
In analogy to BSP, the LogP model assumes that
a parallel machine consists of a number of processors
B
equipped with memory. The processors can communicate using point-to-point messages through a communication network. The behavior of the communication
is described using four parameters L, o, g, and P, which
give the model the name LogP:
●
L (latency) is an upper bound for the network’s
latency, i.e., the maximum delay between sending
and receiving a message.
● o (overhead) describes the period of time in which
a processor is busy with sending or receiving a
message; during this time, no other work can be
performed by that processor.
● g (gap) denotes the minimal timespan between
sending or receiving two messages back-to-back.
● P is the number of processors of the parallel
machine.
Figure  shows a visualization of the LogP parameters []. Except P, all parameters are measured in
time units or multiple machine cycle units. The model
assumes a network with finite capacity. From the definitions of L and g, it follows that, for a given pair of source
and destination nodes, the number of messages that can
be on their way through the network simultaneously is
limited by L/g. A processor trying to send a message
that would exceed this limitation will be blocked until
the network can accept the next message. This property
models the network bandwidth where the parameter g
reflects the bottleneck bandwidth, independent of the
bottleneck’s location, be it the network link, or be it the
processing time spent on sending or receiving. g thus
denotes an upper bound for o.
P processors
M
M
M
P
P
P
Overhead o
Overhead o
Latency L
Communication network
Bandwidth-Latency Models (BSP, LogP). Fig.  Parameter
visualization of the LogP model

B

B
Bandwidth-Latency Models (BSP, LogP)
For this capacity constraint, LogP both gets praised
and criticized. The advantage of this constraint is a very
realistic modeling of communication performance, as
the bandwidth capacity of a communication path can
get limited by any entity between the sender’s memory and the receiver’s memory. Due to this focus on
point-to-point communication, LogP (variants) have
been successfully used for modeling various computer
communication problems, like the performance of the
Network File System (NFS) [] or collective communication operations from the Message Passing Interface (MPI) []. The disadvantage of the capacity constraint is that LogP exposes the sensitivity of a parallel computation to the communication performance
of individual processors. This way, analytic performance modeling is much harder with LogP, as compared to models with a higher abstraction level like
BSP [].
While LogP assumes that the processors are working asynchronously, it requires that no message may
exceed a preset size, i.e., larger messages must be
fragmented. The latency of a single message is not
predictable, but it is limited by L. This means, in particular, that messages can overtake each other such
that the recipient may potentially receive the messages
out of order. The values of the parameters L, o, and g
depend on hardware characteristics, the used communication software, and the underlying communication
protocols.
The time to communicate a message from one node
to another (i.e., the start-up time t ) consists of three
terms: t = o + L + o. The first o is called the send overhead, which is the time at the sending node to execute
a message send operation in order to inject a message
into the network. The second o is called the receive
overhead, which is the time at the receiving node to execute a message receive operation. For simplicity, the two
overheads are assumed equal and called the overhead
o, i.e., o is the length of a time period that a node is
engaged in sending or receiving a message. During this
time, the node cannot perform other operations (e.g.,
overlapping computations).
In the LogP model, the runtime of an algorithm is
determined as the maximum runtime across all processors. A consequence of the LogP model is that the access
to a data element in memory of another processor costs
Tm = L + o time units (a message round-trip), of
g
1
g
2
O
3
O
L
4
O
L
O
5
O
L
O
O
L
O
L
O
O
Time
Bandwidth-Latency Models (BSP, LogP). Fig.  Modeling
the transfer of a large message in n segments using the
LogP model. The last message segment is sent at time
Ts = (n − )⋅g and will be received at time Tr = Ts + o + L
which one half is used for reading, and the other half
for writing. A sequence of n pipelined messages can be
delivered in Tn = L + o + (n − )g time units, as shown
in Fig. .
A strong point of LogP is its simplicity. This simplicity, however, can, at times, lead to inaccurate performance modeling. The most obvious limitation is the
restriction to small messages only. This was overcome
by introducing a LogP variant, called LogGP []. It contains an additional parameter G (Gap per Byte) as the
time needed to send a byte from a large message. /G
is the bandwidth per processor. The time for sending a
message consisting of n Bytes is then Tn = o + (n − )
G + L + o.
A more radical extension is the parameterized LogP
model [], PLogP, for short. PLogP has been designed
taking observations about communication software,
like the Message Passing Interface (MPI), into account.
These observations are () Overhead and gap strongly
depend on the message size; some communication
libraries even switch between different transfer implementations, for short, medium, and large messages.
() Send and receive overhead can strongly differ, as
the handling of asynchronously incoming messages
needs a fundamentally different implementation than
a synchronously invoked send operation. In the PLogP
model (see Fig. ), the original parameters o and g have
been replaced by the send overhead os (m), the receive
Bandwidth-Latency Models (BSP, LogP)
g(m)
Sender
os (m)
Time
or (m)
Receiver
L
g(m)
Bandwidth-Latency Models (BSP, LogP). Fig. 
Visualization of a message transport with m bytes using
the PLogP model. The sender is busy for os (m) time. The
message has been received at T = L + g(m), out of which
the receiver had been busy for or (m) time
overhead or (m), and the gap g(m), where m is the
message size. L and P remain the same as in LogP.
PLogP allows precise performance modeling when
parameters for the relevant message sizes of an application are used. In [], this has been demonstrated
for MPI’s collective communication operations, even
for hierarchical communication networks with different sets of performance parameters. Pješivac-Grbović
et al. [] have shown that PLogP provides flexible and
accurate performance modeling.
Concluding Remarks
Historically, the major focus of parallel algorithm development had been the PRAM model, which ignores data
access and communication cost and considers only load
balance and extra work. PRAM is very useful in understanding the inherent concurrency of an application,
which is the first conceptual step in developing a parallel program; however, it does not take into account
important realities of particular systems, such as the fact
that data access and communication costs are often the
dominant components of execution time.
The bandwidth-latency models described in this
chapter articulate the performance issues against which
software must be designed. Based on a clearer understanding of the importance of communication costs
on modern machines, models like BSP and LogP
help analyze communication cost and hence improve
the structure of communication. These models expose
B
the important costs associated with a communication
event, such as latency, bandwidth, or overhead, allowing
algorithm designers to factor them into the comparative
analysis of parallel algorithms. Even more, the emphasis on modeling communication cost has shifted to the
cost at the nodes that are the endpoints of the communication message, such that the number of messages
and contention at the endpoints have become more
important than mapping to network technologies. In
fact, both the BSP and LogP models ignore network
topology, modeling network delay as a constant value.
An in-depth comparison of BSP and LogP has been
performed in [], showing that both models are roughly
equivalent in terms of expressiveness, slightly favoring BSP for its higher-level abstraction. But it was
exactly this model that was found to be too restrictive
by the designers of LogP []. Both models have their
advantages and disadvantages. LogP is better suited for
modeling applications that actually use point-to-point
communication, while BSP is better and simpler for
data-parallel applications that fit the superstep model.
The BSP model also provides an elegant framework
that can be used to reason about communication and
parallel performance. The major contribution of both
models is the explicit acknowledgement of communication costs that are dependent on properties of the
underlying machine architecture.
The BSP and LogP models are important steps
toward a realistic architectural model for designing and
analyzing parallel algorithms. By experimenting with
the values of the key parameters in the models, it is possible to determine how an algorithm will perform across
a range of architectures and how it should be structured
for different architectures or for portable performance.
Related Entries
Amdahl’s Law
BSP (Bulk Synchronous Parallelism)
Collective Communication
Gustafson’s Law
Models of Computation, Theoretical
PRAM (Parallel Random Access Machines)
Bibliographic Notes and Further
Reading
The presented models and their extensions were originally introduced in papers [, , ] and described in

B

B
Banerjee’s Dependence Test
textbooks [–] which were partially used for writing this entry. A number of various models similar to
BSP and LogP have been proposed: Queuing Shared
Memory (QSM) [], LPRAM [], Decomposable BSP
[], etc. Further papers in the list of references deal
with particular applications of the models and their
classification and standardization.
Bibliography
. Alexandrov A, Ionescu M, Schauser KE, Scheiman C ()
LogGP: incorporating long messages into the LogP model – one
step closer towards a realistic model for parallel computation.
In: th ACM Symposium on Parallel Algorithms and Architectures (SPAA’), Santa Barbara, California, pp –, July 
. Bilardi G, Herley KT, Pietracaprina A, Pucci G, Spirakis P ()
BSP versus LogP. Algorithmica :–
. Culler DE, Dusseau AC, Martin RP, Schauser KE () Fast parallel sorting under LogP: from theory to practice. In: Portability
and Performance for Parallel Processing, Wiley, Southampton,
pp –, 
. Culler DE, Karp R, Sahay A, Schauser KE, Santos E,
Subramonian R, von Eicken T () LogP: towards a realistic model of parallel computation. In: th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming
(PPoPP’), pp –, 
. Goudreau MW, Hill JM, Lang K, McColl WF, Rao SD, Stefanescu
DC, Suel T, Tsantilas T () A proposal for a BSP Worldwide standard. Technical Report, BSP Worldwide, www.bspwordwide.org, 
. Hill M, McColl W, Skillicorn D () Questions and answers
about BSP. Scientific Programming ():–
. Kielmann T, Bal HE, Verstoep K () Fast measurement of
LogP parameters for message passing platforms. In: th Workshop on Runtime Systems for Parallel Programming (RTSPP),
held in conjunction with IPDPS , May 
. Kielmann T, Bal HE, Gorlatch S, Verstoep K, Hofman RFH
() Network performance-aware collective communication
for clustered wide area systems. Parallel Computing ():
–
. Martin RP, Culler DE () NFS sensitivity to high performance networks. In: ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems,
pp –, 
. Pješivac-Grbović J, Angskun T, Bosilca G, Fagg GE, Gabriel E,
Dongarra JJ () Performance analysis of MPI collective operations. In: th International Workshop on Performance Modeling,
Evaluation, and Optimization of Parallel and Distributed Systems
(PMEO-PDS’), April 
. Valiant LG () A bridging model for parallel computation.
Communications of the ACM ():–
. Rauber T, Rünger G () Parallel programming: for multicore
and cluster systems. Springer, New York
. Hwang K, Xu Z () Scalable parallel computing.
WCB/McGraw-Hill, New York
. Culler DE, Singh JP, Gupta A () Parallel computer architecture – a hardware/software approach. Morgan Kaufmann, San
Francisco
. Gibbons PB, Matias Y, Ramachandran V () Can a sharedmemory model serve as a bridging-model for parallel computation? Theor Comput Syst ():–
. Aggarwal A, Chandra AK, Snir M () Communication complexity of PRAMs. Theor Comput Sci :–
. De la Torre P, Kruskal CP () Submachine locality in the bulk
synchronous setting. In: Proceedings of the EUROPAR’, LNCS
, pp –, Springer, Berlin, August 
Banerjee’s Dependence Test
Utpal Banerjee
University of California at Irvine, Irvine, CA, USA
Definition
Banerjee’s Test is a simple and effective data dependence test widely used in automatic vectorization and
parallelization of loops. It detects dependence between
statements caused by subscripted variables by analyzing
the subscripts.
Discussion
Introduction
A restructuring compiler transforms a sequential program into a parallel form that can run efficiently on a
parallel machine. The task of the compiler is to discover
the parallelism that may be hiding in the program and
bring it out into the open. The first step in this process is to compute the dependence structure imposed
on the program operations by the sequential execution
order. The parallel version of the sequential program
must obey the same dependence constraints, so that it
computes the same final values as the original program.
(If an operation B depends on an operation A in the
sequential program, then A must be executed before
B in the parallel program.) To detect possible dependence between two operations, one needs to know if
Banerjee’s Dependence Test
they access the same memory location during sequential execution and in which order. Banerjee’s Test provides a simple mechanism for dependence detection in
loops when subscripted variables are involved.
In this essay, Banerjee’s test is developed for onedimensional array variables in assignment statements
within a perfect loop nest. Pointers are given for extensions to more complicated situations. The first section
below is on mathematical preliminaries, where certain
concepts and results are presented that are essential for
an understanding of the test. After that the relevant
dependence concepts are explained, and then the test
itself is discussed.
Mathematical Preliminaries
Linear Diophantine Equations
Let Z denote the set of all integers. An integer b divides
an integer a, if there exists an integer c such that a = bc.
For a list of integers a , a , . . . , am , not all zero, the greatest common divisor or gcd is the largest positive integer
that divides each member of the list. It is denoted by
gcd(a , a , . . . , am ). The gcd of a list of zeros is defined
to be .
A linear diophantine equation in m variables is an
equation of the form
a x  + a  x  + ⋯ + a m x m = c
where the coefficients a , a , . . . , am are integers not all
zero, c is an integer, and x , x , . . . , xm are integer variables. A solution to this equation is a sequence of integers (i , i , . . . , im ) such that ∑m
k= a k ik = c. The following
theorem is a well-known result in Number Theory.
Theorem 
The linear diophantine equation
a  x + a x + ⋯ + a m xm = c
has a solution if and only if gcd(a , a , . . . , am ) divides c.
Proof The “only if ” Part is easy to prove. If the equation
has a solution, then there are integers i , i , . . . , im such
that ∑m
k= ak ik = c. Since gcd(a , a , . . . , am ) divides each
ak , it must also divide c. To get the “if ” Part (and derive
the general solution), see the proof of Theorem .
in [].
B
Lexicographic Order
For any positive integer m, the set of all integer
m-vectors (i , i , . . . , im ) is denoted by Zm . The zero
vector (, , . . . , ) is abbreviated as . Addition and
subtraction of members of Zm are defined coordinatewise in the usual way.
For  ≤ ℓ ≤ m, a relation ≺ℓ in Zm is defined as
follows: If i = (i , i , . . . , im ) and j = ( j , j , . . . , jm ) are
vectors in Zm , then i ≺ℓ j if
i = j , i = j , . . . , iℓ− = jℓ− , and iℓ < jℓ .
The lexicographic order ≺ in Zm is then defined by
requiring that i ≺ j, if i ≺ℓ j for some ℓ in  ≤ ℓ ≤ m.
It is often convenient to write i ≺m+ j when i = j ,
i = j , . . . , im = jm , that is, i = j. The notation i ⪯ j
means either i ≺ j or i = j, that is, i ≺ℓ j for some ℓ in
 ≤ ℓ ≤ m + . Note that ⪯ is a total order in Zm .
The associated relations ≻ and ⪰ are defined in the
usual way: j ≻ i means i ≺ j, and j ⪰ i means i ⪯ j. (⪰ is
also a total order in Zm .) An integer vector i is positive
if i ≻ , nonnegative if i ⪰ , and negative if i ≺ .
Let R denote the field of real numbers. The sign
function sgn : R → Z is defined by
⎧
⎪
⎪
 if x > 
⎪
⎪
⎪
⎪
⎪
⎪
sgn(x) = ⎨  if x = 
⎪
⎪
⎪
⎪
⎪
⎪
⎪
− if x < ,
⎪
⎩
for each x in R. The direction vector of any vector
(i , i , . . . , im ) in Zm is (sgn(i ), sgn(i ), . . . , sgn(im )).
Note that a vector is positive (negative) if and only if its
direction vector is positive (negative).
Extreme Values of Linear Functions
For any positive integer m, let Rm denote the
m-dimensional Euclidean space consisting of all real
m-vectors. It is a real vector space where vector addition and scalar multiplication are defined coordinatewise. The concepts of the previous subsection stated in
terms of integer vectors can be trivially extended to the
realm of real vectors. For a real number a, we define the
positive part a+ and the negative part a− as in []:
∣a∣ + a
= max(a, ) and

∣a∣ − a
= max(−a, ).
a− =

a+ =

B

B
Banerjee’s Dependence Test
Thus, a+ = a and a− =  for a ≥ , while a+ =  and
a− = −a for a ≤ . For example, + = , − = , (−)+ =
, and (−)− = . The following lemma lists the basic
properties of positive and negative parts of a number
(Lemma . in [], Lemma . in []).
Lemma  For any real number a, the following statements hold:
.
.
.
.
.
.
a+ ≥ , a− ≥ ,
a = a+ − a− , ∣a∣ = a+ + a− ,
(−a)+ = a− , (−a)− = a+ ,
(a+ )+ = a+ , (a+ )− = ,
(a− )+ = a− , (a− )− = ,
−a− ≤ a ≤ a+ .
The next lemma gives convenient expressions for the
extreme values of a simple function. (This is Lemma .
in []; it generalizes Lemma . in [].)
Lemma  Let a, p, q denote real constants, where p < q.
The minimum and maximum values of the function
f (x) = ax on the interval p ≤ x ≤ q are (a+ p − a− q)
and (a+ q − a− p), respectively.
Proof
For p ≤ x ≤ q, the following hold:
. If x = y, then
−(a − b)− q ≤ ax − by ≤ (a − b)+ q.
. If x ≤ y − , then
−b−(a− +b)+ (q−) ≤ ax−by ≤ −b+(a+ −b)+ (q−).
. If x ≥ y + , then
a − (b+ − a)+ (q − ) ≤ ax − by ≤ a + (b− + a)+ (q − ).
. If (x, y) varies freely in A, then
−(a− + b+ )q ≤ ax − by ≤ (a+ + b− )q.
Proof By Lemma ,  ≤ x ≤ q implies −a− q ≤ ax ≤
a+ q. This result and Lemma  are used repeatedly in the
following proof.
Case . Let x = y. Then ax − by = (a − b)x. Since
 ≤ x ≤ q, one gets
−(a − b)− q ≤ ax − by ≤ (a − b)+ q.
Case . Let  ≤ x ≤ y −  ≤ q − . Then ax ≤ a+ (y − ).
Hence,
ax − by = −b + [ax − b(y − )] ≤ −b + (a+ − b)(y − ).
Since  ≤ y −  ≤ q − , an upper bound for (ax − by) is
given by:
a+ p ≤ a + x ≤ a + q
ax − by ≤ −b + (a+ − b)+ (q − ).
−a− q ≤ −a− x ≤ −a− p,
To derive a lower bound, replace a with −a and b with
−b to get
since a+ ≥  and −a− ≤ . Adding these two sets of
inequalities, one gets
a+ p − a− q ≤ ax ≤ a+ q − a− p,
−ax + by ≤ b + (a− + b)+ (q − ),
so that
−b − (a− + b)+ (q − ) ≤ ax − by.
since a = a+ − a− . To complete the proof, note that these
bounds for f (x) are actually attained at the end points
x = p and x = q, and therefore they are the extreme
values of f . For example, when a ≥ , it follows that
Case . Let x ≥ y + . Then  ≤ y ≤ x −  ≤ q − . In
Case , replace x by y, y by x, a by −b, and b by −a, to
get
a+ p − a− q = ap = f (p)
a − ((−b)− − a)+ (q − ) ≤ (−b)y − (−a)x
≤ a + ((−b)+ + a)+ (q − ),
a+ q − a− p = aq = f (q).
The following theorem gives the extreme values of a
function of two variables in some simple domains; these
results will be needed in the next section.
Theorem  Let a, b, q denote real constants, where
q > . Define a function f : R → R by f (x, y) = ax − by
for each (x, y) ∈ R . Let A denote the rectangle
{(x, y) ∈ R :  ≤ x ≤ q,  ≤ y ≤ q}. The extreme values
of f on A under an additional restriction are as follows:
or
a − (b+ − a)+ (q − ) ≤ ax − by ≤ a + (b− + a)+ (q − ).
Case . Since  ≤ x ≤ q, one gets −a− q ≤ ax ≤ a+ q.
Also, since  ≤ y ≤ q, it follows that −b− q ≤ by ≤ b+ q,
that is, −b+ q ≤ −by ≤ b− q. Hence,
−a− q − b+ q ≤ ax − by ≤ a+ q + b− q.
This gives the bounds for Case .
Banerjee’s Dependence Test
As in the proof of Lemma , it can be shown in
each case that each bound is actually attained at some
point of the corresponding domain. Hence, the bounds
represent the extreme values of f in each case.
m
The Euclidean space R is an inner product
space, where the inner product of two vectors x =
(x , x , . . . , xm ) and y = (y , y , . . . , ym ) is defined by
⟨x, y⟩ = ∑m
k= xk yk . The inner product defines a norm
(length), the norm defines a metric (distance), and the
metric defines a topology. In this topological vector
space, one can talk about bounded sets, open and closed
sets, compact sets, and connected sets. The following theorem is easily derived from well-known results
in topology. (See the topology book by Kelley [] for
details.)
Theorem  Let f : Rm → R be a continuous function.
If a set A ⊂ Rm is closed, bounded, and connected, then
f (A) is a finite closed interval of R.
Proof Note that Rm and R are both Euclidean spaces.
A subset of a Euclidean space is compact if and only
if it is closed and bounded. Thus, the set A is compact and connected. Since f is continuous, it maps a
compact set onto a compact set, and a connected set
onto a connected set. Hence, the set f (A) is compact
and connected, that is, closed, bounded, and connected.
Therefore, f (A) must be a finite closed interval of R.
Corollary  Let f : Rm → R denote a continuous
function, and let A be a closed, bounded, and connected
subset of Rm . Then f assumes a minimum and a maximum value on A. And for any real number c satisfying
the inequalities
min f (x) ≤ c ≤ max f (x),
x∈A
B
Dependence Concepts
For more details on the material of this section, see
Chapter  in []. An assignment statement has the form
S:
x=E
where S is a label, x a variable, and E an expression.
Such a statement reads the memory locations specified
in E, and writes the location x. The output variable of
S is x and its input variables are the variables in E. Let
Out(S) = {x} and denote by In(S) the set of all input
variables of S.
The basic program model is a perfect nest of loops
L = (L , L , . . . , Lm ) shown in Fig. . For  ≤ k ≤ m,
the index variable Ik of Lk runs from  to some positive integer Nk in steps of . The index vector of the
loop nest is I = (I , I , . . . , Im ). An index point or index
value of the nest is a possible value of the index vector, that is, a vector i = (i , i , . . . , im ) ∈ Zm such that
 ≤ ik ≤ Nk for  ≤ k ≤ m. The subset of Zm consisting of all index points is the index space of the loop nest.
During sequential execution of the nest, the index vector starts at the index point  = (, , . . . , ) and ends
at the point (N , N , . . . , Nm ) after traversing the entire
index space in the lexicographic order.
The body of the loop nest L is denoted by H(I) or H;
it is assumed to be a sequence of assignment statements.
A given index value i defines a particular instance H(i)
of H(I), which is an iteration of L. An iteration H(i) is
executed before an iteration H(j) if and only if i ≺ j.
A typical statement in the body of the loop nest is
denoted by S(I) or S, and its instance for an index value
i is written as S(i). Let S and T denote any two (not necessarily distinct) statements in the body. Statement T
depends on statement S, if there exist a memory location
M, and two index points i and j, such that
x∈A
the equation f (x) = c has a solution x ∈ A.
Proof By Theorem , the image f (A) of A is a finite
closed interval [α, β] of R. Then for each x ∈ A,
one has α ≤ f (x) ≤ β. So, α is a lower bound
and β an upper bound for f on A. Since α ∈ f (A)
and β ∈ f (A), there are points x , x ∈ A such
that f (x ) = α and f (x ) = β. Thus, f assumes
a minimum and a maximum value on A, and α =
minx∈A f (x) and β = maxx∈A f (x). If α ≤ c ≤ β,
then c ∈ f (A), that is, f (x ) = c for some x ∈ A.

L1 :
L2 :
..
.
Lm :
do I1 = 0, N1, 1
do I2 = 0, N2, 1
..
.
do Im = 0, Nm, 1
H(I1, I2, . . . , Im)
enddo
..
.
enddo
enddo
Banerjee’s Dependence Test. Fig.  A perfect loop nest
B

B
Banerjee’s Dependence Test
. The instances S(i) of S and T(j) of T both reference
(read or write) M
. In the sequential execution of the program, S(i) is
executed before T(j).
If i and j are a pair of index points that satisfy these two
conditions, then it is convenient to say that T(j) depends
on S(i). Thus, T depends on S if and only if at least one
instance of T depends on at least one instance of S. The
concept of dependence can have various attributes as
described below.
Let i = (i , i , . . . , im ) and j = ( j , j , . . . , jm ) denote
a pair of index points, such that T(j) depends on S(i).
If S and T are distinct and S appears lexically before T in
H, then S(i) is executed before T(j) if and only if i ⪯ j.
Otherwise, S(i) is executed before T(j) if and only if
i ≺ j. If i ⪯ j, there is a unique integer ℓ in  ≤ ℓ ≤ m + 
such that i ≺ℓ j. If i ≺ j, there is a unique integer ℓ in
 ≤ ℓ ≤ m such that i ≺ℓ j. This integer ℓ is a dependence level for the dependence of T on S. The vector
d ∈ Zm , defined by d = j − i, is a distance vector for
the same dependence. Also, the direction vector σ of d,
defined by
σ = (sgn( j − i ), sgn( j − i ), . . . , sgn( jm − im )),
is a direction vector for this dependence. Note that both
d and σ are nonnegative vectors if S precedes T in H,
and both are positive vectors otherwise. The dependence of T(j) on S(i) is carried by the loop Lℓ if the
dependence level ℓ is between  and m. Otherwise,
the dependence is loop-independent.
A statement can reference a memory location only
through one of its variables. The definition of dependence is now extended to make explicit the role played
by the variables in the statements under consideration.
A variable u(I) of the statement S and a variable v(I) of
the statement T cause a dependence of T on S, if there
are index points i and j, such that
. The instance u(i) of u(I) and the instance v(j) of
v(I) both represent the same memory location;
. In the sequential execution of the program, S(i) is
executed before T(j).
If these two conditions hold, then the dependence
caused by u(I) and v(I) is characterized as
. A flow dependence if
u(I) ∈ Out(S) and v(I) ∈ In(T);
. An anti-dependence if
u(I) ∈ In(S) and v(I) ∈ Out(T);
. An output dependence if
u(I) ∈ Out(S) and v(I) ∈ Out(T);
. An input dependence if
u(I) ∈ In(S) and v(I) ∈ In(T).
Banerjee’s Test
In its simplest form, Banerjee’s Test is a necessary condition for the existence of dependence of one statement
on another at a given level, caused by a pair of onedimensional array variables. This form of the test is
described in Theorem . Theorem  gives the version of
the test that deals with dependence with a fixed direction vector. Later on, pointers are given for extending
the test in different directions.
Theorem  Consider any two assignment statements S
and T in the body H of the loop nest of Fig. . Let X( f (I))
denote a variable of S and X(g(I)) a variable of T, where
X is a one-dimensional array, f (I) = a  + ∑m
k= a k Ik ,
g(I) = b + ∑m
b
I
,
and
the
a’s
and
the
b’s
are
all
integer
k= k k
constants. If X( f (I)) and X(g(I)) cause a dependence
of T on S at a level ℓ, then the following two conditions
hold:
(A) gcd(a − b , . . . , aℓ− − bℓ− , aℓ , . . . , am , bℓ , . . . , bm )
divides (b − a );
(B) α ≤ b − aℓ−
 ≤ β, where
+
α = −bℓ − ∑(ak − bk )− Nk − (a−ℓ + bℓ ) (Nℓ − )
k=
m
− ∑ (a−k + b+k ) Nk
k=ℓ+
ℓ−
β = −bℓ + ∑(ak − bk )+ Nk + (a+ℓ − bℓ )+ (N ℓ − )
k=
m
+ ∑ (a+k + b−k ) Nk .
k=ℓ+
Proof Assume that the variables X( f (I)) and X(g(I))
do cause a dependence of statement T on statement S at
a level ℓ. If S precedes T in H, the dependence level ℓ
could be any integer in  ≤ ℓ ≤ m + . Otherwise, ℓ must
be in the range  ≤ ℓ ≤ m. By hypothesis, there exist two
index points i = (i , i , . . . , im ) and j = ( j , j , . . . , jm )
such that i ≺ℓ j, and X( f (i)) and X(g(j)) represent the
same memory location.
Banerjee’s Dependence Test
The restrictions on i , i , . . . , im , j , j , . . . , jm are as
follows:
⎫
⎪
 ≤ ik ≤ Nk ⎪
⎪
⎪
⎬
⎪
⎪
 ≤ jk ≤ N k ⎪
⎪
⎭
i k = jk
( ≤ k ≤ m)
( ≤ k ≤ ℓ − )
iℓ ≤ jℓ − .
()
()
()
(As iℓ and jℓ are integers, i ℓ < jℓ means iℓ ≤ jℓ − .)
Since f (i) and g(j) must be identical, it follows that
ℓ−
b − a = ∑(ak ik − bk jk ) + (aℓ iℓ − bℓ jℓ )
k=
m
+ ∑ (ak ik − bk jk ).
()
k=ℓ+
Because of (), this equation is equivalent to
ℓ−
b − a = ∑(ak − bk )ik + (a ℓ iℓ − bℓ jℓ )
k=
m
+ ∑ (ak ik − bk jk ).
()
k=ℓ+
First, think of i , i , . . . , im , jℓ , jℓ+ , . . . , jm as integer
variables. Since the linear diophantine equation () has
a solution, condition (A) of the theorem is implied by
Theorem .
Next, think of i  , i , . . . , im , jℓ , jℓ+ , . . . , jm as real variables. Note that for  ≤ k < t ≤ m, there is no relation
between the pair of variables (ik , jk ) and the pair of variables (it , jt ). Hence, the minimum (maximum) value of
the right-hand side of () can be computed by summing
the minimum (maximum) values of all the individual
terms (ak ik − bk jk ). For each k in  ≤ k ≤ m, one can
compute the extreme values of the term (ak ik − bk jk )
by using a suitable case of Theorem . It is then clear
that α is the minimum value of the right-hand side
of (), and β is its maximum value. Hence, (b − a )
must lie between α and β. This is condition (B) of the
theorem.
B
Remarks
. In general, the two conditions of Theorem  are
necessary for existence of dependence, but they are
not sufficient. Suppose both conditions hold. First,
think in terms of real variables. Let P denote the
subset of Rm defined by the inequalities ()–().
Then P is a closed, bounded, and connected set. The
right-hand side of () represents a real-valued continuous function on Rm . Its extreme values on P
are given by α and β. Condition (B) implies that
there is a real solution to equation () with the constraints ()–() (by Corollary  to Theorem ). On
the other hand, condition (A) implies that there is an
integer solution to equation () without any further
constraints (by Theorem ). Theoretically, the two
conditions together do not quite imply the existence
of an integer solution with the constraints needed to
guarantee the dependence of T(j) on S(i). However,
note that using Theorem , one would never falsely
conclude that a dependence does not exist when in
fact it does.
. If one of the two conditions of Theorem  fails to
hold, then there is no dependence. If both of them
hold, there may or may not be dependence. In practice, however, it usually turns out that when the
conditions hold, there is dependence. This can be
explained by the fact that there are certain types of
array subscripts for which the conditions are indeed
sufficient, and these are the types most commonly
encountered in practice. Theorem . in [] shows
that the conditions are sufficient, if there is an integer t >  such that ak , bk ∈ {−t, , t} for  ≤ k ≤ m.
Psarris et al. [] prove the sufficiency for another
large class of subscripts commonly found in real
programs.
Example  Consider the loop nest of Fig. , where
X is a one-dimensional array, and the constant terms
a , b in the subscripts are integers unspecified for the
moment. Suppose one needs to check if T is outputdependent on S at level . For this problem, m = ,
ℓ = , and
(N , N , N ) = (, , ),
For the dependence problem of Theorem , equation () or () is the dependence equation, condition (A) is the gcd Test, and condition (B) is Banerjee’s
Test.
(a , a , a ) = (, , −), (b , b , b ) = (, −, ).
Sincegcd(a −b , a , a , b , b ) = gcd(−, , −, −, ) = ,
condition (A) of Theorem  is always satisfied. To test

B

B
Banerjee’s Dependence Test
L1 :
do I1 = 0, 100, 1
L2 :
do I2 = 0, 50, 1
L3 :
do I3 = 0, 40, 1
S:
X(a0 + I1 + 2I2 − I3) = ···
T:
X(b0 + 3I1 − I2 + 2I3) = ···
enddo
enddo
enddo
Banerjee’s Dependence Test. Fig.  Loop nest of
example 
α = −b − (a − b )
N − (a−
{ak − bk : σk = }, {ak : σk ≠ }, {bk : σk ≠ }
divides (b − a );
(B) α ≤ b − a ≤ β, where
α = − ∑ (ak − bk )− Nk
+
+
+ b ) (N − )
− (a−
β=
+ b+ ) N = −
−b + (a − b )+ N + (a+
+ (a+ + b− ) N = .
(A) The gcd of all integers in the three lists
σ k =
condition (B), evaluate α and β:
−
integer constants. If these variables cause a dependence of
T on S with a direction vector (σ , σ , . . . , σm ), then the
following two conditions hold:
+
− b ) (N − )
Condition (B) then becomes
− ≤ b − a ≤ .
First, take a = − and b = . Then (b − a )
is outside this range, and hence Banerjee’s Test is not
satisfied. By Theorem , statement T is not outputdependent on statement S at level .
Next, take a = b = . Then b − a is within the
range, and Banerjee’s Test is satisfied. Theorem  cannot guarantee the existence of the dependence under
question. However, there exist two index points i =
(, , ) and j = (, , ) of the loop nest, such that
i ≺ j, the instance S(i) of statement S is executed before
the instance T(j) of statement T, and both instances
write the memory location X(). Thus, statement T is
indeed output-dependent on statement S at level .
Theorem  dealt with dependence at a level. It is
now extended to a result that deals with dependence
with a given direction vector. A direction vector here
has the general form σ = (σ , σ , . . . , σm ), where each
σk is either unspecified, or has one of the values , , −.
When σk is unspecified, one writes σk = ∗. (As components of a direction vector, many authors use ‘=, <, >’ in
place of , , −, respectively.)
Theorem  Consider any two assignment statements S
and T in the body H of the loop nest of Fig. . Let X( f (I))
denote a variable of S and X(g(I)) a variable of T, where
X is a one-dimensional array, f (I) = a + ∑m
k= ak Ik ,
m
g(I) = b + ∑k= bk Ik , and the a’s and the b’s are all
− ∑ [bk + (a−k + bk ) (Nk − )]
σ k =
+
+ ∑ [ak − (b+k − ak ) (Nk − )]
σ k =−
− ∑ (a−k + b+k ) Nk
σ k =∗
β = ∑ (ak − bk )+ Nk
σ k =
+
+ ∑ [−bk + (a+k − bk ) (Nk − )]
σ k =
+
+ ∑ [ak + (b−k + ak ) (Nk − )]
σ k =−
+ ∑ (a+k + b−k ) Nk .
σ k =∗
Proof All expressions of the form (ak − bk ), where
 ≤ k ≤ m and σk = , constitute the list {ak −bk : σk = }.
The other two lists have similar meanings. The sum
∑σk = is taken over all values of k in  ≤ k ≤ m such that
σk = . The other three sums have similar meanings.
As in the case of Theorem , the proof of this theorem
follows directly from Theorem  and Theorem .
For the dependence problem with a direction vector,
condition (A) of Theorem  is the gcd Test and condition (B) is Banerjee’s Test. Comments similar to those
given in Remarks  also apply to Theorem .
The expressions for α and β in Banerjee’s Test
may look complicated even though their numerical
evaluation is quite straightforward. With the goal of
developing the test quickly while avoiding complicated
formulas, the perfect loop nest of Fig.  was taken as the
model program, and the variables were assumed to be
one-dimensional array elements. However, Banerjee’s
Test can be extended to cover multidimensional array
elements in a much more general program. Detailed
references are given in the section on further reading.
Banerjee’s Dependence Test
Related Entries
Code Generation
Dependence Abstractions
Dependences
Loop Nest Parallelization
Omega Test
Parallelization, Automatic
Parallelism Detection in Nested Loops, Optimal
Unimodular Transformations
Bibliographic Notes and Further
Reading
Banerjee’s Test first appeared in Utpal Banerjee’s
MS thesis [] at the University of Illinois, UrbanaChampaign, in . In that thesis, the test is developed
for checking dependence at any level when the subscript
functions f (I) and g(I) are polynomials in I , I , . . . , Im .
The case where f (I) and g(I) are linear (affine) functions of the index variables is then derived as a special
case (Theorem .). Theorem . of [] appeared as Theorem  in the  paper by Banerjee, Chen, Kuck, and
Towle []. Theorem  presented here is essentially the
same theorem, but has a stronger gcd Test. (The stronger
gcd Test was pointed out by Kennedy in [].)
Banerjee’s Test for a direction vector of the form
σ = (, . . . , , , −, ∗, . . . , ∗) was given by Kennedy in
[]. It is a special case of Theorem  presented here. See
also the comprehensive paper by Allen and Kennedy
[]. Wolfe and Banerjee [] give an extensive treatment
of the dependence problem involving direction vectors.
(The definition of the negative part of a number used in
[] is slightly different from that used here and in most
other publications.)
The first book on dependence analysis [] was published in ; it gave the earliest coverage of Banerjee’s
Test in a book form. Dependence Analysis [] published
in  is a completely new work that subsumes the
material in [].
It is straightforward to extend theorems  and  to
the case where we allow the loop limits to be arbitrary
integer constants, as long as the stride of each loop is
kept at . See theorems . and . in []. This model
can be further extended to test the dependence of a
statement T on a statement S when the nest of loops
enclosing S is different from the nest enclosing T. See
theorems . and . in [].
B
A loop of the form “do I = p, q, θ,” where p, q, θ are
integers and θ ≠ , can be converted into the loop “do
Î = , N, ,” where I = p + Îθ and N = ⌊(q − p)/θ⌋. The
new variable Î is the iteration variable of the loop. Using
this process of loop normalization, one can convert any
nest of loops to a standard form if the loop limits and
strides are integer constants. (See [].)
Consider now the dependence problem posed by
variables that come from a multidimensional array,
where each subscript is a linear (affine) function of the
index variables. The gcd Test can be generalized to handle this case; see Section . of [] and Section . of
[]. Also, Banerjee’s Test can be applied separately to
the subscripts in each dimension; see Theorem . in
[]. If the test is not satisfied in one particular dimension, then there is no dependence as a whole. But,
if the test is satisfied in each dimension, there is no
definite conclusion. Another alternative is array linearization; see Section . in []. Zhiyuan Li et al. []
did a study of dependence analysis for multidimensional array elements, that did not involve subscriptby-subscript testing, nor array linearization. The general method presented in Chapter  of [] includes Li’s
λ-test.
The dependence problem is quite complex when one
has an arbitrary loop nest with loop limits that are linear functions of index variables, and multidimensional
array elements with linear subscripts. To understand the
general problem and some methods of solution, see the
developments in [] and [].
Many researchers have studied Banerjee’s Test over
the years; the test in various forms can be found in many
publications. Dependence Analysis [] covers this test
quite extensively. For a systematic development of the
dependence problem, descriptions of the needed mathematical tools, and applications of dependence analysis
to program transformations, see the books [–] in
the series on Loop Transformations for Restructuring
compilers.
Bibliography
. Allen JR, Kennedy K (Oct ) Automatic translation of
FORTRAN programs to vector form. ACM Trans Program Lang
Syst ():–
. Banerjee U (Nov ) Data dependence in ordinary programs.
MS Thesis, Report –, Department of Computer Science,
University of Illinois at Urbana-Champaign, Urbana, Illinois

B

B
Barnes-Hut
. Banerjee U () Dependence analysis for supercomputing.
Kluwer, Norwell
. Banerjee U () Loop transformations for restructuring compilers: the foundations. Kluwer, Norwell
. Banerjee U () Loop transformations for restructuring compilers: loop parallelization. Kluwer, Norwell
. Banerjee U () Loop transformations for restructuring compilers: dependence analysis. Kluwer, Norwell
. Banerjee U, Chen S-C, Kuck DJ, Towle RA (Sept ) Time and
parallel processor bounds for FORTRAN-like loops. IEEE Trans
Comput C-():–
. Kelley JL () General topology. D. Van Nostrand Co.,
New York
. Kennedy K (Oct ) Automatic translation of FORTRAN programs to vector form. Rice Technical Report --, Department of Mathematical Sciences, Rice University, Houston, Texas
. Li Z, Yew P-C, Zhu C-Q (Jan ) An efficient data dependence
analysis for parallelizing compilers. IEEE Trans Parallel Distrib
Syst ():–
. Psarris K, Klappholz D, Kong X (June ) On the accuracy of
the Banerjee test. J Parallel Distrib Comput ():–
. Wolfe M, Banerjee U (Apr ) Data dependence and its
application to parallel processing. Int J Parallel Programming
():–
Barnes-Hut
N-Body Computational Methods
Barriers
NVIDIA GPU
Synchronization
Basic Linear Algebra
Subprograms (BLAS)
BLAS (Basic Linear Algebra Subprograms)
Behavioral Equivalences
Rocco De Nicola
Universitá degli Studi di Firenze, Firenze, Italy
Synonyms
Behavioral relations; Extensional equivalences
Definition
Behavioral equivalences serve to establish in which
cases two reactive (possible concurrent) systems offer
similar interaction capabilities relatively to other systems representing their operating environment. Behavioral equivalences have been mainly developed in the
context of process algebras, mathematically rigorous languages that have been used for describing and verifying
properties of concurrent communicating systems. By
relying on the so-called structural operational semantics (SOS), labeled transition systems are associated to
each term of a process algebra. Behavioral equivalences
are used to abstract from unwanted details and identify
those labeled transition systems that react “similarly”
to external experiments. Due to the large number of
properties which may be relevant in the analysis of concurrent systems, many different theories of equivalences
have been proposed in the literature. The main contenders consider those systems equivalent that () perform the same sequences of actions, or () perform the
same sequences of actions and after each sequence are
ready to accept the same sets of actions, or () perform
the same sequences of actions and after each sequence
exhibit, recursively, the same behavior. This approach
leads to many different equivalences that preserve
significantly different properties of systems.
Introduction
In many cases, it is useful to have theories which can
be used to establish whether two systems are equivalent or whether one is a satisfactory “approximation” of
another. It can be said that a system S is equivalent to
a system S whenever “some” aspects of the externally
observable behavior of the two systems are compatible.
If the same formalism is used to model what is required
of a system (its specification) and how it can actually be
built (its implementation), then it is possible to use theories based on equivalences to prove that a particular
concrete description is correct with respect to a given
abstract one. If a step-wise development method is
used, equivalences may permit substituting large specifications with equivalent concise ones. In general it
is useful to be able to interchange subsystems proved
behaviorally equivalent, in the sense that one subsystem
may replace another as part of a larger system without
affecting the behavior of the overall system.
Behavioral Equivalences
The kind of equivalences, or approximations,
involved depends very heavily on how the systems
under consideration will be used. In fact, the way
a system is used determines the behavioral aspects
which must be taken into account and those which
can be ignored. It is then important to know, for
the considered equivalence, the systems properties it
preserves.
In spite of the general agreement on taking an
extensional approach for defining the equivalence of
concurrent or nondeterministic systems, there is still
disagreement on what “reasonable” observations are
and how their outcomes can be used to distinguish
or identify systems. Many different theories of equivalences have been proposed in the literature for models
which are intended to be used to describe and reason
about concurrent or nondeterministic systems. This is
mainly due to the large number of properties which
may be relevant in the analysis of such systems. Almost
all the proposed equivalences are based on the idea
that two systems are equivalent whenever no external
observation can distinguish them. In fact, for any given
system it is not its internal structure which is of interest
but its behavior with respect to the outside world, i.e.,
its effect on the environment and its reactions to stimuli
from the environment.
One of the most successful approaches for describing the formal, precise, behavior of concurrent systems is the so-called operational semantics. Within this
approach, concurrent programs or systems are modeled
as labeled transition systems (LTSs) that consist of a set
of states, a set of transition labels, and a transition relation. The states of the transition systems are programs,
while the labels of the transitions between states represent the actions (instructions) or the interactions that
are possible in a given state.
When defining behavioral equivalence of concurrent systems described as LTSs, one might think that
it is possible to consider systems equivalent if they
give rise to the same (isomorphic) LTSs. Unfortunately,
this would lead to unwanted distinctions, e.g., it would
consider the two LTSs below different
a
p
a
q
a
q1
B
in spite of the fact that their behavior is the same; they
can (only) execute infinitely many a-actions, and they
should thus be considered equivalent.
The basic principles for any reasonable equivalence
can be summarized as follows. It should:
●
●
●
●
Abstract from states (consider only the actions)
Abstract from internal behavior
Identify processes whose LTSs are isomorphic
Consider two processes equivalent only if both can
execute the same actions sequences
● Allow to replace a subprocess by an equivalent counterpart without changing the overall semantics of
the system
However, these criteria are not sufficiently insightful
and discriminative, and the above adequacy requirements turn out to be still too loose. They have given rise
to many different kinds of equivalences, even when all
actions are considered visible.
The main equivalences over LTSs introduced in the
literature consider as equivalent those systems that:
. Perform the same sequences of actions
. Perform the same sequences of actions and after
each sequence are ready to accept the same sets of
actions
. Perform the same sequences of actions and after
each sequence exhibit, recursively, the same behavior
These three different criteria lead to three groups
of equivalences that are known as traces equivalences,
decorated-traces equivalences, and bisimulation-based
equivalences. Equivalences in different classes behave
differently relatively to the three-labeled transition systems in Fig. . The three systems represent the specifications of three vending machines that accept two coins
and deliver coffee or tea. The trace-based equivalences
equate all of them, the bisimulation-based equivalences
distinguish all of them, and the decorated traces distinguish the leftmost system from the other two, but equate
the central and the rightmost one.
Many of these equivalences have been reviewed [];
here, only the main ones are presented. First, equivalences that consider invisible (τ) actions just normal actions will be presented, then their variants that
abstract from internal actions will be introduced.

B

B
Behavioral Equivalences
q
p
coin1
r
coin1
coin1
p1
q1
coin2
coin2
p2
coffee
q2
coffee
tea
p3
p4
q4
r2
coin2
q3
tea
q5
coin2
coin1
r1
coin2
r4
r3
coffee
r6
tea
r5
Behavioral Equivalences. Fig.  Three vending machines
The equivalences will be formally defined on states
μ
→ ⟩, where Q is a set of states,
of LTSs of the form ⟨Q, Aτ , −
ranging over p, q, p′ , q , …, Aτ is the set of labels, ranging over a, b, c, …, that also contains the distinct silent
μ
action τ, and −
→ is the set of transitions. In the following, s will denote a generic element of A∗τ , the set of all
sequences of actions that a process might perform.
Traces Equivalence
The first equivalence is known as traces equivalence
and is perhaps the simplest of all; it is imported from
automata theory that considers those automata equivalent that generate the same language. Intuitively, two
processes are deemed traces equivalent if and only if
they can perform exactly the same sequences of actions.
s
→ p′ , with s = μ  μ  . . . μ n ,
In the formal definition, p −
μ
μ
μn
denotes the sequence p −→ p −→ p . . . −→ p′ of
transitions.
Two states p and q are traces equivalent (p ≃T q) if :
s
s
→ p′ implies q −
→ q′ for some q′ and
. p −
s ′
s
− q implies p −
→ p′ for some p′ .
. q →
A drawback of ≃T is that it is not sensitive to deadlocks. For example, if we consider the two LTSs below:
p3
a
p1
a
p2
q1
a
q2
b
b
p4
q3
we have that P ≃T Q , but P , unlike Q , after performing action a, can reach a state in which it cannot perform
any action, i.e., a deadlocked state.
Traces equivalence identifies all of the three LTSs of
Fig. . Indeed, it is not difficult to see that the three vending machines can perform the same sequences of visible
actions. Nevertheless, a customer with definite preferences for coffee who is offered to choose between the
three machines would definitely select to interact with
the leftmost one since the others do not let him choose
what to drink.
Bisimulation Equivalence
The classical alternative to traces equivalence is bisimilarity (also known as observational equivalence) that
considers equivalent two systems that can simulate each
other step after step []. Bisimilarity is based on the
notion of bisimulation:
A relation R ⊆ Q × Q is a bisimulation if, for any
pair of states p and q such that ⟨p, q⟩ ∈ R, the following
holds:
μ
μ
→ p′ then q −
→ q′ for
. For all μ ∈ Aτ and p′ ∈ Q, if p −
′
′ ′
some q ∈ Q s.t. ⟨p , q ⟩ ∈ R
μ
μ
→ q′ then p −
→ p′ for
. For all μ ∈ Aτ and q′ ∈ Q, if q −
some p′ ∈ Q s.t. ⟨p′ , q′ ⟩ ∈ R
Two states p, q are bisimilar (p ∼ q) if there exists a
bisimulation R such that ⟨p, q⟩ ∈ R.
This definition corresponds to the circular definition below that more clearly shows that two systems are
bisimilar (observationally equivalent) if they can perform the same action and reach bisimilar states. This
Behavioral Equivalences
recursive definition can be solved with the usual fixed
points techniques.
Two states p, q ∈ Q are bisimilar, written p ∼ q, if and
only if for each μ ∈ Aτ :
μ
μ
μ
μ
. if p !→ p′ then q !→ q′ for some q′ such that
p′ ∼ q′ ;
. if q !→ q′ then p !→ p′ for some p′ such that
p′ ∼ q′ .
Bisimilarity distinguishes all machines of Fig. . This
is because the basic idea behind bisimilarity is that two
states are considered equivalent if by performing the
same sequences of actions from these states it is possible to reach equivalent states. It is not difficult to see
that bisimilarity distinguishes the first and the second
machine of Fig.  because after receiving two coins (coin
and coin ) the first machine still offers the user the possibility of choosing between having coffee or tea while
the second does not. To see that also the second and the
third machine are distinguished, it is sufficient to consider only the states reachable after just inserting coin
because already after this insertion the user loses his
control of the third machine. Indeed, there is no way for
this machine to reach a state bisimilar to the one that the
second machine reaches after accepting coin .
Testing Equivalence
The formulation of bisimilarity is mathematically very
elegant and has received much attention also in
other fields of computer science []. However, some
researchers do consider it too discriminating: two processes may be deemed unrelated even though there is
no practical way of ascertaining it. As an example, consider the two rightmost vending machines of Fig. . They
are not bisimilar because after inserting the first coin in
one case there is still the illusion of having the possibility of choosing what to drink. Nevertheless, a customer
would not be able to appreciate their differences since
there is no possibility of deciding what to drink with
both machines.
Testing equivalence has been proposed [] (see
also []) as an alternative to bisimilarity; it takes
to the extreme the claim that when defining behavioral equivalences, one does not want to distinguish
between systems that cannot be taken apart by external
observers and bases the definition of the equivalences
B
on the notions of observers, observations, and successful observations. Equivalences are defined that consider
equivalent those systems that satisfy (lead to successful
observations by) the same sets of observers. An observer
is an LTS with actions in Aτ,w ≜ Aτ ∪ {w}, with w ∈/ A.
To determine whether a state q satisfies an observer with
initial state o, the set OBS(q, o) of all computations from
⟨q, o⟩ is considered.
μ
→ ⟩ and an observer
Given an LTS ⟨Q, Aτ , −
μ
⟨O, Aτ,w , −
→ ⟩, and a state q ∈ Q and the initial state
o ∈ O, an observation c from ⟨q, o⟩ is a maximal sequence
of pairs ⟨qi , oi ⟩, such that ⟨q , o ⟩ = ⟨q, o⟩. The transition
μ
→ ⟨qi+ , oi+ ⟩ can be proved using the following
⟨qi , oi ⟩ −
inference rule:
μ
μ
E−
→ E′
F−
→ F′
μ
⟨E, F⟩ −
→ ⟨E′ , F ′ ⟩
μ ∈ Aτ
An observation from ⟨q, o⟩ is successful if it contains
w
a configuration ⟨qn , on ⟩ ∈ c, with n ≥ , such that on −→ o
for some o.
When analyzing the outcome of observations, one
has to take into account that, due to nondeterminism,
a process satisfies an observer sometimes or a process
satisfies an observer always. This leads to the following
definitions:
. q may satisfy o if there exists an observation from
⟨q, o⟩ that is successful.
. q must satisfy o if all observations from ⟨q, o⟩ are
successful.
These notions can be used to define may, must and,
testing equivalence.
May equivalence : p is may equivalent to q (p ≃m q) if,
for all possible observers o:
p may satisfy o if and only if q may satisfy o;
Must equivalence : p is must equivalent to q (p ≃M q) if,
for all possible observers o:
p must satisfy o if and only if q must satisfy o.
Testing equivalence : p is testing equivalent to q (p ≃test q)
if p ≃m q and p ≃M q.
The three vending machines of Fig.  are may equivalent, but only the two rightmost ones are must equivalent and testing equivalent. Indeed, in most cases must
equivalence implies may equivalence, and thus in most
cases must and testing do coincide. The two leftmost

B

B
Behavioral Equivalences
machines are not must equivalent because one after
receiving the two coins the machine cannot refuse to
(must) deliver the drink chosen by the customer while
the other can.
May and must equivalences have nice alternative
characterizations. It has been shown that may equivalence coincides with traces equivalence and that must
equivalence coincides with failures equivalence, another
well-studied relation that is inspired by traces equivalence but takes into account the possible interactions
( failures) after each trace and is thus more discriminative than trace equivalence []. Failures equivalence
relies on pairs of the form ⟨s, F⟩, where s is a trace and
F is a set of labels. Intuitively, ⟨s, F⟩ is a failure for a process if it can perform the sequence of actions s to evolve
into a state from which no action in F is possible. This
equivalence can be formulated on LTS as follows:
Failures equivalence: Two states p and q are failuresequivalent, written p ≃F q, if and only if they possess the
same failures, i.e., if for any s ∈ A∗τ and for any F ⊆ Aτ :
s
s
. p −
→ p′ and Init(p′ ) ∩ F = / implies q −
→ q′ for some
q′ and Init(q′ ) ∩ F = /
s
s
. q −
→ q′ and Init(q′) ∩ F = / implies p −
→ p′ for some
′
′
p and Init(p ) ∩ F = /
where Init(q) represents the immediate actions of
state q.
Hierarchy of Equivalences
The equivalences considered above can be precisely
related (see [] for a first study). Their relationships
over the class of finite transition systems with only visible actions are summarized by the figure below, where
the upward arrow indicates containment of the induced
relations over states and ≡ indicates coincidence.
F
that coincides with failures equivalence. For the considered class of systems, it also holds that must and testing
equivalence ≃test do coincide. Thus, bisimilarity implies
testing equivalence that in turn implies traces
equivalence.
Weak Variants of the Equivalences
When considering abstract versions of systems making
use of invisible actions, it turns out that all equivalences
considered above are too discriminating. Indeed, traces,
testing/failures, and observation equivalence would distinguish the two machines of Fig.  that, nevertheless,
exhibit similar observable behaviors: get a coin and
deliver a coffee. The second one can be obtained, e.g.,
from the term
coin.grinding.coffee.nil
by hiding the grinding action that is irrelevant for the
customer.
Because of this overdiscrimination, weak variants of
the equivalences have been defined that permit ignoring
(to different extents) internal actions when considering
the behavior of systems. The key step of their definition is the introduction of a new transition relation
a
that ignores silent actions. Thus, q ⇒
= q′ denotes that
q reduces to q′ by performing the visible action a possibly preceded and followed by any number (also )
s
= q′ , instead,
of invisible actions (τ). The transition q ⇒
denotes that q reduces to q′ by performing the sequence
s of visible actions, each of which can be preceded and
є
= indicates that only
followed by τ-actions, while ⇒
τ-actions, possibly none, are performed.
Weak traces equivalence The weak variant of traces
equivalence is obtained by simply replacing the trans
→ p′ above with the observable transitions
sitions p −
s
p⇒
= p′ .
M
q0
T
m
p0
Overall, the figure states that may testing gives rise
to a relation (≃m ) that coincides with traces equivalence, while must testing gives rise to a relation ≃M
coin
coffee
coin
t
coffee
Behavioral Equivalences. Fig.  Weakly equivalent
vending machines
B
Behavioral Equivalences
Two states p and q are weak traces equivalent (p ≊T q)
if for any s ∈ A∗ :
s
action possibly preceded and followed by any number
of invisible actions.
s
. p ⇒
= p′ implies q ⇒
= q′ for some q′
s
s
. q ⇒
= q′ implies p ⇒
= p′ for some p′
p
t
q
Weak testing equivalence To define the weak variants
of may, must, and testing equivalences (denoted by
≊m , ≊M , ≊test respectively), it suffices to change experiments so that processes and observers can freely perform silent actions. To this purpose, one only needs
to change the inference rule of the observation step:
μ
⟨qi , oi ⟩ −
→ ⟨qi+ , oi+ ⟩ that can now be proved using:
E−
→ E′
τ
F−
→ F′
τ
τ
⟨E, F⟩ −
→ ⟨E, F ′ ⟩
⟨E, F⟩ −
→ ⟨E′ , F⟩
a
τ
F−
→ F′
a
→ ⟨E′ , F ′ ⟩
⟨E, F⟩ −
a∈A
a
···
t
qn
a
B
p¢
q¢1
t
q¢2
t
···
t
q¢
Branching bisimulation equivalence An alternative to
weak bisimulation has also been proposed that considers those τ-actions important that appear in branching points of systems descriptions: only silent actions
that do not eliminate possible interaction with external
observers are ignored.
A symmetric relation R ⊆ Q × Q is a branching
bisimulation if, for any pair of states p and q such that
μ
To define, instead, weak failures equivalence, it suffices to
s
s
→ p′ with p ⇒
= p′ in the definition of its strong
replace p −
variant. It holds that weak traces equivalence coincides with weak may equivalence, and that weak failures
equivalence ≊F coincides with weak must equivalence.
Weak bisimulation equivalence For defining weak observational equivalence, a new notion of (weak) bisimulation is defined that again assigns a special role to
τ’s. To avoid having four items, the definition below
requires that the candidate bisimulation relations be
symmetric:
A symmetric relation R ⊆ Q × Q is a weak bisimulation if, for any pair of states p and q such that ⟨p, q⟩ ∈ R,
the following holds:
a
For all a ∈ A and p′ ∈ Q, if p −
→ p′ then q ⇒
= q′ for
′
′ ′
some q ∈ Q s. t. ⟨p , q ⟩ ∈ R
є
τ
● For all p′ ∈ Q, if p −
→ p′ then q ⇒
= q′ for some q′
∈ Q s.t. ⟨p′ , q′ ⟩ ∈ R
●
t
q1
a
→ p′ , with μ ∈ Aτ and p′ ∈ Q, at least one
⟨p, q⟩ ∈ R, if p −
of the following conditions holds:
a
→ E′
E−
Two states p, q are weakly bisimilar (p ≈ q) if there
exists a weak bisimulation R such that ⟨p, q⟩ ∈ R.
The figure below describes the intuition behind
weak bisimilarity. In order to consider two states, say
p and q, equivalent, it is necessary that for each visible action performed by one of them the other has
to have the possibility of performing the same visible

●
μ = τ and ⟨p′ , q⟩ ∈ R
●
q⇒
= q′′ −
→ q′ for some q′ , q′′ ∈ Q such that ⟨p, q′′ ⟩ ∈ R
and ⟨p′ , q′ ⟩ ∈ R
μ
є
Two states p, q are branching bisimilar (p ≈b q)
if there exists a branching bisimulation R such that
⟨p, q⟩ ∈ R.
The figure below describes the intuition behind
branching bisimilarity; it corresponds to the definition above although it might appear, at first glance,
more demanding. In order to consider two states, say
p and q, equivalent, it is necessary, like for weak bisimilarity, that for each visible action performed by one
of them the other has to have the possibility of performing the same visible action possibly preceded and
followed by any number of invisible actions. Branching
bisimilarity, however, imposes the additional requirement that all performed internal actions are not used
to change equivalent class. Thus, all states reached via
τ’s before performing the visible action are required
to be equivalent to p, while all states reached via τ’s
after performing the visible action are required to be
equivalent to p′ .
p
q
t
q1
t
···
t
qn
a
a
p¢
q¢1
t
q¢2
t
···
t
q¢

B
Behavioral Equivalences
Hierarchy of Weak Equivalences
p3
Like for the strong case, also weak equivalences can be
clearly related. Their relationships over the class of finite
transition systems with invisible actions, but without
τ-loops (so-called non-divergent or strongly convergent
LTSs) are summarized by the figure below, where the
upward arrow indicates containment of the induced
relations over states.
a
p1
t
b
p2
q3
t
q1
b
t
q2
p4
b
q6
a
q5
b
q4
F
M
T
m
Thus, over strongly convergent LTSs with silent
actions, branching bisimilarity implies weak bisimilarity, and this implies testing and failures equivalences;
and these imply traces equivalence.
A number of counterexamples can be provided to
show that the implications of the figure above are proper
and thus that the converse does not hold.
The two LTSs reported below are weak traces equivalent and weakly may equivalent, but are distinguished
by all the other equivalences.
p1
a
q1
t
q3
a
p0
b
p1
b
t
p3
q0
b
q6
c
q2
b
p4
q5
a
q1
q2
p2
p2
c
a
p3
a
Both of them, after zero or more silent actions, can
be either in a state where both actions a and b are possible or in a state in which only a b transition is possible.
However, via a τ-action, the topmost system can reach a
state that has no equivalent one in the bottom one, thus
they are not weakly bisimilar.
The next two LTSs are instead equated by weak
bisimilarity, and thus by weak trace and weak must
equivalences, but are not branching bisimilar.
t
q3
b
q4
p4
Indeed, they can perform exactly the same weak traces,
but while the former can silently reach a state in which
an a-action can be refused the second cannot.
The next two LTSs are equated by weak trace and
weak must equivalences, but are distinguished by weak
bisimulation and branching bisimulation.
It is easy to see that from the states p and q , the
same visible action is possible and bisimilar states can
be reached. The two states p and q are instead not
branching bisimilar because p , in order to match the
a action of q to q and reach a state equivalent to
q , needs to reach p through p , but these two states,
connected by a τ-action, are not branching bisimilar.
It is worth concluding that the two LTSs of Fig.  are
equated by all the considered weak equivalences.
Benchmarks
Future Directions
The study on behavioral equivalences of transition systems is still continuing. LTSs are increasingly used as the
basis for specifying and proving properties of reactive
systems. For example, they are used in model checking as
the model against which logical properties are checked.
It is then important to be able to use minimal systems
that are, nevertheless, equivalent to the original larger
ones so that preservation of the checked properties is
guaranteed. Thus, further research is expected on devising efficient algorithms for equivalence checking and on
understanding more precisely the properties of systems
that are preserved by the different equivalences.
Related Entries
Actors
Bisimulation
CSP (Communicating Sequential Processes)
Pi-Calculus
Process Algebras
Bibliographic Notes and Further
Reading
The theories of equivalences can be found in a number of books targeted to describing the different process
algebras. The theory of bisimulation is introduced in [],
while failure and trace semantics are considered in []
and []. The testing approach is presented in [].
Moreover, interesting papers relating the different
approaches are [], the first paper to establish precise
relationships between the many equivalences proposed
in the literature, and the two papers by R. van Glabbeek:
[], considering systems with only visible actions, and
[], considering also systems with invisible actions.
In his two companion papers, R. van Glabbeek provides a uniform, model-independent account of many
of the equivalences proposed in the literature and
proposes several motivating testing scenarios, phrased
in terms of “button pushing experiments” on reactive
machines to capture them.
Bisimulation and its relationships with modal logics is deeply studied in [], while a deep study of its
origins and its use in other areas of computer science
is provided in []. Branching bisimulation was first
introduced in [], while the testing based equivalences
B

were introduced in []. Failure semantic was first introduced in [].
B
Bibliography
. Baeten JCM, Weijland WP () Process algebra. Cambridge
University Press, Cambridge
. Brookes SD, Hoare CAR, Roscoe AW () A theory of communicating sequential processes. J ACM ():–
. De Nicola R () Extensional equivalences for transition systems. Acta Informatica ():–
. De Nicola R, Hennessy M () Testing equivalences for
processes. Theor Comput Sci :–
. Hennessy M () Algebraic theory of processes. The MIT Press,
Cambridge
. Hennessy M, Milner R () Algebraic laws for nondeterminism
and concurrency. J ACM ():–
. Hoare CAR () Communicating sequential processes.
Prentice-Hall, Englewood Cliffs
. Milner R () Communication and concurrency. Prentice-Hall,
Upper Saddle River
. Roscoe AW () The theory and practice of concurrency.
Prentice-Hall, Hertfordshire
. Sangiorgi D () On the origins of bisimulation and coinduction. ACM Trans Program Lang Syst ():.–.
. van Glabbeek RJ () The linear time-branching time
spectrum I: the semantics of concrete, sequential processes.
In: Bergstra JA, Ponse A, Smolka SA (eds) Handbook of process
algebra, Elsevier, Amsterdam, pp –
. van Glabbeek RJ () The linear time-branching time spectrum II. In: Best E (ed) CONCUR ’, th international conference on concurrency theory, Hildesheim, Germany, Lecture
notes in computer science, vol . Springer-Verlag, Heidelberg,
pp –
Behavioral Relations
Behavioral Equivalences
Benchmarks
Jack Dongarra, Piotr Luszczek
University of Tennessee, Knoxville, TN, USA
Definition
Computer benchmarks are computer programs that
form standard tests of the performance of a computer

B
Benchmarks
and the software through which it is used. They
are written to a particular programming model and
implemented by specific software, which is the final
arbiter as to what the programming model is. A
benchmark is therefore testing a software interface to
a computer, and not a particular type of computer
architecture.
Discussion
The basic goal of performance modeling is to measure,
predict, and understand the performance of a computer
program or set of programs on a computer system. In
other words, it transcends the measurement of basic
architectural and system parameters and is meant to
enhance the understanding of the performance behavior of full complex applications. However, the programs
and codes used in different areas of science differ in
a large number of features. Therefore the performance
of full application codes cannot be characterized in a
general way independent of the application and code
used. The understanding of the performance characteristics is tightly bound to the specific computer program
code used. Therefore the careful selection of an interesting program for analysis is the crucial first step in
any more detailed and elaborate investigation of full
application code performance. The applications of performance modeling are numerous, including evaluation
of algorithms, optimization of code implementations,
parallel library development, and comparison of system
architectures, parallel system design, and procurement
of new systems.
A number of projects such as Perfect, NPB, ParkBench, HPC Challenge, and others have laid the
groundwork for a new era in benchmarking and evaluating the performance of computers. The complexity of these machines requires a new level of detail in
measurement and comprehension of the results. The
quotation of a single number for any given advanced
architecture is a disservice to manufacturers and users
alike, for several reasons. First, there is a great variation in performance from one computation to another
on a given machine; typically the variation may be one
or two orders of magnitude, depending on the type
of machine. Secondly, the ranking of similar machines
often changes as one goes from one application to
another, so the best machine for circuit simulation may
not be the best machine for computational fluid dynamics. Finally, the performance depends greatly on a combination of compiler characteristics and human efforts
were expended on obtaining the results.
The conclusions drawn from a benchmark study of
computer performance depend not only on the basic
timing results obtained, but also on the way these are
interpreted and converted into performance figures.
The choice of the performance metric, may itself influence the conclusions. For example, is it desirable to
have a computer that generates the most mega op per
second (or has the highest Speedup), or the computer
that solves the problem in the least time? It is now
well known that high values of the first metrics do
not necessarily imply the second property. This confusion can be avoided by choosing a more suitable metric that re effects solution time directly, for example,
either the Temporal, Simulation, or Benchmark performance, defined below. This issue of the sensible choice
of performance metric is becoming increasingly important with the advent of massively parallel computers
which have the potential of very high Giga-op rates,
but have much more limited potential for reducing
solution time.
Given the time of execution T and the floatingpoint operation-count several different performance
measures can be defined. Each metric has its own uses,
and gives different information about the computer and
algorithm used in the benchmark. It is important therefore to distinguish the metrics with different names,
symbols and units, and to understand clearly the difference between them. Much confusion and wasted work
can arise from optimizing a benchmark with respect to
an inappropriate metric. If the performance of different
algorithms for the solution of the same problem needs
to be compared, then the correct performance metric to
use is the Temporal Performance which is defined as the
inverse of the execution time
RT = /T.
A special case of temporal performance occurs for
simulation programs in which the benchmark problem
is defined as the simulation of a certain period of physical time, rather than a certain number of timesteps. In
this case, the term “simulation performance” is used,
Beowulf Clusters
and it is measured in units such as simulated days per
day (written sim-d/d or ‘d’/d) in weather forecasting,
where the apostrophe is used to indicate “simulated” or
simulated pico-seconds per second (written sim-ps/s or
‘ps’/s) in electronic device simulation. It is important to
use simulation performance rather than timestep/s for
comparing different simulation algorithms which may
require different sizes of timestep for the same accuracy
(e.g., an implicit scheme that can use a large timestep,
compared with an explicit scheme that requires a much
smaller step). In order to compare the performance of
a computer on one benchmark with its performance on
another, account must be taken of the different amounts
of work (measured in op) that the different problems
require for their solution. The benchmark performance
is defined as the ratio of the floating-point operationcount and the execution time
often proprietary, and/or subject to distribution restrictions. To minimize the negative impact of these factors, the use of compact applications was proposed in
many benchmarking efforts. Compact applications are
typical of those found in research environments (as
opposed to production or engineering environments),
and usually consist of up to a few thousand lines of
source code. Compact applications are distinct from
kernel applications since they are capable of producing scientifically useful results. In many cases, compact
applications are made up of several kernels, interspersed
with data movements and I/O operations between the
kernels.
Any of the performance metrics, R, can be described
with a two-parameter Amdahl saturation, for a fixed
problem size as a function of number of
processors p,
R = R∞ /( + p/ /p)
RB = FB /T.
The units of benchmark performance are Giga-op/s
(benchmark name), where the name of the benchmark
is included in parentheses to emphasize that the performance may depend strongly on the problem being
solved, and to emphasize that the values are based on
the nominal benchmark op-count. In other contexts
such performance figures would probably be quoted
as examples of the so-called sustained performance of
a computer. For comparing the observed performance
with the theoretical capabilities of the computer hardware, the actual number of floating-point operations
performed FH is computed, and from it the actual hardware performance
RH = FH /T.
Parallel speedup is a popular metric that has been
used for many years in the study of parallel computer
performance. Speedup is usually defined as the ratio of
execution time of one-processor T and execution time
on p-processors Tp .
One factor that has hindered the use of full application codes for benchmarking parallel computers in the
past is that such codes are difficult to parallelize and to
port between target architectures. In addition, full application codes that have been successfully parallelized are
B
where R∞ is the saturation performance approached as
p → ∞ and p/ is the number of processors required
to reach half the saturation performance. This universal Amdahl curve [, ] could be matched against the
actual performance curves by changing values of the
two parameters (R∞ , p/ ).
Related Entries
HPC Challenge Benchmark
LINPACK Benchmark
Livermore Loops
TOP
Bibliography
. Hockney RW () A framework for benchmark analysis. Supercomputer (IX-):–
. Addison C, Allwright J, Binsted N, Bishop N, Carpenter B, Dalloz P,
Gee D, Getov V, Hey A, Hockney R, Lemke M, Merlin J, Pinches M,
Scott C, Wolton I () The genesis distributed-memory benchmarks. Part : methodology and general relativity benchmark with
results for the SUPRENUM computer. Concurrency: Practice and
Experience ():–
Beowulf Clusters
Clusters

B

B
Beowulf-Class Clusters
Beowulf-Class Clusters
Clusters
Bernstein’s Conditions
Paul Feautrier
Ecole Normale Supérieure de Lyon, Lyon, France
Definition
Bersntein’s conditions [] are a simple test for deciding
if statements or operations can be interchanged without
modifying the program results. The test applies to operations which read and write memory at well defined
addresses. If u is an operation, let M(u) be the set of
(addresses of) the memory cells it modifies, and R(u)
the set of cells it reads. Operations u and v can be
reordered if:
M(u) ∩ M(v) = M(u) ∩ R(v) = R(u) ∩ M(v) = / ()
If these conditions are met, one says that u and v commute or are independent.
Note that in most languages, each operation writes
at most one memory cell: W(u) is a singleton. However,
there are exceptions: multiple and parallel assignments,
vector assignments among others.
The importance of this result stems from the fact
that most program optimizations consist – or at least,
involve – moving operations around. For instance, to
improve cache performance, one must move all uses of
a datum as near as possible to its definition. In parallel programming, if u and v are assigned to different
threads or processors, their order of execution may be
unpredictable, due to arbitrary decisions of a scheduler
or to the presence of competing processes. In this case,
if Bernstein’s conditions are not met, u and v must be
kept in the same thread.
Checking Bernstein’s conditions is easy for operations accessing scalars (but beware of aliases), is more
difficult for array accesses, and is almost impossible for
pointer dereferencing. See the Dependences entry for
an in-depth discussion of this question.
Discussion
Notations and Conventions
Is this essay, a program is represented as a sequence
of operations, i.e., of instances of high level statements
or machine instructions. Such a sequence is called a
trace. Each operation has a unique name, u, and a
text T(u), usually specified as a (high-level language)
statement. There are many schemes for naming operations: for polyhedral programs, one may use integer
vectors, and operations are executed in lexicographic
order. For flowcharts programs, one may use words of a
regular language to name operations, and if the program
has function calls, words of a context-free language [].
In the last two cases, u is executed before v iff u is a
prefix of v. In what follows, u ≺ v is a shorthand for
“u is executed before v.” For sequential programs, ≺ is
a well-founded total order: there is no infinite chain
x , x , . . . , xi , . . . such that xi+ ≺ xi . This is equivalent
to stipulating that a program execution has a begining,
but may not have an end.
All operations will be assumed deterministic: the
state of memory after execution of u depends only on
T(u) and on the previous state of memory.
For static control programs, one can enumerate the
unique trace – or at least describe it – once and for all.
One can also consider static control program families,
where the trace depends on a few parameters which are
know at program start time. Lastly, one can consider
static control parts of programs or SCoPs. Most of this
essay will consider only static control programs.
When applying Bernstein’s conditions, one usually
considers a reference trace, which comes from the original program, and a candidate trace, which is the result
of some optimization or parallelization. The problem is
to decide whether the two traces are equivalent, in a
sense to be discussed later. Since program equivalence
is in general undecidable, one has to restrict the set of
admissible transformations. Bernstein’s conditions are
specially usefull for dealing with operation reordering.
Commutativity
To prove that Berstein’s conditions are sufficient for
commutativity, one needs the following facts:
●
When an operation u is executed, the only memory cells which may be modified are those whose
adresses are in M(u)
Bernstein’s Conditions
●
The values stored in M(u) depend only on u and on
the values read from R(u).
▸ Consider two operations u and v which satisfy (Eq. ).
Assume that u is executed first. When v is executed
later, it finds in R(v) the same values as if it were
executed first, since M(u) and R(v) are disjoint.
Hence, the values stored in M(v) are the same, and
they do not overwrite the values stored by u, since
M(u) and M(v) are disjoint. The same reasoning
applies if v is executed first.
The fact that u and v do not meet Bernstein’s conditions is written u ⊥ v to indicate that u and v cannot be
executed in parallel.
Atomicity
When dealing with parallel programs, commutativity is
not enough for correctness. Consider for instance two
operations u and v with T(u) = [x = x + 1] and
T(v) = [x = x + 2]. These two operations commute, since their sequential execution in whatever order
is equivalent to a unique operation w such that T(w) =
[x = x + 3]. However, each one is compiled into
a sequence of more elementary machine instructions,
which when executed in parallel, may result in x being
increased by  or  or  (see Fig. , where r1 and r2 are
processor registers).
Observe that these two operations do not satisfy
Bernstein’s conditions. In contrast, operations that satisfy Bernstein’s conditions do not need to be protected
by critical sections when run in parallel. The reason is
that neither operation modifies the input of the other,
and that they write in distinct memory cells. Hence, the
stored values do not depend on the order in which the
writes are interleaved.
x = 0
r1 = x
-r2 = x
--r1 += 1
-r2 += 2
-x = r1
-x = r2
--x = 2
P #1
P #2
x = 0
r1 = x
--r1 += 1
-x = r1
-r2 = x
r2 += 2
--x = r2
--x = 3
P #1
P #2
x = 0
r1 = x
-r2 = x
--r1 += 1
-r2 += 2
x = r2
-x = r1
---x = 1
P #1
P #2
Bernstein’s Conditions. Fig.  Several possible interleaves
of x = x + 1 and x = x + 2
B

Legality
Here, the question is to decide whether a candidate
trace is equivalent to a reference trace, where the two
traces contains exactly the same operations. There are
two possibilities for deciding equivalence. Firstly, if the
traces are finite, one may examine the state of memory
after their termination. There is equivalence if these two
states are identical. Another possibility is to construct
the history of each memory cell. This is a list of values
ordered in time. A new value is appended to the history
of x each time an operation u such that x ∈ M(u) is executed. Two traces are equivalent if all cells have the same
history. This is clearly a stronger criterion than equality of the final memory; it has the advantage of being
applicable both to terminating programs and to nonterminating systems. The histories are especially simple
when a trace has the single assignment property: there is
only one operation that writes into x. In that case, each
history has only one element.
Terminating Programs
A terminating program is specified by a finite list of
operations, [u , . . . , un ], in order of sequential execution. There is a dependence relation ui → uj iff i < j
and ui ⊥ uj .
All reorderings of the u: [v , . . . , vn ] such that
the execution order of dependent operations is not
modified:
ui → u j , u i = v i ′ , u j = v j ′ ⇒ i ′ < j′
are legal.
▸ The proof is by a double induction. Let k be the length
of the common prefix of the two programs:
ui = vi , i = , k.
Note that k may be null. The element uk+ occurs somewhere among the v, at position i > k. The element vi−
occurs among the u at position j > k +  (see Fig. ). It
follows that uk+ = vi and uj = vi− are ordered differently in the two programs, and hence must satisfy Bernstein’s condition. vi− and vi can therefore be exchanged
without modifying the result of the reordered program.
Continuing in this way, vi can be brought in position
k + , which means that the common prefix has been
extended one position to the right. This process can be
continued until the length of the prefix is n. The two
programs are now identical, and the final result of the
candidate trace has not been modified.
B

B
Bernstein’s Conditions
uk
by A[σ (y, u)]. That the new trace has the single assignment property is clear. It is equivalent to the reference
trace in the following sense: for each cell x, construct
a history by appending the value of A[u] each time an
operation u such that x ∈ M(u) is executed. Then the
histories of a cell in the reference trace and in the single
assigment trace are identical.
vi−1 vi
v
vk
u
uk+1
uj
Bernstein’s Conditions. Fig.  The commutation Lemma
The property which has just been proved is crucial
for program optimization, since it gives a simple test for
the legality of statement motion, but what is its import
for parallel programming?
The point is that when parallelizing a program, its
operations are distributed among several processors or
among several threads. Most parallel architectures do
not try to combine simultaneous writes to the same
memory cell, which are arbitrarily ordered by the bus
arbiter or a similar device. It follows that if one is only
interested in the final result, each parallel execution is
equivalent to some interleave of the several threads of
the program. Taking care that operations which do not
satisfy Bernstein’s condition are excuted in the order
specified by the original sequentail program guarantees
deterministic execution and equivalence to the sequential program.
Single Assignment Programs
A trace is in single assignment form if, for each memory cell x, there is one and only one operation u such
that x ∈ M(u). Any trace can be converted to (dynamic)
single assignment form – at least in principle – by the
following method.
Let A be an (associative) array indexed by the operation names. Assuming that all M(u) are singletons,
operation u now writes into A[u] instead of M(u).
The source of cell x at u, noted σ (x, u), is defined as:
●
●
●
x ∈ M(σ (x, u))
σ (x, u) ≺ u
there is no v such that σ (x, u) ≺ v ≺ u and x ∈ M(v)
In words, σ (x, u) is the last write to x that precedes u.
Now, in the text of u, replace all occurences of y ∈ R(u)
▸ Let us say that an operation u has a discrepancy for x if
the value assigned to x by u in the reference trace is different from the value of A[u] in the single assignment
trace. Let u be the earliest such operation. Since all
operations are assumed deterministic, this means that
there is a cell y ∈ R(u ) whose value is different from
A[σ(y, u )]. Hence σ(y, u ) ≺ u also has a discrepancy,
a contradiction.
Single assignment programs (SAP) where first proposed by Tesler and Enea [] as a tool for parallel
programming. In a SAP, the sets M(u) ∩ M(v) are
always empty, and if there is a non-empty R(u) ∩ M(v)
where u ≺ v, it means that some variable is read before
being assigned, a programming error. Some authors
[] then noticed that a single assignment program is a
collection of algebraic equations, which simplifies the
construction of correctness proofs.
Non-Terminating Systems
The reader may have noticed that the above legality proof depends on the finiteness of the program
trace. What happens when one wants to build a nonterminating parallel system, as found for instance in
signal processing applications or operating systems?
For assessing the correctness of a transformation, one
cannot observe the final result, which does not exists.
Beside, one clearly needs some fairness hypothesis: it
would not do to execute all even numbered operations,
ad infinitum, and then to execute all odd numbered
operations, even if Bernstein’s conditions would allow
it. The needed property is that for all operations u in the
reference trace, there is a finite integer n such that u is
the n-th operation in the candidate trace.
Consider first the case of two single assignment
traces, one of which is the reference trace, the other having been reordered while respecting dependences. Let u
be an operation. By the fairness hypothesis, u is present
in both traces. Assume that the values written in A[u]
by the the two traces are distinct. As above, one can find
Bernstein’s Conditions
an operation v such that A[v] is read by u, A[v] has different values in the two traces, and v ≺ u in the two
traces. One can iterate this process indefinitely, which
contradicts the well-foundedness of ≺.
Consider now two ordinary traces. After conversion
to single assignment, one obtain the same values for the
A[u]. If one extract an history for each cell x as above,
one obtain two identical sequence of values, since operations that write to x are in dependence and hence are
ordered in the same direction in the two traces.
Observe that this proof applies also to terminating
traces. If the cells of two terminating traces have identical histories, it obviously follows that the final memory
states are identical. On the other hand, the proof for
terminating traces applies also, in a sequential context, to operations which commutes without satisfying
Bernstein’s conditions.
Dynamic Control Programs
is not executed too early, and since there is a dependence
from a test to each enclosed operation, that no operation
is executed before the tests results are known. One must
take care not to compute dependences between operations u and v which have incompatible guards gu and gv
such that gu ∧ gv = false.
The case of while loops is more complex. Firstly, the
construction while(true) is the simplest way of writing
a non terminating program, whose analysis has been
discussed above. Anything that follows an infinite loop
is dead code, and no analysis is needed for it. Consider
now a terminating loop:
while(p) do S;
The several executions of the continuation predicate,
p, must be considered as operations. Strictly speaking,
one cannot execute an instance of S before the corresponding instance of p, since if the result of p is false,
S is not executed. On the other hand, there must be a
dependence from S to p, since otherwise the loop would
not terminate. Hence, a while loop must be executed
sequentially. The only way out is to run the loop speculatively, i.e., to execute instances of the loop body before
knowing the outcome of the continuation predicate, but
this method is beyond the scope of this essay.
The presence of tests whose outcome cannot be
predicted at compile time greatly complicates program
analysis. The simplest case is that of well structured programs, which uses only the if then else construct. For
such programs, a simple syntactical analysis allows the
compiler to identify all tests which have an influence on
the execution of each operation. One has to take into
account three new phenomena:
Related Entries
●
Dependences
A test is an operation in itself, which has a set of read
cells, and perhaps a set of modified cells if the source
language allows side effects
● An operation cannot be executed before the outcomes of all controlling tests are known
● No dependence exists for two operations which
belong to opposite branches of a test
A simple solution, known as if-conversion [], can
be used to solve all three problems at once. Each test:
if(e) then . . . else . . . is replaced by a new operation b
= e; where b is a fresh boolean variable. Each operation
in the range of the test is guarded by b or ¬b, depending
on whether the operation is on the then or else branch
of the test. In the case of nested tests, this transformation
is applied recursively; the result is that each operation
is guarded by a conjunction of the b’s or their complements. Bernstein’s conditions are then applied to the
resulting trace, the b variables being included in the read
and modified sets as necessary. This insures that the test
B
Polyhedron Model
Bibliographic Notes and Further
Reading
See Allen and Kennedy’s book [] for many uses of the
concept of dependence in program optimization.
For more information on the transformation to
Single Assignment form, see [] or []. For the use of
Single Assignment Programs for hardware synthesis,
see [] or [].
Bibliography
. Allen JR, Kennedy K, Porterfield C, Warren J () Conversion
of control dependence to data dependence. In: Proceedings of
the th ACM SIGACT-SIGPLAN symposium on principles of
programming languages, POPL ’, ACM, New York, pp –
. Amiranoff P, Cohen A, Feautrier P () Beyond iteration vectors: instancewise relational abstract domains. In: Static analysis
symposium (SAS ’), Seoul, August 

B

B
Bioinformatics
. Arsac J () La construction de programmes structurés.
Dunod, Paris
. Bernstein AJ () Analysis of programs for parallel processing.
IEEE Trans Electron Comput EC-:–
. Feautrier P () Dataflow analysis of scalar and array references.
Int J Parallel Program ():–
. Feautrier P () Array dataflow analysis. In: Pande S, Agrawal D
(eds) Compiler optimizations for scalable parallel systems. Lecture notes in computer science, vol , chapter . Springer,
Berlin, pp –
. Kennedy K, Allen R () Optimizing compilers for modern
architectures: a dependence-based approach. Morgan Kaufman,
San Francisco
. Leverge H, Mauras C, Quinton P () The alpha language and
its use for the design of systolic arrays. J VLSI Signal Process
:–
. Tesler LG, Enea HJ () A language design for concurrent
processes. In: AFIPS SJCC , Thomson Book Co., pp –
. Verdoolaege S, Nikolov H, Stefanov T () Improved derivation
of process networks. In: Digest of the th workshop on optimization for DSP and embedded systems, New York, March ,
pp –
Bioinformatics
Srinivas Aluru
Iowa State University, Ames, IA, USA
Indian Institute of Technology Bombay, Mumbai, India
Synonyms
Computational biology
Definition
Bioinformatics and/or computational biology is broadly
defined as the development and application of informatics techniques for solving problems arising in biological
sciences.
Discussion
The terms “Bioinformatics” and “Computational Biology” are broadly used to represent research in computational models, methods, databases, software, and
analysis tools aimed at solving applications in the biological sciences. Although the origins of the field can be
traced as far back as , the field exploded in prominence during the s with the conception and execution of the human genome project, and such explosive growth continues to this date. This long time line
is punctuated by the occasional development of new
technologies in the field which generally created new
types of data acquisition (such as microarrays to capture gene expression developed in the mid-s) or
more rapid acquisition of data (such as the development of next-generation sequencing technologies in the
mid-s). Generally, these developments have ushered in new avenues of bioinformatics research due
to new applications enabled by novel data sources, or
increases in the scale of data that need to be archived
and analyzed, or new applications that come within
reach due to improved scales and efficiencies. For example, the rapid adoption of microarray technologies for
measuring gene expressions ushered in the era of systems biology; the relentless increases in cost efficiencies
in sequencing enabled genome sequencing for many
species, which then formed the foundation for the field
of comparative genomics.
Thanks to the aforementioned advances and our
continually improving knowledge of how biological systems are designed and operate, bioinformatics developed into a broad field with several well-defined subfields of specialization – computational genomics, comparative genomics, metagenomics, phylogenetics, systems biology, structural biology, etc. Several entries in
this encyclopedia are designed along the lines of such
subfields whenever a sufficient body of work exists in
development of parallel methods in the area. This entry
contains general remarks about the field of parallel
computational biology; the readers are referred to the
related entries for an in-depth discussion and appropriate references for specific topics. An alternative view to
classifying bioinformatics research relies on the organisms of study – () microbial organisms, () plants,
and () humans/animals. Even though the underlying bioinformatics techniques are generally applicable
across organisms, the key target applications tend to
vary based on this organismal classification. In studying
microbial organisms, a key goal is the ability to engineer them for increased production of certain products
or to meet certain environmental objectives. In agricultural biotechnology, key challenges being pursued
include increasing yields, increasing nutitional content,
developing robust crops and biofuel production. On the
other hand, much of the research on humans is driven
by medical concerns and understanding and treating
complex diseases.
Bioinformatics
Perhaps the oldest studied problem in computational biology is that of molecular dynamics, originally
studied in computational chemistry but increasingly
being applied to the study of protein structures in biology. Outside of this classical area, research in parallel
computational biology can be traced back to the development of parallel sequence alignment algorithms in
the late s. One of the early applications where parallel bioinformatics methods proved crucial is that of
genome assembly. At the human genome scale, this
requires inferring a sequence by assembling tens of millions of randomly sampled fragments of it, which would
take weeks to months if done serially. Albeit the initial
use of parallelism only in phases where such parallelism is obvious, the resulting improvements in timeto-solution proved adequate, and subsequently motivated the development of more sophisticated parallel
bioinformatics methods for genome assembly. Spurred
by the enormous data sizes, problem complexities, and
multiple scales of biological systems, interest in parallel computational biology continues to grow. However,
the development of parallel methods in bioinformatics
represents a mere fraction of the problems for which
sequential solutions have been developed. In addition,
progress in parallel computational biology has not been
uniform across all subfields of computational biology.
For example, there has been little parallel work in the
field of comparative genomics, not counting trivial uses
of parallelism such as in all-pairs alignments. In other
fields such as systems biology, work in parallel methods is in its nascent stages with certain problem areas
(such as network inference) targeted more than others. The encyclopedia entries reflect this development
and cover the topical areas that reflect strides in parallel
computational biology.
Going forward, there are compelling developments
that favor growing prominence of parallel computing
in the field of bioinformatics and computational biology. One such development is the creation of highthroughput second- and third-generation sequencing
technologies. After more than three decades of sequencing DNA one fragment at a time, next-generation
sequencing technologies permit the simultaneous
sequencing of a large number of DNA fragments. With
throughputs increasing at the rate of a factor of  per
year, sequencers that were generating a few million
DNA reads per experiment in  are delivering
B
upward of  billion reads per experiment by early .
The high throughput sequencing data generated by
these systems is impacting many subfields of computational biology, and the data deluge is severely straining
the limits of what can be achieved by sequential methods. Next-generation sequencing is shifting individual
investigators into terascale and bigger organizations
into petascale, necessitating the development of parallel
methods. Many other types of high-throughput instrumentation are becoming commonplace in biology, raising further complexities in massive scale, heterogenous
data integration and multiscale modeling of biological systems. Emerging high performance computing
paradigms such as Clouds and manycore GPU platforms are also providing impetus to the development of
parallel bioinformatics methods.
Related Entries
Genome Assembly
Homology to Sequence Alignment, From
Phylogenetics
Protein Docking
Suffix Trees
Systems Biology, Network Inference in
Bibliographic Notes and Further
Reading
Readers who wish to conduct an in-depth study of
bioinformatics and computational biology may find the
comprehensive handbook on computational biology a
useful reference []. Works on development of parallel
methods in computational biology can be found in the
book chapter [], the survey article [], the first edited
volume on this subject [], and several journal special
issues [–]. The annual IEEE International Workshop
on High Performance Computational Biology held since
 provides the primary forum for research dissemination in this field and the readers may consult the proceedings (www.hicomb.org) for scoping the progress in
the field and for further reference.
Bibliography
. Aluru S (ed) () Handbook of computational molecular biology. Chapman & Hall/CRC Computer and Information Science
Series, Boca Raton
. Aluru S, Amato N, Bader DA, Bhandarkar S, Kale L, Marinescu D, Samatova N () Parallel computational biology.

B

B
.
.
.
.
.
.
Bisimilarity
In: Heroux MA, Raghavan P, Simon HD (eds) Parallel Processing for Scientific Computing (Software, Environments and Tools).
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, pp –
Aluru S, Bader DA () Special issue on high performance
computational biology. J Parallel Distr Com (–):–
Aluru S, Bader DA () Special issue on high performance
computational biology. Concurrency-Pract Ex ():–
Aluru S, Bader DA () Special issue on high performance
computational biology. Parallel Comput ():–
Amato N, Aluru S, Bader DA () Special issue on high performance computational biology. IEEE Trans Parallel Distrib Syst
():–
Bader DA () Computational biology and high-performance
computing. Commun ACM ():–
Zomaya AY (ed) () Parallel computing for bioinformatics and
computational biology: models, enabling technolgoies, and case
studies. Wiley, Hoboken
Bisimilarity
Bisimulation
Bisimulation
Robert J. van Glabbeek
NICTA, Sydney, Australia
The University of New South Wales, Sydney, Australia
Stanford University, Standford, CA, USA
Synonyms
Bisimulation equivalence; Bisimilarity
Definition
Bisimulation equivalence is a semantic equivalence relation on labeled transition systems, which are used to
represent distributed systems. It identifies systems with
the same branching structure.
Discussion
be labeled by predicates from a given set P that hold in
that state.
Definition  Let A and P be sets (of actions and predicates, respectively).
A labeled transition system (LTS) over A and P is a triple
(S, →, ⊧) with:
●
●
S a class (of states).
a
→ a collection of binary relations !→ ⊆ S × S – one
for every a ∈ A – (the transitions),
a
such that for all s ∈ S the class {t ∈ S ∣ s !→ t} is a
set.
● ⊧ ⊆ S × P. s ⊧ p says that predicate p ∈ P holds in
state s ∈ S.
LTSs with A a singleton (i.e., with → a single binary
relation on S) are known as Kripke structures, the models
of modal logic. General LTSs (with A arbitrary) are the
Kripke models for polymodal logic. The name “labeled
transition system” is employed in concurrency theory.
There, the elements of S represent the systems one is
a
interested in, and s !→ t means that system s can
evolve into system t while performing the action a. This
approach identifies states and systems: The states of a
system s are the systems reachable from s by following the transitions. In this realm P is often taken to
√
be empty, or it contains a single predicate indicating
successful termination.
Definition  A process graph over A and P is a tuple
g = (S, I, →, ⊧) with (S, →, ⊧) an LTS over A and P in
which S is a set, and I ∈ S.
Process graphs are used in concurrency theory to
disambiguate between states and systems. A process
graph (S, I, →, ⊧) represents a single system, with S the
set of its states and I its initial state. In the context of
an LTS (S, →, ⊧) two concurrent systems are modeled
by two members of S; in the context of process graphs,
they are two different graphs. The nondeterministic finite
automata used in automata theory are process graphs
with a finite set of states over a finite alphabet A and a set
P consisting of a single predicate denoting acceptance.
Labeled Transition Systems
A labeled transition system consists of a collection of
states and a collection of transitions between them. The
transitions are labeled by actions from a given set A that
happen when the transition is taken, and the states may
Bisimulation Equivalence
Bisimulation equivalence is defined on the states of a
given LTS, or between different process graphs.
Bisimulation
B
Definition  Let (S, →, ⊧) be an LTS over A and P. A
bisimulation is a binary relation R ⊆ S × S, satisfying:
Definition  The language L of polymodal logic over
A and P is given by:
∧ if sRt then s ⊧ p ⇔ t ⊧ p for all p ∈ P.
a
∧ if sRt and s !→ s′ with a ∈ A, then there exists a t ′
a
with t !→ t′ and s′ Rt′ .
a
∧ if sRt and t !→ t ′ with a ∈ A, then there exists an
a
s′ with s !→ s′ and s′ Rt′ .
●
●
●
●
●
Two states s, t ∈ S are bisimilar, denoted s ↔ t, if there
exists a bisimulation R with sRt.
Basic (as opposed to poly-) modal logic is the special case where ∣A∣ = ; there //a//φ is simply denoted
◇φ. The Hennessy–Milner logic is polymodal logic with
P = /. The language L∞ of infinitary polymodal logic
over A and P is obtained from L by additionally allowing ⋀i∈I φ i to be in L∞ for arbitrary index sets I and
φ i ∈ L∞ for i ∈ I. The connectives ⊺ and ∧ are then the
special cases I = / and ∣I∣ = .
Bisimilarity turns out to be an equivalence relation
on S, and is also called bisimulation equivalence.
Definition  Let g = (S, I, →, ⊧) and h = (S′ , I ′ , →′ ,
⊧′ ) be process graphs over A and P. A bisimulation
between g and h is a binary relation R ⊆ S × S′ , satisfying IRI ′ and the same three clauses as above. g and h are
bisimilar, denoted g ↔ h, if there exists a bisimulation
between them.
⊺∈L
p ∈ L for all p ∈ P
if φ, ψ ∈ L for then φ ∧ ψ ∈ L
if φ ∈ L then ¬φ ∈ L
if φ ∈ L and a ∈ A then //a//φ ∈ L
Definition  Let (S, →, ⊧) be an LTS over A and P.
The relation ⊧ ⊆ S×P can be extended to the satisfaction
relation ⊧ ⊆ S × L∞ , by defining
s ⊧ ⋀i∈I φ i if s ⊧ φ i for all i ∈ I – in particular, s ⊧ ⊺
for any state s ∈ S
● s ⊧ ¬φ if s ⊧
/φ
a
/ /
● s ⊧ /a/φ if there is a state t with s !→ t and t ⊧ φ
●
a
a
a
↔
b
c
b
c
Write L(s) for {φ ∈ L ∣ s ⊧ φ}.
Theorem  [] Let (S, →, ⊧) be an LTS and s, t ∈ S.
Then s ↔ t ⇔ L∞ (s) = L∞ (t).
Example The two process graphs above (over A =
√
{a, b, c} and P = { }), in which the initial states are
indicated by short incoming arrows and the final states
√
(the ones labeled with ) by double circles, are not
bisimulation equivalent, even though in automata theory they accept the same language. The choice between
b and c is made at a different moment (namely, before
vs. after the a-action); that is, the two systems have
a different branching structure. Bisimulation semantics
distinguishes systems that differ in this manner.
Modal Logic
(Poly)modal logic is an extension of propositional logic
with formulas //a//φ, saying that it is possible to follow
an a-transition after which the formula φ holds. Modal
formulas are interpreted on the states of labeled transition systems. Two systems are bisimilar iff they satisfy
the same infinitary modal formulas.
In case the systems s and t are image finite, it suffices to consider finitary polymodal formulas only [].
In fact, for this purpose it is enough to require that one
of s and t is image finite.
Definition  Let (S, →, ⊧) be an LTS. A state t ∈ S is
reachable from s ∈ S if there are si ∈ S and ai ∈ A for
ai
i = , . . . , n with s = s , si− !→ si for i = , . . . , n, and
sn = t. A state s ∈ S is image finite if for every state t ∈ S
reachable from s and for every a ∈ A, the set {u ∈ S ∣
a
t !→ u} is finite.
Theorem  [] Let (S, →, ⊧) be an LTS and s, t ∈S with
s image finite. Then s ↔ t ⇔ L(s) = L(t).
Non-well-Founded Sets
Another characterization of bisimulation semantics
can be given by means of Aczel’s universe V of

B

B
Bisimulation
non-well-founded sets []. This universe is an extension of the Von Neumann universe of well-founded
sets, where the axiom of foundation (every chain x ∋
x ∋ ⋯ terminates) is replaced by an anti-foundation
axiom.
Definition  Let (S, →, ⊧) be an LTS, and let B denote
the unique function M : S → V satisfying, for all s ∈ S,
a
M(s) = {//a, M(t)// ∣ s !→ t}.
It follows from Aczel’s anti-foundation axiom that
such a function exists. In fact, the axiom amounts to
saying that systems of equations like the one above
have unique solutions. B(s) could be taken to be the
branching structure of s. The following theorem then
says that two systems are bisimilar iff they have the same
branching structure.
Theorem  [] Let (S, →, ⊧) be an LTS and s, t ∈ S.
Then s ↔ t ⇔ B(s) = B(t).
Abstraction
of related systems. The notions of weak and delay bisimulation equivalence, which were both introduced by
Milner under the name observational equivalence, make
more identifications, motivated by observable machinebehaviour
according
to
certain
testing
scenarios.
τ
Write s 8⇒ t for ∃n ≥  : ∃s , . . . , sn : s = s !→
τ
τ
s !→ ⋯ !→ sn = t, that is, a (possibly empty) path of
(a)
τ-steps from s to t. Furthermore, for a ∈ Aτ , write s !→ t
(a)
a
a
for s !→ t ∨ (a = τ ∧ s = t). Thus !→ is the same as !→
(τ)
for a ∈ A, and !→ denotes zero or one τ-step.
Definition  Let (S, →, ⊧) be an LTS over Aτ and P.
Two states s, t ∈ S are branching bisimulation equivalent,
denoted s ↔b t, if they are related by a binary relation
R ⊆ S × S (a branching bisimulation), satisfying:
∧ if sRt and s ⊧ p with p ∈ P, then there is a t with
t 8⇒ t ⊧ p and sRt .
∧ if sRt and t ⊧ p with p ∈ P, then there is a s with
s 8⇒ s ⊧ p and s Rt.
a
∧ if sRt and s !→ s′ with a ∈ Aτ , then there are t , t , t′
(a)
with t 8⇒ t !→ t = t ′ , sRt , and s′ Rt ′ .
a
∧ if sRt and t !→ t ′ with a ∈ Aτ , then there are s , s , s′
(a)
with s8⇒s !→ s = s′ , s Rt, and s′ Rt′ .
In concurrency theory it is often useful to distinguish
between internal actions, which do not admit interactions with the outside world, and external ones. As norDelay bisimulation equivalence, ↔d , is obtained by
mally there is no need to distinguish the internal actions
dropping the requirements sRt and s Rt. Weak bisimufrom each other, they all have the same name, namely,
lation equivalence [], ↔w , is obtained by furthermore
τ. If A is the set of external actions a certain class of sysrelaxing the requirements t = t ′ and s = s′ to t 8⇒ t′
tems may perform, then Aτ := A˙∪{τ}. Systems in that
and s 8⇒ s′ .
class are then represented by labeled transition systems
over Aτ and a set of predicates P. The variant of bisimuThese definitions stem from concurrency theory.
lation equivalence that treats τ just like any action of A is On Kripke structures, when studying modal or tempocalled strong bisimulation equivalence. Often, however, ral logics, normally a stronger version of the first two
one wants to abstract from internal actions to various conditions is imposed:
degrees. A system doing two τ actions in succession is
∧ if sRt and p ∈ P, then s ⊧ p ⇔ t ⊧ p.
then considered equivalent to a system doing just one.
However, a system that can do either a or b is consid- For systems without τ’s all these notions coincide with
ered different from a system that can do either a or first strong bisimulation equivalence.
τ and then b, because if the former system is placed
in an environment where b cannot happen, it can still Concurrency
do a instead, whereas the latter system may reach a When applied to parallel systems, capable of performstate (by executing the τ action) in which a is no longer ing different actions at the same time, the versions of
bisimulation discussed here employ interleaving semanpossible.
Several versions of bisimulation equivalence that tics: no distinction is made between true parallelism and
formalize these desiderata occur in the literature. its nondeterministic sequential simulation. Versions of
Branching bisimulation equivalence [], like strong bisimulation that do make such a distinction have been
bisimulation, faithfully preserves the branching structure developed as well, most notably the ST-bisimulation [],
Bitonic Sort
which takes temporal overlap of actions into account,
and the history preserving bisimulation [], which even
keeps track of causal relations between actions. For
this purpose, system representations such as Petri nets
or event structures are often used instead of labeled
transition systems.
Bibliography
. Aczel P () Non-well-founded Sets, CSLI Lecture Notes .
Stanford University, Stanford, CA
. van Glabbeek RJ () Comparative concurrency semantics and
refinement of actions. PhD thesis, Free University, Amsterdam.
Second edition available as CWI tract , CWI, Amsterdam 
. Hennessy M, Milner R () Algebraic laws for nondeterminism
and concurrency. J ACM ():–
. Hollenberg MJ () Hennessy-Milner classes and process algebra. In: Ponse A, de Rijke M, Venema Y (eds) Modal logic and
process algebra: a bisimulation perspective, CSLI Lecture Notes ,
CSLI Publications, Stanford, CA, pp –
. Milner R () Operational and algebraic semantics of concurrent processes. In: van Leeuwen J (ed) Handbook of theoretical
computer science, Chapter . Elsevier Science Publishers B.V.,
North-Holland, pp –
Further Readings
Baeten JCM, Weijland WP () Process algebra. Cambridge
University Press
Milner R () Communication and concurrency. Prentice Hall, New
Jersey
Sangiorgi D () on the origins of bisimulation and coinduction.
ACM Trans Program Lang Syst (). doi: ./.
Bisimulation Equivalence
Bisimulation
B
Definition
Bitonic Sort is a sorting algorithm that uses comparisonswap operations to arrange into nondecreasing order an
input sequence of elements on which a linear order is
defined (for example, numbers, words, and so on).
Discussion
Introduction
Henceforth, all inputs are assumed to be numbers, without loss of generality. For ease of presentation, it is also
assumed that the length of the sequence to be sorted is
an integer power of . The Bitonic Sort algorithm has the
following properties:
. Oblivious: The indices of all pairs of elements
involved in comparison-swaps throughout the execution of the algorithm are predetermined, and do
not depend in any way on the values of the input elements. It is important to note here that the elements
to be sorted are assumed to be kept into an array
and swapped in place. The indices that are referred
to here are those in the array.
. Recursive: The algorithm can be expressed as a procedure that calls itself to operate on smaller versions
of the input sequence.
. Parallel: It is possible to implement the algorithm
using a set of special processors called “comparators” that operate simultaneously, each implementing a comparison-swap. A comparator receives a
distinct pair of elements as input and produces that
pair in sorted order as output in one time unit.
Bitonic sequence. A sequence (a , a , . . . , am ) is said to
be bitonic if and only if:
(a) Either there is an integer j,  ≤ j ≤ m, such that
a ≤ a ≤ ⋯ ≤ aj ≥ aj+ ≥ aj+ ≥ ⋯ ≥ am
Bitonic Sort
Selim G. Akl
Queen’s University, Kingston, ON, Canada
Synonyms
Bitonic sorting network; Ditonic sorting

(b) Or the sequence does not initially satisfy the condition in (a), but can be shifted cyclically until the
condition is satisfied.
For example, the sequence (, , , , , , , , )
is bitonic, as it satisfies condition (a). Similarly, the
sequence (, , , , , , ), which does not satisfy condition (a), is also bitonic, as it can be shifted cyclically
to obtain (, , , , , , ).
B

B
Bitonic Sort
Let (a , a , . . . , am ) be a bitonic sequence, and let
di = min(ai , am+i ) and ei = max(ai , am+i ), for i = , ,
. . . , m. The following properties hold:
Finally, since ak ≥ ak+ , ak ≥ ak−m , ak−m+ ≥ ak−m ,
and ak−m+ ≥ ak+ , it follows that:
(a) The sequences (d , d , . . . , dm ) and (e , e , . . . , em )
are both bitonic.
(b) max(d , d , . . . , dm ) ≤ min(e , e , . . . , em ).
max(ak−m , ak+ ) ≤ min(ak , ak−m+ ).
In order to prove the validity of these two properties,
it suffices to consider sequences of the form:
a ≤ a ≤ . . . ≤ aj− ≤ aj ≥ aj+ ≥ . . . ≥ am ,
for some  ≤ j ≤ m, since a cyclic shift of
affects
{d , d , . . . , dm }
and
{a , a , . . . , am }
{e , e , . . . , em } similarly, while affecting neither of the
two properties to be established. In addition, there is
no loss in generality to assume that m < j ≤ m, since
{am , am− , . . . , a } is also bitonic and neither property
is affected by such reversal.
There are two cases:
. If am ≤ am , then ai ≤ am+i . As a result, di = ai , and
ei = am+i , for  ≤ i ≤ m, and both properties hold.
. If am > am , then since aj−m ≤ aj , an index k exists,
where j ≤ k < m, such that ak−m ≤ ak and ak−m+ >
ak+ . Consequently:
di = ai and ei = am+i for  ≤ i ≤ k − m,
Sorting a Bitonic Sequence
Given a bitonic sequence (a , a , . . . , am ), it can
be sorted into a sequence (c , c , . . . , cm ), arranged
in nondecreasing order, by the following algorithm
MERGEm :
Step . The two sequences (d , d , . . . , dm ) and (e , e ,
. . . , em ) are produced.
Step . These two bitonic sequences are sorted independently and recursively, each by a call to MERGEm .
It should be noted that in Step  the two sequences
can be sorted independently (and simultaneously
if enough comparators are available), since no element of (d , d , . . . , dm ) is larger than any element of
(e , e , . . . , em ). The m smallest elements of the final
sorted sequence are produced by sorting (d , d , . . . ,
dm ), and the m largest by sorting (e , e , . . . , em ). The
recursion terminates when m = , since MERGE is
implemented directly by one comparison-swap (or one
comparator).
and
di = am+i and ei = ai for k − m < i ≤ m.
Hence:
di ≤ di+ for  ≤ i < k − m,
and
di ≥ di+ for k − m ≤ i < m,
implying that {d , d , . . . , dm } is bitonic. Similarly,
ei ≤ ei+ , for k − m ≤ i < m, em ≤ e , ei ≤ ei+ , for
 ≤ i < j − m, and ei ≥ ei+ , for j − m ≤ i < k − m,
implying that {e , e , . . . , em } is also bitonic. Also,
max(d , d , . . . , dm ) = max(dk−m , dk−m+ )
= max(ak−m , ak+ ),
and
min(e , e , . . . , em ) = min(ek−m , ek−m+ )
= min(ak , ak−m+ ).
Sorting an Arbitrary Sequence
Algorithm MERGEm assumes that the input sequence
to be sorted is bitonic. However, it is easy to modify an arbitrary input sequence into a sequence of
bitonic sequences as follows. Let the input sequence
be (a , a , . . . , an ), and recall that, for simplicity, it
is assumed that n is a power of . Now the following
n/ comparisons-swaps are performed: For all odd i,
ai is compared with ai+ and a swap is applied if necessary. These comparison-swaps are numbered from
 to n/. Odd-numbered comparison-swaps place the
smaller of (ai , ai+ ) first and the larger of the pair second.
Even-numbered comparison-swaps place the larger of
(ai , ai+ ) first and the smaller of the pair second.
At the end of this first stage, n/ bitonic sequences
are obtained. Each of these sequences is of length 
and can be sorted using MERGE . These instances of
MERGE are numbered from  to n/. Odd-numbered
instances sort their inputs in nondecreasing order while
Bitonic Sort
even-numbered instances sort their inputs in nonincreasing order. This yields n/ bitonic sequences each
of length .
The process continues until a single bitonic sequence
of length n is produced and is sorted by giving it as input
to MERGEn . If a comparator is used to implement each
comparison-swap, and all independent comparisonswaps are allowed to be executed in parallel, then the
sequence is sorted in (( + log n) log n)/ time units (all
logarithms are to the base ). This is now illustrated.
B
bitonic sequences of length  and  are shown in Figs. 
and , respectively.
Finally, a combinational circuit for sorting an arbitrary sequence of numbers, namely, the sequence
(, , , , , , , ) is shown in Fig. . Comparators
that reverse their outputs (i.e., those that produce the
larger of their two inputs on the left output line, and the
smaller on the right output line) are indicated with the
letter R.
Analysis
Implementation as a Combinational Circuit
Figure  shows a schematic diagram of a comparator.
The comparator receives two numbers x and y as input,
and produces the smaller of the two on its left output
line, and the larger of the two on its right output line.
A combinational circuit for sorting is a device, built
entirely of comparators that takes an arbitrary sequence
at one end and produces that sequence in sorted order
at the other end. The comparators are arranged in rows.
Each comparator receives its two inputs from the input
sequence, or from comparators in the previous row, and
delivers its two outputs to the comparators in the following row, or to the output sequence. A combinational
circuit has no feedback: Data traverse the circuit in the
same way as water flows from high to low terrain.
Figure  shows a schematic diagram of a combinational circuit implementation of algorithm MERGEm ,
which sorts the bitonic sequence (a , a , . . . , am ) into
nondecreasing order from smallest, on the leftmost line,
to largest, on the rightmost.
Clearly, a bitonic sequence of length  is sorted by
a single comparator. Combinational circuits for sorting
y
x
The depth of a sorting circuit is the number of rows it
contains – that is, the maximum number of comparators
on a path from input to output. Depth, in other words,
represents the time it takes the circuit to complete the
sort. The size of a sorting circuit is defined as the number
of comparators it uses.
Taking m = i , the depth of the circuit in Fig. ,
which implements algorithm MERGEm for sorting a
bitonic sequence of length m, is given by the recurrence:
d() = 
d(i ) =  + d(i− ),
whose solution is d(i ) = i. The size of the circuit is
given by the recurrence:
s() = 
s(i ) = i− + s(i− ),
whose solution is s(i ) = ii− .
A circuit for sorting an arbitrary sequence of length
n, such as the circuit in Fig.  where n = , consists of
log n phases: In the ith phase, n/i circuits are required,
each implementing MERGEi , and having a size of s(i )
and a depth of d(i ), for i = , , . . . , log n. The depth and
size of this circuit are
log n
log n
i
∑ d( ) = ∑ i =
i=
i=
( + log n) log n
,

and
log n
log n
∑ (log n)−i s(i ) = ∑ (log n)−i ii−
i=
min(x,y)
max(x,y)
Bitonic Sort. Fig.  A comparator
i=
n( + log n) log n
,
=

respectively.

B

B
Bitonic Sort
a1
a2
Bitonic input
am−2 am −1 am am+1 am+2 am +3
a3
MERGEm
c1
c2
c3
a2m−2
a2m−1 a2m
MERGEm
cm−2 cm−1 cm cm+1 cm +2 cm+3
c2m−2 c2m−1 c2m
Sorted output
Bitonic Sort. Fig.  A circuit for sorting a bitonic sequence of length  m
a1
a2
a3
a4
c1
c2
c3
c4
Bitonic Sort. Fig.  A circuit for sorting a bitonic sequence
of length 
Lower Bounds
In order to evaluate the quality of the Bitonic Sort circuit,
its depth and size are compared to two types of lower
bounds.
A lower bound on the number of comparisons. A
lower bound on the number of comparisons required in
the worst case by a comparison-based sorting algorithm
to sort a sequence of n numbers is derived as follows. All
comparison-based sorting algorithms are modeled by a
binary tree, each of whose nodes represents a comparison between two input elements. The leaves of the tree
represent the n! possible outcomes of the sort. A path
from the root of the tree to a leaf is a complete execution of an algorithm on a given input. The length of the
longest such path is log n!, which is a quantity on the
order of n log n.
Several sequential comparison-based algorithms,
such as Heapsort and Mergesort, for example, achieve
this bound, to within a constant multiplicative factor,
and are therefore said to be optimal. The Bitonic Sort
circuit always runs in time on the order of log n and
is therefore significantly faster than any sequential algorithm. However, the number of comparators it uses, and
therefore the number of comparisons it performs, is on
the order of n log n, and consequently, it is not optimal
in that sense.
Lower bounds on the depth and size. These lower
bounds are specific to circuits. A lower bound on the
depth of a sorting circuit for an input sequence of n
elements is obtained as follows. Each comparator has
Bitonic Sort
a1
a2
a3
a4
a5
a6
a7
B

a8
B
c1
c2
c3
c4
c5
c6
c7
c8
Bitonic Sort. Fig.  A circuit for sorting a bitonic sequence of length 
two outputs, implying that an input element can reach
at most r locations after r rows. Since each element
should be able to reach all n output positions, a lower
bound on the depth of the circuit is on the order of log n.
A lower bound on the size of a sorting circuit is
obtained by observing that the circuit must be able to
produce any of the n! possible output sequences for each
input sequence. Since each comparator can be in one of
two states (swapping or not swapping its two inputs),
the total number of configurations in which a circuit
with c comparators can be is c , and this needs to be
at least equal to n!. Thus, a lower bound on the size of a
sorting circuit is on the order of n log n.
The Bitonic Sort circuit exceeds each of the lower
bounds on depth and size by a factor of log n, and is
therefore not optimal in that sense as well.
There exists a sorting circuit, known as the AKS circuit, which has a depth on the order of log n and a
size on the order of n log n. This circuit meets the lower
bound on the number of comparisons required to sort,
in addition to the lower bounds on depth and size specific to circuits. It is, therefore, theoretically optimal on
all counts.
In practice, however, the AKS circuit may not be
very useful. In addition to its high conceptual complexity, its depth and size expressions are preceded by
constants on the order of  . Even if the depth of the
AKS circuit were as small as  log n, in order for the
latter to be smaller than the depth of the Bitonic Sort
circuit, namely, (( + log n) log n)/, the input sequence
will need to have the astronomical length of n >  ,
which is greater than the number of atoms that the
observable universe is estimated to contain.
Future Directions
An interesting open problem in this area of research is to
design a sorting circuit with the elegance, regularity, and
simplicity of the Bitonic Sort circuit, while at the same

B
Bitonic Sort
5
3
2
6
1
4
7
5
R
3
5
6
R
2
1
4
7
R
3
6
2
5
7
5
R
1
5
R
4
R
2
3
5
6
7
5
4
1
2
7
3
5
4
5
1
6
2
4
1
3
5
7
5
6
1
2
3
4
5
5
6
7
Bitonic Sort. Fig.  A circuit for sorting an arbitrary sequence of length 
time matching the theoretical lower bounds on depth
and size as closely as possible.
Permutation Circuits
Related Entries
Bibliographic Notes and Further
Reading
AKS Sorting Network
Bitonic Sorting, Adaptive
Odd-Even Sorting
Sorting
In , Ken Batcher presented a paper at the AFIPS
conference, in which he described two circuits for
Bitonic Sort
sorting, namely, the Odd-Even Sort and the Bitonic
Sort circuits []. This paper pioneered the study of
parallel sorting algorithms and effectively launched
the field of parallel algorithm design and analysis.
Proofs of correctness of the Bitonic Sort algorithm
are provided in [, , ]. Implementations of Bitonic
Sort on other architectures beside combinational circuits have been proposed, including implementations
on a perfect shuffle computer [], a mesh of processors [, ], and on a shared-memory parallel
machine [].
In its simplest form, Bitonic Sort assumes that n, the
length of the input sequence, is a power of . When
n is not a power of , the sequence can be padded
with z zeros such that n + z is the smallest power
of  larger than n. Alternatively, several variants of
Bitonic Sort were proposed in the literature that are
capable of sorting input sequences of arbitrary length
[, , ].
Combinational circuits for sorting (also known as
sorting networks) are discussed in [–, , –, ].
A lower bound for oblivious merging is derived in []
and generalized in [], which demonstrates that the
bitonic merger is optimal to within a small constant factor. The AKS sorting circuit (whose name derives from
the initials of its three inventors) was first described
in [] and then in []. Unlike the Bitonic Sort circuit that is based on the idea of repeatedly merging
increasingly longer subsequences, the AKS circuit sorts
by repeatedly splitting the original input sequence into
increasingly shorter and disjoint subsequences. While
theoretically optimal, the circuit suffers from large multiplicative constants in the expressions for its depth and
size, making it of little use in practice, as mentioned
earlier. A formulation in [] manages to reduce the
constants to a few thousands, still a prohibitive number.
Descriptions of the AKS circuit appear in [, , , ].
Sequential sorting algorithms, including Heapsort and
Mergesort, are covered in [, ].
One property of combinational circuits, not shared
by many other parallel models of computation, is their
ability to allow several input sequences to be processed
simultaneously, in a pipeline fashion. This is certainly
true of sorting circuits: Once the elements of the first
input sequence have traversed the first row and moved
on to the second, a new sequence can enter the first row,
B

and so on. If the circuit has depth D, then M sequences
can be sorted in D + M −  time units [].
B
Bibliography
. Ajtai M, Komlós J, Szemerédi E () An O(n log n) sorting
network. In: Proceedings of the ACM symposium on theory of
computing, Boston, Massachusetts, pp –
. Ajtai M, Komlós J, Szemerédi E () Sorting in c log n parallel
steps. Combinatorica :–
. Akl SG () Parallel sorting algorithms. Academic Press,
Orlando, Florida
. Akl SG () The design and analysis of parallel algorithms.
Prentice-Hall, Englewood Cliffs, New Jersey
. Akl SG () Parallel computation: models and methods. Prentice Hall, Upper Saddle River, New Jersey
. Batcher KE () Sorting networks and their applications. In:
Proceedings of the AFIPS spring joint computer conference.
Atlantic City, New Jersey, pp –. Reprinted in: Wu CL,
Feng TS (eds) Interconnection networks for parallel and
distributed processing. IEEE Computer Society, , pp
–
. Bilardi G, Nicolau A () Adaptive bitonic sorting: An optimal
parallel algorithm for shared-memory machines. SIAM J Comput
():–
. Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduction
to algorithms, nd edn. MIT Press/McGraw-Hill, Cambridge,
Massachusetts/New York
. Floyd RW () Permuting information in idealized twolevel storage. In: Miller RE, Thatcher JW (eds) Complexity of computer computations. Plenum Press, New York, pp
–
. Gibbons A, Rytter W () Efficient parallel algorithms.
Cambridge University Press, Cambridge, England
. JáJá J () An introduction to parallel algorithms. AddisonWesley, Reading, Massachusetts
. Knuth DE () The art of computer programming, vol .
Addison-Wesley, Reading, Massachusetts
. Leighton FT () Introduction to parallel algorithms and architectures. Morgan Kaufmann, San Mateo, California
. Liszka KJ, Batcher KE () A generalized bitonic sorting network. In: Proceedings of the international conference on parallel
processing, vol , pp –
. Nakatani T, Huang ST, Arden BW, Tripathi SK () K-way
bitonic sort. IEEE Trans Comput ():–
. Nassimi D, Sahni S () Bitonic sort on a mesh-connected
parallel computer. IEEE Trans Comput C-():–
. Parberry I () Parallel complexity theory. Pitman, London
. Paterson MS () Improved sorting networks with O(log N)
depth. Algorithmica :–
. Smith JR () The design and analysis of parallel algorithms.
Oxford University Press, Oxford, England
. Stone HS () Parallel processing with the perfect shuffle. IEEE
Trans Comput C-():–

B
Bitonic Sorting Network
. Thompson CD, Kung HT () Sorting on a mesh-connected
parallel computer. Commun ACM ():–
. Wang BF, Chen GH, Hsu CC () Bitonic sort with and arbitrary
number of keys. In: Proceedings of the international conference
on parallel processing. vol , Illinois, pp –
. Yao AC, Yao FF () Lower bounds for merging networks.
J ACM ():–
Bitonic Sorting Network
Bitonic Sort
Bitonic Sorting, Adaptive
Gabriel Zachmann
Clausthal University, Clausthal-Zellerfeld, Germany
Definition
Adaptive bitonic sorting is a sorting algorithm suitable
for implementation on EREW parallel architectures.
Similar to bitonic sorting, it is based on merging, which
is recursively applied to obtain a sorted sequence. In
contrast to bitonic sorting, it is data-dependent. Adaptive bitonic merging can be performed in O( np ) parallel
time, p being the number of processors, and executes
only O(n) operations in total. Consequently, adaptive
n log n
bitonic sorting can be performed in O( p ) time,
which is optimal. So, one of its advantages is that it executes a factor of O(log n) less operations than bitonic
sorting. Another advantage is that it can be implemented efficiently on modern GPUs.
Discussion
Introduction
This chapter describes a parallel sorting algorithm,
adaptive bitonic sorting [], that offers the following
benefits:
●
It needs only the optimal total number of comparison/exchange operations, O(n log n).
● The hidden constant in the asymptotic number of
operations is less than in other optimal parallel sorting methods.
●
It can be implemented in a highly parallel manner
on modern architectures, such as a streaming architecture (GPUs), even without any scatter operations,
that is, without random access writes.
One of the main differences between “regular” bitonic
sorting and adaptive bitonic sorting is that regular
bitonic sorting is data-independent, while adaptive
bitonic sorting is data-dependent (hence the name).
As a consequence, adaptive bitonic sorting cannot
be implemented as a sorting network, but only on architectures that offer some kind of flow control. Nonetheless, it is convenient to derive the method of adaptive
bitonic sorting from bitonic sorting.
Sorting networks have a long history in computer
science research (see the comprehensive survey []).
One reason is that sorting networks are a convenient
way to describe parallel sorting algorithms on CREWPRAMs or even EREW-PRAMs (which is also called
PRAC for “parallel random access computer”).
In the following, let n denote the number of keys
to be sorted, and p the number of processors. For the
sake of clarity, n will always be assumed to be a power
of . (In their original paper [], Bilardi and Nicolau
have described how to modify the algorithms such that
they can handle arbitrary numbers of keys, but these
technical details will be omitted in this article.)
The first to present a sorting network with optimal
asymptotic complexity were Ajtai, Komlós, and Szemerédi []. Also, Cole [] presented an optimal parallel
merge sort approach for the CREW-PRAM as well as
for the EREW-PRAM. However, it has been shown that
neither is fast in practice for reasonable numbers of keys
[, ].
In contrast, adaptive bitonic sorting requires less
than n log n comparisons in total, independent of the
number of processors. On p processors, it can be implen log n
mented in O( p ) time, for p ≤ logn n .
Even with a small number of processors it is efficient in practice: in its original implementation, the
sequential version of the algorithm was at most by a
factor . slower than quicksort (for sequence lengths
up to  ) [].
Fundamental Properties
One of the fundamental concepts in this context is the
notion of a bitonic sequence.
B
Bitonic Sorting, Adaptive
Definition  (Bitonic sequence) Let a = (a , . . . , an− )
be a sequence of numbers. Then, a is bitonic, iff it monotonically increases and then monotonically decreases,
or if it can be cyclically shifted (i.e., rotated) to
become monotonically increasing and then monotonically decreasing.
Figure  shows some examples of bitonic sequences.
In the following, it will be easier to understand
any reasoning about bitonic sequences, if one considers them as being arranged in a circle or on a cylinder:
then, there are only two inflection points around the circle. This is justified by Definition . Figure  depicts an
example in this manner.
As a consequence, all index arithmetic is understood
modulo n, that is, index i + k ≡ i + k mod n, unless
otherwise noted, so indices range from  through n − .
As mentioned above, adaptive bitonic sorting can be
regarded as a variant of bitonic sorting, which is in order
to capture the notion of “rotational invariance” (in some
sense) of bitonic sequences; it is convenient to define the
following rotation operator.
Definition  (Rotation) Let a = (a , . . . , an− ) and
j ∈ N. We define a rotation as an operator Rj on the
sequence a:
Rj a = (aj , aj+ , . . . , aj+n− )
This operation is performed by the network shown
in Fig. . Such networks are comprised of elementary
comparators (see Fig. ).
Two other operators are convenient to describe
sorting.
Definition  (Half-cleaner) Let a = (a , . . . , an− ).
La = (min(a , a n ) , . . . , min(a n − , an− )) ,
Ua = (max(a , a n ) , . . . , max(a n − , an− )) .
In [], a network that performs these operations
together is called a half-cleaner (see Fig. ).
i
1
n
i
1

n
i
1
n
Bitonic Sorting, Adaptive. Fig.  Three examples of sequences that are bitonic. Obviously, the mirrored sequences
(either way) are bitonic, too
Bitonic Sorting, Adaptive. Fig.  Left: according to their definition, bitonic sequences can be regarded as lying on a
cylinder or as being arranged in a circle. As such, they consist of one monotonically increasing and one decreasing part.
Middle: in this point of view, the network that performs the L and U operators (see Fig. ) can be visualized as a wheel of
“spokes.” Right: visualization of the effect of the L and U operators; the blue plane represents the median
B

B
Bitonic Sorting, Adaptive
a
min(a,b)
LR n a, which can be verified trivially. In the latter case,
Eq.  becomes
b
max(a,b)
LRj a = (min(aj , aj+ n ) , . . . , min(a n − , an− ) , . . . ,
a
max(a,b)
b
min(a,b)
min(aj− , aj−+ n ))
= Rj La.
Bitonic Sorting, Adaptive. Fig.  Comparator/exchange
elements
Thus, with the cylinder metaphor, the L and U operators basically do the following: cut the cylinder with
circumference n at any point, roll it around a cylinder
with circumference n , and perform position-wise the
max and min operator, respectively. Some examples are
shown in Fig. .
The following theorem states some important properties of the L and U operators.
Theorem  Given a bitonic sequence a,
max{La} ≤ min{Ua} .
Moreover, La and Ua are bitonic too.
Bitonic Sorting, Adaptive. Fig.  A network that performs
the rotation operator
In other words, each element of La is less than or
equal to each element of Ua.
This theorem is the basis for the construction of the
bitonic sorter []. The first step is to devise a bitonic
merger (BM). We denote a BM that takes as input
bitonic sequences of length n with BMn . A BM is recursively defined as follows:
BMn (a) = ( BM n (La), BM n (Ua) ) .
Bitonic Sorting, Adaptive. Fig.  A network that performs
the L and U operators
It is easy to see that, for any j and a,
La = R−j mod n LRj a,
()
Ua = R−j mod n URj a.
()
and
This is the reason why the cylinder metaphor is valid.
The proof needs to consider only two cases: j = n
and  ≤ j < n . In the former case, Eq.  becomes La =
The base case is, of course, a two-key sequence, which
is handled by a single comparator. A BM can be easily
represented in a network as shown in Fig. .
Given a bitonic sequence a of length n, one can show
that
()
BMn (a) = Sorted(a).
It should be obvious that the sorting direction can be
changed simply by swapping the direction of the elementary comparators.
Coming back to the metaphor of the cylinder, the
first stage of the bitonic merger in Fig.  can be visualized as n comparators, each one connecting an element
of the cylinder with the opposite one, somewhat like
spokes in a wheel. Note that here, while the cylinder can
rotate freely, the “spokes” must remain fixed.
From a bitonic merger, it is straightforward to derive
a bitonic sorter, BSn , that takes an unsorted sequence,
and produces a sorted sequence either up or down.
Like the BM, it is defined recursively, consisting of two
Bitonic Sorting, Adaptive
B

Ua
Ua
a
La
La
i
1
B
a
i
1
n/2
n/2
n
Bitonic Sorting, Adaptive. Fig.  Examples of the result of the L and U operators. Conceptually, these operators fold the
bitonic sequence (black), such that the part from indices n +  through n (light gray) is shifted into the range  through n
(black); then, L and U yield the upper (medium gray) and lower (dark gray) hull, respectively
BM(n)
La
BM(n/2)
Ua
BM(n/2)
Sorted
Bitonic
0
n/2-1
n/2
n–1
1 stage
Bitonic Sorting, Adaptive. Fig.  Schematic, recursive diagram of a network that performs bitonic merging
smaller bitonic sorters and a bitonic merger (see Fig. ).
Again, the base case is the two-key sequence.
Analysis of the Number of Operations of
Bitonic Sorting
Since a bitonic sorter basically consists of a number of
bitonic mergers, it suffices to look at the total number of
comparisons of the latter.
The total number of comparators, C(n), in the
bitonic merger BMn is given by:
n
n
C(n) = C( ) + ,


Clearly, there is some redundancy in such a network, since n comparisons are sufficient to merge two
sorted sequences. The reason is that the comparisons
performed by the bitonic merger are data-independent.
Derivation of Adaptive Bitonic Merging
The algorithm for adaptive bitonic sorting is based on
the following theorem.
Theorem  Let a be a bitonic sequence. Then, there is
an index q such that
with C() = ,
which amounts to

C(n) = n log n.

As a consequence, the bitonic sorter consists of
O(n log n) comparators.
La = (aq , . . . , aq+ n − )
()
Ua = (aq+ n , . . . , aq− )
()
(Remember that index arithmetic is always modulo n.)
B
Bitonic Sorting, Adaptive
BS(n)
n/2-1
BS(n/2)
BM(n)
Sorted
n/2
Sorted
Bitonic
BS(n/2)
Sorted
0
Unsorted

n-1
Bitonic Sorting, Adaptive. Fig.  Schematic, recursive diagram of a bitonic sorting network
the median, each half must have length n . The indices
where the cut happens are q and q + n . Figure  shows
an example (in one dimension).
The following theorem is the final keystone for the
adaptive bitonic sorting algorithm.
m
0
q+n/2
q
U
n-1
Theorem  Any bitonic sequence a can be partitioned
into four subsequences (a , a , a , a ) such that either
L
Bitonic Sorting, Adaptive. Fig.  Visualization for the
proof of Theorem 
(La, Ua) = (a , a , a , a )
()
(La, Ua) = (a , a , a , a ).
()
n
,

()
or
Furthermore,
The following outline of the proof assumes, for the
sake of simplicity, that all elements in a are distinct. Let
m be the median of all ai , that is, n elements of a are less
than or equal to m, and n elements are larger. Because
of Theorem ,
max{La} ≤ m < min{Ua} .
Employing the cylinder metaphor again, the median
m can be visualized as a horizontal plane z = m that
cuts the cylinder. Since a is bitonic, this plane cuts the
sequence in exactly two places, that is, it partitions the
sequence into two contiguous halves (actually, any horizontal plane, i.e., any percentile partitions a bitonic
sequence in two contiguous halves), and since it is
∣a ∣ + ∣a ∣ = ∣a ∣ + ∣a ∣ =
∣a ∣ = ∣a ∣ ,
()
∣a ∣ = ∣a ∣ ,
()
and
where ∣a∣ denotes the length of sequence a.
Figure  illustrates this theorem by an example.
This theorem can be proven fairly easily too: the
length of the subsequences is just q and n − q, where q is
the same as in Theorem . Assuming that max{a } <
m < min{a }, nothing will change between those
two subsequences (see Fig. ). However, in that case
min{a } > m > max{a }; therefore, by swapping
Bitonic Sorting, Adaptive
a
a
0
n/2
B
m
n–1
B
0
q
a1
b
n/2 q + n/2 n – 1
a2
a3
a4
Ua
m
m
La
0
q
n/2
0
q
n/2 q + n/2 n – 1
d
a1
c
a4
La
a3
a2
Ua
Bitonic Sorting, Adaptive. Fig.  Example illustrating Theorem 
a and a (which have equal length), the bounds
max{(a , a )} < m < min{a , a )} are obtained. The
other case can be handled analogously.
Remember that there are n comparator-andexchange elements, each of which compares ai and
ai+ n . They will perform exactly this exchange of subsequences, without ever looking at the data.
Now, the idea of adaptive bitonic sorting is to find
the subsequences, that is, to find the index q that marks
the border between the subsequences. Once q is found,
one can (conceptually) swap the subsequences, instead
of performing n comparisons unconditionally.
Finding q can be done simply by binary search
driven by comparisons of the form (ai , ai+ n ).
Overall, instead of performing n comparisons in the
first stage of the bitonic merger (see Fig. ), the adaptive
bitonic merger performs log( n ) comparisons in its first
stage (although this stage is no longer representable by
a network).
Let C(n) be the total number of comparisons performed by adaptive bitonic merging, in the worst case.
Then
k−
n
n
C(n) = C( ) + log(n) = ∑ i log( i ) ,


i=

with C() = , C() =  and n = k . This amounts
to
C(n) = n − log n − .
The only question that remains is how to achieve the
data rearrangement, that is, the swapping of the subsequences a and a or a and a , respectively, without
sacrificing the worst-case performance of O(n). This
can be done by storing the keys in a perfectly balanced
tree (assuming n = k ), the so-called bitonic tree. (The
tree can, of course, store only k −  keys, so the n-th
key is simply stored separately. ) This tree is very similar
to a search tree, which stores a monotonically increasing sequence: when traversed in-order, the bitonic tree
produces a sequence that lists the keys such that there
are exactly two inflection points (when regarded as a
circular list).
Instead of actually copying elements of the sequence
in order to achieve the exchange of subsequences, the
adaptive bitonic merging algorithm swaps O(log n)
pointers in the bitonic tree. The recursion then works on
the two subtrees. With this technique, the overall number of operations of adaptive bitonic merging is O(n).
Details can be found in [].

B
Bitonic Sorting, Adaptive
Clearly, the adaptive bitonic sorting algorithm needs
O(n log n) operations in total, because it consists of
log(n) many complete merge stages (see Fig. ).
It should also be fairly obvious that the adaptive
bitonic sorter performs an (adaptive) subset of the
comparisons that are executed by the (nonadaptive)
bitonic sorter.
The Parallel Algorithm
So far, the discussion assumed a sequential implementation. Obviously, the algorithm for adaptive bitonic
merging can be implemented on a parallel architecture,
just like the bitonic merger, by executing recursive calls
on the same level in parallel.
Unfortunately, a naïve implementation would
require O(log n) steps in the worst case, since there
are log(n) levels. The bitonic merger achieves O(log n)
parallel time, because all pairwise comparisons within
one stage can be performed in parallel. But this is not
straightforward to achieve for the log(n) comparisons
of the binary-search method in adaptive bitonic merging, which are inherently sequential.
However, a careful analysis of the data dependencies
between comparisons of successive stages reveals that
the execution of different stages can be partially overlapped []. As La, Ua are being constructed in one stage
by moving down the tree in parallel layer by layer (occasionally swapping pointers); this process can be started
for the next stage, which begins one layer beneath the
one where the previous stage began, before the first stage
has finished, provided the first stage has progressed “far
enough” in the tree. Here, “far enough” means exactly
two layers ahead.
This leads to a parallel version of the adaptive bitonic
merge algorithm that executes in time O( np ) for p ∈
O( logn n ), that is, it can be executed in (log n) parallel
time.
Furthermore, the data that needs to be communicated between processors (either via memory, or via
communication channels) is in O(p).
It is straightforward to apply the classical sortingby-merging approach here (see Fig. ), which yields the
adaptive bitonic sorting algorithm. This can be implemented on an EREW machine with p processors in
n log n
O( p ) time, for p ∈ O( logn n ).
A GPU Implementation
Because adaptive bitonic sorting has excellent scalability (the number of processors, p, can go up to n/ log(n))
and the amount of inter-process communication is
fairly low (only O(p)), it is perfectly suitable for implementation on stream processing architectures. In addition, although it was designed for a random access
architecture, adaptive bitonic sorting can be adapted
to a stream processor, which (in general) does not
have the ability of random-access writes. Finally, it can
be implemented on a GPU such that there are only
O(log (n)) passes (by utilizing O(n/ log(n)) (conceptual) processors), which is very important, since the
Algorithm : Adaptive construction of La and Ua
(one stage of adaptive bitonic merging)
input : Bitonic tree, with root node r and extra
node e, representing bitonic sequence a
output: La in the left subtree of r plus root r, and Ua
in the right subtree of r plus extra node e
// phase : determine case
if value(r) < value(e) then
case = 
else
case = 
swap value(r) and value(e)
( p, q ) = ( left(r) , right(r) )
for i = , . . . , log n −  do
// phase i
test = ( value(p) > value(q) )
if test == true then
swap values of p and q
if case ==  then
swap the pointers left(p) and
left(q)
else
swap the pointers right(p) and
right(q)
if ( case ==  and test == false ) or ( case ==
 and test == true ) then
( p, q ) = ( left(p) , left(q) )
else
( p, q ) = ( right(p) , right(q) )
Bitonic Sorting, Adaptive
Algorithm : Merging a bitonic sequence to obtain a
sorted sequence
input : Bitonic tree, with root node r and extra node
e, representing bitonic sequence a
B
Algorithm : Simplified adaptive construction of La
and Ua
input : Bitonic tree, with root node r and extra node
e, representing bitonic sequence a
output: Sorted tree (produces sort(a) when
traversed in-order)
output: La in the left subtree of r plus root r, and Ua
in the right subtree of r plus extra node e
construct La and Ua in the bitonic tree by
Algorithm 
// phase 
call merging recursively with left(r) as root and r
as extra node
call merging recursively with right(r) as root and
e as extra node
if value(r) > value(e) then
swap value(r) and value(e)
swap pointers left(r) and right(r)
( p, q ) = ( left(r) , right(r) )
for i = , . . . , log n −  do
// phase i
if value(p) > value(q) then
swap value(p) and value(q)
number of passes is one of the main limiting factors on
GPUs.
This section provides more details on the implementation on a GPU, called “GPU-ABiSort” [, ].
For the sake of simplicity, the following always assumes
increasing sorting direction, and it is thus not explicitely
specified. As noted above, the sorting direction must
be reversed in the right branch of the recursion in the
bitonic sorter, which basically amounts to reversing the
comparison direction of the values of the keys, that is,
compare for < instead of > in Algorithm .
As noted above, the bitonic tree stores the sequence
(a , . . . , an− ) in in-order, and the key an− is stored in
the extra node. As mentioned above, an algorithm that
constructs (La, Ua) from a can traverse this bitonic tree
and swap pointers as necessary. The index q, which is
mentioned in the proof for Theorem , is only determined implicitly. The two different cases that are mentioned in Theorem  and Eqs.  and  can be distinguished simply by comparing elements a n − and an− .
This leads to Algorithm . Note that the root of
the bitonic tree stores element a n − and the extra
node stores an− . Applying this recursively yields Algorithm . Note that the bitonic tree needs to be constructed only once at the beginning during setup time.
Because branches are very costly on GPUs, one
should avoid as many conditionals in the inner loops
as possible. Here, one can exploit the fact that Rn/ a =
(a n , . . . , an− , a , . . . , a n − ) is bitonic, provided a is
bitonic too. This operation basically amounts to swapping the two pointers left(root) and right(root). The
swap pointers left(p) and left(q)
( p, q ) = ( right(p) , right(q) )
else
( p, q ) = ( left(p) , left(q) )
simplified construction of La and Ua is presented in
Algorithm . (Obviously, the simplified algorithm now
really needs trees with pointers, whereas Bilardi’s original bitonic tree could be implemented pointer-less
(since it is a complete tree). However, in a real-world
implementation, the keys to be sorted must carry pointers to some “payload” data anyway, so the additional
memory overhead incurred by the child pointers is at
most a factor ..)
Outline of the Implementation
As explained above, on each recursion level j =
, . . . , log(n) of the adaptive bitonic sorting algorithm,
log n−j+ bitonic trees, each consisting of j− nodes,
have to be merged into log n−j bitonic trees of j nodes.
The merge is performed in j stages. In each stage k =
, . . . , j − , the construction of La and Ua is executed
on k subtrees. Therefore, log n−j ⋅k instances of the La
/ Ua construction algorithm can be executed in parallel during that stage. On a stream architecture, this
potential parallelism can be exposed by allocating a
stream consisting of log n−j+k elements and executing
a so-called kernel on each element.

B

B
Bitonic Sorting, Adaptive
The La / Ua construction algorithm consists of j − k
phases, where each phase reads and modifies a pair
of nodes, (p, q), of a bitonic tree. Assume that a kernel implementation performs the operation of a single
phase of this algorithm. (How such a kernel implementation is realized without random-access writes will be
described below.) The temporary data that have to be
preserved from one phase of the algorithm to the next
one are just two node pointers (p and q) per kernel
instance. Thus, each of the log n−j+k elements of the allocated stream consist of exactly these two node pointers.
When the kernel is invoked on that stream, each kernel
instance reads a pair of node pointers, (p, q), from the
stream, performs one phase of the La/Ua construction
algorithm, and finally writes the updated pair of node
pointers (p, q) back to the stream.
Eliminating Random-Access Writes
Since GPUs do not support random-access writes (at
least, for almost all practical purposes, random-access
writes would kill any performance gained by the parallelism) the kernel has to be implement so that it modifies
node pairs (p, q) of the bitonic tree without randomaccess writes. This means that it can output node pairs
from the kernel only via linear stream write. But this
way it cannot write a modified node pair to its original
location from where it was read. In addition, it cannot simply take an input stream (containing a bitonic
tree) and produce another output stream (containing
the modified bitonic tree), because then it would have to
process the nodes in the same order as they are stored in
memory, but the adaptive bitonic merge processes them
in a random, data-dependent order.
Fortunately, the bitonic tree is a linked data structure
where all nodes are directly or indirectly linked to the
root (except for the extra node). This allows us to change
the location of nodes in memory during the merge algorithm as long as the child pointers of their respective
parent nodes are updated (and the root and extra node
of the bitonic tree are kept at well-defined memory locations). This means that for each node that is modified its
parent node has to be modified also, in order to update
its child pointers.
Notice that Algorithm  basically traverses the
bitonic tree down along a path, changing some of the
nodes as necessary. The strategy is simple: simply output every node visited along this path to a stream. Since
the data layout is fixed and predetermined, the kernel can store the index of the children with the node
as it is being written to the output stream. One child
address remains the same anyway, while the other is
determined when the kernel is still executing for the
current node. Figure  demonstrates the operation of
the stream program using the described stream output
technique.
Complexity
A simple implementation on the GPU would need
O(log n) phases (or “passes” in GPU parlance) in
total for adaptive bitonic sorting, which amounts to
O(log n) operations in total.
This is already very fast in practice. However, the
optimal complexity of O(log n) passes can be achieved
exactly as described in the original work [], that is,
phase i of a stage k can be executed immediately after
phase i +  of stage k −  has finished. Therefore, the execution of a new stage can start at every other step of the
algorithm.
The only difference from the simple implementation
is that kernels now must write to parts of the output
stream, because other parts are still in use.
GPU-Specific Details
For the input and output streams, it is best to apply
the ping-pong technique commonly used in GPU programming: allocate two such streams and alternatingly
use one of them as input and the other one as output
stream.
Preconditioning the Input
For merge-based sorting on a PRAM architecture (and
assuming p < n), it is a common technique to sort
locally, in a first step, p blocks of n/p values, that is, each
processor sorts n/p values using a standard sequential
algorithm.
The same technique can be applied here by implementing such a local sort as a kernel program. However,
since there is no random write access to non-temporary
memory from a kernel, the number of values that can be
sorted locally by a kernel is restricted by the number of
temporary registers.
B
Bitonic Sorting, Adaptive
7>0
4
3
4
2
4 < 11
0
7
6
1
5
15
13 > 1
11
12
3
8
0
13
7
2
...
1
5
...
Root spare Root spare
0
9
10
14
7
4
6
2
4
6
5
15
1
root spare
Root spare
11
12
3
3
8
0
7
root spare
13
1
5
...
9
10
2
14
B
6
root spare
...
Phase 0
kernel
p0
q0
p0
3<4
7
3
q0
2
11
12
15
6
3
8
p0
q0
p0
q0
p0
...
q0
9>5
4
4
1
p0
...
12 > 3
0
5
q0
0
13
1
9
...
14
7
0
5
6
10
2
7
3
5
4
4
1
2
11
12
3
15
6
8
0
...
7
14
p0
q0
p0
q0
...
p0
q0
p0
q0
p0
q0
...
p1
q1
p1
q1
...
p1
q1
p1
q1
p1
q1
...
13
1
5
p0
9
6
10
2
q0
Phase 1
kernel
5>2
8>7
0
7
3
5
4
1
2
12
3
6
0
8
15
p1
q1
6 < 10
11
4

13
1
5
...
7
2
0
9
6
14
7
3
10
2
4
4
1
5
11
3
6
0
12
7
15
13
1
5
...
8
2
9
6
p1
q1
p1
q1
...
p1
q1
p1
q1
p1
q1
...
p1
q1
p2
q2
p2
q2
...
p2
q2
p2
q2
p2
q2
...
p2
q2
14
10
Phase 2
kernel
Bitonic Sorting, Adaptive. Fig.  To execute several instances of the adaptive La/Ua construction algorithm in parallel,
where each instance operates on a bitonic tree of  nodes, three phases are required. This figure illustrates the operation
of these three phases. On the left, the node pointers contained in the input stream are shown as well as the comparisons
performed by the kernel program. On the right, the node pointers written to the output stream are shown as well as the
modifications of the child pointers and node values performed by the kernel program according to
Algorithm 
On recent GPUs, the maximum output data size of
a kernel is  ×  bytes. Since usually the input consists
of key/pointer pairs, the method starts with a local sort
of -key/pointer pairs per kernel. For such small numbers of keys, an algorithm with asymptotic complexity
of O(n) performs faster than asymptotically optimal
algorithms.
After the local sort, a further stream operation
converts the resulting sorted subsequences of length
 pairwise to bitonic trees, each containing  nodes.
Thereafter, the GPU-ABiSort approach can be applied
as described above, starting with j = .
The Last Stage of Each Merge
Adaptive bitonic merging, being a recursive procedure,
eventually merges small subsequences, for instance of
length . For such small subsequences it is better to use
a (nonadaptive) bitonic merge implementation that can
be executed in a single pass of the whole stream.
Timings
The following experiments were done on arrays consisting of key/pointer pairs, where the key is a uniformly
distributed random -bit floating point value and the
pointer a -byte address. Since one can assume (without
B
Bitonic Sorting, Adaptive
500
450
n
CPU sort
GPUSort
GPU-ABiSort
32,768
65,536
131,072
262,144
524,288
1,048,576
9–11 ms
19–24 ms
46–52 ms
98–109 ms
203–226 ms
418–477 ms
4 ms
8 ms
18 ms
38 ms
80 ms
173 ms
5 ms
8 ms
16 ms
31 ms
65 ms
135 ms
CPU sort
400
Time (in ms)

GPUSort
350
GPU-ABiSort
300
250
200
150
100
50
0
32768
65536
131072
262144
524288
1048576
Sequence length
Bitonic Sorting, Adaptive. Fig.  Timings on a GeForce  system. (There are two curves for the CPU sort, so as to
visualize that its running time is somewhat data-dependent)
loss of generality) that all pointers in the given array are
unique, these can be used as secondary sort keys for the
adaptive bitonic merge.
The experiments described in the following compare the implementation of GPU-ABiSort of [, ] with
sorting on the CPU using the C++ STL sort function (an
optimized quicksort implementation) as well as with the
(nonadaptive) bitonic sorting network implementation
on the GPU by Govindaraju et al., called GPUSort [].
Contrary to the CPU STL sort, the timings of GPUABiSort do not depend very much on the data to be
sorted, because the total number of comparisons performed by the adaptive bitonic sorting is not datadependent.
Figure  shows the results of timings performed on
a PCI Express bus PC system with an AMD Athlon + CPU and an NVIDIA GeForce  GTX
GPU with  MB memory. Obviously, the speedup
of GPU-ABiSort compared to CPU sorting is .–.
for n ≥  . Furthermore, up to the maximum tested
sequence length n =  (= , , ), GPU-ABiSort is
up to . times faster than GPUSort, and this speedup is
increasing with the sequence length n, as expected.
The timings of the GPU approaches assume that the
input data is already stored in GPU memory. When
embedding the GPU-based sorting into an otherwise
purely CPU-based application, the input data has to be
transferred from CPU to GPU memory, and afterwards
the output data has to be transferred back to CPU memory. However, the overhead of this transfer is usually
negligible compared to the achieved sorting speedup:
according to measurements by [], the transfer of one
million key/pointer pairs from CPU to GPU and back
takes in total roughly  ms on a PCI Express bus PC.
Conclusion
Adaptive bitonic sorting is not only appealing from a
theoretical point of view, but also from a practical one.
Unlike other parallel sorting algorithms that exhibit
optimal asymptotic complexity too, adaptive bitonic
sorting offers low hidden constants in its asymptotic
complexity and can be implemented on parallel architectures by a reasonably experienced programmer. The
practical implementation of it on a GPU outperforms
the implementation of simple bitonic sorting on the
same GPU by a factor ., and it is a factor  faster than
a standard CPU sorting implementation (STL).
Related Entries
AKS Network
Bitonic Sort
Non-Blocking Algorithms
Metrics
Bibliographic Notes and Further
Reading
As mentioned in the introduction, this line of research
began with the seminal work of Batcher [] in the
late s, who described parallel sorting as a network.
BLAS (Basic Linear Algebra Subprograms)
Research of parallel sorting algorithms was reinvigorated in the s, where a number of theoretical questions have been settled [, , , , , ].
Another wave of research on parallel sorting ensued
from the advent of affordable, massively parallel architectures, namely, GPUs, which are, more precisely,
streaming architectures. This spurred the development
of a number of practical implementations [, –, ,
, ].
Bibliography
. Ajtai M, Komlós J, Szemerédi J () An O(n log n) sorting
network. In: Proceedings of the fifteenth annual ACM symposium on theory of computing (STOC ’), New York, NY,
pp –
. Akl SG () Parallel sorting algorithms. Academic, Orlando, FL
. Azar Y, Vishkin U () Tight comparison bounds on the complexity of parallel sorting. SIAM J Comput ():–
. Batcher KE () Sorting networks and their applications. In:
Proceedings of the  Spring joint computer conference (SJCC),
Atlanta City, NJ, vol , pp –
. Bilardi G, Nicolau A () Adaptive bitonic sorting: An optimal
parallel algorithm for shared-memory machines. SIAM J Comput
():–
. Cole R () Parallel merge sort. SIAM J Comput ():–.
see Correction in SIAM J. Comput. , 
. Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduction to algorithms, rd edn. MIT Press, Cambridge, MA
. Gibbons A, Rytter W () Efficient parallel algorithms. Cambridge University Press, Cambridge, England
. Govindaraju NK, Gray J, Kumar R, Manocha D () GPUTeraSort: high performance graphics coprocessor sorting for
large database management. Technical Report MSR-TR--,
Microsoft Research (MSR), December . In: Proceedings of
ACM SIGMOD conference, Chicago, IL
. Govindaraju NK, Raghuvanshi N, Henson M, Manocha D ()
A cachee efficient sorting algorithm for database and data mining computations using graphics processors. Technical report,
University of North Carolina, Chapel Hill
. Greß A, Zachmann G () GPU-ABiSort: optimal parallel
sorting on stream architectures. In: Proceedings of the th
IEEE international parallel and distributed processing symposium (IPDPS), Rhodes Island, Greece, p 
. Greß A, Zachmann G () Gpu-abisort: Optimal parallel
sorting on stream architectures. Technical Report IfI--, TU
Clausthal, Computer Science Department, Clausthal-Zellerfeld,
Germany
. Kipfer P, Westermann R () Improved GPU sorting. In:
Pharr M (ed) GPU Gems : programming techniques for
high-performance graphics and general-purpose computation.
Addison-Wesley, Reading, MA, pp –
B
. Leighton T () Tight bounds on the complexity of parallel
sorting. In: STOC ’: Proceedings of the sixteenth annual ACM
symposium on Theory of computing, ACM, New York, NY, USA,
pp –
. Natvig L () Logarithmic time cost optimal parallel sorting is
not yet fast in practice! In: Proceedings supercomputing ’, New
York, NY, pp –
. Purcell TJ, Donner C, Cammarano M, Jensen HW, Hanrahan P
() Photon mapping on programmable graphics hardware. In:
Proceedings of the  annual ACM SIGGRAPH/eurographics
conference on graphics hardware (EGGH ’), ACM, New York,
pp –
. Satish N, Harris M, Garland M () Designing efficient sorting
algorithms for manycore gpus. In: Proceedings of the  IEEE
International Symposium on Parallel and Distributed Processing (IPDPS), IEEE Computer Society, Washington, DC, USA, pp
–
. Schnorr CP, Shamir A () An optimal sorting algorithm for
mesh connected computers. In: Proceedings of the eighteenth
annual ACM symposium on theory of computing (STOC), ACM,
New York, NY, USA, pp –
. Sintorn E, Assarsson U () Fast parallel gpu-sorting
using a hybrid algorithm. J Parallel Distrib Comput ():
–
BLAS (Basic Linear Algebra
Subprograms)
Robert van de Geijn, Kazushige Goto
The University of Texas at Austin, Austin, TX, USA
Definition
The Basic Linear Algebra Subprograms (BLAS) are an
interface to commonly used fundamental linear algebra
operations.
Discussion
Introduction
The BLAS interface supports portable high-performance
implementation of applications that are matrix and
vector computation intensive. The library or application developer focuses on casting computation in terms
of the operations supported by the BLAS, leaving the
architecture-specific optimization of that software layer
to an expert.

B

B
BLAS (Basic Linear Algebra Subprograms)
A Motivating Example
Vector–Vector Operations (Level- BLAS)
The use of the BLAS interface will be illustrated by considering the Cholesky factorization of an n×n matrix A.
When A is Symmetric Positive Definite (a property that
guarantees that the algorithm completes), its Cholesky
factorization is given by the lower triangular matrix L
such that A = LLT .
An algorithm for this operation can be derived as
follows: Partition
The first BLAS interface was proposed in the s when
vector supercomputers were widely used for computational science. Such computers could achieve near-peak
performance as long as the bulk of computation was cast
in terms of vector operations and memory was accessed
mostly contiguously. This interface is now referred to as
the Level- BLAS.
Let x and y be vectors of appropriate length and α
be scalar. Commonly encountered vector operations are
multiplication of a vector by a scalar (x ← αx), inner
(dot) product (α ← xT y), and scaled vector addition
(y ← αx + y). This last operation is known as an axpy:
alpha times x plus y.
The Cholesky factorization, coded in terms of such
operations, is given by
⎛ α
⋆

A→⎜
⎜
⎝ a A
⎞
⎟
⎟
⎠
and
⎛ λ


L→⎜
⎜
⎝ l L
⎞
⎟,
⎟
⎠
where α  and λ are scalars, a and l are vectors, A
is symmetric, L is lower triangular, and the ⋆ indicates
the symmetric part of A that is not used. Then
⎛ α
⋆
⎜ 
⎜
⎝ a A
⎞ ⎛ λ

⎟ = ⎜ 
⎟ ⎜
⎠ ⎝ l L
⎞⎛ λ

⎟ ⎜ 
⎟⎜
⎠ ⎝ l L
⎛ λ
⋆

=⎜
⎜
T
+ L LT .
⎝ λ l l l
⎞
⎟
⎟
⎠
T
do j=1, n
A( j,j ) = sqrt( A( j,j ) )
call dscal( n-j, 1.0d00/A( j,j ),
A( j+1, j ), 1 )
do k=j+1,n
call daxpy( n-k+1, -A( k,j ),
A(k,j), 1, A( k, k ), 1 );
enddo
enddo
⎞
⎟
⎟
⎠
This yields the following algorithm for overwriting A
with L:
√
● α  ← α  .
● a ← a /α  .
● A ← −a aT + A , updating only the lower triangular part of A . (This is called a symmetric rank-
update.)
● Continue by overwriting A with L where A =
L LT .
A simple code in Fortran is given by
do j=1, n
A( j,j ) = sqrt( A( j,j ) )
do i=j+1,n
A( i,j ) = A( i,j ) / A( j,j )
enddo
do k=j+1,n
do i=k,n
A( i,k ) = A( i,k )
- A( i,j ) * A( k,j )
enddo
enddo
enddo
Here
●
The first letter in dscal and daxpy indicates that
the computation is with double precision numbers.
● The call to dscal performs the computation a ←
a /α  .
● The loop
do i=k,n
A( i,k ) = A( i,k ) - A( i,j )
* A( k,j )
enddo
is replaced by the call
call daxpy( n-k+1, -A( k,j ),
A(k,j), 1, A( k, k ), 1 )
If the operations supported by dscal and daxpy
achieve high performance on a target architecture then
so will the implementation of the Cholesky factorization, since it casts most computation in terms of those
operations.
BLAS (Basic Linear Algebra Subprograms)
A representative calling sequence for a Level- BLAS
routine is given by
_axpy( n, alpha, x, incx, y, incy )
which implements the operation y = αx + y. Here
●
The “_” indicates the data type. The choices for this
first letter are
s
single precision
d
double precision
c
single precision complex
z
double precision complex
●
●
●
●
●
The operation is identified as axpy: alpha times x
plus y.
n indicates the number of elements in the vectors x
and y.
alpha is the scalar α.
x and y indicate the memory locations where the
first elements of x and y are stored, respectively.
incx and incy equal the increment by which one
has to stride through memory to locate the elements
of vectors x and y, respectively.
The following are the most frequently used Level-
BLAS:
Routine/
Function Operation
A( j,j ) = sqrt( A( j,j ) )
call dscal( n-j, 1.0d00 /
A( j,j ), A( j+1, j ), 1 )
call dsyr( ‘Lower triangular’,
n-j, -1.0d00,
A( j+1,j ), 1, A( j+1,j+1 ),lda )
enddo
Here, dsyr is the routine that implements a double
precision symmetric rank- update. Readability of the
code is improved by casting computation in terms of
routines that implement the operations that appear in
the algorithm: dscal for a = a /α  and dsyr for
A = −a aT + A .
The naming convention for Level- BLAS routines
is given by
_XXYY,
where
●
●
“_” can take on the values s, d, c, z.
XX indicates the shape of the matrix:
XX
matrix shape
ge
general (rectangular)
sy
symmetric
he
Hermitian
tr
triangular
_swap
x↔y
_scal
x ← αx
_copy
y←x
_axpy
y ← αx + y
_dot
xT y
YY
matrix shape
_nrm2
∥x∥
mv
matrix vector multiplication
_asum
∥re(x)∥ + ∥im(x)∥
sv
solve vector
min(k) : ∣re(xk )∣ + ∣im(xk )∣ = max(∣re(xi )∣
+∣im(xi )∣)
r
rank- update
r2
rank- update
i_max
Matrix–Vector Operations (Level- BLAS)
The next level of BLAS supports operations with matrices and vectors. The simplest example of such an operation is the matrix–vector product: y ← Ax where x
and y are vectors and A is a matrix. Another example
is the computation A = −a aT + A (symmetric
rank- update) in the Cholesky factorization. This
operation can be recoded as
do j=1, n
B
In addition, operations with banded matrices are
supported, which we do not discuss here.
● YY indicates the operation to be performed:
A representative call to a Level- BLAS operation is
given by
dsyr( uplo, n, alpha, x, incx, A,
lda )
which implements the operation A = αxxT + A, updating the lower or upper triangular part of A by choosing uplo as ‘Lower triangular’ or ‘Upper
triangular,’ respectively. The parameter lda (the

B

B
BLAS (Basic Linear Algebra Subprograms)
leading dimension of matrix A) indicates the increment
by which memory has to be traversed in order to address
successive elements in a row of matrix A.
The following table gives the most commonly used
Level- BLAS operations:
Routine/
Function
Operation
_gemv
general matrix-vector multiplication
_symv
symmetric matrix-vector multiplication
_trmv
triangular matrix-vector multiplication
_trsv
triangular solve vector
_ger
general rank- update
_syr
symmetric rank- update
_syr2
symmetric rank- update
This yields the algorithm
A = L where A = L LT (Cholesky factorization
of a smaller matrix).
● A = L where L LT = A (triangular solve with
multiple right-hand sides).
● A = −L LT + A , updating only the lower triangular part of A (symmetric rank-k update).
● Continue by overwriting A with L where A =
L LT .
●
A representative code in Fortran is given by
do j=1, n, nb
jb = min( nb, n-j+1 )
call chol( jb, A( j, j ), lda )
There are also interfaces for operation with banded
matrices stored in packed format as well as for operations with Hermitian matrices.
call dtrsm( ‘Right’, ‘Lower
triangular’,
‘Transpose’,
‘Nonunit diag’,
J-JB+1, JB,
1.0d00, A( j, j ),
lda, A( j+jb, j ),
lda )
Matrix–Matrix Operations (Level- BLAS)
The problem with vector operations and matrix–vector
operations is that they perform O(n) computations
with O(n) data and O(n ) computations with O(n )
data, respectively. This makes it hard, if not impossible,
to leverage cache memory now that processing speeds
greatly outperform memory speeds, unless the problem
size is relatively small (fits in cache memory).
The solution is to cast computation in terms of
matrix-matrix operations like matrix–matrix multiplication. Consider again Cholesky factorization. Partition
⎛ A
⋆

A→⎜
⎜
⎝ A A
⎞
⎟
⎟
⎠
and
⎛ L


L→⎜
⎜
⎝ L L
where A and L are nb × nb submatrices. Then
⎛ A
⋆
⎜ 
⎜
⎝ A A
⎛ L

⎜ 
⎜
⎝ L L
⎞
⎟=
⎟
⎠
⎞⎛ L

⎟ ⎜ 
⎟⎜
⎠ ⎝ L L
call dsyrk( ‘Lower triangular’,
‘No transpose’,
J-JB+1, JB,
-1.0d00,
A( j+jb, j ), lda,
1.0d00,
A( j+jb, j+jb ),
lda )
enddo
⎞
⎟,
⎟
⎠
Here subroutine chol performs a Cholesky factorization; dtrsm and dsyrk are level- BLAS routines:
The call to dtrsm implements A ← L where
L LT = A .
● The call to dsyrk implements A ← −L LT +A .
●
T
⎞
⎟ =
⎟
⎠
⎛ L LT
⋆
⎜  
⎜
⎝ L LT L LT + L LT
⎞
⎟
⎟
⎠
The bulk of the computation is now cast in terms
of matrix–matrix operations which can achieve high
performance.
The naming convention for Level- BLAS routines
are similar to those for the Level- BLAS. A representative call to a Level- BLAS operation is given by
BLAS (Basic Linear Algebra Subprograms)
dsyrk( uplo, trans, n, k, alpha, A,
lda, beta, C, ldc )
which implements the operation C ← αAAT + βC
or C ← αAT A + βC depending on whether trans
is chosen as ‘No transpose’ or ‘Transpose,’
respectively. It updates the lower or upper triangular
part of C depending on whether uplo equal ‘Lower
triangular’ or
‘Upper triangular,’
respectively. The parameters lda and ldc are the leading dimensions of arrays A and C, respectively.
The following table gives the most commonly
usedLevel- BLAS operations:
Routine/
Function
Operation
_gemm
general matrix-matrix multiplication
_symm
symmetric matrix-matrix multiplication
_trmm
triangular matrix-matrix multiplication
_trsm
triangular solve with multiple right-hand sides
_syrk
symmetric rank-k update
_syr2k
symmetric rank-k update
Impact on Performance
Figure  illustrates the performance benefits that come
from using the different levels of BLAS on a typical
architecture.
BLAS-Like Interfaces
CBLAS
A C interface for the BLAS, CBLAS, has also been
defined to simplify the use of the BLAS from C and C++.
The CBLAS support matrices stored in row and column
major format.
B
Sparse BLAS
Several efforts were made to define interfaces for BLASlike operations with sparse matrices. These do not seem
to have caught on, possibly because the storage of sparse
matrices is much more complex.
Parallel BLAS
Parallelism with BLAS operations can be achieved in a
number of ways.
Multithreaded BLAS
On shared-memory architectures multithreaded BLAS
are often available. Such implementations achieve parallelism within each BLAS call without need for changing
code that is written in terms of the interface. Figure 
shows the performance of the Cholesky factorization
codes when multithreaded BLAS are used on a multicore architecture.
PBLAS
As part of the ScaLAPACK project, an interface for
distributed memory parallel BLAS was proposed, the
PBLAS. The goal was to make this interface closely
resemble the traditional BLAS. A call to dsyrk
becomes
pdsyrk(uplo, trans, n, k, alpha, A,
iA, jA, descA, beta, C, iC, jC,
descC)
where the new parameters iA, jA, descA, etc., encapsulate information about the submatrix with which
to multiply and the distribution to a logical twodimensional mesh of processing nodes.
PLAPACK
Libflame
The libflame library that has resulted from the
FLAME project encompasses the functionality of the
BLAS as well as higher level linear algebra operations.
It uses an object-based interface so that a call to a BLAS
routine like _syrk becomes
FLA_Syrk( uplo, trans, alpha, A,
beta, C )
thus hiding many of the dimension and indexing details.

The PLAPACK project provides an alternative to
ScaLAPACK. It also provides BLAS for distributed
memory architectures, but (like libflame) goes one
step further toward encapsulation. The call for parallel
symmetric rank-k update becomes
PLA_Syrk( uplo, trans, alpha, A,
beta, C )
where all information about the matrices, their distribution, and the storage of local submatrices are encapsulated in the parameters A and C.
B
B
BLAS (Basic Linear Algebra Subprograms)
One thread
10
8
GFlops

6
Hand optimized
BLAS3
BLAS2
BLAS1
triple loops
4
2
0
0
200
400
600
800
1000
n
1200
1400
1600
1800
2000
BLAS (Basic Linear Algebra Subprograms). Fig.  Performance of the different implementations of Cholesky
factorization that use different levels of BLAS. The target processor has a peak of . Gflops (billions of floating point
operations per second). BLAS, BLAS, and BLAS indicate that the bulk of computation was cast in terms of Level-, -, or
- BLAS, respectively
Available Implementations
PLAPACK
Many of the software and hardware vendors market high-performance implementations of the BLAS.
Examples include IBM’s ESSL, Intel’s MKL, AMD’s
ACML, NEC’s MathKeisan, and HP’s MLIB libraries.
Widely used open source implementations include
ATLAS and the GotoBLAS. Comparisons of performance of some of these implementations are given in
Figs.  and .
The details about the platform on which the performance data was gathered nor the versions of the
libraries that were used are given because architectures
and libraries continuously change and therefore which
is faster or slower can easily change with the next release
of a processor or library.
ScaLAPACK
Related Entries
ATLAS (Automatically
Software)
libflame
LAPACK
Tuned
Linear
Algebra
Bibliographic Notes and Further
Reading
What came to be called the Level- BLAS were first
published in , followed by the Level- BLAS in 
and Level- BLAS in  [, , ].
Matrix–matrix multiplication (_gemm) is considered the most important operation, since highperformance implementations of the other Level-
BLAS can be coded in terms of it []. Many implementations of _gemm are now based on the techniques developed by Kazushige Goto []. These techniques extend to
the high-performance implementation of other Level-
BLAS [] and multithreaded architectures []. Practical algorithms for the distributed memory parallel
implementation of matrix–matrix multiplication, used
by ScaLAPACK and PLAPACK, were first discussed
in [, ] and for other Level- BLAS in [].
BLAS (Basic Linear Algebra Subprograms)
B

Four threads
B
40
35
GFlops
30
25
20
Hand optimized
BLAS3
BLAS2
BLAS1
triple loops
15
10
5
0
0
200
400
600
800
1000
n
1200
1400
1600
1800
2000
BLAS (Basic Linear Algebra Subprograms). Fig.  Performance of the different implementations of Cholesky
factorization that use different levels of BLAS, using four threads on a architectures with four cores and a peak of .
Gflops
One thread
10
GFlops
8
6
4
GotoBLAS
MKL
ACML
ATLAS
2
0
0
200
400
600
800
1000
n
1200
1400
1600
1800
2000
BLAS (Basic Linear Algebra Subprograms). Fig.  Performance of different BLAS libraries for matrix–matrix
multiplication (dgemm)
B
Blocking
Four threads
40
35
30
GFlops

25
20
15
10
GotoBLAS
MKL
ACML
ATLAS
5
0
0
200
400
600
800
1000
n
1200
1400
1600
1800
2000
BLAS (Basic Linear Algebra Subprograms). Fig.  Parallel performance of different BLAS libraries for matrix–matrix
multiplication (dgemm)
As part of the BLAS Technical Forum, an effort was
made in the late s to extend the BLAS interfaces to
include additional functionality []. Outcomes included
the CBLAS interface, which is now widely supported,
and an interface for Sparse BLAS [].
Bibliography
. Agarwal RC, Gustavson F, Zubair M () A high-performance
matrix multiplication algorithm on a distributed memory parallel
computer using overlapped communication. IBM J Res Dev ()
. Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S,
Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A,
Pozo R, Remington K, Whaley RC (June ) An updated set
of Basic Linear Algebra Subprograms (BLAS). ACM Trans Math
Softw ():–
. Chtchelkanova A, Gunnels J, Morrow G, Overfelt J, van de Geijn
RA (Sept ) Parallel implementation of BLAS: general techniques for level- BLAS. Concurrency: Pract Exp ():–
. Dongarra JJ, Du Croz J, Hammarling S, Duff I (March )
A set of level- basic linear algebra subprograms. ACM Trans
Math Softw ():–
. Dongarra JJ, Du Croz J, Hammarling S, Hanson RJ (March )
An extended set of FORTRAN basic linear algebra subprograms.
ACM Trans Math Softw ():–
. Duff IS, Heroux MA, Pozo R (June ) An overview of the
sparse basic linear algebra subprograms: the new standard from
the BLAS technical forum. ACM Trans Math Softw ():–
. Goto K, van de Geijn R () High-performance implementation of the level- BLAS. ACM Trans Math Softw ():–
. Goto K, van de Geijn RA () Anatomy of high-performance
matrix multiplication. ACM Trans Math Softw ():–
. Kågström B, Ling P, Van Loan C () GEMM-based level-
BLAS: high performance model implementations and performance evaluation benchmark. ACM Trans Math Softw ():
–
. Lawson CL, Hanson RJ, Kincaid DR, Krogh FT (Sept ) Basic
linear algebra subprograms for Fortran usage. ACM Trans Math
Softw ():–
. Marker B, Van Zee FG, Goto K, Quintana-Ortí G, van de Geijn RA
() Toward scalable matrix multiply on multithreaded architectures. In: Kermarrec A-M, Bougé L, Priol T (eds) Euro-Par,
LNCS , pp –
. van de Geijn R, Watts J (April ) SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Pract Exp
():–
Blocking
Tiling
Blue CHiP
Blue CHiP
Lawrence Snyder
University of Washington, Seattle, WA, USA
Synonyms
Blue CHiP project; CHiP architecture; CHiP computer;
Configurable, Highly parallel computer; Programmable
interconnect computer; Reconfigurable computer
Definition
The CHiP Computer is the Configurable, Highly
Parallel architecture, a multiprocessor composed of
processor-memory elements in a lattice of programmable switches []. Designed in  to exploit Very Large
Scale Integration (VLSI), the switches are set under
program control to connect and reconnect processors.
Though the CHiP architecture was the first use of programmable interconnect, it has since been applied most
widely in Field Programmable Gate Arrays (FPGAs).
The Blue CHiP Project, the effort to investigate and
develop the CHiP architecture, took Carver Mead’s “tall,
thin man” vision as its methodology. Accordingly, it
studied “How best to apply VLSI technology” from
five research perspectives: VLSI, architecture, software,
algorithms, and theory.
B
The Blue CHiP Project and the CHiP architecture
were strongly influenced by ideas tested in a design
included in the MPC- multi-project chip, a batch
of chip designs created by Mead’s students []. That
design, called the Tree Organized Processor Structure
(referenced as AG- in the archive []), was a height
 binary tree with field-programmable processors for
evaluating Boolean expressions []. Though the design
used field-effect programmability, the binary tree was
embedded directly into the layout; see Fig. . Direct
embedding seemed unnecessarily restrictive. The lattice
structure of the CHiP architecture, developed in Spring,
, was to provide a more flexible communication
capability.
Because parallel computation was not well understood, because the CHiP computer was an entirely
new architecture, because nothing was known about
programmable interconnect, and because none of the
“soft” parts of the project (algorithms, OS, languages,
applications) had ever been considered in the context
of configurability, there was plenty of research to do.
And it involved so many levels that Mead’s “tall, thin
man” approach was effectively a requirement. Having
secured Office of Naval Research funding for the Blue
CHiP Project in Fall, , work began at the start
of .
Discussion
Background and Motivation
In the late s as VLSI densities were improving
to allow significant functionality on a chip, Caltech’s
Carver Mead offered a series of short courses at several research universities []. He taught the basics of
chip design to computer science grad students and faculty. In his course, which required students to do a
project fabricated using ARPA’s multi-project chip facility, Mead argued that the best way to leverage VLSI
technology was for a designer to be a “tall, thin man,”
a person knowledgeable in the entire chip development
stack: electronics, circuits, layout, architecture, algorithms, software, and applications. “Thin” implied that
the person had a working knowledge of each level, but
might not be an expert. The Blue CHiP project adopted
Mead’s idea: Use the “tall, thin man” approach to create
new ways to apply VLSI technology.
Blue CHiP. Fig.  The tree organized processor structure,
an antecedent to the CHiP architecture []; the root is in the
center and its children are to its left and right

B

B
Blue CHiP
The CHiP Architecture Overview and
Operation
The focal point of the Blue CHiP project was the Configurable, Highly Parallel (CHiP) computer []. The CHiP
architecture used a set of processor elements (PEs) – bit processors with random access memory for program
and data – embedded in a regular grid, or lattice, of wires
and programmable switches. Interprocessor communication was implemented by setting the switches under
program control to connect PEs with direct, circuitswitched channels. The rationale was simple: Processor design was well understood, so a specific hardware
implementation made sense, but the optimal way to
connect together parallel processors was not known.
Building a dynamically configurable switching fabric
that could reconnect processes as the computation progressed permitted optimal communication.
Phases
The CHiP architecture exploited the observation that
large computations (worthy of parallel processing) are
typically expressed as a sequence of algorithms, or
phases, that perform the major steps of the computation.
Each phase – matrix multiplication, FFT, pivot selection, etc. – is a building block for the overall computation. The phases execute one after another, often being
repeated for iterative methods, simulations, approximations, etc. Phases tend to have a specific characteristic communication structure. For example, FFT uses
the butterfly graph to connect processors; the KungLeiserson systolic array matrix multiplication uses a
“hex array” interconnect [MC]; parallel prefix computations use a complete binary tree, etc. Rather than
trying to design a one-size-fits-all interconnect, the
CHiP architecture provides the mechanism to program it.
From an OS perspective a parallel computation on
the CHiP machine proceeds as follows:
Load switch memories in the lattice
with configuration settings for all
phases
Load PE memories with binary object
code for all phases
i = 0
while phases not complete {
Select the lattice configuration
connecting processors as
required for Phase[i]
Run Phase[i] to completion
i=i+1
}
Notice that lattice configurations are typically reused
during a program execution.
Configuration
Configuring a communication structure works as follows: The interconnection graph structure, or configuration, is stored in the lattice, and when selected,
it remains in effect for the period of that phase
of execution, allowing the PEs to communicate and
implement the algorithm. PE-to-PE communication is
realized by a direct circuit-switched connection []. The
overall graph structure is realized by setting each switch
appropriately.
A switch has circuitry to connect its d incident
edges (bit-parallel data wires), allowing up to d communication paths to cross the switch independently. See
Figure . A switch has memory to store the settings
needed for a set of configurations. Each switch stores
its setting for configuration i in its memory location i.
The lattice implements configuration i by selecting the
ith memory location for all switches.
Figure  shows schematic diagrams of a degree
 switch capable of  independent connections, that
is, paths that can crossover. Interconnections are programmed graphically, using an iconography shown in
Fig. a (and Fig. ). Figure b shows the connectivity allowing the wires to connect arbitrarily. Notice
that fan-out is possible. The independent connections require separate buses, shown as paired, overlapping diamonds. Higher degree switches (more incident
edges) require more buses; wider datapaths (-wide is
shown) require more diamonds. Figure c shows the
configuration setting in memory location M asserting
the connection of the east edge to the right bus. Notice
that memory is provided to store the connections for
each bus, and that there are three states: disconnected,
connect to left, and connect to right. Not shown in the
figure is the logic to select a common memory location
on all switches, that is, the logic selecting M, as well as
circuitry to insure the quality of the signals.
Blue CHiP
B

B
Blue CHiP. Fig.  A schematic diagram for a degree four switch: (a) iconography used in programming
interconnection graphs; from top, unconnected, east-west connection, east-west and north-south connections, fan-out
connection, two corner-turning connections; (b) wire arrangement for four incident edges (-bit-parallel wires); the
buses implementing the connection are shown as overlapping diamonds; (c) detail for making a connection of the east
edge to the right bus per configuration stored in memory location M
Blue CHiP. Fig.  Sample switch lattices in which processors are represented by squares and switches by circles;
(a) an -degree, single wide lattice; (b) a -degree, double wide lattice []
As a final detail, notice that connections can cause
the order of the wires of a datapath to be flipped. Processor ports contain logic to flip the order of the bits to the
opposite of the order it received. When a new configuration comes into effect, each connection sends/receives
an lsb , revealing whether the bits are true or reversed,
allowing the ports to select/deselect the flipping
hardware.
Processor Elements
Each PE is a computer, a processor with memory, in
which all input and output is accessible via ports. (Certain “edge PEs” also connect to external I/O devices.)
A lattice with switches having d incident edges hosts
PEs with d incident edges, and therefore d ports.
Programs communicate simply by sending and receiving through the ports; since the lattice implements

B
Blue CHiP
circuit-switched communication, the values can be sent
directly with little “packetizing” beyond a task designation in the receiving processor.
Each PE has its own memory, so in principle each PE
can execute its own computation in each phase; that is,
the CHiP computer is a MIMD, or multiple-instruction,
multiple-data, computer. As phase-based parallel algorithms became better understood, it became clear that
PEs typically execute only a few different computations
to implement a phase. For example, in parallel-prefix
algorithms, which have tree-connected processors, the
root, the interior nodes, and the leaves execute slightly
different computations. Because these are almost always
variations of each other, a single program, multiple data
(SPMD) model was adopted.
The Controller
The orchestration of phase execution is performed by an
additional processor, a “front end” machine, called the
controller. The controller loads the configuration settings into the switches, and loads the PEs with their code
for each phase. It then steps through the phase execution logic (see “Phases”), configuring the lattice for a
phase and executing it to completion; processors use the
controller’s network to report completion.
The controller code embodies the highest-level logic
of a computation, which tends to be very straightforward. The controller code for the Simple Benchmark []
for m := o to mmax do
begin
phase(hydro);
phase(viscos);
phase(new_Δt);
phase(thermo);
phase(row_solve);
phase(column_solve);
phase(energy_bal);
end.
illustrates this point and is typical.
Blue CHiP Project Research
The Blue CHiP Project research, following Mead, conducted research in five different topic areas: VLSI,
architecture, software, algorithms, and theory. These
five topics provide an appropriate structure for the
remainder of this entry, when extended with an additional section to summarize the ideas. Though much of
the work was targeted at developing a CHiP computer,
related topics were also studied.
VLSI: Programmable Interconnect
Because the CHiP concept was such a substantial departure from conventional computers and the parallel computers that preceded it, initial research focused less on
building chips, that is, the VLSI, and more on the overall
properties of the lattice that were the key to exploiting
VLSI. A simple field-effect programmable switch had
been worked out in the summer of , giving the team
confidence that fabricating a production switch would
be doable when the overall lattice design was complete.
The CHiP computer’s lattice is a regular grid of
(data) wires with switches at the cross points and processors inset at regular intervals. The lattice abstraction
uses circles for switches and squares for processors. An
important study was to understand how a given size lattice hosted various graphs; see Fig.  for typical abstractions. The one layer metal processes of the early s
caused the issue of lattice bandwidth to be a significant
concern.
Lattice Properties Among the key parameters []
that describe a lattice are the degree of the nodes, that
is, the number of incident edges and corridor width, the
number of switches separating two PEs. The architectural question is: What is a sufficiently rich lattice to
handle most interconnection graphs? This is a graph
embedding question, and though graph embedding was
an active research area at the time, asymptotic results
were not helpful. Rather, it was important for us to
understand graph embeddings for a limited number of
nodes, say  or fewer, of the sort of graphs arising in
phase-based parallel computation.
What sort of communication graphs do parallel
algorithms use? Table  lists some typical graphs and
algorithms that use them; there are other, less familiar graphs. Since lattices permit nonplanar embeddings,
that is, wires can cross over, a sufficiently wide corridor
lattice of sufficiently high degree switches can embed
any graph in our domain. But economy is crucial, especially with VLSI technologies with few layers.
Embeddings It was found that graphs could be
embedded in very economical lattices. Examples of two
of the less obvious embeddings are shown in Fig. .
The embedding of the binary tree is an adaptation of
the “H-embedding” of a binary tree into the plane. The
Blue CHiP
shuffle exchange graph for  nodes of a single wide lattice is quite involved – shuffle edges are solid, exchange
edges are dashed. Shuffle exchange on  nodes is not
known to be embeddable in a single wide lattice, and
was assumed to require corridor width of . Notice that
once a graph has been embedded in a lattice, it can be
placed in a library for use by those who prefer to program using the logical communication structure rather
than the implementing physical graph.
Techniques Most graphs are easy to layout because
most parallel computations use sparse graphs with little complexity. In programming layouts, however, one is
often presented with the same problem repeatedly. For
example, laying out a torus is easy, but minimizing the
B
wire length – communication time is proportional to
length – takes some technique. See Fig. . An interleaving of the row and column PEs of a torus by “folding” at
the middle vertically, and then “folding” at the middle
horizontally, produces a torus with interleaved processors. Each of a PE’s four neighbors is at most three
switches away. Opportunities to use this “trick” of interleaving PEs to shorten long wires arise in other cases.
A related idea (Fig. b) is to “lace” the communication channels in a corridor to maximize the number of
datapaths used.
The conclusion from the study of lattice embeddings
indicated, generally, that the degree should be , and
that corridor width is sufficiently valuable that it should
be “as wide as possible.” Given the limitations of VLSI
technology at the time, only  was realistic.
Blue CHiP. Table  Communication graph families and
parallel algorithms that communicate using those graphs
Communication Graph
Parallel Applications
-degree, -degree mesh
Jacobi relaxation for
Laplace equations
-degree, -degree mesh
Matrix multiplication,
dynamic programming
Binary tree
Aggregation (sum,
max), parallel prefix
Toroidal -degree, -degree mesh Simulations with
periodic boundaries
Butterfly graph
Fast Fourier transform
Shuffle-exchange graph
Sorting
Architecture: The Pringle
When the project began VLSI technology did not yet
have the densities necessary to build a prototype CHiP
computer beyond single-bit processors; fitting one serious processor on a die was difficult enough, much
less a lattice and multiple processors. Nevertheless, the
Roadmap of future technology milestones was promising, so the project opted to make a CHiP simulator to
use while waiting for technology to improve. The simulated CHiP was called the Pringle [] in reference to the
snack food by that name that also approximates a chip.
Blue CHiP. Fig.  Example embeddings; (a) a -node binary tree embedded in a  node, -degree, single wide lattice;
(b) a -node shuffle exchange graph embedded in a  node, -degree, single wide lattice

B

B
Blue CHiP
Blue CHiP. Fig.  Embedding techniques; (a) a -degree torus embedded using “alternating PE positions” to reduce the
length of communication paths to  switches; solid lines are vertical connections, dashed lines are horizontal connections;
(b) “lacing” a width  corridor to implement  independent datapaths, two of which have been highlighted in bold
Blue CHiP. Table  The Pringle datasheet; Pringle was an emulator for the CHiP architecture with an -bit datapath
Processor Elements
Switch
Number of PEs

Switch structure
Polled Bus
PE microprocessor
Intel 
Switch clock rate
 MHz
PE datapath width
-bits
Bus Bandwidth
 Mb/s
PE floating point chip
Intel 
Switch Datapath
 bits
PE RAM size
 Kb
PE EPROM size
 Kb
Controller
PE clock rate
 MHz
Controller microprocessor
The Pringle was designed to behave like a CHiP
computer except that instead of using a lattice for interprocessor communication, it simulated the lattice with a
polled bus structure. Of course, this design serializes the
CHiP’s parallel communication, but the Pringle could
potentially advance  separate communications in one
polling sweep. The switch controller stored multiple
configurations in its memory (up to ), implementing
one per phase. See the datasheet in Table .
Two copies of Pringle were eventually built, one for
Purdue University and one for the University of Washington. Further data on the Pringle is available in the
reports [].
Software: Poker
The primary software effort was building the Poker Parallel Programming Environment []. Poker was targeted
at programming the CHiP Computer, which informed
Intel 
the structure of the programming system substantially.
But there were other influences as well.
For historical context the brainstorming and design
of Poker began at Purdue in January  in a seminar
dedicated to that purpose. At that point, Xerox PARC’s
Alto was well established as the model for building a
modern workstation computer. It had introduced bitmapped graphic displays, and though they were still
extremely rare, the Blue CHiP Project was committed
to using them. The Alto project also supported interactive programming environments such as SmallTalk
and Interlisp. It was agreed that Poker needed to support interactive programming, too, but the principles
and techniques for building windows, menus, etc., were
not yet widely understood. Perhaps the most radical
aspect of the Alto, which also influenced the project,
was dedicating an entire computer to support the work
of one user. It was decided to use a VAX / for
that purpose. This decision was greeted with widespread
Blue CHiP
astonishment among the Purdue CS faculty, because
all of the rest of the departmental computing was performed by another time-shared VAX /.
The Poker Parallel Programming Environment
opened with a “configuration” screen, called CHiP
Params, that asked programmers for the parameters
of the CHiP computer they would be programming.
Items such as the number of processors, the degree of
the switches, etc. were requested. Most of this information would not change in a production setting using
a physical computer, but for a research project, it was
important.
Programming the CHiP machine involved developing phase programs for specific algorithms such as
matrix multiplication, and then assembling these phase
pieces into the overall computation []. Each phase
program requires these components:
●
●
●
●
●
Switch Settings (SS) specification – defines the interconnection graph
Code Names (CN) specification – assigns a process
to each PE, together with parameters
Port Names (PN) specification – defines the names
used by each PE to refer to its neighbors
I/O Names (IO) specification – defines the external
data files used and created by the phase
Text files – process code written in an imperative
language (XX or C)
Not all phases read or write external files, so IO is
not always needed; all of the other specifications are
required.
Poker emphasized graphic programming and minimized symbolic programming. Programmers used one
of several displays – full-screen windows – to program
the different types of information describing a computation. Referring to Fig. , the windows were: Switch Settings for defining the interconnection graph (Fig. a),
Code Names for assigning a process (and parametric
data) to each processing element (Fig. b), Port Names
for naming a PE’s neighbors (Fig. c), I/O Names for
reading and writing to standard in and out, and Trace
View (Fig. d) for seeing the execution in the same form
as used in the “source” code.
In addition to the graphic display windows, a text
editor was used to write the process code, whose names
are assigned in the Code Names window. Process code
was written in a very simple language called XX. (The
B
letters XX were used until someone could think of an
appropriate name for the language, but soon it was simply called Dos Equis.) Later, a C-preprocessor called
Poker C was also implemented. One additional facility was a Command Request window for assembling
phases.
After building a set of phases, programmers needed
to assemble them, using the Command Request (CR)
interface, into a sequence of phase invocations to solve
the problem. This top-level logic is critical because when
a phase change occurs – for example, moving from a
pivot selection step to an elimination step – the switches
in the lattice typically change.
The overall structure of the Poker Environment is
shown in Fig.  []. The main target of CHiP programs was a generic simulator, since production level
computations were initially not needed. The Cosmic
Cube from Caltech and the Sequent were contemporary
parallel machines that were also targeted [].
Algorithms: Simple
In terms of developing CHiP-appropriate algorithms,
the Blue CHiP Project benefited from the wide interest at the time in systolic array computation. Systolic
arrays are data parallel algorithms that map easily to a
CHiP computer phase []. Other parallel algorithms
of the time – parallel prefix, nearest neighbor iterative
relaxation algorithms, etc., – were also easy to program.
So, with basic building blocks available, project personnel split their time between composing these known
algorithms (phases) into complete computations, and
developing new parallel algorithms. The SIMPLE computation illustrates the former activity.
SIMPLE The SIMPLE computation is, as its name
implies, a simple example of a Lagrangian hydrodynamics computation developed by researchers at Livermore
National Labs to illustrate the sorts of computations of
interest to them that would benefit from performance
improvements. The program existed only in Fortran, so
the research question became developing a clean parallel solution that eventually became the top-level logic
of the earlier Section “The Controller.” Phases had to
be defined, interconnection graphs had to be created,
data flow between phases had to be managed, and so
forth. This common task – converting from sequential
to parallel for the CHiP machine – is fully described for
SIMPLE in a very nice paper by Gannon and Panetta [].

B

B
Blue CHiP
Blue CHiP. Fig.  Poker screen shots of a phase program for a  processor CHiP computer for a master/slave computation;
(a) Switch Settings (, is the master), editing mode for SS is shown in Fig. a; (b) Code Names; (c) Port Names; and (d) Trace
View midway through the (interpreted) computation
New Algorithms Among the new algorithms were
image-processing computations, since they were thought
to be a domain benefiting from CHiP-style parallelism
[]. One particularly elegant connected components
counting algorithm illustrates the style.
Using a -degree version of a morphological transformation due to Levialdi, a bit array is modified using
two transformations applied at all positions,
? ?
?
X
? ?
?
producing a series of morphed arrays. In words, the first
rule specifies that a -bit adjacent to -bits to its north,
northwest, and west becomes a -bit in the next generation; the second rule specifies that a -bit adjacent to
-bits to its north and west becomes a -bit in the next
generation; all other bits are unchanged; an isolated bit
that is about to disappear due to the first rule should be
counted as a component. See Fig. .
The CHiP phase for the counting connected components computation assigns a rectangular subarray of
pixels to each processor, and uses a -degree mesh interconnection. PEs exchange bits around the edge of their
subarray with adjacent neighbors, and then apply the
rules to create the next iteration, repeating. As is typical
the controller is not involved in the phase until all processors signal that they have no more bits left in their
subarray, at which point the phase ends.
Theory: Graphs
Though the project emphasized the practical task of
exploiting VLSI, a few theorems were proved, too.
Two research threads are of interest. The first concerned properties of algorithms to convert data-driven
programs to loop programs. Data-driven is an asynchronous communication protocol where, for example,
Blue CHiP
Poker front-end
Programming views
CHiP params
IO names
Port names
Code names
Switch settings
B

Poker back-ends
Program
database
Simulators
Generic
Cosmic cube
CP
Emulators
Pringle
IO
PN
Parallel computers
Pringle
Cosmic cube
Sequent
CN
SS
Text
Make view
Command request
Compilers
Poker C
XX
Run-time view
Trace
I/O
Text editor
Blue CHiP. Fig.  Structure of the Poker parallel programming environment. The “front end” was the portion visible in the
graphic user interface; the “back ends” were the “platforms” on which a Poker program could be “run”
Blue CHiP. Fig.  The sequence of ten-bit arrays produced using Levialdi’s transformation; pixels are blackened when
counted. The algorithm “melts” a component to the lower right corner of its bounding box, where it is counted
B

B
Blue CHiP
reads to a port prior to data arrival stall. The CHiP
machine was data driven, but the protocol carries overhead. The question was whether data-driven programs
could be more synchronous. The theorems concerned
properties of an algorithm to convert from data-driven
to synchronous; the results were reported [] and
implemented in Poker, but not heavily used.
The second thread turned out to be rather long. It
began with an analysis of yield in VLSI chip fabrication.
Called the Tile Salvage Problem, the analysis concerned
how the chips on a wafer, some of which had tested
faulty, could be grouped into large working blocks [].
The solution found that matching working pairs is in
polynomial time, but finding  ×  working blocks is NP
hard; an optimal-within- approximation was developed for × blocks. When the work was shown to Tom
Leighton, he found closely related planar graph matching questions, and added/strengthened some results. He
told Peter Shor, who did the same. And finally, David
Johnson made still more improvements. The work was
finally published as Generalized Planar Matching [].
Summary
The Blue CHiP Project, which ran for six years, produced a substantial amount of research, most of which is
not reviewed here. Much of the research was integrated
across multiple topics by applying Mead’s “tall, thin
man” approach. The results in the hardware domain
included a field programmable switch, a programmable
communication fabric (lattice), an architecture, waferscale integration studies, and a hardware emulator
(Pringle) for the machine. The results in the software
domain included the Poker Parallel Programming Environment that was built on a graphic workstation, communication graph layouts, programs for the machine,
and new parallel algorithms. Related research included
studies in wafer scale integration, both theory and
design, as well as the extension of the CHiP approach
to signal processing and other domains. The work was
never directly applied because the technology of the day
was not sufficiently advanced;  years after the conception of the CHiP architecture, however, the era of
multiple-cores-per-chip offers suitable technology.
Apart from the nascent state of the technology, an
important conclusion of the project was that the problem in parallel computing is not a hardware problem,
but a software problem. The two key challenges were
plain:
●
Performance – developing highly parallel computations that exploit locality,
● Portability – expressing parallel computations at a
high enough level of abstraction that a compiler can
target to any MIMD machine.
The two challenges are enormous, and largely remain
unsolved into the twenty-first century. Regarding the
first, exploiting locality was a serious concern for the
CHiP architecture with its tight VLSI connection, making us very sensitive to the issue. But it was also clear
that locality is always essential in parallelism, and valuable for all computation. Regarding the second, it was
clear in retrospect that the project’s programming and
algorithm development were very tightly coupled to the
machine, as were other projects of the day. Whereas
CHiP computations could usually be hosted effectively on other computers, the converse was not true.
Shouldn’t all code be machine independent? The issue in
both cases concerned the parallel programming model.
These conclusions were embodied in a paper known
as the “Type Architecture” paper []. (Given the many
ways “type” is used in computer science, it was not
a good name; in this case it meant “characteristic
form” as in type species in biology.) The paper, among
other things, predicts the rise of message passing programming, and criticizes contemporary programming
approaches. Most importantly, it defines a machine
model – a generic parallel machine – called the CTA.
This machine plays the same role that the RAM or
von Neumann machine plays in sequential computing.
The CTA is visible today in applications ranging from
message passing programming to LogP analysis.
Related Entries
CSP (Communicating Sequential Processes)
Graph Algorithms
Networks, Direct
Reconfigurable Computer
Routing (Including Deadlock Avoidance)
Systolic Arrays
Universality in VLSI Computation
VLSI Computation
Blue Gene/P
Bibliographic Notes and Additional
Reading
The focal point of the project, the machine, and its software were strongly influenced by VLSI technology, but
the technology was not yet ready for a direct application of the approach; it would take until roughly
. The ideas that the project developed – phasebased parallelism, high level parallel language, emphasis on locality, emphasis on data parallelism, etc. –
turned out to drive the follow-on research much more
than VLSI. Indeed, there were several follow-on efforts
to develop high performance, machine independent
parallel languages [], and eventually, ZPL []. The
problem has not yet been completely solved, but the
Chapel [] language is the latest to embody these
ideas.
Bibliography
. Snyder L () Introduction to the configurable, highly parallel
computer. Computer ():–, Jan 
. Conway L () The MPC adventures, Xerox PARC Tech.
Report VLSI--; also published at http://ai.eecs.umich.edu/
people/conway/VLSI/MPCAdv/MPCAdv.html
. Conway L MPC: a large-scale demonstration of a new way
to create systems in silicon, http://ai.eecs.umich.edu/people/
conway/VLSI/MPC/MPCReport.pdf
. Snyder L () Tree organized processor structure, A VLSI
parallel processor design, Yale University Technical Report
DCS/TR
. Snyder L () Introduction to the configurable, highly parallel computer, Technical Report CSD-TR-, Purdue University,
Nov 
. Snyder L () Programming processor interconnection
structures, Technical Report CSD-TR-, Purdue University,
Oct 
. Gannon DB, Panetta J () Restructuring SIMPLE for the CHiP
architecture. Parallel Computation ():–
. Kapauan AA, Field JT, Gannon D, Snyder L () The
Pringle parallel computer. Proceedings of the th international symposium on computer architecture, IEEE, New York,
pp –
. Snyder L () Parallel programming and the Poker programming environment. Computer ():–, July 
. Snyder L () The Poker (.) programmer’s guide, Purdue
University Technical Report TR-, Dec 
. Notkin D, Snyder L, Socha D, Bailey ML, Forstall B, Gates K,
Greenlaw R, Griswold WG, Holman TJ, Korry R, Lasswell G,
Mitchell R, Nelson PA () Experiences with poker. Proceedings of the ACM/SIGPLAN conference on parallel programming:
experience with applications, languages and systems
B
. Snyder L, Socha D () Poker on the cosmic cube: the first
retargetable parallel programming language and environment.
Proceedings of the international conference on parallel processing, Los Alamitos
. Kung HT, Leiserson CE () Algorithms for VLSI processor
arrays. In: Mead C, Conway L (eds) Introduction to VLSI systems,
Addison-Wesley, Reading
. Cypher RE, Sanz JLC, Snyder L () Algorithms for image component labeling on SIMD mesh connected computers. IEEE Trans
Comput ():–
. Cuny JE, Snyder L () Compilation of data-driven programs
for synchronous execution. Proceedings of the tenth ACM symposium on the principles of programming languages, Austin, pp
–
. Berman F, Leighton FT, Snyder L () Optimal tile salvage,
Purdue University Technical Report TR-, Jan 
. Berman F, Johnson D, Leighton T, Shor P, Snyder L () Generalized planar matching. J Algorithms :–
. Snyder L () Type architectures, shared memory, and the
corollary of modest potential, Annual review of computer science,
vol , Annual Reviews, Palo Alto
. Lin C () The portability of parallel programs across MIMD
computers, PhD Dissertation, University of Washington
. Chamberlain B, Choi S-E, Lewis E, Lin C, Snyder L and Weathersby W () The case for high-level parallel programming in
ZPL. IEEE Comput Sci Eng ():–, July–Sept 
. Chamberlain BL, Callahan D, Zima HP () Parallel programmability and the Chapel language. Int J High Perform Comput Appl ():–, Aug 
. Snyder L () Overview of the CHiP computer. In: Gray JP (ed)
VLSI , Academic, London, pp –
Blue CHiP Project
Blue CHiP
Blue Gene/L
IBM Blue Gene Supercomputer
Blue Gene/P
IBM Blue Gene Supercomputer

B

B
Blue Gene/Q
Blue Gene/Q
IBM Blue Gene Supercomputer
Branch Predictors
André Seznec
IRISA/INRIA, Rennes, Rennes, France
Definition
The branch predictor is a hardware mechanism that
predicts the address of the instruction following the
branch. For respecting the sequential semantic of a program, the instructions should be fetched, decoded, executed, and completed in the order of the program. This
would lead to quite slow execution. Modern processors
implement many hardware mechanisms to execute concurrently several instructions while still respecting the
sequential semantic. Branch instructions cause a particular burden since their result is needed to begin the execution of the subsequent instructions. To avoid stalling
the processor execution on every branch, branch predictors are implemented in hardware. The branch predictor predicts the address of the instruction following
the branch.
Discussion
Introduction
Most instruction set architectures have a sequential semantic. However, most processors implement
pipelining, instruction level parallelism, and out-oforder execution. Therefore on state-of-the-art processors, when an instruction completes, more than one
hundred of subsequent instructions may already be
in progress in the processor pipeline. Enforcing the
semantic of a branch is therefore a major issue: The
address of the instruction following a branch instruction is normally unknown before the branch completes.
Control flow instructions are quite frequent, up to
% in some applications. Therefore to avoid stalling
the issuing of subsequent instructions until the branch
completes, microarchitects invented branch prediction,
i.e., the address of the instruction following a branch B
is predicted. Then the instructions following the branch
can speculatively progress in the processor pipeline
without waiting for the completion of the execution of
branch B.
Anticipating the address of the next instruction to
be executed was recognized as an important issue very
early in the computer industry back in the late s.
However, the concept of branch prediction was really
introduced around  by Smith in []. On the occurrence of a branch, the effective information that must
be predicted is the address of the next instruction.
However, in practice several different informations are
predicted in order to predict the address of the next
instruction. First of all, it is impossible to know that an
instruction is a branch before it has been decoded, that
is the branch nature of the instruction must be known
or predicted before fetching the next instruction. Second, on taken branches, the branch target is unknown
at instruction fetch time, that is the potential target of
the branch must be predicted. Third, most branches
are conditional, the direction of the branch taken or
not-taken must be predicted.
It is important to identify that not all information is
of equal importance for performance. Failing to predict
that an instruction is a branch means that instructions
are fetched in sequence until the branch instruction
is decoded. Since decoding is performed early in the
pipeline, the instruction fetch stream can be repaired
very quickly. Likewise, failing to predict the target of
the direct branch is not very dramatic. The effective
target of a direct branch can be computed from the
instruction codeop and its address, thus the branch
can be computed very early in the pipeline — generally, it becomes available at the end of the decode
stage. On the other hand, the direction of a conditional branch and the target of an indirect branch are
only known when the instruction has been executed,
i.e., very late in the pipeline, thus potentially generating
very long misprediction penalties, sometimes  or 
cycles.
Since in many applications most branches are conditional and the penalty on a direction misprediction
is high, when one refers to branch prediction, one
generally refers to predicting directions of conditional
branches. Therefore, most of this article is dedicated to
conditional branch predictions.
B
Branch Predictors
General Hardware Branch Prediction
Principle
Some instruction sets have included some software
hints to help branch prediction. Hints like “likely taken”
and “likely not-taken” have been added to the encoding
of the branch instruction. These hints can be inserted
by the compilers based on application knowledge, e.g.,
a loop branch is likely to be taken, or on profiling information. However, the most efficient schemes are essentially hardware and do not rely on any instruction set
support.
The general principle that has been used in hardware branch prediction scheme is to predict that the
behavior of the branch to be executed will be a replay
of its past behavior. Therefore, hardware branch predictors are based on memorization of the past behavior of
the branches and some limited hardware logic.
Predicting Branch Natures and Direct Branch
Targets
For a given static instruction, its branch nature (is it a
branch or not) remains unchanged all along the program life, apart in case of self-modifying code. This
stands also for targets of direct branches. The Branch
Target Buffer, or BTB [], is a special cache which aims
at predicting whether an instruction is a branch and its
potential target. At fetch time, the BTB is checked with
the instruction Program Counter. On a hit, the instruction is predicted to be a branch and the address stored in
the hitting line of the BTB is assessed to be the potential
target of the branch. The nature of the branch (unconditional, conditional, return, indirect jumps) is also read
from the BTB.
A branch may miss the BTB, e.g., on its first execution or after its eviction on a conflict miss. In this case,
the branch is written in the BTB: its kind and its target are stored in association with its Program Counter.
Note that on a BTB hit, the target may be mispredicted
on indirect branches and returns. This will shortly be
presented in section on “Predicting Indirect Branch
Targets”.
Predicting Conditional Branch Directions
Most branches are conditional branches and the penalty
on mispredicting the direction of a conditional branch
is really high on modern processors. Therefore, predicting the direction of conditional branches has received

a lot of attention from the research community in the
s and the early s.
B
PC-Based Prediction Schemes
It is natural to predict a branch based on its program
counter. When introducing conditional branch prediction [], Smith immediately introduced two schemes
that capture the most frequent behaviors of the conditional branches.
Since most branches in a program tend to be biased
toward taken or not-taken, the simplest scheme consisting to predict that a branch will follow the same
direction than the last time it has been executed is
quite natural. Hardware implementation of this simple
scheme necessitates to store only one bit per branch.
When a conditional branch is executed, its direction is
recorded, e.g., in the BTB along with the branch target.
A single bit is used:  encodes not-taken and  encodes
taken. The next time the branch is fetched, the direction
stored in the BTB is used as a prediction. This scheme is
surprisingly efficient on most programs often achieving
accuracy higher than %. However, for branches that
privilege one direction and from time to time branch
in the other direction, this -bit scheme tends to predict
the wrong direction twice in a row. This is the case for
instance, for loop branches which are taken except the
last iteration. The -bit scheme fails to predict the first
iteration and the last iteration.
Smith [] proposed a slightly more complex
scheme based on a saturated -bit counter automaton
(Fig. ): The counter is incremented on a taken branch
and decremented on a not-taken branch. The most significant bit of the counter is used as the prediction.
On branches exhibiting a strong bias, the -bit scheme
avoids to encounter two successive mispredictions after
one occurrence of the non-bias direction. This -bit
Predict Taken
T
T
Predict Not-Taken
T
2
3
N
1
N
2:Weakly Taken
3:Strongly Taken
T
0
N
N
0:Strongly Not Taken
1:Weakly Not Taken
Branch Predictors. Fig.  A -bit counter automaton

B
Branch Predictors
counter predictor is often referred to as a -bit bimodal
predictor.
History-Based Conditional Branch Prediction
Schemes
In the early s, branch prediction accuracy became
an important issue. The performance of superscalar
processors is improved by any reduction of branch misprediction rate. The -bit and -bit prediction schemes
use a very limited history of the behavior of a program
to predict the outcome direction of a branch. Yeh and
Patt [] and Pan and So [] proposed to use more
information on the passed behavior of the program, trying to better isolate the context in which the branch is
executed.
Two families of predictors were defined. Local
history predictors use only the past behavior of the
program on the particular branch. Global history predictors use the past behavior of the whole program on
all branches.
If the last 3 iterations
have been taken then predict
not taken else predict taken
for (i=0; i<100; i++)
for (j=0; j<4; j++)
loop body
Branch Predictors. Fig.  A loop with four iterations
m bits
PC
2m L bits local
history
histo L bits
PHT
2n 2 bits counters
Local History Predictors
Using the past behavior of the branch to be predicted
appears attractive. For instance, on the loop nest illustrated on Fig. , the number of iterations in the inner
loop is fixed (). When the outcome of the last three
occurrences of the branch are known, the outcome of
the present occurrence is completely determined: If the
last three occurrences of the branch were taken then the
current occurrence is the last iteration, therefore is nottaken otherwise the branch is taken. Local history is not
limited to capturing the behavior of branches with fixed
number of iterations. For instance, it can capture the
behavior of any branch exhibiting a periodic behavior
as long as the history length is equal or longer than the
period.
A local history predictor can be implemented as
illustrated Fig. . For each branch, the history of its
past behavior is recorded as a vector of bits of length
L and stored in a local history table. The branch program counter is used as an index to read the local history
table. Then the branch history is associated with the
program counter to read the prediction table. The prediction table entries are generally saturated -bit counters. The history vector must be adapted on each branch
occurrence and stored back in the local history table.
1
1
1
0
1
1
1
0
1
1
1
0
Prediction 0/1
Branch Predictors. Fig.  A local history predictor
Effective implementation of local history predictors is quite difficult. First the prediction requires to
chain two successive table reads, thus creating a difficult latency issue for providing prediction in time for
use. Second, in aggressive superscalar processors, several branches (sometimes tens) are progressing speculatively in the pipeline. Several instances of the same
branch could have been speculatively fetched and predicted before the first occurrence is finally committed.
The branch prediction should be executed using the
speculative history. Thus , maintaining correct speculative history for local history predictors is very complex
for wide-issue superscalar processors.
Global History Predictors
The outcome of a branch is often correlated with the
outcome of branches that have been recently executed.
Branch Predictors
For instance, in the example illustrated in Fig. , the outcome of the third branch is completely determined by
the outcome of the first two branches.
First generation global history predictors such as
GAs [] or gshare [] (Fig. ) are associating with
a fixed length global history vector and the program
counter of the branch to index a prediction table. These
predictors were shown to suffer from two antagonistic
phenomena. First, it was shown that using a very long
history is sometimes needed to capture correlation. Second, using a long history results in possible destructive
conflicts on a limited size prediction table.
A large body of research in the mid-s was dedicated to reduce the impact of these destructive conflicts
or aliasing [, , , ].
By the end of the s, these dealiased global history predictors were known to be more accurate at equal
storage complexity than local predictors.
Path or global history predictor: In some cases,
the global conditional branch history vector does not
uniquely determine the instruction path that leads to a
particular branch; e.g., an indirect branch or a return
B1: if cond1 then ..
B2: if cond2 then ..
B
may have occurred and two paths can be represented by
the same global history vector. The path vector combining all the addresses of the last control flow instructions
that lead to a branch is unique. Using the path history
instead of the global history vector generally results in a
slightly higher accuracy [].
Hybrid Predictors
Global history predictors and local history predictors
were shown to capture different branch behaviors. In
, McFarling [] proposed to combine several predictors to improve their accuracy. Even combining a
global history predictor with a simple bimodal predictor
was shown to provide enhanced prediction accuracy.
The first propositions of hybrid predictors were to
use a metapredictor to select a prediction. The metapredictor is also indexed with the program counter and
the branch history. The metapredictor learns which
predictor component is more accurate. The bc-gskew
predictor  proposed for the cancelled Compaq EV
processor [] leveraged hybrid prediction to combine
several global history predictors including a majority vote gskew predictor with different history lengths.
However, a metapredictor is not a cost-effective solution
to select among more than two predictions [].
B3: if cond1 and cond 2 then ..
Branch Predictors. Fig.  Branch correlation: outcome of
branch  is uniquely determined by the outcomes of
branches  and 
n bits
PC
Hash PC and address e.g.
XOR
xor
histo L bits
PHT
2n 2 bits
counters
Prédiction 0/1
Branch Predictors. Fig.  The gshare predictor
Toward Using Very Long Global History
While some applications will be very accurately predicted using a short branch history length, limit studies were showing that some benchmarks would benefit
from using very long history in s of bit range.
In , Jimenez and Lin [] proposed to use neural nets inspired techniques to combine predictions. On
perceptron predictors, the prediction is computed as the
sign of a dot-product of a vector of signed prediction
counters read in parallel by the branch history vector
(Fig. ). Perceptron predictors allow to use very long
global history vectors, e.g,  bits, but suffer from a
long latency prediction computation and require huge
predictor table volumes.
Building on top of the neural net inspired predictors, the GEometric History Length or GEHL predictor
(Fig. ) was proposed in  []. This predictor combines a few global history predictors (typically from 

B

B
Branch Predictors
BIM
bimodal prediction
n
address
PREDICTION
G0
n
e-gskew
majority
vote
e–gskew prediction
G1
address
n
history
Meta
n
metaprediction
Branch Predictors. Fig.  The bc-gskew predictor
Signed 8-bit
conters
Final prediction computation through a sum
branch history
as (-1,+1)
T0
T1
X
Σ
T2
L(0)
T3
L(1)
Sign=prediction
T4
L(2)
L(3)
L(4)
Update on mispredictions or if ⎮SUM⎮< θ
Branch Predictors. Fig.  The perceptron predictor
to ) indexed with different history lengths. The prediction is computed as the sign of the sum of the read
predictions. The set of history lengths forms a geometric series, for instance, , , , , , , , . The use of
such a geometric series allows to concentrate most of
the storage budget on short histories while still capturing long-run correlations on very long histories. Using a
medium number of predictor tables (–) and maximal
history length of more than  bits was shown to be
realistic.
While using an adder tree was shown as an effective final prediction computation function by the GEHL
predictor, partial tag matching may be even more storage effective. The TAGE predictor (Fig. ) proposed in
 [] uses the geometric history length principle,
Σ
Prediction=Sign
Branch Predictors. Fig.  The GEHL predictor: tables are
indexed using different global history length that forms a
geometric series
but relies on partial tag matching for final prediction
computation. Each table in the predictor is read and the
prediction is provided by the hitting predictor component with the longest history. If there is no hitting component then the prediction is provided by the default
predictor.
Realistic size TAGE currently represents the state
of the art in conditional branch predictors [] and its
misprediction rate is within % of the currently known
limits for conditional branch predictabilty [].
Predicting Indirect Branch Targets
Return and indirect jump targets are also only known at
execution time and must be predicted.
Branch Predictors
pc
pc
h[0:L1]
hash
ctr
pc h[0:L2]
has
tag
u
hash
ctr
=?
1
1
u
has
ctr
=?
1
1
tag
has
u
=?
1
1
1
1
Tagless base
Predictor
1
prediction
Branch Predictors. Fig.  The TAGE predictor: partial tag match determines the prediction
Predicting Return Targets
Kaeli and Emma [] remarked that, in most cases and
particularly for compiler generated code the return
address of procedure calls obey a very simple call-return
rule: The return address target is the address just following the call instruction. They also remarked that the
return addresses of the procedures could be predicted
through a simple return address stack (RAS). On a call,
the address of the next instruction in sequence is pushed
on the top of the stack. On a return, the target of the
return is popped from the top of the stack.
When the code is generated following the call–
return rule and if there is no branch misprediction
between the call and the return, an infinite size RAS predicts the return target with a % accuracy. However,
several difficulties arise for practical implementation.
The call–return rule is not always respected. The RAS is
size limited, but in practice a -entry RAS is sufficient
for most applications. The main difficulty is associated
with the speculative execution: When on the wrong
path returns are fetched followed by one or more calls,
valid RAS entries are corrupted with wrong information, thus generating mispredictions on the returns on
the right path. Several studies [, , ] have addressed
this issue, and the effective accuracy of return target
prediction is close to perfect.
Indirect Jump Target Predictions
To predict the targets of indirect jumps, one can use
the same kind of information that can be used for

pc h[0:L3]
has
tag
B
predicting the direction of the conditional branches, i.e.,
the global branch history or program path history.
First it was proposed to use the last encountered
target, i.e., just the entry in the BTB. Chang et al. [] proposed to use a predictor indexed by the global history
and the program counter. Driesen and Holzle [] proposed to use an hybrid predictor based on tag matching.
Finally Seznec and Michand [] proposed ITTAGE, a
multicomponent indirect jump predictor, based on partial tag matching and the use of geometric history length
as TAGE.
Conclusion
In the modern processors, the whole set of branch predictors (conditional, returns, indirect jumps, BTB) is
an important performance enabler. They are particularly important on deep pipelined wide-issue superscalar processors, but they are also becoming important
in low power embedded processors as they are also
implementing instruction-level parallelism.
A large body of research has addressed branch prediction during the s and the early s and complex predictors inspired by this research have been
implemented in processors during the last decade. Current state-of-the-art branch predictors combine multiple predictions and rely on the use of global history.
These predictors cannot deliver a prediction in a single cycle. Since, prediction is needed on the very next
cycle, various techniques such as overriding predictors
[] or ahead pipelined predictors [] were proposed
and implemented.
B

B
Brent’s Law
Branch predictor accuracy seems to have reached a
plateau since the introduction of TAGE. Radically new
predictor ideas, probably new sources of predictability, will probably be needed to further increase the
predictor accuracy.
Bibliography
. Chang P-Y, Hao E, Patt YN () Target prediction for indirect
jumps. In: ISCA ’: Proceedings of the th annual international
symposium on Computer architecture, Denver, – June .
ACM Press, New York, pp –
. Diefendorff K () Compaq chooses SMT for alpha. Microprocessor Report, Dec 
. Driesen K, Holzle U () The cascaded predictor: economical
and adaptive branch target prediction. In: Proceeding of the st
Annual ACM/IEEE International Symposium on Microarchitecture, Dallas, Dec , pp –
. Eden AN, Mudge T () The yags branch predictor. In: Proceedings of the st Annual International Symposium on Microarchitecture, Dallas, Dec 
. Evers M () Improving branch behavior by understanding
branch behavior. Ph.D. thesis, University of Michigan
. Jiménez D () Reconsidering complex branch predictors. In:
Proceedings of the th International Symposium on High Performance Computer Architecture, Anaheim, – Feb . IEEE,
Los Alamitos
. Jiménez DA, Lin C () Dynamic branch prediction with perceptrons. In: HPCA: Proceedings of the th International Symposium on High Performance Computer Architecture, Monterrey,
– Jan . IEEE, Los Alamitos, pp –
. Jourdan S, Hsing T-H, Stark J, Patt YN () The effects of
mispredicted-path execution on branch prediction structures. In:
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Boston, – Oct . IEEE,
Los Alamitos
. Kaeli DR, Emma PG () Branch history table prediction of
moving target branches due to subroutine returns. SIGARCH
Comput Archit News ():–
. Lee C-C, Chen I-C, Mudge T () The bi-mode branch predictor. In: Proceedings of the th Annual International Symposium
on Microarchitecture, Dec 
. Lee J, Smith A () Branch prediction strategies and branch
target buffer design. IEEE Comput ():–
. McFarling S () Combining branch predictors, TN ,
DECWRL, Palo Alto, June 
. Michaud P, Seznec A, Uhlig R () Trading conflict and capacity aliasing in conditional branch predictors. In: Proceedings of
the th Annual International Symposium on Computer Architecture (ISCA), Denver, – June . ACM, New York
. Pan S, So K, Rahmeh J () Improving the accuracy of dynamic
branch prediction using branch correlation. In: Proceedings of the
th International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, – Oct
. ACM, New York
. Seznec A () Analysis of the o-gehl branch predictor. In: Proceedings of the nd Annual International Symposium on Computer Architecture, Madison, – June . IEEE, Los Alamitos
. Seznec A () The idealistic gtl predictor. J Instruction Level
Parallelism. http://www.jilp.org/vol
. Seznec A () The l-tage branch predictor. J Instruction Level
Parallelism. http://www.jilp.org/vol
. Seznec A, Fraboulet A () Effective a head pipelining of the
instruction addres generator. In: Proceedings of the th Annual
International Symposium on Computer Architecture, San Diego,
– June . IEEE, Los Alamitos
. Seznec A, Michaud P () A case for (partially)-tagged geometric history length predictors. J Instruction Level Parallelism.
http://www.jilp.org/vol
. Skadron K, Martonosi M, Clark D () Speculative updates
of local and global branch history: a quantitative analysis. J
Instruction-Level Parallelism 
. Smith J () A study of branch prediction strategies. In: Proceedings of the th Annual International Symposium on Computer
Architecture, May . ACM, New York, pp –
. Sprangle E, Chappell R, Alsup M, Patt Y () The agree predictor: a mechanism for reducing negative branch history interference. In: th Annual International Symposium on Computer
Architecture, Denver, – June . ACM, New York
. Vandierendonck H, Seznec A () Speculative return address
stack management revisited. ACM Trans Archit Code Optim
():–
. Yeh T-Y, Patt Y () Two-level adaptive branch prediction. In:
Proceedings of the th International Symposium on Microarchitecture, Albuquerque, – Nov . ACM, New York
. Yeh T-Y, Patt YN () Two-level adaptive training branch prediction. In: Proceedings of the th Annual International Symposium on Microarchitecture, Albuquerque, – Nov . ACM,
New York
Brent’s Law
Brent’s Theorem
Brent’s Theorem
John L. Gustafson
Intel Corporation, Santa Clara, CA, USA
Synonyms
Brent’s law
Definition
Assume a parallel computer where each processor
can perform an arithmetic operation in unit time.
Brent’s Theorem
Suppose the task is to solve the following system of two
equations in two unknowns, u and v:
au + bu = x
cu + du = y
N−T
,
P
One solution method is Cramer’s Rule, which is far
less efficient than Gaussian elimination in general, but
exposes so much parallelism that it can take less time on
a PRAM computer. Figure  shows how a PRAM with six
processors can calculate the solution values, u and v in
only three time steps:
Thus, T =  time steps, and P ≤ , the amount of
concurrency in the first time step. The total number of
arithmetic operations, N, is  (six multiplications, followed by three subtractions, followed by two divisions).
Brent’s Theorem tells us an upper bound on the time a
PRAM with four processors would take to perform the
algorithm:
where P is less than or equal to the number of processors needed to exploit the maximum concurrency in the
algorithm.
Discussion
Brent’s Theorem assumes a PRAM (Parallel Random
Access Machine) model [, ]. The PRAM model is an
idealized construct that assumes any number of processors can access any items in memory instantly, but
then take unit time to perform an operation on those
items. Thus, PRAM models can answer questions about
how much arithmetic parallelism one can exploit in an
algorithm if the communication cost were zero. Since
many algorithms have very large amounts of theoretical
PRAM-type arithmetic concurrency, Brent’s Theorem
bounds the theoretical time the algorithm would take
on a system with fewer processors.
PRAM variables:
a
b
N−T
, so
P
 − 
T ≤  +
= .

TP ≤ T +
c
Timestep 1
a×d
b×c
c×y
Timestep 2
ad – bc = D
dx – cy
ay – bx
Timestep 3
(dx – cy) ÷ D (ay – bx) ÷ D
d
x
d×x
y
a×y
Brent’s Theorem. Fig.  Concurrency of a solver for two equations in two unknowns
Timestep 1
a×d
b×c
Timestep 2
a×y
b×x
Timestep 3
ad – bc = D
dx – cy
Timestep 4
(dx – cy) ÷ D (ay – bx) ÷ D
Brent’s Theorem. Fig.  Linear solver on only four processors

Example
Further, assume that the computer has exactly enough
processors to exploit the maximum concurrency in an
algorithm with N operations, such that T time steps suffice. Brent’s Theorem says that a similar computer with
fewer processors, P, can perform the algorithm in time
TP ≤ T +
B
c×y
ay – bx
d×x
b×x
B

B
Brent’s Theorem
k1
Timestep 1
Timestep 2
Timestep 3
k2
k3
+
k4
+
k5
k6
k7
+
+
k8
k9
+
k10 k11 k12 k13 k14 k15 k16
+
+
+
+
+
+
+
Timestep 4
+
+
+
Brent’s Theorem. Fig.  Binary sum collapse with a maximum parallelism of eight processors
Figure  shows that four time steps suffice in this case,
fewer than the five in the bound predicted by Brent’s
Theorem. The figure omits dataflow arrows, for clarity.
Brent’s Theorem says the four-processor system
should require no more than five time steps, and Fig. 
shows this is clearly the case.
Asymptotic Example: Parallel Summation
A common use of Brent’s Theorem is to place bounds
on the applicability of parallelism to problems parameterized by size. For example, to find the sum of a list of
n numbers, a PRAM can add the numbers in pairs, then
the resulting sums in pairs, and so on until the result is
a single summation value. Figure  shows such a binary
sum collapse for a set of  numbers.
In the following discussion, the notation
lg(x) denotes the logarithm base  of x, and the
notation ⌈X⌉ denotes the “ceiling” function, the smallest
integer larger than x.
If n is exactly a power of , such that n = T , then
the PRAM can complete the sum in T time steps. That
is, T = lg(n). If n is not an integer power of , then
T = ⌈lg(n)⌉ because the summation takes one additional step. The total number of arithmetic operations
for the summation of n numbers is n –  additions.
The maximum parallelism occurs in the first time
step, when n/ processors can operate on the list concurrently. If the number of processors in a PRAM is P,
and P is less than n/, then Brent’s Theorem shows []
that the execution time can be made less than or equal to
(n − ) − ∣lg(n)∣
.
P
For values of n much larger than P, the above formula
shows TP ≈ Pn .
TP = ⌈lg(n)⌉ +
Proof of Brent’s Theorem
Let nj be the number of operations in step j
that the PRAM can execute concurrently, where j is
, , . . . , T. Assume the PRAM has P processors, where
P ≤ max(nj ).
j
A property of the ceiling function is that for any two
positive integers k and m,
⌈
k
k+m−
⌉≤
.
m
m
Hence, the time for the PRAM to perform each time
step j has the following bound:
⌈
nj + P − 
nj
⌉≤
.
P
P
The time for all time steps has the bound
T
TP ≤ ∑⌈
j=
T n +P−
nj
j
⌉≤∑
P
P
j=
T
T
nj
P T 
+∑ −∑ .
j= P
j= P
j= P
=∑
Since the sum of the nj is the total number of operations N, and the sum of  for T steps is simply T, this
simplifies to
TP ≤
N −T
∑ nj T
∑
+∑−
=T+
.
P
P
P
j=
Application to Superlinear Speedup
Ideal speedup is speedup that is linear in the number of
processors. Under the strict assumptions of Brent’s Theorem, superlinear speedup (or superunitary efficiency,
Brent’s Theorem
sometimes erroneously termed superunitary speedup)
is impossible.
Superlinear speedup means that the time to perform an algorithm on a single processor is more than P
times as long as the time to perform it on P processors.
In other words, P processors are more than P times as
fast as a single processor. According to Brent’s Theorem
with P = ,
N −T
T ≤ T +
= N.

Speedup greater than the number or processors means
T > P ⋅ TP . Brent’s Theorem says this is mathematically impossible. However, Brent’s Theorem is based
on the assumptions of the PRAM model that have little resemblance to real computer design, such as the
assumptions that every arithmetic operation requires
the same amount of time, and that memory access has
zero latency and infinite bandwidth. The debate over
the possibility of superlinear speedup first appeared in
 [, ], and the debate highlighted the oversimplification of the PRAM model for parallel computing
performance analysis.
Perspective
The PRAM model used by Brent’s Theorem assumes
that communication has zero cost, and arithmetic work
constitutes all of the work of an algorithm. The opposite is closer to the present state of computer technology (see Memory Wall), which greatly diminishes
the usefulness of Brent’s Theorem in practical problem
solving. It is so esoteric that it does not even provide
useful upper or lower bounds on how parallel processing might improve execution time, and modern performance analysis seldom uses it. Brent’s Theorem is
a mathematical model more related to graph theory
and partial orderings than to actual computer behavior.
When Brent constructed his model in , most computers took longer to perform arithmetic on operands
than they took to fetch and store the operands, so the
approximation was appropriate. More recent abstract
models of parallelism take into account communication costs, both bandwidth and latency, and thus can
provide better guidance for the parallel performance
bounds of current architectures.
B

Related Entries
Bandwidth-Latency Models (BSP, LogP)
Memory Wall
PRAM (Parallel Random Access Machines)
Bibliographic Notes and Further
Reading
As explained above, Brent’s Theorem reflects the computer design issues of the s, and readers should view
Brent’s original  paper in this context. It was in 
that the esoteric nature of the PRAM model became
clear, in the opposing papers [] and []. Faber, Lubeck,
and White [] assumed the PRAM model to state that
one cannot obtain superlinear speedup since computing
resources increase linearly with the number of processors. Parkinson [], having had experience with the ICL
Distributed Array Processor (DAP), based his model
and experience on premises very different from the
hypothetical PRAM model used in Brent’s Theorem.
Parkinson noted that simple constructs like the sum of
two vectors could take place superlinearly faster on P
processors because a shared instruction to add, sent to
the P processors, does not require the management of a
loop counter and address increment that a serial processor requires. The  work Introduction to Algorithms
by Cormen, Leiserson, and Rivest [] is perhaps the first
textbook to formally recognize that the PRAM model is
too simplistic, and thus Brent’s Theorem has diminished
predictive value.
Bibliography
. Brent RP () The parallel evaluation of general arithmetic
expressions. J ACM ():–
. Cole R () Faster optimal parallel prefix sums and list ranking.
Inf Control ():–
. Cormen TH, Leiserson CE, Rivest RL () Introduction to algorithms, MIT Press Cambridge
. Faber V, Lubeck OM, White AB () Superlinear speedup of
an efficient sequential algorithm is not possible. Parallel Comput
:–
. Helmbold DP, McDowell CE () Modeling speedup (n) greater
than n. In: Proceedings of the international conference on parallel
processing, :–
. Parkinson D () Parallel efficiency can be greater than unity.
Parallel Comput :–
. Smith JR () The design and analysis of parallel algorithms.
Oxford University Press, New York
B

B
Broadcast
●
Broadcast
Jesper Larsson Träff , Robert A. van de Geijn

University of Vienna, Vienna, Austria

The University of Texas at Austin, Austin, TX, USA
●
●
Synonyms
One-to-all broadcast; Copy
Definition
Among a group of processing elements (nodes), a
designated root node has a data item to be communicated (copied) to all other nodes. The broadcast operation performs this collective communication
operation.
●
Discussion
The reader may consider first visiting the entry on
Collective Communication.
Let p be the number of nodes in the group that
participate in the broadcast operation and number
these nodes consecutively from  to p − . One node,
the root with index r, has a vector of data, x of
size n, to be communicated to the remainder p − 
nodes:
Before
Node r
x
Node 
Under these assumptions, two lower bounds for the
broadcast operation can be easily justified, the first for
the α term and the second for the β term:
⌈log p⌉α. Define a round of communication as a
period during which each node can send at most one
message and receive at most one message. In each
round, the number of nodes that know message x
can at most double, since each node that has x can
send x to a new node. Thus, a minimum of ⌈log p⌉
rounds are needed to broadcast the message. Each
round costs at least α.
● nβ. If p >  then the message must leave the root
node, requiring a time of at least nβ.
●
After
Node 
●
All nodes can communicate through a communication network.
Individual nodes perform communication operations that send and/or receive individual messages.
Communication is through a single port, such that a
node can be involved in at most one communication
operation at a time. Such an operation can be either
a send to or a receive from another node (unidirectional communication), a combined send to and
receive from another node (bidirectional, telephone
like communication), or a send to and receive from
two possibly different nodes (simultaneous sendreceive, fully bidirectional communication).
It is assumed that the communication medium is
homogeneous and fully connected such that all
nodes can communicate with the same costs and any
maximal set of pairs of disjoint nodes can communicate simultaneously.
A reasonable first approximation for the time for
transferring a message of size n between (any) two
nodes is α + nβ where α is the start-up cost (latency)
and β is the cost per item transfered (inverse of the
bandwidth).
Node r
Node 
Node 
x
x
x
All nodes are assumed to explicitly take part in the
broadcast operation. It is generally assumed that before
its execution all nodes know the index of the designated root node as well as the the amount n of data
to be broadcast. The data item x may be either a single, atomic unit or divisible into smaller, disjoint pieces.
The latter can be exploited algorithmically when n is
large.
Broadcast is a key operation on parallel systems
with distributed memory. On shared memory systems, broadcast can be beneficial for improving locality
and/or avoiding memory conflicts.
When assumptions about the communication system
change, these lower bounds change as well. Lower
bounds for mesh and torus networks, hypercubes, and
many other communication networks are known.
Tree-Based Broadcast Algorithms
Lower Bounds
To obtain some simple lower bounds, it is assumed that
A well-known algorithm for broadcasting is the socalled Minimum Spanning Tree (MST) algorithm
Broadcast
a
Path
B

...
B
Binary tree
b
Fibonacci trees, F0, F1, F2, F3, F4
c
Binomial trees, B0, B1, B2, B3, B4
d
Star graph
e
...
Broadcast. Fig.  Commonly used broadcast trees: (a) Path; (b) Binary Tree; (c) Fibonacci trees, the ith Fibonacci tree for
i >  consists of a new root node connected to Fibonacci trees Fi− and Fi− ; (d) Binomial trees, the ith binomial tree for i > 
consists of a new root with children Bi− , Bi− , . . .; (e) Star
(which is a misnomer, since all broadcast trees are spanning trees and minimum for homogeneous communication media), which can be described as follows:
●
Partition the set of nodes into two roughly equalsized subsets.
● Send x from the root to a node in the subset that does
not include the root. The receiving node will become
a local root in that subset.
● Recursively broadcast x from the (local) root nodes
in the two subsets.
Under the stated communication model, the total cost
of the MST algorithm is ⌈log p⌉(α + nβ). It achieves
the lower bound for the α term but not for the β term
for which it is a logarithmic factor off from the lower
bound.
If the data transfer between two nodes is represented
by a communication edge between the two nodes, the
MST algorithm constructs a binomial spanning tree
over the set of nodes. An equivalent (also recursive)
construction is as follows. The th binomial tree B consists of a single node. For i >  the ith binomial tree Bi
consists of the root with i children Bj for j = i − , . . . , .
The number of nodes in Bi is i . It can be seen that the
number of nodes at level i is (logi p) from which the term

originates.
The construction and structure of the binomial
(MST) tree is shown in Fig. d.
Pipelining
For large item sizes, pipelining is a general technique to
improve the broadcast cost. Assume first that the nodes

B
Broadcast
are communicating along a directed path with node 
being the root and node i sending to node i +  for
 ≤ i < p −  (node p −  only receives data). This is
shown in Fig. a. The message to be broadcast is split
into k blocks of size n/k each. In the first round of the
algorithm, the first block is sent from node  to node .
In the second round, this block is forwarded to node 
while the second block is sent to node . In this fashion,
a pipeline is established that communicates the blocks
to all nodes.
Under the prior model, the cost of this algorithm is
(k + p − )(α +
k+p−
n
β) = (k + p − )α +
nβ
k
k
p−
= (p − )α + nβ + kα +
nβ.
k
It takes p −  communication rounds for the first piece
to reach node p − , which afterward in each successive round receives a new block. Thus, an additional
k −  rounds are required for a total of k + p −  rounds.
p−
Balancing the kα term against the k nβ term gives a
minimum time of
√
√
√

( (p − )α + βn ) = (p − )α +  (p − )nαβ + nβ.
√
and best possible number of blocks k = (p−)βn
. This
α
meets the lower bound for the β term, but not for the α
term. For very large n compared to p the linear pipeline
can be a good (practical) algorithm. It is straightforward
to implement and has very small extra overhead.
For the common case where p and/or the start-up
latency α is large compared to n, the linear pipeline suffers from the nonoptimal (p−)α term, and algorithms
with a shorter longest path from root node to receiving
nodes will perform better. Such algorithms can apply
pipelining using different, fixed degree trees, as illustrated in Fig. a–c. For instance, with a balanced binary
tree as shown in Fig. b the broadcast time becomes
√
√
(log p − ) +  (log p − )α βn + βn.
The latency of this algorithm is significantly better than
the linear pipeline, but the β term is a factor of  off
from optimal. By using instead skewed trees, the time
at which the last node receives the first block of x
be improved which affects both α and β terms. The
Fibonacci tree shown in Fig. c for instance achieves a
broadcast cost of
(logΘ p − )α + 
√
√
(logΘ p − )α βn + βn,
√
where Θ = +   . Finally it should be noted that pipelining cannot be employed with any advantage for trees
with nonconstant degrees like the binomial tree and the
degenerated star tree shown in Fig. d–e.
A third lower bound for broadcasting of data k items
can be justified similarly to the two previously introduced bounds, and is applicable to algorithms that apply
pipelining.
●
k −  + ⌈log p⌉ rounds. First k −  items have to leave
the root, and the number of rounds for the last item
to arrive at some last node is an additional ⌈log p⌉.
With this bound, the best possible broadcast cost in
the linear cost model is therefore (k −  + log p)(α +
βn/k) when x is (can be) divided into k roughly equalsized blocks of size at most ⌈n/k⌉. Balancing the kα term
against the (log p − )βn/k term) achieves a minimum
time of
√

√
( (log p − )α + βn ) = (log p − )α
√
√
+  (log p − )α βn + βn
√
(log  p−)βn
.
with the best number blocks being k =
α
Simultaneous Trees
None of the tree-based algorithm were optimal in the
sense of meeting the lower bound for both the α and the
β terms. A practical consequence of this is that broadcast implementations become cumbersome in that algorithms for different combinations of p and n have to be
maintained. The breakthrough results of Johnsson and
Ho [] show how employing multiple, simultaneously
active trees can be used to overcome these limitations
and yield algorithms that are optimal in the number of
communication rounds. The results were at first formulated for hypercubes or fully connected communication
networks where p is a power of two, and (much) later
extended to communication networks with arbitrary
number of nodes. The basic idea can also be employed
for meshes and hypercubes.
The idea is to embed simultaneous, edge disjoint
spanning trees in the communication network, and use
each of these trees to broadcast different blocks of the
Broadcast
input. If this embedding can be organized such that
each node has the same number of incoming edges
(from different trees) and outgoing edges (to possibly
different trees), a form of pipelining can be employed,
even if the individual trees are not necessarily fixed
degree trees. To illustrate the idea, consider the threedimensional hypercube, and assume that an infinite
number of blocks have to be broadcast.
100
000
101
001
110
010
111
011
Let the nodes be numbered in binary with the
broadcast root being node 000. The three edge-disjoint
trees used for broadcasting blocks , , , . . .; , , , . . .
and , , , . . .; respectively, are as shown below:
100
000
101
001
110
010
100
000
111
011
110
100
111
011
101
001
110
010
001
010
000
101
111
011
The trees are in fact constructed as edge disjoint
spanning trees excluding node 000 rooted at the root
hypercube neighbors 001, 010, and 100. The root is
connected to each of these trees. In round , block 
is sent to the first tree, which in rounds , , and  is
responsible for broadcasting to its spanning subtree. In
round , block  is sent to the second tree which uses
rounds , , and  to broadcast, and in round  block
 is sent to the third tree which uses rounds , , and
 to broadcast this block. In round , the root sends
the third block, the broadcasting of which takes place
simultaneously with the broadcasting of the previous
B
blocks, and so on. The broadcasting is done along a regular pattern, which is symmetric for all nodes. For a
d-dimensional hypercube, node i in round t, t ≥ , sends
and receives a block to and from the node found by toggling bit t mod d. As can be seen, each node is a leaf
in tree j if bit j of i is , otherwise an internal node in
tree j. Leaves in tree j = t mod d receive block t − d
in round d. When node i is an internal node in tree j,
the block received in round j is sent to the children of
i which are the nodes found by toggling the next, more
significant bits after j until the next position in i that is a
. Thus, to determine which blocks are received and sent
by node i in each round, the number of zeroes from each
bit position in i until the next, more significant  will
suffice.
The algorithm for broadcasting k blocks in the optimal k −  + log p number of rounds is given formally in
Fig. . To turn the description above into an algorithm
for an arbitrary, finite number of blocks, the following modifications are necessary. First, any block with
a negative number is neither sent nor received. Second, for blocks with a number larger than k − , block
k −  is taken instead. Third, if k −  is not a multiple
of log p the broadcast is started at a later round f such
that indeed k + f −  is a multiple of log p. The table
BitDistancei [j] stores for each node i the distance
from bit position j in i to the next  to the left of position j (with wrap around after d bits). The root 000 has
no ones. This node only sends blocks, and and therefore BitDistance [k] = k for  ≤ k < d. For each i,
 ≤ i < p the table can be filled in O(log p) steps.
Composing from Other Collective
Communications
Another approach to implementing the broadcast
operation follows from the observation that data
can be broadcast to p nodes by scattering p disjoint blocks of x of size n/p across the nodes, and
then reassembling the blocks at each node by an
allgather operation. On a network that can host a
hypercube, the scatter and allgather operations can
p−
be implemented at a cost of (log p)α + p nβ each,
under the previously used communication cost model
(see entries on scatter and allgather). This yields a
cost of
p−
nβ.
 log  pα + 
p

B

B
Broadcast
f ← ((k mod d) + d − 1) mod d /* Start round for first phase */
t←0
while t < k + d − 1 do
/* New phase consisting of (up to) d rounds */
for j ← f to d − 1
s ← t − d + (1 − ij ) ∗ BitDistancei [j] /* block to send */
/* block to receive */
r ← t − d + ij ∗ BitDistancei [j]
if s ≥ k then s ← k − 1
if r ≥ k then r ← k − 1
par/* simultaneous send-receive with neighbor */
if s ≥ 0 then Send block s to node (i xor 2j )
if r ≥ 0 then Receive block r from node (i xor 2j )
end par
t ← t + 1 /* next round */
end for
f ← 0 /* next phases start from 0 */
endwhile
Broadcast. Fig.  The algorithm for node i,  ≤ i < p for broadcasting k blocks in a d-dimensional hypercube or fully
connected network with p = d nodes. The algorithm requires the optimal number of k −  + d rounds. The jth bit of i is
denoted ij and is used to determine the hypercube neighbor of node i for round j
The cost of this algorithm is within a factor two of the
lower bounds for both the α and β term and is considerably simpler to implement than approaches that use
pipelining.
General Graphs
Good broadcast algorithms are known for many different communication networks under various cost
models. However, the problem of finding a best broadcast schedule for an arbitrary communication network
is a hard problem. More precisely, the following problem of determining whether a given number of rounds
suffices to broadcast in an arbitrary, given graph is NPcomplete [, Problem ND]: Given an undirected
graph G = (V, E), a root vertex r ∈ V, and an integer k (number of rounds), is there a sequence of vertex
and edge subsets {r} = V , E , V , V , E , . . . , Ek , Vk = V
with Vi ⊆ V, Ei ⊆ E, such that each e ∈ Ei has one endpoint in Vi− and one in Vi , no two edges of Ei share an
endpoint, and Vi = Vi− ∪ {w∣(v, w) ∈ Ei }?
Related Entries
Allgather
Collective Communication
Collective Communication, Network Support for
Message Passing Interface (MPI)
PVM (Parallel Virtual Machine)
Bibliographic Notes and Further
Reading
Broadcast is one of the most thoroughly studied collective communication operations. Classical surveys
with extensive treatment of broadcast (and allgather/
gossiping) problems under various communication and
network assumptions can be found in [, ]. For a
survey of broadcasting in distributed systems, which
raises many issues not discussed here, see []. Wellknown algorithms that have been implemented in,
for instance, Message Passing Interface (MPI) libraries
include the MST algorithm, binary trees, and scatterallgather approaches. Fibonacci trees for broadcast were
explored in []. Another skewed, pipelined tree structure termed fractional trees that yield close to optimal results for certain ranges of p and n was proposed
in [].
The basic broadcast techniques discussed in this
entry date back to the early days of parallel computing [, ]. The scatter/allgather algorithm was already
discussed in [] and was subsequently popularized for
mesh architectures in []. It was further popularized for
use in MPI implementations in []. A different implementation of this paradigm was given in []. Modular
construction of hybrid algorithms from MST broadcast
and scatter/allgather is discussed in [].
The classical paper by Johnsson and Ho []
introduced the simultaneous tree algorithm that was
Broadcast
discussed, which they call the n-ESBT algorithm. This
algorithm can be used for networks that can host a
hypercube and has been used in practice for hypercubes and fully connected systems. It achieves the
lower bound on the number of communication rounds
needed to broadcast k blocks of data. The exposition
given here is based on [, ]. It was for a number
of years an open problem how to achieve similar optimality for arbitrary p (and k). The first round-optimal
algorithms were given in [, ], but these seem not to
have been implemented. A different, explicit construction was found and described in [], and implemented
in an MPI library. A very elegant (and practical) extension of the hypercube n-ESBT algorithm to arbitrary p
was presented in []. That algorithm uses the hypercube algorithm for the largest i such that i ≤ p. Each
node that is not in the hypercube is paired up with a
hypercube node and the two nodes in each such pair
in an alternating fashion jointly carry out the work of a
hypercube node. Yet another, optimal to a lower-order
term algorithm based on using two edge-disjoint binary
trees was given in []. Disjoint spanning tree algorithms for multidimensional mesh and torus topologies
were given in [].
The linear model used for modeling communication (transfer) costs is folklore. An arguably more accurate performance model of communication networks
is the so-called LogGP model (and its variants), which
account more accurately for the time in which processors are involved in data transfers. With this model,
yet other broadcast tree structures yield best performance [, ]. The so-called postal model in which a
message sent at some communication round is received
a number of rounds λ later at the destination was used
in [, ] and gives rise to yet more tree algorithms.
Broadcasting in heterogeneous systems has recently
received renewed attention [, , ].
Finding minimum round schedules for broadcast
remain NP-hard for many special networks []. General approximation algorithms have been proposed for
instance in [].
Bibliography
. Bar-Noy A, Kipnis S () Designing broadcasting algorithms in
the postal model for message-passing systems. Math Syst Theory
():–
B
. Bar-Noy A, Kipnis S, Schieber B () Optimal multiple message
broadcasting in telephone-like communication systems. Discret
App Math (–):–
. Barnett M, Payne DG, van de Geijn RA, Watts J () Broadcasting on meshes with wormhole routing. J Parallel Distrib Comput
():–
. Beaumont O, Legrand A, Marchal L, Robert Y () Pipelining broadcast on heterogeneous platforms. IEEE Trans Parallel
Distrib Syst ():–
. Bruck J, De Coster L, Dewulf N, Ho C-T, Lauwereins R () On
the design and implementation of broadcast and global combine
operations using the postal model. IEEE Trans Parallel Distrib
Syst ():–
. Bruck J, Cypher R, Ho C-T () Multiple message broadcasting
with generalized fibonacci trees. In: Symposium on Parallel and
Distributed Processing (SPDP). IEEE Computer Society Press.
Arlington, Texas, USA, pp –
. Chan E, Heimlich M, Purkayastha A, van de Geijn RA () Collective communication: theory, practice, and experience. Concurr
Comput ():–
. Culler DE, Karp RM, Patterson D, Sahay A, Santos EE, Schauser
KE, Subramonian R, von Eicken T () LogP: A practical model
of parallel computation. Commun ACM ():–
. Défago X, Schiper A, Urbán P () Total order broadcast and
multicast algorithms: taxonomy and survey. ACM Comput Surveys ():–
. Fox G, Johnson M, Lyzenga M, Otto S, Salmon J, Walker D ()
Solving problems on concurrent processors, vol I. Prentice-Hall,
Englewood Cliffs
. Fraigniaud P, Lazard E () Methods and problems of communication in usual networks. Discret Appl Math (–):–
. Fraigniaud P, Vial S () Approximation algorithms for broadcasting and gossiping. J Parallel Distrib Comput :–
. Garey MR, Johnson DS () Computers and intractability: a
guide to the theory of NP-completeness. Freeman, San Francisco
(With an addendum, )
. Hedetniemi SM, Hedetniemi T, Liestman AL () A survey
of gossiping and broadcasting in communication networks. Networks :–
. Jansen K, Müller H () The minimum broadcast time problem
for several processor networks. Theor Comput Sci (&):–
. Jia B () Process cooperation in multiple message broadcast.
Parallel Comput ():–
. Johnsson SL, Ho C-T () Optimum broadcasting and personalized communication in hypercubes. IEEE Trans Comput
():–
. Kwon O-H, Chwa K-Y () Multiple message broadcasting in
communication networks. Networks :–
. Libeskind-Hadas R, Hartline JRK, Boothe P, Rae G, Swisher J
() On multicast algorithms for heterogenous networks of
workstations. J Parallel Distrib Comput :–
. Liu P () Broadcast scheduling optimization for heterogeneous cluster systems. J Algorithms ():–
. Saad Y, Schultz MH () Data communication in parallel architectures. Parallel Comput ():–

B

B
BSP
. Sanders P, Sibeyn JF () A bandwidth latency tradeoff for
broadcast and reduction. Inf Process Lett ():–
. Sanders P, Speck J, Träff JL () Two-tree algorithms for
full bandwidth broadcast, reduction and scan. Parallel Comput
:–
. Santos EE () Optimal and near-optimal algorithms for k-item
broadcast. J Parallel Distrib Comput ():–
. Thakur R, Gropp WD, Rabenseifner R () Improving the performance of collective operations in MPICH. Int J High Perform
Comput Appl :–
. Träff JL () A simple work-optimal broadcast algorithm for
message-passing parallel systems. In: Recent advances in parallel virtual machine and message passing interface. th European
PVM/MPI users’ group meeting. Lecture Notes in Computer
Science, vol . Springer, pp –
. Träff JL, Ripke A () Optimal broadcast for fully connected processor-node networks. J Parallel Distrib Comput ():
–
. Watts J, van de Geijn RA () A pipelined broadcast for multidimensional meshes. Parallel Process Lett :–
The BSP model can be regarded as an abstraction
of both parallel hardware and software, and supports an approach to parallel computation that is both
architecture-independent and scalable. The main principles of BSP are the treatment of a communication
medium as an abstract fully connected network, and the
decoupling of all interaction between processors into
point-to-point asynchronous data communication and
barrier synchronization. Such a decoupling allows an
explicit and independent cost analysis of local computation, communication, and synchronization, all of which
are viewed as limited resources.
BSP Computation
The BSP Model
A BSP computer (see Fig. ) contains
●
BSP
Bandwidth-Latency Models (BSP, LogP)
BSP (Bulk Synchronous Parallelism)
BSP (Bulk Synchronous
Parallelism)
Alexander Tiskin
University of Warwick, Coventry, UK
Definition
Bulk-synchronous parallelism is a type of coarse-grain
parallelism, where inter-processor communication follows the discipline of strict barrier synchronization.
Depending on the context, BSP can be regarded as
a computation model for the design and analysis of
parallel algorithms, or a programming model for the
development of parallel software.
Discussion
Introduction
The model of bulk-synchronous parallel (BSP) computation was introduced by Valiant [] as a “bridging model” for general-purpose parallel computing.
p Processors; each processor has a local memory and
is capable of performing an elementary operation or
a local memory access every time unit.
● A communication environment, capable of accepting
a word of data from every processor, and delivering
a word of data to every processor, every g time units.
● A barrier synchronization mechanism, capable of
synchronizing all the processors simultaneously
every l time units.
The processors may follow different threads of computation, and have no means of synchronizing with one
another apart from the global barriers.
A BSP computation is a sequence of supersteps (see
Fig. ). The processors are synchronized between supersteps; the computation within a superstep is completely
asynchronous. Consider a superstep in which every
processor performs a maximum of w local operations,
sends a maximum of hout words of data, and receives a
maximum of hin words of data. The value w is the local
computation cost, and h = hout +hin is the communication
cost of the superstep. The total superstep cost is defined
0
1
PM
PM
p−1
···
PM
COMM. ENV. (g, l)
BSP (Bulk Synchronous Parallelism). Fig.  The BSP
computer
BSP (Bulk Synchronous Parallelism)
0
···
1
···
B
···
p −1
···
BSP (Bulk Synchronous Parallelism). Fig.  BSP computation
as w + h ⋅ g + l, where the communication gap g and the
latency l are parameters of the communication environment. For a computation comprising S supersteps with
local computation costs ws and communication costs hs ,
 ≤ s ≤ S, the total cost is W + H ⋅ g + S ⋅ l, where
●
●
●
W = ∑≤s≤S ws is the total local computation cost.
H = ∑s=≤s≤S hs is the total communication cost.
S is the synchronization cost.
The values of W, H, and S typically depend on the
number of processors p and on the problem size.
In order to utilize the computer resources efficiently,
a typical BSP program regards the values p, g, and l as
configuration parameters. Algorithm design should aim
to minimize local computation, communication, and
synchronization costs for any realistic values of these
parameters. The main BSP design principles are
●
Load balancing, which helps to minimize both the
local computation cost W and the communication
cost H
● Data locality, which helps to minimize the communication cost H
● Coarse granularity, which helps to minimize (or
sometimes to trade off) the communication cost H
and the synchronization cost S
The term “data locality” refers here to placing a piece
of data in the local memory of a processor that “needs
it most.” It has nothing to do with the locality of a processor in a specific network topology, which is actively
discouraged from use in the BSP model. To distinguish these concepts, some authors use the terms “strict
locality” [] or “co-locality” [].
The values of network parameters g, l for a specific parallel computer can be obtained by benchmarking. The benchmarking process is described in []; the
resulting lists of machine parameters can be found in
[, ].
BSP vs Traditional Parallel Models
Traditionally, much of theoretical research in parallel
algorithms has been done using the Parallel Random
Access Machine (PRAM), proposed initially in [].
This model contains
●
A potentially unlimited number of processors, each
capable of performing an elementary operation
every time unit
● Global shared memory, providing uniform access
for every processor to any location in one time unit
A PRAM computation proceeds in a sequence of
synchronous parallel steps, each taking one time unit.
Concurrent reading or writing of a memory location by
several processors within the same step may be allowed
or disallowed. The number of processors is potentially
unbounded, and is often considered to be a function of
the problem size. If the number of processors p is fixed,
the PRAM model can be viewed as a special case of the
BSP model, with g = l =  and communication realized
by reading from/writing to the shared memory.
Since the number of processors in a PRAM can be
unbounded, a common approach to PRAM algorithm
design is to associate a different processor with every
data item. Often, the processing of every item is identical; in this special case, the computation is called dataparallel. Programming models and languages designed
for data-parallel computation can benefit significantly
from the BSP approach to cost analysis; see [] for a
more detailed discussion.
It has long been recognized that in order to be practical, a model has to impose a certain structure on the

B

B
BSP (Bulk Synchronous Parallelism)
communication and/or synchronization patterns of a
parallel computation. Such a structure can be provided
by defining a restricted set of collective communication primitives, called skeletons, each with an associated cost model (see, e.g., []). In this context, a BSP
superstep can be viewed as a simple generalized skeleton (see also []). However, current skeleton proposals
concentrate on more specialized, and somewhat more
arbitrary, skeleton sets (see, e.g., []).
The CTA/Phase Abstractions model [], which
underlies the ZPL language, is close in spirit to skeletons. A CTA computation consists of a sequence of
phases, each with a simple and well-structured communication pattern. Again, the BSP model takes this
approach to the extreme, where a superstep can be
viewed as a phase allowing a single generic asynchronous communication pattern, with the associated
cost model.
Memory Efficiency
The original definition of BSP does not account for
memory as a limited resource. However, the model can
be easily extended by an extra parameter m, representing the maximum capacity of each processor’s local
memory. Note that this approach also limits the amount
of communication allowed within a superstep: h ≤ m.
One of the early examples of memory-sensitive BSP
algorithm design is given by [].
An alternative approach to reflecting memory cost
is given by the model CGM, proposed in []. A CGM
is essentially a memory-restricted BSP computer, where
memory capacity and maximum superstep communication are determined by the size of the input/output:
h ≤ m = O(N/p). A large number of algorithms have
been developed for the CGM, see, e.g., [].
Memory Management
The BSP model does not directly support shared memory, a feature that is often desirable for both algorithm
design and programming. Furthermore, the BSP model
does not properly address the issue of input/output,
which can also be viewed as accessing an external shared
memory. Virtual shared memory can be obtained by
PRAM simulation, a technique introduced in []. An
efficient simulation on a p-processor BSP computer is
possible if the simulated virtual PRAM has at least
p log p processors, a requirement know as slackness.
Memory access in the randomized simulation is made
uniform by address hashing, which ensures a nearly
random and independent distribution of virtual shared
memory cells.
In the automatic mode of BSP programming proposed in [] (see also []), shared memory simulation completely hides the network and processors’ local
memories. The algorithm designer and the programmer can enjoy the benefits of virtual shared memory;
however, data locality is destroyed, and, as a result, performance may suffer. A useful compromise is achieved
by the BSPRAM model, proposed in []. This model
can be seen as a hybrid of BSP and PRAM, where
each processor keeps its local memory, but in addition
there is a uniformly accessible shared (possibly external) memory. Automatic memory management can still
be achieved by the address-hashing technique of [];
additionally, there are large classes of algorithms, identified by [], for which simpler, slackness-free solutions
are possible.
In contrast to the standard BSP model, the BSPRAM
is meaningful even with a single processor (p = ).
In this case, it models a sequential computation that
has access both to main and external memory (or the
cache and the main memory). Further connections
between parallel and external-memory computations
are explored, e.g., in [].
Paper [] introduced a more elaborate model EMBSP, where each processor, in addition to its local memory, can have access to several external disks, which may
be private or shared. Paper [] proposed a restriction
of BSPRAM, called BSPGRID, where only the external memory can be used for persistent data storage
between supersteps – a requirement reflecting some
current trends in processor architecture. Paper []
proposed a shared-memory model QSM, where normal processors have communication gap g, and each
shared memory cell is essentially a “mini-processor”
with communication gap d. Naturally, in a superstep
every such “mini-processor” can “send” or “receive” at
most p words (one for each normal processor), hence
the model is similar to BSPRAM with communication
gap g and latency dp + l.
Virtual shared memory is implemented in several
existing or proposed programming environments offering BSP-like functionality (see, e.g., [, ]).
BSP (Bulk Synchronous Parallelism)
Heterogeneity
In the standard BSP model, all processors are assumed
to be identical. In particular, all have the same local processing speed (which is an implicit parameter of the
model), and the same communication gap g. In practice, many parallel architectures are heterogeneous, i.e.,
include processors with different speeds and communication performances. This fact has prompted heterogeneous extensions of the BSP model, such as HBSP
[], and HCGM []. Both these extended models
introduce a processor’s speed as an explicit parameter.
Each processor has its own speed and communication
gap; these two parameters can be either independent
or linked (e.g., proportional). The barrier structure of
a computation is kept in both models.
Other Variants of BSP
The BSP* model [] is a refinement of the BSP
model with an alternative cost formula for small-sized
h-relations. Recognizing the fact that communication
of even a small amount of data incurs a constant-sized
overhead, the model introduces a parameter b, defined
as the minimum communication cost of any, even zerosized, h-relation. Since the overhead reflected by the
parameter b can also be counted as part of superstep
latency, the BSP* computer with communication gap g
and latency l is asymptotically equivalent to a standard
BSP computer with gap g and latency l + b.
The E-BSP model [] is another refinement of the
BSP model, where the cost of a superstep is parameterized separately by the maximum amount of data sent,
the maximum amount of data received, the total volume
of communicated data, and the network-specific maximum distance of data travel. The OBSP* model [] is an
elaborate extension of BSP, which accounts for varying
computation costs of individual instructions, and allows
the processors to run asynchronously while maintaining “logical supersteps”. While E-BSP and OBSP* may
be more accurate on some architectures (in [], a linear array and a D mesh are considered), they lack the
generality and simplicity of pure BSP. A simplified version of E-BSP, asymptotically equivalent to pure BSP, is
defined in [].
The PRO approach [] is introduced as another
parallel computation model, but can perhaps be better
understood as an alternative BSP algorithm design philosophy. It requires that algorithms are work-optimal,
B
disregards point-to-point communication efficiency by
setting g = , and instead puts the emphasis on synchronization efficiency and memory optimality.
BSP Algorithms
Basic Algorithms
As a simple parallel computation model, BSP lends itself
to the design of efficient, well-structured algorithms.
The aim of BSP algorithm design is to minimize the
resource consumption of the BSP computer: the local
computation cost W, the communication cost H, the
synchronization cost S, and also sometimes the memory cost M. Since the aim of parallel computation is
to obtain speedup over sequential execution, it is natural to require that a BSP algorithm should be workoptimal relative to a particular “reasonable” sequential
algorithm, i.e., the local computation cost W should
be proportional to the running time of the sequential
algorithm, divided by p. It is also natural to allow a reasonable amount of slackness: an algorithm only needs
to be efficient for n ≫ p, where n is the problem size.
The asymptotic dependence of the minimum value of n
on the number of processors p must be clearly specified
by the algorithm.
Assuming that a designated processor holds a
value a, the broadcasting problem asks that a copy of a is
obtained by every processor. Two natural solutions are
possible: either direct or binary tree broadcast.
In the direct broadcast method, a designated processor makes p− copies of a and sends them directly to the
destinations. The BSP costs are W = O(p), H = O(p),
S = O(). In the binary tree method, initially, the designated processor is defined as being awake, and the
other processors as sleeping. The processors are woken
up in log p rounds. In every round, every awake processor makes a copy of a and sends it to a sleeping
processor, waking it up. The BSP costs are W = O(p),
H = O(log p), S = O(log p). Thus, there is a tradeoff between the direct and the binary tree broadcast
methods, and no single method is optimal (see [, ]).
The array broadcasting problem asks, instead of a
single value, to broadcast an array a of size n ≥ p.
In contrast with the ordinary broadcast problem, there
exists an optimal method for array broadcast, known as
two-phase broadcast (a folklore result, described, e.g.,
in []). In this method, the array is partitioned into p

B

B
BSP (Bulk Synchronous Parallelism)
blocks of size n/p. The blocks are scattered across the
processors; then, a total-exchange of the blocks is performed. Assuming sufficient slackness, the BSP costs are
W = O(n/p), H = O(n/p), S = O().
Many computational problems can be described as
computing a directed acyclic graph (dag), which characterizes the problem’s data dependencies. From now on,
it is assumed that a BSP algorithm’s input and output are
stored in the external memory. It is also assumed that all
problem instances have sufficient slackness.
The balanced binary tree dag of size n consists of
n −  nodes, arranged in a rooted balanced binary
tree with n leaves. The direction of all edges can be
either top-down (from root to leaves), or bottomup (from leaves to root). By partitioning into appropriate blocks, the balanced binary tree dag can be
computed with BSP costs W = O(n/p), H = O(n/p),
S = O() (see [, ]).
The butterfly dag of size n consists of n log n nodes,
and describes the data dependencies of the Fast Fourier
Transform algorithm on n points. By partitioning into
appropriate blocks, the butterfly dag can be computed
with BSP costs W = O(n log n/p), H = O(n/p), S =
O() (see [, ]).
The ordered D grid dag of size n consists of n
nodes arranged in an n × n grid, with edges directed
top-to-bottom and left-to-right. The computation takes
n inputs to the nodes on the left and top borders, and
returns n outputs from the nodes on the right and bottom borders. By partitioning into appropriate blocks,
the ordered D grid dag can be computed with BSP costs
W = O(n /p), H = O(n), S = O(p) (see []).
The ordered D grid dag of size n consists of n nodes
arranged in an n×n×n grid, with edges directed top-tobottom, left-to-right, and front-to-back. The computation takes n inputs to the nodes on the front, left, and
top faces, and returns n outputs from the nodes on
the back, right, and bottom faces. By partitioning into
appropriate blocks, the ordered D grid dag can be computed with BSP costs W = O(n /p), H = O(n /p/ ),
S = O(p/ ) (see []).
Further Algorithms
The design of efficient BSP algorithms has become a
well-established topic. Some examples of BSP algorithms proposed in the past include list and tree contraction [], sorting [, , , , ], convex hull
computation [, , , , , , ], and selection
[, ]. In the area of matrix algorithms, some examples of the proposed algorithms include matrix–vector
and matrix–matrix multiplication [, , ], Strassentype matrix multiplication [], triangular system solution and several versions of Gaussian elimination [,
, ], and orthogonal matrix decomposition [, ]
In the area of graph algorithms, some examples of the
proposed algorithms include Boolean matrix multiplication [], minimum spanning tree [], transitive closure [], the algebraic path problem and the all-pairs
shortest path problems [], and graph coloring [].
In the area of string algorithms, some examples of the
proposed algorithms include the longest common subsequence and edit distance problems [–, , , ],
and the longest increasing subsequence problem [].
BSP Programming
The BSPlib Standard
Based on the experience of early BSP programming
tools [, , ], the BSP programming community agreed on a common library standard BSPlib
[]. The aim of BSPlib is to provide a set of BSP
programming primitives, striking a reasonable balance between simplicity and efficiency. BSPlib is based
on the single program/multiple data (SPMD) programming model, and contains communication primitives for direct remote memory access (DRMA) and
bulk-synchronous message passing (BSMP). Experience
shows that DRMA, due to its simplicity and deadlockfree semantics, is the method of choice for all but the
most irregular applications. Routine use of DRMA is
made possible by the barrier synchronization structure
of BSP computation.
The two currently existing major implementations
of BSPlib are the Oxford BSP toolset [] and the PUB
library []. Both provide a robust environment for the
development of BSPlib applications, including mechanisms for optimizing communication [, ], load
balancing, and fault tolerance []. The PUB library also
provides a few additional primitives (oblivious synchronization, processor subsets, multithreading). Both the
Oxford BSP toolset and the PUB library include tools
for performance analysis and prediction. Recently, new
approaches have been developed to BSPlib implementation [, , ] and performance analysis [, ].
BSP (Bulk Synchronous Parallelism)
Beyond BSPlib
Reduce and Scan
The Message Passing Interface (MPI) is currently the
most widely accepted standard of distributed-memory
parallel programming. In contrast to BSPlib, which
is based on a single programming paradigm, MPI
provides a diverse set of parallel programming patterns, allowing the programmer to pick-and-choose
a paradigm most suitable for the application. Consequently, the number of primitives in MPI is an order
of magnitude larger than in BSPlib, and the responsibility to choose the correct subset of primitives and to
structure the code rests with the programmer. It is not
surprising that a carefully chosen subset of MPI can be
used to program in the BSP style; an example of such an
approach is given by [].
The ZPL language [] is a global-view array language based on a BSP-like computation structure.
As such, it can be considered to be one of the earliest
high-level BSP programming tools. Another growing
trend is the integration of the BSP model with modern programming environments. A successful example
of integrating BSP with Python is given by the package Scientific Python [], which provides high-level
facilities for writing BSP code, and performs communication by calls to either BSPlib or MPI. Tools for
BSP programming in Java have been developed by
projects NestStep [] and JBSP []; a Java-like multithreaded BSP programming model is proposed in [].
A functional programming model for BSP is given by
the BSMLlib library []; a constraint programming
approach is introduced in []. Projects InteGrade []
and GridNestStep [] are aiming to implement the BSP
model using Grid technology.
Sorting
Related Entries
Bandwidth-Latency Models (BSP, LogP)
Collective Communication
Dense Linear System Solvers
Functional Languages
Graph Algorithms
Load Balancing, Distributed Memory
Linear Algebra, Numerical
Models of Computation, Theoretical
Parallel Skeletons
PGAS (Partitioned Global Address Space) Languages
PRAM (Parallel Random Access Machines)
B
SPMD Computational Model
Synchronization
ZPL
Bibliographic Notes and Further
Reading
A detailed treatment of the BSP model, BSP programming, and several important BSP algorithms is given
in the monograph []. Collection [] includes several
chapters dedicated to BSP and related models.
Bibliography
. Alverson GA, Griswold WG, Lin C, Notkin D, Snyder L ()
Abstractions for portable, scalable parallel programming. IEEE
Trans Parallel Distrib Syst ():–
. Alves CER, Cáceres EN, Castro Jr AA, Song SW, Szwarcfiter
JL () Efficient parallel implementation of transitive closure
of digraphs. In: Proceedings of EuroPVM/MPI, Venice. Lecture
notes in computer science, vol . Springer, pp –
. Alves CER, Cáceres EN, Dehne F, Song SW () Parallel
dynamic programming for solving the string editing problem on
a CGM/BSP. In: Proceedings of the th ACM SPAA, Winnipeg,
pp –
. Alves CER, Cáceres EN, Dehne F, Song SW () A parallel
wavefront algorithm for efficient biological sequence comparison.
In: Proceedings of ICCSA, Montreal. Lecture notes in computer
science, vol . Springer, Berlin, pp –
. Alves CER, Cáceres EN, Song SW () A coarse-grained parallel algorithm for the all-substrings longest common subsequence
problem. Algorithmica ():–
. Ballereau O, Hains G, Lallouet A () BSP constraint programming. In: Gorlatch S, Lengauer C (eds) Constructive methods for
parallel programming, vol . Advances in computation: Theory
and practice. Nova Science, New York, Chap 
. Bäumker A, Dittrich W, Meyer auf der Heide F () Truly efficient parallel algorithms: -optimal multisearch for an extension
of the BSP model. Theor Comput Sci ():–
. Bisseling RH () Parallel scientific computation: A structured
approach using BSP and MPI. Oxford University Press, New York
. Bisseling RH, McColl WF () Scientific computing on bulk
synchronous parallel architectures. Preprint , Department of
Mathematics, University of Utrecht, December 
. Blanco V, González JA, León C, Rodríguez C, Rodríguez G,
Printista M () Predicting the performance of parallel programs. Parallel Comput :–
. Bonorden O, Juurlink B, von Otte I, Rieping I () The
Paderborn University BSP (PUB) library. Parallel Comput ():
–

B

B
BSP (Bulk Synchronous Parallelism)
. Cáceres EN, Dehne F, Mongelli H, Song SW, Szwarcfiter JL ()
A coarse-grained parallel algorithm for spanning tree and connected components. In: Proceedings of Euro-Par, Pisa. Lecture
notes in computer science, vol . Springer, Berlin, pp –
. Calinescu R, Evans DJ () Bulk-synchronous parallel algorithms for QR and QZ matrix factorisation. Parallel Algorithms
Appl :–
. Chamberlain BL, Choi S-E, Lewis EC, Lin C, Snyder L, Weathersby WD () ZPL: A machine independent programming
language for parallel computers. IEEE Trans Softw Eng ():
–
. Cinque L, Di Maggio C () A BSP realisation of Jarvis’ algorithm. Pattern Recogn Lett ():–
. Cole M () Algorithmic skeletons. In: Hammond K,
Michaelson G (eds) Research Directions in Parallel Functional
Programming. Springer, London, pp –
. Cole M () Bringing skeletons out of the closet: A pragmatic
manifesto for skeletal parallel programming. Parallel Comput
:–
. Corrêa R et al (eds) () Models for parallel and distributed
computation: theory, algorithmic techniques and applications,
vol . Applied Optimization. Kluwer, Dordrecht
. Dehne F, Dittrich W, Hutchinson D () Efficient external
memory algorithms by simulating coarse-grained parallel algorithms. Algorithmica ():–
. Dehne F, Fabri A, Rau-Chaplin A () Scalable parallel computational geometry for coarse grained multicomputers. Int J
Comput Geom :–
. Diallo M, Ferreira A, Rau-Chaplin A, Ubéda S () Scalable
D convex hull and triangulation algorithms for coarse grained
multicomputers. J Parallel Distrib Comput :–
. Donaldson SR, Hill JMD, Skillicorn D () Predictable communication on unpredictable networks: Implementing BSP over
TCP/IP and UDP/IP. Concurr Pract Exp ():–
. Dymond P, Zhou J, Deng X () A D parallel convex hull
algorithm with optimal communication phases. Parallel Comput
():–
. Fantozzi C, Pietracaprina A, Pucci G () A general PRAM simulation scheme for clustered machines. Int J Foundations Comput
Sci ():–
. Fortune S, Wyllie J () Parallelism in random access machines.
In: Proceedings of ACM STOC, San Diego, pp –
. Gebremedhin AH, Essaïdi M, Lassous GI, Gustedt J, Telle JA
() PRO: A model for the design and analysis of efficient and
scalable parallel algorithms. Nordic J Comput ():–
. Gebremedhin AH, Manne F () Scalable parallel graph coloring algorithms. Concurr Pract Exp ():–
. Gerbessiotis AV, Siniolakis CJ () Deterministic sorting and
randomized median finding on the BSP model. In: Proceedings
of the th ACM SPAA, Padua, pp –
. Gerbessiotis AV, Siniolakis CJ, Tiskin A () Parallel priority queue and list contraction: The BSP approach. Comput
Informatics :–
. Gerbessiotis AV, Valiant LG () Direct bulk-synchronous parallel algorithms. J Parallel Distrib Comput ():–
. Goldchleger A, Kon F, Goldman A, Finger M, Bezerra GC. InteGrade: Object-oriented Grid middleware leveraging idle computing power of desktop machines. Concurr Comput Pract Exp
:–
. Goodrich M () Communication-efficient parallel sorting.
In: Proceedings of the th ACM STOC, Philadelphia, pp
–
. Gorlatch S, Lengauer C (eds) () Constructive methods for
parallel programming, vol . Advances in computation: Theory
and practice. Nova Science, New York
. Goudreau MW, Lang K, Rao SB, Suel T, Tsantilas T () Portable
and efficient parallel computing using the BSP model. IEEE Trans
Comput ():–
. Gu Y, Lee B-S, Cai W () JBSP: A BSP programming library in
Java. J Parallel Distrib Comput :–
. Hains G, Loulergue F () Functional bulk synchronous parallel programming using the BSMLlib library. In: Gorlatch S,
Lengauer C (eds) Constructive methods for parallel programming, vol . Advances in computation: Theory and practice.
Nova Science, New York, Chap 
. Hammond K, Michaelson G (eds) () Research directions in
parallel functional programming. Springer, London
. Heywood T, Ranka S () A practical hierarchical model of
parallel computation I: The model. J Parallel Distrib Comput
():–
. Hill J () Portability of performance in the BSP model. In:
Hammond K, Michaelson G (eds) Research directions in parallel
functional programming. Springer, London, pp –
. Hill JMD, Donaldson SR, Lanfear T () Process migration
and fault tolerance of BSPlib programs running on a network
of workstations. In: Pritchard D, Reeve J (eds) Proceedings of
Euro-Par, Southampton. Lecture notes in computer science, vol
. Springer, Berlin, pp –
. Hill JMD, Jarvis SA, Siniolakis C, Vasilev VP () Analysing
an SQL application with a BSPlib call-graph profiling tool. In:
Pritchard D, Reeve J (eds) Proceedings of Euro-Par. Lecture notes
in computer science, vol . Springer, Berlin, pp –
. Hill JMD, McColl WF, Stefanescu DC, Goudreau MW, Lang K,
Rao SB, Suel T, Tsantilas T, Bisseling RH () BSPlib: The BSP
programming library. Parallel Comput ():–
. Hill JMD, Skillicorn DB () Lessons learned from implementing BSP. Future Generation Comput Syst (–):–
. Hinsen K () High-level parallel software development with
Python and BSP. Parallel Process Lett ():–
. Ishimizu T, Fujiwara A, Inoue M, Masuzawa T, Fujiwara H ()
Parallel algorithms for selection on the BSP and BSP ∗ models. Syst
Comput Jpn ():–
. Juurlink BHH, Wijshoff HAG () The E-BSP model: Incorporating general locality and unbalanced communication into
the BSP model. In: Bougé et al (eds) Proceedings of Euro-Par
(Part II), Lyon. Lecture notes in computer science, vol .
Springer, Berlin, pp –
. Juurlink BHH, Wijshoff HAG () A quantitative comparison
of parallel computation models. ACM Trans Comput Syst ():
–
Bulk Synchronous Parallelism (BSP)
. Kee Y, Ha S () An efficient implementation of the BSP programming library for VIA. Parallel Process Lett ():–
. Keßler CW () NestStep: Nested parallelism and virtual
shared memory for the BSP model. J Supercomput :–
. Kim S-R, Park K () Fully-scalable fault-tolerant simulations
for BSP and CGM. J Parallel Distrib Comput ():–
. Krusche P, Tiskin A () Efficient longest common subsequence computation using bulk-synchronous parallelism. In: Proceedings of ICCSA, Glasgow. Lecture notes in computer science,
vol . Springer, Berlin, pp –
. Krusche P, Tiskin A () Longest increasing subsequences
in scalable time and memory. In: Proceedings of PPAM ,
Revised Selected Papers, Part I. Lecture notes in computer science,
vol . Springer, Berlin, pp –
. Krusche P, Tiskin A () New algorithms for efficient parallel string comparison. In: Proceedings of ACM SPAA, Santorini,
pp –
. Lecomber DS, Siniolakis CJ, Sujithan KR () PRAM programming: in theory and in practice. Concurr Pract Exp :–
. Mattsson H, Kessler CW () Towards a virtual shared memory programming environment for grids. In: Proceedings of
PARA, Copenhagen. Lecture notes in computer science, vol .
Springer-Verlag, pp –
. McColl WF () Scalable computing. In: van Leeuwen J (ed)
Computer science today: Recent trends and developments. Lecture notes in computer science, vol . Springer, Berlin, pp
–
. McColl WF () Universal computing. In: L. Bougé et al (eds)
Proceedings of Euro-Par (Part I), Lyon. Lecture notes in computer
science, vol . Springer, Berlin, pp –
. McColl WF, Miller Q () Development of the GPL language.
Technical report (ESPRIT GEPPCOM project), Oxford University Computing Laboratory
. McColl WF, Tiskin A () Memory-efficient matrix multiplication in the BSP model. Algorithmica (/):–
. Miller R () A library for bulk-synchronous parallel programming. In: Proceeding of general purpose parallel computing,
British Computer Society, pp –
. Morin P () Coarse grained parallel computing on heterogeneous systems. In: Proceedings of ACM SAC, Como, pp –
. Nibhanupudi MV, Szymanski BK () Adaptive bulksynchronous parallelism on a network of non-dedicated
workstations. In: High performance computing systems and
applications. Kluwer, pp –
. The Oxford BSP Toolset () http://www.bsp-worldwide.org/
implmnts/oxtool
. Ramachandran V () A general purpose shared-memory
model for parallel computation. In: Heath MT, Ranade A,
Schreiber RS (eds) Algorithms for parallel processing. IMA volumes in mathematics and applications, vol . Springer-Verlag,
New York
. Saukas LG, Song SW () A note on parallel selection on coarsegrained multicomputers. Algorithmica, (–):–, 
. Sibeyn J, Kaufmann M () BSP-like external-memory computation. In: Bongiovanni GC, Bovet DP, Di Battista G (eds)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B
Proceedings of CIAC, Rome. Lecture notes in computer science,
vol . Springer, Berlin, pp –
Skillicorn DB () Predictable parallel performance: The BSP
model. In: Corrêa R et al (eds) Models for parallel and distributed
computation: theory, algorithmic techniques and applications,
vol . Applied Optimization. Kluwer, Dordrecht, pp –
Song SW () Parallel graph algorithms for coarse-grained
multicomputers. In: Corrêa R et al (eds) Models for parallel
and distributed computation: theory, algorithmic techniques and
applications, vol , Applied optimization. Kluwer, Dordrecht,
pp –
Tiskin A () Bulk-synchronous parallel multiplication of
Boolean matrices. In: Proceedings of ICALP Aalborg. Lecture
notes in computer science, vol . Springer, Berlin, pp –
Tiskin A () The bulk-synchronous parallel random access
machine. Theor Comput Sci (–):–
Tiskin A () All-pairs shortest paths computation in the BSP
model. In: Proceedings of ICALP, Crete. Lecture notes in computer science, vol . Springer, Berlin, pp –
Tiskin A () A new way to divide and conquer. Parallel Process
Lett ():–
Tiskin A () Parallel convex hull computation by generalised regular sampling. In: Proceedings of Euro-Par, Paderborn.
Lecture notes in computer science, vol . Springer, Berlin,
pp –
Tiskin A () Efficient representation and parallel computation
of string-substring longest common subsequences. In: Proceedings of ParCo Malaga. NIC Series, vol . John von Neumann
Institute for Computing, pp –
Tiskin A () Communication-efficient parallel generic pairwise elimination. Future Generation Comput Syst :–
Tiskin A () Parallel selection by regular sampling. In: Proceedings of Euro-Par (Part II), Ischia. Lecture notes in computer
science, vol . Springer, pp –
Valiant LG () A bridging model for parallel computation.
Commun ACM ():–
Vasilev V () BSPGRID: Variable resources parallel computation and multiprogrammed parallelism. Parallel Process Lett
():–
Williams TL, Parsons RJ () The heterogeneous bulk synchronous parallel model. In: Rolim J et al (eds) Proceedings of
IPDPS workshops Cancun. Lecture notes in computer science,
vol . Springer, Berlin, pp –
Zheng W, Khan S, Xie H () BSP performance analysis and prediction: Tools and applications. In: Malyshkin V (ed) Proceedings
of PaCT, Newport Beach. Lecture notes in computer science, vol
. Springer, Berlin, pp –
Bulk Synchronous Parallelism
(BSP)
BSP (Bulk Synchronous Parallelism)

B

B
Bus: Shared Channel
Bus: Shared Channel
Buses and Crossbars
Buses and Crossbars
Rajeev Balasubramonian, Timothy M. Pinkston

University of Utah, Salt Lake City, UT, USA

University of Southern California, Los Angeles, CA, USA
Synonyms
Bus:
Shared
channel;
Shared
interconnect;
Shared-medium network; Crossbar; Interconnection
network; Point-to-point switch; Switched-medium
network
Definition
Bus: A bus is a shared interconnect used for connecting multiple components of a computer on a single chip
or across multiple chips. Connected entities either place
signals on the bus or listen to signals being transmitted on the bus, but signals from only one entity at a
time can be transported by the bus at any given time.
Buses are popular communication media for broadcasts
in computer systems.
Crossbar: A crossbar is a non-blocking switching
element with N inputs and M outputs used for connecting multiple components of a computer where, typically,
N = M. The crossbar can simultaneously transport signals on any of the N inputs to any of the M outputs as
long as multiple signals do not compete for the same
input or output port. Crossbars are commonly used as
basic switching elements in switched-media network
routers.
Discussion
Introduction
Every computer system is made up of numerous components such as processor chips, memory chips, peripherals, etc. These components communicate with each
other via interconnects. One of the simplest interconnects used in computer systems is the bus. The bus
is a shared medium (usually a collection of electrical
wires) that allows one sender at a time to communicate with all sharers of that medium. If the interconnect
must support multiple simultaneous senders, more scalable designs based on switched-media must be pursued.
The crossbar represents a basic switched-media building
block for more complex but scalable networks.
In traditional multi-chip multiprocessor systems
(c. ), buses were primarily used as off-chip interconnects, for example, front-side buses. Similarly, crossbar functionality was implemented on chips that were
used mainly for networking. However, the move to
multi-core technology has necessitated the use of networks even within a mainstream processor chip to connect its multiple cores and cache banks. Therefore, buses
and crossbars are now used within mainstream processor chips as well as chip sets. The design constraints
for on-chip buses are very different from those of offchip buses. Much of this discussion will focus on onchip buses, which continue to be the subject of much
research and development.
Basics of Bus Design
A bus comprises a shared medium with connections to
multiple entities. An interface circuit allows each of the
entities either to place signals on the medium or sense
(listen to) the signals already present on the medium.
In a typical communication, one of the entities acquires
ownership of the bus (the entity is now known as the bus
master) and places signals on the bus. Every other entity
senses these signals and, depending on the content of
the signals, may choose to accept or discard them. Most
buses today are synchronous, that is, the start and end
of a transmission are clearly defined by the edges of a
shared clock. An asynchronous bus would require an
acknowledgment from the receiver so the sender knows
when the bus can be relinquished.
Buses often are collections of electrical wires,
(Alternatively, buses can be a collection of optical
waveguides over which information is transmitted photonically []) where each wire is typically organized as
“data,” “address,” or “control.” In most systems, networks
are used to move messages among entities on the data
bus; the address bus specifies the entity that must receive
the message; and the control bus carries auxiliary signals such as arbitration requests and error correction
codes. There is another nomenclature that readers may
also encounter. If the network is used to implement a
Buses and Crossbars
cache coherence protocol, the protocol itself has three
types of messages: () DATA, which refers to blocks
of memory; () ADDRESS, which refers to the memory block’s address; and () CONTROL, which refers
to auxiliary messages in the protocol such as acknowledgments. Capitalized terms as above will be used to
distinguish message types in the coherence protocol
from signal types on the bus. For now, it will be assumed
that all three protocol message types are transmitted on
the data bus.
Arbitration Protocols
Since a bus is a shared medium that allows a single
master at a time, an arbitration protocol is required to
identify this bus master. A simple arbitration protocol
can allow every entity to have ownership of the bus for
a fixed time quantum, in a round-robin manner. Thus,
every entity can make a local decision on when to transmit. However, this wastes bus bandwidth when an entity
has nothing to transmit during its turn.
The most common arbitration protocol employs a
central arbiter; entities must send their bus requests to
the arbiter and the arbiter sends explicit messages to
grant the bus to requesters. If the requesting entity is
not aware of its data bus occupancy time beforehand,
the entity must also send a bus release message to the
arbiter after it is done. The request, grant, and release
signals are part of the control network. The request signal is usually carried on a dedicated wire between an
entity and the arbiter. The grant signal can also be implemented similarly, or as a shared bus that carries the ID of
the grantee. The arbiter has state to track data bus occupancy, buffers to store pending requests, and policies to
implement priority or fairness. The use of pipelining to
hide arbitration delays will be discussed shortly.
Arbitration can also be done in a distributed manner [], but such methods often incur latency or bandwidth penalties. In one example, a shared arbitration
bus is implemented with wired-OR signals. Multiple
entities can place a signal on the bus; if any entity places
a “one” on the bus, the bus carries “one,” thus using
wires to implement the logic of an OR gate. To arbitrate, all entities place their IDs on the arbitration bus;
the resulting signal is the OR of all requesting IDs. The
bus is granted to the entity with the largest ID and this
is determined by having each entity sequentially drop
B

out if it can determine that it is not the largest ID in the
competition.
Pipelined Bus
Before a bus transaction can begin, an entity must arbitrate for the bus, typically by contacting a centralized
arbiter. The latency of request and grant signals can be
hidden with pipelining. In essence, the arbitration process (that is handled on the control bus) is overlapped
with the data transmission of an earlier message. An
entity can send a bus request to the arbiter at any time.
The arbiter buffers this request, keeps track of occupancy on the data bus, and sends the grant signal one
cycle before the data bus will be free. In a heavily loaded
network, the data bus will therefore rarely be idle and
the arbitration delay is completely hidden by the wait
for the data bus. In a lightly loaded network, pipelining
will not hide the arbitration delay, which is typically at
least three cycles: one cycle for the request signal, one
cycle for logic at the arbiter, and one cycle for the grant
signal.
Case Study: Snooping-Based Cache Coherence
Protocols
As stated earlier, the bus is a vehicle for transmission
of messages within a higher-level protocol such as a
cache coherence protocol. A single transaction within
the higher-level protocol may require multiple messages
on the bus. Very often, the higher-level protocol and the
bus are codesigned to improve efficiency. Therefore, as
a case study, a snooping bus-based coherence protocol
will be discussed.
Consider a single-chip multiprocessor where each
processor core has a private L cache, and a large L
cache is shared by all the cores. The multiple L caches
and the multiple banks of the L cache are the entities connected to a shared bus (Fig. ). The higher-level
coherence protcol ensures that data in the L and L
caches is kept coherent, that is, a data modification is
eventually seen by all caches and multiple updates to
one block are seen by all caches in exactly the same
order.
A number of coherence protocol operations will
now be discussed. When a core does not find its data
in its local L cache, it must send a request for the data
block to other L caches and the L cache. The core’s L
cache first sends an arbitration request for the bus to the
B

B
Buses and Crossbars
Buses and Crossbars. Fig.  Cores and L cache banks
connected with a bus. The bus is composed of wires that
handle data, address, and control
arbiter. The arbiter eventually sends the grant signal to
the requesting L. The arbitration is done on the control
portion of the bus. The L then places the ADDRESS
of the requested data block on the data bus. On a synchronous bus, we are guaranteed that every other entity
has seen the request within one bus cycle. Each such
“snooping” entity now checks its L cache or L bank
to see if it has a copy of the requested block. Since
every lookup may take a different amount of time, a
wired-AND signal is provided within the control bus so
everyone knows that the snoop is completed. This is an
example of bus and protocol codesign (a protocol CONTROL message being implemented on the bus’ control
bus). The protocol requires that an L cache respond
with data if it has the block in “modified” state, else, the
L cache responds with data. This is determined with
a wired-OR signal; all L caches place the outcome of
their snoop on this wired-OR signal and the L cache
accordingly determines if it must respond. The responding entity then fetches data from its arrays and places
it on the data bus. Since the bus is not released until
the end of the entire coherence protocol transaction,
the responder knows that the data bus is idle and need
not engage in arbitration (another example of protocol and bus codesign). Control signals let the requester
know that the data is available and the requester reads
the cache block off the bus.
The use of a bus greatly simplifies the coherence
protocol. It serves as a serialization point for all coherence transactions. The timing of when an operation
is visible to everyone is well known. The broadcast of
operations allows every cache on the bus to be selfmanaging. Snooping bus-based protocols are therefore
much simpler than directory-based protocols on more
scalable networks.
As described, each coherence transaction is handled
atomically, that is, one transaction is handled completely before the bus is released for use by other transactions. This means that the data bus is often idle while
caches perform their snoops and array reads. Bus utilization can be improved with a split transaction bus.
Once the requester has placed its request on the data
bus, the data bus is released for use by other transactions. Other transactions can now use the data bus
for their requests or responses. When a transaction’s
response is ready, the data bus must be arbitrated for.
Every request and response must now carry a small tag
so responses can be matched up to their requests. Additional tags may also be required to match the wired-OR
signals to the request.
The split transaction bus design can be taken
one step further. Separate buses can be implemented
for ADDRESS and DATA messages. All requests
(ADDRESS messages) and corresponding wired-OR
CONTROL signals are carried on one bus. This bus
acts as the serialization point for the coherence protocol. Responders always use a separate bus to return
DATA messages. Each bus has its own separate arbiter
and corresponding control signals.
Bus Scalability
A primary concern with any bus is its lack of scalability. First, if many entities are connected to a bus, the
bus speed reduces because it must drive a heavier load
over a longer distance. In an electrical bus, the higher
capacitive load from multiple entities increases the RCdelay; in an optical bus, the reduced photons received at
photodetectors from dividing the optical power budget
among multiple entities likewise increases the time to
detect bus signals. Second, with many entities competing for the shared bus, the wait-time to access the bus
increases with the number of entities. Therefore, conventional wisdom states that more scalable switchedmedia networks are preferred when connecting much
more than  or  entities []. However, the simplicity of bus-based protocols (such as the snooping-based
cache coherence protocol) make it attractive for smallor medium-scale symmetric multiprocessing (SMP)
systems. For example, the IBM POWERTM processor chip supports  cores on its SMP bus []. Buses
Buses and Crossbars
are also attractive because, unlike switched-media networks, they do not require energy-hungry structures
such as buffers and crossbars. Researchers have considered multiple innovations to extend the scalability of
buses, some of which are discussed next.
One way to scale the number of entities connected
using buses is, simply, to provide multiple buses, for
example, dual-independent buses or quad-independent
buses. This mitigates the second problem listed above
regarding the high rate of contention on a single bus, but
steps must still be taken to maintain cache coherency
via snooping on the buses. The Sun Starfire multiprocessor [], for example, uses four parallel buses for
ADDRESS requests, wherein each bus handles a different range of addresses. Tens of dedicated buses are
used to connect up to  IBM POWERTM processor chips in a coherent SMP system []. While this
option has high cost for off-chip buses because of
pin and wiring limitations, a multi-bus for an on-chip
network is not as onerous because of plentiful metal
area budgets.
Some recent works have highlighted the potential
of bus-based on-chip networks. Das et al. [] argue
that buses should be used within a relatively small cluster of cores because of their superior latency, power,
and simplicity. The buses are connected with a routed
mesh network that is employed for communication
beyond the cluster. The mesh network is exercised
infrequently because most applications exhibit locality.
Udipi et al. [] take this hierarchical network approach
one step further. As shown in Fig. , the intra-cluster
buses are themselves connected with an inter-cluster
bus. Bloom filters are used to track the buses that have
previously handled a given address. When coherence
transactions are initiated for that address, the Bloom
filters ensure that the transaction is broadcasted only
to the buses that may find the address relevant. Locality optimizations such as page coloring help ensure that
bus broadcasts do not travel far, on average. Udipi et
al. also employ multiple buses and low-swing wiring to
further extend bus scalability in terms of performance
and energy.
Crossbars
Buses are used as a shared fabric for communication among multiple entities. Communication on a bus
B

B
Core
Sub-bus
Central-bus
Buses and Crossbars. Fig.  A hierarchical bus structure
that localizes broadcasts to relevant clusters
a
b
Buses and Crossbars. Fig.  (a) A “dance-hall”
configuration of processors and memory. (b) The circuit for
a  ×  crossbar
is always broadcast-style, i.e., even though a message is going from entity-A to entity-B, all entities
see the message and no other message can be simultaneously in transit. However, if the entities form a
“dance hall” configuration (Fig. a) with processors on
one side and memory on the other side, and most
communication is between processors and memory,
a crossbar interconnect becomes a compelling choice.
Although crossbars incur a higher wiring overhead
than buses, they allow multiple messages simultaneously to be in transit, thus increasing the network
bandwidth. Given this, crossbars serve as the basic
switching element within switched-media network
routers.

B
Buses and Crossbars
A crossbar circuit takes N inputs and connects each
input to any of the M possible outputs. As shown in
Fig. b, the circuit is organized as a grid of wires, with
inputs on the left, and outputs on the bottom. Each wire
can be thought of as a bus with a unique master, that is,
the associated input port. At every intersection of wires,
a pass transistor serves as a crosspoint connector to short
the two wires, if enabled, connecting the input to the
output. Small buffers can also be located at the crosspoints in buffered crossbar implementations to store
messages temporarily in the event of contention for the
intended output port. A crossbar is usually controlled
by a centralized arbiter that takes output port requests
from incoming messages and computes a viable assignment of input port to output port connections. This,
for example, can be done in a crossbar switch allocation stage prior to a crossbar switch traversal stage for
message transport. Multiple messages can be simultaneously in transit as long as each message is headed
to a unique output and each emanates from a unique
input. Thus, the crossbar is non-blocking. Some implementations allow a single input message to be routed to
multiple output ports.
A crossbar circuit has a cost that is proportional to
N × M. The circuit is replicated W times, where W represents the width of the link at one of the input ports.
It is therefore not a very scalable circuit. In fact, larger
centralized switches such as Butterfly and Benes switch
fabrics are constructed hierarchically from smaller
crossbars to form multistage indirect networks or MINs.
Such networks have a cost that is proportional to
Nlog(M) but have a more restrictive set of messages that
can be routed simultaneously without blocking.
A well-known example of a large-scale on-chip
crossbar is the Sun Niagara processor []. The crossbar connects eight processors to four L cache banks
in a “dance-hall” configuration. A recent example of
using a crossbar to interconnect processor cores in
other switched point-to-point configurations is the Intel
QuickPath Interconnect []. More generally, crossbars
find extensive use in network routers. Meshes and tori,
for example, implement a  ×  crossbar in router
switches where the five input and output ports correspond to the North, South, East, West neighbors
and the node connected to each router. The meshconnected Tilera Tile-GxTM -core processor is a
recent example [].
Related Entries
Cache Coherence
Collective Communication
Interconnection Networks
Networks, Direct
Network Interfaces
Networks, Multistage
PCI-Express
Routing (Including Deadlock Avoidance)
Switch Architecture
Switching Techniques
Bibliographic Notes and Further
Reading
For more details on bus design and other networks,
readers are referred to the excellent textbook by Dally
and Towles []. Recent papers in the architecture community that focus on bus design include those by Udipi
et al. [], Das et al. [], and Kumar et al. []. Kumar
et al. [] articulate some of the costs of implementing buses and crossbars in multi-core processors and
argue that the network must be codesigned with the
core and caches for optimal performance and power.
A few years back, S. Borkar made a compelling argument for the widespread use of buses within multi-core
chips that is highly thought provoking [, ]. The paper
by Charlesworth [] on Sun’s Starfire, while more than a
decade old, is an excellent reference that describes considerations when designing a high-performance bus for
a multi-chip multiprocessor. Future many-core processors may adopt photonic interconnects to satisfy the
high memory bandwidth demands of the many cores.
A single photonic waveguide can carry many wavelengths of light, each carrying a stream of data. Many
receiver “rings” can listen to the data transmission, each
ring contributing to some loss in optical energy. The
Corona paper by Vantrease et al. [] and the paper
by Kirman et al. [] are excellent references for more
details on silicon photonics, optical buses, and optical
crossbars.
The basic crossbar circuit has undergone little
change over the last several years. However, given the
recent interest in high-radix routers which increase the
input/output-port degree of the crossbar used as the
internal router switch, Kim et al. [] proposed hierarchical crossbar and buffered crossbar organizations to
facilitate scalability. Also, given the relatively recent shift
Butterfly
in focus to energy-efficient on-chip networks, Wang
et al. [] proposed techniques to reduce the energy
usage within crossbar circuits. They introduced a cutthrough crossbar that is optimized for traffic that travels
in a straight line through a mesh network’s router. The
design places some restrictions on the types of message turns that can be simultaneously handled. Wang
et al. also introduce a segmented crossbar that prevents switching across the entire length of wires when
possible.
Bibliography
. Borkar S () Networks for multi-core chips – a contrarian
view, ISLPED Keynote. www.islped.org/X/BorkarISLPED.
pdf
. Borkar S () Networks for multi-core chips – a controversial
view. In: Workshop on on- and off-chip interconnection networks
for multicore systems (OCIN), Stanford
. Charlesworth A () Starfire: extending the SMP envelope.
IEEE Micro ():–
. Dally W, Towles B () Route packets, not wires: on-chip interconnection networks. In: Proceedings of DAC, Las Vegas
. Dally W, Towles B () Principles and practices of interconnection networks, st edn. Morgan Kaufmann, San Francisco
. Das R, Eachempati S, Mishra AK, Vijaykrishnan N, Das CR
() Design and evaluation of hierarchical on-chip network
topologies for next generation CMPs. In: Proceedings of HPCA,
Raleigh
. Intel Corp. An introduction to the Intel QuickPath interconnect.
http://www.intel.com/technology/quickpath/introduction.pdf
. Kim J, Dally W, Towles B, Gupta A () Microarchitecture of a
high-radix router. In: Proceedings of ISCA, Madison
B
. Kirman N, Kyrman M, Dokania R, Martinez J, Apsel A, Watkins
M, Albonesi D () Leveraging optical technology in future
bus-based chip multiprocessors. In: Proceedings of MICRO,
Orlando
. Kongetira P () A -way multithreaded SPARC processor. In:
Proceedings of hot chips , Stanford. http://www.hotchips.org/
archives/
. Kumar R, Zyuban V, Tullsen D () Interconnections in multicore architectures: understanding mechanisms, overheads, and
scaling. In: Proceedings of ISCA, Madison
. Tendler JM () POWER processors: the beat goes on. http://
www.ibm.com / developerworks / wikis /download/attachments /
/POWER+-+The+Beat+Goes+On.pdf
. Tilera. TILE-Gx processor family product brief. http://www.
tilera.com /sites /default /files /productbriefs / PB_Processor_
A_v.pdf
. Udipi A, Muralimanohar N, Balasubramonian R () Towards
scalable, energy-efficient, bus-based on-chip networks. In:
Proceedings of HPCA, Bangalore
. Vantrease D, Schreiber R, Monchiero M, McLaren M, Jouppi
N, Fiorentino M, Davis A, Binkert N, Beausoleil R, Ahn J-H
() Corona: system implications of emerging nanophotonic
technology. In: Proceedings of ISCA, Beijing
. Wang H-S, Peh L-S, Malik S () Power-driven design of
router microarchitectures in on-chip networks. In: Proceedings
of MICRO, San Diego
Butterfly
Networks, Multistage

B
C
C*
Guy L. Steele Jr.
Oracle Labs, Burlington, MA, USA
Definition
C* (pronounced “see-star”) refers to two distinct dataparallel dialects of C developed by Thinking Machines
Corporation for its Connection Machine supercomputers. The first version () is organized around
the declaration of domains, similar to classes in C++,
but when code associated with a domain is activated,
it is executed in parallel within all instances of the
domain, not just a single designated instance. Compound assignment operators such as += are extended in
C* to perform parallel reduction operations. An elaborate theory of control flow allows use of C control statements in a MIMD-like, yet predictable, fashion despite
the fact that the underlying execution model is SIMD.
The revised version () replaces domains with shapes
that organize processors into multidimensional arrays
and abandons the MIMD-like control-flow theory.
Discussion
Of the four programming languages (*Lisp, C*, CM Fortran, and CM-Lisp) provided by Thinking Machines
Corporation for Connection Machine Systems, C* was
the most clever (indeed, perhaps too clever) in trying
to extend features of an already existing sequential language for parallel execution. To quote the language
designers:
▸ C* is an extension of the C programming language
designed to support programming in the data parallel style, in which the programmer writes code as if a
processor were associated with every data element. C*
features a single new data type (based on classes in
C++), a synchronous execution model, and a minimal
number of extensions to C statement and expression
syntax. Rather than introducing a plethora of new language constructs to express parallelism, C* relies on
existing C operators, applied to parallel data, to express
such notions as broadcasting, reduction, and interprocessor communication in both regular and irregular
patterns [, , ].
The original proposed name for the language was
*C, not only by analogy with *Lisp, but with a view for
the potential of making a similarly data-parallel extension to the C++ language, which would then naturally
be called *C++. However, the marketing department
of Thinking Machines Corporation decided that “C*”
sounded better. This inconsistency in the placement
of the “*” did confuse many Thinking Machines customers and others, resulting in frequent references to
“*C” anyway, and even to “Lisp*” on occasion.
The Initial Design of C*
The basic idea was to start with the C programming language and then augment it with the ability to declare
something like a C++ class, but with the keyword
class replaced with the keyword domain. As in C++,
a domain could have functions as well as variables as
members. However, the notion of method invocation
(calling a member function on a single specific instance)
was replaced by the notion of domain activation (calling
a member function, or executing code, on all instances
simultaneously and synchronously). Everything else in
the language was driven by that one design decision,
that one metaphor.
Two new keywords, mono and poly, were introduced to describe in which memory data resided – the
front-end computer or the Connection Machine processors, respectively. Variables declared within sequential code were mono by default, and variables declared
within parallel code were poly by default, so the principal use of these keywords was in describing pointer
types; for example, the declaration
mono int *poly p;
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,
© Springer Science+Business Media, LLC 

C
C*
indicates that p is a poly pointer to a mono int (i.e.,
p holds many values, all stored in the Connection
Machine processors, one in each instance of the current domain; and each of these values is a pointer to an
integer that resides in the front-end processor, but each
of these pointers might point to a different front-end
integer).
C* systematically extended the standard operators
in C to have parallel semantics by applying two rules:
() if a binary operator has a scalar operand and a parallel operand, the scalar value is automatically replicated
to form a parallel value (an idea previously seen in both
APL and Fortran x), and () an operator applied to parallel operands is executed for all active processors as if
in some serial order. In this way, binary operators such
as + and - and % can be applied elementwise to many
sets of operands at once, and binary operators with
side effects – the compound assignment operators – are
guaranteed to have predictable sensible behavior; e.g., if
a is a scalar variable and b is a parallel variable, then the
effect of a+=b is to add every active element of b into
a, because it must behave as if each active element of b
were added into a in some sequential order. (In practice,
the implementation used a parallel algorithm to sum the
elements of b and then added that sum into a.)
C* makes two extensions to the standard set of C
operators, both motivated by a desire to extend the parallel functionality provided by the standard operators
in a consistent manner. Two common arithmetic operations, min and max, are provided in C as preprocessor
macros rather than as operators; one disadvantage of a
macro is that it cannot be used as part of a compound
assignment operator, and so one cannot write a max= b
in the same way that one can write a+=b. C* introduces operators <? and >? to serve as min and max
operations. The designers commented:
▸ They may be understood in terms of their traditional
macro definitions
a <? b means ((a) < (b)) ? (a) : (b)
a >? b means ((a) > (b)) ? (a) : (b)
but of course the operators, unlike the macro definitions, evaluate each argument exactly once. The operators <? and >? are intended as mnemonic reminders
of these definitions. (Such mnemonic reminders are
important. The original design of C* used >< and <>
for the maximum and minimum operators. We quickly
discovered that users had some difficulty remembering
which was which.) [, ]
In addition, most of the binary compound operators are
pressed into service as unary reduction operators. Thus
+=b computes the sum of all the active values in b and
returns that sum as a mono value; similarly >?=b finds
the largest active value in b.
C* also added a modulus operator %% because of
the great utility of modulus, as distingushed from the
remainder operator %, in performing array index calculations: when k is zero, the expression (k-1)%%n
produces a much more useful result, namely, k-1, than
does the expression (k-1)%n, which produces -1.
Because a processor can contain pointers to variables residing in other processors, interprocessor communication can be expressed simply by dereferencing
such pointers. Thus if p is a poly pointer to a poly
int, then *p causes each active processor to fetch
an int value through its own p pointer, and *p = b
causes each active processor to store its b value indirectly through p (thus “sending a message” to some
other processor). The “combining router” feature of the
Connection Machine could be invoked by an expression such as *p += b, which might cause the b values to be sent to some smaller number of destinations, resulting in the summing (in parallel) of various subsets of the b values into the various individual
destinations.
C* has an elaborate theory of implicitly synchronized control flow that allows the programmer to code,
for the most part, as if each Connection Machine processor were executing its own copy of parallel code
independently as if it were ordinary sequential C code.
The idea is that each processor has its own program
counter (a “virtual pc”), but can make no progress until
a “master pc” arrives at that point in the code, at which
point that virtual pc (as well every other virtual pc waiting at that particular code location) becomes active,
joining the master pc and participating in SIMD execution. Whenever the master pc reaches a conditional
branch, each active virtual pc that takes the branch
becomes inactive (thus, e.g., in an if statement, after
evaluation of the test expression, processors that need
to execute the else part take an implicit branch to
the else part and wait while other processors possibly
C*
proceed to execute the “then” part). Whenever the master pc reaches the end of a statement, every virtual pc
becomes inactive, and the master pc is transferred from
the current point to a new point in the code, namely, the
earliest point that has waiting program counters, within
the innermost statement that has waiting program counters within it, that contains the current point. Frequently
this new point is the same as the current point, and frequently this fact can be determined at compile time;
but in general the effect is to keep trying to pull lagging program counters forward, in such a way that once
the master pc enters a block, it does not leave the block
until every virtual pc has left the block. Thus this execution rule respects block structure (and subroutine call
structure) in a natural manner, while allowing the programmer to make arbitrary use of goto statements
if desired. (This theory of control flow was inspired
by earlier investigations of a SIMD-like theory giving
rise to a superficially MIMD-like control structure in
Connection Machine Lisp [].)
A compiler for this version of C* was independently
implemented at the University of New Hampshire []
for a MIMD multicomputer (an N-Cube , whose
processors communicated by message passing through
a hypercube network).
Figure  shows an early () example of a C* program that identifies prime numbers by the method of
the Sieve of Eratosthenes, taken (with one minor correction) from []. In typical C style, N is defined to
be a preprocessor name that expands to the integer
literal 100000. The name bit is then defined to be
a synonym for the type of -bit integers. The domain
declaration defines a parallel processing domain named
SIEVE that has a field named prime of type bit, and
then declares an array named sieve of length N, each
element of which is an instance of this domain. (This
single statement could instead have been written as two
statements:
domain SIEVE { bit prime; };
domain SIEVE sieve[N];
In this respect domain declarations are very much like
C struct declarations.) The function find_primes
is declared within the domain SIEVE, so when it
is called, its body is executed within every instance
of that domain, with a distinct virtual processor of
the Connection Machine containing and processing
C
each such instance. The “magic identifier” this is a
pointer to the current domain instance, and its value is
therefore different within each executing instance; subtracting from it a pointer to the first instance in the
array, namely, &sieve[0], produces the index of that
instance within the sieve array, by the usual process
of C pointer arithmetic. Local variables value and
candidate are declared within each instance of the
domain, and therefore are parallel values. The while
statement behaves in such a way that different domain
instances can execute different number of iterations, as
appropriate to the data within that instance; when an
active instance computes a zero (false) value for the test
expression candidate, that instance simply becomes
inactive. When all instances have become inactive, then
the while statement completes after making active
exactly those instances that had been active when execution of the while statement began. The mono storage class keyword indicates that a variable should be
allocated just once (in the front-end processor), not
once within each domain instance. The unary operator <?= is the minimum-reduction operator, so the
expression (<?= value) returns the smallest integer
that any active domain instance holds in its value
variable. The array sieve can be indexed in the normal C fashion, by either a mono (front-end) index, as
in this example code, or by a poly (parallel) value,
which can be used to perform inter-domain (i.e., interprocessor) communication. The if statement is handled in much the same way as the while statement:
If an active instance computes a zero (false) value for
the test expression candidate, that instance simply
becomes inactive during execution of the “then” part of
the if statement, and then becomes active again. (If an
if statement has an else part, then processors that
had been active for the “then” part become temporarily
inactive during execution of the else part.).
The Revised Design of C*
In , the C* language was revised substantially []
to produce version .; this version was initially implemented for the Connection Machine model CM-
[, ] and later for the model CM- as well as the
model CM- [, ]. This revised design was intended
to be closer in style to ANSI C than to C++. The
biggest change was to introduce the notion of a shape,

C

C
C*
#define N 100000
typedef int bit:1;
domain SIEVE { bit prime; } sieve[N];
void SIEVE::find_primes() {
int value = this - &sieve[0];
bit candidate = (value >= 2);
prime = 0;
while (candidate) {
mono int next_prime = (<?= value);
sieve[next_prime].prime = 1;
if (value % next_prime == 0) candidate = 0;
}
}
C*. Fig.  Example version  C* program for identifying prime numbers
which essentially describes a (possibly multidimensional) array of virtual processors. An ordinary variable
declaration may be tagged with the name of a shape,
which indicates that the declaration is replicated for
parallel processing, one instance for each position in
the shape. Where the initial design of C* required that
all instances of a domain be processed in parallel, the
new design allows declaration of several different shapes
(parallel arrays) having the same data layout, and different shapes may be chosen at different times for parallel
execution. Parallel execution is initiated using a newly
introduced with statement that specifies a shape:
with (shape) statement
The statement is executed with the specified shape as the
“current shape” for parallel execution. C operators may
be applied to parallel data much as in the original version of C*, but such data must have the current shape (a
requirement enforced by the compiler at compile time).
Positions within shapes may be selected by indexing.
In order to distinguish shape indexing from ordinary
array indexing, shape subscripts are written to the left
rather than to the right. Given these declarations:
shape [16][16][16]cube;
float:cube z;
int:cube a[10][10];
then cube is a three-dimensional shape having ,
( ×  × ) distinct positions, z is a parallel variable
consisting of one float value at each of these ,
shape positions, and a is a parallel array variable that has
a × array of int values at each of , shape positions (for a total of ,, distinct int values). Then
[3][13][5]z refers to one particular float value
within z at position (, , ) within the cube shape,
and [3][13][5]a[4][7] refers to one particular
int element within the particular  ×  array at that
same position. One may also write [3][13][5]a to
refer to (the address of) that same  ×  array. (This
use of left subscripts to index “across processors” and
right subscripts to index “within a processor” may be
compared to the use of square brackets and parentheses to distinguish two sorts of subscript in Co-Array
Fortran [].)
Although pointers can be used for interprocessor communication exactly as in earlier versions
of C*, such communication is more conveniently
expressed in C* version . by shape indexing (writing
[i][j][k]z = b rather than *p = b, for example),
thus affording a more array-oriented style of programming. Furthermore, version . of C* abandons the
entire “master program counter” execution model that
allowed all C control structures to be used for parallel
execution. Instead, statements such as if, while, and
switch are restricted to test nonparallel values, and a
C*
newly introduced where statement tests parallel values, behaving very much like the where statement in
Fortran . The net effect of all these revisions is to give
version . of C* an execution model and programming
style more closely resembling those of *Lisp and CM
Fortran.
Figure  shows a rewriting of the code in Fig.  into
C* version .. The domain declaration is replaced by
a shape declaration that specifies a one-dimensional
shape named SIEVE of size N. The global variable
prime of type bit is declared within shape SIEVE.
The function find_primes is no longer declared
within a domain. When it is called, the with statement establishes SIEVE as the current shape for parallel execution; conceptually, a distinct virtual processor
of the Connection Machine is associated with each position in the current shape. The function pcoord takes
an integer specfying an axis number and returns, at
each position of the current shape, an integer indicating
the index of the position along the specified axis; thus
in this example pcoord returns values ranging from
0 to 99999. Local variables value and candidate
are explicitly declared as belonging to shape SIEVE,
C
and therefore are parallel values. The while statement
of Fig.  becomes a while statement and a where
statement in Fig. ; the new while statement uses an
explicit or-reduction operator |= to decide whether
there are any remaining candidates; if there are, the
where statement makes inactive every position in the
shape for which its candidate value is zero. The declaration of local variable next_prime does not mention a shape, so it is not a parallel variable (note that
the mono keyword is no longer used). A left subscript
is used to assign 1 to a single value of prime at the
position within shape SIEVE indicated by the value of
next_prime.
The need to split what was a single while statement in earlier versions of C* into two nested statements
(a while with an or-reduction operator containing a
where that typically repeats the same expression) has
been criticized, as well as the syntax of type and shape
declarations and the lack of nested parallelism [].
Version . of C* [] introduced a “global/local programming” feature, allowing C* to be used for MIMD
programming on the model CM-. A special “prototype file” is used to specify which functions are local
#define N 100000
typedef int bit:1;
shape [N]SIEVE;
bit:SIEVE prime;
void find_primes() {
with (SIEVE) {
int:SIEVE value = pcoord(0);
bit:SIEVE candidate = (value >= 2);
prime = 0;
while (|= candidate) {
where (candidate) {
int next_prime = (<?= value);
[next_prime]prime = 1;
if (value % next_prime == 0) candidate = 0;
}
}
}
}
C*. Fig.  Example version  C* program for identifying prime numbers

C

C
Cache Affinity Scheduling
and the calling interface to be used when global code
calls each local function. The idea is that a local function executes on a single processing node but can use
all the facilities of C*, including parallelism (which
might be implemented on the model CM- through
SIMD vector accelerator units). This facility is very
similar to “local subprograms” in High Performance
Fortran [].
Related Entries
Coarray Fortran
Connection Machine
. Thinking Machines Corporation () C∗ programming guide.
Cambridge, MA
. Thinking Machines Corporation () Connection Machine
CM- technical summary, rd edn. Cambridge, MA
. Thinking Machines Corporation () C∗ . Alpha release
notes. Cambridge, MA
. Tichy WF, Philippsen M, Hatcher P () A critique of the
programming language C∗ . Commun ACM ():–
Cache Affinity Scheduling
Affinity Scheduling
Connection Machine Fortran
Connection Machine Lisp
HPF (High Performance Fortran)
*Lisp
Cache Coherence
Xiaowei Shen
IBM Research, Armonk, NY, USA
Bibliography
. Frankel JL () A reference description of the C* language.
Technical Report TR-, Thinking Machines Corporation,
Cambridge, MA
. Koelbel CH, Loveman DB, Schreiber RS, Steele GL Jr, Zosel ME
() The High Performance Fortran handbook. MIT Press,
Cambridge, MA
. Numrich RW, Reid J () Co-array Fortran for parallel programming. SIGPLAN Fortran Forum ():–
. Quinn MJ, Hatcher PJ () Data-parallel programming on multicomputers. IEEE Softw ():–
. Rose JR, Steele GL Jr () C∗ : An extended C language for
data parallel programming. Technical Report PL –, Thinking
Machines Corporation, Cambridge, MA
. Rose JR, Steele GL Jr () C∗ : An extended C language for data
parallel programming. In: Supercomputing ’: Proceedings of
the second international conference on supercomputing, vol II:
Industrial supercomputer applications and computations. International Supercomputing Institute, Inc., St. Petersburg, Florida,
pp –
. Steele GL Jr, Daniel Hillis W () Connection Machine Lisp:
fine-grained parallel symbolic processing. In: LFP ’: Proc. 
ACM conference on LISP and functional programming, ACM
SIGPLAN/SIGACT/SIGART, ACM, New York, pp –, Aug

. Thinking Machines Corporation () Connection Machine
model CM- technical summary. Technical report HA-,
Cambridge, MA
. Thinking Machines Corporation () C∗ programming guide,
version . Pre-Beta. Cambridge, MA
. Thinking Machines Corporation () C∗ user’s guide, version
. Pre-Beta. Cambridge, MA
Definition
A shared-memory multiprocessor system provides a
global address space in which processors can exchange
information and synchronize with one another. When
shared variables are cached in multiple caches simultaneously, a memory store operation performed by one
processor can make data copies of the same variable
in other caches out of date. Cache coherence ensures a
coherent memory image for the system so that each processor can observe the semantic effect of memory access
operations performed by other processors in time.
Discussion
The cache coherence mechanism plays a crucial role in
the construction of a shared-memory system, because
of its profound impact on the overall performance and
implementation complexity. It is also one of the most
complicated problems in the design, because an efficient
cache coherence protocol usually incorporates various
optimizations.
Cache Coherence and Memory Consistency
The cache coherence protocol of a shared-memory multiprocessor system implements a memory consistency
model that defines the semantics of memory access
instructions. The essence of memory consistency is the
Cache Coherence
correspondence between each load instruction and the
store instruction that supplies the data retrieved by
the load instruction. The memory consistency model
of uniprocessor systems is intuitive: a load operation returns the most recent value written to the
address, and a store operation binds the value for subsequent load operations. In parallel systems, however,
notions such as “the most recent value” can become
ambiguous since multiple processors access memory
concurrently.
An ideal memory consistency model should allow
efficient implementations while still maintaining simple semantics for the architect and the compiler writer
to reason about. Sequential consistency [] is a dominant memory model in parallel computing for decades
due to its simplicity. A system is sequentially consistent if the result of any execution is the same as if the
operations of all the processors were executed in some
sequential order, and the operations of each individual
processor appear in this sequence in the order specified
by its program.
Sequential consistency is easy for programmers
to understand and use, but it often prohibits many
architectural and compiler optimizations. The desire to
achieve higher performance has led to relaxed memory models, which can provide more implementation
flexibility by exposing optimizing features such as
instruction reordering and data caching. Modern
microprocessors [, ] support selected relaxed memory
consistency models that allow memory accesses to be
reordered, and provide memory fences that can be used
to ensure proper memory access ordering constraints
whenever necessary.
It is worth noting that, as a reaction to everchanging memory models and their complicated and
imprecise definitions, there is a desire to go back
to the simple, easy-to-understand sequential consistency, even though there are a plethora of problems
in its high-performance implementation []. Ingenious
solutions have been devised to maintain the sequential consistency semantics so that programmers cannot detect if and when the memory accesses are
out of order or nonatomic. For example, advances
in speculative execution may permit memory access
reordering without affecting the semantics of sequential
consistency.
C
Sometimes people may get confused between memory consistency models and cache coherence protocols.
A memory consistency model defines the semantics
of memory operations, in particular, for each memory
load operation, the data value that should be provided
by the memory system. The memory consistency model
is a critical part of the semantics of the Instruction-Set
Architecture of the system, and thus should be exposed
to the system programmer. A cache coherence protocol, in contrast, is an implementation-level protocol that
defines how caches should be kept coherent in a multiprocessor system in which data of a memory address
can be replicated in multiple caches, and thus should be
made transparent to the system programmer. Generally
speaking, in a shared-memory multiprocessor system,
the underlying cache coherence protocol, together with
some proper memory operation reordering constraint
often enforced when memory operations are issued,
implements the semantics of the memory consistency
model defined for the system.
Snoopy Cache Coherence
A symmetric multiprocessor (SMP) system generally
employs a snoopy mechanism to ensure cache coherence. When a processor reads an address not in its
cache, it broadcasts a read request on the bus or network, and the memory or the cache with the most upto-date copy can then supply the data. When a processor
broadcasts its intention to write an address which it does
not own exclusively, other caches need to invalidate or
update their copies.
With snoopy cache coherence, when a cache miss
occurs, the requesting cache sends a cache request to
the memory and all its peer caches. When a peer cache
receives the cache request, it performs a cache snoop
operation and produces a cache snoop response indicating whether the requested data is found in the peer
cache and the state of the corresponding cache line.
A combined snoop response can be generated based
on cache snoop responses from all the peer caches. If
the requested data is found in a peer cache, the peer
cache can source the data to the requesting cache via
a cache-to-cache transfer, which is usually referred to
as a cache intervention. The memory is responsible for
supplying the requested data if the combined snoop

C

C
Cache Coherence
response shows that the data cannot be supplied by any
peer cache.
Example: The MESI Cache Coherence
Protocol
A number of snoopy cache coherence protocols have
been proposed. The MESI coherence protocol and its
variations have been widely used in SMP systems. As
the name suggests, MESI has four cache states, modified
(M), exclusive (E), shared (S), and invalid (I).
●
I (invalid): The data is not valid. This is the initial
state or the state after a snoop invalidate hit.
● S (shared): The data is valid, and can also be valid
in other caches. This state is entered when the data
is sourced from the memory or another cache in
the modified state, and the corresponding snoop
response shows that the data is valid in at least one
of the other caches.
● E (exclusive): The data is valid, and has not been
modified. The data is exclusively owned, and cannot
be valid in another cache. This state is entered when
the data is sourced from the memory or another
cache in the modified state, and the corresponding
snoop response shows that the data is not valid in
another cache.
● M (modified): The data is valid and has been modified. The data is exclusively owned, and cannot be
valid in another cache. This state is entered when a
store operation is performed on the cache line.
With the MESI protocol, when a read cache miss occurs,
if the requested data is found in another cache and the
cache line is in the modified state, the cache with the
modified data supplies the data via a cache intervention
(and writes the most up-to-date data back to the memory). However, if the requested data is found in another
cache and the cache line is in the shared state, the cache
with the shared data does not supply the requested data,
since it cannot guarantee from the shared state that it is
the only cache that is to source the data. In this case, the
memory will supply the data to the requesting cache.
When a write cache miss occurs, if data of the memory address is cached in one or more other caches in the
shared state, all those cached copies in other caches are
invalidated before the write operation can be performed
in the local cache.
It should be pointed out that the MESI protocol
described above is just an exemplary protocol to show
the essence of cache coherence operations. It can be
modified or tailored in various ways for implementation optimization. For example, one can imagine that
a cache line in a shared state can provide data for a
read cache miss, rather than letting the memory provide the data. This may provide better response time
for a system in which cache-to-cache data transfer is
faster than memory-to-cache data transfer. Since there
can be more than one cache with the requested data in
the shared state, the cache coherence protocol needs to
specify which cache in the shared state should provide
the data, or how extra data copies should be handled in
case multiple caches provide the requested data at the
same time.
Example: An Enhanced MESI Cache
Coherence Protocol
In modern SMP systems, when a cache miss occurs,
if the requested data is found in both the memory
and a cache, supplying the data via a cache intervention is often preferred over supplying the data from
the memory, because cache-to-cache transfer latency is
usually smaller than memory access latency. Furthermore, when caches are on the same die or in the same
package module, there is usually more bandwidth available for cache-to-cache transfers, compared with the
bandwidth available for off-chip DRAM accesses.
The IBM Power- system [], for example, enhances
the MESI coherence protocol to allow more cache interventions. Compared with MESI, an enhanced coherence protocol allows data of a shared cache line to be
sourced via a cache intervention. In addition, if data
of a modified cache line is sourced from one cache to
another, the modified data does not have to be written
back to the memory immediately. Instead, a cache with
the most up-to-date data can be held responsible for
memory update when it becomes necessary to do so. An
exemplary enhanced MESI protocol employing seven
cache states is as follows.
●
I (invalid): The data is invalid. This is the initial state
or the state after a snoop invalidate hit.
● SL (shared, can be sourced): The data is valid and
may also be valid in other caches. The data can
Cache Coherence
●
●
●
●
●
be sourced to another cache in the same module
via a cache intervention. This state is entered when
the data is sourced from another cache or from the
memory.
S (shared): The data is valid, and may also be valid in
other caches. The data cannot be sourced to another
cache. This state is entered when a snoop read hit
from another cache in the same module occurs on a
cache line in the SL state.
M (modified): The data is valid, and has been modified. The data is exclusively owned, and cannot be
valid in another cache. The data can be sourced to
another cache. This state is entered when a store
operation is performed on the cache line.
ME (exclusive): The data is valid, and has not been
modified. The data is exclusively owned, and cannot
be valid in another cache.
MU (unsolicited modified): The data is valid and
is considered to have been modified. The data is
exclusively owned and cannot be valid in another
cache.
T (tagged): The data is valid and has been modified. The modified data has been sourced to another
cache. This state is entered when a snoop read hit
occurs on a cache line in the M state.
When data of a memory address is shared in multiple
caches in a single module, the single module can include
at most one cache in the SL state. The cache in the SL
state is responsible for supplying the shared data via a
cache intervention when a cache miss occurs in another
cache in the same module. At any time, the particular
cache that can source the shared data is fixed, regardless of which cache has issued the cache request. When
data of a memory address is shared in more than one
module, each module can include a cache in the SL state.
A cache in the SL state can source the data to another
cache in the same module, but cannot source the data
to a cache in a different module.
In systems in which a cache-to-cache transfer can
take multiple message-passing hops, sourcing data from
different caches can result in different communication
latency and bandwidth consumption. When a cache
miss occurs in a requesting cache, if requested data is
shared in more than one peer cache, a peer cache that is
closest to the requesting cache is preferred to supply the
C
requested data to reduce communication latency and
bandwidth consumption of cache intervention. Thus, it
is probably desirable to enhance cache coherence mechanisms with cost-conscious cache-to-cache transfers to
improve overall performance in SMP systems.
Broadcast-Based Cache Coherence Versus
Directory-Based Cache Coherence
A major drawback of broadcast-based snoopy cache
coherence protocols is that a cache request is usually
broadcast to all caches in the system. This can cause
serious problems to overall performance, system scalability, and power consumption, especially for large-scale
multiprocessor systems. Further, broadcasting cache
requests indiscriminately may consume enormous network bandwidth, while snooping peer caches unnecessarily may require excessive cache snoop ports. It is
worth noting that servicing a cache request may take
more time than necessary when far away caches are
snooped unnecessarily.
Unlike broadcast-based snoopy cache coherence
protocols, a directory-based cache coherence protocol
maintains a directory entry to record the cache sites in
which each memory block is currently cached []. The
directory entry is often maintained at the site in which
the corresponding physical memory resides. Since the
locations of shared copies are known, the protocol
engine at each site can maintain coherence by employing point-to-point protocol messages. The elimination
of broadcast overcomes a major limitation on scaling
cache coherent machines to large-scale multiprocessor
systems.
Typical directory-based protocols maintain a directory entry for each memory block to record the caches
in which the memory block is currently cached. With
a full-map directory structure, for example, each directory entry comprises one bit for each cache in the system, indicating whether the cache has a data copy of the
memory block. Given a memory address, its directory
entry is usually maintained in a node in which the corresponding physical memory resides. This node is often
referred to as the home of the memory address. When
a cache miss occurs, the requesting cache sends a cache
request to the home, which generates appropriate pointto-point coherence messages according to the directory
information.

C

C
Cache-Only Memory Architecture (COMA)
However, directory-based cache coherence protocols
have various shortcomings. For example, maintaining a
directory entry for each memory block usually results
in significant storage overhead. Alternative directory
structures may reduce the storage overhead with performance compromises. Furthermore, accessing directory
can be time consuming since directory information is usually stored in DRAM. Caching recently
used directory entries can potentially reduce directory access latencies but with increased implementation
complexity.
Accessing directory causes three or four messagepassing hops to service a cache request, compared with
two message-passing hops with snoopy cache coherence protocols. Consider a scenario in which a cache
miss occurs in a requesting cache, while the requested
data is modified in another cache. To service the cache
miss, the requesting cache sends a cache request to
the corresponding home. When the home receives the
cache request, it forwards the cache request to the cache
that contains the modified data. When the cache with
the modified data receives the forwarded cache request,
it sends the requested data to the requesting cache (an
alternative is to send the requested data to the home,
which will forward the requested data to the requesting
cache).
Cache Coherence for Network-Based
Multiprocessor Systems
In a modern shared-memory multiprocessor system,
caches can be interconnected with each other via a
message-passing network instead of a shared bus to
improve system scalability and performance. In a busbased SMP system, the bus behaves as a central arbitrator that serializes all bus transactions. This ensures
a total order of bus transactions. In a network-based
multiprocessor system, in contrast, when a cache broadcasts a message, the message is not necessarily observed
atomically by all the receiving caches. For example, it
is possible that cache A multicasts a message to caches
B and C, cache B receives the broadcast message and
then sends a message to cache C, and cache C receives
cache B’s message before receiving cache A’s multicast
message.
Protocol correctness can be compromised when
multicast messages can be received in different orders
at different receiving caches. Appropriate mechanisms
are needed to guarantee correctness of cache coherence
for network-based multiprocessor systems [, ].
Bibliography
. Lamport L () How to make a multiprocessor computer that
correctly executes multiprocess programs. IEEE Trans Comput
C-():–
. May C, Silha E, Simpson R, Warren H () The powerPC architecture: a specification for a new family of RISC processors. Morgan Kaufmann, San Francisco
. Intel Corporation () IA- application developer’s architecture guide
. Gniady C, Falsafi B, Vijaykumar T () Is SC+ILP=RC? In: Proceedings of the th annual international symposium on computer
architecture (ISCA ), Atlanta, – May , pp –
. Tendler J, Dodson J, Fields J, Le H, Sinharoy B () POWER-
system microarchitecture. IBM J Res Dev ():
. Chaiken D, Fields C, Kurihara K, Agarwal A () Directorybased cache coherence in large-scale multiprocessors. Computer
():–
. Martin M, Hill M, Wood D () Token coherence: decoupling
performance and corrections. In: Proceedings of the th annual
international symposium on computer architecture international
symposium on computer architecture, San Diego, – June 
. Strauss K, Shen X, Torrellas J () Uncorq: unconstrained snoop
request delivery in embedded-ring multiprocessors. In: Proceedings of the th annual IEEE/ACM international symposium on
microarchitecture, Chicago, pp –, – Dec 
Cache-Only Memory Architecture
(COMA)
Josep Torrellas
University of Illinois at Urbana-Champaign, Urbana,
IL, USA
Synonyms
COMA (Cache-only memory architecture)
Definition
A Cache-Only Memory Architecture (COMA) is a
type of cache-coherent nonuniform memory access
(CC-NUMA) architecture. Unlike in a conventional
CC-NUMA architecture, in a COMA, every sharedmemory module in the machine is a cache, where each
memory line has a tag with the line’s address and state.
Cache-Only Memory Architecture (COMA)
As a processor references a line, it transparently brings
it to both its private cache(s) and its nearby portion of
the NUMA shared memory (Local Memory) – possibly
displacing a valid line from its local memory. Effectively,
each shared-memory module acts as a huge cache memory, giving the name COMA to the architecture. Since
the COMA hardware automatically replicates the data
and migrates it to the memory module of the node that
is currently accessing it, COMA increases the chances of
data being available locally. This reduces the possibility
of frequent long-latency memory accesses. Effectively,
COMA dynamically adapts the shared data layout to the
application’s reference patterns.
Discussion
Basic Concepts
In a conventional CC-NUMA architecture, each node
contains one or more processors with private caches and
a memory module that is part of the NUMA shared
memory. A page allocated in the memory module of
one node can be accessed by the processors of all other
nodes. The physical page number of the page specifies the node where the page is allocated. Such node is
referred to as the Home Node of the page. The physical address of a memory line includes the physical page
number and the offset within that page.
In large machines, fetching a line from a remote
memory module can take several times longer than
fetching it from the local memory module. Consequently, for an application to attain high performance,
the local memory module must satisfy a large fraction
of the cache misses. This requires a good placement of
the program pages across the different nodes. If the program’s memory access patterns are too complicated for
the software to understand, individual data structures
may not end up being placed in the memory module of
the node that access them the most. In addition, when a
page contains data structures that are read and written
by different processors, it is hard to attain a good page
placement.
In a COMA, the hardware can transparently eliminate a certain class of remote memory accesses. COMA
does this by turning memory modules into large caches
called Attraction Memory (AM). When a processor
requests a line from a remote memory, the line is
C
inserted in both the processor’s cache and the node’s
AM. A line can be evicted from an AM if another line
needs the space. Ideally, with this support, the processor dynamically attracts its working set into its local
memory module. The lines the processor is not accessing overflow and are sent to other memories. Because a
large AM is more capable of containing a node’s current
working set than a cache is, more of the cache misses are
satisfied locally within the node.
There are three issues that need to be addressed in
COMA, namely finding a line, replacing a line, and dealing with the memory overhead. In the rest of this article,
these issues are described first, then different COMA
designs are outlined, and finally further readings are
suggested.
Finding a Memory Line
In a COMA, the address of a memory line is a global
identifier, not an indicator of the line’s physical location
in memory. Just like a normal cache, the AM keeps a tag
with the address and state of the memory line currently
stored in each memory location. On a cache miss, the
memory controller has to look up the tags in the local
AM to determine whether or not the access can be serviced locally. If the line is not in the local AM, a remote
request is issued to locate the block.
COMA machines have a mechanism to locate a line
in the system so that the processor can find a valid copy
of the line when a miss occurs in the local AM. Different mechanisms are used by different classes of COMA
machines.
One approach is to organize the machine hierarchically, with the processors at the leaves of the tree. Each
level in the hierarchy includes a directory-like structure,
with information about the status of the lines present in
the subtree extending from the leaves up to that level of
the hierarchy. To find a line, the processing node issues
a request that goes to successively higher levels of the
tree, potentially going all the way to the root. The process stops at the level where the subtree contains the line.
This design is called Hierarchical COMA [, ].
Another approach involves assigning a home node
to each memory line, based on the line’s physical
address. The line’s home has the directory entry for the
line. Memory lines can freely migrate, but directory
entries do not. Consequently, to locate a memory line,

C

C
Cache-Only Memory Architecture (COMA)
a processor interrogates the directory in the line’s home
node. The directory always knows the state and location
of the line and can forward the request to the right node.
This design is called Flat COMA [].
Replacing a Memory Line
The AM acts as a cache, and lines can be displaced
from it. When a line is displaced in a plain cache, it is
either overwritten (if it is unmodified) or written back
to its home memory module, which guarantees a place
for the line.
A memory line in COMA does not have a fixed
backup location where it can be written to if it gets
displaced from an AM. Moreover, even an unmodified
line can be the only copy of that memory line in the
system, and it must not be lost on an AM displacement. Therefore, the system must keep track of the last
copy of a line. As a result, when a modified or otherwise unique line is displaced from an AM, it must be
relocated into another AM.
To guarantee that at least one copy of an unmodified line remains in the system, one of the line’s copies is
denoted as the Master copy. All other shared copies can
be overwritten if displaced, but the master copy must
always be relocated to another AM. When a master copy
or a modified line is relocated, the problem is deciding which node should take the line in its AM. If other
nodes already have one or more other shared copies of
the line, one of them becomes the master copy. Otherwise, another node must accept the line. This process is
called Line Injection.
Different line injection algorithms are possible. One
approach is for the displacing node to send requests to
other nodes asking if they have space to host the line [].
Another approach is to force one node to accept the line.
This, however, may lead to another line displacement.
A proposed solution is to relocate the new line to the
node that supplied the line that caused the displacement
in the first place [].
Dealing with Memory Overhead
A CC-NUMA machine can allocate all memory to
application or system pages. COMA, however, leaves a
portion of the memory unallocated to facilitate automatic data replication and migration. This unallocated
space supports the replication of lines across AMs.
It also enhances line migration to the AMs of the referencing nodes because less line relocation traffic is
needed.
Without unallocated space, every time a line is
inserted in the AM, another line would have to be relocated. The ratio between the allocated data size and the
total size of the AMs is called the Memory Pressure. If the
memory pressure is %, then % of the AM space is
available for data replication. Both the relocation traffic
and the number of AM misses increase with the memory pressure []. For a given memory size, choosing an
appropriate memory pressure is a trade-off between the
effect on page faults, AM misses, and relocation traffic.
Different Cache-Only Memory Architecture
Designs
Hierarchical COMA
The first designs of COMA machines follow what has
been called Hierarchical COMA. These designs organize the machine hierarchically, connecting the processors to the leaves of the tree. These machines include
the KSR- [] from Kendall Square Research, which has
a hierarchy of rings, and the Data Diffusion Machine
(DDM) [] from the Swedish Institute of Computer
Science, which has a hierarchy of buses.
Each level in the tree hierarchy includes a directorylike structure, with information about the status of the
lines extending from the leaves up to that level of the
hierarchy. To find a line, the processing node issues a
request that goes to successively higher levels of the tree,
potentially going all the way to the root. The process
stops at the level where the subtree contains the line.
In these designs, substantial latency occurs as the
memory requests go up the hierarchy and then down
to find the desired line. It has been argued that such
latency can offset the potential gains of COMA relative
to conventional CC-NUMA architectures [].
Flat COMA
A design called Flat COMA makes it easy to locate a
memory line by assigning a home node to each memory
line [] – based on the line’s physical address. The line’s
home has the directory entry for the line, like in a conventional CC-NUMA architecture. The memory lines
can freely migrate, but the directory entries of the memory lines are fixed in their home nodes. At a miss on a
Cache-Only Memory Architecture (COMA)
line in an AM, a request goes to the node that is keeping
the directory information about the line. The directory
redirects the request to another node if the home does
not have a copy of the line. In Flat COMA, unlike in a
conventional CC-NUMA architecture, the home node
may not have a copy of the line even though no processor has written to the line. The line has simply been
displaced from the AM in the home node.
Because Flat COMA does not rely on a hierarchy to
find a block, it can use any high-speed network.
Simple COMA
A design called Simple COMA (S-COMA) [] transfers
some of the complexity in the AM line displacement and
relocation mechanisms to software. The general coherence actions, however, are still maintained in hardware for performance reasons. Specifically, in S-COMA,
the operating system sets aside space in the AM for
incoming memory blocks on a page- granularity basis.
The local Memory Management Unit (MMU) has mappings only for pages in the local node, not for remote
pages. When a node accesses for the first time a shared
page that is already in a remote node, the processor suffers a page fault. The operating system then allocates a
page frame locally for the requested line. Thereafter, the
hardware continues with the request, including locating
a valid copy of the line and inserting it, in the correct state, in the newly allocated page in the local AM.
The rest of the page remains unused until future
requests to other lines of the page start filling it. Subsequent accesses to the line get their mapping directly
from the MMU. There are no AM address tags to check
if the correct line is accessed.
Since the physical address used to identify a line in
the AM is set up independently by the MMU in each
node, two copies of the same line in different nodes are
likely to have different physical addresses. Shared data
needs a global identity so that different nodes can communicate. To this end, each node has a translation table
that converts local addresses to global identifiers and
vice versa.
Multiplexed Simple COMA
S-COMA sets aside memory space in page-sized
chunks, even if only one line of each page is present.
C
Consequently, S-COMA suffers from memory fragmentation. This can cause programs to have inflated
working sets that overflow the AM, inducing frequent
page replacements and resulting in high operating system overhead and poor performance.
Multiplexed Simple COMA (MS-COMA) [] eliminates this problem by allowing multiple virtual pages
in a given node to map to the same physical page at
the same time. This mapping is possible because all the
lines on a virtual page are not used at the same time.
A given physical page can now contain lines belonging
to different virtual pages if each line has a short virtual page ID. If two lines belonging to different pages
have the same page offset, they displace each other
from the AM. The overall result is a compression of the
application’s working set.
Further Readings
There are several papers that discuss COMA and
related topics. Dahlgren and Torrellas present a more
in-depth survey of COMA machine issues []. There
are several designs that combine COMA and conventional CC-NUMA architecture features, such as
NUMA with Remote Caches (NUMA-RC) [], Reactive
NUMA [], Excel-NUMA [], the Sun Microsystems’
WildFire multiprocessor design [], the IBM Prism
architecture [], and the Illinois I-ACOMA architecture
[]. A model for comparing the performance of COMA
and conventional CC-NUMA architectures is presented
by Zhang and Torrellas []. Soundarajan et al. []
describe the trade-offs related to data migration and
replication in CC-NUMA machines.
Bibliography
. Basu S, Torrellas J () Enhancing memory use in simple coma:
multiplexed simple coma. In: International symposium on highperformance computer architecture, Las Vegas, February 
. Burkhardt H et al () Overview of the KSR computer system.
Technical Report , Kendall Square Research, Waltham,
February 
. Dahlgren F, Torrellas J () Cache-only memory architectures.
IEEE Computer Magazine ():–, June 
. Ekanadham K, Lim B-H, Pattnaik P, Snir M () PRISM: an
integrated architecture for scalable shared memory. In: International symposium on high-performance computer architecture,
Las Vegas, February 

C

C
Caches, NUMA
. Falsafi B, Wood D () Reactive NUMA: a design for unifying S-COMA and CC-NUMA. In: International symposium on
computer architecture, Denver, June 
. Hagersten E, Koster M () WildFire: a scalable path for SMPs.
In: International symposium on high-performance computer
architecture, Orlando, January 
. Hagersten E, Landin A, Haridi S () DDM – a cache-only
memory architecture. IEEE Computer ():–
. Joe T, Hennessy J () Evaluating the memory overhead
required for COMA architectures. In: International symposium
on computer architecture, Chicago, April , pp –
. Moga A, Dubois M () The effectiveness of SRAM network
caches in clustered DSMs. In: International symposium on highperformance computer architecture, Las Vegas, February 
. Saulsbury A, Wilkinson T, Carter J, Landin A () An
argument for simple COMA. In: International symposium on
high-performance computer architecture, Raleigh, January ,
pp –
. Soundararajan V, Heinrich M, Verghese B, Gharachorloo K,
Gupta A, Hennessy J () Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors. In:
International symposium on computer architecture, Barcelona,
June 
. Stenstrom P, Joe T, Gupta A () Comparative performance
evaluation of cache-coherent NUMA and COMA architectures.
In: International symposium on computer architecture, Gold
Coast, Australia, May , pp –
. Torrellas J, Padua D () The illinois aggressive coma multiprocessor project (I-ACOMA). In: Symposium on the frontiers
of massively parallel computing, Annapolis, October 
. Zhang Z, Cintra M, Torrellas J () Excel-NUMA: toward
programmability, simplicity, and high performance. IEEE
Trans Comput ():–. Special Issue on Cache Memory,
February 
. Zhang Z, Torrellas J () Reducing remote conflict misses:
NUMA with remote cache versus COMA. In: International
symposium on high-performance computer architecture, San
Antonio, February , pp –
Carbon Cycle Research
Terrestrial Ecosystem Carbon Modeling
Car-Parrinello Method
Mark Tuckerman , Eric J. Bohm , Laxmikant V.
Kalé , Glenn Martyna

New York University, New York, NY, USA

University of Illinois at Urbana-Champaign, Urbana,
IL, USA

IBM Thomas J. Watson Research Center,
Yorktown Heights, NY, USA
Synonyms
Ab initio molecular dynamics; First-principles molecular dynamics
Definition
A Car–Parrinello simulation is a molecular dynamics based calculation in which the finite-temperature
dynamics of a system of N atoms is generated using
forces obtained directly from electronic structure
calculations performed “on the fly” as the simulation
proceeds. A typical Car–Parrinello simulation employs
a density functional description of the electronic structure, a plane-wave basis expansion of the single-particle
orbitals, and periodic boundary conditions on the simulation cell. The original paper has seen an exponential rise in the number of citations, and the method
has become a workhorse for studying systems which
undergo nontrivial electronic structure changes.
Discussion
Caches, NUMA
NUMA Caches
Calculus of Mobile Processes
Pi-Calculus
Introduction
Atomistic modeling of many systems in physics,
chemistry, biology, and materials science requires
explicit treatment of chemical bond-breaking and forming events. The methodology of ab initio molecular
dynamics (AIMD), in which the finite-temperature
dynamics of a system of N atoms is generated using
forces obtained directly from the electronic structure
calculations performed “on the fly” as the simulation
proceeds, can describe such processes in a manner that
Car-Parrinello Method
is both general and transferable from system to system. The Car–Parrinello method is one type of AIMD
simulation.
Assuming that a physical system can be described
classically in terms of its N constituent atoms, having masses M , . . . , MN and charges Z e, . . . , ZN e, the
classical microscopic state of the system is completely determined by specifying the Cartesian positions R , . . . , RN of its atoms and their corresponding
conjugate momenta P , . . . , PN as functions of time. In a
standard molecular dynamics calculation, the time evolution of the system is determined by solving Hamilton’s
equations of motion
ṘI =
PI
,
MI
ṖI = FI
which can be combined into the second-order differential equations
MI R̈I = FI
Here, ṘI = dRI /dt, ṖI = dPI /dt are the first derivatives with respect to time of position and momentum,
respectively, and R̈I = d RI /dt  is the second derivative
of position with respect to time, and FI is the total force
on atom I due to all of the other atoms in the system.
The force FI is a function FI (R , . . . , RN ) of all of the
atomic positions, hence Newton’s equations of motion
constitute a set of N coupled second-order ordinary
differential equations.
Any molecular dynamics calculation requires the
functional form FI (R , . . . , RN ) of the forces as an input
to the method. In most molecular dynamics calculations, the forces are modeled using simple functions
that describe bond stretching, angle bending, torsion,
van der Waals, and Coulombic interactions and a set
of parameters for these interactions that are fit either
to experiment or to high-level ab initio calculations.
Such models are referred to as force fields, and while
force fields are useful for many types of applications,
they generally are unable to describe chemical bondbreaking and forming events and often neglect electronic polarization effects. Moreover, the parameters
cannot be assumed to remain valid in thermodynamic
states very different from the one for which they were
originally fit. Consequently, most force fields are unsuitable for studying chemical processes under varying
external conditions.
C
An AIMD calculation circumvents the need for
an explicit force field model by obtaining the forces
FI (R , . . . , RN ) at a given configuration R , . . . , RN of
the nuclei from a quantum mechanical electronic structure calculation performed at this particular nuclear
configuration. To simplify the notation, let R denote the
complete set R , . . . , RN nuclear coordinates. Suppose
the forces on the Born–Oppenheimer electronic ground
state surface are sought. Let ∣Ψ (R)⟩ and Ĥel (R) denote,
respectively, the ground-state electronic wave function
and electronic Hamiltonian at the nuclear configuration R. If the system contains Ne electrons with position
operators r̂ , . . . , r̂Ne , then the electronic Hamiltonian in
atomic units (e = , ħ = , me = , c = ) is
Ne N
 Ne

ZI
Ĥel = − ∑ ∇i + ∑
− ∑∑
 i=
i>j ∣r̂ i − r̂j ∣
i= I= ∣r̂ i − R I ∣
where the first, second, and third terms are the electron
kinetic energy, the electron–electron Coulomb repulsion, and the electron–nuclear Coulomb attraction,
respectively. The interatomic forces are given exactly by
FI (R) = −⟨Ψ (R)∣∇I Ĥel (R)∣Ψ (R)⟩
(RI − RJ )
+ ∑ ZI ZJ
∣RI − RJ ∣
J≠I
by virtue of the Hellman–Feynman theorem.
In practice, it is usually not possible to obtain the
exact ground-state wave function ∣Ψ (R)⟩, and, therefore, an approximate electronic structure method is
needed. The approximation most commonly employed
in AIMD calculations is the Kohn–Sham formulation
of density functional theory. In the Kohn–Sham theory, the full electronic wave function is replaced by
a set ψs (r), s = , . . . , Ns of mutually orthogonal
single-electron orbitals (denoted collectively as ψ(r))
and the corresponding electron density
Ns
n(r) = ∑ fs ∣ψ s (r)∣
s=
where fs is the occupation number of the state s. In
closed-shell calculations Ns = Ne / with fs = , while for
open-shell calculations Ns = Ne with fs = . The Kohn–
Sham energy functional gives the total energy of the
system as
E[ψ, R] = Ts [ψ] + EH [n] + Eext [n] + Exc [n]

C

C
Car-Parrinello Method
where
Ns

Ts [ψ] = − ∑ fs ∫ dr ψ ∗s (r)∇ ψ s (r)
 s=
EH [n] =
′

′ n(r)n(r )
∫ dr dr

∣r − r′ ∣
N
Eext [n] = − ∑ ∫ dr
I=
ZI n(r)
∣r − RI ∣
are the single-particle kinetic energy, the Hartree
energy, and the external energy, respectively. The functional dependence of the term Exc [n], known as the
exchange and correlation energy, is not known and
must, therefore, be approximated. One of the most
commonly used approximations is referred to as the
generalized gradient approximation
Exc [n] ≈ ∫ dr f (n(r), ∇n(r))
where f is a scalar function of the density and its gradient. When the Kohn–Sham functional is minimized
with respect to the electronic orbitals subject to the
orthogonality condition ⟨ψ s ∣ψ s′ ⟩ = δ ss′ , then the interatomic forces are given by
FI (R) = −∇I E[ψ () , R] + ∑ ZI ZJ
J≠I
(RI − RJ )
∣RI − RJ ∣
where ψ () denotes the set of orbitals obtained by the
minimization procedure.
Most AIMD calculations use a basis set for expanding the Kohn–Sham orbitals ψ s (r). Because periodic
boundary conditions are typically employed in molecular dynamics simulations, a useful basis set is a simple
plane-wave basis. In fact, when the potential is periodic,
the Kohn–Sham orbitals must be represented as Bloch
functions, ψ sk (r) = exp(ik ⋅ r)us (r), where k is a vector
in the first Brillouin zone. However, if a large enough
simulation cell is used, k can be taken to be (, , )
(the Gamma-point) for many chemical systems, as will
be done here. In this case, the plane-wave expansion of
ψs (r) becomes the simple Fourier representation

ψ s (r) = √ ∑ Cs (g)eig⋅r
V g
where g = πin/V / , with n a vector of integers,
denotes the Fourier-space vector corresponding to a
cubic box of volume V, and the {Cs (g)} is the set
of expansion coefficients. At the Gamma-point, the
orbitals are purely real, and the coefficients satisfy the
condition C∗s (g) = Cs (−g), which means that only half
of the reciprocal space is needed to reconstruct the full
set of Kohn–Sham orbitals. A similar expansion
n(r) =

ig⋅r
∑ ñ(g)e
V g
is employed for the electronic density. Note that the
coefficients ñ(g) depend on the orbital coefficients
Cs (g). Because the density is real, the coefficients ñ(g)
satisfy ñ∗ (g) = ñ(−g). Again, this condition means that
the full density can be reconstructed using only half of
the reciprocal space. In order to implement these expansions numerically, they must be truncated. The truncation criterion is based on the plane-wave kinetic energy
∣g∣ /. Specifically, the orbital expansion is truncated at
a value Ecut such that ∣g∣ / < Ecut , and the density,
being determined from the squares of the orbitals, is
truncated using the condition ∣g∣ / < Ecut . When the
above two plane-wave expansions are substituted into
the Kohn–Sham energy functional, the resulting energy
is an ordinary function of the orbital expansion coefficients that must be minimized with respect to these
coefficients subject to the orthonormality constraint
∑g C∗s (g)Cs′ (g) = δ ss′ .
An alternative to explicit minimization of the energy
was proposed by Car and Parrinello (see bibliographic
notes) based on the introduction of an artificial dynamics for the orbital coefficients
μ C̈s (g) = −
MI R̈I = −
∂E
− ∑ Λ ss′ Cs′ (g)
∂C∗s (g) s′
∂E
∂RI
In the above equations, known as the Car–Parrinello
equations, μ is a mass-like parameter (having units of
energy×time ) that determines the time scale on which
the coefficients evolve, and Λ ss′ is a matrix of Lagrange
multipliers for enforcing the orthogonality condition
as a constraint. The mechanism of the Car–Parrinello
equations is the following: An explicit minimization of
the energy with respect to the orbital coefficients is carried out at a given initial configuration of the nuclei.
Following this, the Car–Parrinello equations are integrated numerically with a value of μ small enough to
Car-Parrinello Method
ensure that the coefficients respond as quickly as possible to the motion of the nuclei. In addition, the orbital
coefficients must be assigned a fictitious kinetic energy
satisfying
μ ∑ ∑ ∣Ċs (g)∣ ≪ ∑ MI ṘI
s
g
C
a trajectory of just – ps. Thus, the parallelization
scheme must, therefore, be efficient and scale well with
the number of processors. The remainder of this entry
will be devoted to the discussion of the algorithm and
parallelization techniques for it.
I
in order to ensure that the coefficients remain close to
the ground-state Born–Oppenheimer surface throughout the simulation.
In a typical calculation containing  – nuclei
and a comparable number of Kohn–Sham orbitals, the
number of coefficients per orbital is often in the range
 – , depending on the atomic types present in the
system. Although this may seem like a large number of
coefficients, the number would be considerably larger
without a crucial approximation applied to the external energy Eext [n]. Specifically, electrons in low-lying
energy states, also known as core electrons, are eliminated in favor of an atomic pseudopotential that requires
replacing the exact Eext [n] by a functional of the following form
N
Eext [n, ψ] ≈ ∑ ∫ dr n(r)vl (r − RI ) + Enl [ψ]
I=
≡ Eloc [n] + Enl [ψ]
where Eloc [n] is a purely local energy term, and Enl [ψ] is
an orbital-dependent functional known as the nonlocal
energy given by
l−
l
Enl [ψ] = ∑ ∑ ∑ ∑ wl ∣ZsIlm ∣
s
I l= m=−l
with
ZsIlm = ∫ dr Flm (r − RI )ψ s (r)
Here Flm (r) is a function particular to the pseudopotential, and l and m label angular momentum channels.
The value of l is summed up to a maximum l − . Typically, l =  or  depending on the chemical composition
of the system, but higher values can be included when
necessary. Despite the pseudopotential approximation,
AIMD calculations in a plane-wave basis are computationally very intensive and can benefit substantially
from massive parallelization. In addition, a typical value
of the integration time step in a Car–Parrinello AIMD
simulation is . fs, which means that  – or more
electronic structure calculations are needed to generate
Outline of the Algorithm
The calculation of the total energy and its derivatives with respect to orbital expansion coefficients and
nuclear positions consists of the following phases:
. Phase I: Starting with the orbital coefficients in
reciprocal space, the electron kinetic energy and
its coefficient derivatives are evaluated using the
formula
Ts =
 Ns


∑ fs ∑ ∣g∣ ∣Cs (g)∣
 s= g
. Phase II: The orbital coefficients are transformed
from Fourier space to real space. This operation
requires Ns three-dimensional fast Fourier transforms (FFTs).
. Phase III: The real-space coefficients are squared
and summed to generate the density n(r).
. Phase IV: The real-space density n(r) is used to
evaluate the exchange-correlation functional and
its functional derivatives with respect to the density. Note that Exc [n] is generally evaluated on
a regular mesh, hence, the functional derivatives
are replaced by ordinary derivatives at the mesh
points.
. Phase V: The density is Fourier transformed to
reciprocal space, and the coefficients ñ(g) are used
to evaluate the Hartree and purely local part of the
pseudopotential energy using the formulas
EH =
Eloc =

V
π
∣ñ(g)∣

∣g∣
g≠(,,)
∑
 N
∗
−ig⋅R I
∑ ∑ ñ (g)ṽl (g)e
V I= g
and their derivatives and nuclear position derivatives, where ṽl (g) is the Fourier transform of the
potential vl (r).
. Phase VI: The derivatives from Phase VI are Fourier
transformed to real space and combined with the
derivatives from Phase V. The combined functional

C

C
.
.
.
.
Car-Parrinello Method
derivatives are multiplied against the real-space
orbital coefficients to produce part of the orbital
forces in real space.
Phase VII: The reciprocal-space orbital coefficients
are used to evaluate the nonlocal pseudopotential
energy and its derivatives.
Phase VIII: The forces from Phase VI are combined
and are then transformed back into reciprocal space,
an operation that requires Ns FFTs.
Phase IX: The nuclear–nuclear Coulomb repulsion
energy and its position derivatives are evaluated
using standard Ewald summation.
Phase X: The reciprocal space forces are combined
with those from Phases I and VII to yield the
total orbital forces. These, together with the nuclear
forces, are fed into a numerical solver in order to
advance the nuclear positions and reciprocal-space
orbital coefficients to the next time step, and the process returns to Phase I. As part of this phase, the
condition of orthogonality
⟨ψ s ∣ψ s′ ⟩ = ∑ C∗s (g)Cs′ (g) = δ ss′
g
is enforced as a holonomic constraint on the
dynamics.
Parallelization
Two basic strategies for parallelization of ab initio
molecular dynamics will be outlined. The first is a
hybrid state/grid-level parallelization scheme useful on
machines with a modest number of fast processors having large memory but with slow communication. This
scheme does not require a parallel fast Fourier transform (FFT) and can be coded up relatively easily starting from an optimized serial code. The second scheme
is a fine-grained parallelization approach based on parallelization of all operations and is useful for massively
parallel architectures. The tradeoff of this scheme is its
considerable increase in coding complexity. An intermediate scheme between these builds a parallel FFT into
the hybrid scheme as a means of reducing the memory
and some of the communication requirements. In all
such schemes if the FFT used is a complex-to-complex
FFT and the orbitals are real, then the states can be double packed into the FFT routine and transformed two
at a time, which increases the overall efficiency of the
algorithms to be presented below.
Hybrid scheme – Let ncoef represent the number of
coefficients used to represent each Kohn–Sham orbital,
and let nstate represent the number of orbitals (also
called “states”). In a serial calculation, the coefficients
are then stored in two arrays a(ncoef,state)
and b(ncoef,nstate) holding the real and imaginary parts of the coefficients, respectively. Alternatively,
complex data typing could be used for the coefficient
array. Let nproc be the number of processors available
for the calculation. In the hybrid parallel scheme, the
coefficients are stored in one of two ways at each point
in the calculation. The default storage mode is called
“transposed” form in which the coefficient arrays are
dimensioned as a(ncoef/nproc,nstate) and
b(ncoef/nproc,nstate), so that each processor has a portion the orbitals for all of the states.
At certain points in the calculation, the coefficients are
transformed to “normal” form in which the arrays are
dimensioned as a(ncoef,nstate/nproc) and
b(ncoef,nstate/nproc), so that each processor
has a fraction of the orbitals but each of the orbitals is
complete on the processor.
In the hybrid scheme, operations on the density,
both in real and reciprocal spaces, are carried out in
parallel. These terms include the Hartree and local
pseudopotential energies, which are carried out on
a spherical reciprocal-space grid, and the exchangecorrelation energy, which is carried out on a rectangular
real-space grid. The Hartree and local pseudopotential
terms require arrays gdens_a(ndens/nproc) and
gdens_b(ndens/nproc) that hold, on each processor, a portion of the real and imaginary reciprocalspace density coefficients ñ(g). Here, ndens is the
number of reciprocal-space density coefficients. The
exchange-correlation energy makes use of an array
rdens(ngrid/nproc) that holds, on each processor, a portion of the real-space density n(r). Here,
ngrid is the number of points on the rectangular
grid.
Given this division of data over the processors, the
algorithm proceeds in the following steps:
. With the coefficients in transposed form, each processor calculates its contribution to the electronic
kinetic energy and the corresponding forces on the
coefficients it holds. A reduction is performed to
obtain the total kinetic energy.
Car-Parrinello Method
. The coefficient arrays are transposed into normal
form, and each processor must subsequently perform 0.5A∗nstate/nproc three-dimensional
serial FFTs on its subset of states to obtain the
corresponding orbitals in real space. These orbitals
are stored as creal(ngrid,nstate/nproc).
Each processor sums the square of this orbital over
nstate/nproc at each grid point.
. Each processor performs a serial FFT on its portion of the density to obtain its contribution to the
reciprocal-space density coefficients ñ(g).
. Reduce_scatter operations are used to sum
each processor’s contributions to the real and
reciprocal-space densities and distribute ngrid/
nproc and ndens/nproc real and reciprocalspace density values on each processor.
. Each processor calculates its contribution to the
Hartree and local pseudopotential energies, Kohn–
Sham potential, and nuclear forces using its
reciprocal-space density coefficients and exchangecorrelation energies and Kohn–Sham potential
using its real-space density coefficients.
. As the full Kohn–Sham potential is needed on
each processor, Allgather operators are used
to collect the reciprocal and real-space potentials.
The reciprocal-space potential is additionally transformed to real space by a single FFT.
. With the coefficients in normal form, the Kohn–
Sham potential contributions to the coefficient
forces are computed via the product VKS (r)ψ s (r).
Each processor computes this product for the
states it has.
. Each processor transforms its coefficient force contributions VKS (r)ψ s (r) back to reciprocal space by
performing 0.5∗nstate/nproc serial FFTs.
. With the coefficients in normal form, each processor calculates its contribution to the nonlocal pseudopotential energy, coefficient forces, and nuclear
forces for its states.
. Each processor adds its contributions to the nonlocal forces to those from Steps  and  to obtain the
total coefficient forces in normal form. The energies
and nuclear forces are reduced over processors.
. The coefficients and coefficient forces are transformed back to transposed form, and the equations
of motion are integrated with the coefficients in
transposed form.
C
. In order to enforce the orthogonality constraints,
various partially integrated coefficient and coefficient velocity arrays are multiplied together in
transposed form on each processor. In this way,
each processor has a set of Ns × Ns matrices that
are reduced over processors to obtain the corresponding Ns × Ns matrices needed to obtain
the Lagrange multipliers. The Lagrange multiplier
matrix is broadcast to all of the processors and each
processor applies this matrix on its subset of the
coefficient or coefficient velocity array.
Intermediate scheme – The storage requirements of this
scheme are the same as those of the hybrid approach
except that the coefficient arrays are only used in their
transposed form. Since all FFTs are performed in parallel, a full grid is never required, which cuts down on
the memory and scratch-space requirements. The key to
this scheme is having a parallel FFT capable of achieving
good load balance in the presence of the spherical truncation of the reciprocal-space coefficient and density
grids.
This algorithm is carried out as follows:
. With the coefficients in transposed form, each processor calculates its contribution to the electronic
kinetic energy and the corresponding forces on the
coefficients it holds. A reduction is performed to
obtain the total kinetic energy.
. A set of 0.5∗nstate three-dimensional parallel FFTs is performed in order to transform the
coefficients to corresponding orbitals in real space.
These orbitals are stored in an array with dimensions creal(ngrid/nproc,nstate). Since
each processor has a full set of state, the real-space
orbitals can simply be squared so that each processor
has the full density on its subset of ngrid/nproc
grid points.
. A parallel FFT is performed on the density to
obtain the reciprocal-space density coefficients
ñ(g), which are also divided over the processors.
Each processor will have ndens/nproc coefficients.
. Each processor calculates its contribution to the
Hartree and local pseudopotential energies, Kohn–
Sham potential, and nuclear forces using its
reciprocal-space density coefficients. Each processor

C

C
.
.
.
.
.
.
Car-Parrinello Method
also calculates its contribution to the exchangecorrelation energies and Kohn–Sham potential
using its real-space density coefficients.
A parallel FFT is used to transform the reciprocalspace Kohn–Sham potential to real space.
With the coefficients in transposed form, the Kohn–
Sham potential contributions to the coefficient
forces are computed via the product VKS (r)ψ s (r).
Each processor computes this product at the grid
points it has.
The coefficient force contributions VKS (r)ψ s (r) on
each processor are transformed back to reciprocal
space via a set of 0.5∗nstate three-dimensional
parallel FFTs.
With the coefficients in transposed form, each
processor calculates the nonlocal pseudopotential
energy, coefficient forces, and nuclear forces using
its subset of reciprocal-space grid points.
Each processor adds its contributions to the nonlocal forces to those from Steps  and  to obtain
the total coefficient forces in transposed form.
The energies and nuclear forces are reduced over
processors.
In order to enforce the orthogonality constraints,
various partially integrated coefficient and coefficient velocity arrays are multiplied together in
transposed form on each processor. In this way,
each processor has a set of N s × Ns matrices that
are reduced over processors to obtain the corresponding Ns × Ns matrices needed to obtain
the Lagrange multipliers. The Lagrange multiplier
matrix is broadcast to all of the processors and each
processor applies this matrix on its subset of the
coefficient or coefficient velocity array.
D parallel FFTs and fine-grained data decomposition – Any fine-grained parallelization scheme for
Car–Parrinello molecular dynamics requires a scalable three-dimensional FFT and a data decomposition
strategy that allows parallel operations to scale up to
large numbers of processors. Much of this is true even
for the intermediate scheme discussed above.
Starting with the D parallel FFT, when the full
set of indices of the coefficients Cs (g) are displayed,
the array appears as Cs (gx , gy , gz ), where the reciprocal√
space points lie within a sphere of radius Ecut .
A common approach for transforming this array
to real space is based on data transposes. First, a
set of one-dimensional FFTs is computed to yield
Cs (gx , gy , gz ) → C̃s (gx , gy , z). Since the data is dense
in real space, this operation transforms the sphere into
a cylinder. Next, a transpose is performed on the z index
to parallelize it and collect complete planes of gx and gy
along the cylindrical axis. Once this operation is complete, the remaining two one-dimensional FFTs can be
performed. The first maps the cylinder onto a rectangular slab, and the final FFT transforms the slab into a
dense real-space array ψ s (x, y, z).
Given this strategy for the D FFT, the optimal data
decomposition scheme is based on dividing the state
and density arrays into planes both in reciprocal space
in and in real space. Briefly, it is useful to think of dividing the coefficient array into data objects represented as
G(s, p), where p indexes the (gx , gy ) planes. The planes
are optimally grouped in such a way that complete lines
along the gz axis are created. This grouping is important
due to the spherical truncation of reciprocal space. This
type of virtualization allows parallelization tools such
as the Charm++ software to be employed as a way of
mapping the data objects to the physical processors as
discussed by Bohm et al. in []. The granularity of the
object G(s, p) is something that can be tuned for each
given architecture.
Once the data decomposition is accomplished, the
planes must be mapped to the physical processors.
A simple map that allocates all planes of a state to the
same processor, for example, would make all D FFT
transpose operations local, resulting in good performance. However, such a mapping is not scalable because
massively parallel architectures will have many more
processors than states. Another extreme would map all
planes of the same rank in all of the states to the same
processor. Unfortunately, this causes the transposes to
be highly nonlocal and this leads to a communication
bottleneck. Thus, the optimal mapping is a compromise between these two extremes, mapping collections
of planes in a state partition to the physical processors,
where the size of these collections depends on the number of processors and communication bandwidth. This
mapping then enables the parallel computation of overlap matrices obtained by summing objects with different
state indices s and s′ over reciprocal space.
This entry is focused on the parallelization of
plane-wave based Car–Parrinello molecular dynamics.
Cedar Multiprocessor
However, other basis sets offer certain advantages over
plane waves. These are localized real-space basis sets
useful for chemical applications where the orbitals
ψ s (r) can be easily transformed into a maximally spatially localized form. Future work will focus on developing parallelization strategies for ab initio molecular
dynamics calculations with such basis sets.
Bibliographic Notes and Further
Reading
As noted in the introduction, the Car–Parrinello
method was introduced in Ref. [] and is discussed
in greater detail in a number of review articles [, ]
and a recent book []. Further details on the algorithms presented in this entry and their implementation
in the open-source package PINY_MD can be found
in Refs. [, ]. A detailed discussion of the Charm++
runtime environment alluded to in the “Parallelization” section can be found in Ref. []. Finally, finegrained algorithms leveraging the Charm++ runtime
environment in the manner described in this article
are described in more detail in Refs. [, ]. These massively parallel techniques have been implemented in
the Charm++ based open-source ab initio molecular
dynamics package OpenAtom [].
C
(eds) Parallel programming using C++. MIT, Cambridge, pp
–
. Vadali RV, Shi Y, Kumar S, Kale LV, Tuckerman ME, Martyna GJ
() Scalable fine-grained parallelization of plane-wave-based
ab initio molecular dynamics for large supercomputers. J Comp
Chem :
. Bohm E, Bhatele A, Kale LV, Tuckerman ME, Kumar S, Gunnels
JA, Martyna GJ () Fine-grained parallelization of the Car–
Parrinello ab initio molecular dynamics method on the IBM Blue
Gene/L supercomputer. IBM J Res Dev :
. OpenAtom is freely available for download via the link http://
charm.cs.uiuc.edu/OpenAtom
CDC 
Control Data 
Cedar Multiprocessor
Pen-Chung Yew
University of Minnesota at Twin-Cities, Minneapolis,
MN, USA
Definition
Bibliography
. Car R, Parrinello M () Unified approach for molecular
dynamics and density-functional theory. Phys Rev Lett :
. Marx D, Hutter J () Ab initio molecular dynamics: theory and
implementation in modern methods and algorithms of quantum
chemistry. In: Grotendorst J (ed) Forschungszentrum, Jülich, NIC
Series, vol . NIC Directors, Jülich, pp –
. Tuckerman ME () Ab initio molecular dynamics: basic concepts, current trends, and novel applications. J Phys Condens Mat
:R
. Marx D, Hutter J () Ab initio molecular dynamics: basic
theory and advanced methods. Cambridge University Press,
New York
. Tuckerman ME, Yarne DA, Samuelson SO, Hughes AL, Martyna
GJ () Exploiting multiple levels of parallelism in molecular
dynamics based calculations via modern techniques and software paradigms on distributed memory computers. Comp Phys
Commun :
. The open-source package PINY MD is freely available for download via the link http://homepages.nyu.edu/~mt/PINYMD/
PINY.html
. Kale LV, Krishnan S () Charm++: parallel programming with message-driven objects. In: Wilson GV, Lu P
The Cedar multiprocessor was designed and built at
the Center for Supercomputing Research and Development (CSRD) in the University of Illinois at UrbanaChampaign (UIUC) in s. The project brought
together a group of researchers in computer architecture, parallelizing compilers, parallel algorithms/
applications, and operating system, to develop a scalable, hierarchical, shared-memory multiprocessor. It
was the largest machine building effort in academia
since ILLIAC-IV. The machine became operational in
 and decommissioned in . Some pieces of
the boards are still displayed in the Department of
Computer Science at UIUC.
Discussion
Introduction
The Cedar multiprocessor was the first scalable, clusterbased, hierarchical shared-memory multiprocessor of
its kind in s. It was designed and built at the

C

C
Cedar Multiprocessor
Center for Supercomputing Research and Development (CSRD) in the University of Illinois at UrbanaChampaign (UIUC). The project succeeded in building
a complete scalable shared-memory multiprocessor system with a working -cluster (-processor) hardware
prototype [], a parallelizing compiler for Cedar Fortran [, , ], and an operating system, called Xylem, for
scalable multiprocessor []. Real application programs
were ported and run on Cedar [, ] with performance
studies presented in []. The Cedar project was started
in  and the prototype became functional in .
Cedar had many features that later used extensively in large-scale multiprocessors systems, such as
software-managed cache memory to avoid very expensive cache coherence hardware support; vector data
prefetching to cluster memories for hiding long memory latency; parallelizing compiler techniques that take
sequential applications and extract task-level parallelism from their loop structures; language extensions that include memory attributes of the data
variables to allow programmers and compilers to manage data locality more easily; and parallel dense/sparse
matrix algorithms that could speed up the most time
consuming part of many linear systems in large-scale
scientific applications.
Machine Organization of Cedar
The organization of Cedar consists of multiple clusters
of processors connected through two high-bandwidth
single-directional global interconnection networks
(GINs) to a globally shared memory system (GSM)
(see Fig. ). One GIN provides memory requests from
clusters to the GSM. The other GIN provides data and
responses from GSM back to clusters. In the Cedar
prototype built at CSRD, an Alliant FX/ system was
used as a cluster (See Fig. ). Some hardware components in FX/, such as the crossbar switch and the
interface between the shared cache and the crossbar
switch, were modified to accommodate additional ports
for the global network interface (GNI). GNI provides
pathways from a cluster to GIN. Each GNI board also
has a software-controlled vector prefetching unit (VPU)
that is very similar to the DMA in later IBM’s Cell
Broadband Engine.
Global
memory
Global
memory
Global
memory
SP
SP
SP
8 8 Switch
8 8 Switch
Stage 2
8 8 Switch
8 8 Switch
8 8 Switch
Stage 1
8 8 Switch
Cluster 0
Cluster 1
SP. Synchronization processor
Cedar Multiprocessor. Fig.  Cedar machine organization
Cluster 3
Cedar Multiprocessor
I/O
Subsystem
Cluster memory modules
MEM
MEM
MEM
Memory bus
4 Way
interleaved
cache
Global
interface
Cluster switch
CE
CE
8 8 Switch
CE
Concurrency control bus
Cedar Multiprocessor. Fig.  Cedar cluster
Cedar Cluster: In each Alliant system, there are
eight processors, called computational elements (CEs).
Those eight CEs are connected to a four-way interleaved
shared cache through an  ×  crossbar switch, and four
ports to GNI that provide access to GIN and GSM (see
Fig. ). On the other side of the shared cache is a highspeed shared bus that is connected to multiple cluster
memory modules and interactive processors (IPs). IPs
handle input/output and network functions.
Each CE is a pipelined implementation of Motorola
 instruction set architecture, one of the most popular high-performance microprocessors in s. The
 ISA was augmented with vector instructions.
The vector unit includes both -bit floating-point and
-bit integer operations. It also has eight -word
vector registers. Each vector instruction could have one
memory- and one register-operand to balance the use
of registers and the requirement of memory bandwidth.
The clock cycle time was  ns, but each CE has a
peak performance of . Mflops for a multiply-and-add
-bit floating-point vector instruction. It was a very
high-performance microprocessor at that time, and was
implemented on one printed-circuit board.
There is a concurrency control bus (CCB) that connects all eight CEs to provide synchronization and
C
coordination of all CEs. Concurrency control instructions include fast fork, join, and fetch-and-increment
type of synchronization operations. For example, a single concurrent start instruction broadcast on CCB will
spread the iterations and initiate the execution of a concurrent loop among all eight CEs simultaneously. Loop
iterations are self-scheduled among CEs by fetch-andincrementing a loop counter shared by all CEs. CCB
also supports a cascade synchronization that enforces an
ordered execution among CEs in the same cluster for a
concurrent loop that requires a sequential ordering for
a particular portion of the loop execution.
Memory Hierarchy: The  GB physical memory
address space of Cedar is divided into two equal halves
between the cluster memory and the GSM. There are
 MB of GSM and  MB of cluster memory in each
cluster on the Cedar prototype. It also supports a virtual memory system with a  KB page size. GSM could
be directly addressed and shared by all clusters, but cluster memory is only addressable by the CEs within each
cluster. Data coherence among multiple copies of data
in different cluster memories is maintained explicitly
through software by either programmer or the compiler.
The peak GSM bandwidth is  MB/s and  MB/s per
CE. The GSM is double-word interleaved, and aligned
among all global memory modules. There is a synchronization processor in each GSM module that could
execute each atomic Cedar synchronization instruction
issued from a CE and staged at GNI.
Data Prefetching: It was observed early in the Cedar
design phase that scientific application programs have
very poor cache locality due to their vector operations.
To compensate for the long access latency to GSM and
to overcome the limitation of two outstanding memory requests per CE in the Alliant microarchitectural
design, vector data prefetching is needed. The large GIN
bandwidth is ideal for supporting such data prefetching.
To avoid major changes to Alliant’s instruction
set architecture (ISA) by introducing additional data
prefetching instructions into its ISA, and also to avoid
major alterations to CE’s control and data paths of
including a data prefetching unit (PFU) inside its CE,
it was decided to build the PFU on the GIN board.
Prefetched data is stored in a -word (each word is byte) prefetch buffer inside PFU to avoid polluting the
shared data cache and to allow data reuse. A CE will
stage a prefetch operation at PFU by providing it with

C

C
Cedar Multiprocessor
the length, the stride, and the mask of the vector data to
be prefetched. The prefetching operation could then be
fired by providing the physical address of the first word
of the vector. Prefetching operation could be overlapped
with other computations or cluster memory operations.
When a page boundary is crossed during a data
prefetching operation, the PFU will be suspended until
the CE provides the starting address of the new page.
In the absence of page crossing, PFU will prefetch 
words without pausing into its prefetch buffer. Due to
hardware limitation, only one prefetch buffer is implemented in each PFU. The prefetch buffer will thus be
invalidated with another prefetch operation. Prefetched
data could return from GSM out of order because of
potential network and memory contentions. A presence
bit per data word in the prefetch buffer allows the CE
to both access the data without waiting for the completion of the prefetch instruction, and to access the
prefetch data in the same order as requested. It was
later proved from experiments that PFU is extremely
useful in improving memory performance for scientific
applications.
Data prefetching was later incorporated extensively
into high-performance microprocessors that prefetch
data from main memory into the last-level cache memory. Sophisticated compiler techniques that could insert
prefetching instructions into user programs at compiler time or runtime have been developed and used
extensively. Cache performance has shown significant
improvement with the help of data prefetching.
Global Interconnection Network (GIN): The Cedar
network is a multi-stage shuffle-exchange network as
shown in Fig. . It is constructed with  ×  crossbar
switches with -bit wide data paths and some control
signals for network flow control. As there are only processors on the Cedar prototype, there are only two
stages needed for each direction of GIN between clusters and GSM, i.e., it only need two clock cycles to go
from GNI to one of the  GSM modules if there is
no network contention. Hence, it provides a very low
latency and high-bandwidth communication path to
GSM from a CE. There is one GIN in each direction
between clusters and GSM as mentioned above. To cut
down possible signal noises and to maintain high reliability, all data signals between stages are implemented
using differential signals. GIN is packet-switched and
self-routed. Routing is based on a tag-controlled scheme
proposed in [] for multi-stage shuffle-exchange networks. There is a unique path between each pair of GIN
input and output ports. A two-word queue is provided
at each input and output port of the × crossbar switch.
Flow control between network stages is implemented to
prevent buffer overflow and maintain low network contention. The total network bandwidth is  MB/s, and
 MB/s each network port to match the bandwidth of
GSM and CE. Hence, the requests and data flow from
a CE to a GSM module and back to the CE. It forms
a circular pipeline with balanced bandwidth that could
support high-performance vector operations.
Memory-Based Synchronization: Given packetswitched multi-stage interconnection networks in
Cedar, instead of a shared bus, it is very difficult to
implement atomic (indivisible) synchronization operations such as a test-and-set or a fetch-and-add operation
efficiently on the system interconnect. Lock and unlock
operations are two low-level synchronization operations that will require multiple passes through GINs
and GSM to implement a higher-level synchronization
such as fetch-and-add, a very frequent synchronization
operation in parallel applications that could be used to
implement barrier synchronizations and self-scheduling
for parallel loops.
Cedar implements a sophisticated set of synchronization operations []. They are very useful in
supporting parallel loop execution that requires frequent fine-grained data synchronization between loop
iterations that have cross-iteration data dependences,
so called doacross loops []. Cedar synchronization
instructions are basically test-and-operate operations,
where test is any relational operation on -bit data
(e.g., ≤) and operate could be a read, write, add, subtract,
or logical operation on -bit data. These Cedar synchronization operations are also staged and controlled
in a GNI board by each CE. There is also a synchronization unit at each GSM module that executes these
atomic operations at GSM. It is a very efficient and
effective way of implementing atomic global synchronization operations right at GSM.
Hardware Performance Monitoring System: It was
determined at the Cedar design time that performance
tuning and monitoring on a large-scale multiprocessor requires extensive support starting at the lowest
hardware level. Important and critical system signals
in all major system components that include GIN
Cedar Multiprocessor
and GSM must be made observable. To minimize the
amount of hardware changes needed on Alliant clusters, it was decided to use external monitoring hardware
to collect time-stamped event traces and histograms
of various hardware signals. This allows the hardware
performance monitoring system to evolve over time
without having to make major hardware changes on the
Cedar prototype. The Cedar hardware event tracers can
each collect M events and the histogrammers have K
-bit counters. These hardware tracers could be cascaded to capture more events. The triggering and stopping signals or events could come from special library
calls in application programs, or some hardware signals
such as a GSM request to indicate a shared cache miss.
Software tools are built to support starting and stopping of tracers, off-loading data from the tracers and
counters for extensive performance analysis.
C
marks the task idle and stops execution. Wait_task
(tasknum) blocks the execution of the calling task until
the task specified by tasknum enters an idle state. A task
enters an idle state when it calls end_task. Hence, the
waiting task will be unblocked when the task it is waiting
for (identified by tasknum) ends execution.
Memory management of Xylem is based on a
paging system. It has the notion of global memory
pages and cluster memory pages, and provides kernels
to allow a task to control its own memory environment or that of any other task in the process. They
include kernels to allocate and de-allocate pages for
any task in the process; change the attributes of pages;
make copies and share pages with any other task in
the process; and unmap pages from any task in the
process. The attributes of a page include execute/noexecute, read-write/no-read-write, local/global/globalcached, and shared/private-copy/private-new/private.
Xylem Operating System
Xylem operating system [] provides support for parallel execution of application programs on Cedar. It is
based on the notion that a parallel user program is a flow
graph of executable nodes. New system calls are added
to allow Unix processes to create and control multiple tasks. It basically links the four separate operating
systems in Alliant clusters into one Cedar operating system. Xylem provides virtual memory, scheduling and
file system services for Cedar.
The Xylem scheduler schedules those tasks instead
of the Unix scheduler. A Xylem task corresponds to one
or more executions of each node in the program flowgraph. Support of multiprocessing in Xylem basically
includes create_task, delete_task, start_task, end_task,
andwait_task. The reason for separating create_task and
start_task is because task creation is a very expensive operation, while starting a task is a relatively
faster operation. Separating these two operations allow
more efficient management of tasks, same for separating delete_task and end_task operations. For example,
helper tasks could be created in the beginning of a process then later started when parallel loops are being
executed. These helper tasks could be put back to the
idle state without being deleted and recreated later.
There is no parent–child relationship between the
creator task and the created task though. Either task
could wait for, start or delete the other task. A task could
delete itself if it is the last task in the process. End_task
Programming Cedar
A parallel program can be written using Cedar Fortran,
a parallel dialect of the Fortran language, and Fortran
 on Cedar. Cedar Fortran supports all key features
of the Cedar system described above. Those Cedar features include its memory hierarchy, the data prefetching
capability from GSM, the Cedar synchronization operations, and the concurrency control features on CCB.
These features are supported through a language extension to Fortran . Programs written in Fortran  could
be restructured by a parallelizing restructurer, and translated into parallel programs in Cedar Fortran. They are
then fed into a backend compiler, mostly an enhanced
and modified Alliant compiler, to produce Cedar executables. The parallelizing restructurer was based on
the KAP restructurer [], a product of Kuck and
Associates (KAI).
Cedar Fortran has many features common to the
ANSI Technical Committee XH standard for parallel
Fortran whose basis was PCF Fortran developed by the
Parallel Computing Forum (PCF). The main features
encompass parallel loops, vector operations that include
vector reduction operations and a WHERE statement
that could mask vector assignment as in Fortran-,
declaration of visibility (or accessibility) of data, and
post/wait synchronization.
Several variations of parallel loops that take into
account the loop structure and the Cedar organization

C

C
Cedar Multiprocessor
are included in Cedar Fortran. There are basically two
types of parallel loops: doall loops and doacross loops.
A doacross loop is an ordered parallel loop whose loop
iterations start sequentially as in its original sequential order. It uses cascade synchronization on CCB to
enforce such sequential order on parts of its loop body
(see Fig. ). A doall loop is an unordered parallel loop
that enforces no order among the iterations of its loop.
However, a barrier synchronization is needed at the end
of the loop.
The syntactic form of all variations of these two
parallel loops is shown in Fig. .
There are three variations of the parallel loops: cluster loops, spread loops, and cross-cluster loops, denoted
with prefixes C, S, X, respectively in the concurrent loop
syntax shown in Fig. . CDOALL and CDOACROSS
loops require all CEs in a cluster to join in the execution
of the parallel loops. SDOALL and SDOACROSS loops
cause a single CE from each cluster to join the execution
of the parallel loops. It is not necessarily the best way
to execute a parallel loop, but if each loop iteration has
a large working set that could fill the cluster memory,
such a scheduling will be very effective. Another common situation is to have a CDOALL loop nested inside
an SDOALL loop. It could engage all CEs in a cluster.
An XDOALL loop will require all CEs in all clusters to
execute the loop body.
Local declaration of data variables in a CDOALL and
XDOALL will be visible only to a single CE, while visible
{C/S/X}{DOALL/DOACROSS} index = start, end [,increment]
[local declarations of data variables]
[Preamble/Loop]
Loop Body
[Endloop/Postamble] (only SDO or XDO)
END {C/S/X}{DOALL/DOACROSS}
Cedar Multiprocessor. Fig.  Concurrent loop syntax
CDOACROSS j = 1, m
A(j) = B(j) + C(j)
call wait (1,1)
D(j) = E(j) + D(j-1)
call advance (1)
END DOACROSS
Cedar Multiprocessor. Fig.  An example DOACROSS
loop with cascade synchronization
to all CEs in a single cluster if in an SDOALL. The statements in the preamble of a loop are executed only once
by each CE when it first joins the execution of the loop
and prior to the execution of loop body. The statements
in the postamble of a loop are executed only once after
all CEs complete the execution of the loop. Postamble is
only available in SDOALL and XDOALL.
By default, data declared outside of a loop in a Cedar
Fortran program are visible to all CEs in a single cluster. However, it provides statements to explicitly declare
variables outside of a loop to be visible to all CEs in all
clusters (see Fig. ).
The GLOBAL and PROCESS COMMON statements in Fig.  declare that the data are visible to all CEs
in all clusters. A single copy of the data exists in global
memory. All CEs in all clusters could access the data, but
it is the programmer’s responsibility to maintain coherence if multiple copies of the data are kept in separate
cluster memories. The CLUSTER and COMMON statements declare that the data are visible to all CEs inside
a single cluster. A separate copy of the data is kept in the
cluster memory of each cluster that participates in the
execution of the program.
Implementation of Cedar Fortran on Cedar: All
parallel loops in Cedar Fortran are self-scheduled.
Iterations in a CDOALL or CDOACROSS loop are
dispatched by CCB inside each cluster. CCB also provides cascade synchronization for a CDOACROSS loop
through wait and advance calls in the loop as shown
in Fig. . The execution of SDOALL and XDOALL
are supported by the Cedar Fortran runtime library.
The library starts a requested number of helper tasks
by calling Xylem kernels in the beginning of the program execution. They remain idle until a SDOALL or
XDOALL starts. The helper tasks begin to compete for
loop iterations using self-scheduling.
Subroutine-level tasking is also supported in the
Cedar Fortran runtime library. It allows a new thread of
execution to be started for running a subroutine. In the
GLOBAL var [,var]
CLUSTER var [,var]
PROCESS COMMON /name/ var [,var]
COMMON /name/ var [,var]
Cedar Multiprocessor. Fig.  Variable declaration
statements in Cedar Fortran
Cedar Multiprocessor
mean time, the main thread of execution continues following the subroutine call. The thread execution ends
when the subroutine returns. The new thread of execution could be through one of the idle helper tasks or
through the creation of a new task.
It also supports vector prefetching by generating prefetch instructions before a vector register load
instruction. It could reduce the latency and overhead
of loading vector data from GSM. The Cedar synchronization instructions are used primarily in the runtime
library. They have been proven to be useful in controlling loop self-scheduling. They are also available to a
Fortran programmer through runtime library routines.
Parallelizing Restructurer: There was a huge volume
of research work on parallelizing scientific application
programs written in Fortran before Cedar was built [].
Hence, there were many sophisticated program analysis
and parallelization techniques available through a very
advanced parallelizing restructurer based on the KAP
restructurer for Cedar [, ]. The parallelizing restructurer could convert a sequential program written in
FORTRAN  into a parallel program in Cedar Fortran that takes advantage of all unique features in Cedar.
Through this conversion and restructuring process, not
only loop-level parallelism is exposed and extracted, but
C
also data locality is enhanced through advance techniques such as strip-mining, data globalization, and
privatization [, ].
Performance measurements have been done extensively on Cedar [, , ]. Given the complexity of
Cedar architecture, parallelizing restructurer, OS, and
the programs themselves, it is very difficult to isolate the
individual effects at all levels that influence the final
performance of each program. The performance results
shown in Fig.  are the most thorough measurements
presented in []. A suite of scientific application benchmark programs called Perfect Benchmark [] were used
to measure Cedar performance. It was a collection
of Fortran programs that span a wide spectrum of
scientific and engineering applications from quantum
chromodynamics (QCD) to analog circuit simulation
(SPICE).
The table in Fig.  lists the performance improvement over the serial execution time of each individual
program. The second column shows the performance
improvement using KAP-based parallelizing restructurer. The results show that despite the most advance
parallelizing restructurer at that time, the performance
improvement overall is still quite limited. Hence, each
benchmark program was analyzed and parallelized
Complied by
Kap/Cedar
time (Improvement)
Auto,
transforms
time (Improvement)
W/o Cedar
Synchronization
time (% slowdown)
W/o prefetch
time
(% slowdown)
(YMP-8/Cedar)
ADM
689 (1.2)
73 (10.8)
81 (11%)
83 (2%)
6.9 (3.4)
ARC2D
218 (13.5)
141 (20.8)
141 (0%)
157 (11%)
13.1 (34.2)
BDNA
502 (1.9)
111 (8.7)
118 (6%)
122 (3%)
8.2 (18.4)
DYFESM
167 (3.9)
60 (11.0)
67 (12%)
100 (49%)
9.2 (6.5)
FLO52
100 (9.0)
63 (14.3)
64 (1%)
79 (23%)
8.7 (37.8)
Program
MFLOPS
MDG
3200 (1.3)
182 (22.7)
202 (11%)
202 (0%)
18.9 (1.1)
MG3Da
OCEAN
7929 (1.5)
348 (35.2)
346 (0%)
350 (1%)
31.7 (3.6)
2158 (1.4)
148 (19.8)
174 (18%)
187 (7%)
11.2 (7.4)
QCD
369 (1.1)
239 (1.8)
239 (0%)
246 (3%)
1.1 (11.8)
SPEC77
973 (2.4)
156 (15.2)
156 (0%)
165 (6%)
11.9 (4.8)
SPICE
95.1 (1.02)
NA
NA
NA
0.5 (11.4)
TRACK
126 (1.1)
26 (5.3)
28 (8%)
28 (0%)
3.1 (2.7)
TRFD
273 (3.2)
21 (41.1)
21 (0%)
21 (0%)
20.5 (2.8)
aThis
version of MG3D includes the elimination of file I/O.
Cedar Multiprocessor. Fig.  Cedar performance improvement for Perfect Benchmarks []

C

C
CELL
manually. The techniques are limited to those that could
be implemented in an automated parallelizing restructurer. The third column under an automatable transformation shows the performance improvement that could
be achieved if those programs are restructured by a
more intelligent parallelizing restructurer.
It is quite clear that there was still a lot of potential in
improving the performance of parallelizing restructurer
because manually parallelized programs show substantial improvement in overall performance. However,
through another decade of research in more advanced
techniques of parallelizing restructurer since the publication of [–, ], the consensus seems to be pointing
to a direction that programmers must be given more
tools and control to parallelize and write their own parallel code instead of relying totally on a parallelizing
restructurer to convert a sequential version of their code
automatically into a parallel form, and expect to have a
performance improvement equals to that of the parallel
code implemented by the programmers themselves.
The fourth column in the table of Fig.  shows the
performance improvement using Cedar synchronization instructions. The fifth column shows the impact
of vector prefetching on overall performance. The
improvement from vector prefetching was not as significant as those obtained in later studies in other literatures because the Cedar backend compiler was not
using sophisticated algorithms to place those prefetching instructions, and the number of prefetching buffers
is too small.
. Gallivan KA, Plemmons RJ, Sameh AH () Parallel algorithms
for dense linear algebra computations. SIAM Rev ():–
. Guzzi MD, Padua DA, Hoeflinger JP, Lawrie DH () Cedar
Fortran and other vector and parallel Fortran dialects. J Supercomput ():–
. Kuck D et al () The cedar system and an initial performance
study. In: Proceedings of international symposium on computer
architecture, San Diego, CA, pp –
. Kuck & Associates, Inc () KAP User’s Guide. Champaign
Illinois
. Lawrie DH () Access and alignment of data in an array
processor. IEEE Trans Comput C-():–, Dec 
. Midkiff S, Padua DA () Compiler algorithms for synchronization. IEEE Trans C-():–
. Padua DA, Wolfe MJ () Advanced compiler optimizations for
supercomputers. Commun ACM ():–
. Zhu CQ, Yew PC () A scheme to enforce data dependence
on large on large multiprocessor system. IEEE Trans Softw Eng
SE-():–, June 
CELL
Cell Broadband Engine Processor
Cell Broadband Engine Processor
H. Peter Hofstee
IBM Austin Research Laboratory, Austin, TX, USA
Synonyms
Bibliography
. Berry M et al () The perfect club benchmarks: effective
performance evaluation of supercomputers. Int J Supercomput
Appl ():–
. Eigenmann R et al () Restructuring Fortran Programs for
Cedar. In: Proceedings of ICPP’, vol , pp –
. Eigenmann R et al () The Cedar Fortran Project. CSRD
Report No. , University of Illinois at Urbana-Champaign
. Eigenmann R, Hoeflinger J, Li Z, Padua DA () Experience
in the automatic parallelization of four perfect-benchmark programs. In: Proceedings for the fourth workshop on languages
and compilers for parallel computing, Santa Clara, CA, pp –,
August 
. Emrath P et al () The xylem operating system. In: Proceedings
of ICPP’, vol , pp –
. Gallivan K et al () Preliminary performance analysis of the
cedar multiprocessor memory system. In: Proceedings of 
ICPP, vol , pp –
CELL; Cell processor; Cell/B.E.
Definition
The Cell Broadband Engine is a processor that conforms
to the Cell Broadband Engine Architecture (CBEA).
The CBEA is a heterogeneous architecture defined
jointly by Sony, Toshiba, and IBM that extends the
Power architecture with “Memory flow control” and
“Synergistic processor units.”
CBEA compliant processors are used in a variety
of systems, most notably the PlayStation  (now PS)
from Sony Computer Entertainment, the IBM QS,
QS, and QS server blades, Regza-Cell Televisions
from Toshiba, single processor PCI-express accelerator
boards, rackmount servers, and a variety of custom systems including the Roadrunner supercomputer at Los
Cell Broadband Engine Processor
Alamos National Laboratory that was the first to achieve
petaflop level performance on the Linpack benchmark
and was ranked as the world’s # supercomputer from
June  to November .
Synergistic
processor
unit
Load/store
C
Local
store
memory
Instr. fetch
Discussion
Note: In what follows a system or chip that has multiple differing cores sharing memory is referred to as
heterogeneous and a system that contains multiple differing computational elements but that does not provide
shared memory between these elements is referred to as
hybrid.
Cell Broadband Engine
The Cell Broadband Engine Architecture (CBEA)
defines a heterogeneous architecture for shared-memory
multi-core processors and systems. Heterogeneity allows
a higher degree of efficiency and/or improved performance compared to conventional homogeneous
shared-memory multi-core processors by allowing
cores to gain efficiency through specialization.
The CBEA extends the Power architecture and processors that conform to the CBEA architecture are also
fully Power architecture compliant. Besides the Power
cores CBEA adds a new type of core: the Synergistic Processor Element (SPE). The SPE derives its efficiency and
performance from the following key attributes:
– A per-SPE local store for code and data
– Asynchronous transfers between the local store and
global shared memory
– A single-mode architecture
– A large register file
– A SIMD-only instruction set architecture
– Instructions to improve or avoid branches
– Mechanisms for fast communication and synchronization
Whereas a (traditional) CISC processor defines instructions that transform main memory locations directly
and RISC processors transform only data in registers
and therefore must first load operands into registers, the
SPEs stage code and data from main memory to the
local store, and the SPEs RISC core called the Synergistic Processor Unit (SPU) (Fig. ). The SPU operates
on local store the way a conventional RISC processor
operates on memory, i.e., by loading data from the local
store into registers before it is transformed. Similarly,
put/get (command)
put/get (data)
Memory
flow
control
On-chip coherent bus
Cell Broadband Engine Processor. Fig.  Internal
organization of the SPE
results are produced in registers, and must be staged
through the local store on its way to main memory. The
motivation for this organization is that while a large
enough register file allows a sufficient number of operations to be simultaneously executing to effectively hide
latencies to a store closely coupled to the core, a much
larger store or buffer is required to effectively hide the
latencies to main memory that, taking multi-core arbitration into account, can approach a thousand cycles.
In all the current implementations of the SPE the local
store is  kB which is an order of magnitude smaller
than the size of a typical per-core on-chip cache for
similarly performing processors. SPEs can be effective
with an on-chip store that is this small because data
can be packed as it is transferred to the local store and,
because the SPE is organized to maximally utilize the
available memory bandwidth, data in the local store is
allowed to be replaced at a higher rate than is typical for
a hardware cache.
In order to optimize main memory bandwidth, a
sufficient number of main memory accesses (put and
get) must be simultaneously executable and data access
must be nonspeculative. The latter is achieved by having
software issue put and get commands rather than having hardware cache pre-fetch or speculation responsible
for bringing multiple sets of data on to the chip simultaneously. To allow maximum flexibility in how put and
get commands are executed, without adding hardware

C

C
Cell Broadband Engine Processor
complexity, put and get semantics defines these operations as asynchronous to the execution of the SPU. The
unit responsible for executing these commands is the
Memory Flow Control unit (MFC). The MFC supports
three mechanisms to issue commands.
Any unit in the system with the appropriate memory access privileges can issue an MFC command by
writing to or reading from memory-mapped command
registers.
The SPU can issue MFC commands by reading or
writing to a set of channels that provide a direct interface to the MFC for its associated SPU.
Finally, put-list and get-list commands instruct the
MFC to execute a list of put and get commands from the
local store. This can be particularly effective if a substantial amount of noncontiguous data needs to be gathered
into the local store or distributed back to main memory.
put and get commands are issued with a tag that
allows groups of these commands to be associated in a
tag group. Synchronization between the MFC and the
SPU or the rest of the system is achieved by checking
on the completion status of a tag group or set of tag
groups. The SPU can avoid busy waiting by checking
on the completion status with a blocking MFC channel read operation. Put and get commands adhere to the
Power addressing model for main memory and specify
effective addresses for main (shared) memory that are
translated to real addresses according to the page and
segment tables maintained by the operating system.
While the local store, like the register file, is considered
private, access to shared memory follows the normal
Power architecture coherence rules.
The design goals of supporting both a relatively large
local store and high single-thread performance imply
that bringing data from the local store to a register file
is a multi-cycle operation (six cycles in current SPE
implementations). In order to support efficient execution for programs that randomly access data in the local
store a typical loop may be unrolled four to eight times.
Supporting this without creating a lot of register spills
requires a large register file. Thus the SPU was architected to provide a -entry general-purpose register
file. Further efficiency is gained by using a single register file for all data types including bit, byte, half-word
and word unsigned integers, and word and double-word
floating point. All of these data types are SIMD with a
width that equates to  bits. The unified register file
allows for a rich set of operations to be encoded in a
-bit instruction, including some performance critical
operations that specify three sources and an independent target including select (conditional assignment)
and multiply-add.
The abovementioned select instruction can quite
often be used to design branchless routines. A second
architectural mechanism that is provided to limit
branch penalties is a branch hint instruction that provides advance notice that an upcoming branch is predicted taken and also specifies its target so that the
code can be pre-fetched by the hardware and the branch
penalty avoided.
The Cell Broadband Engine and PowerXCelli processors combine eight SPEs, a Power core and highbandwidth memory controllers, and a configurable
off-chip coherence fabric onto a single chip. On-chip the
cores and controllers are interconnected with a highbandwidth coherent fabric. While physically organized
as a set of ring buses for data transfers, the intent of
this interconnect fabric is to allow all but those programs most finely tuned for performance to ignore its
bandwidth or connectivity limitations and only consider bandwidth to memory and I/O, and bandwidth in
and out of each unit (Fig. ).
Cell B.E.-Based Systems
Cell Broadband Engine processors have been used in
a wide variety of systems. Each system is reviewed
briefly.
. Sony PlayStation . This is perhaps the best-known
use of the Cell B.E. processor. In PlayStation 
Off-chip
Power
core
SPE
SPE
SPE
SPE
I
O
and
Element interconnect bus
(on-chip coherent bus)
XDR
Memory
CTRL
SPE
SPE
SPE
SPE
C
O
H
E
R
E
N
T
bus
Cell Broadband Engine Processor. Fig.  Organization of
the Cell Broadband Engine and PowerXCelli
Cell Broadband Engine Processor
.
.
.
.
.
the Cell B.E. processor is configured with a highbandwidth I/O interface to connect to the RSX
graphics processor. A second I/O interface connects the Cell B.E. to a SouthBridge which provides
the connectivity to optical BluRay, HDD, and other
storage and provides network connectivity. In the
PlayStation  application, seven of the eight synergistic processors are used. This benefits the manufacturing efficiency of the processor.
Numerous PlayStation –based clusters. These range
from virtual grids of PlayStations such as the grid
used for the “Folding at Home” application that first
delivered Petascale performance, to numerous clusters of PlayStations running the Linux operating system that are connected with Ethernet switches used
for applications from astronomy to cryptography.
IBM QS, QS, and QS server blades. In these
blades two Cell processors are used connected with
a high-bandwidth coherent interface. Each of the
Cell processors is also connected to a bridge chip
that provides an interface to PCI-express, Ethernet,
as well as other interfaces. The QS uses a version of the nm Cell processor with added doubleprecision floating-point capability that also supports
larger (DDR) memory capacities.
Cell-based PCI-express cards (multiple vendors).
These PCI-express cards have a single Cell B.E.
or PowerXCelli processor and PCI-express bridge.
The cards are intended for use as accelerators in
workstations.
The “Roadrunner” supercomputer at Los Alamos.
This supercomputer consists of a cluster of more
than , Dual Dual-Core Opteron-based IBM
server blades clustered together with an InfiniBand
network. Each Opteron blade is PCI-express connected to two QS server blades, i.e., one PowerXCelli processor per Opteron Core on each blade.
This supercomputer, installed at Los Alamos in
, was the first to deliver a sustained Petaflop on
the Linpack benchmark used to rank supercomputers (top.org). Roadrunner was nearly three times
more efficient than the next supercomputer to reach
a sustained Petaflop.
The QPACE supercomputer developed by a European University consortium in collaboration with
IBM leverages the PowerXCelli processor in combination with an FPGA that combines the I/O
C
bridge and network switching functions. This
configuration allows the system to achieve the very
low communication latencies that are critical to
Quantum Chromo Dynamics calculations. QPACE
is a watercooled system, and the efficiency of watercooling in combination with the efficiency of the
PowerXCelli processor made this system top the
green list (green.org) in November 
(Figs. –).
In addition to the systems discussed above, the Cell
B.E. and PowerXCelli processors have been used in
Televisions (Toshiba Regza-Cell), dual-Cell-processor
U servers, experimental blade servers that combine
FPGAs and Cell processors, and a variety of custom
systems used as subsystems in medical systems and in
aerospace and defense applications.
Programming Cell
The CBEA reflects the view that aspects of programs that are critical to their performance should be
brought under software control for best performance
and efficiency. Critical to application performance and
efficiency are:
– Thread concurrency (i.e., ability to run parts of the
code simultaneously on shared data)
– Data-level concurrency (i.e., ability to apply operations to multiple data simultaneously)
– Data locality and predictability (i.e., ability to predict what data will be needed next)
– Control flow predictability (i.e., ability to predict
what code will be needed next)
The threading model for the Cell processor is similar
to that of a conventional multi-core processor in that
main memory is coherently shared between SPEs and
Power cores.
The effective addresses used by the application
threads to reference shared memory are translated on
the SPEs in the same way, and based on the same segment and page tables that govern translation on the
Power cores. Threads can obtain locks in a consistent
manner on the Power cores and on the SPEs. To this
end the SPE supports atomic “get line and reserve”
and “store line conditional” commands that mirror the
Power core’s atomic load word and reserve and store
word conditional operations. An important practical

C

C
Cell Broadband Engine Processor
XDR
XDR
XDR
XDR
Cell
Broadband
Engine
IO
bridge
1Gb
Ethernet
RSX
GPU
Cell Broadband Engine Processor. Fig.  PlayStation  (configuration and system/cluster picture)
IB
adapter
DDR2
DDR2
DDR2
DDR2
IBM
PowerXCell8i
PCIe
bridge
DDR2
DDR2
DDR2
DDR2
IBM
PowerXCell8i
PCIe
bridge
PCIe
bridge
AMD
opteron
PCIe
bridge
AMD
opteron
IB
adapter
Roadrunner accelerated node
Cell Broadband Engine Processor. Fig.  Roadrunner (system configuration and picture)
Cell Broadband Engine Processor
DDR2
DDR2
DDR2
DDR2
IBM
PowerXCell8i
C
C
Xilinx
Virtex 5
FPGA
QPACE node card
Cell Broadband Engine Processor. Fig.  QPACE (card and system configuration and picture)
difference between a thread that runs on the Power core
and one that runs on an SPE is that the context on
the SPE is large and includes the  general-purpose
registers, the  kB local store, and the state of the
MFC. Therefore, unless an ABI is used that supports
cooperative multitasking (i.e., switching threads only
when there is minimal state in the SPE), doing a thread
switch is an expensive operation. SPE threads therefore are preferentially run to completion and when a
thread requires an operating system service it is generally preferable to service this with code on another
processor than to interrupt and context switch the SPE.
In the Linux operating system for Cell, SPE threads are
therefore first created as Power threads that initialize
and start the SPEs and then remain available to service
operating system functions on behalf of the SPE threads
they started. Note that while a context switch on an SPE
is expensive, once re-initialized, the local store is back in
the same state where the process was interrupted, unlike
a cache which typically suffers a significant number of
misses right after a thread switch. Because the local store
is much larger than a typical register file and part of
the state of the SPE, a (preemptive) context switch on
an SPE is quite expensive. It is therefore best to think
of the SPE as a single-mode or batch-mode resource. If
the time to gather the inputs for a computation into the

local store leaves the SPE idle for a substantial amount
of time, then double-buffering of tasks within the same
user process can be an effective method to improve SPE
utilization.
The SPU provides a SIMD instruction set that is
similar to the instruction sets of other media-enhanced
processors and therefore data-level concurrency is handled in much the same way. Languages such as OpenCL
provide language support for vector data types allowing
portable code to be constructed that leverages the SIMD
operations. Vectorizing compilers for Cell provide an
additional path toward leveraging the performance provided by the SIMD units while retaining source code
portability. The use of standard libraries, the implementation of which uses the SIMD instructions provided
by Cell directly, provides a third path toward leveraging the SIMD capabilities of Cell and other processors
without sacrificing portability. While compilers for the
Cell B.E. adhere to the language standards and thus also
accept scalar data types, performance on scalar applications can suffer performance penalties for aligning the
data in the SIMD registers.
The handling of data locality predictability and the
use of the local store is the most distinguishing characteristic of the Cell Broadband Engine. It is possible to
use the local store as a software-managed cache for code

C
Cell Broadband Engine Processor
and data with runtime support and thus remove the
burden of dealing with locality from both the programmer and the compiler. While this provides a path toward
code portability of standard multithreaded languages,
doing so generally results in poor performance due to
the overheads associated with software tag checks on
loads and stores (the penalties associated with issuing
the MFC commands on a cache miss are insignificant in
comparison to the memory latency penalties incurred
on a miss). On certain types of codes, data pre-fetching
commands generated by the compiler can be effective.
If the code is deeply vectorized then the local store can
be treated by the compiler as essentially a large vector
register file and if the vectors are sufficiently long this
can lead to efficient compiler-generated pre-fetching
and gathering of data. Streaming languages, where a
set of kernels is applied to data that is streamed from
and back to global memory can also be efficiently supported. Also, if the code is written in a functional or
task-oriented style which allows operands of a piece
of work to be identified prior to execution then again
compiler-generated pre-fetching of data can be highly
effective. Finally, languages that explicitly express data
locality and/or privacy can be effectively compiled to
leverage the local store. It is not uncommon for compilers or applications to employ multiple techniques,
e.g., the use of a software data cache for hard to prefetch data in combination with block pre-fetching of
vector data.
Because the local store stores code as well as data,
software must also take care of bringing code into
the local store. Unlike the software data cache, a software instruction cache generally operates efficiently and
explicitly dealing with pre-fetching code to the local
store can be considered a second-order optimization
step for most applications. That said, there are cases,
such as hard-real-time applications, where explicit control over code locality is beneficial. As noted earlier,
the Cell B.E. provides branch hint instructions that
allow compilers to leverage information about preferred
branch directions.
Cell Processor Performance
Cell Broadband Engine processors occupy a middle
ground between CPUs and GPUs. On properly structured applications, an SPU is often an order of magnitude more efficient than a high-end CPU. This can
be seen by comparing the performance of the Cell B.E.
or PowerXCelli to that of dual core processors that
require a similar amount of power and chip area in the
same semiconductor technology. For an application to
benefit from the Cell B.E. architecture it is most important that it be possible to structure the application such
that the majority of the data can be pre-fetched into
the local store prior to program execution. The SIMDwidth on the Cell B.E. is similar to that of CPUs of its
generation and thus the degree of data parallelism is not
a big distinguishing factor between Cell B.E. and CPUs.
GPUs are optimized for streaming applications and with
a higher degree of data parallelism can be more efficient
than Cell B.E. on those applications.
Related Entries
NVIDIA GPU
IBM Power Architecture
Bibliographic Notes and Further
Reading
An overview of Cell Broadband Engine is provided in
[] with a more detailed look at aspects of the architecture in [] and []. In [] more detail is provided
on the implementation aspects of the microprocessor.
Reference [] goes into more detail on the design and
programming philosophy of the Cell B.E., whereas references [] and [] provide insight into compiler design
for the Cell B.E. Reference [] provides an overview
of the security architecture of the Cell B.E. References
[] and [] provide an introduction to performance
attributes of the Cell B.E.
Bibliography
. Kahle JA, Day MN, Hofstee HP, Johns CR, Maeurer, TR, Shippy D
() Introduction to the cell multiprocessor. IBM J Res Dev
(/):–
. Johns CR, Brokenshire DA () Introduction to the Cell Broadband Engine architecture. IBM J Res Dev ():–
. Gschwind M, Hofstee HP, Flachs B, Hopkins M, Watanabe Y,
Yamazaki T () Synergistic processing in cell’s multicore
architecture. IEEE Micro ():–
. Flachs B, Asano S, Dhong SH, Hofstee HP, Gervais G, Kim R, Le
T, Liu P, Leenstra J, Liberty JS, Michael B, Oh H-J, Mueller SM,
Takahashi O, Hirairi K, Kawasumi A, Murakami H, Noro H,
Onishi S, Pille J, Silberman J, Yong S, Hatakeyama A, Watanabe Y, Yano N, Brokenshire DA, Peyravian M, To V, Iwata E
() Microarchitecture and implementation of the synergistic
processor in -nm and -nm SOI. IBM J Res Dev ():–
. Keckler SW, Olokuton K, Hofstee HP (eds) () Multicore
processors and systems. Springer, New York
Cellular Automata
. Eichenberger AE, O’Brien K, O’Brien KM, Wu P, Chen T, Oden
PH, Prener DA, Sheperd JC, So B, Sura Z, Wang A, Zhang T,
Zhao P, Gschwind M, Achambault R, Gao Y, Koo R () Using
advanced compiler technology to exploir the performance of the
Cell Broadband Engine architecture. IBM Syst J ():–
. Perez JM, Bellens P, Badia RM, Labarta J () CellSs: making
it easier to program the Cell Broadband Engine processor. IBM J
Res Dev ():–
. Shimizu K, Hofstee HP, Liberty JS () Cell Broadband Engine
processor vault security architecture, IBM J Res Dev ():–
. Chen T, Raghavan R, Dale JN, Iwata E () Cell Broadband
Engine architecture and its first implementation – a performance
view. IBM J Res Dev ():–
. Williams S, Shalf J, Oliker L, Husbands P, Kamil S, Yelick K
() The potential of the cell processor for scientific computing.
In: Proceedings of the third conference on computing frontiers,
Ischia, pp –
Cell Processor
Cell Broadband Engine Processor
Cell/B.E.
Cell Broadband Engine Processor
Cellular Automata
Matthew Sottile
Galois, Inc., Portland, OR, USA

Discussion
Definition of Cellular Automata
Cellular automata (CA) are a class of highly parallel
computational systems based on a transition function
applied to elements of a grid of cells. They were first
introduced in the s by John von Neumann as an
effort to model self-reproducing biological systems in
a framework based on mathematical logic. The system
is evolved in time by applying this transition function to
all cells in parallel to generate the state of the system at
time t based on the state from time t−. Three properties
are common amongst different CA algorithms:
. A grid of cells is defined such that for each cell there
exists a finite neighborhood of cells that influence
its state change. This neighborhood is frequently
defined to be other cells that are spatially adjacent.
The topology of the grid of cells determines the
neighbors of each cell.
. A fixed set of state values are possible for each cell.
These can be as simple as boolean true/false values,
all the way up to real numbers. Most common examples are restricted to either booleans or a small set of
integer values.
. A state transition rule that evolves the state of a cell
to a new state based on its current value and that of
its neighbors.
Cells are positioned at the nodes of a regular grid,
commonly based on rectangular, triangular, or hexagonal cells. The grid of cells may be finite or unbounded
in size. When finite grids are employed, the model must
account for the boundaries of the grid (Fig. ). The
most common approaches taken are to either impose a
toroidal or cylindrical topology to the space in which all
or some edges wrap around, or adopt a fixed value for all
Definition
A cellular automaton is a computational system defined
as a collection of cells that change state in parallel based
on a transition rule applied to each cell and a finite
number of their neighbors. A cellular automaton can
be treated as a graph in which vertices correspond to
cells and edges connect adjacent cells to define their
neighborhood. Cellular automata are well suited to parallel implementation using a Single Program, Multiple
Data (SPMD) model of computation. They have historically been used to study a variety of problems in the
biological, physical, and computing sciences.
C
a
b
Cellular Automata. Fig.  Common boundary types.
(a) Toroidal, (b) planar with fixed values off grid
C

C
Cellular Automata
off-grid cells to represent an appropriate boundary condition. Finite grids are easiest to model computationally,
as unbounded grids will require a potentially large and
endlessly growing amount of memory to represent all
cells containing important state values.
D Grids
In a simple D CA, the grid is represented as an array
of cells. The neighborhood of a cell is the set of cells
that are spatially adjacent within some finite radius. For
example, in Fig.  a single cell is highlighted with a
neighborhood of radius , which includes one cell on
each side.
D Grids
In a D CA, there are different ways in which a neighborhood can be defined. The most common are the von
Neumann and Moore neighborhoods as shown in Fig. .
For each cell, the eight neighbor cells are considered
due to either sharing a common face or corner. The
von Neumann neighborhood includes only those cells
that share a face, while the Moore neighborhood also
includes those sharing a corner.
When considering D systems, the transition rule
computation bears a strong similarity to stencil-based
computations common in programs that solve systems
of partial differential equations. For example, a simple
stencil computation involves averaging all values in the
neighborhood of each point in a rectangular grid and
replacing the value of the point at the center with the
result. In the example of Conway’s game of life introduced later in this entry, there is a similar computation
for the transition rule, in which the number of cells that
Cellular Automata. Fig.  D neighborhood of radius 
a
are in the “on” state in the neighborhood of a cell is used
to determine the state of the central cell. In both cases
there is a neighborhood of cells where a reduction operation (such as +) is applied to their state to compute a
single value from which the new state of the central cell
is computed.
Common Cellular Automata
D Boolean Automata
The simplest cellular automata are one-dimensional
boolean systems. In these, the grid is an array of boolean
values. The smallest neighborhood is that in which a
cell is influenced only by its two adjacent neighbors. The
state transition rule for a cell ci is defined as
ct+
= R (cti , cti− , cti+ ) .
i
Wolfram Rule Scheme
Wolfram [] defined a concise scheme for naming D
cellular automata that is based on a binary encoding
of the output value for a rule over a family of possible
inputs. The most common scheme is applied to systems
in which a cell is updated based only on the values of
itself and its two nearest neighbors. For any cell there
are only eight possible inputs to the transition rule R:
, , , , , , , and . For each of these
inputs, R produces a single bit. Therefore, if each -bit
input is assigned a position in an -bit number and set
that bit to either  or  based on the output of R for the
corresponding -bit input state, an -bit number can be
created that uniquely identifies the rule R.
For example, a common example rule is number ,
which in binary is written as . This means that
the input states , , , and  map to , and the
inputs , , , and  map to . It is common to see
these rules represented visually as in Fig. . The result of
this rule is shown in Fig. .
This numbering scheme can be extended to D cellular automata with larger neighborhoods. For example, a
CA with a neighborhood of five cells (two on each side
of the central cell) would have  =  possible input
states; so each rule could be summarized with a single
b
Cellular Automata. Fig.  D neighborhoods. (a) von
Neumann (b) Moore
Cellular Automata. Fig.  Rule  transition rule. Black is ,
white is 
Cellular Automata
Cellular Automata. Fig.  A sequence of time steps for
the D rule  automaton, starting from the top row with a
single cell turned on
Cellular Automata. Fig.  A three color D automaton
representing rule  with initial conditions 
-bit number. The D scheme can also be extended to
include cells that have more than two possible states.
Extensions to these larger state spaces yield example
rules that result in very complex dynamics as they
evolve. For example, rule  in Fig.  shows the evolution of the system starting from an initial condition of
 centered on the first row.
D Boolean Automata
In , Martin Gardner introduced a cellular automaton invented by John Conway [] on D grids with
boolean-valued cells that is known as the game of life.
For each cell, the game of life transition rule considers all cells in the -cell Moore neighborhood. Cells that
contain a true value are considered to be “alive,” while
those that are false are “dead.” The rule is easily stated as
follows, where ct is the current state of the cell at time
step t, and N is the number of cells in its neighborhood
that are alive:
●
If ct is alive:
– If N <  then ct+ is set to dead.
– If N >  then ct+ is set to dead.
– If N =  or N =  then ct+ is set to alive.
● If ct is dead and N = , ct+ is set to alive.
C
This rule yields very interesting dynamic behavior as
it evolves. Many different types of structures have been
discovered, ranging from those that are static and stable,
those that repeat through a fixed sequence of states, and
persistent structures that move and produce new structures. Figure a shows a typical configuration of the
space after a number of time steps from a random initial state. A number of interesting structures that appear
commonly are also shown, such as the self-propagating
glider (Fig. b), oscillating blinker (Fig. d), and a configuration that rapidly stabilizes and ceases to change
(Fig. c).
Lattice Gas Automata
Cellular automata have a history of being employed for
modeling physical systems in the context of statistical
mechanics, where the macroscopic behavior of the system is dictated primarily by microscopic interactions
with a small spatial extent. The Ising model used in solid
state physics is an example of non-automata systems
that share this characteristic of local dynamics leading
to a global behavior that mimics real physical systems.
In the s, cellular automata became a topic of
interest in the physics community for modeling fluid
systems based on the observation that macroscopic
behavior of fluids is determined by the microscopic
behavior of a large ensemble of fluid particles. Instead
of considering cells as containing generic boolean values, one could encode more information in each cell
to represent entities such as interacting particles. Two
important foundational cellular automata systems are
described here that laid the basis for later models
of modern relevance such as the Lattice Boltzmann
Method (LBM). The suitability of CA-based algorithms
to parallelization is a primary reason for current interest
in LBM methods for physical simulation.
HPP Lattice Gas
One of the earliest systems that gained attention was the
HPP automaton of Hardy, Pomeau and de Pazzis []. In
the simple single fluid version of this system, each cell
contains a -bit value. Each bit in this value corresponds
to a vector relating the cell to its immediately adjacent
neighbors (up, down, left, and right). A bit being on corresponds to a particle coming from the neighbor into
the cell, and a bit being off corresponds to the absence
of a particle arriving from the neighbor.

C

C
Cellular Automata
Cellular Automata. Fig.  Plots showing behavior observed while evolving the game of life CA. (a) Game of life over three
time steps; (b) two full periods of the glider; (c) feature that stabilizes after three steps; (d) two features that repeat every
two steps
Cellular Automata. Fig.  A sample HPP collision rule
The transition rule for the HPP automaton was constructed to represent conservation of momentum in a
physical system. For example, if a cell had a particle
arriving from above and from the left, but none from
the right or from below, then after a time step the cell
should produce a particle that leaves to the right and to
the bottom. Similarly, if a particle arrives from the left
and from the right, but from neither the top or bottom,
then two particles should exit from the top and bottom.
In both of these cases, the momentum of the system
does not change. This simple rule models the physical
properties of a system based on simple collision rules.
This is illustrated in Fig.  for a collision of two particles
entering a cell facing each other. If the cell encoded the
entry state as the binary value  where the bits correspond in order to North, East, South, and West, then
the exit state for this diagram would be defined as .
The rules for the system are simple to generate by
taking advantage of rotation invariant properties of the
system – the conservation laws underlying a collision
rule do not change if they are transformed by a simple
rotation. For example, in the cell illustrated here, rotation of the state by ○ corresponds to a circular shift of
the input and output bit encodings by one. As such, a
small set of rules need to be explicitly defined and the
remainder can be computed by applying rotations. The
same shortcut can be applied to generating the table of
transition rules for more sophisticated lattice gas systems due to the rotational invariance of the underlying
physics of the system.
From a computational perspective, this type of rule
set was appealing for digital implementation because
it could easily be encoded purely based on boolean
operators. This differed from traditional methods for
studying fluid systems in which systems of partial differential equations needed to be solved using floating
point arithmetic.
FHP Rules
The HPP system, while intuitively appealing, lacks
properties that are necessary for modeling realistic
physical systems – for example, the HPP system does
not exhibit isotropic behavior. The details related to the
physical basis of these algorithms are beyond the scope
of this entry, but are discussed at length by Doolen [],
Wolf-Gladrow [], and Rivet []. A CA inspired by the
early HPP automaton was created by Frisch, Hasslacher,
Cellular Automata
and Pomeau [], leading to the family of FHP lattice gas
methods.
The primary advance of these new methods made
over prior lattice gas techniques was that it was possible
to derive a rigorous relationship between the FHP cellular automaton and more common models for fluid flow
based on Navier–Stokes methods requiring the solution
of systems of partial differential equations. Refinements
of the FHP model, and subsequent models based on it,
are largely focused on improving the correspondence
of the CA-based algorithms to the physical systems
that they model and better match accepted traditional
numerical methods.
In the D FHP automaton, instead of a neighborhood based on four neighbors, the system is defined
using a neighborhood of six cells arranged in a hexagonal grid with triangular cells. This change in the
underlying topology of the grid of cells was critical to improving the physical accuracy of the system
while maintaining a computationally simple CA-based
model. The exact same logic is applied in constructing
the transition rules in this case as for the HPP case –
the rules must obey conservation laws. The first instance
of the automaton was based on this simple extension of
HPP, in which the rules for transitions were intended
to model collisions in which the state represented only
particles arriving at a cell. Two example rules are shown
in Fig. , one that has two possible outcomes (each with
equal probability), and one that is deterministic. In both
cases, the starting state indicates the particles entering a
cell, and the exit state(s) show the particles leaving the
cell after colliding.
Later revisions of the model added a bit to the state
of each cell corresponding to a “rest particle” – a particle that, absent any additional particle arriving from
or
Cellular Automata. Fig.  Two examples of FHP collision
rules
C
any neighbor, remained in the same position. Multiple
variants of the FHP-based rule set appeared in the literature during the s and s. These introduced
richer rule sets with more collision rules, with the effect
of producing models that corresponded to different viscosity coefficients []. Further developments added the
ability to model multiple fluids interacting, additional
forces influencing the flow, and so on.
Modeling realistic fluid systems required models to
expand to three dimensions. To produce a model with
correct isotropy, d’Humières, Lallemand and Frisch []
employed a four-dimensional structure known as the
face centered hyper-cubic, or FCHC, lattice. The threedimensional lattice necessary for building a lattice gas is
based on a dimension reducing procedure that projects
FCHC into three-dimensional space. Given this embedded lattice, a set of collision rules that obey the appropriate physical conservation laws can be defined much
like those for FHP in two dimensions.
Lattice gas methods were successful in demonstrating that, with careful selection of transition rules, realistic fluid behavior could be modeled with cellular
automata. The lattice Boltzmann method (LBM) is a
derivative of these early CA-based fluid models that
remains in use for physical modeling problems today.
In the LBM, a similar grid of cells is employed, but
the advancement of their state is based on continuous
valued function evaluation instead of a boolean state
machine.
Self-organized Criticality: Forest Fire Model
In , a model was proposed by Drossel and Schwabl [] to model forest fire behavior that built upon a
previous model proposed by Bak, Chen, and Tang [].
This model is one of a number of similar systems that
are used to study questions in statistical physics, such
as phase transitions and their critical points. A related
system is the sandpile model, in which a model is constructed of a growing pile of sand that periodically
experiences avalanches of different sizes.
As a cellular automaton, the forest fire model is
interesting because it is an automaton in which the rules
for updating a cell are probabilistic. The lattice gas models above include rules that are also probabilistic, but
unlike the forest fire system, the lattice gas probabilities are fixed to ensure accurate correspondence with
the physical system being modeled. Forest fire parameters are intended to be changed to study the response

C

C
Cellular Automata
of the overall system to their variation. Two probability
parameters are required: p and f , both of which are real
numbers between  and . The update rule for a cell is
defined as:
●
●
A cell that is burning becomes empty (burned out).
A cell will start burning if one or more neighbor cells
are burning.
● A cell will ignite with probability f regardless of how
many neighbors are burning.
● A cell that is empty turns into a non-burning cell
with probability p.
The third and fourth rules are the probabilistic parts
of the update algorithm. The third rule states that cells
can spontaneously combust without requiring burning
neighbors to ignite them. The fourth rule states that a
burned region will eventually grow back as a healthy,
non-burning cell. All four rules correspond to an intuitive notion for how a real forest functions. Trees can
be ignited by their neighbors; events can cause trees to
start burning in a non-burning region (such as by lightning or humans), and over time, burned regions will
eventually grow back.
This model differs from basic cellular automata due
to the requirement that for each cell update there may be
a required sampling of a pseudorandom number source
to determine whether or not spontaneous ignition or
new tree growth occurs. This has implications for parallel implementation because it requires the ability to
generate pseudorandom numbers in a parallel setting.
Universality in Cellular Automata
Cellular automata are not only useful for modeling
physical phenomena. They have also been used to study
computability theory. It has been shown that cellular automata exist that exhibit the property of computational universality. This is a concept that arises
in computability theory and is based on the notion
of Turing-completeness. Simply stated, a system that
exhibits Turing-completeness is able to perform any calculation by following a simple set of rules on input data.
In essence, given a carefully constructed input, the execution of the automaton will perform a computation
(such as arithmetic) that has been “programmed into”
the input data. This may be accomplished by finding
an encoding of an existing universal computing system
within the system which is to be shown as capable of
universal computation.
For example, the simple one-dimensional Rule 
automaton was shown by Cook [] to be capable of
universal computation. He accomplished this by identifying structures known as “spaceships” that are selfperpetuating and could be configured to interact. These
interactions could be used then to encode an existing computational system in the evolving state of the
Rule  system. The input data and program to execute would then be encoded in a linear form to serve
as the initial state of the automaton. A similar approach
was taken to show that the game of life could also
emulate any Turing machine or other universal computing system. Using self-perpetuating structures, their
generators, and structures that react with them, one
can construct traditional logic gates and connect them
together to form structures equivalent to a traditional
digital computer.
As Cook points out, a consequence of this is that it is
formally undecidable to predict certain behaviors, such
as reaching periodic states or a specific configuration
of bits. The property of universality does not imply that
encoding of a system like a digital computer inside a CA
would be at all efficient – the CA implementation would
be very slow relative to a real digital computer.
Parallel Implementation
Cellular automata fit well with parallel implementations
due to the high degree of parallelism present in the
update rules that define them. Each cell is updated in
parallel with all others, and the evolution of the system
proceeds by repeatedly applying the rule to update the
entire set of cells.
The primary consideration for parallel implementation of a cellular automaton is the dependencies
imposed by the update rule. Parallel implementation
of any sequential algorithm often starts with an analysis of the dependencies within the program, both in
terms of control and data. In a CA implementation,
there exists a data dependency between time steps due
to the need for state data from time step t to generate
the state for time t + .
For this discussion, consider the simple D boolean
cellular automaton. To determine the value of the ith cell
at step t of the evolution of the system, the update rule
requires the value of the cell itself and its neighbors (i−
Cellular Automata
and i + ) at step t − . This dependency has two effects
that determine how the algorithm can be implemented
in parallel.
. Given that the previous step t −  is not changed at
all and can be treated as read-only, all cell values for
step t can be updated in parallel.
. An in-place update of the cells is not correct, as
any given cell i from step t −  is in the dependency set of multiple cells. If an in-place update
occurred, then it is possible that the value from step
t− would be destructively overwritten and lost, disrupting the update for any other cell that requires its
step t −  value. A simple double-buffering scheme
can be used to overcome this issue for parallel implementations.
C
interconnection network connecting processing nodes
together for exchanging data. The machine could be
programmed for different CA rules, and was demonstrated on examples from fluid dynamics (lattice gases),
statistical physics (diffusion limited aggregation), image
processing, and large-scale logic simulation.
Another approach that has been taken to hardware implementation of CAs is the use of Field Programmable Gate Arrays (FPGAs) to encode the boolean
transition rule logic directly in hardware. More recently,
hardware present in accelerator-based systems such as
General Purpose Graphics Processing Units (GPGPU)
presents a similar feature set as the traditional MPPs.
These devices present the programmer with the ability
to execute a large number of small parallel threads that
operate on large volumes of data. Each thread in a GPU
is very similar to the basic processing elements from a
traditional MPP. The appeal of these modern accelerators is that they can achieve performance comparable
to traditional supercomputers on small, specialized programs like cellular automata that are based on a single
small computation run in parallel on a large data set.
In more sophisticated systems such as those discussed earlier, the number of dependencies that each
cell has between time steps grows with the size of its
neighborhood. The game of life rule states that each cell
requires the state of nine cells to advance – the state of
the cell itself and that of its eight neighbors. The D
FCHC lattice gas has  channels per node to influence the transition rule leading to a large number of
dependencies per cell.
References
Hardware Suitability
Historically cellular automata were of interest on massively parallel processing (MPP) systems in which a
very large number of simple processing elements were
available. Each processing element was assigned a small
region of the set of cells (often a spatially contiguous patch), and iterated over the elements that it was
responsible for to apply the update rule. A limited
amount of synchronization would be required to ensure
that each processing element worked on the same
update step.
In the s, there was a research effort at MIT to
build the CAM [], a parallel architecture designed
specifically for executing cellular automata. Machines
specialized for CA models could achieve performance
comparable to conventional parallel systems of the
era, including the Thinking Machines CM- and Cray
X-MP. The notable architectural features of the CAM
were the processing nodes based on DRAM for storing
cell state data, SRAM for storing a lookup table holding the transition rules for a CA, and a mesh-based
. Bak P, Chen K, Tang C () A forest fire model and some
thoughts on turbulence. Phys Lett A :–
. Cook M () Universality in elementary cellular automata.
Complex Syst ():–
. d’Humieres D, Lallemand P, Frisch U () Lattice gas models
for -D hydrodynamics. Europhys Lett :–
. Wolf-Gladrow DA () Lattice-gas cellular automata and lattice Boltz-Mann models: an introduction, volume  of Lecture
Notes in Mathematics. Springer, Berlin
. Doolen GD (ed) () Lattice gas methods: theory, application,
and Hardware. MIT Press, Cambridge, MA
. Drossel B, Schwabl F () Self-organized critical forest-fire
model. Phys Rev Lett ():–
. Frisch U, Hasslacher B, Pomeau Y () Lattice-gas automata for
the Navier-Stokes equation. Phys Rev Lett ():–
. Gardner M () The fantastic combinations of John Conway’s
new solitaire game “life”. Sci Am :–
. Hardy J, Pomeau Y, de Pazzis O () Time evolution of a twodimensional model system. J Math Phys ():–
. Rivet J-P, Boon JP () Lattice gas hydrodynamics, volume 
of Cambridge Nonlinear Science Series. Cambridge University
Press, Cambridge
. Toffoli T, Margolus N () Programmable matter: concepts and
realization. Phys D :–
. Wolfram S () Statistical mechanics of cellular automata. Rev
Mod Phys :–

C

C
Chaco
Chaco
Bruce Hendrickson
Sandia National Laboratories, Albuquerque, NM, USA
Definition
Chaco was the first modern graph partitioning code
developed for parallel computing applications. Although
developed in the early s, the code continues to be
widely used today.
refinement provided by an implementation of the
Fiduccia-Mattheyses (FM) implementation.
● Chaco provides a robust yet efficient algorithm for
computing Laplacian eigenvectors which can be
used for spectral partitioning or for other algorithms.
● Chaco supports generalized spectral, combinatorial,
and multilevel algorithms that partition into more
than two parts at each level of recursion.
● Chaco has several approaches that consider the
topology of the target parallel computer during the
partitioning process.
Discussion
Chaco is a software package that implements a variety of
graph partitioning techniques to support parallel applications. Graph partitioning is a widely used abstraction
for dividing work amongst the processors of a parallel
machine in such a way that each processor has about
the same amount of work to do, but the amount of
interprocessor communication is kept small. Chaco is a
serial code, intended to be used as a preprocessing step
to set up a parallel application. Chaco takes in a graph
which describes the data dependencies in a computation and outputs a description of how the computation
shouldbepartitionedamongsttheprocessorsofaparallel
machine.
Chaco was developed by Bruce Hendrickson and
Rob Leland at Sandia National Laboratories. It provides implementations of a variety of graph partitioning algorithms. Chaco provided an advance
over the prior state of the art in a number of
areas [].
●
Although multilevel partitioning was simultaneously co-invented by several groups [, , ], the
implementation in Chaco led to the embrace of
this approach as the best balance between speed
and quality for many practical problems. In parallel computing, this remains the dominant approach
to partitioning problems.
● All the algorithms in Chaco support graphs with
weights on both edges and vertices.
● Chaco provides a suite of partitioning algorithms
including spectral, geometric and multilevel
approaches.
● Chaco supports the coupling of global methods
(e.g., spectral or geometric partitioning) with local
Related Entries
Load Balancing, Distributed Memory
Graph Partitioning
Hypergraph Partitioning
METIS and ParMETIS
Bibliographic Notes and Further
Reading
Chaco is available for download under an open source
license []. The Chaco Users Guide [] has much
more detailed information about the capabilities of the
code. Subsequent partitioning codes like METIS, Jostle,
PATOH, and Scotch have adopted and further refined
many of the ideas first prototyped in Chaco.
Acknowledgment
Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin Company,
for the US Department of Energy under contract DEAC-AL.
Bibliography
. Chaco: Software for partitioning graphs, http://www.sandia.
gov/∼bahendr/chaco.html
. Bui T, Jones C () A heuristic for reducing fill in sparse matrix
factorization. In: Proceedings of the th SIAM Conference on Parallel Processing for Scientific Computing, SIAM, Portland, OR,
pp –
. Cong J, Smith ML () A parallel bottom-up clustering algorithm with applications to circuit partitioning in VLSI design.
In: Proceedings th Annual ACM/IEEE International Design
Automation Conference, DAC ’, ACM, Dallas, TX, pp –
. Hendrickson B, Leland R () The Chaco user’s guide, version
.. Technical Report SAND–, Sandia National Laboratories, Albuquerque, NM, October 
Chapel (Cray Inc. HPCS Language)
. Hendrickson B, Leland R () A multilevel algorithm for partitioning graphs. In: Proceedings of the Supercomputing ’. ACM,
December . Previous version published as Sandia Technical
Report SAND –
Chapel (Cray Inc. HPCS Language)
Bradford L. Chamberlain
Cray Inc., Seattle, WA, USA
Synonyms
Cascade high productivity language
Definition
Chapel is a parallel programming language that
emerged from Cray Inc.’s participation in the High Productivity Computing Systems (HPCS) program sponsored by the Defense Advanced Research Projects
Agency (DARPA). The name Chapel derives from
the phrase “Cascade High Productivity Language,”
where “Cascade” is the project name for the Cray
HPCS effort. The HPCS program was launched with
the goal of raising user productivity on large-scale
parallel systems by a factor of ten. Chapel was
designed to help with this challenge by vastly improving the programmability of parallel architectures while
matching or beating the performance, portability,
and robustness of previous parallel programming
models.
Discussion
History
The Chapel language got its start in  during the first
phase of Cray Inc.’s participation in the DARPA HPCS
program. While exploring candidate system design concepts to improve user productivity, the technical leaders
of the Cray Cascade project decided that one component of their software solution would be to pursue
the development of an innovative parallel programming
language. Cray first reported their interest in pursuing a
new language to the HPCS mission partners in January
. The language was named Chapel later that year,
stemming loosely from the phrase “Cascade High Productivity Language.” The early phases of Chapel development were a joint effort between Cray Inc. and their
C
academic partners at Caltech/JPL (Jet Propulsion Laboratory).
The second phase of the HPCS program, from summer  through , saw a great deal of work in
the specification and early implementation of Chapel.
Chapel’s initial design was spearheaded by David
Callahan, Brad Chamberlain, and Hans Zima. This
group published an early description of their design in
a paper entitled “The Cascade High Productivity Language” []. Implementation of a Chapel compiler began
in the winter of , initially led by Brad Chamberlain
and John Plevyak (who also played an important role in
the language’s early design). In late , a rough draft
of the language specification was completed. Around
this same time, Steve Deitz joined the implementation
effort, who would go on to become one of Chapel’s most
influential long-term contributors. This group established the primary themes and concepts that set the
overall direction for the Chapel language. The motivation and vision of Chapel during this time was captured
in an article entitled “Parallel Programmability and the
Chapel Language” [], which provided a wish list of
sorts for productive parallel languages and evaluated
how well or poorly existing languages and Chapel met
the criteria.
By the summer of , the HPCS program was
transitioning into its third phase and the Chapel team’s
composition had changed dramatically, as several of
the original members moved on to other pursuits and
several new contributors joined the project. Of the original core team, only Brad Chamberlain and Steve Deitz
remained and they would go on to lead the design and
implementation of Chapel for the majority of phase III
of HPCS (ongoing at the time of this writing).
The transition to phase III also marked a point
when the implementation began to gain significant traction due to some important changes that were made
to the language and compiler design based on experience gained during phase II. In the spring of , the
first multi-threaded task-parallel programs began running on a single node. By the summer of , the first
task-parallel programs were running across multiple
nodes with distributed memory. And by the fall of ,
the first multi-node data-parallel programs were being
demonstrated.
During this time period, releases of Chapel also
began taking place, approximately every  months. The

C

C
Chapel (Cray Inc. HPCS Language)
very first release was made available in December 
on a request-only basis. The first release to the general public occurred in November . And in April
, the Chapel project more officially became an
open-source project by migrating the hosting of its code
repository to SourceForge.
Believing that a language cannot thrive and become
adopted if it is controlled by a single company, the
Chapel team has often stated its belief that as Chapel
grows in its capabilities and popularity, it should gradually be turned over to the broader community as
an open, consortium-driven language. Over time, an
increasing number of external collaborations have been
established with members of academia, computing centers, and industry, representing a step in this direction.
At the time of this writing, Chapel remains an active
and evolving project. The team’s current emphasis is on
expanding Chapel’s support for user-defined distributions, improving performance of key idioms, supporting users, and seeking out strategic collaborations.
Influences
Rather than extending an existing language, Chapel
was designed from first principles. It was decided that
starting from scratch was important to avoid inheriting features from previous languages that were not
well suited to large-scale parallel programming. Examples include pointer/array equivalence in C and common blocks in Fortran. Moreover, Chapel’s design team
believed that the challenging part of learning any language is learning its semantics, not its syntax. To that
end, embedding new semantics in an established syntax can often cause more confusion than benefit. To this
end, Chapel was designed from a blank slate. That said,
Chapel’s design does contain many concepts and influences from previous languages, most notably ZPL, High
Performance Fortran (HPF), and the Tera/Cray MTA
extensions to C and Fortran, reflecting the backgrounds
of the original design team. Chapel utilizes a partitioned
global namespace for convenience and scalability, similar to traditional PGAS languages like UPC, Co-Array
Fortran, and Titanium; yet it departs from those languages in other ways, most notably by supporting more
dynamic models of execution and parallelism. Other
notable influences include CLU, ML, NESL, Java, C#,
C/C++, Fortran, Modula, and Ada.
Themes
Chapel’s design is typically described as being motivated
by five major themes: () support for general parallel programming, () support for global-view abstractions, () a multiresolution language design, () support
for programmer control over locality and affinity, and
() a narrowing of the gap between mainstream and
HPC programming models. This section provides an
overview of these themes.
General Parallel Programming
Chapel’s desire to support general programming stems
from the observation that while programs and parallel architectures both typically contain many types
of parallelism at several levels, programmers typically
need to use a mix of distinct programming models
to express all levels/types of software parallelism and
to target all available varieties of hardware parallelism.
In contrast, Chapel aims to support multiple levels of
hardware and software parallelism using a unified set
of concepts for expressing parallelism and locality. To
this end, Chapel programs support parallelism at the
function, loop, statement, and expression levels. Chapel
language concepts support data parallelism, task parallelism, concurrent programming, and the ability to
compose these different styles within a single program
naturally. Chapel programs can be executed on desktop
multicore computers, commodity clusters, and largescale systems developed by Cray Inc. or other vendors.
Global-View Abstractions
Chapel is described as supporting global-view abstractions for data and for control flow. In the tradition of
ZPL and High Performance Fortran, Chapel supports
the ability to declare and operate on large arrays in a
holistic manner even though they may be implemented
by storing their elements within the distributed memories of many distinct nodes. Chapel’s designers felt that
many of the most significant challenges to parallel programmability stem from the typical requirement that
programmers write codes in a cooperating executable
or Single Program, Multiple Data (SPMD) programming model. Such models require the user to manually
manage many tedious details including data ownership,
local-to-global index transformations, communication,
and synchronization. This overhead often clutters a program’s text, obscuring its intent and making the code
Chapel (Cray Inc. HPCS Language)
difficult to maintain and modify. In contrast, languages
that support a global view of data such as ZPL, HPF,
and Chapel shift this burden away from the typical
user and onto the compiler, runtime libraries, and data
distribution authors. The result is a user code that is
cleaner and easier to understand, arguably without a
significant impact on performance.
Chapel departs from the single-threaded logical
execution models of ZPL and HPF by also providing a
global view of control flow. Chapel’s authors define this
as a programming model in which a program’s entry
point is executed by a single logical task, and then additional parallelism is introduced over the course of the
program through explicit language constructs such as
parallel loops and the creation of new tasks. Supporting a global view for control flow and data structures
makes parallel programming more like traditional programming by removing the requirement that users must
write programs that are complicated by details related to
running multiple copies of the program in concert as in
the SPMD model.
A Multiresolution Design
Another departure from ZPL and HPF is that those
languages provide high-level data-parallel abstractions without providing a means of abandoning those
abstractions in order to program closer to the machine
in a more explicit manner. Chapel’s design team felt it
was important for users to have such control in order
to program as close to the machine as their algorithm
requires, whether for reasons of performance or expressiveness. To this end, Chapel’s features are designed in
a layered manner so that when high-level abstractions
like its global-view arrays are inappropriate, the programmer can drop down to lower-level features and
control things more explicitly. As an example of this,
Chapel’s global-view arrays and data-parallel features
are implemented in terms of its lower-level task-parallel
and locality features for creating distinct tasks and mapping them to a machine’s processors. The result is that
users can write different parts of their program using
different levels of abstraction as appropriate for that
phase of the computation.
Locality and Affinity
The placement of data and tasks on large-scale machines
is crucial for performance and scalability due to the
C
latencies incurred by communicating with other processors over a network. For this reason, performanceminded programmers typically need to control where
data is stored on a large-scale system and where the
tasks accessing that data will execute relative to the data.
As multicore processors grow in the number and variety
of compute resources, such control over locality is likely
to become increasingly important for desktop programming as well. To this end, Chapel provides concepts
that permit programmers to reason about the compute resources that they are targeting and to indicate
where data and tasks should be located relative to those
compute resources.
Narrowing the Gap Between Mainstream
and HPC Languages
Chapel’s designers believe there to be a wide gap
between programming languages that are being used in
education and mainstream computing such as Java, C#,
Matlab, Perl, and Python and those being used by the
High Performance Computing community: Fortran, C,
and C++ in combination with MPI and OpenMP (and
in some circles, Co-Array Fortran and UPC – Unified Parallel C). It was believed that this gap should
be bridged in order to take advantage of productivity
improvements in modern language design while also
being able to better utilize the skills of the emerging
workforce. The challenge was to design a language
that would not alienate traditional HPC programmers
who were perhaps most comfortable in Fortran or C.
An example of such a design decision was to have
Chapel support object-oriented programming since it
is a staple of most modern languages and programmers, yet to make the use of objects optional so
that Fortran and C programmers would not need to
change their way of thinking about program and data
structure design. Another example was to make Chapel
an imperative, block-structured language, since the languages that have been most broadly adopted by both the
mainstream and HPC communities have been imperative rather than functional or declarative in nature.
Language Features
Data Parallelism
Chapel’s highest-level concepts are those relating to data
parallelism. The central concept for data-parallel programming in Chapel is the domain, which is a first-class

C

C
Chapel (Cray Inc. HPCS Language)
language concept for representing an index set, potentially distributed between multiple processors. Domains
are an extension of the region concept in ZPL. They
are used in Chapel to represent iteration spaces and
to declare arrays. A domain’s indices can be Cartesian
tuples of integers, representing dense or sparse index
sets on a regular grid. They may also be arbitrary values, providing the capability for storing arbitrary sets
or key/value mappings. Chapel also supports the notion
of an unstructured domain in which the index values are anonymous, providing support for irregular,
pointer-based data structures.
The following code declares three Chapel domains:
const D: domain(2) = [1..n, 1..n],
DDiag: sparse subdomain(D)
= [i in 1..n] (i,i),
Employees: domain(string)
= readNamesFromFile(infile);
In these declarations, D represents a regular dimensional n×n index set; DDiag represents the sparse
subset of indices from D describing its main diagonal;
and Employees is a set of strings representing the names
of a group of employees.
Chapel arrays are defined in terms of domains and
represent a mapping from the domain’s indices to a set
of variables. The following declarations declare a pair of
arrays for each of the domains above:
var A, B: [D] real,
X, Y: [DDiag] complex,
Age, SSN: [Employees] int;
The first declaration defines two arrays A and B over
domain D, creating two n × n arrays of real floating
point values. The second creates sparse arrays X and Y
that store complex values along the main diagonal of D.
The third declaration creates two arrays of integers
representing the employees’ ages and social security
numbers.
Chapel users can express data parallelism using
forall loops over domains or arrays. As an example,
the following loops express parallel iterations over the
indices and elements of some of the previously declared
domains and arrays.
forall a in A do
a += 1.0;
forall (i,j) in D do
A(i,j) = B(i,j) + 1.0i * X(i,j);
forall person in Employees do
if (Age(person) < 18) then
SSN = 0;
Scalar functions and operators can be promoted in
Chapel by calling them with array arguments. Such
promotions also result in data-parallel execution, equivalent to the forall loops above. For example, each of the
following whole-array statements will be computed in
parallel in an element-wise manner:
A = B + 1.0i * X;
B = sin(A);
Age += 1;
In addition to the use cases described above,
domains are used in Chapel to perform set intersections, to slice arrays and refer to subarrays, to
perform tensor or elementwise iterations, and to
dynamically resize arrays. Chapel also supports reduction and scan operators (including user-defined variations) for efficiently computing common collective
operations in parallel. In summary, domains provide a
very rich support for parallel operations on a rich set of
potentially distributed data aggregates.
Locales
Chapel’s primary concept for referring to machine
resources is called the locale. A locale in Chapel is an
abstract type, which represents a unit of the target architecture that can be used for reasoning about locality.
Locales support the ability to execute tasks and to store
variables, but the specific definition of the locale for
a given architecture is defined by a Chapel compiler.
In practice, an SMP node or multicore processor is
often defined to be the locale for a system composed of
commodity processors.
Chapel programmers specify the number of locales
that they wish to use on the executable’s command
line. The Chapel program requests the appropriate
resources from the target architecture and then spawns
the user’s code onto those resources for execution.
Within the Chapel program’s source text, the set of execution locales can be referred to symbolically using a
built-in array of locale values named Locales. Like any
other Chapel array, Locales can be sliced, indexed, or
reshaped to organize the locales in any manner that suits
the program. For example, the following statements
create customized views of the locale set. The first
divides the locales into two disjoint sets while the
Chapel (Cray Inc. HPCS Language)
second reshapes the locales into a -dimensional virtual
grid of nodes:
// given: const Locales:
[0..#numLocales] locale;
const localeSetA = Locales[0..#localesInSetA],
localeSetB = Locales[localesInSetA..];
C
of parallel, distributed data structures. While Chapel
provides a standard library of domain maps, a major
research goal of the language is to implement these
standard distributions using the same mechanism that
an end-user would rather than by embedding semantic knowledge of the distributions into the compiler and
runtime as ZPL and HPF did.
const compGrid
= Locales.reshape[1..2,numLocales/2];
Distributions
As mentioned previously, a domain’s indices may be distributed between the computational resources on which
a Chapel program is running. This is done by specifying
a domain map as part of the domain’s declaration, which
defines a mapping from the domain’s index set to a target locale set. Since domains are used to define iteration
spaces and arrays, this also implies a distribution of
the computations and data structures that are expressed
in terms of that domain. Chapel’s domain maps are
more than simply a mapping of indices to locales, however. They also define how each locale should store its
local domain indices and array elements, as well as
how operations such as iteration, random access, slicing, and communication are defined on the domains
and arrays. To this end, Chapel domain maps can be
thought of as recipes for implementing parallel, distributed data aggregates in Chapel. The domain map’s
functional interface is targeted by the Chapel compiler as it rewrites a user’s global-view array operations
down to the per-node computations that implement the
overall algorithm.
A well-formed domain map does not affect the
semantics of a Chapel program, only its implementation
and performance. In this way, Chapel programmers can
tune their program’s implementation simply by changing the domain declarations, leaving the bulk of the
computation and looping untouched. For example, the
previous declaration of domain D could be changed as
follows to specify that it should be distributed using an
instance of the Block distribution:
const D: domain(2) dmapped MyBlock
= [1..n, 1..n];
However, none of the loops or operations written previously on D, A, or B would have to change.
Advanced users can write their own domain maps in
Chapel and thereby create their own implementations
Task Parallelism
As mentioned previously, Chapel’s data-parallel features are implemented in terms of its lower-level taskparallel features. The most basic task-parallel construct
in Chapel is the begin statement, which creates a new
task while allowing the original task to continue executing. For example, the following code starts a new task
to compute an FFT while the original task goes on to
compute a Jacobi iteration:
begin FFT(A);
Jacobi(B);
Inter-task coordination in Chapel is expressed in a
data-centric way using special variables called synchronization variables. In addition to storing a traditional
data value, these variables also maintain a logical full/
empty state. Reads to synchronization variables block
until the variable is full and leave the variable empty.
Conversely, writes block until the variable is empty and
leave it full. Variations on these default semantics are
provided via method calls on the variable. This leads to
the very natural expression of inter-task coordination.
Consider, for example, the elegance of the following
bounded buffer producer/consumer pattern:
var buffer: [0..#buffsize] sync int;
begin { // producer
for i in 0..n do
buffer[i%buffsize] = ...;
}
{ // consumer
for j in 0..n do
...buffer[j%buffsize]...;
}
Because each element in the bounded buffer is declared
as a synchronized integer variable, the consumer will
not read an element from the buffer until the producer
has written to it, marking its state as full. Similarly, the
producer will not overwrite values in the buffer until the
consumer’s reads have reset the state to empty.

C

C
Chapel (Cray Inc. HPCS Language)
In addition to these basic task creation and synchronization primitives, Chapel supports additional
constructs to create and synchronize tasks in structured
ways that support common task-parallel patterns.
Locality Control
While domain maps provide a high-level way of mapping iteration spaces and arrays to the target architecture, Chapel also provides a low-level mechanism for
controlling locality called the on-clause. Any Chapel
statement may be prefixed by an on-clause that indicates the locale on which that statement should execute.
On-clauses can take an expression of locale type as their
argument, which specifies that the statement should be
executed on the specified locale. They may also take
any other variable expression, in which case the statement will execute on the locale that stores that variable.
For example, the following modification to an earlier
example will execute the FFT on Locale # while executing the Jacobi iteration on the locale that owns element
i,j of B:
on Locales[1] do begin FFT(A);
on B(i,j) do Jacobi(B);
dynamic typing. It also supports generic programming and code reuse.
● Iterators: Chapel’s iterators are functions that yield
multiple values over their lifetime (as in CLU or
Ruby) rather than returning a single time. These
iterators can be used to control serial and parallel
loops.
● Configuration variables: Chapel’s configuration
variables are symbols whose default values can be
overridden on the command line of the compiler
or executable using argument parsing that is implemented automatically by the compiler.
● Tuples: These support the ability to group values
together in a lightweight manner, for example, to
return multiple values from a function or to represent multidimensional array indices using a single
variable.
Future Directions
At the time of this writing, the Chapel language is
still evolving based on user feedback, code studies, and
the implementation effort. Some notable areas where
additional work is expected in the future include:
●
Base Language
Chapel’s base language was designed to support productive parallel programming, the ability to achieve
high performance, and the features deemed necessary
for supporting user-defined distributions effectively.
In addition to a fairly standard set of types, operators,
expressions, and statements, the base language supports
the following features:
●
A rich compile-time language: Chapel supports the
ability to define functions that are evaluated at compile time, including functions that return types.
Users may also indicate conditionals that should be
folded at compile time as well as loops that should
be statically unrolled.
● Static-type inference: Chapel supports a statictype inference scheme in which specifications can
be omitted in most declaration settings, causing
the compiler to infer the types from  context. For
example, a variable declaration may omit the variable’s type as long as it is initialized, in which case the
compiler infers its type from the initializing expression. This supports exploratory programming as in
scripting languages without the runtime overhead of
Support for heterogeneity: Heterogeneous systems
are becoming increasingly common, especially
those with heterogeneous processor types such as
traditional CPUs paired with accelerators such as
graphics processing units (GPUs). To support such
systems, it is anticipated that Chapel’s locale concept will need to be refined to expose architectural
substructures in abstract and/or concrete terms.
● Transactional memory: Chapel has plans to support
an atomic block for expressing transactional computations against memory, yet software transactional
memory (STM) is an active research area in general, and becomes even trickier in the distributed
memory context of Chapel programs.
● Exceptions: One of Chapel’s most notable omissions
is support for exception- and/or error-handling
mechanisms to deal with software failures in a
robust way, and perhaps also to be resilient to hardware failures. This is an area where the original
Chapel team felt unqualified to make a reasonable
proposal and intended to fill in that lack over time.
In addition to the above areas, future design and
implementation work is expected to take place in the
areas of task teams, dynamic load balancing, garbage
Chapel (Cray Inc. HPCS Language)
collection, parallel I/O, language interoperability, and
tool support.
Related Entries
Fortress (Sun HPCS Language)
HPF (High Performance Fortran)
NESL
PGAS (Partitioned Global Address Space) Languages
Tera MTA
ZPL
Bibliographic Notes and Further
Reading
The two main academic papers providing an overview
of Chapel’s approach are also two of the earliest: “The
Cascade High Productivity Language” [] and “Parallel
Programmability and the Chapel Language” []. While
the language has continued to evolve since these papers
were published, the overall motivations and concepts
are still very accurate. The first paper is interesting in
that it provides an early look at the original team’s
design while the latter remains a good overview of the
language’s motivations and concepts. For the most accurate description of the language at any given time, the
reader is referred to the Chapel Language Specification.
This is an evolving document that is updated as the language and its implementation improve. At the time of
this writing, the current version is . [].
Other important early works describing specific language concepts include “An Approach to Data Distributions in Chapel” by Roxana Diaconescu and Hans
Zima []. While the approach to user-defined distributions eventually taken by the Chapel team differs
markedly from the concepts described in this paper [],
it remains an important look into the early design being
pursued by the Caltech/JPL team. Another conceptrelated paper is “Global-view Abstractions for UserDefined Reductions and Scans” by Deitz, Callahan,
Chamberlain, and Snyder, which explored concepts for
user-defined reductions and scans in Chapel [].
In the trade press, a good overview of Chapel in
Q&A form entitled “Closing the Gap with the Chapel
Language” was published in HPCWire in  [].
Another Q&A-based document is a position paper
entitled “Multiresolution Languages for Portable yet
Efficient Parallel Programming,” which espouses the
C
use of multiresolution language design in efforts like
Chapel [].
For programmers interested in learning how to
use Chapel, there is perhaps no better resource
than the release itself, which is available as an
open-source download from SourceForge []. The
release is made available under the Berkeley Software
Distribution (BSD) license and contains a portable
implementation of the Chapel compiler along with documentation and example codes. Another resource is
a tutorial document that walks through some of the
HPC Challenge benchmarks, explaining how they can
be coded in Chapel. While the language has evolved
since the tutorial was last updated, it remains reasonably accurate and provides a gentle introduction to the
language []. The Chapel team also presents tutorials to
the community fairly often, and slide decks from these
tutorials are archived at the Chapel Web site [].
Many of the resources above, as well as other useful
resources such as presentations and collaboration ideas,
can be found at the Chapel project Web site hosted at
Cray [].
To read about Chapel’s chief influences, the best
resources are probably dissertations from the ZPL
team [, ], the High Performance Fortran Handbook [], and the Cray XMT Programming Environment User’s Guide [] (the Cray XMT is the current
incarnation of the Tera/Cray MTA).
Acknowledgments
This material is based upon work supported by the
Defense Advanced Research Projects Agency under its
Agreement No. HR---. Any opinions, findings and conclusions or recommendations expressed
in this material are those of the author(s) and do not
necessarily reflect the views of the Defense Advanced
Research Projects Agency.
Bibliography
. Callahan D, Chamberlain B, Zima H (April ) The Cascade
high productivity language. th International workshop on highlevel parallel programming models and supportive environments,
pp –, Santa Fe, NM
. Chamberlain BL (November ) The design and implementation of a region-based parallel language. PhD thesis, University of
Washington
. Chamberlain BL (October ) Multiresolution languages for
portable yet efficient parallel programming. http://chapel.cray.
com/papers/DARPA-RFI-Chapel-web.pdf. Accessed  May 

C

C
Charm++
. Chamberlain BL, Callahan D, Zima HP (August ) Parallel
programmability and the Chapel language. Int J High Perform
Comput Appl ():–
. Chamberlain BL, Deitz SJ, Hribar MB, Wong WA (November
) Chapel tutorial using global HPCC benchmarks: STREAM
Triad, Random Access, and FFT (revision .). http://chapel.cray.
com/hpcc/hpccTutorial-..pdf. Accessed  May 
. Chamberlain BL, Deitz SJ, Iten D, Choi S-E () User-defined
distributions and layouts in Chapel: Philosophy and framework.
In: Hot-PAR ‘: Proceedings of the nd USENIX workshop on
hot topics, June 
. Chapel development site at SourceForge. http://sourceforge.net/
projects/chapel. Accessed  May 
. Chapel project website. http://chapel.cray.com. Accessed  May

. Cray Inc., Seattle, WA. Chapel Language Specification (version .), October . http://chapel.cray.com/papers.html.
Accessed  May 
. Cray Inc. Cray XMT Programming Environment User’s Guide,
March  (see http://docs.cray.com). Accessed  May 
. Deitz SJ () High-Level Programming Language Abstractions
for Advanced and Dynamic Parallel Computations. PhD thesis,
University of Washington
. Deitz SJ, Callahan D, Chamberlain BL, Synder L (March )
Global-view abstractions for user-defined reductions and scans.
In: PPoPP ’: Proceedings of the eleventh ACM SIGPLAN
symposium on principles and practice of parallel programming,
pp –. ACM Press, New York
. Diaconescu R, Zima HP (August ) An approach to data
distributions in Chapel. Intl J High Perform Comput Appl
():–
. Feldman M, Chamberlain BL () Closing the parallelism
gap with the Chapel language. HPCWire, November .
http://www.hpcwire.com/hpcwire/--/closing_the_paral
lelism_gap_with_the_chapel_language.html. Accessed  May

. Koelbel CH, Loveman DB, Schreiber RS, Steele Jr GL, Zosel ME
(September ) the High Performance Fortran handbook. Scientific and engineering computation. MIT Press, Cambridge, MA
Charm++
Laxmikant V. Kalé
University of Illinois at Urbana-Champaign, Urbana,
IL, USA
Definition
Charm++ is a C++-based parallel programming system
that implements a message-driven migratable objects
programming model, supported by an adaptive runtime
system.
Discussion
Charm++ [] is a parallel programming system developed at the University of Illinois at Urbana-Champaign.
It is based on a message-driven migratable objects programming model, and consists of a C++-based parallel
notation, an adaptive runtime system (RTS) that automates resource management, a collection of debugging
and performance analysis tools, and an associated family of higher level languages. It has been used to program
several highly scalable parallel applications.
Motivation and Design Philosophy
One of the main motivations behind Charm++ is the
desire to create an optimal division of labor between
the programmer and the system: that is, to design a
programming system so that the programmers do what
they can do best, while leaving to the “system” what it
can automate best. It was observed that deciding what to
do in parallel is relatively easy for the application developer to specify; conversely, it has been very difficult for
a compiler (for example) to automatically parallelize a
given sequential program. On the other hand, automating resource management – which subcomputation to
carry out on what processor and which data to store on a
particular processor – is something that the system may
be able to do better than a human programmer, especially as the complexity of the resource management
task increases. Another motivation is to emphasize the
importance of data locality in the language, so that the
programmer is made aware of the cost of non-local data
references.
Programming Model in Abstract
In Charm++, computation is specified in terms of
collections of objects that interact via asynchronous
method invocations. Each object is called a chare.
Chares are assigned to processors by an adaptive runtime system, with an optional override by the programmer. A chare is a special kind of C++ object. Its behavior
is specified by a C++ class that is “special” only in the
sense that it must have at least one method designated
as an “entry” method. Designating a method as an entry
method signifies that it can be invoked from a remote
processor. The signatures of the entry methods (i.e., the
type and structure of its parameters) are specified in a
separate interface file, to allow the system to generate
code for packing (i.e., serializing) and unpacking the
parameters into messages. Other than the existence of
Charm++
the interface files, a Charm++ program is written in a
manner very similar to standard C++ programs, and
thus will feel very familiar to C++ programmers.
The chares communicate via asynchronous method
invocations. Such a method invocation does not return
any value to the caller, and the caller continues with its
own execution. Of course, the called chare may choose
to send a value back by invoking an entry method upon
the caller object. Each chare has a globally valid ID (its
proxy), which can be passed around via method invocations. Note that the programmer refers to only the
target chare by its global ID, and not by the processor on which it resides. Thus, in the baseline Charm++
model, the processor is not a part of the ontology of the
programmer.
Chares can also create other chares. The creation of
new chares is also asynchronous in that the caller does
not wait until the new object is created. Programmers
typically do not specify the processor on which the new
chare is to be created; the system makes this decision at
runtime. The number of chares may vary over time, and
is typically much larger than the number of processors.
Message-Driven Scheduler
At any given time, there may be several pending method
invocations for the chares on a processor. Therefore, the
Charm++ runtime system employs a user-level scheduler (Fig. ) on each processor. The scheduler is userlevel in the sense that the operating system is not aware
of it. Normal Charm++ methods are non-preemptive:
once a method begins execution, it returns control to
the scheduler only after it has completed execution.
The scheduler works with a queue of pending entry
Chares
Chares
Processor 1
Processor 2
Scheduler
Scheduler
Message Queue
Message Queue
Charm++. Fig.  Message-driven scheduler
C
method invocations. Note that this queue may include
asynchronous method invocations for chares located on
this processor, as well as “seeds” for the creation of new
chares. These seeds can be thought of as invocations of
the constructor entry method. The scheduler repeatedly
selects a message (i.e., a pending method invocation)
from the queue, identifies the object targeted, creating an object if necessary, unpacks the parameters from
the message if necessary, and then invokes the specified
method with the parameters. Only when the method
returns does it select the next message and repeats the
process.
Chare-arrays and Iterative Computations
The model described so far, with its support for dynamic
creation of work, is well-suited for expressing divideand-conquer as well as divide-and-divide computations. The latter occur in state-space search. Charm++
(and its C-based precursor, Charm, and Chare Kernel
[]) were used in the late s for implementing parallel Prolog [] as well as several combinatorial search
applications [].
In principle, singleton chares could also be used to
create the arbitrary networks of objects that are required
to decompose data in Science and Engineering applications. For example, one can organize chares in a twodimensional mesh network, and through some additional message passing, ensure that each chare knows
the ID of its four neighboring chares. However, this
method of creating a network of chares is quite cumbersome, as it requires extensive bookkeeping on the part of
the programmer. Instead, Charm++ supports indexed
collections of chares, called chare-arrays. A Charm++
computation may include multiple chare-arrays. Each
chare-array is a collection of chares of the same type.
Each chare is identified by an index that is unique within
its collection. Thus, an individual chare belonging to
a chare-array is completely identified by the ID of the
chare-array and its own index within it. Common index
structures include dense as well as sparse multidimensional arrays, but arbitrary indices such as strings or bit
vectors are also possible. Elements in a chare-array may
be created all at once, or can be inserted one at a time.
Method invocations can be broadcast to an entire
chare-array, or a section of it. Reductions over charearrays are also supported, where each chare in a charearray contributes a value, and all submitted values are
combined via a commutative-associative operation. In

C

C
Charm++
many other programming models, a reduction is a collective operation which blocks all callers. In contrast,
reductions in Charm++ are non-blocking, that is, asynchronous. A contribute call simply deposits the
value created by the calling chare into the system and
returns to its caller. At some later point after all the
values have been combined, the system delivers them
to a user-specified callback. The callback, for example,
could be a broadcast to an entry method of the same
chare-array.
Charm++ does not allow generic global variables,
but it does allow “specifically shared variables.” The simplest of these are read-only variables, which are initialized in the main chare’s constructor, and are treated as
constants for the remainder of the program. The runtime system (RTS) ensures that a copy of each read-only
variable is available on all physical processors.
associate work directly with processors, one typically
has two options. One may divide the set of processors
so that a subset is executing one module (P) while the
remaining processors execute the other (Q). Alternatively, one may sequentialize the modules, executing P
first, followed by Q, on all processors. Neither alternative is efficient. Allowing the two modules to interleave
the execution on all processors is often beneficial, but
is hard to express even with wild-card receives, and it
breaks abstraction boundaries between the modules in
any case. With message driven execution, such interleaving happens naturally, allowing idle time in one
module to be overlapped with computation in the other.
Coupled with the ability of the adaptive runtime system to migrate communicating objects closer to each
other, this adds up to strong support for concurrent
composition, and thereby for increased modularity.
Benefits of Message-driven Execution
Prefetching Data and Code
Since the message driven scheduler can examine its
queue, it knows what the next several objects scheduled
to execute are and what methods they will be executing. This information can be used to asynchronously
prefetch data for those objects, while the system executes the current object. This idea can be used by
the Charm++ runtime system for increasing efficiency
in various contexts, including on accelerators such as
the Cell processor, for out-of-core execution and for
prefetching data from DRAM to cache.
The message-driven execution model confers several
performance and/or productivity benefits.
Automatic and Adaptive Overlap of
Computation and Communication
Since objects are scheduled based on the availability
of messages, no single object can occupy the processor while waiting for some remote data. Instead, objects
that have asynchronous method invocations (messages)
waiting for them in the scheduler’s queue are allowed
to execute. This leads to a natural overlap of communication and computation, without any extra work from
the programmer. For example, a chare may send a message to a remote chare and wait for another message
from it before continuing. The ensuing communication
time, which would otherwise be an idle period, is naturally and automatically filled in (i.e., overlapped) by
the scheduler with useful computation, that is, processing of another message from the scheduler’s queue for
another chare.
Concurrent Composition
The ability to compose in parallel two individually parallel modules is referred to as concurrent composition.
Consider two modules P and Q that are both ready
to execute and have no direct dependencies among
them. With other programming models (e.g., MPI) that
Capabilities of the Adaptive Runtime
System Based on Migratability of Chares
Other capabilities of the runtime system arise from the
ability to migrate chares across processors and the ability to place newly created chares on processors of its
choice.
Supporting Task Parallelism with Seed
Balancers
When a program calls for the creation of a singleton
chare, the RTS simply creates a seed for it. This seed
includes the constructor arguments and class information needed to create a new chare. Typically, these seeds
are initially stored on the same processor where they are
created, but they may be passed from processor to processor under the control of a runtime component called
Charm++
the seed balancer. Different seed balancer strategies are
provided by the RTS. For example, in one strategy, each
processor monitors its neighbors’ queues in addition
to its own, and balances seeds between them as it sees
fit. In another strategy, a processor that becomes idle
requests work from a random donor – a work stealing strategy [, ]. Charm++ also includes strategies
that balance priorities and workloads simultaneously, in
order to give precedence to high-priority work over the
entire system of processors.
Migration-based Load Balancers
Elements of chare-arrays can be migrated across processors, either explicitly by the programmer or by the
runtime system. The Charm++ RTS leverages this capability to provide a suite of dynamic load balancing
strategies. One class of such strategies is based on the
principle of persistence, which is the empirical observation that in most science and engineering applications
expressed in terms of their natural objects, computational loads and communication patterns tend to persist
over time, even for dynamically evolving applications.
Thus, the recent past is a reasonable predictor of the near
future. Since the runtime system mediates communication and schedules computations, it can automatically
instrument its execution so as to measure computational loads and communication patterns accurately.
Load balancing strategies can use these measurements,
or alternatively, any other mechanisms for predicting
such patterns. Multiple load balancing strategies are
available to choose from. The choice may depend on
the machine context and applications, although one can
always use the default strategy provided. Programmers
can write their own strategy, either to specialize it to the
specific needs of the application or in the hope of doing
better than the provided strategies.
Dynamically Altering the Sets
of Processors Used
A Charm++ program can be asked to change the set of
processors it is using at runtime, without requiring any
effort by the programmer. This can be useful to increasing utilization of a cluster running multiple jobs that
arrive at unpredictable times. The RTS accomplishes
this by migrating objects and adjusting its runtime data
structures, such as spanning trees used in its collective
C
operations. Thus, a , ×, ×,  cube of data partitioned into  ×  ×  array of chares, each holding
×× data subcube, can be shrunk from  cores
to (say)  cores without significantly losing efficiency.
Of course, some cores will house  objects, instead of
the  objects they did earlier.
Fault Tolerance
Charm++ provides multiple levels of support for fault
tolerance, including alternative competing strategies.
At a basic level, it supports automated applicationlevel checkpointing by leveraging its ability to migrate
objects. With this, it is possible to create a checkpoint
of the program without requiring extra user code. More
interestingly, it is also possible to use a checkpoint created on P processors to restart the computation on a
different number of processors than P.
On appropriate machines and with job schedulers
that permit it, Charm++ can also automatically detect
and recover from faults. This requires that the job scheduler not kill a job if one of its nodes were to fail. At
the time of this writing, these schemes are available
on workstation clusters. The most basic strategy uses
the checkpoint created on disk, as described above, to
effect recovery. A second strategy avoids using disks
for checkpointing, instead creating two checkpoints of
each chare in the memory of two processors. It is suitable for those applications whose memory footprint at
the point of checkpointing is relatively small compared
with the available memory. Fortunately, many applications such as molecular dynamics and computational
astronomy fall into this category. When it can be used,
it is very fast, often accomplishing a checkpoint in less
than a second and recovery in a few seconds. However, both strategies described above send all processors
back to their checkpoints even when just one out of a
million processors has failed. This wastes all the computation performed by processors that did not fail. As the
number of processors increases and, consequently, the
MTBF decreases, this will become an untenable recovery strategy. A third experimental strategy in Charm++
sends only the failed processor(s) to their checkpoints
by using a message-logging scheme. It also leverages
the over-decomposition and migratability of Charm++
objects to parallelize the restart process. That is, the

C

C
Charm++
objects from failed processors are reincarnated on multiple other processors, where they re-execute, in parallel, from their checkpoints using the logged messages.
Charm++ also provides a fourth proactive strategy to
handle situations where a future fault can be predicted,
say based on heat sensors, or estimates of increasing
(corrected) cache errors. The runtime simply migrates
objects away from such a processor and readjusts its
runtime data structures.
Associated Tools
Several tools have been created to support the development and tuning of Charm++ applications. LiveViz
allows one to inject messages into a running program
and display attributes and images from a running
simulation. Projections supports performance analysis
and visualization, including live visualization, parallel
on-line analysis, and log-based post-mortem analysis.
CharmDebug is a parallel debugger that understands
Charm++ constructs, and provides online access to
runtime data structures. It also supports a sophisticated
record-replay scheme and provisional message delivery
for dealing with nondeterministic bugs. The communication required by these tools is integrated in the runtime system, leveraging the message-driven scheduler.
No separate monitoring processes are necessary.
Code Example
Figures  and  show fragments from a simple Charm++
example program to give a flavor of the programming
model. The program is a simple Lennard-Jones molecular dynamics code. The computation is decomposed
into a one-dimensional array of LJ objects, each holding a subset of atoms. The interface file (Fig. ) describes
the main Charm-level entities, their types and signatures. The program has a read-only integer called
numChares that holds the size of the LJ chare-array. In
this particular program, the main chare is called Main
and has only one method, namely, its constructor. The
LJ class is declared as constituting a one-dimensional
array of chares. It has two entry methods in addition
to its constructor. Note the others[n] notation used
to specify a parameter that is an array of size n, where
n itself is another integer parameter. This allows the
system to generate code to serialize the parameters
into a message, when necessary. CkReductionMsg is a
system-defined type which is used as a target of entry
methods used in reductions.
Some important fragments from the C++ file that
define the program itself are shown in Fig. . Sequential
code not important for understanding the program is
omitted. The classes Main and LJ inherit from classes
generated by a translator based on the interface file. The
program execution consists of a number of time steps.
In each time step, each processor sends its particles on a
round-trip to visit all other chares. Whenever a packet
of particles visits a chare via the passOn method, the
chare calculates forces on each of the visiting (others)
particles due to each of its own particles. Thus, when
the particles return home after the round-trip, they
have accumulated forces due to all the other particles in
the system. A sequential call (integrate) then adds
local forces to the accumulated forces and calculates
new velocities and positions for each owned particle.
At this point, to make sure the time step is truly finished for all chares, the program uses an “asynchronous
reduction” via the contribute call. To emphasize the
asynchronous nature of this call, the example makes
mainmodule ljdyn {
readonly int numChares;
...
mainchare Main {
entry Main(CkArgMsg *m);
};
array [1D] LJ {
entry LJ(void);
entry void passOn(int home, int n, Particle others[n]);
entry void startNextStep(CkReductionMsg *m);
};
};
Charm++. Fig.  A simple molecular dynamics program: interface file
Charm++
C

/*readonly*/ int numChares;
classMain : public CBase_Main {
Main(CkArgMsg*m){
//Process command-line arguments
...
numChares = atoi(m->argv[1]);
...
CProxy_LJ arr = CProxy_LJ::ckNew(numChares);
}
};
class LJ : public CBase_LJ {
int timeStep, numParticles, next;
Particle * myParticles;
...
C
LJ[0]
LJ
[n-1]
LJ[1]
LJ
[n-2]
LJ(){
...
myParticles = new Particle[numParticles];
... // initialize particle data
next = (thisIndex + 1) % numChares;
timeStep = 0;
startNextStep((CkReductionMsg * )NULL);
}
LJ[2]
LJ[k]
LJ[k].passOn(...)
void startNextStep(CkReductionMsg* m){
if (++timeStep > MAXSTEPS){
if (thisIndex == 0) {ckout << "Done\n" << endl; CkExit();}
}else
thisProxy[next].passOn(thisIndex, numParticles, myParticles);
}
void passOn(int homeIndex, int n, Particle* others) {
if (thisIndex ! = homeIndex) {
interact(n, others); //add forces on "others" due to my particles
thisProxy[next].passOn(homeIndex,n, others);
} else { // particles are home, with accumulated forces
CkCallback cb(CkIndex_LJ::startNextStep(NULL), thisProxy);
contribute(cb); // asynchronous barrier
integrate( n, others); // add forces and update positions
}
}
void interact(int n, Particle* others){
/* add forces on "others" due to my particles */
}
void integrate(int n, Particle* others){
/*... apply forces, update positions... */
}
};
Charm++. Fig.  A simple molecular dynamics program: fragments from the C++ file
the call before integrate. The contribute call simply
deposits the contribution into the system and continues
on to integrate. The contribute call specifies
that after all the array elements of LJ have contributed,
a callback will be made. In this case, the callback is
a broadcast to all the members of the chare-array at
the entry-method startNextStep. Inherited variable thisproxy is a proxy to the entire chare-array.
Similarly, thisIndex refers to the index of the calling
chare in the chare-array to which it belongs.

C
Charm++
Language Extensions and Features
The baseline programming model described so far is
adequate to express all parallel interaction structures.
However, for programming convenience, increased
productivity and/or efficiency, Charm++ supports a few
additional features. For example, individual entry methods can be marked as “threaded.” This results in the
creation of a user-level, lightweight thread whenever the
entry method is invoked. Unlike normal entry methods, which always complete their execution and return
control to the scheduler before other entry methods
are executed, threaded entry methods can block their
execution; of course they do so without blocking the
processor they are running on. In particular, they can
wait for a “future,” wait until another entry method
unblocks them, or make blocking method invocations.
An entry method can be tagged as blocking (actually
called a “sync” method), and such a method is capable of returning values, unlike normal methods that are
asynchronous and therefore have a return type of void.
Often, threaded entry methods are used to describe
the life cycle of a chare. Another notation within
Charm++, called “structured dagger,” accomplishes the
same effect without the need for a separate stack and
associated memory for each user level thread and the,
admittedly small, overhead associated with thread context switches. However, it requires that all dependencies
on remote data be expressed in this notation within the
text of an entry-method. In contrast, the thread of control may block waiting for remote data within functions
called from a threaded entry method.
Charm++ as described so far does not bring in the
notion of a “processor” in the programming model.
However, some low-level constructs that refer to processors are also provided to programmers and especially
to library writers. For example, when a chare is created,
one can optionally specify which processor to create it
on. Similarly, when a chare-array is created, one can
specify its initial mapping to processors. One can create
specialized chare-arrays, called groups, that have exactly
one member on each processor, which are useful for
implementing services such as load balancers.
Languages in the Charm Family
AMPI or Adaptive MPI is an implementation of the
message passing interface standard on top of the
Charm++ runtime system. Each MPI process is implemented as a user level thread that is embedded inside
a Charm++ object, as a threaded entry method. These
objects can be migrated across processors, as is usual
for Charm++ objects, thus bringing benefits of the
Charm++ adaptive runtime system, such as dynamic
load balancing and fault tolerance, to traditional MPI
programs. Since there may be multiple MPI “processes”
on each core, commensurate with the overdecomposition strategy of Charm++ applications, the MPI programs need to be modified in a mechanical, systematic
fashion to avoid conflict among the global variables.
Adaptive MPI provides tools for automating this process to some extent. As a result, a standard Adaptive
MPI program is also a legal MPI program, but the converse is true only if the use of global variables has been
handled via such modifications. In addition, Adaptive
MPI provides primitives such as asynchronous collectives, which are not part of the MPI  standard. An
asynchronous reduction, for example, carries out the
communication associated with a reduction in the background while the main program continues on with its
computation. A blocking call is then used to fetch the
result of the reduction.
Two recent languages in the Charm++ family are
multiphase shared arrays (MSA) and Charisma. These
are part of the Charm++ strategy of creating a toolbox consisting of incomplete languages that capture
some interaction modes elegantly and frameworks that
capture the needs of specific domains or data structures, both backed up by complete languages such as
Charm++ and AMPI. The compositionality afforded by
message-driven execution ensures that modules written
using multiple paradigms can be efficiently composed
in a larger application.
MSA is designed to support disciplined use of
shared address space. The computation consists of
collections of threads and multiple user-defined data
arrays, each partitioned into user-defined pages. Both
entities are implemented as migratable objects (i.e.,
chares) available to the Charm++ runtime system.
The threads can access the data in the arrays, but
each array is in only one of a restrictive set of
access modes at a time. Read-only, exclusive-write, and
accumulate are examples of the access modes supported by MSA. At designated synchronization points,
Charm++
a program may change the access modes of one or
more arrays.
Charisma, another language implemented on top
of Charm++, is designed to support computations
that exhibit a static data flow pattern among a set of
Charm++ objects. Such a pattern, where the flow of
messages remains the same from iteration to iteration, even though the content and the length of messages may change, is extremely common in science
and engineering applications. For such applications,
Charisma provides a convenient syntax that captures
the flow of values and control across multiple collections of objects clearly. In addition, Charisma provides a clean separation of sequential and parallel code
that is convenient for collaborative application development involving parallel programmers and domain
specialists.
Frameworks atop Charm++
In addition to its use as a language for implementing applications directly, Charm++ is seen as backend
for higher level frameworks and languages described
above. Its utility in this context arises because of the
interoperability and runtime features it provides, which
one can leverage to put together a new domain-specific
framework relatively quickly. An example of such a
framework is ParFUM, which is aimed at unstructured mesh applications. ParFUM allows developers
of sequential codes based on such meshes to retarget
them to parallel machines, with relatively few changes. It
automates several commonly needed functions including the need to exchange boundary nodes (or boundary
layers, in general) with neighboring objects. Once a
code is ported to ParFUM, it can automatically benefit from other Charm++ features such as load balancing
and fault tolerance.
Applications
Some of the highly scalable applications developed
using Charm++ are in extensive use by scientists on
national supercomputers. These include NAMD (for
biomolecular simulations), OpenAtom (for electronic
structure simulations), and ChaNGa (for astrophysical
N-body simulations).
C

Origin and History
The Chare Kernel, a precursor of the Charm++ system, arose from the work on parallel Prolog at the
University of Illinois in Urbana-Champaign in the late
s. The implementation mechanism required a collection of computational entities, one for each active
clause (i.e., its activation record) of the underlying logic
program. Each of these entities typically received multiple responses from its children in the proof tree,
and they needed to create new nodes in the proof
tree which had to fire new “tasks” for each of its
active clauses. The implementation led to a messagedriven scheduler and dynamic creation of the seeds of
work. These entities were called chares, borrowing the
term used by an earlier parallel functional-languages
project, RediFlow. Chare Kernel essentially separated
this implementation mechanism from its parallel Prolog
context, into a C-based parallel programming paradigm
of its own. Charm had similarities (in particular, its
message-driven execution) with the earlier research on
reworking of the Hewitt’s Actor framework by Agha and
Yonezawa [, ]. However, its intellectual progenitors
were in parallel logic and functional languages. With
the increase in popularity of C++, Charm++ became
the version of Charm for C++, which was a natural fit
for its object-based abstraction. As many researchers
in parallel logic programming shifted attention to scientific computations, indexed collections of migratable
chares were developed in Charm++ to simplify addressing chares in mid-s. In the late s, Adaptive
MPI was developed in the context of applications being
developed at the Center for Simulation of Advanced
Rockets at Illinois. Charm++ continues to be developed
and maintained from the University of Illinois, and its
applications are in regular use at many supercomputers
around the world.
Availability and Usage
Charm++ runs on most parallel machines available at
the time this entry was written, including multicore
desktops, clusters, and large-scale proprietary supercomputers, running Linux, Windows, and other operating systems. Charm++, its associated software tools and
libraries can be downloaded in source code and binary
forms from http://charm.cs.illinois.edu under a license
that allows free use for noncommercial purposes.
C

C
Checkpoint/Restart
Related Entries
Actors
NAMD (NAnoscale Molecular Dynamics)
Combinatorial Search
Bibliographic Notes and Further
Reading
One of the earliest papers on the Chare Kernel, the
precursor of Charm++ was published in  [], followed by a more detailed description of the model and
its load balancers [, ]. This work arose out of earlier work on parallel Prolog []. The C++ based version
was described in an OOPSLA paper in  [] and
was expanded upon in a book on parallel C++ in [].
Early work on quantifying benefits of the programming
model is summarized in a later paper [].
Some of the early applications using Charm++ were
in symbolic computing and, specifically, in parallel
combinatorial search []. A scalable framework for supporting migratable arrays of chares is described in a
paper [], which is also useful for understanding the
programming model. An early paper [] describes support for migrating Chares for dynamic load balancing.
Recent papers describe an overview of Charm++ []
and its applications [].
. Agha G () Actors: a model of concurrent computation in
distributed systems. MIT, Cambridge
. Yonezawa A, Briot J-P, Shibayama E () Object-oriented concurrent programming in ABCL/. ACM SIGPLAN Notices, Proceedings OOPSLA ’, Nov , ():–
. Kale LV, Shu W () The Chare Kernel base language: preliminary performance results. In: Proceedings of the  international conference on parallel processing, St. Charles, August ,
pp –
. Kale LV () The Chare Kernel parallel programming language
and system. In: Proceedings of the international conference on
parallel processing, August , vol II, pp –
. Kale LV, Krishnan S () Charm++: parallel programming
with message-driven objects. In: Wilson GV, Lu P (eds) Parallel
programming using C++. MIT, Cambridge, pp –
. Gursoy A, Kale LV () Performance and modularity benefits of message-driven execution. J Parallel Distrib Comput
:–
. Lawlor OS, Kale LV () Supporting dynamic parallel object
arrays. Concurr Comput Pract Exp :–
. Brunner RK, Kale LV () Handling application-induced load
imbalance using parallel objects. In: Parallel and distributed computing for symbolic and irregular applications. World Scientific,
Singapore, pp –
. Kale Lv, Zheng G () Charm++ and AMPI: adaptive runtime
strategies via migratable objects. In: Parashar M (ed) Advanced
computational infrastructures for parallel and distributed applications. Wiley-Interscience, Hoboken, pp –
. Kale LV, Bohm E, Mendes CL, Wilmarth T, Zheng G () Programming petascale applications with Charm++ and AMPI. In:
Bader B (ed) Petascale computing: algorithms and applications.
Chapman & Hall, CRC, Boca Raton, pp –
Bibliography
. Kale LV, Krishnan S () CHARM++: a portable concurrent object oriented system based on C++. In: Paepcke A (ed)
Proceedings of OOPSLA’, ACM, New York, September ,
pp –
. Shu WW, Kale LV () Chare Kernel – a runtime support
system for parallel computations. J Parallel Distrib Comput
:–
. Kale LV () Parallel execution of logic programs: the
REDUCE-OR process model. In: Proceedings of the fourth international conference on logic programming, Melbourne, May
, pp –
. Kale LV, Ramkumar B, Saletore V, Sinha AB () Prioritization
in parallel symbolic computing. In: Ito T, Halstead R (eds) Lecture
notes in computer science, vol . Springer, pp –
. Lin Y-J, Kumar V () And-parallel execution of logic programs on a sharedmemory multiprocessor. J Logic Program
(//&):–
. Frigo M, Leiserson CE, Randall KH () The implementation
of the Cilk- multithreaded language. In: ACM SIGPLAN ’
conference on programming language design and implementation (PLDI), Montreal, June . vol  of ACM Sigplan Notices,
pp –
Checkpoint/Restart
Checkpointing
Checkpointing
Martin Schulz
Lawrence Livermore National Laboratory, Livermore,
CA, USA
Synonyms
Checkpoint-recovery; Checkpoint/Restart
Definition
In the most general sense, Checkpointing refers to the
ability to store the state of a computation in a way that
Checkpointing
allows it be continued at a later time without changing
the computation’s behavior. The preserved state is called
the Checkpoint and the continuation is typically referred
to as a Restart.
Checkpointing is most typically used to provide
fault tolerance to applications. In this case, the state of
the entire application is periodically saved to some kind
of stable storage, e.g., disk, and can be retrieved in case
the original application crashes due to a failure in the
underlying system. The application is then restarted (or
recovered) from the checkpoint that was created last
and continued from that point on, thereby minimizing
the time lost due to the failure.
Discussion
Checkpointing is a mechanism to store the state of
a computation so that it can be retrieved at a later
point in time and continued. The process of writing the
computation’s state is referred to as Checkpointing, the
data written as the Checkpoint, and the continuation of
the application as Restart or Recovery. The execution
sequence between two checkpoints is referred to as a
Checkpointing Epoch or just Epoch.
As discussed in section “Checkpointing Types”,
Checkpointing can be accomplished either at system
level, transparently to the application (section “ SystemLevel Checkpointing”), or at application level, integrated into an application (section “Application-Level
Checkpointing”). While the first type is easier to apply
for the end user, the latter one is typically more
efficient.
While checkpointing is useful for any kind of computation, it plays a special role for parallel applications (section “Parallel Checkpointing”), especially in
the area of High-Performance Computing. With rising numbers of processors in each system, the overall system availability is decreasing, making reliable
fault tolerance mechanisms, like checkpointing, essential. However, in order to apply checkpointing to parallel
applications, the checkpointing software needs to be
able to create globally consistent checkpoints across the
entire application, which can be achieved using either
coordinated (section “Coordinated Checkpointing”)
or uncoordinated (section “Uncoordinated Checkpointing”) checkpointing protocols.
Independent of the type and the underlying system, checkpointing systems always require some kind
C
of storage to which the checkpoint can be saved,
which is discussed in section “Checkpoint Storage
Considerations”.
Checkpointing is most commonly associated with
fault tolerance: It is used to periodically store the
state of an application to some kind of stable storage, such that, after a hardware or operating system failure, an application can continue its execution from the last checkpoint, rather than having to
start from scratch. The following entry will concentrate on this usage scenario, but will also discuss
some alternate scenarios in section “Alternate Usage
Scenarios”.
Checkpointing Types
Checkpointing can be implemented either at the system level, i.e., by the operating system or the system
environment, or within the application itself.
System-Level Checkpointing
In system level checkpointing, the state of a computation is saved by an external entity, typically without the application’s knowledge or support. Consequently, the complete process information has to be
included in the checkpoint, as illustrated in Fig. .
This includes not only the complete memory footprint including data segments, the heap, and all stacks,
but also register and CPU state as well as open file
and other resources. On restart, the complete memory footprint is restored, all file resources are made
available to the process again, and then the register
set is restored to its original state, including the program counter, allowing the application to continue at
the same point where it had been interrupted for the
checkpoint.
System-level checkpoint solutions can either be
implemented inside the kernel as a kernel module or
service, or at the user level. The former has the advantage that the checkpointer has full access to the target process as well as its resources. User-level checkpointers have to find other ways to gather this information, e.g., by intercepting all system calls. On the
flip side, user-level schemes are typically more portable
and easier to deploy, in particular in large-scale production environments with limited access for end
users.

C

C
Checkpointing
Process address space
Text
Data
Heap
CPU state:
Registers, incl. SP & IP
Stack
External state:
File I/O, Sockets, …
System-level
checkpoint
Checkpointing. Fig.  System-level checkpointing
Application-Level Checkpointing
The alternative to system-level checkpointing is to
integrate the checkpointing capability into the actual
application, which leads to application-level checkpointing solutions. In such systems, the application is
augmented with the ability to write its own state into
a checkpoint as well as to restart from it. While this
requires explicit code inside the application and hence is
no longer transparent, it gives the application the ability
to decide when checkpoints should be taken (i.e., when
it is a good time to write the state, e.g., when memory
usage is low or no extra resources are used) and what
should be contained in the checkpoint (Fig. ). The latter enables applications to remove noncritical memory
regions, e.g., temporary fields, from the checkpoint and
hence reduce the checkpoint size.
To illustrate the latter point, let us consider a classical particle simulation in which the location and speed
of a set of particles is updated at each iteration step based
on the physical properties of the underlying system. The
complete state of the computation can be represented
by only the two arrays representing the coordinates and
velocities of all particles in the simulation. If checkpoints are taken at iteration boundaries after the update
of all arrays is complete, only these arrays need to be
stored to continue the application at a later time. Any
temporary array, e.g., used to compute forces between
particles, as well as stack information does not need to
be included. A system-level checkpointer, on the other
hand, would not be able to determine which parts of the
data are relevant and which are not and would have to
store the entire memory segment.
Application-level checkpointing is used in many
high-performance computing applications, especially
in simulation codes. They are part of the applications’ base design and implemented by the programmer.
Additionally, systems like SRS [] provide toolboxes that
allow users to implement application-level checkpoints
in their codes on top of a simple and small API.
Trade-offs
Both approaches have distinct advantages and disadvantages. The key differences are summarized in
Table . While system-level checkpoints provide full
transparency to the user and require no special mechanism or consideration inside an application, this transparency is missing in application-level checkpointers.
On the other hand, this transparency comes at the price
of high implementation complexity for the checkpointing software. Not only must it be able to checkpoint
the complete system state of an arbitrary process with
arbitrary resources, but also has to do so at any time
independent of the state the process is in.
In contrast to system-level checkpointers, applicationlevel approaches can exploit application-specific information to optimize the checkpointing process. They are
able to control the timing of the checkpoints and they
can limit the data that is written to the checkpoint,
which reduces the data that has to be saved and hence
the size of the checkpoints.
Combining the advantages of both approaches by
providing a transparent solution with the benefits of
an application-level approach is still a topic of basic
research. The main idea is, for each checkpoint inserted
Checkpointing
C

Process address space
Text
(+ CP code)
Data
Heap
Stack
C
Application-level
checkpoint
Checkpointing. Fig.  Application-level checkpointing
Checkpointing. Table  Key differences between systemand application-level checkpointing
System-level
checkpointing
Application-level
checkpointing
Transparent
Integrated into the
application
Implementation complexity Implementation complexity
high
medium or low
System specific
Portable
Checkpoints taken at
arbitrary times
Checkpoints taken at
predefined locations
Full memory dump
Only save what is needed
Large checkpoint files
Checkpoint files only as
large as needed
into the application (either by hand or by a compiler),
to identify which variables need to be included in the
checkpoint at that location, i.e., those variables that are
in scope and that are actually used after the checkpoint location (and cannot be easily recomputed) [, ].
While this can eliminate the need to save some temporary arrays, current compiler analysis approaches are
not powerful enough to achieve the same efficiency as
manual approaches.
Parallel Checkpointing
Checkpointing is of special importance in parallel systems. The larger numbers of components used for a
single computation naturally decreases the mean time
between failures, making system faults more likely.
Already today’s largest systems consist of over ,
processing cores and systems with , to , cores
are common in High-Performance Computing [];
future architectures will have even more, as plans for
machines with over a million cores have already been
announced []. This scaling trend requires effective fault
tolerance solutions for such parallel platforms and their
applications.
However, checkpointing a parallel application cannot be implemented by simply replicating a sequential
checkpointing mechanism to all tasks of a parallel application, since the individua
Download