Uploaded by Silver Extra

Programming Language Pragmatics (3ed., Elsevier, 2009) Scott M.L

advertisement
In Praise of Programming Language Pragmatics,Third Edition
The ubiquity of computers in everyday life in the 21st century justifies the centrality of programming languages to computer science education. Programming languages is the area that connects the
theoretical foundations of computer science, the source of problem-solving algorithms, to modern
computer architectures on which the corresponding programs produce solutions. Given the speed
with which computing technology advances in this post-Internet era, a computing textbook must
present a structure for organizing information about a subject, not just the facts of the subject itself.
In this book, Michael Scott broadly and comprehensively presents the key concepts of programming
languages and their implementation, in a manner appropriate for computer science majors.
— From the Foreword by Barbara Ryder, Virginia Tech
Programming Language Pragmatics is an outstanding introduction to language design and implementation. It illustrates not only the theoretical underpinnings of the languages that we use, but also the
ways in which they have been guided by the development of computer architecture, and the ways in
which they continue to evolve to meet the challenge of exploiting multicore hardware.
— Tim Harris, Microsoft Research
Michael Scott has provided us with a book that is faithful to its title—Programming Language Pragmatics. In addition to coverage of traditional language topics, this text delves into the sometimes
obscure, but always necessary, details of fielding programming artifacts. This new edition is current
in its coverage of modern language fundamentals, and now includes new and updated material on
modern run-time environments, including virtual machines. This book is an excellent introduction
for anyone wishing to develop languages for real-world applications.
— Perry Alexander, Kansas University
Michael Scott has improved this new edition of Programming Language Pragmatic in big and small
ways. Changes include the addition of even more insightful examples, the conversion of Pascal
and MIPS examples to C and Intel 86, as well as a completely new chapter on run-time systems.
The additional chapter provides a deeper appreciation of the design and implementation issues of
modern languages.
— Eileen Head, Binghamton University
This new edition brings the gold standard of this dynamic field up to date while maintaining an
excellent balance of the three critical qualities needed in a textbook: breadth, depth, and clarity.
— Christopher Vickery, Queens College of CUNY
Programming Language Pragmatics provides a comprehensive treatment of programming language
theory and implementation. Michael Scott explains the concepts well and illustrates the practical
implications with hundreds of examples from the most popular and influential programming languages. With the welcome addition of a chapter on run-time systems, the third edition includes new
topics such as virtual machines, just-in-time compilation and symbolic debugging.
— William Calhoun, Bloomsburg University
This page intentionally left blank
Programming Language Pragmatics
THIRD EDITION
About the Author
Michael L. Scott is a professor and past chair of the Department of Computer Science at the University of Rochester. He received his Ph.D. in computer sciences in
1985 from the University of Wisconsin–Madison. His research interests lie at the
intersection of programming languages, operating systems, and high-level computer architecture, with an emphasis on parallel and distributed computing. He
is the designer of the Lynx distributed programming language and a co-designer
of the Charlotte and Psyche parallel operating systems, the Bridge parallel file
system, the Cashmere and InterWeave shared memory systems, and the RSTM
suite of transactional memory implementations. His MCS mutual exclusion lock,
co-designed with John Mellor-Crummey, is used in a variety of commercial and
academic systems. Several other algorithms, designed with Maged Michael, Bill
Scherer, and Doug Lea appear in the java.util.concurrent standard library.
In 2006 he and Dr. Mellor-Crummey shared the ACM SIGACT/SIGOPS Edsger
W. Dijkstra Prize in Distributed Computing.
Dr. Scott is a Fellow of the Association for Computing Machinery, a Senior
Member of the Institute of Electrical and Electronics Engineers, and a member
of the Union of Concerned Scientists and Computer Professionals for Social
Responsibility. He has served on a wide variety of program committees and grant
review panels, and has been a principal or co-investigator on grants from the NSF,
ONR, DARPA, NASA, the Departments of Energy and Defense, the Ford Foundation, Digital Equipment Corporation (now HP), Sun Microsystems, IBM, Intel,
and Microsoft. The author of more than 100 refereed publications, he served as
General Chair of the 2003 ACM Symposium on Operating Systems Principles
and as Program Chair of the 2007 ACM SIGPLAN Workshop on Transactional
Computing and the 2008 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. In 2001 he received the University of Rochester’s
Robert and Pamela Goergen Award for Distinguished Achievement and Artistry
in Undergraduate Teaching.
Programming Language Pragmatics
TH I R D E D I TI O N
Michael L. Scott
Department of Computer Science
University of Rochester
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann Publishers is an imprint of Elsevier
Morgan Kaufmann Publishers is an imprint of Elsevier
30 Corporate Drive, Suite 400
Burlington, MA 01803
This book is printed on acid-free paper.
∞
c 2009 by Elsevier Inc. All rights reserved.
Copyright Designations used by companies to distinguish their products are often claimed as trade-marks or
registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim,
the product names appear in initial capital or all capital letters. Readers, however, should contact the
appropriate companies for more complete information regarding trademarks and registration.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, scanning, or otherwise, without prior written
permission of the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in
Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: permissions@elsevier.com.
You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by
selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.”
Library of Congress Cataloging-in-Publication Data
Application submitted.
ISBN 13: 978-0-12-374514-9
c 2008, Michael L. Scott.
Cover image: Copyright Beaver Lake, near Lowville, NY, in the foothills of the Adirondacks
For all information on all Morgan Kaufmann publications,
visit our Website at www.books.elsevier.com
Printed in the United States
Transferred to Digital Printing in 2011
To my parents,
Dorothy D. Scott and Peter Lee Scott,
who modeled for their children
the deepest commitment
to humanistic values.
This page intentionally left blank
Contents
Foreword
Preface
I
FOUNDATIONS
1 Introduction
1.1 The Art of Language Design
xxi
xxiii
3
5
7
1.2 The Programming Language Spectrum
10
1.3 Why Study Programming Languages?
14
1.4 Compilation and Interpretation
16
1.5 Programming Environments
24
1.6 An Overview of Compilation
1.6.1 Lexical and Syntax Analysis
1.6.2 Semantic Analysis and Intermediate Code Generation
1.6.3 Target Code Generation
1.6.4 Code Improvement
25
27
29
33
33
1.7 Summary and Concluding Remarks
35
1.8 Exercises
36
1.9 Explorations
37
1.10 Bibliographic Notes
2 Programming Language Syntax
2.1 Specifying Syntax: Regular Expressions and Context-Free Grammars
2.1.1 Tokens and Regular Expressions
2.1.2 Context-Free Grammars
2.1.3 Derivations and Parse Trees
39
41
42
43
46
48
x
Contents
2.2 Scanning
2.2.1 Generating a Finite Automaton
2.2.2 Scanner Code
2.2.3 Table-Driven Scanning
2.2.4 Lexical Errors
2.2.5 Pragmas
51
55
60
63
63
65
2.3 Parsing
2.3.1 Recursive Descent
2.3.2 Table-Driven Top-Down Parsing
2.3.3 Bottom-Up Parsing
2.3.4 Syntax Errors
1
67
70
76
87
99
2.4 Theoretical Foundations
2.4.1 Finite Automata
2.4.2 Push-Down Automata
2.4.3 Grammar and Language Classes
13 · 100
13
18
19
·
2.5 Summary and Concluding Remarks
101
2.6 Exercises
102
2.7 Explorations
108
2.8 Bibliographic Notes
109
3 Names, Scopes, and Bindings
111
3.1 The Notion of Binding Time
112
3.2 Object Lifetime and Storage Management
3.2.1 Static Allocation
3.2.2 Stack-Based Allocation
3.2.3 Heap-Based Allocation
3.2.4 Garbage Collection
114
115
117
118
120
3.3 Scope Rules
3.3.1 Static Scoping
3.3.2 Nested Subroutines
3.3.3 Declaration Order
3.3.4 Modules
3.3.5 Module Types and Classes
3.3.6 Dynamic Scoping
121
123
124
127
132
136
139
3.4 Implementing Scope
3.4.1 Symbol Tables
3.4.2 Association Lists and Central Reference Tables
3.5 The Meaning of Names within a Scope
3.5.1 Aliases
29 · 143
29
33
144
144
xi
Contents
3.5.2 Overloading
3.5.3 Polymorphism and Related Concepts
146
148
3.6 The Binding of Referencing Environments
3.6.1 Subroutine Closures
3.6.2 First-Class Values and Unlimited Extent
3.6.3 Object Closures
151
153
154
157
3.7 Macro Expansion
159
3.8 Separate Compilation
3.8.1 Separate Compilation in C
3.8.2 Packages and Automatic Header Inference
3.8.3 Module Hierarchies
39 · 161
40
42
43
3.9 Summary and Concluding Remarks
162
3.10 Exercises
163
3.11 Explorations
171
3.12 Bibliographic Notes
172
4 Semantic Analysis
175
4.1 The Role of the Semantic Analyzer
176
4.2 Attribute Grammars
180
4.3 Evaluating Attributes
182
4.4 Action Routines
191
4.5 Space Management for Attributes
4.5.1 Bottom-Up Evaluation
4.5.2 Top-Down Evaluation
49 · 196
49
54
4.6 Decorating a Syntax Tree
197
4.7 Summary and Concluding Remarks
204
4.8 Exercises
205
4.9 Explorations
209
4.10 Bibliographic Notes
5 Target Machine Architecture
210
65 · 213
5.1 The Memory Hierarchy
66
5.2 Data Representation
5.2.1 Integer Arithmetic
5.2.2 Floating-Point Arithmetic
68
69
72
xii
Contents
II
5.3 Instruction Set Architecture
5.3.1 Addressing Modes
5.3.2 Conditions and Branches
75
75
76
5.4 Architecture and Implementation
5.4.1 Microprogramming
5.4.2 Microprocessors
5.4.3 RISC
5.4.4 Multithreading and Multicore
5.4.5 Two Example Architectures: The x86 and MIPS
78
79
80
81
82
84
5.5 Compiling for Modern Processors
5.5.1 Keeping the Pipeline Full
5.5.2 Register Allocation
91
91
96
5.6 Summary and Concluding Remarks
101
5.7 Exercises
103
5.8 Explorations
107
5.9 Bibliographic Notes
109
CORE ISSUES IN LANGUAGE DESIGN
6 Control Flow
217
219
6.1 Expression Evaluation
6.1.1 Precedence and Associativity
6.1.2 Assignments
6.1.3 Initialization
6.1.4 Ordering within Expressions
6.1.5 Short-Circuit Evaluation
220
222
224
233
235
238
6.2 Structured and Unstructured Flow
6.2.1 Structured Alternatives to goto
6.2.2 Continuations
241
242
245
6.3 Sequencing
246
6.4 Selection
6.4.1 Short-Circuited Conditions
6.4.2 Case / Switch Statements
247
248
251
6.5 Iteration
6.5.1 Enumeration-Controlled Loops
6.5.2 Combination Loops
256
256
261
Contents
6.5.3 Iterators
6.5.4 Generators in Icon
6.5.5 Logically Controlled Loops
6.6 Recursion
6.6.1 Iteration and Recursion
6.6.2 Applicative- and Normal-Order Evaluation
6.7 Nondeterminacy
xiii
262
111 · 268
268
270
271
275
115 · 277
6.8 Summary and Concluding Remarks
278
6.9 Exercises
279
6.10 Explorations
285
6.11 Bibliographic Notes
287
7 Data Types
7.1 Type Systems
7.1.1 Type Checking
7.1.2 Polymorphism
7.1.3 The Meaning of “Type”
7.1.4 Classification of Types
7.1.5 Orthogonality
289
290
291
291
293
294
301
7.2 Type Checking
7.2.1 Type Equivalence
7.2.2 Type Compatibility
7.2.3 Type Inference
7.2.4 The ML Type System
303
303
310
314
125 · 316
7.3 Records (Structures) and Variants (Unions)
7.3.1 Syntax and Operations
7.3.2 Memory Layout and Its Impact
7.3.3 With Statements
7.3.4 Variant Records (Unions)
317
318
319
135 · 323
139 · 324
7.4 Arrays
7.4.1 Syntax and Operations
7.4.2 Dimensions, Bounds, and Allocation
7.4.3 Memory Layout
325
326
330
335
7.5 Strings
342
7.6 Sets
344
7.7 Pointers and Recursive Types
7.7.1 Syntax and Operations
345
346
xiv
Contents
7.7.2 Dangling References
7.7.3 Garbage Collection
7.8 Lists
7.9 Files and Input/Output
7.9.1 Interactive I/O
7.9.2 File-Based I/O
7.9.3 Text I/O
149 · 356
357
364
153 · 367
153
154
156
7.10 Equality Testing and Assignment
368
7.11 Summary and Concluding Remarks
371
7.12 Exercises
373
7.13 Explorations
379
7.14 Bibliographic Notes
380
8 Subroutines and Control Abstraction
8.1 Review of Stack Layout
383
384
8.2 Calling Sequences
8.2.1 Displays
8.2.2 Case Studies: C on the MIPS; Pascal on the x86
8.2.3 Register Windows
8.2.4 In-Line Expansion
386
·
169
389
173 · 389
181 · 390
391
8.3 Parameter Passing
8.3.1 Parameter Modes
8.3.2 Call-by-Name
8.3.3 Special-Purpose Parameters
8.3.4 Function Returns
393
394
185 · 402
403
408
8.4 Generic Subroutines and Modules
8.4.1 Implementation Options
8.4.2 Generic Parameter Constraints
8.4.3 Implicit Instantiation
8.4.4 Generics in C++, Java, and C#
410
412
414
416
189 · 417
8.5 Exception Handling
8.5.1 Defining Exceptions
8.5.2 Exception Propagation
8.5.3 Implementation of Exceptions
418
421
423
425
8.6 Coroutines
8.6.1 Stack Allocation
8.6.2 Transfer
428
430
432
Contents
8.6.3 Implementation of Iterators
8.6.4 Discrete Event Simulation
8.7 Events
8.7.1 Sequential Handlers
8.7.2 Thread-Based Handlers
xv
201 · 433
205 · 433
434
434
436
8.8 Summary and Concluding Remarks
438
8.9 Exercises
439
8.10 Explorations
446
8.11 Bibliographic Notes
447
9 Data Abstraction and Object Orientation
449
9.1 Object-Oriented Programming
451
9.2 Encapsulation and Inheritance
9.2.1 Modules
9.2.2 Classes
9.2.3 Nesting (Inner Classes)
9.2.4 Type Extensions
9.2.5 Extending without Inheritance
460
460
463
465
466
468
9.3 Initialization and Finalization
9.3.1 Choosing a Constructor
9.3.2 References and Values
9.3.3 Execution Order
9.3.4 Garbage Collection
469
470
472
475
477
9.4 Dynamic Method Binding
9.4.1 Virtual and Nonvirtual Methods
9.4.2 Abstract Classes
9.4.3 Member Lookup
9.4.4 Polymorphism
9.4.5 Object Closures
478
480
482
482
486
489
9.5 Multiple Inheritance
9.5.1 Semantic Ambiguities
9.5.2 Replicated Inheritance
9.5.3 Shared Inheritance
9.5.4 Mix-In Inheritance
215 · 491
217
220
222
223
9.6 Object-Oriented Programming Revisited
9.6.1 The Object Model of Smalltalk
492
·
227
493
9.7 Summary and Concluding Remarks
494
xvi
Contents
9.8 Exercises
495
9.9 Explorations
498
9.10 Bibliographic Notes
III
ALTERNATIVE PROGRAMMING MODELS
10 Functional Languages
499
503
505
10.1 Historical Origins
506
10.2 Functional Programming Concepts
507
10.3 A Review/Overview of Scheme
10.3.1 Bindings
10.3.2 Lists and Numbers
10.3.3 Equality Testing and Searching
10.3.4 Control Flow and Assignment
10.3.5 Programs as Lists
10.3.6 Extended Example: DFA Simulation
509
512
513
514
515
517
519
10.4 Evaluation Order Revisited
10.4.1 Strictness and Lazy Evaluation
10.4.2 I/O: Streams and Monads
521
523
525
10.5 Higher-Order Functions
530
10.6 Theoretical Foundations
10.6.1 Lambda Calculus
10.6.2 Control Flow
10.6.3 Structures
237 · 534
239
242
244
10.7 Functional Programming in Perspective
534
10.8 Summary and Concluding Remarks
537
10.9 Exercises
538
10.10 Explorations
542
10.11 Bibliographic Notes
543
11 Logic Languages
545
11.1 Logic Programming Concepts
546
11.2 Prolog
11.2.1 Resolution and Unification
11.2.2 Lists
547
549
550
Contents
11.2.3
11.2.4
11.2.5
11.2.6
11.2.7
Arithmetic
Search/Execution Order
Extended Example: Tic-Tac-Toe
Imperative Control Flow
Database Manipulation
11.3 Theoretical Foundations
11.3.1 Clausal Form
11.3.2 Limitations
11.3.3 Skolemization
xvii
551
552
554
557
561
253 · 566
254
255
257
11.4 Logic Programming in Perspective
11.4.1 Parts of Logic Not Covered
11.4.2 Execution Order
11.4.3 Negation and the “Closed World” Assumption
566
566
567
568
11.5 Summary and Concluding Remarks
570
11.6 Exercises
571
11.7 Explorations
573
11.8 Bibliographic Notes
573
12 Concurrency
575
12.1 Background and Motivation
12.1.1 The Case for Multithreaded Programs
12.1.2 Multiprocessor Architecture
576
579
581
12.2 Concurrent Programming Fundamentals
12.2.1 Communication and Synchronization
12.2.2 Languages and Libraries
12.2.3 Thread Creation Syntax
12.2.4 Implementation of Threads
586
587
588
589
598
12.3 Implementing Synchronization
12.3.1 Busy-Wait Synchronization
12.3.2 Nonblocking Algorithms
12.3.3 Memory Consistency Models
12.3.4 Scheduler Implementation
12.3.5 Semaphores
603
604
607
610
613
617
12.4 Language-Level Mechanisms
12.4.1 Monitors
12.4.2 Conditional Critical Regions
12.4.3 Synchronization in Java
619
619
624
626
xviii
Contents
12.4.4 Transactional Memory
12.4.5 Implicit Synchronization
12.5 Message Passing
12.5.1 Naming Communication Partners
12.5.2 Sending
12.5.3 Receiving
12.5.4 Remote Procedure Call
629
633
263 · 637
263
267
272
278
12.6 Summary and Concluding Remarks
638
12.7 Exercises
640
12.8 Explorations
645
12.9 Bibliographic Notes
647
13 Scripting Languages
649
13.1 What Is a Scripting Language?
13.1.1 Common Characteristics
650
652
13.2 Problem Domains
13.2.1 Shell (Command) Languages
13.2.2 Text Processing and Report Generation
13.2.3 Mathematics and Statistics
13.2.4 “Glue” Languages and General-Purpose Scripting
13.2.5 Extension Languages
655
655
663
667
668
676
13.3 Scripting the World Wide Web
13.3.1 CGI Scripts
13.3.2 Embedded Server-Side Scripts
13.3.3 Client-Side Scripts
13.3.4 Java Applets
13.3.5 XSLT
680
680
681
686
686
287 · 689
13.4 Innovative Features
13.4.1 Names and Scopes
13.4.2 String and Pattern Manipulation
13.4.3 Data Types
13.4.4 Object Orientation
691
691
696
704
710
13.5 Summary and Concluding Remarks
717
13.6 Exercises
718
13.7 Explorations
723
13.8 Bibliographic Notes
724
Contents
IV
A CLOSER LOOK AT IMPLEMENTATION
14 Building a Runnable Program
14.1 Back-End Compiler Structure
14.1.1 A Plausible Set of Phases
14.1.2 Phases and Passes
xix
727
729
729
730
734
14.2 Intermediate Forms
14.2.1 Diana
14.2.2 The gcc IFs
14.2.3 Stack-Based Intermediate Forms
303 · 734
303
306
736
14.3 Code Generation
14.3.1 An Attribute Grammar Example
14.3.2 Register Allocation
738
738
741
14.4 Address Space Organization
744
14.5 Assembly
14.5.1 Emitting Instructions
14.5.2 Assigning Addresses to Names
746
748
749
14.6 Linking
14.6.1 Relocation and Name Resolution
14.6.2 Type Checking
750
751
751
14.7 Dynamic Linking
14.7.1 Position-Independent Code
14.7.2 Fully Dynamic (Lazy) Linking
311 · 754
312
313
14.8 Summary and Concluding Remarks
755
14.9 Exercises
756
14.10 Explorations
758
14.11 Bibliographic Notes
759
15 Run-time Program Management
761
15.1 Virtual Machines
15.1.1 The Java Virtual Machine
15.1.2 The Common Language Infrastructure
764
766
775
15.2 Late Binding of Machine Code
15.2.1 Just-in-Time and Dynamic Compilation
15.2.2 Binary Translation
784
785
791
xx
Contents
15.2.3 Binary Rewriting
15.2.4 Mobile Code and Sandboxing
795
797
15.3 Inspection/Introspection
15.3.1 Reflection
15.3.2 Symbolic Debugging
15.3.3 Performance Analysis
799
799
806
809
15.4 Summary and Concluding Remarks
811
15.5 Exercises
812
15.6 Explorations
815
15.7 Bibliographic Notes
816
16 Code Improvement
321 · 817
16.1 Phases of Code Improvement
323
16.2 Peephole Optimization
325
16.3 Redundancy Elimination in Basic Blocks
16.3.1 A Running Example
16.3.2 Value Numbering
328
328
331
16.4 Global Redundancy and Data Flow Analysis
16.4.1 SSA Form and Global Value Numbering
16.4.2 Global Common Subexpression Elimination
336
336
339
16.5 Loop Improvement I
16.5.1 Loop Invariants
16.5.2 Induction Variables
346
347
348
16.6 Instruction Scheduling
351
16.7 Loop Improvement II
16.7.1 Loop Unrolling and Software Pipelining
16.7.2 Loop Reordering
355
355
359
16.8 Register Allocation
366
16.9 Summary and Concluding Remarks
370
16.10 Bibliographic Notes
377
A Programming Languages Mentioned
819
B Language Design and Language Implementation
831
C Numbered Examples
835
Bibliography
849
Index
867
Foreword
The ubiquity of computers in everyday life in the 21st century justifies the centrality of programming languages to computer science education. Programming
languages is the area that connects the theoretical foundations of computer science,
the source of problem-solving algorithms, to modern computer architectures on
which the corresponding programs produce solutions. Given the speed with which
computing technology advances in this post-Internet era, a computing textbook
must present a structure for organizing information about a subject, not just the
facts of the subject itself. In this book, Michael Scott broadly and comprehensively
presents the key concepts of programming languages and their implementation,
in a manner appropriate for computer science majors.
The key strength of Scott’s book is that he holistically combines descriptions of
language concepts with concrete explanations of how to realize them. The depth of
these discussions, which have been updated in this third edition to reflect current
research and practice, provide basic information as well as supplemental material
for the reader interested in a specific topic. By eliding some topics selectively,
the instructor can still create a coherent exploration of a subset of the subject
matter. Moreover, Scott uses numerous examples from real languages to illustrate
key points. For interested or motivated readers, additional in-depth and advanced
discussions and exercises are available on the book’s companion CD, enabling
students with a range of interests and abilities to further explore on their own the
fundamentals of programming languages and compilation.
I have taught a semester-long comparative programming languages course
using Scott’s book for the last several years. I emphasize to students that my
goal is for them to learn how to learn a programming language, rather than to
retain detailed specifics of any one programming language. The purpose of the
course is to teach students an organizational framework for learning new languages throughout their careers, a certainty in the computer science field. To this
end, I particularly like Scott’s chapters on programming language paradigms (i.e.,
functional, logic, object-oriented, scripting), and my course material is organized
in this manner. However, I also have included foundational topics such as memory
organization, names and locations, scoping, types, and garbage collection–all of
which benefit from being presented in a manner that links the language concept
to its implementation details. Scott’s explanations are to the point and intuitive,
with clear illustrations and good examples. Often, discussions are independent
of previously presented material, making it easier to pick and choose topics for
xxi
xxii
Foreword
the syllabus. In addition, many supplemental teaching materials are provided on
the Web.
Of key interest to me in this new edition are the new Chapter 15 on run-time
environments and virtual machines (VMs), and the major update of Chapter
12 on concurrency. Given the current emphasis on virtualization, including a
chapter on VMs, such as Java’s JVM and CLI, facilitates student understanding
of this important topic and explains how modern languages achieve portability
over many platforms. The discussion of dynamic compilation and binary translation provides a contrast to the more traditional model of compilation presented
earlier in the book. It is important that Scott includes this newer compilation
technology so that a student can better understand what is needed to support the
newer dynamic language features described. Further, the discussions of symbolic
debugging and performance analysis demonstrate that programming language
and compiler technology pervade the software development cycle.
Similarly, Chapter 12 has been augmented with discussions of newer topics
that have been the focus of recent research (e.g., memory consistency models,
software transactional memory). A discussion of concurrency as a programming
paradigm belongs in a programming languages course, not just in an operating
systems course. In this context, language design choices easily can be compared
and contrasted, and their required implementations considered. This blurring
of the boundaries between language design, compilation, operating systems, and
architecture characterizes current software development in practice. This reality
is mirrored in this third edition of Scott’s book.
Besides these major changes, this edition features updated examples (e.g., in
X86 code, in C rather than Pascal) and enhanced discussions in the context of
modern languages such as C#, Java 5, Python, and Eiffel. Presenting examples in
several programming languages helps students understand that it is the underlying
common concepts that are important, not their syntactic differences.
In summary, Michael Scott’s book is an excellent treatment of programming
languages and their implementation. This new third edition provides a good reference for students, to supplement materials presented in lectures. Several coherent
tracks through the textbook allow construction of several “flavors” of courses that
cover much, but not all of the material. The presentation is clear and comprehensive with language design and implementation discussed together and supporting
one another.
Congratulations to Michael on a fine third edition of this wonderful book!
Barbara G. Ryder
J. Byron Maupin Professor of Engineering
Head, Department of Computer Science
Virginia Tech
Preface
A course in computer programming provides the typical student’s first
exposure to the field of computer science. Most students in such a course will
have used computers all their lives, for email, games, web browsing, word processing, social networking, and a host of other tasks, but it is not until they write their
first programs that they begin to appreciate how applications work. After gaining
a certain level of facility as programmers (presumably with the help of a good
course in data structures and algorithms), the natural next step is to wonder how
programming languages work. This book provides an explanation. It aims, quite
simply, to be the most comprehensive and accurate languages text available, in a
style that is engaging and accessible to the typical undergraduate. This aim reflects
my conviction that students will understand more, and enjoy the material more,
if we explain what is really going on.
In the conventional “systems” curriculum, the material beyond data structures (and possibly computer organization) tends to be compartmentalized into a
host of separate subjects, including programming languages, compiler construction, computer architecture, operating systems, networks, parallel and distributed
computing, database management systems, and possibly software engineering,
object-oriented design, graphics, or user interface systems. One problem with this
compartmentalization is that the list of subjects keeps growing, but the number of
semesters in a Bachelor’s program does not. More important, perhaps, many of the
most interesting discoveries in computer science occur at the boundaries between
subjects. The RISC revolution, for example, forged an alliance between computer architecture and compiler construction that has endured for 25 years. More
recently, renewed interest in virtual machines has blurred the boundaries between
the operating system kernel, the compiler, and the language run-time system.
Programs are now routinely embedded in web pages, spreadsheets, and user interfaces. And with the rise of multicore processors, concurrency issues that used to be
an issue only for systems programmers have begun to impact everyday computing.
Increasingly, both educators and practitioners are recognizing the need to
emphasize these sorts of interactions. Within higher education in particular there
is a growing trend toward integration in the core curriculum. Rather than give the
typical student an in-depth look at two or three narrow subjects, leaving holes in all
the others, many schools have revised the programming languages and computer
organization courses to cover a wider range of topics, with follow-on electives
in various specializations. This trend is very much in keeping with the findings
of the ACM/IEEE-CS Computing Curricula 2001 task force, which emphasize the
xxiii
xxiv
Preface
growth of the field, the increasing need for breadth, the importance of flexibility
in curricular design, and the overriding goal of graduating students who “have
a system-level perspective, appreciate the interplay between theory and practice,
are familiar with common themes, and can adapt over time as the field evolves”
[CR01, Sec. 11.1, adapted].
The first two editions of Programming Language Pragmatics (PLP-1e and -2e)
had the good fortune of riding this curricular trend. This third edition continues
and strengthens the emphasis on integrated learning while retaining a central
focus on programming language design.
At its core, PLP is a book about how programming languages work. Rather than
enumerate the details of many different languages, it focuses on concepts that
underlie all the languages the student is likely to encounter, illustrating those
concepts with a variety of concrete examples, and exploring the tradeoffs that
explain why different languages were designed in different ways. Similarly, rather
than explain how to build a compiler or interpreter (a task few programmers will
undertake in its entirety), PLP focuses on what a compiler does to an input program, and why. Language design and implementation are thus explored together,
with an emphasis on the ways in which they interact.
Changes in the Third Edition
In comparison to the second edition, PLP-3e provides
1.
2.
3.
4.
A new chapter on virtual machines and run-time program management
A major revision of the chapter on concurrency
Numerous other reflections of recent changes in the field
Improvements inspired by instructor feedback or a fresh consideration of
familiar topics
Item 1 in this list is perhaps the most visible change. It reflects the increasingly
ubiquitous use of both managed code and scripting languages. Chapter 15 begins
with a general overview of virtual machines and then takes a detailed look at
the two most widely used examples: the JVM and the CLI. The chapter also
covers dynamic compilation, binary translation, reflection, debuggers, profilers,
and other aspects of the increasingly sophisticated run-time machinery found in
modern language systems.
Item 2 also reflects the evolving nature of the field. With the proliferation
of multicore processors, concurrent languages have become increasingly important to mainstream programmers, and the field is very much in flux. Changes to
Chapter 12 (Concurrency) include new sections on nonblocking synchronization,
memory consistency models, and software transactional memory, as well as
increased coverage of OpenMP, Erlang, Java 5, and Parallel FX for .NET.
Other new material (Item 3) appears throughout the text. Section 5.4.4 covers
the multicore revolution from an architectural perspective. Section 8.7 covers
Preface
xxv
event handling, in both sequential and concurrent languages. In Section 14.2,
coverage of gcc internals includes not only RTL, but also the newer GENERIC
and Gimple intermediate forms. References have been updated throughout to
accommodate such recent developments as Java 6, C++ ’0X, C# 3.0, F#, Fortran
2003, Perl 6, and Scheme R6RS.
Finally, Item 4 encompasses improvements to almost every section of the
text. Topics receiving particularly heavy updates include the running example
of Chapter 1 (moved from Pascal/MIPS to C/x86); bootstrapping (Section 1.4);
scanning (Section 2.2); table-driven parsing (Sections 2.3.2 and 2.3.3); closures
(Sections 3.6.2, 3.6.3, 8.3.1, 8.4.4, 8.7.2, and 9.2.3); macros (Section 3.7); evaluation order and strictness (Sections 6.6.2 and 10.4); decimal types (Section 7.1.4);
array shape and allocation (Section 7.4.2); parameter passing (Section 8.3); inner
(nested) classes (Section 9.2.3); monads (Section 10.4.2); and the Prolog examples
of Chapter 11 (now ISO conformant).
To accommodate new material, coverage of some topics has been condensed. Examples include modules (Chapters 3 and 9), loop control (Chapter 6),
packed types (Chapter 7), the Smalltalk class hierarchy (Chapter 9), metacircular interpretation (Chapter 10), interconnection networks (Chapter 12), and
thread creation syntax (also Chapter 12). Additional material has moved to the
companion CD. This includes all of Chapter 5 (Target Machine Architecture),
unions (Section 7.3.4), dangling references (Section 7.7.2), message passing
(Section 12.5), and XSLT (Section 13.3.5). Throughout the text, examples
drawn from languages no longer in widespread use have been replaced with more
recent equivalents wherever appropriate.
Overall, the printed text has grown by only some 30 pages, but there are nearly
100 new pages on the CD. There are also 14 more “Design & Implementations”
sidebars, more than 70 new numbered examples, a comparable number of new
“Check Your Understanding” questions, and more than 60 new end-of-chapter
exercises and explorations. Considerable effort has been invested in creating a
consistent and comprehensive index. As in earlier editions, Morgan Kaufmann
has maintained its commitment to providing definitive texts at reasonable
cost: PLP-3e is less expensive than competing alternatives, but larger and more
comprehensive.
The PLP CD - See Note on page xxx
To minimize the physical size of the text, make way for new material, and allow
students to focus on the fundamentals when browsing, approximately 350 pages
of more advanced or peripheral material appears on the PLP CD. Each CD section
is represented in the main text by a brief introduction to the subject and an “In
More Depth” paragraph that summarizes the elided material.
Note that placement of material on the CD does not constitute a judgment
about its technical importance. It simply reflects the fact that there is more material
worth covering than will fit in a single volume or a single semester course. Since
preferences and syllabi vary, most instructors will probably want to assign reading
xxvi
Preface
from the CD, and most will refrain from assigning certain sections of the printed
text. My intent has been to retain in print the material that is likely to be covered
in the largest number of courses.
Also contained on the CD are compilable copies of all significant code fragments
found in the text (in more than two dozen languages) and pointers to on-line
resources.
Design & Implementation Sidebars
Like its predecessors, PLP-3e places heavy emphasis on the ways in which language
design constrains implementation options, and the ways in which anticipated
implementations have influenced language design. Many of these connections and
interactions are highlighted in some 135 “Design & Implementations” sidebars.
A more detailed introduction to these sidebars appears on page 9 (Chapter 1).
A numbered list appears in Appendix B.
Numbered and Titled Examples
Examples in PLP-3e are intimately woven into the flow of the presentation. To
make it easier to find specific examples, to remember their content, and to refer
to them in other contexts, a number and a title for each is displayed in a marginal
note. There are nearly 1000 such examples across the main text and the CD. A
detailed list appears in Appendix C.
Exercise Plan
Review questions appear throughout the text at roughly 10-page intervals, at the
ends of major sections. These are based directly on the preceding material, and
have short, straightforward answers.
More detailed questions appear at the end of each chapter. These are
divided into Exercises and Explorations. The former are generally more challenging than the per-section review questions, and should be suitable for homework or brief projects. The latter are more open-ended, requiring web or
library research, substantial time commitment, or the development of subjective opinion. Solutions to many of the exercises (but not the explorations)
are available to registered instructors from a password-protected web site: visit
textbooks.elsevier.com/web/9780123745149.
How to Use the Book
Programming Language Pragmatics covers almost all of the material in the PL
“knowledge units” of the Computing Curricula 2001 report [CR01]. The book is
an ideal fit for the CS 341 model course (Programming Language Design), and
can also be used for CS 340 (Compiler Construction) or CS 343 (Programming
Preface
Fu
n
cti
Lo onal
12 gic
Co
nc
ur
ren
cy
13
Sc
rip
tin
g
14
Co
15 deG
Ru en
nt
im
16
e
Im
pr
ov
em
en
t
Part IV
11
cts
10
bje
9O
ro
ut
ine
s
Part III
8S
ub
7T
yp
es
Part II
3N
am
es
4S
em
a
5 A ntic
rch s
6 C itect
ur
on
tro e
l
1I
nt
ro
2S
yn
tax
Part I
xxvii
F
R
2.3.3
P
C
Q
14.5 15.2
2.2
2.3.2
8.3
F: The full-year/self-study plan
R: The one-semester Rochester plan
P: The traditional Programming Languages plan;
would also de-emphasize implementation material
throughout the chapters shown
C: The compiler plan; would also de-emphasize design material
throughout the chapters shown
Q: The 1+2 quarter plan: an overview quarter and two independent, optional
follow-on quarters, one language-oriented, the other compiler-oriented
Supplemental (CD) section
To be skimmed by students
in need of review
Figure 0.1
Paths through the text. Darker shaded regions indicate supplemental “In More Depth” sections on the PLP CD.
Section numbers are shown for breaks that do not correspond to supplemental material.
Paradigms). It contains a significant fraction of the content of CS 344 (Functional
Programming) and CS 346 (Scripting Languages). Figure 0.1 illustrates several
possible paths through the text.
For self-study, or for a full-year course (track F in Figure 0.1), I recommend
working through the book from start to finish, turning to the PLP CD as each “In
More Depth” section is encountered. The one-semester course at the University of
Rochester (track R ), for which the text was originally developed, also covers most
of the book, but leaves out most of the CD sections, as well as bottom-up parsing
(2.3.3) and the second halves of Chapters 14 (Building a Runnable Program)
and 15 (Run-time Program Management).
Some chapters (2, 4, 5, 14, 15, 16) have a heavier emphasis than others on
implementation issues. These can be reordered to a certain extent with respect
to the more design-oriented chapters. Many students will already be familiar
with much of the material in Chapter 5, most likely from a course on computer
organization; hence the placement of the chapter on the PLP CD. Some students
may also be familiar with some of the material in Chapter 2, perhaps from a course
on automata theory. Much of this chapter can then be read quickly as well, pausing
xxviii
Preface
perhaps to dwell on such practical issues as recovery from syntax errors, or the
ways in which a scanner differs from a classical finite automaton.
A traditional programming languages course (track P in Figure 0.1) might leave
out all of scanning and parsing, plus all of Chapter 4. It would also de-emphasize
the more implementation-oriented material throughout. In place of these it could
add such design-oriented CD sections as the ML type system (7.2.4), multiple
inheritance (9.5), Smalltalk (9.6.1), lambda calculus (10.6), and predicate calculus
(11.3).
PLP has also been used at some schools for an introductory compiler course
(track C in Figure 0.1). The typical syllabus leaves out most of Part III (Chapters 10
through 13), and de-emphasizes the more design-oriented material throughout.
In place of these it includes all of scanning and parsing, Chapters 14 through 16,
and a slightly different mix of other CD sections.
For a school on the quarter system, an appealing option is to offer an introductory one-quarter course and two optional follow-on courses (track Q in Figure 0.1). The introductory quarter might cover the main (non-CD) sections of
Chapters 1, 3, 6, and 7, plus the first halves of Chapters 2 and 8. A languageoriented follow-on quarter might cover the rest of Chapter 8, all of Part III, CD
sections from Chapters 6 through 8, and possibly supplemental material on formal
semantics, type systems, or other related topics. A compiler-oriented follow-on
quarter might cover the rest of Chapter 2; Chapters 4–5 and 14–16, CD sections
from Chapters 3 and 8–9, and possibly supplemental material on automatic code
generation, aggressive code improvement, programming tools, and so on.
Whatever the path through the text, I assume that the typical reader has already
acquired significant experience with at least one imperative language. Exactly
which language it is shouldn’t matter. Examples are drawn from a wide variety of
languages, but always with enough comments and other discussion that readers
without prior experience should be able to understand easily. Single-paragraph
introductions to more than 50 different languages appear in Appendix A. Algorithms, when needed, are presented in an informal pseudocode that should be
self-explanatory. Real programming language code is set in "typewriter" font.
Pseudocode is set in a sans-serif font.
Supplemental Materials
In addition to supplemental sections, the PLP CD contains a variety of other
resources, including
Links to language reference manuals and tutorials on the Web
Links to open-source compilers and interpreters
Complete source code for all nontrivial examples in the book
A search engine for both the main text and the CD-only content
Preface
xxix
Additional resources are available on-line at textbooks.elsevier.com/web/
9780123745149 (you may wish to check back from time to time). For instructors who have adopted the text, a password-protected page provides access to
Editable PDF source for all the figures in the book
Editable PowerPoint slides
Solutions to most of the exercises
Suggestions for larger projects
Acknowledgments for the Third Edition
In preparing the third edition I have been blessed with the generous assistance
of a very large number of people. Many provided errata or other feedback on
the second edition, among them Gerald Baumgartner, Manuel E. Bermudez,
William Calhoun, Betty Cheng, Yi Dai, Eileen Head, Nathan Hoot, Peter Ketcham,
Antonio Leitao, Jingke Li, Annie Liu, Dan Mullowney, Arthur Nunes-Harwitt,
Zongyan Qiu, Beverly Sanders, David Sattari, Parag Tamhankar, Ray Toal, Robert
van Engelen, Garrett Wollman, and Jingguo Yao. In several cases, good advice from
the 2004 class test went unheeded in the second edition due to lack of time; I am
glad to finally have the chance to incorporate it here. I also remain indebted to
the many individuals acknowledged in the first and second editions, and to the
reviewers, adopters, and readers who made those editions a success.
External reviewers for the third edition provided a wealth of useful suggestions; my thanks to Perry Alexander (University of Kansas), Hans Boehm (HP
Labs), Stephen Edwards (Columbia University), Tim Harris (Microsoft Research),
Eileen Head (Binghamton University), Doug Lea (SUNY Oswego), Jan-Willem
Maessen (Sun Microsystems Laboratories), Maged Michael (IBM Research),
Beverly Sanders (University of Florida), Christopher Vickery (Queens College,
City University of New York), and Garrett Wollman (MIT). Hans, Doug, and
Maged proofread parts of Chapter 12 on very short notice; Tim and Jan were
equally helpful with parts of Chapter 10. Mike Spear helped vet the transactional memory implementation of Figure 12.18. Xiao Zhang provided pointers for Section 15.3.3. Problems that remain in all these sections are entirely
my own.
In preparing the third edition, I have drawn on 20 years of experience teaching
this material to upper-level undergraduates at the University of Rochester. I am
grateful to all my students for their enthusiasm and feedback. My thanks as well
to my colleagues and graduate students, and to the department’s administrative,
secretarial, and technical staff for providing such a supportive and productive work
environment. Finally, my thanks to Barbara Ryder, whose forthright comments
on the first edition helped set me on the path to the second; I am honored to have
her as the author of the Foreword.
xxx
Preface
As they were on previous editions, the staff at Morgan Kaufmann have been a
genuine pleasure to work with, on both a professional and a personal level. My
thanks in particular to Nate McFadden, Senior Development Editor, who shepherded both this and the previous edition with unfailing patience, good humor,
and a fine eye for detail; to Marilyn Rash, who managed the book’s production;
and to Denise Penrose, whose gracious stewardship, first as Editor and then as
Publisher, have had a lasting impact.
Most important, I am indebted to my wife, Kelly, and our daughters, Erin and
Shannon, for their patience and support through endless months of writing and
revising. Computing is a fine profession, but family is what really matters.
Michael L. Scott
Rochester, NY
December 2008
PLP CD Content on a Companion Web Site
All content originally included on a CD is now available at this book’s companion
web site. Please visit the URL: http://www.elsevierdirect.com/9780123745149 and
click on “Companion Site”
This page intentionally left blank
I
Foundations
A central premise of Programming Language Pragmatics is that language design and implementation are intimately connected; it’s hard to study one without the other.
The bulk of the text—Parts II and III—is organized around topics in language design, but
with detailed coverage throughout of the many ways in which design decisions have been shaped
by implementation concerns.
The first five chapters—Part I—set the stage by covering foundational material in both
design and implementation. Chapter 1 motivates the study of programming languages, introduces the major language families, and provides an overview of the compilation process. Chapter 3 covers the high-level structure of programs, with an emphasis on names, the binding of
names to objects, and the scope rules that govern which bindings are active at any given time.
In the process it touches on storage management; subroutines, modules, and classes; polymorphism; and separate compilation.
Chapters 2, 4, and 5 are more implementation oriented. They provide the background
needed to understand the implementation issues mentioned in Parts II and III. Chapter 2
discusses the syntax, or textual structure, of programs. It introduces regular expressions and
context-free grammars, which designers use to describe program syntax, together with the scanning and parsing algorithms that a compiler or interpreter uses to recognize that syntax. Given
an understanding of syntax, Chapter 4 explains how a compiler (or interpreter) determines
the semantics, or meaning of a program. The discussion is organized around the notion of
attribute grammars, which serve to map a program onto something else that has meaning,
such as mathematics or some other existing language. Finally, Chapter 5 provides an overview
of assembly-level computer architecture, focusing on the features of modern microprocessors
most relevant to compilers. Programmers who understand these features have a better chance
not only of understanding why the languages they use were designed the way they were, but
also of using those languages as fully and effectively as possible.
This page intentionally left blank
1
Introduction
EXAMPLE
1.1
GCD program in x86
machine language
The first electronic computers were monstrous contraptions, filling
several rooms, consuming as much electricity as a good-size factory, and costing millions of 1940s dollars (but with the computing power of a modern
hand-held calculator). The programmers who used these machines believed that
the computer’s time was more valuable than theirs. They programmed in machine
language. Machine language is the sequence of bits that directly controls a processor, causing it to add, compare, move data from one place to another, and
so forth at appropriate times. Specifying programs at this level of detail is an
enormously tedious task. The following program calculates the greatest common
divisor (GCD) of two integers, using Euclid’s algorithm. It is written in machine
language, expressed here as hexadecimal (base 16) numbers, for the x86 (Pentium)
instruction set.
55 89 e5 53
00 00 39 c3
75 f6 89 1c
EXAMPLE
1.2
GCD program in x86
assembler
83 ec 04 83
74 10 8d b6
24 e8 6e 00
e4 f0 e8 31
00 00 00 00
00 00 8b 5d
00 00 00 89
39 c3 7e 13
fc c9 c3 29
c3 e8 2a 00
29 c3 39 c3
d8 eb eb 90
As people began to write larger programs, it quickly became apparent that a less
error-prone notation was required. Assembly languages were invented to allow
operations to be expressed with mnemonic abbreviations. Our GCD program
looks like this in x86 assembly language:
A:
pushl
movl
pushl
subl
andl
call
movl
call
cmpl
je
cmpl
%ebp
%esp, %ebp
%ebx
$4, %esp
$-16, %esp
getint
%eax, %ebx
getint
%eax, %ebx
C
%eax, %ebx
B:
C:
D:
jle
subl
cmpl
jne
movl
call
movl
leave
ret
subl
jmp
Programming Language Pragmatics. DOI: 10.1016/B978-0-12-374514-9.00010-0
Copyright © 2009 by Elsevier Inc. All rights reserved.
D
%eax, %ebx
%eax, %ebx
A
%ebx, (%esp)
putint
-4(%ebp), %ebx
%ebx, %eax
B
5
6
Chapter 1 Introduction
Assembly languages were originally designed with a one-to-one correspondence between mnemonics and machine language instructions, as shown in this
example.1 Translating from mnemonics to machine language became the job of a
systems program known as an assembler. Assemblers were eventually augmented
with elaborate “macro expansion” facilities to permit programmers to define
parameterized abbreviations for common sequences of instructions. The correspondence between assembly language and machine language remained obvious
and explicit, however. Programming continued to be a machine-centered enterprise: each different kind of computer had to be programmed in its own assembly
language, and programmers thought in terms of the instructions that the machine
would actually execute.
As computers evolved, and as competing designs developed, it became increasingly frustrating to have to rewrite programs for every new machine. It also became
increasingly difficult for human beings to keep track of the wealth of detail in large
assembly language programs. People began to wish for a machine-independent
language, particularly one in which numerical computations (the most common
type of program in those days) could be expressed in something more closely
resembling mathematical formulae. These wishes led in the mid-1950s to the
development of the original dialect of Fortran, the first arguably high-level programming language. Other high-level languages soon followed, notably Lisp and
Algol.
Translating from a high-level language to assembly or machine language is the
job of a systems program known as a compiler.2 Compilers are substantially more
complicated than assemblers because the one-to-one correspondence between
source and target operations no longer exists when the source is a high-level
language. Fortran was slow to catch on at first, because human programmers,
with some effort, could almost always write assembly language programs that
would run faster than what a compiler could produce. Over time, however, the
performance gap has narrowed, and eventually reversed. Increases in hardware
complexity (due to pipelining, multiple functional units, etc.) and continuing
improvements in compiler technology have led to a situation in which a stateof-the-art compiler will usually generate better code than a human being will.
Even in cases in which human beings can do better, increases in computer speed
and program size have made it increasingly important to economize on programmer effort, not only in the original construction of programs, but in subsequent
program maintenance—enhancement and correction. Labor costs now heavily
outweigh the cost of computing hardware.
1 The 22 lines of assembly code in the example are encoded in varying numbers of bytes in machine
language. The three cmp (compare) instructions, for example, all happen to have the same register
operands, and are encoded in the two-byte sequence ( 39 c3 ). The four mov (move) instructions
have different operands and lengths, and begin with 89 or 8b . The chosen syntax is that of the
GNU gcc compiler suite, in which results overwrite the last operand, not the first.
2 High-level languages may also be interpreted directly, without the translation step. We will return
to this option in Section 1.4. It is the principal way in which scripting languages like Python and
JavaScript are implemented.
1.1 The Art of Language Design
1.1
7
The Art of Language Design
Today there are thousands of high-level programming languages, and new ones
continue to emerge. Human beings use assembly language only for specialpurpose applications. In a typical undergraduate class, it is not uncommon to
find users of scores of different languages. Why are there so many? There are
several possible answers:
Evolution. Computer science is a young discipline; we’re constantly finding better
ways to do things. The late 1960s and early 1970s saw a revolution in “structured programming,” in which the goto -based control flow of languages like
Fortran, Cobol, and Basic3 gave way to while loops, case ( switch ) statements,
and similar higher level constructs. In the late 1980s the nested block structure
of languages like Algol, Pascal, and Ada began to give way to the object-oriented
structure of Smalltalk, C++, Eiffel, and the like.
Special Purposes. Many languages were designed for a specific problem domain.
The various Lisp dialects are good for manipulating symbolic data and complex
data structures. Icon and Awk are good for manipulating character strings. C is
good for low-level systems programming. Prolog is good for reasoning about
logical relationships among data. Each of these languages can be used successfully for a wider range of tasks, but the emphasis is clearly on the specialty.
Personal Preference. Different people like different things. Much of the parochialism of programming is simply a matter of taste. Some people love the terseness
of C; some hate it. Some people find it natural to think recursively; others prefer iteration. Some people like to work with pointers; others prefer the implicit
dereferencing of Lisp, Clu, Java, and ML. The strength and variety of personal
preference make it unlikely that anyone will ever develop a universally acceptable programming language.
Of course, some languages are more successful than others. Of the many that
have been designed, only a few dozen are widely used. What makes a language
successful? Again there are several answers:
Expressive Power. One commonly hears arguments that one language is more
“powerful” than another, though in a formal mathematical sense they are all
Turing complete—each can be used, if awkwardly, to implement arbitrary algorithms. Still, language features clearly have a huge impact on the programmer’s
ability to write clear, concise, and maintainable code, especially for very large
systems. There is no comparison, for example, between early versions of Basic
on the one hand, and Common Lisp or Ada on the other. The factors that
contribute to expressive power—abstraction facilities in particular—are a
major focus of this book.
3 The names of these languages are sometimes written entirely in uppercase letters and sometimes
in mixed case. For consistency’s sake, I adopt the convention in this book of using mixed case for
languages whose names are pronounced as words (e.g., Fortran, Cobol, Basic), and uppercase for
those pronounced as a series of letters (e.g., APL, PL/I, ML).
8
Chapter 1 Introduction
Ease of Use for the Novice. While it is easy to pick on Basic, one cannot deny its
success. Part of that success is due to its very low “learning curve.” Logo is
popular among elementary-level educators for a similar reason: even a 5-yearold can learn it. Pascal was taught for many years in introductory programming
language courses because, at least in comparison to other “serious” languages,
it is compact and easy to learn. In recent years Java has come to play a similar
role. Though substantially more complex than Pascal, it is much simpler than,
say, C++.
Ease of Implementation. In addition to its low learning curve, Basic is successful because it could be implemented easily on tiny machines, with limited
resources. Forth has a small but dedicated following for similar reasons.
Arguably the single most important factor in the success of Pascal was that
its designer, Niklaus Wirth, developed a simple, portable implementation of
the language, and shipped it free to universities all over the world (see Example 1.15).4 The Java designers took similar steps to make their language available
for free to almost anyone who wants it.
Standardization. Almost every widely used language has an official international
standard or (in the case of several scripting languages) a single canonical
implementation; and in the latter case the canonical implementation is almost
invariably written in a language that has a standard. Standardization—of both
the language and a broad set of libraries—is the only truly effective way
to ensure the portability of code across platforms. The relatively impoverished standard for Pascal, which is missing several features considered essential by many programmers (separate compilation, strings, static initialization,
random-access I/O), is at least partially responsible for the language’s drop
from favor in the 1980s. Many of these features were implemented in different
ways by different vendors.
Open Source. Most programming languages today have at least one open-source
compiler or interpreter, but some languages—C in particular—are much
more closely associated than others with freely distributed, peer-reviewed,
community-supported computing. C was originally developed in the early
1970s by Dennis Ritchie and Ken Thompson at Bell Labs,5 in conjunction
with the design of the original Unix operating system. Over the years Unix
evolved into the world’s most portable operating system—the OS of choice
for academic computer science—and C was closely associated with it. With
the standardization of C, the language has become available on an enormous
4 Niklaus Wirth (1934–), Professor Emeritus of Informatics at ETH in Zürich, Switzerland, is
responsible for a long line of influential languages, including Euler, Algol W, Pascal, Modula,
Modula-2, and Oberon. Among other things, his languages introduced the notions of enumeration, subrange, and set types, and unified the concepts of records (structs) and variants (unions).
He received the annual ACM Turing Award, computing’s highest honor, in 1984.
5 Ken Thompson (1943–) led the team that developed Unix. He also designed the B programming
language, a child of BCPL and the parent of C. Dennis Ritchie (1941–) was the principal force
behind the development of C itself. Thompson and Ritchie together formed the core of an
incredibly productive and influential group. They shared the ACM Turing Award in 1983.
1.1 The Art of Language Design
9
variety of additional platforms. Linux, the leading open-source operating system, is written in C. As of October 2008, C and its descendants account for 66%
of the projects hosted at the sourceforge.net repository.
Excellent Compilers. Fortran owes much of its success to extremely good compilers. In part this is a matter of historical accident. Fortran has been around
longer than anything else, and companies have invested huge amounts of time
and money in making compilers that generate very fast code. It is also a matter
of language design, however: Fortran dialects prior to Fortran 90 lack recursion and pointers, features that greatly complicate the task of generating fast
code (at least for programs that can be written in a reasonable fashion without
them!). In a similar vein, some languages (e.g., Common Lisp) are successful
in part because they have compilers and supporting tools that do an unusually
good job of helping the programmer manage very large projects.
Economics, Patronage, and Inertia. Finally, there are factors other than technical
merit that greatly influence success. The backing of a powerful sponsor is one.
PL/I, at least to first approximation, owes its life to IBM. Cobol and, more
recently, Ada owe their life to the U.S. Department of Defense: Ada contains a
wealth of excellent features and ideas, but the sheer complexity of implementation would likely have killed it if not for the DoD backing. Similarly, C#, despite
its technical merits, would probably not have received the attention it has without the backing of Microsoft. At the other end of the life cycle, some languages
remain widely used long after “better” alternatives are available because of a
huge base of installed software and programmer expertise, which would cost
too much to replace.
D E S I G N & I M P L E M E N TAT I O N
Introduction
Throughout the book, sidebars like this one will highlight the interplay of
language design and language implementation. Among other things, we will
consider the following.
Cases (such as those mentioned in this section) in which ease or difficulty
of implementation significantly affected the success of a language
Language features that many designers now believe were mistakes, at least
in part because of implementation difficulties
Potentially useful features omitted from some languages because of concern
that they might be too difficult or slow to implement
Language features introduced at least in part to facilitate efficient or elegant
implementations
Cases in which a machine architecture makes reasonable features unreasonably expensive
Various other tradeoffs in which implementation plays a significant role
A complete list of sidebars appears in Appendix B.
10
Chapter 1 Introduction
Clearly no single factor determines whether a language is “good.” As we study
programming languages, we shall need to consider issues from several points of
view. In particular, we shall need to consider the viewpoints of both the programmer and the language implementor. Sometimes these points of view will be in
harmony, as in the desire for execution speed. Often, however, there will be conflicts and tradeoffs, as the conceptual appeal of a feature is balanced against the
cost of its implementation. The tradeoff becomes particularly thorny when the
implementation imposes costs not only on programs that use the feature, but also
on programs that do not.
In the early days of computing the implementor’s viewpoint was predominant.
Programming languages evolved as a means of telling a computer what to do. For
programmers, however, a language is more aptly defined as a means of expressing algorithms. Just as natural languages constrain exposition and discourse, so
programming languages constrain what can and cannot easily be expressed, and
have both profound and subtle influence over what the programmer can think.
Donald Knuth has suggested that programming be regarded as the art of telling
another human being what one wants the computer to do [Knu84].6 This definition perhaps strikes the best sort of compromise. It acknowledges that both
conceptual clarity and implementation efficiency are fundamental concerns. This
book attempts to capture this spirit of compromise, by simultaneously considering
the conceptual and implementation aspects of each of the topics it covers.
1.2
EXAMPLE
1.3
Classification of
programming languages
The Programming Language Spectrum
The many existing languages can be classified into families based on their model of
computation. Figure 1.1 shows a common set of families. The top-level division
distinguishes between the declarative languages, in which the focus is on what the
computer is to do, and the imperative languages, in which the focus is on how the
computer should do it.
Declarative languages are in some sense “higher level”; they are more in tune
with the programmer’s point of view, and less with the implementor’s point
of view. Imperative languages predominate, however, mainly for performance
reasons. There is a tension in the design of declarative languages between the desire
to get away from “irrelevant” implementation details, and the need to remain close
enough to the details to at least control the outline of an algorithm. The design of
efficient algorithms, after all, is what much of computer science is about. It is not
yet clear to what extent, and in what problem domains, we can expect compilers to
6 Donald E. Knuth (1938–), Professor Emeritus at Stanford University and one of the foremost
figures in the design and analysis of algorithms, is also widely known as the inventor of the TEX
typesetting system (with which this book was produced) and of the literate programming methodology with which TEX was constructed. His multivolume The Art of Computer Programming has
an honored place on the shelf of most professional computer scientists. He received the ACM
Turing Award in 1974.
1.2 The Programming Language Spectrum
declarative
functional
dataflow
logic, constraint-based
template-based
imperative
von Neumann
scripting
object-oriented
11
Lisp/Scheme, ML, Haskell
Id, Val
Prolog, spreadsheets
XSLT
C, Ada, Fortran, . . .
Perl, Python, PHP, . . .
Smalltalk, Eiffel, Java, . . .
Figure 1.1 Classification of programming languages. Note that the categories are fuzzy, and
open to debate. In particular, it is possible for a functional language to be object-oriented, and
many authors do not consider functional programming to be declarative.
discover good algorithms for problems stated at a very high level of abstraction. In
any domain in which the compiler cannot find a good algorithm, the programmer
needs to be able to specify one explicitly.
Within the declarative and imperative families, there are several important
subclasses.
Functional languages employ a computational model based on the recursive
definition of functions. They take their inspiration from the lambda calculus,
a formal computational model developed by Alonzo Church in the 1930s. In
essence, a program is considered a function from inputs to outputs, defined in
terms of simpler functions through a process of refinement. Languages in this
category include Lisp, ML, and Haskell.
Dataflow languages model computation as the flow of information (tokens)
among primitive functional nodes. They provide an inherently parallel model:
nodes are triggered by the arrival of input tokens, and can operate concurrently.
Id and Val are examples of dataflow languages. Sisal, a descendant of Val, is
more often described as a functional language.
Logic- or constraint-based languages take their inspiration from predicate logic.
They model computation as an attempt to find values that satisfy certain
specified relationships, using goal-directed search through a list of logical rules.
Prolog is the best-known logic language. The term is also sometimes applied to
the SQL database language, the XSLT scripting language, and programmable
aspects of spreadsheets such as Excel and its predecessors.
The von Neumann languages are the most familiar and successful. They
include Fortran, Ada 83, C, and all of the others in which the basic means of
computation is the modification of variables.7 Whereas functional languages
7 John von Neumann (1903–1957) was a mathematician and computer pioneer who helped to
develop the concept of stored program computing, which underlies most computer hardware. In
a stored program computer, both programs and data are represented as bits in memory, which
the processor repeatedly fetches, interprets, and updates.
12
Chapter 1 Introduction
are based on expressions that have values, von Neumann languages are based
on statements (assignments in particular) that influence subsequent computation via the side effect of changing the value of memory.
Scripting languages are a subset of the von Neumann languages. They are
distinguished by their emphasis on “gluing together” components that were
originally developed as independent programs. Several scripting languages
were originally developed for specific purposes: csh and bash , for example,
are the input languages of job control (shell) programs; Awk was intended
for report generation; PHP and JavaScript are primarily intended for the generation of web pages with dynamic content (with execution on the server
and the client, respectively). Other languages, including Perl, Python, Ruby,
and Tcl, are more deliberately general purpose. Most place an emphasis
on rapid prototyping, with a bias toward ease of expression over speed of
execution.
Object-oriented languages trace their roots to Simula 67. Most are closely
related to the von Neumann languages, but have a much more structured
and distributed model of both memory and computation. Rather than picture computation as the operation of a monolithic processor on a monolithic
memory, object-oriented languages picture it as interactions among semiindependent objects, each of which has both its own internal state and subroutines to manage that state. Smalltalk is the purest of the object-oriented
languages; C++ and Java are the most widely used. It is also possible to
devise object-oriented functional languages (the best known of these is the
CLOS [Kee89] extension to Common Lisp), but they tend to have a strong
imperative flavor.
EXAMPLE
1.4
GCD function in C
One might suspect that concurrent (parallel) languages also form a separate
class (and indeed this book devotes a chapter to the subject), but the distinction
between concurrent and sequential execution is mostly independent of the classifications above. Most concurrent programs are currently written using special
library packages or compilers in conjunction with a sequential language such as
Fortran or C. A few widely used languages, including Java, C#, and Ada, have
explicitly concurrent features. Researchers are investigating concurrency in each
of the language classes mentioned here.
As a simple example of the contrast among language classes, consider the greatest common divisor (GCD) problem introduced at the beginning of this chapter.
The choice among, say, von Neumann, functional, or logic programming for
this problem influences not only the appearance of the code, but how the programmer thinks. The von Neumann algorithm version of the algorithm is very
imperative:
To compute the gcd of a and b , check to see if a and b are equal. If so, print one of them
and stop. Otherwise, replace the larger one by their difference and repeat.
C code for this algorithm appears at the top of Figure 1.2.
1.2 The Programming Language Spectrum
int gcd(int a, int
while (a != b)
if (a > b)
else b = b
}
return a;
}
b) {
{
a = a - b;
- a;
13
// C
(define gcd
(lambda (a b)
(cond ((= a b) a)
((> a b) (gcd (- a b) b))
(else (gcd (- b a) a)))))
; Scheme
gcd(A,B,G) :- A = B, G = A.
gcd(A,B,G) :- A > B, C is A-B, gcd(C,B,G).
gcd(A,B,G) :- B > A, C is B-A, gcd(C,A,G).
% Prolog
Figure 1.2
The GCD algorithm in C (top), Scheme (middle), and Prolog (bottom). All three
versions assume (without checking) that their inputs are positive integers.
EXAMPLE
1.5
GCD function in Scheme
In a functional language, the emphasis is on the mathematical relationship of
outputs to inputs:
The gcd of a and b is defined to be (1) a when a and b are equal, (2) the gcd of b and
a - b when a > b , and (3) the gcd of a and b - a when b > a . To compute the gcd of
a given pair of numbers, expand and simplify this definition until it terminates.
EXAMPLE
1.6
GCD rules in Prolog
A Scheme version of this algorithm appears in the middle of Figure 1.2. The
keyword lambda introduces a function definition; (a b) is its argument list. The
cond construct is essentially a multiway if . . . then . . . else . The difference of a
and b is written (- a b) .
In a logic language, the programmer specifies a set of axioms and proof rules
that allows the system to find desired values:
The proposition gcd(a, b, g) is true if (1) a , b , and g are all equal; (2) a is greater
than b and there exists a number c such that c is a - b and gcd(c, b, g) is true; or
(3) a is less than b and there exists a number c such that c is b - a and gcd(c, a,
g) is true. To compute the gcd of a given pair of numbers, search for a number g (and
various numbers c ) for which these rules allow one to prove that gcd(a, b, g) is true.
A Prolog version of this algorithm appears at the bottom of Figure 1.2. It may be
easier to understand if one reads “if ” for :- and “and” for commas.
It should be emphasized that the distinctions among language classes are not
clear-cut. The division between the von Neumann and object-oriented languages,
for example, is often very fuzzy, and most of the functional and logic languages
14
Chapter 1 Introduction
include some imperative features. The descriptions above are meant to capture
the general flavor of the classes, without providing formal definitions.
Imperative languages—von Neumann and object-oriented—receive the bulk
of the attention in this book. Many issues cut across family lines, however, and the
interested reader will discover much that is applicable to alternative computational
models in most chapters of the book. Chapters 10 through 13 contain additional
material on functional, logic, concurrent, and scripting languages.
1.3
Why Study Programming Languages?
Programming languages are central to computer science, and to the typical computer science curriculum. Like most car owners, students who have become familiar with one or more high-level languages are generally curious to learn about
other languages, and to know what is going on “under the hood.” Learning about
languages is interesting. It’s also practical.
For one thing, a good understanding of language design and implementation
can help one choose the most appropriate language for any given task. Most languages are better for some things than for others. Few programmers are likely to
choose Fortran for symbolic computing or string processing, but other choices are
not nearly so clear-cut. Should one choose C, C++, or C# for systems programming? Fortran or C for scientific computations? PHP or Ruby for a web-based
application? Ada or C for embedded systems? Visual Basic or Java for a graphical
user interface? This book should help equip you to make such decisions.
Similarly, this book should make it easier to learn new languages. Many languages are closely related. Java and C# are easier to learn if you already know
C++; Common Lisp if you already know Scheme; Haskell if you already know
ML. More important, there are basic concepts that underlie all programming languages. Most of these concepts are the subject of chapters in this book: types, control (iteration, selection, recursion, nondeterminacy, concurrency), abstraction,
and naming. Thinking in terms of these concepts makes it easier to assimilate the
syntax (form) and semantics (meaning) of new languages, compared to picking
them up in a vacuum. The situation is analogous to what happens in natural languages: a good knowledge of grammatical forms makes it easier to learn a foreign
language.
Whatever language you learn, understanding the decisions that went into its
design and implementation will help you use it better. This book should help you
with the following.
Understand obscure features. The typical C++ programmer rarely uses unions,
multiple inheritance, variable numbers of arguments, or the .* operator. (If
you don’t know what these are, don’t worry!) Just as it simplifies the assimilation of new languages, an understanding of basic concepts makes it easier to
understand these features when you look up the details in the manual.
1.3 Why Study Programming Languages?
15
Choose among alternative ways to express things, based on a knowledge of implementation costs. In C++, for example, programmers may need to avoid unnecessary temporary variables, and use copy constructors whenever possible, to
minimize the cost of initialization. In Java they may wish to use Executor
objects rather than explicit thread creation. With certain (poor) compilers,
they may need to adopt special programming idioms to get the fastest code:
pointers for array traversal in C; with statements to factor out common
address calculations in Pascal or Modula-3; x*x instead of x**2 in Basic. In
any language, they need to be able to evaluate the tradeoffs among alternative implementations of abstractions—for example between computation and
table lookup for functions like bit set cardinality, which can be implemented
either way.
Make good use of debuggers, assemblers, linkers, and related tools. In general, the
high-level language programmer should not need to bother with implementation details. There are times, however, when an understanding of those
details proves extremely useful. The tenacious bug or unusual system-building
problem is sometimes a lot easier to handle if one is willing to peek at the
bits.
Simulate useful features in languages that lack them. Certain very useful features
are missing in older languages, but can be emulated by following a deliberate
(if unenforced) programming style. In older dialects of Fortran, for example, programmers familiar with modern control constructs can use comments
and self-discipline to write well-structured code. Similarly, in languages with
poor abstraction facilities, comments and naming conventions can help imitate
modular structure, and the extremely useful iterators of Clu, C#, Python, and
Ruby (which we will study in Section 6.5.3) can be imitated with subroutines
and static variables. In Fortran 77 and other languages that lack recursion, an
iterative program can be derived via mechanical hand transformations, starting
with recursive pseudocode. In languages without named constants or enumeration types, variables that are initialized once and never changed thereafter can
make code much more readable and easy to maintain.
Make better use of language technology wherever it appears. Most programmers
will never design or implement a conventional programming language, but
most will need language technology for other programming tasks. The typical
personal computer contains files in dozens of structured formats, encompassing web content, word processing, spreadsheets, presentations, raster and vector graphics, music, video, databases, and a wide variety of other application
domains. Each of these structured formats has formal syntax and semantics,
which tools must understand. Code to parse, analyze, generate, optimize, and
otherwise manipulate structured data can thus be found in almost any sophisticated program, and all of this code is based on language technology. Programmers with a strong grasp of this technology will be in a better position to write
well-structured, maintainable tools.
16
Chapter 1 Introduction
In a similar vein, most tools themselves can be customized, via start-up configuration files, command-line arguments, input commands, or built-in extension languages (considered in more detail in Chapter 13). My home directory
holds more than 250 separate configuration (“preference”) files. My personal
configuration files for the emacs text editor comprise more than 1200 lines
of Lisp code. The user of almost any sophisticated program today will need
to make good use of configuration or extension languages. The designers of
such a program will need either to adopt (and adapt) some existing extension language, or to invent new notation of their own. Programmers with a
strong grasp of language theory will be in a better position to design elegant,
well-structured notation that meets the needs of current users and facilitates
future development.
Finally, this book should help prepare you for further study in language design
or implementation, should you be so inclined. It will also equip you to understand
the interactions of languages with operating systems and architectures, should
those areas draw your interest.
3C H E C K YO U R U N D E R S TA N D I N G
1. What is the difference between machine language and assembly language?
2. In what way(s) are high-level languages an improvement on assembly language? In what circumstances does it still make sense to program in assembler?
3. Why are there so many programming languages?
4. What makes a programming language successful?
5. Name three languages in each of the following categories: von Neumann,
functional, object-oriented. Name two logic languages. Name two widely used
concurrent languages.
6. What distinguishes declarative languages from imperative languages?
7.
8.
9.
10.
What organization spearheaded the development of Ada?
What is generally considered the first high-level programming language?
What was the first functional language?
Why aren’t concurrent languages listed as a category in Figure 1.1?
1.4
EXAMPLE
1.7
Pure compilation
Compilation and Interpretation
At the highest level of abstraction, the compilation and execution of a program in
a high-level language look something like this:
1.4 Compilation and Interpretation
17
Source program
Compiler
Input
EXAMPLE
1.8
Pure interpretation
Target program
Output
The compiler translates the high-level source program into an equivalent target
program (typically in machine language), and then goes away. At some arbitrary
later time, the user tells the operating system to run the target program. The
compiler is the locus of control during compilation; the target program is the locus
of control during its own execution. The compiler is itself a machine language
program, presumably created by compiling some other high-level program. When
written to a file in a format understood by the operating system, machine language
is commonly known as object code.
An alternative style of implementation for high-level languages is known as
interpretation.
Source program
Interpreter
Output
Input
Unlike a compiler, an interpreter stays around for the execution of the application. In fact, the interpreter is the locus of control during that execution. In
effect, the interpreter implements a virtual machine whose “machine language”
is the high-level programming language. The interpreter reads statements in that
language more or less one at a time, executing them as it goes along.
In general, interpretation leads to greater flexibility and better diagnostics
(error messages) than does compilation. Because the source code is being executed
directly, the interpreter can include an excellent source-level debugger. It can also
cope with languages in which fundamental characteristics of the program, such
as the sizes and types of variables, or even which names refer to which variables,
can depend on the input data. Some language features are almost impossible to
implement without interpretation: in Lisp and Prolog, for example, a program can
write new pieces of itself and execute them on the fly. (Several scripting languages,
including Perl, Tcl, Python, and Ruby, also provide this capability.) Delaying decisions about program implementation until run time is known as late binding ; we
will discuss it at greater length in Section 3.1.
Compilation, by contrast, generally leads to better performance. In general, a
decision made at compile time is a decision that does not need to be made at run
time. For example, if the compiler can guarantee that variable x will always lie
at location 49378 , it can generate machine language instructions that access this
location whenever the source program refers to x . By contrast, an interpreter may
need to look x up in a table every time it is accessed, in order to find its location.
18
EXAMPLE
Chapter 1 Introduction
1.9
Mixing compilation and
interpretation
Since the (final version of a) program is compiled only once, but generally executed
many times, the savings can be substantial, particularly if the interpreter is doing
unnecessary work in every iteration of a loop.
While the conceptual difference between compilation and interpretation is
clear, most language implementations include a mixture of both. They typically
look like this:
Source program
Translator
Intermediate program
Virtual machine
Output
Input
We generally say that a language is “interpreted” when the initial translator is
simple. If the translator is complicated, we say that the language is “compiled.” The
distinction can be confusing because “simple” and “complicated” are subjective
terms, and because it is possible for a compiler (complicated translator) to produce
code that is then executed by a complicated virtual machine (interpreter); this is
in fact precisely what happens by default in Java. We still say that a language
is compiled if the translator analyzes it thoroughly (rather than effecting some
“mechanical” transformation), and if the intermediate program does not bear a
strong resemblance to the source. These two characteristics—thorough analysis
and nontrivial transformation—are the hallmarks of compilation.
In practice one sees a broad spectrum of implementation strategies:
EXAMPLE
1.10
Preprocessing
Most interpreted languages employ an initial translator (a preprocessor) that
removes comments and white space, and groups characters together into tokens
such as keywords, identifiers, numbers, and symbols. The translator may
also expand abbreviations in the style of a macro assembler. Finally, it may
identify higher-level syntactic structures, such as loops and subroutines. The
D E S I G N & I M P L E M E N TAT I O N
Compiled and interpreted languages
Certain languages (e.g., Smalltalk and Python) are sometimes referred to as
“interpreted languages” because most of their semantic error checking must be
performed at run time. Certain other languages (e.g., Fortran and C) are sometimes referred to as “compiled languages” because almost all of their semantic
error checking can be performed statically. This terminology isn’t strictly correct: interpreters for C and Fortran can be built easily, and a compiler can
generate code to perform even the most extensive dynamic semantic checks.
That said, language design has a profound effect on “compilability.”
1.4 Compilation and Interpretation
EXAMPLE
1.11
Library routines and linking
19
goal is to produce an intermediate form that mirrors the structure of the source,
but can be interpreted more efficiently.
In some very early implementations of Basic, the manual actually suggested
removing comments from a program in order to improve its performance.
These implementations were pure interpreters; they would re-read (and then
ignore) the comments every time they executed a given part of the program.
They had no initial translator.
The typical Fortran implementation comes close to pure compilation. The
compiler translates Fortran source into machine language. Usually, however,
it counts on the existence of a library of subroutines that are not part of the
original program. Examples include mathematical functions ( sin , cos , log ,
etc.) and I/O. The compiler relies on a separate program, known as a linker, to
merge the appropriate library routines into the final program:
Fortran program
Compiler
Incomplete machine language
Library routines
Linker
Machine language program
EXAMPLE
1.12
Post-compilation assembly
In some sense, one may think of the library routines as extensions to the
hardware instruction set. The compiler can then be thought of as generating
code for a virtual machine that includes the capabilities of both the hardware
and the library.
In a more literal sense, one can find interpretation in the Fortran routines
for formatted output. Fortran permits the use of format statements that control the alignment of output in columns, the number of significant digits and
type of scientific notation for floating-point numbers, inclusion/suppression
of leading zeros, and so on. Programs can compute their own formats on the fly.
The output library routines include a format interpreter. A similar interpreter
can be found in the printf routine of C and its descendants.
Many compilers generate assembly language instead of machine language.
This convention facilitates debugging, since assembly language is easier for
people to read, and isolates the compiler from changes in the format of
machine language files that may be mandated by new releases of the operating system (only the assembler must be changed, and it is shared by many
compilers).
20
Chapter 1 Introduction
Source program
Compiler
Assembly language
Assembler
Machine language
EXAMPLE
1.13
The C preprocessor
Compilers for C (and for many other languages running under Unix) begin
with a preprocessor that removes comments and expands macros. The preprocessor can also be instructed to delete portions of the code itself, providing a
conditional compilation facility that allows several versions of a program to be
built from the same source.
Source program
Preprocessor
Modified source program
Compiler
Assembly language
EXAMPLE
1.14
Source-to-source
translation (C++)
C++ implementations based on the early AT&T compiler actually generated
an intermediate program in C, instead of in assembly language. This C++
compiler was indeed a true compiler: it performed a complete analysis of the
syntax and semantics of the C++ source program, and with very few exceptions
generated all of the error messages that a programmer would see prior to
running the program. In fact, programmers were generally unaware that the C
compiler was being used behind the scenes. The C++ compiler did not invoke
the C compiler unless it had generated C code that would pass through the
second round of compilation without producing any error messages.
1.4 Compilation and Interpretation
21
Source program
Preprocessor
Modified source program
C++ compiler
C code
C compiler
Assembly language
EXAMPLE
1.15
Bootstrapping
Occasionally one would hear the C++ compiler referred to as a preprocessor, presumably because it generated high-level output that was in turn compiled. I consider this a misuse of the term: compilers attempt to “understand”
their source; preprocessors do not. Preprocessors perform transformations
based on simple pattern matching, and may well produce output that will
generate error messages when run through a subsequent stage of translation.
Many compilers are self-hosting : they are written in the language they
compile—Ada compilers in Ada, C compilers in C. This raises an obvious
question: how does one compile the compiler in the first place? The answer
is to use a technique known as bootstrapping, a term derived from the intentionally ridiculous notion of lifting oneself off the ground by pulling on one’s
bootstraps. In a nutshell, one starts with a simple implementation—often an
interpreter—and uses it to build progressively more sophisticated versions. We
can illustrate the idea with an historical example.
Many early Pascal compilers were built around a set of tools distributed by
Niklaus Wirth. These included the following.
– A Pascal compiler, written in Pascal, that would generate output in P-code,
a stack-based language similar to the byte code of modern Java compilers
– The same compiler, already translated into P-code
– A P-code interpreter, written in Pascal
To get Pascal up and running on a local machine, the user of the tool set
needed only to translate the P-code interpreter (by hand) into some locally
available language. This translation was not a difficult task; the interpreter was
22
Chapter 1 Introduction
small. By running the P-code version of the compiler on top of the P-code
interpreter, one could then compile arbitrary Pascal programs into P-code,
which could in turn be run on the interpreter. To get a faster implementation,
one could modify the Pascal version of the Pascal compiler to generate a locally
available variety of assembly or machine language, instead of generating P-code
(a somewhat more difficult task). This compiler could then be bootstrapped—
run through itself—to yield a machine code version of the compiler.
Pascal to machine
language compiler,
in Pascal
Pascal to P-code
compiler, in P-code
Pascal to machine
language compiler,
in P-code
Pascal to machine
language compiler,
in machine language
For a more modern example, suppose we were building one of the first
compilers for Java. If we had a C compiler already, we might start by writing,
in a simple subset of C, a compiler for an equally simple subset of Java. Once
this compiler was working, we could hand-translate the C code into our subset
of Java and run the compiler through itself. We could then repeatedly extend
the compiler to accept a larger subset of Java, bootstrap it again, and use the
extended language to implement an even larger subset.
D E S I G N & I M P L E M E N TAT I O N
The early success of Pascal
The P-code-based implementation of Pascal, and its use of bootstrapping, are
largely responsible for the language’s remarkable success in academic circles
in the 1970s. No single hardware platform or operating system of that era
dominated the computer landscape the way the x86, Linux, and Windows do
today.8 Wirth’s toolkit made it possible to get an implementation of Pascal up
and running on almost any platform in a week or so. It was one of the first
great successes in system portability.
8 Throughout this book we will use the term “x86” to refer to the instruction set architecture of the
Intel 8086 and its descendants, including the various Pentium processors. Intel calls this architecture the IA-32, but x86 is a more generic term that encompasses the offerings of competitors such
as AMD as well.
1.4 Compilation and Interpretation
EXAMPLE
1.16
Compiling interpreted
languages
EXAMPLE
1.17
Dynamic and just-in-time
compilation
EXAMPLE
1.18
Microcode (firmware)
23
One will sometimes find compilers for languages (e.g., Lisp, Prolog, Smalltalk,
etc.) that permit a lot of late binding, and are traditionally interpreted. These
compilers must be prepared, in the general case, to generate code that performs
much of the work of an interpreter, or that makes calls into a library that
does that work instead. In important special cases, however, the compiler can
generate code that makes reasonable assumptions about decisions that won’t
be finalized until run time. If these assumptions prove to be valid the code will
run very fast. If the assumptions are not correct, a dynamic check will discover
the inconsistency, and revert to the interpreter.
In some cases a programming system may deliberately delay compilation
until the last possible moment. One example occurs in implementations of
Lisp or Prolog that invoke the compiler on the fly, to translate newly created
source into machine language, or to optimize the code for a particular input
set. Another example occurs in implementations of Java. The Java language
definition defines a machine-independent intermediate form known as byte
code. Byte code is the standard format for distribution of Java programs;
it allows programs to be transferred easily over the Internet, and then run
on any platform. The first Java implementations were based on byte-code
interpreters, but more recent (faster) implementations employ a just-in-time
compiler that translates byte code into machine language immediately before
each execution of the program. C#, similarly, is intended for just-in-time
translation. The main C# compiler produces .NET Common Intermediate
Language (CIL), which is then translated into machine language immediately
prior to execution. CIL is deliberately language independent, so it can be used
for code produced by a variety of front-end compilers.
On some machines (particularly those designed before the mid-1980s), the
assembly-level instruction set is not actually implemented in hardware, but in
fact runs on an interpreter. The interpreter is written in low-level instructions
called microcode (or firmware), which is stored in read-only memory and
executed by the hardware. Microcode and microprogramming are considered
further in Section 5.4.1.
As some of these examples make clear, a compiler does not necessarily translate from a high-level language into machine language. It is not uncommon for
compilers, especially prototypes, to generate C as output. A little further afield,
text formatters like TEX and troff are actually compilers, translating high level
document descriptions into commands for a laser printer or phototypesetter.
(Many laser printers themselves incorporate interpreters for the Postscript pagedescription language.) Query language processors for database systems are also
compilers, translating languages like SQL into primitive operations on files. There
are even compilers that translate logic-level circuit specifications into photographic masks for computer chips. Though the focus in this book is on imperative
programming languages, the term “compilation” applies whenever we translate
automatically from one nontrivial language to another, with full analysis of the
meaning of the input.
24
Chapter 1 Introduction
1.5
Programming Environments
Compilers and interpreters do not exist in isolation. Programmers are assisted
in their work by a host of other tools. Assemblers, debuggers, preprocessors, and
linkers were mentioned earlier. Editors are familiar to every programmer. They
may be augmented with cross-referencing facilities that allow the programmer
to find the point at which an object is defined, given a point at which it is used.
Pretty-printers help enforce formatting conventions. Style checkers enforce syntactic or semantic conventions that may be tighter than those enforced by the compiler (see Exploration 1.12). Configuration management tools help keep track of
dependences among the (many versions of) separately compiled modules in a
large software system. Perusal tools exist not only for text but also for intermediate languages that may be stored in binary. Profilers and other performance
analysis tools often work in conjunction with debuggers to help identify the pieces
of a program that consume the bulk of its computation time.
In older programming environments, tools may be executed individually, at
the explicit request of the user. If a running program terminates abnormally with
a “bus error” (invalid address) message, for example, the user may choose to
invoke a debugger to examine the “core” file dumped by the operating system.
He or she may then attempt to identify the program bug by setting breakpoints,
enabling tracing and so on, and running the program again under the control of
the debugger. Once the bug is found, the user will invoke the editor to make an
appropriate change. He or she will then recompile the modified program, possibly
with the help of a configuration manager.
More recent environments provide much more integrated tools. When an
invalid address error occurs in an integrated development environment (IDE),
a new window is likely to appear on the user’s screen, with the line of source code
at which the error occurred highlighted. Breakpoints and tracing can then be set in
this window without explicitly invoking a debugger. Changes to the source can be
made without explicitly invoking an editor. If the user asks to rerun the program
after making changes, a new version may be built without explicitly invoking the
compiler or configuration manager.
The editor for an IDE may incorporate knowledge of language syntax, providing
templates for all the standard control structures, and checking syntax as it is
D E S I G N & I M P L E M E N TAT I O N
Powerful development environments
Sophisticated development environments can be a two-edged sword. The
quality of the Common Lisp environment has arguably contributed to its
widespread acceptance. On the other hand, the particularity of the graphical
environment for Smalltalk (with its insistence on specific fonts, window styles,
etc.) has made it difficult to port the language to systems accessed through a
textual interface, or to graphical systems with a different “look and feel.”
1.6 An Overview of Compilation
25
typed in. Internally, the IDE is likely to maintain not only a program’s source and
object code, but also a syntax tree. When the source is edited, the tree will be
updated automatically—often incrementally (without reparsing large portions of
the source). In some cases, structural changes to the program may be implemented
first in the syntax tree, and then automatically reflected in the source.
IDEs are fundamental to Smalltalk—it is nearly impossible to separate the language from its graphical environment—and have been routinely used for Common Lisp since the 1980s. In more recent years, integrated environments have
largely displaced command-line tools for many languages and systems. Popular
open-source IDEs include Eclipse and NetBeans. Commercial systems include
the Visual Studio environment from Microsoft and the XCode environment from
Apple. Much of the appearance of integration can also be achieved within sophisticated editors such as emacs .
3C H E C K YO U R U N D E R S TA N D I N G
11. Explain the distinction between interpretation and compilation. What are the
comparative advantages and disadvantages of the two approaches?
12. Is Java compiled or interpreted (or both)? How do you know?
13. What is the difference between a compiler and a preprocessor?
14. What was the intermediate form employed by the original AT&T C++ compiler?
15. What is P-code?
16. What is bootstrapping?
17. What is a just-in-time compiler?
18. Name two languages in which a program can write new pieces of itself “on
the fly.”
19. Briefly describe three “unconventional” compilers—compilers whose purpose
is not to prepare a high-level program for execution on a microprocessor.
20. List six kinds of tools that commonly support the work of a compiler within
a larger programming environment.
21. Explain how an IDE differs from a collection of command-line tools.
1.6
An Overview of Compilation
Compilers are among the most well-studied classes of computer programs. We
will consider them repeatedly throughout the rest of the book, and in Chapters 2,
4, 14, and 16 in particular. The remainder of this section provides an introductory
overview.
26
Chapter 1 Introduction
Character stream
Scanner (lexical analysis)
Token stream
Front
end
Parser (syntax analysis)
Abstract syntax tree or
other intermediate form
Modified
intermediate form
Target language
(e.g., assembler)
Modified
target language
Semantic analysis and
intermediate code generation
Machine-independent
code improvement (optional)
Target code generation
Symbol table
Parse tree
Back
end
Machine-specific
code improvement (optional)
Figure 1.3
Phases of compilation. Phases are listed on the right and the forms in which information is passed between phases are listed on the left. The symbol table serves throughout
compilation as a repository for information about identifiers.
EXAMPLE
1.19
Phases of compilation
In a typical compiler, compilation proceeds through a series of well-defined
phases, shown in Figure 1.3. Each phase discovers information of use to later
phases, or transforms the program into a form that is more useful to the subsequent phase.
The first few phases (up through semantic analysis) serve to figure out the
meaning of the source program. They are sometimes called the front end of the
compiler. The last few phases serve to construct an equivalent target program.
They are sometimes called the back end of the compiler. Many compiler phases
can be created automatically from a formal description of the source and/or target
languages.
One will sometimes hear compilation described as a series of passes. A pass is
a phase or set of phases that is serialized with respect to the rest of compilation:
it does not start until previous phases have completed, and it finishes before any
subsequent phases start. If desired, a pass may be written as a separate program,
reading its input from a file and writing its output to a file. Compilers are commonly divided into passes so that the front end may be shared by compilers for
more than one machine (target language), and so that the back end may be shared
by compilers for more than one source language. In some implementations the
front end and the back end may be separated by a “middle end” that is responsible
for language- and machine-independent code improvement. Prior to the dramatic
increases in memory sizes of the mid to late 1980s, compilers were also sometimes
divided into passes to minimize memory usage: as each pass completed, the next
could reuse its code space.
27
1.6 An Overview of Compilation
1.6.1
EXAMPLE
1.20
GCD program in C
Lexical and Syntax Analysis
Consider the greatest common divisor (GCD) problem introduced at the beginning of this chapter, and shown as a function in Figure 1.2 (page 13). Hypothesizing trivial I/O routines and recasting the function as a stand-alone program,
our code might look as follows in C.
int main() {
int i = getint(), j = getint();
while (i != j) {
if (i > j) i = i - j;
else j = j - i;
}
putint(i);
}
EXAMPLE
1.21
GCD program tokens
Scanning and parsing serve to recognize the structure of the program, without
regard to its meaning. The scanner reads characters (‘ i ’, ‘ n ’, ‘ t ’, ‘ ’, ‘ m ’, ‘ a ’, ‘ i ’, ‘ n ’,
‘ ( ’, ‘ ) ’, etc.) and groups them into tokens, which are the smallest meaningful units
of the program. In our example, the tokens are
int
getint
)
{
=
j
)
EXAMPLE
1.22
Context-free grammar and
parsing
main
(
;
if
i
;
(
)
while
(
i
}
)
,
(
i
j
;
{
j
i
>
;
}
int
=
!=
j
else
putint
i
getint
j
)
j
(
=
(
)
i
=
i
Scanning is also known as lexical analysis. The principal purpose of the scanner
is to simplify the task of the parser, by reducing the size of the input (there are
many more characters than tokens) and by removing extraneous characters like
white space. The scanner also typically removes comments and tags tokens with
line and column numbers, to make it easier to generate good diagnostics in later
phases. One could design a parser to take characters instead of tokens as input—
dispensing with the scanner—but the result would be awkward and slow.
Parsing organizes tokens into a parse tree that represents higher-level constructs
(statements, expressions, subroutines, and so on) in terms of their constituents.
Each construct is a node in the tree; its constituents are its children. The root of
the tree is simply “program”; the leaves, from left to right, are the tokens received
from the scanner. Taken as a whole, the tree shows how the tokens fit together
to make a valid program. The structure relies on a set of potentially recursive
rules known as a context-free grammar. Each rule has an arrow sign (−→) with
the construct name on the left and a possible expansion on the right.9 In C, for
9 Theorists also study context-sensitive grammars, in which the allowable expansions of a construct
(the applicable rules) depend on the context in which the construct appears (i.e., on constructs
to the left and right). Context sensitivity is important for natural languages like English, but it is
almost never used in programming language design.
28
Chapter 1 Introduction
example, a while loop consists of the keyword while followed by a parenthesized
Boolean expression and a statement:
iteration-statement −→ while ( expression ) statement
The statement, in turn, is often a list enclosed in braces:
statement −→ compound-statement
compound-statement −→ { block-item-list opt }
where
block-item-list opt −→ block-item-list
or
block-item-list opt −→ and
block-item-list −→ block-item
block-item-list −→ block-item-list block-item
block-item −→ declaration
block-item −→ statement
EXAMPLE
1.23
GCD program parse tree
Here represents the empty string; it indicates that block-item-list opt can simply
be deleted. Many more grammar rules are needed, of course, to explain the full
structure of a program.
A context-free grammar is said to define the syntax of the language; parsing
is therefore known as syntax analysis. There are many possible grammars for C
(an infinite number, in fact); the fragment shown above is taken from the sample grammar contained in the official language definition [Int99]. A full parse
tree for our GCD program (based on a full grammar not shown here) appears
in Figure 1.4. While the size of the tree may seem daunting, its details aren’t
particularly important at this point in the text. What is important is that (1)
each individual branching point represents the application of a single grammar
rule, and (2) the resulting complexity is more a reflection of the grammar than
it is of the input program. Much of it stems from (a) the use of such artificial
“constructs” as block item-list and block item-list opt to generate lists of arbitrary
length, and (b) the use of the equally artificial assignment-expression, additiveexpression, multiplicative-expression, and so on, to capture precedence and associativity in arithmetic expressions. We shall see in the following subsection that
much of this complexity can be discarded once parsing is complete.
In the process of scanning and parsing, the compiler checks to see that all of
the program’s tokens are well formed, and that the sequence of tokens conforms
to the syntax defined by the context-free grammar. Any malformed tokens (e.g.,
123abc or $@foo in C) should cause the scanner to produce an error message.
Any syntactically invalid token sequence (e.g., A = X Y Z in C) should lead to an
error message from the parser.
1.6 An Overview of Compilation
1.6.2
29
Semantic Analysis and Intermediate Code Generation
Semantic analysis is the discovery of meaning in a program. The semantic analysis
phase of compilation recognizes when multiple occurrences of the same identifier
are meant to refer to the same program entity, and ensures that the uses are consistent. In most languages the semantic analyzer tracks the types of both identifiers
and expressions, both to verify consistent usage and to guide the generation of
code in later phases.
To assist in its work, the semantic analyzer typically builds and maintains a symbol table data structure that maps each identifier to the information known about
it. Among other things, this information includes the identifier’s type, internal
structure (if any), and scope (the portion of the program in which it is valid).
Using the symbol table, the semantic analyzer enforces a large variety of rules
that are not captured by the hierarchical structure of the context-free grammar
and the parse tree. In C, for example, it checks to make sure that
Every identifier is declared before it is used.
No identifier is used in an inappropriate context (calling an integer as a subroutine, adding a string to an integer, referencing a field of the wrong type of
struct , etc.).
Subroutine calls provide the correct number and types of arguments.
Labels on the arms of a switch statement are distinct constants.
Any function with a non- void return type return s a value explicitly.
In many compilers, the work of the semantic analyzer takes the form of semantic
action routines, invoked by the parser when it realizes that it has reached a particular
point within a grammar rule.
Of course, not all semantic rules can be checked at compile time. Those that
can are referred to as the static semantics of the language. Those that must be
checked at run time are referred to as the dynamic semantics of the language. C
has very little in the way of dynamic checks (its designers opted for performance
over safety). Examples of rules that other languages enforce at run time include
the following.
Variables are never used in an expression unless they have been given a value.10
Pointers are never dereferenced unless they refer to a valid object.
Array subscript expressions lie within the bounds of the array.
Arithmetic operations do not overflow.
10 As we shall see in Section 6.1.3, Java and C# actually do enforce initialization at compile time,
but only by adopting a conservative set of rules for “definite assignment,” outlawing programs for
which correctness is difficult or impossible to verify at compile time.
30
translation-unit
1
function-definition
declaration-list_opt
declarator
pointer_opt
direct-declarator
declaration-specifiers
type-specifier
direct-declarator
(
{
identifier-list_opt
)
ident(main)
declaration-specifiers_opt
block-item-list_opt }
block-item-list
block-item-list block-item
block-item-list
block-item
B
1
declaration
int
compound-statement
declaration-specifiers
type-specifier
init-declarator-list_opt
declaration-specifiers_opt
int
=
;
init-declarator-list
init-declarator-list
init-declarator
,
declarator
init-declarator
declarator
A
initializer
=
initializer
pointer_opt
direct-declarator
assignment-expression
ident(j)
13
postfix-expression
pointer_opt
direct-declarator
assignment-expression
ident(i)
13
postfix-expression
postfix-expression
(
)
1
postfix-expression
(
)
ident(getint)
argument-expression-list_opt
1
ident(getint)
argument-expression-list_opt
A
B
statement
statement
iteration-statement
expression-statement
while ( expression )
statement
7
equality-expression
equality-expression
8
!=
ident(i)
compound-statement
relational-expression
7
ident(j)
if
{
block-item-list_opt }
3
selection-statement
)
expression
8
relational-expression
expression-statement
shift-expression
expression_opt
relational-expression >
expression_opt
15
postfix-expression
statement
(
7
6
ident(i)
ident(j)
unary-expression
2
assignment-operator
ident(i)
=
additive-expression
5
ident(i)
postfix-expression
1
ident(putint)
)
argument-expression-list_opt
17
statement
else
ident(i)
expression-statement
expression_opt
;
;
1
assignment-expression
1
assignment-expression
assignment-expression
10
additive-expression
unary-expression
2
assignment-operator
ident(j)
=
-
(
;
multiplicative-expression
4
additive-expression
5
ident(j)
ident(j)
assignment-expression
10
additive-expression
-
multiplicative-expression
4
ident(i)
Parse tree for the GCD program. The symbol represents the empty string. Dotted lines indicate a chain of one-for-one replacements, elided
to save space; the adjacent number indicates the number of omitted nodes. While the details of the tree aren’t important to the current chapter, the sheer
amount of detail is: it comes from having to fit the (much simpler) source code into the hierarchical structure of a context-free grammar.
Figure 1.4
31
32
Chapter 1 Introduction
program
:=
:=
(5)
while
call
(6)
call
call
(3)
(3)
=
/
(5)
Index
1
2
3
4
5
6
Symbol
void
int
getint
putint
i
j
(4)
if
(6)
>
(5)
:=
:=
Type
type
type
func : (1) → (2)
func : (2) → (1)
(2)
(2)
(5)
(6)
(5)
−
(5)
(6)
(6)
−
(6)
(5)
Figure 1.5
Syntax tree and symbol table for the GCD program. Note the contrast to Figure 1.4:
the syntax tree retains just the essential structure of the program,omitting details that were needed
only to drive the parsing algorithm.
EXAMPLE
1.24
GCD program abstract
syntax tree
When it cannot enforce rules statically, a compiler will often produce code
to perform appropriate checks at run time, aborting the program or generating an exception if one of the checks then fails. (Exceptions will be discussed in
Section 8.5.) Some rules, unfortunately, may be unacceptably expensive or impossible to enforce, and the language implementation may simply fail to check them.
In Ada, a program that breaks such a rule is said to be erroneous; in C its behavior
is said to be undefined.
A parse tree is sometimes known as a concrete syntax tree, because it demonstrates, completely and concretely, how a particular sequence of tokens can be
derived under the rules of the context-free grammar. Once we know that a token
sequence is valid, however, much of the information in the parse tree is irrelevant
to further phases of compilation. In the process of checking static semantic rules,
the semantic analyzer typically transforms the parse tree into an abstract syntax
tree (otherwise known as an AST, or simply a syntax tree) by removing most of the
“artificial” nodes in the tree’s interior. The semantic analyzer also annotates the
remaining nodes with useful information, such as pointers from identifiers to their
symbol table entries. The annotations attached to a particular node are known as
its attributes. A syntax tree for our GCD program is shown in Figure 1.5.
In many compilers, the annotated syntax tree constitutes the intermediate form
that is passed from the front end to the back end. In other compilers, semantic
1.6 An Overview of Compilation
33
analysis ends with a traversal of the tree that generates some other intermediate form. One common such form consists of a control flow graph whose nodes
resemble fragments of assembly language for a simple idealized machine. We will
consider this option further in Chapter 14, where a control flow graph for our
GCD program appears in Figure 14.3. In a suite of related compilers, the front
ends for several languages and the back ends for several machines would share a
common intermediate form.
1.6.3
EXAMPLE
1.25
GCD program assembly
code
Target Code Generation
The code generation phase of a compiler translates the intermediate form into the
target language. Given the information contained in the syntax tree, generating
correct code is usually not a difficult task (generating good code is harder, as we
shall see in Section 1.6.4). To generate assembly or machine language, the code
generator traverses the symbol table to assign locations to variables, and then
traverses the intermediate representation of the program, generating loads and
stores for variable references, interspersed with appropriate arithmetic operations,
tests, and branches. Naive code for our GCD example appears in Figure 1.6, in
x86 assembly language. It was generated automatically by a simple pedagogical
compiler.
The assembly language mnemonics may appear a bit cryptic, but the comments
on each line (not generated by the compiler!) should make the correspondence
between Figures 1.5 and 1.6 generally apparent. A few hints: esp , ebp , eax , ebx ,
and edi are registers (special storage locations, limited in number, that can be
accessed very quickly). -8(%ebp) refers to the memory location 8 bytes before
the location whose address is in register ebp ; in this program, ebp serves as a
base from which we can find variables i and j . The argument to a subroutine
call instruction is passed by pushing it onto a stack, for which esp is the top-ofstack pointer. The return value comes back in register eax . Arithmetic operations
overwrite their second argument with the result of the operation.11
Often a code generator will save the symbol table for later use by a symbolic
debugger, by including it in a nonexecutable part of the target code.
1.6.4
Code Improvement
Code improvement is often referred to as optimization, though it seldom makes
anything optimal in any absolute sense. It is an optional phase of compilation
whose goal is to transform a program into a new version that computes the same
result more efficiently—more quickly or using less memory, or both.
11 As noted in footnote 1, these are GNU assembler conventions; Microsoft and Intel assemblers
specify arguments in the opposite order.
34
Chapter 1 Introduction
A:
B:
C:
D:
pushl
movl
subl
call
movl
call
movl
movl
movl
cmpl
je
movl
movl
cmpl
jle
movl
movl
subl
movl
jmp
movl
movl
subl
movl
jmp
movl
push
call
addl
leave
mov
ret
Figure 1.6
EXAMPLE
1.26
GCD program
optimization
%ebp
%esp, %ebp
$16, %esp
getint
%eax, -8(%ebp)
getint
%eax, -12(%ebp)
-8(%ebp), %edi
-12(%ebp), %ebx
%ebx, %edi
D
-8(%ebp), %edi
-12(%ebp), %ebx
%ebx, %edi
B
-8(%ebp), %edi
-12(%ebp), %ebx
%ebx, %edi
%edi, -8(%ebp)
C
-12(%ebp), %edi
-8(%ebp), %ebx
%ebx, %edi
%edi, -12(%ebp)
A
-8(%ebp), %ebx
%ebx
putint
$4, %esp
$0, %eax
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
\
) reserve space for local variables
/
read
store i
read
store j
load i
load j
compare
jump if i == j
load i
load j
compare
jump if i < j
load i
load j
i = i - j
store i
#
#
#
#
load j
load i
j = j - i
store j
#
#
#
#
#
#
#
load i
push i (pass to putint)
write
pop i
deallocate space for local variables
exit status for program
return to operating system
Naive x86 assembly language for the GCD program.
Some improvements are machine independent. These can be performed as
transformations on the intermediate form. Other improvements require an understanding of the target machine (or of whatever will execute the program in the
target language). These must be performed as transformations on the target program. Thus code improvement often appears as two additional phases of compilation, one immediately after semantic analysis and intermediate code generation,
the other immediately after target code generation.
Applying a good code improver to the code in Figure 1.6 produces the code
shown in Example 1.2 (page 5). Comparing the two programs, we can see that the
improved version is quite a lot shorter. Conspicuously absent are most of the loads
and stores. The machine-independent code improver is able to verify that i and
j can be kept in registers throughout the execution of the main loop. (This would
1.7 Summary and Concluding Remarks
35
not have been the case if, for example, the loop contained a call to a subroutine that
might reuse those registers, or that might try to modify i or j .) The machinespecific code improver is then able to assign i and j to actual registers of the
target machine. For modern microprocessor architectures, particularly those with
so-called superscalar implementations (ones in which separate functional units
can execute instructions simultaneously), compilers can usually generate better
code than can human assembly language programmers.
3C H E C K YO U R U N D E R S TA N D I N G
22. List the principal phases of compilation, and describe the work performed by
each.
23. Describe the form in which a program is passed from the scanner to the parser;
from the parser to the semantic analyzer; from the semantic analyzer to the
intermediate code generator.
24. What distinguishes the front end of a compiler from the back end?
25. What is the difference between a phase and a pass of compilation? Under what
circumstances does it make sense for a compiler to have multiple passes?
26. What is the purpose of the compiler’s symbol table?
27. What is the difference between static and dynamic semantics?
28. On modern machines, do assembly language programmers still tend to write
better code than a good compiler can? Why or why not?
1.7
Summary and Concluding Remarks
In this chapter we introduced the study of programming language design and
implementation. We considered why there are so many languages, what makes
them successful or unsuccessful, how they may be categorized for study, and what
benefits the reader is likely to gain from that study. We noted that language design
and language implementation are intimately related to one another. Obviously
an implementation must conform to the rules of the language. At the same time,
a language designer must consider how easy or difficult it will be to implement
various features, and what sort of performance is likely to result for programs that
use those features.
Language implementations are commonly differentiated into those based on
interpretation and those based on compilation. We noted, however, that the difference between these approaches is fuzzy, and that most implementations include
a bit of each. As a general rule, we say that a language is compiled if execution is
preceded by a translation step that (1) fully analyzes both the structure (syntax)
36
Chapter 1 Introduction
and meaning (semantics) of the program, and (2) produces an equivalent program in a significantly different form. The bulk of the implementation material
in this book pertains to compilation.
Compilers are generally structured as a series of phases. The first few phases—
scanning, parsing, and semantic analysis—serve to analyze the source program. Collectively these phases are known as the compiler’s front end. The final
few phases—intermediate code generation, code improvement, and target code
generation—are known as the back end. They serve to build a target program—
preferably a fast one—whose semantics match those of the source.
Chapters 3, 6, 7, 8, and 9 form the core of the rest of this book. They cover fundamental issues of language design, both from the point of view of the programmer
and from the point of view of the language implementor. To support the discussion
of implementations, Chapters 2 and 4 describe compiler front ends in more detail
than has been possible in this introduction. Chapter 5 provides an overview of
assembly-level architecture. Chapters 14 through 16 discuss compiler back ends,
including assemblers and linkers, run-time systems, and code improvement techniques. Additional language paradigms are covered in Chapters 10 through 13.
Appendix A lists the principal programming languages mentioned in the text,
together with a genealogical chart and bibliographic references. Appendix B contains a list of “Design & Implementation” sidebars; Appendix C contains a list of
numbered examples.
1.8
1.1
Errors in a computer program can be classified according to when they are
detected and, if they are detected at compile time, what part of the compiler
detects them. Using your favorite imperative language, give an example of
each of the following.
(a)
(b)
(c)
(d)
(e)
1.2
1.3
Exercises
A lexical error, detected by the scanner
A syntax error, detected by the parser
A static semantic error, detected by semantic analysis
A dynamic semantic error, detected by code generated by the compiler
An error that the compiler can neither catch nor easily generate code to
catch (this should be a violation of the language definition, not just a
program bug)
Consider again the Pascal tool set distributed by Niklaus Wirth (Example 1.15). After successfully building a machine language version of the
Pascal compiler, one could in principle discard the P-code interpreter and
the P-code version of the compiler. Why might one choose not to do so?
Imperative languages like Fortran and C are typically compiled, while scripting languages, in which many issues cannot be settled until run time, are
typically interpreted. Is interpretation simply what one “has to do” when
1.9 Explorations
1.4
37
compilation is infeasible, or are there actually some advantages to interpreting a language, even when a compiler is available?
The gcd program of Example 1.20 might also be written
int main() {
int i = getint(), j = getint();
while (i != j) {
if (i > j) i = i % j;
else j = j % i;
}
putint(i);
}
1.5
1.6
1.7
Does this program compute the same result? If not, can you fix it? Under
what circumstances would you expect one or the other to be faster?
In your local implementation of C, what is the limit on the size of integers?
What happens in the event of arithmetic overflow? What are the implications
of size limits on the portability of programs from one machine/compiler to
another? How do the answers to these questions differ for Java? For Ada? For
Pascal? For Scheme? (You may need to find a manual.)
The Unix make utility allows the programmer to specify dependences among
the separately compiled pieces of a program. If file A depends on file B and
file B is modified, make deduces that A must be recompiled, in case any
of the changes to B would affect the code produced for A. How accurate
is this sort of dependence management? Under what circumstances will it
lead to unnecessary work? Under what circumstances will it fail to recompile
something that needs to be recompiled?
Why is it difficult to tell whether a program is correct? How do you go about
finding bugs in your code? What kinds of bugs are revealed by testing? What
kinds of bugs are not? (For more formal notions of program correctness,
see the bibliographic notes at the end of Chapter 4.)
1.9
1.8
Explorations
(a) What was the first programming language you learned? If you chose it,
why did you do so? If it was chosen for you by others, why do you think
they chose it? What parts of the language did you find the most difficult
to learn?
(b) For the language with which you are most familiar (this may or may
not be the first one you learned), list three things you wish had been
differently designed. Why do you think they were designed the way they
were? How would you fix them if you had the chance to do it over?
Would there be any negative consequences, for example in terms of
compiler complexity or program execution speed?
38
Chapter 1 Introduction
1.9
Get together with a classmate whose principal programming experience is
with a language in a different category of Figure 1.1. (If your experience is
mostly in C, for example, you might search out someone with experience
in Lisp.) Compare notes. What are the easiest and most difficult aspects
of programming, in each of your experiences? Pick a simple problem (e.g.,
sorting, or identification of connected components in a graph) and solve it
using each of your favorite languages. Which solution is more elegant (do
the two of you agree)? Which is faster? Why?
1.10 (a) If you have access to a Unix system, compile a simple program with
the -S command-line flag. Add comments to the resulting assembly
language file to explain the purpose of each instruction.
(b) Now use the -o command-line flag to generate a relocatable object file.
Using appropriate local tools (look in particular for for nm , objdump ,
or a symbolic debugger like gdb or dbx ), identify the machine language
corresponding to each line of assembler.
(c) Using nm , objdump , or a similar tool, identify the undefined external
symbols in your object file. Now run the compiler to completion, to
produce an executable file. Finally, run nm or objdump again to see what
has happened to the symbols in part (b). Where did they come from—
how did the linker resolve them?
(d) Run the compiler to completion one more time, using the -v commandline flag. You should see messages describing the various subprograms
invoked during the compilation process (some compilers use a different letter for this option; check the man page). The subprograms may
include a preprocessor, separate passes of the compiler itself (often two),
probably an assembler, and the linker. If possible, run these subprograms
yourself, individually. Which of them produce the files described in the
previous subquestions? Explain the purpose of the various commandline flags with which the subprograms were invoked.
1.11 Write a program that commits a dynamic semantic error (e.g., division by
zero, access off the end of an array, dereference of a null pointer). What
happens when you run this program? Does the compiler give you options to
control what happens? Devise an experiment to evaluate the cost of run-time
semantic checks. If possible, try this exercise with more than one language
or compiler.
1.12 C has a reputation for being a relatively “unsafe” high-level language. In
particular, it allows the programmer to mix operands of different sizes and
types in many more ways than its “safer” cousins. The Unix lint utility can
be used to search for potentially unsafe constructs in C programs. In effect,
many of the rules that are enforced by the compiler in other languages are
optional in C, and are enforced (if desired) by a separate program. What do
you think of this approach? Is it a good idea? Why or why not?
1.10 Bibliographic Notes
39
1.13 Using an Internet search engine or magazine indexing service, read up on the
history of Java and C#, including the conflict between Sun and Microsoft
over Java standardization. Some have claimed that C# is, at least in part,
Microsoft’s attempt to kill Java. Defend or refute this claim.
1.10
Bibliographic Notes
The compiler-oriented chapters of this book attempt to convey a sense of what
the compiler does, rather than explaining how to build one. A much greater level
of detail can be found in other texts. Leading options include the work of Aho
et al. [ALSU07] and of Cooper and Torczon [CT04]. Other excellent, though less
current texts include those of Grune et al. [GBJL01], Appel [App97], and Fischer
and LeBlanc [FL88]. Popular texts on programming language design include those
of Louden [Lou03], Sebesta [Seb08], and Sethi [Set96].
Some of the best information on the history of programming languages can
be found in the proceedings of conferences sponsored by the Association for
Computing Machinery in 1978, 1993, and 2007 [Wex78, Ass93, Ass07]. Another
excellent reference is Horowitz’s 1987 text [Hor87]. A broader range of historical
material can be found in the quarterly IEEE Annals of the History of Computing.
Given the importance of personal taste in programming language design, it is
inevitable that some language comparisons should be marked by strongly worded
opinions. Examples include the writings of Dijkstra [Dij82], Hoare [Hoa81],
Kernighan [Ker81], and Wirth [Wir85a].
Much modern software development takes place in integrated programming
environments. Influential precursors to these environments include the Genera
Common Lisp environment from Symbolics Corp. [WMWM87] and the Smalltalk [Gol84], Interlisp [TM81], and Cedar [SZBH86] environments at the Xerox
Palo Alto Research Center.
This page intentionally left blank
2
Programming Language Syntax
EXAMPLE
2.1
Syntax of Arabic numerals
Unlike natural languages such as English or Chinese, computer languages
must be precise. Both their form (syntax) and meaning (semantics) must be
specified without ambiguity, so that both programmers and computers can tell
what a program is supposed to do. To provide the needed degree of precision,
language designers and implementors use formal syntactic and semantic notation.
To facilitate the discussion of language features in later chapters, we will cover this
notation first: syntax in the current chapter and semantics in Chapter 4.
As a motivating example, consider the Arabic numerals with which we represent numbers. These numerals are composed of digits, which we can enumerate
as follows (‘ | ’ means “or”):
digit −→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Digits are the syntactic building blocks for numbers. In the usual notation, we say
that a natural number is represented by an arbitrary-length (nonempty) string of
digits, beginning with a nonzero digit:
non zero digit −→ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural number −→ non zero digit digit *
Here the “Kleene1 star” metasymbol (*) is used to indicate zero or more repetitions
of the symbol to its left.
Of course, digits are only symbols: ink blobs on paper or pixels on a screen.
They carry no meaning in and of themselves. We add semantics to digits when
we say that they represent the natural numbers from zero to nine, as defined by
mathematicians. Alternatively, we could say that they represent colors, or the days
of the week in a decimal calendar. These would constitute alternative semantics
for the same syntax. In a similar fashion, we define the semantics of natural
1 Stephen Kleene (1909–1994), a mathematician at the University of Wisconsin, was responsible
for much of the early development of the theory of computation, including much of the material
in Section 2.4.
Programming Language Pragmatics. DOI: 10.1016/B978-0-12-374514-9.00011-2
Copyright © 2009 by Elsevier Inc. All rights reserved.
41
42
Chapter 2 Programming Language Syntax
numbers by associating a base-10, place-value interpretation with each string of
digits. Similar syntax rules and semantic interpretations can be devised for rational
numbers, (limited-precision) real numbers, arithmetic, assignments, control flow,
declarations, and indeed all of programming languages.
Distinguishing between syntax and semantics is useful for at least two reasons.
First, different programming languages often provide features with very similar
semantics but very different syntax. It is generally much easier to learn a new
language if one is able to identify the common (and presumably familiar) ideas
beneath the unfamiliar syntax. Second, there are some very efficient and elegant
algorithms that a compiler or interpreter can use to discover the syntactic structure
(but not the semantics!) of a computer program, and these algorithms can be used
to drive the rest of the compilation or interpretation process.
In the current chapter we focus on syntax: how we specify the structural rules
of a programming language, and how a compiler identifies the structure of a given
input program. These two tasks—specifying syntax rules and figuring out how
(and whether) a given program was built according to those rules—are distinct.
The first is of interest mainly to programmers, who want to write valid programs.
The second is of interest mainly to compilers, which need to analyze those programs. The first task relies on regular expressions and context-free grammars, which
specify how to generate valid programs. The second task relies on scanners and
parsers, which recognize program structure. We address the first of these tasks in
Section 2.1, the second in Sections 2.2 and 2.3.
In Section 2.4 (largely on the PLP CD) we take a deeper look at the formal
theory underlying scanning and parsing. In theoretical parlance, a scanner is a
deterministic finite automaton (DFA) that recognizes the tokens of a programming
language. A parser is a deterministic push-down automaton (PDA) that recognizes
the language’s context-free syntax. It turns out that one can generate scanners and
parsers automatically from regular expressions and context-free grammars. This
task is performed by tools like Unix’s lex and yacc .2 Possibly nowhere else in
computer science is the connection between theory and practice so clear and so
compelling.
2.1
Specifying Syntax: Regular Expressions
and Context-Free Grammars
Formal specification of syntax requires a set of rules. How complicated (expressive) the syntax can be depends on the kinds of rules we are allowed to use.
2 At many sites, lex and yacc have been superseded by the GNU flex and bison tools. These
independently developed, noncommercial alternatives are available without charge from the Free
Software Foundation at www.gnu.org/software. They provide a superset of the functionality of
lex and yacc .
2.1 Specifying Syntax
43
It turns out that what we intuitively think of as tokens can be constructed from
individual characters using just three kinds of formal rules: concatenation, alternation (choice among a finite set of alternatives), and so-called “Kleene closure”
(repetition an arbitrary number of times). Specifying most of the rest of what
we intuitively think of as syntax requires one additional kind of rule: recursion
(creation of a construct from simpler instances of the same construct). Any set of
strings that can be defined in terms of the first three rules is called a regular set,
or sometimes a regular language. Regular sets are generated by regular expressions
and recognized by scanners. Any set of strings that can be defined if we add recursion is called a context-free language (CFL). Context-free languages are generated
by context-free grammars (CFGs) and recognized by parsers. (Terminology can
be confusing here. The meaning of the word “language” varies greatly, depending
on whether we’re talking about “formal” languages [e.g., regular or context-free],
or programming languages. A formal language is just a set of strings, with no
accompanying semantics.)
2.1.1
EXAMPLE
2.2
Lexical structure of C99
Tokens and Regular Expressions
Tokens are the basic building blocks of programs—the shortest strings of characters with individual meaning. Tokens come in many kinds, including keywords, identifiers, symbols, and constants of various types. Some kinds of token
(e.g., the increment operator) correspond to only one string of characters. Others
(e.g., identifier) correspond to a set of strings that share some common form.
(In most languages, keywords are special strings of characters that have the right
form to be identifiers, but are reserved for special purposes.) We will use the word
“token” informally to refer to both the generic kind (an identifier, the increment
operator) and the specific string ( foo , ++ ); the distinction between these should
be clear from context.
Some languages have only a few kinds of token, of fairly simple form.
Other languages are more complex. C, for example, has almost 100 kinds of
tokens, including 37 keywords ( double , if , return , struct , etc.); identifiers
( my_variable , your_type , sizeof , printf , etc.); integer (0765, 0x1f5, 501),
floating-point (6.022e23), and character ( ’x’ , ’\’’ , ’\0170’ ) constants; string
literals ( "snerk" , "say \"hi\"\n" ); 54 “punctuators” ( + , ] , -> , *= , : , || , etc.),
and two different forms of comments. There are provisions for international
character sets, string literals that span multiple lines of source code, constants of
varying precision (width), alternative “spellings” for symbols that are missing on
certain input devices, and preprocessor macros that build tokens from smaller
pieces. Other large, modern languages (Java, Ada 95) are similarly complex. To specify tokens, we use the notation of regular expressions. A regular expression is one of the following.
1. A character
2. The empty string, denoted 44
Chapter 2 Programming Language Syntax
3. Two regular expressions next to each other, meaning any string generated
by the first one followed by (concatenated with) any string generated by the
second one
4. Two regular expressions separated by a vertical bar ( | ), meaning any string
generated by the first one or any string generated by the second one
5. A regular expression followed by a Kleene star, meaning the concatenation of
zero or more strings generated by the expression in front of the star
EXAMPLE
2.3
Syntax of numeric
constants
Parentheses are used to avoid ambiguity about where the various subexpressions start and end.3
Consider, for example, the syntax of numeric constants accepted by a simple
hand-held calculator:
number −→ integer | real
integer −→ digit digit *
real −→ integer exponent | decimal ( exponent | )
decimal −→ digit * ( . digit | digit . ) digit *
exponent −→ ( e | E ) ( + | - | ) integer
digit −→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
The symbols to the left of the −→ signs provide names for the regular expressions. One of these (number) will serve as a token name; the others are simply
for convenience in building larger expressions.4 Note that while we have allowed
definitions to build on one another, nothing is ever defined in terms of itself,
even indirectly. Such recursive definitions are the distinguishing characteristic of
context-free grammars, described in the Section 2.1.2. To generate a valid number,
we expand out the sub-definitions and then scan the resulting expression from left
to right, choosing among alternatives at each vertical bar, and choosing a number
of repetitions at each Kleene star. Within each repetition we may make different
choices at vertical bars, generating different substrings.
Character Sets and Formatting Issues
Upper- and lowercase letters in identifiers and keywords are considered distinct
in some languages (e.g., Modula-2/3 and C and its descendants), and identical in
others (e.g., Ada, Common Lisp, Fortran 90, and Pascal). Thus foo , Foo , and FOO
all represent the same identifier in Ada, but different identifiers in C. Modula-2
and Modula-3 require keywords and predefined (built-in) identifiers to be written
3 Some authors use λ to represent the empty string. Some use a period (.), rather than juxtaposition, to indicate concatenation. Some use a plus sign (+), rather than a vertical bar, to indicate
alternation.
4 We have assumed here that all numeric constants are simply “numbers.” In many programming
languages, integer and real constants are separate kinds of token. Their syntax may also be more
complex than indicated here, to support such features are multiple lengths or nondecimal bases.
2.1 Specifying Syntax
45
in uppercase; C and its descendants require them to be written in lowercase.
A few languages (notably Modula-3 and Standard Pascal) allow only letters and
digits in identifiers. Most (including many actual implementations of Pascal) allow
underscores. A few (notably Lisp) allow a variety of additional characters. Some
languages (e.g., Java, C#, and Modula-3) have standard conventions on the use of
upper- and lowercase letters in names.5
With the globalization of computing, non-Latin character sets have become
increasingly important. Many modern languages, including C99, C++, Ada 95,
Java, C#, and Fortran 2003 have explicit support for multibyte character sets,
generally based on the Unicode and ISO/IEC 10646 international standards. Most
modern programming languages allow non-Latin characters to appear within
comments and character strings; an increasing number allow them in identifiers
as well. Conventions for portability across character sets and for localization to a
given character set can be surprisingly complex, particularly when various forms
of backward compatibility are required (the C99 Rationale devotes five full pages
to this subject [Int99, pp. 19–23]); for the most part we ignore such issues here.
Some language implementations impose limits on the maximum length of
identifiers, but most avoid such unnecessary restrictions. Most modern languages
are also more-or-less free format, meaning that a program is simply a sequence of
tokens: what matters is their order with respect to one another, not their physical
position within a printed line or page.“White space”(blanks, tabs, carriage returns,
and line and page feed characters) between tokens is usually ignored, except to
the extent that it is needed to separate one token from the next. There are a few
exceptions to these rules. Some language implementations limit the maximum
length of a line, to allow the compiler to store the current line in a fixed-length
buffer. Dialects of Fortran prior to Fortran 90 use a fixed format, with 72 characters per line (the width of a paper punch card, on which programs were once
stored), and with different columns within the line reserved for different purposes.
Linebreaks serve to separate statements in several other languages, including
Haskell, Occam, SR, Tcl, and Python. Haskell, Occam, and Python also give special
D E S I G N & I M P L E M E N TAT I O N
Formatting restrictions
Formatting limitations inspired by implementation concerns—as in the punchcard–oriented rules of Fortran 77 and its predecessors—have a tendency
to become unwanted anachronisms as implementation techniques improve.
Given the tendency of certain word processors to “fill” or auto-format text, the
linebreak and indentation rules of languages like Haskell, Occam, and Python
are somewhat controversial.
5 For the sake of consistency we do not always obey such conventions in this book: most examples
follow the common practice of C programmers, in which underscores, rather than capital letters,
separate the “subwords” of names.
46
Chapter 2 Programming Language Syntax
significance to indentation. The body of a loop, for example, consists of precisely
those subsequent lines that are indented farther than the header of the loop.
Other Uses of Regular Expressions
Many readers will be familiar with regular expressions from the grep family of
tools in Unix, the search facilities of various text editors (notably emacs ), or
such scripting languages and tools as Perl, Python, Ruby, awk , and sed . Most
of these provide a rich set of extensions to the notation of regular expressions.
Some extensions, such as shorthand for “zero or one occurrences” or “anything
other than white space,” do not change the power of the notation. Others, such
as the ability to require a second occurrence, later in the input string, of the
same character sequence that matched an earlier part of the expression, increase
the power of the notation, so that it is no longer restricted to generating regular
sets. Still other extensions are designed not to increase the expressiveness of the
notation but rather to tie it to other language facilities. In many tools, for example,
one can bracket portions of a regular expression in such a way that when a string
is matched against it the contents of the corresponding substrings are assigned
into named local variables. We will return to these issues in Section 13.4.2, in the
context of scripting languages.
2.1.2
EXAMPLE
2.4
Syntactic nesting in
expressions
Context-Free Grammars
Regular expressions work well for defining tokens. They are unable, however, to
specify nested constructs, which are central to programming languages. Consider
for example the structure of an arithmetic expression.
expr −→ id | number | - expr | ( expr )
| expr op expr
op −→ + | - | * | /
Here the ability to define a construct in terms of itself is crucial. Among other
things, it allows us to ensure that left and right parentheses are matched, something
that cannot be accomplished with regular expressions (see Section 2.4.3 for
more details). The arrow symbol (−→) means “can have the form”; for brevity it
is sometimes pronounced “goes to.”
Each of the rules in a context-free grammar is known as a production. The
symbols on the left-hand sides of the productions are known as variables, or
nonterminals. There may be any number of productions with the same left-hand
side. Symbols that are to make up the strings derived from the grammar are
known as terminals (shown here in typewriter font). They cannot appear on
the left-hand side of any production. In a programming language, the terminals
of the context-free grammar are the language’s tokens. One of the nonterminals,
usually the one on the left-hand side of the first production, is called the start
symbol. It names the construct defined by the overall grammar.
2.1 Specifying Syntax
EXAMPLE
2.5
Extended BNF (EBNF)
47
The notation for context-free grammars is sometimes called Backus-Naur Form
(BNF), in honor of John Backus and Peter Naur, who devised it for the definition
of the Algol-60 programming language [NBB+ 63].6 Strictly speaking, the Kleene
star and meta-level parentheses of regular expressions are not allowed in BNF,
but they do not change the expressive power of the notation, and are commonly
included for convenience. Sometimes one sees a “Kleene plus” (+ ) as well; it
indicates one or more instances of the symbol or group of symbols in front of it.7
When augmented with these extra operators, the notation is often called extended
BNF (EBNF). The construct
id list −→ id ( , id )*
is shorthand for
id list −→ id
id list −→ id list , id
“Kleene plus” is analogous. Note that the parentheses here are metasymbols. In
Example 2.4 they were part of the language being defined, and were written in
fixed-width font.8
Like the Kleene star and parentheses, the vertical bar is in some sense superfluous, though it was provided in the original BNF. The construct
op −→ + | - | * | /
can be considered shorthand for
op −→ +
op −→ op −→ *
op −→ /
which is also sometimes written
op −→ +
−→ −→ *
−→ /
6 John Backus (1924–2007) was also the inventor of Fortran. He spent most of his professional
career at IBM Corporation, and was named an IBM Fellow in 1987. He received the ACM Turing
Award in 1977.
7 Some authors use curly braces ({ }) to indicate zero or more instances of the symbols inside.
Some use square brackets ([ ]) to indicate zero or one instances of the symbols inside—that is,
to indicate that those symbols are optional. In both regular and extended BNF, many authors use
::= instead of −→.
8 To avoid confusion, some authors place quote marks around any single character that is part of
the language being defined: id list −→ id ( ‘ ,’ id ) * ; expr −→ ‘ ( ’ expr ‘ )’.
48
Chapter 2 Programming Language Syntax
Many tokens, such as id and number above, have many possible spellings (i.e.,
may be represented by many possible strings of characters). The parser is oblivious
to these; it does not distinguish one identifier from another. The semantic analyzer
does distinguish them, however; the scanner must save the spelling of each such
“interesting” token for later use.
2.1.3
EXAMPLE
2.6
Derivation of slope * x +
Derivations and Parse Trees
A context-free grammar shows us how to generate a syntactically valid string
of terminals: Begin with the start symbol. Choose a production with the start
symbol on the left-hand side; replace the start symbol with the right-hand side
of that production. Now choose a nonterminal A in the resulting string, choose a
production P with A on its left-hand side, and replace A with the right-hand side
of P. Repeat this process until no nonterminals remain.
As an example, we can use our grammar for expressions to generate the string
“ slope * x + intercept ”:
intercept
expr =⇒ expr op expr
=⇒ expr op id
=⇒ expr + id
=⇒ expr op expr + id
=⇒ expr op id + id
=⇒ expr * id + id
=⇒
id
* id +
id
(slope)
(x)
(intercept)
The =⇒ metasymbol is often pronounced “derives.” It indicates that the righthand side was obtained by using a production to replace some nonterminal in the
left-hand side. At each line we have underlined the symbol A that is replaced in
the following line.
A series of replacement operations that shows how to derive a string of terminals
from the start symbol is called a derivation. Each string of symbols along the way
is called a sentential form. The final sentential form, consisting of only terminals,
is called the yield of the derivation. We sometimes elide the intermediate steps and
write expr =⇒∗ slope * x + intercept , where the metasymbol =⇒∗ means
“derives after zero or more replacements.” In this particular derivation, we have
chosen at each step to replace the right-most nonterminal with the right-hand side
of some production. This replacement strategy leads to a right-most derivation.
There are many other possible derivations, including left-most and options inbetween.
We saw in Chapter 1 that we can represent a derivation graphically as a parse
tree. The root of the parse tree is the start symbol of the grammar. The leaves of
the tree are its yield. Each internal node, together with its children, represents the
use of a production.
2.1 Specifying Syntax
49
expr
expr
expr
op
expr
id(slope)
*
id(x)
Figure 2.1
op
expr
+
id(intercept)
Parse tree for slope * x + intercept (grammar in Example 2.4).
expr
expr
op
id(slope)
*
expr
expr
op
expr
id(x)
+
id(intercept)
Figure 2.2 Alternative (less desirable) parse tree for slope * x + intercept (grammar in
Example 2.4). The fact that more than one tree exists implies that our grammar is ambiguous.
EXAMPLE
2.7
Parse trees for slope * x
+ intercept
A parse tree for our example expression appears in Figure 2.1. This tree is
not unique. At the second level of the tree, we could have chosen to turn the
operator into a * instead of a + , and to further expand the expression on the
right, rather than the one on the left (see Figure 2.2). A grammar that allows
the construction of more than one parse tree for some string of terminals is
said to be ambiguous. Ambiguity turns out to be a problem when trying to build
a parser: it requires some extra mechanism to drive a choice between equally
acceptable alternatives.
A moment’s reflection will reveal that there are infinitely many context-free
grammars for any given context-free language.9 Some grammars, however, are
much more useful than others. In this text we will avoid the use of ambiguous
grammars (though most parser generators allow them, by means of disambiguating rules). We will also avoid the use of so-called useless symbols: nonterminals
that cannot generate any string of terminals, or terminals that cannot appear in
the yield of any derivation.
When designing the grammar for a programming language, we generally try to
find one that reflects the internal structure of programs in a way that is useful to
the rest of the compiler. (We shall see in Section 2.3.2 that we also try to find one
9 Given a specific grammar, there are many ways to create other equivalent grammars. We could,
for example, replace A with some new symbol B everywhere it appears in the right-hand side of
a production, and then create a new production B −→ A.
50
Chapter 2 Programming Language Syntax
expr
expr
add_op
term
+
term
mult_op
factor
factor
factor
*
number(5)
number(3)
number(4)
Figure 2.3
EXAMPLE
2.8
Expression grammar with
precedence and
associativity
term
Parse tree for 3 + 4 * 5, with precedence (grammar in Example 2.8).
that can be parsed efficiently, which can be a bit of a challenge.) One place in which
structure is particularly important is in arithmetic expressions, where we can use
productions to capture the associativity and precedence of the various operators.
Associativity tells us that the operators in most languages group left to right, so
that 10 - 4 - 3 means (10 - 4) - 3 rather than 10 - (4 - 3) . Precedence
tells us that multiplication and division in most languages group more tightly
than addition and subtraction, so that 3 + 4 * 5 means 3 + (4 * 5) rather
than (3 + 4) * 5 . (These rules are not universal; we will consider them again in
Section 6.1.1.)
Here is a better version of our expression grammar:
1.
expr −→ term | expr add op term
2.
term −→ factor | term mult op factor
3.
factor −→ id | number | - factor | ( expr )
4.
add op −→ + | -
5.
mult op −→ * | /
This grammar is unambiguous. It captures precedence in the way factor, term,
and expr build on one another, with different operators appearing at each level. It
captures associativity in the second halves of lines 1 and 2, which build subexprs
and subterms to the left of the operator, rather than to the right. In Figure 2.3, we
can see how building the notion of precedence into the grammar makes it clear
that multiplication groups more tightly than addition in 3 + 4 * 5 , even without
parentheses. In Figure 2.4, we can see that subtraction groups more tightly to the
left, so that 10 - 4 - 3 would evaluate to 3 , rather than to 9 .
3C H E C K YO U R U N D E R S TA N D I N G
1. What is the difference between syntax and semantics?
2. What are the three basic operations that can be used to build complex regular
expressions from simpler regular expressions?
3. What additional operation (beyond the three of regular expressions) is provided in context-free grammars?
2.2 Scanning
51
expr
expr
expr
add_op
term
term
-
factor
factor
add_op
term
-
factor
number(3)
number(4)
number(10)
Figure 2.4
Parse tree for 10 – 4 – 3, with left associativity (grammar in Example 2.8).
4. What is Backus-Naur form? When and why was it devised?
5. Name a language in which indentation affects program syntax.
6. When discussing context-free languages, what is a derivation? What is a sentential form?
7. What is the difference between a right-most derivation and a left-most derivation?
8. What does it mean for a context-free grammar to be ambiguous?
9. What are associativity and precedence? Why are they significant in parse trees?
2.2
Scanning
Together, the scanner and parser for a programming language are responsible for
discovering the syntactic structure of a program. This process of discovery, or
syntax analysis, is a necessary first step toward translating the program into an
equivalent program in the target language. (It’s also the first step toward interpreting the program directly. In general, we will focus on compilation, rather than
interpretation, for the remainder of the book. Most of what we shall discuss either
has an obvious application to interpretation, or is obviously irrelevant to it.)
By grouping input characters into tokens, the scanner dramatically reduces the
number of individual items that must be inspected by the more computationally intensive parser. In addition, the scanner typically removes comments (so
the parser doesn’t have to worry about them appearing throughout the contextfree grammar); saves the text of “interesting” tokens like identifiers, strings, and
numeric literals; and tags tokens with line and column numbers, to make it easier
to generate high-quality error messages in subsequent phases.
52
EXAMPLE
Chapter 2 Programming Language Syntax
2.9
Tokens for a calculator
language
In Examples 2.4 and 2.8 we considered a simple language for arithmetic expressions. In Section 2.3.1 we will extend this to create a simple “calculator language”
with input, output, variables, and assignment. For this language we will use the
following set of tokens.
assign −→ :=
plus −→ +
minus −→ times −→ *
div −→ /
lparen −→ (
rparen −→ )
id −→ letter ( letter | digit )*
except for read and write
number −→ digit digit * | digit * ( . digit | digit . ) digit *
In keeping with Algol and its descendants (and in contrast to the C-family languages), we have used := rather than = for assignment. For simplicity, we have
omitted the exponential notation found in Example 2.3. We have also listed
the tokens read and write as exceptions to the rule for id (more on this in
Section 2.2.2). To make the task of the scanner a little more realistic, we borrow
the two styles of comment from C:
comment −→ / * ( non-* | * non-/ )* * * /
| / / ( non-newline )* newline
EXAMPLE
2.10
An ad hoc scanner for
calculator tokens
Here we have used non- * , non- / , and non-newline as shorthand for the alternation
of all characters other than * , / , and newline, respectively.
How might we go about recognizing the tokens of our calculator language? The
simplest approach is entirely ad hoc. Pseudocode appears in Figure 2.5. We can
structure the code however we like, but it seems reasonable to check the simpler
D E S I G N & I M P L E M E N TAT I O N
Nested comments
Nested comments can be handy for the programmer (e.g., for temporarily
“commenting out”large blocks of code). Scanners normally deal only with nonrecursive constructs, however, so nested comments require special treatment.
Some languages disallow them. Others require the language implementor to
augment the scanner with special-purpose comment-handling code. C++ and
C99 strike a compromise: /* ... */ style comments are not allowed to nest,
but /* ... */ and //... style comments can appear inside each other. The
programmer can thus use one style for “normal” comments and the other for
“commenting out.” (The C99 designers note, however, that conditional compilation ( #if ) is preferable [Int03a, p. 58].)
2.2 Scanning
53
skip any initial white space (spaces, tabs, and newlines)
if cur char ∈ { ‘( ’, ‘ )’, ‘ + ’, ‘ - ’, ‘ * ’ }
return the corresponding single-character token
if cur char = ‘ : ’
read the next character
if it is ‘ = ’ then return assign else announce an error
if cur char = ‘ / ’
peek at the next character
if it is ‘ * ’ or ‘ / ’
read additional characters until “ */ ” or newline is seen, respectively
jump back to top of code
else return div
if cur char = .
read the next character
if it is a digit
read any additional digits
return number
else announce an error
if cur char is a digit
read any additional digits and at most one decimal point
return number
if cur char is a letter
read any additional letters and digits
check to see whether the resulting string is read or write
if so then return the corresponding token
else return id
else announce an error
Figure 2.5
Outline of an ad hoc scanner for tokens in our calculator language.
and more common cases first, to peek ahead when we need to, and to embed loops
for comments and for long tokens such as identifiers and numbers.
After finding a token the scanner returns to the parser. When invoked again it
repeats the algorithm from the beginning, using the next available characters of
input (including any that were peeked at but not consumed the last time).
As a rule, we accept the longest possible token in each invocation of the scanner.
Thus foobar is always foobar and never f or foo or foob . More to the point,
in a language like C, 3.14159 is a real number and never 3 , . , and 14159 . White
space (blanks, tabs, newlines, comments) is generally ignored, except to the extent
that it separates tokens (e.g., foo bar is different from foobar ).
Figure 2.5 could be extended fairly easily to outline a scanner for some larger
programming language. The result could then be fleshed out, by hand, to create code in some implementation language. Production compilers often use such
ad hoc scanners; the code is fast and compact. During development, however, it is
usually preferable to build a scanner in a more structured way, as an explicit representation of a finite automaton. Finite automata can be generated automatically
54
Chapter 2 Programming Language Syntax
space, tab, newline
Start
1
newline
/
non-newline
2
div
*
*
4
non-*
(
/
3
/
)
5
non-/or *
+
6 lparen
7 rparen
11
12 assign
*
-
8 plus
*
9 minus
10 times
:
=
.
13
digit
digit
digit
digit
14
.
number
15
number
letter, digit
letter
16
id or keyword
Figure 2.6
Pictorial representation of a scanner for calculator tokens, in the form of a finite
automaton. This figure roughly parallels the code in Figure 2.5. States are numbered for reference
in Figure 2.12. Scanning for each token begins in the state marked “Start.” The final states, in
which a token is recognized, are indicated by double circles. Comments, when recognized, send
the scanner back to its start state, rather than a final state.
EXAMPLE
2.11
Finite automaton for a
calculator scanner
from a set of regular expressions, making it easy to regenerate a scanner when
token definitions change.
An automaton for the tokens of our calculator language appears in pictorial
form in Figure 2.6. The automaton starts in a distinguished initial state. It then
moves from state to state based on the next available character of input. When it
reaches one of a designated set of final states it recognizes the token associated
with that state. The “longest possible token” rule means that the scanner returns
to the parser only when the next character cannot be used to continue the current
token.
2.2 Scanning
2.2.1
55
Generating a Finite Automaton
While a finite automaton can in principle be written by hand, it is more common
to build one automatically from a set of regular expressions, using a scanner
generator tool. Because regular expressions are significantly easier to write and
modify than is an ad hoc scanner, automatically generated scanners are often used
during language or compiler development, or when ease of implementation is
more important than the last little bit of run-time performance. In effect, regular
expressions constitute a declarative programming language for a limited problem
domain, namely that of scanning.
The example automaton of Figure 2.6 is deterministic: there is never any ambiguity about what it ought to do, because in a given state with a given input character
there is never more than one possible outgoing transition (arrow) labeled by that
character. As it turns out, however, there is no obvious one-step algorithm to convert a set of regular expressions into an equivalent deterministic finite automaton
(DFA). The typical scanner generator implements the conversion as a series of
three separate steps.
The first step converts the regular expressions into a nondeterministic finite
automaton (NFA). An NFA is like a DFA except that (1) there may be more than
one transition out of a given state labeled by a given character, and (2) there may
be so-called epsilon transitions: arrows labeled by the empty string symbol, . The
NFA is said to accept an input string (token) if there exists a path from the start
state to a final state whose non-epsilon transitions are labeled, in order, by the
characters of the token.
To avoid the need to search all possible paths for one that “works,” the second
step of a scanner generator translates the NFA into an equivalent DFA: an automaton that accepts the same language, but in which there are no epsilon transitions,
and no states with more than one outgoing transition labeled by the same character. The third step is a space optimization that generates a final DFA with the
minimum possible number of states.
From a Regular Expression to an NFA
EXAMPLE
2.12
Constructing an NFA for a
given regular expression
EXAMPLE
2.13
NFA for d * ( . d | d . ) d *
A trivial regular expression consisting of a single character c is equivalent to a simple two-state NFA (in fact, a DFA), illustrated in part (a) of Figure 2.7. Similarly,
the regular expression is equivalent to a two-state NFA whose arc is labeled by .
Starting with this base we can use three subconstructions, illustrated in parts (b)
through (d) of the same figure, to build larger NFAs to represent the concatenation,
alternation, or Kleene closure of the regular expressions represented by smaller
NFAs. Each step preserves three invariants: there are no transitions into the initial
state, there is a single final state, and there are no transitions out of the final state.
These invariants allow smaller machines to be joined into larger machines without
any ambiguity about where to create the connections, and without creating any
unexpected paths.
To make these constructions concrete, we consider a small but nontrivial
example—the decimal strings of Example 2.3. These consist of a string of decimal
56
Chapter 2 Programming Language Syntax
c
(a) base case
A
(b) concatenation
B
AB
A
B
A|B
(c) alternation
A
(d) Kleene closure
A*
Figure 2.7
Construction of an NFA equivalent to a given regular expression. Part (a) shows
the base case: the automaton for the single letter c . Parts (b), (c), and (d), respectively, show
the constructions for concatenation, alternation, and Kleene closure. Each construction retains a
unique start state and a single final state. Internal detail is hidden in the diamond-shaped center
regions.
digits containing a single decimal point. With only one digit, the point can come
at the beginning or the end: ( . d | d . ), where for brevity we use d to represent
any decimal digit. Arbitrary numbers of digits can then be added at the beginning
or the end: d *( . d | d . ) d *. Starting with this regular expression and using the
constructions of Figure 2.7, we illustrate the construction of an equivalent NFA
in Figure 2.8.
From an NFA to a DFA
EXAMPLE
2.14
DFA for d * ( . d | d . ) d *
With no way to “guess” the right transition to take from any given state, any practical implementation of an NFA would need to explore all possible transitions,
57
2.2 Scanning
d
.
Start
Start
.
.
Start
d
d
Start
1
Start
2
d
4
d
5
.
.d|d.
.
d
3
d.
Start
.
Start
.
6
d
7
d
9
.
10
8
d*
d
.d
d
11
12
d
13
14
Figure 2.8 Construction of an NFA equivalent to the regular expression d *( . d | d . ) d *.
In the top row are the primitive automata for . and d, and the Kleene closure construction for
d *. In the second and third rows we have used the concatenation and alternation constructions
to build . d, d . , and ( . d | d . ). The fourth row uses concatenation again to complete the NFA.
We have labeled the states in the final automaton for reference in subsequent figures.
concurrently or via backtracking. To avoid such a complex and time-consuming
strategy, we can use a “set of subsets” construction to transform the NFA into
an equivalent DFA. The key idea is for the state of the DFA after reading a given
input to represent the set of states that the NFA might have reached on the same
input. We illustrate the construction in Figure 2.9 using the NFA from Figure 2.8.
Initially, before it consumes any input, the NFA may be in State 1, or it may make
epsilon transitions to States 2, 4, 5, or 8. We thus create an initial State A for our
DFA to represent this set. On an input of d, our NFA may move from State 2 to
State 3, or from State 8 to State 9. It has no other transitions on this input from any
of the states in A. From State 3, however, the NFA may make epsilon transitions
to any of States 2, 4, 5, or 8. We therefore create DFA State B as shown.
58
Chapter 2 Programming Language Syntax
Start
d
A[1, 2, 4, 5, 8]
B[2, 3, 4, 5, 8, 9]
d
.
.
C[6]
D[6, 10, 11, 12, 14]
d
d
E[7, 11, 12, 14]
F[7, 11, 12, 13, 14]
d
d
G[12, 13, 14]
d
Figure 2.9
A DFA equivalent to the NFA at the bottom of Figure 2.8. Each state of the DFA
represents the set of states that the NFA could be in after seeing the same input.
On a . , our NFA may move from State 5 to State 6. There are no other transitions
on this input from any of the states in A, and there are no epsilon transitions out
of State 6. We therefore create the singleton DFA State C as shown. None of
States A, B, or C is marked as final, because none contains a final state of the
original NFA.
Returning to State B of the growing DFA, we note that on an input of d the
original NFA may move from State 2 to State 3, or from State 8 to State 9. From
State 3, in turn, it may move to States 2, 4, 5, or 8 via epsilon transitions. As
these are exactly the states already in B, we create a self-loop in the DFA. Given
a . , on the other hand, the original NFA may move from State 5 to State 6, or
from State 9 to State 10. From State 10, in turn, it may move to States 11, 12,
or 14 via epsilon transitions. We therefore create DFA State D as shown, with
a transition on . from B to D. State D is marked as final because it contains
state 14 of the original NFA. That is, given input d . , there exists a path from
the start state to the end state of the original NFA. Continuing our enumeration
of state sets, we end up creating three more, labeled E, F , and G in Figure 2.9.
Like State D, these all contain State 14 of the original NFA, and thus are marked
as final.
In our example, the DFA ends up being smaller than the NFA, but this is only
because our regular language is so simple. In theory, the number of states in the
DFA may be exponential in the number of states in the NFA, but this extreme is
also uncommon in practice. For a programming language scanner, the DFA tends
to be larger than the NFA, but not outlandishly so. We consider space complexity
in more detail in Section 2.4.1.
2.2 Scanning
(a)
Start
ABC
d
(b)
Start
d,.
59
AB
d,.
DEFG
d
C
(c)
Start
A
d
.
C
.
.
B
d
DEFG
d
d
.
d
DEFG
d
Figure 2.10
Minimization of the DFA of Figure 2.9. In each step we split a set of states to
eliminate a transition ambiguity.
Minimizing the DFA
EXAMPLE
2.15
Minimal DFA for
d * ( .d | d . ) d *
Starting from a regular expression we have now constructed an equivalent DFA.
Though this DFA has seven states, a bit of thought suggests that a smaller one
should exist. In particular, once we have seen both a d and a . , the only valid
transitions are on d, and we ought to be able to make do with a single final
state. We can formalize this intuition, allowing us to apply it to any DFA, via the
following inductive construction.
Initially we place the states of the (not necessarily minimal) DFA into two
equivalence classes: final states and nonfinal states. We then repeatedly search for
an equivalence class X and an input symbol c such that when given c as input,
the states in X make transitions to states in k > 1 different equivalence classes.
We then partition X into k classes in such a way that all states in a given new class
would move to a member of the same old class on c . When we are unable to find
a class to partition in this fashion we are done.
In our example, the original placement puts States D, E, F , and G in one
class (final states) and States A, B, and C in another, as shown in the upper left
of Figure 2.10. Unfortunately, the start state has ambiguous transitions on both
d and . . To address the d ambiguity, we split ABC into AB and C, as shown
in the upper right. New State AB has a self-loop on d; new State C moves to
State DEFG. State AB still has an ambiguity on . , however, which we resolve
by splitting it into States A and B, as shown at the bottom of the figure. At
this point there are no further ambiguities, and we are left with a four-state
minimal DFA.
60
Chapter 2 Programming Language Syntax
2.2.2
EXAMPLE
2.16
Nested case statement
automaton
Scanner Code
We can implement a scanner that explicitly captures the “circles-and-arrows”
structure of a DFA in either of two main ways. One embeds the automaton in
the control flow of the program using goto s or nested case ( switch ) statements;
the other, described in the following subsection, uses a table and a driver. As a general rule, handwritten automata tend to use nested case statements, while most
(but not all [BC93]) automatically generated automata use tables. Tables are hard
to create by hand, but easier than code to create from within a program. Likewise,
nested case statements are easier to write and to debug than the ad hoc approach
of Figure 2.5, if not quite as efficient. Unix’s lex/flex tool produces C language
output containing tables and a customized driver.
The nested case statement style of automaton has the following general
structure.
D E S I G N & I M P L E M E N TAT I O N
Recognizing multiple kinds of token
One of the chief ways in which a scanner differs from a formal DFA is that it
identifies tokens in addition to recognizing them. That is, it not only determines
whether characters constitute a valid token; it also indicates which one. In
practice, this means that it must have separate final states for every kind of
token. We glossed over this issue in our RE-to-DFA constructions.
To build a scanner for a language with n
different kinds of tokens, we begin with an
M1
NFA of the sort suggested in the figure here.
Given NFAs Mi , 1 ≤ i ≤ n (one machine for
Start each kind of token), we create a new start
M2
state with epsilon transitions to the start
states of the Mi s. In contrast to the alter
nation construction of Figure 2.7(c), however, we do not create a single final state;
Mn
we keep the existing ones, each labeled by
the token for which it is final. We then apply
the NFA-to-DFA construction as before. (If final states for different tokens in
the NFA ever end up in the same state of the DFA, then we have ambiguous
token definitions. These may be resolved by changing the regular expressions
from which the NFAs were derived, or by wrapping additional logic around
the DFA.)
In the DFA minimization construction, instead of starting with two equivalence classes (final and nonfinal states), we begin with n+1, including a separate
class for final states for each of the kinds of token. Exercise 2.5 explores this
construction for a scanner that recognizes both the integer and decimal types
of Example 2.3.
...
2.2 Scanning
61
state := 1
– – start state
loop
read cur char
case state of
1 : case cur char of
‘ ’, ‘\t’, ‘\n’ : . . .
‘a’. . . ‘z’ :
...
‘0’. . . ‘9’ :
...
‘>’ :
...
...
2 : case cur char of
...
...
n : case cur char of
...
EXAMPLE
2.17
The nontrivial prefix
problem
The outer case statement covers the states of the finite automaton. The inner
case statements cover the transitions out of each state. Most of the inner clauses
simply set a new state. Some return from the scanner with the current token. (If
the current character should not be part of that token, it is pushed back onto the
input stream before returning.)
Two aspects of the code typically deviate from the strict form of a formal finite
automaton. One is the handling of keywords. The other is the need to peek ahead
when a token can validly be extended by two or more additional characters, but
not by only one.
As noted at the beginning of Section 2.1.1, keywords in most languages look
just like identifiers, but are reserved for a special purpose (some authors use the
term reserved word instead of keyword10 ). It is possible to write a finite automaton
that distinguishes between keywords and identifiers, but it requires a lot of states
(see Exercise 2.3). Most scanners, both handwritten and automatically generated,
therefore treat keywords as “exceptions” to the rule for identifiers. Before returning
an identifier to the parser, the scanner looks it up in a hash table or trie (a tree of
branching paths) to make sure it isn’t really a keyword.
Whenever one legitimate token is a prefix of another, the “longest possible
token” rule says that we should continue scanning. If some of the intermediate
strings are not valid tokens, however, we can’t tell whether a longer token is possible
without looking more than one character ahead. This problem arises with dot
characters (periods) in C. Suppose the scanner has just seen a 3 and has a dot
coming up in the input. It needs to peek at characters beyond the dot in order
to distinguish between 3.14 (a single token designating a real number), 3 . foo
10 Keywords (reserved words) are not the same as predefined identifiers. Predefined identifiers can
be redefined to have a different meaning; keywords cannot. The scanner does not distinguish
between predefined and other identifiers. It does distinguish between identifiers and keywords.
C doesn’t really have any predefined identifiers, but many languages do. In Pascal, for example,
the names of built-in types and standard library functions are predefined but not reserved.
62
EXAMPLE
Chapter 2 Programming Language Syntax
2.18
Look-ahead in Fortran
scanning
(three tokens that the scanner should accept, even though the parser will object to
seeing them in that order), and 3 ... foo (again not syntactically valid, but three
separate tokens nonetheless). In general, upcoming characters that a scanner must
examine in order to make a decision are known as its look-ahead. In Section 2.3
we will see a similar notion of look-ahead tokens in parsing.
In messier languages, a scanner may need to look an arbitrary distance ahead.
In Fortran IV, for example, DO 5 I = 1,25 is the header of a loop (it executes the
statements up to the one labeled 5 for values of I from 1 to 25), while DO 5 I
= 1.25 is an assignment statement that places the value 1.25 into the variable
DO5I . Spaces are ignored in (pre-Fortran 90) Fortran input, even in the middle
of variable names. Moreover, variables need not be declared, and the terminator
for a DO loop is simply a label, which the parser can ignore. After seeing DO , the
scanner cannot tell whether the 5 is part of the current token until it reaches the
comma or dot. It has been widely (but apparently incorrectly) claimed that NASA’s
Mariner 1 space probe was lost due to accidental replacement of a comma with
a dot in a case similar to this one in flight control software.11 Dialects of Fortran
starting with Fortran 77 allow (in fact encourage) the use of alternative syntax
for loop headers, in which an extra comma makes misinterpretation less likely:
DO 5,I = 1,25 .
In C, the the dot character problem can easily be handled as a special case. In
languages requiring larger amounts of look-ahead, the scanner can take a more
general approach. In any case of ambiguity, it assumes that a longer token will
be possible, but remembers that a shorter token could have been recognized at
some point in the past. It also buffers all characters read beyond the end of the
shorter token. If the optimistic assumption leads the scanner into an error state, it
“unreads” the buffered characters so that they will be seen again later, and returns
the shorter token.
D E S I G N & I M P L E M E N TAT I O N
Longest possible tokens
A little care in syntax design—avoiding tokens that are nontrivial prefixes of
other tokens—can dramatically simplify scanning. In straightforward cases
of prefix ambiguity the scanner can enforce the “longest possible token” rule
automatically. In Fortran, however, the rules are sufficiently complex that no
purely lexical solution suffices. Some of the problems, and a possible solution,
are discussed in an article by Dyadkin [Dya95].
11 In actuality, the faulty software for Mariner 1 appears to have stemmed from a missing “bar”
punctuation mark (indicating an average) in handwritten notes from which the software was
derived [Cer89, pp. 202–203]. The Fortran DO loop error does appear to have occurred in at least
one piece of NASA software, but no serious harm resulted [Web89].
2.2 Scanning
2.2.3
EXAMPLE
2.19
Table-driven scanning
63
Table-Driven Scanning
In the preceding subsection we sketched how control flow—a loop and nested case
statements—can be used to represent a finite automaton. An alternative approach
represents the automaton as a data structure: a two-dimensional transition table.
A driver program (Figure 2.11) uses the current state and input character to index
into the table. Each entry in the table specifies whether to move to a new state (and
if so, which one), return a token, or announce an error. A second table indicates,
for each state, whether we might be at the end of a token (and if so, which one).
Separating this second table from the first allows us to notice when we pass a state
that might have been the end of a token, so we can back up if we hit an error state.
Example tables for our calculator tokens appear in Figure 2.12.
Like a handwritten scanner, the table-driven code of Figure 2.11 looks tokens up
in a table of keywords immediately before returning. An outer loop serves to filter
out comments and “white space”—spaces, tabs, and newlines. These character
sequences are not meaningful to the parser, and would in fact be very difficult to
represent in a grammar (Exercise 2.20).
2.2.4
Lexical Errors
The code in Figure 2.11 explicitly recognizes the possibility of lexical errors. In
some cases the next character of input may be neither an acceptable continuation
of the current token nor the start of another token. In such cases the scanner must
print an error message and perform some sort of recovery so that compilation
can continue, if only to look for additional errors. Fortunately, lexical errors are
relatively rare—most character sequences do correspond to token sequences—
and relatively easy to handle. The most common approach is simply to (1) throw
away the current, invalid token; (2) skip forward until a character is found that can
legitimately begin a new token; (3) restart the scanning algorithm; and (4) count
on the error-recovery mechanism of the parser to cope with any cases in which
the resulting sequence of tokens is not syntactically valid. Of course the need for
error recovery is not unique to table-driven scanners; any scanner must cope with
errors. We did not show the code in Figure 2.5, but it would have to be there in
practice.
The code in Figure 2.11 also shows that the scanner must return both the kind
of token found and its character-string image (spelling); again this requirement
applies to all types of scanners. For some tokens the character-string image is
redundant: all semicolons look the same, after all, as do all while keywords. For
other tokens, however (e.g., identifiers, character strings, and numeric constants),
the image is needed for semantic analysis. It is also useful for error messages:
“undeclared identifier” is not as nice as “ foo has not been declared.”
64
Chapter 2 Programming Language Syntax
state = 0 . . number of states
token = 0 . . number of tokens
scan tab : array [char, state] of record
action : (move, recognize, error)
new state : state
token tab : array [state] of token
– – what to recognize
keyword tab : set of record
k image : string
k token : token
– – these three tables are created by a scanner generator tool
tok : token
cur char : char
remembered chars : list of char
repeat
cur state : state := start state
image : string := null
remembered state : state := 0
– – none
loop
read cur char
case scan tab[cur char, cur state].action
move:
if token tab[cur state] = 0
– – this could be a final state
remembered state := cur state
remembered chars := add cur char to remembered chars
cur state := scan tab[cur char, cur state].new state
recognize:
tok := token tab[cur state]
unread cur char
– – push back into input stream
exit inner loop
error:
if remembered state = 0
tok := token tab[remembered state]
unread remembered chars
remove remembered chars from image
exit inner loop
– – else print error message and recover; probably start over
append cur char to image
– – end inner loop
until tok ∈ {white space, comment}
look image up in keyword tab and replace tok with appropriate keyword if found
return tok, image
Figure 2.11
Driver for a table-driven scanner, with code to handle the ambiguous case in which
one valid token is a prefix of another, but some intermediate string is not.
65
2.2 Scanning
Current input character
State space, tab
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
17
–
3
4
4
–
–
–
–
–
–
–
–
–
–
–
17
–
newline
/
*
(
)
+
-
:
=
.
digit
letter
other
17
–
18
4
4
–
–
–
–
–
–
–
–
–
–
–
17
–
2
3
3
4
18
–
–
–
–
–
–
–
–
–
–
–
–
–
10
4
3
5
5
–
–
–
–
–
–
–
–
–
–
–
–
–
6
–
3
4
4
–
–
–
–
–
–
–
–
–
–
–
–
–
7
–
3
4
4
–
–
–
–
–
–
–
–
–
–
–
–
–
8
–
3
4
4
–
–
–
–
–
–
–
–
–
–
–
–
–
9
–
3
4
4
–
–
–
–
–
–
–
–
–
–
–
–
–
11
–
3
4
4
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
3
4
4
–
–
–
–
–
12
–
–
–
–
–
–
–
13
–
3
4
4
–
–
–
–
–
–
–
–
15
–
–
–
–
14
–
3
4
4
–
–
–
–
–
–
–
15
14
15
16
–
–
16
–
3
4
4
–
–
–
–
–
–
–
–
–
–
16
–
–
–
–
3
4
4
–
–
–
–
–
–
–
–
–
–
–
–
–
div
lparen
rparen
plus
minus
times
assign
number
number
identifier
white space
comment
Figure 2.12
Scanner tables for the calculator language. These could be used by the code of Figure 2.11. States are numbered
as in Figure 2.6, except for the addition of two states—17 and 18—to “recognize” white space and comments. The right-hand
column represents table token tab ; the rest of the figure is scan tab . Dashes indicate no way to extend the current token. Table
keyword tab (not shown) contains the strings read and write .
2.2.5
Pragmas
Some languages and language implementations allow a program to contain
constructs called pragmas that provide directives or hints to the compiler. Pragmas that do not change program semantics—only the compilation process—
are sometimes called significant comments. In some languages the name is also
appropriate because, like comments, pragmas can appear anywhere in the source
program. In this case they are usually processed by the scanner: allowing them
anywhere in the grammar would greatly complicate the parser. In other languages
(Ada, for example), pragmas are permitted only at certain well-defined places
in the grammar. In this case they are best processed by the parser or semantic
analyzer.
Pragmas that serve as directives may:
Turn various kinds of run-time checks (e.g., pointer or subscript checking) on
or off
Turn certain code improvements on or off (e.g., on in inner loops to improve
performance; off otherwise to improve compilation speed)
66
Chapter 2 Programming Language Syntax
Enable or disable performance profiling (statistics gathering to identify
program bottlenecks)
Some directives “cross the line” and change program semantics. In Ada, for example, the unchecked pragma can be used to disable type checking. In OpenMP,
which we will consider in Chapter 12, pragmas specify significant parallel extensions to Fortran, C and C++: creating, scheduling, and synchronizing threads. In
this case the principal rationale for expressing the extensions as pragmas rather
than more deeply integrated changes is to sharply delineate the boundary between
the core language and the extensions, and to share a common set of extensions
across languages.
Pragmas that serve (merely) as hints provide the compiler with information
about the source program that may allow it to do a better job:
Variable x is very heavily used (it may be a good idea to keep it in a register).
Subroutine F is a pure function: its only effect on the rest of the program is
the value it returns.
Subroutine S is not (indirectly) recursive (its storage may be statically allocated).
32 bits of precision (instead of 64) suffice for floating-point variable x .
The compiler may ignore these in the interest of simplicity, or in the face of
contradictory information.
3C H E C K YO U R U N D E R S TA N D I N G
10. List the tasks performed by the typical scanner.
11. What are the advantages of an automatically generated scanner, in comparison
to a handwritten one? Why do many commercial compilers use a handwritten
scanner anyway?
12. Explain the difference between deterministic and nondeterministic finite
automata. Why do we prefer the deterministic variety for scanning?
13. Outline the constructions used to turn a set of regular expressions into a
minimal DFA.
14. What is the “longest possible token” rule?
15. Why must a scanner sometimes “peek” at upcoming characters?
16. What is the difference between a keyword and an identifier?
17. Why must a scanner save the text of tokens?
18. How does a scanner identify lexical errors? How does it respond?
19. What is a pragma?
2.3 Parsing
2.3
EXAMPLE
2.20
Top-down and bottom-up
parsing
67
Parsing
The parser is the heart of a typical compiler. It calls the scanner to obtain the tokens
of the input program, assembles the tokens together into a syntax tree, and passes
the tree (perhaps one subroutine at a time) to the later phases of the compiler,
which perform semantic analysis and code generation and improvement. In effect,
the parser is “in charge” of the entire compilation process; this style of compilation
is sometimes referred to as syntax-directed translation.
As noted in the introduction to this chapter, a context-free grammar (CFG) is
a generator for a CF language. A parser is a language recognizer. It can be shown
that for any CFG we can create a parser that runs in O(n 3 ) time, where n is the
length of the input program.12 There are two well-known parsing algorithms that
achieve this bound: Earley’s algorithm [Ear70] and the Cocke-Younger-Kasami
(CYK) algorithm [Kas65, You67]. Cubic time is much too slow for parsing sizable programs, but fortunately not all grammars require such a general and slow
parsing algorithm. There are large classes of grammars for which we can build
parsers that run in linear time. The two most important of these classes are called
LL and LR.
LL stands for “Left-to-right, Left-most derivation.” LR stands for “Left-to-right,
Right-most derivation.” In both classes the input is read left-to-right, and the
parser attempts to discover (construct) a derivation of that input. For LL parsers,
the derivation will be left-most; for LR parsers, right-most. We will cover LL
parsers first. They are generally considered to be simpler and easier to understand.
They can be written by hand or generated automatically from an appropriate
grammar by a parser-generating tool. The class of LR grammars is larger (i.e.,
more grammars are LR than LL), and some people find the structure of the LR
grammars more intuitive, especially in the handling of arithmetic expressions. LR
parsers are almost always constructed by a parser-generating tool. Both classes of
parsers are used in production compilers, though LR parsers are more common.
LL parsers are also called “top-down,” or “predictive” parsers. They construct a
parse tree from the root down, predicting at each step which production will be
used to expand the current node, based on the next available token of input. LR
parsers are also called “bottom-up” parsers. They construct a parse tree from the
leaves up, recognizing when a collection of leaves or other nodes can be joined
together as the children of a single parent.
We can illustrate the difference between top-down and bottom-up parsing
by means of a simple example. Consider the following grammar for a commaseparated list of identifiers, terminated by a semicolon:
id list −→ id id list tail
12 In general, an algorithm is said to run in time O( f (n)), where n is the length of the input, if
its running time t (n) is proportional to f (n) in the worst case. More precisely, we say t (n) =
O( f (n)) ⇐⇒ ∃ c, m [n > m −→ t (n) < c f (n)].
68
Chapter 2 Programming Language Syntax
id list tail −→ , id id list tail
id list tail −→ ;
These are the productions that would normally be used for an identifier list in a
top-down parser. They can also be parsed bottom-up (most top-down grammars
can be). In practice they would not be used in a bottom-up parser, for reasons that
will become clear in a moment, but the ability to handle them either way makes
them good for this example.
Progressive stages in the top-down and bottom-up construction of a parse tree
for the string A, B, C; appear in Figure 2.13. The top-down parser begins by
predicting that the root of the tree (id list) will be replaced by id id list tail.
It then matches the id against a token obtained from the scanner. (If the scanner produced something different, the parser would announce a syntax error.)
The parser then moves down into the first (in this case only) nonterminal child
and predicts that id list tail will be replaced by , id id list tail. To make this
prediction it needs to peek at the upcoming token (a comma), which allows it
to choose between the two possible expansions for id list tail. It then matches
the comma and the id and moves down into the next id list tail. In a similar,
recursive fashion, the top-down parser works down the tree, left-to-right, predicting and expanding nodes and tracing out a left-most derivation of the fringe of
the tree.
The bottom-up parser, by contrast, begins by noting that the left-most leaf of
the tree is an id . The next leaf is a comma and the one after that is another id . The
parser continues in this fashion, shifting new leaves from the scanner into a forest
of partially completed parse tree fragments, until it realizes that some of those
fragments constitute a complete right-hand side. In this grammar, that doesn’t
occur until the parser has seen the semicolon—the right-hand side of id list tail
−→ ; . With this right-hand side in hand, the parser reduces the semicolon to an
id list tail. It then reduces , id id list tail into another id list tail. After doing
this one more time it is able to reduce id id list tail into the root of the parse
tree, id list.
At no point does the bottom-up parser predict what it will see next. Rather,
it shifts tokens into its forest until it recognizes a right-hand side, which it then
reduces to a left-hand side. Because of this behavior, bottom-up parsers are sometimes called shift-reduce parsers. Moving up the figure, from bottom to top, we
can see that the shift-reduce parser traces out a right-most derivation, in reverse.
Because bottom-up parsers were the first to receive careful formal study, rightmost derivations are sometimes called canonical.
There are several important subclasses of LR parsers, including SLR, LALR,
and “full LR.” SLR and LALR are important for their ease of implementation,
full LR for its generality. LL parsers can also be grouped into SLL and “full LL”
subclasses. We will cover the differences among them only briefly here; for further
information see any of the standard compiler-construction or parsing theory
textbooks [App97, ALSU07, AU72, CT04, FL88].
69
2.3 Parsing
id_list
id(A)
id_list
id(A) ,
id(A) , id(B)
id(A)
id_list_tail
id(A) , id(B) ,
id_list
id(A)
id(A) , id(B) , id(C)
id_list_tail
id(A) , id(B) , id(C) ;
id(A) , id(B) , id(C)
id_list_tail
, id(B)
;
id_list
id(A)
id_list_tail
id(A) , id(B)
id_list_tail
,
id(C) id_list_tail
id_list_tail
, id(B)
,
id(C)
;
id_list_tail
id(A)
id_list_tail
id_list_tail
, id(B)
id_list
id(A)
id_list_tail
id_list_tail
,
,
id(C)
id_list_tail
;
id_list_tail
, id(B)
id(C)
id_list
id_list_tail
id(A)
id_list_tail
;
id_list_tail
, id(B)
id_list
id id_list_tail
id_list_tail
, id id_list_tail
id_list_tail
;
,
id(C)
id_list_tail
;
Figure 2.13 Top-down (left) and bottom-up parsing (right) of the input string A, B, C; . Grammar appears at lower left.
One commonly sees LL or LR (or whatever) written with a number in parentheses after it: LL(2) or LALR(1), for example. This number indicates how many
tokens of look-ahead are required in order to parse. Most real compilers use just
one token of look-ahead, though more can sometimes be helpful. Terrence Parr’s
open-source ANTLR tool, in particular, uses multitoken look-ahead to enlarge
70
EXAMPLE
Chapter 2 Programming Language Syntax
2.21
Bounding space with a
bottom-up grammar
the class of languages amenable to top-down parsing [PQ95]. In Section 2.3.1 we
will look at LL(1) grammars and handwritten parsers in more detail. In Sections
2.3.2 and 2.3.3 we will consider automatically generated LL(1) and LR(1) (actually
SLR(1)) parsers.
The problem with our example grammar, for the purposes of bottom-up parsing, is that it forces the compiler to shift all the tokens of an id list into its forest
before it can reduce any of them. In a very large program we might run out of
space. Sometimes there is nothing that can be done to avoid a lot of shifting. In
this case, however, we can use an alternative grammar that allows the parser to
reduce prefixes of the id list into nonterminals as it goes along:
id list −→ id list prefix ;
id list prefix −→ id list prefix , id
−→ id
This grammar cannot be parsed top-down, because when we see an id on the
input and we’re expecting an id list prefix, we have no way to tell which of the two
possible productions we should predict (more on this dilemma in Section 2.3.2).
As shown in Figure 2.14, however, the grammar works well bottom-up.
2.3.1
EXAMPLE
2.22
Top-down grammar for a
calculator language
Recursive Descent
To illustrate top-down (predictive) parsing, let us consider the grammar for a
simple “calculator” language, shown in Figure 2.15. The calculator allows values
to be read into (numeric) variables, which may then be used in expressions.
Expressions in turn can be written to the output. Control flow is strictly linear
(no loops, if statements, or other jumps). The end-marker ( $$ ) pseudotoken is
produced by the scanner at the end of the input. This token allows the parser to
terminate cleanly once it has seen the entire program. As in regular expressions,
we use the symbol to denote the empty string. A production with on the
right-hand side is sometimes called an epsilon production.
It may be helpful to compare the expr portion of Figure 2.15 to the expression
grammar of Example 2.8 (page 50). Most people find that previous, LR grammar
to be significantly more intuitive. It suffers, however, from a problem similar to that
of the id list grammar of Example 2.21: if we see an id on the input when expecting
an expr, we have no way to tell which of the two possible productions to predict.
The grammar of Figure 2.15 avoids this problem by merging the common prefixes
of right-hand sides into a single production, and by using new symbols (term tail
and factor tail) to generate additional operators and operands as required. The
transformation has the unfortunate side effect of placing the operands of a given
operator in separate right-hand sides. In effect, we have sacrificed grammatical
elegance in order to be able to parse predictively.
So how do we parse a string with our calculator grammar? We saw the basic
idea in Figure 2.13. We start at the top of the tree and predict needed productions on the basis of the current left-most nonterminal in the tree and the current
71
2.3 Parsing
id_list_prefix ,
id(A)
id_list_prefix
id_list_prefix
id(A)
id(A)
id_list_prefix
,
id_list_prefix
,
id(B)
id(A)
id_list_prefix
,
id_list_prefix
id(C)
id(B)
,
id_list_prefix
id(B)
id_list_prefix
id_list_prefix
id(A)
id_list_prefix
id_list_prefix
,
,
,
,
;
id(C)
id(B)
id(A)
id_list
id(B)
id_list_prefix
id(A)
id_list_prefix
id_list_prefix
,
id(A)
id_list_prefix
id_list
id(B)
id_list_prefix
,
id(A)
id_list_prefix
id(C)
id_list_prefix ;
id_list_prefix
,
,
;
id(C)
id(B)
id_list_prefix , id
id
id(A)
Figure 2.14
Bottom-up parse of A, B, C; using a grammar (lower left) that allows lists to be
collapsed incrementally.
input token. We can formalize this process in one of two ways. The first, described
in the remainder of this subsection, is to build a recursive descent parser whose
subroutines correspond, one-one, to the nonterminals of the grammar. Recursive descent parsers are typically constructed by hand, though the ANTLR parser
generator constructs them automatically from an input grammar. The second
approach, described in Section 2.3.2, is to build an LL parse table which is then
read by a driver program. Table-driven parsers are almost always constructed
automatically by a parser generator. These two options—recursive descent and
table-driven—are reminiscent of the nested case statements and table-driven
approaches to building a scanner that we saw in Sections 2.2.2 and 2.2.3. It should
be emphasized that they implement the same basic parsing algorithm.
72
Chapter 2 Programming Language Syntax
program −→ stmt list $$
stmt list −→ stmt stmt list | stmt −→ id := expr | read id | write expr
expr −→ term term tail
term tail −→ add op term term tail | term −→ factor factor tail
factor tail −→ mult op factor factor tail | factor −→ ( expr ) | id | number
add op −→ + | mult op −→ * | /
Figure 2.15
EXAMPLE
2.23
Recursive descent parser
for the calculator language
EXAMPLE
2.24
Recursive descent parse of
a “sum and average”
program
LL(1) grammar for a simple calculator language.
Handwritten recursive descent parsers are most often used when the language
to be parsed is relatively simple, or when a parser-generator tool is not available.
There are exceptions, however. In particular, recursive descent appears in recent
versions of the GNU compiler collection ( gcc ). Earlier versions used bison to
create a bottom-up parser automatically. The change was made in part for performance reasons and in part to enable the generation of higher-quality syntax error messages. (The bison code was easier to write, and arguably easier to
maintain.)
Pseudocode for a recursive descent parser for our calculator language appears
in Figure 2.16. It has a subroutine for every nonterminal in the grammar. It also
has a mechanism input token to inspect the next token available from the scanner
and a subroutine ( match ) to consume and update this token, and in the process
verify that it is the one that was expected (as specified by an argument). If match
or any of the other subroutines sees an unexpected token, then a syntax error
has occurred. For the time being let us assume that the parse error subroutine
simply prints a message and terminates the parse. In Section 2.3.4 we will consider
how to recover from such errors and continue to parse the remainder of the
input.
Suppose now that we are to parse a simple program to read two numbers and
print their sum and average:
read A
read B
sum := A + B
write sum
write sum / 2
The parse tree for this program appears in Figure 2.17. The parser begins by
calling the subroutine program . After noting that the initial token is a read ,
2.3 Parsing
73
program calls stmt list and then attempts to match the end-of-file pseudotoken. (In the parse tree, the root, program, has two children, stmt list and $$ .)
Procedure stmt list again notes that the upcoming token is a read . This observation allows it to determine that the current node (stmt list) generates stmt
stmt list (rather than ). It therefore calls stmt and stmt list before returning.
Continuing in this fashion, the execution path of the parser traces out a leftto-right depth-first traversal of the parse tree. This correspondence between the
dynamic execution trace and the structure of the parse tree is the distinguishing
characteristic of recursive descent parsing. Note that because the stmt list nonterminal appears in the right-hand side of a stmt list production, the stmt list
subroutine must call itself. This recursion accounts for the name of the parsing
technique.
Without additional code (not shown in Figure 2.16), the parser merely verifies that the program is syntactically correct (i.e., that none of the otherwise
parse error clauses in the case statements are executed and that match always
sees what it expects to see). To be of use to the rest of the compiler—which
must produce an equivalent target program in some other language—the parser
must save the parse tree or some other representation of program fragments
as an explicit data structure. To save the parse tree itself, we can allocate and
link together records to represent the children of a node immediately before
executing the recursive subroutines and match invocations that represent those
children. We shall need to pass each recursive routine an argument that points
to the record that is to be expanded (i.e., whose children are to be discovered). Procedure match will also need to save information about certain tokens
(e.g., character-string representations of identifiers and literals) in the leaves of
the tree.
As we saw in Chapter 1, the parse tree contains a great deal of irrelevant detail
that need not be saved for the rest of the compiler. It is therefore rare for a parser to
construct a full parse tree explicitly. More often it produces an abstract syntax tree
or some other more terse representation. In a recursive descent compiler, a syntax
tree can be created by allocating and linking together records in only a subset of
the recursive calls.
The trickiest part of writing a recursive descent parser is figuring out which
tokens should label the arms of the case statements. Each arm represents one
production: one possible expansion of the symbol for which the subroutine was
named. The tokens that label a given arm are those that predict the production.
A token X may predict a production for either of two reasons: (1) the right-hand
side of the production, when recursively expanded, may yield a string beginning
with X , or (2) the right-hand side may yield nothing (i.e., it is , or a string
of nonterminals that may recursively yield ), and X may begin the yield of what
comes next. In the following subsection we will formalize this notion of prediction
using sets called FIRST and FOLLOW, and show how to derive them automatically
from an LL(1) CFG.
74
Chapter 2 Programming Language Syntax
procedure match(expected)
if input token = expected then consume input token
else parse error
– – this is the start routine:
procedure program
case input token of
id , read , write , $$ :
stmt list
match( $$ )
otherwise parse error
procedure stmt list
case input token of
id , read , write : stmt; stmt list
$$ : skip
– – epsilon production
otherwise parse error
procedure stmt
case input token of
id : match( id ); match( := ); expr
read : match( read ); match( id )
write : match( write ); expr
otherwise parse error
procedure expr
case input token of
id , number , ( : term; term tail
otherwise parse error
procedure term tail
case input token of
+ , - : add op; term; term tail
) , id , read , write , $$ :
skip
– – epsilon production
otherwise parse error
procedure term
case input token of
id , number , ( : factor; factor tail
otherwise parse error
Figure 2.16
Recursive descent parser for the calculator language. Execution begins in procedure program . The recursive calls trace out a traversal of the parse tree. Not shown is code to
save this tree (or some similar structure) for use by later phases of the compiler. (continued)
2.3 Parsing
75
procedure factor tail
case input token of
* , / : mult op; factor; factor tail
+ , - , ) , id , read , write , $$ :
skip
– – epsilon production
otherwise parse error
procedure factor
case input token of
id : match( id )
number : match( number )
( : match( ( ); expr; match( ) )
otherwise parse error
procedure add op
case input token of
+ : match( + )
- : match( - )
otherwise parse error
procedure mult op
case input token of
* : match( * )
/ : match( / )
otherwise parse error
Figure 2.16
(continued)
3C H E C K YO U R U N D E R S TA N D I N G
20. What is the inherent “big-O” complexity of parsing? What is the complexity
of parsers used in real compilers?
21. Summarize the difference between LL and LR parsing. Which one of them is
also called “bottom-up”? “Top-down”? Which one is also called “predictive”?
“Shift-reduce”? What do “LL” and “LR” stand for?
22. What kind of parser (top-down or bottom-up) is most common in production
compilers?
23. Why are right-most derivations sometimes called canonical?
24. What is the significance of the “1” in LR(1)?
25. Why might we want (or need) different grammars for different parsing algorithms?
26. What is an epsilon production?
27. What are recursive descent parsers? Why are they used mostly for small languages?
28. How might a parser construct an explicit parse tree or syntax tree?
76
Chapter 2 Programming Language Syntax
program
stmt_list
$$
stmt
read
stmt_list
stmt
id(A)
read
stmt_list
id(B)
stmt_list
stmt
expr
id(sum) :=
term
stmt
term_tail
factor
factor_tail
id(A)
add_op
+
term
expr
write
term_tail
factor
factor_tail
id(B)
stmt_list
term
stmt
term_tail
write
term
factor
factor_tail
id(sum)
id(sum)
Figure 2.17
2.25
Driver and table for
top-down parsing
EXAMPLE
expr
term_tail
factor_tail
mult_op
factor
factor_tail
/
number(2)
factor
Parse tree for the sum-and-average program of Example 2.24, using the grammar of Figure 2.15.
2.3.2
EXAMPLE
stmt_list
2.26
Table-driven parse of the
“sum and average” program
Table-Driven Top-Down Parsing
In a recursive descent parser, each arm of a case statement corresponds to a
production, and contains parsing routine and match calls corresponding to the
symbols on the right-hand side of that production. At any given point in the parse,
if we consider the calls beyond the program counter (the ones that have yet to
occur) in the parsing routine invocations currently in the call stack, we obtain a
list of the symbols that the parser expects to see between here and the end of the
program. A table-driven top-down parser maintains an explicit stack containing
this same list of symbols.
Pseudocode for such a parser appears in Figure 2.18. The code is language
independent. It requires a language-dependent parsing table, generally produced
by an automatic tool. For the calculator grammar of Figure 2.15, the table appears
in Figure 2.19.
To illustrate the algorithm, Figure 2.20 shows a trace of the stack and the input
over time, for the sum-and-average program of Example 2.24. The parser iterates around a loop in which it pops the top symbol off the stack and performs
77
2.3 Parsing
terminal = 1 . . number of terminals
non terminal = number of terminals + 1 . . number of symbols
symbol = 1 . . number of symbols
production = 1 . . number of productions
parse tab : array [non terminal, terminal] of record
action : (predict, error)
prod : production
prod tab : array [production] of list of symbol
– – these two tables are created by a parser generator tool
parse stack : stack of symbol
parse stack.push(start symbol)
loop
expected sym : symbol := parse stack.pop
if expected sym ∈ terminal
match(expected sym)
– – as in Figure 2.16
if expected sym = $$ then return
– – success!
else
if parse tab[expected sym, input token].action = error
parse error
else
prediction : production := parse tab[expected sym, input token].prod
foreach sym : symbol in reverse prod tab[prediction]
parse stack.push(sym)
Figure 2.18
Driver for a table-driven LL(1) parser.
Top-of-stack
nonterminal
id
number
read
write
Current input token
:=
(
)
+
-
*
/
$$
program
stmt list
stmt
expr
term tail
term
factor tail
factor
add op
mult op
1
2
4
7
9
10
12
14
–
–
–
–
–
7
–
10
–
15
–
–
1
2
5
–
9
–
12
–
–
–
1
2
6
–
9
–
12
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
7
–
10
–
13
–
–
–
–
–
–
9
–
12
–
–
–
–
–
–
–
8
–
12
–
16
–
–
–
–
–
8
–
12
–
17
–
–
–
–
–
–
–
11
–
–
18
–
–
–
–
–
–
11
–
–
19
1
3
–
–
9
–
12
–
–
–
Figure 2.19 LL(1) parse table for the calculator language. Table entries indicate the production to predict (as numbered in
Figure 2.22). A dash indicates an error. When the top-of-stack symbol is a terminal, the appropriate action is always to match
it against an incoming token from the scanner. An auxiliary table, not shown here, gives the right-hand-side symbols for each
production.
78
Parse stack
Chapter 2 Programming Language Syntax
Input stream
program
read A read B . . .
stmt list $$
read A read B . . .
stmt stmt list $$
read A read B . . .
read id stmt list $$
read A read B . . .
id stmt list $$
A read B . . .
stmt list $$
read B sum := . . .
stmt stmt list $$
read B sum := . . .
read id stmt list $$
read B sum := . . .
id stmt list $$
B sum := . . .
stmt list $$
sum := A + B . . .
stmt stmt list $$
sum := A + B . . .
id := expr stmt list $$
sum := A + B . . .
:= expr stmt list $$
:= A + B . . .
expr stmt list $$
A + B ...
term term tail stmt list $$
A + B ...
factor factor tail term tail stmt list $$
A + B ...
id factor tail term tail stmt list $$
A + B ...
factor tail term tail stmt list $$
+ B write sum . . .
term tail stmt list $$
+ B write sum . . .
add op term term tail stmt list $$
+ B write sum . . .
+ term term tail stmt list $$
+ B write sum . . .
term term tail stmt list $$
B write sum . . .
factor factor tail term tail stmt list $$
B write sum . . .
id factor tail term tail stmt list $$
B write sum . . .
factor tail term tail stmt list $$
write sum . . .
term tail stmt list $$
write sum write . . .
stmt list $$
write sum write . . .
stmt stmt list $$
write sum write . . .
write expr stmt list $$
write sum write . . .
expr stmt list $$
sum write sum / 2
term term tail stmt list $$
sum write sum / 2
factor factor tail term tail stmt list $$
sum write sum / 2
id factor tail term tail stmt list $$
sum write sum / 2
factor tail term tail stmt list $$
write sum / 2
term tail stmt list $$
write sum / 2
stmt list $$
write sum / 2
stmt stmt list $$
write sum / 2
write expr stmt list $$
write sum / 2
expr stmt list $$
sum / 2
term term tail stmt list $$
sum / 2
factor factor tail term tail stmt list $$
sum / 2
id factor tail term tail stmt list $$
sum / 2
factor tail term tail stmt list $$
/2
mult op factor factor tail term tail stmt list $$ / 2
/ factor factor tail term tail stmt list $$
/2
factor factor tail term tail stmt list $$
2
number factor tail term tail stmt list $$
2
factor tail term tail stmt list $$
term tail stmt list $$
stmt list $$
$$
Figure 2.20
Comment
initial stack contents
predict program −→ stmt list $$
predict stmt list −→ stmt stmt list
predict stmt −→ read id
match read
match id
predict stmt list −→ stmt stmt list
predict stmt −→ read id
match read
match id
predict stmt list −→ stmt stmt list
predict stmt −→ id := expr
match id
match :=
predict expr −→ term term tail
predict term −→ factor factor tail
predict factor −→ id
match id
predict factor tail −→ predict term tail −→ add op term term tail
predict add op −→ +
match +
predict term −→ factor factor tail
predict factor −→ id
match id
predict factor tail −→ predict term tail −→ predict stmt list −→ stmt stmt list
predict stmt −→ write expr
match write
predict expr −→ term term tail
predict term −→ factor factor tail
predict factor −→ id
match id
predict factor tail −→ predict term tail −→ predict stmt list −→ stmt stmt list
predict stmt −→ write expr
match write
predict expr −→ term term tail
predict term −→ factor factor tail
predict factor −→ id
match id
predict factor tail −→ mult op factor factor tail
predict mult op −→ /
match /
predict factor −→ number
match number
predict factor tail −→ predict term tail −→ predict stmt list −→ Trace of a table-driven LL(1) parse of the sum-and-average program of Example 2.24.
2.3 Parsing
79
the following actions. If the popped symbol is a terminal, the parser attempts
to match it against an incoming token from the scanner. If the match fails, the
parser announces a syntax error and initiates some sort of error recovery (see
Section 2.3.4). If the popped symbol is a nonterminal, the parser uses that nonterminal together with the next available input token to index into a two-dimensional
table that tells it which production to predict (or whether to announce a syntax
error and initiate recovery).
Initially, the parse stack contains the start symbol of the grammar (in our case,
program). When it predicts a production, the parser pushes the right-hand-side
symbols onto the parse stack in reverse order, so the first of those symbols ends up
at top-of-stack. The parse completes successfully when we match the end token,
$$ . Assuming that $$ appears only once in the grammar, at the end of the first
production, and that the scanner returns this token only at end-of-file, any syntax
error is guaranteed to manifest itself either as a failed match or as an error entry
in the table.
Predict Sets
EXAMPLE
2.27
Predict sets for the
calculator language
As we hinted at the end of Section 2.3.1, predict sets are defined in terms of simpler
sets called FIRST and FOLLOW, where FIRST(A) is the set of all tokens that could
be the start of an A and FOLLOW(A) is the set of all tokens that could come after
an A in some valid program. If we extend the domain of FIRST in the obvious
way to include strings of symbols, we then say that the predict set of a production
A −→ β is FIRST(β), plus FOLLOW(A) if β =⇒∗ . For notational convenience,
we define the predicate EPS such that EPS(β) ≡ β =⇒∗ .13
We can illustrate the algorithm to construct these sets using our calculator
grammar (Figure 2.15). We begin with “obvious” facts about the grammar and
build on them inductively. If we recast the grammar in plain BNF (no EBNF ‘ | ’
constructs), then it has 19 productions. The “obvious” facts arise from adjacent
pairs of symbols in right-hand sides. In the first production, we can see that
$$ ∈ FOLLOW(stmt list). In the second (stmt list −→ ), EPS(stmt list) = true .
In the fourth production (stmt −→ id := expr), id ∈ FIRST(stmt) and := ∈
FOLLOW( id ). In the fifth and sixth productions (stmt −→ read id | write
expr), { read , write } ⊂ FIRST(stmt), and id ∈ FOLLOW( read ). The complete set
of “obvious” facts appears in Figure 2.21.
From the “obvious” facts we can deduce a larger set of facts during a second
pass over the grammar. For example, in the second production (stmt list −→ stmt
stmt list) we can deduce that { id , read , write } ⊂ FIRST(stmt list), because we
already know that { id , read , write } ⊂ FIRST(stmt), and a stmt list can begin with
13 Following conventional notation, we use uppercase Roman letters near the beginning of the
alphabet to represent nonterminals, uppercase Roman letters near the end of the alphabet to
represent arbitrary grammar symbols (terminals or nonterminals), lowercase Roman letters near
the beginning of the alphabet to represent terminals (tokens), lowercase Roman letters near the
end of the alphabet to represent token strings, and lowercase Greek letters to represent strings of
arbitrary symbols.
80
Chapter 2 Programming Language Syntax
program −→ stmt list $$
stmt list −→ stmt stmt list
stmt list −→ stmt −→ id := expr
stmt −→ read id
stmt −→ write expr
expr −→ term term tail
term tail −→ add op term term tail
term tail −→ term −→ factor factor tail
factor tail −→ mult op factor factor tail
factor tail −→ factor −→ ( expr )
factor −→ id
factor −→ number
add op −→ +
add op −→ mult op −→ *
mult op −→ /
Figure 2.21
$$ ∈ FOLLOW( stmt list )
EPS( stmt list ) = true
id ∈ FIRST( stmt ) and := ∈ FOLLOW( id )
read ∈ FIRST( stmt ) and id ∈ FOLLOW( read )
write ∈ FIRST( stmt )
EPS( term tail ) = true
EPS( factor tail ) = true
( ∈ FIRST( factor ) and ) ∈ FOLLOW( expr )
id ∈ FIRST( factor )
number ∈ FIRST( factor )
+ ∈ FIRST( add op )
- ∈ FIRST( add op )
* ∈ FIRST( mult op )
/ ∈ FIRST( mult op )
“Obvious” facts about the LL(1) calculator grammar.
a stmt. Similarly, in the first production, we can deduce that $$ ∈ FIRST(program),
because we already know that EPS(stmt list) = true .
In the eleventh production (factor tail −→ mult op factor factor tail), we can
deduce that { ( , id , number } ⊂ FOLLOW(mult op), because we already know
that { ( , id , number } ⊂ FIRST(factor), and factor follows mult op in the righthand side. In the seventh production (expr −→ term term tail), we can deduce
that ) ∈ FOLLOW(term tail), because we already know that ) ∈ FOLLOW(expr),
and a term tail can be the last part of an expr. In this same production, we
can also deduce that ) ∈ FOLLOW(term), because the term tail can generate (EPS(term tail) = true ), allowing a term to be the last part of an expr.
There is more that we can learn from our second pass through the grammar, but the examples above cover all the different kinds of cases. To complete our calculation, we continue with additional passes over the grammar until
we don’t learn any more (i.e., we don’t add anything to any of the FIRST and
FOLLOW sets). We then construct the PREDICT sets. Final versions of all three
sets appear in Figure 2.22. The parse table of Figure 2.19 follows directly from
PREDICT.
The algorithm to compute EPS, FIRST, FOLLOW, and PREDICT sets appears, a bit
more formally, in Figure 2.23. It relies on the following definitions.
EPS(α) ≡ if α =⇒∗ then true else false
FIRST(α) ≡ { c : α =⇒∗ c β }
FOLLOW(A) ≡ { c : S =⇒+ α A c β }
PREDICT(A −→ α) ≡ FIRST(α) ∪ ( if EPS(α) then FOLLOW(A) else ∅ )
2.3 Parsing
FIRST
program { id , read , write , $$ }
stmt list { id , read , write }
stmt { id , read , write }
expr { ( , id , number }
term tail { + , - }
term { ( , id , number }
factor tail { * , / }
factor { ( , id , number }
add op { + , - }
mult op { * , / }
Also note that FIRST( c ) = { c } ∀ tokens c .
FOLLOW
id { + , - , * , / , ) , := , id , read , write , $$ }
number { + , - , * , / , ) , id , read , write , $$ }
read { id }
write { ( , id , number }
( { ( , id , number }
) { + , - , * , / , ) , id , read , write , $$ }
:= { ( , id , number }
+ { ( , id , number }
- { ( , id , number }
* { ( , id , number }
/ { ( , id , number }
$$ ∅
program ∅
stmt list { $$ }
stmt { id , read , write , $$ }
Figure 2.22
81
expr { ) , id , read , write , $$ }
term tail { ) , id , read , write , $$ }
term { + , - , ) , id , read , write , $$ }
factor tail { + , - , ) , id , read , write , $$ }
factor { + , - , * , / , ) , id , read , write , $$ }
add op { ( , id , number }
mult op { ( , id , number }
PREDICT
1. program −→ stmt list $$ { id , read , write , $$ }
2. stmt list −→ stmt stmt list { id , read , write }
3. stmt list −→ { $$ }
4. stmt −→ id := expr { id }
5. stmt −→ read id { read }
6. stmt −→ write expr { write }
7. expr −→ term term tail { ( , id , number }
8. term tail −→ add op term term tail { + , - }
9. term tail −→ { ) , id , read , write , $$ }
10. term −→ factor factor tail { ( , id , number }
11. factor tail −→ mult op factor factor tail { * , / }
12. factor tail −→ { + , - , ) , id , read , write , $$ }
13. factor −→ ( expr ) { ( }
14. factor −→ id { id }
15. factor −→ number { number }
16. add op −→ + { + }
17. add op −→ - { - }
18. mult op −→ * { * }
19. mult op −→ / { / }
FIRST, FOLLOW, and PREDICT sets for the calculator language. EPS(A) is true iff A ∈ {stmt list, term tail,
factor tail}.
Note that FIRST sets and EPS values for strings of length greater than one are
calculated on demand; they are not stored explicitly. The algorithm is guaranteed
to terminate (i.e., converge on a solution), because the sizes of the FIRST and
FOLLOW sets are bounded by the number of terminals in the grammar.
If in the process of calculating PREDICT sets we find that some token belongs
to the PREDICT set of more than one production with the same left-hand side,
then the grammar is not LL(1), because we will not be able to choose which of
the productions to employ when the left-hand side is at the top of the parse stack
(or we are in the left-hand side’s subroutine in a recursive descent parser) and
we see the token coming up in the input. This sort of ambiguity is known as a
predict-predict conflict ; it can arise either because the same token can begin more
than one right-hand side, or because it can begin one right-hand side and can also
appear after the left-hand side in some valid program, and one possible right-hand
side can generate .
82
Chapter 2 Programming Language Syntax
– – EPS values and FIRST sets for all symbols:
for all terminals c , EPS( c ) := false; FIRST( c ) := { c }
for all nonterminals X , EPS( X ) := if X −→ then true else false; FIRST( X ) := ∅
repeat
outer for all productions X −→ Y1 Y2 . . . Yk ,
inner for i in 1 . . k
add FIRST(Y i ) to FIRST( X )
if not EPS(Y i ) (yet) then continue outer loop
EPS( X ) := true
until no further progress
– – Subroutines for strings, similar to inner loop above:
EPS( X1 X2 . . . Xn )
for i in 1 . . n
if not EPS( X i ) then return false
return true
FIRST( X1 X2 . . . Xn )
return value := ∅
for i in 1 . . n
add FIRST( X i ) to return value
if not EPS( X i ) then return
– – FOLLOW sets for all symbols:
for all symbols X , FOLLOW( X ) := ∅
repeat
for all productions A −→ α B β ,
add FIRST(β ) to FOLLOW( B )
for all productions A −→ α B
or A −→ α B β , where EPS(β ) = true,
add FOLLOW( A ) to FOLLOW( B )
until no further progress
– – PREDICT sets for all productions:
for all productions A −→ α
PREDICT( A −→ α ) := FIRST(α) ∪ (if EPS(α) then FOLLOW( A ) else ∅ )
Figure 2.23 Algorithm to calculate FIRST, FOLLOW, and PREDICT sets. The grammar is
LL(1) if and only if the PREDICT sets are disjoint.
Writing an LL(1) Grammar
EXAMPLE
2.28
Left recursion
When working with a top-down parser generator, one has to acquire a certain facility in writing and modifying LL(1) grammars. The two most common obstacles
to “LL(1)-ness” are left recursion and common prefixes.
A grammar is said to be left recursive if there is a nonterminal A such that
A =⇒+ A α for some α. The trivial case occurs when the first symbol on the
2.3 Parsing
83
right-hand side of a production is the same as the symbol on the left-hand side.
Here again is the grammar from Example 2.21, which cannot be parsed top-down:
id list −→ id list prefix ;
id list prefix −→ id list prefix , id
−→ id
EXAMPLE
2.29
Common prefixes
The problem is in the second and third productions; with id list prefix at topof-stack and an id on the input, a predictive parser cannot tell which of the
productions it should use. (Recall that left recursion is desirable in bottom-up
grammars, because it allows recursive constructs to be discovered incrementally,
as in Figure 2.14.)
Common prefixes occur when two different productions with the same lefthand side begin with the same symbol or symbols. Here is an example that commonly appears in languages descended from Algol:
stmt −→ id := expr
−→ id ( argument list )
– – procedure call
Clearly id is in the FIRST set of both right-hand sides, and therefore in the PREDICT
set of both productions.
Both left recursion and common prefixes can be removed from a grammar
mechanically. The general case is a little tricky (Exercise 2.22), because the prediction problem may be an indirect one (e.g., S −→ A α and A −→ S β , or
D E S I G N & I M P L E M E N TAT I O N
Recursive descent and table-driven LL parsing
When trying to understand the connection between recursive descent and tabledriven LL parsing, it is tempting to imagine that the explicit stack of the tabledriven parser mirrors the implicit call stack of the recursive descent parser, but
this is not the case.
A better way to visualize the two implementations
of top-down parsing is to remember that both are
discovering a parse tree via depth-first left-to-right
traversal. When we are at a given point in the parse—
say the circled node in the tree shown here—the
implicit call stack of a recursive descent parser holds
a frame for each of the nodes on the path back to
the root, created when the routine corresponding to
that node was called. (This path is shown in grey.)
But these nodes are immaterial. What matters for the rest of the parse—as
shown on the white path here—are the upcoming calls on the case statement
arms of the recursive descent routines. Those calls—those parse tree nodes—
are precisely the contents of the explicit stack of a table-driven LL parser.
84
EXAMPLE
Chapter 2 Programming Language Syntax
2.30
Eliminating left recursion
S −→ A α, S −→ B β , A =⇒∗ c γ , and B =⇒∗ c δ). We can see the general
idea in the examples above, however.
Our left-recursive definition of id list can be replaced by the right-recursive
variant we saw in Example 2.20:
id list −→ id id list tail
id list tail −→ , id id list tail
id list tail −→ ;
EXAMPLE
2.31
Left factoring
Our common-prefix definition of stmt can be made LL(1) by a technique called
left factoring :
stmt −→ id stmt list tail
stmt list tail −→ := expr | ( argument list )
EXAMPLE
2.32
Parsing a “dangling else ”
Of course, simply eliminating left recursion and common prefixes is not guaranteed to make a grammar LL(1). There are infinitely many non-LL languages—
languages for which no LL grammar exists—and the mechanical transformations
to eliminate left recursion and common prefixes work on their grammars just
fine. Fortunately, the few non-LL languages that arise in practice can generally be
handled by augmenting the parsing algorithm with one or two simple heuristics.
The best known example of a “not quite LL” construct arises in languages
like Pascal, in which the else part of an if statement is optional. The natural
grammar fragment
stmt −→ if condition then clause else clause | other stmt
then clause −→ then stmt
else clause −→ else stmt | is ambiguous (and thus neither LL nor LR); it allows the else in if C1 then
if C2 then S1 else S2 to be paired with either then . The less natural grammar
fragment
stmt −→ balanced stmt | unbalanced stmt
balanced stmt −→ if condition then balanced stmt else balanced stmt
| other stmt
unbalanced stmt −→ if condition then stmt
| if condition then balanced stmt else unbalanced stmt
can be parsed bottom-up but not top-down (there is no pure top-down grammar
for Pascal else statements). A balanced stmt is one with the same number of
then s and else s. An unbalanced stmt has more then s.
The usual approach, whether parsing top-down or bottom-up, is to use the
ambiguous grammar together with a “disambiguating rule,” which says that in the
case of a conflict between two possible productions, the one to use is the one that
occurs first, textually, in the grammar. In the ambiguous fragment above, the fact
that else clause −→ else stmt comes before else clause −→ ends up pairing
the else with the nearest then , as desired.
2.3 Parsing
EXAMPLE
2.33
“Dangling else ” program
bug
85
Better yet, a language designer can avoid this sort of problem by choosing
different syntax. The ambiguity of the dangling else problem in Pascal leads to
problems not only in parsing, but in writing and maintaining correct programs.
Most Pascal programmers have at one time or another written a program like this
one:
if P <> nil then
if Pˆ.val = goal then
foundIt := true
else
endOfList := true
Indentation notwithstanding, the Pascal manual states that an else clause
matches the closest unmatched then —in this case the inner one—which is clearly
not what the programmer intended. To get the desired effect, the Pascal programmer must write
if P <> nil then begin
if Pˆ.val = goal then
foundIt := true
end
else
endOfList := true
EXAMPLE
2.34
End markers for structured
statements
EXAMPLE
2.35
The need for elsif
Many other Algol-family languages (including Modula, Modula-2, and Oberon,
all more recent inventions of Pascal’s designer, Niklaus Wirth) require explicit end
markers on all structured statements. The grammar fragment for if statements
in Modula-2 looks something like this:
stmt −→ IF condition then clause else clause END | other stmt
then clause −→ THEN stmt list
else clause −→ ELSE stmt list | The addition of the END eliminates the ambiguity.
Modula-2 uses END to terminate all its structured statements. Ada and Fortran 77 end an if with end if (and a while with end while , etc.). Algol 68 creates
its terminators by spelling the initial keyword backward ( if . . . fi , case . . . esac ,
do . . . od , etc.).
One problem with end markers is that they tend to bunch up. In Pascal one
can write
D E S I G N & I M P L E M E N TAT I O N
The dangling else
A simple change in language syntax—eliminating the dangling else —not
only reduces the chance of programming errors, but also significantly simplifies parsing. For more on the dangling else problem, see Exercise 2.27 and
Section 6.4.
86
Chapter 2 Programming Language Syntax
if A
else
else
else
else
= B then
if A = C
if A = D
if A = E
...
...
then ...
then ...
then ...
With end markers this becomes
if A = B then ...
else if A = C then ...
else if A = D then ...
else if A = E then ...
else ...
end end end end
To avoid this awkwardness, languages with end markers generally provide an
elsif keyword (sometimes spelled elif ):
if A = B then ...
elsif A = C then ...
elsif A = D then ...
elsif A = E then ...
else ...
end
With elsif clauses added, the Modula-2 grammar fragment for if statements
looks like this:
stmt −→ IF condition then clause elsif clauses else clause END | other stmt
then clause −→ THEN stmt list
elsif clauses −→ ELSIF condition then clause elsif clauses | else clause −→ ELSE stmt list | 3C H E C K YO U R U N D E R S TA N D I N G
29. Discuss the similarities and differences between recursive descent and tabledriven top-down parsing.
30. What are FIRST and FOLLOW sets? What are they used for?
31. Under what circumstances does a top-down parser predict the production
A −→ α?
32. What sorts of “obvious” facts form the basis of
FIRST set and FOLLOW set
construction?
33. Outline the algorithm used to complete the construction of FIRST and FOLLOW
sets. How do we know when we are done?
2.3 Parsing
87
34. How do we know when a grammar is not LL(1)?
35. Describe two common idioms in context-free grammars that cannot be parsed
top-down.
36. What is the “dangling else ” problem? How is it avoided in modern languages?
2.3.3
Bottom-Up Parsing
Conceptually, as we saw at the beginning of Section 2.3, a bottom-up parser works
by maintaining a forest of partially completed subtrees of the parse tree, which it
joins together whenever it recognizes the symbols on the right-hand side of some
production used in the right-most derivation of the input string. It creates a new
internal node and makes the roots of the joined-together trees the children of that
node.
In practice, a bottom-up parser is almost always table-driven. It keeps the roots
of its partially completed subtrees on a stack. When it accepts a new token from
the scanner, it shifts the token into the stack. When it recognizes that the top few
symbols on the stack constitute a right-hand side, it reduces those symbols to their
left-hand side by popping them off the stack and pushing the left-hand side in
their place. The role of the stack is the first important difference between topdown and bottom-up parsing: a top-down parser’s stack contains a list of what
the parser expects to see in the future; a bottom-up parser’s stack contains a record
of what the parser has already seen in the past.
Canonical Derivations
EXAMPLE
2.36
Derivation of an id list
We also noted earlier that the actions of a bottom-up parser trace out a rightmost (canonical) derivation in reverse. The roots of the partial subtrees, leftto-right, together with the remaining input, constitute a sentential form of the
right-most derivation. On the right-hand side of Figure 2.13, for example, we
have the following series of steps.
Stack contents (roots of partial trees)
Remaining input
id (A)
id (A)
id (A)
id (A)
id (A)
id (A)
id (A)
id (A)
id (A)
id list
A, B, C;
, B, C;
B, C;
, C;
C;
;
,
,
,
,
,
,
,
id
id
id
id
id
id
(B)
(B) ,
(B) , id (C)
(B) , id (C) ;
(B) , id (C) id list tail
(B) id list tail
id list tail
88
Chapter 2 Programming Language Syntax
1. program −→ stmt list $$
2. stmt list −→ stmt list stmt
3. stmt list −→ stmt
4. stmt −→ id := expr
5. stmt −→ read id
6. stmt −→ write expr
7. expr −→ term
8. expr −→ expr add op term
9. term −→ factor
10. term −→ term mult op factor
11. factor −→ ( expr )
12. factor −→ id
13. factor −→ number
14. add op −→ +
15. add op −→ 16. mult op −→ *
17. mult op −→ /
Figure 2.24
LR(1) grammar for the calculator language. Productions have been numbered for
reference in future figures.
The last four lines (the ones that don’t just shift tokens into the forest) correspond
to the right-most derivation:
id list =⇒ id id list tail
=⇒ id , id id list tail
=⇒ id , id , id id list tail
=⇒ id , id , id ;
EXAMPLE
2.37
Bottom-up grammar for
the calculator language
The symbols that need to be joined together at each step of the parse to represent
the next step of the backward derivation are called the handle of the sentential
form. In the parse trace above, the handles are underlined.
In our id list example, no handles were found until the entire input had been
shifted onto the stack. In general this will not be the case. We can obtain a more
realistic example by examining an LR version of our calculator language, shown
in Figure 2.24. While the LL grammar of Figure 2.15 can be parsed bottom-up, the
version in Figure 2.24 is preferable for two reasons. First, it uses a left-recursive
production for stmt list. Left recursion allows the parser to collapse long statement
lists as it goes along, rather than waiting until the entire list is on the stack and then
collapsing it from the end. Second, it uses left-recursive productions for expr and
term. These productions capture left associativity while still keeping an operator
and its operands together in the same right-hand side, something we were unable
to do in a top-down grammar.
89
2.3 Parsing
Modeling a Parse with LR Items
EXAMPLE
2.38
Bottom-up parse of the
“sum and average” program
Suppose we are to parse the sum-and-average program from Example 2.24:
read A
read B
sum := A + B
write sum
write sum / 2
The key to success will be to figure out when we have reached the end of a righthand side—that is, when we have a handle at the top of the parse stack. The trick
is to keep track of the set of productions we might be “in the middle of ” at any
particular time, together with an indication of where in those productions we
might be.
When we begin execution, the parse stack is empty and we are at the beginning
of the production for program. (In general, we can assume that there is only one
production with the start symbol on the left-hand side; it is easy to modify any
grammar to make this the case.) We can represent our location—more specifically,
the location represented by the top of the parse stack—with a in the right-hand
side of the production:
.
program −→
.
stmt list $$
.
.
When augmented with a , a production is called an LR item. Since the in
this item is immediately in front of a nonterminal—namely stmt list—we may be
about to see the yield of that nonterminal coming up on the input. This possibility
implies that we may be at the beginning of some production with stmt list on the
left-hand side:
program −→
stmt list −→
stmt list −→
.
.
.
stmt list $$
stmt list stmt
stmt
And, since stmt is a nonterminal, we may also be at the beginning of any production
whose left-hand side is stmt:
program −→
stmt list −→
stmt list −→
stmt −→
stmt −→
stmt −→
.
.
.
.
.
.
stmt list $$
(State 0)
stmt list stmt
stmt
id := expr
read id
write expr
Since all of these last productions begin with a terminal, no additional items need
to be added to our list. The original item (program −→ stmt list $$ ) is called
the basis of the list. The additional items are its closure. The list represents the
.
90
Chapter 2 Programming Language Syntax
initial state of the parser. As we shift and reduce, the set of items will change,
always indicating which productions may be the right one to use next in the
derivation of the input string. If we reach a state in which some item has the at
the end of the right-hand side, we can reduce by that production. Otherwise, as in
the current situation, we must shift. Note that if we need to shift, but the incoming
token cannot follow the in any item of the current state, then a syntax error has
occurred. We will consider error recovery in more detail in Section 2.3.4.
Our upcoming token is a read . Once we shift it onto the stack, we know we
are in the following state:
.
.
stmt −→ read
.
(State 1)
id
.
This state has a single basis item and an empty closure—the precedes a terminal.
After shifting the A , we have
stmt −→ read id
.
(State 1 )
We now know that read id is the handle, and we must reduce. The reduction
pops two symbols off the parse stack and pushes a stmt in their place, but what
should the new state be? We can see the answer if we imagine moving back in time
to the point at which we shifted the read —the first symbol of the right-hand
side. At that time we were in the state labeled “State 0” above, and the upcoming
tokens on the input (though we didn’t look at them at the time) were read id .
We have now consumed these tokens, and we know that they constituted a stmt.
By pushing a stmt onto the stack, we have in essence replaced read id with stmt
on the input stream, and have then “shifted” the nonterminal, rather than its yield,
into the stack. Since one of the items in State 0 was
stmt list −→
.
stmt
we now have
stmt list −→ stmt
.
(State 0 )
Again we must reduce. We remove the stmt from the stack and push a stmt list in
its place. Again we can see this as “shifting” a stmt list when in State 0. Since two
of the items in State 0 have a stmt list after the , we don’t know (without looking
ahead) which of the productions will be the next to be used in the derivation, but
we don’t have to know. The key advantage of bottom-up parsing over top-down
parsing is that we don’t need to predict ahead of time which production we shall
be expanding.
Our new state is as follows:
.
program −→ stmt list
stmt list −→ stmt list
stmt −→
stmt −→
stmt −→
.
.
.
.
.
id := expr
read id
write expr
$$
stmt
(State 2)
91
2.3 Parsing
The first two productions are the basis; the others are the closure. Since no item
has a at the end, we shift the next token, which happens again to be a read ,
taking us back to State 1. Shifting the B takes us to State 1 again, at which point
we reduce. This time however, we go back to State 2 rather than State 0 before
shifting the left-hand-side stmt. Why? Because we were in State 2 when we began
to read the right-hand-side.
.
The Characteristic Finite State Machine and LR Parsing Variants
An LR-family parser keeps track of the states it has traversed by pushing them into
the parse stack, along with the grammar symbols. It is in fact the states (rather than
the symbols) that drive the parsing algorithm: they tell us what state we were in
at the beginning of a right-hand side. Specifically, when the combination of state
and input tells us we need to reduce using production A −→ α, we pop length(α)
symbols off the stack, together with the record of states we moved through while
shifting those symbols. These pops expose the state we were in immediately prior
to the shifts, allowing us to return to that state and proceed as if we had seen A in
the first place.
We can think of the shift rules of an LR-family parser as the transition function
of a finite automaton, much like the automata we used to model scanners. Each
state of the automaton corresponds to a list of items that indicate where the parser
might be at some specific point in the parse. The transition for input symbol X
(which may be either a terminal or a nonterminal) moves to a state whose basis
consists of items in which the has been moved across an X in the right-hand
side, plus whatever items need to be added as closure. The lists are constructed by
a bottom-up parser generator in order to build the automaton, but are not needed
during parsing.
It turns out that the simpler members of the LR family of parsers—LR(0),
SLR(1), and LALR(1)—all use the same automaton, called the characteristic finitestate machine, or CFSM. Full LR parsers use a machine with (for most grammars) a much larger number of states. The differences between the algorithms lie
in how they deal with states that contain a shift-reduce conflict —one item with
the in front of a terminal (suggesting the need for a shift) and another with
the at the end of the right-hand side (suggesting the need for a reduction).
An LR(0) parser works only when there are no such states. It can be proven that
with the addition of an end-marker (i.e., $$ ), any language that can be deterministically parsed bottom-up has an LR(0) grammar. Unfortunately, the LR(0)
grammars for real programming languages tend to be prohibitively large and
unintuitive.
SLR (simple LR) parsers peek at upcoming input and use FOLLOW sets to
resolve conflicts. An SLR parser will call for a reduction via A −→ α only if the
upcoming token(s) are in FOLLOW(α). It will still see a conflict, however, if the
tokens are also in the FIRSTset of any of the symbols that follow a in other
items of the state. As it turns out, there are important cases in which a token may
follow a given nonterminal somewhere in a valid program, but never in a context
described by the current state. For these cases global FOLLOW sets are too crude.
.
..
.
92
Chapter 2 Programming Language Syntax
LALR (look-ahead LR) parsers improve on SLR by using local (state-specific)
look-ahead instead.
Conflicts can still arise in an LALR parser when the same set of items can occur
on two different paths through the CFSM. Both paths will end up in the same
state, at which point state-specific look-ahead can no longer distinguish between
them. A full LR parser duplicates states in order to keep paths disjoint when their
local look-aheads are different.
LALR parsers are the most common bottom-up parsers in practice. They are
the same size and speed as SLR parsers, but are able to resolve more conflicts.
Full LR parsers for real programming languages tend to be very large. Several
researchers have developed techniques to reduce the size of full-LR tables, but
LALR works sufficiently well in practice that the extra complexity of full LR is
usually not required. Yacc/bison produces C code for an LALR parser.
Bottom-Up Parsing Tables
EXAMPLE
2.39
CFSM for the bottom-up
calculator grammar
Like a table-driven LL(1) parser, an SLR(1), LALR(1), or LR(1) parser executes
a loop in which it repeatedly inspects a two-dimensional table to find out what
action to take. However, instead of using the current input token and top-of-stack
nonterminal to index into the table, an LR-family parser uses the current input
token and the current parser state (which can be found at the top of the stack).
“Shift” table entries indicate the state that should be pushed. “Reduce” table entries
indicate the number of states that should be popped and the nonterminal that
should be pushed back onto the input stream, to be shifted by the state uncovered
by the pops. There is always one popped state for every symbol on the righthand side of the reducing production. The state to be pushed next can be found
by indexing into the table using the uncovered state and the newly recognized
nonterminal.
The CFSM for our bottom-up version of the calculator grammar appears in
Figure 2.25. States 6, 7, 9, and 13 contain potential shift-reduce conflicts, but all of
these can be resolved with global FOLLOW sets. SLR parsing therefore suffices. In
State 6, for example, FIRST(add op) ∩ FOLLOW(stmt) = ∅. In addition to shift and
reduce rules, we allow the parse table as an optimization to contain rules of the
form “shift and then reduce.” This optimization serves to eliminate trivial states
such as 1 and 0 in Example 2.38, which had only a single item, with the at
the end.
A pictorial representation of the CFSM appears in Figure 2.26. A tabular
representation, suitable for use in a table-driven parser, appears in Figure 2.27.
Pseudocode for the (language-independent) parser driver appears in Figure 2.28.
A trace of the parser’s actions on the sum-and-average program appears in
Figure 2.29.
.
Handling Epsilon Productions
EXAMPLE
2.40
Epsilon productions in the
bottom-up calculator
grammar
The careful reader may have noticed that the grammar of Figure 2.24, in addition
to using left-recursive rules for stmt list, expr, and term, differs from the grammar
of Figure 2.15 in one other way: it defines a stmt list to be a sequence of one or
more stmts, rather than zero or more. (This means, of course, that it defines a
2.3 Parsing
93
different language.) To capture the same language as Figure 2.15, production 3 in
Figure 2.24,
stmt list −→ stmt
would need to be replaced with
stmt list −→ EXAMPLE
2.41
CFSM with epsilon
productions
Note that it does in general make sense to have an empty statement list. In the
calculator language it simply permits an empty program, which is admittedly
silly. In real languages, however, it allows the body of a structured statement to
be empty, which can be very useful. One frequently wants one arm of a case
or multiway if . . . then . . . else statement to be empty, and an empty while
loop allows a parallel program (or the operating system) to wait for a signal from
another process or an I/O device.
If we look at the CFSM for the calculator language, we discover that State 0 is
the only state that needs to be changed in order to allow empty statement lists.
The item
.
becomes
.
which is equivalent to
.
or simply
.
The entire state is then
.
..
..
.
stmt list −→
stmt
stmt list −→
stmt list −→ stmt list −→
program −→
stmt
stmt
stmt
stmt
stmt
stmt list $$
list −→ stmt list stmt
list −→
−→ id := expr
−→ read id
−→ write expr
on stmt list shift and goto 2
on $$ reduce (pop 0 states, push stmt list on input)
on id shift and goto 3
on read shift and goto 1
on write shift and goto 4
The look-ahead for item
stmt list −→
.
is FOLLOW(stmt list), which is the end-marker, $$ . Since $$ does not appear in
the look-aheads for any other item in this state, our grammar is still SLR(1). It is
worth noting that epsilon productions commonly prevent a grammar from being
LR(0): if such a production shares a state with an item in which the dot precedes
a terminal, we won’t be able to tell whether to “recognize” without peeking
ahead.
94
Chapter 2 Programming Language Syntax
State
0.
.
..
..
.
program −→
stmt
stmt
stmt
stmt
stmt
Transitions
stmt list $$
list −→ stmt list stmt
list −→ stmt
−→ id := expr
−→ read id
−→ write expr
.
1.
stmt −→ read
2.
program −→ stmt list
stmt list −→ stmt list
stmt −→
stmt −→
stmt −→
..
.
.
stmt −→ id
4.
stmt −→ write
5.
6.
on id shift and goto 3
on read shift and goto 1
on write shift and goto 4
on := shift and goto 5
term
expr add op term
factor
term mult op factor
( expr )
on term shift and goto 7
id
number
.
on factor shift and reduce (pop 1 state, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
expr
on expr shift and goto 9
term
expr add op term
factor
term mult op factor
( expr )
on term shift and goto 7
..
..
..
.
id
number
. .
stmt −→ write expr
expr −→ expr add op term
add op −→
add op −→
on $$ shift and reduce (pop 2 states, push program on input)
on stmt shift and reduce (pop 2 states, push stmt list on input)
on expr shift and goto 6
stmt −→ id :=
expr −→
expr −→
term −→
term −→
factor −→
factor −→
factor −→
$$
stmt
expr
..
..
..
.
expr −→
expr −→
term −→
term −→
factor −→
factor −→
factor −→
..
:= expr
.
on stmt shift and reduce (pop 1 state, push stmt list on input)
on id shift and goto 3
on read shift and goto 1
on write shift and goto 4
on id shift and reduce (pop 2 states, push stmt on input)
id
id := expr
read id
write expr
3.
on stmt list shift and goto 2
..
+
-
on factor shift and reduce (pop 1 state, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
on FOLLOW(stmt) = { id , read , write , $$ } reduce
(pop 2 states, push stmt on input)
on add op shift and goto 10
on + shift and reduce (pop 1 state, push add op on input)
on - shift and reduce (pop 1 state, push add op on input)
Figure 2.25 CFSM for the calculator grammar (Figure 2.24). Basis and closure items in each state are separated by a horizontal
rule. Trivial reduce-only states have been eliminated by use of “shift and reduce” transitions. (continued)
2.3 Parsing
State
7.
expr −→ term
term −→ term
mult op −→
mult op −→
8.
factor −→ (
..
..
..
.
expr −→
expr −→
term −→
term −→
factor −→
factor −→
factor −→
9.
*
/
on FOLLOW(expr) = { id , read , write , $$ , ) , + , - } reduce
(pop 1 state, push expr on input)
on mult op shift and goto 11
on * shift and reduce (pop 1 state, push mult op on input)
on / shift and reduce (pop 1 state, push mult op on input)
expr )
on expr shift and goto 12
mult op factor
term
expr add op term
factor
term mult op factor
( expr )
id
number
. .
stmt −→ id := expr
expr −→ expr add op term
add op −→
add op −→
10.
..
.
..
Transitions
..
+
-
expr −→ expr add op
..
..
.
.
term
term −→ factor
term −→ term mult op factor
factor −→ ( expr )
factor −→ id
factor −→ number
11.
term −→ term mult op
factor −→
factor −→
factor −→
12.
13.
..
.
.
factor
( expr )
id
number
on term shift and goto 7
on factor shift and reduce (pop 1 state, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
on FOLLOW ( stmt ) = { id , read , write , $$ } reduce
(pop 3 states, push stmt on input)
on add op shift and goto 10
on + shift and reduce (pop 1 state, push add op on input)
on - shift and reduce (pop 1 state, push add op on input)
on term shift and goto 13
on factor shift and reduce (pop 1 state, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
on factor shift and reduce (pop 3 states, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
factor −→ ( expr )
expr −→ expr add op term
..
on ) shift and reduce (pop 3 states, push factor on input)
on add op shift and goto 10
add op −→
add op −→
+
-
on + shift and reduce (pop 1 state, push add op on input)
on - shift and reduce (pop 1 state, push add op on input)
..
.
expr −→ expr add op term
term −→ term mult op factor
mult op −→
mult op −→
Figure 2.25
..
.
*
/
(continued)
on FOLLOW(expr) = { id , read , write , $$ , ) , + , - } reduce
(pop 3 states, push expr on input)
on mult op shift and goto 11
on * shift and reduce (pop 1 state, push mult op on input)
on / shift and reduce (pop 1 state, push mult op on input)
95
96
Chapter 2 Programming Language Syntax
:=
3
5
id
Start
9
term
read
0
expr
1
(
id
12
read
stmt_list
2
write
(
4
term
7
10
term
(
term
mult_op
add_op
(
8
write
add_op
expr
(
mult_op
11
13
add_op
expr
6
Figure 2.26
Top-of-stack
state
sl
0
1
2
3
4
5
6
7
8
9
10
11
12
13
Pictorial representation of the CFSM of Figure 2.25. Reduce actions are not shown.
Current input symbol
s
e
s2 b3 –
– –
–
– b2 –
– –
–
– – s6
– – s9
– –
–
– –
–
– – s12
– –
–
– –
–
– –
–
– –
–
– –
–
t
f
–
–
–
–
–
–
–
–
s7 b9
s7 b9
–
–
–
–
s7 b9
–
–
s13 b9
– b10
–
–
–
–
ao
mo
id
lit
r
w
:=
(
–
–
–
–
–
–
s10
–
–
s10
–
–
s10
–
–
–
–
–
–
–
–
s11
–
–
–
–
–
s11
s3
b5
s3
–
b12
b12
r6
r7
b12
r4
b12
b12
–
r8
–
–
–
–
b13
b13
–
–
b13
–
b13
b13
–
–
s1 s4
– –
s1 s4
– –
– –
– –
r6 r6
r7 r7
– –
r4 r4
– –
– –
– –
r8 r8
–
–
–
s5
–
–
–
–
–
–
–
–
–
–
–
–
–
–
s8
s8
–
–
s8
–
s8
s8
–
–
)
+
-
*
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
– b14 b15 –
r7
r7
r7 b16
–
–
–
–
– b14 b15 –
–
–
–
–
–
–
–
–
b11 b14 b15 –
r8
r8
r8 b16
/
$$
–
–
–
–
–
–
–
b17
–
–
–
–
–
b17
–
–
b1
–
–
–
r6
r7
–
r4
–
–
–
r8
Figure 2.27 SLR(1) parse table for the calculator language. Table entries indicate whether to shift (s), reduce (r), or shift
and then reduce (b). The accompanying number is the new state when shifting, or the production that has been recognized
when (shifting and) reducing. Production numbers are given in Figure 2.24. Symbol names have been abbreviated for the sake
of formatting. A dash indicates an error. An auxiliary table, not shown here, gives the left-hand-side symbol and right-hand-side
length for each production.
3C H E C K YO U R U N D E R S TA N D I N G
37. What is the handle of a right sentential form?
38. Explain the significance of the characteristic finite-state machine in LR
parsing.
2.3 Parsing
97
state = 1 . . number of states
symbol = 1 . . number of symbols
production = 1 . . number of productions
action rec = record
action : (shift, reduce, shift reduce, error)
new state : state
prod : production
parse tab : array [symbol, state] of action rec
prod tab : array [production] of record
lhs : symbol
rhs len : integer
– – these two tables are created by a parser generator tool
parse stack : stack of record
sym : symbol
st : state
parse stack.push(null, start state)
cur sym : symbol := scan
– – get new token from scanner
loop
cur state : state := parse stack.top.st – – peek at state at top of stack
if cur state = start state and cur sym = start symbol
return
– – success!
ar : action rec := parse tab[cur state, cur sym]
case ar.action
shift:
parse stack.push(cur sym, ar.new state)
cur sym := scan
– – get new token from scanner
reduce:
cur sym := prod tab[ar.prod].lhs
parse stack.pop(prod tab[ar.prod].rhs len)
shift reduce:
cur sym := prod tab[ar.prod].lhs
parse stack.pop(prod tab[ar.prod].rhs len–1)
error:
parse error
Figure 2.28
Driver for a table-driven SLR(1) parser. We call the scanner directly, rather than
using the global input token of Figures 2.16 and 2.18, so that we can set cur sym to be an arbitrary
symbol.
.
39. What is the significance of the dot ( ) in an LR item?
40. What distinguishes the basis from the closure of an LR state?
41. What is a shift-reduce conflict ? How is it resolved in the various kinds of
LR-family parsers?
42. Outline the steps performed by the driver of a bottom-up parser.
Parse stack
0
0 read 1
0
0
0 stmt list
0 stmt list
0 stmt list
0
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0
0 stmt list
0
[done]
Input stream
2
2 read 1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
id
id
id
id
id
id
id
id
id
id
id
id
id
id
2
2
2
2
2
2
2
2
write
write
write
write
write
write
4
4
4
4
4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
write
write
write
write
write
write
write
write
write
write
write
4
4
4
4
4
4
4
4
4
4
4
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
read A read B . . .
A read B . . .
stmt read B . . .
stmt list read B . . .
read B sum . . .
B sum := . . .
stmt sum := . . .
stmt list sum := . . .
sum := A . . .
:= A + . . .
5
A + B ...
5
factor + B . . .
5
term + B . . .
5 term 7
+ B write . . .
5
expr + B write . . .
5 expr 9
+ B write . . .
5 expr 9
add op B write . . .
5 expr 9 add op 10
B write sum . . .
5 expr 9 add op 10
factor write sum . . .
5 expr 9 add op 10
term write sum . . .
5 expr 9 add op 10 term 13 write sum . . .
5
expr write sum . . .
5 expr 9
write sum . . .
stmt write sum . . .
stmt list write sum . . .
write sum . . .
sum write sum . . .
factor write sum . . .
term write sum . . .
term 7
write sum . . .
expr write sum . . .
expr 6
write sum . . .
stmt write sum . . .
stmt list write sum . . .
write sum / . . .
sum / 2 . . .
factor / 2 . . .
term / 2 . . .
term 7
/ 2 $$
term 7
mult op 2 $$
term 7 mult op 11
2 $$
term 7 mult op 11
factor $$
term $$
term 7
$$
expr $$
expr 6
$$
stmt $$
stmt list $$
$$
program
Comment
shift read
shift id(A) & reduce by stmt −→ read id
shift stmt & reduce by stmt list −→ stmt
shift stmt list
shift read
shift id(B) & reduce by stmt −→ read id
shift stmt & reduce by stmt list −→ stmt list stmt
shift stmt list
shift id(sum)
shift :=
shift id(A) & reduce by factor −→ id
shift factor & reduce by term −→ factor
shift term
reduce by expr −→ term
shift expr
shift + & reduce by add op −→ +
shift add op
shift id(B) & reduce by factor −→ id
shift factor & reduce by term −→ factor
shift term
reduce by expr −→ expr add op term
shift expr
reduce by stmt −→ id := expr
shift stmt & reduce by stmt list −→ stmt
shift stmt list
shift write
shift id(sum) & reduce by factor −→ id
shift factor & reduce by term −→ factor
shift term
reduce by expr −→ term
shift expr
reduce by stmt −→ write expr
shift stmt & reduce by stmt list −→ stmt list stmt
shift stmt list
shift write
shift id(sum) & reduce by factor −→ id
shift factor & reduce by term −→ factor
shift term
shift / & reduce by mult op −→ /
shift mult op
shift number(2) & reduce by factor −→ number
shift factor & reduce by term −→ term mult op factor
shift term
reduce by expr −→ term
shift expr
reduce by stmt −→ write expr
shift stmt & reduce by stmt list −→ stmt list stmt
shift stmt list
shift $$ & reduce by program −→ stmt list $$
Figure 2.29 Trace of a table-driven SLR(1) parse of the sum-and-average program. States in the parse stack are shown in
boldface type. Symbols in the parse stack are for clarity only; they are not needed by the parsing algorithm. Parsing begins with
the initial state of the CFSM (State 0) in the stack. It ends when we reduce by program −→ stmt list $$ , uncovering State 0
again and pushing program onto the input stream.
2.3 Parsing
99
43. What kind of parser is produced by yacc/bison? By ANTLR?
44. Why are there never any epsilon productions in an LR(0) grammar?
2.3.4
EXAMPLE
2.42
A syntax error in C
Syntax Errors
Suppose we are parsing a C program and see the following code fragment in a
context where a statement is expected:
A = B : C + D;
We will detect a syntax error immediately after the B , when the colon appears
from the scanner. At this point the simplest thing to do is just to print an error
message and halt. This naive approach is generally not acceptable, however: it
would mean that every run of the compiler reveals no more than one syntax error.
Since most programs, at least at first, contain numerous such errors, we really need
to find as many as possible now (we’d also like to continue looking for semantic
errors). To do so, we must modify the state of the parser and/or the input stream
so that the upcoming token(s) are acceptable. We shall probably want to turn off
code generation, disabling the back end of the compiler: since the input is not a
valid program, the code will not be of use, and there’s no point in spending time
creating it.
In general, the term syntax error recovery is applied to any technique that allows
the compiler, in the face of a syntax error, to continue looking for other errors later
in the program. High-quality syntax error recovery is essential in any productionquality compiler. The better the recovery technique, the more likely the compiler
will be to recognize additional errors (especially nearby errors) correctly, and the
less likely it will be to become confused and announce spurious cascading errors
later in the program.
IN MORE DEPTH
On the PLP CD we explore several possible approaches to syntax error recovery. In
panic mode, the compiler writer defines a small set of “safe symbols” that delimit
clean points in the input. Semicolons, which typically end a statement, are a
good choice in many languages. When an error occurs, the compiler deletes input
tokens until it finds a safe symbol, and then “backs the parser out” (e.g., returns
from recursive descent subroutines) until it finds a context in which that symbol
might appear. Phrase-level recovery improves on this technique by employing
different sets of “safe” symbols in different productions of the grammar (right
parentheses when in an expression; semicolons when in a declaration). Contextspecific look-ahead obtains additional improvements by differentiating among the
various contexts in which a given production might appear in a syntax tree. To
respond gracefully to certain common programming errors, the compiler writer
may augment the grammar with error productions that capture language-specific
idioms that are incorrect but are often written by mistake.
100
Chapter 2 Programming Language Syntax
Niklaus Wirth published an elegant implementation of phrase-level and
context-specific recovery for recursive descent parsers in 1976 [Wir76, Sec. 5.9].
Exceptions (to be discussed further in Section 8.5) provide a simpler alternative if
supported by the language in which the compiler is written. For table-driven topdown parsers, Fischer, Milton, and Quiring published an algorithm in 1980 that
automatically implements a well-defined notion of locally least-cost syntax repair.
Locally least-cost repair is also possible in bottom-up parsers, but it is significantly
more difficult. Most bottom-up parsers rely on more straightforward phrase-level
recovery; a typical example can be found in yacc/bison.
2.4
Theoretical Foundations
Our understanding of the relative roles and computational power of scanners,
parsers, regular expressions, and context-free grammars is based on the formalisms of automata theory. In automata theory, a formal language is a set of strings
of symbols drawn from a finite alphabet. A formal language can be specified either
by a set of rules (such as regular expressions or a context-free grammar) that generates the language, or by a formal machine that accepts (recognizes) the language.
A formal machine takes strings of symbols as input and outputs either “yes” or
“no.” A machine is said to accept a language if it says “yes” to all and only those
strings that are in the language. Alternatively, a language can be defined as the set
of strings for which a particular machine says “yes.”
Formal languages can be grouped into a series of successively larger classes
known as the Chomsky hierarchy.14 Most of the classes can be characterized in
two ways: by the types of rules that can be used to generate the set of strings,
or by the type of formal machine that is capable of recognizing the language. As
we have seen, regular languages are defined by using concatenation, alternation,
and Kleene closure, and are recognized by a scanner. Context-free languages are a
proper superset of the regular languages. They are defined by using concatenation,
alternation, and recursion (which subsumes Kleene closure), and are recognized
by a parser. A scanner is a concrete realization of a finite automaton, a type of
formal machine. A parser is a concrete realization of a push-down automaton.
Just as context-free grammars add recursion to regular expressions, push-down
automata add a stack to the memory of a finite automaton. There are additional
levels in the Chomsky hierarchy, but they are less directly applicable to compiler
construction, and are not covered here.
It can be proven, constructively, that regular expressions and finite automata
are equivalent: one can construct a finite automaton that accepts the language
defined by a given regular expression, and vice versa. Similarly, it is possible to
14 Noam Chomsky (1928–), a linguist and social philosopher at the Massachusetts Institute of
Technology, developed much of the early theory of formal languages.
2.5 Summary and Concluding Remarks
101
construct a push-down automaton that accepts the language defined by a given
context-free grammar, and vice versa. The grammar-to-automaton constructions
are in fact performed by scanner and parser generators such as lex and yacc .
Of course, a real scanner does not accept just one token; it is called in a loop so
that it keeps accepting tokens repeatedly. As noted in the sidebar on page 60, this
detail is accommodated by having the scanner accept the alternation of all the
tokens in the language (with distinguished final states), and by having it continue
to consume characters until no longer token can be constructed.
IN MORE DEPTH
On the PLP CD we consider finite and pushdown automata in more detail. We give
an algorithm to convert a DFA into an equivalent regular expression. Combined
with the constructions in Section 2.2.1, this algorithm demonstrates the equivalence of regular expressions and finite automata. We also consider the sets of
grammars and languages that can and cannot be parsed by the various linear-time
parsing algorithms.
2.5
Summary and Concluding Remarks
In this chapter we have introduced the formalisms of regular expressions and
context-free grammars, and the algorithms that underlie scanning and parsing
in practical compilers. We also mentioned syntax error recovery, and presented
a quick overview of relevant parts of automata theory. Regular expressions and
context-free grammars are language generators: they specify how to construct valid
strings of characters or tokens. Scanners and parsers are language recognizers:
they indicate whether a given string is valid. The principal job of the scanner is
to reduce the quantity of information that must be processed by the parser, by
grouping characters together into tokens, and by removing comments and white
space. Scanner and parser generators automatically translate regular expressions
and context-free grammars into scanners and parsers.
Practical parsers for programming languages (parsers that run in linear time)
fall into two principal groups: top-down (also called LL or predictive) and bottomup (also called LR or shift-reduce). A top-down parser constructs a parse tree
starting from the root and proceeding in a left-to-right depth-first traversal. A
bottom-up parser constructs a parse tree starting from the leaves, again working
left-to-right, and combining partial trees together when it recognizes the children
of an internal node. The stack of a top-down parser contains a prediction of what
will be seen in the future; the stack of a bottom-up parser contains a record of
what has been seen in the past.
Top-down parsers tend to be simple, both in the parsing of valid strings and
in the recovery from errors in invalid strings. Bottom-up parsers are more powerful, and in some cases lend themselves to more intuitively structured grammars,
102
Chapter 2 Programming Language Syntax
though they suffer from the inability to embed action routines at arbitrary points
in a right-hand side (we discuss this point in more detail in Section 4.5.1). Both
varieties of parser are used in real compilers, though bottom-up parsers are more
common. Top-down parsers tend to be smaller in terms of code and data size, but
modern machines provide ample memory for either.
Both scanners and parsers can be built by hand if an automatic tool is
not available. Handbuilt scanners are simple enough to be relatively common.
Handbuilt parsers are generally limited to top-down recursive descent, and
are most commonly used for comparatively simple languages (e.g., Pascal but
not Ada). Automatic generation of the scanner and parser has the advantage
of increased reliability, reduced development time, and easy modification and
enhancement.
Various features of language design can have a major impact on the complexity
of syntax analysis. In many cases, features that make it difficult for a compiler to
scan or parse also make it difficult for a human being to write correct, maintainable
code. Examples include the lexical structure of Fortran and the if . . . then . . .
else statement of languages like Pascal. This interplay among language design,
implementation, and use will be a recurring theme throughout the remainder of
the book.
2.6
Exercises
2.1 Write regular expressions to capture the following.
(a) Strings in C. These are delimited by double quotes ( " ), and may not contain newline characters. They may contain double-quote or backslash
characters if and only if those characters are “escaped” by a preceding
backslash. You may find it helpful to introduce shorthand notation to
represent any character that is not a member of a small specified set.
(b) Comments in Pascal. These are delimited by (* and *) or by { and }.
(c) Numeric constants in C. These are octal, decimal, or hexadecimal integers, or decimal or hexadecimal floating-point values. An octal integer
begins with 0 , and may contain only the digits 0 – 7 . A hexadecimal integer begins with 0x or 0X , and may contain the digits 0 – 9 and a / A – f / F .
A decimal floating-point value has a fractional portion (beginning with
a dot) or an exponent (beginning with E or e ). Unlike a decimal integer,
it is allowed to start with 0 . A hexadecimal floating-point value has an
optional fractional portion and a mandatory exponent (beginning with
P or p ). In either decimal or hexadecimal, there may be digits to the left
of the dot, the right of the dot, or both, and the exponent itself is given in
decimal, with an optional leading + or - sign. An integer may end with
an optional U or u (indicating “unsigned”), and/or L or l (indicating
“long”) or LL or ll (indicating “long long”). A floating-point value may
2.6 Exercises
103
end with an optional F or f (indicating “float”—single precision) or L
or l (indicating “long”—double precision).
(d) Floating-point constants in Ada. These match the definition of real in
Example 2.3 [page 44]), except that (1) a digit is required on both sides
of the decimal point, (2) an underscore is permitted between digits,
and (3) an alternative numeric base may be specified by surrounding
the nonexponent part of the number with pound signs, preceded by a
base in decimal (e.g., 16#6.a7#e+2 ). In this latter case, the letters a . . f
(both upper- and lowercase) are permitted as digits. Use of these letters
in an inappropriate (e.g., decimal) number is an error, but need not be
caught by the scanner.
(e) Inexact constants in Scheme. Scheme allows real numbers to be explicitly
inexact (imprecise). A programmer who wants to express all constants
using the same number of characters can use sharp signs ( # ) in place
of any lower-significance digits whose values are not known. A base-10
constant without exponent consists of one or more digits followed by
zero of more sharp signs. An optional decimal point can be placed at the
beginning, the end, or anywhere in-between. (For the record, numbers
in Scheme are actually a good bit more complicated than this. For the
purposes of this exercise, please ignore anything you may know about
sign, exponent, radix, exactness and length specifiers, and complex or
rational values.)
(f) Financial quantities in American notation. These have a leading dollar
sign ( $ ), an optional string of asterisks ( * —used on checks to discourage fraud), a string of decimal digits, and an optional fractional part
consisting of a decimal point ( . ) and two decimal digits. The string of
digits to the left of the decimal point may consist of a single zero ( 0 ).
Otherwise it must not start with a zero. If there are more than three
digits to the left of the decimal point, groups of three (counting from
the right) must be separated by commas ( , ). Example: $**2,345.67 .
(Feel free to use “productions” to define abbreviations, so long as the
language remains regular.)
2.2
2.3
2.4
Show (as“circles-and-arrows”diagrams) the finite automata for Exercise 2.1.
Build a regular expression that captures all nonempty sequences of letters other than file , for , and from . For notational convenience, you may
assume the existence of a not operator that takes a set of letters as argument
and matches any other letter. Comment on the practicality of constructing
a regular expression for all sequences of letters other than the keywords of a
large programming language.
(a) Show the NFA that results from applying the construction of Figure 2.7
to the regular expression letter ( letter | digit )* .
(b) Apply the transformation illustrated by Example 2.14 to create an equivalent DFA.
104
Chapter 2 Programming Language Syntax
(c) Apply the transformation illustrated by Example 2.15 to minimize the
DFA.
Starting with the regular expressions for integer and decimal in Example 2.3,
construct an equivalent NFA, the set-of-subsets DFA, and the minimal
equivalent DFA. Be sure to keep separate the final states for the two different kinds of token (see the sidebar on page 60). You will find the exercise
easier if you undertake it by modifying the machines in Examples 2.13
through 2.15.
2.6 Build an ad hoc scanner for the calculator language. As output, have it print
a list, in order, of the input tokens. For simplicity, feel free to simply halt in
the event of a lexical error.
2.7 Write a program in your favorite scripting language to remove comments
from programs in the calculator language (Example 2.9).
2.8 Build a nested- case -statements finite automaton that converts all letters in
its input to lowercase, except within Pascal-style comments and strings. A
Pascal comment is delimited by { and }, or by (* and *) . Comments do
not nest. A Pascal string is delimited by single quotes ( ’ . . . ’ ). A quote
character can be placed in a string by doubling it ( ’Madam, I’’m Adam.’ ).
This upper-to-lower mapping can be useful if feeding a program written
in standard Pascal (which ignores case) to a compiler that considers upperand lowercase letters to be distinct.
2.9 (a) Describe in English the language defined by the regular expression a*
( b a* b a* )* . Your description should be a high-level characterization—one that would still make sense if we were using a different regular
expression for the same language.
(b) Write an unambiguous context-free grammar that generates the same
language.
(c) Using your grammar from part (b), give a canonical (rightmost) derivation of the string b a a b a a a b b .
2.10 Give an example of a grammar that captures right associativity for an
exponentiation operator (e.g., ** in Fortran).
2.11 Prove that the following grammar is LL(1):
2.5
decl −→ ID decl tail
decl tail −→ , decl
−→ : ID ;
(The final ID is meant to be a type name.)
2.12 Consider the following grammar:
G −→ S $$
S −→ A M
M −→ S | 2.6 Exercises
105
A −→ a E | b A A
E −→ a B | b A | B −→ b E | a B B
(a) Describe in English the language that the grammar generates.
(b) Show a parse tree for the string a b a a .
(c) Is the grammar LL(1)? If so, show the parse table; if not, identify a
prediction conflict.
2.13 Consider the following grammar:
stmt −→ assignment
−→ subr call
assignment −→ id := expr
subr call −→ id ( arg list )
expr −→ primary expr tail
expr tail −→ op expr
−→ primary −→ id
−→ subr call
−→ ( expr )
op −→ + | - | * | /
arg list −→ expr args tail
args tail −→ , arg list
−→ (a)
(b)
(c)
(d)
Construct a parse tree for the input string foo(a, b) .
Give a canonical (rightmost) derivation of this same string.
Prove that the grammar is not LL(1).
Modify the grammar so that it is LL(1).
2.14 Consider the language consisting of all strings of properly balanced parentheses and brackets.
(a)
(b)
(c)
(d)
Give LL(1) and SLR(1) grammars for this language.
Give the corresponding LL(1) and SLR(1) parsing tables.
For each grammar, show the parse tree for ([]([]))[](()) .
Give a trace of the actions of the parsers in constructing these trees.
2.15 Consider the following context-free grammar:
G −→ G B
−→ G N
−→ 106
Chapter 2 Programming Language Syntax
B −→ ( E )
E −→ E ( E )
−→ N −→ ( L ]
L −→ L E
−→ L (
−→ (a) Describe, in English, the language generated by this grammar. (Hint: B
(b)
(c)
(d)
(e)
stands for “balanced”; N stands for “nonbalanced”.) (Your description
should be a high-level characterization of the language—one that is
independent of the particular grammar chosen.)
Give a parse tree for the string (( ]( ) .
Give a canonical (rightmost) derivation of this same string.
What is FIRST(E) in our grammar? What is FOLLOW(E)? (Recall that
FIRST and FOLLOW sets are defined for symbols in an arbitrary CFG,
regardless of parsing algorithm.)
Given its use of left recursion, our grammar is clearly not LL(1). Does
this language have an LL(1) grammar? Explain.
2.16 Give a grammar that captures all levels of precedence for arithmetic expressions in C, as shown in Figure 6.1 (page 223). (Hint: This exercise is somewhat tedious. You’ll probably want to attack it with a text editor rather than a
pencil.)
2.17 Extend the grammar of Figure 2.24 to include if statements and while
loops, along the lines suggested by the following examples:
abs := n
if n < 0 then abs := 0 - abs fi
sum := 0
read count
while count > 0 do
read n
sum := sum + n
count := count - 1
od
write sum
Your grammar should support the six standard comparison operations
in conditions, with arbitrary expressions as operands. It also should allow
an arbitrary number of statements in the body of an if or while
statement.
2.18 Consider the following LL(1) grammar for a simplified subset of Lisp:
2.6 Exercises
107
P −→ E $$
E −→ atom
−→ ’ E
−→ ( E Es )
Es −→ E Es
−→
What is FIRST(Es)? FOLLOW(E)? PREDICT(Es −→ )?
Give a parse tree for the string (cdr ’(a b c)) $$ .
Show the leftmost derivation of (cdr ’(a b c)) $$ .
Show a trace, in the style of Figure 2.20, of a table-driven top-down
parse of this same input.
(e) Now consider a recursive descent parser running on the same input.
At the point where the quote token ( ’ ) is matched, which recursive
descent routines will be active (i.e., what routines will have a frame on
the parser’s run-time stack)?
(a)
(b)
(c)
(d)
2.19 Write top-down and bottom-up grammars for the language consisting of
2.20
2.21
2.22
2.23
all well-formed regular expressions. Arrange for all operators to be leftassociative. Give Kleene closure the highest precedence and alternation the
lowest precedence.
Suppose that the expression grammar in Example 2.8 were to be used in
conjunction with a scanner that did not remove comments from the input,
but rather returned them as tokens. How would the grammar need to be
modified to allow comments to appear at arbitrary places in the input?
Build a complete recursive descent parser for the calculator language. As
output, have it print a trace of its matches and predictions.
Flesh out the details of an algorithm to eliminate left recursion and common
prefixes in an arbitrary context-free grammar.
In some languages an assignment can appear in any context in which an
expression is expected: the value of the expression is the right-hand side
of the assignment, which is placed into the left-hand-side as a side effect.
Consider the following grammar fragment for such a language. Explain why
it is not LL(1), and discuss what might be done to make it so.
expr −→ id := expr
−→ term term tail
term tail −→ + term term tail | term −→ factor factor tail
factor tail −→ * factor factor tail | factor −→ ( expr ) | id
2.24 Construct the CFSM for the id list grammar in Example 2.20 (page 67) and
verify that it can be parsed bottom-up with zero tokens of look-ahead.
108
Chapter 2 Programming Language Syntax
2.25 Modify the grammar in Exercise 2.24 to allow an id list to be empty. Is the
grammar still LR(0)?
2.26 Consider the following grammar for a declaration list:
decl list −→ decl list decl ; | decl ;
decl −→ id : type
type −→ int | real | char
−→ array const .. const of type
−→ record decl list end
Construct the CFSM for this grammar. Use it to trace out a parse (as in
Figure 2.29) for the following input program:
foo : record
a : char;
b : array 1 .. 2 of real;
end;
2.27 The dangling else problem of Pascal is not shared by Algol 60. To avoid
ambiguity regarding which then is matched by an else , Algol 60 prohibits
if statements immediately inside a then clause. The Pascal fragment
if C1 then if C2 then S1 else S2
must be written as either
if C1 then begin if C2 then S1 end else S2
or
if C1 then begin if C2 then S1 else S2 end
in Algol 60. Show how to write a grammar for conditional statements that
enforces this rule. (Hint: you will want to distinguish in your grammar
between conditional statements and nonconditional statements; some contexts will accept either, some only the latter.)
2.28–2.32 In More Depth.
2.7
Explorations
2.33 Some languages (e.g., C) distinguish between upper- and lowercase letters
in identifiers. Others (e.g., Ada) do not. Which convention do you prefer?
Why?
2.8 Bibliographic Notes
109
2.34 The syntax for type casts in C and its descendants introduces potential
ambiguity: is (x)-y a subtraction, or the unary negation of y , cast to type
x ? Find out how C, C++, Java, and C# answer this question. Discuss how
you would implement the answer(s).
2.35 What do you think of Haskell, Occam, and Python’s use of indentation to
delimit control constructs (Section 2.1.1)? Would you expect this convention
to make program construction and maintenance easier or harder? Why?
2.36 Skip ahead to Section 13.4.2 and learn about the“regular expressions”used in
scripting languages, editors, search tools, and so on. Are these really regular?
What can they express that cannot be expressed in the notation introduced
in Section 2.1.1?
2.37 Rebuild the automaton of Exercise 2.8 using lex/flex.
2.38 Find a manual for yacc/bison, or consult a compiler textbook [ALSU07,
Secs. 4.8.1 and 4.9.2] to learn about operator precedence parsing. Explain how
it could be used to simplify the grammar of Exercise 2.16.
2.39 Use lex/flex and yacc/bison to construct a parser for the calculator language. Have it output a trace of its shifts and reductions.
2.40 Repeat the previous exercise using ANTLR.
2.41–2.42 In More Depth.
2.8
Bibliographic Notes
Our coverage of scanning and parsing in this chapter has of necessity been brief.
Considerably more detail can be found in texts on parsing theory [AU72] and
compiler construction [ALSU07, FL88, App97, GBJL01, CT04]. Many compilers
of the early 1960s employed recursive descent parsers. Lewis and Stearns [LS68]
and Rosenkrantz and Stearns [RS70] published early formal studies of LL grammars and parsing. The original formulation of LR parsing is due to Knuth [Knu65].
Bottom-up parsing became practical with DeRemer’s discovery of the SLR and
LALR algorithms [DeR71]. W. L. Johnson et al. [JPAR68] describe an early scanner generator. The Unix lex tool is due to Lesk [Les75]. Yacc is due to S. C.
Johnson [Joh75].
Further details on formal language theory can be found in a variety of textbooks,
including those of Hopcroft, Motwani, and Ullman [HMU01] and Sipser [Sip97].
Kleene [Kle56] and Rabin and Scott [RS59] proved the equivalence of regular
expressions and finite automata.15 The proof that finite automata are unable to
15 Dana Scott (1932–), Professor Emeritus at Carnegie Mellon University, is known principally
for inventing domain theory and launching the field of denotational semantics, which provides
a mathematically rigorous way to formalize the meaning of programming languages. Michael
Rabin (1931–), of Harvard University, has made seminal contributions to the concepts of nondeterminism and randomization in computer science. Scott and Rabin shared the ACM Turing
Award in 1976.
110
Chapter 2 Programming Language Syntax
recognize nested constructs is based on a theorem known as the pumping lemma,
due to Bar-Hillel, Perles, and Shamir [BHPS61]. Context-free grammars were
first explored by Chomsky [Cho56] in the context of natural language. Independently, Backus and Naur developed BNF for the syntactic description of
Algol 60 [NBB+ 63]. Ginsburg and Rice [GR62] recognized the equivalence of
the two notations. Chomsky [Cho62] and Evey [Eve63] demonstrated the equivalence of context-free grammars and push-down automata.
Fischer and LeBlanc’s text [FL88] contains an excellent survey of error recovery
and repair techniques, with references to other work. The phrase-level recovery
mechanism for recursive descent parsers described in Section 2.3.4 is due to
Wirth [Wir76, Sec. 5.9]. The locally least-cost recovery mechanism for tabledriven LL parsers described in Section 2.3.4 is due to Fischer, Milton, and
Quiring [FMQ80]. Dion published a locally least-cost bottom-up repair algorithm
in 1978 [Dio78]. It is quite complex, and requires very large precomputed tables.
McKenzie, Yeatman, and De Vere subsequently showed how to effect the same
repairs without the precomputed tables, at a higher but still acceptable cost in
time [MYD95].
3
Names, Scopes, and Bindings
“High-level” programming languages take their name from the relatively
high level, or degree of abstraction, of the features they provide, relative to those
of the assembly languages they were originally designed to replace. The adjective “abstract,” in this context, refers to the degree to which language features
are separated from the details of any particular computer architecture. The early
development of languages like Fortran, Algol, and Lisp was driven by a pair of complementary goals: machine independence and ease of programming. By abstracting the language away from the hardware, designers not only made it possible to
write programs that would run well on a wide variety of machines, but also made
the programs easier for human beings to understand.
Machine independence is a fairly simple concept. Basically it says that a programming language should not rely on the features of any particular instruction
set for its efficient implementation. Machine dependences still become a problem
from time to time (standards committees for C, for example, are still debating
how to accommodate multiprocessors with relaxed memory consistency), but
with a few noteworthy exceptions (Java comes to mind) it has probably been 35
years since the desire for greater machine independence has really driven language
design. Ease of programming, on the other hand, is a much more elusive and
compelling goal. It affects every aspect of language design, and has historically
been less a matter of science than of aesthetics and trial and error.
This chapter is the first of five to address core issues in language design. The
others are Chapters 6 through 9. In Chapter 6 we will look at control-flow constructs, which allow the programmer to specify the order in which operations are
to occur. In contrast to the jump-based control flow of assembly languages, highlevel control flow relies heavily on the lexical nesting of constructs. In Chapter 7
we will look at types, which allow the programmer to organize program data and
the operations on them. In Chapters 8 and 9 we will look at subroutines and
classes. In the current chapter we will look at names.
A name is a mnemonic character string used to represent something else.
Names in most languages are identifiers (alphanumeric tokens), though certain
other symbols, such as + or := , can also be names. Names allow us to refer to
Programming Language Pragmatics. DOI: 10.1016/B978-0-12-374514-9.00012-4
Copyright © 2009 by Elsevier Inc. All rights reserved.
111
112
Chapter 3 Names, Scopes, and Bindings
variables, constants, operations, types, and so on using symbolic identifiers rather
than low-level concepts like addresses. Names are also essential in the context of
a second meaning of the word abstraction. In this second meaning, abstraction is
a process by which the programmer associates a name with a potentially complicated program fragment, which can then be thought of in terms of its purpose
or function, rather than in terms of how that function is achieved. By hiding
irrelevant details, abstraction reduces conceptual complexity, making it possible
for the programmer to focus on a manageable subset of the program text at any
particular time. Subroutines are control abstractions: they allow the programmer
to hide arbitrarily complicated code behind a simple interface. Classes are data
abstractions: they allow the programmer to hide data representation details behind
a (comparatively) simple set of operations.
We will look at several major issues related to names. Section 3.1 introduces the
notion of binding time, which refers not only to the binding of a name to the thing
it represents, but also in general to the notion of resolving any design decision
in a language implementation. Section 3.2 outlines the various mechanisms used
to allocate and deallocate storage space for objects, and distinguishes between
the lifetime of an object and the lifetime of a binding of a name to that object.1
Most name-to-object bindings are usable only within a limited region of a given
high-level program. Section 3.3 explores the scope rules that define this region;
Section 3.4 (mostly on the PLP CD) considers their implementation.
The complete set of bindings in effect at a given point in a program is known as
the current referencing environment. Section 3.6 expands on the notion of scope
rules by considering the ways in which a referencing environment may be bound
to a subroutine that is passed as a parameter, returned from a function, or stored in
a variable. Section 3.5 discusses aliasing, in which more than one name may refer
to a given object in a given scope; overloading, in which a name may refer to more
than one object in a given scope, depending on the context of the reference; and
polymorphism, in which a single object may have more than one type, depending
on context or execution history. Section 3.7 discusses macro expansion, which
can introduce new names via textual substitution, sometimes in ways that are at
odds with the rest of the language. Finally, Section 3.8 (mostly on the PLP CD)
discusses separate compilation.
3.1
The Notion of Binding Time
A binding is an association between two things, such as a name and the thing it
names. Binding time is the time at which a binding is created or, more generally,
the time at which any implementation decision is made (we can think of this
1 For want of a better term, we will use the term “object” throughout Chapters 3–8 to refer to
anything that might have a name: variables, constants, types, subroutines, modules, and others. In
many modern languages “object” has a more formal meaning, which we will consider in Chapter 9.
3.1 The Notion of Binding Time
113
as binding an answer to a question). There are many different times at which
decisions may be bound:
Language design time: In most languages, the control flow constructs, the set of
fundamental (primitive) types, the available constructors for creating complex
types, and many other aspects of language semantics are chosen when the
language is designed.
Language implementation time: Most language manuals leave a variety of issues
to the discretion of the language implementor. Typical (though by no means
universal) examples include the precision (number of bits) of the fundamental
types, the coupling of I/O to the operating system’s notion of files, the organization and maximum sizes of stack and heap, and the handling of run-time
exceptions such as arithmetic overflow.
Program writing time: Programmers, of course, choose algorithms, data structures, and names.
Compile time: Compilers choose the mapping of high-level constructs to machine code, including the layout of statically defined data in memory.
Link time: Since most compilers support separate compilation—compiling different modules of a program at different times—and depend on the availability
of a library of standard subroutines, a program is usually not complete until
the various modules are joined together by a linker. The linker chooses the
overall layout of the modules with respect to one another, and resolves intermodule references. When a name in one module refers to an object in another
module, the binding between the two is not finalized until link time.
Load time: Load time refers to the point at which the operating system loads
the program into memory so that it can run. In primitive operating systems,
the choice of machine addresses for objects within the program was not finalized until load time. Most modern operating systems distinguish between virtual and physical addresses. Virtual addresses are chosen at link time; physical
addresses can actually change at run time. The processor’s memory management hardware translates virtual addresses into physical addresses during each
individual instruction at run time.
Run time: Run time is actually a very broad term that covers the entire span from
the beginning to the end of execution. Bindings of values to variables occur at
run time, as do a host of other decisions that vary from language to language.
D E S I G N & I M P L E M E N TAT I O N
Binding time
It is difficult to overemphasize the importance of binding times in the design
and implementation of programming languages. In general, early binding times
are associated with greater efficiency, while later binding times are associated
with greater flexibility. The tension between these goals provides a recurring
theme for later chapters of this book.
114
Chapter 3 Names, Scopes, and Bindings
Run time subsumes program start-up time, module entry time, elaboration
time (the point at which a declaration is first “seen”), subroutine call time,
block entry time, and statement execution time.
The terms static and dynamic are generally used to refer to things bound before
run time and at run time, respectively. Clearly “static” is a coarse term. So is
“dynamic.”
Compiler-based language implementations tend to be more efficient than
interpreter-based implementations because they make earlier decisions. For example, a compiler analyzes the syntax and semantics of global variable declarations
once, before the program ever runs. It decides on a layout for those variables
in memory and generates efficient code to access them wherever they appear in
the program. A pure interpreter, by contrast, must analyze the declarations every
time the program begins execution. In the worst case, an interpreter may reanalyze
the local declarations within a subroutine each time that subroutine is called. If
a call appears in a deeply nested loop, the savings achieved by a compiler that is
able to analyze the declarations only once may be very large. As we shall see in
the following section, a compiler will not usually be able to predict the address
of a local variable at compile time, since space for the variable will be allocated
dynamically on a stack, but it can arrange for the variable to appear at a fixed
offset from the location pointed to by a certain register at run time.
Some languages are difficult to compile because their definitions require fundamental decisions to be postponed until run time, generally in order to increase
the flexibility or expressiveness of the language. Smalltalk, for example, delays all
type checking until run time. All operations in Smalltalk are cast in the form of
“messages” to “objects.” A message is acceptable if and only if the object provides
a handler for it. References to objects of arbitrary types (classes) can then be
assigned into arbitrary named variables, as long as the program never ends up
sending a message to an object that is not prepared to handle it. This form of
polymorphism—allowing a variable name to refer to objects of multiple types—
allows the Smalltalk programmer to write very general-purpose code, which will
correctly manipulate objects whose types had yet to be fully defined at the time
discuss it further in Chapters 7 and 9.
3.2
Object Lifetime and Storage Management
In any discussion of names and bindings, it is important to distinguish between
names and the objects to which they refer, and to identify several key events:
Creation of objects
Creation of bindings
References to variables, subroutines, types, and so on, all of which use bindings
3.2 Object Lifetime and Storage Management
115
Deactivation and reactivation of bindings that may be temporarily unusable
Destruction of bindings
Destruction of objects
The period of time between the creation and the destruction of a name-toobject binding is called the binding’s lifetime. Similarly, the time between the
creation and destruction of an object is the object’s lifetime. These lifetimes need
not necessarily coincide. In particular, an object may retain its value and the
potential to be accessed even when a given name can no longer be used to access
it. When a variable is passed to a subroutine by reference, for example (as it
typically is in Fortran or with var parameters in Pascal or ‘ & ’ parameters in C++),
the binding between the parameter name and the variable that was passed has a
lifetime shorter than that of the variable itself. It is also possible, though generally
a sign of a program bug, for a name-to-object binding to have a lifetime longer
than that of the object. This can happen, for example, if an object created via the
C++ new operator is passed as a & parameter and then deallocated ( delete -ed)
before the subroutine returns. A binding to an object that is no longer live is called
a dangling reference. Dangling references will be discussed further in Sections 3.6
and 7.7.2.
Object lifetimes generally correspond to one of three principal storage allocation
mechanisms, used to manage the object’s space:
1. Static objects are given an absolute address that is retained throughout the
program’s execution.
2. Stack objects are allocated and deallocated in last-in, first-out order, usually in
conjunction with subroutine calls and returns.
3. Heap objects may be allocated and deallocated at arbitrary times. They require
a more general (and expensive) storage management algorithm.
3.2.1
Static Allocation
Global variables are the obvious example of static objects, but not the only one.
The instructions that constitute a program’s machine language translation can
also be thought of as statically allocated objects. In addition, we shall see examples in Section 3.3.1 of variables that are local to a single subroutine, but retain
their values from one invocation to the next; their space is statically allocated.
Numeric and string-valued constant literals are also statically allocated, for statements such as A = B/14.7 or printf("hello, world\n") . (Small constants
are often stored within the instruction itself; larger ones are assigned a separate
location.) Finally, most compilers produce a variety of tables that are used by runtime support routines for debugging, dynamic-type checking, garbage collection,
exception handling, and other purposes; these are also statically allocated. Statically allocated objects whose value should not change during program execution
(e.g., instructions, constants, and certain run-time tables) are often allocated in
116
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.1
Static allocation of local
variables
protected, read-only memory, so that any inadvertent attempt to write to them
will cause a processor interrupt, allowing the operating system to announce a
run-time error.
Logically speaking, local variables are created when their subroutine is called,
and destroyed when it returns. If the subroutine is called repeatedly, each invocation is said to create and destroy a separate instance of each local variable. It is
not always the case, however, that a language implementation must perform work
at run time corresponding to these create and destroy operations. Recursion was
not originally supported in Fortran (it was added in Fortran 90). As a result, there
can never be more than one invocation of a subroutine active in an older Fortran
program at any given time, and a compiler may choose to use static allocation for
local variables, effectively arranging for the variables of different invocations to
share the same locations, and thereby avoiding any run-time overhead for creation
and destruction.
In many languages a named constant is required to have a value that can be
determined at compile time. Usually the expression that specifies the constant’s
value is permitted to include only other known constants and built-in functions
and arithmetic operators. Named constants of this sort, together with constant
literals, are sometimes called manifest constants or compile-time constants. Manifest
constants can always be allocated statically, even if they are local to a recursive
subroutine: multiple instances can share the same location.
In other languages (e.g., C and Ada), constants are simply variables that cannot be changed after elaboration time. Their values, though unchanging, can
sometimes depend on other values that are not known until run time. These
elaboration-time constants, when local to a recursive subroutine, must be allocated
on the stack. C# provides both options, explicitly, with the const and readonly
keywords.
Along with local variables and elaboration-time constants, the compiler typically stores a variety of other information associated with the subroutine,
including:
Arguments and return values. Modern compilers keep these in registers whenever
possible, but sometimes space in memory is needed.
Temporaries. These are usually intermediate values produced in complex calculations. Again, a good compiler will keep them in registers whenever possible.
D E S I G N & I M P L E M E N TAT I O N
Recursion in Fortran
The lack of recursion in (pre-Fortran 90) Fortran is generally attributed to the
expense of stack manipulation on the IBM 704, on which the language was first
implemented. Many (perhaps most) Fortran implementations choose to use a
stack for local variables, but because the language definition permits the use
of static allocation instead, Fortran programmers were denied the benefits of
language-supported recursion for over 30 years.
3.2 Object Lifetime and Storage Management
117
Bookkeeping information. This may include the subroutine’s return address, a
reference to the stack frame of the caller (also called the dynamic link), additional saved registers, debugging information, and various other values that we
will study later.
3.2.2
EXAMPLE
3.2
Layout of the run-time
stack
Stack-Based Allocation
If a language permits recursion, static allocation of local variables is no longer
an option, since the number of instances of a variable that may need to exist
at the same time is conceptually unbounded. Fortunately, the natural nesting of
subroutine calls makes it easy to allocate space for locals on a stack. A simplified
picture of a typical stack appears in Figure 3.1. Each instance of a subroutine at
run time has its own frame (also called an activation record) on the stack, containing arguments and return values, local variables, temporaries, and bookkeeping
information. Arguments to be passed to subsequent routines lie at the top of the
frame, where the callee can easily find them. The organization of the remaining
information is implementation-dependent: it varies from one language, machine,
and compiler to another.
Maintenance of the stack is the responsibility of the subroutine calling
sequence—the code executed by the caller immediately before and after the call—
and of the prologue (code executed at the beginning) and epilogue (code executed
at the end) of the subroutine itself. Sometimes the term “calling sequence” is used
to refer to the combined operations of the caller, the prologue, and the epilogue.
We will study calling sequences in more detail in Section 8.2.
While the location of a stack frame cannot be predicted at compile time (the
compiler cannot in general tell what other frames may already be on the stack), the
offsets of objects within a frame usually can be statically determined. Moreover,
the compiler can arrange (in the calling sequence or prologue) for a particular
register, known as the frame pointer to always point to a known location within the
frame of the current subroutine. Code that needs to access a local variable within
the current frame, or an argument near the top of the calling frame, can do so
by adding a predetermined offset to the value in the frame pointer. As we discuss
in Section 5.3.1, almost every processor provides a displacement addressing
mechanism that allows this addition to be specified implicitly as part of an ordinary
load or store instruction. The stack grows “downward” toward lower addresses
in most language implementations. Some machines provide special push and pop
instructions that assume this direction of growth. Local variables, temporaries, and
bookkeeping information typically have negative offsets from the frame pointer.
Arguments and returns typically have positive offsets; they reside in the caller’s
frame.
Even in a language without recursion, it can be advantageous to use a stack for
local variables, rather than allocating them statically. In most programs the pattern
of potential calls among subroutines does not permit all of those subroutines to
be active at the same time. As a result, the total space needed for local variables
118
Chapter 3 Names, Scopes, and Bindings
sp
Subroutine D
fp
Arguments
to called
routines
Temporaries
Subroutine C
Direction of stack
growth (usually
lower addresses)
Subroutine B
Subroutine B
procedure C
D; E
procedure B
if ... then B else C
procedure A
B
−− main program
A
Local
variables
Miscellaneous
bookkeeping
Return address
fp (when subroutine
C is running)
Subroutine A
Figure 3.1
Stack-based allocation of space for subroutines. We assume here that subroutines have been called as shown in
the upper right. In particular, B has called itself once, recursively, before calling C . If D returns and C calls E , E ’s frame (activation
record) will occupy the same space previously used for D ’s frame. At any given time, the stack pointer ( sp ) register points to the
first unused location on the stack (or the last used location on some machines), and the frame pointer ( fp ) register points to a
known location within the frame of the current subroutine. The relative order of fields within a frame may vary from machine
to machine and compiler to compiler.
of currently active subroutines is seldom as large as the total space across all
subroutines, active or not. A stack may therefore require substantially less memory
at run time than would be required for static allocation.
3.2.3
Heap-Based Allocation
A heap is a region of storage in which subblocks can be allocated and deallocated
at arbitrary times.2 Heaps are required for the dynamically allocated pieces of
linked data structures, and for objects like fully general character strings, lists, and
sets, whose size may change as a result of an assignment statement or other update
operation.
2 Unfortunately, the term “heap” is also used for the common tree-based implementation of a
priority queue. These two uses of the term have nothing to do with one another.
3.2 Object Lifetime and Storage Management
119
Heap
Allocation request
Figure 3.2 Fragmentation.The shaded blocks are in use; the clear blocks are free. Cross-hatched
space at the ends of in-use blocks represents internal fragmentation.The discontiguous free blocks
indicate external fragmentation. While there is more than enough total free space remaining to
satisfy an allocation request of the illustrated size, no single remaining block is large enough.
EXAMPLE
3.3
External fragmentation in
the heap
There are many possible strategies to manage space in a heap. We review the
major alternatives here; details can be found in any data-structures textbook. The
principal concerns are speed and space, and as usual there are tradeoffs between
them. Space concerns can be further subdivided into issues of internal and external fragmentation. Internal fragmentation occurs when a storage-management
algorithm allocates a block that is larger than required to hold a given object; the
extra space is then unused. External fragmentation occurs when the blocks that
have been assigned to active objects are scattered through the heap in such a way
that the remaining, unused space is composed of multiple blocks: there may be
quite a lot of free space, but no one piece of it may be large enough to satisfy some
future request (see Figure 3.2).
Many storage-management algorithms maintain a single linked list—the free
list —of heap blocks not currently in use. Initially the list consists of a single block
comprising the entire heap. At each allocation request the algorithm searches the
list for a block of appropriate size. With a first fit algorithm we select the first block
on the list that is large enough to satisfy the request. With a best fit algorithm we
search the entire list to find the smallest block that is large enough to satisfy the
request. In either case, if the chosen block is significantly larger than required, then
we divide it in two and return the unneeded portion to the free list as a smaller
block. (If the unneeded portion is below some minimum threshold in size, we
may leave it in the allocated block as internal fragmentation.) When a block is
deallocated and returned to the free list, we check to see whether either or both of
the physically adjacent blocks are free; if so, we coalesce them.
Intuitively, one would expect a best fit algorithm to do a better job of reserving
large blocks for large requests. At the same time, it has higher allocation cost than
a first fit algorithm, because it must always search the entire list, and it tends to
result in a larger number of very small “left-over” blocks. Which approach—first
fit or best fit—results in lower external fragmentation depends on the distribution
of size requests.
In any algorithm that maintains a single free list, the cost of allocation is linear
in the number of free blocks. To reduce this cost to a constant, some storage management algorithms maintain separate free lists for blocks of different sizes. Each
120
Chapter 3 Names, Scopes, and Bindings
request is rounded up to the next standard size (at the cost of internal fragmentation) and allocated from the appropriate list. In effect, the heap is divided into
“pools,” one for each standard size. The division may be static or dynamic. Two
common mechanisms for dynamic pool adjustment are known as the buddy system
and the Fibonacci heap. In the buddy system, the standard block sizes are powers
of two. If a block of size 2k is needed, but none is available, a block of size 2k+1 is
split in two. One of the halves is used to satisfy the request; the other is placed on
the kth free list. When a block is deallocated, it is coalesced with its “buddy”—the
other half of the split that created it—if that buddy is free. Fibonacci heaps are
similar, but use Fibonacci numbers for the standard sizes, instead of powers of
two. The algorithm is slightly more complex, but leads to slightly lower internal
fragmentation, because the Fibonacci sequence grows more slowly than 2k .
The problem with external fragmentation is that the ability of the heap to satisfy
requests may degrade over time. Multiple free lists may help, by clustering small
blocks in relatively close physical proximity, but they do not eliminate the problem.
It is always possible to devise a sequence of requests that cannot be satisfied,
even though the total space required is less than the size of the heap. If memory
is partitioned among size pools statically, one need only exceed the maximum
number of requests of a given size. If pools are dynamically readjusted, one can
“checkerboard” the heap by allocating a large number of small blocks and then
deallocating every other one, in order of physical address, leaving an alternating
pattern of small free and allocated blocks. To eliminate external fragmentation, we
must be prepared to compact the heap, by moving already-allocated blocks. This
task is complicated by the need to find and update all outstanding references to a
block that is being moved. We will discuss compaction further in Sections 7.7.2
and 7.7.3.
3.2.4
Garbage Collection
Allocation of heap-based objects is always triggered by some specific operation in
a program: instantiating an object, appending to the end of a list, assigning a long
value into a previously short string, and so on. Deallocation is also explicit in some
languages (e.g., C, C++, and Pascal.) As we shall see in Section 7.7, however, many
languages specify that objects are to be deallocated implicitly when it is no longer
possible to reach them from any program variable. The run-time library for such a
language must then provide a garbage collection mechanism to identify and reclaim
unreachable objects. Most functional and scripting languages require garbage
collection, as do many more recent imperative languages, including Modula-3,
Java, and C#.
The traditional arguments in favor of explicit deallocation are implementation
simplicity and execution speed. Even naive implementations of automatic garbage
collection add significant complexity to the implementation of a language with a
rich type system, and even the most sophisticated garbage collector can consume
nontrivial amounts of time in certain programs. If the programmer can correctly
3.3 Scope Rules
121
identify the end of an object’s lifetime, without too much run-time bookkeeping,
the result is likely to be faster execution.
The argument in favor of automatic garbage collection, however, is compelling: manual deallocation errors are among the most common and costly bugs in
real-world programs. If an object is deallocated too soon, the program may follow
a dangling reference, accessing memory now used by another object. If an object
is not deallocated at the end of its lifetime, then the program may “leak memory,” eventually running out of heap space. Deallocation errors are notoriously
difficult to identify and fix. Over time, both language designers and programmers have increasingly come to consider automatic garbage collection an essential
language feature. Garbage-collection algorithms have improved, reducing their
run-time overhead; language implementations have become more complex in
general, reducing the marginal complexity of automatic collection; and leadingedge applications have become larger and more complex, making the benefits of
automatic collection ever more compelling.
3C H E C K YO U R U N D E R S TA N D I N G
1. What is binding time?
2. Explain the distinction between decisions that are bound statically and those
that are bound dynamically.
3. What is the advantage of binding things as early as possible? What is the
advantage of delaying bindings?
4. Explain the distinction between the lifetime of a name-to-object binding and
its visibility.
5. What determines whether an object is allocated statically, on the stack, or in
the heap?
6.
7.
8.
9.
List the objects and information commonly found in a stack frame.
What is a frame pointer? What is it used for?
What is a calling sequence?
What are internal and external fragmentation?
10. What is garbage collection?
11. What is a dangling reference?
3.3
Scope Rules
The textual region of the program in which a binding is active is its scope. In
most modern languages, the scope of a binding is determined statically, that is,
122
Chapter 3 Names, Scopes, and Bindings
at compile time. In C, for example, we introduce a new scope upon entry to a
subroutine. We create bindings for local objects and deactivate bindings for global
objects that are “hidden” by local objects of the same name. On subroutine exit,
we destroy bindings for local variables and reactivate bindings for any global
objects that were hidden. These manipulations of bindings may at first glance
appear to be run-time operations, but they do not require the execution of any
code: the portions of the program in which a binding is active are completely
determined at compile time. We can look at a C program and know which names
refer to which objects at which points in the program based on purely textual
rules. For this reason, C is said to be statically scoped (some authors say lexically
scoped 3 ). Other languages, including APL, Snobol, and early dialects of Lisp, are
dynamically scoped: their bindings depend on the flow of execution at run time.
We will examine static and dynamic scoping in more detail in Sections 3.3.1
and 3.3.6.
In addition to talking about the “scope of a binding,” we sometimes use the
word scope as a noun all by itself, without a specific binding in mind. Informally,
a scope is a program region of maximal size in which no bindings change (or
at least none are destroyed—more on this in Section 3.3.3). Typically, a scope
is the body of a module, class, subroutine, or structured control flow statement,
sometimes called a block. In C family languages it would be delimited with {...}
braces.
Algol 68 and Ada use the term elaboration to refer to the process by which
declarations become active when control first enters a scope. Elaboration entails
the creation of bindings. In many languages, it also entails the allocation of stack
space for local objects, and possibly the assignment of initial values. In Ada it
can entail a host of other things, including the execution of error-checking or
heap-space-allocating code, the propagation of exceptions, and the creation of
concurrently executing tasks (to be discussed in Chapter 12).
At any given point in a program’s execution, the set of active bindings is called
the current referencing environment. The set is principally determined by static
or dynamic scope rules. We shall see that a referencing environment generally
corresponds to a sequence of scopes that can be examined (in order) to find the
current binding for a given name.
In some cases, referencing environments also depend on what are (in a confusing use of terminology) called binding rules. Specifically, when a reference to a
subroutine S is stored in a variable, passed as a parameter to another subroutine,
or returned as a function value, one needs to determine when the referencing
environment for S is chosen—that is, when the binding between the reference to
S and the referencing environment of S is made. The two principal options are
3 Lexical scope is actually a better term than static scope, because scope rules based on nesting can
be enforced at run time instead of compile time if desired. In fact, in Common Lisp and Scheme
it is possible to pass the unevaluated text of a subroutine declaration into some other subroutine
as a parameter, and then use the text to create a lexically nested declaration at run time.
3.3 Scope Rules
123
deep binding, in which the choice is made when the reference is first created, and
shallow binding, in which the choice is made when the reference is finally used. We
will examine these options in more detail in Section 3.6.
3.3.1
Static Scoping
In a language with static (lexical) scoping, the bindings between names and objects
can be determined at compile time by examining the text of the program, without
consideration of the flow of control at run time. Typically, the “current” binding
for a given name is found in the matching declaration whose block most closely
surrounds a given point in the program, though as we shall see there are many
variants on this basic theme.
The simplest static scope rule is probably that of early versions of Basic, in
which there was only a single, global scope. In fact, there were only a few hundred
possible names, each of which consisted of a letter optionally followed by a digit.
There were no explicit declarations; variables were declared implicitly by virtue of
being used.
Scope rules are somewhat more complex in (pre-Fortran 90) Fortran, though
not much more.4 Fortran distinguishes between global and local variables. The
scope of a local variable is limited to the subroutine in which it appears; it is not
visible elsewhere. Variable declarations are optional. If a variable is not declared,
it is assumed to be local to the current subroutine and to be of type integer if
its name begins with the letters I–N, or real otherwise. (Different conventions
for implicit declarations can be specified by the programmer. In Fortran 90, the
programmer can also turn off implicit declarations, so that use of an undeclared
variable becomes a compile-time error.)
Global variables in Fortran may be partitioned into common blocks, which are
then “imported” by subroutines. Common blocks are designed for separate compilation: they allow a subroutine to import only the sets of variables it needs.
Unfortunately, Fortran requires each subroutine to declare the names and types
of the variables in each of the common blocks it uses, and there is no standard mechanism to ensure that the declarations in different subroutines are the
same.
Semantically, the lifetime of a local Fortran variable (both the object itself and
the name-to-object binding) encompasses a single execution of the variable’s subroutine. Programmers can override this rule by using an explicit save statement.
4 Fortran and C have evolved considerably over the years. Unless otherwise noted, comments in this
text apply to the Fortran 77 dialect [Ame78a] (still more widely used than the newer Fortran 90).
Comments on C refer to all versions of the language (including the C99 standard [Int99]) unless
otherwise noted. Comments on Ada, likewise, refer to both Ada 83 [Ame83] and Ada 95 [Int95b]
unless otherwise noted.
124
Chapter 3 Names, Scopes, and Bindings
/*
Place into *s a new name beginning with the letter ’L’ and
continuing with the ASCII representation of a unique integer.
Parameter s is assumed to point to space large enough to hold any
such name; for the short ints used here, 7 characters suffice.
*/
void label_name (char *s) {
static short int n;
sprintf (s, "L%d\0", ++n);
/* C guarantees that static locals
are initialized to zero */
/* "print" formatted output to s */
}
Figure 3.3
EXAMPLE
3.4
Static variables in C
C code to illustrate the use of static variables.
(Similar mechanisms appear in many other languages: in C one declares the variable static ; in Algol one declares it own .) A save -ed ( static , own ) variable has
a lifetime that encompasses the entire execution of the program. Instead of a logically separate object for every invocation of the subroutine, the compiler creates
a single object that retains its value from one invocation of the subroutine to the
next. (The name-to-variable binding, of course, is inactive when the subroutine
is not executing, because the name is out of scope.)
As an example of the use of static variables, consider the code in Figure 3.3.
The subroutine label_name can be used to generate a series of distinct characterstring names: L1 , L2 , . . . . A compiler might use these names in its assembly
language output.
3.3.2
Nested Subroutines
The ability to nest subroutines inside each other, introduced in Algol 60, is a
feature of many modern languages, including Pascal, Ada, ML, Python, Scheme,
Common Lisp, and (to a limited extent) Fortran 90. Other languages, including C
and its descendants, allow classes or other scopes to nest. Just as the local variables
of a Fortran subroutine are not visible to other subroutines, any constants, types,
variables, or subroutines declared within a block are not visible outside that block
in Algol-family languages. More formally,Algol-style nesting gives rise to the closest
nested scope rule for bindings from names to objects: a name that is introduced in
a declaration is known in the scope in which it is declared, and in each internally
nested scope, unless it is hidden by another declaration of the same name in
one or more nested scopes. To find the object corresponding to a given use of a
name, we look for a declaration with that name in the current, innermost scope.
If there is one, it defines the active binding for the name. Otherwise, we look
for a declaration in the immediately surrounding scope. We continue outward,
examining successively surrounding scopes, until we reach the outer nesting level
of the program, where global objects are declared. If no declaration is found at
any level, then the program is in error.
3.3 Scope Rules
EXAMPLE
3.5
Nested scopes
125
Many languages provide a collection of built-in, or predefined objects, such as
I/O routines, mathematical functions, and in some cases types such as integer
and char . It is common to consider these to be declared in an extra, invisible,
outermost scope, which surrounds the scope in which global objects are declared.
The search for bindings described in the previous paragraph terminates at this
extra, outermost scope, if it exists, rather than at the scope in which global objects
are declared. This outermost scope convention makes it possible for a programmer
to define a global object whose name is the same as that of some predefined object
(whose “declaration” is thereby hidden, making it unusable).
An example of nested scopes appears in Figure 3.4.5 In this example, procedure
P2 is called only by P1 , and need not be visible outside. It is therefore declared
inside P1 , limiting its scope (its region of visibility) to the portion of the program
shown here. In a similar fashion, P4 is visible only within P1 , P3 is visible only
within P2 , and F1 is visible only within P4 . Under the standard rules for nested
scopes, F1 could call P2 and P4 could call F1 , but P2 could not call F1 .
Though they are hidden from the rest of the program, nested subroutines are
able to access the parameters and local variables (and other local objects) of the
surrounding scope(s). In our example, P3 can name (and modify) A1 , X , and
A2 , in addition to A3 . Because P1 and F1 both declare local variables named X ,
the inner declaration hides the outer one within a portion of its scope. Uses of
X in F1 refer to the inner X ; uses of X in other regions of the code refer to the
outer X .
A name-to-object binding that is hidden by a nested declaration of the same
name is said to have a hole in its scope. In most languages the object whose name
is hidden is inaccessible in the nested scope (unless it has more than one name).
Some languages allow the programmer to access the outer meaning of a name
by applying a qualifier or scope resolution operator. In Ada, for example, a name
may be prefixed by the name of the scope in which it is declared, using syntax
that resembles the specification of fields in a record. My_proc.X , for example,
refers to the declaration of X in subroutine My_proc , regardless of whether some
other X has been declared in a lexically closer scope. In C++, which does not allow
subroutines to nest, ::X refers to a global declaration of X , regardless of whether
the current subroutine also has an X .6
Access to Nonlocal Objects
We have already seen (Section 3.2.2) that the compiler can arrange for a frame
pointer register to point to the frame of the currently executing subroutine at run
time. Using this register as a base for displacement (register plus offset) addressing,
target code can access objects within the current subroutine. But what about
5 This code is not contrived; it was extracted from an implementation of the FMQ error repair
algorithm described in Section 2.3.4.
6 The C++ :: operator is also used to name members (fields or methods) of a base class that are
hidden by members of a derived class; we will consider this use in Section 9.2.2.
126
Chapter 3 Names, Scopes, and Bindings
procedure P1(A1 : T1);
var X : real;
...
procedure P2(A2 : T2);
...
procedure P3(A3 : T3);
...
begin
...
(* body of
end;
...
begin
...
(* body of
end;
...
procedure P4(A4 : T4);
...
function F1(A5 : T5) :
var X : integer;
...
begin
...
(* body of
end;
...
begin
...
(* body of
end;
...
begin
...
(* body of
end
A1 X P2
P4
A2 P3
A3
P3 *)
P2 *)
A4 F1
T6;
A5 X
F1 *)
P4 *)
P1 *)
Figure 3.4
Example of nested subroutines in Pascal.Vertical bars show the scope of each name
(note the hole in the scope of the outer X ).
objects in lexically surrounding subroutines? To find these we need a way to find
the frames corresponding to those scopes at run time. Since a nested subroutine
may call a routine in an outer scope, the order of stack frames at run time may
not necessarily correspond to the order of lexical nesting. Nonetheless, we can
be sure that there is some frame for the surrounding scope already in the stack,
since the current subroutine could not have been called unless it was visible, and it
could not have been visible unless the surrounding scope was active. (It is actually
possible in some languages to save a reference to a nested subroutine, and then
call it when the surrounding scope is no longer active. We defer this possibility to
Section 3.6.2.)
The simplest way in which to find the frames of surrounding scopes is to
maintain a static link in each frame that points to the “parent” frame: the frame
3.3 Scope Rules
127
A
B
C
fp
C
D
D
B
E
E
A
Figure 3.5 Static chains. Subroutines A , B , C , D , and E are nested as shown on the left. If the
sequence of nested calls at run time is A , E , B , D , and C , then the static links in the stack will look
as shown on the right. The code for subroutine C can find local objects at known offsets from
the frame pointer. It can find local objects of the surrounding scope, B , by dereferencing its static
chain once and then applying an offset. It can find local objects in B ’s surrounding scope, A , by
dereferencing its static chain twice and then applying an offset.
EXAMPLE
3.6
Static chains
of the most recent invocation of the lexically surrounding subroutine. If a subroutine is declared at the outermost nesting level of the program, then its frame
will have a null static link at run time. If a subroutine is nested k levels deep,
then its frame’s static link, and those of its parent, grandparent, and so on, will
form a static chain of length k at run time. To find a variable or parameter
declared j subroutine scopes outward, target code at run time can dereference
the static chain j times, and then add the appropriate offset. Static chains are
illustrated in Figure 3.5. We will discuss the code required to maintain them in
Section 8.2.
3.3.3
Declaration Order
In our discussion so far we have glossed over an important subtlety: suppose an
object x is declared somewhere within block B . Does the scope of x include the
portion of B before the declaration, and if so can x actually be used in that portion
of the code? Put another way, can an expression E refer to any name declared in
the current scope, or only to names that are declared before E in the scope?
Several early languages, including Algol 60 and Lisp, required that all declarations appear at the beginning of their scope. One might at first think that this rule
128
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.7
A “gotcha” in
declare-before-use
would avoid the questions in the preceding paragraph, but it does not, because
declarations may refer to one another.7
In an apparent attempt to simplify the implementation of the compiler, Pascal
modified the requirement to say that names must be declared before they are used
(with special-case mechanisms to accommodate recursive types and subroutines).
At the same time, however, Pascal retained the notion that the scope of a declaration is the entire surrounding block. These two rules can interact in surprising
ways:
1.
2.
3.
4.
5.
6.
7.
const N = 10;
...
procedure foo;
const
M = N;
(* static semantic error! *)
...
N = 20;
(* local constant declaration; hides the outer N *)
Pascal says that the second declaration of N covers all of foo , so the semantic
analyzer should complain on line 5 that N is being used before its declaration.
The error has the potential to be highly confusing, particularly if the programmer
meant to use the outer N :
const N = 10;
...
procedure foo;
const
M = N;
(* static semantic error! *)
var
A : array [1..M] of integer;
N : real;
(* hiding declaration *)
Here the pair of messages “ N used before declaration” and “ N is not a constant”
are almost certainly not helpful.
D E S I G N & I M P L E M E N TAT I O N
Mutual recursion
Some Algol 60 compilers were known to process the declarations of a scope in
program order. This strategy had the unfortunate effect of implicitly outlawing
mutually recursive subroutines and types, something the language designers
clearly did not intend [Atk73].
7 We saw an example of mutually recursive subroutines in the recursive descent parsing of
Section 2.3.1. Mutually recursive types frequently arise in linked data structures, where nodes
of two types may need to point to each other.
3.3 Scope Rules
EXAMPLE
3.8
Whole-block scope in C#
In order to determine the validity of any declaration that appears to use a
name from a surrounding scope, a Pascal compiler must scan the remainder of
the scope’s declarations to see if the name is hidden. To avoid this complication,
most Pascal successors (and some dialects of Pascal itself) specify that the scope
of an identifier is not the entire block in which it is declared (excluding holes), but
rather the portion of that block from the declaration to the end (again excluding
holes). If our program fragment had been written in Ada, for example, or in C,
C++, or Java, no semantic errors would be reported. The declaration of M would
refer to the first (outer) declaration of N .
C++ and Java further relax the rules by dispensing with the define-before-use
requirement in many cases. In both languages, members of a class (including
those that are not defined until later in the program text) are visible inside all
of the class’s methods. In Java, classes themselves can be declared in any order.
Interestingly, while C# echos Java in requiring declaration before use for local
variables (but not for classes and members), it returns to the Pascal notion of
whole-block scope. Thus the following is invalid in C#.
class A {
const int N = 10;
void foo() {
const int M = N;
const int N = 20;
EXAMPLE
3.9
“Local if written” in Python
EXAMPLE
3.10
Declaration order in
Scheme
129
// uses inner N before it is declared
Perhaps the simplest approach to declaration order, from a conceptual point
of view, is that of Modula-3, which says that the scope of a declaration is the
entire block in which it appears (minus any holes created by nested declarations),
and that the order of declarations doesn’t matter. The principal objection to this
approach is that programmers may find it counterintuitive to use a local variable
before it is declared. Python takes the “whole block” scope rule one step further
by dispensing with variable declarations altogether. In their place it adopts the
unusual convention that the local variables of subroutine S are precisely those
variables that are written by some statement in the (static) body of S . If S is nested
inside of T , and the name x appears on the left-hand side of assignment statements
in both S and T , then the x ’s are distinct: there is one in S and one in T . Nonlocal variables are read-only unless explicitly imported (using Python’s global
statement). We will consider these conventions in more detail in Section 13.4.1,
as part of a general discussion of scoping in scripting languages.
In the interest of flexibility, modern Lisp dialects tend to provide several options
for declaration order. In Scheme, for example, the letrec and let* constructs
define scopes with, respectively, whole-block and declaration-to-end-of-block
semantics. The most frequently used construct, let , provides yet another option:
(let ((A 1))
(let ((A 2)
(B A))
B))
; outer scope, with A
; inner scope, with A
;
and B
; return the value of
defined to be 1
defined to be 2
defined to be A
B
130
Chapter 3 Names, Scopes, and Bindings
Here the nested declarations of A and B don’t take effect until after the end
of the declaration list. Thus when B is defined, the redefinition of A has not
yet taken effect. B is defined to be the outer A , and the code as a whole
returns 1.
Declarations and Definitions
EXAMPLE
3.11
Declarations vs definitions
in C
Recursive types and subroutines introduce a problem for languages that require
names to be declared before they can be used: how can two declarations each
appear before the other? C and C++ handle the problem by distinguishing between
the declaration of an object and its definition. A declaration introduces a name
and indicates its scope, but may omit certain implementation details. A definition describes the object in sufficient detail for the compiler to determine its
implementation. If a declaration is not complete enough to be a definition, then
a separate definition must appear somewhere else in the scope. In C we can
write
struct manager;
/* declaration only */
struct employee {
struct manager *boss;
struct employee *next_employee;
...
};
struct manager {
/* definition */
struct employee *first_employee;
...
};
and
void list_tail(follow_set fs);
/* declaration only */
void list(follow_set fs)
{
switch (input_token) {
case id : match(id); list_tail(fs);
...
}
void list_tail(follow_set fs)
/* definition */
{
switch (input_token) {
case comma : match(comma); list(fs);
...
}
The initial declaration of manager needed only to introduce a name: since pointers are all the same size, the compiler could determine the implementation
of employee without knowing any manager details. The initial declaration of
3.3 Scope Rules
131
list_tail , however, must include the return type and parameter list, so the
compiler can tell that the call in list is correct.
Nested Blocks
EXAMPLE
3.12
Inner declarations in C
In many languages, including Algol 60, C89, and Ada, local variables can be
declared not only at the beginning of any subroutine, but also at the top of any
begin . . . end ( {...} ) block. Other languages, including Algol 68, C99, and all of
C’s descendants, are even more flexible, allowing declarations wherever a statement
may appear. In most languages a nested declaration hides any outer declaration
with the same name (Java and C# make it a static semantic error if the outer
declaration is local to the current subroutine).
Variables declared in nested blocks can be very useful, as for example in the
following C code:
{
int temp = a;
a = b;
b = temp;
}
Keeping the declaration of temp lexically adjacent to the code that uses it makes the
program easier to read, and eliminates any possibility that this code will interfere
with another variable named temp .
No run-time work is needed to allocate or deallocate space for variables declared
in nested blocks; their space can be included in the total space for local variables
allocated in the subroutine prologue and deallocated in the epilogue. Exercise 3.9
considers how to minimize the total space required.
D E S I G N & I M P L E M E N TAT I O N
Redeclarations
Some languages, particularly those that are intended for interactive use, permit
the programmer to redeclare an object: to create a new binding for a given name
in a given scope. Interactive programmers commonly use redeclarations to fix
bugs. In most interactive languages, the new meaning of the name replaces the
old in all contexts. In ML, however, the old meaning of the name may remain
accessible to functions that were elaborated before the name was redeclared.
This design choice in ML can sometimes be counterintuitive. It probably reflects
the fact that ML is usually compiled, bit by bit on the fly, rather than interpreted.
A language like Scheme, which is lexically scoped but usually interpreted, stores
the binding for a name in a known location. A program accesses the meaning of
the name indirectly through that location: if the meaning of the name changes,
all accesses to the name will use the new meaning. In ML, previously elaborated
functions have already been compiled into a form (often machine code) that
accesses the meaning of the name directly.
132
Chapter 3 Names, Scopes, and Bindings
3C H E C K YO U R U N D E R S TA N D I N G
12. What do we mean by the scope of a name-to-object binding?
13. Describe the difference between static and dynamic scoping.
14. What is elaboration?
15.
16.
17.
18.
What is a referencing environment ?
Explain the closest nested scope rule.
What is the purpose of a scope resolution operator?
What is a static chain? What is it used for?
19. What are forward references? Why are they prohibited or restricted in many
programming languages?
20. Explain the difference between a declaration and a definition. Why is the distinction important?
3.3.4
Modules
A major challenge in the construction of any large body of software is how to
divide the effort among programmers in such a way that work can proceed on
multiple fronts simultaneously. This modularization of effort depends critically
on the notion of information hiding, which makes objects and algorithms invisible, whenever possible, to portions of the system that do not need them. Properly
modularized code reduces the “cognitive load” on the programmer by minimizing the amount of information required to understand any given portion of the
system. In a well-designed program the interfaces between modules are as “narrow” (i.e., simple) as possible, and any design decision that is likely to change is
hidden inside a single module. This latter point is crucial, since maintenance (bug
fixes and enhancement) consumes much more programmer time than does initial
construction for most commercial software.
In addition to reducing cognitive load, information hiding reduces the risk of
name conflicts: with fewer visible names, there is less chance that a newly introduced name will be the same as one already in use. It also safeguards the integrity
of data abstractions: any attempt to access objects outside of the subroutine(s) to
which they belong will cause the compiler to issue an “undefined symbol” error
message. Finally, it helps to compartmentalize run-time errors: if a variable takes
on an unexpected value, we can generally be sure that the code that modified it is
in the variable’s scope.
Encapsulating Data and Subroutines
Unfortunately, the information hiding provided by nested subroutines is limited
to objects whose lifetime is the same as that of the subroutine in which they
3.3 Scope Rules
133
are hidden. When control returns from a subroutine, its local variables will no
longer be live: their values will be discarded. We have seen a partial solution to
this problem in the form of the save statement in Fortran and the static and
own variables of C and Algol.
Static variables allow a subroutine to have “memory”—to retain information
from one invocation to the next—while protecting that memory from accidental
access or modification by other parts of the program. Put another way, static variables allow programmers to build single-subroutine abstractions. Unfortunately,
they do not allow the construction of abstractions whose interface needs to consist of more than one subroutine. Suppose, for example, that we wish to construct
a stack abstraction. We should like to hide the representation of the stack—its
internal structure—from the rest of the program, so that it can be accessed only
through its push and pop routines. We can achieve this goal in many languages
through use of a module construct.
Modules as Abstractions
EXAMPLE
3.13
Stack module in Modula-2
A module allows a collection of objects—subroutines, variables, types, and so
on—to be encapsulated in such a way that (1) objects inside are visible to each
other, but (2) objects on the inside are not visible on the outside unless explicitly
exported, and (3) (in many languages) objects outside are not visible on the inside
unless explicitly imported. Note that these rules affect only the visibility of objects;
they do not affect their lifetime.
Modules were one of the principal language innovations of the late 1970s and
early 1980s; they appear in Clu (which called them clusters), Modula (1, 2, and 3),
Turing, and Ada 83. They also appear in Haskell; in C++, Java, and C#; and in
the major scripting languages. Several languages, including Ada, Java, and Perl,
use the term package instead of module. Others, including C++, C#, and PHP,
use namespace. Modules can be emulated to some degree through use of the
separate compilation facilities of C; we discuss this possibility in Section 3.8.
As an example of the use of modules, consider the stack abstraction shown
in Figure 3.6. This stack can be embedded anywhere a subroutine might appear
in a Modula-2 program. Bindings to variables declared in a module are inactive
outside the module, not destroyed. In our stack example, s and top have the
same lifetime they would have had if not enclosed in the module. If stack is
declared at the program’s outermost nesting level, then s and top retain their
values throughout the execution of the program, though they are visible only to
the code inside push and pop . If stack is declared inside some subroutine sub ,
then s and top have the same lifetime as the local variables of sub . If stack is
declared inside some other module mod , then s and top have the same lifetime as
they would have had if not enclosed in either module. Type stack_index , which
is also declared inside stack , is likewise visible only inside push and pop . The
issue of lifetime is not relevant for types or constants, since they have no mutable
state.
Our stack abstraction has two imports: the type ( element ) and maximum
number ( stack_size ) of elements to be placed in the stack. Element and
134
Chapter 3 Names, Scopes, and Bindings
CONST stack_size = ...
TYPE element = ...
...
MODULE stack;
IMPORT element, stack_size;
EXPORT push, pop;
TYPE
stack_index = [1..stack_size];
VAR
s
: ARRAY stack_index OF element;
top : stack_index;
(* first unused slot *)
PROCEDURE error; ...
PROCEDURE push(elem : element);
BEGIN
IF top = stack_size THEN
error;
ELSE
s[top] := elem;
top := top + 1;
END;
END push;
PROCEDURE pop() : element;
BEGIN
IF top = 1 THEN
error;
ELSE
top := top - 1;
RETURN s[top];
END;
END pop;
(* A Modula-2 function is just a *)
(* procedure with a return type. *)
BEGIN
top := 1;
END stack;
Figure 3.6
VAR x, y : element;
...
push(x);
...
y := pop;
Stack abstraction in Modula-2.
stack_size must be declared in a surrounding scope; the compiler will complain if they are not. With the exception of predefined (pervasive) names like
integer and arctan , element and stack_size are the only names from surrounding scopes that will be visible inside stack . Our stack also has two exports:
push and pop . These are the only names inside of stack that will be visible in the
surrounding scope.
3.3 Scope Rules
135
Imports and Exports
Most module-based languages allow the programmer to specify that certain
exported names are usable only in restricted ways. Variables may be exported
read-only, for example, or types may be exported opaquely, meaning that variables
of that type may be declared, passed as arguments to the module’s subroutines,
and possibly compared or assigned to one another, but not manipulated in any
other way.
Modules into which names must be explicitly imported are said to be closed
scopes. By extension, modules that do not require imports are said to be open
scopes. Imports serve to document the program: they increase modularity by
requiring a module to specify the ways in which it depends on the rest of the
program. They also reduce name conflicts by refraining from importing anything
that isn’t needed. Modules are closed in Modula (1, 2, and 3) and Haskell. An
increasingly common option, found in the modules of Ada, Java, C#, and Python,
among others, might be called selectively open scopes. In these languages a name
foo exported from module A is automatically visible in peer module B as A.foo .
It becomes visible as merely foo if B explicitly imports it.
Unlike modules, subroutines are open scopes in most Algol family languages.
Important exceptions are Euclid, in which both module and subroutine scopes are
closed; Turing, Modula (1), and Perl, in which subroutines are optionally closed
(if a subroutine imports anything explicitly, then no other nonlocal names will
be visible); and Clu, which outlaws the use of nonlocal variables entirely (though
nonlocal constants and subroutines can still be used). As in the case of modules,
import lists serve to document the interface between a subroutine and the rest
of the program. It would appear that most language designers have decided the
documentation isn’t worth the inconvenience.8
Modules as Managers
EXAMPLE
3.14
Module as “manager” for
a type
Modules facilitate the construction of abstractions by allowing data to be made
private to the subroutines that use them. When used as in Figure 3.6, however,
each module defines a single abstraction. If we want to have several stacks, we must
generally make the module a “manager” for instances of a stack type, which is then
exported from the module, as shown in Figure 3.7. The manager idiom requires
additional subroutines to create/initialize and possibly destroy stack instances, and
it requires that every subroutine ( push , pop , create ) take an extra parameter, to
specify the stack in question. Clu adopts the position that every module (“cluster”)
is the manager for a type. Data declared in the cluster (other than static variables in
subroutines) are automatically the representation of the managed type, and there
are special language features to export an opaque version of the representation to
users of the type.
8 There is an interesting analogy here to exception propagation. As we shall see in Section 8.5.1,
language designers display similar disagreement about whether the exceptions that may be thrown
out of a subroutine must be listed in the subroutine’s header.
136
Chapter 3 Names, Scopes, and Bindings
CONST stack_size = ...
TYPE element = ...
...
MODULE stack_manager;
IMPORT element, stack_size;
EXPORT stack, init_stack, push, pop;
TYPE
stack_index = [1..stack_size];
stack = RECORD
s : ARRAY stack_index OF element;
top : stack_index;
(* first unused slot *)
END;
PROCEDURE init_stack(VAR stk : stack);
BEGIN
stk.top := 1;
END init_stack;
PROCEDURE push(VAR stk : stack; elem : element);
BEGIN
IF stk.top = stack_size THEN
error;
ELSE
stk.s[stk.top] := elem;
stk.top := stk.top + 1;
END;
END push;
PROCEDURE pop(VAR stk : stack) : element;
BEGIN
IF stk.top = 1 THEN
error;
ELSE
stk.top := stk.top - 1;
RETURN stk.s[stk.top];
END;
END pop;
END stack_manager;
Figure 3.7
3.3.5
var A, B : stack;
var x, y : element;
...
init_stack(A);
init_stack(B);
...
push(A, x);
...
y := pop(B);
Manager module for stacks in Modula-2.
Module Types and Classes
An alternative solution to the multiple instance problem can be found in Simula,
Euclid, and (in a slightly different sense) ML, which treat modules as types, rather
3.3 Scope Rules
EXAMPLE
3.15
Module types in Euclid
137
than simple encapsulation constructs. Given a module type, the programmer can
declare an arbitrary number of similar module objects. The skeleton of a Euclid
stack appears in Figure 3.8. As in the (single) Modula-2 stack of Figure 3.6, Euclid
allows the programmer to provide initialization code that is executed whenever
a new stack is created. Euclid also allows the programmer to specify finalization
code that will be executed at the end of a module’s lifetime. This feature is not
needed for an array-based stack, but would be useful if elements were allocated
from a heap, and needed to be reclaimed.
The difference between the module-as-manager and module-as-type approaches to abstraction is reflected in the lower right of Figures 3.7 and 3.8. With
module types, the programmer can think of the module’s subroutines as “belonging” to the stack in question ( A.push(x) ), rather than as outside entities to which
the stack can be passed as an argument ( push(A, x) ). Conceptually, there is a
separate pair of push and pop operations for every stack. In practice, of course,
it would be highly wasteful to create multiple copies of the code. As we shall see
in Chapter 9, all stacks share a single pair of push and pop operations, and the
compiler arranges for a pointer to the relevant stack to be passed to the operation
as an extra, hidden parameter. The implementation turns out to be very similar
D E S I G N & I M P L E M E N TAT I O N
Modules and separate compilation
One of the hallmarks of a good abstraction is that it tends to be useful in multiple contexts. To facilitate code reuse, many languages make modules the basis
of separate compilation. Modula-2 actually provided two different kinds of
modules: one (external modules) for separate compilation, the other (internal
modules, as in Figure 3.6) for textual nesting within a larger scope. Experience
with these options eventually led Niklaus Wirth, the designer of Modula-2, to
conclude that external modules were by far the more useful variety; he omitted
the internal version from his subsequent language, Oberon. Many would argue,
however, that internal modules find their real utility only when extended with
instantiation and inheritance. Indeed, as noted near the end of this section,
many object-oriented languages provide both modules and classes. The former support separate compilation and serve to minimize name conflicts; the
latter are for data abstraction.
To facilitate separate compilation, modules in many languages (Modula-2
and Oberon among them) can be divided into a declaration part (header) and
an implementation part (body), each of which occupies a separate file. Code
that uses the exports of a given module can be compiled as soon as the header
exists; it is not dependent on the body. In particular, work on the bodies of
cooperating modules can proceed concurrently once the headers exist. We will
return to the subjects of separate compilation and code reuse in Sections 3.8
and 9.1, respectively.
138
Chapter 3 Names, Scopes, and Bindings
const stack_size := ...
type element : ...
...
type stack = module
imports (element, stack_size)
exports (push, pop)
type
stack_index = 1..stack_size
var
s
: array stack_index of element
top : stack_index
procedure push(elem : element) = ...
function pop returns element = ...
...
initially
top := 1
end stack
var A, B : stack
var x, y : element
...
A.push(x)
...
y := B.pop
Figure 3.8
Module type for stacks in Euclid. Unlike the code in Figure 3.6, the code here can
be used to create an arbitrary number of stacks.
to the implementation of Figure 3.7, but the programmer need not think of it
that way.9
Object Orientation
As an extension of the module-as-type approach to data abstraction, many languages now provide a class construct for object-oriented programming. To first
approximation, classes can be thought of as module types that have been augmented with an inheritance mechanism. Inheritance allows new classes to be
defined as extensions or refinements of existing classes. Inheritance facilitates a
programming style in which all or most operations are thought of as belonging
to objects, and in which new objects can inherit most of their operations from
existing objects, without the need to rewrite code. Classes have their roots in
Simula-67. They are the central innovation of object-oriented languages such as
Smalltalk, Eiffel, C++, Java, and C#. They are also fundamental to several scripting
languages, notably Python and Ruby. In a different style, inheritance mechanisms
can be found in several languages that are not usually considered object-oriented,
including Modula-3, Ada 95, and Oberon. We will examine inheritance and its
impact on scope rules in Chapter 9 and in Section 13.4.4.
Module types and classes (ignoring issues related to inheritance) require only
simple changes to the scope rules defined for modules in the previous subsection.
9 It is interesting to note that Turing, which was derived from Euclid, reverts to Modula-2 style
modules, in order to avoid implementation complexity [HMRC88, p. 9].
3.3 Scope Rules
EXAMPLE
3.16
N-ary methods in C++
139
Every instance A of a module type or class (e.g., every stack) has a separate copy
of the module or class’s variables. These variables are then visible when executing
one of A ’s operations. They may also be indirectly visible to the operations of some
other instance B if A is passed as a parameter to one of those operations. This rule
makes it possible in most object-oriented languages to construct binary (or moreary) operations that can manipulate the variables of more than one instance of a
class. In C++, for example, we could create an operation that determines which
of two stacks contains a larger number of elements:
class stack {
...
bool deeper_than(stack other) {
return (top > other.top);
}
...
}
...
if (A.deeper_than(B)) ...
// function declaration
Within the deeper_than operation of stack A , top refers to A.top . Because
deeper_than is an operation of class stack , however, it is able to refer not
only to the variables of A (which it can access directly by name), but also to the
variables of any other stack to which it has a reference. Because these variables
belong to a different stack, deeper_than must name that stack explicitly, as for
example in other.top . In a module-as-manager style program, of course, module
subroutines would access all instance variables via parameters.
Modules Containing Classes
EXAMPLE
3.17
Modules and classes in a
large application
While there is a clear progression from modules to module types to classes, it is
not necessarily the case that classes are an adequate replacement for modules in all
cases. Suppose we are developing a complex “first person” game. Class hierarchies
may be just what we need to represent characters, possessions, buildings, goals,
and a host of other data abstractions. At the same time, especially on a project with
a large team of programmers, we will probably want to divide the functionality of
the game into large-scale subsystems such as graphics and rendering, physics, and
strategy. These subsystems are really not data abstractions, and we probably don’t
want the option to create multiple instances of them. They are naturally captured
with traditional modules.
Many applications have a similar need for both multi-instance abstractions and
functional subdivision. In recognition of this fact, many languages, including C++,
Java, C#, Python, and Ruby, provide separate class and module mechanisms. 3.3.6
Dynamic Scoping
In a language with dynamic scoping, the bindings between names and objects
depend on the flow of control at run time, and in particular on the order in which
140
Chapter 3 Names, Scopes, and Bindings
1. n : integer
– – global declaration
2. procedure first
3.
n := 1
4. procedure second
5.
n : integer
6.
first()
7.
8.
9.
10.
11.
12.
– – local declaration
n := 2
if read integer() > 0
second()
else
first()
write integer(n)
Figure 3.9
Static versus dynamic scoping. Program output depends on both scope rules and,
in the case of dynamic scoping, a value read at run time.
EXAMPLE
3.18
Static vs dynamic scoping
subroutines are called. In comparison to the static scope rules discussed in the
previous section, dynamic scope rules are generally quite simple: the “current”
binding for a given name is the one encountered most recently during execution,
and not yet destroyed by returning from its scope.
Languages with dynamic scoping include APL, Snobol, TEX (the typesetting language with which this book was created), and early dialects of Lisp
10
Because the flow of control cannot in gen[MAE+
eral be predicted in advance, the bindings between names and objects in a language
with dynamic scoping cannot in general be determined by a compiler. As a result,
many semantic rules in a language with dynamic scoping become a matter of
dynamic semantics rather than static semantics. Type checking in expressions and
argument checking in subroutine calls, for example, must in general be deferred
until run time. To accommodate all these checks, languages with dynamic scoping
tend to be interpreted, rather than compiled.
Consider the program in Figure 3.9. If static scoping is in effect, this program
prints a 1. If dynamic scoping is in effect, the output depends on the value read
at line 8 at run time: if the input is positive, the program prints a 2; otherwise it
prints a 1. Why the difference? At issue is whether the assignment to the variable
n at line 3 refers to the global variable declared at line 1 or to the local variable
declared at line 5. Static scope rules require that the reference resolve to the closest
lexically enclosing declaration, namely the global n . Procedure first changes n to
1, and line 12 prints this value. Dynamic scope rules, on the other hand, require
that we choose the most recent, active binding for n at run time.
10 Scheme and Common Lisp are statically scoped, though the latter allows the programmer
to specify dynamic scoping for individual variables. Static scoping was added to Perl in version 5. The programmer now chooses static or dynamic scoping explicitly in each variable
declaration.
3.3 Scope Rules
EXAMPLE
3.19
Run-time errors with
dynamic scoping
EXAMPLE
3.20
Customization via dynamic
scoping
141
We create a binding for n when we enter the main program. We create another
when and if we enter procedure second . When we execute the assignment statement at line 3, the n to which we are referring will depend on whether we entered
first through second or directly from the main program. If we entered through
second , we will assign the value 1 to second ’s local n . If we entered from the main
program, we will assign the value 1 to the global n . In either case, the write at
line 12 will refer to the global n , since second ’s local n will be destroyed, along
with its binding, when control returns to the main program.
With dynamic scoping, errors associated with the referencing environment may
not be detected until run time. In Figure 3.10, for example, the declaration of local
variable max score in procedure foo accidentally redefines a global variable used
by function scaled score , which is then called from foo . Since the global max
score is an integer, while the local max score is a floating-point number, dynamic
semantic checks in at least some languages will result in a type clash message at
run time. If the local max score had been an integer, no error would have been
detected, but the program would almost certainly have produced incorrect results.
This sort of error can be very hard to find.
The principal argument in favor of dynamic scoping is that it facilitates the
customization of subroutines. Suppose, for example, that we have a library routine print integer that is capable of printing its argument in any of several bases
(decimal, binary, hexadecimal, etc.). Suppose further that we want the routine to
use decimal notation most of the time, and to use other bases only in a few special
cases; we do not want to have to specify a base explicitly on each individual call.
We can achieve this result with dynamic scoping by having print integer obtain its
base from a nonlocal variable print base . We can establish the default behavior by
declaring a variable print base and setting its value to 10 in a scope encountered
early in execution. Then, any time we want to change the base temporarily, we can
write:
begin
– – nested block
print base : integer := 16
print integer(n)
– – use hexadecimal
D E S I G N & I M P L E M E N TAT I O N
Dynamic scoping
It is not entirely clear whether the use of dynamic scoping in Lisp and other
early interpreted languages was deliberate or accidental. One reason to think
that it may have been deliberate is that it makes it very easy for an interpreter to
look up the meaning of a name: all that is required is a stack of declarations (we
examine this stack more closely in Section 3.4.2). Unfortunately, this simple
implementation has a very high run-time cost, and experience indicates that
dynamic scoping makes programs harder to understand. The modern consensus seems to be that dynamic scoping is usually a bad idea (see Exercise 3.17
and Exploration 3.33 for two exceptions).
142
Chapter 3 Names, Scopes, and Bindings
max score : integer
– – maximum possible score
function scaled score(raw score : integer) : real
return raw score / max score * 100
...
procedure foo
max score : real := 0
– – highest percentage seen so far
...
foreach student in class
student.percent := scaled score(student.points)
if student.percent > max score
max score := student.percent
Figure 3.10
The problem with dynamic scoping. Procedure scaled score probably does not
do what the programmer intended when dynamic scope rules allow procedure foo to change
the meaning of max score .
EXAMPLE
3.21
Multiple interface
alternative
EXAMPLE
3.22
Static variable alternative
The problem with this argument is that there are usually other ways to
achieve the same effect, without dynamic scoping. One option would be to
have print integer use decimal notation in all cases, and create another routine,
print integer with base , that takes a second argument. In a language like Ada
or C++, one could make the base an optional (default) parameter of a single
print integer routine, or use overloading to give the same name to both routines.
(We will consider default parameters in Section 8.3.3; overloading is discussed in
Section 3.5.2.)
Unfortunately, using two different routines for printing (or one routine with
two different sets of parameters) requires that the caller know what is going on.
In our example, alternative routines work fine if the calls are all made in the
scope in which the local print base variable would have been declared. If that
scope calls subroutines that in turn call print integer , however, we cannot in
general arrange for the called routines to use the alternative interface. A second alternative to dynamic scoping solves this problem: we can create a static
variable, either global or encapsulated with print integer inside an appropriate
module, that controls the base. To change the print base temporarily, we can then
write:
begin
print
print
print
print
– – nested block
base save : integer := print base
base := 16
– – use hexadecimal
integer(n)
base := print base save
The possibility that we may forget to restore the original value, of course, is a
potential source of bugs. With dynamic scoping the value is restored automatically.
3.4 Implementing Scope
3.4
143
Implementing Scope
To keep track of the names in a statically scoped program, a compiler relies on a
data abstraction called a symbol table. In essence, the symbol table is a dictionary:
it maps names to the information the compiler knows about them. The most
basic operations are to insert a new mapping (a name-to-object binding) or to
look up the information that is already present for a given name. Static scope rules
add complexity by allowing a given name to correspond to different objects—and
thus to different information—in different parts of the program. Most variations
on static scoping can be handled by augmenting a basic dictionary-style symbol
table with enter scope and leave scope operations to keep track of visibility.
Nothing is ever deleted from the table; the entire structure is retained throughout compilation, and then saved for use by debuggers or run-time reflection
mechanisms.
In a language with dynamic scoping, an interpreter (or the output of a compiler)
must perform operations analogous to symbol table insert and lookup at runtime. In principle, any organization used for a symbol table in a compiler could
be used to track name-to-object bindings in an interpreter, and vice versa. In
practice, implementations of dynamic scoping tend to adopt one of two specific
organizations: an association list or a central reference table.
IN MORE DEPTH
A symbol table with visibility support can be implemented in several different
ways. One appealing approach, due to LeBlanc and Cook [CL83], is described on
the PLP CD, along with both association lists and central reference tables.
An association list (or A-list for short) is simply a list of name/value pairs. When
used to implement dynamic scoping it functions as a stack: new declarations are
pushed as they are encountered, and popped at the end of the scope in which they
appeared. Bindings are found by searching down the list from the top. A central
reference table avoids the need for linear-time search by maintaining an explicit
mapping from names to their current meanings. Lookup is faster, but scope entry
and exit are somewhat more complex, and it becomes substantially more difficult
to save a referencing environment for future use (we discuss this issue further in
Section 3.6.1).
3C H E C K YO U R U N D E R S TA N D I N G
21. Explain the importance of information hiding.
22. What is an opaque export?
144
Chapter 3 Names, Scopes, and Bindings
23. Why might it be useful to distinguish between the header and the body of a
module?
24. What does it mean for a scope to be closed?
25. Explain the distinction between “modules as managers” and “modules as
types.”
26. How do classes differ from modules?
27. Why does the use of dynamic scoping imply the need for run-time type
checking?
28. Give an argument in favor of dynamic scoping. Describe how similar benefits
can be achieved in a language without dynamic scoping.
29. Explain the purpose of a compiler’s symbol table.
3.5
The Meaning of Names within a Scope
So far in our discussion of naming and scopes we have assumed that there is a
one-to-one mapping between names and visible objects in any given point in a
program. This need not be the case. Two or more names that refer to the same
object at the same point in the program are said to be aliases. A name that can refer
to more than one object at a given point in the program is said to be overloaded.
3.5.1
EXAMPLE
3.23
Aliasing with parameters
Aliases
Simple examples of aliases occur in the common blocks and equivalence statements of Fortran, and in the variant records and unions of languages like Pascal
and C (we will discuss these topics in detail in Section 7.3.4). They also arise
naturally in programs that make use of pointer-based data structures. A more
subtle way to create aliases in many languages is to pass a variable by reference to
a subroutine that also accesses that variable directly. Consider the following code
in C++.
double sum, sum_of_squares;
...
void accumulate(double& x)
{
sum += x;
sum_of_squares += x * x;
}
...
accumulate(sum);
// x is passed by reference
3.5 The Meaning of Names within a Scope
EXAMPLE
3.24
Aliases and code
improvement
145
If sum is passed as an argument to accumulate , then sum and x will be aliases
for one another, and the program will probably not do what the programmer
intended. This type of error was one of the principal motivations for making
subroutines closed scopes in Euclid and Turing, as described in Section 3.3.4.
Given import lists, the compiler can identify when a subroutine call would create
an alias, and the language can prohibit it.
As a general rule, aliases tend to make programs more confusing than they
otherwise would be. They also make it much more difficult for a compiler
to perform certain important code improvements. Consider the following C
code:
int a, b, *p, *q;
...
a = *p;
/* read from the variable referred to by p */
*q = 3;
/* assign to the variable referred to by q */
b = *p;
/* read from the variable referred to by p */
The initial assignment to a will, on most machines, require that *p be loaded into
a register. Since accessing memory is expensive, the compiler will want to hang on
to the loaded value and reuse it in the assignment to b . It will be unable to do so,
however, unless it can verify that p and q cannot refer to the same object—that is,
that *p and *q are not aliases. While verification of this sort is possible in many
common cases, in general it’s uncomputable.
D E S I G N & I M P L E M E N TAT I O N
Pointers in C and Fortran
The tendency of pointers to introduce aliases is one of the reasons why Fortran
compilers have tended, historically, to produce faster code than C compilers:
pointers are heavily used in C, but missing from Fortran 77 and its predecessors.
It is only in recent years that sophisticated alias analysis algorithms have allowed
C compilers to rival their Fortran counterparts in speed of generated code.
Pointer analysis is sufficiently important that the designers of the C99 standard
decided to add a new keyword to the language. The restrict qualifier, when
attached to a pointer declaration, is an assertion on the part of the programmer
that the object to which the pointer refers has no alias in the current scope.
It is the programmer’s responsibility to ensure that the assertion is correct;
the compiler need not attempt to check it. C99 also introduces strict aliasing.
This allows the compiler to assume that pointers of different types will never
refer to the same location in memory. Most compilers provide a command-line
option to disable optimizations that exploit this rule; otherwise (poorly written)
legacy programs may behave incorrectly when compiled at higher optimization
levels.
146
Chapter 3 Names, Scopes, and Bindings
declare
type month is (jan, feb, mar, apr, may, jun,
jul, aug, sep, oct, nov, dec);
type print_base is (dec, bin, oct, hex);
mo : month;
pb : print_base;
begin
mo := dec;
-- the month dec (since mo has type month)
pb := oct;
-- the print_base oct (since pb has type print_base)
print(oct);
-- error! insufficient context
-to decide which oct is intended
Figure 3.11
3.5.2
EXAMPLE
3.25
Overloaded enumeration
constants in Ada
EXAMPLE
3.26
Resolving ambiguous
overloads
Overloading of enumeration constants in Ada.
Overloading
Most programming languages provide at least a limited form of overloading. In C,
for example, the plus sign ( + ) is used to name several different functions, including
signed and unsigned integer and floating-point addition. Most programmers don’t
worry about the distinction between these two functions—both are based on the
same mathematical concept, after all—but they take arguments of different types
and perform very different operations on the underlying bits. A slightly more
sophisticated form of overloading appears in the enumeration constants of Ada.
In Figure 3.11, the constants oct and dec refer either to months or to numeric
bases, depending on the context in which they appear.
Within the symbol table of a compiler, overloading must be handled by arranging for the lookup routine to return a list of possible meanings for the requested
name. The semantic analyzer must then choose from among the elements of the
list based on context. When the context is not sufficient to decide, as in the call to
print in Figure 3.11, then the semantic analyzer must announce an error. Most
languages that allow overloaded enumeration constants allow the programmer to
provide appropriate context explicitly. In Ada, for example, one can say
print(month’(oct));
In Modula-3 and C#, every use of an enumeration constant must be prefixed with
a type name, even when there is no chance of ambiguity:
mo := month.dec;
pb := print_base.oct;
EXAMPLE
3.27
Overloading in Ada and
C++
In C, C++, and standard Pascal, one cannot overload enumeration constants at
all; every constant visible in a given scope must be distinct.
Both Ada and C++ have elaborate facilities for overloading subroutine names.
(Most of the C++ facilities carry over to Java and C#.) A given name may refer
to an arbitrary number of subroutines in the same scope, so long as the subroutines differ in the number or types of their arguments. C++ examples appear in
Figure 3.12.
3.5 The Meaning of Names within a Scope
147
struct complex {
double real, imaginary;
};
enum base {dec, bin, oct, hex};
int i;
complex x;
void print_num(int n) { ...
void print_num(int n, base b) { ...
void print_num(complex c) { ...
print_num(i);
print_num(i, hex);
print_num(x);
// uses the first function above
// uses the second function above
// uses the third function above
Figure 3.12 Simple example of overloading in C++. In each case the compiler can tell which
function is intended by the number and types of arguments.
Redefining Built-in Operators
EXAMPLE
3.28
Operator overloading
in Ada
EXAMPLE
3.29
Operator overloading
in C++
Ada, C++, C#, Fortran 90, and Haskell also allow the built-in arithmetic operators
( + , - , * , etc.) to be overloaded with user-defined functions. Ada, C++, and C#
do this by defining alternative prefix forms of each operator, and defining the
usual infix forms to be abbreviations (or “syntactic sugar”) for the prefix forms.
In Ada, A + B is short for "+"(A, B) . If "+" is overloaded, it must be possible to
determine the intended meaning from the types of A and B .
In C++ and C#, which are object-oriented, A + B may be short for either
operator+(A, B) or A.operator+(B) . In the latter case, A is an instance of a
class (module type) that defines an operator+ function. In C++:
class complex {
double real, imaginary;
...
public:
complex operator+(complex other) {
return complex(real + other.real, imaginary + other.imaginary);
}
...
};
...
complex A, B, C;
...
C = A + B;
// uses user-defined operator+
C# syntax is similar.
This class-based style of operator abbreviation resembles a similar facility in Clu. Since the abbreviation expands to an unambiguous name (i.e., A ’s
148
Chapter 3 Names, Scopes, and Bindings
operator+ ; not any other), one might be tempted to say that no “real” overloading is involved, and this is in fact the case in Clu. In C++ and C#, however, there
may be more than one definition of A.operator+ , allowing the second argument
to be of several types. Fortran 90 provides a special interface construct that can
be used to associate an operator with some named binary function.
3.5.3
EXAMPLE
3.30
Overloading vs coercion
Polymorphism and Related Concepts
In the case of subroutine names, it is worth distinguishing overloading from the
closely related concepts of coercion and polymorphism. All three can be used, in
certain circumstances, to pass arguments of multiple types to (or return values
of multiple types from) a given named routine. The syntactic similarity, however,
hides significant differences in semantics and pragmatics.
Suppose, for example, that we wish to be able to compute the minimum of
two values of either integer or floating-point type. In Ada we might obtain this
capability using overloaded functions:
function min(a, b : integer) return integer is ...
function min(x, y : real) return real is ...
In C, however, we could get by with a single function:
double min(double x, double y) { ...
If the C function is called in a context that expects an integer (e.g., i = min(j,
k) ), the compiler will automatically convert the integer arguments ( j and k ) to
floating-point numbers, call min , and then convert the result back to an integer
(via truncation). So long as floating-point ( double ) variables have at least as many
significant bits as integers (which they do in the case of 32-bit integers and 64-bit
double-precision floating-point), the result will be numerically correct.
Coercion is the process by which a compiler automatically converts a value of
one type into a value of another type when that second type is required by the
surrounding context. Coercion is a somewhat controversial subject in language
design. As we shall see in Section 7.2.2, Ada coerces nothing but explicit constants,
subranges, and in certain cases arrays with the same type of elements. Pascal
will coerce integers to floating point in expressions and assignments. Fortran will
also coerce floating-point values to integers in assignments, at a potential loss of
precision. C will perform these same coercions on arguments to functions. Most
scripting languages provide a very rich set of built-in coercions. C++ allows the
programmer to extend its built-in set with user-defined coercions.
In Example 3.30, overloading allows the Ada compiler to choose between two
different versions of min , depending on the types of the arguments. Coercion
allows the C compiler to modify the arguments to fit a single subroutine. Polymorphism provides yet another option: it allows a single subroutine to accept
unconverted arguments of multiple types.
3.5 The Meaning of Names within a Scope
149
The term “polymorphic” comes from the Greek, and means “having multiple
forms.” It is applied to code—both data structures and subroutines—that can
work with values of multiple types. For this concept to make sense, the types must
generally have certain characteristics in common, and the code must not depend
on any other characteristics. The commonality is usually captured in one of two
main ways. In parametric polymorphism the code takes a type (or set of types) as
a parameter, either explicitly or implicitly. In subtype polymorphism the code is
designed to work with values of some specific type T, but the programmer can
define additional types to be extensions or refinements of T, and the polymorphic
code will work with these subtypes as well.
Explicit parametric polymorphism is also known as genericity. Generic facilities
appear in Ada, C++, Clu, Eiffel, Modula-3, Java, and C#, among others. Readers
familiar with C++ will know them by the name of templates. We will consider them
further in Sections 8.4 and 9.4.4. Implicit parametric polymorphism appears in
the Lisp and ML families of languages, and in the various scripting languages;
we will consider it further in Sections 7.2.4 and 10.3. Subtype polymorphism is
fundamental to object-oriented languages, in which subtypes (classes) are said to
inherit the methods of their parent types. We will consider inheritance further in
Section 9.4.
Generics (explicit parametric polymorphism) are usually, though not always,
implemented by creating multiple copies of the polymorphic code, one specialized for each needed concrete type. Inheritance (subtype polymorphism) is almost
always implemented by creating a single copy of the code, and by including in the
representation of objects sufficient “metadata” (data about the data) that the code
can tell when to treat them differently. Implicit parametric polymorphism can
be implemented either way. Most Lisp implementations use a single copy of the
code, and delay all semantic checks until run time. ML and its descendants perform all type checking at compile time. They typically generate a single copy
of the code where possible (e.g., when all the types in question are records that
share a similar representation), and multiple copies when necessary (e.g., when
polymorphic arithmetic must operate on both integer and floating-point numbers). Object-oriented languages that perform type checking at compile time,
including C++, Eiffel, Java, and C#, generally provide both generics and inheritance. Smalltalk (Section 9.6.1), Objective-C, Python, and Ruby use a single
D E S I G N & I M P L E M E N TAT I O N
Coercion and overloading
In addition to their semantic differences, coercion and overloading can have
very different costs. Calling an integer-specific version of min would be much
more efficient than calling the floating-point version of Example 3.30 with
integer arguments: it would use integer arithmetic for the comparison (which
may be cheaper in and of itself), and would avoid three conversion operations.
One of the arguments against supporting coercion in a language is that it tends
to impose hidden costs.
150
Chapter 3 Names, Scopes, and Bindings
generic
type T is private;
with function "<"(x, y : T) return Boolean;
function min(x, y : T) return T;
function min(x, y : T) return T is
begin
if x < y then return x;
else return y;
end if;
end min;
function string_min is new min(string, "<");
function date_min is new min(date, date_precedes);
Figure 3.13
EXAMPLE
3.31
Generic min function
in Ada
EXAMPLE
3.32
Implicit polymorphism
in Scheme
Use of a generic subroutine in Ada.
mechanism (with run-time checking) to provide both parametric and subtype
polymorphism.
As a concrete example of generics, consider the overloaded min functions of
Example 3.30. The source code for the integer and floating-point versions is likely
to be very similar. We can exploit this similarity to define a single version that
works not only for integers and reals, but for any type whose values are totally
ordered. This code appears in Figure 3.13. The initial (bodyless) declaration of min
is preceded by a generic clause specifying that two things are required in order to
create a concrete instance of a minimum function: a type, T , and a corresponding
comparison routine. This declaration is followed by the actual code for min . Given
appropriate declarations of string and date types and comparison routines (not
shown), we can create functions to return the lesser of pairs of objects of these types
as shown in the last two lines. (The "<" operation mentioned in the definition
of string_min is presumably overloaded; the compiler resolves the overloading
by finding the version of "<" that takes arguments of type T , where T is already
known to be string .)
With the implicit parametric polymorphism of Lisp, ML, and their descendants,
the programmer need not specify a type parameter. The Scheme definition of min
looks like this:
(define min (lambda (a b) (if (< a b) a b)))
It makes no mention of types. The typical Scheme implementation employs an
interpreter that examines the arguments to min and determines, at run time,
whether they support a < operator. (Like all Lisp dialects, Scheme puts function
names inside parentheses, right in front of the arguments. The lambda keyword
introduces the parameter list and body of a function.) Given the definition above,
the expression (min 123 456) evaluates to 123 ; (min 3.14159 2.71828) evaluates to 2.71828 . The expression (min "abc" "def") produces a run-time error
when evaluated, because the string comparison operator is named string<? ,
not < .
3.6 The Binding of Referencing Environments
3.33
The Haskell version of min is similar:
Implicit polymorphism
in Haskell
min a b = if a < b then a else b
EXAMPLE
151
This version works for values of any totally ordered type, including strings. It is
type checked at compile time, using a sophisticated system of type inference (to be
described in Section 7.2.4).
So what exactly is the difference between the overloaded min functions of
Example 3.30 and the generic version of Figure 3.13? The answer lies in the
generality of the code. With overloading the programmer must write a separate
copy of the code, by hand, for every type with a min operation. Generics allow the
compiler (in the typical implementation) to create a copy automatically for every
needed type. The similarity of the calling syntax and of the generated code has
led some authors to refer to overloading as ad hoc (special case) polymorphism.
There is no particular reason, however, for the programmer to think of generics in
terms of multiple copies: from a semantic (conceptual) point of view, overloaded
subroutines use a single name for more than one thing; a polymorphic subroutine
is a single thing.
3C H E C K YO U R U N D E R S TA N D I N G
30. What are aliases? Why are they considered a problem in language design and
implementation?
31. Explain the value of the restrict qualifier in C99.
32. Explain the differences among overloading, coercion, and polymorphism.
33. What is operator overloading ? Explain its relationship to “ordinary” overloading in C++.
34. Define parametric and subtype polymorphism. Explain the distinction between explicit and implicit parametric polymorphism. Which is also known as
genericity?
35. Why is overloading sometimes referred to as ad hoc polymorphism?
3.6
The Binding of Referencing Environments
We have seen in the Section 3.3 how scope rules determine the referencing environment of a given statement in a program. Static scope rules specify that the
referencing environment depends on the lexical nesting of program blocks in
which names are declared. Dynamic scope rules specify that the referencing environment depends on the order in which declarations are encountered at run time.
152
Chapter 3 Names, Scopes, and Bindings
type person = record
...
age : integer
...
threshold : integer
people : database
function older than threshold(p : person) : boolean
return p.age ≥ threshold
procedure print person(p : person)
– – Call appropriate I/O routines to print record on standard output.
– – Make use of nonlocal variable line length to format data in columns.
...
procedure print selected records(db : database;
predicate, print routine : procedure)
line length : integer
if device type(stdout) = terminal
line length := 80
else
– – Standard output is a file or printer.
line length := 132
foreach record r in db
– – Iterating over these may actually be
– – a lot more complicated than a ‘for’ loop.
if predicate(r)
print routine(r)
– – main program
...
threshold := 35
print selected records(people, older than threshold, print person)
Figure 3.14
Program to illustrate the importance of binding rules. One might argue that
deep binding is appropriate for the environment of function older than threshold (for access to
threshold ), while shallow binding is appropriate for the environment of procedure print person
(for access to line length ).
EXAMPLE
3.34
Deep and shallow binding
An additional issue that we have not yet considered arises in languages that allow
one to create a reference to a subroutine, for example by passing it as a parameter.
When should scope rules be applied to such a subroutine: when the reference
is first created, or when the routine is finally called? The answer is particularly
important for languages with dynamic scoping, though we shall see that it matters
even in languages with static scoping.
A dynamic scoping example appears in Figure 3.14. Procedure print selected
records is assumed to be a general-purpose routine that knows how to traverse
the records in a database, regardless of whether they represent people, sprockets,
or salads. It takes as parameters a database, a predicate to make print/don’t print
3.6 The Binding of Referencing Environments
153
decisions, and a subroutine that knows how to format the data in the records of
this particular database. In Section 3.3.6 we hypothesized a print integer library
routine that would print in any of several bases, depending on the value of a
nonlocal variable print base . Here we have hypothesized in a similar fashion that
print person uses the value of nonlocal variable line length to calculate the number
and width of columns in its output. In a language with dynamic scoping, it is
natural for procedure print selected records to declare and initialize this variable
locally, knowing that code inside print routine will pick it up if needed. For this
coding technique to work, the referencing environment of print routine must not
be created until the routine is actually called by print selected records . This late
binding of the referencing environment of a subroutine that has been passed as a
parameter is known as shallow binding. It is usually the default in languages with
dynamic scoping.
For function older than threshold , by contrast, shallow binding may not work
well. If, for example, procedure print selected records happens to have a local
variable named threshold , then the variable set by the main program to influence
the behavior of older than threshold will not be visible when the function is finally
called, and the predicate will be unlikely to work correctly. In such a situation, the
code that originally passes the function as a parameter has a particular referencing
environment (the current one) in mind; it does not want the routine to be called
in any other environment. It therefore makes sense to bind the environment at the
time the routine is first passed as a parameter, and then restore that environment
when the routine is finally called. This early binding of the referencing environment is known as deep binding. The need for deep binding is sometimes referred
to as the funarg problem in Lisp.
3.6.1
Subroutine Closures
Deep binding is implemented by creating an explicit representation of a referencing environment (generally the one in which the subroutine would execute if
called at the present time) and bundling it together with a reference to the subroutine. The bundle as a whole is referred to as a closure. Usually the subroutine
itself can be represented in the closure by a pointer to its code. In a language with
dynamic scoping, the representation of the referencing environment depends on
whether the language implementation uses an association list or a central reference table for run-time lookup of names; we consider these alternatives at the end
of Section 3.4.2.
Although shallow binding is usually the default in languages with dynamic
scoping, deep binding may be available as an option. In early dialects of Lisp, for
example, the built-in primitive function takes a function as its argument and
returns a closure whose referencing environment is the one in which the function
would execute if called at the present time. This closure can then be passed as a
parameter to another function. If and when it is eventually called, it will execute in
the saved environment. (Closures work slightly differently from “bare” functions
154
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.35
Binding rules with static
scoping
in most Lisp dialects: they must be called by passing them to the built-in primitives
funcall or apply .)
Deep binding is generally the default in languages with static (lexical) scoping.
At first glance, one might be tempted to think that the binding time of referencing
environments would not matter in languages with static scoping. After all, the
meaning of a statically scoped name depends on its lexical nesting, not on the
flow of execution, and this nesting is the same whether it is captured at the time a
subroutine is passed as a parameter or at the time the subroutine is called. The catch
is that a running program may have more than one instance of an object that is
declared within a recursive subroutine. A closure in a language with static scoping
captures the current instance of every object, at the time the closure is created.
When the closure’s subroutine is called, it will find these captured instances, even
if newer instances have subsequently been created by recursive calls.
One could imagine combining static scoping with shallow binding [VF82],
but the combination does not seem to make much sense, and does not appear
to have been adopted in any language. Figure 3.15 contains a Pascal program
that illustrates the impact of binding rules in the presence of static scoping. This
program prints a 1. With shallow binding it would print a 2.
It should be noted that binding rules matter with static scoping only when
accessing objects that are neither local nor global, but are defined at some intermediate level of nesting. If an object is local to the currently executing subroutine,
then it does not matter whether the subroutine was called directly or through a
closure; in either case local objects will have been created when the subroutine
started running. If an object is global, there will never be more than one instance,
since the main body of the program is not recursive. Binding rules are therefore
irrelevant in languages like C, which has no nested subroutines, or Modula-2,
which allows only outermost subroutines to be passed as parameters, thus ensuring that any variable defined outside the subroutine is global. (Binding rules are
also irrelevant in languages like PL/I and Ada 83, which do not permit subroutines
to be passed as parameters at all.)
Suppose then that we have a language with static scoping in which nested
subroutines can be passed as parameters, with deep binding. To represent a closure
for subroutine S, we can simply save a pointer to S’s code together with the static
link that S would use (see Figure 3.5) if it were called right now, in the current
environment. When S is finally called, we temporarily restore the saved static link,
rather than creating a new one. When S follows its static chain to access a nonlocal
object, it will find the object instance that was current at the time the closure was
created. This instance may not have the value it had at the time the closure was
created, but its identity, at least, will reflect the intent of the closure’s creator.
3.6.2
First-Class Values and Unlimited Extent
In general, a value in a programming language is said to have first-class status
if it can be passed as a parameter, returned from a subroutine, or assigned into
3.6 The Binding of Referencing Environments
155
program binding_example(input, output);
procedure A(I : integer; procedure P);
procedure B;
begin
writeln(I);
end;
begin (* A *)
if I > 1 then
P
else
A(2, B);
end;
procedure C; begin end;
begin (* main *)
A(1, C);
end.
B
A
I == 2
P == B
A
I == 1
P == C
main program
Figure 3.15 Deep binding in Pascal. At right is a conceptual view of the run-time stack.
Referencing environments captured in closures are shown as dashed boxes and arrows. When B
is called via formal parameter P , two instances of I exist. Because the closure for P was created in
the initial invocation of A , B ’s static link (solid arrow) points to the frame of that earlier invocation.
B uses that invocation’s instance of I in its writeln statement, and the output is a 1 .
a variable. Simple types such as integers and characters are first-class values in
most programming languages. By contrast, a “second-class” value can be passed
as a parameter, but not returned from a subroutine or assigned into a variable,
and a “third-class” value cannot even be passed as a parameter. As we shall see
in Section 8.3.2, labels are third-class values in most programming languages,
but second-class values in Algol. Subroutines display the most variation. They
are first-class values in all functional programming languages and most scripting
languages. They are also first-class values in C# and, with some restrictions, in
several other imperative languages, including Fortran, Modula-2 and -3, Ada 95,
C, and C++.11 They are second-class values in most other imperative languages,
and third-class values in Ada 83.
Our discussion of binding so far has considered only second-class subroutines.
First-class subroutines in a language with nested scopes introduce an additional
level of complexity: they raise the possibility that a reference to a subroutine may
11 Some authors would say that first-class status requires anonymous function definitions—lambda
expressions—that can be embedded in other expressions. C#, most scripting languages, and all
functional languages meet this requirement, but most imperative languages do not.
156
Chapter 3 Names, Scopes, and Bindings
plus_x
x = 2
rtn = anon
main program
anon
y = 3
main program
Figure 3.16
The need for unlimited extent. When function plus_x is called in Example 3.36,
it returns (left side of the figure) a closure containing an anonymous function. The referencing
environment of that function encompasses both plus_x and main —including the local variables of plus_x itself. When the anonymous function is subsequently called (right side of the
figure), it must be able to access variables in the closure’s environment—in particular, the x
inside plus_x .
EXAMPLE
3.36
Returning a first-class
subroutine in Scheme
outlive the execution of the scope in which that routine was declared. Consider
the following example in Scheme:
1. (define plus-x (lambda (x)
2.
(lambda (y) (+ x y))))
3. ...
4. (let ((f (plus-x 2)))
5.
(f 3))
; returns 5
Here the let construct on line 4 declares a new function, f , which is the result of
calling plus-x with argument 2 . Function plus-x is defined at line 1. It returns
the (unnamed) function declared at line 2. But that function refers to parameter x
of plus-x . When f is called at line 5, its referencing environment will include the
x in plus-x , despite the fact that plus-x has already returned (see Figure 3.16).
Somehow we must ensure that x remains available.
If local objects were destroyed (and their space reclaimed) at the end of each
scope’s execution, then the referencing environment captured in a long-lived closure might become full of dangling references. To avoid this problem, most functional languages specify that local objects have unlimited extent : their lifetimes
continue indefinitely. Their space can be reclaimed only when the garbage collection system is able to prove that they will never be used again. Local objects
(other than own / static variables) in most imperative languages have limited
extent : they are destroyed at the end of their scope’s execution. (C# and Smalltalk are exceptions to the rule, as are most scripting languages.) Space for local
objects with limited extent can be allocated on a stack. Space for local objects with
unlimited extent must generally be allocated on a heap.
Given the desire to maintain stack-based allocation for the local variables
of subroutines, imperative languages with first-class subroutines must generally adopt alternative mechanisms to avoid the dangling reference problem for
closures. C, C++, and (pre-Fortran 90) Fortran, of course, do not have nested
subroutines. Modula-2 allows references to be created only to outermost subroutines (outermost routines are first-class values; nested routines are third-class
3.6 The Binding of Referencing Environments
157
values). Modula-3 allows nested subroutines to be passed as parameters, but only
outermost routines to be returned or stored in variables (outermost routines are
first-class values; nested routines are second-class values). Ada 95 allows a nested
routine to be returned, but only if the scope in which it was declared is the same
as, or larger than, the scope of the declared return type. This containment rule,
while more conservative than strictly necessary (it forbids the Ada equivalent of
Figure 3.15), makes it impossible to propagate a subroutine reference to a portion
of the program in which the routine’s referencing environment is not active.
3.6.3
EXAMPLE
3.37
An object closure in Java
Object Closures
As noted in Section 3.6.1, the referencing environment in a closure will be nontrivial only when passing a nested subroutine. This means that the implementation of
first-class subroutines is trivial in a language without nested subroutines. At the
same time, it means that a programmer working in such a language is missing
a useful feature: the ability to pass a subroutine with context. In object-oriented
languages, there is an alternative way to achieve a similar effect: we can encapsulate our subroutine as a method of a simple object, and let the object’s fields hold
context for the method. In Java we might write the equivalent of Example 3.36 as
follows.
interface IntFunc {
public int call(int i);
}
class PlusX implements IntFunc {
final int x;
PlusX(int n) { x = n; }
public int call(int i) { return i + x; }
}
...
IntFunc f = new PlusX(2);
System.out.println(f.call(3));
// prints 5
Here the interface IntFunc defines a static type for objects enclosing a function
from integers to integers. Class PlusX is a concrete implementation of this type,
D E S I G N & I M P L E M E N TAT I O N
Binding rules and extent
Binding mechanisms and the notion of extent are closely tied to implementation issues. A-lists make it easy to build closures (Section 3.4.2), but so
do the non-nested subroutines of C and the rule against passing nonglobal
subroutines as parameters in Modula-2. In a similar vein, the lack of first-class
subroutines in most imperative languages reflects in large part the desire to
avoid heap allocation, which would be needed for local variables with unlimited extent.
158
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.38
Function objects in C++
and can be instantiated for any constant x . Where the Scheme code in Example 3.36
captured x in the subroutine closure returned by (plus-x 2) , the Java code here
captures x in the object closure returned by new PlusX(2) .
An object that plays the role of a function and its referencing environment may
variously be called an object closure, a function object, or a functor. (This is unrelated
to use of the term functor in Prolog.) Object closures are sufficiently important
that some languages support them with special syntax. In C++, an object of a class
that overrides operator() can be called as if it were a function:
class int_func {
public:
virtual int operator()(int i) = 0;
};
class plus_x : public int_func {
const int x;
public:
plus_x(int n) : x(n) { }
virtual int operator()(int i) { return i + x; }
};
...
plus_x f(2);
cout << f(3) << "\n";
// prints 5
EXAMPLE
3.39
Object f could also be passed to any function that expected a parameter of class
int_func .
In C#, a first-class subroutine is an instance of a delegate type:
Delegates in C#
delegate int IntFunc(int i);
This type can be instantiated for any subroutine that matches the specified argument and return types. That subroutine may be static, or it may be a method of
some object:
static int Plus2(int i) { return i + 2; }
...
IntFunc f = new IntFunc(Plus2);
Console.WriteLine(f(3));
// prints 5
class PlusX {
int x;
public PlusX(int n) { x = n; }
public int call(int i) { return i + x; }
}
...
IntFunc g = new IntFunc(new PlusX(2).call);
Console.WriteLine(g(3));
// prints 5
Here g is roughly equivalent to the C++ code of Example 3.38.
3.7 Macro Expansion
EXAMPLE
3.40
Delegates and unlimited
extent
159
Remarkably, though C# does not permit subroutines to nest in the general case,
version 2 of the language allows delegates to be instantiated in-line from anonymous (unnamed) methods. These allow us to mimic the code of Example 3.36:
static IntFunc PlusY(int y) {
return delegate(int i) { return i + y; };
}
...
IntFunc h = PlusY(2);
Here y has unlimited extent! The compiler arranges to allocate it in the heap,
and to refer to it indirectly through a hidden pointer, included in the closure.
This implementation incurs the cost of dynamic storage allocation (and eventual
garbage collection) only when it is needed; local variables remain in the stack in
the common case.
C# 3.0 provides an alternative lambda expression notation for anonymous
methods. This notation is particularly convenient for functions whose bodies
can be written as a single expression. The (one-line) body of PlusY above could
be replaced with
return i => i + y;
Anonymous delegates are heavily used for event handling in C# programs; we will
see examples in Section 8.7.2.
3.7
EXAMPLE
3.41
A simple assembly macro
EXAMPLE
3.42
Preprocessor macros in C
Macro Expansion
Prior to the development of high-level programming languages, assembly language programmers could find themselves writing highly repetitive code. To ease
the burden, many assemblers provided sophisticated macro expansion facilities.
Consider the task of loading an element of a two-dimensional array from memory into a register. As we shall see in Section 7.4.3, this operation can easily require
half a dozen instructions, with details depending on the hardware instruction set;
the size of the array elements; and whether the indices are constants, values in
memory, or values in registers. In many assemblers one can define a macro that
will replace an expression like ld2d(target reg, array name, row, column, row size,
element size) with the appropriate multi-instruction sequence. In a numeric program containing hundreds or thousands of array access operations, this macro
may prove extremely useful.
When C was created in the early 1970s, it was natural to include a macro
preprocessing facility:
#define LINE_LEN 80
#define DIVIDES(a,n) (!((n) % (a)))
/* true iff n has zero remainder modulo a */
#define SWAP(a,b) {int t = (a); (a) = (b); (b) = t;}
#define MAX(a,b) ((a) > (b) ? (a) : (b))
160
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.43
“Gotchas” in C macros
Macros like LINE_LEN avoided the need (in early versions of C) to support named
constants in the language itself. Perhaps more important, parameterized macros
like DIVIDES , MAX , and SWAP were much more efficient than equivalent C functions. They avoided the overhead of the subroutine call mechanism (including
register saves and restores), and the code they generated could be integrated into
any code improvements that the compiler was able to effect in the code surrounding the call.
Unfortunately, C macros suffer from several limitations, all of which stem
from the fact that they are implemented by textual substitution, and are not
understood by the rest of the compiler. Put another way, they provide a naming
and binding mechanism that is separate from—and often at odds with—the rest
of the programming language.
In the definition of DIVIDES , the parentheses around the occurrences of a and
b are essential. Without them, DIVIDES(y + z, x) would be replaced by (!(x %
y + z)) , which is the same as (!((x % y) + z)) , according to the rules of precedence. In a similar vein, SWAP may behave unexpectedly if the programmer writes
SWAP(x, t) : textual substitution of arguments allows the macro’s declaration of t
to capture the t that was passed. MAX(x++, y++) may also behave unexpectedly,
since the increment side effects will happen more than once. Unfortunately, in
standard C we cannot avoid the extra side effects by assigning the parameters into
temporary variables: a C macro that “returns” a value must be an expression, and
declarations are one of many language constructs that cannot appear inside (see
also Exercise 3.22).
Modern languages and compilers have, for the most part, abandoned macros
as an anachronism. Named constants are type-safe and easy to implement, and
in-line subroutines (to be discussed in Section 8.2.4) provide almost all the performance of parameterized macros without their limitations. A few languages
(notably Scheme and Common Lisp) take an alternative approach, and integrate macros into the language in a safe and consistent way. So-called hygienic
macros implicitly encapsulate their arguments, avoiding unexpected interactions
with associativity and precedence. They rename variables when necessary to avoid
the capture problem. and they can be used in any expression context. Unlike
D E S I G N & I M P L E M E N TAT I O N
Generics as macros
In some sense, the local stack module of Figure 3.6 (page 134) is a primitive
sort of generic module. Because it imports the element type and stack_size
constant, it can be inserted (with a text editor) into any context in which these
names are declared, and will produce a “customized” stack for that context
when compiled. Early versions of C++ formalized this mechanism by using
macros to implement templates. Later versions of C++ have made templates
(generics) a fully supported language feature, giving them much of the flavor of
hygienic macros. (More on templates and on template metaprogramming can
be found in Section 8.4.4.)
3.8 Separate Compilation
161
subroutines, however, they are expanded during semantic analysis, making them
generally unsuitable for unbounded recursion. Their appeal is that, like all macros,
they take unevaluated arguments, which they evaluate lazily on demand. Among
other things, this means that they preserve the multiple side effect “gotcha” of our
MAX example. Delayed evaluation was a bug in this context, but can sometimes be
a feature. We will return to it in Sections 6.1.5 (short-circuit Boolean evaluation),
8.3.2 (call-by-name parameters), and 10.4 (normal-order evaluation in functional
programming languages).
3C H E C K YO U R U N D E R S TA N D I N G
36. Describe the difference between deep and shallow binding of referencing environments.
37. Why are binding rules particularly important for languages with dynamic
scoping?
38. What are first-class subroutines? What languages support them?
39. What is a subroutine closure? What is it used for? How is it implemented?
40. What is an object closure? How is it related to a subroutine closure?
41. Describe how the delegates of C# extend and unify both subroutine and object
closures.
42. Explain the distinction between limited and unlimited extent of objects in a
local scope.
43. What are macros? What was the motivation for including them in C? What
problems may they cause?
3.8
Separate Compilation
Since most large programs are constructed and tested incrementally, and since the
compilation of a very large program can be a multihour operation, any language
designed to support large programs must provide for separate compilation.
IN MORE DEPTH
On the PLP CD we consider the relationship between modules and separate
compilation. Because they are designed for encapsulation and provide a narrow interface, modules are the natural choice for the “compilation units” of many
programming languages. The separate module headers and bodies of Modula-3
and Ada, for example, are explicitly intended for separate compilation, and reflect
experience gained with more primitive facilities in other languages. C and C++,
by contrast, must maintain backward compatibility with mechanisms designed
in the early 1970s. C99 and C++ include a namespace mechanism that provides
162
Chapter 3 Names, Scopes, and Bindings
module-like data hiding, but names must still be declared before they are used in
every compilation unit, and the mechanisms used to accommodate this rule are
purely a matter of convention. Java and C# break with the C tradition by requiring
the compiler to infer header information automatically from separately compiled
class definitions; no header files are required.
3.9
Summary and Concluding Remarks
This chapter has addressed the subject of names, and the binding of names to
objects (in a broad sense of the word). We began with a general discussion of the
notion of binding time—the time at which a name is associated with a particular
object or, more generally, the time at which an answer is associated with any open
question in language or program design or implementation. We defined the notion
of lifetime for both objects and name-to-object bindings, and noted that they
need not be the same. We then introduced the three principal storage allocation
mechanisms—static, stack, and heap—used to manage space for objects.
In Section 3.3 we described how the binding of names to objects is governed by
scope rules. In some languages, scope rules are dynamic: the meaning of a name is
found in the most recently entered scope that contains a declaration and that has
not yet been exited. In most modern languages, however, scope rules are static, or
lexical: the meaning of a name is found in the closest lexically surrounding scope
that contains a declaration. We found that lexical scope rules vary in important
but sometimes subtle ways from one language to another. We considered what
sorts of scopes are allowed to nest, whether scopes are open or closed, whether the
scope of a name encompasses the entire block in which it is declared, and whether
a name must be declared before it is used. We explored the implementation of
scope rules in Section 3.4. In Section 3.6 we considered the question of when
to bind a referencing environment to a subroutine that is passed as a parameter,
returned from a function, or stored in a variable.
Some of the more complicated aspects of lexical scoping illustrate the evolution
of language support for data abstraction, a subject to which we will return in
Chapter 9. We began by describing the own or static variables of languages
like Fortran, Algol 60, and C, which allow a variable that is local to a subroutine
to retain its value from one invocation to the next. We then noted that simple
modules can be seen as a way to make long-lived objects local to a group of
subroutines, in such a way that they are not visible to other parts of the program.
By selectively exporting names, a module may serve as the “manager” for one or
more abstract data types. At the next level of complexity, we noted that some
languages treat modules as types, allowing the programmer to create an arbitrary
number of instances of the abstraction defined by a module. Finally, we noted
that object-oriented languages extend the module-as-type approach (as well as
the notion of lexical scope) by providing an inheritance mechanism that allows
3.10 Exercises
163
new abstractions (classes) to be defined as extensions or refinements of existing
classes.
In Section 3.5 we examined several ways in which bindings relate to one another.
Aliases arise when two or more names in a given scope are bound to the same
object. Overloading arises when one name is bound to multiple objects. Polymorphism allows a single body of code to operate on objects of more than one type,
depending on context or execution history. We noted that while similar effects
can sometimes be achieved through overloading, coercion, and polymorphism,
the underlying mechanisms are really very different. In Section 3.8 we considered
rules for separate compilation.
Among the topics considered in this chapter, we saw several examples of useful features (recursion, static scoping, forward references, first-class subroutines,
unlimited extent) that have been omitted from certain languages because of concern for their implementation complexity or run-time cost. We also saw an example of a feature (the private part of a module specification) introduced expressly
to facilitate a language’s implementation, and another (separate compilation in C)
whose design was clearly intended to mirror a particular implementation. In several additional aspects of language design (late vs early binding, static vs dynamic
scoping, support for coercions and conversions, toleration of pointers and other
aliases), we saw that implementation issues play a major role.
In a similar vein, apparently simple language rules can have surprising implications. In Section 3.3.3, for example, we considered the interaction of whole-block
scope with the requirement that names be declared before they can be used. Like
the do loop syntax and white space rules of Fortran (Section 2.2.2) or the if . . .
then . . . else syntax of Pascal (Section 2.3.2), poorly chosen scoping rules can
make program analysis difficult not only for the compiler, but for human beings
as well. In future chapters we shall see several additional examples of features
that are both confusing and hard to compile. Of course, semantic utility and
ease of implementation do not always go together. Many easy-to-compile features
(e.g., goto statements) are of questionable value at best. We will also see several
examples of highly useful and (conceptually) simple features, such as garbage
collection (Section 7.7.3) and unification (Sections 7.2.4, 8.4.4, and 11.2.1),
whose implementations are quite complex.
3.10
3.1
Exercises
Indicate the binding time (when the language is designed, when the program
is linked, when the program begins execution, etc.) for each of the following decisions in your favorite programming language and implementation.
Explain any answers you think are open to interpretation.
The number of built-in functions (math, type queries, etc.)
The variable declaration that corresponds to a particular variable reference (use)
164
Chapter 3 Names, Scopes, and Bindings
The maximum length allowed for a constant (literal) character string
The referencing environment for a subroutine that is passed as a
parameter
The address of a particular library routine
The total amount of space occupied by program code and data
3.2
3.3
3.4
3.5
In Fortran 77, local variables are typically allocated statically. In Algol and its
descendants (e.g., Pascal and Ada), they are typically allocated in the stack. In
Lisp they are typically allocated at least partially in the heap. What accounts
for these differences? Give an example of a program in Pascal or Ada that
would not work correctly if local variables were allocated statically. Give an
example of a program in Scheme or Common Lisp that would not work
correctly if local variables were allocated on the stack.
Give two examples in which it might make sense to delay the binding of an
implementation decision, even though sufficient information exists to bind
it early.
Give three concrete examples drawn from programming languages with
which you are familiar in which a variable is live but not in scope.
Consider the following pseudocode.
1. procedure main
2.
a : integer := 1
3.
b : integer := 2
4.
5.
procedure middle
b : integer := a
6.
7.
procedure inner
print a, b
8.
a : integer := 3
9.
10.
11.
– – body of middle
inner()
print a, b
12.
13.
14.
– – body of main
middle()
print a, b
Suppose this was code for a language with the declaration-order rules of C
(but with nested subroutines)—that is, names must be declared before use,
and the scope of a name extends from its declaration through the end of
the block. At each print statement, indicate which declarations of a and b
are in the referencing environment. What does the program print (or will
the compiler identify static semantic errors)? Repeat the exercise for the
declaration-order rules of C# (names must be declared before use, but the
3.10 Exercises
3.6
165
scope of a name is the entire block in which it is declared) and of Modula-3
(names can be declared in any order, and their scope is the entire block in
which they are declared).
Consider the following pseudocode, assuming nested subroutines and static
scope.
procedure main
g : integer
procedure B(a : integer)
x : integer
procedure A(n : integer)
g := n
procedure R(m : integer)
write integer(x)
x /:= 2 – – integer division
if x > 1
R(m + 1)
else
A(m)
– – body of B
x := a × a
R(1)
– – body of main
B(3)
write integer(g)
(a) What does this program print?
(b) Show the frames on the stack when
A has just been called. For each
frame, show the static and dynamic links.
(c) Explain how A finds g .
3.7 As part of the development team at MumbleTech.com, Janet has written a
list manipulation library for C that contains, among other things, the code
in Figure 3.17.
(a) Accustomed to Java, new team member Brad includes the following
code in the main loop of his program:
list_node* L = 0;
while (more_widgets()) {
L = insert(next_widget(), L);
}
L = reverse(L);
166
Chapter 3 Names, Scopes, and Bindings
typedef struct list_node {
void* data;
struct list_node* next;
} list_node;
list_node* insert(void* d, list_node* L) {
list_node* t = (list_node*) malloc(sizeof(list_node));
t->data = d;
t->next = L;
return t;
}
list_node* reverse(list_node* L) {
list_node* rtn = 0;
while (L) {
rtn = insert(L->data, rtn);
L = L->next;
}
return rtn;
}
void delete_list(list_node* L) {
while (L) {
list_node* t = L;
L = L->next;
free(t->data);
free(t);
}
}
Figure 3.17
List management routines for Exercise 3.7.
Sadly, after running for a while, Brad’s program always runs out of
memory and crashes. Explain what’s going wrong.
(b) After Janet patiently explains the problem to him, Brad gives it another
try:
list_node* L = 0;
while (more_widgets()) {
L = insert(next_widget(), L);
}
list_node* T = reverse(L);
delete_list(L);
L = T;
This seems to solve the insufficient memory problem, but where the
program used to produce correct results (before running out of memory), now its output is strangely corrupted, and Brad goes back to Janet
for advice. What will she tell him this time?
3.10 Exercises
3.8
3.9
167
Rewrite Figures 3.6 and 3.7 in C.
Consider the following fragment of code in C:
{
int a, b, c;
...
{
int d, e;
...
{
int f;
...
}
...
}
...
{
int g, h, i;
...
}
...
}
(a) Assume that each integer variable occupies 4 bytes. How much total
space is required for the variables in this code?
(b) Describe an algorithm that a compiler could use to assign stack frame
offsets to the variables of arbitrary nested blocks, in a way that minimizes
the total space required.
3.10 Consider the design of a Fortran 77 compiler that uses static allocation for
the local variables of subroutines. Expanding on the solution to the previous question, describe an algorithm to minimize the total space required
for these variables. You may find it helpful to construct a call graph data
structure in which each node represents a subroutine, and each directed arc
indicates that the subroutine at the tail may sometimes call the subroutine at
the head.
3.11 Consider the following pseudocode:
procedure P(A, B : real)
X : real
procedure Q(B, C : real)
Y : real
...
procedure R(A, C : real)
Z : real
...
...
– – (*)
Assuming static scope, what is the referencing environment at the location
marked by (*) ?
168
Chapter 3 Names, Scopes, and Bindings
3.12 Write a simple program in Scheme that displays three different behaviors,
depending on whether we use let , let* , or letrec to declare a given set
of names. (Hint: to make good use of letrec , you will probably want your
names to be functions [ lambda expressions].)
3.13 Consider the following program in Scheme:
(define A
(lambda()
(let* ((x 2)
(C (lambda (P)
(let ((x 4))
(P))))
(D (lambda ()
x))
(B (lambda ()
(let ((x 3))
(C D)))))
(B))))
What does this program print? What would it print if Scheme used dynamic
scoping and shallow binding? Dynamic scoping and deep binding? Explain
your answers.
3.14 Consider the following pseudocode:
x : integer
– – global
procedure set x(n : integer)
x := n
procedure print x
write integer(x)
procedure first
set x(1)
print x
procedure second
x : integer
set x(2)
print x
set x(0)
first()
print x
second()
print x
What does this program print if the language uses static scoping? What does
it print with dynamic scoping? Why?
3.10 Exercises
169
3.15 As noted in Section 3.6.3, C# has unusually sophisticated support for firstclass subroutines. Among other things, it allows delegates to be instantiated
from anonymous nested methods, and gives local variables and parameters
unlimited extent when they may be needed by such a delegate. Consider the
implications of these features in the following C# program.
using System;
public delegate int UnaryOp(int n);
// type declaration: UnaryOp is a function from ints to ints
public class Foo {
static int a = 2;
static UnaryOp b(int c) {
int d = a + c;
Console.WriteLine(d);
return delegate(int n) { return c + n; };
}
public static void Main(string[] args) {
Console.WriteLine(b(3)(4));
}
}
What does this program print? Which of a , b , c , and d , if any, is likely to
be statically allocated? Which could be allocated on the stack? Which would
need to be allocated in the heap? Explain.
3.16 Consider the programming idiom illustrated in Example 3.22. One of the
reviewers for this book suggests that we think of this idiom as a way to
implement a central reference table for dynamic scoping. Explain what is
meant by this suggestion.
3.17 If you are familiar with structured exception handling, as provided in Ada,
C++, Java, C#, ML, Python, or Ruby, consider how this mechanism relates to
the issue of scoping. Conventionally, a raise or throw statement is thought
of as referring to an exception, which it passes as a parameter to a handlerfinding library routine. In each of the languages mentioned, the exception
itself must be declared in some surrounding scope, and is subject to the
usual static scope rules. Describe an alternative point of view, in which the
raise or throw is actually a reference to a handler, to which it transfers
control directly. Assuming this point of view, what are the scope rules for
handlers? Are these rules consistent with the rest of the language? Explain.
(For further information on exceptions, see Section 8.5.)
3.18 Consider the following pseudocode:
x : integer
– – global
procedure set x(n : integer)
x := n
170
Chapter 3 Names, Scopes, and Bindings
procedure print x
write integer(x)
procedure foo(S, P : function; n : integer)
x : integer := 5
if n in {1, 3}
set x(n)
else
S(n)
if n in {1, 2}
print x
else
P
set
set
set
set
x(0); foo(set
x(0); foo(set
x(0); foo(set
x(0); foo(set
x, print
x, print
x, print
x, print
x, 1); print
x, 2); print
x, 3); print
x, 4); print
x
x
x
x
Assume that the language uses dynamic scoping. What does the program
print if the language uses shallow binding? What does it print with deep
binding? Why?
3.19 Consider the following pseudocode:
x : integer := 1
y : integer := 2
procedure add
x := x + y
procedure second(P : procedure)
x : integer := 2
P()
procedure first
y : integer := 3
second(add)
first()
write integer(x)
(a) What does this program print if the language uses static scoping?
(b) What does it print if the language uses dynamic scoping with deep
binding?
(c) What does it print if the language uses dynamic scoping with shallow
binding?
3.20 In Section 3.5.3 we noted that while a single min function in C would
work for both integer and floating-point numbers, overloading would be
3.11 Explorations
171
more efficient, because it would avoid the cost of type conversions. Give an
example in which overloading does not seem advantageous—one in which it
makes more sense to have a single function with floating-point parameters,
and perform coercion when integers are supplied.
3.21 (a) Write a polymorphic sorting routine in Scheme or Haskell.
(b) Write a generic sorting routine in C++, Java, or C#. (For hints, see
Section 8.4.)
(c) Write a nongeneric sorting routine using subtype polymorphism in
your favorite object-oriented language. Assume that the elements to be
sorted are members of some class derived from class ordered , which has
a method precedes such that a.precedes(b) is true if and only if a
comes before b in some canonical total order. (For hints, see Section 9.4.)
3.22 Can you write a macro in standard C that “returns” the greatest common
divisor of a pair of arguments, without calling a subroutine? Why or why
not?
3.23–3.29 In More Depth.
3.11
Explorations
3.30 Experiment with naming rules in your favorite programming language. Read
the manual, and write and compile some test programs. Does the language
use lexical or dynamic scoping? Can scopes nest? Are they open or closed?
Does the scope of a name encompass the entire block in which it is declared,
or only the portion after the declaration? How does one declare mutually
recursive types or subroutines? Can subroutines be passed as parameters,
returned from functions, or stored in variables? If so, when are referencing
environments bound?
3.31 List the keywords (reserved words) of one or more programming languages.
List the predefined identifiers. (Recall that every keyword is a separate token.
An identifier cannot have the same spelling as a keyword.) What criteria do
you think were used to decide which names should be keywords and which
should be predefined identifiers? Do you agree with the choices? Why or
why not?
3.32 If you have experience with a language like C, C++, or Pascal, in which
dynamically allocated space must be manually reclaimed, describe your
experience with dangling references or memory leaks. How often do these
bugs arise? How do you find them? How much effort does it take? Learn
about open source or commercial tools for finding storage bugs (Valgrind
is a popular open source example). Do such tools weaken the argument for
automatic garbage collection?
3.33 We learned in Section 3.3.6 that modern languages have generally abandoned dynamic scoping. One place it can still be found is in the so-called
172
Chapter 3 Names, Scopes, and Bindings
3.34
3.35
3.36
3.37
3.38
environment variables of the Unix programming environment. If you are not
familiar with these, read the manual page for your favorite shell (command
interpreter— csh / tcsh , ksh / bash , etc.) to learn how these behave. Explain
why the usual alternatives to dynamic scoping (default parameters and static
variables) are not appropriate in this case.
Compare the mechanisms for overloading of enumeration names in Ada
and Modula-3 (Section 3.5.2). One might argue that the (historically more
recent) Modula-3 approach moves responsibility from the compiler to the
programmer: it requires even an unambiguous use of an enumeration constant to be annotated with its type. Why do you think this approach was
chosen by the language designers? Do you agree with the choice? Why or
why not?
Learn about tied variables in Perl. These allow the programmer to associate an ordinary variable with an (object-oriented) object in such a way
that operations on the variable are automatically interpreted as method
invocations on the object. As an example, suppose we write tie $my_var,
"my_class"; . The interpreter will create a new object of class my_class ,
which it will associate with scalar variable $my_var . For purposes of discussion, call that object O. Now, any attempt to read the value of $my_var
will be interpreted as a call to method O ->FETCH() . Similarly, the assignment $my_var = value will be interpreted as a call to O ->STORE( value ) .
Array, hash, and filehandle variables, which support a larger set of built-in
operations, provide access to a larger set of methods when tied.
Compare Perl’s tying mechanism to the operator overloading of C++.
Which features of each language can be conveniently emulated by the other?
Write a program in C++ or Ada that creates at least two concrete types or
subroutines from the same template/generic. Compile your code to assembly language and look at the result. Describe the mapping from source to
target code.
Do you think coercion is a good idea? Why or why not?
Give three examples of features that are not provided in some language with
which you are familiar, but that are common in other languages. Why do
you think these features are missing? Would they complicate the implementation of the language? If so, would the complication (in your judgment) be
justified?
3.39–3.43 In More Depth.
3.12
Bibliographic Notes
This chapter has traced the evolution of naming and scoping mechanisms through
a very large number of languages, including Fortran (several versions), Basic,
Algol 60 and 68, Pascal, Simula, C and C++, Euclid, Turing, Modula (1, 2, and 3),
3.12 Bibliographic Notes
173
Ada (83 and 95), Oberon, Eiffel, Perl, Tcl, Python, Ruby, Java, and C#. Bibliographic
references for all of these can be found in Appendix A.
Both modules and objects trace their roots to Simula, which was developed
by Dahl, Nygaard, Myhrhaug, and others at the Norwegian Computing Centre
in the mid-1960s. (Simula I was implemented in 1964; descriptions in this book
pertain to Simula 67.) The encapsulation mechanisms of Simula were refined in
the 1970s by the developers of Clu, Modula, Euclid, and related languages. Other
Simula innovations—inheritance and dynamic method binding in particular—
provided the inspiration for Smalltalk, the original and arguably purest of the
object-oriented languages. Modern object-oriented languages, including Eiffel,
C++, Java, C#, Python, and Ruby, represent to a large extent a reintegration of the
evolutionary lines of encapsulation on the one hand and inheritance and dynamic
method binding on the other.
The notion of information hiding originates in Parnas’s classic paper, “On the
Criteria to be Used in Decomposing Systems into Modules” [Par72]. Comparative
discussions of naming, scoping, and abstraction mechanisms can be found, among
other places, in Liskov et al.’s discussion of Clu [LSAS77], Liskov and Guttag’s
text [LG86, Chap. 4], the Ada Rationale [IBFW91, Chaps. 9–12], Harbison’s text
on Modula-3 [Har92, Chaps. 8–9], Wirth’s early work on modules [Wir80], and
his later discussion of Modula and Oberon [Wir88a, Wir07]. Further information
on object-oriented languages can be found in Chapter 9.
For a detailed discussion of overloading and polymorphism, see the survey by
Cardelli and Wegner [CW85]. Cailliau [Cai82] provides a lighthearted discussion
of many of the scoping pitfalls noted in Section 3.3.3. Abelson and Sussman [AS96,
p. 11n] attribute the term “syntactic sugar” to Peter Landin.
This page intentionally left blank
4
Semantic Analysis
In Chapter 2 we considered the topic of programming language syntax. In
the current chapter we turn to the topic of semantics. Informally, syntax concerns
the form of a valid program, while semantics concerns its meaning. Meaning
is important for at least two reasons: it allows us to enforce rules (e.g., type
consistency) that go beyond mere form, and it provides the information we need
in order to generate an equivalent output program.
It is conventional to say that the syntax of a language is precisely that portion
of the language definition that can be described conveniently by a context-free
grammar, while the semantics is that portion of the definition that cannot. This
convention is useful in practice, though it does not always agree with intuition.
When we require, for example, that the number of arguments contained in a
call to a subroutine match the number of formal parameters in the subroutine
definition, it is tempting to say that this requirement is a matter of syntax. After
all, we can count arguments without knowing what they mean. Unfortunately,
we cannot count them with context-free rules. Similarly, while it is possible to
write a context-free grammar in which every function must contain at least one
return statement, the required complexity makes this strategy very unattractive.
In general, any rule that requires the compiler to compare things that are separated
by long distances, or to count things that are not properly nested, ends up being a
matter of semantics.
Semantic rules are further divided into static and dynamic semantics, though
again the line between the two is somewhat fuzzy. The compiler enforces static
semantic rules at compile time. It generates code to enforce dynamic semantic
rules at run time (or to call library routines that do so). Certain errors, such as
division by zero, or attempting to index into an array with an out-of-bounds subscript, cannot in general be caught at compile time, since they may occur only
for certain input values, or certain behaviors of arbitrarily complex code. In special cases, a compiler may be able to tell that a certain error will always or never
occur, regardless of run-time input. In these cases, the compiler can generate an
error message at compile time, or refrain from generating code to perform the
check at run time, as appropriate. Basic results from computability theory, however, tell us that no algorithm can make these predictions correctly for arbitrary
Programming Language Pragmatics. DOI: 10.1016/B978-0-12-374514-9.00013-6
Copyright © 2009 by Elsevier Inc. All rights reserved.
175
176
Chapter 4 Semantic Analysis
programs: there will inevitably be cases in which an error will always occur, but
the compiler cannot tell, and must delay the error message until run time; there
will also be cases in which an error can never occur, but the compiler cannot tell,
and must incur the cost of unnecessary run-time checks.
Both semantic analysis and intermediate code generation can be described in
terms of annotation, or decoration of a parse tree or syntax tree. The annotations
themselves are known as attributes. Numerous examples of static and dynamic
semantic rules will appear in subsequent chapters. In this current chapter we
focus primarily on the mechanisms a compiler uses to enforce the static rules. We
will consider intermediate code generation (including the generation of code for
dynamic semantic checks) in Chapter 14.
In Section 4.1 we consider the role of the semantic analyzer in more detail,
considering both the rules it needs to enforce and its relationship to other phases
of compilation. Most of the rest of the chapter is then devoted to the subject
of attribute grammars. Attribute grammars provide a formal framework for the
decoration of a tree. This framework is a useful conceptual tool even in compilers
that do not build a parse tree or syntax tree as an explicit data structure. We
introduce the notion of an attribute grammar in Section 4.2. We then consider
various ways in which such grammars can be applied in practice. Section 4.3
discusses the issue of attribute flow, which constrains the order(s) in which nodes
of a tree can be decorated. In practice, most compilers require decoration of the
parse tree (or the evaluation of attributes that would reside in a parse tree if there
were one) to occur in the process of an LL or LR parse. Section 4.4 presents action
routines as an ad hoc mechanism for such “on-the-fly” evaluation. In Section 4.5
(mostly on the PLP CD) we consider the management of space for parse tree
attributes.
Because they have to reflect the structure of the CFG, parse trees tend to be
very complicated (recall the example in Figure 1.4). Once parsing is complete, we
typically want to replace the parse tree with a syntax tree that reflects the input
program in a more straightforward way (Figure 1.5). One particularly common
compiler organization uses action routines during parsing solely for the purpose
of constructing the syntax tree. The syntax tree is then decorated during a separate
traversal, which can be formalized, if desired, with a separate attribute grammar.
We consider the decoration of syntax trees in Section 4.6.
4.1
The Role of the Semantic Analyzer
Programming languages vary dramatically in their choice of semantic rules. In
Section 3.5.3, for example, we saw a range of approaches to coercion, from languages like Fortran and C, which allow operands of many types to be intermixed
in expressions, to languages like Ada, which do not. Languages also vary in the
extent to which they require their implementations to perform dynamic checks.
At one extreme, C requires no checks at all, beyond those that come “free” with
4.1 The Role of the Semantic Analyzer
177
the hardware (e.g., division by zero, or attempted access to memory outside the
bounds of the program). At the other extreme, Java takes great pains to check as
many rules as possible, in part to ensure that an untrusted program cannot do
anything to damage the memory or files of the machine on which it runs. The
role of the semantic analyzer is to enforce all static semantic rules and to annotate the program with information needed by the intermediate code generator.
This information includes both clarifications (this is floating-point addition, not
integer; this is a reference to the global variable x ) and requirements for dynamic
semantic checks.
In the typical compiler, the interface between semantic analysis and intermediate code generation defines the boundary between the front end and the back
end. The exact division of labor varies a bit from compiler to compiler: it can be
hard to say exactly where analysis (figuring out what the program means) ends
and synthesis (expressing that meaning in some new form) begins. Many compilers actually carry a program through more than one intermediate form. In one
common organization, described in more detail in Chapter 14, the semantic analyzer creates an annotated syntax tree, which the intermediate code generator then
translates into a linear form reminiscent of the assembly language for some idealized machine. After machine-independent code improvement, this linear form
is then translated into yet another form, patterned more closely on the assembly
language of the target machine. That form may then undergo machine-specific
code improvement.
Compilers also vary in the extent to which semantic analysis and intermediate
code generation are interleaved with parsing. With fully separated phases, the
parser passes a full parse tree on to the semantic analyzer, which converts it to
a syntax tree, fills in the symbol table, performs semantic checks, and passes it
on to the code generator. With fully interleaved phases, there may be no need
to build either the parse tree or the syntax tree in its entirety: the parser can call
semantic check and code generation routines on the fly as it parses each expression,
statement, or subroutine of the source. We will focus on an organization in which
construction of the syntax tree is interleaved with parsing (and the parse tree is not
built), but semantic analysis occurs during a separate traversal of the syntax tree.
Dynamic Checks
Many compilers that generate code for dynamic checks provide the option of disabling them if desired. It is customary in some organizations to enable dynamic
checks during program development and testing, and then disable them for production use, to increase execution speed. The wisdom of this practice is questionable: Tony Hoare, one of the key figures in programming language design,1
1 Among other things, C. A. R. Hoare (1934–) invented the quicksort algorithm and the case
statement, contributed to the design of Algol W, and was one of the leaders in the development
of axiomatic semantics. In the area of concurrent programming, he refined and formalized the
monitor construct (to be described in Section 12.4.1), and designed the CSP programming model
and notation. He received the ACM Turing Award in 1980.
178
Chapter 4 Semantic Analysis
has likened the programmer who disables semantic checks to a sailing enthusiast
who wears a life jacket when training on dry land, but removes it when going
to sea [Hoa89, p. 198]. Errors may be less likely in production use than they are
in testing, but the consequences of an undetected error are significantly worse.
Moreover, on multiissue, superscalar processors (described in Section 5.4.3), it
is often possible for dynamic checks to execute in instruction slots that would otherwise go unused, making them virtually free. On the other hand, some dynamic
checks (e.g., ensuring that pointer arithmetic in C remains within the bounds of
an array) are sufficiently expensive that they are rarely implemented.
Assertions
EXAMPLE
4.1
Assertions in Java
When reasoning about the correctness of their algorithms (or when formally
proving properties of programs via axiomatic semantics) programmers frequently
write logical assertions regarding the values of program data. Some programming
languages make these assertions a part of the language syntax. The compiler then
generates code to check the assertions at run time. An assertion is a statement that
a specified condition is expected to be true when execution reaches a certain point
in the code. In Java one can write
assert denominator != 0;
An AssertionError exception will be thrown if the semantic check fails at run
time.
Some languages (e.g., Euclid and Eiffel) also provide explicit support for invariants, preconditions, and post-conditions. These are essentially structured assertions.
An invariant is expected to be true at all “clean points” of a given body of code.
In Eiffel, the programmer can specify an invariant on the data inside a class: the
invariant will be checked, automatically, at the beginning and end of each of the
class’s methods (subroutines). Similar invariants for loops are expected to be true
before and after every iteration. Pre- and post-conditions are expected to be true
at the beginning and end of subroutines, respectively. In Euclid, a post-condition,
D E S I G N & I M P L E M E N TAT I O N
Dynamic semantic checks
In the past, language theorists and researchers in programming methodology
and software engineering tended to argue for more extensive semantic checks,
while “real-world” programmers “voted with their feet” for languages like C
and Fortran, which omitted those checks in the interest of execution speed.
As computers have become more powerful, and as companies have come to
appreciate the enormous costs of software maintenance, the “real-world” camp
has become much more sympathetic to checking. Languages like Ada and Java
have been designed from the outset with safety in mind, and languages like
C and C++ have evolved (to the extent possible) toward increasingly strict
definitions.
4.1 The Role of the Semantic Analyzer
EXAMPLE
4.2
Assertions in C
179
specified once in the header of a subroutine, will be checked not only at the end
of the subroutine’s text, but at every return statement as well.
Many languages support assertions via standard library routines or macros. In
C, for example, one can write
assert(denominator != 0);
If the assertion fails, the program will terminate abruptly with the message
myprog.c:42: failed assertion ‘denominator != 0’
The C manual requires assert to be implemented as a macro (or built into the
compiler) so that it has access to the textual representation of its argument, and
to the filename and line number on which the call appears.
Assertions, of course, could be used to cover the other three sorts of checks,
but not as clearly or succinctly. Invariants, preconditions, and post-conditions
are a prominent part of the header of the code to which they apply, and can
cover a potentially large number of places where an assertion would otherwise
be required. Euclid and Eiffel implementations allow the programmer to disable
assertions and related constructs when desired, to eliminate their run-time cost.
Static Analysis
In general, compile-time algorithms that predict run-time behavior are known
as static analysis. Such analysis is said to be precise if it allows the compiler to
determine whether a given program will always follow the rules. Type checking, for
example, is static and precise in languages like Ada and ML: the compiler ensures
that no variable will ever be used at run time in a way that is inappropriate for its
type. By contrast, languages like Lisp and Smalltalk obtain greater flexibility, while
remaining completely type-safe, by accepting the run-time overhead of dynamic
type checks. (We will cover type checking in more detail in Chapter 7.)
Static analysis can also be useful when it isn’t precise. Compilers will often
check what they can at compile time and then generate code to check the rest
dynamically. In Java, for example, type checking is mostly static, but dynamically
loaded classes and type casts may require run-time checks. In a similar vein, many
compilers perform extensive static analysis in an attempt to eliminate the need for
dynamic checks on array subscripts, variant record tags, or potentially dangling
pointers (again, to be discussed in Chapter 7).
If we think of the omission of unnecessary dynamic checks as a performance
optimization, it is natural to look for other ways in which static analysis may
enable code improvement. We will consider this topic in more detail in Chapter 16.
Examples include alias analysis, which determines when values can be safely cached
in registers, computed “out of order,” or accessed by concurrent threads; escape
analysis, which determines when all references to a value will be confined to a
given context, allowing it to be allocated on the stack instead of the heap, or to be
accessed without locks; and subtype analysis, which determines when a variable
180
Chapter 4 Semantic Analysis
in an object-oriented language is guaranteed to have a certain subtype, so that its
methods can be called without dynamic dispatch.
An optimization is said to be unsafe if it may lead to incorrect code in certain
programs. It is said to be speculative if it usually improves performance, but may
degrade it in certain cases. A compiler is said to be conservative if it applies optimizations only when it can guarantee that they will be both safe and effective.
By contrast, an optimistic compiler may make liberal use of speculative optimizations. It may also pursue unsafe optimizations by generating two versions of the
code, with a dynamic check that chooses between them based on information not
available at compile time. Examples of speculative optimization include nonbinding prefetches, which try to bring data into the cache before they are needed, and
trace scheduling, which rearranges code in hopes of improving the performance
of the processor pipeline and the instruction cache.
To eliminate dynamic checks, language designers may choose to tighten semantic rules, banning programs for which conservative analysis fails. The ML type
system, for example (Section 7.2.4), avoids the dynamic type checks of Lisp,
but disallows certain useful programming idioms that Lisp supports. Similarly,
the definite assignment rules of Java and C# (Section 6.1.3) allow the compiler to
ensure that a variable is always given a value before it is used in an expression, but
disallow certain programs that are legal (and correct) in C.
4.2
EXAMPLE
4.3
Bottom-up CFG for
constant expressions
Attribute Grammars
In Chapter 2 we learned how to use a context-free grammar to specify the syntax of
a programming language. Here, for example, is an LR (bottom-up) grammar for
arithmetic expressions composed of constants, with precedence and associativity:2
E
E
E
T
T
T
F
F
F
−→
−→
−→
−→
−→
−→
−→
−→
−→
E + T
E - T
T
T * F
T / F
F
- F
( E )
const
This grammar will generate all properly formed constant expressions over
the basic arithmetic operators, but it says nothing about their meaning. To tie
2 The addition of semantic rules tends to make attribute grammars quite a bit more verbose than
context-free grammars. For the sake of brevity, many of the examples in this chapter use very
short symbol names: E instead of expr, TT instead of term tail.
4.2 Attribute Grammars
181
1. E1 −→ E2 + T
E1 .val := sum(E2 .val, T.val)
2. E1 −→ E2 - T
E1 .val := difference(E2 .val, T.val)
3. E −→ T
E.val := T.val
4. T1 −→ T2 * F
T1 .val := product(T2 .val, F.val)
5. T1 −→ T2 / F
T1 .val := quotient(T2 .val, F.val)
6. T −→ F
T.val := F.val
7. F1 −→ - F2
F1 .val := additive inverse(F2 .val)
8. F −→ ( E )
F.val := E.val
9. F −→ const
F.val := const.val
Figure 4.1
A simple attribute grammar for constant expressions, using the standard arithmetic
operations.
EXAMPLE
4.4
Bottom-up AG for
constant expressions
these expressions to mathematical concepts (as opposed to, say, floor tile patterns
or dance steps), we need additional notation. The most common is based on
attributes. In our expression grammar, we can associate a val attribute with each
E, T, F, and const in the grammar. The intent is that for any symbol S, S.val
will be the meaning, as an arithmetic value, of the token string derived from S.
We assume that the val of a const is provided to us by the scanner. We must
then invent a set of rules for each production, to specify how the vals of different
symbols are related. The resulting attribute grammar (AG) is shown in Figure 4.1.
In this simple grammar, every production has a single rule. We shall see more
complicated grammars later, in which productions can have several rules. The
rules come in two forms. Those in productions 3, 6, 8, and 9 are known as copy
rules; they specify that one attribute should be a copy of another. The other rules
invoke semantic functions ( sum , quotient , additive inverse , etc.). In this example,
the semantic functions are all familiar arithmetic operations. In general, they
can be arbitrarily complex functions specified by the language designer. Each
semantic function takes an arbitrary number of arguments (each of which must
be an attribute of a symbol in the current production—no global variables are
allowed), and each computes a single result, which must likewise be assigned into
an attribute of a symbol in the current production. When more than one symbol
of a production has the same name, subscripts are used to distinguish them. These
subscripts are solely for the benefit of the semantic functions; they are not part of
the context-free grammar itself.
182
EXAMPLE
Chapter 4 Semantic Analysis
4.5
AG to count the elements
of a list
In a strict definition of attribute grammars, copy rules and semantic function
calls are the only two kinds of permissible rules. In our examples we use a symbol
to introduce each code fragment corresponding to a single rule. In practice, it is
common to allow rules to consist of small fragments of code in some well-defined
notation (e.g., the language in which a compiler is being written), so that simple
semantic functions can be written out “in-line.” To count the elements of a list,
we might write
L −→ id
L1 −→ L2 , id
L1 .c := 1
L1 .c := L2 .c + 1
Here the rule on the second production performs an addition operation. Whether
in-line or explicit, semantic functions are not allowed to refer to any variables or
attributes outside the current production (we will relax this restriction when we
discuss action routines in Section 4.4).
Semantic functions must be written in some already-existing notation, because
attribute grammars do not really specify the meaning of a program; rather,
they provide a way to associate a program with something else that presumably has meaning. Neither the notation for semantic functions nor the types
of the attributes themselves (i.e., the domain of values passed to and returned
from semantic functions) is intrinsic to the AG notion. In the example above,
we have used an attribute grammar to associate numeric values with the symbols
in our grammar, using semantic functions drawn from ordinary arithmetic. In
the code generation phase of a compiler, we might associate fragments of target machine code with our symbols, using semantic functions written in some
existing programming language. If we were interested in defining the meaning
of a programming language in a machine-independent way, our attributes might
be domain theory denotations (these are the basis of denotational semantics). If
we were interested in proving theorems about the behavior of programs in our
language, our attributes might be logical formulas (this is the basis of axiomatic
semantics).3 These more formal concepts are beyond the scope of this text (but see
the Bibliographic Notes at the end of the chapter). We will use attribute grammars
primarily as a framework for building a syntax tree, checking semantic rules, and
(in Chapter 14) generating code.
4.3
EXAMPLE
4.6
Evaluating Attributes
The process of evaluating attributes is called annotation or decoration of the
parse tree. Figure 4.2 shows how to decorate the parse tree for the expression
Decoration of a parse tree
3 It’s actually stretching things a bit to discuss axiomatic semantics in the context of attribute
grammars. Axiomatic semantics is intended not so much to define the meaning of programs as
to permit one to prove that a given program satisfies some desired property (e.g., computes some
desired function).
4.3 Evaluating Attributes
183
E 8
T 8
(
E
1
T 1
F
1
T
4
F
4
E
4
+
F 2
*
const 2
)
T 3
F 3
const 3
const 1
Figure 4.2 Decoration of a parse tree for (1 + 3) * 2, using the attribute grammar of Figure 4.1. The val attributes of symbols are shown in boxes. Curving arrows show the attribute
flow, which is strictly upward in this case. Each box holds the output of a single semantic rule;
the arrow(s) entering the box indicate the input(s) to the rule. At the second level of the tree,
for example, the two arrows pointing into the box with the 8 represent application of the rule
T1 .val := product(T2 .val, F.val).
(1 + 3) * 2 , using the AG of Figure 4.1. Once decoration is complete, the value
of the overall expression can be found in the val attribute of the root of the
tree.
Synthesized Attributes
The attribute grammar of Figure 4.1 is very simple. Each symbol has at most
one attribute (the punctuation marks have none). Moreover, they are all so-called
synthesized attributes: their values are calculated (synthesized) only in productions
in which their symbol appears on the left-hand side. For annotated parse trees like
the one in Figure 4.2, this means that the attribute flow—the pattern in which
information moves from node to node—is entirely bottom-up.
An attribute grammar in which all attributes are synthesized is said to be
S-attributed. The arguments to semantic functions in an S-attributed grammar
are always attributes of symbols on the right-hand side of the current production, and the return value is always placed into an attribute of the left-hand side
184
Chapter 4 Semantic Analysis
of the production. Tokens (terminals) often have intrinsic properties (e.g., the
character-string representation of an identifier or the value of a numeric constant); in a compiler these are synthesized attributes initialized by the scanner.
Inherited Attributes
EXAMPLE
4.7
Top-down CFG and parse
tree for subtraction
In general, we can imagine (and will in fact have need of) attributes whose values
are calculated when their symbol is on the right-hand side of the current production. Such attributes are said to be inherited. They allow contextual information to
flow into a symbol from above or from the side, so that the rules of that production can be enforced in different ways (or generate different values) depending on surrounding context. Symbol table information is commonly passed
from symbol to symbol by means of inherited attributes. Inherited attributes
of the root of the parse tree can also be used to represent the external environment (characteristics of the target machine, command-line arguments to the
compiler, etc.).
As a simple example of inherited attributes, consider the following simplified
fragment of an LL(1) expression grammar (here covering only subtraction):
expr −→ const expr tail
expr tail −→ - const expr tail | For the expression 9 - 4 - 3 , we obtain the following parse tree:
expr
expr_tail
9
-
expr_tail
4
-
3
expr_tail
If we want to create an attribute grammar that accumulates the value of the overall
expression into the root of the tree, we have a problem: because subtraction is
left associative, we cannot summarize the right subtree of the root with a single
numeric value. If we want to decorate the tree bottom-up, with an S-attributed
grammar, we must be prepared to describe an arbitrary number of right operands
in the attributes of the top-most expr tail node (see Exercise 4.4). This is indeed
possible, but it defeats the purpose of the formalism: in effect, it requires us to
embed the entire tree into the attributes of a single node, and do all the real work
inside a single semantic function.
4.3 Evaluating Attributes
EXAMPLE
4.8
Decoration with
left-to-right attribute flow
If, however, we are allowed to pass attribute values not only bottom-up but
also left-to-right in the tree, then we can pass the 9 into the top-most expr tail
node, where it can be combined (in proper left-associative fashion) with the 4 .
The resulting 5 can then be passed into the middle expr tail node, combined with
the 3 to make 2, and then passed upward to the root:
expr
2
const 9
expr_tail
-
9 2
const 4
expr_tail 5 2
-
const 3
expr_tail 2 2
EXAMPLE
4.9
Top-down AG for
subtraction
185
To effect this style of decoration, we need the following attribute rules:
expr −→ const expr tail
expr tail.st := const.val
expr.val := expr tail.val
expr tail1 −→ - const expr tail2
expr tail2 .st := expr tail1 .st − const.val
expr tail1 .val := expr tail2 .val
expr tail −→ expr tail.val := expr tail.st
EXAMPLE
4.10
Top-down AG for constant
expressions
In each of the first two productions, the first rule serves to copy the left context
(value of the expression so far) into a “subtotal” ( st ) attribute; the second rule
copies the final value from the right-most leaf back up to the root. In the expr tail
nodes of the picture in Example 4.8, the left box holds the st attribute; the right
holds val .
We can flesh out the grammar fragment of Example 4.7 to produce a more complete expression grammar, as shown (with shorter symbol names) in Figure 4.3.
The underlying CFG for this grammar accepts the same language as the one in
Figure 4.1, but where that one was SLR(1), this one is LL(1). Attribute flow for a
parse of (1 + 3) * 2 , using the LL(1) grammar, appears in Figure 4.4. As in the
grammar fragment of Example 4.9, the value of the left operand of each operator
is carried into the TT and FT productions by the st (subtotal) attribute. The
relative complexity of the attribute flow arises from the fact that operators are left
186
Chapter 4 Semantic Analysis
1. E −→ T TT
TT.st := T.val
E.val := TT.val
2. TT1 −→ + T TT2
TT2 .st := TT1 .st + T.val
TT1 .val := TT2 .val
3. TT1 −→ - T TT2
TT2 .st := TT1 .st − T.val
TT1 .val := TT2 .val
4. TT −→ TT.val := TT.st
5. T −→ F FT
FT.st := F.val
T.val := FT.val
6. FT1 −→ * F FT2
FT2 .st := FT1 .st × F.val
FT1 .val := FT2 .val
7. FT1 −→ / F FT2
FT2 .st := FT1 .st ÷ F.val
FT1 .val := FT2 .val
8. FT −→ FT.val := FT.st
9. F1 −→ - F2
F1 .val := − F2 .val
10. F −→ ( E )
F.val := E.val
11. F −→ const
F.val := const.val
Figure 4.3
An attribute grammar for constant expressions based on an LL(1) CFG. In this
grammar several productions have two semantic rules.
associative, but the grammar cannot be left recursive: the left and right operands
of a given operator are thus found in separate productions. Grammars to perform
semantic analysis for practical languages generally require some non–S-attributed
flow.
Attribute Flow
Just as a context-free grammar does not specify how it should be parsed, an
attribute grammar does not specify the order in which attribute rules should be
invoked. Put another way, both notations are declarative: they define a set of valid
trees, but they don’t say how to build or decorate them. Among other things, this
means that the order in which attribute rules are listed for a given production
is immaterial; attribute flow may require them to execute in any order. If, in
Figure 4.3, we were to reverse the order in which the rules appear in productions
1, 2, 3, 5, 6, and/or 7 (listing the rule for symbol.val first), it would be a purely
cosmetic change; the grammar would not be altered.
4.3 Evaluating Attributes
E
8
T 8
(
F
4
E
4
T 1
F
const
1
1
TT
FT 1 1
+
F
TT 8 8
FT
)
T
3
const 3
1 4
const
4 8
F 2
*
3
187
FT 8 8
2
TT 4 4
FT
3 3
Figure 4.4 Decoration of a top-down parse tree for (1 + 3) * 2, using the AG of Figure 4.3. Curving arrows again indicate attribute flow; the arrow(s) entering a given box represent the application of a single semantic rule. Flow in this case
is no longer strictly bottom-up, but it is still left-to-right. At FT and TT nodes, the left box holds the st attribute; the right
holds val .
We say an attribute grammar is well defined if its rules determine a unique set
of values for the attributes of every possible parse tree. An attribute grammar is
noncircular if it never leads to a parse tree in which there are cycles in the attribute
flow graph—that is, if no attribute, in any parse tree, ever depends (transitively)
on itself. (A grammar can be circular and still be well defined if attributes are
guaranteed to converge to a unique value.) As a general rule, practical attribute
grammars tend to be noncircular.
An algorithm that decorates parse trees by invoking the rules of an attribute
grammar in an order consistent with the tree’s attribute flow is called a translation
scheme. Perhaps the simplest scheme is one that makes repeated passes over a
tree, invoking any semantic function whose arguments have all been defined, and
stopping when it completes a pass in which no values change. Such a scheme is
said to be oblivious, in the sense that it exploits no special knowledge of either the
parse tree or the grammar. It will halt only if the grammar is well defined. Better
performance, at least for noncircular grammars, may be achieved by a dynamic
scheme that tailors the evaluation order to the structure of a given parse tree, for
188
Chapter 4 Semantic Analysis
example by constructing a topological sort of the attribute flow graph and then
invoking rules in an order consistent with the sort.
The fastest translation schemes, however, tend to be static—based on an analysis
of the structure of the attribute grammar itself, and then applied mechanically
to any tree arising from the grammar. Like LL and LR parsers, linear-time static
translation schemes can be devised only for certain restricted classes of grammars.
S-attributed grammars, such as the one in Figure 4.1, form the simplest such
class. Because attribute flow in an S-attributed grammar is strictly bottom-up (see
Figure 4.2), attributes can be evaluated by visiting the nodes of the parse tree in
exactly the same order that those nodes are generated by an LR-family parser. In
fact, the attributes can be evaluated on the fly during a bottom-up parse, thereby
interleaving parsing and semantic analysis (attribute evaluation).
The attribute grammar of Figure 4.3 is a good bit messier than that of
Figure 4.1, but it is still L-attributed: its attributes can be evaluated by visiting
the nodes of the parse tree in a single left-to-right, depth-first traversal (the same
order in which they are visited during a top-down parse—see Figure 4.4). If we say
that an attribute A.s depends on an attribute B.t if B.t is ever passed to a semantic
function that returns a value for A.s, then we can define L-attributed grammars
more formally with the following two rules: (1) each synthesized attribute of a
left-hand-side symbol depends only on that symbol’s own inherited attributes or
on attributes (synthesized or inherited) of the production’s right-hand-side symbols, and (2) each inherited attribute of a right-hand-side symbol depends only
on inherited attributes of the left-hand-side symbol or on attributes (synthesized
or inherited) of symbols to its left in the right-hand-side.
Because L-attributed grammars permit rules that initialize attributes of the
left-hand side of a production using attributes of symbols on the right-hand
side, every S-attributed grammar is also an L-attributed grammar. The reverse
is not the case: S-attributed grammars do not permit the initialization of
attributes on the right-hand side, so there are L-attributed grammars that are not
S-attributed.
S-attributed attribute grammars are the most general class of attribute grammars for which evaluation can be implemented on the fly during an LR parse.
L-attributed grammars are the most general class for which evaluation can be
implemented on the fly during an LL parse. If we interleave semantic analysis (and possibly intermediate code generation) with parsing, then a bottom-up
parser must in general be paired with an S-attributed translation scheme; a topdown parser must be paired with an L-attributed translation scheme. (Depending
on the structure of the grammar, it is often possible for a bottom-up parser to
accommodate some non–S-attributed attribute flow; we consider this possibility
in Section 4.5.1.) If we choose to separate parsing and semantic analysis into
separate passes, then the code that builds the parse tree or syntax tree must still
use an S-attributed or L-attributed translation scheme (as appropriate), but the
semantic analyzer can use a more powerful scheme if desired. There are certain
tasks, such as the generation of code for “short-circuit” Boolean expressions (to
be discussed in Sections 6.1.5 and 6.4.1), that are easiest to accomplish with a
non–L-attributed scheme.
4.3 Evaluating Attributes
189
One-Pass Compilers
A compiler that interleaves semantic analysis and code generation with parsing is
said to be a one-pass compiler.4 It is unclear whether interleaving semantic analysis
with parsing makes a compiler simpler or more complex; it’s mainly a matter of
taste. If intermediate code generation is interleaved with parsing, one need not
build a syntax tree at all (unless of course the syntax tree is the intermediate code).
Moreover, it is often possible to write the intermediate code to an output file on the
fly, rather than accumulating it in the attributes of the root of the parse tree. The
resulting space savings were important for previous generations of computers,
which had very small main memories. On the other hand, semantic analysis is
easier to perform during a separate traversal of a syntax tree, because that tree
reflects the program’s semantic structure better than the parse tree does, especially
with a top-down parser, and because one has the option of traversing the tree in
an order other than that chosen by the parser.
Building a Syntax Tree
EXAMPLE
4.11
Bottom-up and top-down
AGs to build a syntax tree
If we choose not to interleave parsing and semantic analysis, we still need to
add attribute rules to the context-free grammar, but they serve only to create
the syntax tree—not to enforce semantic rules or generate code. Figures 4.5 and
4.6 contain bottom-up and top-down attribute grammars, respectively, to build
a syntax tree for constant expressions. The attributes in these grammars hold
neither numeric values nor target code fragments; instead they point to nodes
of the syntax tree. Function make leaf returns a pointer to a newly allocated
syntax tree node containing the value of a constant. Functions make un op and
make bin op return pointers to newly allocated syntax tree nodes containing a
unary or binary operator, respectively, and pointers to the supplied operand(s).
Figures 4.7 and 4.8 show stages in the decoration of parse trees for (1 + 3) * 2 ,
D E S I G N & I M P L E M E N TAT I O N
Forward references
In Sections 3.3.3 and 3.4.1 we noted that the scope rules of many languages
require names to be declared before they are used, and provide special mechanisms to introduce the forward references needed for recursive definitions.
While these rules may help promote the creation of clear, maintainable code,
an equally important motivation, at least historically, was to facilitate the construction of one-pass compilers. With increases in memory size, processing
speed, and programmer expectations regarding the quality of code improvement, multipass compilers have become ubiquitous, and language designers
have felt free (as, for example, in the class declarations of C++, Java, and C#)
to abandon the requirement that declarations precede uses.
4 Most authors use the term one-pass only for compilers that translate all the way from source to
target code in a single pass. Some authors insist only that intermediate code be generated in a
single pass, and permit additional pass(es) to translate intermediate code to target code.
190
Chapter 4 Semantic Analysis
E1 −→ E2 + T
E1 .ptr := make bin op(“+”, E2 .ptr, T.ptr)
E1 −→ E2 - T
E1 .ptr := make bin op(“–”, E2 .ptr, T.ptr)
E −→ T
E.ptr := T.ptr
T1 −→ T2 * F
T1 .ptr := make bin op(“×”, T2 .ptr, F.ptr)
T1 −→ T2 / F
T1 .ptr := make bin op(“÷”, T2 .ptr, F.ptr)
T −→ F
T.ptr := F.ptr
F1 −→ - F2
F1 .ptr := make un op(“+/–”, F2 .ptr)
F −→ ( E )
F.ptr := E.ptr
F −→ const
F.ptr := make leaf(const.val)
Figure 4.5
Bottom-up (S-attributed) attribute grammar to construct a syntax tree. The
symbol +/− is used (as it is on calculators) to indicate change of sign.
using the grammars of Figures 4.5 and 4.6, respectively. Note that the final syntax
tree is the same in each case.
3C H E C K YO U R U N D E R S TA N D I N G
1. What determines whether a language rule is a matter of syntax or of static
semantics?
2. Why is it impossible to detect certain program errors at compile time, even
though they can be detected at run time?
3. What is an attribute grammar?
4. What are programming assertions? What is their purpose?
5. What is the difference between synthesized and inherited attributes?
6. Give two examples of information that is typically passed through inherited
attributes.
7. What is attribute flow?
8. What is a one-pass compiler?
9. What does it mean for an attribute grammar to be S-attributed? L-attributed?
Noncircular? What is the significance of these grammar classes?
4.4 Action Routines
191
E −→ T TT
TT.st := T.ptr
E.ptr := TT.ptr
TT1 −→ + T TT2
TT2 .st := make bin op(“+”, TT1 .st, T.ptr)
TT1 .ptr := TT2 .ptr
TT1 −→ - T TT2
TT2 .st := make bin op(“–”, TT1 .st, T.ptr)
TT1 .ptr := TT2 .ptr
TT −→ TT.ptr := TT.st
T −→ F FT
FT.st := F.ptr
T.ptr := FT.ptr
FT1 −→ * F FT2
FT2 .st := make bin op(“×”, FT1 .st, F.ptr)
FT1 .ptr := FT2 .ptr
FT1 −→ / F FT2
FT2 .st := make bin op(“÷”, FT1 .st, F.ptr)
FT1 .ptr := FT2 .ptr
FT −→ FT.ptr := FT.st
F1 −→ - F2
F1 .ptr := make un op(“+/–”, F2 .ptr)
F −→ ( E )
F.ptr := E.ptr
F −→ const
F.ptr := make leaf(const.val)
Figure 4.6
Top-down (L-attributed) attribute grammar to construct a syntax tree. Here the
st attribute, like the ptr attribute (and unlike the st attribute of Figure 4.3), is a pointer to a syntax
tree node.
4.4
Action Routines
Just as there are automatic tools that will construct a parser for a given contextfree grammar, there are automatic tools that will construct a semantic analyzer
(attribute evaluator) for a given attribute grammar. Attribute evaluator generators
have been used in syntax-based editors [RT88], incremental compilers [SDB84],
and various aspects of programming language research. Most production compilers, however, use an ad hoc, handwritten translation scheme, interleaving parsing
with at least the initial construction of a syntax tree, and possibly all of semantic
analysis and intermediate code generation. Because they are able to evaluate the
192
Chapter 4 Semantic Analysis
E
T *
T
*
+
2
1
T
*
(d)
×
F
3
F
const 2
F
(c)
(
E
)
+
1
2
3
E
E
(b)
T
+
+
1
E
3
T
+
T
F
(a)
const 3
F
const 1
Figure 4.7
3
1
Construction of a syntax tree for (1 + 3) * 2 via decoration of a bottom-up parse
tree, using the grammar of Figure 4.5. This figure reads from bottom to top. In diagram (a), the
values of the constants 1 and 3 have been placed in new syntax tree leaves. Pointers to these
leaves propagate up into the attributes of E and T. In (b), the pointers to these leaves become
child pointers of a new internal + node. In (c) the pointer to this node propagates up into the
attributes of T, and a new leaf is created for 2. Finally, in (d), the pointers from T and F become
child pointers of a new internal × node, and a pointer to this node propagates up into the
attributes of E.
193
4.4 Action Routines
E
T
F
T
FT
E
(
E
TT
T
FT
F
)
E
(
TT
)
TT
FT
F
const 1
+
1
(a)
+
T
F
TT
const 3
T
+
1
3
FT
(b)
E
TT
FT
*
TT
F
const 2
FT
×
*
+
1
2
3
(c)
Figure 4.8 Construction of a syntax tree via decoration of a top-down parse tree, using the grammar of Figure 4.6. In the
top diagram, (a), the value of the constant 1 has been placed in a new syntax tree leaf. A pointer to this leaf then propagates to
the st attribute of TT. In (b), a second leaf has been created to hold the constant 3. Pointers to the two leaves then become
child pointers of a new internal + node, a pointer to which propagates from the st attribute of the bottom-most TT, where it
was created, all the way up and over to the st attribute of the top-most FT. In (c), a third leaf has been created for the constant
2. Pointers to this leaf and to the + node then become the children of a new × node, a pointer to which propagates from the
st of the lower FT, where it was created, all the way to the root of the tree.
194
EXAMPLE
Chapter 4 Semantic Analysis
4.12
Top-down action routines
to build a syntax tree
EXAMPLE
4.13
Recursive descent and
action routines
attributes of each production as it is parsed, they do not need to build the full
parse tree.
An ad hoc translation scheme that is interleaved with parsing takes the form
of a set of action routines. An action routine is a semantic function that the
programmer (grammar writer) instructs the compiler to execute at a particular
point in the parse. Most parser generators allow the programmer to specify action
routines. In an LL parser generator, an action routine can appear anywhere within
a right-hand side. A routine at the beginning of a right-hand side will be called
as soon as the parser predicts the production. A routine embedded in the middle
of a right-hand side will be called as soon as the parser has matched (the yield
of) the symbol to the left. The implementation mechanism is simple: when it
predicts a production, the parser pushes all of the right-hand side onto the stack,
including terminals (to be matched), nonterminals (to drive future predictions),
and pointers to action routines. When it finds a pointer to an action routine at the
top of the parse stack, the parser simply calls it.
To make this process more concrete, consider again our LL(1) grammar for
constant expressions. Action routines to build a syntax tree while parsing this
grammar appear in Figure 4.9. The only difference between this grammar and the
one in Figure 4.6 is that the action routines (delimited here with curly braces)
are embedded among the symbols of the right-hand sides; the work performed
is the same. The ease with which the attribute grammar can be transformed into
the grammar with action routines is due to the fact that the attribute grammar is
L-attributed. If it required more complicated flow, we would not be able to cast it
in the form of action routines.
As in ordinary parsing, there is a strong analogy between recursive descent
and table-driven parsing with action routines. Figure 4.10 shows the term tail
routine from Figure 2.16 (page 74), modified to do its part in constructing a
syntax tree. The behavior of this routine mirrors that of productions 2 through
5 in Figure 4.9. The routine accepts as a parameter a pointer to the syntax tree
fragment contained in the attribute grammar’s TT1 . Then, given an upcoming +
or - symbol on the input, it (1) calls add op to parse that symbol (returning a
character string representation); (2) calls term to parse the attribute grammar’s T;
D E S I G N & I M P L E M E N TAT I O N
Attribute evaluators
Automatic evaluators based on formal attribute grammars are popular in language research projects because they save developer time when the language
definition changes. They are popular in syntax-based editors and incremental
compilers because they save execution time: when a small change is made to
a program, the evaluator may be able to “patch up” tree decorations significantly faster than it could rebuild them from scratch. For the typical compiler,
however, semantic analysis based on a formal attribute grammar is overkill: it
has higher overhead than action routines, and doesn’t really save the compiler
writer that much work.
4.4 Action Routines
195
E −→ T { TT.st := T.ptr } TT { E.ptr := TT.ptr }
TT1 −→ + T { TT2 .st := make bin op(“+”, TT1 .st, T.ptr) } TT2 { TT1 .ptr := TT2 .ptr }
TT1 −→ - T { TT2 .st := make bin op(“–”, TT1 .st, T.ptr) } TT2 { TT1 .ptr := TT2 .ptr }
TT −→ { TT.ptr := TT.st }
T −→ F { FT.st := F.ptr } FT { T.ptr := FT.ptr }
FT1 −→ * F { FT2 .st := make bin op(“×”, FT1 .st, F.ptr) } FT2 { FT1 .ptr := FT2 .ptr }
FT1 −→ / F { FT2 .st := make bin op(“÷”, FT1 .st, F.ptr) } FT2 { FT1 .ptr := FT2 .ptr }
FT −→ { FT.ptr := FT.st }
F1 −→ - F2 { F1 .ptr := make un op(“+/–”, F2 .ptr) }
F −→ ( E ) { F.ptr := E.ptr }
F −→ const { F.ptr := make leaf(const.ptr) }
Figure 4.9
LL(1) grammar with action routines to build a syntax tree.
procedure term tail(lhs : tree node ptr)
case input token of
+, - :
op : string := add op
return term tail(make bin op(op, lhs, term))
– – term is a recursive call with no arguments
) , id , read , write , $$ :
– – epsilon production
return lhs
otherwise parse error
Figure 4.10 Recursive descent parsing with embedded “action routines.” Compare to the
routine with the same name in Figure 2.16 (page 74) and with productions 2 through 5 in
Figure 4.9.
(3) calls make bin op to create a new tree node; (4) passes that node to term tail ,
which parses the attribute grammar’s TT2 ; and (5) returns the result.
Bottom-Up Evaluation
In an LR parser generator, one cannot in general embed action routines at arbitrary places in a right-hand side, since the parser does not in general know what
production it is in until it has seen all or most of the yield. LR parser generators therefore permit action routines only after the point at which the production
being parsed can be identified unambiguously (this is known as the trailing part
of the right-hand side; the ambiguous part is the left corner). If the attribute flow
of the action routines is strictly bottom-up (as it is in an S-attributed attribute
grammar), then execution at the end of right-hand sides is all that is needed. The
attribute grammars of Figures 4.1 and 4.5, in fact, are essentially identical to the
action routine versions. If the action routines are responsible for a significant part
of semantic analysis, however (as opposed to simply building a syntax tree), then
they will often need contextual information in order to do their job. To obtain
196
Chapter 4 Semantic Analysis
and use this information in an LR parse, they will need some (necessarily limited)
access to inherited attributes or to information outside the current production.
We consider this issue further in Section 4.5.1.
4.5
Space Management for Attributes
Any attribute evaluation method requires space to hold the attributes of the grammar symbols. If we are building an explicit parse tree, then the obvious approach
is to store attributes in the nodes of the tree themselves. If we are not building a
parse tree, then we need to find a way to keep track of the attributes for the symbols we have seen (or predicted) but not yet finished parsing. The details differ in
bottom-up and top-down parsers.
For a bottom-up parser with an S-attributed grammar, the obvious approach
is to maintain an attribute stack that directly mirrors the parse stack: next to
every state number on the parse stack is an attribute record for the symbol we
shifted when we entered that state. Entries in the attribute stack are pushed and
popped automatically by the parser driver; space management is not an issue for
the writer of action routines. Complications arise if we try to achieve the effect of
inherited attributes, but these can be accommodated within the basic attributestack framework.
For a top-down parser with an L-attributed grammar, we have two principal
options. The first option is automatic, but more complex than for bottom-up
grammars. It still uses an attribute stack, but one that does not mirror the parse
stack. The second option has lower space overhead, and saves time by “shortcutting” copy rules, but requires action routines to allocate and deallocate space
for attributes explicitly.
In both families of parsers, it is common for some of the contextual information for action routines to be kept in global variables. The symbol table in
particular is usually global. We can be sure that the table will always represent the
current referencing environment, because we control the order in which action
routines (including those that modify the environment at the beginnings and
ends of scopes) are executed. In a pure attribute grammar we should need to
pass symbol table information into and out of productions through inherited and
synthesized attributes.
IN MORE DEPTH
We consider attribute space management in more detail on the PLP CD. Using
bottom-up and top-down grammars for arithmetic expressions, we illustrate
automatic management for both bottom-up and top-down parsers, as well as
the ad hoc option for top-down parsers.
4.6 Decorating a Syntax Tree
197
program −→ stmt list $$
stmt list −→ stmt list decl | stmt list stmt | decl −→ int id | real id
stmt −→ id := expr | read id | write expr
expr −→ term | expr add op term
term −→ factor | term mult op factor
factor −→ ( expr ) | id | int_const | real_const |
float ( expr ) | trunc ( expr )
add op −→ + | mult op −→ * | /
Figure 4.11
Context-free grammar for a calculator language with types and declarations. The
intent is that every identifier be declared before use, and that types not be mixed in computations.
4.6
EXAMPLE
4.14
Bottom-up CFG for
calculator language with
types
EXAMPLE
4.15
Syntax tree to average
an integer and a real
EXAMPLE
4.16
Tree grammar for the
calculator language with
types
Decorating a Syntax Tree
In our discussion so far we have used attribute grammars solely to decorate parse
trees. As we mentioned in the chapter introduction, attribute grammars can also
be used to decorate syntax trees. If our compiler uses action routines simply to
build a syntax tree, then the bulk of semantic analysis and intermediate code
generation will use the syntax tree as base.
Figure 4.11 contains a bottom-up CFG for a calculator language with types
and declarations. The grammar differs from that of Example 2.37 (page 88)
in three ways: (1) we allow declarations to be intermixed with statements, (2)
we differentiate between integer and real constants (presumably the latter contain a decimal point), and (3) we require explicit conversions between integer and real operands. The intended semantics of our language requires that
every identifier be declared before it is used, and that types not be mixed in
computations.
Extrapolating from the example in Figure 4.5, it is easy to add semantic functions or action routines to the grammar of Figure 4.11 to construct a syntax
tree for the calculator language (Exercise 4.21). The obvious structure for such a
tree would represent expressions as we did in Figure 4.7, and would represent a
program as a linked list of declarations and statements. As a concrete example,
Figure 4.12 contains the syntax tree for a simple program to print the average of
an integer and a real.
Much as a context-free grammar describes the possible structure of parse trees
for a given programming language, we can use a tree grammar to represent the
possible structure of syntax trees. As in a CFG, each production of a tree grammar
represents a possible relationship between a parent and its children in the tree.
The parent is the symbol on the left-hand side of the production; the children are
198
Chapter 4 Semantic Analysis
program
int_decl
read
real_decl
a
read
a
write
b
int a
read a
real b
read b
write (float (a) + b) / 2.0
null
b
÷
+
float
2.0
b
a
Figure 4.12
Syntax tree for a simple calculator program.
the symbols on the right-hand side. The productions used in Figure 4.12 might
look something like the following.
program −→ item
int decl : item −→ id item
read : item −→ id item
real decl : item −→ id item
write : item −→ expr item
null : item −→ ‘÷’ : expr −→ expr expr
‘+’ : expr −→ expr expr
float : expr −→ expr
id : expr −→ real const : expr −→ Here the notation A : B on the left-hand side of a production means that A is one
variant of B, and may appear anywhere a B is expected on a right-hand side. Tree grammars and context-free grammars differ in important ways. A contextfree grammar is meant to define (generate) a language composed of strings of
tokens, where each string is the fringe (yield) of a parse tree. Parsing is the process
of finding a tree that has a given yield. A tree grammar, as we use it here, is meant
4.6 Decorating a Syntax Tree
Class of node
program
item
expr
Variants
—
int decl, real decl,
read, write, :=, null
int const, real const,
id, +, –, × , ÷,
float, trunc
Inherited
Attributes
Synthesized
—
symtab, errors in
symtab
199
location, errors
location, errors out
location, type, errors,
name (id only)
Figure 4.13 Classes of nodes for the syntax tree attribute grammar of Figure 4.14. With the
exception of name , all variants of a given class have all the class’s attributes.
EXAMPLE
4.17
Tree AG for the calculator
language with types
to define (or generate) the trees themselves. We have no need for a notion of
parsing: we can easily inspect a tree and determine whether (and how) it can
be generated by the grammar. Our purpose in introducing tree grammars is to
provide a framework for the decoration of syntax trees. Semantic rules attached
to the productions of a tree grammar can be used to define the attribute flow of a
syntax tree in exactly the same way that semantic rules attached to the productions
of a context-free grammar are used to define the attribute flow of a parse tree. We
will use a tree grammar in the remainder of this section to perform static semantic
checking. In Chapter 14 we will show how additional semantic rules can be used
to generate intermediate code.
A complete tree attribute grammar for our calculator language with types can be
constructed using the node classes, variants, and attributes shown in Figure 4.13.
The grammar itself appears in Figure 4.14. Once decorated, the program node
at the root of the syntax tree will contain a list, in a synthesized attribute, of all
static semantic errors in the program. (The list will be empty if the program is free
of such errors.) Each item or expr node has an inherited attribute symtab that
contains a list, with types, of all identifiers declared to the left in the tree. Each item
node also has an inherited attribute errors in that lists all static semantic errors
found to its left in the tree, and a synthesized attribute errors out to propagate
the final error list back to the root. Each expr node has one synthesized attribute
that indicates its type and another that contains a list of any static semantic errors
found inside.
Our handling of semantic errors illustrates a common technique. In order
to continue looking for other errors we must provide values for any attributes
that would have been set in the absence of an error. To avoid cascading error
messages, we choose values for those attributes that will pass quietly through
subsequent checks. In this specific case we employ a pseudotype called error ,
which we associate with any symbol table entry or expression for which we have
already generated a message.
Though it takes a bit of checking to verify the fact, our attribute grammar is
noncircular and well defined. No attribute is ever assigned a value more than once.
(The helper routines at the end of Figure 4.14 should be thought of as macros,
200
Chapter 4 Semantic Analysis
program −→ item
item.symtab := null
program.errors := item.errors out
item.errors in := null
int decl : item1 −→ id item2
declare name(id, item1 , item2 , int)
item1 .errors out := item2 .errors out
real decl : item1 −→ id item2
declare name(id, item1 , item2 , real)
item1 .errors out := item2 .errors out
read : item1 −→ id item2
item2 .symtab := item1 .symtab
if id.name, ? ∈ item1 .symtab
item2 .errors in := item1 .errors in
else
item2 .errors in := item1 .errors in + [id.name “undefined at” id.location]
item1 .errors out := item2 .errors out
write : item1 −→ expr item2
expr.symtab := item1 .symtab
item2 .symtab := item1 .symtab
item2 .errors in := item1 .errors in + expr.errors
item1 .errors out := item2 .errors out
‘:=’ : item1 −→ id expr item2
expr.symtab := item1 .symtab
item2 .symtab := item1 .symtab
if id.name, A ∈ item1 .symtab
– – for some type A
if A = error and expr.type = error and A = expr.type
item2 .errors in := item1 .errors in + [“type clash at” item1 .location]
else
item2 .errors in := item1 .errors in + expr.errors
else
item2 .errors in := item1 .errors in + [id.name “undefined at” id.location] + expr.errors
item1 .errors out := item2 .errors out
null : item −→ item.errors out := item.errors in
Figure 4.14
Attribute grammar to decorate an abstract syntax tree for the calculator language with types. We use square
brackets to delimit error messages and pointed brackets to delimit symbol table entries. Juxtaposition indicates concatenation
within error messages; the ‘+’ and ‘–’ operators indicate insertion and removal in lists. We assume that every node has been
initialized by the scanner or by action routines in the parser to contain an indication of the location (line and column) at which
the corresponding construct appears in the source (see Exercise 4.22). The ‘ ? ’ symbol is used as a “wild card”; it matches any
type. (continued)
4.6 Decorating a Syntax Tree
id : expr −→ if id.name, A ∈ expr.symtab
201
– – for some type A
expr.errors := null
expr.type := A
else
expr.errors := [id.name “undefined at” id.location]
expr.type := error
int const : expr −→ expr.type := int
real const : expr −→ expr.type := real
‘+’ : expr1 −→ expr2 expr3
expr2 .symtab := expr1 .symtab
expr3 .symtab := expr1 .symtab
check types(expr1 , expr2 , expr3 )
‘–’ : expr1 −→ expr2 expr3
expr2 .symtab := expr1 .symtab
expr3 .symtab := expr1 .symtab
check types(expr1 , expr2 , expr3 )
‘×’ : expr1 −→ expr2 expr3
expr2 .symtab := expr1 .symtab
expr3 .symtab := expr1 .symtab
check types(expr1 , expr2 , expr3 )
‘÷’ : expr1 −→ expr2 expr3
expr2 .symtab := expr1 .symtab
expr3 .symtab := expr1 .symtab
check types(expr1 , expr2 , expr3 )
float : expr1 −→ expr2
expr2 .symtab := expr1 .symtab
convert type(expr2 , expr1 , int, real, “float of non-int”)
trunc : expr1 −→ expr2
expr2 .symtab := expr1 .symtab
convert type(expr2 , expr1 , real, int, “trunc of non-real”)
Figure 4.14
EXAMPLE
4.18
Decorating a tree with the
AG of Example 4.17
(continued on next page)
rather than semantic functions. For the sake of brevity we have passed them
entire tree nodes as arguments. Each macro calculates the values of two different
attributes. Under a strict formulation of attribute grammars each macro would
be replaced by two separate semantic functions, one per calculated attribute.) Figure 4.15 uses the grammar of Figure 4.14 to decorate the syntax tree of
Figure 4.12. The pattern of attribute flow appears considerably messier than in
previous examples in this chapter, but this is simply because type checking is
more complicated than calculating constants or building a syntax tree. Symbol
202
Chapter 4 Semantic Analysis
macro declare name(id, cur item, next item : syntax tree node; t : type)
if id.name, ? ∈ cur item.symtab
next item.errors in := cur item.errors in + [“redefinition of” id.name “at” cur item.location]
next item.symtab := cur item.symtab – id.name, ? + id.name, error
else
next item.errors in := cur item.errors in
next item.symtab := cur item.symtab + id.name, t
macro check types(result, operand1, operand2)
if operand1.type = error or operand2.type = error
result.type := error
result.errors := operand1.errors + operand2.errors
else if operand1.type = operand2.type
result.type := error
result.errors := operand1.errors + operand2.errors + [“type clash at” result.location]
else
result.type := operand1.type
result.errors := operand1.errors + operand2.errors
macro convert type(old expr, new expr : syntax tree node; from t, to t : type; msg : string)
if old expr.type = from t or old expr.type = error
new expr.errors := old expr.errors
new expr.type := to t
else
new expr.errors := old expr.errors + [msg “at” old expr.location]
new expr.type := error
Figure 4.14
(continued)
table information flows along the chain of items and down into expr trees. The
int decl and real decl nodes add new information; other nodes simply pass the
table along. Type information is synthesized at id : expr leaves by looking up an
identifier’s name in the symbol table. The information then propagates upward
within an expression tree, and is used to type-check operators and assignments
(the latter don’t appear in this example). Error messages flow along the chain
of items via the errors in attributes, and then back to the root via the errors out
attributes. Messages also flow up out of expr trees. Wherever a type check is
performed, the type attribute may be used to help create a new message to be
appended to the growing message list.
In our example grammar we accumulate error messages into a synthesized
attribute of the root of the syntax tree. In an ad hoc attribute evaluator we might
be tempted to print these messages on the fly as the errors are discovered. In
practice, however, particularly in a multipass compiler, it makes sense to buffer
the messages, so they can be interleaved with messages produced by other phases
of the compiler, and printed in program order at the end of compilation.
One could convert our attribute grammar into executable code using an automatic attribute evaluator generator. Alternatively, one could create an ad hoc
evaluator in the form of mutually recursive subroutines (Exercise 4.20). In the
203
4.6 Decorating a Syntax Tree
program e
int_decl s ei eo
read s ei eo
a n
real_decl s ei eo
a n
read s ei eo
b n
write s ei eo
b n
ei
eo
e
s
t
n
=
=
=
=
=
=
errors_in
errors_out
errors
symtab
type
name
null s ei eo
+ s t
÷ s t
e
e
2.0 s
location attribute not shown
float s t
a n s
e
t
b n s
t
t
e
e
e
Figure 4.15
Decoration of the syntax tree of Figure 4.12, using the grammar of Figure 4.14.
Location information,which we assume has been initialized in every node by the parser,contributes
to error messages, but does not otherwise propagate through the tree.
latter case attribute flow would be explicit in the calling sequence of the routines.
We could then choose if desired to keep the symbol table in global variables, rather
than passing it from node to node through attributes. Most compilers employ the
ad hoc approach.
3C H E C K YO U R U N D E R S TA N D I N G
10. What is the difference between a semantic function and an action routine?
11. Why can’t action routines be placed at arbitrary locations within the righthand side of productions in an LR CFG?
204
Chapter 4 Semantic Analysis
12. What patterns of attribute flow can be captured easily with action routines?
13. Some compilers perform all semantic checks and intermediate code generation in action routines. Others use action routines to build a syntax tree and
then perform semantic checks and intermediate code generation in separate
traversals of the syntax tree. Discuss the tradeoffs between these two strategies.
14. What sort of information do action routines typically keep in global variables,
rather than in attributes?
15. Describe the similarities and differences between context-free grammars and
tree grammars.
16. How can a semantic analyzer avoid the generation of cascading error messages?
4.7
Summary and Concluding Remarks
This chapter has discussed the task of semantic analysis. We reviewed the sorts
of language rules that can be classified as syntax, static semantics, and dynamic
semantics, and discussed the issue of whether to generate code to perform dynamic
semantic checks. We also considered the role that the semantic analyzer plays in
a typical compiler. We noted that both the enforcement of static semantic rules
and the generation of intermediate code can be cast in terms of annotation, or
decoration, of a parse tree or syntax tree. We then presented attribute grammars
as a formal framework for this decoration process.
An attribute grammar associates attributes with each symbol in a context-free
grammar or tree grammar, and attribute rules with each production. Synthesized
attributes are calculated only in productions in which their symbol appears on the
left-hand side. The synthesized attributes of tokens are initialized by the scanner.
Inherited attributes are calculated in productions in which their symbol appears
within the right-hand side; they allow calculations internal to a symbol to depend
on the context in which the symbol appears. Inherited attributes of the start
symbol (goal) can represent the external environment of the compiler. Strictly
speaking, attribute grammars allow only copy rules (assignments of one attribute to
another) and simple calls to semantic functions, but we usually relax this restriction
to allow more-or-less arbitrary code fragments in some existing programming
language.
Just as context-free grammars can be categorized according to the parsing
algorithm(s) that can use them, attribute grammars can be categorized according
to the complexity of their pattern of attribute flow. S-attributed grammars, in
which all attributes are synthesized, can naturally be evaluated in a single bottomup pass over a parse tree, in precisely the order the tree is discovered by an LRfamily parser. L-attributed grammars, in which all attribute flow is depth-first
left-to-right, can be evaluated in precisely the order that the parse tree is predicted
and matched by an LL-family parser. Attribute grammars with more complex
patterns of attribute flow are not commonly used in production compilers, but
4.8 Exercises
205
are valuable for syntax-based editors, incremental compilers, and various other
tools.
While it is possible to construct automatic tools to analyze attribute flow and
decorate parse trees, most compilers rely on action routines, which the compiler
writer embeds in the right-hand sides of productions to evaluate attribute rules at
specific points in a parse. In an LL-family parser, action routines can be embedded
at arbitrary points in a production’s right-hand side. In an LR-family parser,
action routines must follow the production’s left corner. Space for attributes in
a bottom-up compiler is naturally allocated in parallel with the parse stack, but
this complicates the management of inherited attributes. Space for attributes in a
top-down compiler can be allocated automatically, or managed explicitly by the
writer of action routines. The automatic approach has the advantage of regularity,
and is easier to maintain; the ad hoc approach is slightly faster and more flexible.
In a one-pass compiler, which interleaves scanning, parsing, semantic analysis,
and code generation in a single traversal of its input, semantic functions or action
routines are responsible for all of semantic analysis and code generation. More
commonly, action routines simply build a syntax tree, which is then decorated
during separate traversal(s) in subsequent pass(es).
In subsequent chapters (6–9 in particular) we will consider a wide variety
of programming language constructs. Rather than present the actual attribute
grammars required to implement these constructs, we will describe their semantics informally, and give examples of the target code. We will return to attribute
grammars in Chapter 14, when we consider the generation of intermediate code
in more detail.
4.8
4.1
4.2
4.3
4.4
Exercises
Basic results from automata theory tell us that the language L = a n b n c n =
, abc , aabbcc , aaabbbccc , . . . is not context free. It can be captured,
however, using an attribute grammar. Give an underlying CFG and a set of
attribute rules that associates a Boolean attribute ok with the root R of each
parse tree, such that R.ok = true if and only if the string corresponding to
the fringe of the tree is in L.
Modify the grammar of Figure 2.24 so that it accepts only programs that
contain at least one write statement. Make the same change in the solution
to Exercise 2.17. Based on your experience, what do you think of the idea of
using the CFG to enforce the rule that every function in C must contain at
least one return statement?
Give two examples of reasonable semantic rules that cannot be checked at
reasonable cost, either statically or by compiler-generated code at run time.
Write an S-attributed attribute grammar, based on the CFG of Example 4.7,
that accumulates the value of the overall expression into the root of the
tree. You will need to use dynamic memory allocation so that individual
attributes can hold an arbitrary amount of information.
206
Chapter 4 Semantic Analysis
.
cdr
.
.
quote
.
.
a
.
b
c
Figure 4.16
Natural syntax tree for the Lisp expression (cdr ’(a b c)).
4.5 As we shall learn in Chapter 10, Lisp programs take the form of parenthesized
4.6
4.7
lists. The natural syntax tree for a Lisp program is thus a tree of binary
cells (known in Lisp as cons cells), where the first child represents the first
element of the list and the second child represents the rest of the list. The
syntax tree for (cdr ’(a b c)) appears in Figure 4.16. (The notation ’L is
syntactic sugar for (quote L) .)
Extend the CFG of Exercise 2.18 to create an attribute grammar that
will build such trees. When a parse tree has been fully decorated, the root
should have an attribute v that refers to the syntax tree. You may assume
that each atom has a synthesized attribute v that refers to a syntax tree node
that holds information from the scanner. In your semantic functions, you
may assume the availability of a cons function that takes two references
as arguments and returns a reference to a new cons cell containing those
references.
Refer back to the context-free grammar of Exercise 2.13 (page 105). Add
attribute rules to the grammar to accumulate into the root of the tree a
count of the maximum depth to which parentheses are nested in the program
string. For example, given the string f1(a, f2(b * (c + (d - (e - f))))) ,
the stmt at the root of the tree should have an attribute with a count of 3
(the parentheses surrounding argument lists don’t count).
Suppose that we want to translate constant expressions into the postfix,
or “reverse Polish” notation of logician Jan Lukasiewicz. Postfix notation
does not require parentheses. It appears in stack-based languages such as
Postscript, Forth, and the P-code and Java byte code intermediate forms
mentioned in Section 1.4. It also serves as the input language of certain
Hewlett-Packard (HP) brand calculators. When given a number, an HP
4.8 Exercises
4.8
4.9
207
calculator pushes it onto an internal stack. When given an operator, it pops
the top two numbers, applies the operator, and pushes the result. The display
shows the value at the top of the stack. To compute 2 × (5 − 3)/4 one would
enter 2 5 3 - * 4 / .
Using the underlying CFG of Figure 4.1, write an attribute grammar
that will associate with the root of the parse tree a sequence of calculator
button pushes, seq, that will compute the arithmetic value of the tokens
derived from that symbol. You may assume the existence of a function
buttons (c) that returns a sequence of button pushes (ending with ENTER
on an HP calculator) for the constant c. You may also assume the existence
of a concatenation function for sequences of button pushes.
Repeat the previous exercise using the underlying CFG of Figure 4.3.
Consider the following grammar for reverse Polish arithmetic expressions:
E −→ E E op | id
op −→ + | - | * | /
Assuming that each id has a synthesized attribute name of type string, and
that each E and op has an attribute val of type string, write an attribute
grammar that arranges for the val attribute of the root of the parse tree to
contain a translation of the expression into conventional infix notation. For
example, if the leaves of the tree, left to right, were “ A A B - * C / ,” then
the val field of the root would be “ ( ( A * ( A - B ) ) / C ) .” As an extra
challenge, write a version of your attribute grammar that exploits the usual
arithmetic precedence and associativity rules to use as few parentheses as
possible.
4.10 To reduce the likelihood of typographic errors, the digits comprising most
credit card numbers are designed to satisfy the so-called Luhn formula, standardized by ANSI in the 1960s, and named for IBM mathematician Hans
Peter Luhn. Starting at the right, we double every other digit (the secondto-last, fourth-to-last, etc.). If the doubled value is 10 or more, we add the
resulting digits. We then sum together all the digits. In any valid number
the result will be a multiple of 10. For example, 1234 5678 9012 3456
becomes 2264 1658 9022 6416, which sums to 64, so this is not a valid
number. If the last digit had been 2, however, the sum would have been 60,
so the number would potentially be valid.
Give an attribute grammar for strings of digits that accumulates into the
root of the parse tree a Boolean value indicating whether the string is valid
according to Luhn’s formula. Your grammar should accommodate strings
of arbitrary length.
4.11 Consider the following CFG for floating-point constants, without exponential notation. (Note that this exercise is somewhat artificial: the language
in question is regular, and would be handled by the scanner of a typical
compiler.)
208
Chapter 4 Semantic Analysis
C −→ digits . digits
digits −→ digit more digits
more digits −→ digits | digit −→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Augment this grammar with attribute rules that will accumulate the value
of the constant into a val attribute of the root of the parse tree. Your answer
should be S-attributed.
4.12 One potential criticism of the obvious solution to the previous problem is
that the values in internal nodes of the parse tree do not reflect the value,
in context, of the fringe below them. Create an alternative solution that
addresses this criticism. More specifically, create your grammar in such a
way that the val of an internal node is the sum of the val s of its children.
Illustrate your solution by drawing the parse tree and attribute flow for
12.34 . (Hint: you will probably want a different underlying CFG, and non–
L-attributed flow.)
4.13 Consider the following attribute grammar for type declarations, based on
the CFG of Exercise 2.11:
decl −→ ID decl tail
decl.t := decl tail.t
decl tail.in tab := insert (decl.in tab, ID.n, decl tail.t)
decl.out tab := decl tail.out tab
decl tail −→ , decl
decl tail.t := decl.t
decl.in tab := decl tail.in tab
decl tail.out tab := decl.out tab
decl tail −→ : ID ;
decl tail.t := ID.n
decl tail.out tab := decl tail.in tab
Show a parse tree for the string A, B : C; . Then, using arrows and
textual description, specify the attribute flow required to fully decorate the
tree. (Hint: note that the grammar is not L-attributed.)
4.14 A CFG-based attribute evaluator capable of handling non–L-attributed
attribute flow needs to take a parse tree as input. Explain how to build a
parse tree automatically during a top-down or bottom-up parse (i.e., without explicit action routines).
4.15 Building on Example 4.13, modify the remainder of the recursive descent
parser of Figure 2.16 to build syntax trees for programs in the calculator
language.
4.16 Write an LL(1) grammar with action routines and automatic attribute
space management that generates the reverse Polish translation described
in Exercise 4.7.
4.9 Explorations
209
4.17 (a) Write a context-free grammar for polynomials in x. Add semantic functions to produce an attribute grammar that will accumulate the polynomial’s derivative (as a string) in a synthesized attribute of the root of
the parse tree.
(b) Replace your semantic functions with action routines that can be evaluated during parsing.
4.18 (a) Write a context-free grammar for case or switch statements in the
style of Pascal or C. Add semantic functions to ensure that the same
label does not appear on two different arms of the construct.
(b) Replace your semantic functions with action routines that can be evaluated during parsing.
4.19 Write an algorithm to determine whether the rules of an arbitrary attribute
grammar are noncircular. (Your algorithm will require exponential time in
the worst case [JOR75].)
4.20 Rewrite the attribute grammar of Figure 4.14 in the form of an ad hoc
tree traversal consisting of mutually recursive subroutines in your favorite
programming language. Keep the symbol table in a global variable, rather
than passing it through arguments.
4.21 Write an attribute grammar based on the CFG of Figure 4.11 that will build
a syntax tree with the structure described in Figure 4.14.
4.22 Augment the attribute grammar of Figure 4.5, Figure 4.6, or Exercise 4.21 to
initialize a synthesized attribute in every syntax tree node that indicates the
location (line and column) at which the corresponding construct appears in
the source program. You may assume that the scanner initializes the location
of every token.
4.23 Modify the CFG and attribute grammar of Figures 4.11 and 4.14 to permit
mixed integer and real expressions, without the need for float and trunc .
You will want to add an annotation to any node that must be coerced to the
opposite type, so that the code generator will know to generate code to do
so. Be sure to think carefully about your coercion rules. In the expression
my_int + my_real , for example, how will you know whether to coerce the
integer to be a real, or to coerce the real to be an integer?
4.24 Explain the need for the A : B notation on the left-hand sides of productions in a tree grammar. Why isn’t similar notation required for context-free
grammars?
4.25–4.29 In More Depth.
4.9
Explorations
4.30 One of the most influential applications of attribute grammars was the
Cornell Synthesizer Generator [Rep84, RT88]. Learn how the Generator
210
Chapter 4 Semantic Analysis
used attribute grammars not only for incremental update of semantic information in a program under edit, but also for automatic creation of language
based editors from formal language specifications. How general is this technique? What applications might it have beyond syntax-directed editing of
computer programs?
4.31 The attribute grammars used in this chapter are all quite simple. Most are
S- or L-attributed. All are noncircular. Are there any practical uses for more
complex attribute grammars? How about automatic attribute evaluators?
Using the Bibliographic Notes as a starting point, conduct a survey of
attribute evaluation techniques. Where is the line between practical techniques and intellectual curiosities?
4.32 The first validated Ada implementation was the Ada/Ed interpreter from
New York University [DGAFS+ 80]. The interpreter was written in the setbased language SETL [SDDS86] using a denotational semantics definition
of Ada. Learn about the Ada/Ed project, SETL, and denotational semantics.
Discuss how the use of a formal definition aided the development process.
Also discuss the limitations of Ada/Ed, and expand on the potential role of
formal semantics in language design, development, and prototype implementation.
4.33 Version 5 of the Scheme language manual [ADH+ 98] included a formal
definition of Scheme in denotational semantics. How long is this definition,
compared to the more conventional definition in English? How readable is
it? What do the length and the level of readability say about Scheme? About
denotational semantics? (For more on denotational semantics, see the texts
of Stoy [Sto77] or Gordon [Gor79].)
Version 6 of the manual [SDF+ 07] switched to operational semantics.
How does this compare to the denotational version? Why do you suppose
the standards committee made the change? (For more information, see the
paper by Matthews and Findler [MF08].)
4.34–4.35 In More Depth.
4.10
Bibliographic Notes
Much of the early theory of attribute grammars was developed by Knuth [Knu68].
Lewis, Rosenkrantz, and Stearns [LRS74] introduced the notion of an L-attributed
grammar. Watt [Wat77] showed how to use marker symbols to emulate inherited
attributes in a bottom-up parser. Jazayeri, Ogden, and Rounds [JOR75] showed
that exponential time may be required in the worst case to decorate a parse tree
with arbitrary attribute flow. Articles by Courcelle [Cou84] and Engelfriet [Eng84]
survey the theory and practice of attribute evaluation. The best-known attribute
grammar system for language-based editing is the Synthesizer Generator [RT88]
(a follow-on to the language-specific Cornell Program Synthesizer [TR81]) of
4.10 Bibliographic Notes
211
Reps and Teitelbaum. Magpie [SDB84] is an incremental compiler. Action routines to implement many language features can be found in the texts of Fischer
and LeBlanc [FL88] or Appel [App97]. Further notes on attribute grammars can
be found in the texts of Cooper and Torczon [CT04, pp. 171–188] or Aho et
al. [ALSU07, Chap. 5].
Marcotty, Ledgard, and Bochmann [MLB76] provide a survey of formal notations for programming language semantics. The seminal paper on axiomatic
semantics is by Hoare [Hoa69]. An excellent book on the subject is Gries’s The
Science of Programming [Gri81]. The seminal paper on denotational semantics is
by Scott and Strachey [SS71]. Texts on the subject include those of Stoy [Sto77]
and Gordon [Gor79].
This page intentionally left blank
5
Target Machine Architecture
As described in Chapter 1, a compiler is simply a translator. It translates
programs written in one language into programs written in another language.
This second language can be almost anything—some other high-level language,
phototypesetting commands, VLSI (chip) layouts—but most of the time it’s the
machine language for some available computer.
Just as there are many different programming languages, there are many different machine languages, though the latter tend to display considerably less diversity
than the former. Each machine language corresponds to a different processor architecture. Formally, an architecture is the interface between the hardware and the
software, that is, the language generated by a compiler, or by a programmer writing
for the bare machine. The implementation of the processor is a concrete realization
of the architecture, generally in hardware. To generate correct code, it suffices for
a compiler writer to understand the target architecture. To generate fast code, it
is generally necessary to understand the implementation as well, because it is the
implementation that determines the relative speeds of alternative translations of
a given language construct.
IN MORE DEPTH
Chapter 5 can be found in its entirety on the PLP CD. It provides a brief overview of
those aspects of processor architecture and implementation of particular importance to compiler writers, and may be worth reviewing even by readers who have
seen the material before. Principal topics include data representation, instruction
set architecture, the evolution of implementation techniques, and the challenges
of compiling for modern processors. Examples are drawn largely from the x86,
a legacy CISC (complex instruction set) architecture that dominates the desktop/laptop market, and the MIPS, a more modern RISC (reduced instruction set)
design used widely for embedded systems.
Programming Language Pragmatics. DOI: 10.1016/B978-0-12-374514-9.00014-8
Copyright © 2009 by Elsevier Inc. All rights reserved.
213
214
Chapter 5 Target Machine Architecture
D E S I G N & I M P L E M E N TAT I O N
Pseudo-assembly notation
At various times throughout the remainder of this book, we will need to consider sequences of machine instructions corresponding to some high-level
language construct. Rather than present these sequences in the assembly
language of some particular processor architecture, we will (in most cases)
rely on a simple notation designed to represent a generic RISC machine. The
following is a brief example that sums the elements of an n -element floatingpoint vector, V , and places the results in s .
r1 = &V
r2 := n
f1 := 0
goto L2
L1: f2 := *r1
– – load
f1 +:= f2
r1 +:= 8
– – floating-point numbers are 8 bytes long
r2 –:= 1
L2: if r2 > 0 goto L1
s := f1
The notation should in most cases be self-explanatory. It uses “assignment
statements” and operators reminiscent of high-level languages, but each line
of code corresponds to a single machine instruction, and registers are named
explicitly (the names of integer registers begin with ‘r’ ; those of floating-point
registers begin with ‘f’ ). Control flow is based entirely on goto s and subroutine
calls (not shown). Conditional tests assume that the hardware can perform
a comparison and branch in a single instruction, where the comparison tests
the contents of a register against a small constant or the contents of another
register.
Main memory in our notation can be accessed only by load and store instructions, which look like assignments to or from a register, with no arithmetic. We
do, however, assume the availability of displacement addressing, which allows
us to access memory at some constant offset from the address held in a register.
For example, to store register r1 to a local variable at an offset of 12 bytes from
the frame pointer ( fp ) register, we could say *(fp–12) := r1 .
This page intentionally left blank
II
Core Issues in Language Design
Having laid the foundation in Part I, we now turn to issues that lie at the core of most programming languages: control flow, data types, and abstractions of both control and data.
Chapter 6 considers control flow, including expression evaluation, sequencing, selection,
iteration, and recursion. In many cases we will see design decisions that reflect the sometimes
complementary but often competing goals of conceptual clarity and efficient implementation.
Several issues, including the distinction between references and values and between applicative
(eager) and lazy evaluation will recur in later chapters.
Chapter 7, the longest in the book, considers the subject of types. It begins with type systems
and type checking, including the notions of equivalence, compatibility, and inference of types.
It then presents a survey of high-level type constructors, including records and variants, arrays,
strings, sets, pointers, lists, and files. The section on pointers includes an introduction to garbage
collection techniques.
Both control and data are amenable to abstraction, the process whereby complexity is hidden behind a simple and well-defined interface. Control abstraction is the subject of Chapter 8.
Subroutines are the most common control abstraction, but we also consider exceptions and
coroutines, and return briefly to the subjects of continuations and iterators, introduced in
Chapter 6. The coverage of subroutines includes calling sequences, parameter-passing mechanisms, and generics, which support parameterization over types.
Chapter 9 returns to the subject of data abstraction, introduced in Chapter 3. In many modern languages this subject takes the form of object orientation, characterized by an encapsulation
mechanism, inheritance, and dynamic method dispatch (subtype polymorphism). Our coverage of object-oriented languages will also touch on constructors, access control, polymorphism,
closures, and multiple and mix-in inheritance.
This page intentionally left blank
6
Control Flow
Having considered the mechanisms that a compiler uses to enforce
semantic rules (Chapter 4) and the characteristics of the target machines for
which compilers must generate code (Chapter 5), we now return to core issues in
language design. Specifically, we turn in this chapter to the issue of control flow
or ordering in program execution. Ordering is fundamental to most (though not
all) models of computing. It determines what should be done first, what second,
and so forth, to accomplish some desired task. We can organize the language
mechanisms used to specify ordering into eight principal categories:
1. Sequencing: Statements are to be executed (or expressions evaluated) in a certain specified order—usually the order in which they appear in the program
text.
2. Selection: Depending on some run-time condition, a choice is to be made
among two or more statements or expressions. The most common selection
constructs are if and case ( switch ) statements. Selection is also sometimes
referred to as alternation.
3. Iteration: A given fragment of code is to be executed repeatedly, either a certain number of times, or until a certain run-time condition is true. Iteration
constructs include for / do , while , and repeat loops.
4. Procedural abstraction: A potentially complex collection of control constructs
(a subroutine) is encapsulated in a way that allows it to be treated as a single
unit, usually subject to parameterization.
5. Recursion: An expression is defined in terms of (simpler versions of) itself,
either directly or indirectly; the computational model requires a stack on
which to save information about partially evaluated instances of the expression.
Recursion is usually defined by means of self-referential subroutines.
6. Concurrency: Two or more program fragments are to be executed/evaluated
“at the same time,” either in parallel on separate processors, or interleaved on
a single processor in a way that achieves the same effect.
7. Exception handling and speculation: A program fragment is executed optimistically, on the assumption that some expected condition will be true. If that
Programming Language Pragmatics. DOI: 10.1016/B978-0-12-374514-9.00016-1
Copyright © 2009 by Elsevier Inc. All rights reserved.
219
220
Chapter 6 Control Flow
condition turns out to be false, execution branches to a handler that executes
in place of the remainder of the protected fragment (in the case of exception
handling), or in place of the entire protected fragment (in the case of speculation). For speculation, the language implementation must be able to undo, or
“roll back,” any visible effects of the protected code.
8. Nondeterminacy: The ordering or choice among statements or expressions is
deliberately left unspecified, implying that any alternative will lead to correct
results. Some languages require the choice to be random, or fair, in some formal
sense of the word.
Though the syntactic and semantic details vary from language to language, these
eight principal categories cover all of the control-flow constructs and mechanisms
found in most programming languages. A programmer who thinks in terms of
these categories, rather than the syntax of some particular language, will find it
easy to learn new languages, evaluate the tradeoffs among languages, and design
and reason about algorithms in a language-independent way.
Subroutines are the subject of Chapter 8. Concurrency is the subject of Chapter 12. Exception handling and speculation are discussed in those chapters as well,
in Sections 8.5 and 12.4.4. The bulk of the current chapter (Sections 6.3 through
6.7) is devoted to the five remaining categories. We begin in Section 6.1 by examining expression evaluation. We consider the syntactic form of expressions, the
precedence and associativity of operators, the order of evaluation of operands,
and the semantics of the assignment statement. We focus in particular on the distinction between variables that hold a value and variables that hold a reference to
a value; this distinction will play an important role many times in future chapters.
In Section 6.2 we consider the difference between structured and unstructured
( goto -based) control flow.
The relative importance of different categories of control flow varies significantly among the different classes of programming languages. Sequencing is
central to imperative (von Neumann and object-oriented) languages, but plays
a relatively minor role in functional languages, which emphasize the evaluation
of expressions, de-emphasizing or eliminating statements (e.g., assignments) that
affect program output in any way other than through the return of a value. Similarly, functional languages make heavy use of recursion, while imperative languages tend to emphasize iteration. Logic languages tend to de-emphasize or hide
the issue of control flow entirely: The programmer simply specifies a set of inference rules; the language implementation must find an order in which to apply
those rules that will allow it to deduce values that satisfy some desired property.
6.1
Expression Evaluation
An expression generally consists of either a simple object (e.g., a literal constant, or
a named variable or constant) or an operator or function applied to a collection of
operands or arguments, each of which in turn is an expression. It is conventional
6.1 Expression Evaluation
EXAMPLE
6.1
A typical function call
to use the term operator for built-in functions that use special, simple syntax,
and to use the term operand for an argument of an operator. In most imperative
languages, function calls consist of a function name followed by a parenthesized,
comma-separated list of arguments, as in
my_func(A, B, C)
EXAMPLE
6.2
Typical operators
221
Operators are typically simpler, taking only one or two arguments, and dispensing
with the parentheses and commas:
a + b
- c
As we saw in Section 3.5.2, some languages define their operators as syntactic
sugar for more “normal”-looking functions. In Ada, for example, a + b is short
for "+"(a, b) ; in C++, a + b is short for a.operator+(b) .
In general, a language may specify that function calls (operator invocations) employ prefix, infix, or postfix notation. These terms indicate, respectively, whether the function name appears before, among, or after its several
arguments:
prefix:
infix:
postfix:
EXAMPLE
6.3
Cambridge Polish (prefix)
notation
op a b
a op b
a b op
or
6.4
Mixfix notation in Smalltalk
or
(op a b)
Most imperative languages use infix notation for binary operators and prefix
notation for unary operators and (with parentheses around the arguments) other
functions. Lisp uses prefix notation for all functions, but with the third of the
variants above: in what is known as Cambridge Polish 1 notation, it places the
function name inside the parentheses:
(* (+ 1 3) 2)
(append a b c my_list)
EXAMPLE
op (a, b)
; that would be (1 + 3) * 2 in infix
A few languages, notably ML and the R scripting language, allow the user to
create new infix operators. Smalltalk uses infix notation for all functions (which it
calls messages), both built-in and user-defined. The following Smalltalk statement
sends a “ displayOn: at: ” message to graphical object myBox , with arguments
myScreen and 100@50 (a pixel location). It corresponds to what other languages
would call the invocation of the “ displayOn: at: ” function with arguments
myBox , myScreen , and 100@50 .
myBox displayOn: myScreen at: 100@50
1 Prefix notation was popularized by Polish logicians of the early 20th century; Lisp-like parenthesized syntax was first employed (for noncomputational purposes) by philosopher W. V. Quine of
Harvard University (Cambridge, MA).
222
EXAMPLE
Chapter 6 Control Flow
6.5
Conditional expressions
This sort of multiword infix notation occurs occasionally in other languages as
well.2 In Algol one can say
a := if b <> 0 then a/b else 0;
Here “ if . . . then . . . else ” is a three-operand infix operator. The equivalent
operator in C is written “. . . ? . . . : . . . ”:
a = b != 0 ? a/b : 0;
Postfix notation is used for most functions in Postscript, Forth, the input language
of certain hand-held calculators, and the intermediate code of some compilers.
Postfix appears in a few places in other languages as well. Examples include the
pointer dereferencing operator ( ˆ ) of Pascal and the post-increment and decrement operators ( ++ and -- ) of C and its descendants.
6.1.1
EXAMPLE
6.6
A complicated Fortran
expression
Precedence and Associativity
Most languages provide a rich set of built-in arithmetic and logical operators.
When written in infix notation, without parentheses, these operators lead to ambiguity as to what is an operand of what. In Fortran, for example, which uses **
for exponentiation, how should we parse a + b * c**d**e/f ? Should this be
grouped as
((((a + b) * c)**d)**e)/f
or
a + (((b * c)**d)**(e/f))
or
a + ((b * (c**(d**e)))/f)
EXAMPLE
6.7
Precedence in four
influential languages
or yet some other option? (In Fortran, the answer is the last of the options
shown.)
In any given language, the choice among alternative evaluation orders depends
on the precedence and associativity of operators, concepts we introduced in
Section 2.1.3. Issues of precedence and associativity do not arise in prefix or
postfix notation.
Precedence rules specify that certain operators, in the absence of parentheses,
group “more tightly” than other operators. In most languages multiplication and
division group more tightly than addition and subtraction, so 2 + 3 × 4 is 14
and not 20. Details vary widely from one language to another, however. Figure 6.1
shows the levels of precedence for several well-known languages.
2 Most authors use the term “infix” only for binary operators. Multiword operators may be called
“mixfix,” or left unnamed.
6.1 Expression Evaluation
Fortran
Pascal
C
223
Ada
++ , -- (post-inc., dec.)
**
not
++ , -- (pre-inc., dec.),
+ , - (unary),
& , * (address, contents of),
! , ˜ (logical, bit-wise not)
abs (absolute value),
not , **
*, /
*, /,
div , mod , and
* (binary), / ,
% (modulo division)
* , / , mod , rem
+ , - (unary
and binary)
+ , - (unary and
binary), or
+ , - (binary)
+ , - (unary)
<< , >>
(left and right bit shift)
+ , - (binary),
& (concatenation)
< , <= , > , >=
(inequality tests)
= , /= , < , <= , > , >=
.eq. , .ne. , .lt. ,
.le. , .gt. , .ge.
< , <= , > , >= ,
= , <> , IN
(comparisons)
== , != (equality tests)
.not.
& (bit-wise and)
ˆ (bit-wise exclusive or)
| (bit-wise inclusive or)
.and.
&& (logical and)
.or.
|| (logical or)
.eqv. , .neqv.
(logical comparisons)
?: (if . . . then . . . else)
and , or , xor
(logical operators)
= , += , -= , *= , /= , %= ,
>>= , <<= , &= , ˆ= , |=
(assignment)
, (sequencing)
Figure 6.1
Operator precedence levels in Fortran, Pascal, C, and Ada. The operators at the top of the figure group most
tightly.
The precedence structure of C (and, with minor variations, of its descendants, C++, Java, and C#) is substantially richer than that of most other languages. It is, in fact, richer than shown in Figure 6.1, because several additional
constructs, including type casts, function calls, array subscripting, and record
field selection, are classified as operators in C. It is probably fair to say that
most C programmers do not remember all of their language’s precedence levels.
224
EXAMPLE
Chapter 6 Control Flow
6.8
A “gotcha” in Pascal
precedence
The intent of the language designers was presumably to ensure that “the right
thing” will usually happen when parentheses are not used to force a particular
evaluation order. Rather than count on this, however, the wise programmer will
consult the manual or add parentheses.
It is also probably fair to say that the relatively flat precedence hierarchy of
Pascal is a mistake. In particular, novice Pascal programmers frequently write
conditions like
if A < B and C < D then (* ouch *)
EXAMPLE
6.9
Common rules for
associativity
Unless A , B , C , and D are all of type Boolean, which is unlikely, this code will
result in a static semantic error, since the rules of precedence cause it to group
as A < (B and C) < D . (And even if all four operands are of type Boolean,
the result is almost certain to be something other than what the programmer
intended.) Most languages avoid this problem by giving arithmetic operators
higher precedence than relational (comparison) operators, which in turn have
higher precedence than the logical operators. Notable exceptions include APL and
Smalltalk, in which all operators are of equal precedence; parentheses must be
used to specify grouping.
Associativity rules specify whether sequences of operators of equal precedence
group to the right or to the left. Conventions here are somewhat more uniform
across languages, but still display some variety. The basic arithmetic operators
almost always associate left-to-right, so 9 - 3 - 2 is 4 and not 8 . In Fortran,
as noted above, the exponentiation operator ( ** ) follows standard mathematical
convention, and associates right-to-left, so 4**3**2 is 262144 and not 4096 .
In Ada, exponentiation does not associate: one must write either (4**3)**2 or
4**(3**2) ; the language syntax does not allow the unparenthesized form. In
languages that allow assignments inside expressions (an option we will consider
more in Section 6.1.2), assignment associates right-to-left. Thus in C, a = b =
a + c assigns a + c into b and then assigns the same value into a .
Because the rules for precedence and associativity vary so much from one
language to another, a programmer who works in several languages is wise to
make liberal use of parentheses.
6.1.2
Assignments
In a purely functional language, expressions are the building blocks of programs,
and computation consists entirely of expression evaluation. The effect of any
individual expression on the overall computation is limited to the value that
expression provides to its surrounding context. Complex computations employ
recursion to generate a potentially unbounded number of values, expressions, and
contexts.
In an imperative language, by contrast, computation typically consists of an
ordered series of changes to the values of variables in memory. Assignments
6.1 Expression Evaluation
225
provide the principal means by which to make the changes. Each assignment
takes a pair of arguments: a value and a reference to a variable into which the
value should be placed.
In general, a programming language construct is said to have a side effect if it
influences subsequent computation (and ultimately program output) in any way
other than by returning a value for use in the surrounding context. Assignment is
perhaps the most fundamental side effect: while the evaluation of an assignment
may sometimes yield a value, what we really care about is the fact that it changes
the value of a variable, thereby influencing the result of any later computation in
which the variable appears.
Many (though not all) imperative languages distinguish between expressions,
which always produce a value, and may or may not have side effects, and statements,
which are executed solely for their side effects, and return no useful value. Given
the centrality of assignment, imperative programming is sometimes described as
“computing by means of side effects.”
At the opposite extreme, purely functional languages have no side effects. As
a result, the value of an expression in such a language depends only on the referencing environment in which the expression is evaluated, not on the time at
which the evaluation occurs. If an expression yields a certain value at one point in
time, it is guaranteed to yield the same value at any point in time. In fancier terms,
expressions in a purely functional language are said to be referentially transparent.
Haskell and Miranda are purely functional. Many other languages are mixed:
ML and Lisp are mostly functional, but make assignment available to programmers who want it. C#, Python, and Ruby are mostly imperative, but provide a
variety of features (first-class functions, polymorphism, functional values and
aggregates, garbage collection, unlimited extent) that allow them to be used in
a largely functional style. (We will return to functional programming, and the
features it requires, in several future sections, including 6.2.2, 6.6, 7.1.2, 7.7.3, 7.8,
and all of Chapter 10.)
References and Values
EXAMPLE
6.10
On the surface, assignment appears to be a very straightforward operation. Below
the surface, however, there are some subtle but important differences in the semantics of assignment in different imperative languages. These differences are often
invisible, because they do not affect the behavior of simple programs. They have
a major impact, however, on programs that use pointers, and will be explored in
further detail in Section 7.7. We provide an introduction to the issues here.
Consider the following assignments in C:
L-values and r-values
d = a;
a = b + c;
In the first statement, the right-hand side of the assignment refers to the value
of a , which we wish to place into d . In the second statement, the left-hand side
226
EXAMPLE
Chapter 6 Control Flow
6.11
L-values in C
refers to the location of a , where we want to put the sum of b and c . Both
interpretations—value and location—are possible because a variable in C (and
in Pascal, Ada, and many other languages) is a named container for a value. We
sometimes say that languages like C use a value model of variables. Because of
their use on the left-hand side of assignment statements, expressions that denote
locations are referred to as l-values. Expressions that denote values (possibly the
value stored in a location) are referred to as r-values. Under a value model of
variables, a given expression can be either an l-value or an r-value, depending on
the context in which it appears.
Of course, not all expressions can be l-values, because not all values have a
location, and not all names are variables. In most languages it makes no sense to
say 2 + 3 = a , or even a = 2 + 3 , if a is the name of a constant. By the same token,
not all l-values are simple names; both l-values and r-values can be complicated
expressions. In C one may write
(f(a)+3)->b[c] = 2;
EXAMPLE
6.12
L-values in C++
In this expression f(a) returns a pointer to some element of an array of pointers
to structures (records). The assignment places the value 2 into the c -th element
of field b of the structure pointed at by the third array element after the one to
which f ’s return value points.
In C++ it is even possible for a function to return a reference to a structure,
rather than a pointer to it, allowing one to write
g(a).b[c] = 2;
EXAMPLE
6.13
Variables as values and
references
We will consider references further in Section 8.3.1.
A language can make the distinction between l-values and r-values more explicit
by employing a reference model of variables. Languages that do this include
Algol 68, Clu, Lisp/Scheme, ML, Haskell, and Smalltalk, In these languages, a
variable is not a named container for a value; rather, it is a named reference
to a value. The following fragment of code is syntactically valid in both Pascal
and Clu:
b := 2;
c := b;
a := b + c;
A Pascal programmer might describe this code by saying: “We put the value 2 in b
and then copy it into c . We then read these values, add them together, and place
the resulting 4 in a .” The Clu programmer would say: “We let b refer to 2 and
then let c refer to it also. We then pass these references to the + operator, and let
a refer to the result, namely 4.”
These two ways of thinking are illustrated in Figure 6.2. With a value model of
variables, as in Pascal, any integer variable can contain the value 2. With a reference
6.1 Expression Evaluation
a
4
a
b
2
b
227
4
2
c
2
c
Figure 6.2
The value (left) and reference (right) models of variables. Under the reference
model, it becomes important to distinguish between variables that refer to the same object
and variables that refer to different objects whose values happen (at the moment) to be
equal.
model of variables, as in Clu, there is (at least conceptually) only one 2 —a sort of
Platonic Ideal—to which any variable can refer. The practical effect is the same in
this example, because integers are immutable: the value of 2 never changes, so we
can’t tell the difference between two copies of the number 2 and two references to
“the” number 2.
In a language that uses the reference model, every variable is an l-value. When
it appears in a context that expects an r-value, it must be dereferenced to obtain
the value to which it refers. In most languages with a reference model (including
Clu), the dereference is implicit and automatic. In ML, the programmer must use
an explicit dereference operator, denoted with a prefix exclamation point. We will
revisit ML pointers in Section 7.7.1.
The difference between the value and reference models of variables becomes
particularly important (specifically, it can affect program output and behavior)
if the values to which variables refer can change “in place,” as they do in many
programs with linked data structures, or if it is possible for variables to refer
to different objects that happen to have the “same” value. In this latter case it
becomes important to distinguish between variables that refer to the same object
and variables that refer to different objects whose values happen (at the moment)
to be equal. (Lisp, as we shall see in Sections 7.10 and 10.3.3, provides more than
one notion of equality, to accommodate this distinction.) We will discuss the value
and reference models of variables further in Section 7.7.
D E S I G N & I M P L E M E N TAT I O N
Implementing the reference model
It is tempting to assume that the reference model of variables is inherently
more expensive than the value model, since a naive implementation would
require a level of indirection on every access. As we shall see in Section 7.7.1,
however, most compilers for languages with a reference model use multiple
copies of immutable objects for the sake of efficiency, achieving exactly the
same performance for simple types that they would with a value model.
228
Chapter 6 Control Flow
Java uses a value model for built-in types and a reference model for userdefined types (classes). C# and Eiffel allow the programmer to choose between
the value and reference models for each individual user-defined type. A C# class
is a reference type; a struct is a value type.
Boxing
EXAMPLE
6.14
Wrapper objects in Java 2
A drawback of using a value model for built-in types is that they can’t be passed
uniformly to methods that expect class-typed parameters. Early versions of Java
required the programmer to “wrap” objects of built-in types inside corresponding
predefined class types in order to insert them in standard container (collection)
classes:
import java.util.Hashtable;
...
Hashtable ht = new Hashtable();
...
Integer N = new Integer(13);
ht.put(N, new Integer(31));
Integer M = (Integer) ht.get(N);
int m = M.intValue();
EXAMPLE
6.15
Boxing in Java 5 and C#
// Integer is a "wrapper" class
The wrapper class was needed here because Hashtable expects a parameter of a
class derived from Object , and an int is not an Object .
More recent versions of Java perform automatic boxing and unboxing operations that avoid the need for wrappers in many cases:
ht.put(13, 31);
int m = (Integer) ht.get(13);
Here the compiler creates hidden Integer objects to hold the values 13 and 31 ,
so they may be passed to put as references. The Integer cast on the return value
is still needed, to make sure that the hash table entry for 13 is really an integer and
not, say, a floating-point number or string. C# “boxes” not only the arguments,
but the cast as well, eliminating the need for the Integer class entirely.
Orthogonality
One of the principal design goals of Algol 68 was to make the various features
of the language as orthogonal as possible. Orthogonality means that features can
be used in any combination, the combinations all make sense, and the meaning
of a given feature is consistent, regardless of the other features with which it is
combined. The name is meant to draw an explicit analogy to orthogonal vectors
in linear algebra: none of the vectors in an orthogonal set depends on (or can be
expressed in terms of) the others, and all are needed in order to describe the vector
space as a whole.
Algol 68 was one of the first languages to make orthogonality a principal design
goal, and in fact few languages since have given the goal such weight. Among
6.1 Expression Evaluation
EXAMPLE
6.16
Expression orientation in
Algol 68
229
other things, Algol 68 is said to be expression-oriented: it has no separate notion
of statement. Arbitrary expressions can appear in contexts that would call for
a statement in a language like Pascal, and constructs that are considered to be
statements in other languages can appear within expressions. The following, for
example, is valid in Algol 68:
begin
a := if b < c then d else e;
a := begin f(b); g(c) end;
g(d);
2 + 3
end
EXAMPLE
6.17
A “gotcha” in C conditions
Here the value of the if . . . then . . . else construct is either the value of its then
part or the value of its else part, depending on the value of the condition. The
value of the “statement list” on the right-hand side of the second assignment is the
value of its final “statement,” namely the return value of g(c) . There is no need
to distinguish between procedures and functions, because every subroutine call
returns a value. The value returned by g(d) is discarded in this example. Finally,
the value of the code fragment as a whole is 5, the sum of 2 and 3.
C takes an approach intermediate between Pascal and Algol 68. It distinguishes
between statements and expressions, but one of the classes of statement is an
“expression statement,” which computes the value of an expression and then
throws it away; in effect, this allows an expression to appear in any context that
would require a statement in most other languages. Unfortunately, as we noted in
Section 3.7, the reverse is not the case: statements cannot in general be used in an
expression context. C provides special expression forms for selection and sequencing. Algol 60 defines if . . . then . . . else as both a statement and an expression.
Both Algol 68 and C allow assignments within expressions. The value of an
assignment is simply the value of its right-hand side. Unfortunately, where most
of the descendants of Algol 60 use the := token to represent assignment, C follows
Fortran in simply using = . It uses == to represent a test for equality (Fortran uses
.eq. ). Moreover, C lacks a separate Boolean type. (C99 has a new _Bool type, but
it’s really just a 1-bit integer.) In any context that would require a Boolean value
in other languages, C accepts an integer (or anything that can be coerced to be an
integer). It interprets zero as false; any other value is true. As a result, both of the
following constructs are valid—common—in C:
if (a == b) {
/* do the following if a equals b */
if (a = b) {
/* assign b into a and then do
the following if the result is nonzero */
Programmers who are accustomed to Ada or some other language in which = is
the equality test frequently write the second form above when the first is what is
intended. This sort of bug can be very hard to find.
230
Chapter 6 Control Flow
Though it provides a true Boolean type ( bool ), C++ shares the problem of C,
because it provides automatic coercions from numeric, pointer, and enumeration
types. Java and C# eliminate the problem by disallowing integers in Boolean
contexts. The assignment operator is still = , and the equality test is still == , but the
statement if (a = b) ... will generate a compile-time type clash error unless a
and b are both boolean (Java) or bool (C#), which is generally unlikely.
Combination Assignment Operators
EXAMPLE
6.18
Updating assignments
Because they rely so heavily on side effects, imperative programs must frequently update a variable. It is thus common in many languages to see statements like
a = a + 1;
or, worse
b.c[3].d = b.c[3].d * e;
EXAMPLE
6.19
Side effects and updates
Such statements are not only cumbersome to write and to read (we must examine
both sides of the assignment carefully to see if they really are the same), they also
result in redundant address calculations (or at least extra work to eliminate the
redundancy in the code improvement phase of compilation).
If the address calculation has a side effect, then we may need to write a pair of
statements instead. Consider the following code in C:
void update(int A[], int index_fn(int n)) {
int i, j;
/* calculate i */
...
j = index_fn(i);
A[j] = A[j] + 1;
}
Here we cannot safely write
A[index_fn(i)] = A[index_fn(i)] + 1;
EXAMPLE
6.20
Assignment operators
We have to introduce the temporary variable j because we don’t know whether
index_fn has a side effect or not. If it is being used, for example, to keep a log
of elements that have been updated, then we shall want to make sure that update
calls it only once.
To eliminate the clutter and compile- or run-time cost of redundant address
calculations, and to avoid the issue of repeated side effects, many languages, beginning with Algol 68, and including C and its descendants, provide so-called assignment operators to update a variable. Using assignment operators, the statements
in Example 6.18 can be written as follows.
6.1 Expression Evaluation
231
a += 1;
b.c[3].d *= e;
and the two assignments in the update function can be replaced with
A[index_fn(i)] += 1;
EXAMPLE
6.21
Prefix and postfix inc/dec
In addition to being aesthetically cleaner, the assignment operator form guarantees
that the address calculation is performed only once.
As shown in Figure 6.1, C provides 10 different assignment operators, one for
each of its binary arithmetic and bit-wise operators. C also provides prefix and
postfix increment and decrement operations. These allow even simpler code in
update :
A[index_fn(i)]++;
or
++A[index_fn(i)];
More significantly, increment and decrement operators provide elegant syntax for
code that uses an index or a pointer to traverse an array:
A[--i] = b;
*p++ = *q++;
EXAMPLE
6.22
Advantages of postfix
inc/dec
When prefixed to an expression, the ++ or -- operator increments or decrements
its operand before providing a value to the surrounding context. In the postfix
form, ++ or -- updates its operand after providing a value. If i is 3 and p and q
point to the initial elements of a pair of arrays, then b will be assigned into A[2]
(not A[3] ), and the second assignment will copy the initial elements of the arrays
(not the second elements).
The prefix forms of ++ and -- are syntactic sugar for += and -= . We could have
written
A[i -= 1] = b;
above. The postfix forms are not syntactic sugar. To obtain an effect similar to the
second statement above we would need an auxiliary variable and a lot of extra
notation:
*(t = p, p += 1, t) = *(t = q, q += 1, t);
Both the assignment operators ( += , -= ) and the increment and decrement
operators ( ++ , -- ) do “the right thing” when applied to pointers in C (assuming
those pointers point into an array). If p points to element i of an array, where each
element occupies n bytes (including any bytes required for alignment, as discussed
in Section 5.1), then p += 3 points to element i + 3, 3n bytes later in memory.
We will discuss pointers and arrays in C in more detail in Section 7.7.1.
232
Chapter 6 Control Flow
Multiway Assignment
EXAMPLE
6.23
Simple multiway
assignment
We have already seen that the right associativity of assignment (in languages that
allow assignment in expressions) allows one to write things like a = b = c . In
several languages, including Clu, ML, Perl, Python, and Ruby, it is also possible to
write
a, b = c, d;
EXAMPLE
6.24
Advantages of multiway
assignment
Here the comma in the right-hand side is not the sequencing operator of C.
Rather, it serves to define a expression, or tuple, consisting of multiple r-values.
The comma operator on the left-hand side produces a tuple of l-values. The effect
of the assignment is to copy c into a and d into b .3
While we could just as easily have written
a = c; b = d;
the multiway (tuple) assignment allows us to write things like
a, b = b, a;
(* swap a and b *)
which would otherwise require auxiliary variables. Moreover, multiway assignment allows functions to return tuples, as well as single values:
a, b, c = foo(d, e, f);
This notation eliminates the asymmetry (nonorthogonality) of functions in most
programming languages, which allow an arbitrary number of arguments, but only
a single return.
3C H E C K YO U R U N D E R S TA N D I N G
1. Name eight major categories of control-flow mechanisms.
2. What distinguishes operators from other sorts of functions?
3. Explain the difference between prefix, infix, and postfix notation. What is Cambridge Polish notation? Name two programming languages that use postfix
notation.
4. Why don’t issues of associativity and precedence arise in Postscript or Forth?
5. What does it mean for an expression to be referentially transparent ?
6. What is the difference between a value model of variables and a reference
model of variables? Why is the distinction important?
3 The syntax shown here is for Perl, Python, and Ruby. Clu uses := for assignment. ML requires
parentheses around each tuple.
6.1 Expression Evaluation
233
7. What is an l-value? An r-value?
8. Why is the distinction between mutable and immutable values important in
the implementation of a language with a reference model of variables?
9. Define orthogonality in the context of programming language design.
10. What does it mean for a language to be expression-oriented?
11. What are the advantages of updating a variable with an assignment operator,
rather than with a regular assignment in which the variable appears on both
the left- and right-hand sides?
6.1.3
Initialization
Because they already provide a construct (the assignment statement) to set the
value of a variable, imperative languages do not always provide a means of specifying an initial value for a variable in its declaration. There are at least three
reasons, however, why such initial values may be useful:
1. As suggested in Figure 3.3 (page 124), a static variable that is local to a
subroutine needs an initial value in order to be useful.
2. For any statically allocated variable, an initial value that is specified in the
declaration can be preallocated in global memory by the compiler, avoiding
the cost of assigning an initial value at run time.
3. Accidental use of an uninitialized variable is one of the most common programming errors. One of the easiest ways to prevent such errors (or at least
ensure that erroneous behavior is repeatable) is to give every variable a value
when it is first declared.
Most languages allow variables of built-in types to be initialized in their declarations. A more complete and orthogonal approach to initialization requires
a notation for aggregates: built-up structured values of user-defined composite
types. Aggregates can be found in several languages, including C, Ada, Fortran 90,
and ML; we will discuss them further in Section 7.1.5.
It should be emphasized that initialization saves time only for variables that
are statically allocated. Variables allocated in the stack or heap at run time must
be initialized at run time.4 It is also worth noting that the problem of using an
uninitialized variable occurs not only after elaboration, but also as a result of any
operation that destroys a variable’s value without providing a new one. Two of the
4 For variables that are accessed indirectly (e.g., in languages that employ a reference model of
variables), a compiler can often reduce the cost of initializing a stack or heap variable by placing
the initial value in static memory, and only creating the pointer to it at elaboration time.
234
Chapter 6 Control Flow
most common such operations are explicit deallocation of an object referenced
through a pointer and modification of the tag of a variant record. We will consider
these operations further in Sections 7.7 and 7.3.4, respectively.
If a variable is not given an initial value explicitly in its declaration, the language may specify a default value. In C, for example, statically allocated variables
for which the programmer does not provide an initial value are guaranteed to be
represented in memory as if they had been initialized to zero. For most types on
most machines, this is a string of zero bits, allowing the language implementation
to exploit the fact that most operating systems (for security reasons) fill newly
allocated memory with zeros. Zero-initialization applies recursively to the subcomponents of variables of user-defined composite types. Java and C# provide a
similar guarantee for the fields of all class-typed objects, not just those that are
statically allocated. Most scripting languages provide a default initial value for all
variables, of all types, regardless of scope or lifetime.
Dynamic Checks
Instead of giving every uninitialized variable a default value, a language or implementation can choose to define the use of an uninitialized variable as a dynamic
semantic error, and can catch these errors at run time. The advantage of the
semantic checks is that they will often identify a program bug that is masked or
made more subtle by the presence of a default value. With appropriate hardware
support, uninitialized variable checks can even be as cheap as default values, at
least for certain types. In particular, a compiler that relies on the IEEE standard
for floating-point arithmetic can fill uninitialized floating-point numbers with a
signaling NaN value, as discussed in Section 5.2.2. Any attempt to use such a
value in a computation will result in a hardware interrupt, which the language
implementation may catch (with a little help from the operating system), and use
to trigger a semantic error message.
For most types on most machines, unfortunately, the costs of catching all uses
of an uninitialized variable at run time are considerably higher. If every possible
bit pattern of the variable’s representation in memory designates some legitimate
value (and this is often the case), then extra space must be allocated somewhere
to hold an initialized/uninitialized flag. This flag must be set to “uninitialized” at
elaboration time and to “initialized” at assignment time. It must also be checked
(by extra code) at every use, or at least at every use that the code improver is unable
to prove is redundant.
Definite Assignment
EXAMPLE
6.25
Programs outlawed by
definite assignment
For local variables of methods, Java and C# define a notion of definite assignment
that precludes the use of uninitialized variables. This notion is based on the control
flow of the program, and can be statically checked by the compiler. Roughly
speaking, every possible control path to an expression must assign a value to every
variable in that expression. This is a conservative rule; it can sometimes prohibit
programs that would never actually use an uninitialized variable. In Java:
6.1 Expression Evaluation
235
int i;
int j = 3;
...
if (j > 0) {
i = 2;
}
...
// no assignments to j in here
if (j > 0) {
System.out.println(i); // error: "i might not have been initialized"
}
While a human being might reason that i will be used only when it has previously
been given a value, it is uncomputable to make such determinations in the general
case, and the compiler does not attempt it.
Constructors
Many object-oriented languages (Java and C# among them) allow the programmer
to define types for which initialization of dynamically allocated variables occurs
automatically, even when no initial value is specified in the declaration. Some—
notably C++—also distinguish carefully between initialization and assignment.
Initialization is interpreted as a call to a constructor function for the variable’s type,
with the initial value as an argument. In the absence of coercion, assignment is
interpreted as a call to the type’s assignment operator or, if none has been defined,
as a simple bit-wise copy of the value on the assignment’s right-hand side. The
distinction between initialization and assignment is particularly important for
user-defined abstract data types that perform their own storage management.
A typical example occurs in variable-length character strings. An assignment to
such a string must generally deallocate the space consumed by the old value of the
string before allocating space for the new value. An initialization of the string must
simply allocate space. Initialization with a nontrivial value is generally cheaper
than default initialization followed by assignment, because it avoids deallocation of
the space allocated for the default value. We will return to this issue in Section 9.3.2.
Neither Java nor C# distinguishes between initialization and assignment: an
initial value can be given in a declaration, but this is the same as an immediate
subsequent assignment. Java uses a reference model for all variables of user-defined
object types, and provides for automatic storage reclamation, so assignment never
copies values. C# allows the programmer to specify a value model when desired
(in which case assignment does copy values), but otherwise mirrors Java.
6.1.4
EXAMPLE
6.26
Ordering within Expressions
While precedence and associativity rules define the order in which binary infix
operators are applied within an expression, they do not specify the order in which
the operands of a given operator are evaluated. For example, in the expression
Indeterminate ordering
a - f(b) - c * d
236
Chapter 6 Control Flow
we know from associativity that f(b) will be subtracted from a before performing
the second subtraction, and we know from precedence that the right operand
of that second subtraction will be the result of c * d , rather than merely c ,
but without additional information we do not know whether a - f(b) will be
evaluated before or after c * d . Similarly, in a subroutine call with multiple
arguments
f(a, g(b), h(c))
we do not know the order in which the arguments will be evaluated.
There are two main reasons why the order can be important:
EXAMPLE
6.27
A value that depends
on ordering
EXAMPLE
6.28
An optimization that
depends on ordering
1. Side effects: If f(b) may modify d , then the value of a - f(b) - c * d will
depend on whether the first subtraction or the multiplication is performed
first. Similarly, if g(b) may modify a and/or c , then the values passed to
f(a, g(b), h(c)) will depend on the order in which the arguments are
evaluated.
2. Code improvement: The order of evaluation of subexpressions has an impact
on both register allocation and instruction scheduling. In the expression a *
b + f(c) , it is probably desirable to call f before evaluating a * b , because the
product, if calculated first, would need to be saved during the call to f , and f
might want to use all the registers in which it might easily be saved. In a similar
vein, consider the sequence
a := B[i];
c := a * 2 + d * 3;
Here it is probably desirable to evaluate d * 3 before evaluating a * 2 , because
the previous statement, a := B[i] , will need to load a value from memory.
D E S I G N & I M P L E M E N TAT I O N
Safety versus performance
A recurring theme in any comparison between C++ and Java is the latter’s willingness to accept additional run-time cost in order to obtain cleaner semantics
or increased reliability. Definite assignment is one example: it may force the
programmer to perform “unnecessary” initializations on certain code paths,
but in so doing it avoids the many subtle errors that can arise from missing initialization in other languages. Similarly, the Java specification mandates
automatic garbage collection, and its reference model of user-defined types
forces most objects to be allocated in the heap. As we shall see in Chapters 7
and 9, Java also requires both dynamic binding of all method invocations and
run-time checks for out-of-bounds array references, type clashes, and other
dynamic semantic errors. Clever compilers can reduce or eliminate the cost of
these requirements in certain common cases, but for the most part the Java
design reflects an evolutionary shift away from performance as the overriding
design goal.
6.1 Expression Evaluation
237
Because loads are slow, if the processor attempts to use the value of a in the
next instruction (or even the next few instructions on many machines), it will
have to wait. If it does something unrelated instead (i.e., evaluate d * 3 ), then
the load can proceed in parallel with other computation.
Because of the importance of code improvement, most language manuals say
that the order of evaluation of operands and arguments is undefined. (Java and C#
are unusual in this regard: they require left-to-right evaluation.) In the absence
of an enforced order, the compiler can choose whatever order results in faster
code.
Applying Mathematical Identities
EXAMPLE
6.29
Optimization and
mathematical “laws”
Some language implementations (e.g., for dialects of Fortran) allow the compiler
to rearrange expressions involving operators whose mathematical abstractions
are commutative, associative, and/or distributive, in order to generate faster code.
Consider the following Fortran fragment:
a = b + c
d = c + e + b
Some compilers will rearrange this as
a = b + c
d = b + c + e
They can then recognize the common subexpression in the first and second statements, and generate code equivalent to
a = b + c
d = a + e
Similarly,
a = b/c/d
e = f/d/c
may be rearranged as
t = c * d
a = b/t
e = f/t
EXAMPLE
6.30
Overflow and arithmetic
“identities”
Unfortunately, while mathematical arithmetic obeys a variety of commutative, associative, and distributive laws, computer arithmetic is not as orderly. The
problem is that numbers in a computer are of limited precision. Suppose a , b ,
and c are all integers between 2 billion and 3 billion. With 32-bit arithmetic, the
238
EXAMPLE
Chapter 6 Control Flow
6.31
Reordering and numerical
stability
expression b - c + d can be evaluated safely left-to-right (232 is a little less than
4.3 billion). If the compiler attempts to reorganize this expression as b + d - c ,
however (e.g., in order to delay its use of c ), then arithmetic overflow will occur.
Despite our intuition from math, this reorganization is unsafe.
Many languages, including Pascal and most of its descendants, provide dynamic
semantic checks to detect arithmetic overflow. In some implementations these
checks can be disabled to eliminate their run-time overhead. In C and C++,
the effect of arithmetic overflow is implementation-dependent. In Java, it is well
defined: the language definition specifies the size of all numeric types, and requires
two’s complement integer and IEEE floating-point arithmetic. In C#, the programmer can explicitly request the presence or absence of checks by tagging an
expression or statement with the checked or unchecked keyword. In a completely
different vein, Scheme, Common Lisp, and several scripting languages place no a
priori limit on the size of integers; space is allocated to hold extra-large values on
demand.
Even in the absence of overflow, the limited precision of floating-point arithmetic can cause different arrangements of the“same”expression to produce significantly different results, invisibly. Single-precision IEEE floating-point numbers
devote 1 bit to the sign, 8 bits to the exponent (power of 2), and 23 bits to the
mantissa. Under this representation, a + b is guaranteed to result in a loss of
information if | log2 (a/b)| > 23. Thus if b = -c , then a + b + c may appear
to be zero, instead of a , if the magnitude of a is small, while the magnitude of b
and c is large. In a similar vein, a number like 0.1 cannot be represented precisely,
because its binary representation is a “repeating decimal”: 0.0001001001. . . . For
certain values of x , (0.1 + x) * 10.0 and 1.0 + (x * 10.0) can differ by as
much as 25%, even when 0.1 and x are of the same magnitude.
6.1.5
EXAMPLE
6.32
Short-Circuit Evaluation
Boolean expressions provide a special and important opportunity for code
improvement and increased readability. Consider the expression (a < b) and
Short-circuited
expressions
D E S I G N & I M P L E M E N TAT I O N
Evaluation order
Expression evaluation presents a difficult tradeoff between semantics and
implementation. To limit surprises, most language definitions require the
compiler, if it ever reorders expressions, to respect any ordering imposed by
parentheses. The programmer can therefore use parentheses to prevent the
application of arithmetic “identities” when desired. No similar guarantee exists
with respect to the order of evaluation of operands and arguments. It is therefore unwise to write expressions in which a side effect of evaluating one operand
some languages, notably Euclid and Turing, outlaw such side effects.
6.1 Expression Evaluation
239
(b < c) . If a is greater than b , there is really no point in checking to see whether
b is less than c ; we know the overall expression must be false. Similarly, in the
expression (a > b) or (b > c) , if a is indeed greater than b there is no point in
checking to see whether b is greater than c ; we know the overall expression must
EXAMPLE
6.33
Saving time with
short-circuiting
be true. A compiler that performs short-circuit evaluation of Boolean expressions
will generate code that skips the second half of both of these computations when
the overall value can be determined from the first half.
Short-circuit evaluation can save significant amounts of time in certain
situations:
if (very_unlikely_condition && very_expensive_function()) ...
EXAMPLE
6.34
Short-circuit pointer
chasing
But time is not the only consideration, or even the most important. Shortcircuiting changes the semantics of Boolean expressions. In C, for example, one
can use the following code to search for an element in a list:
p = my_list;
while (p && p->key != val)
p = p->next;
C short-circuits its && and || operators, and uses zero for both null and false,
so p->key will be accessed if and only if p is non-null. The syntactically similar code in Pascal does not work, because Pascal does not short-circuit and
and or :
p := my_list;
while (p <> nil) and (pˆ.key <> val) do
p := pˆ.next;
(* ouch! *)
Here both of the <> relations will be evaluated before and -ing their results
together. At the end of an unsuccessful search, p will be nil , and the attempt
to access pˆ.key will be a run-time (dynamic semantic) error, which the compiler may or may not have generated code to catch. To avoid this situation, the
Pascal programmer must introduce an auxiliary Boolean variable and an extra
level of nesting:
p := my_list;
still_searching := true;
while still_searching do
if p = nil then
still_searching := false
else if pˆ.key = val then
still_searching := false
else
p := pˆ.next;
240
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Chapter 6 Control Flow
function tally(word : string) : integer;
(* Look up word in hash table. If found, increment tally; If not
found, enter with a tally of 1. In either case, return tally. *)
...
function misspelled(word : string) : Boolean;
(* Check to see if word is mis-spelled and return appropriate
indication. If yes, increment global count of mis-spellings. *)
...
while not eof(doc_file) do begin
w := get_word(doc_file);
if (tally(w) = 10) and misspelled(w) then
writeln(w)
end;
writeln(total_misspellings);
Figure 6.3
EXAMPLE
Pascal code that counts on the evaluation of Boolean operands.
6.35
Short-circuiting and other
errors
Short-circuit evaluation can also be used to avoid out-of-bound subscripts:
const MAX = 10;
int A[MAX];
/* indices from 0 to 9 */
...
if (i >= 0 && i < MAX && A[i] > foo) ...
division by zero:
if (d <> 0 && n/d > threshold) ...
EXAMPLE
6.36
When not to use
short-circuiting
EXAMPLE
6.37
Optional short-circuiting
and various other errors.
Short circuiting is not necessarily as attractive for situations in which a Boolean
subexpression can cause a side effect. Suppose we wish to count occurrences of
words in a document, and print a list of all misspelled words that appear 10 or
more times, together with a count of the total number of misspellings. Pascal
code for this task appears in Figure 6.3. Here the if statement at line 9 tests the
conjunction of two subexpressions, both of which have important side effects. If
short-circuit evaluation is used, the program will not compute the right result.
The code can be rewritten to eliminate the need for non–short-circuit evaluation,
but one might argue that the result is more awkward than the version shown. So now we have seen situations in which short-circuiting is highly desirable,
and others in which at least some programmers would find it undesirable. A few
languages provide both regular and short-circuit Boolean operators. In Ada, for
example, the regular Boolean operators are and and or ; the short-circuit operators
are the two-word operators and then and or else :
found_it := p /= null and then p.key = val;
...
if d = 0 or else n/d < threshold then ...
6.2 Structured and Unstructured Flow
241
(Ada uses /= for “not equal.”) In C, the bit-wise & and | operators can be used as
non–short-circuiting alternatives to && and || when their arguments are logical
(0 or 1) values.
If we think of and and or as binary operators, short circuiting can be considered
an example of delayed or lazy evaluation: the operands are “passed” unevaluated.
Internally, the operator evaluates the first operand in any case, the second only
when needed. In a language like Algol 68, which allows arbitrary control flow
constructs to be used inside expressions, conditional evaluation can be specified
explicitly with if . . . then . . . else ; see Exercise 6.12.
When used to determine the flow of control in a selection or iteration construct,
short-circuit Boolean expressions do not really have to calculate a Boolean value;
they simply have to ensure that control takes the proper path in any given situation.
We will look more closely at the generation of code for short-circuit expressions
in Section 6.4.1.
3C H E C K YO U R U N D E R S TA N D I N G
12. Given the ability to assign a value into a variable, why is it useful to be able to
specify an initial value?
13. What are aggregates? Why are they useful?
14. Explain the notion of definite assignment in Java and C#.
15. Why is it generally expensive to catch all uses of uninitialized variables at run
time?
16. Why is it impossible to catch all uses of uninitialized variables at compile time?
17. Why do most languages leave unspecified the order in which the arguments
of an operator or function are evaluated?
18. What is short-circuit Boolean evaluation? Why is it useful?
6.2
EXAMPLE
6.38
Control flow with goto s in
Fortran
Structured and Unstructured Flow
Control flow in assembly languages is achieved by means of conditional and
unconditional jumps (branches). Early versions of Fortran mimicked the low
level approach by relying heavily on goto statements for most nonprocedural
control flow:
if (A .lt. B) goto 10
...
! ".lt." means "<"
10
The 10 on the bottom line is a statement label. Goto statements also featured
prominently in other early imperative languages.
242
Chapter 6 Control Flow
Beginning in the late 1960s, largely in response to an article by Edsger Dijkstra
[Dij68b],5 language designers hotly debated the merits and evils of goto s. It seems
fair to say the detractors won. Ada and C# allow goto s only in limited contexts.
Modula (1, 2, and 3), Clu, Eiffel, Java, and most of the scripting languages do not
allow them at all. Fortran 90 and C++ allow them primarily for compatibility with
their predecessor languages. (Java reserves the token goto as a keyword, to make
it easier for a Java compiler to produce good error messages when a programmer
uses a C++ goto by mistake.)
The abandonment of goto s was part of a larger “revolution” in software engineering known as structured programming. Structured programming was the “hot
trend” of the 1970s, in much the same way that object-oriented programming was
the trend of the 1990s. Structured programming emphasizes top-down design
(i.e., progressive refinement), modularization of code, structured types (records,
sets, pointers, multidimensional arrays), descriptive variable and constant names,
and extensive commenting conventions. The developers of structured programming were able to demonstrate that within a subroutine, almost any well-designed
imperative algorithm can be elegantly expressed with only sequencing, selection,
and iteration. Instead of labels, structured languages rely on the boundaries of
lexically nested constructs as the targets of branching control.
Many of the structured control-flow constructs familiar to modern programmers were pioneered by Algol 60. These include the if . . . then . . . else construct
and both enumeration ( for ) and logically ( while ) controlled loops. The modern
case ( switch ) statement was introduced by Wirth and Hoare in Algol W [WH66]
as an alternative to the more unstructured computed goto and switch constructs
of Fortran and Algol 60, respectively. (The switch statement of C bears a closer
resemblance to the Algol W case statement than to the Algol 60 switch .)
6.2.1
Structured Alternatives to goto
Once the principal structured constructs had been defined, most of the controversy surrounding goto s revolved around a small number of special cases, each of
which was eventually addressed in structured ways. Where once a goto might have
been used to jump to the end of the current subroutine, most modern languages
provide an explicit return statement. Where once a goto might have been used
to escape from the middle of a loop, most modern languages provide a break or
exit statement for this purpose. (Some languages also provide a statement that
will skip the remainder of the current iteration only: continue in C; cycle in
Fortran 90; next in Perl.) More significantly, several languages allow a program
to return from a nested chain of subroutine calls in a single operation, and most
5 Edsger W. Dijkstra (1930–2002) developed much of the logical foundation of our modern
understanding of concurrency. He is also responsible, among many other contributions, for
the semaphores of Section 12.3.4 and for much of the practical development of structured programming. He received the ACM Turing Award in 1972.
6.2 Structured and Unstructured Flow
243
provide a way to raise an exception that propagates out to some surrounding context. Both of these capabilities might once have been attempted with (nonlocal)
goto s.
Multilevel Returns
EXAMPLE
6.39
Escaping a nested
subroutine
Returns and (local) goto s allow control to return from the current subroutine.
On occasion it may make sense to return from a surrounding routine. Imagine, for
example, that we are searching for an item matching some desired pattern within
a collection of files. The search routine might invoke several nested routines, or
a single routine multiple times, once for each place in which to search. In such a
situation certain historic languages, including Algol 60, PL/I, and Pascal, permit a
goto to branch to a lexically visible label outside the current subroutine:
function search(key : string) : string;
var rtn : string;
...
procedure search_file(fname : string);
...
begin
...
for ... (* iterate over lines *)
...
if found(key, line) then begin
rtn := line;
goto 100;
end;
...
end;
...
begin (* search *)
...
for ... (* iterate over files *)
...
search_file(fname);
...
100:
return rtn;
end;
In the event of a nonlocal goto , the language implementation must guarantee
to repair the run-time stack of subroutine call information. This repair operation
is known as unwinding. It requires not only that the implementation deallocate
the stack frames of any subroutines from which we have escaped, but also that
it perform any bookkeeping operations, such as restoration of register contents,
that would have been performed when returning from those routines.
As a more structured alternative to the nonlocal goto , Common Lisp provides
a return-from statement that names the lexically surrounding function or block
from which to return, and also supplies a return value (eliminating the need for
the artificial rtn variable in Example 6.39).
244
EXAMPLE
Chapter 6 Control Flow
6.40
Structured nonlocal
transfers
But what if search_file were not nested inside of search ? We might, for
example, wish to call it from routines that search files in different orders. Algol 60,
Algol 68, and PL/I allow labels to be passed as parameters, so a dynamically nested
subroutine can perform a goto to a caller-defined location. Common Lisp again
provides a more structured alternative, also available in Ruby. In either language
an expression can be surrounded with a catch block, whose value can be provided
by any dynamically nested routine that executes a matching throw . In Ruby we
might write
def searchFile(fname, pattern)
file = File.open(fname)
file.each {|line|
throw :found, line if line =˜ /#{pattern}/
}
end
match = catch :found
searchFile("f1",
searchFile("f2",
searchFile("f3",
"not found\n"
end
print match
do
key)
key)
key)
# default value for catch,
# if control gets this far
Here the throw expression specifies a tag, which must appear in a matching catch ,
together with a value ( line ) to be returned as the value of the catch . (The if
clause attached to the throw performs a regular-expression pattern match, looking
for pattern within line . We will consider pattern matching in more detail in
Section 13.4.2.)
Errors and Other Exceptions
EXAMPLE
6.41
Error checking with status
codes
The notion of a multilevel return assumes that the callee knows what the caller
expects, and can return an appropriate value. In a related and arguably more
common situation, a deeply nested block or subroutine may discover that it is
unable to proceed with its usual function, and moreover lacks the contextual
information it would need to recover in any graceful way. Eiffel formalizes this
notion by saying that every software component has a contract —a specification
of the function it performs. A component that is unable to fulfill its contract is
said to fail. Rather than return in the normal way, it must arrange for control to
“back out” to some context in which the program is able to recover. Conditions
that require a program to “back out” are usually called exceptions. We mentioned
an example in Section 2.3.4, where we considered phrase-level recovery from
syntax errors in a recursive-descent parser.
The most straightforward but generally least satisfactory way to cope with
exceptions is to use auxiliary Boolean variables within a subroutine ( if still_ok
then ... ) and to return status codes from calls:
6.2 Structured and Unstructured Flow
status := my_proc(args);
if status = ok then ...
245
The auxiliary Booleans can be eliminated by using a nonlocal goto or multilevel
return, but the caller to which we return must still inspect status codes explicitly. As a structured alternative, many modern languages provide an exceptionhandling mechanism for convenient, nonlocal recovery from exceptions. We will
discuss exception handling in more detail in Section 8.5. Typically the programmer appends a block of code called a handler to any computation in which an
exception may arise. The job of the handler is to take whatever remedial action is
required to recover from the exception. If the protected computation completes
in the normal fashion, execution of the handler is skipped.
Multilevel returns and structured exceptions have strong similarities. Both
involve a control transfer from some inner, nested context back to an outer context, unwinding the stack on the way. The distinction lies in where the computing
occurs. In a multilevel return the inner context has all the information it needs. It
completes its computation, generating a return value if appropriate, and transfers
to the outer context in a way that requires no post-processing. At an exception, by
contrast, the inner context cannot complete its work—it cannot fulfill its contract.
It performs an “abnormal” return, triggering execution of the handler.
Common Lisp and Ruby provide mechanisms for both multilevel returns and
exceptions, but this dual support is relatively rare. Most languages support only
exceptions; programmers implement multilevel returns by writing a trivial handler. In an unfortunate overloading of terminology, the names catch and throw ,
which Common Lisp and Ruby use for multilevel returns, are used for exceptions
in several other languages.
6.2.2
Continuations
The notion of nonlocal goto s that unwind the stack can be generalized by defining
what are known as continuations. In low-level terms, a continuation consists of
D E S I G N & I M P L E M E N TAT I O N
Cleaning up continuations
The implementation of continuations in Scheme and Ruby is surprisingly
straightforward. Because local variables have unlimited extent in both languages, activation records must in general be allocated on the heap. As a result,
explicit deallocation is neither required nor appropriate when jumping through
a continuation; frames that are no longer accessible will eventually be reclaimed
by a general purpose garbage collector (to be discussed in Section 7.7.3). Restoration of state (e.g., saved registers) from escaped routines is not required either:
the continuation closure holds everything required to resume the captured
context.
246
Chapter 6 Control Flow
a code address and a referencing environment to be restored when jumping to
that address. In higher-level terms, a continuation is an abstraction that captures
a context in which execution might continue. Continuations are fundamental to
denotational semantics. They also appear as first-class values in certain languages
(notably Scheme and Ruby), allowing the programmer to define new control flow
constructs.
Continuation support in Scheme takes the form of a general-purpose function
called call-with-current-continuation , sometimes abbreviated call/cc .
This function takes a single argument f , which is itself a function. It calls f , passing as argument a continuation c that captures the current program counter and
referencing environment. The continuation is represented by a closure, indistinguishable from the closures used to represent subroutines passed as parameters.
At any point in the future, f can call c to reestablish the captured context. If
nested calls have been made, control pops out of them, as it does with exceptions. More generally, however, c can be saved in variables, returned explicitly by
subroutines, or called repeatedly, even after control has returned from f (recall
that closures in Scheme have unlimited extent; see Section 3.6). Call/cc suffices
to build a wide variety of control abstractions, including goto s, midloop exit s,
multilevel returns, exceptions, iterators (Section 6.5.3), call-by-name parameters
(Section 8.3.1), and coroutines (Section 8.6). It even subsumes the notion of
returning from a subroutine, though it seldom replaces it in practice.
First-class continuations are an extremely powerful facility. They can be very
useful if applied in well-structured ways (i.e., to define new control-flow constructs). Unfortunately, they also allow the undisciplined programmer to construct completely inscrutable programs.
6.3
Sequencing
Like assignment, sequencing is central to imperative programming. It is the principal means of controlling the order in which side effects (e.g., assignments) occur:
when one statement follows another in the program text, the first statement executes before the second. In most imperative languages, lists of statements can be
enclosed with begin . . . end or {. . . } delimiters and then used in any context
in which a single statement is expected. Such a delimited list is usually called
a compound statement. A compound statement optionally preceded by a set of
declarations is sometimes called a block.
In languages like Algol 68, which blur or eliminate the distinction between
statements and expressions, the value of a statement (expression) list is the value
of its final element. In Common Lisp, the programmer can choose to return the
value of the first element, the second, or the last. Of course, sequencing is a useless
operation unless the subexpressions that do not play a part in the return value have
side effects. The various sequencing constructs in Lisp are used only in program
fragments that do not conform to a purely functional programming model.
6.4 Selection
EXAMPLE
6.42
Side effects in a random
number generator
247
Even in imperative languages, there is debate as to the value of certain kinds
of side effects. In Euclid and Turing, for example, functions (i.e., subroutines that
return values, and that therefore can appear within expressions) are not permitted
to have side effects. Among other things, side-effect freedom ensures that a Euclid
or Turing function, like its counterpart in mathematics, is always idempotent : if
called repeatedly with the same set of arguments, it will always return the same
value, and the number of consecutive calls (after the first) will not affect the results
of subsequent execution. In addition, side-effect freedom for functions means that
the value of a subexpression will never depend on whether that subexpression is
evaluated before or after calling a function in some other subexpression. These
properties make it easier for a programmer or theorem-proving system to reason
about program behavior. They also simplify code improvement, for example by
permitting the safe rearrangement of expressions.
Unfortunately, there are some situations in which side effects in functions are
highly desirable. We saw one example in the label name function of Figure 3.3
(page 124). Another arises in the typical interface to a pseudorandom number
generator:
procedure srand(seed : integer)
– – Initialize internal tables.
– – The pseudorandom generator will return a different
– – sequence of values for each different value of seed.
function rand() : integer
– – No arguments; returns a new “random” number.
Obviously rand needs to have a side effect, so that it will return a different value
each time it is called. One could always recast it as a procedure with a reference
parameter:
procedure rand(var n : integer)
but most programmers would find this less appealing. Ada strikes a compromise:
it allows side effects in functions in the form of changes to static or global variables,
but does not allow a function to modify its parameters.
6.4
EXAMPLE
6.43
Selection in Algol 60
Selection
Selection statements in most imperative languages employ some variant of the
if . . . then . . . else notation introduced in Algol 60:
if condition then statement
else if condition then statement
else if condition then statement
...
else statement
248
EXAMPLE
Chapter 6 Control Flow
6.44
Elsif / elif
As we saw in Section 2.3.2, languages differ in the details of the syntax. In Algol
60 and Pascal both the then clause and the else clause are defined to contain
a single statement (this can of course be a begin . . . end compound statement).
To avoid grammatical ambiguity, Algol 60 requires that the statement after the
then begin with something other than if ( begin is fine). Pascal eliminates this
restriction in favor of a “disambiguating rule” that associates an else with the
closest unmatched then . Algol 68, Fortran 77, and more modern languages avoid
the ambiguity by allowing a statement list to follow either then or else , with a
terminating keyword at the end of the construct.
To keep terminators from piling up at the end of nested if statements, most
languages with terminators provide a special elsif or elif keyword. In Modula2, one writes
IF a = b THEN ...
ELSIF a = c THEN ...
ELSIF a = d THEN ...
ELSE ...
END
EXAMPLE
6.45
In Lisp, the equivalent construct is
Cond in Lisp
(cond
((= A B)
(...))
((= A C)
(...))
((= A D)
(...))
(T
(...)))
Here cond takes as arguments a sequence of pairs. In each pair the first element is
a condition; the second is an expression to be returned as the value of the overall
construct if the condition evaluates to T ( T means “true” in most Lisp dialects). 6.4.1
Short-Circuited Conditions
While the condition in an if . . . then . . . else statement is a Boolean expression,
there is usually no need for evaluation of that expression to result in a Boolean
value in a register. Most machines provide conditional branch instructions that
capture simple comparisons. Put another way, the purpose of the Boolean expression in a selection statement is not to compute a value to be stored, but to cause
control to branch to various locations. This observation allows us to generate
particularly efficient code (called jump code) for expressions that are amenable to
the short-circuit evaluation of Section 6.1.5. Jump code is applicable not only to
6.4 Selection
EXAMPLE
6.46
Code generation for a
Boolean condition
249
selection statements such as if . . . then . . . else , but to logically controlled loops
as well; we will consider the latter in Section 6.5.5.
In the usual process of code generation, either via an attribute grammar or via
ad hoc syntax tree decoration, a synthesized attribute of the root of an expression
subtree acquires the name of a register into which the value of the expression will be
computed at run time. The surrounding context then uses this register name when
generating code that uses the expression. In jump code, inherited attributes of the
root inform it of the addresses to which control should branch if the expression
is true or false, respectively. Jump code can be generated quite elegantly by an
attribute grammar, particularly one that is not L-attributed (Exercise 6.11).
Suppose, for example, that we are generating code for the following source:
if ((A > B) and (C > D)) or (E = F) then
then clause
else
else clause
In Pascal, which does not use short-circuit evaluation, the output code would look
something like this:
r1 := A
– – load
r2 := B
r1 := r1 > r2
r2 := C
r3 := D
r2 := r2 > r3
r1 := r1 & r2
r2 := E
r3 := F
r2 := r2 = r3
r1 := r1 | r2
if r1 = 0 goto L2
L1: then clause
– – (label not actually used)
goto L3
L2: else clause
L3:
EXAMPLE
6.47
Code generation for
short-circuiting
The root of the subtree for ((A > B) and (C > D)) or (E = F) would name r1 as the
register containing the expression value.
In jump code, by contrast, the inherited attributes of the condition’s root would
indicate that control should “fall through” to L1 if the condition is true, or branch
to L2 if the condition is false. Output code would then look something like this:
r1 := A
r2 := B
if r1 <= r2 goto L4
r1 := C
r2 := D
250
Chapter 6 Control Flow
if r1 > r2 goto L1
L4: r1 := E
r2 := F
if r1 = r2 goto L2
L1: then clause
goto L3
L2: else clause
L3:
EXAMPLE
6.48
Short-circuit creation of a
Boolean value
Here the value of the Boolean condition is never explicitly placed into a register.
Rather it is implicit in the flow of control. Moreover for most values of A , B , C , D ,
and E , the execution path through the jump code is shorter and therefore faster
(assuming good branch prediction) than the straight-line code that calculates the
value of every subexpression.
If the value of a short-circuited expression is needed explicitly, it can of course
be generated, while still using jump code for efficiency. The Ada fragment
found_it := p /= null and then p.key = val;
is equivalent to
if p /= null and then p.key = val then
found_it := true;
else
found_it := false;
end if;
and can be translated as
r1 := p
if r1 = 0 goto L1
r2 := r1→key
if r2 = val goto L1
r1 := 1
goto L2
L1: r1 := 0
L2: found it := r1
D E S I G N & I M P L E M E N TAT I O N
Short-circuit evaluation
Short-circuit evaluation is one of those happy cases in programming language
design where a clever language feature yields both more useful semantics and a
faster implementation than existing alternatives. Other at least arguable examples include case statements, local scopes for for loop indices (Section 6.5.1),
with statements in Pascal (Section 7.3.3), and parameter modes in Ada
(Section 8.3.1).
6.4 Selection
251
The astute reader will notice that the first goto L1 can be replaced by goto L2 ,
since r1 already contains a zero in this case. The code improvement phase of the
compiler will notice this also, and make the change. It is easier to fix this sort of
thing in the code improver than it is to generate the better version of the code in
the first place. The code improver has to be able to recognize jumps to redundant
instructions for other reasons anyway; there is no point in building special cases
into the short-circuit evaluation routines.
6.4.2
EXAMPLE
6.49
Case statements and
nested if s
Case / Switch Statements
The case statements of Algol W and its descendants provide alternative syntax for
a special case of nested if . . . then . . . else . When each condition compares the
same integer expression to a different compile-time constant, then the following
code (written here in Modula-2)
i := ... (* potentially complicated expression *)
IF i = 1 THEN
clause A
ELSIF i IN 2, 7 THEN
clause B
ELSIF i IN 3..5 THEN
clause C
ELSIF (i = 10) THEN
clause D
ELSE
clause E
END
can be rewritten as
CASE ... (*
1:
|
2, 7:
|
3..5:
|
10:
ELSE
END
potentially complicated expression *) OF
clause A
clause B
clause C
clause D
clause E
The elided code fragments (clause A, clause B, etc.) after the colons and the ELSE
are called the arms of the CASE statement. The lists of constants in front of the
colons are CASE statement labels. The constants in the label lists must be disjoint,
and must be of a type compatible with the tested expression. Most languages
allow this type to be anything whose values are discrete: integers, characters,
enumerations, and subranges of the same. C# allows strings as well.
The CASE statement version of the code above is certainly less verbose than the
IF . . . THEN . . . ELSE version, but syntactic elegance is not the principal motivation
252
EXAMPLE
Chapter 6 Control Flow
6.50
Translation of nested if s
for providing a CASE statement in a programming language. The principal
motivation is to facilitate the generation of efficient target code. The IF . . .
THEN . . . ELSE statement is most naturally translated as follows:
r1 := . . .
if r1 = 1 goto L1
– – calculate tested expression
clause A
goto L6
L1: if r1 = 2 goto L2
if r1 = 7 goto L3
L2: clause B
goto L6
L3: if r1 < 3 goto L4
if r1 > 5 goto L4
clause C
goto L6
L4: if r1 = 10 goto L5
clause D
goto L6
L5: clause E
L6:
EXAMPLE
6.51
Jump tables
Rather than test its expression sequentially against a series of possible values,
the case statement is meant to compute an address to which it jumps in a single
instruction. The general form of the target code generated from a case statement
appears in Figure 6.4. The code at label L6 can take any of several forms. The most
common of these simply indexes into an array:
T:
&L1
– – tested expression = 1
&L2
&L3
&L3
&L3
&L5
&L2
&L5
&L5
&L4
– – tested expression = 10
L6: r1 := . . .
– – calculate tested expression
if r1 < 1 goto L5
if r1 > 10 goto L5
– – L5 is the “else” arm
r1 –:= 1
– – subtract off lower bound
r2 := T[r1]
goto *r2
L7:
Here the “code” at label T is actually a table of addresses, known as a jump table.
It contains one entry for each integer between the lowest and highest values,
inclusive, found among the case statement labels. The code at L6 checks to make
6.4 Selection
goto L6
L1: clause A
goto L7
L2: clause B
goto L7
L3: clause C
goto L7
...
L4: clause D
goto L7
L5: clause E
goto L7
– – jump to code to compute address
L6: r1 := . . .
goto *r1
L7:
– – computed target of branch
Figure 6.4
253
General form of target code generated for a five-arm case statement.
sure that the tested expression is within the bounds of the array (if not, we should
execute the else arm of the case statement). It then fetches the corresponding
entry from the table and branches to it.
Alternative Implementations
A linear jump table is fast. It is also space efficient when the overall set of case
statement labels is dense and does not contain large ranges. It can consume an
extraordinarily large amount of space, however, if the set of labels is nondense,
or includes large value ranges. Alternative methods to compute the address to
which to branch include sequential testing, hashing, and binary search. Sequential
testing (as in an if . . . then . . . else statement) is the method of choice if the total
number of case statement labels is small. It runs in time O(n), where n is the
number of labels. A hash table is attractive if the range of label values is large, but
has many missing values and no large ranges. With an appropriate hash function
it will run in time O(1). Unfortunately, a hash table requires a separate entry for
each possible value of the tested expression, making it unsuitable for statements
with large value ranges. Binary search can accommodate ranges easily. It runs in
time O(log n), with a relatively low constant factor.
To generate good code for all possible case statements, a compiler needs to be
prepared to use a variety of strategies. During compilation it can generate code
for the various arms of the case statement as it finds them, while simultaneously
building up an internal data structure to describe the label set. Once it has seen
all the arms, it can decide which form of target code to generate. For the sake
of simplicity, most compilers employ only some of the possible implementations.
Many use binary search in lieu of hashing. Some generate only indexed jump tables;
others only that plus sequential testing. Users of less sophisticated compilers may
need to restructure their case statements if the generated code turns out to be
unexpectedly large or slow.
254
Chapter 6 Control Flow
Syntax and Label Semantics
As with if . . . then . . . else statements, the syntactic details of case statements
vary from language to language. Different languages use different punctuation
to delimit labels and arms. More significantly, languages differ in whether they
permit label ranges, whether they permit (or require) a default ( else ) clause, and
in how they handle a value that fails to match any label at run time.
Standard Pascal does not permit a default clause: all values on which to take
action must appear explicitly in label lists. It is a dynamic semantic error for
the expression to evaluate to a value that does not appear. Most Pascal compilers
permit the programmer to add a default clause, labeled either else or otherwise ,
as a language extension. Modula allows an optional else clause. If one does not
appear in a given case statement, then it is a dynamic semantic error for the
tested expression to evaluate to a missing value. Ada requires arm labels to cover
all possible values in the domain of the tested expression’s type. If the type of
tested expression has a very large number of values, then this coverage must be
accomplished using ranges or an others clause. In some languages, notably C
and Fortran 90, it is not an error for the tested expression to evaluate to a missing
value. Rather, the entire construct has no effect when the value is missing.
The C switch Statement
C’s syntax for case ( switch ) statements (retained by C++ and Java) is unusual
in several respects:
switch (... /* tested expression */) {
case 1: clause A
break;
case 2:
case 7: clause B
break;
case 3:
case 4:
case 5: clause C
break;
case 10: clause D
break;
default: clause E
break;
}
D E S I G N & I M P L E M E N TAT I O N
Case statements
Case statements are one of the clearest examples of language design driven by
implementation. Their primary reason for existence is to facilitate the generation of jump tables. Ranges in label lists (not permitted in Pascal or C) may
reduce efficiency slightly, but binary search is still dramatically faster than the
equivalent series of if s.
6.4 Selection
EXAMPLE
6.52
Fall-through in C switch
statements
255
Here each possible value for the tested expression must have its own label within
the switch ; ranges are not allowed. In fact, lists of labels are not allowed, but the
effect of lists can be achieved by allowing a label (such as 2 , 3 , and 4 above) to
have an empty arm that simply “falls through” into the code for the subsequent
label. Because of the provision for fall-through, an explicit break statement must
be used to get out of the switch at the end of an arm, rather than falling through
into the next. There are rare circumstances in which the ability to fall through is
convenient:
letter_case = lower;
switch (c) {
...
case ’A’ :
letter_case = upper;
/* FALL THROUGH! */
case ’a’ :
...
break;
...
}
Most of the time, however, the need to insert a break at the end of each arm—
and the compiler’s willingness to accept arms without breaks, silently—is a recipe
for unexpected and difficult-to-diagnose bugs. C# retains the familiar C syntax,
including multiple consecutive labels, but requires every nonempty arm to end
with a break , goto , continue , or return .
3C H E C K YO U R U N D E R S TA N D I N G
19. List the principal uses of goto , and the structured alternatives to each.
20. Explain the distinction between exceptions and multilevel returns.
21. What are continuations? What other language features do they subsume?
22. Why is sequencing a comparatively unimportant form of control flow in
Lisp?
23. Explain why it may sometimes be useful for a function to have side effects.
24. Describe the jump code implementation of the short-circuit Boolean evaluation.
25. Why do imperative languages commonly provide a case statement in addition
to if . . . then . . . else ?
26. Describe three different search strategies that might be employed in the implementation of a case statement, and the circumstances in which each would
be desirable.
256
Chapter 6 Control Flow
6.5
Iteration
Iteration and recursion are the two mechanisms that allow a computer to perform similar operations repeatedly. Without at least one of these mechanisms,
the running time of a program (and hence the amount of work it can do and
the amount of space it can use) would be a linear function of the size of the
program text. In a very real sense, it is iteration and recursion that make computers useful. In this section we focus on iteration. Recursion is the subject of
Section 6.6.
Programmers in imperative languages tend to use iteration more than they use
recursion (recursion is more common in functional languages). In most languages,
iteration takes the form of loops. Like the statements in a sequence, the iterations of
a loop are generally executed for their side effects: their modifications of variables.
Loops come in two principal varieties, which differ in the mechanisms used to
determine how many times to iterate. An enumeration-controlled loop is executed
once for every value in a given finite set; the number of iterations is known before
the first iteration begins. A logically controlled loop is executed until some Boolean
condition (which must generally depend on values altered in the loop) changes
value. The two forms of loops share a single construct in Algol 60 in Common
Lisp. They are distinct in most other languages.
6.5.1
EXAMPLE
6.53
Fortran 90 do loop
Enumeration-Controlled Loops
Enumeration-controlled iteration originated with the do loop of Fortran I. Similar mechanisms have been adopted in some form by almost every subsequent
language, but syntax and semantics vary widely. Even Fortran’s own loop has
evolved considerably over time. The Fortran 90 version (retained by Fortran 2003
and 2008) looks something like this:
do i = 1, 10, 2
...
enddo
EXAMPLE
6.54
Variable i is called the index of the loop. The expressions that follow the equals
sign are i ’s initial value, its bound, and the step size. With the values shown here, the
body of the loop (the statements between the loop header and the enddo delimiter)
will execute five times, with i set to 1, 3, . . . , 9 in successive iterations.
Many other languages provide similar functionality. In Modula-2 one says
Modula-2 for loop
FOR i := first TO last BY step DO
...
END
6.5 Iteration
257
By choosing different values of first , last , and step , we can arrange to iterate
over an arbitrary arithmetic sequence of integers, namely i = first , first +
step , . . . , first + (( last − first )/ step ) × step .
Following the lead of Clu, many modern languages allow enumerationcontrolled loops to iterate over much more general finite sets—the nodes of a
tree, for example, or the elements of a collection. We consider these more general
iterators in Section 6.5.3. For the moment we focus on arithmetic sequences. For
the sake of simplicity, we use the name “ for loop” as a general term, even for
languages that use a different keyword.
Code Generation for for Loops
EXAMPLE
6.55
Obvious translation of a
for loop
EXAMPLE
6.56
Naively, the loop of Example 6.54 can be translated as
r1 := first
r2 := step
r3 := last
L1: if r1 > r3 goto L2
...
r1 := r1 + r2
goto L1
L2:
– – loop body; use r1 for i
A slightly better if less straightforward translation is
For loop translation with
test at the bottom
r1 := first
r2 := step
r3 := last
goto L2
L1: . . .
r1 := r1 + r2
L2: if r1 ≤ r3 goto L1
– – loop body; use r1 for i
This version is likely to be faster, because each of the iterations contains a single conditional branch, rather than a conditional branch at the top and an
unconditional branch at the bottom. (We will consider yet another version in
Exercise 16.4.)
Note that both of these translations employ a loop-ending test that is fundamentally directional: as shown, they assume that all the realized values of i will
be smaller than last . If the loop goes “the other direction”—that is, if first >
last , and step < 0—then we will need to use the inverse test to end the loop.
To allow the compiler to make the right choice, many languages restrict the generality of their arithmetic sequences. Commonly, step is required to be a compiletime constant. Ada actually limits the choices to ±1. Several languages, including
both Ada and Pascal, require special syntax for loops that iterate “backward” ( for
i in reverse 10..1 in Ada; for i := 10 downto 1 in Pascal).
258
EXAMPLE
Chapter 6 Control Flow
6.57
For loop translation with
an iteration count
Obviously, one can generate code that checks the sign of step at run time, and
chooses a test accordingly. The obvious translations, however, are either time or
space inefficient. An arguably more attractive approach, adopted by many Fortran
compilers, is to precompute the number of iterations, place this iteration count in
a register, decrement the register at the end of each iteration, and branch back to
the top of the loop if the count is not yet zero:
r1 := first
r2 := step
r3 := max ( ( last − first + step )/ step , 0)
– – iteration count
– – NB: this calculation may require several instructions.
– – It is guaranteed to result in a value within the precision of the machine,
– – but we may have to be careful to avoid overflow during its calculation.
if r3 ≤ 0 goto L2
L1: . . .
– – loop body; use r1 for i
r1 := r1 + r2
r3 := r3 – 1
if r3 > 0 goto L1
i := r1
L2:
EXAMPLE
6.58
A “gotcha” in the naive
loop translation
The use of the iteration count avoids the need to test the sign of step within
the loop. Assuming we have been suitably careful in precomputing the count, it
also avoids a problem we glossed over in the naive translations of Examples 6.55
and 6.56: If last is near the maximum value representable by integers on our
machine, naively adding step to the final legitimate value of i may result in
arithmetic overflow. The “wrapped” number may then appear to be smaller (much
smaller!) than last , and we may have translated perfectly good source code into
an infinite loop.
Some processors, including the PowerPC, PA-RISC, and most CISC machines,
can decrement the iteration count, test it against zero, and conditionally branch,
all in a single instruction. For many loops this results in very efficient code.
Semantic Complications
The astute reader may have noticed that use of an iteration count is fundamentally
dependent on being able to predict the number of iterations before the loop begins
to execute. While this prediction is possible in many languages, including Fortran
and Ada, it is not possible in others, notably C and its descendants. The difference
stems largely from the following question: is the for loop construct only for
iteration, or is it simply meant to make enumeration easy? If the language insists
on enumeration, then an iteration count works fine. If enumeration is only one
possible purpose for the loop—more specifically, if the number of iterations or
the sequence of index values may change as a result of executing the first few
iterations—then we may need to use a more general implementation, along the
lines of Example 6.56, modified if necessary to handle dynamic discovery of the
direction of the terminating test.
6.5 Iteration
259
The choice between requiring and (merely) enabling enumeration manifests
itself in several specific questions:
1. Can control enter or leave the loop in any way other than through the enumeration mechanism?
2. What happens if the loop body modifies variables that were used to compute
the end-of-loop bound?
3. What happens if the loop body modifies the index variable itself?
4. Can the program read the index variable after the loop has completed, and if
so, what will its value be?
Questions (1) and (2) are relatively easy to resolve. Most languages allow a
break / exit statement to leave a for loop early. Fortran IV allowed a goto to
EXAMPLE
6.59
Changing the index in a
for loop
jump into a loop, but this was generally regarded as a language flaw; Fortran 77
and most other languages prohibit such jumps. Similarly, most languages (but not
C; see Section 6.5.2) specify that the bound is computed only once, before the first
iteration, and kept in a temporary location. Subsequent changes to variables used
to compute the bound have no effect on how many times the loop iterates.
Questions (3) and (4) are more difficult. Suppose we write (in no particular
language)
for i := 1 to 10 by 2
...
if i = 3
i := 6
D E S I G N & I M P L E M E N TAT I O N
Numerical imprecision
Among its many changes to the do loop of Fortran IV, Fortran 77 allowed
the index, bounds, and step size of the loop to be floating-point numbers, not
just integers. Interestingly, this feature was taken back out of the language in
Fortran 90.
The problem with real-number sequences is that limited precision can cause
comparisons (e.g., between the index and the bound) to produce unexpected
or even implementation-dependent results when the values are close to one
another. Should
for x := 1.0 to 2.0 by 1.0 / 3.0
execute three iterations or four? It depends on whether 1.0 / 3.0 is rounded up
or down. The Fortran 90 designers appear to have decided that such ambiguity
is philosophically inconsistent with the idea of finite enumeration. The programmer who wants to iterate over floating-point values must use an explicit
comparison in a pretest or post-test loop (Section 6.5.5).
260
EXAMPLE
Chapter 6 Control Flow
6.60
Inspecting the index after a
for loop
What should happen at the end of the i = 3 iteration? Should the next iteration
have i = 5 (the next element of the arithmetic sequence specified in the loop
header), i = 8 (2 more than 6), or even conceivably i = 7 (the next value of
the sequence after 6)? One can imagine reasonable arguments for each of these
options. To avoid the need to choose, many languages prohibit changes to the
loop index within the body of the loop. Fortran makes the prohibition a matter of
programmer discipline: the implementation is not required to catch an erroneous
update. Pascal provides an elaborate set of conservative rules [Int90, Sec. 6.8.3.9]
that allow the compiler to catch all possible updates. These rules are complicated
by the fact that the index variable is declared outside the loop; it may be visible to
subroutines called from the loop even if it is not passed as a parameter.
If control escapes the loop with a break / exit , the natural value for the index
would seem to be the one that was current at the time of the escape. For “normal”
termination, on the other hand, the natural value would seem to be the first one
that exceeds the loop bound. Certainly that is the value that will be produced by
the implementation of Example 6.56. Unfortunately, as we noted in Example 6.57,
the “next” value for some loops may be outside the range of integer precision. For
other loops, it may be semantically invalid:
c : ’a’..’z’
– – character subrange
...
for c := ’a’ to ’z’ do
...
– – what comes after ’z’?
Requiring the post-loop value to always be the index of the final iteration is
unattractive from an implementation perspective: it would force us to replace
Example 6.56 with a translation that has an extra branch instruction in every
iteration:
r1 := ’a’
r2 := ’z’
if r1 > r2 goto L3
L1: . . .
if r1 = r2 goto L2
r1 := r1 + 1
goto L1
L2: i := r1
L3:
– – Code improver may remove this test,
– – since ’a’ and ’z’ are constants.
– – loop body; use r1 for i
Of course, the compiler must generate this sort of code in any event (or use an
iteration count) if arithmetic overflow may interfere with testing the terminating
condition. To permit the compiler to use the fastest correct implementation in all
cases, several languages, including Fortran 90 and Pascal, say that the value of the
index is undefined after the end of the loop.
6.5 Iteration
261
An attractive solution to both the index modification problem and the postloop value problem was pioneered by Algol W and Algol 68, and subsequently
adopted by Ada, Modula 3, and many other languages. In these, the header of the
loop is considered to contain a declaration of the index. Its type is inferred from
the bounds of the loop, and its scope is the loop’s body. Because the index is not
visible outside the loop, its value is not an issue. Of course, the programmer must
not give the index the same name as any variable that must be accessed within the
loop, but this is a strictly local issue: it has no ramifications outside the loop.
6.5.2
EXAMPLE
6.61
Combination ( for ) loop
in C
Combination Loops
As noted briefly above, Algol 60 provides a single loop construct that subsumes
the properties of more modern enumeration and logically controlled loops. It can
specify an arbitrary number of “enumerators,” each of which can be a single value,
a range of values similar to that of modern enumeration-controlled loops, or an
expression with a terminating condition. Common Lisp provides an even more
powerful facility, with four separate sets of clauses, to initialize index variables (of
which there may be an arbitrary number), test for loop termination (in any of
several ways), evaluate body expressions, and cleanup at loop termination.
A much simpler form of combination loop appears in C and its successors.
Semantically, the C for loop is logically controlled. It was designed, however, to
make enumeration easy. Our Modula-2 example
FOR i := first TO last BY step DO
...
END
would usually be written in C as
for (i = first; i <= last; i += step) {
...
}
C defines this to be precisely equivalent to
{
i = first;
while (i <= last) {
...
i += step;
}
}
This definition means that it is the programmer’s responsibility to worry about
the effect of overflow on testing of the terminating condition. It also means that
both the index and any variables contained in the terminating condition can be
262
EXAMPLE
Chapter 6 Control Flow
6.62
C for loop with a local
index
modified by the body of the loop, or by subroutines it calls, and these changes will
affect the loop control. This, too, is the programmer’s responsibility.
Any of the three clauses in the for loop header can be null (the condition is
considered true if missing). Alternatively, a clause can consist of a sequence of
comma-separated expressions. The advantage of the C for loop over its while
loop equivalent is compactness and clarity. In particular, all of the code affecting
the flow of control is localized within the header. In the while loop, one must
read both the top and the bottom of the loop to know what is going on.
While the logical iteration semantics of the C for loop eliminate any ambiguity
about the value of the index variable after the end of the loop, it may still be
convenient to make the index local to the body of the loop, by declaring it in the
header’s initialization clause. In Example 6.61, variable i must be declared in the
surrounding scope. If we instead write
for (int i = first; i <= last; i += step) {
...
}
then i will not be visible outside. It will still, however, be vulnerable to (deliberate
or accidental) modification within the loop.
6.5.3
Iterators
In all of the examples we have seen so far (with the possible exception of the
combination loops of Algol 60, Common Lisp, or C), a for loop iterates over the
elements of an arithmetic sequence. In general, however, we may wish to iterate
over the elements of any well-defined set (what are often called containers or
collections in object-oriented code). Clu introduced an elegant iterator mechanism
(also found in Python, Ruby, and C#) to do precisely that. Euclid and several more
recent languages, notably C++ and Java, define a standard interface for iterator
objects (sometimes called enumerators) that are equally easy to use, but not as
D E S I G N & I M P L E M E N TAT I O N
For loops
Modern for loops reflect the impact of both semantic and implementation
challenges. Semantic challenges include changes to loop indices or bounds from
within the loop, the scope of the index variable (and its value, if any, outside
the loop), and goto s that enter or leave the loop. Implementation challenges
include the imprecision of floating-point values, the direction of the bottomof-loop test, and overflow at the end of the iteration range. The “combination
loops” of C (Section 6.5.2) move responsibility for these challenges out of the
compiler and into the application program.
6.5 Iteration
263
easy to write. Icon, conversely, provides a generalization of iterators, known as
generators, that combines enumeration with backtracking search.6
True Iterators
EXAMPLE
6.63
Simple iterator in Python
Clu, Python, Ruby, and C# allow any container abstraction to provide an iterator
that enumerates its items. The iterator resembles a subroutine that is permitted to contain yield statements, each of which produces a loop index value.
For loops are then designed to incorporate a call to an iterator. The Modula-2
fragment
FOR i := first TO last BY step DO
...
END
would be written as follows in Python.
for i in range(first, last, step):
...
EXAMPLE
6.64
Python iterator for tree
enumeration
Here range is a built-in iterator that yields the integers from first to first +
( last − first )/ step × step in increments of step .
When called, the iterator calculates the first index value of the loop, which
it returns to the main program by executing a yield statement. The yield
behaves like return , except that when control transfers back to the iterator
after completion of the first iteration of the loop, the iterator continues where
it last left off—not at the beginning of its code. When the iterator has no more
elements to yield it simply returns (without a value), thereby terminating the
loop.
In effect, an iterator is a separate thread of control, with its own program
counter, whose execution is interleaved with that of the for loop to which it supplies index values.7 The iteration mechanism serves to “decouple” the algorithm
required to enumerate elements from the code that uses those elements.
The range iterator is predefined in Python. As a more illustrative example, consider the preorder enumeration of values stored in a binary tree.
A Python iterator for this task appears in Figure 6.5. Invoked from the header
of a for loop, it yields the value in the root node (if any) for the first iteration
and then calls itself recursively, twice, to enumerate values in the left and right
subtrees.
6 Unfortunately, terminology is not consistent across languages. Euclid uses the term “generator”
for what are called “iterator objects” here. Python uses it for what are called “true iterators” here.
7 Because iterators are interleaved with loops in a very regular way, they can be implemented more
easily (and cheaply) than fully general threads. We will consider implementation options further
in Section 8.6.3.
264
Chapter 6 Control Flow
class BinTree:
def __init__(self):
# constructor
self.data = self.lchild = self.rchild = None
...
# other methods: insert, delete, lookup, ...
def preorder(self):
if self.data != None:
yield self.data
if self.lchild != None:
for d in self.lchild.preorder():
yield d
if self.rchild != None:
for d in self.rchild.preorder():
yield d
Figure 6.5
Python iterator for preorder enumeration of the nodes of a binary tree. Because
Python is dynamically typed, this code will work for any data that support the operations needed
by insert , lookup , and so on (probably just < ). In a statically typed language, the BinTree class
would need to be generic.
Iterator Objects
EXAMPLE
6.65
Java iterator for tree
enumeration
As realized in most imperative languages, iteration involves both a special form
of for loop and a mechanism to enumerate values for the loop. These concepts
can be separated. Euclid, C++, and Java all provide enumeration-controlled loops
reminiscent of those of Python. They have no yield statement, however, and no
separate thread-like context to enumerate values; rather, an iterator is an ordinary
object (in the object-oriented sense of the word) that provides methods for initialization, generation of the next index value, and testing for completion. Between
calls, the state of the iterator must be kept in the object’s data members.
Figure 6.6 contains the Java equivalent of the BinTree class of Figure 6.5. Given
this code, we can write
BinTree<Integer> myTree = ...
...
for (Integer i : myTree) {
System.out.println(i);
}
D E S I G N & I M P L E M E N TAT I O N
“True” iterators and iterator objects
While the iterator library mechanisms of C++ and Java are highly useful,
it is worth emphasizing that they are not the functional equivalents of “true”
iterators, as found in Clu, Python, Ruby, and C#. Their key limitation is the
need to maintain all intermediate state in the form of explicit data structures,
rather than in the program counter and local variables of a resumable execution
context.
6.5 Iteration
265
The loop here is syntactic sugar for
for (Iterator<Integer> it = myTree.iterator(); it.hasNext();) {
Integer i = it.next();
System.out.println(i);
}
EXAMPLE
6.66
Iterator objects in C++
The expression following the colon in the more concise version of the loop must be
an object that supports the standard Iterable interface. This interface includes
an iterator() method that returns an Iterator object.
C++ takes a different tack. Rather than propose a special version of the for
loop that would interface with iterator objects, the designers of the C++ standard
library used the language’s unusually flexible overloading and reference mechanisms (Sections 3.5.2 and 8.3.1) to redefine comparison ( != ), increment ( ++ ),
dereference ( * ), and so on in a way that makes iterating over the elements of
a set look very much like using pointer arithmetic (Section 7.7.1) to traverse a
conventional array:
bin_tree<int> *my_tree = ...
...
for (bin_tree<int>::iterator n = my_tree->begin();
n != my_tree->end(); ++n) {
cout << *n << "\n";
}
C++ encourages programmers to think of iterators as if they were pointers. Iterator n in this example encapsulates all the state encapsulated by iterator it in
the (no syntactic sugar) Java code of Example 6.65. To obtain the next element of the set, however, the C++ programmer “dereferences” n , using the *
or -> operators. To advance to the following element, the programmer uses the
increment ( ++ ) operator. The end method returns a reference to a special iterator that “points beyond the end” of the set. The increment ( ++ ) operator must
return a reference that tests equal to this special iterator when the set has been
exhausted.
We leave the code of the C++ tree iterator to Exercise 6.17. The details are
somewhat messier than Figure 6.6, due to operator overloading, the value model
of variables (which requires explicit references and pointers), and the lack of
garbage collection. Also, because C++ lacks a common Object base class, its
containers must always be declared as generics and instantiated for some particular
element type.
Iterating with First-Class Functions
In functional languages, the ability to specify a function “in line” facilitates a
programming idiom in which the body of a loop is written as a function, with the
266
Chapter 6 Control Flow
class BinTree<T> implements Iterable<T> {
BinTree<T> left;
BinTree<T> right;
T val;
...
// other methods: insert, delete, lookup, ...
public Iterator<T> iterator() {
return new TreeIterator(this);
}
private class TreeIterator implements Iterator<T> {
private Stack<BinTree<T>> s = new Stack<BinTree<T>>();
TreeIterator(BinTree<T> n) {
if (n.val != null) s.push(n);
}
public boolean hasNext() {
return !s.empty();
}
public T next() {
if (!hasNext()) throw new NoSuchElementException();
BinTree<T> n = s.pop();
if (n.right != null) s.push(n.right);
if (n.left != null) s.push(n.left);
return n.val;
}
public void remove() {
throw new UnsupportedOperationException();
}
}
}
Figure 6.6
Java code for preorder enumeration of the nodes of a binary tree. The nested
TreeIterator class uses an explicit Stack object (borrowed from the standard library) to keep
track of subtrees whose nodes have yet to be enumerated. Java generics, specified as <T> type
arguments for BinTree , Stack , Iterator , and Iterable , allow next to return an object of the
appropriate type, rather than the undifferentiated Object . The remove method is part of the
Iterator interface, and must therefore be provided, if only as a placeholder.
EXAMPLE
6.67
Passing the “loop body”
to an iterator in Scheme
loop index as an argument. This function is then passed as the final argument to
an iterator. In Scheme we might write
(define uptoby
(lambda (low high step f)
(if (<= low high)
(begin
(f low)
(uptoby (+ low step) high step f))
’())))
6.5 Iteration
267
We could then sum the first 50 odd numbers as follows.
(let ((sum 0))
(uptoby 1 100 2
(lambda (i)
(set! sum (+ sum i))))
sum)
EXAMPLE
6.68
Iteration with blocks in
Smalltalk
=⇒ 2500
Here the body of the loop, (set! sum (+ sum i)) , is an assignment. The =⇒
symbol (not a part of Scheme) is used here to mean “evaluates to.”
Smalltalk, which we consider in Section 9.6.1, supports a similar idiom:
sum <- 0.
1 to: 100 by: 2 do:
[:i | sum <- sum + i]
Like a lambda expression in Scheme, a square-bracketed block in Smalltalk creates
a first-class function, which we then pass as argument to the to: by: do: iterator.
The iterator calls the function repeatedly, passing successive values of the index
variable i as argument. Iterators in Ruby employ a similar but somewhat less general mechanism: where a Smalltalk method can take an arbitrary number of blocks
as argument, a Ruby method can take only one. Continuations (Section 6.2.2) and
lazy evaluation (Section 6.6.2) also allow the Scheme/Lisp programmer to create
iterator objects and more traditional looking true iterators; we consider these
options in Exercises 6.32 and 6.33.
Iterating without Iterators
EXAMPLE
6.69
Imitating iterators in C
In a language with neither true iterators nor iterator objects, we can still decouple
set enumeration from element use through programming conventions. In C, for
example, we might define a tree_iter type and associated functions that could
be used in a loop as follows:
bin_tree *my_tree;
tree_iter ti;
...
for (ti_create(my_tree, &ti); !ti_done(ti); ti_next(&ti)) {
bin_tree *n = ti_val(ti);
...
}
ti_delete(&ti);
There are two principal differences between this code and the more structured
alternatives: (1) the syntax of the loop is a good bit less elegant (and arguably
more prone to accidental errors), and (2) the code for the iterator is simply a
type and some associated functions. C provides no abstraction mechanism to
group them together as a module or a class. By providing a standard interface
for iterator abstractions, object-oriented languages like C++, Python, Ruby, Java,
268
Chapter 6 Control Flow
and C# facilitate the design of higher-order mechanisms that manipulate whole
containers: sorting them, merging them, finding their intersection or difference,
and so on. We leave the C code for tree_iter and the various ti_ functions to
Exercise 6.18.
6.5.4
Generators in Icon
Icon generalizes the concept of iterators, providing a generator mechanism that
causes any expression in which it is embedded to enumerate multiple values on
demand.
IN MORE DEPTH
We consider Icon generators in more detail on the PLP CD. The language’s
enumeration-controlled loop, the every loop, can contain not only a generator, but any expression that contains a generator. Generators can also be used in
constructs like if statements, which will execute their nested code if any generated value makes the condition true, automatically searching through all the
possibilities. When generators are nested, Icon explores all possible combinations
of generated values, and will even backtrack where necessary to undo unsuccessful
control flow branches or assignments.
6.5.5
EXAMPLE
6.70
While loop in Pascal
Logically Controlled Loops
In comparison to enumeration-controlled loops, logically controlled loops have
many fewer semantic subtleties. The only real question to be answered is where
within the body of the loop the terminating condition is tested. By far the most
common approach is to test the condition before each iteration. The familiar
while loop syntax for this was introduced in Algol-W:
while condition do statement
To allow the body of the loop to be a statement list, most modern languages use
an explicit concluding keyword (e.g., end ), or bracket the body with delimiters
(e.g., { . . . }). A few languages (notably Python) indicate the body with an extra
level of indentation.
Post-test Loops
EXAMPLE
6.71
Post-test loop in Pascal and
Modula
Occasionally it is handy to be able to test the terminating condition at the
bottom of a loop. Pascal introduced special syntax for this case, which was
retained in Modula but dropped in Ada. A post-test loop allows us, for example,
to write
6.5 Iteration
269
repeat
readln(line)
until line[1] = ’$’;
instead of
readln(line);
while line[1] <> ’$’ do
readln(line);
EXAMPLE
6.72
Post-test loop in C
The difference between these constructs is particularly important when the body
of the loop is longer. Note that the body of a post-test loop is always executed at
least once.
C provides a post-test loop whose condition works “the other direction” (i.e.,
“while” instead of “until”):
do {
line = read_line(stdin);
} while (line[0] != ’$’);
Midtest Loops
EXAMPLE
6.73
Break statement in C
Finally, as we noted in Section 6.2.1, it is sometimes appropriate to test the terminating condition in the middle of a loop. In many languages this “midtest” can be
accomplished with a special statement nested inside a conditional: exit in Ada,
break in C, last in Perl. In Section 6.4.2 we saw a somewhat unusual use of
break to leave a C switch statement. More conventionally, C also uses break to
exit the closest for , while , or do loop:
for (;;) {
line = read_line(stdin);
if (all_blanks(line)) break;
consume_line(line);
}
EXAMPLE
6.74
Exiting a nested loop
in Ada
Here the missing condition in the for loop header is assumed to always be true.
(C programmers have traditionally preferred this syntax to the equivalent
while (1) , presumably because it was faster in certain early C compilers.)
In some languages, an exit statement takes an optional loop-name argument
that allows control to escape a nested loop. In Ada we might write
outer: loop
get_line(line, length);
for i in 1..length loop
exit outer when line(i) = ’$’;
consume_char(line(i));
end loop;
end loop outer;
270
EXAMPLE
Chapter 6 Control Flow
6.75
Exiting a nested loop in
Perl
In Perl this would be
outer: while (>) {
# iterate over lines of input
foreach $c (split //) {
# iterate over remaining chars
last outer if ($c =˜ ’$’); # exit main loop if we see a $ sign
consume_char($c);
}
}
Java extends the C/C++ break statement in a similar fashion, with optional labels
on loops.
3C H E C K YO U R U N D E R S TA N D I N G
27. Describe three subtleties in the implementation of enumeration-controlled
loops.
28. Why do most languages not allow the bounds or increment of an enumerationcontrolled loop to be floating-point numbers?
29. Why do many languages require the step size of an enumeration-controlled
loop to be a compile-time constant?
30. Describe the “iteration count” loop implementation. What problem(s) does it
solve?
31. What are the advantages of making an index variable local to the loop it
controls?
32. Does C have enumeration-controlled loops? Explain.
33. What is a container (a collection)?
34. Explain the difference between true iterators and iterator objects.
35. Cite two advantages of iterator objects over the use of programming conventions in a language like C.
36. Describe the approach to iteration typically employed in languages with firstclass functions.
37. Give an example in which a midtest loop results in more elegant code than
does a pretest or post-test loop.
6.6
Recursion
Unlike the control-flow mechanisms discussed so far, recursion requires no special
syntax. In any language that provides subroutines (particularly functions), all that
6.6 Recursion
271
is required is to permit functions to call themselves, or to call other functions that
then call them back in turn. Most programmers learn in a data structures class
that recursion and (logically controlled) iteration provide equally powerful means
of computing functions: any iterative algorithm can be rewritten, automatically,
as a recursive algorithm, and vice versa. We will compare iteration and recursion
in more detail in the first subsection below. In the following subsection we will
consider the possibility of passing unevaluated expressions into a function. While
usually inadvisable, due to implementation cost, this technique will sometimes
allow us to write elegant code for functions that are only defined on a subset of
the possible inputs, or that explore logically infinite data structures.
6.6.1
EXAMPLE
6.76
A “naturally iterative”
problem
Iteration and Recursion
As we noted in Section 3.2, Fortran 77 and certain other languages do not permit
recursion. A few functional languages do not permit iteration. Most modern
languages, however, provide both mechanisms. Iteration is in some sense the
more “natural” of the two in imperative languages, because it is based on the
repeated modification of variables. Recursion is the more natural of the two in
functional languages, because it does not change variables. In the final analysis,
which to use in which circumstance is mainly a matter of taste. To compute
a sum,
f (i)
1≤i≤10
it seems natural to use iteration. In C one would say:
typedef int (*int_func) (int);
int summation(int_func f, int low, int high) {
/* assume low <= high */
int total = 0;
int i;
for (i = low; i <= high; i++) {
total += f(i);
}
return total;
}
EXAMPLE
6.77
A “naturally recursive”
problem
To compute a value defined by a recurrence,
⎧
⎨a
gcd(a, b)
≡ gcd(a−b, b)
⎩
gcd(a, b−a)
(positive integers, a, b)
recursion may seem more natural:
if a = b
if a > b
if b > a
272
Chapter 6 Control Flow
int gcd(int a, int b) {
/* assume a, b > 0 */
if (a == b) return a;
else if (a > b) return gcd(a-b, b);
else return gcd(a, b-a);
}
EXAMPLE
6.78
Implementing problems
“the other way”
In both these cases, the choice could go the other way:
typedef int (*int_func) (int);
int summation(int_func f, int low, int high) {
/* assume low <= high */
if (low == high) return f(low);
else return f(low) + summation(f, low+1, high);
}
int gcd(int a, int b) {
/* assume a, b > 0 */
while (a != b) {
if (a > b) a = a-b;
else b = b-a;
}
return a;
}
Tail Recursion
EXAMPLE
6.79
Iterative implementation of
tail recursion
It is sometimes argued that iteration is more efficient than recursion. It is more
accurate to say that naive implementation of iteration is usually more efficient than
naive implementation of recursion. In the examples above, the iterative implementations of summation and greatest divisors will be more efficient than the recursive
implementations if the latter make real subroutine calls that allocate space on a
run-time stack for local variables and bookkeeping information. An “optimizing”
compiler, however, particularly one designed for a functional language, will often
be able to generate excellent code for recursive functions. It is particularly likely
to do so for tail-recursive functions such as gcd above. A tail-recursive function
is one in which additional computation never follows a recursive call: the return
value is simply whatever the recursive call returns. For such functions, dynamically
allocated stack space is unnecessary: the compiler can reuse the space belonging
to the current iteration when it makes the recursive call. In effect, a good compiler
will recast the recursive gcd function above as follows.
int gcd(int a, int b) {
/* assume a, b > 0 */
start:
if (a == b) return a;
else if (a > b) {
a = a-b; goto start;
6.6 Recursion
273
} else {
b = b-a; goto start;
}
}
EXAMPLE
6.80
By-hand creation of
tail-recursive code
Even for functions that are not tail-recursive, automatic, often simple transformations can produce tail-recursive code. The general case of the transformation employs conversion to what is known as continuation-passing style [FWH01,
Chaps. 7–8]. In effect, a recursive function can always avoid doing any work after
returning from a recursive call by passing that work into the recursive call, in the
form of a continuation.
Some specific transformations (not based on continuation passing) are often
employed by skilled users of functional languages. Consider, for example, the
recursive summation function above, written here in Scheme:
(define summation (lambda (f low high)
(if (= low high)
(f low)
(+ (f low) (summation f (+ low 1) high)))))
; then part
; else part
Recall that Scheme, like all Lisp dialects, uses Cambridge Polish notation for
expressions. The lambda keyword is used to introduce a function. As recursive
calls return, our code calculates the sum from “right to left”: from high down to
low . If the programmer (or compiler) recognizes that addition is associative, we
can rewrite the code in a tail-recursive form:
(define summation (lambda (f low high subtotal)
(if (= low high)
(+ subtotal (f low))
(summation f (+ low 1) high (+ subtotal (f low))))))
Here the subtotal parameter accumulates the sum from left to right, passing it
into the recursive calls. Because it is tail recursive, this function can be translated
into machine code that does not allocate stack space for recursive calls. Of course,
the programmer won’t want to pass an explicit subtotal parameter to the initial
call, so we hide it (the parameter) in an auxiliary, “helper” function:
(define summation (lambda (f low high)
(letrec ((sum-helper (lambda (low subtotal)
(let ((new_subtotal (+ subtotal (f low))))
(if (= low high)
new_subtotal
(sum-helper (+ low 1) new_subtotal))))))
(sum-helper low 0))))
The let construct in Scheme serves to introduce a nested scope in which local
names (e.g., new_subtotal ) can be defined. The letrec construct permits the
definition of recursive functions (e.g., sum-helper ).
274
Chapter 6 Control Flow
Thinking Recursively
EXAMPLE
6.81
Naive recursive Fibonacci
function
Detractors of functional programming sometimes argue, incorrectly, that recursion leads to algorithmically inferior programs. Fibonacci numbers, for example,
are defined by the mathematical recurrence
Fn
≡
(nonnegative integer n)
1
Fn−1 + Fn−2
if n = 0 or n = 1
otherwise
The naive way to implement this recurrence in Scheme is
(define fib (lambda (n)
(cond ((= n 0) 1)
((= n 1) 1)
(#t (+ (fib (- n 1)) (fib (- n 2)))))))
; #t means ’true’ in Scheme
EXAMPLE
6.82
Linear iterative Fibonacci
function
EXAMPLE
6.83
Efficient tail-recursive
Fibonacci function
Unfortunately, this algorithm takes exponential time, when linear time is possible.8
In C, one might write
int fib(int n) {
int f1 = 1; int f2 = 1;
int i;
for (i = 2; i <= n; i++) {
int temp = f1 + f2;
f1 = f2; f2 = temp;
}
return f2;
}
One can write this iterative algorithm in Scheme: Scheme includes (nonfunctional) iterative features. It is probably better, however, to draw inspiration from
the tail-recursive version of the summation example above, and write the following
O(n) recursive function:
(define fib (lambda (n)
(letrec ((fib-helper (lambda (f1 f2 i)
(if (= i n)
f2
(fib-helper f2 (+ f1 f2) (+ i 1))))))
(fib-helper 0 1 0))))
8 Actually, one can do substantially better than linear time using algorithms based on binary matrix
multiplication or closest-integer rounding of continuous functions, but these approaches suffer
from high constant-factor costs or problems with numeric precision. For most purposes the
linear-time algorithm is a reasonable choice.
6.6 Recursion
275
For a programmer accustomed to writing in a functional style, this code is perfectly
natural. One might argue that it isn’t “really” recursive; it simply casts an iterative
algorithm in a tail-recursive form, and this argument has some merit. Despite
the algorithmic similarity, however, there is an important difference between the
iterative algorithm in C and the tail-recursive algorithm in Scheme: the latter has
no side effects. Each recursive call of the fib-helper function creates a new scope,
containing new variables. The language implementation may be able to reuse the
space occupied by previous instances of the same scope, but it guarantees that this
optimization will never introduce bugs.
6.6.2
Applicative- and Normal-Order Evaluation
Throughout the discussion so far we have assumed implicitly that arguments are
evaluated before passing them to a subroutine. This need not be the case. It is
possible to pass a representation of the unevaluated arguments to the subroutine
instead, and to evaluate them only when (if) the value is actually needed. The
former option (evaluating before the call) is known as applicative-order evaluation;
the latter (evaluating only when the value is actually needed) is known as normalorder evaluation. Normal-order evaluation is what naturally occurs in macros
(Section 3.7). It also occurs in short-circuit Boolean evaluation (Section 6.1.5),
call-by-name parameters (to be discussed in Section 8.3.1), and certain functional
languages (to be discussed in Section 10.4).
Algol 60 uses normal-order evaluation by default for user-defined functions
(applicative order is also available). This choice was presumably made to mimic
the behavior of macros (Section 3.7). Most programmers in 1960 wrote mainly in
assembler, and were accustomed to macro facilities. Because the parameter-passing
mechanisms of Algol 60 are part of the language, rather than textual abbreviations,
problems like misinterpreted precedence or naming conflicts do not arise. Side
effects, however, are still very much an issue. We will discuss Algol 60 parameters
in more detail in Section 8.3.1.
Lazy Evaluation
From the points of view of clarity and efficiency, applicative-order evaluation is
generally preferable to normal-order evaluation. It is therefore natural for it to
be employed in most languages. In some circumstances, however, normal-order
evaluation can actually lead to faster code, or to code that works when applicativeorder evaluation would lead to a run-time error. In both cases, what matters is
that normal-order evaluation will sometimes not evaluate an argument at all, if
its value is never actually needed. Scheme provides for optional normal-order
evaluation in the form of built-in functions called delay and force .9 These
9 More precisely, delay is a special form, rather than a function. Its argument is passed to it
unevaluated.
276
EXAMPLE
Chapter 6 Control Flow
6.84
Lazy evaluation of an
infinite data structure
functions provide an implementation of lazy evaluation. In the absence of side
effects, lazy evaluation has the same semantics as normal-order evaluation, but the
implementation keeps track of which expressions have already been evaluated, so
it can reuse their values if they are needed more than once in a given referencing
environment.
A delay ed expression is sometimes called a promise. The mechanism used to
keep track of which promises have already been evaluated is sometimes called
memoization.10 Because applicative-order evaluation is the default in Scheme, the
programmer must use special syntax not only to pass an unevaluated argument,
but also to use it. In Algol 60, subroutine headers indicate which arguments are
to be passed which way; the point of call and the uses of parameters within
subroutines look the same in either case.
A common use of lazy evaluation is to create so-called infinite or lazy data
structures that are “fleshed out” on demand. The following example, adapted from
version 5 of the Scheme manual [ADH+ 98, p. 28], creates a “list” of all the natural
numbers:
(define naturals
(letrec ((next (lambda (n) (cons n (delay (next (+ n 1)))))))
(next 1)))
(define head car)
(define tail (lambda (stream) (force (cdr stream))))
Here cons can be thought of, roughly, as a concatenation operator. Car returns
the head of a list; cdr returns everything but the head. Given these definitions, we
can access as many natural numbers as we want, as shown in the following on the
next page.
D E S I G N & I M P L E M E N TAT I O N
Normal-order evaluation
Normal-order evaluation is one of many examples we have seen where arguably
desirable semantics have been dismissed by language designers because of fear
of implementation cost. Other examples in this chapter include side-effect
freedom (which allows normal order to be implemented via lazy evaluation),
iterators (Section 6.5.3), and nondeterminacy (Section 6.7). As noted in the
sidebar on page 236, however, there has been a tendency over time to trade a bit
of speed for cleaner semantics and increased reliability. Within the functional
programming community, Miranda and its successor Haskell are entirely sideeffect free, and use normal order (lazy) evaluation for all parameters.
10 Within the functional programming community, the term “lazy evaluation” is often used for
any implementation that declines to evaluate unneeded function parameters; this includes both
naive implementations of normal-order evaluation and the memoizing mechanism described
here.
6.7 Nondeterminacy
(head naturals)
(head (tail naturals))
(head (tail (tail naturals)))
277
=⇒ 1
=⇒ 2
=⇒ 3
The list will occupy only as much space as we have actually explored. More elaborate lazy data structures (e.g., trees) can be valuable in combinatorial search
problems, in which a clever algorithm may explore only the “interesting” parts of
a potentially enormous search space.
6.7
Nondeterminacy
Our final category of control flow is nondeterminacy. A nondeterministic construct is one in which the choice between alternatives (i.e., between control paths)
is deliberately unspecified. We have already seen examples of nondeterminacy
in the evaluation of expressions (Section 6.1.4): in most languages, operator or
subroutine arguments may be evaluated in any order. Some languages, notably
Algol 68 and various concurrent languages, provide more extensive nondeterministic mechanisms, which cover statements as well.
IN MORE DEPTH
Further discussion of nondeterminism can be found on the PLP CD. Absent a
nondeterministic construct, the author of a code fragment in which order does
not matter must choose some arbitrary (artificial) order. Such a choice can make
it more difficult to construct a formal correctness proof. Some language designers
have also argued that it is inelegant. The most compelling uses for nondeterminacy
arise in concurrent programs, where imposing an arbitrary choice on the order
in which a thread interacts with its peers may cause the system as a whole to
deadlock. For such programs one may need to ensure that the choice among
nondeterministic alternatives is fair in some formal sense.
3C H E C K YO U R U N D E R S TA N D I N G
38. What is a tail-recursive function? Why is tail recursion important?
39. Explain the difference between applicative and normal order evaluation of
expressions. Under what circumstances is each desirable?
40. What is lazy evaluation? What are promises? What is memoization?
41. Give two reasons why lazy evaluation may be desirable.
42. Name a language in which parameters are always evaluated lazily.
43. Give two reasons why a programmer might sometimes want control flow to
be nondeterministic.
278
Chapter 6 Control Flow
6.8
Summary and Concluding Remarks
In this chapter we introduced the principal forms of control flow found in
programming languages: sequencing, selection, iteration, procedural abstraction,
recursion, concurrency, exception handling and speculation, and nondeterminacy.
Sequencing specifies that certain operations are to occur in order, one after the
other. Selection expresses a choice among two or more control-flow alternatives.
Iteration and recursion are the two ways to execute operations repeatedly. Recursion defines an operation in terms of simpler instances of itself; it depends on procedural abstraction. Iteration repeats an operation for its side effect(s). Sequencing
and iteration are fundamental to imperative (especially von Neumann) programming. Recursion is fundamental to functional programming. Nondeterminacy
allows the programmer to leave certain aspects of control flow deliberately unspecified. We touched on concurrency only briefly; it will be the subject of Chapter 12.
Procedural abstractions (subroutines) are the subject of Chapter 8. Exception
handling and speculation will be covered in Sections 8.5 and 12.4.4.
Our survey of control-flow mechanisms was preceded by a discussion of expression evaluation. We considered the distinction between l-values and r-values, and
between the value model of variables, in which a variable is a named container
for data, and the reference model of variables, in which a variable is a reference
to a data object. We considered issues of precedence, associativity, and ordering within expressions. We examined short-circuit Boolean evaluation and its
implementation via jump code, both as a semantic issue that affects the correctness of expressions whose subparts are not always well defined, and as an
implementation issue that affects the time required to evaluate complex Boolean
expressions.
In our survey we encountered many examples of control-flow constructs whose
syntax and semantics have evolved considerably over time. Particularly noteworthy
has been the phasing out of goto -based control flow and the emergence of a consensus on structured alternatives. While convenience and readability are difficult
to quantify, most programmers would agree that the control flow constructs of
a language like Ada are a dramatic improvement over those of, say, Fortran IV.
Examples of features in Ada that are specifically designed to rectify control-flow
problems in earlier languages include explicit terminators ( end if , end loop ,
etc.) for structured constructs; elsif clauses; label ranges and default ( others )
clauses in case statements; implicit declaration of for loop indices as read-only
local variables; explicit return statements; multilevel loop exit statements; and
exceptions.
The evolution of constructs has been driven by many goals, including ease
of programming, semantic elegance, ease of implementation, and run-time efficiency. In some cases these goals have proven complementary. We have seen for
example that short-circuit evaluation leads both to faster code and (in many cases)
to cleaner semantics. In a similar vein, the introduction of a new local scope for
the index variable of an enumeration-controlled loop avoids both the semantic
6.9 Exercises
279
problem of the value of the index after the loop and (to some extent) the implementation problem of potential overflow.
In other cases improvements in language semantics have been considered worth
a small cost in run-time efficiency. We saw this in the development of iterators:
like many forms of abstraction, they add a modest amount of run-time cost in
many cases (e.g., in comparison to explicitly embedding the implementation of
the enumerated set in the control flow of the loop), but with a large pay-back
in modularity, clarity, and opportunities for code reuse. In a similar vein, the
developers of Java would argue that for many applications the portability and
safety provided by extensive semantic checking, standard-format numeric types,
and so on are far more important than speed.
In several cases, advances in compiler technology or in the simple willingness of
designers to build more complex compilers have made it possible to incorporate
features once considered too expensive. Label ranges in Ada case statements
require that the compiler be prepared to generate code employing binary search.
In-line functions in C++ eliminate the need to choose between the inefficiency of
tiny functions and the messy semantics of macros. Exceptions (as we shall see in
Section 8.5.3) can be implemented in such a way that they incur no cost in the
common case (when they do not occur), but the implementation is quite tricky.
Iterators, boxing, generics (Section 8.4), and first-class functions are likewise rather
tricky, but are increasingly found in mainstream imperative languages.
Some implementation techniques (e.g., rearranging expressions to uncover
common subexpressions, or avoiding the evaluation of guards in a nondeterministic construct once an acceptable choice has been found) are sufficiently
important to justify a modest burden on the programmer (e.g., adding parentheses where necessary to avoid overflow or ensure numeric stability, or ensuring
that expressions in guards are side-effect-free). Other semantically useful mechanisms (e.g., lazy evaluation, continuations, or truly random nondeterminacy) are
usually considered complex or expensive enough to be worthwhile only in special
circumstances (if at all).
In comparatively primitive languages, we can often obtain some of the benefits
of missing features through programming conventions. In early dialects of Fortran, for example, we can limit the use of goto s to patterns that mimic the control
flow of more modern languages. In languages without short-circuit evaluation,
we can write nested selection statements. In languages without iterators, we can
write sets of subroutines that provide equivalent functionality.
6.9
Exercises
6.1 We noted in Section 6.1.1 that most binary arithmetic operators are leftassociative in most programming languages. In Section 6.1.4, however,
we also noted that most compilers are free to evaluate the operands of a
binary operator in either order. Are these statements contradictory? Why or
why not?
280
Chapter 6 Control Flow
6.2 As noted in Figure 6.1, Fortran and Pascal give unary and binary minus the
6.3
6.4
same level of precedence. Is this likely to lead to nonintuitive evaluations of
certain expressions? Why or why not?
In Example 6.8 we described a common error in Pascal programs caused by
the fact that and and or have precedence comparable to that of the arithmetic operators. Show how a similar problem can arise in the stream-based
I/O of C++ (described in Section 7.9.3). (Hint: consider the precedence
of << and >> , and the operators that appear below them in the C column of
Figure 6.1.)
Translate the following expression into postfix and prefix notation:
[−b + sqrt(b × b − 4 × a × c)]/(2 × a)
Do you need a special symbol for unary negation?
In Lisp, most of the arithmetic operators are defined to take two or more
arguments, rather than strictly two. Thus (* 2 3 4 5) evaluates to 120,
and (- 16 9 4) evaluates to 3. Show that parentheses are necessary to
disambiguate arithmetic expressions in Lisp (in other words, give an example
of an expression whose meaning is unclear when parentheses are removed).
In Section 6.1.1 we claimed that issues of precedence and associativity do
not arise with prefix or postfix notation. Reword this claim to make explicit
the hidden assumption.
6.6 Example 6.31 claims that “For certain values of x , (0.1 + x) * 10.0 and
1.0 + (x * 10.0) can differ by as much as 25%, even when 0.1 and x are
of the same magnitude.” Verify this claim. (Warning : if you’re using an x86
processor, be aware that floating-point calculations [even on single-precision
variables] are performed internally with 80 bits of precision. Roundoff errors
will appear only when intermediate results are stored out to memory [with
limited precision] and read back in again.)
6.7 Is &(&i) ever valid in C? Explain.
6.8 Languages that employ a reference model of variables also tend to employ
automatic garbage collection. Is this more than a coincidence? Explain.
6.9 In Section 6.1.2 we noted that C uses = for assignment and == for equality
testing. The language designers state: “Since assignment is about twice as
frequent as equality testing in typical C programs, it’s appropriate that the
operator be half as long” [KR88, p. 17]. What do you think of this rationale?
6.10 Consider a language implementation in which we wish to catch every use of
an uninitialized variable. In Section 6.1.3 we noted that for types in which
every possible bit pattern represents a valid value, extra space must be used
to hold an initialized/uninitialized flag. Dynamic checks in such a system
can be expensive, largely because of the address calculations needed to access
the flags. We can reduce the cost in the common case by having the compiler
generate code to automatically initialize every variable with a distinguished
sentinel value. If at some point we find that a variable’s value is different
6.5
6.9 Exercises
281
from the sentinel, then that variable must have been initialized. If its value is
the sentinel, we must double-check the flag. Describe a plausible allocation
strategy for initialization flags, and show the assembly language sequences
that would be required for dynamic checks, with and without the use of
sentinels.
6.11 Write an attribute grammar, based on the following context-free grammar,
that accumulates jump code for Boolean expressions (with short-circuiting)
into a synthesized attribute of condition, and then uses this attribute to
generate code for if statements.
stmt −→ if condition then stmt else stmt
−→ other stmt
condition −→ c term | condition or c term
c term −→ relation | c term and relation
relation −→ c fact | c fact comparator c fact
c fact −→ identifier | not c fact | ( condition )
comparator −→ < | <= | = | <> | > | >=
6.12
6.13
6.14
6.15
6.16
6.17
(Hint: your task will be easier if you do not attempt to make the grammar L-attributed. For further details see Fischer and LeBlanc’s compiler
book [FL88, Sec. 14.1.4].)
Neither Algol 60 nor Algol 68 employs short-circuit evaluation for Boolean
expressions. In both languages, however, an if . . . then . . . else construct
can be used as an expression. Show how to use if . . . then . . . else to achieve
the effect of short-circuit evaluation.
Consider the following expression in C: a/b > 0 && b/a > 0 . What will
be the result of evaluating this expression when a is zero? What will be the
result when b is zero? Would it make sense to try to design a language in
which this expression is guaranteed to evaluate to false when either a or
b (but not both) is zero? Explain your answer.
As noted in Section 6.4.2, languages vary in how they handle the situation
in which the tested expression in a case statement does not appear among
the labels on the arms. C and Fortran 90 say the statement has no effect.
Pascal and Modula say it results in a dynamic semantic error. Ada says that
the labels must cover all possible values for the type of the expression, so
the question of a missing value can never arise at run time. What are the
tradeoffs among these alternatives? Which do you prefer? Why?
Write the equivalent of Figure 6.5 in C# 2.0, Ruby, or Clu. Write a second
version that performs an in-order enumeration, rather than preorder.
Revise the algorithm of Figure 6.6 so that it performs an in-order enumeration, rather than preorder.
Write a C++ preorder iterator to supply tree nodes to the loop in Example 6.66. You will need to know (or learn) how to use pointers, references,
282
Chapter 6 Control Flow
inner classes, and operator overloading in C++. For the sake of (relative)
simplicity, you may assume that the data in a tree node is always an int ;
this will save you the need to use generics. You may want to use the stack
abstraction from the C++ standard library.
6.18 Write code for the tree_iter type ( struct ) and the ti_create , ti_done ,
ti_next , ti_val , and ti_delete functions employed in Example 6.69.
6.19 Write, in C#, Python, or Ruby, an iterator that yields
(a) all permutations of the integers 1 . . n
(b) all combinations of k integers from the range 1 . . n (0 ≤ k ≤ n).
You may represent your permutations and combinations using either a list
or an array.
6.20 Use iterators to construct a program that outputs (in some order) all structurally distinct binary trees of n nodes. Two trees are considered structurally
distinct if they have different numbers of nodes or if their left or right
subtrees are structurally distinct. There are, for example, five structurally
distinct trees of three nodes:
These are most easily output in “dotted parenthesized form”:
(((.).).)
((.(.)).)
((.).(.))
(.((.).))
(.(.(.)))
(Hint: think recursively! If you need help, see Section 2.2 of the text by
Finkel [Fin96].)
6.21 Build true iterators in Java using threads. (This requires knowledge of material in Chapter 12.) Make your solution as clean and as general as possible.
In particular, you should provide the standard Iterator or IEnumerable
interface, for use with extended for or foreach loops, but the programmer
should not have to write these. Instead, he or she should write a class with
an Iterate method, which should in turn be able to call a Yield method,
which you should also provide. Evaluate the cost of your solution. How
much more expensive is it than standard Java iterator objects?
6.22 In an expression-oriented language such as Algol 68 or Lisp, a while loop
(a do loop in Lisp) has a value as an expression. How do you think this
value should be determined? (How is it determined in Algol 68 and Lisp?)
6.9 Exercises
283
Is the value a useless artifact of expression orientation, or are there reasonable programs in which it might actually be used? What do you think
should happen if the condition on the loop is such that the body is never
executed?
6.23 Consider a midtest loop, here written in C, that looks for blank lines in
its input:
for (;;) {
line = read_line();
if (all_blanks(line)) break;
consume_line(line);
}
Show how you might accomplish the same task using a while or do
( repeat ) loop, if midtest loops were not available. (Hint: one alternative
duplicates part of the code; another introduces a Boolean flag variable.)
How do these alternatives compare to the midtest version?
6.24 Rubin [Rub87] used the following example (rewritten here in C) to argue in
favor of a goto statement:
int first_zero_row = -1;
/* none */
int i, j;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (A[i][j]) goto next;
}
first_zero_row = i;
break;
next: ;
}
The intent of the code is to find the first all-zero row, if any, of an n × n
matrix. Do you find the example convincing? Is there a good structured
alternative in C? In any language?
6.25 Bentley [Ben86, Chap. 4] provides the following informal description of
binary search:
We are to determine whether the sorted array X[1..N] contains the element T . . . .
Binary search solves the problem by keeping track of a range within the array in
which T must be if it is anywhere in the array. Initially, the range is the entire array.
The range is shrunk by comparing its middle element to T and discarding half the
range. The process continues until T is discovered in the array or until the range
in which it must lie is known to be empty.
Write code for binary search in your favorite imperative programming language. What loop construct(s) did you find to be most useful? NB: when he
asked more than a hundred professional programmers to solve this problem,
Bentley found that only about 10% got it right the first time, without testing.
284
Chapter 6 Control Flow
6.26 A loop invariant is a condition that is guaranteed to be true at a given
6.27
6.28
6.29
6.30
6.31
6.32
point within the body of a loop on every iteration. Loop invariants play a
major role in axiomatic semantics, a formal reasoning system used to prove
properties of programs. In a less formal way, programmers who identify
(and write down!) the invariants for their loops are more likely to write
correct code. Show the loop invariant(s) for your solution to the preceding
exercise. (Hint: you will find the distinction between < and ≤ [or between
> and ≥] to be crucial.)
If you have taken a course in automata theory or recursive function theory,
explain why while loops are strictly more powerful than for loops. (If you
haven’t had such a course, skip this question!) Note that we’re referring here
to Pascal-style for loops, not C-style.
Show how to calculate the number of iterations of a general Fortran 90-style
do loop. Your code should be written in an assembler-like notation, and
should be guaranteed to work for all valid bounds and step sizes. Be careful
of overflow! (Hint: While the bounds and step size of the loop can be either
positive or negative, you can safely use an unsigned integer for the iteration
count.)
Write atail-recursive function in Scheme or ML to compute n factorial
(n! = 1≤i≤n i = 1 × 2 × · · · × n). (Hint: You will probably want to define
a “helper” function, as discussed in Section 6.6.1.)
Is it possible to write a tail-recursive version of the classic quicksort algorithm? Why or why not?
Give an example in C in which an in-line subroutine may be significantly
faster than a functionally equivalent macro. Give another example in which
the macro is likely to be faster. (Hint: think about applicative vs normalorder evaluation of arguments.)
Use lazy evaluation ( delay and force ) to implement iterator objects in
Scheme. More specifically, let an iterator be either the null list or a pair
consisting of an element and a promise which when force d will return an
iterator. Give code for an uptoby function that returns an iterator, and a
for-iter function that accepts as arguments a one-argument function and
an iterator. These should allow you to evaluate such expressions as
(for-iter (lambda (e) (display e) (newline)) (uptoby 10 50 3))
Note that unlike the standard Scheme for-each , for-iter should not
require the existence of a list containing the elements over which to iterate;
the intrinsic space required for (for-iter f (uptoby 1 n 1)) should be
only O(1), rather than O(n).
6.33 (Difficult) Use call-with-current-continuation ( call/cc ) to implement the following structured nonlocal control transfers in Scheme. (This
requires knowledge of material in Chapter 10.) You will probably want to
6.10 Explorations
285
consult a Scheme manual for documentation not only on call/cc , but on
define-syntax and dynamic-wind as well.
(a) Multilevel returns. Model your syntax after the catch and throw of
Common Lisp.
(b) True iterators. In a style reminiscent of Exercise 6.32, let an iterator be a
function which when call/cc -ed will return either a null list or a pair
consisting of an element and an iterator. As in that previous exercise,
your implementation should support expressions like
(for-iter (lambda (e)(display e) (newline))(uptoby 10 50 3))
Where the implementation of uptoby in Exercise 6.32 required the use
of delay and force , however, you should provide an iterator macro
(a Scheme special form) and a yield function that allows uptoby to
look like an ordinary tail-recursive function with an embedded yield :
(define uptoby
(iterator (low high step)
(letrec ((helper
(lambda (next)
(if (> next high) ’()
(begin
; else clause
(yield next)
(helper (+ next step)))))))
(helper low))))
6.34–6.37 In More Depth.
6.10
Explorations
6.38 Loop unrolling (described in Exercise
5.23 and Section 16.7.1) is a code
transformation that replicates the body of a loop and reduces the number
of iterations, thereby decreasing loop overhead and increasing opportunities to improve the performance of the processor pipeline by reordering
instructions. Unrolling is traditionally implemented by the code improvement phase of a compiler. It can be implemented at source level, however, if
we are faced with the prospect of “hand optimizing” time-critical code on a
system whose compiler is not up to the task. Unfortunately, if we replicate
the body of a loop k times, we must deal with the possibility that the original
number of loop iterations, n, may not be a multiple of k. Writing in C, and
letting k = 4, we might transform the main loop of Exercise 5.23 from
i = 0;
do {
sum += A[i]; squares += A[i] * A[i]; i++;
} while (i < N);
286
Chapter 6 Control Flow
to
i = 0;
do {
sum
sum
sum
sum
} while
j = N/4;
+= A[i]; squares
+= A[i]; squares
+= A[i]; squares
+= A[i]; squares
(--j > 0);
+=
+=
+=
+=
A[i]
A[i]
A[i]
A[i]
*
*
*
*
A[i];
A[i];
A[i];
A[i];
i++;
i++;
i++;
i++;
do {
sum += A[i]; squares += A[i] * A[i]; i++;
} while (i < N);
In 1983, Tom Duff of Lucasfilm realized that code of this sort can be
“simplified” in C by interleaving a switch statement and a loop. The result
is rather startling, but perfectly valid C. It’s known in programming folklore
as “Duff ’s device”:
i = 0; j = (N+3)/4;
switch (N%4) {
case 0: do{ sum
case 3:
sum
case 2:
sum
case 1:
sum
} while
}
+= A[i]; squares
+= A[i]; squares
+= A[i]; squares
+= A[i]; squares
(--j > 0);
+=
+=
+=
+=
A[i]
A[i]
A[i]
A[i]
*
*
*
*
A[i];
A[i];
A[i];
A[i];
i++;
i++;
i++;
i++;
Duff announced his discovery with “a combination of pride and revulsion.”
He noted that “Many people . . . have said that the worst feature of C is
that switch es don’t break automatically before each case label. This code
forms some sort of argument in that debate, but I’m not sure whether it’s
for or against.” What do you think? Is it reasonable to interleave a loop and a
switch in this way? Should a programming language permit it? Is automatic
fall-through ever a good idea?
6.39 Using your favorite language and compiler, investigate the order of evaluation of subroutine parameters. Are they usually evaluated left-to-right or
right-to-left? Are they ever evaluated in the other order? (Can you be sure?)
Write a program in which the order makes a difference in the results of the
computation.
6.40 Consider the different approaches to arithmetic overflow adopted by Pascal,
C, Java, C#, and Common Lisp, as described in Section 6.1.4. Speculate as to
the differences in language design goals that might have caused the designers
to adopt the approaches they did.
6.41 Learn more about container classes and the design patterns (structured programming idioms) they support. Explore the similarities and differences
among the standard container libraries of C++, Java, and C#. Which of
these libraries do you find the most appealing? Why?
6.11 Bibliographic Notes
287
6.42 One of the most popular idioms for large-scale systems is the so-called
visitor pattern. It has several uses, one of which resembles the “iterating with
first-class functions” idiom of Examples 6.67 and 6.68. Briefly, elements of
a container class provide an accept method that expects as argument an
object that implements the visitor interface. This interface in turn has a
method named visit that expects an argument of element type. To iterate
over a collection, we implement the “loop body” in the visit method of
a visitor object. This object constitutes a closure of the sort described in
Section 3.6.3. Any information that visit needs (beyond the identify of the
“loop index” element) can be encapsulated in the object’s fields. An iterator
method for the collection passes the visitor object to the accept method
of each element. Each element in turn calls the visit method of the visitor
object, passing itself as argument.
Learn more about the visitor pattern. Use it to implement iterators for a
collection—preorder, inorder, and postorder traversals of a binary tree, for
example. How do visitors compare with equivalent iterator-based code? Do
they add new functionality? What else are visitors good for, in addition to
iteration?
6.43–6.46 In More Depth.
6.11
Bibliographic Notes
Many of the issues discussed in this chapter feature prominently in papers on
the history of programming languages. Pointers to several such papers can be
found in the Bibliographic Notes for Chapter 1. Fifteen papers comparing Ada,
C, and Pascal can be found in the collection edited by Feuer and Gehani [FG84].
References for individual languages can be found in Appendix A.
Niklaus Wirth has been responsible for a series of influential languages over a
30-year period, including Pascal [Wir71], its predecessor Algol W [WH66], and
the successors Modula [Wir77b], Modula-2 [Wir85b], and Oberon [Wir88b]. The
case statement of Algol W is due to Hoare [Hoa81]. Bernstein [Ber85] considers
a variety of alternative implementations for case , including multilevel versions
appropriate for label sets consisting of several dense “clusters” of values. Guarded
commands are due to Dijkstra [Dij75]. Duff ’s device was originally posted to
netnews, the predecessor of Usenet news, in May of 1984. The original posting
appears to have been lost, but Duff ’s commentary on it can be found at many
Internet sites, including www.lysator.liu.se/c/duffs-device.html.
Debate over the supposed merits or evils of the goto statement dates from
at least the early 1960s, but became a good bit more heated in the wake of a
1968 article by Dijkstra (“Go To Statement Considered Harmful” [Dij68b]). The
structured programming movement of the 1970s took its name from the text
of Dahl, Dijkstra, and Hoare [DDH72]. A dissenting letter by Rubin in 1987
288
Chapter 6 Control Flow
(“ ‘GOTO Considered Harmful’ Considered Harmful” [Rub87]; Exercise 6.24)
elicited a flurry of responses.
What has been called the “reference model of variables” in this chapter is called
the “object model” in Clu; Liskov and Guttag describe it in Sections 2.3 and 2.4.2
of their text on abstraction and specification [LG86]. Clu iterators are described in
an article by Liskov et al. [LSAS77], and in Chapter 6 of the Liskov and Guttag text.
Icon generators are discussed in Chapters 11 and 14 of the text by Griswold and
Griswold [GG96]. The tree-enumeration algorithm of Exercise 6.20 was originally
presented (without iterators) by Solomon and Finkel [SF80].
Several texts discuss the use of invariants (Exercise 6.26) as a tool for writing
correct programs. Particularly noteworthy are the works of Dijkstra [Dij76] and
Gries [Gri81]. Kernighan and Plauger provide a more informal discussion of the
art of writing good programs [KP78].
The Blizzard [SFL+ 94] and Shasta [SG96] systems for software distributed
shared memory (S-DSM) make use of sentinels (Exercise 6.10). We will discuss
S-DSM in Section 12.2.1.
Michaelson [Mic89, Chap. 8] provides an accessible formal treatment of
applicative-order, normal-order, and lazy evaluation. Friedman, Wand, and
Haynes provide an excellent discussion of continuation-passing style [FWH01,
Chaps. 7–8].
7
Data Types
Most programming languages include a notion of type for expressions
and/or objects.1 Types serve two principal purposes:
EXAMPLE
7.1
Operations that leverage
type information
EXAMPLE
7.2
Errors captured by type
information
1. Types provide implicit context for many operations, so that the programmer
does not have to specify that context explicitly. In C, for instance, the expression a + b will use integer addition if a and b are of integer type; it will use
floating-point addition if a and b are of double (floating-point) type. Similarly, the operation new p in Pascal, where p is a pointer, will allocate a block of
storage from the heap that is the right size to hold an object of the type pointed
to by p ; the programmer does not have to specify (or even know) this size. In
C++, Java, and C#, the operation new my_type() not only allocates (and
returns a pointer to) a block of storage sized for an object of type my_type ; it
also automatically calls any user-defined initialization (constructor) function
that has been associated with that type.
2. Types limit the set of operations that may be performed in a semantically
valid program. They prevent the programmer from adding a character and a
record, for example, or from taking the arctangent of a set, or passing a file
as a parameter to a subroutine that expects an integer. While no type system
can promise to catch every nonsensical operation that a programmer might
put into a program by mistake, good type systems catch enough mistakes to
be highly valuable in practice.
Section 7.1 looks more closely at the meaning and purpose of types, and
presents some basic definitions. Section 7.2 addresses questions of type equivalence and type compatibility: when can we say that two types are the same, and
when can we use a value of a given type in a given context? Sections 7.3–7.9 consider syntactic, semantic, and pragmatic issues for some of the most important
1 Recall that unless otherwise noted we are using the term “object” informally to refer to anything
that might have a name. Object-oriented languages, which we will study in Chapter 9, assign a
narrower, more formal, meaning to the term.
Programming Language Pragmatics. DOI: 10.1016/B978-0-12-374514-9.00017-3
Copyright © 2009 by Elsevier Inc. All rights reserved.
289
290
Chapter 7 Data Types
composite types: records, arrays, strings, sets, pointers, lists, and files. The section
on pointers includes a more detailed discussion of the value and reference models of variables introduced in Section 6.1.2, and of the heap management issues
introduced in Section 3.2. The section on files (mostly on the PLP CD) includes
a discussion of input and output mechanisms. Section 7.10 considers what it
means to compare two complex objects for equality, or to assign one into the
other.
7.1
Type Systems
Computer hardware can interpret bits in memory in several different ways: as
instructions, addresses, characters, and integer and floating-point numbers of
various lengths (see Section 5.2 for more details). The bits themselves, however,
are untyped; the hardware on most machines makes no attempt to keep track
of which interpretations correspond to which locations in memory. Assembly
languages reflect this lack of typing: operations of any kind can be applied to
values in arbitrary locations. High-level languages, on the other hand, almost
always associate types with values, to provide the contextual information and
error checking alluded to above.
Informally, a type system consists of (1) a mechanism to define types and
associate them with certain language constructs, and (2) a set of rules for type
equivalence, type compatibility, and type inference. The constructs that must have
types are precisely those that have values, or that can refer to objects that have
values. These constructs include named constants, variables, record fields, parameters, and sometimes subroutines; literal constants (e.g., 17 , 3.14 , "foo" ); and
more complicated expressions containing these. Type equivalence rules determine
when the types of two values are the same. Type compatibility rules determine
when a value of a given type can be used in a given context. Type inference rules
define the type of an expression based on the types of its constituent parts or
(sometimes) the surrounding context. In a language with polymorphic variables
or parameters, it may be important to distinguish between the type of a reference
or pointer and the type of the object to which it refers: a given name may refer to
objects of different types at different times.
Subroutines are considered to have types in some languages, but not in others.
Subroutines need to have types if they are first- or second-class values (i.e., if they
can be passed as parameters, returned by functions, or stored in variables). In each
of these cases there is a construct in the language whose value is a dynamically
determined subroutine; type information allows the language to limit the set
of acceptable values to those that provide a particular subroutine interface (i.e.,
particular numbers and types of parameters). In a statically scoped language that
never creates references to subroutines dynamically (one in which subroutines
are always third-class values), the compiler can always identify the subroutine to
7.1 Type Systems
291
which a name refers, and can ensure that the routine is called correctly without
necessarily employing a formal notion of subroutine types.
7.1.1
Type Checking
Type checking is the process of ensuring that a program obeys the language’s type
compatibility rules. A violation of the rules is known as a type clash. A language is
said to be strongly typed if it prohibits, in a way that the language implementation
can enforce, the application of any operation to any object that is not intended
to support that operation. A language is said to be statically typed if it is strongly
typed and type checking can be performed at compile time. In the strictest sense
of the term, few languages are statically typed. In practice, the term is often applied
to languages in which most type checking can be performed at compile time, and
the rest can be performed at run time.
A few examples: Ada is strongly typed, and for the most part statically typed
(certain type constraints must be checked at run time). A Pascal implementation
can also do most of its type checking at compile time, though the language is not
quite strongly typed: untagged variant records (to be discussed in Section 7.3.4)
are its only loophole. C89 is significantly more strongly typed than its predecessor
dialects, but still significantly less strongly typed than Pascal. Its loopholes include
unions, subroutines with variable numbers of parameters, and the interoperability
of pointers and arrays (to be discussed in Section 7.7.1). Implementations of C
rarely check anything at run time.
Dynamic (run-time) type checking is a form of late binding, and tends to be
found in languages that delay other issues until run time as well. Lisp and Smalltalk are dynamically (though strongly) typed. Most scripting languages are also
dynamically typed; some (e.g., Python and Ruby) are strongly typed. Languages
with dynamic scoping are generally dynamically typed (or not typed at all): if the
compiler can’t identify the object to which a name refers, it usually can’t determine
the type of the object either.
7.1.2
Polymorphism
Polymorphism (Section 3.5.3) allows a single body of code to work with objects of
multiple types. It may or may not imply the need for run-time type checking. As
implemented in Lisp, Smalltalk, and the various scripting languages, fully dynamic
typing allows the programmer to apply arbitrary operations to arbitrary objects.
Only at run time does the language implementation check to see that the objects
actually implement the requested operations. Because the types of objects can be
thought of as implied (unspecified) parameters, dynamic typing is said to support
implicit parametric polymorphism.
Unfortunately, while powerful and straightforward, dynamic typing incurs significant run-time cost, and delays the reporting of errors. ML and its descendants
292
Chapter 7 Data Types
employ a sophisticated system of type inference to support implicit parametric
polymorphism in conjunction with static typing. The ML compiler infers for
every object and expression a (possibly unique) type that captures precisely those
properties that the object or expression must have to be used in the context(s) in
which it appears. With rare exceptions, the programmer need not specify the types
of objects explicitly. The task of the compiler is to determine whether there exists
a consistent assignment of types to expressions that guarantees, statically, that no
operation will ever be applied to a value of an inappropriate type at run time.
This job can be formalized as the problem of unification; we discuss it further in
Section 7.2.4.
In object-oriented languages, subtype polymorphism allows a variable X of
type T to refer to an object of any type derived from T . Since derived types
are required to support all of the operations of the base type, the compiler can
be sure that any operation acceptable for an object of type T will be acceptable for any object referred to by X . Given a straightforward model of inheritance, type checking for subtype polymorphism can be implemented entirely at
compile time. Most languages that envision such an implementation, including
C++, Eiffel, Java, and C#, also provide explicit parametric polymorphism (generics),
which allow the programmer to define classes with type parameters. Generics
are particularly useful for container (collection) classes: “list of T ” ( List<T> ),
“stack of T ” ( Stack<T> ), and so on, where T is left unspecified. Like subtype
polymorphism, generics can usually be type checked at compile time, though
Java sometimes performs redundant checks at run time for the sake of interoperability with preexisting nongeneric code. Smalltalk, Objective-C, Python, and Ruby
use a single mechanism for both parametric and subtype polymorphism, with
D E S I G N & I M P L E M E N TAT I O N
Dynamic typing
The growing popularity of scripting languages has led a number of prominent
software developers to publicly question the value of static typing. They ask:
given that we can’t check everything at compile time, how much pain is it worth
to check the things we can? As a general rule, it is easier to write type-correct
code than to prove that we have done so, and static typing requires such proofs.
As type systems become more complex (due to object orientation, generics,
etc.), the complexity of static typing increases correspondingly. Anyone who
has written extensively in Ada or C++ on the one hand, and in Python or
Scheme on the other, cannot help but be struck at how much easier it is to write
code, at least for modest-sized programs, without complex type declarations.
Dynamic checking incurs some run-time overhead, of course, and may delay the
discovery of bugs, but this is increasingly seen as insignificant in comparison
to the potential increase in human productivity. The choice between static
and dynamic typing promises to provide one of the most interesting language
debates of the coming decade.
7.1 Type Systems
293
checking delayed until run time. We will consider generics further in Section 8.4,
and derived types in Chapter 9.
7.1.3
The Meaning of “Type”
Some early high-level languages (e.g., Fortran 77, Algol 60, and Basic) provided a
small, built-in, and nonextensible set of types. As we saw in Section 3.3.1, Fortran
does not require variables to be declared; it incorporates default rules to determine
the type of undefined variables based on the spelling of their names (Basic has
similar rules). As noted in the previous subsection, some languages (ML, Miranda,
Haskell) infer types automatically at compile time, while others (Lisp, Smalltalk,
scripting languages) track them at run time. In most languages, however, users
must explicitly declare the type of every object, together with the characteristics
of every type that is not built in.
There are at least three ways to think about types, which we may call the denotational, constructive, and abstraction-based points of view. From the denotational
point of view, a type is simply a set of values. A value has a given type if it belongs to
the set; an object has a given type if its value is guaranteed to be in the set. From the
constructive point of view, a type is either one of a small collection of built-in types
(integer, character, Boolean, real, etc.; also called primitive or predefined types), or
a composite type created by applying a type constructor ( record , array , set , etc.)
to one or more simpler types. (This use of the term “constructor” is unrelated to
the initialization functions of object-oriented languages. It also differs in a more
subtle way from the use of the term in ML.) From the abstraction-based point of
view, a type is an interface consisting of a set of operations with well-defined and
mutually consistent semantics. For most programmers (and language designers),
types usually reflect a mixture of these viewpoints.
In denotational semantics (one of the leading ways to formalize the meaning
of programs), a set of values is known as a domain. Types are domains, and the
meaning of an expression is a value from the domain that represents the expression’s type. Some domains—the integers, for example—are simple and familiar.
Others can be quite complex. In fact, in denotational semantics everything has a
type—even statements with side effects. The meaning of an assignment statement
is a value from a domain whose elements are functions. Each function maps a
store—a mapping from names to values that represents the current contents of
memory—to another store, which represents the contents of memory after the
assignment.
One of the nice things about the denotational view of types is that it allows us
in many cases to describe user-defined composite types (records, arrays, etc.) in
terms of mathematical operations on sets. We will allude to these operations again
in Section 7.1.4. Because it is based on mathematical objects, the denotational view
of types usually ignores such implementation issues as limited precision and word
length. This limitation is less serious than it might at first appear: Checks for such
errors as arithmetic overflow are usually implemented outside of the type system
294
Chapter 7 Data Types
of a language anyway. They result in a run-time error, but this error is not called
a type clash.
When a programmer defines an enumerated type (e.g., enum hue {red,
green, blue} in C), he or she certainly thinks of this type as a set of values.
For most other varieties of user-defined type, however, one typically does not
think in terms of sets of values. Rather, one usually thinks in terms of the way the
type is built from simpler types, or in terms of its meaning or purpose. These ways
of thinking reflect the constructive and abstraction-based points of view. The constructive point of view was pioneered by Algol W and Algol 68, and is characteristic
of most languages designed in the 1970s and 1980s. The abstraction-based point
of view was pioneered by Simula-67 and Smalltalk, and is characteristic of modern
object-oriented languages. It can also be adopted as a matter of programming discipline in non–object-oriented languages. We will discuss the abstraction-based
point of view in more detail in Chapter 9. The remainder of this chapter focuses
on the constructive point of view.
7.1.4
Classification of Types
The terminology for types varies some from one language to another. This subsection presents definitions for the most common terms. Most languages provide
built-in types similar to those supported in hardware by most processors: integers,
characters, Booleans, and real (floating-point) numbers.
Booleans (sometimes called logicals) are typically implemented as single-byte
quantities, with 1 representing true and 0 representing false . C is unusual in
its lack of a Boolean type: where most languages would expect a Boolean value, C
expects an integer; zero means false , and anything else means true . As noted in
Section 6.5.4, Icon replaces Booleans with a more general notion of success and
failure.
Characters have traditionally been implemented as one-byte quantities as well,
typically (but not always) using the ASCII encoding. More recent languages (e.g.,
Java and C#) use a two-byte representation designed to accommodate (the commonly used portion of) the Unicode character set. Unicode is an international
standard designed to capture the characters of a wide variety of languages (see sidebar on page 295). The first 128 characters of Unicode ( \u0000 through \u007f )
are identical to ASCII. C++ provides both regular and “wide” characters, though
for wide characters both the encoding and the actual width are implementation
dependent. Fortran 2003 supports four-byte Unicode characters.
Numeric Types
A few languages (e.g., C and Fortran) distinguish between different lengths of
integers and real numbers; most do not, and leave the choice of precision to the
implementation. Unfortunately, differences in precision across language implementations lead to a lack of portability: programs that run correctly on one system
may produce run-time errors or erroneous results on another. Java and C# are
7.1 Type Systems
295
unusual in providing several lengths of numeric types, with a specified precision
for each.
A few languages, including C, C++, C# and Modula-2, provide both signed and
unsigned integers (Modula-2 calls unsigned integers cardinals). A few languages
(e.g., Fortran, C99, Common Lisp, and Scheme) provide a built-in complex type,
usually implemented as a pair of floating-point numbers that represent the real
and imaginary Cartesian coordinates; other languages support these as a standard
library class. A few languages (e.g., Scheme and Common Lisp) provide a built-in
rational type, usually implemented as a pair of integers that represent the numerator and denominator. Most scripting languages support integers of arbitrary
D E S I G N & I M P L E M E N TAT I O N
Multilingual character sets
The ISO 10646 international standard defines a Universal Character Set (UCS)
intended to include all characters of all known human languages. (It also sets
aside a “private use area” for such artificial [constructed] languages as Klingon,
Tengwar, and Cirth [Tolkein Elvish]. Allocation of this private space is coordinated by a volunteer organization known as the ConScript Unicode Registry.)
All natural languages currently employ codes in the 16-bit Basic Multilingual
Plane (BMP): 0x0000 through 0xfffd .
Unicode is an expanded version of ISO 10646, maintained by an international
consortium of software manufacturers. In addition to mapping tables, it covers
such topics as rendering algorithms, directionality of text, and sorting and
comparison conventions.
While recent languages have moved toward 16- or 32-bit internal character representations, these cannot be used for external storage—text files—
without causing severe problems with backward compatibility. To accommodate Unicode without breaking existing tools, Ken Thompson in 1992 proposed
a multibyte “expanding” code known as UTF-8 (UCS/Unicode Transformation
Format, 8-bit), and codified as a formal annex (appendix) to ISO 10646. UTF-8
characters occupy a maximum of 6 bytes—3 if they lie in the BMP, and only 1
if they are ordinary ASCII. The trick is to observe that ASCII is a 7-bit code; in
any legacy text file the most significant bit of every byte is 0. In UTF-8 a most
significant bit of 1 indicates a multibyte character. Two-byte codes begin with
the bits 110 . Three-byte codes begin with 1110 . Second and subsequent bytes
of multibyte characters always begin with 10 .
On some systems one also finds files encoded in one of 10 variants of the
older 8-bit ISO 8859 standard, but these are inconsistently rendered across platforms. On the web, non-ASCII characters are typically encoded with numeric
character references, which bracket a Unicode value, written in decimal or hex,
c for example,
with an ampersand and a semicolon. The copyright symbol (),
is © . Many characters also have symbolic entity names (e.g., © ), but
not all browsers support these.
296
Chapter 7 Data Types
precision; the implementation uses multiple words of memory where appropriate. Ada supports fixed-point types, which are represented internally by integers,
but have an implied decimal point at a programmer-specified position among
the digits. Several languages support decimal types that use a base-10 encoding
to avoid round-off anomalies in financial and human-centered arithmetic (see
sidebar at bottom of this page).
Integers, Booleans, and characters are all examples of discrete types (also called
ordinal types): the domains to which they correspond are countable (they have
a one-to-one correspondence with some subset of the integers), and have a welldefined notion of predecessor and successor for each element other than the first
and the last. (In most implementations the number of possible integers is finite,
but this is usually not reflected in the type system.) Two varieties of user-defined
types, enumerations and subranges, are also discrete. Discrete, rational, real, and
D E S I G N & I M P L E M E N TAT I O N
Decimal types
A few languages, notably Cobol and PL/I, provide a decimal type for fixedpoint representation of integer quantities. These types were designed primarily
to exploit the binary-coded decimal (BCD) integer format supported by many
traditional CISC machines. BCD devotes one nibble (four bits—half a byte)
to each decimal digit. Machines that support BCD in hardware can perform
arithmetic directly on the BCD representation of a number, without converting
it to and from binary form. This capability is particularly useful in business and
financial applications, which treat their data as both numbers and character
strings.
With the growth in on-line commerce, the past few years have seen renewed
interest in decimal arithmetic. The latest version of the IEEE 754 floating-point
standard, adopted in June 2008, includes decimal floating-point types in 32-,
64-, and 128-bit lengths. These represent both the mantissa (significant bits)
and exponent in binary, but interpret the exponent as a power of 10, not a
power of 2. At a given length, values of decimal type have greater precision
but smaller range than binary floating-point values. They are ideal for financial calculations, because they capture decimal fractions precisely. Designers
hope the new standard will displace existing incompatible decimal formats,
not only in hardware but also in software libraries, thereby providing the same
portability and predictability that the original 754 standard provided for binary
floating-point.
C# includes a 128-bit decimal type that is compatible with the new standard. Specifically, a C# decimal variable includes 96 bits of precision, a sign, and
a decimal scaling factor that can vary between 10−28 and 1028 . IBM, for which
business and financial applications have always been an important market, has
included a hardware implementation of the standard (64- and 128-bit widths)
in its POWER6 processor chips.
7.1 Type Systems
297
complex types together constitute the scalar types. Scalar types are also sometimes
called simple types.
Enumeration Types
EXAMPLE
7.3
Enumerations were introduced by Wirth in the design of Pascal. They facilitate
the creation of readable programs, and allow the compiler to catch certain kinds
of programming errors. An enumeration type consists of a set of named elements.
In Pascal, one can write:
Enumerations in Pascal
type weekday = (sun, mon, tue, wed, thu, fri, sat);
The values of an enumeration type are ordered, so comparisons are generally valid
( mon < tue ), and there is usually a mechanism to determine the predecessor or
successor of an enumeration value (in Pascal, tomorrow := succ(today) ). The
ordered nature of enumerations facilitates the writing of enumeration-controlled
loops:
for today := mon to fri do begin ...
It also allows enumerations to be used to index arrays:
var daily_attendance : array [weekday] of integer;
EXAMPLE
7.4
Enumerations as constants
An alternative to enumerations, of course, is simply to declare a collection of
constants:
const sun = 0; mon = 1; tue = 2; wed = 3; thu = 4; fri = 5; sat = 6;
In C, the difference between the two approaches is purely syntactic:
enum weekday {sun, mon, tue, wed, thu, fri, sat};
is essentially equivalent to
typedef int weekday;
const weekday sun = 0, mon = 1, tue = 2,
wed = 3, thu = 4, fri = 5, sat = 6;
EXAMPLE
7.5
Converting to and from
enumeration type
In Pascal and most of its descendants, however, the difference between an enumeration and a set of integer constants is much more significant: the enumeration is a
full-fledged type, incompatible with integers. Using an integer or an enumeration
value in a context expecting the other will result in a type clash error at compile
time.
Values of an enumeration type are typically represented by small integers, usually a consecutive range of small integers starting at zero. In many languages these
ordinal values are semantically significant, because built-in functions can be used
298
EXAMPLE
Chapter 7 Data Types
7.6
Distinguished values for
enums
to convert an enumeration value to its ordinal value, and sometimes vice versa.
In Ada, these conversions employ the attributes pos and val : weekday’pos(mon)
= 1 and weekday’val(1) = mon .
Several languages allow the programmer to specify the ordinal values of enumeration types, if the default assignment is undesirable. In C, C++, and C#, one
could write
enum mips_special_regs {gp = 28, fp = 30, sp = 29, ra = 31};
(The intuition behind these values is explained in Section
In Ada this declaration would be written
5.4.5.)
type mips_special_regs is (gp, sp, fp, ra);
-- must be sorted
for mips_special_regs use (gp => 28, sp => 29, fp => 30, ra => 31); EXAMPLE
7.7
Emulating distinguished
enum values in Java 5
In recent versions of Java one can obtain a similar effect by giving values an
extra field (here named register ):
enum mips_special_regs { gp(28), fp(30), sp(29), ra(31);
private final int register;
mips_special_regs(int r) { register = r; }
public int reg() { return register; }
}
...
int n = mips_special_regs.fp.reg();
As noted in Section 3.5.2, Pascal and C do not allow the same element name
to be used in more than one enumeration type in the same scope. Java and
C# do, but the programmer must identify elements using fully qualified names:
mips_special_regs.fp . Ada relaxes this requirement by saying that element
names are overloaded; the type prefix can be omitted whenever the compiler can
infer it from context.
Subrange Types
EXAMPLE
7.8
Subranges in Pascal
Like enumerations, subranges were first introduced in Pascal, and are found in
many subsequent languages. A subrange is a type whose values compose a contiguous subset of the values of some discrete base type (also called the parent
type). In Pascal and most of its descendants, one can declare subranges of integers, characters, enumerations, and even other subranges. In Pascal, subranges
look like this:
type test_score = 0..100;
workday = mon..fri;
EXAMPLE
7.9
Subranges in Ada
In Ada one would write
7.1 Type Systems
299
type test_score is new integer range 0..100;
subtype workday is weekday range mon..fri;
EXAMPLE
7.10
Space requirements of
subrange type
The range... portion of the definition in Ada is called a type constraint. In this
example test_score is a derived type, incompatible with integers. The workday
type, on the other hand, is a constrained subtype; workday s and weekday s can be
more or less freely intermixed. The distinction between derived types and subtypes
is a valuable feature of Ada; we will discuss it further in Section 7.2.1.
One could of course use integers to represent test scores, or a weekday to represent a workday . Using an explicit subrange has several advantages. For one thing,
it helps to document the program. A comment could also serve as documentation, but comments have a bad habit of growing out of date as programs change,
or of being omitted in the first place. Because the compiler analyzes a subrange
declaration, it knows the expected range of subrange values, and can generate
code to perform dynamic semantic checks to ensure that no subrange variable is
ever assigned an invalid value. These checks can be valuable debugging tools. In
addition, since the compiler knows the number of values in the subrange, it can
sometimes use fewer bits to represent subrange values than it would need to use
to represent arbitrary integers. In the example above, test_score values can be
stored in a single byte.
Most implementations employ the same bit patterns for integers and subranges,
so subranges whose values are large require large storage locations, even if the
number of distinct values is small. The following type, for example,
type water_temperature = 273..373;
(* degrees Kelvin *)
would be stored in at least two bytes. While there are only 101 distinct values in
the type, the largest (373) is too large to fit in a single byte in its natural encoding.
(An unsigned byte can hold values in the range 0 . . 255; a signed byte can hold
values in the range −128 . . 127.)
D E S I G N & I M P L E M E N TAT I O N
Multiple sizes of integers
The space savings possible with (small-valued) subrange types in Pascal and
Ada is achieved in several other languages by providing more than one size of
built-in integer type. C and C++, for example, support integer arithmetic on
signed and unsigned variants of char , short , int , long , and (in C99) long
long types, with monotonically nondecreasing sizes.2
2 More specifically, the C99 standard requires ranges for these types corresponding to lengths of
at least 1, 2, 2, 4, and 8 bytes, respectively. In practice, one finds implementations in which plain
int s are 2, 4, or 8 bytes long, including some in which they are the same size as short s but shorter
than long s, and some in which they are the same size as long s, and longer than short s.
300
Chapter 7 Data Types
Composite Types
Nonscalar types are usually called composite, or constructed types. They are generally created by applying a type constructor to one or more simpler types. Common
composite types include records (structures), variant records (unions), arrays, sets,
pointers, lists, and files. All but pointers and lists are easily described in terms of
mathematical set operations (pointers and lists can be described mathematically
as well, but the description is less intuitive).
Records (structures) were introduced by Cobol, and have been supported by most
languages since the 1960s. A record consists of collection of fields, each of
which belongs to a (potentially different) simpler type. Records are akin to
mathematical tuples; a record type corresponds to the Cartesian product of
the types of the fields.
Variant records (unions) differ from “normal” records in that only one of a variant
record’s fields (or collections of fields) is valid at any given time. A variant
record type is the union of its field types, rather than their Cartesian product.
Arrays are the most commonly used composite types. An array can be thought of
as a function that maps members of an index type to members of a component type. Arrays of characters are often referred to as strings, and are often
supported by special-purpose operations not available for other arrays.
Sets, like enumerations and subranges, were introduced by Pascal. A set type is
the mathematical powerset of its base type, which must often be discrete.
A variable of a set type contains a collection of distinct elements of the base
type.
Pointers are l-values. A pointer value is a reference to an object of the pointer’s
base type. Pointers are often but not always implemented as addresses. They
are most often used to implement recursive data types. A type T is recursive
if an object of type T may contain one or more references to other objects of
type T .
Lists, like arrays, contain a sequence of elements, but there is no notion of mapping
or indexing. Rather, a list is defined recursively as either an empty list or a
pair consisting of a head element and a reference to a sublist. While the length
of an array must be specified at elaboration time in most (though not all)
languages, lists are always of variable length. To find a given element of a
list, a program must examine all previous elements, recursively or iteratively,
starting at the head. Because of their recursive definition, lists are fundamental
to programming in most functional languages.
Files are intended to represent data on mass-storage devices, outside the memory
in which other program objects reside. Like arrays, most files can be conceptualized as a function that maps members of an index type (generally
integer) to members of a component type. Unlike arrays, files usually have
a notion of current position, which allows the index to be implied implicitly
in consecutive operations. Files often display idiosyncrasies inherited from
7.1 Type Systems
301
physical input/output devices. In particular, the elements of some files must
be accessed in sequential order.
We will examine composite types in more detail in Sections 7.3 through 7.9.
7.1.5
EXAMPLE
7.11
Void (empty) type
Orthogonality
In Section 6.1.2 we discussed the importance of orthogonality in the design of
expressions, statements, and control-flow constructs. In a highly orthogonal language, these features can be used, with consistent behavior, in almost any combination. Orthogonality is equally important in type system design. A highly
orthogonal language tends to be easier to understand, to use, and to reason about
in a formal way. We have noted that languages like Algol 68 and C enhance orthogonality by eliminating (or at least blurring) the distinction between statements
and expressions. To characterize a statement that is executed for its side effect(s),
and that has no useful values, some languages provide an “empty” type. In C and
Algol 68, for example, a subroutine that is meant to be used as a procedure is
generally declared with a “return” type of void . In ML, the empty type is called
unit . If the programmer wishes to call a subroutine that does return a value, but
the value is not needed in this particular case (all that matters is the side effect[s]),
then the return value in C can be discarded by “casting” it to void (casts will be
discussed in Section 7.2.1):
foo_index = insert_in_symbol_table(foo);
...
(void) insert_in_symbol_table(bar);
/* don’t care where it went */
/* cast is optional; implied if omitted */
EXAMPLE
7.12
Making do without void
In a language (e.g., Pascal) without an empty type, the latter of these two calls
would need to use a dummy variable:
var dummy : symbol_table_index;
...
dummy := insert_in_symbol_table(bar);
The type systems of C and Pascal are more orthogonal than that of (preFortran 90) Fortran. Where array elements in traditional Fortran were always of
scalar type, C and Pascal allow arbitrary types. Where array indices were always
integers (and still are in C), Pascal allows any discrete type. Where function return
values were always scalars (and still are in Pascal), C allows structures and pointers
to functions. At the same time, both C and Pascal retain significant nonorthogonality. As in traditional Fortran, Pascal requires the bounds of each array to be specified at compile time, except when the array is a formal parameter of a subroutine.
C requires a lower bound of zero on all array indices. Pascal has only second-class
functions. A much more uniformly orthogonal type system can be found in ML;
we consider it in Section 7.2.4.
302
EXAMPLE
Chapter 7 Data Types
7.13
Aggregates in Ada
One particularly useful aspect of type orthogonality is the ability to specify
literal values of arbitrary composite types. Composite literals are sometimes
known as aggregates. They are particularly valuable for the initialization of static
data structures; without them, a program may need to waste time performing
initialization at run time.
Ada provides aggregates for all its structured types. Given the following declarations
type person is record
name : string (1..10);
age : integer;
end record;
p, q : person;
A, B : array (1..10) of integer;
we can write the following assignments.
p
q
A
B
:=
:=
:=
:=
("Jane Doe ", 37);
(age => 36, name => "John Doe ");
(1, 0, 3, 0, 3, 0, 3, 0, 0, 0);
(1 => 1, 3 | 5 | 7 => 3, others => 0);
Here the aggregates assigned into p and A are positional; the aggregates assigned
into q and B name their elements explicitly. The aggregate for B uses a shorthand
notation to assign the same value ( 3 ) into array elements 3 , 5 , and 7 , and to
assign a 0 into all unnamed fields. Several languages, including C, Fortran 90, and
Lisp, provide similar capabilities. ML provides a very general facility for composite
expressions, based on the use of constructors (discussed in Section 7.2.4). 3C H E C K YO U R U N D E R S TA N D I N G
1. What purpose(s) do types serve in a programming language?
2. What does it mean for a language to be strongly typed? Statically typed? What
prevents, say, C from being strongly typed?
3. Name two important programming languages that are strongly but dynamically typed.
4. What is a type clash?
5. Discuss the differences among the denotational, constructive, and abstractionbased views of types.
6. What is the difference between discrete and scalar types?
7. Give two examples of languages that lack a Boolean type. What do they use
instead?
8. In what ways may an enumeration type be preferable to a collection of named
constants? In what ways may a subrange type be preferable to its base type? In
what ways may a string be preferable to an array of characters?
7.2 Type Checking
303
9. What does it mean for a set of language features (e.g., a type system) to be
orthogonal?
10. What are aggregates?
7.2
Type Checking
In most statically typed languages, every definition of an object (constant, variable,
subroutine, etc.) must specify the object’s type. Moreover, many of the contexts in
which an object might appear are also typed, in the sense that the rules of the language constrain the types that an object in that context may validly possess. In the
subsections below we will consider the topics of type equivalence, type compatibility, and type inference. Of the three, type compatibility is the one of most concern
to programmers. It determines when an object of a certain type can be used in
a certain context. At a minimum, the object can be used if its type and the type
expected by the context are equivalent (i.e., the same). In many languages, however, compatibility is a looser relationship than equivalence: objects and contexts
are often compatible even when their types are different. Our discussion of type
compatibility will touch on the subjects of type conversion (also called casting ),
which changes a value of one type into a value of another; type coercion, which
performs a conversion automatically in certain contexts; and nonconverting type
casts, which are sometimes used in systems programming to interpret the bits of
a value of one type as if they represented a value of some other type.
Whenever an expression is constructed from simpler subexpressions, the question arises: given the types of the subexpressions (and possibly the type expected
by the surrounding context), what is the type of the expression as a whole? This
question is answered by type inference. Type inference is often trivial: the sum of
two integers is still an integer, for example. In other cases (e.g., when dealing with
sets) it is a good bit trickier. Type inference plays a particularly important role in
ML, Miranda, and Haskell, in which all type information is inferred.
7.2.1
Type Equivalence
In a language in which the user can define new types, there are two principal ways
of defining type equivalence. Structural equivalence is based on the content of type
definitions: roughly speaking, two types are the same if they consist of the same
components, put together in the same way. Name equivalence is based on the lexical
occurrence of type definitions: roughly speaking, each definition introduces a new
type. Structural equivalence is used in Algol-68, Modula-3, and (with various
wrinkles) C and ML. Name equivalence is the more popular approach in recent
languages. It is used in Java, C#, standard Pascal, and most Pascal descendants,
including Ada.
304
EXAMPLE
Chapter 7 Data Types
7.14
The exact definition of structural equivalence varies from one language to
another. It requires that one decide which potential differences between types are
important, and which may be considered unimportant. Most people would probably agree that the format of a declaration should not matter—otherwise identical
declarations that differ only in spacing or line breaks should still be considered
equivalent. Likewise, in a Pascal-like language with structural equivalence,
Trivial differences in type
type R2 = record
a, b : integer
end;
should probably be considered the same as
type R3 = record
a : integer;
b : integer
end;
But what about
type R4 = record
b : integer;
a : integer
end;
EXAMPLE
7.15
Other minor differences in
type
Should the reversal of the order of the fields change the type? ML says no; most
languages say yes.
In a similar vein, consider the following arrays, again in a Pascal-like notation:
type str = array [1..10] of char;
type str = array [0..9] of char;
EXAMPLE
7.16
The problem with
structural equivalence
Here the length of the array is the same in both cases, but the index values are
different. Should these be considered equivalent? Most languages say no, but some
(including Fortran and Ada) consider them compatible.
To determine if two types are structurally equivalent, a compiler can expand
their definitions by replacing any embedded type names with their respective definitions, recursively, until nothing is left but a long string of type constructors, field
names, and built-in types. If these expanded strings are the same, then the types are
equivalent, and conversely. Recursive and pointer-based types complicate matters,
since their expansion does not terminate, but the problem is not insurmountable;
we consider a solution in Exercise 7.19.
Structural equivalence is a straightforward but somewhat low-level, implementation-oriented way to think about types. Its principal problem is an inability
to distinguish between types that the programmer may think of as distinct, but
which happen by coincidence to have the same internal structure:
7.2 Type Checking
305
1. type student = record
2.
name, address : string
3.
age : integer
4. type school = record
5.
name, address : string
6.
age : integer
7.
8.
9.
10.
x : student;
y : school;
...
x := y;
– – is this an error?
Most programmers would probably want to be informed if they accidentally
assigned a value of type school into a variable of type student , but a compiler
whose type checking is based on structural equivalence will blithely accept such
an assignment.
Name equivalence is based on the assumption that if the programmer goes
to the effort of writing two type definitions, then those definitions are probably
meant to represent different types. In the example above, variables x and y will
be considered to have different types under name equivalence: x uses the type
declared at line 1; y uses the type declared at line 4.
Variants of Name Equivalence
EXAMPLE
7.17
One subtlety in the use of name equivalence arises in the simplest of type declarations:
Alias types
TYPE new_type = old_type;
EXAMPLE
7.18
Semantically equivalent
alias types
(* Modula-2 *)
Here new_type is said to be an alias for old_type . Should we treat them as two
names for the same type, or as names for two different types that happen to have
the same internal structure? The “right” approach may vary from one program to
another.
In Example 3.13 we considered a module that needs to import a type name:
TYPE stack_element = INTEGER;
(* alias *)
MODULE stack;
IMPORT stack_element;
EXPORT push, pop;
...
PROCEDURE push(elem : stack_element);
...
PROCEDURE pop() : stack_element;
...
Here stack is meant to serve as an abstraction that allows the programmer, via
textual inclusion, to create a stack of any desired type (in this case INTEGER ). This
code depends on alias types being considered equivalent; if they were not, the
306
EXAMPLE
Chapter 7 Data Types
7.19
Semantically distinct alias
types
programmer would have to replace stack_element with INTEGER everywhere it
occurs.3
Unfortunately, there are other times when aliased types should probably not
be the same:
TYPE celsius_temp = REAL;
fahrenheit_temp = REAL;
VAR c : celsius_temp;
f : fahrenheit_temp;
...
f := c;
(* this should probably be an error *)
EXAMPLE
7.20
Derived types and
subtypes in Ada
A language in which aliased types are considered distinct is said to have strict
name equivalence. A language in which aliased types are considered equivalent
is said to have loose name equivalence. Most Pascal-family languages (including
Modula-2) use loose name equivalence. Ada achieves the best of both worlds by
allowing the programmer to indicate whether an alias represents a derived type
or a subtype. A subtype is compatible with its base (parent) type; a derived type
is incompatible. (Subtypes of the same base type are also compatible with each
other.) Our examples above would be written:
subtype stack_element is integer;
...
type celsius_temp is new integer;
type fahrenheit_temp is new integer;
EXAMPLE
7.21
Name vs structural
equivalence
One way to think about the difference between strict and loose name equivalence is to remember the distinction between declarations and definitions
(Section 3.3.3). Under strict name equivalence, a declaration type A = B is considered a definition. Under loose name equivalence it is merely a declaration; A
shares the definition of B .
Consider the following example:
1.
2.
3.
4.
5.
6.
7.
8.
type cell
= ...
– – whatever
type alink = pointer to cell
type blink = alink
p, q : pointer to cell
r
: alink
s : blink
t
: pointer to cell
u : alink
3 One might argue here that the generics of more modern languages (Section 8.4) are a better way
to build abstractions, but there are many single-use cases where generics would be overkill, and
yet alias types still yield a more readable program.
7.2 Type Checking
307
Here the declaration at line 3 is an alias; it defines blink to be “the same as” alink .
Under strict name equivalence, line 3 is both a declaration and a definition, and
blink is a new type, distinct from alink . Under loose name equivalence, line 3 is
just a declaration; it uses the definition at line 2.
Under strict name equivalence, p and q have the same type, because they both
use the anonymous (unnamed) type definition on the right-hand side of line 4,
and r and u have the same type, because they both use the definition at line 2.
Under loose name equivalence, r , s , and u all have the same type, as do p and q .
Under structural equivalence, all six of the variables shown have the same type,
namely pointer to whatever cell is.
Both structural and name equivalence can be tricky to implement in the presence of separate compilation. We will return to this issue in Section 14.6.
Type Conversion and Casts
EXAMPLE
7.22
Contexts that expect a
given type
In a language with static typing, there are many contexts in which values of a
specific type are expected. In the statement
a := expression
we expect the right-hand side to have the same type as a . In the expression
a+b
the overloaded + symbol designates either integer or floating-point addition; we
therefore expect either that a and b will both be integers, or that they will both be
reals. In a call to a subroutine,
foo(arg1, arg2, . . . , argN)
we expect the types of the arguments to match those of the formal parameters, as
declared in the subroutine’s header.
Suppose for the moment that we require in each of these cases that the types
(expected and provided) be exactly the same. Then if the programmer wishes to
use a value of one type in a context that expects another, he or she will need to
specify an explicit type conversion (also sometimes called a type cast ). Depending
on the types involved, the conversion may or may not require code to be executed
at run time. There are three principal cases:
1. The types would be considered structurally equivalent, but the language uses
name equivalence. In this case the types employ the same low-level representation, and have the same set of values. The conversion is therefore a purely
conceptual operation; no code will need to be executed at run time.
2. The types have different sets of values, but the intersecting values are represented in the same way. One type may be a subrange of the other, for example,
or one may consist of two’s complement signed integers, while the other is
308
Chapter 7 Data Types
unsigned. If the provided type has some values that the expected type does
not, then code must be executed at run time to ensure that the current value
is among those that are valid in the expected type. If the check fails, then
a dynamic semantic error results. If the check succeeds, then the underlying
representation of the value can be used, unchanged. Some language implementations may allow the check to be disabled, resulting in faster but potentially
unsafe code.
3. The types have different low-level representations, but we can nonetheless
define some sort of correspondence among their values. A 32-bit integer, for
example, can be converted to a double-precision IEEE floating-point number
with no loss of precision. Most processors provide a machine instruction to
effect this conversion. A floating-point number can be converted to an integer
by rounding or truncating, but fractional digits will be lost, and the conversion
will overflow for many exponent values. Again, most processors provide a
machine instruction to effect this conversion. Conversions between different
lengths of integers can be effected by discarding or sign-extending high-order
bytes.
EXAMPLE
7.23
Type conversions in Ada
We can illustrate these options with the following examples of type conversions
in Ada:
n : integer;
-- assume 32 bits
r : real;
-- assume IEEE double-precision
t : test_score;
-- as in Example 7.9
c : celsius_temp;
-- as in Example 7.20
...
t := test_score(n);
-- run-time semantic check required
n := integer(t);
-- no check req.; every test_score is an int
r := real(n);
-- requires run-time conversion
n := integer(r);
-- requires run-time conversion and check
n := integer(c);
-- no run-time code required
c := celsius_temp(n); -- no run-time code required
In each of the six assignments, the name of a type is used as a pseudofunction
that performs a type conversion. The first conversion requires a run-time check
to ensure that the value of n is within the bounds of a test_score . The second
conversion requires no code, since every possible value of t is acceptable for n . The
third and fourth conversions require code to change the low-level representation
of values. The fourth conversion also requires a semantic check. It is generally
understood that converting from a floating-point value to an integer results in
the loss of fractional digits; this loss is not an error. If the conversion results in
integer overflow, however, an error needs to result. The final two conversions
require no run-time code; the integer and celsius_temp types (at least as
we have defined them) have the same sets of values and the same underlying
representation. A purist might say that celsius_temp should be defined as new
integer range -273..integer’last , in which case a run-time semantic check
would be required on the final conversion.
7.2 Type Checking
EXAMPLE
7.24
Type conversions in C
309
A type conversion in C (what C calls a type cast) is specified by using the name
of the desired type, in parentheses, as a prefix operator:
r = (float) n; /* generates code for run-time conversion */
n = (int) r;
/* also run-time conversion, with no overflow check */
C and its descendants do not by default perform run-time checks for arithmetic
overflow on any operation, though such checks can be enabled if desired in C#. Occasionally, particularly in systems programs, one
needs to change the type of a value without changing the underlying implementation; in other words, to interpret the bits of a value of one type as if they were
another type. One common example occurs in memory allocation algorithms,
which use a large array of bytes to represent a heap, and then reinterpret portions of that array as pointers and integers (for bookkeeping purposes), or as
various user-allocated data structures. Another common example occurs in highperformance numeric software, which may need to reinterpret a floating-point
number as an integer or a record, in order to extract the exponent, significand,
and sign fields. These fields can be used to implement special-purpose algorithms
for square root, trigonometric functions, and so on.
A change of type that does not alter the underlying bits is called a nonconverting
type cast, or sometimes a type pun. It should not be confused with use of the term
cast for conversions in languages like C. In Ada, nonconverting casts can be effected
using instances of a built-in generic subroutine called unchecked_conversion :
Nonconverting Type Casts
EXAMPLE
7.25
Unchecked conversions in
Ada
-- assume ’float’ has been declared to match IEEE single-precision
function cast_float_to_int is
new unchecked_conversion(float, integer);
function cast_int_to_float is
new unchecked_conversion(integer, float);
...
f := cast_int_to_float(n);
n := cast_float_to_int(f);
EXAMPLE
7.26
Conversions and
nonconverting casts in
C++
C++ inherits the casting mechanism of C, but also provides a family of semantically cleaner alternatives. Specifically, static_cast performs a type conversion, reinterpret_cast performs a nonconverting type cast, and dynamic_
cast allows programs that manipulate pointers of polymorphic types to perform
assignments whose validity cannot be guaranteed statically, but can be checked at
run time (more on this in Chapter 9). Syntax for each of these is that of a generic
function:
double d = ...
int n = static_cast<int>(d);
There is also a const_cast that can be used to remove read-only qualification.
C-style type casts in C++ are defined in terms of const_cast , static_cast ,
310
Chapter 7 Data Types
and reinterpret_cast ; the precise behavior depends on the source and target
types.
Any nonconverting type cast constitutes a dangerous subversion of the language’s type system. In a language with a weak type system such subversions can
be difficult to find. In a language with a strong type system, the use of explicit nonconverting type casts at least labels the dangerous points in the code, facilitating
debugging if problems arise.
7.2.2
Type Compatibility
Most languages do not require equivalence of types in every context. Instead,
they merely say that a value’s type must be compatible with that of the context
in which it appears. In an assignment statement, the type of the right-hand side
must be compatible with that of the left-hand side. The types of the operands
of + must both be compatible with some common type that supports addition
(integers, real numbers, or perhaps strings or sets). In a subroutine call, the types
of any arguments passed into the subroutine must be compatible with the types
of the corresponding formal parameters, and the types of any formal parameters
D E S I G N & I M P L E M E N TAT I O N
Nonconverting casts
C programmers sometimes attempt a nonconverting type cast (type pun) by
taking the address of an object, converting the type of the resulting pointer,
and then dereferencing:
r = *((float *) &n);
This arcane bit of hackery usually incurs no run-time cost, because most (but
not all!) implementations use the same representation for pointers to integers
and pointers to floating-point values—namely, an address. The ampersand
operator ( & ) means “address of,” or “pointer to.” The parenthesized (float *)
is the type name for “pointer to float” (float is a built-in floating-point type).
The prefix * operator is a pointer dereference. The overall construct causes the
compiler to interpret the bits of n as if it were a float . The reinterpretation
will succeed only if n is an l-value (has an address), and int s and float s have
the same size (again, this second condition is often but not always true in C). If
n does not have an address then the compiler will announce a static semantic
error. If int and float do not occupy the same number of bytes, then the
effect of the cast may depend on a variety of factors, including the relative size
of the objects, the alignment and “endian-ness” of memory (Section 5.2),
and the choices the compiler has made regarding what to place in adjacent
locations in memory. Safer and more portable nonconverting casts can be
achieved in C by means of union s (variant records); we consider this option in
Exercise 7.30.
7.2 Type Checking
311
passed back to the caller must be compatible with the types of the corresponding
arguments.
The definition of type compatibility varies greatly from language to language.
Ada takes a relatively restrictive approach: an Ada type S is compatible with an
expected type T if and only if (1) S and T are equivalent, (2) one is a subtype of
the other (or both are subtypes of the same base type), or (3) both are arrays, with
the same numbers and types of elements in each dimension. Pascal is only slightly
more lenient: in addition to allowing the intermixing of base and subrange types,
it allows an integer to be used in a context where a real is expected.
Coercion
EXAMPLE
7.27
Coercion in Ada
Whenever a language allows a value of one type to be used in a context that expects
another, the language implementation must perform an automatic, implicit conversion to the expected type. This conversion is called a type coercion. Like the
explicit conversions discussed above, a coercion may require run-time code to
perform a dynamic semantic check, or to convert between low-level representations. Ada coercions sometimes need the former, though never the latter:
d : weekday;
-- as in Example 7.3
k : workday;
-- as in Example 7.9
type calendar_column is new weekday;
c : calendar_column;
...
k := d;
-- run-time check required
d := k;
-- no check required; every workday is a weekday
c := d;
-- static semantic error;
-- weekdays and calendar_columns are not compatible
To perform this third assignment in Ada we would have to use an explicit conversion:
c := calendar_column(d);
EXAMPLE
7.28
Coercion in C
As we noted in Section 3.5.3, coercions are a controversial subject in language
design. Because they allow types to be mixed without an explicit indication of
intent on the part of the programmer, they represent a significant weakening
of type security. C, which has a relatively weak type system, performs quite a bit of
coercion. It allows values of most numeric types to be intermixed in expressions,
and will coerce types back and forth “as necessary.” Here are some examples:
short int s;
unsigned long int l;
char c;
/* may be signed or unsigned -- implementation-dependent */
float f;
/* usually IEEE single-precision */
double d;
/* usually IEEE double-precision */
...
s = l; /* l’s low-order bits are interpreted as a signed number. */
l = s; /* s is sign-extended to the longer length, then
its bits are interpreted as an unsigned number. */
312
Chapter 7 Data Types
s = c;
f = l;
d = f;
f = d;
/* c is either sign-extended or zero-extended to s’s length;
the result is then interpreted as a signed number. */
/* l is converted to floating-point. Since f has fewer
significant bits, some precision may be lost. */
/* f is converted to the longer format; no precision lost. */
/* d is converted to the shorter format; precision may be lost.
If d’s value cannot be represented in single-precision, the
result is undefined, but NOT a dynamic semantic error. */ Fortran 90 allows arrays and records to be intermixed if their types have the
same shape. Two arrays have the same shape if they have the same number of
dimensions, each dimension has the same size (i.e., the same number of elements),
and the individual elements have the same shape. (In some other languages, the
actual bounds of each dimension must be the same for the shapes to be considered
the same.) Two records have the same shape if they have the same number of fields,
and corresponding fields, in order, have the same shape. Field names do not matter,
nor do the actual high and low bounds of array dimensions.
Ada’s compatibility rules for arrays are roughly equivalent to those of Fortran 90. C provides no operations that take an entire array as an operand. C does,
however, allow arrays and pointers to be intermixed in many cases; we will discuss this unusual form of type compatibility further in Section 7.7.1. Neither Ada
nor C allows records (structures) to be intermixed unless their types are name
equivalent.
In general, modern compiled languages display a trend toward static typing
and away from type coercion. Some language designers have argued, however,
that coercions are a natural way in which to support abstraction and program
extensibility, by making it easier to use new types in conjunction with existing
ones. This ease-of-programming argument is particularly important for scripting languages (Chapter 13). Among more traditional languages, C++ provides
an extremely rich, programmer-extensible set of coercion rules. When defining a
new type (a class in C++), the programmer can define coercion operations to
convert values of the new type to and from existing types. These rules interact
in complicated ways with the rules for resolving overloading (Section 3.5.2); they
add significant flexibility to the language, but are one of the most difficult C++
features to understand and use correctly.
Overloading and Coercion
EXAMPLE
7.29
Coercion vs overloading
of addends
We have noted (in Section 3.5.3) that overloading and coercion (as well as various forms of polymorphism) can sometimes be used to similar effect. It is worth
repeating some of the distinctions here. An overloaded name can refer to more
than one object; the ambiguity must be resolved by context. Consider the addition
of numeric quantities. In the expression a + b , + may refer to either the integer
or the floating-point addition operation. In a language without coercion, a and b
must either both be integer or both be real; the compiler chooses the appropriate
interpretation of + depending on their type. In a language with coercion, + refers
7.2 Type Checking
313
to the floating-point addition operation if either a or b is real; otherwise it refers to
the integer addition operation. If only one of a and b is real, the other is coerced to
match. One could imagine a language in which + was not overloaded, but rather
referred to floating-point addition in all cases. Coercion could still allow + to
take integer arguments, but they would always be converted to real. The problem
with this approach is that conversions from integer to floating-point format take
a non-negligible amount of time, especially on machines without hardware conversion instructions, and floating-point addition is significantly more expensive
than integer addition.
In most languages literal constants (e.g., numbers, character strings, the empty
set [ [ ] ] or the null pointer [ nil ]) can be intermixed in expressions with values of
many types. One might say that constants are overloaded: nil for example might
be thought of as referring to the null pointer value for whatever type is needed in
the surrounding context. More commonly, however, constants are simply treated
as a special case in the language’s type-checking rules. Internally, the compiler
considers a constant to have one of a small number of built-in “constant types”
(int const, real const, string, null), which it then coerces to some more appropriate
type as necessary, even if coercions are not supported elsewhere in the language.
Ada formalizes this notion of “constant type” for numeric quantities: an integer
constant (one without a decimal point) is said to have type universal_integer ;
a floating-point constant (one with an embedded decimal point and/or an exponent) is said to have type universal_real . The universal_integer type is
compatible with any type derived from integer ; universal_real is compatible
with any type derived from real .
Universal Reference Types
For systems programming, or to facilitate the writing of general-purpose container
(collection) objects (lists, stacks, queues, sets, etc.) that hold references to other
objects, several languages provide a universal reference type. In C and C++, this
type is called void * . In Clu it is called any ; in Modula-2, address ; in Modula-3,
refany ; in Java, Object ; in C#, object . Arbitrary l-values can be assigned into an
object of universal reference type, with no concern about type safety: because the
type of the object referred to by a universal reference is unknown, the compiler will
not allow any operations to be performed on that object. Assignments back into
objects of a particular reference type (e.g., a pointer to a programmer-specified
record type) are a bit trickier, if type safety is to be maintained. We would not want
a universal reference to a floating-point number, for example, to be assigned into
a variable that is supposed to hold a reference to an integer, because subsequent
operations on the “integer” would interpret the bits of the object incorrectly. In
object-oriented languages, the question of how to ensure the validity of a universal
to specific assignment generalizes to the question of how to ensure the validity
of any assignment in which the type of the object on left-hand side supports
operations that the object on the right-hand side may not.
One way to ensure the safety of universal to specific assignments (or, in general,
less specific to more specific assignments) is to make objects self-descriptive—that
314
EXAMPLE
Chapter 7 Data Types
7.30
Java container of Object
is, to include in the representation of each object a tag that indicates its type. This
approach is common in object-oriented languages, which generally need it for
dynamic method binding. Type tags in objects can consume a nontrivial amount
of space, but allow the implementation to prevent the assignment of an object
of one type into a variable of another. In Java and C#, a universal to specific
assignment requires a type cast, and will generate an exception if the universal
reference does not refer to an object of the casted type. In Eiffel, the equivalent
operation uses a special assignment operator ( ?= instead of := ); in C++ it uses a
dynamic_cast operation.
Java and C# programmers frequently create container classes that hold objects
of the universal reference class ( Object or object , respectively). When an
object is removed from a container, it must be assigned (with a type cast)
into a variable of an appropriate class before anything interesting can be done
with it:4
import java.util.*;
// library containing Stack container class
...
Stack myStack = new Stack();
String s = "Hi, Mom";
Foo f = new Foo();
// f is of user-defined class type Foo
...
myStack.push(s);
myStack.push(f);
// we can push any kind of object on a stack
...
s = (String) myStack.pop();
// type cast is required, and will generate an exception at run
// time if element at top-of-stack is not a string
In a language without type tags, the assignment of a universal reference into
an object of a specific reference type cannot be checked, because objects are not
self-descriptive: there is no way to identify their type at run time. The programmer
must therefore resort to an (unchecked) type conversion.
7.2.3
Type Inference
We have seen how type checking ensures that the components of an expression
(e.g., the arguments of a binary operator) have appropriate types. But what determines the type of the overall expression? In most cases, the answer is easy. The
result of an arithmetic operator usually has the same type as the operands. The
result of a comparison is usually Boolean. The result of a function call has the type
declared in the function’s header. The result of an assignment (in languages in
which assignments are expressions) has the same type as the left-hand side. In
4 If the programmer knows that a container will be used to hold objects of only one type, then it may
be possible to eliminate the type cast and, ideally, its run-time cost by using generics (Section 8.4).
7.2 Type Checking
315
a few cases, however, the answer is not obvious. In particular, operations on
subranges and on composite objects do not necessarily preserve the types of the
operands. We examine these cases in the remainder of this subsection. We then
consider (on the PLP CD) a more elaborate form of type inference found in ML,
Miranda, and Haskell.
Subranges
EXAMPLE
7.31
For simple arithmetic operators, the principal type system subtlety arises when
one or more operands have subrange types (what Ada calls subtypes with range
constraints). Given the following Pascal definitions, for example,
Inference of subrange types
type Atype = 0..20;
Btype = 10..20;
var a : Atype;
b : Btype;
EXAMPLE
7.32
The var placeholder in
C# 3.0
what is the type of a + b ? Certainly it is neither Atype nor Btype , since the possible
values range from 10 to 40. One could imagine it being a new anonymous subrange
type with 10 and 40 as bounds. The usual answer in Pascal and its descendants is
to say that the result of any arithmetic operation on a subrange has the subrange’s
base type, in this case integer.
If the result of an arithmetic operation is assigned into a variable of a subrange
type, then a dynamic semantic check may be required. To avoid the expense
of some unnecessary checks, a compiler may keep track at compile time of the
largest and smallest possible values of each expression, in essence computing the
anonymous 10 . . . 40 type. More sophisticated techniques can be used to eliminate
many checks in loops; we will consider these in Section 16.5.2.
In languages like Ada, the type of an arithmetic expression assumes special
significance in the header of a for loop (Section 6.5.1), because it determines the
type of the index variable. For the sake of uniformity, Ada says that the index of
a for loop always has the base type of the loop bounds, whether they are builtup expressions or simple variables or constants. A similar convention appears in
C# 3.0, which allows a variable declaration to use the placeholder var instead
of a type name when an appropriate type can be inferred from the initialization
expression:
var i = 123;
// equiv. to int i = 123;
var map = new Dictionary<int, string>(); // equiv. to
// Dictionary<int, string> map = new Dictionary<int, string>(); Composite Types
Most built-in operators in most languages take operands of built-in types. Some
operators, however, can be applied to values of composite types, including aggregates. Type inference becomes an issue when an operation on composites yields a
result of a different type than the operands.
316
EXAMPLE
Chapter 7 Data Types
7.33
Type inference on string
operations
EXAMPLE
7.34
Type inference for sets
Character strings provide a simple example. In Pascal, the literal string 'abc'
has type array [1..3] of char . In Ada, the analogous string (denoted "abc" )
is considered to have an incompletely specified type that is compatible with any
three-element array of characters. In the Ada expression "abc" & "defg" , "abc"
is a three-character array, "defg" is a four-character array, and the result is a
seven-character array formed by concatenating the two. For all three, the size of
the array is known but the bounds and the index type are not; they must be
inferred from context. The seven-character result of the concatenation could be
assigned into an array of type array (1..7) of character or into an array of
type array (weekday) of character , or into any other seven-element character
array.
Operations on composite values also occur when manipulating sets. Pascal and
Modula, for example, support union ( + ), intersection ( * ), and difference ( - ) on
sets of discrete values. Set operands are said to have compatible types if their
elements have the same base type T . The result of a set operation is then of type
set of T . As with subranges, a compiler can avoid the need for run-time bounds
checks in certain cases by keeping track of the minimum and maximum possible
members of the set expression. Because a set may have many members, some of
which may be known at compile time, it can be useful to track not only the largest
and smallest values that may be in a set, but also the values that are known to be
in the set (see Exercise 7.7).
7.2.4
The ML Type System
The most sophisticated form of type inference occurs in certain functional languages, notably ML, Miranda, and Haskell. Programmers have the option of
declaring the types of objects in these languages, in which case the compiler
behaves much like that of a more traditional statically typed language. As we
noted near the beginning of Section 7.1, however, programmers may also choose
not to declare certain types, in which case the compiler will infer them, based on
the known types of literal constants, the explicitly declared types of any objects that
have them, and the syntactic structure of the program. ML-style type inference is
the invention of the language’s creator, Robin Milner.5
IN MORE DEPTH
An introduction to the type system of ML and its descendants appears on the PLP
CD. The key to its inference mechanism is to unify the (partial) type information
5 Robin Milner (1934–), of Cambridge University’s Computer Laboratory, is responsible not only
for the development of ML and its type system, but for the Logic of Computable Functions,
which provides a formal basis for machine-assisted proof construction, and the Calculus of
Communicating Systems, which provides a general theory of concurrency. He received the ACM
Turing Award in 1991.
7.3 Records (Structures) and Variants (Unions)
317
available for two expressions whenever the rules of the type system say that their
types must be the same. Information known about each is then known about
the other as well. Any discovered inconsistencies are identified as static semantic
errors. Any expression whose type remains incompletely specified after inference is
automatically polymorphic; this is the implicit parametric polymorphism referred
to in Section 3.5.3. ML family languages also incorporate a powerful run-time
pattern-matching facility, and several unconventional structured types, including ordered tuples, (unordered) records, lists, and a datatype mechanism that
subsumes unions and recursive types.
3C H E C K YO U R U N D E R S TA N D I N G
11. What is the difference between type equivalence and type compatibility?
12. Discuss the comparative advantages of structural and name equivalence for
types. Name three languages that use each approach.
13. Explain the differences among strict and loose name equivalence.
14. Explain the distinction between derived types and subtypes in Ada.
15. Explain the differences among type conversion, type coercion, and nonconverting
type casts.
16.
17.
18.
19.
Summarize the arguments for and against coercion.
Under what circumstances does a type conversion require a run-time check?
What purpose is served by universal reference types?
What is type inference? Describe three contexts in which it occurs.
7.3
Records (Structures) and Variants (Unions)
Record types allow related data of heterogeneous types to be stored and manipulated together. Some languages (notably Algol 68, C, C++, and Common Lisp)
use the term structure (declared with the keyword struct ) instead of record.
Fortran 90 simply calls its records “types”: they are the only form of programmerdefined type other than arrays, which have their own special syntax. Structures
in C++ are defined as a special form of class (one in which members are globally
visible by default). Java has no distinguished notion of struct ; its programmers
use classes in all cases. C# uses a reference model for variables of class types,
and a value model for variables of struct types. C# struct s do not support
inheritance. For the sake of simplicity, we will use the term “record” in most of
our discussion to refer to the relevant construct in all these languages.
318
Chapter 7 Data Types
7.3.1
EXAMPLE
7.35
Syntax and Operations
In C, a simple record might be defined as follows.
A C struct
struct element {
char name[2];
int atomic_number;
double atomic_weight;
_Bool metallic;
};
EXAMPLE
7.36
In Pascal, the corresponding declarations would be
A Pascal record
type two_chars = packed array [1..2] of char;
(* Packed arrays will be explained in Example 7.43.
Packed arrays of char are compatible with quoted strings. *)
type element = record
name : two_chars;
atomic_number : integer;
atomic_weight : real;
metallic : Boolean
end;
EXAMPLE
7.37
Each of the record components is known as a field. To refer to a given field of a
record, most languages use “dot” notation. In C:
Accessing record fields
element copper;
const double AN = 6.022e23;
/* Avogadro’s number */
...
copper.name[0] = 'C'; copper.name[1] = 'u';
double atoms = mass / copper.atomic_weight * AN;
Pascal notation is similar to that of C. In Fortran 90 one would say copper%name
and copper%atomic_weight . Cobol and Algol 68 reverse the order of the field and
record names: name of copper and atomic_weight of copper . ML’s notation is
D E S I G N & I M P L E M E N TAT I O N
Struct tags and typedef in C and C++
One of the peculiarities of the C type system is that struct tags are not exactly
type names. In Example 7.35, the name of the type is the two-word phrase
struct element . We used this name to declare the element_yielded field
of the second struct in Example 7.38. To obtain a one-word name, one can say
typedef struct element element_t , or even typedef struct element
element : struct tags and typedef names have separate name spaces, so the
same name can be used in each. C++ eliminates this idiosyncrasy by allowing
the struct tag to be used as a type name without the struct prefix; in effect, it
performs the typedef implicitly.
7.3 Records (Structures) and Variants (Unions)
EXAMPLE
7.38
Nested records
319
also “reversed,” but uses a prefix # : #name copper and #atomic_weight copper .
(Fields of an ML record can also be extracted using patterns.) In Common
Lisp, one would say (element-name copper) and (element-atomic_weight
copper) .
Most languages allow record definitions to be nested. Again in C:
struct ore {
char name[30];
struct {
char name[2];
int atomic_number;
double atomic_weight;
_Bool metallic;
} element_yielded;
};
Alternatively, one could say
struct ore {
char name[30];
struct element element_yielded;
};
EXAMPLE
7.39
ML records and tuples
In Fortran 90 and Common Lisp, only the second alternative is permitted:
record fields can have record types, but the declarations cannot be lexically
nested. Naming for nested records is straightforward: malachite.element_
yielded.atomic_number in Pascal or C; atomic_number of element_yielded
of malachite in Cobol; #atomic_number #element_yielded malachite
in ML; (element-atomic_number (ore-element_yielded malachite)) in
Common Lisp.
As noted in Example 7.14, ML differs from most languages in specifying that
the order of record fields is insignificant. The ML record value {name = "Cu",
atomic_number = 29, atomic_weight = 63.546, metallic = true} is the
same as the value {atomic_number = 29, name = "Cu", atomic_weight =
63.546, metallic = true} (they will test true for equality). ML tuples are
defined as abbreviations for records whose field names are small integers. The
values ("Cu", 29) , {1 = "Cu", 2 = 29} , and {2 = 29, 1 = "Cu"} will all test
true for equality.
7.3.2
Memory Layout and Its Impact
The fields of a record are usually stored in adjacent locations in memory. In its
symbol table, the compiler keeps track of the offset of each field within each record
type. When it needs to access a field, the compiler typically generates a load or
store instruction with displacement addressing. For a local object, the base register
is the frame pointer; the displacement is the sum of the record’s offset from the
register and the field’s offset within the record. On a RISC machine, a global record
320
Chapter 7 Data Types
4 bytes/32 bits
name
atomic_number
atomic_weight
metallic
Figure 7.1
Likely layout in memory for objects of type element on a 32-bit machine. Alignment
restrictions lead to the shaded “holes.”
EXAMPLE
7.40
Memory layout for a
record type
EXAMPLE
7.41
Nested records as values
is accessed in a similar way, using a dedicated globals pointer register as base. On a
CISC machine, the compiler may access the field directly at its absolute address or,
if many fields are to be accessed in a short period of time, it may load a temporary
register with the (absolute) address of the record and then use the field’s offset as
displacement.
A likely layout for our element type on a 32-bit machine appears in
Figure 7.1. Because the name field is only two characters long, it occupies two bytes
in memory. Since atomic_number is an integer, and must (on most machines)
be word-aligned, there is a two-byte “hole” between the end of name and the
beginning of atomic_number . Similarly, since Boolean variables (in most language implementations) occupy a single byte, there are three bytes of empty space
between the end of the metallic field and the next aligned location. In an array
of element s, most compilers would devote 20 bytes to every member of the
array.
In a language with a value model of variables, nested records are naturally
embedded in the parent record, where they function as large fields with word or
double-word alignment. In a language with a reference model of variables, fields
of record type are typically references to data in another location. The difference is
a matter not only of memory layout, but also of semantics. In Pascal, the following
program prints a 0.
type
T = record
j : integer;
end;
S = record
i : integer;
n : T;
end;
var s1, s2 : S;
...
s1.n.j := 0;
s2 := s1;
s2.n.j := 7;
writeln(s1.n.j);
(* prints 0 *)
The assignment of s1 into s2 copies the embedded T .
7.3 Records (Structures) and Variants (Unions)
EXAMPLE
7.42
Nested records as
references
By contrast, the following Java program prints a 7. (Simple classes in Java play
the role of structs.)
class T {
public int j;
}
class S {
public int i;
public T n;
}
...
S s1 = new S();
s1.n = new T();
S s2 = s1;
s2.n.j = 7;
System.out.println(s1.n.j);
EXAMPLE
7.43
Layout of packed types
321
// fields initialized to 0
// prints 7
Here the assignment of s1 into s2 has copied only the reference, so s2.n.j is an
alias for s1.n.j .
A few languages—notably Pascal—allow the programmer to specify that a
record type (or an array, set, or file type) should be packed:
type element = packed record
name : two_chars;
atomic_number : integer;
atomic_weight : real;
metallic : Boolean
end;
EXAMPLE
7.44
Assignment and
comparison of records
The keyword packed indicates that the compiler should optimize for space instead
of speed. In most implementations a compiler will implement a packed record
without holes, by simply “pushing the fields together.” To access a nonaligned field,
however, it will have to issue a multi-instruction sequence that retrieves the pieces
of the field from memory and then reassembles them in a register. A likely packed
layout for our element type (again for a 32-bit machine) appears in Figure 7.2.
It is 15 bytes in length. An array of packed element records would probably
devote 16 bytes to each member of the array; that is, it would align each element.
A packed array of packed records might devote only 15 bytes to each; only every
fourth element would be aligned. Ada, Modula-3, and C provide more elaborate
packing mechanisms, which allow the programmer to specify precisely how many
bits are to be devoted to each field.
Most languages allow a value to be assigned to an entire record in a single
operation:
my_element := copper;
Ada also allows records to be compared for equality ( if my_element = copper
then ... ), but most other languages (including Pascal, C, and their successors)
322
Chapter 7 Data Types
4 bytes/32 bits
atomic_
name
number
atomic_weight
metallic
Figure 7.2
Likely memory layout for packed element records. The atomic_number and
atomic_weight fields are nonaligned, and can only be read or written (on most machines) via
multi-instruction sequences.
EXAMPLE
7.45
Minimizing holes by sorting
fields
do not, though C++ allows the programmer to define equality tests for individual
record types.
For small records, both copies and comparisons can be performed in-line on
a field-by-field basis. For longer records, we can save significantly on code space
by deferring to a library routine. A block_copy routine can take source address,
destination address, and length as arguments, but the analogous block_compare
routine would fail on records with different (garbage) data in the holes. One solution is to arrange for all holes to contain some predictable value (e.g., zero), but
this requires code at every elaboration point. Another is to have the compiler
generate a customized field-by-field comparison routine for every record type.
Different routines would be called to compare records of different types. Languages like Pascal and C avoid the whole issue by simply outlawing full-record
comparisons.
In addition to complicating comparisons, holes in records waste space. Packing
eliminates holes, but at potentially heavy cost in access time. A compromise,
adopted by some compilers, is to sort a record’s fields according to the size of
their alignment constraints. All byte-aligned fields might come first, followed by
any half-word aligned fields, word-aligned fields, and (if the hardware requires)
double-word–aligned fields. For our element type, the resulting rearrangement
is shown in Figure 7.3.
In most cases, reordering of fields is purely an implementation issue: the programmer need not be aware of it, so long as all instances of a record type are
D E S I G N & I M P L E M E N TAT I O N
The order of record fields
Issues of record field order are intimately tied to implementation tradeoffs:
Holes in records waste space, but alignment makes for faster access. If holes
contain garbage we can’t compare records by looping over words or bytes, but
zeroing out the holes would incur costs in time and code space. Predictable layout is important for mirroring hardware structures in “systems” languages, but
reorganization may be advantageous in large records if we can group frequently
accessed fields together, so they lie in the same cache line.
7.3 Records (Structures) and Variants (Unions)
323
4 bytes/32 bits
name
metallic
atomic_number
atomic_weight
Figure 7.3 Rearranging record fields to minimize holes. By sorting fields according to the size
of their alignment constraint, a compiler can minimize the space devoted to holes, while keeping
the fields aligned.
reordered in the same way. The exception occurs in systems programs, which
sometimes “look inside” the implementation of a data type with the expectation
that it will be mapped to memory in a particular way. A kernel programmer, for
example, may count on a particular layout strategy in order to define a record
that mimics the organization of memory-mapped control registers for a particular Ethernet device. C and C++, which are designed in large part for systems
programs, guarantee that the fields of a struct will be allocated in the order
declared. The first field is guaranteed to have the coarsest alignment required by
the hardware for any type (generally a four- or eight-byte boundary). Subsequent
fields have the natural alignment for their type. Fortran 90 allows the programmer
to specify that fields must not be reordered; in the absence of such a specification
the compiler can choose its own order. To accommodate systems programs, Ada
and C++ allow the programmer to specify nonstandard alignment for the fields
of specific record types.
7.3.3
EXAMPLE
7.46
Pascal with statement
With Statements
In programs with complicated data structures, manipulating the fields of a deeply
nested record can be awkward:
ruby.chemical_composition.elements[1].name := ’Al’;
ruby.chemical_composition.elements[1].atomic_number := 13;
ruby.chemical_composition.elements[1].atomic_weight := 26.98154;
ruby.chemical_composition.elements[1].metallic := true;
Pascal provides a with statement to simplify such constructions:
with ruby.chemical_composition.elements[1] do begin
name := ’Al’;
atomic_number := 13;
atomic_weight := 26.98154;
metallic := true
end;
324
Chapter 7 Data Types
IN MORE DEPTH
Pascal with statements are examined in more detail on the PLP CD. They are
generally considered an improvement on the earlier elliptical references of Cobol
and PL/I. They still suffer from several limitations, however, most of which are
addressed in Modula-3 and Fortran 2003. Similar functionality can be achieved
with nested scopes in languages like Lisp and ML (which use a reference model
of variables), and in languages like C and C++, which allow the programmer to
create pointers or references to arbitrary objects.
7.3.4
Variant Records (Unions)
Programming languages of the 1960s and 1970s were designed in an era of severe
memory constraints. Many allowed the programmer to specify that certain variables (presumably ones that would never be used at the same time) should be
allocated “on top of ” one another, sharing the same bytes in memory. C’s syntax,
heavily influenced by Algol 68, looks very much like a struct:
union {
int i;
double d;
_Bool b;
};
EXAMPLE
7.47
The overall size of this union would be that of its largest member (presumably d ).
Exactly which bytes of d would be overlapped by i and b is implementation
dependent, and presumably influenced by the relative sizes of types, their alignment constraints, and the endian-ness of the hardware.
In practice, unions have been used for two main purposes. The first arises in
systems programs, where unions allow the same set of bytes to be interpreted
in different ways at different times. The canonical example occurs in memory
management, where storage may sometimes be treated as unallocated space (perhaps in need of “zeroing out”), sometimes as bookkeeping information (length
and header fields to keep track of free and allocated blocks), and sometimes as
user-allocated data of arbitrary type. While nonconverting type casts can be used
to implement heap management routines, as described on page 309, unions are a
better indication of the programmer’s intent: the bits are not being reinterpreted,
they are being used for independent purposes.6
The second common purpose for unions is to represent alternative sets of
fields within a record. A record representing an employee, for example, might
Motivation for variant
records
6 By contrast, the other example on page 309—examination of the internal structure of a floatingpoint number—does indeed reinterpret bits. Unions can also be used in this case (Exercise 7.30), but here a nonconverting cast is a better indication of intent.
7.4 Arrays
325
have several common fields (name, address, phone, department, ID number) and
various other fields depending on whether the person in question works on a
salaried, hourly, or consulting basis. C unions are awkward when used for this
purpose. A much cleaner syntax appears in the variant records of Pascal and
its successors, which allow the programmer to specify that certain (potentially
hierarchical) sets of fields should overlap one another in memory.
IN MORE DEPTH
We discuss unions and variant records in more detail on the PLP CD. Topics we
consider include syntax, safety, and memory layout issues. Safety is a particular
concern: where nonconverting type casts allow a programmer to circumvent the
language’s type system explicitly, a naive realization of unions makes it easy to
do so by accident. Algol 68 and Ada impose limits on the use of unions and
variant records that allow the compiler to verify, statically, that all programs are
type-safe. We also note that inheritance in object-oriented languages provides an
attractive alternative to type-safe variant records in most cases. This observation
largely accounts for the omission of unions and variant records from more recent
languages.
3C H E C K YO U R U N D E R S TA N D I N G
20. What are struct tags in C? How are they related to type names? How did they
change in C++?
21. Summarize the distinction between records and tuples in ML. How do these
compare to the records of languages like C and Ada?
22. Discuss the significance of “holes” in records. Why do they arise? What problems do they cause?
23. Why is it easier to implement assignment than comparison for records?
24. What is packing ? What are its advantages and disadvantages?
25. Why might a compiler reorder the fields of a record? What problems might
this cause?
26. Briefly describe two purposes for unions/variant records.
7.4
Arrays
Arrays are the most common and important composite data types. They have been
a fundamental part of almost every high-level language, beginning with Fortran I.
Unlike records, which group related fields of disparate types, arrays are usually
326
Chapter 7 Data Types
homogeneous. Semantically, they can be thought of as a mapping from an index
type to a component or element type. Some languages (e.g., Fortran) require that
the index type be integer ; many languages allow it to be any discrete type. Some
languages (e.g., Fortran 77) require that the element type of an array be scalar.
Most (including Fortran 90) allow any element type.
Some languages (notably scripting languages) allow nondiscrete index types.
The resulting associative arrays must generally be implemented with hash tables
or search trees; we consider them in Section 13.4.3. Associative arrays also resemble the dictionary or map types supported by the standard libraries of many
object-oriented languages. In C++, operator overloading allows these types to use
conventional array-like syntax. For the purposes of this chapter, we will assume
that array indices are discrete. This admits a (much more efficient) contiguous
allocation scheme, to be described in Section 7.4.3.
7.4.1
Syntax and Operations
Most languages refer to an element of an array by appending a subscript—
delimited by parentheses or square brackets—to the name of the array. In Fortran
and Ada, one says A(3) ; in Pascal and C, one says A[3] . Since parentheses are generally used to delimit the arguments to a subroutine call, square bracket subscript
notation has the advantage of distinguishing between the two. The difference in
notation makes a program easier to compile and, arguably, easier to read. Fortran’s
use of parentheses for arrays stems from the absence of square bracket characters
on IBM keypunch machines, which at one time were widely used to enter Fortran
programs. Ada’s use of parentheses represents a deliberate decision on the part of
the language designers to embrace notational ambiguity for functions and arrays.
If we think of an array as a mapping from the index type to the element type, it
makes perfectly good sense to use the same notation used for functions. In some
cases, a programmer may even choose to change from an array to a function-based
implementation of a mapping, or vice versa (Exercise 7.10).
Declarations
EXAMPLE
7.48
Array declarations
In some languages one declares an array by appending subscript notation to the
syntax that would be used to declare a scalar. In C:
char upper[26];
In Fortran:
character, dimension (1:26) :: upper
character (26) upper
! shorthand notation
In C, the lower bound of an index range is always zero: the indices of an n-element
array are 0 . . . n − 1. In Fortran, the lower bound of the index range is one by
7.4 Arrays
327
default. Fortran 90 allows a different lower bound to be specified if desired, using
the notation shown in the first of the two declarations above.
In other languages, arrays are declared with an array constructor. In Pascal:
var upper : array [’a’..’z’] of char;
In Ada:
upper : array (character range ’a’..’z’) of character;
EXAMPLE
7.49
Most languages make it easy to declare multidimensional arrays:
Multidimensional arrays
mat : array (1..10, 1..10) of real;
-- Ada
real, dimension (10,10) :: mat
! Fortran
D E S I G N & I M P L E M E N TAT I O N
Is [ ] an operator?
Associative arrays in C++ are typically defined by overloading operator[ ] . C#,
like C++, provides extensive facilities for operator overloading, but it does not
use these facilities to support associative arrays. Instead, the language provides
a special indexer mechanism, with its own unique syntax:
class directory {
Hashtable table;
...
public directory() {
table = new Hashtable();
}
...
public string this[string name] {
get {
return (string) table[name];
}
set {
table[name] = value;
}
}
}
...
directory d = new directory();
...
d["Jane Doe"] = "234-5678";
Console.WriteLine(d["Jane Doe"]);
// from standard library
// constructor
// indexer method
// value is implicitly
// a parameter of set
Why the difference? In C++, operator[] can return a reference (an explicit
l-value—see Section 8.3.1), which can be used on either side of an assignment.
C# has no comparable notion of reference, so it needs separate methods to get
and set the value of d["Jane Doe"] .
328
Chapter 7 Data Types
In some languages (e.g., Pascal, Ada, and Modula-3), one can also declare a multidimensional array by using the array constructor more than once in the same
declaration. In Modula-3,
VAR mat : ARRAY [1..10], [1..10] OF REAL;
is syntactic sugar for
VAR mat : ARRAY [1..10] OF ARRAY [1..10] OF REAL;
EXAMPLE
7.50
Multidimensional vs
built-up arrays
and mat[3, 4] is syntactic sugar for mat[3][4] . Similar equivalences hold in
Pascal.
In Ada, by contrast,
mat1 : array (1..10, 1..10) of real;
is not the same as
type vector is array (integer range <>) of real;
type matrix is array (integer range <>) of vector (1..10);
mat2 : matrix (1..10);
EXAMPLE
7.51
Arrays of arrays in C
Variable mat1 is a two-dimensional array; mat2 is an array of one-dimensional
arrays. With the former declaration, we can access individual real numbers as
mat1(3, 4) ; with the latter we must say mat2(3)(4) . The two-dimensional
array is arguably more elegant, but the array of arrays supports additional operations: it allows us to name the rows of mat2 individually ( mat2(3) is a 10element, single-dimensional array), and it allows us to take slices, as discussed
below ( mat2(3)(2..6) is a five-element array of real numbers; mat2(3..7) is a
five-element array of ten-element arrays).
In C, one must also declare an array of arrays, and use two-subscript notation,
but C’s integration of pointers and arrays (to be discussed in Section 7.7.1) means
that slices are not supported.
double mat[10][10];
Given this definition, mat[3][4] denotes an individual element of the array, but
mat[3] denotes a reference, either to the third row of the array or to the first
element of that row, depending on context.
Slices and Array Operations
EXAMPLE
7.52
Array slice operations
A slice or section is a rectangular portion of an array. Fortran 90 and Single
Assignment C provide extensive facilities for slicing, as do many scripting languages, including Perl, Python, Ruby, and R. Figure 7.4 illustrates some of the
possibilities in Fortran 90, using the declaration of mat
7.4 Arrays
matrix(3:6, 4:7)
matrix(6:, 5)
matrix(:4, 2:8:2)
matrix(:, (/2, 5, 9/))
329
Figure 7.4 Array slices (sections) in Fortran 90. Much like the values in the header of an
enumeration-controlled loop (Section 6.5.1), a : b : c in a subscript indicates positions a, a + c,
a + 2c, . . . through b. If a or b is omitted, the corresponding bound of the array is assumed. If c is
omitted, 1 is assumed. It is even possible to use negative values of c in order to select positions in
reverse order. The slashes in the second subscript of the lower right example delimit an explicit
list of positions.
Ada provides more limited support: a slice is simply a contiguous range of elements in a one-dimensional array. As we saw in Example 7.50, the elements can
themselves be arrays, but there is no way to extract a slice along both dimensions
as a single operation.
In most languages, the only operations permitted on an array are selection of
an element (which can then be used for whatever operations are valid on its type),
and assignment. A few languages (e.g., Ada and Fortran 90) allow arrays to be
compared for equality. Ada allows one-dimensional arrays whose elements are
discrete to be compared for lexicographic ordering : A < B if the first element of A
that is not equal to the corresponding element of B is less than that corresponding
element. Ada also allows the built-in logical operators ( or , and , xor ) to be applied
to Boolean arrays.
Fortran 90 has a very rich set of array operations: built-in operations that take
entire arrays as arguments. Because Fortran uses structural type equivalence, the
operands of an array operator need only have the same element type and shape.
In particular, slices of the same shape can be intermixed in array operations, even
if the arrays from which they were sliced have very different shapes. Any of the
built-in arithmetic operators will take arrays as operands; the result is an array,
330
Chapter 7 Data Types
of the same shape as the operands, whose elements are the result of applying
the operator to corresponding elements. As a simple example, A + B is an array
each of whose elements is the sum of the corresponding elements of A and B .
Fortran 90 also provides a huge collection of intrinsic, or built-in functions. More
than 60 of these (including logic and bit manipulation, trigonometry, logs and
exponents, type conversion, and string manipulation) are defined on scalars, but
will also perform their operation element-wise if passed arrays as arguments. The
function tan(A) , for example, returns an array consisting of the tangents of the
elements of A . Many additional intrinsic functions are defined solely on arrays.
These include searching and summarization, transposition, and reshaping and
subscript permutation.
An equally rich set of array operations can be found in Single Assignment C
(SAC), a purely functional language for high-performance computing developed
by Sven-Bodo Scholz and others in the mid to late 1990s, and currently in active
use at a variety of sites. Both SAC and Fortran 90 take significant inspiration from
APL, an array manipulation language developed by Iverson and others in the early
to mid-1960s.7 APL was designed primarily as a terse mathematical notation for
array manipulations. It employs an enormous character set that made it difficult
to use with traditional keyboards and textual displays. Its variables are all arrays,
and many of the special characters denote array operations. APL implementations
are designed for interpreted, interactive use. They are best suited to “quick and
dirty” solution of mathematical problems. The combination of very powerful
operators with very terse notation makes APL programs notoriously difficult to
read and understand. The J notation, a successor to APL, uses a conventional
character set.
7.4.2
Dimensions, Bounds, and Allocation
In all of the examples in the previous subsection, the shape of the array (including
bounds) was specified in the declaration. For such static shape arrays, storage can
be managed in the usual way: static allocation for arrays whose lifetime is the
entire program; stack allocation for arrays whose lifetime is an invocation of a
subroutine; heap allocation for dynamically allocated arrays with more general
lifetime.
Storage management is more complex for arrays whose shape is not known
until elaboration time, or whose shape may change during execution. For these
the compiler must arrange not only to allocate space, but also to make shape
information available at run time (without such information, indexing would
not be possible). Some dynamically typed languages allow run-time binding of
7 Kenneth Iverson (1920–2004), a Canadian mathematician, joined the faculty at Harvard University in 1954, where he conceived APL as a notation for describing mathematical algorithms.
He moved to IBM in 1960, where he helped develop the notation into a practical programming
language. He was named an IBM Fellow in 1970, and received the ACM Turing Award in 1979.
7.4 Arrays
331
both the number and bounds of dimensions. Compiled languages may allow the
bounds to be dynamic, but typically require the number of dimensions to be static.
A local array whose shape is known at elaboration time may still be allocated in
the stack. An array whose size may change during execution must generally be
allocated in the heap.
In the first subsection below we consider the descriptors, or dope vectors,8 used
to hold shape information at run time. We then consider stack- and heap-based
allocation, respectively, for dynamic shape arrays.
Dope Vectors
During compilation, the symbol table maintains dimension and bounds information for every array in the program. For every record, it maintains the offset
of every field. When the number and bounds of array dimensions are statically
known, the compiler can look them up in the symbol table in order to compute
the address of elements of the array. When these values are not statically known,
the compiler must generate code to look them up in a dope vector at run time.
Typically, a dope vector will contain the lower bound of each dimension and
the size of each dimension other than the last (which will always be the size of
the element type, and will thus be statically known). If the language implementation performs dynamic semantic checks for out-of-bounds subscripts in array
references, then the dope vector may contain upper bounds as well. Given lower
bounds and sizes, the upper bound information is redundant, but it is usually
included anyway, to avoid computing it repeatedly at run time.
The contents of the dope vector are initialized at elaboration time, or whenever
the number or bounds of dimensions change. In a language like Fortran 90, whose
notion of shape includes dimension sizes but not lower bounds, an assignment
statement may need to copy not only the data of an array, but dope vector contents
as well.
In a language that provides both a value model of variables and arrays of
dynamic shape, we must consider the possibility that a record will contain a field
whose size is not statically known. In this case the compiler may use dope vectors
not only for dynamic shape arrays, but also for dynamic shape records. The dope
vector for a record typically indicates the offset of each field from the beginning
of the record.
Stack Allocation
EXAMPLE
7.53
Conformant array
parameters in Pascal
Subroutine parameters are the simplest example of dynamic shape arrays. Early
versions of Pascal required the shape of all arrays to be specified statically. Standard
Pascal relaxes this requirement by allowing array parameters to have bounds that
8 The name “dope vector” presumably derives from the notion of “having the dope on (something),”
a colloquial expression that originated in horse racing: advance knowledge that a horse has been
drugged (“doped”) is of significant, if unethical, use in placing bets.
332
Chapter 7 Data Types
are symbolic names rather than constants. It calls these parameters conformant
arrays:
function DotProduct(A, B:array [lower..upper:integer] of real):real;
var i : integer;
rtn : real;
begin
rtn := 0;
for i := lower to upper do rtn := rtn + A[i] * B[i];
DotProduct := rtn
end;
EXAMPLE
7.54
Local arrays of dynamic
shape in C99
EXAMPLE
7.55
Stack allocation of
elaborated arrays
Here lower and upper are initialized at the time of call, providing DotProduct
with the information it needs to understand the shape of A and B . In effect, lower
and upper are extra parameters of DotProduct .
Conformant arrays are highly useful in scientific applications, many of which
rely on numerical libraries for linear algebra and the manipulation of systems of
equations. Since different programs use arrays of different shapes, the subroutines
in these libraries need to be able to take arguments whose size is not known at
compile time.
Pascal allows conformant arrays to be passed by reference or by value
(Section 8.3.1). In either case, the caller passes both the data of the array and
an appropriate dope vector, in which lower and upper can be found. If the array
is of dynamic shape in the caller’s context, the dope vector will already exist. If
the array is of static shape in the caller’s context, the caller will need to create an
appropriate dope vector prior to the call.
Conformant arrays appear in several other languages. Modula-2 inherits them
from Pascal. C supports single-dimensional arrays of dynamic shape as a natural
consequence of its merger of arrays and pointers (to be discussed in Section 7.7.1).
Ada and C99 support not only conformant arrays, but also local arrays of dynamic
shape. Among other things, local arrays can be declared to match the shape of
conformant array parameters, facilitating the implementation of algorithms that
require temporary space for calculations. Figure 7.5 contains a simple example
in C99. Function square accepts an array parameter M of dynamic shape and
allocates a local variable T of the same dynamic shape.
In many languages, including Ada and C99, the shape of a local array becomes
fixed at elaboration time. For such arrays it is still possible to place the space for
the array in the stack frame of its subroutine, but an extra level of indirection is
required (see Figure 7.6). In order to ensure that every local object can be found
using a known offset from the frame pointer, we divide the stack frame into a fixedsize part and a variable-size part. An object whose size is statically known goes in
the fixed-size part. An object whose size is not known until elaboration time goes
in the variable-size part, and a pointer to it, together with a dope vector, goes in
the fixed-size part. If the elaboration of the array is buried in a nested block, the
compiler delays allocating space (i.e., changing the stack pointer) until the block
is entered. It still allocates space for the pointer and the dope vector among the
7.4 Arrays
333
void square(int n, double M[n][n]) {
double T[n][n];
for (int i = 0; i < n; i++) {
// compute product into T
for (int j = 0; j < n; j++) {
double s = 0;
for (int k = 0; k < n; k++) {
s += M[i][k] * M[k][j];
}
T[i][j] = s;
}
}
for (int i = 0; i < n; i++) {
// copy back into M
for (int j = 0; j < n; j++) {
M[i][j] = T[i][j];
}
}
}
Figure 7.5 A dynamic local array in C99. Function square multiplies a matrix by itself and
replaces the original with the product. To do so it needs a scratch array of the same shape as the
parameter. Note that the declarations of M and T both rely on parameter n .
EXAMPLE
7.56
Elaborated arrays in
Fortran 90
local variables when the subroutine itself is entered. Records of dynamic shape are
handled in a similar way.
Fortran 90 allows specification of the bounds of an array to be delayed until
after elaboration, but it does not allow those bounds to change once they have
been defined:
real, dimension (:,:), allocatable :: mat
! mat is two-dimensional, but with unspecified bounds
...
allocate (mat (a:b, 0:m-1))
! first dimension has bounds a..b; second has bounds 0..m-1
...
deallocate (mat)
! implementation is now free to reclaim mat’s space
Execution of an allocate statement can be treated like the elaboration of a
dynamic shape array in a nested block. Execution of a deallocate statement can
be treated like the end of the nested block (restoring the previous stack pointer)
if there are no other arrays beyond the specified one in the stack. Alternatively,
dynamic shape arrays can be allocated in the stack, as described in the following
subsection.
Heap Allocation
Arrays that can change shape at arbitrary times are sometimes said to be fully
dynamic. Because changes in size do not in general occur in FIFO order, stack
allocation will not suffice; fully dynamic arrays must be allocated in the heap.
334
Chapter 7 Data Types
sp
-- Ada:
procedure foo (size : integer) is
M : array (1..size, 1..size) of real;
...
begin
...
end foo;
Local
variables
// C99:
void foo(int size) {
double M[size][size];
...
}
M
Variable-size
part of the frame
Temporaries
Pointer to M
Dope vector
Fixed-size part
of the frame
Bookkeeping
Return address
fp
Arguments
and returns
Figure 7.6
Elaboration-time allocation of arrays in Ada or C99. Here M is a square twodimensional array whose bounds are determined by a parameter passed to foo at run time. The
compiler arranges for a pointer to M and a dope vector to reside at static offsets from the frame
pointer. M cannot be placed among the other local variables because it would prevent those
higher in the frame from having static offsets. Additional variable-size arrays or records are easily
accommodated.
EXAMPLE
7.57
Dynamic strings in Java and
C#
Several languages, including Snobol, Icon, and all the scripting languages, allow
strings—arrays of characters—to change size after elaboration time. Java and
C# provide a similar capability (with a similar implementation), but describe
the semantics differently: string variables in these languages are references to
immutable string objects:
String s = "short";
...
s = s + " but sweet";
// This is Java; use lowercase 'string' in C#
// + is the concatenation operator
Here the declaration String s introduces a string variable, which we initialize
with a reference to the constant string "short" . In the subsequent assignment, +
creates a new string containing the concatenation of the old s and the constant
" but sweet" ; s is then set to refer to this new string, rather than the old.
7.4 Arrays
335
Java and C# strings, by the way, are not the same as arrays of characters: strings
are immutable, but elements of an array can be changed in place.
Dynamically resizable arrays (other than strings) appear in APL, Common
Lisp, and the various scripting languages. They are also supported by the vector ,
Vector , and ArrayList classes of the C++, Java, and C# libraries, respectively. In
contrast to the allocate -able arrays of Fortran 90, these arrays can change their
shape—in particular, can grow—while retaining their current content. In most
cases, increasing the size will require that the run-time system allocate a larger
block, copy any data that are to be retained from the old block to the new, and
then deallocate the old.
If the number of dimensions of a fully dynamic array is statically known, the
dope vector can be kept, together with a pointer to the data, in the stack frame
of the subroutine in which the array was declared. If the number of dimensions
can change, the dope vector must generally be placed at the beginning of the heap
block instead.
In the absence of garbage collection, the compiler must arrange to reclaim the
space occupied by fully dynamic arrays when control returns from the subroutine
in which they were declared. Space for stack-allocated arrays is of course reclaimed
automatically by popping the stack.
7.4.3
EXAMPLE
7.58
Row-major vs
column-major array layout
Memory Layout
Arrays in most language implementations are stored in contiguous locations in
memory. In a one-dimensional array, the second element of the array is stored
immediately after the first (subject to alignment constraints); the third is stored
immediately after the second, and so forth. For arrays of records, it is common
for each subsequent element to be aligned at an address appropriate for any type;
small holes between consecutive records may result.
For multidimensional arrays, it still makes sense to put the first element of
the array in the array’s first memory location. But which element comes next?
There are two reasonable answers, called row-major and column-major order. In
row-major order, consecutive locations in memory hold elements that differ by
one in the final subscript (except at the ends of rows). A[2, 4] , for example,
is followed by A[2, 5] . In column-major order, consecutive locations hold elements that differ by one in the initial subscript: A[2, 4] is followed by A[3, 4] .
for three or more dimensions are analogous. Fortran uses column-major order;
most other languages use row-major order. (Correspondence with Fran Allen9
9 Fran Allen (1932–) joined IBM’s T. J. Watson Research Center in 1957, and stayed for her entire
professional career. Her seminal paper, Program Optimization [All69] helped launch the field
of code improvement. Her PTRAN (Parallel TRANslation) group, founded in the early 1980s,
developed much of the theory of automatic parallelization. In 1989 Dr. Allen became the first
woman to be named in IBM Fellow. In 2006 she became the first to receive the ACM Turing
Award.
336
Chapter 7 Data Types
Row-major order
Column-major order
Figure 7.7
Row- and column-major memory layout for two-dimensional arrays. In row-major
order, the elements of a row are contiguous in memory; in column-major order, the elements
of a column are contiguous. The second cache line of each array is shaded, on the assumption
that each element is an eight-byte floating-point number, that cache lines are 32 bytes long (a
common size), and that the array begins at a cache line boundary. If the array is indexed from
A[0,0] to A[9,9] , then in the row-major case elements A[0,4] through A[0,7] share a cache line;
in the column-major case elements A[4,0] through A[7,0] share a cache line.
EXAMPLE
7.59
Array layout and cache
performance
suggests that column-major order was originally adopted in order to accommodate idiosyncrasies of the console debugger and instruction set of the IBM model
704 computer, on which the language was first implemented.) The advantage of
row-major order is that it makes it easy to define a multidimensional array as an
array of subarrays, as described in Section 7.4.1. With column-major order, the
elements of the subarray would not be contiguous in memory.
The difference between row- and column-major layout can be important
for programs that use nested loops to access all the elements of a large, multidimensional array. On modern machines the speed of such loops is often limited
by memory system performance, which depends heavily on the effectiveness of
caching (Section 5.1). Figure 7.7 shows the orientation of cache lines for rowand column-major layout of arrays. When code traverses a small array, all or most
of its elements are likely to remain in the cache through the end of the nested
loops, and the orientation of cache lines will not matter. For a large array, however, lines that are accessed early in the traversal are likely to be evicted to make
room for lines accessed later in the traversal. If array elements are accessed in
order of consecutive addresses, then each miss will bring into the cache not only
the desired element, but the next several elements as well. If elements are accessed
across cache lines instead (i.e., along the rows of a Fortran array, or the columns of
an array in most other languages), then there is a good chance that almost every
access will result in a cache miss, dramatically reducing the performance of the
code. In C, one should write
7.4 Arrays
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
... A[i][j] ...
}
}
337
/* rows */
/* columns */
In Fortran:
do j = 1, N
do i = 1, N
... A(i, j) ...
end do
end do
! columns
! rows
Row-Pointer Layout
Some languages employ an alternative to contiguous allocation for some arrays.
Rather than require the rows of an array to be adjacent, they allow them to
lie anywhere in memory, and create an auxiliary array of pointers to the rows.
If the array has more than two dimensions, it may be allocated as an array of
pointers to arrays of pointers to. . . . This row-pointer memory layout requires more
space in most cases, but has three potential advantages. First, it sometimes allows
individual elements of the array to be accessed more quickly, especially on CISC
machines with slow multiplication instructions (see the discussion of address
calculations below). Second, it allows the rows to have different lengths, without
devoting space to holes at the ends of the rows. This representation is sometimes
called a ragged array. The lack of holes may sometimes offset the increased space
for pointers. Third, it allows a program to construct an array from preexisting
rows (possibly scattered throughout memory) without copying. C, C++, and C#
provide both contiguous and row-pointer organizations for multidimensional
arrays. Technically speaking, the contiguous layout is a true multidimensional
array, while the row-pointer layout is an array of pointers to arrays. Java uses the
row-pointer layout for all arrays.
D E S I G N & I M P L E M E N TAT I O N
Array layout
The layout of arrays in memory, like the ordering of record fields, is intimately
tied to tradeoffs in design and implementation. While column-major layout
appears to offer no advantages on modern machines, its continued use in Fortran means that programmers must be aware of the underlying implementation
in order to achieve good locality in nested loops. Row-pointer layout, likewise,
has no performance advantage on modern machines (and a likely performance
penalty, at least for numeric code), but it is a more natural fit for the “reference
to object” data organization of languages like Java. Its impacts on space consumption and locality may be positive or negative, depending on the details of
individual applications.
338
Chapter 7 Data Types
char days[][10] = {
"Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday",
"Friday", "Saturday"
};
...
days[2][3] == ’s’; /* in Tuesday */
S
M
T
W
T
F
S
u
o
u
e
h
r
a
n
n
e
d
u
i
t
d
d
s
n
r
d
u
a
a
d
e
s
a
r
y
y
a
s
d
y
d
char *days[] = {
"Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday",
"Friday", "Saturday"
};
...
days[2][3] == ’s’; /* in Tuesday */
S u n d a
d a y
T
y
W e d
y
T h u
F r i d
t u r d a
y
d a y
a y
a y
y
u
n
r
a
y
M o n
e s d a
e s d a
s d a y
y
S a
Figure 7.8 Contiguous array allocation vs row pointers in C. The declaration on the left is a true two-dimensional array. The
slashed boxes are NUL bytes; the shaded areas are holes. The declaration on the right is a ragged array of pointers to arrays
of characters. In both cases, we have omitted bounds in the declaration that can be deduced from the size of the initializer
(aggregate). Both data structures permit individual characters to be accessed using double subscripts, but the memory layout
(and corresponding address arithmetic) is quite different.
EXAMPLE
7.60
Contiguous vs row-pointer
array layout
By far the most common use of the row-pointer layout in C is to represent arrays
of strings. A typical example appears in Figure 7.8. In this example (representing
the days of the week), the row-pointer memory layout consumes 57 bytes for
the characters themselves (including a NUL byte at the end of each string), plus
28 bytes for pointers (assuming a 32-bit architecture), for a total of 85 bytes.
The contiguous layout alternative devotes 10 bytes to each day (room enough
for Wednesday and its NUL byte), for a total of 70 bytes. The additional space
required for the row-pointer organization comes to 21%. In other cases, row
pointers may actually save space. A Java compiler written in C, for example, would
probably use row pointers to store the character-string representations of the 51
Java keywords and word-like literals. This data structure would use 51 × 4 = 204
bytes for the pointers, plus 343 bytes for the keywords, for a total of 547 bytes
(548 when aligned). Since the longest keyword ( synchronized ) requires 13 bytes
(including space for the terminating NUL ), a contiguous two-dimensional array
would consume 51×13 = 663 bytes (664 when aligned). In this case, row pointers
save a little over 21%.
Address Calculations
EXAMPLE
7.61
Indexing a contiguous array
For the usual contiguous layout of arrays, calculating the address of a particular
element is somewhat complicated, but straightforward. Suppose a compiler is
given the following declaration for a three-dimensional array:
A : array [L1 . . U1 ] of array [L2 . . U2 ] of array [L3 . . U3 ] of elem type;
7.4 Arrays
339
Let us define constants for the sizes of the three dimensions:
S3 = size of elem type
S2 = (U3 − L3 + 1) × S3
S1 = (U2 − L2 + 1) × S2
Here the size of a row (S2 ) is the size of an individual element (S3 ) times the
number of elements in a row (assuming row-major layout). The size of a plane
(S1 ) is the size of a row (S2 ) times the number of rows in a plane. The address of
A[i, j, k] is then
address of A
+ (i − L1 ) × S1
+ (j − L2 ) × S2
+ (k − L3 ) × S3
As written, this computation involves three multiplications and six additions/subtractions. We could compute the entire expression at run time, but in most cases
a little rearrangement reveals that much of the computation can be performed
at compile time. In particular, if the bounds of the array are known at compile
time, then S1 , S2 , and S3 are compile-time constants, and the subtractions of lower
bounds can be distributed out of the parentheses:
(i × S1 ) + ( j × S2 ) + (k × S3 ) + address of A
−[(L1 × S1 ) + (L2 × S2 ) + (L3 × S3 )]
EXAMPLE
7.62
Pseudo-assembler for
contiguous array indexing
The bracketed expression in this formula is a compile-time constant (assuming
the bounds of A are statically known). If A is a global variable, then the address of
A is statically known as well, and can be incorporated in the bracketed expression.
If A is a local variable of a subroutine (with static shape), then the address of A
can be decomposed into a static offset (included in the bracketed expression) plus
the contents of the frame pointer at run time. We can think of the address of A
plus the bracketed expression as calculating the location of an imaginary array
whose [i, j, k] th element coincides with that of A , but whose lower bound in each
dimension is zero. This imaginary array is illustrated in Figure 7.9.
If A ’s elements are integers, and are allocated contiguously in memory, then
the instruction sequence to load A[i, j, k] into a register looks something like this:
1.
2.
3.
4.
5.
6.
– – assume i is in r1, j is in r2, and k is in r3
r4 := r1 × S1
r5 := r2 × S2
r6 := &A – L1 × S1 – L2 × S2 – L3 × 4
– – one or two instructions
r6 := r6 + r4
r6 := r6 + r5
r7 := *r6[r3]
– – load
340
Chapter 7 Data Types
L3
L1
L2
j
k
Address of A
i
Figure 7.9
Virtual location of an array with nonzero lower bounds. By computing the constant
portions of an array index at compile time, we effectively index into an array whose starting
address is offset in memory, but whose lower bounds are all zero.
EXAMPLE
7.63
Static and dynamic
portions of an array index
We have assumed that the hardware provides an indexed addressing mode, and
that it scales its indexing by the size of the quantity loaded (in this case a four-byte
integer).
If i , j , and/or k is known at compile time, then additional portions of the
calculation of the address of A[i, j, k] will move from the dynamic to the static
part of the formula shown above. If all of the subscripts are known, then the
entire address can be calculated statically. Conversely, if any of the bounds of
the array are not known at compile time, then portions of the calculation will
move from the static to the dynamic part of the formula. For example, if L1 is not
known until run time, but k is known to be 3 at compile time, then the calculation
becomes
(i × S1 ) + (j × S2 ) − (L1 × S1 ) + address of A − [(L2 × S2 ) + (L3 × S3 ) − (3 × S3 )]
EXAMPLE
7.64
Indexing complex
structures
Again, the bracketed part can be computed at compile time. If lower bounds are
always restricted to zero, as they are in C, then they never contribute to run-time
cost.
In all our examples, we have ignored the issue of dynamic semantic checks
for out-of-bound subscripts. We explore the code for these in Exercise 7.18.
In Section 16.5.2 we will consider code improvement techniques that can be
used to eliminate many checks statically, particularly in enumeration-controlled
loops.
The notion of “static part” and “dynamic part” of an address computation
generalizes to more than just arrays. Suppose, for example, that V is a messy
local array of records containing a nested, two-dimensional array in field M . The
address of V[i].M[3, j] could be calculated as
7.4 Arrays
341
V
i × S1
V
V
−L1 × S1
+ M ’s offset as a field
V
V
+j × S2
V
+(3 − L1 ) × S1
V
V
−L2 × S2
+ fp
+ offset of V in frame
EXAMPLE
7.65
Pseudo-assembler for
row-pointer array indexing
Here the calculations on the left must be performed at run time; the calculations
on the right can be performed at compile time. (The notation for bounds and size
places the name of the variable in a superscript and the dimension in a subscript:
L2M is the lower bound of the second dimension of M .)
Address calculation for arrays that use row pointers is comparatively straightforward. Using our three-dimensional array A as an example, the expression
A[ i, j, k ] is equivalent, in C notation, to (*(*A[ i ]) [ j ]) [ k ] . The instruction sequence
to load A[ i, j, k ] into a register looks something like this:
1.
2.
3.
4.
– – assume i is in r1, j is in r2, and k is in r3
r4 := &A
– – one or two instructions
r4 := *r4 [ r1 ]
r4 := *r4 [ r2 ]
r7 := r4 [ r3 ]
Assuming that the loads at lines 2 and 3 hit in the cache, this code will be comparable in cost to the instruction sequence for contiguous allocation shown above
D E S I G N & I M P L E M E N TAT I O N
Lower bounds on array indices
In C, the lower bound of every array dimension is always zero. It is often
assumed that the language designers adopted this convention in order to avoid
subtracting lower bounds from indices at run time, thereby avoiding a potential
source of inefficiency. As our discussion has shown, however, the compiler can
avoid any run-time cost by translating to a virtual starting location. (The one
exception to this statement occurs when the lower bound has a very large
absolute value: if any index (scaled by element size) exceeds the maximum
offset available with displacement mode addressing [typically 215 bytes on RISC
machines], then subtraction may still be required at run time.)
A more likely explanation lies in the interoperability of arrays and pointers
in C (Section 7.7.1): C’s conventions allow the compiler to generate code for an
index operation on a pointer without worrying about the lower bound of the
array into which the pointer points. Interestingly, Fortran array dimensions
have a default lower bound of 1; unless the programmer explicitly specifies
a lower bound of 0, the compiler must always translate to a virtual starting
location.
342
Chapter 7 Data Types
(given load delays). If the intermediate loads miss in the cache, it will be slower.
On a 1970s CISC machine, the balance would probably tip in favor of the rowpointer code: multiplies would be slower, and memory accesses faster. In any
event (contiguous or row-pointer allocation, old or new machine), important
code improvements will often be possible when several array references use the
same subscript expression, or when array references are embedded in loops. 3C H E C K YO U R U N D E R S TA N D I N G
27. What is an array slice? For what purposes are slices useful?
28. Is there any significant difference between a two-dimensional array and an
array of one-dimensional arrays?
29. What is the shape of an array?
30. What is a dope vector? What purpose does it serve?
31. Under what circumstances can an array declared within a subroutine be allocated in the stack? Under what circumstances must it be allocated in the heap?
32. What is a conformant array?
33. Discuss the comparative advantages of contiguous and row-pointer layout for
arrays.
34. Explain the difference between row-major and column-major layout for contiguously allocated arrays. Why does a programmer need to know which layout
the compiler uses? Why do most language designers consider row-major layout
to be better?
35. How much of the work of computing the address of an element of an array can
be performed at compile time? How much must be performed at run time?
7.5
Strings
In many languages, a string is simply an array of characters. In other languages,
strings have special status, with operations that are not available for arrays of
other sorts. Particularly powerful string facilities are found in Snobol, Icon, and
the various scripting languages.
As we saw in Section 6.5.4, mechanisms to search for patterns within strings
are a key part of Icon’s distinctive generator-based control flow. Icon has dozens
of built-in string operators, functions, and generators, including sophisticated
pattern-matching facilities based on regular expressions. Perl, Python, Ruby, and
other scripting languages provide similar functionality, though none includes the
full power of Icon’s backtracking search. We will consider the string and patternmatching facilities of scripting languages in more detail in Section 13.4.2. In the
7.5 Strings
EXAMPLE
7.66
Character escapes in C
and C++
EXAMPLE
7.67
Char* assignment in C
343
remainder of this section we focus on the role of strings in more traditional
languages.
Almost all programming languages allow literal strings to be specified as a
sequence of characters, usually enclosed in single or double quote marks. Many
languages, including C and its descendants, distinguish between literal characters
(usually delimited with single quotes) and literal strings (usually delimited with
double quotes). Other languages (e.g., Pascal) make no distinction: a character is
just a string of length one. Most languages also provide escape sequences that allow
nonprinting characters and quote marks to appear inside of strings.
C99 and C++ provide a very rich set of escape sequences. An arbitrary character
can be represented by a backslash followed by (a) 1 to 3 octal (base-8) digits,
(b) an x and one or more hexadecimal (base-16) digits, (c) a u and exactly four
hexadecimal digits, or (d) a U and exactly eight hexadecimal digits. The \U notation
is meant to capture the four-byte (32-bit) Unicode character set described in the
sidebar on page 295. The \u notation is for characters in the Basic Multilingual
Plane. Many of the most common control characters also have single-character
escape sequences, many of which have been adopted by other languages as well.
For example, \n is a line feed; \t is a tab; \r is a carriage return; \\ is a backslash.
C# omits the octal sequences of C99 and C++; Java also omits the 32-bit extended
sequences.
The set of operations provided for strings is strongly tied to the implementation
envisioned by the language designer(s). Several languages that do not in general
allow arrays to change size dynamically do provide this flexibility for strings. The
rationale is twofold. First, manipulation of variable-length strings is fundamental
to a huge number of computer applications, and in some sense “deserves” special
treatment. Second, the fact that strings are one-dimensional, have one-byte elements, and never contain references to anything else makes dynamic-size strings
easier to implement than general dynamic arrays.
Some languages require that the length of a string-valued variable be bound
no later than elaboration time, allowing the variable to be implemented as a
contiguous array of characters in the current stack frame. Pascal and Ada support
a few string operations, including assignment and comparison for lexicographic
ordering. C, on the other hand, provides only the ability to create a pointer to a
string literal. Because of C’s unification of arrays and pointers, even assignment is
not supported. Given the declaration char *s , the statement s = "abc" makes
s point to the constant "abc" in static storage. If s is declared as an array, rather
than a pointer ( char s[4] ), then the statement will trigger an error message from
the compiler. To assign one array into another in C, the program must copy the
elements individually.
Other languages allow the length of a string-valued variable to change over its
lifetime, requiring that the variable be implemented as a block or chain of blocks
in the heap. ML and Lisp provide strings as a built-in type. C++, Java, and C#
provide them as predefined classes of object, in the formal, object-oriented sense.
In all these languages a string variable is a reference to a string. Assigning a new
value to such a variable makes it refer to a different object. Concatenation and
344
Chapter 7 Data Types
other string operators implicitly create new objects. The space used by objects that
are no longer reachable from any variable is reclaimed automatically.
7.6
EXAMPLE
7.68
Set types in Pascal
Sets
A programming language set is an unordered collection of an arbitrary number of
distinct values of a common type. Sets were introduced by Pascal, and are found
in many more recent languages as well. The type from which elements of a set are
drawn is known as the base or universe type. Pascal supports sets of any discrete
type, and provides union, intersection, and difference operations:
var A,
D,
...
A := B
A := B
A := B
B, C : set of char;
E : set of weekday;
+ C;
* C;
- C;
(* union; A := {x | x is in B or x is in C} *)
(* intersection; A := {x | x is in B and x is in C} *)
(* difference; A := {x | x is in B and x is not in C} *)
Icon supports sets of characters (called cset s), but not sets of any other base
type. Python supports sets of arbitrary type; we describe these in Example 13.69
(page 708). Ada does not provide a type constructor for sets, but its generic facility
can be used to define a set package (module) with functionality comparable to the
sets of Pascal [IBFW91, pp. 242–244]. In a similar vein, sets appear in the standard
libraries of many object-oriented languages, including C++, Java, and C#.
There are many ways to implement sets, including arrays, hash tables, and
various forms of trees. For discrete base types with a modest number of elements,
D E S I G N & I M P L E M E N TAT I O N
Representing sets
Unfortunately, bit vectors do not work well for large base types: a set of integers,
represented as a bit vector, would consume some 500 megabytes on a 32-bit
machine. With 64-bit integers, a bit-vector set would consume more memory
than is currently contained on all the computers in the world. Because of this
problem, many languages (including early versions of Pascal, but not the ISO
standard) limit sets to base types of fewer than some fixed number of members.
Both 128 and 256 are common limits; they suffice to cover ASCII characters.
A few languages (e.g., early versions of Modula-2) limit base types to the number of elements that can be represented by a one-word bit vector, but there
is really no excuse for such a severe restriction. A language that permits sets
with very large base types must employ an alternative implementation (e.g., a
hash table). It will still be expensive to represent sets with enormous numbers
of elements, but reasonably easy to represent sets with a modest number of
elements drawn from a very large universe.
7.7 Pointers and Recursive Types
345
a characteristic array is a particularly appealing implementation: it employs a bit
vector whose length (in bits) is the number of distinct values of the base type.
A one in the kth position in the bit vector indicates that the kth element of the
base type is a member of the set; a zero indicates that it is not. In a language that
uses ASCII, a set of characters would occupy 128 bits—16 bytes. Operations on
bit-vector sets can make use of fast logical instructions on most machines. Union
is bit-wise or ; intersection is bit-wise and ; difference is bit-wise not , followed by
bit-wise and .
7.7
Pointers and Recursive Types
A recursive type is one whose objects may contain one or more references to other
objects of the type. Most recursive types are records, since they need to contain
something in addition to the reference, implying the existence of heterogeneous
fields. Recursive types are used to build a wide variety of “linked” data structures,
including lists and trees.
In languages that use a reference model of variables, it is easy for a record
of type foo to include a reference to another record of type foo : every variable
(and hence every record field) is a reference anyway. In languages that use a value
model of variables, recursive types require the notion of a pointer: a variable (or
field) whose value is a reference to some object. Pointers were first introduced
in PL/I.
In some languages (e.g., Pascal, Ada 83, and Modula-3), pointers are restricted
to point only to objects in the heap. The only way to create a new pointer value
(without using variant records or casts to bypass the type system) is to call a
built-in function that allocates a new object in the heap and returns a pointer to
it. In other languages (e.g., PL/I, Algol 68, C, C++, and Ada 95), one can create a
pointer to a nonheap object by using an “address of ” operator. We will examine
pointer operations and the ramifications of the reference and value models in
more detail in the first subsection below.
In any language that permits new objects to be allocated from the heap, the
question arises: how and when is storage reclaimed for objects that are no longer
D E S I G N & I M P L E M E N TAT I O N
Implementation of pointers
It is common for programmers (and even textbook writers) to equate pointers
with addresses, but this is a mistake. A pointer is a high-level concept: a reference
to an object. An address is a low-level concept: the location of a word in memory.
Pointers are often implemented as addresses, but not always. On a machine with
a segmented memory architecture, a pointer may consist of a segment id and an
offset within the segment. In a language that attempts to catch uses of dangling
references, a pointer may contain both an address and an access key.
346
Chapter 7 Data Types
needed? In short-lived programs it may be acceptable simply to leave the storage
unused, but in most cases unused space must be reclaimed, to make room for
other things. A program that fails to reclaim the space for objects that are no
longer needed is said to “leak memory.” If such a program runs for an extended
period of time, it may run out of space and crash.
Many languages, including C, C++, Pascal, and Modula-2, require the programmer to reclaim space explicitly. Other languages, including Modula-3, Java, C#, and
all the functional and scripting languages, require the language implementation
to reclaim unused objects automatically. Explicit storage reclamation simplifies
the language implementation, but raises the possibility that the programmer will
forget to reclaim objects that are no longer live (thereby leaking memory), or
will accidentally reclaim objects that are still in use (thereby creating dangling
references). Automatic storage reclamation (otherwise known as garbage collection) dramatically simplifies the programmer’s task, but imposes certain run-time
costs, and raises the question of how the language implementation is to distinguish garbage from active objects. We will discuss dangling references and garbage
collection further in Sections 7.7.2 and 7.7.3, respectively.
7.7.1
Syntax and Operations
Operations on pointers include allocation and deallocation of objects in the heap,
dereferencing of pointers to access the objects to which they point, and assignment
of one pointer into another. The behavior of these operations depends heavily on
whether the language is functional or imperative, and on whether it employs a
reference or value model for variables/names.
Functional languages generally employ a reference model for names (a purely
functional language has no variables or assignments). Objects in a functional
language tend to be allocated automatically as needed, with a structure determined
by the language implementation. Variables in an imperative language may use
either a value or a reference model, or some combination of the two. In C, Pascal,
or Ada, which employ a value model, the assignment A := B puts the value of B
into A . If we want B to refer to an object, and we want A := B to make A refer to
the object to which B refers, then A and B must be pointers. In Clu and Smalltalk,
which employ a reference model, the assignment A := B always makes A refer to
the same object to which B refers.
Java charts an intermediate course, in which the usual implementation of the
reference model is made explicit in the language semantics. Variables of built-in
Java types (integers, floating-point numbers, characters, and Booleans) employ a
value model; variables of user-defined types (strings, arrays, and other objects in
the object-oriented sense of the word) employ a reference model. The assignment
A := B in Java places the value of B into A if A and B are of built-in type; it makes A
refer to the object to which B refers if A and B are of user-defined type. C# mirrors
Java by default, but additional language features, explicitly labeled “ unsafe ,” allow
systems programmers to use pointers when desired.
7.7 Pointers and Recursive Types
347
node R
node X
R
X
node Y
node Z
node W
Y
Z
W
empty
Figure 7.10
Implementation of a tree in ML. The abstract (conceptual) tree is shown at the
lower left.
Reference Model
EXAMPLE
7.69
In ML, the datatype mechanism can be used to declare recursive types:
Tree type in ML
datatype chr_tree = empty | node of char * chr_tree * chr_tree;
EXAMPLE
7.70
Tree type in Lisp
Here a chr_tree is either an empty leaf or a node consisting of a character and
two child trees. (Further details can be found in Section 7.2.4.)
It is natural in ML to include a chr_tree within a chr_tree because every
variable is a reference. The tree node (#"R", node (#"X", empty, empty),
node (#"Y", node (#"Z", empty, empty), node (#"W", empty, empty)))
would most likely be represented in memory as shown in Figure 7.10. Each individual rectangle in the right-hand portion of this figure represents a block of
storage allocated from the heap. In effect, the tree is a tuple (record) tagged
to indicate that it is a node . This tuple in turn refers to two other tuples that
are also tagged as node s. At the fringe of the tree are tuples that are tagged as
empty ; these contain no further references. Because all empty tuples are the same,
the implementation is free to use just one, and to have every reference point
to it.
In Lisp, which uses a reference model of variables but is not statically typed, our
tree could be specified textually as ’(#\R (#\X ()()) (#\Y (#\Z ()()) (#\W
()()))) . Each level of parentheses brackets the elements of a list. In this case,
the outermost such list contains three elements: the character R and nested lists
to represent the left and right subtrees. (The prefix #\ notation serves the same
purpose as surrounding quotes in other languages.) Semantically, each list is a pair
of references: one to the head and one to the remainder of the list. As we noted in
Section 7.7.1, these semantics are almost always reflected in the implementation
by a cons cell containing two pointers. A binary tree can thus be represented as
348
Chapter 7 Data Types
C
C
A
C
C
A
C
R
C
C
C
X
A
C
C
A
Z
C
Y
C
C
C
A
C
W
Figure 7.11 Implementation of a tree in Lisp. A diagonal slash through a box indicates a null pointer. The C and A tags serve
to distinguish the two kinds of memory blocks: cons cells and blocks containing atoms.
EXAMPLE
7.71
Mutually recursive types
in ML
a three-element (three cons cell) list, as shown in Figure 7.11. At the top level
of the figure, the first cons cell points to R ; the second and third point to nested
lists representing the left and right subtrees. Each block of memory is tagged to
indicate whether it is a cons cell or an atom. An atom is anything other than a
cons cell; that is, an object of a built-in type (integer, real, character, string, etc.), or
a user-defined structure (record) or array. The uniformity of Lisp lists (everything
is a cons cell or an atom) makes it easy to write polymorphic functions, though
without the static type checking of ML.
If one programs in a purely functional style in ML or in Lisp, the data structures
created with recursive types turn out to be acyclic. New objects refer to old ones,
but old ones never change, and thus never point to new ones. Circular structures
can be defined only by using the imperative features of the languages. In ML,
these features include an explicit notion of pointer, discussed briefly under “Value
Model” below.
Even when writing in a functional style, one often finds a need for types that
are mutually recursive. In a compiler, for example, it is likely that symbol table
records and syntax tree nodes will need to refer to each other. A syntax tree node
that represents a subroutine call will need to refer to the symbol table record that
represents the subroutine. The symbol table record, for its part, will need to refer
to the syntax tree node at the root of the subtree that represents the subroutine’s
code. If types are declared one at a time, and if names must be declared before
they can be used, then whichever mutually recursive type is declared first will be
unable to refer to the other. ML addresses this problem by allowing types to be
declared together in a group:
7.7 Pointers and Recursive Types
349
datatype sym_tab_rec = variable of ...
| type of ...
| ...
| subroutine of {code : syn_tree_node, ...}
and syn_tree_node = expression of ...
| loop of ...
| ...
| subr_call of {subr : sym_tab_rec, ...};
Mutually recursive types of this sort are trivial in Lisp, since it is dynamically typed.
(Common Lisp includes a notion of structures, but field types are not declared.
In simpler Lisp dialects programmers use nested lists in which fields are merely
positional conventions.)
Value Model
EXAMPLE
7.72
Tree types in Pascal,
Ada, and C
In Pascal, our tree data type would be declared as follows:
type chr_tree_ptr = ˆchr_tree;
chr_tree = record
left, right : chr_tree_ptr;
val : char
end;
The Ada declaration is similar:
type chr_tree;
type chr_tree_ptr is access chr_tree;
type chr_tree is record
left, right : chr_tree_ptr;
val : character;
end record;
In C, the equivalent declaration is
struct chr_tree {
struct chr_tree *left, *right;
char val;
};
EXAMPLE
7.73
Allocating heap nodes
As mentioned in Section 3.3.3, Pascal permits forward references in the declaration of pointer types, to support recursive types. Ada and C use incomplete type
declarations instead.
No aggregate syntax is available for linked data structures in Pascal, Ada, or C;
a tree must be constructed node by node. To allocate a new node from the heap,
the programmer calls a built-in function. In Pascal:
new(my_ptr);
350
Chapter 7 Data Types
R
X
Y
Z
W
Figure 7.12
Typical implementation of a tree in a language with explicit pointers. As in
Figure 7.11, a diagonal slash through a box indicates a null pointer.
In Ada:
my_ptr := new chr_tree;
In C:
my_ptr = malloc(sizeof(struct chr_tree));
EXAMPLE
7.74
Object-oriented allocation
of heap nodes
C’s malloc is defined as a library function, not a built-in part of the language
(though some compilers recognize and optimize it as a special case). The programmer must specify the size of the allocated object explicitly, and while the
return value (of type void* ) can be assigned into any pointer, the assignment is
not type-safe.
C++, Java, and C# replace malloc with a built-in, type-safe new :
my_ptr = new chr_tree( arg list );
In addition to “knowing” the size of the requested type, the C++/Java/C# new will
automatically call any user-specified constructor (initialization) function, passing
the specified argument list. In a similar but less flexible vein, Ada’s new may specify
an initial value for the allocated object:
my_ptr := new chr_tree'(null, null, 'X');
EXAMPLE
7.75
Pointer-based tree
EXAMPLE
7.76
Pointer dereferencing
After we have allocated and linked together appropriate nodes in C, Pascal, or
Ada, our tree example is likely to be implemented as shown in Figure 7.12. As in
Lisp, a leaf is distinguished from an internal node simply by the fact that its two
pointer fields are null.
To access the object referred to by a pointer, most languages use an explicit
dereferencing operator. In Pascal and Modula this operator takes the form of a
postfix “up-arrow”:
my_ptrˆ.val := 'X';
7.7 Pointers and Recursive Types
351
In C it is a prefix star:
(*my_ptr).val = 'X';
Because pointers so often refer to records ( struct s), for which the prefix notation
is awkward, C also provides a postfix “right-arrow” operator that plays the role of
the “up-arrow dot” combination in Pascal:
my_ptr->val = 'X';
EXAMPLE
7.77
Implicit dereferencing in
Ada
On the assumption that pointers almost always refer to records, Ada dispenses
with dereferencing altogether. The same dot-based syntax can be used to access
either a field of the record foo or a field of the record pointed to by foo , depending
on the type of foo :
T : chr_tree;
P : chr_tree_ptr;
...
T.val := 'X';
P.val := 'Y';
In those cases in which one actually wants to name the entire object referred to by
a pointer, Ada provides a special “pseudofield” called all :
T := P.all;
EXAMPLE
7.78
Pointer dereferencing in
ML
In essence, pointers in Ada are automatically dereferenced when needed.
The imperative features of ML include an assignment statement, but this statement requires that the left-hand side be a pointer: its effect is to make the pointer
refer to the object on the right-hand side. To access the object referred to by a
pointer, one uses an exclamation point as a prefix dereferencing operator:
val p = ref 2; (* p is a pointer to 2 *)
...
p := 3;
(* p now points to 3 *)
...
let val n = !p in ...
(* n is simply 3 *)
ML thus makes the distinction between l-values and r-values very explicit. Most
languages blur the distinction by implicitly dereferencing variables on the righthand side of every assignment statement. Ada blurs the distinction further by
dereferencing pointers automatically in certain circumstances.
The imperative features of Lisp do not include a dereferencing operator. Since
every object has a self-evident type, and assignment is performed using a small
set of built-in operators, there is never any ambiguity as to what is intended.
352
EXAMPLE
Chapter 7 Data Types
7.79
Assignment in Lisp
Assignment in Common Lisp employs the setf operator (Scheme uses set! ,
set-car! , and set-cdr! ), rather than the more common := . For example, if
foo refers to a list, then (cdr foo) is the right-hand (“rest of list”) pointer of the
first node in the list, and the assignment (set-cdr! foo foo) makes this pointer
refer back to foo , creating a one-node circular list:
foo
foo
C
C
A
a
A
b
C
C
A
a
A
b
Pointers and Arrays in C
EXAMPLE
7.80
Array names and pointers
in C
Pointers and arrays are closely linked in C. Consider the following declarations:
int n;
int *a;
int b[10];
/* pointer to integer */
/* array of 10 integers */
Now all of the following are valid:
1.
2.
3.
4.
5.
a
n
n
n
n
=
=
=
=
=
b;
a[3];
*(a+3);
b[3];
*(b+3);
/* make a point to the initial element of b */
/* equivalent to previous line */
/* equivalent to previous line */
In most contexts, an unsubscripted array name in C is automatically converted
to a pointer to the array’s first element (the one with index zero), as shown here
in line 1. (Line 5 embodies the same conversion.) Lines 3 and 5 illustrate pointer
arithmetic: Given a pointer to an element of an array, the addition of an integer
k produces a pointer to the element k positions later in the array (earlier if k
is negative.) The prefix * is a pointer dereference operator. Pointer arithmetic is
valid only within the bounds of a single array, but C compilers are not required to
check this.
Remarkably, the subscript operator [ ] in C is actually defined in terms of
pointer arithmetic: lines 2 and 4 are syntactic sugar for lines 3 and 5, respectively. More precisely, E1[E2] , for any expressions E1 and E2 , is defined to be
(*((E1)+(E2))) , which is of course the same as (*((E2)+(E1))) . (Extra parentheses have been used in this definition to avoid any questions of precedence
if E1 and E2 are complicated expressions.) Correctness requires only that one
operand of [ ] have an array or pointer type and the other have an integral type.
Thus A[3] is equivalent to 3[A] , something that comes as a surprise to most
programmers.
7.7 Pointers and Recursive Types
EXAMPLE
7.81
Pointer comparison and
subtraction in C
EXAMPLE
7.82
353
In addition to allowing an integer to be added to a pointer, C allows pointers
to be subtracted from one another or compared for ordering, provided that they
refer to elements of the same array. The comparison p < q , for example, tests to
see if p refers to an element closer to the beginning of the array than the one
referred to by q . The expression p - q returns the number of array positions that
separate the elements to which p and q refer. All arithmetic operations on pointers
“scale” their results as appropriate, based on the size of the referenced objects.
For multidimensional arrays with row-pointer layout, a[i][j] is equivalent to
(*(a+i))[j] or *(a[i]+j) or *(*(a+i)+j) .
Despite the interoperability of pointers and arrays in C, programmers need
to be aware that the two are not the same, particularly in the context of variable
declarations, which need to allocate space when elaborated. The declaration of
Pointer and array
declarations in C
D E S I G N & I M P L E M E N TAT I O N
Stack smashing
The lack of bounds checking on array subscripts and pointer arithmetic is a
major source of bugs and security problems in C. Many of the most infamous
Internet viruses have propagated by means of stack smashing, a particularly
nasty form of buffer overflow attack. Consider a (very naive) routine designed
to read a number from an input stream:
int get_acct_num(FILE *s) {
char buf[100];
char *p = buf;
do {
/* read from stream s: */
*p = getc(s);
} while (*p++ != ’\n’);
*p = ’\0’;
/* convert ascii to int: */
return atoi(buf);
}
Stack
growth
buf
Higher
addresses
Return address
Previous
(calling)
frame
If the stream provides more than 100 characters without a newline ( '\n' ),
those characters will overwrite memory beyond the confines of buf , as shown
by the large white arrow in the figure. A careful attacker may be able to invent
a string whose bits include both a sequence of valid machine instructions and
a replacement value for the subroutine’s return address. When the routine
attempts to return, it will jump into the attacker’s instructions instead.
Stack smashing can be prevented by manually checking array bounds in C,
or by configuring the hardware to prevent the execution of instructions in the
stack (see the sidebar on page 179). It would never have been a problem in
the first place, however, if C had been designed for automatic bounds checks.
354
Chapter 7 Data Types
a pointer variable allocates space to hold a pointer, while the declaration of an
array variable allocates space to hold the whole array. In the case of an array
the declaration must specify a size for each dimension. Thus int *a[n] , when
elaborated, will allocate space for n row pointers; int a[n][m] will allocate space
for a two-dimensional array with contiguous layout.10 As a convenience, a variable
declaration that includes initialization to an aggregate can omit the size of the
outermost dimension if that information can be inferred from the contents of the
aggregate:
int a[][2] = {{1, 2}, {3, 4}, {5, 6}};
EXAMPLE
7.83
Arrays as parameters in C
// three rows
When an array is included in the argument list of a function call, C passes a
pointer to the first element of the array, not the array itself. For a one-dimensional array of integers, the corresponding formal parameter may be declared as
int a[ ] or int *a . For a two-dimensional array of integers with row-pointer
layout, the formal parameter may be declared as int *a[ ] or int **a . For a twodimensional array with contiguous layout, the formal parameter may be declared
as int a[ ][m] or int (*a)[m] . The size of the first dimension is irrelevant;
all that is passed is a pointer, and C performs no dynamic checks to ensure that
references are within the bounds of the array.
D E S I G N & I M P L E M E N TAT I O N
Pointers and arrays
Many C programs use pointers instead of subscripts to iterate over the elements
of arrays. Before the development of modern optimizing compilers, pointerbased array traversal often served to eliminate redundant address calculations,
thereby leading to faster code. With modern compilers, however, the opposite
may be true: redundant address calculations can be identified as common
subexpressions, and certain other code improvements are easier for indices
than they are for pointers. In particular, as we shall see in Chapter 16, pointers
make it significantly more difficult for the code improver to determine when
two l-values may be aliases for one other.
Today the use of pointer arithmetic is mainly a matter of personal taste:
some C programmers consider pointer-based algorithms to be more elegant
than their array-based counterparts; others simply find them harder to read.
Certainly the fact that arrays are passed as pointers makes it natural to write
subroutines in the pointer style.
10 To read declarations in C, it is helpful to follow the following rule: start at the name of the variable
and work right as far as possible, subject to parentheses; then work left as far as possible; then
jump out a level of parentheses and repeat. Thus int *a[n] means that a is an n-element array
of pointers to integers, while int (*a)[n] means that a is a pointer to an n-element array of
integers.
7.7 Pointers and Recursive Types
EXAMPLE
7.84
Sizeof in C
355
In all cases, a declaration must allow the compiler (or human reader) to determine the size of the elements of an array or, equivalently, the size of the objects
referred to by a pointer. Thus neither int a[ ][ ] nor int (*a)[ ] is a valid
variable or parameter declaration: neither provides the compiler with the size
information it needs to generate code for a + i or a[i] .
The built-in sizeof operator returns the size in bytes of an object or type.
When given an array as argument it returns the size of the entire array. When
given a pointer as argument it returns the size of the pointer itself. If a is an
array, sizeof(a) / sizeof(a[0]) returns the number of elements in the array.
Similarly, if pointers occupy 4 bytes and double-precision floating-point numbers
occupy 8 bytes, then given
double *a;
double (*b)[10];
/* pointer to double */
/* pointer to array of 10 doubles */
we have sizeof(a) = sizeof(b) = 4, sizeof(*a) = sizeof(*b[0]) = 8, and
sizeof(*b) = 80. In most cases, sizeof can be evaluated at compile time. The
principal exception occurs for variable-length arrays, whose size may not be known
until elaboration time:
void f(int len) {
int A[len];
/* sizeof(A) == len * sizeof(int) */
3C H E C K YO U R U N D E R S TA N D I N G
36. Name three languages that provide particularly extensive support for character
strings.
37. Why might a language permit operations on strings that it does not provide
for arrays?
38. What are the strengths and weaknesses of the bit-vector representation for
sets? How else might sets be implemented?
39. Discuss the tradeoffs between pointers and the recursive types that arise naturally in a language with a reference model of variables.
40. Summarize the ways in which one dereferences a pointer in various programming languages.
41. What is the difference between a pointer and an address?
42. Discuss the advantages and disadvantages of the interoperability of pointers
and arrays in C.
43. Under what circumstances must the bounds of a C array be specified in its
declaration?
356
Chapter 7 Data Types
7.7.2
EXAMPLE
7.85
Explicit storage
reclamation
Dangling References
When a heap-allocated object is no longer live, a long-running program needs
to reclaim the object’s space. Stack objects are reclaimed automatically as part of
the subroutine calling sequence. How are heap objects reclaimed? There are two
alternatives. Languages like Pascal, C, and C++ require the programmer to reclaim
an object explicitly. In Pascal:
dispose(my_ptr);
In C:
free(my_ptr);
In C++:
delete my_ptr;
EXAMPLE
7.86
Dangling reference to a
stack variable in C++
C++ provides additional functionality: prior to reclaiming the space, it automatically calls any user-provided destructor function for the object. A destructor can
reclaim space for subsidiary objects, remove the object from indices or tables, print
messages, or perform any other operation appropriate at the end of the object’s
lifetime.
A dangling reference is a live pointer that no longer points to a valid object.
In languages like Algol 68 or C, which allow the programmer to create pointers
to stack objects, a dangling reference may be created when a subroutine returns
while some pointer in a wider scope still refers to a local object of that subroutine:
int i = 3;
int *p = &i;
...
void foo() { int n = 5;
p = &n; }
...
cout << *p;
// prints 3
foo();
...
cout << *p;
// undefined behavior: n is no longer live
EXAMPLE
7.87
Dangling reference to a
heap variable in C++
In a language with explicit reclamation of heap objects, a dangling reference is
created whenever the programmer reclaims an object to which pointers still refer:
int *p = new int;
*p = 3;
...
cout << *p;
// prints 3
delete p;
...
cout << *p;
// undefined behavior: *p has been reclaimed
7.7 Pointers and Recursive Types
357
Note that even if the reclamation operation were to change its argument to a null
pointer, this would not solve the problem, because other pointers might still refer
to the same object.
Because a language implementation may reuse the space of reclaimed stack
and heap objects, a program that uses a dangling reference may read or write bits
in memory that are now part of some other object. It may even modify bits that
are now part of the implementation’s bookkeeping information, corrupting the
structure of the stack or heap.
Algol 68 addresses the problem of dangling references to stack objects by forbidding a pointer from pointing to any object whose lifetime is briefer than that
of the pointer itself. Unfortunately, this rule is difficult to enforce. Among other
things, since both pointers and objects to which pointers might refer can be passed
as arguments to subroutines, dynamic semantic checks are possible only if reference parameters are accompanied by a hidden indication of lifetime. Ada 95 has a
more restrictive rule that is easier to enforce: it forbids a pointer from pointing to
any object whose lifetime is briefer than that of the pointer’s type.
IN MORE DEPTH
On the PLP CD we consider two mechanisms that are sometimes used to catch
dangling references at run time. Tombstones introduce an extra level of indirection
on every pointer access. When an object is reclaimed, the indirection word (tombstone) is marked in a way that invalidates future references to the object. Locks
and keys add a word to every pointer and to every object in the heap; these words
must match for the pointer to be valid. Tombstones can be used in languages that
permit pointers to nonheap objects, but they introduce the secondary problem of
reclaiming the tombstones themselves. Locks and keys are somewhat simpler, but
they work only for objects in the heap.
7.7.3
Garbage Collection
Explicit reclamation of heap objects is a serious burden on the programmer and a
major source of bugs (memory leaks and dangling references). The code required
to keep track of object lifetimes makes programs more difficult to design, implement, and maintain. An attractive alternative is to have the language implementation notice when objects are no longer useful and reclaim them automatically.
Automatic reclamation (otherwise known as garbage collection) is more-or-less
essential for functional languages: delete is a very imperative sort of operation,
and the ability to construct and return arbitrary objects from functions means
that many objects that would be allocated on the stack in an imperative language
must be allocated from the heap in a functional language, to give them unlimited
extent.
Over time, automatic garbage collection has become popular for imperative
languages as well. It can be found in, among others, Clu, Cedar, Modula-3, Java,
358
Chapter 7 Data Types
C#, and all the major scripting languages. Automatic collection is difficult to
implement, but the difficulty pales in comparison to the convenience enjoyed by
programmers once the implementation exists. Automatic collection also tends to
be slower than manual reclamation, though it eliminates any need to check for
dangling references.
Reference Counts
When is an object no longer useful? One possible answer is: when no pointers to
it exist.11 The simplest garbage collection technique simply places a counter in
each object that keeps track of the number of pointers that refer to the object.
When the object is created, this reference count is set to 1, to represent the pointer
returned by the new operation. When one pointer is assigned into another, the
run-time system decrements the reference count of the object formerly referred to
by the assignment’s left-hand side, and increments the count of the object referred
to by the right-hand side. On subroutine return, the calling sequence epilogue
must decrement the reference count of any object referred to by a local pointer
D E S I G N & I M P L E M E N TAT I O N
Garbage collection
Garbage collection presents a classic tradeoff between convenience and safety
on the one hand and performance on the other. Manual storage reclamation,
implemented correctly by the application program, is almost invariably faster
than any automatic garbage collector. It is also more predictable: automatic
collection is notorious for its tendency to introduce intermittent “hiccups” in
the execution of real-time or interactive programs.
Ada takes the unusual position of refusing to take a stand: the language
design makes automatic garbage collection possible, but implementations are
not required to provide it, and programmers can request manual reclamation
with a built-in routine called Unchecked_Deallocation . The Ada 95 version
of the language provides extensive facilities whereby programmers can implement their own storage managers (garbage collected or not), with different
types of pointers corresponding to different storage “pools.”
In a similar vein, the Real Time Specification for Java allows the programmer
to create so-called scoped memory areas that are accessible to only a subset of the
currently running threads. When all threads with access to a given area terminate, the area is reclaimed in its entirety. Objects allocated in a scoped memory
area are never examined by the garbage collector; performance anomalies due
to garbage collection can therefore be avoided by providing scoped memory to
every real-time thread.
11 Throughout the following discussion we will use the pointer-based terminology of languages
with a value model of variables. The techniques apply equally well, however, to languages with a
reference model of variables.
7.7 Pointers and Recursive Types
Stack
stooges
359
Heap
2
"larry"
1
1
"moe"
1
"moe"
"curly"
stooges := nil;
stooges
1 "larry"
1 "curly"
Figure 7.13
Reference counts and circular lists. The list shown here cannot be found via any
program variable, but because it is circular, every cell contains a nonzero count.
EXAMPLE
7.88
Reference counts and
circular structures
that is about to be destroyed. When a reference count reaches zero, its object can
be reclaimed. Recursively, the run-time system must decrement counts for any
objects referred to by pointers within the object being reclaimed, and reclaim
those objects if their counts reach zero. To prevent the collector from following
garbage addresses, each pointer must be initialized to null at elaboration time.
In order for reference counts to work, the language implementation must be
able to identify the location of every pointer. When a subroutine returns, it must
be able to tell which words in the stack frame represent pointers; when an object
in the heap is reclaimed, it must be able to tell which words within the object
represent pointers. The standard technique to track this information relies on type
descriptors generated by the compiler. There is one descriptor for every distinct
type in the program, plus one for the stack frame of each subroutine, and one
for the set of global variables. Most descriptors are simply a table that lists the
offsets within the type at which pointers can be found, together with the addresses
of descriptors for the types of the objects referred to by those pointers. For a
tagged variant record (discriminated union) type, the descriptor is a bit more
complicated: it must contain a list of values (or ranges) for the tag, together with
a table for the corresponding variant. For untagged variant records, there is no
acceptable solution: reference counts work only if the language is strongly typed
(but see the discussion of “Conservative Collection” on page 364).
The most important problem with reference counts stems from their definition of a “useful object.” While it is definitely true that an object is useless if no
references to it exist, it may also be useless when references do exist. As shown
in Figure 7.13, reference counts may fail to collect circular structures. They work
well only for structures that are guaranteed to be noncircular. Many language
360
Chapter 7 Data Types
implementations use reference counts for variable-length strings; strings never
contain references to anything else. Perl uses reference counts for all dynamically
allocated data; the manual warns the programmer to break cycles manually when
data aren’t needed anymore. Some purely functional languages may also be able to
use reference counts safely in all cases, if the lack of an assignment statement prevents them from introducing circularity. Finally, reference counts can be used to
reclaim tombstones. While it is certainly possible to create a circular structure with
tombstones, the fact that the programmer is responsible for explicit deallocation
of heap objects implies that reference counts will fail to reclaim tombstones only
when the programmer has failed to reclaim the objects to which they refer.
Tracing Collection
A better definition of a “useful” object is one that can be reached by following a
chain of valid pointers starting from something that has a name (i.e., something
outside the heap). According to this definition, the blocks in the bottom half of
Figure 7.13 are useless, even though their reference counts are nonzero. Tracing
collectors work by recursively exploring the heap, starting from external pointers,
to determine what is useful.
The classic mechanism to identify useless blocks, under this
more accurate definition, is known as mark-and-sweep. It proceeds in three main
steps, executed by the garbage collector when the amount of free space remaining
in the heap falls below some minimum threshold.
Mark-and-Sweep
D E S I G N & I M P L E M E N TAT I O N
What exactly is garbage?
Reference counting implicitly defines a garbage object as one to which no
pointers exist. Tracing implicitly defines it as an object that is no longer reachable from outside the heap. Ideally, we’d like an even stronger definition: a
garbage object is one that the program will never use again. We settle for
nonreachability because this ideal definition is uncomputable. The difference
can matter in practice: if a program maintains a pointer to an object it will never
use again, then the garbage collector will be unable to reclaim it. If the number
of such objects grows with time, then the program has a memory leak, despite
the presence of a garbage collector. (Trivially we could imagine a program that
added every newly allocated object to a global list, but never actually perused
the list. Such a program would defeat the collector entirely.)
For the sake of space efficiency, programmers are advised to “zero out” any
pointers they no longer need. Doing this can be difficult, but not as difficult
as fully manually reclamation—in particular, we do not need to realize when
we are zeroing the last pointer to a given object. For the same reason, dangling
references can never arise: the garbage collector will refrain from reclaiming
any object that is reachable along some other path.
7.7 Pointers and Recursive Types
361
1. The collector walks through the heap, tentatively marking every block as “useless.”
2. Beginning with all pointers outside the heap, the collector recursively explores
all linked data structures in the program, marking each newly discovered block
as “useful.” (When it encounters a block that is already marked as “useful,” the
collector knows it has reached the block over some previous path, and returns
without recursing.)
3. The collector again walks through the heap, moving every block that is still
marked “useless” to the free list.
Several potential problems with this algorithm are immediately apparent. First,
both the initial and final walks through the heap require that the collector be able
to tell where every “in-use” block begins and ends. In a language with variable-size
heap blocks, every block must begin with an indication of its size, and of whether
it is currently free. Second, the collector must be able in Step 2 to find the pointers
contained within each block. The standard solution is to place a pointer to a type
descriptor near the beginning of each block.
The exploration step (Step 2) of mark-and-sweep collection is
naturally recursive. The obvious implementation needs a stack whose maximum
depth is proportional to the longest chain through the heap. In practice, the space
for this stack may not be available: after all, we run garbage collection when we’re
about to run out of space!12 An alternative implementation of the exploration
step uses a technique first suggested by Schorr and Waite [SW67] to embed the
equivalent of the stack in already-existing fields in heap blocks. More specifically,
as the collector explores the path to a given block, it reverses the pointers it follows,
so that each points back to the previous block instead of forward to the next. This
pointer-reversal technique is illustrated in Figure 7.14. As it explores, the collector
keeps track of the current block and the block from whence it came.
To return from block X to block U (after part (d) of the figure), the collector
will use the reversed pointer in U to restore its notion of previous block (T ). It
will then flip the reversed pointer back to X and update its notion of current
block to U . If the block to which it has returned contains additional pointers, the
collector will proceed forward again; otherwise it will return across the previous
reversed pointer and try again. At most one pointer in every block will be reversed
at any given time. This pointer must be marked, probably by means of another
bookkeeping field at the beginning of each block. (We could mark the pointer by
setting one of its low-order bits, but the cost in time would probably be prohibitive:
we’d have to search the block on every visit.)
Pointer Reversal
EXAMPLE
7.89
Heap tracing with pointer
reversal
12 In many language implementations, the stack and heap grow toward each other from opposite
ends of memory (Section 14.4); if the heap is full, the stack can’t grow. In a system with virtual
memory the distance between the two may theoretically be enormous, but the space that backs
them up on disk is still limited, and shared between them.
362
Chapter 7 Data Types
(a)
prev
R
(b)
prev
R
curr
S
S
T
U
T
V
W
(c)
curr
U
X
V
W
prev
R
(d)
X
prev
R
curr
S
T
S
U
W
curr
T
V
X
U
W
V
X
Figure 7.14 Heap exploration via pointer reversal. The block currently under examination is indicated by the curr pointer.
The previous block is indicated by the prev pointer. As the garbage collector moves from one block to the next, it changes the
pointer it follows to refer back to the previous block. When it returns to a block it restores the pointer. Each reversed pointer
must be marked (indicated with a shaded box), to distinguish it from other, forward pointers in the same block.
In a language with variable-size heap blocks, the garbage collector can reduce external fragmentation by performing storage compaction. Many
garbage collectors employ a technique known as stop-and-copy that achieves compaction while simultaneously eliminating Steps 1 and 3 in the standard mark-andsweep algorithm. Specifically, they divide the heap into two regions of equal size.
All allocation happens in the first half. When this half is (nearly) full, the collector
begins its exploration of reachable data structures. Each reachable block is copied
into the second half of the heap, with no external fragmentation. The old version
of the block, in the first half of the heap, is overwritten with a “useful” flag and a
pointer to the new location. Any other pointer that refers to the same block (and
is found later in the exploration) is set to point to the new location. When the
Stop-and-Copy
7.7 Pointers and Recursive Types
363
collector finishes its exploration, all useful objects have been moved (and compacted) into the second half of the heap, and nothing in the first half is needed
anymore. The collector can therefore swap its notion of first and second halves,
and the program can continue. Obviously, this algorithm suffers from the fact that
only half of the heap can be used at any given time, but in a system with virtual
memory it is only the virtual space that is underutilized; each “half ” of the heap
can occupy most of physical memory as needed. Moreover, by eliminating Steps 1
and 3 of standard mark-and-sweep, stop-and-copy incurs overhead proportional
to the number of nongarbage blocks, rather than the total number of blocks.
Generational Collection To further reduce the cost of collection, some garbage
collectors employ a “generational” technique, exploiting the observation that most
dynamically allocated objects are short-lived. The heap is divided into multiple
regions (often two). When space runs low the collector first examines the youngest
region (the “nursery”), which it assumes is likely to have the highest proportion
of garbage. Only if it is unable to reclaim sufficient space in this region does the
collector examine the next-older region. To avoid leaking storage in long-running
systems, the collector must be prepared, if necessary, to examine the entire heap.
In most cases, however, the overhead of collection will be proportional to the size
of the youngest region only.
Any object that survives some small number of collections (often one) in its current region is promoted (moved) to the next older region, in a manner reminiscent
of stop-and-copy. Promotion requires, of course, that pointers from old objects
D E S I G N & I M P L E M E N TAT I O N
Reference counts versus tracing
Reference counts require a counter field in every heap object. For small objects
such as cons cells, this space overhead may be significant. The ongoing expense
of updating reference counts when pointers are changed can also be significant
in a program with large amounts of pointer manipulation. Other garbage collection techniques, however, have similar overheads. Tracing generally requires
a reversed pointer indicator in every heap block, which reference counting
does not, and generational collectors must generally incur overhead on every
pointer assignment in order to keep track of pointers into the newest section
of the heap.
The two principal tradeoffs between reference counting and tracing are the
inability of the former to handle cycles and the tendency of the latter to “stop
the world” periodically in order to reclaim space. On the whole, implementors
tend to favor reference counting for applications in which circularity is not an
issue, and tracing collectors in the general case. The “stop the world” problem
can be addressed with incremental or concurrent collectors, which interleave
their execution with the rest of the program, but these tend to have higher total
overhead. Efficient, effective garbage collection techniques remain an active
area of research.
364
Chapter 7 Data Types
to new objects be updated to reflect the new locations. While such old-spaceto-new-space pointers tend to be rare, a generational collector must be able to
find them all quickly. At each pointer assignment, the compiler generates code to
check whether the new value is an old-to-new pointer; if so, it adds the pointer
to a hidden list accessible to the collector. This instrumentation on assignments is
known as a write barrier.13
Language implementors have traditionally assumed
that automatic storage reclamation is possible only in languages that are strongly
typed: both reference counts and tracing collection require that we be able to find
the pointers within an object. If we are willing to admit the possibility that some
garbage will go unreclaimed, it turns out that we can implement mark-and-sweep
collection without being able to find pointers [BW88]. The key is to observe that
any given block in the heap spans a relatively small number of addresses. There is
only a very small probability that some word in memory that is not a pointer will
happen to contain a bit pattern that looks like one of those addresses.
If we assume, conservatively, that everything that seems to point into a heap
block is in fact a valid pointer, then we can proceed with mark-and-sweep collection. When space runs low, the collector (as usual) tentatively marks all blocks in
the heap as useless. It then scans all word-aligned quantities in the stack and in
global storage. If any of these words appears to contain the address of something
in the heap, the collector marks the block that contains that address as useful.
Recursively, the collector then scans all word-aligned quantities in the block, and
marks as useful any other blocks whose addresses are found therein. Finally (as
usual), the collector reclaims any blocks that are still marked useless.
The algorithm is completely safe (in the sense that it never reclaims useful
blocks) so long as the programmer never “hides” a pointer. In C, for example, the
collector is unlikely to function correctly if the programmer casts a pointer to int
and then xor s it with a constant, with the expectation of restoring and using the
pointer at a later time. In addition to sometimes leaving garbage unclaimed, conservative collection suffers from the inability to perform compaction: the collector
can never be sure which “pointers” should be changed.
Conservative Collection
7.8
Lists
A list is defined recursively as either the empty list or a pair consisting of an object
(which may be either a list or an atom) and another (shorter) list. Lists are ideally
suited to programming in functional and logic languages, which do most of their
work via recursion and higher-order functions (to be described in Section 10.5).
13 Unfortunately, the word “barrier” is heavily overloaded. Garbage collection barriers are unrelated
to the synchronization barriers of Section 12.3.1, the memory barriers of Section 12.3.3, or the
RTL barriers of Section 14.2.2.
7.8 Lists
EXAMPLE
7.90
Lists in ML and Lisp
EXAMPLE
7.91
List notation
365
In Lisp, in fact, a program is a list, and can extend itself at run time by constructing
a list and executing it (this capability will be examined further in Section 10.3.5;
it depends heavily on the fact that Lisp delays almost all semantic checking until
run time).
Lists can also be used in imperative programs. Clu provides a built-in type
constructor for lists, and a list class is easy to write in most object-oriented languages. Most scripting languages provide extensive list support. In any language
with records and pointers, the programmer can build lists by hand. Since many of
the standard list operations tend to generate garbage, lists work best in a language
with automatic garbage collection.
We have already discussed certain aspects of lists in ML (Section 7.2.4) and
Lisp (Section 7.7.1). As we noted in those sections, lists in ML are homogeneous:
every element of the list must have the same type. Lisp lists, by contrast, are
heterogeneous: any object may be placed in a list, so long as it is never used in
an inconsistent fashion.14 The different approaches to type in ML and in Lisp
lead to different implementations. An ML list is usually a chain of blocks, each of
which contains an element and a pointer to the next block. A Lisp list is a chain
of cons cells, each of which contains two pointers, one to the element and one to
the next cons cell (see Figures 7.10 and 7.11, pages 347 and 348). For historical
reasons, the two pointers in a cons cell are known as the car and the cdr ; they
represent the head of the list and the remaining elements, respectively. In both
semantics (homogeneity vs heterogeneity) and implementation (chained blocks
vs cons cells), Clu resembles ML, while Python and Prolog (to be discussed in
Section 11.2) resemble Lisp.
Both ML and Lisp provide convenient notation for lists. An ML list is enclosed
in square brackets, with elements separated by commas: [a, b, c, d ] . A Lisp
list is enclosed in parentheses, with elements separated by white space: (a b c
d) . In both cases, the notation represents a proper list—one whose innermost
pair consists of the final element and the empty list. In Lisp, it is also possible
to construct an improper list, whose final pair contains two elements. (Strictly
speaking, such a list does not conform to the standard recursive definition.) Lisp
systems provide a more general, but cumbersome dotted list notation that captures
both proper and improper lists. A dotted list is either an atom (possibly null)
or a pair consisting of two dotted lists separated by a period and enclosed in
parentheses. The dotted list (a . (b . (c . (d . null)))) is the same as (a b c
d) . The list (a . (b . (c . d))) is improper; its final cons cell contains a pointer
to d in the second position, where a pointer to a list is normally required.
Both ML and Lisp provide a wealth of built-in polymorphic functions to
manipulate arbitrary lists. Because programs are lists in Lisp, Lisp must distinguish between lists that are to be evaluated and lists that are to be left “as is,” as
structures. To prevent a literal list from being evaluated, the Lisp programmer may
14 Recall that objects are self-descriptive in Lisp. The only type checking occurs when a function
“deliberately” inspects an argument to see whether it is a list or an atom of some particular type.
366
EXAMPLE
Chapter 7 Data Types
7.92
Basic list operations in Lisp
quote it: (quote (a b c d)) , abbreviated ’(a b c d) . To evaluate an internal
list (e.g., one returned by a function), the programmer may pass it to the built-in
function eval . In ML, programs are not lists, so a literal list is always a structural
aggregate.
The most fundamental operations on lists are those that construct them from
their components or extract their components from them. In Lisp:
=⇒
=⇒
=⇒
=⇒
=⇒
=⇒
=⇒
(cons 'a '(b))
(car '(a b))
(car nil)
(cdr '(a b c))
(cdr '(a))
(cdr nil)
(append '(a b) '(c d))
EXAMPLE
7.93
Basic list operations in ML
(a b)
a
??
(b c)
nil
??
(a b c d)
Here we have used =⇒ to mean “evaluates to.” The car and cdr of the empty list
( nil ) are defined to be nil in Common Lisp; in Scheme they result in a dynamic
semantic error.
In ML the equivalent operations are written as follows:
a :: [b]
hd [a, b]
hd [ ]
tl [a, b, c]
tl [a]
tl [ ]
[a, b] @ [c, d]
=⇒
=⇒
=⇒
=⇒
=⇒
=⇒
=⇒
[a, b]
a
run-time exception
[b, c]
nil
run-time exception
[a, b, c, d]
Run-time exceptions may be caught by the program if desired; further details will
appear in Section 8.5.
D E S I G N & I M P L E M E N TAT I O N
Car and cdr
The names of the functions car and cdr are historical accidents: they derive
from the original (1959) implementation of Lisp on the IBM 704 at MIT. The
machine architecture included 15-bit “address” and “decrement” fields in some
of the (36-bit) loop-control instructions, together with additional instructions
to load an index register from, or store it to, one of these fields within a 36-bit
memory word. The designers of the Lisp interpreter decided to make cons
cells mimic the internal format of instructions, so they could exploit these
special instructions. In now archaic usage, memory words were also known as
“registers.”What might appropriately have been called“first”and“rest”pointers
thus came to be known as the CAR (contents of address of register) and CDR
(contents of decrement of register). The 704, incidentally, was also the machine
on which Fortran was first developed, and the first commercial machine to
include hardware floating-point and magnetic core memory.
7.9 Files and Input/Output
EXAMPLE
7.94
List comprehensions
367
Both ML and Lisp provide many additional list functions, including ones that
test a list to see if it is empty; return the length of a list; return the nth element
of a list, or a list consisting of all but the first n elements; reverse the order of the
elements of a list; search a list for elements matching some predicate; or apply a
function to every element of a list, returning the results as a list.
Miranda, Haskell, Python, and F# provide lists that resemble those of ML, but
with an important additional mechanism, known as list comprehensions. These are
adapted from traditional mathematical set notation. A common form comprises
an expression, an enumerator, and one or more filters. In Haskell, the following
denotes a list of the squares of all odd numbers less than 100:
[i*i | i <- [1..100], i `mod` 2 == 1]
In Python we would write
[i*i for i in range(1, 100) if i % 2 == 1]
In F# the equivalent is
[for i in 1..100 do if i % 2 = 1 then yield i*i]
All of these are the equivalent of the mathematical
{i × i | i ∈ {1, . . . , 100} ∧ i mod 2 = 1}
We could of course create an equivalent list with a series of appropriate function
calls. The brevity of the list comprehension syntax, however, can sometimes lead
to remarkably elegant programs (see, e.g., Exercise 7.26).
7.9
Files and Input/Output
Input/output (I/O) facilities allow a program to communicate with the outside
world. In discussing this communication, it is customary to distinguish between
interactive I/O and I/O with files. Interactive I/O generally implies communication
with human users or physical devices, which work in parallel with the running
program, and whose input to the program may depend on earlier output from
the program (e.g., prompts). Files generally refer to off-line storage implemented
by the operating system. Files may be further categorized into those that are
temporary and those that are persistent. Temporary files exist for the duration of
a single program run; their purpose is to store information that is too large to fit
in the memory available to the program. Persistent files allow a program to read
data that existed before the program began running, and to write data that will
continue to exist after the program has ended.
368
Chapter 7 Data Types
I/O is one of the most difficult aspects of a language to design, and one that
displays the least commonality from one language to the next. Some languages
provide built-in file data types and special syntactic constructs for I/O. Others
relegate I/O entirely to library packages, which export a (usually opaque) file
type and a variety of input and output subroutines. The principal advantage of
language integration is the ability to employ non–subroutine-call syntax, and to
perform operations (e.g., type checking on subroutine calls with varying numbers
of parameters) that may not otherwise be available to library routines. A purely
library-based approach to I/O, on the other hand, may keep a substantial amount
of “clutter” out of the language definition.
IN MORE DEPTH
An overview of language-level I/O mechanisms can be found on the PLP CD.
After a brief introduction to interactive and file-based I/O, we focus mainly on the
common case of text files. The data in a text file are stored in character form, but
may be converted to and from internal types during read and write operations.
As examples, we consider the text I/O facilities of Fortran, Ada, C, and C++.
7.10
Equality Testing and Assignment
For simple, primitive data types such as integers, floating-point numbers, or characters, equality testing and assignment are relatively straightforward operations,
with obvious semantics and obvious implementations (bit-wise comparison or
copy). For more complicated or abstract data types, however, both semantic and
implementation subtleties arise.
Consider for example the problem of comparing two character strings. Should
the expression s = t determine whether s and t
are aliases for one another?
occupy storage that is bit-wise identical over its full length?
contain the same sequence of characters?
would appear the same if printed?
The second of these tests is probably too low-level to be of interest in most programs; it suggests the possibility that a comparison might fail because of garbage
in currently unused portions of the space reserved for a string. The other three
alternatives may all be of interest in certain circumstances, and may generate
different results.
In many cases the definition of equality boils down to the distinction between
l-values and r-values: in the presence of references, should expressions be considered equal only if they refer to the same object, or also if the objects to which
7.10 Equality Testing and Assignment
EXAMPLE
7.95
Equality testing in Scheme
369
they refer are in some sense equal? The first option (refer to the same object)
is known as a shallow comparison. The second (refer to equal objects) is called
a deep comparison. For complicated data structures (e.g., lists or graphs) a deep
comparison may require recursive traversal.
In imperative programming languages, assignment operations may also be deep
or shallow. Under a reference model of variables, a shallow assignment a := b
will make a refer to the object to which b refers. A deep assignment will create a
copy of the object to which b refers, and make a refer to the copy. Under a value
model of variables, a shallow assignment will copy the value of b into a , but if
that value is a pointer (or a record containing pointers), then the objects to which
the pointer(s) refer will not be copied.
Most programming languages employ both shallow comparisons and shallow
assignment. A few (notably Python and the various dialects of Lisp) provide more
than one option for comparison. Scheme, for example, has three general-purpose
equality-testing functions:
(eq? a b)
(eqv? a b)
(equal? a b)
; do a and b refer to the same object?
; are a and b known to be semantically equivalent?
; do a and b have the same recursive structure?
Both eq? and eqv? perform a shallow comparison. The former may be faster
for certain types in certain implementations; in particular, eqv? is required to
detect the equality of values of the same discrete type, stored in different locations;
eq? is not. The simpler eq? behaves as one would expect for Booleans, symbols
(names), and pairs (things built by cons ), but can have implementation-defined
behavior on numbers, characters, and strings:
(eq? #t #t)
(eq? 'foo 'foo)
(eq? '(a b) '(a b))
(let ((p '(a b)))
(eq? p p))
(eq? 2 2)
(eq? "foo" "foo")
=⇒
=⇒
=⇒
#t (true)
#t
#f (false); created by separate cons-es
=⇒
=⇒
=⇒
#t; created by the same cons
implementation dependent
implementation dependent
In any particular implementation, numeric, character, and string tests will always
work the same way; if (eq? 2 2) returns true , then (eq? 37 37) will return
true also. Implementations are free to choose whichever behavior results in the
fastest code.
The exact rules that govern the situations in which eqv? is guaranteed to return
true or false are quite involved. Among other things, they specify that eqv?
should behave as one might expect for numbers, characters, and nonempty strings,
and that two objects will never test true for eqv? if there are any circumstances
under which they would behave differently. (Conversely, however, eqv? is allowed
to return false for certain objects—functions, for example—that would behave
370
Chapter 7 Data Types
identically in all circumstances.)15 The eqv? predicate is “less discriminating”
than eq? , in the sense that eqv? will never return false when eq? returns true .
For structures (lists), eqv? returns false if its arguments refer to different
root cons cells. In many programs this is not the desired behavior. The equal?
predicate recursively traverses two lists to see if their internal structure is the
same and their leaves are eqv? . The equal? predicate may lead to an infinite
loop if the programmer has used the imperative features of Scheme to create a
circular list.
Deep assignments are relatively rare. They are used primarily in distributed
computing, and in particular for parameter passing in remote procedure call
(RPC) systems. These will be discussed in Section 12.5.4.
For user-defined abstractions, no single language-specified mechanism for
equality testing or assignment is likely to produce the desired results in all cases.
Languages with sophisticated data abstraction mechanisms usually allow the programmer to define the comparison and assignment operators for each new data
type—or to specify that equality testing and/or assignment is not allowed.
3C H E C K YO U R U N D E R S TA N D I N G
44. What are dangling references? How are they created, and why are they a
problem?
45. What is garbage? How is it created, and why is it a problem? Discuss the
comparative advantages of reference counts and tracing collection as a means
of solving the problem.
46. Summarize the differences among mark-and-sweep, stop-and-copy, and generational garbage collection.
47. What is pointer reversal? What problem does it address?
48. What is “conservative” garbage collection? How does it work?
49. Do dangling references and garbage ever arise in the same programming language? Why or why not?
50. Why was automatic garbage collection so slow to be adopted by imperative
programming languages?
51. What are the advantages and disadvantages of allowing pointers to refer to
objects that do not lie in the heap?
52. Why are lists so heavily used in functional programming languages?
53. Why is equality testing more subtle than it first appears?
15 Significantly, eqv? is also allowed to return false when comparing numeric values of different
types: (eqv? 1 1.0) may evaluate to #f . For numeric code, one generally wants the separate
= function: (= val1 val2) will perform the necessary coercion and test for numeric equality
(subject to rounding errors).
7.11 Summary and Concluding Remarks
7.11
371
Summary and Concluding Remarks
This section concludes the third of our five core chapters on language design
(names [from Part I], control flow, types, subroutines, and classes). In the first
two sections we looked at the general issues of type systems and type checking.
In the remaining sections we examined the most important composite types:
records and variants, arrays and strings, sets, pointers and recursive types, lists,
and files. We noted that types serve two principal purposes: they provide implicit
context for many operations, freeing the programmer from the need to specify
that context explicitly, and they allow the compiler to catch a wide variety of
common programming errors. A type system consists of a set of built-in types,
a mechanism to define new types, and rules for type equivalence, type compatibility, and type inference. Type equivalence determines when two names
or values have the same type. Type compatibility determines when a value of
one type may be used in a context that “expects” another type. Type inference
determines the type of an expression based on the types of its components or
(sometimes) the surrounding context. A language is said to be strongly typed
if it never allows an operation to be applied to an object that does not support it; a language is said to be statically typed if it enforces strong typing at
compile time.
In our general discussion of types we distinguished between the denotational,
constructive, and abstraction-based points of view, which regard types, respectively, in terms of their values, their substructure, and the operations they support.
We introduced terminology for the common built-in types and for enumerations, subranges, and the common type constructors. We discussed several different
approaches to type equivalence, compatibility, and inference, including (on the
PLP CD) a detailed examination of the inference rules of ML. We also examined
type conversion, coercion, and nonconverting casts. In the area of type equivalence,
we contrasted the structural and name-based approaches, noting that while name
equivalence appears to have gained in popularity, structural equivalence retains
its advocates.
In our survey of composite types, we spent the most time on records, arrays,
and recursive types. Key issues for records include the syntax and semantics of
variant records, whole-record operations, type safety, and the interaction of each
of these with memory layout. Memory layout is also important for arrays, in which
it interacts with binding time for shape; static, stack, and heap-based allocation
strategies; efficient array traversal in numeric applications; the interoperability
of pointers and arrays in C; and the available set of whole-array and slice-based
operations.
For recursive data types, much depends on the choice between the value and
reference models of variables/names. Recursive types are a natural fallout of the
reference model; with the value model they require the notion of a pointer: a variable whose value is a reference. The distinction between values and references is
important from an implementation point of view: it would be wasteful to implement built-in types as references, so languages with a reference model generally
372
Chapter 7 Data Types
implement built-in and user-defined types differently. Java reflects this distinction in the language semantics, calling for a value model of built-in types and a
reference model for objects of user-defined class types.
Recursive types are generally used to create linked data structures. In most
cases these structures must be allocated from a heap. In some languages, the programmer is responsible for deallocating heap objects that are no longer needed. In
other languages, the language run-time system identifies and reclaims such garbage
automatically. Explicit deallocation is a burden on the programmer, and leads to
the problems of memory leaks and dangling references. While language implementations almost never attempt to catch memory leaks (see Exploration 3.32
and Exercise 7.36, however, for some ideas on this subject) tombstones or locks
and keys are sometimes used to catch dangling references. Automatic garbage
collection can be expensive, but has proven increasingly popular. Most garbagecollection techniques rely either on reference counts or on some form of recursive
exploration (tracing ) of currently accessible structures. Techniques in this latter
category include mark-and-sweep, stop-and-copy, and generational collection.
Few areas of language design display as much variation as I/O. Our discussion
(largely on the PLP CD) distinguished between interactive I/O, which tends to be
very platform specific, and file-based I/O, which subdivides into temporary files,
used for voluminous data within a single program run, and persistent files, used for
off-line storage. Files also subdivide into those that represent their information in
a binary form that mimics layout in memory and those that convert to and from
character-based text. In comparison to binary files, text files generally incur both
time and space overhead, but they have the important advantages of portability
and human readability.
In our examination of types, we saw many examples of language innovations
that have served to improve the clarity and maintainability of programs, often
with little or no performance overhead. Examples include the original idea of userdefined types (Algol 68), enumeration and subrange types (Pascal), the integration
of records and variants (Pascal), and the distinction between subtypes and derived
types in Ada. In Chapter 9 we will examine what many consider the most important
innovation of the past 30 years, namely object orientation.
In some cases, the distinctions between languages are less a matter of evolution
than of fundamental differences in philosophy. We have already mentioned the
choice between the value and reference models of variables/names. In a similar
vein, most languages have adopted static typing, but Smalltalk, Lisp, and the
many scripting languages work well with dynamic types. Most statically typed
languages have adopted name equivalence, but ML and Modula-3 work well with
structural equivalence. Most languages have moved away from type coercions, but
C++ embraces them: together with operator overloading, they make it possible to
define terse, type-safe I/O routines outside the language proper.
As in the previous chapter, we saw several cases in which a language’s convenience, orthogonality, or type safety appears to have been compromised in order to
simplify the compiler, or to make compiled programs smaller or faster. Examples
include the lack of an equality test for records in most languages, the requirement
7.12 Exercises
373
in Pascal and Ada that the variant portion of a record lie at the end, the limitations
in many languages on the maximum size of sets, the lack of type checking for I/O in
C, and the general lack of dynamic semantic checks in many language implementations. We also saw several examples of language features introduced at least in
part for the sake of efficient implementation. These include packed types, multilength numeric types, with statements, decimal arithmetic, and C-style pointer
arithmetic.
At the same time, one can identify a growing willingness on the part of language
designers and users to tolerate complexity and cost in language implementation in
order to improve semantics. Examples here include the type-safe variant records of
Ada; the standard-length numeric types of Java and C#; the variable-length strings
and string operators of Icon, Java, and C#; the late binding of array bounds in Ada;
and the wealth of whole-array and slice-based array operations in Fortran 90.
One might also include the polymorphic type inference of ML. Certainly one
should include the trend toward automatic garbage collection. Once considered
too expensive for production-quality imperative languages, garbage collection is
now standard not only in such experimental languages as Clu and Cedar, but in
Ada, Modula-3, Java, and C# as well. Many of these features, including variablelength strings, slices, and garbage collection, have been embraced by scripting
languages.
7.12
7.1
7.2
Exercises
Most statically typed languages developed since the 1970s (including Java,
C#, and the descendants of Pascal) use some form of name equivalence for
types. Is structural equivalence a bad idea? Why or why not?
In the following code, which of the variables will a compiler consider to have
compatible types under structural equivalence? Under strict name equivalence? Under loose name equivalence?
type T = array [1..10] of integer
S =T
A : T
B : T
C : S
D : array [1..10] of integer
7.3
Consider the following declarations:
1.
2.
3.
4.
5.
6.
7.
type cell
– – a forward declaration
type cell ptr = pointer to cell
x : cell
type cell = record
val : integer
next : cell ptr
y : cell
374
Chapter 7 Data Types
7.4
7.5
7.6
Should the declaration at line 4 be said to introduce an alias type? Under
strict name equivalence, should x and y have the same type? Explain.
Suppose you are implementing an Ada compiler, and must support arithmetic on 32-bit fixed-point binary numbers with a programmer-specified
number of fractional bits. Describe the code you would need to generate
to add, subtract, multiply, or divide two fixed-point numbers. You should
assume that the hardware provides arithmetic instructions only for integers
and IEEE floating-point. You may assume that the integer instructions preserve full precision; in particular, integer multiplication produces a 64-bit
result. Your description should be general enough to deal with operands and
results that have different numbers of fractional bits.
When Sun Microsystems ported Berkeley Unix from the Digital VAX to the
Motorola 680x0 in the early 1980s, many C programs stopped working, and
had to be repaired. In effect, the 680x0 revealed certain classes of program
bugs that one could “get away with” on the VAX. One of these classes of bugs
occurred in programs that use more than one size of integer (e.g., short and
long ), and arose from the fact that the VAX is a little-endian machine, while
the 680x0 is big-endian (Section 5.2). Another class of bugs occurred in
programs that manipulate both null and empty strings. It arose from the
fact that location zero in a Unix process’s address space on the VAX always
contained a zero, while the same location on the 680x0 is not in the address
space, and will generate a protection error if used. For both of these classes
of bugs, give examples of program fragments that would work on a VAX but
not on a 680x0.
Ada provides two “remainder” operators, rem and mod for integer types,
defined as follows [Ame83, Sec. 4.5.5]:
Integer division and remainder are defined by the relation A = (A/B)*B + (A rem
B) , where (A rem B) has the sign of A and an absolute value less than the absolute
value of B . Integer division satisfies the identity (-A)/B = -(A/B) = A/(-B) .
The result of the modulus operation is such that (A mod B) has the sign of
B and an absolute value less than the absolute value of B ; in addition, for some
integer value N , this result must satisfy the relation A = B*N + (A mod B) .
7.7
Give values of A and B for which A rem B and A mod B differ. For what
purposes would one operation be more useful than the other? Does it make
sense to provide both, or is it overkill?
Consider also the % operator of C and the mod operator of Pascal. The
designers of these languages could have picked semantics resembling those
of either Ada’s rem or its mod . Which did they pick? Do you think they made
the right choice?
Consider the problem of performing range checks on set expressions in
Pascal. Given that a set may contain many elements, some of which may
be known at compile time, describe the information that a compiler might
maintain in order to track both the elements known to belong to the set
and the possible range of unknown elements. Then explain how to update
7.12 Exercises
7.8
375
this information for the following set operations: union, intersection, and
difference. The goal is to determine (1) when subrange checks can be eliminated at run time and (2) when subrange errors can be reported at compile
time. Bear in mind that the compiler cannot do a perfect job: some unnecessary run-time checks will inevitably be performed, and some operations
that must always result in errors will not be caught at compile time. The goal
is to do as good a job as possible at reasonable cost.
Suppose we are compiling for a machine with 1-byte characters, 2-byte
shorts, 4-byte integers, and 8-byte reals, and with alignment rules that
require the address of every primitive data element to be an even multiple of the element’s size. Suppose further that the compiler is not permitted
to reorder fields. How much space will be consumed by the following array?
Explain.
A : array [0..9] of record
s : short
c : char
t : short
d : char
r : real
i : integer
7.9
In Example 7.45 we suggested the possibility of sorting record fields
by their alignment requirement, to minimize holes. In the example, we
sorted smallest-alignment-first. What would happen if we sorted longestalignment-first? Do you see any advantages to this scheme? Any disadvantages? If the record as a whole must be an even multiple of the longest
alignment, do the two approaches ever differ in total space required?
7.10 Give Ada code to map from lowercase to uppercase letters, using
(a) an array
(b) a function
Note the similarity of syntax: in both cases upper('a') is 'A' .
7.11 In Section 7.4.2 we noted that in a language with dynamic arrays and a value
model of variables, records could have fields whose size is not known at
compile time. To accommodate these, we suggested using a dope vector for
the record, to track the offsets of the fields.
Suppose instead that we want to maintain a static offset for each field.
Can we devise an alternative strategy inspired by the stack frame layout of
part? What problems would we need to address? (Hint: consider nested
records.)
7.12 Explain how to extend Figure 7.6 to accommodate subroutine arguments
that are passed by value, but whose shape is not known until the subroutine
is called at run time.
376
Chapter 7 Data Types
7.13 Explain how to obtain the effect of Fortran 90’s allocate statement for
one-dimensional arrays using pointers in C. You will probably find that
your solution does not generalize to multidimensional arrays. Why not? If
you are familiar with C++, show how to use its class facilities to solve the
problem.
7.14 In Section 7.4.3 we discussed how to differentiate between the constant and
variable portions of an array reference, in order to efficiently access the
subparts of array and record objects. An alternative approach is to generate
naive code and count on the compiler’s code improver to find the constant
portions, group them together, and calculate them at compile time. Discuss
the advantages and disadvantages of each approach.
7.15 Consider the following C declaration, compiled on a 32-bit Pentium
machine:
struct {
int n;
char c;
} A[10][10];
If the address of A[0][0] is 1000 (decimal), what is the address of A[3][7] ?
7.16 Suppose we are generating code for a Pascal-like language on a RISC
machine with the following characteristics: 8-byte floating-point numbers,
4-byte integers, 1-byte characters, and 4-byte alignment for both integers
and floating-point numbers. Suppose further that we plan to use contiguous row-major layout for multidimensional arrays, that we do not wish to
reorder fields of records or pack either records or arrays, and that we will
assume without checking that all array subscripts are in bounds.
(a) Consider the following variable declarations.
var A : array [1..10, 10..100] of real;
i : integer;
x : real;
Show the code that our compiler should generate for the following
assignment: x := A[3,i] . Explain how you arrived at your answer.
(b) Consider the following more complex declarations.
var r : record
x : integer;
y : char;
A : array [1..10, 10..20] of record
z : real;
B : array [0..71] of char;
end;
end;
var j, k : integer;
7.12 Exercises
377
Assume that these declarations are local to the current subroutine. Note
the lower bounds on indices in A ; the first element is A[1,10] .
Describe how r would be laid out in memory. Then show code to
load r.A[2,j].B[k] into a register. Be sure to indicate which portions
of the address calculation could be performed at compile time.
7.17 Suppose A is a 10×10 array of (4-byte) integers, indexed from [0][0] through
[9][9]. Suppose further that the address of A is currently in register r1 , the
value of integer i is currently in register r2 , and the value of integer j is
currently in register r3 .
Give pseudo-assembly language for a code sequence that will load the
value of A[i][j] into register r1 (a) assuming that A is implemented using
(row-major) contiguous allocation; (b) assuming that A is implemented
using row pointers. Each line of your pseudocode should correspond to
a single instruction on a typical modern machine. You may use as many
registers as you need. You need not preserve the values in r1 , r2 , and r3 . You
may assume that i and j are in bounds, and that addresses are 4 bytes long.
Which code sequence is likely to be faster? Why?
7.18 In Examples 7.62 and 7.63, show the code that would be required to access
A[i, j, k] if subscript bounds checking were required.
7.19 Pointers and recursive type definitions complicate the algorithm for determining structural equivalence of types. Consider, for example, the following
definitions:
type A = record
x : pointer to B
y : real
type B = record
x : pointer to A
y : real
The simple definition of structural equivalence given in Section 7.2.1
(expand the subparts recursively until all you have is a string of built-in
types and type constructors; then compare them) does not work: we get an
infinite expansion ( type A = record x : pointer to record x : pointer to record
x : pointer to record . . . ). The obvious reinterpretation is to say two types
A and B are equivalent if any sequence of field selections, array subscripts,
pointer dereferences, and other operations that takes one down into the
structure of A , and that ends at a built-in type, always ends at the same
built-in type when used to dive into the structure of B (and encounters the
same field names along the way). Under this reinterpretation, A and B above
have the same type. Give an algorithm based on this reinterpretation that
could be used in a compiler to determine structural equivalence. (Hint: the
fastest approach is due to J. Král [Krá73]. It is based on the algorithm used to
find the smallest deterministic finite automaton that accepts a given regular
378
Chapter 7 Data Types
language. This algorithm was outlined in Example 2.15 [page 59]; details
can be found in any automata theory textbook [e.g., [HMU01]].)
7.20 Explain the meaning of the following C declarations:
double
double
double
double
*a[n];
(*b)[n];
(*c[n])();
(*d())[n];
7.21 In Ada 83, as in Pascal, pointers ( access variables) can point only to objects
in the heap. Ada 95 allows a new kind of pointer, the access all type, to
point to other objects as well, provided that those objects have been declared
to be aliased :
type int_ptr is access all Integer;
foo : aliased Integer;
ip : int_ptr;
...
ip := foo'Access;
7.22
7.23
7.24
7.25
The 'Access attribute is roughly equivalent to C’s “address of ” ( & ) operator. How would you implement access all types and aliased objects?
How would your implementation interact with automatic garbage collection
(assuming it exists) for objects in the heap?
As noted in Section 7.7.2, Ada 95 forbids an access all pointer from
referring to any object whose lifetime is briefer than that of the pointer’s
type. Can this rule be enforced completely at compile time? Why or
why not?
In much of the discussion of pointers in Section 7.7, we assumed implicitly
that every pointer into the heap points to the beginning of a dynamically
allocated block of storage. In some languages, including Algol 68 and C,
pointers may also point to data inside a block in the heap. If you were
trying to implement dynamic semantic checks for dangling references or,
alternatively, automatic garbage collection (precise or conservative), how
would your task be complicated by the existence of such “internal pointers”?
(a) Occasionally one encounters the suggestion that a garbage-collected language should provide a delete operation as an optimization: by explicitly delete -ing objects that will never be used again, the programmer
might save the garbage collector the trouble of finding and reclaiming
those objects automatically, thereby improving performance. What do
you think of this suggestion? Explain.
(b) Alternatively, one might allow the programmer to “tenure” an object, so
that it will never be a candidate for reclamation. Is this a good idea?
In Example 7.88 we noted that functional languages can safely use reference
counts since the lack of an assignment statement prevents them from introducing circularity. This isn’t strictly true; constructs like the Lisp letrec
7.13 Explorations
379
can also be used to make cycles, so long as uses of circularly defined names
are hidden inside lambda expressions in each definition:
(define foo (lambda ()
(letrec ((a (lambda(f) (if f #\A b)))
(b (lambda(f) (if f #\B c)))
(c (lambda(f) (if f #\C a))))
a)))
Each of the functions a , b , and c contains a reference to the next:
((foo) #t)
(((foo) #f) #t)
((((foo) #f) #f) #t)
(((((foo) #f) #f) #f) #t)
=⇒
=⇒
=⇒
=⇒
#\A
#\B
#\C
#\A
How might you address this circularity without giving up on reference
counts?
7.26 Here is a skeleton for the standard quicksort algorithm in Haskell:
quicksort [] = []
quicksort (a : l) = quicksort [...] ++ [a] ++ quicksort [...]
The ++ operator denotes list concatenation (similar to @ in ML). The :
operator is equivalent to ML’s :: or Lisp’s cons . Show how to express the
two elided expressions as list comprehensions.
7.27–7.39 In More Depth.
7.13
Explorations
7.40 Some language definitions specify a particular representation for data types
in memory, while others specify only the semantic behavior of those
types. For languages in the latter class, some implementations guarantee
a particular representation, while others reserve the right to choose different representations in different circumstances. Which approach do you
prefer? Why?
7.41 If you have access to a compiler that provides optional dynamic semantic
checks for out-of-bounds array subscripts, use of an inappropriate record
variant, and/or dangling or uninitialized pointers, experiment with the cost
of these checks. How much do they add to the execution time of programs
that make a significant number of checked accesses? Experiment with different levels of optimization (code improvement) to see what effect each has
on the overhead of checks.
380
Chapter 7 Data Types
7.42 Investigate the typestate mechanism employed by Strom et al. in the Hermes
programming language [SBG+ 91]. Discuss its relationship to the notion of
definite assignment in Java and C# (Section 6.1.3).
7.43 Investigate the notion of type conformance, employed by Black et al. in
the Emerald programming language [BHJL07]. Discuss how conformance
relates to the type inference of ML and to the class-based typing of objectoriented languages.
7.44 Write a library package that might be used by a language implementation to
manage sets of elements drawn from a very large base type (e.g., integer ).
You should support membership tests, union, intersection, and difference.
Does your package allocate memory from the heap? If so, what would a
compiler that assumed the use of your package need to do to make sure that
space was reclaimed when no longer needed?
7.45 Learn about SETL [SDDS86], a programming language based on sets,
designed by Jack Schwartz of New York University. List the mechanisms
provided as built-in set operations. Compare this list with the set facilities of other programming languages. What data structure(s) might a SETL
implementation use to represent sets in a program?
7.46 The HotSpot Java compiler and virtual machine implements an entire suite
of garbage collectors: a traditional generational collector, a compacting collector for the old generation, a low pause-time parallel collector for the
nursery, a high-throughput parallel collector for the old generation, and a
“mostly concurrent” collector for the old generation that runs in parallel
with the main program. Learn more about these algorithms. When is each
used, and why?
7.47 Implement your favorite garbage collection algorithm in Ada 95. Alternatively, implement a special pointer class in C++ for which storage is garbage
collected. You’ll want to use templates (generics) so that your class can be
instantiated for arbitrary pointed-to types.
7.48 Experiment with the cost of garbage collection in your favorite language
implementation. What kind of collector does it use? Can you create artificial
programs for which it performs particularly well or poorly?
7.49 Learn about weak references in Java. How do they interact with garbage
collection? Describe several scenarios in which they may be useful.
7.50–7.53 In More Depth.
7.14
Bibliographic Notes
References to general information on the various programming languages mentioned in this chapter can be found in Appendix A, and in the Bibliographic Notes
for Chapters 1 and 6. Welsh, Sneeringer, and Hoare [WSH77] provide a critique
of the original Pascal definition, with a particular emphasis on its type system.
7.14 Bibliographic Notes
381
Tanenbaum’s comparison of Pascal and Algol 68 also focuses largely on
types [Tan78]. Cleaveland [Cle86] provides a book-length study of many of the
issues in this chapter. Pierce [Pie02] provides a formal and detailed modern coverage of the subject. The ACM Special Interest Group on Programming Languages
launched a biennial workshop on Types in Language Design and Implementation
in 2003.
What we have referred to as the denotational model of types originates with
Hoare [DDH72]. Denotational formulations of the overall semantics of programming languages are discussed in the Bibliographic Notes for Chapter 4. A related
but distinct body of work uses algebraic techniques to formalize data abstraction; key references include Guttag [Gut77] and Goguen et al. [GTW78]. Milner’s
original paper [Mil78] is the seminal reference on type inference in ML. Mairson [Mai90] proves that the cost of unifying ML types is O(2n ), where n is the
length of the program. Fortunately, the cost is linear in the size of the program’s
type expressions, so the worst case arises only in programs whose semantics are
too complex for a human being to understand anyway.
Hoare [Hoa75] discusses the definition of recursive types under a reference
model of variables. Cardelli and Wegner survey issues related to polymorphism,
overloading, and abstraction [CW85]. The new Character Model standard for the
World Wide Web provides a remarkably readable introduction to the subtleties
and complexities of multilingual character sets [Wor05].
Tombstones are due to Lomet [Lom75, Lom85]. Locks and keys are due to
Fischer and LeBlanc [FL80]. The latter also discuss how to check for various other
dynamic semantic errors in Pascal, including those that arise with variant records.
Constant-space (pointer-reversing) mark-and-sweep garbage collection is due to
Schorr and Waite [SW67]. Stop-and-copy collection was developed by Fenichel
and Yochelson [FY69], based on ideas due to Minsky. Deutsch and Bobrow [DB76]
describe an incremental garbage collector that avoids the “stop-the-world” phenomenon. Wilson and Johnstone [WJ93] describe a later incremental collector.
The conservative collector described at the end of Section 7.7.3 is due to Boehm
and Weiser [BW88]. Cohen [Coh81] surveys garbage-collection techniques as
of 1981; Wilson [Wil92b] and Jones and Lins [JL96] provide somewhat more
recent views.
This page intentionally left blank
8
Subroutines and Control Abstraction
In the introduction to Chapter 3,we defined abstraction as a process by which
the programmer can associate a name with a potentially complicated program
fragment, which can then be thought of in terms of its purpose or function, rather
than in terms of its implementation. We sometimes distinguish between control
abstraction, in which the principal purpose of the abstraction is to perform a
well-defined operation, and data abstraction, in which the principal purpose of
the abstraction is to represent information.1 We will consider data abstraction in
more detail in Chapter 9.
Subroutines are the principal mechanism for control abstraction in most programming languages. A subroutine performs its operation on behalf of a caller,
who waits for the subroutine to finish before continuing execution. Most subroutines are parameterized: the caller passes arguments that influence the subroutine’s
behavior, or provide it with data on which to operate. Arguments are also called
actual parameters. They are mapped to the subroutine’s formal parameters at the
time a call occurs. A subroutine that returns a value is usually called a function.
A subroutine that does not return a value is usually called a procedure. Most
languages require subroutines to be declared before they are used, though a few
(including Fortran, C, and Lisp) do not. Declarations allow the compiler to verify
that every call to a subroutine is consistent with the declaration; for example, that
it passes the right number and types of arguments.
As noted in Section 3.2.2, the storage consumed by parameters and local variables can in most languages be allocated on a stack. We therefore begin this chapter,
in Section 8.1, by reviewing the layout of the stack. We then turn in Section 8.2 to
the calling sequences that serve to maintain this layout. In the process, we revisit
the use of static chains to access nonlocal variables in nested subroutines, and consider (on the PLP CD) an alternative mechanism, known as a display, that serves
a similar purpose. We also consider subroutine inlining and the representation of
1 The distinction between control and data abstraction is somewhat fuzzy, because the latter usually
encapsulates not only information, but also the operations that access and modify that information. Put another way, most data abstractions include control abstraction.
Programming Language Pragmatics. DOI: 10.1016/B978-0-12-374514-9.00018-5
Copyright © 2009 by Elsevier Inc. All rights reserved.
383
384
Chapter 8 Subroutines and Control Abstraction
closures. To illustrate some of the possible implementation alternatives, we present
(again on the PLP CD) a pair of case studies: the SGI MIPSpro C compiler for the
MIPS instruction set, and the GNU gpc Pascal compiler for the x86 instruction
set, as well as the register window mechanism of the SPARC instruction set.
In Section 8.3 we look more closely at subroutine parameters. We consider
parameter-passing modes, which determine the operations that a subroutine can
apply to its formal parameters and the effects of those operations on the corresponding actual parameters. We also consider conformant arrays, named and
default parameters, variable numbers of arguments, and function return mechanisms. In Section 8.4 we turn to generic subroutines and modules (classes), which
support explicit parametric polymorphism, as defined in Section 3.5.3. Where
conventional parameters allow a subroutine to operate on many different values,
generic parameters allow it to operate on data of many different types.
In Section 8.5, we consider the handling of exceptional conditions. While exceptions can sometimes be confined to the current subroutine, in the general case they
require a mechanism to “pop out of ” a nested context without returning, so that
recovery can occur in the calling context. In Section 8.6, we consider coroutines,
which allow a program to maintain two or more execution contexts, and to switch
back and forth among them. Coroutines can be used to implement iterators (Section 6.5.3), but they have other uses as well, particularly in simulation and in server
programs. In Chapter 12 we will use them as the basis for concurrent (“quasiparallel”) threads. Finally, in Section 8.7 we consider asynchronous events—things
that happen outside a program, but to which it needs to respond.
8.1
EXAMPLE
8.1
Layout of run-time stack
(reprise)
EXAMPLE
8.2
Offsets from frame pointer
Review of Stack Layout
In Section 3.2.2 we discussed the allocation of space on a subroutine call stack
(Figure 3.1, page 118). Each routine, as it is called, is given a new stack frame,
or activation record, at the top of the stack. This frame may contain arguments
and/or return values, bookkeeping information (including the return address and
saved registers), local variables, and/or temporaries. When a subroutine returns,
its frame is popped from the stack.
At any given time, the stack pointer register contains the address of either the
last used location at the top of the stack, or the first unused location, depending
on convention. The frame pointer register contains an address within the frame.
Objects in the frame are accessed via displacement addressing with respect to the
frame pointer. If the size of an object (e.g., a local array) is not known at compile
time, then the object is placed in a variable-size area at the top of the frame; its
address and dope vector (descriptor) are stored in the fixed-size portion of the
frame, at a statically known offset from the frame pointer (Figure 7.6, page 334). If
there are no variable-size objects, then every object within the frame has a statically
known offset from the stack pointer, and the implementation may dispense with
the frame pointer, freeing up a register for other use. If the size of an argument is
not known at compile time, then the argument may be placed in a variable-size
8.1 Review of Stack Layout
385
A
B
C
fp
C
D
D
B
Dynamic
Links
E
Static
Links
E
A
Figure 8.1
Example of subroutine nesting, taken from Figure 3.5. Within B , C , and D , all five
routines are visible. Within A and E , routines A , B , and E are visible, but C and D are not. Given
the calling sequence A , E , B , D , C , in that order, frames will be allocated on the stack as shown at
right, with the indicated static and dynamic links.
EXAMPLE
8.3
Static and dynamic links
EXAMPLE
8.4
Visibility of nested routines
portion of the frame below the other arguments, with its address and dope vector
at known offsets from the frame pointer. Alternatively, the caller may simply pass
a temporary address and dope vector, counting on the called routine to copy the
argument into the variable-size area at the top of the frame.
In a language with nested subroutines and static scoping (e.g., Pascal, Ada,
ML, Common Lisp, or Scheme), objects that lie in surrounding subroutines,
and that are thus neither local nor global, can be found by maintaining a static
chain (Figure 8.1). Each stack frame contains a reference to the frame of the
lexically surrounding subroutine. This reference is called the static link. By analogy, the saved value of the frame pointer, which will be restored on subroutine
return, is called the dynamic link. The static and dynamic links may or may not
be the same, depending on whether the current routine was called by its lexically surrounding routine, or by some other routine nested in that surrounding
routine.
Whether or not a subroutine is called directly by the lexically surrounding
routine, we can be sure that the surrounding routine is active; there is no other
way that the current routine could have been visible, allowing it to be called.
Consider, for example, the subroutine nesting shown in Figure 8.1. If subroutine
D is called directly from B , then clearly B ’s frame will already be on the stack. How
else could D be called? It is not visible in A or E , because it is nested inside of B .
A moment’s thought makes clear that it is only when control enters B (placing B ’s
frame on the stack) that D comes into view. It can therefore be called by C , or by
386
Chapter 8 Subroutines and Control Abstraction
any other routine (not shown) that is nested inside C or D , but only because these
are also within B .
8.2
Calling Sequences
Maintenance of the subroutine call stack is the responsibility of the calling
sequence—the code executed by the caller immediately before and after a subroutine call—and of the prologue (code executed at the beginning) and epilogue
(code executed at the end) of the subroutine itself. Sometimes the term “calling
sequence” is used to refer to the combined operations of the caller, the prologue,
and the epilogue.
Tasks that must be accomplished on the way into a subroutine include passing
parameters, saving the return address, changing the program counter, changing
the stack pointer to allocate space, saving registers (including the frame pointer)
that contain important values and that may be overwritten by the callee, changing
the frame pointer to refer to the new frame, and executing initialization code for
any objects in the new frame that require it. Tasks that must be accomplished
on the way out include passing return parameters or function values, executing
finalization code for any local objects that require it, deallocating the stack frame
(restoring the stack pointer), restoring other saved registers (including the frame
pointer), and restoring the program counter. Some of these tasks (e.g., passing
parameters) must be performed by the caller, because they differ from call to
call. Most of the tasks, however, can be performed either by the caller or the
callee. In general, we will save space if the callee does as much work as possible:
tasks performed in the callee appear only once in the target program, but tasks
performed in the caller appear at every call site, and the typical subroutine is called
in more than one place.
Saving and Restoring Registers
Perhaps the trickiest division-of-labor issue pertains to saving registers. The ideal
approach (see Section 5.5.2) is to save precisely those registers that are both in
use in the caller and needed for other purposes in the callee. Because of separate
compilation, however, it is difficult (though not impossible) to determine this
intersecting set. A simpler solution is for the caller to save all registers that are in
use, or for the callee to save all registers that it will overwrite.
Calling sequence conventions for many processors, including the MIPS and
x86 described in the case studies of Section 8.2.2, strike something of a compromise: registers not reserved for special purposes are divided into two sets of
approximately equal size. One set is the caller’s responsibility, the other is the
callee’s responsibility. A callee can assume that there is nothing of value in any
of the registers in the caller-saves set; a caller can assume that no callee will destroy
the contents of any registers in the callee-saves set. In the interests of code size,
the compiler uses the callee-saves registers for local variables and other long-lived
8.2 Calling Sequences
387
values whenever possible. It uses the caller-saves set for transient values, which
are less likely to be needed across calls. The result of these conventions is that the
caller-saves registers are seldom saved by either party: the callee knows that they
are the caller’s responsibility, and the caller knows that they don’t contain anything
important.
Maintaining the Static Chain
In languages with nested subroutines, at least part of the work required to maintain
the static chain must be performed by the caller, rather than the callee, because this
work depends on the lexical nesting depth of the caller. The standard approach is
for the caller to compute the callee’s static link and to pass it as an extra, hidden
parameter. Two subcases arise:
1. The callee is nested (directly) inside the caller. In this case, the callee’s static
link should refer to the caller’s frame. The caller therefore passes its own frame
pointer as the callee’s static link.
2. The callee is k ≥ 0 scopes “outward”—closer to the outer level of lexical nesting. In this case, all scopes that surround the callee also surround the caller
(otherwise the callee would not be visible). The caller dereferences its own
static link k times and passes the result as the callee’s static link.
A Typical Calling Sequence
EXAMPLE
8.5
A typical calling sequence
Figure 8.2 shows one plausible layout for a stack frame, consistent with Figure 3.1.
The stack pointer ( sp ) points to the first unused location on the stack (or the
last used location, depending on the compiler and machine). The frame pointer
( fp ) points to a location near the bottom of the frame. Space for all arguments is
reserved in the stack, even if the compiler passes some of them in registers (the
callee will need a place to save them if it calls a nested routine).
To maintain this stack layout, the calling sequence might operate as follows.
The caller
1. saves any caller-saves registers whose values will be needed after the call
2. computes the values of arguments and moves them into the stack or registers
3. computes the static link (if this is a language with nested subroutines), and
passes it as an extra, hidden argument
4. uses a special subroutine call instruction to jump to the subroutine, simultaneously passing the return address on the stack or in a register
In its prologue, the callee
1. allocates a frame by subtracting an appropriate constant from the sp
2. saves the old frame pointer into the stack, and assigns it an appropriate new
value
388
Chapter 8 Subroutines and Control Abstraction
sp
Arguments
to called
routines
Temporaries
Direction of stack growth
(lower addresses)
Local
variables
Current frame
Saved regs.,
static link
fp
Saved fp
Return address
(Arguments
from caller)
Previous (calling)
frame
Figure 8.2
A typical stack frame. Though we draw it growing upward on the page, the stack
actually grows downward toward lower addresses on most machines. Arguments are accessed
at positive offsets from the fp . Local variables and temporaries are accessed at negative offsets
from the fp . Arguments to be passed to called routines are assembled at the top of the frame,
using positive offsets from the sp .
3. saves any callee-saves registers that may be overwritten by the current routine
(including the static link and return address, if they were passed in registers)
After the subroutine has completed, the epilogue
1.
2.
3.
4.
moves the return value (if any) into a register or a reserved location in the stack
restores callee-saves registers if needed
restores the fp and the sp
jumps back to the return address
Finally, the caller
1. moves the return value to wherever it is needed
2. restores caller-saves registers if needed
Special-Case Optimizations
Many parts of the calling sequence, prologue, and epilogue can be omitted in
common cases. If the hardware passes the return address in a register, then a leaf
8.2 Calling Sequences
389
routine (a subroutine that makes no additional calls before returning)2 can simply
leave it there; it does not need to save it in the stack. Likewise it need not save the
static link or any caller-saves registers.
A subroutine with no local variables and nothing to save or restore may not
even need a stack frame on a RISC machine. The simplest subroutines (e.g.,
library routines to compute the standard mathematical functions) may not touch
memory at all, except to fetch instructions: they may take their arguments in
registers, compute entirely in (caller-saves) registers, call no other routines, and
return their results in registers. As a result they may be extremely fast.
8.2.1
Displays
One disadvantage of static chains is that access to an object in a scope k levels
out requires that the static chain be dereferenced k times. If a local object can be
loaded into a register with a single (displacement mode) memory access, an object
k levels out will require k + 1 memory accesses. This number can be reduced to a
constant by use of a display.
IN MORE DEPTH
As described on the PLP CD, a display is a small array that replaces the static
chain. The jth element of the display contains a reference to the frame of the most
recently active subroutine at lexical nesting level j. If the currently active routine
is nested i > 3 levels deep, then elements i − 1, i − 2, and i − 3 of the display
contain the values that would have been the first three links of the static chain.
An object k levels out can be found at a statically known offset from the address
stored in element j = i − k of the display.
For most programs the cost of maintaining a display in the subroutine calling
sequence tends to be slightly higher than that of maintaining a static chain. At
the same time, the cost of dereferencing the static chain has been reduced by
modern compilers, which tend to do a good job of caching the links in registers
when appropriate. These observations, combined with the trend toward languages
(those descended from C in particular) in which subroutines do not nest, has made
displays less common today than they were in the 1970s.
8.2.2
Case Studies: C on the MIPS; Pascal on the x86
Calling sequences differ significantly from machine to machine and even compiler
to compiler (though typically a hardware manufacturer publishes a suggested set of
2 A leaf routine is so named because it is a leaf of the subroutine call graph, a data structure mentioned
in Exercise 3.10.
390
Chapter 8 Subroutines and Control Abstraction
conventions for a given architecture, to promote interoperability among program
components produced by different compilers). Some of the most significant differences can be found in a comparison of CISC and RISC conventions.
Compilers for CISC machines tend to pass arguments on the stack; compilers
for RISC machines tend to pass arguments in registers.
Compilers for CISC machines usually dedicate a register to the frame pointer;
compilers for RISC machines often do not.
Compilers for CISC machines often rely on special-purpose instructions to
implement parts of the calling sequence; available instructions on a RISC
machine are typically much simpler.
The use of the stack to pass arguments reflects the technology of the 1970s,
when register sets were significantly smaller and memory access was significantly
faster (in comparison to processor speed) than is the case today. Most CISC
instruction sets include push and pop instructions that combine a store or load
with automatic update of the stack pointer. The push instruction, in particular,
was traditionally used to pass arguments to subroutines, effectively allocating
stack space on demand. The resulting instability in the value of the sp made it
difficult (though not impossible) to use that register as the base for access to local
variables. A separate frame pointer made code generation easier and, perhaps
more important, made it practical to locate local variables from within a simple
symbolic debugger.
IN MORE DEPTH
On the PLP CD we look in some detail at the stack layout conventions and calling
sequences of a representative pair of compilers: the SGI MIPSpro C compiler for
the 64-bit MIPS architecture, and the GNU Pascal compiler ( gpc ) for the 32-bit
x86. The MIPSpro compiler is the predecessor to the widely used Open64 research
compiler. It illustrates the heavy use of registers on modern RISC machines. The
gpc compiler, while adjusted somewhat to reflect modern implementations of the
x86, still retains vestiges of its CISC ancestry, with heavier use of the stack. It also
illustrates the use of the static chain to accommodate nested subroutines, and the
creation of closures when such routines are passed as parameters.
8.2.3
Register Windows
As an alternative to saving and restoring registers on subroutine calls and returns,
the original Berkeley RISC machines [PD80, Pat85] introduced a hardware mechanism known as register windows. The basic idea is to map the ISA’s limited set of
register names onto some subset (window) of a much larger collection of physical
registers, and to change the mapping when making subroutine calls. Old and new
mappings overlap a bit, allowing arguments to be passed (and function results
returned) in the intersection.
8.2 Calling Sequences
391
IN MORE DEPTH
We consider register windows in more detail on the PLP CD. They have appeared
in several commercial processors, most notably the Sun SPARC and the Intel IA-64
(Itanium).
8.2.4
EXAMPLE
8.6
Requesting an inline
subroutine
In-Line Expansion
As an alternative to stack-based calling conventions, many language implementations allow certain subroutines to be expanded in-line at the point of call. A copy
of the “called” routine becomes a part of the “caller”; no actual subroutine call
occurs. In-line expansion avoids a variety of overheads, including space allocation,
branch delays from the call and return, maintaining the static chain or display,
and (often) saving and restoring registers. It also allows the compiler to perform
code improvements such as global register allocation, instruction scheduling, and
common subexpression elimination across the boundaries between subroutines,
something that most compilers can’t do otherwise.
In many implementations, the compiler chooses which subroutines to expand
in-line and which to compile conventionally. In some languages, the programmer
can suggest that particular routines be in-lined. In C++ and C99, the keyword
inline can be prefixed to a function declaration:
inline int max(int a, int b) {return a > b ? a : b;}
In Ada, the programmer can request in-line expansion with a significant comment,
or pragma:
D E S I G N & I M P L E M E N TAT I O N
Hints and directives
Formally, the inline keyword is a hint in C++ and C99, rather than a directive:
it suggests but does not require that the compiler actually expand the subroutine in-line. The compiler is free to use a conventional implementation when
inline has been specified, or to use an in-line implementation when inline
has not been specified, if it has reason to believe that this will result in better
code.
In effect, the inclusion of hints like inline in a programming language
represents an acknowledgment that advice from the expert programmer may
sometimes be useful with current compiler technology, but that this may change
in the future. By contrast, the use of pointer arithmetic in place of array subscripts, as discussed in the sidebar on page 354, is more of a directive than
a hint, and may complicate the generation of high-quality code from legacy
programs.
392
Chapter 8 Subroutines and Control Abstraction
function max(a, b : integer) return integer is
begin
if a > b then return a; else return b; end if;
end max;
pragma inline(max);
EXAMPLE
8.7
In-lining and recursion
Like the inline of C99 and C++, this pragma is a hint; the compiler is permitted
to ignore it.
In Section 3.7 we noted the similarity between in-line expansion and macros,
but argued that the former is semantically preferable. In fact, in-line expansion
is semantically neutral: it is purely an implementation technique, with no effect
on the meaning of the program. In comparison to real subroutine calls, in-line
expansion has the obvious disadvantage of increasing code size, since the entire
body of the subroutine appears at every call site. In-line expansion is also not an
option in the general case for recursive subroutines. For the occasional case in
which a recursive call is possible but unlikely, it may be desirable to generate a true
recursive subroutine, but to expand one level of that routine in-line at each call
site. As a simple example, consider a binary tree whose leaves contain character
strings. A routine to return the fringe of this tree (the left-to-right concatenation
of the values in its leaves) might look like this:
string fringe(bin_tree *t) {
// assume both children are nil or neither is
if (t->left == 0) return t->val;
return fringe(t->left) + fringe(t->right);
}
A compiler can expand this code in-line if it makes each nested invocation a true
subroutine call. Since half the nodes in a binary tree are leaves, this expansion
will eliminate half the dynamic calls at run-time. If we expand not only the root
calls but also (one level of) the two calls within the true subroutine version, only
a quarter of the original dynamic calls will remain.
D E S I G N & I M P L E M E N TAT I O N
In-line and modularity
Probably the most important argument for in-line expansion is that it allows
programmers to adopt a very modular programming style, with lots of tiny subroutines, without sacrificing performance. This modular programming style is
essential for object-oriented languages, as we shall see in Chapter 9. The benefit of in-lining is undermined to some degree by the fact that changing the
definition of an in-lined function forces the recompilation of every user of the
function; changing the definition of an ordinary function (without changing
its interface) forces relinking only. The best of both worlds may be achieved in
systems with just-in-time compilation (Section 15.2.1).
8.3 Parameter Passing
393
3C H E C K YO U R U N D E R S TA N D I N G
1. What is a subroutine calling sequence? What does it do? What is meant by the
subroutine prologue and epilogue?
2. How do calling sequences typically differ in CISC and RISC compilers?
3. Describe how to maintain the static chain during a subroutine call.
4. What is a display? How does it differ from a static chain?
5. What are the purposes of the stack pointer and frame pointer registers? Why
does a subroutine often need both?
6. Why do RISC machines typically pass subroutine parameters in registers rather
than on the stack?
7. Why do subroutine calling conventions often give the caller responsibility
for saving half the registers and the callee responsibility for saving the other
half ?
8. If work can be done in either the caller or the callee, why do we typically prefer
to do it in the callee?
9. Why do compilers typically allocate space for arguments in the stack, even
when they pass them in registers?
10. List the optimizations that can be made to the subroutine calling sequence in
important special cases (e.g., leaf routines).
11. How does an in-line subroutine differ from a macro?
12. Under what circumstances is it desirable to expand a subroutine in-line?
8.3
Parameter Passing
Most subroutines are parameterized: they take arguments that control certain
aspects of their behavior, or specify the data on which they are to operate.
Parameter names that appear in the declaration of a subroutine are known
as formal parameters. Variables and expressions that are passed to a subroutine in a particular call are known as actual parameters. We have been referring to actual parameters as arguments. In the following two subsections, we
discuss the most common parameter-passing modes, most of which are implemented by passing values, references, or closures. In Section 8.3.3 we will
look at additional mechanisms, including conformant array parameters, missing
and default parameters, named parameters, and variable-length argument lists.
Finally, in Section 8.3.4 we will consider mechanisms for returning values from
functions.
394
EXAMPLE
Chapter 8 Subroutines and Control Abstraction
8.8
Infix operators
As we noted in Section 6.1, most languages use a prefix notation for calls
to user-defined subroutines, with the subroutine name followed by a parenthesized argument list. Lisp places the function name inside the parentheses, as in
(max a b) . ML allows the programmer to specify that certain names represent
infix operators, which appear between a pair of arguments:
infixr 8 tothe;
(* exponentiation *)
fun x tothe 0 = 1.0
| x tothe n = x * (x tothe(n-1));
(* assume n >= 0 *)
EXAMPLE
8.9
Control abstraction in Lisp
and Smalltalk
The infixr declaration indicates that tothe will be a right-associative binary
infix operator, at precedence level 8 (multiplication and division are at level 7,
addition and subtraction at level 6). Fortran 90 also allows the programmer to
define new infix operators, but it requires their names to be bracketed with periods
(e.g., A .cross. B ), and it gives them all the same precedence. Smalltalk uses infix
(or “mixfix”) notation (without precedence) for all its operations.
The uniformity of Lisp and Smalltalk syntax makes control abstraction particularly effective: user-defined subroutines (functions in Lisp, “messages” in Smalltalk) use the same style of syntax as built-in operations. As an example, consider
if . . . then . . . else :
if a > b then max := a else max := b;
(* Pascal *)
(if (> a b) (setf max a) (setf max b))
; Lisp
(a > b) ifTrue: [max <- a] ifFalse: [max <- b].
"Smalltalk"
In Pascal or C it is clear that if . . . then . . . else is a built-in language construct: it does not look like a subroutine call. In Lisp and Smalltalk, on the
other hand, the analogous conditional constructs are syntactically indistinguishable from user-defined operations. They are in fact defined in terms of
simpler concepts, rather than being built in, though they require a special mechanism to evaluate their arguments in normal, rather than applicative, order
(Section 6.6.2).
8.3.1
Parameter Modes
In our discussion of subroutines so far, we have glossed over the semantic rules that
govern parameter passing, and that determine the relationship between actual and
formal parameters. Some languages—including C, Fortran, ML, and Lisp—define
a single set of rules that apply to all parameters. Other languages, including Pascal,
Modula, and Ada, provide two or more sets of rules, corresponding to different
parameter-passing modes. As in many aspects of language design, the semantic
details are heavily influenced by implementation issues.
8.3 Parameter Passing
EXAMPLE
8.10
Passing an argument to a
subroutine
395
Suppose for the moment that x is a global variable in a language with a value
model of variables, and that we wish to pass x as a parameter to subroutine p :
p(x);
EXAMPLE
8.11
Value and reference
parameters
From an implementation point of view, we have two principal alternatives: we
may provide p with a copy of x ’s value, or we may provide it with x ’s address.
The two most common parameter-passing modes, called call-by-value and callby-reference, are designed to reflect these implementations.
With value parameters, each actual parameter is assigned into the corresponding formal parameter when a subroutine is called; from then on, the two are independent. With reference parameters, each formal parameter introduces, within the
body of the subroutine, a new name for the corresponding actual parameter. If
the actual parameter is also visible within the subroutine under its original name
(as will generally be the case if it is declared in a surrounding scope), then the
two names are aliases for the same object, and changes made through one will be
visible through the other. In most languages (Fortran is an exception; see below)
an actual parameter that is to be passed by reference must be an l-value; it cannot
be the result of an arithmetic operation, or any other value without an address.
As a simple example, consider the following pseudocode:
x : integer
procedure foo(y : integer)
y := 3
print x
...
x := 2
foo(x)
print x
– – global
If y is passed to foo by value, then the assignment inside foo has no visible effect—
y is private to the subroutine—and the program prints 2 twice. If y is passed to
foo by reference, then the assignment inside foo changes x — y is just a local name
for x —and the program prints 3 twice.
D E S I G N & I M P L E M E N TAT I O N
Parameter modes
While it may seem odd to introduce parameter modes (a semantic issue) in
terms of implementation, the distinction between value and reference parameters is fundamentally an implementation issue. Most languages with more than
one mode (Ada is the principal exception) might fairly be characterized as an
attempt to paste acceptable semantics onto the desired implementation, rather
than to find an acceptable implementation of the desired semantics.
396
Chapter 8 Subroutines and Control Abstraction
Variations on Value and Reference Parameters
EXAMPLE
8.12
Call-by-value/result
EXAMPLE
8.13
Emulating call-by-reference
in C
If the purpose of call-by-reference is to allow the called routine to modify the actual
parameter, we can achieve a similar effect using call-by-value/result, a mode first
introduced in Algol W. Like call-by-value, call-by-value/result copies the actual
parameter into the formal parameter at the beginning of subroutine execution.
Unlike call-by-value, it also copies the formal parameter back into the actual
parameter when the subroutine returns. In Example 8.11, value/result would copy
x into y at the beginning of foo , and y into x at the end of foo . Because foo
accesses x directly in-between, the program’s visible behavior would be different
than it was with call-by-reference: the assignment of 3 into y would not affect x
until after the inner print statement, so the program would print 2 and then 3 . In Pascal, parameters are passed by value by default; they are passed by reference
if preceded by the keyword var in their subroutine header’s formal parameter list.
Parameters in C are always passed by value, though the effect for arrays is unusual:
because of the interoperability of arrays and pointers in C (Section 7.7.1), what
is passed by value is a pointer; changes to array elements accessed through this
pointer are visible to the caller. To allow a called routine to modify a variable other
than an array in the caller’s scope, the C programmer must pass the address of the
variable explicitly:
void swap(int *a, int *b) { int t = *a; *a = *b; *b = t; }
...
swap(&v1, &v2);
Fortran passes all parameters by reference, but does not require that every
actual parameter be an l-value. If a built-up expression appears in an argument
list, the compiler creates a temporary variable to hold the value, and passes this
variable by reference. A Fortran subroutine that needs to modify the values of its
formal parameters without modifying its actual parameters must copy the values
into local variables, and modify those instead.
Call-by-value and call-by-reference make the most sense in a
language with a value model of variables: they determine whether we copy the
variable or pass an alias for it. Neither option really makes sense in a language
like Smalltalk, Lisp, ML, or Clu, in which a variable is already a reference. Here it
is most natural simply to pass the reference itself, and let the actual and formal
parameters refer to the same object. Clu calls this mode call-by-sharing. It is
different from call-by-value because, although we do copy the actual parameter
into the formal parameter, both of them are references; if we modify the object to
which the formal parameter refers, the program will be able to see those changes
through the actual parameter after the subroutine returns. Call-by-sharing is also
different from call-by-reference, because although the called routine can change
the value of the object to which the actual parameter refers, it cannot change the
identity of that object.
Call-by-Sharing
8.3 Parameter Passing
397
As we noted in Sections 6.1.2 (page 227) and 7.7.1, a reference model of variables does not necessarily require that every object be accessed indirectly by
address: the implementation can create multiple copies of immutable objects
(numbers, characters, etc.) and access them directly. Call-by-sharing is thus commonly implemented the same as call-by-value for objects of immutable type.
In keeping with its hybrid model of variables, Java uses call-by-value for variables of built-in type (all of which are values), and call-by-sharing for variables
of user-defined class types (all of which are references). An interesting consequence is that a Java subroutine cannot change the value of an actual parameter
of built-in type. A similar approach is the default in C#, but because the language allows users to create both value ( struct ) and reference ( class ) types,
both cases are considered call-by-value. That is, whether a variable is a value
or a reference, we always pass it by copying. (Some authors describe Java the
same way.)
When desired, parameters in C# can be passed by reference instead, by labeling
both a formal parameter and each corresponding argument with the ref or out
keyword. Both of these modes are implemented by passing an address; they differ
in that a ref argument must be definitely assigned prior to the call, as described
in Section 6.1.3; an out argument need not. In contrast to Java, therefore, a C#
subroutine can change the value of an actual parameter of built-in type, if the
parameter is passed ref or out . Similarly, if a variable of class (reference) type
is passed as a ref or out parameter, it may end up referring to a different object
as a result of subroutine execution—something that is not possible with call-bysharing.
In a language that provides both value and
reference parameters (e.g., Pascal or Modula), there are two principal reasons why
the programmer might choose one over the other. First, if the called routine is
supposed to change the value of an actual parameter (argument), then the programmer must pass the parameter by reference. Conversely, to ensure that the
called routine cannot modify the argument, the programmer can pass the parameter by value. Second, the implementation of value parameters requires copying
actuals to formals, a potentially time-consuming operation when arguments are
large. Reference parameters can be implemented simply by passing an address. (Of
course, accessing a parameter that is passed by reference requires an extra level of
indirection. If the parameter is used often enough, the cost of this indirection may
outweigh the cost of copying the argument.)
The potential inefficiency of large value parameters sometimes prompts programmers to pass an argument by reference when passing by value would be
semantically more appropriate. Pascal programmers, for example, were commonly
taught to use var (reference) parameters both for arguments that need to be modified and for arguments that are very large. Unfortunately, the latter justification
often leads to buggy code, in which a subroutine modifies an argument that the
caller meant to leave unchanged.
The Purpose of Call-by-Reference
398
Chapter 8 Subroutines and Control Abstraction
To combine the efficiency of reference parameters and
the safety of value parameters, Modula-3 provides a READONLY parameter mode.
Any formal parameter whose declaration is preceded by READONLY cannot be
changed by the called routine: the compiler prevents the programmer from using
that formal parameter on the left-hand side of any assignment statement, reading
it from a file, or passing it by reference to any other subroutine. Small READONLY
parameters are generally implemented by passing a value; larger READONLY parameters are implemented by passing an address. As in Fortran, a Modula-3 compiler
will create a temporary variable to hold the value of any built-up expression passed
as a large READONLY parameter.
The equivalent of READONLY parameters is also available in C, which allows any
variable or parameter declaration to be preceded by the keyword const . Const
variables are “elaboration-time constants,” as described in Section 3.2. Const
parameters are particularly useful when passing addresses:
Read-Only Parameters
EXAMPLE
8.14
Const parameters in C
void append_to_log(const huge_record* r) { ...
...
append_to_log(&my_record);
Here the keyword const applies to the record to which r points;3 the callee will
be unable to change the record’s contents. Note, however, that in C the caller must
take the address of the record explicitly, and the compiler does not have the option
of passing by value.
One traditional problem with parameter modes—and with the READONLY
mode in particular—is that they tend to confuse the key pragmatic issue (does
the implementation pass a value or a reference?) with two semantic issues: is
the callee allowed to change the formal parameter and, if so, will the changes be
reflected in the actual parameter? C keeps the pragmatic issue separate, by forcing
the programmer to pass references explicitly with pointers. Still, its const mode
serves double duty: is the intent of const foo* p to protect the actual parameter
from change, or to document the fact that the subroutine thinks of the formal
parameter as a constant rather than a variable, or both?
Parameter Modes in Ada
Ada provides three parameter-passing modes, called in , out , and in out . In
parameters pass information from the caller to the callee; they can be read by the
callee but not written. Out parameters pass information from the callee to the
caller. In Ada 83 they can be written by the callee but not read; in Ada 95 they
can be both read and written, but they begin their life uninitialized. In out
parameters pass information in both directions; they can be both read and written.
Changes to out or in out parameters always change the actual parameter.
3 Following the usual rules for parsing C declarations (page 354), r is a pointer to a huge_record
whose value is constant. If we wanted r to be a constant that points to a huge_record , we should
need to say huge_record* const r .
8.3 Parameter Passing
399
For parameters of scalar and access (pointer) types, Ada specifies that all three
modes are to be implemented by copying values. For these parameters, then, in
is call-by-value, in out is call-by-value/result, and out is simply call-by-result
(the value of the formal parameter is copied into the actual parameter when
the subroutine returns). For parameters of most constructed types, however, Ada
specifically permits an implementation to pass either values or addresses. In most
languages, these two different mechanisms would lead to different semantics:
changes made to an in out parameter that is passed as an address will affect
the actual parameter immediately; changes made to an in out parameter that is
passed as a value will not affect the actual parameter until the subroutine returns.
As noted in Example 8.12, the difference can lead to different behavior in the
presence of aliases.
One possible way to hide the distinction between reference and value/result
would be to outlaw the creation of aliases, as Euclid does. Ada takes a simpler
tack: a program that can tell the difference between value and address-based
implementations of (nonscalar, nonpointer) in out parameters is said to be erroneous—incorrect, but in a way that the language implementation is not required
to catch.
Ada’s semantics for parameter passing allow a single set of modes to be used
not only for subroutine parameters, but also for communication among concurrently executing tasks (to be discussed in Chapter 12). When tasks are executing
on separate machines, with no memory in common, passing the address of an
actual parameter is not a practical option. Most Ada compilers pass large arguments to subroutines as addresses; they pass them to the entry points of tasks by
copying.
References in C++
EXAMPLE
8.15
Reference parameters in
C++
Programmers who switch to C after some experience with Pascal, Modula, or
Ada (or with call-by-sharing in Java or Lisp) are often frustrated by C’s lack
of reference parameters. As noted above, one can always arrange to modify an
object by passing its address, but then the formal parameter is a pointer, and must
be explicitly dereferenced whenever it is used. C++ addresses this problem by
introducing an explicit notion of a reference. Reference parameters are specified
by preceding their name with an ampersand in the header of the function:
void swap(int &a, int &b) { int t = a; a = b; b = t; }
In the code of this swap routine, a and b are int s, not pointers to int s; no
dereferencing is required. Moreover, the caller passes as arguments the variables
whose values are to be swapped, rather than passing their addresses.
As in C, a C++ parameter can be declared to be const to ensure that it is
not modified. For large types, const reference parameters in C++ provide the
same combination of speed and safety found in the READONLY parameters of
Modula-3: they can be passed by address, and cannot be changed by the called
routine.
400
EXAMPLE
Chapter 8 Subroutines and Control Abstraction
8.16
References in C++ see their principal use as parameters, but they can appear in
other contexts as well. Any variable can be declared to be a reference:
References as aliases in
C++
EXAMPLE
8.17
Returning a reference from
a function
int i;
int &j = i;
...
i = 2;
j = 3;
cout << i;
// prints 3
Here j is a reference to (an alias for) i . The initializer in the declaration is required;
it identifies the object for which j is an alias. Moreover it is not possible later to
change the object to which j refers; it will always refer to i .
Any change to i or j can be seen by reading the other. Most C++ compilers
implement references with addresses. In this example, i will be assigned a location
that contains an integer, while j will be assigned a location that contains the
address of i . Despite their different implementation, however, there is no semantic
difference between i and j ; the exact same operations can be applied to either,
with precisely the same results.
While there is seldom any reason to create aliases on purpose in straight-line
code, references in C++ are highly useful for at least one purpose other than
parameters—namely, function returns. Some objects—file buffers, for example—
do not support a copy operation, and therefore cannot be passed or returned by
value. One can always return a pointer, but just as with subroutine parameters,
the subsequent dereferencing operations can be cumbersome.
Section 7.9 explains how references are used for I/O in C++. The overloaded
<< and >> operators return a reference to their first argument, which can in turn
be passed to subsequent << or >> operations. The syntax
cout << a << b << c;
is short for
((cout.operator<<(a)).operator<<(b)).operator<<(c);
Without references, << and >> would have to return a pointer to their stream:
((cout.operator<<(a))->operator<<(b))->operator<<(c);
or
*(*(cout.operator<<(a)).operator<<(b)).operator<<(c);
This change would spoil the cascading syntax of the operator form:
*(*(cout << a) << b) << c;
It should be noted that the ability to return references from functions is not
new in C++: Algol 68 provides the same capability. The object-oriented features
of C++, and its operator overloading, make reference returns particularly useful.
8.3 Parameter Passing
401
Closures as Parameters
EXAMPLE
8.18
Subroutines as parameters
in Pascal
A closure (a reference to a subroutine, together with its referencing environment)
may be passed as a parameter for any of several reasons. The most obvious of
these arises when the parameter is declared to be a subroutine (sometimes called
a formal subroutine). In Standard Pascal one might write:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
EXAMPLE
8.19
First-class subroutines in
Scheme
procedure apply_to_A(function f(n : integer) : integer;
var A : array [low..high : integer] of integer);
var i : integer;
begin
for i := low to high do A[i] := f(A[i]);
end;
...
var k : integer;
(* in nested scope *)
...
function add_k (m : integer) : integer;
begin
add_k := m + k;
end;
...
k := 3;
apply_to_A(add_k, my_array);
As discussed in Section 3.6.1, a closure needs to include both a code address and a
referencing environment because, in a language with nested subroutines, we need
to make sure that the environment available to f at line 5 is the same that would
have been available to add_k if it had been called directly at line 13—in particular,
that it includes the binding for k .
Ada 83 did not permit subroutines to be passed as parameters. Some of the same
effect could be obtained through generic subroutines, but not enough: Ada 95
added first-class pointer-to-subroutine types, with semantics and implementation similar to Pascal. Fortran has always allowed subroutines to be passed as
parameters, but only allowed them to nest beginning in Fortran 90 (and then only
one level deep).
Subroutines are routinely passed as parameters (and returned as results) in
functional languages. A list-based version of apply_to_A would look something like this in Scheme (for the meanings of car , cdr , and cons , see
Section 7.8):
(define apply-to-L (lambda (f l)
(if (null? l) ’()
(cons (f (car l)) (apply-to-L f (cdr l))))))
Because Scheme (like Lisp) is not statically typed, there is no need to specify
the type of f . At run time, a Scheme implementation will announce a dynamic
semantic error in (f (car l)) if f is not a function, and in (null? l) , (car l) ,
or (cdr l) if l is not a list.
402
EXAMPLE
Chapter 8 Subroutines and Control Abstraction
8.20
First-class subroutines in
ML
The code in ML is similar, but the implementation uses inference (Section 7.2.4) to determine the types of f and l at compile time:
fun apply_to_L(f, l) =
case l of
nil
=> nil
| h :: t => f(h) :: apply_to_L(f, t);
EXAMPLE
8.21
Subroutine pointers in C
and C++
As noted in Section 3.6, C and C++ have no need of subroutine closures, because
their subroutines do not nest. Simple pointers to subroutines suffice. These are
permitted both as parameters and as variables.
void apply_to_A(int (*f)(int), int A[], int A_size) {
int i;
for (i = 0; i < A_size; i++) A[i] = f(A[i]);
}
The syntax f(n) is used not only when f is the name of a function, but also when
f is a pointer to a subroutine; the pointer need not be dereferenced explicitly. In object-oriented languages, one can approximate the behavior of a subroutine closure, even without nested subroutines, by packaging a method and its
“environment” within an explicit object. We described these object closures in Section 3.6.3. Because they are ordinary objects, they require no special mechanisms
to pass them as parameters or to store them in objects.
The delegates of C# significantly extend the notion of object closures. Delegates provide type safety without the restrictions of inheritance. A delegate can
be instantiated not only with a specified object method (subsuming the object
closures of C++ and Java), but also with a static function (subsuming the subroutine pointers of C and C++) or with an anonymous nested delegate or lambda
expression (subsuming true subroutine closures). If an anonymous delegate or
lambda expression refers to objects declared in the surrounding method, then
those objects have unlimited extent. Finally, as we shall see in Section 8.7.2, a C#
delegate can actually contain a list of closures, in which case calling the delegate
has the effect of calling all the entries on the list, in turn. (This behavior generally
makes sense only when each entry has a void return type. It is used primarily
when processing events.)
8.3.2
Call-by-Name
Explicit subroutine parameters are not the only language feature that requires a
closure to be passed as a parameter. In general, a language implementation must
pass a closure whenever the eventual use of the parameter requires the restoration
of a previous referencing environment. Interesting examples occur in the call-byname parameters of Algol 60 and Simula, the label parameters of Algol 60 and
Algol 68, and the call-by-need parameters of Miranda, Haskell, and R.
8.3 Parameter Passing
403
IN MORE DEPTH
We consider call-by-name in more detail on the PLP CD. When Algol 60 was
defined, most programmers programmed in assembly language (Fortran was only
a few years old, and Lisp was even newer). The assembly languages of the day
made heavy use of macros, and it was natural for the Algol designers to propose a parameter-passing mechanism that mimicked the behavior of macros,
namely normal-order argument evaluation (Section 6.6.2). It was also natural,
given common practice in assembly language, to allow a goto to jump to a label
that was passed as a parameter. Call-by-name parameters have some interesting and powerful applications, but they are more difficult to implement (and
more expensive to use) than one might at first expect: they require the passing of closures, sometimes referred to as thunks. Label parameters are typically
implemented by closures as well. Both call-by-name and label parameters tend
to lead to inscrutable code; modern languages encourage programmers to use
explicit formal subroutines (Section 8.3.1) and structured exceptions (Section 8.5)
instead.
8.3.3
Special-Purpose Parameters
Figure 8.3 contains a summary of the common parameter-passing modes. In this
subsection we examine other aspects of parameter passing.
Conformant Arrays
As we saw in Section 7.4.2, the binding time for array dimensions and bounds
varies greatly from language to language, ranging from compile time (Basic
and Pascal) to elaboration time (Ada and Fortran 90) to arbitrary times during execution (APL, Perl, and Common Lisp). In several languages, the rules
for parameters are looser than they are for variables. A formal array parameter whose shape is finalized at run time (in a language that usually determines
shape at compile time), is called a conformant, or open, array parameter. Example 7.53 (page 331) illustrates the use of conformant arrays in Pascal, as does
Example 8.18. The C equivalent of the latter appeared in Example 8.21. A multidimensional example (valid only since C99) can be found in Example 7.54
(page 332).
Default (Optional) Parameters
In Section 3.3.6 we noted that the principal use of dynamic scoping is to change
the default behavior of a subroutine. We also noted that the same effect can
be achieved with default parameters. A default parameter is one that need not
necessarily be provided by the caller; if it is missing, then a preestablished default
value will be used instead.
404
Chapter 8 Subroutines and Control Abstraction
Parameter
mode
value
in, const
out
value/result
var, ref
sharing
in out
name
need
Representative
languages
Implementation
mechanism
Permissible
operations
Change to
actual?
Alias?
C/C++, Pascal,
Java/C# (value types)
value
read, write
no
no
Ada, C/C++, Modula-3
value or reference
read only
no
maybe
Ada
value or reference
write only
yes
maybe
Algol W
value
read, write
yes
no
Fortran, Pascal, C++
reference
read, write
yes
yes
Lisp/Scheme, ML,
Java/C# (reference types)
value or reference
read, write
yes
yes
Ada
value or reference
read, write
yes
maybe
Algol 60, Simula
closure (thunk)
read, write
yes
yes
Haskell, R
closure (thunk) with
memoization
read, write∗
yes∗
yes∗
Figure 8.3 Parameter-passing modes. Column 1 indicates common names for modes. Column 2 indicates prominent languages
that use the modes, or that introduced them. Column 3 indicates implementation via passing of values, references, or closures.
Column 4 indicates whether the callee can read or write the formal parameter. Column 5 indicates whether changes to the
formal parameter affect the actual parameter. Column 6 indicates whether changes to the formal or actual parameter, during
the execution of the subroutine, may be visible through the other. ∗Changes to arguments passed by need in R will happen only
on the first use; changes in Haskell are not permitted.
EXAMPLE
8.22
Default parameters in Ada
One common use of default parameters is in I/O library routines (described
in Section 7.9.3). In Ada, for example, the put routine for integers has the
following declaration in the text_IO library package:
type field is integer range 0..integer’last;
type number_base is integer range 2..16;
default_width : field
:= integer’width;
default_base : number_base := 10;
procedure put(item : in integer;
width : in field
:= default_width;
base : in number_base := default_base);
Here the declaration of default_width uses the built-in type attribute width
to determine the maximum number of columns required to print an integer in
decimal on the current machine (e.g., a 32-bit integer requires no more than 11
columns, including the optional minus sign).
Any formal parameter that is “assigned” a value in its subroutine heading is
optional in Ada. In our text_IO example, the programmer can call put with one,
two, or three arguments. No matter how many are provided in a particular call, the
code for put can always assume it has all three parameters. The implementation is
straightforward: in any call in which actual parameters are missing, the compiler
8.3 Parameter Passing
405
pretends as if the defaults had been provided; it generates a calling sequence that
loads those defaults into registers or pushes them onto the stack, as appropriate.
On a 32-bit machine, put(37) will print the string “37” in an 11-column field
(with nine leading blanks) in base-10 notation. Put(37, 4) will print “37” in
a four-column field (two leading blanks), and put(37, 4, 8) will print “45”
(37 = 458 ) in a four-column field.
Because the default_width and default_base variables are part of the
text_IO interface, the programmer can change them if desired. When using
default values in calls with missing actuals, the compiler loads the defaults from
the variables of the package. As noted in Section 7.9.3, there are overloaded
instances of put for all the built-in types. In fact, there are two overloaded instances
of put for every type, one of which has an additional first parameter that specifies
the output file to which to write a value.4 It should be emphasized that there is
nothing special about I/O as far as default parameters are concerned: defaults
can be used in any subroutine declaration. In addition to Ada, default parameters
appear in C++, Common Lisp, Fortran 90, and Python.
Named Parameters
EXAMPLE
8.23
Named parameters in Ada
In all of our discussions so far we have been assuming that parameters are positional: the first actual parameter corresponds to the first formal parameter, the
second actual to the second formal, and so on. In some languages, including Ada,
Common Lisp, Fortran 90, Modula-3, and Python, this need not be the case. These
languages allow parameters to be named. Named parameters (also called keyword
parameters) are particularly useful in conjunction with default parameters. Positional notation allows us to write put(37, 4) to print “37” in a four-column
field, but it does not allow us to print in octal in a field of default width: any call
(with positional notation) that specifies a base must also specify a width, explicitly, because the width parameter precedes the base in put ’s parameter list. Named
parameters provide the Ada programmer with a way around this problem:
put(item => 37, base => 8);
Because the parameters are named, their order does not matter; we can also write
put(base => 8, item => 37);
We can even mix the two approaches, using positional notation for the first few
parameters, and names for all the rest:
put(37, base => 8);
4 The real situation is actually a bit more complicated: The put routine for integers is nested
inside integer_IO , a generic package that is in turn inside of text_IO . The programmer must
instantiate a separate version of the integer_IO package for each variety (size) of integer type.
406
EXAMPLE
Chapter 8 Subroutines and Control Abstraction
8.24
Self-documentation with
named parameters
In addition to allowing parameters to be specified in arbitrary order, omitting
any intermediate default parameters for which special values are not required,
named parameter notation has the advantage of documenting the purpose of
each parameter. For a subroutine with a very large number of parameters, it can
be difficult to remember which is which. Named notation makes the meaning of
arguments explicit in the call, as in the following hypothetical example:
format_page(columns => 2,
window_height => 400, window_width => 200,
header_font => Helvetica, body_font => Times,
title_font => Times_Bold, header_point_size => 10,
body_point_size => 11, title_point_size => 13,
justification => true, hyphenation => false,
page_num => 3, paragraph_indent => 18,
background_color => white);
Variable Numbers of Arguments
Lisp, Python, and C and its descendants are unusual in that they allow the user to
define subroutines that take a variable number of arguments. Examples of such
subroutines can be found in Section 7.9.3: the printf and scanf functions of
C’s stdio I/O library. In C, printf can be declared as follows:
int printf(char *format, ...)
{ ...
EXAMPLE
8.25
Variable number of
arguments in C
The ellipsis ( ... ) in the function header is a part of the language syntax. It indicates that there are additional parameters following the format, but that their
types and numbers are unspecified. Since C and C++ are statically typed, additional parameters are not type safe. They are type safe in Common Lisp and
Python, however, thanks to dynamic typing.
Within the body of a function with a variable-length argument list, the C or
C++ programmer must use a collection of standard routines to access the extra
arguments. Originally defined as macros, these routines have implementations
that vary from machine to machine, depending on how arguments are passed
to functions; today the necessary support is often built into the compiler. For
printf , variable arguments would be used as follows in C:
#include <stdarg.h>
/* macros and type definitions */
int printf(char *format, ...)
{
va_list args;
va_start(args, format);
...
char cp = va_arg(args, char);
...
double dp = va_arg(args, double);
...
va_end(args);
}
8.3 Parameter Passing
EXAMPLE
8.26
Variable number of
arguments in Java
407
Here args is defined as an object of type va_list , a special (implementationdependent) type used to enumerate the elided parameters. The va_start routine
takes the last declared parameter (in this case, format ) as its second argument. It
initializes its first argument (in this case args ) so that it can be used to enumerate
the rest of the caller’s actual parameters. At least one formal parameter must be
declared; they can’t all be elided.
Each call to va_arg returns the value of the next elided parameter. Two examples appear above. Each specifies the expected type of the parameter, and assigns
the result into a variable of the appropriate type. If the expected type is different from the type of the actual parameter, chaos can result. In printf , the
%X placeholders in the format string are used to determine the type: printf
contains a large switch statement, with one arm for each possible X . The
arm for %c contains a call to va_arg(args, char) ; the arm for %f contains
a call to va_arg(args, double) . All C floating-point types are extended to
double-precision before being passed to a subroutine, so there is no need inside
printf to worry about the distinction between float s and double s. Scanf ,
on the other hand, must distinguish between pointers to float s and pointers to
double s. The call to va_end allows the implementation to perform any necessary
cleanup operations (e.g., deallocation of any heap space used for the va_list ,
or repair of any changes to the stack frame that might confuse the epilogue
code).
Like C and C++, C# and recent versions of Java support variable numbers of
parameters, but unlike their parent languages they do so in a type-safe manner,
by requiring all trailing parameters to share a common type. In Java, for example,
one can write
static void print_lines(String foo, String... lines) {
System.out.println("First argument is \"" + foo + "\".");
System.out.println("There are " +
lines.length + " additional arguments:");
for (String str: lines) {
System.out.println(str);
}
}
...
print_lines("Hello, world", "This is a message", "from your sponsor.");
Here again the ellipsis in the method header is part of the language syntax. Method
print_lines has two arguments. The first, foo , is of type String ; the second,
lines , is of type String... . Within print_lines , lines functions as if it had
type String[] (array of String ). The caller, however, need not package the
second and subsequent parameters into an explicit array; the compiler does this
automatically, and the program prints
First argument is "Hello, world".
There are 2 additional arguments:
This is a message
from your sponsor.
408
EXAMPLE
Chapter 8 Subroutines and Control Abstraction
8.27
Variable number of
arguments in C#
The parameter declaration syntax is slightly different in C#:
static void print_lines(String foo, params String[] lines) {
Console.WriteLine("First argument is \"" + foo + "\".");
Console.WriteLine("There are " +
lines.Length + " additional arguments:");
for (int i = 0; i < lines.Length; i++) {
Console.WriteLine(lines[i]);
}
}
The calling syntax is the same.
8.3.4
EXAMPLE
8.28
Return statement
Function Returns
The syntax by which a function indicates the value to be returned varies greatly. In
languages like Lisp, ML, and Algol 68, which do not distinguish between expressions and statements, the value of a function is simply the value of its body, which
is itself an expression.
In several early imperative languages, including Algol 60, Fortran, and Pascal,
a function specifies its return value by executing an assignment statement whose
left-hand side is the name of the function. This approach has an unfortunate
interaction with the usual static scope rules (Section 3.3.1): the compiler must
forbid any immediately nested declaration that would hide the name of the
function, since the function would then be unable to return. This special case
is avoided in more recent imperative languages by introducing an explicit return
statement:
return expression
In addition to specifying a value, return causes the immediate termination of
the subroutine. A function that has figured out what to return but doesn’t want to
return yet can always assign the return value into a temporary variable, and then
return it later:
rtn := expression
...
return rtn
EXAMPLE
8.29
Incremental computation
of a return value
Fortran separates early termination of a subroutine from the specification of
return values: it specifies the return value by assigning to the function name, and
has a return statement that takes no arguments.
Argument-bearing return statements and assignment to the function name
both force the programmer to employ a temporary variable in incremental computations. Here is an example in Ada:
8.3 Parameter Passing
409
type int_array is array (integer range <>) of integer;
-- array of integers with unspecified integer bounds
function A_max(A : int_array) return integer is
rtn : integer;
begin
rtn := integer’first;
for i in A’first .. A’last loop
if A(i) > rtn then rtn := A(i); end if;
end loop;
return rtn;
end A_max;
EXAMPLE
8.30
Explicitly named return
values in SR
EXAMPLE
8.31
Multivalue returns
Here rtn must be declared as a variable so that the function can read it as well
as write it. Because rtn is a local variable, most compilers will allocate it within
the stack frame of A_max . The return statement must then perform an unnecessary copy to move that variable’s value into the return location allocated by the
caller.
Some languages eliminate the need for a local variable by allowing the result of
a function to have a name in its own right. In SR one can write the following.5
procedure A_max(ref A[1:*]: int) returns rtn : int
rtn := low(int)
fa i := 1 to ub(A) ->
if A[i] > rtn -> rtn := A[i] fi
af
end
Here rtn can reside throughout its lifetime in the return location allocated by the
caller. A similar facility can be found in Eiffel, in which every function contains
an implicitly declared object named Result . This object can be both read and
written, and is returned to the caller when the function returns.
Many languages place restrictions on the types of objects that can be returned
from a function. In Algol 60 and Fortran 77, a function must return a scalar value.
In Pascal and early versions of Modula-2, it must return a scalar or a pointer.
Most imperative languages are more flexible: Algol 68, Ada, C, Fortran 90, and
many (nonstandard) implementations of Pascal allow functions to return values
of composite type. ML, its descendants, and several scripting languages allow a
function to return a tuple of values. In Python, for example, we might write
def foo():
return 2, 3
...
i, j = foo()
5 The fa in SR stands for“for all”; ub stands for“upper bound.”The -> symbol is roughly equivalent
to do and then in other languages. All structured statements in SR are terminated by spelling
the opening keyword backwards. Semicolons between statements may be omitted if they occur at
end-of-line.
410
Chapter 8 Subroutines and Control Abstraction
Modula-3 and Ada 95 allow a function to return a subroutine, implemented
as a closure. C has no closures, but allows a function to return a pointer to a
subroutine. In functional languages such as Lisp and ML, returning a closure is
commonplace.
3C H E C K YO U R U N D E R S TA N D I N G
13. What is the difference between formal and actual parameters?
14. Describe four common parameter-passing modes. How does a programmer
choose which one to use when?
15. Explain the rationale for READONLY parameters in Modula-3.
16. What parameter mode is typically used in languages with a reference model
of variables?
17. Describe the parameter modes of Ada. How do they differ from the modes of
other modern languages?
18. What does it mean for an Ada program to be erroneous?
19. Give an example in which it is useful to return a reference from a function in
C++.
20. List three reasons why a language implementation might implement a parameter as a closure.
21. What is a conformant (open) array?
22. What are default parameters? How are they implemented?
23. What are named (keyword) parameters? Why are they useful?
24. Explain the value of variable-length argument lists. What distinguishes such
lists in Java and C# from their counterparts in C and C++?
25. Describe three common mechanisms for specifying the return value of a function. What are their relative strengths and drawbacks?
8.4
Generic Subroutines and Modules
Subroutines provide a natural way to perform an operation for a variety of different object (parameter) values. In large programs, the need also often arises
to perform an operation for a variety of different object types. An operating
system, for example, tends to make heavy use of queues, to hold processes,
memory descriptors, file buffers, device control blocks, and a host of other
objects. The characteristics of the queue data structure are independent of the
8.4 Generic Subroutines and Modules
EXAMPLE
8.32
Generic queues in Ada
and C++
EXAMPLE
8.33
Generic min function in
Ada (reprise)
EXAMPLE
8.34
Generic parameters
411
characteristics of the items placed in the queue. Unfortunately, the standard
mechanisms for declaring enqueue and dequeue subroutines in most languages
require that the type of the items be declared, statically. In a language like
Pascal or Fortran, this static declaration of item type means that the programmer must create separate copies of enqueue and dequeue for every type of
item, even though the entire text of these copies (other than the type names
in the procedure headers) is the same. In some languages (C is an obvious
example) it is possible to define a queue of pointers to arbitrary objects, but
use of such a queue requires type casts that abandon compile-time checking
(Exercise 8.17).
Implicit parametric polymorphism, as suggested in Section 3.5.3, provides a
way around the problem, allowing us to declare subroutines whose parameter
types are incompletely specified, but still type-safe. This approach has its drawbacks, however. As realized in Lisp (Section 10.3) or the various scripting languages, it delays type checking until run time. As realized in ML (Section 7.2.4),
it makes the compiler substantially slower and more complicated, and it forces the
adoption of a structural view of type equivalence (Section 7.2.1). An alternative,
also mentioned in Section 3.5.3, is to provide an explicitly polymorphic generic
facility that allows a collection of similar subroutines or modules—with different
types in each—to be created from a single copy of the source code. Languages
that provide generics include Ada, C++ (which calls them templates), Clu, Eiffel,
Modula-3, Java, and C#.
Generic modules or classes are particularly valuable for creating containers—
data abstractions that hold a collection of objects, but whose operations are generally oblivious to the type of those objects. Examples of containers include stack,
queue, heap, set, and dictionary (mapping) abstractions, implemented as lists,
arrays, trees, or hash tables. Ada and C++ examples of a generic queue appear in
Figure 8.4.
Generic subroutines (methods) are needed in generic modules (classes), and
may also be useful in their own right. A generic “minimum” function in Ada
appears in Figure 3.13 (page 150). A sorting routine would have a similar flavor:
it needs to be able to tell when objects are smaller or larger than each other, but
does not need to know anything else about them.
Exactly what can be passed as a generic parameter varies from language to
language. Java and C# pass only types. Ada and C++ are a bit more general. In
particular, both allow values of ordinary (nongeneric) types, includi