Compiler design - Kanat Bolazar

advertisement
Compiler Design
1. Overview
CIS 631, CSE 691, CIS400, CSE 400
Kanat Bolazar
January 19, 2010
Compilers
• Compilers translate from a source language (typically a high
level language) to a functionally equivalent target language
(typically the machine code of a particular machine or a
machine-independent virtual machine).
• Compilers for high level programming languages are among
the larger and more complex pieces of software
– Original languages included Fortran and Cobol
• Often multi-pass compilers (to facilitate memory reuse)
– Compiler development helped in better programming language desig
• Early development focused on syntactic analysis and optimizatio
– Commercially, compilers are developed by very large software grou
• Current focus is on optimization and smart use of resources f
Why Study Compilers?
• General background information for good software engineer
– Increases understanding of language semantics
– Seeing the machine code generated for language
constructs helps understand performance issues for
languages
– Teaches good language design
– New devices may need device-specific languages
– New business fields may need domain-specific language
Applications of Compiler Technology & Tool
•
•
•
•
•
•
•
•
Processing XML/other to generate documents, code, etc.
Processing domain-specific and device-specific languages.
Implementing a server that uses a protocol such as http or
imap
Natural language processing, for example, spam filter,
search, document comprehension, summary generation
Translating from a hardware description language to the
schematic of a circuit
Automatic graph layout (graphviz, for example)
Extending an existing programming language
Program analysis and improvement tools
Dynamic Structure of a Compiler
character stream
va l
= 10
*
va l
+ i
Front end
(analysis)
lexical analysis (scanning)
token stream
1
ident
"val"
3
assign
-
2
number
10
4
times
-
1
ident
"val"
syntax analysis (parsing)
Statement
syntax tree
Expression
Term
5
plus
-
1
ident
"i"
token number
token value
Dynamic Structure of a Compiler
Statement
syntax tree
Front end
Expression
Term
ident = number * ident + ident
semantic analysis (type checking, ...)
intermediate
representation
syntax tree, symbol table, or three address code (TAC) ...
optimization
code generation
machine code
const 10
Back end
(synthesis)
Compiler versus Interpreter
Compiler
translates to machine code
scanner
parser
...
code generator
source code
Interpreter
loader
machine code
executes source code "directly"
scanner
• statements in a loop are
scanned and parsed
again and again
parser
source code
interpretation
Variant: interpretation of intermediate code
... compiler ...
source code
VM
intermediate code
• source code is translated i
code of a virtual machine
• VM interprets the code
Static Structure of a Compiler
parser &
sem. analysis
"main program"
directs the whole compilation
scanner
code generation
provides tokens from
the source code
generates machine code
symbol table
maintains information about
declared names and types
uses
data flow
Lexical Analysis
• Stream of characters is grouped into tokens
• Examples of tokens are identifiers, reserved words, integers, doubles or
floats, delimiters, operators and special symbols
int a;
a = a + 2;
int
a
;
a
=
a
+
2
;
reserved word
identifier
special symbol
identifier
operator
identifier
operator
integer constant
special symbol
Syntax Analysis or Parsing
• Parsing uses a context-free grammar of valid programming
language structures to find the structure of the input
• Result of parsing usually represented by a syntax tree
Example of grammar rules:
expression → expression + expression |
variable | constant
variable → identifier
constant → intconstant | doubleconstant | …
Example parse tree:
=
a
+
Semantic Analysis
• Parse tree is checked for things that violates the semantic
rules of the language
– Semantic rules may be written with an attribute grammar
• Examples:
– Using undeclared variables
– Function called with improper arguments
• Number and type of arguments
– Array variables used without array syntax
– Type checking of operator arguments
– Left hand side of an assignment must be a variable (sometimes
called an L-value)
– ...
Intermediate Code Generation
• An intermediate code representation often helps contain
complexity of compiler and discover code optimizations.
• Typical choices include:
– Annotated parse trees
– Three Address Code (TAC), and abstract machine language
– Bytecode, as in Java bytecode.
Example statements:
if (a <= b)
{ a = a – c; }
Resulting TAC:
_t1 = a > b
if _t1 goto L0
_t2 = a – c
a = _t2
L0: _t3 = b * c
C = _t3
Intermediate Code Generation (cont'd)
Example statements:
if (a <= b)
{ a = a – c; }
Java bytecode (javap -c):
55: iload_1
56: iload_2
57: if_icmpgt 64
Postfix/Polish/Stack:
60:
61:
62:
63:
iload_1
iload_3
isub
istore_1
v1 v2 JumpIf(>)
v1 v3 – store(v1)
v2 v3 * store(v3)
64:
65:
66:
67:
iload_2
iload_3
imul
istore_3
c=b*c
Code Optimization
• Compiler converts the intermediate representation to another
one that attempts to be smaller and faster.
• Typical optimizations:
–
–
–
–
Inhibit code generation for unreachable segments
Getting rid of unused variables
Eliminating multiplication by 1 and addition by 0
Loop optimization: e.g. removing statements not modified in the
loop
– Common sub-expression elimination
– ...
Object Code Generation
• The target program is generated in the machine language of
the target architecture.
– Memory locations are selected for each variable
– Instructions are chosen for each operation
– Individual tree nodes or TAC is translated into a sequence of
machine language instructions that perform the same task
• Typical machine language instructions include things like
–
–
–
–
Load register
Add register to memory location
Store register to memory
...
Object Code Optimization
• It is possible to have another code optimization phase that
transforms the object code into more efficient object code.
• These optimizations use features of the hardware itself to
make efficient use of processors and registers.
– Specialized instructions
– Pipelining
– Branch prediction and other peephole optimizations

JIT (Just-In-Time) compilation of intermediate code (e.g.
Java bytecode) can discover more context-specific
optimizations not available earlier.
Symbol Table
• Symbol table management is a part of the compiler that
interacts with several of the phases
– Identifiers are found in lexical analysis and placed in the symbol
table
– During syntactical and semantical analysis, type and scope
information is added
– During code generation, type information is used to determine wha
instructions to use
– During optimization, the “live analysis” may be kept in the symbol
table
Error Handling
• Error handling and reporting also occurs across many phases
– Lexical analyzer reports invalid character sequences
– Syntactic analyzer reports invalid token sequences
– Semantic analyzer reports type and scope errors, and the like
• The compiler may be able to continue with some errors, but
other errors may stop the process
Compiler / Translator Design Decisions
• Choose a source language
– Large enough to have many interesting language features
– Small enough to implement in a reasonable amount of time
– Examples for us: MicroJava, Decaf, MiniJava
• Choose a target language
– Either a real assembly language for a machine with an assembler
– Or a virtual machine language with an interpreter
– Examples for us: MicroJava VM (μJVM), MIPS (a popular RISC
architecture, for which there is a “SPIM” simulator)
• Choose an approach for implementation:
– Either use an existing scanner and parser / compiler generator
• lex/flex, yacc/bison/byacc,
Antlr/JavaCC/SableCC/byaccj/Coco/R.
Example MicroJava Program
program P
main program; no separate compilation
final int size = 10;
class Table {
classes (without methods)
int[] pos;
int[] neg;
}
global variables
Table val;
{
void main()
int x, i;
local variables
{
//---------- initialize val ---------val = new Table;
val.pos = new int[size];
val.neg = new int[size];
i = 0;
while (i < size) {
val.pos[i] = 0; val.neg[i] = 0; i = i + 1;
}
//---------- read values ---------read(x);
while (x != 0) {
if (x > 0) val.pos[x] = val.pos[x] + 1;
else if (x < 0) val.neg[-x] = val.neg[-x] + 1;
read(x);
}
}
References
• Original slides: Nancy McCracken.
• Niklaus Wirth, Compiler Construction, chapters 1 and 2
• Course notes from H. Mossenback, System Specification and Compiler
Construction, http://www.ssw.uni-linz.ac.at/Misc/CC/
– Also notes on MicroJava
• Course notes from Jerry Cain, Compilers,
http://www.stanford.edu/class/cs143/
• General references:
– Aho, A., Lam, M., Sethi, R., Ullman, J., Compilers: Principles,
Techniques and Tools, 2nd Edition, Addison-Wesley, 2006.
– Steven Muchnik, Advanced Compiler Design and Implementation,
Morgan-Kaufmann, 1997.
– Keith Cooper and Linda Torczon, Engineering a Compiler, Morgan
Compiler Design
2. Regular Expressions
&
Finite State Automata
(FSA)
Kanat Bolazar
January 21, 2010
Contents
In these slides we will see:
1. Introduction, Concepts and Notations
2. Regular Expressions, Regular Languages
3. RegExp Examples
4. Finite-State Automata (FSA/FSM)
1.
2.
3.
4.
Introduction, Concepts and Notations
Regular Expressions, Regular Languages
RegExp Examples
Finite-State Automata (FSA/FSM)
Introduction
• Regular expressions are equivalent to Finite State Automata in
recognizing regular languages, the first step in the Chomsky hierarchy
formal languages
• The term regular expressions is also used to mean the extended set of
string matching expressions used in many modern languages
– Some people use the term regexp to distinguish this use
• Some parts of regexps are just syntactic extensions of regular expressio
and can be implemented as a regular expression – other parts are
significant extensions of the power of the language and are not equivale
to finite automata
Concepts and Notations
• Set: An unordered collection of unique elements
S1 = { a, b, c }
S2 = { 0, 1, …, 19 }
empty set:
membership: x  S
union: S1  S2 = { a, b, c, 0, 1, …, 19 }
universe of discourse: U
subset: S1  U
complement: if U = { a, b, …, z }, then S1' = { d, e, …, z } = U - S1
• Alphabet: A finite set of symbols
– Examples:
• Character sets: ASCII, ISO-8859-1, Unicode
•  = { a, b }
2 = { Spring, Summer, Autumn, Winter }
• String: A sequence of zero or more symbols from an alphabet
– The empty string: 
Concepts and Notations
• Language: A set of strings over an alphabet
– Also known as a formal language; may not bear any resemblance to a natural
language, but could model a subset of one.
– The language comprising all strings over an alphabet  is written as: *
• Graph: A set of nodes (or vertices), some or all of which may be connected by
edges.
–
– A directed graph example:
An example:
1
3
2
a
b
c
1.
2.
3.
4.
Introduction, Concepts and Notations
Regular Expressions, Regular Languages
RegExp Examples
Finite-State Automata (FSA/FSM)
Regular Expressions
• A regular expression defines a regular language over an
alphabet :
–  is a regular language: //
– Any symbol from is a regular language:
 = { a, b, c} /a/
/b/
/c/
– Two concatenated regular languages is a regular language:
 = { a, b, c} /ab/
/bc/
/ca/
Regular Expressions
• Regular language (continued):
– The union (or disjunction) of two regular languages is a regular
language:
 = { a, b, c}
/ab|bc/
/ca|bb/
– The Kleene closure (denoted by the Kleene star: *) of a regular
language is a regular language:
 = { a, b, c}
/a*/
/(ab|ca)*/
– Parentheses group a sub-language to override operator precedence
(and, we’ll see later, for “memory”).
RegExps
– The extended use of regular expressions is in many modern languages:
• Perl, php, Java, python, …
– Can use regexps to specify the rules for any set of possible strings you want t
match
• Sentences, e-mail addresses, ads, dialogs, etc
– ``Does this string match the pattern?'', or ``Is there a match for the pattern
anywhere in this string?''
– Can also define operations to do something with the matched string, such as
extract the text or substitute for it
– Regular expression patterns are compiled into a executable code within the
language
Regular Expressions: Basics
Some examples and shortcuts:
/[abc]/ = /a|b|c/
Character class; disjunction
/[b-e]/ = /b|c|d|e/
Range in a character class
/[\012\015]/ = /\n|\r/
Octal characters; special escapes
/./ = /[\x00-\xFF]/
Wildcard; hexadecimal characters
/[^b-e]/ = /[\x00-af-\xFF]/
Complement of character class
/a*/
Kleene star: zero or more
/[af]*/
/a?/ = /a|/
/a+/
/a{8}/
/(abc)*/
/(ab|ca)?/
/([a-zA-Z]1|ca)+/
/b{1,2}/
/c{3,}/
Zero or one
Kleene plus: one or more
Counters: repetition quantification
Regular Expressions: Anchors
• Anchors constrain the position(s) at which a pattern may match.
• Think of them as “extra” alphabet symbols, though they always
consume/match  (the zero-length string):
/^a/
Pattern must match at beginning of string
/a$/
Pattern must match at end of string
/\bword23\b/ “Word” boundary:
or
/[a-zA-Z0-9_][^a-zA-Z0-9_]/
'x '
'0%'
/[^a-zA-Z0-9_][a-zA-Z0-9_]/
' x'
'%0'
/\B23\B/
“Word” non-boundary
Regular Expressions: Escapes
• There are six classes of escape sequences ('\XYZ'):
1. Numeric character representation: the octal or hexadecimal position in a
character set: “\012” = “\xA”
2. Meta-characters: The characters which are syntactically meaningful to
regular expressions, and therefore must be escaped in order to represent
themselves in the alphabet of the regular expression: “[](){}|^$.?+*\”
(note the inclusion of the backslash).
3. “Special” escapes (from the “C” language):
newline: “\n” = “\xA”
carriage ret:
“\r” = “\xD”
tab:
“\t” = “\x9”
formfeed: “\f” = “\xC”
Regular Expressions: Escapes (cont'd)
4. Aliases: shortcuts for commonly used character classes. (Note that the capitalized
version of these aliases refer to the complement of the alias’s character class):
whitespace:
“\s” = “[ \t\r\n\f\v]”
digit:
“\d” = “[0-9]”
word:
“\w” = “[a-zA-Z0-9_]”
non-whitespace: “\S” = “[^ \t\r\n\f]”
non-digit:
“\D” = “[^0-9]”
non-word:
“\W” = “[^a-zA-Z0-9_]”
5. Memory/registers/back-references: “\1”, “\2”, etc.
6. Self-escapes: any character other than those which have special meaning can be
escaped, but the escaping has no effect: the character still represents the regular
language of the character itself.
Regular Expressions: Back References
• Memory/Registers/Back-references
– Many regular expression languages include a memory/register/back-refere
feature, in which sub-matches may be referred to later in the regular
expression, and/or when performing replacement, in the replacement strin
• Perl: /(\w+)\s+\1\b/ matches a repeated word
• Python: re.sub(”(the\s+)the(\s+|\b)”,”\1”,string) remo
the second of a pair of ‘the’s
– Note: finite automata cannot be used to implement the memory featur
1.
2.
3.
4.
Introduction, Concepts and Notations
Regular Expressions, Regular Languages
RegExp Examples
Finite-State Automata (FSA/FSM)
Regular Expression Examples
Character classes and Kleene symbols
[A-Z] = one capital letter
[0-9] = one numerical digit
[st@!9] = s, t, @, ! or 9
[A-Z] matches G or W or E
does not match GW or FA or h or fun
[A-Z]+ = one or more consecutive capital letters
matches GW or FA or CRASH
[A-Z]? = zero or one capital letter
[A-Z]* = zero, one or more consecutive capital letters
matches on eat or EAT or I
so, [A-Z]ate
matches: Gate, Late, Pate, Fate, but not GATE or gate
and [A-Z]+ate
matches: Gate, GRate, HEate, but not Grate or grate or STATE
and [A-Z]*ate
matches: Gate, GRate, and ate, but not STATE, grate or Plate
Regular Expression Examples (cont’d)
[A-Za-z] = any single letter
so [A-Za-z]+
matches on any word composed of only letters,
but will not match on “words”: bi-weekly , yes@SU or IBM325
they will match on bi, weekly, yes, SU and IBM
a shortcut for [A-Za-z] is \w, which in Perl also includes _
so (\w)+ will match on Information, ZANY, rattskellar and jeuvbaew
\s will match whitespace
so (\w)+(\s)(\w+) will match
real estate or Gen Xers
Regular Expression Examples (cont’d)
Some longer examples:
([A-Z][a-z]+)\s([a-z0-9]+)
matches: Intel c09yt745
but not
IBM series5000
[A-Z]\w+\s\w+\s\w+[!]
matches: The dog died!
It also matches that portion of “ he said, “ The dog died! “
[A-Z]\w+\s\w+\s\w+[!]$
matches: The dog died!
But does not match “he said, “ The dog died! “ because the $ indicates end of Line, and
there is a quotation mark before the end of the line
(\w+ats?\s)+
parentheses define a pattern as a unit, so the above expression will match:
Fat cats eat Bats that Splat
1.
2.
3.
4.
Introduction, Concepts and Notations
Regular Expressions, Regular Languages
RegExp Examples
Finite-State Automata (FSA/FSM)
Finite State Automata
• Finite State Automaton
a.k.a. Finite Automaton, Finite State Machine, FSA or FSM
– An abstract machine which can be used to implement regular
expressions (etc.).
– Has a finite number of states, and a finite amount of memory (i.e., the
current state).
– Can be represented by directed graphs or transition tables
Finite-State Automata
• Representation
– An FSA may be represented as a directed graph; each node (or vertex)
represents a state, and the edges (or arcs) connecting the nodes represent
transitions.
– Each state is labelled.
– Each transition is labelled with a symbol from the alphabet over which the
regular language represented by the FSA is defined, or with , the empty
string.
– Among the FSA’s states, there is a start state and at least one final state (or
accepting state).
Finite-State Automata
state
 = { a, b, c }
q0
a
q1
b
q2
c
q3
a
q4
final state
start state
transition
• Representation (continued)
– An FSA may also be
represented with a statetransition table. The table
for the above FSA:
Input
State
a
b
c
0
1


1

2

2


3
3
4


4



Finite-State Automata
• Given an input string, an FSA will either accept or reject
the input.
– If the FSA is in a final (or accepting) state after all input
symbols have been consumed, then the string is accepted (or
recognized).
– Otherwise (including the case in which an input symbol cannot
be consumed), the string is rejected.
Finite-State Automata
 = { a, b, c }
Input
a
q0
IS1:
a
b
q1
b
c
q2
c
q3
a
q4
a
IS2:
c
c
b
a
IS3:
a
b
c
a
c
State
a
b
c
0
1


1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
IS1:
a
b
q1
b
c
q2
c
q3
a
q4
a
IS2:
c
c
b
a
IS3:
a
b
c
a
c
State
a
b
c
0
1


1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
IS1:
a
b
q1
b
c
q2
c
q3
a
State
q4
a
IS2:
c
c
b
a
IS3:
a
b
c
a
c
b
c
1


1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
b
q1
b
IS1:
c
q2
c
q3
a
q4
a
IS2:
c
c
b
a
IS3:
a
b
c
a
c
State
a
b
c
0
1


1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
b
q1
b
IS1:
c
q2
c
q3
a
q4
a
IS2:
c
c
b
a
IS3:
a
b
c
a
c
State
a
c
0
1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
b
q1
c
q2
c
IS1:
q3
a
q4
a
IS2:
c
c
b
a
IS3:
a
b
c
a
c
State
a
b
c
0
1


1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
b
q1
c
q2
c
IS1:
q3
a
q4
State
a
b
0
1

1

2
a
IS2:
c
c
b
a
IS3:
a
b
c
a
3
c
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
b
q1
c
q2
q3
a
q4
a
IS1:
IS2:
c
c
b
a
IS3:
a
b
c
a
c
State
a
b
c
0
1


1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
b
q1
c
q2
q3
a
q4
a
IS1:
IS2:
c
c
b
a
IS3:
a
b
c
a
State
b
c
0


1
2

2

3
4





4
c
Finite-State Automata
 = { a, b, c }
Input
a
q0
b
q1
c
q2
q3
a
q4
IS1:
IS2:
c
c
b
a
IS3:
a
b
c
a
c
State
a
b
c
0
1


1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
IS1:
a
b
q1
b
c
q2
c
q3
a
State
q4
c
c
b
a
IS3:
a
b
c
a
b

a
IS2:
a
c
1

2

2


3
3
4


4



Finite-State Automata
 = { a, b, c }
Input
a
q0
IS1:
IS2:
IS3:
a
c
b
q1
b
c
c
q2
c
b
q3
a
q4
a
a
State
a
b
0
1

1

2
2


3
4


c
Finite-State Automata
• An FSA defines a regular language over an alphabet :
–  is a regular language:
– Any symbol from is a regular language:
q0
 = { a, b, c}
– Two concatenated regular languages is a regular language:
q0 b
q1
 = { a, b, c}
q0
b
q1
q0
b
q0
q1
c
c
q2
q1
Finite-State Automata
• regular language (continued):
– The union (or disjunction) of two regular languages is a regular
language:
 = { a, b, c}
q0
b
q1
q0
q1

q0

q2
c
q1
b
c
q3
– The Kleene closure (denoted by the Kleene star: *) of a regular language
is a regular language:
 = { a, b, c}

q0
b
q1

Finite-State Automata
• Determinism
– An FSA may be either deterministic (DFSA or DFA) or nondeterministic (NFSA or NFA).
• An FSA is deterministic if its behavior during recognition is
fully determined by the state it is in and the symbol to be
consumed.
– I.e., given an input string, only one path may be taken through
the FSA.
• Conversely, an FSA is non-deterministic if, given an input
string, more than one path may be taken through the FSA.
– One type of non-determinism is -transitions, i.e. transitions
which consume the empty string (no symbols).
NFA: Nondeterministic FSA
• An example NFA:
State
 = { a, b, c }

q0
a
q1
b

q2
c
q3
a
q4
Input
a
b
c

0
1



1

2

2
2


3,4

3
4



4




c
• The above NFA is equivalent to the regular
expression /ab*ca?/.
Nondeterministic FSA
• String recognition with an NFA:
– Backup (or backtracking): remember choice points and revisit
choices upon failure
– Look-ahead: choose path based on foreknowlege about the input
string and available paths
– Parallelism: examine all choices simultaneously
Finite-State Automata
• Conversion of NFAs to DFAs
– Every NFA can be expressed as a DFA.

a
q0
State
q1
q2

Input
0
b
c
q3
 = { a, b, c }
a
q4
/ab*ca?/
c
New State
a
b
c

1



Subset
construction
State
Input
a
b
c
0'
0
1


1'
1

2
{3,4}
1

2

2
2


3,4

2'
2

2
{3,4}
3
4



3'F
{3,4}F
4


4F




4'F
4F



5




b,c
q0'
a
a
q1'
c
a
b
b
q2'
b,c
c
q3'
a
q4' a,b,c
q5
a,b,c
Finite-State Automata
• DFA minimization
– Every regular language has a unique minimum-state DFA.
– The basic idea: two states s and t are equivalent if for every string w,
the transitions T(s, w) and T(t, w) are both either final or non-final.
– An algorithm:
• Begin by enumerating all possible pairs of both final or both nonfinal states, then iteratively removing those pairs the transition pair
for which (for any symbol) are either not equal or are not on the
list. The list is complete when an iteration does not remove any
pairs from the list.
• The minimum set of states is the partition resulting from the unions
of the remaining members of the list, along with any original states
not on the list.
Finite-State Automata
• The minimum-state DFA for the DFA converted from the
NFA for /ab*ca?/, without the “failure” state (labeled
“5”), and with the states relabeled to the set Q = { q0",
q1", q2", q3" }:
b
q0"
q1"
a
c
q2"
a
q3"
FSA Recognition As Search
• Recognition as search
– Recognition can be viewed as selection of the correct path
from all possible paths through an NFA (this set of paths is
called the state-space)
– Search strategy can affect efficiency: in what order should
the paths be searched?
• Depth-first (LIFO [last in, first out]; stack)
• Breadth-first (FIFO [first in, first out]; queue)
• Depth-first uses memory more efficiently, but may enter
into an infinite loop under some circumstances
Finite-State Automata with Output
• Finite State Automata may also have an output alphabet
and an action at every state that may output an item
from the alphabet
• Useful for lexical analyzers
– As the FSA recognizes a token, it outputs the characters
– When the FSA reaches a final state and the token is complete,
the lexical analyzer can use
• Token value – output so far
• Token type – label of the output state
Conclusion
•
Both regular expressions and finite-state automata represent regular
languages.
•
The basic regular expression operations are: concatenation,
union/disjunction, and Kleene closure.
•
The regular expression language is a powerful pattern-matching tool.
•
Any regular expression can be automatically compiled into an NFA, to a
DFA, and to a unique minimum-state DFA.
•
An FSA can use any set of symbols for its alphabet, including letters and
words.
References
• Original slides:
– Steve Rowe at Center for NLP
– Nancy McCracken
Compiler Design
3. Lexical Analyzer, Flex
Kanat Bolazar
January 26, 2010
Lexical Analyzer
• The main task of the lexical analyzer is to read the input
source program, scanning the characters, and produce a
sequence of tokens that the parser can use for syntactic
analysis.
• The interface may be to be called by the parser to
produce one token at a time
– Maintain internal state of reading the input program (with lines)
– Have a function “getNextToken” that will read some characters
at the current state of the input and return a token to the parser
• Other tasks of the lexical analyzer include
– Skipping or hiding whitespace and comments
– Keeping track of line numbers for error reporting
• Sometimes it can also produce the annotated lines for error
reports
Character Level Scanning
• The lexical analyzer needs to have a well-defined valid
character set
– Produce invalid character errors
– Delete invalid characters from token stream so as not to be
used in the parser analysis
• E.g. don’t want invisible characters in error messages
• For every end-of-line, keep track of line numbers for
error reporting
• Skip over or hide whitespace and comments
– If comments are nested (not common), must keep track of
nesting to find end of comments
– May produce hidden tokens, for convenience of scanner
structure
• Always produce an end-of-file token
Tokens, Token Types and Values
• The set of tokens is typically something like the
following table
– Or may have separate token types for different
operators or reserved words
Token
Description
–Type
May wantToken
to Value
keep lineInformal
number
with each token
Integer constant
Numeric value
Numbers like 3, -5, 12 without decimal pts.
Floating constant
Numeric value
Numbers like 3.0, -5.1, 12.2456789
Reserved word
Word string
Symbol table
index
Words like if, then, class, …
Words not reserved starting with letter or _ and
containing only letters, _, and digits
Relations
Operator string
<, <=, ==, …
Operators
Operator string
=, +, - , ++, …
Char constant
Char value
‘A’, …
String
String
“this is a string”, …
Identifiers
Hidden: end-of-line
Hidden: comment
Token Actions
• Each token recognized can have an action function
– Many token types produce a value
• In the case of numeric values, make sure property numeric
errors produced, e.g. integer overflow
– Put identifiers in the symbol table
• Note that at this time, no effort is made to distinguish scope;
there will be one symbol table entry for each identifier
– Later, separate scope instances will be produced
• Other types of actions
– End-of-line (can be treated as a token type that doesn’t output to
the parser)
• Increment line number
• Get next line of input to scan
Testing
• Execute lexical analyzer with test cases and compare
results with expected results
• Test cases
– Exercise every part of lexical analyzer code
– Produce every error message
– Don’t have to be valid programs – just valid sequence of tokens
Lex and Yacc
• Two classical tools for compilers:
– Lex: A Lexical Analyzer Generator
– Yacc: “Yet Another Compiler Compiler”
• Lex creates programs that scan your tokens one by one.
• Yacc takes a grammar (sentence structure) and generates
a parser.
Input
Lexical Rules
Grammar Rules
Lex
Yacc
yylex()
yyparse()
Parsed Input
Flex: A Fast Scanner Generator
• Often, instead of the standard Lex and Yacc, Flex and
Bison are used:
– Flex: A fast lexical analyzer
– (GNU) Bison: A drop-in replacement for (backwards
compatible with) Yacc
• Resources:
– http://en.wikipedia.org/wiki/Flex_lexical_analyser
– http://en.wikipedia.org/wiki/GNU_Bison
– http://dinosaur.compilertools.net/
(the Lex & Yacc Page)
Flex Example 1: Delete This
• Shortest Flex example, “deletethis.l”:
%%
deletethis
• This scanner will match and not echo (default behavior)
the word “deletethis”.
• Compile and run it:
$ flex deletethis.l
# creates lex.yy.c
$ gcc -o scan lex.yy.c -lfl # fl: flex library
$ ./scan
This deletethis is not deletethis useful.
This is not useful.
^D
Flex Example 2: Replace This
• Another very short Flex example, “replacer.l”:
%%
replacethis printf(“replaced”);
• This scanner will match “replacethis” and replace it with
“replaced”.
• Compile and run it:
$ flex -o replacer.yy.c replacer.l
$ gcc -o replacer replacer.yy.c -lfl
$ ./replacer
This replacethis is not very replacethis useful.
This replaced is not very replaced useful.
Please dontreplacethisatall.
Please dontreplacedatall.
Flex Example 3: Common Errors
• Let's replace “the the” with “the”:
%%
the the
printf(“the”);
uhh
• Unfortunately, this does not work: The second “the” is
considered part of C code:
%%
the
the printf(“the”);
• Also, the open and close matching double quotes used in
documents will give errors, so you must always replace:
“the” →
"the"
Flex Example 3: Common Errors,
cont'd
• You discover such errors when you compile the C code,
not when you use flex:
$ flex -o errors.yy.c errors.l
$ gcc -o errors errors.yy.c -lfl
errors.l: In function ‘yylex’:
errors.l:2: error: ‘the’ undeclared
...
• The error is reported back in our errors.l file, but we can
also find it in errors.yy.c:
case 1:
YY_RULE_SETUP
#line 2 "errors.l"
<-- For error reporting
the
<-- the ?
printf("the");
not C code
Flex Example 4: Replace Duplicate
• Let's replace “the the” with “the”:
%%
"the the" printf("the");
• This time, it works:
$ flex -o duplicate.yy.c duplicate.l
$ gcc -o duplicate duplicate.yy.c -lfl
$ ./duplicate
This is the the file.
This is the file.
This is the the the file.
This is the the file.
Lathe theory
Latheory
Flex Example 4: Replace And Delete
• Let's replace “the the” with “the” and delete “uhh”:
%%
"the the" printf("the");
uhh
• Run as before:
This uhh is the the uhhh file.
This is the h file.
• Generally, lexical rules are pattern-action pairs:
%%
pattern1
action1 (C code)
pattern2
action2
...
Flex File Structure
• In Lex and Flex, the general rule file structure is:
definitions
%%
rules
%%
user code
• Definitions:
DIGIT
[0-9]
ID [a-z][a-z0-9]*
• can be used later in rules with {DIGIT}, {ID}, etc:
{DIGIT}+"."{DIGIT}*
• This is the same as:
Flex Example 5: Count Lines
int num_lines = 0, num_chars = 0;
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
Some Regular Expressions for Flex
•
•
•
•
•
•
•
\"[^"]*\"
string
"\t"|"\n"\" "
whitespace (most common forms)
[a-zA-Z]
[a-zA-Z_][a-zA-Z0-9_]* identifier: allows a, aX, a45__
[0-9]*"."[0-9]+
allows .5 but not 5.
[0-9]+"."[0-9]*
allows 5. but not .5
[0-9]*"."[0-9]*
allows . by itself !!
Resources
• Aho, Lam, Sethi, and Ullman, Compilers: Principles,
Techniques, and Tools, 2nd ed. Addison-Wesley, 2006.
(The “purple dragon book”)
• Flex Manual. Available as single postscript file at the Lex
and Yacc page online:
– http://dinosaur.compilertools.net/#flex
– http://en.wikipedia.org/wiki/Flex_lexical_analyser
Compiler Design
4. Language Grammars
Kanat Bolazar
January 28, 2010
Introduction to Parsing: Language
Grammars
•
Programming language grammars are usually written as
some variation of Context Free Grammars (CFG)s
•
Notation used is often BNF (Backus-Naur form):
<block>
-> { <statementlist> }
<statementlist> -> <statement> ; <statementlist>
<statement> -> <assignment> ;
| if ( <expr> ) <block> else <block>
| while ( <expr> ) <block>
...
Example Grammar: Language 0+0
•
A language that we'll call "Language 0+0":
E -> E + E | 0
•
Equivalently:
E -> E + E
E -> 0
•
•
Note that if there are multiple rules for the same left
hand side, they are alternatives.
This language only contains sentences of the form:
0
•
0+0
0+0+0
0+0+0+0 ...
Derivation for 0+0+0:
E -> E + E -> E + E + E -> 0 + 0 + 0
•
Note: This language is ambiguous: In the second step,
Example Grammar: Arithmetic,
Ambiguous
•
Arithmetic expressions:
Exp -> num | Exp Operator Exp
Op -> + | - | * | / | %
•
The "num" here represents a token. What it
corresponds to is defined in the lexical analyzer with a
regular expression:
num
•
This langugage allows:
45
•
35 + 257 * 5 - 2
...
This language as defined here is ambiguous:
2+5*7
•
[0-9]+
Exp * 7
or
2 + Exp ?
Depending on the tools you use, you may be able to
•
Example Language: Arithmetic,
Factored
Arithmetic expressions grammar, factored for operator
precedence:
Exp -> Factor | Factor Addop Exp
Factor -> num | num Multop Factor
Addop -> + | Multop -> * | / | %
•
This langugage also allows the same sentences:
45
•
35 + 257 * 5 - 2
...
This language is not ambiguous; it first groups factors:
2+5*7
Factor Addop Exp
num + Exp
num + Factor
Grammar Definitions
• The grammar is a set of rules, sometimes called
productions, that construct valid sentences in the
language.
• Nonterminal symbols represent constructs in the
language. These would be the phrases in a natural
language.
• Terminal symbols are the actual words of the language.
These are the tokens produced by the lexical analyzer. In a
natural language, these would be the words, symbols, and
space.
• A sentence in the language only contains terminal
symbols.
Rules, Nonterminal and Terminal
Symbols
• Arithmetic expressions grammar, using multiplicative
factors for operator precedence:
Exp -> Factor | Factor Addop Exp
Factor -> num | num Multop Factor
Addop -> + | Multop -> * | / | %
• This langugage has four rules as written here. If we
expand each option, we would have 2 + 2 + 2 + 3 = 9
rules.
• There are four nonterminals:
Exp
Factor
Addop
Multop
• There are six terminals (tokens):
num
+
-
*
/
%
Grammar Definitions: Rules
• The production rules are rewrite rules. The basic CFG rule
form is:
X -> Y1 Y2 Y3 … Yn
where X is a nonterminal and the Y’s may be nonterminals
or terminals.
• There is a special nonterminal called the Start symbol.
• The language is defined to be all the strings that can be
generated by starting with the start symbol, repeatedly
replacing nonterminals by the rhs of one of its rules until
there are no more nonterminals.
Larger Grammar Examples
• We'll look at language grammar examples for MicroJava
and Decaf.
• Note: Decaf extends the standard notation; the very
useful { X }, to mean X | X, X | X, X, X | ... is not
standard.
Parse Trees
• Derivation of a sentence by the language rules can be
used to construct a parse tree.
• We expect parse trees to correspond to meaningful
semantic phrases of the programming language.
• Each node of the parse tree will represent some portion
that can be implemented as one section of code.
• The nonterminals expanded during the derivation are
trunk/branches in the parse tree.
• The terminals at the end of branches are the leaves of the
parse tree.
Parsing
• A parser:
– Uses the grammar to check whether a sentence (a program for
us) is in the language or not.
– Gives syntax error If this is not a proper sentence/program.
– Constructs a parse tree from the derivation of the correct
program from the grammar rules.
• Top-down parsing:
– Starts with the start symbol and applies rules until it gets the
desired input program.
• Bottom-up parsing:
– Starts with the input program and applies rules in reverse until it
can get back to the start symbol.
– Looks at left part of input program to see if it matches the rhs of
a rule.
Parsing Issues
• Derivation Paths = Choices
– Naïve top-down and bottom-up parsing may require
backtracking to find a correct parse.
– Restrictions on the form of grammar rules to make
parsing deterministic.
• Ambiguity
– One program may have two different correct derivations
from the grammar.
– This may be a problem if it implies two different
semantic interpretations.
– Famous examples are arithmetic operators and the
dangling else problem.
Ambiguity: Dangling Else Problem
•
Which if does this else associate with?
if X
if Y
find()
else
getConfused()
•
The corresponding ambiguous grammar may be:
IfSttmt -> if Cond Action
| if Cond Action else Action
•
Two derivations at top (associated with top "if") are:
if Cond Action
•
if Cond Action else Action
Programming languages often associate else with the
Resources
• Aho, Lam, Sethi, and Ullman, Compilers: Principles,
Techniques, and Tools, 2nd ed. Addison-Wesley, 2006.
• Compiler Construction Course Notes at Linz:
http://www.ssw.uni-linz.ac.at/Misc/CC/
• CS 143 Compiler Course at Stanford:
http://www.stanford.edu/class/cs143/
Compiler Design
5. Top-Down Parsing
with a
Recursive Descent Parser
Kanat Bolazar
February 2, 2010
Parsing
• Lexical Analyzer has translated the source program into
a sequence of tokens
• The Parser must translate the sequence of tokens into an
intermediate representation
– Assume that the interface is that the parser can call
getNextToken to get the next token from the lexical analyzer
– And the parser can call a function called emit that will put out
intermediate representations, currently unspecified
• The parser outputs error messages if the syntax of the
source program is wrong
Parsing: Top-Down, Bottom-Up
• Given a grammar such as:
E -> 0 | E + E
• And a string to parse such as "0 + 0"
• A parser can parse top-down, from start symbol (E
above):
E ->
E+E
->
0 + E -> 0 + 0
• Or parse bottom-up, grouping terminals into RHS of
rules:
0+0
<- E + 0 <- E + E <- E
• Usually, parsing is done as tokens are read in:
– Top-down:
• After seeing 0, we don't yet know which rule to use;
Parsing: Top-Down, Bottom-Up
• Generally:
– top-down is easier to understand, implement directly
– bottom-up is more powerful, allowing more complicated
grammars
– top-down parsing may require changes to the grammar
• Top-down parsing can be done:
– programmatically (recursive descent)
– by table lookup and transitions
• Bottom-up parsing requires table-driven parsing
• If the grammar is not complicated, the simplest approach
is to implement a recursive-descent parser.
• A recursive descent parser does not require backtracking
Recursive Descent Parsing
• For every BNF rule (production) of the form
<phrase1>  E
the parser defines a function to parse phrase1 whose
body is to parse the rule E
void parsePhrase1( )
{ /* parse the rule E */
}
• Where E consists of a sequence of non-terminal and
terminal symbols
• Requires no left recursion in the grammar.
Parsing a rule
• A sequence of non-terminal and terminal symbols,
Y1 Y2 Y3 … Yn
is recognized by parsing each symbol in turn
• For each non-terminal symbol, Y, call the corresponding
parse function parseY
• For each terminal symbol, y, call a function
expect(y)
that will check if y is the next symbol in the source
program
– The terminal symbols are the token types from the lexical
analyzer
– If the variable currentsymbol always contains the next token:
expect(y):
if (currentsymbol == y)
Simple parse function example
• Suppose that there was a grammar rule
<program> 
‘class’ <classname> ‘{‘ <field-decl> <method-decl>
‘}’
• Then:
parseProgram():
expect(‘class’);
parseClassname();
expect(‘{‘);
parseFieldDecl();
parseMethodDecl();
expect(‘}’);
Look-Ahead
• In general, one non-terminal may have more than one
production, so more than one function should be written
to parse that non-terminal.
• Instead, we insist that we can decide which rule to parse
just by looking ahead one symbol in the input
<sentence> -> 'if' '(' <expr> ')' <block>
| 'while' '(' <expr> ')' <block>
...
• Then parseSentence can have the form
if (currentsymbol == "if")
... // parse first rule
elsif (currentsymbol == "while")
... // parse second rule
First and Follow Sets
• First(E), is the set of terminal symbols that may appear
at the beginning of a sentence derived from E
– And may also include  if E can generate an empty string
• Follow(<N>), where <N> is a non-terminal symbol in
the grammar, is the set of terminal symbols that can
follow immediately after any sentence derived from any
rule of N
• In this grammar:
E -> 0 | E + E
• First(0) = {0} First(E + E) = {0} First(E) = {0}
• Follow(E) = {+, EOF}
Grammar Restriction 1
Grammar Restriction 1 (for top-down parsing):
• The First sets of alternative rules for the same LHS must
be different (so we know which path to take upon seeing
a first terminal symbol/token).
• Notice: This is not true in the grammar above. Upon
seeing 0 we don't know if we should take 0 or E + E
path.
Recognizing Possibly Empty
Sentences
• In a strict context free grammar, there may be rules in
which the rhs is , the empty string
• Or in an extended BNF grammar, there may be the
specification that some part of the rhs of the rule occurs 0
or 1 times
<phrase1>  … [ <phrase2> ] …
• Then we recognize the possibly empty occurrence of
phrase2 by
if (currentsymbol is in First(<phrase2>))
then parsePhrase2()
Recognizing Sequences
• In a context free grammar, you often have rules that
specify any number of a phrase can occur
<arglist>  <arg> <arglist> | e
• In extended BNF, we replace this with the * to indicate 0
or more occurrences
<arg> *
• We can recognize these sequences by using iteration. If
there is a rule of the form
<phrase1>  … <phrase2>* …
we can recognize the phrase2 occurrences by
while (currentsymbol is in First(<phrase2>))
do parsePhrase2()
Grammar Restriction 2
• In either of the previous cases, where the grammar
symbol may generate sentences which are empty, the
grammar must be restricted
– suppose that <phrase2> is the symbol that can occur 0
times
– require that the sets First(<phrase2>) and
Follow(<phrase2) be disjoint
Grammar Restriction 2:
• If a nonterminal may occur 0 times, its First and Follow
sets must be different (so we know whether to parse it or
skip it on seeing a terminal symbol/token).
Multiple Rules
• Suppose that there is a nonterminal symbol with multiple
rules where each rhs is nonempty
<phrase1>  E1 | E2 | E3 | . . . | En
then we can write ParsePhrase1 as follows:
if (currentsymbol is in First( E1 )) then ParseE1
elsif (currentsymbol is in First( E2 )) then ParseE2
...
elsif (currentsymbol is in First( En )) then ParseEn
else Syntax Error
• If any rhs can be empty, then don’t give the syntax error
• Remember the first grammar restriction:
– The sets First( E1 ), … , First( En ) must be disjoint
Example Expression Grammar
• Suppose that we have a grammar
<expr>  <term> { <op> <term> }*
<term>  ‘const’ | ‘(‘ <expr> ‘)’
<op>  ‘+’ | ‘-’
• Parsing functions:
void parseTerm ( )
void parseExpr ( )
{ if (cursym == ‘const’) then getNextToken()
{ parseTerm();
else if (cursym == ‘(‘)
while (cursym in First(<op>))
then { getNextToken();
{ getNextToken();
parseExpr();
parseTerm();
expect( ‘)’ )
}
}
}
}
First Sets
• Here we give a more formal, and more detailed,
definition of a First set, starting with any non-terminal.
– If we have a set of rules for a non-terminal, <phrase1>
<phrase1>  E1 | E2 | E3 | . . . | En
then First(<phrase1>) = First(E1)+ . . . + First(En ) (set union)
– For any right hand side, Y1 Y2 Y3 … Yn , we make cases on the
form of the rule
• First(aY2 Y3 … Yn) = a , for any terminal symbol a
• First(N Y2 Y3 … Yn) = First(N), for any non-terminal N that
does not generate the empty string
• First([N]M) = First(N) + First(M)
(0 or 1 occurrence of
N)
• First({N}*M) = First(N) + First(M) (0 or more
Follow Sets
• To define the set Follow(T), examine the cases of
where the non-terminal T may appear on the rhs of a
rule in the grammar.
– N  S T U or N S [T] U or N  S {T}* U
• If U never generates an empty string, then Follow(T)
includes First(U)
• If U can generate an empty string, then Follow(T)
includes First(U) and Follow(N)
– N  S T or N  S [ T ] or N  S { T }*
• Follow(T) includes Follow(N)
– The Follow set of the start symbol should contain EOT, the
end of text marker
• Include the Follow set of all occurrences of T from the
rhs of rules to make the set Follow(T)
Simple Error Recovery
• To enable the parser to keep parsing after a syntax error,
the parser should be able to skip symbols until it finds a
“synchronizing symbol”.
– E.g. in parsing a sequence of declarations or statements,
skipping to a ‘;’ should enable the parser to start parsing the next
declaration or statement
General Error Recovery
• A more general technique allows the syntax error routine
to be given a list of symbols that it should skip to.
void syntaxError(String msg, Symbols
StopSymbols)
{ give error with msg;
while (! currentsymbol in StopSymbols)
{ getNextSymbol }
}
– assuming that there is a type called Symbols of terminal
symbols
– we may want to pass an error code instead of a message
• Each recursive descent procedure should also take
StopSymbols as a parameter, and may modify these to
pass to any procedure that it calls
Stop Symbols
• If the parser is trying to parse the rhs E of a non-terminal
NE
then the stop symbols are those symbols which the parser
is prepared to recognize after a sentence generated by E
– Remove anything ambiguous from Follow(N)
• The stop symbols should always also contain the end of
text symbol, EOT, so that the syntax error routine never
tries to skip over symbols past the end of the program.
Compiler Design
7. Top-Down Table-Driven
Parsing
Kanat Bolazar
February 9, 2010
Table Driven Parsers
• Both top-down and bottom-up parsers can be written
that explicitly manage a stack while scanning the input
to determine if it can be correctly generated from the
grammar productions
– In top-down parsers, the stack will have non-terminals which
can be expanded by replacing it with the right-hand-side of a
production
– In bottom-up parsers, the stack will have sequences of
terminals and non-terminals which can be reduced by
replacing it with the non-terminal for which it is the rhs of a
production
• Both techniques use a table to guide the parser in
deciding what production to apply, given the top of the
stack and the next input
Top-down and Bottom-up Parsers
• Predictive parsers are top-down, non-backtracking
– Sometimes called LL(k)
• Scan the input from Left to right
• Generates a Leftmost derivation from the grammar
• k is the number of lookahead symbols to make parsing
deterministic
– If a grammar is not in an LL(k) form, removing left recursion
and doing left-factoring may produce one
• Not all context free languages can have an LL(k) grammar
• Shift-reduce parsers are bottom-up parsers, sometimes
called LR(k)
– Scan the input from Left to Right
– Produce a Rightmost derivation from the grammar
• Not all context free languages have LR grammars
(Non-recursive) Predictive Parser
• Replaces non-terminals on the top of the stack
with the rhs of a production that can match the
next input.
Input:
Stack:
X
Y
Z
eot
a
+ b
eot
Predictive Parsing
Program
Parsing
Table: M
Output
Parsing Algorithm
• The parser starts in a configuration with S <eot> on the
stack
Repeat:
let X be the top of stack symbol, a is the next symbol
if X is a terminal symbol or <eot>
if (X = a) then pop X from the stack and getnextsym
else error
else // X is a non-terminal
if M[X, a] = X ->Y1 Y2 … Yk
{ pop X from the stack;
push Y1 Y2 … Yk on the stack with Y1 on top;
output the production
} else error
Until stack is empty
Example from the dragon book (Aho et al)
• The expression grammar
E E + T | T
T T * F | F
F  ( E ) | id
• Can be rewritten to eliminate the left recursion
E  T E’
E’  + T E’ | 
T  F T’
T’  * F T’ | 
F  ( E ) | id
Parsing Table
• The table is indexed by the set of non-terminals in one direction
and the set of terminals in the other
• Any blank entries represent error states
• Non LL(1) grammars could have more than one rule in a table
id
+
*
(
)
Eot
entry
E
E T E’
E’+ T
E’
E’
T
T’ 
F  id
E 
E 
T’ 
T’ 
TF
T’
T  F T’
T’
F
E T E’
T’  F
T’
F(E
)
Stack
$E
$E’T
$E’T’F
$E’T’id
$E’T’
$E’
$E’T +
$E’T
$E’T F
$E’T’ id
$E’T’
$E’T’ F *
$E’T’ F
$E’T’ id
$E’T’
$E’
$
Input
Output
id + id * id $
id + id * id $
id + id * id $
id + id * id $
+ id * id $
+ id * id $
+ id * id $
id * id $
id * id $
id * id $
* id $
* id $
id $
id $
$
$
$
E  T E’
T  F T’
F  id
T’  e
E’  + T’ E’
T  F T’
F  id
T’  * F T’
F  id
T’  e
E’  e
Constructing the LL parsing table
• For each production of the form N  E in the grammar
– For each terminal a in First(E),
add N  E to M[N, a]
– If  is in First(E), for each terminal b in Follow(N),
add N  E to M[N, b]
– If  is in First(E) and eot is in Follow(N),
add N  E to M[N, eot]
– All other entries are errors
Compiler Design
9. Table-Driven Bottom-Up
Parsing:
LR(0), SLR,
LR(1), LALR
Kanat Bolazar
February 16, 2010
Table Driven Parsers
• Both top-down and bottom-up parsers can be written
that explicitly manage a stack while scanning the input
to determine if it can be correctly generated from the
grammar productions
– In top-down parsers, the stack will have non-terminals which
can be expanded by replacing it with the right-hand-side of a
production
– In bottom-up parsers, the stack will have sequences of
terminals and non-terminals which can be reduced by
replacing it with the non-terminal for which it is the rhs of a
production
• Both techniques use a table to guide the parser in
deciding what production to apply, given the top of the
stack and the next input
Top-Down and Bottom-Up Parsers
• Predictive parsers are top-down, non-backtracking
– Sometimes called LL(k)
• Scan the input from Left to right
• Generates a Leftmost derivation from the grammar
• k is the number of lookahead symbols to make parsing
deterministic
– If a grammar is not in an LL(k) form, removing left recursion
and doing left-factoring may produce one
• Not all context free languages can have an LL(k) grammar
• Shift-reduce parsers are bottom-up parsers, sometimes
called LR(k)
– Scan the input from Left to Right
– Produce a Rightmost derivation from the grammar
• Not all context free languages have LR grammars
Bottom-Up (Shift-Reduce) Parsers
• Also called Shift-Reduce Parser because it will either
– Reduce a sequence of symbols on the stack that are the rhs
of a production by their non-terminal
– Shift an input symbol to the top of the stack
Input:
Stack:
X
Y
Z
eot
a + b eot
Output
Shift Reduce Parser
Parsing
Table: M
Shift Reduce Parser Actions
• During the parse
– the stack has a sequence of terminal and non-terminal symbols
representing the part of the input worked on so far, and
– the input has the remaining symbols
• Parser actions
– Reduce: If the stack has a sequence FE and there is a
production N E, we can replace E by N to get FN on the
stack.
– Shift: If there is no possible reduction, transfer the next input.
symbol to the top of the stack.
– Error: Otherwise it is an error.
• If, after a reduce, we get the start symbol on the top of the
stack and there is no more input, then we have succeeded.
Handles
• During the parse, the term handle refers to a sequence of
symbols on the stack that
– Matches the rhs of a production
– Will be a step along the path of producing a correct parse tree
• Finding the handle, i.e. identifying when to reduce, is the
central problem of bottom-up parsing
• Note that ambiguous grammars do not fit (as they didn’t
for top down parsing, either) because there may not be a
unique handle at one step
– E.g. dangling else problem
LR Parsing
• A specific way to implement a shift reduce parser is an
LR parser.
• This parser represents the state of the stack by a single
state symbol on the top of the stack
• It uses two parsing tables, action and goto
– For any parsing state and input symbol, the action table tells
what action to take
• Sn, meaning shift and go to state n
• Rn, meaning reduce by rule n
• Accept
• Error
– For any parsing state and non-terminal symbol N, the goto
table gives the next state when a reduce has been performed
to the non-terminal symbol N
LR Parser
• A Shift Reduce parser that encodes the stack with a
state on the top of the stack
– The TOS state and the next input symbol are used to look
up the parser’s actions and goto function from the table
Input:
Stack:
X S1
Y S2
Z S3
eot
a + b eot
Output
LR Parser
Parsing
Table: M
Types of LR Parsers
• LR parsers can work on more general grammars than LL
parsers
– Has more history on the stack to make decisions than top-down
• LR parsers have different ways to generate the action and
goto tables
• Types of parsers listed in order of increasing power
(ability to handle grammars) and decreasing efficiency
(size of the parsing tables becomes very large)
–
–
–
–
LR(0)Standard/general LR, with 0 lookahead
SLR(1) "Simple LR" (with 1 lookahead)
LALR(1) "Lookahead LR", with 1 lookahead
LR(1)Standard LR, with 1 lookahead
Types of LR Parsers: Comparisons
• Here's a subjective (personal) comparison of grammars
• LALR class of grammars is the most useful and most
complicated
Grammar
(lookahea Name
d)
LR(0)
SLR(1)
simpl
e
LALR(1)
lookahea
d
LR(1)
Power
Table
Size
(+:small)
-too
weak
+
weak
=
-
(was + popular
before LALR)
+
= or ~= SLR
--complicate
d
++
balanced
++
--10x
-
-too large
Conceptual
Complexity
Utility /
Popularity
--never used
LR(0) Parsing Tables
• Although not used in practice, LR(0) table construction
illustrates the key ideas
• Item or configuration is a production with a dot in the
middle, e.g. there are three items from A XY:
A  •XY
X will be parsed next
A  X•Y
X parsed; Y will be parsed next
A  XY•
X and Y parsed, we can reduce to A
• The item represents how much of the production we have
seen so far in the parsing process.
LR(0): Closure and Goto Operations
• Closure is defined to construct a configurating set for each item. For
the starting item, N W•Y
– N  W•Y is in the set
– If Y begins with a terminal, we are done
– If Y begins with a non-terminal N’, add all N’ productions with the dot at
the start of the rhs, N’  •Z
• For each configurating set and grammar symbol, the goto operation
gives another configurating set.
– If a set of items I contains N  W • x Y, where W and Y are sequences
but x is a single grammar symbol, the goto(I,x) contains N  W x • Y
• To create the family of configurating sets for a grammar, add an initial
production S’  S, and construct sets from
S’  • S
• Use the sets for parser states – states that end with a dot will be reduce
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E
closure ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1
more ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, ?) = ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, ?) = ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1•
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, ?) = ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E•
more?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 action(s3, ?) = ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, ?) = ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
s5: ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
s5: E  E - 1• action(s5, ?) = ?
LR(0) Example
•
Consider the simple grammar, add an initial rule:
EE-1|1
rule1: E  E - 1 rule2: E  1
SE
start symbol added for LR(0)
•
The states are:
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
s5: E  E - 1• action(s5, on any token) = reduce by rule1
LR(0) Example: Table
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
State
Action
Goto
s5: E  E - 1• action(s5, on any token) = reduce by rule1
s1
s2
s3
s4
1
EOT
E
LR(0) Example: Table
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
State
Action
Goto
s5: E  E - 1• action(s5, on any token) = reduce by rule1
s1
EOT
s2
s2
r2
s3
s4
s4
1
r2
s3
r2
accept
s5
E
LR(0) Example: Table
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
State
Action
Goto
s5: E  E - 1• action(s5, on any token) = reduce by rule1
-
1
EOT
E
s1
err
s2
err
s3
s2
r2
r2
r2
s3
s4
err
accept
s4
err
s5
err
s5
r1
r1
r1
Limitations of LR(0)
• Since there is no look-ahead, the parser must know
whether to shift or reduce based on the parsing stack so
far
– A configurating set can have only (all) shift(s) or reduce and
not both based on the input (eg. we can't shift for '-' and reduce
for '1')
• Problematic examples
– Epsilon rules create shift/reduce conflict if there are other rules
– Items like these have shift/reduce conflicts:
T  id•
T  id• [ E ]
reduce?
shift?
– Items like these have reduce/reduce conflicts
SLR(1) Parsing
• SLR(1), simple LR, uses the same configurating sets,
table structures and parser operations.
• When assigning table actions, don’t assume that any
completed item should be reduced
– Look ahead by using the Follow set of the item
– Reduce an item N  Y • only if the next input symbol is in the
Follow set of N.
• The configurating sets may have shift and reduce in the
same set, but the Follow sets are required to be disjoint
– This requires that there are no reduce/reduce conflicts in this
state
s1:
SLR(1) Table: Reduce Depends on
Token
S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, {-, EOT}) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
SLR(1)
Action
Goto
s5: E  E - 1• action(s5, {-, EOT}) = reduce by rule1
State
-
s1
1
EOT
s2
s3
s2
r2
r2
s3
s4
accept
s4
s5
s5
r1
E
r1
LR(0) Table For Comparison
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
LR(0)
Action
Goto
s5: E  E -1• action(s5, on any token) = reduce by rule1
State
-
s1
EOT
s2
s2
r2
s3
s4
s4
s5
1
r2
s3
r2
accept
s5
r1
r1
E
r1
LR(1) Parsing
• Although SLR(1) is using 1 lookahead symbol, it is still
not using all of the information that could be obtained in a
parsing state by keeping track of what path led to that item
• Not every item in Follow(X) is possible in every rule of X
• In LR(1) parsing tables, we keep the lookahead in the
parsing state and separate those states, so that they can
have more detailed successor states:
– A -> B C • D E F , a/b/c
– A will eventually be reduced, if the following lookahead
token after F is one of {a, b, c}
– if any other token is seen, some other action may be taken
– if there is no action, it's an error
• Leads to larger numbers of states (in thousands, instead of
LALR(1) parsing
• Compromises between the simplicity of SLR and the
power of LR(1) by merging similar LR(1) states.
– Identify a core of configurating sets and merge states that differ
only by lookahead
– This is not just SLR because LALR will have fewer reduce
actions, but it may introduce reduce/reduce conflicts that LR(1)
did not have
• Constructing LALR(1) parsing tables
– is not usually done by brute force to construct LR(1) and then
merge sets
– As configurating sets are generated, a new configurating set is
examined to see if it can be merged with an existing one
More on LR Parsing
• Almost all SR parsing done with automatically
generating parser tables
• Look at the types of parsers in available parser
generators
– http://en.wikipedia.org/wiki/Category:Parsing_algorithm
s
– Note types of parsers (but not types of trees)
• Bison (yacc)
• ANTLR
• JavaCC
• Coco/R
• Elkhound
One More Type of SR Parsing
• Operator precedence parsing
– Useful for expression grammars and other types of ambiguities
• Doesn’t use a table, just uses operator precedence rules to
resolve conflicts
• Fits in with the various types of LR parsers
– In addition to the action table, the parsing algorithm can appeal
to a precedence operator table
General Context Free Parsers
• All of the table driven parsers work on grammars in
particular forms and may not work for arbitrary CFGs,
including ambiguous ones
• General Backtracking Parsers O(n3)
– CYK (Cocke, Younger, Kasami) algorithm
• Produces a forest of parse trees
– Earley’s algorithm
• Notable for carrying along partial parses (subtrees), the first
of the Chart parsers
• General Parallel Parser, can be O(n3)
– GLR – copies the relevant parts of the LR stack and parses in
parallel whenever there is a conflict – otherwise same as LALR
Compiler Design
11. Table-Driven Bottom-Up Parsing:
LALR
More Examples for LR(0), SLR, LR(1),
LALR
Kanat Bolazar
February 23, 2010
Bottom-Up Parsers
•
•
We have been looking at bottom-up parsers.
They are also called shift-reduce parsers
–
–
•
•
•
shift: Put next token in the stack, move on
reduce: Tokens combine to RHS of a rule; reduce this
to the nonterminal on the left.
Scan the input from Left to Right
Produce a Rightmost derivation from the grammar
Not all context free languages have LR grammars
Shift-Reduce Parsing
•
Example:
•
Grammar:
E -> 1 | E - 1
(rules 1 and 2)
•
Input:
1-1$
($ : End of file / tape)
•
Steps:
1
shift
E
reduce by rule 1
Eshift
E - 1 shift
E - E reduce by rule 1
E
reduce by rule 2
LR(0) Table: Reduce on Any Token
s1: S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, on any token) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
LR(0)
Action
Goto
s5: E  E - 1• action(s5, on any token) = reduce by rule1
State
-
s1
EOT
s2
s2
r2
s3
s4
s4
s5
1
r2
s3
r2
accept
s5
r1
r1
E
r1
s1:
SLR(1) Table: Reduce Depends on
Token
S  •E , E  •E - 1 , E  •1
action(s1, '1') = shift2
goto(s1, E) = s3
s2: E  1• action(s2, {-, EOT}) = reduce by rule2
s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4
s4: E  E - •1 action(s4, '1') = shift5
SLR(1)
Action
Goto
s5: E  E - 1• action(s5, {-, EOT}) = reduce by rule1
State
-
s1
1
EOT
s2
s3
s2
r2
r2
s3
s4
accept
s4
s5
s5
r1
E
r1
LR(1) Parsing
• Although SLR(1) is using 1 lookahead symbol, it is still
not using all of the information that could be obtained in a
parsing state by keeping track of what path led to that item
• Not every item in Follow(X) is possible in every rule of X
• In LR(1) parsing tables, we keep the lookahead in the
parsing state and separate those states, so that they can
have more detailed successor states:
– A -> B C • D E F , a/b/c
– A will eventually be reduced, if the following lookahead
token after F is one of {a, b, c}
– if any other token is seen, some other action may be taken
– if there is no action, it's an error
• Leads to larger numbers of states (in thousands, instead of
LALR(1) parsing
• Compromises between the simplicity of SLR and the
power of LR(1) by merging similar LR(1) states.
– Identify a core of configurating sets and merge states that differ
only by lookahead
– This is not just SLR because LALR will have fewer reduce
actions, but it may introduce reduce/reduce conflicts that LR(1)
did not have
• Constructing LALR(1) parsing tables
– is not usually done by brute force to construct LR(1) and then
merge sets
– As configurating sets are generated, a new configurating set is
examined to see if it can be merged with an existing one
Recap: LR(0), SLR, LR(1), LALR
• LR(0): Don't look ahead when reducing according to a
rule. When we reach the end of RHS, we reduce.
• SLR = SLR(1): Use Follow set of nonterminal on the
left. If the lookahead is in our Follow set, we reduce
• LR(1): Add the expected lookahead for which we will
eventually reduce. Produces very large tables.
• LALR = LALR(1): Use LR(1), but combine states that
differ only in lookahead.
• Note: LALR is not SLR:
S -> V = V | V = V + V
• After first V, SLR would reduce if next token is '+',
LR(1) and LALR wouldn't.
Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?
 What would LR(0) parsing do?

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?
 What would LR(0) parsing do?
conflict: Can't

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?
 What would LR(0) parsing do?
conflict: Can't parse!

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?
 What would LR(0) parsing do?
conflict: Can't parse!

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?
 What would LR(0) parsing do?
conflict: Can't parse!

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?
 What would LR(0) parsing do?
conflict: Can't parse!

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?
 What would LR(0) parsing do?
conflict: Can't parse!

Example 1.0
Let's start with a simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 What strings are allowed in this grammar?
ab
(from B b)
aa
 Consider seeing a string that starts with a:
a ...
 Should we shift a, or reduce a to B according to rule 3?
 What would LR(0) parsing do?
conflict: Can't parse!

Example 1.1
Original simple grammar:
1 S -> B b
2 S -> a a
3 B -> a
 SLR:
Follow(B) = {b} look ahead; shift on a, reduce on b
 Can we make the grammar harder for SLR?

Example 1.1
Original simple grammar allows only "a a" and "a b":
1 S -> B b
2 S -> a a
3 B -> a
 SLR:
Follow(B) = {b} look ahead; shift on a, reduce on b
 Can we make the grammar harder for SLR?
 Of course! Just add 'a' to Follow(B) somehow!

Example 1.1

1
2
3



4
•
•
•
Original simple grammar allows only "a a" and "a b":
S -> B b
S -> a a
B -> a
SLR: Follow(B) = {b} look ahead; shift on a, reduce on b
Can we make the grammar harder for SLR?
Of course! Just add 'a' to Follow(B) somehow:
S -> b B a
Grammar also allows "b a a" now.
This should be irrelevant for "a a",
But SLR can't decide: Follow(B) = {a, b}: Conflict for a a!
Example 1.1
Modified grammar, reorganized:
1 S -> B b
2 S -> a a
3 S -> b B a
4 B -> a
 Input: "a ... "
 SLR:
Follow(B) = {a, b} shift/reduce conflict on a, reduce
on b
 LR(1): State 0:
S' -> . S , $
S -> . B b , $
S -> . a a , $

Example 1.1
Modified grammar, reorganized:
1 S -> B b
2 S -> a a
3 S -> b B a
4 B -> a
 Input: "a ... "
 SLR:
Follow(B) = {a, b} shift/reduce conflict on a, reduce
on b
 LR(1): State 0:
S' -> . S , $
S -> . B b , $
S -> . a a , $

Example 1.1
Modified grammar, reorganized:
1 S -> B b
2 S -> a a
3 S -> b B a
4 B -> a
 Input: "a ... "
 SLR:
Follow(B) = {a, b} shift/reduce conflict on a, reduce
on b
After seeing a, transition to: State 1:
 LR(1): State 0:
S' -> . S , $
S -> . B b , $
S -> . a a , $

Example 1.1
Modified grammar, reorganized:
1 S -> B b
2 S -> a a
3 S -> b B a
4 B -> a
 Input: "a ... "
 SLR:
Follow(B) = {a, b} shift/reduce conflict on a, reduce
on b
After seeing a, transition to: State 1:
S -> a . a , $
 LR(1): State 0:
B -> a . , b
S' -> . S , $
Reduce ? Shift ? Which lookahead?
S -> . B b , $
S -> . a a , $

Example 1.1
Modified grammar, reorganized:
1 S -> B b
2 S -> a a
3 S -> b B a
4 B -> a
 Input: "a ... "
 SLR:
Follow(B) = {a, b} shift/reduce conflict on a, reduce
on b
After seeing a, transition to: State 1:
S -> a . a , $
 LR(1): State 0:
B -> a . , b
S' -> . S , $
Reduce for b, shift for a.
S -> . B b , $
S -> . a a , $

LR(1) vs LALR
• We went through the previous example with LALR in the
class.
• The states and transitions were mostly the same as those of
LR(1).
Example 2
Assignment statement with variables:
1 S -> V = V
2 S -> V = V + V
3 V -> id
Use LR(0), SLR, LR(1), LALR. Shown in the class.
SLR doesn't know that an initial V can't be followed by '+' or $
(EOF).
LR(1) knows it; the '=' that must follow is attached to V rule:
s0:
S -> . V = V
, $
S -> . V = V + V
, $
V -> . id
lines)
, =
(due to previous two
Example 3
Another grammar (for regular expression a*ba*b):
1 S -> X X
2 X -> a X
3 X -> b

Create the LR(0), SLR and LR(1) tables for table-driven
parsing.

Draw the states and state transitions for one of these
tables.

Compare it to the minimal a*ba*b Finite State Machine
a
a
below.
FSM to recognize a*ba*b
b
bb, abb,

In LR(0),
before Accepts
reducing
addsbab,
extra
s0 not looking
s1 bahead s2
baab, abaaab ...
states.
Example 4
A harder grammar:
1 S -> a X c
2 S -> b Y c
3 S -> a Y d
4 S -> b X d
5 X -> e
6 Y -> e


Use LR(0), SLR, LR(1), LALR
We did not yet do this example in the class. We will, later,
to remember how to do table-driven parsing.
Compiler Design
13. Symbol Tables
Kanat Bolazar
March 4, 2010
Symbol Tables


The job of the symbol table is to store all the names of the
program and information about each name
In block structured languages, roughly speaking, the symbol
table collects information from declarations and uses that
information whenever a name is used later in the program



this information could be part of the syntax tree, but is put into a table
for efficient access to names
If there are different occurrences of the same name, the symbol table
assists in name resolution
Either the parser or the lexical analyzer can do the job of
inserting names into the symbol table (as long as scope
information is given to the lexer)
Symbol Table Entries:
Simple Variables, Basic Information

Variables (identifiers)






Character string (lexeme), may have limits on number of
characters
Data type
Storage class (if not already implied by the data type)
Name and lexical level of block in which it is declared
Other access information, if necessary, such as modifiability
constraints
Base address and memory offset, after allocation
Symbol Table Entries:
Beyond Simple Variables

Arrays



Records and structures



List of fields
Information about each field
Functions and Procedures



Also needs number of dimensions
Upper and lower bounds of each dimension
Number and types of parameters
Type of return value
Function Pointers?
Symbol Table Representation

The two main operations are




insert (name) makes an entry for this name
lookup (name) finds the relevant occurrence of the name by
searching the table
Lookups occur a lot more often than insert
Hash tables are commonly used

Because of good average time complexity for lookup (O(1)).
var1
class1
fn1
var2
fn2
var3
Scope Analysis

The scope of a name is tied to the idea of a block in the
programming language





Names must be unique within the block in which they are
declared (no two objects with the same name in one block)


Standard blocks (statement sequences, sometimes if statement)
Procedures and functions
Program (global program level)
Universe (predefined functions, etc.)
There are some languages with exceptions for different types (a
function and a variable may have same name)
Name resolution: A use of a name should refer to the most
local enclosing block that has a declaration for that name.
Declaration Before Use?

We are dealing primarily with languages in which there are
declarations of names required



Names of variables, constants, arrays, etc. must be declared before
use
Names of functions and procedures vary
 C requires functions and procedures to also be declared before
use, or at least given a prototype
 Java does not require this for methods (can call first, define later
in *.java file)
Scope of a name (in a statically scoped language):


The scope of a constant, variable, array, etc. is from the end of its
definition to the end of the block in which it is declared
The scope of a function or procedure name
Further Structure of Symbol Table



For nested scopes, we may use lists of hash tables, with one
element of the list for each scope
The lookup function will first search the current lexical level
table and then continue on up the list, using the first
occurrence of the name that it finds
Parts of the table not currently active may be kept for future
semantic analysis
●
Table A
Scope A
Table B
float x, y;
x
Scope B
(nested)
int x, z;
x
z
y
B.x shadows A.x ; lookup finds B.x first
More Symbol Table Functions

In addition to lookup and insert, the symbol table will also
need



initializeScope (level) , when a block is entered to create a new
hash table entry in the symbol table list
finializeScope (level), on block exit put the current hash table into a
background list
 Essentially makes a tree structure (scope A may contain scopes
B1, B2, B3 ...), where one child may be distinguised as the
active block
The symbol tables shown so far are all for the program
being compiled, also needed is a way to look up names in
the “universe”

System-defined names (predefined types, functions, values)
Example: Predeclared Names in MicroJava

Example: Predeclared in MicroJava:
Types:
int, char
Constants: null
Methods: ord(ch), chr(i), len(arr)

We can put these in the symbol table as well:
Type
int
Type
char
Const
null
Param
)
Method
len
Param
(
Method
chr
Param
Shown as a list here;
symbol table is probably
a hash table instead.
Method
ord
Var
ch
Var
i
Var
arr
Alternate Representation

The lists of hash tables can be inefficient for lookup since the
system has to search up the list of lexical levels


An optimization of the symbol table as lists of hash tables is to
keep one giant hash table


More names tend to be declared at level 0, thus making the most
common occurrence be the most expensive
Within that table each name will have a list of occurrences identified b
lexical level
This representation keeps the (essentially) constant time looku

But makes leaving a block more expensive as hash table must be
searched to find all entries that need to be removed and stored elsewhe
Alternate Representation
A
Single Symbol Table
 Faster lookup.
 Slow scope close.
 Must remove c1, c2 after
scope C ends.
B
C
b2
c1
a1
c2
●
b1
●
●
b1
b2
c1
c2
a1
Hierarchical Symbol Table
 Faster scope close.
 Slow lookup
(of globals, especially)
Static Scope

The scoping system described so far assumes that the scope ru
are for static scoping


The static problem layout of enclosing blocks determines the scoping
a name
There are also languages with dynamic scoping


The scoping of a name depends on the call structure of the program at
run-time
The name resolution will be to the closest block on the call stack of a
block with a declaration of that name – the most recently called functi
or block
Object-Oriented Scoping

Languages like Java must keep symbol tables for




One method of implementation is to attach a symbol table to
each class with two nesting hierarchies



The code being compiled
Any external classes that are known and referenced inside the code
The inheritance hierarchy above the class containing the code
One for lexical scoping inside individual methods
One to follow the inheritance hierarchy of the class
Resolving a name



First consult the lexically scoped symbol table
If not found, search the classes in the inheritance hierarchy
If not found, search the global name space
Testing and Error Recovery

If a name is used, but the lookup fails to find any definition


If a name is defined twice


Give an error but enter the name with a dummy type information so th
further uses do not also trigger errors
Give an ambiguity error, choose which type to use in later analysis,
usually the first
Testing cases


Include all types of correct declarations
Incorrect cases may include
 Definition of an ambiguous name
 Definition without a name
 Meaningless recursive definitions (in some types of structures)
References






Nancy McCracken's original slides
Linz University Compiler course materials (MicroJava).
Keith Cooper and Linda Torczon, Engineering a Compiler,
Elsevier, 2004.
Kenneth C. Louden, Compiler Construction: Principles and
Practices, PWS Publishing, 1997.
Per Brinch Hansen, On Pascal Compilers, Prentice-Hall, 1985.
Out of print.
Aho, Lam, Sethi, and Ullman, Compilers: Principles,
Techniques, and Tools. Addison-Wesley, 2006. (The purple
dragon book)
Compiler Design
14. AST (Abstract Syntax Tree)
and Syntax-Directed Translation
Kanat Bolazar
March 9, 2010
Abstract Syntax Tree (AST)
• The parse tree
– contains too much detail
• e.g. unnecessary terminals such as parentheses
– depends heavily on the structure of the grammar
• e.g. intermediate non-terminals
• Idea:
– strip the unnecessary parts of the tree, simplify it.
– keep track only of important information
• AST
– Conveys the syntactic structure of the program while
providing abstraction.
– Can be easily annotated with semantic information
Abstract Syntax Tree
if-statement
if-statement
can become
IF
cond THEN statement
E
E
id
+
E
id
cond
statement
add_expr
can become
E
*
E
num
id
mul_expr
id
num
Lexical, Parse, Semantic
• Ultimate goal: Generate machine code.
• Before we generate code, we must collect information
about the program.
• After lexical analysis and parsing, we are at semantic
int func (int x, int y);
analysis
(recognizing
meaning)
int main
() {
int list[5],
i, j;
• There
are
issues
deeper
than
structure.
Consider:
char *str;
j = 10 + 'b';
str = 8;
m = func("aa", j, list[12]);
return 0;
}
Beyond Syntax Analysis
• An identifier named x has been recognized.
– Is x a scalar, array or function?
– How big is x?
– If x is a function, how many and what type of arguments
does it take?
– Is x declared before being used?
– Where can x be stored?
– Is the expression x+y type-consistent?
• Semantic analysis is the phase where we collect
information about the types of expressions and check
for type related errors.
• The more information we can collect at compile
time, the less overhead we have at run time.
Semantic Analysis
• Collecting type information may involve "computations"
– What is the type of x+y given the types of x and y?
• Tool: attribute grammars
– CFG
– Each grammar symbol has associated attributes
– The grammar is augmented by rules (semantic actions) that specify ho
the values of attributes are computed from other attributes.
– The process of using semantic actions to evaluate attributes is called
syntax-directed translation.
– Examples:
• Grammar of declarations.
• Grammar of signed binary numbers.
Attribute grammars
Example 1: Grammar of declarations
Production Semantic rule
DTL
L.in = T.type
T  int
T.type = integer
T  char
T.type = character
L  L1, id L1.in = L.in
addtype (id.index, L.in)
L  id
addtype (id.index, L.in)
Attribute grammars
Example 2: Grammar of signed binary numbers
Production Semantic rule
NSL
if (S.neg) print('-'); else print('+');
print(L.val);
S+
S.neg = 0
S–
S.neg = 1
L  L1 B
L.val = 2*L1.val+B.val
L B
L.val = B.val
B0
B.val = 0*20
B1
B.val = 1*20
Attributes
• Attributed parse tree = parse tree annotated with attribute
rules
• Each rule implicitly defines a set of dependences
– Each attribute's value depends on the values of other attributes.
• These dependences form an attribute-dependence graph.
• Note:
– Some dependences flow upward
• The attributes of a node depend on those of its children
• We call those synthesized attributes.
– Some dependences flow downward
• The attributes of a node depend on those of its parent or siblings.
• We call those inherited attributes.
• How do we handle non-local information?
– Use copy rules to "transfer" information to other parts of the tree.
Attribute Grammars
Production
E  E1+E2
E  num
E  (E1)
Semantic rule
E.val = E1.val+E2.val
E.val = num.yylval
E.val = E1.val
attribute-dependence graph
E
num
2
E
12
+
E
(
E
+
E
num
10
7
)
E
num
3
Attribute Grammars
• We can use an attribute grammar to construct an AST
• The attribute for each non-terminal is a node of the tree.
• Example:
Production
E  E1+E2
E  num
E  (E1)
Semantic rule
E.node = new PlusNode(E1.node,E2.node)
E.node = num.yylval
E.node = E1.node
• Notes:
– yylval is assumed to be a node (leaf) created during scanning.
– The production E  (E1) does not create a new node as it is not
•
•
•
•
•
•
Evaluating Attributes
Evaluation Method 1: Dynamic, dependence-based
At compile time
Build dependence graph
Topsort the dependence graph
Evaluate attributes in topological order
This can only work when attribute dependencies are not
circular.
–
–
It is possible to test for that.
Circular dependencies show up in data flow analysis
(optimization) or may appear due to features such as goto
Evaluating attributes
• Other evaluation methods
– Method 2: Oblivious
• Ignore rules and parse tree
• Determine an order at design time
– Method 3: Static, rule-based
• At compiler construction time
• Analyze rules
• Determine ordering based on grammatical structure (parse tree)
Attribute grammars
• We are interested in two kinds of attribute grammars:
– S-attributed grammars
• All attributes are synthesized (flow up)
– L-attributed grammars
• Attributes may be synthesized or inherited, AND
• Inherited attributes of a non-terminal only depend on the parent or
the siblings to the left of that non-terminal.
– This way it is easy to evaluate the attributes by doing a depth-first
traversal of the parse tree.
• Idea (useful for rule-based evaluation)
– Embed the semantic actions within the productions to impose an
evaluation order.
Embedding Rules in Productions
• Synthesized attributes depend on the children of a nonterminal, so they should be evaluated after the children have
been parsed.
• Inherited attributes that depend on the left siblings of a nonterminal should be evaluated right after the siblings have
been parsed.
L.in is inherited and evaluated
• after
Inherited
depend on the parent of a nonparsing attributes
T but before that
L
T.type
is synthesized
and (more
terminal are typically passed along
through
copy rules
evaluated after parsing int
Dlater).
 T {L.in = T.type} L
T  int {T.type = integer}
T  char {T.type = character}
L  {L1.in = L.in} L1, id {L.action = addtype (id.index, L.in)}
L  id {L.action = addtype (id.index, L.in)}
Rule Evaluation in Top-Down Parsing
• Recall that a predictive parser is implemented as follows:
– There is a routine to recognize each rhs. This contains calls to
routines that recognize the non-terminals or match the terminals on
the rhs of a production.
– We can pass the attributes as parameters (for inherited) or return
values (for synthesized).
– Example:
D  T {L.in = T.type} L
T  int {T.type = integer}
• The routine for T will return the value T.type
• The routine for L, will have a parameter L.in
• The routine for D will call T(), get its value and pass it into
L()
Rule Evaluation in Bottom-Up Parsing
• S-attributed grammars
– All attributes are synthesized
– Rules can be evaluated bottom-up
• Keep the values in the stack
• Whenever a reduction is made, pop corresponding attributes,
compute new ones, push them onto the stack
•
Production
Example:L 
Implement
E \n
E  E1 + T
– Grammar:
ET
T  T1* F
T F
F  (E)
F  digit
Semantic rule
aprint(E.val)
desk calculator using
E.val = E1.val+T.val
E.val = T.val
T.val = T1.val*F.val
T.val = F.val
F.val = E.val
F.val = yylval
an LR parser
Rule Evaluation in Bottom-Up Parsing
Production
Semantic rule
Stack operation
L  E \nprint(E.val)
E  E1 + T
E.val = E1.val+T.val
ET
E.val = T.val
T  T1* F
T.val = T1.val+F.val
T F
T.val = F.val
F  (E)
F.val = E.val
F  digit
F.val = yylval
val[newtop]=val[top-2]+val[top]
val[newtop]=val[top-2]*val[top]
val[ntop]=val[top-1]
Attribute Grammars
• Attribute grammars have several problems
– Non-local information needs to be explicitly passed down with copy
rules, which makes the process more complex
– In practice there are large numbers of attributes and often the attribute
themselves are large. Storage management becomes an important issue
then.
– The compiler must traverse the attribute tree whenever it needs
information (e.g. during a later pass)
• However, our discussion of rule evaluation gives us an idea for
simplified (albeit limited) approach:
– Have actions organized around the structure of the grammar
– Constrain attribute flow to one direction.
– Allow only one attribute per grammar symbol.
In Practice: Yacc/Bison
• In Yacc/Bison, $$ is used for the lhs non-terminal, $1, $2, $3, .
are used for the non-terminals on the rhs, (left-to-right order)
• Example:
– Expr : Expr TPLUS Expr
{ $$ = $1+$3;}
• Example:
– Expr: Expr TPLUS Expr
{$$ = new ExprNode($1, $3);}
Compiler Design
15. ANTLR, ANTLRWorks
Lexer and Parser Generator
Kanat Bolazar
March 11, 2010
ANTLR
• ANTLR is a popular lexer and parser generator in
Java.
• It allows LL(*) grammars, does top-down parsing.
• Similarities with LL(1) grammar:
–
–
–
–
Does top-down parsing
Grammar has to be fixed to remove left recursion
Uses lookahead tokens to decide which path to take
You can think of it as recursive-descent parsing.
• Differences:
– How far we can look ahead is not constrained
– CommonTokenStream defines LA(k) and LT(k):
• Both look ahead to k-th next token
ANTLRWorks
• ANTLRWorks is ANTLR IDE (integrated dev
environ)
• It has many nice features:
–
–
–
–
Automatically fills in common token definitions
Has standard IDE features like syntax highlighting
Regexp FSM (lexer machine) for tokens
Has a very nice debugger which can show:
•
•
•
•
input and output
parse tree and AST (abstract syntax tree)
call (rule) stack and events
grammar rule that is being executed
Running ANTLR: Inputs, Steps
• You need three files before you run ANTLR:
– a grammar file, Xyz.g (Microjava.g)
– a Java test runner, Test.java
– a test input file, such as sample.mj
• There are three steps to running ANTLR:
– antlr: Generate lexer and parser classes:
• XyzLexer.java
• XyzParser.java
– javac: Compile these two and Test.java
• XyzLexer.class, XyzParser.class
• Test.class
Step 1. ANTLR
• You may have an antlr executable:
antlr Xyz.g
• Make sure you save a "grammar Xyz" in file Xyz.g
• If you only have a JAR file instead, use:
java -jar antlr-3.2.jar Xyz.g
• This creates two Java class source code files:
XyzLexer.java
XyzParser.java
• By default, these files go in current directory
• You can instead state where *.java should go:
antlr -o src Xyz.g
Step 2. Compile with javac
• To lexer and parser, you need to add your runner:
Test.java
• See ANTLR examples online for runner examples.
• Before javac, set CLASSPATH environment var to
have:
.
antlrworks-1.3.1.jar
(the current directory)
• In Linux/Unix, under bash, you may do:
export CLASSPATH=.:antlrworks-1.3.1.jar
• Unlike this example, give full path to antlrworks
JAR file.
Step 3. Run with java
• Again, set CLASSPATH environment var as before
• Go under src if needed (if you used -o option)
• Run your test, give your input file:
java Test < input.txt
java Test < input.txt > output.txt
java Microjava < sample.mj
• A grammar with no evaluation:
–
–
will be quiet if everything is OK
will only give syntax errors if input is not good
• A grammar with output will display the output.
• ANTLR doesn't allow running interactively
ANTLRWorks, Other Java IDE
• Instead of these steps, you can use ANTLRWorks.
• To run under ANTLRWorks, just use its debugger.
• It has ANTLR inside, and knows how to set the
CLASSPATH for compiling and running.
• *.java files produced by ANTLRWorks will be
different, as they contain debugger commands.
• To run ANTLR under a Java IDE, you may be able
to define custom build rules for *.g files.
• You should add the antlrworks JAR file to your
project, to have ANTLR runtime libraries.
• Make sure the libraries are used during both
Next Steps
• We will next see:
–
–
A demonstration of using ANTLR (three steps)
ANTLRWorks screenshots
• We will also look at some grammar examples:
–
–
–
–
–
Calculator without evaluation
Calculator with evaluation
Calculator with AST
MicroJava lexer
Starting steps for MicroJava parser
Compiler Design
16. Type Checking
Kanat Bolazar
March 23, 2010
Type Checking
• The general topic of type checking includes two parts
– Type synthesis – assigning a type to each expression in the
language
– Type checking – making sure that these types are used in contexts
where they are legal, catching type-related errors
• Strongly typed languages are ones in which every
expression can be assigned an unambiguous type
– Weakly typed languages could have run-time errors due to type
incompatibility
• Statically typed languages vs. dynamically type languages
– Capable of being checked at compile time vs. not
• Static type checking vs. dynamic type checking
– Actually check types at compile time vs. run time
Dynamic Typing Example: Duck Typing
• Rule: "If it walks like a duck, quacks like a duck, call it a duck
• No need to declare ahead of time as subtype of duck
• Just define the operations. Python example:
class Person:
def quack(self):
print "The person imitates a duck."
def in_the_forest(duck):
duck.quack()
def game():
Base Types
• Numbers
– integer
• C specifies length in relative terms, short and long; OS and
machine-dependent; makes porting programs to other OS and
machine harder.
• Java specifies specific lengths: byte (8), short (16), int (32)
and long (64) bits respectively.
– floating point numbers
• Many languages have two sizes
• Can use IEEE representation standards
• Java float and double are 32 and 64 bits
• Characters
– Single letter, digit, symbol, etc; used to be 8 bit ASCII standard
– Now can also be a 16 bit Unicode
• Booleans
Java Example: Strings are Objects
• Some languages have strings as a base type with catenation
operators
• In Java, strings are objects of class String:
– "this".length()
returns 4
• In Java, variables of an object type hold reference only:
– String a = "this";
– String b = a; // reference to same string object, ref count = 2
– if (a == b) ... // reference comparison: returns true
• This last check is not how you want to check String compariso
reference may not be same but value equal:
– if (a.equals(b)) ... // true if a and b have same string value
– if (a == b) ...
// true only if a and b point to same object in heap
Compound Types: Arrays and Strings
• Arrays – aggregate of values of the same type
– Arrays have a base type for elements, may have an indexing
range for each dimension
int a [100][25], in C
– If the indexing range is known, then the compiler can
compute space allocation, otherwise indexing is relative and
space is allocated by a run-time allocator
– The main operation is indexing; some languages allow whole
array operations
• Strings – sequence of characters
– Also can have bit strings
– C treats strings as arrays of characters
– Strings can have comparison operators that use lexicographic
More Compound Types
• Records or Structures – components may have different
types and may be indexed by names
struct { double r; int i; }
– Representation as ordered product of elements
• Variant records or Unions – a component may be one of
a choice of types
union { double r; int i; }
– Representation can save space for the largest element and may
have a tag to indicate which type the value is
– Take care not to have run-time (value) errors
Other Types
• Enumerated types – the programmer can create a type
name for a specific set of constant values
enum WeekDay {Sunday, Monday, … Saturday}
– Representation as for a small set
• Pointers – abstraction of addresses
– Can create a reference to, or derefence an object
– Distinguish between “pointer to integer” and “pointer to
boolean”, etc.
– C allows arithmetic on pointers
• Void
• Classes – may or may not create new types
– Classes can be represented by an extended type of record for
Function Types, New Type Names
• Procedure and function types are sometimes called signatures
– Give number and types of parameters
• May include parameter-passing information such as by value or by
reference
– Give type of result (or indicate no result)
strlength : String -> unsigned int
• Type declarations or type definitions allow the programmer to
assign new type names to a type expression
– These names should also be stored in the symbol table
– Will have scope and other attributes
– Definitions may be recursive in some languages
• If so, size of object will be unknown
Representing Types
• Types can be represented as expressions, or quite often as
trees:
array(9)
function
arglist
type
result
type
arg1
type
argn
type
Type Equivalence
• Name equivalence – in this kind of equivalence, two types mu
have the same name
– Assumes that the programmer will introduce new names exactly when
they want types to be different
if you say t1 = int and t2 = int, t1 and t2 are different
• Structural equivalence – two objects are interchangeable if the
types have the same fields with equivalent types.
int x[10][10] and int y[10][10]
x and y have equivalent types, the 10 by 10 arrays
– More complex situations arise in structural equivalence of other
compound types, e.g. may have mutually recursive type definitions
• Type checking rules extend this to a notion of type compatibili
Type Synthesis and Type Checking
• Assigning types to language expressions (type synthesis) can b
done by a traversal of the abstract syntax tree
• At each node, a type checking rule will say which types are
allowed
• Description here is for languages with type declarations
– Constants are assigned a type
• If there is not a known type, then there will be a set of possibles
– Variables are looked up in the symbol table
• Note that we are assuming an L-attributed grammar so that
declarations are processed first
– Assignment
• The type of the assignable entity on the left must be the typeEqual
Types for Expressions
• Arithmetic and other operators have result types defined in
terms of the types of the subnodes in the tree
• Statements have substructures that need to be checked for type
correctness
– Condition of if and while statements must have type boolean
• Array reference
– Suppose we have exp1 -> exp2[exp3], then an adhoc SDT:
if (isArrayType(exp2.type)) and typeEqual(exp3.type, integer)
then exp1.type = exp2.type.basetype // get the basetype child
else type-error(exp1)
• Function calls have similar rules to check the signature of the
function name the parameters
Issues for typeEqual
• This is sometimes called type compatibility
• Overloading
– Arithmetic operators
2 + 3 means integer addition
2.0 + 3.0 means floating pt addition
– Language may have an arithmetic operator table, telling which type
of operatorType
willof:
be used
a based
b on the expressions
a+b
int
int
int
float
float
double
int
float
double
float
double
double
int
float
double
float
double
double
Overloading Functions
• Can declare the same function (or method) name with differen
numbers and types of parameters
int max(int x,y)
double max(double x,y)
– Java and C++ allow such overloaded declarations
• Need to augment the symbol table functionality to allow for th
name to have multiple signatures
– The lookup procedure is always give a name to look up – we can add a
typelist argument and the lookup can tell us if there is a function
declared with that signature
– Or the lookup procedure is given the name to look up – and in the case
of a method, it can return sets of allowable signatures
Conversion and Coercion
• The typeEqual comparison is commonly extend to allow
arithmetic expressions of mixed type and for other cases of
types which are compatible, but not equal
– If mixed types are allowed in an arithmetic expression, then a
conversion should be inserted (into the AST or the code)
2 * 3.14
becomes code like
t1 = float ( 2)
t2 = t1 * 3.14
• Conversion from one type to another is said to be implicit it if
done automatically by the compiler, and is also called coercion
• Conversion is said to be explicit if the programmer must write
the conversion
– Called casts in C and Java languages
Widening and Narrowing
• The rules for Java type conversion distinguishes between
– Widening conversions which preserve information
– Narrowing conversion which can lose information
• Conversions between primitive types in Java:
(there is also widening and narrowing for references, i.e. objects)
Widening
Narrowing (usually with a cast)
double
double
float
float
long
long
int
int
short
byte
char
char
short
byte
Generating Type Conversions
• For an arithmetic expressions e1  e2 op e3, the
algorithm can generally use widening, to possibly a third
type that is greater than both the types of e2 and e3 in the
widening tree:
Let the new type of e1 be the max of e2.type and
e3.type
Generate a widening conversion of e1 if necessary
Generate a widening conversion of e2 if necessary
Set the type (and later the code) of e1
• Type conversions and coercions also apply to assignment
if r is a double and i is an int: allow r = i;
– C also allows i = r, with the corresponding loss of information
– For classes, we have the subtype principle, objects of subclasses
Continuing Type Synthesis and Checking
Rules
• Expressions: the two expressions involved with boolean
operators, such as &&, must both be boolean
• Functions: the type of each actual parameter must be typeEqu
to its formal parameter
• Classes:
– if specified, the parent of the class must be a properly declared class
– If a class says that it implements an interface, then all methods of the
interface must be implemented
Polymorphic Typing
• A language is polymorphic is language constructs can have more than one
type
procedure swap(anytype x, y)
where anytype is considered to be a type variable
• Polymorphic functions have type patterns or type schemes, instead of actu
type expressions
• The type checker must check that the types of the actual parameters fit the
pattern
– Technically, the type checker must find a substitution of actual types for type
variables that satisfies the type equivalence between the formal type pattern a
the actual type expression
– In complex cases with recursion, may need to do unification to solve the
substitution problem
• Most notably in the language ML
References
• Original slides by Nancy McCracken.
• Keith Cooper and Linda Torczon, Engineering a Compiler,
Elsevier, 2004.
• Kenneth C. Louden, Compiler Construction: Principles and
Practices, PWS Publishing, 1997.
• Aho, Lam, Sethi, and Ullman, Compilers: Principles,
Techniques, and Tools. Addison-Wesley, 2006. (The purple
dragon book)
• Charles Fischer and Richard LeBlanc, Jr., Crafting a Compiler
with C, Benjamin Cummings, 1991.
Compiler Design
18. Object Oriented Semantic Analysis
(Symbol Tables, Type Checking)
Kanat Bolazar
March 30, 2010
Object-Oriented
Symbol Tables and Type Checking
• In object-oriented languages, scope changes can occur
at a finer-grained level than just block or procedure
definition
– Each class has its own scope
– Variables may be declared at any time
• Notation to represent symbol tables as a combination of
environments
An environment maps an identifier to its symbol table entry, which
we only give the type here for brevity:
e1 = { g → string, a → int }
– We will indicate adding to the symbol table with a +, but this
addition will carry the meaning of scope (from right to left):
e2 = e1 + { a → float }
Example Scope in Java
(environment e0 already given for
predefined identifiers)
1 class C {
2
int a, b, c;
e1 = e0 + { a → int, b → int, c → int}
3
public void m ( ) {
4
System.out.println(a + c)
5
int j = a + b;
e2 = e1 + { j → int}
6
String a = ‘hello’;
7
System.out.println(a);
e3 = e2 + { a → string}
8
System.out.println(j);
9
System.out.println(b);
e1
10
}
11 }
e0
Note: All examples in these slides are from Andrew Appel
"Modern Compiler Implementation in Java" (available
online through SU library)
Each Class Has An Environment
• There may be several active environments at once
(multiple symbol tables)
• Class names are mapped to environments
package M;
(each one added to environment e0):
class E
{ static int a = 5;
}
e1 = { a → int }
e2 = { E → e1 }
Class N
{ static int b = 10;
static int a = E.a + b;
}
Class D
{ static int d = E.a + N.a;
}
e3 = { b → int, a → int}
e4 = { N → e3 }
e5 = { d → int }
e6 = { D → e5 }
e7 = e2 + e4 + e6
Classes E, N and D are all compiled
in environment e7: M → e7
Symbol Table
• Each variable and formal parameter name has a type
• Each method name has its signature
• Each class name has its variable and method
class B {
B
declarations
Fields:
C
f;
int[] j;
int
q;
C
public int start (int p, int q) {
int ret, a;
/* . . .
*/
return ret;
}
public boolean stop (int p) {
/* . . .
*/
return false;
}
}
f
C
j int[]
q int
Methods:
start int
stop bool
Params:
p
int
q
int
Locals:
ret int
a int
Params:
p
int
Locals:
Typechecking Rules
• Additional rules include
– The new keyword: C e = new C ( )
• Gives type C to e (as usual)
– Method calls of the form e.m ( <paramlist>)
• Suppose e has type C
• Look up definition of m in class C
• Appel recommends the two-pass semantic analysis strategy
– First pass adds identifiers to the symbol table
– Second pass looks up identifiers and does the typechecking
Example of Inheritance
• Note variables in scope of await definition: passengers, position, v, this
• In c.await(t), in the body of wait, v.move will be the move method from
Truck
class Vehicle {
int pos; // position of Vehicle
void move (int x) { pos += x; }
}
class Car extends Vehicle {
int passengers;
void await(Vehicle v) {
// if ahead, ask other to catch up
if (v.pos < pos) v.move(pos – v.pos);
// if behind, catch up with +10 moves
else
this.move(10);
}
}
class Truck extends Vehicle {
void move(int x) { // max move: +55
if (x <= 55) pos += x;
}
}
class Main {
public static void main(…)
Truck t = new Truck();
Car c = new Car();
Vehicle v = c;
c.passengers = 2;
c.move(60);
v.move(70);
c.await(t);
}
}
Single Inheritance of Data Fields
(Locations)
• To generate code for v.position, the compiler must generate
code to fetch the field position from the object that v points
to
– v may actually be a Car or Truck
• Prefixing fields:
– When B extends A, the fields of B that are inherited from A are
laid out in a record at the beginning in the order they appear.
All children of A have field a as field[0]
– class
Fields
A not inherited are laid out in order afterwards
{ int a = 0;}
class B extends A
{ int b = 0; int c = 0;}
class C extends A
{int d = 0;}
class D extends B
{ int e = 0; }
A
a
B
a
b
c
C
a
d
D
a
b
c
e
Single Inheritance for Methods (Locations)
• A method instance is compiled into code that resides at a
particular address. In semantic analysis phase:
– Each variable’s symbol table entry has a pointer to its class
descriptor
– Each class descriptor contains a pointer to its parent class and a
list of method instances
– Each method instance has a location
• Static methods: method call of the form c.f ( )
– the code for a method declared as static depends on the type of
the variable c and not the type of the object that c holds
– Get the class of c, call it C
• in Java syntax, the method call is C.f ( ), making this clear
– Search class C for method f
– If not found, search the parent for f, then its parent and so on
Single Inheritance for Dynamic Methods
• Dynamic method lookup needs class descriptors
– may be overridden is a subclass
• To execute c.f(), the compiled code must execute instructions:
– Fetch the class descriptor d at from object c at offset 0
– Fetch the method-instance pointer p from the f offset of d
– Call p
Instances of A, B, C, D and D
class A {
int x = 0;
int f () { … }
}
x
x
x
x
x
class B extends A {
y
y
int g () { … }
}
class C extends B {
A A_f B A_f C A_f D D_f
int g () { … }
}
descriptors
B_g
C_g
C_g
class D extends C
{ int y = 0;
int f () { … }
Notation: A_f is an instance of method f declared in class A
}
Multiple Inheritance
• In languages that permit multiple inheritance, a class
can extend several different parent classes
– Cannot put all fields of all parents in every class
• Analyze all classes to assign one offset location for
every field name that can be used in every record with
that field
– Use
class
A { graph coloring algorithm, but still has large
int offsets
a = 0; with sparse use of offset numbers
C
A
B
}
a
a
class B {
int b = 0; int c = 0;
b
}
c
class C {
d
int d = 0;
}
class D extends A, B, C {
int e = 0;
}
numbers of
D
a
b
c
d
e
Multiple Inheritance Solutions
• After graph coloring, assign offset locations and give a sparse
representation that keeps which fields are in each record
– Leads to another level in accessing fields and methods
• Fetch the class descriptor d at from object c
• Fetch the field-offset value from the descriptor
• Fetch the method or data from the appropriate offset of d
• The coloring of fields is done at link time, can still have
problems with dynamic linking, where a new class can be
loaded at run-time
– Solved with hash tables of field names and access algorithms with
additional overhead
Type Coercions
• Given a variable c of type C, it is always legal to treat c as if it
were any supertype of c
– If C extends B, and b has type B, then assignment “b = c;” is safe
• Reverse is not true. Assignment “c = b;” is safe only if b is
really (at run-time) an instance of C.
– Safe object-oriented languages (Modula-3 and Java) will add to any
coercion from a superclass to a subclass, a run-time typecheck that
raises an exception unless b really is an instance of C.
– C++ is unsafe in this respect. It allows a static cast mechanism. The
dynamic cast mechanism does add the run-time checking.
Private Fields and Methods
• In the symbol table for every class C, for all the fields and
methods, keep a flag to indicate whether that method or field i
private.
– When type checking
c.v or c.f ( )
the compiler will check the symbol table flag and must use the context
(i.e. whether the compiler is inside the declaration of the object) to
decide whether to allow the access
– This is another example of inherited attributes for the typechecking
system
• Additional flags can be kept for access at the subclass or
package level
– And additional context must be kept by the typechecking algorithm
Main Reference
• Slides were prepared by Nancy McCracken, using examples
from:
• Andrew Appel, Modern Compiler Implementation in Java,
second edition, Cambridge University Press, 2002.
– Available at SU library as an online resource, viewable when you log i
to SU library.
– You can read the whole book, chapter by chapter, but not download as
PDF.
Compiler Design
20. ANTLR AST (Abstract Syntax Tree)
Kanat Bolazar
April 6, 2010
ANTLR AST (Abstract Syntax Tree)
Generation
• ANTLR allows creation and manipulation of ASTs
• You start with these options in your grammar:
options {
output = AST;
ASTLabelType = CommonTree;
}



If you skip the second line, you may have problems later.
Instead of CommonTree where tokens are the nodes, you can
create and use your own tree structure
If you do, you also have to tell ANTLR how to convert a token
to a node in your tree structure.
Imaginary Tokens
• CommonTree only allows Tokens as tree nodes
• You may want to use nodes that never appear in your input
stream as tokens
• Declare "imaginary tokens" like this at top of your
grammar:
tokens {
// Not needed here (because tokens exist):
// CLASS -- use 'class' instead, one of our keywords (a token)
// PROGRAM -- use 'program' instead
// CONSTANT -- use 'final' instead
VAR; // variable declaration (including arguments, fields, globals)
TYP; // simple type such as int char void or class name
ARRAY; // array type
Default AST Output: Flat List
• By default, AST output will be a flat list of all tokens
• Parse tree will be ignored; nonterminals of the grammar
can't appear in the AST tree:
program : 'program' ID
decl*
'{' methodDecl * '}'
;


Tokens 'program', ID, '{', '}' can be used in the AST nodes
Nonterminals decl, methodDecl* expand to their tokens:
(nil 'program' 'P' 'int' 'a' ';' '{' 'void' 'main' '(' ')' '{' '}' '}')
Rewrite Rules
• Rewrite rules allow you to define your tree nodes per
grammar rule, and for each alternative
• For any occurence of nonterminal (such as decl below), an
implicit list of all decl nodes is created and can be used:
• Whatever is ignored in the rewrite rule is removed (ID, '{'
and '}' below):
program : 'program' ID
decl*
'{' methodDecl * '}' -> ^('program' decl+ methodDecl+)
;

We don't have a flat list anymore:
('program' ('int' 'a') ('int' 'b') ('void' 'main'))
Inlined Rules
• Inlined rules are very useful in expressions
^ Make this token the root
! Ignore this token
• This works very well in nested expressions.
• We'll see some examples with calculator AST grammar:
expr: multExpr '+'^ multExpr;
// looping version keeps creating new parent nodes
expr: multExpr (('+'^|'-'^) multExpr)*
;

The second looping version, for a + b – 1 creates:
('-' ('+' 'a' 'b') '1')

Equivalent rewrite rule:
expr: (multExpr -> multExpr)
// close paranth: needed here
Multiple Subtrees vs. Lists

Note the difference in using implicit lists here:
//
^(VAR type ID)+
// generates many VAR nodes, one for each ID:
//
(VAR int x) (VAR int y) (VAR int z)
// whereas:
//
^(VAR ID+)
// would generate one VAR node, with all IDS as children:
//
(VAR int x y z)
// Any ID seen in the rule is added to an implicit
// list of IDs, as used here.
varDecl : type ID ( ',' ID )* ';'
-> ^(VAR type ID)+;
• Similar but different (pairwise matched):
formPars: type ID ( ',' type ID )*
-> ^(VARLIST ^(VAR type ID)+)
;
Multiple Subtrees vs. Lists

Note the difference in using implicit lists here:
//
^(VAR type ID)+
// generates many VAR nodes, one for each ID:
//
(VAR int x) (VAR int y) (VAR int z)
// whereas:
//
^(VAR ID+)
// would generate one VAR node, with all IDS as children:
//
(VAR int x y z)
// Any ID seen in the rule is added to an implicit
// list of IDs, as used here.
varDecl : type ID ( ',' ID )* ';'
-> ^(VAR type ID)+;
• Similar but different (pairwise matched):
formPars: type ID ( ',' type ID )*
-> ^(VARLIST ^(VAR type ID)+)
;
Compiler Design
21. Intermediate Code Generation
Kanat Bolazar
April 8, 2010
Intermediate Code Generation
• Forms of intermediate code vary from high level ...
– Annotated abstract syntax trees
– Directed acyclic graphs (common subexpressions are
coalesced)
• ... to the low level Three Address Code
– Each instruction has, at most, one binary operation
– More abstract than machine instructions
• No explicit memory allocation
• No specific hardware architecture assumptions
– Lower level than syntax trees
• Control structures are spelled out in terms of instruction
jumps
– Suitable for many types of code optimization
Three Address Code
• Consists of a sequence of instructions, each instruction
may have up to three addresses, prototypically
t1 = t2 op t3
• Addresses may be one of:
– A name. Each name is a symbol table index. For
convenience, we write the names as the identifier.
– A constant.
– A compiler-generated temporary. Each time a temporary
address is needed, the compiler generates another name from
the stream t1, t2, t3, etc.
• Temporary names allow for code optimization to easily
move instructions
• At target-code generation time, these names will be
Three Address Code Instructions
• Symbolic labels will be used as instruction addresses for
instructions that alter the flow of control. The instruction
addresses of labels will be filled in later.
L: t1 = t2 op t3
• Assignment instructions: x = y op z
– Includes binary arithmetic and logical operations
• Unary assignments:
x = op y
– Includes unary arithmetic op (-) and logical op (!) and type conversion
• Copy instructions:
– These may be optimized later.
x=y
Three Address Code Instructions
• Unconditional jump: goto L
– L is a symbolic label of an instruction
• Conditional jumps:
if x goto L and
ifFalse x goto L
– Left: If x is true, execute instruction L next
– Right: If x is false, execute instruction L next
• Conditional jumps:
if x relop y goto L
• Procedure calls. For a procedure call p(x1, …, xn)
param x1
…
param xn
Three Address Code Instructions
• Indexed copy instructions: x = y[i] and x[i] = y
– Left: sets x to the value in the location [i memory units beyond y]
(in C)
– Right: sets the contents of the location [i memory units beyond y]
to x
• Address and pointer instructions:
– x = &y sets the value of x to be the location (address) of y.
– x = *y, presumably y is a pointer or temporary whose value is a
location. The value of x is set to the contents of that location.
– *x = y sets the value of the object pointed to by x to the value of y.
• In Java, all object variables store references (pointers), and
Strings and arrays are implicit objects:
– Object o = "some string object", sets the reference o to hold the
Three Address Code Representation
• Representations include quadruples (used here), triples and
indirect triples.
• In the quadruple representation, there are four fields for each
instruction: op, arg1, arg2 and result.
–
–
–
–
Binary ops have the obvious representation
Unary ops don’t use arg2
Operators like param don’t use either arg2 or result
Jumps put the target label into result
Syntax-Directed Translation of Intermediate
Code
• Incremental Translation
– Instead of using an attribute to keep the generated code, we assume tha
we can generate instructions into a stream of instructions
• gen(<three address instruction>) generates an instruction
• new Temp() generates a new temporary
• lookup(top, id) returns the symbol table entry for id at the topmost
(innermost) lexical level
• newlabel() generates a new abstract label name
Translation of Expressions
• Uses the attribute addr to keep the addr of the instruction for that
nonterminal symbol.
S  id = E ;
Gen(lookup(top, id.text) = E.addr)
E  E1 + E2
E.addr = new Temp()
Gen(E.addr = E1.addr plus E2.addr)
E.addr = new Temp()
Gen(E.addr = minus E1.addr)
| - E1
| ( E1 )
E.addr = E1.addr
| id
E.addr = lookup(top, id.text)
Boolean Expressions
• Boolean expressions have different translations
depending on their context
– Compute logical values – code can be generated in analogy to
arithmetic expressions for the logical operators
– Alter the flow of control – boolean expressions can be used as
conditional expressions in statements: if, for and while.
• Control Flow Boolean expressions have two inherited
attributes:
– B.true, the label to which control flows if B is true
– B.false, the label to which control flows if B is false
– B.false = S.next means:
if B is false, Goto whatever address comes after instruction
S is completed.
This would be used for S → if (B) S1 expansion
Short-Circuit Boolean Expressions
• Some language semantics decree that boolean expressions
have so-called short-circuit semantics.
– In this case, computing boolean operations may also have flowof-control
Example:
if ( x < 100 || x > 200 && x != y ) x = 0;
Translation:
if x < 100 goto L2
ifFalse x >200 goto L1
ifFalse x != y goto L1
L2: x = 0
L1: …
Flow-of-Control Statements
S  if ( B ) S1
| if ( B ) S1 else S2
| while ( B ) S1
if-else
B.Code
B.true
to B.true
to B.false
if
B.Code
B.true
S1.Code
B.false
= S.next
…
while
S1.Code
begin
B.Code
goto S.next
B.False
S.Next
B.true
S2.code
…
to B.true
to B.false
S1.Code
goto begin
B.false
= S.next
…
to B.true
to B.false
Flow-of-Control Translations
PS
S  assign
S  if ( B ) S1
S  if ( B ) S1 else
S2
S  while (B) S1
S  S1 S2
S.Next = newlabel()
|| : Code
P.Code = S.code || label(S.next)
concatenation
S.Code = assign.code
operator
B.True = newlabel()
B.False = S1.next = S.next
S.Code = B.code || label(B.true) || S1.code
B.True = newlabel(); b.false = newlabel();
S1.next = S2.next = S.next
S.Code = B.code || label(B.true) || S1.code
|| gen (goto S.next) || label (B.false) ||
S2.code
Begin = newlabel(); B.True = newlabel();
B.False = S.next; S1.next = begin
S.Code = label(begin) || B.code || label(B.true)
|| S1.code || gen(goto begin)
S1.next = newlabel(); S2.next = S.next;
Control-Flow Boolean Expressions
B  B1 || B2
B1.true = B.true; B1.false = newlabel();
B2.true = B.true; B2.false = B.false;
B.Code = B1.code || label(B1.false) || B2.code
B  B1 && B2
B1.true = newlabel(); B1.false = B.false
B2.true = B.true; B2.false = B.false
B.Code = B1.code || label(B1.true) || B2.code
B  ! B1
B1.True = B.false; B1.false = B.true;
B.Code = B1.code
B E1 rel E2
B.Code = E1.code || E2.code
|| gen( if E1.addr relop E2.addr goto B.true)
|| gen( goto B.false)
B  true
B.Code = gen(goto B.true)
B  false
B.Code = gen(goto B.false)
Avoiding Redundant Gotos,
Backpatching
• Use ifFalse instructions where necessary
• Also use attribute value “fall” to mean to fall through where
possible, instead of generating goto to the next expression
• The abstract labels require a two-pass scheme to later fill in the
addresses
• This can be avoided by instead passing a list of addresses that
need to be filled in, and filling them as it becomes possible.
This is called backpatching.
Java Bytecode, Virtual Machine
Instructions
• Java bytecode is an intermediate representation.
• It uses a stack-machine, which is generally at a lower level tha
a three-address code.
• But it also has some conceptually high-level instructions that
need table lookups for method names, etc.
• The lookups are needed due to dynamic class loading in Java:
– If class A uses class B, the reference can only compile if you have
access to B.class (or if your IDE can compile B.java to its B.class).
– In runtime, A.class and B.class hold bytecode for class A and B.
– Loading A does not automatically load B. B is loaded only if it is
needed.
– Before B is loaded, its method signatures (interfaces) are known but
Displaying Bytecode
• From command line, you can use this command to see the
bytecode:
javap -private -c MyClass
• You need to have access to MyClass.class file
• There are many options to see more information about
local variables, where they are accessed in bytecode, etc.
• Important: Stack machine stack is empty after each full
instruction.
• Example: d = a + b * c
instruction stack
iload_1
a
iload_2
a,b
iload_3
a,b,c
description
get local var #2, a, push it into stack
push b into stack
push c into stack (now, c is on top of stack)
Method Call in Java Bytecode
• Method calls need symbol lookup
• Example: System.out.println(d);
18: getstatic #2; //Field
java/lang/System.out:Ljava/io/PrintStream;
21: iload 4
23: invokevirtual #3; //Method java/io/PrintStream.println:(I)V
• Java internal signature: Lmypkg.MyClass: object of
MyClass, defined in package mypkg
• Java internal signature: (I)V: takes integer, returns void
• We will be focusing on MicroJava virtual machine
instructions
– Few instructions compared to full Java VM instructions
– Simpler language features, less complicated
References
• Aho, Lam, Sethi, and Ullman, Compilers: Principles,
Techniques, and Tools. Addison-Wesley, 2006. (The purple
dragon book)
Compiler Design
22. ANTLR AST Traversal
(AST as Input, AST Grammars)
Kanat Bolazar
April 13, 2010
Chars → Tokens → AST → ....
Lexer
Parser
Tree
Parser
ANTLR Syntax
grammar file, name.g
one rule
/** doc comment */
kind grammar name;
options {…}
tokens {…}
scopes…
@header {…}
@members {…}
rules…
/** doc comment */
rule[String s, int z]
returns [int x, int y]
throws E
options {…}
scopes
@init {…}
@after {…}
: 
| 
;
catch [Exception e] {…}
finally {…}
Trees
^(root child1 … childN)
What is LL(*)?


Natural extension to LL(k) lookahead DFA: Allow
cyclic DFA that can skip ahead past common prefixes
to see what follows
Analogy: like trying to decide which line to get in at
the movies: long line, can’t see sign ahead from the
back; run ahead to see sign
ticket_line : PEOPLE+ STAR WARS 9
| PEOPLE+ AVATAR 2
;



Predict and proceed normally with LL parse
No need to specify k a priori
Weakness: can’t deal with recursive left-prefixes
LL(*) Example
s : ID+ ':' ‘x’
| ID+ '.' ‘y’
;
void s() {
int alt=0;
while (LA(1)==ID) consume();
if ( LA(1)==‘:’ ) alt=1;
if ( LA(1)==‘.’ ) alt=2;
switch (alt) {
case 1 : …
case 2 : …
default : error;
}
}
Note: ‘x’, ‘y’ not in prediction DFA
Tree Rewrite Rules
 Maps an input grammar fragment to an output tree grammar
fragmentgrammar T;
options {output=AST;}
stat : 'return' expr ';' -> ^('return' expr) ;
decl : 'int' ID (',' ID)* -> ^('int' ID+) ;
decl : 'int' ID (',' ID)* -> ^('int' ID)+ ;
Template Rewrite Rules

Reference template name with attribute assigments as
args:
grammar T;
options {output=template;}
s : ID '=' INT ';' -> assign(x={$ID.text},y={$INT.text}) ;

group T; assign is defined like this:
Template
assign(x,y) ::= "<x> := <y>;"
ANTLR AST (Abstract Syntax Tree)
Processing
• ANTLR allows creation and manipulation of ASTs
• 1. Generate an AST (file.mj → AST in memory)
grammar MyLanguage;
options {
output = AST;
ASTLabelType = CommonTree;
3. AST → action (Java):
}

Interpreter;
2. Traverse, process AST → AST:grammar
options {
tree grammar TypeChecker;
options {
tokenVocab = MyLanguage;
output = AST;
ASTLabelType = CommonTree;
tokenVocab = MyLanguage;
}
AST Processing: Calculator 2, 3
•
ANTLR expression evaluator (calculator) examples:
http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator
•
•
We are interested in the examples that build an AST,
and evaluate (interpret) the language AST.
These are in the calculator.zip, as examples 2 and 3.
grammar Expr;
options {
output=AST;
ASTLabelType=CommonTree;
}
tree grammar Eval;
options {
tokenVocab=Expr;
ASTLabelType=CommonTree;
}
Expr
AST
Eval
grammar Expr;
options {
output=AST;
ASTLabelType=CommonTree;
}
prog: ( stat {System.out.println(
$stat.tree.toStringTree());}
)+ ;
tree grammar Eval;
options {
tokenVocab=Expr;
ASTLabelType=CommonTree;
}
@header { import java.util.HashMap; }
@members { HashMap memory = new HashMap(); }
prog: stat+ ;
stat: expr
stat: expr NEWLINE
-> expr
{System.out.println($expr.value);}
| ID '=' expr NEWLINE -> ^('=' ID expr) | ^('=' ID expr)
| NEWLINE
->
{memory.put($ID.text, new Integer($expr.value));}
;
;
expr: multExpr (('+'^|'-'^) multExpr)*
;
multExpr
: atom ('*'^ atom)*
;
atom: INT
| ID
| '('! expr ')'!
;
expr returns [int value]
: ^('+' a=expr b=expr) {$value = a+b;}
| ^('-' a=expr b=expr) {$value = a-b;}
| ^('*' a=expr b=expr) {$value = a*b;}
| ID
{
Integer v = (Integer)memory.get($ID.text);
if ( v!=null ) $value = v.intValue();
else System.err.println("undefined var "+$ID.text);
}
| INT
{$value = Integer.parseInt($INT.text);}
;
AST → AST, AST → Template
•
The ANTLR Tree construction page has examples of
processing ASTs:
–
–
–
•
AST → AST: Can be used for typechecking, processing
(taking derivative of polynomials/formula)
AST → Java (action): Often the final step where AST
is needed no more.
AST → Template: Can simplify Java/action when
output is templatized
Please see Calculator examples as well. They show
which files have to be shared so tree grammars can
be used.
Our Tree Grammar
•
Look at sample output from our AST generator
(syntax_test_ast.txt):
9. program X27
(program X27
10.
11. // constants
12. final int CONST = 25;
25)
(final (TYP int) CONST
13. final char CH = '\n';
'\n')
(final (TYP char) CH
14. final notype[] B3 = 35;
B3 35)
(final (ARRAY notype)
15.
16. // classes (types)
17. class Helper {
18.
19.
20.
(class Helper
// only variable declarations...
int x;
int) x)
char y;
(VARLIST (VAR (TYP
(VAR (TYP char) y)
Compiler Design
27. Runtime Environments:
Activation Records,
Heap Management
Kanat Bolazar
April 29, 2010
Run-time Environments
• The compiler creates and manages a run-time
environment in which it assumes the target program will
be executed
• Issues for run-time environment
– Layout and allocation of storage locations for named program
objects
– Mechanism for the target program to access variables
– Linkages between procedures
– Mechanisms for passing parameters
– Interfaces to the operating system for I/O and other programs
Storage Organization
• Assumes a logical address space
– Operating system will later map it to physical addresses, decide how to
use cache memory, etc.
• Memory typically divided into areas for
– Program code
– Other static data storage, including global constants and compiler
generated data
– Stack to support call/return policy for procedures
– Heap to store data that can outlive a call to a procedure
Code
Static Heap
Free Memory
Stack
Run-time stack
• Each time a procedure is called (or a block entered), space for
local variables is pushed onto the stack.
• When the procedure is terminated, the space is popped off the
stack.
• Procedure activations are nested in time
– If procedure p calls procedure q, then even in cases of exceptions and
errors, q will always terminate before p.
• Activations of procedures during the running of a program can
be represented by an activation tree
• Each procedure activation has an activation record (aka frame)
on the run-time stack.
– The run-time stack consists of the activation records at any point in tim
during the running of the program for all procedures which have been
Procedure Example: quicksort
int a[11];
void readArray( ) /* Reads 9 integers into a[1] through a[9] */
{ int i; …
}
int partition ( int m, int n)
{ /* picks a separator v and partitions a[m .. n] so that a[m .. p-1] are less than v,
a[p] = v, a[p+1 .. n] are equal to or greater than v. Returns p. */
}
void quicksort (int m, int n)
{ int i;
if ( n > m )
{ i = partition(m,n); quicksort(m, i-1); quicksort(i+1, n); }
}
main ( )
{ readArray ( );
a[0] = -9999; a[10] = 9999;
quicksort (1, 9);
}
Activation Records
• Elements in the activation record:
– temporary values that could not fit into
registers
– local variables of the procedure
– saved machine status for point at which this
procedure called. includes return address and
contents of registers to be restored.
– access link to activation record of previous
block or procedure in lexical scope chain
– control link pointing to the activation record
of the caller
– space for the return value of the function, if
any
– actual parameters (or they may be placed in
actual params
return values
control link
access link
saved machine
state
local data
temporaries
Procedure Linkage
• The standardized code to call a procedure, the calling sequence, and
the return sequence, may be divided between the caller and the callee.
– parameters and return value should be first in the new activation record
so that the caller can easily compute the actual params and get the return
value as an extension of its own activation record
• also allows for procedures with a variable number of params
– fixed-length items are placed in the middle of the activation record
• saved machine state is standardized
– local variables and temporaries are placed at the end, especially good for
the case when the size is not known until run-time, such as with dynamic
arrays
– location of the top-of-stack pointer is commonly at the end of the fixedlength fields
• fixed length data can be accessed by local offsets, known to the
intermediate code generator, relative to the TOP-SP (negative offsets)
Activation Record Example
• Showing one way to divide responsibility between the caller an
the callee.
actual params and return val
Caller A.R.
control link, access link,
and saved machine state
local data and temporaries
actual params and return val
Callee A.R.
TOP-SP
control link, access link,
and saved machine state
local data and temporaries
actual top of stack
Caller
responsibility
Callee
responsibility
Calling Sequence
• A possible calling sequence matching the previous diagram:
– caller evaluates the actual parameters
– caller stores the return address and old value of TOP-SP in
the callee’s AR. Caller then increments TOP-SP to the
callee’s AR. (Caller knows the size of the caller’s local data
and temps, and the callee’s parameters and status fields.).
Caller jumps to callee code.
– callee saves the register values and other status fields
– callee initializes local data and begins execution
Return Sequence
• Corresponding return sequence
– callee places the return value next to the parameters
– using information in the status fields, callee restores TOP-S
and other registers. Callee jumps to the return address that
the caller placed in the status field
– Although TOP-SP has been restored to the caller AR, the
caller knows where the return value is, relative to the curren
TOP-SP
Variable-length data on the stack
• It is possible to
allocate objects, arrays
or other structures of
unknown size on the
stack, as long as they
are local to a
procedure and become
inaccessible when the
procedure ends
• For example, represent
a dynamic array in the
activation record by a
pointer to an array
located between the
actual params and return val
proc p
control link, access link,
and saved machine state
pointer to array a
…
array a
actual params and return val
control link, access link,
and saved machine state
local data and temporaries
proc q
Access to Nonlocal Data on the Stack
• Simplest case are languages without nested procedures or
classes
– C and many C-based languages
• All variables are defined either within a single procedure
(function) or outside of any procedure at the global level
• Allocation of variables and access to variables
– Global variables are allocated static storage. Locations are fixed at
compile time.
– All other variables must be local to the activation on the top of the stac
These variables are allocated when the procedure is called and accesse
via the TOP-SP pointer.
• Nested procedures will use a set of access links to access
Nested Procedure Example Outline in ML
fun sort(inputFile, outputFile) =
let
val a = array (11, 0);
fun readArray (inputFile) =
... a ... ;
// body of readArray accesses a
fun exchange ( i, j ) =
... a ... ;
// so does exchange
fun quicksort ( m, n ) =
let
val v = . . . ;
fun partition ( y, z ) =
. . . a . . . v . . . exchange . . .
in
. . . a . . . v . . . partition . . . quicksort . . .
end
in
. . . a . . . readArray . . . quicksort . . .
end;// the function sort accesses a and calls readArray and quicksort
Access Links
• Access links allow implementation of the normal static scope
rule
– if procedure p is nested immediately within q in the source code, then
the access link of an activation of p points to the more recent activatio
of q
• Access links form a chain – one link for each lexical level –
allowing access to all data and procedures accessible to the
currently executing procedure
• Look at example of access links from quicksort program in ML
(previous slide)
Defining Access Links for Direct Procedure
Calls
• Procedure q calls procedure p explicitly:
– case 1: procedure p is at a nesting depth 1 higher than q (can’t be mor
than 1 to follow scope rules). Then the access link is to the immediate
preceding activation record (of p) (example quicksort calls partition)
– case 2: recursive call, i.e. q is p itself. The access link in the new
activation record for q is the same as the preceding activation record fo
q (example: quicksort called quicksort)
– case 3: procedure p is at a lower nesting depth than q. Then procedur
p must be immediately nested in some procedure r (defined in r) and
there must be an activation record for r in the access chain of q. Follo
the access links of q to find the activation record of r and set the acces
link of p to point to that activation record of r. (partition calls exchang
which is defined in sort)
Defining Access Links Parameter Procedures
• Suppose that procedure p is passed to q as a parameter. When
calls its parameter, which may be named r, it is not actually
known which procedure to call until run-time.
• When a procedure is passed as a parameter, the caller must als
pass along with the name of the procedure, the proper access
link for that parameter.
• When q calls the procedure parameter, it sets up that access lin
thus enabling the procedure parameter to run in the environme
of the caller procedure.
Displays
display
• If the nesting depth of access links gets
d[1]
large, then access to nonlocal variables
d[2]
will be inefficient to follow the chain of d[3]
access links.
• Solution is to keep an auxiliary array –
the display – in which each element is the
highest activation record on the stack for
the procedure at that nesting depth.
• Whenever a new activation record is
created at level l, it will save the value of
display[l] to restore when it is done
stack
sort
q(1,9)
saved d[2]
q(1,3)
saved d[2]
p(1,3)
saved d[3]
e(1,3)
saved d[2]
Dangling pointers in the stack
• In a stack-based environment, typically used for
parameters and local variables, variables local to a
procedure are removed from the stack when the
procedure exits
• There should not be any pointers still in use to such
variables
• Example from C:
int* dangle(void)
{ int x;
return &x;
}
• An assignment “addr = dangle();” causes addr to point
Organization of Memory for Arrays
•
•
C/C++ arrays and Java arrays are stored very differently
in memory
A 2x3 C int array only needs space for 6 ints in the heap:
ar[0][0] , ar[0][1] , ar[0][2] , ar[1][0] , ar[1][1] , ar[1][2]
•
•
•
•
•
The same array can be accessed as an int[6] array.
Java is type-safe; in Java, you can't access an int[2][3] as
if it is an int[6].
Java also stores array length, and other Object
information, including reference counts for garbage
collection.
All arrays are objects in heap. A local "array" variable is
just a pointer/reference.
This line creates three Array objects in Java:
Heap Management
• Store used for data that lives indefinitely, or until the program
explicitly deletes it
• Memory manager allocates and deallocates memory in the hea
– serves as the interface between application programs, generated by the
compiler, and the operating system
– calls to free and delete can be generated by the compiler, or in some
languages, explicitly by the programmer
• Garbage Collection is an important subsystem of the memory
manager that finds spaces within the heap that are no longer
used and can be returned to free storage
– the language Java uses the garbage collector as the deallocation
operation
Memory Manager
• The memory manager has one large chunk of memory
from the operating system that it can manage for the
application program
• Allocation – when a program requests memory for a
variable or an object (anything requiring space), the
memory manager gives it the address of a chunk of
contiguous heap memory
– if there is no space big enough, can request the operating
system for virtual space
– if out of space, inform the program
• Deallocation – returns deallocated space to the pool of
free space
– doesn’t reduce the size of space and return to the operating
Properties of Memory Manager
• Space efficiency – minimize the total heap size needed by the
program
– accomplished by minimizing fragmentation
• Program efficiency – make good use of the memory subsystem
to allow programs to run faster
– locality of placement of objects
• Low overhead – important for memory allocation and
deallocation to be as efficient as possible as they are frequent
operations in many programs
Memory Hierarchy
• Registers are scarce –
explicitly managed by
the code generated by
the compiler
• Other memory levels
are automatically
handled by the
operating system
typical sizes
> 40GB
typical access times
viritual
memory
3-15
ms
(disk)
512MB – 4GB physical memory
100-150ns
125KB – 4MB
– chunks of memory
16 – 64KB
copied from lower
level to higher level as
necessary
32 words
2nd-level 40-60ns
cache
5-10ns
1st-level
cache
Registers
1ns
Taking Advantage of Locality
• Programs often exhibit both
– temporal locality – accessed memory locations are likely to be accesse
again soon
– spatial locality – memory close to locations that have been accessed ar
also likely to be accessed
• Compiler can place basic blocks (sequential instructions) on th
same cache page, or even the same cache line
• Instructions belonging to the same loop or function can also be
placed together
Placing objects in the heap
• As heap memory is allocated and deallocated, it is
broken into free spaces, the holes, and the used spaces.
– on allocation, a hole must be split into a free and used part
• Best Fit placement is deemed the best strategy – uses
the smallest available hole that is large enough
– this strategy saves larger holes for later, possibly larger,
requests
• Contrasted to the First Fit strategy that uses the first
hole on the list that is large enough
– has a shorter allocation time, but is a worse overall strategy
• Some managers use the “bin” approach to keeping track
of free space
– for many standard sizes, keep a list of free spaces of that size
– keep more bins for smaller sizes, as they are more common
Coalescing Free Space
• When an object is freed, it will reduce fragmentation if we can
combine the deallocated space with any adjacent free spaces
• Data structures to support coalescing:
– boundary tags – at each end of the chunk, keep a bit indicating whethe
the chunk is free and keep its size
– doubly linked, embedded free list – pointers to next free chunks are ke
at each end next to the boundary tags
• When B is deallocated, it can check if A and C are free and is s
coalesce blocks and adjust the links of the free list
chunk A
0:200: :
chunk B
: :200:0 0:100: :
chunk C
: :100:0 0:80: :
pointers doubly link free chunks, not in physical order
: :80:0
Problems with Manual Deallocation
• It is a notoriously difficult tasks for programmers, or compilers
to correctly decide when an object will never be referenced
again
• If you use caution in deallocation, then you may get chunks of
memory that are marked in use, but are never used again
– memory leaks
• If you deallocate incorrectly, so that at a later time, a reference
is used to an object that was deallocated, then an error occurs
– dangling pointers
Garbage Collection
• In many languages, program variables have
pointers to objects in the heap, e.g. through
the use of new
• These objects can have pointers to other
objects
• Everything reachable through a program
variable is in use, and everything else in the
heap is garbage
– in an assignment “x = y”, an object formerly
pointed to by x is now garbage if x were the last
pointer to it
• A requirement to be a garbage collectible
language is to be type safe:
– we can tell if a data element or component of a
heap
p
q
r
Performance Metrics
• Overall execution time – garbage collection touches a lot of da
and it is important that it not substantially increase the total run
time of an application
• Space Usage – garbage collector should not increase
fragmentation
• Pause time – garbage collectors are notorious for causing the
application to pause suddenly for a very long time, as garbage
collection kicks in
– as a special case, real-time applications must be assured that they can
achieve certain computations within a time limit
• Program locality – garbage collector also controls the placeme
of data, particularly ones which relocate data
Reachability
• The data that can be accessed directly by the program,
without any deferencing, is the root set, and its elements are
all reachable
– the compiler may have placed elements of the root set in registers or
on the stack
• Any object with a reference stored in the field members or
array elements of any reachable object is also a reachable
object
• The program (sometimes called the mutator) can change the
reachable set
– object allocation by the memory manager
– parameter passing and return values – objects pointed to by actual
parameters and by return results remain reachable
Reference Counting Garbage Collectors
• Keep a count of the number of references to any object and when the coun
drops to 0, the object can be returned to free
• Every object keeps a field for the reference count, which is maintained:
– object allocation – the count of a new object is 1
– parameter passing – the reference count of an actual parameter object is
increased by 1
– reference assignments “x = y”: reference count of object referred to by y goe
up by 1, reference count of old object pointed to by x is decreased by 1
– procedure returns – objects pointed to by local variables have counts
decremented
– transitive loss of reachability – whenever the count of an object goes to 0, we
must decrement by 1 each of the objects pointed to by a reference within the
object
• Simple, but imperfect: cannot do circular objects
Java Array Example
•
Recall that this line creates three Array objects:
int[][] ar = new int[2][3];
•
•
// creates ar, ar[0] and ar[1] arrays
Local variable ar stores address of the int[][] object
Its elements store addresses of the two int[] objects, one
per row.
int[] row1 = ar[0]; // the first int[] object now has ref count = 2
ar[0] = null; // the same object now has ref count = 1, from row1
// the first int[] object is not reachable from ar anymore
ar = null; // first int[][] object now has ref count = 0
•
•
transitive loss of reachability: ar is not reachable
anymore
anything reachable from it should have refCount
decremented:
Basic Mark and Sweep Garbage Collection
• Trace based algorithms recycle memory as follows:
– program runs and make allocation requests
– garbage collector discovers reachability by tracing
– unreachable objects are reclaimed for storage
• The Mark and Sweep algorithms use four states for
chunks of memory
– Free – ready to be allocated, during any time
– Unreached – reachability has not been established by gc,
when a chunk is allocated, it is set to be “unreached”
– Unscanned – chunks that are known to be reachable are either
scanned or unscanned – an unscanned object has itself been
reached, but its points have not been scanned
– Scanned – the object is reachable and all its pointers have
Mark and Sweep Algorithm
• Stop the program and start the garbage collector
• Marking phase:
set Free list to be empty
set the reached bit to 1 and add root set to the list
Unscanned
loop over unscanned list:
remove object o from unscanned list
for each pointer p in object: if p is unreached (bit is
0)
set the bit to 1 and put p in unscanned list
• Sweeping phase
for each chunk of memory o in the heap
if o is unreached, add o to the Free list
Baker’s Mark and Sweep algorithm
• The basic algorithm is expensive because it examines every
chunk in the heap
• Baker’s optimization keeps a list of allocated objects
• This list is used as the Unreached list in the algorithm
Scanned = empty set
Unscanned = root set
loop over Unscanned set
move object o from Unscanned to Scanned
for each pointer p in o: if p is Unreached,
move p from Unreached to Unscanned
Free = Free + Unreached
Unreached = Unscanned
Copying Collectors (Relocating)
• While identifying the Free set, the garbage collector can reloca
all reachable objects into one end of the heap
– while analyzing every reference, the gc can update them to point to a
new location, and also update the root set
• Mark and compact moves objects to one end of the heap after
the marking phase
• Copying collector moves the objects from one region of
memory to another as it marks
– extra space is reserved for relocation
– separates the tasks of finding free space and updating the new memory
locations to the objects
• gc copies objects as it traces out the reachable set
Short-Pause Garbage Collection
• Incremental garbage collection – interleaves garbage
collection with the mutator
– incremental gc is conservative during reachability tracing and
only traces out objects which were allocated at the time it begins
– not all garbage is found during the sweep (floating garbage), but
will be collected the next time
• Partial Collection – the garbage collector divides the work
by dividing the space into subsets
– Usually between 80-98% of newly allocated objects “die young”,
i.e. die within a few million instructions, and it is cost effective to
garbage collect these objects often
– Generational garbage collection separates the heap into the
“young” and the “mature” areas. If an object survives some
number of “young” collections, it is promoted to the “mature”
area.
Parallel and Concurrent Garbage Collection
• A garbage collector is parallel if it uses multiple threads, and it
is concurrent if it runs in parallel with the mutator
– based on Dijkstra’s “on-the-fly” garbage collection, coloring the
reachable nodes white, black or gray
• This version partially overlaps gc with mutation, and the
mutation helps the gc:
– Find the root set (with the mutator stopped)
– Interleave the tracing of reachable objects with the mutator(s)
• whenever the mutator writes a reference that points from a Scanne
object to an Unreached object, we remember it (called the dirty
objects)
– Stop the mutator(s) to rescan all the dirty objects, which will be quick
because most of the tracing has been done already
Cost of Basic Garbage Collection
• Mark phase: Depth-first search takes time proportional to the
number of nodes that it marks, i.e. the number of reachable
chunks
• Sweep phase: time proportional to the size of the heap
• Amortize the collection: divide the time spent collecting by th
amount of garbage reclaimed:
– R chunks of reachable data
– H is the heap size
– c1 is the time for each marked node and c2 the time to sweep
(c1)R + (c2)H / H – R
• If R is close to H, this cost gets high, and the collector could
increase H by asking the operating system for more memory
References
• Dr. Nancy McCracken, Syracuse University.
• Aho, Lam, Sethi, and Ullman, Compilers: Principles,
Techniques, and Tools. Addison-Wesley, 2006. (The purple
dragon book)
• Keith Cooper and Linda Torczon, Engineering a Compiler,
Elsevier, 2004.
Compiler Design
Yacc Example
"Yet Another Compiler Compiler"
Kanat Bolazar
Lex and Yacc
• Two classical tools for compilers:
– Lex: A Lexical Analyzer Generator
– Yacc: “Yet Another Compiler Compiler” (Parser Generator)
• Lex creates programs that scan your tokens one by one.
• Yacc takes a grammar (sentence structure) and generates a
parser.
Input
Lexical Rules
Grammar Rules
Lex
Yacc
yylex()
yyparse()
Parsed Input
Lex and Yacc
• Lex and Yacc generate C code for your analyzer & parser.
Grammar Rules
Lexical Rules
C code
Lex
Input
yylex()
char
stream
C code
Lexical Analyzer
(Tokenizer)
C code
token
stream
Yacc
yyparse()
C code
Parser
Parsed
Input
Flex, Yacc, Bison, Byacc
• Often, instead of the standard Lex and Yacc, Flex and
Bison are used:
– Flex: A fast lexical analyzer
– (GNU) Bison: A drop-in replacement for (backwards
compatible with) Yacc
• Byacc is Berkeley implementation of Yacc (so it is
Yacc).
• Resources:
http://en.wikipedia.org/wiki/Flex_lexical_analyser
http://en.wikipedia.org/wiki/GNU_Bison
• The Lex & Yacc Page (manuals, links):
http://dinosaur.compilertools.net/
Yacc: A Standard Parser Generator
•
•
•
•
Yacc is not a new tool, and yet, it is still used in many projects
Yacc syntax is similar to Lex/Flex at the top level.
Lex/Flex rules were regular expression – action pairs.
Yacc rules are grammar rule – action pairs.
declarations
%%
rules
%%
programs
Yacc Examples: Calculator
•
•
•
•
A standard Yacc example is the int-valued calculator.
Appendix A of Yacc manual at Lex and Yacc Page shows
such a calculator.
We'll examine this example in parts.
Let's start with four operations:
E -> E + E
|E–E
|E*E
|E/E
•
Note that this grammar is ambiguous because 2 + 5 * 7 coul
be parsed 2 + 5 first or 5 * 7 first.
Yacc Calculator Example: Declarations
%{
# include <stdio.h>
# include <ctype.h>
int regs[26];
int base;
Directly included C code
list is our start symbol; a list of one-line
statements / expressions.
%}
%start list
%token DIGIT LETTER
%left '+' '-'
%left '*' '/' '%'
%left UMINUS
DIGIT & LETTER are tokens;
(other tokens use ASCII codes, as in '+',
'=', etc)
/* precedence for unary minus */
Precedence and associativity (left) of
operators:
+, - have lowest precedence
Yacc Calculator Example: Rules
%% /* begin rules section */
list : /* empty */
| list stat '\n'
| list error '\n'
{ yyerrok; }
;
list: a list of one-line statements / expressions.
Error handling allows a statement to be corrupt,
but list continues with next statement.
statement: expression to calculate, or assignment
stat : expr
{ printf( "%d\n", $1 ); }
| LETTER '=' expr
{ regs[$1] = $3; }
;
number: made up of digits (tokenizer should handle this, but this is a simple example).
number: DIGIT
{ $$ = $1; base = ($1==0) ? 8 : 10; }
| number DIGIT
{ $$ = base * $1 + $2; }
;
Yacc Calculator Example: Rules, cont'd
expr :
|
|
|
|
|
|
|
;
'(' expr ')'
{ $$ = $2; }
expr '+' expr
{ $$ = $1 + $3; }
expr '-' expr
{ $$ = $1 - $3; }
expr '*' expr
{ $$ = $1 * $3; }
expr '/' expr
{ $$ = $1 / $3; }
'-' expr
%prec UMINUS
{ $$ = - $2; }
LETTER
{ $$ = regs[$1]; }
number
Unary minus
Letter: Register/var
Yacc Calculator Example: Programs (C Code)
%%
/* start of programs */
yylex() {
/* lexical analysis routine */
/* returns LETTER for a lower case letter, yylval = 0 through 25 */
/* return DIGIT for a digit, yylval = 0 through 9 */
/* all other characters are returned immediately */
int c;
while( (c=getchar()) == ' ' ) {/* skip blanks */ }
/* c is now nonblank */
if( islower( c ) ) {
yylval = c - 'a';
return ( LETTER );
}
if( isdigit( c ) ) {
yylval = c - '0';
return( DIGIT );
}
return( c );
}
Download