Compiler Design - Vel Tech University

advertisement
UNIT - I
DEFINITION OF COMPILER
Compiler is a program that reads a program written in one language - Source
language - and translate it into an equivalent program in another language - Target
language. It also reports to its user the presence of errors in the source program.
Source
Program
Compiler
Compiler
Target
Program
Error
Message
The source language may be any high level language and the target language may
be another programming language or machine language. In 1950’s the first Compiler
named as FORTRAN compiler was developed .
THE ANALYSIS-SYNTHESIS MODEL OF COMPILATION
There are two parts to compilation


Analysis
Synthesis
Analysis Part – It breaks up the source program into constituent pieces and creates an
intermediate representation of the source program
Synthesis Part – It constructs the desired target program from the intermediate
representation.
During analysis, the operations implied by the source program are determined and
recorded in a hierarchical structure called a tree. A special kind of tree called a Syntax
Tree is used.
Syntax tree - It’s a tree on which each node represents an operation and the children of a
node represent the arguments of the operation.
Example
Draw a syntax tree for an assignment statement
position := initial + rate * 60
:=
position
+
initial
*
rate
60
SOFTWARE TOOLS FOR ANALYSIS
Many software tools that manipulate source programs first perform some kind of
analysis. They are
1. Structure editors




It accepts sequence of commands as input.
It performs text creation and manipulation similar to text editor.
It produce the hierarchical structure as output.
Its output is similar to the output of analysis phase.
2. Pretty printer


It analyzes the program and print the structure of the program.
It output is clearly visible.
3. Static Checker


It reads and analyze the program.
It discovers the bugs without running the source program.
4. Interpreters
 It translate the source program in to desired target program
 It interprets only one line at a time.
The following compilers have the same analysis phase. They are
1. Text formatters
2. Silicon compilers
3. Query interpreters
1. Text formatters


It takes input as a stream of characters or paragraphs or figures or
mathematical structures like subscript and superscripts.
It perform the operation similar to text editor.
2. Silicon compilers


It consider any programming language as a source language, but the
variable of the language taken as logical signal, not represent the location.
The output is a design in an appropriate language.
3. Query Interpreters


It translates a predicate in to commands
Using the commands, it search a database for records satisfying that
predicate.
THE CONTEXT OF A COMPILER
Compiler needs some other processing for execute the machine code.
structure of a language processing system is shown as follows
Skeletal source program
Preprocessor
Source Program
Compiler
The
Target assembly program
Assembler
Relocatable machine code
Loader / Link Editor






Library,
Reloacatable object files
The source program
maymachine
be divided
into no. of modules stored in separate files.
Absolute
code
The task of collecting the source program into a distinct program, called a
Preprocessor.
The preprocessor may also expand shorthand, called Macros, into the source
language statements.
The target program created by the compiler needs further processing before it can
be run.
The compiler creates assembly code that is translated by an assembler into
machine code.
The machine code is linked with some library routines by linker / load editor into
the code that actually runs on the machine.
Analysis of the Source Program
Analysis part of the compiler have three phases. They are
1. Linear Analysis
2. Hierarchical Analysis
3. Semantic Analysis
Linear Analysis
Linear analysis is otherwise called as Lexical analysis or scanning. This phase
mainly used to group the characters from the input into the tokens.
Example
The tokens for the assignment statement Position := initial + rate * 60 are as
follows
1.
2.
3.
4.
5.
6.
The identifier Position
The assignment symbol :=
The identifier initial
The + sign
The identifier rate
The * sign
7. The number 60
Hierarchical Analysis
It is otherwise called as Parsing or Syntax Analysis. It groups the tokens into
grammatical phrases that are used by the compiler to synthesize output. The grammatical
phrases of the source program are represented by a Parse tree is shown below.
The parse tree for the assignment statement
Position := Initial + Rate * 60 as follows
Assignment statement
:=
Identifier
expression
+
Position
expression
expression
*
Identifier
Initial
expression
expression
Identifier
number
Rate
60
The hierarchical structure of a program is expressed by recursive rules. The following
recursive rules defines the expression
1. Any identifier is an expression.
2. Any number is an expression
3. If expression1 and expression2 are expressions, then so are
expression1 + expression2
expression1 * expression2
(expression1)
rules 1 and 2 are nonrecursive basic rules and rule 3 is a recursive rule.
Lexical constructs do not require recursion, but Syntax analysis use recursion.
Syntax tree – It is a compressed representation of the parse tree in which the operators
appear as the interior nodes, and the operands of an operator are the children of the node
for that operator.
Semantic Analysis
This phase checks the source program for semantic errors. It uses the hierarchical
structure provided by the Syntax analysis phase to identify the operators and operands of
expressions and statements.
The important component of Semantic analysis is type checking. For example,
when a binary arithmetic operator is applied to an integer and real. In this case, the
compiler may need to convert the integer to a real.
PHASES OF A COMPILER
Compiler operates in phases, each of which transforms the source program from
one representation to another. The following diagram depicts a compiler phases.
Source Program
Lexical
Analyzer
Syntax
Analyzer
Semantic
Analyzer
Symbol-table
Manager
Error
Handler
Intermediate Code
Generator
Code
Optimizer
Symbol table management and Error handling are interacting with the six phases
of Lexical Analysis.
Target Program
Symbol Table Management
A Symbol Table is a data structure containing a record for each identifier with
fields for the attributes of the identifier. The data structure allows us to find the record
for each identifier quickly and to store or retrieve data from that record quickly.
Error Detecting and Reporting
Each phase can encounter errors. After detecting an error, a phase must deal with
that error, so that compilation can proceed to find further errors.
The Lexical Analysis phase detect errors where the characters remaining in the
input do not form any token of the language.
The Syntax Analysis phase detect errors where the token streams violates the
structure rules.
The Semantic Analysis phase detect errors where the syntactic structure does not
produce any meaning to the operation involved.
Analysis Phase
The LA phase reads the characters in the source program and groups them into a
stream of tokens. The tokens may be identifier, keyword, a punctuation character or a
multi-character operator like :=.
The character sequence forming a token is called the lexeme for the token.
Certain tokens will be augmented by a “lexical value”.
The Syntax Analysis phase imposes a hierarchical structure on the token stream,
called a Syntax tree in which an interior node is a record with a field for the operator and
two fields containing pointers to the records for the left and right children.
The Semantic analysis phase performs the type checking operation.
Intermediate Code Generation
We consider an Intermediate code form called “three address code” which is like
the assembly language for a machine. It consists of a sequence of instructions, each of
which has at most three operands. It has several properties as follows,
1. Each three address instruction has at most one operator in addition to the
assignment.
2. The compiler must generate a temporary name to hold the value computed by
each instruction.
3. Some instructions have fewer than three operands.
Code Optimization
This phase improves the intermediate code, so that faster running machine code
will result. A significant fraction of time of the compiler is spent on this phase.
Code Generation
This is a final phase which is used to generate the target code, may be either
relocatable machine code or assembly code. Memory locations are selected for each of
the variables used by the program. Then, intermediate instructions are each translated
into a sequence of machine instructions that perform the same task.
Translation of a statement
Position := initial + rate * 60
Lexical Analyzer
Id1 := id2 + id3 * 60
Syntax Analyzer
:=
id1
+
id2
*
id3
Semantic Analyzer
:=
60
id1
+
id2
*
id3
inttoreal (60)
Intermediate Code Generator
temp1 := inttoreal(60)
temp2 := id3 * temp1
temp3 := id2+ temp2
id1 := temp3
Code Optimizer
temp1 := id3 * 60.0
id1 := id2 + temp
Code Generator
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
COUSINS OF THE COMPILER
The input to a compiler may be produced by one or more preprocessor, and
further processing of the compiler’s output may be needed before running machine code
is obtained.
Preprocessor
It produces inputs to the compiler. It performs the following functions
1. Macro processing – Macros are shorthand for longer constructs processed by
preprocessor.
2. File inclusion – Preprocessor includes header files into the program text.
3. “Rational Preprocessor” – It augment older language with more modern flowof-control and data structuring facilities.
4. Language extension – It adds capabilities to the language by what amounts to
built-in-macros.
Assemblers
Some compilers produce assembly code, that is passed to an assembler for further
processing. Assembler produce the relocatable machine code directly to the loader/linkeditor. It is a mnemonic version of machine code.
Loader/Link Editors
Loader is a program that is performs the two functions of loading and link editing.
The process of loading consists of taking relocatable machine code, altering the addresses
and placing the altered instructions and data in memory at the proper locations. The
link-editor allows us to make a single program from several files of relocatable machine
code.
Grouping of Phases
In an implementation of Compiler, activities from more than one phase are often
grouped together.
Front and Back Ends
The different phases are collected into a front end and a back end. Front end
consists of phases such as lexical analysis, syntax analysis, symbol table manager, Error
handler, semantic analysis and intermediate code generator. Back end consists of phases
such as symbol table manager, error handler, code optimizer and code generator.
Passes
Several phases of compilation are usually implemented in a single pass consisting
of reading an input file and writing an output file. Because of this grouping we directly
convert one form of representation of source program into another.
COMPILER CONSTRUCTION TOOLS
The compiler writer like any programmer, use software tools such as debuggers,
version managers, profilers and so on. Some other tools are
Parser Generators
 Input is based on context free grammar and output is syntax analyzer.
 It is easy to implement.
Scanner Generator
 It generates Lexical Analyzer based on regular expression from input.
Syntax-directed translation engines
 It produces collections of routines that walk the parse tree.
Automatic Code Generators
 It translate the intermediate language into machine language based on
collection of rules
 The rules include the detail to handle different possible access methods for
data.
 “Template matching techniques is used
 Templates that represent the sequence of machine instructions replace the
intermediate code statements.
Data flow Engines
 It performs good code optimization
 It involves data flow analysis – the gathering of information about how
values are transmitted from one part of the program in to other part.
SYNTAX
A programming language can be defined by describing what its programs look
like, i.e., the format of the language is called the syntax of the language.
SEMANTICS
A programming language can be defined by describing what its programs mean is
called the semantics of the language.
CONTEXT FREE GRAMMAR
A grammar naturally describes the hierarchical structure of many programming
language. It has 4 components are as follows
1. A set of tokens, known as terminal symbols
2. A set of non terminals.
3. A set of productions where each production consists of a non terminal called
left side of the production, an arrow and a sequence of tokens and/or non
terminals called the right side of the production.
4. A designation of one of the non terminals as the start symbol.
PARSE TREE
A parse tree pictorially shows how the start symbol of a grammar derives a string
in the language. If non terminal A ha s production A -> XYZ, then a parse tree may have
an interior node labeled A with three children labeled X, Y, and Z from left to right :
A
X
Y
Z
Properties of a Parse Tree
1. The root is labeled by the start symbol.
2. Each leaf is labeled by a token or by 
3. Each interior node is labeled by a non terminal
Definition of Yield
The leaves of a parse tree read from left to right form the yield of the tree, Which
is the string generated or derived from the non terminal at the root of the parse tree.
Parsing
The process of finding a parse tree for a given string of token is called parsing that
string.
Ambiguity
A grammar can have more than one parse tree generating a given string of token.
Such a grammar is said to be ambiguous.
Eg. Two parse trees for 9 – 5+2
string
string
string
string
+
string
-
string
string
string
-
9
string
+
string
2
9
5
To avoid this ambiguous problem, two methods are used
5
2
1. Associativity of operators
2. Precedence of operators
Associativity of operators
By convention, 9+5+2 is equivalent to (9+5)+2 and 9-5-2 is equivalent to (9-5)-3.
When an operand like 5 has operators to its left and right, conventions are needed for
deciding which operators takes that operand. We say that the operator + associates t the
left because an operand with plus signs on both sides of its taken by the operator to its
left. In most programming languages the four arithmetic operators, addition, subtraction,
multiplication, and divisions are left associative.
Some common operators such as exponentiation are right associative. As another
example, the assignment operator = in C is right associative; in C, the expression a = b= c
is treated in the same way as the expression a=(b=c).
Precedence of operators
Consider the expression 9+5+2. There are two possible interpretation of this
expression: (9+5)*2 or 9+(5+2). The associativity of + * do not solve this ambiguity. For
this reason, we need to know the relative precedence of operators when more than one
kind of operators is present.
We say that * has higher precedence than + if * takes its operands before + does.
In ordinary arithmetic, multiplication and division have higher precedence than addition
and subtraction. Therefore, 5 is taken by * in both 9+5*2 and 9*5*2; i.e., the expression
are equivalent to 9+(5*2) and (9+5)+2, respectively.
Syntax Directed Definitions
A syntax – directed definition uses a context tree grammar to specific the
syntactic structure of the input. With each grammar symbol, it associated a set of
attributes, and with each production, a set of semantic rules for computing values of the
attributes associated with the symbols appearing in that production.
ROLE OF LEXICAL ANALYZER (LA)
The LA is the first phase of a compiler which is used to read the input characters
and produce as output a sequence of tokens that the parser uses for syntax analysis. The
interaction b/w lexical analyzer and parser is shown below.
Source
Program
Lexical
Analyzer
token
Parser
Get next token
Issues in Lexical Analysis
Symbol table
There are several reasons for separating the analysis phase of compiling into
lexical analysis and parsing.
1. Simpler design.
2. Compiler efficiency is improved.
3. Compiler portability is enhanced.
Tokens, Patterns, Lexemes
Lexeme - The character sequence forming a token is called the lexeme for the token. It
associates with lexical value.
Tokens - The set of characters are called as tokens.
Patterns - The set of strings are described by a rule called a pattern associated with
that token.
Example :
TOKEN
const
if
relation
id
num
literal
SAMPLE LEXEMES
Const
If
<, <=, >, >=, =, <>
pi, const, D2
3.1416,0,6.02E23
“core dumped”
INFORMAL DESCRIPTION OF PATTERN
const
if
< or <= or > or >= or = or <>
letter followed by letters and digits
any numerical constant
Any characters between “and” except “
Attributes for tokens
When more than one pattern matches a lexeme, the lexical analyzer must provide
additional information about the particular lexeme that matched to the subsequent phases
of the compiler. Identifiers and numbers having only one attribute as a pointer to the
symbol table entry.
Example : The tokens and associated attribute-values for the Fortran statement
E = M * C ** 2 are written as follows
< id, pointer to symbol-table entry for E >
< assign_op, >
< id, pointer to symbol-table entry for M >
< mult_op, >
< id, pointer to symbol-table entry for C >
< exp_op, >
< num, integer value 2 >
Lexical Errors
Suppose the lexical analyzer is unable to proceed because none of the patterns for
tokens matches a prefix of the remaining input. In this case it used “panic mode” error
recovery strategy. Its method is we delete successive characters from the remaining input
until the lexical analyzer can find a well formed token. Some other error recovery
strategies are as follows.
1.
2.
3.
4.
Deleting an extraneous character
Inserting a missing character
Replacing an incorrect character by a correct character
Transposing two adjacent characters
INPUT BUFFERING
To speed up the operations of lexical analyzer, we use two buffering techniques.
1. Two buffer input scheme
2. Using Sentinels
Buffer Pairs
This method is used when the lexical analyzer needs to look ahead several
characters beyond the lexeme for a pattern before a match can be announced.
In this method, we divide an input buffer into two N-character halves where N is
the no. of characters on one disk block. E.g., 1024 or 4096.
:
:
E :
=
:
M :
*
C
:
* :
*
:
2 : eof :
forward
Lexeme_beginning
We read N input characters into each half of the buffer with one system read
command, rather than invoking a read command for each input character. If fewer than
N characters remain in the input, then a special character eof is read into the buffer after
the input character. This character is different from any input character.



Two pointers to the input buffer is forward and lexeme beginning.
The string of characters between the two pointers is the current lexeme.
Both the pointers point to the first character of the next lexeme to be
found.
 Forward pointer scans ahead until a match for a pattern is found. It will
point to the character at its right end after finding the token.
Code to advance forward pointer is as follows
if forward at end of first half then begin
reload second half;
forward := forward+1
end
elseif forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else forward := forward+1;
Sentinels
The previous method needs two tests for each advance of the forward pointer.
This can be reduced as a single test for this Sentinel concept. Each buffer half can hold a
sentinel character at the end. It is a special character that cannot be part of the source
program. The same buffer arrangement with sentinels as follows.
:
:
E :
=
:
M :
* eof
C
:
* :
*
:
2 : eof :
forward
Lexeme_beginning
Lookahed code with sentinels
forward := forward +1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward+1
end
elseif forward at end of second half then begin
reload first half;
eof
move forward to beginning of first half
end
else
terminate lexical analysis
end
SPECIFICATION OF TOKENS
Strings and Languages
 Any finite set of symbols are called as Alphabet or Character. Ex. for binary
alphabet is {0,1}
 A finite sequence of symbols or alphabets are called as String
 |s| is the no. of occurrences of symbols in s called as the length of the string s.
 The string of length “zero” is called as Empty string denoted by .
 Any set of strings over some fixed alphabet is called as a Language.
 A string obtained by removing zero or more trailing symbols of string s is called
prefix of string s.
 A string formed by deleting zero or more of the leading symbols of s is called
suffix of string s.
 A string obtained by deleting prefix and a suffix from s is called Substring of s.
Every prefix or a suffix of s and a string s is also considered as a substring.
 Any nonempty string x that is, a prefix, suffix, or substring of s such that sx is
called as proper prefix, suffix or substring of s.
 Any string formed by deleting zero or more not necessarily contiguous symbols
from s is called as a subsequence of s. Ex. baaa is a subsequence of banana.
Operations on Language
There are several important operations that can be applied to languages. Some of
the important operations are as follows.
1. Union of L and M is denoted as
L U M = { s | s is in L or s is in M }
2. Concatenation of L and M is denoted as
LM = { st | s is in L and t is in M }
3. Kleene closure of L is denoted as

L* = U Li
i=0
4. Positive Closure of L is denoted as

L+ = U Li
i=0
Regular Expressions
A language denoted by a regular expression is said to be a regular set.
Regular Definitions
We will give names to regular expressions and to define regular expressions using
these names as if they were symbols. If  is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form
d1
r1
d2
r2
.....
dn
rn
Notational Shorthands
 One or more instances : The unary prefix operator + means “one or more
instances of ”. The operator + has same precedence and associativity as
the operator *. The two algebraic identities r* = r+|  and r+ = rr* relates
Kleene closure and Positive closure operators.
 Zero or more instances : The unary postfix operator ? means “zero or
more instance of ”. The notation ( r )? is a shorthand for r|.
 Character classes : The notation [abc] where a,b,c are alphabet symbols
denotes the regular expression a|b|c. Ex., the regular expression for
identifiers can be described using this notation as
[A-Za-z][A-Za-z0-9]*
Non Regular Sets
Some languages can not be described by any regular expression, that are called as
non regular sets. Example , Repeating strings cannot be described by regular expressions.
{wcw | w is a string of a’s and b’s }
UNIT - II
RECOGNITION MACHINE
RECOGNITION OF TOKENS
This topic explains how the tokens are recognized. For Example
stmt
if expr then stmt
| if expr then stmt else stmt
|
expr
term relop term
| term
term
id
| num
where the terminals if, then, else, relop, id and num generate sets of strings given
by the following regular definitions :
if
if
then
then
else
else
relop
< | <= | = | <> | > | >=
id
letter ( letter | digit )*
num
digit+ ( .digit+ )?( E(+|-)? digit+ )?
delim+
ws
Regular expression patterns for tokens is shown in the following table
Regular
Expression
ws
if
then
else
id
num
<
<=
=
<>
>
>=
Token
Attribute value
if
then
else
id
num
relop
relop
relop
relop
relop
relop
pointer to table entry
pointer to table entry
LT
LE
EQ
NE
GT
GE
Transition Diagrams
A stylized flowchart called a transition diagram which depicts the actions that
take place when a lexical analyzer is called by the parser to get the next token.
 Positions in a transition diagram are drawn as circles and are called states.
 The states are connected by arrows, called edges.
 Edges leaving state s has labels indicating the input characters that can next
appear after the transition diagram has reached state s.
 The label other refers to any character that is not indicated by any of the other
edges leaving s.
 The starting state of a transition diagram is labeled as start state.
Transition diagram for relational operator
o
1
2
3
4
5
6
7
8
Letter or digit
Start
Letter
other
9
10
digit
Start 12 digit
13
11
return
(gettoken(), install_id())
digit
14
digit
15
digit
E
16 +or -
E
digit
17
digit
18 others
19
digit
digit
*
Start 20 digit
21
22
digit
23 others
24
digit
Start 25 digit 26 other 27 *
delim
*
Start
28 delim
29 other
30
Implementing a Transition Diagram
 A sequence of transition diagrams can be converted into a program to look for the
tokens specified by the diagrams.
 Program size is directly proportional to the no. of states and edges in the diagrams.
 Each state gets a segment of code.
 If any edge leaving a state, then its code reads a character and selects an edge to
follow, if possible.
 A function nextchar( ) is used to read the next character from the input buffer and
advance the forward pointer and return the character read.
 If all the transition diagrams are failed, then fail( ) routine is called.
 A global variable lexical_value is assigned to the pointer returned by functions
install_id( ) and install_num( ).
 The function nexttoken( ) is used to return the token of the LA.
 Two variables start & state holds the present state and starting state of the current
transition diagram.
Coding for finding next start state
int state=0, start=0;
int lexical_value;
int fail( )
{
forward = token_beginning;
switch(start)
{
case 0: start = 9; break;
case 9: start = 12; break;
case 12: start = 20; break;
case 20: start = 25; break;
default : recover( ); break;
}
return start; }
Lexical Errors & Error Recovery Strategies
Suppose the lexical analyzer is unable to proceed because none of the patterns for
tokens matches a prefix of the remaining input. In this case it used “panic mode” error
recovery strategy. Its method is we delete successive characters from the remaining input
until the lexical analyzer can find a well formed token. Some other error recovery
strategies are as follows.
1.
2.
3.
4.
Deleting an extraneous character
Inserting a missing character
Replacing an incorrect character by a correct character
Transposing two adjacent characters
A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS
 A particular tool to construct LA is called as Lex. This tool is otherwise called as
Lex compiler, and its input is Lex language.
 First, the source program of Lex compiler is named as Lex.1 is given thro’ the
compiler, which produce the output in the form of tabular representation of
transition diagram.
 It will be given to C compiler as a input which produce output as a.out. This is a
actual Lexical analyzer.
 Finally it converts the stream of inputs into sequence of tokens as follows.
lex.1
Lex
compiler
lex.yy.c
lex.yy.c
C
compiler
a.out
i/p stream
a.out
Sequence of
tokens
Lex specification
%{ declarations }%
regular definitions
%%
translation rules
%%
auxiliary procedures




declarations: include and manifest constants (identifier declared to represent a
constant).
regular definitions: definition of named syntactic constructs such as letter using
regular expressions.
translation rules: pattern / action pairs
auxiliary procedures: arbitrary C functions copied directly into the generated
lexical analyzer.
Lex Conventions


The program generated by Lex matches the longest possible prefix of the input.
For example, if < = appears in the input then rather than matching only the <
(which is also a legal pattern) the entire string is matched.
Lex keywords are:
o yylval: value returned by the lexical analyzer (pointer to token)
o yytext: pointer to lexeme (array of characters that matched the pattern)
o yyleng: length of the lexeme (number of chars in yytext).

If two rules match a prefix of the same and greatest length then the first rule
appearing (sequentially) in the translation section takes precedence.
For example, if is matched by both if and {id}. Since the if rule comes first,
that is the match that is used.
The Lookahead Operator
If r1 and r2 are patterns, then r1/r2 means match r1 only if it is followed by r2.
For example,
DO/({letter}|{digit})*=({letter}|{digit})*,
recognizes the keyword DO in the string DO5I=1,25
Finite Automata
Recognizer - A Recognizer for a language is a program that take as input a string x and
answers “yes” if x is a sentence of the language and “no” otherwise.
Finite Automaton - A regular expression is compiled into a recognizer by constructing a
generalized transition diagram called a finite automaton.
Two types of Finite automata is Deterministic (DFA) & Non deterministric (NFA)
Difference between NFA & DFA
NFA
1. Slower to recognize any regular
Expression
2. Size is small
3. It has a state with  transition.
4. For the same input more than one
transition is occur from a single state
DFA
Faster to recognize any regular
expression
Size is bigger than NFA
No state has an  transition
For each state and input symbol, there is
at most one edge labeled with it.
Deterministic Finite Automata
It’s a mathematical model which consist of
1. a set of states S.
2.
3.
4.
5.
a set of input symbols .
a transition function move that maps state-symbol pairs to set of states.
a state s0 that is distinguished as the start state.
a set of states F distinguished as accepting states.
Transition Graph
An NFA can be represented diagrammatically by a labeled directed graph called a
transition graph, in which nodes are the states and the labeled edges represent the transition
function.
Transition Table
A table which contains a row for each state and a column for each input symbol and
 if necessary.
Moves
A path can be represented by a sequence of state transitions called moves.
The Language defined by an NFA is the set of input string it accepts.
Deterministic Finite Automata
It is a special case of NFA in which
1. no state has an  transition.
2. for each state s and input symbol a, there is at most one edge labeled a
leaving s.
Conversion of Regular Expression in to an NFA
Given a regular expression there is an associated regular language L(r). Since
there is a finite automata for every regular language, there is a machine, M, for every
regular expression such that L(M) = L(r).
The constructive proof provides an algorithm for constructing a machine, M,
from a regular expression r. The six constructions below correspond to the cases:
1) The entire regular expression is the null string, i.e. L={epsilon} r = epsilon
2) The entire regular expression is empty, i.e. L=phi
r = phi
3) An element of the input alphabet, sigma, is in the regular expression r = a
is an element of sigma.
where a
4) Two regular expressions are joined by the union operator, + r1 + r2
5) Two regular expressions are joined by concatenation (no symbol)
6) A regular expression has the Kleene closure (star) applied to it
r1 r2
r*
The construction proceeds by using 1) or 2) if either applies.
The construction first converts all symbols in the regular expression using construction
(3). Then working from inside outward, left to right at the same scope, apply the one
construction that applies from (4) (5) or (6).
Note: add one arrow head to figure 6) going into the top of the second circle.
The result is a NFA with epsilon moves. This NFA can then be converted to a NFA
without epsilon moves. Further conversion can be performed to get a DFA. All these
machines have the same language as the regular expression from which they were
constructed.
The construction covers all possible cases that can occur in any regular
expression. Because of the generality there are many more states generated than are
necessary. The unnecessary states are joined by epsilon transitions. Very careful
compression may be performed. For example, the fragment regular expression aba
would be
a
e
b
e
a
q0 ---> q1 ---> q2 ---> q3 ---> q4 ---> q5
with e used for epsilon, this can be trivially reduced to
a
b
a
q0 ---> q1 ---> q2 ---> q3
Simulating an NFA algorithm
S := -closure({s0)};
a := nextchar( );
while a  eof do begin
S := -closure(move(S,a));
a := nextchar( );
end
if SF   then
return “yes”
else return “no”;
Conversion of NFA to DFA
The subset construction algorithm for this conversion is as follows.
initially -closure(s0) is the only state in Dstates and it is unmarked;
while there is an unmarked state T in Dstates do begin
mark T;
for each input symbol a do begin
U := -closure(move(T,a));
if U is not in Dstates then
add U as an unmarked state to Dstates;
Dtran[T,a] := U
end
end
Computation of -closure is done by the following algorithm
push all states in T onto Stack;
initialize -closure(T) to T;
while stack is not empty do begin
pot t, the top element, off of stack;
for each state u with an edge from t to u labeled  do
if u is not in -closure(T) do begin
add u to -closure(T);
push u onto stack
end
end
DESIGN OF A LEXICAL ANALYZER GENERATOR USING FA
A specification of a lexical analyzer of the form
P1 { action 1 }
P2 { action 2 }
. . . . . . ..
P3 { action n }
Each pattern pi is a regular expression and each action i is a program fragment that is
to be executed whenever a lexeme matched by pattern pi is found in the input.
Problem : Suppose if more than one pattern matches a single lexeme, then we will take the
longest match of lexeme for the problem as a solution.
Example
Iftext = 5
In this statement after reading i and f , it will be considered as a keyword, and next we
read other characters upto t that forms iftext is considered as an identifier. So iftext matches
both keyword and identifier. For this problem we will take the longest lexeme iftext and take
the pattern as an identifier and execute its corresponding action.
A lexical analyzer is also constructed by using Finite Automata and it may be either
Deterministic or Non deterministic. The model of Lex compiler using Finite Automata is
shown in the following figure.
Lex
Transition
Lex
specification
table
compiler
lexeme
FA
simulator
transition
table
The lex compiler is used to compiler the lex language input into the tabular
representation of a transition diagram. This transition table is given as a input to the FA
simulator which produce two pointers such as forward and lexeme beginning which points
the current lexeme in an input buffer. The FA simulator may be either NFA or DFA.
Pattern matching based on NFA’s
One method to construct the transition table of an NFA N for the patterns p1|p2|…|pn.
This can be done by creating an NFA N(pi) for each pattern pi, then adding a new start state
s0, and finally linking s0 to the start state of each N(pi) with an  transition as shown below.
N(p1)

N(p2)

S0
.
.

N(pn)
Example
Consider a Lex program consist of 3 regular expressions and no regular
definitions as follows
a
{ } /* actions are omitted here */
abb
{ }
+
a*b
{ }
An NFA for all the above regular expressions are
start
start
11
3
a
a
2
4
b
5
b
6
start
7
b
8
Combined NFA

start
1

3
0

7
a
2
a
4
b
b
5
b
6
8
Sequence of sets of states entered in processing input aaba
p1
p3
a
a
b
a
0
2
1
4
3
7
7
8
7
We consider a string “aaba” that can match more than one patterns in NFA, and
the action is executed corresponding to the pattern. Starting state is 0137. The first input
symbol a is readed and it will reach the states 247. From 247, 2 is an accepting state for
the first pattern. The next input symbol a is reached only a state 7 and it does not match
any patterns. The third input symbol is b which reaches the state 8 and 8 is an accepting
state. So, it matches the 3rd pattern. The last input string is a. But it does not reach any
state. So we cannot execute any action.
DFA for Lexical Analyzers
Here we construct the transition table for a DFA for the same above example as
follows
State
0137
247
8
7
58
68
Input Symbol
a
247
7
7
-
B
8
58
8
8
68
8
Pattern
Announced
none
a
a*b+
none
a*b+
abb
Optimizing of DFA-based Pattern Matchers
There are three algorithms are used to optimize the DFA. They are
1. Directly convert the regular expression into DFA
2. Minimize the no. of states in DFA.
3. Make a transition table as a compact one.
Important states of an NFA
1. All states of an NFA is important if it has no  transition.
2. Make an accepting state as important one by adding one unique right end
marker # at the end of the regular expression. Now the regular expression
is called as Augmented regular expression.
1. From Regular Expression to a DFA
1. Convert regular expression into an augmented regular expression ( r )#.
2. Construct a syntax tree T for ( r )#.
3. Compute four functions : nullable, firstpos, lastpos and followpos by
making traversals over T. The first 3 functions are defined on the nodes of
the syntax tree and the last one is defined on the set of position.
4. Finally we constrt the DFA from followpos.
Example
Consider the regular expression (a|b)*abb#
Firstpos and lastpos for nodes in syntax tree for (a|b)*abb# and followpos table
is as follows
NODE
followpos
1
2
3
4
5
6
{1,2,3}
{1,2,3}
{4}
{5}
{6}
-
2.
Minimizing the no.
Of states of a DFA
Input : A DFA M with
set of states S, set of inputs
, transitions defined for
all states and inputs, start
state s0, and set of accepting states F.
Output : A DFA M’ accepting the same language as M and having as few state as
possible.
Method :
1.
2.
Construct an initial partition  of the set of states with two groups : the
accepting states F and the non accepting states S-F.
Apply the following procedure to  to construct a new partition new.
for each group G of  do begin
partition G into subgroups such that two states s and t of G
are in the same subgroup if and only if for all input symbols a.
states s and t have transitions on a to states in the same group
of .
replace G in new by the set of all subgroups formed.
end
3.
4.
If new=, let final= and continue with step 4. Otherwise, repeat step
2 with  := new.
Choose one state in each group of the partition final as the representative
for that group. The representatives will be the states of the reduced DFA
M’.
5.
If M’ has a dead state, that is, a state d that is not accepting and that has
transitions to itself on all input symbols, then remove d from M’. also
remove any states not reachable from the start state. Any transitions to d
from other states become undefined.
3. State Minimization in Lexical Analyzers
Using table compression method, we can make the transition table as a compact
one. Normally transition table is a two dimensional array. Here, we use a data structure
consisting of four arras indexed by state numbers. The base array is used to determine
the base location of the entries for each state stored in the next and check arrays. The
default array is used to determine an alternative base location in case the current base
location in invalid. The data structure for representing transition tables as shown below
To compute nextstate(s,a), the transition for state s on input symbol a, we first consult the
pair of arrays next and check. We find their entries for state s in location l = base[s]+a,
where a is treated as an integer. We take next[ l ] to be the next state for s on input a if
check[ l ] = s. If check[ l ]  s, we determine q = default[s] and repeat the entire
procedure recursively, using q in place of s. The procedure is the following
procedure nextstate(s,a);
if check[base[s]+a] = s then
return next[base[s]+a]
else
return nextstate(default[s],a)
Problem
Convert the Regular expression (a/b)*abb into NFA and then to DFA:
i)
Regular expression into NFA:
a
2

I

3
1


6 b
a
7
8

S0
b
4
5

9
b 3
F
S10

ii) NFA to DFA:
a) Finding -Closure for all States.
-Closure (S0) = {S0, S1, S2, S4, S7} =A
MOV (A, a) = {S3, S8}
-Closure (MOV (A, a) = {S3, S6, S7, S1, S2, S4, S8} =B
MOV (A, b) = {S5}
-Closure (MOV (A, b) = {S5, S6, S7, S1, S2, S4} =C
MOV (B, a) = {S3, S8}
-Closure (MOV (B, a) = {S3, S6, S7, S1, S2, S4, S8} =B
MOV (B, b) = {S5, S9}
-Closure (MOV (B, b)={S5, S6, S7, S1, S2, S4, S9}=D
MOV(C, a) = {S3, S8}
-Closure (MOV(C, a)={S3, S6, S7, S1, S2, S4, S8}=B
MOV(C, b) = {S5}
-Closure (MOV(C, b)={S5, S6, S7, S1, S2, S4}=C
MOV (D, a)={S3, S8}
-Closure (MOV (D, a)={S3, S6, S7, S1, S2, S4, S8}=B
MOV (D, b) = {S5, S10}
-Closure (MOV (D, b)={S5, S6, S7, S1, S2, S4, S10}=E
MOV (E, a) = {S3, S8}
-Closure (MOV (E, a)={S3, S6, S7, S1, S2, S4, S8}=B
MOV (E, b) = {S5}
-Closure (MOV (E, b)={S5, S6, S7, S1, S2, S4}=C
b) Transition diagram
State
Inputs
a
b
A
B
C
D
E
B
B
B
B
B
C
D
C
E
C
c) Minimization
The available states are ABC DE
The common state s (except final state) - AC
Remove C state & Put A in the place of C.
BD
d) Minimized Transition diagram
State
Inputs
A
B
D
E
a
b
B
B
B
B
A
D
E
A
e) DFA- (Minimized NFA)
b
A
a
a
B
b
D
b
a
b
a
Example of Converting an NFA to a DFA
E
E
This is one of the NFA examples in the lecture notes. Here we convert it to a DFA. (The
regular expression, above, is not relevant to this conversion.)
This machine is M = ({1, 2, 3, 4, 5, 6, 7, 8}, {a, b, c}, DELTA, 1, {8}) where DELTA =
{
(1, b, 1),
(1, epsilon, 2),
(2, epsilon, 7),
(2, b, 3),
(2, b, 5),
(3, a, 4),
(3, c, 4),
(4, c, 2),
(4, c, 7),
(5, a, 6),
(5, b, 6),
(6, c, 2),
(6, epsilon, 2),
(6, c, 7),
(6, epsilon, 7),
(7, b, 8) }. Note here that DELTA is a relation (a set of triples).
We are computing M' = (K', sigma, delta', s', F'). Note that sigma is the same as that of
the NFA. In this case sigma = {a, b, c}.
Step 1: Compute E(q) for all states, q in K
E(q) is the set of states reachable from q using only (any number of) epsilon transitions.
q E(q)
1 {1, 2, 7}
2 {2, 7}
3 {3}
4 {4}
5 {5}
6 {2, 6, 7}
7 {7}
8 {8}
Step 2: Compute s' = E(s)
Here E(s) = E(1) = {1, 2, 7}.
Step 3: Compute delta'.
We start from E(s), where s is the start state of the original machine. We add
states as necessary.
delta (q\sigma) A
B
c
{1, 2, 7}
{}
{1, 2, 3, 5, 7, 8}
{}
{}
{}
{}
{}
{1, 2, 3, 5, 7, 8}
{2, 4, 6, 7} {1, 2, 3, 5, 6, 7, 8} {4}
{2, 4, 6, 7}
{}
{3, 5, 8}
{2, 7}
{1, 2, 3, 5, 6, 7, 8} {2, 4, 6, 7} {1, 2, 3, 5, 6, 7, 8} {2, 4, 7}
{}
{}
{2, 7}
{4}
{3, 5, 8}
{2, 4, 6, 7} {2, 6, 7}
{4}
{2, 7}
{}
{3, 5, 8}
{}
{2, 4, 7}
{}
{3, 5, 8}
{2, 7}
{2, 6, 7}
{}
{3, 5, 8}
{2, 7}
delta'(StateSet, inputSymbol) = Union of E(q) for all q where (p, inputSymbol, q) is in
DELTA and p is in StateSet. We'll do one example in gory detail: delta({1, 2, 7}, b) =
{1, 2, 3, 5, 7, 8} because you can reach {1, 3, 5, 8} on "b" transitions (DELTA contains
(1, b, 1), (2, b, 3), (2, b, 5), and (7, b, 8)) and E(1) = {1, 2, 7}, E(3) = {3}, E(5) = {5}, and
E(8) = {8}. If you union all of those together, you get {1, 2, 3, 5, 7, 8}. Step 4:
Enumerate K', the set of states : The states are just the entries in the left column: K' =
{{1,2,7}, {}, {1,2,3,5,7,8}, {2,4,6,7}, {1,2,3,5,6,7,8}, {4}, {3,5,8}, {2,7}, {2,4,7},
{2,6,7}). There are 10 states, but 28 = 64 were possible.
Step 5: Compute F'
The final states are the states from K' that have some intersection with the final
state(s) of the original machine. In this case, since 8 was the only final state of the
original machine, our final states in the DFA are those states that have 8 in them: F' =
{{1,2,3,5,7,8}, {1,2,3,5,6,7,8}, {3,5,8}}.
Putting it all together
Our DFA, then, is M' = (K', {a, b, c}, delta', {1, 2, 7}, F') where K', delta', and F'
are as above.
Minimizing the DFA
Note that if the DFA is minimized (like in the project), then the states {2, 4, 6, 7}, {2,
4, 7}, and {2, 6, 7} coalesce, leaving an 8-state machine. (State minimization is a
separate algorithm.)
TOP-DOWN PARSING
 Construction of the parse tree starts at the root, and proceeds towards the leaves.
 Efficient top-down parsers can be easily constructed by hand.
 Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
UNIT – III
THE ROLE OF THE PARSER
The parser obtains a string of tokens from the lexical analyzer and verifies the
string can be generated by the grammar for the source language. It also report any syntax
error in an intelligible fashion. It should also recover commonly occurring errors so that
it can continue processing the remainder of its input.
Source
Program
Lexical
Analyzer
token
get next
token
Parse
Parser
tree
Rest of
front end
Intermediate
representation
Symbol
table
There are 3 general types of parsers for grammar. They are
Universal parsing - It can parse any grammar. But it is too inefficient to use in
production compilers.
Top down parsing - It constructs parse tree from root to the leaves.
Bottom up parsing - It constructs parse tree from leaves to the root.
In both top down and bottom up parsing the inputs are scanned from left to right
one symbol at a time. These methods work more efficiently in sub classes of grammars.
These are LL and LR grammars which describe most syntactic constructs in
programming languages.
Errors occurs at different levels




Lexical, such as misspelling an identifier, keyword or operator.
Syntactic, such as an arithmetic expression with unbalanced parentheses.
Semantic, such as an operator applied to an incompatible operand.
Logical, such as an infinitely recursive call.
Goals of error-handler in parser



It should report the presence of errors clearly and accurately.
It should recover from each error quickly enough to be able to detect
subsequent errors.
It should not significantly slow down the processing of correct programs.
ERROR RECOVERY STRATEGIES
To recover any syntactic errors, the parser having many general strategies. They
are




Panic mode strategy
Phrase level strategy
Error productions
Global correction
Panic mode recovery
 It can be used by most parsing methods.
 On discovering an error, It discards input symbols one at a time until one of a
designated set of synchronizing tokens is found
 The synchronizing tokens are usually delimiters, such as semicolon or end.
 This token may be vary depends upon the programming languages.
Advantages
It is a simplest method to implement.
It is guaranteed not to go into an infinite loop.
It is adequate where multiple errors in the same statement are occur.
Disadvantage
Skips a considerable amount of input without checking it for additional
errors.
Phrase level recovery
 On discovering an error, a parser may perform local correction on the
remaining input.
 It may replace a prefix of the remaining input by some string and continue the
parsing.
 Example : replace a comma by a semicolon, delete an extraneous semicolon,
or insert a missing semicolon.
 We must be careful to choose replacements that do not lead to infinite loops.
 Used in top-down parsing.
Advantages
This type of replacement can correct any input string
It has been used in several error-repairing compilers.
Disadvantage
The drawback is the difficulty it has in coping with situations in which the
actual error has occurred before the point of detection.
Error Production
If we have an idea about the common errors that might be encountered, we can
augment the grammar for the language at hand with productions that generate the
erroneous constructs. We then use the grammar augmented by these error productions to
construct a parser. If an error production is used by the parser, we can generate
appropriate error diagnostics to indicate the erroneous construct that has been recognized
in the input.
Global correction
We would like a compiler to make as few changes as possible in processing an
incorrect input string. There are algorithms for choosing a minimal sequence of changes
to obtain a globally least cost correction. Given an incorrect input string x and grammar
G, these algorithms will find a parse tree for a related string y, such that the no. of
insertions, deletions and changes of tokens required to transform x into y is as small as
possible.
Disadvantages
1. More expensive to implement.
2. It takes more time and occupy more space.
CONTEXT FREE GRAMMAR
A grammar naturally describes the hierarchical structure of many programming
language. It has 4 components are as follows
5. A set of tokens, known as terminal symbols
6. A set of non terminals.
7. A set of productions where each production consists of a non terminal called
left side of the production, an arrow and a sequence of tokens and/or non
terminals called the right side of the production.
8. A designation of one of the non terminals as the start symbol.
DERIVATIONS & REDUCTIONS
Derivations
The Non-Terminals can be expanded and it can derive certain tokens. This is
called as Derivations.
Reductions
The terminal does not derive any string. But the terminals can be reduced to any
Non-Terminal. These are called reductions.
Example :
EE*E
EE+E
Eid
i) Using the above grammar derive the string “id+id*id”
EE+E
[Expansion by EE+E]
Eid+E*E
[Expansion by EE*E]
Eid+E*E
[Expansion by Eid]
Eid+id*E
[Expansion by Eid]
Eid+id*id
[Expansion by Eid]
ii) Using the above grammar reduce the string “id+id*id” to the starting
Non-Terminal.
Eid+id*id
[Reduction by Eid]
Eid+id*E
[Reduction by Eid]
Eid+E*E
[Reduction by Eid]
EE+E*E
[Reduction by EE*E]
EE+E
[Reduction by EE+E]
EE
Parse Tree
A parse tree pictorially shows how the start symbol of a grammar derives a string
in the language. If non terminal A ha s production A -> XYZ, then a parse tree may have
an interior node labeled A with three children labeled X, Y, and Z from left to right :
A
X
Y
Z
Properties of a Parse Tree
4. The root is labeled by the start symbol.
5. Each leaf is labeled by a token or by 
6. Each interior node is labeled by a non terminal
Definition of Yield
The leaves of a parse tree read from left to right form the yield of the tree, Which
is the string generated or derived from the non terminal at the root of the parse tree.
Parsing
The process of finding a parse tree for a given string of token is called parsing that
string.
Ambiguity
A grammar can have more than one parse tree generating a given string of token.
Such a grammar is said to be ambiguous.
Eg. Two parse trees for 9 – 5+2
string
string
string
string
string
-
string
+
string
string
-
9
string
+
string
2
9
5
To avoid this ambiguous problem, two methods are used
5
2
3. Associativity of operators
4. Precedence of operators
Associativity of operators
By convention, 9+5+2 is equivalent to (9+5)+2 and 9-5-2 is equivalent to (9-5)-3.
When an operand like 5 has operators to its left and right, conventions are needed for
deciding which operators takes that operand. We say that the operator + associates t the
left because an operand with plus signs on both sides of its taken by the operator to its
left. In most programming languages the four arithmetic operators, addition, subtraction,
multiplication, and divisions are left associative.
Some common operators such as exponentiation are right associative. As another
example, the assignment operator = in C is right associative; in C, the expression a = b= c
is treated in the same way as the expression a=(b=c).
Precedence of operators
Consider the expression 9+5+2. There are two possible interpretation of this
expression: (9+5)*2 or 9+(5+2). The associativity of + * do not solve this ambiguity. For
this reason, we need to know the relative precedence of operators when more than one
kind of operators is present.
We say that * has higher precedence than + if * takes its operands before + does.
In ordinary arithmetic, multiplication and division have higher precedence than addition
and subtraction. Therefore, 5 is taken by * in both 9+5*2 and 9*5*2; i.e., the expression
are equivalent to 9+(5*2) and (9+5)+2, respectively.
WRITING A GRAMMAR
The following reasons explains why the regular expressions are used to define the
lexical syntax of a language.
1. The lexical rules of a language are frequently quite simple.
2. It provides more concise and easier to understand notation for tokens than
grammars.
3. More efficient lexical analyzers can be constructed automatically from regular
expressions than from arbitrary grammars.
4. Separating the syntactic structure of a language into lexical and non lexical
parts provides a convenient way of modularizing the front end of a compiler
into two manageable-sized components.
Eliminating ambiguity
Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. It
can be eliminated by the following “dangling-else” grammar.
Stmt  if expr then stmt
| if expr then stmt else stmt
| other
According to this grammar, the compound conditional statement
if E1 then if E2 then S1 else S2 has the two parse trees as shown below
stmt
if
expr
then
stmt
E1
If
expr
then
E2
stmt
S1
S2
stmt
if
expr
E1
then
stmt
else
else stmt
stmt
S2
If
expr
then
stmt
E2
S1
In all the programming languages, the first parse tree is preferred. The general
rule is “Match each else with the closest previous unmatched then”. This disambiguating
rule can be incorporated directly into grammar. For example, we can rewrite grammar as
the following unambiguous grammar. The idea is that a statement appearing between a
then and an else must be matched, i.e., it must not end with an unmatched then followed
by any statement, for the else would then be forced to match this unmatched then. A
matched statement is either an if-then-else statement containing no unmatched statements
or it is any other kind of unconditional statement. Thus, we may use the grammar
 matched_stmt
| unmatched_stmt
matched_stmt
 if expr then matched_stmt else matched_stmt
| other
unmatched stmt  if expr then stmt
| if expr then matched_stmt else unmatched_stmt
Stmt
ELIMINATION OF LEFT RECURSION
Left recursion
A grammar is left recursive if it has a non terminal ‘A’ such that there is a
derivation as follows
A --> AX for some i/p string, where X is a grammar symbol.
(Here the nonterminal ‘A’ is recursively called in the left). Top-down parsers
cannot handle such left recursive grammars. This must be eliminated. This can be done
by the following method.
Left recursive grammar : A-->AX/Y
Left recursion elimination : A-->YA’
A’-->XA’/E
where X, Y are grammar symbols and E is epsilon.
Left recursive grammar
:
E-->E+T/T
Apply the above rule :
Here A is E; X is +T; Y is T;
So after left recursion elimination:
E-->TE’
E’-->+TE’/E
Left factoring
This useful transformation to make certain grammar suitable for parsing
(predictive). If non-terminal has two choices for expansion, which are same, then there
will be confusion for selection of the choice, for a particular I/p string. For example:
A-->XB/XC
Now, there is a confusion which is to be selected for the any I/p string starting with
X . This problem can be solved by left factoring.
Left factoring is a transformation for factoring out the common prefixes. For the
above grammar the application of left factoring will result as
A-->XA’
A’-->B/C
Depending on how the parse tree is created, there are different parsing techniques.
These parsing techniques are categorized into two groups:
1. Top-Down Parsing
2. Bottom-Up Parsing
TOP-DOWN PARSING
 Construction of the parse tree starts at the root, and proceeds towards the leaves.
 Efficient top-down parsers can be easily constructed by hand.
 Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
Recursive descent predictive parser
Recursive descent parsers are easily created from context-free grammar productions.
It is a top-down technique because it works by trying to match the program text against
the start symbol and successively replaces symbols by symbols representing their
constituents. This process can be regarded as constructing the parse tree in a top-down
direction. recursive descent parsers are often also called LL parsers because they deal
with the input from left-to-right (the first L) and construct a leftmost derivation (the
second L).
A recursive descent parser is a collection of procedures, one for each unique nonterminal. Each procedure is responsible for parsing the kind of construct described by its
non-terminal. Since the syntax of most programming languages is recursive the resulting
procedures are also usually recursive, hence the name "recursive descent".
The parser maintains an invariant in that a global variable always contains the first
token in the input that has not been examined by the parser. Every time a token is
"consumed" the parser will call the lexical analyser to get another token.
Parsing using a recursive descent parser is started by calling the lexical analyser to get
the first token. Then we call the procedure corresponding to the grammar start symbol.
When this procedure returns the parse is complete.
The body of a parsing procedure for a non-terminal X is constructed by considering
the grammar productions with X on their left-hand side. A non-terminal on the right-hand
side of one of these productions turns into a call to the parsing procedure for that nonterminal. A terminal (literal or non-literal) turns into a test to make sure that the current
token matches the required terminal, and a call to get another token.
For example, consider the following production and its associated parsing procedure.
Statement : Name ':=' Expression.
void Statement ()
{
Name ();
if (current token is not a colon equals)
report a colon equals missing;
get a token;
Expression ();
}
Of course, many non-terminals appear on the left-hand side of more than one
production. The parsing procedure must deal with all of these cases by checking at
the beginning of the parsing procedure. For example, expressions might come in a
few varieties.
Expression : Integer / Identifier / '(' Expression ')' /
...
void Expression ()
{
if (current token is an integer or an identifier) {
get a token;
} else if (current token is a left parenthesis) {
get a token;
Expression ();
if (current token is not a right parenthesis)
report a right parenthesis missing;
get a token;
} else ...
...
} else
report an illegal expression;
}
Decision making in recursive descent parsers
As in the Expression case above, for non-terminal symbols with more than one
production the parsing procedure needs to make a decision between the productions. To
obtain a deterministic parser (which we always want to have for a compiler) we must
guarantee that the appropriate choice is uniquely determined by the basic symbols.
To make a decision in a recursive descent parser we need to know which symbols
predict a particular production. Usually we try to achieve the ability to parse with one
token lookahead. This is because otherwise we need to store more than one token from
the lexical analyser which is not impossible but complicates matters. One token
lookahead is sufficient for most programming languages.
We can define the PREDICT sets for each production as follows using the auxiliary
sets FIRST and FOLLOW. We provide an informal definition. The text gives a more
mathematical definition with an algorithm for calculating these sets.
The FIRST set of a symbol A is the set of tokens that could be the first token of an A,
plus epsilon if A can derive epsilon (in other words, if an A can be empty).
FIRST can be extended to sequences of symbols by saying that the FIRST of a
sequence A1 A2 A3 ... An is FIRST(A1) union FIRST(A2) if epsilon is in FIRST(A1),
union FIRST(A3) if epsilon is in both FIRST(A1) and FIRST(A2), and so on.
The FOLLOW set of a symbol A is the set of tokens that can follow an A in a
syntactically legal program, plus epsilon if A can occur at the end of a program.
The PREDICT set of a production N : A1 A2 A3 ... An is FIRST(A1 A2 A3 ... An)
(without epsilon) plus FOLLOW(N) if A1 A2 A3 ... An can derive epsilon.
The PREDICT sets for the alternative productions of N are used when writing the
recursive descent parsing procedure for N. If the next unexamined token is in the
PREDICT set for a production then we predict that alternative. This is what we did earlier
in the parsing procedure for Expression.
Transforming grammars for recursive descent
Problems arise if the PREDICT sets for the productions of a non-terminal overlap. If
this is the case it is not possible to accurately predict a single production using just one
token of lookahead.
To ensure that decision making with one token lookahead is possible in a recursive
descent parser it may be necessary to transform the grammar. The intention is to change
the grammar so that it is acceptable to our parsing method, but defines the same language
as the original grammar.
Two common situations arise: left recursion and common prefixes.
Non-recursive predictive parser
It is a top-down parser. As a name implies it is not recursive. This needs the
following components to check whether the given string is successfully parsed or not.
Inbut buffer , Stack,
Parsing routine and parsing table
The input buffer is keeping the input string to be parsed .The input string is
followed by a symbol ‘$’. This is used to indicate that the input string is terminated. This
is used as right end marker.
The stack is keeping always the grammar symbols. The grammar symbols will be
either non-terminal or terminals. Initially the stack is pushed with “$” on the top of the
stack. After that, as parsing progress the grammar symbols are pushed this ‘$’ is used to
announce the completion of parsing.
The parsing table is generally a two-dimensional array. An entry in the table is
referred T (A, a), where ‘A’ is a non-terminal ‘a’, it is terminal and’T’ is table name.
A+b$
INPUT
Operation
STACK
X
Y
Z
$
Program
OUTPUT
Parsing
table
The program takes the first symbol on the top of the stack X and then current
input symbol a.
Three possibilities
1. If X=a=$, then parser halts with successful completion.
2. If X=a$,then parser pops X off from stack & mones the i/p pointer to the
next symbol.
5. If X,Non-terminal has another Non-terminal, then remove the Non-terminal
from the stack &substitude the corresponding production.
Reduction by predictive parser:
Productions
ETE’
E’+TE’/
TFT’
T’*FT’/
F(E)/id
Stack
Input
$E
$E’T
$E’T’F
$E’T’id
$E’T’
$E’
$E’T+
$E’T
$E’T’F
$E’T’id
$E’T’
$E’T’F*
$E’T’F
$E’T’id
$E’T’
$E’
$
id+id*id $
id+id*id $
id+id*id $
id+id*id $
+id*id $
+id*id $
+id*id $
id*id $
id*id $
id*id $
*id $
*id $
id $
id $
$
$
$
Output
ETE’
TFT’
Fid
T’
E’+TE’
TFT’
Fid
T’*FT’
Fid
T’
E’
Steps involved in non-recursive predictive parsing:
1. I/P buffer is filled with I/p string with $ as the right end marker.
2. Stack is initially used with $
3. Construction of parsing table T using FIRST () & FOLLOW ().
Computation of FIRST( )
1. If X is terminal, then FIRST (X) ={X}.
2. If X  , then FIRST (X) ={}
3. If X is a Non-terminal, & x a is a production, then add a FIRST (X)
e.g. Xa.FIRST(X)={a}.
6. If XY1, Y2, Y3…Yn then FIRST (X)={FIRST (Y1), FIRST (Y2), FIRST
(Y3)…FIRST (Yn)}
Computation of FOLLOW( )
1. $ is in FOLLOW (S) where S is a Start symbol
Then FOLLOW (S)={$}
2. If there is a production AB,  then FOLLOW(B)={FIRST()}(but
except )
3. If there is a production AB (or) a production AB, FIRST() = then
Follow
(B)={FOLLOW(A)}
The FIRST () & FOLLOW () for the above productions are
FIRST (E)=FIRST (T)=FIRST (F)={(, id}
FIRST (E’)={+, }
FIRST (T’)={*, }
FOLLOW (E)={$,)}
FOLLOW (E’)=FOLLOW (E)={$,)}
FOLLOW (T)={+, $,)}
FOLLOW (T’)=FOLLOW (T)= {+, $,)}
FOLLOW (F)= {*, +, $,)}
Parsing table
id
+
E ETE’
E’
E ’+TE’
T TFT’
T’
T’
F Fid
*
(
ETE’
)
$
E’ E
TFT’
T’*FT
T’
T’
F(E)
ERROR RECOVERY IN PREDICTIVE PARSING
We can use both panic mode and phrase level strategies for recovering the errors
in predictive parsing.
BOTTOM-UP PARSING





Construction of the parse tree starts at the leaves, and proceeds towards the root.
Normally efficient bottom-up parsers are created with the help of some software
tools.
Bottom-up parsing is also known as shift-reduce parsing.
Operator-Precedence Parsing – simple, restrictive, easy to implement
LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
Shift-reduce parsers
In contrast to a recursive descent parser that constructs the derivation "top-down"
(i.e., from the start symbol), a shift-reduce parser constructs the derivation "bottom-up"
(i.e., from the input string). Shift-reduce parsers are often used as the target of parser
generation tools. Some reasons for their popularity are the large class of grammars that
can be parsed in this way (there are more grammars in this class than in the class that can
be processed using recursive descent) and our ability to implement them efficiently.
Shift-reduce parsers are often called LR parsers because they process the input
left-to-right and construct a rightmost derivation in reverse.
In the following we will briefly describe how shift-reduce parsers work, because
knowledge of their operation is useful when using parser generators (which you might
have to do in the future). Our concentration is on the basic mechanisms used during
parsing, not on the techniques used to generate such parsers (which can be quite
complex). The text has much more detail which you can study if you are interested.
Informally, a shift-reduce parser starts out with the entire input string and looks
for a substring that matches the right-hand side of a production. If one is found, the
substring is replaced by the left-hand side symbol of the production. This step is a
reduction. The parser then looks for another substring (now possibly containing a nonterminal symbol), replaces it, and so on. Reductions occur until the string is reduced to
just the start symbol. If no reductions are possible at any stage, it might mean that the
string is not a sentence in the language defined by the grammar, or it might mean that an
earlier reduction was performed in error. The most complex parts of defining a shiftreduce parser are locating valid substrings (called handles) and determining when and if
reductions should be performed on which handles.
S : 'a' A B 'e'.
A : A 'b' 'c' / 'b'.
B : 'd'.
abbcde =>
=>
=>
=>
aAbcde
aAde
aABe
S
Shift-reduce parsers can be described by machines operating on a stack of
symbols and an input buffer containing the input text. Initially, the stack is empty and the
input buffer contains the entire input string. A step of the parser examines the top of the
stack to see if a handle is present. If so, then a reduction could be performed (but doesn't
have to be). If a reduction is not possible or is not desirable, the parser shifts the next
input symbol from the input buffer to the top of the stack. The process then repeats. If the
parser reaches a state where the stack contains just the start symbol and the input buffer is
empty, then the input has been correctly parsed. If the input is consumed but the stack can
not be reduced, then the input is not a sentence.
Grammar
E 
:
Input String
Stack
id
E
E +
E +
E +
E +
E +
E +
E +
E
:
E '+' E / E '*' E / '(' E ')' / id.
id + id * id
Input
Action
id + id * id
+ id * id
+ id * id
id * id
* id
* id
id
id
E
E *
E * id
E * E
E
initial state
shift id
reduce by E : id
shift +
shift id
reduce by E : id
shift *
shift id
reduce by E : id
reduce by E : E '*' E
reduce by E : E '+' E
Note that in this example the decisions about when to reduce are crucial. When
the stack contains E + E for the first time we could reduce it to E. If this is done and the
parse is completed, we end up with a second derivation for this input string. That both
exist is no surprise, however, as we previously noted that this grammar is ambiguous.
Example : Consider the following grammar
SCC
CcC
Cd
Consider the i/p string : cdcd
I/P string
cdcd
cdcC
cdC
cCC
CC
S
Since Reduction by Cd
Since Reduction by CcC
Since Reduction by Cd
Since Reduction by CcC
Since Reduction by SCC
Handles
A handle of a string is a sub string that matches the right side of the production.
This reduction helps in constructing the parse tree or right most derivation.
Ex : AaXb
Xc
Here ‘c’ is a handle. It reduces to X, and helps in construction of the parse tree
i.e. to reach the start terminal A, for the input string.
The I/p string acb
acb
aXb
A
The process of obtaining the starting Non-terminal while constructing the Bottom
–up parse tree by reducing the handles to the respective non-Terminals, is called handle
pruning.
Shift-reduce parsing Actions
1. Shift - Shift the next I/p symbol onto the stack when there is no handle for
reduction.
2. Reduce - Reduce by Non-terminal. This Non-terminal must be pushed to
stack, by replacing the handles.
3. Accept - The I/p string is valid and parsing is successfully done.
4. Error - There is a syntax error, parser calls error recovery routine.
Stack implementation of Shift –reduce parsing
Ex : the input string id1+id2*id3
Stack
$
$id1
$E
$E+
$E+id2
$E+E
$E+E*
$E+E*id3
$E+E*E
$E+E
$E
Input
id1+id2*id3$
+id2*id3$
+id2*id3$
id2*id3$
*id3$
*id3$
id3$
$
$
$
$
Action
Shift
Reduce by Eid
Shift
Shift
Reduce by Eid
Shift
Shift
Reduce by E id
Reduce by E E*E
Reduce by E E+E
Accept
Parsing conflicts
Sometimes the grammar is written in such a way that the parser generator cannot
determine what to do in every possible circumstance. (We assume a lookahead of one
symbol so only the first symbol in the input buffer can be examined at each step. More is
possible if multiple symbol lookahead is allowed, but this complicates the parsing
process.) Situations where problems can occur are called conflicts.
A shift-reduce conflict exists if the parser cannot decide in some situation whether
to shift the input symbol or to reduce using a handle on the top of the stack. A common
situation where a shift-reduce conflict occurs is the dangling-else problem. Once an 'ifthen' has been seen the parser cannot choose between shifting an 'else' (so that it becomes
part of the most recently seen 'if') or reducing the 'if' (so that the 'else' becomes part of a
preceding 'if'). This problem occurs because the grammar is ambiguous (recall the
discussion for recursive descent parsers).
Stmt : 'if' Expression 'then' Stmt
| 'if' Expression 'then' Stmt 'else' Stmt
| ...
A reduce-reduce conflict exists if the parser cannot decide by which production to
make a reduction. This commonly occurs when two productions have the same righthand side (or one with the same structure), but the left context (i.e., what the parser has
seen) is not sufficient to distinguish between them. The following grammar defines part
of a language like FORTRAN where both procedure calls and array accesses are written
using parentheses.
stmt : id '(' param_list ')' / expr ':=' expr.
param_list : param / param_list ',' param.
param : id.
expr : id '(' expr_list ')' / id.
expr_list : expr / expr_list ',' expr.
This grammar has a reduce-reduce conflict between the two productions 'param : id'
and 'expr : id'. This can be seen by considering the input 'id(id,id)'. After the first
three symbols have been shifted, either production could be used to reduce the id on
the top of the stack. The correct decision can only be made by knowing whether the
first id is a procedure identifier or an array identifier.
Avoiding parsing conflicts
Parser generators based on the shift-reduce method often have facilities for
helping you avoid parsing conflicts. For example, YACC has the convenient rule that if
there is a shift-reduce conflict then it will prefer the shift over the reduce. For reducereduce conflicts YACC will prefer to reduce by the production that was written first in
the grammar.
Relying on default behavior like YACC can be dangerous because changes to the
way the grammar is written can affect the parser in subtle ways. For example, just
reordering the productions changes the way reduce-reduce conflicts are resolved.
Other parser generators extend the grammar notation with modifications which
provide more information to resolve conflicts. E.g., we might attach a modification to the
first production in the dangling else problem grammar that says that it should not be
reduced if the next basic symbol is an 'else'.
If the parser generator does not support modifications or its default rules are not
what you want, more needs to be done. In the case of the reduce-reduce conflict above we
might somehow obtain semantic information that tells us whether the ambiguous case is a
procedure call or an array access based on the declaration of the first identifier. This can
be done but complicates the compiler because semantic information is needed before the
program has been fully parsed. C's typedef facility creates a similar problem for parsers
of that language.
Another solution is to rewrite the grammar to remove the conflict. For shiftreduce conflicts it is sometimes possible to rewrite the grammar so that it accepts exactly
the language that you want.
Stmt : 'if' Expression 'then' Stmt
|'if' Expression 'then' Stmt2 'else' Stmt
| ...
Stmt2 : 'if' Expression 'then' Stmt2 'else' Stmt2
| ...
For a reduce-reduce conflict a standard technique is to write the grammar so that
both possibilities are parsed to the same structure. Once the tree has been constructed the
semantic analyzer can then use all available information to decide which was actually
meant. For example, for the example in the previous section we could parse a function
call as an expression and take care of the distinction later. This approach works but can
complicate semantic analysis considerably so it's worth avoiding if possible.
Sometimes the grammar conflict is not really an indication of an ambiguity.
Rather, it might just be a property of the way the grammar is written. Transformation to
an equivalent grammar can remove the conflict. The following example has a conflict
because the parser can't decide between the rules for A and B based on a single token
look ahead because the next symbol is x in both cases.
S
Q
A
B
C
:
:
:
:
:
Q.
A x y / B x x.
C a b.
C a b.
c.
A simple rewrite suffices to remove the problem by effectively pushing the
decision point one taken later. For example,
S
Q
A
B
C
:
:
:
:
:
Q.
A y / B x.
C a b x.
C a b x.
c.
Operator Precedence Parsing
An operator precedence grammar is an  –free operator grammar in which
precedence relations <, =, > constructed as above are disjoint. That is, for any pair of
terminals a & b, never more then one of the relations a<b, a=b, a>b is true.
LEADING( ) & TRAILING( )
LEADING (A)={a/Aa, where  is  or a single nonterminal.}
TRAILING (A){a/Aa, where  is  or a single nonterminal}
Algorithm for operator –precedence relations
Input: An operator grammar G.
Output: The relations <, =, and > for G.
Method:
1. Compute LEADING (A) & TRAILING (A) for each nonterminal.
2. Execute the program given below, examining each position of the right
side of the production.
3. Set $<a for all a in LEADING (S) and set b>$ for all b in TRAILING
(S), where S is the start symbol of G.
For each production A X1,X2..X ndo
For I:=1 to n-1 do
Begin
If Xi & Xi+1 are both terminals then set Xi= Xi+1 ;
If I<=n-2 and Xi & Xi+2are terminals and Xi+1 is a nonterminal then set
Xi=Xi+2
If XI is a terminal and XI+1 is a nonterminal then
for all a in LEADING (X I+1) do set XI<a;
If Xi is a nonterminal and Xi+1 is a terminal then
for all a in TRAILING (X i) do set a>Xi+1;
end
The operator -precedence parsing algorithm
Input
The precedence relation from some operator-precedence grammar and an input
string of terminals from grammar.
Output
Strictly speaking, there is no output. We could construct a skeletal parse tree as we
parse, with one nonterminal labeling all interior nodes and the use of single productions
not shown. Alternatively, the sequence of shift-reduce steps could be considered the
output.
Method
Let the input string be a1…an$. Initially, the stack contains $. Execute the program.
If a parse tree is desired, we must create a node for each terminal shifted onto the stack at
line (4). Then, when the loop of lines (6)-(7) reduces by some production, we create a
node whose children are node corresponding to whatever is popped off the stack. After
line (7) we place on the stack a pointer to the node created. This means that some of the
“symbols” popped by line (6) will be pointers to nodes. The comparison of line (7)
continuous to be made between terminals only; pointers are popped with no comparison
being made.
(1) repeat forever
(2) if only $ is on the stack and $ is on the input then accept and break else begin
(3) let a be the topmost terminal symbol on the stack and let b be the current input
symbol;
(4) if a< b or a=b then shift b onto the stack
(5) else if a>b then /* reduce*/
(6) repeat pop the stack
(7) until the top stack terminal is related by < to the terminal most recently popped
(8) else call the error correcting routine
(9) end
Precedence functions
Compilers using operator-precedence parsers need not store the table or
precedence relations. In most cases, the table can be enclosed by two precedence
functions f and g, which map terminal symbols to integers. We attempt to select f and g
so that, for symbols a and b,
1. f (a)<g(b) whenever a<b,
2. f (a)=g(b) whenever a=b,
3. f (a)>g(b) whenever a>b
Thus the precedence relation between a and b can be determined by a numerical
comparison between f(a) and g(b).
Graph representing precedence functions
Operator
+
*
id
id
+ <
* <
$ <
>
>
>
<
$
>
<
>
<
>
>
>
If x is a terminal and X is a non-terminal then for all a in LEADING (X) do set
X<a;
Graph representing precedence functions
gi
fi
dd
d
f*
g*
f+
g+
Precedence functions with its values
f$
id
F
g
+
4
5
2
1
g$
*
$
4
3
0
0
ERROR RECOVERY IN OPERATOR PRECEDENCE PARSING
There are two points in the parsing process at which an operator-precedence
parser can discover syntactic errors
1. It no precedence relation holds between the terminal on top of the stack and
the current input symbol.
2. If a handle has been found, but there is no production with this handle as a
right side.
The followings are the error handling routines
e1 : /* called when whole expression is missing */
Insert id onto the input
Issue diagnostic : “missing operand “
e2 : /* called when expression begins with a right paranthesis */
Delete ) from the input
Issue diagnostic : “unbalanced right paranthesis “
e3 : /* called when id or ) is followed by id or ( */
Insert + onto the input
Issue diagnostic : “missing operator”
e4 : /* called when expression ends with a left paranthesis */
Pop ( from the stack
Issue diagnostic : “missing right paranthesis “
LR PARSERS
Some of the bottom-up parses are efficient parsers. The expansion for L, R as
follows.
L Left to Right
RRight most derivation.


The name is because these parsers scan the i/p from the left to right and construct
a right most derivation (parse tree) in reverse.
Parsing algorithm & Parsing table are the two major components of LR parses.
Three ways to construct the LR parsing table are listed below as
1. SLR (simple LR)
2. CLR (Canonical LR)
3. LALR (Look ahead LR)
SLR Parser
In order to construct the SLR parsing table the following two components are
necessary.
1. Construction of sets of LR (O) items collection using CLOSURE function and
go to function
2. Construction of parsing table using LR (O) items
Construction of LR (O) items collection
The collection of sets of LR (O) items is called “c”. This must be constructed in
order to construct SLR parsing table. The collection is called canonical collection of LR
(O) items.
LR (O) item
Consider a production
I -> JKL
Now place a (‘.’) at some position at the right side of the production as shown below
I -> .JKL
I -> J.KL
I -> JK.L
I -> JKL.
This are called LR (O) items of the grammar G.
LR (0) item of a grammar G, is a production of G with a dot at some position of
the right side. LR (0) item can be simply called as an item also.
E.g.
1. Eid, then LR (0) items are
E. id
Eid .
2. E
Then items are
E.
Augumented grammar
Consider the grammar G and S is the start symbol. The augumented grammar of
G is G’ with a new start symbol S’ and having a production S’S.
Example :
Consider the grammar G
Sas
Sb
ASa
Aa
The augumented grammar G’
S’S
Sas
Sb
ASa
Aa
Closure Function
Let I be set of items for a grammar G, then closure of I i.e. CLOSURE (I) can be
computed by using the following steps.
1. Initially, every item in I is assed to closure (I)
2. Consider the following.
AX .BY
-an item in I
BZ
- a production
Then add the item B. Z to I
Example
Consider the item in I
SA.S
Closure (I)  closure (SA.S)
S. AS
S. b
Here we have to include the items derived from A also
Therefore A. SA
A.a
The items derived from S are already there. So there can be no new items,
included to I
CLOSURE (I)
S-->A.S
S-->.AS
S-->.b
A-->.SA
A-->.a
GOTO Function
Consider an item in I
A-->.XBY then go to (I, X) will be ==>A-->X.BY including the closure of B.
Where X is a Grammar symbol.
Example
I:
SA.S
S. AS
S. b
A. SA
A. a
Then go to (I, S)
SAS.
S. AS
S. b
AS.A
A. a
Similarly, go to(I,b) S-->b.
Constructing SLR parsing tables
Algorithm
Input : The canonical collection of sets of item for an argumented grammar G.
Output : If possible an LR parsing table consisting of a parsing action function ACTION
and a go to function GOTO.
Method :
Let c={I0 ,I1 ,,,,,,In }. The states of the parser are 0,1…n. state i being constructed from
Ii . the parsing action for states are determined as follows:
1. if [A -> α.aβ] is in Ii and GOTO(Ii ,a)=Ij ,then set ACTION[I, a] to “shift j” , here
a is a terminal.
2. if [a -> α] is in Ii ,then set ACTION[I, a] to “reduce A -> α “ for all a in
FOLOW(A).
if [s’ -> s] is in Ii ,then set action[I,$] to “accept”.
3.
If any conflicting actions are generated by the above rules, we say the grammar is not
SLR(1). The algorithm fails to produce a valid parser in this case.
The goto transitions for state i are constructed using the rule:
4. If GOTO (Ii, A)=Ij ,then GOTO[I,A]=j.
5. All entries not defined by rule (1) through (4) are made “error”.
6. The initial state of the parser is the one constructed from the set of items
containing [S’ -> S].
The parsing table consisting of the parsing action and go to functions determined by
the above algorithm is called SLR table for G. an LR parser using the SLR table for G is
called SLR parser for G. and the grammar having an SLR parsing g table is said to be
SLR(I).
Example:
Construct SLR parsing table for the given Grammar.
E ->E+T
E ->T
T ->T*F
T->F
F ->( E )
F -> id
Step 1 : Augumented Grammar
E’E
E ->E+T
E ->T
T ->T*F
T->F
F ->( E )
Step 2 : Canonical collection of LR (0) items:
I0:
E’ -> .E
E ->.E+T
E ->.T
T -> .T*F
T->.F
F ->.(E)
F -> .id
I1:
E’ -> E.
E -> E. +T
I2:
E –> T.
T -> T. * F
I3:
T--> F.
I4:
F--> ( . E )
E ->.E+T
E ->.T
T -> .T*F
T->.F
F -> .(E)
F -> .id
I5:
F-- >id.
I6:
E-- >E+.T
T -> .T*F
T->.F
F -> .(E)
F -> .id
I7:
T-- >.T*F
F-- >.(E)
F-- >.id
I8:
F-- > (E.)
F-- >E.+T
I9:
E-- >E+T
T-- >T.*F
I10:
T--> T * F .
I11:
F-- > (E).
SLR parsing table
STATE
0
1
2
3
4
5
6
7
8
9
10
11
Id
s5
s5
s5
s5
ACTION
+
* ( )
$ E
s4
1
s6
acc
r2 s7
r2
r2
r4 r4
r4 r4
s4
8
r6 r6
r6 r6
s4
s4
s6
s11
r1 s7
r1 r1
r3
r3
r3 r3
r5
r5
r5
r5
GOTO
T
2
F
3
2
3
9
3
10
CLR parser
It is also an LR parser. Many of the concepts are similar to SLR parser, but there
is some difference in construction of parsing table. The construction of parsing table (T)
from the LR (1) items is also different from the SLR table construction.
LR (1) items
The general form of LR (1) item is
A->X.Y, a;
Where ‘a’ is called look ahead. This is extra information. We are looking a
character ahead. The ‘a’ may be a terminal or the right end marker $.
Example : S’ -> S,$
($ is a look ahead).
The collection of LR(1) items ,will lead to construction of CLR parsing table. As
in the SLR parser here also we use the ‘closure’ and ‘go to’ functions for constructing
LR(1) items taking look ahead into account.
Algorithm for construction of a canonical LR parsing table
Input : A grammar G augumented by production s’s.
Output : If possible, the canonical LR parsing action function ACTION and go to
function GOTO.
Method:
1. Construct C={I 0,..In], the collection of sets of LR (1) items for G.
2. State I of the parser is constructed from Ii The parsing action for state I are
determined as follows:
(a). if [a -> α.aβ, b] is in the Ii and GOTO(I, a)=Ij, then set ACTION[I, a] to
“shift j”.
(b). if [A -> α,a] is in I and GOTO(I, a)=Ij to reduce A-> α.
( c) if [S’ ->S,$] is in Ii then set ACTION[I,$] to “accept”.
If a conflict results from the above rules the grammar is said not to be LR(1), and the
algorithm is said to fail.
3. The go to transitions for state I are determined as follows. If GOTO (Ii,A)=Ij, then
GOTO[I,A]=j.
4. All entries not defined by rules (2) to (3) are made as ‘error’.
5. The initial state of the parser is the one constructed from the set containing item
[S’-> S. $].
Construction of the sets of LR (1) items for the grammar G.
Input : A grammar G.
Output : The set of LR (1) items which are the sets of items valid for one or more viable
prefixes of G.
Method: The procedures CLOSURE and GOTO and main routine for constructing the
sets of items are:
Procedure CLOSURE (1);
Begin
Repeat
For each item [A -> . B, a] in I, each
Production B -> , and each terminal b in FIRST (a)
Such that [B ->. b] is not in 1 do
Add [B ->.  .b] to I;
Until no more items can be added to I;
Return I
End;
Procedure GOTO (I, X);
Begin
Let J be the set of items [A ->X., A], such that
[A -> . X, a] is in l;
Return CLOSURE (J)
End;
Begin
C: ={CLOSURE ({s` ->.S, $})};
Repeat
For each set of items l in C and each grammar
Symbol X such that GOTO (l, X) is not empty
And not already in C do
Add GOTO (l, X) to C
Until no more sets of items can be added to C
End;
Example
For the given grammar construct CLR parsing table.
S-> CC
C->cC
C->d
Step 1 : Augumented grammar
S’ ->S
S -> CC
C -> cC / d
Step 2 : Canonical collection of LR (1) items.
I0:
S’->. S,$
S->.CC,$
C->.cC,c/d
C->.d,c/d
I1:
S’ -> S,$
I2:
S ->CC,$
C -> .cC,$
C -> .d,$
I3:
C -> c.C,c/d
C -> .cC,c/d
C -> .d,c/d
I4:
C -> d.,c/d
I5:
S ->CC.,$
I6:
C -> c.C,$
C-> .cC,$
C -> .d,$
I7:
C ->d.,$
I8:
C ->cC.,c/d
I9:
C->cC.,$
CLR parsing table
STATE
0
1
2
3
4
5
6
7
8
9
c
s3
s6
s3
r3
s6
r2
ACTION
d
$
s4
acc
s7
s4
r3
r1
s7
r3
r2
r2
S
1
r2
GOTO
C
2
5
8
9
s11
LALR Parser
This is very easy to construct than CLR parsing table. The construction is similar
to CLR parsing table with small modifications.
LALR table construction
Input : A grammar G augmented by productions S’ -> S
Output : The LALR parsing tables ACTION and GOTO
Method
1. Construct C = {I0 I1 …In}, the collection of sets LR(1) items.
2. For each core present among the sets of LR(1) items. Find all sets having the
core, and replace the sets by their union.
3. Let C’ = {J0 ,J1 … Jm}be the resulting sets of LR(1) items. The parsing actions
for state I are constructed from Ji in the same manner. If there is a parsing –
action conflict, the algorithm fails to produce a parser and the grammar is said
not to be LALR(1)
4. The GOTO table is constructed as follows. If J is the union of one or more sets
of LR(1) items i.e. J = I1 U I2 U ….U Im , then codes GOTO (I1 ,X) ….
GOTO(IK ,X) are the same since I1 … IK all have the same core. Let K be the
union of all sets of items having the same core as GOTO (I1 ,X). Then
GOTO(J,X) = K.
Example
For the same example in CLR parser, the LALR parsing table would appear as
LALR parsing table
STATE
0
1
2
36
47
5
89
ACTION
c d
$
s36 s47
acc
s36 s47
s36 s47
r3 r3
r3
r1
r2 r2
r2
Here their union I36 replaces I3 & I6.
I36:
C -> c.C, c/d/$
C -> .cC, c/d/$
C -> .d, c/d/$
Similarly,
I47:
S
1
r2
s11
GOTO
C
2
5
89
C -> d., c/d/$
Similarly,
I89:
C ->cC., c/d/$
Comparison
For a comparison of parser size, the SLR & LALR tables for a grammar always
have the same number of states, Where as CLR has more number of states. Thus it is
much easier and more economical to construct SLR or LALR tables than the CLR tables.
PARSER GENERATOR
A parser is a program which determines if its input is syntactically valid and
determines its structure. Parsers may be hand written or may be automatically generated
by a parser generator from descriptions of valid syntactical structures. The descriptions
are in the form of a context-free grammar. Parser generators may be used to develop a
wide range of language parsers, from those used in simple desk calculators to complex
programming languages.
Yacc is a program which given a context-free grammar, constructs a C program
which will parse input according to the grammar rules. Yacc was developed by S. C.
Johnson an others at AT\&T Bell Laboratories. Yacc provides for semantic stack
manipulation and the specification of semantic routines. A input file for Yacc is of the
form:
C and parser declarations
%%
Grammar rules and actions
%%
C subroutines
The first section of the Yacc file consists of a list of tokens (other than single
characters) that are expected by the parser and the specification of the start symbol of the
grammar. This section of the Yacc file may contain specification of the precedence and
associativity of operators. This permits greater flexibility in the choice of a context-free
grammar. Addition and subtraction are declared to be left associative and of lowest
precedence while exponentiation is declared to be right associative and to have the
highest precedence.
%start program
%token LET INTEGER IN
%token SKIP IF THEN ELSE END WHILE DO READ
WRITE
%token NUMBER
%token IDENTIFIER
%left '-' '+'
%left '*' '/'
%right '^'
%%
Grammar rules and actions
%%
C subroutines
The second section of the Yacc file consists of the context-free grammar for the
language. Productions are separated by semicolons, the '::=' symbol of the BNF is
replaced with ':', the empty production is left empty, non-terminals are written in all
lower case, and the multicharacter terminal symbols in all upper case. Notice the
simplification of the expression grammar due to the separation of precedence from the
grammar.
C and parser declarations
%%
program : LET declarations IN commands END
;
declarations : /* empty */
| INTEGER id_seq IDENTIFIER '.'
;
id_seq : /* empty */
| id_seq IDENTIFIER ','
;
commands : /* empty */
| commands command ';'
;
command : SKIP
| READ IDENTIFIER
| WRITE exp
| IDENTIFIER ASSGNOP exp
| IF exp THEN commands ELSE commands FI
| WHILE exp DO commands END
;
exp : NUMBER
| IDENTIFIER
| exp '<' exp
| exp '=' exp
| exp '>' exp
| exp '+' exp
| exp '-' exp
| exp '*' exp
| exp '/' exp
| exp '^' exp
| '(' exp ')'
;
%%
C subroutines
The third section of the Yacc file consists of C code. There must be a main()
routine which calls the function yyparse(). The function yyparse() is the driver routine
for the parser. There must also be the function yyerror() which is used to report on
errors during the parse. Simple examples of the function main() and yyerror() are:
C and parser declarations
%%
Grammar rules and actions
%%
main( int argc, char *argv[] )
{ extern FILE *yyin;
++argv; --argc;
yyin = fopen( argv[0], "r" );
yydebug = 1;
errors = 0;
yyparse (); }
yyerror (char *s) /* Called by yyparse on error */
{printf ("%s\n", s);}
The parser, as written, has no output however, the parse tree is implicitly
constructed during the parse. As the parser executes, it builds an internal representation
of the the structure of the program. The internal representation is based on the right hand
side of the production rules. When a right hand side is recognized, it is reduced to the
corresponding left hand side. Parsing is complete when the entire program has been
reduced to the start symbol of the grammar.
Compiling the Yacc file with the command yacc -vd file.y ( bison -vd file.y) causes
the generation of two files file.tab.h and file.tab.c. The file.tab.h contains the list of
tokens is included in the file which defines the scanner. The file file.tab.c defines the C
function yyparse() which is the parser.
UNIT – IV
INTERMEDIATE LANGUAGES
SYNTAX DIRECTED DEFINITION
It is a generalization of a Context Free Grammar in which each grammar symbol
has an associated set of attributes. The attributes may be a string, a number, type,
memory location or code. Two types of attributes are
1. Synthesized attribute
2. Inherited attribute
Synthesized attribute
values computed from its children or associated with the meaning of the tokens.
Inherited attribute
values computed from parent and/or siblings.
Annotated parse tree
Annotate the parse tree by attaching semantic attributes to the nodes
of the parse tree. Generate code by visiting nodes in the parse tree in a given order.
Input: y := 3 * x + z
Each grammar symbol is associated with a set of attributes.
Annotating (or) Decorating parse tree
The process of computing the attribute values at the nodes.
Output Action ( Semantic rule )
A syntax directed translation scheme is a context free grammar in which a
program fragment called an output action is associated with each production.
Ex : A  XYZ *
w
{ }
If the input string w is derived from the production then  is executed.
Syntax-Directed Translation - Definition
The compilation process is driven by the syntax. The semantic routines perform
interpretation based on the syntax structure. Attaching attributes to the grammar
symbols. Values for attributes are computed by semantic rules associated with the
grammar productions.
Types of Syntax directed translation
1. Synthesized translation
2. Inherited translation
Synthesized translation
It defines the values of the translation of the non terminal on the left side of the
production as a function of translation of non terminals on the right side.
Ex : E.Val  E(1).val + E(2).val
Inherited translation
The translation of a non terminal on the right side of the production is defined in
terms of a translation of the non terminal on the left.
Ex : A  XYZ
{ Y.val := 2*A.val }
Format for writing syntax-directed definitions.
S-attribute definition
A syntax directed definition that uses synthesized attributes
Inherited
It is one whose value at a node in a parse tree is defined in terms of attributes at
the parent of that node.
Dependency graph
The interdependencies among the inherited synthesized attributes at the nodes in a
parse tree can be depicted by a directed graph.
DECLARATIONS
In declaration of block or procedure, we use no. of local name. So, we create
symbol table entry for that name and its attributes. It may be type and relative address of
the storage for that name.
Declaration in a procedure
The procedure P contains sequence of declarations of form id : T. Before first
declaration, offset is “0”. When new name is found, it can be entered into symbol table
and current offset value is assigned to it and the offset is incremented by the width of data
object denoted by that name.
Procedure enter(name,type,offset) - Creates entry for name with type and
relative address in Symbol table. Here type and width attributes are used. The type may
be either integer, real, pointer and array.
Example
P 
{ offset = 0 }
D
D  D;D
D  id : T
{ enter(id.name,T.type,offset);
Offset := offset + T.width }
T  integer
{ T.type := integer
T.width := 4
}
T  real
{ T.type := real
T.width := 8 }
T  array[num] of T1
{ T.type := array(num.val,T.type );
T.width := num.val* T1.width
}
T  T1
{ T.type := pointer ( T1.type );
T.width := 4
}
In line 1 P {offset=0} D, the action is not at the right end. So, we rewrite this
statement as
P  MD
M 
{ offset = 0 }
Nested Procedure
Here for all the procedures, separate symbol table can be created. The nested
procedure can be written as
P  D
D  D ; D | id : T | proc id ; D ; S
Statements
Declaration
Procedure name
Example
Nil
header
A
X
Readarray
Exchange
Quicksort
Readarray
Header
Exchange
header
To Readarray
To Exchange
To Quicksort
Quicksort
Header
partition
i
partition
Header
Semantic rules for nested procedure is defined by using the following operaitions.
1. mktable ( previous ) - Creates new Symbol table and returns pointer to the
new table. The argument previous points to a previously created symbol table
and it is placed in a header for the new symbol table along with additional
information.
2. enter ( table, name, type, offset) - Creates new entry for name in the symbol
table pointed by table.
3. addwidth ( table, width ) - Records the cumulative width of all the entries in
table in the header with that table.
4. enterproc ( table, name, newtable ) - Creates new entry for procedure “name”
in a symbol table. Newtable represents symbol table for that procedure.
Field names in records
For record data type we use the following semantic rule.
T  record L D end
L  
ASSIGNMENT STATEMENTS
Translation of assignment for 3 address code is as follows.
S  id := E
{ P := lookup(id.name);
If P  nil then
Emit(P ‘:=‘ E.place)
Else error
}
E  E1 + E2
{ E.place := newtemp;
Emit( E.place ‘:=’ E1.place ‘+’ E2.place) }
E  E1 * E2
{ E.place := newtemp;
Emit( E.place ‘:=’ E1.place ‘*’ E2.place) }
E  -E1
{ E.place := newtemp;
Emit( E.place ‘:=’ ‘uminus’ E1.place) }
E  (E1)
{ E.place := E1.place }
E  id
{ P := lookup(id.name);
If P  nil then
Emit(E.place ‘:=’ P)
Else error
}
Reusing temporary names
If any statement uses previously created temporary names, then it can be
implemented by using count. Its value is 0 initially. Whenever a new temporary name is
used, it will be incremented. Otherwise, it will be decremented.
Example : x := a*b+c*d-e*f
Statement
$0 := a*b
$1 := c*d
$0 := $0+$1
$1 := e*f
$0 := $0-$1
x := $0
value of count
0
1
2
1
2
1
0
Addressing array elements
Access array elements are done quickly if it is in sequence order. It can be
accessed by using the following formula.
Base + ( i-low ) * w ( for one dimensional array )
Base + (( i1-low1) * n2 + i2 – low2 ) * w (for 2 dimensional array )
Where base is a starting address, i is an index, low is a lower bound on the
subscript and w is the width of the data type.
For example array A contains integer data of 5 elements which is started with an
address 1000. Calculate the address of A[3].
Address of A[3] = 1000 + ( 3-0)*4
= 1000 + 12
= 1012
Type conversions with assignments
Consider only integer and real type for conversion. The following translation
scheme represents the type conversion operation for the production E  E1 + E2.
E.place := newtemp;
If E1.type = integer and E2.type = integer then begin
Emit(E.place ‘:=’ E1.place ‘int +’ E2.place );
E.type := integer
End
Elseif E1.type = real and E2.type = real then begin
Emit(E.place ‘:=’ E1.place ‘real +’ E2.place );
E.type := real
Elseif E1.type = integer and E2.type=real then begin
U := newtemp;
Emit(U ‘:=’ inttoreal’ E1.place);
Emit(E.place ‘:=’ U ‘real +’ E2.place );
E.type := real
End
Else
E.type := type_error;
Example
x := y + i * j where x & y are real and i & j are integer.
t1 := i int* j
t3 := inttoreal t1
t2 := y real+ t3
x := t2
SYMBOL TABLE
A Compiler uses a symbol table to keep track of scope and binding information
about names. A symbol table mechanism must allow us to add new entries and find
existing entries efficiently. Two symbol table mechanisms are
1. Linear list
2. Hash tables
We evaluate each scheme on the basis of the time required to add n entries and
make e inquiries.
Linear list : A linear list is the simplest to implement, but its performance is poor
when e and n get large.
Hashing : It provides better performance than linear list.
Symbol table entries
Each entry in the symbol table is for the declaration of a name. The format of
each entries does not have to be uniform, because the information save about a name
depends on the usage of the name. Each entry can be implemented as a record consisting
of a sequence of consecutive words of memory. To keep symbol table record uniform, it
may be convenient for some of the information about a name to be kept outside the table
entry, with only a pointer to this information stored in the record.
Character in a name
If there is a modest upper bound on the length of a name, then the characters in
the name can be stored in the symbol table entry as shown below
In fixed size space within a record
NAME
S o
ATTRIBUTES
r t
A
R e a
i
d
a r
r
a
y
If there is no limit on the length of a name, or if the limit is rarely reached, the
indirect scheme can be used as follows.
In a separate array
The complete lexeme constituting a name must be stored to ensure that all uses of
the same name can be associated with the same symbol table record.
Storage allocation information
Information about the storage locations that will be bound to names at run time is
kept in the symbol table. In case of names whose storage is allocated on a stack or heap,
the compiler does not allocate storage at all - the compiler plans out the activation record
for each procedure.
The list data structure for symbol tables
The simplest and easiest to implement data structure for a symbol table is a linear
list of records as shown below.
Id1
Info1
Id2
Info2
Idn
Infon
available
We use a single array or equivalent several arrays to store names and their
associated information. The position of the end of the array is marked by the pointer
available, pointing to where the next symbol table entry will go. When the searching
name during searching operation is located, the associated information can be found in
the words following next. If we reach the beginning of the array without finding the
name, a fault occurs.
Hash tables
Many compilers use this technique for searching operations. The basic hashing
scheme is illustrated as shown below
There are two parts to the data structure :
1. A hash table consisting of a fixed array of m pointers to table entries.
2. Table entries organized into m separate linked lists, called buckets ( some
buckets may be empty ). Each record in the symbol table appears on exactly
one of these lists. Storage for the records may be drawn from an array of
records.
The suitable approach for computing hash functions is to proceed as follows :
1. Determine a positive integer h from the characters c1,c2,…..ck in string s.
The conversion of single characters to integers is usually supported by the
implementation language.
2. Convert the integer h determined above into the number of a list, i.e., an
integer between 0 and m-1. Simply dividing by m and taking the remainder is
a reasonable policy.
Representing scope information
The entries in the symbol table are for declarations of names. The scope rules of
the source language determine which declaration appropriate. A simple approach is to
maintain a separate symbol table for each scope. The symbol table for a procedure or
scope is the compile time equivalent of an activation record. Information for the non
locals of a procedure is found by scanning the symbol tables for the enclosing procedures
following the scope rules of the language.
Most closely nested scope rules can be implemented in terms of the following
operations on a name:
Lookup : find the most recently created entry
Insert
: make a new entry
Delete : remove the most recently created entry
A hash table consists of m lists accessed through an array. Since a name always
hashes to the same list, individual lists are maintained as shown below
for implementing delete operation we would rather not have to scan the entire hash table
looking for lists containing entries to be deleted. The following approach can be used.
Suppose each entry has two links :
1. A hash link that chains the entry to other entries whose names hash to the
same value and
2. a scope link that chains all entries in the same scope.
UNIT – V
INTRODUCTION TO CODE OPTIMIZATION
The code optimizer optimizes the code produced by the intermediate code
generator in the terms of time and space.
7
Ex:
MULT
id2,id3,temp1
ADD
temp1,#1,id1
To get efficient target program, we need good code optimization. It improve the
performance of the program by applying various transformations.
Criteria for code improving transformations
The transformation provided by an optimizing compiler should have several
properties
1. It must preserve meaning of programs. i.e., Optimization must not change
output of a program for an input or cause an error such as divide by zero.
2. It must on the average, speed up programs by a measurable amount. The size
of the program is reduced to improve the speed.
3. It must be worth & effort.
Getting better performance
Improve runtime from few hours to few seconds from source level to target level.
Principal sources of optimization
A transformation of a program is called local if it can be performed by looking
only at the statements in a basic block. Otherwise, it is called global. Transformation
performed at both locally and globally. Local is done first.
Function preserving transformations
Improve the program without changing the function it computes. Various
transformations are
1.
2.
3.
4.
Common sub expression elimination
Copy propagation
Dead code elimination
Constant folding
Common sub-expression elimination
Copy propagation : Copy transformation is to use g for f wherever possible after the
copy statement f := g. For example
Before transformation
After transformation
X := t3
A[t2] := t5
A[t4] := x
X := t3
A[t2] := t5
A[t4] := t3
Dead-code elimination : Remove unreachable codes. In the above example x is
eliminated.
Before transformation
After transformation
X := t3
A[t2] := t5
A[t4] := x
A[t2] := t5
A[t4] := t3
Constant folding : The value of an expression is a constant and using the constant
instead is known as constant folding
Ex : 2 * 3.14 = 6.28
Loop optimization
Most of the time for code is spent for the loop statements. To reduce this time,
we use two transformations as follows
1. Code motion
2. Induction variables & Reduction in strength
Code motion - It decrease the amount of code in a loop
Ex : while ( I<limit-2 )
This statement can be rewriting as
t := limit-2
While(I<=t)
The running time of a program may be improved if we decrease the length of one
of its loops, especially an inner loop, even if we increase the amount of code outside the
loops. This statement assumes that the loop in question is executed at least once on the
average. We must beware of a loop whose body is rarely executed, such as “blank
stripper”
While CHAR = ‘ ‘ do CHAR: = GETCHAR ()
Here GETCHAR ( ) is assumed to return the next character on an input file. In
many situations it might be quite normal that the condition CHAR = ` ` is false the first
time around, in which case the statement CHAR : = GETCHAR ( ) would be executed
zero times. An important source of modifications of the above type is called code motion,
where we take a computation that yields the same result the same result independent of
the number of times through the loop (a loop invariant computation) and place it before
the loop.
Induction variables & Reduction in strength
Induction Variable
There is another important optimization that may be applied to the flow graph one
that will actually decrease the total number of instructions as well as speeding up the
loop.
We note that the purpose of I is to count from 1 to 20 in the loop, while the
purpose of T1 is to step through the arrays, four bytes at a time, since we are assuming
four bytes/word. The values of I and T1 remain in lock-step. That is, at the assignment T1:
= 4 * I, I takes on the value 1, 2 … 20 each time through the beginning of loop. Thus T 1
takes the value 4, 8 … 80 immediately after each assignment to T1. That is, both I and T1
form arithmetic progressions. We call such identifiers induction variables. As the
relationship T1 = 4 * I surely holds after the assignment to T1, and T1 is not changed
elsewhere in the loop, if it follows that after statement I: = I + 1 the relationship T1: = 4 *
I – 4 must hold. Thus, at the statement if I <= 20 goto B2, we have I <= 20 if and only if
T1 <= 76.
When there are two or more induction variables in a loop we have an
opportunity to get rid of all but one, and we call this process induction variable
elimination.
PROD: = 0
I: = 1
T2: = addr (A) – 4
T4: = addr (B) - 4
T1: = 4 * I
T3: = T2 [T1]
T5: = T4 [T1]
T6: = T3 * T5
PROD: = PROD + T6
I: = I + 1
If I <= 20 goto B1
Fig 12.3 flow graph after code motion
Reduction in strength
It is also worth nothing that the multiplication step T1: = 4 * I in fig 12.3 was
replaced by an addition step T1: = T1 + 4. This replacement will speed up the object code
if addition takes less time than multiplication, as in the case in many machines. The
replacement of an expensive operation by a cheaper one is termed reduction in strength.
A dramatic example of reduction in strength is the replacement of the stringconcatenation operator || in the PL/I statement
L = LENGTH (S1 || S2)
By an addition
L = LENGTH (S1) + LENGTH (S2)
The extra length determined and additions are far cheaper than the string
concatenation. Another example of reduction in strength is the replacement by a shift of
the multiplication of an integer by a power of two.
If code motion is not applicable to quick sort program then we use this type of
transformation.
Ex : j := j – 1
t := 4 * j
( for induction variable )
In the above example, if j is decremented then it will affect the value for t. Both j
and t are locked. So, these two identifiers are called as induction variable. We must
eliminate this variable.
Ex : x2 = x * x
2:0 * x = x + x
x / 2 = x * 0:5
( for reduction in strength )
CODE GENERATION




Last phase of a compiler
Input is an intermediate representation of a source program
Output is an equivalent target program
In optimizing compiler, code optimization phase is optional.
Ic
Source
Program
Front end
Code
Optimizer
Ic
Code
Generator
Symbol table
ISSUES IN DESIGN OF CODE GENERATOR
The various issues which are inbuilt in all code generation problems are
1.
2.
3.
4.
Memory management
Instruction Selection
Register Allocation
Evaluation Order
Target
Program
Input to the Code Generator



From front end, input is produced with information in symbol table used to
determine the runtime address of the data objects denoted by names in
intermediate representation.
Intermediate representation such as postfix notation, 3 address code such as
quadruples & virtual machine representation and graphical representation such as
syntax tree and dag representation.
In some compilers, semantic checking is done with code generation.
Target Programs


Output is target program.
It may be absolute machine code, relocatable machine language or assembly code
Memory management





Mapping names in source program to address of data objects in runtime memory
is done by both front end and code generator.
Name in 3 address statement refers to symbol entry for that name.
Whenever name is declare in a procedure, it is entered in Symbol table.
It also stores type of that name, width and relative address in ST.
Labels are also considered.
Instruction selection






Instruction set of target machine determines the difficulty of instruction register.
The uniformity and completeness of the instruction set are important factors.
If target machine doesn’t support data type, we use exception handling.
Instruction selection is must to improve the efficiency of the target program.
Instruction speed is another important factor.
Quality of generated code is determined by its speed and size.
Ex : a := a + 1
Code is
MOV a , R0
ADD #1, R0
MOV R0, a
The above three statements can be replaced by a single statement INC a. So,
instruction selection is must to improve the efficiency of target program. We need a tool
to construct an instruction selector.
Register allocation



Register operands faster than memory operands. So, register is important for
good code generation.
The problems with registers are register allocation and register assignment.
Finding assignment of register to variable is difficult.
Choice of evaluation order



Order of computation affects the efficiency of target code.
Some computation order requires only few register
Make intermediate code in order and produce correct order intermediate code to
code generator to solve the problem.
TARGET MACHINE
Our target computer is a byte addressable machine with 4 bytes to a word and n
general purpose registers Ro….Rn-1. It has 2 address instruction of the form
Op source, destination
Mode
form
Address
Added cost
Absolute
Register
Indexed
Indirect register
Indirect Indexed
Literal
M
R
C(R)
*R
*C(R)
#C
M
R
C+Contents( R )
Contents ( R)
Contents(C+Contents( R))
C ( constant )
1
0
1
0
1
1
Instruction Cost
The length of the instruction is known as the instruction cost. It can be calculated
by using the following formula
Cost ( Instruction ) = 1 + costs ( address modes ( source, destination ) )
Reduction in instruction length will minimizes the time taken to perform the
instruction. Because in some machine instructions, the fetch operation takes more time
than execution.
Example :
1. MOV RO,R1
2. ADD #1,R3
 RO & R1 takes 0 cost and MOV instruction cost=1. So,
total cost of instruction is 1.
 Total cost is 2. Because instruction occupies one word
and constant value 1 occupies one word in memory.
Example : Write the equivalent code for the instruction a := b + c
Instruction
Cost
MOV b, R0
ADD c,R0
MOV R0,a
2
2
2
MOV *R1, *R0
ADD *R2, *R0
1
1
Total cost is 6. So, it can be reduced by
the following instructions
Now cost is reduced to 2.
RUNTIME STORAGE MANAGEMENT
Information needed during an execution of a procedure is kept in a block of
storage is called as an activation record. It has a field to hold parameters, results,
machine status information ( return address ), local data and temporaries. Two types of
standard allocation strategies are
1. Static allocation
2. Stack allocation
- The position of an activation record in memory is fixed
at compilation time.
- A new activation record is pushed onto the stack for
each execution of a procedure. The record is popped
when the activation ends.
We must see the following 3 address statements during runtime allocation and
deallocation of activation records.
1.
2.
3.
4.
Call
Return
Halt and
Action, a place holder for other statements
Example : 3 address code for procedure C & P. The size and layout of activation record
is communicated to the code generator via the information about names in symbol table.
Three Address code
Activation record for C ( 64 bytes) Activation record for P (88 bytes
/* code for C */
Return address
Return address
action 1
call p
action 2
halt
Arr
Buf
i
v
n
/* code for P */
action 3
return
Static Allocation
It can be implemented by a code which is implemented by a sequence of two
target machine instruction.
MOV #here+20, Callee.Static_area - saves return address
GOTO Callee.Code_area - transfer control to called procedure.
Stack Allocation
Static allocation become stack allocation by using relative addresses for storage in
activation record. The position of activation record is not known until run time. In stack,
its position is stored in a register. So, words can be accessed as offset from the value in
this register. Index address mode of out target machine is convenient here. A register SP
points to the beginning of the activation record on top of the stack. When a call is occur,
the calling procedure increments SP and transfers control to called procedure. After
control returns to caller, it decrements SP and de-allocating the activation record of the
called procedure.
MOV #stackstart SP - initializes the stack
Code for the first procedure
HALT
Call sequence increments the SP and saves the return address and transfer control
to called procedure.
ADD #caller.recordsize, SP
MOV #here+16, *SP
GOTO Callee.Code_area
Called procedure returns control to calling procedure by taking return address
using
GOTO *0(SP) - represent return address saved in first word in the
activation record
The stack pointer SP is decremented by using the following instruction
SUB
# Caller.recordsize, SP
Basic blocks and flow graphs
Assumption: the input is an intermediate code program.
BASIC BLOCKS AND FLOW GRAPHS
Basic block
It is a sequence of intermediate code such that Jump statements, if any, are at the
end of the sequence. Codes in other basic block can only jump to the beginning of this
sequence, but not in the middle.
Example
Flow graph
The graphical representation of three address code is called flow graph. It
represent the program using a flow chart-like graph where nodes are basic blocks and
edges are flow of control.
Partitioning Basic Blocks
Algorithm
Input : Three address code statements
Output : Basic blocks
Method :
1. To find leaders, which are the first statements of basic blocks.
 The first statement of a program is a leader.
 For all conditional and unconditional goto:
Its target is a leader.
Its next statement is also a leader.
2. Using leaders to partition the program into basic blocks.
Ideas for optimization
Two basic blocks are equivalent if they compute the same expressions.
Use transformation techniques below to perform machine-dependent optimization.
Example
Three-address code for computing the dot product of two vectors a and b.
There are two blocks in the above example is as follows.
PROD: = 0
I: = 1
T1 : = 4*I
T2 : = addr (A) – 4
T3 : = T2 [T1]
T4 : = addr (B) – 4
T5 : = T4 [T1]
T6 : = T3 * T5
PROD : = PROD + T6
T :=I+1
If I <= 20 goto (3)
To block beginning with
Statement following (11)
Transformation on Basic Blocks
It will improve the efficiency of the code generation.
It also increase the speed of execution and reduce the space.
Different types of transformations are
1. Structured preserving transformation.
2. Algebraic transformation.
Structured preserving transformation
Common sub-expression elimination
B1
B2
Dead-code elimination : Remove unreachable codes.
Renaming temporary variables : better usage of registers and avoiding
using unneeded temporary variables.
Interchange of two independent adjacent statements, which might be
useful in discovering the above three transformations.
Algebraic Transformation
APPROACHES TO COMPILER DEVELOPMENT
There are several general approaches that a compiler writer can adopt to
implement a compiler. The simplest is to retarget or rehost an existing compiler. If there
is no suitable existing compiler, the compiler writer might adopt the organization of a
known compiler for a similar language and implement the corresponding components,
using component-generation tools or implementing them by hand.
Bootstrapping
Using the facilities offered by a language to compile itself is the essence of
bootstrapping. The use of bootstrapping is to create compilers and to move them from
one machine to another by modifying the back end.
For bootstrapping purposes, a compiler is characterized by three languages : the
source language S that it compiles, the target language T that it generates code for and the
implementation language I that it is written in. We represent the three languages using
the following diagram called a T-diagram because of its shape.
The above T-diagram can be abbreviated as SIT.
Cross Compiler
A compiler may run on one machine and produce target code for another
machine. Such a compiler is often called a cross-compiler.
Suppose we write a cross-compiler for a new language L in implementation
language S to generate code for machine N; that is, we create LSN. If an existing
compiler for S runs on machine M and generates code for M, it is characterized by SMM.
If LSN is run through SMM, we get a compiler LMN, that is, a compiler from L to N that
runs on M. This process is illustrated by putting together the T-diagrams for these
compilers as shown below.
When T-diagrams are put together as the above, note that the implementation
language S of the compiler LSN must be the same as the source language of the existing
compiler SMM and that the target language M of the existing compiler must be that same
as the implementation language of the translated form LMN. A trio of T-diagrams can be
represented by the following equation
LSN
+ S MM
=
LMN
Download