Syntax

advertisement
Coverage
•
•
•
•
•
•
•
•
•
•
•
•
Programming Language Syntax: Syntax Specifications,
Stages in Translation: Processing Programs, Syntax Analysis, Semantic
Analysis, Lexical Analyzer, Code Generation,
Regular expressions,
Finite Automata,
Grammar Types: Unrestricted, Context-Free, Context-Sensitive,
Regular, BNF, EBNF,
Derivation: Parse Tree,
Grammar Issues: Ambiguous Grammars, Grammar Transformations,
Syntax Diagram,
Recursive Descent Process, Shift-reduce Parsing,
Concrete and Abstract Syntax,
LL grammar and
LR grammar: SLR, LALR.
Programming the Scanner and Parser
Syntax Analysis
1
Programming Language Syntax
•
Syntax defines the structure of the language
•
Syntax helps in:
−
−
−
•
Language design and language comprehension
Implementing or writing the compiler, software specification and
the language system as a whole
Verifying for program correctness
Definitions
−
−
−
−
Constructs: Strings that belong to the language
Syntax: The form or structure of the expression, statements, and
the program unit as a whole is called as Syntax
Semantics: Semantics duly considers what happens while
executing a program segment. Thus, it provides the meaning of
the statements, expressions and program unit
Pragmatics: Tools provided by the translator to help in debugging
and interacting with the operating system
Syntax Analysis
2
Programming Language Syntax
•
Lexeme: Lowest level syntactic unit of any language (e.g., sum,
begin)
•
Token: Category of lexemes (e.g., Identifiers)
•
Any complier needs to have recognizers to recognize the syntax of the
language
•
Notations of Expressions
•
•
•
•
Infix notation: operator symbol is present between the operands
Prefix or Polish notation: operator symbol is present before the
operands
Postfix or Suffix or Reverse Polish notation: operator symbol is
present after the operands
Mixfix notation: operations that don't fit into the previous
notations, like if-then-else
Syntax Analysis
3
Programming Language Syntax
•
Associativity in Expressions
−
−
•
Left-associative: Expressions with the same operator or operator
with same precedence are grouped from left to right.
• Example: +, -, * and /
Right-associative: Expressions with the same operator or
operator with same precedence are grouped from right to left.
• Example: Assignment symbol and exponentiation
Expression Trees and their Evaluation
•
−
Expressions are expressed in the form of a tree with the root
indicating the result of the expression
Traversing a tree can be done in many ways:
• In-order traversal: All the nodes in the left subtree are
visited first and then the root node is visited. Finally, the
nodes in the right subtree are visited.
• Post-order traversal: All the nodes in the left and right
subtree are visited before the root node is visited.
Syntax Analysis
4
Programming Language Syntax
•
Expression Trees and their Evaluation
−
Traversing a tree can be done in many ways:
• Pre-order traversal: The root node is visited first and then
the nodes of the left and right subtree are visited.
• Breadth-first traversal: Traversing is taken level by level.
Finish visiting nodes at one level before moving to the next
level. It is also called as level-order traversal.
• Depth-first traversal: Traversing goes into the depth and
then rises to the next subtree. The order of traversing the
tree performed by depth-first traversal is similar to preorder
traversal.
Syntax Analysis
5
Programming Language Syntax
•
Evaluation of Expressions
−
−
−
−
Applicative Order Evaluation (strict or eager evaluation): The
process of evaluation is bottom-up, which means the processing
starts from the leaves and moves towards the root
Normal Order Evaluation: Evaluation of an expression is done
when it is needed in the computation of the result
• Addition(5+2)
• Addition(Y) {int Y; Y = Y + 2;}
• Here, Y is replaced with 5+2 instead of doing the addition
first
Lazy Evaluation (Delayed evaluation): Evaluation is postponed
until it is really needed
• Frequently used in functional languages.
Block Order Evaluation: This is the evaluation of an expression
that contains a declaration.
• Example: We could have block expression in a function that
includes variable declaration in Pascal
Syntax Analysis
6
Programming Language Syntax
•
Evaluation of Expressions
−
Short Circuit Evaluation: When we are evaluating expressions
which are of Boolean or logical, we could partially evaluate the
expression and get the result
• AND (X AND Y): If both X and Y are "1", then the result is "1".
Otherwise, the result is "0".
• OR (X OR Y): If either or both X and Y are "1", then the result
is "1". Otherwise, the result is "0".
• XOR (X XOR Y): If only one of them (X or Y) is "1", then the
result is "1". Otherwise, the result is "0".
• NOT (X): If X is "1", then the result is "0". If X is "0", then the
result is "1".
Syntax Analysis
7
Compilation Process
SOURCE PROGRAM
SCANNER
TOKENS
PARSER
S
Y
N
T
A
X
A
N
A
L
Y
S
I
S
PARSE TREE
SEMANTIC ANALYSIS
ABSTRACT
SYNTAX TREE
SYMBOL TABLE
INTERMEDIATE CODE
GENERATION
INTERMEDIATE
CODE
CODE GENERATION
MACHINE CODE
Syntax Analysis
OPTIMIZATION
(OPTIONAL)
8
Compilation Process
•
•
•
Syntax Analysis is of low-level and high-level parts.
• Low-level (scanner or lexical analyzer):
• Mostly done using finite automata
• Input symbols are scanned and grouped into meaningful units
called tokens.
• Tokens are formed by principle of longest substring or
maximum match, using lookahead pointer
• High-level part (parser or syntax analyzer)
• Done using Backus-Naur Form (BNF) or Context-Free grammar
• Tokens are grouped into syntactic units like expressions,
statements and declarations and checked whether they
confirm to the grammatical rules of the language
Identification of reserved words: Use lookup table (symbol table)
if statement: "if" "(" "y" "<" "5" ")" …
• y is called as a variable, < is called as an operator, …
• Tokens are represented as keywords, operators, identifiers,
literals, etc.
Syntax Analysis
9
Compilation Process
•
Parser
•
•
•
The parser should find all syntax errors and produce the parse
tree
Parsing algorithms:
• Top-down: Recursive descent (which is a coded
implementation) and LL Parser (which is a table driven
implementation)
• Bottom-up: LR grammar
Why separate the syntax analysis into scanner and parser?
−
−
−
Simplicity: Separating them makes the parser simpler.
Efficiency: Due to the separation, we could make optimization
possible for the lexical analyzer.
Portability: Even though parts of the lexical analyzer might not
be portable, we could always make the parser portable
Syntax Analysis
10
Compilation Process
•
Semantic analysis (Contextual analysis) is required to make sure that
the data types match
•
Semantic analysis works in synchronization with the syntax analysis
•
Contextual analysis is used to answer the following:
−
−
−
−
−
•
Whether the variable has been declared earlier or not?
Does the declaration type match with the usage type of the
variable?
Whether the initialization of the variable has been done in
advance or not?
Is the reference to the array within the bounds of the array?
…
Code generation
•
•
Converting the program into executable machine code
Stages: intermediate code generation and code generation
Syntax Analysis
11
Regular Expressions
•
Regular expression is used to represent the information required by
the lexical analyzer
•
Regular Expression Definitions: The rules of a language L(E) defined
over the alphabet of the language is expressed using regular
expression E.
−
−
−
−
−
−
Alternation: If a and b are regular expressions, then (a+b) is also
a regular expression.
Concatenation (or Sequencing): If a and b are regular
expressions, then (a.b) is also a regular expression.
Kleene Closure: If a is a regular expression, then a* means zero
or more representation of a.
Positive Closure: If a is a regular expression, then a+ means one
or more of the representation of a.
Empty: Empty expressions are those with no strings.
Atom: Atoms indicate that there is only one string in the
expression.
Syntax Analysis
12
Regular Expressions
Syntax Analysis
13
Regular Expressions
Syntax Analysis
14
Regular Expressions
•
Regular expression to match integers and floating point numbers
−
−
−
−
−
−
To match a digit: [0-9]
To match one or more occurrences, we use [0-9]+
To support both signed and unsigned integers: -?[0-9]+
• -? indicates the presence or absence of minus
Floating point representation: Decimal part is present before the
dot
• ([0-9]* \. [0-9]+)
Exponent part: Presence of the character "e" either as lower or
uppercase.
• “e” is followed by + or – sign which is followed by an integer.
• ([eE][-+]?[0-9]+)?
• Question mark at the end indicates the presence of exponent
part is not compulsory.
-?(([0-9]+) | ([0-9]* \. [0-9]+) ([eE][-+]?[0-9]+)?)
Syntax Analysis
15
Finite Automata
•
•
16
Finite Automata represent computing devices that could accept or
recognize the given regular expression that represent a language
Finite Automata Definitions
− Alphabet (): An alphabet is made up of finite, non-empty set of
symbols. Symbols are represented using lower case Latin
alphabets. Symbols are considered to be atoms which cannot be
subdivided further. Ex.  = {a,b,c}
− String or Word: String is a sequence of symbols formed using a
single alphabet.
• Given the alphabet  = {a,b,c}, the various strings that could
be formed are: a, abc, aa, abcabcabc
− Empty String (e): Empty string indicates a string that is composed
of zero symbols. Empty string can be included in an alphabet.
− Size of a String: Size of a string indicates the number of symbols
present in the string.
• Size of the string ab is denoted as, |ab| = 2
• Size of the string |e| = 0
Size of the string |b| = 1
Syntax Analysis
Finite Automata
•
Finite Automata Definitions
− Concatenation of Strings: String can be combined together to
form a new string.
• S1 = abc and S2 = def: S1S2 = abcdef and S2S1 = defabc
• Concatenate empty string: S1e = eS1 = abce = eabc = abc = S1
• Empty string is called as the identity operator for string
concatenation.
− Languages (L): Language defines an infinite set of strings from a
given alphabet.  = {a,b,c}, Language L = {anbncn | n  0}
• In this example, number of a's and b's and c's are the same.
− Power of an alphabet:
• Represented by the power of order n
• This order represents the number of elements present in each
permutation combination of the given string
− For a string  = {a,b,c}
− 0 = {e}
− 1 = {a, b, c}
− 2 = {aa, bb, cc, ab, ba, ac, ca, bc, cb}
− 3 = {aaa, bbb, ccc, aab, bba, aac, cca, …}
Syntax Analysis
17
Finite Automata
•
Finite Automata Definitions
− Closure of an alphabet:
• Transitive Closure:
− Zero or more combinations of the string.
− * = 0  1  2  3 = {e, a, b, c, aa, bb, cc, ab, … }
• Transitive-reflexive Closure:
− One or more combinations of the string.
− + = 1  2  3 = {a, b, c, aa, bb, cc, ab, … }
• Any language defined on the given alphabet is a subset of the
transitive-reflexive closure of the alphabet.
− "L, L  *
− Empty Language:
• Empty language is one that has no strings in it.
• L = {} is an empty language.
• L = {e} is not an empty language because it is made up of one
string, called as the empty string.
Syntax Analysis
18
Finite Automata
•
19
Finite Automata Representation
• Circle: state; Arrows: transition; Double circle: final state
• States are indicated using numbers
• Arrows are indicated using a transition variable or e
e
t
Figure 2.2. NFA for e
Figure 2.3. NFA for t
e
X
Y
Figure 2.4. NFA for XY
e
X
e
e
Y
e
Figure 2.5. NFA for X|Y
Syntax Analysis
Finite Automata
e
e
X
e
e
Figure 2.6. NFA for X*
•
•
DFA (Deterministic Finite Automata) Vs NFA (Non-deterministic Finite
Automata)
• In DFA, empty transitions (e) are not allowed. Also, from any state
s there should be only one edge labeled a.
Convert from NFA to DFA
−
−
Find e–closure of s:
• Add s (the node itself) to its e–closure. i.e. e–closure(s) = {s}
Reachable with empty transition: If there is a node t in e–
closure(s), and there exists an edge labeled e from t to u, then u
is also added to e–closure(s) if u is not there already. Continue
until no more nodes can be added to e–closure(s)
Syntax Analysis
20
Finite Automata
•
Convert from NFA to DFA
− State transition:
• From the initial e–closure, find transitions on various
terminals present in the given regular expression
• Example: If there is a node t in the e–closure(s), and there
exists an edge labeled a (non-empty) from t to u, u is also
added to e–closure(s) if u is not there already. From u, add all
the nodes that could be reached using e–transition.
− A transition table is drawn based on the States and Inputs.
− Optimization of the transition table can be done as:
• Partition the set of states into non-final and final states.
• With the non-final states:
− The state whose transition goes to outside the group is
separated from the group.
− If there are states with same transition on all the inputs,
keep one of those states and replace the other entries
with the preserved one.
− Check for dead state. Dead state is one in which the
transitions end up in the same state irrespective of the
input. Also, this dead state is not the final state.
Syntax Analysis
21
Finite Automata - Example
•
22
Transitions for (m | n)*mnn
e
0
e
e
•
•
•
•
m
3
e
1
6
e
•
2
4
n
5
e
7
m
8
n
9
n
10
e
e
Find e–closure: Starting from 0, using e-transition, we could reach 0, 1, 2, 4
and 7. A = {0, 1, 2, 4, 7}.
− From node 3, we can reach 6, 7, 1, 2 and 4 using e-transition. But from
node 8, there is no more transition possible using e-transition.
− e-Closure({3,8}) = B = {3,8}
− Finally, we get B = {1, 2, 3, 4, 6, 7, 8}.
Transition of n on set A, we get C = {1,2,4,5,6,7}
Transition of n on set B, we get D = {1,2,4,5,6,7,9}
Transition of n on set D, we get E = {1,2,4,5,6,7,10}
If you apply transition of m on set C, we get B. So, we stop here because any
further transition repeats to the already found sets only.
Syntax Analysis
Finite Automata - Example
•
Transition Table
•
Non-Final States (ABCD); Final State (E).
•
With non-final states
−
−
−
−
−
On input m, all of them go to B and so
they are in one group.
On input n, states A, B, and C move to
members of group (ABCD) but D goes to E.
So, split (ABCD) into (ABC) and (D).
In (ABC), with input n, states A & C go to
C but B goes to D. So, split them as (AC)
and (B).
In (AC), both of have the same transitions.
Thus, use only one (A) of them.
Check for dead state. In our example,
there is no dead state.
Syntax Analysis
23
Grammar Types - Definitions
24
• Terminal Symbols: Atomic or non-divisible symbols in any
language
• Non-terminal Symbols (variable symbols or syntactic
categories or syntactic variable or abstraction): A single
non-terminal symbol can be made of more than one Right
Hand Side (RHS) derivation, separated by a divisor (|).
• Variable symbol or distinguished symbol (start symbol):
Basic category that is being defined
• Production or Rewriting Rules: Rules that are used to
define the structure of the constructs. Defines how to
write any variable symbol using terminal and non-terminal
symbols. Rule has a left-hand size (LHS) derived to a
right-hand side (RHS) that is made up of terminal and
non-terminal symbols.
Syntax Analysis
Grammar Types - Definitions
• Grammar: A grammar is a finite non-empty set of rules.
• Syntactic lists: Lists of syntactic nature could be
represented using recursion. <ident_list>  ident | ident,
<ident_list>
• Derivation: This is the process of repeatedly applying the
rules, starting from the start symbol until there are no
more non-terminal symbols to expand.
Syntax Analysis
25
Grammar Types
• Unrestricted Grammar:
− Called as Recursively Enumerable or Phrase
Structured grammar or Type 0 grammar.
− There is no restriction on the right hand side of the
production rule.
− At least one non-terminal symbol on the left side of
the production rule must be present
−
−
−
−
awhere aV + and V *
V: finite set of Variable Symbols.
T: finite set of terminal symbols.
Example: S  ACaB; Ca  aaC
Syntax Analysis
26
Grammar Types
27
• Context-Sensitive Grammar:
− Called as Type 1 grammar
− Requires that the right side of the production rule
must not have fewer symbols compared to the left side
− Called as Context-Sensitive Grammar as any
replacement of a variable depends on what surrounds
it
− a1A21w2
• where AV, 12 V* and wV +
− Example: Things b  b Thing; Thing c  Other b c
Syntax Analysis
Grammar Types
28
• Context-Free Grammar:
− Called as Type 2 grammar
− Developed by Noam Chomsky during the mid-1950s
− The left side of a production rule is a single variable
symbol and the right side is a combination of terminal
and variable symbols
− Production rule takes the form A awhere AV, aV
*
− Example: Fraction Digit;
Fraction Digit Fraction
Syntax Analysis
Grammar Types
•
Regular Grammar:
−
−
−
−
Called as Restrictive Grammar or Type 3 grammar
Each production rule is restricted to have only one terminal or
one terminal and one variable on the right side
Regular Grammars are classified as right-linear or left-linear
grammars.
Right-linear grammar
• A xB or A x where AV, BV, and xT
−
Left-linear grammar
• A Bx or A x where AV, BV, and xT
−
Regular expressions Vs context-free grammar:
• To represent lexical rules which are simple in nature, we
don't need a powerful notation like context-free grammar
• Regular expressions can be used to make recognizers for any
language.
Syntax Analysis
29
Grammar Types
• Backus-Naur Form (BNF):
− Invented by John Backus to describe Algol 58
− Described as a metalanguage because it is a language
that is used to describe another language
− Considered equivalent to context-free grammar
− Abstractions are used to represent various classes of
syntactic structures, which act like non-terminal
symbols.
• To represent While statement:
− <while_stmt>  while ( <logic_expr> ) <stmt>
• Reasons for using BNF to describe syntax are:
− BNF provides a clear and concise syntax description.
− The parser can be based directly on the BNF.
− Parsers based on BNF are easier to handle.
Syntax Analysis
30
Grammar Types
•
Extended BNF (EBNF):
− BNF’s notation + regular expressions
− Different notations persist:
• Optional parts: Denoted with a subscript as opt or used within a
square bracket.
− <proc_call>  ident ( <expr_list>)opt
− <proc_call>  ident [ ( <expr_list>)]
− Alternative parts:
• Pipe (|) indicates either-or choice
• Grouping of the choices is done with square brackets or brackets.
− <term>  <term> [+ | -] const
− <term>  <term> (+ | -) const
− Put repetitions (0 or more) in braces ({ })
• Asterisk indicates zero or more occurrence of the item.
• Presence or absence of asterisk means the same here, as the
presence of curly brackets itself indicates zero or more occurrence
of the item.
− <ident>  letter {letter | digit}*
− <ident>  letter {letter | digit}
Syntax Analysis
31
Grammar Types
• Differences between BNF and EBNF notations
− BNF:
• <expr>  <expr> + <term> | <expr> - <term> |
<term>
• <term>  <term> * <factor> | <term> / <factor> |
<factor>
− EBNF:
• <expr>  <term> {[+ | -] <term>}*
• <term>  <factor> {[ * | / ] <factor>}*
• EBNF uses the final replacement of <expr> by the <term>
and provides the right hand side without any <expr> entry
there.
Syntax Analysis
32
Derivation
33
•
Apply the grammar to the start symbol <program> and continue to
expand until there is no more non-terminal symbol left on the righthand side
•
Methods of Derivation
− Leftmost derivation is a process by which the leftmost nonterminal in each sentential form is expanded
− Parse-tree or Derivation tree
• Top-down parser keeps the start symbol as the root of the
tree. Then, it replaces every variable symbol with a string of
terminal symbols.
• Bottom-up parser begins with the terminal symbols. These
terminal symbols are matched with the right hand side of the
production rule and are replaced with the corresponding
variable symbols present in the left hand side of the
production rule.
•
Parse trees can be used to attach semantics of a construct to its
syntactic structure, called as syntax-directed semantics
Syntax Analysis
Derivation - Example
• Given the regular grammar S ::= aS | bS | a | b, check
whether the grammar can derive the form anbn.
− Let's try for a1b1; S  aS  ab
− Let's try for a2b2; S  aS  aaS  aabS  aabb
− Let's try for a3b3; S  aS  aaS  aaaS  aaabS 
aaabbS  aaabbb
− We are able to attain the required format using this
regular grammar.
Syntax Analysis
34
Grammar Issues
35
• Ambiguities in Grammar
− Any grammar is said to be ambiguous if it generates a
sentential form that has two or more distinct parse
trees.
− Ex. If statement with dangling else.
If Statement
if
(
Expression
)
Statement
If Statement
If Statement
if
(
Expression
)
Statement
else
Statement
if
(
Expression
)
Statement
else
Statement
If Statement
if
Syntax Analysis
(
Expression
)
Statement
Grammar Transformations
•
•
•
Left Factorization:
− Initial element of the options in right side of the given rule is same
• N  XY | XZ
 X (Y|Z)
Elimination of Left Recursion:
− First element on the right hand side causes transition to the left hand
side of the rule
• N  X | NY
 XY*
− The termination of the NY is possible only if we replace N with X.
− If N  X is used without the use of N  NY, then there will be no Y.
• N  NY  NYY  XYY
Substitution of Non-terminal Symbols:
− Presence of any non-terminal symbol in the right hand side of the given
rule should be replaced using another rule.
• N  X and M  a N  can be changed as N  X and M  a X 
Syntax Analysis
36
Syntax Diagram
•
•
•
•
•
•
Called as Syntax Charts or Railroad Diagram
Developed by Niklaus Wirth in 1970
Used to visualize rules in the form of diagrams
Used to represent EBNF notations and not BNF notations
Variables are represented by rectangles and terminal symbols are
represented by circles (sometimes oval shape)
Each production rule is represented as a directed graph whose
vertices are symbols
Syntax Analysis
37
Recursive Descent Parsing
•
There is a subprogram for each non-terminal in the grammar that
parses the sentences that are generated by the non-terminal
•
For proceeding with the correct grammatical rule, we match each
terminal symbol in the right hand side with the next input token.
− If there is a match, we continue further.
− Otherwise, an error is generated or other rules are tried
•
If a non-terminal has more than one RHS, we determine which one to
parse first using:
− Choose the correct RHS based on the next token (lookahead).
− Next token is compared with the first token that can be
generated by each RHS until a match is found.
− If there is no match, then it is considered as a syntax error.
•
Shift-Reduce Parsing: With the given grammar and given input string,
we reduce the right hand side of the input string to attain the start
symbol of the grammar
Syntax Analysis
38
Concrete and Abstract Syntax
•
Concrete Syntax:
− Defines the structure of all the parts of a program like arithmetic
expressions, assignments, loops, functions, definitions, etc.
− Context-Free grammars, BNF, EBNF, etc are of concrete syntax
type.
• Assignment  Identifier = Expression;
• Expression  Term | Expression + Term
•
Abstract Syntax:
− Generated by the parser and is used to link syntax and semantics
of a program
− Unlike concrete syntax, abstract syntax provides only the
essential syntactic elements and does not describe how they are
structured
• Statement = Assignment | Loop
• Assignment = Variable target; Expression source
•
Ambiguity occurs in concrete syntax but not in abstract syntax
Syntax Analysis
39
Symbol Table
•
•
•
Identification Tables
− Called as symbol tables.
− A dictionary-type data structure to store identifier names along
with corresponding attributes
Organization of identification table depends on the "block structure"
used in different languages
− Monolithic block structure: e.g. BASIC, COBOL
− Flat block structure: e.g. Fortran
− Nested block structure is used in the modern "block-structured"
programming languages (e.g. Algol, Pascal, C, C++, Scheme, Java,
…)
Monolithic Block Structure:
− A single block is used for the entire program
− Every identifier is visible throughout the entire program
− Scope of each identifier is the whole program and cannot be
declared twice
Syntax Analysis
40
Symbol Table
•
Flat Block Structure:
− Whole block area is divided into several disjoint blocks
− Declarations can be local or global
− Identifiers can be redefined in another block
− Local declaration is given higher priority over global declaration
•
Nested Block Structure:
− Blocks may be nested one within another
− Scope of an identifier depends on the level of nesting present
− An identifier cannot be defined more than once at the same level within
the same block
Syntax Analysis
41
Symbol Table Structure
•
•
•
•
Unordered list: Data could be stored in an array or a linked list.
Ordered list:
− Entries in the list are ordered
− Searching is faster
− Insertion of data into the list is an expensive process
Binary Search Tree:
− Using a binary search tree, the searching time takes O(log(n)).
Hash Table:
− Most commonly used option
− Access the data can be done in constant time
− Storage of data is not time consuming
Syntax Analysis
42
LL Grammar
•
•
•
•
First L in LL specifies that a left-to-right scan of the input is handled
Second L specifies that a leftmost derivation is generated
First step towards using LL grammar is elimination of common
prefix. Note: aand can match zero or more elements.
− Form is B  a1 | a2 | … |am |Xm+1| Xm+2 | … | Xm+n
− Replace it with
• B  aB1 | Xm+1| Xm+2 | … | Xm+n
• B1  1 | 2 | … |m
Convert the grammar into unambiguous one
− Make sure they obey precendence and associativity rules
− Start from the terminal and move from high precedence to low
precedence
• Consider the grammar: E  E + E | E * E | (E) | id
− Select the terminals and name them differently.
• Factor  (E) | id
− * operator has high priority that + operator. So, select E
 E * E next
• E  E * E is considered first.
Syntax Analysis
43
LL Grammar
•
•
Convert the grammar into unambiguous one
• Consider the grammar: E  E + E | E * E | (E) | id
− * has high priority that +. So, select E  E * E next
• To provide the link between E * E and the Factor, use
the pipe (|) operator.
• With no link, the non-terminal will never become a
terminal.
• Give a new name “Term” for the element.
• Term  Term * Factor | Factor
− Then, consider E  E + E and change it also.
• Expression  Expression + Term | Term
− So, F  (E) | id; T  T * F | F; E  E + T | T
Remove Left-recursion
− If A  Aa1 | Aa2 | … | Aam | 1 | 2 | … | n
− Where no i begins with an A. Where A is E, ais +T &  is T
− Replace the above as:
• A  1A' | 2A' |… | nA'
• A'  a1A' | a2A' | … | amA' | e
Syntax Analysis
44
LL Grammar
•
•
•
Consider the grammar
ETE'; E'+TE'|e; TFT'; T'*FT'|e; F(E)|id
FIRST & FOLLOW
− FIRST:
• If X is terminal, then FIRST(X) is {X}.
•
−
•
•
If X is non-terminal and X  aa is a production, then add a to
FIRST(X). If X  e is a production, then add e to FIRST(X).
• If X Y1Y2…Yk is a production, then for all i such that all of Y1,..Yi-1
are non-terminals and FIRST(Yj) contains e for j=1,2,… i-1, add every
non-e symbol in FIRST(Yj) to FIRST(X). If e is in FIRST(Yj) for all
j=1,2,…,k, then add e to FIRST(X).
The third rule of FIRST is like E TE' where T  FT' and F(E)|id. Thus,
what is in FIRST(F) will be in FIRST(E) & FIRST(T).
FIRST(E) = FIRST(T) = FIRST(F) = {(,id} FIRST(E')={+, e}
FIRST(T')={*, e}
Syntax Analysis
45
LL Grammar
•
•
•
46
FIRST & FOLLOW
− FOLLOW: (ais any string of grammar symbols; a can also be e.)
• $ in FOLLOW(X), where X is the start symbol.
• If there is a production AaB, e, then everything in
FIRST() but e is in FOLLOW(B).
• If there is a production AaB, or a production AaB where
FIRST() contains e, then everything in FOLLOW(A) is in
FOLLOW(B).
In FOLLOW, take the first rule apply to all the grammar and then take
the second rule apply to all the grammar and so on.
Note: Refer to notes for verbal explanation for FIRST & FOLLOW rules
Third Rule of FOLLOW
A  a B 
Second Rule of FOLLOW
A a B
FOLLOW

FIRST,
except e
Condition:
e
FOLLOW
FOLLOW
A  a B
FOLLOW
Syntax Analysis
FOLLOW
Condition: FIRST(containse
LL Grammar
•
•
FIRST & FOLLOW
− FOLLOW(E) = FOLLOW(E') = {), $}
− FOLLOW(T) = FOLLOW(T') = {+,), $}
− FOLLOW(F) = {+,*,),$}
Generating the parsing table
− A Grammar whose parsing table has no multiply-defined entries is
said to be LL(1). a is any string of grammar symbols; a can also
be e.
1. For each production Aa of the grammar, do steps 2 & 3.
2. For each terminal a in FIRST(a), add Aa to M[A,a].
3. If e is in FIRST(A), add Ae to M[A,b] for each terminal b in
FOLLOW(A). If e is in FIRST(A) and $ is in FOLLOW(A), add Ae to
M[A,$].
− Note: Here, M[A,b] indicates the corresponding cell in the table,
whose row corresponds to the non-terminal A and column
corresponds to the terminal b.
4. Make each undefined entry of M error.
Syntax Analysis
47
LR Grammar
•
•
•
•
•
•
•
Left to Right grammar
Most powerful shift-reduce parsing technique
− Non-backtracking shift-reduce parsing which could detect a
syntactic error as soon as possible
Represented as LR(k) where k indicates the look-ahead value
LR(1) means no look-ahead: only next element is considered and not
anything those follows the next element.
Can parse all grammars that could be parsed with predictive parsers
like LL(1) grammar
Types of LR grammars:
− SLR – Simple LR parser.
− LR – Most general LR parser.
− LALR – Intermediate LR parser (Look-ahead LR parser).
All the types use the same algorithm but with different parsing table
Syntax Analysis
48
LR Grammar
49
• LR parser configuration: (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $),
which includes Stack values and the rest of Inputs
•
− Xi is a grammar symbol
− Si is a state
− ai is an input
Sm
Initial Stack contains just S0
Sm-1
a1
...
ai
...
an
$
Xm
Xm-1
LR PARSING ALGORITHM
.
.
S1
X1
Action Table
Terminal and $
Goto Table
Non-Terminal
S0
States + Four Different
Actions
States + Each item is a
state number
Figure 2.11. LR Parsing
Syntax Analysis
LR Grammar
50
•
Parser takes action using Sm and ai
•
shift s: shifts the next input symbol ai and the state s onto the stack
− (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $)  (S0 X1 S1 ... Xm Sm ai s, ai+1 ...
an $)
reduce A (or rn where n is a production number)
− pop r (r is the length of ) number of items from the stack; This is
done so that we can replace the right hand side with the left hand
side of the grammar.
− then push A and s where s=goto[sm-r,A]. Here, m-r indicates that r
items have been taken of the stack.
− (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $)  (S0 X1 S1 ... Xm-r Sm-r A s, ai ...
an $)
− Output is the reducing production rule, reduce A
Accept: Parsing is successfully completed.
Error: Parser has detected an error. This might because there is an
empty entry in the action table.
GOTO takes a state and grammar symbol as arguments and produces a
state.
•
•
•
•
Syntax Analysis
Phases of LR Grammar Processing
•
Closure: If I is a set of LR(0) items for a grammar G, then closure(I)
is the set of LR(0) items constructed from I by the two rules:
1. Initially, every LR(0) item in I is added to closure(I).
2. If A  a.B is in closure(I) and B is a production rule of G;
then B. will be in the closure(I). Here, B is a non-terminal. a
can be anything or even empty
•
The above-mentioned rule is applied until no more LR(0) item can be
added to closure(I).
E'  E
E  E+T E  T T  T*F T  F F  (E) F  id
Check for non-terminal after dot, if there is, continue the productions.
closure({E'  .E}) = {
T  .T*FT  .F
E'  .E
E  .E+T
F  .(E)
Syntax Analysis
E  .T
F  .id }
51
Phases of LR Grammar Processing
•
GOTO: If I is a set of LR(0) items and X is a grammar symbol (terminal
or non-terminal), then goto(I,X) is defined as follows:
−
If A  a.X in I
then every item in closure({A  aX.}) will be in goto(I,X).
Example:
I ={ E'  .E, E  .E+T, E  .T,
T  .T*F, T  .F, F  .(E), F  .id }
goto(I,E) = { E'  E., E  E.+T }  Move dot one step further with E.
goto(I,T) = { E  T., T  T.*F }  Move dot one step further with T.
goto(I,F) = {T  F. }  Move dot one step further with F.
goto(I,() = { F  (.E), E  .E+T, E  .T, T  .T*F, T  .F,
F  .(E), F  .id }  After moving the dot after (, there exists a
non-terminal and so add the closure of that non-terminal.
goto(I,id) = { F  id. }  Move dot one step further with id.
Syntax Analysis
52
Phases of LR Grammar Processing
•
Canonical LR(0) algorithm: This is needed to create the SLR parsing
table.
C is { closure({S'.S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
•
goto function is a
DFA on the sets in C.
For I1, we look at I0 and
use the symbol E.
I2 and I3 are obtained
using transitions with
symbol T and F
Syntax Analysis
53
Phases of LR Grammar Processing
54
•
For I4, we have moved the dot on open-bracket. As the dot is followed
by E (a non-terminal), we need to add all the transitions with E (E 
.E+T and E  .T) from I0. As still we have some non-terminals (like T
and F) that follow the dot, we add their transitions also.
•
I5 is made using transition on id from I0. Then, we make transition on
E
T
+ from I2 to obtain I6.
+
*
To I
I
I
I
I
0
1
6
9
F
T
I2
(
I7
*
To I3
To I4
F
id To I5
F
I3
I10
(
(
id
I4
id
id
E
I8
I5
To I5
)
T
I11
To I2
F
(
To I4
To I3
To I4
+
To I6
Figure 2.12. SLR Transitions
Syntax Analysis
7
LR Grammar – Create SLR Parsing Table
55
1. Construct the canonical collection of sets of LR(0) items for G’.
C  {I0,...,In}
2. Create the parsing action table as follows
• If a is a terminal, Aa.a in Ii and goto(Ii,a)=Ij then action[i,a]
is shift j.
• If Aa. is in Ii , then action[i,a] is reduce Aa for all a in
FOLLOW(A) where AS'. Aa in reduce is represented using the
sequence number of Aa in the grammar.
• Note: There is no element after the dot; a can be anything or
even empty
• If S'S. is in Ii , then action[i,$] is accept. Here, E being the
starting symbol S, E'E. will produce the accept entry.
• If any conflicting actions generated by these rules, the grammar is
not SLR(1).
3. Create the parsing goto table
• for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j
4. All entries not defined by (2) and (3) are errors.
5. Initial state of the parser contains S'.S
Syntax Analysis
Phases of LR Grammar Processing
•
1) E  E+T
2) E  T
3) T  T*F
•
4) T  F
5) F  (E)
6) F  id
•
The first entry of s5 in the (row, column) grouping as (0,id) is because
from Figure 2.12, we could see that I0 transits to I5 on id. So,
action[0, id] = shift 5.
•
s6 on (1,+) is because from Figure 2.12, we could see that I1 transits
to I6 on +. And so on…
Syntax Analysis
56
LR Grammar – Given an input id * id + id
Syntax Analysis
57
SLR(1) Grammar
58
•
SLR(1) grammar is called as SLR grammar in short
•
SLR grammar is always unambiguous but that does not mean that all
unambiguous grammars are SLR grammars.
•
SLR grammar does not posses any of these conflicts:
− Shift/Reduce conflict: It is in a state when it is not sure whether
to make a shift or reduction operation for a terminal.
− Reduce/Reduce conflict: It is in a state when it is not sure
whether to make a reduction operation using the production rule i
or j for a terminal.
•
Canonical SLR(1) parsing table:
− In SLR method, the state i makes a reduction by Aa when the
current token is a:
• if the Aa. in the Ii and a is FOLLOW(A)
− In some situations, A cannot be followed by the terminal a in a
right-sentential form when a and the state i are on the top
stack. This means that making reduction in this case is not
correct.
Syntax Analysis
SLR(1) Grammar
• LR(1) item
− In order to avoid invalid reductions, we need to make
the states carry more information. This information is
added as a terminal symbol in the form of a second
component in an item.
− A LR(1) item is defined as: A  a.,a where a is the
look-ahead of the LR(1) item (a is a terminal or endmarker.) When  (in the LR(1) item A  a.,a ) is not
empty, the look-ahead does not have any effect.
− When  is empty (A  a.,a ), we do the reduction by
Aa only if the next input symbol is a (not for any
terminal in FOLLOW(A)).
− A state will contain A  a.,a1 where {a1,...,an} 
FOLLOW(A)
Syntax Analysis
59
SLR(1) Grammar
•
•
•
60
Canonical Collection of LR(1) items: Similar to LR(0) but with slight
changes in closure and goto.
closure(I) is: ( where I is a set of LR(1) items)
−
every LR(1) item in I is in closure(I)
−
if Aa.B,a in closure(I) and B is a production rule of G;
then B.,b will be in the closure(I) for each terminal b in
FIRST(a) .
•
B is the term next to the dot. The rule of any non-terminal
that follows the dot will be included into the closure.
•
Also,  indicates on what follows B as it is the FIRST() or
FIRST(a). a and  can be anything or even empty.
If I is a set of LR(1) items and X is a grammar symbol (terminal or nonterminal), then goto(I,X) is defined as follows:
− If A  a.X,a in I then every item in closure({A  aX.,a}) will
be in goto(I,X).
− Move the dot a step forward using goto
Syntax Analysis
SLR(1) Grammar
61
• Numbering of the rules start with 1 but the initial S'  S is
excluded from the rule numbering.
Syntax Analysis
SLR(1) Grammar
•
62
In I0: In the representation S'  .S,$: $ is the element that follows S'.
From here, as the dot is followed by a terminal (S), we need add its
rules (S  .L=R,$ & S  .R,$) also.
−
S'  .S,$ matches Aa.B,a and S  .L=R,$ matches B.,b. $ is
added as the look-ahead item as  is empty [so, FIRST() is also
empty] and FIRST(a) = FIRST($) = $. Then, the dot is followed by L
and R, we add their rules also. The dot stays at the beginning of
the right-side in the added rules.
•
In I0: In the representation L  .*R,$/= L  .id,$/= we need to apply
FIRST() = FIRST(=) and FIRST(a) = FIRST($) as Aa.B,a is matched
with S  L.=R,$.
•
In I0: R  .L,$ does not contain a = as the look-ahead because
Aa.B,a is matched to S  .R,$ and  is empty and a is $.
•
Transitions are handled based on the movement of dot across terminal
or non-terminal. Transition to I1 from I0 is based on S.
Syntax Analysis
LR(1) Parsing Table Construction
1. Construct the canonical collection of sets of LR(1) items for G’.
C{I0,...,In}
2. Create the parsing action table as follows
• If a is a terminal, Aa.a,b in Ii and goto(Ii,a)=Ij then action[i,a] is
shift j.
• If Aa.,a is in Ii , then action[i,a] is reduce Aa where AS’.
• If S’S.,$ is in Ii , then action[i,$] is accept.
• If any conflicting actions generated by these rules, the grammar is not
LR(1).
3. Create the parsing goto table
•
for all non-terminals A, if goto(Ii,A)=Ij
then goto[i,A]=j
4. All entries not defined by (2) and (3) are errors.
5. Initial state of the parser contains S’.S,$
Syntax Analysis
63
LALR Grammar
•
•
•
•
•
•
LALR stands for LookAhead LR
LALR tables are smaller than LR(1) parsing tables but the number of
states remain the same
LALR parser is obtained by shrinking the canonical LR(1) parser. This
shrinking process should not produce reduce/reduce conflict.
The core of the LALR grammar is the first component of the LR(1)
items, which excludes the look-ahead item.
− For Example, in S  L.=R,$, the core part is S  L.=R
If there is more than one LR(1) item with the same core, we merge
them into a single state.
Creating LALR parsing table
− Create the canonical LR(1) collection of the sets of LR(1) items
for the given grammar.
− Find all sets that have the same core. Replace those sets having
the same core with a single set which is their union.
• C={I0,...,In}  C’={J1,...,Jm}
where m  n
Syntax Analysis
64
LALR Grammar
•
•
Creating LALR parsing table
− Create the parsing tables (action and goto tables) same as the
construction of the parsing tables of LR(1) parser.
• Note that: If J=I1  ...  Ik since I1,...,Ik have same cores
then cores of goto(I1,X),...,goto(I2,X) must be same.
• So, goto(J,X)=K where K is the union of all sets of items
having same cores as goto(I1,X).
− If no conflict is introduced, the grammar is LALR(1) grammar.
(reduce/reduce conflicts can be introduced but not shift/reduce
conflict)
Ambiguous grammars produce conflicts
− Consider this ambiguous grammar
E  E + E | E * E | (E) | id
− Produce the parsing table
Syntax Analysis
65
LALR Grammar
Syntax Analysis
66
Error Recovery in LR Grammar
•
•
•
•
Errors can be detected by consulting the parsing action table
− Goto table is not used to detect errors
Canonical LR or LR(1) parser will not make any reduction before
announcing an error but SLR and LALR might make many reductions
before indicating an error
Panic Mode Error Recovery in LR Parser
− When faced with an error, remove the entries in the stack before
the state sthat has a goto with a particular non-terminal A
− Discard zero or more input symbols until the symbol a is found
that is present in follow of A
− Parser can now stack the non-terminal A and the state goto[s,A]
and proceed with parsing
Phrase Mode Error Recovery in LR Parser
− An empty entry in the action table is associated with a specific
error routine that reflects the most likely error in this case
− This error could either insert or delete symbols into or from the
stack
− This could be useful in handling missing operand, unbalanced
right parenthesis, etc
Syntax Analysis
67
Programming the Scanner and Parser
•
•
•
•
For scanner:
− lex (A Lexical Analyzer Generator): generates codes in C language
− Variants to lex: flex, AT&T lex, Abraxas Pclex, MKS Lex, POSIX
Lex, jflex, …
For Parser:
− yacc ("Yet Another Compiler Compiler" with AT&T Yacc, Berkeley
Yacc and GNU Bison as variants)
− Accent: Check for conflicts
Programming with lex/flex
− File name: filename.l
− Does not generate executable code, but generates the C routine
called yylex()
− We will need to write a program that calls yylex( ) to run the
lexer
Lex programs are divided into three sections: definitions section,
rules section and user subroutines section
− The starting and ending of the rules section is indicated using "%%"
− ONLY User subroutines section is optional
Syntax Analysis
68
Programming the Scanner and Parser
•
•
•
•
•
•
69
In the definitions section, the part that is covered by %{ and %} is
copied as it is into the generated C program
C language comments can be added outside the definition section also
When using comments outside the %{ and %} block, comments must be
intended with whitespace.
Rules section
− Map pattern and action
− If the number of actions that ought to be handled is more than
one, then the actions are grouped with braces.
User subroutines section
− Contains many subroutines
− The subroutine that calls yylex( ) is copied as it is into the C
program
Internal Variables of LEX/FLEX:
− yylval: This variable contains the value of the token.
− yyleng: This variable contains the length of the string the lexer
has recognized.
Syntax Analysis
Programming the Scanner and Parser
•
Internal Variables of LEX/FLEX:
− yyin: Indicates how lexer reads the input. By default yyin is set to
stdin.
− yylex( ): Function that runs the lexer.
− yywrap( ): Function that is called by the yylex to check for the
end of the file.
− input( ), output( ) and unput( ): input() and unput() functions are
needed to read input from the command line.
− Start State: Start states are defined using %s in the definitions
section.
− ECHO: This macro is used to write the token to the current output
file yyout. This is similar to writing like: fprintf(yyout, "%s",
yytext);
− REJECT: REJECT is used as an action to put back the text matched
by the pattern and search for the next best match.
Syntax Analysis
70
Programming the Scanner and Parser
•
71
Programming with Yacc/Bison
− Does the task of LALR(1) parser
− Being LALR(1) parser, yacc can only go one step lookahead and
thus ambiguous natures beyond one step will generate an error
− The program structure in Yacc is similar to that of Lex
− Definitions sections: definitions, C code and associativity rules are
specified.
− Yacc calls yylex routine repeatedly to get the token and then
applies the rules specified
− As Lex returns tokens to Yacc, both the programs need to agree
on what tokens are
• Definitions section in yacc: %token NUMBER
− In the lex program:
• extern int yylval;
• %%
• [0-9]+ {yylval = atoi(yytext); return NUMBER; }
Syntax Analysis
Programming the Scanner and Parser
•
Programming with Yacc/Bison
− In the yacc program, do the following:
• Specify the variables.
− %union {int ival, double cost;}
• Connect the values to the return tokens.
− %token <ival> INDEX
− %token <cost> NUMBER
• Specify the type for the non-terminals. Let's say ival is a
terminal but cost is not.
− %type <cost> expression
• Associative and Precedence rules are specified in the
definition section of the yacc program.
− %left '-' '+'
− %left '*' '/'
− %nonassoc UMINUS
Syntax Analysis
72
Programming the Scanner and Parser
•
•
Programming with Yacc/Bison
− expression: expression '+' NUMBER
{$$ = $1 - $3; }
−
| expression '-' NUMBER
{$$ = $1 - $3; }
−
;
• $1 represents the first number value in the right hand side, $2
represents the operator and $3 represents the second number
value in the right hand side. Left-hand side is represented
using $$.
− Using union and yyval, only a single value can be passed between
lexer and parser. So, use symbol table to pass multiple values
− Error is reported using yyerror() function.
− While compiling the C programs generated by Lex and Yacc, we
will use –ly option of the C compiler. The yacc library must
contain main() and yyerror().
Compilation and Execution on Linux platform.
− Compile the lex program: lex filename.l
− Compile the yacc program: yacc –d filename.y
− Compile the C program: gcc –o output y.tab.c lex.yy.c –ly –ll
− Running the program: ./output
Syntax Analysis
73
Programming the Scanner and Parser
•
•
Compilation and Execution on Windows platform.
− Make sure that flex (flex.exe), bison (bison.exe) and tcc (Tiny C
Compiler or any C compiler) are installed.
− Compile the lex program: flex filename.l
− Compile the yacc program:
• bison –d filename.y
• bison –d filename.y –b y
− Compile the C program (using Tiny C Compiler – tcc) generated:
• tcc –o output.exe y.tab.c lex.yy.c yyerror.c libyywrap.c
yyinit.c main.c yyaccpt.c
− Running the program: output.exe
Programming with Accent and Amber
− After writing the lex program, we need to write the accent
program
− Rules have left and right hand side separated by a colon
− The initial symbol provided in the grammar is called as start
symbol and it follows context-free grammar
Syntax Analysis
74
Programming the Scanner and Parser
•
75
Accent
− Parameters can be specified as in (inherited attributes) and out
(synthesized attributes), with “<“ and “>” enclosing them
− Statements written within %prelude { …} are literally copied into
the generated C program
•
Given the grammar (R stands
for root, E stands for
expression, T stands for term
and F stands for factor. id is a
terminal which represented by
token NUMBER):
•
RE
•
EE+T|T
•
T  T * F | F;
•
F  (E) | id;
%token NUMBER;
root: expression<n> { printf("Final = %d\n", n);}
;
expression<n>:
expression<x> '+' term<y> { *n = x + y;}
| term<n>
;
term<n> :
term<x> '*' factor<y> { *n = x * y; }
| factor<n>
;
factor<n> :
'(' expression<n> ')'
| NUMBER<n>
;
Syntax Analysis
Programming the Scanner and Parser
•
Programming with Accent
− Compilation and Execution on Linux
• lex filename.l
• accent filename.acc
• gcc –o output yygrammar.c lex.yy.c entire.c
• Check for ambiguity using Amber:
− accent filename.acc
− gcc -o output -O3 yygrammar.c amber.c
− output examples 1000
− Compilation and Execution on Windows
• flex filename.l
• accent filename.acc
• tcc –o output.exe yygrammar.c lex.yy.c entire.c yyerror.c
libyywrap.c main.c yyinit.c yyaccpt.c
• Check for ambiguity using Amber:
− accent filename.acc
− tcc -o output.exe yygrammar.c amber.c
− output examples 1000
Syntax Analysis
76
Download