Sem. VII
Examination, Dec.2014
Unit I
1.a)Explain the equivalence of NFA and DFA with suitable example.
Ans. For every non-deterministic finite automata, there exists an equivalent deterministic finite
automata. The equivalence between the two is defined in terms of language acceptance. Since an
NFA is a nothing more than a finite automata in which zero, one, or more transitions on an input
symbol is permitted, we can always construct a finite automata that will simulate all the moves
of the NFA on a particular input symbol in parallel. We then get a finite automata in which there
will be exactly one transition on an input symbol; hence, it will be a DFA equivalent to the NFA.
Since the DFA equivalent of the NFA parallels the moves of the NFA, every state of a DFA will
be a combination of one or more states of the NFA. Hence, every state of a DFA will be
represented by some subset of the set of states of the NFA; and therefore, the transformation
from NFA to DFA is normally called the "construction" subset. Therefore, if a given NFA has n
states, then the equivalent DFA will have 2n number of states, with the initial state
corresponding to the subset {q0}. Therefore, the transformation from NFA to DFA involves
finding all possible subsets of the set states of the NFA, considering each subset to be a state of a
DFA, and then finding the transition from it on every input symbol. But all the states of a DFA
obtained in this way might not be reachable from the initial state; and if a state is not reachable
from the initial state on any possible input sequence, then such a state does not play role in
deciding what language is accepted by the DFA. Hence, the amount of work involved in
transforming an NFA to a DFA can be reduced if we attempt to generate only reachable states of
a DFA. This can be done by proceeding as follows:
Let M = (Q, S, d, q0, F) be an NFA to be transformed into a DFA.
Let Q1 be the set states of equivalent DFA.
Q1old = F
Q1new = {q0}
While (Q1old Q1new)
Temp = Q1new - Q1old
Q1 = Q1new
for every subset P in Temp do
for every a in Sdo
If transition from P on a goes to new subset S of Q
(transition from P on a is obtained by finding out
the transitions from every member of P on a in a given
and then taking the union of all such transitions)
Q1 new = Q1 new È S
Q1 = Q1new
A subset P in Ql will be a final state of the DFA if P contains at least one member of F of the
NFA. For example,consider the following finite automata:
M= ({q0,q1,q2,q3},{0,1},δ, q0,{ q3})
δ(q0,0) = { q1} δ(q0,1) = Φ
δ(q1,0) = { q1} δ(q0,1) = {q1,q2}
δ(q0,0) = Φ
δ(q0,1) = { q3}
δ(q0,0) = { q3} δ(q0,1) = { q3}
The DFA equivalent of this NFA can be obtained as follows:
{q1, q2}
*{q1, q2, q3}
*{q1, q3}
{q1, q3}
{q1, q3}
{q1, q2}
{q1, q2, q3}
{q1, q2, q3}
{q1, q2, q3}
The transition diagram associated with this DFA is shown in Fig.
Figure. Transition diagram for M = ({q0, q1, q2, q3}, {0, 1} d, q0, {q3}).
(b)Give the minimized DFA for the following expression (a/a)*abb.
2. a) Explain Arden’s theorem?
Ans.2 a) Arden’s theorem: for every regular expression there exist a deterministic finite
automation. So we can say that regular languages, regular expressions and finite automata are all
different representation of the same thing. To convert a regular expression into a finite
automation; first know about Arden’s Theorem.
Statement: Let P and Q be two Regular Expression s over Σ. If P does not contain Λ, then for the
equation R = Q + RP has a unique (one and only one) solution R = QP*.
Proof: Now point out the statements in Arden's Theorem in General form.
(i) P and Q are two Regular Expressions.
(ii) P does not contain Λ symbol.
(iii) R = Q + RP has a solution, i.e. R = QP*
(iv) This solution is the one and only one solution of the equation.
If R = QP* is a solution of the equation R = Q + RP then by putting the value of R in the
equation we shall get the value ‘0’.
(Putting the value of R in the LHS we get)
=QP*-Q(^ +*PP)
=QP*-QP* =0
So from here it is proved that R = QP* is a solution of the equation R = Q + RP.
2.b) What is regular expression? State the rules, which define regular expression?
Ans.2 b) A regular expression is a notation to specify a regular set. Hence, for every regular
expression, there exists finite automata that accepts the language specified by the regular
expression. Similarly, for every finite automata M, there exists a regular expression notation
specifying L(M). Regular expressions and the regular sets they specify are shown in the
following table.
Hence, we only have three regular-expression operators: | or + to denote union operations,. for
concatenation operations, and * for closure operations. The precedence of the operators in the
decreasing order is: *, followed by., followed by | .
3. What is compiler? State various phases of a compiler and explain them in detail.
Ans. A compiler translates or compiles a program written in a high-level programming language
that is suitable for human programmers into the low-level machine language that is required by
computers. So, a compiler is basically a translator whose source language is a high-level
programming language i.e. a problem-oriented language target language is a machine language
or assembly language i.e. a machine-oriented language.
Compilation refers to the compiler's process of translating a high-level language program into a
low-level language program. This process is very complex; hence, from the logical as well as an
implementation point of view, it is customary to partition the compilation process into several
The two parts of compilation are:
1. Analysis
2. Synthesis
The analysis part breaks up the source program into constituent pieces and creates an
intermediate representation of the source program.
The synthesis part constructs the desired target program from the intermediate representation.
A typical compilation, broken down into phases, is shown in Fig.
Fig. Compiler Phases
The first three phases, forms the bulk of the analysis portion of a compiler. Symbol table
management and error handling, are shown interacting with the six phases.
The following are the various phases of a compiler:
1. Lexical Analyzer: This is the initial part of reading and analyzing the program text: The
text is read and divided into tokens, each of which corresponds to a symbol in the
programming language, e.g., a variable name, keyword or number.
2. Syntax Analyzer: This phase takes the list of tokens produced by the lexical analysis and
arranges these in a tree-structure (called the syntax tree) that reflects the structure of the
program. This phase is often called parsing.
3. Semantic Analyzer: This phase analyses the syntax tree to determine if the program
violates certain consistency requirements, e.g., if a variable is used but not declared or if
it is used in a context that doesn’t make sense given the type of the variable, such as
trying to use a boolean value as a function pointer.
4. Intermediate code generation: The program is translated to a simple machine independent
intermediate language. After syntax and semantic analysis, some compilers generate an
explicit intermediate representation of the source program. This intermediate
representation can have a variety of forms.
In three-address code, the source pgm might look like this,
temp1: = inttoreal (10)
temp2: = id3 * temp1
temp3: = id2 + temp2
id1: = temp3
5. Code optimizion: An optimizer reviews the code, looking for ways to reduce the number
of operations and the memory requirements. The code optimization phase attempts to
improve the intermediate code, so that faster running machine codes will result. Some
optimizations are trivial. There is a great variation in the amount of code optimization
different compilers perform. In those that do the most, called ‘optimizing compilers’, a
significant fraction of the time of the compiler is spent on this phase.
6. Code Generation: The final phase of the compiler is the generation of target code,
consisting normally of relocatable machine code or assembly code. Memory locations are
selected for each of the variables used by the program. Then, intermediate instructions
are each translated into a sequence of machine instructions that perform the same task. A
crucial aspect is the assignment of variables to registers. It produces either
Machine code for a specific machine or
Assembly code for a specific machine and assembler.
Symbol table management
An essential function of a compiler is to record the identifiers used in the source program and
collect information about various attributes of each identifier. A symbol table is a data structure
containing a record for each identifier, with fields for the attributes of the identifier. The data
structure allows us to find the record for each identifier quickly and to store or retrieve data from
that record quickly. When an identifier in the source program is detected by the lex analyzer, the
identifier is entered into the symbol table.
Error Detection and Reporting
Each phase can encounter errors. A compiler that stops when it finds the first error is not as
helpful as it could be. The syntax and semantic analysis phases usually handle a large fraction of
the errors detectable by the compiler. The lexical phase can detect errors where the characters
remaining in the input do not form any token of the language. Errors when the token stream
violates the syntax of the language are determined by the syntax analysis phase. During semantic
analysis the compiler tries to detect constructs that have the right syntactic structure but no
meaning to the operation involved.
4.a) Construct the predictive parser for the following grammar:
S→ (L) | a
L→ L,S | S
Ans. To built predictive parser follows the below steps:
1. Remove left recursion if present.
2. Remove left factoring if present.
3. Find the FIRST AND FOLLOW for each grammer.
4. Construct the predictive parsing table.
Step 1. Remove left recursion which is present in L.
L→ L,S | S
L’→ ,SL’ | ε
Step 2. No left factoring is present.
Step 3. Find the FIRST AND FOLLOW for each grammer.And the new grammer variables are
S→ (L) | a
L’→ ,SL’ | ε
First(S) = {( , a}
Follow(S) = {$ , )}
First(L) = {( , a }
Follow(L) = { ) }
First(L’) = {, , ε }
Follow(L’) = { )}
Step 4.
Construct the predictive parsing table.
4.b) What is FIRST AND FOLLOW? Explain in detail with an example.
Ans. Computing first and follow
These are the algorithms used to compute the first and follow sets:
1. If X is terminal, then FIRST(X) IS {X}.
2. If X → ε is a production, then add ε to FIRST(X).
3. If X is non terminal and X → Y1, Y2..Yk is a production, then place a in FIRST(X) if for
some i , a is in FIRST(Yi) , and ε is in all of FIRST(Y1),…FIRST(Yi-1);
1. Place $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker.
2. If there is a production A → αBβ, then everything in FIRST (β) except for ε is placed in
3. If there is a production A → αB, or a production A→ αBβ where FIRST (β) contains ε, then
everything in FOLLOW (A) is in FOLLOW (B).
4. ε will never included in FOLLOW set of any variable.
E → TA
A → +TA | ε
T → FB
B → *FB | ε
F→ (E) | id
First(E) = {(, id} Follow(E) = {$, )}
First(A) = {+, ε} Follow(A) = {$, )}
First(T) = {(, id} Follow(B) = {+, ), $}
First(B) = {*, ε} Follow(B) = {+, ), $}
First(F) = {(, id} Follow(F) = {+,),*,$}
4.b) Explain the role Lexical Analyzer and issues of Lexical Analyzer.
Ans. In the lexical analysis phase, the compiler scans the characters of the source program, one
character at a time. Whenever it gets a sufficient number of characters to constitute a token of the
specified language, it outputs that token. In order to perform this task, the lexical analyzer must
know the keywords, identifiers, operators, delimiters, and punctuation symbols of the language
to be implemented. So, when it scans the source program, it will be able to return a suitable
token whenever it encounters a token lexeme. Lexeme refers to the sequence of characters in the
source program that is matched by language's character patterns that specify identifiers,
operators, keywords, delimiters, punctuation symbols, and so forth. Therefore, the lexical
analyzer design must:
1. Specify the token of the language, and
2. Suitably recognize the tokens.
Lexical Analyzer Design
Since the function of the lexical analyzer is to scan the source program and produce a stream of
tokens as output, the issues involved in the design of lexical analyzer are:
1. Identifying the tokens of the language for which the lexical analyzer is to be built, and to
specify these tokens by using suitable notation, and
2. Constructing a suitable recognizer for these tokens.
Lexical analyzer has input buffer , symbol table and DFA.
Input Buffer: Initially it is blank. Sometimes lexical analyzer needs to look ahead some symbols
to decide about the token to return. For example, in C language: we need to look after -, = or < to
decide what token to return. We need to introduce a two buffer scheme to handle large lookaheads safely.
Therefore, the first thing that is required is to identify what the keywords are, what the operators
are, and what the delimiters are. These are the tokens of the language. After identifying the
tokens of the language, we must use suitable notation to specify these tokens. This notation
should be compact, precise, and easy to understand. Regular expressions can be used to specify a
set of strings, and a set of strings that can be specified by using regular-expression notation is
called a "regular set." The token of a programming language constitutes a regular set. Hence, this
regular set can be specified by using regular-expression notation. Therefore, we write regular
expressions for things like operators, keywords, and identifiers. For example, the regular
expressions specifying the subset of tokens of typical programming language are as follows:
operators = +| -| * |/ | mod|div
keywords = if|while|do|then
letter = a|b|c|d|....|z|A|B|C|....|Z
digit = 0|1|2|3|4|5|6|7|8|9
identifier = letter (letter|digit)*
The advantage of using regular-expression notation for specifying tokens is that when regular
expressions are used, the recognizer for the tokens ends up being a DFA. Therefore, the next step
is the construction of a DFA from the regular expression that specifies the tokens of the
language. But the DFA is a flow-chart (graphical) representation of the lexical analyzer.
Therefore, after constructing the DFA, the next step is to write a program in suitable
programming language that will simulate the DFA. This program acts as a token recognizer or
lexical analyzer. Therefore, we find that by using regular expressions for specifying the tokens,
designing a lexical analyzer becomes a simple mechanical process that involves transforming
regular expressions into finite automata and generating the program for simulating the finite
5. a) Check Whether the following grammer is SLR(1) or not. Explain your answer with reasons.
S → L=R
L → *R
L→ id
Ans. Step 1.
First(S) = {*, id}
Follow(S) = {$ }
First(L) = {*, id}
Follow(L) = { =, $}
First(R) = {*, id}
Follow(R) = { =, $}
Step 2. Number of production and augmented the grammer
S' –> S
S –> L = R (1)
S –> R
L –> *R
L –> id
R –> L
Step 3.
Step 4.
S6, r5
In the above table, shift reduce conflict or multiple entries come in states 2.
5.b) For the operators given below, calculate the operator precedence relations and operator
precedence function: Id,+,*,$
Ans. Bottom-up parsers for a large class of context-free grammars can be easily developed using
operator grammars.Operator grammars have the property that no production right side is empty
or has two adjacent nonterminals. This property enables the implementation of efficient operatorprecedence parsers. These parser rely on the following three precedence relations:
a <· b
a yields precedence to b
a =· b
a has the same precedence as b
a ·> b
a takes precedence over b
These operator precedence relations allow to delimit the handles in the right sentential forms: <·
marks the left end, =· appears in the interior of the handle, and ·> marks the right end. Let
assume that between the symbols ai and ai+1 there is exactly one precedence relation. Suppose
that $ is the end of the string. Then for all terminals we can write: $ <· b and b ·> $. If we
remove all non-terminals and place the correct precedence relation: <·, =·, ·> between the
remaining terminals, there remain strings that can be analyzed by easily developed parser. For
example, the following operator precedence relations can be introduced for simple expressions:
6. a) Construct a canonical parsing table for the grammer given below.
S → CC
C → cC|d
Ans. Step 1.
First(S) = {c, d}
Follow(S) = {$ }
First(C) = {c, d}
Follow(C) = {c,d,$}
Step 2. Number of production and augmented the grammer
S' –>S
S –> CC
C –> cC
C –> d
Step 3.
S’ → .S
S → CC
C → .cC
c| d
C → .d
c| d
S’ → S.
S → C.C
C → .cC
C → .d
C → c.C
c| d
C → .cC
c| d
C → .d
c| d
C → d.
c| d
S → CC.
C → c.C
C → .cC
C → .d
C → d.
C → cC.
c| d
S → CC.
Step 4.
Step 4.
S6, r5
6.b) What is the three address code? Mention its types. How would you implement the the three
address statements? Explain with suitable examples.
Ans. Three address code is a sequence of statements of the form x = y op z. Since a statement
involves no more than three references, it is called a "three-address statement," and a sequence of
such statements is referred to as three-address code. For example, the three-address code for the
expression a + b * c + d is:
T1 = B * C
T2 = A + T2
T3 = T3 + D
Representing Three-Address Statements
Records with fields for the operators and operands can be used to represent three-address
statements. It is possible to use a record structure with four fields: the first holds the operator, the
next two hold the operand1 and operand2, respectively, and the last one holds the result. This
representation of a three-address statement is called a "quadruple representation".
Quadruple Representation
Using quadruple representation, the three-address statement x = y op z is represented by placing
op in the operator field, y in the operand1 field, z in the operand2 field, and x in the result field.
The statement x = op y, where op is a unary operator, is represented by placing op in the operator
field, y in the operand1 field, and x in the result field; the operand2 field is not used. A statement
like param t1 is represented by placing param in the operator field and t1 in the operand1 field;
neither operand2 nor the result field are used. Unconditional and conditional jump statements are
represented by placing the target labels in the result field. For example, a quadruple
representation of the three-address code for the statement x = (a + b) *(-c)/d is shown in Table 1.
The numbers in parentheses represent the pointers to the triple structure.
Table 1: Quadruple Representation of x = (a + b) *(-c)/d
Triple Representation
The contents of the operand1, operand2, and result fields are therefore normally the pointers to
the symbol records for the names represented by these fields. Hence, it becomes necessary to
enter temporary names into the symbol table as they are created. This can be avoided by using
the position of the statement to refer to a temporary value. If this is done, then a record structure
with three fields is enough to represent the three-address statements: the first holds the operator
value, and the next two holding values for the operand1 and operand2, respectively. Such a
representation is called a "triple representation". The contents of the operand1 and operand2
fields are either pointers to the symbol table records, or they are pointers to records within the
triple representation itself. For example, a triple representation of the three-address code for the
statement x = (a+b)*(- c)/d is shown in Table 2.
Table 2: Triple Representation of x = (a + b) * (- c)/d
Indirect Triple Representation
Another representation uses an additional array to list the pointers to the triples in the desired
order. This is called an indirect triple representation. For example, a triple representation of the
three-address code for the statement x = (a+b)*(-c)/d is shown in Table 3.
Table 3: Indirect Triple Representation of x = (a + b) * (- c)/d
7.a) Discuss about the run time storage management of a code generator. Describe about the
stack allocation in memory management.
Ans. One of the important tasks that a compiler must perform is to allocate the resources of the
target machine to represent the data objects that are being manipulated by the source program.
That is, a compiler must decide the run-time representation of the data objects in the source
program. Source program run-time representations of the data objects, such as integers and real
variables, usually take the form of equivalent data objects at the machine level; whereas data
structures, such as arrays and strings, are represented by several words of machine memory. The
run-time organization of the memory will be as shown in Figure.
Fig. Heap memory storage allows program-controlled data allocation.
The run-time storage has been subdivided to hold the generated target code and the data objects,
which are allocated statically for the stack and heap. The sizes of the stack and heap can change
as the program executes.
Each execution of a procedure is referred to as an activation of the procedure. This is different
from the procedure definition, which in its simplest form is the association of an identifier with a
statement; the identifier is the name of the procedure, and the statement is the body of the
procedure.If a procedure is non-recursive, then there exists only one activation of procedure at
any one time. Whereas if a procedure is recursive, several activations of that procedure may be
active at the same time. The information needed by a single execution or a single activation of a
procedure is managed using a contiguous block of storage called an "activation record"
consisting of the collection of fields.
Stack Allocation
In stack allocation, storage is organized as a stack, and activation records are pushed and popped
as the activation of procedures begin and end, respectively, thereby permitting recursive
procedures. The storage for the locals in each procedure call is contained in the activation record
for that call. Hence, the locals are bound to fresh storage in each activation, because a new
activation record is pushed onto stack when a call is made. The storage values of locals are
deleted when the activation ends.
1) The Call and Return Sequence: Procedure calls are implemented by generating what is
called a "call sequence and return sequence" in the target code. The job of a call sequence
is to set up an activation record. Setting up an activation record means entering the
information into the fields of the activation record if the storage for the activation record
is allocated statically. When the storage for the activation record is allocated dynamically,
storage is allocated for it on the stack, and the information is entered in its fields.
2) Access to Nonlocal Names: The way that the non-locals are accessed depends on the
scope rules of the language. There are two different types of scope rules: static scope
rules and dynamic scope rules. Static scope rules determine which declaration a name's
reference will be associated with, depending upon the program's language, thereby
determining from where the name's value will be obtained at run time. When static scope
rules are used during compilation, the compiler knows how the declarations are bound to
the name references, and hence, from where their values will be obtained at run time.
Whereas when dynamic scope rules are used, the values of nonlocal names are retrieved
at run time by scanning down the stack, starting at the top-most activation record.
3) Setting Up the Access Link: To generate the code for setting up the access link, a
compiler makes use of the following information: the nesting depth of the caller
procedure and the nesting depth of the callee procedure.
7.b) What are the various data structure used for implementing the symbol table?
Ans. A symbol table is a data structure used by a compiler to keep track of scope/ binding
information about names. This information is used in the source program to identify the various
program elements, like variables, constants, procedures, and the labels of statements. The symbol
table is searched every time a name is encountered in the source text. When a new name or new
information about an existing name is discovered, the content of the symbol table changes. There
are various data structure used for implementing the symbol table. These methods are given
1. The Linear List
A linear list of records is the easiest way to implement a symbol table. The new names are added
to the table in the order that they arrive. Whenever a new name is to be added to the table, the
table is first searched linearly or sequentially to check whether or not the name is already present
in the table. If the name is not present, then the record for new name is created and added to the
list at a position specified by the available pointer, as shown in the Figure.
Fig. A new record is added to the linear list of records.
Search Trees
A search tree is a more efficient approach to symbol table organization. We add two links, left
and right, in each record, and these links point to the record in the search tree. Whenever a name
is to be added, first the name is searched in the tree. If it does not exist, then a record for the new
name is created and added at the proper position in the search tree.
Fig. The search tree organization approach to a symbol table.
Hash Tables
A hash table is a table of k pointers numbered from zero to k-1 that point to the symbol table and
a record within the symbol table. To enter a name into symbol table, we find out the hash value
of the name by applying a suitable hash function. The hash function maps the name into an
integer between zero and k-1, and using this value as an index in the hash table, we search the list
of the symbol table records that is built on that hash index. If the name is not present in that list,
we create a record for name and insert it at the head of the list. When retrieving the information
associated with the name, the hash value of the name is first obtained, and then the list that was
built on this hash value is searched for information about the name (Fig.).
Fig. Hash table method of symbol table organization.
8.a) List the various error recovery strategies for a syntactic analysis.
Ans. Recovery From Syntactic Phase Errors
A parser detects an error when it has no legal move from its current configuration. The LL(1) and
LR(1) parsers use the valid prefix property; therefore, they are capable of announcing an error as
soon as they read an input that is not a valid continuation of the previous input's prefix. This is
earliest time that a left-to-right parser can announce an error. But there are a variety of other
types of parsers that do not necessarily have this property.
The advantages of using a parser with a valid-prefix-property capability is that it reports an error
as soon as possible, and it minimizes the amount of erroneous output passed to subsequent
phases of the compiler.
Error Recovery In LR Parsing
A systematic method for error recovery in LR parsing is to scan down the stack until a state S
with a goto on a particular non-terminal A is found, and then discard zero or more input symbols
until a symbol a is found that can legitimately follow A. The parser then shifts the state goto [S,
A] on the stack and resumes normal parsing. There might be more than one choice for the nonterminal A. Normally, these would be non-terminals representing major program pieces, such as
Another method of error recovery that can be implemented is called "phrase level recovery".
Each error entry in the LR parsing table is examined, and, based on language usage, an
appropriate error-recovery procedure is constructed.
Predictive Parsing Error Recovery
An error is detected during predictive parsing when the terminal on the top of the stack does not
match the next input symbol, or when non-terminal A is on top of the stack and a is the next
input symbol. M [A, a] is the error entry used to for recovery. Panic mode recovery can be used
to recover from an error detected by the LL parser. The effectiveness of panic mode recovery
depends on the choice of the synchronizing token. Several heuristics can be used when selecting
the synchronizing token in order to ensure quick recovery from common errors:
1. All the symbols in the FOLLOW (A) must be kept in the set of synchronizing tokens,
because if we skip until an a symbol in FOLLOW (A) is read, and we pop A from the
stack, it is likely that the parsing can continue.
2. Since the syntactic structure of a language is very often hierarchical, we add the symbols
that begin higher constructs to the synchronizing set of lower constructs. For example, we
add keywords to the synchronizing sets of non-terminals that generate expressions.
3. We also add the symbols in FIRST (A) to the synchronizing set of non-terminal A. This
provides for a resumption of parsing according to A if a symbol in FIRST (A) appears in
the input.
4. A derivation by an Î-production can be used as a default. Error detection will be
postponed, but the error will still be captured. This method reduces the number of nonterminals that must be considered during error recovery.
8.b) Explain the various limitations of using static allocation.
Ans. In static allocation, the names are bound to specific storage locations as the program is
compiled. These storage locations cannot be changed during the program's execution. Since the
binding does not change at run time, every time a procedure is called, its names are bound to the
same storage locations. However, there are some limitations to using static allocation:
1. The size of the data object and any constraints on its position in memory must be known
at compile time.
2. Recursive procedures cannot be permitted, because all activations of a procedure use the
same binding for local names.
3. Data structures cannot be created dynamically, since there is no mechanism for storage
allocation at run time.
9.a) Explain the principle sources of code optimization in detail.
Ans. Code optimization refers to the techniques used by the compiler to improve the execution
efficiency of the generated object code. It involves a complex analysis of the intermediate code
and the performance of various transformations; but every optimizing transformation must also
preserve the semantics of the program. That is, a compiler should not attempt any optimization
that would lead to a change in the program's semantics.
Optimization can be machine-independent or machine-dependent. Machine-independent
optimizations can be performed independently of the target machine for which the compiler is
generating code; that is, the optimizations are not tied to the target machine's specific platform or
language. On the other hand, machine-dependent optimization requires knowledge of the target
machine. An attempt to generate object code that will utilize the target machine's registers more
efficiently is an example of machine-dependent code optimization. Actually, code optimization
is a misnomer; even after performing various optimizing transformations, there is no guarantee
that the generated object code will be optimal. Hence, we are actually performing code
improvement. When attempting any optimizing transformation, the following criteria should be
1. The optimization should capture most of the potential improvements without an
unreasonable amount of effort.
2. The optimization should be such that the meaning of the source program is preserved.
3. The optimization should, on average, reduce the time and space expended by the object
Loop Optimization
Loop optimization is the most valuable machine-independent optimization because a program's
inner loops are good candidates for improvement. The important loop optimizations are
elimination of loop invariant computations and elimination of induction variables. A loop
invariant computation is one that computes the same value every time a loop is executed.
Eliminating Loop Invariant Computations
To eliminate loop invariant computations, we first identify the invariant computations and then
move them outside loop if the move does not lead to a change in the program's meaning.
Identification of loop invariant computation requires the detection of loops in the program.
Whether a loop exists in the program or not depends on the program's control flow, therefore,
requiring a control flow analysis. For loop detection, a graphical representation, called a
"program flow graph," shows how the control is flowing in the program and how the control is
being used. To obtain such a graph, we must partition the intermediate code into basic blocks.
This requires identifying leader statements, which are defined as follows:
1. The first statement is a leader statement.
2. The target of a conditional or unconditional goto is a leader.
3. A statement that immediately follows a conditional goto is a leader.
A basic block is a sequence of three-address statements that can be entered only at the beginning,
and control ends after the execution of the last statement, without a halt or any possibility of
branching, except at the end.
9.b) How would you construct a DAG for a basic block? Explain with an example.
Ans. To partition three-address code into basic blocks, we must identify the leader statements in
the three-address code and then include all the statements, starting from a leader, and up to, but
not including, the next leader. The basic blocks into which the three-address code is partitioned
constitute the nodes or vertices of the program flow graph. The edges in the flow graph are
decided as follows. If B1 and B2 are the two blocks, then add an edge from B1 to B2 in the
program flow graph, if the block B2 follows B1 in an execution sequence. The block B2 follows
B1 in an execution sequence if and only if:
1. The first statement of block B2 immediately follows the last statement of block B1 in the
three-address code, and the last statement of block B1 is not an unconditional goto
2. The last statement of block B1 is either a conditional or unconditional goto statement, and
the first statement of block B2 is the target of the last statement of block B1.
For example, consider the following program fragment:
int f = 1;
for(i = 2; i<=x; i++)
f = f*i;
The three-address-code representation for the program fragment above is:
1. f = 1;
2. i = 2
3. if i <= x goto(8)
4. f = f *i
5. t1 = i + 1
6. i = t1
7. goto(3)
8. goto calling program
The leader statements are:
 Statement number 1, because it is the first statement.
Statement number 3, because it is the target of a goto.
Statement number 4, because it immediately follows a conditional goto statement.
Statement number 8, because it is a target of a conditional goto statement.
Therefore, the basic blocks into which the above code can be partitioned are as follows, and the
program flow graph is shown in Figure 1.
Block B1:
Block B2:
Block B3:
Block B4:
Figure. Program flow graph
10.a) Write about Data Flow Analysis of structural programs.
Ans. Once an intermediate-code program is partitioned into basic blocks, we represent the flow
of control between them by a flow graph. The nodes of the flow graph are the basic blocks.
There is an edge from block B to block C if and only if it is possible for the first instruction in
block C to immediately follow the last instruction in block B. There are two ways that such an
edge could be justified:
• There is a conditional or unconditional jump from the end of B to the beginning of C.
• C immediately follows B in the original order of the three-address instructions, and B does not
end in an unconditional jump.
B is a predecessor of C, and C is a successor of B. Often we add two nodes, called the entry and
exit, that do not correspond to executable intermediate instructions. There is an edge from the
entry to the first executable node of the flow graph, that is, to the basic block that comes from the
first instruction of the intermediate code. There is an edge to the exit from any basic block that
contains an instruction that could be the last executed instruction of the program. If the final
instruction of the program is not an unconditional jump, then the block containing the final
instruction of the program is one predecessor of the exit, but so is any basic block that has a jump
to code that is not part of the program.
The set of basic blocks constructed yields the flow graph of Fig. 8.9. The entry points to basic
block B\, since B\ contains the first instruction of the program. The only successor of B\ is B2,
because B\ does not end in an unconditional jump, and the leader of B2 immediately follows the
end of B\.
Block B3 has two successors. One is itself, because the leader of B3, instruction 3, is the target of
the conditional jump at the end of £3, instruction 9. The other successor is B4, because control
can fall through the conditional jump at the end of B3 and next enter the leader of B±. Only BQ
points to the exit of the flow graph, since the only way to get to code that follows the program
from which we constructed the flow graph is to fall through the conditional jump that ends BG.
Flow graphs, being quite ordinary graphs, can be represented by any of the data structures
appropriate for graphs. The content of nodes (basic blocks) need their own representation. The
content of a node by a pointer to the leader in the array of three-address instructions, together
with a count of the number of instructions or a second pointer to the last instruction. However,
since we may be changing the number of instructions in a basic block frequently, it is likely to be
more efficient to create a linked list of instructions for each basic block. Many code
transformations depend upon the identification of "loops" in a flow graph. We say that a set of
nodes L in a flow graph is a loop if
1. There is a node in L called the loop entry with the property that no other node in L has a
predecessor outside L. That is, every path from the entry of the entire flow graph to any node in
L goes through the loop entry.
Figure. Flow graph
2. Every node in L has a nonempty path, completely within L, to the entry of L. The flow graph
of Fig. has three loops:
1. B3 by itself.
2. B6 by itself.
3. {B2, B3, B4}.
The first two are single nodes with an edge to the node itself. For instance, B3 forms a loop with
B3 as its entry. Note that the second requirement for a loop is that there be a nonempty path from
B3 to itself. Thus, a single node like B2, which does not have an edge B2 ->• B2, is not a loop,
since there is no nonempty path from B2 to itself within {B2}. The third loop, L = {B2, B3, B4},
has B2 as its loop entry.
10.b) Draw the DAG for a:=b*-c + b* -c.
Ans. Ans. Three address codes are: a=b*-c + b*-c
T1 = -c
T2 = b*T1
T5 = T2+T4
a:= T5
T3 = -c
T4 = b*T3
Fig. DAG