UNIT - I DEFINITION OF COMPILER Compiler is a program that reads a program written in one language - Source language - and translate it into an equivalent program in another language - Target language. It also reports to its user the presence of errors in the source program. Source Program Compiler Compiler Target Program Error Message The source language may be any high level language and the target language may be another programming language or machine language. In 1950’s the first Compiler named as FORTRAN compiler was developed . THE ANALYSIS-SYNTHESIS MODEL OF COMPILATION There are two parts to compilation Analysis Synthesis Analysis Part – It breaks up the source program into constituent pieces and creates an intermediate representation of the source program Synthesis Part – It constructs the desired target program from the intermediate representation. During analysis, the operations implied by the source program are determined and recorded in a hierarchical structure called a tree. A special kind of tree called a Syntax Tree is used. Syntax tree - It’s a tree on which each node represents an operation and the children of a node represent the arguments of the operation. Example Draw a syntax tree for an assignment statement position := initial + rate * 60 := position + initial * rate 60 SOFTWARE TOOLS FOR ANALYSIS Many software tools that manipulate source programs first perform some kind of analysis. They are 1. Structure editors It accepts sequence of commands as input. It performs text creation and manipulation similar to text editor. It produce the hierarchical structure as output. Its output is similar to the output of analysis phase. 2. Pretty printer It analyzes the program and print the structure of the program. It output is clearly visible. 3. Static Checker It reads and analyze the program. It discovers the bugs without running the source program. 4. Interpreters It translate the source program in to desired target program It interprets only one line at a time. The following compilers have the same analysis phase. They are 1. Text formatters 2. Silicon compilers 3. Query interpreters 1. Text formatters It takes input as a stream of characters or paragraphs or figures or mathematical structures like subscript and superscripts. It perform the operation similar to text editor. 2. Silicon compilers It consider any programming language as a source language, but the variable of the language taken as logical signal, not represent the location. The output is a design in an appropriate language. 3. Query Interpreters It translates a predicate in to commands Using the commands, it search a database for records satisfying that predicate. THE CONTEXT OF A COMPILER Compiler needs some other processing for execute the machine code. structure of a language processing system is shown as follows Skeletal source program Preprocessor Source Program Compiler The Target assembly program Assembler Relocatable machine code Loader / Link Editor Library, Reloacatable object files The source program maymachine be divided into no. of modules stored in separate files. Absolute code The task of collecting the source program into a distinct program, called a Preprocessor. The preprocessor may also expand shorthand, called Macros, into the source language statements. The target program created by the compiler needs further processing before it can be run. The compiler creates assembly code that is translated by an assembler into machine code. The machine code is linked with some library routines by linker / load editor into the code that actually runs on the machine. Analysis of the Source Program Analysis part of the compiler have three phases. They are 1. Linear Analysis 2. Hierarchical Analysis 3. Semantic Analysis Linear Analysis Linear analysis is otherwise called as Lexical analysis or scanning. This phase mainly used to group the characters from the input into the tokens. Example The tokens for the assignment statement Position := initial + rate * 60 are as follows 1. 2. 3. 4. 5. 6. The identifier Position The assignment symbol := The identifier initial The + sign The identifier rate The * sign 7. The number 60 Hierarchical Analysis It is otherwise called as Parsing or Syntax Analysis. It groups the tokens into grammatical phrases that are used by the compiler to synthesize output. The grammatical phrases of the source program are represented by a Parse tree is shown below. The parse tree for the assignment statement Position := Initial + Rate * 60 as follows Assignment statement := Identifier expression + Position expression expression * Identifier Initial expression expression Identifier number Rate 60 The hierarchical structure of a program is expressed by recursive rules. The following recursive rules defines the expression 1. Any identifier is an expression. 2. Any number is an expression 3. If expression1 and expression2 are expressions, then so are expression1 + expression2 expression1 * expression2 (expression1) rules 1 and 2 are nonrecursive basic rules and rule 3 is a recursive rule. Lexical constructs do not require recursion, but Syntax analysis use recursion. Syntax tree – It is a compressed representation of the parse tree in which the operators appear as the interior nodes, and the operands of an operator are the children of the node for that operator. Semantic Analysis This phase checks the source program for semantic errors. It uses the hierarchical structure provided by the Syntax analysis phase to identify the operators and operands of expressions and statements. The important component of Semantic analysis is type checking. For example, when a binary arithmetic operator is applied to an integer and real. In this case, the compiler may need to convert the integer to a real. PHASES OF A COMPILER Compiler operates in phases, each of which transforms the source program from one representation to another. The following diagram depicts a compiler phases. Source Program Lexical Analyzer Syntax Analyzer Semantic Analyzer Symbol-table Manager Error Handler Intermediate Code Generator Code Optimizer Symbol table management and Error handling are interacting with the six phases of Lexical Analysis. Target Program Symbol Table Management A Symbol Table is a data structure containing a record for each identifier with fields for the attributes of the identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from that record quickly. Error Detecting and Reporting Each phase can encounter errors. After detecting an error, a phase must deal with that error, so that compilation can proceed to find further errors. The Lexical Analysis phase detect errors where the characters remaining in the input do not form any token of the language. The Syntax Analysis phase detect errors where the token streams violates the structure rules. The Semantic Analysis phase detect errors where the syntactic structure does not produce any meaning to the operation involved. Analysis Phase The LA phase reads the characters in the source program and groups them into a stream of tokens. The tokens may be identifier, keyword, a punctuation character or a multi-character operator like :=. The character sequence forming a token is called the lexeme for the token. Certain tokens will be augmented by a “lexical value”. The Syntax Analysis phase imposes a hierarchical structure on the token stream, called a Syntax tree in which an interior node is a record with a field for the operator and two fields containing pointers to the records for the left and right children. The Semantic analysis phase performs the type checking operation. Intermediate Code Generation We consider an Intermediate code form called “three address code” which is like the assembly language for a machine. It consists of a sequence of instructions, each of which has at most three operands. It has several properties as follows, 1. Each three address instruction has at most one operator in addition to the assignment. 2. The compiler must generate a temporary name to hold the value computed by each instruction. 3. Some instructions have fewer than three operands. Code Optimization This phase improves the intermediate code, so that faster running machine code will result. A significant fraction of time of the compiler is spent on this phase. Code Generation This is a final phase which is used to generate the target code, may be either relocatable machine code or assembly code. Memory locations are selected for each of the variables used by the program. Then, intermediate instructions are each translated into a sequence of machine instructions that perform the same task. Translation of a statement Position := initial + rate * 60 Lexical Analyzer Id1 := id2 + id3 * 60 Syntax Analyzer := id1 + id2 * id3 Semantic Analyzer := 60 id1 + id2 * id3 inttoreal (60) Intermediate Code Generator temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2+ temp2 id1 := temp3 Code Optimizer temp1 := id3 * 60.0 id1 := id2 + temp Code Generator MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1 COUSINS OF THE COMPILER The input to a compiler may be produced by one or more preprocessor, and further processing of the compiler’s output may be needed before running machine code is obtained. Preprocessor It produces inputs to the compiler. It performs the following functions 1. Macro processing – Macros are shorthand for longer constructs processed by preprocessor. 2. File inclusion – Preprocessor includes header files into the program text. 3. “Rational Preprocessor” – It augment older language with more modern flowof-control and data structuring facilities. 4. Language extension – It adds capabilities to the language by what amounts to built-in-macros. Assemblers Some compilers produce assembly code, that is passed to an assembler for further processing. Assembler produce the relocatable machine code directly to the loader/linkeditor. It is a mnemonic version of machine code. Loader/Link Editors Loader is a program that is performs the two functions of loading and link editing. The process of loading consists of taking relocatable machine code, altering the addresses and placing the altered instructions and data in memory at the proper locations. The link-editor allows us to make a single program from several files of relocatable machine code. Grouping of Phases In an implementation of Compiler, activities from more than one phase are often grouped together. Front and Back Ends The different phases are collected into a front end and a back end. Front end consists of phases such as lexical analysis, syntax analysis, symbol table manager, Error handler, semantic analysis and intermediate code generator. Back end consists of phases such as symbol table manager, error handler, code optimizer and code generator. Passes Several phases of compilation are usually implemented in a single pass consisting of reading an input file and writing an output file. Because of this grouping we directly convert one form of representation of source program into another. COMPILER CONSTRUCTION TOOLS The compiler writer like any programmer, use software tools such as debuggers, version managers, profilers and so on. Some other tools are Parser Generators Input is based on context free grammar and output is syntax analyzer. It is easy to implement. Scanner Generator It generates Lexical Analyzer based on regular expression from input. Syntax-directed translation engines It produces collections of routines that walk the parse tree. Automatic Code Generators It translate the intermediate language into machine language based on collection of rules The rules include the detail to handle different possible access methods for data. “Template matching techniques is used Templates that represent the sequence of machine instructions replace the intermediate code statements. Data flow Engines It performs good code optimization It involves data flow analysis – the gathering of information about how values are transmitted from one part of the program in to other part. SYNTAX A programming language can be defined by describing what its programs look like, i.e., the format of the language is called the syntax of the language. SEMANTICS A programming language can be defined by describing what its programs mean is called the semantics of the language. CONTEXT FREE GRAMMAR A grammar naturally describes the hierarchical structure of many programming language. It has 4 components are as follows 1. A set of tokens, known as terminal symbols 2. A set of non terminals. 3. A set of productions where each production consists of a non terminal called left side of the production, an arrow and a sequence of tokens and/or non terminals called the right side of the production. 4. A designation of one of the non terminals as the start symbol. PARSE TREE A parse tree pictorially shows how the start symbol of a grammar derives a string in the language. If non terminal A ha s production A -> XYZ, then a parse tree may have an interior node labeled A with three children labeled X, Y, and Z from left to right : A X Y Z Properties of a Parse Tree 1. The root is labeled by the start symbol. 2. Each leaf is labeled by a token or by 3. Each interior node is labeled by a non terminal Definition of Yield The leaves of a parse tree read from left to right form the yield of the tree, Which is the string generated or derived from the non terminal at the root of the parse tree. Parsing The process of finding a parse tree for a given string of token is called parsing that string. Ambiguity A grammar can have more than one parse tree generating a given string of token. Such a grammar is said to be ambiguous. Eg. Two parse trees for 9 – 5+2 string string string string + string - string string string - 9 string + string 2 9 5 To avoid this ambiguous problem, two methods are used 5 2 1. Associativity of operators 2. Precedence of operators Associativity of operators By convention, 9+5+2 is equivalent to (9+5)+2 and 9-5-2 is equivalent to (9-5)-3. When an operand like 5 has operators to its left and right, conventions are needed for deciding which operators takes that operand. We say that the operator + associates t the left because an operand with plus signs on both sides of its taken by the operator to its left. In most programming languages the four arithmetic operators, addition, subtraction, multiplication, and divisions are left associative. Some common operators such as exponentiation are right associative. As another example, the assignment operator = in C is right associative; in C, the expression a = b= c is treated in the same way as the expression a=(b=c). Precedence of operators Consider the expression 9+5+2. There are two possible interpretation of this expression: (9+5)*2 or 9+(5+2). The associativity of + * do not solve this ambiguity. For this reason, we need to know the relative precedence of operators when more than one kind of operators is present. We say that * has higher precedence than + if * takes its operands before + does. In ordinary arithmetic, multiplication and division have higher precedence than addition and subtraction. Therefore, 5 is taken by * in both 9+5*2 and 9*5*2; i.e., the expression are equivalent to 9+(5*2) and (9+5)+2, respectively. Syntax Directed Definitions A syntax – directed definition uses a context tree grammar to specific the syntactic structure of the input. With each grammar symbol, it associated a set of attributes, and with each production, a set of semantic rules for computing values of the attributes associated with the symbols appearing in that production. ROLE OF LEXICAL ANALYZER (LA) The LA is the first phase of a compiler which is used to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis. The interaction b/w lexical analyzer and parser is shown below. Source Program Lexical Analyzer token Parser Get next token Issues in Lexical Analysis Symbol table There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing. 1. Simpler design. 2. Compiler efficiency is improved. 3. Compiler portability is enhanced. Tokens, Patterns, Lexemes Lexeme - The character sequence forming a token is called the lexeme for the token. It associates with lexical value. Tokens - The set of characters are called as tokens. Patterns - The set of strings are described by a rule called a pattern associated with that token. Example : TOKEN const if relation id num literal SAMPLE LEXEMES Const If <, <=, >, >=, =, <> pi, const, D2 3.1416,0,6.02E23 “core dumped” INFORMAL DESCRIPTION OF PATTERN const if < or <= or > or >= or = or <> letter followed by letters and digits any numerical constant Any characters between “and” except “ Attributes for tokens When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler. Identifiers and numbers having only one attribute as a pointer to the symbol table entry. Example : The tokens and associated attribute-values for the Fortran statement E = M * C ** 2 are written as follows < id, pointer to symbol-table entry for E > < assign_op, > < id, pointer to symbol-table entry for M > < mult_op, > < id, pointer to symbol-table entry for C > < exp_op, > < num, integer value 2 > Lexical Errors Suppose the lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix of the remaining input. In this case it used “panic mode” error recovery strategy. Its method is we delete successive characters from the remaining input until the lexical analyzer can find a well formed token. Some other error recovery strategies are as follows. 1. 2. 3. 4. Deleting an extraneous character Inserting a missing character Replacing an incorrect character by a correct character Transposing two adjacent characters INPUT BUFFERING To speed up the operations of lexical analyzer, we use two buffering techniques. 1. Two buffer input scheme 2. Using Sentinels Buffer Pairs This method is used when the lexical analyzer needs to look ahead several characters beyond the lexeme for a pattern before a match can be announced. In this method, we divide an input buffer into two N-character halves where N is the no. of characters on one disk block. E.g., 1024 or 4096. : : E : = : M : * C : * : * : 2 : eof : forward Lexeme_beginning We read N input characters into each half of the buffer with one system read command, rather than invoking a read command for each input character. If fewer than N characters remain in the input, then a special character eof is read into the buffer after the input character. This character is different from any input character. Two pointers to the input buffer is forward and lexeme beginning. The string of characters between the two pointers is the current lexeme. Both the pointers point to the first character of the next lexeme to be found. Forward pointer scans ahead until a match for a pattern is found. It will point to the character at its right end after finding the token. Code to advance forward pointer is as follows if forward at end of first half then begin reload second half; forward := forward+1 end elseif forward at end of second half then begin reload first half; move forward to beginning of first half end else forward := forward+1; Sentinels The previous method needs two tests for each advance of the forward pointer. This can be reduced as a single test for this Sentinel concept. Each buffer half can hold a sentinel character at the end. It is a special character that cannot be part of the source program. The same buffer arrangement with sentinels as follows. : : E : = : M : * eof C : * : * : 2 : eof : forward Lexeme_beginning Lookahed code with sentinels forward := forward +1; if forward = eof then begin if forward at end of first half then begin reload second half; forward := forward+1 end elseif forward at end of second half then begin reload first half; eof move forward to beginning of first half end else terminate lexical analysis end SPECIFICATION OF TOKENS Strings and Languages Any finite set of symbols are called as Alphabet or Character. Ex. for binary alphabet is {0,1} A finite sequence of symbols or alphabets are called as String |s| is the no. of occurrences of symbols in s called as the length of the string s. The string of length “zero” is called as Empty string denoted by . Any set of strings over some fixed alphabet is called as a Language. A string obtained by removing zero or more trailing symbols of string s is called prefix of string s. A string formed by deleting zero or more of the leading symbols of s is called suffix of string s. A string obtained by deleting prefix and a suffix from s is called Substring of s. Every prefix or a suffix of s and a string s is also considered as a substring. Any nonempty string x that is, a prefix, suffix, or substring of s such that sx is called as proper prefix, suffix or substring of s. Any string formed by deleting zero or more not necessarily contiguous symbols from s is called as a subsequence of s. Ex. baaa is a subsequence of banana. Operations on Language There are several important operations that can be applied to languages. Some of the important operations are as follows. 1. Union of L and M is denoted as L U M = { s | s is in L or s is in M } 2. Concatenation of L and M is denoted as LM = { st | s is in L and t is in M } 3. Kleene closure of L is denoted as L* = U Li i=0 4. Positive Closure of L is denoted as L+ = U Li i=0 Regular Expressions A language denoted by a regular expression is said to be a regular set. Regular Definitions We will give names to regular expressions and to define regular expressions using these names as if they were symbols. If is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form d1 r1 d2 r2 ..... dn rn Notational Shorthands One or more instances : The unary prefix operator + means “one or more instances of ”. The operator + has same precedence and associativity as the operator *. The two algebraic identities r* = r+| and r+ = rr* relates Kleene closure and Positive closure operators. Zero or more instances : The unary postfix operator ? means “zero or more instance of ”. The notation ( r )? is a shorthand for r|. Character classes : The notation [abc] where a,b,c are alphabet symbols denotes the regular expression a|b|c. Ex., the regular expression for identifiers can be described using this notation as [A-Za-z][A-Za-z0-9]* Non Regular Sets Some languages can not be described by any regular expression, that are called as non regular sets. Example , Repeating strings cannot be described by regular expressions. {wcw | w is a string of a’s and b’s } UNIT - II RECOGNITION MACHINE RECOGNITION OF TOKENS This topic explains how the tokens are recognized. For Example stmt if expr then stmt | if expr then stmt else stmt | expr term relop term | term term id | num where the terminals if, then, else, relop, id and num generate sets of strings given by the following regular definitions : if if then then else else relop < | <= | = | <> | > | >= id letter ( letter | digit )* num digit+ ( .digit+ )?( E(+|-)? digit+ )? delim+ ws Regular expression patterns for tokens is shown in the following table Regular Expression ws if then else id num < <= = <> > >= Token Attribute value if then else id num relop relop relop relop relop relop pointer to table entry pointer to table entry LT LE EQ NE GT GE Transition Diagrams A stylized flowchart called a transition diagram which depicts the actions that take place when a lexical analyzer is called by the parser to get the next token. Positions in a transition diagram are drawn as circles and are called states. The states are connected by arrows, called edges. Edges leaving state s has labels indicating the input characters that can next appear after the transition diagram has reached state s. The label other refers to any character that is not indicated by any of the other edges leaving s. The starting state of a transition diagram is labeled as start state. Transition diagram for relational operator o 1 2 3 4 5 6 7 8 Letter or digit Start Letter other 9 10 digit Start 12 digit 13 11 return (gettoken(), install_id()) digit 14 digit 15 digit E 16 +or - E digit 17 digit 18 others 19 digit digit * Start 20 digit 21 22 digit 23 others 24 digit Start 25 digit 26 other 27 * delim * Start 28 delim 29 other 30 Implementing a Transition Diagram A sequence of transition diagrams can be converted into a program to look for the tokens specified by the diagrams. Program size is directly proportional to the no. of states and edges in the diagrams. Each state gets a segment of code. If any edge leaving a state, then its code reads a character and selects an edge to follow, if possible. A function nextchar( ) is used to read the next character from the input buffer and advance the forward pointer and return the character read. If all the transition diagrams are failed, then fail( ) routine is called. A global variable lexical_value is assigned to the pointer returned by functions install_id( ) and install_num( ). The function nexttoken( ) is used to return the token of the LA. Two variables start & state holds the present state and starting state of the current transition diagram. Coding for finding next start state int state=0, start=0; int lexical_value; int fail( ) { forward = token_beginning; switch(start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; default : recover( ); break; } return start; } Lexical Errors & Error Recovery Strategies Suppose the lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix of the remaining input. In this case it used “panic mode” error recovery strategy. Its method is we delete successive characters from the remaining input until the lexical analyzer can find a well formed token. Some other error recovery strategies are as follows. 1. 2. 3. 4. Deleting an extraneous character Inserting a missing character Replacing an incorrect character by a correct character Transposing two adjacent characters A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS A particular tool to construct LA is called as Lex. This tool is otherwise called as Lex compiler, and its input is Lex language. First, the source program of Lex compiler is named as Lex.1 is given thro’ the compiler, which produce the output in the form of tabular representation of transition diagram. It will be given to C compiler as a input which produce output as a.out. This is a actual Lexical analyzer. Finally it converts the stream of inputs into sequence of tokens as follows. lex.1 Lex compiler lex.yy.c lex.yy.c C compiler a.out i/p stream a.out Sequence of tokens Lex specification %{ declarations }% regular definitions %% translation rules %% auxiliary procedures declarations: include and manifest constants (identifier declared to represent a constant). regular definitions: definition of named syntactic constructs such as letter using regular expressions. translation rules: pattern / action pairs auxiliary procedures: arbitrary C functions copied directly into the generated lexical analyzer. Lex Conventions The program generated by Lex matches the longest possible prefix of the input. For example, if < = appears in the input then rather than matching only the < (which is also a legal pattern) the entire string is matched. Lex keywords are: o yylval: value returned by the lexical analyzer (pointer to token) o yytext: pointer to lexeme (array of characters that matched the pattern) o yyleng: length of the lexeme (number of chars in yytext). If two rules match a prefix of the same and greatest length then the first rule appearing (sequentially) in the translation section takes precedence. For example, if is matched by both if and {id}. Since the if rule comes first, that is the match that is used. The Lookahead Operator If r1 and r2 are patterns, then r1/r2 means match r1 only if it is followed by r2. For example, DO/({letter}|{digit})*=({letter}|{digit})*, recognizes the keyword DO in the string DO5I=1,25 Finite Automata Recognizer - A Recognizer for a language is a program that take as input a string x and answers “yes” if x is a sentence of the language and “no” otherwise. Finite Automaton - A regular expression is compiled into a recognizer by constructing a generalized transition diagram called a finite automaton. Two types of Finite automata is Deterministic (DFA) & Non deterministric (NFA) Difference between NFA & DFA NFA 1. Slower to recognize any regular Expression 2. Size is small 3. It has a state with transition. 4. For the same input more than one transition is occur from a single state DFA Faster to recognize any regular expression Size is bigger than NFA No state has an transition For each state and input symbol, there is at most one edge labeled with it. Deterministic Finite Automata It’s a mathematical model which consist of 1. a set of states S. 2. 3. 4. 5. a set of input symbols . a transition function move that maps state-symbol pairs to set of states. a state s0 that is distinguished as the start state. a set of states F distinguished as accepting states. Transition Graph An NFA can be represented diagrammatically by a labeled directed graph called a transition graph, in which nodes are the states and the labeled edges represent the transition function. Transition Table A table which contains a row for each state and a column for each input symbol and if necessary. Moves A path can be represented by a sequence of state transitions called moves. The Language defined by an NFA is the set of input string it accepts. Deterministic Finite Automata It is a special case of NFA in which 1. no state has an transition. 2. for each state s and input symbol a, there is at most one edge labeled a leaving s. Conversion of Regular Expression in to an NFA Given a regular expression there is an associated regular language L(r). Since there is a finite automata for every regular language, there is a machine, M, for every regular expression such that L(M) = L(r). The constructive proof provides an algorithm for constructing a machine, M, from a regular expression r. The six constructions below correspond to the cases: 1) The entire regular expression is the null string, i.e. L={epsilon} r = epsilon 2) The entire regular expression is empty, i.e. L=phi r = phi 3) An element of the input alphabet, sigma, is in the regular expression r = a is an element of sigma. where a 4) Two regular expressions are joined by the union operator, + r1 + r2 5) Two regular expressions are joined by concatenation (no symbol) 6) A regular expression has the Kleene closure (star) applied to it r1 r2 r* The construction proceeds by using 1) or 2) if either applies. The construction first converts all symbols in the regular expression using construction (3). Then working from inside outward, left to right at the same scope, apply the one construction that applies from (4) (5) or (6). Note: add one arrow head to figure 6) going into the top of the second circle. The result is a NFA with epsilon moves. This NFA can then be converted to a NFA without epsilon moves. Further conversion can be performed to get a DFA. All these machines have the same language as the regular expression from which they were constructed. The construction covers all possible cases that can occur in any regular expression. Because of the generality there are many more states generated than are necessary. The unnecessary states are joined by epsilon transitions. Very careful compression may be performed. For example, the fragment regular expression aba would be a e b e a q0 ---> q1 ---> q2 ---> q3 ---> q4 ---> q5 with e used for epsilon, this can be trivially reduced to a b a q0 ---> q1 ---> q2 ---> q3 Simulating an NFA algorithm S := -closure({s0)}; a := nextchar( ); while a eof do begin S := -closure(move(S,a)); a := nextchar( ); end if SF then return “yes” else return “no”; Conversion of NFA to DFA The subset construction algorithm for this conversion is as follows. initially -closure(s0) is the only state in Dstates and it is unmarked; while there is an unmarked state T in Dstates do begin mark T; for each input symbol a do begin U := -closure(move(T,a)); if U is not in Dstates then add U as an unmarked state to Dstates; Dtran[T,a] := U end end Computation of -closure is done by the following algorithm push all states in T onto Stack; initialize -closure(T) to T; while stack is not empty do begin pot t, the top element, off of stack; for each state u with an edge from t to u labeled do if u is not in -closure(T) do begin add u to -closure(T); push u onto stack end end DESIGN OF A LEXICAL ANALYZER GENERATOR USING FA A specification of a lexical analyzer of the form P1 { action 1 } P2 { action 2 } . . . . . . .. P3 { action n } Each pattern pi is a regular expression and each action i is a program fragment that is to be executed whenever a lexeme matched by pattern pi is found in the input. Problem : Suppose if more than one pattern matches a single lexeme, then we will take the longest match of lexeme for the problem as a solution. Example Iftext = 5 In this statement after reading i and f , it will be considered as a keyword, and next we read other characters upto t that forms iftext is considered as an identifier. So iftext matches both keyword and identifier. For this problem we will take the longest lexeme iftext and take the pattern as an identifier and execute its corresponding action. A lexical analyzer is also constructed by using Finite Automata and it may be either Deterministic or Non deterministic. The model of Lex compiler using Finite Automata is shown in the following figure. Lex Transition Lex specification table compiler lexeme FA simulator transition table The lex compiler is used to compiler the lex language input into the tabular representation of a transition diagram. This transition table is given as a input to the FA simulator which produce two pointers such as forward and lexeme beginning which points the current lexeme in an input buffer. The FA simulator may be either NFA or DFA. Pattern matching based on NFA’s One method to construct the transition table of an NFA N for the patterns p1|p2|…|pn. This can be done by creating an NFA N(pi) for each pattern pi, then adding a new start state s0, and finally linking s0 to the start state of each N(pi) with an transition as shown below. N(p1) N(p2) S0 . . N(pn) Example Consider a Lex program consist of 3 regular expressions and no regular definitions as follows a { } /* actions are omitted here */ abb { } + a*b { } An NFA for all the above regular expressions are start start 11 3 a a 2 4 b 5 b 6 start 7 b 8 Combined NFA start 1 3 0 7 a 2 a 4 b b 5 b 6 8 Sequence of sets of states entered in processing input aaba p1 p3 a a b a 0 2 1 4 3 7 7 8 7 We consider a string “aaba” that can match more than one patterns in NFA, and the action is executed corresponding to the pattern. Starting state is 0137. The first input symbol a is readed and it will reach the states 247. From 247, 2 is an accepting state for the first pattern. The next input symbol a is reached only a state 7 and it does not match any patterns. The third input symbol is b which reaches the state 8 and 8 is an accepting state. So, it matches the 3rd pattern. The last input string is a. But it does not reach any state. So we cannot execute any action. DFA for Lexical Analyzers Here we construct the transition table for a DFA for the same above example as follows State 0137 247 8 7 58 68 Input Symbol a 247 7 7 - B 8 58 8 8 68 8 Pattern Announced none a a*b+ none a*b+ abb Optimizing of DFA-based Pattern Matchers There are three algorithms are used to optimize the DFA. They are 1. Directly convert the regular expression into DFA 2. Minimize the no. of states in DFA. 3. Make a transition table as a compact one. Important states of an NFA 1. All states of an NFA is important if it has no transition. 2. Make an accepting state as important one by adding one unique right end marker # at the end of the regular expression. Now the regular expression is called as Augmented regular expression. 1. From Regular Expression to a DFA 1. Convert regular expression into an augmented regular expression ( r )#. 2. Construct a syntax tree T for ( r )#. 3. Compute four functions : nullable, firstpos, lastpos and followpos by making traversals over T. The first 3 functions are defined on the nodes of the syntax tree and the last one is defined on the set of position. 4. Finally we constrt the DFA from followpos. Example Consider the regular expression (a|b)*abb# Firstpos and lastpos for nodes in syntax tree for (a|b)*abb# and followpos table is as follows NODE followpos 1 2 3 4 5 6 {1,2,3} {1,2,3} {4} {5} {6} - 2. Minimizing the no. Of states of a DFA Input : A DFA M with set of states S, set of inputs , transitions defined for all states and inputs, start state s0, and set of accepting states F. Output : A DFA M’ accepting the same language as M and having as few state as possible. Method : 1. 2. Construct an initial partition of the set of states with two groups : the accepting states F and the non accepting states S-F. Apply the following procedure to to construct a new partition new. for each group G of do begin partition G into subgroups such that two states s and t of G are in the same subgroup if and only if for all input symbols a. states s and t have transitions on a to states in the same group of . replace G in new by the set of all subgroups formed. end 3. 4. If new=, let final= and continue with step 4. Otherwise, repeat step 2 with := new. Choose one state in each group of the partition final as the representative for that group. The representatives will be the states of the reduced DFA M’. 5. If M’ has a dead state, that is, a state d that is not accepting and that has transitions to itself on all input symbols, then remove d from M’. also remove any states not reachable from the start state. Any transitions to d from other states become undefined. 3. State Minimization in Lexical Analyzers Using table compression method, we can make the transition table as a compact one. Normally transition table is a two dimensional array. Here, we use a data structure consisting of four arras indexed by state numbers. The base array is used to determine the base location of the entries for each state stored in the next and check arrays. The default array is used to determine an alternative base location in case the current base location in invalid. The data structure for representing transition tables as shown below To compute nextstate(s,a), the transition for state s on input symbol a, we first consult the pair of arrays next and check. We find their entries for state s in location l = base[s]+a, where a is treated as an integer. We take next[ l ] to be the next state for s on input a if check[ l ] = s. If check[ l ] s, we determine q = default[s] and repeat the entire procedure recursively, using q in place of s. The procedure is the following procedure nextstate(s,a); if check[base[s]+a] = s then return next[base[s]+a] else return nextstate(default[s],a) Problem Convert the Regular expression (a/b)*abb into NFA and then to DFA: i) Regular expression into NFA: a 2 I 3 1 6 b a 7 8 S0 b 4 5 9 b 3 F S10 ii) NFA to DFA: a) Finding -Closure for all States. -Closure (S0) = {S0, S1, S2, S4, S7} =A MOV (A, a) = {S3, S8} -Closure (MOV (A, a) = {S3, S6, S7, S1, S2, S4, S8} =B MOV (A, b) = {S5} -Closure (MOV (A, b) = {S5, S6, S7, S1, S2, S4} =C MOV (B, a) = {S3, S8} -Closure (MOV (B, a) = {S3, S6, S7, S1, S2, S4, S8} =B MOV (B, b) = {S5, S9} -Closure (MOV (B, b)={S5, S6, S7, S1, S2, S4, S9}=D MOV(C, a) = {S3, S8} -Closure (MOV(C, a)={S3, S6, S7, S1, S2, S4, S8}=B MOV(C, b) = {S5} -Closure (MOV(C, b)={S5, S6, S7, S1, S2, S4}=C MOV (D, a)={S3, S8} -Closure (MOV (D, a)={S3, S6, S7, S1, S2, S4, S8}=B MOV (D, b) = {S5, S10} -Closure (MOV (D, b)={S5, S6, S7, S1, S2, S4, S10}=E MOV (E, a) = {S3, S8} -Closure (MOV (E, a)={S3, S6, S7, S1, S2, S4, S8}=B MOV (E, b) = {S5} -Closure (MOV (E, b)={S5, S6, S7, S1, S2, S4}=C b) Transition diagram State Inputs a b A B C D E B B B B B C D C E C c) Minimization The available states are ABC DE The common state s (except final state) - AC Remove C state & Put A in the place of C. BD d) Minimized Transition diagram State Inputs A B D E a b B B B B A D E A e) DFA- (Minimized NFA) b A a a B b D b a b a Example of Converting an NFA to a DFA E E This is one of the NFA examples in the lecture notes. Here we convert it to a DFA. (The regular expression, above, is not relevant to this conversion.) This machine is M = ({1, 2, 3, 4, 5, 6, 7, 8}, {a, b, c}, DELTA, 1, {8}) where DELTA = { (1, b, 1), (1, epsilon, 2), (2, epsilon, 7), (2, b, 3), (2, b, 5), (3, a, 4), (3, c, 4), (4, c, 2), (4, c, 7), (5, a, 6), (5, b, 6), (6, c, 2), (6, epsilon, 2), (6, c, 7), (6, epsilon, 7), (7, b, 8) }. Note here that DELTA is a relation (a set of triples). We are computing M' = (K', sigma, delta', s', F'). Note that sigma is the same as that of the NFA. In this case sigma = {a, b, c}. Step 1: Compute E(q) for all states, q in K E(q) is the set of states reachable from q using only (any number of) epsilon transitions. q E(q) 1 {1, 2, 7} 2 {2, 7} 3 {3} 4 {4} 5 {5} 6 {2, 6, 7} 7 {7} 8 {8} Step 2: Compute s' = E(s) Here E(s) = E(1) = {1, 2, 7}. Step 3: Compute delta'. We start from E(s), where s is the start state of the original machine. We add states as necessary. delta (q\sigma) A B c {1, 2, 7} {} {1, 2, 3, 5, 7, 8} {} {} {} {} {} {1, 2, 3, 5, 7, 8} {2, 4, 6, 7} {1, 2, 3, 5, 6, 7, 8} {4} {2, 4, 6, 7} {} {3, 5, 8} {2, 7} {1, 2, 3, 5, 6, 7, 8} {2, 4, 6, 7} {1, 2, 3, 5, 6, 7, 8} {2, 4, 7} {} {} {2, 7} {4} {3, 5, 8} {2, 4, 6, 7} {2, 6, 7} {4} {2, 7} {} {3, 5, 8} {} {2, 4, 7} {} {3, 5, 8} {2, 7} {2, 6, 7} {} {3, 5, 8} {2, 7} delta'(StateSet, inputSymbol) = Union of E(q) for all q where (p, inputSymbol, q) is in DELTA and p is in StateSet. We'll do one example in gory detail: delta({1, 2, 7}, b) = {1, 2, 3, 5, 7, 8} because you can reach {1, 3, 5, 8} on "b" transitions (DELTA contains (1, b, 1), (2, b, 3), (2, b, 5), and (7, b, 8)) and E(1) = {1, 2, 7}, E(3) = {3}, E(5) = {5}, and E(8) = {8}. If you union all of those together, you get {1, 2, 3, 5, 7, 8}. Step 4: Enumerate K', the set of states : The states are just the entries in the left column: K' = {{1,2,7}, {}, {1,2,3,5,7,8}, {2,4,6,7}, {1,2,3,5,6,7,8}, {4}, {3,5,8}, {2,7}, {2,4,7}, {2,6,7}). There are 10 states, but 28 = 64 were possible. Step 5: Compute F' The final states are the states from K' that have some intersection with the final state(s) of the original machine. In this case, since 8 was the only final state of the original machine, our final states in the DFA are those states that have 8 in them: F' = {{1,2,3,5,7,8}, {1,2,3,5,6,7,8}, {3,5,8}}. Putting it all together Our DFA, then, is M' = (K', {a, b, c}, delta', {1, 2, 7}, F') where K', delta', and F' are as above. Minimizing the DFA Note that if the DFA is minimized (like in the project), then the states {2, 4, 6, 7}, {2, 4, 7}, and {2, 6, 7} coalesce, leaving an 8-state machine. (State minimization is a separate algorithm.) TOP-DOWN PARSING Construction of the parse tree starts at the root, and proceeds towards the leaves. Efficient top-down parsers can be easily constructed by hand. Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing). UNIT – III THE ROLE OF THE PARSER The parser obtains a string of tokens from the lexical analyzer and verifies the string can be generated by the grammar for the source language. It also report any syntax error in an intelligible fashion. It should also recover commonly occurring errors so that it can continue processing the remainder of its input. Source Program Lexical Analyzer token get next token Parse Parser tree Rest of front end Intermediate representation Symbol table There are 3 general types of parsers for grammar. They are Universal parsing - It can parse any grammar. But it is too inefficient to use in production compilers. Top down parsing - It constructs parse tree from root to the leaves. Bottom up parsing - It constructs parse tree from leaves to the root. In both top down and bottom up parsing the inputs are scanned from left to right one symbol at a time. These methods work more efficiently in sub classes of grammars. These are LL and LR grammars which describe most syntactic constructs in programming languages. Errors occurs at different levels Lexical, such as misspelling an identifier, keyword or operator. Syntactic, such as an arithmetic expression with unbalanced parentheses. Semantic, such as an operator applied to an incompatible operand. Logical, such as an infinitely recursive call. Goals of error-handler in parser It should report the presence of errors clearly and accurately. It should recover from each error quickly enough to be able to detect subsequent errors. It should not significantly slow down the processing of correct programs. ERROR RECOVERY STRATEGIES To recover any syntactic errors, the parser having many general strategies. They are Panic mode strategy Phrase level strategy Error productions Global correction Panic mode recovery It can be used by most parsing methods. On discovering an error, It discards input symbols one at a time until one of a designated set of synchronizing tokens is found The synchronizing tokens are usually delimiters, such as semicolon or end. This token may be vary depends upon the programming languages. Advantages It is a simplest method to implement. It is guaranteed not to go into an infinite loop. It is adequate where multiple errors in the same statement are occur. Disadvantage Skips a considerable amount of input without checking it for additional errors. Phrase level recovery On discovering an error, a parser may perform local correction on the remaining input. It may replace a prefix of the remaining input by some string and continue the parsing. Example : replace a comma by a semicolon, delete an extraneous semicolon, or insert a missing semicolon. We must be careful to choose replacements that do not lead to infinite loops. Used in top-down parsing. Advantages This type of replacement can correct any input string It has been used in several error-repairing compilers. Disadvantage The drawback is the difficulty it has in coping with situations in which the actual error has occurred before the point of detection. Error Production If we have an idea about the common errors that might be encountered, we can augment the grammar for the language at hand with productions that generate the erroneous constructs. We then use the grammar augmented by these error productions to construct a parser. If an error production is used by the parser, we can generate appropriate error diagnostics to indicate the erroneous construct that has been recognized in the input. Global correction We would like a compiler to make as few changes as possible in processing an incorrect input string. There are algorithms for choosing a minimal sequence of changes to obtain a globally least cost correction. Given an incorrect input string x and grammar G, these algorithms will find a parse tree for a related string y, such that the no. of insertions, deletions and changes of tokens required to transform x into y is as small as possible. Disadvantages 1. More expensive to implement. 2. It takes more time and occupy more space. CONTEXT FREE GRAMMAR A grammar naturally describes the hierarchical structure of many programming language. It has 4 components are as follows 5. A set of tokens, known as terminal symbols 6. A set of non terminals. 7. A set of productions where each production consists of a non terminal called left side of the production, an arrow and a sequence of tokens and/or non terminals called the right side of the production. 8. A designation of one of the non terminals as the start symbol. DERIVATIONS & REDUCTIONS Derivations The Non-Terminals can be expanded and it can derive certain tokens. This is called as Derivations. Reductions The terminal does not derive any string. But the terminals can be reduced to any Non-Terminal. These are called reductions. Example : EE*E EE+E Eid i) Using the above grammar derive the string “id+id*id” EE+E [Expansion by EE+E] Eid+E*E [Expansion by EE*E] Eid+E*E [Expansion by Eid] Eid+id*E [Expansion by Eid] Eid+id*id [Expansion by Eid] ii) Using the above grammar reduce the string “id+id*id” to the starting Non-Terminal. Eid+id*id [Reduction by Eid] Eid+id*E [Reduction by Eid] Eid+E*E [Reduction by Eid] EE+E*E [Reduction by EE*E] EE+E [Reduction by EE+E] EE Parse Tree A parse tree pictorially shows how the start symbol of a grammar derives a string in the language. If non terminal A ha s production A -> XYZ, then a parse tree may have an interior node labeled A with three children labeled X, Y, and Z from left to right : A X Y Z Properties of a Parse Tree 4. The root is labeled by the start symbol. 5. Each leaf is labeled by a token or by 6. Each interior node is labeled by a non terminal Definition of Yield The leaves of a parse tree read from left to right form the yield of the tree, Which is the string generated or derived from the non terminal at the root of the parse tree. Parsing The process of finding a parse tree for a given string of token is called parsing that string. Ambiguity A grammar can have more than one parse tree generating a given string of token. Such a grammar is said to be ambiguous. Eg. Two parse trees for 9 – 5+2 string string string string string - string + string string - 9 string + string 2 9 5 To avoid this ambiguous problem, two methods are used 5 2 3. Associativity of operators 4. Precedence of operators Associativity of operators By convention, 9+5+2 is equivalent to (9+5)+2 and 9-5-2 is equivalent to (9-5)-3. When an operand like 5 has operators to its left and right, conventions are needed for deciding which operators takes that operand. We say that the operator + associates t the left because an operand with plus signs on both sides of its taken by the operator to its left. In most programming languages the four arithmetic operators, addition, subtraction, multiplication, and divisions are left associative. Some common operators such as exponentiation are right associative. As another example, the assignment operator = in C is right associative; in C, the expression a = b= c is treated in the same way as the expression a=(b=c). Precedence of operators Consider the expression 9+5+2. There are two possible interpretation of this expression: (9+5)*2 or 9+(5+2). The associativity of + * do not solve this ambiguity. For this reason, we need to know the relative precedence of operators when more than one kind of operators is present. We say that * has higher precedence than + if * takes its operands before + does. In ordinary arithmetic, multiplication and division have higher precedence than addition and subtraction. Therefore, 5 is taken by * in both 9+5*2 and 9*5*2; i.e., the expression are equivalent to 9+(5*2) and (9+5)+2, respectively. WRITING A GRAMMAR The following reasons explains why the regular expressions are used to define the lexical syntax of a language. 1. The lexical rules of a language are frequently quite simple. 2. It provides more concise and easier to understand notation for tokens than grammars. 3. More efficient lexical analyzers can be constructed automatically from regular expressions than from arbitrary grammars. 4. Separating the syntactic structure of a language into lexical and non lexical parts provides a convenient way of modularizing the front end of a compiler into two manageable-sized components. Eliminating ambiguity Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. It can be eliminated by the following “dangling-else” grammar. Stmt if expr then stmt | if expr then stmt else stmt | other According to this grammar, the compound conditional statement if E1 then if E2 then S1 else S2 has the two parse trees as shown below stmt if expr then stmt E1 If expr then E2 stmt S1 S2 stmt if expr E1 then stmt else else stmt stmt S2 If expr then stmt E2 S1 In all the programming languages, the first parse tree is preferred. The general rule is “Match each else with the closest previous unmatched then”. This disambiguating rule can be incorporated directly into grammar. For example, we can rewrite grammar as the following unambiguous grammar. The idea is that a statement appearing between a then and an else must be matched, i.e., it must not end with an unmatched then followed by any statement, for the else would then be forced to match this unmatched then. A matched statement is either an if-then-else statement containing no unmatched statements or it is any other kind of unconditional statement. Thus, we may use the grammar matched_stmt | unmatched_stmt matched_stmt if expr then matched_stmt else matched_stmt | other unmatched stmt if expr then stmt | if expr then matched_stmt else unmatched_stmt Stmt ELIMINATION OF LEFT RECURSION Left recursion A grammar is left recursive if it has a non terminal ‘A’ such that there is a derivation as follows A --> AX for some i/p string, where X is a grammar symbol. (Here the nonterminal ‘A’ is recursively called in the left). Top-down parsers cannot handle such left recursive grammars. This must be eliminated. This can be done by the following method. Left recursive grammar : A-->AX/Y Left recursion elimination : A-->YA’ A’-->XA’/E where X, Y are grammar symbols and E is epsilon. Left recursive grammar : E-->E+T/T Apply the above rule : Here A is E; X is +T; Y is T; So after left recursion elimination: E-->TE’ E’-->+TE’/E Left factoring This useful transformation to make certain grammar suitable for parsing (predictive). If non-terminal has two choices for expansion, which are same, then there will be confusion for selection of the choice, for a particular I/p string. For example: A-->XB/XC Now, there is a confusion which is to be selected for the any I/p string starting with X . This problem can be solved by left factoring. Left factoring is a transformation for factoring out the common prefixes. For the above grammar the application of left factoring will result as A-->XA’ A’-->B/C Depending on how the parse tree is created, there are different parsing techniques. These parsing techniques are categorized into two groups: 1. Top-Down Parsing 2. Bottom-Up Parsing TOP-DOWN PARSING Construction of the parse tree starts at the root, and proceeds towards the leaves. Efficient top-down parsers can be easily constructed by hand. Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing). Recursive descent predictive parser Recursive descent parsers are easily created from context-free grammar productions. It is a top-down technique because it works by trying to match the program text against the start symbol and successively replaces symbols by symbols representing their constituents. This process can be regarded as constructing the parse tree in a top-down direction. recursive descent parsers are often also called LL parsers because they deal with the input from left-to-right (the first L) and construct a leftmost derivation (the second L). A recursive descent parser is a collection of procedures, one for each unique nonterminal. Each procedure is responsible for parsing the kind of construct described by its non-terminal. Since the syntax of most programming languages is recursive the resulting procedures are also usually recursive, hence the name "recursive descent". The parser maintains an invariant in that a global variable always contains the first token in the input that has not been examined by the parser. Every time a token is "consumed" the parser will call the lexical analyser to get another token. Parsing using a recursive descent parser is started by calling the lexical analyser to get the first token. Then we call the procedure corresponding to the grammar start symbol. When this procedure returns the parse is complete. The body of a parsing procedure for a non-terminal X is constructed by considering the grammar productions with X on their left-hand side. A non-terminal on the right-hand side of one of these productions turns into a call to the parsing procedure for that nonterminal. A terminal (literal or non-literal) turns into a test to make sure that the current token matches the required terminal, and a call to get another token. For example, consider the following production and its associated parsing procedure. Statement : Name ':=' Expression. void Statement () { Name (); if (current token is not a colon equals) report a colon equals missing; get a token; Expression (); } Of course, many non-terminals appear on the left-hand side of more than one production. The parsing procedure must deal with all of these cases by checking at the beginning of the parsing procedure. For example, expressions might come in a few varieties. Expression : Integer / Identifier / '(' Expression ')' / ... void Expression () { if (current token is an integer or an identifier) { get a token; } else if (current token is a left parenthesis) { get a token; Expression (); if (current token is not a right parenthesis) report a right parenthesis missing; get a token; } else ... ... } else report an illegal expression; } Decision making in recursive descent parsers As in the Expression case above, for non-terminal symbols with more than one production the parsing procedure needs to make a decision between the productions. To obtain a deterministic parser (which we always want to have for a compiler) we must guarantee that the appropriate choice is uniquely determined by the basic symbols. To make a decision in a recursive descent parser we need to know which symbols predict a particular production. Usually we try to achieve the ability to parse with one token lookahead. This is because otherwise we need to store more than one token from the lexical analyser which is not impossible but complicates matters. One token lookahead is sufficient for most programming languages. We can define the PREDICT sets for each production as follows using the auxiliary sets FIRST and FOLLOW. We provide an informal definition. The text gives a more mathematical definition with an algorithm for calculating these sets. The FIRST set of a symbol A is the set of tokens that could be the first token of an A, plus epsilon if A can derive epsilon (in other words, if an A can be empty). FIRST can be extended to sequences of symbols by saying that the FIRST of a sequence A1 A2 A3 ... An is FIRST(A1) union FIRST(A2) if epsilon is in FIRST(A1), union FIRST(A3) if epsilon is in both FIRST(A1) and FIRST(A2), and so on. The FOLLOW set of a symbol A is the set of tokens that can follow an A in a syntactically legal program, plus epsilon if A can occur at the end of a program. The PREDICT set of a production N : A1 A2 A3 ... An is FIRST(A1 A2 A3 ... An) (without epsilon) plus FOLLOW(N) if A1 A2 A3 ... An can derive epsilon. The PREDICT sets for the alternative productions of N are used when writing the recursive descent parsing procedure for N. If the next unexamined token is in the PREDICT set for a production then we predict that alternative. This is what we did earlier in the parsing procedure for Expression. Transforming grammars for recursive descent Problems arise if the PREDICT sets for the productions of a non-terminal overlap. If this is the case it is not possible to accurately predict a single production using just one token of lookahead. To ensure that decision making with one token lookahead is possible in a recursive descent parser it may be necessary to transform the grammar. The intention is to change the grammar so that it is acceptable to our parsing method, but defines the same language as the original grammar. Two common situations arise: left recursion and common prefixes. Non-recursive predictive parser It is a top-down parser. As a name implies it is not recursive. This needs the following components to check whether the given string is successfully parsed or not. Inbut buffer , Stack, Parsing routine and parsing table The input buffer is keeping the input string to be parsed .The input string is followed by a symbol ‘$’. This is used to indicate that the input string is terminated. This is used as right end marker. The stack is keeping always the grammar symbols. The grammar symbols will be either non-terminal or terminals. Initially the stack is pushed with “$” on the top of the stack. After that, as parsing progress the grammar symbols are pushed this ‘$’ is used to announce the completion of parsing. The parsing table is generally a two-dimensional array. An entry in the table is referred T (A, a), where ‘A’ is a non-terminal ‘a’, it is terminal and’T’ is table name. A+b$ INPUT Operation STACK X Y Z $ Program OUTPUT Parsing table The program takes the first symbol on the top of the stack X and then current input symbol a. Three possibilities 1. If X=a=$, then parser halts with successful completion. 2. If X=a$,then parser pops X off from stack & mones the i/p pointer to the next symbol. 5. If X,Non-terminal has another Non-terminal, then remove the Non-terminal from the stack &substitude the corresponding production. Reduction by predictive parser: Productions ETE’ E’+TE’/ TFT’ T’*FT’/ F(E)/id Stack Input $E $E’T $E’T’F $E’T’id $E’T’ $E’ $E’T+ $E’T $E’T’F $E’T’id $E’T’ $E’T’F* $E’T’F $E’T’id $E’T’ $E’ $ id+id*id $ id+id*id $ id+id*id $ id+id*id $ +id*id $ +id*id $ +id*id $ id*id $ id*id $ id*id $ *id $ *id $ id $ id $ $ $ $ Output ETE’ TFT’ Fid T’ E’+TE’ TFT’ Fid T’*FT’ Fid T’ E’ Steps involved in non-recursive predictive parsing: 1. I/P buffer is filled with I/p string with $ as the right end marker. 2. Stack is initially used with $ 3. Construction of parsing table T using FIRST () & FOLLOW (). Computation of FIRST( ) 1. If X is terminal, then FIRST (X) ={X}. 2. If X , then FIRST (X) ={} 3. If X is a Non-terminal, & x a is a production, then add a FIRST (X) e.g. Xa.FIRST(X)={a}. 6. If XY1, Y2, Y3…Yn then FIRST (X)={FIRST (Y1), FIRST (Y2), FIRST (Y3)…FIRST (Yn)} Computation of FOLLOW( ) 1. $ is in FOLLOW (S) where S is a Start symbol Then FOLLOW (S)={$} 2. If there is a production AB, then FOLLOW(B)={FIRST()}(but except ) 3. If there is a production AB (or) a production AB, FIRST() = then Follow (B)={FOLLOW(A)} The FIRST () & FOLLOW () for the above productions are FIRST (E)=FIRST (T)=FIRST (F)={(, id} FIRST (E’)={+, } FIRST (T’)={*, } FOLLOW (E)={$,)} FOLLOW (E’)=FOLLOW (E)={$,)} FOLLOW (T)={+, $,)} FOLLOW (T’)=FOLLOW (T)= {+, $,)} FOLLOW (F)= {*, +, $,)} Parsing table id + E ETE’ E’ E ’+TE’ T TFT’ T’ T’ F Fid * ( ETE’ ) $ E’ E TFT’ T’*FT T’ T’ F(E) ERROR RECOVERY IN PREDICTIVE PARSING We can use both panic mode and phrase level strategies for recovering the errors in predictive parsing. BOTTOM-UP PARSING Construction of the parse tree starts at the leaves, and proceeds towards the root. Normally efficient bottom-up parsers are created with the help of some software tools. Bottom-up parsing is also known as shift-reduce parsing. Operator-Precedence Parsing – simple, restrictive, easy to implement LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR Shift-reduce parsers In contrast to a recursive descent parser that constructs the derivation "top-down" (i.e., from the start symbol), a shift-reduce parser constructs the derivation "bottom-up" (i.e., from the input string). Shift-reduce parsers are often used as the target of parser generation tools. Some reasons for their popularity are the large class of grammars that can be parsed in this way (there are more grammars in this class than in the class that can be processed using recursive descent) and our ability to implement them efficiently. Shift-reduce parsers are often called LR parsers because they process the input left-to-right and construct a rightmost derivation in reverse. In the following we will briefly describe how shift-reduce parsers work, because knowledge of their operation is useful when using parser generators (which you might have to do in the future). Our concentration is on the basic mechanisms used during parsing, not on the techniques used to generate such parsers (which can be quite complex). The text has much more detail which you can study if you are interested. Informally, a shift-reduce parser starts out with the entire input string and looks for a substring that matches the right-hand side of a production. If one is found, the substring is replaced by the left-hand side symbol of the production. This step is a reduction. The parser then looks for another substring (now possibly containing a nonterminal symbol), replaces it, and so on. Reductions occur until the string is reduced to just the start symbol. If no reductions are possible at any stage, it might mean that the string is not a sentence in the language defined by the grammar, or it might mean that an earlier reduction was performed in error. The most complex parts of defining a shiftreduce parser are locating valid substrings (called handles) and determining when and if reductions should be performed on which handles. S : 'a' A B 'e'. A : A 'b' 'c' / 'b'. B : 'd'. abbcde => => => => aAbcde aAde aABe S Shift-reduce parsers can be described by machines operating on a stack of symbols and an input buffer containing the input text. Initially, the stack is empty and the input buffer contains the entire input string. A step of the parser examines the top of the stack to see if a handle is present. If so, then a reduction could be performed (but doesn't have to be). If a reduction is not possible or is not desirable, the parser shifts the next input symbol from the input buffer to the top of the stack. The process then repeats. If the parser reaches a state where the stack contains just the start symbol and the input buffer is empty, then the input has been correctly parsed. If the input is consumed but the stack can not be reduced, then the input is not a sentence. Grammar E : Input String Stack id E E + E + E + E + E + E + E + E : E '+' E / E '*' E / '(' E ')' / id. id + id * id Input Action id + id * id + id * id + id * id id * id * id * id id id E E * E * id E * E E initial state shift id reduce by E : id shift + shift id reduce by E : id shift * shift id reduce by E : id reduce by E : E '*' E reduce by E : E '+' E Note that in this example the decisions about when to reduce are crucial. When the stack contains E + E for the first time we could reduce it to E. If this is done and the parse is completed, we end up with a second derivation for this input string. That both exist is no surprise, however, as we previously noted that this grammar is ambiguous. Example : Consider the following grammar SCC CcC Cd Consider the i/p string : cdcd I/P string cdcd cdcC cdC cCC CC S Since Reduction by Cd Since Reduction by CcC Since Reduction by Cd Since Reduction by CcC Since Reduction by SCC Handles A handle of a string is a sub string that matches the right side of the production. This reduction helps in constructing the parse tree or right most derivation. Ex : AaXb Xc Here ‘c’ is a handle. It reduces to X, and helps in construction of the parse tree i.e. to reach the start terminal A, for the input string. The I/p string acb acb aXb A The process of obtaining the starting Non-terminal while constructing the Bottom –up parse tree by reducing the handles to the respective non-Terminals, is called handle pruning. Shift-reduce parsing Actions 1. Shift - Shift the next I/p symbol onto the stack when there is no handle for reduction. 2. Reduce - Reduce by Non-terminal. This Non-terminal must be pushed to stack, by replacing the handles. 3. Accept - The I/p string is valid and parsing is successfully done. 4. Error - There is a syntax error, parser calls error recovery routine. Stack implementation of Shift –reduce parsing Ex : the input string id1+id2*id3 Stack $ $id1 $E $E+ $E+id2 $E+E $E+E* $E+E*id3 $E+E*E $E+E $E Input id1+id2*id3$ +id2*id3$ +id2*id3$ id2*id3$ *id3$ *id3$ id3$ $ $ $ $ Action Shift Reduce by Eid Shift Shift Reduce by Eid Shift Shift Reduce by E id Reduce by E E*E Reduce by E E+E Accept Parsing conflicts Sometimes the grammar is written in such a way that the parser generator cannot determine what to do in every possible circumstance. (We assume a lookahead of one symbol so only the first symbol in the input buffer can be examined at each step. More is possible if multiple symbol lookahead is allowed, but this complicates the parsing process.) Situations where problems can occur are called conflicts. A shift-reduce conflict exists if the parser cannot decide in some situation whether to shift the input symbol or to reduce using a handle on the top of the stack. A common situation where a shift-reduce conflict occurs is the dangling-else problem. Once an 'ifthen' has been seen the parser cannot choose between shifting an 'else' (so that it becomes part of the most recently seen 'if') or reducing the 'if' (so that the 'else' becomes part of a preceding 'if'). This problem occurs because the grammar is ambiguous (recall the discussion for recursive descent parsers). Stmt : 'if' Expression 'then' Stmt | 'if' Expression 'then' Stmt 'else' Stmt | ... A reduce-reduce conflict exists if the parser cannot decide by which production to make a reduction. This commonly occurs when two productions have the same righthand side (or one with the same structure), but the left context (i.e., what the parser has seen) is not sufficient to distinguish between them. The following grammar defines part of a language like FORTRAN where both procedure calls and array accesses are written using parentheses. stmt : id '(' param_list ')' / expr ':=' expr. param_list : param / param_list ',' param. param : id. expr : id '(' expr_list ')' / id. expr_list : expr / expr_list ',' expr. This grammar has a reduce-reduce conflict between the two productions 'param : id' and 'expr : id'. This can be seen by considering the input 'id(id,id)'. After the first three symbols have been shifted, either production could be used to reduce the id on the top of the stack. The correct decision can only be made by knowing whether the first id is a procedure identifier or an array identifier. Avoiding parsing conflicts Parser generators based on the shift-reduce method often have facilities for helping you avoid parsing conflicts. For example, YACC has the convenient rule that if there is a shift-reduce conflict then it will prefer the shift over the reduce. For reducereduce conflicts YACC will prefer to reduce by the production that was written first in the grammar. Relying on default behavior like YACC can be dangerous because changes to the way the grammar is written can affect the parser in subtle ways. For example, just reordering the productions changes the way reduce-reduce conflicts are resolved. Other parser generators extend the grammar notation with modifications which provide more information to resolve conflicts. E.g., we might attach a modification to the first production in the dangling else problem grammar that says that it should not be reduced if the next basic symbol is an 'else'. If the parser generator does not support modifications or its default rules are not what you want, more needs to be done. In the case of the reduce-reduce conflict above we might somehow obtain semantic information that tells us whether the ambiguous case is a procedure call or an array access based on the declaration of the first identifier. This can be done but complicates the compiler because semantic information is needed before the program has been fully parsed. C's typedef facility creates a similar problem for parsers of that language. Another solution is to rewrite the grammar to remove the conflict. For shiftreduce conflicts it is sometimes possible to rewrite the grammar so that it accepts exactly the language that you want. Stmt : 'if' Expression 'then' Stmt |'if' Expression 'then' Stmt2 'else' Stmt | ... Stmt2 : 'if' Expression 'then' Stmt2 'else' Stmt2 | ... For a reduce-reduce conflict a standard technique is to write the grammar so that both possibilities are parsed to the same structure. Once the tree has been constructed the semantic analyzer can then use all available information to decide which was actually meant. For example, for the example in the previous section we could parse a function call as an expression and take care of the distinction later. This approach works but can complicate semantic analysis considerably so it's worth avoiding if possible. Sometimes the grammar conflict is not really an indication of an ambiguity. Rather, it might just be a property of the way the grammar is written. Transformation to an equivalent grammar can remove the conflict. The following example has a conflict because the parser can't decide between the rules for A and B based on a single token look ahead because the next symbol is x in both cases. S Q A B C : : : : : Q. A x y / B x x. C a b. C a b. c. A simple rewrite suffices to remove the problem by effectively pushing the decision point one taken later. For example, S Q A B C : : : : : Q. A y / B x. C a b x. C a b x. c. Operator Precedence Parsing An operator precedence grammar is an –free operator grammar in which precedence relations <, =, > constructed as above are disjoint. That is, for any pair of terminals a & b, never more then one of the relations a<b, a=b, a>b is true. LEADING( ) & TRAILING( ) LEADING (A)={a/Aa, where is or a single nonterminal.} TRAILING (A){a/Aa, where is or a single nonterminal} Algorithm for operator –precedence relations Input: An operator grammar G. Output: The relations <, =, and > for G. Method: 1. Compute LEADING (A) & TRAILING (A) for each nonterminal. 2. Execute the program given below, examining each position of the right side of the production. 3. Set $<a for all a in LEADING (S) and set b>$ for all b in TRAILING (S), where S is the start symbol of G. For each production A X1,X2..X ndo For I:=1 to n-1 do Begin If Xi & Xi+1 are both terminals then set Xi= Xi+1 ; If I<=n-2 and Xi & Xi+2are terminals and Xi+1 is a nonterminal then set Xi=Xi+2 If XI is a terminal and XI+1 is a nonterminal then for all a in LEADING (X I+1) do set XI<a; If Xi is a nonterminal and Xi+1 is a terminal then for all a in TRAILING (X i) do set a>Xi+1; end The operator -precedence parsing algorithm Input The precedence relation from some operator-precedence grammar and an input string of terminals from grammar. Output Strictly speaking, there is no output. We could construct a skeletal parse tree as we parse, with one nonterminal labeling all interior nodes and the use of single productions not shown. Alternatively, the sequence of shift-reduce steps could be considered the output. Method Let the input string be a1…an$. Initially, the stack contains $. Execute the program. If a parse tree is desired, we must create a node for each terminal shifted onto the stack at line (4). Then, when the loop of lines (6)-(7) reduces by some production, we create a node whose children are node corresponding to whatever is popped off the stack. After line (7) we place on the stack a pointer to the node created. This means that some of the “symbols” popped by line (6) will be pointers to nodes. The comparison of line (7) continuous to be made between terminals only; pointers are popped with no comparison being made. (1) repeat forever (2) if only $ is on the stack and $ is on the input then accept and break else begin (3) let a be the topmost terminal symbol on the stack and let b be the current input symbol; (4) if a< b or a=b then shift b onto the stack (5) else if a>b then /* reduce*/ (6) repeat pop the stack (7) until the top stack terminal is related by < to the terminal most recently popped (8) else call the error correcting routine (9) end Precedence functions Compilers using operator-precedence parsers need not store the table or precedence relations. In most cases, the table can be enclosed by two precedence functions f and g, which map terminal symbols to integers. We attempt to select f and g so that, for symbols a and b, 1. f (a)<g(b) whenever a<b, 2. f (a)=g(b) whenever a=b, 3. f (a)>g(b) whenever a>b Thus the precedence relation between a and b can be determined by a numerical comparison between f(a) and g(b). Graph representing precedence functions Operator + * id id + < * < $ < > > > < $ > < > < > > > If x is a terminal and X is a non-terminal then for all a in LEADING (X) do set X<a; Graph representing precedence functions gi fi dd d f* g* f+ g+ Precedence functions with its values f$ id F g + 4 5 2 1 g$ * $ 4 3 0 0 ERROR RECOVERY IN OPERATOR PRECEDENCE PARSING There are two points in the parsing process at which an operator-precedence parser can discover syntactic errors 1. It no precedence relation holds between the terminal on top of the stack and the current input symbol. 2. If a handle has been found, but there is no production with this handle as a right side. The followings are the error handling routines e1 : /* called when whole expression is missing */ Insert id onto the input Issue diagnostic : “missing operand “ e2 : /* called when expression begins with a right paranthesis */ Delete ) from the input Issue diagnostic : “unbalanced right paranthesis “ e3 : /* called when id or ) is followed by id or ( */ Insert + onto the input Issue diagnostic : “missing operator” e4 : /* called when expression ends with a left paranthesis */ Pop ( from the stack Issue diagnostic : “missing right paranthesis “ LR PARSERS Some of the bottom-up parses are efficient parsers. The expansion for L, R as follows. L Left to Right RRight most derivation. The name is because these parsers scan the i/p from the left to right and construct a right most derivation (parse tree) in reverse. Parsing algorithm & Parsing table are the two major components of LR parses. Three ways to construct the LR parsing table are listed below as 1. SLR (simple LR) 2. CLR (Canonical LR) 3. LALR (Look ahead LR) SLR Parser In order to construct the SLR parsing table the following two components are necessary. 1. Construction of sets of LR (O) items collection using CLOSURE function and go to function 2. Construction of parsing table using LR (O) items Construction of LR (O) items collection The collection of sets of LR (O) items is called “c”. This must be constructed in order to construct SLR parsing table. The collection is called canonical collection of LR (O) items. LR (O) item Consider a production I -> JKL Now place a (‘.’) at some position at the right side of the production as shown below I -> .JKL I -> J.KL I -> JK.L I -> JKL. This are called LR (O) items of the grammar G. LR (0) item of a grammar G, is a production of G with a dot at some position of the right side. LR (0) item can be simply called as an item also. E.g. 1. Eid, then LR (0) items are E. id Eid . 2. E Then items are E. Augumented grammar Consider the grammar G and S is the start symbol. The augumented grammar of G is G’ with a new start symbol S’ and having a production S’S. Example : Consider the grammar G Sas Sb ASa Aa The augumented grammar G’ S’S Sas Sb ASa Aa Closure Function Let I be set of items for a grammar G, then closure of I i.e. CLOSURE (I) can be computed by using the following steps. 1. Initially, every item in I is assed to closure (I) 2. Consider the following. AX .BY -an item in I BZ - a production Then add the item B. Z to I Example Consider the item in I SA.S Closure (I) closure (SA.S) S. AS S. b Here we have to include the items derived from A also Therefore A. SA A.a The items derived from S are already there. So there can be no new items, included to I CLOSURE (I) S-->A.S S-->.AS S-->.b A-->.SA A-->.a GOTO Function Consider an item in I A-->.XBY then go to (I, X) will be ==>A-->X.BY including the closure of B. Where X is a Grammar symbol. Example I: SA.S S. AS S. b A. SA A. a Then go to (I, S) SAS. S. AS S. b AS.A A. a Similarly, go to(I,b) S-->b. Constructing SLR parsing tables Algorithm Input : The canonical collection of sets of item for an argumented grammar G. Output : If possible an LR parsing table consisting of a parsing action function ACTION and a go to function GOTO. Method : Let c={I0 ,I1 ,,,,,,In }. The states of the parser are 0,1…n. state i being constructed from Ii . the parsing action for states are determined as follows: 1. if [A -> α.aβ] is in Ii and GOTO(Ii ,a)=Ij ,then set ACTION[I, a] to “shift j” , here a is a terminal. 2. if [a -> α] is in Ii ,then set ACTION[I, a] to “reduce A -> α “ for all a in FOLOW(A). if [s’ -> s] is in Ii ,then set action[I,$] to “accept”. 3. If any conflicting actions are generated by the above rules, we say the grammar is not SLR(1). The algorithm fails to produce a valid parser in this case. The goto transitions for state i are constructed using the rule: 4. If GOTO (Ii, A)=Ij ,then GOTO[I,A]=j. 5. All entries not defined by rule (1) through (4) are made “error”. 6. The initial state of the parser is the one constructed from the set of items containing [S’ -> S]. The parsing table consisting of the parsing action and go to functions determined by the above algorithm is called SLR table for G. an LR parser using the SLR table for G is called SLR parser for G. and the grammar having an SLR parsing g table is said to be SLR(I). Example: Construct SLR parsing table for the given Grammar. E ->E+T E ->T T ->T*F T->F F ->( E ) F -> id Step 1 : Augumented Grammar E’E E ->E+T E ->T T ->T*F T->F F ->( E ) Step 2 : Canonical collection of LR (0) items: I0: E’ -> .E E ->.E+T E ->.T T -> .T*F T->.F F ->.(E) F -> .id I1: E’ -> E. E -> E. +T I2: E –> T. T -> T. * F I3: T--> F. I4: F--> ( . E ) E ->.E+T E ->.T T -> .T*F T->.F F -> .(E) F -> .id I5: F-- >id. I6: E-- >E+.T T -> .T*F T->.F F -> .(E) F -> .id I7: T-- >.T*F F-- >.(E) F-- >.id I8: F-- > (E.) F-- >E.+T I9: E-- >E+T T-- >T.*F I10: T--> T * F . I11: F-- > (E). SLR parsing table STATE 0 1 2 3 4 5 6 7 8 9 10 11 Id s5 s5 s5 s5 ACTION + * ( ) $ E s4 1 s6 acc r2 s7 r2 r2 r4 r4 r4 r4 s4 8 r6 r6 r6 r6 s4 s4 s6 s11 r1 s7 r1 r1 r3 r3 r3 r3 r5 r5 r5 r5 GOTO T 2 F 3 2 3 9 3 10 CLR parser It is also an LR parser. Many of the concepts are similar to SLR parser, but there is some difference in construction of parsing table. The construction of parsing table (T) from the LR (1) items is also different from the SLR table construction. LR (1) items The general form of LR (1) item is A->X.Y, a; Where ‘a’ is called look ahead. This is extra information. We are looking a character ahead. The ‘a’ may be a terminal or the right end marker $. Example : S’ -> S,$ ($ is a look ahead). The collection of LR(1) items ,will lead to construction of CLR parsing table. As in the SLR parser here also we use the ‘closure’ and ‘go to’ functions for constructing LR(1) items taking look ahead into account. Algorithm for construction of a canonical LR parsing table Input : A grammar G augumented by production s’s. Output : If possible, the canonical LR parsing action function ACTION and go to function GOTO. Method: 1. Construct C={I 0,..In], the collection of sets of LR (1) items for G. 2. State I of the parser is constructed from Ii The parsing action for state I are determined as follows: (a). if [a -> α.aβ, b] is in the Ii and GOTO(I, a)=Ij, then set ACTION[I, a] to “shift j”. (b). if [A -> α,a] is in I and GOTO(I, a)=Ij to reduce A-> α. ( c) if [S’ ->S,$] is in Ii then set ACTION[I,$] to “accept”. If a conflict results from the above rules the grammar is said not to be LR(1), and the algorithm is said to fail. 3. The go to transitions for state I are determined as follows. If GOTO (Ii,A)=Ij, then GOTO[I,A]=j. 4. All entries not defined by rules (2) to (3) are made as ‘error’. 5. The initial state of the parser is the one constructed from the set containing item [S’-> S. $]. Construction of the sets of LR (1) items for the grammar G. Input : A grammar G. Output : The set of LR (1) items which are the sets of items valid for one or more viable prefixes of G. Method: The procedures CLOSURE and GOTO and main routine for constructing the sets of items are: Procedure CLOSURE (1); Begin Repeat For each item [A -> . B, a] in I, each Production B -> , and each terminal b in FIRST (a) Such that [B ->. b] is not in 1 do Add [B ->. .b] to I; Until no more items can be added to I; Return I End; Procedure GOTO (I, X); Begin Let J be the set of items [A ->X., A], such that [A -> . X, a] is in l; Return CLOSURE (J) End; Begin C: ={CLOSURE ({s` ->.S, $})}; Repeat For each set of items l in C and each grammar Symbol X such that GOTO (l, X) is not empty And not already in C do Add GOTO (l, X) to C Until no more sets of items can be added to C End; Example For the given grammar construct CLR parsing table. S-> CC C->cC C->d Step 1 : Augumented grammar S’ ->S S -> CC C -> cC / d Step 2 : Canonical collection of LR (1) items. I0: S’->. S,$ S->.CC,$ C->.cC,c/d C->.d,c/d I1: S’ -> S,$ I2: S ->CC,$ C -> .cC,$ C -> .d,$ I3: C -> c.C,c/d C -> .cC,c/d C -> .d,c/d I4: C -> d.,c/d I5: S ->CC.,$ I6: C -> c.C,$ C-> .cC,$ C -> .d,$ I7: C ->d.,$ I8: C ->cC.,c/d I9: C->cC.,$ CLR parsing table STATE 0 1 2 3 4 5 6 7 8 9 c s3 s6 s3 r3 s6 r2 ACTION d $ s4 acc s7 s4 r3 r1 s7 r3 r2 r2 S 1 r2 GOTO C 2 5 8 9 s11 LALR Parser This is very easy to construct than CLR parsing table. The construction is similar to CLR parsing table with small modifications. LALR table construction Input : A grammar G augmented by productions S’ -> S Output : The LALR parsing tables ACTION and GOTO Method 1. Construct C = {I0 I1 …In}, the collection of sets LR(1) items. 2. For each core present among the sets of LR(1) items. Find all sets having the core, and replace the sets by their union. 3. Let C’ = {J0 ,J1 … Jm}be the resulting sets of LR(1) items. The parsing actions for state I are constructed from Ji in the same manner. If there is a parsing – action conflict, the algorithm fails to produce a parser and the grammar is said not to be LALR(1) 4. The GOTO table is constructed as follows. If J is the union of one or more sets of LR(1) items i.e. J = I1 U I2 U ….U Im , then codes GOTO (I1 ,X) …. GOTO(IK ,X) are the same since I1 … IK all have the same core. Let K be the union of all sets of items having the same core as GOTO (I1 ,X). Then GOTO(J,X) = K. Example For the same example in CLR parser, the LALR parsing table would appear as LALR parsing table STATE 0 1 2 36 47 5 89 ACTION c d $ s36 s47 acc s36 s47 s36 s47 r3 r3 r3 r1 r2 r2 r2 Here their union I36 replaces I3 & I6. I36: C -> c.C, c/d/$ C -> .cC, c/d/$ C -> .d, c/d/$ Similarly, I47: S 1 r2 s11 GOTO C 2 5 89 C -> d., c/d/$ Similarly, I89: C ->cC., c/d/$ Comparison For a comparison of parser size, the SLR & LALR tables for a grammar always have the same number of states, Where as CLR has more number of states. Thus it is much easier and more economical to construct SLR or LALR tables than the CLR tables. PARSER GENERATOR A parser is a program which determines if its input is syntactically valid and determines its structure. Parsers may be hand written or may be automatically generated by a parser generator from descriptions of valid syntactical structures. The descriptions are in the form of a context-free grammar. Parser generators may be used to develop a wide range of language parsers, from those used in simple desk calculators to complex programming languages. Yacc is a program which given a context-free grammar, constructs a C program which will parse input according to the grammar rules. Yacc was developed by S. C. Johnson an others at AT\&T Bell Laboratories. Yacc provides for semantic stack manipulation and the specification of semantic routines. A input file for Yacc is of the form: C and parser declarations %% Grammar rules and actions %% C subroutines The first section of the Yacc file consists of a list of tokens (other than single characters) that are expected by the parser and the specification of the start symbol of the grammar. This section of the Yacc file may contain specification of the precedence and associativity of operators. This permits greater flexibility in the choice of a context-free grammar. Addition and subtraction are declared to be left associative and of lowest precedence while exponentiation is declared to be right associative and to have the highest precedence. %start program %token LET INTEGER IN %token SKIP IF THEN ELSE END WHILE DO READ WRITE %token NUMBER %token IDENTIFIER %left '-' '+' %left '*' '/' %right '^' %% Grammar rules and actions %% C subroutines The second section of the Yacc file consists of the context-free grammar for the language. Productions are separated by semicolons, the '::=' symbol of the BNF is replaced with ':', the empty production is left empty, non-terminals are written in all lower case, and the multicharacter terminal symbols in all upper case. Notice the simplification of the expression grammar due to the separation of precedence from the grammar. C and parser declarations %% program : LET declarations IN commands END ; declarations : /* empty */ | INTEGER id_seq IDENTIFIER '.' ; id_seq : /* empty */ | id_seq IDENTIFIER ',' ; commands : /* empty */ | commands command ';' ; command : SKIP | READ IDENTIFIER | WRITE exp | IDENTIFIER ASSGNOP exp | IF exp THEN commands ELSE commands FI | WHILE exp DO commands END ; exp : NUMBER | IDENTIFIER | exp '<' exp | exp '=' exp | exp '>' exp | exp '+' exp | exp '-' exp | exp '*' exp | exp '/' exp | exp '^' exp | '(' exp ')' ; %% C subroutines The third section of the Yacc file consists of C code. There must be a main() routine which calls the function yyparse(). The function yyparse() is the driver routine for the parser. There must also be the function yyerror() which is used to report on errors during the parse. Simple examples of the function main() and yyerror() are: C and parser declarations %% Grammar rules and actions %% main( int argc, char *argv[] ) { extern FILE *yyin; ++argv; --argc; yyin = fopen( argv[0], "r" ); yydebug = 1; errors = 0; yyparse (); } yyerror (char *s) /* Called by yyparse on error */ {printf ("%s\n", s);} The parser, as written, has no output however, the parse tree is implicitly constructed during the parse. As the parser executes, it builds an internal representation of the the structure of the program. The internal representation is based on the right hand side of the production rules. When a right hand side is recognized, it is reduced to the corresponding left hand side. Parsing is complete when the entire program has been reduced to the start symbol of the grammar. Compiling the Yacc file with the command yacc -vd file.y ( bison -vd file.y) causes the generation of two files file.tab.h and file.tab.c. The file.tab.h contains the list of tokens is included in the file which defines the scanner. The file file.tab.c defines the C function yyparse() which is the parser. UNIT – IV INTERMEDIATE LANGUAGES SYNTAX DIRECTED DEFINITION It is a generalization of a Context Free Grammar in which each grammar symbol has an associated set of attributes. The attributes may be a string, a number, type, memory location or code. Two types of attributes are 1. Synthesized attribute 2. Inherited attribute Synthesized attribute values computed from its children or associated with the meaning of the tokens. Inherited attribute values computed from parent and/or siblings. Annotated parse tree Annotate the parse tree by attaching semantic attributes to the nodes of the parse tree. Generate code by visiting nodes in the parse tree in a given order. Input: y := 3 * x + z Each grammar symbol is associated with a set of attributes. Annotating (or) Decorating parse tree The process of computing the attribute values at the nodes. Output Action ( Semantic rule ) A syntax directed translation scheme is a context free grammar in which a program fragment called an output action is associated with each production. Ex : A XYZ * w { } If the input string w is derived from the production then is executed. Syntax-Directed Translation - Definition The compilation process is driven by the syntax. The semantic routines perform interpretation based on the syntax structure. Attaching attributes to the grammar symbols. Values for attributes are computed by semantic rules associated with the grammar productions. Types of Syntax directed translation 1. Synthesized translation 2. Inherited translation Synthesized translation It defines the values of the translation of the non terminal on the left side of the production as a function of translation of non terminals on the right side. Ex : E.Val E(1).val + E(2).val Inherited translation The translation of a non terminal on the right side of the production is defined in terms of a translation of the non terminal on the left. Ex : A XYZ { Y.val := 2*A.val } Format for writing syntax-directed definitions. S-attribute definition A syntax directed definition that uses synthesized attributes Inherited It is one whose value at a node in a parse tree is defined in terms of attributes at the parent of that node. Dependency graph The interdependencies among the inherited synthesized attributes at the nodes in a parse tree can be depicted by a directed graph. DECLARATIONS In declaration of block or procedure, we use no. of local name. So, we create symbol table entry for that name and its attributes. It may be type and relative address of the storage for that name. Declaration in a procedure The procedure P contains sequence of declarations of form id : T. Before first declaration, offset is “0”. When new name is found, it can be entered into symbol table and current offset value is assigned to it and the offset is incremented by the width of data object denoted by that name. Procedure enter(name,type,offset) - Creates entry for name with type and relative address in Symbol table. Here type and width attributes are used. The type may be either integer, real, pointer and array. Example P { offset = 0 } D D D;D D id : T { enter(id.name,T.type,offset); Offset := offset + T.width } T integer { T.type := integer T.width := 4 } T real { T.type := real T.width := 8 } T array[num] of T1 { T.type := array(num.val,T.type ); T.width := num.val* T1.width } T T1 { T.type := pointer ( T1.type ); T.width := 4 } In line 1 P {offset=0} D, the action is not at the right end. So, we rewrite this statement as P MD M { offset = 0 } Nested Procedure Here for all the procedures, separate symbol table can be created. The nested procedure can be written as P D D D ; D | id : T | proc id ; D ; S Statements Declaration Procedure name Example Nil header A X Readarray Exchange Quicksort Readarray Header Exchange header To Readarray To Exchange To Quicksort Quicksort Header partition i partition Header Semantic rules for nested procedure is defined by using the following operaitions. 1. mktable ( previous ) - Creates new Symbol table and returns pointer to the new table. The argument previous points to a previously created symbol table and it is placed in a header for the new symbol table along with additional information. 2. enter ( table, name, type, offset) - Creates new entry for name in the symbol table pointed by table. 3. addwidth ( table, width ) - Records the cumulative width of all the entries in table in the header with that table. 4. enterproc ( table, name, newtable ) - Creates new entry for procedure “name” in a symbol table. Newtable represents symbol table for that procedure. Field names in records For record data type we use the following semantic rule. T record L D end L ASSIGNMENT STATEMENTS Translation of assignment for 3 address code is as follows. S id := E { P := lookup(id.name); If P nil then Emit(P ‘:=‘ E.place) Else error } E E1 + E2 { E.place := newtemp; Emit( E.place ‘:=’ E1.place ‘+’ E2.place) } E E1 * E2 { E.place := newtemp; Emit( E.place ‘:=’ E1.place ‘*’ E2.place) } E -E1 { E.place := newtemp; Emit( E.place ‘:=’ ‘uminus’ E1.place) } E (E1) { E.place := E1.place } E id { P := lookup(id.name); If P nil then Emit(E.place ‘:=’ P) Else error } Reusing temporary names If any statement uses previously created temporary names, then it can be implemented by using count. Its value is 0 initially. Whenever a new temporary name is used, it will be incremented. Otherwise, it will be decremented. Example : x := a*b+c*d-e*f Statement $0 := a*b $1 := c*d $0 := $0+$1 $1 := e*f $0 := $0-$1 x := $0 value of count 0 1 2 1 2 1 0 Addressing array elements Access array elements are done quickly if it is in sequence order. It can be accessed by using the following formula. Base + ( i-low ) * w ( for one dimensional array ) Base + (( i1-low1) * n2 + i2 – low2 ) * w (for 2 dimensional array ) Where base is a starting address, i is an index, low is a lower bound on the subscript and w is the width of the data type. For example array A contains integer data of 5 elements which is started with an address 1000. Calculate the address of A[3]. Address of A[3] = 1000 + ( 3-0)*4 = 1000 + 12 = 1012 Type conversions with assignments Consider only integer and real type for conversion. The following translation scheme represents the type conversion operation for the production E E1 + E2. E.place := newtemp; If E1.type = integer and E2.type = integer then begin Emit(E.place ‘:=’ E1.place ‘int +’ E2.place ); E.type := integer End Elseif E1.type = real and E2.type = real then begin Emit(E.place ‘:=’ E1.place ‘real +’ E2.place ); E.type := real Elseif E1.type = integer and E2.type=real then begin U := newtemp; Emit(U ‘:=’ inttoreal’ E1.place); Emit(E.place ‘:=’ U ‘real +’ E2.place ); E.type := real End Else E.type := type_error; Example x := y + i * j where x & y are real and i & j are integer. t1 := i int* j t3 := inttoreal t1 t2 := y real+ t3 x := t2 SYMBOL TABLE A Compiler uses a symbol table to keep track of scope and binding information about names. A symbol table mechanism must allow us to add new entries and find existing entries efficiently. Two symbol table mechanisms are 1. Linear list 2. Hash tables We evaluate each scheme on the basis of the time required to add n entries and make e inquiries. Linear list : A linear list is the simplest to implement, but its performance is poor when e and n get large. Hashing : It provides better performance than linear list. Symbol table entries Each entry in the symbol table is for the declaration of a name. The format of each entries does not have to be uniform, because the information save about a name depends on the usage of the name. Each entry can be implemented as a record consisting of a sequence of consecutive words of memory. To keep symbol table record uniform, it may be convenient for some of the information about a name to be kept outside the table entry, with only a pointer to this information stored in the record. Character in a name If there is a modest upper bound on the length of a name, then the characters in the name can be stored in the symbol table entry as shown below In fixed size space within a record NAME S o ATTRIBUTES r t A R e a i d a r r a y If there is no limit on the length of a name, or if the limit is rarely reached, the indirect scheme can be used as follows. In a separate array The complete lexeme constituting a name must be stored to ensure that all uses of the same name can be associated with the same symbol table record. Storage allocation information Information about the storage locations that will be bound to names at run time is kept in the symbol table. In case of names whose storage is allocated on a stack or heap, the compiler does not allocate storage at all - the compiler plans out the activation record for each procedure. The list data structure for symbol tables The simplest and easiest to implement data structure for a symbol table is a linear list of records as shown below. Id1 Info1 Id2 Info2 Idn Infon available We use a single array or equivalent several arrays to store names and their associated information. The position of the end of the array is marked by the pointer available, pointing to where the next symbol table entry will go. When the searching name during searching operation is located, the associated information can be found in the words following next. If we reach the beginning of the array without finding the name, a fault occurs. Hash tables Many compilers use this technique for searching operations. The basic hashing scheme is illustrated as shown below There are two parts to the data structure : 1. A hash table consisting of a fixed array of m pointers to table entries. 2. Table entries organized into m separate linked lists, called buckets ( some buckets may be empty ). Each record in the symbol table appears on exactly one of these lists. Storage for the records may be drawn from an array of records. The suitable approach for computing hash functions is to proceed as follows : 1. Determine a positive integer h from the characters c1,c2,…..ck in string s. The conversion of single characters to integers is usually supported by the implementation language. 2. Convert the integer h determined above into the number of a list, i.e., an integer between 0 and m-1. Simply dividing by m and taking the remainder is a reasonable policy. Representing scope information The entries in the symbol table are for declarations of names. The scope rules of the source language determine which declaration appropriate. A simple approach is to maintain a separate symbol table for each scope. The symbol table for a procedure or scope is the compile time equivalent of an activation record. Information for the non locals of a procedure is found by scanning the symbol tables for the enclosing procedures following the scope rules of the language. Most closely nested scope rules can be implemented in terms of the following operations on a name: Lookup : find the most recently created entry Insert : make a new entry Delete : remove the most recently created entry A hash table consists of m lists accessed through an array. Since a name always hashes to the same list, individual lists are maintained as shown below for implementing delete operation we would rather not have to scan the entire hash table looking for lists containing entries to be deleted. The following approach can be used. Suppose each entry has two links : 1. A hash link that chains the entry to other entries whose names hash to the same value and 2. a scope link that chains all entries in the same scope. UNIT – V INTRODUCTION TO CODE OPTIMIZATION The code optimizer optimizes the code produced by the intermediate code generator in the terms of time and space. 7 Ex: MULT id2,id3,temp1 ADD temp1,#1,id1 To get efficient target program, we need good code optimization. It improve the performance of the program by applying various transformations. Criteria for code improving transformations The transformation provided by an optimizing compiler should have several properties 1. It must preserve meaning of programs. i.e., Optimization must not change output of a program for an input or cause an error such as divide by zero. 2. It must on the average, speed up programs by a measurable amount. The size of the program is reduced to improve the speed. 3. It must be worth & effort. Getting better performance Improve runtime from few hours to few seconds from source level to target level. Principal sources of optimization A transformation of a program is called local if it can be performed by looking only at the statements in a basic block. Otherwise, it is called global. Transformation performed at both locally and globally. Local is done first. Function preserving transformations Improve the program without changing the function it computes. Various transformations are 1. 2. 3. 4. Common sub expression elimination Copy propagation Dead code elimination Constant folding Common sub-expression elimination Copy propagation : Copy transformation is to use g for f wherever possible after the copy statement f := g. For example Before transformation After transformation X := t3 A[t2] := t5 A[t4] := x X := t3 A[t2] := t5 A[t4] := t3 Dead-code elimination : Remove unreachable codes. In the above example x is eliminated. Before transformation After transformation X := t3 A[t2] := t5 A[t4] := x A[t2] := t5 A[t4] := t3 Constant folding : The value of an expression is a constant and using the constant instead is known as constant folding Ex : 2 * 3.14 = 6.28 Loop optimization Most of the time for code is spent for the loop statements. To reduce this time, we use two transformations as follows 1. Code motion 2. Induction variables & Reduction in strength Code motion - It decrease the amount of code in a loop Ex : while ( I<limit-2 ) This statement can be rewriting as t := limit-2 While(I<=t) The running time of a program may be improved if we decrease the length of one of its loops, especially an inner loop, even if we increase the amount of code outside the loops. This statement assumes that the loop in question is executed at least once on the average. We must beware of a loop whose body is rarely executed, such as “blank stripper” While CHAR = ‘ ‘ do CHAR: = GETCHAR () Here GETCHAR ( ) is assumed to return the next character on an input file. In many situations it might be quite normal that the condition CHAR = ` ` is false the first time around, in which case the statement CHAR : = GETCHAR ( ) would be executed zero times. An important source of modifications of the above type is called code motion, where we take a computation that yields the same result the same result independent of the number of times through the loop (a loop invariant computation) and place it before the loop. Induction variables & Reduction in strength Induction Variable There is another important optimization that may be applied to the flow graph one that will actually decrease the total number of instructions as well as speeding up the loop. We note that the purpose of I is to count from 1 to 20 in the loop, while the purpose of T1 is to step through the arrays, four bytes at a time, since we are assuming four bytes/word. The values of I and T1 remain in lock-step. That is, at the assignment T1: = 4 * I, I takes on the value 1, 2 … 20 each time through the beginning of loop. Thus T 1 takes the value 4, 8 … 80 immediately after each assignment to T1. That is, both I and T1 form arithmetic progressions. We call such identifiers induction variables. As the relationship T1 = 4 * I surely holds after the assignment to T1, and T1 is not changed elsewhere in the loop, if it follows that after statement I: = I + 1 the relationship T1: = 4 * I – 4 must hold. Thus, at the statement if I <= 20 goto B2, we have I <= 20 if and only if T1 <= 76. When there are two or more induction variables in a loop we have an opportunity to get rid of all but one, and we call this process induction variable elimination. PROD: = 0 I: = 1 T2: = addr (A) – 4 T4: = addr (B) - 4 T1: = 4 * I T3: = T2 [T1] T5: = T4 [T1] T6: = T3 * T5 PROD: = PROD + T6 I: = I + 1 If I <= 20 goto B1 Fig 12.3 flow graph after code motion Reduction in strength It is also worth nothing that the multiplication step T1: = 4 * I in fig 12.3 was replaced by an addition step T1: = T1 + 4. This replacement will speed up the object code if addition takes less time than multiplication, as in the case in many machines. The replacement of an expensive operation by a cheaper one is termed reduction in strength. A dramatic example of reduction in strength is the replacement of the stringconcatenation operator || in the PL/I statement L = LENGTH (S1 || S2) By an addition L = LENGTH (S1) + LENGTH (S2) The extra length determined and additions are far cheaper than the string concatenation. Another example of reduction in strength is the replacement by a shift of the multiplication of an integer by a power of two. If code motion is not applicable to quick sort program then we use this type of transformation. Ex : j := j – 1 t := 4 * j ( for induction variable ) In the above example, if j is decremented then it will affect the value for t. Both j and t are locked. So, these two identifiers are called as induction variable. We must eliminate this variable. Ex : x2 = x * x 2:0 * x = x + x x / 2 = x * 0:5 ( for reduction in strength ) CODE GENERATION Last phase of a compiler Input is an intermediate representation of a source program Output is an equivalent target program In optimizing compiler, code optimization phase is optional. Ic Source Program Front end Code Optimizer Ic Code Generator Symbol table ISSUES IN DESIGN OF CODE GENERATOR The various issues which are inbuilt in all code generation problems are 1. 2. 3. 4. Memory management Instruction Selection Register Allocation Evaluation Order Target Program Input to the Code Generator From front end, input is produced with information in symbol table used to determine the runtime address of the data objects denoted by names in intermediate representation. Intermediate representation such as postfix notation, 3 address code such as quadruples & virtual machine representation and graphical representation such as syntax tree and dag representation. In some compilers, semantic checking is done with code generation. Target Programs Output is target program. It may be absolute machine code, relocatable machine language or assembly code Memory management Mapping names in source program to address of data objects in runtime memory is done by both front end and code generator. Name in 3 address statement refers to symbol entry for that name. Whenever name is declare in a procedure, it is entered in Symbol table. It also stores type of that name, width and relative address in ST. Labels are also considered. Instruction selection Instruction set of target machine determines the difficulty of instruction register. The uniformity and completeness of the instruction set are important factors. If target machine doesn’t support data type, we use exception handling. Instruction selection is must to improve the efficiency of the target program. Instruction speed is another important factor. Quality of generated code is determined by its speed and size. Ex : a := a + 1 Code is MOV a , R0 ADD #1, R0 MOV R0, a The above three statements can be replaced by a single statement INC a. So, instruction selection is must to improve the efficiency of target program. We need a tool to construct an instruction selector. Register allocation Register operands faster than memory operands. So, register is important for good code generation. The problems with registers are register allocation and register assignment. Finding assignment of register to variable is difficult. Choice of evaluation order Order of computation affects the efficiency of target code. Some computation order requires only few register Make intermediate code in order and produce correct order intermediate code to code generator to solve the problem. TARGET MACHINE Our target computer is a byte addressable machine with 4 bytes to a word and n general purpose registers Ro….Rn-1. It has 2 address instruction of the form Op source, destination Mode form Address Added cost Absolute Register Indexed Indirect register Indirect Indexed Literal M R C(R) *R *C(R) #C M R C+Contents( R ) Contents ( R) Contents(C+Contents( R)) C ( constant ) 1 0 1 0 1 1 Instruction Cost The length of the instruction is known as the instruction cost. It can be calculated by using the following formula Cost ( Instruction ) = 1 + costs ( address modes ( source, destination ) ) Reduction in instruction length will minimizes the time taken to perform the instruction. Because in some machine instructions, the fetch operation takes more time than execution. Example : 1. MOV RO,R1 2. ADD #1,R3 RO & R1 takes 0 cost and MOV instruction cost=1. So, total cost of instruction is 1. Total cost is 2. Because instruction occupies one word and constant value 1 occupies one word in memory. Example : Write the equivalent code for the instruction a := b + c Instruction Cost MOV b, R0 ADD c,R0 MOV R0,a 2 2 2 MOV *R1, *R0 ADD *R2, *R0 1 1 Total cost is 6. So, it can be reduced by the following instructions Now cost is reduced to 2. RUNTIME STORAGE MANAGEMENT Information needed during an execution of a procedure is kept in a block of storage is called as an activation record. It has a field to hold parameters, results, machine status information ( return address ), local data and temporaries. Two types of standard allocation strategies are 1. Static allocation 2. Stack allocation - The position of an activation record in memory is fixed at compilation time. - A new activation record is pushed onto the stack for each execution of a procedure. The record is popped when the activation ends. We must see the following 3 address statements during runtime allocation and deallocation of activation records. 1. 2. 3. 4. Call Return Halt and Action, a place holder for other statements Example : 3 address code for procedure C & P. The size and layout of activation record is communicated to the code generator via the information about names in symbol table. Three Address code Activation record for C ( 64 bytes) Activation record for P (88 bytes /* code for C */ Return address Return address action 1 call p action 2 halt Arr Buf i v n /* code for P */ action 3 return Static Allocation It can be implemented by a code which is implemented by a sequence of two target machine instruction. MOV #here+20, Callee.Static_area - saves return address GOTO Callee.Code_area - transfer control to called procedure. Stack Allocation Static allocation become stack allocation by using relative addresses for storage in activation record. The position of activation record is not known until run time. In stack, its position is stored in a register. So, words can be accessed as offset from the value in this register. Index address mode of out target machine is convenient here. A register SP points to the beginning of the activation record on top of the stack. When a call is occur, the calling procedure increments SP and transfers control to called procedure. After control returns to caller, it decrements SP and de-allocating the activation record of the called procedure. MOV #stackstart SP - initializes the stack Code for the first procedure HALT Call sequence increments the SP and saves the return address and transfer control to called procedure. ADD #caller.recordsize, SP MOV #here+16, *SP GOTO Callee.Code_area Called procedure returns control to calling procedure by taking return address using GOTO *0(SP) - represent return address saved in first word in the activation record The stack pointer SP is decremented by using the following instruction SUB # Caller.recordsize, SP Basic blocks and flow graphs Assumption: the input is an intermediate code program. BASIC BLOCKS AND FLOW GRAPHS Basic block It is a sequence of intermediate code such that Jump statements, if any, are at the end of the sequence. Codes in other basic block can only jump to the beginning of this sequence, but not in the middle. Example Flow graph The graphical representation of three address code is called flow graph. It represent the program using a flow chart-like graph where nodes are basic blocks and edges are flow of control. Partitioning Basic Blocks Algorithm Input : Three address code statements Output : Basic blocks Method : 1. To find leaders, which are the first statements of basic blocks. The first statement of a program is a leader. For all conditional and unconditional goto: Its target is a leader. Its next statement is also a leader. 2. Using leaders to partition the program into basic blocks. Ideas for optimization Two basic blocks are equivalent if they compute the same expressions. Use transformation techniques below to perform machine-dependent optimization. Example Three-address code for computing the dot product of two vectors a and b. There are two blocks in the above example is as follows. PROD: = 0 I: = 1 T1 : = 4*I T2 : = addr (A) – 4 T3 : = T2 [T1] T4 : = addr (B) – 4 T5 : = T4 [T1] T6 : = T3 * T5 PROD : = PROD + T6 T :=I+1 If I <= 20 goto (3) To block beginning with Statement following (11) Transformation on Basic Blocks It will improve the efficiency of the code generation. It also increase the speed of execution and reduce the space. Different types of transformations are 1. Structured preserving transformation. 2. Algebraic transformation. Structured preserving transformation Common sub-expression elimination B1 B2 Dead-code elimination : Remove unreachable codes. Renaming temporary variables : better usage of registers and avoiding using unneeded temporary variables. Interchange of two independent adjacent statements, which might be useful in discovering the above three transformations. Algebraic Transformation APPROACHES TO COMPILER DEVELOPMENT There are several general approaches that a compiler writer can adopt to implement a compiler. The simplest is to retarget or rehost an existing compiler. If there is no suitable existing compiler, the compiler writer might adopt the organization of a known compiler for a similar language and implement the corresponding components, using component-generation tools or implementing them by hand. Bootstrapping Using the facilities offered by a language to compile itself is the essence of bootstrapping. The use of bootstrapping is to create compilers and to move them from one machine to another by modifying the back end. For bootstrapping purposes, a compiler is characterized by three languages : the source language S that it compiles, the target language T that it generates code for and the implementation language I that it is written in. We represent the three languages using the following diagram called a T-diagram because of its shape. The above T-diagram can be abbreviated as SIT. Cross Compiler A compiler may run on one machine and produce target code for another machine. Such a compiler is often called a cross-compiler. Suppose we write a cross-compiler for a new language L in implementation language S to generate code for machine N; that is, we create LSN. If an existing compiler for S runs on machine M and generates code for M, it is characterized by SMM. If LSN is run through SMM, we get a compiler LMN, that is, a compiler from L to N that runs on M. This process is illustrated by putting together the T-diagrams for these compilers as shown below. When T-diagrams are put together as the above, note that the implementation language S of the compiler LSN must be the same as the source language of the existing compiler SMM and that the target language M of the existing compiler must be that same as the implementation language of the translated form LMN. A trio of T-diagrams can be represented by the following equation LSN + S MM = LMN