CS 410/510 Compiler Design Theory The following notes are what I am using in class to lecture with. They are not to be considered complete, or a substitute for your own notes. You should use these notes, and your own, as guides for your reading and study outside of class. The textbook will provide you with additional information on each of the subjects introduced below. Compiler Design Introduction Two parts to compilation 1. Analysis: Breaks up the source program into constituent pieces and creates an intermediate representation. 2. Synthesis: Constructs the desired target program from the intermediate representation. Analysis of the source program 1. Linear Analysis: The stream of characters of the source program is grouped into tokens that are sequences of characters having collective meaning. Lexical Analyzer. 2. Hierarchical Analysis: The tokens are grouped hierarchically into nested collections with collective meaning. Syntax Analyzer. 3. Semantic Analysis: Checks are performed to ensure that the components of a program fit together meaningfully. Type Checking. Linear Analysis of the following: sum := oldsum – value / 100 Lexeme (collection of characters) sum := oldsum value / 100 Token (category of lexeme) identifier assignment operator identifier subtraction operator identifier division operator integer constant newposition := initial position + rate * 60 Lexical Analyzer id1 := id2 + id3 * int_lit Syntax Analyzer := / \ id1 + / \ id2 * / \ id3 int_lit Semantic Analyzer := / id1 \ + / \ id2 * / \ id3 int_to_real \ int_lit Intermediate Code Generator temp1 := int_to_real(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 Code Optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 Code Generator MOVF MULF MOVF ADDF MOVF id3 R2 #60.0 R2 id2 R2 R1 id1 A pass of a compiler is a reading of a file followed by processing the data from the file. A phase of a compiler is a logical part of the compilation process. Lexical Analysis A token is a category in which a lexeme can be classified The token, or category, can be defined by a Regular Language and expressed as a Regular Expression. Lexical Analysis is the act of breaking down source code into a set of lexemes. Each lexeme is found by matching sequential characters to a token. Lexical Analysis can be performed with pattern matching through the use of Regular Expressions. Therefore, a Lexical Analyzer can be defined and represented as a DFA. Even though the source language is likely written as Context Free, the Lexical Analyzer identifies strings and sends them to a Syntax Analyzer for parsing. The Lexical Analyzer will remove white space, comments, and identify ill-formed strings or invalid characters. Reasons to separate Lexical Analysis from Syntax Analysis: 1. Simplicity: Techniques for Lexical Analysis can be simpler than those required for Syntax Analysis. DFA vrs. PDA. Separation also simplifies the Syntax Analyzer. 2. Efficiency: Separation into different modules makes it easier to perform simplifications and optimizations unique to the different paradigms. 3. Portability: Due to input/output and character set variations, Lexical Analyzers are not always machine independent. Errors often detected in a Lexical Analyzer: 1. 2. 3. 4. Numeric literals that are too long. Identifiers that are too long (often a warning is given) Ill-formed numeric literals. Input characters that are not in the source language Input Buffering Moving input data into local memory can increase the performance. Double buffering: Use the sentinel, “@” to identify the end of a buffer. If the end of buffer is found, increment to the next buffer and re-fill the prior buffer. The sentinel enables only one check to be made, then a second check to know which buffer needs to be refilled. Otherwise it would be necessary to check for the end of either buffer for each character read. Also consider the event if the sentinel is an invalid character in the source code. Buffering allows for easy look ahead of characters. < vrs. <= The look ahead will allow for the identification of a less than token before getting the next character from the buffer. Example of a Lexical Analyzer Build a Lexical Analyzer which identifies the following tokens: 1. digits 2. digits E [sign] digits 3. digits.digits [ E [sign] digits ] “sign” refers to +,“E” refers to exponent by power 10 [ ] refers to 0 or 1 of contents { } refers to 0 or more of contents έ=λ The above tokens can be accepted by the following DFA: Approaches to constructing a Lexical Analyzer 1. Can use a table representation and a lookup function. 2. Write an if/else block of code for each state. 3. Use a software tool to generate a table driven analyzer from regular expressions. digit . E +,ά token array S q1 er er er er q1 q1 q2 q3 S S q2 q4 er er er er q3 q5 er er q6 er q4 q4 S q3 S S q5 q5 S S S S q6 q5 er er er er 0 int 0 0 float float 0 The String Table The String Table is a data structure for unique lexemes. The String Table can be queried for the lexeme by a numeric code. Lexemes are generally inserted in the String Table by the Lexical Analyzer. The String Table can be used for: 1. Error messages 2. Memory map listings 3. Intermodule linkages Index 0 1 2 3 String fptr number 5.1 Token [ident] [ident] [f_lit] It is not recommended to use the String Table for reserved words. Use a separate structure to determine if an identifier is a reserved word, and then assign a token value accordingly. If it is not a reserved word, insert it in the String Table if necessary. Do not confuse the String Table with the Symbol Table. The String Table is for unique spellings of lexemes, allowing index comparisons instead of string comparisons during syntax analysis and code generation. The Symbol Table is a dynamic data structure used differently than the String Table.