UNIT – I 1. Programs related to Compilers Preprocessor: All the preprocessor commands written in a high level language are processed by the preprocessor before compiler takes over. Example:"#define MAX_ROWS 10" #inlude<stdio.h> Preprocessor finds all the places and replaces MAX_ROWS with 10 in the files of the project. Compiler: This software, converts the code written in high-level language into object file. Compiler converts all the files of a given project at once. The translation process should also report the presence of errors in the source program. Source Program → Compiler → Target Program ↓ Error Messages There are two parts of compilation. The analysis part breaks up the source program into constant piece and creates an intermediate representation of the source program. The synthesis part constructs the desired target program from the intermediate representation. Assembler: It converts only the low level assembly language to machine code. This assembly language is extremely core (microprocessor/platform) specific. Linker: Linker uses the object files created by the compiler and then uses the predefined library objects to create an executable. Loader: It is the part of an operating system that is responsible for loading programs, one of the essential stages in the process of starting a program. Loading a program involves reading the contents of executable file, the file containing the program text, into memory, and then carrying out other required preparatory tasks to prepare the executable for running. Once loading is complete, the operating system starts the program by passing control to the loaded program code. Interpreter: This is a software tool which interprets the user written code line by line, unlike compiler, which processes everything at once. In this case a single line is executed at a time. It is time consuming. Rational Preprocessors: These processors augment older languages with more modern flow of control and data structuring facilities. For example, such a preprocessor might provide the user with built-in macros for constructs like while-statements or if-statements, where none exist in the programming language itself. Language extension: These processors attempt to add capabilities to the language by what amounts to built-in macros. For example, the language equal is a database query language embedded in C. Statements begging with ## are taken by the preprocessor to be database access statements unrelated to C and are translated into procedure calls on routines that perform the database access. The behavior of the compiler with respect to extensions is declared with the #extension directive: #extension extension_name : behavior #extension all : behavior extension_name is the name of an extension. The token all means that the specified behavior should apply to all extensions supported by the compiler. 2. Translation Process Phases of Compiler The compiler has a number of phases plus symbol table manager and an error handler. The front end includes all analysis phases end the intermediate code generator. The back end includes the code optimization phase and final code generation phase. The front end analyzes the source program and produces intermediate code while the back end synthesizes the target program from the intermediate code. 1. Lexical analyzer takes the source program as an input and produces a long string of tokens. 2. Syntax Analyzer takes an out of lexical analyzer and produces a large tree. 3. Semantic analyzer takes the output of syntax analyzer and produces another tree. 4. Similarly, intermediate code generator takes a tree as an input produced by semantic analyzer and produces intermediate code. Input Source Program ↓ Lexical Analyzer ↓ (tokens) Syntax Analyzer ↓ Symbol Table Manager (syntax tree) Semantic Analyzer Error Handler ↓ Intermediate Code Generator ↓ Code Optimizer ↓ Code Generator ↓ Out Target Program 5. Code Optimization tries to optimize the code in such a way that it run in less time and also if possible executes in less space. Note: - If code optimization is done before code generation it is mostly machine independent optimizations and if done after code generation is machine specific code optimization, which is according to the capability (instruction set) of the given machine. 6. Code generation produces the machine code. In detail the working of each phase with example is as follows. Scanner or lexical analysis The scanner begins the analysis of the source program by: _ Reading file character by character _ Grouping characters into tokens _ Eliminating unneeded information (comments and white space) _ Entering preliminary information into literal or symbol tables Tokens represent basic program entities such as: _ Identifiers, Literals, Reserved Words, Operators, Delimiters, etc. _ Example: a: = x + y * 2.5; is scanned as a identifier y identifier := assignment operator * multiplication operator X identifier 2.5 real literal + plus operator ; semicolon Parser or Syntax analysis _ Receives tokens from the scanner _ Recognizes the structure of the program as a parse tree _ Parse tree is recognized according to a context-free grammar _ Syntax errors are reported if the program is syntactically incorrect _ A parse tree is inefficient to represent the structure of a program _ A syntax tree is a more condensed version of the parse tree _ A syntax tree is usually generated as output by the parser Statement a: = x + y * 2.5; Semantic Analyzer _ The semantics of a program are its meaning as opposed to syntax or structure _ The semantics consist of: _ Runtime semantics – behavior of program at runtime _ Static semantics – checked by the compiler _ Static semantics include: _ Declarations of variables and constants before use _ Calling functions that exist (predefined in a library or defined by the user) _ Passing parameters properly _ Type checking. _ Static semantics are difficult to check by the parser _ The semantic analyzer does the following: _ Checks the static semantics of the language _ Annotates the syntax tree with type information Intermediate Code Generator _ Comes after syntax and semantic analysis _ Separates the compiler front end from its backend _ Intermediate representation should have 2 important properties: _ Should be easy to produce _ Should be easy to translate into the target program _ Intermediate representation can have a variety of forms: _ Three-address code, P-code for an abstract machine, Tree or DAG representation Code Improvement or Optimization _ Code improvement techniques can be applied to: _ Intermediate code – independent of the target machine _ Target code – dependent on the target machine Intermediate code improvements include: _ Constant folding _ Elimination of common sub-expressions _ Identification and elimination of unreachable code (called dead code) _ Improving loops _ Improving function calls _ Target code improvement includes: _ Allocation and use of registers _ Selection of better (faster) instructions and addressing modes Code Generator _ Generates code for the target machine, typically: _ Assembly code, or _ Relocatable machine code _ Properties of the target machine become a major factor _ Code generator selects appropriate machine instructions _ Allocates memory locations for variables _ Allocates registers for intermediate computations. 3. Major Data and Structures in a Compiler Token _ Represented by an integer value or an enumeration literal _ Sometimes, it is necessary to preserve the string of characters that was scanned _ For example, name of an identifiers or value of a literal Syntax Tree _ Constructed as a pointer-based structure _ Dynamically allocated as parsing proceeds _ Nodes have fields containing information collected by the parser and semantic analyzer Symbol Table _ Keeps information associated with all kinds of identifiers: _ Constants, variables, functions, parameters, types, fields, etc. _ Identifiers are entered by the scanner, parser, or semantic analyzer _ Semantic analyzer adds type information and other attributes _ Code generation and optimization phases use the information in the symbol table _ Insertion, deletion, and search operations need to efficient because they are frequent _ Hash table with constant-time operations is usually the preferred choice _ More than one symbol table may be used Literal Table _ Stores constant values and string literals in a program. _ One literal table applies globally to the entire program. _ Used by the code generator to: _ Assign addresses for literals. _ Enter data definitions in the target code file. _ Avoids the replication of constants and strings. _ Quick insertion and lookup are essential. Deletion is not necessary. _ Temporary Files _ Used historically by old compilers due to memory constraints _ Hold the data of various stages 4. Tokens, lexems, patterns Token A lexical token is a sequence of characters sequences of characters with a collective meaning. Or that can be treated as a unit in the grammar of the programming languages. Example of tokens: Type token (id, num, real, . . . ) Punctuation tokens (IF, void, return, . . . ) Alphabetic tokens (keywords) Example of non-tokens: Comments, preprocessor directive, macros, blanks, tabs, newline, . . . Patterns There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. Regular expressions are an important notation for specifying patterns. For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*. Lexeme A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six. Specification of Tokens An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters. The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets. Strings A string over some alphabet is a finite sequence of symbol taken from that alphabet. For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0). If x and y are two strings, then the concatenation of x and y, written xy, is the string formed by appending y to x. For example, If x = dog and y = house, then xy = doghouse. For empty string, , we have S = S = S. String exponentiation concatenates a string with itself a given number of times: S2=SSorS.S S3=SSSorS.S.S S4 = SSSS or S.S.S.S and so on By definition S0 is an empty string, banana. , and S` = S. For example, if x =ba and na then xy2 = Languages A language is a set of strings over some fixed alphabet. The language may contain a finite or an infinite number of strings. Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then Union: LUM = {dog, ba, na, house} Concatenation: LM = {doghouse, dogba, bahouse, baba, nahouse, naba} Expontentiation: L2 = LL By definition: L0 ={ } and L` = L The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L. L* = L0 U L` U L2 U L3 . . . U Ln . . . For example, If L = {a, b}, then L* = { , a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . } The positive closure of Language L, denoted by L+, is "one or more Concatenation of" L. L+ = L` U L2 U L3 . . . U Ln . . . For example, If L = {a, b}, then L+ = {a, b, aa, ba, bb, aaa, aba, . . . } Lexical analysis or scanning is the process where the stream of characters making up the source program is read from left-to-right and grouped into tokens. Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: constants (integer, double, char, string, etc.), operators (Arithmetic, relational, logical), punctuation, and reserved words. The Role of the Lexical Analyzer Read input characters To group them into lexemes Produce as output a sequence of tokens, input for the syntactical analyzer Interact with the symbol table ◦Insert identifiers To strip out ◦comments ◦whitespaces: blank, newline, tab … ◦other separators To correlate error messages generated by the compiler with the source program ◦to keep track of the number of newlines seen ◦to associate a line number with each error message Lexical Analysis vs. Parsing Simplicity of design ◦Separation of lexical from syntactical analysis -> simplify at least one of the tasks ◦e.g. parser dealing with white spaces -> complex ◦Cleaner overall language design Improved compiler efficiency ◦Liberty to apply specialized techniques that serves only lexical tasks, not the whole parsing ◦Speedup reading input characters using specialized buffering techniques Enhanced compiler portability ◦Input device peculiarities are restricted to the lexical analyzer Lexical Errors All type of errors cannot be detected by the lexical analyzer alone EX: fi(a == f(x) ) … (The lexical analyzer thinks that it may be user define function, as ‘fi’ is a valid function name) The lexical analyzer is unable to proceed if none of the patterns matches any prefix of the remaining input In “panic mode” recovery strategy lexical analysis do the following things - delete one/successive characters from the remaining input - insert a missing character into the remaining input - replace a character - transpose two adjacent characters (Note: - In all the above cases it has to report to the user the changes it had made...) 5. Input Buffering The lexical analyzer scans the characters of the source program one at a time to discover tokens. Often, however, many characters beyond the next token many have to be examined before the next token itself can be determined. For this and other reasons, it is desirable for the lexical analyzer to read its input from an input buffer. Figure shows a buffer divided into two haves of, say 100 characters each. One pointer marks the beginning of the token being discovered. A� look ahead pointer scans ahead of the beginning point, until the token is discovered .we view the position of each pointer as being between the character last read and the character next to be read. In practice each buffering scheme adopts one convention either a pointer is at the symbol last read or the symbol it is ready to read. ������������������������������������� ���������� Token beginnings�������� look ahead pointer The distance which the lookahead pointer may have to travel past the actual token may be large. For example, in a PL/I program we may see: � ����������DECALRE (ARG1, ARG2� ARG n) Without knowing whether DECLARE is a keyword or an array name until we see the character that follows the right�� parenthesis. In either case, the token itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it began, the other half must be loaded with the next characters from the source file. Since the buffer shown in above figure is of limited size there is an implied constraint on how much look ahead can be used before the next token is discovered. In the above example, if the look ahead traveled to the left half and all the way through the left half to the middle, we could not reload the right half, because we would lose characters that had not yet been grouped into tokens. While we can make the buffer larger if we chose or use another buffering scheme, we cannot ignore the fact that overhead is limited. Sentinels forwardpointer ◦to test if it is at the end of the buffer ◦to determine what character is read (multiway branch) sentinel ◦added at each buffer end ◦can not be part of the source program ◦character eof is a natural choice Retains the role of entire input end when appears other than at the end of a buffer it means that the input is at an end 6. LEX http://dinosaur.compilertools.net/lex/