Page 1 of 18 Unit-1 Language Processors and Compilers Introduction to Language Processing Language Processing Activities arise due to the differences between the manner in which a software designer (or user) describes the ideas concerning the behavior of software and the manner in which these ideas are implemented in a computer system. The designer expresses the ideas in terms related to the application domain of the software. To implement these ideas, their description has to be interpreted in terms related to the execution domain of the computer system. We use the term semantics to represent the rules of meaning of a domain, and the term semantic gap to represent the difference between the semantics of two domains. The semantic gap has following consequences (negative effects): 1. Large development times 2. Large development efforts 3. Poor quality of software. These issues are tackled by SE (Software Engineering) field through the use of Methodologies and Programming Languages (PLs). The SE steps aimed at the use of a PL can be grouped into 1. Specification, Design and Coding Steps 2. PL Implementation Steps. The first step bridges the gap between the Application domain and PL domain. The second step bridges the gap between the PL domain and Execution domain. Software implementation using a PL introduces a new domain, the PL domain. The advantages of introducing the PL domain is: The gap to be bridged by the software designer is now between the Application Domain and the PL Domain rather than between the Application Domain and the Execution Domain. The gap between PL Domain and Execution Domain is bridged by Language Processor (LPr). This reduces the severity of the consequences of semantic gap mentioned earlier. Further, apart from bridging the gap between the PL domain and execution domain, the Language Processor provides a Diagnostic Capability which detects and indicates errors in its input. This helps in improving the quality of the software. Page 2 of 18 We refer to the gap between the Application Domain and PL Domain as the Specification-and-Design Gap or simply the Specification Gap, and the gap between the PL Domain and Execution Domains as the Execution Gap. The specification gap is bridged by the software development team. The execution gap is bridged by the designer of the programming language processor (e.g., a translator or an interpreter). We define the terms specification gap and execution gap as follows: Specification gap is the semantic gap between two specifications of the same task. Execution gap is the gap between the semantics of programs (that perform the same task) written in different programming languages. Each domain has a (its own) specification language (SL). A specification written in an SL is a program in SL. The specification language of the PL domain is the PL itself. The specification language of the execution domain is the machine language of the computer system. Language Processors A Language Processor is software which bridges a specification or execution gap. We use the term Language Processing to describe the activity performed by a language processor and assume a diagnostic capability as an implicit part of any form of language processing. We refer to the program that form input to a language processor as the Source Program (SP) and to its output as the Target Program (TP). The languages in which these programs are written are called Source Language (SL) and Target Language (TL) respectively. A Language Processor typically abandons generation of the target program if it detects errors in the source program. A spectrum of Language Processors is defined to meet practical requirements. 1. A language translator bridges an execution gap to the machine language (or assembly language) of a computer system. An assembler is a language translator whose source language is assembly language. A compiler is any language translator which is not an assembler. Page 3 of 18 2. A de-translator bridges the same execution gap as the language translator, but in the reverse direction. 3. A preprocessor is a language processor which bridges an execution gap but is not a language translator. 4. A language migrator bridges the specification gap between two PLs. Interpreters An interpreter is a language processor which bridges an execution gap without generating a machine language program. The absence of a target program implies the absence of an output interface of the interpreter. Thus the language processing activities of an interpreter cannot be separated from its program execution activities. Hence we say that an interpreter 'executes' a program written in a PL. ln essence, the execution gap vanishes totally. Problem-Oriented and Procedure-Oriented Languages The three consequences of the semantic gap mentioned at the start of this section are in fact the consequences of a specification gap. Software systems are poor in quality and require large amounts of time and effort to develop due to difficulties in bridging the specification gap. A classical solution is to develop a PL such that the PL domain is very close or identical to the application domain. PL features now directly model aspects of the application domain, which leads to a very small specification gap. Such PLs can only be used for specific application. Hence they are called problem-oriented languages. They have large execution gaps. However this is acceptable because the gap is bridged by the translator or interpreter and does not concern the software designer. A procedure-oriented language provides general purpose facilities required in most application domains. Such a language is independent of specific application domains and results in a large specification gap which has to be bridged by an application designer. Language processing activities The fundamental language processing activities can be divided into two categories: those that bridge specification gap and those that bridge the execution gap. We name these activities as 1. Program Generation Activities 2. Program Execution Activities. Page 4 of 18 A program generation activity aims at automatic generation of a program. The source language is a specification language of an application domain and the target language is typically a procedure oriented PL. A program execution activity organizes the execution of a program written in a PL on a computer system. Its source language could be a procedure oriented language or a problem oriented language. The target language is machine language (or assembly language). 1. Program Generation The following figure shows the program generation activity. The program generator is a software which accepts the specification of a program to be generated and generates a program in the target PL. In effect, the program generator introduces a new domain between the application domain and PL domain. We call this domain as the program generator domain. The specification gap is now the gap between the application domain and the program generator domain. This gap is smaller than the gap between the application domain and the target PL domain. See Figure below. Reduction in the specification gap increases the reliability of the generated program. Since the generator domain is close to the application domain, it is easy for the designer or programmer to write the specification of the program to be generated. Example A screen handling program (also called a form filling program) handles screen in a data entry environment. It displays the field headings and default values for various fields in the screen and accepts data values for the fields. Page 5 of 18 The following figure shows a screen for data entry of employee information. A data entry operator can move the cursor to a field and key in its value. The screen handling program accepts the value and stores it in a database. A screen generator (one type of program generator) can generate screen handling programs automatically. It accepts a specification of the screen to be generated (we will call it the screen specification) and generates a program (code) that performs the desired screen handling. The specification for (some of the) fields could be as follows: Errors in the specification (e.g., invalid start or end positions, conflicting specifications for a field), are detected by the generator. The generated screen handling program can validate the data during data entry (if such things are mentioned in specification). For example, the age field must only contain digits and must be between 18 and 58, or the gender field must only contain M or F. 2. Program Execution Two popular models for program execution are translation and interpretation. Program translation The program translation model bridges the execution gap by translating a program written in a PL, called the source program (SP), into an equivalent program in the machine language or assembly language of the computer system, called the target program (TP). Page 6 of 18 Characteristics of the program translation model are: A program must be translated before it can be executed. The translated program may be saved in a file. The saved program may be executed repeatedly. A program must be retranslated following (i.e., after each) modifications (in the source program). Program interpretation Figure given above shows a schematic (diagram/representation) of program interpretation. The interpreter reads the source program and stores it in memory. During interpretation it takes a source statement one by one, determines its meaning and performs actions which implement it. This includes computational and inputoutput actions. The process of program interpretation is same as Fetch-Decode-Execute Cycle of CPU to execute (machine level) instructions, which is described below. The CPU uses a Program Counter (PC) to note the address of the next instruction to be executed. Thus, the PC can indicate which statement of the source program is to be interpreted next. This statement would be subjected to the interpretation cycle, which could consist of the following steps: Fetch the statement. Analyze the statement and determine its meaning, i.e., the computation (or I/O) to be performed and the operands required. Execute the meaning of the statement. Page 7 of 18 Followings are characteristics of interpretation: The source program is retained in the source form itself, i.e., no target program form exists. A statement is analyzed during its interpretation. Analysis and Synthesis Model in Language Processing OR Analysis and Synthesis Model of Compilation OR Two parts of Compilation Language Processing = Analysis of SP + Synthesis of TP Two parts of compilation are: Analysis and Synthesis Analysis Phase • • • • • The analysis part breaks up the Source Program into pieces and creates an Intermediate Representation (IR) of the Source Program. The Input is SP, and output is IR Source program consists of statements written in a PL. Analysis is done on the basis : – Lexical rules • Governs the formation of valid lexical units (tokens) in the Source Language (SL). – Syntax rules • Governs the formation of valid statements in the SL – Semantic rules • Associates meaning with valid statements of the Language. Thus, analysis consists of lexical, syntax and semantic analysis. Synthesis Phase • • • • • The synthesis part constructs desired Target Program (TP) from the IR. Target Program consists of statements which have same meaning as corresponding statements in Source Program. The input is IR, and output is TP. Synthesis requires the most specialized techniques. Synthesis involves : – Creation of data structures in target program (Memory allocation). – Generation of Target Program (Code Generation). An Example of Language Processing Consider the following statement: perc_profit = (profit*100) / cost_price; • Lexical analysis – Identifies =, * and / as operators, 100 as a constant and the remaining strings as identifiers. • Syntax analysis Page 8 of 18 • • – Identifies the statement as an assignment statement with perc_profit as the LHS and (profit*100) / cost_price as RHS expression. – The RHS expression is evaluated and the result is assigned to variable on LHS. Semantic analysis – Determines the meaning of the statement to be assignment of (profit*100) / cost_price to perc_profit. Language processor’s Synthesis Parts – Generates the following assembly language statements MOVER AREG, PROFT MULT AREG, 100 DIV AREG, COST_PRICE MOVEM AREG, PERC_PROFIT … PERC_PROFIT DW 1 PROFIT DW 1 COST_PRICE DW 1 The MOVER moves a value from memory to a CPU register. The MOVEM moves a value from CPU register to memory. DW reserves one word in memory (similar to creating variable in PL). MUL and DIV performs multiplication and division between two operands, respectively, and stores the result in first operand’s address. Compilers A compiler is a program that reads a program written in one language – the source language – and translates it into an equivalent program in another language – the target language. As an important part of the translation process the compiler reports to its user the presence of errors in the source program. Source Program SP Target Program Compiler TP Error Messages First few compilers were developed in early 1950s. Initially, it was too difficult to develop compiler (e.g., First FORTRAN Compiler took 18 years). Page 9 of 18 Now, it is relatively very easy and can be completed as a Student Project due to the modern programming environment and compiler writing tools. The Phases of Compiler Following are the Phases of the Compiler: Analysis Phase o Lexical Analysis o Syntax Analysis o Semantic Analysis Synthesis Phase o Memory Allocation o Code Optimization o Code Generation SP Front-end (Analysis) IR Back-end (Synthesis) TP Figure: Phases of a Compiler Analysis Phase Page 10 of 18 Also known as Front-end of the compiler. The input to analysis phase is SP and its output is IR (Intermediate Representation). IR is a representation of a SP which reflects the effect of some, but not all, analysis and synthesis tasks performed during Language Processing. E.g., Byte Code and MSIL (Microsoft Intermediate Language). IR should be easy to produce from SP. It should be easy to produce TP from IR. IR is consists of two components. Table of Information An IC (Intermediate Code) which is description of the SP. As translation progresses, the internal representation of the SP changes. Analysis include following phases Lexical Analysis Syntax Analysis Semantic Analysis 1. Lexical Analysis Also known as Linear Analysis (LA) or Scanning. LA reads the (stream of) characters in the Source Program from left-to-right and groups them into (a stream of) tokens. Tokens are sequence of characters having a collective meaning. [Token represents a logically cohesive sequence of characters. The character sequence forming a token is called the lexeme for the token.] Thus, LA identifies the Lexical Units, called Tokens, and then classifies these units into different Lexical Classes such as an Identifier, a Keyword (Reserve Word), Operator, and Literal (Constant). For example, in LA the characters in the assignment statement “position = initial + rate * 60” would be grouped into the following tokens. 1. The identifier “position” 2. The assignment symbol/operator “=” 3. The identifier “initial” 4. The operator for addition “+” 5. The identifier “rate” 6. The operator for multiplication “*” 7. The literal number “60” Page 11 of 18 These information is entered/recorded in different tables for further use. For example, Symbol Table maintains information about all Identifiers in SP, which is updated or used by subsequent phases. 2. Syntax Analysis Also known as Hierarchical Analysis or Parsing. It processes the tokens of SP into grammatical phrases, which is represented with a special kind of tree called a Parse Tree or a Syntax Tree (a compressed representation of the Parse Tree). In Syntax Tree, each interior node represents an Operation and the children of a node represent the arguments of the operation. A syntax tree for an assignment statement is shown in following figure. Figure: Syntax tree for Statement: position := initial + rate * 60 The operations implied by the SP (according to its language specification) are determined and recorded in a hierarchical structure called a tree (e.g., Multiplication should be performed before Addition). Thus, it processes the string of tokens (built by Lexical Analysis) to determine the Statement Class such as Assignment Statement, IF Statement, etc. 3. Semantic Analysis The Semantic Analysis uses hierarchical structure created by Syntax Analysis to identify the operators and operands. It gathers type information and performs type checking. That is, it checks that each operator has operands that are permitted by the source language specification. For example, many language specification requires that array index must be positive integer (cannot be negative or real number). However, language specification may permit some compatible operands of different types. The compiler may perform type casting automatically in such cases. For example, while addition operation involves one integer number and one float number; the integer number may be converted first to float number by compiler. In above example, it identifies that “position”, “initial” and “rate” are floating point numbers. Hence, it adds code to convert integer 60 in float (60.0) before it is multiplied with rate. Page 12 of 18 Thus, Semantic Analysis updates Symbol Table and adds information regarding type, length, dimensionality, etc. for each identifier. It also modifies Parse Tree appropriately. Synthesis Phase Also known as Back-end of the compiler. The input to synthesis phase is IR and its output is TP. Synthesis include following phases Memory Allocation Code Optimization Code Generation 4. Memory Allocation This phase allocates the memory to different identifiers of the program. The memory requirement of an identifier is computed from its type, length, and dimensionality; and memory is allocated to it. The address of the memory area is entered in the Symbol Table. 5. Code Optimization The code optimization phase attempts to improve the intermediate code, so that faster-running machine code will result. Code Optimization slowdowns the compilation process. But, it may give comparatively more benefit by producing efficient machine code. 6. Code Generation The final phase of the compiler is the generation of target code, consisting normally of reloadable machine code or assembly code. It uses knowledge of the target architecture to select appropriate instructions such as Registers, Instructions, and Addressing Modes. Memory locations are selected for each of the variables used by the program. Then, each Intermediate Code instruction is translated into a sequence of machine instructions that perform the same task. A crucial aspect is the assignment of variables to registers. For Example, using register 1 and 2, the translation of the code might become: Page 13 of 18 MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1 The first and second operands of each instruction specify a source and destination, respectively. The F in each instruction tells us that instructions deal with floating-point numbers. The first instruction moves the contents of the address id3 into register 2. The second instruction multiplies register 2 with the real-constant 60.0. The # signifies that 60.0 is to be treated as a constant. The third and fourth instructions moves id2 into register 1 and adds to it the value previously computed in register 2. Finally, the value in register 1 is moved into the address of id1. So, the code implements the assignment statement “position = initial + rate * 60”. Example of Processing of Compiler Phases Temp1 := inttoreal(60) Temp2 := id3 * Temp1 Temp3 := id2 + Temp2 id1 :=Temp3 Code Optimizer Page 14 of 18 Temp1 := id3 * 60.0 id1 := id2 + Temp1 Code Generator MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1 Symbol Table Management by Compiler An essential function of a compiler is to record the identifiers used in the source program and collect information about various attributes of each identifier. These attribute may provide information about the storage allocated for an identifier, its type, and its scope. In the case of procedure names, it also provide information such as the number and types of arguments, the method of passing each argument, and the type returned, if any. A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier. This data structure allows us to find the record for each identifier quickly and to store or to retrieve data quickly. The symbol table is created by entering identifier entry during Lexical Analysis phase. The remaining phases update or use identifier information from the Symbol Table. When an identifier in the source program is detected by the lexical analyzer, the identifier is entered into the symbol table. However, the attributes of an identifier cannot normally be determined during lexical analysis. i.e.: in PASCAL declaration like, Var position, initial, rate : real; The type real is not known when position, initial, and rate are seen by the lexical analyzer. Page 15 of 18 The remaining phases enter information about identifiers into the symbol table and then use this information in various ways. Error Detection and Reporting Phase of Compiler It • • • collect all types of error. There are basically three types of errors: Lexical Errors: This error can be detected by Lexical Analysis. Syntax Errors: This error can be detected by Syntactic Analysis. Semantic Errors: - This error can be detected by Semantic Analysis. In error detection and reporting phase each phase can encounter errors. However, after detecting error, a phase must somehow deal with that error, so that compilation can proceed, allowing further errors in the source program to be detected. A compiler that stops when it finds the first error is not as helpful as it could be. The syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the compiler. The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. Errors where the token stream violates the structure rules (syntax) of the language are determined by the syntax analysis phase. During semantic analysis the compiler tries to detect constructs that have the right syntactic structure but no meaning to the operation involved, e.g., if we try to add two identifiers, one of them is the name of the array, and the other is the name of a procedure. Page 16 of 18 The Context of a Compiler OR A Language Processing System/Diagram In addition to a compiler, several other programs may be required to create an executable target program. A source program may be divided into modules stored in separate files. The task of collecting the source program is sometimes assigned to a separate program called, a preprocessor. The preprocessor may expand macros into source language statements. The following figure shows a typical compilation. The target program created by the compiler may require further processing before it can be run. The compiler creates assembly code that is translated by an assembler into the machine code and then linked together with some library functions into the code that actually runs on the machine. The Software Tools Software tools are programs that manipulate SP and perform some kind of analysis. Some examples of such tools include: 1. Structure Editors Page 17 of 18 2. Pretty Printers 3. Static Checkers 4. Interpreter 1. Structure Editors A structure editor takes as input a sequence of commands to build a source program. The structure editor not only performs the text creation and modification functions of an ordinary text editor, but it also analyzes the program text, putting an appropriate hierarchical structure on the source program. Thus, the structure editor can perform additional tasks that are useful in the preparation of programs, i.e., it can check that the input is correctly formed, can supply keywords automatically, and can jump from a BEGIN or LEFT parenthesis to its matching END or RIGHT parenthesis. The output of such an editor is often similar to the output of the analysis phase of a computer. 2. Pretty Printers A pretty printer analyses a program and prints it in such a way that the structure of the program becomes clearly visible. That is, Comments may appear in a special font, and Statements may appear with an amount of indentation proportional to the depth of their nesting in the hierarchical organization of the statements. 3. Static Checker A static checker reads a program, analyzes it, and attempts to discover potential bugs without running the program, i.e., a static checker may detect that parts of the source program that can never be executed, or that a certain variable might be used before being defined or before being assigned with some value. In addition, it can catch logical errors. 4. Interpreters Instead of producing a target program as a translation, an interpreter performs the operations implied by the source program, i.e., an interpreter might build a tree as shown in following figure and then carry out the operations at the nodes as it “walks” the tree. Page 18 of 18 At the root it would discover it had an assignment to perform. So it would call a routine to evaluate the expression on the right, and then store the resulting value in the location associated with the identifier position.