Compiler construction unit - 1

advertisement
UNIT – I
1. Programs related to Compilers
 Preprocessor: All the preprocessor commands written in a high level language are processed
by the preprocessor before compiler takes over.
Example:"#define MAX_ROWS 10"
#inlude<stdio.h>
Preprocessor finds all the places and replaces MAX_ROWS with 10 in the files of the
project.
 Compiler: This software, converts the code written in high-level language into object file.
Compiler converts all the files of a given project at once.
The translation process should also report the presence of errors in the source program.
Source
Program
→
Compiler
→
Target
Program
↓
Error
Messages
There are two parts of compilation.
The analysis part breaks up the source program into constant piece and creates an
intermediate representation of the source program.
The synthesis part constructs the desired target program from the intermediate
representation.
 Assembler: It converts only the low level assembly language to machine code. This
assembly language is extremely core (microprocessor/platform) specific.
 Linker: Linker uses the object files created by the compiler and then uses the predefined
library objects to create an executable.
 Loader: It is the part of an operating system that is responsible for loading programs, one of
the essential stages in the process of starting a program. Loading a program involves reading
the contents of executable file, the file containing the program text, into memory, and then
carrying out other required preparatory tasks to prepare the executable for running. Once
loading is complete, the operating system starts the program by passing control to the loaded
program code.
 Interpreter: This is a software tool which interprets the user written code line by line, unlike
compiler, which processes everything at once. In this case a single line is executed at a time.
It is time consuming.
 Rational Preprocessors: These processors augment older languages with more modern flow
of control and data structuring facilities. For example, such a preprocessor might provide the
user with built-in macros for constructs like while-statements or if-statements, where none
exist in the programming language itself.
 Language extension: These processors attempt to add capabilities to the language by what
amounts to built-in macros. For example, the language equal is a database query language
embedded in C. Statements begging with ## are taken by the preprocessor to be database
access statements unrelated to C and are translated into procedure calls on routines that
perform
the
database
access.
The behavior of the compiler with respect to extensions is declared with the #extension
directive: #extension extension_name : behavior #extension all : behavior
extension_name is the name of an extension. The token all means that the specified behavior
should apply to all extensions supported by the compiler.
2. Translation Process
Phases of Compiler
The compiler has a number of phases plus symbol table manager and an error handler.
The front end includes all analysis phases end the intermediate code generator.
The back end includes the code optimization phase and final code generation phase.
The front end analyzes the source program and produces intermediate code while the back end
synthesizes the target program from the intermediate code.
1. Lexical analyzer takes the source program as an input and produces a long string of
tokens.
2. Syntax Analyzer takes an out of lexical analyzer and produces a large tree.
3. Semantic analyzer takes the output of syntax analyzer and produces another tree.
4. Similarly, intermediate code generator takes a tree as an input produced by semantic
analyzer and produces intermediate code.
Input Source
Program
↓
Lexical Analyzer
↓
(tokens)
Syntax Analyzer
↓
Symbol Table
Manager
(syntax tree)
Semantic
Analyzer
Error Handler
↓
Intermediate
Code Generator
↓
Code Optimizer
↓
Code Generator
↓
Out Target
Program
5. Code Optimization tries to optimize the code in such a way that it run in less time and
also if possible executes in less space.
Note: - If code optimization is done before code generation it is mostly machine
independent optimizations and if done after code generation is machine specific code
optimization, which is according to the capability (instruction set) of the given machine.
6. Code generation produces the machine code.
In detail the working of each phase with example is as follows.
 Scanner or lexical analysis
The scanner begins the analysis of the source program by:
_ Reading file character by character
_ Grouping characters into tokens
_ Eliminating unneeded information (comments and white space)
_ Entering preliminary information into literal or symbol tables
Tokens represent basic program entities such as:
_ Identifiers, Literals, Reserved Words, Operators, Delimiters, etc.
_ Example: a: = x + y * 2.5; is scanned as
a identifier
y identifier
:= assignment operator
* multiplication operator
X identifier
2.5 real literal
+ plus operator
;
semicolon
 Parser or Syntax analysis
_ Receives tokens from the scanner
_ Recognizes the structure of the program as a parse tree
_ Parse tree is recognized according to a context-free grammar
_ Syntax errors are reported if the program is syntactically incorrect
_ A parse tree is inefficient to represent the structure of a program
_ A syntax tree is a more condensed version of the parse tree
_ A syntax tree is usually generated as output by the parser
Statement a: = x + y * 2.5;
 Semantic Analyzer
_ The semantics of a program are its meaning as opposed to syntax or structure
_ The semantics consist of:
_ Runtime semantics – behavior of program at runtime
_ Static semantics – checked by the compiler
_ Static semantics include:
_ Declarations of variables and constants before use
_ Calling functions that exist (predefined in a library or defined by the user)
_ Passing parameters properly
_ Type checking.
_ Static semantics are difficult to check by the parser
_ The semantic analyzer does the following:
_ Checks the static semantics of the language
_ Annotates the syntax tree with type information
 Intermediate Code Generator
_ Comes after syntax and semantic analysis
_ Separates the compiler front end from its backend
_ Intermediate representation should have 2 important properties:
_ Should be easy to produce
_ Should be easy to translate into the target program
_ Intermediate representation can have a variety of forms:
_ Three-address code, P-code for an abstract machine, Tree or DAG representation
 Code Improvement or Optimization
_ Code improvement techniques can be applied to:
_ Intermediate code – independent of the target machine
_ Target code – dependent on the target machine
Intermediate code improvements include:
_ Constant folding
_ Elimination of common sub-expressions
_ Identification and elimination of unreachable code (called dead code)
_ Improving loops
_ Improving function calls
_ Target code improvement includes:
_ Allocation and use of registers
_ Selection of better (faster) instructions and addressing modes
 Code Generator
_ Generates code for the target machine, typically:
_ Assembly code, or
_ Relocatable machine code
_ Properties of the target machine become a major factor
_ Code generator selects appropriate machine instructions
_ Allocates memory locations for variables
_ Allocates registers for intermediate computations.
3. Major Data and Structures in a Compiler
Token
_ Represented by an integer value or an enumeration literal
_ Sometimes, it is necessary to preserve the string of characters that was scanned
_ For example, name of an identifiers or value of a literal
Syntax Tree
_ Constructed as a pointer-based structure
_ Dynamically allocated as parsing proceeds
_ Nodes have fields containing information collected by the parser and semantic analyzer
Symbol Table
_ Keeps information associated with all kinds of identifiers:
_ Constants, variables, functions, parameters, types, fields, etc.
_ Identifiers are entered by the scanner, parser, or semantic analyzer
_ Semantic analyzer adds type information and other attributes
_ Code generation and optimization phases use the information in the symbol table
_ Insertion, deletion, and search operations need to efficient because they are frequent
_ Hash table with constant-time operations is usually the preferred choice
_ More than one symbol table may be used
Literal Table
_ Stores constant values and string literals in a program.
_ One literal table applies globally to the entire program.
_ Used by the code generator to:
_ Assign addresses for literals.
_ Enter data definitions in the target code file.
_ Avoids the replication of constants and strings.
_ Quick insertion and lookup are essential. Deletion is not necessary.
_ Temporary Files
_ Used historically by old compilers due to memory constraints
_ Hold the data of various stages
4. Tokens, lexems, patterns
Token
A lexical token is a sequence of characters sequences of characters with a collective meaning. Or
that can be treated as a unit in the grammar of the programming languages.
Example of tokens:



Type token (id, num, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)
Example of non-tokens:

Comments, preprocessor directive, macros, blanks, tabs, newline, . . .
Patterns
There is a set of strings in the input for which the same token is produced as output. This set of
strings is described by a rule called a pattern associated with the token.
Regular expressions are an important notation for specifying patterns.
For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*.
Lexeme
A lexeme is a sequence of characters in the source program that is matched by the pattern for a
token.
For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the
lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.
Specification of Tokens
An alphabet or a character class is a finite set of symbols. Typical examples of symbols are
letters and characters.
The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer
alphabets.
Strings
A string over some alphabet is a finite sequence of symbol taken from that alphabet.
For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII
computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e.,
string length is 0).
If x and y are two strings, then the concatenation of x and y, written xy, is the string formed by
appending
y
to
x.
For example, If x = dog and y = house, then xy = doghouse. For empty string, , we have S =
S = S.
String exponentiation concatenates a string with itself a given number of times:
S2=SSorS.S
S3=SSSorS.S.S
S4 = SSSS or S.S.S.S and so on
By definition S0 is an empty string,
banana.
, and S` = S. For example, if x =ba and na then xy2 =
Languages
A language is a set of strings over some fixed alphabet. The language may contain a finite or an
infinite number of strings.
Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then



Union: LUM = {dog, ba, na, house}
Concatenation: LM = {doghouse, dogba, bahouse, baba, nahouse, naba}
Expontentiation: L2 = LL

By definition: L0 ={ } and L` = L
The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L.
L* = L0 U L` U L2 U L3 . . . U Ln . . .
For example, If L = {a, b}, then
L* = { , a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . }
The positive closure of Language L, denoted by L+, is "one or more Concatenation of" L.
L+ = L` U L2 U L3 . . . U Ln . . .
For example, If L = {a, b}, then
L+ = {a, b, aa, ba, bb, aaa, aba, . . . }
Lexical analysis or scanning is the process where the stream of characters making up the source
program is read from left-to-right and grouped into tokens. Tokens are sequences of characters
with a collective meaning. There are usually only a small number of tokens for a programming
language: constants (integer, double, char, string, etc.), operators (Arithmetic, relational, logical),
punctuation, and reserved words.
The Role of the Lexical Analyzer
 Read input characters
 To group them into lexemes
 Produce as output a sequence of tokens, input for the syntactical analyzer
 Interact with the symbol table
◦Insert identifiers
 To strip out
◦comments
◦whitespaces: blank, newline, tab …
◦other separators
 To correlate error messages generated by the compiler with the source program
◦to keep track of the number of newlines seen
◦to associate a line number with each error message
Lexical Analysis vs. Parsing
 Simplicity of design
◦Separation of lexical from syntactical analysis -> simplify at least one of the tasks
◦e.g. parser dealing with white spaces -> complex
◦Cleaner overall language design
 Improved compiler efficiency
◦Liberty to apply specialized techniques that serves only lexical tasks, not the whole parsing
◦Speedup reading input characters using specialized buffering techniques
 Enhanced compiler portability
◦Input device peculiarities are restricted to the lexical analyzer
Lexical Errors
 All type of errors cannot be detected by the lexical analyzer alone
EX: fi(a == f(x) ) …
(The lexical analyzer thinks that it may be user define function, as ‘fi’ is a valid function name)
 The lexical analyzer is unable to proceed if none of the patterns matches any prefix of the
remaining input
In “panic mode” recovery strategy lexical analysis do the following things
- delete one/successive characters from the remaining input
- insert a missing character into the remaining input
- replace a character
- transpose two adjacent characters
(Note: - In all the above cases it has to report to the user the changes it had made...)
5. Input Buffering
The lexical analyzer scans the characters of the source program one at a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for the
lexical analyzer to read its input from an input buffer. Figure shows a buffer divided into two
haves of, say 100 characters each. One pointer marks the beginning of the token being
discovered. A� look ahead pointer scans ahead of the beginning point, until the token is
discovered .we view the position of each pointer as being between the character last read and the
character next to be read. In practice each buffering scheme adopts one convention either a
pointer is at the symbol last read or the symbol it is ready to read.
�������������������������������������
����������
Token beginnings�������� look ahead pointer
The distance which the lookahead pointer may have to travel past the actual token may be
large. For example, in a PL/I program we may see:
�
����������DECALRE (ARG1, ARG2� ARG n)
Without knowing whether DECLARE is a keyword or an array name until we see the
character that follows the right�� parenthesis. In either case, the token itself ends at the second
E. If the look ahead pointer travels beyond the buffer half in which it began, the other half must
be loaded with the next characters from the source file.
Since the buffer shown in above figure is of limited size there is an implied constraint on
how much look ahead can be used before the next token is discovered. In the above example, if
the look ahead traveled to the left half and all the way through the left half to the middle, we
could not reload the right half, because we would lose characters that had not yet been grouped
into tokens. While we can make the buffer larger if we chose or use another buffering scheme,
we cannot ignore the fact that overhead is limited.
Sentinels
 forwardpointer
◦to test if it is at the end of the buffer
◦to determine what character is read (multiway branch)
 sentinel
◦added at each buffer end
◦can not be part of the source program
◦character eof is a natural choice
 Retains the role of entire input end
 when appears other than at the end of a buffer it means that the input is at an end
6. LEX
http://dinosaur.compilertools.net/lex/
Download