Course\SS\SS Unit-1 26-2-2015 05

advertisement
Page 1 of 18
Unit-1 Language Processors and Compilers
Introduction to Language Processing
Language Processing Activities arise due to the differences between the manner in
which a software designer (or user) describes the ideas concerning the behavior of
software and the manner in which these ideas are implemented in a computer
system.
The designer expresses the ideas in terms related to the application domain of the
software. To implement these ideas, their description has to be interpreted in
terms related to the execution domain of the computer system.
We use the term semantics to represent the rules of meaning of a domain, and
the term semantic gap to represent the difference between the semantics of two
domains.
The semantic gap has following consequences (negative effects):
1. Large development times
2. Large development efforts
3. Poor quality of software.
These issues are tackled by SE (Software Engineering) field through the use of
Methodologies and Programming Languages (PLs). The SE steps aimed at the use
of a PL can be grouped into
1. Specification, Design and Coding Steps
2. PL Implementation Steps.
The first step bridges the gap between the Application domain and PL domain.
The second step bridges the gap between the PL domain and Execution domain.
Software implementation using a PL introduces a new domain, the PL domain.
The advantages of introducing the PL domain is: The gap to be bridged by the
software designer is now between the Application Domain and the PL Domain
rather than between the Application Domain and the Execution Domain.
The gap between PL Domain and Execution Domain is bridged by Language
Processor (LPr).
This reduces the severity of the consequences of semantic gap mentioned earlier.
Further, apart from bridging the gap between the PL domain and execution
domain, the Language Processor provides a Diagnostic Capability which detects
and indicates errors in its input. This helps in improving the quality of the
software.
Page 2 of 18
We refer to the gap between the Application Domain and PL Domain as the
Specification-and-Design Gap or simply the Specification Gap, and the gap
between the PL Domain and Execution Domains as the Execution Gap.
The specification gap is bridged by the software development team.
The execution gap is bridged by the designer of the programming language
processor (e.g., a translator or an interpreter).
We define the terms specification gap and execution gap as follows:
Specification gap is the semantic gap between two specifications of the same task.
Execution gap is the gap between the semantics of programs (that perform the
same task) written in different programming languages.
Each domain has a (its own) specification language (SL).
A specification written in an SL is a program in SL.
The specification language of the PL domain is the PL itself.
The specification language of the execution domain is the machine language of the
computer system.
Language Processors
A Language Processor is software which bridges a specification or execution gap.
We use the term Language Processing to describe the activity performed by a
language processor and assume a diagnostic capability as an implicit part of any
form of language processing.
We refer to the program that form input to a language processor as the Source
Program (SP) and to its output as the Target Program (TP).
The languages in which these programs are written are called Source Language
(SL) and Target Language (TL) respectively.
A Language Processor typically abandons generation of the target program if it
detects errors in the source program.
A spectrum of Language Processors is defined to meet practical
requirements.
1. A language translator bridges an execution gap to the machine language (or
assembly language) of a computer system. An assembler is a language
translator whose source language is assembly language. A compiler is any
language translator which is not an assembler.
Page 3 of 18
2. A de-translator bridges the same execution gap as the language translator,
but in the reverse direction.
3. A preprocessor is a language processor which bridges an execution gap but
is not a language translator.
4. A language migrator bridges the specification gap between two PLs.
Interpreters
An interpreter is a language processor which bridges an execution gap without
generating a machine language program.
The absence of a target program implies the absence of an output interface of the
interpreter. Thus the language processing activities of an interpreter cannot be
separated from its program execution activities. Hence we say that an interpreter
'executes' a program written in a PL. ln essence, the execution gap vanishes
totally.
Problem-Oriented and Procedure-Oriented Languages
The three consequences of the semantic gap mentioned at the start of this section
are in fact the consequences of a specification gap. Software systems are poor in
quality and require large amounts of time and effort to develop due to difficulties
in bridging the specification gap.
A classical solution is to develop a PL such that the PL domain is very close or
identical to the application domain.
PL features now directly model aspects of the application domain, which leads to a
very small specification gap. Such PLs can only be used for specific application.
Hence they are called problem-oriented languages.
They have large execution gaps. However this is acceptable because the gap is
bridged by the translator or interpreter and does not concern the software
designer.
A procedure-oriented language provides general purpose facilities required in
most application domains. Such a language is independent of specific application
domains and results in a large specification gap which has to be bridged by an
application designer.
Language processing activities
The fundamental language processing activities can be divided into two categories:
those that bridge specification gap and those that bridge the execution gap.
We name these activities as
1.
Program Generation Activities
2.
Program Execution Activities.
Page 4 of 18
A program generation activity aims at automatic generation of a program.
The source language is a specification language of an application domain and the
target language is typically a procedure oriented PL.
A program execution activity organizes the execution of a program written in a PL
on a computer system.
Its source language could be a procedure oriented language or a problem oriented
language. The target language is machine language (or assembly language).
1. Program Generation
The following figure shows the program generation activity.
The program generator is a software which accepts the specification of a program
to be generated and generates a program in the target PL.
In effect, the program generator introduces a new domain between the application
domain and PL domain. We call this domain as the program generator domain.
The specification gap is now the gap between the application domain and the
program generator domain. This gap is smaller than the gap between the
application domain and the target PL domain. See Figure below.
Reduction in the specification gap increases the reliability of the generated
program. Since the generator domain is close to the application domain, it is
easy for the designer or programmer to write the specification of the program to
be generated.
Example
A screen handling program (also called a form filling program) handles screen in
a data entry environment. It displays the field headings and default values for
various fields in the screen and accepts data values for the fields.
Page 5 of 18
The following figure shows a screen for data entry of employee information.
A data entry operator can move the cursor to a field and key in its value. The
screen handling program accepts the value and stores it in a database.
A screen generator (one type of program generator) can generate screen handling
programs automatically. It accepts a specification of the screen to be generated
(we will call it the screen specification) and generates a program (code) that
performs the desired screen handling. The specification for (some of the) fields
could be as follows:
Errors in the specification (e.g., invalid start or end positions, conflicting
specifications for a field), are detected by the generator. The generated screen
handling program can validate the data during data entry (if such things are
mentioned in specification). For example, the age field must only contain digits
and must be between 18 and 58, or the gender field must only contain M or F.
2. Program Execution
Two popular models for program execution are translation and interpretation.
Program translation
The program translation model bridges the execution gap by translating a
program written in a PL, called the source program (SP), into an equivalent
program in the machine language or assembly language of the computer system,
called the target program (TP).
Page 6 of 18
Characteristics of the program translation model are:



A program must be translated before it can be executed.
The translated program may be saved in a file. The saved program may be
executed repeatedly.
A program must be retranslated following (i.e., after each) modifications (in
the source program).
Program interpretation
Figure given above shows a schematic (diagram/representation) of program
interpretation.
The interpreter reads the source program and stores it in memory. During
interpretation it takes a source statement one by one, determines its meaning and
performs actions which implement it. This includes computational and inputoutput actions.
The process of program interpretation is same as Fetch-Decode-Execute Cycle of
CPU to execute (machine level) instructions, which is described below.
The CPU uses a Program Counter (PC) to note the address of the next instruction
to be executed. Thus, the PC can indicate which statement of the source program
is to be interpreted next. This statement would be subjected to the interpretation
cycle, which could consist of the following steps:



Fetch the statement.
Analyze the statement and determine its meaning, i.e., the computation (or
I/O) to be performed and the operands required.
Execute the meaning of the statement.
Page 7 of 18
Followings are characteristics of interpretation:
 The source program is retained in the source form itself, i.e., no target
program form exists.
 A statement is analyzed during its interpretation.
Analysis and Synthesis Model in Language Processing OR
Analysis and Synthesis Model of Compilation OR
Two parts of Compilation
Language Processing = Analysis of SP + Synthesis of TP
Two parts of compilation are: Analysis and Synthesis
Analysis Phase
•
•
•
•
•
The analysis part breaks up the Source Program into pieces and creates an
Intermediate Representation (IR) of the Source Program.
The Input is SP, and output is IR
Source program consists of statements written in a PL.
Analysis is done on the basis :
– Lexical rules
• Governs the formation of valid lexical units (tokens) in the
Source Language (SL).
– Syntax rules
• Governs the formation of valid statements in the SL
– Semantic rules
• Associates meaning with valid statements of the Language.
Thus, analysis consists of lexical, syntax and semantic analysis.
Synthesis Phase
•
•
•
•
•
The synthesis part constructs desired Target Program (TP) from the IR.
Target Program consists of statements which have same meaning as
corresponding statements in Source Program.
The input is IR, and output is TP.
Synthesis requires the most specialized techniques.
Synthesis involves :
– Creation of data structures in target program (Memory allocation).
– Generation of Target Program (Code Generation).
An Example of Language Processing
Consider the following statement: perc_profit = (profit*100) / cost_price;
• Lexical analysis
– Identifies =, * and / as operators, 100 as a constant and the remaining
strings as identifiers.
• Syntax analysis
Page 8 of 18
•
•
– Identifies the statement as an assignment statement with perc_profit
as the LHS and (profit*100) / cost_price as RHS expression.
– The RHS expression is evaluated and the result is assigned to variable
on LHS.
Semantic analysis
– Determines the meaning of the statement to be assignment of
(profit*100) / cost_price to perc_profit.
Language processor’s Synthesis Parts
– Generates the following assembly language statements
MOVER
AREG, PROFT
MULT
AREG, 100
DIV
AREG, COST_PRICE
MOVEM
AREG, PERC_PROFIT
…
PERC_PROFIT
DW 1
PROFIT
DW 1
COST_PRICE
DW 1
The MOVER moves a value from memory to a CPU register.
The MOVEM moves a value from CPU register to memory.
DW reserves one word in memory (similar to creating variable in PL).
MUL and DIV performs multiplication and division between two operands,
respectively, and stores the result in first operand’s address.
Compilers
A compiler is a program that reads a program written in one language – the
source language – and translates it into an equivalent program in another
language – the target language.
As an important part of the translation process the compiler reports to its user the
presence of errors in the source program.
Source Program
SP
Target Program
Compiler
TP
Error
Messages
First few compilers were developed in early 1950s.
Initially, it was too difficult to develop compiler (e.g., First FORTRAN Compiler took
18 years).
Page 9 of 18
Now, it is relatively very easy and can be completed as a Student Project due to the
modern programming environment and compiler writing tools.
The Phases of Compiler
Following are the Phases of the Compiler:


Analysis Phase
o Lexical Analysis
o Syntax Analysis
o Semantic Analysis
Synthesis Phase
o Memory Allocation
o Code Optimization
o Code Generation
SP  Front-end (Analysis)  IR  Back-end (Synthesis)  TP
Figure: Phases of a Compiler
Analysis Phase
Page 10 of 18
Also known as Front-end of the compiler.
The input to analysis phase is SP and its output is IR (Intermediate
Representation).
IR is a representation of a SP which reflects the effect of some, but not all, analysis
and synthesis tasks performed during Language Processing. E.g., Byte Code and
MSIL (Microsoft Intermediate Language).
IR should be easy to produce from SP. It should be easy to produce TP from IR.
IR is consists of two components.
 Table of Information
 An IC (Intermediate Code) which is description of the SP.
As translation progresses, the internal representation of the SP changes.
Analysis include following phases
 Lexical Analysis
 Syntax Analysis
 Semantic Analysis
1. Lexical Analysis
Also known as Linear Analysis (LA) or Scanning.
LA reads the (stream of) characters in the Source Program from left-to-right and
groups them into (a stream of) tokens. Tokens are sequence of characters having
a collective meaning.
[Token represents a logically cohesive sequence of characters. The character
sequence forming a token is called the lexeme for the token.]
Thus, LA identifies the Lexical Units, called Tokens, and then classifies these units
into different Lexical Classes such as an Identifier, a Keyword (Reserve Word),
Operator, and Literal (Constant).
For example, in LA the characters in the assignment statement “position = initial +
rate * 60” would be grouped into the following tokens.
1. The identifier “position”
2. The assignment symbol/operator “=”
3. The identifier “initial”
4. The operator for addition “+”
5. The identifier “rate”
6. The operator for multiplication “*”
7. The literal number “60”
Page 11 of 18
These information is entered/recorded in different tables for further use. For
example, Symbol Table maintains information about all Identifiers in SP, which is
updated or used by subsequent phases.
2. Syntax Analysis
Also known as Hierarchical Analysis or Parsing.
It processes the tokens of SP into grammatical phrases, which is represented with
a special kind of tree called a Parse Tree or a Syntax Tree (a compressed
representation of the Parse Tree).
In Syntax Tree, each interior node represents an Operation and the children of a
node represent the arguments of the operation. A syntax tree for an assignment
statement is shown in following figure.
Figure: Syntax tree for Statement: position := initial + rate * 60
The operations implied by the SP (according to its language specification) are
determined and recorded in a hierarchical structure called a tree (e.g.,
Multiplication should be performed before Addition).
Thus, it processes the string of tokens (built by Lexical Analysis) to determine the
Statement Class such as Assignment Statement, IF Statement, etc.
3. Semantic Analysis
The Semantic Analysis uses hierarchical structure created by Syntax Analysis to
identify the operators and operands. It gathers type information and performs type
checking. That is, it checks that each operator has operands that are permitted
by the source language specification.
For example, many language specification requires that array index must be
positive integer (cannot be negative or real number). However, language
specification may permit some compatible operands of different types. The
compiler may perform type casting automatically in such cases. For example,
while addition operation involves one integer number and one float number; the
integer number may be converted first to float number by compiler.
In above example, it identifies that “position”, “initial” and “rate” are floating point
numbers. Hence, it adds code to convert integer 60 in float (60.0) before it is
multiplied with rate.
Page 12 of 18
Thus, Semantic Analysis updates Symbol Table and adds information regarding
type, length, dimensionality, etc. for each identifier. It also modifies Parse Tree
appropriately.
Synthesis Phase
Also known as Back-end of the compiler.
The input to synthesis phase is IR and its output is TP.
Synthesis include following phases
 Memory Allocation
 Code Optimization
 Code Generation
4. Memory Allocation
This phase allocates the memory to different identifiers of the program.
The memory requirement of an identifier is computed from its type, length, and
dimensionality; and memory is allocated to it.
The address of the memory area is entered in the Symbol Table.
5. Code Optimization
The code optimization phase attempts to improve the intermediate code, so that
faster-running machine code will result.
Code Optimization slowdowns the compilation process. But, it may give
comparatively more benefit by producing efficient machine code.
6. Code Generation
The final phase of the compiler is the generation of target code, consisting
normally of reloadable machine code or assembly code.
It uses knowledge of the target architecture to select appropriate instructions such
as Registers, Instructions, and Addressing Modes.
Memory locations are selected for each of the variables used by the program.
Then, each Intermediate Code instruction is translated into a sequence of machine
instructions that perform the same task.
A crucial aspect is the assignment of variables to registers.
For Example, using register 1 and 2, the translation of the code might become:
Page 13 of 18
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
The first and second operands of each instruction specify a source and
destination, respectively. The F in each instruction tells us that instructions deal
with floating-point numbers.
The first instruction moves the contents of the address id3 into register 2.
The second instruction multiplies register 2 with the real-constant 60.0. The #
signifies that 60.0 is to be treated as a constant.
The third and fourth instructions moves id2 into register 1 and adds to it the value
previously computed in register 2.
Finally, the value in register 1 is moved into the address of id1.
So, the code implements the assignment statement “position = initial + rate * 60”.
Example of Processing of Compiler Phases
Temp1 := inttoreal(60)
Temp2 := id3 * Temp1
Temp3 := id2 + Temp2
id1 :=Temp3
Code Optimizer
Page 14 of 18
Temp1 := id3 * 60.0
id1 := id2 + Temp1
Code Generator
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
Symbol Table Management by Compiler
An essential function of a compiler is to record the identifiers used in the source
program and collect information about various attributes of each identifier.
These attribute may provide information about the storage allocated for an
identifier, its type, and its scope.
In the case of procedure names, it also provide information such as the number
and types of arguments, the method of passing each argument, and the type
returned, if any.
A symbol table is a data structure containing a record for each identifier, with
fields for the attributes of the identifier.
This data structure allows us to find the record for each identifier quickly and to
store or to retrieve data quickly.
The symbol table is created by entering identifier entry during Lexical Analysis
phase. The remaining phases update or use identifier information from the Symbol
Table.
When an identifier in the source program is detected by the lexical analyzer, the
identifier is entered into the symbol table. However, the attributes of an identifier
cannot normally be determined during lexical analysis.
i.e.: in PASCAL declaration like,
Var position, initial, rate : real;
The type real is not known when position, initial, and rate are seen by the lexical
analyzer.
Page 15 of 18
The remaining phases enter information about identifiers into the symbol table
and then use this information in various ways.
Error Detection and Reporting Phase of Compiler
It
•
•
•
collect all types of error. There are basically three types of errors:
Lexical Errors: This error can be detected by Lexical Analysis.
Syntax Errors: This error can be detected by Syntactic Analysis.
Semantic Errors: - This error can be detected by Semantic Analysis.
In error detection and reporting phase each phase can encounter errors. However,
after detecting error, a phase must somehow deal with that error, so that
compilation can proceed, allowing further errors in the source program to be
detected.
A compiler that stops when it finds the first error is not as helpful as it could be.
The syntax and semantic analysis phases usually handle a large fraction of the
errors detectable by the compiler.
The lexical phase can detect errors where the characters remaining in the input do
not form any token of the language.
Errors where the token stream violates the structure rules (syntax) of the language
are determined by the syntax analysis phase.
During semantic analysis the compiler tries to detect constructs that have the
right syntactic structure but no meaning to the operation involved, e.g., if we try to
add two identifiers, one of them is the name of the array, and the other is the
name of a procedure.
Page 16 of 18
The Context of a Compiler OR
A Language Processing System/Diagram
In addition to a compiler, several other programs may be required to create an
executable target program.
A source program may be divided into modules stored in separate files. The task of
collecting the source program is sometimes assigned to a separate program called,
a preprocessor. The preprocessor may expand macros into source language
statements.
The following figure shows a typical compilation. The target program created by
the compiler may require further processing before it can be run.
The compiler creates assembly code that is translated by an assembler into the
machine code and then linked together with some library functions into the code
that actually runs on the machine.
The Software Tools
Software tools are programs that manipulate SP and perform some kind of
analysis. Some examples of such tools include:
1. Structure Editors
Page 17 of 18
2. Pretty Printers
3. Static Checkers
4. Interpreter
1. Structure Editors
A structure editor takes as input a sequence of commands to build a source
program.
The structure editor not only performs the text creation and modification functions
of an ordinary text editor, but it also analyzes the program text, putting an
appropriate hierarchical structure on the source program.
Thus, the structure editor can perform additional tasks that are useful in the
preparation of programs, i.e., it can check that the input is correctly formed, can
supply keywords automatically, and can jump from a BEGIN or LEFT parenthesis
to its matching END or RIGHT parenthesis.
The output of such an editor is often similar to the output of the analysis phase of
a computer.
2. Pretty Printers
A pretty printer analyses a program and prints it in such a way that the structure
of the program becomes clearly visible.
That is, Comments may appear in a special font, and Statements may appear with
an amount of indentation proportional to the depth of their nesting in the
hierarchical organization of the statements.
3. Static Checker
A static checker reads a program, analyzes it, and attempts to discover potential
bugs without running the program, i.e., a static checker may detect that parts of
the source program that can never be executed, or that a certain variable might be
used before being defined or before being assigned with some value.
In addition, it can catch logical errors.
4. Interpreters
Instead of producing a target program as a translation, an interpreter performs the
operations implied by the source program, i.e., an interpreter might build a tree as
shown in following figure and then carry out the operations at the nodes as it
“walks” the tree.
Page 18 of 18
At the root it would discover it had an assignment to perform. So it would call a
routine to evaluate the expression on the right, and then store the resulting value
in the location associated with the identifier position.
Download