Compilers

advertisement

ACSC373 – Compiler Writing

Chapter 4 – Compilers

4 - Syntax, Semantics and Translation

5 - Lexical Analysis

6 - Syntax and Semantic Analysis

7 - Code generation

The design of compilers

3 phases of the compilation process:

(1) lexical analysis: reads high-level source program and divides it into a stream of basic lexical ‘tokens’.

(2) syntax and semantic analysis phase: then combines these tokens into data structures reflecting the form of the source program in terms of the syntactic structures of the language.

(3) code generator: converts these data structures into code for the target machine.

SYNTAX, SEMANTICS AND TRANSLATION

Language Translation: The Compilation Process

Compilers: large and complex programs

Task of a compiler: 1. the analysis of the source program (lexical syntax and semantic

analysis)

2. the synthesis of the object program (a single code-generation

phase)

The phases of the compilation process source program lexical analysis syntax analysis semantic analysis code generation

object

program

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Lexical Analysis

Reads the characters of the source program and recognizes the tokens or basic syntactic components that they represent. It is able to distinguish and pass on, as single units, objects such as

Numbers,

Punctuation symbols,

Operators,

Reserved keywords,

Identifiers and so on.

Effect: simplifies the syntax analyser

effectively reducing the size of the grammar the syntax analyzer has to handle.

In a free-formal language, the lexical analyser ignores spaces, newlines, tabs and other layout characteristics, as well as comments. e.g. for I := 1 to 10 do sum := sum + term[i]; (*sum array*) will be transformed by the lexical analyser into the sequence of tokens:

for i := 1

:= sum + term to

[i];

10 do sum

Pascal, maintains a list of reserved words so that they can be distinguished from identifiers and passed to the next phase of the compiler in the form of, for example, a short integer code (or, use a symbol table).

Syntax Analysis

The syntax analyser or parser has to determine how the tokens retuned by the lexical analyser should be grouped and structured according to the syntax rules of the language.

Output: representation of the syntactic structure often expressed in the form of a tree (the

‘parse tree’).

Usually, lexical analyser should be responsible for all the simple syntactic constructs, such as identifiers, reserved words and numbers, while the syntax analyser should deal with all the other structures.

Semantic Analysis

To determine the semantics or meaning of the source program (the translation phase) may cope with tasks involving declarations and scopes of identifiers, storage allocation, type checking, selection of appropriate polymorphic operators, addition of automatic type transfers etc.

This phase is often followed be a process that takes the parse tree from the syntax analyser and produces a linear sequence of instructions equivalent to the original source program (instructions for a virtual machine).

2

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Code Generation

(the final phase) to take the output from the semantic analyser or translator and output machine code or assembly language for the target hardware (machine’s architecture is required to write good code generator).

+ code improvement or code optimisation.

Syntax Specification

Role: to define the set of valid programs.

Sets e.g. a digit in Pascal could be defined as {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 } or, {xy n

| n>0} xy, xyy, xyyy, …, xy n i.e. a string that starts with a single x followed by any number

(greater than zero) of y s

.

Backus-Naur Form (BNF)

A formal metalanguage that is frequently used in the definition of the syntax of programming languages (introduced in the definition of ALGOL 60).

A technique for representing rules that can be used to derive sentences of the language.

(If a finite set of these rules can be used to derive all sentences of a language, then this set of rules constitutes a formal definition of the syntax of the language).

Example A

<sentence>

<subject> <predicate>

<subject>  <noun> | <pronoun>

<predicate>

<transitive verb> <object> | <intransitive verb>

<noun>

cats | dogs | sheep

<pronoun>

I | we| you | they

<transitive verb>

like | hate | eat

<object>

biscuits | grass | sunshine

<intransitive verb>

sleep | talk | run

Using complete set of rules, sentences can be generated by making random choices. e.g. <sentence>

<subject> <predicate>

<noun> <predicate>

sheep <transitive verb> <object>

sheep eat <object>

sheep eat biscuits

3

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Other, possible sentences in this language are:

I sleep,

Dogs hate grass,

We like sunshine

Note: no meaning of these sentences considered e.g. “cats eat sunshine” is syntactically correct, makes no good sense, no concern to the BNF rules.

Two distinct symbol types in BNF:

1.

Symbols such as <sentence> <pronoun> and <intransitive verb> are called nonterminal symbols (i.e. appear on the left-hand of a BNF).

2.

Symbols such as cats, dogs, I and grass are called terminal symbols, since they cannot be expanded further (the set of symbols of the language being defined).

Non-terminals were enclosed by angle brackets.

Other convention

- Non-terminals are either enclosed by angle brackets or are single upper-case letters.

- Terminals are represented as single lower-case letters, digits, special symbols (such as

+, * or =), punctuation symbols or stings in bold type (such as begin).

Another language

Set of rules: S

S+T | T

T

a | b

S and T are non-terminals, whereas a, b and + are terminals

(S is being defined recursively).

S

S + T (using S

S + T)

S + T + T (using S

S + T)

T + T + T (using S

T) b + T + T (using T

b) b + a + T (using T

a) b + a + a (using T

a)

4

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Example

<expression>

<term> | <expression> + <term>

<term>

<primary> | <term> * <primary>

<primary>  a | b | c a + b * c is generated as follows:

<expression>

<expression> + <term>

<term> + <term>

<primary> + <term> a + <term> a + <term> * <primary> a + <primary> * <primary> a + b * <primary> a + b * c or, a + b + c is generated as follows:

5

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Syntax diagrams

Pictorial notation

A set of syntax diagrams, each defining a specific language construct. e.g. the definition of a constant (in Pascal)

Constant Identifier

+

-

Unsigned Number

Character String i.e. rectangular box – non-terminal symbol terminal symbols, ‘+’ & ‘-‘ enclosed in circles

(any path yields a syntactically correct structure)

Compact specification (e.g. Pascal in only a couple of sheets of paper).

EBNF (Extended Backus-Naur Form)

used in the ISO Pascal Standard

differs from BNF in several ways

principal changes: the set of metasymbols (symbols used for special purposes in the metalanguage) i.e. - terminal symbols enclosed in double quotation marks

- a full stop is used to end each production

- equals sign is used to separate the non-terminal from its definition

- parenthesis for grouping e.g.

AssignementStatement = (Variable | FunctionalIdentifier) “:=” Expression

For repetition: [x] implies zero or more instance of x (i.e. x optional)

{x} implies zero or more instances of x

6

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona e.g. Identifier = Letter {Letter | Digit} i.e. Simpler definition of syntax. Much Clearer.

The same in BNF could be:

<Identifier> ::= <letter> | <identifier> <letter>

| <identifier> <digit>

Grammars

Study of grammars started long before programming languages and was exposed primarily on the study of natural languages.

Then, found direct relevance in the formal study of programming languages.

Noam Chomsky (great influence his work)

(a set of BNF rules as only part of the definition of the grammar of a language)

BNF only for very restricted class of languages.

The grammar – formally defined as a 4-tuple

G = (N, T, S, P)

N – the finite set of non-terminal symbols

T – the finite set of terminal symbols

S – the starting symbol (must be a member of N)

P – the set of productions (general form: α

β) i.e. any occurrence of the string α in the string to be transformed can be replaced by the string β. e.g. the set of BNF productions presented in example A above forms a part of the definition of the grammar of a language. The remainder of the definition is:

N = {sentence, subject, predicate, noun, pronoun, intransitive verb, transitive verb, object}

T = {cats, dogs, sheep, I, we, you, they, live, hate, eat, biscuits, grass, sunshine, sleep, talk, run}

S = sentence

Still, the definition of the grammar is not quite complete, the strings α and β must have some relationship to the sets N and T.

Suppose U is the set of all terminals and non-terminal symbols of the language; that is, U

= N U T

U+ - the closure of U – non-empty strings

U* - the closure of U i.e. U+ U {ε} – empty string

Sentential form: any string that can be derived from the starting symbol

7

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Sentence: a sentential form that does not contain any non-terminal symbols (just terminals, no expansion).

We have seen how sentences can be generated very simply using a set of BNF productions.

The reverse, how the BNF rules were applied to generate a sentence is much harder. This process of determining the syntactic structure of a sentence is called parsing, or syntax analysis (major part of the compilation of H.l.L. programs).

Chomsky Classification

A grammar with general form

α 

β (no restrictions on the sentential forms α & β) is called a Chomsky type 0 or a free grammar

(too general)

Restricted form

αAβ 

αγβ where α, β and γ are members of U *

γ is not null

A is a single non-terminal symbol then the resulting grammar is of type 1, the context-sensitive grammars.

Especially, if α 

β

Where | α | <= | β | and | α | denotes the length of the string α, then the grammar is context sensitive.

(A is transformed to γ only when it occurs in the context of being preceded by α and followed by β).

Type 2 or Context-free grammars if

A

γ where A is a single non-terminal (since A can always be transformed into

γ without any concern for its context.

(Immense importance in programming language design – corresponds directly to the

BNF notation, where each production has a single non-terminal symbol on its left-hand side, and so any grammar that is expressed in BNF must be a context-free grammar). e.g. Pascal and ALGOL 60 – context-free or type 2

Type 3 or finite, finite-state or regular grammars if all productions are of the form:

A

α or A

α B where A and B – non-terminal

α is a terminal symbol

8

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

(too restricted) only for the design some of the structures used as components of most programming languages, e.g. identifiers, numbers, etc.

That is, hierarchically,

type 2 type 1 type 0

context

context

sensitive gr.

Complexity increases e.g. all type 3 languages are also type 1 languages

In processing from type 0 to type 3 grammars, language complexity and hence complexity of recognizers, parsers or compilers decreases.

Type 2

Type 3 – easier – cause of finite-state automata

Protonotions: sequences of “small syntactic marks” composed essentially of lower-case letters (with spaces for readability) used to represent terminal symbols e.g. ‘letter a symbol’ – a symbol with ‘a’ representation each rule, followed by a colon, alternative(;), ends(.), separation(,).

9

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Semantics

Semantic rules specify the meanings or actions of all valid programs (much more difficult techniques for semantic specification than for syntax specification, not so well developed yet).

Compiler is concerned with two processes:

1.

the analysis of the source program (concern of syntax)

2.

the synthesis of the object program.

Syntax is largely concerned with the analysis phase

Semantics is largely concerned with the synthesis phase (+ sometimes syntax)

Specification of semantics: -

operational approach

denotational semantics

axiomatic approach

Parsing e.g. <expression>

<term> | <expression> + <term>

<term>

primary | <term> * <primary>

<primary>

a | b | c to generate expressions such as a * b + c.

However, the compiler has to reserve this process, that is, perform a syntax analysis of the string, to determine if a string such as a * b + c is a valid expression and, if so, how it is structured in terms of the units <term> and <primary>.

The parse tree

10

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Example (how the string a * b + c may be reduced)

Given the three productions defining <expression>, <term> and <primary>, the string a * b + c can be reduced as: a * b + c

<primary> * b + c

<primary> * <primary> + c

(<primary>  a)

(<primary>

b)

<primary> * <primary> + <primary> (<primary>

c)

<term> * <primary> + <primary> (<term>

<primary>)

<term> * <primary> + <term> (<term>

<primary>)

<term> + <term>

<expression>

<expression> + <term>

(<term>

<term> * <primary>)

(<expression>

<term>)

(<expression>

<expression> + <term>)

However, if the productions are used differently, a * b + c

<primary> * b + c (<primary>

a)

<primary> * <primary> + c (<primary>

b)

<primary> * <primary> + <primary> (<primary>

c)

<primary> * <term> + <primary> (<term>

<primary>)

<primary> * <expression> + <primary> (<expression>

<term>)

<primary> * <expression> + <term> (<term>

<primary>)

<primary> * <expression>

<term> * <expression>

<expression> * <expression>

(<expression>

<expression> + <term>)

(<term>

<primary>)

(<expression>

<term>) and then become stuck, implying the false deduction, i.e. a * b + c not a sentence.

parsing process not a trivial matter!

Syntactic structure of the string a * b + c

<expression>

<expression> + <term>

<term> <primary>

<term> * <primary> c

<primary> b

a

11

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

The tree combines the relevant information contained in the set of productions, together with the content of the original sentence, in a structure that is self-contained and which can be used by subsequent phases of the compiler.

Parsing strategies

Problem of parsing: take the starting symbol of the language and start generate all possible sentences from it.

If the input matched one of these sentences, then all the input string is a valid sentence of the language (impractical approach since infinite possible sentences).

Two categories:

1.

Top-down parsers: starting at the root (the starting symbol) and proceeding to the leaves.

2.

Bottom-up parsers: start at the leaves and move up towards the root.

Top-down parsers – easy to write actual code capable of being derived directly from the production rules, but cannot always applied as an approach.

bottom-up parsers, can handle a larger set of grammars.

Top-down parsing

The parsing process starts at the root of the parse tree; it first considers the starting symbol of the grammar.

The goal: to produce, from this starting symbol, the sequence of terminal symbols that have been presented as input to the parser.

Bottom-up parsing

Starts with the input string and repeatedly replaces strings on the right-hand sides of productions by the corresponding strings on the left-hand sides of productions, until, hopefully, just the starting symbol remains.

Necessary to determine which strings should be replaced and in what order the replacement should occur.

Process of derivation (leftmost, rightmost).

Leftmost derivation of a * b + c from

<expression>

<expression> + <term> expression

<term> + <term>

<term> + <primary> + <term>

<primary> + <primary> + <term> a * <primary> + <term> a * b + <term> a * b + <primary> a * b + c

Rightmost derivation

<expression>

<expression> + <term>

<expression> + <primary>

<expression> + c

<term> + c

<term> * <primary> + c

<term> * b + c

<primary> * b + c a * b + c

12

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Handle: the substring that is reduced, replaced by the left-hand side of the corresponding production.

Example

Parse of a * b + c

Sentential form Handle Production used a * b + c

<primary> * b + c

<term> * b + c a

<primary> b

<primary>

<term>

a

<primary>

<primary>

b

Reduced sentential form

<primary> * b + c

<term> * b + c

<term> * <primary> + c

<term> + c <term> * <primary> + c

<term> + c

<expression> + c

<term> *

<primary>

<term> c

<term>

<term> *

<primary>

<expression>

<term>

<primary>

c

<expression> + c

<expression> +

<primary>

<expression> + <term> <expression> +

<primary>

<primary>

<expression> + <term> <expression>

+ <term>

<expression>

<term>

<primary>

<expression> 

<expression> +

<term>

<expression> i.e. the canonical parse (canonical derivation)

13

ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona

Notes

Compilers can often be conveniently structured into four phases: lexical analysis, syntax analysis, semantic analysis and code generation. The first three of these phases are concerned with the analysis of the source program whereas code generation is concerned with the synthesis of the object program.

There are several widely used techniques for the specification of the syntax of programming languages. BNF is a particular popular metalanguage used for this purpose.

The major part of the formal specification of the grammar of a language is the set of productions. The Chomsky classification groups grammars according to the form of their productions.

It may be possible to make use of two-level grammars to express language features, such as context sensitivity. There are several other approaches to grammar specification which are sometimes used.

Formal techniques are also available for the specification of semantics.

Parsers can be broadly classified into two groups: top-down parsers and bottom-up parsers. Top-down parsers try to achieve the goal of recognising the starting symbol by repeatedly subdividing the goal until terminal symbols from the input string can be matched. Bottom-up parsers work directly on the input strings, repeatedly matching symbols on the right-hand sides of productions by the corresponding symbols on the lefthand sides of productions until just the starting symbol remains.

14

Download