Lexical Analysis

advertisement
ACSC373 – Compiler Writing
Chapter 5 – Lexical Analysis
Introduction
Lexical analyzer (or scanner) has to recognize the basic tokens making up the source
program input to the compiler.
Tokens – syntactically simple in structure (recognised by comparatively simple
algorithm).
The lexical analyser relieves the burden of the syntax analyser. If the syntax of the tokens
can be expressed in terms of a regular grammar  straightforward construction of a
lexical analyser.

The lexical analyser breaks up the program input into a sequence of tokens – the
basic syntactic component of the language.
The lexical analyser is also responsible for decoding the lexical representation of
symbols in the source program.
(e.g. possible to modify a Pascal compiler to accept the character @ instead of ^
by simply making a minor modification to the lexical analyser).

Spelling of reserved words should be concern only to the lexical analyser (e.g.
replace all Pascal reserved words by their Welsh translation, by changing the
lexical analyser’s table).
Tokens
The lexical analyser recognises the basic syntactic components of a programming
language. For example, a lexical analyser for Pascal would recognise identifiers, reserved
words, numerical constants, strings, punctuation symbols and special symbols. These
tokens would be passed to the syntax analyser as single values (nature, value or
identification).
So, the lexical analyser has to perform the dual task of recognising the token and
sometimes evaluating it.
Significant Problem: the need for “look-ahead”
It may be necessary to read several characters of a token before the
type of that token can be determined. The degree of lookahead
depends on the particular token on the language.
PASCAL – good language on this respect (most tokens can be distinguished using just
one-or-two character lookahead).
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
e.g. a+ character, a Pascal lexical analyser immediately returns the symbol to the
syntax analyser
or, to determine that the token 12345.67 is going to be a real rather than an
integer.
FORTRAN – Do 1 I = 1
The lexical analyser has to reach the end of the statement before it can determine
that it should return the identifier Do 1 I rather than the Do to start a Do loop.
Furthermore, in some languages the role of the lexical analyser becomes merged
with that of the syntax analyser.
Furthermore, in some languages, the role of the lexical analyser becomes merged
with that of the syntax analyser.
e.g. IF (GT.GT.DO) IF = DO + GT
Special keywords, not identifiers partially resolved by the syntax analyser
offering context information to the lexical analyser when required.
Regular Grammars
Chomsky type 3 (Finite – state or regular) grammars special significance to lexical
analysis.
Recall, type grammars had productions of the form:
Aa
or
A  aB
syntactically simple structures
Alternation and repetition can be specified by these grammar but more complex
structures such as balanced parentheses cannot be handled.
Strengths of these grammars: possible to construct simple and efficient parsers for them.
If the lexical tokens of a programming language can be defined in terms of a type 3
grammar, then an efficient lexical analyser for a compiler for that language can be
constructed simply, derived directly from the set of production rules defining the tokens.
Notation:
Regular Expressions
 Consists of symbols (in the alphabet of the language that is being defined) and a
set of operators that allow
In terms of precedence,
2
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
(2) – Concatenation (adjacent units)
(3) – Alternation (units separated by |)
(1) – Repetition (*) – (zero or more)
Example






a b denotes the set of strings {a b} – this set contains just one member
a | b denotes {a, b}
a * denotes (ε, a, aa, aaa, …..}
a b * denotes (a, ab, abb, abbb, ….}
(a | b)* denotes the set of strings made up of zero or more instances of an ‘a’ or a
‘b’
(a b | c)* d denotes {d, abd, cd, abcd, ababcd, …..}
Equivalent regular expressions: when they denote the same set of strings, that is, the same
language.
e.g. a ( b | c),
a b | a c are equivalent
Regular grammars and regular expressions are equivalent notations.
i.e. given a set of productions defining a regular grammar, it is possible to convert
them to an equivalent regular expression by an algorithmic process. Or, similarly,
a regular expression can be converted into an equivalent set of production rules.
Regular expression can be used to specify the syntax of the lexical tokens of
programming language in a clear and concise way.
e.g. the syntax of an integer constant;
Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Sign  + | - | ε
Integerconstant  sign digit digit*
i.e. an optional sign followed by one or more digits.
An identifier
Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Letter  a | b | c | … | z |A | B | C | … | Z
Identifier  letter (letter| digit)*
i.e. an initial letter followed by a string of letters or digits
3
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
Regular expressions can be represented diagrammatically,
e.g. (a b | c) * d
a
b
c
d
Finite-State Automata
It is possible to represent a regular expression as a transition diagram, that is, a directed
graph having labelled branches
e.g. (a b | c ) * d
The nodes – states are enclosed in circles (state number). Double circle states: accepting
states (i.e. a state reached if the expression to be parsed has been successfully
recognised). Arrow lines: edges or transitions.
c
d
1
start
a
3
b
2
4
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
Parser Action ( of abcd)
1.
2.
3.
4.
5.
Start in state 1
Input a, transition to state 2
Input b, transition to state 1
Input c, transition to state 1
Input d, transition to state 3, an accepting state  the parse succeeds
Another input
abc
1. Starting in state 1
2. Input a, transition to state 2
3. Input c: there are no edges labelled c, from state 2 and hence this parse fails.
(Since the parser knows that it is in state 2 when the parse fails, it can output the
informative information that it was expecting the input of a b – the only edge emerging
from state 2).
Transition table (of (a b | c ) * d )
State
a
b
c
d
1
2
3
2
-
1
1
finished
3
-
or, transition diagram, describes a finite-state automata.
Deterministic finite-state automation (DFA): for any state, there can only be one possible
next state for any given input symbol (i.e. no two edges emerging from a state can be
labelled by the same symbol). No transitions can be labelled with ε (the empty string) –
all the edges have to labelled with non-null symbols, e.g. diagram above.
Non-deterministic finite-state automation (NFA): if an input symbol can cause more than
one next state. A state may have two or more edges emerging from it, labelled with the
same input symbol.
Example
5
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
a
1
2
3
4
Two edges labelled by b coming from state 2  NFA
a*bb*ba
Its transition table,
State
1
2
3
4
a
{1}
{4}
b
{2}
{2 , 3}
finished
Sets of states have to be used (no longer single – state numbers).
NFA somewhat more difficult to implement on a conventional computer than a DFA.
However, NFA’s have the advantage that they generally require fewer states than an
equivalent DFA (if they recognised strings described by the same regular expression).
Later, (somewhere else) to see how to convert NFA into an equivalent DFA – an exercise
for you!
Implementing a Lexical Analyser
Writing a lexical analyser for Pascal (techniques relevant to wide range of other
languages).
Task of lexical analyser: to recognise basic tokens in its input and to return some encoded
representation of these tokens to the next part of the compiler.
6
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
First Steps:
- To decide on the precise set of tokens the lexical analyser should recognise.
- The notation to be used to specify the syntax of these tokens, preferably in some
formal notation.
Tokens – identified by an integer value or equivalent
e.g. enumerated type in Pascal.
Convenient way – implement the lexical analyser as a procedure or function called by the
syntax analyser.
e.g. (Pascal’s lexical analyser).
Type lextoken = (beginsym, endsym, ifsym, dosym, whilesym, periodsym,
commasym, semicolon …);
Var token: lextoken (* the L.A. updates this variable each time it is called *)
Procedure NextToken;
.
.
.
Has to
to read
read characters
characters from
from the
the compiler’s
compiler’s input
input and
Has
and return
the identity
of single
lexical
the
return
the identity
of single
lexical
tokentoken
in thein
global
global
variable
token
each
time
it
is
called.
variable token each time it is called.
First step, decide
Set of tokens:
Pascal:
- 21 special symbols
- 35 word symbols or reserved words
- one directive and
- a few alternative representations
- + identifiers and numerical and character constants
- to skip comments and non-significant spaces, tabs and newlines
- to handle erroneous input and issue informative error message
Ex. Think of the corresponding tokens of either the C or C++ language.
The body of NextToken starts as follows:
While (ch = spacechar) or (ch = tabchar) do Nextch; (*skip white space*)
Case ch of
‘+’: begin token := plussyml Nextch end;
‘-‘ : begin token := minusym Nextch end;
(*and so on for all the other single – character tokens*)
7
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
(Possibly, use array or table lookup to return the corresponding Nexttoken value in Token
as there is a fairly large number of single-character tokens in Pascal).
Other special symbols have to be handled more carefully, <, <=, or <>.
‘<’
Begin
NextCh;
If Ch = ‘=’ then
Begin
Token := lesseqsym;
NextCh;
End
Else
If ch = ‘>’ then
Begin
Token := noteqsym;
NextCh
End
Else
Sym := lesssym
End;
Comments
{or (* ….} or *)
When NextToken encounters a { or a (* symbol, it must call NextCh repeatedly until } or
*) is found.
One way: remove comments within NextCh;
Count {this is a comment} er := 0;
Counter : = 0;
If comments were removed
completely by NextCh;
Or, return a single-space character to the caller on encountering a comment construct in
the input text.
8
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
Numerical constants
Integer, floating-point constants
‘0’ : ‘1’ : ‘2’ : ‘3’ : ‘4’ :
‘5’ : ‘6’ : ‘7’ : ‘8’ : ‘9’ :
Begin
End;
Intval := 0;
While Ch in [‘0’ .. ‘9’] do
Begin
(*accumulate decimal value*)
Intval := intval * 10 + ord (ch) – ord (‘0’);
NextCh
End;
Token := integersym
(*value of constant returned in intval*)
Character and string constants
Enclosed by apostrophes, ‘a’ “
For denoting single’.
Identifiers and reserved words
While Ch in Valididentifierschars do
The set of Characters initialized to
contain the characters that may be
found within an identifier.
Example
Consider the character string abc<def input to NextToken. The letter a guides NextToken
into the routine to read an identifier; thus, the characters are read one by one until the <
character is encountered. This character is retained as the first character to be analysed
when NextToken is next called – there is no need to backspace on the input stream. This
situation where a single character causes the termination of one token and starts another
is very common and is handled naturally by the single-character lookahead.
Errors
Errors detected by the lexical analyser: “ a letter not allowed to terminate a number,
numerical overflow, end of line before end of string and null string detected”.
9
ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona
Also, NextToken should report error if it encounters a character that is not a member of
the character set of Pascal.
e.g. (explanation mark outside a character string or comment)
Automating the Production of Lexical Analysis
Writing a lexical analyser from scratch – fairly demanding task easier to modify an
existing lexical analyser for a similar language.
Given the syntax specification of the lexical tokens, the coding of the lexical analyser
should be mechanical.
Software tools can automate the production of lexical analyser.
e.g. Lex (one of the many utilities of UNIX)
(most famous)
- Requires the syntax of each lexical token be defined in terms of a regular expression.
- Lex generates a recognizer program – the lexical analyser.
- Lex transforms regular expressions to an equivalent deterministic finite-state
automation.
- Produces rapidly reliable and efficient lexical analysis while the control file for lex
provides clear and accurate documentation.
10
Download