from coit

advertisement
Scanner
中正理工學院
電算中心副教授
許良全
Overview of Scanning



The purpose of a scanner is to group input
characters into tokens.
A scanner is sometimes called a lexical analyzer
A precise definition of tokens is necessary to ensure
that lexical rules are properly enforced.


All scanners perform much the same function

Compiler Design
Scanners normally seek to make a token as long as
possible. E.g. ABC is scanned as one identifier rather than
three
using scanner generator is to limit the effort in building a
scanner from scratch
Copyright © 1998 by LCH
Finite State Systems

Compiler Design
The finite state automaton is a mathematical model
of a system, with discrete input and outputs
Copyright © 1998 by LCH
Examples of Finite State Systems

Elevators


Vending machines


the state of the CPU, main memory, and auxiliary storage
at any time is one of a very large but finite number of
states
Human brains

Compiler Design
insert enough coins and you’ll get a Pepsi eventually
Computers


do not remember all previous requests for service but only
the current floor, the direction of motion, and the
collection of not yet satisfied requests for service
235 cells or neurons at most
Copyright © 1998 by LCH
Definition of Finite Automata

A finite automaton (FA) is an idealized 5tuple computer that recognizes strings
belonging to regular sets. (Q,,,q0,F)
 A finite
set of states, Q
 A finite input alphabet, , or vocabulary, V.
 A special start, or initial state, q0. q0Q.
 A set of final, or accepting states, F. FQ.
 A transition function, , that maps Q×F to Q.
Compiler Design
Copyright © 1998 by LCH
FA and Transition Diagrams
a
a
b
c
a state
a transition
the start state
a finite state
Compiler Design
Copyright © 1998 by LCH
FA and Transition Tables
inputs
states
q0
a
q2
q2
Compiler Design
c
q1
q1
q3
b
q3
q1
q3
Copyright © 1998 by LCH
Regular Expressions


The languages accepted by finite automata are
easily described by simple expressions called
regular expressions.
Strings are built from characters in V via catenation



Compiler Design
e.g., !=, for, while
An empty or null string, denoted by , is allowed
The characters, (, ), ‘, *, +, and | are called metacharacters. They must be be quoted when used in
order to avoid ambiguity. E.g.
Delim = (‘(‘|’)’|:=|;|,|’+’|-|’*’|/|=|$$$)
Copyright © 1998 by LCH
Definition of Regular Expression

A regular expression denotes a set of strings:


 is a regular expression denoting the empty set (the set
containing no strings).
 is a regular expression denoting the set that contains
only the empty string.



Compiler Design
Note that this set contains one element.
A string s is a regular expression denoting a set
containing only s. If s contains meta-characters, s can be
quoted to avoid ambiguity.
If A and B are regular expressions, then A|B, AB, and A*
are also regular expressions, corresponding to alternation,
catenation, and Kleene closure respectively.
Copyright © 1998 by LCH
Properties of Regular Expressions

Let P and Q be a set of strings



The string s  (P|Q) iff s  P or s  Q
The string s  P* iff s can be broken into zero or more
pieces: s = s1s2s3…sn such that each si  P.
P+ denotes all strings consisting one or more strings in P
catenated together


If A is a set of characters, Not(A) denotes (V-A)


Compiler Design
P* = (P+|) and P+ = PP* = P*P
all characters in V not included in A.
If k is a constant, the set Ak represents all strings formed
by catenating k strings from A, i.e., Ak = (AAA…) (k
copies)
Copyright © 1998 by LCH
Examples of Regular Expressions


Let D = (0|…|9), L = (A|…|Z)
A comment that begins with -- and ends with Eol


A fixed decimal literal


Lit = D+.D+
An identifier, composed of letters, digits, and
underscores, that begins with a letter, ends with a
letter or digit, and contains no consecutive
underscores

Compiler Design
Comment = --Not(Eol)*Eol
ID = L(L|D)*(_(L|D)+)*
Copyright © 1998 by LCH
Using a Scanner Generator: Lex




Compiler Design
Lex is a lexical analyzer generator developed by
Lesk and Schmidt of AT&T Bell Lab, written in C,
running under UNIX.
Lex produces an entire scanner module that can be
compiled and linked with other compiler modules.
Lex associates regular expressions with arbitrary
code fragments. When an expression is matched,
the code segment is executed.
A typical lex program contains three sections
separated by %% delimiters.
Copyright © 1998 by LCH
First Section of Lex

The first section define character classes and auxiliary
regular expression. (Fig. 3.5 on p. 67)




[] delimits character classes
- denotes ranges: [xyz] = = [x-z]
\ denotes the escape character: as in C.
^ complements a character class, (Not):





Compiler Design
[^xy] denotes all characters except x and y.
|, *, and + (alternation, Kleene closure, and positive closure)
are provided.
() can be used to control grouping of subexpressions.
(expr)? = = (expr)|, i.e. matches Expr zero times or once.
{} signals the macroexpansion of a symbol defined in the first
section.
Copyright © 1998 by LCH
First Section of Lex, cont.

Catenation is specified by the juxtaposition of two
expressions; no explicit operator is used.
[ab][cd] will match any of ad, ac, bc, and bd.
begin = = “begin” = = [b][e][g][i][n]


Compiler Design
Copyright © 1998 by LCH
Second Section of Lex

The second section of lex defines a table of regular
expressions and corresponding commands.

When an expression is matched, its associated command
is executed.



Input that is matched is stored in the string variable
yytext whose length is yyleng.
Lex creates an integer function yylex() that may be
called from the parser.


Compiler Design
Auxiliary functions may be defined in the third section.
The value returned is usually the token code of the token
scanned by Lex.
When yylex() encounters end of file, it calls a usesupplied integer function named yywrap() to wrap up
input processing.
Copyright © 1998 by LCH
Dealing with Multiple Input Files

yylex() uses three user-defined functions to
handle character I/O:



Compiler Design
input(): retrieve a single character, 0 on EOF
output(c): write a single character to the output
unput(c): put a single character back on the input to be
re-read
Copyright © 1998 by LCH
Translating Regular Expressions
into Finite Automata




Compiler Design
Remember the relationship between RE and FA.
The main job of a scanner generator program is to
transform a regular expression definition into an
equivalent (D)FA.
A regular expression is first translated into a
nondeterministic finite automaton (NFA), then
translated from NFA into DFA. (2 steps)
An NFA, when reading a particular input is not
required to make a unique (deterministic) choice of
which state to visit.
Copyright © 1998 by LCH
Translating RE into NFA

Any regular expression can be transformed into an
NFA with the following properties:




Compiler Design
There is a unique final state
The final state has no successors
Every other state has either one or two successors
Regular expressions are built out of the atomic
regular expressions a (where a is a character in V)
and  by using the three operations AB, A|B, and
A*.
Copyright © 1998 by LCH
NFA for a and 
a

Compiler Design
Copyright © 1998 by LCH
An NFA for A|B
Finite
automaton
for A


Compiler Design
Finite
automaton
for B

A
B

Copyright © 1998 by LCH
An NFA for A B
Finite
automaton
for A
A

Finite
automaton
for B
Compiler Design
Copyright © 1998 by LCH
An NFA for A*


Finite
automaton
for A
A


Compiler Design
Copyright © 1998 by LCH
Translating NFA into DFA

Each state of DFA (M) corresponds to a set of states
of NFA (N)


M will be in state {x,y,z} after reading a given
input string if and only if N could be in any of the
states x, y, or z, depending on the transitions it
chooses.

Compiler Design
transforming N to M is done by subset construction
M keeps track of all the possible routes N might take and
runs them in parallel.
Copyright © 1998 by LCH
Download