Lexing

advertisement
COS 320
Compilers
David Walker
Outline
• Last Week
– Introduction to ML
• Today:
– Lexical Analysis
– Reading: Chapter 2 of Appel
The Front End
stream of
characters
stream of
tokens
Lexer
abstract
syntax
Parser
Type
Checker
• Lexical Analysis: Create sequence of tokens
from characters
• Syntax Analysis: Create abstract syntax tree
from sequence of tokens
• Type Checking: Check program for wellformedness constraints
Lexical Analysis
• Lexical Analysis: Breaks stream of ASCII
characters (source) into tokens
• Token: An atomic unit of program syntax
– i.e., a word as opposed to a sentence
• Tokens and their types:
Characters Recognized:
foo, x, listcount
10.45, 3.14, -2.1
;
(
50, 100
if
Type:
ID
REAL
SEMI
LPAREN
NUM
IF
Token:
ID(foo), ID(x), ...
REAL(10.45), REAL(3.14), ...
SEMI
LPAREN
NUM(50), NUM(100)
IF
Lexical Analysis Example
x
=
(
y
+
4.0
)
;
Lexical Analysis Example
x
=
(
y
+
4.0
Lexical Analysis
ID(x)
)
;
Lexical Analysis Example
x
=
(
y
+
4.0
Lexical Analysis
ID(x) ASSIGN
)
;
Lexical Analysis Example
x
=
(
y
+
4.0
)
;
Lexical Analysis
ID(x) ASSIGN LPAREN ID(y) PLUS REAL(4.0) RPAREN SEMI
Lexer Implementation
•
Implementation Options:
1. Write a Lexer from scratch
– Boring, error-prone and too much work
2. Use a Lexer Generator
– Quick and easy. Good for lazy compiler writers.
Lexer
Specification
Lexer Implementation
•
Implementation Options:
1. Write a Lexer from scratch
– Boring, error-prone and too much work
2. Use a Lexer Generator
– Quick and easy. Good for lazy compiler writers.
Lexer
Specification
Lexer
lexer
generator
Lexer Implementation
•
Implementation Options:
1. Write a Lexer from scratch
– Boring, error-prone and too much work
2. Use a Lexer Generator
– Quick and easy. Good for lazy compiler writers.
stream of
characters
Lexer
Specification
Lexer
lexer
generator
stream of
tokens
• How do we specify the lexer?
– Develop another language
– We’ll use a language involving regular
expressions to specify tokens
• What is a lexer generator?
– Another compiler ....
Some Definitions
• We will want to define the language of legal tokens
our lexer can recognize
– Alphabet – a collection of symbols (ASCII is an alphabet)
– String – a finite sequence of symbols taken from our
alphabet
– Language of legal tokens – a set of strings
• Language of ML keywords – set of all strings which are ML
keywords (FINITE)
• Language of ML tokens – set of all strings which map to ML tokens
(INFINITE)
• Some people use the word “language” to mean more general sets:
– eg: ML Language – set of all strings representing correct ML
programs (INFINITE).
Regular Expressions: Construction
• Base Cases:
– For each symbol a in alphabet, a is a RE denoting the
set {a}
– Epsilon (e) denotes { }
• Inductive Cases (M and N are REs)
– Alternation (M | N) denotes strings in M or N
• (a | b) == {a, b}
– Concatenation (M N) denotes strings in M
concatenated with strings in N
• (a | b) (a | c) == { aa, ac, ba, bc }
– Kleene closure (M*) denotes strings formed by any
number of repetitions of strings in M
• (a | b )* == {e, a, b, aa, ab, ba, bb, ...}
Regular Expressions
• Integers begin with an optional minus sign,
continue with a sequence of digits
• Regular Expression:
(- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*
Regular Expressions
• Integers begin with an optional minus sign,
continue with a sequence of digits
• Regular Expression:
(- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*
• So writing (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
and even worse (a | b | c | ...) gets
tedious...
Regular Expressions
• common abbreviations:
– [a-c]
–.
– \n
– a+
– a?
== (a | b | c)
== any character except \n
== new line character
== one or more
== zero or one
• all abbreviations can be defined in terms
of the “standard” regular expressions
Ambiguous Token Rule Sets
• A single expression is a completely
unambiguous specification of a token.
• Sometimes, when we put together a set of
regular expressions to specify all of the
tokens in a language, ambiguities arise:
– i.e., two regular expression match the same
string
Ambiguous Token Rule Sets
• Example:
– Identifier tokens: a-z* (a-z | 0-9)*
– Sample keyword tokens: if, then, ...
• How do we tokenize:
– foobar ==>
– if
==>
ID(foobar) or ID(foo) ID(bar)
ID(if)
or IF
Ambiguous Token Rule Sets
• We resolve ambiguities using two rules:
– Longest match: The regular expression that
matches the longest string takes precedence.
– Rule Priority: The regular expressions
identifying tokens are written down in
sequence. If two regular expressions match
the same (longest) string, the first regular
expression in the sequence takes
precedence.
Ambiguous Token Rule Sets
• Example:
– Identifier tokens: a-z (a-z | 0-9)*
– Sample keyword tokens: if, then, ...
• How do we tokenize:
– foobar ==>
– if
==>
ID(foobar) or ID(foo) ID(bar)
ID(if)
or IF
Ambiguous Token Rule Sets
• Example:
– Identifier tokens: a-z* (a-z | 0-9)*
– Sample keyword tokens: if, then, ...
• How do we tokenize:
– foobar ==>
– if
==>
ID(foobar) or ID(foo) ID(bar)
ID(if)
or IF
Lexer Implementation
Implementation Options:
1. Write Lexer from scratch
–
Boring and error-prone
2. Use Lexical Analyzer Generator
–
Quick and easy
ml-lex is a lexical analyzer generator for ML.
lex and flex are lexical analyzer generators for C.
ML-Lex Specification
• Lexical specification consists of 3 parts:
User Declarations
%%
ML-LEX Definitions
%%
Rules
User Declarations
• User Declarations:
– User can define various values that are
available to the action fragments.
– Two values must be defined in this section:
• type lexresult
– type of the value returned by each rule action.
• fun eof ()
– called by lexer when end of input stream is reached.
ML-LEX Definitions
• ML-LEX Definitions:
– User can define regular expression
abbreviations:
DIGITS = [0-9] +;
LETTER = [a-zA-Z];
– Define multiple lexers to work together. Each
is given a unique name.
%s LEX1 LEX2 LEX3;
Rules
• Rules:
<lexer_list> regular_expression => (action.code) ;
• A rule consists of a pattern and an action:
– Pattern in a regular expression.
– Action is a fragment of ordinary ML code.
• Rules may be prefixed with the list of
lexers that are allowed to use this rule.
Rules
• Rules:
<lexer_list> regular_expression => (action.code) ;
• A rule consists of a pattern and an action:
– Pattern in a regular expression.
– Action is a fragment of ordinary ML code.
– Longest match & rule priority used for disambiguation
• Rules may be prefixed with the list of lexers that
are allowed to use this rule.
Rules
• Rule actions can use any value defined in the
User Declarations section, including
– type lexresult
• type of value returned by each rule action
– val eof : unit -> lexresult
• called by lexer when end of input stream reached
• special variables:
– yytext: input substring matched by regular expression
– yypos: file position of the beginning of matched string
– continue (): used to recursively called lexer
A Simple Lexer
datatype token = Num of int | Id of string | IF | THEN | ELSE | EOF
type lexresult = token
(* mandatory *)
fun eof () = EOF
(* mandatory *)
fun itos s = case Int.fromString s of SOME x => x | NONE => raise fail
%%
NUM = [1-9][0-9]*
ID = [a-zA-Z] ([a-zA-Z] | NUM)*
%%
if
then
else
{NUM}
{ID}
=>
=>
=>
=>
=>
(IF);
(THEN);
(ELSE);
(Num (itos yytext));
(Id yytext);
Using Multiple Lexers
• Rules prefixed with a lexer name are matched
only when that lexer is executing
• Enter new lexer using command YYBEGIN
• Initial lexer is called INITIAL
Using Multiple Lexers
type lexresult = unit
fun eof () = ()
(* mandatory *)
(* mandatory *)
%%
%s COMMENT
%%
<INITIAL> if
<INITIAL> [a-z]+
<INITIAL> “(*”
<COMMENT> “*)”
<COMMENT> “\n” | .
=>
=>
=>
=>
=>
();
();
(YYBEGIN COMMENT; continue ());
(YYBEGIN INITIAL; continue ());
(continue ());
A (Marginally) More Exciting Lexer
type lexresult = string
fun eof () = (print “End of file\n”; “EOF”)
(* mandatory *)
(* mandatory *)
%%
%s COMMENT
INT = [1-9] [0-9]*;
%%
<INITIAL> if
<INITIAL> then
<INITIAL> {INT}
<INITIAL> “(*”
<COMMENT> “*)”
<COMMENT> “\n” | .
=>
=>
=>
=>
=>
=>
(“IF”);
(“THEN”);
( “INT(“ ^ yytext ^ “)” );
(YYBEGIN COMMENT; continue ());
(YYBEGIN INITIAL; continue ());
(continue ());
Implementing Lexers
• By compiling, of course:
– convert REs into non-deterministic finite
automata
– convert non-deterministic finite automata into
deterministic finite automata
– convert deterministic finite automata into a
blazingly fast table-driven algorithm
• you did everything but possibly the last
step in your favorite algorithms class
Table-driven algorithm
• DFA:
Table:
1
a
1
a
3
2
+
b
3
4
2
2
b
=
=
2
c
4
=
3
+
4
4
• Remember start position in character stream
• Keep reading characters and moving from state to
state until no transitions apply
• An auxiliary table maps final states to the token
type identified; yystring = input from start to current
Table-driven algorithm
• DFA:
a-z
1
a-z
2
• Detail: how to deal with longest match?
– when reading “iffy” should recognize “iffy” as
ID, not “if” as keyword and then “fy” as ID
Table-driven algorithm
• DFA:
a-z
1
a-z
2
• Detail: how to deal with longest match?
– save most recent final state seen and
position in character string
– when no more transition can be made, revert
to last saved legal final state
– see Appel 2.4 for more details
Summary
• A Lexer:
– input: stream of characters
– output: stream of tokens
• Writing lexers by hand is boring, so we
use a lexer generator: ml-lex
– lexer generators work by converting REs
through automata theory to efficient tabledriven algorithms.
– theory wins again.
Download