Uploaded by adcA

lecture5-6ann

advertisement
Lecture 5-6 — Lexing
• Compilers always start by doing two things with your program:
lexing and parsing. These are really useful things to know
how to do, not just for writing compilers, but for writing any
kind of program where the input is highly structured. In many
cases, parsing is not necessary, but lexing almost always is.
There are two ways to implement a lexer: (1) write a program
(i.e. do it manually), or (2) prepare a lexer specification and
submit it to a lexer generator. We will do both.
Here are the topics for these two lectures:
• Overview of compilers
• Lexing by implementing a DFA
• Lexing with a generator (ocamllex)
CS 421 — Class 5–6, 1/29–31/13 — 1
Structure of a compiler
Detail of compiler front-end:
CS 421 — Class 5–6, 1/29–31/13 — 2
Lexing and tokens
• Job
of lexer (aka lexical analyzer, scanner, tokenizer)
is to divide input into tokens, and discard whitespace and
comments.
• Token is smallest unit useful for parsing.
• Think of the lexer as a function from a list of characters (the
source file) to a list of tokens.
CS 421 — Class 5–6, 1/29–31/13 — 3
Types of tokens
• Tokens for programming languages are of three types:
• Special characters and character sequences: =, ==, >=,
{, }, ;, etc.
• Keywords: class, if, int, etc.
• Token categories: For parsing purposes, all integers can
be treated the same, and similarly for floats, doubles,
characters, strings, and identifiers.
• Comments and whitespace are also handled by the lexer, but
these are not tokens because they are ignored by the parser;
the lexer deletes them.
• In Java,
three identifiers are reserved (like keywords) but
are treated as identifiers by the parser: true, false, null.
CS 421 — Class 5–6, 1/29–31/13 — 4
Tokens of Java
• Here is a (partial) declaration of the tokens of Java:
type token = EOF | EQ | PLUS | SEMICOLON
| CLASS | INT | LPAREN | RPAREN | ...
| INTLIT of int | FLOATLIT of float
| IDENT of string | STRINGLIT of string
• E.g. Lexer would transform “class
C { int process ( ...”
to
[CLASS; IDENT "C"; LBRACE; INT; IDENT "process"; LPAREN; ... ]
• Now you try it: “x1
[
IDENT "x1"
;
EQ
;
= (3 + y);” becomes:
LPAREN
PLUS ;
CS 421 — Class 5–6, 1/29–31/13 — 5
;
INTLIT
IDENT "y" ;
3
;
RPAREN
;
SEMICOLON
]
Lexing vs. parsing
• Lexing is a simple, efficient pre-processing step
• Tokens can be recognized by finite automata, which are
easily implemented.
• Significant reduction in size from source file to token list
• Parsing is more complicated
• Program cannot be parsed using DFAs. (Why?)
• Many input problems can be solved by lexing alone — which
is why regular expressions are so widely used. (From 373:
finite automata ≡ regular expressions.)
CS 421 — Class 5–6, 1/29–31/13 — 6
Two methods of writing a lexer
• Manual:
• Design a DFA for the tokens, and implement it
• Automatic:
• Write regular expressions for the tokens. Run a program
(“lexer-generator”) that converts regular expressions to a
DFA, and then simulates the DFA.
• Automatic methods usually easier, but sometimes not available
CS 421 — Class 5–6, 1/29–31/13 — 7
Building a lexer by hand
• DFA (non-standard definition):
A directed graph with vertices labelled from the set Tokens ∪ {Error, Discard,
Id or KW}, and edges labelled with sets of characters. The
outgoing edges from any one vertex are labelled with disjoint
sets. One vertex is specified as the start state.
• Note differences from standard definition:
• DFA is not “complete”: You can get to a state where
there are input characters left, but no edge that contains
the next character. This is deliberate.
• No “accept” states: Or rather, states labelled with Error
are reject states and every other state is an accept state.
CS 421 — Class 5–6, 1/29–31/13 — 8
Building a lexer by hand
• Given a DFA as above:
• Open source file, then:
1. From start state, run the DFA as long as possible.
2. When no more transitions are possible, do whatever the
label on the ending state says:
• Token: add to token list
• Error: print error message
• Discard: do nothing at all
• Id or KW: check for keyword; add keyword or ident
to token list.
3. Go to (1).
CS 421 — Class 5–6, 1/29–31/13 — 9
Lexing DFA
• The lexing DFA looks something like this:
• To build this, start by building separate DFAs for each token...
CS 421 — Class 5–6, 1/29–31/13 — 10
•;
Ex: DFA for operators
• <, <=, <<, <<=, <<<
•}
• +, ++, +=
CS 421 — Class 5–6, 1/29–31/13 — 11
Ex: DFA for integer constants
CS 421 — Class 5–6, 1/29–31/13 — 12
Ex: DFA for integer or floating-point
constants
CS 421 — Class 5–6, 1/29–31/13 — 13
Putting DFAs together
• Combine
individual DFAs by putting together their start
states (then making adjustments as necessary to ensure determinism).
• Following slides show how to implement the DFA.
• Number the nodes with positive numbers (0 for the start
state)
• Define transition : int * char → int. transition(s, c)
gives the transition function of the DFA; it returns −1 if
there is no transition from state s on character c.
CS 421 — Class 5–6, 1/29–31/13 — 14
Lexing functions (in pseudocode)
(state * string) get_next_token() {
s = 0; thistoken = "";
while (true) {
c = peek at next char
if (transition(s,c) == -1)
return (s, thistoken)
move c from input to thistoken
s = transition(s,c)
}
token list get_all_tokens() {
tokenlis = []
while (next char not eof) {
(s, thistoken) = get_next_token()
tokenlis = lexing_action(s, thistoken, tokenlis)
}
return tokenlis;
}
CS 421 — Class 5–6, 1/29–31/13 — 15
Lexing actions
lexing_action (state, thistoken, tokenlis) {
switch (label(state)) {
case ERROR: raise lex error
case DISCARD: (* do nothing *)
case ID_OR_KW: Check if tokenchars matches a keyword. If so,
return keyword::tokenlis, else return (Id tokenchars)::tokenlis}
default: return label(state)::tokenlis
}
}
CS 421 — Class 5–6, 1/29–31/13 — 16
DFAs for comments
• C++ comments
• C comments
• OCaml comments
• Not possible!
CS 421 — Class 5–6, 1/29–31/13 — 17
ocamllex
• A lexer generator accepts a list of regular expressions, each
with an associated token; creates a DFA like the one we built
above; and outputs a program that implements the DFA.
• ocamllex is a parser generator whose output is an OCaml
program.
• In the following slides, we describe the input specifications
read by ocamllex, and how to solve various lexing problems
in ocamllex.
• For full details, consult ocamllex manual.
CS 421 — Class 5–6, 1/29–31/13 — 18
Regular expressions in ocamllex
’c’
”c1c2 . . . cn”
r1 r2 . . .
r1 |r2 | . . .
r∗
r+
r?
character c
string (n ≥ 0)
sequence of regular expressions
alternation of regular expressions
Kleene star
same as rr∗
same as r|””
any character
eof
special eof character
[’c1’-’c2’] range of characters
[^’c1’-’c2’] complement of a range of characters
r as id
same as r, but id is bound to the characters matched by r
CS 421 — Class 5–6, 1/29–31/13 — 19
Regular expression examples
• Identifiers: [‘A’ - ‘Z’ ‘a’-‘z’] [‘A’ - ‘Z’ ‘a’-‘z’ ‘0’-‘9’ ‘ ’]*
• Integer literals: [‘0’-‘9’]+
• The keyword if: “if”
• The left-shift operator: “<<”
• Floating-point literals:
[‘0’-‘9’]+‘.’[‘0’-‘9’]+ ([‘E’ ‘e’] [‘+’ ‘-’]? [‘0’-‘9’]+)?
CS 421 — Class 5–6, 1/29–31/13 — 20
Ocamllex specifications
{ header }
let ident = regexp
...
rule main = parse
regexp { action }
| ...
| regexp { action }
{ trailer }
(auxiliary ocaml defs)
(regexp abbreviations)
(action is value returned
when regexp is matched)
(auxiliary ocaml defs)
• header can define names used in actions
• In addition to ordinary regular expression operators, regexps
can include the names given as abbreviations; these are just
replaced by their definitions.
• actions are ocaml expressions; all must be of the same type
CS 421 — Class 5–6, 1/29–31/13 — 21
Ocamllex specifications (cont.)
• ocamllex transforms this specification into a file that defines
the function main : Lexing.lexbuf → type-of-actions. The
name of main’s argument is lexbuf. lexbuf is not a list, but
a “stream” that delivers characters one at a time; the Lexing
module contains a number of functions on lexbufs, but you
won’t need to use them.
• trailer can define functions that call main.
• actions can call main, and can refer to variable lexbuf (the
input stream).
CS 421 — Class 5–6, 1/29–31/13 — 22
Ocamllex specifications (cont.)
• ocamllex puts all the regular expressions together using alternation, then transforms it to an NFA and then a DFA (as in
373). The DFA looks like the one we designed by hand: Each
state is associated with one of the regular expressions (or it’s
an error state). It runs the DFA as long as possible (just as
we did), and then takes the action associated with the state
it ended in.
• If two regexps match:
• If one matches a longer part of the input, that one wins.
• If they both match the exact same characters, the one
that occurs first in the specification wins. (E.g. keywords
should precede regexp for identifiers.)
CS 421 — Class 5–6, 1/29–31/13 — 23
Lexing functions
• ocamllex will define function main, but this is not a convenient functions to use. Instead, in trailer define more useful
functions that call main, e.g.
(* Remember: main: Lexing.lexbuf -> type-of-actions *)
(* Get first token in string s *)
let get_token s = let b = Lexing.from_string (s)
in main b
(* Get list of all tokens in buffer b *)
let rec get_tokens b =
match main b with
EOF -> []
| t -> t :: get_tokens b
(* Get list of all tokens in string s *)
let get_all_tokens s =
get_tokens (Lexing.from_string s)
CS 421 — Class 5–6, 1/29–31/13 — 24
Ocamllex example 1
• In this and following examples, the trailer is omitted; it will
contain definitions like the ones on the previous slide.
• What does this do?
rule main = parse
[’0’-’9’]+
| [’0’-’9’]+’.’[’0’-’9’]+
| [’a’-’z’]+
{ "Int" }
{ "Float" }
{ "String" }
get all tokens "94.7e5" returns ["Float"; "String"; "Int"]
CS 421 — Class 5–6, 1/29–31/13 — 25
Ocamllex examples 2 and 3
let digit = [’0’-’9’]
rule main = parse
digit+
| digit+’.’digit+
| [’a’-’z’]+
{ "Int" }
{ "Float" }
{ "String" }
get all tokens "94.7e5" returns ["Float"; "String"; "Int"]
{
type token = Int | Float | Ident
}
rule main = parse
[’0’-’9’]+
{ Int }
| [’0’-’9’]+’.’[’0’-’9’]+
{ Float }
| [’a’-’z’]+
{ Ident }
get all tokens "94.7e5" returns [Float; Ident; Int]
CS 421 — Class 5–6, 1/29–31/13 — 26
Ocamllex examples 4 and 5
{ type token = Int | Float | Ident }
rule main = parse
[’0’-’9’]+
{ Int }
| [’0’-’9’]+’.’[’0’-’9’]+
{ Float }
| [’a’-’z’]+
{ Ident }
| _
{ main lexbuf }
get all tokens "94.7e5" returns [Float; Ident; Int],
as does get all tokens "94.7 e 5" (which examples 1-3 couldn’t handle)
{ type token = Int of int | Float of float | Ident of string }
rule main = parse
[’0’-’9’]+ as x
{ Int (int_of_string x) }
| [’0’-’9’]+’.’[’0’-’9’]+ as x
{ Float (float_of_string x) }
| [’a’-’z’]+ as id
{ Ident id }
| _
{ main lexbuf }
get all tokens "94.7e5" returns [Float 94.7; Ident "e"; Int 5]
CS 421 — Class 5–6, 1/29–31/13 — 27
Ocamllex example 6
{ type token = Int of int | Float of float | Ident of string | EOF }
rule main = parse
[’0’-’9’]+ as x
| [’0’-’9’]+’.’[’0’-’9’]+ as x
| [’a’-’z’]+ as id
| _
| eof
get all tokens "94.7e5"
EOF]
CS 421 — Class 5–6, 1/29–31/13 — 28
returns
{
{
{
{
{
Int (int_of_string x) }
Float (float_of_string x) }
Ident id }
main lexbuf }
EOF }
[Float 94.7; Ident "e"; Int 5;
Difficult cases...
• Write regular expressions for various kinds of comments:
• C++-style (//...): “//” [ˆ ‘\n’]*
• C-style (/*... */, no nesting):
See ostermiller.org/findcomment.html
• OCaml-style ((*... *), with nesting): Impossible!
CS 421 — Class 5–6, 1/29–31/13 — 29
Simulating DFAs using lexing
functions in ocamllex
• Ocamllex specifications can have more than one function:
rule main = parse
... regexp’s and actions ...
and string = parse
... regexp’s and actions ...
and comment = parse
... regexp’s and actions ...
• Defines functions main, string, and comment.
• Each function has type Lexing.lexbuf → type-of-its-actions.
• Each function can call the other functions. E.g. main might
include:
"/*"
{ comment lexbuf }
CS 421 — Class 5–6, 1/29–31/13 — 30
Handling C-style comments
let open_comment = "/*"
let close_comment = "*/"
rule main = parse
digits ’.’ digits as f
| digits as n
| letters as s
| open_comment
| eof
| _
and comment = parse
close_comment
| _
CS 421 — Class 5–6, 1/29–31/13 — 31
{
{
{
{
{
{
Float (float_of_string f) }
Int (int_of_string n) }
Ident s }
comment lexbuf }
EOF }
main lexbuf }
{ main lexbuf }
{ comment lexbuf }
Handling OCaml-style comments
• Every lexing function has an argument named lexbuf of type
Lexing.lexbuf, but you can add additional arguments. This
allows you to handle nested comments (even though they are
not finite-state).
rule main = parse
| open_comment
{
| eof
{
| _
{
and comment depth = parse
open_comment
{
| close_comment
{
| _
CS 421 — Class 5–6, 1/29–31/13 — 32
comment 1 lexbuf}
[] }
main lexbuf }
comment (depth+1) lexbuf }
if depth = 1
then main lexbuf
else comment (depth - 1) lexbuf }
{ comment depth lexbuf }
Wrap-up
• On Tuesday and today, we discussed:
•
•
•
What a lexer does
How to write a lexer by hand
How to write a lexer using ocamllex
• We discussed it because:
•
•
This is the first step in a compiler
Knowing how to write a lexer is a useful skill
• In the next class, we will:
•
Talk about parsing
• What to do now:
•
•
MP3
Important: Review context-free grammars from CS 373
CS 421 — Class 5–6, 1/29–31/13 — 33
Download