Uploaded by khashenbai

1-lexical-analysis-handout

advertisement
Lexical Analysis
Dr. Nguyen Hua Phung
HCMC University of Technology, Viet Nam
08, 2016
Dr. Nguyen Hua Phung
Lexical Analysis
1 / 33
Outline
1
Introduction
2
Roles
3
Implementation
4
Use ANTLR to generate Lexer
Dr. Nguyen Hua Phung
Lexical Analysis
2 / 33
Compilation Phases
source program
lexical analyzer
syntax analyzer
front end
semantic analyzer
intermediate code generator
code optimizer
back end
code generator
target program
Dr. Nguyen Hua Phung
Lexical Analysis
4 / 33
Lexical Analysis
Like a word extractor
in ⇒ i n ⇒ in
Like a spell checker
I og:::
og to socholsochol
:::::::
Like a classification
am a
student
I
pronoun
verb
article
Dr. Nguyen Hua Phung
noun
Lexical Analysis
6 / 33
Lexical Analysis Roles
Identify lexemes: substrings of the source program
that belong to a grammar unit
Return tokens: a lexical category of lexemes
Ignore spaces such as blank, newline, tab
Record the position of tokens that are used in next
phases
Dr. Nguyen Hua Phung
Lexical Analysis
7 / 33
Example on Lexeme and Token
r
e
s
u
l
t
Lexemes
result
=
oldsum
value
/
100
;
=
’’
’’
o ldsum - value / 100;
Kind of Tokens
IDENT
ASSIGN_OP
IDENT
SUBSTRACT_OP
IDENT
DIV_OP
INT_LIT
SEMICOLON
Dr. Nguyen Hua Phung
Lexical Analysis
8 / 33
How to build a lexical analyzer?
How to build a lexical analysis for English?
65000 words
Simply build a dictionary:
{(I,pronoun);(We,pronoun);(am,verb);...}
Extract, search, compare
But for a programming language?
How many words?
Identifiers: abc, cab, Abc, aBc, cAb, ...
Integers: 1, 10, 120, 20, 210, ...
...
Too many words to build a dictionary, so how?
Apply rules for each kind of word (token)
Dr. Nguyen Hua Phung
Lexical Analysis
11 / 33
Rule Representations
Finite Automata
Deterministic Finite Automata
Nondeterministic Finite Automata
Regular Expressions
Dr. Nguyen Hua Phung
Lexical Analysis
12 / 33
Finite Automata
...
b
a
b
a
a
. . . Input Tape
a
Head (Read only)
q3
...
q2
qn
q1
q0
Finite Control
Dr. Nguyen Hua Phung
Lexical Analysis
13 / 33
State Diagram
a
a
b
start
q0
q1
b
Input: abaabb
Current state
q0
q0
q1
q1
q1
q0
Dr. Nguyen Hua Phung
Read
a
b
a
a
b
b
New State
q0
q1
q1
q1
q0
q1
Lexical Analysis
14 / 33
Deterministic Finite Automata
Definition
Deterministic Finite Automaton(DFA) is a 5-tuple
M =(K,Σ,δ,s,F) where
K = a finite set of state
Σ = alphabet
s ∈ K = the initial state
F ⊆ K = the set of final states
δ = a transition function from K ×Σ to K
Dr. Nguyen Hua Phung
Lexical Analysis
15 / 33
Example
M =(K,Σ,δ,s,F)
where K = {q0 , q1 }
and δ
Σ = {a,b}
K
q0
q0
q1
q1
Σ
a
b
a
b
s=q0
F={q1 }
δ(K , Σ)
q0
q1
q1
q0
a
a
b
start
q0
q1
b
Dr. Nguyen Hua Phung
Lexical Analysis
16 / 33
Nondeterministic Finite Automata
Permit several possible “next states” for a given
combination of current state and input symbol
Accept the empty string ∈ in state diagram
Help simplifying the description of automata
Every NFA is equivalent to a DFA
Dr. Nguyen Hua Phung
Lexical Analysis
17 / 33
Example
Language L = ({ab} ∪ {aba})*
a
start
q0
q1
b
a
b
q2
Dr. Nguyen Hua Phung
Lexical Analysis
18 / 33
Example
Language L = ({ab} ∪ {aba})*
start
a
q0
∈
a
q1
b
q2
Dr. Nguyen Hua Phung
Lexical Analysis
19 / 33
Regular Expression (regex)
Describe regular sets of strings
Symbols other than ( ) | * stand for themselves
Use ∈ for an empty string
Concatenation α β = First part matches α, second
part β
Union α | β = Match α or β
Kleene star α* = 0 or more matches of α
Use ( ) for grouping
Dr. Nguyen Hua Phung
Lexical Analysis
20 / 33
Example
RE
0
01
0|1
0(0|1)
(0|1)(0|1)
0*
(0|1)*
Language
=> { 0 }
=> { 01 }
=> {0,1}
=> {00,01}
=> {00,01,10,11}
=> {∈,0,00,000,0000,...}
=> {∈,0,1,00,01,10,11,000,001,...}
Dr. Nguyen Hua Phung
Lexical Analysis
21 / 33
Example
(i|I)(f|F)
Keyword if of language Pascal
if
IF
If
iF
E(0|1|2|3|4|5|6|7|8|9)*
An E followed by a (possibly empty) sequence of digits
E123
E9
E
Dr. Nguyen Hua Phung
Lexical Analysis
22 / 33
Regular Expression and Finite Automata
a
start
b
a
∈
ab
∈
a|b
start
b
∈
∈
∈
start
a
∈
∈
Dr. Nguyen Hua Phung
a*
∈
Lexical Analysis
24 / 33
Convenience Notation
α+ = one or more (i.e. αα∗)
α? = 0 or 1 (i.e. (α| ∈))
[xyz]= x|y|z
[x-y]= all characters from x to y, e.g. [0-9] = all ASCII
digits
[^x-y]= all characters other than [x-y]
. matches any character
Dr. Nguyen Hua Phung
Lexical Analysis
25 / 33
Example
(0|1|2|3|4)
(a|g|h|m)
(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*
(E|e)(+|-|∈)(0|1|2|3|4|5|6|7|8|9)+
Dr. Nguyen Hua Phung
Lexical Analysis
=> [0-4]
=> [aghm]
=> [0-9]+
=> [Ee][+-]?[0-9]+
26 / 33
ANTLR [1]
ANother Tool for Language Recognition
Terence Parr, Professor of CS at the Uni. San
Francisco
powerful parser/lexer generator
Dr. Nguyen Hua Phung
Lexical Analysis
29 / 33
Lexer
/∗∗
∗ Filename : H e l l o . g4
∗/
l e x e r grammar H e l l o ;
/ / match any d i g i t s
INT : [0 −9]+;
/ / Hexadecimal number
HEX: 0 [ Xx][0 −9A−Fa−f ] + ;
/ / match lower−case i d e n t i f i e r s
ID : [ a−z ] + ;
/ / s k i p spaces , tabs , n e w l i n e s
WS : [ \ t \ r \ n ] + −> s k i p ;
Dr. Nguyen Hua Phung
Lexical Analysis
30 / 33
Lexical Analyzer
1
.
0
e
0
2
. . . Input Tape
0
Look ahead
r3
...
r2
rn
r1
Token
with longest prefix match
r0
Lexical Analyzer
Dr. Nguyen Hua Phung
Lexical Analysis
31 / 33
Summary
A lexical analyzer is a pattern matcher that isolates
small-scale parts of a program
Lexical rules are represented by Regular expressions
or Finite Automata.
How to write a lexical analyzer (lexer) in ANTLR
Dr. Nguyen Hua Phung
Lexical Analysis
32 / 33
References I
[1] ANTLR, http:antlr.org, 19 08 2016.
Dr. Nguyen Hua Phung
Lexical Analysis
33 / 33
Download