Lexical Analysis

advertisement
Lexical Analysis:
Regular Expressions
CS 671
January 22, 2008
Last Time …
High-Level
Programmin
g Languages
Compiler
Machine
Code
Error
Messages
A program that translates a program in one language to
another language
•
the essential interface between applications & architectures
Typically lowers the level of abstraction
•
analyzes and reasons about the program & architecture
We expect the program to be optimized i.e., better than
the original
•
1
ideally exploiting architectural strengths and hiding weaknesses
CS 671 – Spring 2008
Phases of a Compiler
Source program
Lexical analyzer
Syntax analyzer
Semantic analyzer
Intermediate
code generator
Code optimizer
Code generator
Lexical Analyzer
•Group sequence of characters into
lexemes – smallest meaningful entity in a
language (keywords, identifiers,
constants)
•Characters read from a file are buffered
– helps decrease latency due to i/o.
Lexical analyzer manages the buffer
•Makes use of the theory of regular
languages and finite state machines
•Lex and Flex are tools that construct
lexical analyzers from regular expression
specifications
Target program
2
CS 671 – Spring 2008
Phases of a Compiler
Parser
Source program
Lexical analyzer
Syntax analyzer
Semantic analyzer
• Convert a linear structure – sequence
of tokens – to a hierarchical tree-like
structure – an AST
• The parser imposes the syntax rules of
the language
Intermediate
code generator
• Work should be linear in the size of the
input (else unusable)  type consistency
cannot be checked in this phase
Code optimizer
•Deterministic context free languages
and pushdown automata for the basis
Code generator
• Bison and yacc allow a user to
construct parsers from CFG specifications
Target program
3
CS 671 – Spring 2008
Phases of a Compiler
Source program
Lexical analyzer
Syntax analyzer
Semantic analyzer
Intermediate
code generator
Semantic Analysis
• Calculates the program’s “meaning”
• Rules of the language are checked
(variable declaration, type checking)
• Type checking also needed for code
generation (code gen for a + b depends
on the type of a and b)
Code optimizer
Code generator
Target program
4
CS 671 – Spring 2008
Phases of a Compiler
Source program
Intermediate Code Generation
Lexical analyzer
• Makes it easy to port compiler to other
architectures (e.g. Pentium to MIPS)
Syntax analyzer
• Can also be the basis for interpreters
(such as in Java)
Semantic analyzer
Intermediate
code generator
• Enables optimizations that are not
machine specific
Code optimizer
Code generator
Target program
5
CS 671 – Spring 2008
Phases of a Compiler
Source program
Lexical analyzer
Syntax analyzer
Semantic analyzer
Intermediate Code Optimization
• Constant propagation, dead code
elimination, common sub-expression
elimination, strength reduction, etc.
• Based on dataflow analysis –
properties that are independent of
execution paths
Intermediate
code generator
Code optimizer
Code generator
Target program
6
CS 671 – Spring 2008
Phases of a Compiler
Source program
Native Code Generation
Lexical analyzer
• Intermediate code is translated into
native code
Syntax analyzer
• Register allocation, instruction
selection
Semantic analyzer
Intermediate
code generator
Code optimizer
Native Code Optimization
• Peephole optimizations – small window
is optimized at a time
Code generator
Target program
7
CS 671 – Spring 2008
Administration
1. Compiling to assembly
1. HW1 on website: Fun with Lex/Yacc
2. Questionnaire Results…
8
CS 671 – Spring 2008
Useful Tools!
•tar – archiving program
•gzip/bzip2 – compression
•svn – version control
•Make/Scons – build/run utility
•Other useful tools:
–
–
–
–
9
Man!
Which
Locate
Diff (or sdiff)
CS 671 – Spring 2008
Makefiles
Target: dependent source file(s)
<tab>command
proj1
data.o
data.c
10
data.h
CS 671 – Spring 2008
main.o
main.c
io.o
io.h
io.c
First Step: Lexical Analysis (Tokenizing)
•Breaking the program down into words or “tokens”
•Input: stream of characters
•Output: stream of names, keywords, punctuation marks
•Side effect: Discards white space, comments
Source code: if (b==0) a = “Hi”;
Lexical Analysis
Token Stream:
Parsing
11
CS 671 – Spring 2008
Lexical Tokens
• Identifiers: x y11 elsex _i00
• Keywords: if else while break
• Integers: 2 1000 -500 5L
• Floating point: 2.0 0.00020 .02 1.1e5 0.e-10
• Symbols: + * { } ++ < << [ ] >=
• Strings: “x” “He said, \“Are you?\””
• Comments: /** ignore me **/
12
CS 671 – Spring 2008
Lexical Tokens
float match0(char *s) /* find a zero */
{
if (!strncmp(s, “0.0”, 3))
return 0.;
}
FLOAT ID(match0) _______ CHAR STAR ID(s)
RPAREN LBRACE IF LPAREN BANG _______
LPAREN ID(s) COMMA STRING(0.0) ______
NUM(3) RPAREN RPAREN RETURN REAL(0.0)
______ RBRACE EOF
13
CS 671 – Spring 2008
Ad-hoc Lexer
• Hand-write code to generate tokens
• How to read identifier tokens?
Token readIdentifier( ) {
String id = “”;
while (true) {
char c = input.read();
if (!identifierChar(c))
return new Token(ID, id, lineNumber);
id = id + String(c);
}
}
14
CS 671 – Spring 2008
Problems
• Don’t know what kind of token we are going
to read from seeing first character
– if token begins with “i’’ is it an identifier?
– if token begins with “2” is it an integer? constant?
– interleaved tokenizer code is hard to write correctly,
harder to maintain
• More principled approach: lexer generator
that generates efficient tokenizer
automatically (e.g., lex, flex)
15
CS 671 – Spring 2008
Issues
• How to describe tokens unambiguously
2.e0 20.e-01 2.0000
“”
“x”
“\\”
“\”\’”
• How to break text down into tokens
if (x == 0) a = x<<1;
if (x == 0) a = x<1;
• How to tokenize efficiently
– tokens may have similar prefixes
– want to look at each character ~1 time
16
CS 671 – Spring 2008
How To Describe Tokens
• Programming language tokens can be
described using regular expressions
• A regular expression R describes some set of
strings L(R)
• L(R) is the language defined by R
– L(abc) = { abc }
– L(hello|goodbye) = {hello, goodbye}
– L([1-9][0-9]*) = _______________
• Idea: define each kind of token using RE
17
CS 671 – Spring 2008
Regular expressions
Language – set of strings
String – finite sequence of symbols
Symbols – taken from a finite alphabet
Specify languages using regular expressions
18
Symbol
a
one instance of a
Epsilon

empty string
Alternation
R|S
string from either L(R) or L(S)
Concatenation
R∙S
string from L(R) followed by L(S)
Repetition
R*
CS 671 – Spring 2008
Convenient Shorthand
[abcd]
one of the listed characters (a | b | c | d)
[b-g]
[bcdefg]
[b-gM-Qkr] ____________
19
[^ab]
anything but one of the listed chars
[^a-f]
____________
M?
Zero or one M
M+
One or more M
M*
____________
“a.+*”
literally a.+*
.
Any single character (except \n)
CS 671 – Spring 2008
Examples
Regular Expression
Strings in L(R)
digit = [0-9]
“0” “1” “2” “3” …
posint = digit+
“8” “412” …
int = -? posint
“-42” “1024” …
real = int (ε | (. posint))
“-1.56” “12” “1.0”
[a-zA-Z_][a-zA-Z0-9_]*
C identifiers
• Lexer generators support abbreviations
– But they cannot be recursive
20
CS 671 – Spring 2008
More Examples
Whitespace:
Integers:
Hex numbers:
Valid UVa User Ids:
Loop keywords in C:
21
CS 671 – Spring 2008
Breaking up Text
elsex=0;
else x = 0 ;
elsex = 0 ;
•REs alone not enough: need rules for choosing
•Most languages: longest matching token wins
– even if a shorter token is only way
•Ties in length resolved by prioritizing tokens
•RE’s + priorities + longest-matching token rule =
lexer definition
22
CS 671 – Spring 2008
Lexer Generator Specification
• Input to lexer generator:
– list of regular expressions in priority order
– associated action for each RE (generates
appropriate kind of token, other bookkeeping)
• Output:
– program that reads an input stream and breaks it
up into tokens according to the REs. (Or reports
lexical error -- “Unexpected character” )
23
CS 671 – Spring 2008
Lex: A Lexical Analyzer Generator
Lex produces a C program from a lexical
specification
http://www.epaperpress.com/lexandyacc/
%%
DIGITS [0-9]+
ALPHA [A-Za-z]
CHARACTER {ALPHA}|_
IDENTIFIER {ALPHA}({CHARACTER}|{DIGITS})*
%%
if
{return IF; }
{IDENTIFIER}
{return ID; }
{DIGITS}
{return NUM; }
([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+) {return ____; }
.
{error(); }
24
CS 671 – Spring 2008
Lexer Generator
• Reads in list of regular expressions R1,…Rn, one per
token, with attached actions
-?[1-9][0-9]* { return new Token(Tokens.IntConst,
Integer.parseInt(yytext())
}
• Generates scanning code that decides:
1. whether the input is lexically well-formed
2. corresponding token sequence
• Problem 1 is equivalent to deciding whether the input is
in the language of the regular expression
• How can we efficiently test membership in L(R) for
arbitrary R?
25
CS 671 – Spring 2008
Regular Expression Matching
• Sketch of an efficient implementation:
– start in some initial state
– look at each input character in sequence, update
scanner state accordingly
– if state at end of input is an accepting state, the input
string matches the RE
• For tokenizing, only need a finite amount of state:
(deterministic) finite automaton (DFA) or finite state
machine
26
CS 671 – Spring 2008
High Level View
source code
Scanner
tokens
Compile time
Design time
specification
Scanner
Generator
Regular expressions = specification
Finite automata = implementation
Every regex has a FSA that recognizes its
language
27
CS 671 – Spring 2008
Finite Automata
Takes an input string and determines
whether it’s a valid sentence of a language
–
–
–
–
–
A finite automaton has a finite set of states
Edges lead from one state to another
Edges are labeled with a symbol
One state is the start state
One or more states are the final state
i
0
f
1
a-z
2
0
IF
28
ID
CS 671 – Spring 2008
26 edges
a-z
1
0-9
Language
Each string is accepted or rejected
1. Starting in the start state
2. Automaton follows one edge for every character
(edge must match character)
3. After n-transitions for an n-character string, if final
state then accept
Language: set of strings that the FSA accepts
i
0
29
f
1
ID
[a-z0-9]
2
IF
[a-hj-z]
CS 671 – Spring 2008
3
ID
[a-z0-9]
Download