Lexical Analysis: Regular Expressions CS 671 January 22, 2008 Last Time … High-Level Programmin g Languages Compiler Machine Code Error Messages A program that translates a program in one language to another language • the essential interface between applications & architectures Typically lowers the level of abstraction • analyzes and reasons about the program & architecture We expect the program to be optimized i.e., better than the original • 1 ideally exploiting architectural strengths and hiding weaknesses CS 671 – Spring 2008 Phases of a Compiler Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Lexical Analyzer •Group sequence of characters into lexemes – smallest meaningful entity in a language (keywords, identifiers, constants) •Characters read from a file are buffered – helps decrease latency due to i/o. Lexical analyzer manages the buffer •Makes use of the theory of regular languages and finite state machines •Lex and Flex are tools that construct lexical analyzers from regular expression specifications Target program 2 CS 671 – Spring 2008 Phases of a Compiler Parser Source program Lexical analyzer Syntax analyzer Semantic analyzer • Convert a linear structure – sequence of tokens – to a hierarchical tree-like structure – an AST • The parser imposes the syntax rules of the language Intermediate code generator • Work should be linear in the size of the input (else unusable) type consistency cannot be checked in this phase Code optimizer •Deterministic context free languages and pushdown automata for the basis Code generator • Bison and yacc allow a user to construct parsers from CFG specifications Target program 3 CS 671 – Spring 2008 Phases of a Compiler Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Semantic Analysis • Calculates the program’s “meaning” • Rules of the language are checked (variable declaration, type checking) • Type checking also needed for code generation (code gen for a + b depends on the type of a and b) Code optimizer Code generator Target program 4 CS 671 – Spring 2008 Phases of a Compiler Source program Intermediate Code Generation Lexical analyzer • Makes it easy to port compiler to other architectures (e.g. Pentium to MIPS) Syntax analyzer • Can also be the basis for interpreters (such as in Java) Semantic analyzer Intermediate code generator • Enables optimizations that are not machine specific Code optimizer Code generator Target program 5 CS 671 – Spring 2008 Phases of a Compiler Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate Code Optimization • Constant propagation, dead code elimination, common sub-expression elimination, strength reduction, etc. • Based on dataflow analysis – properties that are independent of execution paths Intermediate code generator Code optimizer Code generator Target program 6 CS 671 – Spring 2008 Phases of a Compiler Source program Native Code Generation Lexical analyzer • Intermediate code is translated into native code Syntax analyzer • Register allocation, instruction selection Semantic analyzer Intermediate code generator Code optimizer Native Code Optimization • Peephole optimizations – small window is optimized at a time Code generator Target program 7 CS 671 – Spring 2008 Administration 1. Compiling to assembly 1. HW1 on website: Fun with Lex/Yacc 2. Questionnaire Results… 8 CS 671 – Spring 2008 Useful Tools! •tar – archiving program •gzip/bzip2 – compression •svn – version control •Make/Scons – build/run utility •Other useful tools: – – – – 9 Man! Which Locate Diff (or sdiff) CS 671 – Spring 2008 Makefiles Target: dependent source file(s) <tab>command proj1 data.o data.c 10 data.h CS 671 – Spring 2008 main.o main.c io.o io.h io.c First Step: Lexical Analysis (Tokenizing) •Breaking the program down into words or “tokens” •Input: stream of characters •Output: stream of names, keywords, punctuation marks •Side effect: Discards white space, comments Source code: if (b==0) a = “Hi”; Lexical Analysis Token Stream: Parsing 11 CS 671 – Spring 2008 Lexical Tokens • Identifiers: x y11 elsex _i00 • Keywords: if else while break • Integers: 2 1000 -500 5L • Floating point: 2.0 0.00020 .02 1.1e5 0.e-10 • Symbols: + * { } ++ < << [ ] >= • Strings: “x” “He said, \“Are you?\”” • Comments: /** ignore me **/ 12 CS 671 – Spring 2008 Lexical Tokens float match0(char *s) /* find a zero */ { if (!strncmp(s, “0.0”, 3)) return 0.; } FLOAT ID(match0) _______ CHAR STAR ID(s) RPAREN LBRACE IF LPAREN BANG _______ LPAREN ID(s) COMMA STRING(0.0) ______ NUM(3) RPAREN RPAREN RETURN REAL(0.0) ______ RBRACE EOF 13 CS 671 – Spring 2008 Ad-hoc Lexer • Hand-write code to generate tokens • How to read identifier tokens? Token readIdentifier( ) { String id = “”; while (true) { char c = input.read(); if (!identifierChar(c)) return new Token(ID, id, lineNumber); id = id + String(c); } } 14 CS 671 – Spring 2008 Problems • Don’t know what kind of token we are going to read from seeing first character – if token begins with “i’’ is it an identifier? – if token begins with “2” is it an integer? constant? – interleaved tokenizer code is hard to write correctly, harder to maintain • More principled approach: lexer generator that generates efficient tokenizer automatically (e.g., lex, flex) 15 CS 671 – Spring 2008 Issues • How to describe tokens unambiguously 2.e0 20.e-01 2.0000 “” “x” “\\” “\”\’” • How to break text down into tokens if (x == 0) a = x<<1; if (x == 0) a = x<1; • How to tokenize efficiently – tokens may have similar prefixes – want to look at each character ~1 time 16 CS 671 – Spring 2008 How To Describe Tokens • Programming language tokens can be described using regular expressions • A regular expression R describes some set of strings L(R) • L(R) is the language defined by R – L(abc) = { abc } – L(hello|goodbye) = {hello, goodbye} – L([1-9][0-9]*) = _______________ • Idea: define each kind of token using RE 17 CS 671 – Spring 2008 Regular expressions Language – set of strings String – finite sequence of symbols Symbols – taken from a finite alphabet Specify languages using regular expressions 18 Symbol a one instance of a Epsilon empty string Alternation R|S string from either L(R) or L(S) Concatenation R∙S string from L(R) followed by L(S) Repetition R* CS 671 – Spring 2008 Convenient Shorthand [abcd] one of the listed characters (a | b | c | d) [b-g] [bcdefg] [b-gM-Qkr] ____________ 19 [^ab] anything but one of the listed chars [^a-f] ____________ M? Zero or one M M+ One or more M M* ____________ “a.+*” literally a.+* . Any single character (except \n) CS 671 – Spring 2008 Examples Regular Expression Strings in L(R) digit = [0-9] “0” “1” “2” “3” … posint = digit+ “8” “412” … int = -? posint “-42” “1024” … real = int (ε | (. posint)) “-1.56” “12” “1.0” [a-zA-Z_][a-zA-Z0-9_]* C identifiers • Lexer generators support abbreviations – But they cannot be recursive 20 CS 671 – Spring 2008 More Examples Whitespace: Integers: Hex numbers: Valid UVa User Ids: Loop keywords in C: 21 CS 671 – Spring 2008 Breaking up Text elsex=0; else x = 0 ; elsex = 0 ; •REs alone not enough: need rules for choosing •Most languages: longest matching token wins – even if a shorter token is only way •Ties in length resolved by prioritizing tokens •RE’s + priorities + longest-matching token rule = lexer definition 22 CS 671 – Spring 2008 Lexer Generator Specification • Input to lexer generator: – list of regular expressions in priority order – associated action for each RE (generates appropriate kind of token, other bookkeeping) • Output: – program that reads an input stream and breaks it up into tokens according to the REs. (Or reports lexical error -- “Unexpected character” ) 23 CS 671 – Spring 2008 Lex: A Lexical Analyzer Generator Lex produces a C program from a lexical specification http://www.epaperpress.com/lexandyacc/ %% DIGITS [0-9]+ ALPHA [A-Za-z] CHARACTER {ALPHA}|_ IDENTIFIER {ALPHA}({CHARACTER}|{DIGITS})* %% if {return IF; } {IDENTIFIER} {return ID; } {DIGITS} {return NUM; } ([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+) {return ____; } . {error(); } 24 CS 671 – Spring 2008 Lexer Generator • Reads in list of regular expressions R1,…Rn, one per token, with attached actions -?[1-9][0-9]* { return new Token(Tokens.IntConst, Integer.parseInt(yytext()) } • Generates scanning code that decides: 1. whether the input is lexically well-formed 2. corresponding token sequence • Problem 1 is equivalent to deciding whether the input is in the language of the regular expression • How can we efficiently test membership in L(R) for arbitrary R? 25 CS 671 – Spring 2008 Regular Expression Matching • Sketch of an efficient implementation: – start in some initial state – look at each input character in sequence, update scanner state accordingly – if state at end of input is an accepting state, the input string matches the RE • For tokenizing, only need a finite amount of state: (deterministic) finite automaton (DFA) or finite state machine 26 CS 671 – Spring 2008 High Level View source code Scanner tokens Compile time Design time specification Scanner Generator Regular expressions = specification Finite automata = implementation Every regex has a FSA that recognizes its language 27 CS 671 – Spring 2008 Finite Automata Takes an input string and determines whether it’s a valid sentence of a language – – – – – A finite automaton has a finite set of states Edges lead from one state to another Edges are labeled with a symbol One state is the start state One or more states are the final state i 0 f 1 a-z 2 0 IF 28 ID CS 671 – Spring 2008 26 edges a-z 1 0-9 Language Each string is accepted or rejected 1. Starting in the start state 2. Automaton follows one edge for every character (edge must match character) 3. After n-transitions for an n-character string, if final state then accept Language: set of strings that the FSA accepts i 0 29 f 1 ID [a-z0-9] 2 IF [a-hj-z] CS 671 – Spring 2008 3 ID [a-z0-9]