to your lex.yy.c file.

advertisement

R.Zviel-Girshin Theory of compilation and translation

A Lexical Analyzer

A lexical analyzer is a first phase of the compiler.

Modus operandi:

Source program character stream

Lexical analyzer token

Get next token

Symbol table parser

A lexical analyzer scans an input (a character stream) and produces a stream of tokens. Basically each word of the source program is a token .

Each token is passed to parser that checks it and asks for a next token.

If it is a new (unseen) token in several cases this token is added to the data structure called a symbol table .

A lexical structure of tokens is specified/defined by using regular expressions (a well defined formal language specification).

Lexical analyzer has 2 purposes:

 scans the input

 analyzes it’s lexical correctness

Lexical analyzer also keeps track of the source-coordinates of each token:

 in which file it appears,

 token’s position – line and column number.

This is useful for debugging purposes.

1

R.Zviel-Girshin Theory of compilation and translation

Another task of the lexical analyzer is to get rid of white spaces, comments and other unimportant details.

Example of tokens

Token name Token value - Lexeme

ID

NUM last my x sum

567 67.89 82.001 4.5e-10

IF

COMMA

NOTEQ if

,

!=

Example of non-tokens:

Type comment

Example

//this is one line comment

/* this is multi-line comment*/ white spaces

Preprocessor directive

\t \n \b

#include <iostream.h>

A token is a logical unit defined by programmer.

Example

Given a C code:

void match(char *mystring) /* checking is string is HELLO*/

{

if(!strncmp(mystring, ”HELLO”))

} return 0;

Lexical analyzer will return the following tokens:

VOID ID(match) LPAREN CHAR STAR ID(mystring) RPAREN

LBRACE

IF LPAREN BANG ID(strncmp) LPAREN ID(mystring) COMMA

STRING(HELLO) RPAREN RPAREN

RETURN NUM(0) SEMI

RBRACE

EOF

2

R.Zviel-Girshin Theory of compilation and translation

Basic tokens are numbers, keywords, types, operators and identifiers.

A “value” of a token is called a lexeme .

Example

Token Lexeme

INT 35 1223

ID X1 SUM

REAL 25.5

Actually lexeme holds an original name or value of the token.

Tokens can have attributes .

For example, if we have a token called NUMBER it can have an attribute that tells us if the number is integer or real, positive or negative.

3

R.Zviel-Girshin Theory of compilation and translation

A symbol table

A symbol table is a data structure used throughout the compiling process to build up information about identifiers used in source program.

Usually it holds:

 variable name (or lexeme),

 its type

 additional attributes - is it a function, a class or a variable, what line it was written in, from which line it was called.

Basic symbol table data structure is:

 a linked list

 an array

 a dictionary tree.

A basic linked list symbol table example: struct st_entry

{

} char *lexeme; int type; //or can be char * type; char *attributes; struct st_entry *next;

Lexical analyzer only adds new lexemes to the symbol table. First it checks if lexeme is already in the table:

If yes then does nothing else adds it to the appropriate symbol table cell.

4

R.Zviel-Girshin Theory of compilation and translation

Basic symbol table operations

 insert (w,t) – returns index of a new entry for token t and it’s lexeme w

 lookup (w) – returns an entry for the lexeme w or 0 if not found where

 w is token’s lexeme and

 t is a table name (pointer to the symbol table)

A lokup () and insert () functions implementation depends on data structures you chose to use. Usually those functions are implemented by hashing mechanism.

Usage of the symbol table

A symbol table is used mainly in the front end of the compiler.

It is created at the beginning of the compilation process.

During the lexical analysis phase all new lexemes are added to the table.

Later in a parser phase additional attributes can be entered (such as an id type or a file type or current value of the id or a number of arguments in case that id type was a function).

In a semantic analysis phase a symbol table is used for the type checking and types validation.

During intermediate code generation phase an exact place of the lexeme is important and functional attributes of the lexeme are used.

5

R.Zviel-Girshin Theory of compilation and translation

A lexical analyzer or a scanner uses regular expressions and finite automata to recognize appropriate tokens.

Some basic definitions

Language is a set of words or strings.

String is a finite sequence of letters over finite alphabet.

Alphabet is a finite set of letters. Alphabet sign is

. In programming languages such an alphabet is ASCII character set, but it could be

UNICODE set or subset of ASCII (only English alphabet and separators).

Empty string is a string with 0 letters in it. It has a sign - ε.

Regular expressions

Regular expression definitions

Basic regular expressions: a

An ordinary character a stands for itself is a regular expression.

An empty word is a regular expression.

An empty language or an empty regular expression is a regular expression.

Some operations over basic regular expressions:

M | N

M+N

M .

N

Alternation:

Also

M

a regular expression. Can be written as

Concatenation.

or

M

N. followed by N.

M+N

Also a regular expression. Can be written as MN.

.

M *

M +

Repetition.

Zero or more times of M. Also a regular expression.

Repetition.

One or more times of M. Also a regular expression.

6

M ?

R.Zviel-Girshin Theory of compilation and translation

Repetition.

Zero or one occurrence of M.

Also a regular expression.

More abbreviations:

[a-zA-Z] Character set alternation (single character) from the set.

.

Also a regular expression.

Any (single) character but a new-line.

Also a regular expression.

Quotation, stands for string s itself literally .

“s”

Examples

Regular expression for all words of length 2 over English alphabet: r = [a-zA-Z] [a-zA-Z]

Regular expression for all possible words over digits alphabet: r=([0-9] )*

Regular expression for all words that starts from aB end ends with

BC : r = aB [a-zA-Z]* BC

Regular expression for real numbers: r = [0-9] + ”.” [0-9]* | [0-9]*”.” [0-9] + or r=[0-9] + ”.” [0-9]* + [0-9]*”.” [0-9] + (usage of + instead of |)

7

R.Zviel-Girshin Theory of compilation and translation

Usually a number of parentheses in regular expressions should be small. To avoid using to many parentheses, we assume that the operations have the following priority hierarchy : operator

*

°

+ priority

- highest (do it first)

- lowest (do it last)

Regular expression operators

| means "or"

* means zero or more instances of

+ means one or more instances of

? means zero or one instance of

() priority or grouping

More examples over

={a,b}:

ε, a, a+b, b*, (a+b)b, ab*, a*+b*

For each regular expression r what is a language it describes - L(r)?

8

R.Zviel-Girshin Theory of compilation and translation

Translation technique

1.

Once all tokens are defined (using regular expressions) automaton can be build for each regular expression.

2.

Unite all the automata together.

3.

Translate this automaton into transition table.

4.

Find the longest matching token accordingly to this table.

regular expression

→NFA→DFA→transition table→lexical analyzer

Finite Automaton

Finite Automaton is a directed graph.

Nodes in this graph represent finite set of states, one of which is a start state, several

Definition for students who did not study automata course. states can be final state.

Edges are transitions that tell for each state and for each symbol of alphabet which state to go to next. c c q0 q1

L= { w | w over {a,b,c}* and a last letter of w is c} a,b a,b

9

R.Zviel-Girshin Theory of compilation and translation

Graphical representation

A circle represents each state. The name of the state is written inside the circle.

qi

The transition function (called delta -

), is represented by directed and labeled edges between states. Each edge is a directed edge with letters of alphabet written on it. qi a,b,c qk

A double circle represents final state.

The start state has an incoming arrow.

Modus operandi of automaton

Reading of the string starts at start state.

Each step only one letter of the string is read.

q0 q0

 After reading a letter of the string “ transitional decision” should be done:

which state to go to.

This decision is done accordingly to the edges of the automaton.

Every string that ends in final state is accepted by the automaton and belongs to the automatons language.

 If automaton gets “stuck” (no transition for a given input symbol

– no edge with an appropriate input symbol) then this word is illegal.

10

R.Zviel-Girshin Theory of compilation and translation

Example

Automaton A: c c q0 q1 a,b a,b

Which state is a final state?

Which state is a start state?

Where automaton goes from state q1 on input a? On input c?

What happens if input ends with a? With b? With c?

Let’s see if A accepts w=cabc .

We start in the start state q0. First symbol of input is c. q0

 c

 q1

 a

 q0

 b

 q0

 c

 q1 q1 is an accepting (final) state that means w belongs to L(A).

Let’s see if A accepts w=acbb .

We start in the start state q0. First symbol of input is a. q0

 a

 q0

 c

 q1

 b

 q0

 b

 q0 q0 is a regular not accepting state that means w does not belong to

L(A).

Given several strings: abc, acacb, a,

, cccaac.

Are those strings belong to L(A)?

11

R.Zviel-Girshin Theory of compilation and translation

DFA vs. NFA name transitions word paths

DFA

Deterministic finite automata

1. no ε transitions

2. for each input character exist only one transition

Single path

NFA

Non-deterministic finite automata

1. ε transitions are allowed

2. for each input character several transitions are allowed

Multiple paths

Example L = { w | w over {0,1}* and w starts and ends with 0 and after each 1 should be 0 }

NFA:

1

0

DFA: q0

0,1

1 q0 q1 q3

0

0

1 q2

0 q1

1

0 q2

12

R.Zviel-Girshin Theory of compilation and translation

Translation of regular expression into automaton

Regular expression a translated into following automaton: start i a f

Regular expression M|N translated into following automaton: start i

M f

N

Regular expression M .

N translated into following automaton: start i

M k N f

Regular expression M* translated into following automaton: start i

M

 f

Regular expression M + translated into following automaton:

M start M i f

13

R.Zviel-Girshin Theory of compilation and translation

Union of all automata is easy. It is done in following way:

1.

Add new start symbol q0

2.

Connect q0 to the starting states of all regular expression automata with ε-transitions.

M q01

 q0  q02

F regular expression→NFA →DFA→transition table→lexical analyzer

14

R.Zviel-Girshin Theory of compilation and translation

Basic definitions closure(q) – is a set of all states that can be reached from q with εtransitions only.

1

0 q0

 closure(q0)={q0,q1,q2} q1

,0 q2 closure(q1)={q1,q2} move(T,a) – is a set of all states that can be reached on input a from some state T of NFA.

0,1

0 q1

0

0 move(q0,0)={q0,q1} move(q1,1)=Ø move({q0,q1},1)={q1}

Simulating an NFA on input string

Input: NFA N and a string x

Output: “yes” if N accepts x, “no” otherwise

Algorithm state = closure(q0); a = nextchar; while a!= end_of_word

{ state = closure(move(state,a));

a = nextchar;} if (state∩F (final state set) = = Ø) then return “no” else return “yes”

15

R.Zviel-Girshin Theory of compilation and translation

Example

0

1

1

1

0

3

Does w=110 belong to automaton language? state input ns=move(state,input) closure(0)=0

1

1

1

1

2

U3=3 2,0

3,0

0 end_of_the_word

0 is final state.

Result: w=110 belongs to the automaton language

2

closure(ns)

1

2,0

3,0

Does w=10 belong to automaton language? state closure(0)=0

1

input

1

0 ns=move(state,input)

End of the word is not reached.

1

not defined=

closure(ns)

1

Result : w=10 does not belong to the automaton language.

Does w=1 belong to automaton language? state input closure(0)=0 1 ns=move(state,input)

1

1 end_of_the_word

1 a regular state.

Result : w=1 does not belong to the automaton language.

closure(ns)

1

16

R.Zviel-Girshin Theory of compilation and translation

Convert an NFA to an equivalent DFA

States in converted DFA will be set of states from NFA.

DFA simulates “in parallel” all possible moves of NFA on given input.

Algorithm for converting NFA into DFA

1. DFAstates = {closure(q0)}; and it is unmarked

2. while there is an unmarked state T in DFAstates

{ mark T;

} for each input symbol a

{

U = closure(move(T,a)); if U is not in DFAstates transition[T][a] = U;

} then add U to DFAstates as unmarked state

3. States with final state in it are also final states in resulting DFA.

4. Start state is {closure(q0)}

Array named transition holds the transition function

of resulting

DFA. transition has a number of states equal to the number of states in resulting DFA and a number of columns equal to the number of alphabet symbols of the language. regular expression→NFA→DFA →transition table→lexical analyzer

17

R.Zviel-Girshin

Dry run of the algorithm

A: q0

 q1

1

0 q2

DFAstates={closure(q0)}={{q0,q1}}

Unmarked state {q0,q1}:

Mark {q0,q1}

For 0 U=closure(move({q0,q1},0))=closure({q2})={q2};

Add {q2} to DFAstates: DFAstates={ {q0,q1}’,{q2}} transition[{q0,q1}][0]={q2}

For 1 U=closure(move({q0,q1},1))=closure({q1})={q1};

Add {q1} to DFAstates: DFAstates={ {q0,q1}’,{q2},{q1}} transition[{q0,q1}][1]= Ø

Next unmarked state {q1}:

Mark {q1}

For 0 U=closure(move({q1},0))=closure({q2})={q2};

{q2}is in DFAstates no addition transition[{q1}][0]={q2}

For 1 U=closure(move({q1},1))=closure(Ø)= Ø;

Add Ø to DFAstates: DFAstates={ {q0,q1}’,{q2},{q1}’, Ø } transition[{q1}][1]= Ø

Next unmarked state {q2}:

Mark {q2}

For 0 U=closure(move({q2},0))=closure({q2})={q2};

{q2}is in DFAstates no addition:

DFAstates={ {q0,q1}’,{q2}’,{q1}’, Ø } transition[{q2}][0]={q2}

For 1 U=closure(move({q2},1))=closure({q1})= {q1};

{q1}is in DFAstates no addition transition[{q2}][1]= {q1}

Next unmarked state Ø :

Mark Ø

For 0 U=closure(move(Ø,0))=closure(Ø)= Ø;

Ø is in DFAstates no addition transition[Ø][0]= Ø

For 1 U=closure(move(Ø,1))=closure(Ø)= Ø;

Ø is in DFAstates no addition transition[Ø][1]= Ø

Theory of compilation and translation

0

18

R.Zviel-Girshin Theory of compilation and translation

Resulting transition table: state

{q0,q1} {q2}

0

{q1}

{q2}

Ø

Resulting DFA automaton is:

{q2}

{q2}

Ø

0,1

1

Ø q0,q1

Ø

Ø

{q1}

Ø

0

0

1

1 q2 q1

1

0

Transition table

A transition table represents an automaton transition . It is a matrix or a two dimensional array.

Transition table rows are states of the DFA automata and its columns are letters of the alphabet (in our case ASCII code).

A transition table implementation: int transition[States][256]=

{ /* ... 0,1,...a,b,...*/

/*state 0*/ {0,0,0,..,0,0,...0,0,0...},

}

/*state 1*/ {0,0,0,..,3,5,...0,0,0...}, where transition[1]['a']= 3 means that from state 1 we move to state 3 when 'a' is an input letter. transition[state][char]

has exactly one place to go to.

Why?

We used a DFA automaton to build it.

19

R.Zviel-Girshin Theory of compilation and translation

Example

B uilding a transitional table for a language L - language of arithmetical expressions over

={0,1,2,3,4,5,6,7,8,9,.,+,-,*,/,=,(,)}.

First let’s recognize our tokens integer number: 0+[1-9][0-9]* real number: 0.[0-9] plus: + minus: - multiplication: * division: / equality: = left parthesis: ( right parathesis: )

Now let’s build an automaton for L.

+ +[1-9][0-9]* .[0-9] +

2

1

+

-

*

3

4

/

5

=

0

0-9

(

0

1-9

8 0-9

6

)

9

.

.

11

0-9

7

10

Resulting automaton has 12 states. Alphabet

has 18 letters.

Resulting transition table size is 12X18.

20

R.Zviel-Girshin Theory of compilation and translation transition[12][18]= state + - * / = ( ) . 0 1 2 3 4 5 6 7 8 9

0

1

..

7

8

1 2 3 4 5 6 7 n o

9 m o

8 v

8 e

8 8 8 8 8 8 8

9

10 8 8 8 8 8 8 8 8 8 8

10

10 11 11 11 11 11 11 11 11 11 11

11 11 11 11 11 11 11 11 11 11 11

All empty cells are lexical errors for the current state.

Usually negative values are written in lexical error cells.

We will put –1.

Sometimes error recovery routines can be proposed. Error recovery routine can fix a given source program by adding, deleting or updating symbols. In case we can recover from the error the number of specific error recovery routine is putted into the table.

For example: transition[k][j]=e5 or err5 where e5 or err5 is an error recovery routine.

regular expression→NFA→DFA→transition table

→lexical analyzer

21

R.Zviel-Girshin Theory of compilation and translation

Token recognition policy

1. If in some point two tokens can be recognized the longest one should be return. token1 token2

2. If lexeme matches two tokens then the first one in order of the rules will be returned.

Example: while can match reserved word token and can match identifier token.

Lexical analyzer looks several symbols forward before it matches next token.

For example if lexical analyzer scans 3<>5: it can recognize several tokens:

3 < > 5

< or

<= or

<>

That is why even when DFA is in accepting state (for “< “) scanner does not return token LTN but looks ahead and waits for the longest match.

22

R.Zviel-Girshin Theory of compilation and translation

Creating Lexical analyzer by hand

Example for partial implementation of lexical analyzer for a small calculator language:

/* common.h header file*/

#ifndef COMMON_H

#define COMMON_H

#include<stdio.h>

#include<ctype.h>

typedef enum token_types{

NONE,PLUS,MINUS,MUL,DIV,

LPAR,RPAR,

EQUAL,

INTEGER,REAL,

SCANEOF} token;

// main functions of lexical analyzer

token scaner();

void read_in_buffer();

int reached_eof();

char read_char();

void unread_chars(int how-many);

// auxilirary functions of lexical analyzer

int accepting(int state);

token appropriate_token(int state);

int gotostate(int state,char letter);

#endif /* COMMON_H */

23

R.Zviel-Girshin Theory of compilation and translation

/* SCANNER.C file*/

#include<commoh.h>

void main()

{ token tok; read_in_buffer(); //reads input file to the buffer while((tok=scanner())!=SCANEOF) switch(tok) { case PLUS: puts(“PLUS +”); break; case MINUS: puts(“MINUS -”); break; case MUL: puts(“MUL *”); break; case DIV: puts(“DIV /”); break; case LPAR: puts(“LPAR (”); break; case RPAR: puts(“RPAR )”); break; case EQUAL: puts(“EQUAL =”); break; case INTEGER: puts(“NUM lexeme INTEGER,…”); break; case REAL: puts(“NUM lexeme REAL,…”); break;

}

}

token scanner()

{ int last_accepting=-1, since_last_accepting=0, state=0;

// -1 means that state is not accepting ;

// 0 letters were read since last accepting state

// state0 is the beginning state char letter; if(reach_eof())

return SCANEOF;

// letter is next input letter, state is next state

// get next char until letter==’/0’ or state=0 – “trap” state while((letter=read_char()) && (state= move(state,letter)))

{

// if current state is accepting state if(accepting(state))

{ last_accepting_state=state;

since_last_accepting_state=0;

}

// if current state is not accepting state else

} since_last_accepting++;

24

R.Zviel-Girshin Theory of compilation and translation if(last_accepting==-1)

{ fputs(“Error, exiting…”);

exit(1):

}

// how many letters to unread

// (if letter==’/0’ then go back to since_last_accepting) unread_chars(letter ? since_last_accpeting+1 : since_last_accepting);

// to return appropriate token return appropriate_token(last_accepting);

}

#define BUFLEN 256

static char buf[BUFLEN], *input_pointer;

void read_in_buffer()

{ gets(buf);

input_pointer=buf;

}

int reach_eof()

{ return *input_pointer == ‘\0’;}

char read_char()

{ return *input_pointer ? *input_pointer++ : ‘\0’;}

void unread_chars(int how_many)

{ input_pointer -= how_many; }

Additional functions:

/* scan_table.c file */

// functions for transition table

// move(state,letter)

// accepting(state)

// and etc..

25

R.Zviel-Girshin Theory of compilation and translation

Creating lexical analyzer automatically

Lexical analyzer generator - FLEX

DFA construction is a mechanical task and can be performed by computer (Why? We have well defined algorithms for DFA construction).

FLEX is an automatic lexical analyzer generator. It translates regular expressions into DFA.

FLEX is an abbreviation of the Fast LEXical analyzer generator.

FLEX is a LEX version for PC. LEX is written in C for UNIX operating systems in 70 th .

FLEX input file basically looks like (RE {action})* where RE is some regular expression.

Flex input file has .l or .lex ending. filename.l flex lex.yy.c

File lex.yy.c is a lexical analyzer constructed by automated utility

FLEX. It can be renamed later to some other name. For example: myscanner.c. The resulting file contains a yylex() function.

A FLEX output file can be compiled later with any standard C compiler and .exe file will be produced. lex.yy.c or

C

.exe file myscanner.c compiler

When the .exe file runs, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding actions written in C language code.

26

R.Zviel-Girshin Theory of compilation and translation

A basic FLEX file structure

Flex file is divided into 3 basic parts, called

1.

declarations and definitions ,

2.

translation rues,

3.

user subroutines .

Each part of the FLEX input file is separated by lines beginning with

%%.

%{ C declarations }% definitions

%% translation rules

%% lex.yy.c with yylex() function in it user subroutines Written to lex.yy.c verbatum

Declarations and definitions part and user subroutines part can be optional .

Translation rules are written using the following pattern/structure: pattern + action where a pattern is a regular expression for some token and an action is a set of operations that should be performed by lexical analyzer. An action can be a single statement or a set of statements. If a set of statements is used then curly brackets are used to make those statements to one block statement. Actions are written in C language.

The first part of the FLEX input file, so called declarations and definitions part , contains internal to flex program definitions of text replacements (example: ID [a-z][a-z0-9]* means replace [a-z][az0-9]* with ID) and can also contain global C code preceded by a line beginning %{ and ending with %}

(example:

%{

#include<math.h>

27

R.Zviel-Girshin Theory of compilation and translation

%}) this C section will be added to the flex output file as it is (“%{“ and “%}”will be removed in the output file).

It can also contain internal FLEX declarations (example: int line=0;).

!!!Also any line started with a white space will be copied to the output file as it is (therefore be careful when you put a white space).

An example of the first part of the Flex file:

%{

#include<math.h>

%}

ID [a-z][a-z0-9]* int line=0;

%%

The third part , so called user subroutines part , contains C code that is written to lex.yy.c as it is (verbatum). It usually contains functions that syntax analyzer (parser), the second phase of the compiler, uses.

If you want to run just lexical analyzer then main function should be added to your program. main function will be placed in this user subroutines part. void main(int argc, char* argv[])

{ yylex();

}

The second and most important part of the FLEX input file is a translation rules part. It is constructed from regular expression definitions and actions to be performed when a regular expression is matched.

That is why the additional definition of the FLEX program is – FLEX is a tool for generating programs that perform pattern-matching on text.

28

R.Zviel-Girshin Theory of compilation and translation

Example for pattern-action syntax:

%%

\n printf("\n"); where \n is a pattern and printf(“\n”) is an action

[1-9][0-9]* { printf(“An integer: %d”,yytext);

return INT;} where [1-9][0-9]* is a pattern and { printf(“An integer: %d”,yytext); return INT;} is an action and yytext holds lexeme of the token.

Multiline actions should be taken into curly brackets { action }.

The first separator (%%) between the first and the second part of the flex input file is essential, the second separator (%%) between the second and the third part is not needed if the third part is empty.

Some extended regular expressions

[abcd] a “character class” - means a|b|c|d

[a-d] a “character class” with range in it - means [abcd]

[^A-Z] a “negated character class” – means any character but those in class (^ should be in [..])

. means any character except a newline a? means zero or one a’s r* - means repetition of r 0 or more times r+ - means repetition of r 1 or more times a{1,3} - means from one to three a’s r{3} – means r exactly three times – equals to rrr r{3,} - means from three r’s to infinite number of times rs – means r concatenated with s r/s - means r but only followed by s

^r - means r but only if r is in the beginning of an input line r$ - means r but only the end of the input line r|s – means r or s (any one of them)

\ is used as escape character – example \” means “. Some of the special symbols are: +-*?()[]{}|/\^$.<>

<<EOF>> an end-of-file sign

29

R.Zviel-Girshin single line comments

C++ comments

Theory of compilation and translation

Examples pattern

“for”

“--” meaning exact string - a reserved word for decrement operator

[A-Za-z_][A-Za-z0-9_]* C identifiers

“/*”.*”*/”

“//”.*

[1-9][0-9]*

[^0-9\n] integer constants any letter except upper case letters or a new-line a?”[x]\”*r” matches a string started with zero or one a and than exact string [x]”*r where \ before ” means that it is not an end of opening “ but an exact match

Ambiguty

FLEX always chooses the pattern represents the longest possible input string. If two patterns represent the same string, the first pattern in the list is chosen.

For example: if an input word is int and rules are written in the following order:

[a-z]+ {return ID;} int {return INT;} then a token ID will be returned for a word int .

Therefore be careful – choose a correct order of regular expressions.

30

R.Zviel-Girshin Theory of compilation and translation

Global variables

Flex has several global variables. All start with a prefix yy (therefore it is not advisable to start your variables with yy smth). After a pattern is matched a text (lexeme) corresponding to it is putted/stored into yytext.

 yytext – holds a lexeme of the token, by default defined a character pointer (but can be redefined as an array of chars)

 yyleng – holds tokens length

 yyin – input stream (default is keyboard)

 yyout – output stream (default is screen)

Some flex actions and functions

 yylex() – actions can include return statements, returned value goes to the routine yylex. Each time yylex is called it continues processing tokens from where it last left off until it either reaches end of file or return statement.

 yyless(n) – returns all but the first n characters of current token back to the input stream

ECHO – copies yytext to scanner’s output stream

BEGIN() – starts start/begin condition

REJECT - directs the scanner to proceed on to the "second best" rule that matched the input (or a prefix of the input).

31

R.Zviel-Girshin Theory of compilation and translation

Some FLEX examples

Example 1

A FLEX input program that copies an input file to the screen:

%{

%}

%%

. ECHO; void main(int argc, char* argv[])

{ yyin=fopen(argv[1],"r");

yylex();

} where yyin is a global flex variable and it holds a name of the input stream and the stream receives it's value from the command line argument argv[1].

Example 2

A program that adds line numbers to the input file:

%{ int line=1;

%}

\n {line++; printf("\n%d:",line);}

. ECHO; void main(int argc, char* argv[])

{ yyin=fopen(argv[1],"r"); printf("1:"); yylex();

}

Given an input file stam.txt an output will be

32

R.Zviel-Girshin Theory of compilation and translation

%{

%}

%%

. ECHO;

% void main(int argc, char* argv[])

{ yyin=fopen(argv[1],"r");

yylex();

}

Example 3

Cleaning comments from an input file and printing the resulting file to the output file (using start conditions):

/* Deleting comments */

%x comment1 comment2

%%

” /* “ BEGIN(comment1);

<comment1> [^*]* /*do nothing till ‘*’ */

<comment1>\n /*delete new line in multi-line comment*/

<comment1>”*/” BEGIN(0);

“//” BEGIN(comment2);

<comment2>[^\n] /*do nothing till new line */

<comment2>\n {ECHO;

BEGIN(0);}

.|\n ECHO;

%% void main(int argc, char* argv[])

{ yyin=fopen(argv[1],"r"); /*input file name*/ yyout=fopen(argv[2],"w"); /*output file name*/ yylex();

} where

33

R.Zviel-Girshin Theory of compilation and translation

%x condition_name - an exclusive start condition (only rules with start condition <condition_name> are active)

%s condition_name - an inclusive start condition (rules with and without start condition <condition_name> are active)

To start a condition write BEGIN(condition_name).

To remove it write BEGIN(0) or BEGIN(INITIAL).

Start condition allows you to write a mini-scanner inside your scanner.

The result of running FLEX’s output file: output input

%{

/*this part can be replaced with nothing no c declarations written*/

%}

%%

. ECHO;

%% void main(int argc, char* argv[])

{ yyin=fopen(argv[1],"r");

yylex();

//here is a one line comment

}

%{

%}

%%

. ECHO;

%% void main(int argc, char* argv[])

{ yyin=fopen(argv[1],"r");

yylex();

}

Input file: hello.txt Output file: out.txt

*/hello word*/

#include<iostream.h> //include file void main()

{

*/another comment - multi-line comment*/ cout<<"hello word";

//till end of line comment getch();

}

*/end of file */

# include<iostream.h> void main()

{ cout<<"hello word"; getch();

}

In our translation we printed new line after one line comment and after multi-line comment.

34

R.Zviel-Girshin Theory of compilation and translation

Some Flex Hints:

The start condition, '^', and "<<EOF>>" patterns can only occur at the beginning of a pattern.

 a '$' can only occur at the end of a pattern.

A '^' which does not occur at the beginning of a rule or a '$' which does not occur at the end of a rule loses its special properties and is treated as a normal character.

Example 4

Printing all tokens of an input file, written in arithmetic expression language, to an output file:

DIGIT [0-9]

0|[1-9]{DIGIT}* { fprintf(yyout,"\nInteger:%s",yytext);}

0"."{DIGIT}+ | [1-9]{DIGIT}*"."{DIGIT}+ { fprintf(yyout,"\nReal:%s",yytext);}

"*"|" "|"="|"/"|"+" {fprintf(yyout,"\nAn operator:%s",yytext);}

\n

. fprintf(yyout,"\nUnrecognized character: %s",yytext); void main(int argc, char* argv[])

} yyin=fopen(argv[1],"r"); if(argv[2]) yyout=fopen(argv[2],"w"); else yyout=stdout; yylex();

{ where DIGIT is a lex definition for a group of digits.

Input file: input.txt

Output file: out.tok

4+3.22*45=67

2/3=1.5

Integer:4

An operator:+

Real:3.22

An operator:*

Integer:45

An operator:=

Integer:67

Integer:2

An operator:/

Integer:3

An operator:=

Real:1.5

35

R.Zviel-Girshin Theory of compilation and translation

When the scanner receives an end-of-file indication from YY_INPUT, it then checks the yywrap() function.

If `yywrap()' returns false (zero 0), then it is assumed that the function has gone ahead and set up yyin to point to another input file, and scanning continues.

If it returns true (non-zero), then the scanner terminates, returning 0 to its caller.

Therefore in some environments before you run your lex.yy.c program add int yywrap(){return 1;} to your lex.yy.c file.

36

Download