Ninghui Li
Topic 4: Regular Expressions and Lexical
Analysis
Lexical analyzer/scanner convert sequence of characters to sequence of tokens
(inc 13) becomes 4 tokens, (, inc, 13, )
Parser/syntactic analysis analyze a sequence of tokens to create/determine the grammatical structure slide 2
Part 1: Implement FIZ without user-defined functions
(50%), due Feb 9
Part 2: Implement user-defined functions (50%), due
Feb 16
•
Part 2 is significant harder than Part 1. Do not wait until the last week.
slide 3
4
"inc" { return INC;}
"(" { return OPENPAR;}
")" { return CLOSEPAR;}
0|[1-9][0-9]* { yylval.number_val = atoi(yytext); return NUMBER;
}
[ \t\n] {/* Discard spaces, tabs, and new lines */}
. {printf("Syntax error. Did not recognize %s\n", yytext); } slide 5
/*******************************************************
* Section 1: Definition of tokens and non-terminals *
*****************************************************/
%token <number_val> NUMBER The NUMBER token has number_value
%token INC OPENPAR CLOSEPAR
%type <node_val> expr
%union {
These three tokens have no value
A parsed expr has a pointer to a node in an Abstract Syntax Tree associated with it.
char *string_val; int number_val; struct TREE_NODE *node_val;
This defines the union associated with each token or non-terminal when parsing.
} slide 6
/**************************************************
* Section 3: Grammar production rules *
**************************************************/ goal: statements; statements: statement | statement statements; statement: expr { err_value = 0; resolve($1, NULL);
Red code are currently unnecessary.
They are needed when user-defined functions are implemented.
if (err_value == 0) { printf ("%d\n", eval($1, NULL) );
Green code evaluates the expression.
$1 refers to the AST node associated with the 1 st element in the grammar rule, namely expr
};
} prompt(); slide 7
A abstract syntax tree , is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is
"abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.
slide 8
IFZ_NODE
ARG_NAME strValue = “y”
ARG_NAME strValue = “x”
FUNC_CALL name =“add”
INC_NODE
ARG_NAME strValue = “x”
DEC_NODE
ARG_NAME strValue = “y”
The above is an AST for (ifz y x (add (inc x) (dec y))),
The body of the function (add x y)
Consider how evaluate (add 4 1) would work.
slide 9
expr: OPENPAR INC expr CLOSEPAR { struct TREE_NODE * node = (struct TREE_NODE *) malloc(sizeof(struct TREE_NODE)); node -> type = INC_NODE; node -> first_arg = $3;
$$ = node;
}
The above production rule (grammar rule) parses (inc <expr>)
It creates a node in the abstract syntax tree, denote its type to be INC_NODE, and stores the tree node for <expr> in first_arg; since this is the first (and only) argument of (inc <expr>).
$3 refers to the value associated with the 3 rd element in the grammar, i.e., expr in the body
$$ refers to the value associated with expr on the left hand side slide 10
| NUMBER { struct TREE_NODE * node = (struct TREE_NODE *) malloc(sizeof(struct TREE_NODE)); node -> type = NUMBER_NODE; node -> intValue = $1;
$$ = node;
};
The above production rule (grammar rule) parses a number into an expr.
It creates a node in the abstract syntax tree, denote its type to be
NUMBER_NODE, and stores the integer value in the intValue field.
$1 refers to the value associated with the 1st element in the grammar, i.e.,
NUMBER in the body
$$ refers to the value associated with expr on the left hand side slide 11
Input (inc (inc 1))
Becomes tokens: OPENPAR INC OPENPAR
INC NUMBER CLOSEPAR CLOSEPAR
This is parsed into statement in the following steps: statement: expr expr: OPENPAR INC expr CLOSEPAR expr: OPENPAR INC NUMBER CLOSEPAR
INC_NODE
INC_NODE
NUMBER_NODE intValue = 1 slide 12
Regular expression: A notation to specify a pattern that matches a set of strings
A regular expression can be: a a single character
R
1
|R
2 matches anything that matches either R
1 or R
2
(R) matches the same thing as R
[abcde] any of the five letter listed there, i.e., a|b|c|d|e
[0-9] any digit slide 13
R
1
R
2 matches a string s if s is concatenation of s
1 s
2
, and s
1 matches R
1 and s
2 matches R
2
E.g., [abcde] [0-9] matches
R* repeating the regular expression R zero or more times
E.g., [0-9]* matches the empty string and any digit sequence
R+ repeating R one or more times
Equivalent to the regular expression R R* slide 14
http://flex.sourceforge.net/manual/Patterns.html
‘x’
‘.’ match the character 'x' any character (byte) except newline
‘[xyz]’ a character class ; in this case, the pattern matches either an 'x', a 'y', or a 'z'
‘[abj-oZ]’ a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z'
‘[^A-Z]’ a "negated character class", i.e., any character but those in the class. In this case, any character
EXCEPT an uppercase letter.
slide 15
http://flex.sourceforge.net/manual/Patterns.html
‘[^A-Z\n]’ any character EXCEPT an uppercase letter or a newline
‘[a-z]{-}[aeiou]’ the lowercase consonants
‘r*’
‘r+’ zero or more r's, where r is any regular expression one or more r's
‘r?’ zero or one r's (that is, “an optional r”)
‘r{2,5}’ anywhere from two to five r's
‘r{2,}’ two or more r's
‘r{4}’ exactly 4 r's slide 16
http://flex.sourceforge.net/manual/Patterns.html
‘{name}’
‘"[xyz]\"foo"’
‘(r)’
‘rs’ the expansion of the ‘name’ definition the literal string: ‘[xyz]"foo’ match an ‘r’; parentheses are used to override precedence the regular expression ‘r’ followed by the regular expression ‘s’; called concatenation either an ‘r’ or an ‘s’
‘r|s’
‘^r’
‘r$’ an ‘r’, but only at the beginning of a line an ‘r’, but only at the end of a line slide 17
Regular expression for an non-negative integer:
Is [0-9]* correct?
Yes, if allowing 00123 is okay,
0 | [1-9][0-9]* is better
Regular expression for an identifier:
Rule 1: Name of identifier includes alphabets and digits.
Rule 2: First character of any identifier must be a letter.
How to write the regular expression?
[a-zA-Z][a-zA-Z0-9]* slide 18
How to write regular expression that matches comments, assuming that comments are defined as anything between ; and end of line?
19
Able to write simple regular expressions to match strings.
Given a regular expression, able to tell what are matched are what are not.
20