Lexical analysis

advertisement
Lexical analysis
From Wikipedia, the free encyclopedia
Lexical analysis is the processing of an input sequence of characters (such as the source code of a
computer program) to produce, as output, a sequence of symbols called "lexical tokens", or just
"tokens". For example, lexers for many programming languages convert the character sequence 123
abc into two tokens: 123 and abc (whitespace is not a token in most languages). The purpose of
producing these tokens is usually to forward them as input to another program, such as a parser.
Implementation details
For many languages performing lexical analysis only can be performed in a single pass (ie, no back
tracking) by reading a character at a time from the input. This means it is relatively straightforward to
automate the generation of programs to perform it and a number of these have been written (eg, flex).
However, most commercial compilers use hand written lexers because it is possible to integrate much
better error handling into them.
A lexical analyzer, or lexer for short, can be thought of having two stages, namely a scanner and an
evaluator. (These are often integrated, for efficiency reasons, so they operate in parallel.)
The first stage, the scanner, is usually based on a finite state machine. It has encoded within it
information on the possible sequences of characters that can be contained within any of the tokens it
handles (individual instances of these character sequences are known as lexemes). For instance, an
integer token may contain any sequence of numerical digit characters. In many cases the first nonwhitespace character can be used to deduce the kind of token that follows, the input characters are then
processed one at a time until reaching a character that is not in the set of characters acceptable for that
token (this is known as the maximal munch rule). In some languages the lexeme creation rules are
more complicated and may involve backtracking over previously read characters.
A lexeme, however, is only a string of characters known to be of a certain kind (eg, a string literal, a
sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the
evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type
combined with its value is what properly constitutes a token, which can be given to a parser. (Some
tokens such as parentheses do not really have values, and so the evaluator function for these can return
nothing. The evaluators for integers, identifiers, and strings can be considerably more complex.
Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful
for whitespace and comments.)
For example, in the source code of a computer program the string
net_worth_future = (assets - liabilities);
might be converted (with whitespace suppressed) into the lexical token stream:
NAME "net_worth_future"
EQUALS
OPEN_PARENTHESIS
NAME "assets"
MINUS
NAME "liabilities"
CLOSE_PARENTHESIS
SEMICOLON
Though it is possible and sometimes necessary to write a lexer by hand, lexers are often generated by
automated tools. These tools generally accept regular expressions that describe the tokens allowed in
the input stream. Each regular expression is associated with a production in the lexical grammar of the
programming language that evaluates the lexemes matching the regular expression. These tools may
generate source code that can be compiled and executed or construct a state table for a finite state
machine (which is plugged into template code for compilation and execution).
106742723
1/5
Regular expressions compactly represent patterns that the characters in lexemes might follow. For
example, for an English-based language, a NAME token might be any English alphabetical character
or an underscore, followed by any number of instances of any ASCII alphanumeric character or an
underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means
"any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".
Regular expressions and the finite state machines they generate are not powerful enough to handle
recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing
parentheses." They are not capable of keeping count, and verifying that n is the same on both sides —
unless you have a finite set of permissible values for n. It takes a full-fledged parser to recognize such
patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off
and see if the stack is empty at the end.
The Lex programming tool and its compiler is designed to generate code for fast lexical analysers
based on a formal description of the lexical syntax. It is not generally considered sufficient for
applications with a complicated set of lexical rules and severe performance requirements; for instance,
the GNU Compiler Collection uses hand-written lexers.
Example lexical analyzer
This is an example of a scanner (written in the C programming language) for the instructional
programming language PL/0.
The symbols recognized are:
'+', '-', '*', '/', '=', '(', ')', ',', ';', '.', ':=', '<', '<=', '<>', '>', '>='
numbers: 0-9 {0-9}
identifiers: a-zA-Z {a-zA-Z0-9}
keywords:
"begin", "call", "const", "do", "end", "if", "odd", "procedure", "then", "var", "while"
External variables used:





FILE *source -- the source file
int cur_line, cur_col, err_line, err_col -- for error reporting
int num -- last number read stored here, for the parser
char id[] -- last identifier read stored here, for the parser
Hashtab *keywords -- list of keywords
External routines called:









error(const char msg[]) -- report an error
Hashtab *create_htab(int estimate) -- create a lookup table
int enter_htab(Hashtab *ht, char name[], void *data) -- add an entry to a lookup table
Entry *find_htab(Hashtab *ht, char *s) -- find an entry in a lookup table
void *get_htab_data(Entry *entry) -- returns data from a lookup table
FILE *fopen(char fn[], char mode[]) -- opens a file for reading
fgetc(FILE *stream) -- read the next character from a stream
ungetc(int ch, FILE *stream) -- put-back a character onto a stream
isdigit(int ch), isalpha(int ch), isalnum(int ch) -- character classification
External types:
 Symbol -- an enumerated type of all the symbols in the PL/0 language.
 Hashtab -- represents a lookup table
 Entry -- represents an entry in the lookup table
106742723
2/5
Scanning is started by calling init_scan, passing the name of the source file. If the source file is
successfully opened, the parser calls getsym repeatedly to return successive symbols from the source
file.
The heart of the scanner, getsym, should be straightforward. First, whitespace is skipped. Then the
retrieved character is classified. If the character represents a multiple-character symbol, additional
processing must be done. Numbers are converted to internal form, and identifiers are checked to see if
they represent a keyword.
int read_ch(void) {
int ch = fgetc(source);
cur_col++;
if (ch == '\n') {
cur_line++;
cur_col = 0;
}
return ch;
}
void put_back(int ch) {
ungetc(ch, source);
cur_col--;
if (ch == '\n') cur_line--;
}
Symbol getsym(void) {
int ch;
while ((ch = read_ch()) != EOF && ch <= ' ')
;
err_line = cur_line;
err_col = cur_col;
switch (ch) {
case EOF: return eof;
case '+': return plus;
case '-': return minus;
case '*': return times;
case '/': return slash;
case '=': return eql;
case '(': return lparen;
case ')': return rparen;
case ',': return comma;
case ';': return semicolon;
case '.': return period;
case ':':
ch = read_ch();
return (ch == '=') ? becomes : nul;
case '<':
ch = read_ch();
if (ch == '>') return neq;
if (ch == '=') return leq;
put_back(ch);
return lss;
case '>':
ch = read_ch();
if (ch == '=') return geq;
put_back(ch);
return gtr;
default:
if (isdigit(ch)) {
num = 0;
106742723
3/5
do { /* no checking for overflow! */
num = 10 * num + ch - '0';
ch = read_ch();
} while ( ch != EOF && isdigit(ch));
put_back(ch);
return number;
}
if (isalpha(ch)) {
Entry *entry;
id_len = 0;
do {
if (id_len < MAX_ID) {
id[id_len] = (char)ch;
id_len++;
}
ch = read_ch();
} while ( ch != EOF && isalnum(ch));
id[id_len] = '\0';
put_back(ch);
entry = find_htab(keywords, id);
return entry ? (Symbol)get_htab_data(entry) : ident;
}
error("getsym: invalid character '%c'", ch);
return nul;
}
}
int init_scan(const char fn[]) {
if ((source = fopen(fn, "r")) == NULL) return 0;
cur_line = 1;
cur_col = 0;
keywords = create_htab(11);
enter_htab(keywords, "begin", beginsym);
enter_htab(keywords, "call", callsym);
enter_htab(keywords, "const", constsym);
enter_htab(keywords, "do", dosym);
enter_htab(keywords, "end", endsym);
enter_htab(keywords, "if", ifsym);
enter_htab(keywords, "odd", oddsym);
enter_htab(keywords, "procedure", procsym);
enter_htab(keywords, "then", thensym);
enter_htab(keywords, "var", varsym);
enter_htab(keywords, "while", whilesym);
return 1;
}
Now, contrast the above code with the code needed for a FLEX generated scanner for the same
language:
%{
#include "y.tab.h"
%}
digit
letter
%%
"+"
"-"
"*"
"/"
[0-9]
[a-zA-Z]
{
{
{
{
return
return
return
return
PLUS;
MINUS;
TIMES;
SLASH;
106742723
}
}
}
}
4/5
"("
{ return LPAREN;
}
")"
{ return RPAREN;
}
";"
{ return SEMICOLON; }
","
{ return COMMA;
}
"."
{ return PERIOD;
}
":="
{ return BECOMES;
}
"="
{ return EQL;
}
"<>"
{ return NEQ;
}
"<"
{ return LSS;
}
">"
{ return GTR;
}
"<="
{ return LEQ;
}
">="
{ return GEQ;
}
"begin"
{ return BEGINSYM;
}
"call"
{ return CALLSYM;
}
"const"
{ return CONSTSYM;
}
"do"
{ return DOSYM;
}
"end"
{ return ENDSYM;
}
"if"
{ return IFSYM;
}
"odd"
{ return ODDSYM;
}
"procedure"
{ return PROCSYM;
}
"then"
{ return THENSYM;
}
"var"
{ return VARSYM;
}
"while"
{ return WHILESYM;
}
{letter}({letter}|{digit})* {
yylval.id = (char *)strdup(yytext);
return IDENT;
}
{digit}+
{ yylval.num = atoi(yytext);
return NUMBER;
}
[ \t\n\r]
/* skip whitespace */
.
{ printf("Unknown character [%c]\n",yytext[0]);
return UNKNOWN;
}
%%
int yywrap(void){return 1;}
About 50 lines of code for FLEX versus about 100 lines of hand-written code.
Generally, scanners are not that hard to write. If done correctly, a hand-written scanner should be
faster and offer more flexibility as compared to using a scanner generator. But the simple utility of
using a scanner generator should not be discounted, especially in the developmental phase, when a
language specification might change daily. In that case, much time may be saved by using a scanner
generator.
106742723
5/5
Download