tlex.doc tlex - lexical analyzer for ATK text

tlex.doc
tlex - lexical analyzer for ATK text
gentlex - create tables for tlex
ATK's 'parser' object requires as its stream of tokens a pointer to a
function and a rock value to be passed to that function. Tlex provides
such a function called LexFunc, which reads tokens from an ATK text
object. Tables for creating instantiations of the tlex object are
generated by
gentlex (pronounced gen-t-lex). Unlike the Unix 'lex' package,
gentlex/tlex
does not implement arbitrary regular expressions. However, because it is
designed specifically for tokenizing streams for parser input, tlex does
considerably more than 'lex' and even supports the most general
recognition scheme: C code.
The essence of the tlex approach is to determine what sort of token is
coming by assembling characters until they constitute a prefix for a
recognizable token. Then a recognizer function is called to determine
the
extent of the token. Builtin recognizers are provided for the sorts
of tokens required for popular modern programming languages.
______________________________
Using tlex
For each tlex application, say 'sample', the programmer writes a .tlx
file
describing the tokens to be found. This 'sample.tlx' file is then
processed
with gentlex to create a sample.tlc file. This file is then included in
the
application:
#include <sample.tlc>
Later in the application, a tlex object is declared and created as in
class tlex *tl;
. . .
tl = tlex_Create(&sample_tlex_tables, this,
text, 0, (text)->GetLength());
where sample_tlex_tables is declared within sample.tlc, 'this' is passed
as
a rock that is later available to token recognizers, and text is an ATK
text
object containing the text to be parsed. The last two parameters can
delimit
the parse to a subsequence of the text. After this preparation, the tlex
object, tl, can be passed to parser::Parse with tlex::LexFunc as the
function.
That is, if gggg is a parser object, parsing can be initiated with
(gggg)->Parse(tlex::LexFunc, tl);
See the appendix for a complete program example.
Gentlex takes two inputs--a file containing token class descriptions and
another file containing declarations for YYNTOKENS and yytname as
produced
by running bison with the -k switch. {At present the -k switch is
implemented
only in the AUIS version of Bison.} The output file is typically named
with
the same prefix as the first input file and the extension .tlc:
gentlex sample.tlx sample.tab.c
produces file sample.tlc.
If mkparserclass is used to process the Bison output, the resulting
sample.C
file is adequate as a replacement for sample.tab.c.
-p switch
Declarations output by gentlex are static variables whose name begins
with
a prefix value followed by an underscore, as in sample_tlex_tables. If
the
first input file has the extension .tlx, the prefix will be the filename
prior to the extension. A different prefix value may be specified by
giving
the -p switch on the command line. The output file is always named with
the
prefix value and the extension .tlc. For example
gentlex -p sample a b
will read the .tlx information from file a and the .tab.c information
from file b. The output will be generated to a file sample.tlc and
the variables declared in that file will begin with `sample_'. If the
-p switch is given the file names need not be specified; they will be
generated from the prefix value and the extensions .tlx and .tab.c.
If the .tlx file is named, the .tab.c file need not be, as long as its
name begins with the same prefix as the .tlx file.
-l switch
The .tlc output file will ordinarily contain #line lines to relate
compile
error messages back to the .tlx file. The -l switch will eliminate these
#line lines.
______________________________
Overview of the structure of .tlx Files
The purpose of tlex is to examine the input stream and find tokens,
reporting each to the parser by its token class number. Gentlex
determines the token class numbers from the yytname list in the
.tab.c file, which will have been generated by bison with the -n switch
(and some other switch combinations). {At present, -n is only
implemented in the AUIS version of Bison.}
A .tlx file is a sequence of tokenclass blocks, each responsible for
describing how tlex can identify the tokens of one class (or a group of
classes). The syntax is line oriented: the various allowed operations
each
occupy a single line. Comments begin with two dashes (--) and extend to
the end of the line.
Here is a typical tokenclass block containing a description of identifier
tokens--ones that start with an alphabetic and continue with an
alphabetic,
an underline, or a digit.
tokenclass setID
set
[a-zA-Z]
recognizer ScanID
charset
continueset [a-zA-Z_0-9]
The tokenclass line says that this block describes tokens the parser is
to
think of as satisfying the class setID (a name used in the grammar).
The 'set' line says that this tokenclass block is triggered by any
alphabetic
character. The 'recognizer' line says to use the builtin recognizer
called
ScanID. One of the parameters to ScanID is a charset value called
continueset. It is declared here to override the default value; the
declaration says that an identifier continues with an alphabetic, a
digit,
or an underline.
Each tokenclass block begins with a header line containing 'tokenclass'
and
a representation of the class--either a name, or a quoted string.
Following the header are four types of lines, each of which is described
in
more detail later.
'set' or 'seq' - these lines describe the prefix of tokens
that satisfy the tokenclass. seq specifies an initial
sequence of characters, while set lists a set of characters.
'recognizer' - this line names a builtin recognizer which will be
called to determine the remainder of the token.
struct element declaration - declares the type, name, and initial
value for an element of a struct. For some builtin
recognizers, fields of this struct provide information
that more precisely control the recognizer. The struct is
also passed to the function created from the following:
handler function body - A function created from this body is
called after the recognizer has found the token. The
function may extend the token, modify its text, set the
value for yylval, or even change the token number.
Suppose that identifiers beginning with x_ are to be treated as tokens of
class setHEXCONSTANT. The above description could be augmented to
describe this as follows:
tokenclass setID
set
[a-zA-Z]
-- 'set' or 'seq' line
recognizer ScanID
-- recognizer line
-- struct element declarations follow
charset
continueset [a-zA-Z_0-9]
tokennumber hextok setHEXCONSTANT
-- function body follows
{
char *tok = tlex_GetTokenText(tlex);
if (*tok == 'x' && *(tok+1) == '_')
tlex_SetTokenNumber(tlex, parm->hextok);
return tlex_ACCEPT;
}
As earlier, this tokenclass block describes tokens of the class setID.
They begin with a letter and continue as determined by the ScanID
recognizer. As its last step, ScanID will call the function whose body
is
the bracketed lines. Two parameters will be passed to this function:
tlex--a pointer to the current tlex object, and
parm--a pointer to the struct constructed from the struct
element declarations earlier in the block.
The code in the function gets the token text from the tlex and checks for
an initial 'x_'. If it is found, the token number in the tlex is changed
to
the token class for setHEXCONSTANT, utilizing the hextok value inserted
in
the struct by the struct element declaration.
Special recognition of 'x_' would have been easier, however, by writing
two
tokenclass rules:
tokenclass setID
set
[a-zA-Z]
recognizer ScanID
charset
continueset [a-zA-Z_0-9]
tokenclass setHEXCONSTANT
seq
"x_"
recognizer ScanID
charset
continueset [a-zA-Z_0-9]
The setID class would recognize all tokens, even those beginning with x
but
followed by a character other than underline. The setHEXCONSTANT class
would recognize tokens beginning with x_. When a token starts with x,
the
lexical analyzer looks ahead to determine whether it is a setID or a
setHEXCONSTANT. In practice, the continueset for hex constants might be
[0-9a-f] instead of the value given and a function body might be provided
to
compute the actual hexadecimal value.
______________________________
Tokenclass line: Details
The value following the 'tokenclass' keyword is one of the token
identifer
values used in the bison description of the grammar. Typical examples
are
ELSE
setID
tokNULL
'+'
"<="
Token names beginning with "set" are assumed to describe classes and are
not treated as reserved words. Token names beginning with "tok" are
assumed to be reserved words consisting of the characters following the
initial three. Multicharacter tokens delimited with double quotes may
not be acceptable in all versions of Bison.
It is possible, but not necessary, to write tokenclass blocks for quoted
characters and strings like '+' and "<=". Gentlex automatically
generates tokenclass blocks for these sorts of tokens.
A recognizer for whitespace is also generated automatically and suffices
if
the desired whitespace set is the set of characters satisfying isspace().
To override this automatic set, include a block for tokenclass -none- and
specify for it the recognizer ScanWhitespace, as in
tokenclass -noneset [ \n\t\f\v\r]
recognizer ScanWhitespace
charset continueset [ \n\t\f\v\r]
For the `action' type declaration described below, a disambiguating
letter may
be appended in parentheses after a tokenclass representation or the
special
tokenclass -none-; for example:
tokenclass setNUMBER (b)
There are several reserved token class names: -none-, -global-,
-reservedwords-, and -errorhandler-, as described in the following.
tokenclass -noneThe tokenclass -none- is used for whitespace and comments. It is
assumed that the function body, if any, in the tokenclass block returns
tlex_IGNORE, as is done by ScanWhitespace and ScanComment. If
it instead returns tlex_ACCEPT, it should have reset the token number,
because the default token number established by -none- terminates the
input stream.
tokenclass -globalThe block for this tokenclass has no set, seq, or recognizer lines. Its
sole function is to generate and initialize a struct, called
PREFIX_global, where the PREFIX value is the stem of the .tlx file
name or the value of the -p switch. Fields of the PREFIX_global
struct can be accessed from the handler function body associated with
any of the tokenclasses. This can be used to create a single charset,
tokennumber, or action value that can be referenced from multiple
function bodies. The handler function for the -global- block, if any, is
executed the first time tlex_Create is called for the PREFIX_tlex_tables
value created by this file; thus it can initialize further any variables
in
the global struct. It should not return a value.
tokenclass -reservedwordsThe default treatment of reserved words in gentlex is treat them in the
same
token class as identifiers. The handler for that class is expected to
set
their token number by looking up the identifier in the symbol table.
(Entries are put in the table with parse_EnumerateReservedWords; see
parse.doc.) However, reserved words can be recognized directly by
specifying the tokenclass name '-reservedwords-'.
tokenclass -errorhandlerTlex has a method tlex_Error which can be called by recognizers or
handler function bodies to indicate errors, such as an 8 in an octal
value.
The default action of tlex_Error is to print the error message which is
the
argument; however, a .tlx file may specify a different action by
providing
an errorhandler block. This block must not include a recognizer line,
but
must include a handler function body. The function generated from the
body
is invoked for any error. The function body can access the proposed
error
message as parm->msg.
It should not return a value.
______________________________
Set or seq line, details
The set or seq line must appear for most token classes. It determines
when this token class is initiated. The argument on a seq line is a
double-quote delimited string:
seq "--"
The token class is activated whenever that sequence is found at the
beginning of a token.
The argument on a set line is a sequence of characters within square
brackets; e.g. [ \t\n\r]. Any character in the set will initiate the
given
tokenclass when it appears as the first character in a token. Backslash,
dash, and right square bracket may appear in the sequence if preceded
with
backslash as an escape character. A consecutive subset of the collating
sequence can be included by writing the two characters at the ends of the
sequence separated with a dash; for instance, [a-zA-Z_#$%0-9] would
be the set of all alphabetic characters, the digits, underline, hash,
percent,
and dollar sign.
______________________________
Recognizer line, details
The operand following the keyword 'recognizer' must be the name of one of
the builtin recognizers. Each recognizer takes one or more operands
which
further describe the tokens the recognizer will accept. Most recognizers
return tlex_ACCEPT to the central token recognizer to indicate that they
have
found a token. ScanWhitespace, ScanComment, and ScanError normally
return
tlex_IGNORE to indicate that further scanning is required to actually
find a token. The handler function bodies for -errorhandler- and globalshould not return values.
The individual builtin recognizers are describing in the following
paragraphs.
Some of them define or require certain items among the struct element
declarations. From these is built the 'parm struct'; so-called because
it
can be accessed by the identifier 'parm' in the handler function body.
ScanNumber
The first character of the token must be a dot, a digit, or a single
quote
character. Subsequent characters are scanned as long as they correspond
to
a C numeric constant:
decimal integer - sequence of [0-9]
octal integer - 0 followed by sequence of [0-7]
hexadecimal integer - 0x followed by sequence of [0-9a-fA-F]
quoted character - two single quotes surrounding a character
or an escape sequence
real value - an appropriate sequence of characters from [0-9\-+.eE]
The parm struct will have at least these fields (they need not be
declared):
boolean IsInt
-- TRUE unless value contains '.' or exponent
int
intval
-- value as an integer
double realval -- value as a real
The handler function body can modify these values.
ScanID
This recognizer scans a token which continues as long as subsequent
characters are in a given set. The default set is all characters
satisfying
isalnum(). There may be a struct element declaration for a charset
called continueset; it is the set of characters allowed after the first.
ScanString
This recognizer accumulates the token until a single terminating
character is encountered. An escape character and an illegal character
may also be specified. The default values are those for C strings:
quote, backslash, and newline. Alternate values may be given
by providing one or more of these struct element declarations:
char *endseq ... -- terminating character
char *escapechar ...
-- esacpe char
char *badchar ... -- illegal character
All of these will be in the parm struct, whether specified or not. They
should not be modified.
The illegal character can only appear in the string if preceded by the
escape character, in which case, both are ignored.
ScanToken
The initial character or sequence is treated as the entire token. The
handler function body can extend the token to include succeeding
characters. ScanToken can be used to specify the same tokenclass for
two different characters. For instance, to map left braces to left
parentheses we could write
tokenclass '('
seq
"{"
recognizer ScanToken
ScanToken is the default recognizer if none is specified.
ScanComment
Typically, this recognizer is employed to scan a sequence-terminated
token. By default the recognizer returns tlex_IGNORE so another token
will
be scanned for after recognizing the comment. The struct element
declarations may include
char *endseq ...
-- termination of the comment
The first character of the endseq value must not appear anywhere else in
that value. The default value is a single newline.
ScanWhitespace
Typically, this recognizer spans a set of characters. By default, the
recognizer returns tlex_IGNORE so another token will be scanned for
after skipping the whitespace. The set of whitespace must be specified
both in a 'set' line and by writing a struct element declaration for
charset continueset ...
-- whitespace characters
which will appear in the parm struct and must not be modified in the
function handler body.
ScanError
The character or sequence that initiates this token class is treated as
an
error. A struct element declaration for 'msg' should be specified:
char *msg ... -- error message
If no function handler body is specified, the message is passed to
tlex_Error (which calls the -errorhandler- function or, if none, prints
the message.)
______________________________
Struct element declaration lines, details
A struct element declaration line has three elements--type, identifier,
and value. The first two are single words; the form of the value
depends
on the type. The described field becomes one field of the struct named
'parm'passed to the recognizer or handler function for this token class.
The value is used to initialize the field.
For example, if the line is
int basis 3
the generated struct declaration will have the form:
struct tlex_Sym000 {
...
int basis;
...
} tlex_Sym001 = {
...
3,
...
};
A limited set of types are allowed in a struct element declaration.
This set includes these standard C types
int long float double char*
and the semi-standard type
boolean
for which the value constants are TRUE and FALSE. Other types are
charset, tokennumber, and action, as described in the following:
charset
The value portion is a character sequence in square brackets, just as for
'set' lines. Charset variables can appear as the first argument to
tlex_BITISSET. If v is a charset identifier and c is a character, the
expression
tlex_BITISSET(parm->v, c)
is TRUE if the value of c is one of the characters in the value of v.
tokennumber
The variable declared as a tokennumber is declared
initialization expression is a token name, exactly
operand to tokenclass. The C++ int is initialized
token
number for the token given token name. This value
the second argument to tlex::SetTokenNumber.
as int in C++. The
as may appear as the
to the appropriate
is appropriate as
action
The return value from a C++ code portion must be one of two constants or
must
be a value created by an action type field element. The initialization
for
a variable of type action is the operand of a tokenclass line; that is,
a
token representation, possibly followed by a parenthesized letter.
______________________________
Handler function bodies, details
The handler function body is a sequence of C++ code preceded by a line
containing a left brace and followed by a line containing a right brace:
{
return tlex_IGNORE;
}
Between the braces, comments are written as in C++, rather than "--".
For compilation and execution, the code is implanted as the body of a
function
that has two arguments: tlex and parm, where tlex is the current tlex
object
and parm points to a struct containing at least the fields described in
the field description lines. The function is called as a handler after
the
recognizer has assembled the token.
The code must return a value telling what to do with the assembled token.
This value may be tlex_IGNORE, tlex_ACCEPT, or a variable from a
struct element declaration having type 'action'. tlex_IGNORE causes tlex
to ignore the assembled token and begin looking for another at the
current
position in the text. tlex_ACCEPT says to return the current tokennumber
and tokenvalue to the parser. An action type indicates that the token so
far is to be treated as if it were the operand of 'seq' for the
tokenclass
named in defining the action value; that is, the token is treated as a
prefix
for the token named in the action declaration.
For example, a Fortran lexer could treat "do" specially. If it were
followed by '5 i = 1, 10' the lexer would return the reserved word DO;
but otherwise the lexer would return some variable, say parm->idact,
where
the variable were defined as in
tokenclass DO
seq "do"
action idact setID
{
if ("do" is not the start of a DO stmt)
return parm->idact;
}
The function has available the many operations defined in tlex.H and
described below. The function can affect all aspects of the token,
as described here:
- current position in the input text
On entry to the handler, CurrPos refers to the character after
the token as recognized so far. The handler can adjust the
position by calling NextChar, BackUp, and Advance.
- token position
The values of GetTokPos and GetTokEnd indicate where
the token is in the input stream. These values can be reset
by calling StartToken and EndToken, both of which also
affect the token text. To change the start, it is possible to
use the sequence
long delta = amount to increase start position;
long pos = tlex->CurrPos();
long newstart = tlex->GetTokPos() + delta;
tlex->BackUp(pos - newstart);
tlex->StartToken();
for ( ; newstart < pos; newstart++)
tlex->Advance();
- token text
The token text is a buffer of characters which is a textual
representation of the token for use by the compiler. Its value is
accessed via GetTokenText; the text of the previous token
is via PrevTokenText. On entry to a handler where
parm->SaveText is TRUE, the token text will be a null terminated
string of all the text so far. The token text can be modified by
calling
StartToken, EndToken, ClearTokenText, AppendToTokenText,
TruncateTokenText, and Advance.
- token number
The token number is the number by which the parser refers to
the token. Its value is initialized to the number of the token
given on the tokenclass line. The current value is GetTokenValue
and a new value can be set with SetTokenValue. Operands to
SetTokenValue are the values of variables declared in
struct element declaration lines to have type tokennumber.
- token value
The token value is any (void *) value the application wishes to
have
stored in the value stacked associated with this token. (In
yacc and bison, this value is stored in yylval.) It can be
accessed
and stored with GetTokenValue and SetTokenValue.
- treatment
As its final step, the handler must return a value indicating the
disposition of this token. It can return tlex_ACCEPT, tlex_IGNORE,
or a struct element value of type action.
______________________________
Sample input: The ness.tlx file
Tokens in ness are much like those in C and C++. However, comments
begin with -- and extend to newline; --$ begins a pragmat, which is a
special comment processed by a pragmat parser. There is a long form
of string constants which can include newlines; and brackets and braces
are treated as parentheses. Note that the C++ code for numeric tokens
converts the tokennumber from setINTCON to setREALCON when
ScanNumber has detected a real value.
------------------------comment:
-- ... \n
tokenclass -nonerecognizer ScanComment
seq
"--"
char
*endseq
"\n"
-pragmat: --$ ... \n
tokenclass -nonerecognizer ScanComment
seq
"--$"
char
*endseq
"\n"
{
printf("pragmat: %s", tlex_GetTokenText(tlex)+3);
return tlex_IGNORE;
}
-identifier: [a-zA-Z_] [a-zA-Z0-9_]*
tokenclass setID
set
[a-zA-Z_]
recognizer ScanID
charset
continueset [a-zA-Z0-9_]
{
struct toksym *s;
s = toksym_TFind(tlex_GetTokenText(tlex), grammarscope);
if (s != NULL)
tlex_SetTokenNumber(tlex, s->toknum);
return tlex_ACCEPT;
}
-string:
tokenclass setSTRINGCON
seq
"\""
recognizer ScanString
" ... "
escape is \
-string: // ... \n\\\\
tokenclass setSTRINGCON
seq
"//"
{
register int c;
static char delim[4] = "\n\\\\";
char *dx;
no escape
dx = delim;
while (*dx && c != EOF) {
if (*dx == c)
dx++, c = tlex_NextChar(tlex);
else if (dx == delim)
c = tlex_NextChar(tlex);
else dx = delim;
}
if (c != EOF)
tlex_NextChar(tlex);
tlex_EndToken(tlex);
return tlex_ACCEPT;
}
-integers and real values
tokenclass setINTCON
set
[0-9'.]
recognizer ScanNumber
tokennumber realtok
setREALCON
{
if ( ! parm->IsInt)
tlex_SetTokenNumber(tlex, parm->realtok);
/* add value to symbol table */
return tlex_ACCEPT;
}
--tokenclass
set
tokenclass
set
[ and { map to (
] and } map to )
'('
[{\[]
')'
[}\]]
______________________________
Sample input: The ness.tab.c file
A full .tab.c file as generated by bison is quite long, but gentlex only
looks for certain features. First it must find somewhere a line defining
YYNTOKENS with the form
#define YYNTOKENS 56
(where the # is immediately after a newline).
the
token names as in the example:
Subsequently it must find
static const char * const yytname[] =
{"$","error","$illegal.","OR","AND",
"NOT","'='","\"/=\"","'<'","'>'","\">=\"","\"<=\"","'+'","''","'*'","'/'",
"'%'","'~'","UNARYOP","setID","setSTRINGCON","setINTCON",
"setREALCON","MARKER","BOOLEAN","INTEGER","REAL","OBJECT",
"VOID","FUNCTION","END","ON","EXTEND","FORWARD","MOUSE",
"MENU","KEYS","EVENT","RETURN","WHILE","DO","IF","THEN",
"ELSE","ELIF","EXIT","GOTOELSE","tokTRUE","tokFALSE",
"tokNULL","';'","\":=\"","'('","')'","','","\"~:=\"","script","attributes
",
"type","functype","eventstart","endtag","attrDecl","parmList"};
The key identifying string is "yytname[] = {"; thereafter the tokens may
be separated by arbitrary white space and one comma. Note that scanning
terminates after reading YYNTOKENS token names, so the token list need
not continue to a correct C declaration.
______________________________
Sample compiler using both tlex and the parse object
The ness.tlc resulting from the above ness.tlx is utilized as the token
stream in ness/objects/compile.c, which uses the grammar in nessgra.y.
The nessgra.y file is processed with
bison -n nessgra.y
to produce the nessgra.C and nessgra.H files.
Then gentlex is invoked
gentlex ness.tlx nessgra.tab.C
to generate the ness.tlc file #included in compile.c, which follows:
#include <text.H>
#include <toksym.H>
#include <nessgra.H>
#include <tlex.H>
static toksym_scopeType grammarscope;
static struct toksym *proto;
#include <ness.tlc>
static void
EnterReservedWord(rock, w, i)
void *rock;
char *w;
int i;
{
struct toksym *s;
boolean new;
s = toksym_TLocate(w, proto, grammarscope, &new);
s->toknum = i;
}
int
parsetext(input)
struct text *input;
{
class nessgra *nessparser = new class nessgra;
struct tlex *lexalyzer;
proto = toksym_New();
grammarscope = toksym_TNewScope(toksym_GLOBAL);
lexalyzer = tlex::Create(&ness_tlex_tables, NULL,
input, 0, text_GetLength(input));
(nessparser)->EnumerateReservedWords(EnterReservedWord, NULL);
return (nessparser)->Parse(tlex::LexFunc, lexalyzer);
the work */
}
/* do all
______________________________
Tools available in tlex
struct tlex *
tlex::Create(struct tlex_tables *description, void *rock,
struct text *text, long pos, long len)
Creates, initializes, and returns a new tlex object corresponding
to
the tables given by 'description'.
The rock is available to any client of this object.
The text, pos, and len specify a portion of a text to be scanned.
int
tlex::LexFunc(void *lexrock, void *yylval)
Gets the next token from the associated text stream, sets
*(struct void **)yylval to a value determined by a Handler routine,
and returns the token number.
void
SetText(struct text *text, long pos, long len)
Resets the source text for the lexeme stream.
long
RecentPosition(int index, long *len)
For the 'index'th token relative to the current token,
set *len to its length and return its position.
Index = 0 is the most recent token;
its predecessors are indexed with negative numbers:
-1 -2 ... -tlex_RECENTSIZE+1
The value of tlex_RECENTSIZE is 10.
long
RecentIndent(int index)
Returns the indentation of the 'index'th most recent token,
where index is as for RecentPosition .
A token preceded by anything other than white space
is reported as having indentation 999.
void
Repeat(int index);
Back up and repeat tokens starting with the 'index'th
most recent token, where index is as for RecentPosition.
void
Error(char *msg)
A recognizer or handler may call this function to indicate an
error in the lexeme stream. The function sets up a dummy
token and calls the errorhandler function associated with
the grammar. The msg is expected to be in static storage.
GetRock()
- returns the current rock value
SetRock(void *r) - sets a new rock value
The "rock" is an argument to tlex::Create. It is an arbitrary
value
that is accessible via the tlex object.
struct tlex_Recparm *
Global()
The global struct is established in the xxx.tlc file in reaction
to the -global- tokenclass in the xxx.tlx file. Function Global()
returns a pointer to this global struct.
C++ code in tokenclass blocks can modify the values that will be returned
to the parser by calling macro methods to adjust these attributes:
Token number
Token value (yylval)
Current character and scan position in the source text
Position and length of the source for the current token
Token text generated to represent the token
Tlex methods to perform these operations are described in what follows.
SetTokenNumber(int n)
GetTokenNumber()
TokenNumber is the number to be
This is usually set by default
tokenclass line in the xxx.tlx
value created by a tokennumber
reported to the parser.
based on the argument to the
file. It may be reset to a
line within a tokenclass block.
SetTokenValue(void *v) - sets value for yylval
GetTokenValue()
- get current value destined for yylval
The TokenValue is the value for yylval. These values serve
as the initial values in the value stack maintained
by the parser in parallel with the state stack.
In tlex, the token value must be a (void *) value.
CurrPos() - position in input
CurrChar() - character at CurrPos
NextChar() - advance CurrPos and reset CurrChar
BackUp(int n) - decrement CurrPos and reset CurrChar
The current position in the input is CurrPos where the
character is as given by CurrChar. By convention each
lexical analysis routine leaves CurrPos/CurrChar referring
to the first character to be considered for the next token.
NextChar moves CurrPos ahead one character, fetches the
next character, and returns it.
BackUp moves backward by n characters, resetting CurrPos/CurrChar.
(A negative n value is acceptable and moves the position forward.)
See also Advance, below, which combines NextChar with storing
the prior character in the tokentext.
GetTokPos() - the position of the first character
GetTokEnd() - the position of the last character
StartToken()
- records CurrPos as the position where the token begins
EndToken() - sets the token end to be just before CurrPos/CurrChar
The position of the token text in the input source is
recorded and is available via these methods. There is no
harm in calling StartToken or EndToken more than once,
although these functions also affect the token text buffer,
as noted below.
GetTokenText()
- returns a pointer to a buffer with a copy
of the token's character string
PrevTokenText() - the text of the previous token (index == -1)
ClearTokenText() - the token text becomes ""
AppendToTokenText(int c)
- c is added to the end of the token text
TruncateTokenText(int n)
- text is reduced to its first n characters
Advance()
== AppendToTokenText(NextChar())
Some tokens are recorded by the lexer as
a character string which can be retrieved by GetTokenText.
In particular, when a function body C code is called as a
recognizer
or handler, the text is the sequence of characters from the source
that caused this tokenclass to be activated.
Saving of the token text can be controlled by setting the
SaveText parameter (parm->SaveText=TRUE or
parm->SaveText=FALSE). Its default value is TRUE for ScanID,
and FALSE for ScanWhitespace, ScanComment and ScanString.
The text is always stored for ScanToken.
A canonical form of the number is always stored for ScanNumber.
If the text is stored for a comment or string, only the contents
are
stored--not the delimiters--and the TokPos/TokEnd are set to the
contents only. (Normally TokPos/End includes the delimiters.)
StartToken does a ClearTokenText as a side effect.
EndToken has a side effect of AppendToTokenText('\0'), which makes
the token buffer a valid C string.
The following are declared in gentlex.h.
boolean
If not defined elsewhere, the type 'boolean' is defined.
Values are TRUE and FALSE.
UNSIGN(c)
When subscripting by a char, use UNSIGN since some systems
will use a negative value for characters above 0x7F.
tlex_BITISSET(bs, c)
Macro to determine if character c is in charset value bs.
Assuming parm->b is a charset value, we can ask
whether the bit corresponding to character c is set by saying
tlex_BITISSET(parm->b, c)
tlex_ACCEPT
tlex_IGNORE
These values can be returned by handler function bodies to indicate
what to do with the token. The value of a struct element
declaration
of type 'action' can also be returned
Copyright 1992, 1994 Carnegie Mellon University. All Rights Reserved.
$Disclaimer:
# Permission to use, copy, modify, and distribute this software and its
# documentation for any purpose and without fee is hereby granted,
provided
# that the above copyright notice appear in all copies and that both that
# copyright notice and this permission notice appear in supporting
# documentation, and that the name of IBM not be used in advertising or
# publicity pertaining to distribution of the software without specific,
# written prior permission.
#
# THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD
# TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ANY COPYRIGHT
# HOLDER BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL
# DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE,
# DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
# OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
# WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
#
# $