tlex.doc tlex - lexical analyzer for ATK text gentlex - create tables for tlex ATK's 'parser' object requires as its stream of tokens a pointer to a function and a rock value to be passed to that function. Tlex provides such a function called LexFunc, which reads tokens from an ATK text object. Tables for creating instantiations of the tlex object are generated by gentlex (pronounced gen-t-lex). Unlike the Unix 'lex' package, gentlex/tlex does not implement arbitrary regular expressions. However, because it is designed specifically for tokenizing streams for parser input, tlex does considerably more than 'lex' and even supports the most general recognition scheme: C code. The essence of the tlex approach is to determine what sort of token is coming by assembling characters until they constitute a prefix for a recognizable token. Then a recognizer function is called to determine the extent of the token. Builtin recognizers are provided for the sorts of tokens required for popular modern programming languages. ______________________________ Using tlex For each tlex application, say 'sample', the programmer writes a .tlx file describing the tokens to be found. This 'sample.tlx' file is then processed with gentlex to create a sample.tlc file. This file is then included in the application: #include <sample.tlc> Later in the application, a tlex object is declared and created as in class tlex *tl; . . . tl = tlex_Create(&sample_tlex_tables, this, text, 0, (text)->GetLength()); where sample_tlex_tables is declared within sample.tlc, 'this' is passed as a rock that is later available to token recognizers, and text is an ATK text object containing the text to be parsed. The last two parameters can delimit the parse to a subsequence of the text. After this preparation, the tlex object, tl, can be passed to parser::Parse with tlex::LexFunc as the function. That is, if gggg is a parser object, parsing can be initiated with (gggg)->Parse(tlex::LexFunc, tl); See the appendix for a complete program example. Gentlex takes two inputs--a file containing token class descriptions and another file containing declarations for YYNTOKENS and yytname as produced by running bison with the -k switch. {At present the -k switch is implemented only in the AUIS version of Bison.} The output file is typically named with the same prefix as the first input file and the extension .tlc: gentlex sample.tlx sample.tab.c produces file sample.tlc. If mkparserclass is used to process the Bison output, the resulting sample.C file is adequate as a replacement for sample.tab.c. -p switch Declarations output by gentlex are static variables whose name begins with a prefix value followed by an underscore, as in sample_tlex_tables. If the first input file has the extension .tlx, the prefix will be the filename prior to the extension. A different prefix value may be specified by giving the -p switch on the command line. The output file is always named with the prefix value and the extension .tlc. For example gentlex -p sample a b will read the .tlx information from file a and the .tab.c information from file b. The output will be generated to a file sample.tlc and the variables declared in that file will begin with `sample_'. If the -p switch is given the file names need not be specified; they will be generated from the prefix value and the extensions .tlx and .tab.c. If the .tlx file is named, the .tab.c file need not be, as long as its name begins with the same prefix as the .tlx file. -l switch The .tlc output file will ordinarily contain #line lines to relate compile error messages back to the .tlx file. The -l switch will eliminate these #line lines. ______________________________ Overview of the structure of .tlx Files The purpose of tlex is to examine the input stream and find tokens, reporting each to the parser by its token class number. Gentlex determines the token class numbers from the yytname list in the .tab.c file, which will have been generated by bison with the -n switch (and some other switch combinations). {At present, -n is only implemented in the AUIS version of Bison.} A .tlx file is a sequence of tokenclass blocks, each responsible for describing how tlex can identify the tokens of one class (or a group of classes). The syntax is line oriented: the various allowed operations each occupy a single line. Comments begin with two dashes (--) and extend to the end of the line. Here is a typical tokenclass block containing a description of identifier tokens--ones that start with an alphabetic and continue with an alphabetic, an underline, or a digit. tokenclass setID set [a-zA-Z] recognizer ScanID charset continueset [a-zA-Z_0-9] The tokenclass line says that this block describes tokens the parser is to think of as satisfying the class setID (a name used in the grammar). The 'set' line says that this tokenclass block is triggered by any alphabetic character. The 'recognizer' line says to use the builtin recognizer called ScanID. One of the parameters to ScanID is a charset value called continueset. It is declared here to override the default value; the declaration says that an identifier continues with an alphabetic, a digit, or an underline. Each tokenclass block begins with a header line containing 'tokenclass' and a representation of the class--either a name, or a quoted string. Following the header are four types of lines, each of which is described in more detail later. 'set' or 'seq' - these lines describe the prefix of tokens that satisfy the tokenclass. seq specifies an initial sequence of characters, while set lists a set of characters. 'recognizer' - this line names a builtin recognizer which will be called to determine the remainder of the token. struct element declaration - declares the type, name, and initial value for an element of a struct. For some builtin recognizers, fields of this struct provide information that more precisely control the recognizer. The struct is also passed to the function created from the following: handler function body - A function created from this body is called after the recognizer has found the token. The function may extend the token, modify its text, set the value for yylval, or even change the token number. Suppose that identifiers beginning with x_ are to be treated as tokens of class setHEXCONSTANT. The above description could be augmented to describe this as follows: tokenclass setID set [a-zA-Z] -- 'set' or 'seq' line recognizer ScanID -- recognizer line -- struct element declarations follow charset continueset [a-zA-Z_0-9] tokennumber hextok setHEXCONSTANT -- function body follows { char *tok = tlex_GetTokenText(tlex); if (*tok == 'x' && *(tok+1) == '_') tlex_SetTokenNumber(tlex, parm->hextok); return tlex_ACCEPT; } As earlier, this tokenclass block describes tokens of the class setID. They begin with a letter and continue as determined by the ScanID recognizer. As its last step, ScanID will call the function whose body is the bracketed lines. Two parameters will be passed to this function: tlex--a pointer to the current tlex object, and parm--a pointer to the struct constructed from the struct element declarations earlier in the block. The code in the function gets the token text from the tlex and checks for an initial 'x_'. If it is found, the token number in the tlex is changed to the token class for setHEXCONSTANT, utilizing the hextok value inserted in the struct by the struct element declaration. Special recognition of 'x_' would have been easier, however, by writing two tokenclass rules: tokenclass setID set [a-zA-Z] recognizer ScanID charset continueset [a-zA-Z_0-9] tokenclass setHEXCONSTANT seq "x_" recognizer ScanID charset continueset [a-zA-Z_0-9] The setID class would recognize all tokens, even those beginning with x but followed by a character other than underline. The setHEXCONSTANT class would recognize tokens beginning with x_. When a token starts with x, the lexical analyzer looks ahead to determine whether it is a setID or a setHEXCONSTANT. In practice, the continueset for hex constants might be [0-9a-f] instead of the value given and a function body might be provided to compute the actual hexadecimal value. ______________________________ Tokenclass line: Details The value following the 'tokenclass' keyword is one of the token identifer values used in the bison description of the grammar. Typical examples are ELSE setID tokNULL '+' "<=" Token names beginning with "set" are assumed to describe classes and are not treated as reserved words. Token names beginning with "tok" are assumed to be reserved words consisting of the characters following the initial three. Multicharacter tokens delimited with double quotes may not be acceptable in all versions of Bison. It is possible, but not necessary, to write tokenclass blocks for quoted characters and strings like '+' and "<=". Gentlex automatically generates tokenclass blocks for these sorts of tokens. A recognizer for whitespace is also generated automatically and suffices if the desired whitespace set is the set of characters satisfying isspace(). To override this automatic set, include a block for tokenclass -none- and specify for it the recognizer ScanWhitespace, as in tokenclass -noneset [ \n\t\f\v\r] recognizer ScanWhitespace charset continueset [ \n\t\f\v\r] For the `action' type declaration described below, a disambiguating letter may be appended in parentheses after a tokenclass representation or the special tokenclass -none-; for example: tokenclass setNUMBER (b) There are several reserved token class names: -none-, -global-, -reservedwords-, and -errorhandler-, as described in the following. tokenclass -noneThe tokenclass -none- is used for whitespace and comments. It is assumed that the function body, if any, in the tokenclass block returns tlex_IGNORE, as is done by ScanWhitespace and ScanComment. If it instead returns tlex_ACCEPT, it should have reset the token number, because the default token number established by -none- terminates the input stream. tokenclass -globalThe block for this tokenclass has no set, seq, or recognizer lines. Its sole function is to generate and initialize a struct, called PREFIX_global, where the PREFIX value is the stem of the .tlx file name or the value of the -p switch. Fields of the PREFIX_global struct can be accessed from the handler function body associated with any of the tokenclasses. This can be used to create a single charset, tokennumber, or action value that can be referenced from multiple function bodies. The handler function for the -global- block, if any, is executed the first time tlex_Create is called for the PREFIX_tlex_tables value created by this file; thus it can initialize further any variables in the global struct. It should not return a value. tokenclass -reservedwordsThe default treatment of reserved words in gentlex is treat them in the same token class as identifiers. The handler for that class is expected to set their token number by looking up the identifier in the symbol table. (Entries are put in the table with parse_EnumerateReservedWords; see parse.doc.) However, reserved words can be recognized directly by specifying the tokenclass name '-reservedwords-'. tokenclass -errorhandlerTlex has a method tlex_Error which can be called by recognizers or handler function bodies to indicate errors, such as an 8 in an octal value. The default action of tlex_Error is to print the error message which is the argument; however, a .tlx file may specify a different action by providing an errorhandler block. This block must not include a recognizer line, but must include a handler function body. The function generated from the body is invoked for any error. The function body can access the proposed error message as parm->msg. It should not return a value. ______________________________ Set or seq line, details The set or seq line must appear for most token classes. It determines when this token class is initiated. The argument on a seq line is a double-quote delimited string: seq "--" The token class is activated whenever that sequence is found at the beginning of a token. The argument on a set line is a sequence of characters within square brackets; e.g. [ \t\n\r]. Any character in the set will initiate the given tokenclass when it appears as the first character in a token. Backslash, dash, and right square bracket may appear in the sequence if preceded with backslash as an escape character. A consecutive subset of the collating sequence can be included by writing the two characters at the ends of the sequence separated with a dash; for instance, [a-zA-Z_#$%0-9] would be the set of all alphabetic characters, the digits, underline, hash, percent, and dollar sign. ______________________________ Recognizer line, details The operand following the keyword 'recognizer' must be the name of one of the builtin recognizers. Each recognizer takes one or more operands which further describe the tokens the recognizer will accept. Most recognizers return tlex_ACCEPT to the central token recognizer to indicate that they have found a token. ScanWhitespace, ScanComment, and ScanError normally return tlex_IGNORE to indicate that further scanning is required to actually find a token. The handler function bodies for -errorhandler- and globalshould not return values. The individual builtin recognizers are describing in the following paragraphs. Some of them define or require certain items among the struct element declarations. From these is built the 'parm struct'; so-called because it can be accessed by the identifier 'parm' in the handler function body. ScanNumber The first character of the token must be a dot, a digit, or a single quote character. Subsequent characters are scanned as long as they correspond to a C numeric constant: decimal integer - sequence of [0-9] octal integer - 0 followed by sequence of [0-7] hexadecimal integer - 0x followed by sequence of [0-9a-fA-F] quoted character - two single quotes surrounding a character or an escape sequence real value - an appropriate sequence of characters from [0-9\-+.eE] The parm struct will have at least these fields (they need not be declared): boolean IsInt -- TRUE unless value contains '.' or exponent int intval -- value as an integer double realval -- value as a real The handler function body can modify these values. ScanID This recognizer scans a token which continues as long as subsequent characters are in a given set. The default set is all characters satisfying isalnum(). There may be a struct element declaration for a charset called continueset; it is the set of characters allowed after the first. ScanString This recognizer accumulates the token until a single terminating character is encountered. An escape character and an illegal character may also be specified. The default values are those for C strings: quote, backslash, and newline. Alternate values may be given by providing one or more of these struct element declarations: char *endseq ... -- terminating character char *escapechar ... -- esacpe char char *badchar ... -- illegal character All of these will be in the parm struct, whether specified or not. They should not be modified. The illegal character can only appear in the string if preceded by the escape character, in which case, both are ignored. ScanToken The initial character or sequence is treated as the entire token. The handler function body can extend the token to include succeeding characters. ScanToken can be used to specify the same tokenclass for two different characters. For instance, to map left braces to left parentheses we could write tokenclass '(' seq "{" recognizer ScanToken ScanToken is the default recognizer if none is specified. ScanComment Typically, this recognizer is employed to scan a sequence-terminated token. By default the recognizer returns tlex_IGNORE so another token will be scanned for after recognizing the comment. The struct element declarations may include char *endseq ... -- termination of the comment The first character of the endseq value must not appear anywhere else in that value. The default value is a single newline. ScanWhitespace Typically, this recognizer spans a set of characters. By default, the recognizer returns tlex_IGNORE so another token will be scanned for after skipping the whitespace. The set of whitespace must be specified both in a 'set' line and by writing a struct element declaration for charset continueset ... -- whitespace characters which will appear in the parm struct and must not be modified in the function handler body. ScanError The character or sequence that initiates this token class is treated as an error. A struct element declaration for 'msg' should be specified: char *msg ... -- error message If no function handler body is specified, the message is passed to tlex_Error (which calls the -errorhandler- function or, if none, prints the message.) ______________________________ Struct element declaration lines, details A struct element declaration line has three elements--type, identifier, and value. The first two are single words; the form of the value depends on the type. The described field becomes one field of the struct named 'parm'passed to the recognizer or handler function for this token class. The value is used to initialize the field. For example, if the line is int basis 3 the generated struct declaration will have the form: struct tlex_Sym000 { ... int basis; ... } tlex_Sym001 = { ... 3, ... }; A limited set of types are allowed in a struct element declaration. This set includes these standard C types int long float double char* and the semi-standard type boolean for which the value constants are TRUE and FALSE. Other types are charset, tokennumber, and action, as described in the following: charset The value portion is a character sequence in square brackets, just as for 'set' lines. Charset variables can appear as the first argument to tlex_BITISSET. If v is a charset identifier and c is a character, the expression tlex_BITISSET(parm->v, c) is TRUE if the value of c is one of the characters in the value of v. tokennumber The variable declared as a tokennumber is declared initialization expression is a token name, exactly operand to tokenclass. The C++ int is initialized token number for the token given token name. This value the second argument to tlex::SetTokenNumber. as int in C++. The as may appear as the to the appropriate is appropriate as action The return value from a C++ code portion must be one of two constants or must be a value created by an action type field element. The initialization for a variable of type action is the operand of a tokenclass line; that is, a token representation, possibly followed by a parenthesized letter. ______________________________ Handler function bodies, details The handler function body is a sequence of C++ code preceded by a line containing a left brace and followed by a line containing a right brace: { return tlex_IGNORE; } Between the braces, comments are written as in C++, rather than "--". For compilation and execution, the code is implanted as the body of a function that has two arguments: tlex and parm, where tlex is the current tlex object and parm points to a struct containing at least the fields described in the field description lines. The function is called as a handler after the recognizer has assembled the token. The code must return a value telling what to do with the assembled token. This value may be tlex_IGNORE, tlex_ACCEPT, or a variable from a struct element declaration having type 'action'. tlex_IGNORE causes tlex to ignore the assembled token and begin looking for another at the current position in the text. tlex_ACCEPT says to return the current tokennumber and tokenvalue to the parser. An action type indicates that the token so far is to be treated as if it were the operand of 'seq' for the tokenclass named in defining the action value; that is, the token is treated as a prefix for the token named in the action declaration. For example, a Fortran lexer could treat "do" specially. If it were followed by '5 i = 1, 10' the lexer would return the reserved word DO; but otherwise the lexer would return some variable, say parm->idact, where the variable were defined as in tokenclass DO seq "do" action idact setID { if ("do" is not the start of a DO stmt) return parm->idact; } The function has available the many operations defined in tlex.H and described below. The function can affect all aspects of the token, as described here: - current position in the input text On entry to the handler, CurrPos refers to the character after the token as recognized so far. The handler can adjust the position by calling NextChar, BackUp, and Advance. - token position The values of GetTokPos and GetTokEnd indicate where the token is in the input stream. These values can be reset by calling StartToken and EndToken, both of which also affect the token text. To change the start, it is possible to use the sequence long delta = amount to increase start position; long pos = tlex->CurrPos(); long newstart = tlex->GetTokPos() + delta; tlex->BackUp(pos - newstart); tlex->StartToken(); for ( ; newstart < pos; newstart++) tlex->Advance(); - token text The token text is a buffer of characters which is a textual representation of the token for use by the compiler. Its value is accessed via GetTokenText; the text of the previous token is via PrevTokenText. On entry to a handler where parm->SaveText is TRUE, the token text will be a null terminated string of all the text so far. The token text can be modified by calling StartToken, EndToken, ClearTokenText, AppendToTokenText, TruncateTokenText, and Advance. - token number The token number is the number by which the parser refers to the token. Its value is initialized to the number of the token given on the tokenclass line. The current value is GetTokenValue and a new value can be set with SetTokenValue. Operands to SetTokenValue are the values of variables declared in struct element declaration lines to have type tokennumber. - token value The token value is any (void *) value the application wishes to have stored in the value stacked associated with this token. (In yacc and bison, this value is stored in yylval.) It can be accessed and stored with GetTokenValue and SetTokenValue. - treatment As its final step, the handler must return a value indicating the disposition of this token. It can return tlex_ACCEPT, tlex_IGNORE, or a struct element value of type action. ______________________________ Sample input: The ness.tlx file Tokens in ness are much like those in C and C++. However, comments begin with -- and extend to newline; --$ begins a pragmat, which is a special comment processed by a pragmat parser. There is a long form of string constants which can include newlines; and brackets and braces are treated as parentheses. Note that the C++ code for numeric tokens converts the tokennumber from setINTCON to setREALCON when ScanNumber has detected a real value. ------------------------comment: -- ... \n tokenclass -nonerecognizer ScanComment seq "--" char *endseq "\n" -pragmat: --$ ... \n tokenclass -nonerecognizer ScanComment seq "--$" char *endseq "\n" { printf("pragmat: %s", tlex_GetTokenText(tlex)+3); return tlex_IGNORE; } -identifier: [a-zA-Z_] [a-zA-Z0-9_]* tokenclass setID set [a-zA-Z_] recognizer ScanID charset continueset [a-zA-Z0-9_] { struct toksym *s; s = toksym_TFind(tlex_GetTokenText(tlex), grammarscope); if (s != NULL) tlex_SetTokenNumber(tlex, s->toknum); return tlex_ACCEPT; } -string: tokenclass setSTRINGCON seq "\"" recognizer ScanString " ... " escape is \ -string: // ... \n\\\\ tokenclass setSTRINGCON seq "//" { register int c; static char delim[4] = "\n\\\\"; char *dx; no escape dx = delim; while (*dx && c != EOF) { if (*dx == c) dx++, c = tlex_NextChar(tlex); else if (dx == delim) c = tlex_NextChar(tlex); else dx = delim; } if (c != EOF) tlex_NextChar(tlex); tlex_EndToken(tlex); return tlex_ACCEPT; } -integers and real values tokenclass setINTCON set [0-9'.] recognizer ScanNumber tokennumber realtok setREALCON { if ( ! parm->IsInt) tlex_SetTokenNumber(tlex, parm->realtok); /* add value to symbol table */ return tlex_ACCEPT; } --tokenclass set tokenclass set [ and { map to ( ] and } map to ) '(' [{\[] ')' [}\]] ______________________________ Sample input: The ness.tab.c file A full .tab.c file as generated by bison is quite long, but gentlex only looks for certain features. First it must find somewhere a line defining YYNTOKENS with the form #define YYNTOKENS 56 (where the # is immediately after a newline). the token names as in the example: Subsequently it must find static const char * const yytname[] = {"$","error","$illegal.","OR","AND", "NOT","'='","\"/=\"","'<'","'>'","\">=\"","\"<=\"","'+'","''","'*'","'/'", "'%'","'~'","UNARYOP","setID","setSTRINGCON","setINTCON", "setREALCON","MARKER","BOOLEAN","INTEGER","REAL","OBJECT", "VOID","FUNCTION","END","ON","EXTEND","FORWARD","MOUSE", "MENU","KEYS","EVENT","RETURN","WHILE","DO","IF","THEN", "ELSE","ELIF","EXIT","GOTOELSE","tokTRUE","tokFALSE", "tokNULL","';'","\":=\"","'('","')'","','","\"~:=\"","script","attributes ", "type","functype","eventstart","endtag","attrDecl","parmList"}; The key identifying string is "yytname[] = {"; thereafter the tokens may be separated by arbitrary white space and one comma. Note that scanning terminates after reading YYNTOKENS token names, so the token list need not continue to a correct C declaration. ______________________________ Sample compiler using both tlex and the parse object The ness.tlc resulting from the above ness.tlx is utilized as the token stream in ness/objects/compile.c, which uses the grammar in nessgra.y. The nessgra.y file is processed with bison -n nessgra.y to produce the nessgra.C and nessgra.H files. Then gentlex is invoked gentlex ness.tlx nessgra.tab.C to generate the ness.tlc file #included in compile.c, which follows: #include <text.H> #include <toksym.H> #include <nessgra.H> #include <tlex.H> static toksym_scopeType grammarscope; static struct toksym *proto; #include <ness.tlc> static void EnterReservedWord(rock, w, i) void *rock; char *w; int i; { struct toksym *s; boolean new; s = toksym_TLocate(w, proto, grammarscope, &new); s->toknum = i; } int parsetext(input) struct text *input; { class nessgra *nessparser = new class nessgra; struct tlex *lexalyzer; proto = toksym_New(); grammarscope = toksym_TNewScope(toksym_GLOBAL); lexalyzer = tlex::Create(&ness_tlex_tables, NULL, input, 0, text_GetLength(input)); (nessparser)->EnumerateReservedWords(EnterReservedWord, NULL); return (nessparser)->Parse(tlex::LexFunc, lexalyzer); the work */ } /* do all ______________________________ Tools available in tlex struct tlex * tlex::Create(struct tlex_tables *description, void *rock, struct text *text, long pos, long len) Creates, initializes, and returns a new tlex object corresponding to the tables given by 'description'. The rock is available to any client of this object. The text, pos, and len specify a portion of a text to be scanned. int tlex::LexFunc(void *lexrock, void *yylval) Gets the next token from the associated text stream, sets *(struct void **)yylval to a value determined by a Handler routine, and returns the token number. void SetText(struct text *text, long pos, long len) Resets the source text for the lexeme stream. long RecentPosition(int index, long *len) For the 'index'th token relative to the current token, set *len to its length and return its position. Index = 0 is the most recent token; its predecessors are indexed with negative numbers: -1 -2 ... -tlex_RECENTSIZE+1 The value of tlex_RECENTSIZE is 10. long RecentIndent(int index) Returns the indentation of the 'index'th most recent token, where index is as for RecentPosition . A token preceded by anything other than white space is reported as having indentation 999. void Repeat(int index); Back up and repeat tokens starting with the 'index'th most recent token, where index is as for RecentPosition. void Error(char *msg) A recognizer or handler may call this function to indicate an error in the lexeme stream. The function sets up a dummy token and calls the errorhandler function associated with the grammar. The msg is expected to be in static storage. GetRock() - returns the current rock value SetRock(void *r) - sets a new rock value The "rock" is an argument to tlex::Create. It is an arbitrary value that is accessible via the tlex object. struct tlex_Recparm * Global() The global struct is established in the xxx.tlc file in reaction to the -global- tokenclass in the xxx.tlx file. Function Global() returns a pointer to this global struct. C++ code in tokenclass blocks can modify the values that will be returned to the parser by calling macro methods to adjust these attributes: Token number Token value (yylval) Current character and scan position in the source text Position and length of the source for the current token Token text generated to represent the token Tlex methods to perform these operations are described in what follows. SetTokenNumber(int n) GetTokenNumber() TokenNumber is the number to be This is usually set by default tokenclass line in the xxx.tlx value created by a tokennumber reported to the parser. based on the argument to the file. It may be reset to a line within a tokenclass block. SetTokenValue(void *v) - sets value for yylval GetTokenValue() - get current value destined for yylval The TokenValue is the value for yylval. These values serve as the initial values in the value stack maintained by the parser in parallel with the state stack. In tlex, the token value must be a (void *) value. CurrPos() - position in input CurrChar() - character at CurrPos NextChar() - advance CurrPos and reset CurrChar BackUp(int n) - decrement CurrPos and reset CurrChar The current position in the input is CurrPos where the character is as given by CurrChar. By convention each lexical analysis routine leaves CurrPos/CurrChar referring to the first character to be considered for the next token. NextChar moves CurrPos ahead one character, fetches the next character, and returns it. BackUp moves backward by n characters, resetting CurrPos/CurrChar. (A negative n value is acceptable and moves the position forward.) See also Advance, below, which combines NextChar with storing the prior character in the tokentext. GetTokPos() - the position of the first character GetTokEnd() - the position of the last character StartToken() - records CurrPos as the position where the token begins EndToken() - sets the token end to be just before CurrPos/CurrChar The position of the token text in the input source is recorded and is available via these methods. There is no harm in calling StartToken or EndToken more than once, although these functions also affect the token text buffer, as noted below. GetTokenText() - returns a pointer to a buffer with a copy of the token's character string PrevTokenText() - the text of the previous token (index == -1) ClearTokenText() - the token text becomes "" AppendToTokenText(int c) - c is added to the end of the token text TruncateTokenText(int n) - text is reduced to its first n characters Advance() == AppendToTokenText(NextChar()) Some tokens are recorded by the lexer as a character string which can be retrieved by GetTokenText. In particular, when a function body C code is called as a recognizer or handler, the text is the sequence of characters from the source that caused this tokenclass to be activated. Saving of the token text can be controlled by setting the SaveText parameter (parm->SaveText=TRUE or parm->SaveText=FALSE). Its default value is TRUE for ScanID, and FALSE for ScanWhitespace, ScanComment and ScanString. The text is always stored for ScanToken. A canonical form of the number is always stored for ScanNumber. If the text is stored for a comment or string, only the contents are stored--not the delimiters--and the TokPos/TokEnd are set to the contents only. (Normally TokPos/End includes the delimiters.) StartToken does a ClearTokenText as a side effect. EndToken has a side effect of AppendToTokenText('\0'), which makes the token buffer a valid C string. The following are declared in gentlex.h. boolean If not defined elsewhere, the type 'boolean' is defined. Values are TRUE and FALSE. UNSIGN(c) When subscripting by a char, use UNSIGN since some systems will use a negative value for characters above 0x7F. tlex_BITISSET(bs, c) Macro to determine if character c is in charset value bs. Assuming parm->b is a charset value, we can ask whether the bit corresponding to character c is set by saying tlex_BITISSET(parm->b, c) tlex_ACCEPT tlex_IGNORE These values can be returned by handler function bodies to indicate what to do with the token. The value of a struct element declaration of type 'action' can also be returned Copyright 1992, 1994 Carnegie Mellon University. All Rights Reserved. $Disclaimer: # Permission to use, copy, modify, and distribute this software and its # documentation for any purpose and without fee is hereby granted, provided # that the above copyright notice appear in all copies and that both that # copyright notice and this permission notice appear in supporting # documentation, and that the name of IBM not be used in advertising or # publicity pertaining to distribution of the software without specific, # written prior permission. # # THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD # TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF # MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ANY COPYRIGHT # HOLDER BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL # DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, # DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE # OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION # WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. # # $