tlex.doc tlex - lexical analyzer for ATK text gentlex - create tables for tlex ATK's 'parse' object requires as its stream of tokens an object which is a subclass of the 'lexan' class and thus provides a NextToken method. 'tlex' is one such subclass of lexan; its input stream is an ATK text object. Tables for creating instantiations of the tlex object are generated by gentlex (pronounced gen-t-lex). Unlike the Unix 'lex' package, gentlex/tlex does not implement arbitrary regular expressions. However, because it is designed specifically for tokenizing streams for parser input, tlex does considerably more than 'lex' and even supports the most general recognition scheme: C code. The essence of the tlex approach is to determine what sort of token is coming by assembling characters until they constitute a prefix for a recognizable token. Then a recognizer function is called to determine the extent of the token. Builtin recognizers are provided for the sorts of tokens required for popular modern programming languages. ______________________________ Using tlex For each tlex application, say 'sample', the programmer writes a .tlx file describing the tokens to be found. This 'sample.tlx' file is then processed by gentlex to create a sample.tlc file. This file is then included in the application: #include <sample.tlc> Later in the application, a tlex object is declared and created as in struct tlex *tl; . . . tl = tlex_Create(&sample_tlex_tables, self, text, 0, text_GetLength(text)); where sample_tlex_tables is declared within sample.tlc, self is passed as a rock that is later available to recognizers, and text is an ATK text object containing the text to be parsed. The last two parameters can delimit the parse to a subsequence of the text. After this preparation, the tlex object, tl, can be passed as the lexan argument to parse_Create. See the appendix for a complete program example. Gentlex takes two inputs--a file containing token class descriptions and another file containing declarations for YYNTOKENS and yytname as produced by running bison with the -k switch. The output file is typically named with the same prefix as the first input file and the extension .tlc: gentlex sample.tlx sample.tab.c produces file sample.tlc. -p -prefix The declarations output by gentlex are static variables whose name begins with a prefix value followed by an underscore, as in sample_tlex_tables. If the first input file has the extension .tlx, the prefix will be the filename prior to the extension. A different prefix value may be specified by giving the -p switch on the command line. The output file is always named with the prefix value and the extension .tlc. For example gentlex -p sample a b will read the .tlx information from file a and the .tab.c information from file b. The output will be generated to a file sample.tlc and the variables declared in that file will begin with `sample_'. If the -p switch is given the file names need not be specified; they will be geneerated from the prefix value and the extensions .tlx and .tab.c. If the .tlx file is named, the .tab.c file need not be as long as its name begins with the same stem name as the .tlx file. -l The .tlc output file will ordinarily contain #line lines to relate compile error messages back to the .tlx file. The -l switch will eliminate these #line lines. ______________________________ Overview of the structure of .tlx Files The purpose of tlex is to examine the input stream and find tokens, reporting each to the parser by its token class number. Gentlex determines the token class numbers from the yytname list in the .tab.c file, which will have been generated by bison with the -n switch (and some other switch combinations). A .tlx file is a sequence of tokenclass blocks, each responsible for describing how tlex can identify the tokens of one class (or a group of classes). The syntax is line oriented: the various allowed operations each occupy a single line. Comments begin with two dashes (--) and extend to the end of the line. Here is a typical tokenclass block containing a description of identifier tokens--ones that start with an alphabetic and continue with an alphabetic, an underline, or a digit. tokenclass setID set [a-zA-Z] recognizer ScanID charset continueset [a-zA-Z_0-9] The tokenclass line says that this block describes tokens the parser is to think of as satisfying the class setID (a name used in the grammar). The set line says that this recognizer is triggered by any alphabetic character. The recognizer line says to use the builtin recognizer called ScanID. One of the parameters to ScanID is a charset value called continueset. It is declared here to override the default value; the declaration says that an identifier continues with an alphabetic, a digit, or an underline. Each tokenclass block begins with a header line containing 'tokenclass' and a representation of the class--either a name, or a quoted string. Following the header are four types of lines, each of which is described in more detail later. 'set' or 'seq' - these lines describe the prefix of tokens that satisfy the tokenclass. seq specifies an initial sequence of characters, while set lists a set of characters. 'recognizer' - this line names a builtin recognizer which will be called to determine the remainder of the token. struct element declaration - declares the type, name, and initial value for an element of a struct. For some builtin recognizers, fields of this struct provide information that more precisely control the recognizer. The struct is also passed to the function created from the following body. function body - If a recognizer is named, the function created from this body is called after the recognizer has found the token. Otherwise, the function is called as soon as the prefix is recognized and the function finds the remainder of the token itself. Suppose that identifiers beginning with x_ are to be treated as tokens of class setHEXCONSTANT. The above description could be augmented to describe this as follows: tokenclass setID set [a-zA-Z] recognizer ScanID charset continueset [a-zA-Z_0-9] tokennumber hextok setHEXCONSTANT { char *tok = tlex_GetTokenText(tlex); if (*tok == 'x' && *(tok+1) == '_') tlex_SetTokenNumber(tlex, parm->hextok); return tlex_ACCEPT; } As earlier, this tokenclass block describes by default tokens of the class setID. They begin with a letter and continue as determined by the ScanID recognizer. As its last step, ScanID will call the function whose body is the bracketed lines. Two parameters will be passed to this function: tlex--a pointer to the current tlex object, and parm--a pointer to the struct constructed from the struct element declarations earlier in the block. The code in the function gets the token text from the tlex and checks for an initial 'x_'. If it is found, the token number in the tlex is changed to the token class for setHEXCONSTANT, utilizing the hextok value installed in the struct by the earlier struct element declaration for hextok. Special recognition of 'x_' would have been easier, however, by writing two tokenclass rules: tokenclass setID set [a-zA-Z] recognizer ScanID charset continueset [a-zA-Z_0-9] tokenclass setHEXCONSTANT seq "x_" recognizer ScanID charset continueset [a-zA-Z_0-9] The setID class would recognize all tokens, even those beginning with x but followed by a character other than underline. The setHEXCONSTANT class would recognize tokens beginning with x_. In practice, of course, the continueset for hex constants might be [0-9a-f] instead of the value given and a function body might be provided to compute the actual hexadecimal value. ______________________________ Tokenclass lines: Details The value following the 'tokenclass' keyword is one of the token identifer values used in the bison description of the grammar. Typical examples are ELSE setID tokNULL '+' "<=" Token names beginning with "set" are assumed to describe classes and are not treated as reserved words. Token names beginning with "tok" are assumed to be reserved words consisting of the characters following the initial three. Multicharacter tokens delimited with double quotes may not be acceptable in all versions of Bison. It is possible, but not necessary, to write tokenclass blocks for quoted characters and strings like '+' and "<=". Gentlex automatically generates tokenclass blocks for these sorts of tokens. A recognizer for whitespace is also generated automatically and suffices if the desired whitespace set is the set of characters satisfying isspace(). To override this automatic set, include a block for tokenclass -none- and specify for it the recognizer ScanWhitespace, as in tokenclass -noneset [ \n\t\f\v\r] recognizer ScanWhitespace charset continueset [ \n\t\f\v\r] For the `action' type declaration described below, a disambiguating letter may be appended in parentheses after a tokenclass representation or the special tokenclass -none-; for example: tokenclass setNUMBER (b) There are several reserved token class names: -none-, -global-, reservedwords-, and -errorhandler-, as described in the followinf. tokenclass -noneThe tokenclass -none- is used for whitespace and comments. It is assumed that the function body, if any, in the tokenclass block returns tlex_IGNORE, as is done by ScanWhite and ScanComment. If it instead returns tlex_ACCEPT, it should have reset the token number, because the default token number established by -none- terminates the input stream. tokenclass -global- The block for this tokenclass has no set, seq, or recognizer. Its sole function is to generate and initialize a struct, called PREFIX_global, where the PREFIX value is the stem of the .tlx file name or the value of the -p switch. Fields of the PREFIX_global struct can be accessed from the C code fragment associated with any of the tokenclasses. This can be used to create a single charset, tokennumber, or action value that can be referenced from multiple function bodies. If C code is specified for the -global- block, it is executed when tlex_Create is called for the PREFIX_tlex_tables value created by this file; thus it can initialize further any variables in the global struct. tokenclass -reservedwordsThe default treatment of reserved words in gentlex is to ignore them. The id recognizer is expected to identify them by looking up the identifier in the symbol table. (Entries are put in the table with parse_EnumerateReservedWords; see parse.doc.) However, the reserved words can be recognized directly by specifying the tokenclass name 'reservedwords'. When a reserved word is recognized, the function body in the tokenclass block is invoked. tokenclass -errorhandlerTlex has a method tlex_Error which can be called by recognizers to indicate various problems, such as an 8 in an octal value. The default action of tlex_Error is to print the error message which is the argument; however, a .tlx file may specify a different action by providing an errorhandler block. This block must not include a recognizer line, but must include a function body. The function generated from the body is invoked for any error. The function body can access the proposed error message as parm->msg. ______________________________ Set or seq lines, details The set or seq line must appear for most token classes. It determines when this token class recognizer is initiated. The argument on a seq line is a double-quote delimited string: seq "--" The token class is activated whenever that sequence is found at the beginning of a token. The argument on a set line is a sequence of characters within square brackets; e.g. [ \t\n\r]. Any character in the set will initiate the given tokenclass when it appears as the first character in a token. Backslash, dash, and right square bracket may appear in the sequence if preceded with backslash as an escape character. A consecutive subset of the collating sequence can be included by writing the two characters at the ends of the sequence separated with a dash: [a-zA-Z_#$%0-9] would be the set of all alphabetic characters, the digits, and underline, hash, percent, and dollar sign. ______________________________ Recognizer lines, details The operand following the keyword 'recognizer' must be the name of one of the builtin recognizers. Each recognizer takes one or more operands which further describe the tokens the recognizer will accept. Most recognizers return TRUE to the central token recognizer to indicate that they have found a token. ScanWhitespace, ScanComment, and ScanError normally return FALSE to indicate that further scanning is required to actually find a token. The individual builtin recognizers are describing in the following paragraphs. ScanNumber The first character of the token must be a dot, a digit, or a single quote character. Subsequent characters are scanned as long as they correspond to a C numeric constant: decimal integer - sequence of [0-9] octal integer - 0 followed by sequence of [0-7] hexadecimal integer - 0x followed by sequence of [0-9a-fA-F] quoted character - two single quotes surrounding a character or an escape sequence real value - an appropriate sequence of characters from [0-9\-+.eE] boolean IsInt int intval double realval ScanID The continueset parameter may be specified to indicate what characters are allowed in an identifier after the first. charset continueset ScanString Parameters indicate the terminating character to the string, an escape character, and an illegal character. For C strings these would be ", \, and newline, respectively, and these are the default values. char *endseq char *escapechar char *badchar ScanToken The initial character is treated as the entire token. ScanToken can be used to specify the same tokenclass for two different characters. For instance, to map left braces to left parentheses we could write tokenclass '(' seq "{" recognizer ScanToken Since ScanToken is the default, it can be omitted. ScanComment Parameters indicate the terminating sequence for the comment. The recognizer returns FALSE so another token will be scanned for after recognizing the comment. char *endseq The first character of the endseq value must not appear anywhere else in that value. ScanWhitespace The set of whitespace must be specified both in the set line and by writing a continueset parameter. The recognizer returns FALSE so another token will be scanned for after skipping the whitespace. charset continueset ScanError The set character or seq that initiates this token class is treated as an error. A msg parameter should be specified; if no body is specified, the message is passed to the errorhandler function (or printed if there is no error handler). char *msg ______________________________ Field lines, details A field line has three elements--type, identifier, and value. The first two are signle words; the form of the value depends on the type. The described field becomes one field of the struct passed to the function for this token class. The value is used to initialize the field in the instance of the struct created for this token class. For example, if the line is int basis 3 the generated struct declaration will have the form: struct tlex_Sym000 { ... int basis; ... } tlex_Sym001 = { ... 3, ... }; A limited set of types are allowed in a field line. includes these standard C types: int long float double This set char* and the semi-standard type boolean for which the value constants are TRUE and FALSE. charset, tokennumber, and action. Other types are charset The value portion is a character sequence in square brackets, just as for 'set' lines. Charset variables are used as the first argument to tlex_BITISSET. If v is a charset identifier and c is a character, the expression tlex_BITISSET(parm->v, c) is TRUE if the value of c is one of the characters in the value of v. tokennumber The variable declared as a tokennumber is declared initialization expression is a token name, exactly operand to tokenclass. The C int is initialized to number for the token given token name. This value the second argument to tlex_SetTokenNumber. as int in C. The as may appear as the the appropriate token is appropriate as action The return value from a C code portion must be one of two constants or must be a value created by an action type field element. The initialization for a variable of type action is the operand of a tokenclass line; that is, a token representation, possibly followed by a parenthesized letter. ______________________________ Function bodies, details if the { C-code } section is present, the code is called as in a function with two arguments: tlex and parm, where tlex is the current tlex object and parm points to a struct containing at least the fields described in the field description lines the function is called as the recognizer if no recognizer is specified otherwise it is called as a handler after the recognizer has assembled the token Must return a value telling tlex what to do with the token assembled. This value may be a tlex_IGNORE, tlex_ACCEPT, or a variable from a field element declared to have type action. tlex_IGNORE causes the tlex to ignore the assembled token and begin looking for another at the current positin in the text. tlex_ACCEPT says to return the current tokennumber and tokenvalue to the parser. Any action type indicates that the token so far is to be treated as if it were the operand of 'seq' for the tokenclass named in defining the action value. For example, a Fortran lexer could treat "do" specially. If it were followed by '5 i = 1, 10' the lexer would return the reserved word DO; but otherwise the lexer would return some variable, say parm->idact, where the variable were defined as in tokenclass DO seq "do" action idact setID { if ("do" is not the start of a DO stmt) return parm->idact; } The tasks of a builtin recognizer initial condition: tokpos is first char first and prefix chars in tokenbuffer (with \0) currchar and currpos are char after the initial char set tokend may reset tokpos usually store chars in tokenbuffer may reset default value for tokennumber leave currchar at the character after the token set tokenvalue to NULL call appropriate handler (defined in .tlx file) return handler value as scanning value The tasks of a handler initial condition: token is at tokpos...tokend characters may be in tokenbuffer (with \0) currchar and currpos are char after the final char set tokenvalue return new value for scanning (usually FALSE) may also do all the tasks of a recognizer The tasks of a user defined recognizer initial condition: tokpos is first char first and prefix chars are in tokenbuffer (with trailing \0) currchar and currpos are char after the initial char or seq set tokpos and tokend may store chars in tokenbuffer may reset default value for tokennumber leave currchar at the character after the token may set tokenvalue return value for scanning (usually FALSE) ______________________________ Sample input: The ness.tlx file Tokens in ness are much like those in C. Comments begin with -and extend to newline; --$ begins a pragmat, i.e., a special comment processed by a pragmat parser; there is a long form of string consants which can include newlines; and brackets and braces are treated as parentheses. Note that the C code for numeric tokens converts the tokennumber from setINTCON to setREALCON when ScanNumber has detected a real value. ------------------------comment: -- ... \n tokenclass -nonerecognizer ScanComment seq "--" char *endseq "\n" -pragmat: --$ ... \n tokenclass -nonerecognizer ScanComment seq "--$" char *endseq "\n" { printf("pragmat: %s", tlex_GetTokenText(tlex)+3); return tlex_IGNORE; } -identifier: [a-zA-Z_] [a-zA-Z0-9_]* tokenclass setID set [a-zA-Z_] recognizer ScanID charset continueset [a-zA-Z0-9_] { struct toksym *s; s = toksym_TFind(tlex_GetTokenText(tlex), grammarscope); if (s != NULL) tlex_SetTokenNumber(tlex, s->toknum); return tlex_ACCEPT; } -string: tokenclass setSTRINGCON seq "\"" recognizer ScanString " ... " escape is \ -string: // ... \n\\\\ tokenclass setSTRINGCON seq "//" { register int c; static char delim[4] = "\n\\\\"; char *dx; no escape dx = delim; while (*dx && c != EOF) { if (*dx == c) dx++, c = tlex_NextChar(tlex); else if (dx == delim) c = tlex_NextChar(tlex); else dx = delim; } if (c != EOF) tlex_NextChar(tlex); tlex_EndToken(tlex); return tlex_ACCEPT; } -integers and real values tokenclass setINTCON set [0-9'.] recognizer ScanNumber tokennumber realtok setREALCON { if ( ! parm->IsInt) tlex_SetTokenNumber(tlex, parm->realtok); /* add value to symbol table */ return tlex_ACCEPT; } --tokenclass set tokenclass set [ and { map to ( ] and } map to ) '(' [{\[] ')' [}\]] ______________________________ Sample input: The ness.tab.c file A full .tab.c file as generated by bison is quite long, but gentlex only looks for certain features. First it must find somewhere a line defining YYNTOKENS with the form #define YYNTOKENS 56 (where the # is immediately after a newline). the token names as in the example: Subsequently it must find static const char * const yytname[] = { "$","error","$illegal.","OR","AND", "NOT","'='","\"/=\"","'<'","'>'","\">=\"","\"<=\"","'+'","''","'*'","'/'","'%'", "'~'","UNARYOP","setID","setSTRINGCON","setINTCON","setREALCON","MARKER", "BOOLEAN", "INTEGER","REAL","OBJECT","VOID","FUNCTION","END","ON","EXTEND","FORWARD" ,"MOUSE", "MENU","KEYS","EVENT","RETURN","WHILE","DO","IF","THEN","ELSE","ELIF","EX IT", "GOTOELSE","tokTRUE","tokFALSE","tokNULL","';'","\":=\"","'('","')'","',' ","\"~:=\"", "script","attributes","type","functype","eventstart","endtag","attrDecl", "parmList", The key identifying string is "yytname[] = {"; thereafter the tokens may be separated by arbitrary white space and one comma. Note that scanning terminates after reading YYNTOKENS token names, so the token list need not continue to a correct C declaration. ______________________________ Sample compiler using both tlex and the parse object This object module, nessparse.c, depends on ness.tlx as given above and ness.y, a grammar for the ness language. The ness.y file is processed with bison -n ness.y to produce the ness.act and ness.tab.c files. Then gentlex is invoked gentlex ness.tlx ness.tab.c to generate the ness.tlc file #included in this module. #include <text.ih> #include <toksym.ih> #include <parse.ih> #include <lexan.ih> #include <tlex.ih> #include #include #include #include <ness.tab.c> <parsedesc.h> <parsepre.h> <ness.act> /* /* /* /* parse tables */ declare parse_description */ begin function 'action' */ body of function 'action' */ #include <parsepost.h> /* end of function 'action' */ static toksym_scopeType grammarscope; static struct toksym *proto; #include <ness.tlc> static void EnterReservedWord(rock, w, i) void *rock; char *w; int i; { struct toksym *s; boolean new; s = toksym_TLocate(w, proto, grammarscope, &new); s->toknum = i; } int parsetext(input) struct text *input; { struct parse *p; struct tlex *lexalyzer; proto = toksym_New(); grammarscope = toksym_TNewScope(toksym_GLOBAL); lexalyzer = tlex_Create(&ness_tlex_tables, NULL, input, 0, text_GetLength(input)); p = parse_Create(&parse_description, lexalyzer, reduceActions, NULL, NULL); parse_EnumerateReservedWords(p, EnterReservedWord, NULL); return parse_Run(p); /* do all the work */ } ______________________________ Tools available in tlex tlex_Create(struct tlex_tables *description, void *rock, struct text *text, long pos, long len) returns struct tlex *; /* the rock is available to any function passed this tlex The text, pos, and len specify a portion of a text to be processed */ tlex_SetText(/* struct tlex *self, */ struct text *text, long pos, long len); /* sets the source text for the lexeme stream */ tlex_RecentPosition(/* struct tlex *self, */ int index, long *len) returns long; /* for token 'index', set len to length and return position. index = 0 is the most recent token, its predecessors are indexed with negative numbers: -1 -2 ... -tlex_RECENTSIZE+1*/ tlex_RecentIndent(/* struct tlex *self, */ int index) returns long; /* report the indentation of the 'index'th most recent token, where index is as for RecentPosition . A token preceded by anything other than white space is reported as having indentation 999. */ tlex_Repeat(/* struct tlex *self, */ int index); /* backup and repeat tokens starting with the index'th most recent token, where index is as for RecentPosition */ tlex_Error(/* struct tlex *self, */ char *msg); /* a lexical error is reported by calling the error handler after setting up a dummy token for the error text The msg is expected to be in static storage. */ The "rock" is an argument to tlex_Create. It is an arbitrary value that is accessible via the tlex object. tlex_GetRock() returns the current rock value tlex_SetRock(void *r) sets a new rock value C code in tokenclass blocks can modify the values that will be returned to the parser by calling macro methods to adjust these attributes: Token number Token value (yylval) Current character and scan position in the source text Position and length of the source for the current token Token text generated to represent the token tlex_operations to perform these operations are described in what follows. TokenNumber is the number to be This is usually set by default tokenclass line in the xxx.tlx value created by a tokennumber */ tlex_SetTokenNumber(int n) tlex_GetTokenNumber() reported to the parser. based on the argument to the file. It may be reset to a line within a tokenclass block. /* the TokenValue is the value for yylval. These values serve as the initial values in the value stack maintained by the parser in parallel with the state stack */ tlex_SetTokenValue(void *v) tlex_GetTokenValue() /* the current position in the input is CurrPos where the character is as given by CurrChar. By convention each lexical analysis routine leaves CurrPos/CurrChar referring to the first character to be considered for the next token. NextChar moves CurrPos ahead one character, fetches the next character, and returns it. BackUp moves backward by n characters, resetting CurrPos/CurrChar (a negative n value is acceptable and moves the position forward) See also Advance, below, which combines NextChar with storing the prior character in the tokentext. */ tlex_CurrPos() tlex_CurrChar() tlex_NextChar() tlex_BackUp(int n) /* The position of the token text in the input source is recorded and is available via GetTokPos - the position of the first character GetTokEnd - the position of the last character StartToken records CurrPos as the position at which the token begins. EndToken records the token as ending one character before Currpos. There is no harm in calling StartToken or EndToken more than once, although these functions also affect the token text buffer, as noted below. */ tlex_GetTokPos() tlex_GetTokEnd() tlex_StartToken() tlex_EndToken() /* Some tokens are recorded by the lexer as a character string which can be retrieved by GetTokenText. In particular, when C code is called from a tokenclass block, the text is the sequence of characters from the source that caused this tokenclass to be activated. Saving of the token text can be controlled by setting the SaveText parameter. Its default value is TRUE for ScanID, and and FALSE for ScanWhitespace, ScanComment and ScanString. The text is always stored for ScanToken. A canonical form of the number is always stored for ScanNumber. If the text is stored for a comment or string, only the contents are stored--not the delimiters--and the TokPos/TokEnd are set to the contents only. (Normally TokPos/End includes the delimiters.) StartToken and EndToken (above) have the additional functionality, respectively, of clearing the token buffer and finishing it with a null character. GetTokenText returns a pointer to the token text string. PrevTokenText returns a pointer to the text of the previous token. ClearTokenText clears the text to an empty string. AppendToTokenText appends a character to the text. TruncateTokenText removes n characters from its end. Advance appends the current character to the token text and then calls NextChar */ tlex_GetTokenText() tlex_PrevTokenText() tlex_ClearTokenText() tlex_AppendToTokenText(int c) tlex_TruncateTokenText(int n) tlex_Advance() Copyright 1992 Carnegie Mellon University. All Rights Reserved. $Disclaimer: # Permission to use, copy, modify, and distribute this software and its # documentation for any purpose is hereby granted without fee, # provided that the above copyright notice appear in all copies and that # both that copyright notice, this permission notice, and the following # disclaimer appear in supporting documentation, and that the names of # IBM, Carnegie Mellon University, and other copyright holders, not be # used in advertising or publicity pertaining to distribution of the software # without specific, written prior permission. # # IBM, CARNEGIE MELLON UNIVERSITY, AND THE OTHER COPYRIGHT HOLDERS # DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING # ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT # SHALL IBM, CARNEGIE MELLON UNIVERSITY, OR ANY OTHER COPYRIGHT HOLDER # BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY # DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, # WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS # ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE # OF THIS SOFTWARE. # $