parser.doc - C parser for Bison tables {This file... parserclass.doc in the AUIS C++ sources.}

advertisement
parser.doc - C parser for Bison tables {This file is similar to
parserclass.doc in the AUIS C++ sources.}
May, 1994
parser - object for parsing
WJHansen, Andrew Consortium
A parser object represents a grammar and the state of a parse according
to
that grammar. After parsing one text, the object can be reused to parse
another. Unlike yacc and other systems, the grammar is represented by
uniquely named tables, so there are no name conflicts and multiple
parsers
are possible. Grammar tables are generated with a version of the Bison
package from the Free Software Foundation. An awk script removes the
Bison parser, so the resulting parser and application is not tainted
with FSF's General Public License.
In the descriptions below, it is assumed that the grammar is gggg and is
described in file gggg.y. (The grammar description is upward compatible
from that of yacc, but there are differences in the treatment of errors.)
Overview
The gggg.y file is processed through Bison to produce
file,
which is then processed through the 'mkparser' script
This file defines the function
gggg_New()
which the application code calls to allocate a parser
This is then passed as an argument to parser_Parse to
stream.
the gggg.tab.c
to produce gggg.c.
object for gggg.
parse a lexeme
The .y files for parser are not completely compatible with yacc and
Bison.
The AUIS version of Bison supports an upward-compatible extension to the
grammar language. Grammars would be compatible if they avoided this
extension. However, error handling is different in how errors are
reported and how error conditions are signaled from action routines.
These differences require some conversion. Another difference is that
the AUIS Bison supports the -k switch which is needed to support token
translation.
Grammar (.y file)
The AUIS version of Bison supports one additional token type:
multi-character tokens. These are written in the grammar surrounded
by quotation marks, as in "<=". Thus one rule of the grammar might be
expression : expression "<=" factor ;
In other words it is not necessary to define LE as a token and then teach
the token analyzer that a less-than followed by an equal-sign is the
token LE. (AUIS's 'tlex' token analyzer determines the token list from
the tables generated by Bison.)
Semantic action routines {specified in braces in the grammar} may refer
to value stack locations with $$, $1, $2, and so on, as in yacc/Bison.
In
addition, the variable 'parser' points to the parser object for the parse
in
progress. One use for this is to access the associated 'rock' value:
struct whatever *info = (struct whatever *)parser_GetRock(parser);
The result of the compilation can be passed back to the application
program
by storing it into a component of the rock. Suppose the value is in $1
and
the target field of the rock is 'value', then the code is
info->value = $1;
Note that no rockvalue will exist unless one has been stored. For
instance,
struct whatever parserinfo;
...
parser_SetRock(gparser, &parserinfo);
Compilation errors are reported by calling parser_Error. Its default
action is to call parser_ErrorGuts, which prints an error message. If
an application wishes some other error action, it should override
ErrorGuts. To do so, it must include at least a declaration for
ErrorGuts in the .y file; this is the signal for mkparser to include
the appropriate declaration in gggg.h. The declaration in the .y file
should be
void gggg_ErrorGuts(int severity, char *severityname, char *msg);
This function will be called by parse_Error whenever an error is
detected.
At that time, (severity&~parser_FREEMSG) will be one of the values
parser_WARNING, parser_SERIOUS, parser_SYNTAX, or parser_FATAL, as
defined in cparser.h and severityname will be the corresponding
string. The parameter msg will be a character string; if the severity
value is or'ed with parser_FREEMSG, ErrorGuts must free the character
string. Applications should call Error instead of ErrorGuts because the
former computes the maximum severity and counts the number of errors.
In a context where there is no pointer to the parser object for the
current
compilation, it can be retrieved via parser_GetCurrentparser().
In yacc and Bison, special action macros are available to control
parse termination and error processing: yyclearin, yyerrok, YYACCEPT,
and YYERROR. However, since the semantic actions are in a function
rather than embedded in the parser itself, these macros are no longer
appropriate. Instead, a semantic action routine can terminate with one
of the following macros:
none
- parsing continues normally
parser_ACCEPT
parser_ABORT
parser_ERROR
parser_CLEARIN
parser_CLINERR
parser_ERROROK
parser_CLINEROK
- parsing terminates and succeeds
- parsing terminates and fails
- syntax error; parser enters error state
- pending input token is discarded
- pending input token is discarded and
parser enters error state
- parser leaves error state and continues
- pending input token is discarded and
parser leaves error state
In yacc/Bison, YYSTYPE defaults to int. Mkparser removes this default;
the .y file must have a type named YYSTYPE by one of:
include a %union section in the grammar header in gggg.y,or
#define YYSTYPE in gggg.y or a file it #includes, or
declare YYSTYPE with a typedef in gggg.y or a file it #includes.
Application Code
In general, an application creates a parser object for a given grammar
by calling gggg_New:
struct parser *gparser = gggg_New();
The parse itself is done by calling parser_Parse with this object as the
first argument and a lexeme stream as two more arguments:
parser_Parse(gparser, lexer, lexrock);
A complete program might look like this
. . .
#include <gggg.h> /* include header file created by Bison
and mkparser from gggg.y */
. . .
struct parser *gparser = gggg_New();
. . . /* modify gparser object. For instance: */
. . . parser_SetRock(gparser, xxxxx);
/* now do the parse */
if (parser_Parse(gparser, lexer, lexer_rock) == parser_OK) {
/* action for successful parse */
}
else {
/* action for failed parse */
}
The files gggg.h is generated by mkparser and declares gggg_New().
Tokens are acquired by the parser by calling the lexer provided as the
second argument. The third argument, lexer_rock, is supplied as one of
the arguments to the lexer. The full type expected of the lexer function
is
int lexer(void *lexrock, void *yylval);
Lexer routines can copy semantic values into *yylval, which will have
space for a value of type YYSTYPE. Note that, if the value is a pointer
to an object, the pointer should be stored in *yylval and not in yylval
(which will disappear as the function returns).
The lexer must return Bison token numbers rather than yacc numbers;
yacc uses the first 256 values to indicate distinct ASCII characters,
but Bison does not. In 'tlex', the Bison token numbers are acquired
from the gggg.tab.c file generated by Bison; other lexers can generate
yacc token numbers and translate them with parser_TranslateTokenNumber:
if the yacc token number is t, the Bison token number is
parser_TranslateTokenNumber(gparser, t)
Between the gggg_New() and the call to parser_Parse, the application
can apply other functions to the object such as parser_EnumerateTokens
to enter reserved words in a symbol table, parser_SetKillVal to handle
error cleanup, or parser_SetRock to store a pointer for use by
semantic action routines.
The parser returns the maximum severity from among the severity values
passed to parser_Error. These values are
parser_OK no error
parser_WARNING
there was some minor problem
parser_SERIOUS
compilation aborted, but scan continued
parser_SYNTAX
same as SERIOUS, but due to a syntax error
parser_FATAL
compilation could not continue
When a syntax error occurs, the value stack is popped without calling
semantic routines. This can remove pointers to allocated memory which
ought to be freed. To allow the application to deal with this, the
application can specify a 'killval' function which will be called for
each
value that is discarded from the stack without calling a semantic
routine.
See the function parser_SetKillVal.
Parse-time stacks
A parser object has two stacks which are initially allocated at 500
elements, but grow as needed. Use left recursion in grammars to
avoid requiring great stack depth. Note that stack depth reflects the
amount of information a program reader needs to interpret the program
and a grammar requiring a large stack is too complex.
The value stack contains copies of objects as returned by the lexer and
set
in the action routines. If these objects contain pointers--to "pointee"
objects, the client is responsible for the memory occupied by the
pointees. If the parser terminates early for a syntax error or ABORT,
these values can be deleted by supplying a KillVal function.
parser_SetKillVal(gparser, f)
will establish f as the killval function. After a syntax error and
before
discarding the stack, this function is called for each value on the
stack. The killval function is also called as states are popped for
error recovery. The call is
(killvalfunction)(parseobject, value-pointer-from-stack)
Mkparser
The mkparser script is invoked to produce gggg.c and gggg.h
from the gggg.tab.c file generated by Bison. It is possible to have
Bison generate the additional file gggg.tab.h by specifying the -d
switch. (For tlex, the Andrew version of Bison must be used and the
-r switch specified in addition to -k.)
At minimum, mkparser has one argument, the prefix of the file names:
mkparser gggg
where the input files are gggg.tab.c and gggg.tab.h (if any). The output
will be the two files gggg.c and gggg.h.
Mkparser may have one or two additional arguments, the name of
the Bison output .c file and the name of the .h file. If specified,
these files are used instead of the files named by concatenating the
given prefix to .tab.c and .tab.h. In any case, the prefix is used to
generate the name gggg_New().
Compilation
In the Imakefile, the grammar is processed with a rule like:
gggg.c gggg.h: gggg.y
ExecuteFromDESTDIR(bison -k gggg.y)
ExecuteFromDESTDIR(mkparser gggg)
(The -k switch in the AUIS version of bison generates causes output
of a few additonal declarations.) The .c file resulting from mkparser
is compiled as a normal .c file and linked together with other source
files for the application.
Bison's -l switch should NOT be used; mkparserclass depends on the
#line directives in the file. If necessary, these can be removed from
gggg.C
with sed:
sed '/^#line/d' gggg.C > ,gggg.C; mv ,gggg.C gggg.C
Linking
The application must be linked with cparser.o or a library containing it.
cparser.o
is the result of compiling cparser.c, from the same source directory as
mkparser. In AUIS, cparser.o is installed in
$ANDREWDIR/lib/atk/libcparser.a.
Functions provided by parser object:
int
parser_Parse(struct parser *self, parser_lexerfptr, void *lexerrock)
Causes the parser to run to completion using
the lexeme stream supplied as the second and third arguments.
Returns one of the severity values, indicating
the highest severity error encountered.
int
parser_ParseNumber(char *buf, long *plen, long *intval, double *dvlval)
Parses a number from buf and sets *plen its length in buf.
If intval is non-null, *intval is set to the number's integer
value.
Similarly, if dblval is non-null, *dblval is set to the number's
value as a double.
Returns 1 if syntactically an integer, 2 for a double,
and 0 for a syntax error.
An integer is
a zero followed by a string of octal digits,
a non-zero digit followed by decimal digits,
0x followed by a string of hexadecimal digits, or
a character within apostrophes, possibly \-escaped.
A real is of the form
[ddd][.][ddd][Epddd]
where
[...] indicates an optional part except that the
complete number must have either . or Epddd
ddd is a digit sequence (one or more digits)
p (sign) may be empty or + or E (exponent indicator) may be 'e' or 'E'
int
parser_TransEscape(char *buf, int *plen)
buf holds a character sequence (at least three chars)
that occurred after a backslash in a string.
The translation is returned as an int.
The number of characters used is returned in *plen.
(plen may be NULL)
The translations are a superset of C:
escape seq
: translation
--------------- : -----------\\ \' \" \b \t
: as in C
\n \v \f \r
: as in C
\ddd
: octal digits, as in C
\?
: \177 (DEL)
\e
: \033 (ESC, ctl-[)
\^@
: \000 (NUL)
\^a ... \^z
: \001 ... \032 (ctl-a ... ctl-z)
\^[ \^\ \^]
: \033 \034 \035
\^^ \^_
: \036 \037
void
parser_Error(struct parser *self, int severity, char *msg)
Call this function to report an error. It counts the number of
errors, records the maximum severity, and then calls
ErrorGuts for disposition.
void
parser_EnumerateReservedWords(struct parser *self,
parser_enumresfptr handler, void *rock)
The handler is called for each alphabetic reserved word:
handler(rock, char *word, int tokennumber)
It is not called for names beginning with "set";
for names beginning with "tok", only the rest of the name is
passed.
Uppercase letters are converted to lower, and vice versa.
int
parser_TokenNumberFromName(struct parser *self, char *name)
Returns the token number corresponding to the string.
Typical strings:
"function", "setID", "tokNULL", "'a'", "\":=\""
(Note the quotes around special character tokens.)
If the name is not found, returns 0.
char
parser_TranslateTokenNumber(struct parser *self, int x)
Bison numbers tokens differently than yacc; in particular, the
first 256 do not correspond to the ASCII characters. This function
converts a yacc token number, x, into the token number required
by Bison.
void
parser_SetRock(struct parser *self, void *r)
Sets the 'rock' value associated with the parser. This value is
then available in any context--lexical analysis, semantic action
routine, or other--which has a pointer to the parser object.
void *
parser_GetRock(struct parser *self)
Returns the 'rock' value.
void
parser_SetKillVal(struct parser *self, parser_killfptr kv)
This function sets the killval function to kv. The latter
is called when value stack items are popped for errors. See above.
parser_killfptr
parser_GetKillVal(struct parser *self)
Returns the killval function.
struct parser *parser_GetCurrentparser()
During any call to parser_Parse, this function returns the
current parser object. This can be supplied as the object for
parser_Error.
int parser_SetDebug(int value)
Sets the debug flag to the given value; value must be 0 or 1.
Returns the prior value.
int parser_GetErrorState(struct parser *self)
The error-state is an integer indicating how many tokens must be
successfully parsed before resuming correct parsing. Usually
this value is zero; when a syntax error is detected, the value
is set to three. To change errorstate, an action concludes with
parser_ERROR to indicate an error or parser_ERROROK to reset the
error state to zero.
void parser_SetMaxSeverity(struct parser *self, int s)
Sets the value remembered as the maximum severity encountered.
It is preferable to do so by calling parser_Error.
int parser_GetMaxSeverity(struct parser *self)
Returns the current maximum severity value.
void parser_SetNErrors(struct parser *self, int n)
Allows the application to set the number of errors encountered.
It is usually incorrect to call this function.
int parser_GetNErrors(struct parser *self)
Returns the number of errors that have been encountered in the
current
compilation.
char **parser_GetTokenNames(struct parser *self)
Returns a pointer to an array of all token names in order by
token number.
short GetNTokens(struct parser *self)
Returns the number of tokens in the grammar.
Copyright 1992, 1994 Carnegie Mellon University. All rights Reserved.
$Disclaimer:
# Permission to use, copy, modify, and distribute this software and its
# documentation for any purpose is hereby granted without fee,
# provided that the above copyright notice appear in all copies and that
# both that copyright notice, this permission notice, and the following
# disclaimer appear in supporting documentation, and that the names of
# IBM, Carnegie Mellon University, and other copyright holders, not be
# used in advertising or publicity pertaining to distribution of the
software
# without specific, written prior permission.
#
# IBM, CARNEGIE MELLON UNIVERSITY, AND THE OTHER COPYRIGHT HOLDERS
# DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
# ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT
# SHALL IBM, CARNEGIE MELLON UNIVERSITY, OR ANY OTHER COPYRIGHT HOLDER
# BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY
# DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
# WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
# ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
# OF THIS SOFTWARE.
# $
Download