\begindata{text,8588536} \textdsversion{12} \template{default}

\begindata{text,8588536}
\textdsversion{12}
\template{default}
\flushright{22 June, 1995}
\chapter{cparser / mkparser
C parse objects}
\leftindent{\leftindent{Wilfred J. Hansen, Andrew Consortium}}
(The C++ version of this document is described in ./prsclass.doc.)
A parser object represents a grammar and the state of a parse according
to
that grammar. After parsing one text, the object can be reused to parse
another. Unlike yacc and other systems, the grammar is represented by
uniquely named tables, so there are no name conflicts and multiple
parsers
are possible. Grammar tables are generated with a version of the Bison
package from the Free Software Foundation. An awk script removes the
Bison
parser, so the resulting parser and application is not tainted with FSF's
General Public License. (Nor is Bison's output tainted any longer; as
of
June, 1995, FSF removed the GPL restriction from the Bison parser.) \
Andrew provides an enhanced version of Bison which is upward compatible
from Bison and yacc. It supports "multi-character tokens" and a few new
switches, including -k, which is required for mkparser.
In the descriptions below, it is assumed that the grammar is gggg and is
described in file gggg.y. Do not use the string 'bison' as part of
grammar's filename; for instance, do not use gggg.bison as a file name.
\
For an overview of the parse and compilation tools of Andrew, see \
help parse
\section{Overview}
The gggg.y file is processed through Bison to produce the gggg.tab.c
file,
which is then processed with the 'mkparser' script to produce gggg.c.
This
file defines the function
gggg_New()
which the application code calls to allocate a parser object for gggg.
This is then passed as an argument to parser_Parse to parse a lexeme
stream. \
\section{Grammar (.y file)}
The AUIS version of Bison supports one additional token type:
multi-character tokens. These are written in the grammar surrounded by
quotation marks, as in "<=". Thus one rule of the grammar might be
expression : expression "<=" factor ; \
In other words it is not necessary to define LE as a token and then teach
the token analyzer that a less-than followed by an equal-sign is the
token
LE. (AUIS's 'tlex' token analyzer determines the token list from the
tables generated by Bison.) \
Semantic action routines \{specified in braces in the grammar\} may refer
to value stack locations with $$, $1, $2, and so on, as in yacc/Bison.
In
addition, the variable 'parser' points to the parser object for the parse
in progress. One use for this is to access the associated 'rock' value:
\
struct whatever *info \
= (struct whatever *)parser->GetRock(); \
The parser reports compilation errors by calling parser_Error. Grammar
routines and other application functions can also call this function to
report errors. By default, parser_Error calls parser::ErrorGuts, which
prints an error message. If an application wishes some other error
action,
it should override ErrorGuts. To do so, it must include at least a
declaration for ErrorGuts in the .gra file; this is the signal for
mkparser to include the appropriate declaration in gggg.h. The
declaration
in the gggg.y file should be
void gggg_ErrorGuts(/* struct parser *self;
char *severityname;
int severity;
char *msg */);
When ErrorGuts is called (severity&~parser_FREEMSG) will have one of the
values parser_WARNING, parser_SERIOUS, parser_SYNTAX, or parser_FATAL, as
defined in cparser.h; severityname will be the corresponding string.
The
parameter msg will be a character string; if (severity&parser_FREEMSG) is
non-zero, ErrorGuts must free the character string. Applications should
call Error instead of ErrorGuts because the former computes the maximum
severity and counts the number of errors. \
In a context where there is no pointer to the parser object for the
current
compilation, it can be retrieved via parser_GetCurrentparser(). For
instance, here is a possible definition of yyerror: \
static void
yyerror(char *msg) \{
parser_Error(parser_GetCurrentparser(),
parser_WARNING, msg); \
\} \
In yacc and Bison, grammar rule actions may contain special macros to
control parse termination and error processing: yyclearin, yyerrok,
YYACCEPT, YYERROR, YYABORT. These are supported in parserclass. \
\leftindent{yyerrok - When the current action is completed, the error
state
will be cleared and normal parsing will resume.
yyclearin - When the current action is completed, the pending input token
is discarded, so a new one will be fetched before parsing proceeds.
YYACCEPT - The current action terminates immediately and the entire parse
also terminates, indicating success to the caller.
YYABORT - The current action terminates immediately and the entire parse
also terminates, indicating failure to the caller.
YYERROR - The current action terminates immediately and the current
reduction is treated as an error. The parser enters the error state and
continues scanning input until yyerrok is called by a rule or a rule
containing 'error' as a token is reduced. The parser ignores any yyerror
calls in an action if the action terminates with YYERROR.}
In yacc/Bison, YYSTYPE defaults to int. Mkparser removes this default;
the .y file must have a type named YYSTYPE established by one of these
means: \
Include a %union section in the grammar header in gggg.y
#define YYSTYPE in gggg.y or a file it #includes
Typedef YYSTYPE in gggg.y or a file it #includes
If a %union \{ ... \} appears in the grammar, an appropriate declaration
for YYSTYPE will appear in gggg.c. If the -d switch is given to Bison,
the
same declaration will also appear in gggg.h. \
\section{Application Code}
In general, an application creates a parser object for a given grammar by
calling gggg_New: \
struct parser *gparser = gggg_New(); \
The parse itself is done by calling parser_Parse with this object as the
first argument and a lexeme stream as two more arguments: \
parser_Parse(gparser, lexer, lexrock); \
A complete program might look like this
. . . \
#include <gggg.h> /* include header file created by Bison \
and mkparser from gggg.y */
. . . \
struct parser *gparser = gggg_New(); \
. . . /* modify gparser object.
For instance: */
. . . parser_SetRock(gparser, xxxxx); \
/* now do the parse */
if (parser_Parse(gparser, lexer, lexer_rock) == parser_OK) \{
/* action for successful parse */
\}
else \{
/* action for failed parse */
\}
The file gggg.h is generated by mkparser and declares gggg_New().
If the
-d switch was specified to Bison, the definitions it produces are
incorporated into gggg.h; typically these are #defines for the token
numbers of the various terminal symbols. (For tlex, the -r switch must
also be used. This switch is currently only in the Andrew version of
Bison.) \
While parsing, the parser fetches each successive token by calling the
lexer provided as the second argument. The third argument, lexer_rock,
is
supplied as one of the arguments to the lexer. The full type expected of
the lexer function is
int lexer(void *lexrock, void *yylval)
A lexer routines can copy the semantic value of a token into *yylval,
which
will have space for a value of type YYSTYPE. Note that, if the value is
a
pointer to an object, the pointer should be stored in *yylval and not in
yylval (which will disappear as the function returns). Suppose YYSTYPE
is
specified with %union:
%union \{int i; struct hunk *hunkptr; struct hunk v\}
Then a token with an integer semantic would store it with
yylval->i = integer_value;
and a pointer currently in ((hunk *) h) would be stored as
yylval->hunkptr = h;
An actual hunk value, hv, could be copied into yylval with
yylval->v = hv;
The lexer must return Bison token numbers rather than yacc numbers; yacc
uses the first 256 values to indicate distinct ASCII characters, but
Bison
does not. In 'tlex', the Bison token numbers are acquired from the
gggg.tab.c file generated by Bison; other lexers can generate yacc token
numbers and translate them: if the yacc token number is t, the Bison
token
number is
parser_TranslateTokenNumber(gparser, t)
Between the gggg_New() and the call to parser_Parse, the application can
apply other functions to the object such as parser_EnumerateTokens to
enter
reserved words in a symbol table, parser_SetKillVal to handle error
cleanup, parser_SetRock to store a pointer for use by semantic action
routines, and so on. \
The parser returns the maximum severity from among the severity values
passed to parser_Error. These values are, in increasing order,
parser_OK
no error
parser_WARNING
there was some minor problem
parser_SERIOUS
compilation aborted, \
but scan continued
parser_SYNTAX
same as SERIOUS, \
but due to a syntax error
parser_FATAL
compilation could not continue
\section{Parse-time stacks}
A parser object has two stacks which are initially allocated at 500
elements, but grow as needed. Use left recursion in grammars to avoid
requiring great stack depth. Note that stack depth reflects the recall
complexity of the program to a person reading it; consequently, a
grammar
requiring a large stack is unlikely to describe a language that people
can
feel comfortable with. \
The value stack contains copies of objects as returned by the lexer and
set
in the action routines. If these objects contain pointers to "pointee"
objects, the client is responsible for the memory occupied by the
pointees.
If the parser terminates early for a syntax error or ABORT, the pointee
values can be deleted by supplying a KillVal function;
function, write
to use f as the
parser_SetKillVal(gparser, f)
After a syntax error and before discarding the stack, this function is
called for each value on the stack. The killval function is also called
as
states are popped for error recovery. The call is
(killvalfunction)(parseobject, value-pointer-from-stack)
\section{Mkparser}
The mkparser script is invoked to produce gggg.c and gggg.h from the
gggg.tab.c file generated by Bison. The process can also use the
gggg.tab.h file generated by Bison in response to the -d switch. (For
use
with tlex, Andrew Bison must be used and must also be given the switches
-r
and -k.)
At minimum, mkparser has one argument, the prefix of the file names: \
mkparser gggg
where the input files are gggg.tab.c and possibly gggg.tab.h.
will be the two files gggg.c and gggg.h. \
The output
Mkparser may have one or two additional arguments, the name of the Bison
output .c file and the name of the .h file. If specified, these files
are
used instead of the .tab. files. In any case, the prefix is used to
generate the name gggg_New(). \
\section{Compilation}
In a Makefile, a .y file is converted to .c and .h files via rules like:
\
gggg.c gggg.h: gggg.y
rm -f gggg.c gggg.h gggg.tab.c gggg.tab.h
bison -b gggg -k gggg.y
mkparser gggg
gggg.o: gggg.c
The .c file resulting from mkparser is compiled as a normal .c file and
linked together with other .o files for the application. The .h file is
included by the client application.
Andrew Bison can be given additional flags, among which are
-d
defines - generates gggg.tab.h
-r
raw - token numbers in gggg.tab.h are bison numbers \
-v
verbose - generates gggg.output, useful for debugging
The -k and -r switches are implemented in the AUIS version of Bison. -k
causes output of a few additonal declarations. (See bison.texinfo in the
Andrew distribution of Bison.) \
Bison's -l switch should NOT be used; mkparser depends on the #line
directives in the file. If necessary, these can be removed from gggg.c
with sed: \
sed '/^#line/d' gggg.c > ,gggg.c; mv ,gggg.c gggg.c
In an Andrew Imakefile, the grammar is processed with a rule like: \
CParser(gggg, flags)
Where 'flags' is normally empty but can include the -d, -r, and -v flags
described above. The result is to process gggg.y to produce gggg.c and
gggg.h. \
\section{Linking}
The application must be linked with cparser.o or a library containing it.
File cparser.o is the result of compiling cparser.c, from the same
source
directory as mkparser. In Andrew, cparser.o is installed in
$ANDREWDIR/lib/libcparser.a. \
\section{Functions provided for parser objects}
\subsection{int parser_Parse(struct parser *self, parser_lexerfptr, void
*lexerrock)}
\leftindent{Causes the parser to run to completion using the lexeme
stream
supplied as the second and third arguments. Returns one of the severity
values, indicating the highest severity error encountered. }\
\subsection{int parser_ParseNumber(char *buf, long *plen, long *intval,
double *dblval)}
\leftindent{Parses a number from buf and sets *plen to the number of
characters recognized. If intval is non-null, *intval is set to the
number's integer value. Similarly, if dblval is non-null, *dblval is set
to the number's value as a double. Returns 1 if syntactically an
integer,
2 for a double, and 0 for a syntax error. }\
\leftindent{An integer is
a zero followed by a string of octal digits,
a non-zero digit followed by decimal digits,
0x followed by a string of hexadecimal digits, or
a character within apostrophes, possibly \\-escaped. \
A real is of the form
[ddd][.][ddd][Epddd]
where
[...] indicates an optional part except that the
complete number must have either . or Epddd
ddd is a digit sequence
(one or more digits)
p (sign) may be empty or '+' or '-'
E (exponent indicator) may be 'e' or 'E'}
\subsection{int parser_TransEscape(char *buf, int *plen)}
\leftindent{Buf holds a character sequence (at least three chars) that
occurred after a backslash in a string. The translation is returned as
an
int. The number of characters used is returned in *plen. (plen may be
NULL.)
}
\leftindent{The translations are a superset of C: \
escape seq
:
translation
----------------------\\\\ \\' \\" \\b \\t
:
as in C
\\n \\v \\f \\r
:
as in C
\\ddd
:
octal digits, as in C
\\?
:
\\177
(DEL)
\\e
:
\\033
(ESC, ctl-[)
\\^\formatnote{@ }
\\^a ... \\^z
\\^[
\\^\\
\\^^
\\^_ :
:
:
\\^]
\\000
(NUL)
\\001 ... \\032
:
\\036
\\033
\\034
(ctl-a ... ctl-z)
\\035
\\037}
\subsection{void parser_Error(struct parser *self, int severity, char
*msg)}
\leftindent{Call this function to report an error. It counts the number
of
errors, records the maximum severity, and then calls ErrorGuts for
disposition. }\
\subsection{void parser_EnumerateReservedWords(struct parser *self,
parser_enumresfptr handler, void *rock)}
\leftindent{The handler is called for each alphabetic reserved word: \
handler(rock, char *word, int tokennumber)
It is not called for names beginning with "set"; for names beginning
with
"tok", only the rest of the name is passed. Uppercase letters in token
names are converted to lower, and vice versa. }\
\subsection{int parser_TokenNumberFromName(struct parser *self, char
*name)}
\leftindent{Returns the token number corresponding to the string.
Typical
strings:
function
setID
tokNULL
'a'
":="
(Note the different quotes around the two kinds of character tokens.)
the name is not found, returns 0. }\
If
\subsection{char parser_TranslateTokenNumber(struct parser *self, int x)}
\leftindent{Bison assigns different token number than yacc; in
particular,
the first 256 do not correspond to the ASCII characters. This function
converts a yacc token number, x, into the token number required by
Bison.}
\subsection{void parser_SetRock(struct parser *self, void *r)} \
\leftindent{Sets the 'rock' value associated with the parser. This value
is then available in any context--lexical analysis, semantic action
routine, or other--which has a pointer to the parser object.}
\subsection{void * parser_GetRock(struct parser *self)}
\leftindent{Returns the 'rock' value. }\
\subsection{void parser_SetKillVal(struct parser *self, parser_killfptr
kv)}
\leftindent{This function sets the killval function to kv. The latter is
called when value stack items are popped for errors. See above. }\
\subsection{parser_killfptr parser_GetKillVal(struct parser *self)}
\leftindent{Returns the killval function. }\
\subsection{struct parser *parser_GetCurrentparser()}
\leftindent{During any call to parser_Parse, this function returns the
current parser object. This can be supplied as the object for
parser_Error. \
}
\subsection{int parser_SetDebug(int value)}
\leftindent{Sets the debug flag to the given value;
1.
Returns the prior value. }\
value must be 0 or
\subsection{int parser_GetErrorState(struct parser *self)}
\leftindent{The error-state is an integer indicating how many tokens must
be successfully parsed before resuming correct parsing. Usually this
value
is zero; when a syntax error is detected, the value is set to three. To
clear the errorstate, an action invokes yyerrok and to enter the
errorstate, an action terminates with YYERROR.}
\subsection{void parser_SetMaxSeverity(struct parser *self, int s)}
\leftindent{Sets the value remembered as the maximum severity
encountered.
It is preferable to do so by calling parser_Error. }\
\subsection{int parser_GetMaxSeverity(struct parser *self)}
\leftindent{Returns the current maximum severity value.
}\
\subsection{void parser_SetNErrors(struct parser *self, int n)}
\leftindent{Allows the application to set the number of errors
encountered.
It is usually incorrect to call this function. }\
\subsection{int parser_GetNErrors(struct parser *self)}
\leftindent{Returns the number of errors that have been encountered in
the
current compilation. }\
\subsection{char **parser_GetTokenNames(struct parser *self)}
\leftindent{Returns a pointer to an array of all token names in order by
token number. }\
\subsection{short GetNTokens(struct parser *self)}
\leftindent{Returns the number of tokens in the grammar. }\
Copyright 1992, 1995 Carnegie Mellon University. All rights Reserved.
$Disclaimer:
# Permission to use, copy, modify, and distribute this software and its
# documentation for any purpose and without fee is hereby granted,
provided
# that the above copyright notice appear in all copies and that both that
# copyright notice and this permission notice appear in supporting
# documentation, and that the name of IBM not be used in advertising or
# publicity pertaining to distribution of the software without specific,
# written prior permission.
#
# THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD
# TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ANY COPYRIGHT
# HOLDER BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL
# DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE,
# DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
# OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
# WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
#
# $
\enddata{text,8588536}