LR(1) Parser Generator Hyacc

advertisement
LR(1) Parser Generator Hyacc
X. Chen1, D. Pager1
1
Department of Information and Computer Science, University of Hawaii at Manoa, Honolulu, HI, USA
Abstract - The space and time cost of LR parser generation
is high. Robust and effective LR(1) parser generators are
rare to find. This work employed the Knuth canonical
algorithm, Pager’s practical general method, lane-tracing
algorithm, and other relevant algorithms, implemented an
efficient, practical and open-source parser generator Hyacc
in ANSI C, which supports full LR(0)/LALR(1)/LR(1) and
partial LR(k), and is compatible with Yacc and Bison in
input format and command line user interface. In this paper
we introduce Hyacc, and give a brief overview on its
architecture, parse engine, storage table, precedence and
associativity handling, error handling, data structures,
performance and usage.
Keywords: Hyacc, LR(1), Parser Generator, Compiler,
Software tool
1
Introduction
The canonical LR(k) algorithm [1] proposed by Knuth
in 1965 is a powerful parser generation algorithm for
context-free grammars. It was potentially exponential in time
and space to be of practical use. Alternatives to LR(k)
include the LALR(1) algorithm used in parser generators
such as Yacc and later Bison, and the LL algorithm used by
parser generators such as ANTLR. However, LALR and LL
are not as powerful as LR. Good LR(k) parser generator
remains scarce, even for the case k = 1.
This work has developed Hyacc, an efficient and
practical open source full LR(0)/LALR(1)/LR(1) and partial
LR(k) parser generation tool in ANSI C. It is compatible
with Yacc and Bison. The LR(1) algorithms employed are
based on 1) the canonical algorithm of Knuth [1], 2) the
lane-tracing algorithm of Pager [2][3], which reduces
parsing machine size by splitting from a LALR(1) parsing
machine that contains reduce-reduce conflicts, and 3) the
practical general method of Pager [4], which reduces parsing
machine size by merging compatible states from a parsing
machine obtained by Knuth’s method. The LR(0) algorithm
used in Hyacc is the traditional one. The LALR(1) algorithm
used in Hyacc is based on the first phase of the lane-tracing
algorithm. LR(0) and LALR(1) are implemented because
Pager’s lane-tracing algorithm depends on these as the first
step. The LR(k) algorithm is called the edge-pushing
algorithm [5] based on recursively applying the lane-tracing
process, and works for a subclass of LR(k) grammars. As a
side optimization, Hyacc also implemented the unit
production elimination algorithm of Pager [6] and its
extension [5].
2
2.1
The Hyacc Parser Generator
Overview
Hyacc is pronounced as “HiYacc”. It is an efficient and
practical parser generator written from scratch in ANSI C,
and is easy to port to other platforms. Hyacc is open source.
Version 0.9 was released in January 2008 [7]. Version 0.95
was released in April 2009. Version 0.97 was released in
January 2011.
Hyacc is released under the GPL license. But the
LR(1) parse engine file hyaccpar and LR(k) parse engine file
hyaccpark are under the BSD license so that the parser
generators created by Hyacc can be used in both open source
and proprietary software. This addresses the copyright
problem that Richard Stallman discussed in “Conditions for
Using Bison” of his Bison manuals [8][9].
The algorithms employed by Hyacc are listed in the
introduction.
Hyacc is compatible to Yacc and Bison in its input file
format, ambiguous grammar handling and error handling.
These directives from Yacc and Bison are implemented in
Hyacc: %token, %left, %right, %expect, %start, %prec.
Hyacc can be used together with the lexical analyser Lex. It
can generate rich debug information in the parser generation
process, and store these in a log file for review.
If specified, the generated parser can record the parsing
steps in a file, which makes it easy for debugging and
testing. It can also generate a Graphviz input file for the
parsing machine. With this input, Graphviz can draw an
image of the parsing machine.
2.2
Architecture
Hyacc first gets command line switch options, then
reads from the grammar input file. Next, it creates the
parsing machine according to different algorithms as
specified by the command line switches. Then it writes the
generated parser to y.tab.c, and optionally, y.output and
y.gviz. y.tab.c is the parser with the parsing machine stored
in arrays. y.output contains all kinds of information needed
by the compiler developer to understand the parser
generation process and the parsing machine. y.gviz can be
used as the input file to Graphviz to generate a graph of the
parsing machine.
Knuth LR(1)
LR(0)
Knuth LR(1)
LT LALR(1)
LT LR(1)
w/ PGM
LT LR(1)
w/ LTT
LR(k)
UPE
PGM LR(1)
Data Flow
PGM
LR(1)
LR(k)
UPE Ext
Fig. 1. Relationship of algorithms from the point of view of
1
grammar processing
Fig. 1 shows how the algorithms used in Hyacc are
structured from the point of view of grammar processing.
Input grammars can be processed by the merging path on the
left, first by the Knuth canonical algorithm and stop here, or
be further processed by Pager’s PGM algorithm.
LT LR(1) w/ LTT
LT LALR(1)
LR(0)
Fig. 2. Relationship of algorithms from the point of view of
implementation
2.3
The LR(1) Parse Engine
Similar to the yaccpar file of Yacc, the hyaccpar file is
the parse engine of Hyacc. The parser generation process
embeds the parsing table into hyaccpar. How the hyaccpar
LR(1) parse engine works is shown in Algorithm 1.
Algorithm 1: The hyaccpar LR(1) parse engine algorithm.2
1 Initialization:
2 push state 0 onto state_stack;
3 while next token is not EOF do {
4
S current state;
5
L next token/lookahead;
6
A action of(S, L) in parsing table;
7
if A is shift then {
8
push target state on state_stack,
9
pop lookahead symbol;
10
update S and L;
11
} else if A is reduce then {
12
output code for this reduction;
13
r1 LHS symbol of reduction A;
14
r2 RHS symbol count of A;
15
pop r2 states from state_stack,
16
update current state S;
17
Atmp action for (S, r1);
18
push target goto state Atmp to
state_stack;
19
} else if A is accept then {
20
if next token is EOF then {
21
is valid accept, exit;
22
} else {
23
is error, error recovery or exit;
24
}
25
} else {
26
is error, do error recovery;
27
}
28 }
Input grammars can also be processed by the splitting
path on the right. First the LR(0) parsing machine is
generated. Next the LALR(1) parsing machine is generated
by the first phase of the lane-tracing algorithm. If reducereduce conflicts exist, this is not a LALR(1) grammar, and
the second phase of lane-tracing is applied to generate the
LR(1) parsing machine. There are two methods for the
second phase of lane-tracing. One is based on the PGM
method [4], the other is based on the lane table method [10].
If LR(1) cannot resolve all the conflicts, this may be a LR(k)
grammar and the LR(k) process is applied.
The generated LR(1) parsing machine may contain unit
productions that can be eliminated by applying the UPE
algorithm and its extension.
Fig. 2 shows the relationship of the algorithms from the
point of view of implementation, i.e., how one algorithm is
based on the other.
1
Knuth LR(1) – Knuth canonical algorithm, PGM LR(1) – Pager’s
practical general method, LT LALR(1) – LALR(1) based on lanetracing phase 1, LT LR(1) w/ PGM – lane-tracing LR(1) algorithm
based on Pager’s practical general method, LT LR(1) w/ LTT –
lane-tracing LR(1) algorithm based on Pager’s lane table method,
UPE – Pager’s unit production elimination algorithm, UPT Ext –
Extension algorithm to Pager’s unit production elimination
algorithm.
LT LR(1) w/ PGM
In Algorithm 1, a state stack is used to keep track of
the current status of traversing the state machine. The
parameter ‘S’ or current state is the state on the top of the
2
LHS – Left Hand Side, RHS – Right Hand Side.
state stack. The parameter ‘L’ or lookahead is the symbol
used to decide the next action from the current state. The
parameter ‘A’ or action is the action to take, and is found by
checking the parsing table entry (S, L). ‘’ denotes
assignment operation. This parse engine is similar to the
one used in Yacc, but there are variation in the details, such
as the storage parsing table, as discussed in the next section.
TABLE 1. STORAGE ARRAYS FOR THE PARSING MACHINE IN HYACC
PARSE ENGINE.
Array name
yyfs[]
Explanation
List the default reduction for each state. If a state
has no default reduction, its entry is 0.
Array size = n.
yyrowoffset[] The offset of parsing table rows in arrays
yyptblact[] and yyptbltok[]. Array size = n.
yyptblact[]
Destination state of an action (shift/goto/reduce/
accept).
If yyptblact[i] is positive, action is ‘shift/goto’,
If yyptblact[i] is negative, action is ‘reduce’,
If yyptblact[i] is 0, action is ‘accept’.
If yyptblact[i] is -10000000, labels array end.
Array size = p.
yyptbltok[]
The token for an action.
If yyptbltok[i] is positive, token is terminal,
If yyptbltok[i] is negative, token is non-terminal.
If yyptbltok[i] is -10000001, is place holder for
a row.
If yyptbltok[i] is -10000000, labels array end.
Array size = p.
yyr1[]
If the LHS symbol of rule i is a non-terminal,
and its index among non-terminals (in the order
of appearance in the grammar rules) is x, then
yyr1[i] = -x. If the LHS symbol of rule i is a
terminal and its token value is t, then yyr1[i] = t.
Note yyr1[0] is a placeholder and not used.
Note this is different from yyr1[] of Yacc or
Bison, which only have non-terminals on the
LHS of its rules, so the LHS symbol is always a
non-terminal, and yyr1[i] = x, where x is defined
the same as above.
Array size = r.
yyr2[]
Same as Yacc yyr2[]. Let x[i] be the number of
RHS symbols of rule i, then yyr2[i] = x[i] << 1
+ y[i], where y[i] = 1 if production i has
associated semantic code, y[i] = 0 otherwise.
Note yyr2[0] is a placeholder and not used.
This array is used to generate semantic actions.
Array size = r.
yynts[]
List of non-terminals.
This is used only in debug mode.
Array size = number of non-terminals + 1.
yytoks[]
List of tokens (terminals).
This is used only in debug mode.
Array size = number of terminals + 1.
yyreds[]
List of the reductions.
Note this does not include the augmented rule.
This is used only in debug mode.
Array size = r.
2.4
Storing the Parsing Table
2.4.1
Storage tables
The following describes the arrays that are used in
hyaccpar to store the parsing table.
Let the parsing table have n rows (states) and m
columns (number of terminals and non-terminals). Assuming
there are r rules (including the augmented rule), and the
number of non-empty entries in the parsing table is p. Table
1 lists all the storage arrays and explains their usage.
2.4.2
Complexity
Suppose in state i there is a token j, we can find if an
action exists by looking at the yyptbltok table from
yyptbltok[yyrowoffset[i]] to yyptbltok[yyrowoffset[i+1]-1]:
i) If yyptbltok[k] == j, yyptblact[k] is the associated action;
ii) If yyptblact[k] > 0, this is a ‘shift/goto’ action;
iii) If yyptblact[k] < 0, is a reduction, then use yyr1 and yyr2
to find number of states to pop and the next state to goto;
iv) If yyptblact[k] == 0 then it is an ‘accept’ action, which is
valid when j is the end of an input string.
The space used by the storage is: 2n + 2p + 3r + (m +
2). In most cases the parsing table is a sparse matrix. In
general, 2n + 2p + 3r + (m + 2) < n*m.
For the time used, the main factor is when searching
through the yyptbltok array from yyptbltok[yyrowoffset[i]]
to yyptbltok[yyrowoffset[i+1]-1]. Now it is linear search and
takes O(n) time. This can be made faster by binary search,
which is possible if terminals and non-terminals are sorted
alphabetically. Then time complexity will be O(ln(n)). It can
be made such that time complexity is O(1), by using the
double displacement method which stores the entire row of
each state. That would require more space though.
2.4.3
Example
An example is given to demonstrate the use of these
arrays to represent the parsing table. Given grammar G1:
EE+T|T
TT*a|a
The parsing table is shown in Table 2. Here the parsing
table has n = 8 rows, m = 6 columns, and r = 5 rules
(including the augmented rule). The actual storage arrays in
the hyaccpar parse engine are shown in Table 3.
Array yyfs[] lists the default reduction for each state:
state 3 has default reduction on rule 4, and state 7 has
default reduction on rule 3.
Array yyrowoffset[] defines the offset of parsing table
rows in arrays yyptblact[] and yyptbltok[]. E.g., row 1 starts
at offset 0, row 2 starts at offset 3.
Array yyptblact[] is the destination state of an action.
The first entry is 97, which can be seen in the yytoks[] array.
The second entry is 1, which stands for non-terminal E. And
as we see in the parsing table, entry (0, a) has action s3,
entry (0, E) has action g1, thus in yyptblact[] we see
correspondingly the first entry is 3, and the second entry is
1. Entry -10000000 in both yyptblact[] and yyptbltok[]
labels the end of the array. Entry 0 in yyptblact[] labels the
accept action. Entry 0 in yyptbltok[] stands for the token end
marker $. Entry -10000001 in yyptbltok[] labels that this
state (row in parsing table) has no other actions but the
default reduction. -10000001 is just a dummy value that is
never used, and servers as a place holder so yyrowoffset[]
can have a corresponding value for this row.
TABLE 2. PARSING TABLE FOR GRAMMAR G1
State
0
1
2
3
4
5
6
7
$
0
a0
r2
r4
0
0
r1
r3
+
0
s4
r2
r4
0
0
r1
r3
*
0
0
s5
r4
0
0
s5
r3
a
s3
0
0
0
s3
s7
0
0
E
g1
0
0
0
0
0
0
0
T
g2
0
0
0
g6
0
0
0
TABLE 3. STORAGE TABLES IN HYACC LR(1) PARSE ENGINE
FOR GRAMMAR G1
#define YYCONST const
typedef int yytabelem;
static YYCONST yytabelem yyfs[] = {0, 0,
0, -4, 0, 0, 0, -3};
Entries of array yyr1[] and array yyr2[] are defined as
in Table 1, and it is easy to see the correspondence of the
values.
static YYCONST yytabelem yyptbltok[] = {
97, -1, -2, 0, 43, 0, 43, 42, -10000001, 97,
-2, 97, 0, 43, 42, -10000001, -10000000};
2.5
static YYCONST yytabelem yyptblact[] = {
3, 1, 2, 0, 4, -2, -2, 5, -4, 3,
6, 7, -1, -1, 5, -3, -10000000};
Handling Precedence and Associativity
The way that Hyacc handles precedence and
associativity is the same as Yacc and Bison. By default, in a
shift/reduce conflict, shift is chosen; in a reduce/reduce
conflict, the reduction whose rule appears first in the
grammar is chosen. But this may not be what the user wants.
So %left, %right and %nonassoc are used to declare tokens
and specify customized precedence and associativity.
2.6
Error Handling
Error handling is the same as in Yacc. There have been
abundant complaints about the error recovery scheme of
Yacc. We are concentrating on LR(1) algorithms instead of
better error recovery. Also we want to keep compatible with
Yacc and Bison. For these reasons we keep the way that
Yacc handles errors.
2.7
Data Structures
A symbol table is implemented by hash table, and uses
open-chaining to store elements in a linked list at each
bucket. The symbol table is used to achieve O(1)
performance for many operations. All the symbols used in
the grammar are stored as a node in this symbol table. Each
node also contains other information about each symbol.
Such information are calculated at the time of parsing the
grammar file and stored for later use.
static YYCONST yytabelem yyrowoffset[] = {
0, 3, 5, 8, 9, 11, 12, 15, 16};
static YYCONST yytabelem yyr1[] = {
0,
-1,
-1,
-2,
-2};
static YYCONST yytabelem yyr2[] = {
0,
6,
2,
6,
2};
#ifdef YYDEBUG
typedef struct {char *t_name; int t_val;}
yytoktype;
yytoktype yynts[] = {
"E", -1,
"T", -2,
"-unknown-", 1 /* ends search */
};
yytoktype yytoks[] = {
"a", 97,
"+", 43,
"*", 42,
"-unknown-", -1 /* ends search */
};
char * yyreds[] = {
"-no such reduction-"
"E : 'E' '+' 'T'",
"E : 'T'",
"T : 'T' '*' 'a'",
"T : 'a'",
};
#endif /* YYDEBUG */
Linked list, static arrays, dynamic arrays and hash
tables are used where appropriate. Sometimes multiple data
structures are used for the same object, and which one to use
depends on the particular circumstance.
In the parsing table, the rows index the states (e.g., row
1 represents actions of state 1), and the columns stand for
the lookahead symbols (both terminals and non-terminals)
upon which shift/goto/reduce/accept actions take place. The
parsing table is implemented as a one dimensional integer
array. Each entry [row, col] is accessed as entry [row *
column size + col]. In the parsing table positive numbers are
for ‘shift’, negative numbers are for ‘reduce’, -10000000 is
for ‘accept’ and 0 is for ‘error’.
There is no size limit for any data structures. They can
grow until they consume all the memory. However, Hyacc
artificially sets an upper limit of 512 characters for the
maximal length of a symbol.
2.8
Performance
The performance of Hyacc is compared to other LR(1)
parser generators. Menhir [11] and MSTA [12] are both
very efficiently and robustly implemented. Table 4 and
Table 5 show the running time comparison of the three
parser generators on C++ and C grammars. The speeds are
similar. MSTA, implemented in C++, is a handy choice for
industry users. It does not use reduced-space LR(1)
algorithms though, thus always results in larger parsing
machines. Menhir uses Pager’s PGM algorithm, but is
implemented in Caml, which is not so popular in industry.
Therefore Hyacc should be a favorable choice.
TABLE 4. RUNNING TIME (SEC) COMPARISON OF MENHIR, MSTA AND
HYACC ON C++ GRAMMAR.
Menhir
MSTA
Hyacc
Knuth LR(1)
1.97
5.32
3.53
PG MLR(1)
1.48
N/A
1.78
LALR(1)
N/A
1.17
1.10
TABLE 5. RUNNING TIME (SEC) COMPARISON OF MENHIR, MSTA AND
HYACC ON C GRAMMAR.
Menhir
MSTA
Hyacc
2.9
Knuth LR(1)
1.64
0.92
1.05
PG MLR(1)
0.56
N/A
0.42
LALR(1)
N/A
0.13
0.19
Usage
From the sourceforge.net homepage of Hyacc [7] a
user can download the source packages for unix/linux and
windows, and the binary package for windows. All the
instructions on installation and usage are available in the
included readme file. It is very easy to use, especially for
users familiar with Yacc and/or Bison.
Hyacc is a command line utility. To start hyacc, use:
“hyacc input_file.y [-bcCdDghKlmnoOPQRStvV]”. The
input grammar file input_file.y has the same format as those
used by Yacc/Bison.
The meanings of some of the command line switches
are briefly introduced here. ‘-b’ specifies the prefix to use
for all hyacc output file names. The default is y.tab.c as in
Yacc. If ‘-c’ is specified, no parser files (y.tab.c and y.tab.h)
will be generated. This is used when the user only wants to
use the -v and -D options to parse the grammar and check
the y.output log file. ‘-D’ is used with a number from 0 to 15
(e.g., -D7) to specify the details to be included into the
y.output log file during the parser generation process. ‘-g’
says that a Graphviz input file should be generated. ‘-S’
means to apply LR(0) algorithm. ‘-R’ applies LALR(1)
algorithm. ‘-Oi’ where i = 0 to 3 applies the Knuth canonical
algorithm and the practical general method with different
optimizations. ‘-P’ applies the lane-tracing algorithm based
on the practical general method. ‘-Q’ applies the lane-tracing
algorithm based on the lane table method. ‘-K’ applies the
LR(k) algorithm. ‘-m’ shows man page.
For more usage of the Hyacc parser generator,
interested users can refer to the Hyacc usage manual.
3
Related Work
Pager’s practical general method has been
implemented in some other parser generators. Some
examples are LR (1979, in Fortran 66, at Lawrence
Livermore National Laboratory) [13], LRSYS (1985, in
Pascal, at Lawrence Livermore National Laboratory) [14],
LALR (1988, in MACRO-11 under RSX-11) [15], Menhir
(2004, in Caml) [11] and the Python Parsing module (2007,
in Python) [16].
The lane-tracing algorithm was implemented by Pager
(1970s, in Assembly under OS 360) [3]. But no other
available implementation is known.
For other LR(1) parser generators, the Muskox parser
generator (1994) [17] implemented Spector’s LR(1)
algorithm [18][19]. MSTA (2002) [12] took a splitting
approach but the detail is unknown. Commercial products
include Yacc++ (LR(1) was added around 1990, using
splitting approach loosely based on Spector’s algorithm)
[20][21] and Dr. Parse (detail unknown) [22]. Most recently
an IELR(1) algorithm [23][24] was proposed to provide
LR(1) solution to non-LR(1) grammars with specifications
to solve conflicts, and the authors implemented this as an
extension of Bison.
4
Conclusions
[12] Vladimir Makarov. Toolset COCOM & scripting
language DINO, 2002.
http://sourceforge.net/projects/cocom
In this work we investigated LR(1) parser generation
algorithms and implemented a parser generator Hyacc,
which supports LR(0)/LALR(1)/LR(1) and partial LR(k). It
has been released to the open source community. The usage
of Hyacc is highly similar to the widely used LALR(1)
parser generators Yacc and Bison, which makes it easy to
learn and use. Hyacc is unique in its wide span of algorithms
coverage, efficiency, portability, usability and availability.
[13] Charles Wetherell and A. Shannon. LR automatic
parser generator and LR(1) parser. Technical Report UCRL82926 Preprint, July 1979.
5
[15] Algirdas Pakstas, 1992.
http://compilers.iecc.com/comparch/article/92-08-109
References
[1] Donald E. Knuth. On the translation of languages from
left to right. Information and Control, 8(6):607 – 639, 1965.
[2] David Pager. The lane tracing algorithm for
constructing LR(k) parsers. In Proceedings of the fifth
annual ACM symposium on Theory of computing, pages
172 – 181, Austin, Texas, United States, 1973.
[3] David Pager. The lane-tracing algorithm for
constructing LR(k) parsers and ways of enhancing its
efficiency. Information Sciences, 12:19 – 42,
[4] David Pager. A practical general method for
constructing LR(k) parsers. Acta Informatica, 7:249 – 268,
1977.
[5] Xin Chen. Measuring and Extending LR(1) Parser
Generation. PhD thesis, University of Hawaii, August 2009.
[6] David Pager. Eliminating unit productions from LR
parsers. Acta Informatics, 9:31 – 59, 1977.
[7] Xin Chen. LR(1) Parser Generator Hyacc, January
2008. http://hyacc.sourceforge.net.
[8] Charles Donnelly, Richard Stallman. Bison, The
YACC-compatible Parser generator (for Bison Version
1.23), 1993.
[9] Charles Donnelly, Richard Stallman. Bison, The
YACC-compatible Parser generator (for Bison Version
1.24), 1995.
[10] David Pager. The Lane Table Method Of Constructing
LR(1) Parsers. Technical Report No. ICS2009-06-02,
University of Hawaii, Information and Computer Sciences
Department, 2008. http://www.ics.hawaii.edu/research/techreports/LaneTableMethod.pdf/view
[11] Francois Pottier and Yann Regis-Gianas. Parser
Generator Menhir, 2004.
http://cristal.inria.fr/~fpottier/menhir/
[14] LRSYS, 1991.
http://www.nea.fr/abs/html/nesc9721.html
[16] Parser Generator Parsing.py: An LR(1) parser
generator with CFSM/GLR drivers, 2007.
http://compilers.iecc.com/comparch/article/07-03-076
[17] Boris Burshteyn. MUSKOX Algorithms, 1994.
http://compilers.iecc.com/comparch/article/94-03-067
[18] David Spector. Full LR(1) parser generation. ACM
SIGPLAN Notices, p.58 – 66, 1981.
[19] David Spector. Efficient full LR(1) parser generation.
ACM SIGPLAN Notices, 23(12):143 – 150, 1988.
[20] Yacc++ and the Language Objects Library, 1997 –
2004. http://world.std.com/~compres
[21] Chris Clark, 2005.
http://compilers.iecc.com/comparch/article/05-06-124
[22] Parser Generator Dr. Parse.
http://www.downloadatoz.com/softwaredevelopment_directory/dr-parse
[23] Joel E. Denny, Brian A. Malloy. IELR(1): practical
LR(1) parser tables for non-LR(1) grammars with conflict
resolution. Proceedings of the 2008 ACM symposium on
Applied computing, p.240 – 245, 2008.
[24] Joel E. Denny, Brian A. Malloy, The IELR(1)
algorithm for generating minimal LR(1) parser tables for
non-LR(1) grammars with conflict resolution, Science of
Computer Programming, v.75 n.11, p.943 – 979, November,
2010.
Download