A Parser Project in a Programming Languages

advertisement
A PARSER PROJECT IN A PROGRAMMING LANGUAGES
COURSE
Michael Werner
Department of Computer Science and Systems
Wentworth Institute of Technology
Boston, MA 02115
(617) 734-6780
wernerm@wit.edu
ABSTRACT
This experience report describes a programming exercise designed to
reinforce concepts of parsing, regular and context-free grammars, first sets,
and abstract syntax trees. Students build a parser for a language with real
applications. To make this feasible tools including a compiler compilers
and automated support for the visitor design pattern, are used. The exercise
is extensible.
INTRODUCTION
The Programming Languages Course at Wentworth Institute of Technology is
unusual, in that it requires a supervised laboratory component. It is an essential part of
Wentworth’s model of hands-on education. The challenge in the programming
languages course is to come up with a series of laboratory projects that are directly
relevant to the concepts discussed in the lectures. This paper describes one such
project, the building of a parser for the Dot graphics language. The project was first
used in the summer of 2002.
A PARSER FOR THE DOT LANGUAGE
The Dot Language1 was developed at AT&T Bell Laboratories as a means of
describing node and link graphs so that they can be automatically laid out and
displayed. The idea for using the Dot language in a laboratory project came from
Saumya Debray2. The attraction for using it in Wentworth's Programming Languages
course is as follows:
1. The Programming Languages course spends several weeks on lexing and parsing,
so a laboratory exercise of the same duration is an effective means of reinforcing
the concepts.
2. Building a parser for a real programming language, such as Java is too big a job.
On the other hand, using a toy language is contrived and trivializes the student's
effort.
3. Only a subset of the Dot language is sufficient to describe very simple graphs.
The full language is used to describe fancier graphs that allow for colors, line
thicknesses, subgraphs, etc. This allows for a basic assignment of building a
parser for simple graphs, and an extra credit assignment of building one for fancy
graphs.
4. The article "Drawing graphs with dot"1 provides a full context-free grammar for
fancy graphs. Subtracting from this grammar to derive a minimal grammar for
simple graphs is a useful exercise in context free grammar.
Figure 1 shows examples of simple and fancy graphs drawn from specifications
written in the Dot language. A grammar for that language is shown in Figure 2.
Building the Parser
This section describes the tools used to build a parser for the Dot Language. In
choosing tools, the goal was to reinforce the primary approaches stressed in the
lectures, as well as to introduce students to programming techniques that will prove
useful to them in the future.
JavaCC4 was chosen as the compiler-compiler tool. The reasons were:
1. JavaCC generates a recursive-descent parser. Recursive-descent parsing is the first
approach used in our textbook, (Sebasta3), it is also the most thoroughly discussed
in the lecture, and is the easiest for students to understand. For example, it doesn't
require the sometimes-baffling parsing tables associated with shift-reduce parsers.
2. JavaCC is free software, first developed at Sun Microsystems, and now available
from WebGain4.
3. An excellent introduction to using JavaCC (formerly known as Jack to rhyme with
Yacc), by Chuck McManis appeared in Java World5, and is also available on-line.
4. JavaCC also includes a tool, JJTree, useful for constructing parse trees6.
5. As a Java tool, JavaCC and the parsers it generates run on all platforms.
6. A large number of sample programs, as well as complete grammars are available
for javaCC.
ClassBuilder7 was used to generate boilerplate code for needed classes.
The internal nodes of the parse tree are constructed as objects representing
nonterminals in the grammar. They have fields referencing their children to support
tree traversals. So, classes are needed for each production in the grammar. A number
of other classes are also needed in the extensions to the project. Each class needs the
usual run of constructors, getters, setters, etc. Some fields are multi-valued.
ClassBuilder is a free tool, which generates the necessary boilerplate code starting
from a very simple description of the classes. Multi-valued fields can be represented
as either vectors or hashtables. ClassBuilder generates the code to support these
fields. Optionally, ClassBuilder can also generate an equals method, which does a
deep comparison of two objects, and a clone method. It also can generate code to
support the Visitor Design Pattern8
digraph G {
main ->
main ->
main ->
execute
execute
init ->
main ->
execute
}
parse -> execute;
init;
cleanup;
-> make_string;
-> printf;
make_string;
printf;
-> compare;
digraph G {
size = “4.4”;
main [shape=box]; /*comment*/
main -> parse [weight=8];
parse -> execute;
main -> init [style=dotted];
main -> cleanup;
execute -> {make_string; printf};
init -> make_string;
edge [color=red];
main
->
printf
[style=bold,label=”100 times”];
make_string
[label=”make
a\nstring”];
node
[shape=box,
style=filled,
colors=”.7 .3 1.0”];
execute -> compare;
}
Figure 1: On the left is a simple Dot graph, on the right a fancy graph. The graphs are
described below them using the Dot language. The pictures are taken from the Dot
User’s Manual.
graph -> [strict] ( digraph | graph ) id {stmt-list}
stmt-list -> [stmt [;] [stmt-list ] ]
stmt -> attr-stmt | node-stmt | edge-stmt | subgraph | id = id
attr-stmt -> (graph | node | edge) [ [ attr-list ] ]
attr-list -> id=id [attr-list ]
node-stmt -> node-id [ opt-attrs ]
node-id -> id [: id ]
opt-attrs -> [attr-list]
edge-stmt -> (node-id | subgraph) edgeRHS [opt-attrs ]
edgeRHS -> edgeop ( node-id | subgraph ) [edgeRHS ]
subgraph -> [subgraph id] { stmt-list } | subgraph id
An edgeop is -> in directed graphs and -- in undirected graphs.
Figure 2: A grammar for the Dot language
The Visitor Design Pattern8 was used to traverse the parse tree and
extract graph information from it.
Once the parse tree representing a Dot graph description is constructed, students
need a way to traverse it to retrieve the nodes, edges, attributes, etc. of the Dot graph.
The information about edges is not stored in a single parse tree node. The source of an
edge is discovered first either in an edge-stmt node or an edgeRHS node and must be
stored. Further on in an in-order traversal, the target of the edge is found in an
edgeRHS node. The source is retrieved and the completed edge object is added to the
collection of edges, which comprise the Dot graph. The traversal needs to be written
so that objects can be temporarily stored and various types of processing be done at
internal nodes, depending on which class they belong to.
The Visitor Design Pattern8 is ideal for such a task. A Visitor object containing
fields for items, which need to be remembered, is sent on a depth-first traversal of the
parse tree. The visitor is equipped with numerous before and after methods taking a
single parameter, which is an object of a class used in the parse tree, i.e. before(stmt),
before(edgeRHS), etc. Dispatch is both on the name of the method and the type of the
parameter. The parse tree objects are equipped with visit methods taking a visitor
object as parameter. The visit method calls the visitor’s before method, passing itself
as parameter, iterates through its children calling their visit methods and finally, calls
the visitor’s after method. This way the visitor can collect information at the parse
tree nodes.
Furthermore, the code for creating abstract visitors and providing support at each
parse tree node for the needed traversals can be generated automatically using either
JJTree or ClassBuilder. To show what can be done, imagine a train consisting of an
engine followed by 0 or more cars. Using ClassBuilder notation, this is written:
Train = <engine> Engine <cars>* Car.
i.e. A Train object has a field named engine of type Engine
and a field named cars, which is a collection of Car objects.
ClassBuilder generates an abstract Visitor class:
abstract class Visitor{
public Visitor(){}
public void before(Train host){}
public void after(Train host){}
public void before(Engine host){}
public void after(Engine host){}
public void before(Car host){}
public void after(Car host){}
}
And a visit method for the train class:
public void visit(Visitor v){
v.before(this);
if (engine!= null) engine.visit(v);
Enumeration enumCars = getCars().elements();
while(enumCars.hasMoreElements()){
Car it = (Car) enumCars.nextElement();
it.visit(v);
}
v.after(this);
}
INTEGRATION OF THE PROJECT INTO THE LECTURE
When taught in 2002, the parser project was begun before the lectures on parsing
and lexing were complete. This was done deliberately, since it was inevitable that
some problems would arise from the project, which would stimulate further
investigation of the theory involved. In fact, students reported the following messages
received from JavaCC, when they first tried to use it with a simplified Dot grammar.
1.
Warning: "graph" cannot be matched as a string literal token at line 116,
column 3. It will be matched as <IDENTIFIER>.
The solution was simple. The token for "graph" needed to be entered prior
to that for <IDENTIFIER>. In the next day's lecture, this problem was used to
lead into a discussion of the lexing process. The lexer was pictured as a finite
state automaton. When the automaton recognized an identifier-like string, it
called a lookup function to match the string with a token. In JavaCC, the
lookup function scans the list of tokens top-to-bottom requiring that the
<IDENTIFIER> token be declared after the other string literal tokens.
2.
Warning: Choice conflict involving two expansions at line 140, column 10 and
line 140, column 24 respectively. A common prefix is: <IDENTIFIER>.
Consider using a lookahead of 2 for earlier expansion.
The problem came from the following productions (from a simplified
grammar):



stmt ::= node_stmt | edge_stmt
node_stmt ::= node_id [opt_attrs]
edge_stmt ::= node_id edgeRHS [opt_attrs]
Clearly in parsing a stmt the parser cannot decide between node_stmt and
edge_stmt based on the next token. The next day the concept of first sets was
introduced in the lecture. The first sets for node_stmt and edge_stmt failed the
pairwise disjointness test, hence the warning from JavaCC. The warning
suggested a lookahead to the second token. In the Dot grammar, the second
token for edge_stmt is either "--" or "->", neither of which can be the second
token for node_stmt. So one way to solve this using JavaCC is:
void stmt() :
{}
{
(LOOKAHEAD(2)
}
("--" | "->") edge_stmt() | node_stmt() )
Note that the order of the choices for stmt was switched, and JavaCC was
instructed to look ahead to the second input token, if an arrow or dash,
edge_stmt is chosen, otherwise node_stmt. This led into a discussion of the
LL numbering system. The Dot grammar was LL(2), since all choices could
be resolved with a look ahead of 2.
However, LL(1) grammars produce faster parsers. Was there a way of
modifying the Dot grammar to make it LL(1)? This brought up left factoring.
Since both choices started with a node_id, this could be factored out as
follows:




stmt ::= node_id node_or_edge_stmt
node_or_edge_stmt ::= node_stmt | edge_stmt
node_stmt ::= [opt_attrs]
edge_stmt ::= edgeRHS [opt_attrs]
This made the Dot grammar LL(1), a fact the students could confirm when
it passed the JavaCC test without any warnings.
EVALUATION
The Dot project was used at Wentworth for the first time in the summer of 2002
with a class of 25 students. No lecture time was used for the project, except when
referring to it to explain features of recursive-descent parsing, including the concept of
first sets. The students met with the professor for a two-hour lab each week. During
the first week, half of the lab was taken up with an introduction to JavaCC. In the
second week, it was shown how to insert actions into the parser to build the parse tree.
There was a brief introduction to ClassBuilder, although students were not required to
use it. Some preferred to use JJTree6. Part of the third week's lab was used to explain
how to do a depth-first traversal of the parse tree to pick out needed information about
the graph's nodes and edges. It was shown how to use the Visitor Design Pattern8 for
this purpose.
Basic Requirements:
All students were required to complete the first five deliverables, namely:
1.
2.
3.
4.
5.
Modify the fancy graph grammar by stripping away extra features, so that it is
adequate to describe simple graphs. Call this the SimpleDots grammar.
Use JavaCC to build a parser to recognize the SimpleDots language.
Add parser actions to build a parse tree from an input sentence.
Build Java classes for Graph, Node and Edge. The Graph class should have
collections for subgraphs, for nodes and for edges. Populate the collections by
traversing the parse tree and picking out the nodes and edges described in the
sentence.
Verify that Part 4 is working properly, by writing a function to print out the nodes
and edges in the parsed graph.
Challenging the Better Students:
After building the basic parser, better students could extend the project to further
reinforce their parsing skills and to provide a connection to other courses such as
Windows Programming and Computer Graphics. The extra credit parts were:
6.
7.
8.
9.
10.
11.
Write a program to draw a Graph object. The first job is to lay out the nodes
on a canvas. The simplest way is to use a square grid without regard to the
overlapping of edges. So, for example, if there are 10 nodes, the next highest
perfect square is 16. Hence, use a 4 by 4 grid with 4 nodes in row 1, 4 in row
2, and 2 in row 3. Draw the nodes as ellipses, draw the labels on top, and draw
the edges below the nodes from center to center. Draw the arrowheads in the
middle of the edges.
Improve on Part 6. Place the arrows where the edge strikes the target node.
Make the ellipses the right size to hold their labels. Improve the layout.
Do some optimization on the graph that is drawn. For example, make sure
edges are not partially hidden under nodes, and place the nodes so as to
minimize edge lengths.
Make the graph interactive. The user should be able to drag any node, with the
links following it.
Provide a facility for users to add nodes and edges by clicking and dragging.
Then generate a dot language description for the new or modified graph.
Build a parser and graph drawer for the FancyGraph language, including many
of the features described in the article.
Of the 25 students, 23 successfully completed the required parts 1 - 5, 7 students
went on to submit extra credit parts.
But how well did the students learn the lesson? The best way to measure this
was to test them. The following question was included on their next test.
Question 2.
Given the grammar:
<cmd> ::= <tc> | <m> | <pu> | <pd>
<tc>
::= <tr> | <tl>
<tr>
::= "turn" [ "half" ] "right"
<tl>
::= "turn" ["half" ] "left"
<m>
::= "move" <steps>
<pu>
::= "pen" "up"
<pd>
::= "pen" "down"
<steps> ::= 1 | 2 | 3
1. Write down the first sets for each of the nonterminals.
2. This grammar is not LL(1). Show that at least one production requires a look
ahead beyond the next token.
3. Rewrite the grammar using left factoring to make it LL(1)
The average score on this question was 80% right.
CONCLUSION
Students benefit from robust programming exercises designed to support theory
and concepts discussed in lecture. Wentworth is special in that professors directly
support students in the laboratory environment. However, the programming project
described here should be useful in programming language courses at any school.
Laboratory exercises should be timed to coordinate with lectures, in the best
situation, students become aware of the need to resolve issues that come up in their
programming. This happened when parse conflicts occurred with the Dot language.
This motivated the next lecture’s discussion of first sets, choice conflicts, LL(n)
languages and left factoring. I emphasize that this was a discussion, since some
students had come up with answers on their own, and were eager to share them with
the rest of the class.
Tool support is an essential part of these exercises. Using the right tools allows
students to focus on the programming language issues, rather than on the minutia of
building complex programs such as parsers from scratch.
In an informal survey, students indicated that building a Dot language parser was
one of the most informative parts of the programming languages course. We plan to
offer this exercise again, possibly with a different special-purpose language. The
exact nature of the language is unimportant. It should have a sufficiently compelling
purpose, so as not to appear contrived, and should be small enough so that students are
not overwhelmed by repetitious detail. Students are enlivened by the programming
experience, practical questions will inevitably come up, which can best be resolved by
studying more of the theory.
REFERENCES
1
Eleftherios Koutsofios & Stephen C. North, Drawing Graphs with dot, AT&T Bell Laboratories,
Murray Hill, NJ, 1993. Available from http://citeseer.nj.nec.com/331854.html .
2
Saumya Debray, Making Compiler Design Relevant for Students who will (most likely) Never Design
a Compiler, SIGCSE Bulletin, Volume 34, Number 1, March 2002.
3
4
Robert W. Sebasta, Concepts of Programming Languages, Fifth Edition, Addison Wesley, 2002.
WebGain, JavaCC, http://www.webgain.com/products/java_cc/ .
5
Chuck McManis, Looking for lex and yacc for Java? You don't know Jack, Java
World, December 1996, http://www.javaworld.com/javaworld/jw-12-1996/jw-12jack.html.
6
Jens Palsberg, Java Tree Builder, available from http://www.cs.purdue.edu/jtb/ .
7
Michael Werner, ClassBuilder, available from http://ProfessorWerner.com/ClassBuilder.html .
8
Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides, Design Patterns, Elements of Reusable
Object-Oriented Software, Addison Wesley, 1995.
Download