A PARSER PROJECT IN A PROGRAMMING LANGUAGES COURSE Michael Werner Department of Computer Science and Systems Wentworth Institute of Technology Boston, MA 02115 (617) 734-6780 wernerm@wit.edu ABSTRACT This experience report describes a programming exercise designed to reinforce concepts of parsing, regular and context-free grammars, first sets, and abstract syntax trees. Students build a parser for a language with real applications. To make this feasible tools including a compiler compilers and automated support for the visitor design pattern, are used. The exercise is extensible. INTRODUCTION The Programming Languages Course at Wentworth Institute of Technology is unusual, in that it requires a supervised laboratory component. It is an essential part of Wentworth’s model of hands-on education. The challenge in the programming languages course is to come up with a series of laboratory projects that are directly relevant to the concepts discussed in the lectures. This paper describes one such project, the building of a parser for the Dot graphics language. The project was first used in the summer of 2002. A PARSER FOR THE DOT LANGUAGE The Dot Language1 was developed at AT&T Bell Laboratories as a means of describing node and link graphs so that they can be automatically laid out and displayed. The idea for using the Dot language in a laboratory project came from Saumya Debray2. The attraction for using it in Wentworth's Programming Languages course is as follows: 1. The Programming Languages course spends several weeks on lexing and parsing, so a laboratory exercise of the same duration is an effective means of reinforcing the concepts. 2. Building a parser for a real programming language, such as Java is too big a job. On the other hand, using a toy language is contrived and trivializes the student's effort. 3. Only a subset of the Dot language is sufficient to describe very simple graphs. The full language is used to describe fancier graphs that allow for colors, line thicknesses, subgraphs, etc. This allows for a basic assignment of building a parser for simple graphs, and an extra credit assignment of building one for fancy graphs. 4. The article "Drawing graphs with dot"1 provides a full context-free grammar for fancy graphs. Subtracting from this grammar to derive a minimal grammar for simple graphs is a useful exercise in context free grammar. Figure 1 shows examples of simple and fancy graphs drawn from specifications written in the Dot language. A grammar for that language is shown in Figure 2. Building the Parser This section describes the tools used to build a parser for the Dot Language. In choosing tools, the goal was to reinforce the primary approaches stressed in the lectures, as well as to introduce students to programming techniques that will prove useful to them in the future. JavaCC4 was chosen as the compiler-compiler tool. The reasons were: 1. JavaCC generates a recursive-descent parser. Recursive-descent parsing is the first approach used in our textbook, (Sebasta3), it is also the most thoroughly discussed in the lecture, and is the easiest for students to understand. For example, it doesn't require the sometimes-baffling parsing tables associated with shift-reduce parsers. 2. JavaCC is free software, first developed at Sun Microsystems, and now available from WebGain4. 3. An excellent introduction to using JavaCC (formerly known as Jack to rhyme with Yacc), by Chuck McManis appeared in Java World5, and is also available on-line. 4. JavaCC also includes a tool, JJTree, useful for constructing parse trees6. 5. As a Java tool, JavaCC and the parsers it generates run on all platforms. 6. A large number of sample programs, as well as complete grammars are available for javaCC. ClassBuilder7 was used to generate boilerplate code for needed classes. The internal nodes of the parse tree are constructed as objects representing nonterminals in the grammar. They have fields referencing their children to support tree traversals. So, classes are needed for each production in the grammar. A number of other classes are also needed in the extensions to the project. Each class needs the usual run of constructors, getters, setters, etc. Some fields are multi-valued. ClassBuilder is a free tool, which generates the necessary boilerplate code starting from a very simple description of the classes. Multi-valued fields can be represented as either vectors or hashtables. ClassBuilder generates the code to support these fields. Optionally, ClassBuilder can also generate an equals method, which does a deep comparison of two objects, and a clone method. It also can generate code to support the Visitor Design Pattern8 digraph G { main -> main -> main -> execute execute init -> main -> execute } parse -> execute; init; cleanup; -> make_string; -> printf; make_string; printf; -> compare; digraph G { size = “4.4”; main [shape=box]; /*comment*/ main -> parse [weight=8]; parse -> execute; main -> init [style=dotted]; main -> cleanup; execute -> {make_string; printf}; init -> make_string; edge [color=red]; main -> printf [style=bold,label=”100 times”]; make_string [label=”make a\nstring”]; node [shape=box, style=filled, colors=”.7 .3 1.0”]; execute -> compare; } Figure 1: On the left is a simple Dot graph, on the right a fancy graph. The graphs are described below them using the Dot language. The pictures are taken from the Dot User’s Manual. graph -> [strict] ( digraph | graph ) id {stmt-list} stmt-list -> [stmt [;] [stmt-list ] ] stmt -> attr-stmt | node-stmt | edge-stmt | subgraph | id = id attr-stmt -> (graph | node | edge) [ [ attr-list ] ] attr-list -> id=id [attr-list ] node-stmt -> node-id [ opt-attrs ] node-id -> id [: id ] opt-attrs -> [attr-list] edge-stmt -> (node-id | subgraph) edgeRHS [opt-attrs ] edgeRHS -> edgeop ( node-id | subgraph ) [edgeRHS ] subgraph -> [subgraph id] { stmt-list } | subgraph id An edgeop is -> in directed graphs and -- in undirected graphs. Figure 2: A grammar for the Dot language The Visitor Design Pattern8 was used to traverse the parse tree and extract graph information from it. Once the parse tree representing a Dot graph description is constructed, students need a way to traverse it to retrieve the nodes, edges, attributes, etc. of the Dot graph. The information about edges is not stored in a single parse tree node. The source of an edge is discovered first either in an edge-stmt node or an edgeRHS node and must be stored. Further on in an in-order traversal, the target of the edge is found in an edgeRHS node. The source is retrieved and the completed edge object is added to the collection of edges, which comprise the Dot graph. The traversal needs to be written so that objects can be temporarily stored and various types of processing be done at internal nodes, depending on which class they belong to. The Visitor Design Pattern8 is ideal for such a task. A Visitor object containing fields for items, which need to be remembered, is sent on a depth-first traversal of the parse tree. The visitor is equipped with numerous before and after methods taking a single parameter, which is an object of a class used in the parse tree, i.e. before(stmt), before(edgeRHS), etc. Dispatch is both on the name of the method and the type of the parameter. The parse tree objects are equipped with visit methods taking a visitor object as parameter. The visit method calls the visitor’s before method, passing itself as parameter, iterates through its children calling their visit methods and finally, calls the visitor’s after method. This way the visitor can collect information at the parse tree nodes. Furthermore, the code for creating abstract visitors and providing support at each parse tree node for the needed traversals can be generated automatically using either JJTree or ClassBuilder. To show what can be done, imagine a train consisting of an engine followed by 0 or more cars. Using ClassBuilder notation, this is written: Train = <engine> Engine <cars>* Car. i.e. A Train object has a field named engine of type Engine and a field named cars, which is a collection of Car objects. ClassBuilder generates an abstract Visitor class: abstract class Visitor{ public Visitor(){} public void before(Train host){} public void after(Train host){} public void before(Engine host){} public void after(Engine host){} public void before(Car host){} public void after(Car host){} } And a visit method for the train class: public void visit(Visitor v){ v.before(this); if (engine!= null) engine.visit(v); Enumeration enumCars = getCars().elements(); while(enumCars.hasMoreElements()){ Car it = (Car) enumCars.nextElement(); it.visit(v); } v.after(this); } INTEGRATION OF THE PROJECT INTO THE LECTURE When taught in 2002, the parser project was begun before the lectures on parsing and lexing were complete. This was done deliberately, since it was inevitable that some problems would arise from the project, which would stimulate further investigation of the theory involved. In fact, students reported the following messages received from JavaCC, when they first tried to use it with a simplified Dot grammar. 1. Warning: "graph" cannot be matched as a string literal token at line 116, column 3. It will be matched as <IDENTIFIER>. The solution was simple. The token for "graph" needed to be entered prior to that for <IDENTIFIER>. In the next day's lecture, this problem was used to lead into a discussion of the lexing process. The lexer was pictured as a finite state automaton. When the automaton recognized an identifier-like string, it called a lookup function to match the string with a token. In JavaCC, the lookup function scans the list of tokens top-to-bottom requiring that the <IDENTIFIER> token be declared after the other string literal tokens. 2. Warning: Choice conflict involving two expansions at line 140, column 10 and line 140, column 24 respectively. A common prefix is: <IDENTIFIER>. Consider using a lookahead of 2 for earlier expansion. The problem came from the following productions (from a simplified grammar): stmt ::= node_stmt | edge_stmt node_stmt ::= node_id [opt_attrs] edge_stmt ::= node_id edgeRHS [opt_attrs] Clearly in parsing a stmt the parser cannot decide between node_stmt and edge_stmt based on the next token. The next day the concept of first sets was introduced in the lecture. The first sets for node_stmt and edge_stmt failed the pairwise disjointness test, hence the warning from JavaCC. The warning suggested a lookahead to the second token. In the Dot grammar, the second token for edge_stmt is either "--" or "->", neither of which can be the second token for node_stmt. So one way to solve this using JavaCC is: void stmt() : {} { (LOOKAHEAD(2) } ("--" | "->") edge_stmt() | node_stmt() ) Note that the order of the choices for stmt was switched, and JavaCC was instructed to look ahead to the second input token, if an arrow or dash, edge_stmt is chosen, otherwise node_stmt. This led into a discussion of the LL numbering system. The Dot grammar was LL(2), since all choices could be resolved with a look ahead of 2. However, LL(1) grammars produce faster parsers. Was there a way of modifying the Dot grammar to make it LL(1)? This brought up left factoring. Since both choices started with a node_id, this could be factored out as follows: stmt ::= node_id node_or_edge_stmt node_or_edge_stmt ::= node_stmt | edge_stmt node_stmt ::= [opt_attrs] edge_stmt ::= edgeRHS [opt_attrs] This made the Dot grammar LL(1), a fact the students could confirm when it passed the JavaCC test without any warnings. EVALUATION The Dot project was used at Wentworth for the first time in the summer of 2002 with a class of 25 students. No lecture time was used for the project, except when referring to it to explain features of recursive-descent parsing, including the concept of first sets. The students met with the professor for a two-hour lab each week. During the first week, half of the lab was taken up with an introduction to JavaCC. In the second week, it was shown how to insert actions into the parser to build the parse tree. There was a brief introduction to ClassBuilder, although students were not required to use it. Some preferred to use JJTree6. Part of the third week's lab was used to explain how to do a depth-first traversal of the parse tree to pick out needed information about the graph's nodes and edges. It was shown how to use the Visitor Design Pattern8 for this purpose. Basic Requirements: All students were required to complete the first five deliverables, namely: 1. 2. 3. 4. 5. Modify the fancy graph grammar by stripping away extra features, so that it is adequate to describe simple graphs. Call this the SimpleDots grammar. Use JavaCC to build a parser to recognize the SimpleDots language. Add parser actions to build a parse tree from an input sentence. Build Java classes for Graph, Node and Edge. The Graph class should have collections for subgraphs, for nodes and for edges. Populate the collections by traversing the parse tree and picking out the nodes and edges described in the sentence. Verify that Part 4 is working properly, by writing a function to print out the nodes and edges in the parsed graph. Challenging the Better Students: After building the basic parser, better students could extend the project to further reinforce their parsing skills and to provide a connection to other courses such as Windows Programming and Computer Graphics. The extra credit parts were: 6. 7. 8. 9. 10. 11. Write a program to draw a Graph object. The first job is to lay out the nodes on a canvas. The simplest way is to use a square grid without regard to the overlapping of edges. So, for example, if there are 10 nodes, the next highest perfect square is 16. Hence, use a 4 by 4 grid with 4 nodes in row 1, 4 in row 2, and 2 in row 3. Draw the nodes as ellipses, draw the labels on top, and draw the edges below the nodes from center to center. Draw the arrowheads in the middle of the edges. Improve on Part 6. Place the arrows where the edge strikes the target node. Make the ellipses the right size to hold their labels. Improve the layout. Do some optimization on the graph that is drawn. For example, make sure edges are not partially hidden under nodes, and place the nodes so as to minimize edge lengths. Make the graph interactive. The user should be able to drag any node, with the links following it. Provide a facility for users to add nodes and edges by clicking and dragging. Then generate a dot language description for the new or modified graph. Build a parser and graph drawer for the FancyGraph language, including many of the features described in the article. Of the 25 students, 23 successfully completed the required parts 1 - 5, 7 students went on to submit extra credit parts. But how well did the students learn the lesson? The best way to measure this was to test them. The following question was included on their next test. Question 2. Given the grammar: <cmd> ::= <tc> | <m> | <pu> | <pd> <tc> ::= <tr> | <tl> <tr> ::= "turn" [ "half" ] "right" <tl> ::= "turn" ["half" ] "left" <m> ::= "move" <steps> <pu> ::= "pen" "up" <pd> ::= "pen" "down" <steps> ::= 1 | 2 | 3 1. Write down the first sets for each of the nonterminals. 2. This grammar is not LL(1). Show that at least one production requires a look ahead beyond the next token. 3. Rewrite the grammar using left factoring to make it LL(1) The average score on this question was 80% right. CONCLUSION Students benefit from robust programming exercises designed to support theory and concepts discussed in lecture. Wentworth is special in that professors directly support students in the laboratory environment. However, the programming project described here should be useful in programming language courses at any school. Laboratory exercises should be timed to coordinate with lectures, in the best situation, students become aware of the need to resolve issues that come up in their programming. This happened when parse conflicts occurred with the Dot language. This motivated the next lecture’s discussion of first sets, choice conflicts, LL(n) languages and left factoring. I emphasize that this was a discussion, since some students had come up with answers on their own, and were eager to share them with the rest of the class. Tool support is an essential part of these exercises. Using the right tools allows students to focus on the programming language issues, rather than on the minutia of building complex programs such as parsers from scratch. In an informal survey, students indicated that building a Dot language parser was one of the most informative parts of the programming languages course. We plan to offer this exercise again, possibly with a different special-purpose language. The exact nature of the language is unimportant. It should have a sufficiently compelling purpose, so as not to appear contrived, and should be small enough so that students are not overwhelmed by repetitious detail. Students are enlivened by the programming experience, practical questions will inevitably come up, which can best be resolved by studying more of the theory. REFERENCES 1 Eleftherios Koutsofios & Stephen C. North, Drawing Graphs with dot, AT&T Bell Laboratories, Murray Hill, NJ, 1993. Available from http://citeseer.nj.nec.com/331854.html . 2 Saumya Debray, Making Compiler Design Relevant for Students who will (most likely) Never Design a Compiler, SIGCSE Bulletin, Volume 34, Number 1, March 2002. 3 4 Robert W. Sebasta, Concepts of Programming Languages, Fifth Edition, Addison Wesley, 2002. WebGain, JavaCC, http://www.webgain.com/products/java_cc/ . 5 Chuck McManis, Looking for lex and yacc for Java? You don't know Jack, Java World, December 1996, http://www.javaworld.com/javaworld/jw-12-1996/jw-12jack.html. 6 Jens Palsberg, Java Tree Builder, available from http://www.cs.purdue.edu/jtb/ . 7 Michael Werner, ClassBuilder, available from http://ProfessorWerner.com/ClassBuilder.html . 8 Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides, Design Patterns, Elements of Reusable Object-Oriented Software, Addison Wesley, 1995.