tom -- A Demo for the Tomita Parsing Algorithm (0) Introduction This is a free demo for the Tomita Parsing algorithm. It is released to the public domain free of any restrictions. You may use it for any purpose as you see fit. (1) Command line arguments The command line has the following format: tom -c -s grammar The listed file, "grammar", should contain a grammar listed in a form suitable for processing by the parser generator. The options, -s and -c, if you use them, will have the following effects: -c Writes the LR(0) parsing tables for the grammar file to standard output. -s Displays the parsing stack of each parse to standard output. Parses are accepted from standard input and the results are displayed to standard output. Example: tom -s gram0 <in >out will process the grammar listed in "gram0", read input from "in" and produce a listing of the parse results and parse stacks to the file "out". (2) Grammars The file gram0 contains a small demo grammar. rules of the following types: In it, you will see (1) Productions, e.g., NP = d n | NP PP. (2) Lexical Classificatiom Rules, e.g. p: in with. (3) Start Symbol Declaration * S. Also, as a special case of (2), one can list a rule such as: d. which is equated to the lexical rule: d: "d". Items may be enclosed in double quotes, and MUST be if they consist of any symbols other than the alphanumeric symbols, or if they start in a digit. A Start Symbol can only be declared once, and if it is not declared then the symbol on the left-hand side of the first rule in the grammar file will be treated as the start symbol. The entries on the right hand side of lexical classification rules are treated as literals. These are the actual items that the parser will expect to see when parsing an input. They are distinct from the other parsing symbols so that for instance if one has a rule such as: with = with1 | with2. this "with" is treated as a distinct item. To avoid confusion, avoid using the same names for lexical entries and parsing symbols. (3) Input Format Input is assumed to consist of a list of items separated by spaces, all in the same line. Each line is treated as a separate entry for the parser. If you are using the parser in the interactive mode, then typing an empty line will end the program. Sample files are included with this program. (4) Output Format (a) Parsing Tables With the -c option, the LR(0) parsing table will be listed in standard output before parsing the input. The reason for this option is that the parsing stack makes reference to parsing states, so this listing is provided a as cross-reference. The parsing table for the file gram0 is listed in gram0.tab. The listing consists of the following: State: list of items ... For example, the start state (state 0) in gram0.tab is listed as: 0: S => 1 NP => 2 d => 3 Items can be of any of the following forms examplified by excerpts from gram0.tab: (i) ACCEPT 1: accept This indicates that state 1 is the accepting state of the LR(0) parser. The accepting state will always be state 1. (ii) SHIFTS and GOTOS 3: n => 8 This example indicates a shift (or goto) on item "n" from state 3 to state 8. The difference between shifts and gotos in LR(k) parsing has no bearing on the software here, so they are listed in a common format. (iii) REDUCTIONS 4: [NP -> NP PP] This indicates that in state 4, a reduction action may be applied whereby the last two items on the parsing stack (which will always be nodes of types PP and NP) are to be combined into a new node of type NP. A goto on the item NP is then executed from the state located two levels back in the stack to a new state. (b) Parsing Stack The output corresponding to the file "in" (processed using grammar gram0), is listed in "out". Included in this listing is the following listing of the parsing stack: Parse Stack: v_0_0 v_1_3 <= [ v_2_8 <= [ v_2_2 <= [ v_3_7 <= [ v_4_3 <= [ v_5_8 <= [ v_5_11 <= [ v_5_5 <= [ v_5_1 <= [ d_0_1 ] <= v_0_0 n_1_2 ] <= v_1_3 NP_0_2 ] <= v_0_0 v_2_3 ] <= v_2_2 d_3_4 ] <= v_3_7 n_4_5 ] <= v_4_3 NP_3_5 ] <= v_3_7 VP_2_5 ] <= v_2_2 S_0_5 ] <= v_0_0 The stack is actually a graph consisting of nodes labeled v_N_S, and with edges labeled by nodes of the form [ X_M_N ]. All connections are of the form: v_N_T <= [ X_M_N ] <= v_M_S1 v_M_S2 ... and the following properties will always hold: (a) SHIFT(S1, X) = SHIFT(S2, X) = ... = T (b) M <= N. The meaning of the labels are as follows: v_N_S: N = the parsing position in the input. S = the parsing state. X_M_N: X = the parsing item. M, N = the start and end positions of the item in the input. If the input is labeled as follows: 0 1 2 3 4 5 the boy saw the girl then the edge: v_5_11 <= [ NP_3_5 ] <= v_3_7 indicates that the parser went from state 7 at position 3 to state 11 at position 5 by shifting an item of type NP parsed from the items: "the girl". More than one edge may be labeled with the same node. (b) Parse Forest Also in the file "out" is the parse forest corresponding to the input: the boy saw the girl processed with grammar gram0. This forest is listed in equational form as a grammar which can even be input to the parser as a grammar, itself. This grammar is characterized as the largest sub-grammar of the input grammar that generates the one element set { the input }. What this illustrates, among other things, is that this algorithm is actually a special case of a more general algorithm that computes the intersection of a context free grammar with a regular grammar.