Reference Extraction from Environmental Regulation Texts Shawn Kerrigan Spring 2001 1 Introduction 1.1 Problem to be investigated Environmental regulation provisions contain a large number of casual English references to other regulation provisions. Extracting complete references that make explicit the referenced provisions poses an interesting NLP problem. I developed three research questions related to this information extraction task that I sought to answer in with this project. Can an effective parser be built to recognize and transform regulation references into a standard format? Can an n-gram model be built to help the parser “skim” through a document very quickly without missing many references? Would the n-gram referenceprediction model be a useful tool for context-free grammar development? 1.2 Approach The references I attempted to make explicit range from simple (“as stated in 40 CFR section 262.14(a)(2)”) to more complex (“the requirements in subparts G through I of this part”). These references can be converted into lists of complete references such as: 40.cfr.262.14.a.2; and 40.cfr.265.G, 40.cfr.265.H and 40.cfr.265.I, respectively. Parsing these references required the development of a context-free grammar (CFG) and semantic representation/interpretation system. Statistical models were used to investigate the ability to predict that a reference exists even when the reference could not be parsed using the grammar specification. N-gram models were learned from the text preceding references that the parser identified. These n-gram models were used to return sections of text that appeared to have unidentified references within them, so as to facilitate development of a more robust reference parser. This method provided automatic feedback on missed references so that the most difficult examples from the corpus could be used as targets for further parser development. The statistical model can also be used to accelerate the parsing process by only parsing parts of the corpus that appear to contain references. Experimenting with the statistical model for the parser dramatically improved the parsing speed. When the predictions are accurate, the CFG parser only needs to process text in areas of predicted references, allowing it to skim over significant portions of the corpus. 1 1.3 Running the program The source code, makefile, corpus files and sample results files are included with this report. All coding was done in java. The program can be run using the following command line examples: There are three basic ways to run the program from the command line: > java RefFinder > java RefFinder > java RefFinder [unigram lambda] [-fsi] [grammar [-fsi] [grammar [-fsi] [grammar [bigram lambda] file] [lexicon file] [train corpus] file] [lexicon file] [train corpus] [test corpus] file] [lexicon file] [train corpus] [test corpus] [trigram lambda] The [-fsi] options are: f = save the references that are found in “found.refs” s = save the suspected references in “suspect.refs” i = insert the references found in the training file back into the xml document If the unigram, bigram and trigram lambdas are not specified, default values are used. An example run of the parser to extract a few references is: > java RefFinder –f parser.gram parser.lex sample.xml A larger example run of the program with training is: > java RefFinder –fs parser.gram parser.lex 40P0262.xml 40P0261.xml 2 Overall System Architecture The problem of extracting good references from the text source was divided into a number of tasks, and specialized components were designed for each particular task. Figure 1 shows the five core modules for the reference extraction system. In addition to these five modules seven other modules were developed to perform subtasks and serve as data structures. RefFinder.java RegXMLReader.java NGramModel.java Parser.java ParseTreeParser.java Figure 1. Basic System Architecture 2 The RefFinder.java module controls the overall operation of the program. The basic operation of this module is to initialize a new parser, open the regulation XML file (through RegXMLReader.java), and build an N-gram model (using NGramModel.java) by parsing through the regulation file. In this basic mode of operation the program can write all the identified references to a “found.refs” file. The RefFinder module includes extended functionality for a variety of tasks like running accelerated parses (using the ngram models) through a second regulation XML file, evaluating probabilistic versus complete document parses, running brute-force searches through possible values for probabilistic parameters, and producing output files containing input lines that could not be parsed but were predicted to contain references. The RegXMLReader.java module is used to read the XML regulations. It implements some standard file reading operations like readline(), close(), etc, but specializes these operations to the parsing task. Readline() returns that next line of text that is contained within a <regulation_text> </regulation_text> element. The object provides information on where within the regulation the current text is coming from (what subpart, section, etc.). It also allows the RefFinder to provide references that will be added to the XML file as additional tags immediately following the closing </regulation_text> tag. The NgramModel.java records all the string tokens preceding identified references during the training phase, and then it counts the number of times these n-grams appear in the source file to build probability estimates. The n-gram model differs from a standard approach in that not all possible n-grams are constructed from the corpus vocabulary to estimate their probabilities of occurrence. Only token sequences that might predict a reference following them are important, so no smoothing is done and probabilities are only calculated for token sequences that actually precede a reference during training. The probability is calculated as (# of times preceding a reference / # of times occurring in corpus file). Requesting probability values for any other token-sequence returns zero. The Parser.java module is a modified simple tabular parser. The parser takes a string of X tokens as input, and attempts to build the longest parse string it can starting with the first token. For example, the reference “40 CFR 262.12(a) is used to…”, should be parsed as “40 CFR 262.12(a)” and not as “40 CFR”. Some special features developed to accommodate these types of parses were to implement matching functions for all the special tokens (bracketed letters, numbers, section symbols, etc) and to modify the termination conditions for the tabular parsing algorithm. Parses are considered complete if the category stack is empty, and the input stack still contains tokens. The complete correct parse is selected from the set of all possible parses by identifying the parse that consumed the largest number of input stack tokens. The parsing system is described in more detail below. The ParseTreeParser.java module does the semantic interpretation by parsing though the parse tree constructed by Parser.java and returning a list of well-formed references. This module is also at its core a simple tabular parser that processes parse trees in a modified 3 depth-first mode like input stacks. The grammar for this module is restricted to rules that start with “REF”, essentially reducing the grammar to a list of legal well-formed reference templates. The lexicon for this module specifies how to treat the various categories developed for the Parser.java module when this semantics processor comes across them, and there are special processing rules for all of the categories. The use of these grammar and lexicon specification files builds a great deal of flexibility into the parsing system. These specification files also should allow the system to be extended to work well for parsing any type of reference if some simple tagging conventions are followed. The ParseTreeParser is described in more detail below. A sixth major component developed for the RefFinder was a WordQueue.java object. This utility class was used by several of the components discussed above to tokenize and buffer the text input. Breaking up the input into small enough chunks for processing and properly splitting these input chunks into tokens for processing was one of the more difficult tasks in this project. This component is discussed in more detail below. 3 3.1 Parser System Development Grammar Categories As mentioned above, the parsing system is based on a simple tabular parser. The termination conditions were changed so that the parse is considered complete if the category stack was empty. The parser was also modified to recognize a number of special category tokens in addition to the lexicon. Grammar specifications can use special categories like “txt(abc)” to match “abc” input, which also makes some patterns more transparent by bypassing the lexicon. Grammar specifications can use the categories below, in addition to a vocabulary specified in the lexicon: Category INT DEC NUM UL LL ROM BRAC_INT BRAC_UL BRAC_LL BRAC_ROM Matches Integers Decimal numbers Integers or decimal numbers Uppercase letters Lowercase letters Roman numerals Integers enclosed in () Uppercase letters enclosed in () Lowercase letters enclosed in () Roman numerals enclosed in () A third type of grammar category was of the form “ASSUME_LEV1”, where LEV1 could be any level within the reference hierarchy. When the parser encounters a category that starts with “ASSUME_” during a parse attempt, it requests whatever follows the 4 underscore from the RegXMLReader (in this case “LEV1”) and the returned value is added to the parse as a match.1 The RegXMLReader returns the appropriate reference level for the current part of text it is reading as the assumed reference. 3.2 Tokenizing the input Preparing the input for the parser was a difficult problem, requiring several design iterations to develop a good system. The WordQueue is used to tokenize and buffer the input. Lines are read from the input file and fed into the WordQueue until it contains more than some specified number of tokens. The input is then passed to the parser to look for a reference. If a reference is found, the tokens that constitute the reference are removed from the queue. Otherwise, the first token in the queue is removed and the input is passed back to the parser. If the WordQueue’s length drops below a threshold, more lines are read from the input file and added to the queue. The difficult part was developing an algorithm for the WordQueue to use for tokenizing the input. The first attempt was to split on white space and then split off any trailing punctuation. Upon testing, this proved woefully inadequate. Some of the text included lines like, “oil leaks (as in §§279.14(d)(1)). Materials…”. The tokenized version of this line (using a space delimiter) should look like, “oil leaks ( as in § § 279.14 (d) (1) ) . Materials”. The algorithm cannot split on all “.” because some may occur as part of a number. It cannot split on all “(“ or “)” because some may be part of a (d) marker that should be preserved. The final solution was to first split the input on white space, and then do a second pass on each individual token. This second pass involves splitting the token into smaller tokens until no more splits are possible. The second process looks to split off starting “§” symbols, trailing punctuation, unbalanced opening or closing parenthesis2, and groups of characters enclosed in parenthesis. 3.3 Grammar Development Grammar and lexicon development was started by reading through some regulations and manually identifying patterns. After patterns were encoded in the grammar and lexicon, the RefFinder system was run to identify references and produce an output file containing failed parse attempts for which the n-gram model predicted a reference would be found (“suspected” references). This is similar to the treatment of “empty” rules, except an assumed value is matched. This means splitting off unequal numbers of “(“ or “)” on a token. For example, “(d))” is split into “(d)” and “)”. 1 2 5 It is difficult to quantify the value of these “suspected” references, but they were extremely useful. The first iteration on building the grammar and lexicon files only found 28 references in a particular regulation file. The “suspect.refs” output file consisted almost entirely of missed references, with only a few false predictions. This reduced the difficultly in identifying weak areas where the grammar needed extension. As the quality of the grammar increased, so did the number of identified references and the number of false predictions in the “suspect.refs” file. The final system found 331 references in the regulation file, and the “suspect.refs” file consisted entirely of false predictions. Although it is difficult to quantify the value of the n-gram reference-prediction model for grammar development, the answer to the question, “Is it valuable?”, is clearly “yes”. 4 Statistical Parsing Evaluation Some experiments were done to test how valuable an n-gram model would be for speeding up the parsing process by skimming over text that was not predicted to contain a reference. To develop a good n-gram model, a regulation corpus of about 650,000 words was assembled. The parser found 8,503 references training on this corpus.3 These 8,503 references were preceded by 184 unique unigrams, 1,136 unique bigrams, and 2,276 unique trigrams. For these n-grams to be good predictors of a reference, they should occur frequently enough to be useful predictors, and they should not occur so frequently in the general corpus that their reference prediction value is washed out. For the unigrams, it was interesting to note that 18 of the most “certain” predictors were highly “certain” because they were only seen once in the entire corpus. Other unigrams, such as “in”, which one might intuitively expect to be a good predictor for references, had a surprisingly low 5% prediction. This is because the 2,626 references that were preceded by “in” were so heavily outweighed by the 49,325 total occurrences of “in” in the corpus. These two factors made the unigram model a weak one, since words with high certainty tended to be those that were rarely seen, and words that preceded many references tended to be common words that also appeared often throughout the corpus. One exception to this was the word “under”, which preceded 1,135 references and only appeared 2,403 times in the corpus. The bigram model was a good predictor of references. While over 200 (18%) of the bigrams only occurred once in the corpus, the bigrams that preceded a large number of references were not washed out by an even larger number of occurrences in the corpus This also demonstrates the high concentration of references in the regulation documents – a reference occurs on average once every 76 words. 3 6 (as “in” did for unigrams). For example, “requirements of”, preceded 1,059 references, and was seen 1,585 times total.4 The trigram model helped refine some of the bigram predictors. For example, “described in” with a 61% prediction rate was refined into 35 trigrams with prediction probabilities ranging from 11% to 100%. In general however, the trigram model appeared to split things too far, since about 1/3 of the trigrams only appeared a single time in the entire corpus. The three n-gram models were used together by calculating a weighted sum of unigram (U), bigram (B) and trigrams (T) before attempting a parse of the input. A threshold of 1.0 was used to determine if the parse should be carried out. Changing the weightings for the parameters changes which parses are attempted. 1U 2 B 3T 1 Equation 1. Threshold function for n-gram model The n-gram model is effective for speeding up parsing, but there is a tradeoff between parsing speed and recall. To study this tradeoff the n-gram model was trained on the 650,000-word corpus and then tested on a 36,600-word corpus. To test the system a brute-force search was done through a range of ’s (1 = 1-20,000, 2 = 1-10,000, 3 = 1640). There were over 10,000 passes through the test file completed, and the number of reference parse attempts and successful reference parses were recorded. Examples with the lowest number of parse attempts for a given level of recall were selected from the test runs. This provided an efficient frontier that shows the best efficiency (successful parses / total parse attempts) for a given level of recall. These results are shown in Figure 2. Some other examples are “specified in” that occurred 874/1147, “defined in” which occurred 228/256, and “required by” which occurred 208/333. 4 7 Recall vs. Parse Attempts Trade-off Parse attempts as a multiple of the total # of references in document 30 25 20 15 10 5 0 0 0.2 0.4 0.6 0.8 1 Percent references found Figure 2. Trade-off between recall and required number of parse attempts The x-axis of the graph shows the level of recall for the pass through the test file. So as to provide an indicator of how much extra work the parser was doing, the y-axis shows the total number of parse attempts divided by the total number of references in the document. This means an ideal system would create a plot from (0,0) to (1,1) with a slope of 1, since it every parse attempt would result in an identified reference. There were at total of 569 references in the test corpus As can be seen from the graph, the system reasonably approximates an ideal parse prediction system for recall levels less than 90%. There is clearly a change in the difficulty of predicting references as the desired recall level goes above 90%. For recall levels between 0 and 90% the slope of the graph is about 1.7. This means the price of increasing the recall is relatively low, and the parser is not doing that much extra work. For recall levels above 90%, the slope of the graph becomes very large (greater then 400), which means any additional increase in recall will come at a very significant increase in the number of parse attempts. It was surprising to see that the prediction system was able to achieve 100% recall on the test set, which was previously unseen data. It was expected that the system would peak out in the upper 90% range. The 100% recall should not be achievable in general because there always exists a word that can precede a reference that has not been seen before in training. 8 Despite the steep curve to achieve recall above 90%, the total number of parses to achieve 100% was only 14,310. This compares quite well to the 37,132 parse attempts required to check the document for references without using the n-gram model (by attempting all possible parses). To further investigate the shape of the curve in Figure 2, plots of the coefficients were examined for the n-gram models. A rough but distinct trend in this data was observable. For the points on the “optimal frontier” between 0 and 60% the three n-gram models appear fairly balanced in terms of importance. For the points between 60% and 90% the trigram and bigram models dominated the unigram model. Above 90%, it appears the unigram model dominates the bigram and trigram models. It is not clear why there is a difference between the 0-60% and 60-90% range, but these findings do explain why there is a sharp change in the slope of Figure 2 around 90% recall. It appears the usefulness of the bigrams and trigrams is exhausted around this range, mostly likely to a sparseness problem in the training data. The only way to increase the number of reference predictions beyond this point is to shift the focus to the unigram model – which was noted above to have much lower accuracy than the bigram or trigram models. This results in a significantly steeper recall vs. parse attempts trade-off curve. 5 5.1 Semantic Parsing System Overall Architecture The semantic parsing system was built on top of a simple tabular parser that does a type of depth-first processing of the parse tree and treats each node as an input token. The processing deviates from strict depth-first processing when special control categories are encountered. Grammar and lexicon files provide control information to the semantic interpreter. The parsing algorithm differs from a simple tabular parser in that when a category label is found, it is not removed from the category search stack. Instead, the found category is marked “found” and remains on top of the stack. The next matching category can be the “found” category or the second category in the stack. If the second category in the stack is matched, it is marked found and the top category is removed. 5.1.1 Grammar The grammar file is essentially a list of templates that specify what type of reference is well formed. All parse tree parser grammar rules must start with “REF --> “. The grammar used for the environmental regulations was: REF --> LEV0 LEV1 LEV2 REF --> LEV0 LEV3 LEV4 LEV5 LEV6 LEV7 9 These two grammar rules correspond to the two types of references that were being searched for: 40.cfr.262.F (chapter, part, subpart) 40.cfr.262.12.a.13.iv (chapter, part, section, subsection, etc..) 5.1.2 Lexicon The lexicon file specifies how to treat different parsing categories. There are five semantic interpretation categories that can be used in the lexicon. These categories are used to classify the categories used earlier for the parser. These five semantic interpretation categories are: Category Meaning PTERM Indicates the node is a printing terminal string (to be added the reference string currently being built) Indicates the node is a non-printing terminal string (the node is ignored) Indicates the next child node of parent should be ignored and not processed Indicates the current reference string is complete, and a new reference string should be started Indicates that a list of references should be generated to make a continuous list between the previous child node and the next child node. (If the child node sequence was “262, INTERPOLATE, 265”, this would generate the list “263, 264”) NPTERM SKIPNEXT REFBREAK INTERPOLATE 5.2 Algorithms 5.2.1 Simple Semantic Interpretation The semantic parser works by attempting to match the category stack to the nodes in the tree. The parser maintains a “current reference” string that is updated as nodes in the parse tree are encountered. References are added to a list of complete references when the parser encounters “REFBREAK” or “INTERPOLATE” nodes, or completes a full parse of the tree. Two examples follow that explain this process in detail. 10 REF LEV0’ LEV0 INT 40 CFR CFR LEV1a’ LEV1a CONN’ LEV1a’ LEV1p CONN and LEV1a PART INT CONL2 parts 264 e Figure 3. Example Simple Parse Tree LEV1s INT 265 The original reference for the parse tree in Figure 3 was, “40 CFR parts 264 and 265”. The semantic interpretation parser transforms this into two complete references: 40.cfr.264, and 40.cfr.265. Figure 3 is an example of a simple parse tree that can be interpreted. The parser starts by expanding the REF category in its search list to “LEV0 LEV1 LEV2”. It then starts a depth-first parse down the tree, starting at REF. The LEV0’ node matches LEV0, so this category is marked as found. The LEV0 node also matches the LEV0 search category. Next the children of LEV0 are processed from left to right. Looking INT up in the lexicon shows it is a PTERM, so the current reference string is now “40”. Looking CFR up in the lexicon shows it is also a PTERM, so this leaf’s value is appended to the current reference string to form “40.cfr”. Next, LEV1a’ is processed, and a note is made that the incoming current reference string was “40.cfr”. LEV1a’ matches LEV1, so the top LEV0 search category is discarded and the LEV1 category is marked as found. Processing continues down the LEV1a branch of the tree to the LEV1p node. The PART child node is found to be a NPTERM in the lexicon, so it is not appended to the current reference string. INT is found to be a PTERM, so it is concatenated to the search string. Since CONL2 is also a NPTERM, the algorithm returns back up to LEV1a’. The next child node to be processed is CONN’, which is found to be a REFBREAK in the lexicon. This means the current reference is complete, so “40.cfr.264” is added to the list of references and the current reference is reset to “40.cfr”, the value it had when the LEV1a’ parent node was first reached. Processing then continues down from LEV1a’ to the right-most leaf of the tree. At this point the current reference is updated to “40.cfr.265” and a note is made that the entire tree has been traversed, so “40.cfr.265” is added the to the list of 11 identified references. Next the parser would try the other expansion of REF as “LEV0 LEV3 LEV4 LEV5”, but since it would be unable to match LEV3 this attempt would fail. The final list of parsed references would then contain 40.cfr.264 and 40.cfr.265. 5.2.2 Complex semantic interpretation The basic approach described in 5.2.1 was extended to handle references where the components of the reference do not appear in order. For example, the parser might encounter the reference “paragraph (d) of section 262.14”. A proper ordering of this reference would be “section 262.14, paragraph (d)”. To handle these cases, if the top of the category search stack cannot be matched to a node in the tree the remainder of the parse tree is scanned to see if the missing category appears elsewhere in the tree (a “backreference”). If the category is found, it is processed and appended to the current reference before control returns to the original part of the parse tree. If multiple references are found during the back-reference call, the order needs to be reversed to maintain correctness. This allows parsing an interpretation from trees like the following: REF ASSUME_LEV0 40.cfr LEV2’ BACKREFKEY SUBPART UL’ Subpart of UL O LEV1r CONN’ LEV1a’ LEV1p CONN LEV1a PART INT CONL2 part Figure 4. Example Parse Tree LEV1r’ 264 e or LEV1s INT 265 The original reference for the parse tree in Figure 4 was, “Subpart O of part 264 or 265”. The semantic interpretation parser transforms this into two complete references: 40.cfr.264.O, and 40.cfr.265. In cases of ambiguous meaning, the parser maximizes the scope of ambiguous references. For example, the parse in Figure 4 could also 40.cfr.264.O and 40.cfr.265.O, but this might be too narrow a reference if 40.cfr.264.O and 40.cfr.265 was actually intended. Figure 4 is an example of a complex parse tree that can be interpreted. A brief explanation of this parse tree follows. In this example, the semantic parser will first expand the starting REF category to be “LEV0 LEV1 LEV2”. LEV0 will match the 12 ASSUME_LEV0 leaf, and the current reference string will be updated to be “40.cfr”. Next, the parser encounters LEV2’, which does not match LEV0 or LEV1. The parser then searches for a possible “back-reference” (a level of the reference that is out of order, referring back to a lower level), which it finds as LEV1r’. The parser processes this part of the tree, concatenating the INT under LEV1p to the reference string. It also notes the reference string is complete upon encountering the CONN’ (a REFBREAK), so a new reference string is started with “265” and a notation is made that the rightmost leaf of the tree was found. Upon returning from the back-reference function call, it is noted that multiple references were encountered, so a reconciliation procedure is run to swap “40.cfr.264” with “40.cfr.265” in the complete reference list and to set “40.cfr.264” as the current reference string. Now the parser can match the LEV2’ category and update the current reference list to be “40.cfr.264.O”. Next the parser encounters the BACKREFKEY category, which the lexicon identifies as type SKIPNEXT, so the parser can skip the next child node. Skipping the next child node brings the parser to the end of the tree. Since the parser noted earlier that it had processed the right-most leaf, it knows that this was a successful semantic parsing attempt, and it adds the “40.cfr.264.O” to its reference list. The subsequent attempt to parse the tree using “LEV0 LEV3 LEV4…” will fail to reach the rightmost leaf, so no more parses will be recorded. Thus, the final reference list is 40.cfr.264.O and 40.cfr.265. 5.2.3 General Applicability Of The Semantic Parsing Algorithm The parsing system that was developed, along with the semantic interpreter for the parse trees, should be simple to reconfigure to parse and interpret a variety of different referencing systems or desired text patterns. Using a grammar and lexicon to specify how to treat categories from a parsed reference provides a great deal of flexibility for the system. New grammar and lexicon files can be introduced to change the system so that it parses for new types of references. The main limitation of the system is that rules cannot be left-recursive. 6 6.1 Conclusions Possible Further Extensions Not Pursued For purposes of this project I only worked with an n-gram model for predictions. Another option that could perform the same tasks as my n-gram model is a probabilistic context model. When I started this project I chose to use an n-gram model for two reasons. First, it seemed intuitively clear that there are some common words and phrases that precede references in the text, so I thought it would be interesting to see how valuable these words and phrases are for reference prediction. Second, I had already written a context model for programming project 2, so I thought I would be interesting to try working with n-grams. 13 Although the n-gram model worked well for predicting references, I believe that the ngram model faces limitations. As noted in section 4, there are always new words that could precede a reference, so attaining high reference recall with an n-gram model is problematic. If a context model were built to predict references by looking at the top three or four input tokens, this model might be quite successful. A key requirement for the context model to operate effectively would be for it to group tokens in a manner similar to the reference parser. For example, “262.12” and “261.43” should both be counted as belonging to the “numbers” group. The same idea would apply to roman numerals, uppercase letters in parenthesis, etc. Using these groupings the context model could learn the common words and components of a reference, rather than rely on the words preceding a reference. 6.2 Summary This project answered three questions surrounding the issue of regulation reference extraction. It was shown that an effective parser can be built to recognize and transform environmental regulation references into a standard format. It was also shown that an ngram model can be used to help the parser “skim” through a document quickly without missing many references, although there was a significant time/recall tradeoff as explored in Figure 2. Finally, it was found, qualitatively at least, that an n-gram referenceprediction model is a useful tool for grammar development when attempting to build a parser for sections of text. 7 References Manning, C., Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Jurafsky, D., Martin, J. 2000. Speech and Language Processing. Prentice Hall. Gazdar, G., Mellish, C., 1989. Natural Language Processing in PROLOG: An Introduction to Computational Linguistics, Addison Wesley Longman, Inc. 14 Appendix A – Reference Parser Grammar REF --> LEV0' REF --> ASSUME_LEV0 LEV1r' REF --> LEV2' BackRefKey LEV0 REF REF REF REF --> --> --> --> ASSUME_LEV0 ASSUME_LEV0 ASSUME_LEV0 ASSUME_LEV0 LEV2' BackRefKey LEV1r' SEC LEV3' SecSymb LEV3' SecSymb SecSymb LEV3' REF --> ASSUME_LEV0 PARA LEV4' BackRefKey LEV3 #REF --> ASSUME_LEV0 ASSUME_LEV1 PARA LEV4' BackRefKey LEV3 REF --> ASSUME_LEV0 ASSUME_LEV1 LEV2' REF --> ASSUME_LEV0 LEV2' BackRefKey LEV1a LEV0' --> LEV0 LEV0' --> LEV0 CONN' LEV0' LEV0 --> INT CFR LEV1a' LEV0 --> INT CFR LEV3' LEV1a' --> LEV1a LEV1a' --> LEV1a CONN' LEV1a' LEV1a' --> LEV1a INTERP LEV1a' LEV1r' --> LEV1r LEV1r' --> LEV1r CONN' LEV1a' LEV1r' --> LEV1r INTERP LEV1a' LEV1a' --> LEV1s LEV1a' --> LEV1p LEV1r' --> LEV1p LEV1a --> LEV1s LEV1a --> LEV1p LEV1r --> LEV1p LEV1s --> INT LEV1s --> LEV1_SELFREF LEV1p --> PART INT CONL2 LEV1_SELFREF --> txt(this) txt(part) ASSUME_LEV1 CONL2 --> txt(,) LEV2' CONL2 --> e LEV2' --> LEV2_SELFREF LEV2' --> SUBPART UL' LEV2_SELFREF --> txt(this) txt(subpart) ASSUME_LEV2 LEV2_SELFREF --> txt(this) txt(Subpart) ASSUME_LEV2 15 UL' --> UL UL' --> UL CONN' UL' UL' --> UL INTERP UL' LEV3' --> LEV3 LEV3' --> LEV3 CONN' LEV3' LEV3' --> LEV3 INTERP LEV3 LEV3 --> LEV3_SELFREF LEV3 --> DEC CONL4 LEV3 --> PART DEC CONL4 LEV3_SELFREF --> txt(this) txt(section) ASSUME_LEV3 CONL4 --> e CONL4 --> LEV4' LEV4' --> LEV4 LEV4' --> LEV4 CONN' LEV4' LEV4' --> LEV4 INTERP LEV4' LEV4 --> BRAC_LL CONL5 CONL5 --> e CONL5 --> LEV5' LEV5' --> LEV5 LEV5' --> LEV5 CONN' LEV5' LEV5' --> LEV5 INTERP LEV5' LEV5 --> BRAC_INT CONL6 CONL6 --> e CONL6 --> LEV6' CONN' --> CONN CONN' --> SEP CONN LEV6' --> LEV6 LEV6' --> LEV6 CONN' LEV6' LEV6' --> LEV6 INTERP LEV6' LEV6 --> BRAC_ROM CONL7 CONL7 --> e CONL7 --> LEV7' LEV7' --> LEV7 LEV7 --> BRAC_UL CONL8 LEV6' --> LEV6 CONN' LEV6' LEV6' --> LEV6 INTERP LEV6' CONL8 --> e 16 Appendix B – Reference Parser Lexicon CONN --> and CONN --> or CONN --> , SEP --> , SEP --> ; INTERP --> through INTERP --> between INTERP --> to PART PART PART PART --> --> --> --> part parts Part Parts SUBPART SUBPART SUBPART SUBPART --> --> --> --> SEC SEC SEC SEC section sections Section Sections PARA PARA PARA PARA --> --> --> --> --> --> --> --> subpart subparts Subpart Subparts paragraph paragraphs Paragraph Paragraphs BackRefKey --> of BackRefKey --> in CFR --> CFR CFR --> cfr 17 Appendix C – Semantic Parser Grammar REF --> LEV0 LEV1 LEV2 REF --> LEV0 LEV3 LEV4 LEV5 LEV6 LEV7 Appendix D – Semantic Parser Lexicon PTERM PTERM PTERM PTERM PTERM PTERM PTERM PTERM NPTERM NPTERM NPTERM NPTERM NPTERM NPTERM NPTERM --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> INT CFR DEC UL BRAC_INT BRAC_LL BRAC_UL BRAC_ROM PARA PART SUBPART SEC SecSymb txt e SKIPNEXT --> BackRefKey REFBREAK --> CONN REFBREAK --> SEP REFBREAK --> CONN' INTERPOLATE --> INTERP 18 Appendix E – Simple Example Run tree2:~> java RefFinder -f parser.gram parser.lex sample.xml Saving found references in found.refs no test file provided using default n-gram lambdas. Initializing new parser reading grammar file... reading lexicon file... Initializing new parse tree parser. reading grammar file... reading lexicon file... training... 40.cfr parts 262 E through 265 , 268 , and parts 270 E , 271 , and 124 ----------- Retrieved Refs --------ref.40.cfr.262 ref.40.cfr.263 ref.40.cfr.264 ref.40.cfr.265 ref.40.cfr.268 ref.40.cfr.270 ref.40.cfr.271 ref.40.cfr.124 ***** found parse: 40.cfr parts 262 E through 265 , 268 , and parts 270 E , 271 , and 124 parts 262 through 265 , 268 , and parts 270 , 271 , and 124 of this chapter and which are ject to the notification requirements of section 3010 of RCRA . In this part : (1) Subpart A defines the terms regulation as hazardous wastes under 40.cfr 261 Subpart A ----------- Retrieved Refs --------ref.40.cfr.261.A ***** found parse: 40.cfr 261 Subpart A Subpart A defines the terms ``solid waste'' and ``hazardous waste'' , identifies those wastes which are special management requirements for hazardous waste produced by conditionally exempt small quantity tors and hazardous waste which is recycled . (2) sets forth the criteria In this part : (1) 40.cfr § § 261.2 E and 261.6 E ----------- Retrieved Refs --------ref.40.cfr.261.2 ref.40.cfr.261.6 ***** found parse: 40.cfr § § 261.2 E and 261.6 E § § 261.2 and 261.6 : (1) A ``spent material'' is any material that has been used and as a result of contamination can no longer serve the purpose for which it was produced without processing ; (2) ``Sludge'' has the same meaning (c) For the purposes of 40.cfr § 261.4 (a) (13) E ----------- Retrieved Refs --------ref.40.cfr.261.4.a.13 ***** found parse: 40.cfr § 261.4 (a) (13) E § 261.4 (a) (13) ) . (11) ``Home scrap metal'' is scrap metal as generated by steel mills , foundries , and refineries such as turnings , cuttings , punchings , and borings . (12) ``Prompt scrap metal'' is scrap circuit boards being recycled ( 40.cfr paragraph (b) E of this section 261.2 ----------- Retrieved Refs --------- 19 ref.40.cfr.261.2.b ***** found parse: 40.cfr paragraph (b) E of this section 261.2 paragraph (b) of this section ; or (ii) Recycled , as explained in (iii) Considered inherently waste-like , section ; or (iv) A military munition identified as a (b) Materials are solid waste if they are abandoned by being : Abandoned , as explained in 40.cfr paragraphs (c) (1) E through (4) E of this section 261.2 ----------- Retrieved Refs --------ref.40.cfr.261.2.c.1 ref.40.cfr.261.2.c.2 ref.40.cfr.261.2.c.3 ref.40.cfr.261.2.c.4 ***** found parse: 40.cfr paragraphs (c) (1) E through (4) E of this section 261.2 paragraphs (c) (1) through (4) of this section . (1) Used in a manner constituting disposal . (i) Materials noted with a `` Column 1 of Table I are solid wastes when they are : (A) Applied to or placed on the land recycled by being : in 40 CFR 261.4 (a) (17) E ----------- Retrieved Refs --------ref.40.cfr.261.4.a.17 ***** found parse: 40 CFR 261.4 (a) (17) E 40 CFR 261.4 (a) (17) ) . (4) Accumulated speculatively . Materials noted with a `` Table 1 are solid wastes when accumulated speculatively . (i) Used or reused as ingredients in an industrial process to make a product , ( except as provided under 40.cfr paragraphs (e) (1) (i) E through (iii) E of this section 261.2 ----------- Retrieved Refs --------ref.40.cfr.261.2.e.1.i ref.40.cfr.261.2.e.1.ii ref.40.cfr.261.2.e.1.iii ***** found parse: 40.cfr paragraphs (e) (1) (i) E through (iii) E of this section 261.2 paragraphs (e) (1) (i) through (iii) of this section ) : (i) Materials used in a manner constituting disposal , or used to produce products that are applied to the land ; or (ii) Materials burned for energy recovery , used to produce a fuel , or original process ( described in 40.cfr paragraphs (d) (1) E and (d) (2) E of this section 261.2 ----------- Retrieved Refs --------ref.40.cfr.261.2.d.1 ref.40.cfr.261.2.d.2 ***** found parse: 40.cfr paragraphs (d) (1) E and (d) (2) E of this section 261.2 paragraphs (d) (1) and (d) (2) of this section . (f) Documentation of claims that materials are not solid wastes or are conditionally exempt from regulation . Respondents in actions to enforce regulations implementing who raise a claim that a certain or (iv) Materials listed in Found 9 references in corpus tree2:~> 20