Chapter 5 Context Parsers This chapter explains context parsers and the different methods of parsing. It also contains the syntax for defining context parsers. What is a Context Parser? A context parser may be used in addition to user-defined markup to determine the end of a context unit by parsing punctuational markup. Before a document is converted to the BASIS internal storage format, a context parser is used to read its textual content to determine the location of context units. To use a context parser, simply specify its name on the CONTEXT_PARSER parameter on the FIELD definition of the field that you want to parse. Context Parsers 85 Methods of Context Parsing There are several methods of context parsing. The method that you choose determines how the text is parsed. The methods of context parsing are: LINE Each input line is a context unit. PARAGRAPH This method looks for one or more blank lines as the end of a paragraph. A blank line contains white space (for example, tabs and spaces). For XML files, the parser searches for one or more empty lines as the end of a paragraph. An empty line contains no data or white space. SENTENCE_FAST The end of a sentence occurs whenever a period, question mark, or exclamation point is encountered. SENTENCE The rules of English punctuation are used to determine the end of a sentence. SENTENCE_WITH_ABBREVIATION By looking up the token prior to the period in a list of valid abbreviations, this method determines whether the period is used to end a sentence or if it is a part of an abbreviation. NONE No context parsing whatsoever is done. However, during import, each context delimiter (e.g., <#UNIT> for a BGML converter) is honored. Furthermore, a new successive context unit is created once the current context unit in the process of being imported reaches its maximum size of approximately 16,000 characters. NONE may be specified for the CONTEXT_PARSER parameter of the FIELD DDL statement, but NONE is not valid as a name or METHOD of DEFINE/CONTEXT_PARSER. Note: For records with fields that use CONTEXT_PARSER=NONE, Open Text recommends that the RECORD_STORAGE specification in the SDM be specified with a capacity of C4. This helps ensure the maximum number of words per context unit is not exceeded during import. 86 Context Parsers Customizing Context Parsing Methods To customize the methods of context parsing, you define punctuation sets and abbreviation sets. Punctuation sets allow you to specify which characters signal the beginning and the ending of sentences and which characters do not. The kinds of punctuation sets that you may define are: PUNCTUATION_SET_ABBREVIATION Contains the characters that are used to end abbreviations. PUNCTUATION_SET_BEGIN Contains the characters that begin a sentence. PUNCTUATION_SET_END Contains the characters that end a sentence. PUNCTUATION_SET_SKIP Contains the characters that should be skipped because they appear between the end of one sentence and the beginning of the next. PUNCTUATION_SET_WORD Contains the characters that separate words in a sentence. An abbreviation set contains a list of abbreviations which are used to determine if a word that ends with a period is an abbreviation or the end of a sentence. Context Parsers 87 Defining Context Parsers, Punctuation Sets, and Abbreviation Sets To define a context parser, use the DEFINE/CONTEXT_PARSER statement. The method of context parsing that you choose determines if you need to define punctuation and abbreviation sets. Use the DEFINE/PUNCTUATION_SET and the DEFINE/ABBREVIATION_SET statements to define punctuation and abbreviation sets. These statements and their parameters are described below. 88 Context Parsers Definition of Context Parsers Purpose: To determine the end of a context unit by parsing punctuational markup. The name of a context parser may be specified for each field in a database by using the CONTEXT_PARSER parameter on the FIELD statement. The specified context parser is invoked each time a new record is added to the database or when a record is edited. The system supplies several context parsers that are used by default. Syntax: DEFine/CONTEXT_PARSER context_parser_name METHOD=LINE | PARAGRAPH | SENTENCE_FAST | SENTENCE | SENTENCE_WITH_ABBREVIATION Parameters: context_parser_name (Required) Specifies the name of the context parser. A valid name is a long_id (a character string from 1 to 32 characters in length). This name must be unique within a markup and style guide. For more details about long_id, see “Common Syntax.” METHOD=LINE | PARAGRAPH | SENTENCE_FAST | SENTENCE | SENTENCE_WITH_ABBREVIATION (Required) Determines how the text is parsed. LINE Each input line is a context unit. PARAGRAPH Looks for one or more blank lines as the end of a paragraph. Context Parsers 89 SENTENCE_FAST Checks for a single punctuation character to signal the end of a sentence. If you choose this method, you must also specify the following parameter: PUNCTUATION_SET_END=punc_set_name For an explanation of this parameter, see the Key Points section. SENTENCE Uses the rules of English punctuation to determine whether the end of a sentence exists. If you choose this method, you must also specify the following parameters: PUNCTUATION_SET_END=punc_set_name, PUNCTUATION_SET_SKIP=punc_set_name, PUNCTUATION_SET_BEGIN=punc_set_name For an explanation of these parameters, see the “Key Points” section below. SENTENCE_WITH_ABBREVIATION Because a period may end a sentence or an abbreviation, context parsing is ambiguous. This method tries to resolve this ambiguity by looking up the token prior to the period in an list of valid abbreviations. To be efficient, this method first uses the SENTENCE method to determine if an end of sentence exists. If so, and the sentence ends in a period, the prior token is then checked against the abbreviation list. If you choose this method, you must also specify the following parameters: PUNCTUATION_SET_END=punc_set_name, PUNCTUATION_SET_SKIP=punc_set_name, PUNCTUATION_SET_BEGIN=punc_set_name, PUNCTUATION_SET_WORD=punc_set_name, PUNCTUATION_SET_ABBREVIATION=punc_set_name, ABBREVIATION_SET=abbrev_set_name 90 Context Parsers For an explanation of these parameters, see the “Key Points” section below. Key Points: For more information about common syntax (e.g., long_id), see “Common Syntax.” The default context parsers are PARAGRAPH, SENTENCE, SENTENCE_FAST, LINE, and SENTENCE_ABBR. A default context parser can be changed by including a new definition of a context parser using the name of the default context parser in your markup and style guide. The following parameters are specified with some methods to allow customization of the context parsing method: ABBREVIATION_SET=abbrev_set_name Contains a list of abbreviations for the SENTENCE_WITH_ABBREVIATION method. The abbrev_set_name specifies the name of an abbreviation set that is defined by a DEFINE/ABBREVIATION_SET statement in your markup and style guide. PUNCTUATION_SET_ABBREVIATION=punc_set_name Contains the characters that are used to end an abbreviation. Make sure that this set contains the appropriate characters that also appear in the PUNCTUATION_SET_END set. The punc_set_name specifies the name of a punctuation set that is defined by a DEFINE/PUNCTUATION_SET statement in your markup and style guide. PUNCTUATION_SET_BEGIN=punc_set_name Contains the characters that start a sentence. In English, this may be all capital letters and left parenthesis. For Spanish, this set may also include the signo de exclamation and signo de interrogation. The punc_set_name specifies the name of a punctuation set that is defined by a DEFINE/PUNCTUATION_SET statement in your markup and style guide. Context Parsers 91 PUNCTUATION_SET_END=punc_set_name Contains the characters that typically end a sentence. In English, this may be period, exclamation point, and question mark. The punc_set_name specifies the name of a punctuation set that is defined by a DEFINE/PUNCTUATION_SET statement in your markup and style guide. PUNCTUATION_SET_SKIP=punc_set_name Contains the characters that are between the ending sentence punctuation and the beginning of the next sentence. In English, this may be a blank, newline, closed quote, and right parenthesis. The punc_set_name specifies the name of a punctuation set that is defined by a DEFINE/PUNCTUATION_SET statement in your markup and style guide. PUNCTUATION_SET_WORD=punc_set_name Contains the characters that separate words in a sentence. In English, this may be blank, open quotes, period, and left parenthesis. For Spanish, this set may also include the signo de exclamation and signo de interrogation. The punc_set_name specifies the name of a punctuation set that is defined by a DEFINE/PUNCTUATION_SET statement in your markup and style guide. Example: 1. The context parsers defined below are system-supplied context parsers that you will find in your default markup and style guide. DEFINE/CONTEXT_PARSER PARAGRAPH METHOD=PARAGRAPH DEFINE/CONTEXT_PARSER SENTENCE METHOD=SENTENCE, + PUNCTUATION_SET_END=SYS_PUNC_SET_END, + PUNCTUATION_SET_SKIP=SYS_PUNC_SET_SKIP, + PUNCTUATION_SET_BEGIN=SYS_PUNC_SET_BEGIN DEFINE/CONTEXT_PARSER SENTENCE_FAST + METHOD=SENTENCE_FAST, + PUNCTUATION_SET_END=SYS_PUNC_SET_END DEFINE/CONTEXT_PARSER 92 Context Parsers LINE METHOD=LINE DEFINE/CONTEXT_PARSER SENTENCE_ABBR + METHOD=SENTENCE_WITH_ABBREVIATION,+ PUNCTUATION_SET_END=SYS_PUNC_SET_END, + PUNCTUATION_SET_SKIP=SYS_PUNC_SET_SKIP, + PUNCTUATION_SET_BEGIN=SYS_PUNC_SET_BEGIN, + PUNCTUATION_SET_WORD=SYS_PUNC_SET_WORD, + PUNCTUATION_SET_ABBREVIATION=SYS_PUNC_SET_ABBR, + ABBREVIATION_SET=SYS_ABBR_SET Context Parsers 93 Definition of Punctuation Sets Purpose: To define a punctuation set which is used by a context parsing method. Markup and style guide punctuation sets allow some flexibility in specifying which characters represent the beginning and the ending of sentences for different languages. Syntax: DEFine/PUNCtuation_set punc_set_name (punctuation_list) Parameters: punc_set_name (Required) Specifies the name of the punctuation set. A valid name is a long_id (a character string from 1 to 32 characters in length). This name must be unique within a markup and style guide. For more details about long_id, see “Common Syntax.” The names of the system-supplied punctuation sets all start with “SYS_”. The Style Guide Compiler (DMSGC) assigns a unique internal number to each punctuation set name. punctuation_list (Required) Lists the punctuation (separated by a comma) to be included in the set. This list contains from 0 to 255 entries. Valid entries include 'char' (e.g., '.'), LINE (which means newline), and charcode. Key Point: 94 Context Parsers For more information about common syntax (e.g., long_id), see “Common Syntax.” Example: 1. The punctuation sets defined below are system-supplied punctuation sets that you will find in your default markup and style guide. DEFINE/PUNCTUATION_SET SYS_PUNC_SET_END + ( '.', '?', '!' ) DEFINE/PUNCTUATION_SET SYS_PUNC_SET_SKIP + ( ' ', LINE, '"', '''', ')' ) DEFINE/PUNCTUATION_SET SYS_PUNC_SET_BEGIN + ('(', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', + 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', + 'T', 'U', 'V', 'W', 'X', 'Y', 'Z' ) DEFINE/PUNCTUATION_SET SYS_PUNC_SET_WORD + (' ', '!', '"', '#', '$', '%', '&','''', '(', + ')', '*','+', ',', '-', '.', '/', ':', ';', + '<', '=', '>', '?', '@', '[', '\', ']', '^', + '_', '`', '{', '|', '}', '~' ) DEFINE/PUNCTUATION_SET SYS_PUNC_SET_ABBR ( '.') Context Parsers 95 Definition of Abbreviation Sets Purpose: To define an abbreviation set which is used by the context parsing method SENTENCE_WITH_ABBREVIATION. An abbreviation set is used to determine if a word that ends with a period is an abbreviation or the end of a sentence. Syntax: DEFine/ABBREViation_set abbrev_set_name ( abbrev_list ) Parameters: abbrev_set_name (Required) Specifies the name of the abbreviation set. A valid name is a long_id (a character string from 1 to 32 characters in length). This name must be unique within a style guide. For more details about long_id, see “Common Syntax.” The names of the system-supplied abbreviation sets all start with “SYS_”. DMSGC assigns a unique internal number to each abbreviation set name. abbrev_list (Required) Lists the abbreviations (separated by a comma) to be included in the set. This list contains from 0 to 4000 entries. Do not include periods in your abbreviation list. For example, you should enter ‘Tue’, not ‘Tue.’, for the abbreviation of Tuesday. Key Points: For more information about common syntax (e.g., long_id), see “Common Syntax.” When you are defining an abbreviation set, enter the abbreviations in alphabetical order. This will save time when you compile your style guide. Case is significant in abbreviations. For example, in the abbreviation set, SYS_ABBR_SET, shown below, both the abbreviations ‘corp’ and ‘Corp’ are 96 Context Parsers included. If ‘Corp’ is not included, then Corp. is not recognized as an abbreviation, but as an end of a context unit. Example: The abbreviation set defined below is a system-supplied abbreviation set that you will find in your default markup and style guide. DEFINE/ABBREVIATION_SET SYS_ABBR_SET ('a', 'A', + 'abbr', 'abbrev', 'abs', 'abstr', 'acad', 'acct', + 'ack', 'act', 'addn', 'addnl', 'adj', 'adm', + 'admin', 'adv', 'advt', 'agcy', 'aka', 'alg', + 'alt', 'am', 'AM', 'amt', ans', 'app', 'approx', + 'appt','Apr', 'apt', 'assn', 'assoc', 'atty', + 'Aug', 'ave', 'avg', 'b', 'B', 'bal', 'bar', 'bbl', + 'bd', 'bdl', 'bdle', 'bdrm', 'bef', 'bf', 'bg', + 'biog', 'bk', 'bkg', 'bkgd', 'bl', 'bld', 'bldg', + 'bldr', 'blk', 'blvd', 'br', 'bro', 'bros', 'bu', + 'bur', 'c', 'C', 'ca', 'cal', 'calc', 'canc', + 'Capt', 'ch', 'chan', 'chap', 'chem', 'chg', 'chm', + 'Chmn', 'circ', 'cit', 'civ', 'ck', 'cl', 'cm', + 'cmd', 'cmdg', 'cmdr', 'co', 'Co', 'col', 'coll', + 'comm', 'conf', 'cons', 'const', 'constr', 'cont', + 'contd', 'conv', 'corp', 'Corp', 'corr', 'cu', + 'cvt', 'cyc', 'cycl', 'cyl', 'd', 'D', 'db', 'dbl', + 'dec', 'Dec', 'decd', 'dept', 'diam', 'dict', + 'dif', 'diff', 'disp', 'dist', 'distr', 'div', 'dk',+ 'dol', 'doz', 'dpt', 'Dr', 'dup', 'dz', 'e', 'E', + 'ea', 'ed', 'educ', 'elev', 'enc', 'encl', 'eq', + 'equip', 'equiv', 'esp', 'est', 'et', 'etc', 'ex', + 'exch', 'exec', 'exp', 'f', 'F', 'Feb', 'fed', + 'fig', 'fl', 'fn', 'fr', 'Fr', 'Fri', 'ft', 'fwd', + 'g', 'G', 'ga', 'gal', 'gov', 'govt', 'gr', 'grad', + 'h', 'H', 'hgt', 'hgwy', 'hr', 'hosp', 'ht', 'i', + 'I', 'ibid', 'illus', 'illustr', 'imp', 'inc', + 'incl', 'incr', 'ins', 'inst', 'instr', 'int', + 'intl', 'intnl', 'ital', 'j', 'J', 'Jan', 'jct', + 'jr', 'Jr', 'jnr', 'Jun', 'Jul', 'k', 'K', 'kg', + 'kt', 'kl', 'l', 'L', 'lat', 'lb', 'lbs', 'ln', + 'ltd', 'Lt', 'm', 'M', 'mag', 'Maj', 'manuf', 'Mar',+ 'max', 'mdse', 'mfd', 'mfg', 'mfr', 'mg', 'mgr', + 'mgt', 'mgmt', 'mi', 'mil', 'min', 'misc', 'Miss', + 'mktg', 'ml', 'mm', 'mo', 'Mon', 'mpg', 'mph', 'Mr',+ 'Mrs', 'Ms', 'msec', 'msg', 'mt', 'mtg', 'mtn', 'n',+ 'N', 'natl', 'naut', 'no', 'nos', 'Nov', 'o', 'O', + 'Oct', 'ord', 'org', 'oz', 'p', 'P', 'pct', 'pd', + 'pfd', 'pg', 'pkg', 'pkt', 'pl', 'pmt', 'pop', 'pp',+ 'ppd', 'Prof', 'publ', 'pvt', 'q', 'Q', 'qt', 'qty',+ 'r', 'R', 'rd', 'recd', 'ref', 'retd', 'Rev', 'rte',+ 's', 'S', 'sec', 'Sen', 'Sep', 'Sept', 'Sgt', 'sp', + 'Sr', 'St', 'std', 'Sun', 't', 'T', 'tb', 'tech', + 'Thur', 'Thurs', 'tit', 'tr', 'tsp', 'Tu', 'Tue', + 'Tues', 'twp', 'u', 'U', 'univ', 'v', 'V', 'vs', + 'vv', 'w', 'W', 'wd', 'Wed', 'Weds', 'whse', + 'whsle', 'wk', 'wkly', 'wt', 'x', 'X', 'y', 'Y', + 'yd', 'yr', 'z', 'Z') Context Parsers 97 98 Context Parsers