MSG Chapter 5: Context Parsers

advertisement
Chapter 5
Context Parsers
This chapter explains context parsers and the different methods of parsing. It also
contains the syntax for defining context parsers.
What is a Context Parser?
A context parser may be used in addition to user-defined markup to determine the end of
a context unit by parsing punctuational markup.
Before a document is converted to the BASIS internal storage format, a context parser is
used to read its textual content to determine the location of context units.
To use a context parser, simply specify its name on the CONTEXT_PARSER parameter
on the FIELD definition of the field that you want to parse.
Context Parsers  85
Methods of Context Parsing
There are several methods of context parsing. The method that you choose determines
how the text is parsed. The methods of context parsing are:
LINE
Each input line is a context unit.
PARAGRAPH
This method looks for one or more blank lines as the end
of a paragraph. A blank line contains white space (for
example, tabs and spaces). For XML files, the parser
searches for one or more empty lines as the end of a
paragraph. An empty line contains no data or white space.
SENTENCE_FAST
The end of a sentence occurs whenever a period, question
mark, or exclamation point is encountered.
SENTENCE
The rules of English punctuation are used to determine the
end of a sentence.
SENTENCE_WITH_ABBREVIATION
By looking up the token prior to the period in a list of
valid abbreviations, this method determines whether the
period is used to end a sentence or if it is a part of an
abbreviation.
NONE
No context parsing whatsoever is done. However, during
import, each context delimiter (e.g., <#UNIT> for a
BGML converter) is honored. Furthermore, a new
successive context unit is created once the current context
unit in the process of being imported reaches its maximum
size of approximately 16,000 characters. NONE may be
specified for the CONTEXT_PARSER parameter of the
FIELD DDL statement, but NONE is not valid as a name
or METHOD of DEFINE/CONTEXT_PARSER.
Note: For records with fields that use
CONTEXT_PARSER=NONE, Open Text recommends
that the RECORD_STORAGE specification in the SDM
be specified with a capacity of C4. This helps ensure the
maximum number of words per context unit is not
exceeded during import.
86  Context Parsers
Customizing Context Parsing Methods
To customize the methods of context parsing, you define punctuation sets and
abbreviation sets.
Punctuation sets allow you to specify which characters signal the beginning and the
ending of sentences and which characters do not. The kinds of punctuation sets that you
may define are:
PUNCTUATION_SET_ABBREVIATION
Contains the characters that are used to end abbreviations.
PUNCTUATION_SET_BEGIN
Contains the characters that begin a sentence.
PUNCTUATION_SET_END
Contains the characters that end a sentence.
PUNCTUATION_SET_SKIP
Contains the characters that should be skipped because
they appear between the end of one sentence and the
beginning of the next.
PUNCTUATION_SET_WORD
Contains the characters that separate words in a sentence.
An abbreviation set contains a list of abbreviations which are used to determine if a word
that ends with a period is an abbreviation or the end of a sentence.
Context Parsers  87
Defining Context Parsers, Punctuation Sets, and Abbreviation
Sets
To define a context parser, use the DEFINE/CONTEXT_PARSER statement. The
method of context parsing that you choose determines if you need to define punctuation
and abbreviation sets.
Use the DEFINE/PUNCTUATION_SET and the DEFINE/ABBREVIATION_SET
statements to define punctuation and abbreviation sets.
These statements and their parameters are described below.
88  Context Parsers
Definition of Context Parsers
Purpose:
To determine the end of a context unit by parsing punctuational markup.
The name of a context parser may be specified for each field in a database by using the
CONTEXT_PARSER parameter on the FIELD statement.
The specified context parser is invoked each time a new record is added to the database or
when a record is edited. The system supplies several context parsers that are used by
default.
Syntax:
DEFine/CONTEXT_PARSER context_parser_name
METHOD=LINE | PARAGRAPH | SENTENCE_FAST | SENTENCE |
SENTENCE_WITH_ABBREVIATION
Parameters:
context_parser_name
(Required)
Specifies the name of the context parser. A valid name is a long_id (a character string
from 1 to 32 characters in length). This name must be unique within a markup and style
guide. For more details about long_id, see “Common Syntax.”
METHOD=LINE | PARAGRAPH | SENTENCE_FAST | SENTENCE |
SENTENCE_WITH_ABBREVIATION
(Required)
Determines how the text is parsed.
LINE
Each input line is a context unit.
PARAGRAPH
Looks for one or more blank lines as the end of a paragraph.
Context Parsers  89
SENTENCE_FAST
Checks for a single punctuation character to signal the end of a
sentence.
If you choose this method, you must also specify the following
parameter:
PUNCTUATION_SET_END=punc_set_name
For an explanation of this parameter, see the Key Points
section.
SENTENCE
Uses the rules of English punctuation to determine whether the
end of a sentence exists.
If you choose this method, you must also specify the following
parameters:
PUNCTUATION_SET_END=punc_set_name,
PUNCTUATION_SET_SKIP=punc_set_name,
PUNCTUATION_SET_BEGIN=punc_set_name
For an explanation of these parameters, see the “Key Points”
section below.
SENTENCE_WITH_ABBREVIATION
Because a period may end a sentence or an abbreviation,
context parsing is ambiguous. This method tries to resolve this
ambiguity by looking up the token prior to the period in an list
of valid abbreviations. To be efficient, this method first uses
the SENTENCE method to determine if an end of sentence
exists. If so, and the sentence ends in a period, the prior token
is then checked against the abbreviation list.
If you choose this method, you must also specify the following
parameters:
PUNCTUATION_SET_END=punc_set_name,
PUNCTUATION_SET_SKIP=punc_set_name,
PUNCTUATION_SET_BEGIN=punc_set_name,
PUNCTUATION_SET_WORD=punc_set_name,
PUNCTUATION_SET_ABBREVIATION=punc_set_name,
ABBREVIATION_SET=abbrev_set_name
90  Context Parsers
For an explanation of these parameters, see the “Key Points”
section below.
Key Points:

For more information about common syntax (e.g., long_id), see “Common Syntax.”

The default context parsers are PARAGRAPH, SENTENCE, SENTENCE_FAST,
LINE, and SENTENCE_ABBR.
A default context parser can be changed by including a new definition of a context
parser using the name of the default context parser in your markup and style guide.

The following parameters are specified with some methods to allow customization of
the context parsing method:
ABBREVIATION_SET=abbrev_set_name
Contains a list of abbreviations for the
SENTENCE_WITH_ABBREVIATION method. The
abbrev_set_name specifies the name of an abbreviation set that is
defined by a DEFINE/ABBREVIATION_SET statement in your
markup and style guide.
PUNCTUATION_SET_ABBREVIATION=punc_set_name
Contains the characters that are used to end an abbreviation. Make
sure that this set contains the appropriate characters that also appear
in the PUNCTUATION_SET_END set.
The punc_set_name specifies the name of a punctuation set that is
defined by a DEFINE/PUNCTUATION_SET statement in your
markup and style guide.
PUNCTUATION_SET_BEGIN=punc_set_name
Contains the characters that start a sentence. In English, this may be
all capital letters and left parenthesis. For Spanish, this set may also
include the signo de exclamation and signo de interrogation.
The punc_set_name specifies the name of a punctuation set that is
defined by a DEFINE/PUNCTUATION_SET statement in your
markup and style guide.
Context Parsers  91
PUNCTUATION_SET_END=punc_set_name
Contains the characters that typically end a sentence. In English, this
may be period, exclamation point, and question mark.
The punc_set_name specifies the name of a punctuation set that is
defined by a DEFINE/PUNCTUATION_SET statement in your
markup and style guide.
PUNCTUATION_SET_SKIP=punc_set_name
Contains the characters that are between the ending sentence
punctuation and the beginning of the next sentence. In English, this
may be a blank, newline, closed quote, and right parenthesis.
The punc_set_name specifies the name of a punctuation set that is
defined by a DEFINE/PUNCTUATION_SET statement in your
markup and style guide.
PUNCTUATION_SET_WORD=punc_set_name
Contains the characters that separate words in a sentence. In
English, this may be blank, open quotes, period, and left parenthesis.
For Spanish, this set may also include the signo de exclamation and
signo de interrogation.
The punc_set_name specifies the name of a punctuation set that is
defined by a DEFINE/PUNCTUATION_SET statement in your
markup and style guide.
Example:
1.
The context parsers defined below are system-supplied context parsers that you will
find in your default markup and style guide.
DEFINE/CONTEXT_PARSER
PARAGRAPH
METHOD=PARAGRAPH
DEFINE/CONTEXT_PARSER SENTENCE METHOD=SENTENCE, +
PUNCTUATION_SET_END=SYS_PUNC_SET_END, +
PUNCTUATION_SET_SKIP=SYS_PUNC_SET_SKIP, +
PUNCTUATION_SET_BEGIN=SYS_PUNC_SET_BEGIN
DEFINE/CONTEXT_PARSER SENTENCE_FAST +
METHOD=SENTENCE_FAST, +
PUNCTUATION_SET_END=SYS_PUNC_SET_END
DEFINE/CONTEXT_PARSER
92  Context Parsers
LINE
METHOD=LINE
DEFINE/CONTEXT_PARSER SENTENCE_ABBR +
METHOD=SENTENCE_WITH_ABBREVIATION,+
PUNCTUATION_SET_END=SYS_PUNC_SET_END, +
PUNCTUATION_SET_SKIP=SYS_PUNC_SET_SKIP, +
PUNCTUATION_SET_BEGIN=SYS_PUNC_SET_BEGIN, +
PUNCTUATION_SET_WORD=SYS_PUNC_SET_WORD, +
PUNCTUATION_SET_ABBREVIATION=SYS_PUNC_SET_ABBR, +
ABBREVIATION_SET=SYS_ABBR_SET
Context Parsers  93
Definition of Punctuation Sets
Purpose:
To define a punctuation set which is used by a context parsing method.
Markup and style guide punctuation sets allow some flexibility in specifying which
characters represent the beginning and the ending of sentences for different languages.
Syntax:
DEFine/PUNCtuation_set punc_set_name (punctuation_list)
Parameters:
punc_set_name
(Required)
Specifies the name of the punctuation set. A valid name is a long_id (a character string
from 1 to 32 characters in length). This name must be unique within a markup and style
guide. For more details about long_id, see “Common Syntax.”
The names of the system-supplied punctuation sets all start with “SYS_”.
The Style Guide Compiler (DMSGC) assigns a unique internal number to each
punctuation set name.
punctuation_list
(Required)
Lists the punctuation (separated by a comma) to be included in the set. This list contains
from 0 to 255 entries. Valid entries include 'char' (e.g., '.'), LINE (which means newline),
and charcode.
Key Point:

94  Context Parsers
For more information about common syntax (e.g., long_id), see “Common Syntax.”
Example:
1.
The punctuation sets defined below are system-supplied punctuation sets that you
will find in your default markup and style guide.
DEFINE/PUNCTUATION_SET SYS_PUNC_SET_END +
( '.', '?', '!' )
DEFINE/PUNCTUATION_SET SYS_PUNC_SET_SKIP +
( ' ', LINE, '"', '''', ')' )
DEFINE/PUNCTUATION_SET SYS_PUNC_SET_BEGIN +
('(', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', +
'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', +
'T', 'U', 'V', 'W', 'X', 'Y', 'Z' )
DEFINE/PUNCTUATION_SET SYS_PUNC_SET_WORD +
(' ', '!', '"', '#', '$', '%', '&','''', '(', +
')', '*','+', ',', '-', '.', '/', ':', ';', +
'<', '=', '>', '?', '@', '[', '\', ']', '^', +
'_', '`', '{', '|', '}', '~' )
DEFINE/PUNCTUATION_SET
SYS_PUNC_SET_ABBR
( '.')
Context Parsers  95
Definition of Abbreviation Sets
Purpose:
To define an abbreviation set which is used by the context parsing method
SENTENCE_WITH_ABBREVIATION.
An abbreviation set is used to determine if a word that ends with a period is an
abbreviation or the end of a sentence.
Syntax:
DEFine/ABBREViation_set abbrev_set_name ( abbrev_list )
Parameters:
abbrev_set_name
(Required)
Specifies the name of the abbreviation set. A valid name is a long_id (a character string
from 1 to 32 characters in length). This name must be unique within a style guide. For
more details about long_id, see “Common Syntax.”
The names of the system-supplied abbreviation sets all start with “SYS_”.
DMSGC assigns a unique internal number to each abbreviation set name.
abbrev_list
(Required)
Lists the abbreviations (separated by a comma) to be included in the set. This list
contains from 0 to 4000 entries.
Do not include periods in your abbreviation list. For example, you should enter ‘Tue’,
not ‘Tue.’, for the abbreviation of Tuesday.
Key Points:

For more information about common syntax (e.g., long_id), see “Common Syntax.”

When you are defining an abbreviation set, enter the abbreviations in alphabetical
order. This will save time when you compile your style guide.

Case is significant in abbreviations. For example, in the abbreviation set,
SYS_ABBR_SET, shown below, both the abbreviations ‘corp’ and ‘Corp’ are
96  Context Parsers
included. If ‘Corp’ is not included, then Corp. is not recognized as an abbreviation,
but as an end of a context unit.
Example:
The abbreviation set defined below is a system-supplied abbreviation set that you will
find in your default markup and style guide.
DEFINE/ABBREVIATION_SET SYS_ABBR_SET ('a', 'A', +
'abbr', 'abbrev', 'abs', 'abstr', 'acad', 'acct', +
'ack', 'act', 'addn', 'addnl', 'adj', 'adm', +
'admin', 'adv', 'advt', 'agcy', 'aka', 'alg', +
'alt', 'am', 'AM', 'amt', ans', 'app', 'approx', +
'appt','Apr', 'apt', 'assn', 'assoc', 'atty', +
'Aug', 'ave', 'avg', 'b', 'B', 'bal', 'bar', 'bbl', +
'bd', 'bdl', 'bdle', 'bdrm', 'bef', 'bf', 'bg', +
'biog', 'bk', 'bkg', 'bkgd', 'bl', 'bld', 'bldg', +
'bldr', 'blk', 'blvd', 'br', 'bro', 'bros', 'bu', +
'bur', 'c', 'C', 'ca', 'cal', 'calc', 'canc', +
'Capt', 'ch', 'chan', 'chap', 'chem', 'chg', 'chm', +
'Chmn', 'circ', 'cit', 'civ', 'ck', 'cl', 'cm', +
'cmd', 'cmdg', 'cmdr', 'co', 'Co', 'col', 'coll', +
'comm', 'conf', 'cons', 'const', 'constr', 'cont', +
'contd', 'conv', 'corp', 'Corp', 'corr', 'cu', +
'cvt', 'cyc', 'cycl', 'cyl', 'd', 'D', 'db', 'dbl', +
'dec', 'Dec', 'decd', 'dept', 'diam', 'dict', +
'dif', 'diff', 'disp', 'dist', 'distr', 'div', 'dk',+
'dol', 'doz', 'dpt', 'Dr', 'dup', 'dz', 'e', 'E', +
'ea', 'ed', 'educ', 'elev', 'enc', 'encl', 'eq', +
'equip', 'equiv', 'esp', 'est', 'et', 'etc', 'ex', +
'exch', 'exec', 'exp', 'f', 'F', 'Feb', 'fed', +
'fig', 'fl', 'fn', 'fr', 'Fr', 'Fri', 'ft', 'fwd', +
'g', 'G', 'ga', 'gal', 'gov', 'govt', 'gr', 'grad', +
'h', 'H', 'hgt', 'hgwy', 'hr', 'hosp', 'ht', 'i', +
'I', 'ibid', 'illus', 'illustr', 'imp', 'inc', +
'incl', 'incr', 'ins', 'inst', 'instr', 'int', +
'intl', 'intnl', 'ital', 'j', 'J', 'Jan', 'jct', +
'jr', 'Jr', 'jnr', 'Jun', 'Jul', 'k', 'K', 'kg', +
'kt', 'kl', 'l', 'L', 'lat', 'lb', 'lbs', 'ln', +
'ltd', 'Lt', 'm', 'M', 'mag', 'Maj', 'manuf', 'Mar',+
'max', 'mdse', 'mfd', 'mfg', 'mfr', 'mg', 'mgr', +
'mgt', 'mgmt', 'mi', 'mil', 'min', 'misc', 'Miss', +
'mktg', 'ml', 'mm', 'mo', 'Mon', 'mpg', 'mph', 'Mr',+
'Mrs', 'Ms', 'msec', 'msg', 'mt', 'mtg', 'mtn', 'n',+
'N', 'natl', 'naut', 'no', 'nos', 'Nov', 'o', 'O', +
'Oct', 'ord', 'org', 'oz', 'p', 'P', 'pct', 'pd', +
'pfd', 'pg', 'pkg', 'pkt', 'pl', 'pmt', 'pop', 'pp',+
'ppd', 'Prof', 'publ', 'pvt', 'q', 'Q', 'qt', 'qty',+
'r', 'R', 'rd', 'recd', 'ref', 'retd', 'Rev', 'rte',+
's', 'S', 'sec', 'Sen', 'Sep', 'Sept', 'Sgt', 'sp', +
'Sr', 'St', 'std', 'Sun', 't', 'T', 'tb', 'tech', +
'Thur', 'Thurs', 'tit', 'tr', 'tsp', 'Tu', 'Tue', +
'Tues', 'twp', 'u', 'U', 'univ', 'v', 'V', 'vs', +
'vv', 'w', 'W', 'wd', 'Wed', 'Weds', 'whse', +
'whsle', 'wk', 'wkly', 'wt', 'x', 'X', 'y', 'Y', +
'yd', 'yr', 'z', 'Z')
Context Parsers  97
98  Context Parsers
Download