FactExtractionReport1 - ITACS | International Technology

advertisement
Report on using the English Resource Grammar to extend fact extraction capabilities
1 Introduction
This report is a deliverable for Project 4 task 2, Quarter 1, originally named “Initial principles
for mapping external resources into CE framework”. It is an extension of the long paper submitted
to the Fall Meeting 2013 [22]
Fact extraction from unstructured sources is a key component in the supply of information to
human users such as analysts, but the extraction of the complete set of facts is a complex and
challenging task. In addition it is necessary to express these facts using a conceptual model of the
domain understood by the users and to present the rationale for their extraction in order that they
can use and assess the facts in their analysis tasks.
In the BPP11, [1,2] we demonstrated an approach of using Controlled English (CE) [3,4] for
the facts extracted by Natural Language (NL) processing and the configuration of the NL
processing itself, in order that extracted facts may be used for inference of high value information,
and that linguistic processing is made more accessible to the analyst user. Two key aspects were
the development of a common linguistic model and the mapping of the linguistic structures into
the domain semantics of the user. However linguistic capabilities of the BPP11 parsing system was
limited, and we proposed to address this in the BPP13 research by seeking to integrate more
sophisticated linguistic systems developed by the DELPH-IN consortium [6].
This report describes initial work into the use of the English Resource Grammar (ERG) [11]
from the DELPH-IN community, capable of generating high quality and detailed representations
of the syntax and semantics of English sentences, and outlines how transformations might be made
between knowledge in the ERG and CE-based representations, so that the semantic output can be
extracted into higher quality CE facts, that domain semantics contained in a CE model can be
applied to assist parsing and that linguistic reasoning can be made more available to the nontechnical human user when configuring the extraction process to a particular domain, in support of
information access and knowledge sharing in coalition operations.
2 An Example
We continue to use the SYNCOIN dataset [13] as an example of NL text to be interpreted. This
provides reports from a military-relevant scenario with consistent story threads for different
aspects of military and civilian operations under a background of counter intelligence. One thread
is the operation of the HTT (Human Terrain Team) responsible for maintaining a good relationship
with the local population. One sentence from a report notes that:
HTT are conducting surveys in Adhamiya to judge the level of support for Bath’est return.
Such a sentence states a complex set of relationships, including entities (HTT, survey),
situations (to conduct) relationships (for Bath’est return) and motivational links (to judge). The
ERG system is able to parse this sentence and represent the semantic relationships, and our task is
to convert these relationships into domain specific CE facts. Steps to produce some of the possible
CE facts are described below.
3 The English Resource Grammar (ERG)
The ERG is one of the linguistic resources developed over the last two decades by the DELPHIN consortium, a “collaborative effort aimed at deep linguistic processing of human language”,
with ERG development undertaken principally at CSLI, Stanford, by Flickinger [11,21]. The ERG
defines rules and structures to model a significant portion of the linguistic phenomena of English
and it is capable of analysing sentences more accurately, and to a greater level of detail, than the
Stanford statistical parser [14] used in BPP11. Although DELPH-IN is developing other
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
1
Report on using the English Resource Grammar to extend fact extraction capabilities
grammars, e.g. the Matrix, [9] we have chosen to use the ERG, due to the higher coverage of
English that it affords.
The theoretical foundation of the ERG language model is Head-Driven Phrase Structure
Grammar [10] which provides an account of language as the composition of substructures (such as
phrases of different types) into higher level structures, where substructures are characterized by
key “head” subcomponents (e.g. nouns, verbs) and where the linguistic constraints on composition
are based on the nature of the heads and are highly lexicalized, i.e. the majority of information is
contained in lexical types on which the lexicon is built. Such linguistic information is represented
in a formal constraint-based language called Typed Feature Structures (TFS) [5], which defines a
hierarchy of types containing attributes, values, variables and equalities between variables. TFS is
used to represent (nearly) all of the model of linguistic phenomena, with TFS types defining
compositional grammar rules and lexical types, and TFS instances of these types defining the
lexicon of words. Thus a TFS type in the ERG defines linguistic notions such as “count noun”, and
“head-initial phrase”. TFS instances in the ERG lexicon define such entries as “cat” being a “count
noun”; such an entry also includes the orthography of the word (i.e. the way the word is written on
the page).
In order to parse a sentence, it is necessary to provide a logical definition of the nature of a
“well formed” sentence. The TFS definition is that all words in the sentence must be matched to a
lexical entry, via the orthography; that all such lexical entries must be composed by grammatical
rules into higher level phrases; that all such phrases must be further composed into higher level
phrases or root phrases; that there must be a root phrase that covers all the words and phrases, and
that this root phrase must be defined as one of an acceptable set of roots. By allowing different
types of roots, it is possible to define fragments of sentences as being acceptable, if this is
warranted by the nature of the texts being parsed. Here, “matching” between lexical items and
phrases, and between phrases and phrases, is defined as unification of structures, including
creation of structure when unification of variables is attempted.
The ERG defines the grammar rules and lexical items for English, which are to be interpreted
as defined above. However, in order to actually parse a sentence it is necessary to run a parser
against the sentence, which applies the TFS structures to the sentence. The DELPH-IN consortium
has developed several parsers, and we have chosen to use the PET parsing system [15], as this is
an efficient implementation of the unification and parsing algorithm written in C++, which may
potentially be integrated into other systems, for example the CE store.
As a result of parsing a sentence, the ERG provides a definition of the semantics of the
sentence in the Minimal Recursion Semantics (MRS) formalism [8]. This specifies a set of
elementary predications, each being a logical predicate together with arguments. The predicate is
derived from the lexicon or the grammar rules, and may indicate the existence of an individual
(e.g. that individual x7 is the HTT or x9 is a survey) or the occurrence of an event together with
the individuals involved (e.g. that the event e3 is a “conduct” event with the individual x7 being
the first (subject) argument), or the presence of a more abstract piece of information (such as that
the HTT is a definite object). As far as the basic ERG/PET system is concerned, the output of the
MRS completes the parsing process. However for our use in fact extraction it is necessary to turn
the MRS into domain semantics, and thereby generating CE facts representing the meaning of the
sentence. The transformation of the MRS into domain facts is a key research item in the next
stages of the tasks. Nevertheless, even the output of the semantics in the form of MRS is a
significant addition compared to the output from the Stanford parser, which did not provide
semantics. Further information is provided to assist the handling of scope relations between
quantifiers in sentences such as “all dogs chase a cat”, and further work is required to handle this
information.
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
2
Report on using the English Resource Grammar to extend fact extraction capabilities
The use of the ERG for fact extraction requires the combination of several theories and
technologies, the ERG, the PET parser, TFS and MRS. We will refer to this combination as the
“ERG system”.
4 Integrating CE and the ERG system
A key step in using the ERG system for fact extraction in a given analyst’s domain is to be able
to transform in both directions between the various language structures (represented as TFS and
MRS) and the CE representing the analyst’s CE conceptual model, facts and rationale. We aim to
reuse the BPP11 linguistic model [16] as the basis of this mapping, in order that existing
applications can make use of the new CE outputs. Such transformations are to be performed:

between the ERG lexicon of words (in TFS and associated MRS relations) and the CE domain
concepts as this provides the starting point for the extraction of facts from the sentence. The
lexicon itself may also have to be augmented with specialist words that are likely to occur in
the text sources for the domain.

between the parse tree output by the PET parser and a CE representation of the parse tree, in
order that existing BPP11 applications that work with a parse tree can be supported

between the ERG grammar rules (in TFS) and CE structures for representing parsing rules in
order that users may better understand the nature of the processing. This will facilitate the
development of new domain-specific parsing rules if different linguistic phenomena occur in
the texts for the domain. This is more advanced than changing the lexicon and may require
some degree of linguistic skill to reengineer the grammar. The mapping will also facilitate the
presentation of rationale for the linking of extracted facts to the original sentences

between the semantics of the sentence expressed in the MRS output by the PET parser and CE
facts. This transformation may also be useful for using domain semantics to guide the parsing
itself
To address these transformations we propose the following architecture, revised from the
BPP11 version [2]:
where the ERG system provides the phrase structures and general semantics, and we seek to
extend the use of MRS to represent domain semantics as well as potentially providing guidance to
guide the parsing.
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
3
Report on using the English Resource Grammar to extend fact extraction capabilities
4.1 The Lexicon
Whilst the ERG lexicon is comprehensive, there is still a need to construct CE sentences to
define new words to be added to the lexicon, that may be present only in the domain. For example,
consider the ERG lexical entry for “survey”:
survey_n1 := n_-_c_le &
[ ORTH < "survey" >,
SYNSEM [ LKEYS.KEYREL.PRED "_survey_n_1_rel",
PHON.ONSET con ] ].
“survey_n1” is the name of the instance of the entry, which acts only as an identifier and does
not provide any type information. “n_-_c_le” defines the lexical type of this entry, here a count
noun. The string “survey” defines the orthography of the word. “_survey_n_1_rel” defines the
MRS relation, providing the “meaning” of the word.
To translate between CE and the TFS needed to express words in the ERG lexicon, we propose
some translation principles.
The MRS relation ("_survey_n_1_rel") must be mapped to the equivalent entity concept in the
CE conceptual model (survey), since the MRS relation is the semantic output of the parser. We
propose that the MRS relation is equivalent to the word sense in the CE lexical model, so it is
possible to state “the noun sense _survey_n_1_rel”. This may then be linked to the conceptual
model with an “expresses” relation, e.g. “the noun sense _survey_n_1_rel expresses the entity
concept survey”. As in the BPP11, this link must be defined by the creator of the conceptual
model, assisted by an “Analyst’s Helper”.
The rest of the ERG lexical entry defines a “lexeme”. In the CE lexical model, the lexeme is
not explicit and is represented by a grammatical form, with an associated orthogonal form, and
(optionally) with a Penn tag as postfix on the identifier, for example “the grammatical form
|surveys_NNS| is written as the word |surveys|”. This may then be associated with the word sense
using the “is a form of” relation, eg “the grammatical form |surveys_NNS| is a form of the word
sense _survey_n_1_rel”. In the CE lexical model there is only 1 grammatical form entity for a
combination of word (survey) and Penn tag (NN), thus it does not make sense to say “the
grammatical form |surveys_NNS_1|”. This leads to potential ambiguities, where a single
grammatical form G may have several meanings, and this would be expressed by a set of relations
“the grammatical form G is a form of the word sense Y” with multiple values of Y. Whereas this is
logically equivalent to the TFS structures, nevertheless there may be merit in having the different
form-wordsense relations given unique names (equivalent to the lexicon entry names) for tracing
purposes; in which case the need for a lexeme from the CE model may have to be re-evaluated.
The ERG lexical type (n_-_c_le) also serves to define a subtype of the grammatical form, for
example “the count noun”. Such lexical types could form a hierarchy of CE concepts, in an
equivalent manner to the hierarchy of ERG types (count noun is a type of noun). Orthography is
represented by an attribute of the grammatical form (e.g. the plural noun |surveys_NNS| is written
as the word |surveys|), with multiple word orthographies being mapped to compound nouns.
Using these principles we may define an equivalent CE definition for the TFS lexical entry:
there is a count noun named |surveys_NNS| that
is a plural noun and
is written as the word |surveys| and
is a form of the noun sense _survey_n_1_rel.
the noun sense _survey_n_1_rel
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
4
Report on using the English Resource Grammar to extend fact extraction capabilities
expresses the entity concept survey.
Note that in the parse tree for the example, shown below, the lexical entry for “surveys” is
implicit, being computed by a lexical rule (n_pl_olr) as required. However here we take such
implicit entries to be already contained in the lexicon; hence the difference between the
orthography of the count noun above (surveys) and the ERG lexical entry.
A Prolog program has been constructed to demonstrate the mapping of CE into TFS, thus
allowing the user to construct new word definitions in CE to be added to the ERG lexicon. Only
the first sentence pattern (there is a count noun …) is required for this purpose, since the second
sentence pattern does not provide information that is added to the TFS lexical entry. The
alternative direction, TFS to CE, is yet to be investigated.
4.2 The Parse Tree and Grammar Rules
The ERG defines a set of grammar rules that constrain how the words in the input sentences
can be combined and turned into a tree of phrase types, and the PET parser uses these rules to
generate valid parse trees from the input sentences. In the BPP11 work, a similar role was played
by the Stanford parser, although the mechanism by which this occurred was markedly different.
It is necessary to translate between the parse tree and CE, if the ERG system is to be used by
NL processing systems that require use of the parse tree; such systems may be used by other
projects in the BPP13 research, such as the conversational interface. However it is of importance
that the CE version of the PET parse tree is based on the same linguistic model as the previous
BPP11 works, so that the Stanford and ERG parsers could be used together.
It is also necessary to translate between ERG grammar rules and CE, because users may need
to modify the grammar rules in order to handle domain specific language structure and because the
user may wish to understand the nature of the linguistic processing in order to provide rationale for
the inference of high value information.
Our research is focusing mainly on the semantic processing, so the details of the parse tree are
of lesser priority. Furthermore the linguistic processing performed by the ERG system is complex,
due to the complex nature of the linguistic phenomena that occurs in English. For these reasons
only a limited amount of detail will be given about the parse trees and grammar rules in this paper.
For the example sentence, the raw parse tree output from the ERG parser is too long to show in its
complete form. A fragment for the phrase “conducting surveys” is shown below:
(563 hd-cmp_u_c 0 2 4
(289 v_prp_olr 0 2 3
(25 conduct_v1/v_np*_le 0 2 3 [v_prp_olr]
(3 "conducting" 0 2 3 )))
(328 hdn_bnp_c 0 3 4
(260 n_pl_olr 0 3 4
(26 survey_n1/n_-_c_le 0 3 4 [n_pl_olr]
(4 "surveys" 0 3 4 )))))
Although this phrase is taken out of context, (it is a adjunct to the auxiliary verb “are”, and also
includes the prepositional ”in Adhamiya”), it can be seen that there are two subcomponents, a verb
phrase based on “conduct” and a noun phrase based on “survey”. Harder to see is that the verb
“conduct” has been derived by a lexical rule (v_prp_olr) from the word “conducting”, that the
noun “survey” has been derived by a lexical rule (n_pl_olr) from the word “surveys”; that the
lexical entry (v_np*_le) for conduct_v1 indicates it is a verb taking a noun phrase as a
complement; that the lexical entry survey_n1 (n_-_c_le) indicates it is a count noun. The top part
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
5
Report on using the English Resource Grammar to extend fact extraction capabilities
of this tree is constructed from the rule whose type is “hd-cmp_u_c” which takes a head and a
complement (the head is the verb phrase and the complement is the noun phrase) and constructs a
single sub-component, which becomes the adjunct structure for the auxiliary verb “are”.
This parse tree fragment may be turned into CE following the BPP11 lexical model, in order
that other parsing systems may use it. The model is based upon phrases that have heads and
dependencies, but such information is not directly present in the ERG parse trees; therefore the
current translation uses specific information about the lexical types to determine which
subcomponents are heads and dependents. Such a bespoke process is fragile to fundamental
changes in the lexical types, so requires further investigation to provide a more robust solution.
Further work is also required to generate all of the tree structure that was provided by the Stanford
parser.
Following this process, the resulting CE sentences are:
the verb phrase #p_563 has the verb phrase #p_289 as head
and has the noun phrase #p_328 as dependent.
the verb phrase #p_289 has the verb |conducting_VBG| as head.
the noun phrase #p_328 has the noun phrase #p_260 as head.
the noun phrase #p_260 has the noun |surveys_NNS| as head.
The ERG parse tree provides types of information not available from the Stanford parser, such
as the lexical type (n_-_c_le), the nature of the phrase structure (being a head complement phrase)
and additional features (n_pl_plr indicating a noun is a plural form). For completeness this
information is provided in CE about the phrases and grammatical forms:
the verb phrase #p_563 is a head complement phrase
and has 'hd-cmp_u_c' as erg type.
the verb phrase #p_289 is a head phrase
and has 'v_prp_olr' as erg type.
the verb |conducting_VBG| is a present verb
and has 'v_np*_le' as erg type
and has the thing v_prp_olr as feature.
the noun phrase #p_328 is a head phrase
and has 'hdn_bnp_c' as erg type.
the noun phrase #p_260 is a head phrase
and has 'n_pl_olr' as erg type.
the noun |surveys_NNS| is a plural noun
and has 'n_-_c_le' as erg type
and has the thing n_pl_olr as feature.
Phrase structures may also be diagrammed as a tabular version of the CE, where the two subphrases (verb and noun) are more easily be seen:
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
6
Report on using the English Resource Grammar to extend fact extraction capabilities
the head
complement
phrase #p_563 that
has as head
and has as
dependent
the head phrase
#p_289 that
has as head
the present verb |conducting_VBG|
and has as erg
type
v_prp_olr
the head phrase
#p_328 that
has as head
and has as erg
type
and has as erg type
the head phrase #p_260 that
has as head
the plural noun
|surveys_NNS|
and has as erg type
n_pl_olr
hdn_bnp_c
hd-cmp_u_c
The topmost grammar rule for this construction is “hd-cmp_u_c”, which combines a head and a
complement. This rule is defined in TFS, but it is difficult to explain this rule in any detail here,
due to lack of space and due to the complexity of the linguistic theory that it follows. The rule is
composed of structures at different levels of the type hierarchy, and to give a flavour, two of its
(relatively simple) supertypes are shown below.
Firstly it is a “headed” phrase:
headed_phrase := phrase &
[ SYNSEM.LOCAL [
CAT [ HEAD head & #head, HC-LEX #hclex ],
AGR #agr,CONJ #conj ],
HD-DTR.SYNSEM.LOCAL local & [
CAT [ HEAD #head, HC-LEX #hclex ],
AGR #agr,CONJ #conj ] ].
Such phrases obey the Head Feature Principle [10], which states that a phrase must share its
key properties with the head of the phrase, where the head is defined as being one of the words in
the phrase whose type defines its essential nature (for example a noun phrase with have a noun as
its head). In this type, the HD-DTR (“ head daughter”) holds the head of the phrase and this is
passed up to the head of the phrase itself, along with agreement information (e.g. its person,
number and gender).
Secondly it is a “head initial” phrase:
basic_head_initial := basic_binary_headed_phrase &
[ HD-DTR #head,
NH-DTR #non-head,
ARGS < #head, #non-head > ].
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
7
Report on using the English Resource Grammar to extend fact extraction capabilities
This states that the phrase is composed an ordered sequence of two subphrases, the first being
the head daughter (HD-DTR) and the second being the non-head daughter (NH-DTR); in effect the
“head” is the initial subphrase.
It should be noted that the full definition of the hd_cmp_u_c rule is far more complex that this,
and comprises information from 23 different types, each providing a set of constraints on the
structure of the phrase. This includes information that it is a binary phrase, and that it operates left
to right.
To get a better visualisation of these TFS definitions, we are exploring the use of CE in two
ways. Firstly as a graph, with CE entities and CE relations, where the uppercase names are
attributes (eg HEAD), and the pathways from the central phrase can be followed to the entities that
are the values of these attributes. The diagram below shows just the information from the
“basic_head_initial” definition. The dotted lines indicate matching (unification) between the
entities that are forced by the type definitions. For example the matching of the ARGS (0th and 1st)
of the subphrases below the arrow shape to the HD-DTR and NH_DTR are caused by the
“basic_head_initial” type definition.
Another example is given below, where both of the two rules are combined in the same diagram:
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
8
Report on using the English Resource Grammar to extend fact extraction capabilities
A second way to visualize the rule definition is via a “linguistic frame” [17] that defines the
constraints between the phrase and its subcomponents via CE statements. For example the
linguistic frame for the basic_head_initial type is:
there is a linguistic frame named f1 that
defines the basic-head-initial PH and
has the sequence
( the sign A0 , and the sign A1 )
as subcomponents and
has the statement that
( the basic-head-initial PH has the sign A0 as HD-DTR and
has the sign A1 as NH-DTR )
as semantics.
This may be read as defining a phrase of type “basic-head-initial” which composes two
subphrases (or signs) on the parse tree called A0 and A1 and has these subphrases as head
daughter and non head daughter respectively.
The type “headed-phrase” may be defined as a further linguistic frame:
there is a linguistic frame named f2 that
defines the headed-phrase PH and
has the statement that
( the headed-phrase PH
has the HEAD of the LOCAL CAT of the HD-DTR as
the HC-LEX of the LOCAL CAT and
has the HC-LEX of the LOCAL CAT of the HD-DTR as
the HC-LEX of the LOCAL CAT and
has the AGR of the LOCAL SYNSEM of the HD-DTR as
the AGR of the LOCAL SYNSEM and
has the CONJ of the LOCAL SYNSEM of the HD-DTR as
the CONJ of the LOCAL SYNSEM
)
as semantics.
This defines constraints on any phrase of this type such that the stated attributes of the head
daughter match the same attributes of the headed-phrase.
Several extensions are being explored to CE in order to simplify these linguistic frames: firstly,
an ability to have a path of attributes, following the graph of relations, for example “the HEAD of
the LOCAL CAT of the HD-DTR”; secondly the definition of attribute names as representing
common subpaths, for example “LOCAL CAT” as being a shorthand for “the CAT of the LOCAL
of the SYNSEM”. These extensions are experimental, and are particularly useful when defining
TFS structures, which are sets of pathways across the graph of entity attributes. It is necessary to
be able to convert between TFS structures and CE linguistic frames, in order that users can create
new rules and more easily understand existing rules. More work is required to design the
mechanisms for such translations.
5 Semantics and Rationale
A key aspect is the mapping of the semantics extracted by the ERG system in the MRS
formalism to the domain semantics of the user’s conceptual model as this allows the output of facts
in CE. We propose that the MRS output be translated into domain specific CE facts in three
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
9
Report on using the English Resource Grammar to extend fact extraction capabilities
stages: a raw CE form containing only the same information as the output MRS; an intermediate
CE form containing useful abstractions of the semantics; the domain specific CE form. To
illustrate the process, we show a fragment of the MRS output for the fragment about “HTT are
conducting surveys”:
[ LTOP: h1
INDEX: e3 [ e SF: PROP TENSE: PRES MOOD: INDICATIVE PROG: + PERF: - ]
RELS: <
[ udef_q_rel<-1:-1>
LBL: h4
ARG0: x6 [ x PERS: 3 NUM: PL IND: + ]
RSTR: h7
BODY: h5 ]
[ named_rel<-1:-1>
LBL: h8
ARG0: x6
CARG: "HTT" ]
[ "_conduct_v_1_rel"<-1:-1>
LBL: h9
ARG0: e3
ARG1: x6
ARG2: x10 [ x PERS: 3 NUM: PL IND: + ] ]
[ udef_q_rel<-1:-1>
LBL: h11
ARG0: x10
RSTR: h13
BODY: h12 ]
[ "_survey_n_1_rel"<-1:-1>
LBL: h14
ARG0: x10 ]
...
HCONS: < h1 qeq h2 h7 qeq h8 h13 qeq h14 ... > ]
There is insufficient space here to describe all of the information in this MRS output, which
itself has been shorted from the original. However two examples may be picked out. Firstly the
MRS relation "_survey_n_1_rel” corresponds to the “surveys” being conducted. Here the relation
contains a single argument (ARG0) whose value (x10) “stands for” the survey. This is the
equivalent to the BPP11 general semantic intuition of noun phrases “standing for” things in the
real world. There is further information about x10 in the ARG2 argument for the MRS relation
“_conduct_v_1_rel”, which has the “NUM: PL” feature set, which indicates that this is plural, and
hence x10 is actually a group of surveys (with unknown cardinality). There is a further relation in
the MRS, udef_q_rel, indicating the indefinite nature of the phrase “surveys”, and this will be
taken up below.
Secondly, the act of “conducting” the surveys is shown by the MRS relation
"_conduct_v_1_rel", which has three arguments, the “event” of conducting (e3), the thing doing
the conducting (x6) and the thing being conducted (x10). This is equivalent to the BPP11 general
semantic intuition that verb phrases “stand for” situations (of which event is a subtype) and there
are things fulfilling roles in this situation (although the roles of agent and patient are not here
specified, and it is a matter of research as to whether the roles should be extracted, especially as
more complex sentences can generate pragmatic “passive markers” in the MRS relations).
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
10
Report on using the English Resource Grammar to extend fact extraction capabilities
5.1 The Raw MRS form
The first stage in turning the MRS output into CE is to translate into “elementary predications”
with their arguments:
the mrs elementary predication #ep1_0 is an instance of the mrs predicate 'udef_q_rel' and has the
thing x6 as zeroth argument.
the mrs elementary predication #ep1_1 is an instance of the mrs predicate 'named_rel' and has the
thing x6 as zeroth argument
and has 'HTT' as c argument.
the mrs elementary predication #ep1_2 is an instance of the mrs predicate '_conduct_v_1_rel' and
has the situation e3 as zeroth argument and has the thing x6 as first argument
and has the thing x10 as second argument.
the mrs elementary predication #ep1_3 is an instance of the mrs predicate 'udef_q_rel' and has the
thing x10 as zeroth argument.
the mrs elementary predication #ep1_4 is an instance of the mrs predicate '_survey_n_1_rel' and
has the thing x10 as zeroth argument.
Further information may be provided about the features of the entities, for example, that the
“surveys” are plural:
there is a thing named x10 that has the person category third as feature and has the number
category plural as feature and has the category 'IND:+' as feature.
The remaining information about scope quantification (held in the HCONS section) is also
captured as CE, in the form of an “equals modulo quantifiers” relation between mrs elementary
predications.
5.2 Intermediate MRS
It would be possible to translate this “raw CE” directly into domain specific CE, but it is
proposed to generate some intermediate, more abstract, representations of certain aspects of the
raw MRS, as such representation may provide useful for understanding what is being represented.
For example, the quantification of things (such as the plural nature of “surveys”) is spread across
several MRS relations, and it may be useful to build an intermediate “quantification” specification.
In this example we might state:
there is a group quantification q1 that is on the thing x10 and has the mrs predicate
‘_survey_n_1_rel’ as characteristic type.
If the cardinality of the quantification were known (as in “three surveys”) then this information
could be added to q1. Further quantification types could be created to express definite
quantification of individuals, indefinite quantification, etc. Such information could be inferred by
CE rules that match against patterns of MRS relations (including, in this example, the “udef_q_rel”
and the NUM: PL feature).
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
11
Report on using the English Resource Grammar to extend fact extraction capabilities
5.3 Domain Semantics
Using similar intuitions [2] as for BPP11, the MRS in raw and intermediate CE form can be
turned into domain CE:

Noun and verbs may be turned into types of domain concepts via the expresses relation. A
noun represents an entity concept, for example “the mrs predicate ‘_survey_n_1_rel’ expresses
the entity concept survey” and a verb expresses a relation concept, for example “the mrs
predicate '_conduct_v_1_rel' expresses the relation concept conducts”.

Things “stood for” by noun and verb phrases are inferred to have the type of CE concept
defined by the head (e.g. survey or conducts); noun phrases standing for things and verb
phrases standing for situations. It is not yet clear if it is necessary to infer the roles (eg patient
and agent) played by things associated with the situation, as was done in BPP11, since
sufficient information may be present in the MRS.

However there is major extension to the BPP11 translation of noun phrases and verb phrases
noted above if the intermediate abstract MRS information is to be utilized. For example, a
group quantification on a thing indicates that it must be turned into the domain representation
of a group (of surveys) rather that an individual (survey). This requires the construction of a
model of “groups” that was missing from the BPP11 work. A tentative proposal is that there
be a “group of XXXs” where XXX is the name of a concept, and that this group has an
optional cardinality.

Other linguistic types, such as proper names, adjectives and prepositions, may be turned into
domain semantics in a similar way to BPP11 processing.
However some of the MRS relations for certain linguistic types contain additional information,
for example adjectives contain an “event” argument, for which it is not yet clear how it should be
handled. It is necessary to understand the generic semantic principles that were used to determine
the additional ERG information; discussion with the DELPH-IN community suggests that some of
this information is available in the semantic interface definition (the SEM-I [18]), but that a
specification of the semantic design is not explicitly available. This seems an area that is fruitful
for further study during this BPP13 task.
The extraction of high quality facts is not the focus of the first three months of the research
proposal, which, rather, is focused on the mechanisms for representing ERG linguistic information
in CE. However, for completeness, the result of applying the initial linguistic processes to the
output of the MRS is shown below:
the organisation x6 known as HTT conducts the group of surveys x10.
the situation e3 is contained in the container x15 known as Adhamiya.
This is not yet as readable as the BPP11 results. The first sentence captures the conducting of
the surveys, but the fact that the conducting situation occurs in Adhamiya (the third sentence) has
missed the link between the situation e3 and the conducting situation. Furthermore, no processing
has been done for the motivational link “in order to judge …”. Nevertheless, that there is a group
of surveys is new information in comparison to the BPP11 work, and significantly less semantic
processing rules had to be written and executed in CE to generate the above sentences, due to the
increased semantic output of the ERG system. We anticipate that the facts will become more
readable and that more detailed information will be extracted as the work continues.
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
12
Report on using the English Resource Grammar to extend fact extraction capabilities
5.4 Rationale
It is necessary to extract and display the rationale. The reasoning occurs in two parts, the ERG
grammatical rules applied by PET and the CE-based rules applied to the raw MRS. The reasoning
derived from the elementary predications may be displayed in the manner defined in BPP11 [2].
for example, given the sentence:
the group of things x10 has the entity concept survey as categorisation.
the rationale is similar to the following:
However, for the ERG grammatical reasoning, the steps that generate these elementary
predications are not directly available in the PET parser. The parse tree does provide some form of
“template” for the reasoning, and the detailed TFS for each type in the parse tree could in theory
be extracted from the ERG. However it is not currently possible to map between the parse tree and
the MRS in the PET output, and without such mapping it would not be possible to determine
which parts of the parse tree were involved. Further research is necessary in this area.
6 Integrating to the CE processing chain
The PET parsing system and the Prolog program to turn the MRS into CE must be provide as a
web service in order that it may be called from the CEStore [19], or from other programs such as
the CE-embedded Word documents [20]. An architecture is being constructed, running under
Debian Linux, for this purpose, as diagrammed below:
The web service is provided by a Prolog program that takes a sentence to be parsed, calls the
PET parser with the ERG loaded, receives the output of the parser, and turns this output into CE.
We aim to support the following diagram of relationships between information in the CE system,
the ERG system and the Stanford parser that have been described in this paper:
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
13
Report on using the English Resource Grammar to extend fact extraction capabilities
Additional Prolog code has been constructed that parses the TFS for the ERG and provides
some utility functions, such as showing all paths for a given TFS type. It was this information that
was used to help generate the “palm tree” diagram (although this was drawn by hand). Such
facilities may be useful in converting between linguistic frames and TFS.
7 Integration of Domain reasoning
A key research topic is to integrate the domain reasoning with the ERG/PET system,
potentially allowing domain semantics to guide parsing of the text. The simplest possibility is for
the domain model to source new lexical entries or grammar rules, following some of the
suggestions in translations between CE and TFS noted above. Such integration can be done when
the grammar is compiled, creating new grammar information which is then added into the ERG as
diagrammed below:
A more interesting, but complex, integration might occur at parse time, where domain
reasoning is called upon by the parser to affect the current state of the parse, for example by ruling
out inconsistent parses (in effect providing selectional restrictions to rule out alternatives) or to
feed into the ranking of the parses. Such an architecture is diagrammed below:
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
14
Report on using the English Resource Grammar to extend fact extraction capabilities
The development of such integration is complex and will require significant research, as
originally proposed under the “deeper semantics” aspect of the task.
8 Discussion and Conclusion
This paper has presented preliminary work on the use and integration of the ERG system into
the BPP11 CE fact extraction process. Application of the ERG to SYNCOIN sentences suggests
informally that more accurate linguistic detail can be extracted in comparison to the Stanford
parser, but that the ERG system is slower and in some cases does not generate any parse in cases
where the Stanford parser would. This is in keeping with the nature of the deep linguistic approach
of the ERG as opposed to the statistical approach of the Stanford parser, where only relative
coarse-grained linguistic information is available. We conclude that the ERG system is of potential
benefit in generating higher quality parses, together with greater semantic detail, but that we may
still need to use the Stanford parser as a “backup” when the ERG system fails to generate a parse.
This makes it particularly important to ensure that both systems provide information in the same
CE linguistic model.
Our preliminary techniques described in this report suggest that it is possible to transform
linguistic information between the ERG system and the CE-based representation, allowing the
integration of the ERG system into our fact extraction approach, and potentially providing a
greater degree of involvement by non-linguist users in the understanding and modification of
linguistic processing for specific domains, although the transformations involving grammar rules
will still require linguistic skill.
Informal inspection of the MRS output by the PET parser suggests that the structures are quite
closely linked to the linguistic processing, and that it may be useful to abstract out some of the
underlying concepts, as suggested in the section on MRS representation. This was described by
[12] as being the result of tension between the need for a generic representation versus the fact that
MRS relations are ultimately sourced from grammar rules, which by definition must follow
linguistic phenomena. Such abstraction may facilitate understanding by non-linguists and the
further mapping of the MRS into domain semantics; this is consistent with the approach taken in
the BPP11 processing that separated generic from domain specific semantics. We therefore
propose that this is a worthwhile area for further research.
One of the authors attended the DELPH-IN summit, where this work was presented [7], and
was made aware that the use of domain semantics for assisting the down-stream use of the MRS
output was something that would be of interest of the community; such use of domain semantics is
a key component of our work in building CE based analyst’s models, and therefore this is an area
where we could potentially make a contribution to the DELPH-IN work.
This task has just started, and much work needs to be done; further research should include a
better integration of the PET software to the CE-based systems, including the CEstore; refinement
and testing of the CE-based representational structures; extending the concept of intermediate
MRS representations to cover more general semantic phenomena; the extraction of rationale from
the ERG system and the use of domain semantics to assist the parsing.
9 Acknowledgement
This research was sponsored by the U.S. Army Research Laboratory and the U.K. Ministry of Defence and
was accomplished under Agreement Number W911NF-06-3-0001. The views and conclusions contained in
this document are those of the author(s) and should not be interpreted as representing the official policies,
either expressed or implied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K.
Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce
and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
15
Report on using the English Resource Grammar to extend fact extraction capabilities
10 References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Xue, P., Mott, D., Braines, D., Poteet, S., Kao, A., Giammanco, C., Pham, T., McGowan, R.
Information Extraction using Controlled English to support Knowledge-Sharing and Decision-Making.
In 17th ICCRTS “Operationalizing C2 Agility.”, Fairfax VA, USA, June 2012
Mott, D., Braines, D., Poteet, S., Kao, A., Controlled Natural Language to facilitate information
extraction Fact extraction using Controlled English, ACITA 2012.
Sowa, J., Common Logic Controlled English, http://www.jfsowa.com/clce/clce07.htm
Mott, D., Summary of Controlled English, ITACS, https://www.usukita.org/papers/5658/details.html,
2010.
Copestake, Ann, Implementing Typed Feature Structure Grammars, CSLI Publications, 2002.
http://www.delph-in.net/
Mott, D., Poteet, S., Xue, P., Kao, A, Copestake, A, Fact Extraction using Controlled English and the
English Resource Grammar, DELPH-IN Summit, July 2013.http://www.delph-in.net/2013/david.pdf
Copestake, Ann., Flickinger, D., Sag, I. A., and Pollard, C., Minimal Recursion Semantics: an
introduction. Research on Language and Computation, 3(2-3):281–332. 2005.
Bender, E.M., Flickinger, D., and Oepen, S.. The Grammar Matrix: An open-source starter-kit for the
rapid development of crosslinguistically consistent broad-coverage precision grammars. In Proc.
Workshop on Grammar Engineering and Evaluation, Coling 2002, pages 8–14, Taipei, Taiwan
Sag, I.A., Wasow, T., and Bender, E.M.. Syntactic Theory: A formal introduction, Second Edition.
Stanford: CSLI Publications [distributed by University of Chicago Press], 2003
Copestake A, and Flickinger, D., An open-source grammar development environment and broadcoverage English grammar using HPSG In Proceedings of the Second conference on Language
Resources and Evaluation (LREC-2000), Athens, Greece, 2002.
Bender, E.M., personal communication, August 2013.
Rimland, G., Hall, A.: A COIN-inspired Synthetic Dataset for Qualitative Evaluation of Hard and Soft
Fusion Systems. Information Fusion, Chicago, Illinois, USA (2011)
The Stanford Parser, A statistical parser, http://nlp.stanford.edu/software/lex-parser.shtml
The PET parser, http://moin.delph-in.net/PetTop
Mott, D, Poteet, S., Xue, P., A New CE-based Lexical Model https://www.usukitacs.com/node/2271
Mott, D., Braines, D., Laws, S., Xue, P. Exploring Controlled English for representing knowledge in
the Linguistic Knowledge Builder, Sept 2012, https://www.usukitacs.com/node/2231
Flickinger, D., Lønning, J.T., Dyvik, H., Oepen, S., Bond, F., SEM-I Rational MT: Enriching Deep
Grammars with a Semantic Interface for Scalable Machine Translation, 2005,
http://web.mysites.ntu.edu.sg/fcbond/open/pubs/2005-summit-semi.pdf
"CE Store - Alpha Version 2", https://www.usukitacs.com/node/1670
4th Battalion Communications Report https://www.usukitacs.com/node/2341
Flickinger, D., The English Resource Grammar, LOGON technical report #2007-7,
www.emmtee.net/reports/7.pdf
Mott, D., Poteet, S., Xue, P., Kao, A., Copestake, A. Using the English Resource Grammar to extend
fact extraction capabilities, Fall Meeting 2013, https://www.usukitacs.com/node/2498
David Mott IBM UK, Stephen Poteet, Ping Xue, Anne Kao Boeing Research &
Technology
16
Download