nlchallenges_FINAL - ITACS | International Technology

Challenges solved and unsolved in Fact Extraction from Natural Language

David Mott

Emerging Technology

Services,

IBM United Kingdom Ltd

Hursley Park, Winchester,

UK

Stephen Poteet, Ping

Xue, Anne Kao

Boeing Research &

Technology

Seattle, WA, US

Ann Copestake,

University of Cambridge

UK

Cheryl Giammanco,

Human Research &

Engineering

US Army Research

Laboratory

Aberdeen Proving Ground,

MD, USA

Abstract— Information from unstructured sources is key to human-machine cognitive tasks, but requires Natural Language fact extraction, together with a reasoning capability that allows the user to express assumptions and rules to infer high value information, all based upon a domain conceptual model. We describe research in the use of Controlled English (CE) for fact extraction and reasoning, applied to a complex problem-solving task, the ELICIT identification task, which has posed many challenges.

Our research has integrated the DELPH-IN English Resource

Grammar (ERG) to extract detailed linguistic information based upon a deep parsing of sentences, used a CE domain model to guide transformation of the linguistic information to domain facts in a principled way, extended CE syntax to allow metareasoning for dynamic rule extraction from sentences, applied assumptions to handle NL ambiguities and track sources of uncertainty such as sentence interpretations and linguistic expressions of uncertainty, and solved a component of the

ELICIT task.

However significant challenges remain in the search for a general solution to the application of domain knowledge to fact extraction. Even ELICIT sentences show considerable uncertainty and ambiguity that require a detailed understanding of the domain to overcome, and we have therefore started to analyse the specific relationships between domain knowledge and the resolution of ambiguities, in order to make further progress.

I.

I NTRODUCTION

This paper reports work under the International Technology

Alliance (ITA) [1] on supporting collaborations of human and machine in the execution of cognitive problem-solving tasks, such as that faced by analysts when providing high-value information from a variety of sources, including unstructured textual reports, as well as more structured sources such as databases and spreadsheets. These are complex cognitive tasks that require the making of assumptions and reasoning based upon a "conceptual model" of the domain in which the analysis is taking place. The performance of these tasks requires support for extracting facts from sentences, querying, inference, handing of uncertainty, making hypotheses and understanding of the rationale for conclusions reached. We are researching to provide support to users for such tasks, based on NL processing, use of the human-readable language ITA

Controlled English (CE) and a reasoning system. A more detailed version of parts of this paper is given in [2].

CE [3] is a Controlled Natural Language, a subset of

English, that is both human-readable and machine parseable, suitable for the expression of domain knowledge, concepts and reasoning. It is relatively easy for human analysts to use, but it also has a formal interpretation that is sufficiently unambiguous that a computer can interpret the input of the domain analysts and use it to perform inferencing. Central to the use of CE is a conceptual domain model, a structure that holds all of the users' knowledge (concepts, relationships, logical inferences, constraints, and assumptions) of the domain in which the reasoning and problem solving is to be undertaken.

For example, the domain model for an intelligence analyst might include the concepts of mission, enemy, terrain and weather, troops and support available, time available, and civil considerations (METT-TC). The analyst’s problem solving strategies may also be represented in CE as ways of reasoning and of making assumptions, such assumptions varying on the level of expertise of the analyst and the domain of analysis.

The resulting reasoning can be tracked through the rationale, showing how conclusions are dependent upon givens and assumptions.

Our fact extraction combines a deep linguistic parsing system (the ERG [8]) that generates detailed linguistic information together with CE modelling and reasoning to map the linguistic information onto domain facts expressed in CE.

Deep parsing can provide context to support more precise application of the concepts and rules in the domain model to identify ambiguities and uncertainties in sentence interpretation.

For example, a domain model that includes concepts of

“human intelligence, counter intelligence, signals intelligence, and imagery intelligence” expressed in CE provides clues as to the state of the world of an intelligence analyst in which

“intelligence” is an ambiguous term for multiple concepts with various meanings.

However, there are many challenges in fact extraction from

NL sentences, and some remain unsolved within the scientific community. Accurate extraction of the detail contained in a NL sentence requires a deep analysis of the often complex syntactic structure of the sentence, and a construction of the meaning (or semantics) of the sentence from the meaning of its syntactic component parts. There can be ambiguities in the structure or meaning of sentences, or their components, including individual words, which must be resolved in order to make sense of the whole sentence. Disambiguation may be

impossible without background knowledge of the domain in which the sentences occur, and this background knowledge may have to be extensive. Even the relatively "simple" sentences used in the task described below exhibit such problems. To turn sentences into CE facts, it is necessary to transform the sentence meaning into the particular concepts used in the user's CE domain model. This paper describes our research into these challenges, how some have been solved and some of the work needed into order to solve others.

II.

T HE ELICIT TASK

The Experimental Laboratory for Investigating

Collaboration, Information-sharing, and Trust (ELICIT) [4] has devised the ELICIT framework for researching into how organisational structures and communication patterns affect human collaborative solving of problems requiring reasoning and interpretation of facts. This framework contains the

"ELICIT task" that involves the identification of key aspects of a (simulated) planned terrorist attack, by interpretation of a set of simple English sentences (or factoids). The key aspects are

"who" is going to perform the attack, "what" will be attacked,

"when" and "where" the attack will take place. The ELICIT task requires a domain model and reasoning steps similar to the domain knowledge and cognitive processes expressed by intelligence analysts when reasoning about attacks. It provides a suitable problem on which to apply our research for the extraction of facts from sentences and the subsequent reasoning about the facts in order to perform the identification.

However, analysis of the ELICIT sentences [5] discovered that there were significant ambiguities as to their interpretation

(and no specific contextual background is given to the participants which might help disambiguation). An example is the word " in " in "Dignitaries in Epsilonland employ private guards" which could be interpreted as " belonging to " or

" located ", with the decision affecting subsequent reasoning.

Significantly, some disambiguation could not occur without general "commonsense" reasoning about the world, something that is notoriously difficult to achieve by computer, so it was decided at this stage to simplify the sentences by removing ambiguities that would require common sense reasoning. Even though this requires human intervention, it is of benefit as the problem solving (which is itself complex) from the extracted facts can be achieved automatically. It was also decided to focus on identifying "who" was performing the attack.

To use CE to extract facts and perform reasoning we developed a conceptual model ([6]) of the domain, extending previous ITA models and including:

 agents, operatives, groups of operatives

 financial institutions, visiting dignitaries, embassies

 time intervals, daytime, nighttime

 attack situations, participants, non-participants, targets

 working relationships (works with, cannot work with)

It was also necessary to define a problem solving strategy that would guide the CE reasoning system in performing the

"who" identification. This problem strategy was determined to be a process of elimination: by using reasoning and facts to eliminate possible participants in the attack, the one remaining had to be the actual participant. By using the CE domain model and problem solving strategy the system is able to infer the participants from the simplified NL sentences, as summarised in the figure below:

The figure shows the sentences (black text) and the flow of reasoning through the rules shown as round rectangles, to the conclusion (bottom right). One participant is the Lion, based upon the sentence "the Lion is involved". Another is the Violet group, inferred by the process of elimination; a rule detects that all other possible participants have been eliminated, leaving the

Violet group as the only one remaining. This rule is generated dynamically (by rule-writing rules) from the ELICIT sentences that indicate the set of possible (group) participants, together with a user assumption that the world of possible (group) participants is "closed", in that there are no more possibilities.

This assumption must be a judgement of the user, since it is not explicitly stated in the sentences.

The figure shows how each possible group, apart from the

Violet group, was eliminated. The reasoning pathways include:

 because the group is not one of those in "the Lion only works with the Azuregroup and the Browngroup and the

Violetgroup". This sentence is turned into a rule dynamically and applied to all the groups.

 because the group is directly stated as being operational

 because the group operates at a different time of day to a known participant (the Lion), as specified in a sentence such as "the Azuregroup operates in the nighttime".

 because the group is recruiting locals and "the Lion does not work with locals".

The following section describes how the facts were extracted, and used to generate these reasoning pathways.

III.

NL PROCESSING AND F ACT E XTRACTION

As described in [7], our NL processing research utilises

DELPH-IN linguistic resources, the English Resource

Grammar (ERG) [8] to perform a deep parse of the sentences and Minimal Recursion Semantics (MRS) [9] to represent the extracted linguistic semantics. A key task of our research is to

transform the linguistic semantics into CE facts, expressed in the domain semantics of the CE conceptual model, as shown in the figure and the steps below: cat" could be expressed as the CE " there is a situation e3 that has the thing x7 as first role and has the thing x9 as second role ". This generic semantic representation provides an easier starting point for the transformation into full domain semantics.

Other linguistic phenomena, such adjectives, modal verbs

("may be") and negations require different processing leading to differing conceptualizations, as exemplified below.

Step 1: English text is sent to the ERG for parsing, resulting in the output of MRS "predicates" representing the linguistic semantic information in the text. These predicates are converted into CE, for example, the sentence "John chases the cat" is turned into " the mrs elementary predication #ep2 is an instance of the mrs predicate '_chase_v_1_rel' and has the situation e3 as zeroth argument and has the thing x7 as first argument and has the thing x9 as second argument.". This CE requires an understanding of the linguistic processing to be readable, and it is intended that it be shown only to linguistic specialists rather than end users. It is also in a form that is available to further processing by the CE reasoning components. A tabular form of the CE can also be derived:

Step 3: The generic semantics is transformed into CE facts that conform to the specific domain semantics of the user's conceptual model, which might include such concepts as people, places, attacks, targets etc. Various sources of knowledge (as CE background facts) may be used to guide this transformation: CE facts may express how words (more precisely MRS predicates) are mapped to concepts in the domain model and can list known entities with their types

(such as the person John1). In this way the situation and entities involved (such as x7) can be given names and types. In addition situations involving one or two entities can be mapped to more readable expressions, such as the domain specific and more readable CE " the person John1 chases the cat x9 ".

Step 4: The CE facts may then be used to perform domain specific reasoning, leading to inference of high valued information of use to the analysts, based upon inference rules written by the users, as exemplified by the inference of the

ELICIT participants in the previous section.

The processing of an ELICIT simplified sentence "The

Azuregroup may be a participant" into the CE fact " the group

Azuregroup is a possible participant " is shown below: where there is a column for the thing x7 that is named "John", the first argument of the predicate "_chase_v_1_rel", expressing the situation (e3) of "chasing". There is also a column for x9 that is the first argument of the predicate

"_cat_n_1_rel", expressing that x9 is a "cat".

Step 2: Some MRS is "linguistically nuanced", containing information about how the sentences were constructed rather than about what real world situations were being described, and it is therefore desirable to abstract the MRS into a more generic form. One such form is "intermediate MRS", such as

"quantifications" on things that indicate it is definite, or indefinite, or a group). Another key abstraction is the concept of a situation, or state of affairs, that may have attributes (time and place), relationships (causal, temporal), and roles that entities play in these situations. For example, "John chases the

This involves the analysis of a modal sentence ("may be"), following a similar line to that already described, but requires further information about the modal CE concept to be used

("possible participant"). The figure shows the processing from the MRS predicates (top left) through several rules (rectangles) to the conclusion (bottom right). There are two pathways, one detecting a "modal situation", and one matching the linguistic information about "the Azuregroup" to a set of (known) reference entities. These pathways converge on the final rule that matches the situation to the correct modal CE concept. The

CE rule dealing with the recognition of a reference entity is:

[ nn_lookup ] if ( the thing T has the value W as common name ) and

( it is false that the thing T is a reference entity ) and

( there is a reference entity named REF that

has the value W as common name ) then

( the thing REF is the same as the thing T ). which matches the common name of the thing x7 (from the string "Azuregroup" in the MRS) with an existing reference entity defined in CE (the group Azuregroup), and concludes that they are the same thing, thus inferring the "group

Azuregroup" is the role-player in the situation.

The details of the CE processing inside the boxes are not of importance to the end user, but there are three types of CE fact that must be provided by the user for this reasoning to work, shown as the CE facts in blue boxes. These are the reference entities (" the group Azuregroup has 'Azuregroup' as common name and is a reference entity "), the mapping of MRS predicate to entity concept (" the mrs predicate

'_participant_n_1_rel' expresses the entity concept

'participant'") and the semantic relation between the basic and modal version of the concept ("the entity concept 'participant' is modal to the entity concept 'possible participant'" ). Thus this is a "black box" that handles sentences of a certain linguistic type, configured by a set of CE facts provided by the user.

In trying to analyse the ELICIT sentences it was discovered that even though they are simplified, there were linguistic

"puzzles" to be solved, including the handling of negated and modal situations (noted above), and the handling of generic entities such as "daylight" and "locals" which could not be simply mapped into specific entities. For example, in "the

Browngroup is recruiting locals" and "the Lion does not work with locals", the word "locals" indicates some unspecified group of people, but there is no indication that the group of locals is the same in the two sentences. However in the context of the ELICIT task, it seems plausible that the "locals" are intended in some sense to be the same thing, and that the two sentences operate to rule out the Lion working with the

Browngroup. Thus there is a conflict between the views of

"different instances" and "the same thing" in the interpretation of "locals", but a specific interpretation must be made in order that the facts can be extracted and used. Such challenges correspond to known linguistic phenomena, some of which are still subject to research, but an advantage of using CE is that there is a computational but readable explicit specification of the linguistic theory which can be reviewed and debated.

IV.

M ETA LOGIC , AND R ULE E XTRACTION

In these transformations, significant use was made of the ability for CE to undertake meta-reasoning, where CE facts define explicit information about concepts themselves, and where rules can reason with this information. Simple examples have already been given: in " the mrs predicate

'_participant_n_1_rel' expresses the entity concept 'participant'" a link is defined between a predicate (in effect a word) and a concept; in "the entity concept 'participant' is modal to the entity concept 'possible participant' a relationship between two concepts is given. However there are more complex examples of the need for meta logic, in the creating of rules that write rules and in the attempt to generalise sets of rules.

One significant challenge in interpreting the ELICIT sentences was that some sentences express rules of habitual behaviour rather than facts (e.g. "the Lion only works with the

Bluegroup and the Azuregroup and the Violetgroup"). Our interpretation of this rule, in the context of the reasoning by elimination strategy, is that if a group is not a member of the set (Blue, Azure, Violet) then the Lion cannot work with that group. This can be stated more formally as the CE rule: if ( there is an agent named Lion ) and

( the agent A is different to the agent Bluegroup and

is different to the Azuregroup and

is different to the Violetgroup ) then

( the agent Lion cannot work with the agent A ).

Since all of the linguistic analysis is performed by CE, it was necessary to extend the CE syntax to allow meta-reasoning so that rules could themselves construct other rules. This was achieved by creating a new type of CE object, a "logical inference" that represented a rule and had complete CE statements (with variables) as premises and conclusions. Such logical inferences could be contained in the premises and conclusions of "rule-writing rules", see [2] for more details.

Once these rules have been constructed automatically, they may be added to the domain model rules in order to be involved in the domain reasoning.

V.

S OLVING THE ELICIT " WHO " IDENTIFICATION TASK

The summary reasoning figure above showed how facts and rules extracted from sentences, together with the domain model and problem solving strategy, were used to automatically identify the participants in the attack. The figure below shows some of this reasoning in more detail, following the general layout of the summary figure above, but replacing some of the rule rectangles by "proof tables" that show detailed rule applications, with rows for premises and a row for the conclusion at the bottom.

A full description is given in [2]; here we focus on a few examples. In the top middle, reasoning about working at incompatible times of day is executed by the domain rule:

[ no_time_overlap ] if ( the operative A operates in the time interval TA ) and

( the group B operates in the time interval TB ) and

( the time interval TA does not overlap the time interval TB ) then

( the operative A cannot work with the group B ). and the top proof table has this being applied between the operative Lion and the Azuregroup.

At the right side of the figure is the identification of

"Violetgroup" as a participant, effected in several steps. At the bottom right, an "elimination rule" is used to infer that the

Violetgroup is a participant on the basis that all other participants are ruled out. The proof table at the bottom has this reasoning, and also indicates that the conclusion is derived from an assumption (ass321) about the closed nature of the set of participants. This rule is itself generated automatically by a set of "rule-writing rules" driven from the making of the closed world assumption shown in the top right. In [2] we show the effects of making this assumption prematurely, before receipt of all the relevant sentences, which leads to an inconsistency as all possible candidates are ruled out, and the rationale shows the inconsistency being caused by the closed world assumption, indicating that it would need to be unmade.

The result of the fact extraction and problem solving undertaken by the CE system is the inference of the following

CE sentence: the attack situation Elicitattack involves the operative Lion and involves the group Violetgroup. which constitutes a solution to the question about "who" is participating in the attack, and the rationale for both participants is available for visualization and examination by the user. This solution is based upon the assumption that the world is closed in regard of the entity concept "possible participant". If further processing with this information (such as attempt to solve "what" the target is) were to encounter an inconsistency, then this assumption would have to be considered to be suspect and be revoked.

VI.

F URTHER C HALLENGES

The approach described has made progress in solving some of the challenges of NL processing. However there are other significant challenges, and this section outlines some of these and how we approach their solution.

A.

Generalisation

It is important that CE processing is not tailored completely to the set of sentences used for analysis. It is too easy to construct "one rule per sentence", resulting in a system that is non-generic, and of little use. We attempt to avoid this problem by generalisation.

Firstly, the CE reasoning that leads from MRS output to extracted facts uses concepts that have been generalised from specific cases. For example, "intermediate mrs" provides a more generic view of the MRS output, and "situations" provide an abstract view of the semantics expressed in sentences. Such generalisations allow reuse in a wider range of sentences.

Secondly, meta-logic is used extensively in the reasoning, where rules operate on statements about the concepts and their relationships themselves. Meta-logic is used for mapping between the "world of words" (as expressed in MRS predicates) and the "world of concepts" where it is necessary to reason about how the words express the existence of the concepts. Meta-logic is also used to allow generalisation of rules covering specific concepts into rules that can handle all concepts of a particular type. For example a specific rule that infers that B is married to A given that A is married to B, can be generalised by meta-logic to state that all relations defined as "symmetric" can be handled in this way. This requires a design of the generalisation (i.e. that there is a concept of

"symmetric"), a rewrite of the rule using meta-statements about relations and their (pair-wise) instances, and the writing of CE statements indicating which relations are symmetric.

This is shown in the figure below, where the specific "is married to" rule on the left is generalised to the rule on the right that operates on all relations that are defined as

"symmetric". These definitions are provided by the user as CE meta-facts of the form " the relation concept EC is a symmetric concept ". Each specific statement about the "is married to" relation is generalised to a CE meta-fact of the form "there is a relation named R that has the sequence ( the thing X , and the thing Y ) as realisation." which states that the ordered pair of things X and Y are instances of (or "realise") the relation R.

Since R is a variable here, it is possible to states facts in general terms about any relation R. In the generalised rule the premise and conclusion match against all pairs of things P1 and

P2 that realise a relation RC (as long as it is symmetric) and essentially infer that the relation holds when the order of the pair is reversed to P2 and P1. As the rule stands, it operates independently of the type of the things P1 and P2, but it is possible (and sometimes necessary) to take account of the range and domain of the relationships, and there are CE metastatements that can express this information.

A further approach to generalisation planned to be undertaken is the use of the "MRS test suite" defined by

DELPH-IN. This is a set of example sentences that cover all of the main linguistic phenomena handled by the ERG, and our aim is to ensure that each linguistic phenomenon is covered by generic CE reasoning, providing coverage of most sentences in a principled way.

B.

Linking words and predicates to concepts

Our approach relies upon the mapping between words (or more specifically MRS predicates) and concepts in the CE domain model. Currently these mappings need to be constructed by hand, which is a potential bottleneck. Previous

ITA work started to explore the use of linguistic resources, such as WordNet [10], to suggest suitable concepts for unknown words to the user. It is now proposed [11] that these links be made by distributional semantics, which uses a database of MRS parses of a large corpus of text, and seeks matches between unknown words and this corpus, in the linguistic context defined by the sentence parse.

C.

Handling Uncertainty and Ambiguity

A significant issue is the handling of ambiguities and uncertainties, which may arise from a number of different sources, including ambiguous interpretations of words (such as

"tank") and sentences (such as "dignitaries in Epsilonland") or incomplete knowledge about the context (such as the complete set of possible participants). As reported in [12] we use assumptions and numerical certainty values to represent CE facts that are uncertain, and apply assumption-based reasoning to infer new information, detect inconsistencies and label different sources of uncertainty to the user in the rationale. An example already described above is the assumption that the set of possible participants is known (i.e. the world is closed in respect of other possibilities), leading to the inference of the only possible participant, or to the detection of an inconsistency if the assumption is incorrect.

Ambiguities arise in the interpretation of words (such as

"tank") and we have reported an assumption-based approach in

[12]. This is taken up in more detail in the next section on the use of domain knowledge. Ambiguities also arise from interpretation of sentences, and we represent such ambiguities as types of assumption that represent a specific "sentence interpretation" that form a premise of the rules that extract the facts on the basis of this interpretation, see [2]. For example, consider the sentence "Dignitaries in Epsilonland are protected", which we might wish to turn into a rule inferring that certain dignitaries are a "protected thing". However, as noted above, there is an ambiguity as to the meaning of "in". If we take "in" to mean "working for" then a sentence interpretation is being made, which can be formulated as an assumed proposition it is assumed that there is a sentence interpretation named 'si_in' that has "we take 'in' to mean working for" as description. This assumed proposition may then be placed in the premise of the rule that is the expression of the sentence, for example: if

( the dignitary D is an official of the country Epsilonland ) and ( there is a sentence interpretation 'si_in' ) then

( the dignitary D is a protected thing ). and when this rule is applied to a dignitary D, the conclusion that D is protected will be dependent upon the assumed sentence interpretation 'si_in'.

In some NL sentences, there are expressions that explicitly indicate uncertainty in propositions stated in that sentence, for example "it is thought that", "possibly" "John claimed that", and it is useful to extract such sources of uncertainty and present these to the user as part of the rationale for a fact. We aim to achieve this by representing such information as specific types of CE assumptions, with associated information (such as the actual terms used), see [13].

All of these techniques rely on the fact that these different types of assumptions can be tracked through the rationale and presented to the users so they may know the sources of ambiguity or uncertainty in any conclusion of high value information. This approach based on assumptions is related to the use of argumentation for the elucidation of sources of uncertainty in reasoning, and the resolution of conflicts between different arguments for conclusions; work is being undertaken to link the ITA work on NL fact extraction with that on argumentation [14].

D.

Why Domain Knowledge Is Necessary for Disambiguation

A significant challenge to NL processing is that words and sentences are potentially ambiguous, although human listeners are usually able to disambiguate utterances without conscious effort. This was exemplified in the ELICIT sentences above.

It is generally stated that disambiguation requires knowledge of the world, and it is useful to understand why this is the case. A typical simple algorithm for disambiguating a word with multiple senses (e.g. tank as military vehicle or liquid container) is the Lesk algorithm [15]. This determines the context of the word, as being the other words surrounding it in a sentence, and attempts to match this context with the text definitions of the possible senses of the word from a predefined dictionary, such as WordNet [10]. The definition that matches the greatest number of words with the context gives the "best" sense of the word. This algorithm demonstrates two key aspects of disambiguation, determining the context of a term, and finding the "best" match between the context and the predefined senses of the term.

The intuition is that words (or terms) in same context will share the same general topic, since they are (likely to be) referring to the same situation in the world; there is a relationship between the word W and its context C by virtue of a common third factor, that of the topic or situation on which word and context are dependent. Thus in the real world, a situation may be about someone driving a tank, and this will be expressed by words about drivers, driving and vehicles. Words about liquid containers will not appear, since that is not what the situation is actually about. In the Lesk algorithm, it is assumed that words textually close to a word are in the same context ("referring to the same topic") and that "being based on the same topic" can be measured by textual identity of words in the word definition and words in the context. However these are assumptions, and are prone to error. There are ways to improve the definition of context and similarity of sense. Some of these involve the use of syntactic knowledge and some by the use of semantic knowledge.

For syntactic knowledge, consider the sentence "the dog didn't bark by the tree", where the word "bark" could have the

sense of "making a dog-like noise" or "outer covering of a tree".

Simple textual matching of the context would not rule out either of these senses. However an understanding of the parse structure of the tree (or possibly just the part of speech analysis) would indicate that "bark" is the head of a verb phrase (or just has "verb" as part of speech), and hence the second sense would be ruled out, as being a noun. In more complex sentences such as "the tank in the house was leaking so the engineer drove the tank to the house as quickly as possible", parsing could more accurately determine the context

(in the sense of the situation being referred to) from the phrase structure than simple textual context alone. Given that the ERG system generates output that is closely related to the syntactic parsing, it might be possible to partition the predicates into

"similar contexts" on the basis of the arguments alone (which therefore does not require any semantic analysis). However there are cases where syntax alone cannot disambiguate (such as "the engineer drove the tank"), and it is then necessary to use semantic knowledge to match sense and topic context.

People have knowledge of the world which they use to understand utterances. This semantic knowledge may be seen as the situations and entities that may exist in the world, expressed by their connections, relationships, subparts, causal links, times, places and attributes. Such knowledge may also be seen as a set of constraints on what is and is not possible in the world, either physical or psychological. In this view, "being in the same context" can be stated more precisely as "being in the same situation", and therefore all NL expressions about that situations should be semantically consistent, and not break any of the constraints 1 . For example, role-players in a situation must be of the right type to play the role of the situation; this constraint is the basis of the disambiguation of "tank" as being an entity that had to be "driven" (and hence could not be a liquid container). It seems reasonable that use of semantic knowledge could be the most powerful way to perform disambiguation, since it most closely mirrors the real world constraints. However there are difficulties in the use of such knowledge.

It is useful to split "semantics" into some loose categories of generic, commonsense and specialist. Generic semantics may be considered as the fundamental structuring of the conceptual space, and may include such concepts as "situation",

"role", "container" and "causality". This level of abstraction is perhaps not particularly useful in disambiguation, since it does not provide many detailed constraints. Commonsense knowledge is that more detailed knowledge about many situations in the world, without being specialist, and might include understanding of how things grow and change, how people interact, how machinery and natural forces "work". The world is almost infinite in its possibilities, so commonsense knowledge is almost infinite, although it can be applied easily and quickly by humans. Specialist knowledge is that specific and detailed knowledge of a particular area, such as cooking, medicine, and the behaviour of tribal influences. In this paper we will use the term " domain knowledge " to refer to a formal and limited combination of all of these, mostly specialist

1

We ignore the case where speakers are deliberately or inadvertently reporting the situations incorrectly.

knowledge, together with a (relatively small) degree of commonsense and generic knowledge.

Research into Artificial Intelligence has found that computers can be made to represent detailed domain knowledge in a limited area, but not the totality of commonsense knowledge applicable to all areas. (Such tasks are "AI-complete", mirroring the term "NP-complete" for tasks that are computationally intractable). Machine assistance to problem solving tasks is most successful in specific domains, and this probably applies to NL processing as well.

Thus domain knowledge can assist disambiguation, by providing constraints based upon representations of situations in the real world, and by using these to rule out impossible situations as a result of inferring inconsistencies. However, it may be the case that inconsistencies are not determined directly from the information in a sentence, but from indirect inference over a chain of reasoning steps, based upon application of semantic knowledge. Therefore it is necessary to ensure that the conceptual domain model includes much detail of the regularities of situations and that all possible inferences have been made from the data in the sentence, before it is certain that disambiguation can be undertaken.

The work that we have done so far in disambiguation [12] is in keeping with this approach, based upon the use of assumptions to represent possible senses, and domain knowledge to infer inconsistencies, in effect using domain constraints to rule out impossible situations. We therefore proposed to continue with this approach whilst seeking to represent as much commonsense knowledge as possible.

However this will still not permit disambiguation in all cases, particularly those where common sense is required, and it may still be necessary to involve the human in disambiguation.

The use of the ERG/MRS is also an advantage, in that higher precision of parsing and greater detail of semantic output is more likely to cause the processed situations to be a more accurate reflection of the situation in the real world, and hence provide a better context to perform disambiguation.

In order to extend this approach to a wider range of sentences, it is useful to consider the sources of semantic knowledge (and constraints) available for disambiguation. The user's domain model is of course available for this purpose, but other generic sources of knowledge may be also available.

The ERG itself contains a source of knowledge, both in the lexicon and in the grammar. The grammar mostly encodes syntactic knowledge but the lexicon does provide a (relatively small degree of) semantic knowledge in the form of types to which the entries may belong. Thus some of the constraints may already have been applied in the course of the ERG processing, leading to less requirement for disambiguation.

However the ERG deliberately does not attempt to perform word sense disambiguation (the word tank is output in the

MRS in the same way for both senses).

External sources of generic semantic knowledge include

WordNet [10], VerbNet [16], and FrameNet [17], and have been used in the literature for disambiguation. The first two have been converted into CE, and so could be (and has been) used in the CE reasoning. We approach disambiguation via

making assumptions for different senses of a term, and it may possible to use, say Wordnet, to assess and rank the likelihood of each assumption of sense. Such a ranking could then prioritise the processing of alternative senses, or could contribute to the overall degree of uncertainty of a conclusion.

An alternative use of Wordnet might be as a means of extending the domain model, either automatically or manually by virtue of an "analysts helper" that could suggest extensions to the user. Then the extended domain model could then be used directly in the disambiguation process.

The largest resource of semantic knowledge is Wikipedia, and it is useful to consider how this knowledge could be used.

In theory, Wikipedia could be converted into CE (even if only by converting the DELPHIN MRS version of Wikipedia into

CE expressing the MRS). However this is very low level, and it may be necessary to consider more structured sources such as

DBpedia. Information that is most likely to be of use in disambiguation would be general semantic relations between concepts rather than just specific instances of those concepts, and further work is necessary to determine if such information is extractable from Wikipedia or DBpedia.

VII.

C ONCLUSIONS

The challenge of obtaining a deep representation of the syntactic structure and general meaning for a sentence has been addressed by the use of the ERG parser and its output in MRS.

The challenge of bridging the gap between this linguistic semantics and the domain semantics of a CE model has in part been solved by the use of ITA CE to apply domain knowledge to transform linguistic expression to domain concepts and to address ambiguity in sentence interpretation by use of assumptions and the ruling out of those that are inconsistent with the domain model.

CE has also been used to externalize human reasoning and problem solving strategies in a complex problem-solving task, and by making the reasoning visible to the human, CE can facilitate the human understanding of the solution to the problem. This has been applied to a realistic task that is used elsewhere to experiment with collaborative problem-solving, and discussions with the ELICIT team suggest that our formulation of the problem may be useful to these experiments.

However there are still challenges that remain unresolved.

The range of sentences handled by our system must be extended, by applying the techniques to the MRS test suite to ensure coverage of the main linguistic phenomena to be found in the output of the ERG system. The most significant challenge is the need for considerable domain-specific and common sense knowledge for disambiguation of sentences, such as the ELICIT sentences. The work reported here has sidestepped this problem by human simplification of sentences, although more recent work is handling some of the ELICIT sentences in the original form, in order to contribute facts in

CE to the collaborative sensemaking undertaken by the ITA

ACT-R research [18]. In this paper we have started to explore in more detail why domain knowledge is so important to help disambiguation, with a view to understanding how our work is to be extended. By using more information within the domain model, potentially extended from other sources for this domain knowledge, we aim to apply domain knowledge to more powerfully influence the processing and disambiguation of more complex types of sentence.

Our research has the potential to provide support for the users and analysts in the types of cognitive problem solving task described in the introduction, and we are exploring with the ELICIT team, and others, how these capabilities compare to the human reasoning that occurs in these types of task.

A CKNOWLEDGMENT

This research was sponsored by the U.S. Army Research Laboratory and the U.K. Ministry of Defence and was accomplished under Agreement Number

W911NF-06-3-0001. The views and conclusions contained in this document are those of the author(s) and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Army Research

Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K.

Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.

R EFERENCES

[1] International Technology Alliance, https://www.usukita.org/

[2] Mott, D., Poteet, S., Xue, P., & Copestake, A. (2014), Natural Language

Fact Extraction and Domain Reasoning using Controlled English,

DELPH-IN 2014, in.net/2014/Mot:Pot:Xue:14.pdf

Portugal. http://www.delph-

[3] Mott, D. (2010) Summary https://www.usukita.org/papers/5658/details.html

[4] http://www.dodccrp.org/html4/elicit.html

of CE,

[5] Mott D., (2014) On Interpreting ELICIT sentences, https://www.usukitacs.com/node/2603

[6] Mott, D., (2014) Conceptualising ELICIT sentences, https://www.usukitacs.com/node/2604

[7] Mott, D., Poteet, S., Xue, P., Kao, A., & Copestake, A., (2013) Using the English Resource Grammar to extend fact extraction capabilities,

ITA Annual Fall Meeting, 1st - 3rd October 2013, https://www.usukitacs.com/node/2498 .

[8] Flickinger, D. (2007) The English Resource Grammar, LOGON technical report #2007-7, www.emmtee.net/reports/7.pdf

[9] Copestake, Ann., Flickinger, D., Sag, I. A., & Pollard, C. (2005)

Minimal Recursion Semantics: an introduction. Research on Language and Computation, 3(2-3):281–332. (2005)

[10] George A. Miller (1995). WordNet: A Lexical Database for English.

Communications of the ACM Vol. 38, No. 11: 39-41.

[11] O Seaghdha, D, Copestake, A. & Mott, D. (2014) Investigating the use of distributional semantics to expand domain vocabulary, ITAAFM, https://www.usukitacs.com/node/2754

[12] Mott, D., (2014) CE-based mechanisms for handling ambiguity in

Natural Language, Feb 2014, https://www.usukitacs.com/node/261

[13] Xue, P., Poteet, S., Kao, A., Mott, D. & Giammanco, C. (2014)

Representing Natural Language Expressions of Uncertainty in CE,

ITAFM, https://www.usukitacs.com/node/2753

[14] Cerutti, F., Braines, D., Mott, D., Norman, T.J., Oren,N. & Pipes, S.

(2014) Reasoning under Uncertainty in Controlled English: an

Argumentation-based Perspective, https://www.usukitacs.com/node/2756

ITAAFM,

[15] Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In

SIGDOC '86: Proceedings of the 5th annual international conference on

Systems documentation, pages 24-26, New York, NY, USA. ACM.

[16] Palmer, M., http://verbs.colorado.edu/~mpalmer/projects/verbnet.html

[17] Fillmore, C. J., https://framenet.icsi.berkeley.edu/fndrupal/about

[18] Paul Smart, Yuqing Tang, Paul Stone, et al., (2014) Socially-Distributed

Cognition and Cognitive Architectures: Towards an ACT-R-Based

Cognitive Social Simulation https://www.usukitacs.com/node/2746.

Capability, ITAAFM,

nlchallenges_FINAL - ITACS | International Technology

Challenges solved and unsolved in Fact Extraction from Natural Language

David Mott

Stephen Poteet, Ping

Ann Copestake,

Cheryl Giammanco,

Related documents

Products

Support

nlchallenges_FINAL - ITACS | International Technology

Challenges solved and unsolved in Fact Extraction from Natural Language

David Mott

Stephen Poteet, Ping

Ann Copestake,

Cheryl Giammanco,

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib