Open-Source Implementation of Document Structuring Algorithm for

advertisement
Open-Source Implementation of
Document Structuring Algorithm
for NLTK
Nicholas FitzGerald
Natural Language Generation

Generate coherent text outputs to express
information


Express the right information
Express information in the right order
NLG Tasks
1. Document Structuring - most important and relevant
information selected from knowledge base (Content
Determination), then ordered and structured in such a way as
to maximize coherence and informativeness (Text Planning)
2. Micro-Planning – specifics of word selection, referring
expressions, and the finalization of ordering are determined
3. Realization – internal representations of the above decisions
are realized in actual text output
Document Structuring


Given a set of information to be expressed,
determine the order and grouping of this
information
Texts cannot be simply a random bag of sentences

Order of message presentation has significant effect on
meaning [Hovy 1993]:

One way:




1 - “Maria was diagnosed with cancer some months ago.”
2 - “Maria and Zurab had a fight last night.”
3 - “She was found dead this morning.”
Vs.



1 - “Maria was diagnosed with cancer some months ago.”
2 - “She was found dead this morning.”
3 - “Maria and Zurab had a fight last night.”
Document Structuring

Ordering also effects coherence:


“John was hungry. John went to the store. He bought
some bread to make a sandwich.”
“John bought some bread to make a sandwich. He went
to the store. John was hungry.”
Discourse relations

Relationship between a message or group of
messages

Elaboration(m1,m2)


I love jazz music(m1). My favourite album is Oscar
Peterson's “Night Train” (m2).
Contrast(m1, m3)


I love jazz music (m1). However, my favourite album is The
Beatles' “White Album” (m3).
Cue word - However
Rhetorical Structure Theory



Mann and Thompson 1988
A text is coherent by virtue of relationships that
hold between messages in the text
A small number of relations (~25) can explain
relationships between messages in a wide range of
text
Project Proposal


Implement these general algorithms for inclusion
in NLTK
Provide a sample Data Set and DR schema for
testing and illustration


based on hypothetical WeatherExplainer from [Reiter
and Dale 2000]
Experiment utilizing these new tools as part of
Abstractive Summarization System for Evaluative
Statement Summarization (ASSESS)
Implementation 1: Schemas

Top-Down Approach



Output document structure is predictable and
stereotyped
Schemas are patterns of expansion, similar to CFG
Ie:





CompareAndContrast → DescribeRelationship
CompareProperties.
CompareProperties → CompareProperty
CompareProperties.
CompareProperties → .
“John is much bigger than Kate (DR). He is five inches
taller (CP) and weighs almost twice as much (CP).”
Specify rules for choosing if multiple expansions exist
Top-Down Problems


Hypothesis-Driven
Content selection done “on-line”


Not easily pipelined
Therefore, Bottom-Up used
Implementation 2: Bottom-Up

Output document structure is not predictable
POOL = messages to be expressed
while( size(pool) > 1)):
find all pairs of elements in pool which can be joined by a DR
assign a desirability score to each potential DR
find pair Ei and Ej with highest score and combine with Ek
remove Ei and Ej from POOL, replace with Ek
end while
Implementation

Used nltk.featstruct for Messages and DocPlans

A mapping from feature identifiers to feature values, where each
feature value is either a basic value (such as a string or an integer), or
a nested feature structure.
TotalRainfallMsg
period
year 1996
month 06
attribute
type 'RelativeVariation'
magnitude
unit 'inches'
number 4
direction '+'
[ *msgType* = 'TotalRainfallMsg'
[
]
[
[ direction = '+'
]]
[
[
]]
[ attribute = [ magnitude = [ number = 4
[
[
[ unit = 'inches' ] ] ]
[
[
]]
[
[ type
= 'RelativeVariation' ] ]
[
]
[ period = [ month = 6 ]
]
[
[ year = 1996 ]
]
]
]]]
Implementation

nltk.featstruct.FeatStruct

unify(other):

Unify fstruct1 with fstruct2, and return the resulting feature
structure. This unified feature structure is the minimal
feature structure that:



contains all feature value assignments from both fstruct1 and
fstruct2.
preserves all reentrance properties of fstruct1 and fstruct2.
If no such feature structure exists (because fstruct1 and
fstruct2 specify incompatible values for some feature), then
unification fails, and unify returns None.
Unification
TotalRainfallMsg
period
+
year 1996
Month 06
attribute
type 'RelativeVariation'
direction '+'
TotalRainfallMsg
TotalRainfallMsg
period
period
year 1996
year 1996
=
month 06
month 06
attribute
attribute
type 'RelativeVariation'
type 'RelativeVariation'
magnitude
magnitude
unit 'inches'
unit 'inches'
number 4
number 4
direction '+'
Implementation

nltk.featstruct.FeatStruct

subsumes(other):

True if self subsumes other. I.e., return true if
unifying self with other would result in a feature
structure equal to other.
Subsumes
TotalRainfallMsg
period
year 1996
Month 06
subsumes
TotalRainfallMsg
period
year 1996
month 06
attribute
type 'RelativeVariation'
magnitude
unit 'inches'
number 4
direction '+'
TotalRainfallMsg
period
year 1996
month 06
attribute
type 'RelativeVariation'
magnitude
unit 'inches'
number 4
Does not
subsume
TotalRainfallMsg
period
year 1996
month 06
Using Subsumes
”Select from messages all DocPlans whose
with a relType of Contrast and a nucleus
which is a message of msgType
('TotalRainfallMsg')”
d = DocPlan(relType = 'Contrast', nucleus = Message('TotalRainfallMsg'))
return = filter(lambda msg: d.subsumes(msg), messages)
Implementation: Input Formats

Messages:
TotalRainfallMsg
period
year 1996
month 06
attribute
type 'RelativeVariation'
magnitude
unit 'inches'
number 4
direction '+'
Input Formats

Rules:
inputs
Elaboration(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2)
(M1.attribute.direction == M2.attribute.direction) : ConstituentSet('Elaboration', M1, M2) : 3
conditions
return
heuristic
Example Usage
with open('msg_file', 'r') as f:
msg_string = f.read()
with open('rule_file', 'r') as f:
rule_string = f.read()
messages = read_messages(msg_string)
rules = read_rules(rule_string)
plan = bottom_up_plan(messages, rules)
Data Set - WeatherExplainer


Simple example provided in [Reiter and Dale
2000]
Created 3 messages and 3 rules in the input format
WeatherExplainer Messages
TotalRainfallMsg
period
year 1996
month 06
attribute
type
'RelativeVariation'
magnitude
unit
'inches'
number 4
direction '+'
MonthlyRainfallMsg
period
year 1996
month 06
attribute
type
'RelativeVariation'
magnitude
unit
'inches'
MonthlyTemperatureMsg
period
year 1996
month 06
temperature
category 'hot'
WeatherExplainer Messages
Elaboration(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2)
(M1.attribute.direction == M2.attribute.direction) : ConstituentSet('Elaboration', M1, M2) : 3
Contrast(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2)
(M1.attribute.direction != M2.attribute.direction) : ConstituentSet('Contrast', M1, M2) : 2
Sequence(Message('MonthlyTemperatureMsg')|ConstituentSet(nucleus=Message('MonthlyTemper
atureMsg')) M1,
Message('MonthlyRainfallMsg')|ConstituentSet(nucleus=Message('MonthlyRainfallMsg')) M2)
() : ConstituentSet(Sequence, M1, M2) : 1
WeatherExplainer Result
[ *type* = 'DPDocument'
]
[
]
[
[
[
[ *msgType* = 'TotalRainfallMsg'
]]]]
[
[
[
[
]]]]
[
[
[
[
[ direction = '+'
]]]]]
[
[
[
[
[
]]]]]
[
[
[
[ attribute = [ magnitude = [ number = 4
]]]]]]
[
[
[ *aux* = [
[
[ unit = 'inches' ] ] ] ] ] ]
[
[
[
[
[
]]]]]
[
[
[
[
[ type
= 'RelativeVariation' ] ] ] ] ]
[
[
[
[
]]]]
[
[
[
[ period = [ month = 6 ]
]]]]
[
[
[
[
[ year = 1996 ]
]]]]
[
[
[
]]]
[
[ *aux* = [
[ *msgType* = 'MonthlyRainfallMsg'
]]]]
[
[
[
[
]]]]
[
[
[
[
[ direction = '+'
]]]]]
[
[
[
[
[
]]]]]
[ children = [
[
[ attribute = [ magnitude = [ number = 2
]]]]]]
[
[
[ *nucleus* = [
[
[ unit = 'inches' ] ] ] ] ] ]
[
[
[
[
[
]]]]]
[
[
[
[
[ type
= 'RelativeVariation' ] ] ] ] ]
[
[
[
[
]]]]
[
[
[
[ period = [ month = 6 ]
]]]]
[
[
[
[
[ year = 1996 ]
]]]]
[
[
[
]]]
[
[
[ *relType* = "'Elaboration'"
]]]
[
[
]]
[
[
[ *msgType* = 'MonthlyTemperatureMsg' ]
]]
[
[
[
]
]]
[
[ *nucleus* = [ period
= [ month = 6 ]
]
]]
[
[
[
[ year = 1996 ]
]
]]
[
[
[
]
]]
[
[
[ temperature = [ category = 'hot' ] ]
]]
[
[
]]
[
[ *relType* = 'Sequence'
]]
[
]
[ title = [ text = None ]
]
[
[ type = None ]
]
WeatherExplainer Result
Roughly:
”This has been a hot month. Average rainfall this month is
greater than usual. So far, rainfall is four inches above average.”
ASSESS

Summarization of Evaluative Opinions
An Abstractive Summarization
Pipeline
Input
Reviews
Summary
Data
Extract all
Information from
input corpus
Determine most
relevant information
and generate
summary
ASSESS Testing

Input:



Review sentences tagged with crude-feature
evaluations
Crude-Feature to User-Defined-Feature mapping
Simple content selection



Group evaluations by UDF
Calculate average evaluation
Also include info on UDF-parent in hierarchy, number
of evaluations
Example Message
[ *msgType* = 'AverageOpinionMessage'
[ numOpinions = 17
]
[ polarity = '-'
]
[ udf
= 'Universal Remote Control' ]
[ udf_parent = 'Extra Features'
]
[ valence = 1.1764705882352942
]
12 messages generated
]
Rules
Conjunction(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2)
(M1.udf_parent == M2.udf_parent and M1.polarity ==
M2.polarity):ConstituentSet(Conjunction,M1,M2):(2,M1.numOpinions+M2.numOpinions)
Contrast(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2)
(M1.udf_parent == M2.udf_parent and M1.polarity !=
M2.polarity):ConstituentSet(Contrast,M1,M2):(3,M1.numOpinions+M2.numOpinions)
Explanation(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2)
(M1.udf == M2.udf_parent and M1.polarity == M2.polarity):ConstituentSet(Explanation,M1,M2):(5,0)
Explanation(Message('AverageOpinionMessage') M1, ConstituentSet(relType = 'Conjunction',
nucleus=Message('AverageOpinionMessage')) M2)
(M1.udf == M2.nucleus.udf_parent and M1.polarity ==
M2.nucleus.polarity):ConstituentSet(DExplanation,M1,M2):(10:0)
Sequence(Message('AverageOpinionMessage')|ConstituentSet() M1,
Message('AverageOpinionMessage')|ConstituentSet() M2)
():ConstituentSet(Sequence,M1,M2):(1,0)
ASSESS Result

It works!




Evaluation of resulting DocPlan would say more about
Rules and Content Selection than Document
Structuring Algorithm
Was able to handle larger number of messages and
rules
4 of 5 rules used
Still, only one message type used
Future Improvements

Investigate whether this simple framework can be
used to develop more “intelligent” rules for more
sophisticated domain models




[Carenini 2008] – SEA
May require changes to implementation
Complete comprehensive documentation and usermanual
Submit to NLTK
References
Bird, Steven; Ewan Klein; Edward Loper (2009). Natural Language Processing with Python.
O'Reilly Media Inc.
Print
and online.
Carenini, G., Moore, J.D., (2006) Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11): 925952
Carenini, G., Ng, R., and Pauls, A. (2006) Multi-Document Summarization of Evaluative Text. Proc. of the Conf. of the
European Chapter of the Association for Computational Linguistics.
FitzGerald, N. (2009) A Complete Pipeline for Semantic Evaluation Summarization. Unpublished Project Report
Lester, J. And Porter, B., (1997). Developing and empirically testing robust explanation generators: the KNIGHT
experiments. Computational Linguistics, 23(1):65-101
Mann, W. and Thompson, S. (1988) Rhetorical structure theory: toward a functional theory of text organization. Text 3:
243-281.
Marcu, D. (1997) From local to global coherence: A bottom-up approach to test planning. Proceedings of Fourteenth
National Conference on Artificial Intelligence (AAAI-1997), 629- 635.
Pitler, Emily et al (2008). Easily Identifiable Discourse Relations. University of Pennsylvania Department of Computer
and Information Science Technical Report No. MS-CIS-08-24.
Reiter, E. and Dale, R. (1997) Building applied natural language generation systems. Natural Language Engineering 3
(1): 57-87.
Reiter, E., and Robert Dale. Building Natural Language Generation Systems (Studies in Natural Language
Processing). New York: Cambridge UP, 2000. Print.
Young, R.M., Moore, J.D. DPOCL: A principled approach to discourse planning, in: Proceedings of the 7th
International Workshop on Natural Language Generation, Kennebunkport, ME,
June 17–21, 1994,
pp. 13–20.
Download