semantic graphs derived from triplets with application in - ailab

advertisement
SEMANTIC GRAPHS DERIVED FROM TRIPLETS WITH
APPLICATION IN DOCUMENT SUMMARIZATION
Delia Rusu*, Blaž Fortuna°, Marko Grobelnik°, Dunja Mladenić°
* Technical University of Cluj-Napoca, Faculty of Automation and Computer Science
G. Bariţiu 26-28, 400027 Cluj-Napoca, Romania
° Department of Knowledge Technologies, Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Tel: +386 1 477 3127; fax: +386 1 477 3315
E-mail: delia.rusu@gmail.com
blaz.fortuna@ijs.si, marko.grobelnik@ijs.si, dunja.mladenic@ijs.si
ABSTRACT
Information nowadays has become more and more
accessible, so much as to give birth to an information
overload issue. Yet important decisions have to be
made, depending on the available information.
As it is impossible to read all the relevant content that
helps one stay informed, a possible solution would be
condensing data and obtaining the kernel of a text by
automatically summarizing it.
We present an approach to analyzing text and
retrieving valuable information in the form of a
semantic graph based on subject-verb-object triplets
extracted from sentences. Once triplets have been
generated, we apply several techniques in order to
obtain the semantic graph of the document: coreference and anaphora resolution of named entities
and semantic normalization of triplets. Finally, we
describe the automatic document summarization
process starting from the semantic representation of
the text.
The experimental evaluation carried out step by step
on several Reuters newswire articles shows a
comparable performance of the proposed approach
with other existing methodologies. For the assessment
of the document summaries we utilize an automatic
summarization evaluation package, so as to show a
ranking of various summarizers.
1
INTRODUCTION
The accessibility of information arises mostly from the
rapid development of the World Wide Web and online
information services. One has to read a considerable
amount of relevant content in order to stay updated, but it
is impossible to read everything related to a certain topic.
A feasible solution to this admitted problem is condensing
this vast amount of data and extracting only the essence of
the message, in the form of an automatically generated
summary.
In this paper we describe a method of text analysis with the
stated purpose of extracting valuable information from
documents. We shall attach a graphical representation,
called semantic graph, to the initial document. The graph is
based on triplets retrieved from the document sentences.
Moreover, we are going to describe an application of
semantic graphs generation– text summarization – as a
method for reducing the quantity of information but
preserving one important characteristic –its quality.
The paper is organized as follows. Firstly, the triplet based
semantic graphs generation algorithm is presented. Two
steps are detailed in this phase: triplet extraction from
sentences, followed by the procedure of yielding the
semantic graph of the document. In order to obtain the
graph, named entity co-reference and anaphora resolution
as well the semantic normalization of triplets are
employed. Secondly, the summarization process is
explained, followed by an evaluation of the system
components. The paper concludes with several remarks.
2
TRIPLET BASED SEMANTIC GRAPHS
In English, the declarative sentence has the basic form
subject – verb – object. Starting from this observation, one
can think of the “core” of a sentence as a triplet (consisting
of the three aforementioned three elements). We assume
that it contains enough information to describe the
message of a sentence. The usefulness of triplets resides in
the fact that it is much easier to process them instead of
dealing with very complex sentences as a whole.
For triplet extraction, we apply the algorithm for obtaining
triplets from a treebank parser output described in [1], and
employ the Stanford Parser [2].
The extraction is performed based on pure syntactic
analysis of sentences. For obtaining semantic information,
we first annotate the document with named entities.
Throughout this paper, the term “named entities” refers to
names of people, locations and organizations. For named
entity extraction we consider GATE (General Architecture
for Text Engineering) [3], which was used as a toolkit for
natural language processing.
The semantic graph corresponds to a visual representation
of a document’s semantic structure. The starting point for
deriving semantic graphs was [4].
The procedure of semantic graph generation consists of a
series of sequential operations composing a pipeline:
 Co-reference resolution by employing text
analysis
and
matching
methods,
thus
consolidating named entities.

Pronominal anaphora resolution based on named
entities.
 Semantic normalization using WordNet synsets.
 Semantic graph generation by merging triplet
elements with respect to the synset they belong to.
The following sub-sections will further detail these
pipeline components.
2.1
Co-reference Resolution
Co-reference is defined as the identification of surface
terms (words within the document) that refer to the same
entity [4]. For simplification, we are going to consider coreference resolution for the named entities only. The set of
operations we have to perform is threefold. Firstly we have
to determine the named entity gender, so as to reduce the
search space for candidates. Secondly, in the case of
named entities composed of more than one word, we
eliminate the set of English stop words (for example Ms.,
Inc., and so on). Thirdly, we apply the heuristics proposed
in [4]: two different surface forms represent the same
named entity if one surface form is completely included in
the other. For example, “Clarence”, “Clarence Thomas”
and “Mr. Thomas” refer to the same named entity, that is,
“Clarence Thomas”. Moreover, abbreviations are also coreferenced, for example “U.S.”, “U.S.A.”, “United
States” and “United States of America” all refer to the
same named entity – “United States America” (“of” will
be eliminated, as it is a stop word).
2.2
Anaphora Resolution
In linguistics, anaphora defines an instance of an
expression that refers to another expression; pronouns are
often regarded as anaphors. The pronoun subset we
considered for anaphora resolution is formed of: {I, he,
she, it, they}, and their objective, reflexive and possessive
forms, as well as the relative pronoun who.
We perform a sequential search, first backward and then
forward, with the purpose of finding good replacement
candidates for a given pronoun, among the named entities.
Firstly, we search backwards inside the sentence where we
found the pronoun. We select candidates that agree in
gender with the pronominal anaphor, as suggested in [5,
6]. Next, we look for possible candidates in the sentences
preceding the one where the pronoun is located. If we have
found no candidates so far, we search forward within the
pronoun sentence, and then forward in the next sentences,
as in [4]. Once the candidates have been selected, we apply
antecedent indicators to each of them, and assign scores (0,
1, and 2). The antecedent indicators we have taken into
account are a subset of the ones mentioned in [5]:
givenness, lexical reiteration, referential distance,
indicating verbs and collocation pattern preference. After
assigning scores to the candidates found, we select the
candidate with the highest overall score as the best
replacement for the pronoun. If two candidates have the
same overall score, we prefer the one with a higher
collocation pattern score. If we cannot make a decision
based on this score, we choose the candidate with a greater
indicating verbs score. In case of a tie, we select the most
recent candidate (the one closest to the pronoun).
We summarize the anaphora resolution procedure in the
algorithm in Figure 2.1.
function ANAPHORA-RESOLUTION (pronoun, number_of_sentences)
returns a solution, or failure
candidates ←
BACKWARD-SEARCH-INSIDE-SENTENCE (pronoun) ∪
BACKWARD-SEARCH (pronoun, number_of_sentences)
if candidates ≠ ∅ then
APPLY-ANTECEDENT-INDICATORS (candidates)
else
candidates ←
FORWARD-SEARCH-INSIDE-SENTENCE (pronoun) ∪
FORWARD-SEARCH (pronoun, number_of_sentences)
if candidates ≠ ∅ then
APPLY-ANTECEDENT-INDICATORS (candidates)
result ← MAX-SCORE-CANDIDATE (candidates)
if result ≠failure then return result
else return failure
function APPLY-ANTECEDENT-INDICATORS (candidates) returns a
solution, or failure
result ← APPLY-GIVENNESS (candidates) ∪
APPLY-LEXICAL-REITERATION (candidates) ∪
APPLY-REFERENTIAL-DISTANCE (candidates) ∪
APPLY-INDICATING-VERBS (candidates) ∪
APPLY-COLLOCATION-PATTERN-PREFERENCE
(candidates)
if result ≠failure then return result
else return failure
Figure 2.1 The anaphora resolution algorithm.
2.3
Semantic Normalization
Once co-reference and anaphora resolution have been
performed, the next step is semantic normalization. We
compact the triplets obtained so far, in order to generate a
more coherent semantic graphical represenation. For this
task, we rely on the synonymy relationships between
words. More precisely, we attach to each triplet element
the synsets found with WordNet. If the triplet element is
composed of two or more words, then for each of these
words we determine the corresponding synsets. This
procedure will help in the next phase, when we merge the
triplet elements that belong to the same synset.
2.4
Semantic Graph Generation
Based on the semantic normalization procedure, we can
merge the subject and object elements that belong to the
same normalized semantic class. Therefore, we generate a
directed semantic graph, having as nodes the subject and
the object elements and as edges the verbs. Verbs label the
relationship between the subject and the object nodes in
the graph. An example of a semantic graph obtained from
a news article is shown in Figure 2.2.
Figure 2.2 A semantic graph obtained from a news article.
3
DOCUMENT SUMMARIZATION
The purpose of summarization based on semantic graphs is
to obtain the most important sentences from the original
document by first generating the document semantic graph
and then using the document and graph features to obtain
the document summary [4]. We created a semantic
representation of the document, based on the logical form
triplets we have retrieved from the text. For each generated
triplet we assign a set of features comprising linguistic,
document and graph attributes. We then train the linear
SVM classifier to determine those triplets that are useful
for extracting sentences which will later compose the
summary. As features for learning, we select the logical
form triplets characterized by three types of attributes:
linguistic attributes, document attributes, graph attributes.
The linguistic attributes include, among others, the triplet
type – subject, verb or object – the treebank tag and the
depth of the linguistic node extracted from the treebank
parse tree and the part of speech tag. The document
attributes include the location of the sentence within the
document, the triplet location within the sentence, the
frequency of the triplet element, the number of named
entities in the sentence, the similarity of the sentence with
the centroid (the central words of the document), and so
on. Finally, the graph attributes consist of hub and
authority weights [7], page rank [8], node degree, the size
of the weakly connected component the triplet element
belongs to, and others.
Features are ranked, based on information gain, and the
order is as follows (starting with the most important
feature): object, subject, verb (all of these are words),
location of the sentence in the document, similarity with
the centroid, number of locations in the sentence, number
of named entities in the sentence, authority weight for the
object, hub weight for the subject, size of the weakly
connected component for the object.
The summarization process starts with the original
document and its semantic graph. The three types of
features abovementioned are then retrieved. Further, the
sentences are classified with the linear SVM and the
document summary is obtained. Its sentences are labelled
with SVM scores and ordered based on these scores in a
decreasing manner. The motivation for doing this is
presented in the next section of the paper.
4
SYSTEM EVALUATION
The experiments that were carried out involve gender
information retrieval, co-reference and anaphora resolution
and finally summarization. In the following, each of these
experiments are presented, highlighting the data set used,
the systems selected for result comparison and the
outcome.
4.1
Gender Information Retrieval
Gender related information was extracted from two GATE
resource files: person_male and person_female gazetteers.
For evaluation we manually annotated 15 random
documents taken from the Reuters RCV1 [9] data set. The
two systems that were compared with the manually
obtained results are:
 Our system, henceforward referred to as System
 A Baseline system, which assigns the masculine
gender to all named entities labeled as persons.
The results are presented in Table 4.1.
System
Baseline
Masculine
Feminine
170/206 (83%)
7/14 (50%)
206/206 (100%)
0/14 (0%)
Table 4.1 Gender evaluation results.
Total
177/220 (80%)
206/220 (94%)
The fact that System correctly labeled a significant percent
of masculine as well as feminine persons shows it will
carry out gender retrieval better than the baseline system
when the number of persons belonging to either genders
will be more balanced.
4.2
Co-reference Resolution
For the evaluation of co-reference resolution the same set
of 15 articles mentioned in section 4.1 was used. Named
entities were extracted based on GATE, and the coreference resolution performed by System was compared
with the one of GATE. The results are shown in Table 4.2.
There are 783 named entities extracted using GATE. The
System performance is better than that of GATE, 750
entities compared to GATE’s 646 entities co-referenced.
Co-References
750/783 (96%)
System
646/783 (83%)
GATE
Table 4.2 Co-reference evaluation results.
4.3
Anaphora Resolution
In the case of anaphora resolution, the System was
compared with two baseline systems. Both of them
consider the closest named entity as a pronoun
replacement, but one takes gender information into
account, whereas the other does not.
Pronouns
He
They
I
She
Who
It
Other
Total
System
Baselinegender
35/42 (83%)
18/42 (43%)
7/20 (35%)
8/20 (40%)
4/15 (27%)
0/15 (0%)
0/0
0/0
0/0
0/0
11/35 (31%)
11/35 (31%)
2/4 (50%)
2/6 (33%)
59/116 (51%)
39/118 (33%)
Table 4.3 Anaphora evaluation results.
Baseline-no
gender
18/42 (43%)
2/20 (10%)
2/15 (13%)
0/0
0/0
11/35 (31%)
3/6 (50%)
36/118 (31%)
The results are listed in Table 4.3, pointing out the System
strength where the “he” pronoun is concerned.
4.4
Summary Generation
For summarization evaluation, two tests were carried out.
The first one involved the usage of the DUC (Document
Understanding Conferences) [10] 2002 data set, for which
the results obtained were similar with the ones listed in [4].
For the second one the DUC 2007 update task data set was
used for testing purposes. The data consisted of 10 topics
(A-J), each divided in 3 clusters (A-C), each cluster with
7-10 articles. For this assessment, we focused on the first
part of the task – producing a summary of documents in
cluster A – 100-words in length, without taking into
consideration the topic information. In order to obtain the
100-word summary, we first retrieved all sentences having
triplets belonging to instances with the class attribute value
equal to +1, and ordered them in an increasing manner,
based on the value returned by the SVM classifier. Out of
these sentences, we considered the top 15%, and used them
to generate a summary. That is because most sentences
that were manually labeled as belonging to the summary
were among the first 15% top sentences.
In order to compare the performance of various systems,
we employed ROUGE (Recall-Oriented Understudy for
Gisting Evaluation) [11], an automatic summarization
evaluation package. Our system was ranked 17 out of 25,
based on the ROUGE-2 evaluation method, and 18 out of
25 based on the ROUGE-SU4 evaluation method.
5
CONCLUSION
The stated purpose of the paper was to present a
methodology for generating semantic graphs derived from
logical form triplets and, furthermore, to use these
semantic graphs to construct document summaries. The
evaluation that was carried out showed the system in
comparison to other similar applications, demonstrating its
feasibility as a semantic graph generator and document
summarizer.
As far as future improvements are concerned, one
possibility would be to combine the document summarizer
with an online newswire crawling system that processes
news on the fly, as they are posted, and then uses the
summarizer to obtain a compressed version of the story.
6
ACKNOWLEDGMENTS
This work was supported by the Slovenian Research
Agency and the IST Programme of the EC SMART (IST033917) and PASCAL2 (IST-NoE-216886).
7
REFERENCES
[1] D. Rusu, L. Dali, B. Fortuna, M. Grobelnik, D.
Mladenić. Triplet Extraction from Sentences.
Ljubljana: 2007. Proceedings of the 10th
International Multiconference "Information Society IS 2007". Vol. A, pp. 218 - 222.
[2] Stanford Parser: http://nlp.stanford.edu/software/lexparser.shtml
[3] GATE (General Architecture for Text Engineering):
http://gate.ac.uk/
[4] J. Lescovec, M. Grobelnik, N. Milic-Frayling.
Learning Sub-structures of Document Semantic
Graphs for Document Summarization, 2004. KDD
Workshop on Link Analysis and Group Detection.
[5] R. Mitkov. Robust pronoun resolution with limited
knowledge. Montreal: 1998. Proceedings of the 18th
International
Conference
on
Computational
Linguistics COLING’98/ACL’98. pp. 869-875.
[6] R. Mitkov. Anaphora Resolution: The State of the
Art. Wolverhampton: 1999. Working Paper (Based
on the COLING'98/ACL'98 tutorial on anaphora
resolution).
[7] J. M. Kleinberg. Authoritative Sources in a
Hyperlinked Environment. 1998. Proceedings of the
9th ACM-SIAM Symposium on Discrete Algorithms.
[8] L. Page, S. Brin, R. Motwani, T. Winograd. The
PageRank Citation Ranking: Bringing Order to the
Web.1998.
[9] D. D. Lewis, Y. Yang, T. G. Rose, F. Li. RCV1: A
New Benchmark Collection for Text Categorization
Research. 2004, Journal of Machine Learning
Research, Vol. 5.
[10] DUC (Document Understanding Conferences):
http://duc.nist.gov/
[11] ROUGE (Recall-Oriented Understudy for Gisting
Evaluation): http://haydn.isi.edu/ROUGE/
Download