USING PRIOR KNOWLEDGE TO ASSESS RELEVANCE - INESC-ID

advertisement
USING PRIOR KNOWLEDGE TO ASSESS RELEVANCE IN SPEECH SUMMARIZATION
Ricardo Ribeiro1 and David Martins de Matos2
1
ISCTE/IST/L2 F - INESC ID Lisboa
2
IST/L2 F - INESC ID Lisboa
Rua Alves Redol, 9, 1000-029 Lisboa, Portugal
ABSTRACT
We explore the use of topic-based automatically acquired prior
knowledge in speech summarization, assessing its influence throughout several term weighting schemes. All information is combined
using latent semantic analysis as a core procedure to compute the
relevance of the sentence-like units of the given input source. Evaluation is performed using the self-information measure, which tries
to capture the informativeness of the summary in relation to the
summarized input source. The similarity of the output summaries of
the several approaches is also analyzed.
Index Terms— Speech processing, Text processing, Natural
languages, Information retrieval, Singular value decomposition
1. INTRODUCTION
Automatic summarization aims to present to the user in a concise and
comprehensible manner the most relevant content given one or more
information sources [1]. Several difficulties arise when addressing
this problem, but one of most importance is how to assess the relevant content. While in text summarization, up-to-date systems make
use of complex information, such as syntactical [2], semantic [3]
and discourse [4, 5] information, either to assess relevance or reduce the length of the output, speech summarization systems rely
on lexical, acoustic/prosodic, structural, and discourse features (although in what concerns the use of discourse features, only simple
approaches, such as the given/new information score presented by
Maskey and Hirschberg [6] or the detection of words like decide or
conclude described by Murray et al. [7], are used) to extract from the
input, the most relevant sentence-like units. In fact, spoken language
summarization is often considered of greater difficulty than text summarization [8, 9, 10]: speech recognition errors, disfluencies, and
problems in the accurate identification of sentence boundaries affect the input source to be summarized. However, although current work strives to explore features beyond the spoken documents
transcriptions, such as structural, acoustic/prosodic and spoken language features [6, 11], shallow text summarization approaches like
Latent Semantic Analysis (LSA) [12] and Maximal Marginal Relevance (MMR) [13] achieve comparable performance [11].
In the current work, we explore the use of prior knowledge to
assess relevance in the context of the summarization of broadcast
news. SSNT [14] is a system for selective dissemination of multimedia contents, based on an automatic speech recognition (ASR)
module that generates the transcriptions later used by topic segmentation, topic indexing, and title&summarization modules. Our proposal consists in the use of news stories previously indexed by topic
(by the topic indexing module) to identify the most relevant passages of upcoming news stories. This approach models relevance
within topics and its evolution through time, by continuously updating the corresponding set of background news stories—which can
be seen as topic-directed semantic spaces—, trying to approximate
the way humans perform summarization tasks [15, 16, 17]. This is
in accordance with Endres-Niggemeyer [15, 16, 17], who identified
the following aspects as the most important ones when observing the
summarization process from a human perspective: knowledge-based
text analysis; combined online processing; and task orientation and
selective interpretation.
We address these issues using a statistical approach: the topicdirected semantic spaces establish the knowledge base used by the
summarization process; term weighting strategies are used to combine global and local weights (which may also include information
like the recognition confidence coefficient to address spoken language summarization specificities); and, finally, the use of LSA [18]
provides the selective interpretation step, where terms (keywords),
as described by Endres-Niggemeyer, are the most obvious cues for
attracting attention.
This document is organized as follows: the next section briefly
describes the broadcast news processing system; section 3 presents
our approach to speech summarization, detailing how prior knowledge is included in the summarization process; section 4 presents the
obtained results; before concluding, we address similar work.
2. SELECTIVE DISSEMINATION OF MULTIMEDIA
CONTENTS
SSNT [14] is a system for selective dissemination of multimedia
contents, working primarily with Portuguese broadcast news services. The system is based on an ASR module, that generates
the transcriptions used by the topic segmentation, topic indexing,
and title&summarization modules. Preceding the speech recognition module, an audio preprocessing module classifies the audio
in accordance to several criteria: speech/non-speech, speaker segmentation and clustering, gender, and background conditions. The
ASR module with an average word error rate of 24% [14], greatly
influences the performance of the subsequent modules. The topic
segmentation module [19], reported as based on clustering, groups
transcribed segments into stories. The algorithm relies on a heuristic
derived from the structure of the news services: each story starts
with a segment spoken by the anchor. This module achieved an
F-measure of 68% [14]. The main problem identified by the authors
was boundary deletion: a problem which impacts the summarization
task. Topic indexing [19] is based on a hierarchically organized
thematic thesaurus provided by the broadcasting company. The hierarchy has 22 thematic areas on the first level, for which the module
achieved a correctness of 91.4% [20, 14]. Batista et al. [21] inserted
a module for recovering punctuation marks, based on maximum
entropy models, after the ASR module. The punctuation marks
addressed were the “full stop” and “comma”, which provide the
sentence units necessary for use in the title&summarization module.
This module achieved an F-measure of 56% and a SER (Slot Error
Rate) of 0.74.
3. SPEECH SUMMARIZATION USING PRIOR
KNOWLEDGE TO ASSESS RELEVANCE
As introduced above, human summarization is a knowledge-based
task: Endres-Niggemeyer [16] observes that both generic and specific knowledge is used to understand information and assess its relevance.
3.1. Selecting Background Knowledge
The first step of our summarization process consists in the selection
of the background knowledge that will improve the detection of the
most relevant passages of the news stories to be summarized. This is
done by identifying and selecting the news stories from the previous
n days that match the topics of news story to be summarized. The
procedure, as currently implemented, relies on the results produced
by the topic indexing module, but a clustering approach could also
be followed.
3.2. Using LSA to Combine the Available Information
The summarization process we implemented is characterized by the
use of LSA as a core procedure to compute the relevance of the extracts (sentence-like units) of the given input source.
where aidk represents the weight of term ti , 1 ≤ i ≤ T (T is the
j
number terms), in sentence sdk , 1 ≤ k ≤ D (D is the number
j
of selected documents) with dk1 ≤ dkj ≤ dks , of document dk ; and
ainl , 1 ≤ l ≤ s, are the elements associated with the news story
to be summarized. The elements of the matrix are defined as aij =
Lij × Gij , where Lij is a local weight, and Gij is a global weight.
To form the resulting summary, only sentences corresponding to the
last columns of the matrix (which correspond to the sentence of the
news story to be summarized) are selected.
3.2.3. Term Weighting Schemes
In order to better assess the relevance of the prior knowledge, we
explored different weighting schemes, all of them in both situations
of using and not using prior knowledge. The following local weights
Lij were used:
• frequency of term i in sentence j;
• summation of the recognition confidence scores of all occurrences of term i in sentence j;
• summation of the smoothed recognition confidence scores
(using the trigrams tk−2 tk−1 tk , tk−1 tk tk+1 , and tk tk+1 tk+2 ,
tk is an occurrence term i in a sentence) of all occurrences of
term i in sentence j.
Global weights are entropy-based (more specifically, are based
on the self-information measure), considering the probability of a
term ti characterizing a sentence s:
P
j C(ti , sj )
(2)
P (ti ) = P P
i
j C(ti , sj )
where
3.2.1. LSA
LSA is based on the singular vector decomposition (SVD) of the
term-sentence frequency m × n matrix, M . U is an m × n matrix
of left singular vectors; Σ is the n × n diagonal matrix of singular
values; and, V is the n × n matrix of right singular vectors (only
possible if m ≥ n): M = U ΣV T .
The idea behind the method is that the decomposition captures
the underlying topics (identifying the most salient ones) of the document by means of co-occurrence of terms (the latent semantic analysis), and identifies the best representative sentence-like units of each
topic. Summary creation can be done by picking the best representatives of the most relevant topics according to a defined strategy.
Our implementation follows the original ideas of Gong and
Liu [12] and the ones of Murray, Renals, and Carletta [9] for solving
dimensionality problems, and using, for matrix operations, the GNU
Scientific Library1 .
C(ti , sj ) =
1
1
1 http://www.gnu.org/software/gsl/
1 ti ∈ sj
0 ti ∈
/ sj
(3)
The global weights Gij were defined as follows:
• 1 (not using global weight);
• −log(P (ti ));
• same as previous, but calculating P (ti ) using CR(ti , sj )
(eq. 4) instead of C(ti , sj )—used only when using the recognition confidence scores in the local weight;
• same as previous, but using the smoothed recognition confidence scores as described for the local weights—used only
when using the smoothed recognition confidence scores in the
local weight.
( P recog. conf. score of occurrences of t in s
i
j
ti ∈ sj
frequency of ti in sj
CR(ti , sj ) =
(4)
0
ti ∈
/ sj
3.2.2. Including the Background Knowledge
Prior to the application of the SVD, it is necessary to create the terms
by sentences matrix that represents the input source. It is in this step
that the previously selected news stories are taken into consideration:
instead of using only the news story to be summarized to create the
terms by sentences matrix, both the selected stories and the news
story are used. The matrix is defined as
2
3
a1d1 ... a1d1s ... a1dD ... a1dD
a1n1 ... a1ns
s
1
1
4
5 (1)
...
aT d1 ... aT d1s ... aT dD ... aT dD
a
T n1 ... aT ns
s
(
4. EXPERIMENTAL SETUP
Our goal was to understand whether the inclusion of prior knowledge
increases the relevant content of the resulting summaries.
4.1. Data
In this experiment, we tried to summarize the news stories of 2
episodes, 2008/06/09 and 2008/06/29, of a Portuguese news program
(their properties are detailed in the top part of table 1). As source
of prior knowledge, we used for the 2008/06/09 show the episodes
from 2008/06/01 to 2008/06/08, and for the 2008/06/29 show, the
episodes from 2008/06/20 to 2008/06/28.
# of stories
# of sentences
# of transcribed tokens
# of different topics
Average # of topics per story
# of stories
# of sentences
# of transcribed tokens
# of different topics
Average # of topics per story
2008/06/09
28
443
7733
45
3.54
205
4127
62025
158
3.42
2008/06/29
24
476
6632
47
4.92
268
4553
71470
195
4.10
Table 1. Properties of the news shows to be summarized (top) and
of the news shows used as prior knowledge (bottom).
4.2. Evaluation
sentence extraction is done through feature combination; second,
compaction is done by scoring the words in each sentence and then a
dynamic programming technique is applied to select the words that
will remain in the sentence to be included in the summary.
Closer to our work is the way topic-adapted summarization is
explored by Chatain et al. [26]: one of the components of the sentence scoring function is an n-gram linguistic model that is computed from given data. However, in the two experiments performed,
one using talks and the other using broadcast news, only the one using talks used a topic-adapted linguistic model and the data used for
the adaptation consisted of the papers in the conference proceedings
of the talk to be summarized. CollabSum [27] also explores document proximity in single document text summarization: by means
of a clustering algorithm, documents related to the document to be
summarized are grouped in a cluster and used to build a graph that
reflects the relationships between all sentences in the cluster; then
informativeness is computed for each sentence and redundancy removed to generate the summary.
As previously mentioned, our goal is to understand if the content of
the summaries produced using prior knowledge is more informative
than using baseline LSA methods. To do so, we use self information
(I(ti ) = −log(P (ti ))) as the measure of the informativeness of
a summary: this meets our objectives and facilitates an automatic
assessment of the improvements. The probabilities of term ti , are
estimated using their distribution in the document to be summarized.
In this experiment, we generated 3 sentence summaries (which
corresponds to a compression rate of about 5% to 10%) of each
news story (similar to what is done, for example, in CNN website
http://edition.cnn.com).
4.3. Results
The obtained results are depicted in figures 1, that shows the overlap
between the output summaries, and figure 2, that shows the global
performance of all approaches.
As can be seen in figure 1, the use of global weight increases the
differences between using and not using prior knowledge. This is
due to the fact that global weight for each term is computed using not
only the news story to be summarized, but also the prior knowledge:
that influences also the elements of the term by sentences matrix
related to the news story, whereas when not using global weights,
the prior knowledge does not affect those elements. That is shown on
the top graphic of figure 1, where all summaries have more overlap.
Figure 2 shows the number of times, where each method produced the most informative summary. In general, the methods using
global weight achieved the worst results. This can be explained in
part by the noise present both in the transcriptions and in the topic
indexing phase, since when not using global weights, the use of prior
knowledge has a considerably better performance.
Fig. 1. Summary overlap: with (B,C,D,E)/without (A,F) RC (C,D:
smoothed), with (D,E,F)/without (A,B,C) PK, with (bottom)/without
(top) GW. RC: recognition confidence; PK: prior knowledge; GW:
global weight.
5. RELATED WORK
6. FUTURE WORK
Furui [10] identifies three main approaches to speech summarization: sentence extraction-based methods, sentence compactionbased methods, and combinations of both. Sentence extractive methods comprehend, essentially, methods like LSA [9, 22], MMR [23, 9,
22], and feature-based methods [9, 6, 7, 22]. Sentence compaction
methods are based on word removal from the transcription, with
recognition confidence scores playing a major role [24]. A combination of these two types of methods was developed by Kikuchi
et al. [25], where summarization is performed in two steps: first,
Future work should address the influence of the modules that precede
summarization in its performance; and, assess the correlation of human preferences in summarization with the informativeness measure
used in the current work.
7. REFERENCES
[1] I. Mani, Automatic Summarization, John Benjamins Publishing Company, 2001.
35
30
[13]
25
20
plain
PK
15
plain + RC
PK + RC
10
[14]
5
0
A
B
C
D
E
F
G
H
I
J
K
L
[15]
Fig. 2. Comparative results using all variations of the evaluation
measure. A: plain LSA; B: plain LSA + GW; C: plain LSA + GW
+ RC; D: plain LSA + GW + RC smoothed; E: plain + RC; F: plain
LSA + RC smoothed; G through L, the same but with PK.
[16]
[2] L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova,
“Beyond SumBasic: Task-focused summarization and lexical
expansion,” Information Processing and Management, vol. 43,
pp. 1606–1618, 2007.
[18]
[3] R. I. Tucker and K. Spärck Jones, “Between shallow and deep:
an experiment in automatic summarising,” Tech. Rep. 632,
University of Cambridge Computer Laboratory, 2005.
[17]
[19]
[20]
[4] H. Daumé III and D. Marcu, “A Noisy-Channel Model for
Document Compression,” in Proceedings of the 40th annual
meeting of the ACL. 2002, pp. 449–456, ACL.
[5] S. Harabagiu and F. Lacatusu, “Topic Themes for MultiDocument Summarization,” in SIGIR 2005: Proceedings of
the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2005, pp.
202–209, ACM.
[6] S. Maskey and J. Hirschberg, “Comparing Lexical, Acoustic/Prosodic, Strucural and Discourse Features for Speech
Summarization,” in Proceedings of the 9th EUROSPEECH
- INTERSPEECH 2005. 2005, pp. 621–624, ISCA.
[7] G. Murray, S. Renals, J. Carletta, and J. Moore, “Incorporating
Speaker and Discourse Features into Speech Summarization,”
in Proceedings of the HLT/NAACL. 2006, pp. 367–374, ACL.
[8] K. R. McKeown, J. Hirschberg, M. Galley, and S. Maskey,
“From Text to Speech Summarization,” in 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. 2005, vol. V, pp. 997–1000, IEEE.
[21]
[22]
[23]
[24]
[25]
[9] G. Murray, S. Renals, and J. Carletta, “Extractive Summarization of Meeting Records,” in Proc. of the 9th EUROSPEECH
- INTERSPEECH 2005. 2005, pp. 593–596, ISCA.
[10] S. Furui, “Recent Advances in Automatic Speech Summarization,” in Proc. of the 8th Conference on Recherche
d’Information Assistée par Ordinateur. 2007, Centre des
Hautes Études Internationales d’Informatique Documentaire.
[11] G. Penn and Xiaodan Zhu, “A critical reassessment of evaluation baselines for speech summarization,” in Proceeding of
ACL-08: HLT. 2008, pp. 470–478, ACL.
[12] Y. Gong and X. Liu, “Generic Text Summarization Using
Relevance Measure and Latent Semantic Analysis,” in SIGIR
[26]
[27]
2001: Proceedings of the 24st Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval. 2001, pp. 19–25, ACM.
J. Carbonell and J. Goldstein, “The Use of MMR, DiversityBased Reranking for Reordering Documents and Producing
Summaries,” in SIGIR 1998: Proceedings of the 21st Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998, pp. 335–336, ACM.
R. Amaral, H. Meinedo, D. Caseiro, I. Trancoso, and J. P. Neto,
“A Prototype System for Selective Dissemination of Broadcast News in European Portuguese,” EURASIP Journal on Advances in Signal Processing, vol. 2007, 2007.
B. Endres-Niggemeyer, Summarizing Informarion, Springer,
1998.
B. Endres-Niggemeyer, “Human-style WWW summarization,” Tech. Rep., University for Applied Sciences, Department
of Information and Communication, 2000.
B. Endres-Niggemeyer, “SimSum: an empirically founded
simulation of summarizing,” Information Processing and Management, vol. 36, no. 4, pp. 659–682, 2000.
T. K. Landauer, P. W. Foltz, and D. Laham, “An Introduction
to Latent Semantic Analysis,” Discourse Processes, vol. 25,
1998.
R. Amaral and I. Trancoso, “Improving the Topic Indexation
and Segmentation Modules of a Media Watch System,” in
Proc. of INTERSPEECH 2004 - ICSLP, 2004, pp. 1609–1612.
R. Amaral, H. Meinedo, D. Caseiro, I. Trancoso, and J. P. Neto,
“Automatic vs. Manual Topic Segmentation and Indexation in
Broadcast News,” in Proc. of the IV Jornadas en Tecnologia
del Habla, 2006.
F. Batista, D. Caseiro, N. J. Mamede, and I. Trancoso, “Recovering Punctuation Marks for Automatic Speech Recognition,”
in Proc. of INTERSPEECH 2007. 2007, pp. 2153–2156, ISCA.
R. Ribeiro and D. M. de Matos, “Extractive Summarization
of Broadcast News: Comparing Strategies for European Portuguese,” in Text, Speech and Dialogue – 10th International
Conference. Proceedings. 2007, vol. 4629 of Lecture Notes in
Computer Science (Subseries LNAI), pp. 115–122, Springer.
K. Zechner and A. Waibel, “Minimizing Word Error Rate in
Textual Summaries of Spoken Language,” in Proceedings of
the 1st conference of the North American chapter of the ACL.
2000, pp. 186–193, Morgan Kaufmann.
T. Hori, C. Hori, and Y. Minami, “Speech Summarization using Weighted Finite-State Transducers,” in Proceedings of the
8th EUROSPEECH - INTERSPEECH 2003. 2003, pp. 2817–
2820, ISCA.
T. Kikuchi, S. Furui, and C. Hori, “Two-stage Automatic
Speech Summarization by Sentence Extraction and Compaction,” in Proceedings of the ISCA & IEEE Workshop
on Spontaneous Speech Processing and Recognition (SSPR2003). 2003, pp. 207–210, ISCA.
P. Chatain, E. W. D. Whittaker, J. A. Mrozinski, and S. Furui,
“Topic and Stylistic Adaptation for Speech Summarisation,” in
Proc. of ICASSP 2006. 2006, vol. I, pp. 977–980, IEEE.
X. Wan, J. Yang, and J. Xiao, “CollabSum: Exploiting Multiple Document Clustering for Collaborative Single Document
Summarizations,” in SIGIR 2007: Proc. of the 30th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval. 2007, pp. 143–150, ACM.
Download