Speech Summarization

advertisement
Speech Summarization
Julia Hirschberg (thanks to Sameer
Maskey for some slides)
CS4706
Summarization Distillation
• ‘…the process of distilling the most important
information from a source (or sources) to
produce an abridged version for a particular user
(or users) and task (or tasks) [Mani and
Maybury, 1999]
• Why summarize? Too much data!
Types of Summarization
• Indicative
– Describes the document and its contents
• Informative
– ‘Replaces’ the document
• Extractive
– Concatenate pieces of existing document
• Generative
– Creates a new document
• Document compression
[Salton, et al., 1995]
Sentence Extraction
Similarity Measures
[McKeown, et al., 2001]
SOME SUMMARIZATION
TECHNIQUES BASED
ON TEXT (LEXICAL FEATURES)
Extraction Training
w/ manual Summaries
[Hovy & Lin, 1999]
Concept Level
Extract concepts units
[Witbrock & Mittal, 1999]
Generate Words/Phrases
[Maybury, 1995]
Use of Structured Data
Sentence Extraction/Similarity measures
[Salton, et al. 1995]
• Extract sentences by their similarity to a topic
sentence and their dissimilarity to sentences
already in summary (Maximal Marginal
Relativity)
• Similarity measures
– Cosine Measure
– Vocabulary Overlap
– Topic word overlap
– Content Signatures Overlap
Concept/content level extraction [Hovy & Lin,
1999]
• Present key-words as summary
• Builds concept signatures by finding relevant
words in 30,000 WSJ documents, each
categorized into different topics
• Phrase concatenation of relevant
concepts/content
• Sentence planning for generation
Feature-based statistical models
[Kupiec, et al., 1995]
•
•
•
•
Create manual summaries
Extract features
Train statistical model using various ML techniques
Use the trained model to score each sentence in the test
data
• Extract N highest-scoring sentences
k
P( s  S | F1 , F2, ...Fk ) 
 P( F
j 1
j
|s  S ) P( s  S )
k
 P( F )
j 1
j
• Where S is summary given k features Fj and P(Fj) & P(Fj|s of
S) can be computed by counting occurrences
Structured Database [Maybury, 1995]
• Summarize text represented in structured form:
database, templates
– E.g. generation of a medical history from a
database of medical ‘events’
Relative frequency of E 
# of occurrence s of event E
Total # of all events
• Link analysis (semantic relations within the
structure)
• Domain dependent importance of events
Comparing Speech and Text Summarization
• Alike
– Identifying important
information
– Some lexical,
discourse features
– Extraction or
generation or
compression
• Different
– Speech Signal
– Prosodic features
– NLP tools?
– Segments –
sentences?
– Generation?
– Errors
– Data size
Text vs. Speech Summarization (NEWS)
Speech Signal
Speech Channels
- phone, remote satellite, station
Transcript- Manual
Transcripts
- ASR, Close Captioned
Lexical Features
Some Lexical Features
Many Speakers
- speaking styles
Segmentation
-sentences
Story presentation
style
Error-free Text
NLP tools
Structure
-Anchor, Reporter Interaction
Prosodic Features
-pitch, energy, duration
Commercials, Weather Report
Speech Summarization Today
• Mostly extractive:
– Words, sentences, content units
• Some compression methods
• Generation-based summarization difficult
– Text or synthesized speech?
Generation or Extraction?
•
•
•
•
•
•
•
•
•
•
•
•
•
•
SENT27 a trial that pits the cattle industry against tv talk show host oprah winfrey is under
way in amarillo , texas.
SENT28 jury selection began in the defamation lawsuit began this morning .
SENT29 winfrey and a vegetarian activist are being sued over an exchange on her April
16, 1996 show .
SENT30 texas cattle producers claim the activists suggested americans could get mad
cow disease from eating beef .
SENT31 and winfrey quipped , this has stopped me cold from eating another burger
SENT32 the plaintiffs say that hurt beef prices and they sued under a law banning false
and disparaging statements about agricultural products
SENT33 what oprah has done is extremely smart and there's nothing wrong with it she
has moved her show to amarillo texas , for a while
SENT34 people are lined up , trying to get tickets to her show so i'm not sure this hurts
oprah .
SENT35 incidentally oprah tried to move it out of amarillo . she's failed and now she has
brought her show to amarillo .
SENT36 the key is , can the jurors be fair
SENT37 when they're questioned by both sides, by the judge , they will be asked, can
you be fair to both sides
SENT38 if they say , there's your jury panel
SENT39 oprah winfrey's lawyers had tried to move the case from amarillo , saying they
couldn't get an impartial jury
SENT40 however,
storythe judge moved against them in that matter …
summary
[Christensen et al., 2004]
Sentence extraction with
similarity measures
[Hori C. et al., 1999, 2002] , [Hori T. et al., 2003]
SPEECH SUMMARIZATION
TECHNIQUES
Word scoring
with dependency structure
[Koumpis & Renals, 2004]
Classification
[He et al., 1999]
User access information
[Zechner, 2001]
Removing disfluencies
[Hori T. et al., 2003]
Weighted finite state
transducers
Content/Context sentence level extraction for
speech summary [Christensen et al., 2004]
 Find sentences similar to the lead topic sentences
 Use position features to find the relevant nearby sentences after
detecting a topic sentence
 where Sim is a similarity measure between two sentences or a
sentence and a document (D) and E is the set of sentences
already in the summary
^
Sk  s  arg max {Sim( s1, si )}
si D / E
^
Sk  s  arg max {Sim( D, si )}
si D / E
 Choose a new sentence which is most like D and most
different from E
Weighted finite state transducers for speech
summarization
[Hori T. et al., 2003]
• Summarization includes speech recognition, paraphrasing, sentence
compaction integrated into single Weighted Finite State Transducer
• Decoder can use all knowledge sources in one-pass strategy
R  H C  LG
• Speech recognition using WFST
– Where H is state network of triphone HMMs, C is triphone
connection rules, L is pronunciation and G is trigram language
model
• Paraphrasing can be looked at as a kind of machine translation with
translation probability P(W|T) where W is source language and T is
Z  H C  LGS  D
the target language
• If S is the WFST representing translation rules and D is the
language model of the target language speech summarization can
be looked at as the following composition
Speech Translator
H
C
L
Speech recognizer
G
S
Translator
D
User Access Identifies What to Include
[He et al., 1999]
• Summarize lectures or shows by extracting parts that
have been viewed the longest
• Needs multiple users of the same show, meeting or
lecture for training
• E.g. To summarize lectures compute the time spent on
each slide
• Summarizer based on user access logs did as well as
summarizers that used linguistic and acoustic features
– Average score of 4.5 on a scale of 1 to 8 for the
summarizer (subjective evaluation)
•
Word level extraction by scoring/classifying words
[Hori C. et al., 1999, 2002]
 Score each word in the sentence and extract a set of words to form
a sentence whose total score is the product/sum of the scores of
each word
 Example:
 Word Significance score (topic words)
 Linguistic Score (bigram probability)
 Confidence Score (from ASR)
 Word Concatenation Score (dependency structure grammar)
M
S (V )  {L(vm | ...vm1 )  I I (vm )  cC (vm )  T Tr(vm1,vm )
m 1
Where M is the number of words to be extracted, and I C T
are weighting factors for balancing among L, I, C, and T r
Segmentation Using Discourse Cues
[Maybury, 1998]




Discourse Cue-Based Story Segmentation
Discourse Cues in CNN
 Start and end of broadcast
 Anchor/Reporter handoff, Reporter/Anchor handoff
 Cataphoric Segment (“still ahead …”)
Time Enhanced Finite State Machine representing discourse states
such as anchor segment, reporter segment, advertisement
Other features: named entities, part of speech, discourse shifts
“>>” speaker change, “>>>” subject change
Source
Precision
Recall
ABC
90
94
CNN
95
75
Jim Lehrer Show
77
52
CU: Summarization without Words: Does
importance of ‘what’ is said correlates with ‘how’ it
is said?
• Hypothesis: “Speakers change their amplitude, pitch,
speaking rate to signify importance of words, phrases,
sentences.”
– If so, then the prediction labels for sentences predicted
using acoustic features (A) should correlate with labels
predicted using lexical features (L)
– In fact, this seems to be true (corr .74 between precitions
of A and L
Is It Possible to Build ‘good’ Automatic Speech
Summarization Without Any Transcripts?
Feature Set
F-Measure
ROUGE-avg
L+S+A+D
0.54
0.80
L
0.49
0.70
S+A
0.49
0.68
A
0.47
0.63
Baseline
0.43
0.50
• Just using A+S without any lexical features we get 6% higher Fmeasure and 18% higher ROUGE-avg than the baseline
Evaluation using ROUGE
• F-measure too strict
– Predicted summary sentences must match
summary sentences exactly
– What if content is similar but not identical?
• ROUGE(s)…
ROUGE metric
•
•
•
•
•
Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
ROUGE-N (where N=1,2,3,4 grams)
ROUGE-L (longest common subsequence)
ROUGE-S (skip bigram)
ROUGE-SU (skip bigram counting unigrams as well)
• Does ROUGE solve the problem?
Next Class
• Emotional speech
• HW 4 assigned
Download