Recognizing Structure: Sentence and Topic Segmentation Julia Hirschberg CS 4706

advertisement
Recognizing Structure: Sentence and
Topic Segmentation
Julia Hirschberg
CS 4706
7/15/2016
1
Types of Discourse Structure in Spoken
Corpora
• Domain independent
– Sentence/utterance boundaries
– Speaker turn segmentation
– Topic/story segmentation
• Domain dependent
– Broadcast news
– Meetings
– Telephone conversations
7/15/2016
2
Today
•
•
•
•
Theoretical studies of discourse structure
Text-based approaches and cues
Speech-based approaches and cues
Practical goals for speech segmentation and
state-of-the-art
7/15/2016
3
Hierarchical Structure in Discourse?
Welcome to word processing.
That’s using a computer to type letters and reports.
Make a typo?
No problem.
Just back up, type over the mistake, and it’s
gone.
And, it eliminates retyping.
And, it eliminates retyping.
7/15/2016
4
Structures of Discourse Structure (Grosz &
Sidner ‘86)
• Leading theory of discourse structure
• Three components:
– Linguistic structure
• What is said
• Assumption: divided into appropriate units of analysis
(Discourse Segments)
– Intentional structure
• How are segments structured into topics, subtopics
• Relations: satisfaction-precedence, dominance
– Attentional structure
7/15/2016
5
How are these relations recognized in a
discourse?
• Linguistic markers:
– tense and aspect
– cue phrases: now, well
• Inference of S intentions
• Inference from task structure
• Intonational Information
7/15/2016
6
Structural Information in Text
•
•
•
•
•
Example: reviews of Ipod nano
Cue phrases: now, well, first
Pronominal reference
Orthography and formatting -- in text
Lexical information (Hearst ‘94, Reynar ’98,
Beeferman et al ‘99):
– Domain dependent
– Domain independent
7/15/2016
7
Methods of Text Segmentation
• Lexical cohesion methods vs. multiple source
– Vocabulary similarity indicates topic cohesion
• Intuition from Halliday & Hasan ’76
• Features to capture:
– Stem repetition
– Entity repetition
– Word frequency
– Context vectors
– Semantic similarity
– Word distance
• Methods:
– Sliding window
7/15/2016
8
– Lexical chains
– Clustering
– Combine lexical cohesion with other cues
• Features
– Cue phrases
– Reference (e.g. pronouns)
– Syntactic features
• Methods
– Machine Learning from labeled corpora
7/15/2016
9
Choi 2000: An Example
• Implements leading methods and compares new
algorithm to them on corpus of 700
concatenated documents
• Comparison algorithms:
– Baselines:
•
•
•
•
•
7/15/2016
No boundaries
All boundaries
Regular partition
Random # of random partitions
Actual # of random partitions
10
–
–
–
–
Textiling Algorithm (Hearst ’94)
DotPlot algorithms (Reynar ’98)
Segmenter (Kan et al ’98)
Choi ’00 proposal
• Cosine similarity measure on stems
sim ( x, y ) 
 f f
 f  f
x, j
j
2
y, j
2
• Same sentence: 1; no overlap 0
j
j
x, j
y, j
• Similarity matrix  rank matrix
– Replace each cell with # of lower-valued neighbor cells,
normalized by # of neighboring cells
» How likely is this sentence to be a boundary, compared to
other sentences?
» Minimize effect of outliers
7/15/2016
11
• Divisive clustering based on
– D(n) = sum of rank values (sI,j) of segment n/ inside area
of segment n (j-i+1) – for i,j the sentences at the
beginning and end of segment n
– Homogeneous segments have large rank values within a
small area of the matrix
• Keep dividing the corpus
– until D(n) = D(n) - D(n-1) shows little change
– Choi’s algorithm performs best (9-12% error)
7/15/2016
12
Acoustic and Prosodic Cues to
Segmentation
• Intuition:
– Speakers vary acoustic and prosodic cues to convey
variation in discourse structure
– Systematic? In read or spontaneous speech?
• Evidence:
– Observations from recorded corpora
– Laboratory experiments
– Machine learning of discourse structure from
acoustic/prosodic features
7/15/2016
13
Spoken Cues to Discourse/Topic Structure
• Pitch range
Lehiste ’75, Brown et al ’83, Silverman ’86, Avesani &
Vayra ’88, Ayers ’92, Swerts et al ’92, Grosz &
Hirschberg’92, Swerts & Ostendorf ’95, Hirschberg &
Nakatani ‘96
• Preceding pause
Lehiste ’79, Chafe ’80, Brown et al ’83, Silverman ’86,
Woodbury ’87, Avesani & Vayra ’88, Grosz &
Hirschberg’92, Passoneau & Litman ’93, Hirschberg &
Nakatani ‘96
7/15/2016
14
• Rate
Butterworth ’75, Lehiste ’80, Grosz & Hirschberg’92,
Hirschberg & Nakatani ‘96
• Amplitude
Brown et al ’83, Grosz & Hirschberg’92, Hirschberg &
Nakatani ‘96
• Contour
Brown et al ’83, Woodbury ’87, Swerts et al ‘92
7/15/2016
15
A Practical Problem: Finding Sentence and
Topic/Story Boundaries in ASR Transcripts
• Motivation:
– Finding sentences critical to further syntactic/semantic analysis
– Topic/story id important to identify common regions for q/a,
extraction
• Features:
– Lexical cues
• Domain dependent
• Sensitive to ASR performance
– Acoustic/prosodic cues
• Domain independent
• Sensitive to speaker identity
• Statistical, Machine Learning approaches with large
segmented corpora (e.g. Broadcast News)
7/15/2016
16
ASR Transcription
•
aides tonight in boston in depth the truth squad for special series until
election day tonight the truth about the budget surplus of the candidates are
promising the two international flash points getting worse while the middle
east and a new power play by milosevic and a lifelong a family tries to say
one child life by having another amazing breakthrough the u s was was told
local own boss good evening uh from the university of massachusetts in
boston the site of the widely anticipated first of eight between vice president
al gore and governor george w bush with the election now just five weeks
away this is the beginning of a sprint to the finish and a strong start here
tonight is important this is the stage for the two candidates will appear
before a national television audience taking questions from jim lehrer of p b
s n b c’s david gregory is here with governor bush claire shipman is
covering the vice president claire you begin tonight please
7/15/2016
17
Speaker segmentation (Diarization)
•
Speaker: 0 - aides tonight in boston in depth the truth squad for special series until
election day tonight the truth about the budget surplus of the candidates are
promising the two international flash points getting worse while the middle east and a
new power play by milosevic and a lifelong a family tries to say one child life by
having another amazing breakthrough the u s was was told local own boss good
evening uh from the university of massachusetts in boston
•
Speaker: 1 - the site of the widely anticipated first of eight between vice president al
gore and governor george w bush with the election now just five weeks away this is
the beginning of a sprint to the finish and a strong start here tonight is important this
is the stage for the two candidates will appear before a national television audience
taking questions from jim lehrer of p b s n b c’s david gregory is here with governor
bush claire shipman is covering the vice president claire you begin tonight please
7/15/2016
18
Sentence detection, punctuation
•
•
Speaker: Anchor - Aides tonight in boston. In depth the truth squad for
special series until election day. Tonight the truth about the budget surplus
of the candidates are promising. The two international flash points getting
worse. While the middle east. And a new power play by milosevic and a
lifelong a family tries to say one child life by having another amazing
breakthrough the u. s. was was told local own boss. Good evening uh from
the university of massachusetts in boston.
Speaker: Reporter - The site of the widely anticipated first of eight between
vice president al gore and governor george w. bush. With the election now
just five weeks away. This is the beginning of a sprint to the finish. And a
strong start here tonight is important. This is the stage for the two
candidates will appear before a national television audience taking
questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with
governor bush. Claire shipman is covering the vice president claire you
begin tonight please.
7/15/2016
19
Story boundary detection
•
•
Speaker: Anchor - Aides tonight in boston. In depth the truth squad for
special series until election day. Tonight the truth about the budget surplus
of the candidates are promising. The two international flash points getting
worse. While the middle east. And a new power play by milosevic and a
lifelong a family tries to say one child life by having another amazing
breakthrough the u. s. was was told local own boss. Good evening uh from
the university of massachusetts in boston.
Speaker: Reporter - The site of the widely anticipated first of eight between
vice president al gore and governor george w. bush. With the election now
just five weeks away. This is the beginning of a sprint to the finish. And a
strong start here tonight is important. This is the stage for the two
candidates will appear before a national television audience taking
questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with
governor bush. Claire shipman is covering the vice president claire you
begin tonight please.
7/15/2016
20
Prosodic Cues (Shriberg et al ’00)
• Text-based segmentation is fine…if you have
reliable text
• Could prosodic cues perform as well or better at
sentence and topic segmentation in ASR
transcripts? – more robust? – more general?
• Goal: identify sentence and topic boundaries at
ASR-defined word boundaries
– CART decision trees and LM
– HMM combined prosodic and LM results
7/15/2016
21
• Features --for each potential boundary location:
• Pause at boundary (raw and normalized by speaker)
• Pause at word before boundary (is this a new ‘turn’ or part of
continuous speech segment?)
• Phone and rhyme duration (normalized by inherent duration)
(phrase-final lengthening?)
• F0 (smoothed and stylized): reset, range (topline, baseline),
slope and continuity
• Voice quality (halving/doubling estimates as correlates of
creak or glottalization)
• Speaker change, time from start of turn, # turns in
conversation and gender
– Trained/tested on Switchboard and Broadcast News
7/15/2016
22
Sentence segmentation results
– Prosodic only model
• Better than LM for BN
• Worse (on hand transcription) and same (for ASR transcript)
on SB
• Slightly improves LM on SB
– Useful features for BN
• Pause at boundary, turn change/no turn change, f0 diff
across boundary, rhyme duration
– Useful features for SB
• Phone/rhyme duration before boundary, pause at boundary,
turn/no turn, pause at preceding word boundary, time in turn
7/15/2016
23
Topic segmentation results (BN only):
– Useful features
• Pause at boundary, f0 range, turn/no turn, gender, time in
turn
– Prosody alone better than LM
– Combined model improves significantly
7/15/2016
24
Recent Work on Story Segmentation
(Rosenberg et al ’07)
• Story Segmentation goal: Divide each show into
homogenous regions, each about a single topic
– Task: focussed Q/A
– Issue: what unit of anlysis should we use in
assessing potential boundaries?
7/15/2016
25
TDT-4 Corpus
•
•
•
•
•
•
English: 312.5 hours, 250 broadcasts, 6 shows
Arabic: 88.5 hours, 109 broadcasts, 2 shows
Mandarin: 109 hours, 134 broadcasts, 3 shows
Manually annotated story boundaries
ASR Hypotheses
Speaker Diarization Hypotheses
7/15/2016
26
Approach
• Identify set of segments which define:
– Unit of analysis
– Candidate boundaries
• Classify each candidate boundary based on
features extracted from segments
– C4.5 Decision Tree
– Model each show-type separately
• E.g. CNN “Headline News” and ABC “World News Tonight
have distinct models
– Evaluate using WindowDiff with k=100
7/15/2016
27
Segment Boundary Modeling Features
• Acoustic
– Pitch & Intensity
• speaker normalized
• min, mean, max, stdev, slope
– Speaking Rate
• vowels/sec, voiced frames/sec
– Final Vowel, Rhyme Length
– Pause Length
• Lexical
– TextTiling scores
– LCSeg scores
– Story beginning and ending keywords
• Structural
– Position in show
– Speaker participation
– First or last speaker turn?
7/15/2016
28
Input Segmentations
• ASR Word boundaries
– No segmentation baseline
• Hypothesized Sentence Units
– Boundaries with 0.5, 0.3 and 0.1 confidence
thresholds
• Pause-based Segmentation
– Boundaries at pauses over 500ms and 250ms
• Hypothesized Intonational Phrases
7/15/2016
29
Hypothesizing Intonational Phrases
• ~30 minutes manually annotated ASR BN from
reserved TDT-4 CNN show.
– Phrase was marked if a phrase boundary occurred
since the previous word boundary.
• C4.5 Decision Tree
• Pitch, Energy and Duration Features
– Normalized by hypothesized speaker id and
surrounding context
• 66.5% F-Measure (p=.683, r=.647)
7/15/2016
30
Story Segmentation Results
7/15/2016
31
Input Segmentation Statistics
Exact Story
Boundary
Coverage (pct.)
Mean Distance to
Nearest True
Boundary (words)
Segment to True
Boundary Ratio
7/15/2016
32
Conclusions and Future Work
• Best Performance:
–
–
–
–
Low threshold (0.1) sentences
Short pause (250ms) segmentation
Hyp. IPs perform better than sentences.
Would increased SU, IP accuracy improve story
segmentation?
• External evaluation: impact on IR and MT
performance.
7/15/2016
33
Next Class
• Spoken Dialogue Systems
7/15/2016
34
Download