Cross-Language Access to Recorded Speech in the MALACH Project

advertisement
Multilingual Access to
Large Spoken Archives
Douglas W. Oard
University of Maryland, College Park, MD, USA
MALACH Project’s Goal
Dramatically improve access to
large multilingual spoken word
collections
… by capitalizing on the unique
characteristics of the Survivors
of the Shoah Visual History
Foundation's collection of
videotaped oral history
interviews.
Spoken Word Collections
• Broadcast programming
– News, interview, talk radio, sports, entertainment
• Scripted stories
– Books on tape, poetry reading, theater
• Spontaneous storytelling
– Oral history, folklore
• Incidental recording
– Speeches, oral arguments, meetings, phone calls
Some Statistics
• 2,000 U.S. radio stations webcasting
• 250,000 hours of oral history in British Library
• 35 million audio streams indexed by SingingFish
– Over 1 million searches per day
• ~100 billion hours of phone calls each year
Economics of the Web in 1995
• Affordable storage
– 300,000 words/$
• Adequate backbone capacity
– 25,000 simultaneous transfers
• Adequate “last mile” bandwidth
– 1 second/screen
• Display capability
– 10% of US population
• Effective search capabilities
– Lycos, Yahoo
Spoken Word Collections Today
• Affordable storage
– 300,000 words/$
• Adequate backbone capacity
– 25,000 simultaneous transfers
• Adequate “last mile” bandwidth
– 1 second/screen
• Display capability
– 10% of US population
• Effective search capabilities
– Lycos, Yahoo
1.5 million words/$
30 million
20% of capacity
38% recent use
Research Issues
•
•
•
•
•
•
Acquisition
Segmentation
Description
Synchronization
Rights management
Preservation
MALACH
Description Strategies
• Transcription
– Manual transcription (with optional post-editing)
• Annotation
– Manually assign descriptors to points in a recording
– Recommender systems (ratings, link analysis, …)
• Associated materials
– Interviewer’s notes, speech scripts, producer’s logs
• Automatic
– Create access points with automatic speech processing
Key Results from TREC/TDT
• Recognition and retrieval can be decomposed
– Word recognition/retrieval works well in English
• Retrieval is robust with recognition errors
– Up to 40% word error rate is tolerable
• Retrieval is robust with segmentation errors
– Vocabulary shift/pauses provide strong cues
Supporting Information Access
Source
Selection
Search System
Query
Formulation
Query
Search
Query Reformulation
and
Relevance Feedback
Ranked List
Selection
Recording
Examination
Source
Reselection
Recording
Delivery
Broadcast News Retrieval Study
• NPR Online
 Manually prepared transcripts
 Human cataloging
• SpeechBot
 Automatic Speech Recognition
 Automatic indexing
NPR Online
SpeechBot
Study Design
• Seminar on visual and sound materials
– Recruited 5 students
• After training, we provided 2 topics
– 3 searched NPR Online, 2 searched SpeechBot
• All then tried both systems with a 3rd topic
– Each choosing their own topic
• Rich data collection
– Observation, think aloud, semi-structured interview
• Model-guided inductive analysis
– Coded to the model with QSR NVivo
Criterion-Attribute Framework
Relevance
Criteria
Topicality
Story Type
Authority
Associated Attributes
NPR Online
Story title
Brief summary
Audio
Detailed summary
Speaker name
Audio
Detailed summary
Short summary
Story title
Program title
Speaker name
Speaker’s affiliation
SpeechBot
Detailed summary
Brief summary
Audio
Highlighted terms
Audio
Program title
Some Useful Insights
• Recognition errors may not bother the
system, but they do bother the user!
• Segment-level indexing can be useful
Shoah Foundation’s Collection
• Enormous scale
– 116,000 hours; 52,000 interviews; 180 TB
• Grand challenges
– 32 languages, accents, elderly, emotional, …
• Accessible
– $100 million collection and digitization investment
• Annotated
– 10,000 hours (~200,000 segments) fully described
• Users
– A department working full time on dissemination
Example Video
Existing Annotations
• 72 million untranscribed words
– From ~4,000 speakers
• Interview-level ground truth
– Pre-interview questionnaire (names, locations, …)
– Free-text summary
• Segment-level ground truth
– Topic boundaries: average ~3 min/segment
– Labels: Names, topic, locations, year(s)
– Descriptions: summary + cataloguer’s scratchpad
Annotated Data Example
interview time
Location-Time
Subject
Person
Berlin-1939
Employment
Josef Stein
Berlin-1939
Family life
Gretchen Stein
Anna Stein
Dresden-1939
Relocation
Transportation-rail
Dresden-1939
Schooling
Gunter Wendt
Maria
MALACH Overview
ASR
Speech
Recognition
Spontaneous
Accented
Language switching
Boundary
Detection
Content
Tagging
NLP
Components
Multi-scale segmentation
Multilingual classification
Entity normalization
Observational studies
Formative evaluation
Summative evaluation
Query
Formulation
User
Needs
Automatic
Search
Interactive
Selection
Prototype
Evidence integration
Translingual search
Spatial/temporal
MALACH Overview
ASR
Speech
Recognition
Spontaneous
Accented
Language switching
Boundary
Detection
Content
Tagging
Query
Formulation
Automatic
Search
Interactive
Selection
ASR Research Focus
• Accuracy
– Spontaneous speech
– Accented/multilingual/emotional/elderly
– Application-specific loss functions
• Affordability
– Minimal transcription
– Replicable process
Application-Tuned ASR
• Acoustic model
– Transcribe short segments from many speakers
– Unsupervised adaptation
• Language model
– Transcribed segments
– Interpolation
ASR Game Plan
Language
English
Czech
Russian
Polish
Slovak
Hours
Transcribed
200
84
20 (of 100)
Word
Error Rate
39.6%
39.4%
66.6%
As of May 2003
Instances (N=830)
English Transcription Time
~2,000 hours to manually transcribe
200 hours from 800 speakers
Hours to transcribe 15 minutes of speech
English ASR Error Rate
100
60
40
20
Fe
b03
De
c02
O
ct
-0
2
Au
g02
Ju
n02
Ap
r- 0
2
0
Fe
b02
Word Error Rate
80
Training: 65 hours (acoustic model)/200 hours (language model)
MALACH Overview
Observational studies
Formative evaluation
Summative evaluation
Query
Formulation
Speech
Recognition
Boundary
Detection
Content
Tagging
Automatic
Search
Interactive
Selection
User
Needs
Who Uses the Collection?
Discipline
•
•
•
•
•
•
•
•
History
Linguistics
Journalism
Material culture
Education
Psychology
Political science
Law enforcement
Products
•
•
•
•
•
•
•
•
Book
Documentary film
Research paper
CDROM
Study guide
Obituary
Evidence
Personal use
Based on analysis of 280 access requests
Question Types
• Content
–
–
–
–
Person, organization
Place, type of place (e.g., camp, ghetto)
Time, time period
Event, subject
• Mode of expression
– Language
– Displayed artifacts (photographs, objects, …)
– Affective reaction (e.g., vivid, moving, …)
• Age appropriateness
Observational Studies
Workshop 1 (June)
• Four searchers
–
–
–
–
History/Political Science
Holocaust studies
Holocaust studies
Documentary filmmaker
• Sequential observation
• Rich data collection
–
–
–
–
–
Intermediary interaction
Semi-structured interviews
Observational notes
Think-aloud
Screen capture
Workshop 2 (August)
• Four searchers
–
–
–
–
Ethnography
German Studies
Sociology
High school teacher
• Simultaneous observation
• Opportunistic data collection
–
–
–
–
Intermediary interaction
Semi-structured interviews
Observational notes
Focus group discussions
Segment Viewer
Observed Selection Criteria
• Topicality (57%)
Judged based on: Person, place, …
• Accessibility (23%)
Judged based on: Time to load video
• Comprehensibility (14%)
Judged based on: Language, speaking style
References to Named Entities
Attributes
Mentions
Selection
Reformulation
Gender
Country of birth
Person Nationality
(N=138) Date of birth
Status, interviewee
Status, parents
1
1
0
1
0
1
22
15
13
11
12
11
Camp
Place
Country
(N=116) Ghetto
10
8
7
45
16
12
Functionality
Needed Function
Boolean Search and Ranked Retrieval (13)
Testimony summary (12)
Pre-Interview Questionnaire search/viewer (9)
Rapid access (7)
Related/Alternative search terms (3)
Adding multiple search terms at once (2)
Keywords linked to segment number for easy access(1)
Multi-tasking (1)
Searching testimonies by places under ‘Experience Search’
(1)
Extensive editing within ‘My Project’ (1)
Desired Function Temporary saving of selected testimonies (4)
Remote access (3)
Integrated user tools for note taking (3)
Map presentation (2)
Reference tool (1)
More repositories (1)
Introductory video of system tutorial (1)
Help (1)
MALACH Overview
Query
Formulation
Speech
Recognition
Boundary
Detection
Content
Tagging
NLP
Components
Multi-scale segmentation
Multilingual classification
Entity normalization
Automatic
Search
Interactive
Selection
Topic Segmentation
“True” segmentation:
transcripts aligned with scratchpad-based boundaries
scratchpad
cataloguer
transcript
Hours
Training
Test
Words
Sentences
Segments
177.5
1,555,914
210,497
2,856
7.5
58,913
7,427
168
Effect of ASR Errors
system
output
true
miss
false
alarm
Rethinking the Problem
• Segment-then-label models planned speech well
– Producers assemble stories to create programs
– Stories typically have a dominant theme
• The structure of natural speech is different
– Creation: digressions, asides, clarification, …
– Use: intended use may affect desired granularity
• Documentary film: brief snippet to illustrate a point
• Classroom teacher: longer self-contextualizing story
OntoLog: Labeling Unplanned Speech
• Manually assigned labels; start and end at any time
– Ontology-based aggregation helps manage complexity
Goal
Use available data to estimate the
temporal extent of labels in a way
that optimizes the utility of the
resulting estimates for interactive
searching and browsing
Multi-Scale Segmentation
Labels
Time
Characteristics of the Problem
• Clear sequential dependencies
– Living in Dresden negates living in Berlin
• Heuristic basis for class models
– Persons, based on type of relationship
– Date/Time, based on part-whole relationship
– Topics, based on a defined hierarchy
• Heuristic basis for guessing without training
– Text similarity between labels and spoken words
• Heuristic basis for smoothing
– Sub-sentence retrieval granularity is unlikely
Manually Assigned Onset Marks
Location-Time
Subject
Person
Berlin-1939
interview time
Employment
Family Life
Josef Stein
Gretchen Stein
Anna Stein
Relocation
Transportation-rail
Dresden-1939
Gunter Wendt
Schooling
Maria
Some Additional Results
• Named entity recognition
– F > 0.8 (on manual transcripts)
• Cross-language ranked retrieval (on news)
– Czech/English similar to other language pairs
Looking Forward: 2003
• Component development
– ASR, segmentation, classification, retrieval
• Ranked retrieval test collection
– 1,000 hours of English recognition
– 25 judged topics in English and Czech
• Interactive retrieval
– Integrating free text and thesaurus-based search
Relevance Categories
• Overall relevance
Assessment is informed by the assessments for the
individual reasons for relevance (categories of relevance),
but the relationship is not straightforward
• Provides direct evidence
• Provides indirect / circumstantial evidence
• Provides context
(e.g., causes for the phenomenon of interest)
• Provides comparison (similarity or contrast, same
phenomenon in different environment, similar phenomenon)
• Provides pointer to source of information
Scale for overall relevance
Strictly from the point of view of finding out about the topic,
how useful is this segment for the requester?
This judgment is made independently of whether another
segment (or 25 other segments) give the same information.
4 Makes an important contribution to the topic, right on target
3 Makes an important contribution to the topic
2 Should be looked at for an exhaustive treatment of the topic
1 Should be looked at if the user wants to leave no stone
unturned
0 No need to look at this at all
Direct relevance
Direct evidence for what the user asks for
Directly on topic, direct aboutness. The information describes the events
or circumstances asked for or otherwise speaks directly to what the user is
looking for. First-hand accounts are preferred, e.g., the testimony contains
a report on the interviewee's own experience, or an eye-witness account
on what happened, or self-report on how a survivor felt. Second-hand
accounts (hearsay) are acceptable, such as a report on what an
eyewitness told the interviewee or a report on how somebody else felt.
* Direct Evidence *- Evidence that stands on its own to prove an alleged
fact, such as testimony of a witness who says she saw a defendant
pointing a gun at a victim during a robbery. Direct proof of a fact, such as
testimony by a witness about what that witness personally saw or heard or
did. ('Lectric Law Library's Lexicon)
Indirect relevance
Provides indirect evidence on the topic, indirect aboutness (data from
which one could infer, with some probability, something about the topic,
what in law is known as circumstantial evidence) Such evidence often
deals with events or circumstances that could not have happened or would
not normally have happened unless the event or circumstance of interest
(to be proven) has happened. It may also deal with events or
circumstances that precede the events or circumstances of interest, either
enabling them (establishing their possibility) or establishing their
impossibility. This category takes precedence over context. One could
say that provides indirect evidence also provides context (but the reverse
is not true).
* Circumstances, Circumstantial Evidence * Circumstantial evidence is
best explained by saying what it is not - it is not direct evidence from a
witness who saw or heard something. Circumstantial evidence is a fact
that can be used to infer another fact.
Context
Provides background / context for topic,
sheds additional light on a topic,
facilitates understanding that some piece of information is directly on topic.
So this category covers a variety of things. Things that influence, set the
stage, or provide the environment for what the user asks for. (To take the
law analogy again any things in the history of a person who has committed
a crime that might explain why he committed it).
Includes
support for or hindrance of an activity that is the topic of the query and
activities or circumstances that immediately follow on the activity or
circumstance of interest.
In a way, this category is broader than indirect If a context element can
serve as indirect evidence, indirect takes precedence.
Comparison
Provides information on similar / parallel situations or on a
contrasting situation for comparison
The basic theme of what the user is interested in, but played
out in a different place or time or type of situation.
Comparable segments will be those segments that provide
information either on similar/parallel topics, or on contrasting
topics. This type of relevance relationship identifies items that
can aid understanding of the larger framework, perhaps
contributing to identification of query terms or revision of
search strategies. An example would be a segment in which
an interviewee describes activities like activities described in a
topic description, but which occurred at a different place or
time than the topic description
Pointer
Provides pointers to a source of more information. This could
be a person, group, another segment, etc
•Pointers will be segments that provide suggestions or explicit
evidence of where to find more relevant information. An
example of a pointer segment would be one in which an
interviewee identifies another interviewee who had personal
experiences directly associated with the topic. The value of
these segments is in identifying other relevant segments,
particularly but not limited to segments about a topic.
Quality Assurance
• 20 topics were redone, 10 were reviewed.
• Redo: A second assessor did a topic from scratch
• Review: A second assessor reviewed the first
assessors work and did additional searches when
needed.
• Assessors would then get together and discuss their
interpretation of the topic and resolved differences in
relevance judgments.
• Assessors kept notes on the process.
Looking Forward: 2006
• Working systems in five languages
– Real users searching real data
• Rich experience beyond broadcast news
– Frameworks, components, systems
• Affordable application-tuned systems
– Oral history, lectures, speeches, meetings, …
For More Information
• The MALACH project
– http://www.clsp.jhu.edu/research/malach/
• NSF/EU Spoken Word Access Group
– http://www.dcs.shef.ac.uk/spandh/projects/swag/
• Speech-based retrieval
– http://www.glue.umd.edu/~dlrg/speech/
Download