Searching Speech: A Research Agenda Douglas W. Oard

advertisement
Searching Speech:
A Research Agenda
Douglas W. Oard
College of Information Studies and
Institute for Advanced Computer Studies
University of Maryland, College Park
July 14, 2005
National E-Science Centre
Some Grid Use at Maryland
• Global Land Cover Facility
– 13 TB of raw and derived data from 5 satellites
• Digital archives
– Preserving the meaning of metadata structure
• Access grid
– No-operator information studies classroom
Expanding the Search Space
Scanned
Docs
Identity: Harriet
“… Later, I learned that
John had not heard …”
Indexable Speech
• What if we could collect “everything”?
– 1 billion users of speech-enabled devices
– Each producing >10K words per day
– Much of it not worth finding
• Comparison case: Web search
– Google indexes ~10 billion Web pages
– Perhaps averaging ~1K words each
– Much of it not worth finding
A Web of Speech?
Web in 1995 Speech in 2004
Storage
300K
1.5M
250K
30M
“Last Mile”
1 second
Streaming
(Download time)
(no graphics)
(words per $)
Internet Backbone
(simultaneous users)
Display Capability
10%
(Computers/US population)
Search Systems
Lycos
Yahoo
100%
The Need for Scalable Solutions
TDT
SpeechBot
SingingFish
Shoah Foundation
British Library
Webcasts in a year
Millions of Hours
10000
1000
100
10
1
0.1
0.01
0.001
0.0001
Speech in a day
Some Spoken Word Collections
• Broadcast programming
– News, interview, talk radio, sports, entertainment
• Storytelling
– Books on tape, oral history, folklore
• Incidental recording
– Speeches, courtrooms, meetings, phone calls
Indexing Options
• Transcript-based (e.g., NASA)
– Manual transcription, editing by interviewee
• Thesaurus-based (e.g., Shoah Foundation)
– Manually assign descriptors to points in an interview
• Catalog-based (e.g., British Library)
– Catalog record created from interviewer’s notes
• Speech-based (MALACH)
– Create access points with speech processing
Supporting “Intellectual Access”
Source
Selection
Search System
Query
Formulation
Query
Search
Query Reformulation
and
Relevance Feedback
• Speech Processing
• Computational Linguistics
• Information Retrieval
• Information Seeking
• Human-Computer Interaction
• Digital Libraries
Ranked List
Selection
Recording
Examination
Source
Reselection
Recording
Delivery
Some Technical Challenges
• “Fast” ASR systems are way too slow
– 6 orders or magnitude slower than tokenization
• Situational sublanguage induces variability
– Impedes interactive vocabulary acquisition
• Knee in the WER/MAP curve comes early
– 30-40% for broadcast news
– Somewhere below 30% for conversations
• Skimmable summaries from imperfect ASR
– Particularly important for linear media
• Classic IR measures focus on “documents”
– Conversationalboundaries are ambiguous
Start Time Error Cost
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
1
2
3
4
5
Shoah Foundation Collection
• Substantial scale
– 116,000 hours; 52,000 interviews; 32 languages
• Spontaneous conversational speech
– Accents, elderly, emotional, …
• Accessible
– $100 million collection and digitization investment
• Manually indexed (10,000 hours)
– Segmented, thesaurus terms, people, summaries
• Users
– A department working full time on dissemination
Interview Excerpt
• Audio characteristics
– Accented (this one is unusually clear)
– Separate channels for interviewer / interviewee
• Dialog structure
• Interviewers have different styles
• Content characteristics
– Domain-specific terms
– Named entity mentions and relationships
MALACH Languages
Testimonies (average 2.25 hours each)
English Czech
Russian Slovak Polish
Collected
24,874
573
7,080
573
1,400
Cataloged
22,820
531
7,016
464
989
Indexed
22,820
22
701
0
0
Digitized
13,735
374
3,052
427
835
Completed
11,464
22
287
0
0
As of January 31, 2004
Observational Studies
8 independent searchers
–
–
–
–
–
–
–
Holocaust studies (2)
German Studies
History/Political Science
Ethnography
Sociology
Documentary producer
High school teacher
8 teamed searchers
– All high school teachers
Thesaurus-based search
Rich data collection
–
–
–
–
–
Intermediary interaction
Semi-structured interviews
Observational notes
Think-aloud
Screen capture
Qualitative analysis
– Theory-guided coding
– Abductive reasoning
Relevance Criteria
Number of Mentions
Think-Aloud
Criterion
All
(N=703)
Topicality
535 (76%)
Relevance
Judgment
(N=300)
Query
Form.
(N=248)
219
234
Richness
39 (5.5%)
14
0
Emotion
24 (3.4%)
7
0
Audio/Visual Expression
16 (2.3%)
5
0
Comprehensibility
14 (2%)
1
10
Duration
11 (1.6%)
9
0
Novelty
10 (1.4%)
4
2
6 Scholars, 1 teacher, 1 film producer, working individually
Topicality
Person
Place
Event/Experience
Subject
Organization/Group
Time Frame
Object
0
20
40
60
80
100
120
140
Total mentions
6 Scholars, 1 teacher, 1 movie producer, working individually
Test Collection Design
Query
Formulation
Speech
Recognition
Boundary
Detection
Content
Tagging
Automatic
Search
Interactive
Selection
Test Collection Design
Interviews
Topic Statements
Training: 38 existing
Evaluation: 25 new
Query
Formulation
Speech
Recognition
Automatic: 35% interview-tuned
40% domain-tuned
Boundary
Detection
Automatic
Search
Manual: Topic boundaries
Automatic: Topic boundaries
Content
Tagging
Manual:
~5 Thesaurus labels
3-sentence summaries
Automatic: Thesaurus labels
Ranked Lists
Evaluation
Mean Average Precision
Relevance
Judgments
CLEF-2005 CL-SR Track
• Test collection distributed by ELDA
– ~7,800 segments from ~300 English interviews
• Hand segmented / known boundaries
– 63 topics (title/description/narrative)
• 38 for training, 25 for blind evaluation
• 5 languages (EN, SP, CZ, DE, FR)
– Relevance judgments
• Search-guided + post-hoc judgment pools
• 5 participating teams
– DCU, Maryland, Pitt, Toronto/Waterloo, UNED
• One required cross-site baseline run
– ASR segments / English TD topics
Additional Resources
• Thesaurus
– ~3,000 core concepts
• Plus alternate vocabulary + standard combinations
– ~30,000 location-time pairs, with lat/long
– Both “is-a” and “part-whole” relationships
• In-domain expansion collection
– 186,000 3-sentence summaries
• Indexer’s scratchpad notes
• Digitized speech
– .mp2 or .mp3
English ASR
English Word Error Rate (%)
0
10
20
30
ASR2004A
40
ASR2003A
50
60
70
80
90
100
Jan-02
Jul-02
Jan-03
Jul-03
Jan-04
Jul-04
Jan-05
Training: 200 hours from 800 speakers
<DOCNO>VHF00017-062567.005</DOCNO>
<KEYWORD> Warsaw (Poland), Poland 1935 (May 13) - 1939 (August 31),
awareness of political or military events, schools </KEYWORD>
<PERSON> Sophie P[…], Henry H[…] </PERSON>
<SUMMARY> AH talks about the college she attended before the war. She
mentions meeting her husband. She discusses young peoples' awareness of
the political events that preceded the outbreak of war. </SUMMARY>
<SCRATCHPAD> graduated HS, went to college 1 year, professional college hotel
management; met future husband, knew that they'd end up together; sister also in
college, nice social life, lots of company, not too serious; already got news from
Czechoslovakia, Sudeten, knew that Poland would be next but what could they do about
it, very passive; just heard info from radio and press </SCRATCHPAD>
<ASRTEXT> no no no they did no not not uh i know there was no place to go we
didn't have family in a in other countries so we were not financially at the at extremely
went so that was never at plano of my family it is so and so that was the atmosphere in
the in the country prior to the to the war i graduate take the high school i had one year of college which
was a profession and that because that was already did the practical trends f so that was a study for whatever management that eh
eh education and this i i had only one that here all that at that time i met my future husband and that to me about any we knew it
that way we were in and out together so and i was quite county there was so whatever i did that and this so that was the person that
lived my sister was it here is first year of of colleagues and and also she had a very strongly this antisemitic trend and our parents
there was a nice social life young students that we had open house always pleasant we had a lot of that company here and and we
were not too serious about that she we got there we were getting the they already did knew he knew so from czechoslovakia from
they saw that from other part and we knew the in that that he is uhhuh the hitler spicy we go into this year this direction that eh
poland will be the next country but there was nothing that we would do it at that time so he was a very very he says belong to any
any organizations especially that the so we just take information from the radio and from the dress </ASRTEXT>
Segment duration (s)
??
Min.
1st Qu.
-2044.00
54.01
Median
224.90
Mean
3rd Qu.
Max.
391.70
326.00 287400.00
NA's
75031.00
44.5%
Keywords vs. Segment duration
Nodes descending
from parents of
leaves
Years spoken in ASR
Spoken dates in release ASR
Min.
:
1st Qu.:
Median :
Mean
:
3rd Qu.:
Max.
:
0.0000
0.0000
0.0000
0.6575
1.0000
13.0000
Current classifier performance:
46,601 (1,175)
3,610 ( 169)
1,437 (168)
613 ( 47)
MAP: .2374, even post-mixing of scratchpad/summary from 20NN,
remixed with time-label densities estimated w/
Gaussian kernel at 5x def. bandwidth
An Example English Topic
Number: 1148
Title: Jewish resistance in Europe
Description:
Provide testimonies or describe actions of Jewish resistance in Europe
before and during the war.
Narrative:
The relevant material should describe actions of only- or mostly Jewish
resistance in Europe. Both individual and group-based actions are relevant.
Type of actions may include survival (fleeing, hiding, saving children),
testifying (alerting the outside world, writing, hiding testimonies), fighting
(partisans, uprising, political security) Information about undifferentiated
resistance groups is not relevant.
5-level Relevance Judgments
Binary qrels
• “Classic” relevance (to “food in Auschwitz”)
Direct
Indirect
Knew food was sometimes withheld
Saw undernourished people
• Additional relevance types
Context
Intensity of manual labor
Comparison Food situation in a different camp
Pointer
Mention of a study on the subject
Comparing Index Terms
Mean Average Precision
0.5
0.4
0.3
0.2
0.1
0
ASR
Scratchpad
ThesTerm
Summary
Metadata
+Persons
Title queries, adjudicated judgments
Searching Manual Transcripts
1.0
Average Precision
0.8
0.6
ASR
ASR+Rel+Top10
Metadata
0.4
0.2
0.0
jewish kapo(s)
fort ontario
refugee camp
Title queries, adjudicated judgments
Category Expansion
3,199 Training segments
test segments
Spoken Words
(hand transcribed)
Spoken Words
(ASR transcript)
kNN
Categorization
Thesaurus
Terms
F=0.19
(microaveraged)
Thesaurus
Terms
Mean Average Precision
0.10
0.0941
Index
0.08
0.06
0.04
0.02
0.00
0.0
Thesaurus
Terms
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ASR
Words
Title queries, linear score combination, adjudicated judgments
ASR-Based Search
Mean Average Precision
0.10
Average of
3.4 relevant
segments in
top 20
+27%
0.08
0.06
0.04
0.02
0.00
Inquery
Character
n-grams
Okapi
Qkapi +
Okapi +
Query
Category
Expansion Expansion
Okapi +
QE+CE
Title queries, adjudicated judgments
Rethinking the Problem
• Segment-then-label models planned speech
well
– Producers assemble stories to create programs
– Stories typically have a dominant theme
• The structure of natural speech is different
– Creation: digressions, asides, clarification, …
– Use: intended use may affect desired granularity
• Documentary film: brief snippet to illustrate a point
• Classroom teacher: longer self-contextualizing story
Activation Matrix
Labels
Time
Training Data: 196,000 Segments
interview time
Location-Time
Subject
Person
Berlin-1939
Employment
Josef Stein
Berlin-1939
Family life
Gretchen Stein
Anna Stein
Dresden-1939
Relocation
Transportation-rail
Dresden-1939
Schooling
Gunter Wendt
Maria
+ Segment summaries + Indexer’s notes
Preprocessing Training Data
• Normalize labeled categories?
– Food in hiding -> food AND hiding
• Develop class models
– Existing hierarchy, types of personal relationships
• Determine the extent for each label and class
– Merge the extent of repeated labels
Characteristics of the Problem
• Clear dependencies
– Correlated assignment of applications
– Living in Dresden negates living in Berlin
• Heuristic basis for class models
– Persons, based on type of relationship
– Date/Time, based on part-whole relationship
– Topics, based on a defined hierarchy
• Heuristic basis for guessing without training
– Text similarity between labels and spoken words
• Heuristic basis for smoothing
– Sub-sentence retrieval granularity is unlikely
Modeling Location
Berlin
Dresden
Germany
• Presence in a new location negates presence in the prior location
• Location granularity varies (inclusion relationships are known)
A Class Model for People
father
mother
sister
nobody
father
mother
sister
friend
• Several people may be discussed simultaneously
• Small inventory of relationship types
• Relationship type is known for most people that are mentioned
Search
• Compute a score at each time based on:
– How likely is each descriptor? (~TF)
– How selective is each descriptor? (~IDF)
– What related descriptors are active? (~expansion)
• Determine passage start time based on:
– Score trajectory (sequence of scores)
– Additional heuristics (e.g., pause, speaker turn)
• Rank passages based on score trajectory
– e.g., by peak score within the passage
Timelines for
the whole
interview text
Some Open Issues
• Is the expressive power of a lattice needed?
– An activation matrix is an unrolled lattice
• What states do we need to represent?
– Balance fidelity, accuracy, and complexity
• How to integrate manual onset marks?
• How much training data do we need?
– Annotating new data costs ~$100/hour
• How will people use the system we build?
Non-English ASR Systems
WER [%]
30
34.49%
35.51%
38.57%
+ stand.+LMTr+TC
+ adapt.
41.15%
40
+ standard.
45.91%
100h + LMTr
+ LMTr+TC
45.75%
+ stand.+LMTr+TC
50.82%
84h + LMTr
50
40.69%
100h + LMTr
57.92%
45h + LMTr
60
66.07%
70
20h + LMTr
Czech
10/01
4/02
Russian
10/02
4/03
Polish
Slovak
10/03
4/04
10/04
Hungarian
4/05
10/05
4/06
10/06
Planning for the Future
• Tentative CLEF-2006 CL-SR Plans:
– Adding a Czech collection
– Larger English collection (~900 hours)
• Adding word lattice as standard data
– No-boundary evaluation design
– ASR training data (by special arrangement)
• Transcripts, pronunciation lexicon, language model
• Possible CLEF-2007 CL-SR Options:
– Add a Russian or Slovak collection?
– Much larger English collection (~5,000 hours)?
The CLEF CL-SR Team
USA
• Shoah Foundation
– Sam Gustman
• IBM TJ Watson
– Bhuvana Ramabhadran
– Martin Franz
• U. Maryland
– Doug Oard
– Dagobert Soergel
• Johns Hopkins
– Zak Schefrin
Europe
• U. Cambridge (UK)
– Bill Byrne
• Charles University (CZ)
– Jan Hajic
– Pavel Pecina
• U. West Bohemia (CZ)
– Josef Psutka
– Pavel Ircing
• UNED (ES)
– Fernando López-Ostenero
More Things to Think About
• Privacy protection
– Working with real data has real consequences
• Are fixed segments the right retrieval unit?
– Or is it good enough to know where to start?
• What will it cost to tailor an ASR system?
– $100K to $1 million per application?
• Do we need to change what we collect?
– Speaker enrollment, metadata standards, …
Final Thoughts
• The moving hand, having writ, moves on
– Ephemeral webcasting
– Forgone acquisition opportunities
For More Information
• The MALACH project
– http://www.clsp.jhu.edu/research/malach
• CLEF-2005 evaluation
– http://www.clef-campaign.org
• NSF/DELOS Spoken Word Access Group
– http://www.dcs.shef.ac.uk/spandh/projects/swag
Download