NineOneOne: Recognizing and Classifying Speech for Handling

advertisement
NineOneOne:
Recognizing and Classifying Speech
for Handling Minority Language
Emergency Calls
Udhay Nallasamy, Alan W Black,
Tanja Schultz, Robert Frederking,
[Jerry Weltman, Julio Schrodel]
May 2008
Outline
• Overview
– System design
– ASR design
– MT design
• Current results
– ASR results
– Classification-for-MT results
• Future plans
Project overview
• Problem: Spanish 9-1-1 calls handled in
slow, unreliable fashion
• Tech base: SR/MT far from perfect, but
usable in limited domains
• Science goal: Speech MT that really gets
used
– 9-1-1 as likeliest route:
• Naturally limited, important, civilian domain
• Interested user partner who will really try it
– (vs. Diplomat experience…)
Domain Challenges/Opportunities
• Challenges:
– Real-time required
– Random phones
– Background noise
– Stressed speech
– Multiple dialects
– Cascading errors
• Opportunities:
– Speech data source
– Strong task constraints
– One-sided speech
– Human-in-the-loop
– Perfection not required
System flow
Spanish caller
Necessito una ambulancia
English Dispatcher
Spanish
ASR
Dispatch
Board
DA
Classifier
Spanish
TTS
Spanish-To
-English MT
English Text
I need ambulance
Spanish speech
Nueve uno uno, ¿cuál es
su emergencia?
Overall system design
• Spanish to English: [no TTS!]
– Spanish speech recognized
– Spanish text classified (context-dependent?) into
DomainActs, arguments spotted and translated
– Resulting text displayed to dispatcher
• English to Spanish: [no ASR!]
– Dispatcher selects output from tree, typing/editing arg
• Very simple “Phraselator”-style MT
– System synthesizes Spanish output
• Very simple limited-domain synthesizer
• HCI work: keeping human in the loop!
– role-playing & shadow use
“Ayudame” system mock-up
• We plan to interface with call-takers via
web browser
• Initial planned user scenario follows
• The first version will certainly be wrong
– One of the axioms of HCI
• But iterating through user tests is the best
way to get to the right design
Technical plans: ASR
• ASR challenges:
– Disfluencies
– Noisy emotional speech
– Multiple dialects, some English
• Planned approaches:
– Noisy-channel model [Honal et al,
Eurospeech03]
– Articulatory features
– Multilingual grammars, multilingual front end
Technical plans: MT
• MT challenges:
– Disfluencies in speech
– ASR errors
– Accuracy/transparency vs. Development costs
• Planned approaches: adapt and extend
– Domain Act classification from Nespole!
• Shallow interlingua, speaker intent (not literal)
• Report-fire, Request-ambulance, Don’t-know, …
– Transfer rule system from Avenue
– (Both NSF-funded.)
Nespole! Parsing and Analysis Approach
• Goal: A portable and robust analyzer for taskoriented human-to-human speech, parsing
utterances into interlingua representations
• Our earlier systems used full semantic grammars to
parse complete DAs
– Useful for parsing spoken language in restricted domains
– Difficult to port to new domains
• Nespole! focus was on improving portability to new
domains (and new languages)
• Approach: Continue to use semantic grammars to
parse domain-independent phrase-level arguments
and train classifiers to identify DAs
Example Nespole! representation
• Hello. I would like to take a vacation in Val
di Fiemme.
• c:greeting (greeting=hello)
c:give-information+disposition+trip
(disposition=(who=i, desire),
visit-spec=(identifiability=no, vacation),
location=(place-name=val_di_fiemme))
MT differences from Nespole!
• Hypothesis: Simpler domain can allow
simpler (less expensive) MT approach
• DA classification done without prior
parsing
– We may add argument-recognizers as
features, but still cheaper than parsing
• After DA classification, identify, parse, and
translate simple arguments (addresses,
phone numbers, etc.)
Currently-funded
NineOneOne work
• Full proposal was not funded
– But SGER was funded
• Build targeted ASR from 9-1-1 call data
• Build classification part of MT system
• Evaluate on unseen data, hopefully
demonstrating sufficient ASR and
classification quality to get follow-on
• 18 months, began May 2006
– No-Cost Extension to 24 months
Spanish ASR Details
• Janus Recognition Toolkit (JRTk)
• CI models initialized from GlobalPhone
data (39 Phones)
• CD models are 3 state, semi-continuous
models with 32 gaussians per state
• LM trained on Global Phone text corpus
(Spanish news – 1.5 million words)
• LM is interpolated with the training data
transcriptions
ASR Evaluation
• Training data – 50 calls (4 hours of
speech)
• Dev set – 10 calls (for LM interpolation)
• Test set – 15 calls (1 hour of speech)
• Vocabulary size – 65K words
• Test set perplexity – 96.7
• Accuracy of ASR on test set – 76.5%
– Good for spontaneous, multi-speaker
telephone speech
Utterance Classification/Eval
• Can we automatically classify utterances
into DAs?
• Manually classified turns into DAs
– 10 labels, 845 labelled turns
• WEKA toolkit SVM with simple bag-ofwords binary features
• Evaluated using 10-fold cross-validation
• Overall accuracy 60.1%
– But increases to 68.8% ignoring “Others”
Initial DA Classification
Frequency
Accuracy (%)
Acc. w/o “Others” (%)
Giving Name
80
57.50
67.6
Giving Address
118
38.98
63.0
Giving Phone Number
29
48.28
63.6
Req. Ambulance
8
62.50
83.3
Req. Fire Service
11
54.55
75.0
Req. Police
24
41.67
62.5
Report Injury/Urgency
61
39.34
72.7
Yes
119
52.94
71.6
No
24
54.17
81.2
Others
371
75.74
----
Tag
DA classification caveat
• But DA classification was done on human
transcriptions (also human utterance
segmentation)
• Classifier accuracy on current ASR
transcriptions is 40% (49% w/o “Others”)
• Probably needs to be better than that
Future work
• Improving ASR
• Improving classification on real output:
– More labelled training data
– Use discourse context in classification
– “Query expansion” via synsets from Spanish
EuroWordNet
– Engineered phone-number-recognizer etc.
• Partial (simpler) return to Nespole! approach
– Better ASR/classifier matching
• Building and user-testing full pilot system
Questions?
http://www.cs.cmu.edu/~911/
Class confusion matrix
GN GA GP RA RF RP IU Y N
GN 46
9
1
3
9
GA
9 46
2
3 13
GP
8 14
RA
1
5
RF
6
2
RP
1
10
5
IU
3
6
24
Y
9 11
5
63
N
3
13
O 12 27
2
3
4
7 25 10
O
12
45
7
2
3
8
28
31
8
281
Argument Parsing
• Parse utterances using phrase-level
grammars
• Nespole! used SOUP Parser (Gavaldà,
2000): Stochastic, chart-based, top-down
robust parser designed for real-time
analysis of spoken language
• Separate grammars based on the type of
phrases that the grammar is intended to
cover
Domain Action Classification
• Identify the DA for each SDU using
trainable classifiers
• Nespole! used two TiMBL (k-NN)
classifiers:
– Speech act
– Concept sequence
• Binary features indicate presence or
absence of arguments and pseudoarguments
Current status: March 2008 (1)
(At end of extended SGER…)
• Local Spanish transcribers transcribing
HIPAA-sanitized 9-1-1 recordings
• CMU grad student (Udhay)
– managing transcribers via internal website,
– built and evaluated ASR and utt. classifier,
– building labelling webpage, prototype, etc.
• Volunteer grad student (Weltman, LSU)
analyzing, refining, and using classifier
labels
Current status: March 2008 (2)
• “SGER worked.”
• Paper on ASR and classification accepted to
LREC-2008
• Two additional 9-1-1 centers sending us data
• Submitted follow-on small NSF proposal in
December 07: really build and user-test pilot
– Letters of Support from three 9-1-1 centers
• Will submit to COLING workshop on safetycritical MT systems
Additional Police Partners
• Julio Schrodel (CCPD) successes:
– Mesa PD, Arizona
– Charlotte-Mecklenburg PD, NC
• Much larger cities than Cape Coral
– (Each is now bigger than Pittsburgh!)
• Uncompressed recordings!
• Much larger, more automated 9-1-1
operations
– Call-taker vs. dispatcher
– User-defined call types logged
Acquired data, as of 3/08
Call Center
Calls
CCPD
140
CMPD
392
MesaPD
50
• Miami-Dade County: 5 audio cassettes!
• St. Petersburg: 1 call!!
Transcription Status
• VerbMobil transcription conventions
• TransEdit software (developed by Susi
Burger and Uwe Meier)
• Transcribed calls:
– 97 calls from Cape Coral PD
– 13 calls from Charlotte
• Transcribed calls playback time: 9.7 hours
LSU work: Better DA tags
• Manually analyzed 30 calls to find DAs
with widest coverage
• Current proposal adds 25 new DAs
• Created guidelines for tagging. E.g.:
– If the caller answers an open-ended question
with multiple pieces of information, tag each
piece of information
• Currently underway: Use web-based
tagging tool to manually tag the calls
• Determine inter-tagger agreement
Sample of Proposed Additional DAs
Tag
Example of utterance
Describing Residence
“The left side of a duplex”
Giving Location
“I’m outside of the house”,
“The corner of Vine and Hollywood”
Describing Vehicle
“A white Ford Ranger”,
“License Plate ALV-325”
Giving Age
“She is 3-years-old”
“Born on April 18, 1973”
Describing Clothing
“He was wearing a white sweater and black
shorts”
Giving Quantity
“Only two”
“There are 3 of them right now”
Describing Conflict
“He came back and started threatening me”
“My ex-husband won’t leave me alone”
Giving Medical Symptoms
“He’s having chest pains”
Asking For Instructions
“Should I pull over?”
Asking About Response
“How long will it take someone to get here?”
Project origin
• Contacted by Julio Schrodel of Cape Coral
PD (CCPD) in late 2003
– Looking for technological solution to shortage
of Spanish translation for 9-1-1 calls
• Visited CCPD in December 2003
– CCPD very interested in cooperating
– Promised us access to 9-1-1 recordings
• Designed system, wrote proposal
– CCPD letter in support of proposal
– Funded starting May 2006
• (SGER, only for ASR and preparatory work)
Articulatory features
• Model phone as a bundle of articulatory
features such as voiced or bilabial
• Less fragmentation of training data
• More robust in handling hyper-articulation
– Error-rate reduction of 25% [Metze et al,
ICSLP02]
• Multilingual/crosslingual articulatory
features for multilingual settings
– Error-rate reduction of 12.3% [Stuecker et al,
ICASSP03]
Grammars plus N-grams
• Grammar-based concept recognition
• Multilingual grammars plus n-grams for
efficient multi-lingual decoding [Fuegen et
al, ASRU03]
• Multilingual acoustic models
Interchange Format
• Interchange Format (IF) is a shallow semantic
interlingua for task-oriented domains
• Utterances represented as sequences of
semantic dialog units (SDUs)
• IF representation consists of four parts
–
–
–
–
Speaker
Speech Act
Concepts
Arguments
} Domain Action
speaker : speech act +concept*
+arguments*
Hybrid Analysis Approach
Hello. I would like to take a vacation in Val di Fiemme.
c:greeting (greeting=hello)
c:give-information+disposition+trip
(disposition=(who=i, desire),
visit-spec=(identifiability=no, vacation),
location=(place-name=val_di_fiemme))
greeting
give-information+disposition+trip
SDU1
SDU2
greeting=
hello
disposition=
visit-spec=
location=
hello i would
i would
likelike
to take
to a vacation
take a vacation
in val di fiemme
in val di fiemme
Hybrid Analysis Approach
Use a combination of grammar-based phraselevel parsing and machine learning to produce
interlingua (IF) representations
Text

Argument
Parser
Text
Arguments

SDU
Segmenter
Text
Arguments
SDUs

DA
Classifier
IF

Grammars (1)
• Argument grammar
– Identifies arguments defined in the IF
s[arg:activity-spec=] 
(*[object-ref=any] *[modifier=good] [biking])
– Covers "any good biking", "any biking", "good biking",
"biking", plus synonyms for all 3 words
• Pseudo-argument grammar
– Groups common phrases with similar meanings into
classes
s[=arrival=] 
(*is *usually arriving)
– Covers "arriving", "is arriving", "usually arriving", "is
usually arriving", plus synonyms
Grammars (2)
• Cross-domain grammar
– Identifies simple domain-independent DAs
s[greeting] 
([greeting=first_meeting] *[greet:to-whom=])
– Covers "nice to meet you", "nice to meet you
donna", "nice to meet you sir", plus synonyms
• Shared grammar
– Contains low-level rules accessible by all
other grammars
Using the IF Specification
• Use knowledge of the IF specification
during DA classification
– Ensure that only legal DAs are produced
– Guarantee that the DA and arguments
combine to form a valid IF representation
• Strategy: Find the best DA that licenses
the most arguments
– Trust parser to reliably label arguments
– Retaining detailed argument information is
important for translation
Avenue Transfer Rule Formalism (I)
;SL: the old man, TL: ha-ish ha-zaqen
Type information
Part-of-speech/constituent
information
Alignments
NP::NP
(
(X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2)
x-side constraints
((X1 AGR) = *3-SING)
((X1 DEF = *DEF)
((X3 AGR) = *3-SING)
((X3 COUNT) = +)
y-side constraints
xy-constraints,
e.g. ((Y1 AGR) = (X1 AGR))
[DET ADJ N] -> [DET N DET ADJ]
((Y1 DEF) = *DEF)
((Y3 DEF) = *DEF)
((Y2 AGR) = *3-SING)
((Y2 GENDER) = (Y4 GENDER))
)
Avenue Transfer Rule Formalism (II)
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP
(
(X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2)
Value constraints
Agreement constraints
[DET ADJ N] -> [DET N DET ADJ]
((X1 AGR) = *3-SING)
((X1 DEF = *DEF)
((X3 AGR) = *3-SING)
((X3 COUNT) = +)
((Y1 DEF) = *DEF)
((Y3 DEF) = *DEF)
((Y2 AGR) = *3-SING)
((Y2 GENDER) = (Y4 GENDER))
)
Download