Frank Wessel et al,
“Using Word Probabilities as Confidence Measures”
http://www-i6.informatik.rwthaachen.de/PostScript/InterneArbeiten/Wessel_Word_Probabilities_C onfMeas_ICASSP1998.ps
Timothy Hazen et al,
“Recognition Confidence Scoring for Use in Speech
Understanding Systems”
http://groups.csail.mit.edu/sls/publications/2000/asr2000.pdf
Dan Bohus and Alex Rudnicky
“A Principled Approach for Rejection Threshold Optimization in
Spoken Dialogue System”
http://www.cs.cmu.edu/~dbohus/docs/dbohus_interspeech05.pdf
Provides system a decision whether ASR output could be trusted,
Possible response strategies
Reject the sentence all together.
Confirm with the users again
Both – e.g. bi-threshold system.
Detection of OOV e.g.
If ASR doesn’t include the OOV in the vocabulary.
“What is the focus for paramus park new jersey”
“What is the forecast for paris park new jersey”
Paramus is OOV, then the system should be not confident about the phoneme transcription.
Improve speech recognition performance
Why? In general, posterior should be used instead of likelihood
Does it help? 2%-5% relative level.
In each idea, 3 papers are studied.
Only the most representative became suggested reading.
Results will be quoted from different papers.
Mathematical Foundation:
Neyman-Pearson Theorem (NPT)
Consequence of NPT
In general, likelihood ratio test is the most powerful test to decide which one of the two distributions is in force.
H1 : Distribution A is in force.
H2 : Distribution B is in force.
Compute
F(H1)/F(H2) <> T
In speech recognition,
H1 could be the speech model, H2 could be the non-speech
(garbage) model.
3 studied papers:
( suggested) Frank Wessel et al, “Using Word Probabilities as Confidence
Measures”
http://www-i6.informatik.rwthaachen.de/PostScript/InterneArbeiten/Wessel_Word_Probabilities_ConfMeas_ICASSP1998.ps
Stephen Cox and Richard Rose, “Confidence Measures for the Switchboard
Database”
http://www.ece.mcgill.ca/~rose/papers/cox_rose_icassp96.pdf
Thomas Kemp and Thomas Schaaf, “Estimating Confidence using Word
Lattices.”
http://overcite.lcs.mit.edu/cache/papers/cs/1116/http:zSzzSzwww.is.cs.cmu.ed
uzSz~wwwadmzSzpaperszSzspeechzSzEUROSPEECH97zSzEUROSPEEC
H97-thomas.pdf/kemp97estimating.pdf
Paper chosen because
It has clearest math in minute detail. though less motivating than Cox’s paper.
Formulation of speech recognition
P(W | A ) = P( A |W ) P( W) / P (A)
In decoding, P(A) is ignored because it is a common term
W* = argmax P(A|W) P(W)
W
Problem :
P(A,W) is just a relative measure
P(W|A) is the true measure of how probable a word given the feature.
P(A) could only be approximated
By law of total probability
P(A) = Sum of P(A,W) for all W
N-best list, word lattices are therefore used.
Other Ideas: filler/garbage/general speech models. -> keyword spotter tricks
A threshold of ratio need to be found.
ROC curve always need to be manually interpreted.
All sorts of things
Frame
Frame likelihood ratio
Phone
Phone likelihood ratio
Word
Posterior probability -> a kind of likelihood ratio too
Word likelihood ratio.
Sentence
Likelihood
Word-level confidence perform the best
(CER)
Word lattice method is slightly more general.
This part of presentation will focus on wordlattice-based method.
W_a: Word hypothesis preceding w
W_e: Word hypothesis succeeding w
Only the hypotheses included by lattice need to be computed.
Alpha-beta type of computation could be used.
Similar to forward-backward algorithm
For an end time t,
Read: “Total posterior probability end at t which are identical to h”
Recursive formula
For a begin time t
One LM score is missing in the definition, later added back to the computation
Recursion
According to the author
Posterior found using the above formula have poorer discriminative capability
Timing is fuzzy from the recognizer
Segments of 30% of overlapping is then used.
Acoustic score and language score
Both are scaled
AM scaled by a number equal to 1
LM scaled by a number larger than 1
Confidence error rate is computed.
Definition of CER
# correctly assigned tags/# tags
Threshold is optimized by cross-validation set.
Compared to baseline
(Insertions + Deletion)/Number of recognized words
Results: relatively 14-18% of improvement
Word-based posterior probability is one effective way to compute confidence.
In practice, AM and LM scores need to be scaled appropriately.
Further reading.
Frank Soong et al, “Generalized Word Posterior
Probability (GWPP) For Measuring Reliability of
Recognized Words”
Background
Single ASR feature is not the best
Multiple features could be combined to improve results
Combination could be done by machine-learning algorithm
(Suggested) Timothy et al, “Recognition Confidence Scoring for
Use in Speech Understanding Systems”
http://groups.csail.mit.edu/sls/publications/2000/asr2000.pdf
Zhang et al,
http://www.cs.cmu.edu/~rongz/eurospeech_2001_1.pdf
A survey: http://fife.speech.cs.cmu.edu/Courses/11716/2000/Word_Confide nce_Annotation.ps
Chase et al, “Word and Acoustic Confidence Annotation for Large
Vocabulary Speech Recognition”
http://www.cs.cmu.edu/afs/cs/user/lindaq/mosaic/ca.ps
Paper chosen because
it is more recent.
Combination method is motivated by speech-rec.
10-30 features from the acoustic model is listed
Combination scheme is chosen.
Usually it is based on machine learning method e.g
Decision tree
Neural network
Support vector machine
Fisher Linear separator
Or any super-duper ML method.
Motivation of the paper
Decide whether OOV exists
Marked potentially mis-recognized words.
What the author tries to do
Decide whether an utterance should be accepted
3 different levels of feature
Phonetic Level Scoring
Never used
Utterance Level Scoring
15 features
Word Level Scoring
10 features
From the author
Several work in the past has already show phone and frame scores are unlikely to help
However, phone score will be used to generate word-level and sentence-level scores.
Scores are normalized by “catch-all model”
In other words, garbage model is used to approximate p(A)
Normalized scores are always used.
1, 1 st best hypothesis total score
(AM + LM + PM)
2, 1 st best hypothesis average (word) score
The avg. score per word.
3, 1 st best hypothesis total LM score
4, 1 st best hypothesis avg. LM score
5, 1 st best hypothesis total AM score
6, 1 st best hypothesis avg. AM score
7, Difference of total score between 1 st best hyp & 2 nd best hyp
8, Difference of LM score between 1 st best hyp and 2 nd best hyp
9, Difference of AM score between 1 st best hyp and 2 nd best hyp
14, Number of N-best
15, Number of words in the 1 st best hyp.
N-best Purity
The N-best purity for a hypothesized word is the fraction of N-best hypotheses in which that particular hypothesized word appear in the same location in the sentence
Or
#agreement/Total
Similar to rover voting on the N-best list.
10, 1 st best hypothesis avg. N-best purity
11, 1 st best hypothesis high N-best purity
The fraction of words in the top choice hypothesis which have N-best purity greater than one half.
12, Average N-best purity
13, High N-best purity
1, Mean acoustic score -> The mean of log likelihood
2, Mean of acoustic likliehood score -> The mean of likelihood
(not log likelihood)
3, Minimum acoustic score
4, Standard Deviation of acoustic score
5, Mean difference from max score
The average log likelihood ratio between acoustic scores of the best path and from phoneme recognition.
6, Mean Catch-All Score
7, Number of Acoustic Observation
8, N-best Purity
9, Number of N-best
10, Utterance Score
Linear Separator
Input: features
Output: (correct, incorrect) pair
Training process
1, Fisher Linear discriminative analysis is used to produced the first version of the separator
2, A hill climbing algorithm is used to minimized the classification error.
IMO, Yes,
Provided that the breakdown of the feature contribution to the reduction of CER is provided.
E.g. The goodies in other papers
In Timothy et al, N-best purity is the most useful.
In Lin, LM-jitter is the first question that provide most gain
In Rong, back-off mode and parsing score provide significant improvement.
Also, timothy et al is special because the optimization of combination is also MCE trained.
So, how things combined does matter too.
25 features were used in this paper
15 for utterance level
10 for word level
N-best purity was found to be the most helpful
Both simple linear separator training and minimum classification training was used.
That explains the huge relative reduction in error.
ASR output has certain limitation.
When apply in different applications,
ASR confidence need to be modified/combined with application-specific information.
Dialogue System
(Suggested) Dan Bohus, “A principled approach for rejection threshold optimization in Spoken Dialogue System”
http://www.cs.cmu.edu/~dbohus/docs/dbohus_interspeech05.pdf
Sameer Pradhan, Wayne Ward, “Estimating Semantic
Confidence for Spoken Dialogue Systemss”
http://oak.colorado.edu/~spradhan/publications/semanticconfidence.pdf
CALL
Simon Ho, Brian Mak, “Joint Estimation of Thresholds in a Bithreshold Verification Problem”
http://www.cs.ust.hk/~mak/PDF/eurospeech2003-bithreshold.pdf
Paper chosen because
It is most recent
Representative from a dialogue system stand-point.
Use feature external to ASR as confidence feature.
Dialogue context
Use cost external to ASR error rate as optimization criterion
Cost of misunderstanding
10% FA/FR
As most commented
It usually makes more sense than just relying on ASR features.
The quality of feature also depends on the ASR scores.
Motivation
“Recognition error significantly affect the quality of success of interaction (for the dialogue system)”
Rejection Threshold introduces trade-off between
the number of misunderstanding
false rejections.
An alternative formulation by the authors
User tries to convey system concepts
If the confidence is below threshold
The system reject the utterance and no concept is transferred
If the confidence is above the threshold
The system accept some correct concept
But also accept some wrong concept.
“Given the existence of this tradeoff, what is the optimal value for rejection threshold?”
“ this tradeoff”
The trade-off between correctly and incorrectly transfer concepts.
General”ized” linear model which
g
link function. could be log, logit, identity and reciprocal
http://userwww.sfsu.edu/~efc/classes/biol710/
Glz/Generalized%20Linear%20Models.htm
Usually used in
Categorical or non-continuous dependent variable
Or the relationship itself is actually not linear.
Also used in combination features in ASR
See Siu “Improved Estimation, Evaluation and
Applications of Confidence Measures for Speech
Recognition”
And generally BBN systems
Logit (TS) = 0.21 + 2.14 CTC – 4.12 ITC
The odds of ITC vs CTC is nearly 2 times.
Identify a set of variables A, B, …. Involved in the rejection tradeoff (e.g. CTC and ITC)
Choose a global dialogue performance metric P to optimize for (e.g T.S.)
Fit models m which relates the trade-off variables to the chosen global dialogue performance metric: P<m(A,B)
Find the threshold which maximizes the performance
Th* = arg max(P) = arg max (m(A(th), B(th))
RoomLine system
Baseline, fixed rejection threshold = 0.3
Each participant attempted
a max of 10 scenario-based interactions
71 states in the dialogue system
In general
The 71 state are manually clustered into 3 types:
Open-request, system asks open questions
“How may I help you?”
Request (bool), system asks a yes/no question
“Do you want a reservation for this room?”
Request (non-bool) , system request an answer for more than 2 possible values
“ Starting at what time do you need the room?”
Cost are then optimized for individual state
Principled idea for dialogue system.
Logistic regression is used to optimized the threshold of rejection.
A neat paper.
Several clever point
Logistic regression
Using external metric in the paper
3 different types of ideas in confidence annotation
Questions:
Which idea should we used?
Could ideas be combined?
Word Posterior Probability, LM Jitter were found to be very useful in different papers.
Word Posterior Probability is a generalization of many techniques in the field
LM Jitter could be generalized with other parameters in the decoder as well.
Utterance score help word scores.
Combination always help
Combination in ML sense and DT sense each give a chunk of gain.
Combination methods:
Generalized linear model is easy to interpret and principled.
linear separator could be easily trained ML and
DT sense
Neural network and SVM come with standard goodie: general non-linear modeling
Every types of application has their own concern which is more important than WER
Researcher should take the liberty to optimize them instead of relying on ASR.
For an ASR-based system
Idea 1 + 2 are wins
For an application based on ASR-based system
Idea 1+2+3 would be the most helpful.