

Ideas in

Confidence Annotation

Arthur Chan

Three papers for today

Frank Wessel et al,

“Using Word Probabilities as Confidence Measures”


Timothy Hazen et al,

“Recognition Confidence Scoring for Use in Speech

Understanding Systems”


Dan Bohus and Alex Rudnicky

“A Principled Approach for Rejection Threshold Optimization in

Spoken Dialogue System”


Application of Confidence


Provides system a decision whether ASR output could be trusted,

 Possible response strategies

Reject the sentence all together.

Confirm with the users again

Both – e.g. bi-threshold system.

Detection of OOV e.g.

If ASR doesn’t include the OOV in the vocabulary.

“What is the focus for paramus park new jersey”

“What is the forecast for paris park new jersey”

 Paramus is OOV, then the system should be not confident about the phoneme transcription.

Improve speech recognition performance

 Why? In general, posterior should be used instead of likelihood

 Does it help? 2%-5% relative level.

How this seminar proceed

 In each idea, 3 papers are studied.

 Only the most representative became suggested reading.

 Results will be quoted from different papers.


 Mathematical Foundation:

Neyman-Pearson Theorem (NPT)

Consequence of NPT

In general, likelihood ratio test is the most powerful test to decide which one of the two distributions is in force.

H1 : Distribution A is in force.

H2 : Distribution B is in force.


F(H1)/F(H2) <> T

In speech recognition,

H1 could be the speech model, H2 could be the non-speech

(garbage) model.

Idea 1: Belief of a single ASR feature

3 studied papers:

 ( suggested) Frank Wessel et al, “Using Word Probabilities as Confidence



Stephen Cox and Richard Rose, “Confidence Measures for the Switchboard



Thomas Kemp and Thomas Schaaf, “Estimating Confidence using Word





Paper chosen because

It has clearest math in minute detail. though less motivating than Cox’s paper.

Origins of Confidence Measure in speech recognition

 Formulation of speech recognition

P(W | A ) = P( A |W ) P( W) / P (A)

In decoding, P(A) is ignored because it is a common term

 W* = argmax P(A|W) P(W)


Problem :

P(A,W) is just a relative measure

P(W|A) is the true measure of how probable a word given the feature.

In reality……

 P(A) could only be approximated

By law of total probability

P(A) = Sum of P(A,W) for all W

N-best list, word lattices are therefore used.

Other Ideas: filler/garbage/general speech models. -> keyword spotter tricks

 A threshold of ratio need to be found.

 ROC curve always need to be manually interpreted.

Things that people are not confident

 All sorts of things

 Frame

 Frame likelihood ratio


 Phone likelihood ratio


Posterior probability -> a kind of likelihood ratio too

Word likelihood ratio.


 Likelihood

General Observation from


 Word-level confidence perform the best


 Word lattice method is slightly more general.

 This part of presentation will focus on wordlattice-based method.

Word posterior probability.

Author’s Definition

 W_a: Word hypothesis preceding w

 W_e: Word hypothesis succeeding w

Computation with lattice

 Only the hypotheses included by lattice need to be computed.

 Alpha-beta type of computation could be used.

 Similar to forward-backward algorithm

Forward probability

 For an end time t,

Read: “Total posterior probability end at t which are identical to h”

 Recursive formula

Backward probability

 For a begin time t

 One LM score is missing in the definition, later added back to the computation

 Recursion

Posterior Computation

Practical Implementation

 According to the author

 Posterior found using the above formula have poorer discriminative capability

 Timing is fuzzy from the recognizer

 Segments of 30% of overlapping is then used.

 Acoustic score and language score

 Both are scaled

 AM scaled by a number equal to 1

 LM scaled by a number larger than 1

Experimental Results

 Confidence error rate is computed.

 Definition of CER

 # correctly assigned tags/# tags

 Threshold is optimized by cross-validation set.

 Compared to baseline

 (Insertions + Deletion)/Number of recognized words

 Results: relatively 14-18% of improvement


 Word-based posterior probability is one effective way to compute confidence.

 In practice, AM and LM scores need to be scaled appropriately.

 Further reading.

Frank Soong et al, “Generalized Word Posterior

Probability (GWPP) For Measuring Reliability of

Recognized Words”

Idea 2: Belief in multiple ASR features

 Background

Single ASR feature is not the best

Multiple features could be combined to improve results

Combination could be done by machine-learning algorithm

Reviewed papers

(Suggested) Timothy et al, “Recognition Confidence Scoring for

Use in Speech Understanding Systems”


Zhang et al,


 A survey:

Chase et al, “Word and Acoustic Confidence Annotation for Large

Vocabulary Speech Recognition”


Paper chosen because

 it is more recent.

Combination method is motivated by speech-rec.

General structure of papers in

Idea 2

 10-30 features from the acoustic model is listed

 Combination scheme is chosen.

 Usually it is based on machine learning method e.g

Decision tree

Neural network

Support vector machine

Fisher Linear separator

Or any super-duper ML method.


Motivation of the paper

Decide whether OOV exists

Marked potentially mis-recognized words.

What the author tries to do

 Decide whether an utterance should be accepted

3 different levels of feature

Phonetic Level Scoring

Never used

Utterance Level Scoring

15 features

Word Level Scoring

10 features

Phone-Level Scoring

 From the author

 Several work in the past has already show phone and frame scores are unlikely to help

However, phone score will be used to generate word-level and sentence-level scores.

Scores are normalized by “catch-all model”

In other words, garbage model is used to approximate p(A)

Normalized scores are always used.

Utterance Level Features (the boring group)

1, 1 st best hypothesis total score

 (AM + LM + PM)

2, 1 st best hypothesis average (word) score

 The avg. score per word.

3, 1 st best hypothesis total LM score

4, 1 st best hypothesis avg. LM score

5, 1 st best hypothesis total AM score

6, 1 st best hypothesis avg. AM score

7, Difference of total score between 1 st best hyp & 2 nd best hyp

8, Difference of LM score between 1 st best hyp and 2 nd best hyp

9, Difference of AM score between 1 st best hyp and 2 nd best hyp

14, Number of N-best

15, Number of words in the 1 st best hyp.

Utterance Level Features (the interesting group)

 N-best Purity

The N-best purity for a hypothesized word is the fraction of N-best hypotheses in which that particular hypothesized word appear in the same location in the sentence


 #agreement/Total

 Similar to rover voting on the N-best list.

Utterance Level Features (the interesting group) (cont.)

 10, 1 st best hypothesis avg. N-best purity

 11, 1 st best hypothesis high N-best purity

 The fraction of words in the top choice hypothesis which have N-best purity greater than one half.

 12, Average N-best purity

 13, High N-best purity

Word Level Feature

1, Mean acoustic score -> The mean of log likelihood

2, Mean of acoustic likliehood score -> The mean of likelihood

(not log likelihood)

3, Minimum acoustic score

4, Standard Deviation of acoustic score

5, Mean difference from max score

 The average log likelihood ratio between acoustic scores of the best path and from phoneme recognition.

6, Mean Catch-All Score

7, Number of Acoustic Observation

8, N-best Purity

9, Number of N-best

10, Utterance Score

Classifier Training

 Linear Separator

Input: features

Output: (correct, incorrect) pair

 Training process

1, Fisher Linear discriminative analysis is used to produced the first version of the separator

2, A hill climbing algorithm is used to minimized the classification error.

Results (Word-Level)

Discussion: Is there any meaning in combination method?

 IMO, Yes,

Provided that the breakdown of the feature contribution to the reduction of CER is provided.

E.g. The goodies in other papers

In Timothy et al, N-best purity is the most useful.

In Lin, LM-jitter is the first question that provide most gain

In Rong, back-off mode and parsing score provide significant improvement.

Also, timothy et al is special because the optimization of combination is also MCE trained.

 So, how things combined does matter too.


 25 features were used in this paper

15 for utterance level

10 for word level

 N-best purity was found to be the most helpful

 Both simple linear separator training and minimum classification training was used.

 That explains the huge relative reduction in error.

Idea 3, Believe information other than ASR

 ASR output has certain limitation.

 When apply in different applications,

 ASR confidence need to be modified/combined with application-specific information.

Reviewed Papers

Dialogue System

(Suggested) Dan Bohus, “A principled approach for rejection threshold optimization in Spoken Dialogue System”


Sameer Pradhan, Wayne Ward, “Estimating Semantic

Confidence for Spoken Dialogue Systemss”



Simon Ho, Brian Mak, “Joint Estimation of Thresholds in a Bithreshold Verification Problem”


Paper chosen because

It is most recent

Representative from a dialogue system stand-point.

Big Picture of this type of papers

 Use feature external to ASR as confidence feature.

 Dialogue context

 Use cost external to ASR error rate as optimization criterion

Cost of misunderstanding

10% FA/FR

 As most commented

 It usually makes more sense than just relying on ASR features.

 The quality of feature also depends on the ASR scores.

Overview of the paper

 Motivation

“Recognition error significantly affect the quality of success of interaction (for the dialogue system)”

 Rejection Threshold introduces trade-off between

 the number of misunderstanding

 false rejections.

Incorrect and Correct Transfer of Concepts

 An alternative formulation by the authors

 User tries to convey system concepts

 If the confidence is below threshold

 The system reject the utterance and no concept is transferred

 If the confidence is above the threshold

The system accept some correct concept

But also accept some wrong concept.

Questions the authors want to answer

“Given the existence of this tradeoff, what is the optimal value for rejection threshold?”

“ this tradeoff”

 The trade-off between correctly and incorrectly transfer concepts.

Logistic regression

General”ized” linear model which

 g

 link function. could be log, logit, identity and reciprocal



Logistic regression (cont.)

 Usually used in

Categorical or non-continuous dependent variable

Or the relationship itself is actually not linear.

 Also used in combination features in ASR

See Siu “Improved Estimation, Evaluation and

Applications of Confidence Measures for Speech


 And generally BBN systems

Impact of Incorrect and Correct

Concept to the task success

 Logit (TS) = 0.21 + 2.14 CTC – 4.12 ITC

 The odds of ITC vs CTC is nearly 2 times.

The procedure

Identify a set of variables A, B, …. Involved in the rejection tradeoff (e.g. CTC and ITC)

 Choose a global dialogue performance metric P to optimize for (e.g T.S.)

 Fit models m which relates the trade-off variables to the chosen global dialogue performance metric: P<m(A,B)

 Find the threshold which maximizes the performance

 Th* = arg max(P) = arg max (m(A(th), B(th))


 RoomLine system

 Baseline, fixed rejection threshold = 0.3

 Each participant attempted

 a max of 10 scenario-based interactions

 71 states in the dialogue system

 In general

Rejection Optimization

 The 71 state are manually clustered into 3 types:

Open-request, system asks open questions

“How may I help you?”

Request (bool), system asks a yes/no question

“Do you want a reservation for this room?”

Request (non-bool) , system request an answer for more than 2 possible values

“ Starting at what time do you need the room?”

 Cost are then optimized for individual state


Summary of the paper

 Principled idea for dialogue system.

 Logistic regression is used to optimized the threshold of rejection.

 A neat paper.

 Several clever point

 Logistic regression

 Using external metric in the paper


 3 different types of ideas in confidence annotation

 Questions:

Which idea should we used?

Could ideas be combined?

Goodies in Idea 1

 Word Posterior Probability, LM Jitter were found to be very useful in different papers.

 Word Posterior Probability is a generalization of many techniques in the field

 LM Jitter could be generalized with other parameters in the decoder as well.

 Utterance score help word scores.

Goodies in Idea 2

 Combination always help

 Combination in ML sense and DT sense each give a chunk of gain.

 Combination methods:

 Generalized linear model is easy to interpret and principled.

 linear separator could be easily trained ML and

DT sense

Neural network and SVM come with standard goodie: general non-linear modeling

Goodies in Idea 3

 Every types of application has their own concern which is more important than WER

 Researcher should take the liberty to optimize them instead of relying on ASR.


 For an ASR-based system

 Idea 1 + 2 are wins

 For an application based on ASR-based system

 Idea 1+2+3 would be the most helpful.
