ca_ideas

Ideas in

Confidence Annotation

Arthur Chan

Three papers for today







Frank Wessel et al,



“Using Word Probabilities as Confidence Measures”

 http://www-i6.informatik.rwthaachen.de/PostScript/InterneArbeiten/Wessel_Word_Probabilities_C onfMeas_ICASSP1998.ps

Timothy Hazen et al,



“Recognition Confidence Scoring for Use in Speech

Understanding Systems”

 http://groups.csail.mit.edu/sls/publications/2000/asr2000.pdf

Dan Bohus and Alex Rudnicky



“A Principled Approach for Rejection Threshold Optimization in

Spoken Dialogue System”

 http://www.cs.cmu.edu/~dbohus/docs/dbohus_interspeech05.pdf

Application of Confidence

Annotation







Provides system a decision whether ASR output could be trusted,

 Possible response strategies







Reject the sentence all together.

Confirm with the users again

Both – e.g. bi-threshold system.

Detection of OOV e.g.



If ASR doesn’t include the OOV in the vocabulary.





“What is the focus for paramus park new jersey”

“What is the forecast for paris park new jersey”

 Paramus is OOV, then the system should be not confident about the phoneme transcription.

Improve speech recognition performance

 Why? In general, posterior should be used instead of likelihood

 Does it help? 2%-5% relative level.

How this seminar proceed

 In each idea, 3 papers are studied.

 Only the most representative became suggested reading.

 Results will be quoted from different papers.

Preliminary

 Mathematical Foundation:





Neyman-Pearson Theorem (NPT)

Consequence of NPT









In general, likelihood ratio test is the most powerful test to decide which one of the two distributions is in force.





H1 : Distribution A is in force.

H2 : Distribution B is in force.

Compute

F(H1)/F(H2) <> T



In speech recognition,

H1 could be the speech model, H2 could be the non-speech

(garbage) model.

Idea 1: Belief of a single ASR feature





3 studied papers:

 ( suggested) Frank Wessel et al, “Using Word Probabilities as Confidence

Measures”



 http://www-i6.informatik.rwthaachen.de/PostScript/InterneArbeiten/Wessel_Word_Probabilities_ConfMeas_ICASSP1998.ps

Stephen Cox and Richard Rose, “Confidence Measures for the Switchboard

Database”



 http://www.ece.mcgill.ca/~rose/papers/cox_rose_icassp96.pdf

Thomas Kemp and Thomas Schaaf, “Estimating Confidence using Word

Lattices.”

 http://overcite.lcs.mit.edu/cache/papers/cs/1116/http:zSzzSzwww.is.cs.cmu.ed

uzSz~wwwadmzSzpaperszSzspeechzSzEUROSPEECH97zSzEUROSPEEC

H97-thomas.pdf/kemp97estimating.pdf

Paper chosen because





It has clearest math in minute detail. though less motivating than Cox’s paper.

Origins of Confidence Measure in speech recognition

 Formulation of speech recognition





P(W | A ) = P( A |W ) P( W) / P (A)

In decoding, P(A) is ignored because it is a common term

 W* = argmax P(A|W) P(W)

W

Problem :

P(A,W) is just a relative measure

P(W|A) is the true measure of how probable a word given the feature.

In reality……

 P(A) could only be approximated









By law of total probability

P(A) = Sum of P(A,W) for all W

N-best list, word lattices are therefore used.

Other Ideas: filler/garbage/general speech models. -> keyword spotter tricks

 A threshold of ratio need to be found.

 ROC curve always need to be manually interpreted.

Things that people are not confident

 All sorts of things

 Frame

 Frame likelihood ratio







Phone

 Phone likelihood ratio

Word





Posterior probability -> a kind of likelihood ratio too

Word likelihood ratio.

Sentence

 Likelihood

General Observation from

Literature

 Word-level confidence perform the best

(CER)

 Word lattice method is slightly more general.

 This part of presentation will focus on wordlattice-based method.

Word posterior probability.

Author’s Definition

 W_a: Word hypothesis preceding w

 W_e: Word hypothesis succeeding w

Computation with lattice

 Only the hypotheses included by lattice need to be computed.

 Alpha-beta type of computation could be used.

 Similar to forward-backward algorithm

Forward probability

 For an end time t,



Read: “Total posterior probability end at t which are identical to h”

 Recursive formula

Backward probability

 For a begin time t

 One LM score is missing in the definition, later added back to the computation

 Recursion

Posterior Computation

Practical Implementation

 According to the author

 Posterior found using the above formula have poorer discriminative capability

 Timing is fuzzy from the recognizer

 Segments of 30% of overlapping is then used.

 Acoustic score and language score

 Both are scaled

 AM scaled by a number equal to 1

 LM scaled by a number larger than 1

Experimental Results

 Confidence error rate is computed.

 Definition of CER

 # correctly assigned tags/# tags

 Threshold is optimized by cross-validation set.

 Compared to baseline

 (Insertions + Deletion)/Number of recognized words

 Results: relatively 14-18% of improvement

Summary

 Word-based posterior probability is one effective way to compute confidence.

 In practice, AM and LM scores need to be scaled appropriately.

 Further reading.



Frank Soong et al, “Generalized Word Posterior

Probability (GWPP) For Measuring Reliability of

Recognized Words”

Idea 2: Belief in multiple ASR features

 Background







Single ASR feature is not the best

Multiple features could be combined to improve results

Combination could be done by machine-learning algorithm

Reviewed papers









(Suggested) Timothy et al, “Recognition Confidence Scoring for

Use in Speech Understanding Systems”

 http://groups.csail.mit.edu/sls/publications/2000/asr2000.pdf

Zhang et al,

 http://www.cs.cmu.edu/~rongz/eurospeech_2001_1.pdf

 A survey: http://fife.speech.cs.cmu.edu/Courses/11716/2000/Word_Confide nce_Annotation.ps

Chase et al, “Word and Acoustic Confidence Annotation for Large

Vocabulary Speech Recognition”

 http://www.cs.cmu.edu/afs/cs/user/lindaq/mosaic/ca.ps




 it is more recent.

Combination method is motivated by speech-rec.

General structure of papers in

Idea 2

 10-30 features from the acoustic model is listed

 Combination scheme is chosen.

 Usually it is based on machine learning method e.g











Decision tree

Neural network

Support vector machine

Fisher Linear separator

Or any super-duper ML method.

Outline







Motivation of the paper





Decide whether OOV exists

Marked potentially mis-recognized words.

What the author tries to do

 Decide whether an utterance should be accepted

3 different levels of feature









Phonetic Level Scoring

Never used



Utterance Level Scoring

15 features



Word Level Scoring

10 features

Phone-Level Scoring

 From the author

 Several work in the past has already show phone and frame scores are unlikely to help





However, phone score will be used to generate word-level and sentence-level scores.

Scores are normalized by “catch-all model”





In other words, garbage model is used to approximate p(A)

Normalized scores are always used.

Utterance Level Features (the boring group)























1, 1 st best hypothesis total score

 (AM + LM + PM)

2, 1 st best hypothesis average (word) score

 The avg. score per word.

3, 1 st best hypothesis total LM score

4, 1 st best hypothesis avg. LM score

5, 1 st best hypothesis total AM score

6, 1 st best hypothesis avg. AM score

7, Difference of total score between 1 st best hyp & 2 nd best hyp

8, Difference of LM score between 1 st best hyp and 2 nd best hyp

9, Difference of AM score between 1 st best hyp and 2 nd best hyp

14, Number of N-best

15, Number of words in the 1 st best hyp.

Utterance Level Features (the interesting group)

 N-best Purity





The N-best purity for a hypothesized word is the fraction of N-best hypotheses in which that particular hypothesized word appear in the same location in the sentence

Or

 #agreement/Total

 Similar to rover voting on the N-best list.

Utterance Level Features (the interesting group) (cont.)

 10, 1 st best hypothesis avg. N-best purity

 11, 1 st best hypothesis high N-best purity

 The fraction of words in the top choice hypothesis which have N-best purity greater than one half.

 12, Average N-best purity

 13, High N-best purity

Word Level Feature





















1, Mean acoustic score -> The mean of log likelihood

2, Mean of acoustic likliehood score -> The mean of likelihood

(not log likelihood)

3, Minimum acoustic score

4, Standard Deviation of acoustic score

5, Mean difference from max score

 The average log likelihood ratio between acoustic scores of the best path and from phoneme recognition.

6, Mean Catch-All Score

7, Number of Acoustic Observation

8, N-best Purity

9, Number of N-best

10, Utterance Score

Classifier Training

 Linear Separator





Input: features

Output: (correct, incorrect) pair

 Training process





1, Fisher Linear discriminative analysis is used to produced the first version of the separator

2, A hill climbing algorithm is used to minimized the classification error.

Results (Word-Level)

Discussion: Is there any meaning in combination method?

 IMO, Yes,







Provided that the breakdown of the feature contribution to the reduction of CER is provided.

E.g. The goodies in other papers







In Timothy et al, N-best purity is the most useful.

In Lin, LM-jitter is the first question that provide most gain

In Rong, back-off mode and parsing score provide significant improvement.

Also, timothy et al is special because the optimization of combination is also MCE trained.

 So, how things combined does matter too.

Summary

 25 features were used in this paper





15 for utterance level

10 for word level

 N-best purity was found to be the most helpful

 Both simple linear separator training and minimum classification training was used.

 That explains the huge relative reduction in error.

Idea 3, Believe information other than ASR

 ASR output has certain limitation.

 When apply in different applications,

 ASR confidence need to be modified/combined with application-specific information.

Reviewed Papers







Dialogue System



(Suggested) Dan Bohus, “A principled approach for rejection threshold optimization in Spoken Dialogue System”



 http://www.cs.cmu.edu/~dbohus/docs/dbohus_interspeech05.pdf

Sameer Pradhan, Wayne Ward, “Estimating Semantic

Confidence for Spoken Dialogue Systemss”

 http://oak.colorado.edu/~spradhan/publications/semanticconfidence.pdf

CALL



Simon Ho, Brian Mak, “Joint Estimation of Thresholds in a Bithreshold Verification Problem”

 http://www.cs.ust.hk/~mak/PDF/eurospeech2003-bithreshold.pdf






It is most recent

Representative from a dialogue system stand-point.

Big Picture of this type of papers

 Use feature external to ASR as confidence feature.

 Dialogue context

 Use cost external to ASR error rate as optimization criterion





Cost of misunderstanding

10% FA/FR

 As most commented

 It usually makes more sense than just relying on ASR features.

 The quality of feature also depends on the ASR scores.

Overview of the paper

 Motivation



“Recognition error significantly affect the quality of success of interaction (for the dialogue system)”

 Rejection Threshold introduces trade-off between

 the number of misunderstanding

 false rejections.

Incorrect and Correct Transfer of Concepts

 An alternative formulation by the authors

 User tries to convey system concepts

 If the confidence is below threshold

 The system reject the utterance and no concept is transferred

 If the confidence is above the threshold





The system accept some correct concept

But also accept some wrong concept.

Questions the authors want to answer





“Given the existence of this tradeoff, what is the optimal value for rejection threshold?”

“ this tradeoff”

 The trade-off between correctly and incorrectly transfer concepts.

Logistic regression



General”ized” linear model which

 g



 link function. could be log, logit, identity and reciprocal

 http://userwww.sfsu.edu/~efc/classes/biol710/

Glz/Generalized%20Linear%20Models.htm

Logistic regression (cont.)

 Usually used in





Categorical or non-continuous dependent variable

Or the relationship itself is actually not linear.

 Also used in combination features in ASR



See Siu “Improved Estimation, Evaluation and

Applications of Confidence Measures for Speech

Recognition”

 And generally BBN systems

Impact of Incorrect and Correct

Concept to the task success

 Logit (TS) = 0.21 + 2.14 CTC – 4.12 ITC

 The odds of ITC vs CTC is nearly 2 times.

The procedure



Identify a set of variables A, B, …. Involved in the rejection tradeoff (e.g. CTC and ITC)

 Choose a global dialogue performance metric P to optimize for (e.g T.S.)

 Fit models m which relates the trade-off variables to the chosen global dialogue performance metric: P<m(A,B)

 Find the threshold which maximizes the performance

 Th* = arg max(P) = arg max (m(A(th), B(th))

Data

 RoomLine system

 Baseline, fixed rejection threshold = 0.3

 Each participant attempted

 a max of 10 scenario-based interactions

 71 states in the dialogue system

 In general

Rejection Optimization

 The 71 state are manually clustered into 3 types:







Open-request, system asks open questions



“How may I help you?”

Request (bool), system asks a yes/no question



“Do you want a reservation for this room?”

Request (non-bool) , system request an answer for more than 2 possible values



“ Starting at what time do you need the room?”

 Cost are then optimized for individual state

Results

Summary of the paper

 Principled idea for dialogue system.

 Logistic regression is used to optimized the threshold of rejection.

 A neat paper.

 Several clever point

 Logistic regression

 Using external metric in the paper

Discussion

 3 different types of ideas in confidence annotation

 Questions:





Which idea should we used?

Could ideas be combined?

Goodies in Idea 1

 Word Posterior Probability, LM Jitter were found to be very useful in different papers.

 Word Posterior Probability is a generalization of many techniques in the field

 LM Jitter could be generalized with other parameters in the decoder as well.

 Utterance score help word scores.


 Combination always help

 Combination in ML sense and DT sense each give a chunk of gain.

 Combination methods:

 Generalized linear model is easy to interpret and principled.



 linear separator could be easily trained ML and

DT sense

Neural network and SVM come with standard goodie: general non-linear modeling


 Every types of application has their own concern which is more important than WER

 Researcher should take the liberty to optimize them instead of relying on ASR.

Conclusion

 For an ASR-based system

 Idea 1 + 2 are wins

 For an application based on ASR-based system

 Idea 1+2+3 would be the most helpful.

ca_ideas

Ideas in

Confidence Annotation

Arthur Chan

Three papers for today

Application of Confidence

Annotation

How this seminar proceed

Preliminary

Idea 1: Belief of a single ASR feature

Origins of Confidence Measure in speech recognition

In reality……

Things that people are not confident

General Observation from

Literature

Word posterior probability.

Author’s Definition

Computation with lattice

Forward probability

Backward probability

Posterior Computation

Practical Implementation

Experimental Results

Summary

Idea 2: Belief in multiple ASR features

Reviewed papers

General structure of papers in

Idea 2

Outline

Phone-Level Scoring

Utterance Level Features (the boring group)

Utterance Level Features (the interesting group)

Utterance Level Features (the interesting group) (cont.)

Word Level Feature

Classifier Training

Results (Word-Level)

Discussion: Is there any meaning in combination method?

Summary

Idea 3, Believe information other than ASR

Reviewed Papers

Big Picture of this type of papers

Overview of the paper

Incorrect and Correct Transfer of Concepts

Questions the authors want to answer

Logistic regression

Logistic regression (cont.)

Impact of Incorrect and Correct

Concept to the task success

The procedure

Data

Rejection Optimization

Results

Summary of the paper

Discussion

Goodies in Idea 1

Goodies in Idea 2

Goodies in Idea 3

Conclusion

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib