Ideas for Next-Generation ASR Model the Whole Speech Signal Handle Mixtures

Ideas for Next-Generation ASR

1 Model the Whole Speech Signal

2 Handle Mixtures

3 Respect Diversity

4 Other Remarks

Dan Ellis <dpwe@ee.columbia.edu>

Laboratory for Recognition and Organization of Speech and Audio

(Lab ROSA )

Columbia University, New York http://labrosa.ee.columbia.edu/

Dan Ellis Ideas for Next-Generation ASR (1 of 21) 2003-10-07

Outline


- Channel, accent, style

- Timing/rate variation

- Coarticulation



4 Other Remarks


1

Model the Whole Speech Signal

• HMM is a relatively weak model for speech

4

3

2

1

0

0 1 2 3 4 5 time / sec

- generates something speechlike, but

- missing detail of real speech (...)

- exponential segment durations

• Only meant for inference of p ( X | M )

- to choose between a few M s

• What would it take to model entire signal?

- capture perceptually sufficient information

- e.g. speech coding quality


Channel, Accent, Style

• Factors affecting spectral distributions

- just absorbed into model variance?

- or: adapted generically e.g. MLLR

• Channel

- typically fixed per session

• Accent

- typically fixed per speaker

• Style

- particular to application?

• Modeling these factors explicitly

- improves generalization

reduces variance of models, hence..

- allows better discrimination of voice from other random stuff


Timing/Rate variation

• Timing is weakly modeled with current HMMs

- duration models have little influence on WER

• Some baseline variation

- small improvements with rate-dependent models

• Greatest variation is within phrase ?

4000

2000

0

0 s ow

SO

Dan Ellis

0.5

1.0

1.5

2.0

2.5

3.0

ay th aa dx ax b aw ax th n ay th ih n k ih t s t ih

I ABOUT

THOUGHT THAT

I IT'S

THINK

STILL

AND l p aa s b ax l

POSSIBLE

- models for within-phrase rate-contours

- use as constraints on recognition

(contrast likelihoods of alternate hypotheses e.g. in rescoring)

Ideas for Next-Generation ASR (5 of 21) 2003-10-07

Coarticulation

• HMMs are piecewise-constant feature models

- ... at least with Viterbi decoding

delta features can map trajectories to more constant values, but just a ‘patch’

• More states, more context-dependence reduces mismatch

- but model is too general: adjacent states are in fact strongly related

- consequence: insatiable hunger for 1000s of hours of training data

• Generative models of coarticulation not particularly hard

- e.g. HDM, SSM (Deng, Bridle, ...)

inference is hard...

- ... but many new techniques from Machine

Learning community (MCMC, variational, ...)


Outline



- Frontier applications

- Auditory Scene Analysis

- Multisource models


4 Other Remarks


2

Handle Mixtures

• Historically, speech recognition was made tractable by limiting domain to pure speech

- limited problem still hard enough

- as a consequence: systems discard information that distinguishes speech/nonspeech

• Many (most?) “ frontier applications ” involve nonspeech and mixtures

- meeting recordings

- multimedia indexing

- “Lifelog” audio diary


Approaches to handling sound mixtures

• Separate signals , then recognize

- Computational Auditory Scene Analysis (CASA),

Independent Component Analysis

- nice, if you can make it work

• Recognize combined signal

- ‘multicondition training’

- combinatorics seem daunting

• Recognize with parallel models

- optimal inference from full joint state-space

) Æ (

O

)

- or: skip obscured fragments, infer from higher-level context

- or do both: missing-data recognition


Missing Data Recognition

(Barker, Cooke & Ellis ’03)

• Can evaluate speech models p ( x | m ) over a subset of k dimensions x k m

)

=

Ú

k

, x u m

) d x u x u y

p(x k

| x u

< y )

p(x k

,x u

)

p(x k x k

)

• Hence, missing data recognition :

Present data mask

P(x | q) =

P(x

1

· P(x

2

| q)

| q)

· P(x

3

· P(x

4

· P(x

5

· P(x

6

| q)

| q)

| q)

| q)

.. but need segregation mask time

P

(

M

,

• Fit model and segregation given obs’n :

S Y

)

=

P M

Ú

( )

◊

P X Y

,

S

) d X

◊

P S Y

)


Multi-source decoding

• Search for more than one source

Y(t) S

2

(t) q

2

(t)

S

1

(t)

• Mutually-dependent data masks

• Use e.g. CASA features to propose masks

- locally coherent regions

• Lots of issues in models, representations, matching, inference...

Dan Ellis Ideas for Next-Generation ASR (11 of 21) 2003-10-07 q

1

(t)

Outline




- Class-specific classifiers

- Using different information differently

4 Other Remarks


3

Respect Diversity

• Speech signal is very diverse

- different kinds of phonemes (vowels, stops...)

- different kinds of information (lexical, affective...)

- different timescales (phones, words, phrases...)

• Information needs are diverse

- phoneme classification

- syllable detection

- phrase detection

• Technical approaches are diverse

- the more different they are, the bigger the gain from combination

- ‘Rover effect’


Finding the Information in Speech

(Scanlon & Ellis, Eurospeech ’03)

• Mutual Information in time-frequency:

All phones Stops

15

10

5

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0.04

0.03

0.02

0.01

0

Vowels Speaker (vowels)

0.025

0.08

15

0.02

0.06

10

0.015

0.04

0.01

0.02

5

-200 -100 0 100 200 -200 -100 0

0.005

MI / bits

100 200

/

• Use to select classifier input features time / ms

IRREG IRREG+CHKBD

70

15

10

5

133 ftrs

85 ftrs

65

15

10

5

95 ftrs

47 ftrs

60

55

15

10

5

57 ftrs

28 ftrs

15

10

5

19 ftrs

-50 0

9 ftrs

50 time / ms

50

45

0 20 40 60 80

IRREG+CHKBD

RECT+CHKBD

IRREG

RECT

RECT, all frames

100 120 140

Number of features

-50 0 50


Using different information differently

• Integrating paradigms have lots of power

- e.g. HMM does it all: time warp ... LM

... but can we gain by breaking up the tasks?

- separate vowel center detection

& consonant “adornment” classification

Æ

“ Event-based ” recognition

- separate specialized detectors

15 test/dr1/faks0/sa2

10

- acoustic-phonetic features

.. or data-derived nearequivalents from e.g. Independent

Component Analysis

Dan Ellis

5

8

6

4

2

0

1

0

-1

4

2

0

8

6 ow ae iy eh r oy

Ideas for Next-Generation ASR (15 of 21) ay dh

-1

0 time / labels

2003-10-07

2

1

0

Basis vectors

5 10 15 frequency / Bark

20

0

1

2

3

4

Outline




4 Other Remarks

- Combinations & infrastructure

- How much data?

- Blackboards


4

Other Remarks:

Different ways to combine systems

• After each stage of the recognizer

Feature 1 calculation

Input sound


Feature combination

Speech features

Acoustic classifier

Phone probabilities

HMM decoder

Word hypotheses

10% RER improvement

(2 streams)


Acoustic classifier

Posterior combination

Phone probabilities

HMM decoder

Word hypotheses

20% RER improvement

(2 streams)

Input sound


Speech features

Acoustic classifier

Input sound



Speech features

Acoustic classifier

Acoustic classifier

Phone probabilities

HMM decoder

HMM decoder

Hypothesis combination

Final hypotheses

Word hypotheses

25% RER improvement

(5 stream

ROVER)


Combining modeling techniques:

‘Tandem’ acoustic modeling

(Hermansky, Ellis, Sharma, ICASSP’00)

Input sound

• To combine Neural Net models with HMMs:

Speech features

Phone probabilities

Words

Feature calculation

Neural net model

Posterior decoder

Pre-nonlinearity outputs

PCA orthogn’n

Othogonal features

HTK

GM model

Subword likelihoods

HTK decoder

Words

• Result: better performance than either alone!

- Tandem alone: 20% RER improvement

Posterior combination + Tandem:

50% RER improvement

• Excellent infrastructure for feature experiments

- nets are tolerant of feature eccentricities

- e.g. MSG features

Æ

HTK has double the WER of

Tandem version, MSG

Æ net

Æ

HTK


How Much Data?

• Near-unanimous calls for more data

- sure-fire way to improve accuracy .. a little

- labeled data is expensive, hence limited

• How much data do we need ?

- see examples of ‘all’ speech variants?

- infant example: 6 hr/day = 2000 hr/yr

= 10,000 hr by age 5 (Moore graph)

- brute force recognition-by-matching: every possible syllable?

word? phrase? x voices

100 voices x 2k syllables x 4/sec x 100 contexts

= 1,600 hr (6k syl/hr)

(but: distribution of examples)

• What about generalization ???

- goal should be abstraction of patterns from examples corpus

- i.e. marginals, not full volume of examples


Blackboards?

Actions & Plans action next action

Scheduler

SOU

Problem-solving model

• Events,

Tiers,

Hypothesis Generation &

Verification

= Hearsay Blackboard (1973) hypotheses

Blackboard

• What went wrong last time?

bad knowledge, blame allocation

inefficient decoding

how to incorporate training?

• So, this time around?

new mechanisms for blame?

inefficiencies don’t matter so much?

induction of rules from data?


Summary & Conclusions

• Accept that sound is often/usually a mixture

- combine models and/or carve up features

• Use more detailed models of speech

- so we can still recognize after carving up

• Tandem models as enabling infrastructure

- able to glean value from wacky features

• Find novel approaches for recognition

- vowel nuclei + adornments?


Ideas for Next-Generation ASR Model the Whole Speech Signal Handle Mixtures

Ideas for Next-Generation ASR

Model the Whole Speech Signal

Channel, Accent, Style

Timing/Rate variation

Coarticulation

Handle Mixtures

Approaches to handling sound mixtures

Missing Data Recognition

(Barker, Cooke & Ellis ’03)

=

Ú

=

Ú

◊

◊

Multi-source decoding

Respect Diversity

Finding the Information in Speech

Using different information differently

Other Remarks:

Different ways to combine systems

Combining modeling techniques:

‘Tandem’ acoustic modeling

How Much Data?

Blackboards?

Summary & Conclusions

Related documents

Products

Support

Ideas for Next-Generation ASR Model the Whole Speech Signal Handle Mixtures