Ideas for Next-Generation ASR Model the Whole Speech Signal Handle Mixtures

advertisement

Ideas for Next-Generation ASR

1 Model the Whole Speech Signal

2 Handle Mixtures

3 Respect Diversity

4 Other Remarks

Dan Ellis <dpwe@ee.columbia.edu>

Laboratory for Recognition and Organization of Speech and Audio

(Lab ROSA )

Columbia University, New York http://labrosa.ee.columbia.edu/

Dan Ellis Ideas for Next-Generation ASR (1 of 21) 2003-10-07

Outline

1 Model the Whole Speech Signal

- Channel, accent, style

- Timing/rate variation

- Coarticulation

2 Handle Mixtures

3 Respect Diversity

4 Other Remarks

Dan Ellis Ideas for Next-Generation ASR (2 of 21) 2003-10-07

1

Model the Whole Speech Signal

• HMM is a relatively weak model for speech

4

3

2

1

0

0 1 2 3 4 5 time / sec

- generates something speechlike, but

- missing detail of real speech (...)

- exponential segment durations

• Only meant for inference of p ( X | M )

- to choose between a few M s

• What would it take to model entire signal?

- capture perceptually sufficient information

- e.g. speech coding quality

Dan Ellis Ideas for Next-Generation ASR (3 of 21) 2003-10-07

Channel, Accent, Style

• Factors affecting spectral distributions

- just absorbed into model variance?

- or: adapted generically e.g. MLLR

• Channel

- typically fixed per session

• Accent

- typically fixed per speaker

• Style

- particular to application?

• Modeling these factors explicitly

- improves generalization

reduces variance of models, hence..

- allows better discrimination of voice from other random stuff

Dan Ellis Ideas for Next-Generation ASR (4 of 21) 2003-10-07

Timing/Rate variation

• Timing is weakly modeled with current HMMs

- duration models have little influence on WER

• Some baseline variation

- small improvements with rate-dependent models

• Greatest variation is within phrase ?

4000

2000

0

0 s ow

SO

Dan Ellis

0.5

1.0

1.5

2.0

2.5

3.0

ay th aa dx ax b aw ax th n ay th ih n k ih t s t ih

I ABOUT

THOUGHT THAT

I IT'S

THINK

STILL

AND l p aa s b ax l

POSSIBLE

- models for within-phrase rate-contours

- use as constraints on recognition

(contrast likelihoods of alternate hypotheses e.g. in rescoring)

Ideas for Next-Generation ASR (5 of 21) 2003-10-07

Coarticulation

• HMMs are piecewise-constant feature models

- ... at least with Viterbi decoding

delta features can map trajectories to more constant values, but just a ‘patch’

• More states, more context-dependence reduces mismatch

- but model is too general: adjacent states are in fact strongly related

- consequence: insatiable hunger for 1000s of hours of training data

• Generative models of coarticulation not particularly hard

- e.g. HDM, SSM (Deng, Bridle, ...)

inference is hard...

- ... but many new techniques from Machine

Learning community (MCMC, variational, ...)

Dan Ellis Ideas for Next-Generation ASR (6 of 21) 2003-10-07

Outline

1 Model the Whole Speech Signal

2 Handle Mixtures

- Frontier applications

- Auditory Scene Analysis

- Multisource models

3 Respect Diversity

4 Other Remarks

Dan Ellis Ideas for Next-Generation ASR (7 of 21) 2003-10-07

2

Handle Mixtures

• Historically, speech recognition was made tractable by limiting domain to pure speech

- limited problem still hard enough

- as a consequence: systems discard information that distinguishes speech/nonspeech

• Many (most?) “ frontier applications ” involve nonspeech and mixtures

- meeting recordings

- multimedia indexing

- “Lifelog” audio diary

Dan Ellis Ideas for Next-Generation ASR (8 of 21) 2003-10-07

Approaches to handling sound mixtures

• Separate signals , then recognize

- Computational Auditory Scene Analysis (CASA),

Independent Component Analysis

- nice, if you can make it work

• Recognize combined signal

- ‘multicondition training’

- combinatorics seem daunting

• Recognize with parallel models

- optimal inference from full joint state-space

) Æ (

O

)

- or: skip obscured fragments, infer from higher-level context

- or do both: missing-data recognition

Dan Ellis Ideas for Next-Generation ASR (9 of 21) 2003-10-07

Missing Data Recognition

(Barker, Cooke & Ellis ’03)

• Can evaluate speech models p ( x | m ) over a subset of k dimensions x k m

)

=

Ú

k

, x u m

) d x u x u y

p(x k

| x u

< y )

p(x k

,x u

)

p(x k x k

)

• Hence, missing data recognition :

Present data mask

P(x | q) =

P(x

1

· P(x

2

| q)

| q)

· P(x

3

· P(x

4

· P(x

5

· P(x

6

| q)

| q)

| q)

| q)

.. but need segregation mask time

P

(

M

,

• Fit model and segregation given obs’n :

S Y

)

=

P M

Ú

( )

P X Y

,

S

) d X

P S Y

)

Dan Ellis Ideas for Next-Generation ASR (10 of 21) 2003-10-07

Multi-source decoding

• Search for more than one source

Y(t) S

2

(t) q

2

(t)

S

1

(t)

• Mutually-dependent data masks

• Use e.g. CASA features to propose masks

- locally coherent regions

• Lots of issues in models, representations, matching, inference...

Dan Ellis Ideas for Next-Generation ASR (11 of 21) 2003-10-07 q

1

(t)

Outline

1 Model the Whole Speech Signal

2 Handle Mixtures

3 Respect Diversity

- Class-specific classifiers

- Using different information differently

4 Other Remarks

Dan Ellis Ideas for Next-Generation ASR (12 of 21) 2003-10-07

3

Respect Diversity

• Speech signal is very diverse

- different kinds of phonemes (vowels, stops...)

- different kinds of information (lexical, affective...)

- different timescales (phones, words, phrases...)

• Information needs are diverse

- phoneme classification

- syllable detection

- phrase detection

• Technical approaches are diverse

- the more different they are, the bigger the gain from combination

- ‘Rover effect’

Dan Ellis Ideas for Next-Generation ASR (13 of 21) 2003-10-07

Finding the Information in Speech

(Scanlon & Ellis, Eurospeech ’03)

• Mutual Information in time-frequency:

All phones Stops

15

10

5

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0.04

0.03

0.02

0.01

0

Vowels Speaker (vowels)

0.025

0.08

15

0.02

0.06

10

0.015

0.04

0.01

0.02

5

-200 -100 0 100 200 -200 -100 0

0.005

MI / bits

100 200

/

• Use to select classifier input features time / ms

IRREG IRREG+CHKBD

70

15

10

5

133 ftrs

85 ftrs

65

15

10

5

95 ftrs

47 ftrs

60

55

15

10

5

57 ftrs

28 ftrs

15

10

5

19 ftrs

-50 0

9 ftrs

50 time / ms

50

45

0 20 40 60 80

IRREG+CHKBD

RECT+CHKBD

IRREG

RECT

RECT, all frames

100 120 140

Number of features

-50 0 50

Dan Ellis Ideas for Next-Generation ASR (14 of 21) 2003-10-07

Using different information differently

• Integrating paradigms have lots of power

- e.g. HMM does it all: time warp ... LM

... but can we gain by breaking up the tasks?

- separate vowel center detection

& consonant “adornment” classification

Æ

“ Event-based ” recognition

- separate specialized detectors

15 test/dr1/faks0/sa2

10

- acoustic-phonetic features

.. or data-derived nearequivalents from e.g. Independent

Component Analysis

Dan Ellis

5

8

6

4

2

0

1

0

-1

4

2

0

8

6 ow ae iy eh r oy

Ideas for Next-Generation ASR (15 of 21) ay dh

-1

0 time / labels

2003-10-07

2

1

0

Basis vectors

5 10 15 frequency / Bark

20

0

1

2

3

4

Outline

1 Model the Whole Speech Signal

2 Handle Mixtures

3 Respect Diversity

4 Other Remarks

- Combinations & infrastructure

- How much data?

- Blackboards

Dan Ellis Ideas for Next-Generation ASR (16 of 21) 2003-10-07

4

Other Remarks:

Different ways to combine systems

• After each stage of the recognizer

Feature 1 calculation

Input sound

Feature 2 calculation

Feature combination

Speech features

Acoustic classifier

Phone probabilities

HMM decoder

Word hypotheses

10% RER improvement

(2 streams)

Feature 1 calculation

Acoustic classifier

Posterior combination

Phone probabilities

HMM decoder

Word hypotheses

20% RER improvement

(2 streams)

Input sound

Feature 2 calculation

Speech features

Acoustic classifier

Input sound

Feature 1 calculation

Feature 2 calculation

Speech features

Acoustic classifier

Acoustic classifier

Phone probabilities

HMM decoder

HMM decoder

Hypothesis combination

Final hypotheses

Word hypotheses

25% RER improvement

(5 stream

ROVER)

Dan Ellis Ideas for Next-Generation ASR (17 of 21) 2003-10-07

Combining modeling techniques:

‘Tandem’ acoustic modeling

(Hermansky, Ellis, Sharma, ICASSP’00)

Input sound

• To combine Neural Net models with HMMs:

Speech features

Phone probabilities

Words

Feature calculation

Neural net model

Posterior decoder

Pre-nonlinearity outputs

PCA orthogn’n

Othogonal features

HTK

GM model

Subword likelihoods

HTK decoder

Words

• Result: better performance than either alone!

- Tandem alone: 20% RER improvement

Posterior combination + Tandem:

50% RER improvement

• Excellent infrastructure for feature experiments

- nets are tolerant of feature eccentricities

- e.g. MSG features

Æ

HTK has double the WER of

Tandem version, MSG

Æ net

Æ

HTK

Dan Ellis Ideas for Next-Generation ASR (18 of 21) 2003-10-07

How Much Data?

• Near-unanimous calls for more data

- sure-fire way to improve accuracy .. a little

- labeled data is expensive, hence limited

• How much data do we need ?

- see examples of ‘all’ speech variants?

- infant example: 6 hr/day = 2000 hr/yr

= 10,000 hr by age 5 (Moore graph)

- brute force recognition-by-matching: every possible syllable?

word? phrase? x voices

100 voices x 2k syllables x 4/sec x 100 contexts

= 1,600 hr (6k syl/hr)

(but: distribution of examples)

• What about generalization ???

- goal should be abstraction of patterns from examples corpus

- i.e. marginals, not full volume of examples

Dan Ellis Ideas for Next-Generation ASR (19 of 21) 2003-10-07

Blackboards?

Actions & Plans action next action

Scheduler

SOU

Problem-solving model

• Events,

Tiers,

Hypothesis Generation &

Verification

= Hearsay Blackboard (1973) hypotheses

Blackboard

• What went wrong last time?

bad knowledge, blame allocation

inefficient decoding

how to incorporate training?

• So, this time around?

new mechanisms for blame?

inefficiencies don’t matter so much?

induction of rules from data?

Dan Ellis Ideas for Next-Generation ASR (20 of 21) 2003-10-07

Summary & Conclusions

• Accept that sound is often/usually a mixture

- combine models and/or carve up features

• Use more detailed models of speech

- so we can still recognize after carving up

• Tandem models as enabling infrastructure

- able to glean value from wacky features

• Find novel approaches for recognition

- vowel nuclei + adornments?

Dan Ellis Ideas for Next-Generation ASR (21 of 21) 2003-10-07

Download