Sound, Mixtures, and Learning: LabROSA overview

Sound, Mixtures, and Learning:

LabROSA overview

1 Sound Content Analysis

2 Recognizing sounds

3 Organizing mixtures

4 Accessing large datasets

5 Music Information Retrieval

Dan Ellis <dpwe@ee.columbia.edu>

Laboratory for Recognition and Organization of Speech and Audio

(Lab ROSA )

Columbia University, New York http://labrosa.ee.columbia.edu/

Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 1

1

4000 frq/Hz

3000

2000

1000

0

0

Sound Content Analysis

2 4 6

Analysis

8 10 12

0

-20

-40 time/s

-60 level / dB

Voice (evil)

Stab

Rumble

Voice (pleasant)

Choir

Strings

• Sound understanding : the key challenge

- what listeners do

- understanding = abstraction

• Applications

- indexing/retrieval

- robots

- prostheses


The problem with recognizing mixtures

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one.

Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90)

• Auditory Scene Analysis : describing a complex sound in terms of high-level sources /events

- ... like listeners do

• Hearing is ecologically grounded

- reflects natural scene properties = constraints

- subjective, not absolute


Frequency analysis

Auditory Scene Analysis

(Bregman 1990)

• How do people analyze sound mixtures?

- break mixture into small elements (in time-freq)

- elements are grouped in to sources using cues

- sources have aggregate attributes

• Grouping ‘rules’ (Darwin, Carlyon, ...):

- cues: common onset/offset/modulation, harmonicity, spatial location, ...

Dan Ellis

Onset map

Harmonicity map

Position map

Grouping mechanism

Sound, Mixtures & Learning

(after Darwin, 1996)

2003-09-24 - 4

Source properties

Cues to simultaneous grouping

• Elements + attributes

8000

6000

4000

2000

0

0 1 2 3 4 5 6 7 8 9 time / s

• Common onset

- simultaneous energy has common source

• Periodicity

- energy in different bands with same cycle

• Other cues

- spatial (ITD/IID), familiarity, ...

• But: Context ...


Outline



- Speech recognition

- Nonspeech





2

Recognizing Sounds: Speech

• Standard speech recognition structure:

Acoustic model parameters

Word models s ah t

Language model p("sat"|"the","cat") p("saw"|"the","cat") sound

Feature calculation

Acoustic classifier feature vectors phone probabilities

HMM decoder phone / word

sequence

Understanding/ application...

• How to handle additive noise ?

- just train on noisy data: ‘multicondition training’


How ASR Represents Speech

• Markov model structure: states + transitions

M'

1

S

0.9

0.8

0.18

A

0.1

0.8

K 0.9

T

0.2

E

0.02

O

0.1

4

State models (means)

0.9

Model M'

2

3

0.8

0.05

A

0.1

2

S K

0.9

1

0

0.15

O

0.1

State Transition Probabilities

10

0.8

T

20

0.2

30

40

50

E

10 20 30 40 50

• A generative model

- but not a good speech generator!

4

3

2

1

0

0 1 2 3 4 5 time / sec

Dan Ellis

- only meant for inference of p ( X | M )

Sound, Mixtures & Learning 2003-09-24 - 8

4k

2k

0

4k

2k

0

0

Novel speech signal representations

(with Marios Athineos)

• Common sound models use 10ms frames

- but: sub-10ms envelope is perceptible

Fireplce - orig TDLP resynth CTFLP resynth CTFLP resynth 4x TSM

4k 4k 4k

2

Time

4

2k

0

4k

2k

0

0 2

Time

4

2k

0

4k

2k

0

0 2

Time

4

2k

0

4k

2k

0

0 2

Time

4

• Use a parametric

(LPC) model on spectrum 4

2

0

2

200 400 600 800 1000 1200 1400 1600 1800 2000

5

4.5

4

3.5

3

2.5

2

1.5

1

2-4 kHz

1-2 kHz

0.5-1 kHz

0-0.5 kHz waveform

• Convert to features for ASR

- improvements esp. for stops


Finding the Information in Speech

(with Patricia Scanlon)

• Mutual Information in time-frequency:

All phones Stops

15

10

5

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0.04

0.03

0.02

0.01

0

Vowels

15

10

5

-200 -100 0 100 200

0.08

0.06

0.04

0.02

Speaker (vowels)

-200 -100 0 100 200 time / ms

0.025

0.02

0.015

0.01

0.005

MI / bits

• Use to select classifier input features time / ms

IRREG IRREG+CHKBD

70

15

10

5

133 ftrs

85 ftrs

65

15

10

5

95 ftrs

47 ftrs

60

15

10

5

15

10

5

Dan Ellis

-50 0 50

57 ftrs

19 ftrs

55

28 ftrs

50

-50 0

9 ftrs

50 time / ms

45

0


20 40 60 80

IRREG+CHKBD

RECT+CHKBD

IRREG

RECT

RECT, all frames

100 120 140

Number of features

2003-09-24 - 10

Alarm sound detection

(Ellis 2001)

3

2

1

0

0 s0n6a8+20

4

• Alarm sounds have particular structure

- people ‘know them when they hear them’

- clear even at low SNRs hrn01 bfr02 buz01

5 10 15 20 25

20

0

-20 time / s

-40 level / dB

• Why investigate alarm sounds?

- they’re supposed to be easy

- potential applications...

• Contrast two systems:

- standard, global features , P ( X | M )

- sinusoidal model, fragments , P ( M , S | Y )


Alarms: Results

Restaurant+ alarms (snr 0 ns 6 al 8)

2

1

0

4

3

MLP classifier output

0

2

1

4

3

6 7

Sound object classifier output

8 9

25 30 35 40 45 time/sec

50

• Both systems commit many insertions at 0dB

SNR, but in different circumstances:

Noise

Neural net system Sinusoid model system

Del Ins Tot Del Ins Tot

1 (amb) 7 / 25

2 (bab) 5 / 25

3 (spe) 2 / 25

4 (mus) 8 / 25

2

63

68

37

36%

272%

280%

14 / 25

15 / 25

12 / 25

180% 9 / 25

1 60%

2 68%

9 84%

135 576%

Overall 22 / 100 170 192% 50 / 100 147 197%


Outline




- Auditory Scene Analysis

- Missing data recognition

- Parallel model inference




3

Organizing mixtures:

Approaches to handling overlapped sound

• Separate signals , then recognize

- e.g. CASA, ICA

- nice, if you can do it

• Recognize combined signal

- ‘multicondition training’

- combinatorics..

• Recognize with parallel models

- full joint-state space?

- or: divide signal into fragments, then use missing-data recognition


Computational Auditory Scene Analysis:

The Representational Approach

(Cooke & Brown 1993) input mixture

• Direct implementation of psych. theory

Front end signal features

(maps)

Object formation discrete objects

Grouping rules

Source groups freq onset time period frq.mod

- ‘bottom-up’ processing

- uses common onset & periodicity cues frq/Hz

3000

2000

1500

1000 brn1h.aif

600

400

300

200

150

100

0.2

• Able to extract voiced speech: brn1h.fi.aif

frq/Hz

3000

2000

1500

1000

600

400

300

200

150

100

0.4

0.6

0.8

1.0

time/s 0.2

Dan Ellis Sound, Mixtures & Learning

0.4

0.6

0.8

2003-09-24 - 15

1.0

time/s

Adding top-down constraints

Perception is not direct but a search for plausible hypotheses

• Data-driven (bottom-up)...

input mixture

Front end signal features

Object formation discrete objects

Grouping rules

Source groups

- objects irresistibly appear vs. Prediction-driven (top-down) input mixture

Front end signal features hypotheses

Noise components

Hypothesis management

Periodic components prediction errors

Compare

& reconcile predicted features

Predict

& combine

Dan Ellis

- match observations with parameters of a world-model

- need world-model constraints...

Sound, Mixtures & Learning 2003-09-24 - 16

Prediction-Driven CASA

f/Hz

City

4000

2000

1000

400

200

1000

400

200

100

50

0 f/Hz

Wefts1−4

4000

2000

1000

400

200

1000

400

200

100

50

Horn1 (10/10)

Horn2 (5/10)

Horn3 (5/10)

Horn4 (8/10)

Horn5 (10/10)

1 2 3

Weft5

4 5 6

Wefts6,7

7

Weft8

8 9

Wefts9−12 f/Hz

Noise2,Click1

4000

2000

1000

400

200

Crash (10/10) f/Hz

Noise1

4000

2000

1000

400

200

Squeal (6/10)

Truck (7/10)

Dan Ellis

0 1 2 3 4 5


6 7

−40

−50

−60

−70

8 9 time/s

2003-09-24 - 17 dB

Segregation vs. Inference

• Source separation requires attribute separation

- sources are characterized by attributes

(pitch, loudness, timbre + finer details)

- need to identify & gather different attributes for different sources ...

• Need representation that segregates attributes

- spectral decomposition

- periodicity decomposition

• Sometimes values can’t be separated

- e.g. unvoiced speech

- maybe infer factors from probabilistic model?

) Æ (

O

)

- or: just skip those values, infer from higher-level context

- do both: missing-data recognition


Missing Data Recognition

• Speech models p(x|m) are multidimensional...

- i.e. means, variances for every freq. channel

- need values for all dimensions to get p(•)

• But: can evaluate over a k subset of dimensions x k m

)

=

Ú

k

, x u m

) d x u x u y

p(x k

,x u

)

• Hence, missing data recognition :

p(x k

| x u

< y )

Present data mask

P(x | q) =

P(x

1

| q)

· P(x

2

| q)

· P(x

3

· P(x

4

| q)

| q)

· P(x

5

· P(x

6

| q)

| q) time

- hard part is finding the mask ( segregation )


p(x x k k

)

Comparing different segregations

M

*

• Standard classification chooses between models M to match source features X

= argmax

M

)

= argmax

M

) ◊

( )

P X

• Mixtures

Æ

observed features Y , segregation S , all related by

( )

Observation

Y(f )

Source

X(f )

Segregation S freq

spectral features allow clean relationship

• Joint classification of model and segregation:

)

=

P M

Ú

( ) ◊

P X Y S

) d X

◊

P S Y

)

- probabilistic relation of models & segregation


Multi-source decoding

• Search for more than one source

Y(t) S

2

(t) q

2

(t)

S

1

(t)

• Mutually-dependent data masks

• Use e.g. CASA features to propose masks

- locally coherent regions

• Lots of issues in models, representations, matching, inference...

Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 21 q

1

(t)

“One microphone source separation”

(Roweis 2000, Manuel Reyes)

• State sequences

Æ

t-f estimates

Æ

mask

Speaker 1 Speaker 2

4000

3000

2000

1000

0

4000

3000

2000

1000

0

4000

3000

2000

1000

0

4000

3000

2000

1000

0

0 1 2 3 4 0 1 2 3 4 5 time / sec

6 5 6 time / sec

- 1000 states/model (

Æ

10

6

transition probs.)

- simplify by modeling subbands (coupled HMM)?


Outline





- Meeting Recordings

- The Listening Machine



4

Accessing large datasets:

The Meeting Recorder Project

(with ICSI, UW, IDIAP, SRI, Sheffield)

• Microphones in conventional meetings

- for summarization / retrieval / behavior analysis

- informal, overlapped speech

• Data collection (ICSI, UW, IDIAP, NIST):

- ~100 hours collected & transcribed

• NSF ‘Mapping Meetings’ project


Speaker Turn detection

(Huan Wei Hee, Jerry Liu)

• Acoustic :

Triangulate tabletop mic timing differences

- use normalized peak value for confidence

Example cross coupling response, chan3 to chan0 mr-2000-11-02-1440: PZM xcorr lags

1

4

0

300

250

200

150

100

50

0

0

-1

-2 3

2

50 100 150 time / s

200 250 300

-3

-3

1

-2 -1 0 lag 1-2 / ms

1

10:

9:

8:

7:

5:

3:

2:

1:

0

• Behavioral : Look for patterns of speaker turns mr04: Hand-marked speaker turns vs. time + auto/manual boundaries

5 10 15 20 25 30 35 40 45 50 55 time/min

60


2 3

The Listening Machine

• Smart PDA records everything

• Only useful if we have index, summaries

- monitor for particular sounds

- real-time description

• Scenarios

- personal listener

Æ

summary of your day

- future prosthetic hearing device

- autonomous robots

• Meeting data, ambulatory audio


Personal Audio

• LifeLog / MyLifeBits /

Remembrance Agent:

Easy to record everything you hear

• Then what?

- prohibitively time consuming to search

- but .. applications if access easier

• Automatic content analysis / indexing...

4

2

0

50 100 150 200 250 time / min

15

10

5

14:30 15:00 15:30 16:00 16:30 17:00 17:30

Dan Ellis Sound, Mixtures & Learning

18:00 18:30 clock time

2003-09-24 - 27

Outline






- Anchor space

- Playola browser


5

Music Information Retrieval

• Transfer search concepts to music?

- “ musical Google ”

- finding something specific / vague / browsing

- is anything more useful than human annotation?

• Most interesting area: finding new music

- is there anything on mp3.com that I would like?

audio is only information source for new bands

• Basic idea:

Project music into a space where neighbors are “similar”

• Also need models of personal preference

- where in the space is the stuff I like

- relative sensitivity to different dimensions

• Evaluation problems

- requires large, shareable music corpus!


Artist Similarity

• Recognizing work from each artist is all very well...

• But: what is similarity between artists ?

- pattern recognition systems give a number...

e toni_braxton jessica_simpson mariah_carey new_ janet_jackson eiffel_65 pet_shop_boys lauryn_hill roxette lara_fabian rs sade sof pr spice_girls

• Need subjective ground truth:

Collected via web site www.musicseer.com

• Results:

- 1,000 users, 22,300 judgments

collected over 6 months


Audio

Input

(Class i)

Audio

Input

(Class j)

Music similarity from Anchor space

• A classifier trained for one artist (or genre) will respond partially to a similar artist

• Each artist evokes a particular pattern of responses over a set of classifiers

• We can treat these classifier outputs as a new feature space in which to estimate similarity

Anchor n-dimensional vector in "Anchor

Space"

Anchor

Anchor p(a

1

|x) p(a vector in "Anchor

Space" p(a

1

|x) p(a n

|x) p(a

2

|x)

Anchor

Conversion to Anchorspace p(a n

|x)

GMM

Modeling

GMM

Modeling

Similarity

Computation

KL-d, EMD, etc.

Conversion to Anchorspace

• “ Anchor space ” reflects subjective qualities?


Anchor space visualization

• Comparing 2D projections of per-frame feature points in cepstral and anchor spaces:

Cepstral Features Anchor Space Features

0

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1 madonna bowie

0.5

0 third cepstral coef

0.5

5

10

15 madonna bowie

15 10

Country

5

- each artist represented by 5GMM

- greater separation under MFCCs!

- but: relevant information?


Playola interface ( www.playola.org )

• Browser finds closest matches to single tracks or entire artists in anchor space

• Direct manipulation of anchor space axes


Evaluation

• Are recommendations good or bad?

• Subjective evaluation is the ground truth

- .. but subjects aren’t familiar with the bands being recommended

- can take a long time to decide if a recommendation is good

80

70

60

50

40

30

20

10

0

• Measure match to other similarity judgments

- e.g. musicseer data:

Top rank agreement

SrvKnw 4789x3.58

SrvAll 6178x8.93

GamKnw 7410x3.96

GamAll 7421x8.92

cei cmb erd e3d opn kn2 rnd ANK


Summary

• Sound

- .. contains much, valuable information at many levels

- intelligent systems need to use this information

• Mixtures

- .. are an unavoidable complication when using sound

- looking in the right time-frequency place to find points of dominance

• Learning

- need to acquire constraints from the environment

- recognition/classification as the real task


LabROSA Summary

• Broadcast

• Movies

• Lectures

• Meetings

• Personal recordings

• Location monitoring

ROSA

• Object-based structure discovery & learning

• Speech recognition

• Speech characterization

• Nonspeech recognition

• Scene analysis

• Audio-visual integration

• Music analysis

• Structuring

• Search

• Summarization

• Awareness

• Understanding


Extra Slides


Music Applications

• Music as a complex, information-rich sound

• Applications of separation & recognition :

- note/chord detection & classification

2

DYWMB: Alignments to MIDI note 57 mapped to Orig Audio

1

0

2

23 24 31 32 40 41 48 49

1 true voice

Michael Penn

The Roots

The Moles

Eric Matthews

Arto Lindsay

Oval

Jason Falkner

Built to Spill

Beck

XTC

Wilco

Aimee Mann

The Flaming Lips

Mouse on Mars

Dj Shadow

Richard Davies

Cornelius

Mercury Rev

Belle & Sebastian

Sugarplastic

Boards of Canada

0

Dan Ellis

0

72 73 80 81 85 86 88 89

- singing detection (

Æ

genre identification ...)

Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)

50 100 150


200

2003-09-24 - 38 time / sec

Independent Component Analysis (ICA)

(Bell & Sejnowski 1995 et seq.)

• Drive a parameterized separation algorithm to maximize independence of outputs m

1 m

2 x a

11 a

21 a

12 a

22 s

1 s

2

MutInfo a

• Advantages :

- mathematically rigorous, minimal assumptions

- does not rely on prior information from models

• Disadvantages :

- may converge to local optima...

- separation, not recognition

- does not exploit prior information from models


Sound, Mixtures, and Learning: LabROSA overview

Sound, Mixtures, and Learning:

LabROSA overview

Sound Content Analysis

The problem with recognizing mixtures

Auditory Scene Analysis

Cues to simultaneous grouping

Recognizing Sounds: Speech

How ASR Represents Speech

Novel speech signal representations

Finding the Information in Speech

Alarm sound detection

Alarms: Results

Organizing mixtures:

Approaches to handling overlapped sound

Computational Auditory Scene Analysis:

The Representational Approach

Adding top-down constraints

Prediction-Driven CASA

Segregation vs. Inference

Missing Data Recognition

=

Ú

Comparing different segregations

= argmax

= argmax

=

Ú

Multi-source decoding

“One microphone source separation”

Accessing large datasets:

The Meeting Recorder Project

Speaker Turn detection

The Listening Machine

Personal Audio

Music Information Retrieval

Artist Similarity

Music similarity from Anchor space

Anchor space visualization

Playola interface ( www.playola.org )

Evaluation

Summary

LabROSA Summary

Extra Slides

Music Applications

Independent Component Analysis (ICA)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib