Lecture 1: Introduction & DSP

advertisement
EE E6820: Speech & Audio Processing & Recognition
Lecture 1:
Introduction & DSP
Dan Ellis <dpwe@ee.columbia.edu>
Mike Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820
January 22, 2009
1
Sound and information
2
Course Structure
3
DSP review: Timescale modification
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
1 / 33
Outline
1
Sound and information
2
Course Structure
3
DSP review: Timescale modification
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
2 / 33
Sound and information
Sound is air pressure variation
Mechanical vibration
Pressure waves in air
Motion of sensor
++++
v(t)
Time-varying voltage
t
Transducers convert air pressure ↔ voltage
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
3 / 33
What use is sound?
Footsteps examples:
0.5
0
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
0.5
1
1.5
2
2.5
time / s
3
3.5
4
4.5
5
0.5
0
-0.5
Hearing confers an evolutionary advantage
useful information, complements vision
. . . at a distance, in the dark, around corners
listeners are highly adapted to ‘natural sounds’ (including
speech)
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
4 / 33
The scope of audio processing
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
5 / 33
The acoustic communication chain
message
signal
channel
receiver
decoder
!
synthesis
audio
processing
recognition
Sound is an information bearer
Received sound reflects source(s)
plus effect of environment (channel)
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
6 / 33
Levels of abstraction
Much processing concerns shifting between levels of abstraction
abstract
representation
(e.g. t-f energy)
Synthesis
Analysis
‘information’
sound p(t)
concrete
Different representations serve different tasks
separating aspects, making things explicit, . . .
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
7 / 33
Outline
1
Sound and information
2
Course Structure
3
DSP review: Timescale modification
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
8 / 33
Source structure
Goals
I
I
I
survey topics in sound analysis & processing
develop and intuition for sound signals
learn some specific technologies
Course structure
I
I
I
weekly assignments (25%)
midterm event (25%)
final project (50%)
Text
Speech and Audio Signal Processing
Ben Gold & Nelson Morgan
Wiley, 2000
ISBN: 0-471-35154-7
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
9 / 33
Web-based
Course website:
http://www.ee.columbia.edu/∼dpwe/e6820/
for lecture notes, problem sets, examples, . . .
+ student web pages for homework, etc.
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
10 / 33
Course outline
Fundamentals
L1:
DSP
L2:
Acoustics
Audio processing
L5:
Signal
models
Applications
L6:
Music
analysis/
synthesis
L8:
L7:
Spatial sound
Audio
compression & rendering
Dan Ellis (Ellis & Mandel)
L4:
L3:
Auditory
Pattern
recognition perception
L9:
Speech
recognition
L10:
Music
retrieval
L11:
Signal
separation
L12:
Multimedia
indexing
Intro & DSP
January 22, 2009
11 / 33
Weekly assignments
Research papers
I
I
I
journal & conference publications
summarize & discuss in class
written summaries on web page + Courseworks discussion
Practical experiments
I
I
I
Matlab-based (+ Signal Processing Toolbox)
direct experience of sound processing
skills for project
Book sections
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
12 / 33
Final project
Most significant part of course (50%) of grade
Oral proposals mid-semester;
Presentations in final class
+ website
Scope
I
I
I
practical (Matlab recommended)
identify a problem; try some solutions
evaluation
Topic
I
I
I
few restrictions within world of audio
investigate other resources
develop in discussion with me
Citation & plagiarism
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
13 / 33
Examples of past projects
Automatic prosody classification
Dan Ellis (Ellis & Mandel)
Model-based note transcription
Intro & DSP
January 22, 2009
14 / 33
Outline
1
Sound and information
2
Course Structure
3
DSP review: Timescale modification
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
15 / 33
DSP review: digital signals
Discrete-time sampling
limits bandwidth
xd[n] = Q( xc(nT ) )
Discrete-level
quantization
limits
dynamic range
time
ε
T
sampling interval T
sampling frequency ΩT =
quantizer Q(y ) = y
Dan Ellis (Ellis & Mandel)
2π
T
Intro & DSP
January 22, 2009
16 / 33
The speech signal: time domain
Speech is a sequence of different sound types
Vowel: periodic
“has”
.1
0
0
.1
Fricative: aperiodic
“watch”
0.05
1.38
1.4
-0.05
1.86
1.42
1.88
1.9
1.92
0.2
0.1
0
-0.1
-0.2
1.4
1.6
has a
1.8
2
2.2
watch
thin
2.4
2.6
as a
time/s
dime
0.1
0.02
0
-0.1
1.52
0
-0.02
1.54
1.56
1.58
Glide: smooth transition
“watch”
Dan Ellis (Ellis & Mandel)
Intro & DSP
2.42
2.44
2.46
2.4
Stop burst: transient
“dime”
January 22, 2009
17 / 33
Timescale modification (TSM)
Can we modify a sound to make it ‘slower’ ?
i.e. speech pronounced more slowly
e.g. to help comprehension, analysis
or more quickly for ‘speed listening’ ?
Why not just slow it down?
xs (t) = xo ( rt ), r = slowdown factor (> 1 → slower)
equivalent to playback at a different sampling rate
0.1
0.05
Original
0
-0.05
-0.1
2.35
2.4
2.45
2.4
2.45
2.5
2.55
2.6
2.55
2.6
r =2
0.1
0.05
2x slower
0
-0.05
-0.1
2.35
Dan Ellis (Ellis & Mandel)
2.5
Intro & DSP
time/s
January 22, 2009
18 / 33
Time-domain TSM
Problem: want to preserve local time structure
but alter global time structure
Repeat segments
I
but: artifacts from abrupt edges
Cross-fade & overlap
y m [mL + n] = y m−1 [mL + n] + w [n] · x
0.1
1
2
3
4
5
hj m k
r
L+n
i
6
0
-0.1
2.35
1
2.4
2.45
2.5
4
2
1
1
2.6
time / s
6
3
0.1
2.55
5
2
2
3
3
4
4
5
5
6
4.95
time / s
0
-0.1
Dan Ellis (Ellis & Mandel)
4.7
4.75
4.8
4.85
Intro & DSP
4.9
January 22, 2009
19 / 33
Synchronous overlap-add (SOLA)
Idea: allow some leeway in placing window to optimize alignment
of waveforms
1
2
Km maximizes alignment of 1 and 2
Hence,
m
y [mL + n] = y
m−1
[mL + n] + w [n] · x
hj m k
r
L + n + Km
i
Where Km chosen by cross-correlation:
PNov
+ n] · x mr L + n + K
Km = argmax qP
P 0≤K ≤Ku
(y m−1 [mL + n])2 (x mr L + n + K )2
n=0 y
Dan Ellis (Ellis & Mandel)
m−1 [mL
Intro & DSP
January 22, 2009
20 / 33
The Fourier domain
Fourier Series (periodic continuous x)
Ω0 =
x(t) =
2π
T
X
x(t)
0.5
0
0.5
ck e jkΩ0 t
1
1.5
1
2πT
0.5
0
0.5
1
1.5
t
1.0
k
ck =
1
Z
T /2
|ck|
x(t)e −jkΩ0 t dt
1 2 3 4 5 6 7
k
−T /2
Fourier Transform (aperiodic continuous x)
0.02
x(t)
0.01
Z
1
X (jΩ)e jΩt dΩ
2π
Z
X (jΩ) = x(t)e −jΩt dt
x(t) =
0
-0.01
0
0.004
0.006
0.008
time / sec
|X(jΩ)|
-40
-60
-80
0
Dan Ellis (Ellis & Mandel)
0.002
level
/ dB
-20
Intro & DSP
2000
4000
6000
8000
freq / Hz
January 22, 2009
21 / 33
Discrete-time Fourier
DT Fourier Transform (aperiodic sampled x)
x [n]
x[n] =
X (e jω ) =
1
2π
X
Z
π
X (e jω )e jωn dω
−π
-1
n
1 2 3 4 5 6 7
|X(ejω)|
3
x[n]e −jωn
2
1
0
π
ω
2π
3π
4π
5π
Discrete Fourier Transform (N-point x)
x[n] =
X
X [k] =
X
X [k]e j
x [n]
2πkn
N
k
x[n]e
|X[k]|
n
k=1...
Dan Ellis (Ellis & Mandel)
n
|X(ejω)|
1 2 3 4 5 6 7
−j 2πkn
N
Intro & DSP
k
January 22, 2009
22 / 33
Sampling and aliasing
Discrete-time signals equal the continuous time signal at discrete
sampling instants:
xd [n] = xc (nT )
Sampling cannot represent rapid fluctuations
1
0.5
0
0.5
1
0
1
sin
ΩM +
2
2π
T
3
4
5
6
7
8
9
10
Tn
= sin(ΩM Tn) ∀n ∈ Z
Nyquist limit (ΩT /2) from periodic spectrum:
Gp(jΩ)
−ΩM
−ΩT
−ΩT + ΩM
Dan Ellis (Ellis & Mandel)
“alias” of “baseband”
signal
Ga(jΩ)
ΩM
Intro & DSP
ΩT
ΩT - Ω M
Ω
January 22, 2009
23 / 33
Speech sounds in the Fourier domain
energy / dB
time domain
0.1
Vowel: periodic
“has”
0
-0.1
1.37
0.05
1.38
1.39
1.4
1.41
1.42
time / s
-60
-80
-100
freq / Hz
0
1000
2000
3000
0
1000
2000
3000
4000
0
1000
2000
3000
4000
0
1000
2000
3000
4000
-60
Fricative: aperiodic 0
“watch”
-0.05
1.86
0.1
frequency domain
-40
-80
1.87
1.88
1.89
1.9
-100
1.91
-40
-60
Glide: transition
“watch”
0
-80
-0.1
1.52
1.54
1.56
1.58
-60
0.02
Stop: transient
“dime”
-100
0
-80
-0.02
2.42
2.44
2.46
2.48
-100
dB = 20 log10 (amplitude) = 10 log10 (power)
Voiced spectrum has pitch + formants
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
24 / 33
Short-time Fourier Transform
Want to localize energy in time and frequency
break sound into short-time pieces
calculate DFT of each one
0.1
L
2L
3L
0
-0.1
2.35
2.4
short-time
window
2.45
2.5
2.55
2.6
time / s
DFT
k→
freq / Hz
4000
3000
2000
1000
0
m= 0
m=1
m=2
m=3
Mathematically,
X [k, m] =
N−1
X
n=0
Dan Ellis (Ellis & Mandel)
2πk(n − mL)
x[n] w [n − mL] exp −j
N
Intro & DSP
January 22, 2009
25 / 33
The Spectrogram
Plot STFT X [k, m] as a gray-scale image
0.1
4000
10
0
3000
-10
2000
-20
-30
1000
intensity / dB
freq / Hz
0
-40
0
2.35
2.4
2.45
2.5
2.55
2.6
-50
time / s
freq / Hz
4000
3000
2000
1000
0
0
Dan Ellis (Ellis & Mandel)
0.5
1
1.5
Intro & DSP
2
2.5
time / s
January 22, 2009
26 / 33
Time-frequency tradeoff
Longer window w [n] gains frequency resolution at cost of time
resolution
0.2
freq / Hz
4000
3000
2000
10
0
1000
-10
0
freq / Hz
Window = 48 pt
Wideband
Window = 256 pt
Narrowband
0
-20
4000
-30
-40
3000
-50
level
/ dB
2000
1000
0
Dan Ellis (Ellis & Mandel)
1.4
1.6
1.8
2
Intro & DSP
2.2
2.4
2.6
time / s
January 22, 2009
27 / 33
Speech sounds on the Spectrogram
Stop: transient
“dime”
1.6
Fric've: aperiodic
“watch”
Glide: transition
“watch”
1.4
freq / Hz
Vowel: periodic
“has”
Most popular speech visualization
4000
3000
2000
1000
0
has a
1.8
2
watch
2.2
thin
2.4
as a
2.6
time/s
dime
Wideband (short window) better than narrowband (long window)
to see formants
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
28 / 33
TSM with the Spectrogram
Just stretch out the spectrogram?
4000
Frequency
3000
2000
1000
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Time
4000
Frequency
3000
2000
1000
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Time
how to resynthesize?
spectrogram is only |Y [k, m]|
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
29 / 33
The Phase Vocoder
Timescale modification in the STFT domain
Magnitude from ‘stretched’ spectrogram:
h m i
|Y [k, m]| = X k,
r
I
e.g. by linear interpolation
But preserve phase increment between slices:
h mi
θ̇Y [k, m] = θ̇X k,
r
I
e.g. by discrete differentiator
Does right thing for single sinusoid
I
keeps overlapped parts of sinusoid aligned
∆T
.
θ = ∆θ
∆T
Dan Ellis (Ellis & Mandel)
time
.
∆θ' = θ·2∆T
Intro & DSP
January 22, 2009
30 / 33
General issues in TSM
Time window
I
stretching a narrowband spectrogram
Malleability of different sounds
I
vowels stretch well, stops lose nature
Not a well-formed problem?
want to alter time without frequency
. . . but time and frequency are not separate!
I ‘satisfying’ result is a subjective judgment
⇒ solution depends on auditory perception. . .
I
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
31 / 33
Summary
Information in sound
I
lots of it, multiple levels of abstraction
Course overview
I
I
survey of audio processing topics
practicals, readings, project
DSP review
I
I
digital signals, time domain
Fourier domain, STFT
Timescale modification
I
I
I
properties of the speech signal
time-domain
phase vocoder
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
32 / 33
References
J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical Journal,
pages 1493–1509, 1966.
M. Dolson. The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4):14–27,
1986.
M. Puckette. Phase-locked vocoder. In Proc. IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics (WASPAA), pages 222–225, 1995.
A. T. Cemgil and S. J. Godsill. Probabilistic Phase Vocoder and its application to
Interpolation of Missing Values in Audio Signals. In 13th European Signal
Processing Conference, Antalya, Turkey, 2005.
Dan Ellis (Ellis & Mandel)
Intro & DSP
January 22, 2009
33 / 33
Download