Analyzing the Speech Signal Julia Hirschberg CS 6998 7/15/2016

advertisement
Analyzing the Speech Signal
Julia Hirschberg
CS 6998
7/15/2016
1
Basic Acoustics
What is sound?
Pressure fluctuations in the air caused by a
musical instrument, a car horn, a voice
Cause eardrum to move
Auditory system translates into neural
impulses
Brain interprets as sound
How does it travel?
Via sound wave of air molecules that ‘travels’
thru air
7/15/2016
2
Molecules don’t travel but pressure
fluctuations do
But sound waves lose energy as they travel -it takes energy to move those molecules
And molecules also move for reasons other
than e.g. the sound of my voice: noise
Ratio of speech-generated molecular motion
to other motion: signal-to-noise ratio
7/15/2016
3
Types of Sound: Periodic
Waves
Simple Periodic Waves (sine waves) defined by
Frequency: how often does pattern repeat
per time unit
Cycle: one repetition
Period: duration of cycle
Frequency=# cycles per time unit, e.g.
• Frequency in Hz=1sec/period_in_sec
• Horizontal axis of waveform
Amplitude: peak deviation of pressure from
normal atmospheric pressure
7/15/2016
4
Phase: timing of waveform relative to a
reference point
Complex periodic waves (eg)
Cyclic but composed of two or more sine waves
Fundamental frequency (F0): rate at which
largest pattern repeats (also GCD of component
freqs)
Components not always easily identifiable: power
spectrum graphs amplitude vs. frequency
7/15/2016
5
Fourier’s Theorem
Any complex waveform can be analyzed into
a set of sine waves with their own
frequencies, amplitudes, and phases
Fourier analysis produces power spectrum
from complex periodic wave
Potential problems:
Assumes infinite waveform when we have only a
small window for analysis
Waveform itself may be inaccurately represented
7/15/2016
6
Types of Sound: Aperiodic
Waves
Waveforms with random or non-repeating
patterns (eg)
Random aperiodic waveforms: white noise
Flat spectrum: equal amplitude for all frequency
components
Transients: sudden bursts of pressure (clicks,
pops, door slams)
Waveform shows a single impulse
Fourier analysis shows a flat spectrum
7/15/2016
7
Sample Analyses
Wavesurfer
Download from
http://www.speech.kth.se/wavesurfer/download.html
7/15/2016
8
Filters
Acoustic filters block out certain frequencies of
sounds
Low-pass filter blocks high frequency
components of a waveform
High-pass filter blocks low frequencies
Reject band (what to block) vs. pass band
(what to let through)
7/15/2016
9
Production of Speech
 Voiced and voiceless sounds
 Vocal fold vibration produces complex periodic waveform
Cycles per sec of lowest frequency component of
signal = fundamental frequency (F0)
Fourier analysis yields power spectrum with
component frequencies and amplitudes
F0 is first (lowest frequency) peak
Harmonics are resonances of vocal folds multiples of F0
 Vocal tract filters simple voicing waveform to create
complex wave
7/15/2016
10
Digital Signal Processing
Analog devices store and analyze continuous air
pressure variations (speech) as a continuous
signal
Digital devices (e.g. computers) first convert
continuous signals into discrete signals (A-to-D
conversion)
Sampling: how many time points in the
signal to consider?
Quantization: how accurately do we want to
measure amplitude at sampling points?
7/15/2016
11
Sampling
Sampling rate: how often do we need to
sample?
At least 2 samples per cycle to capture
periodicity of a waveform component at a
given frequency
100 Hz waveform needs 200 samples per sec
Nyquist frequency: highest-frequency component
captured with a given sampling rate (half the
sampling rate)
7/15/2016
12
Samping/storage tradeoff
Human hearing: 20K top frequency
But do we really need to store 40K samples
per second of speech?
Telephone speech: 300-4K Hz (8K sampling)
But fricatives have energy above 4K
16-22K usually good enough
7/15/2016
13
Sampling Errors
Aliasing:
Signal’s frequency higher than half the
sampling rate
Solutions:
Increase the sampling rate
Filter out frequencies above half the sampling
rate (anti-aliasing filter)
7/15/2016
14
Quantization
Measuring the amplitude at sampling points:
what resolution to choose?
Integer representation
8, 12 or 16 bits per sample
Noise due to quantization steps avoided by
higher resolution but requires more storage
Choice depends on what kind of analysis to
be done
7/15/2016
15
But clipping occurs when input volume is
greater than range representable in digitized
waveform  transients
7/15/2016
16
Perception of Pitch
Auditory system’s perception of pitch is nonlinear
Sounds at lower frequencies with same
difference in absolute frequency sound more
different than those at higher frequencies
Bark scale (Zwicker) models perceived
difference
7/15/2016
17
Pitch-Tracking
Autocorrelation techniques
Goal: Estimate F0 over time as fn of vocal fold
vibration
A periodic waveform is correlated with itself
One period looks much like another (eg)
Find the period by finding the ‘lag’ (offset)
between two windows on the signal for which
the correlation of the windows is highest
Lag duration (T) is 1 period of waveform
Inverse is F0 (1/T)
7/15/2016
18
Errors:
Halving: shortest lag calculated is too long
(underestimate pitch)
Doubling: shortest lag too short
(overestimate pitch)
7/15/2016
19
Pitch Track Headers
 version 1
 type_code 4
 frequency 12000.000000
 samples 160768
 start_time 0.000000
 end_time 13.397333
 bandwidth 6000.000000
 dimensions 1
 maximum 9660.000000
 minimum -17384.000000
 time Sat Nov 2 15:55:50 1991
 operation record: padding xxxxxxxxxxxx
7/15/2016
20
Pitch Track Data
 F0
Pvoicing
Energy A/C Score
 147.896 1 2154.07 0.902643
 140.894 1 1544.93 0.967008
 138.05 1 1080.55 0.92588
 130.399 1 745.262 0.595265
 0 0 567.153 0.504029
 0 0 638.037 0.222939
 0 0 670.936 0.370024
 0 0 790.751 0.357141
 141.215 1 1281.1 0.904345
7/15/2016
21
RMS Amplitude
Energy closely correlated experimentally with
perceived loudness
For each window, square the amplitude values
of the samples, take their mean, and take the
root of that mean
What size window?
Longer windows produce smoother
amplitude traces but miss sudden acoustic
events
7/15/2016
22
Perception of Loudness
 Non-linear: Described in sones or decibels (dB)
Differences in soft sounds more salient than loud
Intensity proportional to square of amplitude
so…intensity of sound with pressure x vs. reference
sound with pressure r = x2/r2
bel: base 10 log of ratio
decibel: 10 bels
dB = 10log10 (x2/r2)
Absolute (20 Pa, lowest audible pressure fluctuation of
1000 Hz tone) or typical threshold level for tone at
frequency
7/15/2016
23
Pressure of Common Sounds
Event
Absolute
Whisper
Quiet office
Conversation
Bus
Subway
Thunder
*DAMAGE*
7/15/2016
Pressure
20
200
2K
20K
200K
2M
20M
200M
Db
0
20
40
60
80
100
120
140
24
Speech Analysis Gives us
Information
About variation in
Loudness
Pitch (contours, accent, phrasing, range)
Timing (rate, pauses)
Style (articulation, disfluencies)
This can be correlated with other features
Syntax, semantics, discourse context, words
7/15/2016
25
Now and Next Week
Now: turn in discussion questions and project
ideas
Read HLT96 (Ch. 5)
Try out some TTS systems; exercises
Bring 3 discussion questions to class
Decide which week you would like to help with
class
7/15/2016
26
Vocal fold vibration
[UCLA Phonetics Lab demo]
7/15/2016
27
Places of articulation
dental
labial
alveolar post-alveolar/palatal
velar
uvular
pharyngeal
laryngeal/glottal
http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
7/15/2016
28
Download