Acoustics of Speech Julia Hirschberg CS 4706 7/15/2016

advertisement
Acoustics of Speech
Julia Hirschberg
CS 4706
7/15/2016
1
Claim: How things are said can be critical to
understanding
• I.e., Varying phrasing, prominence, pitch range,
speaking rate, pitch contour, voice
quality…conveys meaning
• What is our evidence? How do we prove?
– Observation
– Hypotheses
– Experimentation (perception, production)
– Speech analysis (independent variables)
– Correlation with dependent variable
7/15/2016
2
• What does our data look like?
• What tools do we have for analysis?
7/15/2016
3
What is sound?
• Pressure fluctuations in the air caused by a
musical instrument, a car horn, a voice
– Cause eardrum to move
– Auditory system translates into neural
impulses
– Brain interprets as sound
• Can we tell one sound from another?
• Can we distinguish one particular sound in
‘noise’?
7/15/2016
4
– From a speech-centric point of view, when
sound is not produced by the human voice,
we may term it noise
– Ratio of speech-generated sound to other
simultaneous sound: signal-to-noise ratio
7/15/2016
5
How ‘Loud’ are Common Sounds?
Event
Absolute
Whisper
Quiet office
Conversation
Bus
Subway
Thunder
*DAMAGE*
7/15/2016
Pressure (Pa)
20
200
2K
20K
200K
2M
20M
200M
Db
0
20
40
60
80
100
120
140
6
Some Sounds are Periodic
• Simple Periodic Waves (sine waves) defined by
– Frequency: how often does pattern repeat per
time unit
• Cycle: one repetition
• Period: duration of cycle
• Frequency=# cycles per time unit, e.g.
– Frequency in Hz=1sec/period_in_sec
– Horizontal axis of waveform
– Amplitude: peak deviation of pressure from
normal atmospheric pressure
7/15/2016
7
– Phase: timing of waveform relative to a
reference point
• Complex periodic waves
• Cyclic but composed of two or more sine waves
• Fundamental frequency (F0): rate at which largest
pattern repeats (also GCD of component freqs)
• Components not always easily identifiable: power
spectrum graphs amplitude vs. frequency
• Any complex waveform can be analyzed into a set
of sine waves with their own frequencies,
amplitudes, and phases (Fourier’s theorem)
– E.g. some speech sounds (mostly vowels)
cat.wav
7/15/2016
8
Some Sounds are Aperiodic
• Waveforms with random or non-repeating
patterns
– Random aperiodic waveforms: white noise
• Flat spectrum: equal amplitude for all frequency
components
– Transients: sudden bursts of pressure (clicks,
pops, door slams)
• Waveform shows a single impulse (click.wav)
• Fourier analysis shows a flat spectrum
• Some speech sounds, e.g. many consonants
(e.g. cat.wav)
7/15/2016
9
Speech Production
• Voiced and voiceless sounds
• Vocal fold vibration filtered by the Vocal tract
produces complex periodic waveform
– Cycles per sec of lowest frequency
component of signal = fundamental frequency
(F0)
– Fourier analysis yields power spectrum with
component frequencies and amplitudes
• F0 is first (lowest frequency) peak
• Harmonics are resonances of vocal track,
multiples of F0
7/15/2016
10
Vocal fold vibration
[UCLA Phonetics Lab demo]
7/15/2016
11
Places of articulation
dental
labial
alveolar post-alveolar/palatal
velar
uvular
pharyngeal
laryngeal/glottal
7/15/2016
http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
12
How do we capture speech for analysis?
• Recording conditions
– A quiet office, a sound booth, an anachoic
chamber
• Microphones
• Analog devices (e.g. tape recorders) store and
analyze continuous air pressure variations
(speech) as a continuous signal
• Digital devices (e.g. computers,DAT) first
convert continuous signals into discrete signals
(A-to-D conversion)
7/15/2016
13
• File format:
– .wav, .aiff, .ds, .au, .sph,…
– Conversion programs, e.g. sox
• Storage
– Function of how much information we store
about speech in digitization
• Higher quality, closer to original
• More space (1000s of hours of speech take up a
lot of space)
7/15/2016
14
Sampling
• Sampling rate: how often do we need to
sample?
– At least 2 samples per cycle to capture
periodicity of a waveform component at a
given frequency
• 100 Hz waveform needs 200 samples per sec
• Nyquist frequency: highest-frequency component
captured with a given sampling rate (half the
sampling rate)
7/15/2016
15
Sampling/storage tradeoff
• Human hearing: ~20K top frequency
– Do we really need to store 40K samples per
second of speech?
• Telephone speech: 300-4K Hz (8K sampling)
– But some speech sounds (e.g. fricatives, /f/,
/s/, /p/, /t/, /d/) have energy above 4K!
– Peter/teeter/Dieter
• 44k (CD quality audio) vs.16-22K (usually good
enough to study pitch, amplitude, duration, …)
7/15/2016
16
Sampling Errors
• Aliasing:
– Signal’s frequency higher than half the
sampling rate
– Solutions:
• Increase the sampling rate
• Filter out frequencies above half the sampling rate
(anti-aliasing filter)
7/15/2016
17
Quantization
• Measuring the amplitude at sampling points:
what resolution to choose?
– Integer representation
– 8, 12 or 16 bits per sample
• Noise due to quantization steps avoided by
higher resolution -- but requires more storage
– How many different amplitude levels do we
need to distinguish?
– Choice depends on data and application (44K
16bit stereo requires ~10Mb storage)
7/15/2016
18
– But clipping occurs when input volume is
greater than range representable in digitized
waveform
• Increase the resolution
• Decrease the amplitude
7/15/2016
19
What can we do if our data is ‘noisy’?
• Acoustic filters block out certain frequencies of
sounds
– Low-pass filter blocks high frequency
components of a waveform
– High-pass filter blocks low frequencies
– Reject band (what to block) vs. pass band
(what to let through)
• But if frequencies of two sounds
overlap….source separation
7/15/2016
20
How can we capture pitch contours, pitch
range?
• What is the pitch contour of this utterance? Is
the pitch range of X greater than that of Y?
• Pitch tracking: Estimate F0 over time as fn of
vocal fold vibration
• A periodic waveform is correlated with itself
– One period looks much like another (cat.wav)
– Find the period by finding the ‘lag’ (offset)
between two windows on the signal for which
the correlation of the windows is highest
– Lag duration (T) is 1 period of waveform
– Inverse is F0 (1/T)
7/15/2016
21
• Errors to watch for:
– Halving: shortest lag calculated is too long
(underestimate pitch)
– Doubling: shortest lag too short (overestimate
pitch)
– Microprosody errors (e.g. /v/)
7/15/2016
22
Sample Analysis File: Pitch Track Header
• version 1
• type_code 4
• frequency 12000.000000
• samples 160768
• start_time 0.000000
• end_time 13.397333
• bandwidth 6000.000000
• dimensions 1
• maximum 9660.000000
• minimum -17384.000000
• time Sat Nov 2 15:55:50 1991
•
operation record: padding xxxxxxxxxxxx
7/15/2016
23
Sample Analysis File: Pitch Track Data
•
•
•
•
•
•
•
•
•
(F0
Pvoicing Energy A/C Score)
147.896 1 2154.07 0.902643
140.894 1 1544.93 0.967008
138.05 1 1080.55 0.92588
130.399 1 745.262 0.595265
0 0 567.153 0.504029
0 0 638.037 0.222939
0 0 670.936 0.370024
0 0 790.751 0.357141
141.215 1 1281.1 0.904345
7/15/2016
24
Pitch Perception
• But do pitch trackers capture what humans perceive?
• Auditory system’s perception of pitch is non-linear
– Sounds at lower frequencies with same difference in
absolute frequency sound more different than those at
higher frequencies (male vs. female speech)
– Bark scale (Zwicker) and other models of perceived
difference
7/15/2016
25
How do we capture loudness/intensity?
• Is one utterance louder than another?
• Energy closely correlated experimentally with
perceived loudness
• For each window, square the amplitude values
of the samples, take their mean, and take the
root of that mean (RMS energy)
– What size window?
– Longer windows produce smoother amplitude
traces but miss sudden acoustic events
7/15/2016
26
Perception of Loudness
• But the relation is non-linear: sones or decibels (dB)
– Differences in soft sounds more salient than loud
– Intensity proportional to square of amplitude
so…intensity of sound with pressure x vs. reference
sound with pressure r = x2/r2
– bel: base 10 log of ratio
– decibel: 10 bels
– dB = 10log10 (x2/r2)
– Absolute (20 Pa, lowest audible pressure fluctuation of
1000 Hz tone), typical threshold level for tone at frequency
7/15/2016
27
How do we capture….
•
•
•
•
For utterances X and Y
Pitch contour: Same or different?
Pitch range: Is X larger than Y?
Duration: Is utterance X longer than utterance
Y?
• Speaker rate: Is the speaker of X speaking
faster than the speaker of Y?
• Voice quality….
7/15/2016
28
Next Class
• Tools for the Masses: Read the Praat tutorial
• Download Praat from the course syllabus page
and play with a speech file (e.g.
http://www.cs.columbia.edu/~julia/cs4706/cc_00
1_sadness_1669.04_August-second-.wav or
record your own)
7/15/2016
29
Download