4/11/2020
Julia Hirschberg
CS 4706
1
• ASR: Did the caller say ‘ I want to fly to Newark ’ or ‘ I want to fly to New York ’?
• Forensic Linguistics: Did the accused say ‘ Kill him’ or ‘ Bill him’ ?
• What evidence is there in the speech signal?
– How accurately and reliably can we extract it?
4/11/2020 2
How
• Forensic Linguistics: ‘Kill him!’ or ‘Kill him?’
• Call Center: ‘That amount is incorrect.’
• What information do we need to extract from the speech signal?
• What tools do we have to do this?
4/11/2020 3
• Acoustic features to extract
– Fundamental frequency (pitch)
– Amplitude/energy (loudness)
– Spectrum
– Timing (pauses, rate)
• Tools for extraction
– Praat
– Wavesurfer
– Xwaves
4/11/2020 4
• Pressure fluctuations in the air caused by a musical instrument, a car horn, a voice
– Cause eardrum to move
– Auditory system translates into neural impulses
– Brain interprets as sound
– Plot sound as change in air pressure over time
• From a speech-centric point of view, when sound is not produced by the human voice, we may term it noise
– Ratio of speech-generated sound to other simultaneous sound: signal-to-noise ratio
• Higher SNRs are better
4/11/2020 5
Event
Absolute
Whisper
Quiet office
Conversation
Bus
Subway
Thunder
*DAMAGE*
4/11/2020
Pressure (Pa)
20
200
2K
20K
200K
2M
20M
200M
Db
0
20
40
60
80
100
120
140
6
• Simple Periodic Waves ( sine waves) defined by
– Frequency : how often does pattern repeat per time unit
• Cycle : one repetition
• Period : duration of cycle
• Frequency =# cycles per time unit, e.g.
– Frequency in Hz = cycles per second or 1/period
– E.g. 400Hz pitch = 1/.0025 (1 cycle has a period of
.0025; 400 cycles complete in 1 sec)
– Amplitude : peak deviation of pressure from normal atmospheric pressure
4/11/2020 7
– Phase : timing of waveform relative to a reference point
4/11/2020 8
4/11/2020 9
• Cyclic but composed of multiple sine waves
• Fundamental frequency (F0): rate at which largest pattern repeats (also GCD of component freqs)
• Components not always easily identifiable: power spectrum graphs amplitude vs. frequency
• Any complex waveform can be analyzed into a set of sine waves with their own frequencies, amplitudes, and phases (Fourier’s theorem)
4/11/2020 10
4/11/2020 11
4/11/2020 12
• Waveforms with random or non-repeating patterns
– Random aperiodic waveforms: white noise
• Flat spectrum: equal amplitude for all frequency components
– Transients : sudden bursts of pressure (clicks, pops, door slams)
• Waveform shows a single impulse ( click.wav
)
• Fourier analysis shows a flat spectrum
• Some speech sounds, e.g. many consonants
(e.g. cat.wav
)
4/11/2020 13
• Voiced and voiceless sounds
produces complex periodic waveform
– Cycles per sec of lowest frequency component of signal = fundamental frequency
(F0)
– Fourier analysis yields power spectrum with component frequencies and amplitudes
• F0 is first (lowest frequency) peak
• Harmonics are resonances of component frequencies amplified by vocal track
4/11/2020 14
4/11/2020
[UCLA Phonetics Lab demo]
15
dental labial alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal
16
• Recording conditions
– A quiet office, a sound booth, an anachoic chamber
• Microphones
• Analog devices (e.g. tape recorders) store and analyze continuous air pressure variations
(speech) as a continuous signal
• Digital devices (e.g. computers,DAT) first convert continuous signals into discrete signals
(A-to-D conversion)
4/11/2020 17
• File format:
– .wav, .aiff, .ds, .au, .sph,…
– Conversion programs, e.g. sox
• Storage
– Function of how much information we store about speech in digitization
• Higher quality, closer to original
• More space (1000s of hours of speech take up a lot of space)
4/11/2020 18
• Sampling rate : how often do we need to sample?
– At least 2 samples per cycle to capture periodicity of a waveform component at a given frequency
• 100 Hz waveform needs 200 samples per sec
• Nyquist frequency : highest-frequency component captured with a given sampling rate (half the sampling rate)
4/11/2020 19
• Human hearing: ~20K top frequency
– Do we really need to store 40K samples per second of speech?
• Telephone speech: 300-4K Hz (8K sampling)
– But some speech sounds (e.g. fricatives, /f/,
/s/, /p/, /t/, /d/) have energy above 4K!
– Peter/teeter/Dieter
• 44k (CD quality audio) vs.16-22K (usually good enough to study pitch, amplitude, duration, …)
4/11/2020 20
• Aliasing :
– Signal’s frequency higher than half the sampling rate
– Solutions:
• Increase the sampling rate
• Filter out frequencies above half the sampling rate
( anti-aliasing filter )
4/11/2020 21
• Measuring the amplitude at sampling points: what resolution to choose?
– Integer representation
– 8, 12 or 16 bits per sample
• Noise due to quantization steps avoided by higher resolution -- but requires more storage
– How many different amplitude levels do we need to distinguish?
– Choice depends on data and application (44K
4/11/2020
16bit stereo requires ~10Mb storage)
22
– But clipping occurs when input volume is greater than range representable in digitized waveform
• Increase the resolution
• Decrease the amplitude
4/11/2020 23
• Acoustic filters block out certain frequencies of sounds
– Low-pass filter blocks high frequency components of a waveform
– High-pass filter blocks low frequencies
– Reject band (what to block) vs. pass band
(what to let through)
• But if frequencies of two sounds overlap….
source separation
4/11/2020 24
• What is the pitch contour of this utterance? Is the pitch range of X greater than that of Y?
• Pitch tracking: Estimate F0 over time as fn of
• A periodic waveform is correlated with itself
– One period looks much like another (cat.wav)
– Find the period by finding the ‘ lag’ (offset) between two windows on the signal for which the correlation of the windows is highest
– Lag duration (T) is 1 period of waveform
– Inverse is F0 (1/T)
4/11/2020 25
• Errors to watch for:
– Halving : shortest lag calculated is too long
(underestimate pitch)
– Doubling : shortest lag too short (overestimate pitch)
– Microprosody effects (e.g. /v/)
4/11/2020 26
• version 1
• type_code 4
• frequency 12000.000000
• samples 160768
• start_time 0.000000
• end_time 13.397333
• bandwidth 6000.000000
• dimensions 1
• maximum 9660.000000
• minimum -17384.000000
• time Sat Nov 2 15:55:50 1991
27
(F0 Pvoicing Energy A/C Score)
• 147.896 1 2154.07 0.902643
• 140.894 1 1544.93 0.967008
• 138.05 1 1080.55 0.92588
• 130.399 1 745.262 0.595265
• 0 0 567.153 0.504029
• 0 0 638.037 0.222939
• 0 0 670.936 0.370024
• 0 0 790.751 0.357141
• 141.215 1 1281.1 0.904345
4/11/2020 28
• But do pitch trackers capture what humans perceive?
• Auditory system’s perception of pitch is non-linear
– Sounds at lower frequencies with same difference in absolute frequency sound more different than those at higher frequencies (male vs. female speech)
– Bark scale (Zwicker) and other models of perceived difference
4/11/2020 29
• Is one utterance louder than another?
• Energy closely correlated experimentally with perceived loudness
• For each window, square the amplitude values of the samples, take their mean, and take the root of that mean (RMS energy)
– What size window?
– Longer windows produce smoother amplitude traces but miss sudden acoustic events
4/11/2020 30
• But the relation is non-linear: sones or decibels (dB)
– Differences in soft sounds more salient than loud
– Intensity proportional to square of amplitude so…intensity of sound with pressure x vs. reference sound with pressure r = x 2 /r 2
– bel : base 10 log of ratio
– decibel : 10 bels
– dB = 10log
10
(x 2 /r 2 )
– Absolute (20
Pa, lowest audible pressure fluctuation of
1000 Hz tone), typical threshold level for tone at frequency
4/11/2020 31
• For utterances X and Y
• Pitch contour: Same or different?
• Pitch range: Is X larger than Y?
• Duration: Is utterance X longer than utterance
Y?
• Speaker rate: Is the speaker of X speaking faster than the speaker of Y?
• Voice quality….
4/11/2020 32
• Tools for the Masses: Read the Praat tutorial
• Download Praat from the course syllabus page and play with a speech file (e.g. http://www.cs.columbia.edu/~julia/cs4706/cc_00
1_sadness_1669.04_August-second-.wav
or record your own)
• Bring a laptop and headphones to class if you have them
4/11/2020 33