Acoustics of Speech Julia Hirschberg CS 4706 7/15/2016

advertisement

4/11/2020

Acoustics of Speech

Julia Hirschberg

CS 4706

1

Goal 1: Distinguishing One Phoneme from

Another, Automatically

• ASR: Did the caller say ‘ I want to fly to Newark ’ or ‘ I want to fly to New York ’?

• Forensic Linguistics: Did the accused say ‘ Kill him’ or ‘ Bill him’ ?

• What evidence is there in the speech signal?

– How accurately and reliably can we extract it?

4/11/2020 2

Goal 2: Determining

How

things are said is sometimes critical to understanding

• Forensic Linguistics: ‘Kill him!’ or ‘Kill him?’

• Call Center: ‘That amount is incorrect.’

• What information do we need to extract from the speech signal?

• What tools do we have to do this?

4/11/2020 3

Today and Next Class

• Acoustic features to extract

– Fundamental frequency (pitch)

– Amplitude/energy (loudness)

– Spectrum

– Timing (pauses, rate)

• Tools for extraction

– Praat

– Wavesurfer

– Xwaves

4/11/2020 4

Sound Production

• Pressure fluctuations in the air caused by a musical instrument, a car horn, a voice

– Cause eardrum to move

– Auditory system translates into neural impulses

– Brain interprets as sound

– Plot sound as change in air pressure over time

• From a speech-centric point of view, when sound is not produced by the human voice, we may term it noise

– Ratio of speech-generated sound to other simultaneous sound: signal-to-noise ratio

• Higher SNRs are better

4/11/2020 5

How ‘Loud’ are Common Sounds – How

Much Pressure Generated?

Event

Absolute

Whisper

Quiet office

Conversation

Bus

Subway

Thunder

*DAMAGE*

4/11/2020

Pressure (Pa)

20

200

2K

20K

200K

2M

20M

200M

Db

0

20

40

60

80

100

120

140

6

Some Sounds are Periodic

• Simple Periodic Waves ( sine waves) defined by

– Frequency : how often does pattern repeat per time unit

• Cycle : one repetition

• Period : duration of cycle

• Frequency =# cycles per time unit, e.g.

– Frequency in Hz = cycles per second or 1/period

– E.g. 400Hz pitch = 1/.0025 (1 cycle has a period of

.0025; 400 cycles complete in 1 sec)

– Amplitude : peak deviation of pressure from normal atmospheric pressure

4/11/2020 7

– Phase : timing of waveform relative to a reference point

4/11/2020 8

4/11/2020 9

Complex Periodic Waves

• Cyclic but composed of multiple sine waves

• Fundamental frequency (F0): rate at which largest pattern repeats (also GCD of component freqs)

• Components not always easily identifiable: power spectrum graphs amplitude vs. frequency

• Any complex waveform can be analyzed into a set of sine waves with their own frequencies, amplitudes, and phases (Fourier’s theorem)

4/11/2020 10

4/11/2020 11

4/11/2020 12

Some Sounds are Aperiodic

• Waveforms with random or non-repeating patterns

– Random aperiodic waveforms: white noise

• Flat spectrum: equal amplitude for all frequency components

– Transients : sudden bursts of pressure (clicks, pops, door slams)

• Waveform shows a single impulse ( click.wav

)

• Fourier analysis shows a flat spectrum

• Some speech sounds, e.g. many consonants

(e.g. cat.wav

)

4/11/2020 13

Speech Production

• Voiced and voiceless sounds

• Vocal fold vibration

filtered by the Vocal tract

produces complex periodic waveform

– Cycles per sec of lowest frequency component of signal = fundamental frequency

(F0)

– Fourier analysis yields power spectrum with component frequencies and amplitudes

• F0 is first (lowest frequency) peak

• Harmonics are resonances of component frequencies amplified by vocal track

4/11/2020 14

Vocal fold vibration

4/11/2020

[UCLA Phonetics Lab demo]

15

Places of articulation

dental labial alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal

16

How do we capture speech for analysis?

• Recording conditions

– A quiet office, a sound booth, an anachoic chamber

• Microphones

• Analog devices (e.g. tape recorders) store and analyze continuous air pressure variations

(speech) as a continuous signal

• Digital devices (e.g. computers,DAT) first convert continuous signals into discrete signals

(A-to-D conversion)

4/11/2020 17

• File format:

– .wav, .aiff, .ds, .au, .sph,…

– Conversion programs, e.g. sox

• Storage

– Function of how much information we store about speech in digitization

• Higher quality, closer to original

• More space (1000s of hours of speech take up a lot of space)

4/11/2020 18

Sampling

• Sampling rate : how often do we need to sample?

– At least 2 samples per cycle to capture periodicity of a waveform component at a given frequency

• 100 Hz waveform needs 200 samples per sec

• Nyquist frequency : highest-frequency component captured with a given sampling rate (half the sampling rate)

4/11/2020 19

Sampling/storage tradeoff

• Human hearing: ~20K top frequency

– Do we really need to store 40K samples per second of speech?

• Telephone speech: 300-4K Hz (8K sampling)

– But some speech sounds (e.g. fricatives, /f/,

/s/, /p/, /t/, /d/) have energy above 4K!

– Peter/teeter/Dieter

• 44k (CD quality audio) vs.16-22K (usually good enough to study pitch, amplitude, duration, …)

4/11/2020 20

Sampling Errors

• Aliasing :

– Signal’s frequency higher than half the sampling rate

– Solutions:

• Increase the sampling rate

• Filter out frequencies above half the sampling rate

( anti-aliasing filter )

4/11/2020 21

Quantization

• Measuring the amplitude at sampling points: what resolution to choose?

– Integer representation

– 8, 12 or 16 bits per sample

• Noise due to quantization steps avoided by higher resolution -- but requires more storage

– How many different amplitude levels do we need to distinguish?

– Choice depends on data and application (44K

4/11/2020

16bit stereo requires ~10Mb storage)

22

– But clipping occurs when input volume is greater than range representable in digitized waveform

• Increase the resolution

• Decrease the amplitude

4/11/2020 23

What can we do if our data is ‘noisy’?

• Acoustic filters block out certain frequencies of sounds

– Low-pass filter blocks high frequency components of a waveform

– High-pass filter blocks low frequencies

– Reject band (what to block) vs. pass band

(what to let through)

• But if frequencies of two sounds overlap….

source separation

4/11/2020 24

How can we capture pitch contours, pitch range?

• What is the pitch contour of this utterance? Is the pitch range of X greater than that of Y?

• Pitch tracking: Estimate F0 over time as fn of

vocal fold vibration

• A periodic waveform is correlated with itself

– One period looks much like another (cat.wav)

– Find the period by finding the ‘ lag’ (offset) between two windows on the signal for which the correlation of the windows is highest

– Lag duration (T) is 1 period of waveform

– Inverse is F0 (1/T)

4/11/2020 25

• Errors to watch for:

– Halving : shortest lag calculated is too long

(underestimate pitch)

– Doubling : shortest lag too short (overestimate pitch)

– Microprosody effects (e.g. /v/)

4/11/2020 26

Sample Analysis File: Pitch Track Header

• version 1

• type_code 4

• frequency 12000.000000

• samples 160768

• start_time 0.000000

• end_time 13.397333

• bandwidth 6000.000000

• dimensions 1

• maximum 9660.000000

• minimum -17384.000000

• time Sat Nov 2 15:55:50 1991

27

Sample Analysis File: Pitch Track Data

(F0 Pvoicing Energy A/C Score)

• 147.896 1 2154.07 0.902643

• 140.894 1 1544.93 0.967008

• 138.05 1 1080.55 0.92588

• 130.399 1 745.262 0.595265

• 0 0 567.153 0.504029

• 0 0 638.037 0.222939

• 0 0 670.936 0.370024

• 0 0 790.751 0.357141

• 141.215 1 1281.1 0.904345

4/11/2020 28

Pitch Perception

• But do pitch trackers capture what humans perceive?

• Auditory system’s perception of pitch is non-linear

– Sounds at lower frequencies with same difference in absolute frequency sound more different than those at higher frequencies (male vs. female speech)

– Bark scale (Zwicker) and other models of perceived difference

4/11/2020 29

How do we capture loudness/intensity?

• Is one utterance louder than another?

• Energy closely correlated experimentally with perceived loudness

• For each window, square the amplitude values of the samples, take their mean, and take the root of that mean (RMS energy)

– What size window?

– Longer windows produce smoother amplitude traces but miss sudden acoustic events

4/11/2020 30

Perception of Loudness

• But the relation is non-linear: sones or decibels (dB)

– Differences in soft sounds more salient than loud

– Intensity proportional to square of amplitude so…intensity of sound with pressure x vs. reference sound with pressure r = x 2 /r 2

– bel : base 10 log of ratio

– decibel : 10 bels

– dB = 10log

10

(x 2 /r 2 )

– Absolute (20 

Pa, lowest audible pressure fluctuation of

1000 Hz tone), typical threshold level for tone at frequency

4/11/2020 31

How do we capture….

• For utterances X and Y

• Pitch contour: Same or different?

• Pitch range: Is X larger than Y?

• Duration: Is utterance X longer than utterance

Y?

• Speaker rate: Is the speaker of X speaking faster than the speaker of Y?

• Voice quality….

4/11/2020 32

Next Class

• Tools for the Masses: Read the Praat tutorial

• Download Praat from the course syllabus page and play with a speech file (e.g. http://www.cs.columbia.edu/~julia/cs4706/cc_00

1_sadness_1669.04_August-second-.wav

or record your own)

• Bring a laptop and headphones to class if you have them

4/11/2020 33

Download