SpeechAcoustics - Electrical, Computer & Biomedical Engineering

advertisement
Speech Acoustics Project
OCE 471 Underwater Acoustics
Jesse Hansen
Speech Acoustics Project
Jesse Hansen
Abstract:
In this paper, basic methods for analyzing recorded speech are presented. The
spectrogram is introduced and subsequently utilized in a Matlab environment to reveal patterns
in recorded voice data. Several examples of speech are recorded, analyzed, and compared. A
model for voice production is introduced in order to explain the variety of time-frequency
patterns in the waveforms. Specifically, a single tube and then a multi-tube model for the vocal
tract are considered and related to resonances in the speech spectrum. It is shown that a series of
connected acoustic tubes results in resonances similar to those that occur in speech.
Introduction
Motivation:
Consider the problem of speech recognition. When two different people speak the same phrase
(or if one person utters the same phrase twice), a human listener will generally have no trouble
understanding each instance of that phrase. This leads us to believe that even though the two
speakers may have different vocal qualities (different pitch, different accents, etc.) there must be
some sort of invariant quality between the two instances of the spoken phrase.
Thinking about the problem a bit further, we realize that when two different people articulate the
same phrase, they perform essentially the same mechanical motions. In other words, they move
their mouths, tongue, lips, etc., in the roughly the same way. We hypothesize that as a result of
the similarities in speech mechanics from person to person there should be some features in the
recorded speech waveform that are similar for multiple instances of a spoken phrase.
One such set of speech features is called formants, which are resonances in the vocal tract. The
frequencies at which these resonances occur are a direct result of the particular configuration of
the vocal tract. As words are spoken, the speaker moves his or her tongue, mouth, and lips,
changing the resonant frequencies with time. Analysis of these time-varying frequency patterns
forms the basis for all modern speech recognition systems.
Organization:
This paper is broadly divided into two sections. Part 1 is concerned with analysis of voice
waveforms. In Part 2, we will delve into models for voice production and relate them to the data
presented in Part 1.
Part 1 is organized as follows. In Section 1.1 we briefly describe the spectrogram, a widely used
tool for time-frequency analysis of acoustic data, and illustrate its benefits with an example. A
Matlab program for recording sounds and viewing their spectrograms is presented in Section 1.2.
In Section 1.3 we divide speech sounds into two broad categories, voiced and unvoiced speech,
restricting our analysis to voiced speech. Finally, in Section 1.4, several speech waveforms are
presented and analyzed.
Part 2 is organized as follows. Section 2.1 briefly describes the vocal tract, and then Section 2.2
presents a single acoustic tube model for the vocal tract. Section 2.3 presents a multi-tube model
and discusses various ways that the model can be analyzed. Closing remarks are made in
Section 2.4.
PART I: Data Analysis
1.1 The Spectrogram
The spectrogram of a waveform shows that signal as a function of frequency and time.
The spectrogram is computed as follows:
1. The original waveform is first broken into smaller blocks, each with equal size. The
choice of the block size depends on the frequency of the underlying data. For speech, a
width of 20 to 30 ms is often used. Blocks are allowed to overlap. An overlap of 50% is
typical.
2. Each block is multiplied by a window function. Most window functions have a value of
1 in the middle and taper down towards 0 at the edges. ‘Windowing’ a block of data has
the effect of diminishing the magnitude of the samples at the edges, while maintaining
the magnitude of the samples in the middle.
3. The Discrete Fourier Transform (DFT) of each windowed block is computed. Only the
magnitude of the DFT is retained. The result is several vectors of frequency data (the
magnitude of the DFT), one vector for each block of the original waveform. The
frequency information is localized in time depending on the location of the time block
that it was computed from.
A simple example will help to illustrate the point. Below we have the waveform and
spectrogram of a bird chirping. This sound was borrowed from a Matlab demonstration.
The waveform spectrogram of a chirping bird
The upper plot shows the time domain waveform of a bird chirping. Below this is the
spectrogram, which shows frequency content as a function of time. Frequency is on the vertical
axis and time is on the horizontal. Blue indicates larger magnitude while red indicates smaller
magnitude.
The beauty of the spectrogram is that it clearly illustrates how the frequency of a signal varies
with time. In this example we can see that each chirp starts at a high frequency, usually between
3 and 4 kHz, and over the course of about 0.1 seconds, decreases in frequency to about 2 kHz.
This type of detail would be lost if we chose to take the DFT of the entire waveform.
Technical details:





The sampling rate is 8 kHz.
The block size is 25 ms or 200 samples
There is 87.5% overlap between blocks
The blocks are multiplied by a Hamming window
There are 211 = 2048 points in the DFT of each block
1.2 Matlab Recording & Analysis Program
Here we will present a Matlab program, Record, written by the author in the summer of 2001.
The program is intended to simplify the recording and basic editing of speech waveforms as well
as to present the spectrogram and the time waveform in a side-by-side format for ease of
analysis.
The remainder of this section will describe the program—its inner working and functionality.
Running the program:
The program can be run by typing record at the Matlab prompt or by opening the program in the
Matlab editor and selecting Run from the Debug menu
Recording:
Sound recording is initiated through the Matlab graphical user interface (GUI) by clicking on the
record button. The duration of the recording can be adjusted to be anywhere from 1 to 6
seconds. (These are the GUI defaults, but the code can be modified to record for longer
durations if desired.)
Upon being clicked, the record button executes a function that reads in mono data from the
microphone jack on the sound card and stores it a Matlab vector.
Most of the important information in a typical voice waveform is found below a frequency of
about 4 kHz. Accordingly, we should sample at a least twice this frequency, or 8 kHz. (Note
that all sound cards have a built in pre-filter to limit the effects of aliasing.) Since there is at
least some valuable information above 4 kHz, the Record GUI has a default sampling rate of
11.025 kHz—this can be modified in code. A sampling rate of 16 kHz had been used in the past,
but the data acquisition toolbox in Matlab 6.0 does not support this rate.
Once recorded, the time data is normalized to a maximum amplitude of 0.99 and displayed on
the upper plot in the GUI window. In addition to the time domain waveform, a spectrogram is
computed using Matlab’s built in specgram function (part of the signal processing toolbox).
An example recording of the sentence, “We were away a year ago” is shown below.
“We were away a year ago”
Zooming in on the Waveform:
One can examine a region of interest in the waveform using the Zoom in button. When Zoom in
is clicked, the cursor will change to a cross hair. Clicking the left mouse button and dragging a
rectangle around the region of interest in the time domain waveform will select a sub-section of
data. In the example below we have zoomed in on the region from about 1 to 1.2 seconds.
‘Zoomed in’ on the waveform
Zooming out:
The Zoom out button will change the axis back to what it was before Zoom in was used. If you
zoom in multiple times, zooming out will return you to the previous axis limits.
Listening to the Waveform:
The Play button uses Matlab’s sound function to play back (send to the speakers) the waveform
that appears in the GUI. If you have zoomed in on a particular section of the waveform, only
that portion of the waveform will be sent to the speakers.
Save is used to write the waveform to a wave file. If zoomed in on segment of data, only that
portion of the waveform will be saved.
Click Load to import any mono wave file into the Record GUI for analysis.
1.3 Voiced and Unvoiced Speech
There are two broad categories into which speech is segmented, voiced and unvoiced speech.
We will differentiate the two classes by their method of production and the by time and
frequency patterns that we observe in the recorded data. This project is primarily concerned with
voiced speech for reasons we’ll explain in a moment.
Voiced Speech:
All voiced speech originates as vibrations of the vocal cords. Its primary characteristic is its
periodic nature.
Voiced speech is created by pushing air from the lungs up the trachea to the vocal folds (cords),
where pressure builds until the folds part, releasing a puff of air. The folds then return to their
original position as pressure on each side is equalized. Muscles controlling the tension and
elasticity of the folds determine the rate at which they vibrate. See [3].
The puffs of air from the vocal cords are subsequently passed through the vocal tract and then
through the air to our ears. The periodicity of the vocal cord vibrations is directly related to the
perceived pitch of the sound. We will examine the effects of the vocal tract in more detail later
on.
Vowels sounds are one example voiced speech. Consider the /aa/ sound in father, or the /o/
sound in boat. In the segment of voiced speech below, note the periodicity of the waveform.
Segment of voiced speech, /aa/ in father
Unvoiced speech:
Unvoiced speech does not have the periodicity associated with voiced speech. In many kinds of
unvoiced speech, noise-like sound is produced at the front of the mouth using the tongue, lips,
and/or teeth. The vocal folds are held open for these sounds.
Consider the sounds /f/ as in fish, /s/ as in sound. The /f/ sound is created by forcing air between
the lower lip and teeth, while /s/ is created by forcing air through a constriction between the
tongue and the roof of the mouth or the teeth.
The waveform below shows a small segment of unvoiced speech. Note its distinguishing
characteristics. It is low amplitude, noise-like, and it changes more rapidly than voiced speech.
Segment of unvoiced speech, /sh/ in she
Let’s further examine the waveform and spectrogram of a word containing both voiced and
unvoiced speech. A recording of the word sky was made with the Matlab program. The /s/
sound, we know now, is unvoiced, while the /eye/ sound is voiced. (The /k/ is also unvoiced, but
not noise-like) The plots are shown below.
Looking at the spectrogram we note that /s/ contains a broad range of frequencies, but is
concentrated at higher frequencies. The resonances, or formants, in the speech waveform can be
seen as blue, horizontal stripes in the spectrogram. These formants, mentioned in the
introduction, aren’t particularly clear in the unvoiced /s/, but are quite obvious in the voiced
/eye/. It is for this reason that we restrict our analysis to voiced speech.
1.4 Data Analysis
Several speech waveforms will be analyzed here, and various features in the waveforms and
spectrograms will be noted. The most prominent features in the spectrograms are the dark (blue)
horizontal bands, called formants, corresponding to frequencies of greater energy. It will be
shown in Part II that these formants result from resonances in the vocal tract.
In addition to the waveforms and spectrograms, we will analyze the spectrum of small segments
of the waveforms. These 2-D spectrums will help us compare formant frequencies from sound to
sound.
Waveform 1: “Already”
The word already contains several different sounds, all of them voiced. Notice how the formants
change with time. One significant feature that is also easy to identify in the spectrogram is the
/d/ sound. This sound is called a stop, and for obvious reasons. When the /d/ is pronounced, the
tongue temporarily stops air from leaving the oral cavity. This action leads to a small amplitude
in the waveform and a ‘hole’ in the spectrogram. Can you spot it?
Waveform and spectrogram of the word already
We’ll now take a look at a small sub-section of the waveform. We’ve isolated a about 0.07
seconds of sound towards the beginning if the word. By itself, this portion of the waveform
sounds a bit like the /au/ in the word caught. The plot below shows the waveform and spectrum
of the sub-signal.
/au/ in already from about 0.09 to 0.16 sec
The primary item of interest is the bottom-most plot, which shows the spectrum of the signal in
decibels (dB). The approximate locations of the formants are indicated by the vertical dotted
lines. Note how this plot matches up with the spectrogram. Both indicate a wide first formant
and 4 more formants between 2 and 4 kHz (although the 4th isn’t as well defined as the other 3).
Moving on, we isolate a different portion of the same waveform, the /ee/ sound at the end of the
word already. By looking at either the waveform or the spectrum, we can see that /ee/ contains
more high frequency information than /au/. /ee/ has a narrow first formant and 3 additional
formants at higher frequencies.
/ee/ in already from about 0.418 to 0.457 sec
Waveform 2: “Cool”
The word cool contains three distinct sounds. The /c/ at the beginning is an unvoiced sound. The
/oo/ and the /L/ are both voiced sounds, but each has different properties. For the sake of time
we’ll stick to the /L/ sound at the end of the word. An /L/ is created by arching the tongue in the
mouth. This leads to a sound that differs greatly from most of the vowel sounds.
Waveform and spectrogram of the word cool
Below, you’ll see a segment of the /L/ sound. It is different from the last two sounds we’ve
analyzed in that it contains virtually no high frequency components. There are two high
frequency formants at about 3000 and 3500 Hz, but they are a great deal smaller in magnitude
than the first formant.
/L/ in cool from about 0.3 to 0.33 sec
Waveform 3: “Cot”
We chose the word cot to get a look at the /ah/ sound in the middle. You’ll notice from the
spectrogram that this sound is quite frequency rich, containing several large formants between 0
and 5 kHz.
Waveform and spectrogram of the word cot
Compare the /ah/ waveform (below) to the previous /L/ waveform. There is a great deal more
activity in this waveform, which explains variety of frequencies in the spectrogram. The
spectrum of the signal reveals 5 (possibly 6) significant formants, each one having a sizable
bandwidth.
/ah/ in cot from about 0.17 to 0.22 sec
Conclusion from Data Analysis:
It is clear from the data than different vocal sounds have widely varying spectral content.
However, each sound contains similar features, like formants, that arise from the methods of
speech production. We’ll now begin to talk a bit about the vocal tract and the attempts that have
been made to model its functionality. The predictions of these models will then be related to the
empirical results from this section
PART II: Speech Production Models
2.1 The Vocal Tract
Here, we examine a bit of the physiology behind the production of speech. Voiced speech is
created by simultaneously exciting our vocal cords into vibration and configuring our vocal tract
to be a particular shape. The vocal tract consists of everything between the vocal cords and the
lips: the pharyngeal cavity, tongue, oral cavity, etc. Pressure waves travel from the vocal cords
along the non-uniform path to the mouth, and eventually, to our ears.
Figure from [5] which is in turn from [1]
The figure above shows a schematic diagram of the vocal tract on the left and, on the right, a plot
the area of the vocal tract as a function of distance (in centimeters) from the vocal cords. The
area/distance function plotted here is for the sound /i/, as in bit. The configuration of the vocal
tract, and hence the plot, is different for different sounds.
Looking at the area vs. distance function in the plot above, you’ll notice that there are two major
resonant chambers in the vocal tract, the first, the pharyngeal cavity, is from about 1 to 8 cm and
the second, the oral cavity, is from about 14 to 16 cm. This manner of thinking about the vocal
tract—identifying resonant chambers—leads us to model it as on acoustic tube, or concatenation
of acoustic tubes, an idea we’ll examine further in subsequent sections.
Aside: Some sounds, called nasals, also use the nasal cavity as an additional resonant chamber
and path to outside world. However, these sounds are small subset of the set of voiced sounds,
and we will ignore them in this presentation.
More detail about the vocal cords can be found in [3], and an in depth analysis of the vocal tract
can be found in [4].
Here come the models
We’ll now examine some models of speech production. All of these models make a couple of
simplifying assumptions:
1. The vocal cords are independent from the vocal tract
2. The vocal tract is a linear system.
Neither of these assumptions are entirely true, but they help to simplify the analysis. The
acoustic tube models that we will examine can also be viewed as source-filter models. Either
way you look at it, there is an excitation sent through a channel, and that channel alters the
spectral content of the excitation.
2.2 A Single Tube Model
Suppose for a moment that we model the vocal tract as a single acoustic tube of uniform area.
Suppose further that the tube is excited at one end by a periodic input, and open at the other. We
will now analyze such a configuration, and attempt to determine the resonant frequencies that
this model predicts.
Resonance in a Single Acoustic Tube
Excitation
Open
end
0
x
L

u0, t   Re U 0e jt


uL, t   Re U L e jt

The wave equation is:
2 p
1 2 p

x 2 c0 2 t 2


Assume a solution of the form px, t   Re Px e jt . We can now use complex notation to
write the reduced wave equation.
d 2P
1
2
 2  j  P
2
dx
c0
d 2P
 k 2P  0 ,
2
dx
k

c0
Equation (1) has a solution of the form Px  A coskx  B sin kx
Pressure and velocity can be related by the conservation of linear momentum.
(1)
(2)
0
u
p

t
x
 0  j U  
dP
dx
(3)
Plugging (2) into (3), we have,
U x  
j
 A sin kx  B coskx
 0 c0
(4)
Apply boundary conditions:
BC 1: The velocity at the closed end, x = 0, is specified as U(0).
U 0  
j
B
 0 c0
B   j 0 c0U 0
(5)
BC 2: The pressure at the open end of the tube is zero:
PL  0
0  A coskL  B sin kL
A  j 0 c0U 0 tan kL
(6)
We are interested in the relation between the input and the output—the transfer function—of the
system. Combining (4), (5), and (6), we have,
U L 
 tan kLsin kL  coskL
U 0
U L  sin 2 kL cos 2 kL


U 0 coskL coskL
U L 
1

U 0 coskL
Resonance occurs at,
(7)
kL 

2
2n  1 ,
n = 0, 1, 2,…
or

c0
2n  1
2L
or
f 
c0
2n  1
4L
(8)
c0
. For
4L
a vocal tract of length 17 cm (L = 17 cm) resonance will occur at approximately 500, 1500, 2500
Hz, etc.
The results show that resonance occurs at odd multiples of the fundamental resonance,
This model seems to be a decent first approximation, yielding results similar to the results
observed in the data. Clearly, however, resonances should not occur at nicely spaced intervals.
The actual data shows that the resonances appear at all sorts of frequencies.
2.3 Multi-tube Model
The results from the single tube model of Section 2.2 motivate us to build a new and better
model. This time we’ll use multiple tubes, each possibly having a different length and crosssectional area, to approximate the vocal tract. What we are doing, in effect, is quantizing the
area function shown in Section 2.1.
Most derivations of multi-tube models, [1], [2], [4], [5], rely on either electrical analogs of
acoustic circuits, which translate into transmission line problems, or discretized versions of the
pressure and volume velocity equations. In either case the solution to the wave equation can be
written as the sum of coming and going waves. For the ith tube, the volume velocity, u i x, t  , is:
ui x, t   ui t  x c   u  t  x c 

and the pressure, pi x, t  , is:
p i  x, t  
u
A
c

i
t  x c   u  t  x c 
i
Since we are primarily interested in what happens at the boundary of the tubes, many authors,
[5], [7], [4], will discretize the velocity and pressure equations in time and space, by evaluating
the functions only at these boundaries. See the plot below.
2-Tube model of vocal tract (here, v is the volume velocity), from [7]
This sort of discrete interpretation of the multi-tube model gives rise to a useful signal flow
model. The model can further be changed to a digital waveguide equivalent if the length of each
tube is the same.
Signal flow diagram for two acoustic tubes, from [7]
The important parameters in the signal flow diagram are the reflection coefficients. These
coefficients arise in many areas of acoustics. They determine what percentage of a wave will
pass the boundary and what percentage will reflect. Simulations of digital waveguides built after
this model confirms that they do indeed lead to multiple resonances of the form witnessed in the
analysis of our data.
Actual simulation of these models is well beyond the scope of the paper. However, Gold and
Morgan assured us that, “an acoustic configuration [of this form] can always be found to match
the measured steady-state spectrum of any speech signal.” ([5], page 152).
2.4 Ideas for Further Work
Digital Waveguides:
The next step in the investigation of this topic would be to test some of the digital waveguide
models for speech production described by Gold and Morgan. Unfortunately, the point was
never reached in this presentation were we could show in detail exactly how the multi-tube
model leads to resonances similar to those in the recorded data. This was the goal as the outset,
but the task proved formidable.
Extraction of Reflection Coefficients:
A data analysis technique called Linear Predictive Coding (LPC) can be using to estimate a set of
reflection coefficients corresponding to an acoustic tube model for speech production. In a LPC,
an all-pole model is fit to the spectral envelope of a segment of the speech waveform. The
corresponding reflection coefficients could be used to determine the relative areas of the tube
segments. An animation could be used to show how the shape of the vocal tract changes during
the articulation of a phrase
Speech synthesis:
Given sets of reflection coefficients would be possible to generate an artificial segment of
speech. The reflection coefficients could be extracted from actual speech as described above and
then used as parameters for generating speech. To generate the speech an excitation would need
to be passes through a digital waveguide with time varying reflection coefficients, or
equivalently, through a filter with time varying poles.
Appendix
Method of data collection
List of equipment
1.
2.
3.
4.
5.
6.
Microphone: Fender P-51 cardioid pattern dynamic microphone.
Amplifier: Fender Passport P-250 sound system
Computer: Dell Dimension 4100, Pentium III, 800 MHz, 512 M RAM
Sound card: Creative Sound Blaster Live
Software: Matlab 6.0
Analysis: Matlab GUI (http://www.ele.uri.edu/~hansenj/projects/record/)
The Microphone and Amplifier
All sounds shown in this paper were recorded using the equipment in the Experiential Signal
Processing Laboratory (ESPLab) in the department of Electrical and Computer Engineering at
the University of Rhode Island. Laboratory funding is provided by the National Science
Foundation and the Champlin Foundation.
Two of the major components used for recording are the
microphone and amplifier. The amplifier is a Fender Passport
P-250 sound system. It has, according to the manual, a “six
channel, 250-watt stereo powered mixer with digital reverb
and two custom full-range speaker cabinets.” Most of these
features are entirely unnecessary for this project. We simply
wish to supply power to a microphone and amplify its signal
before it is sent to the sound card.
The microphone is a Passport P-51 with cardioid pick-up pattern “designed to reject as much of
the sound coming from the side and rear of microphone as possible.” In other words, the P-51
has a certain directionality, with the main lobe on the vertical axis of the microphone. The
precise technical specifications for the P-51 are not readily available, but we can assume that
because of the microphone’s intended use—faithful representation of voice and acoustic music—
it ought to have a relatively flat response between 20 Hz and 20 KHz. Thus the microphone is
more that adequate for recording voice waveforms.
System Description
The microphone output is sent to the first channel of the amplifier. The gain for channel one is
adjusted appropriately, and all other channel levels are set to zero. The amplifier’s output, tape
out, is sent to the microphone jack on the computer’s soundcard. Recording is initialized from
the Matlab GUI, Record, as described in the program’s documentation.
Data flow: Sound goes from the microphone to the amplifier and then to the computer.
References
[1]
G. Fant, Acoustic Theory of Speech Production, Mouton & Co., The Hague, 1970
[2]
K.N. Stevens and A.S. House, “An Acoustic Theory of Vowel Production and Some of
its Implications”, Journal of Speech and Hearing Research, 4:303-320, 1961
[3]
D.G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley & Sons, Inc.,
New York, 2000
[4]
D. O’Shaughnessy, Speech Communications: Human and Machine, The Institute of
Electrical and Electronics Engineers, Inc., New York, 2000
[5]
B. Gold and N. Morgan, Speech and Audio Signal Processing: Processing and
Perception of Speech and Music, John Wiley & Sons, Inc., New York, 2000
[6]
L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall,
Englewood Cliffs, N.J., 1993.
[7]
J.H.L. Hansen, Slides for ECEN-5022 Speech Processing & Recognition, University of
Colorado Boulder, 2000, (http://cslr.colorado.edu/classes/ECEN5022/)
Download