Speech Acoustics Project OCE 471 Underwater Acoustics Jesse Hansen Speech Acoustics Project Jesse Hansen Abstract: In this paper, basic methods for analyzing recorded speech are presented. The spectrogram is introduced and subsequently utilized in a Matlab environment to reveal patterns in recorded voice data. Several examples of speech are recorded, analyzed, and compared. A model for voice production is introduced in order to explain the variety of time-frequency patterns in the waveforms. Specifically, a single tube and then a multi-tube model for the vocal tract are considered and related to resonances in the speech spectrum. It is shown that a series of connected acoustic tubes results in resonances similar to those that occur in speech. Introduction Motivation: Consider the problem of speech recognition. When two different people speak the same phrase (or if one person utters the same phrase twice), a human listener will generally have no trouble understanding each instance of that phrase. This leads us to believe that even though the two speakers may have different vocal qualities (different pitch, different accents, etc.) there must be some sort of invariant quality between the two instances of the spoken phrase. Thinking about the problem a bit further, we realize that when two different people articulate the same phrase, they perform essentially the same mechanical motions. In other words, they move their mouths, tongue, lips, etc., in the roughly the same way. We hypothesize that as a result of the similarities in speech mechanics from person to person there should be some features in the recorded speech waveform that are similar for multiple instances of a spoken phrase. One such set of speech features is called formants, which are resonances in the vocal tract. The frequencies at which these resonances occur are a direct result of the particular configuration of the vocal tract. As words are spoken, the speaker moves his or her tongue, mouth, and lips, changing the resonant frequencies with time. Analysis of these time-varying frequency patterns forms the basis for all modern speech recognition systems. Organization: This paper is broadly divided into two sections. Part 1 is concerned with analysis of voice waveforms. In Part 2, we will delve into models for voice production and relate them to the data presented in Part 1. Part 1 is organized as follows. In Section 1.1 we briefly describe the spectrogram, a widely used tool for time-frequency analysis of acoustic data, and illustrate its benefits with an example. A Matlab program for recording sounds and viewing their spectrograms is presented in Section 1.2. In Section 1.3 we divide speech sounds into two broad categories, voiced and unvoiced speech, restricting our analysis to voiced speech. Finally, in Section 1.4, several speech waveforms are presented and analyzed. Part 2 is organized as follows. Section 2.1 briefly describes the vocal tract, and then Section 2.2 presents a single acoustic tube model for the vocal tract. Section 2.3 presents a multi-tube model and discusses various ways that the model can be analyzed. Closing remarks are made in Section 2.4. PART I: Data Analysis 1.1 The Spectrogram The spectrogram of a waveform shows that signal as a function of frequency and time. The spectrogram is computed as follows: 1. The original waveform is first broken into smaller blocks, each with equal size. The choice of the block size depends on the frequency of the underlying data. For speech, a width of 20 to 30 ms is often used. Blocks are allowed to overlap. An overlap of 50% is typical. 2. Each block is multiplied by a window function. Most window functions have a value of 1 in the middle and taper down towards 0 at the edges. ‘Windowing’ a block of data has the effect of diminishing the magnitude of the samples at the edges, while maintaining the magnitude of the samples in the middle. 3. The Discrete Fourier Transform (DFT) of each windowed block is computed. Only the magnitude of the DFT is retained. The result is several vectors of frequency data (the magnitude of the DFT), one vector for each block of the original waveform. The frequency information is localized in time depending on the location of the time block that it was computed from. A simple example will help to illustrate the point. Below we have the waveform and spectrogram of a bird chirping. This sound was borrowed from a Matlab demonstration. The waveform spectrogram of a chirping bird The upper plot shows the time domain waveform of a bird chirping. Below this is the spectrogram, which shows frequency content as a function of time. Frequency is on the vertical axis and time is on the horizontal. Blue indicates larger magnitude while red indicates smaller magnitude. The beauty of the spectrogram is that it clearly illustrates how the frequency of a signal varies with time. In this example we can see that each chirp starts at a high frequency, usually between 3 and 4 kHz, and over the course of about 0.1 seconds, decreases in frequency to about 2 kHz. This type of detail would be lost if we chose to take the DFT of the entire waveform. Technical details: The sampling rate is 8 kHz. The block size is 25 ms or 200 samples There is 87.5% overlap between blocks The blocks are multiplied by a Hamming window There are 211 = 2048 points in the DFT of each block 1.2 Matlab Recording & Analysis Program Here we will present a Matlab program, Record, written by the author in the summer of 2001. The program is intended to simplify the recording and basic editing of speech waveforms as well as to present the spectrogram and the time waveform in a side-by-side format for ease of analysis. The remainder of this section will describe the program—its inner working and functionality. Running the program: The program can be run by typing record at the Matlab prompt or by opening the program in the Matlab editor and selecting Run from the Debug menu Recording: Sound recording is initiated through the Matlab graphical user interface (GUI) by clicking on the record button. The duration of the recording can be adjusted to be anywhere from 1 to 6 seconds. (These are the GUI defaults, but the code can be modified to record for longer durations if desired.) Upon being clicked, the record button executes a function that reads in mono data from the microphone jack on the sound card and stores it a Matlab vector. Most of the important information in a typical voice waveform is found below a frequency of about 4 kHz. Accordingly, we should sample at a least twice this frequency, or 8 kHz. (Note that all sound cards have a built in pre-filter to limit the effects of aliasing.) Since there is at least some valuable information above 4 kHz, the Record GUI has a default sampling rate of 11.025 kHz—this can be modified in code. A sampling rate of 16 kHz had been used in the past, but the data acquisition toolbox in Matlab 6.0 does not support this rate. Once recorded, the time data is normalized to a maximum amplitude of 0.99 and displayed on the upper plot in the GUI window. In addition to the time domain waveform, a spectrogram is computed using Matlab’s built in specgram function (part of the signal processing toolbox). An example recording of the sentence, “We were away a year ago” is shown below. “We were away a year ago” Zooming in on the Waveform: One can examine a region of interest in the waveform using the Zoom in button. When Zoom in is clicked, the cursor will change to a cross hair. Clicking the left mouse button and dragging a rectangle around the region of interest in the time domain waveform will select a sub-section of data. In the example below we have zoomed in on the region from about 1 to 1.2 seconds. ‘Zoomed in’ on the waveform Zooming out: The Zoom out button will change the axis back to what it was before Zoom in was used. If you zoom in multiple times, zooming out will return you to the previous axis limits. Listening to the Waveform: The Play button uses Matlab’s sound function to play back (send to the speakers) the waveform that appears in the GUI. If you have zoomed in on a particular section of the waveform, only that portion of the waveform will be sent to the speakers. Save is used to write the waveform to a wave file. If zoomed in on segment of data, only that portion of the waveform will be saved. Click Load to import any mono wave file into the Record GUI for analysis. 1.3 Voiced and Unvoiced Speech There are two broad categories into which speech is segmented, voiced and unvoiced speech. We will differentiate the two classes by their method of production and the by time and frequency patterns that we observe in the recorded data. This project is primarily concerned with voiced speech for reasons we’ll explain in a moment. Voiced Speech: All voiced speech originates as vibrations of the vocal cords. Its primary characteristic is its periodic nature. Voiced speech is created by pushing air from the lungs up the trachea to the vocal folds (cords), where pressure builds until the folds part, releasing a puff of air. The folds then return to their original position as pressure on each side is equalized. Muscles controlling the tension and elasticity of the folds determine the rate at which they vibrate. See [3]. The puffs of air from the vocal cords are subsequently passed through the vocal tract and then through the air to our ears. The periodicity of the vocal cord vibrations is directly related to the perceived pitch of the sound. We will examine the effects of the vocal tract in more detail later on. Vowels sounds are one example voiced speech. Consider the /aa/ sound in father, or the /o/ sound in boat. In the segment of voiced speech below, note the periodicity of the waveform. Segment of voiced speech, /aa/ in father Unvoiced speech: Unvoiced speech does not have the periodicity associated with voiced speech. In many kinds of unvoiced speech, noise-like sound is produced at the front of the mouth using the tongue, lips, and/or teeth. The vocal folds are held open for these sounds. Consider the sounds /f/ as in fish, /s/ as in sound. The /f/ sound is created by forcing air between the lower lip and teeth, while /s/ is created by forcing air through a constriction between the tongue and the roof of the mouth or the teeth. The waveform below shows a small segment of unvoiced speech. Note its distinguishing characteristics. It is low amplitude, noise-like, and it changes more rapidly than voiced speech. Segment of unvoiced speech, /sh/ in she Let’s further examine the waveform and spectrogram of a word containing both voiced and unvoiced speech. A recording of the word sky was made with the Matlab program. The /s/ sound, we know now, is unvoiced, while the /eye/ sound is voiced. (The /k/ is also unvoiced, but not noise-like) The plots are shown below. Looking at the spectrogram we note that /s/ contains a broad range of frequencies, but is concentrated at higher frequencies. The resonances, or formants, in the speech waveform can be seen as blue, horizontal stripes in the spectrogram. These formants, mentioned in the introduction, aren’t particularly clear in the unvoiced /s/, but are quite obvious in the voiced /eye/. It is for this reason that we restrict our analysis to voiced speech. 1.4 Data Analysis Several speech waveforms will be analyzed here, and various features in the waveforms and spectrograms will be noted. The most prominent features in the spectrograms are the dark (blue) horizontal bands, called formants, corresponding to frequencies of greater energy. It will be shown in Part II that these formants result from resonances in the vocal tract. In addition to the waveforms and spectrograms, we will analyze the spectrum of small segments of the waveforms. These 2-D spectrums will help us compare formant frequencies from sound to sound. Waveform 1: “Already” The word already contains several different sounds, all of them voiced. Notice how the formants change with time. One significant feature that is also easy to identify in the spectrogram is the /d/ sound. This sound is called a stop, and for obvious reasons. When the /d/ is pronounced, the tongue temporarily stops air from leaving the oral cavity. This action leads to a small amplitude in the waveform and a ‘hole’ in the spectrogram. Can you spot it? Waveform and spectrogram of the word already We’ll now take a look at a small sub-section of the waveform. We’ve isolated a about 0.07 seconds of sound towards the beginning if the word. By itself, this portion of the waveform sounds a bit like the /au/ in the word caught. The plot below shows the waveform and spectrum of the sub-signal. /au/ in already from about 0.09 to 0.16 sec The primary item of interest is the bottom-most plot, which shows the spectrum of the signal in decibels (dB). The approximate locations of the formants are indicated by the vertical dotted lines. Note how this plot matches up with the spectrogram. Both indicate a wide first formant and 4 more formants between 2 and 4 kHz (although the 4th isn’t as well defined as the other 3). Moving on, we isolate a different portion of the same waveform, the /ee/ sound at the end of the word already. By looking at either the waveform or the spectrum, we can see that /ee/ contains more high frequency information than /au/. /ee/ has a narrow first formant and 3 additional formants at higher frequencies. /ee/ in already from about 0.418 to 0.457 sec Waveform 2: “Cool” The word cool contains three distinct sounds. The /c/ at the beginning is an unvoiced sound. The /oo/ and the /L/ are both voiced sounds, but each has different properties. For the sake of time we’ll stick to the /L/ sound at the end of the word. An /L/ is created by arching the tongue in the mouth. This leads to a sound that differs greatly from most of the vowel sounds. Waveform and spectrogram of the word cool Below, you’ll see a segment of the /L/ sound. It is different from the last two sounds we’ve analyzed in that it contains virtually no high frequency components. There are two high frequency formants at about 3000 and 3500 Hz, but they are a great deal smaller in magnitude than the first formant. /L/ in cool from about 0.3 to 0.33 sec Waveform 3: “Cot” We chose the word cot to get a look at the /ah/ sound in the middle. You’ll notice from the spectrogram that this sound is quite frequency rich, containing several large formants between 0 and 5 kHz. Waveform and spectrogram of the word cot Compare the /ah/ waveform (below) to the previous /L/ waveform. There is a great deal more activity in this waveform, which explains variety of frequencies in the spectrogram. The spectrum of the signal reveals 5 (possibly 6) significant formants, each one having a sizable bandwidth. /ah/ in cot from about 0.17 to 0.22 sec Conclusion from Data Analysis: It is clear from the data than different vocal sounds have widely varying spectral content. However, each sound contains similar features, like formants, that arise from the methods of speech production. We’ll now begin to talk a bit about the vocal tract and the attempts that have been made to model its functionality. The predictions of these models will then be related to the empirical results from this section PART II: Speech Production Models 2.1 The Vocal Tract Here, we examine a bit of the physiology behind the production of speech. Voiced speech is created by simultaneously exciting our vocal cords into vibration and configuring our vocal tract to be a particular shape. The vocal tract consists of everything between the vocal cords and the lips: the pharyngeal cavity, tongue, oral cavity, etc. Pressure waves travel from the vocal cords along the non-uniform path to the mouth, and eventually, to our ears. Figure from [5] which is in turn from [1] The figure above shows a schematic diagram of the vocal tract on the left and, on the right, a plot the area of the vocal tract as a function of distance (in centimeters) from the vocal cords. The area/distance function plotted here is for the sound /i/, as in bit. The configuration of the vocal tract, and hence the plot, is different for different sounds. Looking at the area vs. distance function in the plot above, you’ll notice that there are two major resonant chambers in the vocal tract, the first, the pharyngeal cavity, is from about 1 to 8 cm and the second, the oral cavity, is from about 14 to 16 cm. This manner of thinking about the vocal tract—identifying resonant chambers—leads us to model it as on acoustic tube, or concatenation of acoustic tubes, an idea we’ll examine further in subsequent sections. Aside: Some sounds, called nasals, also use the nasal cavity as an additional resonant chamber and path to outside world. However, these sounds are small subset of the set of voiced sounds, and we will ignore them in this presentation. More detail about the vocal cords can be found in [3], and an in depth analysis of the vocal tract can be found in [4]. Here come the models We’ll now examine some models of speech production. All of these models make a couple of simplifying assumptions: 1. The vocal cords are independent from the vocal tract 2. The vocal tract is a linear system. Neither of these assumptions are entirely true, but they help to simplify the analysis. The acoustic tube models that we will examine can also be viewed as source-filter models. Either way you look at it, there is an excitation sent through a channel, and that channel alters the spectral content of the excitation. 2.2 A Single Tube Model Suppose for a moment that we model the vocal tract as a single acoustic tube of uniform area. Suppose further that the tube is excited at one end by a periodic input, and open at the other. We will now analyze such a configuration, and attempt to determine the resonant frequencies that this model predicts. Resonance in a Single Acoustic Tube Excitation Open end 0 x L u0, t Re U 0e jt uL, t Re U L e jt The wave equation is: 2 p 1 2 p x 2 c0 2 t 2 Assume a solution of the form px, t Re Px e jt . We can now use complex notation to write the reduced wave equation. d 2P 1 2 2 j P 2 dx c0 d 2P k 2P 0 , 2 dx k c0 Equation (1) has a solution of the form Px A coskx B sin kx Pressure and velocity can be related by the conservation of linear momentum. (1) (2) 0 u p t x 0 j U dP dx (3) Plugging (2) into (3), we have, U x j A sin kx B coskx 0 c0 (4) Apply boundary conditions: BC 1: The velocity at the closed end, x = 0, is specified as U(0). U 0 j B 0 c0 B j 0 c0U 0 (5) BC 2: The pressure at the open end of the tube is zero: PL 0 0 A coskL B sin kL A j 0 c0U 0 tan kL (6) We are interested in the relation between the input and the output—the transfer function—of the system. Combining (4), (5), and (6), we have, U L tan kLsin kL coskL U 0 U L sin 2 kL cos 2 kL U 0 coskL coskL U L 1 U 0 coskL Resonance occurs at, (7) kL 2 2n 1 , n = 0, 1, 2,… or c0 2n 1 2L or f c0 2n 1 4L (8) c0 . For 4L a vocal tract of length 17 cm (L = 17 cm) resonance will occur at approximately 500, 1500, 2500 Hz, etc. The results show that resonance occurs at odd multiples of the fundamental resonance, This model seems to be a decent first approximation, yielding results similar to the results observed in the data. Clearly, however, resonances should not occur at nicely spaced intervals. The actual data shows that the resonances appear at all sorts of frequencies. 2.3 Multi-tube Model The results from the single tube model of Section 2.2 motivate us to build a new and better model. This time we’ll use multiple tubes, each possibly having a different length and crosssectional area, to approximate the vocal tract. What we are doing, in effect, is quantizing the area function shown in Section 2.1. Most derivations of multi-tube models, [1], [2], [4], [5], rely on either electrical analogs of acoustic circuits, which translate into transmission line problems, or discretized versions of the pressure and volume velocity equations. In either case the solution to the wave equation can be written as the sum of coming and going waves. For the ith tube, the volume velocity, u i x, t , is: ui x, t ui t x c u t x c and the pressure, pi x, t , is: p i x, t u A c i t x c u t x c i Since we are primarily interested in what happens at the boundary of the tubes, many authors, [5], [7], [4], will discretize the velocity and pressure equations in time and space, by evaluating the functions only at these boundaries. See the plot below. 2-Tube model of vocal tract (here, v is the volume velocity), from [7] This sort of discrete interpretation of the multi-tube model gives rise to a useful signal flow model. The model can further be changed to a digital waveguide equivalent if the length of each tube is the same. Signal flow diagram for two acoustic tubes, from [7] The important parameters in the signal flow diagram are the reflection coefficients. These coefficients arise in many areas of acoustics. They determine what percentage of a wave will pass the boundary and what percentage will reflect. Simulations of digital waveguides built after this model confirms that they do indeed lead to multiple resonances of the form witnessed in the analysis of our data. Actual simulation of these models is well beyond the scope of the paper. However, Gold and Morgan assured us that, “an acoustic configuration [of this form] can always be found to match the measured steady-state spectrum of any speech signal.” ([5], page 152). 2.4 Ideas for Further Work Digital Waveguides: The next step in the investigation of this topic would be to test some of the digital waveguide models for speech production described by Gold and Morgan. Unfortunately, the point was never reached in this presentation were we could show in detail exactly how the multi-tube model leads to resonances similar to those in the recorded data. This was the goal as the outset, but the task proved formidable. Extraction of Reflection Coefficients: A data analysis technique called Linear Predictive Coding (LPC) can be using to estimate a set of reflection coefficients corresponding to an acoustic tube model for speech production. In a LPC, an all-pole model is fit to the spectral envelope of a segment of the speech waveform. The corresponding reflection coefficients could be used to determine the relative areas of the tube segments. An animation could be used to show how the shape of the vocal tract changes during the articulation of a phrase Speech synthesis: Given sets of reflection coefficients would be possible to generate an artificial segment of speech. The reflection coefficients could be extracted from actual speech as described above and then used as parameters for generating speech. To generate the speech an excitation would need to be passes through a digital waveguide with time varying reflection coefficients, or equivalently, through a filter with time varying poles. Appendix Method of data collection List of equipment 1. 2. 3. 4. 5. 6. Microphone: Fender P-51 cardioid pattern dynamic microphone. Amplifier: Fender Passport P-250 sound system Computer: Dell Dimension 4100, Pentium III, 800 MHz, 512 M RAM Sound card: Creative Sound Blaster Live Software: Matlab 6.0 Analysis: Matlab GUI (http://www.ele.uri.edu/~hansenj/projects/record/) The Microphone and Amplifier All sounds shown in this paper were recorded using the equipment in the Experiential Signal Processing Laboratory (ESPLab) in the department of Electrical and Computer Engineering at the University of Rhode Island. Laboratory funding is provided by the National Science Foundation and the Champlin Foundation. Two of the major components used for recording are the microphone and amplifier. The amplifier is a Fender Passport P-250 sound system. It has, according to the manual, a “six channel, 250-watt stereo powered mixer with digital reverb and two custom full-range speaker cabinets.” Most of these features are entirely unnecessary for this project. We simply wish to supply power to a microphone and amplify its signal before it is sent to the sound card. The microphone is a Passport P-51 with cardioid pick-up pattern “designed to reject as much of the sound coming from the side and rear of microphone as possible.” In other words, the P-51 has a certain directionality, with the main lobe on the vertical axis of the microphone. The precise technical specifications for the P-51 are not readily available, but we can assume that because of the microphone’s intended use—faithful representation of voice and acoustic music— it ought to have a relatively flat response between 20 Hz and 20 KHz. Thus the microphone is more that adequate for recording voice waveforms. System Description The microphone output is sent to the first channel of the amplifier. The gain for channel one is adjusted appropriately, and all other channel levels are set to zero. The amplifier’s output, tape out, is sent to the microphone jack on the computer’s soundcard. Recording is initialized from the Matlab GUI, Record, as described in the program’s documentation. Data flow: Sound goes from the microphone to the amplifier and then to the computer. References [1] G. Fant, Acoustic Theory of Speech Production, Mouton & Co., The Hague, 1970 [2] K.N. Stevens and A.S. House, “An Acoustic Theory of Vowel Production and Some of its Implications”, Journal of Speech and Hearing Research, 4:303-320, 1961 [3] D.G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley & Sons, Inc., New York, 2000 [4] D. O’Shaughnessy, Speech Communications: Human and Machine, The Institute of Electrical and Electronics Engineers, Inc., New York, 2000 [5] B. Gold and N. Morgan, Speech and Audio Signal Processing: Processing and Perception of Speech and Music, John Wiley & Sons, Inc., New York, 2000 [6] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, N.J., 1993. [7] J.H.L. Hansen, Slides for ECEN-5022 Speech Processing & Recognition, University of Colorado Boulder, 2000, (http://cslr.colorado.edu/classes/ECEN5022/)