MOVING-TALKER, SPEAKER-INDEPENDENT FEATURE STUDY AND BASELINE RESULTS USING THE CUAVE MULTIMODAL SPEECH CORPUS E.K. Patterson, S. Gurbuz, Z. Tufekci, and J.N. Gowdy Department of Electrical and Computer Engineering Clemson University Clemson, SC 29634, USA epatter, sabrig, ztufekci, jgowdy @eng.clemson.edu http://ece.clemson.edu/speech ABSTRACT Strides in computer technology and the search for deeper, more powerful techniques in signal processing have brought multimodal research to the forefront in recent years. Audio-visual speech processing has become an important part of this research because it holds great potential for overcoming certain problems of traditional audio-only methods. Difficulties due to background noise and multiple speakers in an application environment are significantly reduced by the additional information provided by visual features. This paper presents information on a new audio-visual database, a feature study on moving speakers, and baseline results for the whole speaker group. Although a few databases have been collected in this area, none has emerged as a standard for comparison. Also, efforts to date have often been limited, focusing on cropped video or stationary speakers. This paper seeks to introduce a challenging audio-visual database that is flexible and fairly comprehensive, yet easily available to researchers on one DVD. The CUAVE database is a speaker-independent corpus of both connected and continuous digit strings totaling over 7,000 utterances. It contains a wide variety of speakers, and is designed to meet several goals discussed in this paper. One of these goals is to allow testing of adverse conditions such as moving talkers and speaker pairs. For information on obtaining CUAVE, please visit our webpage (http://ece.clemson.edu/speech). A feature study of connected digit strings is also discussed that compares stationary and moving talkers in a speaker-independent grouping. An image-processing-based contour technique, an image-transform method, and a deformable-template scheme are used in this comparison to obtain visual features. This paper also presents methods and results in an attempt to make these techniques more robust to speaker movement. Finally, initial baseline speaker-independent results are included using all speakers, and conclusions as well as suggested areas of research are given. Keywords: audio-visual speech recognition, speech-reading, multimodal database. 1. INTRODUCTION Over the past decade, multimodal signal processing has been an increasing area of interest for researchers. Over recent years the potential of multimodal signal processing has grown as computing power has increased. Faster processing allows the consideration of methods which use separate audio and video modalities for improved results in many applications. Audio-visual speech processing has shown great potential, particularly in speech recognition. The addition of information from lipreading or other features helps make up for information lost due to corrupting influences in the audio. Because of this, audio-visual speech recognition can outperform audio-only recognizers, particularly in environments where there are background noises or other speakers. Researchers have demonstrated the relationship between lipreading and human understanding [1, 2] and have produced performance increases with multimodal systems [3, 4, 5, 6, 7]. Because of the young age of this area of research as well as difficulties associated with the high volumes of data necessary for simultaneous video and audio, the creation and distribution of audio-visual databases have been somewhat limited to date. Most researchers have been forced to record their own data. This has been limited in many ways to either cropped video, stationary speakers, or video with aids for feature segmentation. In order to allow for better comparison and for researchers to enter the area of study more easily, available databases are necessary that meet certain criteria. The Clemson University Audio Visual Experiments (CUAVE) corpus is a new audio-visual database that has been designed to meet some of these criteria. It is a speaker independent database of connected (or isolated, if desired) and continuous digit strings of high quality video and audio of a representative group of speakers and is easily available on one DVD. Section 2 of this paper presents a discussion of the database and associated goals. Subsection 2.1 discusses a brief survey of audio-visual databases and presents motivation for a new database. Next, Subsection 2.2 includes design goals of the CUAVE database and specifics of the data collection and corpus format. One of the goals of the CUAVE database is to include realistic conditions for testing of robust audio-visual schemes. One of these considerations is the movement of speakers. Methods are necessary that do not require a fixed talker. Recordings from the database are grouped into tasks where speakers are mostly stationary and tasks where speakers intentionally move while speaking. Movements include nodding the head in different directions, moving back-and-forth and side-to-side in the field of view, and in some cases rotation of the head. This makes for more difficult testing conditions. A feature study, discussion, and results are included that compare stationary and moving talkers in a speaker-independent grouping. Features included are an image-processing-based contour method that is naturally affine-invariant, an image transform method that is also modified to improve rotation-robustness, and a new variation of the deformable template that is also inherently affine invariant. Section 3 includes a discussion of the different feature extraction methods, the overall recognition system, training and testing from the database, and results of the different techniques. Finally, one of the main purposes of all speech corpora is to allow the comparison of methods and results in order to stimulate research and fuel advances in speech processing. This is a main consideration of the CUAVE database, easily distributable on one DVD. It is a fully-labeled, medium-sized task that facilitates more rapid development and testing of novel audio-visual techniques. To this end, baseline results are included using straight-forward methods for a speaker-independent grouping of all the speakers in a connnected-digit string task. Section 4 presents these results. Finally, Section 5 closes with some observations and suggestions about possible areas of research in audio-visual speech processing. 2. THE CUAVE SPEECH CORPUS This section discusses the motivation and design of the CUAVE corpus. It includes a brief review of known audio-visual databases. Information on the design goals of this database are included along with the corpus format. Finally, a singlespeaker case is discussed that was performed in preparation before collecting this database. This case presents isolated-word recognition in various noisy conditions, comparing an SNR-based fusion with a Noise-and-SNR-based fusion method. Basing fusion on the noise type is shown to improve results. 2.1. Audio-Visual Databases and Research Criteria There have been a few efforts in creating databases for the audio-visual research community. Tulips1 is a twelve subject database of the first four English digits recorded in 8-bit grayscale at 100x75 resolution [8]. AVLetters includes the English alphabet recorded three times by ten talkers in 25 fps grayscale [9]. DAVID is a larger database including various recordings of thirty-one speakers over five sessions including digits, alphabets, vowel-consonant-vowel syllable utterances, and some video conference commands distributed on multiple SVHS tapes [10]. It is recorded in color and has some lip highlighting. The M2VTS database and the expanded XM2VTSDB are geared more toward person authentication and include 37 and 295 speakers, respectively, including head rotation shots, two sequences of digits, and one sentence for each speaker [11]. The M2VTS database is available on HI-8 tapes, and XM2VTSDB is available on 20 1-hour MiniDV cassettes. There have also been some proprietary efforts by AT&T and by IBM [12, 6]. The AT&T database was to include connected digits and isolated consonant-vowel-consonant words. There were also plans for large-vocabulary continuous sentences. The database was never completed nor released, however. The IBM database is the most comprehensive task to date, and fortunately, front-end features from the IBM database are becoming available, but there are currently no plans to release the large number of videos [12]. Large-vocabulary, continuous speech recognition is an ultimate goal of research. Because audio-visual speech processing is a fairly young discipline, though, there are many important techniques open to research that can be performed more easily and rapidly on a medium-sized task. To meet the need for a more widespread testbed for audio-visual development, CUAVE was produced as a speaker-independent database consisting of connected and continuous digits spoken in different situations. It is flexible, representative, and easily available. Subsection 2.2 discusses the design goals, collection methods, and format of the corpus. Fig. 2.1. Sample Speakers from the Database. 2.2. Design Goals and Corpus Format The major design criteria were to create a flexible, realistic, easily distributable database that allows for representative and fairly comprehensive testing. Because DVD readers for computers have become very economical recently, the choice was made to design CUAVE to fit on one DVD-data disc that could be made available through contact information listed on our website. Aside from being a speaker independent collection of isolated and connected digits (zero through nine), CUAVE is designed to enhance research in two important areas: audio-visual speech recognition that is robust to speaker movement and also recognition that is capable of distinguishing multiple simultaneous speakers. The database is also fully, manually labeled to improve training and testing possibilities. This section discusses the collection and formatting of the database performed to meet the aforementioned goals. The database consists of two major sections, one of individual speakers and one of speaker pairs. The first major part with individual speakers consists of 36 speakers. The selection of individuals was not tightly controlled, but was chosen so that there is an even representation of male and female speakers, unfortunately a rarity in research databases, and also so that different skin tones and accents are present, as well as other visual features such as glasses, facial hair, and hats. Speakers also have a wide variety of skin and lip tones as well as face and lip shapes. See Figure 2.1 for a sample of images from some of the CUAVE speakers. For the individual speaker part of the database, each speaker was framed to include the shoulders and head and recorded speaking connected and continuous digit strings in several different styles. Initially, 50 connected digits are spoken while standing still. As this was not forced, there is some small, natural movement among these speakers. Also, at a few points some speakers actually lapse into continuous digits. Aside from these lapses, the connected-digit tasks may be treated as isolated digits, if desired, since label data is included for all digits. This section is for general training and testing without more difficult conditions. Video is full color with no aids given for face/lip segmentation. Secondly, each individual speaker was asked to intentionally move around while talking. This includes side-to-side, back-and-forth, and/or tilting movement of the head while speaking 30 connected digits. There is occasionally slight rotation in some of the speakers as well. This group of moving talkers is an important part of the database to allow for testing of affine-invariant visual features. In addition to this connected-digit section, there is also a continuous-digit section with movement as well. So far, much research has been limited to low resolution, pre-segmented video of only the lip region. Including the speaker’s upper body and segments with movement will allow for more realistic testing. The third and fourth parts spoken by each individual include profile views and continuous digit strings facing the camera. Both profiles are recorded while speaking 10 connected digits for each side. The final continuous digits were listed as telephone numbers on the prompting to elicit a more natural response among readers. Part 1. Individual 2. Pairs Task 1 2 3 4 5 6 Movement Still Moving Profile Still Moving Still Number of Digits 50 x 36 speakers 30 x 36 speakers 20 x 36 speakers 30 x 36 speakers 30 x 36 speakers (30 x 2) x 20 pairs Mode Connected Connected Connected Continuous Continuous Continuous Table 2.1. Summary of CUAVE Tasks: Individuals and Pairs, Stationary and Moving, Connected and Continuous Digits There are 6 strings. The first 3 are stationary, and the final 3 are moving for a total of 60 continuous digits for each speaker. See Table 2.1 for a summary of these tasks. The second major section of the database includes 20 pairs of speakers. Again, see Table 2.1. The goal is to allow for testing of multispeaker solutions. These include distinguishing a single speaker from others as well as the ability to simultaneously recognize speech from two talkers. This is obviously a difficult task, particulary with audio information only. Video features correlated with speech features should facilitate solutions to this problem. In addition, techniques can be tested here that will help with the difficult problem of speech babble. (One such application could be a shopping-mall kiosk that distinguishes a user from other shoppers nearby while giving voice-guided information.) The two speakers in the group section are labeled persons A and B. There are three sequences per person. Person A speaks a continuous-digit sequence, followed by speaker B and vice-versa. For the third sequence, both speaker A and B overlap each other while speaking each person’s separate digit sequence. See Figure 2.2 for an image from one of the speaker-pair videos. Fig. 2.2. Sample Speaker Pair from the Database. Fig. 2.3. Examples of Varied Video Backgrounds. The database was recorded in an isolated sound booth at a resolution of 720x480 with the NTSC standard of 29.97 fps, using a 1-megapixel-CCD, MiniDV camera. (Frames are interpolated from separate fields.) Several microphones were tested. An on-camera microphone produced the best results: audio that was clear from clicking or popping due to speaker movement and video where the microphone did not block the view of the speaker. The video is full color with no visual aids for lip or facial-feature segmentation. Lighting was controlled, and a green background was used to allow chroma-keying of different backgrounds. This serves two purposes. If desired, the green background can be used as an aid in segmenting the face region, but more importantly, it can be used to add video backgrounds from different scenes, such as a mall, or a moving car, etc. See Figure 2.3. In this manner, not only can audio noise be added for testing, but video “noise” such as other speakers’ faces or moving objects may be added as well that will allow for testing of robust feature segmentation and tracking algorithms. We plan to include standard video backgrounds (such as recorded in a shopping mall, crowded room, or moving automobile) in an upcoming release. Nearly three hours of data from about 50 speakers were recorded onto MiniDV tapes and transferred into the computer using the IEEE 1394 interface (FireWire). The recordings were edited into the selection of 36 representative speakers (17 female and 19 male) and 20 pairs. Disruptive mistakes were removed, but occasional vocalized pauses and mistakes in speech were kept for realistic test purposes. The data was then compressed into individual MPEG-2 files for each speaker and group. clean 18 dB 12 dB 6 dB 0 dB -6 dB Audio only 100 % 96.5 % 80.9 % 52.6 % 29.1 % 21.0 % Video only 87.0 % 87.0 % 87.0 % 87.0 % 87.0 % 87.0 % SNR-based AV 100 % 99.1 % 96.4 % 91.5 % 87.6 % 87.0 % Noise&SNR-based AV 100 % 99.3 % 96.9 % 92.9 % 89.9 % 88.4 % Table 2.2. Recognition Rates (Percentage Correct) Averaged Over Noises for Single Speaker Case Noise Speech Lynx Op Room Machine Gun STITEL F16 Factory Car clean 100 % — 100 % 100 % — 100 % 100 % — 100 % 100 % — 100 % 100 % — 100 % 100 % — 100 % 100 % — 100 % 100 % — 100 % 18 dB 99 % — 99 % 99 % — 99 % 100 % — 100 % 100 % — 100 % 98 % — 98 % 98 % — 99 % 99 % — 99 % 100 % — 100 % 12 dB 94 % — 95 % 94 % — 95 % 97 % — 97 % 99 % — 99 % 94 % — 95 % 97 % — 97 % 97 % — 98 % 99 % — 99 % 6 dB 91 % — 91 % 90 % — 91 % 92 % — 94 % 96 % — 97 % 85 % — 89 % 91 % — 92 % 90 % — 92 % 97 % — 97 % 0 dB 85 % — 87 % 85 % — 87 % 90 % — 90 % 92 % — 94 % 81 % — 87 % 88 % — 89 % 88 % — 91 % 92 % — 93 % -6 dB 87 % — 87 % 87 % — 87 87 % — 88 % 87 % — 91 % 87 % — 87 % 87 % — 89 % 87 % — 88 % 87 % — 90 % Table 2.3. Performance Comparison of (SNR-Based — Noise-Based) Late Integration using noises from NOISEX-92. It has been shown that this does not significantly affect the collection of visual features for lipreading [13]. The MPEG-2 files are encoded at a data-rate of 5,000 kbps with multiplexed 16-bit, stereo audio at a 44 kHz sampling rate. An accompanying set of downsampled “wav” format audio files checked for synchronization are included at a 16-bit, mono rate of 16 kHz. The data is fully-labeled manually at the millisecond level. HTK-compatible label files are included with the database [14]. The data-rate and final selection of speakers and groups was chosen so that a medium-sized database of high-quality, digital video and audio, as well as label data (and possibly some database tools) could be released on one 4.7 GB DVD. This helps with the difficulty of distributing and working with the unruly volume of data associated with audio-visual databases. The next section discusses a single-speaker test case conducted before collecting the database. 2.3. Speaker-Independent Test Case Using Noise-Based Late Integration Before recording the database, we conducted a test study in the area of visual features and data fusion related to audio-visual speech recognition using a single speaker for a 10-word isolated speech task [15] [16]. The audio-visual recognizer used was a late-integration system with separate audio and visual hidden-Markov-model (HMM) based speech recognizers. The audio recognizer used sixteen Mel-frequency discrete wavelet coefficients extracted from speech sampled at 16 kHz [17]. The video recognizer used red-green plane division and thresholding before edge-detection of the lip boundaries [18]. Geometric features based on the oral cavity and affine-invariant Fourier descriptors were passed to the visual recognizer [15]. All the noises from the NOISEX database [19] were added at various SNRs, and the system was trained on clean speech and tested under noisy conditions. Initial results for the single speaker were good and led us to choose some of the database design goals mentioned in Subsection 2.2, such as the inclusion of recordings with moving speakers and continuous digit strings. See Table 2.2 for a summary of these results. (Here, the field-of-view was cropped to lips-only to ease preliminary work). The purpose of this study was to investigate the effects of noise on decision fusion. Late integration was used, and recognition rates were found over all noises and multiple SNRs. See Table 2.3. For each noise type and level, optimal values of data fusion, using a fuzzy-and rule were determined [1, 16]. Results obtained using the fusion values based on noise-type and SNR are shown to give slightly superior results to those obtained with fusion based solely on SNR. DCT Features 35 Static 35 Difference 15 Difference 15 Stat./20 Diff. Accuracy 6% 18 % 10 % 10 % Table 3.4. Word Accuracy for Preliminary Difficult Test Case Demonstrates Usefulness of Difference Features 3. FEATURE STUDY USING STATIONARY AND MOVING SPEAKERS This section discusses a moving-talker, speaker-independent feature study over several features. Image-processing, imagetransform, and template-matching methods are employed over Part 1, Tasks 1-2 of the CUAVE database. The task includes individuals speaking connected-digit strings while either stationary or moving. The visual features used are affine-invariant Fourier descriptors (AIFDs) [15], the discrete cosine transform (DCT), an improved rotation-corrected application of the DCT, and a naturally affine-invariant B-Spline template (BST) scheme. (See Table 2.1 for a summary of the tasks.) Details of the overall system are presented in Subsection 3.1. The techniques for segmenting and tracking the face and lips are discussed in Subsection 3.2. Visual-feature extraction is discussed in Subsections 3.3- 3.5. Lastly, Subsection 3.6 presents a comparison of moving and stationary talker results using the various feature methods. 3.1. System Details This subsection describes the setup for the stationary versus moving talker comparison over various visual features. The image-processing method for AIFDs detailed below is the most sensitive to the weakness of a single-mixture, uniform color model. Because of this, an attempt was made to generate a more fair comparison of results. A group of 14 speakers was chosen that yielded more robust lip-contour extraction. The group was arranged arbitrarily into 7 training speakers and 7 testing speakers for a completely speaker-independent study. Part 1 of the database, tasks 1 and 2, were used for this study. (See Table 2.1.) For each set of visual features, HMMs were trained using HTK with the same speakers. The models were simple single-mixture models with eight states per model. Before each of the following visual features was tested, a brief initial test was performed to demonstrate the importance of dynamic features, as supported by other researchers [18, 13]. Table 3.4 contains results from a difficult, two-speaker test case. Recognition models were trained on DCT features from on one female speaker and tested on one male speaker with facial hair. Static and difference features as well as the combination were compared. As shown in previous work and in this case, difference coefficients seem to provide more useful information, particularly when dealing with a speaker-independent case. Shape information along with information on lip movement is more applicable in a speaker-dependent case. Methods for speaker-adaptation, image warping, or codebook schemes might improve the use of shape information. For dynamic coefficients, some work also demonstrates that better results can be obtained by averaging over several adjacent frames for longer temporal windows [20]. For this study, coefficients are only differenced over one frame, as some previous work has also shown no apparent improvement with longer temporal windows [13]. All visual feature schemes here begin with the same face and lip tracking algorithms. Smaller and larger regions-of-interest (ROIs) are passed to each feature routine. The smaller region tracks tightly around the lips and can be searched to locate lip features, such as the lip corners, angle, or width. The larger includes the jaw, specifically to give a larger region for application of the DCT, as shown in [20] to yield better results than a smaller region. All features are passed as difference coefficients at a rate of 29.97 Hz, to match the NTSC standard. As these are speech-reading results only, no interpolation to an audio frame rate is performed. Techniques for locating the face and lip regions are described next. 3.2. Detecting the Face and Lips Dynamic thresholding of red/green division used to initially segment the lips in our earlier work did not prove to be good for feature segmentation over the database of many speakers. For this reason, we manually segmented the face and lip regions for all speakers, as shown in Figure 3.1, to obtain the mean and variance of the non-face, face, and lip color distributions. Since R,G,B values are used, a measure of intensity is also included. This is relatively consistent in this test case, although this could Fig. 3.1. Manually Segmented Face and Lips for Distribution Training. Fig. 3.2. Gaussian Estimation of Non-Face, Face, and Lip Color Distributions. be difficult in practical applications, as the intensity may vary. The estimated Gaussian distributions for the non-face, face, and lip classes are shown in Figure 3.2. These are used to construct a classifier, based on the Bayes classification rule [21]: (1) where is the feature vector (RGB levels), is each class (non-face versus face, or face versus lips), is estimated by the respective estimated Gaussian density, and is estimated by the ratio of class pixels present in training data. This classification is performed on image blocks or pixels, and the class with the larger value is chosen. Currently, only a single Gaussian mixture is used. This results in good class separation over all but a few speakers whose facial tones are ruddy and very similar to their lip tones. For speed in later routines, the image is broken into 10x10 pixel blocks. Each block is averaged and classified as non-face or face, then as face or lips. Determining the location of the speaker’s lips in the video frame is central to determining all visual features presented here; although, some are more sensitive to exact location. In order to identify the lip region, the face is first located in each frame. This is based on a straightforward template search of the classified blocks, locating the largest region classified as “face”. The segmentation from this step is shown in Figure 3.3. The current search template is rectangular to improve the speed over using an oval template that might have to be rotated and scaled to match speaker movement. This performs well for the current setup that does not have “environment” backgrounds substituted for the green screen via chroma-keying. Figure 3.6 displays a “located” face. There are some extreme cases of speaker movement such as large tilting of the head that can cause poor face location. This would likely be solved by the slower scaled/rotated oval template. After locating the probable face region, this area is searched for the best template match on lip-classified pixels. Pixels are used in locating the lip region for more accuracy. A “center-of-mass” check is also employed to help center the tracker on the lips. Again, a rectangular template is used for speed, rather than an oval template. Heuristics, such as checking block classifications near the current template, help prevent misclassifying red shirt-collars as lips. See Figure 3.4. A final overlaid lip segmentation is shown in Figure 3.5. After the lip region is located, it is passed in two forms to the chosen visual-feature- Fig. 3.3. Segmentation of Face from Non-Face and Lips Within Face Region. Fig. 3.4. Sensitivity of Classification Scheme to Red Clothing, Improved by Searching within the Face Region and Using Heuristics. Fig. 3.5. Final Lip Segmentation. extraction routine. A smaller box around the location of the lips is passed to be searched for lip corners and angle, and a larger box is passed as well. It has been shown that performing the image transform on a slightly larger region, including the jaw as well as the lips, yields improved results in speech-reading [20]. Figure 3.7 demonstrates the regions passed on to the feature routines. The routines have been designed so that they may be expanded to track multiple speakers, as in the CUAVE speaker-pair cases. The following subsections discuss each of the feature methods compared in this work. Fig. 3.6. Locating the Face Region. Fig. 3.7. Locating the Smaller and Larger Lip Region. 3.3. Image-Processing Contour Method The feature-extraction method discussed here involves the application of affine-invariant techniques to Fourier coefficients of lip contour points estimated by image-processing techniques. More information about the development and use of affineinvariant Fourier descriptors (AIFD) is given in [15, 22]. A binary image of the ROI passed from the liptracker is created using the classification rule in Equation 1. This is similar to the overlaid white pixels shown in Figure 3.5. The Robert’s cross operator is used to estimate lip edges as shown in Figure 3.8. The outer lip edge is traversed to assemble pixel coordinates to which the affine-invariant Fourier technique will be applied. A difficulty of lip-pixel segmentation is that in certain cases, lip pixels are occasionally lost or gained by improper classification. In this case, the lip contour may include extra pixels or not enough, such as in Figure 3.9 where there is separation between the upper and lower lips. This separation would cause a final mouth contour to be generated that may only include the lower lip, such as in Figure 3.11 rather than the good contour detected and shown in Figure 3.10. This affects all the feature-extraction techniques to some degree, since the lip tracker is based on similar principles, however contour estimation methods are affected the most. (The DCT method presented in the next subsection is affected the least by this, Fig. 3.8. Mouth Contour after Edge Detection. Fig. 3.9. Poor Mouth Contour after Edge Detection. Fig. 3.10. Final Mouth Contour for AIFD Calculation. Fig. 3.11. Poor Final Mouth Contour for AIFD Calculation. Fig. 3.12. Downsampling for DCT Application. though, helping account somewhat for its better performance.) These effects could possibly be minimized by adding more mixtures to the color models, using an adaptive C-means-type scheme for color segmentation within regions, or including additional information such as texture. Additional mixtures would be more important for extracting the inner-lip contour around teeth and shadow pixels. For the comparison in this paper, the single-mixture model suffices since all feature schemes are based on the same liptracker routine. In fact, using personalized mixtures for each person versus a single, overall density did not yield improved recognition results in our tests even though segmentation appeared somewhat more accurate. Once the lip contour coordinates are determined, the Discrete Fourier Transform (DFT) is applied. The redundant coefficients are removed. The zeroth coefficient is then discarded to leave shift-invariant information. Remaining coefficients are divided by an additional arbitrary coefficient to eliminate possible rotation or scaling effects. Finally, the absolute value of the coefficients is taken to remove phase information and, thus, eliminate differences in the starting point of the contourcoordinate listing. This leaves a set of coefficients (AIFDs) that are invariant to shift, scale, rotation, and point-ordering. (Although we employed a parametric-contour estimation of lip contours before AIFD calculation in our earlier work for improved results, this is not yet implemented in this system [15]. Results here are for directly estimated lip contours with no curve estimation or smoothing.) 3.4. Image Transform Method The larger ROI passed from the liptracker is downsampled into a 16x16 grayscale intensity image matrix, as shown in Figure 3.12. The DCT was chosen as the image transform instead of other transform methods because of its information packing properties and strong results presented by other research [13, 20]. The separable 2-D DCT was used in this work: !#" $ & %('*),+.- 0/2 14365 (" 7#" 7#" (2) is the downsampled image matrix. is the resulting transformed matrix. The upper left block (6x6) of is used for feature coefficients, with the exception of the zeroth element that is dropped to perform feature mean subtraction for better speaker-independent results. As is shown in Subsection 3.6, the DCT was not robust to speaker movement. To try to improve this, a rotation-corrected version of the image block was passed to the DCT (rc-DCT). The angle for rotation correction is determined by searching for the lip corners and estimating the vertical angle of the lips/head from the two lip corners. This was chosen for speed, versus estimating the tilt of the whole head with an elliptical template search. All image pixels that would then form a box parallel to the angled lips are chosen for the matrix to which the DCT is applied. Figure 3.13 demonstrates estimation of the rotation-correction factor, and Figure 3.14 shows a downsampled image matrix based on the estimated angle. The head is slightly tilted, but the lips are “normalized” back to horizontal lips. This correction is an attempt to remove one possible affine transformation, rotation. Translation variance has already been minimized, hopefully, by the liptracker and “center-of-mass” algorithm that should center the downsampled image matrix around the lips. Scale is not currently taken into Fig. 3.13. Rotation Correction Before Applying DCT. Fig. 3.14. Rotation-Corrected, Downsampled Image Region for DCT Application. account although a good possibility would be to scale the downsampled image block by the width of the speaker’s face. The approximate measurement can be estimated slightly above a speaker’s mouth, where the width of the face should not change due to jaw movement. This should roughly scale each person’s lips and improve robustness to back-and-forth movement and also improve speaker-independent performance. Although the rotation-correction improves the DCT’s robustness to speaker movement, stationary performance actually drops due to the dependence on estimating the lip corners. Sensitivity to lip segmentation has been introduced as in the contour methods, and performance drops to about their level. In order to improve this, a smoothed rc-DCT was tested that “smooths” the angle estimate between frames to minimize improper estimates or erratic changes. The DCT, rc-DCT, and smoothed rc-DCT results are compared against the image-processing and template-based contour methods in Subsection 3.6. Fig. 3.15. Control Vertices for Estimated Lip Curves. Fig. 3.16. Estimated Lip Contour Using Control Vertices for Bezier Curves. 3.5. Deformable-Template Method A somewhat different approach is taken in this deformable-template method than in other approaches based on parametric curves (B-Spline, Bézier, elliptical), snakes, or active contour models. The main goal is to capture information from lip movement in a reference-coordinate system, thus directly producing affine-invariant results. The method of determining the lip movement is simple, direct, and slightly less sensitive to improper lip segmentation. The deformable template principle is the same; lip contours are guided by minimizing a cost function over the area enclosed by the curves, region [23]: ' 3 "! $# (3) The method presented here may be implemented with B-Spline curves. For the tables of results, this technique will be referred to as B-Spline template (BST). Currently, we use splines with a uniform knot vector and the Bernstein basis, also known as Bézier curves [24]. Eight control vertices (CVs) are used, four for the top lip, and four for the bottom. (Because the top and bottom points for the lip corners are the same, this actually reduces to six distinct points.) CVs are demonstrated in , Figure 3.15. A Bézier curve generated by CVs is shown in Figure 3.16. CVs are also known as points, where that control the parametric curves by the following formulas: / */ */ / 4" " / (4) (5) There are two sets of 4 points: for the top curve, and for the bottom curve. The range of values for is usually normalized to the to range in parametric curve formulas. Here, the width of the lips is used to normalize this. The corners and angle of the lips are estimated as discussed in the previous section. and for the top and bottom are set to the corners of the lips in the image coordinate system. Based on the angle estimation, these points are rotated into a relative, reference coordinate system, with as the origin. (Affine transforms may be applied to CVs without affecting the underlying curves). New values for searches are generated based on changing the width, height, location, etc. of the template. These are then transformed back to the image-coordinate system to generate resulting curves. Several searches are performed, while minimizing within these curves in the image-coordinate system. First, the location is searched by moving all of the CVs throughout the larger box to confirm that the lips are accurately centered within the template. Next, the CVs are changed to minimize in respect to the width of the template. After the width, the height of the top lip and bottom lip are compared against the cost function. Finally, the shape can be changed by searching for the spacing of and based on , for both the top and bottom independently. The horizontal spacing from the center of the lips between and was assumed to be symmetric, as no useful information is seemingly gained from asymmetric lip positions. Once and are determined that minimize , the actual reference coordinate system CV values are differenced from the previous frame and passed on as visual features that capture the moving shape information of the lips. In this respect, angle and translation variance are eliminated. Currently, scaling is not implemented, but could be done so in a similar manner as discussed in the previous Subsection. The advantages of this technique are simplicity, slightly less sensitivity to lip segmentation, and inherent affine-invariance. Another possibility is that CVs are automatically generated that could be passed to control curves for a facial animation or lip synchronization task based on a recorded speaker. Example curves that minimized are shown in Figures 3.17-3.18. The former represents a good match of the cost minimization to the actual lip curve. The latter is slightly off, although moving information may still convey useful information. During experimentation, more exact contour estimation did not always seem to translate directly into improved recognition performance. " Fig. 3.17. Well Estimated Lip Contour from Bezier Curves. Fig. 3.18. Poorly Estimated Lip Contour from Bezier Curves. Features AIFD DCT rc-DCT smoothed rc-DCT BST Stationary 22.40 % 27.71 % 22.57 % 27.43 % 22.86 % Moving (trn. stat.) 21.89 % 19.62 % 21.53 % 23.92 % 24.46 % Moving (trn. mov.) 14.79 % 17.70 % 21.05 % 21.53 % 21.20 % Table 3.5. Comparison of Performance (Word Accuracy) on Stationary and Moving Speakers, Digit Strings. 3.6. Stationary and Moving Results Each of the visual feature methods were used on the test setup described in 3.1. The grouping is completely speakerindependent set, training on one group and testing on a completely separate group. Task 1 (see Table 2.1) is used for stationary testing and training. Task 2 is used for moving-speaker testing and training. The results are presented in Table 3.5 over a stationary set, a moving set with models trained on the stationary set, and a moving set with models trained on a moving set. It is interesting to note that training on the stationary set produces better performance for the moving cases. It is likely that features are corrupted more during speaker movement, thus producing poor training data. Although results may not strictly be compared to other results, the range of these is on the order of results from other medium-sized, connected/continuous, speaker-independent speech-reading tasks [13, 20]. Results presented are obtained using single-mixture, eight-state HMMs. The AIFD features were shown to perform nearly equally well on stationary and moving speakers as hoped. Confirming the conclusion in [13] that an image transform approach yields better performance than lip-contour methods, the DCT features outperform the AIFD-contour features in this test system. A large part of this is likely due to sensitivity to the lip segmentation and tracking algorithms. The DCT is much less sensitive, as the larger block can be more easily located than precise lip contours. DCT performance drops substantially, though, under moving speaker conditions. Implementing the rotationcorrection did improve the performance of the rc-DCT on the moving-speaker case, however stationary performance dropped significantly. This is due to the dependence on the lip segmentation introduced by the lip-angle estimation. Implementing the smoothing factor as discussed in Subsection 3.4 both improved results more on the moving case, and nearly regained stationary performance. The BST features performed on-par with the AIFDs. Another interesting note, though is that they actually earned the best performance on the moving group, surpassing the smoothed rc-DCT and even their own stationary performance. They seem to be fairly robust to speaker movement, still capturing the lip-motions important for speechreading. Unfortunately, though, they are still quite sensitive to the lip segmentation scheme through the cost function, as shown in Section 4. One possibility that should significantly improve this dependence is to increase the number of mixtures in the Gaussian density estimation for the lips to include tongue, teeth, and shadows. This should reduce “patchiness” in lip pixel classification, thus improving the accuracy of minimizing the cost function. 4. SPEAKER-INDEPENDENT RESULTS OVER ALL SPEAKERS In this section we include baseline results over the whole database for Tasks 1 and 2, stationary and moving tasks, respectively. The 36 individual speakers were divided arbitrarily into a set of 18 training speakers and 18 different test talkers for a completely speaker-independent grouping. With a simple, MFCC-based HMM recognizer implemented in HTK using 8-state models and only one mixture per state, we obtained 92.25% correct recognition with a word accuracy of 87.25% (slightly lower due to some insertions between actual digits). With some tuning and the addition of more mixtures, recognition near 100% should be attainable. Here, we also include visual speech-reading results over all 36 speakers, using the same training and testing groups as for the audio. Results are included for DCT, rc-DCT, smoothed rc-DCT, and BST features as in the previous test. AIFDs are not currently included because the current implementation is very sensitive to differences that a single-mixture color model does not represent well. In fact, the BST features which are somewhat less-sensitive also show a performance drop over the whole group. This is particularly true for the moving results, where the BST features performed well in the prior test. The DCT gained the highest score on the stationary task with 29% word accuracy. Performance drops to the level of the BST features, though, on the moving task. Again the rc-DCT performs better on moving, but loses stationary performance. The smoothed rc-DCT performs the best here on moving speakers but doesn’t quite restore the full performance of the DCT on stationary speakers. This is also probably affected by the additional speakers who do not fit the color model as well as the prior test group. Estimating the lip angle to correct the DCT suffers when lip segmentation is poor. Overall, Features DCT rc-DCT smoothed rc-DCT BST Stationary 29.00 % 25.95 % 26.47 % 23.85 % Moving 21.12 % 22.39 % 24.73 % 21.48 % Table 4.6. Baseline Speechreading Results (Word Accuracy) over All Speakers. results seem to indicate that contour methods might perform as well as transform methods if robust enough. The difficulty is creating speaker independent models that perform accurate lip segmentation under the many varying conditions. 5. CONCLUSIONS AND RESEARCH DIRECTIONS This paper presents a flexible, speaker-indpendent audio-visual speech corpus that is easily available on one DVD-data disc. The goal of the CUAVE database is to faciliate multimodal research and to provide a basis for comparison, as well as to provide test data for affine-invariant feature methods and multiple-speaker testing. Our website, listed in the title section, contains sample data and includes current contact information for obtaining the database. As it is a representative, mediumsized digits task, the database may also be used for testing in other areas apart from speech recognition. These may include speaker recognition, lip synchronization, visual-speaker synthesis, etc. Also, results are given that suggest that data fusion can benefit by considering noise-type as well as level. An example application could be in an automobile where a variety of noises and levels are encountered. A camera could be easily mounted to keep the driver in the field-of-view for an audio-visual speech recognizer. Dynamic estimations classifying the background noise could be used to improve recognition results. A feature study of image-processing-based contours, image transform, and deformable-template methods has also been detailed. It included results on stationary and moving talkers and also attempts to lessen the effect of speaker movement on the various visual feature sets. Contour methods are more severely limited by models used for feature segmentation and tracking. Simple changes may yield improved results in the template method, however. Future work may show if these can compare with image-transform results. Scale-correction will also be included for the various feature sets to determine its effect. Finally, baseline results have been included for two of the database tasks to encourage comparison of techniques among researchers. There are several areas that are wide-open for audio-visual speech recognition research that may be tested on this database. A very important area is visual-feature extraction. A variety of speakers and speaker movement has been included for this end. Methods are needed that either significantly strengthen feature-tracking under practical conditions or that create new features for speech-reading. Better features may include some other measures of the mouth, teeth, tongue, jaw, psychoacoustic considerations, or eye-direction to name a few possibilities. Techniques for speaker independent speech-reading are also needed. These may include some form of speaker adaptation, model adaptation, image warping, use of feature codebooks, or some of the many other methods employed for audio speaker adaptation. Finally, data fusion is an important area of research. Improved techniques for multimodal, multistream HMMs could also provide important strides, particularly in continuous audio-visual speech recognition. Other methods, such as hybrid fusion systems may be considered. Dynamic considerations will be important for improving data fusion for practical environments. Finally, the ability to distinguish and separate speakers is important for powerful interfaces that may be desired where multiple speakers are present, such as in public areas or automobiles with passengers. Hopefully, the CUAVE database will facilitate more widespread research in these areas. 6. REFERENCES [1] D. W. Massaro, Perceiving Talking Faces: From Speech Perception to a Behavioral Principle, Cambridge, MA: MIT Press, 1997. [2] Q. Summerfield, “Lipreading and audio-visual speech perception,” Phil. Trans. R. Soc., vol. 335, 1992. [3] E. Petajan, B. Bischoff, D. Bodoff, and N. Brooke, “An improved automatic lipreading system to enhance speech recognition,” in ACM SIGGHI, pp. 19-25, 1988. [4] P. L. Silsbee and A. C. Bovik, “Computer lipreading for improved accuracy in automatic speech recognition,” IEEE Transactions on Speech, and Audio Processing, vol. 4, no. 5, September 1996. [5] P. Teissier, J. Robert-Ribes, J. Schwartz, and A. Guérin-Dugué, “Comparing models for audiovisual fusion in a noisyvowel recognition task,” IEEE Transactions on Speech, and Audio Processing, vol. 7, no. 6, November 1999. [6] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, “Audio-visual speech recognition final workshop 2000 report,” Center for Language and Speech Processing, Johns Hopkins University, Baltimore, 2000. [7] G. Potamianos, C. Neti, G. Iyengar, and E. Helmuth, “Large-vocabulary audio-visual speech recognition by machines and humans,” in Eurospeech, Denmark, 2001. [8] J.R. Movellan, “Visual speech recognition with stochastic networks,” in Advances in Neutral Information Processing Systems, G. Tesauro, D. Toruetzky, and T. Leen, Eds., vol. 7. MIT Press, Cambridge, 1995. [9] I. Matthews, Features for Audio-Visual Speech Recognition, Ph.D. thesis, School of Information Systems, University of East Anglia, UK, 1998. [10] C. C. Chibelushi, S. Gandon, J. S. D. Mason, F. Deravi, and R. D. Johnston, “Design issues for a digital audiovisual integrated database,” in IEE Colloquium on Integrated Audio-Visual Processing for Recognition, Synthesis and Communication, Savoy Place, London, Nov. 1996, number 1996/213, pp. 7/1–7/7. [11] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “Xm2vtsdb: The extended m2vts database,” in Second International Conference on Audio and Video-Based Biometric Person Authentication, Washington, D.C., 1999. [12] G. Potamianos, E. Cosatto, Hans Peter Graf, and David B. Roe, “Speaker independent audio-visual database for bimodal asr,” in Proceedings of European Tutorial and Research Workshop on Audio-Visual Speech Processing, Rhodes, Greece, 1997, pp. 65–68. [13] G. Potamianos, H.P. Graf, and E. Cosatto, “An image transform approach for hmm based automatic lipreading,” in Proceedings of the ICIP, Chicago, 1998. [14] S. Young, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book, Entropic Cambridge Research Laboratory Ltd., version 2.1, 1997. [15] S. Gurbuz, Z. Tufekci, E. K. Patterson, and J. N. Gowdy, “Application of affine-invariant fourier descriptors to lipreading for audio-visual speech recognition,” in Proceedings of ICASSP, 2001. [16] E.K. Patterson, S. Gurvuz, Z. Tufekci, and J.N. Gowdy, “Noise-based audio-visual fusion for robust speech recognition,” in International Conference on Auditory-Visual Speech Processing, Denmark, 2001. [17] Z. Tufekci and J.N. Gowdy, “Mel-scaled discrete wavelet coefficients for speech recognition,” in Proceedings of ICASSP, 2000. [18] G. I. Chiou and J. Hwang, “Lipreading from color video,” IEEE Transactions on Image Processing, vol. 6, no. 8, August 1997. [19] A.P. Varga, H.J.M Steenekan, M. Tomlinson, and D. Jones, “The noisex-92 study on the effect of additive noise on automatic speech recognition,” Tech. Rep., DRA Speech Research Unit, 1992. [20] G. Potamianos and C. Neti, “Improved roi and within frame discriminant features for lipreading,” in ICIP, Thessaloniki, Greece, 2001. [21] S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999. [22] K. Arbter, W. E. Snyder, H. Burkhardt, and G. Hirzinger, “Application of affine-invariant fourier descriptors to recognition of 3-d objects,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 7, 1990. [23] T. Chen, “Audiovisual speech processing: Lip reading and lip synchronization,” IEEE Signal Processing Magazine, vol. 18, no. 1, January 2001. [24] D. Rogers, An Introduction to NURBS, Morgan Kaufmann Publishers, 2001.