Oreja... a MATLAB environment for the design of psychoacoustic stimuli ELVIRA PEREZ Department of Psychology, University of Liverpool, U.K. & RAUL RODRIGUEZ-ESTEBAN Department of Electrical Engineering, Columbia University, New York, U.S.A Correspondence should be addressed to E. Perez, School of Psychology, University of Liverpool Eleanor Rathbone Building, Bedford St. South Liverpool L69 7ZA, England, U.K. Phone: # 44 (0) 151 794 2177 e.perez@liverpool.ac.uk The Oreja software has been designed specially to study speech intelligibility. It is an engineering tool that is informed by psychology to allow researchers interested in speech perception to manipulate speech signals, and study how these manipulations affect human perception. A feature of this package is that it uses a high-level interpreted scripting environment (MATLAB), allowing the user to load, decompose into different channels, analyze, and select the parts of the signal(s) the user wants to label and/or manipulate (e.g. attenuating the amplitude of the channels selected, adding noise, etc.). This paper includes an application to illustrate how Oreja can be used. Oreja can be downloaded at http://www.liv.ac.uk/psychology/Downloads/Oreja.htm 2 The development of this software has being partially supported by HOARSE Grant HPRN-CT-200200276. This project could have not been developed in LabRosa (Dept. Electrical Engineering, Univ. Columbia, NYC, USA) without the supervision of Dan Ellis and a Fulbright scholarship granted to the first author. A version of this paper was presented at the XIV European Society of Cognitive Psychology, Leiden, The Netherlands, 31st August – 3rd September 2005. We thank John Worley, Dimosthenis Karatzas and Martin Cooke for helpful comments and suggestions on earlier versions of this paper. Correspondence should be addressed to Elvira Perez, School of Psychology, University of Liverpool, Eleanor Rathbone Building, Bedford St. South, Liverpool L69 7ZA, England, U.K. (e.perez@liverpool.ac.uk). 3 Psychologists have sought to understand how human listeners understand language, and specifically how our auditory system efficiently processes language in an environment where silence is normally the exception. Listeners possess strategies for handling many types of distortion, and uncovering these strategies is of importance in understanding speech intelligibility under noise situations, or designing robust systems for computational hearing. Psychologists are not experts in signal processing, frequency analysis, acoustics or computer programming, and often engineers do not have an in depth knowledge at statistics or cognitive neuroscience. Over the last decade so many interdisciplinary research groups have emerged to cope with this ‘interdisciplinary problem’. The aim of our interdisciplinary speech approach is to put together different sources of knowledge, attitudes, and skills in order to better understand our sophisticated auditory and cognitive system. This software has been inspired by different sources. The first source of motivation was the need of an interactive and exploratory tool. Specifically, an intuitive interface that would allow users to dynamically interact, and demonstrate virtually the phenomena found in speech and hearing. There are different examples of auditory demonstrations such as the CD of ‘Demonstrations of Auditory Scene Analysis: The Perceptual Organization of Sound by Bregman and Ahad’ (included in Bregman, 1999). And more recently, the auditory demonstrations by Yoshitaka Nakajima (www.design.kyushuu.ac.jp/~ynhome/ENG/index.html). However, these demonstrations do not allow the user to explore and manipulate the variables by themselves, but to listen passively. It is well known that the dynamic interaction with an environment can improve learning in any field, especially when it involves transformations and manipulations of several different parameters. Direct manipulation supports novices through decreasing learning times, expert users through speed of action 4 and intermediate users, by enabling operational concepts to be retained. User anxiety is reduced, and confidence is enhanced from the fact that the user initiates actions and can predict responses (Shneiderman, 1983). Oreja encourages the user to explore and replicate the range of relevant variables underlying some auditory effects such as perceptual grouping of tones (Noorden, 1977), or duplex perception (Rand, 1974). With respect to masking, a user can filter speech into different bands, select and apply maskers from a menu of noises, or hear the result of their manipulations. Another important aspect of Oreja is its simplicity; it enhances basic features such as visual display of signals and the parameters most used in speech intelligibility research (e.g. maskers, filters, or frequency bands). The simplicity of Oreja was led by the fact that too much sophistication can overwhelm intermediate or novice users. Moreover, the different visual representations of the signals help to reinforce complementary views of the data, and a deeper understanding of the auditory phenomena. A second source of motivation for Oreja was the paper written by Kasturi and Loizou (2002), which assessed the intelligibility of speech with normal-hearing listeners. Speech intelligibility was assessed as a function of the filtering out of certain frequency bands, which Kasturi and Loizou termed as “holes”. The speech signals they presented had either a single “hole” in various bands or had two “holes” in disjoint or adjacent bands in the spectrum. After reading this article we wanted to design a similar experiment, but besides adding holes in various bands, we wanted to test the intelligibility of the speech signals when some of these frequency bands were replaced by sine-wave speech replicas (SWS, is a synthetic analogue of natural speech represented by a small number of time-varying sinusoids, see Remez, Rubin, Pisoni, & Carrel, 1981). We imagined an intuitive interface that would allow us to filter the speech into different channels, select different distortions from a menu, and finally hear the output by simply pressing an imaginary ‘play’ key. Software like this does not exist. However, a third source of inspiration was found: the work done by Cook, Brown, and Wrigley 5 (1999a) called ‘MAD’ (‘MATLAB Auditory Demonstrations’). These demonstrations exist within a computer-assisted learning application, which opens up the field to interactive investigations, providing the user with a rich environment in which to explore dynamically the many phenomena and processes associated with speech and hearing. After exploring the versatility of MAD, investing time in a limited program for a specific issue was felt to be inefficient. Instead, we preferred to build a psychoacoustic tool with a wide range of possibilities and menus accessible to a large and heterogeneous group of language and speech researchers. Oreja’s main objective is to become a useful tool for researchers and students, supporting and motivating the design of psychoacoustic experiments. Actually exists a wide variety of professional software that does much more on the audio processing side than Oreja, but the advantage and uniqueness of Oreja is that it is brought together the functions important to psychologists, under a simple and clear interface. Description of Oreja Oreja has been implemented using MATLAB, a high-level scripting environment that provides functions for numerical computation, user interface creation and data visualization. Oreja has been updated in 2004 for MATLAB 6.5 on Windows XP. The design of Oreja has been specially oriented for the study of speech intelligibility, supporting the design of acoustic stimuli to be used in psychoacoustic experiments. It has two main windows or interfaces. The first window guides the selection of the signal (time and frequency domain), and the second guides the changes performed on the different parts of the original signal, or the creation of background noises to mask it. The first window allows loading, decomposing into different channels, 6 analyzing, and selecting the parts of the signal. This first window, allows for label of a signal and its constituent parts. Moreover, Oreja can load multiple signals and concatenate them. The second window has been designed to manipulate the loaded signals, or the selected parts, in many different ways (e.g., attenuating the amplitude of the channels selected, or adding noise), and save them in an audio file format. Using Oreja.m Oreja can be downloaded at http://www.liv.ac.uk/psychology/Downloads/Oreja.htm, and at present, it is only available for Windows operating system. Installation consists of adding the Oreja folder to the MATLAB current directory path. Alternatively, for Oreja to be always available it can be added to a single folder (approximately 1 MB) to MATLAB’s collection of toolboxes. Help is provided via the ‘Users Manual’, which also contain a glossary that defines the more technical concepts. The program does not need installation. Once Oreja is at the Current Directory of MATLAB, and after typing at the MATLAB prompt a command in the form of: » oreja the first window will appear on the monitor. First window: Load Signal The main goal of this first window is to allow the user to load a signal and precisely select portions of it. Panel number one and panel number two (see Fig. 1) represent the frequency composition of the signal, panel number four represent the amplitude of the signal across time. Cursors and a zoom options have been added to facilitate the accuracy of the selection. A speech signal can be labeled or phonologically transcribed by using the transcription menu and panel number three. 7 (Figure 1 around here) Once the user has chosen and loaded a sound file, three different representations from this sound file will appear in three different panels. The first panel shows the spectrogram of the signal selected. In the second panel the signal appears filtered by default into ten frequency channels. The spectrogram display has linear-in-Hz y-axis, whereas the filter center frequencies (CF) are arrayed on an ERB (Equivalent Rectangular Bandwidth) rate scale, which is approximately logarithmic. The info menu contains a popup menu with a detailed user manual in html format that also includes a glossary. Channels and filters By default all the channels will be selected after loading the signal. Different channels can be selected or deselected by clicking on individual waveforms or on the little circles of the second panel (see number five in Fig. 1). The whole original signal can be played back or just the channels selected. In the latter option an unselected signal does not contribute to the overall output. By selecting/deselecting the user can explore various forms of spectral filtering and start designing stimuli. Alternatively, the buttons select all/unselect all can used to speed up the selection process. The button play original will playback the original signal and compared with the current one if it already has been distorted. The number of channels in which the signal has been divided can be modified to explore lowpass, highpass, bandpass and bandstop filtering. The information contained in each band depends on the filterbank applied. The filters used to divide the signal are a bank of second-order auditory gammatone bandpass filters (Patterson & Holdsworth, 1990). The center frequencies of the filters are shown on the left side of the display. The distance between their frequency centers is based on the ERB (Moore, 8 Glasberg & Peter, 1985) fit to the human cochlea. The default filterbank covers uniformly the whole signal with minimal gaps between the bands. The default bandwidths can be changed, which forces all filters to have a bandwidth of 1 ERB regardless of their spacing. This option leaves larger gaps between the filtered signal bands for banks of fewer than ten bands, but the distances between the frequency centers are not changed. Notice that when filtering the signal by a small number of channels the default filterbank brings more information than the '1 ERB' filters. The suitability of each filterbank depends of the purpose of the experiment. This window has been design to study the ability of the auditory system to handle spectral filtering, reduction, and missing data speech recognition. The output of this window can be manipulated in a second window called ‘Manipulate Signal’. Second window: Manipulate Signal After the selection and filtering is done, the Signal menu takes you to the Manipulate selection option, and a second window appears (Fig. 2) with the time and frequency domain selection represented in two panels. (Figure 2 around here) As in the first window the signal can be again selected in either the frequency and/or the time domain. The exact position of the cursors appears in the time and frequency displays. It is possible to accurately select the time portion the user wishes to manipulate by inserting the specific values in the from/to option of the global settings menu. The distortion menu has been design to explore the effect on recognition of altering spectro-temporal regions or adding maskers to speech. It has been organized into two subgroups. The first subgroup comprises three different types of maskers; speech, tone and noises. These maskers will affect all the channels (selected or not), but will mask only the time period selected with the from/to function from 9 the general settings menu. The speech masker is empty by default and has been design to load speech previously recorded and saved by the user. The tone and noises maskers can be generated by inserting the appropriate values in the global settings menu, or by selecting a specific type of noise from the popup menu. The second subgroup of distortions comprises gain, time reverse, and sinewave replicas and they can change some properties of the channels selected such as amplitude, time direction or by positioning time-varying waves at the centers of formant frequencies. None of the three maskers can be frequency-band restricted. In the second panel the spectrum of the selected portions can be visualized in combination with the manipulations added. Again the user has the choice to playback the original signal or the portion of signal selected plus the distortions added. The distortion menu displays the stimuli that can be added, subtracted or applied. Notice that only the global settings that are relevant for each specific stimulus will be enabled. Distortions As specified above, the signal loaded can be masked with speech, tones or noise: (1) Speech: The purpose of this option is to mix the signal selected with streams of speech or other signals. The specific inputs depend upon which kind of stimuli the user wants to design. The advantage of this option is that the user can select and manipulate a speech signal, mix it later with another signal to create for example a Cocktail party effect, or save it to be used at later date. (2) Tone: Generates a sine-wave tone. The user can select a specific frequency, duration, starting and ending point, ramp, and the repetition rate if the tone stops and starts at a regular rate. (3) Noises: This menu contains five stimuli; white noise, brown noise, pink noise, chirps up, and chirps down. Some of the parameters of all these stimuli can be change within the code that generates them (.m files from the folder noises), or 10 within the general settings menu. Notice that stimuli like bursts can be generated by selecting a type of noise (e.g., pink noise), and setting its duration. Transformations These are the transformations the user applies to the selected channels: (1) Gain: Adjusts the amplitude level in dB applied to the channels selected. (2) Time reverse: Inverts in time the selected channel. (3) Sinewave replicas: Sine-wave speech is a synthetic analogue of natural speech represented by a small number of time-varying sinusoids. The general setting that can modify some of the parameters of the signal, channels or maskers are: (1) frequency, (2) amplitude, (3) duration, (4) from/to, (5) ramp, and (6) repetition period. Finally, Oreja allows the user to save the manipulations done, undo the last manipulation, undo all, or annotate the signal. Conclusion The Oreja software can provide a fertile ground for interactive demonstrations and a quick and easy way to design psychoacoustic experiments and stimuli. Dynamic interaction with acoustic stimuli has been shown as a style of interaction that aids learning and a better understanding of auditory phenomena (Cooke, Parker, Brown, & Wrigley, 1999b). Due to its friendly-user interface and processor requirements we hope it can be useful for a broad range of research applications. We look forward to further develop more functions that will improve this piece of software. Availability Oreja may be used freely for teaching or research. It may not be used for commercial gain without permission of the authors. 11 Application: A psychoacoustic experiment The study made by Kasturi and Loizou (2002) assessed the intelligibility of speech having either a single “hole” in various bands or have two “holes” in disjoint or adjacent bands in the spectrum. With Oreja frequency band “holes” can be replaced with sine-wave replicas to assess speech intelligibility and explore if under similar conditions consonants are easier to identify than vowels. To initiate signal manipulation we first load the desired speech signal into Oreja. Once the speech signal is loaded into Oreja it can be band-pass filtered and different distortions can be applied to each designated band. The output is heard by pressing the play key. A step-by-step tutorial, crucial for the understanding and replication of many auditory phenomena, is introduced to illustrate the process of object interaction. To use oreja.m, place the function in the MATLAB folder and at the MATLAB prompt of the command window type: »oreja The first window (see Fig. 1) will appear. On the top left click on the Signal menu and choose load signal (see Fig.3). (Figure 3 around here). The load signal option will open another window called ‘new signal’. This new window allows searching for the signals we want to load, in this case a sample of the Shannon et al. (1999) database (/ABA/). The file format of these signals should be .au, .wav, or .snd. (Any other format is not supported). Once the sound file is loaded, three different representations will appear in three different panels. The top panel displays the spectrogram of the signal selected. The intensity of the signal will be represented in the spectrogram with green tones (light green represents greater intensity or energy). Panel number 2 shows the signal(s) filtered into ten frequency channels (by default), unless another number of frequency channels is specified. The selected channels will appear in green, and the unselected channels will appear in ivory. Panel number 4 will represent the 12 waveform of the sound file loaded. In all these three panels time is represented in the x-axis and frequency in the y-axes, which corresponds to our impression of the highness of a sound. The name given to the signal loaded will appear on top of the window, in this case ‘ABA’. As we move the mouse or the cursors (shown as vertical lines) by dragging them over the first panel, the time and frequency under the current location will be shown at the top right displays (see number 6 in Fig. 1). Cursors from the first and second panel are connected because both share the same time scale of the signal. However, the cursors from the bottom panel are only linked to the cursors from the first and second panels when the whole signal is represented in the first loaded stage. It is only in this first stage that all the panels represent the whole signal. Another difference between these three panels is that the portion of the signal between the cursors will be played back if we click on top of the waveform of the bottom panel. This property will help the user to accurately select the segment of speech you wish to manipulate. The spectrogram display has linear-in-Hz y-axis, whereas the filter center frequencies are arrayed on a ERB rate scale. Clicking the green circles of the second panel will unselect specific channels. The third and fourth channels with center frequencies of 570 Hz and 1017 Hz have been unselected. This means that a “hole” has been introduced in the spectrogram. Moving the vertical lines or cursors will facilitate the selection of the specific segment of the signal. Using the zoom in and zoom out buttons together with the slide bar will facilitate the accuracy of the selection. The buttons Zoom in and Zoom out only work for the first and second panel. In the bottom panel the signal loaded will be represented completely at any time. This panel offers a perspective of the whole signal at all the times. Play selection will display the result. Moving the cursors to the edges of the panels will select the entire signal. From the same menu Signal we can select manipulate selection, and the second window (see Fig. 2) will appear with the selection made in this first window. Again we can unselect more channels to introduce more “holes” and transform specific channels into sine-waves replicas of 13 speech by selecting one or more channels and clinking the sinewave replicas radio button. Click add/apply to perform the transformation. At this point the user can playback the result, and save as an audio file. Any manipulation can be undone one by one or all at once. It is always possible to go back to the first window and incorporate new changes, or just close the current window and start again. 14 REFERENCES BERGMAN, A.S. (1990). Auditory Scene Analysis. MIT Press. COOKE, M.P. & BROWN, G.J. (1999a). Interactive explorations in speech and hearing. Journal of the Acoustical Society of Japan (E), 20, 2, 89-97. COOKE, M.P., PARKER, H.E.D., BROWN, G.J. and WRIGLEY, S.N. (1999b). The interactive auditory demonstrations project. ESCA, Eurospeech Procedings, September 5-9 1999, Budapest, Hungary. KASTURY, K. & LOIZOU, P.C. (2002). The intelligibility of speech with “holes” in the spectrum. Journal of the Acoustical Society of America. 112 (3), 1102-1111. MATLAB. The language of Technical Computing, TheMathWorks Inc., Natick, MA., 1997. MOORE, B.C.J., GLASBERG, B.R. & PETER R.W. (1985). Relative dominance of individual partials in determining the pitch of complex tones. Journal of the Acoustical Society of America, 77, 1853-1860. PATTERSON, R.D. & HOLDSWORTH, J. (1990). In Advance in Speech, Hearing & Language Processing, Vol. 3 (Ed. Ainsworth), JAI Press. RAND, T.C. (1974). Dichotic release from masking for speech. Journal of the Acoustical Society of America, 55, 678-680. REMEZ, R.E., RUBIN, P.E., PISONO, D.B. & CARRELL, T.D. (1981). Speech perception without traditional speech cues. Science, 212, 947-950. REMEZ, R.E., RUBIN, P.E., BERNS, S.M., PARDO, J.S., & LANG, J.M. (1994). On the perceptual organization of speech, Psychological Review, 101, 129-156. SHANNON, R.V., JENSVOLD, A., PADILLA, M., ROBERT, M.E., WANG, X. (1999). Consonant recordings for speech testing. Journal of the Acoustical Society of America, 106, L71-L74. 15 SHNEIDERMAN, B. (1983). Direct manipulation: a step beyond programming languages. IEEE Computer, 16 (8), 57-69. VAN NOORDEN, L.P.A.S. (1977). Minimum differences of level and frequency for perceptual fission of tone sequences ABAB. Journal of the Acoustical Society of America, 61, 1041-1045. 16 1 6 2 5 3 4 Figure 1. The first window allows the user to select the signal by time and frequency domain. Three different representation of the spectrum are available; panel 1 and panel 2 represent the frequency composition of the, and panel 4 the amplitude of the signal across time. Panel 3 allows for labeling portions of the signal. 17 Figure 2. Second window: Manipulations. The portion of signal selected in the previous window can be altered in this window. From the distortions menu on the right side, different types of distortions can be selected and applied to the signal. From the general settings menu different parameters from the distortion menu can be modified or generated. 18 Figure 3. To load a signal, choose from the Signal menu load signal. 19