Author Guidelines for 8 - Department of Electrical Engineering

advertisement
Oreja... a MATLAB environment
for the design of psychoacoustic
stimuli
ELVIRA PEREZ
Department of Psychology, University of Liverpool, U.K.
&
RAUL RODRIGUEZ-ESTEBAN
Department of Electrical Engineering, Columbia University, New York, U.S.A
Correspondence should be addressed to E. Perez,
School of Psychology, University of Liverpool
Eleanor Rathbone Building, Bedford St. South
Liverpool L69 7ZA, England, U.K.
Phone: # 44 (0) 151 794 2177
e.perez@liverpool.ac.uk
The Oreja software has been designed specially to study speech intelligibility. It is an engineering tool
that is informed by psychology to allow researchers interested in speech perception to manipulate
speech signals, and study how these manipulations affect human perception. A feature of this package
is that it uses a high-level interpreted scripting environment (MATLAB), allowing the user to load,
decompose into different channels, analyze, and select the parts of the signal(s) the user wants to label
and/or manipulate (e.g. attenuating the amplitude of the channels selected, adding noise, etc.). This
paper includes an application to illustrate how Oreja can be used. Oreja can be downloaded at
http://www.liv.ac.uk/psychology/Downloads/Oreja.htm
2
The development of this software has being partially supported by HOARSE Grant HPRN-CT-200200276. This project could have not been developed in LabRosa (Dept. Electrical Engineering, Univ.
Columbia, NYC, USA) without the supervision of Dan Ellis and a Fulbright scholarship granted to the
first author. A version of this paper was presented at the XIV European Society of Cognitive
Psychology, Leiden, The Netherlands, 31st August – 3rd September 2005. We thank John Worley,
Dimosthenis Karatzas and Martin Cooke for helpful comments and suggestions on earlier versions of
this paper. Correspondence should be addressed to Elvira Perez, School of Psychology, University of
Liverpool, Eleanor Rathbone Building, Bedford St. South, Liverpool L69 7ZA, England, U.K.
(e.perez@liverpool.ac.uk).
3
Psychologists have sought to understand how human listeners understand language, and specifically
how our auditory system efficiently processes language in an environment where silence is normally
the exception. Listeners possess strategies for handling many types of distortion, and uncovering these
strategies is of importance in understanding speech intelligibility under noise situations, or designing
robust systems for computational hearing.
Psychologists are not experts in signal processing, frequency analysis, acoustics or computer
programming, and often engineers do not have an in depth knowledge at statistics or cognitive
neuroscience. Over the last decade so many interdisciplinary research groups have emerged to cope
with this ‘interdisciplinary problem’. The aim of our interdisciplinary speech approach is to put
together different sources of knowledge, attitudes, and skills in order to better understand our
sophisticated auditory and cognitive system.
This software has been inspired by different sources. The first source of motivation was the need of
an interactive and exploratory tool. Specifically, an intuitive interface that would allow users to
dynamically interact, and demonstrate virtually the phenomena found in speech and hearing. There are
different examples of auditory demonstrations such as the CD of ‘Demonstrations of Auditory Scene
Analysis: The Perceptual Organization of Sound by Bregman and Ahad’ (included in Bregman, 1999).
And more recently, the auditory demonstrations by Yoshitaka Nakajima (www.design.kyushuu.ac.jp/~ynhome/ENG/index.html). However, these demonstrations do not allow the user to explore
and manipulate the variables by themselves, but to listen passively.
It is well known that the dynamic interaction with an environment can improve learning in any field,
especially when it involves transformations and manipulations of several different parameters. Direct
manipulation supports novices through decreasing learning times, expert users through speed of action
4
and intermediate users, by enabling operational concepts to be retained. User anxiety is reduced, and
confidence is enhanced from the fact that the user initiates actions and can predict responses
(Shneiderman, 1983). Oreja encourages the user to explore and replicate the range of relevant
variables underlying some auditory effects such as perceptual grouping of tones (Noorden, 1977), or
duplex perception (Rand, 1974). With respect to masking, a user can filter speech into different bands,
select and apply maskers from a menu of noises, or hear the result of their manipulations. Another
important aspect of Oreja is its simplicity; it enhances basic features such as visual display of signals
and the parameters most used in speech intelligibility research (e.g. maskers, filters, or frequency
bands). The simplicity of Oreja was led by the fact that too much sophistication can overwhelm
intermediate or novice users. Moreover, the different visual representations of the signals help to
reinforce complementary views of the data, and a deeper understanding of the auditory phenomena.
A second source of motivation for Oreja was the paper written by Kasturi and Loizou (2002), which
assessed the intelligibility of speech with normal-hearing listeners. Speech intelligibility was assessed
as a function of the filtering out of certain frequency bands, which Kasturi and Loizou termed as
“holes”. The speech signals they presented had either a single “hole” in various bands or had two
“holes” in disjoint or adjacent bands in the spectrum. After reading this article we wanted to design a
similar experiment, but besides adding holes in various bands, we wanted to test the intelligibility of
the speech signals when some of these frequency bands were replaced by sine-wave speech replicas
(SWS, is a synthetic analogue of natural speech represented by a small number of time-varying
sinusoids, see Remez, Rubin, Pisoni, & Carrel, 1981). We imagined an intuitive interface that would
allow us to filter the speech into different channels, select different distortions from a menu, and
finally hear the output by simply pressing an imaginary ‘play’ key. Software like this does not exist.
However, a third source of inspiration was found: the work done by Cook, Brown, and Wrigley
5
(1999a) called ‘MAD’ (‘MATLAB Auditory Demonstrations’). These demonstrations exist within a
computer-assisted learning application, which opens up the field to interactive investigations,
providing the user with a rich environment in which to explore dynamically the many phenomena and
processes associated with speech and hearing.
After exploring the versatility of MAD, investing time in a limited program for a specific issue was
felt to be inefficient. Instead, we preferred to build a psychoacoustic tool with a wide range of
possibilities and menus accessible to a large and heterogeneous group of language and speech
researchers.
Oreja’s main objective is to become a useful tool for researchers and students, supporting and
motivating the design of psychoacoustic experiments. Actually exists a wide variety of professional
software that does much more on the audio processing side than Oreja, but the advantage and
uniqueness of Oreja is that it is brought together the functions important to psychologists, under a
simple and clear interface.
Description of Oreja
Oreja has been implemented using MATLAB, a high-level scripting environment
that provides functions for numerical computation, user interface creation and data visualization.
Oreja has been updated in 2004 for MATLAB 6.5 on Windows XP.
The design of Oreja has been specially oriented for the study of speech intelligibility, supporting the
design of acoustic stimuli to be used in psychoacoustic experiments. It has two main windows or
interfaces. The first window guides the selection of the signal (time and frequency domain), and the
second guides the changes performed on the different parts of the original signal, or the creation of
background noises to mask it. The first window allows loading, decomposing into different channels,
6
analyzing, and selecting the parts of the signal. This first window, allows for label of a signal and its
constituent parts. Moreover, Oreja can load multiple signals and concatenate them.
The second window has been designed to manipulate the loaded signals, or the selected parts, in
many different ways (e.g., attenuating the amplitude of the channels selected, or adding noise), and
save them in an audio file format.
Using Oreja.m
Oreja can be downloaded at http://www.liv.ac.uk/psychology/Downloads/Oreja.htm, and at present, it
is only available for Windows operating system. Installation consists of adding the Oreja folder to the
MATLAB current directory path. Alternatively, for Oreja to be always available it can be added to a
single folder (approximately 1 MB) to MATLAB’s collection of toolboxes. Help is provided via the
‘Users Manual’, which also contain a glossary that defines the more technical concepts. The program
does not need installation. Once Oreja is at the Current Directory of MATLAB, and after typing at the
MATLAB prompt a command in the form of:
» oreja
the first window will appear on the monitor.
First window: Load Signal
The main goal of this first window is to allow the user to load a signal and precisely select portions of
it. Panel number one and panel number two (see Fig. 1) represent the frequency composition of the
signal, panel number four represent the amplitude of the signal across time. Cursors and a zoom
options have been added to facilitate the accuracy of the selection. A speech signal can be labeled or
phonologically transcribed by using the transcription menu and panel number three.
7
(Figure 1 around here)
Once the user has chosen and loaded a sound file, three different representations from this sound file
will appear in three different panels. The first panel shows the spectrogram of the signal selected. In
the second panel the signal appears filtered by default into ten frequency channels. The spectrogram
display has linear-in-Hz y-axis, whereas the filter center frequencies (CF) are arrayed on an ERB
(Equivalent Rectangular Bandwidth) rate scale, which is approximately logarithmic.
The info menu contains a popup menu with a detailed user manual in html format that also includes a
glossary.
Channels and filters
By default all the channels will be selected after loading the signal. Different channels can be selected
or deselected by clicking on individual waveforms or on the little circles of the second panel (see
number five in Fig. 1). The whole original signal can be played back or just the channels selected. In
the latter option an unselected signal does not contribute to the overall output. By selecting/deselecting
the user can explore various forms of spectral filtering and start designing stimuli. Alternatively, the
buttons select all/unselect all can used to speed up the selection process. The button play original will
playback the original signal and compared with the current one if it already has been distorted. The
number of channels in which the signal has been divided can be modified to explore lowpass,
highpass, bandpass and bandstop filtering. The information contained in each band depends on the
filterbank applied. The filters used to divide the signal are a bank of second-order auditory gammatone
bandpass filters (Patterson & Holdsworth, 1990). The center frequencies of the filters are shown on the
left side of the display. The distance between their frequency centers is based on the ERB (Moore,
8
Glasberg & Peter, 1985) fit to the human cochlea. The default filterbank covers uniformly the whole
signal with minimal gaps between the bands. The default bandwidths can be changed, which forces all
filters to have a bandwidth of 1 ERB regardless of their spacing. This option leaves larger gaps
between the filtered signal bands for banks of fewer than ten bands, but the distances between the
frequency centers are not changed. Notice that when filtering the signal by a small number of channels
the default filterbank brings more information than the '1 ERB' filters. The suitability of each
filterbank depends of the purpose of the experiment.
This window has been design to study the ability of the auditory system to handle spectral filtering,
reduction, and missing data speech recognition. The output of this window can be manipulated in a
second window called ‘Manipulate Signal’.
Second window: Manipulate Signal
After the selection and filtering is done, the Signal menu takes you to the Manipulate selection option,
and a second window appears (Fig. 2) with the time and frequency domain selection represented in
two panels. (Figure 2 around here)
As in the first window the signal can be again selected in either the frequency and/or the time
domain. The exact position of the cursors appears in the time and frequency displays. It is possible to
accurately select the time portion the user wishes to manipulate by inserting the specific values in the
from/to option of the global settings menu.
The distortion menu has been design to explore the effect on recognition of altering spectro-temporal
regions or adding maskers to speech. It has been organized into two subgroups. The first subgroup
comprises three different types of maskers; speech, tone and noises. These maskers will affect all the
channels (selected or not), but will mask only the time period selected with the from/to function from
9
the general settings menu. The speech masker is empty by default and has been design to load speech
previously recorded and saved by the user. The tone and noises maskers can be generated by inserting
the appropriate values in the global settings menu, or by selecting a specific type of noise from the
popup menu. The second subgroup of distortions comprises gain, time reverse, and sinewave replicas
and they can change some properties of the channels selected such as amplitude, time direction or by
positioning time-varying waves at the centers of formant frequencies. None of the three maskers can
be frequency-band restricted.
In the second panel the spectrum of the selected portions can be visualized in combination with the
manipulations added. Again the user has the choice to playback the original signal or the portion of
signal selected plus the distortions added.
The distortion menu displays the stimuli that can be added, subtracted or applied. Notice that only
the global settings that are relevant for each specific stimulus will be enabled.
Distortions
As specified above, the signal loaded can be masked with speech, tones or noise: (1) Speech: The
purpose of this option is to mix the signal selected with streams of speech or other signals. The
specific inputs depend upon which kind of stimuli the user wants to design. The advantage of this
option is that the user can select and manipulate a speech signal, mix it later with another signal to
create for example a Cocktail party effect, or save it to be used at later date. (2) Tone: Generates a
sine-wave tone. The user can select a specific frequency, duration, starting and ending point, ramp, and
the repetition rate if the tone stops and starts at a regular rate. (3) Noises: This menu contains five
stimuli; white noise, brown noise, pink noise, chirps up, and chirps down. Some of the parameters of
all these stimuli can be change within the code that generates them (.m files from the folder noises), or
10
within the general settings menu. Notice that stimuli like bursts can be generated by selecting a type of
noise (e.g., pink noise), and setting its duration.
Transformations
These are the transformations the user applies to the selected channels: (1) Gain: Adjusts the
amplitude level in dB applied to the channels selected. (2) Time reverse: Inverts in time the selected
channel. (3) Sinewave replicas: Sine-wave speech is a synthetic analogue of natural speech represented
by a small number of time-varying sinusoids.
The general setting that can modify some of the parameters of the signal, channels or maskers are:
(1) frequency, (2) amplitude, (3) duration, (4) from/to, (5) ramp, and (6) repetition period.
Finally, Oreja allows the user to save the manipulations done, undo the last manipulation, undo all,
or annotate the signal.
Conclusion
The Oreja software can provide a fertile ground for interactive demonstrations and a quick and easy
way to design psychoacoustic experiments and stimuli. Dynamic interaction with acoustic stimuli has
been shown as a style of interaction that aids learning and a better understanding of auditory
phenomena (Cooke, Parker, Brown, & Wrigley, 1999b). Due to its friendly-user interface and
processor requirements we hope it can be useful for a broad range of research applications. We look
forward to further develop more functions that will improve this piece of software.
Availability
Oreja may be used freely for teaching or research. It may not be used for commercial gain without
permission of the authors.
11
Application: A psychoacoustic experiment
The study made by Kasturi and Loizou (2002) assessed the intelligibility of speech having either a
single “hole” in various bands or have two “holes” in disjoint or adjacent bands in the spectrum. With
Oreja frequency band “holes” can be replaced with sine-wave replicas to assess speech intelligibility
and explore if under similar conditions consonants are easier to identify than vowels. To initiate signal
manipulation we first load the desired speech signal into Oreja. Once the speech signal is loaded into
Oreja it can be band-pass filtered and different distortions can be applied to each designated band. The
output is heard by pressing the play key. A step-by-step tutorial, crucial for the understanding and
replication of many auditory phenomena, is introduced to illustrate the process of object interaction.
To use oreja.m, place the function in the MATLAB folder and at the MATLAB prompt of the
command window type:
»oreja
The first window (see Fig. 1) will appear. On the top left click on the Signal menu and choose load
signal (see Fig.3). (Figure 3 around here). The load signal option will open another window called
‘new signal’. This new window allows searching for the signals we want to load, in this case a sample
of the Shannon et al. (1999) database (/ABA/). The file format of these signals should be .au, .wav, or
.snd. (Any other format is not supported). Once the sound file is loaded, three different representations
will appear in three different panels. The top panel displays the spectrogram of the signal selected. The
intensity of the signal will be represented in the spectrogram with green tones (light green represents
greater intensity or energy). Panel number 2 shows the signal(s) filtered into ten frequency channels
(by default), unless another number of frequency channels is specified. The selected channels will
appear in green, and the unselected channels will appear in ivory. Panel number 4 will represent the
12
waveform of the sound file loaded. In all these three panels time is represented in the x-axis and
frequency in the y-axes, which corresponds to our impression of the highness of a sound. The name
given to the signal loaded will appear on top of the window, in this case ‘ABA’. As we move the
mouse or the cursors (shown as vertical lines) by dragging them over the first panel, the time and
frequency under the current location will be shown at the top right displays (see number 6 in Fig. 1).
Cursors from the first and second panel are connected because both share the same time scale of the
signal. However, the cursors from the bottom panel are only linked to the cursors from the first and
second panels when the whole signal is represented in the first loaded stage. It is only in this first stage
that all the panels represent the whole signal. Another difference between these three panels is that the
portion of the signal between the cursors will be played back if we click on top of the waveform of the
bottom panel. This property will help the user to accurately select the segment of speech you wish to
manipulate. The spectrogram display has linear-in-Hz y-axis, whereas the filter center frequencies are
arrayed on a ERB rate scale. Clicking the green circles of the second panel will unselect specific
channels. The third and fourth channels with center frequencies of 570 Hz and 1017 Hz have been
unselected. This means that a “hole” has been introduced in the spectrogram. Moving the vertical lines
or cursors will facilitate the selection of the specific segment of the signal. Using the zoom in and
zoom out buttons together with the slide bar will facilitate the accuracy of the selection. The buttons
Zoom in and Zoom out only work for the first and second panel. In the bottom panel the signal loaded
will be represented completely at any time. This panel offers a perspective of the whole signal at all
the times. Play selection will display the result. Moving the cursors to the edges of the panels will
select the entire signal. From the same menu Signal we can select manipulate selection, and the second
window (see Fig. 2) will appear with the selection made in this first window. Again we can unselect
more channels to introduce more “holes” and transform specific channels into sine-waves replicas of
13
speech by selecting one or more channels and clinking the sinewave replicas radio button. Click
add/apply to perform the transformation. At this point the user can playback the result, and save as an
audio file. Any manipulation can be undone one by one or all at once. It is always possible to go back
to the first window and incorporate new changes, or just close the current window and start again.
14
REFERENCES
BERGMAN, A.S. (1990). Auditory Scene Analysis. MIT Press.
COOKE, M.P. & BROWN, G.J. (1999a). Interactive explorations in speech and hearing. Journal of
the Acoustical Society of Japan (E), 20, 2, 89-97.
COOKE, M.P., PARKER, H.E.D., BROWN, G.J. and WRIGLEY, S.N. (1999b). The interactive
auditory demonstrations project. ESCA, Eurospeech Procedings, September 5-9 1999, Budapest,
Hungary.
KASTURY, K. & LOIZOU, P.C. (2002). The intelligibility of speech with “holes” in the spectrum.
Journal of the Acoustical Society of America. 112 (3), 1102-1111.
MATLAB. The language of Technical Computing, TheMathWorks Inc., Natick, MA., 1997.
MOORE, B.C.J., GLASBERG, B.R. & PETER R.W. (1985). Relative dominance of individual
partials in determining the pitch of complex tones. Journal of the Acoustical Society of America, 77,
1853-1860.
PATTERSON, R.D. & HOLDSWORTH, J. (1990). In Advance in Speech, Hearing & Language
Processing, Vol. 3 (Ed. Ainsworth), JAI Press.
RAND, T.C. (1974). Dichotic release from masking for speech. Journal of the Acoustical Society of
America, 55, 678-680.
REMEZ, R.E., RUBIN, P.E., PISONO, D.B. & CARRELL, T.D. (1981). Speech perception without
traditional speech cues. Science, 212, 947-950.
REMEZ, R.E., RUBIN, P.E., BERNS, S.M., PARDO, J.S., & LANG, J.M. (1994). On the perceptual
organization of speech, Psychological Review, 101, 129-156.
SHANNON, R.V., JENSVOLD, A., PADILLA, M., ROBERT, M.E., WANG, X. (1999). Consonant
recordings for speech testing. Journal of the Acoustical Society of America, 106, L71-L74.
15
SHNEIDERMAN, B. (1983). Direct manipulation: a step beyond programming languages. IEEE
Computer, 16 (8), 57-69.
VAN NOORDEN, L.P.A.S. (1977). Minimum differences of level and frequency for perceptual
fission of tone sequences ABAB. Journal of the Acoustical Society of America, 61, 1041-1045.
16
1
6
2
5
3
4
Figure 1. The first window allows the user to select the signal by time and frequency domain. Three
different representation of the spectrum are available; panel 1 and panel 2 represent the frequency
composition of the, and panel 4 the amplitude of the signal across time. Panel 3 allows for labeling
portions of the signal.
17
Figure 2. Second window: Manipulations. The portion of signal selected in the previous window can
be altered in this window. From the distortions menu on the right side, different types of distortions
can be selected and applied to the signal. From the general settings menu different parameters from
the distortion menu can be modified or generated.
18
Figure 3. To load a signal, choose from the Signal menu load signal.
19
Download