Document 10716285

advertisement
A Virtual
Vocabulary Speech Recognizer
by
Peter D.
Pathe
B.S. Engineering and Applied Science
California Institute of Technology
Pasadena, California
1977
Submitted to the Department of Architecture in Partial
Fulfillment of the Requirements of the Degree of
Master of Science in Visual Studies
at the
Massachusetts Institute of Technology
April
(c)
Signature
1983
Massachusetts Institute of
of Author
Technology
-
A
1983
--
Department of Architecture
1 April, 1983
I
Certified by
Assistant Professor
Andrew Lippman
of Media Technology
Thesis Supervisor
Accepted by
Nicholas Negroponte
Professor of Computer Graphics
Chairman, Departmental Committee for Graduate Students
Rotch
MASSACHUSETTS INSWitUTE
OF TECHNOLOGY
AUG
5 1983
A
Virtual Vocabulary Speech Recognizer
by
Peter
D.
Pathe
Submitted to the Department of Architecture on 1 April, 1983
in partial fulfillment of the requirements for the Degree of
Master of Science in Visual Studies
ABSTRACT
A system for the automatic recognition of human speech is
described.
A commercially available speech recognizer sees its
recognition vocabulary increased through the use of virtual
memory management techniques.
Central to the design are issues
concerning the nature of speech, its effectiveness as an
isolated mode of communication with computers, and its role as a
part of a multi-modal communication interface.
A highly
interactive information retrieval system serves as a sample
application.
This system is detailed, and features which make
it an appropriate application for the speech system are
identified.
Sponsored in full by the Office of Naval Research,
Contract No. N00014-80-K-0921
Thesis
Title:
Supervisor:
Andrew Lippman
Assistant Professor of Media Technology
-2-
Acknowledgements
I wish to
all
thank everyone at the Architecture Machine Group for
the help I received while working there.
Andy Lippman provided financial support and resources for
project and others,
Chris Schmandt was
this
and gave direction to my research.
a great
source of information and advice on
speech recognition.
Thanks to Chris Lombardi, Neil Galarneau, and Adam Rose for
their work on the NEC personal computer.
Lenox Brassell was
when
always
ready to share his knowledge of Magic6
I needed it.
The lab
just wouldn't have been the same without Smokehouse and
the Hardware Team.
I especially wish to thank Walter Bender
and encouragement, both in
and out of
the opportunity to work together
-3-
for his constant help
the lab.
again in
I hope we have
the future.
Table Of Contents
page
Abstract . .....................................
2
Acknowledgements
..............................
3
1.0
..............................
5
2.0 Communication Modes .......................
2.1 Speech as a Mode of Interaction ......
10
3.0 Automated Speech Recognition ..............
3.1 The Intelligent Listener .............
14
17
4.0
Contextual Environment ....................
4.1 What Is NewsPeek? ....................
4.2 NewsPeek Operation ...................
4.2.1 Off-line (Editorial) Functions
4.2.2 On-line Functions (Reader Aids)
4.3 User Interface .......................
21
21
24
24
5.0
The Speech Recognition Unit
...............
29
6.0
The Vocabulary Predictor
..................
33
Introduction
a
26
27
7.0 Applying the Virtual Vocabulary Speeech Rec 0 gni z er
38
8.0 Conclusion
46
References
................................
.................................................
-4-
47
1.0
Introduction
Human
communication is a well developed art.
People speak,
write, gesture, sing, dance, paint, and the list goes on.
Each
of these modes provides
a powerful way to get a message from one
person to another.
lives
exchange
Our
of thoughts,
Unfortunately, the
well developed.
are made richer through the
feelings,
and experiences.
art of communicating with machines is not
The problem isn't that there is nothing to
people converse with computers as
issue
is how it must be said.
a matter of
course.
The
so
say;
real
Most notably, computers cannot
yet understand normal human speech.
Automatic speech recognition devices do
exist.
Commercially
available recognizers typically are capable of identifying only
a small vocabulary of isolated words uttered by a single
This is
nothing like the recognition
speaker.
of fluent speech, yet still
is adequate performance in some applications.
The virtual vocabulary Speech Recognizer
one
such device.
the speech
communication with the host.
be stored in
selectively
A
recognizer, disc memory, and
and sections can be
loaded into the recognizer's
Although the
-5-
small personal
large recognition vocabulary can
the computer's memory,
vocabulary space.
an enhancement of
of the system is a
At the heart
computer which controls
is
smaller active
speech recognizer can identify
one of
only a small number of words
effective vocabulary seems
at any given moment,
its
quite large.
The general usefulness of this system hinges on the assumption
that,
for a given application, there exists
some method for
predicting a subset vocabulary with a reasonable likelihood of
including the speaker's next utterance.
property of
This is obviously not a
all systems to which automatic speech recognition
might be applied, but it is practical in
The host application in
retrieval and
this case is
analysis system.
certain instances.
an interactive news
At any given moment, the user is
typically involved in reading and searching for groups of
stories.
hands,
Exactly what constitutes a group
is largely in
his
as the system provides the user with the ability to
associate the stories by features
subsets
like content and age.
Just as
of the global database are constructed by the user
during system operation, associated subset recognition
vocabularies
can be created as well.
information about the task,
such as
In
addition, other
the history of
the user's
activity, can be used to supplement the vocabulary subsetting
process.
This
speech system provides the recognizer with
memory,
additional
processing power, knowledge about the user,
knowledge of
the task being performed.
integrated with the host application in
-6-
and
The speech system, when
this manner,
performs
with
in
a greater degreee
of flexibility and utility than it does
isolation.
-7-
2.0 Communication Modes
Research in communication between humans
important
role the mode of interaction plays
information.
[4,11]
subject was designated the
were
in
the transfer of
and Ochsman
assembling a machine.
One
source of information, the other the
Each time the experiment was performed, the subjects
allowed to interact in
a different prescribed manner.
modes made available were communication-rich
in
demonstrated the
In experiments performed by Chapanis
two subjects were involved in
seeker.
has
room),
The
(subjects together
voice, video, handwriting, typing, and various
combinations
measured in
of the above.
The quality of the exchange was
each case by the time the seeker required to
assemble his project.
The communication-rich environment was far
Of
and away the best.
the single modes, voice was found to be the most useful.
Without exception, the multi-modal environments including verbal
contact proved more valuable than those without.
Also of
interest, the typewriter fared poorly among experienced typists
and
novices alike.
Although the
studies were performed with pairs of human subjects,
the three results
design
cited above
are of particular relevance to the
of computer-user interfaces.
People and computers
usually converse by typing at each other.
keyboard-driven computer
terminal
-8-
is
In its defense,
a precise, reliable
the
I/O
device, is inexpensively manufactured, and requires minimal
computational support.
displaying large
it
Its
utility in accepting and quickly
amounts of text is undeniable.
is not always well
suited to the task of highly interactive
communication between humans and computers.
Means
input and output are becoming more common, as
enabling computers to speak and hear.
As
are devices
an active area of
advances in computer design produce faster
more powerful machines, memory and processing
performing tasks
I/O devices
of graphical
Making these and other
alternatives useful in computer interaction is
research.
Unfortunately,
immediately at hand may be
and devoted to
"left over"
and
from
afforded alternative
improving the user
interface.
From a computer interface designer's standpoint, the most
significant
result of the communication experiment is not the
relative rating of speech and typing as useful modes of
interaction, but instead the very high rating of the multi-modal,
communication-rich environment.
The quality interface provides
multiple channels for communication between seeker and source.
That some of the channels overlap in their
ability to transfer
information is not an inefficiency to be optimized away.
redundancy inherent
its usefulness.
the mental
mind to
in multi-modal communication is
The
a part of
The user of the quality interface is freed
constraints
of single-moded thinking.
applying the computer to
from
He devotes his
the use for which it is
intended, undistracted by the need to route his thoughts through
an
inconvenient channel.
-9-
The
communication-rich environment
isolated modes.
is more than a collection of
Each mode may show strengths
and weaknesses
depending on the kind of information to be transferred and
idiosyncratic preferences of
the others;
in
the user.
Each mode supplements
a thoughtfully designed system, the modes
should
not function independently, but with knowledge of each other
whenever possible.
this
seems
Although
especially applicable in
human-computer interface.
verbal
lost in
it is not yet the state of the art,
I/O devices,
some
the case of the
Due to the nature of
of the
their operation.
For
graphical
and
"richness" of the interface is
example, a
speech recognizer may
function by reducing a sentence to a string of words, which are
passed to the main computer in
a form suitable
Other elements of the spoken line, such as
inflection, are completely
touch-screen
operations
but
ignored.
emphasis and
Similarly, a gesture on
may be represented by its
serve to
for processing.
endpoints.
the
These
extract some useful part of the input signal,
in so doing they effectively band-limit that signal.
hope is that the redundancy
inherent in
The
the multi-modal
interface may be used to reconstruct some of the communication
bandwidth lost in the
2.1
Speech
Speech
feature extraction process.
as a Mode of
Interaction
is a powerful method of communication as
supplemental mode,
Most people learn to
-10-
a single or
speak at
an early age
and practice this
such a
social skill
throughout their
simple and natural process it
is all but
lives.
It is
taken for
granted.
What makes
It
speech such a powerful tool in
is fast.
speech
human communication?
Thought may be quickly converted to speech, the
itself is rapid, and the
speech is
easily converted back
to thought at the receiving end.
It
is dense.
transmit more
Subtle inflections in
voice may effectively
information than the text of what is spoken.
way something is said conveys meaning of its
It
is automatic.
One mentally composes
language he speaks, so the
act of
The
own.
a message in
the
speaking that message is
automatic.
It may operate in parallel to
other thought processes.
speaker may talk and drive a car at
It requires no
There
the same time,
special
for instance.
special equipment.
are practical considerations for the use
computer input.
A
of speech as
Since even untrained people know how to
operator instruction is
unnecessary.
computer-speech environment places
-11-
speak,
A well designed
few physical constraints on
the user.
computer
In a sample application, an
from many locations
programmer
in
inspector may talk to his
an industrial
site.
Perhaps
likes to pace the floor while devising code.
the computer with a telephone and operate it
a
Equip
from anywhere in
world.
Since the operator's hands are not required for speech, they are
free for
other tasks.
The inspector examines
the programmer scribbles with
a pencil.
free" nature of speech makes it
a machine part;
Of course, the
a good candidate for
in the multi-modal communication environment.
"hands
inclusion
Most other input
devices require some mechanical manipulation by the user.
People already speak to machines every day.
The act is so
natural and routine that most would not give it
but speaking on a telephone is
uncomfortable talking to
just that.
it
were a person.
feel
Because
Speech can
a useful method for communicating with machines, provided
that the machines hold up their
Imagine calling the
to
Why don't people
a machine like the telephone?
it responds intelligently, as if
be
a second thought,
a
switchboard operator, only to be
(hypothetical) talking
difference?
Probably.
at all?
computer.
Would you care?
the computer kept saying,
respond
end of the conversation.
"I
connected
Could you tell the
Probably not.
don't understand,"
What if the human operator
What if
or didn't
did the
same?
Perhaps you would start talking to your telephone the same way
you talk to your
stalled automobile and broken toaster,
-12-
expecting,
and getting, the
same
intelligent reply from all
three.
-13 -
3.0 Automated Speech Recognition
Although people find voice
a fast and effortless mode of
communication, machine recognition of
remains an
simply
elusive goal.
'plugged into'
handle spoken I/O.
No device yet exists which
a computer as if
it were a terminal to
(usually under
reasonably well identifying
to
can be
Commercially available speech recognizers
with limited vocabularies
co-operative user.
conversational speech
100 words) perform
isolated utterances
spoken by a
These devices do find applications, but due
their limitations lack the general utility of
other input
devices.
Most speech recognizers must be
whose voice
'trained' by the individual
is to be recognized.
the vocabulary, a
As the user says each word
'template' is created and saved.
is used as a representative
sample of the audio
in
The template
signal produced
by the speaker uttering that word.
When recognizing, the device performs spectral analysis
audio
The amplitude
input.
several means,
Fast Fourier
spectrum is determined by one of
including direct filtering, digital filtering by
Transform, and linear predictive analysis.
pattern matching techniques
each of the
of its
are applied to
stored templates.
the input
one of the best
dynamic programming, operates on a signal,
be characterized as a function of
-14-
Various
signal and
of these, called
or pattern, which may
several variables.
The
effects
of non-linear variations along a common axis of two
functions can be minimized through this technique.
Dynamic
programming, as applied to speech recognition, compensates
temporal variations between utterances of a
such
for
single word by
squeezing and stretching the template along its time axis.
Recognition
signal
improves
dramatically
[12,14,16].
accompanied by some kind of
are complications
variations
in
uttering a
single word.
the audio
quite large.
input
or phrase
Usually this is
'confidence' metric.
in addition to those
imposed by normal
signal produced by a
single speaker
Variations between speakers can be
Background noise
to distort recognizable
and room acoustics
affects the listener's perception of
fluent
further serve
Position in
audio features.
Coarticulation between words in
causes
the
has been correlated with the entire vocabulary of
templates, the best match is reported.
There
After
a sentence
the word.
(continuous) speech also
severe recognition problems.
Recognition reliability decreases rapidly as user vocabulary
size increases.
Allowable variation in
a word's audio
spectrum
may be quite large relative to the differences required to
distinguish
vocabularies
it
Inter-word
from another.
are more likely sufficient to
identification than those of a
small
yield good
large vocabulary, but this
feature cannot be pushed very far.
its
variations in
Identifying a
amplitude spectrum is loosely analagous to
-15-
spoken word by
labelling a typed
as the number of table
word by
its hashing function value;
entries
(words) increases relative to the hash
The more words there are in
the labels.
ambiguity of
space, the greater
recognition
so does the
size,
the likelihood that two
the
(or more)
will be similar.
One might think that
these machines are attempting to recognize
at too high a level.
speech
the order of
After all, there
in the English language;
300,000 words
be successfully communicated to
By reducing English text to
identified primitives
transmission is
spoken-text problem?
they may all
a computer via terminal keyboard.
easily
a manageable number of
(the alphabet),
solved,
are somewhere on
the problem of word
why not apply the same reasoning to the
There are
about 20,000 syllables
in
English, but these are constructed from roughly forty phonemes
[8].
Unfortunately, current analysis techniques do not
isolate
and identify phonemes in
the input audio
a word is composed of many of these primitives,
composite multiply, resulting in
recognition.
a method of
Improvements
choice in the
Recognition error
methods of
about
reliably
signal.
errors in
poor confidence in
Since
the
word
in linguistic analysis may make this
future.
can also be reduced through non-acoustical
analysis.
By providing the recognizer with knowledge
the system and some intelligence with which to apply that
-16-
knowledge, some
The
of recognition can be shifted from
processes.
analytical
totally
3.1
of the burden
Intelligent Listener
a speech
applying knowledge and intelligence to
One method of
system involves the notion of
recognizing a word in
context.
and requires
Usually this context is a semantic one,
a formal
along with production
specification of the language's
grammar
rules relating the vocabulary.
This method may be extended to
recognize phonemes
aiding in
in a specified linguistic grammar, thus
grammars for natural
task.
of
the
specifying formal
the identification of words.
languages such as English is
a formidable
However, small subset grammars can be defined.
grammar
is
used to
restrict
the
number
Knowledge
of possible
sentences which can be constructed from the language's
vocabulary.
recognized so
With the
aid of semantic
as to produce
analysis, words
are
a correct sentence with the least
total likelihood of error, even though the word chosen in
a
single instance may not be the most likely candidate.
There
are some problems with semantic analysis.
Complexity
increases quickly with the size of the vocabulary and the number
of
productions in the grammar.
And, of
course, there may be a
large number of mis-recognitions which are semantically correct.
Unfortunately for the semantic analyzer, people often deviate
-17-
from the
formally correct natural
This may happen between familiar
other
grammars when conversing.
partners, or when commands and
information are being exchanged tersely.
Informal speech
is apparently less bound to the rules of production than writing
is.
If
a specified grammar for
does not
include the
crop up during its
the automated speech recognizer
kinds of idiosyncratic deviances which may
operation,
its utility to the user
is
impacted.
Some applications do not have the luxury of
restricting syntax.
single-word commands;
They may be
a reasonably
accessed quite naturally by
an action to be taken on a single-word
spoken datum may be implicit, or may depend on the state of the
system.
Another sort of context may be employed to improve
speech recognition in
these situations.
Knowledge
of the task
being performed can be applied to choose likely subsets of the
vocabulary to be analyzed in
given instances.
A word from the
smaller vocabulary can then be recognized with a much greater
reliability.
Another, perhaps less obvious,
recognizer's advantage.
user interface.
context can be used to
That is the context of
Some applications
nature, and may involve the user
operations.
If,
in continuous
the multi-modal
interactive by
or repetitive
through the course of these operations,
computer can detect patterns
knowledge of
are highly
the
in the user's
the user's input
on
-18-
the
activity, then
all channels may be employed to
choose a likely spoken vocabulary subset for recognition.
sense, by becoming familiar with its partner
the computer becomes
in
In
a
conversation,
a better listener.
It should be noted that not
all
speech recognition
applications
provide an environment conducive to computer-user familiarity.
For
instance, a system for automatic transcription from
dictation might be no more interactive than a tape recorder, the
interaction itself yielding no clue to
On the other hand, its
knowledge
techniques
the
aid the speech recognizer.
recognition may be
improved by applying
of the subject matter and semantic analysis.
are not mutually exclusive.
application, and, to
some extent,
By taking the nature of
the nature of the user,
into account, all of the techniques described in
can be combined to produce useful
These
strategies for
this section
speech
recognition.
The Virtual Vocabulary Speech Recognizer
speech recognition explored in
requires a
the
is the strategy for
"NewsPeek" project.
speech recognizer able to cope with a
vocabulary of single-word utterances.
Its
just sixty words.
1600 words can be stored in the recognizer's
virtual memory.
the NewsPeek
large
speech recognition
hardware is capable of handling a vocabulary of
However, over
NewsPeek
This
recognizer employs knowledge of the both
application and the user's
interactions with the
system to choose sixty-word subsets of the virtual vocabulary
for recognition.
This system and the NewsPeek project are
-19-
described
in
the sections that
-20-
follow.
4.0
Contextual Environment
The specific
environment which provides the
the speech recognition package will
personalized, dynamic information
discussing the details
rely is
'context' upon which
"NewsPeek,"
analysis system.
of spoken input, it
a
Before
will be useful to
describe the NewsPeek system which will be referenced as an
example throughout this paper.
4.1
What
Is NewsPeek?
NewsPeek is
one component of
for data analysis.
Specifically, NewsPeek is
"Nexis" data retrieval system serves as
database and the source of global
model global
used by NewsPeek.
Subscribers
(video text display),
hard copy unit,
function keys,
interface,
The Nexis
these being
(1)
search tools,
(2)
found by those
service are normally
a computer
keyboard including special Nexis
and telephone data-transmission
station performs two basic
access
and
to the Nexis
a
search mechanisms
access station consisting of
provided with a local
items
a personal,
electronic newspaper.
Mead Data Central's
terminal
The
overall effort is to develop interactive systems
purpose of the
interactive,
a larger research program.
functions,
and control of centrally located database
local
display and printing
tools.
-21-
of the news
NewsPeek's purpose is to modify the Nexis system, the large
computerized database and manager described above,
it
into
"electronic publishing".
a medium for
NewsPeek alters the role of the Nexis system as
repository for news stories
print media,
news
other words,
In
archive and
already disseminated by conventional
it instead act
and has
and transform
as the source of
current
intended for first-time distribution electronically.
This
alteration is controlled entirely at the subscriber's end of the
system as is.
system, leaving the central Nexis
replaces Mead Data Central's Nexis
computer.
video
station with a
local personal
The computer is equipped with a touch-sensitive color
graphics display, optical videodisc player, and voice
This computer duplicates the original
recognition system.
station's
functions
of access
and retrieval, but replaces the
Unlike the normal configuration
interface to these functions.
of
NewsPeek
the Nexis
system, NewsPeek resembles a conventional
Where the Nexis subscriber retrieves remote data
publication.
on demand, the NewsPeek subscriber receives his own copy of a
locally available, electronically delivered newspaper.
The Nexis search routines, under the direction of NewsPeek,
provide the editorial
of
function associated with the publication
the electronic newspaper.
newspaper
are located
computer.
Since the
single reader,
interests.
An
Stories
for inclusion in
the
and transmitted to the recipient's
resulting publication is intended for a
its contents will
reflect his personal tastes
individualized, user-directed news
-22-
search is
and
effected, replacing the general, mass-targetted one currently
provided by the media;
no two subscribers get the same "Time"
magazine.
Upon
receipt of this publication, the subscriber's personal
computer
acts as an aid to the examination
contents.
order
Just as one's daily paper is not necessarily
or in
one sitting, the electronic newspaper
for perusal.
interactive
One key
read in
is available
The local computer provides the environment for
analysis.
feature
the news
of the newspaper's
of this system is the manner
occurs.
which perusal
of
In use, the personal computer becomes far more
than a potentially better or easier to use
translator.
in
data base search
Rather than simply providing verbal and
touch-sensitive replacements for the control keys provided by
the Nexis system, it allows
structure and importance.
for the search to assume a new
Nexis
library searches without the
local processor, due to the nature of the tools provided, tend
to
take a short
(linear) path through the global database.
The
operation basically consists of iteratively reducing a group of
candidate stories,
group is
contents.
small enough to allow
(i.e.,
While this is
individual examination
until the user
fine, perhaps
single article on
the
starting with the entire library, until the
finds
a set of
even optimal,
a particular
subject,
target stories.)
for the user seeking a
it may be restrictive to
reader whose goal is less well defined;
-23-
of its
the system was not
designed for
browsing is
browsing.
In the NewsPeek system, however,
exploited as a method of discovery.
If,
during
the reading of the newspaper, the reader's interest is diverted,
he has the option of pursuing that diversion without the expense
of
initiating a new global
To
sum up, NewsPeek is an attempt at producing a personal,
electronic publication.
search
In form and content
after a hypothetical newspaper.
local,
personal computer, both in
reading it.
from the top.
it
The reader is
is modelled
aided by his
producing the newspaper
and in
The activity of reading the publication itself
encourages digression and variation, as the reader is
either to browse, or to make associations
free
and follow through on
them.
4.2 NewsPeek Operation
The NewsPeek system can be
seperated into
those which aid the reader in
which prepare
two sets of functions,
perusing the
for a reading session.
stories,
and those
The following sub-sections
provide details.
4.2.1
Off-line
(Editorial) Functions
In order to spare the
from annoying delays
the news of
subscriber to the electronic newspaper
during an
the day, much of
otherwise productive moment with
the processing
-24-
associated with the
creation of
The
the publication occurs prior
analagous functions
to the time of perusal.
of editing, composing,
and publishing a
morning newspaper are performed the night before the paper hits
the
stands,
and so,
pre-processing.
presumably, would the electronic newspaper's
This
is a period of otherwise low demand for
both the computer and the telephone, which is used to access the
Nexis
library.
A process is
initiated to connect the
the Nexis system.
This program is
subscriber's computer with
a top level filter,
and
directs the Nexis access tools, which may be considered
NewsPeek's remote editor.
as
a
Just as the newspaper's editor
filter between the volume of news events
what eventually ends up printed in
in
acts
the world and
the morning edition, the
automated editor decides what information is worth copying from
the enormous Nexis
database into the user's personal newspaper.
Because the program is privileged with knowledge of the reader's
requirements
and preferences,
the resulting publication can be
tailored to please the entire circulation
(one).
After the desired news stories have been retrieved, their
contents
are
correlated with the user's archive, his local
picture library
point
(on optical videodisc),
and each other.
The
of such extensive cross-referencing is to provide the
reader with
a potentially huge number of
relating the information
in the manner
-25-
paths through the news,
of his choosing.
4.2.2 on-line Functions
(Reader Aids)
The reader wishes to examine his newspaper, which consists
(fast) local
individually selected stories now available in
an expanded table
current version of NewsPeek the front page is
of
In the
Text is displayed on his color television.
memory.
contents, combining some
of the features of
(summarization), and some of normal tables
of
headlines
(location).
As the
of contents may be preceded by an
system evolves, this table
"cover stories."
"intelligently" chosen set of
The user may want to take advantage of the local text processing
which has occurred, and may
fashion:
set of
do so
simple
a deceptively
NewsPeek responds by overlaying a
he indicates a word.
cross-references.
in
Each line in
this set contains
it
selected word surrounded by the words adjacent to
in
the
another
story, library entry, or the picture index, thus indicating the
its use.
context of
return to
At this point the reader may elect to
studying the underlying page,
piqued, decide to press
correlation
on.
By choosing a line in
display, he is transported to the story
the line was taken,
in this manner.
the number of jumps
Of course, the user is
up along the path he has created, or to
or
file
from which
abandon it
stories in his personal library.
-26-
that can be
always free to back
Along the way, the user may initiate new Nexis
notes,
the
and a new page or picture is displayed.
There is virtually no limit to
made
or, his curiousity
completely.
searches, make
The important
associations made among the articles
features here are that the
being read are the user's own, not those imposed on him by the
links
for
associations are are used both as the
and that these
publisher,
choosing the
order in which the stories are presented,
for requesting new information.
and as keys
4.3 User Interface
The interface must be supportive and intuitively clear
user.
In addition, it should be transparent,
otherwise color the process
section
is
will
of browsing and analyzing the
For example, if stories
publication.
as it
to the
from the newspaper's sports
are displayed more quickly than others, the perusal path
likely to be biased.
However, the NewsPeek designers'
intention has never been to simply create a so-called
"user-friendly", or
Nexis system.
make the news
enhanced computer front end to the existing
Basic to the design philosophy is the desire to
search seem as much like reading an ordinary
newspaper as possible, or,
information
at the very least,
like obtaining
from a well-informed, co-operative expert.
If this
cannot be achieved, the system will be used only by those
already versed
in the use of computers, or
as the last resort of
those who are not.
end, a bimodal
To
this
be
issued and words
gesturing)
input method is employed.
selected either through touch
or verbally.
Input is
-27-
Commands may
(pointing and
acknowledged on the television
display primarily by the creative use of color.
modes
are highly redundant.
pages by
a book;
Just
if
desired.)
or by pointing to
it
on the page.
as many
commands
some requests
Touching
a
words
picture of
'soft
could.
this,"
may be made with
will
button'
on
An easily
favor
one
A word is
equal
channel
ease
selected by
via
over
voice
and
another.
the display screen may say more than
voiced
command
such as,
"Show me
may not translate well into the world of
physical gestures.
Similarly, the advantage of
choosing a word not on the
dictated
"Next page, please."
"please,"
gesture,
fifty
he were flipping a page in
alternately, he may simply say,
saying it
input
For instance, the reader may turn
stroking the TV screen as if
(even without the
The two
screen is
some characteristics of
and are covered more
fully later.
-28-
obvious.
speech in
These points
the speech recognition system,
a
5.0
The
Speech Recognition Unit
The device used for speech recognition in
the NewsPeek system is
an NEC
This device comes
model PC-8001A personal computer.
configured with CPU, dual mini-floppy disc drives,
unit, and voice recognition hardware.
I/O interface
To this has been added a
Shure SM-lo noise-cancelling microphone connected through a
pre-amplifier/mixer.
The voice recognizer
uttered by
is capable of
a single speaker.
This particular model
to recognize connected speech;
recognizer's vocabulary
identifying sixty words
that is,
if
is not able
two words from the
are spoken one after another in
a single
phrase, there is no guarantee that either will be recognized.
Note that what
utterance of
is meant here by
less than 1.5
"word" is
seconds duration.
NewsPeek command "NextPage" satifies
considered
this criterion, and is
and audio preamp-mixer
the acoustic environment
is mounted
For example, the
a single word by the system.
The microphone
recognition
actually any connected
system.
are used to help
and so improve the reliability of the
The microphone is
of good audio quality
on a headset worn by the user.
arrangement offers several
Since it is worn by the
stabilize
advantages for
speaker, he is
and
This microphone
speech recognition.
free to move around
without having to worry about his being heard by the computer.
The microphone is
always
a constant distance
-29-
from the speaker's
mouth, thus eliminating one variable in
room acoustics which may
otherwise cause problems in word recognition.
microphone is
close to the
direction, effects
preamp-mixer
aimed in that
of outside noise are minimized.
line, and to assure that the
input is always
makes the recognizer's
The I/O
speaker's mouth and is
The audio
is used to minimize the signal-to-noise ratio on
the microphone's
recognizer's
Because the
speech
at the same audio level.
job easier,
and improves
its
This also
reliability.
interface unit enables the CPU to communicate with the
recognizer hardware and the disc drives.
The device also
supports an RS-232 serial data interface through which the CPU
can communicate with the NewsPeek host computer.
The NEC personal
capabilities
computer runs a program to
of the speech recognizer.
have been formatted to
templates
1677
on each of two mounted disc drives.
serial interface.
LOADBLOCK.
vocabulary.
digital voice
The CPU accepts
from the NewsPeek computer
These commands are:
The NEC computer transfers the specified number of
voice templates from disc storage to
the speech recognizer's
The block of templates may
disc, and can be copied to
active
The mini-floppy discs
allow the storage of
commands from its own keyboard, or
via
extend the
start anywhere on the
any starting slot in
the recognizer's
vocabulary, but the templates must be contiguous
instances.
For
in both
example, ten templates starting with number
-30-
fifty on the disc
(#50-#59) may be copied to the active
vocabulary starting at slot fifteen
This
SAVEBLOCK.
(#15-#24).
is similar to the LOADBLOCK command, but
transfers a block of templates from the active vocabulary to
disc storage.
This command is used to create digital voice
TRAIN_BLOCK.
The NEC computer creates
templates.
speech
recognizer's active vocabulary
follow.
user utterances which
in
the active vocabulary, but again,
recognizer
Upon receipt
is activated.
in
the
for the specified number
This block may start anywhere
of
START_LISTEN.
a block of templates
it must be contiguous.
of this command the speech
From this point on, whenever the user
says a word, the NEC computer sends the host computer an
interrupt followed by the
the
slot number of the recognized word in
active vocabulary and the confidence metric of that
recognition.
If the speaker's utterance is not recognized, an
interrupt followed by a null slot number and confidence is sent.
STOP_LISTEN.
This
command turns
speech recognition off.
Through the use of this simple command set,
NEC
speech recognition unit
the utility of the
is enhanced enormously.
Without the
processing power provided by the NEC personal computer, the
speech recognizer
is capable
of
-31-
identifying only sixty words.
With
it,
the recognizer draws
from a vocabulary of over
fifty
times
that size.
Since
the host computer can interrupt the recognition process
and issue
commands, new words can be trained during application
run-time.
This feature can be used to eliminate the modality
present in
some speech recognition systems;
not be put
in
prior
'train-mode' to learn
to being put in
application.
For
occurs
recognizer need
a large number of words
'recognize-mode' for the duration of the
example, the large NewsPeek vocabulary is
"grown" dynamically.
command vocabulary
the
Except for the initial
(about twenty words),
all
one word at a time during NewsPeek's
training of
the
recognizer training
operation.
a vocabulary that is determined before run-time,
Even
for
the prospect of
a 600-word recognizer training session should scare away any
sane user.
In
spite of
by
the NEC processor, it still
word
the additional capabilities provided the recognizer
cannot distinguish an individual
from the many in its virtual memory until that word is
loaded into the small
active vocabulary.
maintaining the state of the recognizer's
with the host computer.
user
The responsiblity for
active vocabulary lies
In the NewsPeek system, a part of the
interface handler manages the virtual vocabulary,
described in the next section.
-32-
and is
6.0
The Vocabulary Predictor
the speech
The vocabulary predictor is a crucial component in
recognition system.
unrecognized, regardless of the performance
of
hardware.
100% reliability,
equipment, and would, in
totally superfluous.
It
input.
it
would require no other
fact, render the speaker
NewsPeek's vocabulary predictor has a
somewhat easier job, as
user's next
of the recognition
course, if a system could determine a speaker's
next utterance with
recognition
input goes
When the predictor fails,
it is allowed sixty guesses at the
It can, therefore, be
imperfect.
should be noted that a program for predicting the
not necessarily require supernatural
the nature of
its
future does
NewsPeek, by
abilities.
application and user interface, co-operates
quite well with the vocabulary predictor.
system for utilizing this
elaborated upon later, but
It is
an
appropriate
This
speech recognizer.
sort of
a few points here will help
is
clarify
the description.
First,
NewsPeek's local library of
similar to
the grouping of stories
news magazine.
internal
method
in
sections
organized by topic,
in
an
ordinary
Although the user need not be aware of this
structure, it does exist,
for
stories is
associating groups of
stories.
-33-
and it provides
words with groups
a convenient
of news
Second, most NewsPeek commands are words from the news stories
being examined by the user,
and serve to make the user aware of
other stories available for perusal.
As a result of this
property of the overall design, command words are frequently
present on the output display.
to the vocabulary
This group of words is available
predictor.
Third, also due to the nature of the perusal method, there is a
good probability that a command will be used more than once in a
short input sequence.
User input history is accessible by the
predictor.
The above three points are mentioned to demonstrate how an
application and its interface can suggest natural lines for
dividing a large vocabulary into smaller, more manageable blocks.
When this is the case, as it is in NewsPeek, the vocabulary
predictor's task may be reduced to that of finding a likely
subset vocabulary from the set of words contained in a group of
small blocks.
The decision rules it employs for this operation
may also be determined largely by the application it serves.
In general, the vocabulary predictor works in the following
manner.
Small blocks of words are derived from the total
vocabulary.
The number of words in this group should be much
smaller than the main vocabulary, but should exceed the capacity
of the speech recognition unit so that it will be fully utilized.
The blocks are assigned relative priorities to aid in the
-34-
assembly of
the final
blocks will have already been
assigned weighting values.
the priorities
and picks
a
First,
to operate on the blocks
and weights
subset vocabulary.
In a sample NewsPeek situation,
constructed
A
dependent on the state of the
decision rule, which may itself be
system, uses
Words within the
subset vocabulary.
a vocabulary might be
like this:
are isolated.
several blocks of words
These are
(a) the basic NewsPeek command set,
(b)
the last five words input by the user over either
channel
and
(touch or voice),
(c)
trained words appearing on the output story page,
(d)
trained words appearing on the correlation page,
(e)
words
baseball
associated with stories
about the Boston Red Sox
team.
Second, priorities are assigned to the blocks.
Block
(a)
gets the highest priority;
the user will
these commands
say a basic command word, it
Third,
(e)
is essential that
always be available to him. Block
next highest priority. Blocks
Block
it is not only likely that
(c)
and
(d)
get
(b)
equal priorities.
gets the lowest priority.
the active subset vocabulary is
-35-
gets the
calculated.
Block
(a)
slots
are left
by blocks
Of the
is
included in full,
(c)
in the subset,
and
(d).
Two
is block
and there
Because blocks
(e)
(b).
Thirty-five
are forty words shared
of these are members of block
remaining thirty-eight words,
highest weighting factors
block
as
the thirty-five with the
(most recently used) are
included.
of higher priority have filled the active space,
is not included.
Underlying this
implementation of the vocabulary predictor are
two basic assumptions.
The first is
that there exists some
method for sorting the entire vocabulary that is of
value, given a specific
weighting factor
operator.
assignment, and is
can itself
which to construct the
Since
suggest
This is the
a kind of global vocabulary
that the
logical vocabulary blocks
active subset vocabulary.
the speech recognition unit
these blocks
from
This does more
operates by transferring
function can be optimized by identifying
and organizing virtual memory to
In the NewsPeek
in
some general
allow more complex assignment of weighting factors.
blocks of words, its
word
application program.
The second underlying assumption is
application
than just
(b).
exploit them.
system, the weighting factor assigned to each
the vocabulary is derived from the time that word was
last accessed by the user.
the vocabulary has
The least recently accessed word in
the lowest weighting
-36-
factor;
most recently
has the highest.
replace this
Other
indicators may be used to augment or
simple weighting strategy.
calculating vocabulary weighting factors
investigated
for NewsPeek's
Several methods for
are still being
predictor, and are discussed in
more detail later.
-37-
7.0
Applying the virtual vocabulary Speech Recognizer
The previous sections have extolled the virtues of speech in
multi-modal
communication environment,
problems in
the implementation of
recognition
system, and described a sample
a system.
In this section,
recognition
given to
system will be
Briefly, then, a list of
application
an automated speech
application for such
the application of NewsPeek's speech
detailed.
Consideration will be
is affected.
features making NewsPeek an interesting
for speech recognition:
*
personalized
*
most
*
bi-modal
*
recognition vocabulary is
user
application
input is single word
two phase
A point
discussed practical
issues involving the NewsPeek application directly and
how the system in general
*
a
interface
(touch-screen and speech)
(on- and off-
run-time dynamic
line) processing
objective is manipulation
of English text
by point discussion follows.
Since NewsPeek is a personalized electronic publication, there
is only one
user of the
system.
Thus,
the problem of
inter-speaker vocabulary variances is neatly
make light
of this,
skirted.
(Not to
an important problem, but one left for
-38-
another
day.)
Each NewsPeek subscriber has his
own mini-floppy
disc, on which is encoded his personally created vocabulary.
The second point
hindrance
in the
and an asset as
Due to NewsPeek's
as the speech system
aid the
formal structure on the
situation.
far
speech recognizer.
,"
_
The recognizer is
or
Imposing a
input grammar does little to
Most commands would be of
stories about
is concerned.
lack of command syntax, a semantic processor
cannot be employed to
"V-o"
above list may be considered both a
the form,
rectify the
"List the
"Do we have any pictures of
?"
_
still faced with the task of distinguishing
from the same large group
of words
in
the vocabulary.
Furthermore, the command to be executed is usually implicit,
determined by the state of the
activity.
system and history of user
Requiring the user to restate the obvious undermines
the utility inherent
step backward.
in this mode of
the plus
side,
large spoken vocabulary.
semantic analysis
is not necessarily
and freeing the user's personal computer
looked upon as a minor
is this:
advantage.
words
hardware.
advantage, however,
from the problem of
so prevalent in the
amalgam of fluent
As previously mentioned, small vocabularies
can be
cheap,
from this job may be
The real
single utterances do not suffer
co-articulation variances
speech.
a
The problem presented here is that of
recognizing single words from a
on
communication, and is
of isolated
identified reasonably well by currently available
By predicting a likely
-39-
subset of a
large vocabulary
to
be recognized, the virtual vocabulary speech recognizer
attempts
to
reduce the problem to
one which has
already been
"solved."
The context of the bi-modal interface is
subsetting process.
input
The speech recognition
is being processed over the other
being displayed on the color monitor.
aware of
input over the
local data structure.
input mode,
system "knows" what
line, and what output is
as well.
When the user
that word is
found in
a
If the word is one which has been trained
for speech recognition, it
subset vocabulary.
the vocabulary
Of course, the system is
speech channel,
indicates a word via either
by the user
used in
is
loaded into the active
The assumption here is that since the user
is using this word to
key his path through the news, there is
good possibility he will be using it
again in
a
the immediate
future.
This technique produced a bit
of unexpected fallout;
user can guarantee the presence of a word in
vocabulary by
first touching it
the user was never intended to
through conscious effort,
extent.
Because
speech channel,
of
namely, the
the active speech
on the display screen.
Although
guide the prediction algorithm
this has become an option to
a small
it can increase the speaker's confidence in the
and presents
little or no nuisance to him, it
is
some value.
Knowledge
of
the output display is
-40-
also available to the speech
system.
Any trained word on the current output page may be
included in the active subset vocabulary.
In normal operation
all of these words are made active, as they are all potential
NewsPeek commands.
environment
This is the usual way to maintain the
of redundant input modes.
However,
the active
vocabulary is small, and the subset prediction algorithm may
have some likely off-screen candidates for recognition.
is the best mode of access for these words.
Speech
What to do if the
user wishes to select one of these words?
One method involves basing the decision upon user activity in
the two input modes.
If the history of this activity shows the
user consistently using touch as the initial mode for selecting
on-screen words, then the block of on-screen words is assigned a
low priority relative to the off-screen block.
words
As the on-screen
are touched they are loaded into the active subset
vocabulary and may then be selected via either input mode.
On
the other hand, if user history shows repeated subsequent
failures in recognizing spoken on-screen words, the priority
assignments are reversed.
This operation is one example of how
the modes in a rich communication environment can supplement one
another,
The speech recognizer's function here is partly
dependent on touch-screen input and output.
The NewsPeek vocabulary is run-time dynamic.
Among other things,
this means that new words are added and old words deleted as
part of the normal operation of the system.
-41-
Many applications
in
in
Training a large vocabulary
a single session can be tedious
the speaker, and in
not
even precisely known before it
itself provides
is to be used.
The system
the group of words from which the user
vocabulary.
Knowledge of
improve the
performance
NewsPeek, the vocabulary is
the case of
for
be used to
training session.
its use during a special
advance of
his
static vocabulary created
speech recognition use a
that employ
selects
the user's changing vocabulary can
recognizer's performance, and the
of NewsPeek in general.
Since NewsPeek's normal
function includes a pre-processing phase
with the user off-line, a logical place exists
processing required by the
speech system.
for any
The
off-line
operations
for
reformatting the speech recognizer's virtual memory come under
this category.
is not practical to format an entire
Although it
floppy disc after each of the user's verbal commands,
desirable to have the organization of the
memory
in
reflect the sorting by blocks
it is
system's vocabulary in
and weighting factors used
the construction of the subset vocabularies.
After the user
has gone off-line, the disc can be restored to a configuration
dependent on the state
previous session.
organization
The last
data
are
This periodic cleanup
for the
item on the
system is used
of the vocabulary at the end of
start
assures optimal memory
of the user's next
list of
session.
NewsPeek features notes
for manipulating English
real words that the user can
-42-
the
that the
language text.
These
say. The NewsPeek user can
easily
associate verbal names with the
objects he manipulates on
the output display and throughout the database, just by
correlating
a spoken word with its printed counterpart.
this automatically for himself,
but the speech recognizer must
have the operation performed for it:
this,"
the user
says,
"Learn
touches the word on the screen, then says it.
Since
Nexis database provides the text, the NewsPeek reader
the use
of the terminal keyboard.
As a result of
association operation, the computer
respond verbally to the user.
Votrax Type
to
'N'
He does
the
is spared
the
acquires the ability to
Peripheral devices
Talk or the Prose Model 2000
such as the
can be used
convert text strings directly into audible English.
This,
in turn, may provide useful user feedback from the speech
recognizer.
For example, the speaker says a word.
recognizer gets
The
a match, but of marginal confidence.
Rather
than ignore the match and ask the speaker to try
again, the
computer
speech
responds with,
recognizer now has
answer.
"Did you
say
The
_?"
only to contend with a simple
(This technique is adapted from Schmandt
Actually, what happens
in NewsPeek is
interesting than the mere naming of
"yes"
in a sense, these
and their names are one and the same.
They are words.
that
appear in the NewsPeek stories are
manipulated by the user.
words
They are
also
objects
The
the words to be
the words he says,
that must be recognized by the speech
-43-
and more
facilitate the
operations that manipulate them, for,
words
"no"
[15].)
a little subtler
objects to
or
system.
the
Owing to the presence of this feature, the vocabulary predictor
has
it.
not
a wealth of
information about
the user's task available
Although the current version of
the virtual vocabulary does
exploit them, many possibilities for
vocabulary sorting algorithm exist.
Temporal conditions
When
appear?
current section?
A
the enhancement of the
partial list follows.
--
did the word first
When did it last
to
appear in
When did it
the NewsPeek database?
first/last appear in
the
a story actually read by the user?
Statistical conditions -Does the word appear
database?
word
frequently throughout
throughout the current story or
the NewsPeek
section?
Does the
occur nowhere else within the current story/section?
outside the current story/section?
periodically
Positional
(eg.,
Does the word appear
only in Monday's edition)?
conditions
--
Does the word appear in
a headline?
story's
lead paragraph?
concluding paragraph?
first or last line within a paragraph?
Does the word occur in
a phrase with a Nexis
containing a frequently accessed word?
frequently throughout
appears
The
keyword?
a word appearing
the current story or
section?
nowhere else within the current story or
positional conditions may
-44-
a phrase
a word that
section?
even be computed recursively.
For
instance, does the word occur
discovered by the previous
list of
containing one of those words?
likely that an
itself
Words
in
a phrase
containing a word
conditions?
and so
implementation of this
on.
a phrase
It is,
however, not
function could justify
on the basis of either performance or computational cost.
satisfying one,
identifiably
some,
all of
different from the rest
are thus potentially more
the others.
or
As
(or less)
the above conditions are
of the NewsPeek text, and
likely to
be accessed than
this information is made available to the
vocabulary predictor,
improved speech recognition
should result.
Similar analysis can be performed on the vocabulary using
information
about the user interface.
ask about the time a word was
When was
it
periodically?
first accessed?
For
instance, one might
last accessed by voice.
Is it
by touch.
accessed frequently?
and so forth.
In general, the speech system should be adaptable to various
host applications.
features
They need not display all of
singled out in
similar to
this
chapter, but those
the NewsPeek
applications
NewsPeek should have the least difficulty employing
the speech system, and,
for the most part, see the best
-45-
results.
8.0
Conclusion
The small
recognition vocabulary of an inexpensive, commercially
available speech
recognizer can be extended by supplying
additional memory and processing capabilities.
By
integrating
the speech recognition system with an appropriate host
application, the recognizer
is privileged with useful
information concerning both the user
speech
better
recognizer performance is the result.
Automatic
is
and his purpose;
still
recognition of natural, fluent,
a long way off.
However,
field, speech recognition is
conversational speech
as strides
are made in
the
destined to become a practical
popular method for computer input.
-46-
and
REFERENCES
[1]
IEEE Trans.
Barnett, J., "A Vocal Data Management System,"
Audio Electroacoust., vol. AU-21, June 1973, pp. 185 - 188.
[2]
Bates, M.,
"The Use of Syntax in a Speech Understanding
System," IEEE Trans. Acoust., Speech, and Signal
Processing, vol.
[3]
ASSP-23,
Feb.
1975,
pp.
112
-
117.
Bolt, R. A.,
"Voice and Gesture at the Graphics Interface,"
Computer Graphics, Proceedings of ACM SIGGRAPH '80, vol.
14,
no.
3,
1980,
pp.
-
262
270.
[4]
Chapanis, A.,
"Interactive Human Comunication,"
American, vol. 232, no. 3, 1975, pp. 36 - 42.
[5]
David, E. E. and Denes, P. B., Human Communication:
Unified View, New York, McGraw Hill, 1972.
[6]
Hill, D. R.,
"Man-Machine Interaction Using Speech,"
Advances in Computers, Alt, F. L., Rubinoff, M., and
Yovits, M. C., Eds., vol. II., New York, Academic Press,
1971,
[7]
pp.
230.
June
1968,
pp.
184
-
197.
Levinson, S. E. and Liberman, M. Y.,
Computer," Scientific American, vol.
64
[9]
-
A
Lea, W. A., "Establishing the Value of Voice Communication
With Computers,"
IEEE Trans. Audio Electroacoust.,
vol.
AU-16,
[8]
165
Scientific
-
"Speech Recognition by
244, no. 4, 1981, pp.
76.
"Man Computer Symbiosis,"
Licklinder, J. C. R.,
Perspectives on the Computer Revolution, Pylyshyn, Z. W.,
Ed., Englewood Cliffs, NJ, Prentice-Hall, 1970 pp. 306 318.
[10]
[11]
Nash-Webber, B.,
"Semantic Support for a Speech
Understanding System," IEEE Trans. Acoust., Speech,
Signal Processing, vol. ASSP-23, Feb. 1975, pp. 124
Ochsman, R. B. and Chapanis, A.,
"The Effects of Ten
Communication Modes on the Behavior of Teams During
Co-operative Problem Solving," Int. J. Man-Machine Studies,
vol.
[12]
and
- 128.
6,
1974,
pp.
579
-
619.
Rabiner, L. R.,
"Considerations in Dynamic Time Warping
Algorithms for Discrete Word Recognition," IEEE Trans.
Acoust., Speech, and Signal Processing, vol. ASSP-26, Dec.
1978,
pp.
575
-
582.
-47-
[13]
Reddy, D. R., "Speech Recognition by Machine: A Review,"
Proceedings of the IEEE, vol. 64, no. 4, April 1976, pp.
501
[14]
-
531.
Sakoe, H. and Chiba, S.,
"Dynamic Programming Algorithm
Optimizations for Spoken Word Recognition," IEEE Trans.
Acoust., Speech, and Signal Processing, vol. ASSP-26, Feb.
1978,
pp,
43
-
49.
[15]
Schmandt, C. and Hulteen, E. A.,
"The Intelligent Voice
Interactive Interface," Proceedings Human Factors in
Computer Systems, Gaithersburg, MD., 1982, National Bureau
of Standards / ACM, pp. 363 - 366.
[16]
Tappert, C. C. and Subrata, D. K., "Memory and Time
Improvements in a Dynamic Programming Algorithm for Matching
Speech Patterns," IEEE Trans. Acoust., Speech, and Signal
Processing, vol. ASSP-26, Feb. 1978, pp. 583 - 586.
[17]
Walker, D. E.,
"The SRI Speech Understanding System," IEEE
Trans. Acoust., Speech, and Signal Processing, vol. ASSP-23,
no.
5,
Oct.
1975,
pp.
397
-
-48-
416.
Download