Speech Recognition

advertisement
Human/Computer Communications
Using Speech
Ellis K. ‘Skip’ Cave
InterVoice-Brite Inc.
skip@intervoice.com
Famous Human/Computer
Communication - 1968
InterVoice-Brite
• Twenty years building speech applications
• Largest provider of VUI applications and
systems in the world
• Turnkey Systems
– Hardware, software, application design, managed
services
• 1000’s of installations worldwide
• Banking, Travel, Stock Brokerage, Help
Desk, etc.
–
–
–
–
Bank of America
American Express
E-Trade
Microsoft help-desk
Growth of Speech-Enabled
Applications
• Analysts estimate that 15% of IVR ports
sold in 2000 were speech enabled
• By 2004, 48.5% of IVR ports sold will be
speech-enabled
–
Source: Frost & Sullivan - U.S. IVR Systems Market, 2001
• IVB estimates that in 2002, 50% of IVB ports
sold will be speech enabled.
Overview
• Brief History of Speech Recognition
• How ASR works
• Directed Dialog & Applications
• Standards & Trends
• Natural Language & Applications
History
• Natural Language Processing
– Computational Linguistics
– Computer Science
– Text understanding
• Auto translation
• Question/Answer
• Web search
• Speech Recognition
– Electrical Engineering
– Speech-to-text
• Dictation
• Control
Turing Test
• Alan M. Turing
• Paper -”Computing Machinery and Intelligence”
(Mind, 1950 - Vol. 59, No. 236, pp. 433-460)
• First two sentences of the article:
– I propose to consider the question, "Can machines
think?” This should begin with definitions of the
meaning of the terms "machine" and "think."
• To answer this question, Turing proposed the
“Imitation Game” later named the “Turing Test”
– Requires an Interrogator & 2 subjects
Turing Test
Subject #1
Observer
Which subject is a
machine?
Subject #2
Subject #2
Turing Test
• Turing assumed communications would be
written (typed)
• Assumed communications would be
unrestricted as to subject
• Predicted that test would be “passed” in 50
years (2000)
• The ability to communicate is equated to
“Thinking” and “intelligence”
Turing Test - 50 Years Later
• Today - NL systems still unable to fool
interrogator on unrestricted subjects
• Speech Input & Output possible
• Transactional dialogs in restricted subject
areas possible -
• Question/Answer queries feasible on large text
databases
• May not fool the interrogator, but can provide
useful functions
– Travel Reservations, Stock Brokerages, Banking, etc.
Speech
Recognition
Voice Input - The New Paradigm
• Automatic Speech Recognition (ASR)
• Tremendous technical advances in the last
few years
• From small to large vocabularies
– 5,000 - 10,000 word vocabulary
• Stock brokerage - E-Trade - Ameritrade
• Travel - Travelocity, Delta Airlines
• From isolated word to connected words
– Modern ASR recognizes connected words
• From speaker dependent to speaker
independent
– Modern ASR is fully speaker independent
• Natural Language
Signal Processing Front-End
Feature Extraction
13 Parameters
13 Parameters
13 Parameters
Overlapping Sample Windows
25 ms Sample - 15ms overlap - 100 samples/sec.
Cepstrum
• Cepstrum is the inverse Fourier transform of the log
spectrum
1
c ( n) 
2
 



log S (e j ) e jn d , n  0,1,, L  1
Mel Cepstral Coefficients
•
Construct mel-frequency domain using a triangularlyshaped weighting function applied to mel-transformed
log-magnitude spectral samples:
Mel-Filtered Cepstral Coefficients
Most common feature set for recognizers
Motivated by human auditory response characteristics
Mel Cepstrum
•
•
After computing the DFT, and the log magnitude spectrum (to
obtain the real cepstrum), we compute the filterbank outputs, and
then use a discrete cosine transform to compute the melfrequency cepstrum coefficients:
Mel Cepstrum
– 39 Feature vectors representing on 25ms voice sample
Cepstrum as Vector Space Features
Feature Ambiguity
• After the signal processing front-end
• How to resolve overlap or ambiguity in MelCepstrum features
• Need to use context information
– What preceeds? What follows?
• N-phones and N-grams
• All probabalistic computations
The Speech Recognition Problem
Find the most likely word sequence Ŵ among all
possible sequences given acoustic evidence A
A tractable reformulation of the problem is:
Acoustic model
Language model
Daunting search task
ASR Resolution
• Need
– Mel Cepstrum features into probabilities
– Acoustic Model (tri-phone probabilities)
• Phonetic probabilities
– Language Model (bi-gram probabilities)
• Word probabilities
• Apply Dynamic Programming techniques
– Find most-likely sequence of phonemes & words
– Viterbi Search
Acoustic Models
• Acoustic states represented by Hidden Markov
Models (HMMs)
– Probabilistic State Machines - state sequence
unknown, only feature vector outputs observed
– Each state has output symbol distribution
– Each state has transition probability distribution
t(s0 |s0)
t(s1 |s1)
t(s2 |s2)
t(s2 |s1)
t(s1 |s0)
p(s0)
s0
q(i|s0)
s1
q(i|s1)
s2
q(i|s2)
Subword Models
• Objective: Create a set of HMM’s representing the
basic sounds (phones) of a language?
– English has about 40 distinct phonemes
– Need “lexicon” for pronunciations
– Letter to sound rules for unusual words
– Problem - co-articulation effects must be modeled
• “barter” vs “bartender”
• Solution - “tri-phones” - each phone modified by
onset and trailing context phones
Language Models
• What is a language model?
– Quantitative ordering of the likelihood of word
sequences
• Why use language models?
– Not all word sequences equally likely
– Search space optimization
– Improved accuracy
• Bridges the gap between acoustic ambiguities and
ontology
Finite State Grammars
Allowable word sequences are explicitly specified using
a structured syntax
• Creates a word network
• Words sequences not enabled do not exist!
• Application developer must construct grammar
• Excellent for directed dialog and closed prompting
Finite-State Language Model
•
Narrow range of responses allowed
– Only word sequences coded in grammar are recognized
•
Straightforward ASR engine. Follows grammar rules
exactly
– Easy to add words to grammar
•
Allows name lists
• “I want to fly to $CITY”
• “I want to buy $STOCK”
Statistical Language Models
Stochastic Context-Free Grammars
• Only specifies word transition probabilities
• N-gram language model
• Required for open ended prompts: “How may I direct
your inquiry?”
• Much more difficult to analyze possible results
– Not for every interaction
• Data, Data, Data: 10,000+ transcribed responses for
each input task
Statistical State Machines
Mixed Language Models
• SLM statistics are unstable (useless) unless
examples of each word in each context are presented
• Consider a flight reservation tri-gram language
model:
I’d like to fly from Boston to Chicago on Monday
Training sentences required for 100 cities: (100*100
+ 100*7) = 10,700
• A better way is to consider classes of words:
I’d like to fly from $(CITY) to $(CITY) on $(DATE)
Only one transcription is needed to represent 70,000
variations
Viterbi
• How do you determine the most probable
utterance?
• The Viterbi Search returns the n-best paths
through the Acoustic model and the
Language Model
Dynamic Programming (Viterbi)
N-Best Speech Results
ASR
Speech
Waveform
“Get me two movie tickets…”
“I want to movie trips…”
“My car’s too groovy”
N-Best Result
Grammar
•
•
•
•
N=1
N=2
N=3
ASR converts speech to text
Use “grammar” to guide recognition
Focus on “speaker independent” ASRs
Must allow for open context
What does it all Mean?
Text output is nice, but how do we represent meaning ?
• Finite state grammars - constructs can be tagged
with semantics
<item> get me the operator <tag>OPERATOR</tag>
</item>
• SLM uses concept spotting
Itinerary:slm “flightinfo.pfsg” = FlightConcepts
FlightConcepts [
(from City:c) {<origin $c>}
(to City:c) {<dest $c>}
(on Date:d) {<date $d>}
]
• Concepts may also be trained statistically
– but that requires even more data!
Directed
Dialogs
Directed Dialog
• Finite-State Grammars - Currently most common
method to implement speech-enabled applications
• More flexible & user-friendly than key (TouchTone) input
• Allows Spoken List selection
– System: “What City are you leaving from?”
– User: “Birmingham”
• Keywords easier to remember than numeric codes
– “Account balance” instead of “two”
• Easy to skip ahead through menus
– Tellme - “Sports, Basketball, Mavericks”
Issues With Directed Dialogue
• Computer asks all the questions
– Usually presented as a menu
– “Do you want your account balance, cleared
checks, or deposits?”
• Computer always has the initiative
– User just answers questions, never gets to ask
any questions
• All possible answers must be pre-defined by
the application developer (grammars)
• Will eventually get the job done, but can be
tedious
• Still much better than Touch-tone menus
Issues With Directed Dialogue
• Application developer must design scripts
that never have the machine ask openended questions
– “What can I do for you?”
• Application Developer’s job - design
questions where answers can be explicitly
predicted.
– “Do you want to buy or sell stocks”
• Developer must explicitly define all possible
responses
– Buy, purchase, get some, acquire
– Sell, dump, get rid of it
Examples of Directed Dialog
Southwest Airlines
Pizza Inn
Brokerage
Standards &
Trends
VoiceXML
• VoiceXML - A web-oriented voice-application
programming language
– W3C Standard - www.w3.org
– Version 1.0 released March 2000
– Version 2.0 ready to be approved
• http://www.w3.org/TR/voicexml20/
– Voice dialogues scripted using XML structures
• Other VoiceXML support
– www.voicexml.org
– voicexmlreview.org
VoiceXML
• Assume telephone as user device
• Voice or key input
• Pre-recorded or Text-to-Speech output
Why VoiceXML?
• Provides environment similar to web for web
developers to build speech applications
• Applications are distributed on document
servers similar to web
• Leverages the investment companies have
made in the development of a web
presence.
• Data from Web databases can be used in
the call automation system.
• Designed for distributed and/or hosted
(ASP) environment.
VoiceXML Architecture
Internet
Telephone
Network
Voice
VoiceXML
Web
Server
Browser
VoiceXML
Browser/
Gateway
Serve VoiceXML
Document
Mobile Device
VUI
Web
Server
VoiceXML Example
<?xml version="1.0"?>
<vxml version="1.0">
<!--Example 1 for VoiceXML Review -->
<form>
<block>
Hello, World!
</block>
</form>
</vxml>
VoiceXML Applications
• Voice Portals
– TellMe,
• 1-800-555-8355 (TELL)
• http://www.tellme.com
– BeVocal
• 1-408-850-2255 (BVOCAL)
• www.bevocal.com
The VoiceXML Plan
– Third party developers write VoiceXML scripts
that they will publish on the web
– Callers to the Voice Portals will access these
voice applications like browsing the web
– VoiceXML will use VUI with directed dialog
• Voice output
• Voice or key input
– hands/eyes free or privacy
Speech Application
Language Tags (SALT)
• Microsoft, Cisco Systems, Comverse Inc.,
Intel, Philips Speech Processing, and
SpeechWorks
• www.saltforum.org
• Extension of existing Web standards such as
HTML, xHTML and XML
• Support multi-modal and telephone access to
information, applications, and Web services,
independently or concurrently.
SALT - “Multi-modal”
• Input might come from speech recognition, a
keyboard or keypad, and/or a stylus or
mouse
• Output to screen or speaker (speech)
• Embedded in HTML documents
• Will require SALT-enabled browsers
• Working Draft V1.9
• Public Release - March 2002
• Submit to IETF - midyear 2002
SALT Code
•
<!—- Speech Application Language Tags -->
•
<salt:prompt id="askOriginCity"> Where would you like to leave from?
</salt:prompt>
•
<salt:prompt id="askDestCity"> Where would you like to go to?
•
<salt:prompt id="sayDidntUnderstand" onComplete="runAsk()">
•
Sorry, I didn't understand.
•
<salt:listen id="recoOriginCity"
•
onReco="procOriginCity()”
onNoReco="sayDidntUnderstand.Start()">
•
<salt:grammar src="city.xml" />
•
•
•
•
</salt:prompt>
</salt:prompt>
</salt:listen>
<salt:listen id="recoDestCity"
onReco="procDestCity()" onNoReco="sayDidntUnderstand.Start()">
<salt:grammar src="city.xml" />
</salt:listen>
Evolution of the Speech Interface
• Touch-Tone Input
• Directed Dialogue
• Natural Language
– Word spotting
– Phrase spotting
– Deep parsing
Natural Language
Understanding
What is Natural Language
• User can take the initiative
– Computer says “How can I help you?”
• User can state request in a single interaction
– What is the price of IBM?
– “I want to sell all my IBM stock today at the
market”
• User can change initiatives midstream
– I want to buy some stock - How much do I have in
my account?
• Natural Language is closely related to
Artificial Intelligence
• Feasible today, if scope of discussion is
limited
Natural Language
• NLP, NLU, NLG
– All of these mean very specific things to the
computational linguistics community
– NLP - Machine processing of human text or speech
– NLU - Machine understanding of unconstrained humanoriginated text (or speech)
– NLG - Machine generation of human-understandable
text (or speech)
•
To build a true NL dialog engine you must deal with NLU
and NLG
Natural Language Engine
Text in
Text out
Natural Language Engine
Natural Language
Understanding
Natural Language
Generation
Database
Manipulation
Spoken Natural Language
Engine
Speech out
Speech in
Automatic Speech
Recognition
Text-to-Speech
Text out
Text in
Natural Language
Understanding
Text-to-Speech
• Modern Text-to Speech
– More natural-sounding
– Better prosody
– Improved proper noun handling
• People Names
• Street Names
• Cities
– Abbreviation handling
• Dr. = Doctor or Drive
Overview of Dialog Systems
Voice in
Voice out
ASR
TTS
NL Parser
Prompt
Generator
Dialog
Manager
• Openness of dialog
• Can ASR hear everything?
• Can NLP understand everything heard?
• Can DM cope with multiple strands / directions?
• Does Prompt Generator sound natural?
B/O
Dialog Sophistication
• NL Parser
• Ontology + syntax
more
• Ontology
• Morphology  Concept
Spotting
• Word spotting
less
• Dialog Management
• Complex Mixed Initiatives
more
• Dialog context
• Query Answering (Q & A)
• Directed dialog
less
Natural Language Technologies
• Improve Existing Applications
– Scheduling - Airlines, Hotels
– Financial - Banks, Brokerages
•
Enabling New Applications
– Catalogue Order
– Complex Travel Planning
– Information Mining - Voice Web browsing
– Direct Telemarketing AI
• Many applications requires Text-to-Speech
Shallow-Parse NLU
• Nuance, Speechworks, Philips,…
• W3C Semantic Attachments for Speech Recognition
Grammars (Working Draft 11 May 2001)
• Driven by an Interpretation Grammar.
– “Full” - Finite State Grammar (whole sentence must
match)
– “Robust” Statistical Grammar (word or phrase-spotting partial parses of phrase fragments)
– Associating semantic tags (ABNF { }, XML <tag/>) to the
Recognition Grammar rules
– Semantic Attachments for Speech Recognition Grammars
• Constrained by the Recognition Grammar
– Need to write partial Recognition Grammar even when using
SLM
• No interpretation in-context
Statistical Language Model
SLM
Interpretive or
Recognition Grammar
Phrase-
60%
20%
ASR
Speech
Waveform
Front
End
N-Best
Phrases
Spotting
Rules
90%
Probabilistic Word
Transitions
Buy
stock
Sell
Stock
Get stock
quote
Get
operator
1- “I want to buy stock…”
2- “I want to go by the lock…”
3- “High doesn’t mock….”
Shallow-Parse NLU
• Latest thing from ASR vendors
– Phillips
• Speech Pearl - SLM
• Speech Mania - FSG
– Nuance
• Say Anything - SLM
• Requires two new constructs
– Statistical Language Model
– Interpretation Grammar
– Allows more open responses
Shallow-Parse NLU
• Developer must define and code
“interpretation grammar” that lists keywords
and phrases that can occur
– Need programming skills to define interpretive
grammars. More complex coding than fixed grammars
– wild cards, concept spotting, phrase matching
• SLM
– Need large number of example utterances to get
reliable word-sequence statistics
– Usually requires recording and transcription of live
conversations in application topic
– More difficult to add new words to grammar
• Must also provide usage and sequence statistics with
each word added
– SLM outputs n-best utterance list
Shallow-Parse NLU
• Interpretive Grammars
– Allows “wild card” descriptions: “(to $CITY)”
• “I want to fly to $CITY” - “I want to go to
$CITY”
• “I need to fly to $CITY”
– Allows out-of sequence phrases
• I want to go from Chicago to Dallas today
• Today I want to go to Dallas from Chicago
Shallow-Parse NLU
• Allows more open dialog than Finite-State
grammars
• Requires more development effort than
Finite-State Grammars
• Will allow a smaller technology step than full
NLU for less-demanding applications
Deep Parsing NLU
• Full linguistic analysis of Speech
– Syntactic and Semantic analysis
• Core Language Model contains data
structures required for language
understanding
– Lexicon
– Ontology
– Parsing Rules
• Eliminate
– Scripting
– Manually-defined semantic tags
• developer doesn’t have to define concepts
Deep Parsing NLU
• Lexicon - A List of Words and their Syntactical
and Semantic Attributes
• Root or stem word form
– fox, run, Boston
• Optional forms plural, tenses
– fox ,foxes
– run, ran, running
• part of speech
– fox - noun
– run - verb
– Boston - proper noun
• Link to Ontology
– fox - animal, brown, furry
– run - action, move fast
– Boston - city
Deep Parsing NLU
• Ontology - Classes & Relationships
PERSON LOCATION DATE TIME PRODUCT NUMERICAL MONEY ORGANIZATION MANNER
VALUE
clock time
time of day
prime time
midnight
institution,
establishment
team,
squad
financial educationalhockey
institution institution team
DEGREE DIMENSION RATE DURATION PERCENTAGE COUNT
distance,
length
altitude
width,
breadth
thickness
wingspan
integer,
whole number
numerosity,
multiplicity
population denominator
Deep Parsing NLU
S
astronaut
Parsing Rules
VP
astronaut
S
V
P
NP
walk
walk
VP
astronaut
N
PERSON
WP VBD DT
JJ
NNP
walk
PPspace
NN
TO VB
IN
spac
eP
NN
Who was the first Russian astronaut to walk in space
Russian
first
space
astronaut
PERSON
walk
Natural Language Understanding
NL Interpretation is split between two components
Voice in
ASR
•
NLSML
NL Parser
NLSML
NL Dialog
Server
NL Parser
–
N-Best input in NLSML
format
–
Context free parsing
based on Ontology,
Lexicon and Rules
–
–
•
KB
Dialog Server
–
Filtering out redundant
interpretations
Extended W3C NLSML
input
–
Extended W3C NLSML
output
In context
interpretation
–
Dialog Directives
output
Natural Language
• Stock Transaction
• 30 seconds
• Airline Reservation
• TTS
Human Language Technology
Institute at UTD
• Opening March 2002 (open house March 7-8)
• Foster research in Human Language
Technology
• Establish ties with local industry
• 6 new faculty positions
• Currently funded at about 1 million per year by
state and government agencies
Speech Technology
Professionals Forum
• Monthly meetings in Telecom Corridor
– Third Tuesday - March 19
• Rick Tett
– rick.tett@candora.com
– www.candora.com/stpf
Conclusions
• Consumer applications in Stock Brokerages,
Travel Agencies, etc. are raising the
standards for directed-dialogue production
quality and usability
• VoiceXML and SALT will open VUI Directed
Dialog application development
• Artificial Intelligence and Natural Language
technology making rapid advances enabling highly conversational applications
The Speech User Interface
• Speech will emerge as the preferred mobile
interface
• Selectable voice or key input
– Hands/eyes free or privacy
• Selectable voice or screen output
– When small screens make sense
• Hands-free Wireless Web Access
– WAP phones not required
– 3G phones not required
The Speech User Interface
• Natural Language will make VUI highly
conversational
– no menus
– no memorizing keywords or keystrokes
• Many so-called graphical applications can be
made more efficient with a pure speech interface
– Driving Maps, Bank Statements
The Speech User Interface
The computing terminal
of tomorrow is already here!
Speech The optimal computing
interface
Thank You!
References
• Speech & Language Processing
– Jurafsky & Martin -Prentice Hall - 2000
• Statistical Methods for Speech Recognition
– Jelinek - MIT Press - 1999
• Foundations of Statistical Natural Language
Processing
– Manning & Schutze - MIT Press - 1999
• Dr. J. Picone - Speech Website
– www.isip.msstate.edu
Download