SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary Geunbae Lee This Tutorial Introduction to Spoken Dialog System (SDS) for Human-Robot Interaction (HRI) Language processing oriented Brief introduction to SDS But not signal processing oriented Mainly based on papers at ACL, NAACL, HLT, ICASSP, INTESPEECH, ASRU, SLT, SIGDIAL, CSL, SPECOM, IEEE TASLP 2 OUTLINES INTRODUCTION AUTOMATIC SPEECH RECOGNITION SPOKEN LANGUAGE UNDERSTANDING DIALOG MANAGEMENT CHALLENGES & ISSUES MULTI-MODAL DIALOG SYSTEM DIALOG SIMULATOR DEMOS REFERENCES INTRODUCTION Human-Robot Interaction (in Movie) Human-Robot Interaction (in Real World) What is HRI? Wikipedia (http://en.wikipedia.org/wiki/Human_robot_interaction) Human-robot interaction (HRI) is the study of interactions between people and robots. HRI is multidisciplinary with contributions from the fields of human-computer interaction, artificial intelligence, robotics, natural language understanding, and social science. The basic goal of HRI is to develop principles and algorithms to allow more natural and effective communication and interaction between humans and robots. Area of HRI Vision Learning Emotion Speech • Signal Processing • Speech Recognition • Speech Understanding • Dialog Management • Speech Synthesis Haptics SPOKEN DIALOG SYSTEM (SDS) SDS APPLICATIONS Car-navigation Tele-service Home networking Robot interface Talk, Listen and Interact AUTOMATIC SPEECH RECOGNITION SCIENCE FICTION Eagle Eye (2008, D.J. Caruso) AUTOMATIC SPEECH RECOGNITION x y Speech Learning algorithm Words (x, y) Training examples A process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993] NOISY CHANNEL MODEL GOAL Find the most likely sequence w of “words” in language L given the sequence of acoustic observation vectors O Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn Wˆ arg max P(W | O) WL P(O | W ) P(W ) ˆ W arg max P(O) WL Wˆ arg max P(O | W ) P(W ) WL Bayes rule Golden rule TRADITIONAL ARCHITECTURE 버스 정류장이 어디에 있나요? Wˆ arg max P(O | W ) P(W ) WL O Speech Signals Feature Extraction Decoding 버스 정류장이 어디에 있나요? W Word Sequence Network Construction Speech DB HMM Estimation G2P Text Corpora LM Estimation Acoustic Model Pronunciation Model Language Model TRADITIONAL PROCESSES FEATURE EXTRACTION The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] X(n) Preemphasis/ Hamming Window FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) MFCC (12-Dimension) Frame size : 25ms / Frame rate : 10ms 25 ms ... 10ms a1 a2 a3 39 feature per 10ms frame Absolute : Log Frame Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13 absolute coefficients Delta-Delta : Second-order derivatives of the 13 absolute coefficients ACOUSTIC MODEL Provide P(O|Q) = P(features|phone) Modeling Units [Bahl et al., 1986] Context-independent : Phoneme Context-dependent : Diphone, Triphone, Quinphone pL-p+pR : left-right context triphone Typical acoustic model [Juang et al., 1986] Continuous-density Hidden Markov Model ( A, B, ) K Distribution : Gaussian Mixture b j ( x j ) c jk N ( xt ; jk , jk ) k 1 HMM Topology : 3-state left-to-right model for each phone, 1state for silence or pause bj(x) codebook PRONUCIATION MODEL Provide P(Q|W) = P(phone|word) Word Lexicon [Hazen et al., 2002] Map legal phone sequences into words according to phonotactic rules G2P (Grapheme to phoneme) : Generate a word lexicon automatically Several word may have multiple pronunciations Example Tomato 0.2 [ow] 0.5 1.0 [ey] 1.0 1.0 [m] [t] 0.8 [ah] 1.0 [t] 0.5 [aa] [ow] 1.0 P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 LANGUAGE MODEL Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] We saw this was also used in the decoding process as the probability of transitioning from one word to another. Word sequence : W = w1,w2,w3,…,wn n P( w1 wn ) P( wi | w1 wi 1 ) i 1 The problem is that we cannot reliably estimate the conditional word probabilities, P( wi | w1 wi 1 ) for all words and all sequence lengths in a given language n-gram Language Model n-gram language models use the previous n-1 words to represent the history P( wi | w1 wi 1 ) P( wi | wi ( n 1) wi 1 ) Bi-grams are easily incorporated in a viterbi search LANGUAGE MODEL Example Finite State Network (FSN) 서울 부산 에서 출발 세시 네시 출발 대구 대전 하는 기차 버스 도착 Context Free Grammar (CFG) $time = 세시|네시; $city = 서울|부산|대구|대전; $trans = 기차|버스; $sent = $city (에서 $time 출발 | 출발 $city 도착) 하는 $trans Bigram P(에서|서울)=0.2 P(세시|에서)=0.5 P(출발|세시)=1.0 P(하는|출발)=0.5 P(출발|서울)=0.5 P(도착|대구)=0.9 … NETWORK CONSTRUCTION Expanding every word to state level, we get a search network [Demuynck et al., 1997] Acoustic Model Pronunciation Model I 일 I L 이 I 삼 S Language Model L 이 일 S A A M 사 사 M S A 삼 Intra-word transition start 이 LM is applied P(이|x) P(일 |x) Word transition end I I L P(사|x) Between-word transition Search Network S A 일 사 P(삼|x) S A M 삼 DECODING Find Wˆ arg max P(W | O) WL Viterbi Search : Dynamic Programming • • Token Passing Algorithm [Young et al., 1989] Initialize all states with a token with a null history and the likelihood that it’s a start state For each frame ak – For each token t in state s with probability P(t), history H – For each state r – Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H HTK Hidden Markov Model Toolkit (HTK) A portable toolkit for building and manipulating hidden Markov models [Young et al., 1996] - HShell : User I/O & interaction with OS - HLabel : Label files - HLM : Language model - HNet : Network and lattices - HDic : Dictionaries - HVQ : VQ codebooks - HModel : HMM definitions - HMem : Memory management - HGrf : Graphics - HAdapt : Adaptation - HRec : Main recognition processing functions SUMMARY x y Decoding Speech Words Search Network Construction Acoustic Model Learning algorithm (x, y) Training examples Pronunciation Model I 일 I L 이 I 삼 S S A M Language Model L 이 A M 일 사 사 S A 삼 Speech Understanding = Spoken Language Understanding (SLU) SPEECH UNDERSTANDING (in general) Speech Segment Meaning Representation Computer Program Speaker ID / Language ID Dave / English Named Entity / Relation LOC = pod bay OBJ = door Syntactic / Semantic Role Open=Verb, the=Det. ... Topic / Intent Control the Spaceship Summary Open the doors. Sentiment / Opinion Nervous SQL select * from DOORS where ... SPEECH UNDERSTANDING (in SDS) x y Input Output Speech or Words Intentions Learning algorithm (x, y) Training examples A process by which natural langauge speech is mapped to frame structure encoding of its meanings [Mori et al., 2008] LANGUAGE UNDERSTANDING What’s difference between NLU and SLU? Robustness; noise and ungrammatical spoken language Domain-dependent; further deep-level semantics (e.g. Person vs. Cast) Dialog; dialog history dependent and utt. by utt. Analysis Traditional approaches; natural language to SQL conversion Speech Text ASR SLU Semantic Frame SQL Generate SQL Response Database A typical ATIS system (from [Wang et al., 2005]) REPRESENTATION Semantic frame (slot/value structure) [Gildea and Jurafsky, 2002] An intermediate semantic representation to serve as the interface between user and dialog system Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting. “Show me flights from Seattle to Boston” ShowFlight <frame name=‘ShowFlight’ type=‘void’> <slot type=‘Subject’>FLIGHT</slot> <slot type=‘Flight’/> <slot type=‘DCity’>SEA</slot> <slot type=‘ACity’>BOS</slot> </slot> </frame> Subject FLIGHT Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005] Flight Departure_City Arrival_City SEA BOS SEMANTIC FRAME Meaning Representations for Spoken Dialog System Slot type 1: Intent, Subject Goal, Dialog Act (DA) The meaning (intention) of an utt. at the discourse level Slot type 2: Component Slot, Named Entity (NE) The identifier of entity such as person, location, organization, or time. In SLU, it represents domain-specific meaning of a word (or word group). Ex) Find Korean restaurants in Daeyidong, Pohang <frame domain=`RestaurantGuide'> <slot type=`DA' name=`SEARCH_RESTAURANT'/> <slot type=`NE' name=`CITY'>Pohang</slot> <slot type=`NE' name=`ADDRESS'>Daeyidong</slot> <slot type=`NE' name=`FOOD_TYPE'>Korean</slot> </frame> HOW TO SOLVE Two Classification Problems Input: Output: Input: Output: Find Korean restaurants in Daeyidong, Pohang Dialog Act Identification SEARCH_RESTAURANT Find Korean restaurants in Daeyidong, Pohang FOOD_TYPE ADDRESS CITY Named Entity Recognition PROBLEM FORMALIZATION Encoding: x Find Korean restaurants in Daeyidong , Pohang . y O FOOD_TYPE-B O O ADDRESS-B O CITY-B O z SEARCH_RESTAURANT x is an input (word), y is an output (NE), and z is another output (DA). Vector x = {x1, x2, x3, …, xT} Vector y = {y1, y2, y3, …, yT} Scalar z Goal: modeling the functions y=f(x) and z=g(x) CASCADE APPROACH I Named Entity Dialog Act Automatic Speech Recognition x Sequential Labeling x,y Classification (Named Entity / Frame Slot) (Dialog Act / Intent) Sequential Labeling Model Classification Model (e.g. HMM, CRFs) (e.g. MaxEnt, SVM) x,y,z Dialog Management CASCADE APPROACH II Dialog Act Named Entity Automatic Speech Recognition x Classification Sequential Labeling x,z (Dialog Act / Intent) Classification Model (Named Entity / Frame Slot) z (e.g. MaxEnt, SVM) Improve NE, but not DA. Multiple Sequential Models (e.g. intent-dependent) x,y,z Dialog Management JOINT APPROACH Named Entity ↔ Dialog Act Joint Inference Automatic Speech Recognition x Sequential Labeling (Named Entity / Frame Slot) Classification (Dialog Act / Intent) x,y,z Dialog Management Joint Model (e.g. TriCRFs) [Jeong and Lee, 2006] MACHINE LEARNING FOR SLU Relational Learning (RL) or Structured Prediction (SP) [Dietterich, 2002; Lafferty et al., 2004, Sutton and McCallum, 2006] Structured or relational patterns are important because they can be exploited to improve the prediction accuracy of our classier Argmax search (e.g. Sum-Max, Belief propagation, Viterbi etc) Basically, RL for language processing is to use a left-to-right structure (a.k.a linear-chain or sequence structure) Algorithms: CRFs, Max-Margin Markov Net (M3N), SVM for Independent and Structured Output (SVM-ISO), Structured Perceptron, etc. MACHINE LEARNING FOR SLU Background: Maximum Entropy (a.k.a logistic regression) Conditional and discriminative manner Unstructured! (no dependency in y) z Dialog act classification problem hk x Conditional Random Fields [Lafferty et al. 2001] Structured versions of MaxEnt (argmax search in inference) Undirected graphical models Popular in language and text processing Linear-chain structure for practical implementation fk Named entity recognition problem yt-1 yt yt+1 gk xt-1 xt xt+1 SUMMARY Solve by isolate (or independent) classifier such as Naïve Bayes, and MaxEnt Input: Output: Input: Output: Find Korean restaurants in Daeyidong, Pohang Dialog Act Identification SEARCH_RESTAURANT Find Korean restaurants in Daeyidong, Pohang FOOD_TYPE ADDRESS CITY Solve by structured (or relational) classifier such as HMM, and CRFs Named Entity Recognition Coffee Break DIALOG MANAGEMENT DIALOG MANAGEMENT x y Input Output Words or Intentions System Response Learning algorithm (x, y) Training examples A central component of a dialog system to produce system responses with external knowledge sources [McTear, 2004] DIALOG MANAGEMENT GOAL Answer your query (e.g., question and order) given the task domain It includes : Provide query results Ask further slot information Confirm user utterance Notify invalid query Suggest the alternative Related to dialog complexity and task complexity. In practice Find the best system action a given the dialog state s DESIGN ISSUES Task complexity How hard the task is? How much the system has domain knowledge? Call Routing Simple Weather Information Automatic Banking Conversational English Tutoring Collaborative Planning Complex DESIGN ISSUES Dialog complexity What dialog phenomena are allowed Initiative strategies Meta-dialogs; the dialog itself e.g., Could you hold on for a minute? Subdialogs; clarification/confirmation e.g., system-initiative vs. user-initiative vs. mixed-initiative e.g., You selected KE airlines, is it right? Multiple dialog threads e.g., domain switching DIALOG EXAMPLES Example 1 S: Where are you going? U: Gangnam, Seoul S: What kinds of food do you have in Gangnam, Seoul? U: Please African food S: Sorry, there are no African restaurants. Please select another food. System-initiative Implicit confirmation Merely informs user of failed query Rigid dialog flow DIALOG EXAMPLES Example 2 S: May I help you? U: I’d like to have African food in Gannam, Seoul S: Sorry, there are no African restaurants. S: There are Korean, Japanese, Indian, and American restaurants in Gangnam. U: Please Korean restaurants. S: Would you like to find Korean restaurants? U: Yes pleases Mixed-initiative Implicit/Explicit confirmation Suggests the alternative when query fails More natural dialog flow DIALOG EXAMPLES Example 3 U: I’d like to have African food in Gangnam, Seoul S: Sorry, there are no African restaurants. S: How about American restaurants in Gangnam, Seoul? U: No I don’t like it. S: What is your favorite food? U: I like grilled and seasoned beef S: So, how about Korean restaurants? U: Good. Mixed-initiative Implicit/Explicit confirmation Recommends the alternative when query fails Most natural dialog flow DIALOG CONTROL Finite-state based approach Frame-based approach Input : Single word or phrase State transition network (or graph) It can allow all legal dialog flow which is pre-defined in the state diagram. Input : Natural language with concept spotting Form-filling tasks to access information source But the questions do not have to be asked in a predetermined sequence Plan-based approach Input : Unrestricted natural language The modeling of dialog as collaboration between intelligent agents to solve some problems or task. For more complex task, such as negotiation and problem solving. KNOWLEDGE-BASED DM (KBDM) Rule-based approaches Early KBDMs were developed with handcrafted rules (e.g., information state update). Simple Example [Larsson and Traum, 2003] Agenda-based approaches Recent KBDMs were developed with domainspecific knowledge and domain-independent dialog engine. VoiceXML What is VoiceXML? The HTML(XML) of the voice web. The open standard markup language for voice application VoiceXML Resources : http://www.voicexml.org/ Can do Rapid implementation and management Integrated with World Wide Web Mixed-Initiative dialogue Simple Dialogue implementation solution VoiceXML EXAMPLE S: Say one of: Sports scores; Weather information; Log in. U: Sports scores <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <menu> <prompt>Say one of: <enumerate/></prompt> <choice next="http://www.example.com/sports.vxml"> Sports scores </choice> <choice next="http://www.example.com/weather.vxml"> Weather information </choice> <choice next="#login"> Log in </choice> </menu> </vxml> AGENDA-BASED DM RavenClaw DM (CMU) Using Hierarchical Task Decomposition A set of all possible dialogs in the domain Tree of dialog agents Each agent handles the corresponding part of the dialog task [Bohus and Rudnicky, 2003] EXAMPLE-BASED DM (EBDM) Example-based approaches Turn #1 (Domain=Building_Guidance) Dialog Corpus USER: 회의 실 이 어디 지 ? [Dialog Act = WH-QUESTION] [Main Goal = SEARCH-LOC] [ROOM-TYPE =회의실] SYSTEM: 3층에 교수회의실, 2층에 대회의실, 소회의실이 있습 니다. [System Action = inform(Floor)] Indexed by using semantic & discourse features Domain = Building_Guidance Dialog Example Dialog Act = WH-QUESTION Main Goal = SEARCH-LOC ROOM-TYPE=1 (filled), ROOM-NAME=0 (unfilled) LOC-FLOOR=0, PER-NAME=0, PER-TITLE=0 Previous Dialog Act = <s>, Previous Main Goal = <s> Discourse History Vector = [1,0,0,0,0] Lexico-semantic Pattern = ROOM_TYPE 이 어디 지 ? System Action = inform(Floor) e* argmax S (ei , h) ei E Having the similar state Dialog State Space [Lee et al., 2009] STOCHASTIC DM Supervised approaches Find the best system action to maximize the conditional probability P(a|s) given the dialog state Based on supervised learning algorithms MDP/POMDP-based approaches [Williams and Young, 2007] Find the optimal system action to maximize the reward function R(a|s) given the belief state [Griol et al., 2008] Based on reinforcement learning algorithms In general, a dialog state space is too large So, generalizing the current dialog state is important Dialog as a Markov Decision Process user dialog act user goal su au dialog history noisy estimate of user dialog act Speech Understanding ~ a u Reward State Estimator machine state Speech Generation am Dialog Policy R k rk r ( sm , am ) ~ sm User ~ a m sd k Reinforcement Learning Optimize MDP machine dialog act ~ ~ ,~ ~ sm a u su , sd [Williams and Young, 2007] SUMMARY External DB x Input Words or Intentions y Output Dialog Corpus Dialog Model Agenda-based approach Stochastic approach Example-based approach System Response Demo Building guidance dialog TV program guide dialog Multi-domain dialog with chatting CHALLENGES & ISSUES MULTI-MODAL DIALOG SYSTEM MULTI-MODAL DIALOG SYSTEM x y Input Input Input Speech Gesture face Output System Response Learning algorithm (x, y) Training examples MULTIMODAL DIALOG SYSTEM A system which supports human-computer interaction over multiple different input and/or output modes. Input: voice, pen, gesture, face expression, etc. Output: voice, graphical output, etc. Applications GPS Information guide system Smart home control Etc. 여기에서 여기로 가는 제일 빠른 길 좀 알려 줘. voice pen MOTIVATION Speech: the Ultimate Interface? (+) Interaction style: natural (use free speech) (+) Richer channel – speaker’s disposition and emotional state (if system’s knew how to deal with that..) (-) Input inconsistent (high error rates), hard to correct error Natural repair process for error recovery e.g., may get different result, each time we speak the same words. (-) Slow (sequential) output style: using TTS (text-to-speech) How to overcome these weak points? Multimodal interface!! ADVANTAGES Task performance and user preference Migration of Human-Computer Interaction away from the desktop Adaptation to the environment Error recovery and handling Special situations where mode choice helps TASK PERFORMANCE AND USER PREFERENCE Task performance and user preference for multimodal over speech only interfaces [Oviatt et al., 1997] 10% faster task completion, 23% fewer words, (Shorter and simpler linguistic constructions) 36% fewer task errors, 35% fewer spoken disfluencies, 90-100% user preference to interact this way. • Speech-only dialog system Speech: Bring the drink on the table to the side of bed • Multimodal dialog System Speech: Bring this to here Pen gesture: Easy, Simplified user utterance ! MIGRAION OF HCI AWAY FROM THE DESKTOP Small portable computing devices Such as PDAs, organizers, and smart-phones Limited screen real estate for graphical output Limited input no keyboard/mouse (arrow keys, thumbwheel) Complex GUIs not feasible Augment limited GUI with natural modalities such as speech and pen Other devices Kiosks, car navigation system… Use less space Rapid navigation over menu hierarchy No mouse or keyboard Speech + pen gesture APPLICATION TO THE ENVIRONMENT Multimodal interfaces enable rapid adaptation to changes in the environment Allow user to switch modes Mobile devices that are used in multiple environments Environmental conditions can be either physical or social Physical Noise: Increases in ambient noise can degrade speech performance switch to GUI, stylus pen input Brightness: Bright light in outdoor environment can limit usefulness of graphical display Social Speech many be easiest for password, account number etc, but in public places users may be uncomfortable being overheard Switch to GUI or keypad input ERROR RECOVERY AND HANDLING Advantages for recovery and reduction of error: Users intuitively pick the mode that is less error-prone. Language is often simplified. Users intuitively switch modes after an error The same problem is not repeated. Multimodal error correction Cross-mode compensation - complementarity Combining inputs from multiple modalities can reduce the overall error rate Multimodal interface has potentially SPECIAL SITUATIONS WHERE MODE CHOICE HELPS Users with disability People with a strong accent or a cold People with RSI Young children or non-literate users Other users who have problems when handle the standard devices: mouse and keyboard Multimodal interfaces let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities. Demo Multimodal dialog in smart home domain English teaching dialog CHALLENGES & ISSUES DIALOG SIMULATOR SYSTEM EVALUATION Real User Evaluation 1. High Cost (-) Real Interaction 1. Reflecting Real World (+) Spoken Dialog System 2. Human Factor - It looses objectivity (-) SYSTEM EVALUATION Simulated User Evaluation 1. Low Cost (+) Simulated Interaction Virtual Environment 1. Not Real World (-) Spoken Dialog System Simulated User 2. Consistent Evaluation - It guarantees objectivity (+) SYSTEM DEVELOPMENT Exposing System to Diverse Environment Spoken Dialog System •Different users •Noises •Unexpected focus shift USER SIMULATION System Output Simulated User Input Spoken Dialog System Simulated Users PROBLEMS User Simulation for spoken dialog systems involves four essential problems [Jung et al., 2009] User Intention Simulation User Utterance Simulation Spoken Dialog System Simulated Users ASR Channel Simulation USER INTENTION SIMULATION Goal Generating appropriate user intentions given the current dialog state P( user_intention | dialog_state) Example U1 : 근처에 중국집 가자 S1 : 행당동에 북경, 아서원, 시온반점이 있고 홍익동에 정궁중화요 리, 도선동에 양자강이 있습니다. U2 : 삼성동에는 뭐가 있지? Semantic Frame (Intention) Dialog act WH-QUESTION Main Goal SEARCH-LOC Named Entity LOC_ADDRESS USER UTTERANCE SIMULATION Goal Generating natural languages given the user intentions P( user_utterance | user_intention ) Semantic Frame (Intention) Dialog act WH-QUESTION Main Goal SEARCH-LOC Named Entity LOC_ADDRESS • 삼성동에는 뭐가 있지? • 삼성동 쪽에 뭐가 있지? • 삼성동에 있는 것은 뭐니? •… ASR CHANNEL SIMULATION Goal Generating noisy utterances from a clean utterance at certain error rates P( utternoisy | utterclean , error_rate) • 삼성동에는 뭐가 있지? • 삼성동에 뭐 있니? • 삼정동에는 뭐가 있지? • 상성동 뭐 가니? •삼성동에는 무엇이 있니? •… Clean utterance Noisy utterance AUTOMATED DIALOG SYSTEM EVALUATION [Jung et al., 2009] Demo Self-learned dialog system Translating dialog system REFERENCES REFERENCES ASR (1/2) L.R. Rabiner and B.H. Juang, 1993. Fundamentals of Speech Recognition, Prentice-Hall. L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proceedings of 1986 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.49–52. K.K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173. B.H. Juang, S.E. Levinson, and M.M. Sondhi, 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309. T.J. Hazen, I.L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104. REFERENCES ASR (2/2) K. Demuynck, J. Duchateau, and D.V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146. S.J. Young, N.H. Russell, and J.H.S Thornton, 1989. Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department. S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book. Entropics Cambridge Research Lab., Cambridge, UK. HTK website: http://htk.eng.cam.ac.uk/ REFERENCES SLU R. De Mori et al. Spoken Language Understanding for Conversational Systems. Signal Processing Magazine. 25(3):50-58. 2008. Y. Wang, L. Deng, and A. Acero. September 2005, Spoken Language Understanding: An introduction to the statistical framework. IEEE Signal Processing Magazine, 27(5):16-31. D. Gildea, and D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288. M. Jeong and G.G. Lee, 2006. Jointly predicting dialog act and named entity for spoken language understanding, IEEE/ACL workshop on SLT. T. G. Dietterich, 2002. Machine learning for sequential data: A review. Caelli(Ed.) Structural, Syntactic, and Statistical Pattern Recognition. J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. ICML. C. Sutton and A. McCallum, 2006. An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning. L. Getoor and B. Taskar, Eds. MIT Press. REFERENCES DM M. F. McTear, Spoken Dialogue Technology - Toward the Conversational User Interface: Springer Verlag London, 2004. S. Larsson, and D. R. Traum, “Information state and dialogue management in the TRINDI dialogue move engine toolkit,” Natural Language Engineering, vol. 6, pp. 323-340, 2006. B. Bohus, and A. Rudnicky, “RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda,” in Proc. of the European Conference on Speech, Communication and Technology, 2003, pp. 597-600. D. Griol, L. F. Hurtado, E. Segarra et al., “A statistical approach to spoken dialog systems design and evaluation,” Speech Communication, vol. 50, no. 8-9, pp. 666-682, 2008. J. D. Williams, and S. Young, “Partially observable Markov decision processes for spoken dialog systems,” Computer Speech and Language, vol. 21, pp. 393-422, 2007. C. Lee, S. Jung, S. Kim et al., “Example-based Dialog Modeling for Practical Multi-domain Dialog System,” Speech Communication, vol. 51, no. 5, pp. 466-484, 2009. REFERENCES MULTI-MODAL DIALOG SYSTEM & DIALOG SIMULATOR S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97. R. Lopez-Cozar, A. D. la Torre, J. C. Segura et al., “Assessment of dialogue systems by means of a new simulation technique,” Speech Communication, vol. 40, no. 3, pp. 387-407, 2003. J. Schatzmann, B. Thomson, K. Weilhammer et al., “Agenda-based User Simulation for Bootstrapping a POMDP Dialogue System,” in Proc. of the Human Language Technology/North American Chapter of the Association for Computational Linguistics, 2007, pp. 149-152. S. Jung, C. Lee, K. Kim et al., “Data-driven user simulation for automated evaluation of spoken dialog systems,” Computer Speech and Language, 2009. Thank You & QA