Speech and Language Modeling Shaz Husain Albert Kalim Kevin Leung Nathan Liang Voice Recognition The field of Computer Science that deals with designing computer systems that can recognize spoken words. Voice Recognition implies only that the computer can take dictation, not that it understands what is being said. Voice Recognition (continued) A number of voice recognition systems are available on the market. The most powerful can recognize thousands of words. However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent. Voice Recognition (continued) Many systems also require that the speaker speak slowly and distinctly and separate each word with a short pause. These systems are called discrete speech systems. Recently, great strides have been made in continuous speech systems -- voice recognition systems that allow you to speak naturally. There are now several continuous-speech systems available for personal computers. Voice Recognition (continued) Because of their limitations and high cost, voice recognition systems have traditionally been used only in a few specialized situations. For example, such systems are useful in instances when the user is unable to use a keyboard to enter data because his or her hands are occupied or disabled. Instead of typing commands, the user can simply speak into a headset. Increasingly, however, as the cost decreases and performance improves, speech recognition systems are entering the mainstream and are being used as an alternative to keyboards Natural Language Processing Comprehending human languages falls under a different field of computer science called natural language processing. Natural Language: human language. English, French, and Mandarin are natural languages. Computer languages, such as FORTRAN and C, are not. Probably the single most challenging problem in Computer Science is to develop computers that can understand natural languages. So far, the complete solution to this problem has proved elusive, although a great deal of progress has been made. Proteus Project At New York University, members of the Proteus Project have been doing Natural Language Processing (NLP) research since the 1960's. Basic Research: Grammars and Parsers, Translation Models, Domain-Specific Language, Bitext Maps and Alignment, Evaluation Methodologies, Paraphrasing, and Predicate-Argument Structure. Proteus Project: Grammars and Parsers Grammars are models of linguistic structure. Parsers are algorithms that infer linguistic structure, given a grammar and a linguistic expression. Given a grammar, we can design a parser to infer structure from linguistic data. Also, given some parsed data, we can learn a grammar. Example of Research Applications: Apple Pie Parser for English. For example, I love an apple pie will be parsed as (S (NP (PRP I)) (VP (VBP love) (NP (DT an) (NN apple) (NN pie))) (. -PERIOD-)) Web-based application: http://complingone.georgetown.edu/~linguist/applepie.html Proteus Project: Translation Models Translation models describe the abstract/mathematical relationship between two or more languages. Also called models of translational equivalence because the main thing that they aim to predict is whether expressions in different languages have equivalent meanings. A good translation model is the key to many trans-lingual applications, the most famous of which is machine translation. Proteus Project: Domain-specific Language Sentences in different domains of discourse are structurally different. For example, imperative sentences are common in computer manuals, but not in annual company reports. It would be useful to characterize these differences in a systematic way. Proteus Project: Bitext Maps and Alignment A "bitext" consists of two texts that are mutual translations. A bitext map is a description of the correspondence relation between elements of the two halves of a bitext. Finding such a map is the first step to building translation models. It is also the first step in applications like automatic detection of omissions in translations. Proteus Project: Evaluation Methodologies There are many correct ways to say almost anything, and many shades of meaning. This "ambiguity" of natural languages makes the evaluation of NLP systems difficult enough to be a research topic in itself. Proteus Project has invented new evaluation methods in two areas of NLP where evaluation is notoriously difficult: translation modeling and word sense disambiguation. An example of research applications: General Text Matcher (GTM). GTM measures the similarity between texts. Simple Applet for GTM: http://nlp.cs.nyu.edu/call_gtm.html Proteus Project: Paraphrasing A paraphrase relation exists between two phrases which convey the same information. The recognition of paraphrases is an essential part of many natural language applications: if we want to process text reporting fact "X", we need to know all the alternative ways in which "X" can be expressed. Capturing paraphrases by hand is an almost overwhelming task because they are so common and many are domain specific. Therefore, Project Proteus begun to develop procedures which learn paraphrase from text. The basic idea is that they look for news stories from the same day which report the same event, and then examine the different ways in which the same fact gets reported Proteus Project: Predicate-Argument Structure An analysis of sentences in terms of predicates and arguments. It is a "deeper" level of linguistic analysis than constituent structure or simple dependency structure, in particular one that regularizes over nearly equivalent surface strings. Language Modeling A bad language model Language Modeling (continued) Language Modeling (continued) Language Modeling: Introduction Language modeling – One of the basic tasks to build a speech recognition system – help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. – lets the recognizer make the right guess when two different sentences sound the same. Basics of Language Modeling Language modeling has been studied under two different points of view. – First, as a problem of grammar inference: • the model has to discriminate the sentences which belong to the language from those which do not belong. – Second, as a problem of probability estimation. • If the model is used to recognize the decision is usually based on the maximum a posteriori rule. The best sentence L is chosen so that the probability of the sentence, knowing the observations O, is maximized. What is a Language Model A Language model is a probability distribution over word sequences – P(“And nothing but the truth”) 0.001 – P(“And nuts sing on the roof”) 0 How Language Models work Hard to compute – P(“And nothing but the truth”) Decompose probability – P(“And nothing but the truth) = P(“And”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”) Types of Language Modeling Statistical Language Modeling N-grams/ Trigrams Language Modeling Structured Language Modeling Statistical Language Model A statistical language model (SLM) is a probability distribution P(s) over strings S that attempts to reflect how frequently a string S occurs as a sentence. The Trigram / N-grams LM Assume each word depends only on the previous two/n-1 words (three words total – tri means three, gram means writing) – P(“the|… whole truth and nothing but”) P(“the|nothing but”) – P(“truth|… whole truth and nothing but the”) P(“truth|but the”) Structured Language Models Language has structure – noun phrases, verb phrases, etc. Use structure of language to detect long distance information Promising results But: time consuming; language is right branching Evaluation Perplexity - is geometric average inverse probability – measures language model difficulty, not acoustic difficulty. – Lower the perplexity, the closer we are to true model. Language Modeling Techniques Smoothing – addresses the problem of data sparsity: there is rarely enough data to accurately estimate the parameters of a language model. – gives a way to combine less specific, more accurate information with more specific, but noisier data – Eg. deleted interpolation and Katz (or Good-Turing) smoothing, Modified Kneser-Ney smoothing Caching – is a widely used technique that uses the observation that recently observed words are likely to occur again. Models from recently observed data can be combined with more general models to improve performance. LM Techniques (continued) Skipping models – use the observation that even words that are not directly adjacent to the target word contain useful information. Sentence-mixture models – use the observation that there are many different kinds of sentences. By modeling each sentence type separately, performance is improved. Clustering – Words can be grouped together into clusters through various automatic techniques; then the probability of a cluster can be predicted instead of the probability of the word. – can be used to make smaller models or better performing ones. Smoothing: Finding Parameter Values Split data into training, “heldout”, test Try lots of different values for on heldout data, pick best Test on test data Sometimes, can use tricks like “EM” (estimation maximization) to find values Heldout should have (at least) 100-1000 words per parameter. enough test data to be statistically significant. (1000s of words perhaps) Caching: Real Life Someone says “I swear to tell the truth” System hears “I swerve to smell the soup” Cache remembers! Person says “The whole truth”, and, with cache, system hears “The whole soup.” – errors are locked in. Caching works well when users corrects as they go, poorly or even hurts without correction. Caching If you say something, you are likely to say it again later. Interpolate trigram with cache Skipping P(z|…rstuvwxy) P(z|vwxy) Why not P(z|v_xy) – “skipping” ngram – skips value of 3-back word. Example: “P(time|show John a good)” -> P(time | show ____ a good) P(…rstuvwxy) P(z|vwxy) + P(z|vw_y) + (1--)P(z|v_xy) Clustering CLUSTERING = CLASSES (same thing) What is P(“Tuesday | party on”) Similar to P(“Monday | party on”) Similar to P(“Tuesday | celebration on”) Put words in clusters: – WEEKDAY = Sunday, Monday, Tuesday, … – EVENT=party, celebration, birthday, … Predictive Clustering Example Find P(Tuesday | party on) – Psmooth (WEEKDAY | party on) Psmooth (Tuesday | party on WEEKDAY) – C( party on Tuesday) = 0 – C(party on Wednesday) = 10 – C(arriving on Tuesday) = 10 – C(on Tuesday) = 100 Psmooth (WEEKDAY | party on) is high Psmooth (Tuesday | party on WEEKDAY) backs off to Psmooth (Tuesday | on WEEKDAY) Microsoft Language Modeling Research Microsoft language modeling research falls into several categories: Language Model Adaptation. Natural language technology in general and language models in particular are very brittle when moving from one domain to another. Current statistical language models are built from text specific to newspapers and TV/radio broadcasts which has little to do with the everyday use of language by a particular individual. We are investigating means of adapting a general-domain statistical language model to a new domain/user when we have access to limited amounts of sample data from the new domain/user. Microsoft Language Modeling Research Can Syntactic Structure Help? Current language models make no use of the syntactic properties of natural language but rather use very simple statistics such as word co-occurences. Recent results show that incorporating syntactic constraints in a statistical language model reduces the word erroror rate on a conventional dictation task by 10% . We are working on finding the best way of "putting language into language models" as well as exploring the new possibilities opened by such structured language models for other tasks such as speech and language understanding. Microsoft Language Modeling Research Speech Utterance Classification A simple first step to more natural user interfaces in interactive voice response systems is automated call routing. Instead of listening to prompts like "If you are trying to reach department X say Yes, otherwise say No" or punching keys on your telephone keypad, one could simply state in a sentence what the problem is, for example "There is a fraudulous transaction on my last statement" and get connected to the right customer service representative. We are developing technology that aims at classifying speech utterances in a limited set of classes, enhancing the role of the traditional language model such that it also assigns a category to a given utterance Microsoft Language Modeling Research Building the best language models we can. In general, the better the language model, the lower the error rate of the speech recognizer. By putting together the best results available on language modeling, we have created a language model that outperforms a standard baseline by 45%, leading to a 10% reduction in error rate for our speech recognizer. The system has the best reported results of any language model. Microsoft Language Modeling Research Language modeling for other applications. Speech recognition is not the only use for language models. They are also useful in fields like handwriting recognition, spelling correction, even typing Chinese! Like speech recognition, all of these are areas where the input is ambiguous in some way, and a language model can help us guess the most likely input. We're also working on finding new uses for language models, in other areas. Microsoft Speech Software Development Kit enables developers to create, debug and deploy speech-enabled ASP.NET Web applications intended for deployment to a Microsoft Speech Server. applications are designed for devices ranging from telephones to Windows Mobile™-based devices and desktop PCs. Speech Application Language Tags (SALT) SALT is an XML based API that brings speech interactions to the Web. SALT is an extension of HTML and other markup languages (cHTML, XHTML, WML) that adds a powerful speech interface to Web pages, while maintaining and leveraging all the advantages of the Web application model. These tags are designed to be used for both voice-only browsers (for example, a browser accessed over the telephone) and multimodal browsers. SALT is a small set of XML elements, with associated attributes and DOM object properties, events, and methods, which may be used in conjunction with a source markup document to apply a speech interface to the source page. The SALT formalism and semantics are independent of the nature of the source document, so SALT can be used equally effectively within HTML and all its flavors, or with WML, or with any other SGML-derived markup. What kind of applications can we build with SALT? SALT can be used to add speech recognition and synthesis and telephony capabilities to HTML or XHTML based applications, making them accessible from telephones or other GUI–based devices such as PCs, telephones, tablet PCs and wireless personal digital assistants (PDAs). XML (Extensible Markup Language) XML is a collection of protocols for representing structured data in a text format that makes it straightforward to interchange XML documents on different computer systems. XML allows new markups. XML contains sets of data structures. They can be transformed into appropriate formats like XSL or XSLT. The main top-level elements <prompt …> – For speech synthesis configuration and prompt playing <listen …> – For speech recognizer configuration, recognition execution and post-processing, and recording <dtmf …> – For configuration and control of DTMF collection <smex …> – for general-purpose communnication with platform components The input elements <listen> and <dtmf> also contain grammars and binding controls <grammar …> – For specifying input grammar resources <bind …> – For processing of recognition results <record …> – For recording audio input Speech Library Example Speech Library Example Example <input name=”Date” type=”Dates” /> <input name=”PersonToMeet” type=”text” /> <input name=”Duration” type=”time” /> … <prompt …> Schedule a meeting <value targetElement=”Date”/> Date <value targetElement=”Duration”/> Duration <value targetElement=”PersonToMeet”/> Person </prompt> <listen …> <grammar …/> <bind test=”/@confidence $lt$ 50” targetElement=”prompt_confirm” targetMethod=”start” targetElement=”listen_confirm” targetMethod=”start” /> <bind test=”/@confidence $ge$ 50” targetElement=”Date” value=”//Meeting/Date”/> targetElement=”Duration” value=”//Meeting/Duration”/> targetElement=”PersonToMeet” value=”//Meeting/Person” /> … </listen> Example (continued) <rule name=”MeetingProperties”/> <l> <ruleref name=”Date”/> <ruleref name=”Duration”/> <ruleref name=”Time”/> <ruleref name=”Person”/> <ruleref name=”Subject”/> .. .. </l> <o> <ruleref name=”Meeting”/> </o> <output> <Calendar:meeting> <DateTIme> <xsl:apply-templates name=“DayOfWeek”/> <xsl:apply-templates name=“Time”/> <xsl:apply-templates name=“Duration”/> </DateTIme> <PersonToMeet> <xsl:apply-templates name=“Person”/> </PersonToMeet> </Calendar:meeting> </output> </rule> <l propname=”DayOfWeek”> <p valstr=”Sun”> Sunday </p> <p valstr=”Mon”> Monday </p> <p valstr=”Mon”> first day </p> .. .. .. <p valstr=”Sat”> Saturday </p> </l> Voice: monday Generates an XML element: <DayOfWeek text=”first day”>Mon</DayOfWeek> <I propname=“Person”> <p valstr=“Nathan”>CEO</p> <p valstr=“Nathan”>Nathan</p> <p valstr=“Nathan”>boss</p> <p valstr=“Albert”>programmer</p> …… </I> Voice: CEO, Generates: <Person text=“CEO”>Nathan</Person> XML Result <calendar:meeting text=”…”> <DateTime text=”…”> <DateOfWeek text=”…”>Monday</DateOfWeek> <Time text=”…”>2:00</Time> <Duration text=“…”>3600</Duration> </DateTime> <Person>Nathan</Person> </calendar:meeting> How SALT Works Multimodal – For multimodal applications, SALT can be added to a visual page to support speech input and/or output. This is a way to speech-enable individual controls for 'push-to-talk' form-filling scenarios, or to add more complex mixed initiative capabilities if necessary. – A SALT recognition may be started by a browser event such as pendown on a textbox, for example, which activates a grammar relevant to the textbox, and binds the recognition result in the textbox. Telephony – For applications without a visual display, SALT manages the interactional flow of the dialog and the extent of user initiative by using the HTML eventing and scripting model. – In this way, the full programmatic control of client-side (or server-side) code is available to application authors for the management of prompt playing and grammar activation. Sample Implementation Architecture A Web server. This Web server generates Web pages containing HTML, SALT, and embedded script. The script controls the dialog flow for voice-only interactions. For example, the script defines the order for playing the audio prompts to the caller assuming there are several prompts on a page. A telephony server. This telephony server connects to the telephone network. The server incorporates a voice browser interpreting the HTML, SALT, and script. The browser can run in a separate process or thread for each caller. Of course, the voice browser interprets only a subset of HTML since much of HTML refers to GUI and is not relevant to a voice browser. A speech server. This speech server recognizes speech, plays audio prompts, and responses back to the user. The client device. Clients include, for example, a Pocket PC or desktop PC running a version of Internet Explorer capable of interpreting HTML and SALT. SALT Architecture Multimodal Interactive Notepad (MiPad) Mipad's speech input addresses the defects of the handheld, such as the struggle to wrap your hands around a small pen and hit the tiny target known as an on-screen keyboard. Some of the current limitations of speech recognition: background noise, multiple users, accents, and idioms, can be helped with pen input. MiPad What does it do? MiPad cleverly sidesteps some of the problems of speech technology by letting the user touch the pen to a field on the screen, directing the speech recognition engine to expect certain types of input. The Speech group calls this technology "Tap and Talk." If you're sending an e-mail, and you tap the "To" field with the pen before you speak, the system knows to expect a name. It won't try to translate "Helena Bayer" into "Hello there." The semantic information related to this field is limited, leading to a reduced error rate. On the other hand, if you're filling in the subject field and using free-text dictation, the engine behind MiPad knows to expect anything. This is where the "Tap and Talk" technology comes in handy again. If the speech recognition engine has translated your spoken "I saw a bear," into the text "I saw a hair," you can use the stylus to tap on the word "hair" and repeat "bear," to correct the input. This focused correction, an evolution of the mouse pointer, is easy and painless compared to having to re-type or repeat the complete sentence. The "Tap and Talk" interface is always available on your MiPad device. The user can give spontaneous commands by tapping the Command button and talking to the handheld. You might tell your MiPad device, "I want to make an appointment," and the MiPad will obediently bring up an appointment form for you to fill in with speech, pen, or both. Some Projects on Their Way Projects for Speech Recognition Robust techniques for speech recognition in noisy environments (Funded by EPSRC and Bluechip Technologies Ltd, Belfast) Improved large-vocabulary speech recognition using syllables Multi-modal techniques for improved speech recognition (e.g., combining audio and visual information) Projects for Speech Recognition Decision-tree unified multi-resolution models for speech communication on mobile devices in noisy environments (Funded by EPSRC in collaboration.) Modeling Voice, Accent and Emotion in Text to Speech Synthesis (Funded by EPSRC, in collaboration) TCS Programme No 3191 (In collaboration with Bluechip Technologies Ltd, Belfast) Projects for Language Modeling Development and Integration of Statistical Speech and Language Models (Funded by EPSRC) Comparison of Human and Statistical Language Model Performance (Funded by EPSRC) Improved statistical language modeling through the use of domains Modeling individual words as a means of increasing the predictive power of a language model Robust techniques for speech recognition in noisy environments (Funded by EPSRC) Speech recognition degrades dramatically when a mismatch occurs between training and operating conditions. Mismatch due to ambient or communications-channel noise. Focus on robust signal pre-processing. Assume knowledge about the noise or the environment. Robust techniques for speech recognition in noisy environments (Funded by EPSRC) Frequency-band corruption Partial-time duration corruption Partial feature stream corruption(some components are more sensitive than others) Inaccurate noise-reduction processing. Combinations. Improved large-vocabulary speech recognition using syllables Fast bootstrapping of initial phone models of a new language. – Requires less training data Generating baseforms (phonetic spellings) for phonetic languages. – Requires deep linguistic knowledge Improved large-vocabulary speech recognition using syllables Bootstrapping – Existing acoustic model is used to obtain initial phone models. • Bootstrapping through alignment of target language speech • Bootstrapping through alignment of base language speech data. Statistical baseform generation – Based on context-dependent decision trees • Tree is built for each letter. Multi-modal techniques for improved speech recognition (e.g., combining audio and visual information) Focus on the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition. LVCSR – Large vocabulary continuous speech, significant progress, yet under controlled conditions. Recognition of speech utterances with visual clues is limited to small vocabulary, speaker dependent training and isolated word speech. Decision-tree unified multi-resolution models for speech communication on mobile devices in noisy environments Re-configurable multi-resolution decisiontree modeling. Prediction of time varying spectrum of non-stationary noise sources. Developing a unified model for speech integrating features for recognition and synthesis including speaker adaptation. Dynamic multi-resolution models to mitigate the impact of distortion of lowamplitude short-duration speech. Modeling Voice, Accent and Emotion in Text to Speech Synthesis Neutral Emotion Modeling Voice, Accent and Emotion in Text to Speech Synthesis Bored emotion Modeling Voice, Accent and Emotion in Text to Speech Synthesis Angry emotion Modeling Voice, Accent and Emotion in Text to Speech Synthesis Happy emotion Modeling Voice, Accent and Emotion in Text to Speech Synthesis Sad emotion Modeling Voice, Accent and Emotion in Text to Speech Synthesis Frightened emotion Basic Principles of ASR All ASRs work in two phases Training phase – System learns reference patters Recognizing phase – Unknown input pattern is identified by considering the set of references Three major modules Signal processing front-end – Transforms speech signals into sequence of feature vectors Acoustic modeling – Recognizer matches the sequence of observations with subword models Language modeling – Recognized word is used to construct a sentence. Overview Given the identities of all previous words, a language model is a conditional distribution on the identity of the I’th word in a sequence. A trigram model, models language in a second-order Markov process. It is clearly false because it makes the computationally convenient approximation that a word depends on only the two previous words. Speech Recognition Speech recognition is all about understanding the human speech. The ability to convert speech into a sequence of words or meaning and then into action. The challenge is how to achieve this in the real world where unknown time varying noise is a factor. Language Modeling To be able to provide the probabilities of phrases occurring within a given context. Improve the performance of speech recognition systems and internet search engines. References http://www.research.microsoft.com/~joshu ago http://homepages.inf.ed.ac.uk/s04507 36/slm.html http://www.speech.sri.com/people/st olcke/papers/icassp96/paper.html http://www.asel.udel.edu/icslp/cdrom/ vol1/812/a812.pdf References http://www2.cs.cmu.edu/afs/cs.cmu.edu/user/aberger/www/l m.html http://www.cs.qub.ac.uk/~J.Ming/Html/Robust.htm http://www.cs.qub.ac.uk/Research/NLSPOverview .html http://www.research.ibm.com/people/l/lvsubram/p ublications/conferences/mmsp99.html http://dea.brunel.ac.uk/cmsp/Proj_noise2003/obj. htm References http://database.syntheticspeech.de/in dex.html#samples http://www.research.ibm.com/journal/ rd/485/kumar.html http://murray.newcastle.edu.au/users /staff/speech/home_pages/tutorial_sr .html