Overview of Speech and Grammar Kevin Lin Richard Fateman University of Calif. Berkeley Background There has been much research into speech and handwriting recognition. However, there are limitations to recognition algorithms and the computer will inherently have trouble with similar letters or words with similar sounds. For example, it is extremely difficult to differentiate the spoken characters ‘b’, ‘d’, ‘e’, ‘c’, ‘p’, ‘v’it is extremely difficult to differentiate the written characters ‘s’, and ‘5’ (especially when the 5 is written with 1 stroke). Handwriting recognition and voice recognition can be successfully combined in a way to dramatically increase accuracy. Speech Recognition What is needed for speech recognition of mathematics to be implemented using Microsoft Speech SDK 5.1? To assist in accuracy, the speech can be limited to a subset of the English language through the definition of an XML grammar. The grammar should be restrictive so that non-valid words and phrases can be ruled out, but should be versatile enough to incorporate a variety of mathematical symbols. Also, the grammar should not impede the user’s natural dictation. For example, a user can read “(x + y)/(x – y)” as “quantity x plus y over quantity x minus y”. The software should be able to determine the proper placement of the parentheses based on the user’s dictation of “quantity,” and other context or mathematical convention. As an example, if the user speaks one over two pi, or “1/2” it is unreasonable to interpret that as ½*, as would be conventional in programming languages. If the user wanted that he would have said /2. He must have meant 1/(2). The current version of grammar, math.xml allows numbers from 0 to 99, capital and lower case English characters, capital and lower case Greek characters, common symbols such as exclamation mark, and several mathematical functions. The user also has a choice of dictating bold, italics, or underline, or any combination of the above as a prefix to any phrase. The modifiers “Upper”, “Uppercase”, “Capital”, and “Big” are used to delineate capital Greek characters and capital English characters, and the modifiers Lower, Lower Case, and Small are used to delineate lower case Greek and English characters. It is optional to use lower-case modifiers when dictating. (i.e. saying “aye” will output “a”, saying “small a” will output “a”, and saying “capital a” will output “A”). Numbers is implemented using three lists: digits, teens, and decades. Digits and Teens are used individually to implement the numbers 0 to 19, and Decades followed by digits is used to implement digits from 21 to 99. The diagram below outlines the details for the speech grammar developed to facilitate speech of mathematical symbols. A javascript file, speech.js has been developed to use math.xml. This javascript also writes to a output file: C:\temp\testfile.txt, which can be used to interact with SKEME through file input and output. The user can quit the program by speaking “quit”. Currently, Microsoft Speech SDK 5.1 does not support alternates with custom grammar. Our hope is that alternates will become available with future releases. Alternates are useful because they allow the user to choose a different word if the recognition program does not recognize the speech correctly. Also, alternates help with the multimodal input of speech and handwriting by providing additional possibilities to match handwriting and speech. There are several ways to sidestep the problem with the alternatives. Empirical data on common mistakes and sounds that are similar can be compiled and a lisp function can manually assign alternates to the output of speech. This would allow the functionality of alternatives as stated above; however, the results would not be as accurate and therefore the recognition accuracy would not improve as much as if the alternates were passed directly from the speech recognizer. Secondly, custom grammar may not be necessary. An alternative approach would be to use the built-in grammar and to dynamically alter or limit the rules. However, the accuracy may be lower with this method because it does not take advantage of a limited vocabulary. (don’t use this) Handwriting Recognition Handwriting recognition can be implemented by Microsoft Tablet PC Platform SDK 1.0. However, there may be problems with the lack of software for non-tablet-edition PCs. Problems in Implementing Hand and Speech Recognition Here are examples of typical mathematical equations that users may encounter during everyday use (see attached file). From these examples, we can see that there is a timing challenge; that is, the speech and handwriting inputs may not occur at exactly the same time. Also, each speech input may correspond to more than one handwriting input and vice versa. For example, the user may say “x” after the writing the two strokes, or the user may say “sin of theta” while writing “sin ”. One solution maybe be to wait until the user is finished speaking by adding a timeout that detects both the handwriting and speech modules being idle for a set amount of time (say 1 second). It would then process the data afterwards, attempting to match phrases loosely based on the time the input occurred. However, this poses a problem because the user cannot see the result of his input for several seconds unless if he or she speaks really slowly. There may also be occasions where the redundancy in input fails, and the voice recognition does not agree with the handwriting recognition. In this case, the program must determine whether to use the speech input, to use the handwriting input, or to ignore the input altogether. This may be based on logic statements which determine which input would most likely make sense (for example, if the voice recognition returns “integral t squared delta” and the handwriting returns “integral t dt”, the output of the handwriting recognition should overwrite that of the voice recognition). References Microsoft Speech SDK 5.1 Microsoft Tablet PC Platform SDK …(I did not complete the sources yet)…