SPEECH-ENABLED WEBSITES Project Work, IP Lahti, March 2010 0. Abstract This project work aims to develop a series of speech-enabled websites for different purposes. The requirements for the project are the UZ distributed recognition and synthesis applet (provided by the teachers) and a certain knowledge in HTML, Javascript and in creating JSGF grammar files. Students can apply any other knowledge in web programming they have (PHP,…). 1. Task Description Three tasks are proposed to the students. They are expected to do as much as possible of them, according to the difficulties they may find during the process. 1. Development of a web-based language learning activity for the training of pronunciation capabilities. 2. Development of a web-based oral dialog game. 3. Development of a speech enabled web site with accessibility for visually impaired individuals. Currently, the UZ Java Applet only works in Spanish, so the Spanish speaking students will have to translate all the group ideas to Spanish synthesis and grammars. 2. Common Elements The common elements that the students will have to face in the three proposed tasks will be: Inserting the applet, speech synthesis functions, speech recognition functions and creation of JSGF grammars. 2.1. Inserting the applet The code for inserting the applet is as follows, the jar file “sRecoFrontEnd_simple.jar” will be provided to the students. <applet name="vivoreco" id="vivoreco" code="RecoFrontEndsimpleJApplet.class" width="90" height="40" archive=" sRecoFrontEnd_simple.jar"> <param name="host" value="gtc3pc23.cps.unizar.es"> <param name="port" value="22229"> <param name="sinte_speaker" value="Jorge" > <param name="sinte_service" value="http://gtc3pc23.cps.unizar.es:8080/tts_servlet_unizar_cache_codec/sinte"> <param name="sinte_codec" value="3"> <param name="sinte_INI" value="on"> </applet> The Javascript functions which control the applet allowing for the speech recognition and synthesis are described following: void document.vivoreco.UZStopReco(); //Deactivates the speech recognition void document.vivoreco.UZStartReco(); //Activates ASR and launches the first recognition. Internally it calls to recopushini(). This function has to be called only once. void document.vivoreco.UZSinte(String sentence, String speaker); //Synthesizes the “sentence” by the “speaker”. Available speakers are “Jorge” and “Carmen” void document.vivoreco.UZSinteStop(); //Stop the current synthesis void document.vivoreco.UZSetBeep(int value); //If (value==1) activates beeps for beggining and end of ASR . void document.vivoreco.UZSetIniColor(int value); //Sets color of the ASR graph in the appler (2:green, 1: yellow, 0: red) void document.vivoreco.UZStartRecoGrammar(String url_grammar); //Launches a ASR interaction witht gramar provided void recopushini(); //Internally invoked by UZStartReco() void recoend(); //Internally invoked when the ASR finishes void recoerror(); //Internally invoked when the ASR crashes The public variables of the applet control are as follows public int document.vivoreco.UZerrorcode; //Error code in case of crash. public String document.vivoreco.UZerrorstr; //Error string in case of crash. public String document.vivoreco.UZresults; //Ortographic output of the ASR. public String document.vivoreco.UZtags; //Tags output of the ASR. 2.2. Speech synthesis The TTS works just invoking the UZSinte(sentence,speaker) function. In any case, it is strongly recommended to wrap in a higher level function which removes problematic characters. function synthesize(sentence,speaker) { sentence=sentence.replace('#', ' ', 'g'); sentence=sentence.replace('%', ' ', 'g'); sentence=sentence.replace('&', ' ', 'g'); sentence=sentence.replace('-', ' - ', 'g'); sentence=sentence.replace('\n', ' ', 'g'); sentence=sentence.replace('\r', ' ', 'g'); sentence=sentence.replace('\t', ' ', 'g'); var tmp=''; for(i=0;i<sentence.length;i++) { if((sentence.charCodeAt(i)==8230)||(sentence.charCodeAt(i) ==34)||(sentence.charCodeAt(i)==187)) { continue; } tmp=tmp+sentence.charAt(i); } document.vivoreco.UZsinte(sentence,speaker); } 2.3. Speech recognition Enabling the speech recognition requires the development of 3 functions: recopushini(), recopushend() and recoerror(). The function recopushini() is called with the use of UZStartReco() and needs to be called only once unless a UZStopReco() is used at a point. function recopushini() { var URL_grammar=”String gramar URL”; document.vivoreco.UZSetBeep(1); document.vivoreco.UZSetIniColor(2); document.vivoreco.UZStartRecoGrammar(URL_grammar); } The function recopushend() is automatically called when a recognition phase finishes and allows for coding all the desired actions according to the recognition output. function recoend() { //Evaluate the strings document.vivoreco.UZresults or document.vivoreco.UZtags and decide the actions to take. //The string “<>” in the UZtags means the recognizer didn’t decode any valuable output } The function recoerror() is called when the ASR couldn’t work due to an error. It stops the recognizer and outputs the error message. function recoerror() { document.vivoreco.UZStopReco(); //linea obligatoria alert("Error: " + document.vivoreco.UZerrorstr); } The recognizer is invoked directly via document.vivoreco.UZStartRecoGrammar(String url_grammar); once the UZStartReco() has been used. The only parameter is the URL of the grammar to use for recognition 2.4. JSGF grammars A grammar indicates the ASR system the possible sequences of words for recognition. In this case, JSGF grammars will be used. The grammar is a simple text file with the following syntax. #JSGF V1.0 ISO8859-1 es; grammar fsg.example; public <example> = <fon>* [<sentence>] <fon>* ; <fon> = /1.0/ B | /1.0/ D | /1.0/ G | /1.0/ J | /1.0/ L | /1.0/ T | /1.0/ f | /1.0/ j | /1.0/ k | /1.0/ l | /1.0/ m | /1.0/ n | /1.0/ p | /1.0/ r | /1.0/ rr | /1.0/ s |/1.0/ t | /1.0/ tS | /1.0/ w | /1.0/ x | /1.0/ aa | /1.0/ ee |/1.0/ ii | /1.0/ oo | /1.0/ uu; <frase> = [<element1>] <element2>* <element3>+ ( <element4> | <element5> ); <element1> = (word1) {tag1} ; <element2> = (word2) {tag2} ; <element3> = (word3) {tag3} ; <element4> = (word4) {tag4} ; <element5> = (word5) {tag5} ; The possibilities inside a grammar are: <element>: Defines an element to be substituted in a posterior definition (word1): Defines a word or chain or words as they have to be pronounced by the speaker {tag1}: The tag which will be given as an output of the system The syntax also uses the elements: []: Whatever is kept in brackets might be said or not by the speaker *: Whatever is followed by the asterisk might appear 0 or more times +: Whatever is followed by a plus sign might appear 1 or more times |: Vertical bar works as an OR operator. The <fon> part of the grammar allows the speaker to wrap the expected sentence into a bigger expression. For instance, if the grammar expects for the word “help”, the speaker could utter “give me help” or “help, thank you”. Students have to decide for each task whether they allow it or not. 3. Task 1: Language Learning Activity The language learning activity should follow the structure presented by the teachers for language learning in a web environment with the tools proposed. A prompt has to be presented to the user (synthesized audio and image, with the possibility of text) and the recognition system has to take the user’s utterance. If the recognizer indicated it’s correct (very simple evaluation), a certain possitive feedback has to be given and the next prompt has to appear. The students can obtain images for the prompt from http://www.catedu.es/arasaac (Click Welcome for English version). 4. Task 2: Dialog game The dialog game activity has to present the user some connected scenes in which he/she can do some actions via speech. Once a certain command is decoded by the ASR, the students have to act in consequence: Synthesizing some audio, proposing the next scene, etc… The bases of the games are the scenes. Each scene has to be defined be the following set of elements: A text describing the scene, an image of the scene, a set of actions and objects to utter. The output for each action-object recognized pair (synthesizing speech, changing scene, finishing the game, etc…) 5. Task 3: Web accessibility for blind people The web accessibility activity makes students to modify a real web-site to provide web synthesis and, optionally, speech recognition. Students have to locally store a real site, with the option “Save as” of the browser. Then, they have to include the applet code and include the javascript files provided. With that, they just have to insert the HTML labels defined in class to provide the speech accessibility.