speech-enabled websites

advertisement
SPEECH-ENABLED WEBSITES
Project Work, IP Lahti, March 2010
0. Abstract
This project work aims to develop a series of speech-enabled websites for different
purposes. The requirements for the project are the UZ distributed recognition and
synthesis applet (provided by the teachers) and a certain knowledge in HTML,
Javascript and in creating JSGF grammar files. Students can apply any other knowledge
in web programming they have (PHP,…).
1. Task Description
Three tasks are proposed to the students. They are expected to do as much as possible
of them, according to the difficulties they may find during the process.
1. Development of a web-based language learning activity for the training of
pronunciation capabilities.
2. Development of a web-based oral dialog game.
3. Development of a speech enabled web site with accessibility for visually
impaired individuals.
Currently, the UZ Java Applet only works in Spanish, so the Spanish speaking students
will have to translate all the group ideas to Spanish synthesis and grammars.
2. Common Elements
The common elements that the students will have to face in the three proposed tasks
will be: Inserting the applet, speech synthesis functions, speech recognition functions
and creation of JSGF grammars.
2.1.
Inserting the applet
The code for inserting the applet is as follows, the jar file “sRecoFrontEnd_simple.jar”
will be provided to the students.
<applet name="vivoreco" id="vivoreco" code="RecoFrontEndsimpleJApplet.class"
width="90" height="40" archive=" sRecoFrontEnd_simple.jar">
<param name="host" value="gtc3pc23.cps.unizar.es">
<param name="port" value="22229">
<param name="sinte_speaker" value="Jorge" >
<param name="sinte_service"
value="http://gtc3pc23.cps.unizar.es:8080/tts_servlet_unizar_cache_codec/sinte">
<param name="sinte_codec" value="3">
<param name="sinte_INI" value="on">
</applet>
The Javascript functions which control the applet allowing for the speech recognition
and synthesis are described following:
void document.vivoreco.UZStopReco();
//Deactivates the speech recognition
void document.vivoreco.UZStartReco();
//Activates ASR and launches the first recognition. Internally it calls to
recopushini(). This function has to be called only once.
void document.vivoreco.UZSinte(String sentence, String speaker);
//Synthesizes the “sentence” by the “speaker”. Available speakers are
“Jorge” and “Carmen”
void document.vivoreco.UZSinteStop();
//Stop the current synthesis
void document.vivoreco.UZSetBeep(int value);
//If (value==1) activates beeps for beggining and end of ASR .
void document.vivoreco.UZSetIniColor(int value);
//Sets color of the ASR graph in the appler (2:green, 1: yellow, 0: red)
void document.vivoreco.UZStartRecoGrammar(String url_grammar);
//Launches a ASR interaction witht gramar provided
void recopushini();
//Internally invoked by UZStartReco()
void recoend();
//Internally invoked when the ASR finishes
void recoerror();
//Internally invoked when the ASR crashes
The public variables of the applet control are as follows
public int document.vivoreco.UZerrorcode; //Error code in case of crash.
public String document.vivoreco.UZerrorstr; //Error string in case of crash.
public String document.vivoreco.UZresults; //Ortographic output of the ASR.
public String document.vivoreco.UZtags; //Tags output of the ASR.
2.2.
Speech synthesis
The TTS works just invoking the UZSinte(sentence,speaker) function. In any case, it is
strongly recommended to wrap in a higher level function which removes problematic
characters.
function synthesize(sentence,speaker)
{
sentence=sentence.replace('#', ' ', 'g');
sentence=sentence.replace('%', ' ', 'g');
sentence=sentence.replace('&', ' ', 'g');
sentence=sentence.replace('-', ' - ', 'g');
sentence=sentence.replace('\n', ' ', 'g');
sentence=sentence.replace('\r', ' ', 'g');
sentence=sentence.replace('\t', ' ', 'g');
var tmp='';
for(i=0;i<sentence.length;i++)
{
if((sentence.charCodeAt(i)==8230)||(sentence.charCodeAt(i)
==34)||(sentence.charCodeAt(i)==187))
{
continue;
}
tmp=tmp+sentence.charAt(i);
}
document.vivoreco.UZsinte(sentence,speaker);
}
2.3.
Speech recognition
Enabling the speech recognition requires the development of 3 functions:
recopushini(), recopushend() and recoerror(). The function recopushini() is called with
the use of UZStartReco() and needs to be called only once unless a UZStopReco() is
used at a point.
function recopushini()
{
var URL_grammar=”String gramar URL”;
document.vivoreco.UZSetBeep(1);
document.vivoreco.UZSetIniColor(2);
document.vivoreco.UZStartRecoGrammar(URL_grammar);
}
The function recopushend() is automatically called when a recognition phase finishes
and allows for coding all the desired actions according to the recognition output.
function recoend()
{
//Evaluate the strings document.vivoreco.UZresults or
document.vivoreco.UZtags and decide the actions to take.
//The string “<>” in the UZtags means the recognizer didn’t decode any valuable
output
}
The function recoerror() is called when the ASR couldn’t work due to an error. It stops
the recognizer and outputs the error message.
function recoerror()
{
document.vivoreco.UZStopReco(); //linea obligatoria
alert("Error: " + document.vivoreco.UZerrorstr);
}
The recognizer is invoked directly via document.vivoreco.UZStartRecoGrammar(String
url_grammar); once the UZStartReco() has been used. The only parameter is the URL of the
grammar to use for recognition
2.4.
JSGF grammars
A grammar indicates the ASR system the possible sequences of words for recognition.
In this case, JSGF grammars will be used. The grammar is a simple text file with the
following syntax.
#JSGF V1.0 ISO8859-1 es;
grammar fsg.example;
public <example> = <fon>* [<sentence>] <fon>* ;
<fon> = /1.0/ B | /1.0/ D | /1.0/ G | /1.0/ J | /1.0/ L | /1.0/ T | /1.0/ f | /1.0/ j | /1.0/ k
| /1.0/ l | /1.0/ m | /1.0/ n | /1.0/ p | /1.0/ r | /1.0/ rr | /1.0/ s |/1.0/ t | /1.0/ tS |
/1.0/ w | /1.0/ x | /1.0/ aa | /1.0/ ee |/1.0/ ii | /1.0/ oo | /1.0/ uu;
<frase> = [<element1>] <element2>* <element3>+ ( <element4> | <element5> );
<element1> = (word1) {tag1} ;
<element2> = (word2) {tag2} ;
<element3> = (word3) {tag3} ;
<element4> = (word4) {tag4} ;
<element5> = (word5) {tag5} ;
The possibilities inside a grammar are:
 <element>: Defines an element to be substituted in a posterior definition
 (word1): Defines a word or chain or words as they have to be pronounced by
the speaker
 {tag1}: The tag which will be given as an output of the system
The syntax also uses the elements:




[]: Whatever is kept in brackets might be said or not by the speaker
*: Whatever is followed by the asterisk might appear 0 or more times
+: Whatever is followed by a plus sign might appear 1 or more times
|: Vertical bar works as an OR operator.
The <fon> part of the grammar allows the speaker to wrap the expected sentence into
a bigger expression. For instance, if the grammar expects for the word “help”, the
speaker could utter “give me help” or “help, thank you”. Students have to decide for
each task whether they allow it or not.
3. Task 1: Language Learning Activity
The language learning activity should follow the structure presented by the teachers
for language learning in a web environment with the tools proposed.
A prompt has to be presented to the user (synthesized audio and image, with the
possibility of text) and the recognition system has to take the user’s utterance. If the
recognizer indicated it’s correct (very simple evaluation), a certain possitive feedback
has to be given and the next prompt has to appear.
The students can obtain images for the prompt from http://www.catedu.es/arasaac
(Click Welcome for English version).
4. Task 2: Dialog game
The dialog game activity has to present the user some connected scenes in which
he/she can do some actions via speech. Once a certain command is decoded by the
ASR, the students have to act in consequence: Synthesizing some audio, proposing the
next scene, etc…
The bases of the games are the scenes. Each scene has to be defined be the following
set of elements: A text describing the scene, an image of the scene, a set of actions
and objects to utter. The output for each action-object recognized pair (synthesizing
speech, changing scene, finishing the game, etc…)
5. Task 3: Web accessibility for blind people
The web accessibility activity makes students to modify a real web-site to provide web
synthesis and, optionally, speech recognition. Students have to locally store a real site,
with the option “Save as” of the browser. Then, they have to include the applet code
and include the javascript files provided. With that, they just have to insert the HTML
labels defined in class to provide the speech accessibility.
Download