>> Mike Seltzer: So good morning, everybody. It's my great pleasure to welcome Professor Alan Black from Carnegie Mellon University from the
Language Technologies Institute here. I guess I've known Alan since, I don't know, '98, something, basically when he moved to CMU from University of
Edinburgh where he initially created the testable open source speech synthesis engine which is probably the most widely used open source synthesis engine out there. And he's created lots of other amazing technologies with speech doing speech-to-speech translation on small devices and small footprint synthesizers and lots of other really amazing stuff.
So it's my privilege to have here today to talk about recent work in rapid development of low resource languages.
>> Alan Black: Thanks, Mike. It's nice to be here. I know some of you before and so that's good to be here. And although I have got this reputation in speech synthesis and as most of you I'm not primarily talking about speech synthesis here today, what I'm actually talking about is recognition and synthesis, and particularly what we're interested in is being able to build support and all of those languages you've never heard before. Okay? So how can you actually do that?
Okay?
This is joint work with Tanja Schultz. Tanja is at CMU and she's also a professor at [INAUDIBLE]. Some of you probably know who she is. And this is a joint work that we've been doing for the past, I don't know, five years maybe. And we're trying to actually take both of our technologies recognition in synthesis put them much closer together so we can actually get support for this.
So normally what you would do is you're trying to get support in some new language, think of the language that you're not very sure about, and you want to get speech recognition so acoustic and recordings of text, I'm sure you all know this, speech synthesis converting text into natural sounding and speech for the requirements you need some examples of text, you need some examples of acoustic recordings, you need a pronunciation lexicon and you need a lot of skilled construction. Okay. Many of us know how to do this, we've done this for years, and, you know, once you've gone your 40th language you begin to think, can I automate this, because you know, I know what to do, you know, there's nothing new coming out of it and how can we do it.
But if you give it to external people to do, you suddenly discover that there's an awful lot of skilled knowledge and how you would go about doing this to be able to get this going well. Either we care about it, you know, everybody can speak
English so why do we care, everybody I speak to speaks English, you know, so why do we care? Well, it's probably not the best way of looking at it.
And so you're right, there are two people who come in late. And (laughter) and so, you know, we can't really expect everybody to speak English. And even if they can speak English it might not be appropriate for them to speak English our point of view.
In India you have a billion people and many of whom speak English or at least some form of English, actually they're the most common form of English, the rest of us are just speaking some other dialect of it, Indian English is the most common form. So how can we actually get it so that people can communicate in their own language?
And there's an awful lot of languages in order to be able to do that. There's something in the region of 6,000 languages in the world. And realistically, okay, there's probably only about 50 that are only going to be commercially viable in the normal sense of commercially viable. Okay?
And Europe has 20 plus official languages, currently 27 official languages, which vary from German with lots of people to multis with 5,000 people and -- but
Europe also has lots of other minority unfortunate languages, you know, Catalan,
Gallic, Basque, et cetera, and so it would be good to be able to support them if you want to sell things to people, it's nice if you talk to them in her own language.
And South Africa 11 official languages an India has 17 or 21 official languages, depending who you speak to, but many, many, many, many more, okay. Why would you want to be able to do this? Well, you know, speech processing multiple language cross cultural human, human interaction, it would be nice to have people who could speak to each other in the translation sense, it would be like to be able to deliver things to each other by some human interface, if you want to be able to give information, health information, product information, bus, flight, train information. It's nice to be able to communicate in their language rather than to dictating somebody else's language for them to be able to do it.
Okay.
The challenge is actually most of the things that we do in speech are already pretty much language independent. There are language specific things. You get tones in some language, you get stress in others, you get no spaces in Chinese or Japanese or Thai and between words so there are things which are language specific but in general most of the techniques that we've been doing have been shown to work fairly well for language.
However, you know, we've only really addressed a number of languages and already. Probably there's about less than 40 languages for transcriptions -- sorry, less than 40 languages that have significant recordings and transcriptions in the world. Okay? Remember taking transcriptions about 40 times realtime to be able to do. Large vocabulary pronunciation there's probably less than 20 languages. You know, this is the 200,000 word vocabulary. There's probably less than 20. You can name all the languages. Okay?
Small corpora there's probably less than 100 languages that really have, you know, real coverage from that point of view. And from the speech-to-speech translation bilingual corpora are very rare and they're almost in and out of
English, okay. And we desperately were looking for Catalan to Vietnamese translation and data and there isn't much of that about, okay. We probably know of the largest database of that. It's got 50 sentences in it. But anyway.
And so if you're doing this cross-information or you're just trying to get this so that, you know, even if you look at English the amount of work that's been done in English you know, yeah, we can do speaker specific, quiet office recognition, and -- thank you. And you've got speaker specific type things and then you start to worry about broadcast news, you start to worry about real conversational speech, you care about multi speaker speech, all of these things you think that just even having recognition in one part of it is still something you're going to be able to do.
Also, be aware of these six Asian language we're talking about some of them don't have any writing system, okay? People who speak them might be completely fluent in writing some other system but they just don't write that system -- that language. Even though it's the one that they communicate in every day. Okay. And solutions to this, collect lots of data and text, lexical construction. We have been distributing toolkits to be able to do this for years.
There's lots of toolkits and how you do it. People know how to do it and many of you guys have done it.
And we actually give tools that allow people to be able to build things, FestVox things for building, synthesis, voices. I have a list of 47 different languages that I know have been done in the FestVox. I think it might be more than that because it's open source.
But you know, that's quite a lot. That may even be the most of any set of tools.
But, you know, that's nothing compared to the 6,000 that's going to be out there.
To do this, you're required to be an expert in speech technology. You have to know about what it means to build something as speech technology. You've got to be an export in -- export -- expert in target language. You have to know the language. It's really hard to build synthesizers and languages you don't know.
I've done it, and you can't tell when, you know, the writing system is all backwards because it's Arabic and you didn't know. I've been there.
And these two groups of people have to be able to communicate with each other.
Which is also hard actually even though at first -- even if they're speaking the same language. And so the idea that Tanja and me had was to try to make it easier for people to build recognition synthesis in a new language. So we really want a system that's actually going to learn the language from the user. So we're taking the speech expert out of it and having the person who cares about the
Hindi's support or the Basque support, they're the person who's actually going to drive the system. Well, the system's going to drive them, but we don't tell them that.
So the idea of the interactive learning so the user is actually going to be involved in it and we're going to use them to be able to find out whether things are correct or not, we ought to be able to make it a much more efficient to collect end data and so instead of the normal way I would say of doing it, you go and collect a hundred hours of speech and from, I don't know, 100, 200 different people, you want to be able to make it easier than this, we'd like to be able to speed up the
cycles such that you can end up doing this in days rather than months if you go to a commercial company to try to get support they're probably going to quote you six months and I'll charge you twice as much for three months.
And so what we're actually also interested in is some form of language adaptation for universal models. So language, universal models that we can either adapt to the target rather than collecting everything needed from the start.
So we're looking at bridging the gap between knowledge and technology expert such that the people that care about using it are the people who are driving it, and the language experts and because as much as we like to train all of our students to be able to learn how to do these things, we'd like to be in a position where anybody can use our technology without having to go through the masters at CMU.
But if you want to do one, that's okay, some of you already have one. So we came up with SPICE and SPICE stands for something, and with all research and systems with names it is an acronym and I have no idea what it is, it is the speech processing interactive creation and evaluation toolkit, and we came up with that first and then discovered it spelled SPICE, and funded by -- originally by
National Science Foundation where funding finished on this two years ago and we've been using other funding to be able to build on top of it.
And basically abridging the gap between technology experts and language experts and speech recognition, machine translation an text speech, okay, the primary work that we've been doing is in ASR and TTS. And it's a web based tool because we don't want people to download and try to install things that don't actually work or problems we don't want to deal with that. We want them just to go to the website, it's going to work on all platforms and they interact with that.
And it's going to lead then through the stages we actually have to do that.
You can go and try it now, cmuspice.org, and so you can go there and you can log in and you can see what our publication is. When you log in, you get something like this, okay. It might be slightly out of date because we update it.
And basically these are the stages that you're expected to be able to go through.
And you're going to collect some text and it's going to automatically analyze that text and find the right type of prompts to do, and I'll talk about that in a minute.
We're going to collect the audio across the web so we get access to everybody around the world, rather than have to bring them in to Pittsburgh to be able to do that. And we have a automatic way of bootstrapping lexicons, optimizing the time that the human is actually involved. We bring acoustic models, language models and speech synthesis, as one might expect.
>>: Is English to [inaudible] --
>> Alan Black: Yes. They will have to understand English to be able to do that, yes. And that's exactly true. And we discovered that there is that could be a limitation, okay. But one of the things -- the other things is that maybe somebody
can set it up and then be able to get the -- there are people who are recording may not have to understand English to be able to do that.
But there is an underlying assumption that they're going to -- for example in India almost everybody with a computer does understand English, even though they're working in another language, okay. So technology people often understand
English. But yes, you are -- that's absolutely right.
So speech processing we're trying to share information. Traditionally the speech synthesis people and the speech recognition people don't talk to each other and so in other words, they develop their own things, their own phoneme sets. And to give you an example of when these things actually break down, back in the days of communicator, the CMU communicator, DARPA funded project for giving flight information, and when you were speaking to the system you would give one pronunciation and when the system spoke back to it, it would give another.
So if you say to the system I'd like to fly to Atlanta, saying the proper British pronunciation of the city, Atlanta and the system would say it could understand me, okay. So I actually had to say I'd like to fly to Atlanta, okay, the American pronunciation of it.
Well, the system would say back to me, okay, flying to Atlanta. So it actually give the British pronunciation of it. And that's very confusing to the user. See I really want to have these things map together from that point of view. So we're sharing text data to be able to build language models and working in natural language processing, text from 10, stuff that you might be doing in the TTS, we're sharing pronunciation between the two and the phoning set so that these things are really the same.
Sometimes the TTS actually requires more information than the lexicon. So for example, tones or lexical stress, depending on the language. You may need to do that. And we also want to be able to show the acoustic model between the two because we can actually do that to some extent to be able to share between the two systems.
I'm going to skip that part. And we'd like to report the easiest way to be able to bootstrap a system is to do a prerecorded information rather than collecting in the wild and getting somebody to transcribe it, okay. This gives you a system, typically allows you to do simple dialogue systems and speech-to-speech translation systems pretty well.
We're trying to find the nice prompts in a language we do not know, okay. And after a number of experiments we are doing in speech synthesis over a number of years, we decided to do this and also put speech recognition because it's sort of the same thing. You want the minimum number of prompts that you can get somebody to say that has the best coverage. So every new prompt you get, you want to get more information out of that than what you've got already. And so what we do is we take some large text process, okay, some size, okay, whatever it's going to be and we try and find nice sentences, nice technical word there, nice. Okay.
What we actually look for is given the distribution we look for high frequency words, okay, because we want to know what a pronunciation of these words are and we don't want unusual pronunciation so as much as possible we're trying to get high frequency words that are in the domain, everybody likes to tell you that their text corpora is not domain specific, it always is, okay.
And we also want to have between about five and 15 words. And we want people to have the articulators running properly. And if you give them isolated words, they say things in a different way than what they would do for longer. And if you give somebody a prompt longer than about 15 words, they make errors, and you don't want to make errors.
Actually what you'd really like to be able to do is to make this as easy to read, okay. But we don't quite have this. This stage we don't actually have phoning information because this is initial stage. We're actually doing it text based, okay.
And we then try from all of these nice ones we want to optimize the coverage.
So what we basically look at is trigraphing information, so you're trying to optimize to find the sentences with as much trigraphing coverage as possible, okay. So you basically do greedy selection to be able to do that. If you have phoneme base, but in this case we don't, you could do it on triphoneme base.
Now, even languages like English or Chinese where the relationship between the writing system and the pronunciation is not very direct, okay, that's -- this still works well. We've done it for both English and we've done it for in Chinese and you get it. It's not as optimal as what you would get. For example in the English there's almost no letters which directly go to the rare phoneme J okay. There's no standard letters. But there's words like vision and casual which actually have them in there, are not buttered words, okay.
And so you've got to get enough coverage to be able to do it. And we're technically looking for the regions of 500 to 1,000 sentences. And you can -- it has a base set that you're then going to pass out to people to be able to do this.
Okay?
So we're collecting this for ASR and TTS acoustic modelling. And we need good text, well written text, okay. You want to be nice. And so we basically are getting stuff from the web, you up the DHTMLL -- HTML [inaudible] which can be hard because people put accents in different encodings to be able to do that.
Preferably well written. And you'd like to minimize the misspellings. So rather than taking blog data it would be nice to take newspaper text if you can do that.
But the high frequency stuff by looking for the high frequency things you actually end up with at least a common way of spelling even if it isn't correct. Okay? And we're incurring not dealing with the word segmentation automatically in the system and so basically for the people that are doing Chinese, Japanese and
Thai we ask them to be able to do word separated things.
But these are major languages rather than minor languages and many of the minor languages are easier actually. And natural text is often a mixed language,
so when we do this from Hindi, what you end up is one of the real nice phrases that comes out is something like copyright New Delhi times 2008 because it appears in every single page, it's very high frequency and it's written in English, okay. We don't really want that. So you often get mixed things.
So the things that you get out of this is often not ideal but it's usually pretty good.
And then the first person reading it can go yes, yes, yes, no, no, no, no, yes, yes, to be able to make decisions to do this. We've done this for the so-called
ARCTIC speech databases using this the technique and a number of people have used this within the FestVox stuff for being able to build this directly. But we found that automatically, we've been doing this for some 23 different languages that it's pretty good for being able to find text.
>>: [Inaudible] writing system?
>> Alan Black: Hang on. Yes. That's about 500 to 1,000 of the 6,000 have actually got a regular writing system, but the other one it's very hard to define in some cases because some people write it. Okay? But it's about 500 to 1,000, so you may be talking about -- of somewhere in the regions of a 6th have got a regular writing language. I think from British [inaudible] book.
Tanja Schultz for her Ph.D. last century, just to make her feel old, did this project called GlobalPhone where she basically had a standardized way to be able to collect data from 15 different languages, okay, and since then we've continued to be able to collect languages from what we've been doing.
The point here was to try to get it as a wide range of languages both from linguistic point of view and a geographical point of view, okay. And native speakers, the basic thing was random student at Karlsruhe University in
Germany was given a free laptop to be able to go to their home country and be able to record data, okay, and that builds election constant and collect language for language models. Okay?
And this gives us a big database which has got very good coverage of phonemes, almost all times of phonemes throughout the world. And this is important because this is going to be an underlying model that we're going to use to be able to bootstrap our system. Okay?
And the speech recognition rates and these are definitely different qualities that we're actually getting Japanese relatively easy. German, well it was done in
Germany. And English. And so as you can see -- Russian is probably worse and that's not because I think Russian is harder, I think it's not as good a database as some of the other ones. Most recently we've done African, Chinese,
Arabic and Iraqi. And a point to note that I usually think that if you get below 20 percent word data rate you can usually do pretty reasonable spoken dialogue systems, okay?
And getting below 10 percent is cool but definitely not for dictation, but we're not talking about dictation tasks from that point of view.
And so the idea of the rapid portability of acoustic models is what we're actually going to do is initialize the models from similar phonemes in other languages that have got the same phonetic properties, okay. And then be able to take that and be able to do adaptation for the target one to be able to get make use of global phoneme information to be able to get this and then do adaptation to it and so you need much less data to be able to do that.
And so we actually do have to depend on the universal phone system and so we're actually asking for people to define their phoneme set in the language. I'll come back to that issue later because that can be quite hard. You have to have the knowledge about it okay. Not -- all native speakers don't really know about the phonemes in the language. And we discover when you actually look over this, there's lots of overlap of languages, a genuine overlap of languages, okay.
It really is the case that certain pronunciations of T in one language are really the same as T in the other, across the whole planet, okay. So there's lots -- vowels can be very different but it's a good place to be able to start. So we can actually look at this and if we can map some IPA based universal sound [inaudible] into what the target phoneme set is, we can actually do something and end up with something that allows it to do this. And then we can basically do the adaptation in training allows us to be able to end up with something which is quite reasonable. Okay?
And so here's an example of Portuguese and basically by collecting data with
Portuguese and we initially start off with the and of data of -- okay. So here we're basically if we use all GlobalPhone data and no data from Portuguese, okay, we end up with 69.1 percent word data rate, okay, for our recognizer. That's not particularly good. Okay?
And if we get it 50 minutes, okay, and we end up with 57, okay, which is still not good enough, okay, if we use our basically tag the phonemes with which language they actually come from, you can actually get it down to 49, okay. And if you give it a little bit more data you can get it down to 40, okay.
If we do a more information about the language that it comes from, we can actually do this in some of the word of FYU actually basically does a better tree tying on it, and you can actually get it around to 19 M.6 with an now and a half amount of data. If we use all of the data from Portuguese which is 16 hours, 16 and a half hours, we get 19.
So the difference between these two, okay, will not make much difference in run time, but the difference between collecting an hour and a half of speech as opposed to collecting 16 hours of speech is quite significant, okay. So the point is that being able to initialize our models with appropriate context dependent training you can actually get something that's pretty similar to what you're actually doing than just using the draw data and collecting it through a target language.
And in SPICE to be back into the system we actually ask people to be able to type in their phonemes and we give them the standard [inaudible] chart to be
able to do it. Rather importantly we actually give people if they click on the symbols round about it, they get to hear what it sounds like, okay. So you can have some way of actually getting people to be able to do that.
And so not just you get the chart and you compete and your name's for it but we've actually got recordings of people saying all of these different phonemes which is heart to get and to be able to do that. And I'll come back to that in a minute.
>>: [Inaudible].
>> Alan Black: And so one of the things about the IPA is that everybody claims it's universal, okay. But there's a whole bunch of things that are actually missing from it and very quickly you find there's no [inaudible] information which matches up a whole bunch of languages. And actually in the standard forms you don't even get all of the different stops that are in standard Indian languages. So they are missing from it, so we've added them, okay.
So actually the answer is no, it didn't cover everything. But it's probably good enough, okay, because in all forms of acoustic processing what you typically end up with is some form of phoneme which you then map into different form. Take for example RRR as in Scottish, English, okay, which is not what the Americans say for R, and they typically still label them a phoneme R although the R phoneme is somewhat different between the differently languages, okay, differently dialects. But you're still dealing with the same form. And that's typically good enough. Okay?
And pronunciation. I'm actually going to -- oh, great. Okay. I have this one, so
I'll talk about this. Can you do all of this with graphemes rather than phonemes, okay? So you're in the new language that you don't know anything with. You just take the writing system and you try to map it and you pretend each letter and arthrographic form in the language is the phoneme, okay? And this works surprisingly well, okay
Everybody thinks oh, yeah, we wrote it for Spanish but it's not going to work for anything else. In Spanish C can be pronounced lots of different ways. It can be pronounced S, it can be pronounced F, if you're from Continental Spain, it can be pronounced CH depending on the context and maybe even K for borrowed words.
But here's examples of grapheme and phoneme-grapheme recognition and improvements for the flexible tree time so again worrying more about context dependency to be able to do this. So if we end up with the data of 11.5, end up with a 19.2 just by using English it's still sort of okay and if it saves you getting phonemes it can actually help and it depends -- Spanish is actually surprisingly worse. But again I think that's to do with the database. German is actually better. I can see that because Tanja isn't here. I think the lexicon for German isn't very good. But you didn't hear me say that. And apart from this is being recorded and going out on the web, so she'll call me later.
And Russian again is not very good. But Thai is sort of reasonable as well and you can do quite well. And this may be an easier way to be able to address the issue, because if you can get almost as good without the knowledge of a phoneme, you know, it would be a good way to be able to do it, okay.
There are issues with doing this. We've done this with synthesis as well and we find very similar results and issue is that when it's wrong how do you fix it, okay?
So if you suddenly have a new name, okay, and you want to be able to get the pronunciation of, I don't know, Colin Powell's name correct and his name is Colin, not Colin, okay, yet it's written in the same form, how do you specify that it's got a different pronunciation if you don't know phonemes? You have to do some form of rewriting it and something in other letters which are similar, which may not be easy in some languages.
We have a bootstrapping system when we're dealing with phonemes and we've done quite a lot of work on this, and the cost of asking somebody to fix the lexicon versus how good it actually gets and what the best way to be able to do this, and what we do is we set up some word list, okay, and which is probably best frequency based, okay. And we select some word from it, okay. We generate the pronunciation if the first case nothing, okay, and but if we've got some form of mapping letters to pronunciations at all, we'll use it, okay.
And using grapheme-phoneme rules, we then use a text for speech synthesizer for it, which is only phoneme based. We actually use IPA recordings to be able to do that. See, you can hear whether the pronunciation is right. A string of phonemes people find it quite hard to know whether it's right.
Is it okay? Yes. Okay. What we then do, it's okay, gets put in a lexicon. If it's not, you actually say well, correct it. Okay. And then we put in the lexicon. And we update our letter to send roles each time. What you discover very quickly is actually you end up with a much better lexicon much faster because you're dealing with the things which actually are improving pronunciation. And we've done this for a number of languages and showing that this is definitely the case.
And there's also been work down by Marilyn Duvall [phonetic] in her Ph.D. at the
University of Pretoria and which was in similar work, and we actually share some of the things she discovered that actually playing the TTS with a big difference to accuracy, okay. Even if you're an expert. Which is quite interesting.
The other thing that we discovered is that you can skip a word. If it's really wrong, it's better to skip it because it's just too much time it takes to actually fix it, okay. And it's better just to fix the things that have got similar and single things wrong on it.
So we have a lex learner and basically it's basically we're three percent through and we're at the word at, it gives a pronunciation says that it's going to be AXT, and there's a list of phonemes which are actually in your language, you can only type phonemes at linear language, and you can listen to the pronunciation that gets pronounced, you can accept it, you can skip it, or you can remove this word if it's not a valid word, it's maybe the wrong language or somebody's just wrong
with it, okay. And so we can actually use this system to be able to do incremental learning.
And sometimes it gets it wrong. Here we have a word, I don't know, gene, GN, don't know, and its pronunciation is I don't know because it's never seen the G yet and it's got that, and you can hand correct it to be able to do that or you can skip it to be able to do that.
And what we're trying to do is we're trying to make the best use of the human.
You're trying to optimize the human. You've got a human who understand what the language is and you're trying to find the best way for them to be answering questions which will give you the most information as quickly as possible. Okay?
We still have issues when do you stop, okay. We don't have perfect recognition and lexical pronunciation for English because there's always new words that come into the language which aren't covered and anybody in the commercial field is probably adding to the lexicon regularly when new names come in or new famous people that you want to make sure that you're covering. But you know, from initial stage how far do you go, okay? Well, you can look at distribution and find out what -- you know, how much of the tokens you're covering. You can also guess as you go through how well you're guessing, okay. So you can actually find out.
And you can compare that to other languages and find out how well. So for
English, for example, using cmudict [phonetic] and we've probably got about four percent of tokens the Wall Street Journal don't appear in the cmudict [phonetic], okay. And of those for lack of [inaudible], how many are we getting right? Well, it's probably about 60 percent given our experiments, okay.
So of tokens, okay, tokens not of types, okay. So actually you're doing pretty well, okay. So if you can actually get something which is up in the regions of token coverage 85, 90 percent, which is relatively easy to do, you'll probably get a lexicon that's reasonable to do that.
You've also got to deal with the case that people like adders that go, is this right, is this right, is this right, is this right, people go, yeah, yeah, yeah, yeah, yeah, yeah, okay, so you also have to deal with issues of that, make sure it, and so it would be good to have multiple people to be able to do that.
And you've got to have more motivated one of our big adders that we started off before that we used to say the number of tokens were covered rather than the number of types that were covered, so it used to start off saying, you know, you've done three and you've gone 0.000378 percent, you know, you have got a millennium to go before you're going to finish. And we found that that wasn't a good way to get people motivated.
So we changed that. The task is still as hard, but you just change the way of incentivizing them to do that. And rapid language probability when it comes to language modelling, okay, and you might think that there's nothing you can do
cross-language, but what you discovered is lots of languages have actually got a certain amount of shared information and especially for large languages.
We did some experiments with Bulgarian and we find that we could actually improve things by using Frussian [phonetic], okay. And we also found that we could to this for improving Iraqi and using modern standard Arabic, might be a little more believable because there's going to be modern standard Arabic phrases in Iraqi Arabic. All the Arabic dialects are different. They're not the same language. And when people speak to each other, they basically drift up into modern standard Arabic so that they can make themselves understood, but when they're chatting to somebody else from their own country, you know, the
Moroccan people and the Iraqi people are actually speaking the language which is really quite different from that point of view. So we can actually share information like that, too.
Let's get on to TTS. We can do the same thing in TTS where you can actually use information from other languages to be able to initialize our models and then be able to do some form of adaptation to be able to get closer to the target language that we want to do, okay.
So it's following the same ideas in global phoneme to be able to do it. And we have a text speech and page this is the example of this, basically we create or recreate from the imported waves that have been recorded, we import the prompts, lexicon, we label everything with the HMM labeler and we basically et cetera build the thing. The next one tells me about the synthesizer. Yes.
Okay. So and the form we're actually using parametric synthesis, often called
HMM synthesis. It made famous and pioneered by [inaudible] from the
[inaudible] Institute of Technology. Those of you that know my background may be a little surprised that I'm in that domain, I missed a unit selection which was the technology over the last 10 years and parametric synthesis is the speech synthesis technology for the next ten years, okay, take it from me.
And the reasons that we want to be able to do this is because physical parametric synthesis is actually much more robust to adders and much better in small a little of data. Okay? I also believe it's how we're all going to do synthesis in the long term. But basically in unit selection what you're doing is you're taking an instance of some piece of speech, probably phoneme or bigger size, okay, and with the physical parametric synthesis you really have a genuine model and you're basically taking the average of pieces of speech.
Now, as all of you know, you've got one or two things that are wrong and you're taking an average of a hundred, okay, the one or two things which are wrong are irrelevant or almost irrelevant to taking the average. But in unit selection what happens is you're picking instances, okay, and the chances of picking one of these bad ones, okay, which will make the synthesizers sound byte increases.
So what you want to be able to do is make sure that you don't include those and therefore been using parametric synthesizer.
We use Clustergen, which is a synthesizer that I wrote which is part of the festival distribution. It's following a lot of the HTS. I don't call it HMM synthesis because it doesn't contain any HMMs. But actually if you look at HTS, which is called HMM synthesis, it doesn't follow that any HMMs either, not at synthesis time but labeling time it does not same case that we do here.
And we can make this synthesizer work with as little as 10 minutes of speech, okay. And we can robust adders. Fortunately the quality of the speech that comes out of it, okay, is a little buzzy. It's not as bright as what you would get out of targeted synthesizer, okay. But it's still perfectly understandable.
One thing that is a problem that on 10 minutes of speech is hard to get good prosody out of it because prosody's a bigger thing and you need more examples of that, and that's still a research thing to be able to do it. But the quality of synthesis you're getting out of this is you can transcribe everything and understand what's being said.
>>: Short or long term [inaudible] when do you see it [inaudible].
>> Alan Black: Okay. So first we have to solve natural language prosody which requires us to do full understanding of the things that actually have to be said.
We're not going to do that. Okay? But how soon are you going to get prosody coverage that's going to be as good as your best in the selection? Probably one person's Ph.D. with a bunch of follow-on stuff, okay. So that's probably in the regions of three to five years to be able to do that.
And we've already studied doing to see if when we're talking about taking acoustic models over multiple languages can you take prosodic information or multiple languages and share it? The prosody is very different in different languages, okay. The choices that you make -- Dan's in the audience so I can use him as an example. Dan is a Romanian speaker and he has a particular intonation pattern in English which is based on Romanian and it's the way I fundamentally recognize Romanian English speakers, okay? Because you have a time based language rather than a stress based language. And that's really the underlying reason that you end up with Romanian speakers having different intonation.
But if you can find and share intonation say between Danish, German, Dutch and
-- not Swedish -- and Norwegian, okay, you can actually use that information for some other Germanic language, and it will be pretty good, okay, because the intonation patterns in these languages are similar, okay, so you can actually do that. So that's one of the things that we can do.
And here's our actual much more detailed breakdown of what we actually did with this. We had this student a number of years ago, Engelbrecht, Herman
Engelbrecht, who is not German, who is from South Africa, and he basically built an African-English speech-to-speech translation system using actually just the
Google website stuff but all the underlying tools to be able to get a breakdown of the time and effort that he did for all the things that he actually did, okay.
So ASR, standard and Janice stuff using Janice toolkit, HMMs, MT was disclaimed T, TTS was unit selection rather than HMM synthesis because we did this about three years ago, and the dictionary was using copies decision free stuff which is typically quite good but is not incremental. Okay? And because it's
-- you have to be trained after each form.
He had text from transcriptions from the South African parliament, so he had a number of bilingual aligned corpus. Six hours of read speech, about 10,000 occurrences that he was actually working with, okay.
And what was the time that he did for the different parts? Okay. Data, okay, preparation type stuff, okay, was by far the biggest part, okay, which is actually quite interesting because we've been addressing this issue of trying to use a website to be able to make sure that all the data we get is in the right format. If it's not, you spend all of your time dealing with data and the actually core training things, the number of hours that you're actually doing this, okay, is smaller, okay.
And tuning, okay, clearly the TTS and the MT worked perfectly and because only the ASR was actually tuned. That's probably not the case, okay, but he spent all the time making sure that the tuning is -- and we've also been doing things to do this evaluation approximately five and the prototype where he actually built the system together and took the models out of the system actually put it into one of our frameworks to be able to do that. Okay?
And so he was doing this over a period of about six weeks, okay, and that's the breakdown. We knew data preparation is by far the biggest thing and that's one of the things that led us to worry about the website in order to be able to do that.
>>: [Inaudible] is that one that you can [inaudible].
>> Alan Black: And so we actually use the same questions, okay, but we retrain it for every language.
>>: There's a question about.
>> Alan Black: Huh?
>>: The question about [inaudible].
>> Alan Black: The questions are the same, yes. Yes. Yes. And so this is something which I always surprise me in speech recognition is you care a lot about these questions but in speech synthesis, we use sort of the same things but we use techniques to automatically pick the best questions, so I think there's a sort of less of an issue.
Getting the right features in there are typically and we're basically got phonetic information because we have phonetic information because all our phonemes go back into the IPA to actually have phonetic features for all of those and then you can basically use reasonable, you know, cart building techniques to be able to do that.
This was spring this year and we're running again this semester. We had 11 students in the class. And this was the second time we really ran it, the class.
The first time everything almost broke. And we had two Hindi speakers, two
Vietnamese speakers, one French, two German, Bulgarian, Telugu, Cantonese and Mandarin. That was the languages that we actually did. And Turkish.
Turkish must have been the year before.
And nobody was in need of English speaker, but, hey, it's CMU, almost nobody is a native English speaker, including me. And their task was to build a complete speech-to-speech translation system, so not just building the speech part, they're building some form of translation part. And so they were in teams in random other languages that we choose to be able to do, translation was simple phrase base, so it was basically given a bunch of phrases. It trained it through
[inaudible] system but that really wasn't the point.
The point that we were trying to do here is that many of our students get expertise in speech recognition or speech synthesis or information retrieval or translation, but they never get experience from building the whole system.
Okay? So we actually were making sure that we were getting a full experience building everything. And we were also doing this to debug our system. Because you know it's a good way to debug your system. All of our ex-students have graduated now so I can give that trek to them.
And evaluating the results basically we were wanting to collect much more data to be able to evaluate future improvements to the system to be able to make sure that we're improving the system when we've done a bunch of things to be able to improve it I'll just talk about in a minute.
Does it work? Yes. Yes. It really does work, okay. There's no question about that. Does it work for everybody? No. Okay. And sometimes things fail miserably in ways that are not obvious whatsoever to the student and then let me give an example of Vietnamese. Are you a Vietnamese speaker some you are a
Vietnamese speaker. I knew the history is that, but Vietnamese one failed miserably and we took some time to be able to look at what it was. And it was basically the website encoding was lying, and it was sometimes UTF-8 and sometime wasn't, but it told us all the time it was UTF-8. And so we treated it as
UTF-8. And that doesn't work, okay, because you get just random bites coming out of it that don't really happen. So everybody happened from that point of view.
How can you tell it worked? Okay. So here's a little story about going the speech synthesis in Chinese the first time I did it. I built a Chinese synthesizer and I built it and I typed in some Chinese and Chinese came out. And I thought that's just so cool. So I went and got one of my Chinese colleagues across to listen to it and they said that's so cool. What language is it? Oh, right. Okay.
The problem is if you're doing work in another language, you can't tell whether it works. Actually it's much worse than that. The point is that you'll think it works because you spent a lot of time on it, okay. So it sounds Chinese to you, even if you don't know Chinese and that's not good enough, okay. So you have to be able to find a way to make it good. The system has to be able to do that. And so
we have to find objective and subjective ways to be able to ensure that the thing we've generated is actually reasonable, okay. And that's the important thing that we have to do. And I'll address that in a minute.
How can you make it better? Okay. So on it didn't matter how good you ever made anything, okay, what your boss is going to say immediately afterwards, that's amazingly cool, now can we make it better, okay. So you better keep something back.
So it's like if you build a model for Hindi, for Urdu, how can you then make it better, okay? And so that's one of the things we want to do. And also talk about some of the things that are problems that we have identified that we don't have answers for yet and that basically are research issues.
So automatic evaluation. How can you tell that a synthesizer is good automatically, okay?
>>: [Inaudible].
>> Alan Black: Huh?
>>: [Inaudible] you think random language is an [inaudible] it actually works.
>> Alan Black: So, you know, to be fair, that was not true the first year that we actually ran it. And it was not 80 percent success. It was a nice round number that wasn't 100, okay, of success. It didn't work, okay. And the result was somebody like me or Tanja going in the background and changing something to make it work. But that was the first year we did it. So second year that we did it, it did actually work such that when I say work, such that at the end the student spoke into it, okay, it recognized what we said, it put that into the synthesizer and you could recognize what was actually coming out of it. Okay? So you -- it does work at that level. Okay? So, yeah.
But some of them, I mean, you know, the Vietnamese example, that was an identifiable one, okay, of why it didn't work. But one of the Hindi ones didn't work and one of the Hindi ones did. So I don't really know the answer to that. I don't know, okay. So there's still issues on that.
And in speech synthesis we've been centering on the spectral issue where you actually take, held out data and you take synthesized data and you do the spectral distances. Basically you put in distance in the [inaudible] capsule domain called MCD. It's typically used for speech synthesis optimization and voice conversion optimization. So we've been using that and basically because we now have 40 nice clean languages that we've actually been working on testing, you can find out what's the expected value that you have given the amount of data you have and importantly the amount of time you spend on the lexicons. We actually look at what the distribution is what the [inaudible] one of my students has done and he has a paper at a Vietnamese conference earlier this year that actually described all of that stuff.
And so basically he has the information of what the expected form is, the number would be given the amount of data that you actually have in the language, okay?
So we can apply that to the system and we can work out we think the number that we're getting out is good or bad, okay. Because this is the thing you have to tell the person is like is this reasonable.
You can also do automatic listening tests and this is something that we've also done as well. So basically you synthesize it and ask people to write down what they hear, okay. And most reasonable text to speech people are pretty near about 90, 95 percent correct if you ask them to write it down, okay? They may not like the synthesizer, it may sound like a robot, but it's still good enough that people can actually understand it.
In English synthesis we actually typically don't to that because synthesis is too good and we actually do what's called semantically unpredictable sentences where we do completely random words and together at least graphically correct and then ask people to write them down, okay. And they're barely hard even when you have a human read them. Isn't that right, Stephanie? She had to read them out. Yes?
>>: [Inaudible].
>> Alan Black: Yes.
>>: [Inaudible] is the space package generating [inaudible]?
>> Alan Black: Yes.
>>: As part of --
>> Alan Black: As part of its evaluation come out and says it has held out data and synthesize these and tell me what you read. So yes, we do that. And this has been an offline back thing, and we get a number out of it and we're trying to say things like this lesson 6 is probably okay. Okay? And if it isn't, there's something wrong. But you can't necessarily tell what's wrong.
Another area of speech recognition that we're doing is bootstrapping language modelling so basically given the text that we're given by the human who wants to do it, we basically do information retrieval analysis TDIF type stuff on it to try to find the most discriminatory bigrams with respect to background model that are likely within that language. And then given those, we then go off and actually do further searches on the web for more appropriate data. Okay?
And so this means that you have some data and it goes and finds more appropriate data that has got the right type of language background. And particularly we did this, we had a paper in Lisbon about this and the tests were all to in Hindi and the data was basically was two particular domains we were looking at, one was recipes, okay, and the other one was the most important thing to Hindi speakers which is cricket, okay.
And we basically did the right thing. You went away and got the right information and found the right pages in Wikipedia, the Hindi Wikipedia that were on cricket and related aspects of game playing and famous cricketers and also it went and found this recipe thing. So we were doing that and we get reduction in language model.
And this is actually one of the things that we're very much aware of that we now feel we've got to the stage where we can actually build this system that will give you models but we all know when you're actually doing real systems if you're doing a telephone dialogue system you're going to take models from somewhere else and then the hard thing is tuning it for your particular domain. And so this is something that we're trying to get a better handle on being able to do automatically.
The things we've not done, okay, which are still major issues in trying to do this completely automatically new languages. One is text normalization. You can sort of think of a noisy channel model. You've got a bunch of words that are coming in and you're going to get them converted, think about numbers, you get a string of digits, how do you actually pronounce them? Most of these are somewhat programatically defined, okay, but actually in text you get this all the time, much more than just numbers. You get abbreviations, you get letter sequences, you get punctuation choices which are obvious to the reader and unobvious to the machine, okay.
If you look at text to speech on this, it's basically well, people typically come up with a bunch of [inaudible] rules or maybe they come up with a bunch of nice, highly trained automatic models from data that they then augment with a bunch of active rules. And nobody really has a good solution to be able to do this. And we're wondering if you could actually do this in new languages and to be able to do it.
One of the issues that we notice in a lot of our systems is that people are often in bilingual or multi lingual environments. If you go to India and try to deal with
Hindi, you discover that almost nobody speaks Hindi in the full sentence. They'll typically include other words in it, often English, okay. And they -- that's just not a problem, okay.
So in high tech fields you're even more likely to find it. You'll find English words or phrases actually coming in from that point of view, and so we have to make sure that we can actually properly identify and change [inaudible] when we're dealing with English word as opposed to true Hindi word.
The other thing we're interested in doing is correcting [inaudible], so instead of actually having the developer worry about it that he has a -- he or she has the system that comes out a bit that if you deploy it there's a way for it to be able to improve with actual usage. One of the things that we know in speech-to-speech translation and dialogue systems you build the system, you think it's cool, you deploy it, it fails miserably but you collect data from its first deployment and then you label that data and then you add that data to the system and get much better, okay, when you actually put it in the domain to do it.
And the question is can you do this automatically? Okay. Oh, this is a good example to give here, the Microsoft. And so for example in the Microsoft live thing in my phone when I speak to it and say something like Chinese restaurant to it, okay, it will give me a bunch of examples about what it thinks that I said, okay.
If I then click and the Chinese restaurant or Indian restaurant or Taiwanese restaurant or something, the Chinese one it knows that I really meant to say
Chinese restaurant and that actually gives data back to Microsoft. I hope you guys are actually using this. It's harder to use that retraining. Or you probably are getting it and not doing anything with it. But that's your problem.
>>: [Inaudible].
>> Alan Black: Okay. Right. Yeah. I mean, it's hard to get, but, you know, you're getting direct feedback about whether it actually works or not. And that's a really good thing to be able to do. We've actually been doing this in our speech-to-speech translation system trying to improve or being able to take information from its first usage and try to make things better automatically.
Okay?
And there's a non trivial task of engineering, and it's good that students at CMU are very good at programming because there are lots of programming tasks you have to do to be able to make a system that's going to be robust enough to do this. You wouldn't believe what people -- well, you probably would believe what people actually do when you give them a website to be able to do it. They will upload things in weird formats that you know nothing about despite telling them that they're not supposed to do that, and they'll find yet another archiver for the text that they want to upload that we have no access to and no idea what it is to unpack it, and so you'll suddenly get these files that are just binary to us because we don't know that they're compressed and so there's constant things about trying to make it the data that's coming in to be standard spottable. If people follow the standard group, it's probably going to work, but they'll always find ways to do things differently. So being able to automatically detect that is still something that we're fundamentally worried about. But that's, you know, that's a different type of research end part but we're still actually worried about that.
And finally so the rapid language adaptation, you know, speech models by language experts, not by speech experts. That's really what we're fundamentally trying to do. So we're trying to open up outside, okay. It's very nice of DARPA to give us that money every year to do some random other language but ultimately we think that it's not say that that's solved but we think that we can push that out of the research lab and into the end user to be able to do that. We want to be able to share and sharing between models in languages, okay, so we've got pronunciation models, acoustic models being shared between TTS and recognition but we're also trying to share the models between the languages themselves so if you can find similar languages, either from a linguistic point of view or a shared vocabulary which often happens, find ways to be able to automatically do that or acoustics. And acoustics doesn't necessarily mean that
you're sharing languages that have got similar geographic forms. The phoneme in Italian is very similar in the phoneme set to Japanese, but I don't think anybody is going to complain that -- claim that these languages are in any way related, okay? It's not relevant if the improved model and there's a reasonable way to do it.
Evaluation techniques is still something you want to be able to do is tell the user whether it works or not. And you also want to tell the user what's wrong. Okay?
Hey, as a researcher I like to be able to tell me that as well. But that's something that we're still working on and being able to improve and the more data we get we're actually getting more information about it.
And continue on. Application and dependent adaptation so once you take your models in Hindi and put it into your pizza ordering system, well, probably chicken and tea ordering system and you wanted to be able to improve and be able to have a way to do that and all that fine tuning that we do is speech researches we like to be able to move out into the -- so that other people can benefit from that and that we don't have to do it. And full text normalization discovery is something which I would really like to start getting into because it's becoming more and more an issue in things like language modelling and text to speech, okay. And it's something that we're going to have to actually deal with, especially when you're doing fund data on the web because it's not going to be as clean as what you actually want it to be. And better use of the end users in improving the system itself.
So you're getting the users to improve the system, okay. They're the ones who care the most and they're the ones that you want to be able to help the most in improving the system and they know the language best from that point of view.
Okay. Do we have any questions?
>>: Are you taking advantage of any users to -- user [inaudible] same language?
>> Alan Black: So actually we're not at the moment, so there's no actual way to share the data between different people who log in. But this is something that we do want to be able to do, such if you say I want to do Hindi and say oh, here we are, or if you want to say something like I want to do Gutrati and we can say probably being written the day of [inaudible] in writing system and therefore we'll import all the things and the basic pronunciations that we're likely to get from that and probably the phoneme set as well that you can then edit and change around.
So this is something that we like to be able to do but we've not actually done this at the moment. So it can -- it can give you -- if you were doing an extension to a language that's already covered or if you got related information from another language, the writing system is a good example that you want to be able to get that information past [inaudible].
Actually the number of writing systems is relatively small and we started putting them in as defaults. So you identify your language system as being a Latin based system, and so it's using Latin one or I 559, and so you want to be able to
do that, but you find there's lots of languages in the fringe that actually don't include that. German offense has got sharp S, the double S, and it's not actually in Latin one on you have to have a little bit bigger view of it to be able to do that.
But that all worked until we discovered people were using Latin alphabets for describing languages. For example upper and lower case meant different phonemes and we thought if you used the Latin we could actually do case folding. But you can't in some languages because they actually make a distinction between that.
So some of the Indian languages if they're not normally written they may write them in a Latin based language. And so that may be an issue from that point of view. So, yeah, we want to be able to share more things because there's lots of information that's already in there to be able to do that.
>>: So [inaudible] the system that you're [inaudible] can work [inaudible] if you deploy the system which will [inaudible] you get more dinner, you get more people, how do you think you -- it will work for [inaudible] systems?
>> Alan Black: Well, and there's clearly, you know, the longer you have it, if you have this feedback and correction and sharing of data so you collect more data and you get them to correct things at the wrong like it's perfectly reasonable for your key words in your domain where the pronunciation is wrong for you to correct them. Everybody things that's a reasonable thing to do.
And so if you're getting every time somebody uses it, you get another 20 words out of it, you're actually going to be able to improve the system from that point of view.
>>: [Inaudible] to get to a certain level quickly, like [inaudible].
>> Alan Black: Yes.
>>: [Inaudible] synthesis [inaudible] and that you could [inaudible] somehow in
[inaudible]. And it didn't -- you know, the data that you collected include the
[inaudible] of that assumption.
>> Alan Black: That's correct. And so we're already in the state for example in the synthesis form, the synthesizer we first put in there was our first Clustergen synthesizer and there's a bunch of work that's been done by me and John over the last year which substantially improves the quality of the synthesizer. And so we just put that new build process in. And so if you go back and there's a new version you just say build again and it will build it. And so you'd actually get better and synthesis out of it as the underlying technology improves as well.
>>: Do you think that the data you collect basically could be [inaudible].
>> Alan Black: Yes. Oh, yes. And so you know, something that we don't admit to all of our users is we get all the data on our server, okay. And this is just so cool because we actually get all of this data from multiple languages that's the
sort of data that people actually put into the system. That's also the right type of data that we want to be able to improve our algorithms on. Can we write things automatically detect when the person didn't say what we expected them to say?
Can we automatically write things to try and find out when they're making adders in the pronunciation? Okay?
And so can we tune our data on the types of adders that we're actually going to get in real life? So all of these things are perfectly reasonable. And that's just something we're pleased that we get all this data. And for example, when I did my ICSLP paper on this new synthesis technique I could say we tried it in 14 different languages and they all improved actually apart from one. So that's a nice position to be in, okay.
>>: So how many -- you covered a bunch of different classes of language.
>> Alan Black: Yes.
>>: But if you wanted to cover enough languages so that, you know, 99 percent of the people on earth go to the system and say I'm fluent in one of your languages.
>> Alan Black: Right. Right.
>>: What's that number? Is that more like a hundred or more like a thousand?
>> Alan Black: No, it's more like a hundred. I actually know that number.
Supposing you want to be able to cover I don't [inaudible] 90 percent of the people on the planet. Once you get the top 40 languages you're pretty close to being able to do that.
>>: Right.
>> Alan Black: Okay. And this is a -- this is difference of like supposing you're a big company like, I don't know, Microsoft, okay, and you're trying to -- the cost of being able to support a new language is a million dollars to be able to do that.
You're only going to do that for the ones that are actually going to be financially worthwhile or you're legally required to do it. Okay? So I worked with some group and we're doing Spanish speech stuff and it's for Spain and they had to do
Spanish, Catalan and Galician and Basque which, you know, the last two are not worthwhile from a commercial point, but apart from that.
But that's not what we're aiming at. That's fundamentally not what we're aiming at. What we're aiming at is the person who wants to do Ojibwa Native American, okay? And you know Nuance and Microsoft and Google are never going to care about those languages. They're just too small. But the people who speak that language do, and he may want the support to be able to do that. And so we're giving it to them as a developer.
And taking it this way. Supposing that the only people who could develop websites, okay, were you know, major big companies, IBM, Nuance, Google,
Apple and nobody else could do it, okay, and you ask them to develop the website for you, okay. That would be a pain because you only get the big websites and you wouldn't get the little websites, okay. But if you give them the technology that they can build a website, then you get many, many more, okay.
So it's this, you know, long tail type thing that those -- it's the end user that's going to care, the end developer is going to care about supporting the language, and it's taking it away from the million dollar coverage into the end user caring about doing it. So if you only care about having a language that people can actually speak to the system, the actual systems are going to -- the cell phone manufacturers typically quote 46 and different languages that they want to support. It's a different 46 from different languages but it's always 46. I do find that interesting.
And those are the ones that are clearly in the commercially viable ones. Okay?
But there's still languages worth 50 million speakers that have, that are not in that
46. Okay. Mostly in India. And that's where we're moving. So we're not saying companies are going to support this, what we're saying is the developer is going to support this.
>>: Right. Right. Yeah. That's not what I getting at. But I agree with you.
>> Alan Black: Yeah. Sorry. What were you getting at?
>>: Well, it's the [inaudible].
[brief talking over].
>>: If I'm motivated to use a speech-to-speech system like [inaudible] have developed, the goal is communication.
>> Alan Black: Yes.
>>: And so you want to go from a language Mike is fluent in to a language I'm fluent in.
>> Alan Black: Yes.
>>: And in that case, if someone speaks fluent North American and English, you don't have -- you don't have to cover, you know, the Indian dialect.
>> Alan Black: Right. But Mike is a native speaker, yes.
>>: Yes, Mike is a native speaker. If you want to produce -- I mean, I suppose there's other cases where you want to dictate into the computer and have a speech text system work in your native language so you could produce a document that someone else could read.
>> Alan Black: Okay. So [inaudible].
>>: I'm trying to figure out, the cases are where you would care about --
>> Alan Black: So I'm going to give you [inaudible] applications which we've been explicitly asked about which I think this technology really solves. Okay?
Native American languages are dying, and most of the people who speak them natively are over 60 or 70, okay? But there is a bunch of young people who would really love to be able to speak them. But young people, as we know, can't actually speak to people, they have to be able to use SMS to communicate with people, they have to be able to type things online with instant messages, okay.
There's no support for Ojibwa in any form of SMS type things. If you could provide them with the technology in Ojibwa such that these younger people can hear the data, okay, and speak the data when they don't know how to write it, okay, you're actually going to improve their language capability. And as far as they're concerned, that is part of the language.
And if you don't do that, they're going to communicate in English. So being able to preserve languages is one way to be able to do that, because if you do just give, you know, the technology only goes into, you know, use English when you're actually doing that and you're only going to chat to your immediate friends when actually we now communicate worldwide and the data to be able to keep these communities together.
And there's a project that we're involved with and Microsoft's involved with, and I don't know some of you, one of Ernie Rosenfeld's [phonetic] students who is working in Pakistan are trying to provide speech based support for health workers in Pakistan.
>>: [Inaudible].
>> Alan Black: Yes. Okay. Johansab [phonetic]. And he's in a position where the people that he's dealing with have some training but they're really not letter read in Urdu it would be in that case, okay, but they do speak other languages.
And he wants to be able to support some language that isn't Urdu that's the second language in Pakistan I can't remember the name of, okay, and there isn't any support anywhere for that. And he would like to be able to use a system like this to be able to build up the support to be able to do that. Because there isn't any support.
And, yeah, you know, it's only 10s of millions of people, but it's a way to be able to do that. So there's another one where Jay is capable of doing this, but he needs software and be able to -- technique to be able to do this.
And the third one that came up recently was some local people in Pittsburgh that run a refugee aid association for people who are coming into Pittsburgh. They have suddenly got a whole bunch of people who are from -- I mean many hundred of people coming in who are from South Burma. They speak a language called Karen which I've never heard of before. And they want to be
able to communicate with them. And they've got one translator who speaks
English, okay.
And how can the organization be able to tell everybody that there's going to be a meeting next week and but they can't actually say it, and what they wanted to do, which we did, we set up basically a Karen synthesizer such that they could call each of the persons and give them data about when the next meeting was, okay?
Now, there's something where, you know, big company is not going to support
Karen in a hurry, but these people were interested, they did have one translator and you could use that translator to be able to help bootstrap a system which could then communicate with lots of people.
So there are lots of examples about this where, you know, think about the website thing. It's not a big person, it's the little people who are actually wanting to do that, and they want to be able to get access to the technology to be able to do that. Okay? It's a different thing from the -- doing the dictation type and task, it's low level communication language. Could be translation and it could just be the dialogue system that gives you information about something. Okay?
>>: [Inaudible].
>> Alan Black: Yes?
>>: So how many -- like do you have any languages that people have used
[inaudible] to create like a significant number of different [inaudible] language?
Like 15 different people [inaudible] create --
>> Alan Black: And so, no, actually, we haven't, but we really want to be able to do that, and [inaudible] my student who is on the faculty at IIIT in Hyderabad is going to use it in his class and get 15 people to do the Hindi. So we can actually then look at the cross information about that.
>>: I'm curious about how [inaudible] the consistency --
>> Alan Black: I agree. I agree.
>>: [Inaudible].
>> Alan Black: That's a very good question because the point ultimately is that when you make any decision about speech process, you don't really know whether you're making the right decision and being able to do it well with multiple people is something that we'd really like to be able to do and that was the direction that we were going to do with that, Hindi or Telugu, but some language which most of the students can speak and get them all to do it and then do some comparison, cross-comparison about it. It's a good idea.
>> Mike Seltzer: Anyone else? Thank our speaker.
[applause].
>> Alan Black: Thank you.