2. Speech & Language Technologies (SLT) Lab @ Bhrigus Inc

advertisement
Speech & Language Technologies Lab @ Bhrigus Inc
Version Date
1.0
15 Sept 05
Business Proposal
for
Speech and Language Technologies
(SLT) at Bhrigus Inc.
Commercial Products using Speech & Language Technologies
Collaboration with universities in India & abroad
Open source initiatives to promote SLT in India
Authored by:
Nixon Patel
Founder & CEO,
Bhrigus Software Pvt Ltd.
Copyright Bhrigus Inc
Page 1 of 7
Speech & Language Technologies Lab @ Bhrigus Inc
Version Date
1.0
15 Sept 05
TABLE OF CONTENTS
1. Role of Speech and Language technologies............................. 3
2. Speech & Language Technologies (SLT) Lab @ Bhrigus Inc .. 5
2.1
Practical Way: Where to Start with ..................................................... 5
2.2
How to Start .......................................................................................... 5
3. Building TTS System Using Festival & Festvox ........................ 6
4. Building ASR System Using Sphinx .......................................... 6
5. Open Source Initiative ................................................................ 7
6. Deliverables and End Product .................................................... 7
7. University Collaboration ............................................................. 7
Copyright Bhrigus Inc
Page 2 of 7
Speech & Language Technologies Lab @ Bhrigus Inc
Version Date
1.0
15 Sept 05
1. Role of Speech and Language technologies
Speech and language technologies have become an increasingly central component of
Computer Science in the last decade. The goal of these technologies is to impart speech
and language capabilities to computers so that human beings interact with computers
similar to the way they communicate among themselves. Implication of these
technologies in general public can be seen from the daily usage of automated voice
response systems and search engines, which are built using speech technology,
information retrieval and machine translation.
With the advent of Internet and unlimited storage capabilities, information is digitally
stored, processed and communicated. Most of the information in digital world is
accessible to a few who can read or understand a particular language. Speech and
language technologies can provide solutions in the form of natural interfaces so that
digital content can reach to the masses and facilitate the exchange of information across
different people speaking different languages. These technologies play a crucial role in
multi-lingual societies such as India where illiterate people cannot read or write but
typically speak and understand more than one language.
Hands-free and natural communication with the computers, universal access (any where
any language) to the information is the key features of the future pervasive
computational era. The theory, algorithms and implementation aspects of speech and
language processing are well understood to the level where practical applications are
deployed in day-to-day life.
Worldwide Scenario
Most of the top universities in US and Europe have specialized departments that teach
and conduct research in the areas of speech and language. Carnegie Mellon University
has Language Technologies Institute (LTI), which provides specialization in speech and
language technologies both at the undergraduate and graduate level. Apart from the
academia industries such as Microsoft, IBM, Google and Nokia have ventured in these
technologies.
Speech recognition systems such as Dragon naturally speaking, IBM Via voice, Sphinx,
speech synthesis systems from AT&T, Cepstral and Rhetorical and Machine Translation
systems such as Systran are the outcomes of the sustained effort by academia and
industry. The expertise available both at the academic side and the industry side is so
high that the high quality speech-speech translations systems (involving speech to text,
machine translation and text to speech) for new languages are built in a short span of 26 months.
Scenario in India
Academic labs in Indian Universities have been teaching fundamental courses in speech
technology and speech signal processing. However, the academic labs or the industrial
Copyright Bhrigus Inc
Page 3 of 7
Speech & Language Technologies Lab @ Bhrigus Inc
Version Date
1.0
15 Sept 05
labs (apart from IBM-India in Hindi) have not demonstrated any continuous speech
recognition systems for large vocabulary in multiple Indian languages. As of today, there
is no commercial product, which uses either text to speech or speech to text or machine
translation system in Indian languages. There are no streams in the academic
institutions, which provide hands-on experience in building large vocabulary continuous
speech recognition system, text to speech systems or full-fledged machine translation
systems. However, there has been recent surge towards building ASR and TTS system
for Indian languages. Prototype systems in the field of ASR and TTS could be seen from
HP Labs India and IIIT Hyderabad. Given the technology and the advancements in
speech and language, development and deployment of such systems is feasible. Thus it
is a unique opportunity and the right time to start speech and language technologies
industrial lab and collaborate heavily with the academic institution both at India and
abroad and avail the expertise available in multiple institutions. IIIT Hyderabad is
uniquely positioned in this aspect and is a budding place with a lot of speech and
language activities taking place at two of its divisions: Language Technologies Research
Center and MSIT Division.
Copyright Bhrigus Inc
Page 4 of 7
Speech & Language Technologies Lab @ Bhrigus Inc
Version Date
1.0
15 Sept 05
2. Speech & Language Technologies (SLT) Lab @
Bhrigus Inc
One of the major goals of speech lab @ Bhrigus Inc. is to develop commercial products
using speech and language technologies with a specific focus on the Indian languages.
The role that could be played by SLT lab can be split into following tracks.
1.
2.
3.
4.
Development of speech & language resources for Indian languages
Commercial Products using Speech & Language Technologies
Collaboration with universities in India & abroad in the area of SLT
Open source initiatives to promote SLT in India and to build large user base and
application developers base in India
5. Innovation & Research for future products and applications
2.1
Practical Way: Where to Start with
Given the broader objectives and the goal of building products in speech & language
technologies, the business model should be to develop products in proven technologies
but unexplored domain. Development of speech recognition (speech to text) and speech
synthesis (text to speech) systems for Telugu, Marathi and Gujarati is one of the
possible starting points to venture into this area.
2.2
How to Start
There are three possible starting points: 1. Develop from scratch 2. Collaborate with
existing companies and use their platform and 3. Use open source software platforms.
Approach (1) is a naïve approach to start a company and is more suited for an academic
lab. Approach (2) has the advantage of being able to get large volumes of support both
at the technical and product level, but one has to take into account the cost involved and
the long-term cost of having third-party software and its license. The other aspect is the
license associated with the derivatives (such as acoustic models) obtained using thirdparty software. Approach (3) is more suitable for risk-taking companies with a geeky
bent of mind who can venture time and money to make the products out of open source
systems such as Sphinx and Festival & Festvox. Given sufficient background in speech
technologies these are affable software pieces, versatile enough and have encapsulated
state-of-art algorithms for building ASR and TTS systems. Commercial companies such
as LumenVox, Sun, AT&T, and Cepstral use these open source tools to build and finetune their speech recognition and synthesis products.
Copyright Bhrigus Inc
Page 5 of 7
Speech & Language Technologies Lab @ Bhrigus Inc
Version Date
1.0
15 Sept 05
3. Building TTS System Using Festival & Festvox
The goal of TTS system is to convert text to speech. Festival is a multi-lingual speech
synthesizer, while Festvox is a set of scripts built around Festival engine to build a new
voice in a new or existing language. Festival is being used widely by both academic and
industrial labs such as AT&T to build high quality voices.
To build a voice in a new language, the steps involved are as follows:
1. Defining the phone set, letter-to-sound rules and syllabification rules of the
language
2. Selection of text to be recorded
3. Recording of speech database
4. Labeling the speech database
5. Building the units' database by clustering algorithm
6. Fine tuning the parameters such as pitch markers, clustering the units by tagging
them with more phonemic context etc.
High quality TTS voice requires a large text corpus so that a set of sentences can be
selected which has good coverage of high frequency words and diphones. The speech
database required for this purpose is a single speaker’s voice recorded in a studio
environment and would be typically from 5-10 hrs of speech.
4. Building ASR System Using Sphinx
The goal of ASR systems is to convert speech to text. Sphinx is an open source ASR
system built at Carnegie Mellon University, USA. This software has two main
components: SphinxTrain and Sphinx Decoder. SphinxTrain generates the acoustic
models while Sphinx Decoder does the job of decoding which is also referred to as
recognition. Due to the complexity involved in building decoders, there are multiple
versions available for Sphinx Decoder. Sphinx-II, Sphinx3.2, Sphinx3.5 and Sphinx4 are
some of the variants of the decoder. Sphinx-II use semi-continuous model, where as
Sphinx 3.x and 4.x use continuous models. For our purposes, we will stick to Sphinx 3.5,
which is written in C/C++ and is widely supported internally in Carnegie Mellon.
The typical steps involved in building a speech recognition system
1. Selection of the text to be recorded. This requires a large text corpus, which is
transliterated and cleaned. This text corpus is given to a text selection algorithm
to derive the required number of sentences.
2. Recording of these sentences by multiple speakers in different dialects and in
different recording conditions.
3. Building the recognition engine using SphinxTrain and Sphinx Decoder.
4. Build a language model
5. Fine tuning of the parameters of language model and acoustic models.
Copyright Bhrigus Inc
Page 6 of 7
Speech & Language Technologies Lab @ Bhrigus Inc
Version Date
1.0
15 Sept 05
5. Open Source Initiative
While the development of products and resources be put in parallel stream, open source
initiatives are essential from the aspects of visibility and user base. A release of open
source Hindi voice could be a good strategy. Such releases could be also be initiated by
collaborating with the universities in the form of projects, and the deliverables include
release of the voice under open source.
6. Deliverables and End Product
The following are the list of the deliverables at the end of this effort.
1.
2.
3.
4.
Text and Speech Resources for Telugu, Marathi, Gujarati & Hindi
ASR for the 3 languages: Telugu, Marathi & Gujarati
TTS for the 4 languages: Hindi TTS could be made open source
End Product: Railway reservation system for Telugu, Marathi & Gujarati.
7. University Collaboration
Collaborations with the universities working in SLT stream are important. This
collaboration can take place in the form of industrial members (pls see below for the
definition) or in the form of sponsoring projects.
Industrial member: This concept is applicable for the case of MSIT-IIIT. A company can
become an industrial member by paying a membership fee. Upon which the member
plays a role in the following aspects of the SLT stream.
1. Refining the curriculum based on industrial requirements
2. Industrial practicum & internships for MSIT – SLT stream students
3. Job opportunities for the MSIT – SLT stream students
While the industrial member plays a contributing role in the SLT stream the industrial
members can avail the following benefits:
1. The employees of the industrial members can take SLT courses at MSIT by
following the industrial member’s course fee structure.
2. Avail the expertise of the faculty members in the form of consultancy.
3. Work on collaborative projects with faculty members and students of MSIT-SLT
stream.
Copyright Bhrigus Inc
Page 7 of 7
Download