the acquisition of a speech corpus for limited domain translation

advertisement
THE ACQUISITION OF A SPEECH CORPUS FOR LIMITED DOMAIN TRANSLATION
Demetrio Aiello, Loredana Cerrato, Cristina Delogu, Andrea Di Carlo
{demetrio, loredana, cristina, adicarlo}@fub.it
Fondazione Ugo Bordoni - Via B. Castiglione, 59 - 00142 Rome, Italy
Abstract
In this paper we report on the ongoing collection of the
speech corpus for purposes of the ESPRIT LTR project
n. 30268, EuTrans. The corpus is intended to provide
training material for speaker independent continuous
speech recognition over the telephone line, based on a
vocabulary of few thousands words for recognition and
for translation training. Due to its application the corpus
is structured so to contain speech material for acoustic
modelling, and textual material for language modelling
and translation modelling. The speech material which is
being collected, and which we will describe in this
paper, has been uttered in a natural way.
The corpus will be described with the aid of some
statistic results obtained to better illustrate the
characteristics of the acquired material.
We will finally present our future plan for the collection
of other parts of the corpus and in particular a new
"dialogue oriented" collection paradigm will be
introduced.
Download