instructions for sfsslp authors - Sites personnels de TELECOM

advertisement
Searching through a Speech Memory
for efficient Coding, Recognition and Synthesis or
or
for Unified Speech Processing Without Transcriptions
Gérard CHOLLET (chollet@tsi.enst.fr), Dijana PETROVSKA (dijana.petrovski@unifr.ch)
ENST, CNRS-LTCI, 46 rue Barrault, 75634 PARIS cedex 13, France
ABSTRACT
The common denominator of different speech processing methods is the speech unit that is being used. The majority of
current systems use phonemes (or derived units) as an atomic representation of the speech. Using phonetic speech units leads
to efficient representation and implementation for a lot of systems for speech processing. The major problem that arises when
phoneme based systems are being developed is the lack of transcribed databases. This is the major bottleneck for developing
new applications and porting the existing ones to new languages. For each of these applications new transcribed databases
are needed. In order to by-pass this tedious, error-prone and expensive step of transcribing speech, we need to develop
systems that minimize the usage of transcribed databases. Our paper presents a unified approach of speech processing based
on automatically derived speech units, that do not require transcribed databases. These automatically derived units are
denoted here as Automatic Language Independent Speech (ALIS) units. Results for very-low bit (<600 bps)
speaker-independent coding and preliminary results for speaker !!!!! recognition using these units are reported. Another
perspectives are also described.
1. Introduction
The principles of pattern recognition are widely used in automatic speech processing: a similarity measure is evaluated
between an unlabeled input and a labeled pattern in memory. For example, in order to achieve bit rates of the order of 400
bps in speech coding, it is necessary to use recognition and synthesis techniques. The coder and the decoder share a
dictionary of speech segments. The information transmitted to the coder consists of the indexes of the recognized unit, and
some prosodic parameters. At the decoder side speech synthesis is used in order to reconstruct the output speech, from the
sequence of the transmitted symbols.
In the case when we deal only with speech coming from one speaker (speaker dependent case), after the recognition phase,
we use a similarity measure in order to evaluate which amongst the available units in the memory of the coder (that are also
present in the decoder side) best matches the incoming speech unit. By transmitting only these indexes, (and some prosodic
information), the transmitted bit rate is drastically reduced.
In the case when we would like to deal with speech coming from an unknown speaker, we have to use a speaker independent
system. It is build by training a speech recognizer, with as much speech available from as many speakers as possible. With
this speech recognizer we are going to determine the symbolic representation of the incoming unknown speech to be coded.
The next step is to determine which are the best candidates for the coding among the shared units in the coder and the
decoder. During this matching phase, if this similarity measure conveys speaker specific informations, we could exploit them
in order to do either speaker recognition, or/and speaker adaptation.
Current coders working at bit rates lower than 400 bps are based on segmental units related to phones (the phones being the
physical realization of the corresponding phonemes). The efficiency of these phonetic vocoders [5], [6], is due to the fact
that the main information transmitted by this type of coders is the index of the recognized unit, and some prosodic
parameters. An alternative approach using automatically derived speech units based on ALISP tools was developed at ENST,
ESIEE and VUT-BRNO [1]- [4]. These speech units, are derived from a statistical analysis of the speech corpus, requiring
neither phonetics nor orthographic transcriptions of the speech data. However the experiments conducted so far have only
involved the speaker-dependent case. In this paper we extend this technique to the speaker independent case, and show some
preliminary results showing the potentialities of this new method for speaker recognition. The outline of this paper is the
following: Section 2 explains the selections procedure of the speech segments. The database that underlies the experiments
and the experimental setup details are given is Section 3. The results of very low bit speech coding for the speaker
independents case are given in Section 4. In Section 5 we make a comparison of the ALISP speech units with phone units.
The experiments that show that the metrics we use to evaluate the distortion between two speech segments, conveys speaker
specific informations, are given in Section 6. Section 7 gives some perspectives and concludes our work.
A mettre ailleurs
Some of the problems in Automatic Speech Recognition: variability of acoustic patterns, the acoustic realization of
phonemes is dependent on many factors (speaker, context, stress, environment, …)
How to achieve Very Low Bit-rate Coding ?
Compression of computer files, multigrams, perplexity
Organizing a speech memory
Related work
2. Selection of Speech Segments
In order to solve the problem of searching in a database representing our speech memory, and avoid the full search solution,
it is important to efficiently represent this memory. The selection of speech units is done in two steps.
2.1 Acoustic modeling of data driven (ALIS) units
The first stage consists of a recognition phase: we are going to use Hidden Markov Models for their well known acoustic
modeling capabilities. But in place of the widely used phonetic labels we are going to use a set of data driven speech
symbols, automatically derived from the training speech data. The procedure used to build this set of symbolic units is
achieved through temporal decomposition, vector quantization and segmentation. We are going to use these symbols for the
training of the Hidden Markov Models of these Automatic Language Independent Speech (ALIS) units. They are going to be
used during the recognition phase of the incoming speech to be coded. This stage is schematically represented in Figure1.
Here is a more detailed description of the methods used:
2.1.1 Temporal Decomposition
After a classical pre-processing step, temporal decomposition [8] is used for the initial segmentation of the speech into
quasi-stationary parts. The matrix of spectral parameters is decomposed into a limited number of events, represented by a tar
get and an interpolation function. The td95 package (footnote: Thanks to Frederic Bimbot, IRISA Rennes, France) for the
permission to use it is) used for the temporal decomposition. A short-time singular value decomposition with adaptive
windowing is applied during the initial search of the interpolation functions. The following step is the iterative refinement of
targets and interpolation functions. At this point, the speech is segmented in spectrally stable segments. For each segment,
we also calculate its gravity center frame.
2.1.2. VQ clustering and segmentation
Vector quantization (VQ) is used for the clustering of the temporal decomposition segments. The training phase of the
codebook is based on the K-means algorithm. The size of the coding dictionary is determined by the number of the centroids
in the codebook. The codebook is trained using only the gravity center frame of the temporal decomposition segments. The
result of this step is the labeling of each temporal decomposition segment with the labels of the VQ codebook. The next step
of the segmentation is realized using cumulated distance of all the vectors from the segment.
2.1.2. Hidden Markov Modeling
Hidden Markov Models (HMMs) are widely used in speech recognition because of their acoustic modeling capabilities. In
these experiments, the HMMs are used to model and refine the segments from the initial VQ segmentation on the train data.
At this step we have trained the HMM models that are going to be used for the recognition of the coding units.
2.2 DTW matching
The results of this first search are refined with a second search. Once the most probable sequence of symbols is estimated, for
each of these symbols we are going to find which are the examples (having the same label in the speech memory) that best
match the coding segments. The distortion measure that we used in our experiments is the Dynamic Time Warping (DTW)
distance. When dealing with speech examples coming from multiple speakers, we are preserving the information about the
identity of those speakers. In such a way we can choose the subset of the available examples to represent all the training
speakers in a uniform manner. By uniform we mean that when a subset of examples is chosen, we do it in such a way that
each speaker is represented by the same number of examples.
3. Database and experimental setup
3.1 Database
For the experiments we used the BREF database [7], a large vocabulary read-speech corpus for French. The texts were
selected from 5 million words of the French newspaper ``Le monde''. In total 11,000 texts were chosen, in order to maximize
the number of distinct triphones. Separate text materials were selected for training and test corpora. 120 speakers have been
recorded, each providing between 5,000 and 10,000 words (approximately 40-70 min of speech), from different French
dialects. They are subdivided in 80 speakers for training, 40 for development and 20 for evaluation purposes.
3.2 Experimental setup
In this paper we address the issue of extending a speakerdependent very low bit-rate coder to a speakeindependent situtation
based on automatically derived speech units with ALISP. As a first step a gender-dependent, speaker-independent coder is
experimented. For the speaker--independent experiments, we have taken 33 male speakers to train our speaker—independent
(and gender-dependent) coder from the training partition of the BREF database. For the evaluation 3 male speakers were
taken from the evaluation corpus. For a baseline comparison, we generated equivalent speaker-dependent experiments, for
the three male speakers, from the evaluation set. Their speech data was divided into a train and test part. For ease of
comparison, the test sentences are common for both experiments. The BREF database is sampled at 16 kHz. The speech
parameterization was done by classical Linear Predictive Coding (LPC) cepstral analysis. The Linear Prediction Cepstral
Coefficients (LPCC) are calculated every 10 ms, on a 20 ms window. For consistency with the previous speaker-dependent
experiments on Boston University corpus [2], [3], cepstral mean sub traction is applied in order to reduce the effect of the
channel. The temporal decomposition was set up to produce 16 events per second on the average. A codebook with 64
centroids is trained on the vectors from the gravity centers of the interpolation functions, while the segmentation was
performed using cumulated distances on the entire segments. With the speech segments clustered in each class, we trained a
corresponding HMM model with three states. The initial HMM models were refined through one re-segmentation and
re-estimation step. With those models, the 8 longest segments per model were chosen from the training corpus to build the
set of the synthesis units, denoted as synthesis representatives. The original pitch and energy contours, as well as the optimal
DTW time-warps between the original segment and the coded one were used. The index of the best-matching DTW
representative is also transmitted. The unit rate is evaluated assuming uniform encoding of the indices. The encoding of the
representatives increases the rate by 3 additional bits per unit.
4. Comparison of ALISP units with phone units
The automatically derived speech units are the central part of the proposed method. With the experimental parameters as reported in the previous section, the mean length of the temporal decomposition segments (prior to the HMM modeling and
refinement) is 61 ms per temporal decomposition segment. We have studied the correspondence of the automatically derived
units with a acoustic-phonetic segmentation. The phonetic units were obtained from forced Viterbi alignments of
automatically generated phonetic transcriptions (originating from an automatic phonetizer) 2
Table 1: Phone unit set used in the comparison with ALISP units.
Procedure to quantify their correspondence. We first measured the overlapping of the na ALISP units with the np phone
units. Then, the confusion matrix X, with dimensions np na is calculated from the relative overalping values as follows:
where c(p i ) is the number of occurrences of the phone p i in the corpus, and r(p i k ; a j ) is the relative overlapping of the
k-th occurrence of the phone unit p i with the unit a j .
The relative overlapping r is calculated from the absolute overlapping R, by a normalization with the length of the
onsidered phone unit: r(p i k ; a j ) = R(p i k ; a j )=L(p i k ). The value of x i;j gives us the correspondence of the phone unit
i with the ALISP unit j on the whole training set. The set of phone units used in this experiment, is given in Table 1.
The resulting confusion matrix is shown in Figure 1, where the 35 phone units are arranged according to phone classes, and
the 64 ALISP units to give a pseudo diagonal shape. We can conclude that there are some units that are similar. For example, the ALISP units He and H8 correspond to the closures b, d, g, and Hf and Hi correspond to the other subset of
closures k, p, t. The fricative f has his corresponding unit HI. The unit HM corresponds to the nasal set m, n. Among the
most confusing units we can point the unit H2, corresponding to a large set of phonemes.
5. Speaker independent coding
Some results and comments (with LPC synthesis).
6. Searching for the nearest speaker
In the case when we would like to deal with speech coming from an unknown speaker, we have to use a speaker independent
system. It is build by training a speech recognizer, with as much speech available from as many speakers as possible. With
this speech recognizer we are going to determine the symbolic representation of the incoming unknown speech to be coded.
The next step is to determine which are the best candidates for the coding among the shared units in the coder and the
decoder. During this matching phase, if this similarity measure conveys speaker specific informations, we could exploit them
in order to do either speaker recognition, or/and speaker adaptation. In order to verify these ideas we have done some
preliminary experiments. We have considered two different cases : If the speaker to code is present in the training database,
does our similarity measure discovers that (we will denote this case as known train speakers case). The second point to check
is the behaviour of the system in the case that we would like to code speech from a speaker that is unknown to the system
(we will denote this case as impostors speakers case).
6.1 Coding speech coming from (known) train speakers
In order to study if with the DTW distance measure it is possible to determine the speaker identity, we have first considered
the case of coding speech that we know that is originating from one of the training speakers. Given a speech data to be
coded, we have segmented it first with our ALISP recognizer. For each of the recognized segments, we then proceed to the
determination of the best matching examples from our subset of examples having the same label. As each of these examples
are also identified by their original identity, we can report a histogram on the coding sentence, giving us the relative
frequencies of the number of times the various identities present in the training database were chosen as the best matching
units. A maximal value of this relative frequency is an indication that the identity of the given speaker was preffered among
the others identities.
We report the results concerning the choise of the speaker identity for coding of short and longe speech segments. The order
of magnitude of the shorter test is roughly 30s, and for the longer ones in varies from 5 to 7 min. Results concerning the
relative frequency histogram of the prefferred speaker identity for the coding speech segments originating from speaker
whose identity is number one are reported on Figures 6.1 and 6.2. The duration of the coded speech is 30 s and 5 min.
Speech coming from train speaker number 2, was also coded in a similar manner, with a short and a long speech data. These
results are reported on Figures 6.3 and 6.4. From these experiments were can conclude that at least for those two speakers,
our algorithm chooses as the most closest speaker the real identity of the speaker. The fact that for the short and the long
speech data the histograms have the same tendancy is a confirmation of the statistical significance of our test.
6.2 Coding speech coming from (unknown) impostor speakers
In order to evaluate the behaviour of our system when speech data uttered by un unknown speaker is to be coded, we have
done test with speech coming from two speakers that are not present in the training set. They are denoted here as X and Y.
Their respective results, are presented in figures 6.3 and 6.4. Eventhough these results need to be confirmed by a greater
number of training and test speakers, we can conjecture that when coding speech data coming for known speakers, the
proposed algorithm can identify his identity, by choosing preferentially (10% to 15% of the times) segments belonging to
this speaker. In the case when un impostor is present, the system has more difficulties to shoose a preffered speaker (the max
values being in this cases 6 to 8 %).
Figure 6.3
Figure 6.4
8. CONCLUSION and PERSPECTIVES.
9. Acknowledgments.
Several colleagues have contributed to this paper. They are in alphabetic order: René Carré, Maurice Charbit, Guillaume
Gravier, Nicolas Moreau, Dijana Petrovska, Jean-Pierre Tubach, François Yvon. Thanks to all of them.
10. REFERENCES
1. Barbier, L., Techniques neuronales pour l'adaptation à l'environnement d'un système de reconnaissance automatique
de la parole, Doctoral thesis, ENST, Paris, 1992.
Download