matejka_interspeech2012_ICSLP..

advertisement
Patrol Team Language
Identification System
for
DARPA RATS P1 Evaluation
Pavel Matejka1, Oldrich Plchot1, Mehdi Soufifar1, Ondrej Glembek1, Luis Fernando D’Haro1,
Karel Vesely1, Frantisek Grezl1, Jeff Ma2, Spyros Matsoukas2, and Najim Dehak3
1Brno
University of Technology, Speech@FIT and IT4I Center of Excellence, Czech
2Raytheon
3MIT
BBN Technologies, Cambridge, MA, USA
Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
matejkap@fit.vutbr.cz, smatsouk@bbn.com, najim@csail.mit.edu
1
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Outline
 About DARPA RATS program
 Datasets and task description
 Subsystems with analysis
 Fusion and Results
 Conclusion
2
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
DAPRA RATS Program
 RATS = Robust Automatic Trascription of Speech
 Goal : create algorithms and software for performing the following
tasks on speech-containing signals received over communication
channels that are extremely noisy and/or highly distorted.
 Tasks :
–
–
–
–
Speech Activity Detection
Keyword Spotting
Language Identification
Speaker Identification
 Data collector : LDC
 Evaluation by SAIC
 Performer: PATROL Team led by BBN
3
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Data Specification
 Languages:
– Dari, Levantine Arabic, Urdu, Pashtu, Farsi
– >10 out of set languages
 Durations: 120s, 30s, 10s, 3s
 Telephone conversations retransmitted over 8 noisy radio
communication channels [marked as A-H]
 Available: collections of 2-min audio samples
– LDC2011E95 – split to train and dev by SAIC
– LDC2011E111 – split to train and dev by Patrol team
– LDC2012E03 – supplemental training for non-target languages
 The amount of audio data for different languages heavily
unbalanced
 Added shorter duration samples
– Derived from 2-min samples, based on our SAD output
4
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Datasets

Train
– Main
• Only files where VAD detects >60s of speech
• 30774 files together
• Unbalanced = 668 files for Dari, 12778 for Leventine Arabic
– Balanced
• Balanced over files for each language and channel
• 7150 files for each duration
• 673 files for Dari, otherwise ~1300
– Extended
• Main + all 30sec cuts from Main set
• + entire LDC2012e03 (only nontarget languages)
• ~170k segments

Development Set
– Corpus was driven by Dari - only 679 source files, other languages limited to 1000
files, 2432 files for non target languages
– ~7120 files for each duration

Evaluation Data
– 2527 files for each duration
5
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
LID Patrol System Architecture
JFA LID
iVector LID
BUT SAD
Calibration
& Fusion
CZ Phoneme
Recognition
Combined
Score
Phonotactic
iVector LID
Audio
BBN SAD
6
iVector LID
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Speech Activity Detection
 One of the most important blocks since the data really difficult
– See separate paper about SAD development on Wednesday 16:00 in
Pavilon West
 Used both GMM-based (BBN) and MLP-based (BUT) detectors.
7
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Speech Activity Detection
 Comparison of different SAD systems
– Robust SAD tuned for noisy telephone speech
– Robust SAD tuned for RATS
 Results (Cavg) are on DEV set (but scored with SRC
channel)
 iVector system (600dim) used for this experiment
SAD type/ Cavg[%]
120s
Telephone
2,2
RATS
1,6
 25% relative gain
8
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
iVector LID System (BUT)
 Acoustic Features
– Dithering, bandwidth 300-3200Hz for 25 Mel-filters, 6 MFCC+C0
– CMN/CVN (based on SAD), RASTA normalization
– Shifted Delta Cepstra (SDC) 7-1-3-1
 UBM
– Language independent, diagonal-covariance, 2048 Gaussians
– Trained on balanced train set
 iVector
– 600 dimensions
– Trained on main set
 Neural network classifier
–
–
–
–
9
iVector input, 6 outputs (1 nontarget + 5 target languages)
Hidden layer with 200 nodes
Stochastic Gradient Descent training with L2 regularization
Trained on extended set (all data + all 30 sec splits)
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Comparison of Logistic Regression
and Neural Network as final classifier
 BUT iVector system (600dim)
 Results on Development set
 Logistic Regression
 trained by: Trusted Region Conjugate-GD
 Results on Development set




10
Neural Net:
one hidden layer 200
trained by: Stochastic-GD with L2 regularization
also experiments with Conjugate-GD, but no improvement
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
JFA LID System (BUT)
 Acoustic Features
– Same as for iVector system + Wiener filtering
 Universal Background Model (UBM)
– Language independent,
– Diagonal-covariance, 2048 Gaussians
– Trained on balanced train set
 JFA
– Trained on main train set
– μ = m + Dz + Ux
– Models of languages D are MAP adapted from UBM with tau =10
– Channel matrix U with 200 dimensions
– Linear scoring
11
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Importance of Wiener Filter
 400dim i-vector + logistic regression experimental
system
 Results on Development set
12
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
iVector LID System (BBN)
 Acoustic Features
– RASTA-PLP
– Block of 11-frame PLPs, projected to 60 dimensions via HLDA
 UBM
– Language dependent (5 target, 1 “non-target”), 1024 Gaussians
 iVector
– 400 dimensions
– Group adjacent speech segments into 20s chunks, estimate one
iVector per chunk
• improves performance on short duration conditions by 28%
– Estimate 6 iVectors (one per UBM)
– Apply neural network (NN) to each iVector - 6 outputs (1
nontarget + 5 target languages)
– Combine NN outputs to form 6-dimensional score vector
• 26% relative improvement compared to using language
independent i-vectors
13
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Analysis of iVector LID System (BBN)
• Analysis of the BBN iVector extractor training and UBM:
1. Whole audio segments, single UBM
2. Audio split to 20s segments, single UBM
3. Audio split to 20s segments, language dependent background
models (LDBM)
14
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Phonotactic iVector LID System (BUT)
 Phoneme recognizer
– Czech CTS recognizer trained on artificially noised data
• Added noise with varying SNR (lowest 10dB) to 30% of the corpus
– 38 phonemes
– 3-gram counts: sum of posterior probabilities of 3-grams from
phone lattices
 iVector – Multinomial subspace modeling
– 600 dimensions, trained on main train set
– Training a low-dimensional subspace in the framework of total
variability model using multinomial distribution
– Using point-estimate of the model’s latent variable for each
utterance as our new features
 Logistic regression as final classifier
– Trained on main train set
15
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Fusion and Calibration
 Regularized logistic regression
– Objective: minimize cross-entropy on development set
– Duration-independent – trained on files from 10s, 30s, and 120s
conditions
 Procedure
– Calibrate (tune) each system individually
– Combine calibrated system outputs into a single output vector
• Fusion parameters estimated on the same development set
 Performance evaluation
– Primarily Cavg score
– Also computed PMISS and PFA at Phase 1 target operating points
16
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Overall Results
EVL
DEV
System\Duration\Cavg[%]
17
JFA
iVector NN BUT
iVector NN BBN
CZ3-iVec600
Fusion
JFA
iVector NN BUT
iVector NN BBN
CZ3-iVec600
Fusion
120s
1.61
1.60
2.58
2.60
0.83
7.05
7.83
9.03
9.06
6.56
30s
6.14
4.94
5.92
8.95
2.92
12.92
9.97
11.96
15.56
8.33
Patrol LID System for DARPA RATS P1 Evaluation
10s
12.52
10.36
12.06
16.84
6.85
17.36
14.52
17.83
21.90
11.40
3s
23.53
21.73
28.21
30.53
18.08
22.68
21.46
27.10
29.12
17.45
Pavel Matejka
Robustness
 There is channel B completely removed from the training of the
contrastive system (noB) (channel B is unseen channel)
 Results on Development set with BUT iVector system (600dim)
 Over all results
System/Cavg[%]
iVector NN
iVector NN noB
120s
1.9
2.5
30s
6.6
7.2
10s
13.0
13.4
3s
23.6
23.8
120s
3.4
8.1
30s
9.7
14.7
10s
16.8
19.8
3s
25.8
29.0
 Results only for channel B
System/Cavg[%]
iVector NN
iVector NN noB
18
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Conclusion
 SAD is crucial
 De-noising helps
 Benefit from using Language dependent UBM
 Benefit from using NN as final classifier for LID
19
Patrol LID System for DARPA RATS P1 Evaluation
Pavel Matejka
Download