Introduction

advertisement
National Taipei University of Technology
Professor: Yuan-Fu Liao
Overview
 Introduction
 Microphone Array and ASR Integration
 Noise - Phase Error Filtering
 Maximum Likelihood-based Integration
 Maximum Classification Error-like Integration
 Reverberation - Subband Filtering-and-Sum
 Maximum Likelihood-based Integration
 Maximum Classification Error-based Integration
 Summary
Traditional Beamforming+ASR
 Pipeline : first enhance speech with beamformer, then
feed into recogniser
Bridge the Gap between Array and
Speech Recognizer
 Take the advantage of available a priori knowledge, i.e., the
underline recognition model
 Directly feed the output of recognizer back to microphone
array
References
 Noise - dual-microphone phase error filtering
Shi, G., Aarabi, P. and Jiang, H., “Phase-Based Dual-Microphone Speech Enhancement Using
A Prior Speech Model”, IEEE Trans. Audio Speech Lang. Process., 15:109-118, 2007.
 C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation for robust speech recognition
based on phase difference information obtained in the frequency domain,” In INTERSPEECH2009, pp. 2495-2498, 2009.
 Hsien-Cheng Liao, Yuan-Fu Liao and Chin-Hui Lee, Maximum Confidence Measure Based
Interaural Phase Difference Estimation for Noise Masking in Dual-Microphone Robust Speech
Recognition, InterSpeech 2011

 Reverberation - subband filtering-and-sum
M.L. Seltzer, B. Raj, R.M. Stern, “Likelihood-maximizing beamforming for robust hands-free
speech recognition,” IEEE Trans. Speech, and Audio Processing, vol. 12, no. 5, pp. 489–498,
Sep. 2004.
 M.L. Seltzer, R.M. Stern, “Subband likelihood-maximizing beamforming for speech
Recognition in Reverberant Environments,” IEEE Trans. Speech, and Audio Processing, vol. 14,
no. 6, pp. 2109–2121, Nov. 2006.
 Yuan-Fu Liao, I-Yun Xu: Subband minimum classification error beamforming for speech
recognition in reverberant environments, ICASSP‘2010

Signal Modeling(ITD)
sampling rate: 8000Hz
interaural
time delay
sound
source
△t
0.05 x cos (Φ)
sound source
Φ
mic1
0.05 m
mic2
mic1
mic2
0.05 m
Binary Masking
保留
去除
speaker
interference
FFT
masking
micL
micR
ITD > 
ITD < 
Optimal τ estimation
語音辨識
至少一個
一階段
左麥克風
訊號
短時距
傅立葉轉換
雙耳時間差
計算模組
特徵向量
計算模組
右麥克風
訊號
語音命令模型
模型N+1
門檻值
輸入
自動
門檻值
調整模組
no
X-score
計算模組
X-score
輸出
yes
最大
X-score
輸出
辨識結果/
門檻值
Testing Database
 轉錄雙麥克風音檔錄音環境設定






無響室:5X4X3 m3
麥克風位置:無響室正中央
雙麥克風距離:5cm
麥克風高度:1 m
目標音源與雙麥克風中心距離:30cm
o
o
Babble雜訊音源角度:30 & 60
 測試語料




50 commands (e.g. 向前、後退 …)
11 speakers (6 males & 5 females)
547 utterances in total
Noise added artificially

SNR : 0,6,12,18 dB
Recognition Model
 Training Data
 MAT2000 DB4
 Feature
 25 ms/frame without overlap
 13 Dims(8 ceps, 4 delta ceps, dC0)
 Recognition Model
 100 2-state RCD Initials + 38 2-state CI Finals
 2 mixture Gaussians/state
Performance of online τ estimation
30o
db
60o
db
Reverberation - Subband
Filtering-and-Sum
• Introduction
• Maximum Likelihood-based Integration
• Maximum Classification Error-based
Integration
Introduction
Reverberant Model
Noise Free Model in Time Domain
Speech Reverberation -Time Domain
Speech Reverberation -Frequency
Domain
Clean Speech
Noisy Speech
Basic idea of LiMaBeam
Iterative procedure, utterance-based:
 Do beamforming
 Decode the utterance
 Given most likely HMM state sequence, optimise the beamformer
parameters for this sequence
 Stop when likelihood has converged
Subband Likelihood-Maximizing Beamforming
Formulation
Subband Minimum Classification Error
Beamforming
MCE CRITERION
TCC300 Reverberation Experiment
Experimental Setting
Microphone array with 7 microphones, 5.66 cm between two microphones
Speaker 2m away from the array
Room reverberation time T60=0.3~1.3 sec.
TCC300 database, 29 speakers, each with 5 calibration and 10 test
utterances
– Evaluation with free-syllable decoding/syllable error rate (no language model)
 Experimental Results

–
–
–
–
Typical Spectrum Examples
Clean Speech
Noisy Speech
Delay-and-Sum
MCE beamformer
Summary
 Take the advantage of available a priori knowledge, i.e., the
underline recognition model
 Directly feed the output of recognizer back to microphone
array
 Error rate criterion is better than likelihood
Download