National Taipei University of Technology Professor: Yuan-Fu Liao Overview Introduction Microphone Array and ASR Integration Noise - Phase Error Filtering Maximum Likelihood-based Integration Maximum Classification Error-like Integration Reverberation - Subband Filtering-and-Sum Maximum Likelihood-based Integration Maximum Classification Error-based Integration Summary Traditional Beamforming+ASR Pipeline : first enhance speech with beamformer, then feed into recogniser Bridge the Gap between Array and Speech Recognizer Take the advantage of available a priori knowledge, i.e., the underline recognition model Directly feed the output of recognizer back to microphone array References Noise - dual-microphone phase error filtering Shi, G., Aarabi, P. and Jiang, H., “Phase-Based Dual-Microphone Speech Enhancement Using A Prior Speech Model”, IEEE Trans. Audio Speech Lang. Process., 15:109-118, 2007. C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain,” In INTERSPEECH2009, pp. 2495-2498, 2009. Hsien-Cheng Liao, Yuan-Fu Liao and Chin-Hui Lee, Maximum Confidence Measure Based Interaural Phase Difference Estimation for Noise Masking in Dual-Microphone Robust Speech Recognition, InterSpeech 2011 Reverberation - subband filtering-and-sum M.L. Seltzer, B. Raj, R.M. Stern, “Likelihood-maximizing beamforming for robust hands-free speech recognition,” IEEE Trans. Speech, and Audio Processing, vol. 12, no. 5, pp. 489–498, Sep. 2004. M.L. Seltzer, R.M. Stern, “Subband likelihood-maximizing beamforming for speech Recognition in Reverberant Environments,” IEEE Trans. Speech, and Audio Processing, vol. 14, no. 6, pp. 2109–2121, Nov. 2006. Yuan-Fu Liao, I-Yun Xu: Subband minimum classification error beamforming for speech recognition in reverberant environments, ICASSP‘2010 Signal Modeling(ITD) sampling rate: 8000Hz interaural time delay sound source △t 0.05 x cos (Φ) sound source Φ mic1 0.05 m mic2 mic1 mic2 0.05 m Binary Masking 保留 去除 speaker interference FFT masking micL micR ITD > ITD < Optimal τ estimation 語音辨識 至少一個 一階段 左麥克風 訊號 短時距 傅立葉轉換 雙耳時間差 計算模組 特徵向量 計算模組 右麥克風 訊號 語音命令模型 模型N+1 門檻值 輸入 自動 門檻值 調整模組 no X-score 計算模組 X-score 輸出 yes 最大 X-score 輸出 辨識結果/ 門檻值 Testing Database 轉錄雙麥克風音檔錄音環境設定 無響室:5X4X3 m3 麥克風位置:無響室正中央 雙麥克風距離:5cm 麥克風高度:1 m 目標音源與雙麥克風中心距離:30cm o o Babble雜訊音源角度:30 & 60 測試語料 50 commands (e.g. 向前、後退 …) 11 speakers (6 males & 5 females) 547 utterances in total Noise added artificially SNR : 0,6,12,18 dB Recognition Model Training Data MAT2000 DB4 Feature 25 ms/frame without overlap 13 Dims(8 ceps, 4 delta ceps, dC0) Recognition Model 100 2-state RCD Initials + 38 2-state CI Finals 2 mixture Gaussians/state Performance of online τ estimation 30o db 60o db Reverberation - Subband Filtering-and-Sum • Introduction • Maximum Likelihood-based Integration • Maximum Classification Error-based Integration Introduction Reverberant Model Noise Free Model in Time Domain Speech Reverberation -Time Domain Speech Reverberation -Frequency Domain Clean Speech Noisy Speech Basic idea of LiMaBeam Iterative procedure, utterance-based: Do beamforming Decode the utterance Given most likely HMM state sequence, optimise the beamformer parameters for this sequence Stop when likelihood has converged Subband Likelihood-Maximizing Beamforming Formulation Subband Minimum Classification Error Beamforming MCE CRITERION TCC300 Reverberation Experiment Experimental Setting Microphone array with 7 microphones, 5.66 cm between two microphones Speaker 2m away from the array Room reverberation time T60=0.3~1.3 sec. TCC300 database, 29 speakers, each with 5 calibration and 10 test utterances – Evaluation with free-syllable decoding/syllable error rate (no language model) Experimental Results – – – – Typical Spectrum Examples Clean Speech Noisy Speech Delay-and-Sum MCE beamformer Summary Take the advantage of available a priori knowledge, i.e., the underline recognition model Directly feed the output of recognizer back to microphone array Error rate criterion is better than likelihood