Introduction

National Taipei University of Technology Professor: Yuan-Fu Liao Overview  Introduction  Microphone Array and ASR Integration  Noise - Phase Error Filtering  Maximum Likelihood-based Integration  Maximum Classification Error-like Integration  Reverberation - Subband Filtering-and-Sum  Maximum Likelihood-based Integration  Maximum Classification Error-based Integration  Summary Traditional Beamforming+ASR  Pipeline : first enhance speech with beamformer, then feed into recogniser Bridge the Gap between Array and Speech Recognizer  Take the advantage of available a priori knowledge, i.e., the underline recognition model  Directly feed the output of recognizer back to microphone array References  Noise - dual-microphone phase error filtering Shi, G., Aarabi, P. and Jiang, H., “Phase-Based Dual-Microphone Speech Enhancement Using A Prior Speech Model”, IEEE Trans. Audio Speech Lang. Process., 15:109-118, 2007.  C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain,” In INTERSPEECH2009, pp. 2495-2498, 2009.  Hsien-Cheng Liao, Yuan-Fu Liao and Chin-Hui Lee, Maximum Confidence Measure Based Interaural Phase Difference Estimation for Noise Masking in Dual-Microphone Robust Speech Recognition, InterSpeech 2011   Reverberation - subband filtering-and-sum M.L. Seltzer, B. Raj, R.M. Stern, “Likelihood-maximizing beamforming for robust hands-free speech recognition,” IEEE Trans. Speech, and Audio Processing, vol. 12, no. 5, pp. 489–498, Sep. 2004.  M.L. Seltzer, R.M. Stern, “Subband likelihood-maximizing beamforming for speech Recognition in Reverberant Environments,” IEEE Trans. Speech, and Audio Processing, vol. 14, no. 6, pp. 2109–2121, Nov. 2006.  Yuan-Fu Liao, I-Yun Xu: Subband minimum classification error beamforming for speech recognition in reverberant environments, ICASSP‘2010  Signal Modeling(ITD) sampling rate: 8000Hz interaural time delay sound source △t 0.05 x cos (Φ) sound source Φ mic1 0.05 m mic2 mic1 mic2 0.05 m Binary Masking 保留去除 speaker interference FFT masking micL micR ITD >  ITD <  Optimal τ estimation 語音辨識至少一個一階段左麥克風訊號短時距傅立葉轉換雙耳時間差計算模組特徵向量計算模組右麥克風訊號語音命令模型模型N+1 門檻值輸入自動門檻值調整模組 no X-score 計算模組 X-score 輸出 yes 最大 X-score 輸出辨識結果/ 門檻值 Testing Database  轉錄雙麥克風音檔錄音環境設定       無響室：5X4X3 m3 麥克風位置：無響室正中央雙麥克風距離：5cm 麥克風高度：1 m 目標音源與雙麥克風中心距離：30cm o o Babble雜訊音源角度：30 & 60  測試語料     50 commands (e.g. 向前、後退 …) 11 speakers (6 males & 5 females) 547 utterances in total Noise added artificially  SNR : 0,6,12,18 dB Recognition Model  Training Data  MAT2000 DB4  Feature  25 ms/frame without overlap  13 Dims(8 ceps, 4 delta ceps, dC0)  Recognition Model  100 2-state RCD Initials + 38 2-state CI Finals  2 mixture Gaussians/state Performance of online τ estimation 30o db 60o db Reverberation - Subband Filtering-and-Sum • Introduction • Maximum Likelihood-based Integration • Maximum Classification Error-based Integration Introduction Reverberant Model Noise Free Model in Time Domain Speech Reverberation -Time Domain Speech Reverberation -Frequency Domain Clean Speech Noisy Speech Basic idea of LiMaBeam Iterative procedure, utterance-based:  Do beamforming  Decode the utterance  Given most likely HMM state sequence, optimise the beamformer parameters for this sequence  Stop when likelihood has converged Subband Likelihood-Maximizing Beamforming Formulation Subband Minimum Classification Error Beamforming MCE CRITERION TCC300 Reverberation Experiment Experimental Setting Microphone array with 7 microphones, 5.66 cm between two microphones Speaker 2m away from the array Room reverberation time T60=0.3~1.3 sec. TCC300 database, 29 speakers, each with 5 calibration and 10 test utterances – Evaluation with free-syllable decoding/syllable error rate (no language model)  Experimental Results  – – – – Typical Spectrum Examples Clean Speech Noisy Speech Delay-and-Sum MCE beamformer Summary  Take the advantage of available a priori knowledge, i.e., the underline recognition model  Directly feed the output of recognizer back to microphone array  Error rate criterion is better than likelihood

Introduction

Related documents

Products

Support

Introduction

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib