slides

TURKIC LANGUAGES WORKSHOP May 2012, İstanbul BUILDING A TURKISH ASR SYSTEM WITH MINIMAL RESOURCES Arianna Bisazza & Roberto Gretter – FBK, Italy Typical ASR recipe to build a news transcription system N-gram Language Model Pronunciation Lexicon Manual transcriptions Pre-processing: tokeniz., case, digits etc. Speech recordings Written text collection …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Manual transcriptions Pre-processing: tokeniz., case, digits etc. Speech recordings Written text collection …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Automatic transcriptions Pre-processing: tokeniz., case, digits etc. Existing ASR system Speech recordings + morphology Written text collection …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Automatic transcriptions Pre-processing: tokeniz., case, digits etc. Existing ASR system Speech recordings + morphology Written text collection …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Automatic transcriptions Existing ASR system Speech recordings Pre-processing: Data-driven tokeniz., case, digits etc. segmentation & + morphology suffix lexicaliz. Written text collection Outline Data Collection • Unsupervised Acoustic Modeling • Language Modeling for Turkish • • • Word Segmentation Data-driven Morphophonemics Data Collection  International satellite TV channel broadcasting:  one video stream,  parallel audio streams in many languages including AMTrain: 108h TurTest: 12’ Turkish, English, Italian etc. untranscribed audio  transcribed audio Written news daily collected from the same channel and other newspaper websites. LMTrain: 130M words TurDev: 3.2M words Unsupervised Acoustic Modeling Unsupervised Acoustic Modeling   Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y Method:  transcribe Turkish audio with Italian AMs Unsupervised Acoustic Modeling   Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y Method:  transcribe Turkish audio with Italian AMs  use these transcriptions to retrain AM Unsupervised Acoustic Modeling   Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y Method:  transcribe Turkish audio with Italian AMs  use these transcriptions to retrain AM  repeat until convergence Unsupervised Acoustic Modeling   Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y Method:  transcribe Turkish audio with Italian AMs  use these transcriptions to retrain AM  repeat until convergence  No manual transcription required  Language independent in principle, works better if phonemes are similar Turkish Language Modeling Relevant features of Turkish    Agglutination  fast vocabulary growth Rich suffix allomorphy due to vowel harmony & other phonological phenomena: We address both with data-driven methods 1) Word segmentation    Unsupervised morphological segmentation: Morfessor tool (based on Minimum Description Length) Parameter PPthreshold controls level of segmentation Stem+ending representation to avoid too small units: trade-off coverage/recogn.accuracy (cf. Erdoğan&al.05, Arısoy&al.09) 1 word  max 2 segments 2) Data-driven suffix normalization  Goal: factorize together word ending allomorphs  Procedure: 1. define letter equivalence classes 2. A={a,e} H={ı,i,ü,u} D={d,t} K={k,ğ} C={c,ç} normalize = map letters to their class: kural+lar  kural+lAr santral+ler  santral+lAr 3. 4. train LM, build ASR system, transcribe recover surface forms with simple statistical models: 2) Data-driven suffix normalization  Intrinsinc evaluation on clean data:  27% tokens in TurDev are ambiguous lexicalized endings  99.7% of them are assigned the correct surface form  some "errors" are due to misspellings in web-crawled data *yatirimci+lArIn  *yatirimci+lerin  Impact on language model PP: Results ASR results  Scores: WA (word accuracy)|HWA (half-word accuracy)  Performances of recent works on related tasks (WA): [Erdogan&al.05] 53%, [Kurimo&al.06] 67%, [Arisoy&al. 09] 76% Conclusions Conclusion  We built a Turkish ASR system with almost no languagespecific resources, achieving reasonably good results:     unsupervised AM, bootstrapped from unrelated language unsupervised segmentation with off-the-shelf tool Level of segmentation (PPth) affects ASR accuracy, it should be tuned for specific task We proposed a highly accurate data-driven method for suffix normalization + surface prediction  No gain in ASR quality so far, more analysis needed  Replicate experiments on a larger test Conclusion  Similar methods may be applied to ASR (and MT) of under-resourced agglutinative languages İlginiz için teşekkürler Bakan Çağlayan Çin Ticaret Bakanı Deming ile görüştü Bakan Çağlayan Çin Ticaret Bakan+ +H Deming ile görüştü Devlet Bakanı Zafer Çağlayan her iki ülkenin kendi parasıyla karşılıklı Devlet Bakan+ +H Zafer Çağlayan her iki ülke+ +nHn kendi parası+ +ylA karşılık+ +lH ticaret yapma konusundaki çalışmaların sürdüğünü bildirdi ticaret yapma konusu+ +nDAKH çalış+ +mAlArHn sürdüğünü bildird+ +H Çağlayan Çin Ticaret Bakanı Chen Deming ve beraberinde özel ve kamu Çağlayan Çin Ticaret Bakan+ +H Chen Deming ve beraber+ +HnDA özel ve kamu sektörü temsilcilerinden oluşan heyetle görüştü sektörü temsilci+ +lArHnDAn oluşan heyet+ +lA görüştü Ümit ediyorum ki yarım sabah itibariyle Sayın Bakan ile beraber _bir_ milyar Ümit ediyor+ +Hm ki yarım sabah itibar+ +HylA Sayın Bakan ile beraber _bir_ milyar doların üzerinde bir Çin 'den Türkiye 'den alım gerçekleştirilmiş olacak dedi dolar+ +Hn üzerin+ +DA bir Çin 'den Türk+ +HyA 'den alım gerç+ +AKlAşDHrHlmHş olacak dedi

slides

Related documents

Products

Support

slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib