TURKIC LANGUAGES WORKSHOP May 2012, İstanbul BUILDING A TURKISH ASR SYSTEM WITH MINIMAL RESOURCES Arianna Bisazza & Roberto Gretter – FBK, Italy Typical ASR recipe to build a news transcription system N-gram Language Model Pronunciation Lexicon Manual transcriptions Pre-processing: tokeniz., case, digits etc. Speech recordings Written text collection …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Manual transcriptions Pre-processing: tokeniz., case, digits etc. Speech recordings Written text collection …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Automatic transcriptions Pre-processing: tokeniz., case, digits etc. Existing ASR system Speech recordings + morphology Written text collection …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Automatic transcriptions Pre-processing: tokeniz., case, digits etc. Existing ASR system Speech recordings + morphology Written text collection …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Automatic transcriptions Existing ASR system Speech recordings Pre-processing: Data-driven tokeniz., case, digits etc. segmentation & + morphology suffix lexicaliz. Written text collection Outline Data Collection • Unsupervised Acoustic Modeling • Language Modeling for Turkish • • • Word Segmentation Data-driven Morphophonemics Data Collection International satellite TV channel broadcasting: one video stream, parallel audio streams in many languages including AMTrain: 108h TurTest: 12’ Turkish, English, Italian etc. untranscribed audio transcribed audio Written news daily collected from the same channel and other newspaper websites. LMTrain: 130M words TurDev: 3.2M words Unsupervised Acoustic Modeling Unsupervised Acoustic Modeling Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y Method: transcribe Turkish audio with Italian AMs Unsupervised Acoustic Modeling Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y Method: transcribe Turkish audio with Italian AMs use these transcriptions to retrain AM Unsupervised Acoustic Modeling Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y Method: transcribe Turkish audio with Italian AMs use these transcriptions to retrain AM repeat until convergence Unsupervised Acoustic Modeling Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y Method: transcribe Turkish audio with Italian AMs use these transcriptions to retrain AM repeat until convergence No manual transcription required Language independent in principle, works better if phonemes are similar Turkish Language Modeling Relevant features of Turkish Agglutination fast vocabulary growth Rich suffix allomorphy due to vowel harmony & other phonological phenomena: We address both with data-driven methods 1) Word segmentation Unsupervised morphological segmentation: Morfessor tool (based on Minimum Description Length) Parameter PPthreshold controls level of segmentation Stem+ending representation to avoid too small units: trade-off coverage/recogn.accuracy (cf. Erdoğan&al.05, Arısoy&al.09) 1 word max 2 segments 2) Data-driven suffix normalization Goal: factorize together word ending allomorphs Procedure: 1. define letter equivalence classes 2. A={a,e} H={ı,i,ü,u} D={d,t} K={k,ğ} C={c,ç} normalize = map letters to their class: kural+lar kural+lAr santral+ler santral+lAr 3. 4. train LM, build ASR system, transcribe recover surface forms with simple statistical models: 2) Data-driven suffix normalization Intrinsinc evaluation on clean data: 27% tokens in TurDev are ambiguous lexicalized endings 99.7% of them are assigned the correct surface form some "errors" are due to misspellings in web-crawled data *yatirimci+lArIn *yatirimci+lerin Impact on language model PP: Results ASR results Scores: WA (word accuracy)|HWA (half-word accuracy) Performances of recent works on related tasks (WA): [Erdogan&al.05] 53%, [Kurimo&al.06] 67%, [Arisoy&al. 09] 76% Conclusions Conclusion We built a Turkish ASR system with almost no languagespecific resources, achieving reasonably good results: unsupervised AM, bootstrapped from unrelated language unsupervised segmentation with off-the-shelf tool Level of segmentation (PPth) affects ASR accuracy, it should be tuned for specific task We proposed a highly accurate data-driven method for suffix normalization + surface prediction No gain in ASR quality so far, more analysis needed Replicate experiments on a larger test Conclusion Similar methods may be applied to ASR (and MT) of under-resourced agglutinative languages İlginiz için teşekkürler Bakan Çağlayan Çin Ticaret Bakanı Deming ile görüştü Bakan Çağlayan Çin Ticaret Bakan+ +H Deming ile görüştü Devlet Bakanı Zafer Çağlayan her iki ülkenin kendi parasıyla karşılıklı Devlet Bakan+ +H Zafer Çağlayan her iki ülke+ +nHn kendi parası+ +ylA karşılık+ +lH ticaret yapma konusundaki çalışmaların sürdüğünü bildirdi ticaret yapma konusu+ +nDAKH çalış+ +mAlArHn sürdüğünü bildird+ +H Çağlayan Çin Ticaret Bakanı Chen Deming ve beraberinde özel ve kamu Çağlayan Çin Ticaret Bakan+ +H Chen Deming ve beraber+ +HnDA özel ve kamu sektörü temsilcilerinden oluşan heyetle görüştü sektörü temsilci+ +lArHnDAn oluşan heyet+ +lA görüştü Ümit ediyorum ki yarım sabah itibariyle Sayın Bakan ile beraber _bir_ milyar Ümit ediyor+ +Hm ki yarım sabah itibar+ +HylA Sayın Bakan ile beraber _bir_ milyar doların üzerinde bir Çin 'den Türkiye 'den alım gerçekleştirilmiş olacak dedi dolar+ +Hn üzerin+ +DA bir Çin 'den Türk+ +HyA 'den alım gerç+ +AKlAşDHrHlmHş olacak dedi