slides

advertisement
TURKIC LANGUAGES WORKSHOP
May 2012, İstanbul
BUILDING A TURKISH ASR
SYSTEM WITH MINIMAL
RESOURCES
Arianna Bisazza & Roberto Gretter – FBK, Italy
Typical ASR recipe to build a news
transcription system
N-gram Language Model
Pronunciation
Lexicon
Manual transcriptions
Pre-processing:
tokeniz., case, digits etc.
Speech recordings
Written text collection
…for Turkish, with minimal resources
N-gram Language Model
Pronunciation
Lexicon
Manual transcriptions
Pre-processing:
tokeniz., case, digits etc.
Speech recordings
Written text collection
…for Turkish, with minimal resources
N-gram Language Model
Pronunciation
Lexicon
Automatic transcriptions
Pre-processing:
tokeniz., case, digits etc.
Existing
ASR system
Speech recordings
+ morphology
Written text collection
…for Turkish, with minimal resources
N-gram Language Model
Pronunciation
Lexicon
Automatic transcriptions
Pre-processing:
tokeniz., case, digits etc.
Existing
ASR system
Speech recordings
+ morphology
Written text collection
…for Turkish, with minimal resources
N-gram Language Model
Pronunciation
Lexicon
Automatic transcriptions
Existing
ASR system
Speech recordings
Pre-processing:
Data-driven
tokeniz., case, digits etc. segmentation &
+ morphology
suffix lexicaliz.
Written text collection
Outline
Data Collection
• Unsupervised Acoustic Modeling
• Language Modeling for Turkish
•
•
•
Word Segmentation
Data-driven Morphophonemics
Data Collection

International satellite TV channel broadcasting:

one video stream,

parallel audio streams in many languages including
AMTrain: 108h
TurTest: 12’
Turkish, English, Italian etc.
untranscribed
audio

transcribed
audio
Written news daily collected from the same channel
and other newspaper websites.
LMTrain: 130M
words
TurDev: 3.2M
words
Unsupervised
Acoustic Modeling
Unsupervised Acoustic Modeling


Scenario: we have a good ASR system for lang. X,
audio & text data but no transcription for lang. Y
Method:

transcribe Turkish audio with Italian AMs
Unsupervised Acoustic Modeling


Scenario: we have a good ASR system for lang. X,
audio & text data but no transcription for lang. Y
Method:

transcribe Turkish audio with Italian AMs

use these transcriptions to retrain AM
Unsupervised Acoustic Modeling


Scenario: we have a good ASR system for lang. X,
audio & text data but no transcription for lang. Y
Method:

transcribe Turkish audio with Italian AMs

use these transcriptions to retrain AM

repeat until convergence
Unsupervised Acoustic Modeling


Scenario: we have a good ASR system for lang. X,
audio & text data but no transcription for lang. Y
Method:

transcribe Turkish audio with Italian AMs

use these transcriptions to retrain AM

repeat until convergence

No manual transcription required

Language independent in principle, works better if
phonemes are similar
Turkish Language Modeling
Relevant features of Turkish



Agglutination  fast vocabulary growth
Rich suffix allomorphy due to vowel harmony &
other phonological phenomena:
We address both with data-driven methods
1) Word segmentation



Unsupervised morphological segmentation:
Morfessor tool (based on Minimum Description Length)
Parameter PPthreshold controls level of segmentation
Stem+ending representation to avoid too small units:
trade-off coverage/recogn.accuracy
(cf. Erdoğan&al.05, Arısoy&al.09)
1 word  max 2
segments
2) Data-driven suffix normalization

Goal: factorize together word ending allomorphs

Procedure:
1.
define letter equivalence classes
2.
A={a,e} H={ı,i,ü,u} D={d,t} K={k,ğ} C={c,ç}
normalize = map letters to their class:
kural+lar  kural+lAr
santral+ler  santral+lAr
3.
4.
train LM, build ASR system, transcribe
recover surface forms with simple statistical models:
2) Data-driven suffix normalization

Intrinsinc evaluation on clean data:

27% tokens in TurDev are ambiguous lexicalized endings

99.7% of them are assigned the correct surface form

some "errors" are due to misspellings in web-crawled data
*yatirimci+lArIn  *yatirimci+lerin

Impact on language model PP:
Results
ASR results

Scores: WA (word accuracy)|HWA (half-word accuracy)

Performances of recent works on related tasks (WA):
[Erdogan&al.05] 53%, [Kurimo&al.06] 67%, [Arisoy&al. 09] 76%
Conclusions
Conclusion

We built a Turkish ASR system with almost no languagespecific resources, achieving reasonably good results:




unsupervised AM, bootstrapped from unrelated language
unsupervised segmentation with off-the-shelf tool
Level of segmentation (PPth) affects ASR accuracy,
it should be tuned for specific task
We proposed a highly accurate data-driven method for
suffix normalization + surface prediction

No gain in ASR quality so far, more analysis needed

Replicate experiments on a larger test
Conclusion

Similar methods may be applied to ASR (and MT) of
under-resourced agglutinative languages
İlginiz için teşekkürler
Bakan Çağlayan Çin Ticaret Bakanı Deming ile görüştü
Bakan Çağlayan Çin Ticaret Bakan+ +H Deming ile görüştü
Devlet Bakanı Zafer Çağlayan her iki ülkenin kendi parasıyla karşılıklı
Devlet Bakan+ +H Zafer Çağlayan her iki ülke+ +nHn kendi parası+ +ylA karşılık+ +lH
ticaret yapma konusundaki çalışmaların sürdüğünü bildirdi
ticaret yapma konusu+ +nDAKH çalış+ +mAlArHn sürdüğünü bildird+ +H
Çağlayan Çin Ticaret Bakanı Chen Deming ve beraberinde özel ve kamu
Çağlayan Çin Ticaret Bakan+ +H Chen Deming ve beraber+ +HnDA özel ve kamu
sektörü temsilcilerinden oluşan heyetle görüştü
sektörü temsilci+ +lArHnDAn oluşan heyet+ +lA görüştü
Ümit ediyorum ki yarım sabah itibariyle Sayın Bakan ile beraber _bir_ milyar
Ümit ediyor+ +Hm ki yarım sabah itibar+ +HylA Sayın Bakan ile beraber _bir_ milyar
doların üzerinde bir Çin 'den Türkiye 'den alım gerçekleştirilmiş olacak dedi
dolar+ +Hn üzerin+ +DA bir Çin 'den Türk+ +HyA 'den alım gerç+ +AKlAşDHrHlmHş
olacak dedi
Download