Claire Brierley 13.06.2010 Prosodic

advertisement
Claire Brierley
13.06.2010
Prosodic-syntactic phrasing is a language universal: natural language speakers process speech
in chunks or meaningful stand-alone word clusters. The process of automated phrase break
prediction seeks to emulate human psycholinguistic performance by identifying intelligible
and naturalistic chunk boundaries in text which correspond to native speaker speech and
language processing. Building on recent work on phrase break correlates for English, which
uncovers new text-based and real-world linguistic cues for this classification task (Brierley
and Atwell, 2009; 2010), we will use data mining software such as WEKA (Hall et al., 2009)
to trial and compare the performance of machine learning feature sets derived from scholarly
prosodic markup in Tajweed* editions of the Qu'ran to model the prosody-syntax interface
for modern standard Arabic, and for language engineering applications such as Arabic TTS.
Initial studies will look at statistically significant patterns of association between parts-ofspeech and/or discourse connectors and different boundary types at verse endings and within
verses. Subsequent studies will explore phonetic-graphemic cues to boundary strength
encoded in Qu’ranic annotations. A prerequisite for both sets of studies is an additional
prosodic annotation tier for the Qur’anic Arabic Corpus. As well as prescribing traditional
recitation, these ‘gold standard’ annotations may be reinterpreted for learners of Arabic
(whether native or non-native speakers) as parsing and phrasing strategies for contemporary
modern standard Arabic texts, and may also constitute linguistic knowledge about the process
of prosodic-syntactic chunking that is transferable to other languages.
Publications:
Brierley, C. and Atwell, E. 2010. Complex Vowels as Boundary Correlates in a Multi-Speaker Corpus
of Spontaneous English Speech. In proceedings of Speech Prosody 2010, Chicago.
Brierley, C. and Atwell, E. 2009. Exploring Phrase Break Correlates in a Corpus of English Speech
with ProPOSEL, a Prosody and POS English Lexicon. In proceedings of Interspeech 2009, Brighton.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. 2009. The WEKA
Data Mining Software: An Update. SIGKDD Explorations. Volume 11, Issue 1.
* Tajweed is the science of properly reciting the Qur’an. Certain editions of the Qur’an are marked up
with guidance, for example, on the articulation of Arabic phonemes as they appear in different
positions within a word. One such edition is: Mushaf at-Tajweed. 1999. Dar-Al-Marefah. Damascus,
Syria. 4th. edition.
Comments:
I take it that prosodic markup in the Qu’ran is ‘human’ and represents human wisdom and learning
about language; this is an important point in terms of what this research sets out to do. The
annotations represent Arabic linguists’ best efforts at parsing a divine communication, made possible
by the nature of the communication itself, since it was transmitted through the natural language
medium of Arabic.
Download