Claire Brierley 13.06.2010 Prosodic-syntactic phrasing is a language universal: natural language speakers process speech in chunks or meaningful stand-alone word clusters. The process of automated phrase break prediction seeks to emulate human psycholinguistic performance by identifying intelligible and naturalistic chunk boundaries in text which correspond to native speaker speech and language processing. Building on recent work on phrase break correlates for English, which uncovers new text-based and real-world linguistic cues for this classification task (Brierley and Atwell, 2009; 2010), we will use data mining software such as WEKA (Hall et al., 2009) to trial and compare the performance of machine learning feature sets derived from scholarly prosodic markup in Tajweed* editions of the Qu'ran to model the prosody-syntax interface for modern standard Arabic, and for language engineering applications such as Arabic TTS. Initial studies will look at statistically significant patterns of association between parts-ofspeech and/or discourse connectors and different boundary types at verse endings and within verses. Subsequent studies will explore phonetic-graphemic cues to boundary strength encoded in Qu’ranic annotations. A prerequisite for both sets of studies is an additional prosodic annotation tier for the Qur’anic Arabic Corpus. As well as prescribing traditional recitation, these ‘gold standard’ annotations may be reinterpreted for learners of Arabic (whether native or non-native speakers) as parsing and phrasing strategies for contemporary modern standard Arabic texts, and may also constitute linguistic knowledge about the process of prosodic-syntactic chunking that is transferable to other languages. Publications: Brierley, C. and Atwell, E. 2010. Complex Vowels as Boundary Correlates in a Multi-Speaker Corpus of Spontaneous English Speech. In proceedings of Speech Prosody 2010, Chicago. Brierley, C. and Atwell, E. 2009. Exploring Phrase Break Correlates in a Corpus of English Speech with ProPOSEL, a Prosody and POS English Lexicon. In proceedings of Interspeech 2009, Brighton. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations. Volume 11, Issue 1. * Tajweed is the science of properly reciting the Qur’an. Certain editions of the Qur’an are marked up with guidance, for example, on the articulation of Arabic phonemes as they appear in different positions within a word. One such edition is: Mushaf at-Tajweed. 1999. Dar-Al-Marefah. Damascus, Syria. 4th. edition. Comments: I take it that prosodic markup in the Qu’ran is ‘human’ and represents human wisdom and learning about language; this is an important point in terms of what this research sets out to do. The annotations represent Arabic linguists’ best efforts at parsing a divine communication, made possible by the nature of the communication itself, since it was transmitted through the natural language medium of Arabic.