The BUiD Arab Learner Corpus: Explaining Second Language Writing Systems within a Markedness Framework Yasemin Yildiz The British University in Dubai yasemin.yildiz@buid.ac.ae 1 Introduction This paper has a two-fold goal. The first goal is to contribute to the literature of Second Language Writing Systems (L2WS) by focusing on the British University in Dubai Arab Learner Corpus (BALC). The second goal is to demonstrate the close relationship between phonology and orthography in L2WS and critically address the issue of reform in a script. Unlike previous studies which provide a holistic and descriptive analysis of all possible spelling errors of Arabic-speaking learners of English (e.g. Randall and Groom, 2009; Haggan, 1991; Hassan, 2010) this study is different in two kinds: 1) As a first attempt BALC will be interpreted within a markedness linguistic framework 2) Particular emphasis will be given to the erroneous spelling forms which appear in lexical items with complex onset and coda clusters at phonological level only (e.g. stamped [stæmpt]). The existing theories explaining L2WS range from Contrastive Analysis Hypothesis (Lado 1957; herafter CAH), which compared the areas where the L2 differed from the L1 to determine what would be difficult for the learner, to Error Analysis (Corder 1967; hereafter EA), which advocates looking only at the developing grammar of the learner to ascertain where difficulties exist. Moreover, although both of these theories may be able to foresee or account for the linguistic difficulties of the learners, they exhibit two shortcomings. First, CAH relies only on native language transfer. Second, Error Analysis misses the relationship between L1 transfer and universal processes. As an alternative model, this study attempts to explain how the Markedness framework, can also be a useful tool in modelling the first language and universal constraints in L2WS. In fact, according to Spolsky (1989) the markedness condition is necessary as a linguistic ground for language learning. 2 Theoretical framework Trubetzkoy and Jakobson were the first linguists to introduce the idea of ‘markedness’ in the 1930s and is treated as a languageparticular phenomenon. Trubetzkoy approached the term markedness within a descriptive framework and it was initially confined to phonetics. Jakobson (1968), however, approached the term markedness within the perspective of language acquisition. The underlying principle of Jakobson’s theory is that there is a universal order of acquisition, largely based on phonological oppositions and phonetic properties of segments. Based on the structural contrasts in his theory, Jakobson suggested that the unmarked forms would be the earliest acquired and would also occur in all the world’s languages. 3 The study The BUiD Arab Learner Corpus (BALC) consists of 1,865 texts written by either first year university students or secondary school students (year/grade 12 – the last year of schooling). It comprises 287,227 word tokens and 20,275 word types. The texts themselves fall into three types: texts collected by MEd students in secondary schools, retired first year university test essays, and texts sourced from the Common Educational Proficiency Assessment (CEPA) examinations (All school students in the United Arab Emirates need to take CEPA as a university entrance exam). The scripts were all hand written and then converted into text files for incorporation into the corpus. 4 Instrumentation and procedure The misspelling data which exhibit consonant clusters will be identified and categorized by using the Wmatrix3 program (Rayson 2003, Rayson 2005), which is an online integrated corpus linguistic software environment in which texts can be loaded and analyzed for word frequency profiles and concordances, annotated in terms of part-of- speech (using the well-known CLAWS tagger, see Garside et al. 1997) and word-sense (semantic content and word sense tagger). The semantic content component, named the UCREL Semantic Analysis System (or USAS), contains a multi-tier structure with 21 major discourse categories. These 21 categories are further refined and categorized. A particular refinement within the 'Z' category identifies the unmatched items (or those items not recognized by the system) and is categorized as 'Z99'. The data elicitation will be sourced from the Z99 category, as this category can identify all the spelling errors and provide the frequency distribution. The quantitative analysis will be conducted by using the findings from the Z99 category. A further qualitative analysis will be conducted within the markedness framework. 5 Research questions This study takes up the following three questions for investigation: 1) What modification strategies do the learners use in the production of consonant clusters? 2) To what extent are L2 syllables constrained by allowable L1 syllable structure and to what extent do universal principles apply or even prevail? 3) What is the role of markedness for the production of consonant clusters? References Corder, S. P. 1967. “The Significance of Learners` Errors”. International Review of Applied Linguistics 5: 161-169. Garside, R., Leech, G. and McEnery, T. 1997. (eds). The Computational Analysis of English. London: Longman. Gnanadesikan, A. E. 2004. “Markedness and faithfulness constraints in child phonology”. In R. Kager, J. Pater and W. Zonneveld (eds.) Constraints in Phonological Acquisition. Cambridge: Cambridge University Press. pp. 73–108. [ROA-76] Haggan, M. 1991. “Spelling errors in native Arabicspeaking English majors. A comparison between remedial students and fourth year students”. System 19(1): 45-61. Lado, R. 1957. Linguistics across cultures: Applied linguistics for language teachers. University of Michigan Press: Ann Arbor. Spolsky, B. 1989. Conditions for Second Language Learning: Introduction to a General Theory. Oxford University Press. Trubetzkoy, N. 1939. Grundzüge der Phonologie (Principles of Phonology). Travaux du cercle linguistique de Prague 7. Randall, M. 2007. Memory, psychology and second language learning. Philadelphia: Benjamins Publishing Company. Randall, M. and Groom, N. 2009. “Introducing the BUiD Arab Learner Corpus: a resource for studying the acquisition of L2 English spelling”. In M. Mahlberg, V. González-Díaz and C. Smith (eds.) Proceedings of the Corpus Linguistics Conference CL2009, University of Liverpool, UK, 20-23 July 2009. Rayson, P. 2003. Matrix: A Statistical Method and Software Tool for Linguistic Analysis through Corpus Comparison. Ph.D. thesis, Lancaster University. Available online at http://ucrel.lancs.ac.uk/people/paul/publications/phd20 03.pdf Rayson, P. 2005. Wmatrix: A Web-based Corpus Processing Environment. Computing Department, Lancaster University. Available online at http://www.comp.lancs.ac.uk/ucrel/wmatrix/ Yildiz, Y. and Ozek, Y. 2009. The Role of Markedness in Vocabulary Learning. In Proceeding of the International Conference of Technology, Education and Development (ICERI 2009).