BilingBank Database Guide This guide provides documentation regarding the data on bilingualism and second language acquisition (SLA) in the TalkBank database. All of these data are available from http://talkbank.org/data/BilingBank . TalkBank is an international system for the exchange of data on spoken language interactions. The majority of the corpora in TalkBank have either audio or video media linked to transcripts. All transcripts are formatted in the CHAT system and can be automatically converted to XML using the CHAT2XML convertor. TalkBank data dealing with first language acquisition are available from the CHILDES site at http://childes.psy.cmu.edu To jump to the relevant section, click on the page number to the right of the corpus. 1. Anadolu ....................................................................................................................... 2 2. Bangor-Pilot (Welsh-English) .................................................................................... 3 3. Bangor (Welsh-English) Siarad ................................................................................. 4 4. Bangor (Welsh-Spanish) Patagonia........................................................................... 7 5. Bangor (Spanish-English) Miami ............................................................................ 21 6. Welsh Transcription Conventions ............................................................................ 33 7. BlumSnow (Hebrew-English)................................................................................... 43 Family Backgrounds ............................................................................................................... 43 Group 1: Native Israeli Families ..........................................................................................................44 Group 2: American Israeli Families .....................................................................................................44 Group 3: Jewish-American Families ....................................................................................................45 8. Eppler (German-English) ......................................................................................... 47 9. Gardner-Chloros (Greek-English) ........................................................................... 49 10. Hatzidaki (Greek-French) ...................................................................................... 50 11. Køge (Turkish-Danish) ........................................................................................... 53 1. Anadolu Fatma Hülya Özcan Anadolu University, Eskisehir fozcan@anadolu.edu.tr This corpus is included in BilingBank to provide a comparison with the Køge corpus, because both involve immigrant Turkish children. The project was a collaboration of Jens Normann Jørgensen of the University of Copenhagen and Fatma Hülya Özcan and Ilknur Kecik of Anadolu University in Eskisehir, a town of 600,000 in central Turkey. The children are from the second generation of working class immigrants to the city, thereby allowing comparisons with the Køge corpus. The students, in groups of four, were first studied in the first grade in 1997 and then the same students were followed in grades 3, 5, 7, and 8. In the first and third grades, the groups were asked to furnish a house on a white cardboard, using stickers, paper, glue sticks, and marking pens. In the 5th grade, the task was to build a city. In the 7th and 8th grade the task was to prepare a collage on a topic of their choice. Each session lasted 45 minutes. Transcription was done in CHAT using CLAN and translations into Danish and later English were prepared. The files we have here have the Danish translations. 2. Bangor-Pilot (Welsh-English) Margaret Deuchar Department of Linguistics and English Language University of Wales, Bangor Gwynedd LL57 2DG UK m.deuchar@bangor.ac.uk The corpus was transcribed in 2004/05 as part of a small research project funded by the British Academy, entitled "Structural aspects of Welsh-English code-switching". The main theoretical aim of the project was to test Myers-Scotton's (2002) Matrix Language Frame (MLF) model of code switching with Welsh-English data. The data consist of recordings of informal conversations involving groups or pairs of speakers in North-West Wales and excerpts from BBC Radio Cymru programs. We would like to express our thanks to the Bangor students and researchers (named in the transcript headers) who recorded the informal conversations within their social networks and kindly gave us permission to use the recordings. We are also grateful to all the speakers involved. Details regarding the context of each conversation and the speakers involved are given in the header of each transcript. However, the real names of speakers and other persons mentioned (other than professional radio presenters, actors, politicians etc.) as well as house names have been replaced by pseudonyms from words used conventionally to refer to letters, e.g. Alpha, Bravo, Charlie (see e.g.http://www.dynamoo.com/technical/phonetic.htm), in the transcripts. Finally we would like to thank the BBC for their permission to use the BBC radio programs. Publications using these data should cite the following article: Deuchar, M. (forthcoming). Welsh-English code-switching and the Matrix Language Frame model. Lingua. 3. Bangor (Welsh-English) Siarad Margaret Deuchar ESRC Centre for Research on Bilingualism Bangor University Bangor Gwynedd LL57 2DG United Kingdom m.deuchar@bangor.ac.uk A INTRODUCTION The Siarad corpus of Welsh-English bilingual speech was recorded and transcribed between 2005 and 2008 as part of a research project funded by the Arts and Humanities Research Council (AHRC), entitled ’Code switching and convergence in Welsh: a universal versus a typological approach’. The main theoretical aim of the project was to test alternative models of code switching with Welsh-English data. The title of the corpus, Siarad is the Welsh word for speaking. When using these data, please refer to the corpus as the Bangor Siarad corpus, and provide a link to the website from which you accessed the corpus, either http://www.talkbank.org or http://www.bilingualism.bangor.ac.uk. B THE DATA The corpus consists of 69 audio recordings and their corresponding transcripts of informal conversation between two or more speakers, involving a total of 153 speakers from across Wales. Participants were recruited via a variety of methods, including advertising, approaching visitors at a Welsh-language cultural event, and using the research team’s extended social network. In total, the corpus consists of 452,116 words of text from 40 hours of recorded conversation. The transcriptions (in CHAT format) are linked to the digitized recordings through sound links at the end of each main tier. Most recordings were in stereo, and made using radio microphones and a Marantz hard disk recorder. A minidisk recorder was also occasionally used, meaning that some recordings are in mono mode. The recordings were made at a place convenient for the speakers, e.g. at their homes, workplaces or at the university. After setting up the equipment the researcher would leave the speakers to talk freely with one another. The first five minutes of all recordings after the point when the researcher left the room have been deleted. In some cases the researcher re-entered briefly during the recording. These sections have not been transcribed, but notes have been made in the relevant parts of the transcripts. At the end of each recording all participants were asked to fill in questionnaires providing background information regarding their age, gender, location of places lived, etc, in order to provide information for sociolinguistic analysis. They were also asked to sign consent forms giving permission for their recording and its transcript to be used for research purposes and to be submitted to online linguistic archives. The consent form included the provision that the names of speakers and other people named in the recording would be replaced by pseudonyms in the transcript. In the case of children of 16 years or younger, a parent or guardian also signed the consent form. Sound and transcription files in the corpus are named after the researcher (primarily) responsible for recording them, namely Marika Fusser, Peredur Davies, Elen Robert, Jonathan Stammers, Nesta Roberts, Gary Smith and Margaret Deuchar. Each file name is made up of the surname followed by a number (ordered chronologically). The sound and transcription files for each conversation share the filename, but have different file extensions (‘*.wav’ and ‘*.cha’ respectively). For example, Davies3.cha is the transcription of Peredur Davies’ third recording (sound file Davies3.wav). In a few cases numbers are discontinuous. The ‘Fusser’ files begin with Fusser3, for example. Also, five recordings collected (including Fusser20, Fusser24 and Davies8) were left out of the corpus. In three cases this was due to the lack of consent from speakers in the recording, in one case due to the researcher taking an extensive part in the conversation, and in one case due to a participant being a Welsh speaker from Patagonia who was not a WelshEnglish bilingual. A list of the files in the corpus can be found in the Appendix. This list includes some basic information for each file. Details regarding the context of each conversation and the speakers involved are given in the transcript headers. Some additional information about the speakers and recordings is available to researchers on request. All recordings have been transcribed in the CHAT transcription and coding format (MacWhinney 2000), in accordance with the 2007 version of the online manual available on the Talkbank website (www.talkbank.org). All further references to CHAT in this document are taken from this online version. The transcripts were all produced by trained transcribers working on the project: Peredur Davies, Marika Fusser, Siân Wynn Lloyd, Elen Robert and Jonathan Stammers. For 22% of the transcripts an independent transcription was done, in which a member of the transcription team transcribed one (randomly selected) minute of the recording independently from the original transcriber of that particular transcript. Transcripts were then compared and a rate of similarity was calculated. The average reliability score for independent transcriptions was 75%. Furthermore, the transcripts that were completed before March 2007 by each original transcriber were proofread by another member of the transcription team and corrections made accordingly. All transcripts contain at least three different tiers. In addition to the main tier, required by CHAT, we use a gloss tier (%gls) for the closest English equivalent for each word (including morphological information where relevant), and a translation tier (%eng), which contains a free translation of the main tier. All main tiers include a sound link to the corresponding section of the recording. We request that a copy of any publications that make use of this corpus be sent to us at the above address. For introductory information about the Welsh-speaking community see Deuchar (2005). Publications using these data should cite: Deuchar, M. and Davies, P. Code-switching and the future of the Welsh language, International Journal of the Sociology of Language 195: 15-38. Additional References: Canolfan Bedwyr (2008), Cysgliad. Prifysgol Bangor. Deuchar, M. (2005). Minority Language Survival in Northwest Wales: an Introduction. In Cohen, J, McAlister, K., Rolstad, K. and MacSwan, J. (eds) Proceedings of the 4th International Symposium on Bilingualism. Somerville, MA: Cascadilla Press, 621-624. Griffiths, B. and Jones, D.G. (eds.) (1995). The Welsh Academy English- Welsh Dictionary / Geiriadur yr Academi. Cardiff: University of Wales Press. King, G. (2003). Modern Welsh : a comprehensive grammar (2nd ed.). London: Routledge. MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Oxford English Dictionary. Oxford University Press. (2008). (www.oed.com) Thomas, R.J. (ed.) (1950-2004). Geiriadur Prifysgol Cymru : a dictionary of the Welsh language. Cardiff: (http://www.cymru.ac.uk/geiriadur/gpc_pdfs.htm) Thomas, P.W. (1996). Gramadeg y Gymraeg. Cardiff: University of Wales Press. 4. Bangor (Welsh-Spanish) Patagonia Margaret Deuchar ESRC Centre for Research on Bilingualism Bangor University Bangor Gwynedd LL57 2DG United Kingdom m.deuchar@bangor.ac.uk A INTRODUCTION The Patagonia corpus of Welsh-Spanish bilingual speech was recorded in late 2009 and transcribed from 2010 to 2011 as part of a research project funded by the Economic and Social Research Council (ESRC). The main theoretical aim of the project was to test alternative models of code-switching with Welsh-Spanish data. Conditions of use The corpus is being made available under the GNU General Public License version 3 or later (http://gnu.org/copyleft/gpl.html). Researchers who use it are requested to subscribe to the TalkBank Code of Ethics (http://talkbank.org/share/ethics.html) and acknowledge the corpus as set out below. Acknowledgments Please refer to the corpus as the Bangor Patagonia corpus, and provide a link to the website by which you accessed the corpus, either http://www.talkbank.org or http://bangortalk.org.uk. We request that a copy of any publications that make use of this corpus be sent to us at the above address. Canonical version of the data The most up-to-date version of the data as well as more detailed documentation is available on http://bangortalk.org.uk. B THE DATA The corpus consists of 43 audio recordings and their corresponding transcripts of informal conversation between two or more speakers, involving a total of 94 speakers from Patagonia, Argentina. Participants were recruited via a social network approach: as only a very small percentage of the inhabitants of Patagonia are fluent in both Spanish and Welsh, names of bilingual speakers were sought from local contacts in advance of the fieldworkers’ visit. In total, the corpus consists of 195,190 words of text from just under 21 hours of recorded conversation. The transcriptions (in CHAT format) are linked to the digitized recordings through sound links at the end of each main tier. Most recordings were in stereo, and were made using Marantz, Zoom or Microtrack digital audio recorders. The recordings were made at a place convenient for the speakers, e.g. at their homes or workplaces. After setting up the equipment the researcher would leave the speakers to talk freely with one another. In some cases the researcher re-entered briefly during the recording. This is noted in the transcripts and speech by the researcher is usually not transcribed. The first five minutes of all recordings after the point when the researcher left the room have been deleted, in case the participants’ speech was initially affected by the presence of the recorder. At the end of each recording all participants were asked to fill in questionnaires providing background information regarding their age, gender, location of places lived, etc, in order to provide information for sociolinguistic analysis. They were also asked to sign consent forms giving permission for their recording and its transcript to be used for research purposes and to be submitted to online linguistic archives. The consent form included the provision that the names of speakers and other people named in the recording would be replaced by pseudonyms in the transcript. In the case of children of 16 years or younger, a consent form was also signed by a parent or guardian. There are a few instances where speakers who have not given consent feature in recordings, e.g. a neighbour walking in briefly. In these cases the utterances have been transcribed as “www” and replaced by silence in the audio file. This can sometimes mean that parts of the consenting participants’ speech are lost as well where there is overlap with the non-consenting speaker. In addition, beeps have been placed over the names of people about whom sensitive information is given. The recordings in the corpus are named after the Patagonia region of Argentina where the recordings took place and are numbered in order of the sequence of recording. The sound and transcription files for each conversation share the filename, but have different file extensions (‘*.wav’/‘*.mp3’ for the sound file and ‘*.cha’ for the transcription). For example, Patagonia3.cha is the transcription of the third recording (sound files Patagonia3.wav or Patagonia3.mp3). Basic details regarding the context of each conversation and the speakers involved are given in the transcript headers. Some additional information about the speakers and recordings is available to researchers on request. All recordings have been transcribed in the CHAT transcription and coding format (MacWhinney 2000), in accordance with the 2012 version of online manual available on http://childes.psy.cmu.edu/manuals/chat.pdf. All references to the CHAT manual in this document are to this online version. All transcripts have been done by trained transcribers working on the project: Fraibet Aveledo, Diana Carter, Marika Fusser, Lowri Jones, M. Carmen Parafita Couto, Myfyr Prys and Jonathan Stammers. For 10% of the transcripts an independent transcription was done, in which a member of the transcription team transcribed one (randomly selected) minute of the recording independently from the original transcriber of that particular transcript. Transcripts were then compared and a rate of similarity was calculated. The average reliability score1 for independent transcriptions was 88%. Furthermore, all the transcripts were checked by another member of the transcription team and corrections made accordingly. The team of checkers included the following researchers in addition to the original transcription team: Margaret Deuchar, Lara Gil Vallejo, Jon Herring, Guillermo Montero Melis, and Susana SabinFernández. All transcripts contain at least three different tiers. In addition to the main tier, required by CHAT, we use an automatically generated gloss tier (%xaut) for the closest English equivalent for each word (including morphological information where relevant), and a translation tier (%eng), which contains a free translation into English of the main tier. A comments tier (%com) has also been used occasionally for comments by the transcriber that are specific to the utterance in the corresponding main tier. All main tiers include a sound link to the corresponding section of the recording. The remainder of this document outlines the conventions used in the main tier and the gloss tier. C MAIN TIER 1. Layout of transcription 1.1. Since the theoretical aims of the project include clause-based analysis, the transcribed data are divided into clauses where possible. Where an utterance contains two main clauses, each clause in that utterance is written on a separate main tier. Complex clauses are treated as one clause and therefore subordinate clauses are included in the same tier as their main clauses. Adverbial clauses are also written on the same main tier as their related main clause. 1.2. Each main tier is divided into units which we call ‘words’ for the purposes of these conventions. With some exceptions (see C.1.3) a word is considered to be a continuous sequence of characters containing no spaces, as found in Geiriadur Prifysgol Cymru (Thomas 1950-2004) (GPC), Geiriadur yr Academi (Griffiths & Jones 1995) (GyrA), Cysgeir (Canolfan Bedwyr 2008), the Diccionario de la Lengua Española online from the Real Academia Española (DLE) and the Diccionario de Americanismos (2010) (DA). These are referred to as GPC, GyrA, Cysgeir, DLE and DA respectively throughout this document. Where items are entered as two hyphenated words in these reference dictionaries, they are connected by an underscore in the transcripts, e.g. ‘cyd_ddigwyddiad’ (= 1 An innovative method was used based on Turnitin plagiarism (http://www.turnitin.com). For further details see Deuchar et al. (in press) detection software ’coincidence’). When one of the reference dictionaries offers more than one alternative (e.g. ‘minibus’, ‘mini-bus’ or ‘mini bus’), or when the reference dictionaries differ from each other, the most compact alternative is chosen (‘minibus’ in this case). 1.3. Other items which are treated as words are: interjections and interactional markers, e.g. ‘ajá’ (= ‘aha’), ‘ay’ (= ‘oh’), ‘hym’ (=’hmm’), etc. proper names (including names of books, films, organisations etc.), a sequence of words being connected by underscores, e.g. ‘Butch_Cassidy’, ‘Buenos_Aires’. abbreviations (connected by underscore), e.g. ‘B_B_C’ Welsh numbers consisting of two words involving ten which translate into one English word, e.g. ‘un_ar_ddeg’ (= ‘eleven’), ‘pedwar_deg’ (= ‘forty’). Note that other numbers such as those containing ‘hundred’, ‘thousand’ etc. are transcribed as separate words, e.g. ‘cant saith deg tri’, ciento setenta y tres’ (‘173’). Welsh phrasal prepositions, formed using two morphemes, where separation of the two elements of the word would make any gloss of those individual elements unhelpful, were transcribed with an underscore between the two morphemes; e.g. ‘oddi_wrth’, which means ‘from’, but whose individual morphemes translate respectively as ’out of’ and ‘next to’. Examples of the phrasal prepositions described in C.1.3.v are listed below, along with some other phrases which are similarly transcribed because they normally translate into just a single English word. (a) Welsh Our transcription cyd_ddigwyddiad dim_byd o_hyd o_k ta_beth tu_ôl_i un_ai yn_ôl Conventional form English equivalent cyd-ddigwyddiad coincidence dim byd nothing o hyd still OK OK ‘ta beth anyway tu ôl i behind un ai either yn ôl back (b) Spanish Our transcription ni_fu_ni_fa no_más o_k Conventional form ni fu ni fa no más OK English equivalent neither nor only OK o_la_la copo_de_nieve olalá copo de nieve ooh la la guelder rose 1.4. Contractions that do not have entries in the reference dictionaries listed above or, in the case of Welsh, in King (2003), are transcribed in full, but the unpronounced parts are bracketed. For example, the pronunciation of ‘fel yna’ (=’like that’) as [vɛla] in speech is represented in the transcripts as ‘fel (yn)a’. 1.5. There are some continuous sequences of characters in the main tier which are not treated as words. These include “simple events” such as ‘&=laugh’ (see CHAT 7.6.1), ‘xxx’ for unintelligible sounds, or the use of an ampersand (‘&’) plus phonetic characters for intelligible sounds without clear meaning, as in e.g. ‘&pfe’ where the speaker produces the non-word [pfe]. (see CHAT 6.4 for both). 1.6. Please note that pause markings are not used consistently in the transcripts. Additionally, pauses between utterances are generally not marked. We have used the ‘lazy overlap’ markings (+<) for overlapping speech. 2. Language marking 2.1. A default language is assigned to each transcription based on the language contributing the greater number of words. The default language is the first language listed in the @Language tier in the file header, and is indicated by the ISO-639-3 abbreviation for the language: cym = Welsh, eng = English, spa = Spanish. Words without any language markers in the transcription are in the default language unless they are part of an utterance preceded by a precode indicating that it is in a non-default language – see next paragraph for details. 2.2. Individual utterances in the second or third most frequent language are marked with precodes at the beginning of the main tier: e.g. [- cym] for Welsh, [- spa] for Spanish and these utterances contain no language tags. In mixed utterances each word in the non-default language is marked by a tag consisting of @s: followed by the relevant ISO-639-3 abbreviation: @s:cym = Welsh, @s:spa = Spanish, @s:eng = English, @s:cym&spa = undetermined (see below, 2.4), @s:spa+cym = word with first morpheme(s) Spanish, final morpheme(s) Welsh, @s:cym+spa = word with first morpheme(s) Welsh, final morpheme(s) Spanish. 2.3. A word or morpheme is considered to be Welsh if it can be found in any of the Welsh-language reference dictionaries or in King (2003). A word or morpheme is considered to be Spanish if it or all its elements are found in either of the Spanish reference dictionaries (e.g. ‘principito’ is considered to be a Spanish word because ‘príncipe’ and ‘-ito’ are both listed in DLE). However, we have considered some words not listed in the dictionaries to be either Welsh or Spanish, as indicated in the list below. Transcribed form chuker coranto corintos ddi ddo estiletos lactal mm mmhm nebulización uhuh valvuloplastía wchi wsti ychi yfe yli yliwch yndyfe yn_basai yn_de yn_do yn_doedd yn_doeddech yn_doedden yn_doeddwn yn_does yn_dydach yn_dydan yn_dydy yn_dydyn Language English equivalent Spanish sweetener Spanish courante Spanish currants Welshshe/her Welshhe/him Spanish stiletto heels Spanish sliced Welshmm (interactional marker) Welshmmhm (interactional marker) Spanish nebulisation Welshuhuh (interactional marker) Spanish valvuloplasty Welshyou know (polite/plural) Welshyou know Welshyou know (polite/plural) Welsh isn’t it Welshyou see Welshyou see (polite/plural) Welshdoesn't it Welshwouldn’t it Welshisn’t it Welshdidn’t it Welshwasn’t it Welshweren’t you (polite/plural) Welshweren’t they Welshwasn’t I Welshisn’t there Welsharen’t you (polite/plural) Welsharen’t they/we Welshisn’t it Welsharen’t they/we 2.4. The language marker @s:cym&spa is used with words where the language source is undetermined. It marks words that occur in the lexicon of both languages (as determined by the respective reference dictionaries), that are pronounced in a way that is possible both in Welsh and in Spanish, e.g. [foto] (‘ffoto’ in Welsh or ‘foto’ in Spanish) or [pjano] (‘piano’ in both languages). 2.5. @s:cym&spa also marks interjections and interactional markers that may be interpreted as ambiguous, e.g. ‘ah’, ‘oh’. Other interjections and interactional markers are assigned language markers according to their inclusion (or not) in the reference dictionaries. For example, ‘ych’ (a marker of disgust) is marked @s:cym as it is only found in the Welsh-language reference dictionaries. There are also some instances where we assigned a language to an interactional marker that was not listed in any of the dictionaries – see 2.3. Transcription ah ajá argian ay bah bechod diar eh ew hym mm mmhm nefi oh oi ta ta_ra ta_ta te w wel ý ych ym Language(s) Welsh & Spanish (Welsh &) Spanish Welsh Spanish Spanish Welsh Welsh Welsh & Spanish Welsh Welsh Welsh Welsh Welsh Welsh & Spanish Spanish Welsh Welsh Welsh Welsh Welsh Welsh Welsh Welsh Welsh English equivalent ah aha good lord oh bah how sad dear eh oh hmm mm mmhm heavens oh oh then goodbye goodbye be ooh well er yuck um 2.6. Where a lexeme could belong to both languages, but its pronunciation in a specific occurrence belongs unambiguously to one language only, it will be marked @s:cym or @s:spa (and written in the orthography of that language) according to its pronunciation. For example, if ‘hotel’ is pronounced with initial [h], it will be marked @s:cym, without initial [h] it will be marked @s:spa. 2.7. Proper names and titles are marked ‘@s:cym&spa’ (undetermined) unless there are alternatives in each language in general use, e.g. ‘Butch_Cassidy@s:cym&spa’, ‘Buenos_Aires@ s:cym&spa’, ‘Arglwydd_Dyma_Fi@s:cym&spa’ (a Welsh hymn), but ‘Argentina@s:spa’, ‘Ariannin@s:cym’ (the Welsh word for ‘Argentina’). 3. Orthography 3.1. We have used a Unicode font (http://en.wikipedia.org/wiki/Unicode) for the transcription. Occasional non-lexical phonological fragments are spelt out following an ampersand using IPA symbols (http://www.langsci.ucl.ac.uk/ipa/ipachart.html) (e.g. &ʧʊ), and these may not show up correctly if a Unicode font is not used. 3.2. Words marked as ‘@s:spa’ (Spanish) are transcribed in conventional Spanish orthography. 3.3. Words marked as Welsh are transcribed in conventional Welsh orthography. We have not represented regional variation in the transcripts, except in cases which have orthographic representation in the Welsh-language reference dictionaries or in King (2003). There are some cases where we differ in usage from conventional orthography: Colloquial second person singular verb and preposition endings not usually represented in writing are transcribed as ‘-a’ where they are followed by the pronoun ‘chdi’, e.g. ‘oedda chdi’ (= ‘you were’), ‘amdana chdi’ (= ‘about you’). We do not represent morpheme-final [v] when it is not pronounced. For example, [pɛntre] (village) is written ‘pentre’ in the transcripts rather than ‘pentref’ (as the word is represented in the Welsh-language reference dictionaries). Morpheme-initial /r/ is only transcribed as ‘rh’ where it is clearly heard by the transcriber to be voiceless ([r̥]). Otherwise it is transcribed as ‘r’, even when the standard orthography prescribes ‘rh’. Some speakers do not have [r̥] as part of their phonological system in any case. Morphemes in Welsh which are usually written with an initial apostrophe, such as the marking of the ellipsis of a possessive pronouns in e.g. ‘’nhad’ (= ‘my father’), are transcribed without this initial apostrophe (e.g. ‘nhad’) owing to the constraints of CHAT. We have represented mutation (sound change to initial consonants) or its absence without following prescriptive rules as to where mutation might or might not be expected. Thus the Welsh form of ‘in Cardiff’ may be transcribed ‘yn Caerdydd’ (with initial [k]) and ‘yn Gaerdydd’ (with initial [g]), as well as the standard form ‘yng Nghaerdydd’ (with initial [ŋ̥]), according to what is heard. We have also transcribed the aspirate mutation of /m/ and /n/ after the 3rd singular feminine possessive adjective common in regional varieties, e.g. ‘ei mham’ (= ‘her mother’, with initial [m̥]), rather than standard ‘ei mam’ (with initial [m]). There are also quite a few instances in the corpus where speakers who are learners of Welsh use ungrammatical or unconventional forms. These include ‘hypermutation’, where an already mutated initial consonant undergoes another round of mutation, e.g. ‘tipyn’ > ‘dipyn’ . ‘ddipyn’ (meaning ‘a bit’). 3.4. Words whose language source is undetermined are transcribed in Spanish rather than in Welsh orthography, e.g. ‘avocado@s:cym&spa’ rather than ‘afocado@s:cym&spa’. 3.5. When words marked as Spanish or undetermined are mutated (where the sound of an initial consonant is changed depending on the grammatical context, see for example King 1993:14-20), the initial (mutated) sound is written in Welsh orthography and the rest in Spanish, e.g. ‘rhyw ddoctora’ (= ‘some (female) doctor’) 3.6. There is some variation in the way initial consonants in Welsh have been transcribed. In some instances the transcriber interpreted a word to have a soft mutation where the speaker may simply have used the Spanish variant of a consonant rather than the Welsh one. This is especially true for stops, where Spanish /p/, /t/, /k/ are more similar to Welsh /b/, /d/, /g/ than to Welsh /p/, /t/, /k. For example, the transcription may record ‘dâl’ (= ‘payment’, with soft mutation), where the speaker was intending to say ‘tâl’ (without mutation). D GLOSS TIER 1. Principles Each word (see C1.2 and C.1.3) in the main tier is given a gloss in the gloss tier (%aut). The gloss tier has been produced automatically using the Bangor Autoglosser (http://bangortalk.org.uk/autoglosser.php), free (GPL) software developed at the Centre – for further details see Donnelly and Deuchar 2011. The transcripts were manually corrected after autoglossing to deal with the small number (less than 2%) of incorrectly-attributed glosses. 1.1. Non-words (see C.1.5) are not glossed. 1.2. All words are glossed with the closest English-language equivalent (in lower case) and, where appropriate, information about parts of speech. English equivalents of proper names are used where they exist (for example, ‘Caerdydd@s:cym’ is glossed as ‘Cardiff’). If there is no English-language equivalent to a name, it is glossed ‘name’. 1.3. The underscore is used in the gloss tier to connect more than one lexical item in a gloss, where the English translation of a single Welsh or Spanish word involves more than one word. For example, ‘neithiwr’ is glossed as ‘last_night’. 1.4. The English lexeme in a gloss is followed by information about parts of speech, separated by dots. Some examples: Spanish ‘hijos’ is glossed ‘son.N.M.PL’, which means ‘plural of the masculine noun “hijo”’; Welsh ‘mae’ is glossed ‘be.V.3S.PRES’, which means ‘third person singular present of the verb “be”’; Spanish ‘me’ is glossed ‘me.PRON.OBL.MF.1S’, meaning ‘oblique pronoun, 1st person singular, masculine or femine’ Welsh ‘fan’ is glossed ‘place.N.MF.SG+SM’, which means ‘singular of the noun “man” (meaning “place”), which can be either masculine or feminine, with a soft mutation’. 2. Parts of speech abbreviations Abbreviation Representing 0 impersonal 123S 1st, 2nd, 3rd person singular 13S 1st, 2nd, 3rd person singular 1P 1st person plural 1S 1st person singular 23P 2nd, 3rd person plural 23S 2nd, 3rd person singular 23SP 2nd, 3rd person singular or plural 2P 2nd person plural 2S 2nd person singular 2SP 2nd person singular or plural 3P 3rd person plural 3S 3rd person singular 3SP 3rd person singular or plural ADJ adjective ADV adverb AM aspirate mutation ASV adjective, singular noun, or verb AUG augmentative COMP comparative COND conditional CONJ conjunction DEF definite DEM demonstrative DET determiner DIM diminutive E exclamation EMPH emphatic F feminine FAR far (demonstrative) FOCUS item with focus FUT future GER gerund H pre-vocalic h after 3S.F, 1P and 3P possessives HYP hypothetical IM interactional marker IMPER imperative IMPERF imperfect INDEF indefinite INFIN infinitive INT interrogative INTENS intensive M masculine MF masculine or feminine N noun NEAR near (demonstrative) NEG negative NM nasal mutation NT neuter NUM numeral OBJ object OBL oblique ORD ordinal PAST past PASTPART past participle PL plural PLUPERF pluperfect POSS possessive PRECLITIC accented form before clitics PREP preposition PREQ pre-qualifier PRES present PRESPART present participle PRON pronoun PRT particle QUAN quantifier REFL reflexive REL relative SG singular SM soft mutation SP singular or plural SUB subject SUBJ subjunctive SUP superlative SV singular noun or verb TAG tag question V verb REFERENCES Canolfan Bedwyr (2008). Cysgliad. Prifysgol Bangor. (www.cysgliad.com) Diccionario de Americanismos. Asociación de Academias de la Lengua Española (2010). Diccionario de la Lengua Española. Real Academia Española. (www.rae.es) Griffiths, B. and Jones, D.G. (eds.) (1995). The Welsh Academy English-Welsh Dictionary / Geiriadur yr Academi. Cardiff: University of Wales Press. (http://techiaith.bangor.ac.uk/GeiriadurAcademi) Deuchar, M. et al. (in press) Building bilingual corpora: Welsh-English, Spanish-English and Spanish-Welsh. In I. Mennen and E. Thomas (eds) Unravelling Bilingualism. Multilingual Matters. Donnelly, K. and Deuchar, M. (2011) Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. In: Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. Tartu: NEALT Proceedings Series. (http://dspace.utlib.ee/dspace/handle/10062/19298) King, G. (2003). Modern Welsh : a comprehensive grammar (2nd ed.). London: Routledge. MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Thomas, R.J. (ed.) (1950-2004). Geiriadur Prifysgol Cymru : a dictionary of the Welsh language. Cardiff: University of Wales Press. (http://www.cymru.ac.uk/geiriadur/gpc_pdfs.htm) Thomas, P.W. (1996). Gramadeg y Gymraeg. Cardiff: University of Wales Press. APPENDIX File summary: No. of main participants Patagonia1 Length (mm:ss) 10:02 Patagonia2 29:18 Patagonia3 28:44 Patagonia4 23:33 Patagonia5 22:16 File name Age (years) Sex 3 22, 21 ,28 F, F, M 2 66, 82 F, F 2 82, 78 F, F 3 48, 47, ? F, M, M 5 90,82,67,72,61 F, F, M, F, F Patagonia6 27:17 2 54, 96 F, F Patagonia7 44:31 2 66, 68 F, F Patagonia8 31:37 2 84, 83 F, F Patagonia9 37:05 2 65, 69 F, F Patagonia10 25:48 2 35, 9 M, F Patagonia11 43:59 3 81,74, 86 F, F, F Patagonia12 29:55 2 78, 25 F, F Patagonia13 28:31 2 61, 42 F, F Patagonia14 31:28 2 63, 74 F, F Patagonia15 29:45 2 81, 42 F, F Patagonia16 32:10 3 54, 54, 46 M, F, F Patagonia17 30:15 2 21,18 F, F Patagonia18 30:46 3 73, 46, ? M, M, F Patagonia19 43:10 2 38, 8 M, F Patagonia20 34:29 2 67, 58 F, F Patagonia21 33:30 2 60, 60 F, F Patagonia22 26:18 2 84, 56 F, F Patagonia23 29:54 2 63, 64 F, F Patagonia24 30:57 2 53, 88 F, F Patagonia25 29:47 4 48, 44, 13, 8 M, F, F, M Patagonia26 37:54 2 55, 27 F, M Patagonia27 27:45 2 69, 68 F, M Patagonia28 2 22, 28 F, F Patagonia29 35:07 14:26 2 18,18 M, F Patagonia30 33:10 2 74, 71 F, F Patagonia31 39:20 2 81, 70 F, F Patagonia32 30:02 2 71, 75 F, M Patagonia33 30:26 2 72, 29 F, M Patagonia34 28:42 2 58, 34 F, F Patagonia35 32:55 2 75, 71 M, F Patagonia36 33:04 1 70 F Patagonia37 29:28 2 46, 44 F, F Patagonia38 35:21 2 30, 37 F, F Patagonia39 29:30 2 71, 76 F, F Patagonia40 24:09 2 90, 92 F, F Patagonia41 22:05 4 71, 55, 76, ? Patagonia42 33:35 2 70, 87 M, F, F, M F, F Patagonia43 11:29 2 56, ? F, M 20:55:46 942 Total: 42 5. Bangor (Spanish-English) Miami The Miami Corpus: Documentation File Margaret Deuchar ESRC Centre for Research on Bilingualism Bangor University Bangor Gwynedd LL57 2DG United Kingdom m.deuchar@bangor.ac.uk A INTRODUCTION The Miami corpus of Spanish-English bilingual speech was recorded and transcribed between 2008 and 2011 as part of a research project funded by the Economic and Social Research Council (ESRC). The main theoretical aim of the project was to test alternative models of code-switching with SpanishEnglish data. Conditions of use The corpus is being made available under the GNU General Public License version 3 or later (http://gnu.org/copyleft/gpl.html). Researchers who use it are requested to subscribe to the TalkBank Code of Ethics (http://talkbank.org/share/ethics.html) and acknowledge the corpus as set out below. Acknowledgments Please refer to the corpus as the Bangor Miami corpus, and provide a link to the website by which you accessed the corpus, either http://www.talkbank.org or http://bangortalk.org.uk. We request that a copy of any publications that make use of this corpus be sent to us at the above address. Canonical version of the data The most up-to-date version of the data as well as more detailed documentation is available on http://bangortalk.org.uk. B THE DATA The corpus consists of 56 audio recordings and their corresponding transcripts of informal conversation between two or more speakers, involving a total of 84 speakers living in Miami, Florida (USA). Participants were recruited via a variety of methods, including advertising and using the research team’s extended social network. From the 56 audio recordings, 15 are transcripts of conversations from one individual, recorded over a longer period of time in conversation with more than one speaker. The participant (‘María’) was already known by the research team to be a balanced bilingual who frequently and consistently code-switched in daily conversation, and so she was invited to make recordings of her interactions with colleagues, family and friends. Maria decided when and with whom to make recordings, by means of a small digital recorder worn on her belt with a moderately concealed lapel microphone. She recorded 42 conversations, 15 of which have been selected for transcription on the basis of their acoustic quality. The research team had no control over when or where the recordings were made and also did not have control over the technical aspects such as checking audio levels, environmental noise and changing batteries in the recorder. Maria’s interlocutors did not sign consent forms or fill in questionnaires and so the transcripts of the15 recordings only represent Maria’s speech, while utterances from other speakers are transcribed as “www”. In total, the corpus consists of 242,475 words of text from 35 hours of recorded conversation. The transcriptions (in CHAT format) are linked to the digitized recordings through sound links at the end of each main tier. Most recordings were in stereo, and were made using Marantz, Zoom or Microtrack digital audio recorders. The recordings were made at a place convenient for the speakers, e.g. at their homes or workplaces. After setting up the equipment the researcher would leave the speakers to talk freely with one another. In some cases the researcher re-entered briefly during the recording. This is noted in the transcripts and speech by the researcher is usually not transcribed. The first five minutes of all recordings after the point when the researcher left the room have been deleted, in case the participants’ speech was initially affected by the presence of the recorder. At the end of each recording all participants were asked to fill in questionnaires providing background information regarding their age, gender, location of places lived, etc, in order to provide information for sociolinguistic analysis. They were also asked to sign consent forms giving permission for their recording and its transcript to be used for research purposes and to be submitted to online linguistic archives. The consent form included the provision that the names of speakers and other people named in the recording would be replaced by pseudonyms in the transcript. In the case of children of 16 years or younger, a consent form was also signed by a parent or guardian. There are a few instances where speakers who have not given consent feature in recordings, e.g. a neighbour walking in briefly. In these cases the utterances have been transcribed as “www” and replaced by silence in the audio file. This can sometimes mean that parts of the consenting participants’ speech are lost as well where there is overlap with the non-consenting speaker. In addition, beeps have been placed over the names of people about whom sensitive information is given. Sound and transcription files in the corpus are named after the researcher who did the recording and are numbered in order of the sequence of recording. The sound and transcription files for each conversation share the filename, but have different file extensions (‘*.wav’/‘*.mp3’ for the sound file and ‘*.cha’ for the transcription). For example, Sastre2.cha is the transcription of the second recording made by Sastre (sound file Sastre2.wav). Basic details regarding the context of each conversation and the speakers involved are given in the transcript headers. Some additional information about the speakers and recordings is available to researchers on request. All recordings have been transcribed in the CHAT transcription and coding format (MacWhinney 2000), in accordance with the 2012 version of online manual available on http://childes.psy.cmu.edu/manuals/chat.pdf. All references to the CHAT manual in this document are to this online version. All transcripts have been done by trained transcribers working on the project: Fraibet Aveledo, Diana Carter, Marika Fusser, Lowri Jones, M. Carmen Parafita Couto, Myfyr Prys and Jonathan Stammers. Additionally, teams from Penn State University (Amelia Dietrich, Giuli Dussias, Chip Gerfen , Rosa Guzzardo, and Jorge Valdes Kroff), Australian National University (Bronwyn Wrigley, Manuel Delicado, and Jennifer Plaistowe) also collaborated in the process of transcriptions. For 10% of the transcripts an independent transcription was done, in which a member of the transcription team transcribed one (randomly selected) minute of the recording independently from the original transcriber of that particular transcript. Transcripts were then compared and a rate of similarity was calculated. The average reliability score 3 for independent transcriptions was 83%. Furthermore, all the transcripts were proofread by another member of the transcription team and corrections made accordingly. An additional team of transcribers and checkers included the following researchers in addition to 3 An innovative method was used based on Turnitin plagiarism detection software (http://www.turnitin.com). Deuchar, M., Davies, P. Herring, J.R., Parafita Couto, M. & Carter, D. (in press) Building bilingual corpora: Welsh-English, Spanish-English and Spanish-Welsh. In I. Mennen and E. Thomas (eds) Unravelling Bilingualism. Multilingual Matters. the original transcription team: Margaret Deuchar, Sarah Fairchild, Marika Fusser, Lara Gil Vallejo, Guillermo Montero Melis, Esther Nuñez, Susana Sabin-Fernández, and Jonathan Stammers. All transcripts contain at least three different tiers. In addition to the main tier, required by CHAT, we use an automatically generated gloss tier (%xaut) for the closest English equivalent for each word (including morphological information where relevant), and a translation tier (%eng), which contains a free translation of the main tier. A comments tier (%com) has also been used occasionally for comments by the transcriber that are specific to the utterance in the corresponding main tier. All main tiers include a sound link to the corresponding section of the recording. The following contributed to the translation tier: Adriana Acevedo, Olga Bolaños, Vanesa Bonavota, Rubén Chapela, Magdalena Gazda, Ana Muerza, Renata Kendall, Mary Silva, Sara Viñas, and Renée Zeichen. The remainder of this document outlines the conventions used in the main tier and the gloss tier. C MAIN TIER 1. Layout of transcription 1.1. Since the theoretical aims of the project included clause-based analysis, the transcribed data are divided into clauses where possible. Where an utterance contains two main clauses, each clause in that utterance is written on a separate main tier. Complex clauses are treated as one clause and therefore subordinate clauses are included in the same tier as their main clauses. Adverbial clauses are also written on the same main tier as their related main clause. 1.2. Each main tier is divided into units which we call ‘words’ for the purposes of these conventions. With some exceptions (see C.1.3) a word is considered to be a continuous sequence of characters containing no spaces as found in the Webster’s Dictionary for English, and in the Diccionario de la Lengua Española online from the Real Academia Española and the Diccionario de Americanismos (2010) for Spanish. These are referred to as DLE and DA respectively throughout this document. Where items are entered as two hyphenated words in these reference dictionaries, they are connected by an underscore in the transcripts. When one of the reference dictionaries offers more than one alternative (e.g. ‘minibus’ ‘mini-bus’ or ‘mini bus’), or when the reference dictionaries differ from each other, the most compact alternative is chosen (‘minibus’ in this case). 1.3. Other items which are treated as words are: (a) interjections and interactional markers, e.g ‘ajá’ (= ‘aha’), ‘ay’ (= ‘oh’), ‘mmhm’ (=’mhm’), etc. (b) propernames (including names of books, films, organisations etc.), a sequence of words being connected by underscores, e.g., ‘Nueva_York’. (c) abbreviations (connected by underscore), e.g. ‘B_B_C’ (d) examples of phrases that are not found in the DLE and DA are listed below. Transcribed form ni_fu_ni_fa no_más o_k vale_turca o_la_la copo_de_nieve Conventional form ni fu ni fa no más OK vale turca olalá copo de nieve English neither nor only OK it doesn't matter ooh la la guelder rose 1.4. There are some continuous sequences of characters in the main tier which are not treated as words. These include simple events such as ‘&=laugh’ (see CHAT 7.6.1), ‘xxx’ for unintelligible sounds, or the use of an ampersand (‘&’) plus phonetic characters for intelligible sounds without clear meaning (see CHAT 6.4 for both). 1.5. Please note that pause markings are not used consistently in the transcripts. Additionally, pauses between utterances are usually not marked. We have used the ‘lazy overlap’ markings (‘+>’) for overlapping speech. 2. Language marking 2.1. A default language is assigned to each transcription based on the language contributing the greater number of words. The default language is the first language listed in the @Language tier in the file header, and is indicated by the ISO-639-3 abbreviation for the language: spa = Spanish, eng = English. Words without any language markers in the transcription are in the default language unless they are part of an utterance preceded by a precode indicating that it is in a non-default language – see next paragraph for details. 2.2. Individual utterances in the second or third most frequent language are marked with precodes at the beginning of the main tier: e.g. [- eng] for English, [- spa] for Spanish and these utterances contain no language tags. In mixed utterances each word in the non-default language is marked by a tag consisting of @s: followed by the relevant ISO-639-3 abbreviation: @s:spa = Spanish, @s:eng = English, @s:eng&spa = undetermined (see below, 2.4), @s:spa+eng = word with first morpheme(s) Spanish, final morpheme(s) English , @s:eng+spa = word with first morpheme(s) English, final morpheme(s) Spanish. 2.3. A word or morpheme is considered to be English if it can be found in any of the English-language reference dictionaries. A word or morpheme is considered to be Spanish if it or all its elements are found in either of the Spanish reference dictionaries (e.g. ‘principito’ is considered to be a Spanish word because ‘príncipe’ and ‘-ito’ are both listed in DLE). However, we have considered some words not listed in the dictionaries to be either English or Spanish, as indicated in the list below. Transcribed form cucu estrech Language Spanish Spanish English equivalent bottom stretch (jeans) 2.4. The language marker @s:eng&spa is used with words where the language source is undetermined. It marks words that occur in the lexicon of both languages, (as determined by the respective reference dictionaries), that are pronounced in a way that is possible both in English and in Spanish, e.g [pjano] (‘piano’ in both languages). 2.5. @s:eng&spa also marks interjections and interactional markers that may be interpreted as ambiguous, e.g. ‘ah’, ‘oh’. Other interjections and interactional markers are assigned language markers according to their inclusion (or not) in the reference dictionaries. For example, ‘ay’ (=’oh’) is marked @s:spa as it is only found in the Spanish-language reference dictionaries. 2.6. Where a lexeme could belong to both languages, but its pronunciation in a specific occurrence belongs unambiguously to one language only, it will be marked @s:eng or @s:spa (and written in the orthography of that language) according to its pronunciation. For example, if ‘hotel’ is pronounced with initial [h], it will be marked @s:eng, without initial [h] it will be marked @s:spa. 2.7. Proper names and titles are marked ‘@s:eng&spa’ (undetermined) unless there are alternatives in each language in general use, e.g. ‘Caracas@s:eng&spa’, Sears@s:eng&spa but ‘New_York@s:eng’ ‘Nueva_York@s:spa’, (the Spanish word for ‘New York’). 3. Orthography 3.1. We have used a Unicode font (http://en.wikipedia.org/wiki/Unicode) for the transcription. Occasional non-lexical phonological fragments are spelt out following an ampersand using IPA symbols (http://www.langsci.ucl.ac.uk/ipa/ipachart.html) (e.g. &ʧʊ), and these may not show up correctly if a Unicode font is not used. 3.2. Words marked as ‘@s:spa’ (Spanish) are transcribed in conventional Spanish orthography 3.3. Words considered to be Spanish are transcribed in Spanish orthography. We have not represented regional variation in the transcripts, except in cases which have orthographic representation in the Spanish-language reference dictionaries. 3.4. Words whose language source is undetermined are transcribed in English rather than in Spanish orthography, e.g. football, internet, lunch, etc. D. GLOSS TIER 2. Principles Each word (see C1.2 and C.1.3) in the main tier is given a gloss in the gloss tier (%aut). The gloss tier has been produced automatically using the Bangor Autoglosser (http://bangortalk.org.uk/autoglosser.php), free (GPL) software developed at the Centre – for further details see Donnelly and Deuchar 2011. The transcripts were manually corrected after autoglossing to deal with the small number (less than 2%) of incorrectly-attributed glosses. 2.1. Non-words are not glossed. 2.2. All words are glossed with the closest English-language equivalent (in lower case) and, where appropriate, information about parts of speech. English equivalents of proper names are used where they exist (for example, ‘Nueva_York@s:spa’ is glossed as ‘New_York’). If there is no English-language equivalent to a name, it is glossed ‘name’. 2.3. The underscore is used in the gloss tier to connect more than one lexical item in a gloss, where the English translation of a single Spanish word involves more than one word. For example, ‘veinticinco’ is glossed as ‘twenty_five’. 2.4. The English lexeme in a gloss is followed by information about parts of speech, separated by dots. Some examples: Spanish ‘hijos’ is glossed ‘son.N.M.PL’, which means ‘plural of the masculine noun “hijo”’; Spanish ‘me’ is glossed ‘me.PRON.OBL.MF.1S’, meaning ‘oblique pronoun, 1st person singular, masculine or femine’; English "wouldn't" is glossed "be.V.1S.COND+NEG", meaning "the first person singular conditional tense of the verb 'be', with an attached negative marker". 3. Parts of speech abbreviations. Abbreviation 0 123S 13S 1P 1S 23P 23S 23SP 2P 2S 2SP 3P 3S 3SP ADJ ADV AM ASV AUG COMP COND CONJ DEF DEM DET DIM E EMPH Representing impersonal 1st, 2nd, 3rd person singular 1st, 2nd, 3rd person singular 1st person plural 1st person singular 2nd, 3rd person plural 2nd, 3rd person singular 2nd, 3rd person singular or plural 2nd person plural 2nd person singular 2nd person singular or plural 3rd person plural 3rd person singular 3rd person singular or plural adjective adverb aspirate mutation adjective, singular noun, or verb augmentative comparative conditional conjunction definite demonstrative determiner diminutive exclamation emphatic F feminine FAR far (demonstrative) FOCUS item with focus FUT future GER gerund H pre-vocalic h after 3S.F, 1P and 3P possessives HYP hypothetical IM interactional marker IMPER imperative IMPERF imperfect INDEF indefinite INFIN infinitive INT interrogative INTENS intensive M masculine MF masculine or feminine N noun NEAR near (demonstrative) NEG negative NM nasal mutation NT neuter NUM numeral OBJ object OBL oblique ORD ordinal PAST past PASTPART past participle PL plural PLUPERF pluperfect POSS possessive PRECLITIC accented form before clitics PREP preposition PREQ pre-qualifier PRES present PRESPART present participle PRON pronoun PRT particle QUAN quantifier REFL reflexive REL relative SG singular SM soft mutation SP singular or plural SUB subject SUBJ subjunctive SUP superlative SV TAG V singular noun or verb tag question verb **************************************************************************************** REFERENCES Diccionario de Americanismos. Asociación de Academias de la Lengua Española (2010). Diccionario de la Lengua Española. Real Academia Española. (www.rae.es) Deuchar, M., Davies, P., Herring J.R., Parafita, M.C., and Carter, D. (in press) Building bilingual corpora: Welsh-English, Spanish-English and Spanish-Welsh. In I. Mennen and E. Thomas (eds) Unravelling Bilingualism. Multilingual Matters. Donnelly, K. and Deuchar, M. (2011) Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. In: Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. Tartu: NEALT Proceedings Series. (http://dspace.utlib.ee/dspace/handle/10062/19298) MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. APPENDIX File summary: File name Length (mm:ss) No. of main participants Age (years) Sex HERRING1 0:32:18 2 24, 27 F, F HERRING2 0:30:42 2 21, 19 M, M HERRING3 0:31:37 2 37, 41 F, M HERRING5 0:27:10 2 41, 40 F, M HERRING6 0:28:14 2 43, ? F, M HERRING7 0:24:53 2 22, ? M, M HERRING8 0:29:43 2 39, 42 F, M HERRING9 0:32:39 2 21, 20 F, M HERRING10 0:33:52 2 33, 34 F, F HERRING11 0:31:00 2 64, 63 M, F HERRING12 0:33:06 2 22, 20 M, M HERRING13 0:29:53 2 ? , 32 F, F HERRING14 0:30:04 2 20, 23 M, F HERRING15 0:29:53 2 ? , 21 M, M HERRING16 0:30:51 2 24, 30 M, M HERRING17 0:29:58 2 ? , 25 M, F SASTRE1 0:33:52 2 57, 44 M, F SASTRE2 0:41:00 2 78, 55 F, M SASTRE3 0:43:02 3 37, 43, 52 M, M, F SASTRE4 0:31:26 2 29, 22 F, F SASTRE5 0:29:03 2 36, 66 F, F SASTRE6 0:30:20 2 43, 42 M, F SASTRE7 0:29:58 2 19, 15 F, F SASTRE8 0:33:20 2 63, 13 F, F SASTRE9 0:40:02 2 48, 60 F, F SASTRE10 0:39:40 2 35, 35 F, F SASTRE11 0:40:25 2 30, 60 M, F SASTRE12 0:30:59 2 48, 41 F, F SASTRE13 0:29:43 2 25, 19 M, F ZELEDON1 0:29:38 2 26, 21 F, F ZELEDON2 0:26:53 2 22, 21 M, F ZELEDON3 0:30:25 2 19, 11 F, M ZELEDON4 0:21:48 2 40, ? M, M ZELEDON5 0:23:41 2 35, 37 F, F ZELEDON6 0:30:25 2 21, 19 F, F ZELEDON7 0:30:20 2 19, 21 F, M ZELEDON8 0:37:53 2 42, 45 F, F ZELEDON9 0:30:51 2 12, 09 F, F ZELEDON11 0:30:40 2 21, 25 M, M ZELEDON13 0:34:42 2 18, 19 F, F ZELEDON14 0:33:01 2 22, 19 F, F MAR1 0:15:02 1 45 F MAR2 0:01:42 1 45 F MAR4 0:17:22 1 45 F MAR7 0:04:34 1 45 F MAR10 0:17:32 1 45 F MAR16 2:41:36 1 45 F MAR18 1:38:40 1 45 F MAR19 0:53:58 1 45 F MAR20 0:31:50 1 45 F MAR21 0:05:29 1 45 F MAR24 0:41:40 1 45 F MAR27 1:22:55 1 45 F MAR30 0:59:58 1 45 F MAR31 1:45:40 1 45 F MAR40 Total 2:25:47 35:11:04 1 84 45 F 6. Welsh Transcription Conventions This section documents the transcription conventions specific to the Bangor Siarad, Pilot, and Bangor 2 corpora. The three sections are: main tier, gloss tier, and tags. A MAIN TIER 1. Layout of transcription 1.1. Since we are primarily interested in clauses, the data is divided into clauses as far as possible. Where an utterance contains two main clauses, each clause in that utterance is written on a separate main tier. Complex clauses are treated as one clause and therefore subordinate clauses are included in the same tier as their main clauses. Adverbial clauses are also written on the same main tier as their related main clause. 1.2. Each main tier is divided into units which we call, for the purposes of these conventions, ‘words’. With some exceptions (see C.1.3) a word is considered to be a continuous sequence of characters containing no spaces, as found in Geiriadur Prifysgol Cymru (Thomas 1950-2004), Geiriadur yr Academi (Griffiths & Jones1995), Cysgeir (2004) or the Oxford English Dictionary online (2008). These are referred to as GPC, GyrA, Cysgeir and OED respectively throughout this document. Where items are treated as hyphenated by these reference dictionaries, they are connected by underscore in the transcripts. When one of the reference dictionaries offers more than one alternative (e.g. ‘minibus’ ‘mini-bus’ or ‘mini bus’), or when the reference dictionaries differ from each other, the most compact alternative is chosen (‘minibus’ in this case). 1.3. Other items which are treated as words are: i. interjections and interactional markers, e.g. ah, er, um etc. ii. proper names (including names of books, films, organisations etc.), a sequence of words being connected by underscores, e.g. Elton_John, Hong_Kong, One_Flew_Over_the_Cuckoo’s_Nest iii. abbreviations (connected by underscore), e.g. N_S_P_C_C iv. numbers between eleven and ninety-nine in Welsh and between twentyone and ninety-nine in English, e.g. pedwar_deg_pump, forty_five. Note that other numbers such as those containing ‘hundred’, ‘thousand’ etc. are transcribed as separate words, e.g. one hundred and seventy_three, cant saith_deg_tri v. some prepositions and adverbs, usually represented as two words, whose individual parts are meaningless or difficult to translate in isolation, e.g. oddi_wrth. See a full list below in C.3.4 vii. 1.4. Contractions that do not have entries in one of the Welsh-language reference dictionaries (namely GPC, GyrA or Cysgeir) or in King (2003), are transcribed in full, but the unpronounced parts are bracketed. For example, the pronunciation of ‘fel yna’ (like that) as [vɛla] in speech is represented in the transcripts as ‘fel (yn)a’. 1.5. There are some continuous sequences of characters in the main tier which are not treated as words. These include simple events such as ‘&=laugh’ (see CHAT 7.6.1), ‘xx’ or ‘xxx’ for unintelligible sounds, or the use of an ampersand plus phonetic characters for intelligible sounds without clear meaning (see CHAT 6.4 for both). 1.6. Note that we do not follow the guidelines for collocations, compounds and linkages given in the CHAT manual (see CHAT 6.6.2). We consider collocations and compounds to be single items only if they are considered words according to the definition given in C.1.2 and C.1.3. For example, CHAT gives the option of writing ‘peanut butter’ or ‘peanut+butter’, and ‘Star Wars’ or ‘Star+Wars’. According to our conventions, the first phrase should be transcribed as two separate words (‘peanut butter’) as this is how the phrase appears in the OED. The second case should be transcribed as ‘Star_Wars’ as it is a film title. Note that we do not use the plus sign within words under any circumstances. 2. Language marking 2.1. Each word in the main tier is assigned a language marker, which consists of @s: plus one or two other letters which denote its language: @s:w = Welsh, @s:e = English, @s:u = undetermined, @s:ew = word with first morpheme(s) English, second morpheme(s) Welsh, @s:we = word with first morpheme(s) Welsh, second morpheme(s) English. Other languages have been coded as they have arisen and have been included in the depfile, e.g. @s:f = French 2.2. A word or morpheme is considered to be Welsh if it can be found in any of the Welsh-language reference dictionaries or in King (2003). 2.3. Words which contain two or more morphemes from different languages are marked as mixed-language words, e.g. ‘concentrate_io@s:ew’ (to concentrate). However, where a word containing at least one English morpheme and at least one Welsh morpheme is included in one or more of the Welsh-language reference dictionaries, it is marked as a Welsh word. For example, the English word ‘use’ forms the basis of the Welsh word ‘iwsio’ (to use) but we mark the entire word as Welsh (‘iwsio@s:w’) because it is included in one of the Welshlanguage reference dictionaries. 2.4. The language marker @s:u marks words that occur in the lexicon of both languages, (as determined by the Welsh-language reference dictionaries for Welsh or by the OED for English), that are pronounced in a way that is possible both in Welsh and in English, e.g. [ˈʌŋkl] / [ˈəŋkl] (‘uncle’ in English or ‘yncl’ in Welsh) or [mat] (‘mat’ in both languages). @s:u also marks a specified list of interjections and interactional markers, e.g. ah, ahhah, aw, er, ey, hmm, ho, mmm, mmhm, oh, ooh, ow, ugh, um. Other interjections and interactional markers are assigned language markers according to their inclusion or not in the reference dictionaries. For example, ‘ych’ (a marker of disgust equivalent to ‘yuk’ in English) is marked @s:w as it is only found in the Welsh-language reference dictionaries. 2.5. Where a lexeme could belong to both languages, but its pronunciation in a specific occurrence belongs unambiguously to one language only, it will be marked @s:w or @s:e (and written in the respective orthography) according to the pronunciation, e.g. ‘problem@s:w’ for the specifically (north-west) Welsh pronunciation of the second vowel as [a], but ‘problem@s:u’ where the second vowel is [ə] or [ɛ], which is possible in both English and Welsh; ‘toast@s:e’ where the word is pronounced with [əʊ] / [oʊ] as in English only, and ‘tost@s:w’ where it is pronounced with [ɒ] as in southern Welsh, but ‘toast@s:u’ where the word is pronounced with [o:] as in northern Welsh or some varieties of welsh English. 2.6. Proper names and titles are marked as undetermined unless there are alternatives in each language in general use, e.g. Elton_John@s:u, One_Flew_Over_the_Cuckoo’s_Nest@s:u, Hong_Kong@s:u, Tebot_Piws@s:u (a Welsh-language pop group, literally meaning ‘purple teapot’) but Cardiff@s:e, Caerdydd@s:w (the Welsh word for ‘Cardiff’). 2.7. According to GPC, the -s plural ending is an established loan in the Welsh lexicon. Any plural formed with the -s ending is assigned the language marker of the previous morpheme. For example, ‘pregethwrs@s:w’ from ‘pregethwr@s:w’ (preacher), ‘dolphins@s:u’ from ‘dolphin@s:u’ and ‘dogs@s:e’ from ‘dog@s:w’. 2.8. In multi-word phrases, each word is tagged separately, regardless of the phrase’s internal syntax. For example, in ‘traffic@s:u lights@s:e’ ‘traffic’ is coded as undetermined, although the syntax of the whole phrase comes from English. 3. Orthography 3.1. Words marked as English are transcribed in standard English orthography, including contractions, such as ‘isn’t’. Some non-standard spellings for colloquial forms such as ‘gonna’ are used. 3.2. Words whose language is undetermined are transcribed in English rather than in Welsh orthography, e.g. ‘acid@s:u’ rather than ‘asid@s:u’. This is in order to make the corpus more accessible to non-Welsh-speakers who might use the data. 3.3. When words marked as English or undetermined are mutated (where the sound of an initial consonant is changed depending on the grammatical context, see for example King 1993:14-20), the initial (mutated) sound is written in Welsh orthography and the rest in English, e.g. ei@s:w firthday@s:e = his birthday; ei@s:w goat@s:u = his coat . In the case of words that begin with ‘qu’ in English but that are mutated in the data, the mutated sound and the following [w] are written in Welsh orthography, e.g. question (unmutated), gwestion (soft mutation), chwestion (aspirate mutation), nghwestion (nasal mutation). 3.4. Words marked as Welsh are transcribed in Welsh orthography. We have not represented regional variation in the transcripts, except in cases which have orthographic representation in the Welsh-language reference dictionaries or in King (2003). There are some cases where we differ from the standard orthography: i. We transcribe some non-standard verb-noun suffixes, e.g. ‘-ian’ in ‘swnian’ (to grumble) rather than ‘-io’ in the standard form ‘swnio’. ii. We represent non-standard usage of inflected prepositions. Agreement markers for person and number show considerable variation in the spoken language. Thus one may, for example, find several forms for ‘to you’ (plural/respect form), such as ‘wrthoch chi’ (the variant found in King 2003), ‘wrthych (chi)’ (more formal variant, e.g. prescribed in Thomas 1996) as well as ‘wrthach chi’ (more colloquial, northern variant). The orthography used in transcripts is based on pronunciation (note that the Welsh orthographic system has a fairly regular relationship between sound and speech, so representing sound variation is possible in Welsh in a way that is not in English). iii. Northern second person singular verb and preposition endings not usually represented in writing are transcribed as ‘-a’ where they are followed by the pronoun ‘chdi’, e.g. oedda chdi (you were), arna chdi (on you). Where they occur in isolation, they are transcribed as ‘-achd’, e.g. oeddachd (you were/weren’t you), arnachd (on you). iv. We do not represent morpheme-final [v] when it is not pronounced. For example, [pɛntrɛ] (village) is written ‘pentre’ in the transcripts rather than ‘pentref’ (as the word is represented in the Welsh-language reference dictionaries). v. Morpheme-initial /r/ is transcribed as ‘r’ even when the standard orthography prescribes ‘rh’ (pronounced [r̥]) as [r̥] is often absent from speakers’ phonological systems. ‘rh’ is only transcribed where [r̥] is clearly discernible. vi. We have represented mutation (sound change to initial consonants) or its absence without following prescriptive rules. Thus ‘in Cardiff’ may be transcribed ‘yn Caerdydd’ and ‘yn Gaerdydd’ as well as the standard form ‘yng Nghaerdydd’, according to what is heard. We have also transcribed the aspirate mutation of /m/ and /n/ after the 3rd singular feminine possessive adjective common in northern varieties, e.g. ‘ei mham’ (her mother), rather than standard ‘ei mam’. vii. We list below the phrases described in C.1.3 v which we transcribe using underscore to link the individual words. Our transcription ar_draws ar_goll ar_gyfer ar_ôl cyn_belled dim_byd ein_gilydd, eich_gilydd, ei_gilydd er_mwyn ers_talwm i_fewn i_ffwrdd i_fyny i_gyd i_lawr i_mewn naill_ai o_gwbl o_gwmpas oddi_ar oddi_wrth oni_bai pob_dim ta_waeth un_ai wrth_gwrs yn_erbyn yn_ôl yn_ystod Standard ar draws ar goll ar gyfer ar ôl cyn belled dim byd ein gilydd, eich gilydd, ei gilydd er mwyn ers talwm i fewn i ffwrdd i fyny i gyd i lawr i mewn naill ai o gwbl o gwmpas oddi ar oddi wrth oni bai pob dim ’ta waeth un ai wrth gwrs yn erbyn yn ôl yn ystod English translation across lost for after as far nothing each other (1st, 2nd and 3rd person plural) for in the past, long ago in(to) away up all down in(to) either at all around off from unless everything anyway either of course against back during viii. We list below some colloquial forms which are not represented in the Welsh-language reference dictionaries but which we have transcribed as indicated: Our spelling (r)hein, (r)hain, (r)heiny etc byswn i, bysa chdi etc. cynna fi, cynna chdi etc. dylen i, bydden i etc. Standard rhein, rhain, rheiny etc. Meaning these, those etc. Comments pronounced with initial [h] baswn i, baset ti etc. I would, you would etc. before me, before you etc. I should, I would etc. very common in northern varieties preposition inflected in northern varieties unless heard in northwestern varieties usually connected by apostrophe to a preceding vowel 3rd singular present form of ‘bod’ (to be) heard in southwestern varieties mutates to ‘folchi’ dylwn i, byddwn i etc. gosa m ’m mag mae molchi ymolchi mynedd na i etc. nunman oedd nhw, wneith nhw etc. penwsnos tes i (ddi)m w wannwyl whi my common in southern varieties wash oneself amynedd patience mutates to ‘fynedd’ a i etc. I will go etc. heard in the Caernarfon area unman nowhere widespread oedden nhw, they were, 3rd person singular wnan nhw etc. they will verb forms used with etc. plural pronouns penwythnos weekend GPC has an entry for ‘wsnos’ es i’m I didn’t go some northern varieties ’w his/her/ their usually appears with apostrophe, e.g. ‘i’w’, but we transcribe as ‘i w’ (to his/her/its) dear Lord a contraction of ‘Duw annwyl’ hwyaid ducks heard in some southern varieties Our spelling y fi Standard rwy i Meaning I am Comments southern Welsh B GLOSS TIER 1. Principles 1.1. Each word (see C.1.2 and C.1.3) in the main tier is given a gloss in the gloss tier (%gls). Non-words (see C.1.5) are not glossed, with the exception of ‘xx’ and ‘xxx’, which are represented by the same characters in the gloss. 1.2. With the exception of proper names (see below), all words are glossed with the closest English-language equivalent (in lower case). In Welsh or mixedlanguage words, certain morphological information is included in the gloss (in upper case, see D.2.1). For example: wasn’t@s:e : wasn’t soup@s:u : soup hefyd@s:w : also recharge_io@s:ew : recharge.NONFIN Some words marked as Welsh are glossed only with morphological information, such as ‘POSS.2S’ for the 2nd singular possessive adjective ‘dy’. Proper names (including names of books, films, organisations etc.) marked as English or undetermined are glossed as they appear in the main tier. For example, ‘Hong_Kong@s:u’ is glossed as ‘Hong_Kong’, ‘Cardiff@s:e’ is glossed as ‘Cardiff’ and ‘Tebot_Piws@s:u’ is glossed as ‘Tebot_Piws’. However, proper names marked as Welsh are glossed with their Englishlanguage equivalents. For example, ‘Caerdydd@s:w’ is glossed as ‘Cardiff’. 1.3. Lexical information always precedes morphological information in the gloss. A full stop ‘.’ is used to separate morphological information from lexical information (e.g. go.NONFIN) and also to separate morphological information (e.g. PRON.3S). 1.4. The underscore is used on the gloss tier to connect more than one lexical item in a gloss, where the English translation of a single Welsh word involves more than one word. For example, ‘neithiwr’ is glossed as ‘last_night’ . 2. Specific glosses 2.1. The following glosses are used for morphological information: Gloss Use Gloss 1,2,3 CONDIT DET F FUT IM IMP IMPER IMPERSONAL INT M NEG NONFIN NONPAST PL PAST POSS POSSD PRES PRON PRT REL S SUBJ Use 1st, 2nd, 3rd person conditional/habitual past determiner feminine future/habitual present (verb ‘bod’ (to be) only) interactional marker/exclamation, e.g. ‘um’, ‘oh’ imperfect (verb ‘bod’ (to be) only) imperative impersonal interrogative masculine negative nonfinite nonpast tense (used for present/habitual/future) plural past tense possessive possessed present tense (verb ‘bod’ (to be) only) pronoun particle relative singular subjunctive 2.2. Gender-specific adjectives in Welsh are not marked for gender in the gloss. For example, ‘gwyn’ (used to modify masculine nouns) and ‘wen’ (used to modify feminine nouns) are both glossed as ‘white’. 2.3. Numerals are glossed for gender where appropriate. For example, ‘dau’ and ‘dwy’ are glossed as ‘two.M’ and ‘two.F’ respectively. 2.4. Welsh collective nouns are glossed by the English plural. For example, ‘moch’ (singular collective noun indicating ‘pigs’) will have the gloss ‘pigs’. 2.5. In third person singular possessive constructions, the gender of the possessor is marked only where there is positive evidence of that gender (i.e. either when the possessed noun is mutated, or when a gender-specific pronoun follows the possessed noun, specifically referring to the possessor). The gender is marked on the possessive adjective. For example: ‘her mother’ ei mam : POSS.3S mother ei mham: POSS.3SF mother ei mam hi : POSS.3SF mother PRON.3SF ‘his mother’ ei fam : POSS.3SM mother ei fam e : POSS.3SM mother PRON.3SM ei mam e : POSS.3SM mother PRON.3SM The above applies also to possessive constructions involving non-finite verbs preceded by ‘ei’. For example: ‘he was born’ gaeth (e) ei eni: get.3S.PAST (PRON.3SM) POSS.3SM bear.NONFIN ‘he/she was shot’ gaeth ei saethu: get.3S.PAST POSS.3S shoot.NONFIN 2.6. When a possessive construction in the first person singular is marked only by mutation of the noun, the possessed noun, in the gloss, is followed by ‘.POSSD.1S’. For example, ‘nhad’ (my father) is glossed as ‘father.POSSD.1S’ . Note that this gloss is used only if there is no possessive adjective preceding or pronoun following the possessed noun (‘fy nhad’ or ‘nhad i’ are glossed ‘POSS.1S father’ and ‘father PRON.1S’ respectively, and ‘fy nhad i’ is glossed ‘POSS.1S father PRON.1S’). C TAGS 1. There are certain phrases used in Welsh, usually at the end of an utterance, but also possible mid-utterance, which are used discursively to engage with the listener, to gauge whether he/she agrees, understands etc. (although the listener is seldom required to reply). We term these ‘tags’. Tags can be agreeing (i.e. they include a verb form that agrees in person, number and tense with the finite verb in the main clause) or they can be non-agreeing. Both kinds are particularly problematic in transcription, as they are seldom seen in the written language and therefore there are no fixed conventions for their spelling. They are also often highly contracted in speech and can be problematic for glossing. 2. The following is an incomplete list of agreeing tags that may occur, which serves as a pattern for other agreeing tags (with different verbs, tenses and persons). The table gives the tag as is represented by us in the main tier, and its gloss. Main tier byddaf na fyddaf yn_byddaf medri na fedri Gloss be.1S.FUT NEG be.1S.FUT be.1S.FUT.NEG can.2S.NONPAST NEG can.2S.NONPAST Main tier yn_medri dylai na ddylai yn_dylai ydy, yndy nag ydy, nac (y)dy, na(g) (y)dy etc. yn_dydy, yn_tydy, dydy, tydy oes e nag oes e yn_does e, does e Gloss can.2S.NONPAST.NEG should.3S.CONDIT NEG should.3S.CONDIT should.3S.CONDIT.NEG be.3S.PRES NEG be.3S.PRES be.3S.PRES.NEG be.3S.PRES there NEG be.3S.PRES there be.3S.PRES.NEG there 3. Here is also a list of common non-agreeing tags with their spellings and their glosses. Note that not all occurrences of these words or phrases in the transcripts are tags. Main tier felly, (fe)lly wsti, sti wchi, (w)chi yli, (y)li ylwch, (y)lwch yn_de, de yn_do, do yn_dyfe, dyfe chimod chwel deud deuda deudwch, (deu)dwch dywedwch dofe dywed, dywad, dŵad fel gwed iawn na naci naddo nag yfe nage ti gweld, ti weld ti (y)n gweld timod twel ie,ia yfe Gloss thus know.2S know.2PL see.2S.IMPER see.2PL.IMPER TAG yes PRT.INT.NEG know.2PL see.2PL say.2S.IMPER say.2S.IMPER say.2PL.IMPER say.2PL.IMPER yes say.2S.IMPER like say.2S.IMPER right no no no NEG PRT.INT no PRON.2S see.NONFIN PRON.2S PRT see.NONFIN know.2S see.2S yes PRT.INT 7. BlumSnow (Hebrew-English) Shoshana Blum-Kulka Department of Communications Hebrew University 91905 Jerusalem, Israel Catherine Snow Harvard Graduate School of Education Larsen Hall, Appian Way Cambridge, MA 02138 USA catherine_snow@harvard.edu This corpus includes data from the Family Discourse Project, carried out in two stages between 1985 to 1988 and 1989 to 1992. The research was funded by two grants from the Israeli-American Binational Science Foundation, grant No. 82-3422 to Shoshana BlumKulka, David Gordon, Susan Ervin-Tripp, and Catherine Snow as consultant, and grant 87-00167/1 to Shoshana Blum-Kulka and Catherine Snow. Three groups of families were involved in the project: native born Israeli families from Jerusalem, American-born Israeli families living in Israel, and American-born Jewish families living in Boston. The project was carried out in two stages. Stage one included 34 families and stage two included 24 families. A monograph by Blum-Kulka (1997) is devoted to the analysis of these data. The book demonstrates the ways talk at dinner constructs, reflects, and invokes familial, social and cultural identities and provides social support for children to become members of their parents’ culture. The groups studied are shown to differ in the ways they negotiate issues of power, independence and involvement through speech activities such as the choice and initiation of topics, conversational story-telling, naming practices, metapragmatic discourse, politeness, language choice, and code-switching. The transcripts in the CHILDES database include two types of files from stage two. The first type includes transcripts of one dinner table conversation per family from eight native Israeli and eight American Israeli families. The families were taped in their homes in Jerusalem (BlumKulka). The second type includes transcripts of one dinner table conversation per family from eight Jewish American families. The families were taped in their homes in Boston (Blum-Kulka and Snow). Families are identified by group and number, and participants are identified by role for adults and by name for children. The names of the children in the corpus are pseudonyms. Family Backgrounds The families in the project were middle-class and upper-middle-class, white-collar professional, nonobservant Jewish families from a European background from Israel and the United States. All parents were at least college educated and were occupied professionally outside the home. Most parents were at the time of data collection in their late 30s or early 40s (mean age 41, range 34 to 54). Families had two, three or four children; the ages of children ranged from 3;1 to 17;2. By design most children are at the school-age of 6;1 to 13;5. Further information about the ages of the children is given below. A participant observer taped three family dinners over a period of 2 to 3 months. Recording started when the family began to gather around the table and stopped when they left the table. Meals lasted on the average from 1 to 1.5 hours. One meal per family was transcribed in CHAT. Group 1: Native Israeli Families The parents in this group are all Israeli born. The language spoken at dinner is Hebrew. Table 1: Native Israeli Children Family # 1 2 4 5 6 8 9 10 Children’s Age and Sex 12;0 m, 10; 5 m 13;2 m, 11.4 m, 5;2 f 16;1 m, 12.2 f, 8;6 m 13;1 m, 10;8 m, 4;0 f 6;2 f, 6;2 f 10;5 f, 8;7 m 8;8 m, 5;6 m 11;5 f, 8;3 f, 3;2 m Group 2: American Israeli Families The adults in the American-Israeli families were born in the United States and lived in Israel for more than 9 years at the time of the study. Twenty-five of the children were born in Israel and four in the United States. All members of the family are competent bilinguals. Both English and Hebrew are used; the rate of English varies by family from 30% to 96%. Table 2: American Israeli Children Family # 1 2 3 4 6 7 8 12 Children’s Age and Sex 11;4 m, 7;2 f 8;0 m, 6;1 m 9;0 m, 6;3 m 17;2 m, 13;4 f, 9;4 f, 7;5 f 15;10 m, 13;11 f, 5;5 f 13;11 f, 12;4 f, 9;0 f 12;9 f, 9;5 m, 5;8 m 12;2 m, 8;4 f Group 3: Jewish-American Families This set includes dinner conversations in English from eight middle-class Jewish American families from Boston. The families were taped in their homes. Table 3: Jewish-American Children Family # 1 2 3 4 9 10 11 12 Children’s Age and Sex 15;5 f, 13;5 f 8;5 m, 6;1 m 4;4 m 10;0 m, 5;11 m 7;5 m, 4;3 m 9;5 m, 7;3 f 10;4 m, 8;2 f, 3;1 m 11;7 m, 9;6 f 13;4 f, 10;1 f, 4;1 m The coding schemes developed for the analysis of family discourse include: 1. The Topical Actions Code (analyzes conversational topical actions such as the introduction, change, and shift of topics); 2. The Request Code (analyzes the speech act of directives); 3. The Narrative-Event Code (analyzes narrative segments from both the interactive and structural perspectives); 4. The Metapragmatic Comments Code (analyzes metapragmatic comments made with regard to turn-taking, conversational norms, and language). Publications using these data should cite: Blum-Kulka, S. (1997). Dinner-talk: Cultural patterns of sociability and socialization in family discourse. Mahwah, NJ: Lawrence Erlbaum Associates. Additional relevant references include: Blum-Kulka, S. (1990). “You don’t touch lettuce with your fingers”: Parental politeness in family discourse. Journal of Pragmatics, 14, 259–289. Blum-Kulka, S. (1993). “You gotta know how to tell a story”: Telling, tales and tellers in American and Israeli narrative events at dinner. Language in Society, 22, 361–402. Blum-Kulka, S. (1994). The dynamics of family dinner-talk: Cultural contexts for children’s passages to adult discourse. Research on Language and Social Interaction, 27, 1–51. Blum-Kulka, S. (1996). Cultural patterns in dinner talk. In W. Senn (Ed.), SPELL, Swiss Papers in English Language and Literature. Vol.9: Families (pp. 77–107). Tübingen, Germany: Gunter Narr. Blum-Kulka, S., & Katriel, T. (1991). Nicknaming practices in families: A cross-cultural perspective. In S. Ting-Toomey & F. Korseny (Eds.), Cross Cultural Interpersonal Communication: International and Intercultural Communication Manual Vol. 15, 58– 77. London: Sage Publications. Blum-Kulka, S., & Snow, C. (1992). Developing autonomy for tellers, tales and telling in family narrative-events. Journal of Narrative and Life History, 2, 187–217. Olshtain, E., & Blum-Kulka, S. (1989). Happy Hebrish: Mixing and switching in American-Israeli family interaction. In S. Gass, C. Madden, & D. Presto Selinker (Eds.), Variation in Second Language Acquisition Volume 1: Discourse and Pragmatics (pp. 59– 84). Philadelphia: Multilingual Matters. 8. Eppler (German-English) Eva Eppler School of English and Modern Languages University of Surrey Roehampton Roehampton Lane, London SW15 5PH, UK e.eppler@roehampton.ac.uk The data was collected from a community of Austrian Jewish refugees from Nazi occupied Austria (approx. 30000 Austrians fled to the UK) who settled in Northwest London in the late 1930s. We are therefore dealing with a community in which German and English have been in close contact for over sixty years. The L1 of the informants is close to Standard German, although occasionally interspersed with Yiddish lexical items and phonetically influenced by the Viennese variety. A peculiarity of the linguistic profile of this community is that they do NOT speak Yiddish. The age of onset of L2 (English) was during the late teens and early twenties for most speakers. At the time the audio-recordings were made (1993) all informants were in their late sixties or early seventies. Patterns of language use in this bilingual community changed throughout the last half a century: up to the 1970s mainly English was used in both public and private domains. Once the second generation had left the parent’s household and especially after retirement both languages started being used in the private domain. A close-knit network between a subset of the community facilitated the development of a bilingual mode of interaction, sometimes called 'Emigranto'. This mode of interaction is only used in ingroup situations, is regarded as the 'we-code' (Gumperz 19982) and has covert prestige. Linguistically it is characterised by intra-sentential code-switching, and frequent switching at speaker turn boundaries. Biographical (age, gender, schooling, social class of informants etc.) and situational information, where available, is provided under the relevant headers in the .cha files. Pseudonyms are used for all participants.The goal of the project was to provide a linguistic profile of the Jewish refugee community in London and to study patterns of code-mixing. Sampling and Data Collection A random sample of 70 members of the target community was selected from a list of clients of an Austrian solicitor specializing in pension claims for refugees. 27 of them were audio-recorded for approx. 90 minutes in one-to-one or one-to-two sociolinguistic interviews/oral history collections. To this body of subjects other informants were added by referral (snowball sampling). All audio-recordings were collected in the informants’ homes. Informants were encouraged to choose as a language of interaction the one they normally use in their home. An additional 400 minutes of group recordings with three informants and the researcher were collected in participant observation technique during informal gatherings. Another 540 minutes of audio-data collected in the Day-Centre of a Refugee Organisation are almost impossible to transcribe due to the low quality of the recordings and the amount of overlap. Data Transcription Full transcripts were made of sound files using the CHAT/LIDES transcription systems. LIDES (Language Interaction Data Exchange System) is based on CHAT but was extended to deal with code-mixed data. For this purpose language tags (@2 English and @4 German) are added to each word/morpheme to indicate its language. In cases where it was impossible to determine the language in which words were being produced, @u was attached, e.g. in@u preceding English or German place-names. Morphologically mixed words only display the full language tag on the suffix as CHECK does not pass sequences like e.g. ge@4#bother@2-t@. The comma was used to indicate syntactic juncture as one of the research aims is co- and subordination. The CHAT symbol for tag questions was also used to delimit discourse markers (Schiffrin 1987). Due to the nature of some of the data (group recordings) overlaps are only indicated when the beginning and end point of the overlap was clearly recognisable. Eva Eppler and Maggie Brueckner of the Language Centre of the University of Rostock, Germany both transcribed and checked each transcript. Project-specific codes are not included in the files on the web. 1. Ibron.cha 46 minutes of the first meeting between the researcher and the central informant DOR; 36 minutes with DOR, her daughter (2nd generation) and her grandson (3rd generation). 2. Jen1 - 3.cha: group recordings of DOR and three of her friends from of the same generation (TRU, MEL and LIL) and the researcher. 3. Alfred.cha: is a one-to-two sociolinguistic interview/oral history with a male and a female informant; Alfred1.mp3 corresponds to side A, alfred2.mp3 to side B 4. Hogan.cha: is a one-to-two sociolinguistic interview/oral history with a married couple 5. Hogan1.mp3 corresponds to side A, hogan2.mp3 to side B of the original tape recordings.’ The collection and transcription of the data was funded by various research grants form the University of Vienna and the University of Surrey Roehampton. The Austrian Ministry of Science funded this research. Many thanks for the technical support from the media team at Roehampton, to LIPPS and to TalkBank. Publications that use these data should cite: Eppler Eva. 1999. ‘Word order in German-English mixed discourse’, UCL Working Papers in Linguistics 11, 285-309. Eva Duran Eppler. 2010. Emigranto. The syntax of a German/English mixed code. Vienna: Braumueller. ISBN 978-3-7003-1739-5 9. Gardner-Chloros (Greek-English) Dr.P.H.Gardner-Chloros, Department of Applied Linguistics Birkbeck College, 43 Gordon Square, London WC1H OPD email: p.gardner-chloros@bbk.ac.uk This research was designed oo identify linguistic and sociolinguistic developments in the London Greek Cypriot community, differentiating between the patterns found in different generations; to relate the linguistic patterns to sociolinguistic factors: and to analyze the spontaneous productions of London Greek-Cypriots, in particular those born in Britain, from the point of view of language change (borrowing, calques), language shift (abandonment of the Greek Cypriot dialect or GCD), and code-switching (linguistic and pragmatic aspects). The informants were 30 subjects of Greek Cypriot origin living in London, coming from the working/lower middle class, with no higher education: 5 men and 5 women over age 60; 5 men and 5 women aged 35-60; 5 girls and 5 boys aged 14-18. Each subject was recorded once for 30-60 minutes. Coders were Olga Pillakouri and Mary Kastamoula. The Greek Cypriot community in London consists of about 180,000 people, who came over in various waves from the 1960’s onwards, many as economic refugees. Those ousted by the Turkish invasion in 1974 are of a more varied social and educational background. Their children and grandchildren have been educated in English schools, though many have attended the classes in (Standard) Greek organised on Saturdays by the Church and Parents’ Organization. The community is on the whole close-knit and preserves religious and social/family values distinct from the surrounding community. The younger generation is, however, creating a new identity for itself, distinct from the rural and traditional ethos of the older generations yet also different from British teenagers as a whole and indeed from those growing up in Cyprus. Gardner-Chloros, P. 1992“The sociolinguistics of the Greek Cypriot community of London”. In Plurilinguismes No.4, June 1992 Sociolinguistique du grec et de la Grece, ed M.Karyolemou, pp112-135. 10. Hatzidaki (Greek-French) Aspa Hatzidaki Tripoleos 11 Kalamaria Thessaloniki, 55131 Greece The data contained in this corpus were used in Hatzidaki (1994). The purpose of the investigation which took place among the second-generation Greeks living in Brussels was to examine their linguistic behavior with a view to discovering to what degree they maintain the use of the ethnic language, and how they alternate between French and Greek in their daily interaction (to the extent that they do use Greek in spontaneous conversation). The data collection took place between January 1991 and October 1992 and consisted of three complementary techniques: the taperecording of speech events such as interviews, participant observation, and the compilation of network lists. Thirty-four second-generation informants (19 male, 15 female) took part in the study. The following table groups participants together and provides information on their sex, age, and occupation at the time of the study. Table 4: Hatzidaki Participants Participant Stefanos Dimitris Lazaros Tassos Pavlos Kostas Nikos Fotis Andreas Spiros Yannis Orestis Yorgos Vassilis Ilias Michalis Miltos Christos Petros Thalia Zoe Roula Sex M M M M M M M M M M M M M M M M M M M F F F Age 14 16 16 18 18 20 20 21 21 22 22 23 23 24 24 24 24 29 29 14 15 16 Occupation High school student High school student Studying hotel management High school student Studying hotel management Studying car mechanics Studying chemistry Technician Running bookshop, studying PoliSci Car mechanic, cook Studying computers, waiter Studying car mechanics Physiotherapist Physiotherapist Cook Studying Economics Telecommunications engineer Degree in Economics Mechanical Engineering High school student High school student High school student Natasa Vera Sofia Katerina Maria Voula Olga Fani Alexandra Elissavet Irene Despina F F F F F F F F F F F F 18 19 20 20 21 21 21 23 25 25 26 28 Studying linguistics Studying linguistics Studying linguistics Studying linguistics Studying accounting Going to secretarial school Studying linguistics Secretary Translator Ergonomics, unemployed Studying pharmaceutics Beautician All our informants belonged to the category of “early bilinguals” (although two of them, Petros and his sister Irene, were not born in Belgium but acquired the French language in their early school years). It is difficult to be more precise and to place the informants in the category of “consecutive” or “successive” bilinguals, because they were not always able to provide reliable answers to the question of how they learned their two languages and when they started using one or the other for the first time. Differences in their learning experiences, differences in time and type of language exposure time all together made it difficult to say with certainty what their first language was. On the whole, most of our informants seemed to have experienced a positive, additive form of bilingualism, even though the Greek spoken by the majority is not comparable to Standard Greek in many ways; the speech of second-generation Greeks is markedly different from the norm for Modern Greek, as their variety of the ethnic language manifests certain distinctive features on all linguistic levels. Some of these features even appear with some systematicity. Irrespective of the structural deviations from the norm, the participants’ overall competence in Greek was sufficient for communication purposes. The active involvement of Greek authorities and the Greek Orthodox Church, frequent visits to Greece, and the availability of Greek-language press and media provided ample opportunity to develop oral and literacy skills in the ethnic language. If our informants’ competence in Greek varied from poor to very good, their competence in French was higher, by their own admission. They could be safely considered French-dominant bilinguals, something that is true for the totality of second-generation Greeks in Brussels (apart from those few who have been educated in Dutch, of course). This means that French was the language that served most functions in their everyday life, the language they felt more comfortable in, and the language they mastered best. The dominance of the French language was due to the nature of the children’s socialization and the functions fulfilled by the two codes in question. For those who still attended Mother Tongue Classes, Greek was the language of instruction for a few hours twice a week. Apart from that, they used it with family and friends to varying degrees. All other linguistic activity, be it receptive or productive, took place in French. This functional separation of codes, which they experienced since their infancy, firmly established the dominance of French. They definitely did not speak Greek as well as they spoke French. Their French is as good as that of any native speaker of their background. Informants were asked to rate their Greek proficiency on two aspects, and the mean of the score for oral proficiency and literacy skills gave the informant’s proficiency score. It was decided to consider as “more proficient speakers” those informants who gave themselves between 2.5 and 4 and “less proficient speakers” those who rated themselves between 1 and 2.5. When the mean turned out to be exactly 2.5, the final placement of the informant was left to the researcher’s discretion. The criteria on which this judgment was based were the following: A “more proficient” speaker of Greek did not manifest disfluency phenomena indicating incompetence, made very few or no grammatical mistakes, used the appropriate words most of the time, and did not switch frequently out of incompetence. On the other hand, the speech of “less proficient” speakers of Greek manifested more clearly the dominance of French. In contrast to “more proficient” speakers, it was fraught with pauses, hesitations, grammatical mistakes, poor word selection, and competence-related code switching. The more proficient speakers were Miltos, Lazaros, Orestis, Katerina, Elissavet, Ilias, Petros, Alexandra, Yannia, Andreas, Vassilis, Yorgos, Fotis, and Nikos. The other speakers can be classed as less proficient. The participants came from several social groups. These included the Sphynx Café group, the Hellenic Community group, the Association group, the foursome group, the students group, and Orestis and Alexandra. Full details regarding the social structure and language usage in these different groups can be found in Hatzidaki (1994). The results of the quantitative study of language choice in our data led to the conclusion that more proficient speakers used significantly more Greek during monitored situations (mean: 91%) than their less proficient counterparts (mean: 60%). This discrepancy can be attributed to the former group’s higher competence and greater practice in the ethnic language, which permitted them to conduct a long conversation with almost no French elements. Less proficient speakers in our sample, on the other hand, rarely found themselves in situations where the use of Greek was called for. However, the number of speakers on whom data are available is too small to draw any significant conclusions. Again, more proficient speakers manifest a more homogeneous behavior, which is natural in view of their consistency in using Greek. Publications using these data should cite: Hatzidaki, A. (1994). Ethnic language use among second-generation Greeks in Brussels. Unpublished doctoral dissertation. Vrije Universiteit, Brussels. 11. Køge (Turkish-Danish) Jens Normann Jørgensen University of Copenhagen Copenhagen, DK normann@hum.ku.dk This data were collected from adolescent Turkish-Danish bilinguals in the town of Køge near Copenhagen. The data include interviews in Danish and Turkish and group discussions in both Danish and Turkish. There are audio files, but they are not available to TalkBank.