BilingBank

advertisement
BilingBank Database Guide
This guide provides documentation regarding the data on bilingualism and second language
acquisition (SLA) in the TalkBank database. All of these data are available from
http://talkbank.org/data/BilingBank . TalkBank is an international system for the exchange
of data on spoken language interactions. The majority of the corpora in TalkBank have
either audio or video media linked to transcripts. All transcripts are formatted in the
CHAT system and can be automatically converted to XML using the CHAT2XML
convertor. TalkBank data dealing with first language acquisition are available from the
CHILDES site at http://childes.psy.cmu.edu
To jump to the relevant section, click on the page number to the right of the corpus.
1. Anadolu ....................................................................................................................... 2
2. Bangor-Pilot (Welsh-English) .................................................................................... 3
3. Bangor (Welsh-English) Siarad ................................................................................. 4
4. Bangor (Welsh-Spanish) Patagonia........................................................................... 7
5. Bangor (Spanish-English) Miami ............................................................................ 21
6. Welsh Transcription Conventions ............................................................................ 33
7. BlumSnow (Hebrew-English)................................................................................... 43
Family Backgrounds ............................................................................................................... 43
Group 1: Native Israeli Families ..........................................................................................................44
Group 2: American Israeli Families .....................................................................................................44
Group 3: Jewish-American Families ....................................................................................................45
8. Eppler (German-English) ......................................................................................... 47
9. Gardner-Chloros (Greek-English) ........................................................................... 49
10. Hatzidaki (Greek-French) ...................................................................................... 50
11. Køge (Turkish-Danish) ........................................................................................... 53
1. Anadolu
Fatma Hülya Özcan
Anadolu University, Eskisehir
fozcan@anadolu.edu.tr
This corpus is included in BilingBank to provide a comparison with the Køge corpus,
because both involve immigrant Turkish children. The project was a collaboration of Jens
Normann Jørgensen of the University of Copenhagen and Fatma Hülya Özcan and Ilknur
Kecik of Anadolu University in Eskisehir, a town of 600,000 in central Turkey. The
children are from the second generation of working class immigrants to the city, thereby
allowing comparisons with the Køge corpus. The students, in groups of four, were first
studied in the first grade in 1997 and then the same students were followed in grades 3, 5,
7, and 8. In the first and third grades, the groups were asked to furnish a house on a
white cardboard, using stickers, paper, glue sticks, and marking pens. In the 5th grade,
the task was to build a city. In the 7th and 8th grade the task was to prepare a collage on a
topic of their choice. Each session lasted 45 minutes. Transcription was done in CHAT
using CLAN and translations into Danish and later English were prepared. The files we
have here have the Danish translations.
2. Bangor-Pilot (Welsh-English)
Margaret Deuchar
Department of Linguistics and English Language
University of Wales, Bangor
Gwynedd LL57 2DG UK
m.deuchar@bangor.ac.uk
The corpus was transcribed in 2004/05 as part of a small research project funded by the
British Academy, entitled "Structural aspects of Welsh-English code-switching". The
main theoretical aim of the project was to test Myers-Scotton's (2002) Matrix Language
Frame (MLF) model of code switching with Welsh-English data.
The data consist of recordings of informal conversations involving groups or pairs of
speakers in North-West Wales and excerpts from BBC Radio Cymru programs. We
would like to express our thanks to the Bangor students and researchers (named in the
transcript headers) who recorded the informal conversations within their social networks
and kindly gave us permission to use the recordings. We are also grateful to all the
speakers involved. Details regarding the context of each conversation and the speakers
involved are given in the header of each transcript. However, the real names of speakers
and other persons mentioned (other than professional radio presenters, actors, politicians
etc.) as well as house names have been replaced by pseudonyms from words used
conventionally to refer to letters, e.g. Alpha, Bravo, Charlie (see
e.g.http://www.dynamoo.com/technical/phonetic.htm), in the transcripts. Finally we
would like to thank the BBC for their permission to use the BBC radio programs.
Publications using these data should cite the following article:
Deuchar, M. (forthcoming). Welsh-English code-switching and the Matrix Language
Frame model. Lingua.
3. Bangor (Welsh-English) Siarad
Margaret Deuchar
ESRC Centre for Research on Bilingualism
Bangor University
Bangor
Gwynedd LL57 2DG
United Kingdom
m.deuchar@bangor.ac.uk
A INTRODUCTION
The Siarad corpus of Welsh-English bilingual speech was recorded and transcribed
between 2005 and 2008 as part of a research project funded by the Arts and Humanities
Research Council (AHRC), entitled ’Code switching and convergence in Welsh: a
universal versus a typological approach’. The main theoretical aim of the project was to
test alternative models of code switching with Welsh-English data. The title of the
corpus, Siarad is the Welsh word for speaking.
When using these data, please refer to the corpus as the Bangor Siarad corpus, and
provide a link to the website from which you accessed the corpus, either
http://www.talkbank.org or http://www.bilingualism.bangor.ac.uk.
B THE DATA
The corpus consists of 69 audio recordings and their corresponding transcripts of
informal conversation between two or more speakers, involving a total of 153 speakers
from across Wales. Participants were recruited via a variety of methods, including
advertising, approaching visitors at a Welsh-language cultural event, and using the
research team’s extended social network. In total, the corpus consists of 452,116 words
of text from 40 hours of recorded conversation. The transcriptions (in CHAT format) are
linked to the digitized recordings through sound links at the end of each main tier. Most
recordings were in stereo, and made using radio microphones and a Marantz hard disk
recorder. A minidisk recorder was also occasionally used, meaning that some recordings
are in mono mode.
The recordings were made at a place convenient for the speakers, e.g. at their homes,
workplaces or at the university. After setting up the equipment the researcher would leave
the speakers to talk freely with one another. The first five minutes of all recordings after
the point when the researcher left the room have been deleted. In some cases the
researcher re-entered briefly during the recording. These sections have not been
transcribed, but notes have been made in the relevant parts of the transcripts.
At the end of each recording all participants were asked to fill in questionnaires providing
background information regarding their age, gender, location of places lived, etc, in order
to provide information for sociolinguistic analysis. They were also asked to sign consent
forms giving permission for their recording and its transcript to be used for research
purposes and to be submitted to online linguistic archives. The consent form included the
provision that the names of speakers and other people named in the recording would be
replaced by pseudonyms in the transcript. In the case of children of 16 years or younger,
a parent or guardian also signed the consent form.
Sound and transcription files in the corpus are named after the researcher (primarily)
responsible for recording them, namely Marika Fusser, Peredur Davies, Elen Robert,
Jonathan Stammers, Nesta Roberts, Gary Smith and Margaret Deuchar. Each file name is
made up of the surname followed by a number (ordered chronologically). The sound and
transcription files for each conversation share the filename, but have different file
extensions (‘*.wav’ and ‘*.cha’ respectively). For example, Davies3.cha is the
transcription of Peredur Davies’ third recording (sound file Davies3.wav). In a few cases
numbers are discontinuous. The ‘Fusser’ files begin with Fusser3, for example. Also,
five recordings collected (including Fusser20, Fusser24 and Davies8) were left out of the
corpus. In three cases this was due to the lack of consent from speakers in the recording,
in one case due to the researcher taking an extensive part in the conversation, and in one
case due to a participant being a Welsh speaker from Patagonia who was not a WelshEnglish bilingual.
A list of the files in the corpus can be found in the Appendix. This list includes some
basic information for each file. Details regarding the context of each conversation and the
speakers involved are given in the transcript headers. Some additional information about
the speakers and recordings is available to researchers on request.
All recordings have been transcribed in the CHAT transcription and coding format
(MacWhinney 2000), in accordance with the 2007 version of the online manual available
on the Talkbank website (www.talkbank.org). All further references to CHAT in this
document are taken from this online version.
The transcripts were all produced by trained transcribers working on the project: Peredur
Davies, Marika Fusser, Siân Wynn Lloyd, Elen Robert and Jonathan Stammers. For 22%
of the transcripts an independent transcription was done, in which a member of the
transcription team transcribed one (randomly selected) minute of the recording
independently from the original transcriber of that particular transcript. Transcripts were
then compared and a rate of similarity was calculated. The average reliability score for
independent transcriptions was 75%. Furthermore, the transcripts that were completed
before March 2007 by each original transcriber were proofread by another member of the
transcription team and corrections made accordingly.
All transcripts contain at least three different tiers. In addition to the main tier, required
by CHAT, we use a gloss tier (%gls) for the closest English equivalent for each word
(including morphological information where relevant), and a translation tier (%eng),
which contains a free translation of the main tier. All main tiers include a sound link to
the corresponding section of the recording.
We request that a copy of any publications that make use of this corpus be sent to us at
the above address. For introductory information about the Welsh-speaking community
see Deuchar (2005). Publications using these data should cite:
Deuchar, M. and Davies, P. Code-switching and the future of the Welsh language,
International Journal of the Sociology of Language 195: 15-38.
Additional References:
Canolfan Bedwyr (2008), Cysgliad. Prifysgol Bangor.
Deuchar, M. (2005). Minority Language Survival in Northwest Wales: an Introduction.
In Cohen, J, McAlister, K., Rolstad, K. and MacSwan, J. (eds) Proceedings of the
4th International Symposium on Bilingualism. Somerville, MA: Cascadilla Press,
621-624.
Griffiths, B. and Jones, D.G. (eds.) (1995). The Welsh Academy English- Welsh
Dictionary / Geiriadur yr Academi. Cardiff: University of Wales Press.
King, G. (2003). Modern Welsh : a comprehensive grammar (2nd ed.). London:
Routledge.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.).
Mahwah, NJ: Lawrence Erlbaum Associates.
Oxford English Dictionary. Oxford University Press. (2008). (www.oed.com)
Thomas, R.J. (ed.) (1950-2004). Geiriadur Prifysgol Cymru : a dictionary of the Welsh
language. Cardiff: (http://www.cymru.ac.uk/geiriadur/gpc_pdfs.htm)
Thomas, P.W. (1996). Gramadeg y Gymraeg. Cardiff: University of Wales Press.
4. Bangor (Welsh-Spanish) Patagonia
Margaret Deuchar
ESRC Centre for Research on Bilingualism
Bangor University
Bangor
Gwynedd LL57 2DG
United Kingdom
m.deuchar@bangor.ac.uk
A
INTRODUCTION
The Patagonia corpus of Welsh-Spanish bilingual speech was recorded in late 2009
and transcribed from 2010 to 2011 as part of a research project funded by the
Economic and Social Research Council (ESRC). The main theoretical aim of the
project was to test alternative models of code-switching with Welsh-Spanish data.
Conditions of use
The corpus is being made available under the GNU General Public License version 3
or later (http://gnu.org/copyleft/gpl.html). Researchers who use it are requested to
subscribe to the TalkBank Code of Ethics (http://talkbank.org/share/ethics.html) and
acknowledge the corpus as set out below.
Acknowledgments
Please refer to the corpus as the Bangor Patagonia corpus, and provide a link to the
website by which you accessed the corpus, either http://www.talkbank.org or
http://bangortalk.org.uk. We request that a copy of any publications that make use of
this corpus be sent to us at the above address.
Canonical version of the data
The most up-to-date version of the data as well as more detailed documentation is
available on http://bangortalk.org.uk.
B THE DATA
The corpus consists of 43 audio recordings and their corresponding transcripts of
informal conversation between two or more speakers, involving a total of 94 speakers
from Patagonia, Argentina. Participants were recruited via a social network approach:
as only a very small percentage of the inhabitants of Patagonia are fluent in both
Spanish and Welsh, names of bilingual speakers were sought from local contacts in
advance of the fieldworkers’ visit. In total, the corpus consists of 195,190 words of
text from just under 21 hours of recorded conversation. The transcriptions (in CHAT
format) are linked to the digitized recordings through sound links at the end of each
main tier. Most recordings were in stereo, and were made using Marantz, Zoom or
Microtrack digital audio recorders.
The recordings were made at a place convenient for the speakers, e.g. at their homes
or workplaces. After setting up the equipment the researcher would leave the speakers
to talk freely with one another. In some cases the researcher re-entered briefly during
the recording. This is noted in the transcripts and speech by the researcher is usually
not transcribed. The first five minutes of all recordings after the point when the
researcher left the room have been deleted, in case the participants’ speech was
initially affected by the presence of the recorder.
At the end of each recording all participants were asked to fill in questionnaires
providing background information regarding their age, gender, location of places
lived, etc, in order to provide information for sociolinguistic analysis. They were also
asked to sign consent forms giving permission for their recording and its transcript to
be used for research purposes and to be submitted to online linguistic archives. The
consent form included the provision that the names of speakers and other people
named in the recording would be replaced by pseudonyms in the transcript. In the
case of children of 16 years or younger, a consent form was also signed by a parent or
guardian.
There are a few instances where speakers who have not given consent feature in
recordings, e.g. a neighbour walking in briefly. In these cases the utterances have
been transcribed as “www” and replaced by silence in the audio file. This can
sometimes mean that parts of the consenting participants’ speech are lost as well
where there is overlap with the non-consenting speaker. In addition, beeps have been
placed over the names of people about whom sensitive information is given.
The recordings in the corpus are named after the Patagonia region of Argentina where
the recordings took place and are numbered in order of the sequence of recording.
The sound and transcription files for each conversation share the filename, but have
different file extensions (‘*.wav’/‘*.mp3’ for the sound file and ‘*.cha’ for the
transcription). For example, Patagonia3.cha is the transcription of the third recording
(sound files Patagonia3.wav or Patagonia3.mp3). Basic details regarding the context
of each conversation and the speakers involved are given in the transcript headers.
Some additional information about the speakers and recordings is available to
researchers on request.
All recordings have been transcribed in the CHAT transcription and coding format
(MacWhinney 2000), in accordance with the 2012 version of online manual available
on http://childes.psy.cmu.edu/manuals/chat.pdf. All references to the CHAT manual
in this document are to this online version.
All transcripts have been done by trained transcribers working on the project: Fraibet
Aveledo, Diana Carter, Marika Fusser, Lowri Jones, M. Carmen Parafita Couto,
Myfyr Prys and Jonathan Stammers. For 10% of the transcripts an independent
transcription was done, in which a member of the transcription team transcribed one
(randomly selected) minute of the recording independently from the original
transcriber of that particular transcript. Transcripts were then compared and a rate of
similarity was calculated. The average reliability score1 for independent transcriptions
was 88%. Furthermore, all the transcripts were checked by another member of the
transcription team and corrections made accordingly. The team of checkers included
the following researchers in addition to the original transcription team: Margaret
Deuchar, Lara Gil Vallejo, Jon Herring, Guillermo Montero Melis, and Susana SabinFernández.
All transcripts contain at least three different tiers. In addition to the main tier,
required by CHAT, we use an automatically generated gloss tier (%xaut) for the
closest English equivalent for each word (including morphological information where
relevant), and a translation tier (%eng), which contains a free translation into English
of the main tier. A comments tier (%com) has also been used occasionally for
comments by the transcriber that are specific to the utterance in the corresponding
main tier. All main tiers include a sound link to the corresponding section of the
recording.
The remainder of this document outlines the conventions used in the main tier and the
gloss tier.
C MAIN TIER
1. Layout of transcription
1.1. Since the theoretical aims of the project include clause-based analysis, the
transcribed data are divided into clauses where possible. Where an utterance
contains two main clauses, each clause in that utterance is written on a separate
main tier. Complex clauses are treated as one clause and therefore subordinate
clauses are included in the same tier as their main clauses. Adverbial clauses are
also written on the same main tier as their related main clause.
1.2. Each main tier is divided into units which we call ‘words’ for the purposes of
these conventions. With some exceptions (see C.1.3) a word is considered to be
a continuous sequence of characters containing no spaces, as found in Geiriadur
Prifysgol Cymru (Thomas 1950-2004) (GPC), Geiriadur yr Academi (Griffiths
& Jones 1995) (GyrA), Cysgeir (Canolfan Bedwyr 2008), the Diccionario de la
Lengua Española online from the Real Academia Española (DLE) and the
Diccionario de Americanismos (2010) (DA). These are referred to as GPC,
GyrA, Cysgeir, DLE and DA respectively throughout this document. Where
items are entered as two hyphenated words in these reference dictionaries, they
are connected by an underscore in the transcripts, e.g. ‘cyd_ddigwyddiad’ (=
1
An
innovative
method
was
used
based
on
Turnitin
plagiarism
(http://www.turnitin.com). For further details see Deuchar et al. (in press)
detection
software
’coincidence’). When one of the reference dictionaries offers more than one
alternative (e.g. ‘minibus’, ‘mini-bus’ or ‘mini bus’), or when the reference
dictionaries differ from each other, the most compact alternative is chosen
(‘minibus’ in this case).
1.3. Other items which are treated as words are:
 interjections and interactional markers, e.g. ‘ajá’ (= ‘aha’), ‘ay’ (= ‘oh’),
‘hym’ (=’hmm’), etc.
 proper names (including names of books, films, organisations etc.), a
sequence of words being connected by underscores, e.g. ‘Butch_Cassidy’,
‘Buenos_Aires’.
 abbreviations (connected by underscore), e.g. ‘B_B_C’
 Welsh numbers consisting of two words involving ten which translate into
one English word, e.g. ‘un_ar_ddeg’ (= ‘eleven’), ‘pedwar_deg’ (=
‘forty’). Note that other numbers such as those containing ‘hundred’,
‘thousand’ etc. are transcribed as separate words, e.g. ‘cant saith deg tri’,
ciento setenta y tres’ (‘173’).
 Welsh phrasal prepositions, formed using two morphemes, where
separation of the two elements of the word would make any gloss of those
individual elements unhelpful, were transcribed with an underscore
between the two morphemes; e.g. ‘oddi_wrth’, which means ‘from’, but
whose individual morphemes translate respectively as ’out of’ and ‘next
to’.
 Examples of the phrasal prepositions described in C.1.3.v are listed below,
along with some other phrases which are similarly transcribed because
they normally translate into just a single English word.
(a) Welsh
Our transcription
cyd_ddigwyddiad
dim_byd
o_hyd
o_k
ta_beth
tu_ôl_i
un_ai
yn_ôl
Conventional form
English equivalent
cyd-ddigwyddiad coincidence
dim byd
nothing
o hyd
still
OK
OK
‘ta beth
anyway
tu ôl i
behind
un ai
either
yn ôl
back
(b) Spanish
Our transcription
ni_fu_ni_fa
no_más
o_k
Conventional form
ni fu ni fa
no más
OK
English equivalent
neither nor
only
OK
o_la_la
copo_de_nieve
olalá
copo de nieve
ooh la la
guelder rose
1.4. Contractions that do not have entries in the reference dictionaries listed above
or, in the case of Welsh, in King (2003), are transcribed in full, but the
unpronounced parts are bracketed. For example, the pronunciation of ‘fel yna’
(=’like that’) as [vɛla] in speech is represented in the transcripts as ‘fel (yn)a’.
1.5. There are some continuous sequences of characters in the main tier which are
not treated as words. These include “simple events” such as ‘&=laugh’ (see
CHAT 7.6.1), ‘xxx’ for unintelligible sounds, or the use of an ampersand (‘&’)
plus phonetic characters for intelligible sounds without clear meaning, as in e.g.
‘&pfe’ where the speaker produces the non-word [pfe]. (see CHAT 6.4 for
both).
1.6. Please note that pause markings are not used consistently in the transcripts.
Additionally, pauses between utterances are generally not marked. We have used
the ‘lazy overlap’ markings (+<) for overlapping speech.
2. Language marking
2.1. A default language is assigned to each transcription based on the language
contributing the greater number of words. The default language is the first
language listed in the @Language tier in the file header, and is indicated by the
ISO-639-3 abbreviation for the language: cym = Welsh, eng = English, spa =
Spanish. Words without any language markers in the transcription are in the
default language unless they are part of an utterance preceded by a precode
indicating that it is in a non-default language – see next paragraph for details.
2.2. Individual utterances in the second or third most frequent language are marked
with precodes at the beginning of the main tier: e.g. [- cym] for Welsh, [- spa]
for Spanish and these utterances contain no language tags. In mixed utterances
each word in the non-default language is marked by a tag consisting of @s:
followed by the relevant ISO-639-3 abbreviation: @s:cym = Welsh, @s:spa =
Spanish, @s:eng = English, @s:cym&spa = undetermined (see below, 2.4),
@s:spa+cym = word with first morpheme(s) Spanish, final morpheme(s) Welsh,
@s:cym+spa = word with first morpheme(s) Welsh, final morpheme(s) Spanish.
2.3. A word or morpheme is considered to be Welsh if it can be found in any of the
Welsh-language reference dictionaries or in King (2003). A word or morpheme
is considered to be Spanish if it or all its elements are found in either of the
Spanish reference dictionaries (e.g. ‘principito’ is considered to be a Spanish
word because ‘príncipe’ and ‘-ito’ are both listed in DLE). However, we have
considered some words not listed in the dictionaries to be either Welsh or
Spanish, as indicated in the list below.
Transcribed form
chuker
coranto
corintos
ddi
ddo
estiletos
lactal
mm
mmhm
nebulización
uhuh
valvuloplastía
wchi
wsti
ychi
yfe
yli
yliwch
yndyfe
yn_basai
yn_de
yn_do
yn_doedd
yn_doeddech
yn_doedden
yn_doeddwn
yn_does
yn_dydach
yn_dydan
yn_dydy
yn_dydyn
Language English equivalent
Spanish
sweetener
Spanish
courante
Spanish
currants
Welshshe/her
Welshhe/him
Spanish
stiletto heels
Spanish
sliced
Welshmm (interactional marker)
Welshmmhm (interactional marker)
Spanish
nebulisation
Welshuhuh (interactional marker)
Spanish
valvuloplasty
Welshyou know (polite/plural)
Welshyou know
Welshyou know (polite/plural)
Welsh
isn’t it
Welshyou see
Welshyou see (polite/plural)
Welshdoesn't it
Welshwouldn’t it
Welshisn’t it
Welshdidn’t it
Welshwasn’t it
Welshweren’t you (polite/plural)
Welshweren’t they
Welshwasn’t I
Welshisn’t there
Welsharen’t you (polite/plural)
Welsharen’t they/we
Welshisn’t it
Welsharen’t they/we
2.4. The language marker @s:cym&spa is used with words where the language
source is undetermined. It marks words that occur in the lexicon of both
languages (as determined by the respective reference dictionaries), that are
pronounced in a way that is possible both in Welsh and in Spanish, e.g. [foto]
(‘ffoto’ in Welsh or ‘foto’ in Spanish) or [pjano] (‘piano’ in both languages).
2.5. @s:cym&spa also marks interjections and interactional markers that may be
interpreted as ambiguous, e.g. ‘ah’, ‘oh’. Other interjections and interactional
markers are assigned language markers according to their inclusion (or not) in
the reference dictionaries. For example, ‘ych’ (a marker of disgust) is marked
@s:cym as it is only found in the Welsh-language reference dictionaries. There
are also some instances where we assigned a language to an interactional marker
that was not listed in any of the dictionaries – see 2.3.
Transcription
ah
ajá
argian
ay
bah
bechod
diar
eh
ew
hym
mm
mmhm
nefi
oh
oi
ta
ta_ra
ta_ta
te
w
wel
ý
ych
ym
Language(s)
Welsh & Spanish
(Welsh &) Spanish
Welsh
Spanish
Spanish
Welsh
Welsh
Welsh & Spanish
Welsh
Welsh
Welsh
Welsh
Welsh
Welsh & Spanish
Spanish
Welsh
Welsh
Welsh
Welsh
Welsh
Welsh
Welsh
Welsh
Welsh
English equivalent
ah
aha
good lord
oh
bah
how sad
dear
eh
oh
hmm
mm
mmhm
heavens
oh
oh
then
goodbye
goodbye
be
ooh
well
er
yuck
um
2.6. Where a lexeme could belong to both languages, but its pronunciation in a
specific occurrence belongs unambiguously to one language only, it will be
marked @s:cym or @s:spa (and written in the orthography of that language)
according to its pronunciation. For example, if ‘hotel’ is pronounced with initial
[h], it will be marked @s:cym, without initial [h] it will be marked @s:spa.
2.7. Proper names and titles are marked ‘@s:cym&spa’ (undetermined) unless there
are alternatives in each language in general use, e.g.
‘Butch_Cassidy@s:cym&spa’, ‘Buenos_Aires@ s:cym&spa’,
‘Arglwydd_Dyma_Fi@s:cym&spa’ (a Welsh hymn), but ‘Argentina@s:spa’,
‘Ariannin@s:cym’ (the Welsh word for ‘Argentina’).
3. Orthography
3.1. We have used a Unicode font (http://en.wikipedia.org/wiki/Unicode) for the
transcription. Occasional non-lexical phonological fragments are spelt out
following an ampersand using IPA symbols
(http://www.langsci.ucl.ac.uk/ipa/ipachart.html) (e.g. &ʧʊ), and these may not
show up correctly if a Unicode font is not used.
3.2. Words marked as ‘@s:spa’ (Spanish) are transcribed in conventional Spanish
orthography.
3.3. Words marked as Welsh are transcribed in conventional Welsh orthography. We
have not represented regional variation in the transcripts, except in cases which
have orthographic representation in the Welsh-language reference dictionaries or
in King (2003).
There are some cases where we differ in usage from conventional orthography:
 Colloquial second person singular verb and preposition endings not
usually represented in writing are transcribed as ‘-a’ where they are
followed by the pronoun ‘chdi’, e.g. ‘oedda chdi’ (= ‘you were’), ‘amdana
chdi’ (= ‘about you’).
 We do not represent morpheme-final [v] when it is not pronounced. For
example, [pɛntre] (village) is written ‘pentre’ in the transcripts rather
than ‘pentref’ (as the word is represented in the Welsh-language reference
dictionaries).
 Morpheme-initial /r/ is only transcribed as ‘rh’ where it is clearly heard by
the transcriber to be voiceless ([r̥]). Otherwise it is transcribed as ‘r’, even
when the standard orthography prescribes ‘rh’. Some speakers do not have
[r̥] as part of their phonological system in any case.
 Morphemes in Welsh which are usually written with an initial apostrophe,
such as the marking of the ellipsis of a possessive pronouns in e.g. ‘’nhad’
(= ‘my father’), are transcribed without this initial apostrophe (e.g. ‘nhad’)
owing to the constraints of CHAT.
 We have represented mutation (sound change to initial consonants) or its
absence without following prescriptive rules as to where mutation might
or might not be expected. Thus the Welsh form of ‘in Cardiff’ may be
transcribed ‘yn Caerdydd’ (with initial [k]) and ‘yn Gaerdydd’ (with initial
[g]), as well as the standard form ‘yng Nghaerdydd’ (with initial [ŋ̥]),
according to what is heard. We have also transcribed the aspirate mutation
of /m/ and /n/ after the 3rd singular feminine possessive adjective common
in regional varieties, e.g. ‘ei mham’ (= ‘her mother’, with initial [m̥]),
rather than standard ‘ei mam’ (with initial [m]).
 There are also quite a few instances in the corpus where speakers who are
learners of Welsh use ungrammatical or unconventional forms. These
include ‘hypermutation’, where an already mutated initial consonant
undergoes another round of mutation, e.g. ‘tipyn’ > ‘dipyn’ . ‘ddipyn’
(meaning ‘a bit’).
3.4. Words whose language source is undetermined are transcribed in Spanish rather
than in Welsh orthography, e.g. ‘avocado@s:cym&spa’ rather than
‘afocado@s:cym&spa’.
3.5. When words marked as Spanish or undetermined are mutated (where the sound
of an initial consonant is changed depending on the grammatical context, see for
example King 1993:14-20), the initial (mutated) sound is written in Welsh
orthography and the rest in Spanish, e.g. ‘rhyw ddoctora’ (= ‘some (female)
doctor’)
3.6. There is some variation in the way initial consonants in Welsh have been
transcribed. In some instances the transcriber interpreted a word to have a soft
mutation where the speaker may simply have used the Spanish variant of a
consonant rather than the Welsh one. This is especially true for stops, where
Spanish /p/, /t/, /k/ are more similar to Welsh /b/, /d/, /g/ than to Welsh /p/, /t/, /k.
For example, the transcription may record ‘dâl’ (= ‘payment’, with soft
mutation), where the speaker was intending to say ‘tâl’ (without mutation).
D GLOSS TIER
1. Principles
Each word (see C1.2 and C.1.3) in the main tier is given a gloss in the gloss tier
(%aut). The gloss tier has been produced automatically using the Bangor Autoglosser
(http://bangortalk.org.uk/autoglosser.php), free (GPL) software developed at the
Centre – for further details see Donnelly and Deuchar 2011. The transcripts were
manually corrected after autoglossing to deal with the small number (less than 2%) of
incorrectly-attributed glosses.
1.1. Non-words (see C.1.5) are not glossed.
1.2. All words are glossed with the closest English-language equivalent (in lower
case) and, where appropriate, information about parts of speech. English
equivalents of proper names are used where they exist (for example,
‘Caerdydd@s:cym’ is glossed as ‘Cardiff’). If there is no English-language
equivalent to a name, it is glossed ‘name’.
1.3. The underscore is used in the gloss tier to connect more than one lexical item in
a gloss, where the English translation of a single Welsh or Spanish word
involves more than one word. For example, ‘neithiwr’ is glossed as ‘last_night’.
1.4. The English lexeme in a gloss is followed by information about parts of speech,
separated by dots. Some examples:
 Spanish ‘hijos’ is glossed ‘son.N.M.PL’, which means ‘plural of the
masculine noun “hijo”’;
 Welsh ‘mae’ is glossed ‘be.V.3S.PRES’, which means ‘third person
singular present of the verb “be”’;
 Spanish ‘me’ is glossed ‘me.PRON.OBL.MF.1S’, meaning ‘oblique
pronoun, 1st person singular, masculine or femine’
 Welsh ‘fan’ is glossed ‘place.N.MF.SG+SM’, which means ‘singular of
the noun “man” (meaning “place”), which can be either masculine or
feminine, with a soft mutation’.
2. Parts of speech abbreviations
Abbreviation
Representing
0
impersonal
123S
1st, 2nd, 3rd person singular
13S
1st, 2nd, 3rd person singular
1P
1st person plural
1S
1st person singular
23P
2nd, 3rd person plural
23S
2nd, 3rd person singular
23SP
2nd, 3rd person singular or plural
2P
2nd person plural
2S
2nd person singular
2SP
2nd person singular or plural
3P
3rd person plural
3S
3rd person singular
3SP
3rd person singular or plural
ADJ
adjective
ADV
adverb
AM
aspirate mutation
ASV
adjective, singular noun, or verb
AUG
augmentative
COMP
comparative
COND
conditional
CONJ
conjunction
DEF
definite
DEM
demonstrative
DET
determiner
DIM
diminutive
E
exclamation
EMPH
emphatic
F
feminine
FAR
far (demonstrative)
FOCUS
item with focus
FUT
future
GER
gerund
H
pre-vocalic h after 3S.F, 1P and 3P possessives
HYP
hypothetical
IM
interactional marker
IMPER
imperative
IMPERF
imperfect
INDEF
indefinite
INFIN
infinitive
INT
interrogative
INTENS
intensive
M
masculine
MF
masculine or feminine
N
noun
NEAR
near (demonstrative)
NEG
negative
NM
nasal mutation
NT
neuter
NUM
numeral
OBJ
object
OBL
oblique
ORD
ordinal
PAST
past
PASTPART
past participle
PL
plural
PLUPERF pluperfect
POSS
possessive
PRECLITIC
accented form before clitics
PREP
preposition
PREQ
pre-qualifier
PRES
present
PRESPART
present participle
PRON
pronoun
PRT
particle
QUAN
quantifier
REFL
reflexive
REL
relative
SG
singular
SM
soft mutation
SP
singular or plural
SUB
subject
SUBJ
subjunctive
SUP
superlative
SV
singular noun or verb
TAG
tag question
V
verb
REFERENCES
Canolfan Bedwyr (2008). Cysgliad. Prifysgol Bangor. (www.cysgliad.com)
Diccionario de Americanismos. Asociación de Academias de la Lengua Española (2010).
Diccionario de la Lengua Española. Real Academia Española. (www.rae.es)
Griffiths, B. and Jones, D.G. (eds.) (1995). The Welsh Academy English-Welsh
Dictionary / Geiriadur yr Academi. Cardiff: University of Wales Press.
(http://techiaith.bangor.ac.uk/GeiriadurAcademi)
Deuchar, M. et al. (in press) Building bilingual corpora: Welsh-English, Spanish-English
and Spanish-Welsh. In I. Mennen and E. Thomas (eds) Unravelling Bilingualism.
Multilingual Matters.
Donnelly, K. and Deuchar, M. (2011) Using constraint grammar in the Bangor
Autoglosser to disambiguate multilingual spoken text. In: Constraint Grammar
Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. Tartu:
NEALT Proceedings Series. (http://dspace.utlib.ee/dspace/handle/10062/19298)
King, G. (2003). Modern Welsh : a comprehensive grammar (2nd ed.). London:
Routledge.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.).
Mahwah, NJ: Lawrence Erlbaum Associates.
Thomas, R.J. (ed.) (1950-2004). Geiriadur Prifysgol Cymru : a dictionary of the Welsh
language. Cardiff: University of Wales Press.
(http://www.cymru.ac.uk/geiriadur/gpc_pdfs.htm)
Thomas, P.W. (1996). Gramadeg y Gymraeg. Cardiff: University of Wales Press.
APPENDIX
File summary:
No. of main
participants
Patagonia1
Length
(mm:ss)
10:02
Patagonia2
29:18
Patagonia3
28:44
Patagonia4
23:33
Patagonia5
22:16
File name
Age (years)
Sex
3
22, 21 ,28
F, F, M
2
66, 82
F, F
2
82, 78
F, F
3
48, 47, ?
F, M, M
5
90,82,67,72,61
F, F, M, F, F
Patagonia6
27:17
2
54, 96
F, F
Patagonia7
44:31
2
66, 68
F, F
Patagonia8
31:37
2
84, 83
F, F
Patagonia9
37:05
2
65, 69
F, F
Patagonia10
25:48
2
35, 9
M, F
Patagonia11
43:59
3
81,74, 86
F, F, F
Patagonia12
29:55
2
78, 25
F, F
Patagonia13
28:31
2
61, 42
F, F
Patagonia14
31:28
2
63, 74
F, F
Patagonia15
29:45
2
81, 42
F, F
Patagonia16
32:10
3
54, 54, 46
M, F, F
Patagonia17
30:15
2
21,18
F, F
Patagonia18
30:46
3
73, 46, ?
M, M, F
Patagonia19
43:10
2
38, 8
M, F
Patagonia20
34:29
2
67, 58
F, F
Patagonia21
33:30
2
60, 60
F, F
Patagonia22
26:18
2
84, 56
F, F
Patagonia23
29:54
2
63, 64
F, F
Patagonia24
30:57
2
53, 88
F, F
Patagonia25
29:47
4
48, 44, 13, 8
M, F, F, M
Patagonia26
37:54
2
55, 27
F, M
Patagonia27
27:45
2
69, 68
F, M
Patagonia28
2
22, 28
F, F
Patagonia29
35:07
14:26
2
18,18
M, F
Patagonia30
33:10
2
74, 71
F, F
Patagonia31
39:20
2
81, 70
F, F
Patagonia32
30:02
2
71, 75
F, M
Patagonia33
30:26
2
72, 29
F, M
Patagonia34
28:42
2
58, 34
F, F
Patagonia35
32:55
2
75, 71
M, F
Patagonia36
33:04
1
70
F
Patagonia37
29:28
2
46, 44
F, F
Patagonia38
35:21
2
30, 37
F, F
Patagonia39
29:30
2
71, 76
F, F
Patagonia40
24:09
2
90, 92
F, F
Patagonia41
22:05
4
71, 55, 76, ?
Patagonia42
33:35
2
70, 87
M, F, F, M
F, F
Patagonia43
11:29
2
56, ?
F, M
20:55:46
942
Total: 42
5. Bangor (Spanish-English) Miami
The Miami Corpus: Documentation File
Margaret Deuchar
ESRC Centre for Research on Bilingualism
Bangor University
Bangor
Gwynedd LL57 2DG
United Kingdom
m.deuchar@bangor.ac.uk
A
INTRODUCTION
The Miami corpus of Spanish-English bilingual speech was recorded and
transcribed between 2008 and 2011 as part of a research project funded by
the Economic and Social Research Council (ESRC). The main theoretical aim
of the project was to test alternative models of code-switching with SpanishEnglish data.
Conditions of use
The corpus is being made available under the GNU General Public License
version 3 or later (http://gnu.org/copyleft/gpl.html). Researchers who use it
are requested to subscribe to the TalkBank Code of Ethics
(http://talkbank.org/share/ethics.html) and acknowledge the corpus as set out
below.
Acknowledgments
Please refer to the corpus as the Bangor Miami corpus, and provide a link to
the website by which you accessed the corpus, either http://www.talkbank.org
or http://bangortalk.org.uk. We request that a copy of any publications that
make use of this corpus be sent to us at the above address.
Canonical version of the data
The most up-to-date version of the data as well as more detailed
documentation is available on http://bangortalk.org.uk.
B THE DATA
The corpus consists of 56 audio recordings and their corresponding
transcripts of informal conversation between two or more speakers, involving
a total of 84 speakers living in Miami, Florida (USA). Participants were
recruited via a variety of methods, including advertising and using the
research team’s extended social network.
From the 56 audio recordings, 15 are transcripts of conversations from one
individual, recorded over a longer period of time in conversation with more
than one speaker. The participant (‘María’) was already known by the
research team to be a balanced bilingual who frequently and consistently
code-switched in daily conversation, and so she was invited to make
recordings of her interactions with colleagues, family and friends. Maria
decided when and with whom to make recordings, by means of a small digital
recorder worn on her belt with a moderately concealed lapel microphone. She
recorded 42 conversations, 15 of which have been selected for transcription
on the basis of their acoustic quality. The research team had no control over
when or where the recordings were made and also did not have control over
the technical aspects such as checking audio levels, environmental noise and
changing batteries in the recorder. Maria’s interlocutors did not sign consent
forms or fill in questionnaires and so the transcripts of the15 recordings only
represent Maria’s speech, while utterances from other speakers are
transcribed as “www”.
In total, the corpus consists of 242,475 words of text from 35 hours of
recorded conversation. The transcriptions (in CHAT format) are linked to the
digitized recordings through sound links at the end of each main tier. Most
recordings were in stereo, and were made using Marantz, Zoom or
Microtrack digital audio recorders.
The recordings were made at a place convenient for the speakers, e.g. at
their homes or workplaces. After setting up the equipment the researcher
would leave the speakers to talk freely with one another. In some cases the
researcher re-entered briefly during the recording. This is noted in the
transcripts and speech by the researcher is usually not transcribed. The first
five minutes of all recordings after the point when the researcher left the room
have been deleted, in case the participants’ speech was initially affected by
the presence of the recorder.
At the end of each recording all participants were asked to fill in
questionnaires providing background information regarding their age, gender,
location of places lived, etc, in order to provide information for sociolinguistic
analysis. They were also asked to sign consent forms giving permission for
their recording and its transcript to be used for research purposes and to be
submitted to online linguistic archives. The consent form included the
provision that the names of speakers and other people named in the
recording would be replaced by pseudonyms in the transcript. In the case of
children of 16 years or younger, a consent form was also signed by a parent
or guardian.
There are a few instances where speakers who have not given consent
feature in recordings, e.g. a neighbour walking in briefly. In these cases the
utterances have been transcribed as “www” and replaced by silence in the
audio file. This can sometimes mean that parts of the consenting participants’
speech are lost as well where there is overlap with the non-consenting
speaker. In addition, beeps have been placed over the names of people
about whom sensitive information is given.
Sound and transcription files in the corpus are named after the researcher
who did the recording and are numbered in order of the sequence of
recording. The sound and transcription files for each conversation share the
filename, but have different file extensions (‘*.wav’/‘*.mp3’ for the sound file
and ‘*.cha’ for the transcription).
For example, Sastre2.cha is the
transcription of the second recording made by Sastre (sound file
Sastre2.wav). Basic details regarding the context of each conversation and
the speakers involved are given in the transcript headers. Some additional
information about the speakers and recordings is available to researchers on
request.
All recordings have been transcribed in the CHAT transcription and coding
format (MacWhinney 2000), in accordance with the 2012 version of online
manual available on http://childes.psy.cmu.edu/manuals/chat.pdf. All
references to the CHAT manual in this document are to this online version.
All transcripts have been done by trained transcribers working on the project:
Fraibet Aveledo, Diana Carter, Marika Fusser, Lowri Jones, M. Carmen
Parafita Couto, Myfyr Prys and Jonathan Stammers. Additionally, teams from
Penn State University (Amelia Dietrich, Giuli Dussias, Chip Gerfen , Rosa
Guzzardo, and Jorge Valdes Kroff), Australian National University (Bronwyn
Wrigley, Manuel Delicado, and Jennifer Plaistowe) also collaborated in the
process of transcriptions.
For 10% of the transcripts an independent transcription was done, in which a
member of the transcription team transcribed one (randomly selected) minute
of the recording independently from the original transcriber of that particular
transcript. Transcripts were then compared and a rate of similarity was
calculated. The average reliability score 3 for independent transcriptions was
83%. Furthermore, all the transcripts were proofread by another member of
the transcription team and corrections made accordingly. An additional team
of transcribers and checkers included the following researchers in addition to
3
An innovative method was used based on Turnitin plagiarism detection software
(http://www.turnitin.com). Deuchar, M., Davies, P. Herring, J.R., Parafita Couto, M. &
Carter, D. (in press) Building bilingual corpora: Welsh-English, Spanish-English and
Spanish-Welsh. In I. Mennen and E. Thomas (eds) Unravelling Bilingualism. Multilingual
Matters.
the original transcription team: Margaret Deuchar, Sarah Fairchild, Marika
Fusser, Lara Gil Vallejo, Guillermo Montero Melis, Esther Nuñez, Susana
Sabin-Fernández, and Jonathan Stammers.
All transcripts contain at least three different tiers. In addition to the main tier,
required by CHAT, we use an automatically generated gloss tier (%xaut) for
the closest English equivalent for each word (including morphological
information where relevant), and a translation tier (%eng), which contains a
free translation of the main tier. A comments tier (%com) has also been used
occasionally for comments by the transcriber that are specific to the utterance
in the corresponding main tier. All main tiers include a sound link to the
corresponding section of the recording.
The following contributed to the translation tier: Adriana Acevedo, Olga
Bolaños, Vanesa Bonavota, Rubén Chapela, Magdalena Gazda, Ana
Muerza, Renata Kendall, Mary Silva, Sara Viñas, and Renée Zeichen.
The remainder of this document outlines the conventions used in the main tier
and the gloss tier.
C MAIN TIER
1. Layout of transcription
1.1. Since the theoretical aims of the project included clause-based analysis,
the transcribed data are divided into clauses where possible. Where an
utterance contains two main clauses, each clause in that utterance is
written on a separate main tier. Complex clauses are treated as one
clause and therefore subordinate clauses are included in the same tier
as their main clauses. Adverbial clauses are also written on the same
main tier as their related main clause.
1.2. Each main tier is divided into units which we call ‘words’ for the purposes
of these conventions. With some exceptions (see C.1.3) a word is
considered to be a continuous sequence of characters containing no
spaces as found in the Webster’s Dictionary for English, and in the
Diccionario de la Lengua Española online from the Real Academia
Española and the Diccionario de Americanismos (2010) for Spanish.
These are referred to as DLE and DA respectively throughout this
document. Where items are entered as two hyphenated words in these
reference dictionaries, they are connected by an underscore in the
transcripts. When one of the reference dictionaries offers more than one
alternative (e.g. ‘minibus’ ‘mini-bus’ or ‘mini bus’), or when the reference
dictionaries differ from each other, the most compact alternative is
chosen (‘minibus’ in this case).
1.3. Other items which are treated as words are:
(a) interjections and interactional markers, e.g ‘ajá’ (= ‘aha’), ‘ay’ (=
‘oh’), ‘mmhm’ (=’mhm’), etc.
(b) propernames (including names of books, films, organisations
etc.), a sequence of words being connected by underscores,
e.g., ‘Nueva_York’.
(c) abbreviations (connected by underscore), e.g. ‘B_B_C’
(d) examples of phrases that are not found in the DLE and DA are
listed below.
Transcribed form
ni_fu_ni_fa
no_más
o_k
vale_turca
o_la_la
copo_de_nieve
Conventional form
ni fu ni fa
no más
OK
vale turca
olalá
copo de nieve
English
neither nor
only
OK
it doesn't matter
ooh la la
guelder rose
1.4. There are some continuous sequences of characters in the main tier
which are not treated as words. These include simple events such as
‘&=laugh’ (see CHAT 7.6.1), ‘xxx’ for unintelligible sounds, or the use of
an ampersand (‘&’) plus phonetic characters for intelligible sounds
without clear meaning (see CHAT 6.4 for both).
1.5. Please note that pause markings are not used consistently in the
transcripts. Additionally, pauses between utterances are usually not
marked. We have used the ‘lazy overlap’ markings (‘+>’) for overlapping
speech.
2. Language marking
2.1. A default language is assigned to each transcription based on the
language contributing the greater number of words. The default language
is the first language listed in the @Language tier in the file header, and is
indicated by the ISO-639-3 abbreviation for the language: spa = Spanish,
eng = English. Words without any language markers in the transcription
are in the default language unless they are part of an utterance preceded
by a precode indicating that it is in a non-default language – see next
paragraph for details.
2.2. Individual utterances in the second or third most frequent language are
marked with precodes at the beginning of the main tier: e.g. [- eng] for
English, [- spa] for Spanish and these utterances contain no language
tags. In mixed utterances each word in the non-default language is
marked by a tag consisting of @s: followed by the relevant ISO-639-3
abbreviation: @s:spa = Spanish, @s:eng = English, @s:eng&spa =
undetermined (see below, 2.4), @s:spa+eng = word with first
morpheme(s) Spanish, final morpheme(s) English , @s:eng+spa = word
with first morpheme(s) English, final morpheme(s) Spanish.
2.3. A word or morpheme is considered to be English if it can be found in any
of the English-language reference dictionaries. A word or morpheme is
considered to be Spanish if it or all its elements are found in either of the
Spanish reference dictionaries (e.g. ‘principito’ is considered to be a
Spanish word because ‘príncipe’ and ‘-ito’ are both listed in DLE).
However, we have considered some words not listed in the dictionaries
to be either English or Spanish, as indicated in the list below.
Transcribed form
cucu
estrech
Language
Spanish
Spanish
English equivalent
bottom
stretch (jeans)
2.4. The language marker @s:eng&spa is used with words where the
language source is undetermined. It marks words that occur in the
lexicon of both languages, (as determined by the respective reference
dictionaries), that are pronounced in a way that is possible both in
English and in Spanish, e.g [pjano] (‘piano’ in both languages).
2.5. @s:eng&spa also marks interjections and interactional markers that may
be interpreted as ambiguous, e.g. ‘ah’, ‘oh’. Other interjections and
interactional markers are assigned language markers according to their
inclusion (or not) in the reference dictionaries. For example, ‘ay’ (=’oh’) is
marked @s:spa as it is only found in the Spanish-language reference
dictionaries.
2.6. Where a lexeme could belong to both languages, but its pronunciation in
a specific occurrence belongs unambiguously to one language only, it
will be marked @s:eng or @s:spa (and written in the orthography of that
language) according to its pronunciation. For example, if ‘hotel’ is
pronounced with initial [h], it will be marked @s:eng, without initial [h] it
will be marked @s:spa.
2.7. Proper names and titles are marked ‘@s:eng&spa’ (undetermined)
unless there are alternatives in each language in general use, e.g.
‘Caracas@s:eng&spa’, Sears@s:eng&spa but ‘New_York@s:eng’
‘Nueva_York@s:spa’, (the Spanish word for ‘New York’).
3. Orthography
3.1. We have used a Unicode font (http://en.wikipedia.org/wiki/Unicode) for
the transcription. Occasional non-lexical phonological fragments are spelt
out
following
an
ampersand
using
IPA
symbols
(http://www.langsci.ucl.ac.uk/ipa/ipachart.html) (e.g. &ʧʊ), and these may
not show up correctly if a Unicode font is not used.
3.2. Words marked as ‘@s:spa’ (Spanish) are transcribed in conventional
Spanish orthography
3.3. Words considered to be Spanish are transcribed in Spanish orthography.
We have not represented regional variation in the transcripts, except in
cases which have orthographic representation in the Spanish-language
reference dictionaries.
3.4. Words whose language source is undetermined are transcribed in
English rather than in Spanish orthography, e.g. football, internet, lunch,
etc.
D. GLOSS TIER
2. Principles
Each word (see C1.2 and C.1.3) in the main tier is given a gloss in the gloss
tier (%aut). The gloss tier has been produced automatically using the Bangor
Autoglosser (http://bangortalk.org.uk/autoglosser.php), free (GPL) software
developed at the Centre – for further details see Donnelly and Deuchar 2011.
The transcripts were manually corrected after autoglossing to deal with the
small number (less than 2%) of incorrectly-attributed glosses.
2.1. Non-words are not glossed.
2.2. All words are glossed with the closest English-language equivalent (in
lower case) and, where appropriate, information about parts of speech.
English equivalents of proper names are used where they exist (for
example, ‘Nueva_York@s:spa’ is glossed as ‘New_York’). If there is no
English-language equivalent to a name, it is glossed ‘name’.
2.3. The underscore is used in the gloss tier to connect more than one lexical
item in a gloss, where the English translation of a single Spanish word
involves more than one word. For example, ‘veinticinco’ is glossed as
‘twenty_five’.
2.4. The English lexeme in a gloss is followed by information about parts of
speech, separated by dots. Some examples:
 Spanish ‘hijos’ is glossed ‘son.N.M.PL’, which means ‘plural of the
masculine noun “hijo”’;
 Spanish ‘me’ is glossed ‘me.PRON.OBL.MF.1S’, meaning ‘oblique
pronoun, 1st person singular, masculine or femine’;
 English "wouldn't" is glossed "be.V.1S.COND+NEG", meaning "the
first person singular conditional tense of the verb 'be', with an
attached negative marker".
3. Parts of speech abbreviations.
Abbreviation
0
123S
13S
1P
1S
23P
23S
23SP
2P
2S
2SP
3P
3S
3SP
ADJ
ADV
AM
ASV
AUG
COMP
COND
CONJ
DEF
DEM
DET
DIM
E
EMPH
Representing
impersonal
1st, 2nd, 3rd person singular
1st, 2nd, 3rd person singular
1st person plural
1st person singular
2nd, 3rd person plural
2nd, 3rd person singular
2nd, 3rd person singular or plural
2nd person plural
2nd person singular
2nd person singular or plural
3rd person plural
3rd person singular
3rd person singular or plural
adjective
adverb
aspirate mutation
adjective, singular noun, or verb
augmentative
comparative
conditional
conjunction
definite
demonstrative
determiner
diminutive
exclamation
emphatic
F
feminine
FAR
far (demonstrative)
FOCUS
item with focus
FUT
future
GER
gerund
H
pre-vocalic h after 3S.F, 1P and 3P possessives
HYP
hypothetical
IM
interactional marker
IMPER
imperative
IMPERF
imperfect
INDEF
indefinite
INFIN
infinitive
INT
interrogative
INTENS
intensive
M
masculine
MF
masculine or feminine
N
noun
NEAR
near (demonstrative)
NEG
negative
NM
nasal mutation
NT
neuter
NUM
numeral
OBJ
object
OBL
oblique
ORD
ordinal
PAST
past
PASTPART
past participle
PL
plural
PLUPERF pluperfect
POSS
possessive
PRECLITIC
accented form before clitics
PREP
preposition
PREQ
pre-qualifier
PRES
present
PRESPART
present participle
PRON
pronoun
PRT
particle
QUAN
quantifier
REFL
reflexive
REL
relative
SG
singular
SM
soft mutation
SP
singular or plural
SUB
subject
SUBJ
subjunctive
SUP
superlative
SV
TAG
V
singular noun or verb
tag question
verb
****************************************************************************************
REFERENCES
Diccionario de Americanismos. Asociación de Academias de la Lengua Española
(2010).
Diccionario de la Lengua Española. Real Academia Española. (www.rae.es)
Deuchar, M., Davies, P., Herring J.R., Parafita, M.C., and Carter, D. (in press)
Building bilingual corpora: Welsh-English, Spanish-English and Spanish-Welsh.
In I. Mennen and E. Thomas (eds) Unravelling Bilingualism. Multilingual Matters.
Donnelly, K. and Deuchar, M. (2011) Using constraint grammar in the Bangor
Autoglosser to disambiguate multilingual spoken text. In: Constraint Grammar
Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. Tartu:
NEALT Proceedings Series.
(http://dspace.utlib.ee/dspace/handle/10062/19298)
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.).
Mahwah, NJ: Lawrence Erlbaum Associates.
APPENDIX
File summary:
File name
Length
(mm:ss)
No. of main
participants
Age
(years)
Sex
HERRING1
0:32:18
2
24, 27
F, F
HERRING2
0:30:42
2
21, 19
M, M
HERRING3
0:31:37
2
37, 41
F, M
HERRING5
0:27:10
2
41, 40
F, M
HERRING6
0:28:14
2
43, ?
F, M
HERRING7
0:24:53
2
22, ?
M, M
HERRING8
0:29:43
2
39, 42
F, M
HERRING9
0:32:39
2
21, 20
F, M
HERRING10
0:33:52
2
33, 34
F, F
HERRING11
0:31:00
2
64, 63
M, F
HERRING12
0:33:06
2
22, 20
M, M
HERRING13
0:29:53
2
? , 32
F, F
HERRING14
0:30:04
2
20, 23
M, F
HERRING15
0:29:53
2
? , 21
M, M
HERRING16
0:30:51
2
24, 30
M, M
HERRING17
0:29:58
2
? , 25
M, F
SASTRE1
0:33:52
2
57, 44
M, F
SASTRE2
0:41:00
2
78, 55
F, M
SASTRE3
0:43:02
3
37, 43, 52
M, M, F
SASTRE4
0:31:26
2
29, 22
F, F
SASTRE5
0:29:03
2
36, 66
F, F
SASTRE6
0:30:20
2
43, 42
M, F
SASTRE7
0:29:58
2
19, 15
F, F
SASTRE8
0:33:20
2
63, 13
F, F
SASTRE9
0:40:02
2
48, 60
F, F
SASTRE10
0:39:40
2
35, 35
F, F
SASTRE11
0:40:25
2
30, 60
M, F
SASTRE12
0:30:59
2
48, 41
F, F
SASTRE13
0:29:43
2
25, 19
M, F
ZELEDON1
0:29:38
2
26, 21
F, F
ZELEDON2
0:26:53
2
22, 21
M, F
ZELEDON3
0:30:25
2
19, 11
F, M
ZELEDON4
0:21:48
2
40, ?
M, M
ZELEDON5
0:23:41
2
35, 37
F, F
ZELEDON6
0:30:25
2
21, 19
F, F
ZELEDON7
0:30:20
2
19, 21
F, M
ZELEDON8
0:37:53
2
42, 45
F, F
ZELEDON9
0:30:51
2
12, 09
F, F
ZELEDON11
0:30:40
2
21, 25
M, M
ZELEDON13
0:34:42
2
18, 19
F, F
ZELEDON14
0:33:01
2
22, 19
F, F
MAR1
0:15:02
1
45
F
MAR2
0:01:42
1
45
F
MAR4
0:17:22
1
45
F
MAR7
0:04:34
1
45
F
MAR10
0:17:32
1
45
F
MAR16
2:41:36
1
45
F
MAR18
1:38:40
1
45
F
MAR19
0:53:58
1
45
F
MAR20
0:31:50
1
45
F
MAR21
0:05:29
1
45
F
MAR24
0:41:40
1
45
F
MAR27
1:22:55
1
45
F
MAR30
0:59:58
1
45
F
MAR31
1:45:40
1
45
F
MAR40
Total
2:25:47
35:11:04
1
84
45
F
6. Welsh Transcription Conventions
This section documents the transcription conventions specific to the Bangor Siarad, Pilot,
and Bangor 2 corpora. The three sections are: main tier, gloss tier, and tags.
A MAIN TIER
1. Layout of transcription
1.1. Since we are primarily interested in clauses, the data is divided into clauses as
far as possible. Where an utterance contains two main clauses, each clause in
that utterance is written on a separate main tier. Complex clauses are treated as
one clause and therefore subordinate clauses are included in the same tier as
their main clauses. Adverbial clauses are also written on the same main tier as
their related main clause.
1.2. Each main tier is divided into units which we call, for the purposes of these
conventions, ‘words’. With some exceptions (see C.1.3) a word is considered to
be a continuous sequence of characters containing no spaces, as found in
Geiriadur Prifysgol Cymru (Thomas 1950-2004), Geiriadur yr Academi
(Griffiths & Jones1995), Cysgeir (2004) or the Oxford English Dictionary online
(2008). These are referred to as GPC, GyrA, Cysgeir and OED respectively
throughout this document. Where items are treated as hyphenated by these
reference dictionaries, they are connected by underscore in the transcripts. When
one of the reference dictionaries offers more than one alternative (e.g. ‘minibus’
‘mini-bus’ or ‘mini bus’), or when the reference dictionaries differ from each
other, the most compact alternative is chosen (‘minibus’ in this case).
1.3. Other items which are treated as words are:
i. interjections and interactional markers, e.g. ah, er, um etc.
ii. proper names (including names of books, films, organisations etc.), a
sequence of words being connected by underscores, e.g. Elton_John,
Hong_Kong, One_Flew_Over_the_Cuckoo’s_Nest
iii. abbreviations (connected by underscore), e.g. N_S_P_C_C
iv. numbers between eleven and ninety-nine in Welsh and between twentyone and ninety-nine in English, e.g. pedwar_deg_pump, forty_five. Note
that other numbers such as those containing ‘hundred’, ‘thousand’ etc. are
transcribed as separate words, e.g. one hundred and seventy_three, cant
saith_deg_tri
v. some prepositions and adverbs, usually represented as two words, whose
individual parts are meaningless or difficult to translate in isolation, e.g.
oddi_wrth. See a full list below in C.3.4 vii.
1.4. Contractions that do not have entries in one of the Welsh-language reference
dictionaries (namely GPC, GyrA or Cysgeir) or in King (2003), are transcribed
in full, but the unpronounced parts are bracketed. For example, the
pronunciation of ‘fel yna’ (like that) as [vɛla] in speech is represented in the
transcripts as ‘fel (yn)a’.
1.5. There are some continuous sequences of characters in the main tier which are
not treated as words. These include simple events such as ‘&=laugh’ (see CHAT
7.6.1), ‘xx’ or ‘xxx’ for unintelligible sounds, or the use of an ampersand plus
phonetic characters for intelligible sounds without clear meaning (see CHAT 6.4
for both).
1.6. Note that we do not follow the guidelines for collocations, compounds and
linkages given in the CHAT manual (see CHAT 6.6.2). We consider
collocations and compounds to be single items only if they are considered words
according to the definition given in C.1.2 and C.1.3. For example, CHAT gives
the option of writing ‘peanut butter’ or ‘peanut+butter’, and ‘Star Wars’ or
‘Star+Wars’. According to our conventions, the first phrase should be
transcribed as two separate words (‘peanut butter’) as this is how the phrase
appears in the OED. The second case should be transcribed as ‘Star_Wars’ as it
is a film title. Note that we do not use the plus sign within words under any
circumstances.
2. Language marking
2.1. Each word in the main tier is assigned a language marker, which consists of @s:
plus one or two other letters which denote its language: @s:w = Welsh, @s:e =
English, @s:u = undetermined, @s:ew = word with first morpheme(s) English,
second morpheme(s) Welsh, @s:we = word with first morpheme(s) Welsh,
second morpheme(s) English. Other languages have been coded as they have
arisen and have been included in the depfile, e.g. @s:f = French
2.2. A word or morpheme is considered to be Welsh if it can be found in any of the
Welsh-language reference dictionaries or in King (2003).
2.3. Words which contain two or more morphemes from different languages are
marked as mixed-language words, e.g. ‘concentrate_io@s:ew’ (to concentrate).
However, where a word containing at least one English morpheme and at least
one Welsh morpheme is included in one or more of the Welsh-language
reference dictionaries, it is marked as a Welsh word. For example, the English
word ‘use’ forms the basis of the Welsh word ‘iwsio’ (to use) but we mark the
entire word as Welsh (‘iwsio@s:w’) because it is included in one of the Welshlanguage reference dictionaries.
2.4. The language marker @s:u marks words that occur in the lexicon of both
languages, (as determined by the Welsh-language reference dictionaries for
Welsh or by the OED for English), that are pronounced in a way that is possible
both in Welsh and in English, e.g. [ˈʌŋkl] / [ˈəŋkl] (‘uncle’ in English or ‘yncl’
in Welsh) or [mat] (‘mat’ in both languages). @s:u also marks a specified list of
interjections and interactional markers, e.g. ah, ahhah, aw, er, ey, hmm, ho,
mmm, mmhm, oh, ooh, ow, ugh, um. Other interjections and interactional
markers are assigned language markers according to their inclusion or not in the
reference dictionaries. For example, ‘ych’ (a marker of disgust equivalent to
‘yuk’ in English) is marked @s:w as it is only found in the Welsh-language
reference dictionaries.
2.5. Where a lexeme could belong to both languages, but its pronunciation in a
specific occurrence belongs unambiguously to one language only, it will be
marked @s:w or @s:e (and written in the respective orthography) according to
the pronunciation, e.g. ‘problem@s:w’ for the specifically (north-west) Welsh
pronunciation of the second vowel as [a], but ‘problem@s:u’ where the second
vowel is [ə] or [ɛ], which is possible in both English and Welsh; ‘toast@s:e’
where the word is pronounced with [əʊ] / [oʊ] as in English only, and
‘tost@s:w’ where it is pronounced with [ɒ] as in southern Welsh, but
‘toast@s:u’ where the word is pronounced with [o:] as in northern Welsh or
some varieties of welsh English.
2.6. Proper names and titles are marked as undetermined unless there are alternatives
in each language in general use, e.g. Elton_John@s:u,
One_Flew_Over_the_Cuckoo’s_Nest@s:u, Hong_Kong@s:u, Tebot_Piws@s:u
(a Welsh-language pop group, literally meaning ‘purple teapot’) but
Cardiff@s:e, Caerdydd@s:w (the Welsh word for ‘Cardiff’).
2.7. According to GPC, the -s plural ending is an established loan in the Welsh
lexicon. Any plural formed with the -s ending is assigned the language marker of
the previous morpheme. For example, ‘pregethwrs@s:w’ from ‘pregethwr@s:w’
(preacher), ‘dolphins@s:u’ from ‘dolphin@s:u’ and ‘dogs@s:e’ from
‘dog@s:w’.
2.8. In multi-word phrases, each word is tagged separately, regardless of the phrase’s
internal syntax. For example, in ‘traffic@s:u lights@s:e’ ‘traffic’ is coded as
undetermined, although the syntax of the whole phrase comes from English.
3. Orthography
3.1. Words marked as English are transcribed in standard English orthography,
including contractions, such as ‘isn’t’. Some non-standard spellings for
colloquial forms such as ‘gonna’ are used.
3.2. Words whose language is undetermined are transcribed in English rather than in
Welsh orthography, e.g. ‘acid@s:u’ rather than ‘asid@s:u’. This is in order to
make the corpus more accessible to non-Welsh-speakers who might use the data.
3.3. When words marked as English or undetermined are mutated (where the sound
of an initial consonant is changed depending on the grammatical context, see for
example King 1993:14-20), the initial (mutated) sound is written in Welsh
orthography and the rest in English, e.g. ei@s:w firthday@s:e = his birthday;
ei@s:w goat@s:u = his coat . In the case of words that begin with ‘qu’ in
English but that are mutated in the data, the mutated sound and the following
[w] are written in Welsh orthography, e.g. question (unmutated), gwestion (soft
mutation), chwestion (aspirate mutation), nghwestion (nasal mutation).
3.4. Words marked as Welsh are transcribed in Welsh orthography. We have not
represented regional variation in the transcripts, except in cases which have
orthographic representation in the Welsh-language reference dictionaries or in
King (2003).
There are some cases where we differ from the standard orthography:
i. We transcribe some non-standard verb-noun suffixes, e.g. ‘-ian’ in
‘swnian’ (to grumble) rather than ‘-io’ in the standard form ‘swnio’.
ii. We represent non-standard usage of inflected prepositions. Agreement
markers for person and number show considerable variation in the spoken
language. Thus one may, for example, find several forms for ‘to you’
(plural/respect form), such as ‘wrthoch chi’ (the variant found in King
2003), ‘wrthych (chi)’ (more formal variant, e.g. prescribed in Thomas
1996) as well as ‘wrthach chi’ (more colloquial, northern variant). The
orthography used in transcripts is based on pronunciation (note that the
Welsh orthographic system has a fairly regular relationship between sound
and speech, so representing sound variation is possible in Welsh in a way
that is not in English).
iii. Northern second person singular verb and preposition endings not usually
represented in writing are transcribed as ‘-a’ where they are followed by
the pronoun ‘chdi’, e.g. oedda chdi (you were), arna chdi (on you). Where
they occur in isolation, they are transcribed as ‘-achd’, e.g. oeddachd (you
were/weren’t you), arnachd (on you).
iv. We do not represent morpheme-final [v] when it is not pronounced. For
example, [pɛntrɛ] (village) is written ‘pentre’ in the transcripts rather than
‘pentref’ (as the word is represented in the Welsh-language reference
dictionaries).
v. Morpheme-initial /r/ is transcribed as ‘r’ even when the standard
orthography prescribes ‘rh’ (pronounced [r̥]) as [r̥] is often absent from
speakers’ phonological systems. ‘rh’ is only transcribed where [r̥] is
clearly discernible.
vi. We have represented mutation (sound change to initial consonants) or its
absence without following prescriptive rules. Thus ‘in Cardiff’ may be
transcribed ‘yn Caerdydd’ and ‘yn Gaerdydd’ as well as the standard form
‘yng Nghaerdydd’, according to what is heard. We have also transcribed
the aspirate mutation of /m/ and /n/ after the 3rd singular feminine
possessive adjective common in northern varieties, e.g. ‘ei mham’ (her
mother), rather than standard ‘ei mam’.
vii. We list below the phrases described in C.1.3 v which we transcribe using
underscore to link the individual words.
Our transcription
ar_draws
ar_goll
ar_gyfer
ar_ôl
cyn_belled
dim_byd
ein_gilydd,
eich_gilydd, ei_gilydd
er_mwyn
ers_talwm
i_fewn
i_ffwrdd
i_fyny
i_gyd
i_lawr
i_mewn
naill_ai
o_gwbl
o_gwmpas
oddi_ar
oddi_wrth
oni_bai
pob_dim
ta_waeth
un_ai
wrth_gwrs
yn_erbyn
yn_ôl
yn_ystod
Standard
ar draws
ar goll
ar gyfer
ar ôl
cyn belled
dim byd
ein gilydd,
eich gilydd,
ei gilydd
er mwyn
ers talwm
i fewn
i ffwrdd
i fyny
i gyd
i lawr
i mewn
naill ai
o gwbl
o gwmpas
oddi ar
oddi wrth
oni bai
pob dim
’ta waeth
un ai
wrth gwrs
yn erbyn
yn ôl
yn ystod
English translation
across
lost
for
after
as far
nothing
each other (1st, 2nd and 3rd person
plural)
for
in the past, long ago
in(to)
away
up
all
down
in(to)
either
at all
around
off
from
unless
everything
anyway
either
of course
against
back
during
viii.
We list below some colloquial forms which are not represented in
the Welsh-language reference dictionaries but which we have transcribed
as indicated:
Our spelling
(r)hein,
(r)hain,
(r)heiny etc
byswn i, bysa
chdi etc.
cynna fi,
cynna chdi
etc.
dylen i,
bydden i etc.
Standard
rhein, rhain,
rheiny etc.
Meaning
these, those
etc.
Comments
pronounced with
initial [h]
baswn i, baset
ti etc.
I would, you
would etc.
before me,
before you
etc.
I should, I
would etc.
very common in
northern varieties
preposition inflected
in northern varieties
unless
heard in northwestern varieties
usually connected by
apostrophe to a
preceding vowel
3rd singular present
form of ‘bod’ (to be)
heard in southwestern varieties
mutates to ‘folchi’
dylwn i,
byddwn i etc.
gosa
m
’m
mag
mae
molchi
ymolchi
mynedd
na i etc.
nunman
oedd nhw,
wneith nhw
etc.
penwsnos
tes i (ddi)m
w
wannwyl
whi
my
common in southern
varieties
wash
oneself
amynedd
patience
mutates to ‘fynedd’
a i etc.
I will go etc. heard in the
Caernarfon area
unman
nowhere
widespread
oedden nhw,
they were,
3rd person singular
wnan nhw etc. they will
verb forms used with
etc.
plural pronouns
penwythnos
weekend
GPC has an entry for
‘wsnos’
es i’m
I didn’t go
some northern
varieties
’w
his/her/ their usually appears with
apostrophe, e.g. ‘i’w’,
but we transcribe as ‘i
w’ (to his/her/its)
dear Lord
a contraction of ‘Duw
annwyl’
hwyaid
ducks
heard in some
southern varieties
Our spelling
y fi
Standard
rwy i
Meaning
I am
Comments
southern Welsh
B GLOSS TIER
1. Principles
1.1. Each word (see C.1.2 and C.1.3) in the main tier is given a gloss in the gloss tier
(%gls). Non-words (see C.1.5) are not glossed, with the exception of ‘xx’ and
‘xxx’, which are represented by the same characters in the gloss.
1.2. With the exception of proper names (see below), all words are glossed with the
closest English-language equivalent (in lower case). In Welsh or mixedlanguage words, certain morphological information is included in the gloss (in
upper case, see D.2.1). For example:
wasn’t@s:e : wasn’t
soup@s:u : soup
hefyd@s:w : also
recharge_io@s:ew : recharge.NONFIN
Some words marked as Welsh are glossed only with morphological information,
such as ‘POSS.2S’ for the 2nd singular possessive adjective ‘dy’.
Proper names (including names of books, films, organisations etc.) marked as
English or undetermined are glossed as they appear in the main tier. For
example, ‘Hong_Kong@s:u’ is glossed as ‘Hong_Kong’, ‘Cardiff@s:e’ is
glossed as ‘Cardiff’ and ‘Tebot_Piws@s:u’ is glossed as ‘Tebot_Piws’.
However, proper names marked as Welsh are glossed with their Englishlanguage equivalents. For example, ‘Caerdydd@s:w’ is glossed as ‘Cardiff’.
1.3. Lexical information always precedes morphological information in the gloss. A
full stop ‘.’ is used to separate morphological information from lexical
information (e.g. go.NONFIN) and also to separate morphological information
(e.g. PRON.3S).
1.4. The underscore is used on the gloss tier to connect more than one lexical item in
a gloss, where the English translation of a single Welsh word involves more than
one word. For example, ‘neithiwr’ is glossed as ‘last_night’ .
2. Specific glosses
2.1. The following glosses are used for morphological information:
Gloss
Use
Gloss
1,2,3
CONDIT
DET
F
FUT
IM
IMP
IMPER
IMPERSONAL
INT
M
NEG
NONFIN
NONPAST
PL
PAST
POSS
POSSD
PRES
PRON
PRT
REL
S
SUBJ
Use
1st, 2nd, 3rd person
conditional/habitual past
determiner
feminine
future/habitual present (verb ‘bod’ (to be) only)
interactional marker/exclamation, e.g. ‘um’, ‘oh’
imperfect (verb ‘bod’ (to be) only)
imperative
impersonal
interrogative
masculine
negative
nonfinite
nonpast tense (used for present/habitual/future)
plural
past tense
possessive
possessed
present tense (verb ‘bod’ (to be) only)
pronoun
particle
relative
singular
subjunctive
2.2. Gender-specific adjectives in Welsh are not marked for gender in the gloss. For
example, ‘gwyn’ (used to modify masculine nouns) and ‘wen’ (used to modify
feminine nouns) are both glossed as ‘white’.
2.3. Numerals are glossed for gender where appropriate. For example, ‘dau’ and
‘dwy’ are glossed as ‘two.M’ and ‘two.F’ respectively.
2.4. Welsh collective nouns are glossed by the English plural. For example, ‘moch’
(singular collective noun indicating ‘pigs’) will have the gloss ‘pigs’.
2.5. In third person singular possessive constructions, the gender of the possessor is
marked only where there is positive evidence of that gender (i.e. either when the
possessed noun is mutated, or when a gender-specific pronoun follows the
possessed noun, specifically referring to the possessor). The gender is marked on
the possessive adjective. For example:
‘her mother’
ei mam : POSS.3S mother
ei mham: POSS.3SF mother
ei mam hi : POSS.3SF mother PRON.3SF
‘his mother’
ei fam : POSS.3SM mother
ei fam e : POSS.3SM mother PRON.3SM
ei mam e : POSS.3SM mother PRON.3SM
The above applies also to possessive constructions involving non-finite verbs
preceded by ‘ei’. For example:
‘he was born’
gaeth (e) ei eni: get.3S.PAST (PRON.3SM) POSS.3SM bear.NONFIN
‘he/she was shot’
gaeth ei saethu: get.3S.PAST POSS.3S shoot.NONFIN
2.6. When a possessive construction in the first person singular is marked only by
mutation of the noun, the possessed noun, in the gloss, is followed by
‘.POSSD.1S’. For example, ‘nhad’ (my father) is glossed as ‘father.POSSD.1S’ .
Note that this gloss is used only if there is no possessive adjective preceding or
pronoun following the possessed noun (‘fy nhad’ or ‘nhad i’ are glossed
‘POSS.1S father’ and ‘father PRON.1S’ respectively, and ‘fy nhad i’ is glossed
‘POSS.1S father PRON.1S’).
C TAGS
1. There are certain phrases used in Welsh, usually at the end of an utterance, but also
possible mid-utterance, which are used discursively to engage with the listener, to
gauge whether he/she agrees, understands etc. (although the listener is seldom
required to reply). We term these ‘tags’. Tags can be agreeing (i.e. they include a verb
form that agrees in person, number and tense with the finite verb in the main clause)
or they can be non-agreeing. Both kinds are particularly problematic in transcription,
as they are seldom seen in the written language and therefore there are no fixed
conventions for their spelling. They are also often highly contracted in speech and can
be problematic for glossing.
2. The following is an incomplete list of agreeing tags that may occur, which serves as a
pattern for other agreeing tags (with different verbs, tenses and persons). The table
gives the tag as is represented by us in the main tier, and its gloss.
Main tier
byddaf
na fyddaf
yn_byddaf
medri
na fedri
Gloss
be.1S.FUT
NEG be.1S.FUT
be.1S.FUT.NEG
can.2S.NONPAST
NEG can.2S.NONPAST
Main tier
yn_medri
dylai
na ddylai
yn_dylai
ydy, yndy
nag ydy, nac (y)dy, na(g) (y)dy etc.
yn_dydy, yn_tydy, dydy, tydy
oes e
nag oes e
yn_does e, does e
Gloss
can.2S.NONPAST.NEG
should.3S.CONDIT
NEG should.3S.CONDIT
should.3S.CONDIT.NEG
be.3S.PRES
NEG be.3S.PRES
be.3S.PRES.NEG
be.3S.PRES there
NEG be.3S.PRES there
be.3S.PRES.NEG there
3. Here is also a list of common non-agreeing tags with their spellings and their glosses.
Note that not all occurrences of these words or phrases in the transcripts are tags.
Main tier
felly, (fe)lly
wsti, sti
wchi, (w)chi
yli, (y)li
ylwch, (y)lwch
yn_de, de
yn_do, do
yn_dyfe, dyfe
chimod
chwel
deud
deuda
deudwch, (deu)dwch
dywedwch
dofe
dywed, dywad, dŵad
fel
gwed
iawn
na
naci
naddo
nag yfe
nage
ti gweld, ti weld
ti (y)n gweld
timod
twel
ie,ia
yfe
Gloss
thus
know.2S
know.2PL
see.2S.IMPER
see.2PL.IMPER
TAG
yes
PRT.INT.NEG
know.2PL
see.2PL
say.2S.IMPER
say.2S.IMPER
say.2PL.IMPER
say.2PL.IMPER
yes
say.2S.IMPER
like
say.2S.IMPER
right
no
no
no
NEG PRT.INT
no
PRON.2S see.NONFIN
PRON.2S PRT see.NONFIN
know.2S
see.2S
yes
PRT.INT
7. BlumSnow (Hebrew-English)
Shoshana Blum-Kulka
Department of Communications
Hebrew University
91905 Jerusalem, Israel
Catherine Snow
Harvard Graduate School of Education
Larsen Hall, Appian Way
Cambridge, MA 02138 USA
catherine_snow@harvard.edu
This corpus includes data from the Family Discourse Project, carried out in two stages
between 1985 to 1988 and 1989 to 1992. The research was funded by two grants from the
Israeli-American Binational Science Foundation, grant No. 82-3422 to Shoshana BlumKulka, David Gordon, Susan Ervin-Tripp, and Catherine Snow as consultant, and grant
87-00167/1 to Shoshana Blum-Kulka and Catherine Snow. Three groups of families were
involved in the project: native born Israeli families from Jerusalem, American-born
Israeli families living in Israel, and American-born Jewish families living in Boston. The
project was carried out in two stages. Stage one included 34 families and stage two
included 24 families.
A monograph by Blum-Kulka (1997) is devoted to the analysis of these data. The book
demonstrates the ways talk at dinner constructs, reflects, and invokes familial, social and
cultural identities and provides social support for children to become members of their
parents’ culture. The groups studied are shown to differ in the ways they negotiate issues
of power, independence and involvement through speech activities such as the choice and
initiation of topics, conversational story-telling, naming practices, metapragmatic
discourse, politeness, language choice, and code-switching. The transcripts in the
CHILDES database include two types of files from stage two. The first type includes
transcripts of one dinner table conversation per family from eight native Israeli and eight
American Israeli families. The families were taped in their homes in Jerusalem (BlumKulka). The second type includes transcripts of one dinner table conversation per family
from eight Jewish American families. The families were taped in their homes in Boston
(Blum-Kulka and Snow).
Families are identified by group and number, and participants are identified by role for
adults and by name for children. The names of the children in the corpus are
pseudonyms.
Family Backgrounds
The families in the project were middle-class and upper-middle-class, white-collar professional, nonobservant Jewish families from a European background from Israel and the
United States. All parents were at least college educated and were occupied
professionally outside the home. Most parents were at the time of data collection in their
late 30s or early 40s (mean age 41, range 34 to 54). Families had two, three or four
children; the ages of children ranged from 3;1 to 17;2. By design most children are at the
school-age of 6;1 to 13;5. Further information about the ages of the children is given
below.
A participant observer taped three family dinners over a period of 2 to 3 months.
Recording started when the family began to gather around the table and stopped when
they left the table. Meals lasted on the average from 1 to 1.5 hours. One meal per family
was transcribed in CHAT.
Group 1: Native Israeli Families
The parents in this group are all Israeli born. The language spoken at dinner is
Hebrew.
Table 1:
Native Israeli Children
Family #
1
2
4
5
6
8
9
10
Children’s Age and Sex
12;0 m, 10; 5 m
13;2 m, 11.4 m, 5;2 f
16;1 m, 12.2 f, 8;6 m
13;1 m, 10;8 m, 4;0 f
6;2 f, 6;2 f
10;5 f, 8;7 m
8;8 m, 5;6 m
11;5 f, 8;3 f, 3;2 m
Group 2: American Israeli Families
The adults in the American-Israeli families were born in the United States and lived in
Israel for more than 9 years at the time of the study. Twenty-five of the children were
born in Israel and four in the United States. All members of the family are competent
bilinguals. Both English and Hebrew are used; the rate of English varies by family from
30% to 96%.
Table 2:
American Israeli Children
Family #
1
2
3
4
6
7
8
12
Children’s Age and Sex
11;4 m, 7;2 f
8;0 m, 6;1 m
9;0 m, 6;3 m
17;2 m, 13;4 f, 9;4 f, 7;5 f
15;10 m, 13;11 f, 5;5 f
13;11 f, 12;4 f, 9;0 f
12;9 f, 9;5 m, 5;8 m
12;2 m, 8;4 f
Group 3: Jewish-American Families
This set includes dinner conversations in English from eight middle-class Jewish American families from Boston. The families were taped in their homes.
Table 3:
Jewish-American Children
Family #
1
2
3
4
9
10
11
12
Children’s Age and Sex
15;5 f, 13;5 f
8;5 m, 6;1 m 4;4 m
10;0 m, 5;11 m
7;5 m, 4;3 m
9;5 m, 7;3 f
10;4 m, 8;2 f, 3;1 m
11;7 m, 9;6 f
13;4 f, 10;1 f, 4;1 m
The coding schemes developed for the analysis of family discourse include:
1.
The Topical Actions Code (analyzes conversational topical actions such as
the introduction, change, and shift of topics);
2.
The Request Code (analyzes the speech act of directives);
3.
The Narrative-Event Code (analyzes narrative segments from both the
interactive and structural perspectives);
4.
The Metapragmatic Comments Code (analyzes metapragmatic comments
made with regard to turn-taking, conversational norms, and language).
Publications using these data should cite:
Blum-Kulka, S. (1997). Dinner-talk: Cultural patterns of sociability and socialization in
family discourse. Mahwah, NJ: Lawrence Erlbaum Associates.
Additional relevant references include:
Blum-Kulka, S. (1990). “You don’t touch lettuce with your fingers”: Parental politeness
in family discourse. Journal of Pragmatics, 14, 259–289.
Blum-Kulka, S. (1993). “You gotta know how to tell a story”: Telling, tales and tellers in
American and Israeli narrative events at dinner. Language in Society, 22, 361–402.
Blum-Kulka, S. (1994). The dynamics of family dinner-talk: Cultural contexts for children’s passages to adult discourse. Research on Language and Social Interaction, 27,
1–51.
Blum-Kulka, S. (1996). Cultural patterns in dinner talk. In W. Senn (Ed.), SPELL, Swiss
Papers in English Language and Literature. Vol.9: Families (pp. 77–107). Tübingen,
Germany: Gunter Narr.
Blum-Kulka, S., & Katriel, T. (1991). Nicknaming practices in families: A cross-cultural
perspective. In S. Ting-Toomey & F. Korseny (Eds.), Cross Cultural Interpersonal
Communication: International and Intercultural Communication Manual Vol. 15, 58–
77. London: Sage Publications.
Blum-Kulka, S., & Snow, C. (1992). Developing autonomy for tellers, tales and telling in
family narrative-events. Journal of Narrative and Life History, 2, 187–217.
Olshtain, E., & Blum-Kulka, S. (1989). Happy Hebrish: Mixing and switching in American-Israeli family interaction. In S. Gass, C. Madden, & D. Presto Selinker (Eds.),
Variation in Second Language Acquisition Volume 1: Discourse and Pragmatics (pp.
59– 84). Philadelphia: Multilingual Matters.
8. Eppler (German-English)
Eva Eppler
School of English and Modern Languages
University of Surrey Roehampton
Roehampton Lane, London SW15 5PH, UK
e.eppler@roehampton.ac.uk
The data was collected from a community of Austrian Jewish refugees from Nazi
occupied Austria (approx. 30000 Austrians fled to the UK) who settled in Northwest
London in the late 1930s. We are therefore dealing with a community in which German
and English have been in close contact for over sixty years. The L1 of the informants is
close to Standard German, although occasionally interspersed with Yiddish lexical items
and phonetically influenced by the Viennese variety. A peculiarity of the linguistic
profile of this community is that they do NOT speak Yiddish. The age of onset of L2
(English) was during the late teens and early twenties for most speakers. At the time the
audio-recordings were made (1993) all informants were in their late sixties or early
seventies. Patterns of language use in this bilingual community changed throughout the
last half a century: up to the 1970s mainly English was used in both public and private
domains. Once the second generation had left the parent’s household and especially after
retirement both languages started being used in the private domain. A close-knit network
between a subset of the community facilitated the development of a bilingual mode of
interaction, sometimes called 'Emigranto'. This mode of interaction is only used in ingroup situations, is regarded as the 'we-code' (Gumperz 19982) and has covert prestige.
Linguistically it is characterised by intra-sentential code-switching, and frequent
switching at speaker turn boundaries. Biographical (age, gender, schooling, social class
of informants etc.) and situational information, where available, is provided under the
relevant headers in the .cha files. Pseudonyms are used for all participants.The goal of the
project was to provide a linguistic profile of the Jewish refugee community in London
and to study patterns of code-mixing.
Sampling and Data Collection
A random sample of 70 members of the target community was selected from a list of
clients of an Austrian solicitor specializing in pension claims for refugees. 27 of them
were audio-recorded for approx. 90 minutes in one-to-one or one-to-two sociolinguistic
interviews/oral history collections. To this body of subjects other informants were added
by referral (snowball sampling). All audio-recordings were collected in the informants’
homes. Informants were encouraged to choose as a language of interaction the one they
normally use in their home. An additional 400 minutes of group recordings with three
informants and the researcher were collected in participant observation technique during
informal gatherings. Another 540 minutes of audio-data collected in the Day-Centre of a
Refugee Organisation are almost impossible to transcribe due to the low quality of the
recordings and the amount of overlap.
Data Transcription
Full transcripts were made of sound files using the CHAT/LIDES transcription systems.
LIDES (Language Interaction Data Exchange System) is based on CHAT but was
extended to deal with code-mixed data. For this purpose language tags (@2 English and
@4 German) are added to each word/morpheme to indicate its language. In cases where it
was impossible to determine the language in which words were being produced, @u was
attached, e.g. in@u preceding English or German place-names. Morphologically mixed
words only display the full language tag on the suffix as CHECK does not pass sequences
like e.g. ge@4#bother@2-t@. The comma was used to indicate syntactic juncture as one
of the research aims is co- and subordination. The CHAT symbol for tag questions was
also used to delimit discourse markers (Schiffrin 1987). Due to the nature of some of the
data (group recordings) overlaps are only indicated when the beginning and end point of
the overlap was clearly recognisable. Eva Eppler and Maggie Brueckner of the Language
Centre of the University of Rostock, Germany both transcribed and checked each
transcript. Project-specific codes are not included in the files on the web.
1. Ibron.cha 46 minutes of the first meeting between the researcher and the central
informant DOR; 36 minutes with DOR, her daughter (2nd generation) and her
grandson (3rd generation).
2. Jen1 - 3.cha: group recordings of DOR and three of her friends from of the same
generation (TRU, MEL and LIL) and the researcher.
3. Alfred.cha: is a one-to-two sociolinguistic interview/oral history with a male and a
female informant; Alfred1.mp3 corresponds to side A, alfred2.mp3 to side B
4. Hogan.cha: is a one-to-two sociolinguistic interview/oral history with a married
couple
5. Hogan1.mp3 corresponds to side A, hogan2.mp3 to side B of the original tape
recordings.’
The collection and transcription of the data was funded by various research grants form
the University of Vienna and the University of Surrey Roehampton. The Austrian
Ministry of Science funded this research. Many thanks for the technical support from the
media team at Roehampton, to LIPPS and to TalkBank.
Publications that use these data should cite:
Eppler Eva. 1999. ‘Word order in German-English mixed discourse’, UCL Working
Papers in Linguistics 11, 285-309.
Eva Duran Eppler. 2010. Emigranto. The syntax of a German/English mixed code.
Vienna: Braumueller. ISBN 978-3-7003-1739-5
9. Gardner-Chloros (Greek-English)
Dr.P.H.Gardner-Chloros,
Department of Applied Linguistics
Birkbeck College,
43 Gordon Square,
London WC1H OPD
email: p.gardner-chloros@bbk.ac.uk
This research was designed oo identify linguistic and sociolinguistic developments in
the London Greek Cypriot community, differentiating between the patterns found in
different generations; to relate the linguistic patterns to sociolinguistic factors: and to
analyze the spontaneous productions of London Greek-Cypriots, in particular those born
in Britain, from the point of view of language change (borrowing, calques), language
shift (abandonment of the Greek Cypriot dialect or GCD), and code-switching (linguistic
and pragmatic aspects).
The informants were 30 subjects of Greek Cypriot origin living in London, coming from
the working/lower middle class, with no higher education: 5 men and 5 women over age
60; 5 men and 5 women aged 35-60; 5 girls and 5 boys aged 14-18. Each subject was
recorded once for 30-60 minutes. Coders were Olga Pillakouri and Mary Kastamoula.
The Greek Cypriot community in London consists of about 180,000 people, who
came over in various waves from the 1960’s onwards, many as economic refugees. Those
ousted by the Turkish invasion in 1974 are of a more varied social and educational
background. Their children and grandchildren have been educated in English schools,
though many have attended the classes in (Standard) Greek organised on Saturdays by
the Church and Parents’ Organization. The community is on the whole close-knit and
preserves religious and social/family values distinct from the surrounding community.
The younger generation is, however, creating a new identity for itself, distinct from the
rural and traditional ethos of the older generations yet also different from British
teenagers as a whole and indeed from those growing up in Cyprus.
Gardner-Chloros, P. 1992“The sociolinguistics of the Greek Cypriot community of
London”. In Plurilinguismes No.4, June 1992 Sociolinguistique du grec et de la
Grece, ed M.Karyolemou, pp112-135.
10.
Hatzidaki (Greek-French)
Aspa Hatzidaki
Tripoleos 11
Kalamaria
Thessaloniki, 55131 Greece
The data contained in this corpus were used in Hatzidaki (1994). The purpose of the
investigation which took place among the second-generation Greeks living in Brussels
was to examine their linguistic behavior with a view to discovering to what degree they
maintain the use of the ethnic language, and how they alternate between French and
Greek in their daily interaction (to the extent that they do use Greek in spontaneous
conversation). The data collection took place between January 1991 and October 1992
and consisted of three complementary techniques: the taperecording of speech events
such as interviews, participant observation, and the compilation of network lists.
Thirty-four second-generation informants (19 male, 15 female) took part in the study.
The following table groups participants together and provides information on their sex,
age, and occupation at the time of the study.
Table 4:
Hatzidaki Participants
Participant
Stefanos
Dimitris
Lazaros
Tassos
Pavlos
Kostas
Nikos
Fotis
Andreas
Spiros
Yannis
Orestis
Yorgos
Vassilis
Ilias
Michalis
Miltos
Christos
Petros
Thalia
Zoe
Roula
Sex
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
F
F
F
Age
14
16
16
18
18
20
20
21
21
22
22
23
23
24
24
24
24
29
29
14
15
16
Occupation
High school student
High school student
Studying hotel management
High school student
Studying hotel management
Studying car mechanics
Studying chemistry
Technician
Running bookshop, studying PoliSci
Car mechanic, cook
Studying computers, waiter
Studying car mechanics
Physiotherapist
Physiotherapist
Cook
Studying Economics
Telecommunications engineer
Degree in Economics
Mechanical Engineering
High school student
High school student
High school student
Natasa
Vera
Sofia
Katerina
Maria
Voula
Olga
Fani
Alexandra
Elissavet
Irene
Despina
F
F
F
F
F
F
F
F
F
F
F
F
18
19
20
20
21
21
21
23
25
25
26
28
Studying linguistics
Studying linguistics
Studying linguistics
Studying linguistics
Studying accounting
Going to secretarial school
Studying linguistics
Secretary
Translator
Ergonomics, unemployed
Studying pharmaceutics
Beautician
All our informants belonged to the category of “early bilinguals” (although two of
them, Petros and his sister Irene, were not born in Belgium but acquired the French language in their early school years). It is difficult to be more precise and to place the informants in the category of “consecutive” or “successive” bilinguals, because they were not
always able to provide reliable answers to the question of how they learned their two languages and when they started using one or the other for the first time. Differences in their
learning experiences, differences in time and type of language exposure time all together
made it difficult to say with certainty what their first language was. On the whole, most
of our informants seemed to have experienced a positive, additive form of bilingualism,
even though the Greek spoken by the majority is not comparable to Standard Greek in
many ways; the speech of second-generation Greeks is markedly different from the norm
for Modern Greek, as their variety of the ethnic language manifests certain distinctive
features on all linguistic levels. Some of these features even appear with some
systematicity.
Irrespective of the structural deviations from the norm, the participants’ overall
competence in Greek was sufficient for communication purposes. The active involvement
of Greek authorities and the Greek Orthodox Church, frequent visits to Greece, and the
availability of Greek-language press and media provided ample opportunity to develop
oral and literacy skills in the ethnic language. If our informants’ competence in Greek
varied from poor to very good, their competence in French was higher, by their own
admission. They could be safely considered French-dominant bilinguals, something that
is true for the totality of second-generation Greeks in Brussels (apart from those few who
have been educated in Dutch, of course). This means that French was the language that
served most functions in their everyday life, the language they felt more comfortable in,
and the language they mastered best. The dominance of the French language was due to
the nature of the children’s socialization and the functions fulfilled by the two codes in
question. For those who still attended Mother Tongue Classes, Greek was the language of
instruction for a few hours twice a week. Apart from that, they used it with family and
friends to varying degrees. All other linguistic activity, be it receptive or productive, took
place in French. This functional separation of codes, which they experienced since their
infancy, firmly established the dominance of French. They definitely did not speak Greek
as well as they spoke French. Their French is as good as that of any native speaker of
their background.
Informants were asked to rate their Greek proficiency on two aspects, and the mean
of the score for oral proficiency and literacy skills gave the informant’s proficiency score.
It was decided to consider as “more proficient speakers” those informants who gave
themselves between 2.5 and 4 and “less proficient speakers” those who rated themselves
between 1 and 2.5. When the mean turned out to be exactly 2.5, the final placement of the
informant was left to the researcher’s discretion. The criteria on which this judgment was
based were the following: A “more proficient” speaker of Greek did not manifest
disfluency phenomena indicating incompetence, made very few or no grammatical
mistakes, used the appropriate words most of the time, and did not switch frequently out
of incompetence. On the other hand, the speech of “less proficient” speakers of Greek
manifested more clearly the dominance of French. In contrast to “more proficient”
speakers, it was fraught with pauses, hesitations, grammatical mistakes, poor word
selection, and competence-related code switching. The more proficient speakers were
Miltos, Lazaros, Orestis, Katerina, Elissavet, Ilias, Petros, Alexandra, Yannia, Andreas,
Vassilis, Yorgos, Fotis, and Nikos. The other speakers can be classed as less proficient.
The participants came from several social groups. These included the Sphynx Café
group, the Hellenic Community group, the Association group, the foursome group, the
students group, and Orestis and Alexandra. Full details regarding the social structure and
language usage in these different groups can be found in Hatzidaki (1994).
The results of the quantitative study of language choice in our data led to the
conclusion that more proficient speakers used significantly more Greek during monitored
situations (mean: 91%) than their less proficient counterparts (mean: 60%). This
discrepancy can be attributed to the former group’s higher competence and greater
practice in the ethnic language, which permitted them to conduct a long conversation
with almost no French elements. Less proficient speakers in our sample, on the other
hand, rarely found themselves in situations where the use of Greek was called for.
However, the number of speakers on whom data are available is too small to draw any
significant conclusions. Again, more proficient speakers manifest a more homogeneous
behavior, which is natural in view of their consistency in using Greek.
Publications using these data should cite:
Hatzidaki, A. (1994). Ethnic language use among second-generation Greeks in Brussels.
Unpublished doctoral dissertation. Vrije Universiteit, Brussels.
11.
Køge (Turkish-Danish)
Jens Normann Jørgensen
University of Copenhagen
Copenhagen, DK
normann@hum.ku.dk
This data were collected from adolescent Turkish-Danish bilinguals in the town of Køge
near Copenhagen. The data include interviews in Danish and Turkish and group
discussions in both Danish and Turkish. There are audio files, but they are not available
to TalkBank.
Download