A Quantitative Approach to Defining Code-switching Pattern
Natalia Bakaeva
University of Toronto
In this presentation I will show some methodological improvements of Muysken's (2000) typological
approach patterning bilingual speech that establish a more precise quantitative profile for the
identification of two types of code-switching patterns: insertion and alternation. Muysken
distinguished two main code-switching patterns: insertion and alternation. Insertion (see 1) is
characterized by insertion of material (lexical item or entire constituent) from one language into
morphosyntactic structure from the other language.
Ya ego kazhdyi den videla
I him every day see3S.Pas
'I saw him every day in the subway'
Alternation (see 2) is "alternation between structures from [different] languages" (Muysken, 2000:3).
They NEG
'They do not notice that, I don't know if I will feel the same'
The grammatical constraints on intrasentential code-switching are numerous (Myers-Scotton,
1993, 2004 ; Treffers-Daller, 1994 ; Mahootian and Santorini, 1996 ; Poplack, 1980, 1990, 2001;
MacSwan, 1999, 2006), but partially convergent. All these studies involve a variety of language pairs,
social settings, and speaker types. Although data sets of bilingual speech share many similar features,
they also show a large variation in types, forms and frequency of switches. These differences may be
caused by numerous factors, both linguistic and social, which may explain why a consensus has yet to
be reached on how to grammatically constraint the code-switching patterns. Muysken's typology
allows to reconcile two main models of code-switching (Poplack, 1980 and Myers-Scotton, 1993),
which up to now have been proved hardly convergent. So, the model of linear equivalence of Poplack
(1980) better reflects the alternation pattern, while the Matrix Language Frame Model of MyersScotton (1993) better reports the insertion.
The method I developed allows for systematic and objective comparison across corpora to more
easily distinguish these types and thus improve our understanding of the phenomena involved. I
illustrate my method via analysis of a corpus of Russian-English code-switching data. 1,123 switches
are extracted from my corpus of 11 hours of spontaneous speech recorded from six speakers in
Toronto. All speakers are bilingual, dominant in Russian, first generation immigrants who moved to
Canada at the age of 17 or older and have been here for at least 7 years.
This study aims to 1) refine the linguistic criteria used by Muysken to define code-switching
patterns (insertion and alternation), and 2) organize the linguistic diagnostic features for insertion into
a hierarchy according to their relative predictive power. I present my objective definitions of the
diagnostic criteria and my method of quantifying their contribution and then apply the criteria to the
data set. The statistical weight of each diagnostic feature (elements of the linguistic context of a
switch), showing how strongly it is predictive of insertion, is determined using GoldVarb’s
multivariate function.
Muysken (2000: 230-231) suggests that these “specific diagnostic features [of the two types of
alternation] can be applied to a corpus, and that the set of values for each feature will match one
pattern more closely than the other”. These two strategies are structurally defined and a number of
their diagnostic criteria are listed. The 27 diagnostic criteria are gathered in four groups: constituency,
element switched, switch site, and properties of switch. For example, the different feature values
determine the different code-switching patterns: the positive value for the feature single constituent
relating to the insertion pattern means that if a particular switch is a single constituent (see 1) and thus
has a positive value for this feature, it is more likely to match the insertion pattern than the alternation.
All examples are from my Russian\English corpus.
If the switch is not a single constituent (see 2), however, and has a negative value for this feature, this
would be indicative of the alternation pattern. However, the criteria cannot in fact be considered in
isolation because they are related – there are implicational hierarchies and interactions among these
factors. A primary goal of my analysis is to simplify the set of features by untangling these
My analysis builds on Deuchar, Muysken and Wang’s (2007) initial attempt to empirically test
the hypotheses. Their study, a quantitative analysis of three corpora of spontaneous conversations
(Welsh-English, Tsou-Mandarin Chinese, and Taiwanese-Mandarin Chinese), reveals some
methodological and conceptual issues.
One is that all of the criteria used to identify the code-switching patterns have been treated as if
they had equal weight, while there are reasonable expectations that some criteria (e.g. linear
equivalence, function word) may be more important than others (e.g. homophonous diamorphs,
triggering). Second, there is some redundancy in the system that wasn’t taken into account: cases
where the value of one criterion will determine the value of another. For example, the negative value
of the criterion nested will determine the positive value of the criterion non-nested. Third, not all
criteria apply to all switches. Finally, the same criteria are often offered for both types of switches. As
a result, the scores for each switch are not directly comparable. The criteria need to be more precisely
defined to allow testing (Deuchar, Muysken, Wang, 2007: 336).
For this purpose, all of the diagnostic features were first divided into criteria of absolute value
(e.g. single constituent, content word) and criteria of relative value (e.g. long constituent, complex
constituent) according to their applicability. The next step of analysis consisted of the categorization
(binning) of relative value criteria as well as the elimination of redundant criteria (e.g., non-nested,
self-correction) and irrelevant criteria in the context provided (e.g., homophonous diamorphs,
embedding in discourse). Once the criteria are better defined and pruned, quantitative analysis of the
binary values of the remaining 19 criteria was applied to the 1,123 switches. This produced two scores
for each switch: an insertion score and an alternation score. The higher the score, the better the match
to the pattern of predictive features. Adding up the scores for all switches in the corpus allows me to
determine, on the basis of the highest score, the pattern that is matched best in the corpus as a whole.
The dominant pattern for this set of data is insertion: insertion score 3526 versus alternation score 4520.
Next, the statistical weight of each diagnostic feature indicative of insertion was assessed using
GoldVarb (Sankoff, Tagliamonte & Smith, 2005). Insertion vs. alternation is the dependent variable
and all 7 diagnostic features indicative of insertion are independent linguistic variables, binarily coded
according to their applicable value.
The diagnostic criteria indicative of insertion fall into the following hierarchy. Those ranked at
the top are most powerful in predicting insertion in this corpus. Future work will determine how
universal this ranking is.
content word
morphological integration
dummy word insertion
single constituent
selected element
telegraphic mixing
Factor Weight
[ ]
p < 0.05
p < 0.05
p < 0.05
not significant
The validation and refinement of diagnostic criteria permit me to more clearly define the relative
value features (long constituent, complex constituent) and to reduce the number of criteria considered
from 27 to 19. The results of this analysis indicate that only six of these play a critical role in defining
insertion, suggesting that future analysis could be considerably simplified.
