Journal of Memory and Language 76 (2014) 61–79 Contents lists available at ScienceDirect Journal of Memory and Language journal homepage: www.elsevier.com/locate/jml Variables and similarity in linguistic generalization: Evidence from inflectional classes in Portuguese João Veríssimo a,⇑, Harald Clahsen b a b Faculty of Psychology, University of Lisbon, Alameda da Universidade, 1649-013 Lisboa, Portugal Potsdam Research Institute for Multilingualism, University of Potsdam, Haus 2, Campus Golm, Karl-Liebknecht-Straße 24–25, 14476 Potsdam, Germany a r t i c l e i n f o Article history: Received 11 November 2013 Received in revised form 19 May 2014 Keywords: Variables Similarity Rules Morphological generalization Productivity Computational modeling a b s t r a c t Two opposing viewpoints have been advanced to account for morphological productivity, one according to which some knowledge is couched in the form of operations over variables, and another in which morphological generalization is primarily determined by similarity. We investigated this controversy by examining the generalization of Portuguese verb stems, which fall into one of three conjugation classes. In Study 1, an elicited production task revealed that the generalization of 2nd and 3rd conjugation stems is influenced by the degree of phonological similarity between novel roots and existing verbs, whereas the 1st conjugation generalizes beyond similarity. In Study 2, we directly contrasted two distinct computational implementations of conjugation class assignment in how well they matched the human data: a similarity-driven model that captures phonological similarities, and a dual-mechanism model that implements an explicit distinction between context-free and similarity-based generalizations. The similarity-driven model consistently underestimated 1st conjugation responses and overestimated proportions of 2nd and 3rd conjugation responses, especially for novel verbs that are highly similar to existing verbs of those classes. In contrast, the expected proportions produced by the dual-mechanism model were statistically indistinguishable from human responses. We conclude that both context-free and context-sensitive processes determine the generalization of conjugations in Portuguese, and that similarity-based algorithms of morphological acquisition are insufficient to exhibit default-like generalization. Ó 2014 Elsevier Inc. All rights reserved. Introduction One of the striking features of human language is its productivity, that is, the fact that speakers are able to produce and comprehend linguistic expressions that they have not encountered before (Chomsky, 1965). At the heart of this ability is the generalization of linguistic patterns and constraints to novel items. For example, if a novel verb such ⇑ Corresponding author. Present address: Potsdam Research Institute for Multilingualism, University of Potsdam, Haus 2, Campus Golm, Karl-Liebknecht-Straße 24–25, 14476 Potsdam, Germany. Fax: +49 (0) 331/977 2687. E-mail address: joao.verissimo@uni-potsdam.de (J. Veríssimo). http://dx.doi.org/10.1016/j.jml.2014.06.001 0749-596X/Ó 2014 Elsevier Inc. All rights reserved. as to ploamph were to enter the English language, speakers would be readily able to form its different variants (e.g., ploamphed, ploamphing; Prasada & Pinker, 1993), as well as incorporate them into acceptable sentences (e.g., Why do you think I should have ploamphed it?). Given that knowledge of language is finite, but the number of complex forms and sentences that can be produced and understood is infinite, one of the central goals of the language sciences is to characterize the representational substrate that accounts for linguistic generalizations. Broadly speaking, two opposing viewpoints have been advanced. On one side, proponents of symbol-manipulation approaches hold that linguistic knowledge is primarily 62 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 couched in the form of operations over variables, that is, placeholders that stand for every instance of a category (e.g., Chomsky, 1980; Fodor & Pylyshyn, 1988; Marcus, 2001). Variables are insensitive to the idiosyncratic properties of the tokens they instantiate, and as such, allow free and unbounded generalization to novel instances. For example, if the rules or constraints of sentence formation make reference to a variable ‘(V)erb’, then every lexical item that satisfies this condition can be used in a well-formed sentence. Likewise, if producing a progressive form involves concatenating an instance of a variable with the appropriate suffix (i.e., V + -ing), then this operation can be productively extended to any novel verb. A radically different approach to linguistic generalization is espoused by proponents of similarity-based approaches, which here we will take to encompass a broad class of models, including (amongst others) connectionist and exemplar-based architectures (e.g., Daelemans, 2002; Elman, 1993; Rumelhart & McClelland, 1986; Skousen, Lonsdale, & Parkinson, 2002). A distinctive feature of such models is the notion that generalization is primarily determined by similarity. More specifically, the higher the representational overlap between a novel item and a set of learned instances, the higher the probability that it will be responded to in the same way (e.g., Hahn & Nakisa, 2000). This aspect stands in sharp contrast to how generalization is treated in variable-based approaches. Rather than the same operation being applied equally to all members of a category, analogical models typically produce graded and probabilistic outcomes which reflect overlap at different levels of representation (e.g., phonological, semantic) and are influenced by the statistical properties of previously learned input– output pairs. The study of morphological generalization and processing has played an important role in this debate, particularly in what concerns the contrast between regular (e.g., walked) and irregular (e.g., sang) past tense forms in English: whereas the regular -ed pattern is productively extended to new roots, in a way that appears to be insensitive to their phonological characteristics (e.g., ploamphed; see, e.g., Berko, 1958; Prasada & Pinker, 1993; Ullman, 1999; but see Albright & Hayes, 2003), irregular patterns are seldom generalized and are applied only to novel items that phonologically resemble clusters of irregular verbs (e.g., spling, which in analogy with verbs such as sing, can be inflected as splang; Bybee & Moder, 1983). The case of the English past tense clearly illustrates a tension that is also visible in the inflectional and derivational systems of many other languages: that between context-independence and context-sensitivity (Keuleers et al., 2007). More specifically, because many inflectional contrasts and word-formation processes are not applied in the same way for each and every member of a grammatical class (such as all verbs), then morphological operations must, at least for some items, be conditioned by lexical information. At the same time, because some patterns can be productively extended in an unbounded fashion (i.e., even to novel items that are very dissimilar to existing forms in the language), then it would appear that at least some morphological operations can behave as a default, applying when ‘‘all else fails” (Bybee, 1995, p. 452) and in a way that is independent of the idiosyncratic properties of individual tokens (see, e.g., Berent, Pinker, & Shimron, 1999, for Hebrew; Marcus, Brinkmann, Clahsen, Wiese, & Pinker, 1995, for German; Prasada & Pinker, 1993, for English). The balance between lexically conditioned and productive generalizations is most explicitly incorporated in a class of dual-mechanism models of morphology (e.g., Clahsen, 1999; Marslen-Wilson & Tyler, 1997, 2007; Pinker, 1999; Pinker & Ullman, 2002). According to such models, morphological operations can either be instantiated: (1) by the application of a grammatical rule, which operates over a variable and generates a structured representation (e.g., adding the English regular -ed affix to any verbal root); or (2) through direct retrieval of an exceptional form (e.g., an irregular past tense, such as brought), and in the case of generalization to novel words, via analogy from the associations between lexically specified representations (e.g., splought as a possible inflection for spling). In contrast, according to the class of similarity-based models mentioned above, a single context-sensitive mechanism based on the overlap between lexical representations is purported to be sufficient to capture both the generalizations that are similarity-based and those that are made outside of restricted areas of the similarity space. In such models, approximation to default-like behavior is thought to emerge naturally for those morphological patterns that are most frequent or that display significant heterogeneity in their phonological distributions (e.g. Hahn & Nakisa, 2000; Hare & Elman, 1995; Hare, Elman, & Daugherty, 1995). In other words, in the single-mechanism view, what appears to be the result of an operation over a variable is in fact the product of the same frequency- and similaritybased mechanisms that are responsible for restricted and lexically conditioned generalizations. In the present paper, we set out to investigate the generalization properties of conjugation classes in Portuguese, an example of pure morphology, which we believe provides a better test case for assessing the mechanisms involved in morphological generalization than the familiar contrast between regular and irregular inflection. In order to assess the role of phonological similarity, we have used a metric derived from a computational implementation of a similarity-based model, the Minimal Generalization Learner (MGL) proposed by Albright (Albright, 2002a; Albright & Hayes, 2003). In addition, by minimally changing the MGL model to embed a more explicit dual-mechanism distinction, we were able to directly compare two specific computational implementations in how well they matched elicited production data obtained from native speakers of Portuguese. Linguistic background In linguistic treatments of Portuguese morphology, the structure of Portuguese verbs has been proposed to display three hierarchical levels: a root constituent, a stem constituent, and a word constituent (Villalva, 2000, 2003). The root (e.g., cant-, in the infinitive form cantar, ‘to sing’) is taken to be the locus of all semantic, syntactic and morphological information, and transmits this information to the stem J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 node above it. The stem is constituted by a root and, in the case of verbs, a theme vowel, which is the affix that realizes the verb’s morphological class or conjugation. In Portuguese, verbs can belong to one of three conjugation classes, which are commonly identified by the theme vowels that they display in the stem of infinitive forms: these are -a-, for 1st conjugation verbs; -e-, for 2nd conjugation verbs; and -i-, for 3rd conjugation verbs. In most cases, the stem is the linguistic unit that functions as the base for further morphological processes, that is, both inflectional and derivational affixes generally attach to the stem (e.g., cant-a-r ‘to sing’, transfer-í-vel ‘transferable’). One of the few exceptions is the 1sg present indicative form which displays an affix that attaches directly to the root (e.g., cant-o ‘I sing’). For that reason, novel 1sg present indicative forms are ambiguous regarding class membership and can therefore be used to gauge the generalization properties of the different conjugations. A full form such as cantássemos, for example, will therefore be composed of four morphological constituents: the root cant-, the theme vowel -a- for the 1st conjugation, the inflectional affix -sse for the imperfect subjunctive, and the affix -mos expressing person and number information. Of these four constituents, only three can be conceived of as morphemes: the root, bearing all idiosyncratic information associated with the lexical entry; -sse, encoding tense, mood, and aspect; and -mos, an agreement morpheme. However, the theme vowel -a- is not a morpheme under any sensible definition, because it does not have any meaning or function. Beyond being class markers, theme vowels specify no further information. Such constituents are semantically and syntactically empty, and only convey purely morphological information, namely, which conjugation the verb belongs to. Another important aspect of the verbal conjugation system of Portuguese is that the formal spell-out of an inflectional morpheme may differ depending on what class the verb belongs to. Most, but not all, inflectional affixes are the same across conjugations; however, some forms display inflectional affixes that are sensitive to conjugation membership. For example, the past imperfect form shows two different affixes: -va in the case of 1st conjugation verbs, and -a, in the case of 2nd and 3rd conjugation verbs. Summing up, conjugational stems in Portuguese display three advantageous properties. Firstly, most stems are combinatorial, that is, they consist of a root and a theme vowel; therefore, all stems are ‘regular’ in the sense that they do not need to show any phonological changes and can be potentially segmented into their constituents. Secondly, stems contain an empty ‘morph’, a theme vowel that expresses nothing except morphological information. Thirdly, stems define inflectional classes, that is, they select particular affixes. Therefore, rather than being a mapping between a phonological form and a meaning (or morphosyntactic feature), theme vowels and stems determine the mappings of sound to meaning (for example, that the past imperfect is expressed by the affixes -va or -a). Because of these properties, stems and inflectional classes constitute purely morphological concepts (see Aronoff, 1994), and therefore, they can be used as a very 63 clear-cut and unconfounded test case of morphological generalization. In addition, conjugations in Portuguese display a striking discrepancy between the 1st conjugation, on the one hand, and the 2nd and 3rd conjugations, on the other. In the Portuguese verb lexicon, the 1st conjugation is the most productive class. For example, a count of type frequencies in a lexical database of Portuguese, created from a corpus of 16,210,438 words (Bacelar do Nascimento et al., 2000), showed a predominance of 1st conjugation verbs in both the whole corpus (3,396 1st conjugation, 380 2nd conjugation, and 348 3rd conjugation verbs) and amongst the verbs with the lowest lemma frequency (0.37 per million; 123 1st conjugation verbs, but only 10 2nd or 3rd conjugation verbs). More importantly, the formation of 1st conjugation stems in Portuguese (as well as in other Romance languages) qualifies as a default process according to the criteria laid out in Marcus et al. (1995), that is, the 1st conjugation exhibits unrestricted productivity in that it can apply to foreign borrowings, onomatopoeias, denominal verbs, etc. Consequently, novel words that enter the language are always assigned to the 1st conjugation (e.g., blogar ‘to blog’). Psycholinguistic models of stem formation From a psycholinguistic perspective, the above considerations raise the question of how stems and inflectional classes are mentally represented and generalized to novel forms. We can think of three general possibilities, which differ in the role ascribed to context-sensitive mechanisms or to variables referring to grammatical categories. Following linguistic treatments of Portuguese morphology (Villalva, 2000, 2003), according to which verbal stems of all three inflectional classes constitute the output of morphological rules that join a root and a class marker, one would expect all conjugations to be generalized in the same manner, that is, in a way that is insensitive to phonological characteristics of the root. Therefore, whilst the differences in the productivity of the three conjugation classes is indeed acknowledged in traditional linguistic treatments of Portuguese, no explanation for that discrepancy is provided. A second possibility is a dual-mechanism account of stem formation, as was proposed by Say and Clahsen (2002) for Italian verbal conjugations. According to this account, 1st conjugation stems are generated by a stemformation rule that applies to any verbal root, accounting for its unbounded productivity, whilst 2nd and 3rd conjugation stems have to be stored in the lexicon and block the application of the general stem-formation rule. This account has been tested for Portuguese by Veríssimo and Clahsen (2009) using a morphological cross-modal priming experiment, in which infinitive forms belonging to the 1st (e.g., limit-a-r ‘to limit’) or the 3rd conjugation (e.g., resist-i-r ‘to resist’) primed root-based present tense indicative forms (e.g., limit-o, resist-o). The results showed that only 1st conjugation infinitives produced as much facilitation as identity primes, which was interpreted as support for a dual-mechanism account along the lines of Say and Clahsen (2002). 64 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 With respect to generalization properties, a dual-mechanism account would predict differences between 1st and 2nd/3rd conjugation stems: 1st conjugation forms should be more widely generalized and unaffected by analogies to existing words, because they are the result of a stemformation operation over a variable, whilst 2nd or 3rd conjugation forms should reveal graded effects of phonological similarity. A third possibility is that the conjugational stems of Romance languages do not have any internal morphological structure, and that generalization of all stems is performed by a frequency- and similarity-driven mechanism that is sensitive to phonological overlap. For Italian, there are several different computational implementations of these assumptions, Eddington’s (2002) exemplar-based model, Colombo, Stoianov, Pasini, and Zorzi’s (2006) connectionist network of massively interconnected units, and Albright’s (2002a) rule-based MGL. Eddington’s (2002) exemplar model is an implementation of the Analogical Modelling of Language (AML) framework developed by Skousen (1992) and Skousen et al. (2002). In AML, the probability of a particular output is obtained by first searching lexical memory for the entries that are most similar to the given context (i.e., to an input form). The members of the database are then grouped into subcontexts containing forms that share phonological constituents with the input (e.g., the last consonant or the nucleus vowels). Those lexical items that share more features with the elicitation form (i.e., that belong to more subcontexts) and that display cohesive change patterns have a larger influence on the probability that a particular output is chosen. Therefore, such properties make Eddington’s model highly sensitive to analogical effects of phonological similarity, especially those arising from particularly dense neighborhoods. Eddington argued that this model accurately simulates the human data presented by Say and Clahsen (2002) (see below for discussion). Another computational implementation of analogical principles has been proposed by Colombo et al. (2006), who trained two connectionist networks (both with input, hidden, and output layers) to produce Italian participles from different inflectional variants. In these networks, inflected forms are represented in a distributed fashion across a set of phonological units and learned connection weights mediate the activation of phonological outputs. Because of these properties, the networks generalize conjugation membership through a graded and analogical mechanism, that is, novel roots are assigned to a conjugation on the basis of their phonological overlap with similar existing verbs. Colombo et al. (2006) performed an elicited production task with adult native speakers of Italian, the results of which they argued could be accurately simulated by their model. Finally, an influential account of conjugation assignment has been proposed by Albright (2002a), also for Italian, in the form of a computational implementation: the MGL model. Unlike analogical models of morphology, the MGL model generalizes on the basis of phonological similarity, but it does so by extracting rules that incorporate both variables and restricted phonological contexts. The algorithm proposed in this model takes pairs of morphologically related words as input and, for each pair, posits a morphophonological rule that describes the mapping. For example, conjugation assignment in Portuguese can be represented as a mapping between a 1sg present indicative form, which does not contain a theme vowel (and displays the same inflectional affix -o in all conjugations) and an infinitive form ending in -ar, -er and -ir, in which the theme vowels distinguish the three verb classes. The rules that the model extracts for each input–output pair have the form ‘‘change A into B, in the presence of C and D”, where C and D are the phonological environments on the left- and right-hand side of the change, respectively. The first rules that are learned are word-specific, that is, they can apply only to a single input form. For example, the rule that relates the 1sg form [fiku] (fico‘(I) stay’) to its 1st conjugation infinitive [fikar] (ficar ‘to stay’) could be described as ‘‘change [u] into [ar], in the environment [fik]”. These word-specific rules cannot be used for generalization and are indistinguishable from a direct lexical association. However, by comparing different rules with the same structural change, the model then posits ‘generalized’ rules that can encompass the phonological contexts in both rules. In order to do that, the algorithm first preserves the common phonemes in both contexts; then, when different phonemes are found, it maintains whatever phonological features these phonemes have in common; and finally, if there is additional unmatched material, it is substituted by a variable. For example, comparing the word-specific rule that relates [fiku] and [fikar] to the rule that relates R R [ tiku] (estico ‘(I) stretch’) and [ tikar] (esticar ‘to stretch’) would lead the algorithm to extract a ‘generalized’ rule that changes [u] into [ar], in the presence of certain left-side phonological material: the common phonemes [ik], preceded by a featural description that encompasses the two different phonemes, [f] and [t] (i.e., non-sonorant unvoiced consonants), together with a variable that can match any extra material (in this case, because the remaining R R phoneme [ ] in [ tiku] was not already covered by the phonological description). By iterating the process (i.e., by comparing all word-specific rules both to each other and to all ‘generalized’ rules), additional rules are extracted that match increasingly wider phonological environments. In the case of morphological transformations that apply to sets of forms with some degree of phonological heterogeneity (rather than clustering into relatively constrained phonological environments), even a context-free rule can be discovered (i.e., ‘‘change [u] into [ar], in the presence of x”). The resulting grammar therefore contains many redundant rules, which cover the phonological space in both general and specific ways. For example, in Portuguese, the MGL model derives the following 1st conjugation rules, with the first of these necessarily subsumed under the second (because the phoneme [i] in the first rule matches the featural description in the second rule): (1) u ? ar/[x fik___] (2) u ? ar/[x[+cont, –nas, –lat]k___] In the MGL model, competition between rules is resolved not by their specificity (Kiparsky, 1973), but by evaluating their reliability, a measure of a rule’s ‘success’ J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 (e.g., Albright, 2002b). In particular, the reliability of a rule is the number of forms a rule can derive correctly (the rule’s ‘hits’) divided by the number of forms that the rule can be applied to (its ‘scope’; reliabilities are then adjusted using a lower confidence limit, which penalizes rules with smaller scopes; see Mikheev, 1997). For example, rule (2) above can apply to 155 verbs in a lexicon of Portuguese, and derives the right infinitive (i.e., correctly assigns to the 1st conjugation) for 101 of them, which leads to an adjusted reliability score of .625 (slightly less than the raw 65.2%, due to the adjustment). In contrast, rule (1) derives 48 out of 48 verbs correctly (e.g., ficar ‘to stay’; verificar ‘to verify’); because every Portuguese verb root that ends in [fik] forms an infinitive in -ar (i.e., it is a 1st conjugation verb), this rule has a very high adjusted reliability of .979 and takes precedence over the more general rule in (2). The MGL model therefore incorporates both variables and similarity as part of its mechanisms for morphological generalization. However, Albright’s (2002a) proposal can be crucially distinguished from the ‘standard’ dual-mechanism model in two important ways. Firstly, the MGL model is similarity-driven in that it preserves all phonologically specific rules, instead of a single context-free one. That is, the generalizations learned by the MGL algorithm are phonologically restricted for all conjugation classes (even when the extracted rules incorporate variables, as in examples 1 and 2 above) and many of these phonologically-sensitive rules will take precedence over a context-free rule that applies unboundedly to a grammatical category. In contrast, a dual-mechanism model of stem formation, such as the one proposed by Say and Clahsen (2002) and Veríssimo and Clahsen (2009), predicts that context-free and phonologically restricted generalizations align with conjugation membership: only the generalization of 2nd and 3rd (but not 1st) conjugation stems should display sensitivity to phonological properties. Secondly, in the MGL model, the ordering of rules by a reliability metric is inherently input-driven, that is, it reflects the predictive power of each phonological environment as indicative of conjugation membership. This is the case even for any context-free rule extracted by the model, which will still be ranked according to its success. In contrast, in a dual-mechanism approach, the 1st conjugation in Romance languages is the default class: rather than having its rank determined by statistical and distributional properties, the default has unlimited applicability. These contrasts between Albright’s (2002a) MGL and the dual-mechanism proposals of Say and Clahsen (2002) and Veríssimo and Clahsen (2009), together with the fact that specific quantitative predictions can be derived from the MGL computational implementation, make stem formation in Romance languages an interesting test case for the wider theoretical questions regarding the role of context-free and context-sensitive generalizations in morphology. As support for his proposal, Albright (2002a) conducted an acceptability judgement study in Italian and found that mean acceptabilities of novel infinitives belonging to the 1st, 2nd or 3rd conjugations were positively correlated with MGL rule reliabilities, suggesting that, in Italian at least, speakers are sensitive to the phonological shape of a nonce 65 word when generating stems belonging to all three conjugations. The specific question we will be addressing in Study 1 is whether the generalization of 1st, 2nd and 3rd conjugation stems in Portuguese displays sensitivity to the phonological properties of novel roots. In order to assess this, we conducted an elicited production experiment in which participants were presented with novel roots that were constructed such that they fell into a range of reliability values and associated phonological environments. If generalization of all conjugational stems can be appropriately described by a context-sensitive mechanism, then proportions of infinitives in -ar, -er and -ir should be influenced by this variation in phonological properties. In Study 2, we will directly compare the proportions of participant responses belonging to the different conjugations to the predicted proportions of the MGL implementation. Furthermore, the MGL model will be contrasted with a minimally different computational model that implements an explicit distinction between context-free and similarity-based generalizations for the different verbal conjugations of Portuguese. Study 1: Similarity effects in the generalization of inflectional classes To examine similarity effects in Portuguese, we followed the same steps as Albright did for Italian. We first applied the MGL algorithm to a large lexical database of European Portuguese. When the MGL implementation was run over the verbs in this database, the model extracted morphophonological rules and corresponding reliability scores reflecting similarity clusters or – in Albright’s terms – ‘islands of reliability’ within the three inflectional classes. Using these reliability scores, we then constructed a set of novel verbs and tested their generalization properties in an elicited production task with a group of native speakers of Portuguese. Finally, in order to determine similarity effects, we tested whether participant performance was predicted by the model’s reliability scores. Participants were presented with root-based 1sg present indicative forms of novel verbs, and asked to produce a stem-based infinitive form. Because the 1sg present indicative is constituted by a verbal root coupled with an inflectional affix that is the same for all conjugations (-o), the presented form was ambiguous regarding conjugation membership. However, infinitive forms are constituted by a verbal root and a theme vowel (i.e., the stem), to which the infinitival affix (-r) attaches. Therefore, the elicitation of an infinitive form requires assigning a conjugation class to the novel verbal roots. By manipulating phonological similarity in a graded, continuous manner, it is possible to ascertain whether the generalization of the three conjugation classes is sensitive to contextual properties of novel roots, or instead, based on an operation over an unbounded class, that is, a variable such as ‘(V)erb root’. Therefore, the experiment allows us to contrast the similarity-based and dual-mechanism models that have been proposed for the generalization of stems in Romance languages. 66 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 If the generalization of Portuguese verbal stems can be captured by Albright’s (2002a) similarity-driven model, as has been proposed for Italian, we would expect to find that the model’s reliability scores predict proportions of responses belonging to all three conjugations. In other words, we should find evidence that the generalization of 1st, 2nd and 3rd conjugations is sensitive to the phonological properties of novel roots and displays ‘gang’ effects, even when similarity to the competing classes is factored out. In contrast, if conjugational stems of Romance languages are represented in a dual architecture, we should find a dissociation between the generalization of the three conjugation classes. If 1st conjugation stems constitute structured representations, then they should be generalized on the basis of an operation over a variable, which potentially encompasses any novel verbal root. Therefore, we expect the proportions of novel 1st conjugation responses to be insensitive to variation in phonological properties, once similarity to the competing classes is factored out. At the same time, if 2nd and 3rd conjugation stems are represented in a format that supports a similarity-based relation to their roots, we might expect them to be susceptible to graded similarity-based generalizations, as captured by the reliability scores of the MGL algorithm. Method Participants Fifty-four adult native speakers of European Portuguese (mean age: 25.3, 30 males) participated in the elicited production experiment. All of the participants were from mainland Portugal, had completed at least 12 years of schooling, had normal or corrected-to-normal vision and were naive with respect to the purpose of the experiment. None of them had ever experienced language or literacy-related difficulties. Participants were randomly assigned to one of two experimental versions, 32 to one version, and 22 to the other (see Procedure below). Simulation The first step in constructing the materials was to construct an implementation of the MGL learning algorithm described by Albright and Hayes (2002), which was programmed in Visual Basic.1 The model’s input was selected from a large lexical database, the Léxico Multifuncional Computorizado do Português Contemporâneo (Bacelar do Nascimento et al., 2000), a frequency lexicon created from a corpus of 16,210,438 words of modern Portuguese. Starting with the 4,124 lemmatized verbs that exist in the Bacelar do Nascimento et al.’s frequency lexicon, we selected those verbs that had a lemma frequency of 1 per million or higher (3,543 verbs). In addition, because the target form for both the model and the participants was the infinitive, verbs whose infinitive form did not appear in the corpus were excluded from the model’s input. This resulted in a set of 3,117 verbs, each of these represented by a pair of forms: 1 We have also programmed an equivalent (but more efficient) implementation in R, which is available for download at http://software.jverissimo.net. 1sg present indicative (which does not contain a theme vowel) and infinitive. For all verbs in the resulting set, both forms were encoded in phonetic transcription, following standard European Portuguese pronunciation, and in particular, reflecting the pronunciation variety that is more common in Lisbon (the region where most of the participants in this study lived). The inventory of Portuguese consonants and vowels was taken from Mateus and d’Andrade (2000, pp. 29–30), albeit with several modifications: we only employed phonemes that occur in European Portuguese, excluded glides (which have the same featural specification as their corresponding vowels), and excluded alternative realizations. For the resulting set of phonemes, we constructed a matrix of distinctive features, also from the work of Mateus and d’Andrade, excluding only the phonological features that contained redundant information in terms of the MGL generalization algorithm. This resulted in a matrix of 33 phonemes by 13 features, with feature values encoded as +, , or 0 (i.e., unspecified).1 When the MGL implementation was run over the 3,117 pairs of 1sg present indicative and corresponding infinitive forms in the database, it extracted many morphophonological rules, first a set of word-specific rules for each of the input pairs, and then a set of 6,389 ‘generalized’ rules obtained from the iterative comparison of the word-specific rules. For each of the ‘generalized’ rules extracted by the model, the type reliability ratio was computed, that is, the number of verbs that undergo the particular morphological transformation divided by the number of all verbs that contain the relevant phonological context.2 Reliability scores were then adjusted using a lower confidence limit of 75%, which was the value used in Albright’s (2002a) simulation for Italian. Rules were then sorted in descending order by their adjusted reliabilities. Materials We constructed 78 novel verbs on the basis of the MGL rules and reliability values (listed in the Appendix). Novel verbs contained specific phonological environments that the MGL model identified as constituting particularly reliable contexts for one or more inflectional classes. In order to achieve this, we first selected rules that encompass ‘islands of reliability’ for the different conjugations. Following Albright (2002a), we wanted to use correlation statistics to test how well the adjusted reliabilities of the rules discovered by the model predict the conjugation class of the forms produced by our participants. Therefore, we selected not only rules with high and low adjusted reliabilities for a given class, but also rules that span a wide range of intermediate reliability values for all three conjugations. The next step was to create novel verbs constituted by a root together with the 1sg present indicative suffix -o 2 Phonemes excluded from Mateus and d’Andrade’s (2000, pp. 29–30) descriptive table were [ʧ], [ʤ], [v], [ɫ], [R], [j] and [w]. Features excluded were [laryngeal], [height] and [±round]. 3 This calculation of reliabilities takes only type frequencies into account. In previous comparisons of type- and token-based reliability measures in Italian and English, Albright (2002a, 2002b) showed that type measures display greater correlations with human judgements, especially for the 1st conjugation in Italian. J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 (pronounced [u]), such that the novel roots contained phonological material that matched the conditions of application of selected rules. One problem that arises when creating novel forms is that there are several rules that can potentially apply to any given form. Following Albright (2002a), we assumed that, for each given novel 1sg form, only the most reliable rule for each conjugation is relevant for the acceptability of the resulting infinitive form. Therefore, if we wanted a novel form to tap into a lower reliability ‘island’, care was taken to ensure that the form could not be covered by any other rule that was associated with a higher reliability value. In other words, the rules that were used to create the novel words were the most reliable ones that could apply to them. Note that because novel forms match different rules for each conjugation, each of the items used in the present study is associated with three reliability values, which can vary independently. A novel form’s reliability for a conjugation corresponds to the value of the most reliable rule that assigns it to the 1st, 2nd or 3rd conjugation. In other words, the most reliable -ar, -er, and -ir rules that can apply to a novel form define its similarity to a conjugation class. Further considerations that were taken into account when building the novel forms were that they should sound as natural as possible in Portuguese and that they should not contain existing verbs within them.3 Since we were most interested in testing for graded similarity effects in the 1st conjugation, we made sure that the 78 items fell into an ‘island’ of reliability for this class. However, because the MGL model also discovers a context-free rule for 1st conjugation infinitives (albeit one that is ranked lower than the more reliable phonologically restricted rules), there is necessarily a lower boundary to the reliability values for the 1st conjugation, limiting their variance and range in the experimental materials. Another important constraint in the reliabilities of the nonce forms is that, due to the phonological distribution of 2nd and 3rd conjugation verbs into relatively tighter ‘islands’ than in the case of 1st conjugation verbs, it is difficult to create materials with medium reliability values for the non-default conjugations (see descriptive statistics in the Appendix). In order to minimize these problems, we attempted to create forms that covered the whole reliability space as much as possible, for all classes, but especially for the 1st conjugation. The resulting set of items displays a greater range and standard deviation (SD) in their reliabilities than the items used by Albright (2002a) in Italian (1st conj. range: .534 vs. .230; SD: .178 vs. .085), thereby providing enough power to detect continuous and graded effects of similarity. In addition, besides testing whether reliability scores predict probabilities of production, we also wanted to assess the MGL model’s success in predicting each response’s relative acceptability. To this end, we included 4 This last consideration was followed to the extent possible, but the features and segments specified by some of the MGL rules, together with the phonological well-formedness constraints of Portuguese, necessarily yield existing verbs (e.g., many verb roots ending in [duz] belong to the 3rd conjugation, but any novel 1sg form that falls into this ‘island’ will necessarily contain the 1st conjugation form uso [uzu] ‘I use’). In the very few cases in which this was unavoidable (5 items), the possibility of a direct association to a memorized verb was reduced by ensuring that the existing form did not occur as a separate syllable (e.g., by creating a diphthong). 67 items for which the most reliable rule outputs a 2nd or 3rd conjugation form (12 items for 2nd and 12 for 3rd conjugation). These 24 items also fall into phonological ‘islands’ for the 1st conjugation, but because they contain phonological properties that are highly predictive of membership in the 2nd or 3rd conjugations, assigning them to one of the non-default classes is rated by the MGL model as the most acceptable alternative. Unlike in Albright’s (2002a) experiment in Italian, we made sure that such items were also included so that we could specifically examine the proportions of 1st conjugation responses in cases of high similarity to the other classes. Procedure Novel verbs were presented in written form in a paper booklet, and participants were asked to perform an open response sentence completion task. The first page of the booklet detailed the instructions for the task. The experiment was introduced as a study about how new words enter everyday language, and participants were informed that they would be asked to transform verbs they had never heard before. Each novel form was presented only once and was embedded in a frame ‘conversation’ consisting of two wellformed sentences. There were 78 ‘conversations’, one for each of the experimental items. The first sentence presented the novel verb inflected for the 1sg present indicative, in bold type. The second sentence created a syntactic context that required an infinitive form, but contained a blank space in place of the main verb, as in the following example: (3) Quase sempre acuo sozinho. Mas amanhã vou _______ acompanhado. ‘I almost always acuo alone. But tomorrow I will ________ with someone.’ Before the experimental sentences were introduced, participants were presented with four example ‘conversations’ that were similar in every respect to the experimental frames except that they contained existing verbs and that the underlined space was already filled with an infinitive form in bold font. The verbs used in these introductory frames were the most frequent verbs in Portuguese that display each of the three possible infinitive theme vowels (and the verb pôr ‘to put’, which is considered not to have a theme vowel in the infinitive). In order to reduce the influence of metalinguistic knowledge and elicit more natural responses, no mention was made of conjugation classes, nor that responses should be given in the infinitive. Participants were instructed to read the first sentence in each of the experimental frames and to complete the second sentence by filling in the blank spaces with a form that they considered appropriate. Participants were encouraged to read every frame ‘conversation’ and carefully consider their responses. However, it was also emphasized that there were no right or wrong answers, and that they should rely on their intuitions by completing the spaces using the form that ‘‘sounded best” to them. Experimental sentences were constructed so that the contexts for the novel verbs did not elicit any obvious semantic associations with existing verbs. This reduced the possibility that effects from similarity to real verbs were 68 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 based on any semantic properties, rather than exclusively based on phonological similarity, which was the crucial independent variable. Two versions of the task were constructed. The sentences were presented in the same order in both versions, but the order of the novel verbs was pseudo-randomised in each version. Therefore, because not all participants saw the same novel verb in the same sentence, the possibility of conjugation assignments reflecting associations created by particular sentences was further reduced. The 24 items for which the MGL model prefers a 2nd or 3rd conjugation form were scattered throughout the task, with 14 items appearing in the first half of the booklet in one of the versions, and in the second half of the booklet in the other. In addition, there were no more than 2 of these items presented sequentially in any of the versions. This was done to ensure that participant responses were diverse enough during the task, without several items eliciting the same response in a sequence. Data scoring and analysis Participant responses were coded into a nominal scale reflecting the conjugation that the supplied infinitive form belonged to. Two items were excluded from the analysis due to a spelling inconsistency in the two versions of the task. A total of 45 blank, illegible, or non-infinitive responses were discarded (accounting for 1.10% of the data). Another 8 responses (0.20% of the remaining data) were considered valid but were not further analyzed, because they were infinitive forms with the ending -ôr (by analogy to the verb pôr and its compounds, which belong to the 2nd conjugation, but do not display a theme vowel). Instead of directly analyzing proportions of each type of response, several adjustments needed to be made to the raw counts. Firstly, when analyzing nominal data (such as conjugation membership), there are potential problems associated with using methods such as linear regression, because proportions are inherently bounded between 0 and 1 and their error variance is not independent from the mean (Barr, 2008). These violations can lead to biases in the associated p-values, and give rise to spurious significances or null results (Jaeger, 2008). One way to avoid the statistical problems involved in the analysis of categorical scales is to convert proportions to relative odds, which are then subjected to a logarithmic transformation (Woolf, 1954). When this conversion is applied, the scale of the resulting log-odds or logits has the advantageous properties of being unbounded and symmetric around zero (with a logit of zero corresponding to a proportion of 50%). In the present paper, all analyses were performed on logodds, rather than on proportions. This allows the human responses to be analyzed using linear methods (Study 1), but also, to statistically compare the human data to expected proportions predicted by the computational implementations (Study 2). In order to avoid infinite values for proportions of 0 and 1, all results were analyzed by applying the method recommended by Agresti (2002), in which raw counts of responses are converted to empirical logits, using an estimator originally proposed by Haldane (1955) and Anscombe (1956).4 In addition, when performing linear regression with empirical logits, McCullagh and Nelder (1989) and Jaeger (2008) argued for weighting cases by the inverse of their variance, due to the fact that log-odds with lower variance (i.e., those closer to zero and based on a higher number of valid responses) should be more informative to the estimated model than those with higher variance. In the present analysis, cases were weighted using an estimator recommended by McCullagh and Nelder and originally proposed by Gart (1966).5 Log-odds of responses belonging to the 1st, 2nd and 3rd conjugations were each submitted to three weighted linear regressions. In each regression, the reliability scores for each of the three conjugations were entered as simultaneous predictor variables, that is, the contribution of each predictor is estimated by controlling for the reliabilities for the other conjugations. Similarity to the competing conjugations can also have an effect (likely a negative one) in a particular response type; this inhibition could arise either through a linguistic principle, morphological blocking (Aronoff, 1976), or simply by virtue of the mutual exclusivity of the different possible responses. Therefore, the multiple regression method we employ here allows estimating the independent contribution of phonological similarity to each conjugation and answers the question of whether proportions of responses for any one conjugation are predicted by similarity to that class. Results Table 1 displays the results of a set of regressions, in which the reliability scores for each conjugation were simultaneously entered as predictors. The table shows the (unstandardized) coefficients of each predictor (i.e., the reliability score for each class) in the estimation of the log-odds of responses belonging to each of the three conjugations. As can be seen in Table 1, a clear effect of phonological similarity was obtained for the 2nd and 3rd conjugations, but not for 1st conjugation responses. For 2nd conjugation responses, the only significant predictor was the MGL model’s reliabilities for the 2nd conjugation (t(72) = 6.35, p < .001). Neither the reliabilities for the 1st conjugation (t(72) = 0.39, p = .696), nor those for the 3rd conjugation (t(72) = 1.83, p = .071) were significant predictors. The total model with the three reliability values as predictors had an associated r2 of .583 (F(3,72) = 33.58, p < .001). The same pattern was found for 3rd conjugation responses. The only significant predictor was the reliability scores for the 3rd conjugation (t(72) = 5.45, p < .001). Again, there was no effect of similarity to competing classes, as both the reliabilities for the 1st (t(72) = .09, p = .928) and the 2nd conjugations (t(72) = .09, p = .368) were not yþ:5 Empirical logits g0 are given by g0 ¼ ln nyþ:5 , in which y is the number of participants that gave a certain response to a particular item, and n is the number of participants that gave a valid response to that item. Gart and Zweifel (1967) have shown in an empirical comparison that this estimator fares very well against the alternatives. 6 1 1 Cases were weighted by 1/v, where v ¼ yþ:5 þ nyþ:5 . In the empirical comparison conducted by Gart and Zweifel (1967), this variance estimator showed no substantial biases. 5 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 Table 1 Estimated coefficients for the MGL reliabilities for the 1st, 2nd, and 3rd conjugations in the three conducted regressions (with log-odds of 1st, 2nd and 3rd conjugations as dependent variables). Predictors Reliabilities for 1st conj. Reliabilities for 2nd conj. Reliabilities for 3rd conj. * Dependent variables (Log-odds) 1st conj. (-ar) 2nd conj. (-er) 3rd conj. (-ir) 0.14 1.69 1.37 0.28 2.19 1.03 0.06 0.45 2.11 Indicates statistical significance at a = .05. significant predictors. The model with the three reliability scores as independent variables explained 38.2% of the variance in the log-odds of 3rd conjugation responses (F(3,72) = 14.81, p < .001)). A very different picture emerged for the log-odds of producing 1st conjugation infinitives. Both the reliabilities for the 2nd conjugation (t(72) = 5.25, p < .001) and the 3rd conjugation (t(72) = 3.73, p < .001) were highly significant negative predictors. In contrast, the MGL reliabilities for the 1st conjugation did not play a statistically significant role in the prediction of the corresponding log-odds (t(72) = .24, p = .809). The total model r2 for 1st conjugation responses was.395 (F(3,72) = 15.63, p < .001). These results stand in sharp contrast to those for 2nd and 3rd conjugation responses, in which the coefficients of the reliabilities of the corresponding classes were highly significant and of moderate magnitude. The three regression models were also combined into a single multivariate analysis, so that the general effect of each predictor across all three response types could be assessed. Whilst each of the regressions above examined the role of reliability scores in the production of one particular conjugation (a one-vs.-rest analysis of multinomial data; Agresti, 2002), a multivariate regression model allows simultaneously taking into account all three dependent variables, as well as their interdependence. Consistently with the previous analyses, the results showed an effect of the reliabilities for the 2nd (F(3,70) = 24.32, p < .001) and the 3rd conjugations (F(3,70) = 13.00, p < .001), but no effect of the reliabilities for the 1st conjugation (F(3,70) = 0.32, p = .814) in the assignment of novel words to a conjugation. In order to further attest the robustness of these results, we have investigated the influence of different types of reliability adjustment, which is a computational parameter of Albright and Hayes’s (2002) MGL model. Recall that raw reliabilities, which are obtained by dividing the number of verbs correctly derived by a phonological rule by the number of verbs to which the rule applies, are adjusted using a lower confidence limit. This is the only free parameter in the MGL model, varying from 50% to 100%, and its effect is to penalize the reliability of rules that apply to less verbs. Therefore, different confidence adjustments can dramatically change the reliability values of rules with narrower scopes, in turn leading to a substantial reordering of the rules. Furthermore, because the different conjugations in Portuguese display different phonological distributions, 69 the type of adjustment that is used might differentially impact the similarity effects for the three conjugation classes. The potential effect of the strength of the confidence adjustment was investigated by using four additional ways of calculating the reliability function, all of them previously employed in comparisons conducted by Albright (Albright, 2002b; Albright & Hayes, 2003).6 For each of these four variants of the reliability function, the rules extracted by the MGL model were sorted in descending order of their reliabilities, and the novel words employed in the elicitation task were fed into the model. As before, each item’s reliability for the 1st, 2nd, and 3rd conjugations corresponded to the scores of the most reliable rules that output -ar, -er, and -ir infinitives, and the new reliability scores were then used as simultaneous predictors in weighted linear regressions on the production log-odds of 1st, 2nd, and 3rd conjugation responses. The different versions of the reliability function had a very small effect on the overall fit of the regression models (r2 coefficients ranged from .36 to .42 for 1st conj., .57 to .59 for 2nd conj., and .34 to .40 for log-odds of 3rd conj. responses). More importantly, all versions of the reliability function yielded a dissociation between conjugations: regardless of the type of adjustment, reliabilities for the 1st conjugation were never a significant predictor of 1st (all ps > .723), 2nd (all ps > .271), or 3rd conjugation responses (all ps > .534). In contrast, higher reliability values for both the 2nd and 3rd conjugations were systematically associated with higher proportions of responses belonging to these classes (all ps < .001) and with lower proportions of 1st conjugation responses (all ps < .001). Therefore, this set of regressions replicated the results above and showed that the obtained dissociation between conjugations is robust to changes in this implementational parameter of the MGL model. In sum, the results showed that the log-odds of 2nd and 3rd conjugation responses were solely determined by their corresponding reliability values, with no significant effects from the reliabilities of competing classes. In contrast, the MGL reliabilities for the 1st conjugation had no effect in any of the conducted analyses, that is, they did not reliably predict 1st conjugation responses or responses belonging to the other classes. Instead, the log-odds of producing a 1st conjugation response were predicted by phonological similarity to the 2nd and 3rd conjugations, such that the higher the reliabilities for these classes, the lower the proportion of participants producing 1st conjugation infinitives. Additional analyses In order to determine whether the obtained dissociation between conjugations, and in particular, the absence of a similarity effect for the 1st conjugation, could be explained by potentially confounding factors, a range of subsequent analyses was conducted on the log-odds of 1st conjugation 7 The four additional types of adjustment, besides the 75% lower confidence limit used in the main analysis, were: raw reliabilities with no adjustment; a 90% lower confidence limit; a 55% lower confidence limit; and an adjustment that multiplied raw reliability values by 1.2n, where n is the number of full segments specified in a rule’s phonological context. 70 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 responses, scrutinizing the possible influence of a number of statistical and methodological aspects of our study. Statistical factors. Concerning statistical factors, we first asked whether a similarity effect for the 1st conjugation could have been rendered undetectable because of limited statistical power. An inspection of the estimated coefficients of the reliabilities for the 1st conjugation in all regression models indicates that this is not the case. In five regression models (on log-odds of 1st conj. responses, each employing a different way of calculating the reliability function; see above), coefficients for the reliabilities for the 1st conjugation were extremely small, ranging from 0.00 to 0.14. Even for the largest of these coefficients (Table 1, with a model intercept of 1.56), a back-transformation from log-odds to proportions shows that the effect of the whole reliability scale corresponds to an increase of only 2% of 1st conjugation responses, from 83% (at a reliability score of .00) to 85% (at a reliability of 1.00). Therefore, it does not appear to be the case that an effect of similarity to the 1st conjugation was not detected due to limitations in statistical power. A related possibility, however, is that differences in the distribution of the predictors, that is, in the distribution of reliabilities for the three conjugations, could have made the dissociation in their effect sizes more extreme. As can be seen in the Appendix, reliabilities for the 1st conjugation across the items in Study 1 had both a smaller range and SD than the reliabilities for the 2nd and 3rd conjugations. As noted before (see Materials subsection), this is a consequence of the phonological distribution of the three classes in interaction with properties of the MGL model. Despite the impossibility of generating items with a wider range of reliabilities, this limited variation could potentially reduce the contribution of similarity to the 1st conjugation in determining proportions of responses–an instance of the well-known problem of ‘‘restriction of range” (e.g., Gulliksen, 1950), which can sometimes reduce correlation coefficients (though by no means always; see Wiseman, 1967; Zimmerman & Williams, 2000). In order to investigate this potential confound, we estimated how much larger the ‘‘unrestricted” effect of the reliabilities for the 1st conjugation would be, by applying a common correction for restriction of range in correlations, Thorndike’s Case 2 correction (Hunter & Schmidt, 1990; Thorndike, 1949). We first performed a (weighted) correlation between the reliabilities for the 1st conjugation and the residuals of a (weighted) regression, in which the effect of the reliabilities for the 2nd and the 3rd conjugations had been removed from the log-odds of 1st conjugation responses. This correlation coefficient was then corrected, by assuming that the SD of the reliabilities for the 1st conjugation was as large as that of the reliabilities for the 2nd conjugation (SD = .177 to .293). The results showed that the corrected correlation was inflated from only .02 to .04, a minute increase of an extremely small coefficient, suggesting that restriction of range played no role in reducing the effect of similarity to the 1st conjugation. Finally, one last statistical factor that could have reduced an effect of the reliabilities for the 1st conjugation is the presence of multicollinearity (see, e.g., Baayen, 2008). In our case, because multiple regression coefficients assess only the unique contribution of the predictors, if the reliabilities for the 1st conjugation are negatively correlated with those for the 2nd and 3rd conjugations, then their independent contribution could be underestimated. We assessed this possibility in two different ways: by calculating the unique variance of the reliability predictors and by conducting a stepwise regression. We first calculated each predictor’s tolerance, a measure of its variance that cannot be accounted for by the other independent variables. Tolerance values were similar for all three reliabilities, .65 for 1st, .55 for 2nd and .70 for 3rd conjugation. That is, all variables suffer from slight multicollinearity, with the reliabilities for the 2nd, not 1st, conjugation being the most affected. Therefore, if a null effect of the reliabilities for 1st conjugation was due to multicollinearity, one would expect the same pattern of results to emerge for 2nd and 3rd conjugation responses, instead of a clear dissociation between conjugations. In addition, the same three-predictor regression on the log-odds of 1st conjugation responses was conducted in a stepwise fashion, that is, the MGL reliabilities for -ar, -er, and -ir were allowed to enter the model one at a time according to how much variance they explained (see Albright, 2002a, for a similar analysis). If similarity to the 1st conjugation determines log-odds of 1st conjugation responses, then it should be considered a better predictor for inclusion. Importantly, the contribution of the first predictor to enter the model is assessed without controlling for the other predictors, and therefore, in a way that is immune to effects of multicollinearity. The results showed that the first predictor to be included was the reliabilities for the 2nd conjugation (B = 1.29, t(74) = 4.91, p < .001) and the second and last predictor was the reliabilities for the 3rd conjugation (B = 1.41, t(73) = 4.23, p < .001), but at no point were the reliabilities for the 1st conjugation considered to enter the model. Therefore, the stepwise regression produced an identical result to the previous set of regressions (in which all predictors were simultaneously considered), indicating that multicollinearity cannot account for the observed dissociation between conjugations. Methodological factors. In what concerns potential methodological confounds, a plausible concern in a nonce word elicitation task, in which none of the possible answers is the single ‘correct’ one, is that participants might converge on a pattern of responding in which they give the same type of answer for every item (e.g., by realizing at some point in the task that most of their answers involved the same transformation and then repeating the same type of answer). If this was the case in the present task, then participants could have repeatedly assigned novel forms to the 1st conjugation, which was the more common response, without actually reading or considering the different novel forms. Crucially, this could reduce or altogether eliminate a similarity effect for the 1st conjugation, because even items with low reliability for the 1st conjugation would still get a very large proportion of 1st conjugation responses, should they occur closer to the end of the task. In the present study, various steps were taken to eliminate such task habituation effects. First, because each item J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 was embedded in a different frame ‘conversation’, participants were led to read each individual sentence and, presumably, to give more natural responses than if they only needed to focus on the novel verbs. Second, because there were two versions of the task containing items presented in different orders, the same items should not be affected by the same list composition effects for all participants. Third, the 24 items that are highly similar to the 2nd and 3rd conjugations were scattered throughout the task in order to maximize the probability that different types of response were given at all points in the task. More importantly, it is exactly because items that are very similar to the 2nd and 3rd conjugations were randomly distributed, that potential effects of previous answers on subsequent ones cannot, by themselves, account for the observed dissociation between conjugations. If participants settled on a pattern of 1st conjugation responding, then most items would elicit very high proportions of -ar responses, regardless of their phonological similarity to the different conjugations, in which case one would expect all conjugations to show weak or non-significant similarity effects, contrary to what was obtained. In addition to the care taken to minimize such taskrelated artifacts, we have attempted to estimate a possible influence of task habituation on participant’s responses. This was done by repeating the regressions on the production log-odds of 1st conjugation responses, but including as an additional predictor a (centered) variable encoding item position, thereby estimating and, more importantly, controlling for any potential influence of habituation throughout the task. Because two different versions of the task were administered, each with different item orders, logodds and case weightings were recalculated and these four-predictor regressions were conducted separately for each version. The results showed no effect of item position in either version (version 1: t(71) = 1.16, p = .251; version 2: t(71) = 0.78, p = .441). In addition, even when item position was controlled for, there were no significant effects of the reliabilities for the 1st conjugation on the corresponding production log-odds (version 1: t(71) = 0.13, p = .900; version 2: t(71) = 0.18, p = .858). Instead, and as before, production log-odds for the 1st conjugation were, in both versions, significantly predicted by the reliabilities for the 2nd (version 1: t(71) = 5.28, p < .001; version 2: t(71) = 4.29, p < .001) and 3rd conjugations (version 1: t(71) = 3.59, p < .001; version 2: t(71) = 2.96, p = .004), such that higher reliabilities for these classes are associated with smaller log-odds of 1st conjugation responses. These results show that the observed dissociation between conjugation was immune to habituation effects and also that it is robust enough to hold for two distinct subgroups of participants. Discussion The results of the elicited production task showed a clear dissociation between the 1st conjugation, on the one hand, and the 2nd and 3rd conjugations on the other, which can be summarized as follows. Firstly, a substantial proportion of the variance in the probabilities of producing 2nd and 3rd conjugation forms was accounted for by their respective 71 reliability scores, indicating that generalizations of stems belonging to these classes were determined by the degree of phonological similarity (as defined by Albright’s, 2002a, MGL model) between novel roots and the roots of existing 2nd and 3rd conjugation verbs. Secondly, the reliabilities for the 1st conjugation did not predict the probabilities of producing 1st conjugation forms, once the reliabilities for competing classes are factored out. Thirdly, the majority of variance in producing 1st conjugation forms was explainable by the reliabilities for the two other classes, such that the higher the reliabilities for the 2nd or 3rd conjugations, the lower the likelihood of a 1st conjugation response. These results were found to be unaffected by a number of computational, statistical and methodological factors. In all analyses, generalization of 1st conjugation forms was not influenced by the degree of phonological similarity, which is precisely what is expected if 1st conjugation stem formation is based on a context-free operation that is insensitive to the idiosyncratic properties of individual tokens. On the other hand, the results suggest that the generalization of 2nd and 3rd conjugation stems is based on a context-sensitive and graded process that produces similar responses to similar inputs, that is, on an operation of similarity-based generalization. Study 2: A comparison of two competing computational implementations In the present study, we contrasted two different computational implementations in how well they match the elicited production data obtained in Study 1. The first of these was Albright’s (2002a) MGL, a model in which multiple morphophonological rules are extracted from input– output pairs and then evaluated by their reliability, that is, by the adjusted proportion of verbs in the lexicon that they correctly derive. The second implementation is our own model, the Default Generalization Learner (DGL), in which the evaluation metric makes a principled distinction between a context-free default rule and context-sensitive rules, which contain phonological material as part of their conditions of application. As explained before, a number of specific and general rules are extracted by the MGL algorithm during the learning stage, possibly including a context-free rule that applies a given morphological transformation to any possible input (i.e., it contains only a variable as part of its conditions of application). Importantly, all MGL rules, regardless of their specificity, are ordered by their reliability values, that is, by how successful they are in deriving existing forms in the language. The DGL model we propose here is based on the same algorithm for the discovery of phonological environments and rule learning, but differs from the MGL model in the ranking of rules, and in particular, in how the context-free rule is evaluated. More specifically, the DGL model is endowed with a built-in principle according to which a maximal reliability score is attributed to the first contextfree rule that is derived during the rule-learning process, that is, to the rule that forms a 1st conjugation infinitive from any 1sg form. 72 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 The assignment of a maximal score follows from the very notion of a linguistic default. By definition, a default can be applied to an unbounded number of cases, so that in terms of a reliability metric of rule evaluation, a default should have maximal reliability or confidence. Therefore, the crucial distinction between Albright’s (2002a) MGL model and the DGL model we are proposing is that our model is innately guided to distinguish between a context-free rule and all other phonologically restricted rules: only the context-free rule is ascribed maximal reliability, or in other words, the morphological transformation that it performs is ascribed maximal well-formedness. As such, the DGL model becomes an implementation of a dualmechanism proposal, similar to the ones by Say and Clahsen (2002) and Veríssimo and Clahsen (2009), in two important ways. First, by making a principled distinction between the two types of rule, the model invokes separate mechanisms (context-free and context-sensitive) for the generalization of the default class (i.e., the 1st conjugation) and of the two other phonologically restricted classes (i.e., the 2nd and the 3rd conjugations). Second, by biasing rule evaluation towards the maximal well-formedness of a default, our model postulates an innate principle that goes beyond the input-driven statistics that guide rule evaluation in Albright’s MGL model. In the present study, these two competing implementations were contrasted by comparing their predicted proportions of 1st, 2nd and 3rd conjugation responses for the items employed in Study 1, with the actual proportions of responses given by participants. Whilst the analyses in Study 1 involved correlating the MGL reliability scores to log-odds of responses (i.e., whether responses increase as the corresponding similarity to a class increases), here we ask how the MGL model and our dual-mechanism implementation fare in predicting actual proportions of responses, that is, whether they overestimate or underestimate responses belonging to the different conjugations. In the present analysis, the 76 items used in Study 1 were factorized into one of three groups, following the MGL model preference for the 1st, 2nd, or 3rd conjugations. That is, items were grouped according to the transformation performed by the most reliable MGL rule that they match. Under the assumptions that the well-formedness of a novel stem is derived from the reliability of its best matching rule, and that most participants choose the most well-formed transformation when producing an output, the MGL prediction is that the 24 items that are highly similar to the 2nd or 3rd conjugations should elicit a majority of responses belonging to these classes. In contrast, the DGL implementation specifically favors the application of the context-free default regardless of the phonological properties of novel roots. Therefore, the DGL model predicts that even items that are similar to the 2nd and 3rd conjugations receive a majority of 1st conjugation responses. For the remaining 52 items, both models would predict a majority of 1st conjugation responses. Method The predictions of the two computational models were compared to human data by deriving expected proportions from each model for each of the items that was part of the elicited production task. Expected proportions of responses were derived from the MGL rule evaluations by applying Luce’s (1959) choice axiom to each of the well-formedness scores (i.e., reliabilities) for all the novel items that were used in Study 1. In other words, it was assumed that the probability of selecting an output from a set of candidates with different well-formedness scores is given by dividing each corresponding score by the sum of well-formedness scores for all possible outputs. More specifically, in order to calculate the MGL model’s expected proportions of 1st, 2nd and 3rd conjugation responses for a given item, we first obtained every item’s reliabilities for -ar, -er and -ir, that is, the same values that were used as predictors in Study 1. Secondly, each of the scores was divided by the sum of all three reliability values to yield expected proportions of -ar, -er and -ir responses. For example, the best rules for each conjugation that match the novel 1sg present indicative form sarrolvo [sɐʁolvu] have reliabilities of .482 (for [sɐʁolar], in the 1st conjugation), .489 (for [sɐʁoler], in the 2nd conjugation) and .171 (for [sɐʁolvir], in the 3rd conjugation). In this example, the sum of all three reliability scores for sarrolvo is 1.142, and dividing each of the three scores by this sum yields predicted proportions of 42.21%, 42.82%, and 14.97% for the 1st, 2nd, and 3rd conjugations, respectively (which necessarily add up to 100%). Thirdly, as in Study 1, statistical analyses were conducted on log-odds, rather than expected proportions. In order to make the log-odd calculation exactly parallel to the one for the human data, expected counts for each particular item were obtained by multiplying expected proportions by the number of valid responses for that same item in Study 1; these predicted counts were then subjected to the same empirical logit transformation (as per the equation in Footnote 3). The same procedure was employed to derive expected proportions and log-odds from our DGL model, which differs from Albright’s (2002a) MGL model in that the 1st conjugation context-free rule is assigned a maximal wellformedness score (i.e., a reliability score of 1). Therefore, expected proportions of 1st conjugation responses for an item with reliabilities w2 and w3 (for the 2nd and 3rd conjugations) were given by the inverse of 1 + w2 + w3, whilst expected proportions of 2nd and 3rd conjugation responses were given by dividing the reliabilities for each of these classes by 1 + w2 + w3. As above, expected counts were calculated and then converted to log-odds.7 Finally, items were grouped in terms of their highest MGL reliability value, that is, separated into items more similar to the 1st, 2nd and 3rd conjugations (52 items for a 1st conjugation group, and 12 items for each of the 2nd and 3rd conjugation groups). The data were submitted to Analyses of Variances (ANOVAs) and t-tests, directly comparing model expected log-odds to human data.8 8 The case weighting that was employed in Study 1 cannot be used in repeated measures analyses of multinomial data, such as the ones in Study 2, because the comparisons involve two different sets of weights. 9 All paired t-tests performed on subsets of 12 items were also conducted using a non-parametric method, Wilcoxon signed rank tests, which showed exactly the same pattern of results. J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 73 Fig. 1. Mean proportions (and associated 95% confidence intervals) of 1st, 2nd, and 3rd conj. responses obtained with human participants and predicted by the MGL and DGL models, for the group of items that highly resemble (a) 1st conj., (b) 2nd conj. and (c) 3rd conj. verbs. Results The mean proportion of 1st conjugation responses across all 76 analyzed items was 73.2%, demonstrating that participants displayed a clear preference for producing 1st conjugation infinitives. Mean proportions of 2nd and 3rd conjugation infinitives were substantially smaller, corresponding to 12.3% and 14.3%, respectively. Fig. 1 displays proportions of 1st, 2nd and 3rd conjugation responses obtained in the elicited production task with human participants, as well as expected proportions derived from the MGL and DGL computational models, for each group of items: novel items highly similar to the 1st (panel a), 2nd (panel b) and 3rd conjugations (panel c). In what follows, we will contrast the different types of responses given by participants with the predictions of both the MGL and the DGL models. Finally, the two models are directly compared with respect to the absolute error of their predictions across all items. Human responses vs. MGL model predictions For the group of items that are highly similar to the 1st conjugation (see Fig. 1, panel a), the MGL model predicts higher proportions of 1st conjugation infinitives (73.7%) than of those belonging to the 2nd (6.9%; t(51) = 20.78, p < .001) or 3rd conjugations (19.4%; t(51) = 13.17, p < .001). This same pattern was obtained in the human data, that is, 1st conjugation responses were overwhelmingly prevalent (80.1%) and more common than 2nd (7.0%; t(51) = 18.96, p < .001) or 3rd conjugation responses (12.6%; t(51) = 16.12, p < .001). However, a direct comparison between the log-odds of 1st conjugation responses produced by participants (80.1%) and those predicted by the MGL model (73.7%) shows that the model significantly underestimates porportions of infinitives in -ar (t(51) = 3.73, p < .001). As for responses to items highly similar to the 2nd conjugation (see Fig. 1, panel b), we directly compared the relative proportion of 1st and 2nd conjugation responses given by participants with the predictions of the MGL model, by conducting a repeated measures Analysis of Variance (ANOVA) with two factors: Output Source (MGL vs. participants) and Response Type (1st conj. vs. 2nd conj.). The results revealed no main effects (Output Source: F(1,11) = 1.41, p = .260; Response Type: F(1,11) = 1.90, p = .196), but a significant Output Source Response Type 74 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 interaction (F(1,11) = 10.09, p = .009). That is, the difference between log-odds of 1st and 2nd conjugation responses was significantly larger in the human data than in the model’s predictions. Even though, for these items, participants produced a majority of 1st conjugation responses (50.9%) and a slightly smaller proportion of 2nd conjugation responses (44.2%), with the log-odds of these responses being statistically indistinguishable (t(11) = 0.71, p = .491), the MGL model predicted higher proportions of responses belonging to the 2nd conjugation (57.2%) than to the 1st conjugation (34.5%), and this difference is statistically significant (t (11) = 9.07, p < .001). In addition, direct contrasts between the human data and the MGL model showed that the model overestimated log-odds of 2nd conjugation responses (44.2% vs. 57.2%; t(11) = 2.64, p = .023) and underestimated log-odds of 1st conjugation responses (50.9% vs. 34.5%; t(11) = 3.52, p = .005). When considering the group of items that are highly similar to the 3rd conjugation (see Fig. 1, panel c), participant and MGL model data were submitted to the same repeated measures ANOVA, but including 1st and 3rd conjugation log-odds as the two Response Types. A very similar pattern emerged. There was no main effect of Output Source (F(1,11) = 1.53, p = .242), and a marginally significant effect of Response Type (F(1,11) = 4.25, p = .064). More importantly, the ANOVA also revealed a significant Output Source Response Type interaction (F(1,11) = 12.94, p = .004), demonstrating a larger difference between the log-odds of 1st and 3rd conjugation responses in the participant data than in the MGL model’s predictions. Participants produced a majority of 1st conjugation responses (65.4%), that is, proportions of infinitives in -ar were higher than of 3rd conjugation infinitives (t(11) = 2.85, p = .016), with the latter being associated with a mean proportion of only 31.2%. In contrast, the model predicts a significantly higher proportion of 3rd conjugation (53.9%), rather than 1st conjugation (42.1%) responses (t(11) = 7.22, p < .001). Furthermore, relatively to the responses produced by human participants, the MGL model overestimates proportions of 3rd conjugation responses (31.2% vs. 53.9%; t(11) = 3.58, p = .004) and underestimates proportions of 1st conjugation responses (65.4% vs. 42.1%; t(11) = 3.56, p = .004), as was the case for items highly similar to the 2nd conjugation. Human response vs. DGL model predictions In order to investigate how the predictions of the DGL model match the human data, we conducted parallel analyses to those conducted for the MGL model reported above. First, considering only the group of items that are more similar to the 1st conjugation than to the other classes (see Fig. 1, panel a), recall that both the participant data and the MGL predictions showed a large majority of 1st conjugation responses. The same pattern is obtained in the analyses of the predictions of the DGL model, that is, 1st conjugation infinitives (79.0%) are predicted to be a more common response than both 2nd conjugation (5.6%; t(51) = 26.01, p < .001) and 3rd conjugation infinitives (15.4%; t (51) = 18.08, p < .001). Crucially, however, whilst the MGL model still significantly underestimated proportions of 1st conjugation responses (see above), the DGL predictions approximated the proportions of 1st conjugation responses remarkably well (80.1% vs. 79.0%; t(51) = 1.24, p = .220). Likewise, for the group of items more similar to the 2nd conjugation (see Fig. 1, panel b), the same analysis that was employed in the comparison of the predictions of the MGL model with the human data was now conducted for the DGL predictions. More specifically, the same repeated measures ANOVA with two factors, Output Source (DGL vs. participants) and Response Type (1st conj. vs. 2nd conj.). The results revealed no main effects (Output Source: F(1,11) = 0.09, p = .775; Response Type: F(1,11) = 1.71, p = .217), and, crucially, no interaction between the two factors (F (1,11) = 0.003, p = .955). That is, the difference between mean log-odds of 1st and 2nd conjugation responses for these items were similar in the human data and the DGL predictions. Even though this interaction was not significant, we further tested the DGL predictions by comparing log-odds of 1st and 2nd conjugation responses, within and across each Output Source. The DGL model predicted larger mean log-odds of 1st (50.8%), rather than 2nd (43.0%) conjugation responses (t(11) = 3.20, p = .008). This same contrast was not statistically significant in the analysis of the human data (see above), but the discrepancy can be attributed to the larger variability in participant responses. In fact, as can be seen on Fig. 1 (panel b), mean proportions predicted by the DGL model are almost identical to those in the participant data, and they do not differ for either 1st conjugation responses (50.9% vs. 50.8%; t(11) = 0.02, p = .982), or 2nd conjugation responses (44.2% vs. 43.0%; t (11) = 0.13, p = .901). Considering now the predictions of the DGL model for the items that are highly similar to the 3rd conjugation (see Fig. 1, panel c), the same repeated measures ANOVA was conducted, albeit on 1st and 3rd conjugation responses. The ANOVA revealed no main effect of Output Source (F (1,11) = 2.64, p = .133), but a main effect of Response Type (F(1,11) = 11.72, p = .006), reflecting more 1st, rather than 3rd, conjugation responses. The interaction between these two factors approached significance (F(1,11) = 4.44, p = .059), which suggests a larger difference between 1st and 3rd conjugation responses in the participant data than in the DGL predictions. Paired contrasts showed that the DGL model predicted larger proportions of 1st (56.0%) rather than 3rd (41.0%) conjugation responses (t(11) = 7.72, p < .001), which is the same pattern that was obtained in the human data (see above). However, marginally significant comparisons across Output Sources suggest that the model slightly underestimates 1st conjugation responses (65.4% vs. 56.0%; t(11) = 1.96, p = .076) and slightly overestimates 3rd conjugation responses (31.2% vs. 41.0%; t(11) = 2.20, p = .051). MGL vs. DGL in absolute error In the last analysis contrasting the two computational implementations, the MGL and DGL models were directly compared in how well they fit the data across all items employed in the present study. In order to assess this, absolute error was calculated for each item and each response type, for each of the two models. We calculated the difference between the MGL and DGL predicted proportions J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 and the obtained proportions of responses given by participants (after conversion to log-odds), regardless of whether the human data were being under- or overestimated. Average absolute error produced by the MGL and DGL models was then compared for all three response types, by conducting t-tests. The results revealed that average absolute error for 1st conjugation responses was significantly lower for the DGL (0.68) than for the MGL model (0.85; t(75) = 3.81, p < .001). The same advantage was obtained for responses belonging to the non-default classes: the DGL model outperformed the MGL model in its predictions of the proportions of 2nd conjugation responses (0.97 vs. 1.05; t(75) = 2.42, p = .018) and of 3rd conjugation responses (0.80 vs. 0.99; t(75) = 6.22, p < .001). Finally, root mean squared error (a commonly used measure of model accuracy) was also calculated and was uniformly lower for the DGL model, for both 1st (1.00 vs. 0.80), 2nd (1.29 vs. 1.23), and 3rd conjugation responses (1.21 vs. 1.02). Summary The results of Study 2 showed that the preference for the 1st conjugation is sufficiently strong to override high levels of similarity to the other conjugations. For items very similar to the 2nd conjugation, proportions of 1st and 2nd conjugation infinitives were statistically indistinguishable (but numerically larger) for the 1st conjugation. Similarly, for items that match highly reliable 3rd conjugation rules, the mean proportion of 1st conjugation infinitives was even larger than that of responses belonging to the 3rd conjugation. When comparing Albright’s (2002a) MGL model of stem formation with the participant data, it was clear that the model failed to account for the data in a very specific way: it consistently overestimated the role of similarity in the generalization of inflectional classes in Portuguese. This lead to inflated predictions of the proportions of 2nd and 3rd conjugation responses for verbs that fall into specific phonological contexts that are characteristic of these classes. Conversely, the MGL model underestimated the proportion of 1st conjugation responses, which were found to be more numerous than what could be predicted by the reliability metric that determines rule priority in the model. In contrast, the predicted proportions of responses extracted from our DGL model showed a remarkable close fit to the human data, which was in fact, statistically indistinguishable from obtained proportions. General discussion The combined results of Study 1, that is, the default-like behavior of 1st conjugation and the similarity effects of 2nd and 3rd conjugation stems, are straightforwardly explained by the dual-mechanism account put forward by Say and Clahsen (2002), for Italian, and Veríssimo and Clahsen (2009) for Portuguese. This account distinguishes between a general rule for 1st conjugation stem formation that may apply to any verbal root and a restrictive set of associatively represented 2nd and 3rd conjugation stems and roots. 75 Alternatively, it has been proposed that conjugational stems do not have any internal morphological structure and that the default-like behavior of the 1st conjugation is a consequence of its higher frequency or of its more heterogenous phonological distribution. Consider firstly Eddington’s (2002) attempt to simulate the results from Say and Clahsen (2002). It is true that this model correctly simulated the finding that novel forms of verbs that bear no similarity to existing words are preferably assigned to the 1st conjugation. Closer inspection, however, reveals that the simulation output was inaccurate in several other ways. For novel verbs resembling existing 2nd and 3rd conjugation verbs with irregular stems, for example, speakers of Italian preferred 1st conjugation forms (59% and 71%, respectively), but in Eddington’s model the majority of responses belonged to the 2nd (55%) or 3rd conjugations (56%). For novel verbs that rhymed with existing 2nd conjugation verbs, Say and Clahsen’s participants produced 1st and 2nd conjugation forms with similar percentages (43% for the former, 45% for the latter), but Eddington’s model dispreferred 1st over 2nd conjugation forms (27% vs. 54%). These discrepancies between the model’s output and the performance of human participants are a result of the model’s overreliance on phonologically similar patterns as the basis for generalization. Similar problems arise in Colombo et al.’s (2006) connectionist network model of stem formation in Italian. Again, the proportions of 1st conjugation responses (i.e., participles in -ato) to 2nd or 3rd conjugation pseudoverbs were much smaller than for human participants, indicating an oversensitivity of the network to phonological similarity. Furthermore, unlike human participants, Colombo et al.’s model produced a large proportion of incorrect (‘unclassifiable’) responses, especially for 2nd and 3rd conjugation pseudoverbs. This relatively poor performance is likely to be a consequence of the high proportion of 1st conjugation forms in the training set, which enabled the network to approximate the default-like behavior of the 1st conjugation. However, Colombo et al.’s results suggest that this can only be achieved at the expense of an unacceptable level of performance for the other classes. Thus, purely frequency and similarity-based models such as the ones proposed by Eddington (2002) and Colombo et al. fail to accurately simulate the different generalization properties of the inflectional classes. Finally, Albright’s (2002a) MGL model is not able to capture the generalization properties of 1st conjugation stems in Portuguese. If, as postulated by the MGL model, all conjugations were generalized on the basis of a context-sensitive mechanism, we would expect our manipulation of phonological similarity to have an effect on the proportion of responses belonging to all classes. However, our results showed that the MGL reliability scores only predicted proportions of 2nd and 3rd conjugation responses. Furthermore, the MGL model consistently underestimated the proportion of 1st conjugation responses given by participants, that is, the 1st conjugation is generalized to novel roots beyond what would be expected given their phonological similarity to existing roots. In contrast to the present findings, however, Albright’s (2002a) acceptability judgement study in Italian produced a different pattern of results, in which acceptability ratings 76 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 of novel infinitival forms belonging to all conjugations were found to be predicted by their respective reliability scores. Although ratings given to 1st conjugation forms were (negatively) predicted by ratings given to other classes, and were influenced by a phonological well-formedness metric, there was still a significant effect of the reliabilities for the 1st conjugation. Therefore, whilst our results for the 2nd and 3rd conjugations are in accordance with those reported by Albright in Italian, there is an important inconsistency regarding the role of phonological similarity in the generalization of 1st conjugation stems. How can these discrepancies be explained? First, the possibility that the conflicting results are attributable to language-specific differences cannot be discarded. Although the morphological characteristics of Portuguese and Italian are very similar, the distribution of the different conjugations in terms of phonological properties and their type and token frequencies are different, and such differences could perhaps bias the morphological system towards a larger reliance on context-sensitive or variable-based generalization. However, given the scarcity of psycholinguistic evidence pertaining to this issue, there is no principled reason for assuming that the generalization of conjugation classes in different Romance languages should be characterized by distinct representational mechanisms. Setting aside this basic divergence, the only obvious differences between Albright’s study and the present experiment are of a methodological nature. Firstly, regarding the MGL input and the construction of stimuli, our elicited production experiment featured a number of methodological improvements over Albright’s (2002a) experiment. One of these is the size of the frequency lexicon that served as input to the MGL simulation. We used a considerably larger set of verbs than Albright (n = 3,117 vs. n = 2,022). A smaller frequency-ordered lexicon will differ from a larger one primarily in the number of 1st conjugation verbs it contains, which in turn, might lead to dramatic changes in the corresponding reliabilities, possibly leading to their underestimation. It is difficult to assess the effects of this difference without running MGL simulations comparing lexicons of different sizes, but it should be clear that the reliability values we obtained are more realistic than the values used by Albright. In addition, our materials are also more representative of the language than Albright’s in that they covered a wide range of reliability scores, whereas Albright’s items were confined to a subset of verbs with very high reliabilities for the 1st conjugation (see Method of Study 1). Thirdly, and more importantly, the two studies crucially differ in the experimental tasks that were used. The present experiment used elicited production as a means to assess generalization, whilst Albright’s (2002a) study employed an acceptability judgement task. In Albright’s experiment, participants were presented with a 1sg present indicative form of a given verb followed by the corresponding infinitive forms (belonging to different conjugations) and they had to rate how ‘typical’ each form sounded to them. It is conceivable that this particular presentation format recruits processes that are very different from the ones normally involved in language use. For example, the emphasis on the ‘typicality’ of a form might trigger (perhaps conscious) processes of lexical search for similar existing forms. Furthermore, the presentation of all possible infinitival stems for a given root may encourage participants to perform artificial comparisons as to how typical each form is as a member of a particular class. In this way, Albright’s task may have produced inflated similarity effects. Compare this with an elicited production experiment, in which participants are only given one ambiguous form and asked to complete a sentence blank in the way they consider most appropriate. We consider this a more natural way of examining the mechanisms involved in the generalization of stems than Albright’s acceptability judgement task. In Study 2, we have also directly contrasted the predictions of the MGL model with a revised model of stem formation. In order to produce a novel implementation we changed the original MGL model in a single specific way, which we argued would bring the model closer to a dual-mechanism account (Clahsen, 1999; Say & Clahsen, 2002; Veríssimo & Clahsen, 2009). Instead of confidence in a 1st conjugation output being measured by Albright and Hayes’s (2002) reliability function (which measures the ‘success’ of a given rule), we proposed that a default context-free rule is treated by the linguistic system in a special way, in that it is associated with a maximum confidence value. In other words, the default morphological transformation is associated with a maximum well-formedness score. The results of this study showed that the DGL model outperforms the MGL model both in overall error, and in accounting for the patterns of participant responses for different types of items. In particular, whilst Albright’s (2002a) MGL underestimated the proportions of 1st conjugation responses that were given to items that are highly similar to the 2nd and 3rd conjugations, the DGL model predicts average proportions that are remarkably similar to the ones found in the human data. Furthermore, whilst the MGL model underestimated 1st conjugation responses even for novel forms that it preferably assigns to that class, the DGL model does not. Importantly, this implementation also provided proof of principle in two respects. Firstly, it shows that a model embedding a dual-mechanism architecture does not overgeneralize the default transformation. That is, even though the 1st conjugation has a quantitative advantage relatively to the other classes, mean proportions of 1st conjugation responses were not overestimated. Secondly, it demonstrates that even substantial proportions of non-default responses (in our study, 20% of 2nd and 3rd conjugation responses for items that are relatively more similar to the 1st conjugation) are not unexplainable by a dual-mechanism account. On the contrary, these responses follow naturally from the fact that such items bear some resemblance to existing 2nd and 3rd conjugation verbs, and in fact, the strength of this process relatively to the preference for the 1st conjugation appears to have been almost perfectly approximated by the dual-mechanism model. Finally, another crucial respect in which the DGL model outperforms Albright’s (2002a) MGL is by predicting that the generalization of the 1st conjugation to novel forms is insensitive to phonological similarity. Given that the context-free rule is ascribed maximal reliability, then it will always serve as the best 1st conjugation rule; any minor 1st conjugation rules that apply in restricted parts of the 77 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 phonological space have lower reliability and do not make any contribution to the output. As such, our DGL model treats the generalization of the 1st conjugation as always based on a context-free mechanism in which phonological similarity plays no role, which accounts for the results of our regression analyses of the elicited production task. Even though we demonstrated that the kind of structured phonological similarity that is captured by the MGL model is not sufficient to explain the generalization of 1st conjugation stems, we have not explored the potential role of other sources of similarity for morphological processes. For example, Ramscar (2002) has argued that semantic similarity plays a role in English past tense inflection. Likewise, Keuleers et al. (2007) have used a plural inflection task in Dutch to show that an analogical memory-based model that makes use of both phonological and orthographic information produces similar generalization patterns as human participants. One direction for future work is investigating whether such effects can be found in stem formation, for example, by applying the same algorithm of minimal generalization to inputs and outputs that incorporate nonphonological representations. Note, however, that such accounts also predict effects of phonological similarity for all types of morphological generalization (besides effects of other types of similarity). Therefore, the challenge for any similarity-driven account is showing that effects of phonological similarity in Portuguese dissociate along conjugation classes: they are absent in the generalization of the 1st conjugation, but present in the case of the non-default classes. Conclusions In sum, when the three Portuguese conjugations were explicitly distinguished such that the default conjugation is ‘maximally reliable’, and the non-default classes are still generalized in a restricted manner on the basis of similarity, we arrived at a much better account, that produced less overall error and explained both the qualitative and quantitative patterns in the results of our elicited production study. In contrast, when the three conjugations were generalized by similarity and their strength was derived purely by inputdriven statistics, as is the case in the MGL model, then the role ascribed to similarity is larger that what participants reveal. This specific discrepancy from the human data is evident not only in the behavior of Albright’s (2002a) MGL model, but also of other similarity-based models of conjugation assignment in Romance languages (e.g., Colombo et al., 2006; Eddington, 2002; see above). We believe that to be a consequence of these models’ single-mechanism architectures and, in particular, that the pattern of errors they display suggests that purely similarity-driven algorithms of morphological acquisition are insufficient to exhibit default-like generalizations. By taking Romance conjugations as a case of pure morphology and contrasting two minimally differing computational models, we have provided evidence for a distinction between two different mechanisms of morphological generalization: context-free, unbounded operations and context-sensitive restricted generalizations. More generally, our results demonstrate the need to postulate variable-based symbolic operations to account for linguistic productivity and support approaches that argue for a dual architecture of the language faculty (e.g., Clahsen, 2006; Pinker & Ullman, 2002). Acknowledgments Supported by doctoral (SFRH/BD/13195/2003) and postdoctoral fellowships (SFRH/BPD/65164/2009) awarded to João Veríssimo by the Fundação para a Ciência e a Tecnologia, Portugal, and by an Alexander von Humboldt Professorship awarded to Harald Clahsen. We thank Adam Albright for advice on implementing the computational simulation, Constança Carvalho for recruiting many of the participants, Patrícia Vidigal for help in designing the materials, and three JML reviewers (T.M. Bailey, E. Keuleers, one anonymous) for detailed and helpful comments. List of experimental stimuli Items used in the current studies (in the 1st present indicative), their corresponding MGL reliabilities for the 1st, 2nd and 3rd conjugations, and descriptive statistics for their distribution. MGL reliabilities 1st conj. (-ar) 2nd conj. (-er) 3rd conj. (-ir) Descriptives Mean .698 SD .177 Min .459 Max. .994 Range .534 .192 .293 .000 .983 .983 .280 .248 .067 .923 .856 Item prizo lico rento bito apreio matreio alfego buro faugo livo zalo fanso sulho lauso feduzo frigo jasto beço saurro faivo pretuo launo perfenso .084 .040 .040 .040 .104 .104 .102 .034 .102 .082 .000 .036 .000 .084 .084 .102 .040 .036 .061 .082 .043 .000 .215 .120 .081 .069 .080 .078 .078 .123 .089 .123 .171 .074 .067 .074 .487 .923 .123 .069 .570 .123 .171 .570 .719 .067 .994 .990 .983 .970 .970 .970 .967 .951 .938 .933 .933 .929 .916 .911 .911 .908 .908 .899 .872 .852 .852 .821 .812 (continued on next page) 78 J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 References (continued) MGL reliabilities 1st conj. (-ar) 2nd conj. (-er) 3rd conj. (-ir) labenso astenso renso dechenço beilo acuo sôcho fraimo lauvo micho sempo taucho laido empido estumo defido tromo vundo manstituo lustituo bendo assido ajido espoudo chinzo quendo inuo ambuo anhuo prijo azijo lajo mecisto conzuo fubo saubo tureo anheo treso azeso apreso dezêbo feibo taibo denfo pinvo estenvo arro dorro murêvo perzolvo sarrolvo faio esgaio solho .812 .812 .812 .812 .805 .787 .785 .785 .775 .739 .733 .723 .719 .719 .719 .706 .646 .636 .612 .612 .610 .607 .601 .601 .584 .584 .575 .575 .575 .574 .574 .574 .574 .570 .544 .544 .503 .503 .503 .503 .503 .492 .492 .492 .486 .482 .482 .482 .482 .482 .482 .482 .480 .480 .459 .215 .215 .215 .215 .000 .043 .036 .000 .082 .036 .058 .036 .102 .102 .000 .102 .000 .102 .043 .043 .833 .102 .102 .102 .084 .959 .043 .043 .043 .075 .075 .075 .040 .043 .102 .102 .983 .983 .872 .872 .872 .852 .102 .102 .058 .064 .064 .061 .916 .908 .719 .489 .034 .034 .000 .067 .067 .067 .067 .074 .512 .067 .292 .171 .067 .145 .067 .297 .297 .292 .355 .238 .354 .872 .872 .150 .433 .355 .251 .120 .150 .742 .685 .629 .523 .523 .300 .719 .804 .355 .355 .067 .067 .067 .067 .067 .171 .607 .607 .145 .171 .171 .123 .123 .171 .171 .171 .694 .694 .074 Agresti, A. (2002). Categorical data analysis (2nd ed.). New York, NY: John Wiley & Sons. Albright, A. (2002a). Islands of reliability for regular morphology: Evidence from Italian. Language, 78, 684–709. Albright, A. (2002b). The lexical bases of morphological well-formedness. In S. Bendjaballah, W. U. Dressler, O. E. Pfeiffer, & M. D. Voeikova (Eds.), Morphology 2000 (pp. 5–15). Amsterdam: John Benjamins. Albright, A., & Hayes, B. (2002). Modeling English past tense intuitions with minimal generalization. In M. Max well (Ed.), Proceedings of the sixth meeting of the ACL special interest group in computational phonology (pp. 58–69). Philadelphia: Association for Computational Linguistics. Albright, A., & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental study. Cognition, 90, 119–161. Anscombe, F. J. (1956). On estimating binomial response relations. Biometrika, 43, 461–464. Aronoff, M. (1976). Word formation in generative grammar. Cambridge, MA: MIT Press. Aronoff, M. (1994). Morphology by itself: Stems and inflectional classes. Cambridge, MA: MIT Press. Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press. Bacelar do Nascimento, M. F., Casteleiro, J. M., Marques, M. L. G., Barreto, F., Amaro, R., & Veloso, R. (2000). Léxico multifuncional computorizado do português contemporâneo [Computerized multifunctional lexicon of contemporary Portuguese]. Lisboa: Centro de Linguística da Universidade de Lisboa. Barr, D. J. (2008). Analyzing ‘visual world’ eyetracking data using multilevel logistic regression. Journal of Memory and Language, 59, 457–474. Berent, I., Pinker, S., & Shimron, J. (1999). Default nominal inflection in Hebrew: Evidence for mental variables. Cognition, 72, 1–44. Berko, J. G. (1958). The child’s learning of English morphology. Word, 14, 150–177. Bybee, J. L. (1995). Regular morphology and the lexicon. Language and Cognitive Processes, 10, 425–455. Bybee, J. L., & Moder, C. L. (1983). Morphological classes as natural categories. Language, 59, 251–270. Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press. Chomsky, N. (1980). Rules and representations. New York, NY: Columbia University Press. Clahsen, H. (1999). Lexical entries and rules of language: A multidisciplinary study of German inflection. Behavioral and Brain Sciences, 22, 991–1013. Clahsen, H. (2006). Dual-mechanism morphology. In K. Brown (Ed.). Encyclopedia of language and linguistics (Vol. 4, pp. 1–5). Oxford: Elsevier. Colombo, L., Stoianov, I., Pasini, M., & Zorzi, M. (2006). The role of phonology in the inflection of Italian verbs: A connectionist investigation. The Mental Lexicon, 1, 147–181. Daelemans, W. (2002). A comparison of analogical modeling of language to memory-based language processing. In R. Skousen, D. Lonsdale, & D. B. Parkinson (Eds.), Analogical modeling: An exemplar-based approach to language (pp. 157–179). Amsterdam: John Benjamins. Eddington, D. (2002). Dissociation in Italian conjugations: A single-route account. Brain and Language, 81, 291–302. Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71–99. Fodor, J., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. In S. Pinker & J. Mehler (Eds.), Connections and symbols (pp. 3–71). Cambridge, MA: MIT Press. Gart, J. J. (1966). Alternative analyses of contingency tables. Journal of the Royal Statistical Society, B28, 164–179. Gart, J. J., & Zweifel, J. R. (1967). On the bias of various estimators of the logit and its variance with application to quantal bioassay. Biometrika, 54, 181–187. Gulliksen, H. (1950). Theory of mental tests. New York, NY: John Wiley & Sons. Hahn, U., & Nakisa, R. C. (2000). German inflection: Single route or dual route? Cognitive Psychology, 41, 313–360. Haldane, J. B. S. (1955). The estimation and significance of the logarithm of a ratio of frequencies. Annals of Human Genetics, 20, 309–311. Hare, M. L., & Elman, J. L. (1995). Learning and morphological change. Cognition, 6, 61–98. Hare, M. L., Elman, J. L., & Daugherty, K. G. (1995). Default generalization in connectionist networks. Language and Cognitive Processes, 10, 601–630. J. Veríssimo, H. Clahsen / Journal of Memory and Language 76 (2014) 61–79 Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. London: Sage Publications. Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446. Keuleers, E., Sandra, D., Daelemans, W., Gillis, S., Durieux, G., & Martens, E. (2007). Dutch plural inflection: The exception that proves the analogy. Cognitive Psychology, 54, 283–318. Kiparsky, P. (1973). Elsewhere in phonology. In S. Anderson & P. Kiparsky (Eds.), A festschrift for morris halle (pp. 93–106). New York, NY: Holt, Rinehart and Winston. Luce, R. D. (1959). Individual choice behavior: A theoretical analysis. New York, NY: Wiley. Marcus, G. F. (2001). The algebraic mind: Integrating connectionism and cognitive science. Cambridge, MA: MIT Press. Marcus, G. F., Brinkmann, U., Clahsen, H., Wiese, R., & Pinker, S. (1995). German inflection: The exception that proves the rule. Cognitive Psychology, 29, 189–256. Marslen-Wilson, W. D., & Tyler, L. K. (1997). Dissociating types of mental computation. Nature, 387, 592–594. Marslen-Wilson, W. D., & Tyler, L. K. (2007). Morphology, language and the brain: The decompositional substrate for language comprehension. Philosophical Transactions of the Royal Society B, 362, 823–836. Mateus, M. H. M., & d’Andrade, E. (2000). The phonology of Portuguese. New York, NY: Oxford University Press. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models. New York, NY: Chapman and Hall. Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23, 405–423. Pinker, S. (1999). Words and rules: The ingredients of language. New York, NY: Basic Books. Pinker, S., & Ullman, M. T. (2002). The past and future of the past tense. Trends in Cognitive Sciences, 6, 456–463. Prasada, S., & Pinker, S. (1993). Generalization of regular and irregular morphological patterns. Language and Cognitive Processes, 8, 1–56. 79 Ramscar, M. (2002). The role of meaning in inflection: Why the past tense does not require a rule. Cognitive Psychology, 45, 45–94. Rumelhart, D. E., & McClelland, J. L. (1986). On learning the past tenses of English verbs. In J. L. McClelland, D. E. Rumelhart, & The PDP Research Group (Eds.). Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 2, pp. 216–271). Cambridge, MA: MIT Press. Say, T., & Clahsen, H. (2002). Words, rules and stems in the Italian mental lexicon. In S. Nooteboom, F. Weerman, & F. Wijnen (Eds.), Storage and computation in the language faculty (pp. 93–129). Dordrecht: Kluwer. Skousen, R. (1992). Analogy and structure. Dordrecht: Kluwer Academic. Skousen, R., Lonsdale, D., & Parkinson, D. B. (2002). Analogical modeling: An exemplar-based approach to language. Amsterdam: John Benjamins. Thorndike, R. L. (1949). Personnel selection: Test and measurement techniques. New York, NY: John Wiley & Sons. Ullman, M. T. (1999). Acceptability ratings of regular and irregular pasttense forms: Evidence for a dual-system model of language from word frequency and phonological neighbourhood effects. Language and Cognitive Processes, 14, 47–67. Veríssimo, J., & Clahsen, H. (2009). Morphological priming by itself: A study of Portuguese conjugations. Cognition, 112, 187–194. Villalva, A. (2000). Estruturas morfológicas: Unidades e hierarquias nas palavras do portuguê [Morphological structures: Units and hierarchies in Portuguese words]. Lisboa: Fundaczão Calouste Gulbenkian, Fundaczão para a Ciência e a Tecnologia. Villalva, A. (2003). Estrutura morfológica básica [Basic morphological structure]. In M. H. M. Mateus, A. M. Brito, I. Duarte, & I. H. Faria (Eds.), Gramática da língua portuguesa (6th ed., pp. 917–938). Lisboa: Editorial Caminho. Wiseman, S. (1967). The effect of restriction of range upon correlation coefficients. British Journal of Educational Psychology, 37, 248–252. Woolf, B. (1954). On estimating the relation between blood group and disease. Annals of Human Genetics, 19, 251–253. Zimmerman, D. W., & Williams, R. H. (2000). Restriction of range and correlation in outlier-prone distributions. Applie d Psychological Measurement, 24, 267–280.