Does English Need its Pronouns? Simulating the Effect of Pro-drop on SVO Languages Ezra Van Everbroeck Maria Polinsky UC San Diego Linguistics 0108 9500 Gilman Drive La Jolla, CA 92093 ezra@ucsd.edu UC San Diego Linguistics 0108 9500 Gilman Drive La Jolla, CA 92093 mpolinsky@ucsd.edu Abstract We present the results from a large set of connectionist simulations exploring the effect of subject omission (prodrop) on languages with an SVO word order. We show that pro-drop only affects the learnability of the languages if there are no cues available to tell nouns and verbs apart. We also argue that neither creoles nor Mandarin Chinese instantiate the language type which was unlearnable in the simulations. Introduction It is well documented that many of the world’s languages allow sentential subjects to remain unexpressed (i.e. prodrop; Gilligan 1987): e.g. in Spanish, A comido la sopa ‘(S/He) has eaten the soup’ is a perfectly fine sentence even though it lacks an overt subject. Linguists have been studying this phenomenon for several decades, but it is still unclear to what extent pro-drop is correlated with other linguistic properties. While it was originally assumed that languages could only exhibit pro-drop if they also featured rich subject-verb agreement — like Italian and Spanish, where the agreement affixes on the verb essentially replace the information which is lost by omitting the subject (Jaeggli and Safir 1989) — it was soon pointed out that some languages which lack agreement systems altogether, like Mandarin Chinese and Thai, also exhibit frequent pro-drop (Huang 1984, 1989). To gain insight into the interactions between pro-drop and other linguistic parameters, we have run 12,000 neural network simulations to systematically investigate how languages with different basic word orders and varying degrees of overt morphological marking are affected by the introduction of pro-drop. By using computational models and artificial languages, we can control each parameter separately and also look at combinations of properties (i.e. types of languages) which are unattested in the real world — neither of which would be possible if we limited ourselves to more traditional, grammar-based typological research. In order to keep the results and discussion sections manageable, we will restrict ourselves here to the network models which had to learn languages with a basic Subject-Verb-Object (SVO) Garrison W. Cottrell UC San Diego Computer Science & Engineering 0114 9500 Gilman Drive La Jolla, CA 92093 gary@ucsd.edu word order. Cross-linguistically, the SOV word order is more frequent (Tomlin 1986; Dryer 1989), but SVO languages are more interesting for our purposes because they display more morphological variation than SOV languages (Siewierska 1996, 1998) and because a number of well-studied languages like English, Spanish and Mandarin Chinese are SVO. Experiment The use of neural networks for typological studies is a recent development in cognitive science, but there have already been several successful models (Christiansen and Devlin 1997; Lupyan and Christiansen 2002; Van Everbroeck 1999, 2003). For our simulations, we created simple context-free grammars capable of generating many different artificial languages, each of which represented a specific language type. The lexicon used to generate the training corpora consisted of 300 nouns (half animate, half inanimate), 100 verbs (half transitive, half intransitive) as well as several pronouns and morphological markers. The test lexicon contained new nouns and verbs but the same pronouns and markers. For each language type, the grammars were used to generate training and test corpora of 3,000 simple sentences. The possible sentences were SV and SVO, as well as V and VO for the language types in which subjects could remain unexpressed. In addition to word order and the presence/absence of pro-drop, the two other crucial parameters were the presence/absence of Subject/Object case-marking on the nouns, and the presence/absence of head-marking on the verbs. The latter could take the form of simple Tense/Aspect/Modality markers which essentially only help identify the verbs (Bybee 1985), valency markers indicating the number of arguments in the clause (e.g. McWhorter 1998), or rich subject-verb agreement. The architecture of the model used in our simulations consisted of a simple recurrent network (Elman 1990) augmented with a recurrent layer for the output units (see Figure 1). The latter functioned as a short-term memory and led to much faster training. At the input layer, the networks SVO Head-marking No prodrop [-case] [+case] None T/A/M Valency Agreement 96.2% 97.0% 96.8% 97.4% 96.8% 97.1% 96.5% 96.8% 45.8% 92.8% 95.8% 96.4% 96.1% 96.3% 92.6% 93.2% pro-drop Figure 1. Model architecture: Activation flows from the input layer at the bottom to three banks of output units (S, V, O). were shown one word of a sentence at a time. At the end of each sentence, a special ‘period’ pattern was presented to signal to the network that a new sentence was about to start. At the output layer, the networks were trained to build a representation of the entire sentence: there were separate banks of units for the subject, object and verb of the sentence and each bank had to be filled with the appropriate word from the sentence, or left empty if the sentence didn’t contain the element — e.g. the object bank was to remain empty in all intransitive sentences. The networks were trained using back-propagation for 10 epochs, and then tested on the sentences from the appropriate test corpus. The main error measure was the percentage of sentences for which the network got all three output banks correct; i.e. the pattern of activation over each bank of output units had to be closer to the correct word than to any other one. Results An ANOVA test reveals the importance of each linguistic parameter used for generating the artificial languages: prodrop, case-marking on the nouns, and head-markers on the verbs are all statistically significant (see Table 1). Effect SS Case 3177. 1 3177. 34.31 .000* Head 104E2 3 3460. 37.37 .000* 5403. 1 5403. 58.35 .000* Pro DF MS F p Table 1. ANOVA analysis results. If we look at individual language types and how well the trained networks were able to generalize to the new sentences in the test corpora, we find that test performance was excellent (> 90%) for all but one combination of linguistic features (see Table 2; the percentages in each cell are averages over 20 networks with different initial weights). The problematic language type combines an SVO word order with pro-drop, no case-marking on the nouns and no marking on the verbs. [-case] [+case] Table 2. Percentages of novel sentences of each language type which are parsed correctly. An analysis of the errors made by the various networks shows that telling the nouns apart from the verbs is the main problem caused by pro-drop. When all subjects are expressed, SVO languages are easy to parse because the first word is always the subject, the second word always the verb, and the third (if present) the object. (Note that these networks still make some mistakes because they are being forced to analyze sentences with nouns and verbs which they have never been trained on.) When there is pro-drop, however, this simple parsing strategy no longer works because the first word can be either the subject (in SV, SVO) or the verb (in V, VO); similarly, the second word can be either the verb or the object. Because the nouns and verbs are novel in the test corpora, some additional information is needed to identify which is which. The results in Table 2 demonstrate that marking the nouns (through case-marking) is as successful for the disambiguation task as marking the verbs (through any of its three types of marking). Interestingly, marking both simultaneously has very limited benefit. But if no marking is available to tell the nouns from the verbs, the networks fail to generalize to the novel sentences in the test corpus. Discussion A comparison of the network results in Table 2 with typological data on SVO languages suggests that some possible language types may be unattested (see Table 3). These ‘gaps’ in the space of possible languages could well be due to random historical events (Diamond 1997). On the other hand, the potential absence of natural languages which correspond to the type the networks found problematic suggests that some gaps could be motivated by cognitive factors such as learnability. There are two groups of natural languages which may correspond to the supposedly unattested type in Table 3. The first group consists of creole languages, i.e. languages which developed over the last 400 years or so in contact situations where speakers from mutually unintelligible languages were forced to communicate (Bickerton 1981; Thomason and Kaufman 1988; McWhorter 1998). SVO Head-marking No prodrop None T/A/M Valency Agreement [-case] [+case] English Russian ?? ?? ?? ?? ?? ?? (45.8%) Hebrew Spanish Polish ?? Pasamaqoddy ?? prodrop [-case] [+case] ?? Table 3. Natural languages which correspond to the ones learned by the networks. All creoles share several linguistic properties, including an SVO word order and very little morphological marking on nouns and verbs. Many also derive partially from Spanish and Portuguese, both of which have frequent pro-drop. Nonetheless, if we look at the pro-drop phenomena observed in most creole languages, we find that they hardly ever go beyond the kinds of constructions which are also possible in English, a prominent non-pro-drop language. For example, it is possible to say the creole equivalent of ‘Seems like a good idea’ in Spanish-based Capeverdean Creole (Baptista 1995), Portuguese-based Papiamento (Kouwenberg 1990) and French-based Haitian Creole (DeGraff 1993). But the equivalent of Spanish Está comiendo la sopa ‘(S/He) is eating the soup’ is not acceptable, even in the creoles which derive from Spanish (Muysken and Law 1991). Moreover, in the few cases where creoles do allow pro-drop, they have also developed subject-verb agreement — e.g. Bislama (Meyerhoff 2000) and São Tomé Creole (Gilligan 1987). So, there is no evidence that creoles represent the language type which the networks have problems learning. The second group of potential counter-examples can be found among the languages of South-East Asia, including Thai, Lao, Vietnamese, and the varieties of Chinese. Although not all of them are historically related, they share a large number of linguistic features, including an SVO word order and minimal noun and verb morphology, due to their pro-longed geographical proximity (Cooke 1968; Bisang 1996). Crucially, some of these languages also exhibit very frequent pro-drop. In Thai, for example, unexpressed subjects occurred in about every second sentence in a large corpus (Aroonmanakun 1999). Similar numbers are often mentioned for Mandarin Chinese (Huang 1984, 1989; Tao 1996; Tardif, Shatz, and Naigles 1997). We limit ourselves to a discussion of Mandarin here because it is the best documented of these languages. A closer analysis of Mandarin shows that there are a number of discourse and structural constraints on the usage of pro-drop. In general, pro-drop is only allowed when the unexpressed element can be readily recovered from the discourse or the situational context (Huang 1995). The structural constraints involve co-verbs and pivotal nouns in serial verb constructions (Li and Thompson 1981); in both cases, they prevent pro-drop from causing a verb to appear where a noun would normally be expected. In addition, Mandarin actually has some clues which help identify verbs (e.g. aspect markers, auxiliaries and co-verbs — Li, Bates, and MacWhinney 1993) and nouns (e.g. numerals, classifiers, prepositions, and the ba and bei particles — Chang 1992). Finally, there is considerable evidence from acquisition studies that children learning Mandarin Chinese prefer a fixed SVO word order, both in perception and production, and also don’t make mistakes in using nouns as verbs or vice versa (Erbaugh 1983, 1986; Miao and Zhu 1992). The combination of all these factors suggests to us that Mandarin, although it may come closer than most SVO languages, is not an instantiation either of the type which the models could not learn. Conclusion Our connectionist simulations of the effect of pro-drop on SVO languages show that the presence of unexpressed arguments need not create serious problems for a language learner, at least if some morphological marking is available to distinguish the nouns in the language from the verbs. If no such source of information is available, the networks fail to generalize to sentences containing novel words. This finding meshes well with what is known about human language processing because people, and especially young children, do poorly when faced with structural ambiguities (Trueswell et al. 1999; Hawkins 2002). From a typological perspective, our results are relevant because they help determine which SVO language types are unattested for cognitive reasons as opposed to the result of historical accident. The fact that no real languages appears to instantiate the type which the networks found unlearnable, even when they could have been expected to do so as in the case of Spanish-based creoles, is a strong indication for us that computer simulations can make a valuable contribution to linguistic typology. On the other hand, we also want to stress that language simulations like ours currently have limited explanatory value because they do not include semantic or pragmatic information. Adding lexical semantics of the type described by Li, Burgess, and Lund (2000) is definitely possible, however, so this deficiency can presumably be addressed. Other future work includes the analysis of the verb-final (mostly SOV) and verb-initial (VSO, VOS) language types, as well as closer look at Riau Indonesian and Singapore English, two SVO languages about which controversial claims have been made with respect to their linguistic features (Gil 1994, 2003). References Aroonmanakun, Wiroote (1999). Extending focusing for zero pronoun resolution in Thai. Doctoral dissertation, Georgetown University Baptista, Marlyse (1995). On the nature of Pro-drop in Capeverdean Creole. Harvard Working Papers in Linguistics 5: 3-17 Bickerton, Derek (1981). Roots of Language. Ann Harbor: Karoma Bisang, Walter (1996). Areal typology and grammaticalization: processes of grammaticalization based on nouns and verbs in East and Mainland South East Asian languages. Studies in Language 20.3: 519-597 Bybee, Joan L. (1985). Morphology. A Study of the Relation between Meaning and Form. Amsterdam: John Benjamins Chang, Hsing-Wu (1992). The acquisition of Chinese syntax. In HsuanChih Chen & Ovid J.L. Tzeng (eds.), Language Processing in Chinese, 277-311. Amsterdam: North-Holland Christiansen, Morten H. and Joseph Devlin (1997). Recursive Inconsistencies Are Hard to Learn: A Connectionist Perspective on Universal Word Order Correlations. In Proceedings of the 19th Annual Cognitive Science Society Conference, 113-118. Mahwah, NJ: Lawrence Erlbaum Cooke, Joseph R. (1968). Pronominal reference in Thai, Burmese, and Vietnamese. Berkeley: University of California Press DeGraff, Michel Frederic (1993). Is Haitian Creole a Pro-drop language? In Francis Byrne & John Holm (eds.), Atlantic meets Pacific. A global view of pidginization and creolization, 71-90. Amsterdam: John Benjamins Diamond, Jared (1997). Guns, Germs, and Steel: The Fates of Human Societies. New York, NY: W.W. Norton Dryer, Matthew S. (1989). Discourse-governed word order and word order typology. Belgian Journal of Linguistics 4: 69-90 Elman, Jeffrey L. (1990). Finding structure in time. Cognitive Science 14: 179-211 Erbaugh, Mary S. (1983). Why Chinese children's acquisition of Mandarin predicates should be "just like English". Papers and Reports on Child Language Development 22: 49-57 — (1986). Taking stock: The development of Chinese noun classifiers historically and in young children. In Colette Grinevald Craig (ed.), Noun Classes and Categorization, 399-436. Amsterdam: John Benjamins Gil, David (1994). The structure of Riau Indonesian. Nordic Journal of Linguistics 17: 179-200 — (2003). English goes Asian: Number and (in)definiteness in the Singlish noun phrase. In Frans Plank (ed.), Noun phrase structure in the languages of Europe, 467-514. Berlin: Mouton de Gruyter Gilligan, Gary Martin (1987). A Cross-Linguistic Approach to the Prodrop Parameter. Doctoral dissertation, University of Southern California Hawkins, John A. (2002). Symmetries and asymmetries: their grammar, typology and parsing. Theoretical Linguistics 28: 95-150 Huang, C.-T. James (1984). On the distribution and reference of empty pronouns. Linguistic Inquiry 15.4: 531-574 — (1989). Pro-drop in Chinese: A generalized control theory. In Osvaldo Jaeggli & Kenneth J. Safir (eds.), The Null Subject Parameter, 185214. Dordrecht: Kluwer Huang, Yan (1995). On null subjects and null objects in generative grammar. Linguistics 33: 1081-1123 Jaeggli, Osvaldo and Kenneth J. Safir (1989). The Null Subject parameter and parametric theory. In Osvaldo Jaeggli & Kenneth J. Safir (eds.), The Null Subject Parameter, 1-44. Dordrecht: Kluwer Kouwenberg, Silvia (1990). Complementizer pa, the finiteness of its complements, and some remarks on empty categories in Papiamento. Journal of Pidgin and Creole Languages 5.1: 39-51 Li, Charles N. and Sandra A. Thompson (1981). Mandarin Chinese. A Functional Reference Grammar. Berkeley, CA: University of California Press Li, Ping, Elizabeth Bates and Brian MacWhinney (1993). Processing a language without inflections: A reaction time study of sentence interpretation in Chinese. Journal of Memory and Language 32: 169192 Li, Ping, Curt Burgess and Kevin Lund (2000). The acquisition of word meaning through global lexical co-occurrences. In Eve V. Clark (ed.), Proceedings of the Thirtieth Stanford Child Language Research Forum, 167-178. Stanford, CA: Center for the Study of Language and Information Lupyan, Gary and Morten H. Christiansen (2002). Case, word order, and language learnability: Insights from connectionist modeling. In Wayne Gray & Christian Schunn (eds.), Proceedings of the 24th Annual Conference of the Cognitive Science Society, 596-601. McWhorter, John H. (1998). Identifying the creole prototype. Vindicating a typological class. Language 74.4: 788-818 Meyerhoff, Miriam (2000). The emergence of creole subject-verb agreement and the licensing of null subjects. Language Variation and Change 12: 203-230 Miao, Xiaochun and Manshu Zhu (1992). Language development in Chinese children. In Hsuan-Chih Chen & Ovid J.L. Tzeng (eds.), Language Processing in Chinese, 237-276. Amsterdam: North-Holland Muysken, Pieter and Paul Law (2001). Creole studies. A theoretical linguist's field guide. Glot International 5.2: 47-57 Siewierska, Anna (1996). Word order type and alignment type. Zeitschrift für Sprachtypologie und Universalienforschung 49.2: 149-176 — (1998). Variation in major constituent order: a global and a European perspective. In Anna Siewierska (ed.), Constituent Order in the Languages of Europe, 475-551. Berlin: Mouton de Gruyter Tao, Hongyin (1996). Units in Mandarin conversation. Prosody, discourse, and grammar. Amsterdam: John Benjamins Tardif, Twila, Marilyn Shatz and Letitia Naigles (1997). Caregiver speech and children's use of nouns versus verbs: A comparison of English, Italian, and Mandarin. Journal of Child Language 24: 535-565 Thomason, Sarah Grey and Terrence Kaufman (1988). Language Contact, Creolization, and Genetic Linguistics. Berkeley: University of California Press Tomlin, Russell S. (1986). Basic Word Order. Functional Principles. London: Croom Helm Trueswell, John C., Irina Sekerina, Nicole M. Hill and Marian L. Logrip (1999). The kindergarten-path effect: studying on-line sentence processing in young children. Cognition 73: 89-134 Van Everbroeck, Ezra (1999). Language type frequency and learnability: A connectionist approach. In Proceedings of the 21st Annual Conference of the Cognitive Science Society, 755-760. Mahwah, NJ: Lawrence Erlbaum — (2003). Language type frequency and learnability from a connectionist perspective. Linguistic Typology 7.1: 1-50