Language Learning ISSN 0023-8333 A Corpus-Based Analysis of the Discourse Functions of Ser/Estar + Adjective in Three Levels of Spanish as FL Learners Joe Collentine Northern Arizona University Yuly Asención-Delaney Northern Arizona University Research on the acquisition of Spanish’s two copulas, ser and estar, provides an understanding of the interaction among syntax, semantics, pragmatics, morphology, and vocabulary during development (e.g., Geeslin, 2003a, 2003b; Gunterman, 1992; Ryan & Lafford, 1992). Recent research suggests that linguistic features in the surrounding discourse influence learners’ copula choice. We present a corpus-based analysis of the lexico-grammatical features co-occurring with copula + adjective usage among foreign-language learners of Spanish at three levels of instruction. Findings revealed the following: (a) both ser + adjective and estar + adjective occur at all levels where little linguistic complexity typically occurs; (b) ser + adjective appears in descriptive and evaluative discourse; and (c) estar + adjective is present in narrations, descriptions, and hypothetical discourse. Keywords second language acquisition; Spanish interlanguage; learner corpus; corpus linguistics; grammatical development; ser and estar; copula choice Introduction Studying the acquisition of Spanish copulas, ser and estar, interests second language acquisition (SLA) researchers because it requires studying syntax, semantics, pragmatics, morphology, and vocabulary during development We wish to thank Dr. Roy St. Laurent of the Northern Arizona University Statistical Consulting Lab for his valuable assistance in the design of the statistical analyses of this project. Any errors reside solely with us. Our thanks also go to Dr. Vincent and Dr. Ojeda for their financial support to transcribe the texts written by the learners. Correspondence concerning this article should be addressed to Joe Collentine, Northern Arizona University, Modern Languages, Box 6004, Flagstaff, AZ 86011. Internet: Joseph.Collentine@ nau.edu Language Learning 60:2, June 2010, pp. 409–445 ! C 2010 Language Learning Research Club, University of Michigan DOI: 10.1111/j.1467-9922.2010.00563.x 409 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective (Leonetti, 1994). Although this might seem particular to Spanish as a second language (L2), the acquisition of these verbs shows how learners acquire one of the two basic Indo-European sentence types (Halliday, 1970): predicative (e.g., Juan corre rápidamente “John runs quickly”) and attributive sentences (e.g., Juan es rápido “John is quick”), with the ser/estar (S/E) distinction forming the central verbal element of the latter. Pragmatically speaking, the S/E distinction requires knowing when the relationship between the subject and adjective involves characterization (Marı́a es capaz “Mary is capable”) or identification (Marı́a es la encargada “Mary is the one in charge”; Fernández Leborans, 1999). Semantically, S/E can differ aspectually, with estar often connoting the perfective aspect (e.g., that an event’s time frame is short and limited in duration) and ser connoting the imperfective (e.g., the event is habitual) (Luján, 1981). Morphologically, Spanish adjectives inflect for person and number, which is especially difficult for learners whose first language (L1) has few inflections, like English. Finally, the number of adjectives that learners must associate with either ser or estar presents lexical challenges. Geeslin (2003a) and Silva-Corvalán (1986, 1994) reminded us that even native speakers of Spanish show much variation in S/E usage with adjectives as a function of pragmatic considerations. Traditionally (and in current learner textbooks), ser + adjective segments describe a subject’s permanent, seemingly unchanging characteristics. However, estar + adjective segments describe temporary, dynamic characteristics of a subject. It is for this reason that an adjective like aburrido “boring/bored” in soy aburrido, which uses ser, produces the meaning “I am boring,” whereas in estoy aburrido, which uses estar, yields roughly “I am bored”; with ser, the boredom is constant, whereas with estar, the state—and its effect on others— should pass. Nonetheless, this traditional view has come under much empirical scrutiny, with the works of Geeslin (2003a) and Silva-Corvalán (1986, 1994) showing that this explanation only scratches the surface of the pragmatic nuances that native speakers consider when choosing their copula. Studying the acquisition of S/E provides a means to address various SLA questions (e.g., orders of acquisition, the role of study abroad), and researchers have used various methodologies (e.g., error analysis of open-ended conversations, raters judging the semantic intent of learner utterances). Recent S/E research suggests that learner copula selection is sensitive to lexical and grammatical features (often referred to together as “lexico-grammatical” features) in the surrounding discourse. Corpus-linguistics methods are particularly suited to study the interaction between a construct and its lexical and grammatical context. After reviewing S/E research and the potential contribution of a corpus Language Learning 60:2, June 2010, pp. 409–445 410 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective analysis, we present a large-scale corpus-based analysis of learners’ use of S/E + adjective at different instructional levels. Ser/Estar SLA Research to Date Initial S/E research identified developmental stages in instructed contexts, focusing on accuracy and omission rates. Estar emerges in later stages, especially in estar + adjective segments. VanPatten (1985, 1987) studied oral interviews, grammaticality judgments, and informal class observations to propose five stages: (a) copula absence, (b) ser as the default copula, (c) estar with progressive, (d) estar with locatives, and (e) estar with adjectives of condition. Simplification, communicative value, frequency in input, and L1 transfer influence these stages (VanPatten, 1987). Researchers have studied whether VanPatten’s stages generalize to study-abroad and Peace Corps experiences (Gunterman, 1992; Ryan & Lafford, 1992). Oral proficiency interviews in both Gunterman’s study and Ryan and Lafford’s study confirmed most stages, with estar + adjectives of condition appearing before estar with locatives. Although accuracy studies reveal that these two copulas develop in a predictable fashion, they do not explain the variability in S/E usage. Additionally, these studies appeared when SLA research was highly concerned with the role of input in acquisition, and explanations focused on issues such as the copula’s individual frequency and communicative value/saliency (Ryan & Lafford, 1992; VanPatten, 1985, 1987) in the input. Ryan and Lafford (1992) attributed the late emergence of estar + adjective to access to naturalistic input. Nonetheless, we know almost nothing about the input (e.g., the types of discourse) that learners process in naturalistic settings or over the course of a semester in at-home or study-abroad settings (Collentine, 2008). SLA theory posits that output (be it from instructional interventions or naturalistic experiences) plays as strong a role as input at latter stages of acquisition (Shehadeh, 2002; Swain, 1985), which is when estar + adjective emerges. What type of communication, then, do learners generate that coincides with estar + adjective emergence? Some evidence suggests that copula + adjective production improves as learners grow in the complexity of the discourse types they generate. Copula + adjective segments help beginning learners to relate simple messages, containing a subject and a verb without elaboration (e.g., accompanied by adverbs). Gunterman (1992) noted that when communication became difficult, learners resorted to ser + adjective segments. “Because the questions typically elicited descriptions, explanations, and definitions, the [peace corps volunteers] were able to build a great number of their answers around ser” (Gunterman, 1992, 411 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective p. 1297). Descriptive discourse is structurally and semantically basic, depicting a situation’s important nouns and their states (e.g., via adjectives); descriptions lack dynamic details about events and changes of states. Estar + adjective appears in Gunterman’s data, where learners went beyond descriptions to communicate narrative discourse, which entails both a situation’s states and its events (often chronologically). Lafford (2004) attributed copula + adjective gains after a single semester of study abroad to “the pragmatic constraints inherent in real-world discourse . . . and perhaps to improved overall narrative and discursive abilities, proficiency, and fluency” (p. 216; emphasis added). Subsequent S/E research intimated that copula + adjective growth occurs as lexical and grammatical choices become sensitive to what appears in the surrounding discourse. In the copula + adjective segment, natives demonstrate variation in copula selection because each copula affects different pragmatic and discursive interpretations (Geeslin, 2002; Silva-Corvalán, 1986), and so the copula + adjective context is ideal for studying how learners encode pragmatic and discursive information. Geeslin (2002, 2003a, 2003b) focused on different instructional levels while considering findings from sociolinguistic studies of copula + adjective language change in bilingual and monolingual communities (e.g., Silva-Corvalán, 1986) in which semantic, pragmatic, and sociolinguistic variables such as frame of reference (i.e., comparison with group norm—Juan es alto “John is tall”—or with the referent—Juan está alto “John’s gotten tall”), susceptibility of change (i.e., inherent—Juan es inteligente “John is intelligent”—vs. changing—Juan está viejo “John’s gotten old”), lexical class of the adjective (e.g., age, nationality), and semantic transparency (El mango es verde/El mango está verde “The mango is green/The mango is unripe” vs. Juan es casado/Juan está casado “John is married/John is just married”) explained the overuse of estar. Geeslin (2002) collected data from high school students with a guided interview, a picture-description task, and a contextualized questionnaire, concluding that learners acquire the restriction of susceptibility of change earlier than the frame of reference restriction. Geeslin (2003a, 2003b) later examined copula choice with advanced learners using contextualized questionnaires, finding that semantic and pragmatic features interact to predict estar usage. She found that whereas advanced learners seem to overgeneralize pragmatic constraints such as frame of reference and experience with the referent, native speakers favor lexical and semantic constraints (e.g., predicate type) to decide when to use ser or estar. Recently, Geeslin (2003b) and Geeslin and Guijarro-Fuentes (2006) suggested that we need to understand the context that surrounds L2 copula choice: Language Learning 60:2, June 2010, pp. 409–445 412 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective In the case of copula choice, advanced learners apply pragmatic constraints, even in contexts in which native speakers do not. In contrast, native speakers choose not to apply pragmatic constraints in favor of lexical and semantic constraints. (Geeslin, 2003a, p. 751) In copula selection, L2 Spanish learners may even be more sensitive to contextual factors than native speakers, who appear to depend on “local” factors within the attributive copula + adjective segment (i.e., lexical and semantic constraints related to the interaction of the copula and the adjective alone); learners are sensitive to a wider context, apparently attending to speaker intent and implicatures (as “pragmatic” considerations would imply) as well as to lexico-grammatical features in the surrounding discourse. Geeslin (2003a, p. 748) noted that words/phrases that imply change near a copula + adjective segment apparently cause advanced learners to select estar—the copula associated with changing states—even when the relationship between the adjective and the copula necessitates the use of ser—the copula associated with permanent states. Thus, to better understand the factors surrounding learners’ S/E usage, we might well ask the following: • What are the contextual features that co-occur with each copula + adjective segment at different levels of instruction? • What types of discourse (e.g., narratives, descriptions) are usually associated with each segment? Corpus Techniques and the Study of Context Geeslin’s (2002, 2003a, 2003b) research shows by way of rater judgments that the pragmatic intent of copula + adjective segments influences whether learners use ser or estar. It is also reasonable to suspect that discourse type influences copula selection in important ways. Recall that Gunterman (1992) argued that ser + adjective and estar + adjective segments are distributed within different discourse types. Additionally, Lafford (2004) related learners’ copula selection gains to the expansion of the types of discourse they can produce. Myles and Mitchell (2004) argued that SLA researchers should take note that corpus research examining large collections of digitized documents has had a considerable role in furthering the field of discourse analysis. Accordingly, the present study employs a variety of corpus-based techniques to understand the contextual features that co-occur with ser + adjective and estar + adjective use in addition to the discursive functions that learners at different levels assign to ser and estar. As these techniques are not widely utilized in SLA research, in 413 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective the following section we not only briefly delimit what corpus-based research can reveal about SLA, but we also describe important corpus assumptions and techniques. Because our analysis compares learner data to corpus-based native-speaker models, we also describe relevant perspectives that recent corpus research has uncovered about the nature of Spanish discourse. Not only does a corpus-based approach lend itself to questions of L2 discourse development, but the techniques also permit empirical comparisons between learner behaviors and native-speaker models. For instance, using an English learner corpus and two British native-speaker corpora, Siyanova and Schmitt (2007) found that, in informal speech, learners are less likely to use two-word verb constructs (e.g., run into, put off ) than are native English speakers. One advantage of comparing learner performance to native-speaker models is that the SLA researcher can make empirically defensible and testable assumptions about the end state of the acquisition process, an approach we adopt in the present study. Myles (2005) and Myles and Mitchell (2004) lamented that SLA research has not been quick to embrace new technologies for collecting and analyzing data, especially as it relates to corpus linguistics. They argued that corpus linguistics complements the current research by examining large amounts of data with relative ease, thus increasing the generalizability of findings (Rutherford & Thomas, 2001). Still, some notable corpus-based SLA research has contributed to our understanding of the context on language development (Belz, 2004; Collentine, 2004; Granger, Hung, & Petch-Tyson, 2002; Klein & Purdue, 1997). Some corpus research exists on ser and estar. Corpus-Based S/E Findings Corpus-based S/E research provides some evidence that learner’s copula choice is sensitive to contextual factors and that there is reason to suspect that Spanish copula + adjective segments are distributed to different discourse types. Cheng, Lu, and Giannakouros (2008) examined a corpus of Mandarin Chinese L1 learners of Spanish. They show how advanced learners’ copula choice varies according to the pragmatic intent of the surrounding discourse they themselves produce. They reported that “exploratory writing” evoked greater estar + adjective usage and that estar + adjective is compatible with the semantic and pragmatic goals of narratives or descriptions. Collentine (2008), in an invited commentary article on Cheng et al. (2008), conducted a study on whether copula + adjective segments might serve discernable discourse functions in native Spanish discourse. His analysis uncovered a significant interaction between copula and text type. Ser + adjective was relatively frequent in most all types Language Learning 60:2, June 2010, pp. 409–445 414 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective of discourse, whereas estar + adjective was most frequent in dramas, which entail much evaluative language and monologues containing descriptions, and narratives. These two studies suggest that copula + adjective use by learners and native speakers is not influenced by local features alone (which range from within the copula + adjective phrase structure to the lexico-grammatical characteristics of the discourse) but also by communicative goals such as the type of discourse being produced. Techniques, Tools, and Utility of Corpus Based-Research Corpus linguistics ranges in complexity. Minimally, it utilizes searchable digitized texts sampled in a representative fashion, depending on the study’s focus. Textual information is critical for statistical procedures (just as it is for individual learners), and so files are tagged with header information, such as topic, source type, biographical information about the author, and purpose (argumentative essay, narrative). Concordance applications and scripting languages allow researchers to search for specific segments and tabulate their frequencies by text. When investigators need to search for morphosyntactic information (e.g., all adjectives, all verbs whose infinitive is either ser or estar), they often use a part-of-speech tagger: a series of software modules that annotates every word with information about its major word classes (e.g., adjective, noun, verb, determiner, preposition), basic morphological information (e.g., plural, preterit), as well as its lemma (i.e., its unmarked, dictionary root, such as a verb’s infinitive or a noun’s masculine, singular form). Part-of-speech tagging requires a dictionary with lexical and grammatical information about the possible words in a language (some words have more than one entry because languages have many synonyms). For the present project we compiled our own dictionary and we utilized a training set (which assists tagging ambiguous forms) from samples from the Corpus del español (Biber, Davies, Jones, & Tracey-Ventura, 2006) as well as software routines from the Natural Language Tool Kit (NLTK; http://www.nltk.org/). After the corpus is tagged in this way, the investigator must verify the accuracy of the tagging and fix errors (individual and/or systematic) through further programming. An increasingly popular technology to create search patterns (regardless of the tagging software) utilizes regular expressions, a sophisticated wild-cardand variable-based text-search system (e.g., \w{3,} symbolizes words of three letters or more; \w+ing symbolizes words of any length ending in ing). Having a tagged corpus along with the flexibility of regular expressions provided us with a powerful means of studying a number of lexical and/or grammatical phenomena. For instance, the pattern \w+\∧ v[∧ `]∗ `[∧ `]∗ `[∧ `]∗ ` 415 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective (?:ser|estar)` (?:obvio|evidente) \∧ j \w+ que \∧ \w+ is one way to search for every verb whose lemma is either ser or estar followed by the adjectives obvio or evidente followed by the conjunction que. It is important to make mention of two common corpus-statistical techniques. The process of norming is a numerical transformation of counts to account for the fact that individual texts vary in length and that longer texts can have a greater influence on the numerical distribution of any phenomenon. Investigators often norm frequency counts to an arbitrary number, such as per 1,000 or per 10,000 words: The count of some phenomenon in a text is divided by the text’s total word count, the quotient of which is multiplied by 1,000 (a higher norming multiplier like 100,000 affords greater precision). The technique known as normalizing involves converting the count of some phenomenon to its z-score value vis-à-vis its count in each document in the corpus (i.e., the difference of the phenomenon’s frequency and its mean occurrence in the corpus divided by its standard deviation). Normalizing is convenient for measuring the relative presence of two or more linguistic features within any given text, as one can easily sum two or more z-scores to calculate how concentrated those features are in any texts or group of documents while taking into account the fact that some linguistic phenomena are naturally scarce in a document (e.g., the subjunctive), whereas others are naturally common (e.g., articles) (cf. Biber & Conrad, 2001). For instance, the frequency of adverbs of time and copula + adjective segments are likely to vary in different ways across the texts of a corpus (e.g., adverbs of time may be generally more frequent). By summing the two segments’ z-score per document, we can find which texts have the highest concentration of the two. Corpus-Based Native-Speaker Models of Discourse According to Myles and Mitchell (2004), we now have the ability to define structurally and statistically different discourse types. Thus, the present study not only compares learners’ copula selection behaviors between different levels of instruction, but it also attempts to identify the types of discourse learners produce when using S/E + adjective, based on a native-speaker model. Corpus linguistics has shown through factor analyses how lexico-grammatical structures bundle together to produce different types of discourse (Biber & Conrad, 2001). Biber et al. (2006) provided the first comprehensive analysis of Spanish, analyzing a 20 million-word Spanish corpus with written and oral data from a variety of registers. There are four types of discourse that Biber et al. (2006) identified that learners might well produce in written texts, the features with which each is associated are presented in Table 1. Language Learning 60:2, June 2010, pp. 409–445 416 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective Table 1 Discourse dimensions and features targeted in the learner-native speaker comparison (cf. Biber et al., 2006) Discourse type Lexico-grammatical features Informationally rich • • • • • • • • • Singular and plural nouns Postnominal descriptive adjectives Prenominal descriptive adjectives Definite articles Prepositions Derived nouns Type-token ratio Long wordsa Se passives (i.e., ergative se use) Hypothetical • • • • Subjunctive use Conditional use Future use Verbs of obligation and causation (e.g., dejar, permitir, hacer + infinitive) Infinitives not preceded by a verb or article Verbs followed by an infinitive Progressive aspect (imperfect use or present participle) Dependent que clauses • • • • Narrative • • • • • • • • Descriptive • Postnominal descriptive adjectives • Derived nouns • Absence of all narrative variables Clitic usage Imperfect tense/aspect Preterit tense/aspect Possessives Third-person pronouns Reflexive se and changes of states Infinitives not preceded by a verb or article Verbs followed by an infinitive a Defined as those that have an average number of characters in the dataset, plus that calculation’s standard deviation, plus one character—thus, six or more characters. Informationally rich discourse is one that conveys large amounts of information densely. Derived nouns, adjectives, multisyllabic words, and passives convey information in a decidedly encyclopedic fashion. Another important type of discourse in Spanish—which is not found in English analyses (cf. Biber 417 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective & Conrad, 2001), perhaps because Spanish has a neatly defined mood system (with readily discernable inflections)—is hypothetical discourse, which communicates possibilities and counterfactual information. It is characterized by features such as verbs in the subjunctive and the conditional. The other two discourse types identified by Biber et al. (2006) are well known to most (viz., narratives and descriptions). Research Questions The present study adds to our understanding of the acquisition of how contextual variables interact with learners’ use of attributive sentences. Although the field has a good idea of the communicative factors that motivate copula choice, we do not know how each copula + adjective segment works with other lexical and grammatical structures to communicate coherent discourse. To address this gap in the literature and to understand the discursive function that ser + adjective and estar + adjective segments serve over time, we provide a corpus-based analysis of the lexico-grammatical features that predict the use of these two segments with foreign-language (i.e., at-home) learners in the first, second, and third years of the university level. More specifically, we address the following research questions: 1. What are the lexico-grammatical features that co-occur with ser + adjective usage? What are the discursive functions that these co-occurring features serve? 2. What are the lexico-grammatical features that co-occur with estar + adjective usage? What are the discursive functions that these co-occurring features serve? To address these questions, we present the results of a series of regression analyses predicting the occurrence of each copula + adjective segment from a variety of lexico-grammatical features (see the Corpus Description section). We predict that ser + adjective and estar + adjective segments will have distinct lexico-grammatical associations that change over time. Specifically, we posit that ser + adjective segments appear in simple discourses (e.g., highly descriptive and listlike discourse) and estar + adjective segments become increasingly associated with discursive complexity. However, we posit that the association of estar + adjective with a particular discourse type will be more difficult to identify because previous research suggests that even advanced learners are more sensitive to contextual (i.e., pragmatic) constraints than are native speakers with this construct. Language Learning 60:2, June 2010, pp. 409–445 418 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective Method Corpus Description This study used a 432,511-word learner corpus of written Spanish, comprising edited and nonedited compositions collected from English-speaking Spanish learners at three levels of instruction: first year (230,270 words), second year (109,224 words), and third year (93,017 words). The compositions were not specific tasks designed to collect the data for this study but rather writing samples used for assessment purposes. Students wrote letters, narratives, descriptions, summaries, and argumentative essays both in and out of class as well as on exams. Topics related to the textbook themes (e.g., family, childhood) and the cultural readings assigned in class. Each text was tagged for numerous lexical and grammatical features (see above). To determine what lexico-grammatical features co-occur with ser + adjective and estar + adjective usage, we considered a total of 75 potential predictor variables, each operationalized in the form of a regular expression. In corpus studies, variables refer to the linguistic features in the texts being analyzed. This study’s predictor variables included various lexical features, such as adjectives other than the ones in the copula + adjective frame (e.g., derived adjectives, adjective in postnominal position), nouns (e.g., derived nouns, feminine nouns, masculine nouns), adverbs (e.g., adverbs of place, adverbs of time), and verb classes (e.g., verb in imperfect aspect, verb in past participle), as well as morphosyntactic features such as dependent clauses, noun phrase configurations (e.g., article plus noun), pronoun usage (e.g., clitic—third person), as well as a variety of verb phrases (e.g., verbs of communication, verbs of knowledge). The set of variables considered involved all parts of speech, common morphosyntactic constructs studied by learners, as well as additional constructs studied in Biber et al. (2006). Data Analyses Learner Models Analysis To identify the types of lexico-grammatical features that learners use with ser + adjective and estar + adjective segments and to identify which variables distinguish among the three levels of learners, we constructed regression models of lexical-grammatical regressors predicting copula + adjective usage: a ser + adjective learner model and a estar + adjective learner model. We constructed regression models for each copula + adjective segment—rather than, for instance, a single regression model for which the choice between the two is the dependent variable—because the previous research suggests that the factors motivating the use of ser + adjective usage are not the same as those 419 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective motivating estar + adjective usage (cf. Guntermann, 1992). The process involves screening a set of potential predictor variables for standard assumptions of linear regression, submitting the reduced set to a best-subsets analysis— rather than a stepwise procedure—to identify the so-called “best subset,” and, finally, comparing the predictor variables’ ability to distinguish among the three levels of learners in terms of copula + adjective usage. We employed a standard data-screening process, identifying which of the potential predictor variables had honest correlations with the criterion variables, thus discarding the following: (a) variables that had no correlation with a criterion variable (by examining correlation coefficients and scatter plots between a potential predictor variable and the criterion); (b) variables that represented inflated correlations (i.e., where two features correlated highly with each other and constituted too high an overlap in semantic or structural properties, so as to avoid colinearity problems in the final model selection phase);1 and (c) variables that constituted deflated correlations, eliminating predictor variables that had a highly reduced range of responses to the criterion variable (e.g., those variables whose frequency was very small, such as n = 2, regardless of the level of the participant or the genre). This screening of the data yielded a list of 58 potential linguistic variables (37 for ser + adjective and 21 for estar + adjective) that could be meaningful for the regression analyses to be performed. Table 2 shows the preliminary list of variables. We used best-subsets analyses to derive the two regression models for ser + adjective and estar + adjective. Social scientists frequently employ stepwise procedures for building regression models. Although these procedures for variable selection work adequately for reducing a small set of potential predictor variables to a small, more meaningful set (e.g., a subset that does not have a high degree of overlap), statisticians do not favor stepwise analyses when the initial pool of predictor variables is extremely large (Miller, 2002), such as the present case. Following Rencher (2002), we employed instead a best-subsets analysis for building the two models for predicting ser + adjective and estar + adjective. The principal advantage that a best-subsets approach has over statistical/stepwise regression (with a large number of predictor variables) is that best-subsets approaches attempt to reduce the number of predictor variables by comparing various “combinations” of variables, whereas the stepwise procedure attempts the reduction process by considering each and every potential predictor variable “individually.” The best-subsets approach has been shown to produce less spurious results than stepwise procedures when reducing a large set of potential predictor variables. With large pools of potential predictor variables that have an almost infinite number of combinations, stepwise regression Language Learning 60:2, June 2010, pp. 409–445 420 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective Table 2 Linguistic variables used in the study after initial data screening Variable class Ser + adjective Estar + adjective Noun • noun - derived • noun - feminine • noun - masculine • noun - singular Adjective∗ • • • • • • adjective - singular • adjective - type 1 • adjective - type 2 • • • • adjective - derived adjective - feminine adjective - masculine adjective - plural adjective - postnominal—Una casa grande “a large house” adjective - prenominal—Una bella mansión “A beautiful mansion” adjective - singular adjective - type 1—Descriptive adjective with four inflections: masculine, feminine, singular, and plural. Blanco/a(s) “white” adjective - type 2—Descriptive adjective with two inflections: singular and plural. Interesante(s), liberal(es) “interesting, liberal” Pronoun • clitic - third person • pronoun - subject • que subordinator Other noun phrase elements • article noun segment—El libro “The book” • definite article • possessive adjective Verbs • • • • SE plus third-singular verb verb - “Gustar”-like verb - third person verb - communication—Decir “say/tell,” anunciar “announce,” explicar “explain,” etc. • verb - imperfect • verb - infinitive • clitic - preverbal • pronoun - third person • que subordinator • article noun segment • possessive adjective • SE plus 3rd-singular verb • verb - “Gustar”-like • verb - third person • verb - knowledge • verb - past participle • verb - present participle (Continued) 421 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective Table 2 Continued Variable class Adverbs or adverbial clauses Total Ser + adjective Estar + adjective • verb - infinitive 2; not preceded by verb or article • verb - knowledge—Saber “know,” recordar “recall,” entender “understand,” etc. • verb - observation—Ver “see,” escuchar “listen,” etc. • verb - past participle • verb - past subjunctive • verb - periphrastic future • verb - present participle • verb - preterit • verb - suasive—Querer “want,” mandar “order” • verb aspect - progressive • verb suasive—Querer “want,” mandar “order,” etc. • verb probability—Creer “believe,” negar “deny,” dudar “doubt,” etc. • • • • • adverb - time • adverbial clauses contingency • adverbial clauses time adverb - place adverb - time adverbial clauses - contingency adverbial clauses - time 37 21 Note. All adjectives in this list did not follow one of the two copulas. may never consider combinations of predictor variables that are equally good at predicting the occurrence of the response variable (i.e., the dependent variable) in question.2 Because this analysis is computationally intensive and not available in many commercial software packages for the social sciences, we used the statistical package R and its best-subsets regression package to perform the analysis (see Dalgaard, 2008).3 We employed what is termed a subgroup regression analysis to determine which of the variables in the two models predicting ser + adjective and estar + adjective usage distinguished among the three levels (Hardy, 1993). The process employs indicator variables (sometimes called dummy variables) to add categorical predictor variables (into the model described earlier) called differential intercept coefficients. This reveals the effect for each group for each Language Learning 60:2, June 2010, pp. 409–445 422 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective predictor variable (i.e., the unique contribution of each level in our study to each coefficient calculated for the predictor variables), producing k − 1 difference (predictor) variable models, where k represents the number of groups.4 Because this group-level coefficient effect process is derived from two regression models, we adjusted the alpha for significant coefficient differences via a Bonferroni adjustment to 0.025 (i.e., 1 − (1 − .05)1/2 ). Native-Speaker Model Comparison To objectively identify the types of discourse that the lexico-grammatical structures (dis)associated with each copula + adjective segment represent (derived from the best-subsets analysis), we compare the two copula + adjective learner models with the native-speaker discourse model described in Table 1. Our analysis measured the extent to which the learners’ discourse possessed indicators of informational richness, hypothetical discourse, narrative discourse, and descriptive discourse. As described earlier, we calculated the normed frequency of the occurrence of each of these variables in the learner corpus to a scale of 10,000 per text. Subsequently, we calculated the extent to which documents representing high concentrations of each copula + adjective model correlated with high concentrations of each of the four native-speaker discourse types in three steps: (1) For each document we calculated z-score totals for both the ser + adjective and the estar + adjective models; (2) for each document we calculated a z-score total for each of the four discourse types in Table 2; (3) we regressed the four discourse type z-score totals against each of the copula model z-score totals along with subregession analyses to assess differences between the three levels. A z-score value for any document on a given variable—be it a criterion variable as in step 1 or a regressor as in step 2—represents the extent to which that variable is represented in that document vis-à-vis all other documents. Summing a set of z-scores produces a value representing to what extent any document had a concentration of that set of variables (see Biber et al., 2006, as well as Biber and Conrad, 2001, for in-depth discussions of this technique). Thus, summing the z-scores for each document for variables representing, say, narrative discourse indicated how narrative each document is. Likewise, z-score totals for the set of regressors representing the ser + adjective model and for the set representing the estar + adjective model for each document yields values indicating how much each document more or less represented each model. (Of course, all z-scores here must be weighted according to their +/− sign in the model.) The regression and subregression analyses answer the following question: When documents reflect the ser + adjective model and the estar + adjective model, 423 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective are they more or less encyclopedic, hypothetical, narrative, or descriptive in nature? Again, because we employ two regression analyses, we adjusted the alpha for significant coefficient differences via a Bonferroni adjustment to 0.025 (i.e., 1 − (1 − .05)1/2 ). Finally, to identify documents for the qualitative analysis of the discursive nature of copula + adjective usage, we chose to concentrate on those documents for each learner level that most represented each regression model derived from the best-subsets analysis. This simply entailed identifying those documents that had high z-score totals for the ser + adjective models and those with high z-sores for the estar + adjective model, as described earlier in step 2. Results Learner Usage: Ser + Adjective The best-subsets analysis identified 21 regressors predicting ser + adjective usage across the three levels, with 16 constituting significant regressors (p ≤ .05). This model included twice as many predictor variables as the estar + adjective model did. Additionally, the amount of variation that the ser + adjective model accounted for was 41% in the use of the criterion variable, whereas the estar + adjective model only accounted for 5% of its criterion variable (see below). The ser + adjective model accounted significantly for ser + adjective usage, F(21, 1576) = 54.9; p = .000. Furthermore, the subgroup regression analysis revealed that 5 of these 21 regressors significantly distinguished among the three levels of learners: pronoun - subject, adverbs of place, verb - “gustar”-like, verb - observation, and verb - past subjunctive (see Table 3). In the following we discuss these 21 regressors by grouping them into six lexico-grammatical regressor categories: adjectives, nouns, pronouns, adverbial constructions, grammatical verb variables, and lexical verb variables. Within the relevant lexico-grammatical regressor categories, we discuss the five variables distinguishing among the levels. Seven of the regressors represented various features of descriptive adjectives, although none distinguished among the three levels of learners. Table 3 indicates that each variable contributed significantly to the model. For the most part, adjectives predicted ser + copula usage, with five associating positively (i.e., their coefficient sign was positive) and two were disassociated with the construction (i.e., the coefficient sign was negative). The positive, adjectival regressors reveal that, perhaps not surprisingly, a variety of adjectives representing particular inflectional properties co-occur with ser + adjective, suggesting Language Learning 60:2, June 2010, pp. 409–445 424 Corpus-Based Analysis of Ser/Estar + Adjective Collentine and Asención-Delaney Table 3 Best-subsets regression model for ser + adjective Coefficient (Constant) adjective - feminine adjective - masculine adjective - plural adjective - postnominal adjective - prenominal adjective - singular adjective - type 2 noun - derived noun - feminine pronoun - subjecta adverbs of placea adverbial clauses - cause verb - third person verb - infinitive verb - periphrastic future verb - past participlea verb - past subjunctive verb - “Gustar”-likea verb - communication verb - knowledge verb - observationa a Estimate sign Estimate Std. error − + + + − − + + + + + − + + + − − − − − − − 81.371 .050 .040 .100 −.170 −.200 .150 .070 .020 .020 .050 −.060 .040 .060 .040 −.070 −.040 −.110 −.040 −.070 −.090 −.080 9.011 .020 .020 .020 .020 .020 .020 .030 .010 .010 .010 .040 .030 .010 .010 .050 .020 .060 .020 .030 .040 .040 t test −9.030 2.470 2.180 5.420 −10.010 −9.370 9.680 2.580 2.030 2.640 5.800 −1.560 1.500 12.380 3.730 −1.570 −1.530 −1.860 −2.460 −2.300 −2.340 −1.980 p .000 .010 .030 .000 .000 .000 .000 .010 .040 .010 .000 .120 .130 .000 .000 .120 .130 .060 .010 .020 .020 .050 Variable distinguishing between the levels of instruction. that at all levels in contexts/discourses where ser + adjective segments appear, learners use adjectives in general in a variety of inflections. Interestingly, however, the positive correlation with type-2 adjectives (i.e., adjectives with only two inflections: singular and plural) tempers this conclusion because they are also significantly associated with the criterion. Finally, although various morphological properties of adjectives associate with ser + adjective, this construction is not associated with more complex uses of adjectives because ser + adjective is disassociated with adjectives that appear in either prenominal (e.g., bella casa “beautiful house”) or postnominal position (e.g., casa grande “large house”). An analysis of the two nominal regressors indicates that a certain degree of morphological nominal complexity occurs where ser + adjective segments predominate, as both had a significant positive association with the criterion 425 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective variable. The association with feminine nouns shows an association with the criterion variable of gender-inflectional processes, whereas the association with derived nouns (which represent nouns packaging semantic information in a dense fashion, as these derived forms have a base/root morpheme and an additional derivational morpheme; e.g., constitu-ción, sereni-dad, procesa-miento). It is important to note, however, that this is the only indication of ser + adjective association with semantically dense forms. As with the adjectival regressors, neither of these two nominal regressors distinguished among the three levels, suggesting that the association of ser + adjective with a certain degree of morphological complexity occurs from the beginning to more advanced levels of instruction. Subject pronouns for the most part also appeared where there was a preponderance of ser + adjective segments, although the subregression analysis revealed that this regressor significantly distinguished among the three levels of learners. The subregression analysis revealed that for the first-year learners subject pronouns were positively associated with ser + adjective (beta = 0.06; std error = 0.001), that for the second-year learners there was no association at all (beta = 0.001; std error = 0.017), and that for the third-year learners there was a disassociation with the criterion variable (beta = −0.06; std error = 0.043); the analysis also revealed that the significant difference came from the first-year learners rather than the other two (t = 3.00; p = .003), meaning that the association of ser + adjective with subject pronoun use was primarily due to the first-year-learner data. The best-subsets analysis identified two adverbial constructions as important contributing predictors of overall ser + adjective usage: adverbs of place and adverbial clauses of cause. Although neither of the two contributed significantly on an individual basis, adverbs of place significantly distinguished among the three levels of learners in terms of predicting when ser + adjective would occur. The subregression analysis indicated that for the first-year learners, adverbs of place were disassociated with ser + adjective (beta = −0.12; std error = 0.05), whereas these adverbs were (positively) associated with the criterion at the second (beta = 0.07; std error = 0.06) and third years (beta = 0.06; std error = 0.09), with the significant difference being attributed to the difference between the first-year and second-year individual contributions to the model (t = 2.45; p = .015). There were six grammatical features of verbs that predicted ser + adjective usage at the three levels. For the most part, verbal variables were disassociated with ser + adjective. Similar to the adverbial regressors, three were important enough to be included in the ser + adjective model but did not individually Language Learning 60:2, June 2010, pp. 409–445 426 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective contribute significantly: Past subjunctive usage, periphrastic future usage, and past participles—an adjectival/verbal feature—were disassociated with the use of the criterion variable. Past participles significantly distinguished among the three levels, as the second-year coefficients (beta = −0.08; std error = 0.03) were significantly lower (t = 2.35; p = .02) than those of the first (beta = 0.04; std error = 0.13) and third year (beta = 0.09; std error = 0.06). Gustar-like verbs were also disassociated with ser + adjective usage at a significant level; however, it is important to note that this was a regressor that the subregression analysis identified as one that distinguished among the three levels. For both the first (beta = −0.04; std error = 0.02) and the second year (beta = −0.19; std error = 0.08), its coefficient was negative, meaning that it was disassociated with ser + adjective. For the third-year learners, this subregression coefficient was positive (beta = 0.25; std error = 0.10), and the difference between the third- and second-year coefficients was significant (t = 1.97; p = .05). Because Gustar-like verbs are syntactically complex for Spanish learners, these data strongly suggest that ser + adjective appears in general where complex verbal morphology does not but that more complex verbal syntax begins to become associated with the criterion at more advanced stages of development. Finally, simpler verbal grammatical properties were significantly and positively associated with ser + adjective usage, as verbs with third-person morphology and infinitives reliably associated with the presence of ser + adjective segments. The final group of variables involves various lexical classes of verbs, all of which were significantly disassociated with ser + adjective usage. These included verbs of communication and knowledge, indicating that ser + adjective usage is not associated with discourse where epistemic stance is manifested (i.e., where one qualifies what is commented on by reporting that some assertion was [only] heard or is known to be true). Verbs of observation were also disassociated with ser + adjective usage, although this regressor significantly distinguished among the three levels. The second-year (beta = −0.16; std error = 0.06) and third-year subregression coefficients (beta = −0.21; std error = 0.10) were negative, whereas the first-year subregression coefficients were positive (beta = 0.07; std error = 0.07), with the difference between the first- and the other second-year subregression coefficients (and so the third as well) being significant (t = 2.61; p = .009). Learner Usage: Estar + Adjective The best-subsets analysis identified 10 regressors predicting estar + adjective usage across the three levels, with 8 constituting significant regressors (p ≤ .05). The value of the coefficient of determination (R2 ) of the model, however, 427 Language Learning 60:2, June 2010, pp. 409–445 Corpus-Based Analysis of Ser/Estar + Adjective Collentine and Asención-Delaney indicates that only 5% of the variance in the Spanish learners’ use of estar + adjective could be explained by this regression model. This indicates that the association of estar + adjective with other lexical-grammatical features is weak within the interlanguage for all levels of learners. The model did account for a significant amount of the overall variation in estar + adjective usage, [F(10, 1590) = 8.42; p < .0001]. As observed in Table 4, most of these 10 variables distinguished significantly among the three levels, with the subgroup regression analysis revealing that four regressors significantly distinguished among the three levels of learners: type-2 adjectives (i.e., adjectives with singular and plural inflection), article noun segments, preverbal clitics, and possessive adjectives. It is interesting to note that this group of variables is entirely different from the group of significant regressors for the ser + adjective copula. At any rate, these differences are considered below in the interpretation of the variables, where we discuss all 10 variables by grouping them into three lexico-grammatical regressor categories: nominal (noun and adjectival), verbal, and syntactic variables. In contrast to ser + adjective segments, estar + adjective is associated with decidedly basic grammatical properties. For example, noun phrases in discourse where estar + adjective occurs usually comprises nouns preceded by articles or possessive determiners (e.g., mi mamá “my mother,” la universidad “the university”) and adjectives that have only two inflections (e.g., inteligente “intelligent”) or adjectives in their singular form (alta “tall” [feminine]). Three of the four level-distinguishing regressors identified in the subregression analysis Table 4 Best subset regression model for estar + adjective Coefficient (Constant) adjective - singular adjective - type 2a noun - singular article noun segmenta possessive adjectivea verb - “Gustar”-like verb - present participle verb - probability clitics - preverbala adverbial clauses - cause a Estimate sign Estimate Std. error t test p − + + −4.460 .010 .020 .000 .010 .010 −.020 .030 .020 .020 .020 3.518 .003 .009 .002 .002 .003 .008 .011 .011 .005 .009 −1.267 2.459 2.616 −1.716 3.544 3.669 −2.419 2.364 1.871 3.617 2.326 .205 .014 .009 .086 .000 .000 .016 .018 .062 .000 .020 + + − + + + + Variable distinguishing between the levels of instruction. Language Learning 60:2, June 2010, pp. 409–445 428 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective were nominal in nature. Type-2 adjectives were found to distinguish significantly between first- and third-year learners (t = 2.73; p = .006), indicating that the trend to associate inflectionally simple adjectives with estar + adjective appears to become stronger as learners progress in their acquisition of Spanish. This predictor variable was disassociated with the criterion variable (beta = −0.01, std error = 0.009) for the first-year students and was positively associated with estar + adjective for the second (beta = 0.01, std error = 0.021) and third year (beta = 0.13, std error = 0.046). The article noun segment significantly distinguished only between first- and second-year learners (t = 3.30; p = .001). This regressor was weakly associated with the criterion variable for first-year (beta = 0.002; std error = 0.003) and third-year students (beta = 0.003; std error = 0.011) and only slightly more associated with estar + adjective for second-year students (beta = 0.018; std error = 0.005). Finally, possessive adjectives significantly distinguished between second-year and third-year learners (t = 2.78; p = .005). This regressor was found to be weakly associated with the criterion level for the first year (beta = 0.008; std error = 0.003) and the third year (beta = 0.003; std error = 0.011) and only slightly more associated with estar + adjective for the second-year writing (beta = 0.041; std error = 0.010). Among verbal regressors, the significant predictor variables also showed no evidence that complexity is associated with the criterion. Although Gustarlike verbs are usually associated with complex syntax, in the learners’ writing this variable is negatively associated with the occurrence of estar + adjective. The other grammatical verb form—present participle—is expected to co-occur with estar + adjective because it is mostly associated with estar to form the progressive aspect. Indeed, its beta coefficient was the highest of those regressors included in the best-subsets analysis (0.030). Two syntactic features were positively associated with estar + adjective. Preverbal clitics positively associated with estar + adjectives at all levels, perhaps the only indication of complexity associated with this phrase structure. The other syntactic regressor, causal adverbial clauses—which usually started with the conjunction porque—also predicted criterion usage. Preverbal clitics was the only syntactic regressor variable that distinguished significantly between learners use of estar + adjective at different levels. This variable was weakly associated with the criterion for first-year learners (beta = 0.006; std error = 0.007), which increases modestly yet significantly (t = 2.31; p = .021) into the second and third years, with the association being greater for second(beta = 0.033; std error = 0.011) and third-year (beta = 0.056; std error = 0.019) learners. 429 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective Native-Speaker Model Comparison As explained earlier (see the Corpus Description section), our analysis also included a measurement (via regression analysis) of the extent to which the learners’ texts with high concentrations of each copula + adjective model related with high concentrations of each of four types of native-speaker discourse types: informational richness, hypothetical discourse, narrative discourse, and descriptive discourse. The native-speaker model comparison indicated that three native-speaker discourse types combined significantly and individually to predict where the ser + adjective learner model occurred: hypothetical, narrative, and descriptive (see Table 5). As observed in Table 6, three also combined to predict where the estar + adjective learner model held: information rich, hypothetical, and narrative. The information-rich discourse regressor indicates the extent to which documents reflecting a copula + adjective model is accompanied by semantically dense discourse. Considering the sign of the coefficients—specifically, whereas the encyclopedic regressor in the ser + adjective model was significantly negative—ser + adjective usage is not semantically dense. Interestingly, Table 5 Native-speaker discourse-type predictions of documents matching ser + adjective model Coefficient (Constant) information rich hypothetical narrative descriptive Estimate sign Estimate Std. error t test p + − − + + 0.001 −0.079 −0.303 0.376 0.350 0.131 0.045 0.039 0.124 0.114 9.007 −1.776 −7.766 3.030 3.060 .995 .076 .000 .002 .002 Note. F(4, 1596) = 24.38; p = .000; multiple R2 : 0.06; adjusted R2 : 0.06. Table 6 Native-speaker discourse-type predictions of documents matching estar + adjective model Coefficient (Constant) information rich hypothetical narrative descriptive Estimate sign Estimate Std. error t test p + + + + + 81.371 0.129 0.059 0.550 0.135 9.011 0.026 0.023 0.072 0.067 −9.030 4.954 2.574 7.592 2.025 .000 .000 .010 .000 .043 Note. F(4, 1596) = 149.3; p = .000; multiple R2 : 0.06; adjusted R2 : 0.06. Language Learning 60:2, June 2010, pp. 409–445 430 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective documents reflecting the estar + adjective model appear to be semantically dense. Furthermore, because the subregression analysis showed no interlevel coefficient difference, we must surmise that this association is constant for all three levels of instruction. The hypothetical regressor implies how much copula + adjective usage occurs when learners conjecture and present possible scenarios. Given their signs and significance levels, ser + adjective discourse appears to represent the antithesis of hypothetical discourse and estar + adjective usage contains hypothetical elements. The disassociation with ser + adjective discourse may be partially explained by the observation made earlier that epistemic verbs (representing stance) are entirely disassociated with ser + adjective usage as well as the model’s exclusion of verbal entities like the subjunctive and periphrastic future. The subregression analysis indicates that ser + adjective is wholly unhypothetical at the first year and that at the second and third years this disassociation “raises” to the level of “no association.” The hypothetical regressor was disassociated with the first-year learner data (beta = −0.638; std error = 0.071), which was significantly below those of the second-year (beta = −0.167; std error = 0.098; t = 7.652; p = .000) and third-year learner ser + adjective usage (beta = 0.024; std error = 0.052; t = 3.280; p = .001). The estar + adjective association with hypothetical discourse is supported in the above analysis because this model was associated with verbs of probability. Additionally, the learner estar + adjective regression analysis included causal adverbial clauses in the estar + adjective model, and cause-effect relationships are an important tool for hypothesizing. This hypothetical regressor was not associated with the first-year coefficients (beta = −0.638; std error = 0.071), which were significantly below the second-year (beta = 0.055; std error = 0.040; t = 4.042; p = .000) and third-year (beta = −0.029; std error = 0.001; t = 3.280; p = .001) coefficients. The narrative regressors generally indicate where learners used a copula + adjective model accompanied by story-telling elements, although not necessarily whole narrations. Both copula + adjective segments appear to be significantly associated with the presence of narrative features. The subregression analysis indicates that both the second- and third-year learners generate more narrative features where ser + adjective occurs than first-year learners: Although the coefficients for the second-year (beta = 1.031; std error = 0.192) and third-year (beta = 1.015; std error = 0.379) data were not significantly different (t = 0.030; std p = .976), the difference between the second- and firstyear coefficients (beta = 0.133; std error = 0.177) was significant (t = −3.450; std p = .001). The subregression analysis indicates that the association of 431 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective estar + adjective segments with narrative features remains constant through the three levels, as there were no significant interlevel coefficient differences. This is consistent with the learner regression analysis, which showed that present participles, which denote durative aspect—an important element of stories—were associated with estar + adjective. Both copula + adjective learner models were associated with descriptive features, although the ser + adjective association was significant. This might seem surprising given the operationalization of Spanish descriptive discourse offered by Biber et al. (2006), which is almost entirely devoid of narrative features. The implication here is that both copula + adjective segments operate in both narrative and descriptive contexts beyond the first year of instruction. We see a significant transition toward greater association of ser + adjective segments with descriptive features from first (beta = −0.154; std error = 0.168), to second (beta = 0.989; std error = 0.165), to third year (beta = 1.417; std error = 0.350), with the second-year coefficients being greater than the first (t = 4.880; p = .000) as well as the third-year coefficients being greater than the first (t = 3.271; p = .001). Finally, It is important to note that strength of association of estar + adjective segments with narrative features (beta = 0.550; std error = 0.072) is almost four times as much as with descriptive features (beta = 0.135; std error = 0.067). Qualitative Analysis We contextualize the following qualitative analysis in consideration of the learner models presented above and of their association with the preceding native-speaker discourse models. Ser + adjective discourse serves first-year learners in highly descriptive discourse. The first-year documents reveal that ser + adjective segments are employed to relate descriptions containing multiple chained adjectives where ser tends to be the most frequently inflected verb. The following are segments from midterm-exam letters students in a first-year course wrote to a Mexican friend to describe their girlfriend/boyfriend and his/her family. (1) yo estoy bien porque yo tengo novia. se llama jessica. ella [es] bonita, inteligente y elegante. ella tiene veinte años. ella es de oregon. y [es] moreno, bajo y muy bonita. ella lleva camiseta verde y jeans azules. sus ropas es mucho dolares. ella gusta bailar y cantar para mi. ella gusta tenı́s . . . la madre de jessica [es] bonita, inteligente y bajo. se llama velerie. nosotros jugamos tenis mucho. ella [es] bueno. nosotros aprendamos la universidad. ella lleva camisa verde y los jeans azul en la universidad . . . (I am well because I have a girlfriend. Her name is Jessica. She is beautiful, intelligent and elegant. She is 20 years Language Learning 60:2, June 2010, pp. 409–445 432 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective old. She is from Oregon and she is a brunette, short and very beautiful. She wears a green t-shirt and blue jeans. Her clothes cost a lot of dollars. She likes to dance and sing for me. She likes tennis. Jessica’s mother is beautiful, intelligent and short. Her name is Valerie. We play tennis a lot. She is good. We learn it at the University. She wears a green shirt and blue jeans at the university . . .) (2) yo soy bien porque yo soy amo con novia, selena. ella [es] bonita y simpática. ella [es] soltera y practicar. ella es alta y la ropa es mocha colores. mi muchacha lleva rojo gora, blanco jacqueta, azul jeans, y negro sandalias. ella es mi amora. selena (stays) con madre en casa grande. la familia [es] baja. la madre [es] rica y lista y soltera . . . (I am well because I am in love with my girlfriend, Selena. She is beautiful and nice. She is single and practical. She is tall and she wears clothes in a lot of colors. My girl wears a red cap, white jacket, blue jeans, and black sandals. She is my love. Selena stays with her mother in Casa Grande. Her family is small. Her mother is rich, smart and single . . .) In both of these samples we see simple discourse, grammar, and lexicon, with few verbs except for the copula and an overuse of subject pronouns. Additionally, although there are numerous adjectives in both segments, it is apparent that noun + adjective segments are scarce. These first-year samples are nonnarrative and possess almost no conjecturing. Among second-year learners, ser + adjective segments appear in list fashion in discourse with few conjunctions expressing interpropositional relationships (e.g., ser + adjective + que “copula + adjective + that”). Such loosely connected discourse not only describes people, places and concepts, but it also describes evaluations and reactions to events and states. As the learner model suggests, there is a marked absence of epistemic verbs to demonstrate the stance (verbs of knowledge, pienso que “I think that”; verbs of perception, vemos que “we see that”; verbs of communication, se dice que “it is said that”). Instead, copula + adjective segments present (seemingly) indisputable assertions. Structurally speaking, we see subject pronouns omitted to mark continuity; still, there are various referents and allusions to the things they do frequently. This probably accounts for why ser + adjective segments are associated with a mix of descriptive and narrative features. Finally, the derivational sophistication—and thus semantic density—of the nouns employed is slightly greater at this level in nouns, although these are mostly cognates. The following is an argumentative essay a second-year student wrote using short stories as the topic. (3) . . . este cuento es un ejemplo que muchos padres están usando la televisión como niñera. pienso que esto es un problema porque los jóvenes no saben 433 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective si es realidad o no. los niños no reciben la atención que necesitan para crecer. también pienso que los jóvenes necesitan atención y amor en los primeros años más que de cuando [son] maduros porque cuando son jóvenes ellos no saben que [es] malo o que [es] bueno. también, la televisión [es] mala para los padres. para los adultos la puede ser un escape tan ellos no tienen hacer trabajo, o cosas diferentes que necesitan hacer durante el dı́a. pero, también pienso que hay diferentes programas que [son] buenas. hay programas que enseña como cocinar, leer (para los niños), y que dice que esta haciendo en el mundo hoy. no todos de los programas de televisión [son] mala. pero yo pienso que [es] malo usar la mas de necesario. (. . . this story is an example that many parents are using the television as babysitters. I think this is a problem because young people don’t know whether it is real life or not. Children do not receive attention enough to grow up. I also think that young people need attention and love in their first years of life more than when they are mature because when they are young they don’t know what is good or what is bad. Also, television is bad for parents. For adults it can be an escape because they don’t have to do their work or the different things they need to do during the day. But, I also think that there are different programs that are good. There are programs that teach you how to cook, to read (for children) and that tell you what is being done in the world today. Not all the TV programs are bad, but I think it is bad to use it more than necessary.) With the third-year learners, ser + adjective is less frequent, reflected by a lower overall average z-score of ser + adjective. It is now mixed among other verbs in the third person and adjectives modifying nouns. The discourse is descriptive and evaluative in nature, with references to relevant events, producing a mix of descriptive and narrative elements. The following texts are expository essays students wrote in a third-year course about different occupations. (4) al principio de su vida, el bebé atleta es una hija diferente de sus hermanas. el grito del bebé [es] más fuerte, el apetito más famélico y el cuerpo pequeño más musculoso que los otros bebés . . . de repente, en la escuela primaria, es la estrella de su partido de fútbol y la parte necesaria entre su equipo de básquetbol. al fin, no se puede negar todos los hechos, ella es atleta. [es] seguro que hay cualidades particulares para las atletas; factores que definen las mujeres que aman los deportes . . . mientras que la atleta está entrenándose, se come un dietético rico con una variedad de las frutas y las verduras. sin las vitaminas y minerales de estas comidas, el cuerpo no funciona mejor . . . se come mucho pescado y tofu, [es] justo porque los dos son comidas saludables sin mucha grasa . . . en el concepto de la diversión, el cuerpo de la atleta es su templo. por eso, no pasan los viernes bebiendo cerveza y fumando Language Learning 60:2, June 2010, pp. 409–445 434 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective cigarrillas. todas las actividades giran de la salud y se mantienen la buena salud. [es] necesario que las atletas pasen sus noches jugando los juegos activas como escondite y jugar al corre que te pillo. (At the beginning of her life, the baby athlete is a different daughter from her sisters. Her crying is stronger, her appetite is more ravenous. And her small frame more muscular than the one of the other babies . . . Suddenly, in grade school, she is the star in her football game and the main player in her basketball team. At the end, you cannot deny all the facts, she is an athlete. It is sure that there are particular qualities to athletes, factors that define women that love sports . . . While the athlete is training, she has a rich diet with a variety of fruit or vegetables. Without the vitamins and minerals in this food, her body couldn’t work better . . . a lot of fish and tofu is eaten. It is so because both are healthy foods without much fat. On the entertainment side, the body of the athlete is her temple. That’s why she doesn’t spend her Fridays drinking beer and smoking cigarettes. All her activities go around her health in order to keep her healthy. It is necessary that athletes spend their nights playing active games such as hide and seek or run and catch.) (5) los músicos es distingue por no estar religioso. muchos de ellos no creen que haya un dios. actualmente, [es] irónico, porque los músicos viven como no creen en Dios, pero tan pronto como ganen un premio, lo agradecen . . . la dietética no [es] similar entre músicos. unos músicos se distinguen por su dietética de alcohol y drogas. ellos también fumar cigarrillos, o otras sustancias, y asistir a fiestas todas las noches, entonces casi nunca duermen. unos músicos están muy saludable, y están vegetarianos estrictos . . . músicos a veces tienen su propia familia. tienen esposos y a veces hijos. tener una familia es muy difı́cil cuando los músicos siempre están viajando. (Musicians are known for not being religious. A lot of them don’t believe there is a God. Actually, it is ironic because musicians live as they don’t believe in God, but as soon as they are awarded a prize, they thank God . . . The diet is not similar among musicians. Some musicians distinguish themselves for having a diet with alcohol and drugs. They also smoke cigarettes or other substances, and they attend parties every night. So they almost never sleep. Some musicians are very healthy and they are strict vegetarians. Musicians sometimes have their own families. They have spouses and sometimes kids. Having a family is very difficult when the musicians are always traveling.) For the most part, however, important information is packaged into nominal lexemes (adjectives and nouns) with a derivational morpheme (e.g., salud-able “healthy,” muscul-oso “muscular,” cuali-dad “quality”). Still, cognates prevail and there is creative derivation (e.g., the neologism diet-ético “diet”). Finally, 435 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective subject pronouns are scarce perhaps due to topic continuity. As with the firstyear learners, we see expression of stance via epistemic verbs, and statements are given as unqualified facts. Regarding the estar + adjective segments, their principal discourse functions appear to be narrative and descriptions within narrations. In the first year, estar + adjective mostly appears with a fixed expression such as estoy feliz “I am happy” or is used in descriptive contexts where ser was required with adjectives such as bonita “pretty” and grande “large.” The following examples come from in-class letters that learners wrote to a friend. The examples relate life events as well as describe familiar people and places. (6) querida maria, ¡hola! [estoy] muy feliz porque yo tengo un novio nueva. su nombre es Pete. Pete tiene veinte años. mi novio es de indiana. Pete es moreno y alto. mi novio es muy inteligente y optimista. (Dear Mary, Hello! I am happy because I have a new boyfriend. His name is Pete. Pete is twenty years old. My boyfriend is from Indiana. Pete has dark hair and is tall. My boyfriend is very intelligent and optimist.) (7) ¡hola aubrey! fue a costa rica para un semana. fue a un hotel en la playa dominical de costa rica. ¡la playa dominical [estuvó] más bonita! viajó con mis padres y mi hermano. fue en un avión y lo [estuvó] más grande. dormı́ en un hotel en la playa. el mar [estuvó] muy largo y yo pesqué mucho. ¡me gustaron las comidas mucho! (Hello Aubrey! I went to Costa Rica for a week. I went to a hotel in the Dominical beach in Costa Rica. The Dominical beach was very beautiful. I traveled with my parents and my brother. I went by plane and it was very big. I slept in a hotel by the beach. The ocean was very big and I fished a lot. I liked the meals very much!) Learners couple their assessments of peoples’ states with causes embedded in porque “because” adverbial clauses. The semantically dense nature is attributable to the use of various cognates that are “long” words, which describe places, disciplines, actions, or events. In second-year writing, estar + adjective is used in narrative and descriptive discourse that is detached from the writer. Writing is elicited from tasks in which students must summarize events and describe characters in readings or audiovisual material. The description of events favors the use of the present participle. The summarizing task also allows students to speculate about characters’ motives or actions by using verbs of probability such as creer “to believe” and causal adverbial clauses that begin with the porque “because” causal conjunction. These are the types of behaviors that account for the hypothetical nature identified for estar + adjective. Language Learning 60:2, June 2010, pp. 409–445 436 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective (8) la madre regresa de la cocina, ella piensa que el muchacho [está] dormido. sin embargo, el muchacho [está] despierto todavı́a y está mirando la televisión. la pantalla [está] oscura, ası́ la madre le pregunta a su hijo que él hace. el hijo responde que él está esperando la muchacha en el televisor. esto es muy triste, porque obviamente la madre no es una madre muy bien . . . el hijo cree que la muchacha en el televisor es su amiga. él piensa esto porque cree que la muchacha está hablando a él . . . (The mother returns to the kitchen, she thinks that the boy is asleep. However, the boy is still awake and he is watching the television. The screen is dark so she asks him what he is doing. The son responds that he is waiting for the girl in the television. This is very sad because it is obvious that the mother is not a good mother . . . The son believes that the girl in the television is his friend. He thinks so because he thinks the girl is speaking to him . . .) (9) la personalidad del protagonista, juan, era tı́mido y tranquilo. le gustaba [estar] solo con sus pensamientos. él soñaba antes de acostarse de, todas las peripecias de un viaje a francia, pero no pudo costearlo. no creo que juan [estuviera] satisfecho con su vida y su trabajo porque soñaba con ir a francia. él querı́a experimentar nuevas cosas. su vida era muy rutinario y querı́a cambiarla. él no [estaba] satisfecho con su trabajo porque no pudo costear el viaje . . . el narrador nos sugirió que juan escribiera las cartas porque la letra tuvo los mismos rasgos esenciales. creo que el narrador nos dijo eso porque él es anónimo y nos quiere hacer creer que juan [estuviera] loco y se muriera. (The personality of the main character, Juan, is shy and quiet. He liked to be alone with his thoughts. He used to dream before going to bed about his adventures in a trip to France, but he couldn’t afford it. I don’t think that Juan was satisfied about his life and his work because he dreamed about going to France. He wanted to experience new things. His life was a routine and wanted to change it. He was not satisfied about his work because he could not afford his trip . . . . The narrator suggested to us that Juan wrote the letters because his handwriting had the same main features. I think the narrator told us so because he is anonymous and he didn’t want us to believe that Juan was crazy and he died.) Third-year students combine different discourse patterns, using estar + adjective in argumentative texts where they describe two opposing sides of an issue, as in (10). The writer is comparing the culture or life of American and Hispanic cultures, which gives the text a hypothetical reading. They use estar + adjective to produce personal narratives such as what happened on a birthday. It is noteworthy that preverbal clitic forms are not only used with some Gustar-like constructions but also with verbs in middle voice (e.g., infiltrarse 437 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective “to be infiltrated”) and passive constructions (e.g. enseñarse “to be taught”), making the discourse more encyclopedic-sounding. (10) . . . . el sueño americano se infiltra desde juventud, es evidente por televisión, escuela, la cultura y ejemplos del gobierno y polı́ticos. esta influencia es subconsciente pero fuerte y se enseña el americano que la única cosa que se necesita hacer es trabaja fielmente y comprar las cosas correctas y eventualmente se recibirá la vida perfecta. / . . . además el americano siempre [está] preocupado, consumiendo y trabajando pero viva muy poco. (The American dream is instilled from youth, it is evident in television, school, the culture and the examples from the government and the politicians. This influence is subconscious but strong and it is taught to the American that the only thing that he needs to do is to work loyally and to buy the right things and eventually he will receive a perfect life. / Moreover, the American is always busy, consuming and working but he lives very little.) (11) mi familia recordaron mi cumpleaños! pero, en este momento mis padres [estaban] enojados conmigo. yo no querı́a pelear, entonces, termino el papel y nos sentamos a comer. mis hermanas mi desean un buen cumpleaños y me dieron un regalo bellı́simo. mi padre me hablaba de un film que le habı́a gustado. yo pide a mi madre si a ella le a gustado y ella me respondió, no, en un tono agresivo. mientras la entera cena ella solo me dijo, no, y, si, y fue muy molestosa. no me habló durante mi fiesta de cumpleaños! [estaba] muy triste este noche y la próxima dı́a. (My family remembered my birthday! But, at that moment my parents were angry with me. I didn’t want to argue so I finished my paper and we sat to eat. My sisters wished me a good birthday and gave me a very beautiful present. My father was talking to me about a film he had liked. I asked my mother if she had liked it and she answered no, in an aggressive tone. During the whole dinner she told me no and yes and I was very angry. She didn’t talk to me during my birthday party! I was very sad that night and the next day.) Discussion and Conclusions Studying the acquisition of the Spanish copula provides insights into the interaction among syntax, semantics, pragmatics, morphology, and vocabulary during development in one of the most basic of syntactic structures—namely, attributive sentences (Leonetti, 1994). Spanish requires learners to choose between two copulas in attributive sentences in accordance with a variety of contextual considerations and in consideration of a variety of levels of representations. Whereas relevant L2 research has examined (a) copula choice Language Learning 60:2, June 2010, pp. 409–445 438 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective and (b) the function of attributive sentences in terms of orders of acquisition in different learning contexts (Gunterman, 1992; Ryan & Lafford, 1992; VanPatten, 1985, 1987) and (c) the contextual and semantic factors that predict learner usage of this construct as compared to native speakers (Geeslin, 2003a, 2005), the present study is the first to provide a corpus-based analysis of the lexico-grammatical features that co-occurred with the Spanish copula (i.e., ser and estar) + adjective usage and so the different discursive functions that the ser + adjective and the estar + adjective segments play at three learner levels and in comparison to native-speaker models. The study delves into important learner issues—for example, the discourse types learners associate with copula usage (Gunterman, 1992), the strong influence of contextual cues on copula choice (Geeslin, 2003a, 2003b)—identified in the S/E research but not fully developed to date. The results overall revealed the following: (a) Both ser + adjective and estar + adjective were associated with simple discourse at all levels; (b) ser + adjective appears in descriptive and evaluative discourse where much linguistic complexity reliably occurs; (c) estar + adjective is present in narrations, descriptions, and hypothetical discourse where, nonetheless, little linguistic complexity typically occurs. Specifically, findings showed that the model predicting ser + adjective usage identified more variables (n = 21) and accounted for more variation (41%) than the estar + adjective model, which only identified 10 predictors and 5% of the variation. It seems that at beginning levels of instruction, learners find ser + adjective more communicatively productive and thus more easily associated with a large array of features within their interlanguage, although these features are basic grammatical and lexical items. Ser + adjective is one of the first copula segments taught and recycled during various semesters, whereas estar + adjective is primarily used at beginning levels in routines and formulaics like estar + bien, mal, ocupado, enfermo. In this sense, the input provided by teacher, materials, and other students in the class through task completion emphasizes the use of ser + adjective over estar + adjective constructions and, therefore, encourages more ser + adjective usage. These findings are in line with early SLA studies on ser/estar acquisition, which found that ser + adjective was acquired well before estar + adjective (Gunterman, 1992; Ryan & Lafford, 1992; VanPatten, 1987) presumably because of the higher frequency and saliency of ser + adjective in instructional and naturalistic input. It was also found that many of the lexico-grammatical predictor variables in both models were characteristics of simple discourse and they did not differentiate learners’ copula + adjective usage among the three levels of instructions. 439 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective All levels seem to use copula + adjective as a discourse tool such as to communicate evaluatives like es importante, lástima “it’s important, it’s a shame,” and so forth. However, when the discourse becomes more syntactically and grammatically complex, ser + adjective segments are absent and estar + adjective segments become more prevalent. On the one hand, these observations contrast with native speakers, who use ser + adjective for evaluative purposes in a wide variety of discourses, simple or complex; on the other hand, they are consistent with natives’ propensity to use estar + adjective in more complex discourse (Collentine, 2008). The ser + adjective model was mostly associated with adjective and grammatical/lexical verb variables. Various morphological properties of adjectives (e.g., feminine, plural) associated with ser + adjective, whereas more complex adjectival syntactic processes (e.g., prenominal or postnominal adjectives) emerged as disassociated. Most of the verbal variables reflecting complex syntax (e.g., periphrastic future, past subjunctive, Gustar-like verbs) were disassociated with the copula construction and started to emerge as associated with ser + adjective at advanced levels of instruction. Other features such as null subjects also indicated some grammatical sophistication at advanced levels where ser + adjective became less frequently used. As for the discursive functions served by the co-occurrence of the variables in the predictive model for ser + adjective, the disassociation of verbs of observation and communication with the construction indicated a discourse that was nonepistemic/nonhypothetical in nature. Comparisons with native speakers’ discourse showed that learners used ser + adjective in discourse that is highly descriptive in nature and accompanied by story-telling elements, especially at advanced levels of instruction. These findings corroborate those of Gunterman (1992), who examined learners in study-abroad contexts where ser + adjective was indicative of descriptive discourse. Spanish learners, regardless their level, associate an evaluative stance with ser + adjective. The estar + adjective regression analysis revealed a weak association with other lexical-grammatical features. This indicates that throughout the early to middle stages of acquisition, this phrase structure is weakly integrated into the interlanguage in terms of being a productive, necessary tool for the types of communication in which learners engage. In other words, the use of estar + adjective segments is not obviated—or evoked, cognitively speaking—when learners use their standard repertoire of lexico-grammatical tools. All told, the story is complicated for estar + adjectives, which ultimately might account for its late acquisition. On the one hand, it appears where there is little associated inflectional sophistication (recall that Spanish is a highly inflectional Language Learning 60:2, June 2010, pp. 409–445 440 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective language, both in verbal and nominal constructs.). The few variables associated with estar + adjective suggest that it appears in discourse lacking in overall inflectional sophistication (e.g., type-2 adjectives or adjectives with singular and plural inflections, singular nouns, negative association with Gustar-like verbs). Its use places significant processing demands on learners, as shown by Geeslin (2003a, 2003b), who noted that learners use estar + adjective segments according to pragmatic factors (which require a consideration of a multitude of contextual variables) rather than according to semantic/lexical constraints (which are local to the copula + adjective phrase structure). With this in mind it is not unreasonable to conjecture that learners are more likely to have the cognitive resources to employ it when other structural demands are not overwhelming. On the other hand, estar + adjective segments usually occurred in discourse that was semantically dense—probably because it was based on sources, with hypothetical elements (e.g., verbs of probability, causal adverbial clauses), narrative features (e.g., present participles), and descriptive features (e.g., of adjectives type 2). Like study-abroad learners (see Gunterman, 1992), in an instructional context, learners also use estar + adjective when they need to fulfill communicative functions that go beyond description. Learners’ awareness and experience with different kinds of discourse (e.g., narration, arguments) at advanced levels of instruction might explain these associations rather than learners’ acquisition of discrete grammatical and lexical items, given the simplicity of their interlanguage, as Lafford (2004) asserted for the gains observed for learners studying abroad. Cheng et al. (2008) concluded that more abstract registers can evoke greater estar + adjective usage. In this study, learners at advanced levels of instruction were asked to complete written tasks in which they summarized a story or argued in favor of a position. We have no way of knowing if the task demands affected learners’ estar + adjective usage, however, the results indicate that it would be possible that weighty processing demands of discourse where referents and events were detached from the writer and in some cases based on reading with grammar and vocabulary beyond their linguistic knowledge could have lead to more estar + adjective use. The semantic and pragmatic goals of narratives as well as hypothetical discourse seem to entail more consideration of the states of affairs of referents and changes in the background of a story or situation, thus being more compatible with estar + adjective. The findings in our study provide evidence of the influence of the lexicogrammatical and discourse predictors in learners’ copula + adjective usage, as attested in previous studies (Cheng et al., 2008; Geeslin 2003a, 2003b, 2005). Under classroom setting conditions, learners’ written discourse in response to 441 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective tasks designed to advance their communicative abilities or testing them revealed that copula + adjective usually co-occurred with simple linguistic features. Developmentally, interesting and seemingly contradictory observations emerged about the use of the two copula + adjective segments studied here. In general, the phrase structure (ser + adjective) associated with greater linguistic complexity is typically associated with simple types of discourse, whereas the phrase structure (estar + adjective) associated with less linguistic complexity is typically associated with more complex discourse types. The implication of these observations is that processing demands interact with the types of discourse that learners can produce as they develop and the types of lexicogrammatical structures they produce in those types of discourse, as Cheng et al. (2008) argued. The relatively simple linguistic features associated with estar + adjective may be an indication that learners hit a wall when attempting to communicate messages within complex discourse structures. Conversely, the learner models reported here may indicate that the simple nature of discourse structures like descriptions may afford learners processing resources for calling up relatively complex structures. All in all, the present analysis suggests that copula usage and choice depends on the amount of processing resources available and the discourse structure a learner produces. Consequently, we have a much better understanding of the complex ways that pragmatic and discursive features influence how learners make copula choices (in ways that are not how native speakers make copula choices), as Geeslin (2003a, 2003b, 2005) has argued. The results of our study have pedagogical implications for teaching S/E + adjective. It makes sense to teach ser + adjective to beginners first because of its frequency and communicative value; however, estar + adjective probably deserves more attention and practice in different kinds of discourse (e.g., narration and hypothetical situations). Given that there is some evidence that estar + adjective segments are more prominently distributed in certain kinds of discourse (cf., Collentine, 2008), exposure to input with estar + adjective-relevant types of discourse should have a positive effect. What Spanish educators might examine is whether estar emerges in learner production reliably as a result of having to write estar + adjective-relevant types of discourse such as exploratory writing—as Cheng et al.’s (2008) study suggests—or narratives. An interesting corollary would be whether such discourse types emerge as a result of asking learners to produce estar + adjective segments. Revised version accepted 5 March 2009 Language Learning 60:2, June 2010, pp. 409–445 442 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective Notes 1 Screening data for inflated correlations is difficult in the linguistic sciences. Whereas two words might represent the same part of speech (e.g., adjectives), the inflectional morphology of adjectives can represent important distinctions for learners, such as those that are singular and those that are plural. Additionally, natural language is extremely redundant as communication systems go, and so our initial screening process sought to balance semantic and structural collinearity considerations. 2 The analysis identifies the optimal combination of regressors for a criterion by comparing and contrasting all possible predictor-variable combination values for Mallow’s C p , which simultaneously represents any given model’s bias (i.e., how well it predicts the referent variable) and the variation associated with that bias. The best-subsets analysis compares—numerous—combinations of variables by identifying (a) the number of and (b) which predictor variables balance bias and variance where the mean-square error of a combination is small. The resulting model has a small bias with the least amount of predictor variables, such that the resulting model contains a highly reduced number of predictor variables whose combination predicts values for the criterion variable that are closest to the observed values. Statisticians recommend best-subsets analysis when the potential number of predictor variables is large because stepwise methods tend to miss identifying models that are equally good at balancing bias and variance as the resulting model they produce. 3 R is an open-source statistical package based on Bell’s Labs (proprietary) S programming language, a standard among statisticians for statistical programming (see http://www.r-project.org/). R is gaining increasing popularity in academic circles because of its reliability, statistical accuracy, and flexibility (it contains numerous [tested] add-on modules) and due to the fact that it is freely available in the public domain. 4 As a simplified example, because this process extrapolates the true coefficient for each level, we can extrapolate individual level effects in the following fashion. For instance, if the X1 coefficient were 8.0 for level 1, 5.0 for level 2, and 3.0 for level 3 and if the process calculates the difference coefficient for X1 between levels 1 and 2 to be 3.0 (i.e., 8.0 − 5.0 = 3.0) and between levels 2 and 3 to be 2.0 (i.e., 8.0 − 5.0 = 3.0), we infer that the difference between levels 1 and 3 by summing these two difference coefficients, or 5.0 (i.e., (8.0 − 5.0) + (5.0 − 3.0)). See Hardy (1993) for details. References Belz, J. (2004). Learner corpus analysis and the development of foreign language proficiency. System, 32, 577–597. 443 Language Learning 60:2, June 2010, pp. 409–445 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective Biber, D., & Conrad, S. (2001). Introduction: Multi-dimensional analysis and the study of register variation. In S. Conrad & D. Biber (Eds.), Variation in English: Multi-dimensional studies (pp. 3–13). London: Longman. Biber, D., Davies, M., Jones, J., & Tracy-Ventura, N. (2006). Spoken and written register variation in Spanish: A multi-dimensional analysis. Corpora, 1, 1–37. Cheng, C., Lu, H., & Giannakouros, P. (2008). The uses of Spanish copulas by Chinese-speaking learners in a free writing task. Bilingualism: Language and Cognition, 11, 301–317. Collentine, J. G. (2004). The effects of learning contexts on morphosyntactic and lexical development. Studies in Second Language Acquisition, 26, 227–248. Collentine, J. G. (2008). The role of discursive features in SLA modeling and grammatical frequency: A response to Cheng, Lu and Giannakouros. Bilingualism: Language and Cognition, 11, 319–321. Dalgaard, P. (2008). Introductory statistics with R. New York: Springer. Fernández Leborans, M. J. (1999). La predicación: las oraciones copulativas. In I. Bosque & V. Demonte (Eds.), Gramática descriptiva de la lengua española (pp. 2354–2460). Madrid: Espasa. Geeslin, K. (2002). The second language acquisition of copula choice and its relationship to language change. Studies in Second Language Acquisition, 24, 419–451. Geeslin, K. (2003a). A comparison of copula choice in advanced and native Spanish. Language Learning, 53, 703–764. Geeslin, K. (2003b). The role of adjectival features in the second language acquisition of copula choice. In P. Kempchinsky & C. Piñeros (Eds.), Theory, practice and acquisition: Papers from the 6th Hispanic Linguistics Symposium and the 5th Conference on the Acquisition of Spanish and Portuguese (pp. 332–351). Medford, MA: Cascadilla Press. Geeslin, K. (2005). Crossing disciplinary boundaries to improve the analysis of second language data: A study of copula choice with adjectives in Spanish. Munich: LINCOM Europa Publishers. Geeslin, K., & Guijarro-Fuentes, P. (2006). Second language acquisition of variable structures in Spanish and Portuguese speakers. Language Learning, 56, 53–107. Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam: Benjamins. Gunterman, G. (1992). An analysis of interlanguage development over time: Part II, ser and estar. Hispania, 75, 1294–1303. Halliday, M. A. K. (1970). Language structure and language function. In J. Lyons (Ed.), New horizons in linguistics (pp. 140–165). Harmondsworth, UK: Penguin Books. Hardy, M. A. (1993). Regression with dummy variables. Sage University Papers, QASS # 07-093. Newbury Park, CA: Sage. Language Learning 60:2, June 2010, pp. 409–445 444 Collentine and Asención-Delaney Corpus-Based Analysis of Ser/Estar + Adjective Klein, W., & Perdue, C. (1997). The basic variety (Or: Couldn’t natural languages be much simpler?). Second Language Research, 13, 301–347. Lafford, B. A. (2004). The effect of the context of learning on the use of communication strategies by learners of Spanish as a second language. Studies in Second Language Acquisition, 26, 201–225. Leonetti, M. (1994). Ser y estar: estado de la cuestión. Barataria, 1, 182–205. Luján, M. (1981). The Spanish copulas as aspectual indicators. Lingua, 54, 165–210. Miller, A. (2002). Subset selection in regression. Boca Raton, FL: Chapman & Hall/CRC. Myles, F. (2005). Interlanguage corpora and second language acquisition research. Second Language Research, 21, 373–391. Myles, F., & Mitchell, R. (2004). Using information technology to support empirical SLA research. Journal of Applied Linguistics, 1, 169–196. Rencher, A. (2002). Methods of multivariate analysis. New York: Wiley-Interscience. Ryan, J., & Lafford, B. (1992). The acquisition of lexical meaning in a study abroad environment: Ser + estar and the Granada experience. Hispania, 75, 714–722. Rutherford, W., & Thomas, M. (2001). The Child Language Data Exchange System in research on second language acquisition. Second Language Research, 17, 195–212. Shehadeh, A. (2002). Comprehensible output, from occurrence to acquisition: An agenda for acquisitional research. Language Learning, 52, 597–649. Silva-Corvalán, C. (1986). Bilingualism and language change: The extension of estar in Los Angeles Spanish. Language, 62, 587–608. Silva-Corvalán, C. (1994). Language contact and change: Spanish in Los Angeles. Oxford: Clarendon Press. Siyanova, A., & Schmitt, N. (2007). Native and nonnative use of multi-word versus one-word verbs. IRAL, 45, 119–139. Swain, M. (1985). Communicative competence: Some roles of comprehensible input and comprehensible output in its development. In S. Gass & C. Madden (Eds.), Input in second language acquisition (pp. 235–253). Rowley, MA: Newbury House. VanPatten, B. (1985). The acquisition of ser and estar in adult second language learners: A preliminary investigation of transitional stages of competence. Hispania, 68, 399–406. VanPatten, B. (1987). The acquisition of ser and estar: Accounting for developmental patterns. In B. VanPatten, T. Dvorak, & J. Lee (Eds.), Foreign language learning: A research perspective (pp. 61–75). New York: Newbury House. 445 Language Learning 60:2, June 2010, pp. 409–445 Copyright of Language Learning is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.