12. Society and culture 12.1. Corpora and culture Corpora can also be used to study culture. Differences in senses of words used in British and American English show cultural differences(Leech and Fallon, 1992). For instance, travel words are more frequent in American English, which may suggest that the US is larger than Britain and that people there travel more. Words in the domains of crime and military were more common in American data that may suggest American “gun culture”. More research is needed in the area of cultural studies. 12.2. Psycholinguistic variation Psycholinguistics is a laboratory subject. However, corpora may be sources of data for materials used during laboratory experiments. Frequency of words may indicate the order of testing cognitive processes – word recognition. Garnham et all. (1981) used natural spoken corpus to look at speech errors. They classified and counted different error types. The analysis of language pathologies in individuals may be based on abnormal data collected earlier. Fletcher and Garman (1988) collected a corpus of impaired and normal language development. This may help to identify language abnormalities. 12.3. Social psychology In social psychology both qualitative data and quantitative analyses are equally important. For example: Various written and spoken texts are used in analysing explanations. Antaki and Najii (1987) investigated phrases that followed because. The results showed that explanation of general states of affairs were most common (33.8%), then actions of speaker and speaker group appear (28.8), then actions of others (17.7%). 12.4. Corpora and sociolinguistics Sociolinguistics relies on collection of research-specific data. Either a small corpus can be collected, or a representative corpus can be sampled according to research questions. Examples of study: Lexical studies in the area of language and gender. Kjellmer (1986) studied masculine bias in American and British English. He looked at masculine and feminine pronouns and at the occurrence of lexical items man/men, woman/women. He found that the frequency of female items was much lower than the male in both corpora, but female forms were more frequent in British English than in American English. Some corpora contain sociolinguistic variables encoded, but not all. Variables as sex of the writer, social class and educational backgrounds were encoded in historical corpora. 13. Language acquisition and teaching 13.1. Corpora in first language acquisition Corpora of children’s language, e.g. CHILDES database, are used to study the stages in linguistic development in normal and impaired children. First descriptions of the process of development were based on small corpora that were not machine-readable at that time. (see history of corpus linguistics chapter 1) Corpus data may be used to evaluate linguistic theories on the basis of empirical data. “Order of acquisition” was presented by Brown in 1973. Research done on children’s corpus confirmed the order proposed by Brown. On the one hand, this shows that corpus linguistics as well as other ways of gathering data including intuition, elicitation and experimentation lead to the same results. On the other hand, by recapitulating the process of finding the well known results young researchers learn the methods and techniques. There is still a need for corpora that contain the language children are exposed to. 13.2. Corpora in second language acquisition and teaching Indirectly all learners and teachers use corpora on a daily basis because most of the dictionaries and grammar reference books are based on corpora . Sinclair (1990) stated that: “ELT methodology has paid little attention to the state of language description, behaving as if the facts of English structure were no longer in dispute. In practical terms this has led to the growth and maintenance of a mythology about English, which teachers take for granted, but much of which has been challenged by corpus evidence”. Chris Tribble (1990) suggested using concordances as teaching materials. They can be used in teaching vocabulary and grammatical features of words in sentence context to advanced learners if they are based on monitor corpora. The main advantage of using corpora is that the language is authentic, the main disadvantage is that sentences are out of wider context and may contain words that are difficult even for advanced learners. Examples of activities: - A gapfilling exercise with two words that are easily confused can help students to infer differences of meaning. - Wordlists can be used as lead-in activities. - Literary texts are also useful for comparative studies of style. - Deducing the meaning of keyword from context. - Matching exercises, in which the left or right contexts are jumbled and have to be re-installed. - Study of homonyms and synonyms. - Using wildcards for studying prefixes and suffixes. Using corpora in class is time consuming. However, this introduces autonomy in language learning and helps students to solve linguistic problems on their own. The Council of Europe has recommended that language pedagogy should “develop explicit objectives and practices to teach methods of discovery and analysis” (1994:10). Learners need to test any rule against as many examples as possible before they fully internalise it. Concordance programs facilitate building language awareness. Words and syntax should not be taught separately. The interrelationship of lexis and syntax can be visible on screen in a few seconds. Tim Johns introduced and developed DDL, i.e. Data Driven Learning. Monolingual and parallel corpora of English and French (German, Spanish, Italian) are used to learn and teach the target language. Materials and ideas are available from Tim Johns’s website: http://web.bham.ac.uk/johnstf. Concordances are also used in reciprocal learning. Two learners, one native English and one native French, who want to learn the language of the partner work together helping each other. Susan Hunston (2002: 193) enumerates the challenges to the use of corpora in language teaching: - Lack of context - Critical approach to corpus evidence. Learners should be creative, not restricted to utterances collected earlier. - Corpora comprise the language of native speakers. Native norms are not always the aim of learning the language. - Eclectic and diverse methods of teaching and learning should be adapted to learners’ needs and learning styles. Further reading Susan Hunston Corpora in Applied linguistics Teaching ESP and EAP - English for specific purposes and English for academic purposes In teaching ESP and EAP the content is equally important as the language. Legal, technical medical, business ‘sublanguages’ should be taught on the texts from these disciplines. EAP teaching should also include academic papers in individual disciplines. Further reading: Susan Hunston Corpora in Applied linguistics Teaching linguistics Corpora may used in teaching linguistics. Kirk (1994) presented a project in which students were asked to analyse some corpus data in the light of a given theoretical model. Theoretical models selected for the analysis were Brown and Levinson’s politeness theory, Grice’s cooperative principle and Biber’s multidimensional approach to linguistic variation.1 In other projects at Nijmegen and Lancaster, students of linguistics are asked to annotate a text using corpus-based software. The student is given four chances to get the annotations right. Teaching translation Find more about theories in pragmatics on Andrew Moore’s website http://www.universalteacher.org.uk/lang/pragmatics.htm 1 Translation is a matter of style rather than of right and wrong. A multilingual corpus may provide side-by-side examples of style and idiom in more than one language. 13.3. Learner corpora All corpora discussed in the previous sections are collections of language used by native speakers, either adults or children. They are authentic because they have been collected in natural contexts. In this section we will deal with learner corpora that contain language used by FL/SL learners. As for all corpora the set of criteria of collecting any corpus should be clearly established. Sinclair (1996) defined learner corpora as follows: “Computer learner corpora are electronic collections of authentic FL2/SL3 textual data assembled according to explicit design criteria for a particular SLA4/FLT5 purpose. The are encoded in a standardized and homogenous way and documented as to their origin and provenance.” Granger (2002) comments on the definition in the following way: - Authenticity Authenticity is problematic in the case of learner language in comparison with authenticity of native speaker data. Classroom teaching involves a lot of “artificiality”. Even texts created as “free writing” are rarely natural because the topic and the time limit are imposed on the learner. Thus, there are different levels of authenticity ranging from genuine communication of people about their businesses to results of classroom activities. For example, essays written in the classroom can be considered authentic written data, and texts read aloud can be considered authentic spoken data. - FL and SL varieties Non-native varieties of English comprise: - English as an Official Language (EOL) such as Nigerian English or Indian English 2 FL Foreign Language 3 SL Second Language 4 SLA Second Language Acquisition 5 FLA Foreign Language Acquisition - English as a Second Language (ESL) language acquires in English- speaking environment as in Britain or the US - English as a Foreign Language (EFL) learned primarily in the classroom setting (in most countries) Learner corpora cover EFL and ESL. - Textual data A learner corpus must contain continuous stretches of discourse, not isolated sentences or words. It must contain both correct and erroneous use of the language. - Design criteria A random collection of texts does not create a corpus. Some criteria are the same as for native corpora, however the set of criteria maybe specific to learner corpora relating to both the learner and the task. For example: learning context and time limit, mother tongue and use of reference tools. - SLA/FLT purpose The purpose of collecting a learner corpus should relate to SLA theory or FLT methodology. The researcher may need to prove or falsify theories about transfer form L1 or L3 or an order of acquisition of specific elements of language. A learner corpus may help to evaluate ELT methods or tools. - Standardization and documentation The corpus can be collected as - a row corpus of plain texts; - an annotated corpus enriched with linguistic or textual information. Examples of learner corpora: Cambridge learner corpus http://uk.cambridge.org/elt/corpus/clc.htm consists of texts written by those who have taken Cambridge exams. It contains over 15 million words and it is constantly growing. The Longman Learners' Corpus http://www.longman.com/dictionaries/corpus/lclearn.html International Corpus of Learner English – ICLE http://juppiter.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/Cecl-Projects/Icle/icle.htm Instructions how to join the project are available on the website. The Louvain International Database of Spoken English Interlanguage – LINDSEI http://juppiter.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/Cecl-Projects/Lindsei/lindsei.htm The standard used for annotating a learner corpus should be the same as the one used for native corpora to ensure their comparability. There are problems with annotating learner corpora, because tools used for native corpora are less reliable. Error tagging software is developed for learner corpora. Learner documentation and task variables should be also provided either in the corpus or separately. Research Methodology Linguistic analysis of learner corpora involves either Contrastive Interlanguage Analysis or Computer-aided Error Analysis. The first method is contrastive, and consists of comparisons between native and non-native data or between varieties of non-native data. The second method focuses on identification and analysis of errors in the non-native data. For Contrastive Interlanguage Analysis a learner corpus needs to be collected, and a control native corpus needs to be selected out of a monitor corpus appropriately for any comparison that highlight a range of features of non-nativeness. The interlanguage of the learners can be investigated in order to understand its underlying system. However, if the aim of the learner corpus research is to improve learners’ proficiency, it needs to be related to native norms. Comparisons of learner data for different mother tongue backgrounds help to: - identify features that are shared by several learner populations; - find peculiarities of one national group; - describe developmental features. Having selected a feature for study, its underrepresentation or overrepresentation in the learner data in comparison with native data a researcher formulates hypotheses. In order to interpret the results, a bilingual corpus of the learners’ mother tongue and English is necessary. Classical Contrastive Analysis and Corpus based Contrastive Interlanguage Analysis are complementary. Error analysis Computer-aided error analysis involves either selecting an error-prone linguistic item and scanning a corpus to retrieve it or discovering learner difficulties that the researcher was not aware of. The first method is effective and fast. However, it requires the set of those items that seem to be problematic. The second method is time-consuming. It requires tagging all errors or at least all errors in a particular category. A fully-tagged corpus offers a huge range of possible applications. Errors appear both in native and non-native utterances. In language learning and teaching the approach to errors varies. In audio-lingual method errors were considered an entirely negative aspect of learner language. Nowadays error analysis can be seen as the key aspect of the process of understanding interlanguage development. It also provides data for teachers and materials designers. Corpus data help to evaluate what learners can be expected to have acquired. Learner corpus analysis On the one hand, learner corpora can be analysed with the same tools developed for the analysis of native corpora. T/t ratio = (Number of word types*100)/ (Number of word tokens*1) is used to draw conclusions on lexical richness in texts. In learner corpora the (T/t) value may be influenced by the high rate of non-standard forms, i.e. various errors. On the other hand, these “special corpora” require specific tools and methods of analysis. Traditional types of annotation need to be supplemented by error tagging. There are may systems of error tagging. The system developed and implemented in Louvain (Granger 2002) is hierarchical, it attaches to each error a series of codes which go from more general to more specific. First letter refers to the error domain.: G - grammatical L - lexical X - lexico-gramatical F – formal R – register W – syntax S – style The following letters give more precision GV – grammatical errors affecting verbs GVAUX – ( auxiliary errors) GVM ( morphological errors) GVT( tense errors) For example (Granger 2002: 20): the fact that we could (XVPR) argue on $argue about$ the definition of want to be parents, do not (XVPR) care of $care about$ the sex is rising. These people who (XVPR) come in $come to$ Belgium Family planning (XVPR) consists on $consists of$ have the possibility to (XVPR) discuss about $discuss$ their problems which the purchaser cannot (XVPR) dispense of $dispense with$ the health. Nobody (XVPR) doubts about $doubts$ that. harvest they get is often (XVPR) exported in $exported to$ countries of advice on (XNUC) a $0$ better health care for years. Undoubtedly (XNUC) a $0$ big progress has been made characteristic (XNUC) behaviours $behaviour$ It provides (XNUC) employments $employment$ combining study life and (XNUC) entertainments $entertainment$ are many other (XNUC) leisures $leisure facilities$ a balance between work and (XNUC) spare times $spare time$ need to do some (XNUC) works $work$ or simply for your personal Figure 9. Error tag search: verb dependent prepositions and count/uncount nouns This system is flexible and allows the analyst to add or delete codes. The error-tagging system was implemented in an error editor – a tool that helps to insert errors. By clicking on the relevant tag from the menu the analyst may insert the tag. Using the correction box s/he may insert the corrected form within two dollar signs as formatting symbols. Native and learner corpora and language pedagogy Native corpus data have changed dictionaries and grammar reference books because they have provided enriched description of the language. Learner corpora and native corpora are expected to change language pedagogy in the following sections: - curriculum design; - materials design; - classroom methodology. While investigating learner corpora for language pedagogy data from four corpora should be analysed. Granger (2002: 22) presents the ideal environment for the analysis of French speaking learners’ interlanguage and the design of materials for them. Native French Corpus Native English Corpus Learner English Corpus Basilang Mesolang Acrolang L1 L2 French-English bilingual corpus Fig. Learner corpus environment (Granger 2002) Basilang – the earliest form of target language development Mesolang- the intermediate stage of language development Acrolang - the final stage of target language development Learner corpus data may change the teaching content. Curriculum design In the field of vocabulary teaching frequency information is useful, it may support intuitions that teachers and researchers have about the areas of difficulty for learners. However, the frequency should not be the dominant factor in selecting vocabulary for learners. In grammar teaching the selection and sequence of teaching grammatical phenomena should be verified or modified. Biber (1994) proved that prepositional phrases (the man in the corner) are much more frequent than relative clauses (the man who is in the corner) and more frequent than participial clause (the man standing in the corner). In EFL grammars relative clauses receive much more discussion than prepositional clauses. A study by Meunier 2000 showed that French learners do not use prepositional phrases and participial clauses and significantly overuse relative clauses. This my be partly teachinginduced or partly due to cross linguistic reasons: prepositional and participial clauses are less common in French than in English. What is more, corpus data analysis proved that French learners have persistent difficulty with relative pronoun selection. Thus, French learners need more practice in learning prepositional phrases, participial clauses and relative clauses. The conclusions drawn from these findings should be included into the syllabus design. Materials design In monolingual learners’ dictionaries learner corpus data are used to enrich usage notes, which draw learners’ attention to common mistakes. The Longman Essential Activator Dictionary is the first dictionary based on learner corpora. The integration of CALL software with corpus data seems to be promising. WordPilot by Milton is designed to help Hong Kong EFL learners (see: http://www.compulang.com/index.htm.) Telenex (http://www.telenex.hku.hk/telec/pmain/opening.htm) is a project designed to provide support for secondary level English teachers in Hong Kong. It is available only to registered Hong Kong teachers. A large learner corpus TELEC Student Corpus has been used to compile the? problem page in TeleGram. For each problematic area a series of tools for teacher have been developed. Every ‘student problem’ was matched with ‘teaching implications’ which suggest teaching methods designed to help students avoid such mistakes. Classroom methodology The use of learner corpus in the classroom is highly controversial because learners are exposed to erroneous data. Using corpora is more suitable for advanced learners. Granger (1999) suggests a method in which one the teacher selects a problematic field, e.g. a word, then asks students to find examples in the native corpus. Then asks them to find examples in the learner corpus. The patterns used in the learner corpus are similar to the patterns the learners of this particular class tend to use. The aim of this activity is to get learners to notice the gap between their own and target language forms. Learners get aware of their misuse and overuse of the word. Seildhofer (2002) suggested a procedure called learning-driven data. The teacher asks students to read a text and then write a summary of the text ant their accounts - personal reaction to the article . Then she compiled a corpus of their own short texts. She asked students to prepare a list of questions that can be asked to be answered by computer tools. They also compared the language they use with the language of the input text. She mentioned the possibility of using native corpora if there is a need for them. Read Barbara Seidlhofer (2002) Pedagogy and local learner corpora Working with learningdriven data Challenges to learner corpora involve: - corpus compilation Many learner corpora have been compiled but very few are available. ICLE corpus is free for contributors. Longman and Cambridge learner corpora are collected for internal use only. It is easy for teachers to collect their own learner corpora for evaluation and implementing new methods. - corpus analysis More research is needed in the field of success rate of linguistic annotation. There is a need for longitudinal studies. Quantitative product-oriented studies should be supplemented with more qualitative process–oriented studies. - Interdisciplinarity There is a need for cooperation of SLA, ELT and NLP researchers. Applications of learner corpus research need to relate to current SLA theories. Classroom practice needs to be in the focus of researchers. User-friendly tools need to be developed for learners, teachers and researchers. “Learner corpus research has the potential to radically to improve knowledge about learner language and language learning.” (Granger 2002: 28)