Corpora and language pedagogy

advertisement
How can corpora help in language pedagogy?
Richard Xiao
Abstract
Corpus linguistics as a methodology of linguistic research has gained such
prominence over time that corpora have been used extensively in nearly all branches
of linguistics. This chapter explores the potential uses of corpus data in one of these
areas – language teaching and learning. We will first discuss a wide range of issues
related to using corpora in language pedagogy, including referencing publishing,
syllabus design and materials development, language testing, teacher development,
data-driven learner (DLL), teaching language for specific purposes, as well as learner
corpus and interlanguage analysis. We will then demonstrate, via a case study of
passive constructions in Chinese learner English, how contrastive corpus linguistics
can inform second language acquisition research. The chapter concludes by
discussing the debate over the relevance of authenticity and frequency of corpora in
language education as well as the future of corpus-based language pedagogy.
Key words: corpora, language pedagogy, data-driven learning, learner corpus,
contrastive corpus linguistics, interlanguage, second language acquisition
1. Introduction
The corpus-based approach to linguistics and language education has gained
prominence over the past four decades, particularly since the mid-1980s. This is
because corpus analysis can be illuminating ‘in virtually all branches of linguistics or
1
language learning’ (Leech 1997: 9; cf. also Biber, Conrad and Reppen 1998: 11). One
of the strengths of corpus data lies in its empirical nature, which pools together the
intuitions of a great number of speakers and makes linguistic analysis more objective
(McEnery and Wilson 2001: 103). Unsurprisingly, corpora have been used
extensively in nearly all branches of linguistics including, for example, lexicographic
and lexical studies, grammatical studies, language variation studies, contrastive and
translation
studies,
diachronic
studies,
semantics,
pragmatics,
stylistics,
sociolinguistics, discourse analysis, forensic linguistics, and language pedagogy.
Corpora have won widespread popularity over time in spite of the fact that they still
occasionally attract hostile criticism (e.g. Widdowson 1990, 2000).
In this chapter, we will not be concerned with the debate over the use of corpus data
in linguistic analysis and language education. In our view, such a debate is over a
non-issue. Readers interested in the pros and cons of using corpus data should refer to
Sinclair (1991), Widdowson (1991, 2000), de Beaugrande (2001) and Stubbs (2001).
Robert de Beaugrande’s unpublished paper, ‘Large corpora and applied linguistics: H.
G.
Widdowson
versus
J.
McH.
Sinclair’
(available
online
at
http://www.beaugrande.com/WiddowSincS.htm), provides an excellent summary of
the debate between Sinclair and Widdowson, at the Georgetown University Round
Table on Languages and Linguistics in 1991, over the use of corpora in language
teaching. While Widdowson, Sinclair and de Beaugrande characterize two extreme
attitudes towards corpora, there are many milder (positive or negative) reactions to
corpus data between the two extremes. Readers can refer to Nelson (2000: section
5.3.3.) for a good review. Nor will we discuss the use of corpora in a wide range of
language studies. Readers can refer to Hunston (2002) and McEnery, Xiao and Tono
2
(2006) for a further discussion of using corpora in applied linguistics. Instead, this
chapter focuses only on using corpora in language pedagogy.
The early 1990s saw an increasing interest in applying the findings of corpus-based
research to language pedagogy. The upsurge of interest is evidenced by the eight wellreceived biennial international conferences on Teaching and Language Corpora
(TaLC) held in Lancaster (1996, 1994), Oxford (1998), Graz (2000), Bertinoro
(2002), Granada (2004), Paris (2006), and Lisbon (2008). This is also apparent when
one looks at the published literature. In addition to a large number of journal articles,
well over twenty authored or edited volumes have recently been produced on the topic
of teaching and language corpora: Wichmann et al (1997), Partington (1998),
Bernardini (2000), Burnard and McEnery (2000), Kettemann and Marko (2002,
2006), Aston (2001), Ghadessy, Henry, and Roseberry (2001), Hunston (2002),
Granger et al (2002), Connor and Upton (2002), Tan (2002), Sinclair (2003, 2004),
Aston et al (2004), Mishan (2005), Nesselhauf (2005), Römer (2005), Braun, Kohn
and Mukherjee (2006), Gavioli (2006), Scott and Tribble (2006), Hidalgo, Quereda
and Santana (2007), O’Keeffe, McCarthy and Carter (2007), Aijmer (2009), and
Campoy, Gea-valor and Belles-Fortuno (2010). These works cover a wide range of
issues related to using corpora in language pedagogy, e.g. corpus-based language
description, corpus analysis in classroom, and learner corpora (cf. Keck 2004).
In the opening chapter of Teaching and Language Corpora (Wichmann et al 1997),
Geoffrey Leech observed that a convergence between teaching and language corpora
was apparent. That convergence has three focuses, as noted by Leech (1997): the
direct use of corpora in teaching (teaching about, teaching to exploit, and exploiting to
3
teach), the indirect use of corpora in teaching (reference publishing, materials
development,
and
language
testing),
and
further
teaching-oriented
corpus
development (LSP corpora, L1 developmental corpora and L2 learner corpora).
In the remainder of this chapter, we will explore the potential uses of corpora in
language pedagogy in line with Leech’s three focuses of convergence (sections 2-4),
which is followed by a case study demonstrating how contrastive corpus linguistics
can inform second language acquisition research (section 5). The chapter concludes
by discussing the debate over the relevance of authenticity and frequency of corpora
in language education as well as the future of corpus-based language pedagogy.
2. Indirect use of corpora
The use of corpora in language teaching and learning has been more indirect than
direct. This is perhaps because direct use of corpora in language pedagogy is
restricted by a number of factors including, for example, the level and experience of
learners, time constraints, curricular requirements, knowledge and skills required of
teachers for corpus analysis and result interpretation, and the access to resources such
as computers, and appropriate software tools and corpora, or a combination of these
(see section 6 for further discussion). This section explores how corpora have
impacted on language pedagogy indirectly.
2.1. Reference publishing
Corpora have revolutionized reference publishing (at least for English), be it a
dictionary or reference grammar, in such a way that it is now nearly unheard of for
new dictionaries and new editions of old dictionaries published from the 1990s
4
onwards not to be based on corpus data, and ‘even people who have never heard of a
corpus are using the product of corpus-based investigation’ (Hunston 2002: 96).
Corpora are useful in several ways for lexicographers. The greatest advantage of
using corpora in lexicography lies in their machine-readable nature, which allows
dictionary makers to extract all authentic, typical examples of the usage of a lexical
item from a large body of text in a few seconds. The second advantage of the corpusbased approach, which is not available when using citation slips, is the frequency
information and quantification of collocation which a corpus can readily provide.
Some dictionaries, e.g. COBUILD 1995 and Longman 1995, include such frequency
information. Frequency data plays an even more important role in the so-called
frequency dictionaries, which define core vocabulary to help learners of different
modern languages, e.g. Davies (2005) for Spanish, Jones and Tschirner (2005) for
German, Davies and de Oliveira Preto-Bay (2007) for Portuguese, Lonsdale and Bras
(2009) for French, and Xiao, Rayson and McEnery (2009) for Chinese. Information
of this sort is particularly useful for materials writers and language learners alike. A
further benefit of using corpora is related to corpus markup and annotation. Many
available corpora (e.g. the BNC) are encoded with textual (e.g. register, genre and
domain) and sociolinguistic (e.g. user gender and age) metadata which allows
lexicographers to give a more accurate description of the usage of a lexical item.
Corpus annotations such as part-of-speech tagging and word sense disambiguation
also enable a more sensible grouping of words which are polysemous and
homographs. Furthermore, a monitor corpus allows lexicographers to track subtle
change in the meaning and usage of a lexical item so as to keep their dictionaries upto-date. Last but not least, corpus evidence can complement or refute the intuitions of
5
individual lexicographers, which are not always reliable (cf. Sinclair 1991a: 112;
Atkins and Levin 1995; Meijs 1996; Murison-Bowie 1996: 184) so that dictionary
entries are more accurate. The above observations above are line with Hunston (2002:
96), who summarizes the changes brought about by corpora to dictionaries and other
reference books in terms of five ‘emphases’: an emphasis on frequency, an emphasis
on collocation and phraseology, an emphasis on variation, an emphasis on lexis in
grammar and an emphasis on authenticity.
It has been noted that non-corpus-based grammars can contain biases while corpora
can help to improve grammatical descriptions (McEnery and Xiao (2005). The
Longman Grammar of Spoken and Written English (LGSWE, Biber et al 1999) can be
considered as a milestone in reference publishing. Based entirely on the 40-millionword Longman Spoken and Written English Corpus, the grammar gives ‘a thorough
description of English grammar, which is illustrated throughout with real corpus
examples, and which gives equal attention to the ways speakers and writers actually
use these linguistic resources’ (Biber et al 1999: 45). The new corpus-based grammar
is unique in many different ways, for example, by taking register variations into
account and exploring the differences between written and spoken grammars.
While lexical information forms, to some extent, an integral part of the grammatical
description in Biber et al (1999), it is the Collins COBUILD series (Sinclair 1990,
1992; Francis et al 1996; 1997; 1998), that focus on lexis in grammatical descriptions
(the so-called ‘pattern grammar’, Hunston and Francis 2002). In fact, Sinclair et al
(1990) flatly reject the distinction between lexis and grammar. While pattern
grammars focusing on the connection between pattern and meaning challenge the
6
traditional distinction between lexis and grammar, they are undoubtedly useful in
language learning as they provide ‘a resource for vocabulary building in which the
word is treated as part of a phrase rather than in isolation’ (Hunston 2002: 106).
In the dictionary family, perhaps the most important member as far as language
pedagogy is concerned is a learner dictionary. Yet corpus-based learner dictionaries
have a quite short history. It was only in 1987 that the Collins COBUILD English
Dictionary was published as the first ‘fully corpus-based’ dictionary. Yet the impact
of this corpus-based dictionary was such that most other publishers in the ELT market
followed Collins’ lead. By 1995, the new editions of major learner’s dictionaries such
as the Longman Dictionary of Contemporary English (LDOCE, 3rd edition), the
Oxford Advanced Learner’s Dictionary (OALD, 5th edition), and a newcomer, the
Cambridge International Dictionary of English (CIDE, 1st edition) all claimed to be
based on corpus evidence in one way or another.
One of the important features of corpus-based learner dictionaries is that their
inclusion of quantitative data extracted from a corpus. Another important feature,
which is also related to frequency information, is that such dictionaries typically
select the vocabulary used from a controlled set when defining the entry for a word.
Producing definitions in an L2 that language learners can understand is a problem;
language learners may not have a very well developed L2 vocabulary. This makes it
necessary and desirable for dictionary makers to limit the vocabulary they use when
defining words in a dictionary. Nowadays, most learner dictionary makers prepare a
list of defining words, usually ranging from 2,000 to 2,500 words, based on the
7
frequency information extracted from corpora as well as on the lexicographers’
experience of defining words.
As noted earlier, an important use of corpus data for lexicography is in the area of
example selection so that nowadays most dictionaries of English use corpora as the
source of their examples. In the case of learner dictionaries, however, there was a
tradition of using examples invented by lexicographers, rather than authentic
materials, in dictionary production, because they believed that foreign language
learners have difficulty understanding authentic materials and therefore have to be
presented with simple, rewritten examples in which the use of a given word is
highlighted to show its syntactic and semantic properties. It was corpus-based learner
dictionary work which challenged this received wisdom. The COBUILD project broke
with tradition and used authentic data extracted from corpora to produce illustrative
examples for a learner dictionary. The use of authentic examples in learner
dictionaries is an area where corpus-based learner dictionaries have innovated.
2.2. Syllabus design and materials development
While corpora have been used extensively to provide more accurate descriptions of
language use, a number of scholars have also used corpus data directly to look
critically at existing TEFL (Teaching English as a Foreign Language) syllabuses and
teaching materials. Mindt (1996), for example, finds that the use of grammatical
structures in textbooks for teaching English differs considerably from the use of these
structures in L1 English. He observes that one common failure of English textbooks is
that they teach ‘a kind of school English which does not seem to exist outside the
foreign language classroom’ (Mindt 1996: 232). As such, learners often find it
8
difficult to communicate successfully with native speakers. A simple yet important
role of corpora in language education is to provide more realistic examples of
language usage. In addition, however, corpora may provide data, especially frequency
data, which may further alter what is taught. For example, on the basis of a
comparison of the frequencies of modal verbs, future time expressions and conditional
clauses in corpora and their grading in textbooks used widely in Germany, Mindt
(ibid) concludes that one problem with non-corpus-based syllabuses is that the order
in which those items are taught in syllabuses ‘very often does not correspond to what
one might reasonably expect from corpus data of spoken and written English’,
arguing that teaching syllabuses should be based on empirical evidence rather than
tradition and intuition with frequency of usage as a guide to priority for teaching
(Mindt 1996: 245-246). While frequency is certainly not the only determinant of what
to teach and in what order (see section 6 for further discussion), it can indeed help to
make learning more effective. For example, McCarthy, McCarten and Sandiford’s
(2005-2006) innovative Touchstone book series, which is based on the Cambridge
International Corpus, aims to present the vocabulary, grammar, and functions students
encounter most often in real life.
Hunston (2002: 189) echoes Mindt suggesting that ‘the experience of using corpora
should lead to rather different views of syllabus design.’ The type of syllabus she
discusses extensively is a ‘lexical syllabus’, originally proposed by Sinclair and
Renouf (1988) and outlined fully by Willis (1990) and embodied in Willis, Willis and
Davids’ (1988-1989) three-part Collins COBUILD English Course. According to
Sinclair and Renouf (1988: 148), a lexical syllabus would focus on ‘(a) the
commonest word forms in a language; (b) the central patterns of usage; (c) the
9
combinations which they usually form.’ While the term may occasionally be
misinterpreted to indicate a syllabus consisting solely of vocabulary items, a lexical
syllabus actually covers ‘all aspects of language, differing from a conventional
syllabus only in that the central concept of organization is lexis’ (Hunston 2002: 189).
Sinclair (2000: 191) would say that the grammar covered in a lexical syllabus is
‘lexical grammar’, not ‘lexico-grammar’, which attempts to ‘build a grammar and
lexis on an equal basis.’ Indeed, as Murison-Bowie (1996: 185) observes, ‘in using
corpora in a teaching context, it is frequently difficult to distinguish what is a lexical
investigation and what is a syntactic one. One leads to the other, and this can be used
to advantage in a teaching/learning context.’ Sinclair and his colleagues’ proposal for
a lexical syllabus is echoed by Lewis (1993, 1997a, 1997b, 2000) who provides strong
support for the lexical approach to language teaching.
A focus of the lexical approach to language pedagogy is teaching collocations and the
related concept of prefabricated units. There is a consensus that collocational
knowledge is important for developing L1/L2 language skills (e.g. Bahns 1993;
Zhang 1993; Cowie 1994; Herbst 1996: 389-391; Kita and Ogata 1997: 230-231;
Partington 1998: 23-25; Hoey 2000, 2004; Shei and Pain 2000: 167-170; Sripicharn
2000: 169-170; Altenberg and Granger 2001; McEnery and Wilson 2001; McAlpine
and Myles 2003: 71-75; Nesselhauf 2003). Hoey (2004), for example, posits that
‘learning a lexical item entails learning what it occurs with and what grammar it tends
to have.’ Cowie (1994: 3168) observes that ‘native-like proficiency of a language
depends crucially on knowledge of a stock of prefabricated units.’ Aston (1995) also
notes that the use of prefabs can speed language processing in both comprehension
and production, thus creating native-like fluency. A powerful reason for the
10
employment of collocations, as Partington (1998: 20) suggests, ‘lies in the way it
facilitates communication processing on the part of hearer’, because ‘language
consisting of a relatively high number of fixed phrases is generally more predictable
than that which is not’ while ‘in real time language decoding, hearers need all the help
they can get.’ As such, competence in a language undoubtedly seems to involve
collocational knowledge (cf. Herbst 1996: 389). Collocational knowledge indicates
which lexical items co-occur frequently with others and how they combine within a
sentence. Such knowledge is evidently more important than individual words
themselves (cf. Kita and Ogata 1997: 230) and is needed for effective sentence
generation (cf. Smadja and McKeown 1990). Zhang (1993), for example, finds that
more proficient L2 writers use significantly more collocations, more accurately and in
more variety than less proficient learners. Collocational error is a common type of
error for learners (cf. McAlpine and Myles 2003: 75). Gui and Yang (2002: 48)
observe, on the basis of the Chinese Learner English corpus, that collocation error is
one of the major error types for Chinese learners of English. Altenberg and Granger
(2001) and Nesselhauf (2003) find that even advanced learners of English have
considerable difficulties with collocation. One possible explanation is that learners are
deficient in ‘automation of collocations’ (Kjellmer 1991). ‘As a result, learners need
detailed information about common collocational patterns and idioms; fixed and semifixed lexical expressions and different degrees of variability; relative frequency and
currency of particular patterns; and formality level’ (McAlpine and Myles 2003: 75).
Corpora are useful in this respect, not only because collocations can only reliably be
measured quantitatively, but also because the KWIC (key word in centre) view of
corpus data exposes learners to a great deal of authentic data in a structured way. Our
view is line with Kennedy (2003), who discusses the relationship between corpus data
11
and the nature of language learning, focusing on the teaching of collocations. The
author argues that second or foreign language learning is a process of learning
‘explicit knowledge’ with awareness, which requires a great deal of exposure to
language data.
In addition to the lexical focus, corpus-based teaching materials try to demonstrate
how the target language is actually used in different contexts, as exemplified in Biber
et al’s (2002) Longman Student Grammar of Spoken and Written English, which pays
special attention to how English is used differently in various spoken and written
registers.
2.3. Language testing
Another emerging area of language pedagogy which has started to use the corpusbased approach is language testing. Alderson (1996) envisaged the possible uses of
corpora in this area: test construction, compilation and selection, test presentation,
response capture, test scoring, and calculation and delivery of results. He concludes
that ‘[t]he potential advantages of basing our tests on real language data, of making
data-based judgments about candidates’ abilities, knowledge and performance are
clear enough. A crucial question is whether the possible advantages are born out in
practice’ (Alderson 1996: 258-259). The concern raised in Alderson’s conclusion
appears to have been addressed satisfactorily. Choi, Kim and Boo (2003), for
example, find that computer-based tests are comparable to paper-based tests. A
number of corpus-based studies of language testing have been reported. For example,
Coniam (1997) demonstrated how to use word frequency data extracted from corpora
to generate cloze tests automatically. Kaszubski and Wojnowska (2003) presented a
12
corpus-driven program for building sentence-based ELT exercises – TestBuilder. The
program can process raw and part-of-speech tagged corpora, tagged on the fly by a
built-in part-of-speech tagger, and uses this as input for test material selection. Indeed,
corpora have recently been used by major providers of test services for a number of
purposes: 1) as an archive of examination scripts; 2) to develop test materials; 3) to
optimize test procedures; 4) to improve the quality of test marking; 4) to validate
tests; and 5) to standardize tests (cf. Ball 2001; Hunston 2002: 205). For example, the
University of Cambridge Local Examinations Syndicate (UCLES) is active in both
corpus development (e.g. Cambridge Learner Corpus, Cambridge Corpus of Spoken
English, Business English Text Corpus and Corpus YLE Speaking Tests) and the
analysis of native English corpora and learner corpora. At UCLES, native English
corpora such as the British National Corpus (BNC) are used ‘to investigate
collocations, authentic stems and appropriate distractors which enable item writers to
base their examination tasks on real texts’ (Ball 2001: 7); the corpus-based approach
is used to explore ‘the distinguishing features in the writing performance of EFL/ESL
learners or users taking the Cambridge English examinations’ and how to incorporate
these into ‘a single scale of bands, that is, a common scale, describing different levels
of L2 writing proficiency’ (Hawkey 2001: 9); corpora are also used for the purpose of
speaking assessment (Ball and Wilson 2002; Taylor 2003) and to develop domainspecific (e.g. business English) wordlists for use in test materials (Ball 2002; Horner
and Strutt 2004).
2.4. Teacher development
For learners to benefit from the use of corpora, language teachers must first of all be
equipped with a sound knowledge of the corpus-based approach. It is unsurprising to
13
discover then that corpora have been used in training language teachers (e.g. Allan
1999, 2002; Conrad 1999; Seidlhofer 2000, 2002; O’Keeffe and Farr 2003). Allan
(1999), for example, demonstrates how to use corpus data to raise the language
awareness of English teachers in Hong Kong secondary schools. Conrad (1999)
presents a corpus-based study of linking adverbials (e.g. therefore and in other
words), on the basis of which she suggests that it is important that a language teacher
do more than using classroom concordancing and lexical or lexico-grammatical
analyses if language teaching is to take full advantage of the corpus-based approach.
Conrad’s concern with teacher education is echoed by O’Keeffe and Farr (2003), who
argue that corpus linguistics should be included in initial language teacher education
so as to enhance teachers’ research skills and language awareness.
3. Direct use of corpora
While indirect uses such as syllabus design and materials development are closely
associated with what to teach, corpora have also provided valuable insights into how
to teach. Of Leech’s (1997) three focuses, direct uses of corpora include ‘teaching
about’, ‘teaching to exploit’, and ‘exploit to teach’, with the latter two relating to how
to use. Given a number of restricting factors as noted in section 2, direct uses have so
far confined largely to learning at more advanced levels, for example, in tertiary
education, whereas in general English language teaching (let alone to mention other
foreign languages), especially in secondary education (see Braun 2007 a rare example
of an empirical study of using corpora in secondary education), the direct use of
corpora is ‘still conspicuously absent’ (Kaltenböck and Mehlmauer-Larcher 2005).
14
‘Teaching about’ means teaching corpus linguistics as an academic subject like other
sub-disciplines of linguistics such as syntax and pragmatics. Corpus linguistics has
now found its way into the curricula for linguistic and language related degree
programmes at both postgraduate and undergraduate levels. ‘Teaching to exploit’
means providing students with ‘hands-on’ know-how, as emphasized in McEnery,
Xiao and Tono (2006), so that they can exploit corpora for their own purposes. Once
the student has acquired the necessary knowledge and techniques of corpus-based
language study, learning activity may become student centred. ‘Exploiting to teach’
means using a corpus-based approach to teaching language and linguistics courses
(e.g. sociolinguistics and discourse analysis), which would otherwise be taught using
non-corpus-based methods.
If the focuses of ‘teaching about’ and ‘exploiting to teach’ are viewed as being
associated typically with students of linguistics and language programmes, ‘teaching
to exploit’ relates to students of all subjects which involve language study and
learning, who are expected to benefit from the so-called data-driven learning (DDL)
or ‘discovery learning’.
The issue of how to use corpora in the language classroom has been discussed
extensively in the literature. With the corpus-based approach to language pedagogy,
the traditional ‘three P’s’ (Presentation – Practice – Production) approach to teaching
may not be entirely suitable. Instead, the more exploratory approach of ‘three I’s’
(Illustration – Interaction – Induction) may be more appropriate, where ‘illustration’
means looking at real data, ‘interaction’ means discussing and sharing opinions and
observations, and ‘induction’ means making one’s own rule for a particular feature,
15
which ‘will be refined and honed as more and more data is encountered’ (see Carter
and McCarthy 1995: 155). This progressive induction approach is what MurisonBowie (1996: 191) would call the interlanguage approach: namely, partial and
incomplete generalizations are drawn from limited data as a stage on the way towards
a fully satisfactory rule. While the ‘three I’s’ approach was originally proposed by
Carter and McCarthy (1995) to teach spoken grammar, it may also apply to language
education as a whole, in our view.
It is clear that the teaching approach focusing on ‘three I’s’ is in line with Johns’
(1991) concept of ‘data-driven learning (DLL)’. Johns was perhaps among the first to
realize the potential of corpora for language learners (e.g. Higgins and Johns 1984). In
his opinion, ‘research is too serious to be left to the researchers’ (Johns 1991: 2). As
such, he argues that the language learner should be encouraged to become ‘a research
worker whose learning needs to be driven by access to linguistic data’ (ibid). John’s
web-based Kibbitzer (www.eisu2.bham.ac.uk/johnstf/timeap3.htm) gives some very
good examples of data-driven learning.
Data-driven learning can be either teacher-directed or learner-led (i.e. discovery
learning) to suit the needs of learners at different levels, but it is basically learnercentred. This autonomous learning process ‘gives the student the realistic expectation
of breaking new ground as a “researcher”, doing something which is a unique and
individual contribution’ (Leech 1997: 10). It is important to note, however, that the
key to successful data-driven learning, even if it is student-centred, is the appropriate
level of teacher guidance or mediation depending on the learners’ age, experience,
and proficiency level, because ‘a corpus is not a simple object, and it is just as easy to
16
derive nonsensical conclusions from the evidence as insightful ones’ (Sinclair 2004:
2). In this sense, it is even more important for language teachers to be equipped with
the necessary training in corpus analysis (cf. section 6).
Johns (1991) identifies three stages of inductive reasoning with corpora in the DDL
approach: observation (of concordanced evidence), classification (of salient features)
and generalization (of rules). The three stages roughly correspond to Carter and
McCarthy’s (1995) ‘three I’s’. The DDL approach is fundamentally different from the
‘three P’s’ approach in that the former is bottom-up induction whereas the latter is
top-down deduction. The direct use of corpora and concordancing in the language
classroom has been discussed extensively in the literature (e.g. Tribble 1991, 1997a,
1997b, 2000, 2003; Tribble and Jones 1990, 1997; Flowerdew 1993; Karpati 1995;
Kettemann 1995, 1996; Wichmann 1995; Woolls 1998; Aston 2001; Osborne 2001,
Bruan 2007), covering a wide range of issues including, for example, underlying
theories, methods and techniques, and problems and solutions.
4. Teaching oriented corpora
Teaching-oriented corpora are particularly useful in teaching languages for specific
purposes (LSP corpora) and in research on L1 (developmental corpora) and L2
(learner corpora) language acquisition. Such corpora can be used directly or indirectly
in language pedagogy as discussed in previous sections.
4.1. Languages for specific purposes and professional communication
In addition to teaching English as a second or foreign language in general, a great deal
of attention has been paid to domain-specific language use and professional
17
communication (e.g. English for specific purposes and English for academic purpose).
For example, Thurstun and Candlin (1997, 1998) explore the use of concordancing in
teaching writing and vocabulary in academic English. Hyland (1999) compares the
features of the specific genres of metadiscourse in introductory course books and
research articles on the basis of a corpus consisting of extracts from 21 university
textbooks for different disciplines and a similar corpus of research articles. Upton and
Connor (2001) undertake a ‘moves analysis’ in the business English using a business
learner corpus. The authors approach the cultural aspect of professional
communication by comparing the ‘politeness strategies’ used by learners from
different cultural backgrounds. Thompson and Tribble (2001) examine citation
practices in academic text. Koester (2002) argues, on the basis of an analysis of the
performance of speech acts in workshop conversations, for a discourse approach to
teaching communicative functions in spoken English. Yang and Allison (2003) study
the organizational structure in research articles in applied linguistics. Carter and
McCarthy (2004) explore, on the basis of the CANCODE corpus, a range of social
contexts in which creative uses of language are manifested. Hinkel (2004) compares
the use of tense, aspect and the passive in L1 and L2 academic texts. Xiao (2003)
reviews a number of case studies using domain specialized multilingual corpora to
teach domain specific translation. Studies such as these demonstrate that LSP corpora
are particularly useful in teaching language for specific purposes and professional
communication.
4.2. Learner corpora and interlanguage analysis
Two kinds of corpora that emerged in the 1990s have not only greatly contributed to
the vitality of corpus linguistics but have also revived contrastive analysis and
18
interlanguage research. They are learner corpora and multilingual corpora. This
section discusses learner corpora while the topic of multilingual corpora will be taken
up for further discussion in section 5.1.
The creation and use of learner corpora in language pedagogy and interlanguage
research has been welcomed as one of the most exciting recent developments in
corpus-based language studies. If native speaker corpora of the target language
provide a top-down approach to using corpora in language pedagogy, learner corpora
provide a bottom-up approach to language teaching (Osborne 2002).
A learner corpus, as opposed to a “developmental corpus” composed of data produced
by children acquiring their mother tongue (L1), comprises written or spoken data
produced by language learners who are acquiring a second or foreign language. Data
of this type has particularly been useful in language pedagogy and second language
acquisition (SLA) research, as demonstrated by the fruitful learner corpus studies
published over the past decade (see Pravec 2002; Keck 2004; and Myles 2005 for
recent reviews). SLA research is primarily concerned with ‘the mental representations
and developmental processes which shape and constrain second language (L2)
productions’ (Myles 2005: 374). Language acquisition occurs in the mind of the
learner, which cannot be observed directly and must be studied from a psychological
perspective. Nevertheless, if learner performance data is shaped and constrained by
such a mental process, it at least provides indirect, observable, and empirical evidence
for the language acquisition process. Note that using product as evidence for process
may not be less reliable; sometimes this is the only practical way of finding about
process. Stubbs (2001) draws a parallel between corpora in corpus linguistics and
19
rocks in geology, ‘which both assume a relation between process and product. By and
large, the processes are invisible, and must be inferred from the products.’ Like
geologists who study rocks because they are interested in geological processes to
which they do not have direct access, SLA researchers can analyze learner
performance data to infer the inaccessible mental process of second language
acquisition. Learner corpora can also be used as an empirical basis that tests
hypotheses generated using the psycholinguistic approach, and to enable the findings
previously made on the basis of limited data of a small number of informants to be
generalised. Additionally, learner corpora have widened the scope of SLA research so
that, for example, interlanguage research nowadays treats learner performance data in
its own right rather than as decontextualised errors in traditional error analysis (cf.
Granger 1998: 6).
At the pre-conference workshop on learner corpora affiliated to the International
Symposium of Corpus Linguistics 2003 held at the University of Lancaster, the
workshop organizers Yukio Tono and Fanny Meunier observed that learner corpora
are no longer in their infancy but are going through their nominal teenage years – they
are full of promise but not yet fully developed. In language pedagogy, the
implications of learner corpora have been explored for curriculum design, materials
development and teaching methodology (cf. Keck 2004: 99). The interface between
L1 and L2 materials has been explored. Meunier (2002), for example, argues that
frequency information obtained from native speaker corpora alone is not sufficient to
inform curriculum and materials design. Rather, ‘it is important to strike a balance
between frequency, difficulty and pedagogical relevance. That is exactly where
learner corpus research comes into play to help weigh the importance of each of
20
these’ (Meunier 2002: 123). Meunier also advocates the use of learner data in the
classroom, suggesting that exercises such as comparing learner and native speaker
data and analyzing errors in learner language will help students to notice gaps
between their interlanguage and the language they are learning. Interlanguage studies
based on learner corpora which have been undertaken so far focus on what Granger
(2002) calls ‘Contrastive Interlanguage Analysis (CIA)’, which compares learner data
and native speaker data, or language produced learners from different L1
backgrounds. The first type of comparison typically aims to identify under or overuse
of particular linguistic features in learner language while the second type aims to
uncover L1 interference or transfer. In addition to CIA, learner corpora have also been
used to investigate the order of acquisition of particular morphemes. Readers can refer
to Granger et al (2002) for recent work in the use of learner corpora, and read Granger
(2003) for a more general discussion of the applications of learner corpora such as the
International Corpus of Learner English (ICLE).
In addition to SLA research, learner corpora can also be used directly in classroom
teaching. For example, Seidlhofer (2002) and Mukherjee and Rohrbach (2006)
demonstrate how a ‘local learner corpus’ containing students’ own writings can be
used directly for learning by coping with students’ questions about their own or
classmates’ writings, or analyzing and correcting errors in such familiar writings.
We have so far discussed how corpora, including those teaching oriented corpora like
LSP corpora and learner corpora, can be used directly or indirectly in language
pedagogy. The section that follows seeks to demonstrate the predictive and diagnostic
power of the integrated approach that combines contrastive corpus linguistics with
21
interlanguage analysis in second language acquisition research as advocated in Römer
(2008), via a case study of passive constructions in Chinese learner English.
5. Using contrastive corpus linguistics to inform LSA research
In this section, we will first clarify the type of corpora used in contrastive corpus
linguistics, which will be followed by a summary of the findings from a published
contrastive study of passive constructions in English and Chinese based on
comparable corpora of the two languages (Xiao, McEnery and Qian 2006). These
findings will in turn be used to predict and diagnose the performance of Chinese
learners of English in their use of English passives as mirrored in a sizeable Chinese
learner English corpus in comparison with a comparable native English corpus.
5.1. Contrastive corpus linguistics
As noted in section 4.2, multilingual corpora have been an important development in
corpus research since the 1990s. A multilingual corpus involves two or more
languages. Data contained in this kind of corpora can be either source texts in one
language plus their translations in another language or other languages, or texts
collected from different native languages using comparable sampling techniques to
achieve similar coverage and balance. The two types of multilingual corpora are
usually referred to as parallel corpora and comparable corpora respectively and used
in translation and contrastive studies.
Contrastive studies can be theoretically oriented or geared towards applied research.
Theoretic contrastive studies are language independent and primarily concerned with
how a universal category is realised in two or more different languages, whilst applied
22
contrastive studies are preoccupied with how a common category in one language is
realised in another language. In its early stage, contrastive linguistics was
predominantly theoretic, though the applied aspect was not totally neglected.
Theoretically oriented contrastive studies were continued from the late 1920s all the
way into the 1960s by the Prague School. On the other hand, WWII aroused great
interest in foreign language teaching in the United States, and contrastive studies were
recognised as an important part of foreign language teaching methodology (cf. Fries
1945; Lado 1957). As a means of ‘predicting and/or explaining difficulties of second
language learners with a particular mother tongue in learning a particular target
language’ (Johansson 2003), applied contrastive studies were dominant throughout
the 1960s. However, it was soon realised that language learning could not be
accounted for by cross-linguistic contrast alone (see Sajavaara 1996 for a discussion
of some problems with contrastive linguistics), and as a result contrastive studies lost
ground to more learner-oriented approaches such as error analysis, performance
analysis and interlanguage analysis (cf. Johansson 2003). The revival of contrastive
studies in the 1990s has largely been attributed to the corpus methodology and the
availability of multilingual corpora (cf. Granger 1996: 37; Salkie 1999; Johansson
2003).
What kind of corpora can be used in contrastive analysis? To answer this question, we
will first need to have a general idea of purposes of multilingual corpora of various
kinds.
While multilingual corpora, and especially comparable corpora, are designed and
created with the explicit aim of cross-linguistic contrast, all corpora have ‘always
23
been pre-eminently suited for comparative studies’ (Aarts 1998: i). For example, the
four English corpora of the Brown family (e.g. Brown, LOB, Frown, FLOB, see Xiao
2008: 395-297 for a comparison of these corpora) were created for synchronic and
diachronic comparisons of English as used in Britain and the US in the early 1960s
and the early 1990s, while the Lancaster Corpus of Mandarin Chinese (LCMC) was
designed as a Chinese match for FLOB and Frown to facilitate cross-linguistic
contrasts of English and Chinese (McEnery, Xiao and Mo 2003). The International
Corpus of English (ICE) project has used a common corpus design and the same
sampling criteria for each of its components to ensure their comparability (Nelson
1996); similarly, the International Corpus of Learner English (ICLE) is designed in
such a way that the subcorpora for learners of different L1 backgrounds are
comparable (Granger 1998). Even a corpus like the British National Corpus (BNC),
which was designed to be representative of modern British English (Aston and
Burnard 1998), also provides a useful basis for various intra-lingual comparisons (e.g.
genre-based variations and variations caused by sociolinguistic variables). Clearly,
corpora are intrinsically comparative, and so is the corpus linguistics methodology.
For example, collocations are extracted using statistic measures that compare the
probabilities of co-occurring words within a specified window span of the node word;
keywords are identified by comparing the target corpus with a reference corpus; what
Granger (1998: 12) referred to as Contrastive Interlanguage Analysis (CIA) is also
mainly concerned with comparison, e.g. comparing interlanguage with target native
language, and comparing different interlanguages (in terms of L1 background, age,
proficiency level, task type, learning setting, and medium etc). In short, it can be said
that the whole corpus research enterprise is based on comparison, for example, by
comparing the same linguistic feature in different corpora, comparing different
24
linguistic features in the same corpus, and comparing what is observed and what is
expected.
While corpus linguistics is clearly comparative in nature, the technical terms for
corpora used in linguistic comparison are somewhat confusing, with the controversy
revolving around the issue of whether a parallel corpus should be a corpus composed
of source texts plus translations, or a corpus containing native language data collected
using comparable sampling criteria. As we have argued elsewhere (McEnery et al
2006: 47), a parallel corpus is composed of source texts and their translations, whilst a
comparable corpus contains L1 texts sampled from different languages which are
comparable in sampling criteria. A translation corpus, instead of referring to what is
actually a parallel corpus as suggested in the literature, comprises translated texts for
us in studies of translational language (e.g. the Translational English Corpus).
Corpora which are designed primarily for intra-lingual comparison or for comparing
different varieties of the same language (e.g. the ICE) are comparative corpora.
Having clarified the terminologies, it is appropriate to discuss what types of corpora
are to be used in cross-linguistic contrasts. This is in fact an issue which is as
debatable as the terminological issue. It has been argued that parallel corpora provide
a sound basis for contrastive analysis, as demonstrated in the claims that ‘translation
equivalence is the best available basis of comparison’ (James 1980: 178), and that
‘studies based on real translations are the only sound method for contrastive analysis’
(Santos 1996: i). However, as has been widely observed (Baker 1993: 243-5;
Hartmann 1995; Gellerstam 1996; Teubert 1996: 247; Laviosa 1997: 315; McEnery
and Wilson 2001: 71-72; McEnery and Xiao 2002, Xiao and Yue 2009; Xiao, He and
25
Yue forthcoming), translational language is ‘an unrepresentative special variant of the
target language’ which is perceptibly influenced by the source language (McEnery,
Xiao and Tono 2006: 93). The source texts and translations in a parallel corpus are
certainly comparable in terms of sampling criteria such as genres – in fact sampling
only applies in selecting source texts but does not apply twice to translations, but this
comparability is immediately undermined by so-called ‘translationese’ in translated
texts. For example, Laviosa (1998) finds that translational language has four core
patterns of lexical use: a relatively lower proportion of lexical words over function
words, a relatively higher proportion of high-frequency words over low-frequency
words, a relatively greater repetition of the most frequent words, and less variety in
the words that are most frequently used. Beyond the lexical level, translational
language is characterised by normalization, simplification (Baker 1993), explicitation
(i.e. increased cohesion, Øverås 1998), and sanitization (i.e. reduced connotational
meanings, Kenny 1998). In addition to these common features of translational
language, Granger (1996) has noted some similarity between translationese and what
she calls ‘learnerese’: ‘Both are situated somewhere between L1 and L2 and are likely
to contain examples of transfer’, and both ‘give evidence of what Gellerstam (1986:
94) calls “syntactic fingerprints”’ (Granger 1996: 48).
As observations resulting from parallel corpus analysis usually invite ‘further research
with monolingual corpora in both languages’ (Mauranen 2002: 182), parallel corpora
can be a useful starting point of contrastive analysis. Nevertheless, it is also clear from
the discussion above that while they are ideal resources for translation studies (see
McEnery and Xiao 2007 for further discussion), parallel corpora provide a poor basis
for cross-linguistic contrasts if relied upon alone.
26
In the section that follows, we will present the findings of a contrastive study of
passive constructions in English and Chinese on the basis of comparable written and
spoken corpora of the two languages, which will be used to predict and diagnose what
is observed in Chinese learner English.
5.2. Passive constructions in English and Chinese
This section summarises the results of a contrastive corpus analysis of passive
constructions on the basis of comparable corpora of English and Chinese, which was
published in Xiao, McEnery and Qian (2006). The primary corpus resources used in
that study included FLOB for written English and LCMC for written Chinese,
together with spoken corpora composed of transcripts for casual conversations in the
two languages. In addition, two spoken corpora of sampling period similar to FLOB
and LCMC were used to compare speech and writing. For English we used the
demographically component of the British National Corpus, amounting to
approximately four million words of conversational data sampled during 1985-1994.
For Chinese we used the Callhome Mandarin Chinese Transcript corpus, which
contains 120 transcripts of telephone conversations amounting to roughly 300,000
words (see McEnery and Xiao 2008).
Our corpus-based contrastive study yields a number of interesting findings. Below we
will only give a summary of the results that are most relevant to our discussion of the
performance of Chinese learners of English in the following section.
27
Firstly, passive constructions are nearly ten times as frequent in English as in Chinese,
with normalised frequencies of 1,026 and 110 instances per million words for the two
languages respectively. There are a number of reasons for this contrast. First, bepassives can be used for both stative and dynamic situations whereas Chinese passives
can only occur in dynamic events; second, Chinese passives usually have a negative
pragmatic meaning while English passives (especially be-passives) do not; third,
English has a tendency to overuse passives, especially in formal writing whereas
Chinese tends to avoid syntactic passives wherever possible; Chinese has a number of
linguistic devices other than the syntactically marked passive constructions to express
a passive meaning, e.g. notional passives, lexical passives, topic sentences, subjectless
sentences, sentences with vague subjects (e.g. youren ‘someone’, renmen ‘people’,
dajia ‘all’), and special structures such as the disposal ba construction and the
predicative shi…de structure. Finally, syntactically unmarked notional passives are
more common in Chinese than in English because English is a subject-oriented
language whereas Chinese is topic oriented. Given that Chinese passives are much
more restricted in scope of use, their low frequency in relation to their English
counterparts is unsurprising. It can be predicted from this sharp contrast in frequency
of use that Chinese learners of English are very likely to underuse passives in their
interlanguage.
Secondly, passives are formed by an auxiliary (be, get) followed by a past participial
verb in English whilst in Chinese they can be marked syntactically by passive markers
such as bei, indicated lexically by verbs with an inherent passive meaning (e.g. zao
‘suffer’), or simply expressed by unmarked notional passives or special sentence
structures. Unlike English, which inflects the passivised verb morphologically,
28
Chinese is non-inflectional, which means that the same verb form is used for both
active and passive voices in Chinese. Also because of the non-inflectional Chinese
morphology, the concept of auxiliary is less salient or useful in Chinese. These crosslinguistic differences seem to suggest that the choice of correct auxiliaries as well as
proper inflectional forms for passivised verbs can constitute a difficult area for
Chinese learners to acquire English passives.
Thirdly, short passives (i.e. passives without a by-phrase introducing an agent) are
typical of English, accounting for over 90% of total occurrences in both speech and
writing. Short passives are predominant in English simply because passives are often
used in English as a strategy that allows one to avoid mentioning the agent when it
cannot or must not be mentioned, while they are also used for stylistic and coherence
purposes (see Granger 1976 and 1983 for further discussion of uses of passives). In
contrast, three out of five syntactic passive markers in Chinese (wei…suo, jiao and
rang) only occur in long passives (i.e. passives with an explicit agent). For the two
remaining passive markers bei and gei, which allow both long and short passives, the
proportions of short passives (60.7% and 57.5% respectively) are significantly lower
than that for English passives. Early Chinese grammarians (e.g. Wang 1984; Lü and
Zhu 1979) noted that an agent must normally be spelt out in passive constructions,
though this constraint has become more relaxed nowadays. When it is difficult to spell
out the agent, passives are used in English, but an alternative device mentioned in the
preceding paragraph is often used in Chinese instead of using passives. This finding
can lead one to expect more long passives in the interlanguage of Chinese learners of
English.
29
Percent
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
4.7%
10.7%
37.8%
Positive
80.3%
Neutral
Negative
51.5%
15.0%
English be passives
Chinese bei passives
Language
Figure 1. Pragmatic meanings of be and bei passives
Finally, a major distinction between passives in English and Chinese is that Chinese
passives are more frequently used with an inflictive meaning than their English
counterparts. With the exception of the archaic passive form wei…suo, over half of
syntactically marked passives in Chinese occur in adversative situations, a proportion
considerably higher than that for English passives (see Figure 1). As the prototypical
passive marker bei was derived from a verb with an inflictive meaning (i.e. ‘suffer’),
Chinese passives were used at early stages primarily for unpleasant or undesirable
events. While this semantic constraint on the use of passives has become more
relaxed, especially in written Chinese, under the influence of western languages,
disyllabic words made up of bei and a single character verb as used in modern
Chinese typically refer to something undesirable, as in beibu ‘be arrested’, beifu ‘be
captured’, beigao ‘the accused’, beihai ‘be a victim’ and beipo ‘be forced’. In
contrast, marking negative pragmatic meanings is not a basic feature of English
passives, though get-passives often refer to undesirable events. An essential difference
between English and Chinese passives lies in how much negativity is coded in them,
30
which predicts that Chinese learners of English will use passives more frequently for
undesirable situations.
In the next section, we will analyze the use of passives in a Chinese learner English
corpus to ascertain how reliably the findings of our contrastive study as summarized
in this section can predict and diagnose learner behaviour in interlanguage.
5.3. Passive constructions in Chinese learner English
This section examines be passives in Chinese learner English. The corpus used is the
Chinese Learner English Corpus (CLEC), which contains one million words of essays
written by Chinese learners at five proficiency levels: high school students (ST2),
junior and senior non-English majors (ST3 and ST4), and junior and senior English
majors (ST5 and ST6). The five types of learners are equally represented in the
corpus. The corpus is fully annotated with learner errors using an error tagset that
consists of 61 error types clustered in 11 categories (see Gui and Yang 2002). In order
to compare Chinese learners’ interlanguage with native English, the Louvain Corpus
of Native English Essays (LOCNESS) is used as the control data, which is composed
of argumentative essays written by native British and American students on a great
variety of topics, totalling approximately 300,000 words (cf. Granger and Tyson
1996).
Table 1. Passives in CLEC and LOCNESS
Corpus
Words
Passives
Per million
LL score
words
CLEC
LOCNESS
1,070,602
9,711
907
1235.6
324,304
5,465
1,685
(p<0.001)
31
A comparison of CLEC and LOCNESS shows that in relation to native English
writing, Chinese learners of English significantly underuse passives in their
interlanguage. Table 1 gives the raw frequencies of passive constructions in the two
corpora as well as the frequencies normalised to a common base of one million words.
As can be seen, passives are nearly twice as frequent in native English as in Chinese
learner English. The log-likelihood test (LL) indicates that this difference is
statistically significant (LL=1235.6 for 1 degree of freedom, p<0.001). The significant
underuse of passives in Chinese learner English is hardly surprising in light of the
marked contrast in frequencies for passives in English and Chinese as noted in section
5.2. Granger (1996: 46) also expected French learners of English to underuse passives
in their writing as it was noted that passives were twice as frequent in English as in
French (see Granger 1976), but she did not verify this prediction against French
learner English data. While Chinese learners’ underuse of passives as mirrored in the
CLEC corpus is very likely to be caused by the influence of their native language,
more cross-linguistic contrasts and interlanguage studies involving learners from
other L1 backgrounds are required before we can be more confident that underuse of
passives is the result of L1 transfer rather than a common feature of interlanguages,
irrespective of the learner’s mother tongue, which would mean that learners underuse
passives for developmental reasons. As Granger (2007) observes, while native
English speakers mainly use the verb discuss in the passive, ‘learners show a
predilection for active structures with first person subjects.’
32
100%
Percent
80%
60%
Short passives
40%
Long passives
20%
0%
CLEC
LOCNESS
Corpus
Figure 2. Long and short passive in CLEC and LOCNESS
The results of the contrastive analysis in section 5.2 predicted that Chinese learners
would use long passives more frequently than native English speakers. Figure 2
shows the proportions of long and short passives in CLEC and LOCNESS. It can be
seen that in comparison with native English writings, long passives are indeed slightly
more frequent in Chinese learner English (9.14% and 8.44% for CLEC and
LOCNESS respectively), though this difference is marginal and not statistically
significant (LL=2.18 for 1 degree of freedom, p=0.139).
100%
5.9%
4.4%
68.4%
78.8%
Percent
80%
60%
Positive
Neutral
40%
20%
Negative
25.7%
16.8%
0%
CLEC
LOCNESS
Corpus
Figure 3. Pragmatic meanings of passives in CLEC and LOCNESS
33
It was noted in earlier that over 50% of passives in Chinese express an inflictive
meaning whereas the corresponding figure for be passives in English is merely 15%.
Such a contrast would reasonably lead one to expect more negative cases in Chinese
learner English than in native English. This expectation is in fact supported by
evidence from CLEC and LOCNESS. Figure 3 shows that 25.7% of passives in the
Chinese learner English data are negative whilst negative cases account for 16.8% in
native English writings. The log-likelihood test indicates that the differences between
CLEC and LOCNESS in the three meaning categories are statistically significant
(LL=7.4 for 2 degrees of freedom, p=0.025). A comparison of Figures 1 and 3
suggests that the proportions for the three meaning categories for the two types of
native English data (i.e. general English and students’ essays) are very close to each
other. In contrast, the proportions in Chinese learner English shift away from those for
L1 Chinese and move closer to the proportions for L2 English. Given that
interlanguage is ‘situated somewhere between L1 and L2’ (Granger 1996: 48), this
movement is only reasonable and as expected.
An inspection of the specific errors related to the use of passive constructions in
CLEC also demonstrates the value of contrastive corpus linguistics in SLA research.
There are mainly four types of passive-related learner errors: underuse, misuse,
misformation, and auxiliary errors. It can be considered as an advantage of the
corpus-based approach to be able to view underuse or overuse of a linguistic feature
in interlanguage as a type of learner error, as this was not possible in traditional error
analysis without corpus data. Misuse of passives means that learners use passive
constructions where they are not supposed to use them. Misformation errors are
34
associated with morphological inflections, while auxiliary errors relate to omission
Frequency per 200,000 words
and misuse of auxiliaries in passive constructions.
250
200
Aux. errors
Misformation
150
Misuse
100
Underuse
All error types
50
0
ST2
ST3
ST4
ST5
ST6
Learner level
Figure 4. Passive-related errors in Chinese learner English
Figure 4 charts the distribution of four types of errors, as well as all error types as a
whole, across learner proficiency levels. Unsurprisingly, when all error types are
taken together, learners at higher levels generally make fewer errors related to
passives. Of the four types of learner errors, underuse is the most important type,
followed by misuse and misformation errors. Auxiliary errors are uncommon for
learner groups other than the lowest level ST2 (i.e. high school students). It is also
clear from the figure that learning curves are not straight lines. There can be relapses
in the language acquisition process, especially for difficult items.
It is of interest to note that while error types are associated with learner levels when
the dataset is taken as a whole (LL=51.77 for 12 degrees of freedom, p<0.001),
similar leaner groups show similar error types. This means that the differences
between the two non-English-major learner groups (i.e. ST3 and ST4), and between
the two English-majors learner groups (i.e. ST5 and ST6) are not statistically
35
significant, as indicated in Table 2. The table gives the log-likelihood test scores and
probability values (3 degrees of freedom for all pairs of data), with significant
differences highlighted. Hence, Chinese learners can be divided into three broad
groups in terms of their acquisition of English passives: ST2 – ST3/ST4 – ST5/ST6.
Table 2. Association between error types and learner levels
From
To
LL score (3 d.f.)
P value
ST2
ST3
27.303
<0.001
ST3
ST4
6.955
0.073
ST4
ST5
18.563
<0.001
ST5
ST6
6.987
0.072
While we cannot be conclusive of whether the underuse of passives by Chinese
learners of English is a result of L1 transfer or a stage of the developmental path,
errors of this type in our learner data typically occur with verbs whose Chinese
equivalents are not normally used in passive constructions, as shown in (1).
(1)
(2)
a.
A birthday party will hold in Lily’s house. (ST2)
b.
The woman in white called Anne Catherick. (ST5)
a.
The supper had done. (ST2)
b.
wanfan zuo-hao
le
supper cook-ready ASP
The supper is ready.
Underuse errors also occur under the influence of topic sentences in Chinese, as
exemplified in (2a), which is expressed in Chinese as (2b). The Chinese example in
36
(2b) is an instance of topic sentence, which is very common in this language. Here
wanfan ‘supper’ in the subject position is the topic and zuo-hao le ‘cook-ready ASP’
is the comment. Sentences like this cannot be used in the passive felicitously (e.g.
*wanfan bei zuo-hao le).
Misuse errors are mostly found in three contexts. Firstly, they occur when intransitive
verbs are passivised (e.g. 3); secondly, errors of this type are related to the misuse of
ergative verbs (e.g. 4); and finally, misuse errors can be a result of training transfer,
i.e. excessive passive training in classroom instructions, as shown in (5). In sentences
like these, the passivised verb is followed by an object, yet Chinese learners have
been taught that passive transformation involves moving the object to the subject
position. This can be taken as a symptom of the overdone passive training in English
classrooms in China.
(3)
(4)
(5)
a.
A very unhappy thing was happened in this week. (ST2)
b.
I was graduated from Zhongshan University. (ST5)
a.
the secince <sic science> is developed quickly (ST4)
b.
infant mortality was declined (ST4)
a.
Because they have been mastered everything of this job (ST4)
b.
many machine and appliance are used electricity as power (ST5)
Misformation errors are a result of L1 interference. As noted in section 5.2, passivised
verbs do not inflect in Chinese. Consequently, Chinese learners of English tend to use
uninflected verbs or misspelled past participles in passive constructions, as
exemplified as (6).
(6)
a.
His relatives can not stop him, because his choice is protect by the
37
laws. (ST6)
b.
Since the People’s Republic of china <sic China> was found on
October 1949, great changes <…> (ST2)
(7)
a.
In China, since the new China established, people’s life has goten <sic
gotten>
better and better. (ST3)
b.
I am not a smoker, but why do we forced to be a second-hand smoker?
(ST5)
Auxiliary errors, the final type of passive errors in our annotation scheme, are also the
result of L1 interference. We noted earlier that while passives in Chinese can be
marked syntactically, lexical passives, unmarked notional passives and topic
sentences that express a passive meaning are abundant. As such, it is hardly surprising
that Chinese learners of English tend to omit or misuse auxiliaries, as shown in (7).
The discussion in this section suggests that the performance of Chinese learners of
English in their use of English passives is closely linked to their native language; and
most of the passive-related errors in their interlanguage can be accounted for from the
perspective of contrastive corpus linguistics. In the following section, we will discuss
the implications of this study in SLA research.
5.4. Modelling contrastive interlanguage analysis
We hope that the case study has demonstrated the predictive and explanatory power
of contrastive corpus linguistics in SLA research. Combining contrastive analysis
(CA) and contrastive interlanguage analysis (CIA) is undoubtedly a fruitful direction
38
to pursue in SLA research. This is not a new idea. As early as a decade ago, Granger
(1996: 46) proposed an ‘integrated contrastive model’:
The model involves constant to-ing and fro-ing between CA and CIA. CA
data helps analysts to formulate predictions about interlanguage which can
be checked against CA data. […] Conversely, CIA results can only be
reliably interpreted as being evidence of transfer if supported by clear CA
descriptions.
Just as CIA has contributed significantly to SLA research by enabling and
foregrounding many areas of investigation which have traditionally been impossible
or marginalized (e.g. quantitatively distinctive features of interlanguage such as
overuse and underuse, the potential effects of learner parameters on interlanguage),
the integrated approach that combines CA and CIA will be an indispensable tool in
SLA research, because ‘if we want to be able to make firm pronouncements about
transfer-related phenomena, it is essential to combine CA and CIA approaches’
(Granger 1998: 14).
This emerging and promising area of research has recently become popular. For
example, Gilquin (2001) demonstrates, on the basis of a case study of causative
constructions in English and French, how the integrated contrastive model can help
explain some of the characteristics of learners’ interlanguage and thus throw new light
on the key notion of transfer, which turns out to be a more complex phenomenon than
has traditionally been assumed. Similarly, Borin and Prütz (2004) use the integrated
contrastive approach to explore L1 syntactic interference in advanced Swedish learner
English by investigating part-of-speech sequences. The increasing interest in the
39
integrated approach is also demonstrated by the specialised workshop ‘Linking up
Contrastive and Learner Corpus Research’, which was affiliated to the 4th
International Contrastive Linguistics Conference.
We entirely agree with Granger (1996, 1998) that a combination of corpus-based
contrastive study and interlanguage analysis can provide insights into language
acquisition research, but we have different opinions of the role of parallel corpora (or
‘translation corpora’ in her words) in cross-linguistic contrasts, for the reasons
outlined earlier in section 5.1. While Granger (1996: 38, 48) is fully aware of the
drawback of using translated texts in contrastive analysis, her examples are largely
based on data of this kind. In our revised CIA model, therefore, contrastive corpus
linguistics interacts with interlanguage analysis on the basis of comparable native
language corpora as illustrated in Figure 5.
Figure 5. A revised model of contrastive interlanguage analysis
It is true that using a bidirectional parallel corpus can average out, to some extent at
least, the undesirable effects of translationese on contrastive analysis. To achieve this
aim, however, the same sampling criteria must apply to the selection of source texts in
40
both languages, because any mismatch of proportion, genre, or domain, for example,
may invalidate the findings derived from such a corpus (McEnery, Xiao and Tono
2006: 93). A well-matched bidirectional parallel corpus is in fact a mixture of parallel
corpus and comparable corpus, which can become a bridge that brings translation and
contrastive studies together. Yet the ideal bidirectional parallel-comparable corpus
will often not be easy, or even possible, to build because of the heterogeneous pattern
of translation between languages and genres. This is especially true if the corpus aims
to achieve sufficient coverage and balance to produce convincing findings (McEnery
and Xiao 2007). Hence, in our approach, comparable native language data is preferred
in contrastive corpus linguistics. Other kinds of corpora for comparative studies such
as parallel corpora, translational corpora, and comparative corpora are best suited for
their own different purposes. Nevertheless, in spite of some difference in data type
used, there has been increasing consensus that contrastive corpus linguistics has
something to deliver in second language acquisition research.
6. Conclusions
Before we close the discussion of using corpora in language pedagogy, it is
appropriate to address some objections to the use of corpora in language learning and
teaching. While frequency and authenticity are often considered two of the most
important advantages of using corpora, they are also the locus of criticism from
language pedagogy researchers. For example, Cook (1998: 61) argues that corpus data
impoverishes language learning by giving undue prominence to what is simply
frequent at the expense of rarer but more effective or salient expressions. Widdowson
(1990, 2000) argues that corpus data is authentic only in a very limited sense in that it
41
is de-contextualized (i.e. traces of texts rather than discourse) and must be recontextualized in language teaching. It can also be argued that:
on the contrary, using corpus data not only increases the chances of learners
being confronted with relatively infrequent instances of language use, but
also of their being able to see in what way such uses are atypical, in what
contexts they do appear, and how they fit in with the pattern of more
prototypical uses. (Osborne 2001: 486)
This view is echoed by Goethals (2003: 424), who argues that ‘frequency ranking will
be a parameter for sequencing and grading learning materials’ because ‘frequency is a
measure of probability of usefulness’ and ‘high-frequency words constitute a core
vocabulary that is useful above the incidental choice of text of one teacher or textbook
author.’ Hunston (2002:194-195) observes that ‘items which are important though
infrequent seem to be those that echo texts which have a high cultural value’, though
in many cases ‘cultural salience is not clearly at odds with frequency.’ While
frequency information is readily available from corpora, no corpus linguist has ever
argued that the most frequent is most important. On the contrary, Kennedy (1998:
290) argues that frequency ‘should be only one of the criteria used to influence
instruction’ and that ‘the facts about language and language use which emerge from
corpus analyses should never be allowed to become a burden for pedagogy’. As such,
raw frequency data is often adjusted for use in a syllabus, as reported in Renouf
(1987: 168). It would be inappropriate, therefore, for language teachers, syllabus
designers, and materials writers to ignore ‘compelling frequency evidence already
available’, as pointed out by Leech (1997: 16), who argues that:
Whatever the imperfections of the simple equation ‘most frequent’ = ‘most
important to learn’, it is difficult to deny that frequency information
42
becoming available from corpora has an important empirical input to
language learning materials.
Kaltenböck and Mehlmauer-Larcher (2005: 78) downplay the role of frequency in
language learning, arguing that ‘what is frequent in language will be picked up by
learners automatically, precisely because it is frequent, and therefore does not have to
be consciously learned.’ This is not true, however. Determiners such as a and the are
certainly very frequent in English, yet they are difficult for Chinese learners of
English because their mother tongue does not have such grammatical morphemes and
does not maintain a count-mass noun distinction.
Clearly, frequency is not ‘automatically pedagogically useful’ (Kaltenböck and
Mehlmauer-Larcher 2005: 78); decisions relating to teaching must also take account
of overall teaching objectives, learners’ concrete situations, cognitive salience,
learnability, generative value and of course teachers’ intuitions (cf. Kaltenböck and
Mehlmauer-Larcher 2005: 78). However, frequency can at least help syllabus
designers, materials writers and teachers alike to make better-informed and more
carefully motivated decisions (cf. Gavioli and Aston 2001: 239).
If we leave objections to frequency data to one side, Widdowson (1990, 2000) also
questions the use of authentic texts in language teaching. In his opinion, authenticity
of language in the classroom is ‘an illusion’ (1990: 44) because even though corpus
data may be authentic in one sense, its authenticity of purpose is destroyed by its use
with an unintended audience of language learners (see Murison-Bowie 1996: 189).
Widdowson (2003: 93) makes a distinction between ‘genuineness’ and ‘authenticity’,
43
which are claimed to be the features of text as a product and discourse as a process
respectively: corpora are genuine in that they comprise attested language use, but they
are not authentic for language teaching because their contexts (as opposed to co-texts)
have been deprived. We will not be engaged in the debate here, but would like to
draw readers’ attention to Stubbs’ (2001) metaphor of product versus process as cited
in section 4.2. The implication of Widdowson’s argument is that only language
produced for imaginary situations in the classroom is ‘authentic’. Even if we do
follow Widdowson’s genuineness-authenticity distinction, it is not clear why such
imaginary situations are authentic because authenticity, as opposed to genuineness,
would mean real communicative context. Situations conjured up for classroom
teaching obviously do not take place for really communicative contexts, how can they
be authentic, if we choose to keep this distinction? When students learn and practise a
shopping ‘discourse’, they are actually by no means doing shopping! Furthermore, as
argued by Fox (1987), invented examples often do not reflect nuances of usage. That
is perhaps why, as Mindt (1996: 232) observes, students who have been taught
‘school English’ cannot readily cope with English used by native speakers in real life.
As such, Wichmann (1997: xvi) argues that in language teaching, ‘the preference for
“authentic” texts requires both learners and teachers to cope with language which the
textbooks do not predict.’
The discussions in sections 2-4 suggest that corpora appear to have played a more
important role in helping to decide what to teach (indirect uses) than how to teach
(direct uses). While indirect uses of corpora seem to be well established, direct uses of
corpora in teaching are largely confined to advanced levels like higher education.
Corpus-based learning activities are nearly absent general TEFL classes at lower
44
levels like secondary education. Of the various causes for this absence mentioned
earlier, perhaps the most important are the access to appropriate corpus resources and
the necessary training of teachers, which we view as priorities for future tasks of
corpus linguists if corpora are to be popularised to general language teaching context.
While there are a wide range of existing corpora that are publicly available (see Xiao
2008 for a recent survey), the majority of those resources have been developed ‘as
tools for linguistic research and not with pedagogical goals in mind’ (Braun 2007). As
Cook (1998: 57) suggests, ‘the leap from linguistics to pedagogy is […] far from
straightforward.’ To bridge the gap between corpora and language pedagogy, the first
step would involve creating corpora that are pedagogically motivated, in both design
and contents, to meet pedagogical needs and curricular requirements so that corpusbased learning activities become an integral part, rather than an additional option, of
the overall language curriculum. Such pedagogically motivated corpora ‘should not
only be more coherent than traditional corpora; they should, as far as possible, also be
complementary to school curricula, to facilitate both the contextualisation process and
the practical problems of integration’ (Braun 2007: 310). The design of such corpusbased learning activities must also take account of learners’ age, experience and level
as well as their integration into the overall curriculum.
Given the situation of learners (e.g. their age, level of language competence, level of
expert knowledge, and attitude towards learning autonomy) in general language
education in relation to advanced learners in tertiary education, even such
pedagogically motivated corpus study activities must be mediated by teachers. This in
turn raises the issue of the current state of teachers’ knowledge and skills of corpus
45
analysis and data interpretation, which is another practical problem that has prevented
direct use of corpora in language pedagogy. As Kaltenböck and Mehlmauer-Larcher
(2005: 81) argue, ‘mediation by the teacher is a necessary prerequisite for successful
application of computer corpora in language teaching and should therefore be given
sufficient attention in teacher education courses’ (cf. also O’Keeffe and Farr 2003).
However, as the integration of corpus studies language teacher training is only a quite
recent phenomenon (cf. Chambers 2007), ‘it will therefore at least take more time,
and perhaps a new generation of teachers, for corpora to find their way into the
language classroom’ (Braun 2007: 308).
In conclusion, it is our view that corpora will not only revolutionize the teaching of
subjects such as grammar in the 21st century (see Conrad 2000), they will also
fundamentally change the ways we approach language education, including both what
is taught and how it is taught. As Gavioli and Aston (2001) argue, corpora should not
only be viewed as resources which help teachers to decide what to teach, they should
also be viewed as resources from which learners may learn directly.
References:
Aarts, J. (1998) ‘Introduction’. In S. Johansson and S. Oksefjell (eds.) Corpora and
Cross-linguistic Research. Amsterdam: Rodopi. ix-xiv.
Aijmer, K. (2009) Corpora and Language Teaching. Amsterdam: John Benjamins.
Alderson, C. (1996) ‘Do corpora have a role in language assessment?’ in J. Thomas
and M. Short (eds.) Using Corpora for Language Research, pp. 248-259. London:
Longman.
46
Allan, Q. (1999) ‘Enhancing the language awareness of Hong Kong teachers through
corpus data’. Journal of Technology and Teacher Education 7/1: 57-74.
Allan, Q. (2002) ‘The TELEC secondary learner corpus: a resource for teacher
development’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner
Corpora, Second Language Acquisition and Foreign Language Teaching, pp.
195–212. Philadelphia: John Benjamins.
Altenberg, B. and Granger, S. (2001) ‘The grammatical and lexical patterning of
MAKE in native and non-native student writing.’ Applied Linguistics 22/2: 17395.
Aston, G. (1995) ‘Corpora in language pedagogy: matching theory and practice’ in G.
Cook and B. Seidlhofer (eds.) Principle and Practice in Applied Linguistics:
Studies in Honour of H. G. Widdowson. Oxford: Oxford University Press.
Aston, G. (ed.) (2001) Learning with Corpora. Houston, TX: Athelstan.
Aston, G., Bernardini, S. and Stewart, D. (eds.) (2004) Corpora and Language
Learners. Amsterdam: John Benjamins.
Aston, G. and Burnard, L. (1998) The BNC Handbook: Exploring the British National
Corpus with SARA. Edinburgh: Edinburgh University Press.
Bahns, J. (1993) ‘Lexical collocations: a contrastive view’. ELT Journal 47/1: 56-63.
Baker, M. (1993) ‘Corpus linguistics and translation studies: implications and
applications’. In M. Baker, G. Francis and E. Tognini-Bonelli (eds.) Text and
Technology: in Honour of John Sinclair. Amsterdam: Benjamins. 233-352.
Ball, F. (2001) ‘Using corpora in language testing’. Research Notes 6: 6-8.
Ball, F. (2002) ‘Developing wordlists for BEC’. Research Notes 8: 10-13.
Ball, F. and Wilson, J. (2002) ‘Research projects relating to YLE Speaking Tests’.
Research Notes 7: 8-10.
47
Bernardini, S. (2000) Competence, Capacity, Corpora: A Study in Corpus-aided
Language Learning. Bologna: CLUEB.
Biber, D., Conrad, S. and Reppen, R. (1998) Corpus Linguistics: Investigating
Language Structure and Use. Cambridge: Cambridge University Press.
Biber, D., Johansson S., Leech G., Conrad S. and Finegan, E. (1999) Longman
Grammar of Spoken and Written English. London: Longman.
Biber, D., Leech, G. and Conrad, S. (2002) Longman Student Grammar of Spoken and
Written English. London: Longman.
Borin, L. and Prütz, K. (2004) ‘New wine in old skins? A corpus investigation of L1
syntactic transfer in learner language’. In G. Aston, S. Bernardini and D. Stewart
(eds.) Corpora and Language Learners. Amsterdam: John Benjamins. 67–87.
Braun, S. (2007) ‘Integrating corpus work into secondary education: From data-driven
learning to needs-driven corpora’. ReCALL 19(3): 307-328.
Braun, S., Kohn, K. and Mukherjee, J. (eds.) (2006) Corpus Technology and
Language Pedagogy. Frankfurt: Peter Lang.
Burnard, L. and McEnery, A. (eds.) (2000) Rethinking Language Pedagogy from a
Corpus Perspective. New York: Peter Lang.
Campoy, M., Gea-valor, M. and Belles-Fortuno, B. (2010) Corpus-based
Approaches to English Language Teaching. London: Continuum.
Carter, R. and McCarthy, M. (1995) ‘Grammar and the spoken language’. Applied
Linguistics 16/2: 141-158.
Carter, R. and McCarthy, M. (2004) ‘Talking, creating: interactional language,
creativity, and context’. Applied Linguistics 25/1: 62-88.
Chambers, A. (2007) ‘Popularising corpus consultation by language learners and
teachers’. In E. Hidalgo, L. Quereda, and J. Santana (eds) Corpora in the Foreign
48
Language Classroom: Selected Papers from the Sixth International Conference on
Teaching and Language Corpora (TaLC 6), pp. 3–16. Amsterdam: Rodopi.
Choi, I., Kim, K. and Boo, J. (2003) ‘Comparability of a paper-based language test
and a computer-based language test’. Language Testing 20/3: 295–320.
Coniam, D. (1997) ‘A preliminary inquiry into using corpus word frequency data in
the automatic generation of English language cloze tests’. CALICO Journal 16/24: 15-33.
Connor, U. and Upton, T. (eds) (2002) Applied Corpus Linguistics: A
Multidimensional Perspective. Amsterdam: Rodopi.
Conrad, S. (1999) ‘The importance of corpus-based research for language teachers’.
System 27: 1-18.
Conrad, S. (2000) ‘Will corpus linguistics revolutionize grammar teaching in the 21st
century?’. TESOL Quarterly 34: 548–60.
Cook, G. (1998) ‘The uses of reality: a reply to Ronald Cater.’ ELT Journal 52/1: 5764.
Cowie, A. (1994) ‘Phraseology’ in R. Asher (ed.) The Encyclopaedia of Language
and Linguistics Vol. 6, pp. 3168-3171. Oxford: Pergamon Press Ltd.
de Beaugrande, R. (2001) ‘Interpreting the discourse of H. G. Widdowson: a corpusbased critical discourse analysis’. Applied Linguistics 22/1: 104-121.
Flowerdew, J. (1993) ‘Concordancing as a tool in course design’. System 21/3: 231243.
Fox, G. (1987) ‘The case for examples’ in J. Sinclair (ed.) Looking Up: An Account of
the COBUILD Project, pp. 137-149. London: HarperCollins.
Francis, G., Hunston, S. and Manning, E. (1996) Collins COBUILD Grammar Patterns
1: Verbs. London: HarperCollins.
49
Francis, G., Hunston, S. and Manning, E. (1998) Collins COBUILD Grammar Patterns
2: Nouns and Adjectives. London: HarperCollins.
Fries, C. (1945) Teaching and Learning English as a Foreign Language. Ann Arbor:
University of Michigan Press.
Gavioli, L. (2006) Exploring Corpora for ESP Learning. Amsterdam: John
Benjamins.
Gavioli, L. and Aston, G. (2001) ‘Enriching reality: language corpora in language
pedagogy’. ELT Journal 55/3: 238-246.
Gellerstam, M. (1986) ‘Translationese in Swedish novels translated from English’. In
L. Wollin and H. Lindquist (eds.) Translation Studies in Scandinavia. Lund:
CWK Gleerup. 88-95.
Gellerstam, M. (1996) ‘Translations as a source fro cross-linguistic studies’. In K.
Aijmer, B. Altenberg and M. Johansson (eds.) Language in Contrast. Lund: Lund
University Press. 53-62.
Ghadessy, M., Henry, A. and Roseberry, R. (eds.) (2001) Small Corpus Studies and
ELT: Theory and Practice. Amsterdam: John Benjamins.
Gilquin, G. (2001) ‘The integrated contrastive model. Spicing up your data’.
Languages in Contrast 3(1): 95–123.
Goethals, M. (2003) ‘E.E.T.: the European English Teaching vocabulary-list’ in B.
Lewandowska-Tomaszczyk (ed.) Practical Applications in Language and
Computers, pp. 417-427. Frankfurt: Peter Lang.
Granger S. (1976) ‘Why the passive?’. In J. Van Roey (ed.) English-French
Contrastive Analyses. Leuven: Acco. 23-57.
Granger, S. (1983) The Be + Past Participle Construction in Spoken English with
Special Emphasis on the Passive. Amsterdam: North-Holland.
50
Granger, S. (1996) ‘From CA to CIA and back: An integrated approach to
computerised bilingual and learner corpora’. In K. Aijmer, B. Altenberg and M.
Johansson (eds.) Language in Contrast. Lund: Lund University Press. 37-51.
Granger, S. (1998) ‘The computer learner corpus: a versatile new source of data for
SLA research’. In S. Granger (ed.) Learner English on Computer. London:
Longman. 3-18.
Granger, S. (2002) ‘A bird’s-eye view of learner corpus research’ in S. Granger, J.
Hung and S. Petch-Tyson (eds.) Computer Learner Corpora, Second Language
Acquisition and Foreign Language Teaching, pp. 3–33. Philadelphia: John
Benjamins.
Granger, S. (2003) ‘Practical applications of learner corpora’ in B. LewandowskaTomaszczyk (ed.) Practical Applications in Language and Computers, pp. 291302. Frankfurt: Peter Lang.
Granger, S. (2007) ‘Sylviane Granger: Interview’. Mindbite 1.
Granger, S., Hung, J. and Petch-Tyson, S. (eds.) (2002) Computer Learner Corpora,
Second Language Acquisition, and Foreign Language Teaching. Philadelphia:
John Benjamins.
Granger, S. and Tyson, S. (1996) ‘Connector usage in the English essay writing of
native and non-native speakers of English’. World Englishes 15: 19-29.
Gui, S. and Yang, H. (2002) Zhonguo Xuexizhe Yingyu Yuliaoku (Chinese Learner
English Corpus). Shanghai: Shanghai Foreign Language Education Press.
Hartmann, R. (1995) ‘Contrastive textology’. Language and Communication 5: 2537.
Herbst, T. (1996) ‘What are collocations: sandy beaches or false teeth?’. English
Studies 04/1996: 379-393.
51
Hidalgo, E., Quereda, L. and Santana, J. (2007) Corpora in the Foreign Language
Classroom: Selected Papers from the Sixth International Conference on Teaching
and Language Corpora (TaLC 6). Amsterdam: Rodopi.
Higgins, J. and Johns, T. (1984) Computers in Language Learning. Oxford: Oxford
University Press.
Hinkel, E. (2004) ‘Tense, aspect the passive voice in L1 and L2 academic texts’.
Language Teaching Research 8/1: 5-29.
Hoey, M. (2000) ‘A world beyond collocation: new perspectives on vocabulary
teaching’ in M. Lewis (ed.) Teaching Collocations, pp. 224-245. Hove: Language
Teaching Publications.
Hoey, M. (2004) ‘Lexical priming and the properties of text’. In A. Partington, J.
Morley and L. Haarman (eds.) Corpora and Discourse, pp. 385-412. Bern: Peter
Lang.
Horner, D. and Strutt, P. (2004) ‘Analyzing domain-specific lexical categories:
evidence from the BEC written corpus’. Research Notes 15: 6-8.
Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge
University Press.
Hyland, K. (1999) ‘Talking to students: metadiscourse in introductory coursebooks’.
English for Specific Purposes 18/1: 3-26.
James, C. (1980) Contrastive Analysis. London: Longman.
Johansson S. (2003) ‘Contrastive linguistics and corpora’. In S. Granger, J. Lerot and
S. Petch-Tyson (eds.) Corpus-Based Approaches to Contrastive Linguistics and
Translation Studies. Amsterdam: Rodopi. 31-44.
52
Johns, T. (1991) ‘“Should you be persuaded”: two samples of data-driven learning
materials’ in T. Johns and P. King (eds.) Classroom Concordancing ELR Journal
4. University of Birmingham.
Kaltenböck, G. and Mehlmauer-Larcher, B. (2005) ‘Computer corpora and the
language classroom: On the potential and limitations of computer corpora in
language teaching’. ReCALL 17:65-84.
Karpati, I. (1995) Concordance in Language Learning and Teaching. Pecs:
University of Pecs.
Kaszubski, P. and Wojnowska, A. (2003) ‘Corpus-informed exercises for learners of
English: the TestBuilder program’ in E. Oleksy and B. LewandowskaTomaszczyk (eds.) Research and Scholarship in Integration Processes: Poland USA – EU, pp. 337-354. Łódź: Łódź University Press.
Keck, C. (2004) ‘Corpus linguistics and language teaching research: bridging the
gap’. Language Teaching Research 8(1): 83-109.
Kennedy, G. (1998) An Introduction to Corpus Linguistics. London: Longman.
Kennedy, G. (2003) ‘Amplifier collocations in the British National Corpus:
implications for English language teaching’. TESOL Quarterly 37/3: 467-487.
Kenny, D. (1998) ‘Creatures of habit? What translators usually do with words?’. Meta
43(4).
Kettemann, B. (1995) ‘On the use of concordancing in ELT’. TELL&CALL 4: 4-15.
Kettemann, B. (1996) ‘Concordancing in English Language Teaching’ in S. Botley, J.
Glass, A. McEnery and A. Wilson (eds.) Proceedings of Teaching and Language
Corpora, pp. 4-16. Lancaster University.
Kettemann, B. and Marko, G. (2002) Teaching and Learning by Doing Corpus
Analysis. Amsterdam: Rodopi.
53
Kettemann, B. and Marko, G (eds) (2006) Planning, Gluing and Painting Corpora:
Inside the Applied Corpus Linguist’s Workshop. Frankfurt: Peter Lang.
Kita, K. and Ogata, H. (1997) ‘Collocations in language learning: corpus-based
automatic compilation of collocations and bilingual collocation concordancer’.
Computer Assisted Language Learning 10/3: 229-238.
Kjellmer, G. (1991) ‘A mint of phrases’ in K. Aijmer and B. Altenberg (eds.) English
Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman.
Koester , A. (2002) ‘The performance of speech acts in workplace conversations and
the teaching of communicative functions’. System 30: 167-184.
Lado, R. (1957) Linguistics across Cultures: Applied Linguistics for Language
Teachers. Ann Arbor: University of Michigan Press.
Laviosa, S. (1997) ‘How comparable can “comparable corpora” be?’. Target 9: 289319.
Laviosa, S. (1998) ‘Core patterns of lexical use in a comparable corpus of English
narrative prose’. Meta 43(4).
Leech, G. (1997) ‘Teaching and language corpora: a convergence’ in A. Wichmann,
S. Fligelstone, A. McEnery and G. Knowles (eds.) Teaching and Language
Corpora, pp. 1-23. London: Longman.
Lewis, M. (1993) The Lexical Approach: The State of ELT and the Way Forward.
Hove: Language Teaching Publications.
Lewis, M. (1997a) Implementing the Lexical Approach: Putting Theory into Practice.
Hove: Language Teaching Publications.
Lewis, M. (1997b) ‘Pedagogical implications of the lexical approach’ in J. Coady and
T. Huckin (eds.) Second Language Vocabulary Acquisition: A Rationale for
Pedagogy, pp. 255-270. Cambridge: Cambridge University Press.
54
Lewis, M. (ed.) (2000) Teaching Collocation: Further Developments in the Lexical
Approach. Hove: Language Teaching Publications.
Lü, S. and Zhu, D. (1979) Yufa Xiuci Jianghua (Talks on Grammar and Rhetoric).
Beijing: Chinese Youth Press.
Mauranen, A. (2002) ‘Will “translationese” ruin a contrastive study?’. Languages in
Contrast 2(2): 161-186.
McAlpine, J. and Myles, J. (2003) ‘Capturing phraseology in an online dictionary for
advanced users of English as a second language: a response to user needs’. System
31: 71-84.
McCarthy, M., McCarten, J. and Sandiford, H. (2005-2006) Touchstone (Books 1-4).
cambridge. Cambridge University Press.
McEnery, A. and Wilson, A. (2001) Corpus Linguistics (1st ed. 1996). Edinburgh:
Edinburgh University Press.
McEnery, A. and Xiao, R. (2002) ‘Domains, text types, aspect marking and EnglishChinese translation’. Languages in Contrast 2(2): 211-229.
McEnery, A. and Xiao, R. (2005) ‘Help or help to: What do corpora have to say?’
English Studies 86(2): 161-187.
McEnery, A. and Xiao, R. (2007) ‘Parallel and comparable corpora: What is
happening?’. In M. Rogers and G. Anderman (eds.) Incorporating Corpora: The
Linguist and the Translator, pp. 18-31. Clevedon: Multilingual Matters.
McEnery, A. and Xiao, R. (2008) CALLHOME Mandarin Chinese Transcripts - XML
version. Pennsylvania: Linguistic Data Consortium.
McEnery, A., Xiao, R. and Mo, L. (2003) ‘Aspect marking in English and Chinese:
using the Lancaster Corpus of Mandarin Chinese for contrastive language study’.
Literary and Linguistic Computing 18(4): 361-378.
55
McEnery, T., Xiao, R. and Tono, Y. (2006) Corpus-Based Language Studies: An
Advanced Resource Book. London: Routledge.
Meunier, F. (2002) ‘The pedagogical value of native and learner corpora in EFL
grammar teaching’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer
Learner Corpora, Second Language Acquisition and Foreign Language Teaching,
pp. 119–142. Philadelphia: John Benjamins.
Mindt, D. (1996) ‘English corpus linguistics and the foreign language teaching
syllabus’ in J. Thomas and M. Short (eds.) Using Corpora for Language
Research, pp. 232-247. London: Longman.
Mishan, F. (2005) Designing Authenticity into Language Learning Materials.
Chicago: Chicago University Press.
Mukherjee, J. and Rohrbach, J. (2006) ‘Rethinking applied corpus linguistics from a
language-pedagogical perspective: New departures in learner corpus research’. In
B. Kettemann and G. Marko (eds) Planning, Gluing and Painting Corpora: Inside
the Applied Corpus Linguist’s Workshop, pp. 205-232. Frankfurt: Peter Lang.
Murison-Bowie, S. (1996) ‘Linguistic corpora and language teaching’. Annual Review
of Applied Linguistics 16: 182-199.
Myles, F. (2005) ‘Interlanguage corpora and second language acquisition research’.
Second Language Research 21(4): 373-391.
Nelson, G. (1996) ‘The design of the corpus’. In S. Greenbaum (ed.) Comparing
English Worldwide: The International Corpus of English, pp. 27-35. Oxford:
Clarendon Press.
Nelson, M. (2000) A Corpus-Based Study of Business English and Business English
Teaching Materials. PhD thesis, the University of Manchester, Manchester.
Available at http://users.utu.fi/micnel/thesis.html.
56
Nesselhauf, N. (2003) ‘The use of collocations by advanced learners of English and
some implications for teaching.’ Applied Linguistics 24/2: 223-42.
Nesselhauf, N. (2005) Collocations in a Learner Corpus. Amsterdam: John
Benjamins.
O’Keeffe, A. and Farr, F. (2003) ‘Using language corpora in initial teacher education:
pedagogic issues and practical applications’. TESOL Quarterly 37/3: 389-418.
O’Keeffe, A., McCarthy, M. and Carter, R (2007) From Corpus to Classroom:
Language Use and Language Teaching. Cambridge: Cambridge University Press.
Osborne, J. (2001) ‘Integrating corpora into a language-learning syllabus’ in B.
Lewandowska-Tomaszczyk (ed.) PALC 2001: Practical Applications in Language
Corpora, pp. 479-492. Frankfurt: Peter Lang.
Osborne, J. (2002) ‘Top-down and bottom-up approaches to corpora in language
teaching’. In U. Connor and T. Upton (eds) Applied Corpus Linguistics: A
Multidimensional Perspective, pp. 251-265. Amsterdam: Rodopi.
Øverås, S. (1998) ‘In search of the third code: an investigation of norms in literary
translation’. Meta 43(4).
Partington, A. (1998) Patterns and Meanings: Using Corpora for English Language
Research and Teaching. Amsterdam: John Benjamins.
Pravec, N. (2002) ‘Survey of learner corpora’. ICAME Journal 26: 81-114.
Renouf, A. (1987) ‘Moving on’ in J. Sinclair (ed.) Looking Up: An Account of the
COBUILD Project. London: HarperCollins.
Römer, U. (2005) Progressives, Patterns, Pedagogy: A Corpus-driven Approach to
English Progressive Forms, Functions, Contexts and Didactics. Amsterdam: John
Benjamins.
57
Römer, U. (2008) ‘Corpora and language teaching’. In A. Lüdeling and M. Kyto
(eds.) Corpus Linguistics: An International Handbook, pp. 112-131. Berlin:
Mouton de Gruyter.
Sajavaara, K. (1996) ‘New challenges for contrastive linguistics’. In K. Aijmer, B.
Altenberg and M. Johansson (eds.) Language in Contrast. Lund: Lund University
Press. 17-36.
Salkie, R. (1999) ‘How can linguists profit from parallel corpora?’. Paper given at the
Symposium on Parallel Corpora. 22-23 April 1999, University of Uppsala.
Santos, D. (1996) Tense and Aspect in English and Portuguese: A Contrastive
Semantical Study. PhD thesis. Universidade Tecnica de Lisboa.
Scott, M. and Tribble, C. (2006). Textual Patterns: Key Words and Corpus Analysis
in Language Education. Amsterdam: John Benjamins.
Seidlhofer, B. (2000) ‘Operationalizing intertextuality: using learner corpora for
learning’ in L. Burnard and A. McEnery (eds.) Rethinking Language Pedagogy
from a Corpus Perspective, pp. 207–24. New York: Peter Lang.
Seidlhofer, B. (2002) ‘Pedagogy and local learner corpora: working with learning
driven data’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner
Corpora, Second Language Acquisition and Foreign Language Teaching, pp.
213–234. Philadelphia: John Benjamins.
Shei, C. and Pain, H. (2000) ‘An ESL writer’s collocational aid’. Computer Assisted
Language Learning 13/2: 167-182.
Sinclair, J. (1990) Collins COBUILD English Grammar. London: HarperCollins.
Sinclair, J. (1992) Collins COBUILD English Usage. London: HarperCollins.
Sinclair, J. (2000) ‘Lexical grammar’. Naujoji Metodologija 24: 191-203.
Sinclair, J. (2003) Reading Concordances. London: Longman.
58
Sinclair, J. (ed.) (2004) How to Use Corpora in Language Teaching. Amsterdam:
John Benjamins.
Sinclair, J., Bullon, S., Krishnamurthy, R., Manning, E. and Todd, J. (1990) Collins
COBUILD English Grammar. London: HarperCollins.
Sinclair, J. and Renouf, A. (1988) ‘A lexical syllabus for language learning’ in R.
Carter and M. McCarthy (eds.) Vocabulary and Language Teaching. London:
Longman.
Smadja, F. and McKeown, K. (1990) ‘Automatically extracting and representing
collocations for language generation’ in Proceedings of the 28th Annual Meeting
of Association for Computational Linguistics, pp. 252-259.
Sripicharn, P. (2000) ‘Data-driven learning materials as a way to teach lexis in
context’ in C. Heffer, H. Sauntson and G. Fox (eds.) Words in Context: A tribute
to John Sinclair on his Retirement. Birmingham: University of Birmingham.
Stubbs, M. (2001) ‘Texts, corpora, and problems of interpretation: a response to
Widdowson’. Applied Linguistics 22(2): 149-172.
Tan, M. (2002) Corpus Studies in Language Education. Bangkok: IELE Press.
Taylor, L. (2003) ‘The Cambridge approach to speaking assessment’. Research Notes
13: 2-4.
Teubert, W. (1996) ‘Comparable or parallel corpora?’. International Journal of
Lexicography 9(3): 238-264.
Thompson, P. and Tribble, C. (2001) ‘Looking at citations: using corpora in English
for academic purposes’. Language Learning & Technology 5/3: 91-105.
Thurstun, J. and Candlin, C. (1997) Exploring Academic English: A Workbook for
Student Essay Writing. Sydney: NCELTR.
59
Thurstun, J. and Candlin, C. (1998) ‘Concordancing and the teaching of the
vocabulary of academic English’. English for Specific Purposes 17: 267-280.
Tribble, C. (1991) ‘Concordancing and an EAP writing program’. CAELL Journal 1/2:
10-15.
Tribble, C. (1997a) ‘Corpora, concordances and ELT’ in T. Boswood (ed.) New Ways
of Using Computers in Language Teaching. Alexandria VA: TESOL.
Tribble C. (1997b) ‘Improving corpora for ELT: quick and dirty ways of developing
corpora for language teaching’ in B. Lewandowska-Tomaszczyk, P. Melia (eds.)
Practical Applications in Language Corpora – Proceedings of PALC ’97, pp.
107-117. Łódź: Łódź University Press.
Tribble, C. (2000) ‘Practical uses for language corpora in ELT’ in P. Brett, and G.
Motteram (eds.) A Special Interest in Computers: Learning and Teaching with
Information and Communications Technologies, pp. 31-41. Kent: IATEFL.
Tribble, C. (2003) ‘The text, the whole text…or why large published corpora aren’t
much use to language learners and teachers’ in B. Lewandowska-Tomaszczyk
(ed.) Practical Applications in Language and Computers, pp. 303-318. Frankfurt:
Peter Lang.
Tribble, C. and Jones, G. (1990) Concordances in the Classroom: A Resource Book
for Teachers. London: Longman.
Tribble, C. and Jones, G. (1997) Concordances in the Classroom: Using Corpora in
Language Education. Houston TX: Athelstan.
Upton, T. and Connor, U. (2001) ‘Using computerized corpus analysis to investigate
the textlinguistic discourse move of a genre’. English for Specific Purposes 20:
313-329.
60
Wang, L. (1984) Zhongguo Jufa Lilun (Syntactic Theories in China). Qingdao:
Shandong Education Press.
Wichmann, A. (1995) ‘Using concordances for the teaching of modern languages in
higher education’. Language Learning Journal 11: 61-63.
Wichmann, A. (1997) ‘General introduction’ in A. Wichmann, S. Fligelstone, A.
McEnery and G. Knowles (eds.) Teaching and Language Corpora, pp. xvi-xvii.
London: Longman.
Wichmann, A. Fligelstone, S. McEnery A. and Knowles, G. (eds.) (1997) Teaching
and Language Corpora. London: Longman.
Widdowson, H. (1990) Aspects of Language Teaching. Oxford: Oxford University
Press.
Widdowson, H. (1991) ‘The description and prescription of language’ in J. Alatis
(ed.) Georgetown University Round Table on Languages and Linguistics 1991,
pp. 11-24. Washington, D.C.: Georgetown University Press.
Widdowson, H. (2000) ‘The limitations of linguistics applied’. Applied Linguistics
21/1: 3-25.
Widdowson, H. (2003) Defining Issues in English Language Teaching. Oxford:
Oxford University Press.
Willis, D. (1990) The Lexical Syllabus: A New Approach to Language Teaching.
London: HarperCollins.
Willis, J., Willis, D. and Davids, J. (1988-1989) Collins COBUILD English Course
(Parts 1-3). London: HarperCollins.
Woolls, D. (1998) ‘Multilingual parallel concordancing for pedagogical use’ in
Teaching and Language Corpora, pp. 222-227. Keble College, Oxford, 24-27
July 1998.
61
Xiao, R. (2003) ‘Use of parallel and comparable corpora in language study’. English
Education in China 2003(1).
Xiao, R. (2008) ‘Well-known and influential corpora’. In A. Lüdeling and M. Kyto
(eds) Corpus Linguistics: An International Handbook, pp. 383-457. Berlin:
Mouton de Gruyter.
Xiao, R., He, L. and Yue, M. (forthcoming) ‘In pursuit of the third code: Using the
ZJU Corpus of Translational Chinese in Translation Studies.’ In R. Xiao (ed.)
Using Corpora in Contrastive and Translation Studies. Newcastle upon Tyne:
Cambridge Scholars Publishing.
Xiao, R., McEnery, T. and Qian, Y. (2006) ‘Passive constructions in English and
Chinese: A corpus-based contrastive study’. Languages in Contrast 6(1): 109-149.
Xiao, R. and Yue, M. (2009) ‘Using corpora in Translation Studies: The state of the
art’. In P. Baker (ed.) Contemporary Corpus Linguistics. London: Continuum.
Yang, Y. and Allison, D. (2003) ‘Research articles in applied linguistics: moving
from results to conclusions’. English for Specific Purposes 22: 365-385.
Zhang, X. (1993) English Collocations and Their Effect on the Writing of Native and
Non-native College Freshmen. PhD thesis. Indiana University of Pennsylvania.
62
Download