12. Society and culture

advertisement
12. Society and culture
12.1. Corpora and culture
Corpora can also be used to study culture. Differences in senses of words used in British and
American English show cultural differences(Leech and Fallon, 1992). For instance, travel
words are more frequent in American English, which may suggest that the US is larger than
Britain and that people there travel more. Words in the domains of crime and military were
more common in American data that may suggest American “gun culture”. More research is
needed in the area of cultural studies.
12.2. Psycholinguistic variation
Psycholinguistics is a laboratory subject. However, corpora may be sources of data for
materials used during laboratory experiments. Frequency of words may indicate the order of
testing cognitive processes – word recognition.
Garnham et all. (1981) used natural spoken corpus to look at speech errors. They classified
and counted different error types.
The analysis of language pathologies in individuals may be based on abnormal data collected
earlier.
Fletcher and Garman (1988) collected a corpus of impaired and normal language
development. This may help to identify language abnormalities.
12.3. Social psychology
In social psychology both qualitative data and quantitative analyses are equally important.
For example:
Various written and spoken texts are used in analysing explanations.
Antaki and Najii (1987) investigated phrases that followed because. The results showed that
explanation of general states of affairs were most common (33.8%), then actions of speaker
and speaker group appear (28.8), then actions of others (17.7%).
12.4. Corpora and sociolinguistics
Sociolinguistics relies on collection of research-specific data. Either a small corpus can be
collected, or a representative corpus can be sampled according to research questions.
Examples of study:
Lexical studies in the area of language and gender. Kjellmer (1986) studied masculine bias in
American and British English. He looked at masculine and feminine pronouns and at the
occurrence of lexical items man/men, woman/women. He found that the frequency of female
items was much lower than the male in both corpora, but female forms were more frequent in
British English than in American English. Some corpora contain sociolinguistic variables
encoded, but not all. Variables as sex of the writer, social class and educational backgrounds
were encoded in historical corpora.
13. Language acquisition and teaching
13.1. Corpora in first language acquisition
Corpora of children’s language, e.g. CHILDES database, are used to study the stages in
linguistic development in normal and impaired children. First descriptions of the process of
development were based on small corpora that were not machine-readable at that time. (see
history of corpus linguistics chapter 1)
Corpus data may be used to evaluate linguistic theories on the basis of empirical data.
“Order of acquisition” was presented by Brown in 1973. Research done on children’s corpus
confirmed the order proposed by Brown. On the one hand, this shows that corpus linguistics
as well as other ways of gathering data including intuition, elicitation and experimentation
lead to the same results. On the other hand, by recapitulating the process of finding the well
known results young researchers learn the methods and techniques.
There is still a need for corpora that contain the language children are exposed to.
13.2. Corpora in second language acquisition and teaching
Indirectly all learners and teachers use corpora on a daily basis because most of the
dictionaries and grammar reference books are based on corpora .
Sinclair (1990) stated that: “ELT methodology has paid little attention to the state of language
description, behaving as if the facts of English structure were no longer in dispute. In practical
terms this has led to the growth and maintenance of a mythology about English, which
teachers take for granted, but much of which has been challenged by corpus evidence”.
Chris Tribble (1990) suggested using concordances as teaching materials. They can be used in
teaching vocabulary and grammatical features of words in sentence context to advanced
learners if they are based on monitor corpora. The main advantage of using corpora is that the
language is authentic, the main disadvantage is that sentences are out of wider context and
may contain words that are difficult even for advanced learners.
Examples of activities:
-
A gapfilling exercise with two words that are easily confused can help students to
infer differences of meaning.
-
Wordlists can be used as lead-in activities.
-
Literary texts are also useful for comparative studies of style.
-
Deducing the meaning of keyword from context.
-
Matching exercises, in which the left or right contexts are jumbled and have to be
re-installed.
-
Study of homonyms and synonyms.
-
Using wildcards for studying prefixes and suffixes.
Using corpora in class is time consuming. However, this introduces autonomy in language
learning and helps students to solve linguistic problems on their own.
The Council of Europe has recommended that language pedagogy should “develop explicit
objectives and practices to teach methods of discovery and analysis” (1994:10). Learners need
to test any rule against as many examples as possible before they fully internalise it.
Concordance programs facilitate building language awareness. Words and syntax should not
be taught separately. The interrelationship of lexis and syntax can be visible on screen in a
few seconds.
Tim Johns introduced and developed DDL, i.e. Data Driven Learning. Monolingual and
parallel corpora of English and French (German, Spanish, Italian) are used to learn and teach
the target language. Materials and ideas are available from Tim Johns’s website:
http://web.bham.ac.uk/johnstf.
Concordances are also used in reciprocal learning. Two learners, one native English and one
native French, who want to learn the language of the partner work together helping each
other.
Susan Hunston (2002: 193) enumerates the challenges to the use of corpora in language
teaching:
-
Lack of context
-
Critical approach to corpus evidence. Learners should be creative, not restricted to
utterances collected earlier.
-
Corpora comprise the language of native speakers. Native norms are not always
the aim of learning the language.
-
Eclectic and diverse methods of teaching and learning should be adapted to
learners’ needs and learning styles.
Further reading Susan Hunston Corpora in Applied linguistics
Teaching ESP and EAP - English for specific purposes and English for academic purposes
In teaching ESP and EAP the content is equally important as the language. Legal, technical
medical, business ‘sublanguages’ should be taught on the texts from these disciplines. EAP
teaching should also include academic papers in individual disciplines.
Further reading: Susan Hunston Corpora in Applied linguistics
Teaching linguistics
Corpora may used in teaching linguistics. Kirk (1994) presented a project in which students
were asked to analyse some corpus data in the light of a given theoretical model. Theoretical
models selected for the analysis were Brown and Levinson’s politeness theory, Grice’s cooperative principle and Biber’s multidimensional approach to linguistic variation.1
In other projects at Nijmegen and Lancaster, students of linguistics are asked to annotate a
text using corpus-based software. The student is given four chances to get the annotations
right.
Teaching translation
Find more about theories in pragmatics on Andrew Moore’s website
http://www.universalteacher.org.uk/lang/pragmatics.htm
1
Translation is a matter of style rather than of right and wrong. A multilingual corpus may
provide side-by-side examples of style and idiom in more than one language.
13.3. Learner corpora
All corpora discussed in the previous sections are collections of language used by native
speakers, either adults or children. They are authentic because they have been collected in
natural contexts. In this section we will deal with learner corpora that contain language used
by FL/SL learners. As for all corpora the set of criteria of collecting any corpus should be
clearly established.
Sinclair (1996) defined learner corpora as follows:
“Computer learner corpora are electronic collections of authentic FL2/SL3 textual data
assembled according to explicit design criteria for a particular SLA4/FLT5 purpose. The are
encoded in a standardized and homogenous way and documented as to their origin and
provenance.”
Granger (2002) comments on the definition in the following way:
- Authenticity
Authenticity is problematic in the case of learner language in comparison with authenticity of
native speaker data. Classroom teaching involves a lot of “artificiality”. Even texts created as
“free writing” are rarely natural because the topic and the time limit are imposed on the
learner. Thus, there are different levels of authenticity ranging from genuine communication
of people about their businesses to results of classroom activities. For example, essays written
in the classroom can be considered authentic written data, and texts read aloud can be
considered authentic spoken data.
- FL and SL varieties
Non-native varieties of English comprise:
-
English as an Official Language (EOL) such as Nigerian English or Indian English
2
FL Foreign Language
3
SL Second Language
4
SLA Second Language Acquisition
5
FLA Foreign Language Acquisition
-
English as a Second Language (ESL) language acquires in English- speaking
environment as in Britain or the US
-
English as a Foreign Language (EFL) learned primarily in the classroom setting
(in most countries)
Learner corpora cover EFL and ESL.
- Textual data
A learner corpus must contain continuous stretches of discourse, not isolated sentences or
words. It must contain both correct and erroneous use of the language.
- Design criteria
A random collection of texts does not create a corpus. Some criteria are the same as for native
corpora, however the set of criteria maybe specific to learner corpora relating to both the
learner and the task. For example: learning context and time limit, mother tongue and use of
reference tools.
- SLA/FLT purpose
The purpose of collecting a learner corpus should relate to SLA theory or FLT methodology.
The researcher may need to prove or falsify theories about transfer form L1 or L3 or an order
of acquisition of specific elements of language. A learner corpus may help to evaluate ELT
methods or tools.
- Standardization and documentation
The corpus can be collected as
-
a row corpus of plain texts;
-
an annotated corpus enriched with linguistic or textual information.
Examples of learner corpora:
Cambridge learner corpus http://uk.cambridge.org/elt/corpus/clc.htm consists of texts written
by those who have taken Cambridge exams. It contains over 15 million words and it is
constantly growing.
The Longman Learners' Corpus http://www.longman.com/dictionaries/corpus/lclearn.html
International Corpus of Learner English – ICLE
http://juppiter.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/Cecl-Projects/Icle/icle.htm
Instructions how to join the project are available on the website.
The Louvain International Database of Spoken English Interlanguage – LINDSEI
http://juppiter.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/Cecl-Projects/Lindsei/lindsei.htm
The standard used for annotating a learner corpus should be the same as the one used for
native corpora to ensure their comparability. There are problems with annotating learner
corpora, because tools used for native corpora are less reliable. Error tagging software is
developed for learner corpora. Learner documentation and task variables should be also
provided either in the corpus or separately.
Research Methodology
Linguistic analysis of learner corpora involves either Contrastive Interlanguage Analysis or
Computer-aided Error Analysis. The first method is contrastive, and consists of comparisons
between native and non-native data or between varieties of non-native data. The second
method focuses on identification and analysis of errors in the non-native data.
For Contrastive Interlanguage Analysis a learner corpus needs to be collected, and a control
native corpus needs to be selected out of a monitor corpus appropriately for any comparison
that highlight a range of features of non-nativeness.
The interlanguage of the learners can be investigated in order to understand its underlying
system. However, if the aim of the learner corpus research is to improve learners’ proficiency,
it needs to be related to native norms.
Comparisons of learner data for different mother tongue backgrounds help to:
-
identify features that are shared by several learner populations;
-
find peculiarities of one national group;
-
describe developmental features.
Having selected a feature for study, its underrepresentation or overrepresentation in the
learner data in comparison with native data a researcher formulates hypotheses. In order to
interpret the results, a bilingual corpus of the learners’ mother tongue and English is
necessary. Classical Contrastive Analysis and Corpus based Contrastive Interlanguage
Analysis are complementary.
Error analysis
Computer-aided error analysis involves either selecting an error-prone linguistic item and
scanning a corpus to retrieve it or discovering learner difficulties that the researcher was not
aware of. The first method is effective and fast. However, it requires the set of those items
that seem to be problematic. The second method is time-consuming. It requires tagging all
errors or at least all errors in a particular category. A fully-tagged corpus offers a huge range
of possible applications.
Errors appear both in native and non-native utterances. In language learning and teaching the
approach to errors varies. In audio-lingual method errors were considered an entirely negative
aspect of learner language. Nowadays error analysis can be seen as the key aspect of the
process of understanding interlanguage development. It also provides data for teachers and
materials designers. Corpus data help to evaluate what learners can be expected to have
acquired.
Learner corpus analysis
On the one hand, learner corpora can be analysed with the same tools developed for the
analysis of native corpora. T/t ratio = (Number of word types*100)/ (Number of word
tokens*1) is used to draw conclusions on lexical richness in texts. In learner corpora the (T/t)
value may be influenced by the high rate of non-standard forms, i.e. various errors.
On the other hand, these “special corpora” require specific tools and methods of analysis.
Traditional types of annotation need to be supplemented by error tagging. There are may
systems of error tagging. The system developed and implemented in Louvain (Granger 2002)
is hierarchical, it attaches to each error a series of codes which go from more general to more
specific.
First letter refers to the error domain.:
G - grammatical
L - lexical
X - lexico-gramatical
F – formal
R – register
W – syntax
S – style
The following letters give more precision
GV – grammatical errors affecting verbs
GVAUX – ( auxiliary errors)
GVM ( morphological errors)
GVT( tense errors)
For example (Granger 2002: 20):
the fact that we could
(XVPR) argue on $argue about$ the definition of
want to be parents, do not
(XVPR) care of $care about$ the sex
is rising. These people who
(XVPR) come in $come to$ Belgium
Family planning
(XVPR) consists on $consists of$
have the possibility to
(XVPR) discuss about $discuss$ their problems
which the purchaser cannot
(XVPR) dispense of $dispense with$
the health. Nobody
(XVPR) doubts about $doubts$ that.
harvest they get is often
(XVPR) exported in $exported to$ countries
of advice on
(XNUC) a $0$ better health care
for years. Undoubtedly
(XNUC) a $0$ big progress has been made
characteristic
(XNUC) behaviours $behaviour$
It provides
(XNUC) employments $employment$
combining study life and
(XNUC) entertainments $entertainment$
are many other
(XNUC) leisures $leisure facilities$
a balance between work and
(XNUC) spare times $spare time$
need to do some
(XNUC) works $work$ or simply for your personal
Figure 9. Error tag search: verb dependent prepositions and count/uncount nouns
This system is flexible and allows the analyst to add or delete codes.
The error-tagging system was implemented in an error editor – a tool that helps to insert
errors. By clicking on the relevant tag from the menu the analyst may insert the tag. Using the
correction box s/he may insert the corrected form within two dollar signs as formatting
symbols.
Native and learner corpora and language pedagogy
Native corpus data have changed dictionaries and grammar reference books because they
have provided enriched description of the language. Learner corpora and native corpora are
expected to change language pedagogy in the following sections:
-
curriculum design;
-
materials design;
-
classroom methodology.
While investigating learner corpora for language pedagogy data from four corpora should be
analysed. Granger (2002: 22) presents the ideal environment for the analysis of French
speaking learners’ interlanguage and the design of materials for them.
Native French
Corpus
Native English
Corpus
Learner English Corpus
Basilang
Mesolang
Acrolang
L1
L2
French-English bilingual corpus
Fig. Learner corpus environment (Granger 2002)
Basilang – the earliest form of target language development
Mesolang- the intermediate stage of language development
Acrolang - the final stage of target language development
Learner corpus data may change the teaching content.
Curriculum design
In the field of vocabulary teaching frequency information is useful, it may support intuitions
that teachers and researchers have about the areas of difficulty for learners. However, the
frequency should not be the dominant factor in selecting vocabulary for learners.
In grammar teaching the selection and sequence of teaching grammatical phenomena should
be verified or modified.
Biber (1994) proved that prepositional phrases (the man in the corner) are much more
frequent than relative clauses (the man who is in the corner) and more frequent than
participial clause (the man standing in the corner). In EFL grammars relative clauses receive
much more discussion than prepositional clauses.
A study by Meunier 2000 showed that French learners do not use prepositional phrases and
participial clauses and significantly overuse relative clauses. This my be partly teachinginduced or partly due to cross linguistic reasons: prepositional and participial clauses are less
common in French than in English. What is more, corpus data analysis proved that French
learners have persistent difficulty with relative pronoun selection. Thus, French learners need
more practice in learning prepositional phrases, participial clauses and relative clauses. The
conclusions drawn from these findings should be included into the syllabus design.
Materials design
In monolingual learners’ dictionaries learner corpus data are used to enrich usage notes,
which draw learners’ attention to common mistakes. The Longman Essential Activator
Dictionary is the first dictionary based on learner corpora. The integration of CALL software
with corpus data seems to be promising. WordPilot by Milton is designed to help Hong Kong
EFL learners (see: http://www.compulang.com/index.htm.)
Telenex (http://www.telenex.hku.hk/telec/pmain/opening.htm) is a project designed to
provide support for secondary level English teachers in Hong Kong. It is available only to
registered Hong Kong teachers. A large learner corpus TELEC Student Corpus has been used
to compile the? problem page in TeleGram. For each problematic area a series of tools for
teacher have been developed. Every ‘student problem’ was matched with ‘teaching
implications’ which suggest teaching methods designed to help students avoid such mistakes.
Classroom methodology
The use of learner corpus in the classroom is highly controversial because learners are
exposed to erroneous data. Using corpora is more suitable for advanced learners.
Granger (1999) suggests a method in which one the teacher selects a problematic field, e.g. a
word, then asks students to find examples in the native corpus. Then asks them to find
examples in the learner corpus. The patterns used in the learner corpus are similar to the
patterns the learners of this particular class tend to use. The aim of this activity is to get
learners to notice the gap between their own and target language forms. Learners get aware of
their misuse and overuse of the word.
Seildhofer (2002) suggested a procedure called learning-driven data. The teacher asks
students to read a text and then write a summary of the text ant their accounts - personal
reaction to the article . Then she compiled a corpus of their own short texts. She asked
students to prepare a list of questions that can be asked to be answered by computer tools.
They also compared the language they use with the language of the input text. She mentioned
the possibility of using native corpora if there is a need for them.
Read Barbara Seidlhofer (2002) Pedagogy and local learner corpora Working with learningdriven data
Challenges to learner corpora involve:
-
corpus compilation Many learner corpora have been compiled but very few are
available. ICLE corpus is free for contributors. Longman and Cambridge learner
corpora are collected for internal use only. It is easy for teachers to collect their
own learner corpora for evaluation and implementing new methods.
-
corpus analysis
More research is needed in the field of success rate of linguistic annotation.
There is a need for longitudinal studies.
Quantitative product-oriented studies should be supplemented with more
qualitative process–oriented studies.
-
Interdisciplinarity
There is a need for cooperation of SLA, ELT and NLP researchers.
Applications of learner corpus research need to relate to current SLA theories.
Classroom practice needs to be in the focus of researchers.
User-friendly tools need to be developed for learners, teachers and researchers.
“Learner corpus research has the potential to radically to improve knowledge about learner
language and language learning.” (Granger 2002: 28)
Download