Chapter 4Methods in corpus linguistics beyond

advertisement
Chapter 4: Methods in corpus linguistics :
Beyond the concordance line
(Hunston, Susan. Corpora in Applied Linguistics. CUP, 2002)
Concordance lines are a useful tool for investigating corpora, but their use is limited by the
ability of the human observer to process information. This chapter looks at methods of
investigating corpora that go beyond concordance lines, including statistical calculations of
collocation and corpus annotation.
4.1 Frequency and key-word lists
A frequency list = a list of all the types in a corpus together with the number of occurrences
of each type.
Comparing the frequency lists for two corpora can give interesting information
about the differences between the two texts.
e.g.) Kennedy (1998)
a comparison between a corpus of Economics texts and one of general academic English→
the words price, cost, demand, curve, firm… are frequently found in the Economics
corpus.
↓
keywords
・ a useful starting point in investigating a specialized corpus
・ can be lexical items which reflect the topic of a particular text but also
grammatical words which convey more subtle information
4.2 Collocation
=the tendency of words to be biased in the way they co-occur.
Statistical measurements of collocation are more reliable, and for this reason a corpus is
essential.
4.2.1 Measurements of collocation
Computer programs, which calculate collocation, take a node word and count the instances
of all words occurring within a particular span.
(note) the count ignores punctuation marks.
counts ‘s as a separate word.
ignores sentence boundaries.
・To calculate collocation, large quantities of data are needed because there is a chance
that insignificant collocations will be shown when compared with the meaningful
ones.
・The problem with a list of raw frequencies is that it is impossible to attach a precise
degree of importance to any of the figures in it.
↓
To calculate the significance of each co-occurrence, there are three measures; MI score,
t-score, and z-score.
MI score ・・・a measure of how strongly two words seem to associate in a corpus,
based on the independent relative frequency of two words.
1) not dependent on the size of the corpus
2) can be compared across corpora, even if the corpora are of different sizes
3) gives information about its lexical behaviour, but particularly about the more
idiomatic co-occurrences
4) the highest MI scores tend to be less frequent words with restricted collocation.
☆ The strength of the collocation is not always a reliable indication of meaningful
association.
↓
To know how much evidence there is for it, t-score is used.
t-score ・・・a measure of how certain we can be that the collocation is the result of more
than the vagaries of a particular corpus
1) Corpus size is important.
2) cannot be compared across corpora
3) gives information about the grammatical behaviour of a word
4) the highest t-scores tend to be frequently used words ( whether or not they are
grammatical words) that collocate with a variety of other words.
☆ In some instances they may require a wider span than is commonly used with
respect to ‘clause collocation’.
4.2.2 The use of collocational information
Collocational statistics can be helpful in summarising some of the information to be found
in concordance lines, allowing more instances of a word to be considered than is feasible
with concordance lines.
<uses of Collocational information>
1) to highlight the different meanings that a word has.
e.g.) the collocates of verb LEAK
・some of words are associated with the physical meaning of LEAK: oil, water, gas, roof
・others are associated with the metaphoric sense: documents, information, report,
memo….
・prepositions and adverbs of direction are important to the behaviour of LEAK:
out, from, to, into
↓
The list of collocates gives a kind of semantic profile of the word involved.
2) With a different method of displaying collocational information, we can obtain clues as
to the dominant phraseology of a word.
e.g.) here is some information about the words that occur one, two and three places
to the left of the word shoulder and each column is arranged in t-score.
hand over his shoulder
on on her
looking shoulder my
3) To obtain a profile of the semantic field of a word.
e.g.) a list of the collocates of bribe and bribery can be grouped into semantic areas (Orpin
1997):
・Words connected with wrong-doing
・Words connected with money
・Words connected with officialdom
・Words connected with sport
・Words connected with the legal process
・information not only about the meaning of the words, but an insight
into some of the cultural ramifications of the concept of ‘bribery’
(Warning)
Calculations of collocation will always prioritise uses of a word that tend to be
lexically fixed or restricted. The overall frequency of compounds and phrases may
not be justified.
4.3 Tagging and parsing
4.3.1 Categories and annotation
Corpus annotation・・・ the process of adding information ( designed to interpret the
corpus linguistically ) to a corpus
・makes it easier to retrieve information and increases the range of investigations that can
be done on the corpus.
・‘category-based’ methodology
the parts of a corpus – the words, or phonological units, or clause etc – are placed into
categories and those categories are used as the basis for corpus searches and
statistical manipulations
4.3.2 Tagging
= allocating a part of speech (POS) label to each word in a corpus
e.g.) the word light ・・・tagged as verb, a noun or an adjective each time it occurs in the
corpus
< various uses of a tagged corpus>
1) Looking at the concordance lines for a word with several senses can be made much
simpler.
2) The relative frequencies of different parts of speech for a specific word can be
compared.
e.g.) Biber et al (1998) deal・・・more frequently a noun than a verb
deals・・・more frequently a verb than a noun
(more sophisticated uses)
3) Total occurrences of word-classes in a particular corpus can be counted.
e.g.)Biber et al (1999) compare word-class occurrence in a number of corpora. ( see Table
4.1)
4) The frequency of sequences of tags can be calculated and corpora can be compared in
this respect.
e.g.) Aarts and Granger (1998)
Non-native speakers writing in English differ from equivalent native-speaker writers in
the sequences of tags that recur.
→the non-native writers are using fewer of the lengthy noun phrases that are essential to
formal, particularly academic, writing in English.
・Corpus tagging needs to be done automatically with computer programs
called taggers, which work on a mixture of two principles.
1. rules governing word-classes
2. probability
Automatic taggers are usually claimed to have an accuracy rate of over 90 %. It is
important, however, to remember that the tagger may be wrong, and the human user’s
judgement is more reliable in individual cases.
Inaccuracies produced by a tagger are not usually spread evenly throughout a corpus. All
mistakes a tagger makes will be clustered around words which have several possible tags.
A tagger can be instructed to suggest more than one tag, in cases of ambiguity, and the
human researcher can simply pick out the ambiguous cases and make a judgement as to
the best tag.
4.3.3 Parsing
= analyzing the sentences in a corpus into their constituent parts, that is, doing
a grammatical analysis.
e.g.) Leech and Eyes (1997)
S
NP VP
GEN N V NP
NP(nominal clause)
N
that
NP VP
N
V PP
P NP
DT N
The victim’s friend’s told police that Krueger drove into the quarry.
Computer programs (parsers) can’t do these work completely accurately, and parsed
corpora are often edited by hand to achieve a greater degree of accuracy.
↓
basis for much of the statistical work that has been done on different registers
e.g.) Biber et al (1998: 98-99)
intransitive use of START > BEGIN (in academic prose)
→ used in sentences indicating the start of a process
BEGIN + to-clause > BEGIN + ‘-ing’ (in fiction)
→ describe the start of an action or a reaction to events
・・・frequent in narrative
・verbs indicating thought and feeling・・・BEGIN + to-clause, only
・verbs indicating movement・・・BEGIN + ’-ing’ , BEGIN + to-clause
This study demonstrates a useful synergy between word-based methods and categorybased methods:
1. To search for intransitive verbs ・・・ category-based method(parsed corpus)
2. To search for a verb followed by a to-clause or an ‘-ing’ clause・・・word-based
method
4.4 Other kinds of corpus annotation
annotation = a general term for tagging and parsing, and also used to describe other
kinds of categorisation that may be performed on a corpus.
e.g.) the annotation of a spoken corpus for prosodic features
the annotation of a corpus of learner English for types of error
annotation of anaphora and semantic annotation
4.4.1 Annotation of anaphora
anaphora = a term used in schemes which annotate the cohesion in texts, with the
term anaphor being used for the cohesive item (whether the direction of connection
is forwards or backwards).
e.g.) (Garside et al 1997:72)
(1 A man carrying a blue sports bag1)…was arrested when <REF=1 he…
antecedent anaphor
“ < “ = the direction of connotation (backwards)
REF = “reference”
・It is possible to track the development of a text, showing which people or things
are referred to most frequently and how the text is progressively chunked.
(Disadvantage)
It cannot be done automatically and the amount of text that can be coded in this way is
therefore limited.
(Possibilities)
・To see what types of anaphor and antecedent most frequently occur in different
registers
・To see how anaphora typically change as a text progresses
・To provide an anaphoric profile of sets of texts from different registers
4.4.2 Semantic annotation
= the categorisation of words and phrases in a corpus in terms of a set of semantic fields
e.g.) Wilson and Thomas (1997:61)
Joanna stubbed out her cigarette with unnecessary fierceness.
Joanna・・・Personal Name
stubbed out・・・’Object-Oriented Physical Activity’ and ‘Temperature’
cigarette・・・’Luxury Item’
unnecessary・・・’Causality/Chance’
fierceness・・・’Anger’
her and with・・・‘Low Content Words’ and are not assigned to a semantic category.
e.g.) an application of this method: Thomas and Wilson (1996)
an analysis of interactions between doctors and patients in two clinics
Doctor A –1) discourse particles, first and second person pronouns, boosters and
downtoners・・・more interactive and friendly
2) the categories ‘Cause’, ‘Change’, ‘Power’ and ‘Treatment’・・・
Explaining what their treatment would be and what they could expect from it.
Doctor B – technical terms・・・explaining how their disease was progressing
・Doctor A was perceived as more supportive than the other one.
It is possible to criticize the semantic categories as lacking in finesse. On the other hand, the
automatic annotation plays an important role in dealing with large quantities of data that
would be unreasonably time-consuming and difficult to annotate consistently by hand.
4.4.3 How a meaning is made
a variation on semantic annotation
by Biber and Finegan(1989), Conrad and Biber(2000)
e.g.) a partial annotation, in that only certain categories are selected,
for example ‘stance’.
1. Adverbs, clauses and prepositional phrases are selected by the computer from a
tagged and parsed corpus.
2. The human researcher can accept or modify the categorisation.
3. Calculations can be made in terms of how frequently a meaning is made in a
number of registers, and how the meaning is most frequently made in each
register.
↓
・Adverbs are the most frequently used way of expressing stance grammatically in all
three registers (conversation, news reporting and academic prose), and most
predominant in conversation.
・Clauses such as I think and I guess・・・frequent in conversation
・Prepositional phrases such as on the whole and in most cases・・・frequent
in academic prose and news reporting
Corpus annotation of this kind provides a basis for approaching a corpus from the point of
view of meaning first and can be linked with a notional approach to language teaching.
A ‘local grammar’ attempts to describe the resources for only one set of meanings in a
language, rather than for the language as a whole.
e.g.) Hunston (1999b)
The words that are used to indicate sameness and difference are identified, together with
the following patterns. Because meaning and patterns are linked, in each case there
are several verbs used with the same pattern.
A: ‘verb + plural noun group’・・・compare, conflate, connect, contrast, …
There are people who equate those two terrible video tapes.
comparer comparison item1, 2
B: ‘verb + between + plural noun group’・・・discriminate, distinguish
It’s difficult to differentiate between chemical weapons and chemicals for peaceful use.
comparison item1 item2
C: ‘verb + from + noun group’・・・differ, diverge, grow apart,…
Make your advertisement stand out from all the others by having it printed in bold type.
item1 comparison item2
D: ‘verb + to + plural noun group’・・・answer, approximate, conform,…
How does your job measure up to your ideal?
item1 comparison item2
E: ‘verb + a noun phrase + to + a noun phrase’ ・・・compare, connect, …
The Cuban musicians themselves often liken their music to the works of Bob Dylan…
comparer comparison item1 item2
4.4.4 Issues in annotation
There are three basic methods of annotating a corpus – manual,
computer-assisted and automatic but only the second two are suitable for use
on anything but the smallest corpora.
・An automatic annotation・・・There are likely to be errors as discussed above in the
section on tagging.
・A computer-assisted annotation・・・ slower than automatic, but more accurate than
automatic
The amount of parsed corpus data publicly available is limited.
The work involved in annotation acts as a constraint against updating or
enlarging a corpus.
e.g.) the 1951, 1 million word LOB Corpus is still used as a source of data,
small and outdated though it is, precisely because it is parsed.
4.5 Competing methods
Word-based methods vs. category-based methods
The preference for prioritising words ・・・ a preference for a plain text corpus.
The preference for prioritising categories・・・a preference for an annotated corpus
・It is important to recognise that no method of working is neutral with regard to
theory.・・・What the data says will depend to a large extent on how it is able to be
accessed.
・Category-based methods and word-based methods each answer different sets of
questions:
“What kind of anaphora is most frequently used in academic prose?”
⇒a category-based method
“How is the word way used?” ⇒ a word-based method
(Word-based methods)
・A plain text corpus has obvious disadvantages in that certain categories cannot easily be
counted, so certain questions are difficult to answer.
e.g.) Rissanen (1991) traces the history of ‘object clauses’ beginning with that, and those
beginning with ‘zero’ in unparsed corpora and commented”… The problem of tracing
zeros must, in this case, be solved in an indirect way, by checking all occurrences of
all verbs which take an object clause with that,…”
↓
only possible at all because a relatively small corpus was being used
(Category-based methods)
・The ‘added value’ that annotations give a corpus can be a double-edged sword:
more useful, but less readily updated, expanded or discarded
・The categories used to annotate a corpus are typically determined before any corpus
analysis is carried out, which tends to limit, not the kind of question that can be asked, but
the kind of question that usually is asked.
☆The most interesting and useful annotations seem to be those which are either added ad
hoc to a corpus to enable the researcher to answer a particular question or those which are
able to added to a corpus quickly and automatically.
What we can perhaps hope for is a synergy between word-based and category-based
methods of corpus analysis, in which the one can inform the other, much as qualitative and
quantitative methods of research complement each other.
The interpretation of information found by looking beyond the concordance line frequently
involves returning to those same concordance lines.
<Points with regard to corpus annotation>
・ However much annotation is added to a text, it is important for the researcher to be able
to see the plain text.
・ It is important to be able to use unconventional, ad hoc annotations as necessary.
・ Unless a very small-scale study is being carried out, it is important to make the process
of annotation as automatic as possible.
☆ It is important when reading about corpus work to be critically aware of the methods
being used, and of the theories that lie behind them.
*** New Terms***
P.67 type cf. token P.90 local grammar
a frequency list
P.68 keywords
collocation
P.70 MI score
t-score
P.79 annotation
P.80 tagging
P.82 tagger
P.84 parsing
parser
P.87 anaphora
anaphor
Download