Chapter 4: Methods in corpus linguistics : Beyond the concordance line (Hunston, Susan. Corpora in Applied Linguistics. CUP, 2002) Concordance lines are a useful tool for investigating corpora, but their use is limited by the ability of the human observer to process information. This chapter looks at methods of investigating corpora that go beyond concordance lines, including statistical calculations of collocation and corpus annotation. 4.1 Frequency and key-word lists A frequency list = a list of all the types in a corpus together with the number of occurrences of each type. Comparing the frequency lists for two corpora can give interesting information about the differences between the two texts. e.g.) Kennedy (1998) a comparison between a corpus of Economics texts and one of general academic English→ the words price, cost, demand, curve, firm… are frequently found in the Economics corpus. ↓ keywords ・ a useful starting point in investigating a specialized corpus ・ can be lexical items which reflect the topic of a particular text but also grammatical words which convey more subtle information 4.2 Collocation =the tendency of words to be biased in the way they co-occur. Statistical measurements of collocation are more reliable, and for this reason a corpus is essential. 4.2.1 Measurements of collocation Computer programs, which calculate collocation, take a node word and count the instances of all words occurring within a particular span. (note) the count ignores punctuation marks. counts ‘s as a separate word. ignores sentence boundaries. ・To calculate collocation, large quantities of data are needed because there is a chance that insignificant collocations will be shown when compared with the meaningful ones. ・The problem with a list of raw frequencies is that it is impossible to attach a precise degree of importance to any of the figures in it. ↓ To calculate the significance of each co-occurrence, there are three measures; MI score, t-score, and z-score. MI score ・・・a measure of how strongly two words seem to associate in a corpus, based on the independent relative frequency of two words. 1) not dependent on the size of the corpus 2) can be compared across corpora, even if the corpora are of different sizes 3) gives information about its lexical behaviour, but particularly about the more idiomatic co-occurrences 4) the highest MI scores tend to be less frequent words with restricted collocation. ☆ The strength of the collocation is not always a reliable indication of meaningful association. ↓ To know how much evidence there is for it, t-score is used. t-score ・・・a measure of how certain we can be that the collocation is the result of more than the vagaries of a particular corpus 1) Corpus size is important. 2) cannot be compared across corpora 3) gives information about the grammatical behaviour of a word 4) the highest t-scores tend to be frequently used words ( whether or not they are grammatical words) that collocate with a variety of other words. ☆ In some instances they may require a wider span than is commonly used with respect to ‘clause collocation’. 4.2.2 The use of collocational information Collocational statistics can be helpful in summarising some of the information to be found in concordance lines, allowing more instances of a word to be considered than is feasible with concordance lines. <uses of Collocational information> 1) to highlight the different meanings that a word has. e.g.) the collocates of verb LEAK ・some of words are associated with the physical meaning of LEAK: oil, water, gas, roof ・others are associated with the metaphoric sense: documents, information, report, memo…. ・prepositions and adverbs of direction are important to the behaviour of LEAK: out, from, to, into ↓ The list of collocates gives a kind of semantic profile of the word involved. 2) With a different method of displaying collocational information, we can obtain clues as to the dominant phraseology of a word. e.g.) here is some information about the words that occur one, two and three places to the left of the word shoulder and each column is arranged in t-score. hand over his shoulder on on her looking shoulder my 3) To obtain a profile of the semantic field of a word. e.g.) a list of the collocates of bribe and bribery can be grouped into semantic areas (Orpin 1997): ・Words connected with wrong-doing ・Words connected with money ・Words connected with officialdom ・Words connected with sport ・Words connected with the legal process ・information not only about the meaning of the words, but an insight into some of the cultural ramifications of the concept of ‘bribery’ (Warning) Calculations of collocation will always prioritise uses of a word that tend to be lexically fixed or restricted. The overall frequency of compounds and phrases may not be justified. 4.3 Tagging and parsing 4.3.1 Categories and annotation Corpus annotation・・・ the process of adding information ( designed to interpret the corpus linguistically ) to a corpus ・makes it easier to retrieve information and increases the range of investigations that can be done on the corpus. ・‘category-based’ methodology the parts of a corpus – the words, or phonological units, or clause etc – are placed into categories and those categories are used as the basis for corpus searches and statistical manipulations 4.3.2 Tagging = allocating a part of speech (POS) label to each word in a corpus e.g.) the word light ・・・tagged as verb, a noun or an adjective each time it occurs in the corpus < various uses of a tagged corpus> 1) Looking at the concordance lines for a word with several senses can be made much simpler. 2) The relative frequencies of different parts of speech for a specific word can be compared. e.g.) Biber et al (1998) deal・・・more frequently a noun than a verb deals・・・more frequently a verb than a noun (more sophisticated uses) 3) Total occurrences of word-classes in a particular corpus can be counted. e.g.)Biber et al (1999) compare word-class occurrence in a number of corpora. ( see Table 4.1) 4) The frequency of sequences of tags can be calculated and corpora can be compared in this respect. e.g.) Aarts and Granger (1998) Non-native speakers writing in English differ from equivalent native-speaker writers in the sequences of tags that recur. →the non-native writers are using fewer of the lengthy noun phrases that are essential to formal, particularly academic, writing in English. ・Corpus tagging needs to be done automatically with computer programs called taggers, which work on a mixture of two principles. 1. rules governing word-classes 2. probability Automatic taggers are usually claimed to have an accuracy rate of over 90 %. It is important, however, to remember that the tagger may be wrong, and the human user’s judgement is more reliable in individual cases. Inaccuracies produced by a tagger are not usually spread evenly throughout a corpus. All mistakes a tagger makes will be clustered around words which have several possible tags. A tagger can be instructed to suggest more than one tag, in cases of ambiguity, and the human researcher can simply pick out the ambiguous cases and make a judgement as to the best tag. 4.3.3 Parsing = analyzing the sentences in a corpus into their constituent parts, that is, doing a grammatical analysis. e.g.) Leech and Eyes (1997) S NP VP GEN N V NP NP(nominal clause) N that NP VP N V PP P NP DT N The victim’s friend’s told police that Krueger drove into the quarry. Computer programs (parsers) can’t do these work completely accurately, and parsed corpora are often edited by hand to achieve a greater degree of accuracy. ↓ basis for much of the statistical work that has been done on different registers e.g.) Biber et al (1998: 98-99) intransitive use of START > BEGIN (in academic prose) → used in sentences indicating the start of a process BEGIN + to-clause > BEGIN + ‘-ing’ (in fiction) → describe the start of an action or a reaction to events ・・・frequent in narrative ・verbs indicating thought and feeling・・・BEGIN + to-clause, only ・verbs indicating movement・・・BEGIN + ’-ing’ , BEGIN + to-clause This study demonstrates a useful synergy between word-based methods and categorybased methods: 1. To search for intransitive verbs ・・・ category-based method(parsed corpus) 2. To search for a verb followed by a to-clause or an ‘-ing’ clause・・・word-based method 4.4 Other kinds of corpus annotation annotation = a general term for tagging and parsing, and also used to describe other kinds of categorisation that may be performed on a corpus. e.g.) the annotation of a spoken corpus for prosodic features the annotation of a corpus of learner English for types of error annotation of anaphora and semantic annotation 4.4.1 Annotation of anaphora anaphora = a term used in schemes which annotate the cohesion in texts, with the term anaphor being used for the cohesive item (whether the direction of connection is forwards or backwards). e.g.) (Garside et al 1997:72) (1 A man carrying a blue sports bag1)…was arrested when <REF=1 he… antecedent anaphor “ < “ = the direction of connotation (backwards) REF = “reference” ・It is possible to track the development of a text, showing which people or things are referred to most frequently and how the text is progressively chunked. (Disadvantage) It cannot be done automatically and the amount of text that can be coded in this way is therefore limited. (Possibilities) ・To see what types of anaphor and antecedent most frequently occur in different registers ・To see how anaphora typically change as a text progresses ・To provide an anaphoric profile of sets of texts from different registers 4.4.2 Semantic annotation = the categorisation of words and phrases in a corpus in terms of a set of semantic fields e.g.) Wilson and Thomas (1997:61) Joanna stubbed out her cigarette with unnecessary fierceness. Joanna・・・Personal Name stubbed out・・・’Object-Oriented Physical Activity’ and ‘Temperature’ cigarette・・・’Luxury Item’ unnecessary・・・’Causality/Chance’ fierceness・・・’Anger’ her and with・・・‘Low Content Words’ and are not assigned to a semantic category. e.g.) an application of this method: Thomas and Wilson (1996) an analysis of interactions between doctors and patients in two clinics Doctor A –1) discourse particles, first and second person pronouns, boosters and downtoners・・・more interactive and friendly 2) the categories ‘Cause’, ‘Change’, ‘Power’ and ‘Treatment’・・・ Explaining what their treatment would be and what they could expect from it. Doctor B – technical terms・・・explaining how their disease was progressing ・Doctor A was perceived as more supportive than the other one. It is possible to criticize the semantic categories as lacking in finesse. On the other hand, the automatic annotation plays an important role in dealing with large quantities of data that would be unreasonably time-consuming and difficult to annotate consistently by hand. 4.4.3 How a meaning is made a variation on semantic annotation by Biber and Finegan(1989), Conrad and Biber(2000) e.g.) a partial annotation, in that only certain categories are selected, for example ‘stance’. 1. Adverbs, clauses and prepositional phrases are selected by the computer from a tagged and parsed corpus. 2. The human researcher can accept or modify the categorisation. 3. Calculations can be made in terms of how frequently a meaning is made in a number of registers, and how the meaning is most frequently made in each register. ↓ ・Adverbs are the most frequently used way of expressing stance grammatically in all three registers (conversation, news reporting and academic prose), and most predominant in conversation. ・Clauses such as I think and I guess・・・frequent in conversation ・Prepositional phrases such as on the whole and in most cases・・・frequent in academic prose and news reporting Corpus annotation of this kind provides a basis for approaching a corpus from the point of view of meaning first and can be linked with a notional approach to language teaching. A ‘local grammar’ attempts to describe the resources for only one set of meanings in a language, rather than for the language as a whole. e.g.) Hunston (1999b) The words that are used to indicate sameness and difference are identified, together with the following patterns. Because meaning and patterns are linked, in each case there are several verbs used with the same pattern. A: ‘verb + plural noun group’・・・compare, conflate, connect, contrast, … There are people who equate those two terrible video tapes. comparer comparison item1, 2 B: ‘verb + between + plural noun group’・・・discriminate, distinguish It’s difficult to differentiate between chemical weapons and chemicals for peaceful use. comparison item1 item2 C: ‘verb + from + noun group’・・・differ, diverge, grow apart,… Make your advertisement stand out from all the others by having it printed in bold type. item1 comparison item2 D: ‘verb + to + plural noun group’・・・answer, approximate, conform,… How does your job measure up to your ideal? item1 comparison item2 E: ‘verb + a noun phrase + to + a noun phrase’ ・・・compare, connect, … The Cuban musicians themselves often liken their music to the works of Bob Dylan… comparer comparison item1 item2 4.4.4 Issues in annotation There are three basic methods of annotating a corpus – manual, computer-assisted and automatic but only the second two are suitable for use on anything but the smallest corpora. ・An automatic annotation・・・There are likely to be errors as discussed above in the section on tagging. ・A computer-assisted annotation・・・ slower than automatic, but more accurate than automatic The amount of parsed corpus data publicly available is limited. The work involved in annotation acts as a constraint against updating or enlarging a corpus. e.g.) the 1951, 1 million word LOB Corpus is still used as a source of data, small and outdated though it is, precisely because it is parsed. 4.5 Competing methods Word-based methods vs. category-based methods The preference for prioritising words ・・・ a preference for a plain text corpus. The preference for prioritising categories・・・a preference for an annotated corpus ・It is important to recognise that no method of working is neutral with regard to theory.・・・What the data says will depend to a large extent on how it is able to be accessed. ・Category-based methods and word-based methods each answer different sets of questions: “What kind of anaphora is most frequently used in academic prose?” ⇒a category-based method “How is the word way used?” ⇒ a word-based method (Word-based methods) ・A plain text corpus has obvious disadvantages in that certain categories cannot easily be counted, so certain questions are difficult to answer. e.g.) Rissanen (1991) traces the history of ‘object clauses’ beginning with that, and those beginning with ‘zero’ in unparsed corpora and commented”… The problem of tracing zeros must, in this case, be solved in an indirect way, by checking all occurrences of all verbs which take an object clause with that,…” ↓ only possible at all because a relatively small corpus was being used (Category-based methods) ・The ‘added value’ that annotations give a corpus can be a double-edged sword: more useful, but less readily updated, expanded or discarded ・The categories used to annotate a corpus are typically determined before any corpus analysis is carried out, which tends to limit, not the kind of question that can be asked, but the kind of question that usually is asked. ☆The most interesting and useful annotations seem to be those which are either added ad hoc to a corpus to enable the researcher to answer a particular question or those which are able to added to a corpus quickly and automatically. What we can perhaps hope for is a synergy between word-based and category-based methods of corpus analysis, in which the one can inform the other, much as qualitative and quantitative methods of research complement each other. The interpretation of information found by looking beyond the concordance line frequently involves returning to those same concordance lines. <Points with regard to corpus annotation> ・ However much annotation is added to a text, it is important for the researcher to be able to see the plain text. ・ It is important to be able to use unconventional, ad hoc annotations as necessary. ・ Unless a very small-scale study is being carried out, it is important to make the process of annotation as automatic as possible. ☆ It is important when reading about corpus work to be critically aware of the methods being used, and of the theories that lie behind them. *** New Terms*** P.67 type cf. token P.90 local grammar a frequency list P.68 keywords collocation P.70 MI score t-score P.79 annotation P.80 tagging P.82 tagger P.84 parsing parser P.87 anaphora anaphor