Corpus analysis (1): concordance and wordlist

advertisement
Corpus analysis (1)
Corpus Linguistics
Richard Xiao
lancsxiaoz@googlemail.com
Outline of the session
• Lecture
– Concordance
– Patterning
– Semantic prosody
– Wordlist
– Cluster (lexical bundle, MWU, n-gram)
• Lab
– WST Concord and Wordlist
– AntConc
– Online concordancers
Who reads a corpus?
• A corpus is usually too large for anyone to read, e.g. the
BNC is very, very large…
– It took 4 years to build
– It contains over 100 million (100,106,008) words of modern
English
– It comprises 4,124 texts
– There are six and a quarter million sentence units in the whole
corpus
– Each word is automatically assigned a part of speech code there are 65 parts of speech identified
– It occupies 1.5 gigabytes of disk space - the equivalent of more
than 1,000 high capacity floppy disks
– The whole corpus printed in small type on thin paper would take
up 10 metres of shelf space
– Reading the whole corpus aloud at a rate of 150 words a minute,
eight hours a day, 365 days a year, would take nearly 4 years
• A computer can scan in a few seconds more text than
you can read in your whole life…
Concordance
• A comprehensive index of the words used in a
text or a corpus
• A set of concordance lines
• The most common concordance format is the
KWIC concordance - Key Word in Context
– In a KWIC concordance of your search word, i.e. the
node word, is in a central position with all lines
vertically aligned around the node
• Can be sorted to reveal patterns of usage
Concordancer
• A concordancer is the software that displays
concordances (Unicode compliant)
– Concord WordSmith Tools (GBP50)
• www.lexically.net/wordsmith/
– MonoConc (USD85)
• www.athel.com/mono.html
– AntConc (free)
• www.antlab.sci.waseda.ac.jp/software/antconc3.2.4w.exe
– Xaira (free)
• www.oucs.ox.ac.uk/rts/xaira/
– Multilingual Corpus Tool (MLCT) - free
• www.lancs.ac.uk/fass/projects/corpus/cbls/resources.asp
KWIC concordance (WST)
KWIC concordance (MonoConc)
KWIC concordance (AntConc)
KWIC concordance (Xaira)
Online concordancers
• English (free)
– http://corpus.byu.edu/bnc/
– http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php
– http://www.americancorpus.org/ (COCA)
• Chinese (free)
– www.lancs.ac.uk/fass/projects/corpus/LCMC/
– www.lancs.ac.uk/fass/projects/corpus/UCLA/
– www.lancs.ac.uk/fass/projects/corpus/babel/babel.htm
• Sketch Engine: Corpus query system of multilingual data,
incorporating word sketches, grammatical relations, and a
distributional thesaurus (30 days free trial)
– http://www.sketchengine.co.uk/
Syntagmatic vs. paradigmatic
Collocation is syntagmatic
Langue (Language system)
paradigmatic
famous boots. On
Stoke the lead on
Smith sin-binned on
clinched their win on
chase by declaring on
expectant crowd, on
hour began not upon
of midnight but upon
booked in advance. On
Promptly on
from Edinburgh on
the
the
the
the
the
the
the
the
the
the
the
stroke
stroke
stroke
stroke
stroke
stroke
stroke
stroke
stroke
stroke
stroke
of
of
of
of
of
of
of
of
of
of
of
full time the
half-time with a goal
half-time, added a
lunch after resuming
lunch. <p> With a lead
midday. The bird
midnight but upon the
noon. There was,
seven, a gong summons
six 'clock, the chooks
the Millennium.
Parole (Utterance)
syntagmatic
Example of pattern meaning
• “on the stroke of X”
– X = a temporal point
• “It is/was adj. that…” (construction grammar?)
– certain, likely, possible, probable, etc.
– apparent, clear, evident, obvious, plain, etc.
– fantastic, marvellous, appropriate, logical,
encouraging, exciting, reassuring, etc.
– appalling, unjust, annoying, etc.
– critical, important, necessary, vital, etc.
– amazing, funny, interesting, intriguing, etc.
Pattern meaning
• A large number of different adjectives occur in the
pattern between is/was and that
– Probability
• “It was important to establish this because it was possible
that strontium and calcium in fossils might have reacted
chemically with the rock in which the fossils were buried.”
(New Scientist)
– Evaluation - used to evaluate propositions
(statements) rather than things or people
• “But a lot of health authorities say they will not allow these
drugs on NHS prescription as they cannot afford them at
around £90 a month. It is scandalous that the rich can buy
the drugs privately, but tough luck if you are poor.” (The Sun)
Meaning arising from collocation
• “There are always semantic relations
between node and collocates, and among
the collocates themselves.” (Stubbs 2002:
225)
– Collocational meaning arising from the
semantic relations between node and
collocates: semantic prosody (also called
“discourse prosody”)
– Collocational meaning arising from the
semantic relations among collocates of a
node: semantic preference
What is semantic prosody?
• “consistent aura of meaning with which a form is imbued
by its collocates” (Louw 1993: 157)
• “a form of meaning which is established through the
proximity of a consistent series of collocates.” (Louw
2000: 57)
• “the spreading of connotational colouring beyond single
word boundaries” (Partington 1998: 68)
• “When the usage of a word gives an impression of an
attitudinal or pragmatic meaning, this is called a
semantic prosody” (Sinclair 1999)
• This kind of meaning is “prosody” in the sense that it
stretches over more than one unit (word)
Semantic prosody
• The primary function of SP is to express
speaker/writer attitude or evaluation (Louw
2000: 58)
– Attitudinal, affective, evaluative and pragmatic
meaning
• Typically negative, with relatively few of them
bearing an affectively positive meaning
– Unsurprising: contented human beings utter much
less than discontented ones
– It is unrequited love, not requited love, that forms
most of the subject matter for the greatest love poetry
in English!
Semantic prosody
• SET IN: occurs primarily with subjects which refer to
unpleasant states of affairs
–
–
–
–
–
…before bad weather sets in…
…the fact that misery can set in…
…desperation can set in…
…stagnation seemed to have set in…
…before rigor mortis sets in…
• BREAK OUT: it is bad things that break out
–
–
–
–
–
…violence broke out…
…riots broke out…
…war broke out…
…real disagreements have broken out…
…a storm of protest broke out…
Semantic prosody
• Collocates of CAUSE
– damage, problems, pain, disease, distress, trouble,
concern, degradation, harm, pollution, suffering,
anxiety, death, fear, stress, symptoms
– These examples of ‘bad company’ collocate with
cause so frequently that the central and typical
use of cause shows a negative affective meaning
(近墨者黑?)
• Collocates of consequences
– In the sense of result
• serious, disastrous, adverse, dire, damaging, negative,
unintended, unfortunate, tragic, fatal, severe
– In the sense of importance
• important, significant, far-reaching, profound
Semantic prosody
• PROVIDE: a positive semantic prosody
– facilities, information, services; aid, assistance, help,
support; care, food, money, nourishment, protection,
security
• CREATE: “prosodically mixed or incomplete”
– [Negative] illusion, problems
– [Neutral] atmosphere, conditions, environment,
image, impression, situation, space
– [Positive] jobs, opportunities, order, wealth
Semantic prosody
• The negative (or less frequently positive)
prosody that belongs to an lexical item is the
result of the interplay between the item and its
typical collocates
– The item does not appear to have an affective meaning until it
appears in the context of its typical collocates
– If a word has typical collocates with an affective meaning, it may
take on that affective meaning even when it is used with other
atypical collocates
• The consequence of a word frequently keeping
‘bad company’ is that the use of the word alone
may become enough to indicate something
unfavourable (cf. Partington 1998: 67)
Semantic prosody
• Is semantic prosody a type of connotative
meaning?
• “Semantic prosodies are not merely
connotational” as the force behind semantic
prosodies is “more strongly collocational than
the schematic aspects of connotation.” (Louw
2000: 49-50)
• In my view, connotation can be collocational or
non-collocational; semantic prosody can only be
collocational
Semantic prosody
• Semantic prosody is strongly collocational
in that it operates beyond the meanings
of individual words
• Both personal and price are quite neutral,
but when they co-occur, a negative
prosody may result: personal price most
frequently refers to something undesirable
– In the BoE with over 550 million words of
written and spoken texts, 20 instances of
“personal price” are all evaluatively negative
“Personal price”
typically negative and high
something undesirable
Barclays’ slogan to promote their personal financial services in 2003
“The personal loan with the personal price”
Semantic preference
• ‘a lexical set of frequently occurring
collocates [sharing] some semantic
feature’ (Stubbs 2002: 449)
– large typically collocates with items from the
same semantic set indicating ‘quantities and
sizes’
• number(s), scale, part, quantities, amount(s)
– ‘absence/change of state’ is a common
feature of the collocates of maximizers such
as utterly, totally, completely and entirely
Semantic preference
• Semantic preference and semantic prosody are
two distinct yet interdependent collocational
meanings
– Semantic prosody is a further level of abstraction of
the relationship between lexical units (Sinclair 1996,
1998; Stubbs 2001)
• Collocation (the relationship between a node and individual
words)
• Colligation (the relationship between a node and grammatical
categories, e.g. “very” tends to collocate with adjectives and
adverbs)
• Semantic preference (semantic sets/fields of collocates)
• Semantic prosody (affective meanings of a given node with
its typical collocates)
Semantic preference
• Semantic preference and semantic prosody
have different operating scopes (Partington
2004:151)
– Semantic preference can be viewed as a feature of
the collocates while semantic prosody is a feature of
the node word
• The two also interact (Partington 2004: 151)
– Semantic prosody ‘dictates the general environment
which constrains the preferential choices of the node
item’
– Semantic preference ‘contributes powerfully’ to
building semantic prosody
End of concordance versus patterning, collocation and colloational meaning
Wordlist
• A list of words in a corpus and their frequency
– Can become very meaningful when compared with other lists: “keyword
analysis”
• “A type is not a token.”
– Token: an occurrence of any given word form (6 tokens)
– Type: a (unique) word form (5 types - “a” is repeated)
• Type-token ratio (TTR): the number of types divided by the number
of tokens multiplies 100
– lexical density: a low TTR indicates a text is not very lexically rich
– useful when comparing samples of roughly equal length
• Standardized type-token ratio (STTR)
– It is difficult to compare the TTR of a smaller corpus against a larger one
• As a corpus gets bigger, the number of new word types being counted
declines
– In order to remedy the issue of comparing TTRs of corpora of different
sizes, WordSmith can calculate TTR based on every 1,000 words (the
default setting can be adjusted) and produce an average TTR
Wordlist
AntConc wordlist
Practice
• Make a wordlist of the following text using
wordlist function in WST or AntConc
– The Stephen text (local copy available)
• http://www.cch.kcl.ac.uk/legacy/teaching/av1000/te
xtanalysis/gaskin/stephen.txt
• A book written by the hippie guru Stephen Gaskell
• Browse through the frequency list. Can
you see any pattern in the list?
Cluster
• Also called lexical bundle, n-gram, multi-word unit
(MWU)
• Groups of N words which appear in sequence in the text
• Presented using frequency lists
• Good way to identify recurrent/specific expressions for a
corpus
• Tools
– WordSmith
• Concord
• Wordlist (Index)
– AntConc
• N-gram
Cluster/lexical bundle/n-gram
Concord
(3-gram)
Wordlist
Clusters in WordSmith
• The Stephen text
• Clusters with WST Concord
– The search term
• Clusters with WST Wordlist (Index)
– The whole corpus
• Questions
– What are the most frequent 3-word clusters with
“know” in the Stephen text?
– What are the most frequent 3-word clusters in
the whole text? Are they all “expected” phrases?
Clusters in WordSmith
• Make adjustments here
Concord: “know”
3-word clusters of “know”
recompute n-word clusters
Clusters in Wordlist (Index)
An error may occur if
you specify a folder
without having the
writing permissions
Clusters in Wordlist (Index)
The index is created and saved in the specified file location
Warning: Your file location may be different!
Resulting index
Clusters in Wordlist (Index)
• OR: Wordlist – File – Open
Clusters in Wordlist (Index)
Clusters in Wordlist (Index)
N-gram in AntConc
Difference from WST: Can a word contain the apostrophe?
Download