Corpora in lexicographic and lexical studies

advertisement
Corpora in lexical studies
Corpus Linguistics
Richard Xiao
lancsxiaoz@googlemail.com
Aims of this session
• Lecture
– Corpus-based lexicography
– Collocation and colligation
• Lab session
–
–
–
–
Collocation using WST
Collocation using AntConc
Collocation and colligation in Xaira
Using the BNCweb to study collocation
Corpus revolution in
lexicographic and lexical studies
• Lexicographic and lexical studies are the
greatest beneficiaries of corpora
• Corpora have “revolutionised” dictionary
making and reference publishing
– It is now nearly unheard of for new dictionaries
and new editions of old dictionaries published
from the 1990s onwards not to claim to be based
on corpus data
Why use corpora in dictionary making?
• Machine-readable corpora allow dictionary makers
to extract all authentic, typical examples of the usage
of a lexical item from a large body of text in a few
seconds
• Corpora allow dictionary makers to select entries
based on frequency information
• Corpora can readily provide frequency information
and collocation information for readers
• Textual (e.g. register, genre and domain) and
sociolinguistic (e.g. user gender and age) information
encoded in corpora allows lexicographers to give a
more accurate description of the usage of a lexical
item
Why use corpora in dictionary making?
• Corpus annotations such as part-of-speech tagging
and word sense disambiguation also enable a more
sensible grouping of words which are polysemous
and homographs
• A “monitor corpus” allows lexicographers to track
subtle change in the meaning and usage of a lexical
item so as to keep their dictionaries up-to-date
• Corpus evidence can complement or refute the
intuitions of individual lexicographers, which are not
always reliable because of potential biases in
intuitions
Five emphases
• Changes brought about by corpora to
dictionaries and other reference books - five
“emphases” (Hunston 2002)
– an emphasis on frequency
– an emphasis on collocation and phraseology
– an emphasis on variation
– an emphasis on lexis in grammar
– an emphasis on authenticity
Top 1000 written / spoken words
Authentic examples
Corpus-based learner dictionaries
• First ‘fully corpus-based’ dictionary
– Collins Cobuild English Dictionary (1987)
• Some corpus-based learner dictionaries
– Longman Dictionary of Contemporary English (3rd
edition)
– Oxford Advanced Learner’s Dictionary (OALD, 5th
edition)
– Cambridge International Dictionary of English (1st
edition)
Frequency dictionaries
Collocation
• Collocation is among the linguistic concepts which have
benefited most from advances in corpus linguistics
• What is collocation?
– strong tea, powerful car (Halliday 1976)
– “collocations of a given word are statements of the habitual or
customary places of that word…the company that words keep” (Firth
1968:181-2)
• “One of the meanings of night is its collocability with dark” (Firth
1957:196)
– “a frequent co-occurrence of two lexical items in the language”
(Greenbaum 1974:82)
• expel a school child vs. cashier an army officer
• “I propose to bring forward as a technical term, meaning by
collocation, and apply the test of collocability” (Firth 1957:
194)
Meaning by collocation
• “There is frequently so high a degree of
interdependence between lexemes which tend to
occur in texts in collocation with one another that
their potentiality for collocation is reasonably
described as being part of their meaning” (Lyons
1977: 613)
• Complete description of the meaning of a word
would have to include the other word or words that
collocate with it
• “You shall know a word by the company it keeps!”
(Firth 1968:179)
• Collocation is part of the word meaning
Two types of collocation
• Coherence collocation vs. neighbourhood
(horizontal) collocation (Scott 1998)
– Coherence collocation
• Collocates associated with a word (e.g. letter – stamp,
post office)
– Neighbourhood collocation
• Words which do actually co-occur with the word (letter
- my, this, a, etc)
Coherence collocation
• “A cover term for the cohesion that results from the
co-occurrence of lexical items that are in some way
or other typically associated with one another,
because they tend to occur in similar environments.”
(Halliday & Hasan 1976:287)
– candle – flame – flicker
– hair – comb – curl – wave
– sky – sunshine – cloud – rain
• Difficult to measure using a statistical formula
Neighbourhood collocation
• Collocation in corpus linguistics
• Structure of collocation – collocation window
– “We may use the term node to refer to an item whose
collocations we are studying, and we may then define a
span as the number of lexical items on each side of a node
that we consider relevant to that node. Items in the
environment set by the span we will call collocates.”
(Sinclair 1966:415)
• Casual vs. significant collocation
– Significant collocation: collocation that occurs more
frequently than would be expected (in a statistical sense)
on the basis of the individual items
• n.b. Neighbourhood (horizontal) collocations can
include some coherence collocations
Intuition vs. collocation
• Greenbaum (1974): “people disagree on collocations” in
introspection-based elicitation experiments
• Although “collocation can be observed informally” on the
basis of intuitions, “it is more reliable to measure it
statistically, and for this a corpus is essential” (Hunston 2002:
68)
• Intuition is often a poor guide to collocation
– “because each of us has only a partial knowledge of the language, we
have prejudices and preferences, our memory is weak, our
imagination is powerful (so we can conceive of possible contexts for
the most implausible utterances), and we tend to notice unusual
words or structures but often overlook ordinary ones” (Krishnamurthy
2000: 32-33)
• Collocation can be measured on the basis of co-occurrence
statistics (MI, z, t, LL etc) – more discussion to follow
Collocation is syntagmatic
Langue (Language system)
paradigmatic
famous boots. On
Stoke the lead on
Smith sin-binned on
clinched their win on
chase by declaring on
expectant crowd, on
hour began not upon
of midnight but upon
booked in advance. On
Promptly on
from Edinburgh on
the
the
the
the
the
the
the
the
the
the
the
stroke
stroke
stroke
stroke
stroke
stroke
stroke
stroke
stroke
stroke
stroke
of
of
of
of
of
of
of
of
of
of
of
Parole (Utterance)
syntagmatic
full time the
half-time with a goal
half-time, added a
lunch after resuming
lunch. <p> With a lead
midday. The bird
midnight but upon the
noon. There was,
seven, a gong summons
six 'clock, the chooks
the Millennium.
Collocation vs. colligation
• Collocation
– Relationship between a lexical item and other
lexical items
• Relationship between words at the lexical level
• E.g. very collocates with good
• Colligation
– Relationship between a lexical item and a
grammatical category
• Relationship between words at the grammatical level
• E.g. very colligates with ADJ
WST Collocate settings
Concord tab
WST collocates
Strength of relationship is displayed as 0.000 if it hasn't yet been computed
Strength of collocation relationship
A wordlist is required
Highlight and double click…
…to see the selected collocate
Collocates in AntConc
Collocation in Xaira
Colligation in Xaira
Exploring collocation with BNCweb
http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php
Search for “sweet”
Concordances of “sweet”
KWIC view
KWIC view
Dropdown menu: collocations
Collocation setting
Collocation database (default settings)
Adjusting settings
Noun collocates of “sweet”
Click on a word to see its collocation info
Collocation info of “sweet” + “smell”
Click on a number to see concordances of collocates at that position
Concordances of “smell” at R2
Collocation statistics
Rank by frequency
Frequent words crowd into the top of the collocate list:
Are they genuine collocates?
Rank by the t test
• Also focusing on frequent words?
Rank by MI
Infrequent words at the top of the list
How useful are they (especially to English learners)?
Rank by the z score
Like MI, the z score also over-estimates infrequent
items (e.g. nothings, afton, marjoram)
Log-likelihood test
Rank by MI3
Rank by dice coefficient
Download