Word Association Thesaurus as a Resource for Building Wordnet

advertisement
Word Association
Thesaurus as a Resource
for Building Wordnet
Anna Sinopalnikova
Masaryk University, Brno, Czech Republic
Saint-Petersburg State University, Russia
anna@fi.muni.cz
Overview
• Types of LRs used
• What is Word Association?
• Information to be extracted from WAT
• WAT vs. Corpus
• Conclusions
• Future plans
What kind of language resources
are used to build wordnets?
• Primary resources
• Derived resources
• e.g. text corpora
• e.g. explanatory dictionaries,
Roget type thesauri
• present (more or less) ‘raw’
data on the language in use
• present explications of
internal knowledge of
language
• based on primary resources
+ intuition
• information is given
implicitly
• information is given
explicitly
What is better?
• To build an adequate and reliable lexical
database (e.g. wordnet) it is not enough to rely
upon information produced by ‘experts’ (i. e.
linguists, lexicographers).
• One should rather explore the raw data, and
extract information from language in its actual
and its potential use.
• Corpora reign!
Word Association
• Association: “connection or relation between 2 entities
(perceptions, ideas or words), that manifests in a following
way: an appearance of one entity entails the appearance of
the other in the mind”
• Word Association: an appearance of one word entails the
appearance of the other in the mind
Association: examples (1)
• Kill 
Association: examples (1)
• Kill 
 Bill
Association: examples (2)
Association: examples (2)
 Nike
Association: examples (3)
Association: examples (3)
•
 Kill Bill
Word Association Test
• Generally, a list of words (stimuli) is given
to subjects (either in writing or in oral
form). The subjects are asked to respond
with the first word that comes into their
mind (responses).
• Other methods: controlled association
test, priming etc.
‘Cat’ stimulates
• Dog 49, mouse 8, black 4, animal 2, eyes,
gut, kitten, tom 2, bit, Cheshire, claw,
claws, enigma, feline, furry, hearth, house,
kin, kittens, milk, pet, pussy, todd 1
(of 100 people asked)
Word Association Norms (WAN)
• WAN represents the data collected through a series of WA
test carried out according to the standard technique.
• The body of WAN: list of responses and their absolute
frequencies for each stimulus word
E.g. Kent & Rosanoff (1910) 100 stimuli - 1000 subjects
Palermo & Jenkins (1964) 200 stimuli - 1000 subjects
Word Association Thesaurus (WAT)
• WAT is a kind of WAN
• WAN vs. WAT differ not only in volume but also in the
procedure of data collection. It implies cycles: A small set of
stimuli is used as a starting point of the experiment, responses
obtained for them are used as stimuli in the next stage, the
cycle being repeated at least 3 times.
• Being a thesaurus WAT is expected to cover ‘all’ the
vocabulary (all the words relevant for the language) and
reflect the basic structure of a particular language (all the
relations between words relevant for this particular language
system).
•
E.g. Kiss et al (1972): about 54.000 words, Nelson et al (1973-1990)
about 75.000 words, Karaulov et al (1994-1998): 23.000 words
What kind of linguistic information
could be extracted from WAT?
1)
The core concepts of the language
2)
Syntagmatic & paradigmatic relations between words
presented explicitly (as opposed to text corpora)
3)
Relevance of word senses for native speakers
4)
Relevance of relations for native speakers
5)
Domain information that are shown (as opposed to
dictionaries)
6)
Semantic classification of words obtained by using
formal criteria
The core concepts of the language
•
In every language there is a finite number of words that appear
as responses more frequently then other words. This set is quite
stable:
– it does change much as the time goes;
– it doesn’t depends on the starting circumstances, e.g. on words that
were chosen as stimulus words
Russian: ‘man’, ‘house’, ‘love’, ‘life’, ‘be/eat’, ‘think’, ‘live’, ‘go’,
‘big/large’, ‘good’, ‘bad’, ‘no/not’...
295 words with more then 100 relations
English: man, sex, no (not), love, house; work, eat, think, go, live;
good, old, small…
586 words with more then 100 relations
•
Cf. EuroWordNet Basic Concepts
Syntagmatic relations
E.g. Cat -> black, Cheshire, pussy;
Cat -> mat, nip, purr
• Law of contiguity: through life we learn “what goes together” and
reproduce it together
• Right and left contexts of a word
• Help to acquire:
–
–
–
–
Selectional preferences, valency frames
Semantic relations between words (e.g. ROLE/INVOLVED)
Distinguishing different senses of a word
Establishing relations of synonymy, hyponymy, and antonymy
Cf. text corpora
Paradigmatic relations
E.g. Cat-> dog, mouse, animal, pet;
Cat-> eyes, claw
• Synonyms, hyponyms/hyperonyms/co-hyponyms,
meronyms/holonyms, or antonyms
• Law of contiguity???
• Help us to acquire:
– This information may be included directly in terms of
semantic relations between wordnet entries
– Also it helps us to enrich and to check out the set of
relations encoded earlier
Classifying verbs according to the
number of their syntagmatic
associations
30
Number of verbs
. 25
20
15
10
5
0
0,33
0,38
0,43
0,48
0,53
0,58
0,63
0,68
-5
Number of syntagmatic associations
0,73
0,78
0,83
Domain information
• E.g., hospital –> nurse, doctor, pain, ill,
injury, load…
• This type of data is not so easily
extracted from corpora, in explanatory
dictionaries it is presented partly
• Is crucial while we approach wordnet
usage in IR.
Relevance of word senses for
native speakers
•
WAT: for each word 80% of associations are related to 1-3 of its
senses.
•
Cf. Corpus: 90% of occurrences of a word
•
That allows us:
– to measure the relevance of a particular word sense for native
speakers.
– to find an appropriate place for it in the hierarchy of senses.
– to define the necessary level of sense granularity: to include into a
wordnet no more and no less senses of each word than native
speakers do differentiate.
•
Problem: emotionally coloured senses are thus overestimated.
E.g. дать – в рожу
Relevance of relations for
native speakers
• It is clear that in a WN words must have at least
a hyperonym and desirably a synonym.
• Other relations???
• Relations are not the same for different PoS,
but also they are not the same for different
words within the same PoS.
E.g. buy CONVERSIVE sell, while cry INVOLVED_AGENT baby.
WAT vs. Corpus
• Compare a corpus to WAT:
Wetter & Rapp (1996), Willners (2001): Correlation
between frequency of word X and word Y cooccurrence in a corpus and strength of association
word X-word Y in WAT.
• Compare WAT to a corpus?
WAT vs. Corpus (2)
N of w ord pairs
Coverage: 64% word associations do not occur in the corpus
4000
3500
3000
2500
2000
1500
1000
500
0
0
1
2
3
4
5
6
7
8
9 10
Absolute frequency in the corpus
WAT vs. Corpus (3)
N of occurrences in the
corpus
N of occurrences in
RWAT
% of all word
associations missing
0
2
48
0
3
22
0
4
14
0
5
8
0
6-10
5
0
11-15
<1
0
15-20
<1
0
> 20
0
Table 1. Distribution of word associations that do not occur in the corpus.
NB! Mostly it’s Syntagmatic WA that are missing, not paradigmatic ones
Conclusions
•
The advantages of using WAT in wordnet constructing:
– Great variety of linguistic information extracted.
WAT is equal to or excels other LRs in several respects.
– ‘Raw’ data (as opposed to theoretical one, cf. conventional
dictionaries, that supposes the researcher’s introspection and
intuition to be involved, and hence, leads to over- and underestimation of the language phenomena).
WAT is comparable to a balanced text corpus, and could supply
all the necessary empirical information in case of absence of the
latter.
– Probabilistic nature of data presented (data reflects the relative
rather then absolute relevance of language phenomena).
•
Parallel usage of WAT and other LR is effective way of:
– constant checking-out of wordnet construction,
– refining wordnet and
– expanding wordnet
Future plans
• WAT vs. Corpus vs. Wordnet
– Czech: small – large – middle
– English: large – large – large
– Russian: large – middle - small
Thank you…
Download