Top-down classification

advertisement
Information Access I
Information Representation
and Text Searching
GSLT,
Göteborg, September 2003
Barbara Gawronska, Högskolan i Skövde
Requirements on Information
Representation:





Discriminating power
Descriptive power
Similarity identification
Ambiguity minimalization
Conciseness
Those requirements may collide...
Traditional descriptors
Classification codes (e.g. Universal
Decimal Classification)
 Subject headings
 Key words

 Problems:
Standarized lists of subject headings needed
 Different spelling conventions
 Morphology: inflectional and derivational,
compounding
 Semantic relations

Strategies for linking related words
and phrases

Different spelling conventions: spelling checkers; in
proper names - counting the number of identical letters
or identical bigrams (letter pairs). Could be improved by
adding some phonological knowledge (metathesis etc.)
BARBARA GAWRONSKA
BARBRO GRAVONSKA

Relations on morphological level:

Truncation: finding the common part of a string; no language
specific morphological knowledge. Problems: too many
unrelated words may pass trough
ren#: renen, renar, rena, rent, renad...
ren$$: renen, renar, renad...
Strategies for linking related words
and phrases (2)

Lemmatization: identifying the lexical form

Stemming: a strategy between truncation and lemmatization
The general principle for English (Lovins 1968,Paice 1990):
remove the ending, and transform the ending of the remaining string,
if needed
Language-dependent
infinitive
tawar
pikir
beri
sewa
algorithms; consider e.g. Indonesian:
active
menawar ”bargain”
memikir ”think”
memberi ”give”
menyewa ”rent”
Strategies for linking related words
and phrases (3)

multi-word entries: context operators, e.g.
 exact
distance between words
retrieval$information: retrieval of information
retrieval with information loss
 maximal
distance between words
text##retrieval: text retrieval
text and data retrieval
 unspecified
word order
information#,retrieval: information retrieval, retrieval of
information
+ word pair co-occurence rate
Strategies for linking related words
and phrases (4)

Semantic relations: thesauri, lexicons, semantic nets
as tools for term expansion; some examples:
 ERIC Thesaurus of Descriptors (the Dialog
Corporation)
 Roger Thesaurus
 KL-ONE
 WordNet...
Normally used relations: broader/narrower term, related term,
synonym,”used for”/ ”use” (identifies a preferred synonym);
Even entailment (WordNet), role (KL-ONE)
Thesauri: Top-down classification monohierarchy
cat
wildcat
Panter
Ocelot
domestic cat
Siamese cat
Angora
Thesauri: Polyhierarchy
domestic cat
mouser
Angora
Thesauri: Polydimensional hierarchy
domestic cat
mouser
stray
Burmese cat
Siamese cat
Angora
Thesauri: Polydimensional hierarchy
domestic cat
Function/Life style
mouser
Breed
stray
Angora
Siamese cat
Burmese cat
Thesauri: WordNet, some problems

feline mammal usually having thick soft fur and being unable to roar;
domestic cats; wildcats

any domesticated member of the genus Felis
 any small or medium-sized cat resembling the domestic cat and living in
the wild
Thesauri: WordNet, some problems

feline mammal usually having thick soft fur and being unable to roar; domestic cats;
wildcats
 any domesticated member of the genus Felis
 female cat
 a long-haired breed similar to the Persian cat
 a slender short-haired blue-eyed breed of cat having a pale coat with dark
ears paws face and tail tip
 Siamese cat having a bluish cream body and dark gray points
 a cat proficient at mousing
 a long-haired breed of cat
 a short-haired breed with body similar to the Siamese cat but having a solid
dark brown or gray coat
 a short-haired bluish-gray cat breed
 a small slender short-haired breed of African origin having brownish fur with
a reddish undercoat
 homeless cat
 ….
 ….
 ….
Thesauri: WordNet, some problems

feline mammal usually having thick soft fur and being unable to roar…..
 any small or medium-sized cat resembling the domestic cat and living in the wild
 widely distributed wildcat of Africa and Asia Minor
 long-bodied long-tailed tropical American wildcat
 small spotted wildcat found from Texas to Brazil
 bushy-tailed European wildcat resembling the domestic tabby and regarded
as the ancestor of the domestic cat
 medium-sized wildcat of Central and South America having a dark-striped
coat
 small Asiatic wildcat
 a desert-dwelling wildcat
 ….
 ….
 short-tailed wildcats with usually tufted ears; valued for their fur
 of northern Eurasia
 of southern Europe
 small lynx of North America
 of deserts of northern Africa and southern Asia
 of northern North America
Thesauri: Bottom-up classification
Attribute A:
size
Attribute B:
fur
Attribute C:
colour
Attribute D:
eye colour
A1: middle
B1: short
C1: pale
D1: blue
A2: small
B2: long
C2: dark
D2: green
A3: big
C3: striped
Finding significant words

Significance as a function of rank (Luhn 1958)
100
10
1
1

10
100
1000
A simple frequency-based indexing method: frequent words
– stop list + truncation/conflating
Finding significant words (2)

Term weighting: Salton & McGill1983
The ”Tf x idf” method (also called document frequency,
or inverse term frequency):

n
Weightij  Frequencyij  log 2 
 DocFreq
j





”Tf x idf” can be combined with similarity measures, e.g.
the vector space model
Similarity measures


Models for comparing texts normally make use
of words the texts have in common
Some models also utilize the size of the
documents and/or the number of words the
texts do not have in common
Similarity measures (2)
DocSim1D1, D 2    t D1 j t D 2 j 
T
i 1
t D1 j = THE WEIGHT OF AN OCCURENCE OF TERM j IN DOCUMENT i
T = THE MAXIMUM NUMBER OF TERMS IN BOTH DOCUMENTS
COMBINED
No attention is paid to the size of a document
Similarity measures (3)
Dice’s coefficient
DocSim 2D1, D 2  
Jaccard’s coefficient
DocSim3D1, D 2  
t
D1 j
2 t D1 j t D 2 j
t
 t
 t
D1 j
D1 j
D2 j
 t D 2 j
t D 2 j 
  t D1 j t D 2 j 
Similarity measures (4)
The cosine coefficient
(the cosine of the angle between two vectors)
DocSim4D1, D 2  
 t
 t
2
D1 j
D1 j
t D 2 j 
 t
2

1
D2 j
2
Similarity measures (5)




Clustering by similarity matrices
(Jaccard’s coefficient applied to attribute/value
matrices)
Document signature matching (documents coded into
very compact binary representations, so-called
signatures)
Discriminator words (Williams 1963): the discrimination
coefficient ascribes high values to words that occur with
a probability much different from the mean probability
Latest advances in document clustering – wait for
Hercules Dalianis’ lecture!
Which words should count as
common to both documents?
As summer turns to fall, many brewers start to plan their Oktoberfest brewing. This
installment of "Brewing in Styles" looks at the materials and techniques used for brewing
traditional and modern Maerzen beers and offers some radical tips for brewing
Oktoberfest-like ales. Ein prosit!
Several people called in response to the last installment of "Brewing in Styles"
("American Wheat," BrewingTechniques 1 [1], May/June 1993) to say that they were
confused because many pubs and micros in the Midwest brew wheat beers in the
traditional German manner, complete with the 4-vinylguaiacol clovelike character. Many
fine German-style Weizenbiers are brewed in America.
*****************************************************************************************************
Republished from BrewingTechniques' July/August 1993.
What to do with that unfortunate mistake of a recipe? Design another beer that is out of
balance in an opposite and complementary way.
It invariably happens, even to the best of us. The beer that should have been so good
ends up out of balance and undrinkable. Not being the type to accept less-than-perfect
products graciously, I decided to take a page from the Belgian book of brewing.
Belgian brewers have long used the practice of blending to even out inconsistent, wild
fermentations
Relevance estimation

The Retrieval Status Value (rsv) – the measure of
closeness between the query and the document
 In strictly Boolean systems: 0 or 1
 Fuzzy (weighted) Boolean retrieval : values between
0 and 1; however, ”false drops” very probable
because of the definition of retrieval functions
QR3  QR1  QR2
rsvi ,3  min(rsvi ,1 , rsvi , 2 )
QR3  QR1  QR2
rsvi ,3  max(rsvi ,1 , rsvi , 2 )
Relevance estimation (2)
The
vector space model: the closeness of the query
and the document vectors, computed using some of
the previously mentioned similarity measures (Dice,
Jaccard, or cosine)
rsvi ,k


MT
dtw

qtw
i
,
j
j ,k
j 1
normalization factor
Relevance estimation (3)
 The
probabilistic model (a feedback model)
qtw j ,k 
rj  MD  n j  Rk  rj 
n
j
 rj  Rk  rj 
qtw j ,k query term weight
rj
number of relevant documents in which the term occurs
MD the total number of documents
nj
the number of documents in which the term occus
Rk
the total number of documents that are relevant for query q
(a non trivial problem!)
rsvi ,k 
 qtw
Qk  Di
j ,k
Download