Presentation (PPT file) - Svetlin Nakov

advertisement
Automatic Acquisition of
Synonyms Using the Web
as a Corpus
3rd Annual South East European Doctoral
Student Conference (DSC2008): Infusing
Knowledge and Research in South East Europe
Svetlin Nakov, Sofia University "St. Kliment Ohridski"
nakov@fmi-uni-sofia.bg
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Introduction
 We want to automatically extract all pairs
of synonyms inside given text
 Our goal is:
 Design an algorithm that can distinguish
between synonyms and non-synonyms
 Our approach:
 Measure semantic similarity using the Web
as a corpus
 Synonyms are expected to have higher
semantic similarity than non-synonyms
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
The Paper in One Slide
 Measuring semantic similarity
 Analyze the words local contexts
 Use the Web as a corpus
 Similar contexts  similar words
 TF.IDF weighting & reverse context lookup
 Evaluation
 94 words (Russian fine arts terminology)
 50 synonym pairs to be found
 11pt average precision: 63.16%
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
 What is local context?
 Few words before and after the target word
Same day delivery of fresh flowers, roses, and unique gift baskets
from our online boutique. Flower delivery online by local florists for
birthday flowers.
 The words in the local context of given word are
semantically related to it
 Need to exclude the stop words: prepositions,
pronouns, conjunctions, etc.
 Stop words appear in all contexts
 Need of sufficiently big corpus
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
 Web as a corpus
 The Web can be used as a corpus to
extract the local context for given word
 The Web is the largest possible corpus
 Contains large corpora in any language
 Searching some word in Google can return
up to 1 000 snippets of texts
 The target word is given along with its local
context: few words before and after it
 Target language can be specified
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
 Web as a corpus
 Example: Google query for "flower"
Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...
Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears
presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30
years.
Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...
Wide selection of BOUQUETS, FLORAL ARRANGEMENTS,
CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate
for various occasions. CREDIT cards acceptable.
Flowers, plants, roses, & gifts. Flowers delivery with fewer ...
Flowers, roses, plants and gift delivery. Order flowers from ProFlowers
once, and you will never use flowers delivery from florists again.
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
 Measuring semantic similarity
 For given two words their local contexts
are extracted from the Web
 A set of words and their frequencies
 Semantic similarity is measured as
similarity between these local contexts
 Local contexts are represented as
frequency vectors for given set of words
 Cosine between the frequency vectors in
the Euclidean space is calculated
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
 Example of context words frequencies
word: flower
word: computer
word
count
word
count
fresh
order
rose
delivery
gift
welcome
red
...
217
204
183
165
124
98
87
...
Internet
PC
technology
order
new
Web
site
...
291
286
252
185
174
159
146
...
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
 Example of frequency vectors
v1: flower
#
0
1
2
3
...
4999
5000
word
alias
alligator
amateur
apple
...
zap
zoo
v2: computer
freq.
#
3
2
0
5
...
0
6
0
1
2
3
...
4999
5000
 Similarity = cosine(v1, v2)
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
word
alias
alligator
amateur
apple
...
zap
zoo
freq.
7
0
8
133
...
3
0
TF.IDF Weighting
 TF.IDF (term frequency times inverted
document frequency)
 Statistical measure in information retrieval
 Shows how important is a certain word for
a given document in a set of documents
 Increases proportionally to the number of
word's occurrences in the document
 Decreases proportionally to the total
number of documents containing the word
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Reverse Context Lookup
 Local context extracted from the Web can
contain arbitrary parasite words like
"online", "home", "search", "click", etc.
 Internet terms appear in any Web page
 Such words are not likely to be
associated with the target word
 Example (for the word flowers)
 "send flowers online", "flowers here",
"order flowers here"
 Will the word "flowers" appear in the local
context of "send", "online" and "here"?
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Reverse Context Lookup
 If two words are semantically related, then
 Both of them should appear in the local contexts of
each other
 Let #{x,y} = number of occurrences of x in the
local context of y
 For any word w and a word from its local context
wc, we define their strength of semantic
association p(w,wc) as follows:
 p(w, wc) = min{ #(w, wc), #(wc,w) }
 We use p(w, wc) as vector coordinates
 We introduce a minimal occurrence threshold (e.g.
5) to filter words appearing just by chance
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Data Set
 We use a list of 94 Russian words:
 Terms extracted from texts in the subject
of fine arts
 Limited to nouns only
 The data set:
абрис, адгезия, алмаз, алтарь, амулет, асфальт, беломорит,
битум, бородки, ваятель, вермильон, ..., шлифовка, штихель,
экспрессивность, экспрессия, эстетизм, эстетство
 There are 50 synonym pairs in these words
 We expect to find them by our algorithms
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Experiments
 We tested few modifications of our
contextual Web similarity algorithm
 Basic algorithm (without modifications)
 TF.IDF weighting
 Reverse context lookup with different
frequency threshold
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Experiments
 RAND – random ordering of all the pairs
 SIM – the basic algorithm for extraction of
semantic similarity from the Web
 Context size of 3 words
 Without analyzing the reverse context
 With lemmatization
 SIM+TFIDF – modification of the SIM algorithm
with TF.IDF weighting
 REV2, REV3, REV4, REV5, REV6, REV7 – the
SIM algorithm + “reverse context lookup” with
frequency thresholds of: 2, 3, 4, 5, 6 and 7
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Resources Used
 We used the following resources:
 Google Web search engine: extracted the
first 1 000 results for 82 645 Russian words
 Russian lemma dictionary: 1 500 000
wordforms and 100 000 lemmata
 A list of 507 Russian stop words
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Evaluation
 Our algorithms arrange all pairs of words
according to their semantic similarity
 We expect the 50 synonyms pairs to be at
the top of the result list
 We count how many synonyms are found
in the top N results (e.g. top 5, top 10, etc.)
 We measure precision and recall
 We measure 11pt average precision to
evaluate the results
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
SIM Algorithm – Results
n
Word 1
Words 2
Semantic
Similarity
Synonyms
Precision
@n
Recall @
n
1
выжигание
пирография
0.433805
yes
100.00%
2%
2
тонирование
тонировка
0.382357
yes
100.00%
4%
3
гематит
кровавик
0.325138
yes
100.00%
6%
4
подрамок
подрамник
0.271659
yes
100.00%
8%
5
оливин
перидот
0.252256
yes
100.00%
10%
6
полирование
шлифование
0.220559
no
83.33%
10%
7
полировка
шлифовка
0.216347
no
71.43%
10%
8
амулет
талисман
0.200595
yes
75.00%
12%
9
пластификаторы
мягчители
0.170770
yes
77.78%
14%
...
...
...
...
...
...
...
Precision and recall obtained by the SIM algorithm
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Comparison of the Algorithms
Algorithm
1
5
10
20
30
40
50 100 200 Max
RAND
0
SIM
1
5
8
15
18
23
25
SIM+TFIDF
1
4
8
16
22
27
REV2
1
4
8
16
21
REV3
1
4
8
16
REV4
1
4
8
REV5
1
4
REV6
1
REV7
1
0.1 0.1 0.2 0.3 0.4 0.6 1.1
2.3
50
39
48
50
29
43
48
50
27
32
42
43
46
20
28
32
41
42
46
15
20
28
33
41
42
45
8
15
20
28
33
40
41
42
4
8
15
22
28
32
39
40
42
4
8
15
21
27
30
37
39
40
Comparison of the algorithms (number of synonyms in the top results)
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Comparison of the Algorithms
(11pt Average Precision)
11pt Average Precision
70,00%
63,16%
58,98%
60,00%
50,00%
40,00%
30,00%
20,00%
10,00%
1,15%
n/a
n/a
n/a
n/a
n/a
n/a
REV2
REV3
REV4
REV5
REV6
REV7
0,00%
RAND
SIM
SIM+TFIDF
Comparing RAND, SIM, SIM+TDIDF and REV2 … REV7
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Results (Precision-Recall Graph)
Comparing the recall-precision graphs of evaluated algorithms
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Discussion
 Our approach is original because:
 Measures automatically semantic similarity
 Uses the Web as a corpus
 Does not rely on any preexisting corpora
 Does not requires semantic resources like
WordNet and EuroWordNet
 Works for any language
 Tested for Bulgarian and Russian
 Uses reverse-context lookup and TF.IDF
 Significant improvement in quality
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Discussion
 Good accuracy, but far away from 100%
 Known problems of the proposed algorithms:
 Semantically related words are not always
synonyms
 red – blue
 wood – pine
 apple – computer
 Similar contexts does not always mean similar
words (distributional hypothesis)
 The Web as a corpus introduces noise
 Google returns the first 1 000 results only
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Discussion
 Known problems of the proposed algorithms:
 Google ranks higher news portals, travel
agencies and retail sites than books, articles
and forum messages
 Local context always contain noise
 Working with words, not capturing phrases
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Conclusion and Future Work
 Conclusion
 Our algorithms can distinguish between
synonyms and non-synonyms
 Accuracy should be improved
 Future Work
 Additional techniques to distinguish
between synonyms and semantically
related words
 Improve the semantic similarity measure
algorithm
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
References
 Hearst M. (1991). "Noun Homograph Disambiguation Using Local Context in
Large Text Corpora". In Proceedings of the 7th Annual Conference of the
University of Waterloo Centre for the New OED and Text Research, Oxford,
England, pages 1-22.
 Nakov P., Nakov S., Paskaleva E. (2007a). “Improved Word Alignments Using
the Web as a Corpus”. In Proceedings of RANLP'2007, pages 400-405,
Borovetz, Bulgaria.
 Nakov S., Nakov P., Paskaleva E. (2007b). “Cognate or False Friend? Ask the
Web!”. In Proceedings of the Workshop on Acquisition and Management of
Multilingual Lexicons, held in conjunction with RANLP'2007, pages 55-62,
Borovetz, Bulgaria.
 Sparck-Jones K. (1972). “A Statistical Interpretation of Term Specificity and its
Application in Retrieval”. Journal of Documentation, volume 28, pages 11-21.
 Salton G., McGill M. (1983), Introduction to Modern Information Retrieval,
McGraw-Hill, New York.
 Paskaleva E. (2002). “Processing Bulgarian and Russian Resources in Unified
Format”. In Proceedings of the 8th International Scientific Symposium
MAPRIAL, Veliko Tarnovo, Bulgaria, pages 185-194.
 Harris, Z. (1954). "Distributional structure”. Word, 10, pages 146-162.
 Lin D. (1998). "Automatic Retrieval and Clustering of Similar Words". In
Proceedings of COLING-ACL'98, Montreal, Canada, pages 768-774.
 Curran J., Moens M. (2002). "Improvements in Аutomatic Тhesaurus
Еxtraction". In Proceedings of the Workshop on Unsupervised Lexical
Acquisition, SIGLEX 2002, Philadelphia, USA, pages 59-67.
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
References
 Plas L., Tiedeman J. (2006). "Finding Synonyms Using Automatic Word
Alignment and Measures of Distributional Similarity". In Proceedings of
COLING/ACL 2006, Sydney, Australia.
 Och F., Ney H. (2003). "A Systematic Comparison of Various Statistical
Alignment Models". Computational Linguistics, 29 (1), 2003.
 Hagiwara М., Ogawa Y., Toyama K. (2007). "Effectiveness of Indirect
Dependency for Automatic Synonym Acquisition". In Proceedings of CoSMo
2007 Workshop, held in conjuction with CONTEXT 2007, Roskilde, Denmark.
 Kilgarriff A., Grefenstette G. (2003). "Introduction to the Special Issue on the
Web as Corpus", Computational Linguistics, 29(3):333–347.
 Inkpen D. (2007). "Near-synonym Choice in an Intelligent Thesaurus". In
Proceedings of the NAACL-HLT, New York, USA.
 Chen H., Lin M., Wei Y. (2006). "Novel Association Measures Using Web
Search with Double Checking". In Proceedings of the COLING/ACL 2006,
Sydney, Australia, pages 1009-1016.
 Sahami M., Heilman T. (2006). "A Web-based Kernel Function for Measuring
the Similarity of Short Text Snippets". In Proceedings of 15th International
World Wide Web Conference, Edinburgh, Scotland.
 Bollegala D., Matsuo Y., Ishizuka M. (2007). "Measuring Semantic Similarity
between Words Using Web Search Engines", In Proceedings of the 16th
International World Wide Web Conference (WWW2007), Banff, Canada, pages
757-766.
 Sanchez D., Moreno A. (2005), "Automatic Discovery of Synonyms and
Lexicalizations from the Web". Artificial Intelligence Research and
Development, Volume 131, 2005.
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Automatic Acquisition of Synonyms
Using the Web as a Corpus
Questions?
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Download