Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen Identifying Sets of Related

advertisement
Identifying Sets of Related
Words from the World Wide Web
Thesis Defense 06/09/2005
Pratheepan (Prath) Raveendranathan
Advisor: Ted Pedersen
1
Outline
•
•
•
•
•
•
Introduction & Objective
Methodology
Experimental Results
Conclusion
Future Work
Demo
2
Introduction
• The goal of my thesis research is to use the World Wide
Web as a source of information to identify sets of words
that are related in meaning.
– Example, given two words - {gun,pistol}
a possible set of related words would be
{handgun, holster, shotgun, machine-gun, weapon,ammunition,bullet, magazine }
– Example, given two words – {toyota, nissan, ford}
A possible set of related words would
{honda, gmc, chevy, mitsubishi}
3
Examples Cont…
– Example, given two words - {red,yellow}
a possible set of related words would be
{ white,black,blue, colors, green}
– Example, given two words - {George Bush,Bill Clinton}
a possible set of related words would be
{ Ronald Reagan, Jimmy Carter, White House, Presidents, USA, etc }
4
Application
• Use sets of related words to classify Semantic
Orientation of reviews.
(Peter Turney)
• Use sets of related words to find the sentiment
associated with particular product.
(Rajiv Vaidyanathan and Praveen Agarwal).
5
Pros and Cons of using the Web
• Pros
– Huge amounts of text
– Diverse text
• Encyclopedia’s, Publications, Commercial Web
Pages
– Dynamic (ever-changing state)
• Cons,
– The Web creates a unique set of challenges,
– Dynamic (ever-changing state)
• News websites, Blogs
– Presence of repetitive, noisy, or low-quality data.
• HTML tags, web lingo (home page, information etc)
6
Contributions
• Developed an Algorithm that predicts sets of related words by using
pattern matching techniques and frequency counts.
• Developed an Algorithm that predicts sets of related words by using a
relatedness measure.
• Developed an Algorithm that predicts sets of related words by using a
relatedness measure and an extension of the Log Likelihood score.
• Applied sets of related words to problem of Sentiment Classification.
7
Outline
•
•
•
•
•
•
Introduction & Objective
Methodology
Experimental Results
Conclusion
Future Work
Demo
8
Interface to Web - Google
– Reasons for using Google
• Research is very much dependant on both the quantity and
quality of the Web content.
• Google has a very effective ranking algorithm called
PageRank which attempts to give more important or higher
quality web pages a higher ranking.
• Google API – An interface which allows programmers to
query more than 8 billion web pages using the Google
search engine. (http://www.google.com/apis/).
9
Problems with Google API
•
•
•
•
Restricted to 1000 queries a day
10 Results for each query
No “near” operator (Proximity based search)
Maximum 1000 results.
• Alternative
– Yahoo API – 5000 Queries a day (Released very recently)
• No “near” operator as well.
• Cannot retrieve number of hits.
Note: Google was used only as means of retrieving from the
10
Information.
Key Idea behind Algorithms
• Words that are related in meaning often
tend to occur together.
– Example,
A Springfield, MA , Chevrolet, Ford, Honda,
Lexus, Mazda, Nissan, Saturn, Toyota
automotive dealer with new and pre-owned
vehicle sales and leasing
11
Algorithm 1
• Features
•
•
•
•
•
•
Based on frequency
Takes only single words as input
Initial set 2 words
Frequency cutoff
Ranked by frequency
Smart stop list – The, if, me, why, you etc (non-content words)
• Web stop list
– Web page, WWW, home,page, personal, url, information, link, text ,
decoration, verdana, script, javascript
12
Algorithm 1 – High level Description
1. Create queries to Google based on the input terms.
2. Retrieve the top N number of web pages for each query.
1. Parse the retrieved web page content for each query.
3. Tokenize web page content into list of words and frequency.
1. Discard words that occur less than C number of times.
4. Find the common words between at least two of the sets of words.
This set of intersecting words are the set of related words to the
input term.
5. Repeat the process for I iterations by using the set of related words
from the previous iteration as input.
13
Algorithm 1 Trace 1
•
•
•
•
Search Terms : S1={pistol, gun}
Frequency Cutoff – 15
Num Results (Web Pages) – 10
Iterations - 2
14
Algorithm 1 –Step 1
1. Create queries to Google based permutations of
the Input Terms,
–
–
–
–
gun
gun AND pistol
pistol
pistol AND gun
15
Algorithm 1 – Step 2
2. Issue query to Google,
1. Retrieve the top 10 URLs for the query,
1. For each URL, retrieve the web page content, and parse
the web page for more links.
2. Traverse these links and retrieve the content of those web
pages as well.
Repeat this process for each query.
16
Trace 1 Cont…
• Web pages for the query gun
gun
http://www.thesmokinggun.com/
http://www.gunbroker.com/
http://www.gunowners.org/
http://www.ithacagun.com/
http://www.doublegun.com/
http://www.imdb.com/title/tt0092099/
http://www.imdb.com/Title?0092099
http://www.gunandgame.com/
http://www.gunaccessories.com/
http://www.guncite.com/
17
Trace 1 Cont…
• Web pages for pistol
pistol
http://www.idpa.com/
http://www.bullseyepistol.com/
http://www.crpa.org/
http://www.zvis.com/dep/dep.shtml
http://www.nysrpa.org/
http://www.auspistol.com.au/
http://hubblesite.org/newscenter/newsdesk/archive/releases/1997/33/
http://en.wikipedia.org/wiki/Pistol
http://www.imdb.com/title/tt0285906/
http://www.fas.org/man/dod-101/sys/land/m9.htm
18
Trace 1 Cont…
• Web pages for gun AND pistol
gun AND pistol
http://www.usgalco.com/
http://www.minirifle.co.uk/
http://www.dypic.com/gunsafepistol.html
http://www.datacity.com/handgun-pistol-case.html
http://www.camping-hunting.com/
http://www.pelican-case.com/pelguncaspis.html
http://www.cafepress.com/4funnystuff/566642
http://www.nimmocustomarms.com/
http://www.bullseyegunaccessories.com/
http://www.airsoftshogun.com/P_224.htm
19
Trace 1 Cont…
• Web pages for pistol AND gun
pistol AND gun
http://www.safetysafeguards.com/:
http://www.safetysafeguards.com/site/402168/page/57955:
http://www.safetysafeguards.com/site/402168/page/57959:
http://www.airguns-online.co.uk/:
http://www.dypic.com/gunsafepistol.html:
http://www.airgundepot.com/eaa-drozd.html:
http://www.docs.state.ny.us/DOCSOlympics/Combat.htm:
http://www.datacity.com/handgun-pistol-case.html:
http://www.sail.qc.ca/catalog/detail.jsp?id=2880:
http://portfolio-pro.com/pistolhandgun.html:here also
20
Algorithm 1 – Step 3
3. Next, for the total web page content retrieved for each
query,
1. Remove HTML Tags etc and retrieve text.
2. Remove stop words.
3. Tokenize the web page content into lists of words and frequency.
Note: This would result in the following 4 sets of words,
each set representing the words retrieved for each
query.
21
Words from Web pages
after removing stop words
gun
shotgun, 15
mounts, 21
daily, 33
holsters, 27
parts, 15
systems, 24
control, 31
cases, 33
bullets, 17
reloading, 16
military, 19
rifle, 21
care, 20
grips, 31
knives, 44
tactical, 24
stocks, 23
optics, 29
shooting, 19
scope, 16
accessories, 53
pistol
shooting, 25
dep, 16
eagle, 20
desert, 19
crpa, 17
gun AND pistol
hobbies, 18
chelmsford, 15
rifle, 120
pelican, 56
pistols, 35
auto, 18
practical, 69
club, 56
shotgun, 79
holster, 15
trigger, 24
foam, 27
ipsc, 18
cases, 56
case, 82
shooting, 123
essex, 30
target, 22
hobby, 18
bullets, 22
ruger, 38
airsoft, 28
ukpsa, 22
sport, 28
clubs, 19
safe, 29
semi, 18
range, 19
guns, 72
mini, 25
bullet, 42
shoot, 31
forum, 18
advertise, 16
pictures, 17
dealers, 17
riffles, 22
firearms, 22
ammo, 23
pistol
electronic, 24
option, 60
biometric, 20
hspace, 24
menus, 15
ddd, 21
guns, 38
middle, 740
cases, 35
shoes, 16
safes, 62
airsoft, 50
vspace, 18
soft, 22
null, 1051
travel, 15
diversion, 21
air, 70
rifle, 29
family, 59
shopping, 16
case, 37
silver, 17
AND gun
hand, 66
normal, 48
technical, 16
imgcounter, 27
security, 20
small, 17
members, 19
catalog, 371
category, 370
order, 17
auto, 20
addtab, 30
paintball, 20
pro, 36
safety, 53
boots, 24
false, 30
safe, 70
money, 15
uploaded, 17
fingerprint, 27
accessories, 59
22
Algorithm 1 – Step 4
4. Find the words that are common at least 2 sets.
Let,
A. gun AND pistol
B. pistol AND gun
C. gun
D. pistol
Related Set =
23
Related Set 1 – Iteration 1
Result Set 1
rifle , 177
shooting , 169
case , 127
accessories , 126
cases , 124
guns , 123
safe , 100
shotgun , 97
airsoft , 78
auto , 41
bullets , 40
24
Trace 1 Cont… Iteration 2
• 11 input terms –
– Search terms created –
• Rifle
• Shooting
• Guns
• Cases
• Airsoft
• Shooting AND Guns
• Guns AND Shooting
• Guns AND Cases
etc etc.
Results in 112 = 121 queries to Google!
Note: As you can see, the number of queries to Google increases
drastically.
25
Result Set 2 – {gun, pistol}
pistols,227
firearms,205
accessories,204
free, 192
holsters,172
club,170
target,164
tactical,161
air,158
practical,152
range,150
court,149
uk,147
sports,145
law,143
price,142
full,140
control,140
soft,124
military,121
custom,120
holster,118
fits,118
shoot,117
sport,115
hours,109
usa,109
ammo,107
electric,107
ships,106
spring,103
articles,96
carry,95
ruger,93
force,92
mp,90
remote,90
car,89
harlow,88
magazines,87
belt,86
mini,82
tac,79
radio,77
paintball,75
assault,71
teflon,70
pouch,69
number,69
shoulder,69
leg,64
core,62
essex,60
nylon,57
flash,55
bullets,53
trigger,50
straps,46
helicopter,45
riffles,44
coat,44
ukpsa,44
26
Algorithm 1 – {red, yellow}
Number of Results – 10
Frequency Cutoff - 15
Iterations
-1
enterprise , 411
software , 257
solutions , 151
management , 142
technology , 141
system , 96
services , 89
netherlands , 84
fellow , 76
applications , 71
snake , 70
performance , 64
scarlet , 62
project , 34
organizations , 33
organization , 29
coral , 28
black , 28
blue , 27
Related Words
27
Problems with Algorithm 1
• Frequency based ranking,
• Number of input terms restricted to 2,
• Input and output restricted to single words
28
Algorithm 2
• Features
•
•
•
•
•
•
•
•
Based on frequency & relatedness score
Can takes input as single words or 2 word collocations
Relatedness measure based on Jiang and Conrath
Frequency cutoff and relatedness score cutoff
Ranked by score
Initial set can be more than 2 words
Bi-grams as output
Smart stop list
– The, if, me, why, you etc
• Web stop words + phrases
– Web page, WWW, home page, personal, url, information, link, text ,
decoration, verdana, script, javascript
29
Algorithm 2 – High level Description
1. Repeat same steps as in Algorithm 1 to retrieve initial set
of related words (Add bigrams to results as well).
2. For each word returned by Algorithm 1 as a related word,
1. Calculate Relatedness of word to input terms.
2. Discard any word or bigram with a relatedness score greater
than the score cutoff.
3. Sort remaining terms from most relevant to irrelevant.
3. Repeat Steps 1 – 2 for each iteration, using the set of
words from iteration previous iteration as input.
30
Relatedness Measure (Distance
Measure)
• Relatedness (Word1, Word2) =
log (hits(Word1)) + log (hits(Word2)) – 2 * log (hits(Word1 Word2))
(Based on measure by Jiang and Conrath)
• Example 1,
hits(toyota) = 12,500,000
hits(ford) = 22,900,000
hits(toyota AND ford) = 50,000
= 32.41
• Example 2,
hits(toyota) = 12,500,000
hits(ford) = 22,900,000
hits(toyota AND ford) = 150,000
= 30.82
31
Relatedness Measure Cont…
• Example 3,
hits(toyota) = 1000
hits(ford) = 1000
hits(toyota AND ford) = 1000
Relatedness (toyota,ford) = 0
As the measure tends to approach zero, the relatedness
between the two terms increase.
32
Input Set – {gun, pistol}
Algorithm 1
shooting , 169
guns , 124
rifle , 113
case , 81
accessories , 74
cases , 74
airsoft , 72
products , 68
bullet , 53
air , 50
shotgun , 46
holsters , 46
ammo , 37
bullets , 34
Algorithm 2
shotgun , 16.40,
rifle , 18.01,
holster , 19.31,
ammo , 19.61,
shooting , 22.21,
bullets , 22.80,
air , 24.88,
holsters , 25.04,
airsoft , 25.79,
gun cases , 26.02,
accessories , 26.99 ,
guns , 28.42,
equipment , 29.32,
remington , 29.37,
33
Algorithm 2 – {red, yellow}
Number of Results – 10
Frequency Cutoff - 10
Score Cutoff - 30
Iterations
-1
blue , 16.77
black , 17.07
scarlet , 24.91
coral , 28.97
34
Problems with Algorithm 2
• Certain bigrams are not good collocations,
– For example,
{sunny, cloudy}
Number of Results - 10
Frequency Cutoff - 15
Bigram Cutoff
-4
Score Cutoff
- 30
clear , 24.35
partly cloudy , 25.85
forecast text 26.66
partly sunny , 26.92
light , 27.33 ,
bulletin fpcn , 28.33
wind , 28.84
winds , 29.22
35
Algorithm 3 – High Level Description
1. Repeat same steps as in Algorithm 1 to retrieve initial set
of related words (Add bigrams to results as well).
2. For each term returned by Algorithm 1 as a related word,
1. If the term is a bigram,
1. Validate if bigram is a valid collocation
1. If bigram is a valid collocation continue with step 2.2
else
2. Remove term from set of related words.
2. Calculate Relatedness of word to input terms.
3. Discard any word or collocation with a relatedness score greater
than the score cutoff.
4. Sort remaining terms from most relevant to irrelevant.
36
Verifying Bigrams
• Adapt Log Likelihood (G2) Score to web hit counts
– Example, “New York”
York
New
Not New
– 4 Queries to Google
“New *”
“New York”
Not York
607
2953
14
2096
621
5049
3560
2110
5670
“* York”
“of the”
37
Expected Values
(621 * 3560) / 5670
(5049 * 3560) / 5670
York
New
Not New
Not York
389.9047619 3170.0952
231.095238 1878.9048
(621 * 2110) / 5670
(5049 * 2110) / 5670
38
Identifying a “bad” collocation
• Bigram is discarded if,
– Observed value for bigram is 0 (eg, “New York”)
– Observed value for bigram is less than the expected
value.
39
Example Bigrams
40
Methodology
•
•
•
•
•
•
Introduction & Objective
Methodology
Experimental Results & Evaluation
Conclusion
Future Work
Demo
41
Evaluating Results
• Compare with Google Sets
– http://labs.google.com/sets
• Human Subject Experiments
– Around 20 people expanded 2-word sets to what they
feel as a set of related words
42
F-measure, Precision and Recall
43
Comparison of Algorithm 1 & 2
{toyota, ford}
Frequency Cutoff - 5
truck , 66
car , 61
sales , 59
parts , 46
vehicles , 45
year , 43
cars , 35
auto , 32
motors , 30
general , 27
company , 24
honda , 20
service , 20
automotive , 18
nissan , 18
trucks , 17
consumer , 17
detroit , 13
marketing , 13
volvo , 12
media , 12
buyers , 12
focus , 11
{toyota, ford}
Frequency C- 5, Score C - 30
gm , 19.09
nissan , 20.15
car , 29.77
{toyota, ford, nissan}
Frequency C- 5, Score C - 30
mazda , 19.59
honda , 19.92
chevrolet , 21.37
bmw , 22.47
dodge , 22.83
lexus , 23.05
mitsubishi , 23.17
pontiac , 23.89
mercedes , 24.56
gmc , 25.14
vehicles , 27.77
44
Algorithm 1
{jordan,chicago}
Number of Results – 10
Frequency Cutoff - 15
Iterations
-1
Precision = 0,
F-measure = 0
Recall
Google Hack
Google Sets
michael , 174 Chicago
bulls , 148
Jordan
nba , 97
Israel
game , 56
JOHNSON
jersey , 43
Jackson
Kuwait
JANESVILLE
Iraq
Japan
Lebanon
Egypt
Springfield
= 0
45
Algorithm 2
{toyota,ford, nissan}
Number of Results – 10
Frequency Cutoff - 10
Score Cutoff - 30
Iterations
-1
Google Hack
mazda , 19.59
honda , 19.92
chevrolet , 21.37
bmw , 22.47
dodge , 22.83
lexus , 23.05
mitsubishi , 23.17
pontiac , 23.89
mercedes , 24.56
gmc , 25.14
vehicles , 27.77
Precision = 6/11 = 0.54,
F-measure = 0.54
Recall
Google Sets
HONDA
MAZDA
SUBARU
MITSUBISHI
DODGE
CHEVROLET
Jeep
Volvo
Buick
Pontiac
Suzuki
Human Subject
benz
buick
subaru
mitsubishi
dodge
chevrolet
jeep
volvo
buick
pontiac
suzuki
holden
mitsubishi
= 6/11 = 0.54
46
Algorithm 2
{january, february, may}
Number of Results – 10
Frequency Cutoff - 10
Score Cutoff - 30
Iterations
-1
Google Hack
june , 22.90
july , 24.39
august , 25.33
september , 25.50
march , 25.71
october , 26.21
november , 27.09
april , 27.49
december , 27.61
Precision = 9/9 = 1,
Recall
Google Sets
March
April
June
October
November
December
September
July
August
= 9/9 = 1
F-measure = 1
47
Algorithm 2
{armani, versace}
Number of Results – 10
Frequency Cutoff - 10
Bigram Cutoff - 4
Score Cutoff - 30
Iterations
-1
Precision = 11/20 = 0.55,
Recall
= 11/43 = .25
F-measure = 0.35
Google Hack
prada , 18.17
moschino , 18.45
gucci , 18.60
dkny , 19.00
valentino , 19.72 ,
chanel , 19.93
gianni , 20.12
hugo boss , 20.17
calvin klein , 20.29
gianni versace , 20.46
dolce gabbana , 21.76
calvin , 21.97
yves saint , 22.10
dior , 22.37
yves , 22.62
giorgio armani , 23.04
hugo , 23.06
fendi , 24.12
giorgio , 24.64
christian dior , 24.86
Google Sets
Gucci
Chanel
Calvin Klein
Prada
Dolce Gabbana
Fendi
Hugo Boss
Christian Dior
Hermes
Moschino
Donna Karan
Ralph Lauren
Valentino
Louis Vuitton
Giorgio Armani
DKNY
Escada
Tommy Hilfiger
Tiffany
Givenchy
Not Entire Set
48
Algorithm 2
{artificial intelligence, machine learning}
Number of Results – 10
Frequency Cutoff - 10
Bigram Cutoff - 4
Score Cutoff - 32
Iterations
-1
Precision = 9/23 = 0.39,
Recall
= 9/48 = 0.1875
F-measure = 0.25
Google Hack
neural networks , 20.88
robotics , 21.14
neural , 21.60
data mining , 22.84
expert systems , 22.90
expert , 24.24
genetic algorithms , 24.30
reasoning , 24.40
logic programming , 24.40
natural language , 24.87
intelligent , 25.68
knowledge , 25.89
logic , 26.18
data , 26.21
natural , 26.23
genetic , 26.33
applications , 26.60
computer , 27.91
knowledge discovery , 28.91
ai , 29.16
case based , 29.83
computer science , 30.21
reinforcement learning , 31.17
Google Sets
Neural Networks
Robotics
Knowledge Representation
Natural Language Processing
Pattern Recognition
Machine Vision
Programming Languages
Data Mining
Genetic Programming
Vision
Natural Language
Intelligent Agents
People
Publications
Philosophy
Qualitative Physics
Speech Processing
Expert Systems
Genetic Algorithms
Computer Vision
Computational Linguistics
Cognitive Science
Logic Programming
49
Comparison of Algorithm 2 & 3
{sunny, cloudy}
Number of Results – 10
Frequency Cutoff - 10
Bigram Cutoff - 4
Score Cutoff - 30
Iterations
-1
Algorithm 2
Algorithm 3
clear , 24.35
clear , 24.35
partly cloudy , 25.85
partly cloudy , 25.85
forecast text 26.66
partly sunny , 26.92
partly sunny , 26.92
light , 27.33 ,
light , 27.33 ,
wind , 28.84
bulletin fpcn , 28.33
winds , 29.22
wind , 28.84
winds , 29.22
50
Algorithm 3 - Bigrams
{artificial intelligence, machine learning}
Bigram
nerual networks
morgan kaufmann
pattern recognition
genetic algorithms
grammatical inference
based learning
computer science
ai magazine
ai programming
based reasoning
case based
data mining
expert systems
intelligence machine
Observed Value Expected Value Log Likelihood Score
617000
144620.81
954551.64
138000
5067.61
692428.92
419000
248456.79
102193.35
129000
75014.81
32818.00
4590
1474.92
4214.81
861000
13804761.9
0
12700000
27947089.94
0
99500
340178.13
0
8050
197317.46
0
46300
676825.39
0
150000
12690476.19
0
1160000
1424162.25
0
165000
3705114.63
0
3650
587160.49
0
51
Performance of Algorithms
• F-measure increases, from Algorithm 1 to 3
F-measure
Algorithm 1 Algorithm 2 Algorithm 3
0.06
0.26
0.29
52
Sentiment Classification
• Point wise Mutual Information –Information Retrieval
Algorithm (PMI-IR) – Peter Turney
– Used to classify reviews as being positive or negative in
orientation
• Part-of-speech tag the review
• Extract 2-word phrases from text
– Adjective followed by a Noun
– Noun followed by a Noun etc.
• Use a positive connotation such as “excellent” and negative
connotation such as “poor”, and calculate the Semantic
Orientation (SO) for each 2-word phrase,
53
Example,
• Let, the phrase be “incredible cast”
– SO(“incredible cast”)
= log2 (hits(“incredible cast” NEAR “excellent”)) * hits(“poor”)
(hits(“incredible cast” NEAR “poor”)) * hits(“excellent”)
54
Problem with Current Algorithm
• Words such as “poor” have at least two senses
– “poor” as in poverty
– “poor” as in not good
55
Extended PMI-IR
• Used Google instead of AltaVista
• Used AND instead of NEAR
• Extended SO formula
– Use multiple pairs of positive and negative connotations
• {excellent, poor}, {good, bad}, {great, mediocre}
56
A Negative Review for the movie
“Planet of the Apes”
Classified by our Algorithm as being Negative
57
Positive Review for an Audi
Classified by our Algorithm as being Positive
58
Negative Movie Review
Classified by our Algorithm as being Negative
59
Performance of Extended PMI-IR
• Algorithm run on 20 reviews (movies and
automobiles)
Classified as Positive Classified as Negative
Positive Reviews
5
5
Negative Reviews
0
10
Total # Movies
5
15
10
10
20
• Overall Accuracy – 75%
60
End Result:
• All of this is available freely on CPAN and
Sourceforge
Google-Hack
61
Conclusions & Contribution
• Developed 3 Algorithms that try to predict sets of
related words
– Algorithm 1 was based on frequency
– Algorithm 2 was based on a relatedness measure
– Algorithm 3 was based on a relatedness measure and
the Log Likelihood score
• Applied sets of related words to Sentiment
Classification
62
Conclusions & Contribution
• Released free PERL package Google-Hack on
CPAN and Sourceforge.
• Developed a web interface.
63
Future Work
• Addition of proximity operator
• Restrict # of web pages traversed
• Find intersection of words through different search
engines - Yahoo API
• Use anchor text
64
Related URLs
•
Research Page
– http://www.d.umn.edu/~rave0029/research
•
Google-Hack
– http://google-hack.sf.net
•
CPAN Release
– http://search.cpan.org/~prath/WebService-GoogleHack0.15/GoogleHack/GoogleHack.pm
•
Web Interface
–
http://marimba.d.umn.edu/cgi-bin/googlehack/index.cgi
65
Download