Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen 1 Outline • • • • • • Introduction & Objective Methodology Experimental Results Conclusion Future Work Demo 2 Introduction • The goal of my thesis research is to use the World Wide Web as a source of information to identify sets of words that are related in meaning. – Example, given two words - {gun,pistol} a possible set of related words would be {handgun, holster, shotgun, machine-gun, weapon,ammunition,bullet, magazine } – Example, given two words – {toyota, nissan, ford} A possible set of related words would {honda, gmc, chevy, mitsubishi} 3 Examples Cont… – Example, given two words - {red,yellow} a possible set of related words would be { white,black,blue, colors, green} – Example, given two words - {George Bush,Bill Clinton} a possible set of related words would be { Ronald Reagan, Jimmy Carter, White House, Presidents, USA, etc } 4 Application • Use sets of related words to classify Semantic Orientation of reviews. (Peter Turney) • Use sets of related words to find the sentiment associated with particular product. (Rajiv Vaidyanathan and Praveen Agarwal). 5 Pros and Cons of using the Web • Pros – Huge amounts of text – Diverse text • Encyclopedia’s, Publications, Commercial Web Pages – Dynamic (ever-changing state) • Cons, – The Web creates a unique set of challenges, – Dynamic (ever-changing state) • News websites, Blogs – Presence of repetitive, noisy, or low-quality data. • HTML tags, web lingo (home page, information etc) 6 Contributions • Developed an Algorithm that predicts sets of related words by using pattern matching techniques and frequency counts. • Developed an Algorithm that predicts sets of related words by using a relatedness measure. • Developed an Algorithm that predicts sets of related words by using a relatedness measure and an extension of the Log Likelihood score. • Applied sets of related words to problem of Sentiment Classification. 7 Outline • • • • • • Introduction & Objective Methodology Experimental Results Conclusion Future Work Demo 8 Interface to Web - Google – Reasons for using Google • Research is very much dependant on both the quantity and quality of the Web content. • Google has a very effective ranking algorithm called PageRank which attempts to give more important or higher quality web pages a higher ranking. • Google API – An interface which allows programmers to query more than 8 billion web pages using the Google search engine. (http://www.google.com/apis/). 9 Problems with Google API • • • • Restricted to 1000 queries a day 10 Results for each query No “near” operator (Proximity based search) Maximum 1000 results. • Alternative – Yahoo API – 5000 Queries a day (Released very recently) • No “near” operator as well. • Cannot retrieve number of hits. Note: Google was used only as means of retrieving from the 10 Information. Key Idea behind Algorithms • Words that are related in meaning often tend to occur together. – Example, A Springfield, MA , Chevrolet, Ford, Honda, Lexus, Mazda, Nissan, Saturn, Toyota automotive dealer with new and pre-owned vehicle sales and leasing 11 Algorithm 1 • Features • • • • • • Based on frequency Takes only single words as input Initial set 2 words Frequency cutoff Ranked by frequency Smart stop list – The, if, me, why, you etc (non-content words) • Web stop list – Web page, WWW, home,page, personal, url, information, link, text , decoration, verdana, script, javascript 12 Algorithm 1 – High level Description 1. Create queries to Google based on the input terms. 2. Retrieve the top N number of web pages for each query. 1. Parse the retrieved web page content for each query. 3. Tokenize web page content into list of words and frequency. 1. Discard words that occur less than C number of times. 4. Find the common words between at least two of the sets of words. This set of intersecting words are the set of related words to the input term. 5. Repeat the process for I iterations by using the set of related words from the previous iteration as input. 13 Algorithm 1 Trace 1 • • • • Search Terms : S1={pistol, gun} Frequency Cutoff – 15 Num Results (Web Pages) – 10 Iterations - 2 14 Algorithm 1 –Step 1 1. Create queries to Google based permutations of the Input Terms, – – – – gun gun AND pistol pistol pistol AND gun 15 Algorithm 1 – Step 2 2. Issue query to Google, 1. Retrieve the top 10 URLs for the query, 1. For each URL, retrieve the web page content, and parse the web page for more links. 2. Traverse these links and retrieve the content of those web pages as well. Repeat this process for each query. 16 Trace 1 Cont… • Web pages for the query gun gun http://www.thesmokinggun.com/ http://www.gunbroker.com/ http://www.gunowners.org/ http://www.ithacagun.com/ http://www.doublegun.com/ http://www.imdb.com/title/tt0092099/ http://www.imdb.com/Title?0092099 http://www.gunandgame.com/ http://www.gunaccessories.com/ http://www.guncite.com/ 17 Trace 1 Cont… • Web pages for pistol pistol http://www.idpa.com/ http://www.bullseyepistol.com/ http://www.crpa.org/ http://www.zvis.com/dep/dep.shtml http://www.nysrpa.org/ http://www.auspistol.com.au/ http://hubblesite.org/newscenter/newsdesk/archive/releases/1997/33/ http://en.wikipedia.org/wiki/Pistol http://www.imdb.com/title/tt0285906/ http://www.fas.org/man/dod-101/sys/land/m9.htm 18 Trace 1 Cont… • Web pages for gun AND pistol gun AND pistol http://www.usgalco.com/ http://www.minirifle.co.uk/ http://www.dypic.com/gunsafepistol.html http://www.datacity.com/handgun-pistol-case.html http://www.camping-hunting.com/ http://www.pelican-case.com/pelguncaspis.html http://www.cafepress.com/4funnystuff/566642 http://www.nimmocustomarms.com/ http://www.bullseyegunaccessories.com/ http://www.airsoftshogun.com/P_224.htm 19 Trace 1 Cont… • Web pages for pistol AND gun pistol AND gun http://www.safetysafeguards.com/: http://www.safetysafeguards.com/site/402168/page/57955: http://www.safetysafeguards.com/site/402168/page/57959: http://www.airguns-online.co.uk/: http://www.dypic.com/gunsafepistol.html: http://www.airgundepot.com/eaa-drozd.html: http://www.docs.state.ny.us/DOCSOlympics/Combat.htm: http://www.datacity.com/handgun-pistol-case.html: http://www.sail.qc.ca/catalog/detail.jsp?id=2880: http://portfolio-pro.com/pistolhandgun.html:here also 20 Algorithm 1 – Step 3 3. Next, for the total web page content retrieved for each query, 1. Remove HTML Tags etc and retrieve text. 2. Remove stop words. 3. Tokenize the web page content into lists of words and frequency. Note: This would result in the following 4 sets of words, each set representing the words retrieved for each query. 21 Words from Web pages after removing stop words gun shotgun, 15 mounts, 21 daily, 33 holsters, 27 parts, 15 systems, 24 control, 31 cases, 33 bullets, 17 reloading, 16 military, 19 rifle, 21 care, 20 grips, 31 knives, 44 tactical, 24 stocks, 23 optics, 29 shooting, 19 scope, 16 accessories, 53 pistol shooting, 25 dep, 16 eagle, 20 desert, 19 crpa, 17 gun AND pistol hobbies, 18 chelmsford, 15 rifle, 120 pelican, 56 pistols, 35 auto, 18 practical, 69 club, 56 shotgun, 79 holster, 15 trigger, 24 foam, 27 ipsc, 18 cases, 56 case, 82 shooting, 123 essex, 30 target, 22 hobby, 18 bullets, 22 ruger, 38 airsoft, 28 ukpsa, 22 sport, 28 clubs, 19 safe, 29 semi, 18 range, 19 guns, 72 mini, 25 bullet, 42 shoot, 31 forum, 18 advertise, 16 pictures, 17 dealers, 17 riffles, 22 firearms, 22 ammo, 23 pistol electronic, 24 option, 60 biometric, 20 hspace, 24 menus, 15 ddd, 21 guns, 38 middle, 740 cases, 35 shoes, 16 safes, 62 airsoft, 50 vspace, 18 soft, 22 null, 1051 travel, 15 diversion, 21 air, 70 rifle, 29 family, 59 shopping, 16 case, 37 silver, 17 AND gun hand, 66 normal, 48 technical, 16 imgcounter, 27 security, 20 small, 17 members, 19 catalog, 371 category, 370 order, 17 auto, 20 addtab, 30 paintball, 20 pro, 36 safety, 53 boots, 24 false, 30 safe, 70 money, 15 uploaded, 17 fingerprint, 27 accessories, 59 22 Algorithm 1 – Step 4 4. Find the words that are common at least 2 sets. Let, A. gun AND pistol B. pistol AND gun C. gun D. pistol Related Set = 23 Related Set 1 – Iteration 1 Result Set 1 rifle , 177 shooting , 169 case , 127 accessories , 126 cases , 124 guns , 123 safe , 100 shotgun , 97 airsoft , 78 auto , 41 bullets , 40 24 Trace 1 Cont… Iteration 2 • 11 input terms – – Search terms created – • Rifle • Shooting • Guns • Cases • Airsoft • Shooting AND Guns • Guns AND Shooting • Guns AND Cases etc etc. Results in 112 = 121 queries to Google! Note: As you can see, the number of queries to Google increases drastically. 25 Result Set 2 – {gun, pistol} pistols,227 firearms,205 accessories,204 free, 192 holsters,172 club,170 target,164 tactical,161 air,158 practical,152 range,150 court,149 uk,147 sports,145 law,143 price,142 full,140 control,140 soft,124 military,121 custom,120 holster,118 fits,118 shoot,117 sport,115 hours,109 usa,109 ammo,107 electric,107 ships,106 spring,103 articles,96 carry,95 ruger,93 force,92 mp,90 remote,90 car,89 harlow,88 magazines,87 belt,86 mini,82 tac,79 radio,77 paintball,75 assault,71 teflon,70 pouch,69 number,69 shoulder,69 leg,64 core,62 essex,60 nylon,57 flash,55 bullets,53 trigger,50 straps,46 helicopter,45 riffles,44 coat,44 ukpsa,44 26 Algorithm 1 – {red, yellow} Number of Results – 10 Frequency Cutoff - 15 Iterations -1 enterprise , 411 software , 257 solutions , 151 management , 142 technology , 141 system , 96 services , 89 netherlands , 84 fellow , 76 applications , 71 snake , 70 performance , 64 scarlet , 62 project , 34 organizations , 33 organization , 29 coral , 28 black , 28 blue , 27 Related Words 27 Problems with Algorithm 1 • Frequency based ranking, • Number of input terms restricted to 2, • Input and output restricted to single words 28 Algorithm 2 • Features • • • • • • • • Based on frequency & relatedness score Can takes input as single words or 2 word collocations Relatedness measure based on Jiang and Conrath Frequency cutoff and relatedness score cutoff Ranked by score Initial set can be more than 2 words Bi-grams as output Smart stop list – The, if, me, why, you etc • Web stop words + phrases – Web page, WWW, home page, personal, url, information, link, text , decoration, verdana, script, javascript 29 Algorithm 2 – High level Description 1. Repeat same steps as in Algorithm 1 to retrieve initial set of related words (Add bigrams to results as well). 2. For each word returned by Algorithm 1 as a related word, 1. Calculate Relatedness of word to input terms. 2. Discard any word or bigram with a relatedness score greater than the score cutoff. 3. Sort remaining terms from most relevant to irrelevant. 3. Repeat Steps 1 – 2 for each iteration, using the set of words from iteration previous iteration as input. 30 Relatedness Measure (Distance Measure) • Relatedness (Word1, Word2) = log (hits(Word1)) + log (hits(Word2)) – 2 * log (hits(Word1 Word2)) (Based on measure by Jiang and Conrath) • Example 1, hits(toyota) = 12,500,000 hits(ford) = 22,900,000 hits(toyota AND ford) = 50,000 = 32.41 • Example 2, hits(toyota) = 12,500,000 hits(ford) = 22,900,000 hits(toyota AND ford) = 150,000 = 30.82 31 Relatedness Measure Cont… • Example 3, hits(toyota) = 1000 hits(ford) = 1000 hits(toyota AND ford) = 1000 Relatedness (toyota,ford) = 0 As the measure tends to approach zero, the relatedness between the two terms increase. 32 Input Set – {gun, pistol} Algorithm 1 shooting , 169 guns , 124 rifle , 113 case , 81 accessories , 74 cases , 74 airsoft , 72 products , 68 bullet , 53 air , 50 shotgun , 46 holsters , 46 ammo , 37 bullets , 34 Algorithm 2 shotgun , 16.40, rifle , 18.01, holster , 19.31, ammo , 19.61, shooting , 22.21, bullets , 22.80, air , 24.88, holsters , 25.04, airsoft , 25.79, gun cases , 26.02, accessories , 26.99 , guns , 28.42, equipment , 29.32, remington , 29.37, 33 Algorithm 2 – {red, yellow} Number of Results – 10 Frequency Cutoff - 10 Score Cutoff - 30 Iterations -1 blue , 16.77 black , 17.07 scarlet , 24.91 coral , 28.97 34 Problems with Algorithm 2 • Certain bigrams are not good collocations, – For example, {sunny, cloudy} Number of Results - 10 Frequency Cutoff - 15 Bigram Cutoff -4 Score Cutoff - 30 clear , 24.35 partly cloudy , 25.85 forecast text 26.66 partly sunny , 26.92 light , 27.33 , bulletin fpcn , 28.33 wind , 28.84 winds , 29.22 35 Algorithm 3 – High Level Description 1. Repeat same steps as in Algorithm 1 to retrieve initial set of related words (Add bigrams to results as well). 2. For each term returned by Algorithm 1 as a related word, 1. If the term is a bigram, 1. Validate if bigram is a valid collocation 1. If bigram is a valid collocation continue with step 2.2 else 2. Remove term from set of related words. 2. Calculate Relatedness of word to input terms. 3. Discard any word or collocation with a relatedness score greater than the score cutoff. 4. Sort remaining terms from most relevant to irrelevant. 36 Verifying Bigrams • Adapt Log Likelihood (G2) Score to web hit counts – Example, “New York” York New Not New – 4 Queries to Google “New *” “New York” Not York 607 2953 14 2096 621 5049 3560 2110 5670 “* York” “of the” 37 Expected Values (621 * 3560) / 5670 (5049 * 3560) / 5670 York New Not New Not York 389.9047619 3170.0952 231.095238 1878.9048 (621 * 2110) / 5670 (5049 * 2110) / 5670 38 Identifying a “bad” collocation • Bigram is discarded if, – Observed value for bigram is 0 (eg, “New York”) – Observed value for bigram is less than the expected value. 39 Example Bigrams 40 Methodology • • • • • • Introduction & Objective Methodology Experimental Results & Evaluation Conclusion Future Work Demo 41 Evaluating Results • Compare with Google Sets – http://labs.google.com/sets • Human Subject Experiments – Around 20 people expanded 2-word sets to what they feel as a set of related words 42 F-measure, Precision and Recall 43 Comparison of Algorithm 1 & 2 {toyota, ford} Frequency Cutoff - 5 truck , 66 car , 61 sales , 59 parts , 46 vehicles , 45 year , 43 cars , 35 auto , 32 motors , 30 general , 27 company , 24 honda , 20 service , 20 automotive , 18 nissan , 18 trucks , 17 consumer , 17 detroit , 13 marketing , 13 volvo , 12 media , 12 buyers , 12 focus , 11 {toyota, ford} Frequency C- 5, Score C - 30 gm , 19.09 nissan , 20.15 car , 29.77 {toyota, ford, nissan} Frequency C- 5, Score C - 30 mazda , 19.59 honda , 19.92 chevrolet , 21.37 bmw , 22.47 dodge , 22.83 lexus , 23.05 mitsubishi , 23.17 pontiac , 23.89 mercedes , 24.56 gmc , 25.14 vehicles , 27.77 44 Algorithm 1 {jordan,chicago} Number of Results – 10 Frequency Cutoff - 15 Iterations -1 Precision = 0, F-measure = 0 Recall Google Hack Google Sets michael , 174 Chicago bulls , 148 Jordan nba , 97 Israel game , 56 JOHNSON jersey , 43 Jackson Kuwait JANESVILLE Iraq Japan Lebanon Egypt Springfield = 0 45 Algorithm 2 {toyota,ford, nissan} Number of Results – 10 Frequency Cutoff - 10 Score Cutoff - 30 Iterations -1 Google Hack mazda , 19.59 honda , 19.92 chevrolet , 21.37 bmw , 22.47 dodge , 22.83 lexus , 23.05 mitsubishi , 23.17 pontiac , 23.89 mercedes , 24.56 gmc , 25.14 vehicles , 27.77 Precision = 6/11 = 0.54, F-measure = 0.54 Recall Google Sets HONDA MAZDA SUBARU MITSUBISHI DODGE CHEVROLET Jeep Volvo Buick Pontiac Suzuki Human Subject benz buick subaru mitsubishi dodge chevrolet jeep volvo buick pontiac suzuki holden mitsubishi = 6/11 = 0.54 46 Algorithm 2 {january, february, may} Number of Results – 10 Frequency Cutoff - 10 Score Cutoff - 30 Iterations -1 Google Hack june , 22.90 july , 24.39 august , 25.33 september , 25.50 march , 25.71 october , 26.21 november , 27.09 april , 27.49 december , 27.61 Precision = 9/9 = 1, Recall Google Sets March April June October November December September July August = 9/9 = 1 F-measure = 1 47 Algorithm 2 {armani, versace} Number of Results – 10 Frequency Cutoff - 10 Bigram Cutoff - 4 Score Cutoff - 30 Iterations -1 Precision = 11/20 = 0.55, Recall = 11/43 = .25 F-measure = 0.35 Google Hack prada , 18.17 moschino , 18.45 gucci , 18.60 dkny , 19.00 valentino , 19.72 , chanel , 19.93 gianni , 20.12 hugo boss , 20.17 calvin klein , 20.29 gianni versace , 20.46 dolce gabbana , 21.76 calvin , 21.97 yves saint , 22.10 dior , 22.37 yves , 22.62 giorgio armani , 23.04 hugo , 23.06 fendi , 24.12 giorgio , 24.64 christian dior , 24.86 Google Sets Gucci Chanel Calvin Klein Prada Dolce Gabbana Fendi Hugo Boss Christian Dior Hermes Moschino Donna Karan Ralph Lauren Valentino Louis Vuitton Giorgio Armani DKNY Escada Tommy Hilfiger Tiffany Givenchy Not Entire Set 48 Algorithm 2 {artificial intelligence, machine learning} Number of Results – 10 Frequency Cutoff - 10 Bigram Cutoff - 4 Score Cutoff - 32 Iterations -1 Precision = 9/23 = 0.39, Recall = 9/48 = 0.1875 F-measure = 0.25 Google Hack neural networks , 20.88 robotics , 21.14 neural , 21.60 data mining , 22.84 expert systems , 22.90 expert , 24.24 genetic algorithms , 24.30 reasoning , 24.40 logic programming , 24.40 natural language , 24.87 intelligent , 25.68 knowledge , 25.89 logic , 26.18 data , 26.21 natural , 26.23 genetic , 26.33 applications , 26.60 computer , 27.91 knowledge discovery , 28.91 ai , 29.16 case based , 29.83 computer science , 30.21 reinforcement learning , 31.17 Google Sets Neural Networks Robotics Knowledge Representation Natural Language Processing Pattern Recognition Machine Vision Programming Languages Data Mining Genetic Programming Vision Natural Language Intelligent Agents People Publications Philosophy Qualitative Physics Speech Processing Expert Systems Genetic Algorithms Computer Vision Computational Linguistics Cognitive Science Logic Programming 49 Comparison of Algorithm 2 & 3 {sunny, cloudy} Number of Results – 10 Frequency Cutoff - 10 Bigram Cutoff - 4 Score Cutoff - 30 Iterations -1 Algorithm 2 Algorithm 3 clear , 24.35 clear , 24.35 partly cloudy , 25.85 partly cloudy , 25.85 forecast text 26.66 partly sunny , 26.92 partly sunny , 26.92 light , 27.33 , light , 27.33 , wind , 28.84 bulletin fpcn , 28.33 winds , 29.22 wind , 28.84 winds , 29.22 50 Algorithm 3 - Bigrams {artificial intelligence, machine learning} Bigram nerual networks morgan kaufmann pattern recognition genetic algorithms grammatical inference based learning computer science ai magazine ai programming based reasoning case based data mining expert systems intelligence machine Observed Value Expected Value Log Likelihood Score 617000 144620.81 954551.64 138000 5067.61 692428.92 419000 248456.79 102193.35 129000 75014.81 32818.00 4590 1474.92 4214.81 861000 13804761.9 0 12700000 27947089.94 0 99500 340178.13 0 8050 197317.46 0 46300 676825.39 0 150000 12690476.19 0 1160000 1424162.25 0 165000 3705114.63 0 3650 587160.49 0 51 Performance of Algorithms • F-measure increases, from Algorithm 1 to 3 F-measure Algorithm 1 Algorithm 2 Algorithm 3 0.06 0.26 0.29 52 Sentiment Classification • Point wise Mutual Information –Information Retrieval Algorithm (PMI-IR) – Peter Turney – Used to classify reviews as being positive or negative in orientation • Part-of-speech tag the review • Extract 2-word phrases from text – Adjective followed by a Noun – Noun followed by a Noun etc. • Use a positive connotation such as “excellent” and negative connotation such as “poor”, and calculate the Semantic Orientation (SO) for each 2-word phrase, 53 Example, • Let, the phrase be “incredible cast” – SO(“incredible cast”) = log2 (hits(“incredible cast” NEAR “excellent”)) * hits(“poor”) (hits(“incredible cast” NEAR “poor”)) * hits(“excellent”) 54 Problem with Current Algorithm • Words such as “poor” have at least two senses – “poor” as in poverty – “poor” as in not good 55 Extended PMI-IR • Used Google instead of AltaVista • Used AND instead of NEAR • Extended SO formula – Use multiple pairs of positive and negative connotations • {excellent, poor}, {good, bad}, {great, mediocre} 56 A Negative Review for the movie “Planet of the Apes” Classified by our Algorithm as being Negative 57 Positive Review for an Audi Classified by our Algorithm as being Positive 58 Negative Movie Review Classified by our Algorithm as being Negative 59 Performance of Extended PMI-IR • Algorithm run on 20 reviews (movies and automobiles) Classified as Positive Classified as Negative Positive Reviews 5 5 Negative Reviews 0 10 Total # Movies 5 15 10 10 20 • Overall Accuracy – 75% 60 End Result: • All of this is available freely on CPAN and Sourceforge Google-Hack 61 Conclusions & Contribution • Developed 3 Algorithms that try to predict sets of related words – Algorithm 1 was based on frequency – Algorithm 2 was based on a relatedness measure – Algorithm 3 was based on a relatedness measure and the Log Likelihood score • Applied sets of related words to Sentiment Classification 62 Conclusions & Contribution • Released free PERL package Google-Hack on CPAN and Sourceforge. • Developed a web interface. 63 Future Work • Addition of proximity operator • Restrict # of web pages traversed • Find intersection of words through different search engines - Yahoo API • Use anchor text 64 Related URLs • Research Page – http://www.d.umn.edu/~rave0029/research • Google-Hack – http://google-hack.sf.net • CPAN Release – http://search.cpan.org/~prath/WebService-GoogleHack0.15/GoogleHack/GoogleHack.pm • Web Interface – http://marimba.d.umn.edu/cgi-bin/googlehack/index.cgi 65