Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu WordNet® a large lexical database of English a combination of dictionary and thesaurus created and maintained by Cognitive Science Lab of Princeton University designed to establish the connections between words http://wordnet.princeton.edu/ WORDnet 4 types of Parts of Speech (POS) ▪ Noun, Verb, Adjective, Adverb Synset ▪ the smallest unit in WordNet ▪ a synonym set ▪ Represent a specific meaning of a word wordNET Synsets are connected to one anther through semantic and lexical relations Type of relations (based on POS) ▪ ▪ ▪ ▪ ▪ ▪ hypernyms (kind-of): ‘vehicle’ is a hypernym of ‘car’ hyponyms (kind-of): ‘car’ is a hyponym of ‘vehicle’ holonym (part-of): ‘building’ is a holonym of ‘window’ meronym(part-of): ‘window’ is a meronym of ‘building’ similar to: ‘smart’ is similar to ‘intelligent’ antonyms: ‘smart’ is antonym of ‘unintelligent’ hypernym hyponym Unix-style manual Web Interfaces Local Interfaces/APIs Java Perl C# http://wordnet.princeton.edu/wordnet/relatedprojects/#web Definition: the process for removing suffixes of words to get their base or root form Example: ‘fishing’, ‘fished’, ‘fish’, ‘fisher’ ‘fish’ Porter Stemmer http://tartarus.org/~martin/PorterStemmer/ Krovetz Stemmer (in Lemur package) http://www.lemurproject.org/phorum/read.php?1 1,1394 WordNet Stemmer http://tipsandtricks.runicsoft.com/Other/JavaSte mmer.html Tokenization The process of breaking a stream of text up into “words” and punctuation marks. Sentence Splitting Part of Speech Tagging Example: He/PRP 's/VBZ at/IN peace/NN with/IN the/DT house/NN and/CC could/MD stay/VB there/RB indefinitely/RB ./. Name Entity Recognition The process of labeling sequences of words which are the names of things, such as person, company, location names. Example: Jim bought 300 shares of Acme Corp. in 2006. <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought 300 shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>. Stanford POS tagger http://nlp.stanford.edu/software/tagger.shtml Stanford NER http://nlp.stanford.edu/software/CRF-NER.shtml GATE http://gate.ac.uk/ JET http://cs.nyu.edu/grishman/jet/license.html http://www.cs.nyu.edu/courses/spring10/G22.2590- 001/schedule.html