WordNet and other NLP tools

advertisement
Course G22.2580 - Web Search Engines
3/9/2011
Wei Xu xuwei@cs.nyu.edu

WordNet®
 a large lexical database of English
 a combination of dictionary and thesaurus
 created and maintained by Cognitive Science Lab
of Princeton University
 designed to establish the connections between
words

http://wordnet.princeton.edu/

WORDnet
 4 types of Parts of Speech (POS)
▪ Noun, Verb, Adjective, Adverb
 Synset
▪ the smallest unit in WordNet
▪ a synonym set
▪ Represent a specific meaning of a word

wordNET
 Synsets are connected to one anther through
semantic and lexical relations
 Type of relations (based on POS)
▪
▪
▪
▪
▪
▪
hypernyms (kind-of): ‘vehicle’ is a hypernym of ‘car’
hyponyms (kind-of): ‘car’ is a hyponym of ‘vehicle’
holonym (part-of): ‘building’ is a holonym of ‘window’
meronym(part-of): ‘window’ is a meronym of ‘building’
similar to: ‘smart’ is similar to ‘intelligent’
antonyms: ‘smart’ is antonym of ‘unintelligent’
hypernym
hyponym



Unix-style manual
Web Interfaces
Local Interfaces/APIs
 Java
 Perl
 C#
http://wordnet.princeton.edu/wordnet/relatedprojects/#web

Definition:
 the process for removing suffixes of words to get
their base or root form

Example:
 ‘fishing’, ‘fished’, ‘fish’, ‘fisher’  ‘fish’

Porter Stemmer
 http://tartarus.org/~martin/PorterStemmer/

Krovetz Stemmer (in Lemur package)
 http://www.lemurproject.org/phorum/read.php?1
1,1394

WordNet Stemmer
 http://tipsandtricks.runicsoft.com/Other/JavaSte
mmer.html

Tokenization
 The process of breaking a stream of text up into
“words” and punctuation marks.


Sentence Splitting
Part of Speech Tagging
 Example:
He/PRP 's/VBZ at/IN peace/NN with/IN the/DT
house/NN and/CC could/MD stay/VB there/RB
indefinitely/RB ./.

Name Entity Recognition
 The process of labeling sequences of words which are the
names of things, such as person, company, location
names.
 Example:
Jim bought 300 shares of Acme Corp. in 2006.
<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought 300
shares of <ENAMEX TYPE="ORGANIZATION">Acme
Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.

Stanford POS tagger
 http://nlp.stanford.edu/software/tagger.shtml

Stanford NER
 http://nlp.stanford.edu/software/CRF-NER.shtml

GATE
 http://gate.ac.uk/

JET
 http://cs.nyu.edu/grishman/jet/license.html
 http://www.cs.nyu.edu/courses/spring10/G22.2590-
001/schedule.html
Download