Natural Language Toolkit(NLTK)

advertisement
NATURAL LANGUAGE
TOOLKIT(NLTK)
April Corbet
Overview
1. What is NLTK?
2. NLTK Basic Functionalities
3. Part of Speech Tagging
4. Chunking and Trees
5. Example: Calculating WordNet Synset
Similarity
6. Other Functionalities
What is NLTK?
• A tool consisting of a collection of libraries and
programs in python that allows for customization
and optimization of NLP processes
• Downloading
What is NLTK?
• NLP tools typically use other NLP tools
• Other tools include
• Wordnet
• Stanford Dependency Parser
• Conceptnet
• DBPedia
• Google Mate-Tools
Overview
1. What is NLTK?
2. NLTK Basic Functionalities
3. Part of Speech Tagging
4. Chunking and Trees
5. Other Functionalities
6. Works Cited
NLTK Basic Functionalities
1. Sentence Tokenization
2. Word Tokenization
3. Wordnet, Synsets, and Synonyms
4. Stemming Words and Lemmas
Sentence Tokenization
• Basic Tokenization
• Statistically Based Training Methodology
• Tokenizing for Multiple Sentences
• Pickle File
• Tokenizing with Other Languages
Word Tokenization
• Basic Word Tokenizer
• Penn Treebank Project
• Other Types of Word Tokenizers:
• PunctWordTokenizer: splits on punctuation but keeps it
with the punctuation with the associated word token
• WordPunctTokenizer: splits all punctuation onto
separate tokens
• Word Tokenizers and Regular Expressions
• Match on tokens separators, or gaps
• Stopwords and Filtering
Wordnet, Synsets, and Synonyms
• Wordnet is a tool integrated into NLTK that
contains listings of word relations (i.e. a lexical
database)
• Groupings of synonymous meanings that express
the same concept are synset instances
• Expressed in a tree
• Hypernyms and Hyponyms
• Synonyms and Antonyms
Overview
1. What is NLTK?
2. NLTK Basic Functionalities
3. Part of Speech Tagging
4. Chunking and Trees
5. Other Functionalities
6. Works Cited
POS Tagging
• String Representation for Tagged Tokens (tuples)
• Default Tagging
• Tagging based off a Trained Corpus (Brown)
POS Tagging
• Types of Tagging
• Unigram/Bigram Tagger
• Regexp Tagging
• Brill: uses and initial tagger than then applies
transformation rules learned from the training corpus
using “rule templates”
Overview
1. What is NLTK?
2. NLTK Basic Functionalities
3. Part of Speech Tagging
4. Chunking and Trees
5. Other Functionalities
6. Works Cited
Chunking and Trees
• Default Chunking
• Trees and Parsing
• Drawing Trees
Overview
1. What is NLTK?
2. NLTK Basic Functionalities
3. Part of Speech Tagging
4. Chunking and Trees
5. Other Functionalities
6. Works Cited
Other Functionalities
• Replacing and Correcting Words
• Calculating WordNet Synset Similarity
• Word Collections
• Text Classification
• Transforming Chunks and Trees
• Processes for Distributed Processing and
Handling Large Datasets
• Parsing for Specific Data(Location, Dates and
Times)
Works Cited
• Perkins, Jacob. Python Text Processing with NLTK 2.0
Cookbook.
• http://wordnet.princeton.edu/
• http://www.ling.upenn.edu/courses/Fall_2003/ling001/pen
n_treebank_pos.html
• http://nltk.org
Download