NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5. Example: Calculating WordNet Synset Similarity 6. Other Functionalities What is NLTK? • A tool consisting of a collection of libraries and programs in python that allows for customization and optimization of NLP processes • Downloading What is NLTK? • NLP tools typically use other NLP tools • Other tools include • Wordnet • Stanford Dependency Parser • Conceptnet • DBPedia • Google Mate-Tools Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5. Other Functionalities 6. Works Cited NLTK Basic Functionalities 1. Sentence Tokenization 2. Word Tokenization 3. Wordnet, Synsets, and Synonyms 4. Stemming Words and Lemmas Sentence Tokenization • Basic Tokenization • Statistically Based Training Methodology • Tokenizing for Multiple Sentences • Pickle File • Tokenizing with Other Languages Word Tokenization • Basic Word Tokenizer • Penn Treebank Project • Other Types of Word Tokenizers: • PunctWordTokenizer: splits on punctuation but keeps it with the punctuation with the associated word token • WordPunctTokenizer: splits all punctuation onto separate tokens • Word Tokenizers and Regular Expressions • Match on tokens separators, or gaps • Stopwords and Filtering Wordnet, Synsets, and Synonyms • Wordnet is a tool integrated into NLTK that contains listings of word relations (i.e. a lexical database) • Groupings of synonymous meanings that express the same concept are synset instances • Expressed in a tree • Hypernyms and Hyponyms • Synonyms and Antonyms Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5. Other Functionalities 6. Works Cited POS Tagging • String Representation for Tagged Tokens (tuples) • Default Tagging • Tagging based off a Trained Corpus (Brown) POS Tagging • Types of Tagging • Unigram/Bigram Tagger • Regexp Tagging • Brill: uses and initial tagger than then applies transformation rules learned from the training corpus using “rule templates” Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5. Other Functionalities 6. Works Cited Chunking and Trees • Default Chunking • Trees and Parsing • Drawing Trees Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5. Other Functionalities 6. Works Cited Other Functionalities • Replacing and Correcting Words • Calculating WordNet Synset Similarity • Word Collections • Text Classification • Transforming Chunks and Trees • Processes for Distributed Processing and Handling Large Datasets • Parsing for Specific Data(Location, Dates and Times) Works Cited • Perkins, Jacob. Python Text Processing with NLTK 2.0 Cookbook. • http://wordnet.princeton.edu/ • http://www.ling.upenn.edu/courses/Fall_2003/ling001/pen n_treebank_pos.html • http://nltk.org