NLP versus IR Covered predominantly IR up until now processing, stemming, indexing, querying, etc mostly bag of words and vector space models word order unimportant* word inflections unimportant* What do we mean by "natural language processing"? and how does this differ from / overlap with IR? 2 COMP90042 Trevor Cohn Tags 1: ambiguity time flies like an arrow fruit flies like a banana ambiguous headlines British Left Waffles on Falkland Islands Juvenile Court to Try Shooting Defendant 3 COMP90042 Trevor Cohn Tags 2: Representations to resolve ambiguity 4 COMP90042 Trevor Cohn Exercise: tag some headlines British Left Waffles on Falkland Islands Juvenile Court to Try Shooting Defendant 5 COMP90042 Trevor Cohn Tags 3: Tagged Corpora • The/DT limits/NNS to/TO legal/JJ absurdity/NN stretched/VBD another/DT notch/NN this/DT week/NN when/WRB the/DT Supreme/NNP Court/NNP refused/VBD to/TO hear/VB an/DT appeal/NN from/IN a/DT case/NN that/WDT says/VBZ corporate/JJ defendants/NNS must/MD pay/VB damages/NNS even/RB after/IN proving/VBG that/IN they/PRP could/MD not/RB possibly/RB have/VB caused/VBN the/DT harm/NN ./. • Source: Penn Treebank Corpus (nltk/data/treebank/wsj_0130) 6 COMP90042 Trevor Cohn Another kind of tagging: Sense Tagging • The Pantheon's interior/a , still in its original/a form/a , • interior: (a) inside a space; (b) inside a country and at a distance from the coast or border; (c) domestic; (d) private. • original: (a) relating to the beginning of something; (b) novel; (c) that from which a copy is made; (d) mentally ill or eccentric. • form: (a) definite shape or appearance; (b) body; (c) mould; (d) particular structural character exhibited by something; (e) a style as in music, art or literature; (f) homogenous polynomial in two or more variables; ... 7 COMP90042 Trevor Cohn Significance of Parts of Speech a word's POS tells us a lot about the word and its neighbors limits the range of meanings (deal), pronunciations (object vs object), or both (wind) helps in stemming limits the range of following words for ASR helps select nouns from a document for IR More advanced uses (these won't make sense yet): basis for chunk parsing parsers can build trees directly on the POS tags instead of maintaining a lexicon first step for many different NLP tasks 8 COMP90042 Trevor Cohn What does Tagging do? 1. Collapses Distinctions Lexical identity may be discarded 2. e.g. all personal pronouns tagged with PRP Introduces Distinctions Ambiguities may be removed e.g. deal tagged with NN or VB; deal tagged with DEAL1 or DEAL2 3. Helps classification and prediction There are many tagsets. This is due to: the different ways to define a tag the need to balance classification and prediction harder/easier classification task; vs more/less information about context 9 COMP90042 Trevor Cohn Tagged Corpora Brown Corpus: The first digital corpus (1961), Francis and Kucera, Brown U Contents: 500 texts, each 2000 words long from American books, newspapers, magazines, representing 15 genres: science fiction, romance fiction, press reportage scientific writing, popular lore. See nltk/data/brown/ See reading for definition of Brown tags Penn Treebank: First syntactically annotated corpus Contents: 1 million words from WSJ; POS tags, syntax trees See nltk/data/treebank/ (5% sample) 10 COMP90042 Trevor Cohn Tagged Corpora in other languages Parsed treebanks in many other languages Basque, Bulgarian, Chinese, Czech, Finnish, French German, Greek, Hebrew, Hungarian, Irish, Italian Japanese, Korean, Persian, Romanian, Spanish Swedish … and many more! All with part-of-speech annotation language specific tag sets recent work on mapping to common tag set 11 COMP90042 Trevor Cohn Application of tagged corpora: genre classification 12 COMP90042 Trevor Cohn Important Treebank Tags • NN noun JJ adjective • NNP proper noun (and/or/..) CC coord conjunc • DT determiner (the/a/..) CD cardinal number • IN preposition (in/of/..) PRP personal pronoun (I/you/..) • VB verb RB adverb (gently, now) • -R comparative (better) • -S superlative (bravest) or plural • -$ possessive (my) 13 COMP90042 Trevor Cohn Verb Tags • VBP base present take • VB infinitive take • VBD past took • VBG present participle taking • VBN past participle taken • VBZ present 3sg takes • MD modal can, would 14 COMP90042 Trevor Cohn Simple Tagging in NLTK Reading Tagged Corpora: >>> from nltk.corpus import treebank >>> treebank.fileids() >>> treebank.tagged_sents('wsj_0001.mrg')[0] [(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'), (u'join', u'VB'), (u'the', u'DT'), ...] see also Brown corpus, Conll2000, Alpino and more Tagging a string >>> nltk.tag.pos_tag('Fruit flies like a banana'.split()) [('Fruit', 'NN'), ('flies', 'NNS'), ('like', 'IN'), ('a', 'DT'), ('banana', 'NN')] (N.b. Uses a maximum entropy tagger) 15 COMP90042 Trevor Cohn Tagging Algorithms rule based taggers unigram tagger assign the tag which is the most probable for the word in question, based on frequency in a training corpus bigram tagger, n-gram tagger original methods, based on layers of rules about how to tag words based on their context (e.g., Brill tagger) inspect one or more tags in the context (usually, immediate left context) Maximum entropy and HMM taggers (next lecture) 16 COMP90042 Trevor Cohn Unigram Tagging Unigram = table of tag frequencies for each word e.g. in tagged WSJ sample (from Penn Treebank): deal: NN (11); VB (1); VBP (1) Training load a corpus count the occurrences of each (word, tag) in the corpus Tagging lookup the most common tag for each word to tag Gets 90% accuracy! See the code in nltk.tag.UnigramTagger 17 COMP90042 Trevor Cohn The problem with unigram taggers what evidence do they consider when assigning a tag? when does this method fail? 18 COMP90042 Trevor Cohn Fixing the problem using a bigram tagger construct sentences involving a word which can have two different parts of speech e.g. wind: noun, verb The wind blew forcefully I wind up the clock gather statistics for current tag, based on: (i) current word; (ii) previous tag result: a 2-D array of frequency distributions what does this look like? 19 COMP90042 Trevor Cohn Generalizing the context 20 COMP90042 Trevor Cohn Bigram & n-gram taggers n-gram tagger: consider n-1 previous tags how big does the model get? how much data do we need to train it? Sparse-data problem: As n gets large, the chances of having seen all possible patterns of tags during training diminishes (large: >3) Approaches: Combine taggers (backoff, weighted average) statistical estimation of the probability of unseen events See nltk.tag.sequential.NgramTagger and various others in nltk.tag package 21 COMP90042 Trevor Cohn Markov Model Taggers Recall n-gram language model similar problem of modelling next word given previous words, similar issues with sparsity and estimation here we focus on generating tag sequences rather than words both are in instances of a Markov model tag sequence modelled as a Markov chain each tag is linked to word sequence Can we just predict each tag in sequence? need to know the preceding tag(s) but these are unknown… Next lecture, we’ll explore this further using Hidden Markov Models 22 COMP90042 Trevor Cohn The Brill rule-Based Tagger The Linguistic Complaint: where is the linguistic knowledge of a tagger? just a massive table of numbers aren't there any linguistic insights that could emerge from the data? Transformation-Based Tagging / Brill Tagging: Tag each word with its most likely tag Repeatedly correct tags based on context Example rule: NN VB PREVTAG TO Other contexts: to/TO race/NN -> to/TO race/VB PREV1OR2TAG, PREV1OR2WD, WDNEXTTAG, ... See nltk.tag.brill.BrillTagger 23 COMP90042 Trevor Cohn Evaluating Tagger Performance • Need an objective measure of performance • Commonly use per-token accuracy - measured against heldout ‘gold standard’ data - fraction of words tagged correctly • Simple methods get ~90% performance - 1 and 2-gram - Brill tagger • HMMs get ~95% and CRFs get ~97% performance - see nltk.tag.{hmm,tnt,crf,stanford,senna,…} • Why can't we get 100%? 24 COMP90042 Trevor Cohn Tagging: broader lessons Tagging has several properties that are typical of NLP classification disambiguation through representation sequence learning from annotated corpora simple, general methods: (words have properties) conditional frequency distributions Cool things you can do now: elementary NLU, NLG Review: tokenization + tagging = segmentation and annotation of words chunking = segmentation and annotation of word sequences 25 COMP90042 Trevor Cohn Readings One of: Jurafsky & Martin, chapter 5 Manning & Schutze, chapter 10 NLTK tagging tutorial Next lecture tagging with (hidden) Markov models other sequence tagging tasks named entity tagging shallow parsing 26 COMP90042 Trevor Cohn