NYU Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman Parts of Speech Grammar is stated in terms of parts of speech (‘preterminals’): – classes of words sharing syntactic properties: noun verb adjective … 1/16/14 NYU 2 POS Tag Sets Most influential tag sets were those defined for projects to produce large POS-annotated corpora: • Brown corpus – 1 million words from variety of genres – 87 tags • UPenn Tree Bank – – – – 1/16/14 initially 1 million words of Wall Street Journal later retagged Brown first POS tags, then full parses 45 tags (some distinctions captured in parses) NYU 3 The Penn POS Tag Set • Noun categories • • • • NN (common singular) NNS (common plural) NNP (proper singular) NNPS (proper plural) Penn POS tags • Verb categories • • • • • • 1/16/14 VB (base form) VBZ (3rd person singular present tense) VBP (present tense, other than 3rd person singular) VBD (past tense) VBG (present participle) VBN (past participle) NYU 4 some tricky cases • present participles which act as prepositions: – according/JJ to • nationalities: – English/JJ cuisine – an English/NNP sentence • adjective vs. participle – – – – 1/16/14 the striking/VBG teachers a striking/JJ hat he was very surprised/JJ he was surprised/VBN by his wife NYU 5 Tokenization • any annotated corpus assumes some tokenization • relatively straightforward for English – generally defined by whitespace and punctuation – treat negative contraction as separate token: do | n’t – treat possessive as separate token: cat | ‘s – do not split hyphenated terms: Chicago-based 1/16/14 NYU 6 the Tagging Task Task: assigning a POS to each word • not trivial: many words have several tags • dictionary only lists possible POS, independent of context • how about using a parser to determine tags? – some analysis (e.g., partial parsers) assume input is tagged 1/16/14 NYU 7 Why tag? • POS tagging can help parsing by reducing ambiguity • Can resolve some pronunciation ambiguities for text-to-speech (“desert”) • Can resolve some semantic ambiguities 1/16/14 NYU 8 Simple Models • Natural language is very complex – we don't know how to model it fully, so we build simplified models which provide some approximation to natural language 1/16/14 NYU 9 Corpus-Based Methods How can we measure 'how good' these models are? • we assemble a text corpus • annotate it by hand with respect to the phenomenon we are interested in • compare it with the predictions of our model – for example, how well the model predicts part-ofspeech or syntactic structure 1/16/14 NYU 10 Preparing a Good Corpus • To build a good corpus – we must define a task people can do reliably (choose a suitable POS set, for example) – we must provide good documentation for the task • so annotation can be done consistently – we must measure human performance (through dual annotation and inter-annotator agreement) • Often requires several iterations of refinement Training the model How to build a model? – need a goodness metric – train by hand, by adjusting rules and analyzing errors (ex: Constraint Grammar) – train automatically • develop new rules • build probabilistic model (generally very hard to do by hand) • choice of model affected by ability to train it (NN) 1/16/14 NYU 12 The simplest model • The simplest POS model considers each word separately: P(T | W ) P(ti | wi ) i • We tag each word with its most likely part-of-speech – this works quite well: about 90% accuracy when trained and tested on similar texts – although many words have multiple parts of speech, one POS typically dominates within a single text type • How can we take advantage of context to do better? 1/16/14 NYU 13 A Language Model • To see how we might do better, let us consider a related problem: building a language model – a language model can generate sentences following some probability distribution 1/16/14 NYU 14 Markov Model • In principle each word we select depends on all the decisions which came before (all preceding words in the sentence) • But we’ll make life simple by assuming that the decision depends on only the immediately preceding decision • [first-order] Markov Model • representable by a finite state transition network • Tij = probability of a transition from state i to state j Finite State Network 0.30 dog: woof 0.50 start 0.40 0.50 0.30 end 0.40 cat: meow 0.30 0.30 Our bilingual pets • Suppose our cat learned to say “woof” and our dog “meow” • … they started chatting in the next room • … and we wanted to know who said what Hidden State Network woof meow dog start end cat woof meow • How do we predict • When the cat is talking: ti = cat • When the dog is talking: ti = dog • We construct a probabilistic model of the phenomenon • And then seek the most likely state sequence S arg max S P(t1 ...t n | w1 ...wn ) t1 ...t n Hidden Markov Model • Assume current word depends only on current tag arg max S P(t1 ...t n | w1 ...wn ) t1 ...t n arg max P(w1 ,...,wn | t1 ,...,t n )P(t1 ,...,t n ) t1 ...t n arg max n P(wi | t i )P(t i | t i1 ) t1 ...t n i1 HMM for POS Tagging • We can use the same formulas for POS tagging states POS tags 1/16/14 NYU 21 Training an HMM • Training an HMM is simple if we have a completely labeled corpus: – have marked the POS of each word. – can directly estimate both P ( ti | ti-1 ) and P ( wi | ti ) from corpus counts • using the Maximum Likelihood Estimator. 1/16/14 NYU 22 Greedy Decoder • simplest decoder (tagger) assign tags deterministically from left to right • selects ti to maximize P(wi|ti) * P(ti|ti-1) • does not take advantage of right context • can we do better? 1/16/14 NYU 23 < Viterbi decoder > 1/16/14 NYU 24 Performance • Accuracy with good unknown-word model trained and tested on WSJ is 96.5% to 96.8% 1/16/14 NYU 25 Unknown words • Problem (as with NB) of zero counts … words not in the training corpus – simplest: assume all POS equally likely for unknown words – can make better estimate by observing unknown words are very likely open class words, and most likely nouns • base P(t|w) of unknown word on probability distribution of words which occur once in corpus 1/16/14 NYU 26 Unknown words, cont’d – can do even better by taking into account the form of a word • whether it is capitalized • whether it is hyphenated • its last few letters 1/16/14 NYU 27 Trigram Models • in some cases we need to look two tags back to find an informative context – e.g, conjunction (N and N, V and V, …) • but there’s not enough data for a pure trigram model • so combine unigram, bigram, and trigram – linear interpolation – backoff 1/16/14 NYU 28 Domain adaptation • Substantial loss in shifting to new domain – 8-10% loss in shift from WSJ to biology domain – adding small annotated sample (200-500 sentences) in new domain greatly reduces error – some reduction possible without annotated target data (Blitzer, Structured Correspondence Learning) 1/16/14 NYU 29 Jet Tagger • HMM–based • trained on WSJ • file pos_hmm.txt 1/16/14 NYU 30 Transformation-Based Learning • TBL provides a very different corpus-based approach to part-of-speech tagging • It learns a set of rules for tagging – the result is inspectable 1/16/14 NYU 31 TBL Model • TBL starts by assigning each word its most likely part of speech • Then it applies a series of transformations to the corpus – each transformation states some condition and some change to be made to the assigned POS if the condition is met – for example: • Change NN to VB if the preceding tag is TO. • Change VBP to VB if one of the previous 3 tags is MD. 1/16/14 NYU 32 Transformation Templates • Each transformation is based on one of a small number of templates, such as • • • • • • Change tag x to y if the preceding tag is z. Change tag x to y if one of the previous 2 tags is z. Change tag x to y if one of the previous 3 tags is z. Change tag x to y if the next tag is z. Change tag x to y if one of the next 2 tags is z. Change tag x to y if one of the next 3 tags is z. 1/16/14 NYU 33 Training the TBL Model • To train the tagger, using a hand-tagged corpus, we begin by assigning each word its most common POS. • We then try all possible rules (all instantiations of one of the templates) and keep the best rule -- the one which corrects the most errors. • We do this repeatedly until we can no longer find a rule which corrects some minimum number of errors. 1/16/14 NYU 34 Some Transformations the first 9 transformations found for WSJ corpus Change to if NN VB previous tag isTO VBP VB one of previous 3 tags is MD NN VB one of previous 2 tags is MD VB NN one of previous 2 tags is DT VBD VBN one of previous 3 tags is VBZ VBN VBD previous tag is PRP VBN VBD previous tag is NNP VBD VBN previous tag is VBD VBP VB previous tag is TO 1/16/14 NYU 35 TBL Performance • Performance competitive with good HMM • accuracy 96.6% on WSJ • Compared to HMM, much slower to train, but faster to apply 1/16/14 NYU 36