POS tagging in R

advertisement
This post will exemplify how to tag a corpus with R. Part-of-Speech tagging, or POS tagging, is a form
of annotating text in which POS tags are assigned to lexical items.
<!--more-->
Parts-of-speech, or word categories, refer to the grammatical nature of a lexical item, e.g. in the
sentence “John likes the girl” each lexical item can be classified according to whether it belongs to
the group of determinatives, verbs or nouns. If POS tagged manually, the example sentence could
look like this “John\NNP likes\VBZ the\DT girl\NN” where NNP stands for proper noun, VBZ stands
for 3rd person singular present tense verb, DT for determinative, and NN for singular common noun.
The POS tags used by the openNLP package are the Penn English Treebank POS tags – here is a list of
these tags and what they stand for:
CC = Coordinating conjunction
CD = Cardinal number
DT = Determiner
EX = Existential there
FW = Foreign word
IN = Preposition or subordinating conjunction
JJ = Adjective
JJR = Adjective, comparative
JJS = Adjective, superlative
LS = List item marker
MD = Modal
NN = Noun, singular or mass
NNS = Noun, plural
NNP = Proper noun, singular
NNPS = Proper noun, plural
PDT = Predeterminer
POS = Possessive ending
PRP = Personal pronoun
PRP$ = Possessive pronoun
RB = Adverb
RBR = Adverb, comparative
RBS = Adverb, superlative
RP = Particle
SYM = Symbol
TO = to
UH = Interjection
VB = Verb, base form
VBD = Verb, past tense
VBG = Verb, gerund or present participle
VBN = Verb, past participle
VBP = Verb, non¬3rd person singular present
VBZ = Verb, 3rd person singular present
WDT = Wh¬determiner
WP = Wh¬pronoun
WP$ = Possessive wh¬pronoun
WRB = Wh¬adverb
In R we can POS tag large amounts of text using the openNLP package, which also requires the NLP
package and installing the models on which the openNLP package works – you can find more
information on the openNLP package and how it works <a href="http://cran.rproject.org/web/packages/openNLP/openNLP.pdf">here</a>. The openNLP package uses the
Apache OpenNLP Maxent Part of Speech tagger which is a trained POS tagger, that assigns POS tags
based on the probability of what the correct POS tag is – the POS tag with the highest probability is
selected.
Below is an example of how you can implement the POS tagging in R.
<pre lang="rsplus">
##########################################################
### --- Part-of-Speech tagging and syntactic parsing with R
##########################################################
### --- install required models for pos tagging
#install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/",
type = "source")
#install.packages("openNLP")
#install.packages("NLP")
# load packages
library("NLP")
library("openNLP")
library("openNLPmodels.en")
# to install openNLPmodels, please download an install the packages/models
direktly from
# http://datacube.wu.ac.at/. To install these packages/models, simply enter
#install.packages("foo", repos = "http://datacube.wu.ac.at/", type =
"source")
# into your R console. E.g. enter:
#install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/",
type = "source")
# to install the file "openNLPmodels.en_1.5-1.tar.gz"
##########################################################
## Some text.
s <- paste(c("John likes the girl.\n",
"The girl also likes John."),
collapse = "")
s <- as.String(s)
## Need sentence and word token annotations.
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
pos_tag_annotator
#>An annotator inheriting from classes
#> Simple_POS_Tag_Annotator Annotator
#>with description
#> Computes POS tag annotations using the Apache OpenNLP Maxent Part of
#> Speech tagger employing the default model for language 'en'
a3 <- annotate(s, pos_tag_annotator, a2)
a3
#>id type
start end features
#> 1 sentence 1
20 constituents=<<integer,5>>
#> 2 sentence 22
46 constituents=<<integer,6>>
#> 3 word
1
4 POS=NNP
#> 4 word
6
10 POS=VBZ
#> 5 word
12
14 POS=DT
#> 6 word
16
19 POS=NN
#> 7 word
20
20 POS=.
#> 8 word
22
24 POS=DT
#> 9 word
26
29 POS=NN
#> 10 word
31
34 POS=RB
#> 11 word
36
40 POS=VBZ
#> 12 word
42
45 POS=NNP
#> 13 word
46
46 POS=.
## Variant with POS tag probabilities as (additional) features.
head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2))
#>id type
start end features
#> 1 sentence 1
20 constituents=<<integer,5>>
#> 2 sentence 22
46 constituents=<<integer,6>>
#> 3 word
1
4 POS=NNP POS_prob=0.9664531
#> 4 word
6
10 POS=VBZ POS_prob=0.9183389
#> 5 word
12
14 POS=DT POS_prob=0.9814714
#> 6 word
16
19 POS=NN POS_prob=0.997068
## Determine the distribution of POS tags for word tokens.
a3w <- subset(a3, type == "word")
tags <- sapply(a3w$features, '[[', "POS")
tags
#> [1] "NNP" "VBZ" "DT" "NN" "."
"DT" "NN" "RB" "VBZ" "NNP" "."
table(tags)
#>tags
#> . DT NN NNP
#> 2
2
2
2
RB VBZ
1
2
## Extract token/POS pairs (all of them): easy.
sprintf("%s/%s", s[a3w], tags)
#> [1] "John/NNP" "likes/VBZ" "the/DT"
"girl/NN"
#> [7] "girl/NN"
"also/RB"
"likes/VBZ" "John/NNP"
"./."
"./."
## Extract pairs of word tokens and POS tags for second sentence:
a3ws2 <- annotations_in_spans(subset(a3, type == "word"),
subset(a3, type == "sentence")[2L])[[1L]]
sprintf("%s/%s", s[a3ws2], sapply(a3ws2$features, '[[', "POS"))
#> [1] "The/DT"
"girl/NN"
"also/RB"
"likes/VBZ" "John/NNP"
##########################################################
</pre>
I hope this helps and I will also be posting some updates to include more useful examples.
"The/DT"
"./."
Download