This post will exemplify how to tag a corpus with R. Part-of-Speech tagging, or POS tagging, is a form of annotating text in which POS tags are assigned to lexical items. <!--more--> Parts-of-speech, or word categories, refer to the grammatical nature of a lexical item, e.g. in the sentence “John likes the girl” each lexical item can be classified according to whether it belongs to the group of determinatives, verbs or nouns. If POS tagged manually, the example sentence could look like this “John\NNP likes\VBZ the\DT girl\NN” where NNP stands for proper noun, VBZ stands for 3rd person singular present tense verb, DT for determinative, and NN for singular common noun. The POS tags used by the openNLP package are the Penn English Treebank POS tags – here is a list of these tags and what they stand for: CC = Coordinating conjunction CD = Cardinal number DT = Determiner EX = Existential there FW = Foreign word IN = Preposition or subordinating conjunction JJ = Adjective JJR = Adjective, comparative JJS = Adjective, superlative LS = List item marker MD = Modal NN = Noun, singular or mass NNS = Noun, plural NNP = Proper noun, singular NNPS = Proper noun, plural PDT = Predeterminer POS = Possessive ending PRP = Personal pronoun PRP$ = Possessive pronoun RB = Adverb RBR = Adverb, comparative RBS = Adverb, superlative RP = Particle SYM = Symbol TO = to UH = Interjection VB = Verb, base form VBD = Verb, past tense VBG = Verb, gerund or present participle VBN = Verb, past participle VBP = Verb, non¬3rd person singular present VBZ = Verb, 3rd person singular present WDT = Wh¬determiner WP = Wh¬pronoun WP$ = Possessive wh¬pronoun WRB = Wh¬adverb In R we can POS tag large amounts of text using the openNLP package, which also requires the NLP package and installing the models on which the openNLP package works – you can find more information on the openNLP package and how it works <a href="http://cran.rproject.org/web/packages/openNLP/openNLP.pdf">here</a>. The openNLP package uses the Apache OpenNLP Maxent Part of Speech tagger which is a trained POS tagger, that assigns POS tags based on the probability of what the correct POS tag is – the POS tag with the highest probability is selected. Below is an example of how you can implement the POS tagging in R. <pre lang="rsplus"> ########################################################## ### --- Part-of-Speech tagging and syntactic parsing with R ########################################################## ### --- install required models for pos tagging #install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/", type = "source") #install.packages("openNLP") #install.packages("NLP") # load packages library("NLP") library("openNLP") library("openNLPmodels.en") # to install openNLPmodels, please download an install the packages/models direktly from # http://datacube.wu.ac.at/. To install these packages/models, simply enter #install.packages("foo", repos = "http://datacube.wu.ac.at/", type = "source") # into your R console. E.g. enter: #install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/", type = "source") # to install the file "openNLPmodels.en_1.5-1.tar.gz" ########################################################## ## Some text. s <- paste(c("John likes the girl.\n", "The girl also likes John."), collapse = "") s <- as.String(s) ## Need sentence and word token annotations. sent_token_annotator <- Maxent_Sent_Token_Annotator() word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- annotate(s, list(sent_token_annotator, word_token_annotator)) pos_tag_annotator <- Maxent_POS_Tag_Annotator() pos_tag_annotator #>An annotator inheriting from classes #> Simple_POS_Tag_Annotator Annotator #>with description #> Computes POS tag annotations using the Apache OpenNLP Maxent Part of #> Speech tagger employing the default model for language 'en' a3 <- annotate(s, pos_tag_annotator, a2) a3 #>id type start end features #> 1 sentence 1 20 constituents=<<integer,5>> #> 2 sentence 22 46 constituents=<<integer,6>> #> 3 word 1 4 POS=NNP #> 4 word 6 10 POS=VBZ #> 5 word 12 14 POS=DT #> 6 word 16 19 POS=NN #> 7 word 20 20 POS=. #> 8 word 22 24 POS=DT #> 9 word 26 29 POS=NN #> 10 word 31 34 POS=RB #> 11 word 36 40 POS=VBZ #> 12 word 42 45 POS=NNP #> 13 word 46 46 POS=. ## Variant with POS tag probabilities as (additional) features. head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2)) #>id type start end features #> 1 sentence 1 20 constituents=<<integer,5>> #> 2 sentence 22 46 constituents=<<integer,6>> #> 3 word 1 4 POS=NNP POS_prob=0.9664531 #> 4 word 6 10 POS=VBZ POS_prob=0.9183389 #> 5 word 12 14 POS=DT POS_prob=0.9814714 #> 6 word 16 19 POS=NN POS_prob=0.997068 ## Determine the distribution of POS tags for word tokens. a3w <- subset(a3, type == "word") tags <- sapply(a3w$features, '[[', "POS") tags #> [1] "NNP" "VBZ" "DT" "NN" "." "DT" "NN" "RB" "VBZ" "NNP" "." table(tags) #>tags #> . DT NN NNP #> 2 2 2 2 RB VBZ 1 2 ## Extract token/POS pairs (all of them): easy. sprintf("%s/%s", s[a3w], tags) #> [1] "John/NNP" "likes/VBZ" "the/DT" "girl/NN" #> [7] "girl/NN" "also/RB" "likes/VBZ" "John/NNP" "./." "./." ## Extract pairs of word tokens and POS tags for second sentence: a3ws2 <- annotations_in_spans(subset(a3, type == "word"), subset(a3, type == "sentence")[2L])[[1L]] sprintf("%s/%s", s[a3ws2], sapply(a3ws2$features, '[[', "POS")) #> [1] "The/DT" "girl/NN" "also/RB" "likes/VBZ" "John/NNP" ########################################################## </pre> I hope this helps and I will also be posting some updates to include more useful examples. "The/DT" "./."