
COMP 791A: Statistical Language
Corpus-Based Work
Chap. 4
Using a Corpus
To approximate the probability distribution of
language events, we use a training corpus
Statistical NLP seeks to automatically learn
lexical and structural preferences from corpora.
Large database of text & speech
Many types of text corpora exist
Major suppliers:
plain text, domain specific, tagged, parsed, parallel
bilingual, …
Linguistic data Consortium (LDC) -- www.ldc.upenn.edu
European Language resources Associations (ELRA) -www.icp.grnet.fr/ELRA
To derive the needed probabilities, a corpus
needs to be:
a representative sample of the population of interest
Low-Level Formatting Issues
Junk formatting & content
Removal of typesetter codes (ex. HTML tags),
diagrams, tables, foreign words etc.
Also other problems if data was retrieved through
OCR (unrecognized words)
Uppercase and Lowercase
should we keep the case or not?
“the”, “The” and “THE” should all be treated the same?
but in “George Brown” and “brown dog”, “brown” should
be treated separately…
Finding Tokens and Sentences
divide the input text into units (called tokens)
each token is either a word or something else (ex. a
number or a punctuation mark)
Mark sentence boundaries
Most sentences end with ‘.’, ‘?’ or ‘!’
Can be confused by abbreviations
Tokenization --What is a word?
Graphic word (Kučera and Francis, 1967):
“A string of contiguous alphanumeric characters with white
spaces on either side; may include hyphens and apostrophes, but
no other punctuation marks”
But what about: “$22.50” “C++”
“ :-)”
Main problems:
Abbreviation or end of sentence?
“etc.” “Calif.” “Wash.”
Is the period part of the word or not?
 “Wash.” (Washington) need to keep the period to distinguish it from
“wash” (the verb)
Single apostrophes
Part of the word or not?
“Peter’s sick” --> 1 word? or 2 words?
If 1 word, then problems in parsing… S--> NP VP
If 2 words, then should “Peter’s house” be considered 2 words?
Tokenization --What is a word? (con’t)
Line-breaks to improve justification of text or not?
Ex: “e-mail” “pro-life” “data-base”/”database”/”data base”
Remove them?
Should we distinguish 2 words that have the same spelling but with
unrelated senses
“Bow“: part of a ship / a knot of ribbon
“Saw”: instrument / past tense of “to see”
Word Segmentation in other languages:
Some languages have no whitespaces !!!
Ex: East-Asian languages
In German: “life insurance company employee” =
Tokenization --What is a word? (con’t)
Whitespace do not always indicate a word break
Ex: Do we really want to separate the phrases
If no, then what do we do with non-adjacent phrasal verbs?
“I could not work the answer out”
Variant forms of some semantic types
Ex. Telephone numbers
“in spite of”
“as a matter of fact”
“work out”
(514) 848-3074
+1 514 848 3074
+1 (514) 848 3074
Speech corpora
More contractions, fillers (ex. “Um” “well” “euh”)
Tokenization -- Lemmatizer
What about morphological variants?
Should “give”, “gives”, “given”, “giver”… be considered
different words?
Goal: “normalize” similar words
Two main approaches:
Morphological Analysis
Very “dumb” rules work well (for English and Romance languages)
Ex: the Porter stemmer
Strips off affixes and leaves the stem
give --> give, gives --> give + s, given --> give + en, …
uses simple rules:
IF word ends with “ies” but not with “eies” or “aies”
IF word ends with “es” but not “aes”, “ees” and “oes”
IF word ends with “s” but not “us” or “ss”
first applicable rule is applied
THEN replace “ies” by “y”
THEN replace “es” by ”e”
THEN remove “s”
Advantage: Fast
Rules depend on the language
Unreadable results:
EX: “computers”, “computation”, “computational” --> “comput”
May reduce different words to same stem although they are actually
stocks --> stock
stockings --> stock
arm --> arm
army --> arm
organization --> organ
university --> universe
Morphological Analyzer
Apply morphological rules
Check that (file,N) (file,V) is in dictionary
(XXXes,V) --> (XXXe,V)
(XXXes,N) --> (XXXe,N)
files --> (file,N) (file,V)
Identifies the root which is an actual word
Fewer errors than stemming
More complex, too slow
Sentences: What is a sentence?
Something ending with a ‘.’, ‘?’ or ‘!’
But, sentences can be split up by other punctuation marks or
True in 90% of the cases
Ex: nested phrases:
“You remind me,” she remarked, “of your mother.”
We usually use heuristic methods
But hand-coded heuristics…
Some effort to use statistical methods for sentence-boundary
Typical classification problem…
Classify a period as a end-of-sentence marker or not
Use features such as case, length, POS tag of words preceding the
Use decision trees, neural networks…
Some techniques go up to 98-99% correct classification
Marked-Up Data: Mark-up Schemes
Schemes developed to mark up the structure of
Different Mark-up schemes:
 COCOA format
older, and rather ad-hoc
And other related encodings: HTML, XML
Input text:
Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi,
in the northern department of Bolivar, who was kidnapped on 5 January,
apparently by army of national liberation (ELN) guerrillas, was found dead
today, according to authorities.
Castellar was kidnapped on 5 january on the outskirts of Achi, about 850
km north of Bogota, by a group of armed men, who forced him to
accompany them to an undisclosed location.
Police sources in Cartagena reported that Castellar's body showed signs
of torture and several bullet wounds.
Castellar was attacked by ELN guerrillas while he was traveling in a boat
down the Cauca river to the tenche area, a region within his jurisdiction.
In Cartagena it was reported that Castellar faced a “revolutionary trial”
by the ELN and that he was found guilty and executed.
Example (con’t)
Text with named entity tags:
</ENAMEX>) -- [text] <ENAMEX TYPE= PERSON> Ricardo Alfonso
Castellar </ENAMEX>, mayor of <ENAMEX TYPE= LOCATION> Achi
</ENAMEX>, in the northern department of <ENAMEX TYPE=
LOCATION> Bolivar </ENAMEX>, who was kidnapped on <TIMEX TYPE=
DATE>5 January</TIMEX>, apparently by <ENAMEX TYPE=
ORGANIZATION> army of national liberation (ELN) </ENAMEX>
guerrillas, was found dead today, according to authorities.
<ENAMEX TYPE= PERSON> Castellar </ENAMEX> was kidnapped on
<TIMEX TYPE= DATE> 5 january </TIMEX>on the outskirts of <ENAMEX
TYPE= LOCATION> Achi </ENAMEX>, about 850 km north of <ENAMEX
TYPE= LOCATION> Bogota </ENAMEX>, by a group of armed men, who
forced him to accompany them to an undisclosed location.
Example (con’t)
Text with coreference tags:
Bogota, 9 jan 90 (EFE) -- [text] <COREF ID="1" MIN="Ricardo Alfonso
Castellar "> Ricardo Alfonso Castellar </COREF>, <COREF ID= "2" MIN="
mayor" REF="1" TYPE="IDENT"> mayor of Achi </COREF>, in the northern
department of Bolivar, who was kidnapped on 5 January, apparently by
<COREF ID= "3" MIN="army "> army of national liberation (ELN) guerrillas
</COREF>, was found dead today, according to authorities.
<COREF ID="4" MIN="Castellar" REF="1" TYPE="IDENT"> Castellar
</COREF> was kidnapped on 5 january on the outskirts of Achi, about 850
km north of Bogota, by <COREF ID="5" MIN= "group" REF="3"
TYPE="IDENT"> a group of armed men </COREF>, who forced <COREF
ID="6" MIN="him" REF="1" TYPE="IDENT"> him </COREF>...
Example: (con’t)
Interpretation of coreference tags
Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of
Achi, in the northern department of Bolivar, who was kidnapped on 5
January, apparently by army of national liberation (ELN) guerrillas, was
found dead today, according to authorities.
Castellar was kidnapped on 5 january on the outskirts of Achi, about 850
km north of Bogota, by a group of armed men, who forced him to
accompany them to an undisclosed location.
Police sources in Cartagena reported that Castellar's body showed signs
of torture and several bullet wounds.
Castellar was attacked by ELN guerrillas while he was traveling in a boat
down the Cauca river to the tenche area, a region within his jurisdiction.
In Cartagena it was reported that Castellar faced a “revolutionary trial”
by the ELN and that he was found guilty and executed.
Marked-Up Data: Grammatical Coding
to indicate the various parts of speech of
Different Tag Sets have been used
Brown Tag Set:
87/179 tags
Penn Treebank (most used):
45 tags
London-Lund:197 tags
132 tags
166 tags
CLAWS c5: 62 tags
The design of a Tag Set
Target feature (classification):
 Tags are used to tell (the user) useful information about the
grammatical class of a word
Predictive feature:
 Tags are used (by the system) to predict the behavior of
other words in the context
Example, in the Brown tag set:
 VBG: verb, present participle
 But they can be used as Gerund or as Noun
 Gerund: “While purchasing/VBG a gift, I noticed that I was out of money.”
 Noun: “Concordia’s purchasing/VBG? Department is closed.”
2 conflicting goals:
 splitting a tag improves prediction but makes classification