School of Computing
Tokenization and Morphology
Eric Atwell, Language Research Group
(with thanks to Katja Markert, Marti Hearst,
and other contributors)
The main areas of linguistics
Rationalism: language models based on expert introspection
Empiricism: models via machine-learning from a corpus
Corpus: text selected by language, genre, domain, …
Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …
Corpus Annotation: text headers, PoS, parses, …
Corpus size is no. of words – depends on tokenisation
We can count word tokens, word types, type-token distribution
Lexeme/lemma is “root form”, v inflections (be v am/is/was…)
What’s a word?
How many words do you find in the following short text?
What is the biggest/smallest plausible answer to this question?
What problems do you encounter?
It’s a shame that our data-base is not up-to-date. It is a
shame that um, data base A costs $2300.50 and that
database B costs $5000. All databases cost far too much.
Time: 3 minutes
Counting words: tokenization
Tokenisation is a processing step where the input text is
automatically divided into units called tokens where each is either a word
or a number or a punctuation mark…
So, word count can ignore numbers, punctuation marks (?)
Word: Continuous alphanumeric characters delineated by whitespace.
Whitespace: space, tab, newline.
BUT dividing at spaces is too simple: It’s, data base
Another approach is to use regular expressions to specify which substrings
are valid words.
Regular expressions for tokenization
• wordr
= r'(\w+)‘
• hyphen = r'(\w+\-\s?\w+)‘
• Eg data-base, Allows for a space after the hyphen
• apostrophe = r'(\w+\'\w+)‘
• Eg isn’t
• numbers = r'((\$|#)?\d+(\.)?\d+%?)‘
• Needs to handle large numbers with commas
Some Tokenization Issues
Sentence Boundaries
• Punctuation, eg quotation marks around sentences?
• Periods – end of line or not?
Proper Names
• What to do about
• “New York-New Jersey train”?
• “California Governor Arnold Schwarzenegger”?
• That’s Fred’s jacket’s pocket.
• I’m doing what you’re saying “Don’t do!”.
Jabberwocky Analysis
This is nonsense … or is it?
This is not English … but it’s much more like English than it is
like French or German or Chinese or …
Why do we pretty much understand the words?
Jabberwocky Analysis
Why do we pretty much understand the words?
We recognize combinations of morphemes.
• Chortled - Laugh in a breathy, gleeful way; (Definition from Oxford
American Dictionary) A combination of "chuckle" and "snort."
• Galumphing - Moving in a clumsy, ponderous, or noisy manner.
Perhaps a blend of "gallop" and "triumph." (Definition from Oxford
American Dictionary)
• Make up a word whose meaning can be inferred from the morphemes
that you used.
Jabberwocky Analysis
Why do we pretty much understand the words?
• Surrounding English words strongly indicate the parts-of-speech of
the nonsense words.
• toves: probably can perform an action
(because they did gyre and gimble)
• wabe: is probably a place.
(they did … in the wabe)
Jabberwocky Analysis
• Surrounding English words strongly indicate the parts-of-speech of
the nonsense words.
• It’s similar in the French Translation:
Example from
• The study of the way words are built up from smaller meaning units.
• The smallest meaningful unit in the grammar of a language.
• Derivational vs. Inflectional
• Regular vs. Irregular
• Concatinative vs. Templatic (root-and-pattern)
A useful resource:
• Glossary of linguistic terms by Eugene Loos
Examples (English)
• 3 morphemes, 4 syllables
lady ‘(well behaved) female adult human’
-like ‘having the characteristics of’
• Can’t break any of these down further without distorting the
meaning of the units
• 1 morpheme, 2 syllables
• 2 morphemes, 1 syllable
-s, a plural marker on nouns
Morpheme Definitions
• The portion of the word that:
• is common to a set of derived or inflected forms, if any, when all affixes
are removed
• is not further analyzable into meaningful elements
• carries the principle portion of meaning of the words
• The root or roots of a word, together with any derivational affixes, to which
inflectional affixes are added.
• A bound morpheme that is joined before, after, or within a root or stem.
• a morpheme that functions syntactically like a word, but does not appear
as an independent phonological word
• Arabic: al in Al-Qaeda (definite particle)
• English: ‘s in Hal’s (genitive particle)
Inflectional vs. Derivational
Word Classes
• Parts of speech: noun, verb, adjectives, etc.
• Word class dictates how a word combines with morphemes to form new
• Variation in the form of a word, typically by means of an affix, that expresses
a grammatical contrast.
• Doesn’t change the word class
• Usually produces a predictable, nonidiosyncratic change of meaning.
• run -> runs | running | ran
• The formation of a new word or inflectable stem from another word or stem.
• compute -> computer -> computerization
Inflectional Morphology
• tense, number, person, mood, aspect
Word class doesn’t change
Word serves new grammatical role
• come is inflected for person and number:
The pizza guy comes at noon.
• las and rojas are inflected for agreement with manzanas in grammatical
gender by -a and in number by –s
las manzanas rojas (‘the red apples’)
Derivational Morphology
Word class changes: verb  noun, noun  adjective etc
Nominalization (formation of nouns from other parts of speech,
primarily verbs in English):
• computerization
• appointee
• killer
• fuzziness
Formation of adjectives (primarily from nouns)
• computational
• clueless
• Embraceable
Difficult cases:
• building  from which word-class and sense of “build”?
Concatinative Morphology
Stems: also called lemma, base form, root, lexeme
hope+ing  hoping
hop  hopping
• Prefixes: Antidisestablishmentarianism
• Suffixes: Antidisestablishmentarianism
• Infixes: hingi (borrow) – humingi (borrower) in Tagalog
• Circumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languages
• uygarlaştıramadıklarımızdanmışsınızcasına
• uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
Behaving as if you are among those whom we could not cause to become civilized
Templatic Morphology
Roots and Patterns
• Example: Hebrew or Arabic or Amharic (spoken in Ethiopia)
• Root:
• Consists of 3 consonants CCC
• Carries basic meaning
• Template:
• Gives the ordering of consonants and vowels
• Specifies semantic information about the verb
• Active, passive, middle voice
• Example (Hebrew):
• lmd (to learn or study)
• CaCaC -> lamad (he studied)
• CiCeC -> limed (he taught)
• CuCaC -> lumad (he was taught)
Morphological Analysis Tools
Porter stemmer:
• A simple approach: just hack off the end of the word!
• Frequently used, especially for Information Retrieval, but results are
pretty ugly!
Original *****************************
Pierre Vinken , 61 years old , will join the board as a nonexecutive
director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch
publishing group . Rudolph Agnew , 55 years old and former chairman of
Consolidated Gold Fields PLC , was named a nonexecutive director of
this British industrial conglomerate . A form of asbestos once used to
make Kent cigarette filters has caused a high percentage of cancer
deaths among a group of workers exposed to it more than 30 years ago ,
researchers reported .
Results *******************************
Pierr Vinken , 61 year old , will join the board as a nonexecut
director Nov. 29 . Mr. Vinken is chairman of Elsevi N.V. , the Dutch
publish group . Rudolph Agnew , 55 year old and former chairman of
Consolid Gold Field PLC , wa name a nonexecut director of thi British
industri conglomer . A form of asbesto onc use to make Kent cigarett
filter ha caus a high percentag of cancer death among a group of
worker expos to it more than 30 year ago , research report .
Morphological Analysis Tools
WordNet’s morphy()
• A slightly more sophisticated approach
• Use an understanding of inflectional morphology
• Uses a set of Rules of Detachment
• Use an Exception List for irregulars
• Handle collocations in a special way
• Do the transformation, compare the result to the WordNet
• If the transformation produces a real word, then keep it, else use
the original word.
• For more details, see
Some morphy() output
>>> wntools.morphy('dogs')
>>> wntools.morphy('running', pos='verb')
>>> wntools.morphy('corpora')
Morphological Analysis Tools
Very sophisticated programs have been developed
Use a techniqued called Two-Level Phonology
• Has been applied to numerous languages
Best known: PCKimmo
• After Kimmo Koskenniemi, based in part on work by Lauri Kartunnen in 1983
• Uses:
• A rules file which specifies the alphabet and the phonological (or spelling) rules,
• A lexicon file which lists lexical items and encodes morphotactic constraints.
Commercial versions are available
• inXight’s LinguistX version based on technology developed by Kaplan and others
from Xerox PARC (or at least used to be)
Morphological Analysis Tools
“cheat”: store all variants in a dictionary database, eg
• Categorial Variation Database
• “A database of clusters of uninflected words (lexemes) and their
categorial (i.e. part-of-speech) variants.”
• Example: the developing cluster:(develop(V), developer(N),
developed(AJ), developing(N), developing(AJ), development(N)).
based on published dictionaries: LDOCE, CELEX, OALD++,
One problem with rule-based systems (PCkimmo) or dictionarylookup systems: Porting to new languages
In principle, Unsupervised Machine Learning could learn from
any language data-set, by finding recurring patterns which
correspond to roots, prefixes, postfixes
MorphoChallenge is a contest to find the best UML
morphological analyser
Atwell, Roberts: Combinatory Hybrid Elementary Analysis of Text
Arabic morphological analysis
Arabic is particularly challenging - different script, infixes,
vowels may be left out in written Arabic …
Leeds researcher Majdi Sawalha: online analysis tool
Sawalha, Majdi; Atwell, Eric (2010). Fine-Grain Morphological
Analyzer and Part-of-Speech Tagger for Arabic Text. in:
Proceedings of the Language Resource and Evaluation
Conference LREC 2010, 17-23 May 2010, Valetta, Malta.
Tokenization - by whitespace, regular expressions
Problems: It’s data-base New York …
Jabberwocky shows we can break words into morphemes
Morpheme types: root/stem, affix, clitic
Derivational vs. Inflectional
Regular vs. Irregular
Concatinative vs. Templatic (root-and-pattern)
Morphological analysers: Porter stemmer, Morphy, PC-Kimmo
Morphology by lookup: CatVar, CELEX, OALD++
Unsupervised Machine Learning: MorphoChallenge