Morphology

advertisement
Fondements du TAL
Delphine Bernhard
Morphology
Contains slides adapted from Pierre Zweigenbaum
LIMSI-CNRS | 1/72
Outline
1. Word segmentation
2. Linguistic morphology
a) Morphemes
b) Morphological processes
3. Computational morphology
a) Normalisation: stemming, lemmatisation
b) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 2/72
Levels of linguistic structure
Our focus today
(c) David Groome, 2006
LIMSI-CNRS | 3/72
Outline
1. Word segmentation
2. Linguistic morphology
a) Morphemes
b) Morphological processes
3. Computational morphology
a) Normalisation: stemming, lemmatisation
b) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 4/72
Can you read this (fast!)?
wikipédiaestunprojetdencyclopédiecollectiveétabli
esurinternetuniversellemultilingueetfonctionnants
urleprincipeduwikiwikipédiaapourobjectifdoffrirun
contenulibrementréutilisableneutreetvérifiableque
chacunpeutéditeretaméliorer
Wikipédia est un projet d’encyclopédie collective établie sur Internet,
universelle, multilingue et fonctionnant sur le principe du wiki.
Wikipédia a pour objectif d’offrir un contenu librement réutilisable,
neutre et vérifiable, que chacun peut éditer et améliorer.
LIMSI-CNRS | 5/72
It's only words ...
 ... but what are they exactly and
how can we automatically
recognise them?
 In speech, there are no obvious
breaks
 So how do babies learn words?
 According to (Saffran et al.,
1996) they use distributional
cues and statistical regularities in
speech
LIMSI-CNRS | 6/72
How do we recognise words in speech?
(Bauer, 1988)
 There are no gaps between words in speech:
 Menbecomeoldbuttheyneverbecomegood
 Thanks to our knowledge of language, we recognise
certain strings of sounds/letters:
 e.g. we can recognise men in the previous sequence because
it also comes up in sequences like:
 Menareconservativeafterdinner
 Menlosetheirtempersindefendingtheirtaste.
 Afterfortymenhavemarriedtheirhabits.
LIMSI-CNRS | 7/72
Learning to read is difficult for humans
 Reading disabilities:
 Dyslexia: inability to decode, or break down, words into
phonemes
 Comprehension difficulties
 The invention of writing and reading is
recent
 Contrarily to speech or vision, it is an
unnatural process that has to be
learned: brains are not wired to read!
LIMSI-CNRS | 8/72
For computers: characters and strings
 Control characters:
 End of line: \n
 Tabulation: \t
 Encodings:
 ASCII: English alphabet
 Latin 1, ISO-8859-1: Western European Languages
 ISO-8859-15: Similar to ISO-8859-1, but replaces some less
common symbols with €, Œ or œ
 Windows-1252, Cp1252: superset of ISO 8859-1 (includes €,
Œ and œ)
 UTF-8: can represent every character in the Unicode
character set, backward-compatible with ASCII
LIMSI-CNRS | 9/72
Practical definition of words
and sentences
 Bauer (1988):
 A word is a unit which, in print, is bounded by spaces on both
sides. We will call this an orthographic word.
 Kučera and Francis (1967):
 A graphic word is a string of contiguous alphanumeric
characters with space on either side; may include hyphens
and apostrophes, but no other punctuation marks
 Grefenstette and Tapanainen (1994):
 Sentences end with punctuation.
LIMSI-CNRS | 10/72
What are the "words" and sentences here?
Pacific Lumber Co. was trying to figure out the safest
way to bring the activists down.
He doesn't need us.
For additional information see also
http://www.limsi.fr
New York is situated on the east coast of the United
States.
c’est-à-dire les pommes de terre des U.S.A.
LIMSI-CNRS | 11/72
Tokenisation
 Tokenisation: process which divides the input text into
word tokens: punctuation marks, word-like units,
numbers, etc.
 A system which splits texts into word tokens is called a
tokeniser
 A very simple example:
 Input text:
John likes Mary and Mary likes John.
 Tokens:
{"John", "likes", "Mary", "and", "Mary", "likes", "John", "."}
LIMSI-CNRS | 12/72
Problems of tokenisation
 Numeric expressions:
The corresponding free cortisol fractions in these sera
were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively.
 How many words are there in 4.53 +/- 0.15%?




1
3
9 (“four point five three, plus or minus fifteen percent”)
not a word
 The answer depends on the application at hand
LIMSI-CNRS | 13/72
Problems of tokenisation
 Boundaries
 For "simple" words:
 Spaces
 Punctuation
 Multiword expressions: several units, one word
 pomme de terre
 parce que
 Contracted forms: one unit, several words
 aux (à les), des (de les)
LIMSI-CNRS | 14/72
Outline
1. Word segmentation
2. Linguistic morphology
a) What is morphology?
b) Morphological processes
3. Computational morphology
a) Normalisation: stemming, lemmatisation
b) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 15/72
Morphology
Words can be further decomposed into smaller units:
pneumonoultramicroscopicsilicovolcanoconiosis
lung
microscopic
volcano
disease
extreme
silicium
dust
lung disease caused by the inhalation of very fine
silica dust found in volcanoes
LIMSI-CNRS | 16/72
What is morphology?
 Morphology is the branch of linguistics which studies
word forms and word formation
 Word formation processes
 Inflection
 Derivation
 Composition / Compounding
LIMSI-CNRS | 17/72
Words vs. lexemes vs. lemmas
 A lexeme refers to the set of word forms which
correspond to the same dictionary entry
small, smaller, smallest → SMALL
knife, knives → KNIFE
 A lemma is the canonical form of a lexeme
SMALL
 In the following, capital letters are used to indicate lemmas
LIMSI-CNRS | 18/72
Inflection
 Inflection is the process of forming different
grammatical forms of a single lexeme
montrer → montrera
cheval → chevaux
 The grammatical category of the word form remains the
same
LIMSI-CNRS | 19/72
Word formation
 Word formation is the process of creating new lexemes
from existing ones:
 Derivation: combines bases and affixes
 Compounding: combines lexemes
LIMSI-CNRS | 20/72
Derivation
 Derivation involves the creation of one lexeme from
another
re- + create → RECREATE
re- is a derivational prefix
recreate + s → recreates
-s is an inflectional suffix, it provides another word-form of
the lexeme RECREATE!

Derivation might induce a change of the grammatical
category

be- + witch → BEWITCH: changes a noun into a verb
LIMSI-CNRS | 21/72
Compounding
 A compound involves the creation of one lexeme from
two or more other lexemes
popcorn = a kind of corn which pops
hot dog = a kind of food (opaque compound)
 Compounding is particularly frequent in French medical
language
 appendice + ectomie → appendicectomie
LIMSI-CNRS | 22/72
Non concatenative phenomena
 Root-and-pattern morphology (e.g. Arabic, Hebrew)
 the root consists of consonants only (3 by default)
ktb = to write
 the pattern is a combination of vowels (possibly consonants
too) with slots for the root consonants
kaatab = he corresponded
 Apophony: vowel changes within a root
 Ablaut: sing, sang, sung
 Umlaut: Buch, Bücher
LIMSI-CNRS | 23/72
Outline
1. Word segmentation
2. Linguistic morphology
a) Morphemes
b) Morphological processes
3. Computational morphology
a) Normalisation: stemming, lemmatisation
b) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 24/72
Morphological normalisation
 Morphological normalisation consists in identifying a single
canonical representative for morphologically related wordforms
 Methods:
 Stemming
 Lemmatisation
LIMSI-CNRS | 25/72
Stemming
 Stemming is an algorithmic approach to strip off the
endings of words
 Objective: group words belonging to the same
morphological family by transforming them into a
similar stemmed representation
 Stemming does not distinguish between inflection and
derivation
 The stems obtained do not necessarily correspond to a
genuine word form
 The best known stemming algorithms have been
developed by Lovins (1968) and Porter (1980)
LIMSI-CNRS | 26/72
Algorithmic stemming method
1) Desuffixing: removal of predefined word endings
sitting → sitt
2) Recoding: transform the endings of the previously
obtained stems using transformation rules
sitt → sit
These 2 phases can be performed successively (Lovins)
or simultaneously (Porter)
LIMSI-CNRS | 27/72
Porter's stemmer
 Based on a limited set of general cascaded
transformational rules:
-ational → -ate : relational → relate
 Variants exist for many languages: English, French,
Spanish, Portuguese, Italian, Romanian, German
Dutch, Swedish, Norwegian, Danish, Russian, Finnish,
Hungarian, Turkish
 Fast
 Accurate enough for some applications, e.g.
Information Retrieval
 Available at http://snowball.tartarus.org/
LIMSI-CNRS | 28/72
Steps in Porter stemming (excerpts)
 Step 1a
 SSES → SS
caresses → caress
 Step 1b
 (m>0) EED → EE
feed → feed, agreed → agree
 Step 1c
 (*v*) Y → I
happy → happi, sky → sky
 Step 2
 (m>0) ATIONAL → ATE
relational → relate
LIMSI-CNRS | 29/72
 Step 3
 (m>0) ICATE → IC
triplicate → triplic
 Step 4
 (m>1) AL →
revival → reviv
 Step 5a
 (m>1) E →
probate → probat
 Step 5b
 (m > 1 and *d and *L) → single
letter
controll → control
Porter's stemmer
Original Word
vision
visible
visibility
visionary
visioner
visual
LIMSI-CNRS | 30/72
Stemmed Word
vision
visibl
visibl
visionari
vision
visual
Comparison of three stemmers
© 2008 Cambridge University Press, Introduction to Information Retrieval, Christopher D.
Manning, Prabhakar Raghavan & Hinrich Schütze
LIMSI-CNRS | 31/72
Stemming errors
 Under-stemming:
adhere → adher
adhesion → adhes
 Over-stemming:
 appendicitis → append
 append → append
LIMSI-CNRS | 32/72
Ambiguity
Homographs: words which have the same spelling but different meanings
I saw the saw
Preterite form
of the verb
SEE
≠
Singular form
of the noun
SAW
Such cases cannot be properly dealt with with stemming only,
the word's grammatical category has to be identified
LIMSI-CNRS | 33/72
Lemmatisation
 Lemmatisation consists in mapping word forms to
their lemma (base form):
sing, sang, sung → sing
 Lemmatisation only handles inflection, not derivation
 In order to disambiguate ambiguous cases,
lemmatisation is usually combined with part-of-speech
tagging
 Additional morphological information is usually provided
with the lemma (more about this later in the
presentation)
LIMSI-CNRS | 34/72
Outline
1. Word segmentation
2. Linguistic morphology
a) Morphemes
b) Morphological processes
3. Computational morphology
a) Normalisation: stemming, lemmatisation
b) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 35/72
Morphological analysis
 Aim:
 split a word into its constituent morphemes : foxes → fox + es
 get morpho-syntactic information : part-of-speech (POS),
tense, number, person, voice, gender, etc.
 Morphological analysis can be perfomed:
 manually, the analyses are then stored in lexical databases
 automatically:
 based on some manually-written rules and lexicons
 in an unsupervised manner, using no external resources
LIMSI-CNRS | 36/72
Lexical databases: contents
 word entries + information
 surface form, lemma
 syntactic properties
 category, POS (Part Of Speech)
 features: masculine, feminine, etc.
 semantic properties
 semantic relations: synonym, antonym, hypernym
 semantic type: person, event, object
LIMSI-CNRS | 37/72
CELEX
 CELEX is a lexical database which is available for English,
Dutch and German
LIMSI-CNRS | 38/72
Morphalou
http://www.cnrtl.fr/lexiques/morphalou/
LIMSI-CNRS | 39/72
Prolex
http://www.cnrtl.fr/lexiques/prolex/
LIMSI-CNRS | 40/72
French Verbs (Dubois & Dubois-Charlier)
LIMSI-CNRS | 41/72
Unsupervised Segmentation
 Unsupervised morphological segmentation consists in
automatically breaking down words into their constituent
morphemes
 Only input dataset: list of words (no language-specific rules
or lexicons)
 Scientific goals:
 Learn of the phenomena underlying word construction in natural
languages
 Discover approaches suitable for a wide range of languages
 Advance machine learning methodology
 See the Morpho Challenge website
http://www.cis.hut.fi/morphochallenge2009/
LIMSI-CNRS | 42/72
Segmentation by analogy
(Lepage, 1998)
LIMSI-CNRS | 43/72
Application of the analogy principle
LIMSI-CNRS | 44/72
fahre
schlafe
fahren
X?
Segmentation by compression
 Minimum Description Length and bayesian inference
(Goldsmith, 2001; Creutz & Lagus, 2005)
LIMSI-CNRS | 45/72
Harris (1955):
Segmentation by successor counts
 At the end of a morpheme (or word) almost any sound
can follow:
design + #, design + ation, design + ing, design + ed, ...
 However, within morphemes, the choice is more
restricted:
desig + n
 Basic algorithm:
 At each position in an utterance, count the number of
different sounds which can possibly follow
 Peaks in this count indicate morpheme boundaries
LIMSI-CNRS | 46/72
Segmentation of He's quicker
 Utterance = He's quicker (hiyzkwikәr)
 Successors of h:
 His ship's in?
 Humans act like simians.
 ...
 Successors of hi:
 Hip-high in water.
 Hidden meanings were discovered.
 ...
LIMSI-CNRS | 47/72
Successor counts
35
Successor counts
30
25
20
15
10
5
0
h
-
i
-
y
-
z
-
k
-
w
Sounds
LIMSI-CNRS | 48/72
-
i
-
k
-
ә
-
r
-
Morphological parsing
 Aim: break down a word into component morphemes and
build a structured representation of the analysis
 Example:
cats → cat +N +PL
lemma features
 Our focus: finite-state morphological parsing
LIMSI-CNRS | 49/72
Finite state automata
 A finite state automaton (FSA) recognises a set of
strings
 An FSA is represented as directed graph:
 vertices (nodes) represent states
 directed links between nodes represent transitions
LIMSI-CNRS | 50/72
Sheep Talk FSA
 The language of sheep includes to following
utterances: baa!, baaa!, baaaa!, baaaaa!, etc.
 Regular expression for this language: baa+!
 FSA that can accept this language:
a
q0
LIMSI-CNRS | 51/72
b
q1
a
q2
a
q3
!
q4
Formal definition of an FSA
 Q = q0 q1 q2 ... qN-1
a finite set of N states
Σ
 q0
a finite input alphabet of symbols
the start state
F
 δ(q,i)
the set of final states
the transition function
 For the sheep talk automaton: Q = {q0, q1, q2, q3, q4},
Σ = {a, b, !}, F = {q4}
LIMSI-CNRS | 52/72
Deterministic vs. non-deterministic FSA
 Deterministic FSA for sheep talk
a
q0
b
q1
a
q2
a
q3
!
q4
q3
!
q4
 Non-deterministic FSA for sheep talk
a
q0
LIMSI-CNRS | 53/72
b
q1
a
q2
a
Morphological parsers
 Components:
 lexicon: list of lemmas and affixes
 morphotactics: word grammar which accounts for
morpheme ordering
 orthographic rules: model the changes that occur when two
morphemes combine
city + s → cities
 Morphological parsers can be implemented as finite-state
transducers
LIMSI-CNRS | 54/72
Finite State Transducers
1
s:PL
3
cat:N
0
catch:V




2
Finite-state transducers map
between one representation
and another
State 0: start state
State 1: cat has been recognised as a +N (possible end state)
State 2: catch has been recognised as a +V (possible end state)
State 3: cats has been recognised as +N +PL (possible end
state)
LIMSI-CNRS | 55/72
Two-level morphology
 (Koskenniemi, 1984)
 Surface level: words as they are pronounced or written
 Lexical level: concatenation of morphemes
Lexical level:
Surface level:
c a t +N +PL
c a t s
 The mapping between the surface and the lexical level
is constrained by rules
LIMSI-CNRS | 56/72
Two-level rules
 Example rule (Trost, 2003):
lexical
level
+:e ⇐ { s x z [ {s c} h ] } : _s
surface
level
left context
right context
 Application of the rule:
# d i s h + s #
| | | | | 1 | |
0 d i s h e s 0
LIMSI-CNRS | 57/72
Spelling
rule:
e-insertion
PC-KIMMO
 Demo: http://languagelink.let.uu.nl/~lion
LIMSI-CNRS | 58/72
PC-Kimmo: POS Ambiguity
1:
Word:
[ cat:
head:
Word
[ pos:
V
vform: BASE ]
root: `walk
root_pos:V
clitic:drvstem:- ]
LIMSI-CNRS | 59/72
2:
Word:
[ cat:
Word
head:
[ agr:
[ 3sg:
+ ]
number:SG
pos:
N
proper:verbal:- ]
root: `walk
root_pos:N
clitic:drvstem:- ]
Inflectional Analysis for French: Flemm
 Developed by F. Namer
http://www.cnrtl.fr/outils/flemm/
 Input : word + POS (as provided by the TreeTagger or the
Brill tagger)
 renouent VER:pres
renouer
 Output: lemma + morpho-syntactic features
 renouent VER(pres):Vmip3p--1
renouer
|| renouent
VER(pres):Vmsp3p--1 renouer
 Verbe au présent de l'indicatif ou du subjonctif à la troisième personne du pluriel, 1er
groupe
LIMSI-CNRS | 60/72
Inflectional Analysis for French: Flemm
 Analyse linguistique : le cas de -èrent
 en général, -èrent marque les verbes du 3ème groupe au
passé simple : céd-èrent
 quelquefois, la terminaison est plus courte et -èrent
marque le présent : légifèr-ent
 très rarement, terminaison ambiguë : lac-èrent et lacèrent
 Règles et exceptions : le cas de -èrent
 les partitions ambiguës sont lexicalisées car rares
 la règle étant le désuffixage sur le suffixe le plus long, les
verbes correspondant au suffixe -ent tels que légifèrsont lexicalisés
 autres cas (e.g. céd-) : désuffixage régulier sur -èrent.
LIMSI-CNRS | 61/72
Derivational Analysis: DériF
 Developed by F. Namer
http://www.cnrtl.fr/outils/DeriF/
 Input: form/POS
 sympathique/ADJ
 Output: analysis
 [ [ sympathie NOM] ique ADJ] (sympathique/ADJ,
sympathie/NOM) " En rapport avec le(s) sympathie"
LIMSI-CNRS | 62/72
Derivational Analysis: DériF
 Word formation rules
 déXiser V
 Xable A
 inX A
[dé [X N] +iser V]
[[X (er) V] +able A]
[in [X A] A]
 Sequence of decompositions
 impensable/ADJ
 décomposable/ADJ
in + pensable/ADJ
décomposer/VERBE + able/ADJ
 Ambiguous analyses
 implantable/ADJ
implanter/VERBE + able/ADJ
im + plantable/ADJ
 Produces a gloss :
 " ( lequel - Que l') on peut implanter" // " Non plantable"
LIMSI-CNRS | 63/72
Analysis of neoclassical compounds: DériF
 acrodynie/N
 Hierarchical decomposition:
 [ [ acr N* ] [ odyn N* ] ie NOM ]
 Definition (gloss):
 "douleur (du -- liée au) extrémité "
 Semantic type:
 Type = maladie
 Lexical and semantic relations with other lexemes:
 eql:acr/algie, eql:acr/algo, eql:acr/algés, eql:apex/algie,
eql:apex/algo, eql:apex/algés, eql:apex/odyn
 see:acr/ite, see:apex/ite
LIMSI-CNRS | 64/72
Outline
1. Word segmentation
2. Linguistic morphology
a) Morphemes
b) Morphological processes
3. Computational morphology
a) Normalisation: stemming, lemmatisation
b) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 65/72
Information Retrieval
Stemming
 Stemming is frequently used in Information Retrieval:
 Stemming is applied at indexing time
 User queries are analysed likewise
 Stems in the user query are matched against stems in
documents
 It reduces the number of terms to index
 It improves recall (number of documents which are
retrieved)
LIMSI-CNRS | 66/72
Information Retrieval
Morphological Query Expansion
 Morphological variants of a word can be used to
perform query expansion
 The original word forms are indexed
 Query terms are expanded with their morphological
variants at retrieval time (Moreau et al., 2007)
Original query: Ineffectiveness of U.S. embargoes or
sanctions
Expanded query: ineffectiveness ineffective effectiveness
effective ineffectively embargoes embargo embargoed
embargoing sanctioning sanction sanctioned sanctions
sanctionable
LIMSI-CNRS | 67/72
Text-To-Speech Systems
 Aim: take text, in standard spelling, and synthesise a
spoken version of the text
 Problems
 Proper names (places, persons)
 Out of vocabulary words (words unknown to the system)
 Solutions from morphology
 hothouse = hot + house and not hoth + ouse
LIMSI-CNRS | 68/72
Machine Translation
 Aim: translate a text from one language into another
language
 Problems:
 A word in one language may correspond to two or more
words in another language
 Out of vocabulary words
 How can morphological analysis help?
 compounds: Aktionsplan (de) → action plan (en)
 inflection: va, aller (fr) → go (en)
LIMSI-CNRS | 69/72
Meditate on this...
"Maybe in order to understand mankind, we have to
look at the word itself. Mankind. Basically, it's made up
of two separate words – 'mank' and 'ind'. What do
these words mean? It's a mystery, and that's why so is
mankind."
Jack Handey (Deep Thoughts)
LIMSI-CNRS | 70/72
Relevant Literature
 Creutz, M. & Lagus, K. (2005), Inducing the Morphological Lexicon of a
Natural Language from Unannotated Text, in 'Proceedings of the
International and Interdisciplinary Conference on Adaptive Knowledge
Representation and Reasoning (AKRR'05)', pp. 106-113.
 Goldsmith, J. (2001), 'Unsupervised Learning of the Morphology of a
Natural Language', Computational Linguistics 27(2), 153-198.
 Harris, Z. (1955), 'From phoneme to morpheme', Language 31(2), 190222.
 Koskenniemi, K. (1984), A general computational model for word-form
recognition and production, in 'Proceedings of the 22nd annual
meeting on Association for Computational Linguistics', Association for
Computational Linguistics, Morristown, NJ, USA, pp. 178--181.
 Lepage, Y. (1998), Solving analogies on words: an algorithm, in
'Proceedings of the 17th international conference on Computational
Linguistics', Association for Computational Linguistics, Morristown, NJ,
USA, pp. 728-734.
LIMSI-CNRS | 71/72
Relevant Literature
 Moreau, F.; Claveau, V. & Sébillot, P. (2007), 'Automatic morphological
query expansion using analogy-based machine learning.', in
Proceedings of the 29th European Conference on Information
Retrieval (ECIR 2007), Roma, Italy, April 2007.
 Trost, H. (2003), The Oxford Handbook of Computational Linguistics,
Oxford University Press, chapter Morphology, pp. 25--47.
 Saffran, J. R.; Newport, E. L. & Aslin, R. N. (1996), 'Word
Segmentation: The Role of Distributional Cues', Journal of Memory
and Language 35(4), 606-621.
LIMSI-CNRS | 72/72
Download