Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien

advertisement
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
IMS
Computational
Approaches to Lexical
Collocations
Stefan Evert
IMS
Stuttgart
Brigitte Krenn
ÖFAI
Vienna
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Collocations
IMS
•
•
•
•
What kind of animal are they?
What are they good for?
How can we find them?
How likely are we to find them?
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Schedule
IMS
• Monday: Introduction to Collocations
• Tuesday: Extraction of Coocurrence
Data
• Wednesday: Association Measures
(AMs)
• Thursday: Evaluation of AMs
• Friday: Significance of Result
Differences
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Collocations
Terminology & Definitions
IMS
• Firth's Notion of Collocation
``Meaning by collocation is an
abstraction at the syntagmatic level
and is not directly concerned with
the conceptual or idea approach to
the meaning of words.''
``One of the meanings of night is its
collocability with dark, and of dark,
of course, its collocation with
night.'‘ (Firth 1957)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Collocations
Terminology & Definitions
IMS
• Choueka's Notion of Collocation
"A collocation is defined as a
sequence of two or more consecutive
words, that has characteristics of a
syntactic and semantic unit, and
whose exact and unambiguous
meaning cannot be derived directly
from the meaning or connotation of its
components."
(Choueka 1988)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Collocations
Terminology & Definitions
IMS
Summing up, we have 2 different views
1. Firth: collocations as lexical
proximities in text
2. Choueka: collocations as syntactic
and semantic units, semantic
irregularity
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Terminology
IMS
• Idioms
preferably used in the English
literature, e.g. Bar-Hillel:55,
Hockett:58, Katz;Postal:63,
Healey:68,Makkai:72.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Terminology
IMS
• Phraseological Units, (Ge.:
Phraseologismen)
a widely used generic term in the
German literature, e.g.
BurgerEA:82, Fleischer:82.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Terminology
IMS
• Light-verb Constructions,
Support-verb Constructions
refer to very particular
phenomena,
cross-categorisation with idioms
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Terminology
IMS
Terms stemming from Computational
Linguistics, e.g.
• multi-word lexemes, e.g.
Tschichold:97, BreidtEA:96.
• multi-word expressions, e.g.
Segond;Tapanainen:95
• non-compositional compounds, e.g.
Melamed:97
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Terminology
IMS
• Collocation
e.g, red herring, kick the bucket,
dark-night, dog-bark
• Base, Collocate
elements if a collocation,
the base selects its collocates,
it is not always clear to define
what the base is,
e.g. red-herring, kick-bucket
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Terminology
IMS
• Collocation Phrase
grammatical unit which is
constituted by a base and its
collocates or by two or more
collocates,
e.g. a red herring, kick the
bucket, unexpectedly kick the
bucket
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Summing up
IMS
• Terminology and definitions are
influenced by
• different linguistic traditions
• computational linguistic
applications
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Summing up
IMS
• The phenomena covered are
manifold:
•
•
•
•
lexical proximities in texts
syntactic and semantic units
semantic irregularity
syntactic rigidity
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Observations: Generativity
IMS
• Collocations range from
completely fixed to syntactically
flexible constructions.
• Syntactic restrictions usually
coincide with semantic
restrictions and thus are
indicators for the degree of
lexicalization of a particular word
combination.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Observations: Generativity
IMS
• Particular word combinations are
associated with specific
restrictions that cannot be
inferred from standard rules of
grammar and thus need to be
stored together with the
collocation.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Observations: Recurrence
IMS
• Within corpora, the proportion of
collocations is larger among
highly recurrent word
combination than among
infrequent ones.
• Recurrence is an effect of
lexicalization.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Observations: Idiomaticity
IMS
• Semantic opacity is not sufficient for
the definition of collocations as there
exists a variety of conventionalized
word combinations that range from
• fully compositional ones like Hut
aufsetzen (`put on a hat'), Jacke
anziehen (`put on a jacket')
to
• semantically opaque ones like ins Gras
beissen (`bite into the grass' literal
meaning, `die' idiomatic meaning).
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Observations:
Words, Multi-words or Phrases
IMS
• Collocations can be
• word level phenomena
• phrase level phenomena
(collocation phrase)
• Collocation phrases consist of
the lexically determined words
(collocates) only or contain
additional lexically underspecified
material.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Word-level Collocations
IMS
• Adjective- and Adverb-Like
Collocations
• nichts desto trotz (`nonetheless') adverb
• fix und fertig (`exhausted') adjective
• Preposition-Like Collocations
• im Lauf(e), im Zuge (`during')
• an Hand (`with the help of')
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Word-level Collocations
IMS
• Noun-Like Collocations
• Rotes Kreuz (Red Cross)
• Wiener Sängerknaben (Vienna choir
boys)
• Hinz und Kunz (`every Tom, Dick and
Harry')
• Sequences where the nouns are
duplicated
• Schulter an Schulter (shoulder to
shoulder),
• Kopf an Kopf (neck and neck)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Phrase-level Collocations
IMS
• Modal constructions
• sich (nicht) lumpen lassen (`to splash
out')
• Verb-object combinations
• übers Ohr hauen (`take somebody for a
ride')
• unter die Lupe nehmen (`take a close
look at')
• zum Vorschein bringen (`bring something
to the light')
• des Weges kommen (`to approach')
• Lügen strafen (`prove somebody a liar')
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Phrase-level Collocations
IMS
• Copula constructions
• guten Glaubens sein (`be in good faith')
• auf Draht sein (`be on the ball')
• Proverbs
• Morgenstund hat Gold im Mund
(morning hour has gold in the mouth, ‘the
early bid catches the worm’)
• wissen, wo der Barthel den Most holt
(know where the Barthel the cider
fetches, `know every trick in the book')
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Summing up
IMS
• Structural dependency
•
the collocates of a collocation are
syntactic dependents, thus knowledge of
syntactic structure is a precondition for
accurate collocation identification.
• Syntactic context
•
may help to discriminate literal and
collocational readings, see for instance
im Lauf, im Zug where a genitive to the
right is a strong indicator for collocational
reading.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Summing up
IMS
• Markedness
•
morphologically or syntactically
marked constructions like
seemingly incomplete syntactic
structure or archaic e-suffix are
suitable indicators for collocations,
see im Laufe, im Zuge for e-suffix
and zu Recht, an Hand for
incomplete syntactic structures.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Summing up
IMS
• Single-word versus multi-word units
•
single-word occurrences of word
combinations indicate word-level
collocations, see for instance zu Recht,
zurecht.
• Syntactic rigidity
•
is an important indicator for
collocations see for instance Hinz und
Kunz, an und für sich, fix und fertig, Kopf
an Kopf.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Identification of Collocations:
Operationalizable Criteria
IMS
• over proportionally high recurrence of
collocational word combinations
compared to noncollocational word
combinations in corpora
• lexical determination of the collocates
of a collocation
• collocations constitute grammatical
units
• grammatical restrictions in the
collocation phrases
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams
IMS
• numerical span
• typically AMs work on bi-grams
, i.e., all <wi, wj> pairs within a
certain span are considered
• <wi,wi> pairs are not considered!
• e.g.: wi-3 ... wi-1 wi wi+1 ... wi+3
(span size is 6)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams
IMS
• grammatical span
• e.g. we assume that collocational
relations do not exceed sentence
boundaries
 i.e., span size = sentence size
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams:
Problems
IMS
• huge amounts of cooccurrence
data need to be managed
this requires special algorithms
(e.g. Yamamoto,Church 2001)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams:
Problems
IMS
• definition of span size is crucial
If the span size is kept small, it is
unlikely to properly cover nonadjacent
collocates of structurally flexible
collocations.
Enlarging the span size leads to an
increase of candidate collocations
including an increase of noisy data
which need to be discarded in a
further processing step.
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams:
Problems
IMS
High amount of noise in the
collocation data due to
• inappropriate span size
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams:
Problems
IMS
High amount of noise in the
collocation data due to
• over-proportional frequency of
function words within texts
use stop word lists
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams:
Problems
IMS
High amount of noise in the
collocation data due to
• insensitivity to punctuation
use a sentence as the largest unit within
which the collocates of a collocation may
occur
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams:
Problems
IMS
High amount of noise in the
collocation data due to
• insensitivity to parts-of-speech
knowing parts-of-speech allows a large
number of syntactically invalid n-grams to
be excluded beforehand
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Positional N-Grams:
Problems
IMS
High amount of noise in the
collocation data due to
• insensitivity to syntactic structure
further improvement of the
appropriateness of the collocation
candidates selected is achieved by
the availability of structural and/or
dependency information
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Proposal
IMS
• (Partial) replacement of
• positional n-grams
by
• relational n-grams
• What does it imply?
• In which cases do we want/need it?
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
From Positional to Relational
N-Grams
IMS
• positional with part-of-speech,
eg. (Evert,Kermes 2003)
• <Adj,N> pairs within certain span
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
From Positional to Relational
N-Grams
IMS
• positional with part-of-speech
and lexical knowledge, e.g.
Breidt 93
• sentence final <noun,past
participle> pairs
• past participles must be part of a
list of 16 German support verbs
e.g. kommen, bringen, stehen,
stellen, ...
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
From Positional to Relational
N-Grams
IMS
• partially relational, e.g. Krenn
2000
• extraction of PN-combinations from
PPs
• extraction of main verbs
• combination of PN-pairs and verbs
co-occurring in a sentence
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
From Positional to Relational
N-Grams
IMS
• Result
• a theoretical maximum of PNV
combinations, i.e.,
• verbs are duplicated in sentences
that contain more than one PP,
• PPs are duplicated in sentences
where more than one main verb is
found.
This has effects on counting!
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Relational N-Grams
IMS
Assumption
The collocates of a collocation are
syntactically related!
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Relational N-Grams
IMS
Requirements
• part-of-speech tagging
• partial parsing (phrase chunking)
• full parsing
Evert,Kermes 2003
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Relational N-Grams
IMS
Problems
• full parsing of free text is error
prone
• there are cases where
collocation extraction from fully
parsed text is less accurate
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Summary
IMS
• Use application dependent notion of
collocation as opposed to
Firth/Choueka
• Extract collocations from relational
data (relational n-grams)
• Interested in collocation phrases and
their collocates
• Consider grammatically homogenous
data (e.g. Adj-N, PP-V)
• Recurrence/cooccurrence frequency
is main criterion for collocation
extraction (statistical approaches)
Download