fr(x+y) - Adam Kilgarriff

advertisement
What computers can and
cannot do for lexicography
or
Us precision, them
recall
Adam Kilgarriff
Lexicography Masterclass Ltd
and
University of Brighton, UK
1
Outline
Precision and recall
 History of corpus lexicography
 Natural Language Processing
 Cyborgs

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
2
Find me all the fat cats

a request for information
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
3
High recall


27-29 Aug 2003
Lots of responses
Maybe not all good
Adam Kilgarriff: Us precision them recall
4
High precision
Fewer hits
 Higher confidence

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
5
Us precision, them recall
Recall
Precision
Computers
good
bad
People
bad
good
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
6
Us precision, them recall

True in many areas
– web searching, google
– finding an image to illustrate a talk

Nowhere more so than
lexicography
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
7
Lexicography: finding facts
about words
collocations
 grammatical patterns
 idioms
 synonyms
 antonyms
 meanings
 translations

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
8
Outline
Precision and recall
 History of corpus lexicography
 Natural Language Processing
 Cyborgs

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
9
Four ages of corpus
lexicography
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
10
Age 1:
Pre
computer
Oxford English
Dictionary:
• 5 million
index cards
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
11
Age 2: KWIC
Concordances
From 1980
 Computerised
 COBUILD project was innovator
 asian-kwic.html
 the coloured-pens method

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
12
Age 2: limitations
as corpora get bigger:
too much data
• 50 lines for a word: :read all
• 500 lines: could read all, takes a long time,
slow
• 5000 lines: no
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
13
Age 3: Collocation
statistics
Problem:
too much data - how to summarise?
 Solution:
list of words occurring in
neighbourhood of headword, with
frequencies
 Sorted by salience

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
14
Collocation listing
For right collocates of save (>5 hits)
word
fr(x+y)
fr(y)
word
fr(x+y)
fr(y)
forests
6
170
life
36
4875
$1.2
6
180
dollars
8
1668
lives
37
1697
costs
7
1719
enormous
6
301
thousands
6
1481
annually
7
447
face
9
2590
jobs
20
2001
estimated
6
2387
money
64
6776
your
7
3141
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
15
Collocation statistics

Which words?
– next word
– last word
– window, +1 to +5; window, -5 to -1
How sorted?
 most common collocates --but for most
nouns it's the
 most salient collocates --how to
measure salience?

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
16
Mutual Information
Church and Hanks 1989
 How much more often does a word pair
occur, than one might expect by chance
 “Chance” of x and y occurring together:
p(x) * p(y)


Probabilities approximated by
frequencies
p(x) =(approx) f(x)/N
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
17
Mutual Information
X
fr eat fr X
fr eat MI*
X
it
1000 400, 404
000
meat 1000 6000 136
sushi 1000 100
5
1/
1M
23/
1M
50/
1M
rank
3
2
1
* numbers are log-proportional to MI
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
18
Problem
mathematical salience = lexicographic
salience?
 no! higher-frequency items are
lexicographically more salient
 Solution multiply MI by raw frequency

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
19
Mutual Information
X
fr eat fr X
fr eat MI
X
it
1000 400, 404
000
meat 1000 6000 136
sushi 1000 100
27-29 Aug 2003
5
1/
1M
23/
1M
50/
1M
rank
MI x fr new
rank
3
400/ 3
M
2
3128 1
/M
1
2500 2
/M
Adam Kilgarriff: Us precision them recall
20
Collocation listing
For right collocates of save (>5 hits)
word
fr(x+y)
fr(y)
word
fr(x+y)
fr(y)
forests
6
170
life
36
4875
$1.2
6
180
dollars
8
1668
lives
37
1697
costs
7
1719
enormous
6
301
thousands
6
1481
annually
7
447
face
9
2590
jobs
20
2001
estimated
6
2387
money
64
6776
your
7
3141
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
21
Age-3 collocation
statistics: limitations
Lists contain
 junk
 unsorted for type --MI lists mix adverbs,
subjects, objects, prepositions
What we really want:
 noise-free lists
 one list for each grammatical relation
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
22
Age 4: The word sketch
Large well-balanced corpus
 Parse to find

– subjects, objects, heads, modifiers etc
One list for each grammatical relation
 Statistics to sort each list, as before

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
23
Can we do it?



high-accuracy parsing is hard
lots of NLP work, many parsing frameworks
exist
if any parser can handle large corpus, it's
probably good enough
--- sorting, statistics, make us error-tolerant
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
24
Can we do it?




high-accuracy parsing is hard
lots of NLP work, many parsing frameworks
exist
if any parser can handle large corpus, it's
probably good enough
--- sorting, statistics, make us error-tolerant
Poor man’s parsing:
– object (of active verb) = last noun in any sequence
of nouns, adjectives, determiners, numbers and
adverbs following the verb
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
25
The word sketch

coffee_n.html
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
26
Macmillan Dictionary of English for Advanced
Leaners, 2002: editor: Rundell. Work done
1999.







Word sketches produced for 6000 most common
nouns, verbs, adjectives of English
using British National Corpus (100 M words, already
POS-tagged)
lemmatized using John Carroll's lemmatizer
parsed using regular expressions over POS-tags
HTML files with hyperlinked corpus examples
lexicographers used them extensively, used instead
of going direct to corpus
positive feedback
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
27
Outline
Precision and recall
 History of corpus lexicography
 Natural Language Processing
 Cyborgs

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
28
Natural Language
Processing

The academic discipline which provides the
tools
– Also known as Computational Linguistics,
Human Language Technology (HLT), Language
Engineering


Good at evaluation of its tools
Good news for lexicography:
– identify the best tools, apply them to our corpora
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
29
An Anglophone Apology



Technology, tools, resources most often
available for English
This talk centres on English
Other languages often present new
problems
– Finding word delimiters for Chinese is hard
– Finding bunsetsu for Japanese is hard


Fewer resources available, less work done
Recommendation:
– find the local experts for your language
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
30
Recap: Lexicography: finding
facts about words
collocations
 grammatical patterns
 idioms
 synonyms
 antonyms
 meanings
 translations

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
31
Recap: Lexicography: finding
facts about words
collocations - sketches
 grammatical patterns - sketches
 idioms
 synonyms
 antonyms
 meanings
 translations

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
32
Idioms
Extreme case of collocation/multi word
expressions
 Sequence of workshops on
collocations, MWE
 Technical terms (of great interest to
technologists, technical): TERMIGHT

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
33
Antonyms

Essential semantic relation
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
34
Antonyms


Essential semantic relation
but
Justeson and Katz 1995: distributional
evidence for typical antonym pairs
– rich men and poor men
– the big ones and the small ones
– black and white issues

Perhaps antonyms are ‘really’ distributional
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
35
Thesauruses

Also near-synonyms
– are there any true synonyms?

Distributional: which words share same
distributions
– if corpus contains
object(drink, wine), object(drink, beer)
– 1 pt similarity between wine and beer
– gather all points; find nearest neighbours

Sparck Jones, Lin, Grefenstette
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
36
Nearest neighbours
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
37
Translation

Parallel corpora
– Texts and their translations or

Comparable corpora
– Matched for source and target (genre and
subject matter), not translations

Which L1 words occur in equivalent L1
settings to L2 words in L2 settings?
– They are candidate translation pairs


Very hard problem
Lots of high quality research
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
38
The WASPbench






with David Tugwell, supported by UK EPSRC, grant
M54971 A lexicographer's workbench
runtime creation of word sketches
integration with Word Sense Disambiguation
technology
output is "disambiguating dictionary" - analysis of
word's meaning into senses, plus computer program for
disambiguating contextualised instances of the word
First release now available.
http://wasps.itri.brighton.ac.uk/
Sketches at
http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/wordsketc
hes.html
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
39
The Sketch Engine

Input:
– any corpus, any language
 Lemmatised,
part-of-speech tagged
– specification of grammatical relations
Word sketches integrated with
 Corpus query system

– Supports complex searching, sorting etc

First release early 2004
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
40
Outline
Precision and recall
 History of corpus lexicography
 Natural Language Processing
 Cyborgs

27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
41
Cyborgs
Robots: will they take over?
 Rod Brooks’s answer:

– Wrong question: greatest advances are in
what the human+computer ensemble can
do
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
42
Cyborgs

A creature that is partly human and
partly machine
– Macmillan English Dictionary
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
43
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
44
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
45
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
46
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
47
Cyborgs and the
Information Society
The dictionary-making agent is part
human (for precision), part
computer (for recall).
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
48
Treat your computer with
respect. You and it can do
great things together.
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
49
Lexicographers of the future?
27-29 Aug 2003
Adam Kilgarriff: Us precision them recall
50
Download