What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton, UK 1 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 2 Find me all the fat cats a request for information 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 3 High recall 27-29 Aug 2003 Lots of responses Maybe not all good Adam Kilgarriff: Us precision them recall 4 High precision Fewer hits Higher confidence 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 5 Us precision, them recall Recall Precision Computers good bad People bad good 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 6 Us precision, them recall True in many areas – web searching, google – finding an image to illustrate a talk Nowhere more so than lexicography 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 7 Lexicography: finding facts about words collocations grammatical patterns idioms synonyms antonyms meanings translations 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 8 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 9 Four ages of corpus lexicography 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 10 Age 1: Pre computer Oxford English Dictionary: • 5 million index cards 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 11 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator asian-kwic.html the coloured-pens method 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 12 Age 2: limitations as corpora get bigger: too much data • 50 lines for a word: :read all • 500 lines: could read all, takes a long time, slow • 5000 lines: no 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 13 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 14 Collocation listing For right collocates of save (>5 hits) word fr(x+y) fr(y) word fr(x+y) fr(y) forests 6 170 life 36 4875 $1.2 6 180 dollars 8 1668 lives 37 1697 costs 7 1719 enormous 6 301 thousands 6 1481 annually 7 447 face 9 2590 jobs 20 2001 estimated 6 2387 money 64 6776 your 7 3141 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 15 Collocation statistics Which words? – next word – last word – window, +1 to +5; window, -5 to -1 How sorted? most common collocates --but for most nouns it's the most salient collocates --how to measure salience? 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 16 Mutual Information Church and Hanks 1989 How much more often does a word pair occur, than one might expect by chance “Chance” of x and y occurring together: p(x) * p(y) Probabilities approximated by frequencies p(x) =(approx) f(x)/N 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 17 Mutual Information X fr eat fr X fr eat MI* X it 1000 400, 404 000 meat 1000 6000 136 sushi 1000 100 5 1/ 1M 23/ 1M 50/ 1M rank 3 2 1 * numbers are log-proportional to MI 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 18 Problem mathematical salience = lexicographic salience? no! higher-frequency items are lexicographically more salient Solution multiply MI by raw frequency 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 19 Mutual Information X fr eat fr X fr eat MI X it 1000 400, 404 000 meat 1000 6000 136 sushi 1000 100 27-29 Aug 2003 5 1/ 1M 23/ 1M 50/ 1M rank MI x fr new rank 3 400/ 3 M 2 3128 1 /M 1 2500 2 /M Adam Kilgarriff: Us precision them recall 20 Collocation listing For right collocates of save (>5 hits) word fr(x+y) fr(y) word fr(x+y) fr(y) forests 6 170 life 36 4875 $1.2 6 180 dollars 8 1668 lives 37 1697 costs 7 1719 enormous 6 301 thousands 6 1481 annually 7 447 face 9 2590 jobs 20 2001 estimated 6 2387 money 64 6776 your 7 3141 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 21 Age-3 collocation statistics: limitations Lists contain junk unsorted for type --MI lists mix adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 22 Age 4: The word sketch Large well-balanced corpus Parse to find – subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 23 Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks exist if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 24 Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks exist if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant Poor man’s parsing: – object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 25 The word sketch coffee_n.html 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 26 Macmillan Dictionary of English for Advanced Leaners, 2002: editor: Rundell. Work done 1999. Word sketches produced for 6000 most common nouns, verbs, adjectives of English using British National Corpus (100 M words, already POS-tagged) lemmatized using John Carroll's lemmatizer parsed using regular expressions over POS-tags HTML files with hyperlinked corpus examples lexicographers used them extensively, used instead of going direct to corpus positive feedback 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 27 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 28 Natural Language Processing The academic discipline which provides the tools – Also known as Computational Linguistics, Human Language Technology (HLT), Language Engineering Good at evaluation of its tools Good news for lexicography: – identify the best tools, apply them to our corpora 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 29 An Anglophone Apology Technology, tools, resources most often available for English This talk centres on English Other languages often present new problems – Finding word delimiters for Chinese is hard – Finding bunsetsu for Japanese is hard Fewer resources available, less work done Recommendation: – find the local experts for your language 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 30 Recap: Lexicography: finding facts about words collocations grammatical patterns idioms synonyms antonyms meanings translations 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 31 Recap: Lexicography: finding facts about words collocations - sketches grammatical patterns - sketches idioms synonyms antonyms meanings translations 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 32 Idioms Extreme case of collocation/multi word expressions Sequence of workshops on collocations, MWE Technical terms (of great interest to technologists, technical): TERMIGHT 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 33 Antonyms Essential semantic relation 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 34 Antonyms Essential semantic relation but Justeson and Katz 1995: distributional evidence for typical antonym pairs – rich men and poor men – the big ones and the small ones – black and white issues Perhaps antonyms are ‘really’ distributional 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 35 Thesauruses Also near-synonyms – are there any true synonyms? Distributional: which words share same distributions – if corpus contains object(drink, wine), object(drink, beer) – 1 pt similarity between wine and beer – gather all points; find nearest neighbours Sparck Jones, Lin, Grefenstette 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 36 Nearest neighbours 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 37 Translation Parallel corpora – Texts and their translations or Comparable corpora – Matched for source and target (genre and subject matter), not translations Which L1 words occur in equivalent L1 settings to L2 words in L2 settings? – They are candidate translation pairs Very hard problem Lots of high quality research 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 38 The WASPbench with David Tugwell, supported by UK EPSRC, grant M54971 A lexicographer's workbench runtime creation of word sketches integration with Word Sense Disambiguation technology output is "disambiguating dictionary" - analysis of word's meaning into senses, plus computer program for disambiguating contextualised instances of the word First release now available. http://wasps.itri.brighton.ac.uk/ Sketches at http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/wordsketc hes.html 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 39 The Sketch Engine Input: – any corpus, any language Lemmatised, part-of-speech tagged – specification of grammatical relations Word sketches integrated with Corpus query system – Supports complex searching, sorting etc First release early 2004 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 40 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 41 Cyborgs Robots: will they take over? Rod Brooks’s answer: – Wrong question: greatest advances are in what the human+computer ensemble can do 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 42 Cyborgs A creature that is partly human and partly machine – Macmillan English Dictionary 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 43 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 44 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 45 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 46 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 47 Cyborgs and the Information Society The dictionary-making agent is part human (for precision), part computer (for recall). 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 48 Treat your computer with respect. You and it can do great things together. 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 49 Lexicographers of the future? 27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 50