Towards a corpus-based online dictionary of Italian Word

advertisement

TOWARDS A CORPUS-BASED

ONLINE DICTIONARY OF

ITALIAN WORD COMBINATIONS

The CombiNet project

SARA CASTAGNOLI

FRANCESCA MASINI

(UNIVERSITY OF BOLOGNA )

MALVINA NISSIM

(UNIVERSITY OF GRONINGEN)

ENeL meeting @ Herstmonceux Castle, 13 August 2015

GIANLUCA E. LEBANI

ALESSANDRO LENCI

(UNIVERSITY OF PISA)

VALENTINA PIUNNO

(UNIVERSITY OF ROMA TRE)

THIS PRESENTATION

• INTRODUCING CombiNet , an ongoing project aimed at building a corpus-based, lexicographic resource for Italian

Word Combinations (Universities of Roma Tre, Pisa, Bologna)

• an innovative resource for the Italian language

• relevance for ENeL-WG3:

• an electronic resource

• an integrated computational-lexicographic approach:

1) automatic extraction of candidate WoCs from corpora

2) manual evaluation and compilation

• OUTLINE :

• our view of Word Combinations (WoCs)

• AKA: extracting WoCs from corpora – methods

• evaluation of AKA: automatic and manual

WORD COMBINATIONS (WoCs)

The whole range of combinatory possibilities associated with a word, including:

Multiword Expressions (MWEs), i.e. a variety of WoCs characterised by different degrees of fixedness and idiomaticity that act as a single unit at some level of linguistic analysis, e.g.:

• collocations • idioms

• phrasal lexemes

• preferred combinations

• More abstract combinations , i.e. the distributional properties of a word at the level of e.g.:

• argument structure

• subcategorization frames

• selectional preferences

EXTRACTING WoCs - METHODS

Using POS PATTERNS

(P-BASED methods)

POS-tagged corpus

list of POS patterns

NOUN PREP NOUN punto di

‘point of view’ vista

NOUN ADJ anno accademico

‘academic year’

VER DET (ADJ) NOUN costruire un piccolo impero

‘build a small empire’

Using SYNTACTIC INFO

(S-BASED methods)

parsed corpus

list of syntactic relations

SUBJ – VERB guerra – scoppiare

‘war – burst’

VERB – OBJ perdere – vista

‘lose – (one’s)sight’

VERB – COMP_DI parlare – di sport

‘talk – about sport’

COMPARING EXTRACTION METHODS

Using POS PATTERNS

(P-BASED methods)

- satisfactory results for relatively fixed | adjacent | short WOCs

Using SYNTACTIC INFO

(S-BASED methods)

- also target discontinuous and syntactically flexible WoCs

- patterns need to be specified a priori

- noise, even after applying AMs

- cannot capture complex and flexible WOCs

- dismissing abstract combinatory information (e.g. argument structure)

- abstracting away from information such as linear order, morphosyntactic features etc.

- no information about how exactly words combine

- cannot distinguish frequent but productive combinations, from idiomatic ones with the very same syntactic structure

Castagnoli et al. 2015; Lenci et al. 2014, 2015

AUTOMATIC EXTRACTION OF

CANDIDATE WoCs - DATA

La Repubblica corpus (Baroni et al. 2004)

• approx. 380M tokens, POS-tagged and dependency parsed

• “clean” corpus, but only newspaper language

• POS-based extraction:

• 122 POS sequences deemed representative of Italian WoCs, in 3 subsets (nominal, verbal, prepositional WoCs)

Independent extraction rounds, using the EXTra tool

• contiguous sequences, no optional slots, LL ranking, freq>5

Syntax-based extraction:

• distributional profiles, containing the syntactic slots (subject, complements, modifiers, etc.) and the combinations of slots (frames) with which words co-occur, abstracted away from their surface morphosyntactic patterns

• each slot is associated with lexical sets formed by its most prototypical fillers

• LexIt tool

• contiguous and discontinuous sequences, LL ranking, freq>5

DATA FOR LEXICOGRAPHERS

1) All sequences corresponding to the mentioned patterns are extracted from the corpus.

• 2) Lists of candidate WoCs are filtered to extract lines containing specific Target Lemmas (i.e. future headwords)

• Headwords : “fundamental” 2,100 words from the Senso Comune lexicon ( http://www.sensocomune.it/ )

• Nouns, Verbs, Adjectives

• 3) Lexicographers are provided with structured lists:

• lemmatised candidate WoCs for a given TL

• ranked according to their LL score

• raw frequency of each combination in the corpus

• underlying POS pattern or syntactic relation

POS-BASED DATA

POS-BASED DATA

SYNTAX-BASED DATA

LEXICOGRAPHERS ’ USE OF DATA

• Candidate lists for each TL are imported into a spreadsheet .

• As our current lexicographic layout groups WoCs on the basis of their function and syntactic configuration, lexicographers can scroll candidate lists or filter them to observe and evaluate only candidate WoCs corresponding to specific POS patterns and/or syntactic relations.

LEXICOGRAPHERS ’ USE OF DATA

• Candidate lists for each TL are imported into a spreadsheet .

• As our current lexicographic layout groups WoCs on the basis of their function and syntactic configuration, lexicographers can scroll candidate lists or filter them to observe and evaluate only candidate WoCs corresponding to specific POS patterns and/or syntactic relations.

• Candidates considered as valid WoCs are manually selected

• and edited

• before being recorded in the relevant part of the lexicographic record

LEXICOGRAPHERS’ EVALUATION - 1

(“ highly impressionistic feedback from our lexicographers ”)

• LL ranking is generally helpful , as most higher-ranking candidates represent (or contain, or suggest) proper

WoCs which deserve inclusion in the dictionary.

• However, difficult to set thresholds, since WoCs which they would intuitively include in the entry also appear in the middle and lower part of the ranking.

• POS-based data are more useful to compile the entries for nominal and adjectival TLs, whereas

SYNTAX-based data would be more helpful for verbal

TLs .

• No systematic evidence provided .

AUTOMATIC EVALUATION - 1

• We tested and compared the performance of the two extraction methods using an existing Italian combinatory dictionary as a benchmark (25 TLs).

• Recall, (R)precision, thresholds, systems’ overlap

Castagnoli et al. 2015

• Interesting findings supporting the lexicographers’ intuition:

• Recall is rather high for both systems

• Recall of P-based method is higher for N and A, while S-based method has higher recall for V

• Recall for P-based method appears to plateau at 2,000 hits (*)

• P-based and Sbased method often extract/don’t extract the same WoCs

(performance is identical for 76% of gold standard combinations) (*)

• But they also extract different gold standard combinations, with a complementary distribution (P-based: N+A, S-based: V) (*)

• R-precision is higher for S-based method

• Crowdsourcing evaluation: nearly 25% of candidates are valid WoCs even if they are not included in the benchmark dictionary (*)

LEXICOGRAPHERS’ EVALUATION - 2

• Lexicographers report adding WoCs that “should intuitively be there” but are not extracted from the corpus.

• More research is needed to: a) analyse the nature of these WoCs

• Patterns we haven’t thought of? (Long) idioms?

b) assess the impact of extraction techniques and settings

• Min. frequency? c) assess the impact of corpus type and size

• Limited to a single newspaper corpus

• Virtually no difference with the PAISA’ corpus (250M words, copyright-free web content)

• Maybe a huge web corpus?

OTHER LIMITATIONS

• Still a lot of manual work for lexicographers

• No automatic import / conversion of acquired data into an editing database / interface

• We are not using a proper Dictionary Writing System

• Many other ideas that came up listening to some eLex presentations …

THANK YOU!

Download