SARA CASTAGNOLI
FRANCESCA MASINI
(UNIVERSITY OF BOLOGNA )
MALVINA NISSIM
(UNIVERSITY OF GRONINGEN)
ENeL meeting @ Herstmonceux Castle, 13 August 2015
GIANLUCA E. LEBANI
ALESSANDRO LENCI
(UNIVERSITY OF PISA)
VALENTINA PIUNNO
(UNIVERSITY OF ROMA TRE)
• INTRODUCING CombiNet , an ongoing project aimed at building a corpus-based, lexicographic resource for Italian
Word Combinations (Universities of Roma Tre, Pisa, Bologna)
• an innovative resource for the Italian language
• relevance for ENeL-WG3:
• an electronic resource
• an integrated computational-lexicographic approach:
1) automatic extraction of candidate WoCs from corpora
2) manual evaluation and compilation
• OUTLINE :
• our view of Word Combinations (WoCs)
• AKA: extracting WoCs from corpora – methods
• evaluation of AKA: automatic and manual
The whole range of combinatory possibilities associated with a word, including:
•
Multiword Expressions (MWEs), i.e. a variety of WoCs characterised by different degrees of fixedness and idiomaticity that act as a single unit at some level of linguistic analysis, e.g.:
• collocations • idioms
• phrasal lexemes
• preferred combinations
• More abstract combinations , i.e. the distributional properties of a word at the level of e.g.:
• argument structure
• subcategorization frames
• selectional preferences
Using POS PATTERNS
(P-BASED methods)
POS-tagged corpus
list of POS patterns
NOUN PREP NOUN punto di
‘point of view’ vista
NOUN ADJ anno accademico
‘academic year’
VER DET (ADJ) NOUN costruire un piccolo impero
‘build a small empire’
Using SYNTACTIC INFO
(S-BASED methods)
parsed corpus
list of syntactic relations
SUBJ – VERB guerra – scoppiare
‘war – burst’
VERB – OBJ perdere – vista
‘lose – (one’s)sight’
VERB – COMP_DI parlare – di sport
‘talk – about sport’
Using POS PATTERNS
(P-BASED methods)
- satisfactory results for relatively fixed | adjacent | short WOCs
Using SYNTACTIC INFO
(S-BASED methods)
- also target discontinuous and syntactically flexible WoCs
- patterns need to be specified a priori
- noise, even after applying AMs
- cannot capture complex and flexible WOCs
- dismissing abstract combinatory information (e.g. argument structure)
- abstracting away from information such as linear order, morphosyntactic features etc.
- no information about how exactly words combine
- cannot distinguish frequent but productive combinations, from idiomatic ones with the very same syntactic structure
Castagnoli et al. 2015; Lenci et al. 2014, 2015
• La Repubblica corpus (Baroni et al. 2004)
• approx. 380M tokens, POS-tagged and dependency parsed
• “clean” corpus, but only newspaper language
• POS-based extraction:
• 122 POS sequences deemed representative of Italian WoCs, in 3 subsets (nominal, verbal, prepositional WoCs)
•
Independent extraction rounds, using the EXTra tool
• contiguous sequences, no optional slots, LL ranking, freq>5
•
Syntax-based extraction:
• distributional profiles, containing the syntactic slots (subject, complements, modifiers, etc.) and the combinations of slots (frames) with which words co-occur, abstracted away from their surface morphosyntactic patterns
• each slot is associated with lexical sets formed by its most prototypical fillers
• LexIt tool
• contiguous and discontinuous sequences, LL ranking, freq>5
1) All sequences corresponding to the mentioned patterns are extracted from the corpus.
• 2) Lists of candidate WoCs are filtered to extract lines containing specific Target Lemmas (i.e. future headwords)
• Headwords : “fundamental” 2,100 words from the Senso Comune lexicon ( http://www.sensocomune.it/ )
• Nouns, Verbs, Adjectives
• 3) Lexicographers are provided with structured lists:
• lemmatised candidate WoCs for a given TL
• ranked according to their LL score
• raw frequency of each combination in the corpus
• underlying POS pattern or syntactic relation
• Candidate lists for each TL are imported into a spreadsheet .
• As our current lexicographic layout groups WoCs on the basis of their function and syntactic configuration, lexicographers can scroll candidate lists or filter them to observe and evaluate only candidate WoCs corresponding to specific POS patterns and/or syntactic relations.
• Candidate lists for each TL are imported into a spreadsheet .
• As our current lexicographic layout groups WoCs on the basis of their function and syntactic configuration, lexicographers can scroll candidate lists or filter them to observe and evaluate only candidate WoCs corresponding to specific POS patterns and/or syntactic relations.
• Candidates considered as valid WoCs are manually selected
• and edited
• before being recorded in the relevant part of the lexicographic record
• LL ranking is generally helpful , as most higher-ranking candidates represent (or contain, or suggest) proper
WoCs which deserve inclusion in the dictionary.
• However, difficult to set thresholds, since WoCs which they would intuitively include in the entry also appear in the middle and lower part of the ranking.
• POS-based data are more useful to compile the entries for nominal and adjectival TLs, whereas
SYNTAX-based data would be more helpful for verbal
TLs .
• No systematic evidence provided .
• We tested and compared the performance of the two extraction methods using an existing Italian combinatory dictionary as a benchmark (25 TLs).
• Recall, (R)precision, thresholds, systems’ overlap
Castagnoli et al. 2015
• Interesting findings supporting the lexicographers’ intuition:
• Recall is rather high for both systems
• Recall of P-based method is higher for N and A, while S-based method has higher recall for V
• Recall for P-based method appears to plateau at 2,000 hits (*)
• P-based and Sbased method often extract/don’t extract the same WoCs
(performance is identical for 76% of gold standard combinations) (*)
• But they also extract different gold standard combinations, with a complementary distribution (P-based: N+A, S-based: V) (*)
• R-precision is higher for S-based method
• Crowdsourcing evaluation: nearly 25% of candidates are valid WoCs even if they are not included in the benchmark dictionary (*)
• Lexicographers report adding WoCs that “should intuitively be there” but are not extracted from the corpus.
• More research is needed to: a) analyse the nature of these WoCs
• Patterns we haven’t thought of? (Long) idioms?
b) assess the impact of extraction techniques and settings
• Min. frequency? c) assess the impact of corpus type and size
• Limited to a single newspaper corpus
• Virtually no difference with the PAISA’ corpus (250M words, copyright-free web content)
• Maybe a huge web corpus?
• Still a lot of manual work for lexicographers
• No automatic import / conversion of acquired data into an editing database / interface
• We are not using a proper Dictionary Writing System
• Many other ideas that came up listening to some eLex presentations …