Finding your way through the woods with GrETEL Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde TABU-dag - June 14, 2013 GrETEL • Greedy Extraction of Trees for Empirical Linguistics • Query engine for treebanks • Nederbooms project Exploitation of Dutch treebanks for research in linguistics GrETEL • Greedy Extraction of Trees for Empirical Linguistics • Query engine for treebanks • Nederbooms project Exploitation of Dutch treebanks for research in linguistics • Goals o o o User-friendly tools Access to large data files Fast and accurate GrETEL • Greedy Extraction of Trees for Empirical Linguistics • Query engine for treebanks • Treebank = syntactically annotated corpus e.g. Penn Treebank (English), TüBa (German), LASSY, CGN (Dutch) TREEBANKS CGN core corpus LASSY small Spoken Dutch Written Dutch Stylistic & regional differences Stylistic differences conversations vs read texts NL vs VL Wikipedia vs legal texts ± 1M tokens ± 1M tokens 130k sentences 65k sentences Manually corrected Manually corrected GrETEL • Greedy Extraction of Trees for Empirical Linguistics • Query engine for treebanks • Treebank = syntactically annotated corpus e.g. Penn Treebank (English), TüBa (German), LASSY, CGN (Dutch) • Parser e.g. Alpino (Van Noord 2006) ALPINO PARSER Dit is een zin. >> ALPINO parser >> “This is a sentence.” ALPINO PARSER Dit is een zin. >> ALPINO parser >> “This is a sentence.” XML trees Query language: XPath XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]] XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]] XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]] XPATH GrETEL • Greedy Extraction of Trees for Empirical Linguistics • Query treebanks by example GrETEL • Greedy Extraction of Trees for Empirical Linguistics • Query treebanks by example • First version => only for LASSY treebank • New release => GrETEL for CGN treebank => update based on user reviews the user • Example sentence GrETEL • Parser (Alpino) • Indicate relevant items of the sentence • (Adapt XPath) • Select treebank • Inspect results • Automatically generate XPath expression • Present results OUTLINE • GrETEL in a nutshell • GrETEL demo o o Case study Search options • Conclusions and future work CASE STUDY • Verbs with fixed preposition o E.g. Hij keek met een bang hartje naar de heks. ‘he was looking at the witch with a heavy heart .’ o VERB + (…+) PREP LASSY: • Xpath query //node[@cat="smain" and node[@rel="hd" and @pos="verb" and @root="kijk"] and node[@rel="ld" and @cat="pp" and node[@rel="hd" and @pos="prep" and @root="naar"]]] CASE STUDY • Verbs with fixed preposition o E.g. Hij keek naar de heks. ‘he was looking at the witch .’ • Discontinuous constructions! o o E.g. Hij keek met een bang hartje naar de heks. ‘he was looking at the witch with a heavy heart .’ VERB + (…+) PREP GrETEL ONLINE INPUT ANNOTATION MATRIX ANNOTATION GUIDELINES XPATH GENERATOR Other treebank, other format … Hij keek met een bank hartje naar de heks • CGN /node[@cat="smain" and node[@rel="hd" and @pt="ww" and @lemma="kijken"] and node[@rel="ld" and @cat="pp" and node[@rel="hd" and @pt="vz" and @lemma="naar"]]] • LASSY //node[@cat="smain" and node[@rel="hd" and @pos="verb" and @root="kijk"] and node[@rel="ld" and @cat="pp" and node[@rel="hd" and @pos="prep" and @root="naar"]]] Other treebank, other format … Hij keek met een bang hartje naar de heks • CGN /node[@cat="smain" and node[@rel="hd" and @pt="ww" and @lemma="kijken"] and node[@rel="ld" and @cat="pp" and node[@rel="hd" and @pt="vz" and @lemma="naar"]]] • LASSY //node[@cat="smain" and node[@rel="hd" and @pos="verb" and @root="kijk"] and node[@rel="ld" and @cat="pp" and node[@rel="hd" and @pos="prep" and @root="naar"]]] TREEBANK SELECTION RESULTS Verb plus fixed preposition o E.g. Hij keek naar de heks. ‘A number of trees fell down.’ o VERB + (…+) PREP 4004 matches in 3881 sentences RESULTS: table RESULTS: data RESULTS: trees OUTLINE • GrETEL in a nutshell • GrETEL demo o o Case study Search options • Conclusions and future work SEARCH OPTIONS Below annotation matrix SEARCH OPTIONS Green versus red word order in Dutch o green: past participle – auxiliary De NAVO stelt dat ze er alles aan gedaan heeft o red: auxiliary – past participle De NAVO stelt dat ze er alles aan heeft gedaan “The NATO claim that they have done everything in their power” (deredactie.be) SEARCH OPTIONS SEARCH OPTIONS SEARCH OPTIONS SEARCH OPTIONS SEARCH OPTIONS OUTLINE • GrETEL in a nutshell • GrETEL demo o o Case study Search options • Conclusions and future work CONCLUSIONS • GrETEL: search engine for Dutch treebanks • Input = natural language example • Output = sample of similar sentences • Syntactic concordancer • Available online (via Mozilla Firefox) • No installation required FUTURE WORK • GrETEL 2.0 o o Include SoNaR corpus (ca 500M tokens) More generic • AfriBooms o o GrETEL for Afrikaans Include other treebank formats CASE STUDY • Collective noun constructions o E.g. Een aantal bomen zijn omgevallen. ‘A number of trees fell down.’ o DET + NOUN + PLURAL NOUN • Discontinuous constructions! o E.g. Een groot aantal oude bomen zijn omgevallen. ‘A large number of old trees fell down.’ Try it yourself at http://nederbooms.ccl.kuleuven.be/eng/gretel Thanks for your attention! Waaraan vs Waar … aan Waar denk je aan ? //node[@cat="top" and node[@rel="--" and @cat="whq" and node[@rel="whd" and @pos="adv"] and node[@rel="body" and @cat="sv1" and node[@rel="pc" and @cat="pp" and node[@rel="hd" and @pos="prep"]]]] and node[@rel="--" and @pos="punct"]] (4 results) • Waar bemoei je je mee? • Wanneer gaat een koortsstuip over in epilepsie ? Waaraan denk je ? //node[@cat="top" and node[@rel="--" and @cat="whq" and node[@rel="whd" and @pos="pp"]] and node[@rel="--" and @pos="punct"]] (38 results) • Waarom werken we ? • Waartoe verbind ik mij als ouder door dit formulier in te vullen ? • Vanwaar die gulle hand van een Turkse overheid die in de schulden zwemt ? Hij klom de boom in //node[@cat="top" and node[@rel="--" and @cat="smain" and node[@rel="hd" and @pos="verb"] and node[@rel="ld" and @cat="np" and node[@rel="det" and @pos="det"] and node[@rel="hd" and @pos="noun"]] and node[@rel="svp" and @pos="part"]] and node[@rel="--" and @pos="punct"]] (37 results) • Door haar winst komt Clijsters de top-20 binnen . • In feite ging minder dan de helft van Dorsets de rivier over . • Nederland gaat de bezettingstijd in .