TABU Dag, Groningen

advertisement
Finding your way through
the woods with GrETEL
Liesbeth Augustinus
Vincent Vandeghinste
Ineke Schuurman
Frank Van Eynde
TABU-dag - June 14, 2013
GrETEL
• Greedy Extraction of Trees for Empirical Linguistics
• Query engine for treebanks
• Nederbooms project
Exploitation of Dutch treebanks
for research in linguistics
GrETEL
• Greedy Extraction of Trees for Empirical Linguistics
• Query engine for treebanks
• Nederbooms project
Exploitation of Dutch treebanks
for research in linguistics
• Goals
o
o
o
User-friendly tools
Access to large data files
Fast and accurate
GrETEL
• Greedy Extraction of Trees for Empirical Linguistics
• Query engine for treebanks
• Treebank = syntactically annotated corpus
e.g. Penn Treebank (English), TüBa (German),
LASSY, CGN (Dutch)
TREEBANKS
CGN core corpus
LASSY small
Spoken Dutch
Written Dutch
Stylistic & regional
differences
Stylistic differences
conversations vs read texts
NL vs VL
Wikipedia vs legal texts
± 1M tokens
± 1M tokens
130k sentences
65k sentences
Manually corrected
Manually corrected
GrETEL
• Greedy Extraction of Trees for Empirical Linguistics
• Query engine for treebanks
• Treebank = syntactically annotated corpus
e.g. Penn Treebank (English), TüBa (German),
LASSY, CGN (Dutch)
• Parser
e.g. Alpino (Van Noord 2006)
ALPINO PARSER
Dit is een zin. >> ALPINO parser >>
“This is a sentence.”
ALPINO PARSER
Dit is een zin. >> ALPINO parser >>
“This is a sentence.”
XML trees
Query language: XPath
XPATH
//node[@cat="smain" and
node[@rel="su" and
@pt="vnw" and @lemma="dit"]
and node[@rel="hd" and
@pt="ww" and @lemma="zijn"]
and node[@rel="predc" and
@cat="np" and
node[@rel="det" and
@pt="lid" and @lemma="een"]
and node[@rel="hd" and
@pt="n" and @lemma="zin"]]]
XPATH
//node[@cat="smain" and
node[@rel="su" and
@pt="vnw" and @lemma="dit"]
and node[@rel="hd" and
@pt="ww" and @lemma="zijn"]
and node[@rel="predc" and
@cat="np" and
node[@rel="det" and
@pt="lid" and @lemma="een"]
and node[@rel="hd" and
@pt="n" and @lemma="zin"]]]
XPATH
//node[@cat="smain" and
node[@rel="su" and
@pt="vnw" and @lemma="dit"]
and node[@rel="hd" and
@pt="ww" and @lemma="zijn"]
and node[@rel="predc" and
@cat="np" and
node[@rel="det" and
@pt="lid" and @lemma="een"]
and node[@rel="hd" and
@pt="n" and @lemma="zin"]]]
XPATH
GrETEL
• Greedy Extraction of Trees for Empirical Linguistics
• Query treebanks by example
GrETEL
• Greedy Extraction of Trees for Empirical Linguistics
• Query treebanks by example
• First version
=> only for LASSY treebank
• New release
=> GrETEL for CGN treebank
=> update based on user reviews
the user
• Example sentence
GrETEL
• Parser (Alpino)
• Indicate relevant items
of the sentence
• (Adapt XPath)
• Select treebank
• Inspect results
• Automatically generate
XPath expression
• Present results
OUTLINE
• GrETEL in a nutshell
• GrETEL demo
o
o
Case study
Search options
• Conclusions and future work
CASE STUDY
• Verbs with fixed preposition
o
E.g. Hij keek met een bang hartje naar de heks.
‘he was looking at the witch with a heavy heart .’
o
VERB + (…+) PREP
LASSY:
• Xpath query
//node[@cat="smain" and node[@rel="hd" and @pos="verb" and
@root="kijk"] and node[@rel="ld" and @cat="pp" and
node[@rel="hd" and @pos="prep" and @root="naar"]]]
CASE STUDY
• Verbs with fixed preposition
o
E.g. Hij keek naar de heks.
‘he was looking at the witch .’
• Discontinuous constructions!
o
o
E.g. Hij keek met een bang hartje naar de heks.
‘he was looking at the witch with a heavy heart
.’
VERB + (…+) PREP
GrETEL ONLINE
INPUT
ANNOTATION MATRIX
ANNOTATION GUIDELINES
XPATH GENERATOR
Other treebank, other format …
Hij keek met een bank hartje naar de heks
• CGN
/node[@cat="smain" and node[@rel="hd" and @pt="ww" and
@lemma="kijken"] and node[@rel="ld" and @cat="pp" and
node[@rel="hd" and @pt="vz" and @lemma="naar"]]]
• LASSY
//node[@cat="smain" and node[@rel="hd" and @pos="verb"
and @root="kijk"] and node[@rel="ld" and @cat="pp" and
node[@rel="hd" and @pos="prep" and @root="naar"]]]
Other treebank, other format …
Hij keek met een bang hartje naar de heks
• CGN
/node[@cat="smain" and node[@rel="hd" and @pt="ww" and
@lemma="kijken"] and node[@rel="ld" and @cat="pp" and
node[@rel="hd" and @pt="vz" and @lemma="naar"]]]
• LASSY
//node[@cat="smain" and node[@rel="hd" and @pos="verb"
and @root="kijk"] and node[@rel="ld" and @cat="pp" and
node[@rel="hd" and @pos="prep" and @root="naar"]]]
TREEBANK SELECTION
RESULTS
Verb plus fixed preposition
o E.g. Hij keek naar de heks.
‘A number of trees fell down.’
o
VERB + (…+) PREP
 4004 matches in 3881 sentences
RESULTS: table
RESULTS: data
RESULTS: trees
OUTLINE
• GrETEL in a nutshell
• GrETEL demo
o
o
Case study
Search options
• Conclusions and future work
SEARCH OPTIONS
 Below annotation matrix
SEARCH OPTIONS
Green versus red word order in Dutch
o
green: past participle – auxiliary
De NAVO stelt dat ze er alles aan gedaan heeft
o
red: auxiliary – past participle
De NAVO stelt dat ze er alles aan heeft gedaan
“The NATO claim that they have done everything in their power”
(deredactie.be)
SEARCH OPTIONS
SEARCH OPTIONS
SEARCH OPTIONS
SEARCH OPTIONS
SEARCH OPTIONS
OUTLINE
• GrETEL in a nutshell
• GrETEL demo
o
o
Case study
Search options
• Conclusions and future work
CONCLUSIONS
• GrETEL: search engine for Dutch treebanks
• Input = natural language example
• Output = sample of similar sentences
• Syntactic concordancer
• Available online (via Mozilla Firefox)
• No installation required
FUTURE WORK
• GrETEL 2.0
o
o
Include SoNaR corpus (ca 500M tokens)
More generic
• AfriBooms
o
o
GrETEL for Afrikaans
Include other treebank formats
CASE STUDY
• Collective noun constructions
o
E.g. Een aantal bomen zijn omgevallen.
‘A number of trees fell down.’
o
DET + NOUN + PLURAL NOUN
• Discontinuous constructions!
o
E.g. Een groot aantal oude bomen zijn omgevallen.
‘A large number of old trees fell down.’
Try it yourself at
http://nederbooms.ccl.kuleuven.be/eng/gretel
Thanks for your attention!
Waaraan vs Waar … aan
Waar denk je aan ?
//node[@cat="top" and node[@rel="--" and @cat="whq" and
node[@rel="whd" and @pos="adv"] and node[@rel="body"
and @cat="sv1" and node[@rel="pc" and @cat="pp" and
node[@rel="hd" and @pos="prep"]]]] and node[@rel="--"
and @pos="punct"]]
(4 results)
• Waar bemoei je je mee?
• Wanneer gaat een koortsstuip over in epilepsie ?
Waaraan denk je ?
//node[@cat="top" and node[@rel="--" and @cat="whq" and
node[@rel="whd" and @pos="pp"]] and node[@rel="--"
and @pos="punct"]]
(38 results)
• Waarom werken we ?
• Waartoe verbind ik mij als ouder door dit formulier in te
vullen ?
• Vanwaar die gulle hand van een Turkse overheid die in de
schulden zwemt ?
Hij klom de boom in
//node[@cat="top" and node[@rel="--" and @cat="smain"
and node[@rel="hd" and @pos="verb"] and
node[@rel="ld" and @cat="np" and node[@rel="det" and
@pos="det"] and node[@rel="hd" and @pos="noun"]] and
node[@rel="svp" and @pos="part"]] and node[@rel="--"
and @pos="punct"]]
(37 results)
• Door haar winst komt Clijsters de top-20 binnen .
• In feite ging minder dan de helft van Dorsets de rivier
over .
• Nederland gaat de bezettingstijd in .
Related documents
Download