PDTSE Corpus Enrichment Angelina Ivanova () Prague Dependency Treebank of Spoken English

advertisement
PDTSE Corpus Enrichment
Angelina Ivanova (angelii@ifi.uio.no)
Language Technology Group, Department of Informatics, University of Oslo
Prague Dependency Treebank of Spoken English
188 946 tokens, 15 853 sentences
The Prague Dependency Treebank of Spoken English is a collection of English spoken dialogs
about personal photograph collections.
The original corpus consisted of three interlinked representations:
◮ ASR output aligned to audio;
◮ manual transcription;
◮ reconstructed text.
The data is annotated with formats based on the Prague Markup Language (PML) which is a
backbone for the family of XML schema for rich linguistic annotations of texts, such as
morphological tagging and dependency trees.
The corpus can serve as a training and testing material for machine learning experiments in both
intelligent editing as well as in dialog language understanding.
Morphological layer enhancement
We converted the morphological layer of the corpus into a treebank in the standard Penn
Treebank bracketing style and enhanced it with:
◮ part of speech tags;
◮ named entity labels;
◮ WordNet hypernyms;
◮ links to the lower layers of annotation.
The pre-processed corpus data was given as an input into state-of-the-art NLP tools such as the
Stanford parser and named entity recognizer, and the WordNet API to obtain the additional
analyses. These annotations were added in such a way as to preserve the original PML format.
We designed new XML schema for the modified topmost layer of the corpus so that it could be
appropriately displayed in editors for linguistic corpus processing, in particular, the powerful toolkit
TrEd, which is a programmable graphical tree editor and browser for PML-compliant corpuses.
http://ufal.mff.cuni.cz/pdtsl/
Browsing, editing and querying possibilities
Excerpt from modified PML schema
The main motivation for corpus enrichment is its preparation for information extraction task and linguistic research.
The corpus is in PML format which enables its browsing and editing in PML-tree editor TrED and querying with a powerful search engine
PML Tree Query (PML-TQ).
Figure below shows a query “find the tokens that are nouns and have a hypernym “anniversary” in the PML-TQ environment. The system
outputs all the sentences that contain tokens consistent with the query (such as “birthday”).
<structure name=“node”>
<member name=“cat” required=“0”>
<!−− syntactic category −−> <alt type=“cat.type”/>
</member>
<member name=“form”>
<cdata format=“any”/></member>
<member name=“pos” type=“postag.type”/>
<member name=“ne” type=“ne.type”/>
<member name=“hypernyms”>
<cdata format=“any”/></member>
<member name=“wkey”>
<cdata format=“any”/></member>
<member name=“mkey”>
<cdata format=“any”/></member>
<member name=“order” role=“#ORDER”>
<cdata format=“nonNegativeInteger”/></member>
<member name=“children” required=“0”
role=“#CHILDNODES” type=“node children.type”/>
</structure>
terminal [ pos ∼ ”NN*”, hypernyms ∼ ”anniversary” ];
Concluding remark
A new layer of annotation has been added to multi-layered corpus data in a complex format by combining several tools and merging their
partial outputs. The augmented corpus contains interesting strata of linguistic knowledge, is compatible with a specialized open-source
query engine and is suitable for extensive information extraction.
Acknowledgment
I would like to thank Dr. Silvie Cinková for her supervision of the project and providing me with access to the data (Institute of Formal and Applied Linguistics, Charles University in Prague).
References
Hajič Jan, Cinková Silvie, Mičková Petra, Pajas Petr, Peterek Nino, Spousta Miroslav. Prague Dependency Treebank of Spoken Language - English , Software or data, Institute of Formal and Applied
Linguistics, Charles University in Prague, Malostranské nám. 25, 118 00 Praha 1, Jan 2009.
Hajič Jan, Cinková Silvie, Mikulová Marie, Pajas Petr, Ptáček Jan, Toman Josef, Urešová Zdeňka. PDTSL: An Annotated Resource For Speech Reconstruction, in Proceedings of the 2008 IEEE
c IEEE, Goa, India, ISBN 978-1-4244-3472-5, 2008.
Workshop on Spoken Language Technology, Copyright Petr Pajas, Jan Štěpánek. System for querying syntactically annotated corpora. In Proceedings of the ACL-IJCNLP 2009 Software Demonstrations (ACLDemos ’09). Association for Computational
Linguistics, Stroudsburg, PA, USA, 33-36, 2009.
Download