PDTSE Corpus Enrichment Angelina Ivanova (angelii@ifi.uio.no) Language Technology Group, Department of Informatics, University of Oslo Prague Dependency Treebank of Spoken English 188 946 tokens, 15 853 sentences The Prague Dependency Treebank of Spoken English is a collection of English spoken dialogs about personal photograph collections. The original corpus consisted of three interlinked representations: ◮ ASR output aligned to audio; ◮ manual transcription; ◮ reconstructed text. The data is annotated with formats based on the Prague Markup Language (PML) which is a backbone for the family of XML schema for rich linguistic annotations of texts, such as morphological tagging and dependency trees. The corpus can serve as a training and testing material for machine learning experiments in both intelligent editing as well as in dialog language understanding. Morphological layer enhancement We converted the morphological layer of the corpus into a treebank in the standard Penn Treebank bracketing style and enhanced it with: ◮ part of speech tags; ◮ named entity labels; ◮ WordNet hypernyms; ◮ links to the lower layers of annotation. The pre-processed corpus data was given as an input into state-of-the-art NLP tools such as the Stanford parser and named entity recognizer, and the WordNet API to obtain the additional analyses. These annotations were added in such a way as to preserve the original PML format. We designed new XML schema for the modified topmost layer of the corpus so that it could be appropriately displayed in editors for linguistic corpus processing, in particular, the powerful toolkit TrEd, which is a programmable graphical tree editor and browser for PML-compliant corpuses. http://ufal.mff.cuni.cz/pdtsl/ Browsing, editing and querying possibilities Excerpt from modified PML schema The main motivation for corpus enrichment is its preparation for information extraction task and linguistic research. The corpus is in PML format which enables its browsing and editing in PML-tree editor TrED and querying with a powerful search engine PML Tree Query (PML-TQ). Figure below shows a query “find the tokens that are nouns and have a hypernym “anniversary” in the PML-TQ environment. The system outputs all the sentences that contain tokens consistent with the query (such as “birthday”). <structure name=“node”> <member name=“cat” required=“0”> <!−− syntactic category −−> <alt type=“cat.type”/> </member> <member name=“form”> <cdata format=“any”/></member> <member name=“pos” type=“postag.type”/> <member name=“ne” type=“ne.type”/> <member name=“hypernyms”> <cdata format=“any”/></member> <member name=“wkey”> <cdata format=“any”/></member> <member name=“mkey”> <cdata format=“any”/></member> <member name=“order” role=“#ORDER”> <cdata format=“nonNegativeInteger”/></member> <member name=“children” required=“0” role=“#CHILDNODES” type=“node children.type”/> </structure> terminal [ pos ∼ ”NN*”, hypernyms ∼ ”anniversary” ]; Concluding remark A new layer of annotation has been added to multi-layered corpus data in a complex format by combining several tools and merging their partial outputs. The augmented corpus contains interesting strata of linguistic knowledge, is compatible with a specialized open-source query engine and is suitable for extensive information extraction. Acknowledgment I would like to thank Dr. Silvie Cinková for her supervision of the project and providing me with access to the data (Institute of Formal and Applied Linguistics, Charles University in Prague). References Hajič Jan, Cinková Silvie, Mičková Petra, Pajas Petr, Peterek Nino, Spousta Miroslav. Prague Dependency Treebank of Spoken Language - English , Software or data, Institute of Formal and Applied Linguistics, Charles University in Prague, Malostranské nám. 25, 118 00 Praha 1, Jan 2009. Hajič Jan, Cinková Silvie, Mikulová Marie, Pajas Petr, Ptáček Jan, Toman Josef, Urešová Zdeňka. PDTSL: An Annotated Resource For Speech Reconstruction, in Proceedings of the 2008 IEEE c IEEE, Goa, India, ISBN 978-1-4244-3472-5, 2008. Workshop on Spoken Language Technology, Copyright Petr Pajas, Jan Štěpánek. System for querying syntactically annotated corpora. In Proceedings of the ACL-IJCNLP 2009 Software Demonstrations (ACLDemos ’09). Association for Computational Linguistics, Stroudsburg, PA, USA, 33-36, 2009.