UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014 Outline UAM CorpusTool (O’Donnell, 2008) Tool description A short tutorial Annotating signals of coherence relations by UAM CorpusTool Feb 5, 2014 Discourse Research Group 2 UAM CorpusTool Created by Mick O’Donnell in 2008 Replaces prior software Systemic Coder which allowed coding of single documents at a single layer Available at http://www.wagsoft.com/CorpusTool/ Runs on Windows and Mac OS “… primarily aimed at the linguist or computational linguist who does not program, and would rather spend their time annotating text than learning how to use the system.” (O’Donnell, 2008: 13) Feb 5, 2014 Discourse Research Group 3 UAM CorpusTool Annotate documents Annotate segments text type, writer characteristics, register, etc. Tagging sections of a text by function (abstract, introduction, body, conclusion) Tagging sentences (active/passive; simple/ complex) or clauses (relative/imperative/non-finite) Semantic or pragmatic annotation (synonymy/antonymy; speech acts) Tagging POS (noun, verbs, adjective) Automatic grammar analysis (English only) using Stanford parser Rhetorical structure annotation Feb 5, 2014 Discourse Research Group 4 Annotation in UAM CorpusTool Main Steps Start a new project Add (an) annotation layer(s) Add file You can use some pre-built annotation schemes or design your own Import .txt files and Incorporate them Annotate Feb 5, 2014 Discourse Research Group 5 Annotation in UAM CorpusTool Main Window Screenshot Feb 5, 2014 Discourse Research Group 6 Annotation in UAM CorpusTool Annotation Scheme Screenshots Feb 5, 2014 Discourse Research Group 7 Annotation in UAM CorpusTool Document Coding Screenshot Feb 5, 2014 Discourse Research Group 8 Annotation in UAM CorpusTool Segment Coding Screenshot Feb 5, 2014 Discourse Research Group 9 Other Components Search Autocode Statistics Explore Options Help Feb 5, 2014 Discourse Research Group 10 Annotating Signals of Coherence Relations Goal Annotate signals of coherence relations Signals of coherence relations E.g., John is tall, but Mary is short. One straightforward signal: the discourse marker ‘but’ Also, there are two more signals Feb 5, 2014 Antonyms (tall ~ short) Parallel syntactic constructions (subj – copula – adj) Discourse Research Group 11 Annotating Signals of Coherence Relations Annotate the RST Discourse Treebank (Carlson et al., 2002) Feb 5, 2014 Contains 385 documents from The Wall Street Journal articles Texts in those articles are annotated already for rhetorical (coherence) relations Approx. 22,000 discourse units and 17,000 relations in total Discourse Research Group 12 Annotating Signals of Coherence Relations Requirements from an annotation tool Importability Annotation Scheme XML output Simplicity Feb 5, 2014 Two or more tags for a single element Convertibility Easy access to the annotation scheme for editing Multiple Annotations Support for three-level hierarchical taxonomy Customizability Relevant data to be imported into the tool No advanced computational knowledge Graphical interface Discourse Research Group 13 Signalling Annotation by UAM CorpusTool Problem with Importing data UAM CorpusTool supports RST annotation and can directly import RST files However, it cannot provide layered annotation on top of the RST-level structure Solution to the problem Feb 5, 2014 Convert RST base files from LISP to text format Import the converted files This retains discourse structures and all relational information Discourse Research Group 14 Signalling Annotation by UAM CorpusTool Feb 5, 2014 How did we do the rest? Discourse Research Group 15 Signalling Annotation by UAM CorpusTool Annotation Scheme Screenshot Feb 5, 2014 Discourse Research Group 16 Signalling Annotation by UAM CorpusTool Annotation Window Screenshot Feb 5, 2014 Discourse Research Group 17 References Carlson, L., Marcu, D., & Okurowski, M. E. (2002). RST Discourse Treebank, LDC2002T07 [Corpus]. Philadelphia, PA: Linguistic Data Consortium. O'Donnell, M. (2008). The UAM CorpusTool: Software for corpus annotation and exploration. Paper presented at the XXVI Congreso de AESLA, Almeria, Spain. Feb 5, 2014 Discourse Research Group 18 Thank You! Feb 5, 2014 Discourse Research Group 19