The UAM CorpusTool - SFU Blogs

advertisement
UAM CorpusTool: An Overview
Debopam Das
Discourse Research Group
Department of Linguistics
Simon Fraser University
Feb 5, 2014
Outline

UAM CorpusTool (O’Donnell, 2008)



Tool description
A short tutorial
Annotating signals of coherence relations by
UAM CorpusTool
Feb 5, 2014
Discourse Research Group
2
UAM CorpusTool





Created by Mick O’Donnell in 2008
Replaces prior software Systemic Coder which
allowed coding of single documents at a single layer
Available at http://www.wagsoft.com/CorpusTool/
Runs on Windows and Mac OS
“… primarily aimed at the linguist or computational
linguist who does not program, and would rather
spend their time annotating text than learning how to
use the system.” (O’Donnell, 2008: 13)
Feb 5, 2014
Discourse Research Group
3
UAM CorpusTool

Annotate documents


Annotate segments






text type, writer characteristics, register, etc.
Tagging sections of a text by function (abstract, introduction, body, conclusion)
Tagging sentences (active/passive; simple/ complex) or clauses
(relative/imperative/non-finite)
Semantic or pragmatic annotation (synonymy/antonymy; speech acts)
Tagging POS (noun, verbs, adjective)
Automatic grammar analysis (English only) using
Stanford parser
Rhetorical structure annotation
Feb 5, 2014
Discourse Research Group
4
Annotation in UAM CorpusTool

Main Steps


Start a new project
Add (an) annotation layer(s)


Add file


You can use some pre-built annotation schemes or
design your own
Import .txt files and Incorporate them
Annotate
Feb 5, 2014
Discourse Research Group
5
Annotation in UAM CorpusTool

Main Window Screenshot
Feb 5, 2014
Discourse Research Group
6
Annotation in UAM CorpusTool

Annotation Scheme Screenshots
Feb 5, 2014
Discourse Research Group
7
Annotation in UAM CorpusTool

Document Coding Screenshot
Feb 5, 2014
Discourse Research Group
8
Annotation in UAM CorpusTool

Segment Coding Screenshot
Feb 5, 2014
Discourse Research Group
9
Other Components






Search
Autocode
Statistics
Explore
Options
Help
Feb 5, 2014
Discourse Research Group
10
Annotating Signals of Coherence Relations

Goal


Annotate signals of coherence relations
Signals of coherence relations



E.g., John is tall, but Mary is short.
One straightforward signal: the discourse marker
‘but’
Also, there are two more signals


Feb 5, 2014
Antonyms (tall ~ short)
Parallel syntactic constructions (subj – copula – adj)
Discourse Research Group
11
Annotating Signals of Coherence Relations

Annotate the RST Discourse Treebank (Carlson et al.,
2002)



Feb 5, 2014
Contains 385 documents from The Wall Street Journal
articles
Texts in those articles are annotated already for
rhetorical (coherence) relations
Approx. 22,000 discourse units and 17,000 relations in
total
Discourse Research Group
12
Annotating Signals of Coherence Relations

Requirements from an annotation tool

Importability


Annotation Scheme


XML output
Simplicity


Feb 5, 2014
Two or more tags for a single element
Convertibility


Easy access to the annotation scheme for editing
Multiple Annotations


Support for three-level hierarchical taxonomy
Customizability


Relevant data to be imported into the tool
No advanced computational knowledge
Graphical interface
Discourse Research Group
13
Signalling Annotation by UAM CorpusTool

Problem with Importing data



UAM CorpusTool supports RST annotation and can
directly import RST files
However, it cannot provide layered annotation on top of
the RST-level structure
Solution to the problem



Feb 5, 2014
Convert RST base files from LISP to text format
Import the converted files
This retains discourse structures and all relational
information
Discourse Research Group
14
Signalling Annotation by UAM CorpusTool

Feb 5, 2014
How did we do the rest?
Discourse Research Group
15
Signalling Annotation by UAM CorpusTool

Annotation Scheme Screenshot
Feb 5, 2014
Discourse Research Group
16
Signalling Annotation by UAM CorpusTool

Annotation Window Screenshot
Feb 5, 2014
Discourse Research Group
17
References


Carlson, L., Marcu, D., & Okurowski, M. E. (2002). RST Discourse Treebank,
LDC2002T07 [Corpus]. Philadelphia, PA: Linguistic Data Consortium.
O'Donnell, M. (2008). The UAM CorpusTool: Software for corpus annotation and
exploration. Paper presented at the XXVI Congreso de AESLA, Almeria, Spain.
Feb 5, 2014
Discourse Research Group
18
Thank You!
Feb 5, 2014
Discourse Research Group
19
Download