Corpus annotation

advertisement
Corpus annotation
Corpus Linguistics
Richard Xiao
lancsxiaoz@googlemail.com
Outline of the session
• Lecture
– Rationale for corpus annotation
– Leech’s maxims of corpus annotation
– Types of annotation
• Lab
– CLAWS POS tagger (online and Windowsbased)
– Introducing Wmatrix
– ICTCLAS
Corpora and annotation
• Unannotated corpus
– simple plain text or raw text
– the linguistic information is implicit
• e.g. no explicit representation of present as a noun
• Annotated corpus
– no longer just text
– real repository of linguistic information
• the relevant linguistic information is now explicit
(e.g. present as a noun, adjective, or verb)
Corpus annotation
• What is annotation?
– “The process of adding […] interpretive, linguistic
information to an electronic corpus of spoken and/or
written language data” (Leech 1997)
– Broadly, also refers to the results of the annotation
process
• In a strict sense, different from corpus markup
– Markup provides objective, verifiable information
• e.g. author, paragraph boundary
– Annotation is concerned with interpretive linguistic
information
• e.g. part-of-speech
Why annotate a corpus?
•
•
•
•
•
It makes information retrieval and extraction easier,
faster and enables human analysts to exploit and
retrieve analyses of which they are not themselves
capable
Annotated corpora are reusable resources
Annotated corpora are multifunctional - they can be
annotated with a purpose and be reused with another
Corpus annotation records a linguistic analysis
explicitly
Corpus annotation provides a standard reference
resource, a stable base of linguistic analyses, so that
successive studies can be compared and contrasted
on a common basis
How are corpora annotated?
• Automatic annotation
– Can be automated reliably for some types (POS, lemmatization)
– Can annotate large amount of data quickly at low cost
– Post-editing or human correction may be necessary to improve
accuracy
• Computer-assisted annotation
– The semi-automatic annotation process (human-machine
interface) may produce more reliable results than fully
automated annotation, but it is also slower and more costly
• Manual annotation
– Occurs where no annotation tool is available or where the
accuracy of available systems is not high enough to be useful
– Expensive and time-consuming, typically only feasible for small
corpora
Leech’s 7 maxims of annotation
1. It should be possible to remove the annotation from an
annotated corpus in order to revert to the raw corpus.
2. It should be possible to extract the annotations by
themselves from the text.
3. The annotation scheme should be based on guidelines
which are available to the end user.
4. It should be made clear how and by whom the
annotation was carried out. The end user should be
made aware that the corpus annotation is not error-free
or infallible, but simply a potentially useful tool.
6. Annotation schemes should be based as far as possible
on widely agreed and theory-neutral principles.
7. No annotation scheme has the a priori right to be
considered as a standard. Standards emerge through
practical consensus.
Types of corpus annotation
• Phonological level
– Syllable boundaries (phonetic/phonemic annotation)
– Prosodic or suprasegmental features (prosodic
annotation, e.g. pitch, loudness, intonation)
• Morphological level
– Prefixes, suffixes, stems (morphological annotation)
• Lexical level
– Tokenisation (essential for Chinese)
– Parts of speech (POS tagging)
• e.g. present: NN1, VVB, JJ
– Lemmas (lemmatization)
• stop, stopped, stops, stopping → stop
– Semantic fields (semantic annotation)
• cricket: sport, insect
Tokenisation
• The one-to-one correspondence between
orthographic and morpho-syntactic word tokens can
be considered as a default in English with three
main exceptions
– Multiword units (e.g. so that and in spite of)
– Mergers (e.g. can’t and gonna)
– Variably spelt compounds (e.g. noticeboard, noticeboard, notice board)
• CLAWS examples (“ditto tags”)
– so that: so_CS21 that_CS22
– in spite of: in_II31 spite_II32 of_II33
– can’t: ca_VM n’t_XX
BNC-style POS tagging
•
•
•
•
•
•
•
•
<s>
new sentence
<w NN2>Explosives
plural noun
<w VVD>found
past tense verb
<w PRP>on
preposition
<w NP0>Hampstead
proper noun
<w NP0>Heath
proper noun
<PUN>
punctuation
</s>
Explosives found on Hampstead Heath.
Example of semantic tagging
See http://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf for the tagset.
Types of corpus annotation
• Syntactic level
– Parsing / treebanking / bracketing
(S (NP Mary)
(VP visited
(NP a
(ADJP very nice)
boy)))
• Stanford Parser
– http://nlp.stanford.edu:8080/parser/
Types of corpus annotation
• Discourse level
– Anaphoric relations (coreference annotation)
(6 the married couple 6) said that <REF=6 they were happy
with <REF=6 their lot.
– Speech acts (pragmatic annotation)
• 3 layers of coding
– Segmentation (dividing dialogue in textual units, i.e. utterances)
– Functional annotation (dialogue act annotation)
– Utterance tags (applying utterance tags that characterize the
role of the utterance as a dialogue act)
– Stylistic features such as speech and thought in
presentation (stylistic annotation)
• The representation of people’s speech and thoughts, known
as speech ad thought presentation (S&TP)
Types of corpus annotation
• Other types
– Error tagging
• Applying to learner corpus data
• The CLEC error tagging scheme consists of 61 error
types clustered in 11 categories
– Problems-specific annotation
• Not exhaustive – only the phenomenon directly relevant
to a particular research question
• Developed for its relevance to the specific research
question, but not for its broad coverage and
consensus-based theory-neutrality
– E.g. Hunston (1993) studies how people talk about
sameness and difference (“local grammar”)
Annotation styles
• Embedded style  LOB style
going_VVGK
 TEI entity references
going&VVGK;
 WSJ style
going/VVGK
 SGML
<w POS=VVGK>going</w>
 BNC style (simplified SGML)
<w VVGK>going
 XML
<w POS=“VVGK”>going</w>
• Standalone style
– <s>
<w id=“1”>He</w>
<w id=“2”>was</w>
<w id=“3”>going</w>
<w id=“4”>to</w>
<w id=“5”>die</w>
<w id= “6”>.</w>
</s>
– <s>
<word id=“1”>PPHS1</word>
<word id=“2”>VBDZ</word>
<word id=“3”>VVGK</word>
<word id=“4”>TO</word>
<word id=“5”>VVI</word>
<word id=“6”>.</word>
</s>
Introducing CLAWS
• CLAWS: some basic facts
– The Constituent Likelihood Automatic Word-tagging
System
– Best known POS tagger for general English
– Has been used to tag a number of large corpora,
including the 100M word BNC
– Has consistently achieved 96-97% accuracy
– Free online tagging service allow academic users to
tag 100,000 word at a time (from an academic
website)
• http://ucrel.lancs.ac.uk/claws/trial.html
CLAWS tagsets
• C7 taget
– A detailed tagset of 146 tags
– http://ucrel.lancs.ac.uk/claws7tags.html
• C5 tagset
– Less refined, 61 tags (BNC tagset)
– http://ucrel.lancs.ac.uk/claws5tags.html
• The mapping between C7 and C5 is a many-to-one
conversion, and is available in a tab-delimited text file
• C8 tagset is an extension of C7 tagset that makes further
distinctions in the determiner and pronoun categories as
well as for auxiliary verbs
– http://ucrel.lancs.ac.uk/claws8tags.pdf
Free CLAWS trial service
CLAWS output formats
Vertical output format
Horizontal output format (Use copy & paste and save as a plain text file)
Pseudo-XML output format
Windows-based CLAWS
D:\ZJU CL\tools\Jclaws\lib\run_jclaws.bat (or antclawsgui)
…tagging text in a file
Wmatrix
• An online corpus analysis and comparison system
• A web interface that allows you to access to the CLAWS
part-of-speech tagger and the USAS semantic tagger
– CLAWS
– USAS: UCREL Semantic Analysis System
• Including standard corpus research tools
– Frequency, KWIC concordance, wordlist, keyword list, word
cluster/n-gram), collocation
– Built-in statistics model log likelihood for corpus comparison
• Integrating POS tagging and semantic field annotation
into a single profiling tool
• Introduction to Wmatrix
– http://ucrel.lancs.ac.uk/wmatrix/
Your Wmarix account
• You will need a username and password
to use Wmatrix
• Write down your username and password
– Tag and download your text as soon as
possible if you wish to use Wmatrix to tag
your data (POS / semantic) on your project
• …and now login with your account
– http://ucrel.lancs.ac.uk/wmatrix3.html
Click here to run “tag wizard”
Click here to see your work area (for data you have already processed)
Click here to find out more about the
UCREL Semantic Annotation System
Amongst other things, the link explains
the categorisation scheme utilised …
Hierarchy of 21 major discourse fields (or domains),
which expand into 232 semantic field tags (see the web link)
semantic field (or domain) = “A named area of meaning in which lexemes
interrelate and define each other in specific ways” (Crystal 1995: 157)
Note --- the USAS scheme is derived from McArthur (1981)
The USAS system
• Designed to undertake the automatic semantic analysis of presentday English texts (spoken and written)
• Involving two stages
(i) POS tagging by CLAWS
A POS tag is assigned to every lexical item or multi-word expression
(MWE), using probabilistic Markov models of likely part-of-speech
sequences (accuracy of 97%+)
(ii) Output fed into SEMTAG for semantic annotation
Semantic tags are assigned automatically on the basis of pattern
matching between the target text and two computer dictionaries
developed for use with the program (accuracy of 92%+)
• Present applications: market research, content analysis, information
extraction, assistance for translation, linguistic analysis, etc.
Let’s do some tagging
Once you have
logged in:
• From the Wmatrix
home page, click
on Tag wizard
• This will bring up
the following page
…
Let’s do some tagging
Tag the following two texts:
– Tips: It’s a good practice to create one folder for
each file
• Conservative MP Michael Howard’s farewell
speech to his party (2005)
– D:\ZJU CL\texts\Howard_speech.txt
• New Labour MP Tony Blair’s farewell speech
to his party (2006)
– D:\ZJU CL\texts\texts\Blair_speech.txt
A quick “how to”!
• Enter new work area name
(Blair / Howard)
• Click the browse button to
select the right file
• Click the “upload now” button
…
• A new screen will provide you
with an update report … e.g.
part-of-speech tagging
semantic tagging
frequency lists
You will then be taken to your work area
[My folders]
What you’ll see in the Simple “VIEW of folder”
Click on Frequency to see the most frequent words
You can also do concordance searches of words/phrases
Advanced View of Howard Folder
Click on Frequency to see the most frequent words (as before)
--- and investigate key parts of speech (POS)
and key concepts / domains
How might we discover the most ‘frequent’ POS? Jot them down
--- and the most ‘frequent’ semantic fields? Make a note of them
We can also see all of the keywords using this VIEW
Frequency of words in Howard and Blair
(using advanced view)
Make a note of the similarities and differences …
Download the tagged text
Remember to
change filename
and file type
Tagging Chinese text
• ICTCLAS – Institute of Computing
Technology, Chinese Lexical Analysis
System
– Best Chinese tagger
• Fast and reliable (98.45%)
– Online demonstration
– Free download of shareware version
– http://ictclas.org/
Online demo
Standalone ICTCLAS
D:\ZJU CL\tools\ICTCLAS\ICTCLAS_Win.exe
Tagset - http://www.lancs.ac.uk/fass/projects/corpus/LCMC/lcmc/lcmc_tagset.htm
Download