Annotation and SGML

advertisement
LELA 30922
Lecture 5
Corpus annotation and SGML
See esp.
R Garside, G Leech & A McEnery (eds) Corpus Annotation, London
(1997) Longman, ch. 1 “Introduction” by G Leech; something similar
available at http://llc.oxfordjournals.org/cgi/reprint/8/4/275.pdf
CM Sperberg-McQueen and L Burnard (eds) Guidelines for Electronic
Text Encoding and Interchange, ch. 2 “A Gentle Introduction to
SGML”, available at
http://www-sul.stanford.edu/tools/tutorials/html2.0/gentle.html
1/25
Annotation
• Difference between a corpus and a “mere
collection of texts” is mainly due to the
value added by annotation
• Includes generic information about the text,
usually stored in a “header”
• But more significantly, annotations within
the text itself
2/25
Why annotate?
• Adds information
• Reflects some analysis of text
– Inasmuch as this may reflect commitment to some
theoretical approach, this can be a barrier sometimes
(but see later)
• Increases usefulness/reusability of text
• Multi-functionality
– May make corpus usable for something not originally
foreseen by its compilers
3/25
Golden rules of annotation
• Recoverability
– It should always be possible to ignore the annotation and reconstruct the
corpus in its raw form
• Extricability
– Correspondingly, annotations should be easily accessible so they can be
stored separately if necessary (“Before and after” versions)
• Transparency: documentation
– Purpose and meaning of annotations
– How (eg manually or automatically), where and by whom
annotations were done
• If automatic, information about the programs used
– Quality indication
• Annotations almost inevitably include some errors or inconsistencies
• To what extent have annotations been checked?
• What is the measured accuracy rate, and against what benchmark?
4/25
Theory-neutrality
• Schools of thought
– Annotations may reflect a particular theoretical approach, and this
should be acknowledged
• Consensus
– corpus annotations which are more (rather than less) theory-neutral
will be more widely used
– given the amount of work involved, it pays to be aware of the
descriptive traditions of the relevant field
• Standards
– There are very few absolute standards, but some schemes can
become de facto standards through widespread use
– For example, BNC designers were aware of the likely side effects
of any decisions (regarding annotation) that they took
5/25
Types of annotation
• Plain corpus: it appears in its existing raw state of
plain text
• Corpus marked up for formatting attributes e.g.
page breaks, paragraphs, font sizes
• Corpus annotated with identifying information,
such as title, author, genre, register, edition date
• Corpus annotated with linguistic information
• Corpus annotated with additional interpretive
information, eg error analysis in learner corpus
6/25
Levels of linguistic annotation
• Paragraph and sentence-boundary disambiguation
– Naive fullstop+space+capital unreliable for genuine
texts
– May also involve distinguishing titles/headings from
running text
• Tokenization: identification of lexical units
– multi-word units, cliticised words (eg can’t)
• Lemmatisation: identification of lemmas (or
lexemes)
– Makes accessible variants of lexemes for more generic
searches
– May involve some disambiguation (eg rose)
7/25
Levels of linguistic annotation
• POS tagging (grammatical tagging)
– assigning to each lexical unit a code indicating its part
of speech
– most basic type of linguistic corpus annotation and
forms an essential foundation for further forms of
analysis
• Parsing (treebanking)
– Identification of syntactic relationships between words
• Semantic tagging
– Marking of word senses (sense resolution)
– Marking of semantic relationships eg agent, patient
– Marking with semantic categories eg human, animate
8/25
Levels of linguistic annotation
• Discourse annotation
– especially for transcribed speech
– Identifying discourse function of text eg apology, greeting
– or other pragmatic aspects, eg politeness level,
• Anaphoric annotation
– Identification of pronoun reference
– and other anaphoric links (eg different references to the same
entity)
• Phonetic transcription (only in spoken language corpora)
– Indication of details of pronunciation not otherwise reflected in
transcription eg weak forms,
– Explicit indication of accent/dialect features eg vowel qualities,
allophonic variation
• Prosodic annotation (only in spoken language corpora)
– Suprasegmental iformation, eg stress, intonation, rhythm
9/25
Some examples
PROSODIC ANNOTATION, LONDON-LUND CORPUS:
well ^very nice of you to ((come and)) _spare the !t\/ime and #
^come and !t\alk # ^tell me a’bout the - !pr\oblems#
And ^incidentally# .
^I [@:] ^do ^do t\ell me#
^anything you ‘want about the :college in ”!g\eneral
Source: Leech chapter in Garside et al. 1997
10/25
EXAMPLE OF PART-OF-SPEECH TAGGING, LOB CORPUS:
hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC
not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB in_IN
rows_NNS in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS ,_,
whose_WP$ chief_JJB scene_NN was_BEDZ cut_VBN at_IN the_ATI last_AP
moment_NN ,_, had_HVD comparatively_RB little_AP to_TO sing_VB '_'
he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD Rollinson_NP
._.
._.
EXAMPLE OF SKELETON PARSING, FROM THE SPOKEN ENGLISH
CORPUS:
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N]
,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG
big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N
Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V
has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$
new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1
park_NNL1 ]N]P]N]P]V] ._. S]
Source: http://ucrel.lancs.ac.uk/annotation.html
11/25
ANAPHORIC ANNOTATION OF AP NEWSWIRE
S.1 The state Supreme Court has refused to release Rahway State Prison
inmate James Scott on bail.
S.2 The fighter is serving 30-40 years for a 1975 armed robbery conviction.
S.3 Scott had asked for freedom while he waits for an appeal decision.
S.4 Meanwhile, his promoter, Murad Muhammed, said Wednesday he netted only
$15,250 for Scott's nationally televised light heavyweight fight against
ranking contender Yaqui Lopez last Saturday.
S.5 The fight, in which Scott won a unanimous decision over Lopez, grossed
$135,000 for Muhammed's firm, Triangle Productions of Newark, he said.
S.1 (0) The state Supreme Court has refused to release
{1 [2 Rahway State Prison 2] inmate 1}} (1 James Scott 1) on bail
.
S.2 (1 The fighter 1) is serving 30-40 years for a 1975 armed
robbery conviction .
S.3 (1 Scott 1) had asked for freedom while <1 he waits for an
appeal decision .
S.4 Meanwhile , [3 <1 his promoter 3] , {{3 Murad Muhammed 3} ,
said Wednesday <3 he netted only $15,250 for (4 [1 Scott 1] 's
nationally televised light heavyweight fight against {5 ranking
contender 5}} (5 Yaqui Lopez 5) last Saturday 4) .
S.5 (4 The fight , in which [1 Scott 1] won a unanimous decision
over (5 Lopez 5) 4) , grossed $135,000 for [6 [3 Muhammed 3] 's
firm 6], {{6 Triangle Productions of Newark 6} , <3 he said .
Source: http://ucrel.lancs.ac.uk/annotation.html
12/25
SGML
• Although none of the examples just shown use it,
for all but the simplest of mark-up schemes,
SGML is widely recommended and used
• SGML = standard generalized mark-up language
• Actually suitable for all sorts of things, including
web pages (HTML is SGML-conformant)
13/25
What is a mark-up language?
• Mark-up historically referred to printer’s marks on a manuscript to
indicate typesetting requirements.
• Now covers all sorts of codes inserted into electronic texts to govern
formatting, printing, or other information.
• Mark-up, or (synonymously) encoding, is defined as any means of
making explicit an interpretation of a text.
• By “mark-up language” we mean a set of mark-up conventions used
together for encoding texts. Language must specify
–
–
–
–
what mark-up is allowed
what mark-up is required
how mark-up is to be distinguished from text
what the mark-up means
• SGML provides the means for doing the first three
• Separate documentation/software is required for the last
– eg (1) difference between identifying something as <emph>and how that
appears in print; (2) why something may or may not be tagged as a
“relative clause”
14/25
Rules of SGML
• SGML allows us to define
– Elements
– Specific features of elements
– Hierarchical/structural relations between elements
• These specified in a “document type definition”
(DTD)
• DTD allows software to be written to
– Help annotators annotate consistently
– Explore documents marked-up
15/25
Elements in SGML
• Have a (unique) name
• Semantics of name are application dependent
– up to designer to choose appropriate name, but nothing
automatically follows from the choice of any particular name
• Each element must be explicitly marked or tagged in some
way
– Most usual is with <element>and </element>pairs, called
start- and end-tags
– Much SGML-compliant software seems to allow start-only tags
– &element; (esp. useful for single words or characters)
– _tag suffix
16/25
Attributes
• Elements can have named attributes with
associated values
• When defined, values can be identified as
– #REQUIRED: must be specified
– #IMPLIED: optional
– #CURRENT: inferred to be the same as the last
specified value for that attribute
• Values can be from a predefined list, or can be of a
general type (string, integer, etc)
17/25
DTD (Document type definition)
• Helps to impose uniformity over the corpus
• Defines the (expected or to-be-imposed)
structure of the document
• For each element, defines
– How it appears (whether end tags are required)
– What its substructure is, ie what elements, how
many of them, whether compulsory or not
18/25
Example of DTD
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
anthology - - (poem+)>
poem - - (title?, stanza+ | couplet+)>
title - O (#PCDATA) >
stanza - O (line+) >
couplet – O (cline, cline) >
(line | cline) O O (#PCDATA) >
• Start and end tags necessary (-) or optional (O)
• Anthology consists of 1 or more poems
• Poem has an optional title, then 1 or more stanzas or 1 or more
couplets
• Title consists of “parsed character data”, ie normal text
• Stanza has one or more lines, couplet has two lines
• Both lines and clines have the same definition: normal text
19/25
Attributes
<!ATTLIST poem
id
ID
#IMPLIED
status (draft | revised | published) draft
>
• DTD defines the attributes expected/required
for each element
• A poem has an id and a status
• Value of id is any identifier, and is optional
• Status is one of three values, default draft
20/25
<anthology>
<poem id=12 status=revised>
<title>It’s a grand old team</title>
<stanza>
<line>It’s a grand old team to play for
<line>It’s a grand old team to support
<line>And if you know your history
<line>It’s enough to make your heart go
Whoooooah
</stanza>
</poem>
<poem id=13>
...
</poem>
</anthology>
21/25
Mark-up exemplified
RAW TEXT:
Two men retained their marbles, and as luck would
have it they're both roughie-toughie types as well
as military scientists - a cross between Albert
Einstein and Action Man!
TOKENIZED TEXT:
<w orth=CAP>Two</w> <w>men</w> <w>retained</w>
<w>their</w> <w>marbles<c PUN>,</c> <w>and</w>
<w>as</w> <w>luck</w> <w>would</w> <w>have</w>
<w>it</w> <w>they</w><w>'re</w> <w>both</w>
<w>roughie-toughie</w> <w>types</w> <w>as</w>
<w>well</w> <w>as</w> <w>military</w>
<w>scientists <c PUN>—</c></w> <w>a</w>
<w>cross</w> <w>between</w> <w orth=CAP>Albert</w>
<w orth=CAP>Einstein</w> <w>and</w>
<w orth=CAP>Action</w> <w orth=CAP>Man<c PUN>!</c>22/25
LEMMATIZED TEXT:
<w orth=CAP>Two</w> <w lem=man>men</w>
<w lem=retain>retained</w> <w>their</w>
<w lem=marble>marbles<c PUN>,</c> <w>and</w>
<w>as</w> <w>luck</w> <w>would</w> <w>have</w>
<w>it</w> <w>they</w><w lem=be>'re</w> <w>both</w>
<w>roughie-toughie</w> <w lem=type>types</w>
<w>as</w> <w>well</w> <w>as</w> <w>military</w> <w
lem=scientist>scientists</w> <c PUN>—</c>
<w>a</w> <w>cross</w> <w>between</w>
<w orth=CAP>Albert</w> <w orth=CAP>Einstein</w>
<w>and</w> <w orth=CAP>Action</w>
<w orth=CAP>Man</w><c PUN>!</c>
23/25
POS TAGGED TEXT:
<w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w>
<w VVD lem=retain>retained</w> <w DPS>their</w>
<w NN2 lem=marble>marbles</w><c PUN>,</c>
<w CJC>and</w> <w CJS>as</w> <w NN1-VVB>luck</w>
<w VM0>would</w> <w VHI>have</w> <w PNP>it</w>
<w PNP>they</w><w VBB lem=be>'re</w>
<w AV0>both</w> <w AJ0>roughie-toughie</w>
<w NN2>types</w> <w AV0>as</w> <w AV0>well</w>
<w CJS>as</w> <w AJ0>military</w>
<w NN2>scientists</w> <c PUN>&mdash</c>
<w AT0>a</w> <w NN1>cross</w> <w PRP>between</w>
<w NP0>Albert</w> <w NP0>Einstein</w>
<w CJC>and</w> <w NN1>Action</w>
<w NN1-NP0>Man<c PUN>!</c>
24/25
POS TAGGED TEXT with idioms and named entities:
<w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w>
<phrase type=idiom><w VVD lem=retain>retained</w> <w
DPS>their</w>
<w NN2 lem=marble>marbles</w></phrase><c PUN>,</c>
<w CJC>and</w> <phrase type=idiom><w CJS>as</w> <w NN1VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w
PNP>it</w></phrase>
<w PNP>they</w><w VBB lem=be>'re</w>
<w AV0>both</w> <w AJ0>roughie-toughie</w>
<w NN2>types</w>
<phrase type=compound pos=CJS><w AV0>as</w>
<w AV0>well</w> <w CJS>as</w></phrase>
<phrase type=compound pos=NN2><w AJ0>military</w> <w
NN2>scientists</w></phrase> <c PUN>&mdash</c>
<w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <phrase
type=compound pos=NP0><w NP0>Albert</w> <w
NP0>Einstein</w></phrase>
<w CJC>and</w>
<phrase type=compound pos=NP0><w NN1>Action</w>
<w NN1-NP0>Man</phrase><c PUN>!</c>
25/25
Download