LELA 30922 Lecture 5 Corpus annotation and SGML See esp. R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997) Longman, ch. 1 “Introduction” by G Leech; something similar available at http://llc.oxfordjournals.org/cgi/reprint/8/4/275.pdf CM Sperberg-McQueen and L Burnard (eds) Guidelines for Electronic Text Encoding and Interchange, ch. 2 “A Gentle Introduction to SGML”, available at http://www-sul.stanford.edu/tools/tutorials/html2.0/gentle.html 1/25 Annotation • Difference between a corpus and a “mere collection of texts” is mainly due to the value added by annotation • Includes generic information about the text, usually stored in a “header” • But more significantly, annotations within the text itself 2/25 Why annotate? • Adds information • Reflects some analysis of text – Inasmuch as this may reflect commitment to some theoretical approach, this can be a barrier sometimes (but see later) • Increases usefulness/reusability of text • Multi-functionality – May make corpus usable for something not originally foreseen by its compilers 3/25 Golden rules of annotation • Recoverability – It should always be possible to ignore the annotation and reconstruct the corpus in its raw form • Extricability – Correspondingly, annotations should be easily accessible so they can be stored separately if necessary (“Before and after” versions) • Transparency: documentation – Purpose and meaning of annotations – How (eg manually or automatically), where and by whom annotations were done • If automatic, information about the programs used – Quality indication • Annotations almost inevitably include some errors or inconsistencies • To what extent have annotations been checked? • What is the measured accuracy rate, and against what benchmark? 4/25 Theory-neutrality • Schools of thought – Annotations may reflect a particular theoretical approach, and this should be acknowledged • Consensus – corpus annotations which are more (rather than less) theory-neutral will be more widely used – given the amount of work involved, it pays to be aware of the descriptive traditions of the relevant field • Standards – There are very few absolute standards, but some schemes can become de facto standards through widespread use – For example, BNC designers were aware of the likely side effects of any decisions (regarding annotation) that they took 5/25 Types of annotation • Plain corpus: it appears in its existing raw state of plain text • Corpus marked up for formatting attributes e.g. page breaks, paragraphs, font sizes • Corpus annotated with identifying information, such as title, author, genre, register, edition date • Corpus annotated with linguistic information • Corpus annotated with additional interpretive information, eg error analysis in learner corpus 6/25 Levels of linguistic annotation • Paragraph and sentence-boundary disambiguation – Naive fullstop+space+capital unreliable for genuine texts – May also involve distinguishing titles/headings from running text • Tokenization: identification of lexical units – multi-word units, cliticised words (eg can’t) • Lemmatisation: identification of lemmas (or lexemes) – Makes accessible variants of lexemes for more generic searches – May involve some disambiguation (eg rose) 7/25 Levels of linguistic annotation • POS tagging (grammatical tagging) – assigning to each lexical unit a code indicating its part of speech – most basic type of linguistic corpus annotation and forms an essential foundation for further forms of analysis • Parsing (treebanking) – Identification of syntactic relationships between words • Semantic tagging – Marking of word senses (sense resolution) – Marking of semantic relationships eg agent, patient – Marking with semantic categories eg human, animate 8/25 Levels of linguistic annotation • Discourse annotation – especially for transcribed speech – Identifying discourse function of text eg apology, greeting – or other pragmatic aspects, eg politeness level, • Anaphoric annotation – Identification of pronoun reference – and other anaphoric links (eg different references to the same entity) • Phonetic transcription (only in spoken language corpora) – Indication of details of pronunciation not otherwise reflected in transcription eg weak forms, – Explicit indication of accent/dialect features eg vowel qualities, allophonic variation • Prosodic annotation (only in spoken language corpora) – Suprasegmental iformation, eg stress, intonation, rhythm 9/25 Some examples PROSODIC ANNOTATION, LONDON-LUND CORPUS: well ^very nice of you to ((come and)) _spare the !t\/ime and # ^come and !t\alk # ^tell me a’bout the - !pr\oblems# And ^incidentally# . ^I [@:] ^do ^do t\ell me# ^anything you ‘want about the :college in ”!g\eneral Source: Leech chapter in Garside et al. 1997 10/25 EXAMPLE OF PART-OF-SPEECH TAGGING, LOB CORPUS: hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD comparatively_RB little_AP to_TO sing_VB '_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD Rollinson_NP ._. ._. EXAMPLE OF SKELETON PARSING, FROM THE SPOKEN ENGLISH CORPUS: [S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] Source: http://ucrel.lancs.ac.uk/annotation.html 11/25 ANAPHORIC ANNOTATION OF AP NEWSWIRE S.1 The state Supreme Court has refused to release Rahway State Prison inmate James Scott on bail. S.2 The fighter is serving 30-40 years for a 1975 armed robbery conviction. S.3 Scott had asked for freedom while he waits for an appeal decision. S.4 Meanwhile, his promoter, Murad Muhammed, said Wednesday he netted only $15,250 for Scott's nationally televised light heavyweight fight against ranking contender Yaqui Lopez last Saturday. S.5 The fight, in which Scott won a unanimous decision over Lopez, grossed $135,000 for Muhammed's firm, Triangle Productions of Newark, he said. S.1 (0) The state Supreme Court has refused to release {1 [2 Rahway State Prison 2] inmate 1}} (1 James Scott 1) on bail . S.2 (1 The fighter 1) is serving 30-40 years for a 1975 armed robbery conviction . S.3 (1 Scott 1) had asked for freedom while <1 he waits for an appeal decision . S.4 Meanwhile , [3 <1 his promoter 3] , {{3 Murad Muhammed 3} , said Wednesday <3 he netted only $15,250 for (4 [1 Scott 1] 's nationally televised light heavyweight fight against {5 ranking contender 5}} (5 Yaqui Lopez 5) last Saturday 4) . S.5 (4 The fight , in which [1 Scott 1] won a unanimous decision over (5 Lopez 5) 4) , grossed $135,000 for [6 [3 Muhammed 3] 's firm 6], {{6 Triangle Productions of Newark 6} , <3 he said . Source: http://ucrel.lancs.ac.uk/annotation.html 12/25 SGML • Although none of the examples just shown use it, for all but the simplest of mark-up schemes, SGML is widely recommended and used • SGML = standard generalized mark-up language • Actually suitable for all sorts of things, including web pages (HTML is SGML-conformant) 13/25 What is a mark-up language? • Mark-up historically referred to printer’s marks on a manuscript to indicate typesetting requirements. • Now covers all sorts of codes inserted into electronic texts to govern formatting, printing, or other information. • Mark-up, or (synonymously) encoding, is defined as any means of making explicit an interpretation of a text. • By “mark-up language” we mean a set of mark-up conventions used together for encoding texts. Language must specify – – – – what mark-up is allowed what mark-up is required how mark-up is to be distinguished from text what the mark-up means • SGML provides the means for doing the first three • Separate documentation/software is required for the last – eg (1) difference between identifying something as <emph>and how that appears in print; (2) why something may or may not be tagged as a “relative clause” 14/25 Rules of SGML • SGML allows us to define – Elements – Specific features of elements – Hierarchical/structural relations between elements • These specified in a “document type definition” (DTD) • DTD allows software to be written to – Help annotators annotate consistently – Explore documents marked-up 15/25 Elements in SGML • Have a (unique) name • Semantics of name are application dependent – up to designer to choose appropriate name, but nothing automatically follows from the choice of any particular name • Each element must be explicitly marked or tagged in some way – Most usual is with <element>and </element>pairs, called start- and end-tags – Much SGML-compliant software seems to allow start-only tags – &element; (esp. useful for single words or characters) – _tag suffix 16/25 Attributes • Elements can have named attributes with associated values • When defined, values can be identified as – #REQUIRED: must be specified – #IMPLIED: optional – #CURRENT: inferred to be the same as the last specified value for that attribute • Values can be from a predefined list, or can be of a general type (string, integer, etc) 17/25 DTD (Document type definition) • Helps to impose uniformity over the corpus • Defines the (expected or to-be-imposed) structure of the document • For each element, defines – How it appears (whether end tags are required) – What its substructure is, ie what elements, how many of them, whether compulsory or not 18/25 Example of DTD <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT anthology - - (poem+)> poem - - (title?, stanza+ | couplet+)> title - O (#PCDATA) > stanza - O (line+) > couplet – O (cline, cline) > (line | cline) O O (#PCDATA) > • Start and end tags necessary (-) or optional (O) • Anthology consists of 1 or more poems • Poem has an optional title, then 1 or more stanzas or 1 or more couplets • Title consists of “parsed character data”, ie normal text • Stanza has one or more lines, couplet has two lines • Both lines and clines have the same definition: normal text 19/25 Attributes <!ATTLIST poem id ID #IMPLIED status (draft | revised | published) draft > • DTD defines the attributes expected/required for each element • A poem has an id and a status • Value of id is any identifier, and is optional • Status is one of three values, default draft 20/25 <anthology> <poem id=12 status=revised> <title>It’s a grand old team</title> <stanza> <line>It’s a grand old team to play for <line>It’s a grand old team to support <line>And if you know your history <line>It’s enough to make your heart go Whoooooah </stanza> </poem> <poem id=13> ... </poem> </anthology> 21/25 Mark-up exemplified RAW TEXT: Two men retained their marbles, and as luck would have it they're both roughie-toughie types as well as military scientists - a cross between Albert Einstein and Action Man! TOKENIZED TEXT: <w orth=CAP>Two</w> <w>men</w> <w>retained</w> <w>their</w> <w>marbles<c PUN>,</c> <w>and</w> <w>as</w> <w>luck</w> <w>would</w> <w>have</w> <w>it</w> <w>they</w><w>'re</w> <w>both</w> <w>roughie-toughie</w> <w>types</w> <w>as</w> <w>well</w> <w>as</w> <w>military</w> <w>scientists <c PUN>&mdash;</c></w> <w>a</w> <w>cross</w> <w>between</w> <w orth=CAP>Albert</w> <w orth=CAP>Einstein</w> <w>and</w> <w orth=CAP>Action</w> <w orth=CAP>Man<c PUN>!</c>22/25 LEMMATIZED TEXT: <w orth=CAP>Two</w> <w lem=man>men</w> <w lem=retain>retained</w> <w>their</w> <w lem=marble>marbles<c PUN>,</c> <w>and</w> <w>as</w> <w>luck</w> <w>would</w> <w>have</w> <w>it</w> <w>they</w><w lem=be>'re</w> <w>both</w> <w>roughie-toughie</w> <w lem=type>types</w> <w>as</w> <w>well</w> <w>as</w> <w>military</w> <w lem=scientist>scientists</w> <c PUN>&mdash;</c> <w>a</w> <w>cross</w> <w>between</w> <w orth=CAP>Albert</w> <w orth=CAP>Einstein</w> <w>and</w> <w orth=CAP>Action</w> <w orth=CAP>Man</w><c PUN>!</c> 23/25 POS TAGGED TEXT: <w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w> <w VVD lem=retain>retained</w> <w DPS>their</w> <w NN2 lem=marble>marbles</w><c PUN>,</c> <w CJC>and</w> <w CJS>as</w> <w NN1-VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w PNP>it</w> <w PNP>they</w><w VBB lem=be>'re</w> <w AV0>both</w> <w AJ0>roughie-toughie</w> <w NN2>types</w> <w AV0>as</w> <w AV0>well</w> <w CJS>as</w> <w AJ0>military</w> <w NN2>scientists</w> <c PUN>&mdash</c> <w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <w NP0>Albert</w> <w NP0>Einstein</w> <w CJC>and</w> <w NN1>Action</w> <w NN1-NP0>Man<c PUN>!</c> 24/25 POS TAGGED TEXT with idioms and named entities: <w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w> <phrase type=idiom><w VVD lem=retain>retained</w> <w DPS>their</w> <w NN2 lem=marble>marbles</w></phrase><c PUN>,</c> <w CJC>and</w> <phrase type=idiom><w CJS>as</w> <w NN1VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w PNP>it</w></phrase> <w PNP>they</w><w VBB lem=be>'re</w> <w AV0>both</w> <w AJ0>roughie-toughie</w> <w NN2>types</w> <phrase type=compound pos=CJS><w AV0>as</w> <w AV0>well</w> <w CJS>as</w></phrase> <phrase type=compound pos=NN2><w AJ0>military</w> <w NN2>scientists</w></phrase> <c PUN>&mdash</c> <w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <phrase type=compound pos=NP0><w NP0>Albert</w> <w NP0>Einstein</w></phrase> <w CJC>and</w> <phrase type=compound pos=NP0><w NN1>Action</w> <w NN1-NP0>Man</phrase><c PUN>!</c> 25/25