THE SUSANNE CORPUS Release 1, 6th September 1992 Geoffrey Sampson School of Cognitive & Computing Sciences University of Sussex Falmer, Brighton BN1 9QH, England geoffs@uk.ac.susx.cogs INTRODUCTION The SUSANNE Corpus has been created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive NLP-oriented taxonomy and annotation scheme for the (logical and surface) grammar of English.[1] The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis. The SUSANNE scheme may be likened to a "Linnaean taxonomy" of the grammatical domain: its aim (comparable to that of Linnaeus's eighteenth-century taxonomy for the domain of botany) is not to identify categories which are theoretically optimal or which necessarily reflect the psychological organization of speakers' linguistic competence, but simply to offer a scheme of categories and ways of applying them that make it practical for NLP researchers to register everything that occurs in real-life usage systematically and unambiguously, and for researchers at different sites to exchange empirical grammatical data without misunderstandings over local uses of analytic terminology. On reasons why such a scheme is needed at the present juncture in NLP research, see e.g. Sampson (1992, forthcoming). Note that a sharp distinction is drawn here between the terms "scheme" and "system". A "parsing scheme", or "analytic scheme", refers to a range of notations and guidelines for using them which prescribe to a human analyst what the appropriate grammatical annotation for a language example should be. A parsing "system" on the other hand refers to a software system which automatically produces analyses (according to some parsing scheme) of input language examples. A parsing scheme defines the target which a parsing system hits (or misses). The SUSANNE Corpus represents part of the definition of a parsing scheme. It has been produced largely manually, not as the output of an automatic parsing system. The SUSANNE analytic scheme is defined in detail in a book by myself, ENGLISH FOR THE COMPUTER, forthcoming from Oxford University Press. The Chairman of the Analysis and Interpretation Working Group of the US/EC-sponsored Text Encoding Initiative has proposed its adoption as a recognised TEI standard. The SUSANNE scheme aims to specify annotation norms for the modern English language; it does not cover other languages, although it is hoped that the general principles of the SUSANNE scheme may prove helpful in developing comparable taxonomies for these. Regrettably, Release 1 of the SUSANNE Corpus is not a "TEI-conformant" resource, though aspects of the annotation scheme have been decided in such a way as to facilitate a move to TEI conformance in later releases. The working timetable of the Initiative meant that relevant aspects of the TEI Guidelines were not yet complete at the point when the SUSANNE Corpus was ready for initial release; delaying this release would have been unfortunate. The brief description of the SUSANNE Corpus which follows cannot replace the very detailed statements to be found in ENGLISH FOR THE COMPUTER, and any user aiming to do serious work with the Corpus or its annotation scheme would need to consult the book. Nevertheless, it may be useful to have a summary statement included with the electronic Corpus. The present SUSANNE annotation scheme originated in work carried out by myself in collaboration with Professor Geoffrey Leech FBA and others in the years 1983-85 to produce a database of manually analysed sentences from the LOB Corpus of written British English, as a source of statistics for probabilistic automatic-parsing techniques; this database, which has not been (and will not now be) published, is described in Garside et al. (1987: ch. 7). The annotation scheme of this "Lancaster-Leeds Treebank" represented surface grammar only, without indications of logical form. It subsequently seemed desirable to extend this scheme to include methods for representing logical grammar, and to refine both surface and logical aspects of the annotation scheme by applying it to a larger body of texts. The only way that a parsing scheme can in practice be made increasingly adequate is in the way that the English Common Law develops, by collecting and systematizing the body of precedents generated through detailed consideration of more and more individual cases that arise in real life. Accordingly, Project SUSANNE took a subset of the Brown Corpus of written American English which had been manually analysed by Alvar Elleg<aring>rd's group at Gothenburg (Elleg<aring>rd 1978), and reworked the annotations in this under-used resource in order to turn them into a scheme consistent with that used in the Lancaster-Leeds Treebank but including specifications of logical as well as surface structure: several categories of information not indicated in either Lancaster-Leeds or Gothenburg schemes were also added.[2] (On Brown and LOB Corpora, see e.g. Garside et al. (1987: 4-5).) The finished SUSANNE parsing scheme has thus been developed on the basis of samples of both British and American English. It is oriented chiefly towards written language; however, on another project sponsored by the Royal Signals and Radar Establishment[3] my team produced extensions to the SUSANNE scheme for annotating the distinctive grammatical phenomena of spoken English, and these extensions are specified in ENGLISH FOR THE COMPUTER (though they are not used in the SUSANNE Corpus and are not discussed further here). It should be noted also that the scheme has emerged through a process of detailed critical discussion of analytic standards by some ten people over a decade; apart from myself, the leading role in the early years of these discussons was taken by Geoffrey Leech, whose standing as an English grammarian needs no emphasis. The SUSANNE Corpus itself comprises an approximately 128,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme. The original motives for producing this database included that of providing better statistics for probabilistic parsing; but in this respect Project SUSANNE was overtaken after its inception by projects (notably Mitchell Marcus's Pennsylvania Treebank project, cf. Marcus & Santorini (forthcoming)) which have used quasi-industrial methods to generate far larger bodies of grammatically-analysed material. However, the SUSANNE scheme may be unparallelled in the extent to which its categories have been refined and tested through detailed consideration of the almost endless small quirks of the texts to which they have been applied, and in the degree of precision to which the resulting guidelines for using the categories have been documented -- thus defining analytic standards which permit annotation of future material to be extremely self-consistent. Accordingly the SUSANNE Corpus is offered to the research community primarily as a demonstration of the application of the parsing scheme, evidencing the fact that the scheme has survived the test of experience rather than being a merely aprioristic system. The SUSANNE Corpus functions, as it were, like a collection of type specimens appended to a botanical taxonomy. Although the accompanying first release of the SUSANNE Corpus has undergone considerable proof-checking, it unquestionably still contains many errors. I intend to correct these in future releases; I shall be extremely grateful if users discovering errors will log these and send details to me, preferably by post rather than e-mail. STRUCTURE OF THE CORPUS The SUSANNE Corpus consists of 64 files (apart from this documentation file), each containing an annotated version of one 2000+ word text from the Brown Corpus. Files average about 83 kilobytes in size, thus the entire Corpus totals about 5.3 megabytes. The file names are those of the respective Brown texts, e.g. A01, N18. Sixteen texts are drawn from each of the following Brown genre categories: A G J N press reportage belles lettres, biography, memoirs learned (mainly scientific and technical) writing adventure and Western fiction The Corpus thus samples each of the four broad genre groups established on the basis of word-frequency data by Hofland & Johansson (1982: 27).[4] Each file has a line (terminating in a newline character) for each "word" of the original text; but "words" for SUSANNE purposes are often smaller than words in the ordinary orthographic sense, for instance punctuation marks and the apostrophe-s suffix are treated as separate words and assigned lines of their own. (For details on the rules by which orthographic words are segmented, as well as on all other analytic matters mentioned below, see ENGLISH FOR THE COMPUTER.) Each line of a SUSANNE file has six fields separated by tabs (that is, there is one tab after each of fields 1 to 5, but a newline after field 6). Each field on every line contains at least one character. The six fields on each line are: 1 2 3 4 5 6 reference status wordtag word lemma parse Apart from the tab and newline characters used to structure fields and records, all bytes in each of the 64 SUSANNE files are drawn from a subset of the 94 graphic character allocations of the International Reference Version ("IRV") of ISO 646:1983 "Information Processing -- ISO 7-bit coded character set for information interchange", from hexadecimal 21 (exclamation mark) to hex 7E (tilde). These codes are assumed for SUSANNE purposes to represent the graphic symbols assigned by the IRV system. Twelve members of the IRV character set are not used in the Corpus, namely (all codes hexadecimal): 23 24 27 2F 5C 5E 5F 60 7B 7C 7D 7E gate generalized currency unit prime solidus reverse solidus circumflex underline grave opening curly bracket vertical bar closing curly bracket tilde The space character, hex 20, which is classified by ISO 646 as a control code also does not occur in the SUSANNE Corpus. Where text characters cannot be adequately represented directly within the resulting 82-member character set, they are represented by entity names within angle brackets. Where possible these are drawn from Appendix D to ISO 8879:1986, "Information Processing -- Text & Office Systems -Standard Generalized Markup Language (SGML)". For instance, "<eacute>" stands for lower-case "e" with acute accent. Symbols in angle brackets are used also to represent such things as typographical shifts, which for purposes of grammatical analysis are conveniently represented as items within the word-sequence: e.g. "<bital>" stands for "begin italics". REFERENCE FIELD The reference field contains nine bytes which give each line a reference number that is unique across the SUSANNE Corpus, e.g. "N06:1530t". The first three bytes (here N06) are the file name; the fourth byte is always a colon; bytes 5 to 8 (here 1530) are the number of the line in the "Bergen I" version of the Brown Corpus on which the relevant word appears (Brown line numbers normally increment in tens, with occasional odd numbers interpolated); and the ninth byte is a lower-case letter differentiating successive words that appear on the same Brown line. (SUSANNE lines are lettered continuously from "a", omitting "l" and "o".) STATUS FIELD The status field contains one byte. The letters "A" and "S" show that the word is an "abbreviation" or "symbol", respectively, as defined by Brown Corpus codes (Francis & Ku<ccaron>era 1989: 12). The letter "E" shows that the word is (or is part of) a misprint or solecism in the original text (details are logged in ENGLISH FOR THE COMPUTER). On the great majority of lines, to which none of these three categories apply, the status field contains a hyphen character. WORDTAG FIELD The SUSANNE wordtag set is based on the "Lancaster" tagset listed in Garside et al. (1987: Appendix B); additional grammatical distinctions have been drawn in this set, and these are indicated by suffixing lower-case letters to the Lancaster tags. For instance, "revealing" is tagged "VVG" (present participle of verb) in the Lancaster scheme, but as "VVGt" (present participle of transitive verb) in the SUSANNE scheme. Apart from the lower-case extensions, the wordtags are normally identical to the Lancaster tags: punctuation marks are assigned alphabetical tags beginning Y... (e.g. YC for comma), and the dollar sign which appears in some Lancaster tags for genitive words is replaced by G (e.g. GG for the apostrophe-s suffix), so that the modified Lancaster tags always consist wholly of alphanumeric characters, beginning with two capital letters. (In a few cases, tags from the Lancaster set have been merged or eliminated from the SUSANNE scheme in the light of experience.) The tag YG appears in the wordtag field to represent a "trace" -- the logical position of a constituent which has been shifted elsewhere, or deleted, in the surface grammatical structure. The SUSANNE tagset comprises 352 distinct wordtags, not counting tags for elements of "grammatical idioms" (see below); a few of these wordtags never occur in the SUSANNE Corpus. The wordtags are listed, and their application rigorously defined, in ENGLISH FOR THE COMPUTER -- in the case of closed wordclasses, by enumeration of their members, and in the case of open classes by rules for choice between alternative tags. These rules refer to information in a specified published dictionary (the OXFORD ADVANCED LEARNER'S DICTONARY OF CURRENT ENGLISH, 3rd edition). WORD FIELD The word field contains a segment of the text, often coinciding with a word in the orthographic sense but sometimes, as noted above, including only part of an orthographic word. In general the word field represents all and only those typographical distinctions in the original documents which were recorded in the Brown Corpus (Francis & Ku<ccaron>era 1989: 10-15), though in certain cases the SUSANNE Corpus has gone behind the Brown Corpus to reconstruct typographical details omitted from Brown. Certain characters have special meanings in the wordfield, as follows: + (occurs only as first byte of the wordfield) shows that the contents of the field were not separated in the original text from the immediately-preceding text segment by whitespace (e.g. in the case of a punctuation mark, or part of a hyphenated sequence split over successive SUSANNE lines); - the line corresponds to no text material (it represents the "trace" for a grammatically-moved element); <...> enclose entity names for special typographical features, as discussed above, either taken from ISO 8879:1986 Appendix D or created for the SUSANNE Corpus -- for instance "<pand>" stands for "either plus sign or ampersand", since the Brown Corpus makes no distinction between these characters. LEMMA FIELD The lemma field shows the dictionary headword of which the text word is a form: the field shows base forms for words which are inflected in the text, and eliminates typographical variations (such as sentenceinitial capitalization) which are not inherent to the word but relate to its use in context. (In the case of "words" to which the dictionary-form concept is inappropriate, e.g. numerals and punctuation marks, the lemma field contains a hyphen.) Orthographic forms in the lemma field are those of a specified dictionary (the OXFORD ADVANCED LEARNER'S DICTIONARY OF CURRENT ENGLISH, 3rd edition). Project SUSANNE aimed also to indicate the senses which polysemous words bear in context, via codes relating word-tokens to numbered subsenses in a specified dictionary. The book ENGLISH FOR THE COMPUTER provides a detailed coding scheme for representing this information. Unfortunately, this aspect of the project's output proved to contain a number of inadequacies, and the information does not appear in Release 1 of the Corpus. It is hoped to include it in later releases. PARSE FIELD The contents of the sixth field represent the central raison d'<ecirc>tre of the SUSANNE Corpus. They code the grammatical structure of texts as a sequence of labelled trees, having a leaf node for each Corpus line. Each text is treated as a sequence of "paragraphs" separated by "headings". A "paragraph" normally coincides with an ordinary orthographic paragraph; a "heading" may consist of actual verbal material, or may be merely a typographical paragraph division, symbolized "<minbrk>" in the word field. Conceptually, the structure of each paragraph or heading is a labelled tree with root node labelled "O" ("Oh" for a heading), and with a leaf node labelled with a wordtag for each SUSANNE word or trace, i.e. each line of the Corpus. There will commonly be many intermediate labelled nodes. Such a tree is represented as a bracketed string in the ordinary way, with the labels of nonterminal nodes written "inside" both opening and closing brackets (that is, to the right of opening brackets and to the left of closing brackets). This bracketed string is then adapted as follows for inclusion in successive SUSANNE parse fields. Wherever an opening bracket immediate follows a closing bracket, the string is segmented, yielding one segment per leaf node; and within each such segment, the sequence opening-bracket + wordtag + closing-bracket, representing the leaf node, is replaced by full stop. Thus each parse field contains exactly one full stop, corresponding to a terminal node labelled with the contents of the wordtag field, sometimes preceded by labelled opening bracket(s) and sometimes followed by labelled closing bracket(s), corresponding to higher tagmas which begin or end with the word on the line in question. Brackets are square except in the case of nodes immediately dominating the "trace" wordtag YG, which are represented with angle brackets. Nonterminal node labels in the SUSANNE scheme contain up to three types of information: a FORMTAG, a FUNCTIONTAG, and an INDEX, in that order. In a label containing a formtag and one or both of the other two elements, a colon separates the formtag from the other elements. A functiontag is always a single alphabetic character, and an index is a sequence of three digits; restrictions on valid combinations of elements within a node label mean that complex labels can always be unambiguously decomposed into their elements. RANKS OF CONSTITUENT Apart from nodes immediately dominating traces, all node have labels including formtags, which identify the internal properties of the word or word-sequence dominated by the node. The shape of a parse-tree is defined in terms of a hierarchy of formtag ranks: 1 wordlevel formtags (begin with two capital letters; formtags of all other ranks begin with one capital and contain no further capitals) 2 phraselevel formtags (begin with one of: N V J R P D M G) 3 clauselevel formtags (begin with one of: S F T Z L A W) 4 rootlevel formtags (begin with one of: O Q I) Each grammatical clause, whether consisting of one or more words, is given a node labelled with a clauselevel formtag. Each immediate constituent of a clause, whether there are one or more such constituents and whether the constituent consists of one or more words, is given a node labelled with a phraselevel formtag, unless the constituent belongs to a wordlevel category that has no corresponding phraselevel category (e.g. punctuation marks, conjunctions), or to a rootlevel category (e.g. a direct quotation, formtagged Q). Thus a clause consisting of one verb will be assigned a clauselevel formtag (e.g. Tg for presentparticiple clause) which singularily dominates a phraselevel formtag (e.g. Vg for "verb group beginning with present participle") which in turn singularily dominates a wordlevel formtag (e.g. VVGi for "present participle of intransitive verb"). Other than by these rules, and in certain other limited circumstances specified in ENGLISH FOR THE COMPUTER, singulary branching does not occur. An intermediate phraselevel node is inserted between a higher phraselevel node and a sequence of words dominated by it only if two or more of those words form a coherent constituent within the higher phrase. A clause which fills a slot standardly filled by a phrase (e.g. a nominal clause as subject or object) will not have a phrase node above the clause node unless the clause proper is preceded and/or followed by modifying elements that are not part of the clause. Detailed rules for deciding constituency in various debatable cases, for placing items such as punctuation marks within parse trees, etc. are laid down in ENGLISH FOR THE COMPUTER. FUNCTIONTAGS AND INDICES Functiontags, identifying roles such as surface subject, logical object, time adjunct, are assigned to all immediate constituents of clauses, except for their verb-group heads and certain other constituents for which function labelling is inappropriate. Indices are assigned to pairs of nodes to show referential identity between items which are in certain defined grammatical relationships to one another. For instance, a phrase raised out of a lower clause to act as object in a higher clause, as in "John expected Mary to admit it", will be assigned an index identical to that assigned to the trace showing the logical position of the item in the lower clause. The (artificial) example quoted would be represented as: [Nns:s John] expected [Nns:O999 Mary] [Ti:o <s999 TRACE> to admit [Ni:o it]] -- where the index 999 shows that the trace acting as logical subject (symbolized s) of the "admit" clause is coreferential with "Mary" which acts as surface object (O) of the "expected" clause; the logical object (o) of the "expected" clause being the infinitival subordinate clause (Ti). In some cases, movement rules displace a constituent into a tagma within which it has no grammatical role (for instance, an adverb which is logically a clause constituent may interrupt the verb group -- sequence of auxiliary verbs and main verb -- of the clause): in such cases the functiontag is G ("guest"). Constituents which do not logically belong below the node which immediately dominates them in surface structure are always given G functiontags and indices linking them to their logical position. With that exception (and with one other exception not discussed here relating to co-ordination), functiontagging is used only for immediate constituents of clauses. ENGLISH FOR THE COMPUTER lists the categories of surface/logical-grammar discordance which are represented by the SUSANNE scheme, and the approved methods of representing them. The SUSANNE analysis is always chosen so as to be as far as possible neutral as between alternative linguistic theories. THE FORMTAGS The SUSANNE formtags are as follows: Rootlevel Formtags O Oh Ot Q I Iq Iu paragraph heading title (e.g. of book) quotation interpolation tag question scientific citation Clauselevel Formtags S Ss Fa Fn Fr Ff Fc Tg Ti Tn Tf Tb Tq Z L A W main clause quoting clause embedded within quotation adverbial clause nominal clause relative clause "fused" relative comparative clause present participle clause infinitival clause past participle clause "for-to" clause "bare" nonfinite clause infinitival relative clause reduced ("whiz-deleted") relative clause other verbless clause special "as" clause "with" clause Phraselevel Formtags N V J R P D M G noun phrase verb group adjective phrase adverb phrase prepositional phrase determiner phrase numeral phrase genitive phrase The various phrase categories take lower-case subcategory symbols which can be combined in any meaningful combination (e.g. the verb group "must have been noticed" would be formtagged "Vcfp"). The phrase subcategories are: Vo Vr Vm Va Vs Vz Vw operator section of verb group, when separated from remainder of V e.g. by subject-auxiliary inversion remainder of V from which Vo has been separated V beginning with "am" V beginning with "are" V beginning with "was" V beginning with other 3rd-singular verb V beginning with "were" Vj Vd Vi Vg Vn Vc Vk Ve Vf Vu Vp Vb Vx Vt V beginning with "be" V beginning with past tense infinitival V V beginning with present participle V beginning with past participle V beginning with modal V containing emphatic DO negative V perfective V progressive V passive V V ending with BE V lacking main verb catenative V Nq Nv Ne Ny Ni Nj Nn Nu Na No Ns Np "wh-" N "wh...ever" N "I/me" head "you" head "it" head adjective head proper name unit noun head marked as subject marked as nonsubject singular N plural N Jq Jv Jx Jr Jh "wh-" J "wh...ever" J measured absolute J measured comparative J postmodified J Rq Rv Rx Rr Rs Rw "wh-" R "wh...ever" R measured absolute R measured comparative R adverb conducive to asyndeton quasi-nominal adverb Po Pb Pq Pv "of" phrase "by" phrase "wh-" P "wh...ever" P Dq Dv Ds Dp "wh-" D "wh...ever" D singular D plural D Ms M headed by "one" NON-ALPHANUMERIC FORMTAG SUFFIXES Formtags may also contain non-alphanumeric symbols, including: ? * % ! " interrogative clause imperative clause subjunctive clause exclamatory clause or other item vocative item Other non-alphanumeric symbols represent co-ordination structure. Under the SUSANNE scheme, second and subsequent conjuncts in a co-ordination are analysed as subordinate to the first conjunct; thus a co-ordination of the form: chi, psi, and omega (whatever the grammatical rank of the word-sequences chi, psi, etc.) would be assigned a structure of the form: [chi, [psi], [and omega]] The formtag of the entire co-ordination is determined by the properties of the first conjunct (except for singular/plural subcategories in the case of phrase categories to which these apply); the later conjuncts (which will often be transformationally reduced) have nodes of their own whose formtags mark them as "subordinate conjuncts". The following symbols relate to co-ordination (and apposition) structure: + @ & subordinate conjunct introduced by conjunction subordinate conjunct not introduced by conjunction appositional element co-ordinate structure acting as first conjunct within a higher co-ordination (marked in certain cases only) Co-ordination is recognised as occurring between words as well as between higher-rank tagmas. Therefore nonterminal nodes may have formtags consisting of wordtags followed by co-ordination symbols, thus (using "WT" to stand for an arbitrary wordtag): WT& WT+ WT- co-ordination of words conjunct within wordlevel co-ordination that is introduced by a conjunction conjunct within wordlevel co-ordination not introduced by a conjunction (A wordlevel co-ordination always takes an ampersand on its formtag; phrase or clause co-ordinations do so only in very restricted circumstances.) Also, certain sequences of orthographic words, in certain uses, are regarded as functioning grammatically as single words ("grammatical idioms"). For instance, "none the less" would normally be treated as a grammatical idiom, equivalent to an adverb (for which the wordtag is RR). In such cases, the nonterminal node dominating the sequence has a formtag consisting of an equals sign suffixed to the corresponding wordtag; and the individual words composing the grammatical idiom are not wordtagged in their own right, but receive tags with numerical suffixes reflecting their membership of an idiom. (The sequence "none the less" would be formtagged RR=, and the words "none", "the", and "less" in this context would be wordtagged RR31 RR32 RR33.) ENGLISH FOR THE COMPUTER includes exhaustive listings of closed-class grammatical idioms. Note that formtags of the forms WT& WT+ WT- WT= rank as wordlevel formtags for the purposes of determining tree structure as discussed above. THE FUNCTIONTAGS Functiontags divide into COMPLEMENT and ADJUNCT tags: broadly, a given complement tag can occur at most once in any clause, but a clause may contain multiple adjuncts of the same type. The scheme of adjunct categories has been developed from the classification of Quirk et al. (1985). Complement Functiontags s o S O i u e j a n z x G logical subject logical direct object surface (and not logical) subject surface (and not logical) direct object indirect object prepositional object predicate complement of subject predicate complement of object agent of passive particle of phrasal verb complement of catenative relative clause having higher clause as antecedent "guest" having no grammatical role within its tagma Adjunct Functiontags p q t h m place direction time manner or degree modality c r w k b contingency respect comitative benefactive absolute Detailed guidelines for the application of these functional categories is included in ENGLISH FOR THE COMPUTER. NOTES [1] The support of the Economic and Social Research Council (ESRC) is gratefully acknowledged. Project SUSANNE, "Construction of an Analysed Corpus of English", was funded by ESRC award no. R00023 1142, over the period 1988 to 1992. "SUSANNE" stands for "Surface and underlying structural analyses of naturalistic English". I should like to express my warmest thanks to the team who worked on Project SUSANNE, namely Robin Haigh, H<eacute>l<egrave>ne Knight, Tim Willis, and Nancy Glaister. [2] I thank Alvar Elleg<aring>rd for permission to circulate a research resource derived from the work of his group. [3] Ministry of Defence research contract no. D/ER/1/9/4/2062/151(RSRE), "A Speech-Oriented Stochastic Parser". [4] The Brown texts included in the Gothenburg and hence the SUSANNE Corpora are as follows: A01 A09 G01 G09 J01 J09 N01 N09 A02 A10 G02 G10 J02 J10 N02 N10 A03 A11 G03 G11 J03 J12 N03 N11 A04 A12 G04 G12 J04 J17 N04 N12 A05 A13 G05 G13 J05 J21 N05 N13 A06 A14 G06 G17 J06 J22 N06 N14 A07 A19 G07 G18 J07 J23 N07 N15 A08 A20 G08 G22 J08 J24 N08 N18 For details on the source publications from which these texts were sampled, see Francis & Ku<ccaron>era (1989). REFERENCES A. Elleg<aring>rd (1978) The Syntactic Structure of English Texts. Gothenburg Studies in English, 43. W.N. Francis & H. Ku<ccaron>era (1989) Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for use with Digital Computers (corrected and revised edition). Department of Linguistics, Brown University, Providence, Rhode Island. R.G. Garside, G.N. Leech, & G.R. Sampson, eds. Analysis of English. Longman. K. Hofland & S. Johansson American English. Longman. (1982) (1987) The Computational Word Frequencies in British and M.P. Marcus & Beatrice Santorini (forthcoming) "Building very large natural language corpora: the Penn Treebank". To appear in N. Ostler, ed., Proceedings of the 1992 Pisa Symposium on European Textual Corpora. R. Quirk, S. Greenbaum, G. Leech, & J. Svartvik (1985) Grammar of the English Language. Longman. A Comprehensive G.R. Sampson (1992) "Probabilistic parsing". In J. Svartvik, ed., Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Mouton de Gruyter. G.R. Sampson (forthcoming) "The need for grammatical stocktaking". To appear in N. Ostler, ed., Proceedings of the 1992 Pisa Symposium on European Textual Corpora. SUSANNE.doc