From: AAAI Technical Report WS-92-01. Compilation copyright © 1992, AAAI (www.aaai.org). All rights reserved. A Richly Annotated Corpus for Probabilisfic Parsing Clive Souterand Eric Atwell Centre for ComputerAnalysis of Languageand Speech Divisionof Artificial Intelligence Schoolof ComputerStudies University of Leeds Leeds LS2 9yr United Kingdom Tel: +44 532 335460and335761 Emall:cs@ai.leeds.ac.ukanderic@ai.leeds.ac.uk competence approach for English is the UK Government-sponsored Alvey Natural Language Toolkit, which contains a GPSG-Iikegrammarwhich expandsto an object grammarof over 1,000 phrasestructure rules (Groveret al 1989). Thetoolkit also includes an efficient chart parser, a morphological analyser, anda 35,000wordlexicon. However,over the last five years or so, interest has grownin the developmentof robust NLPsystems with muchlarger lexical and grammaticalcoverage,and with the ability to providea best fit analysis for a sentence when,despite its extendedcoverage, the grammardoes not describe the required structure. Someresearchers haveattemptedto providead hoerule-basedsolutions to handlesuch lexical andgrammatical shortfalls (see, for example,. ACL1983, Chamiak1983, Cliff and Atwell 1987, Kwasnyand Sondheimer 1981, Heidorn et al 1982, Weischedeland Black 1980). Others, including someat Leeds,haveinvestigatedthe use of probabilistic parsing techniques in the quest for more reliable systems (e.g. Atwell and Elliott 1987, Atwell 1988, 1990). One of the central principles behind the adoptionof probabilities (unless this is donesimplyto order the solutions producedby a conventionalparser) is the forfeiture of a strict grammaiical/ungrammatical distinction for a sliding scale of likelihood. Structures whichoccur very frequently are those whichone might call grammatical,and those whichoccur infrequentlyor do not occur at all, are either genuinelyrare, semigrammatical or ungrammatical. The source of grammaticalfrequencyinformationis a parsed corpus: Amachine-readable collection of utterances whichhave been analysed by hand according to somegrammatical description and formalism. Corporamaybe collected for bothspokenandwritten text, transcribed(in the case of spokentexts), and then (laboriously) grammatically annotated. Such a grammar may be called a performance grammar. Abstract This paperdescribesthe use of a smallbut syntactically rich parsedcorpus of Englishin probabilistic parsing. Softwarehas been developedto extract probabilistic systemic-f~nctional grammars (SFGs) from the Polytechnic of WalesCorpusin several formalisms, whichcould equally well be applied to other parsed corpora. To complement the large probabilistic grammar, wediscuss progressin the provisionof lexical resources, whichrange fromcorpuswordlists to a large lexical database supplementedwith wordfrequencies and SFG categories. The lexicon and grammar resources maybe used in a variety of probabilistic parsing programs, one of whichis presented in some detail: TheRealistic AnnealingParser. Compared to traditional rule-based methods,such parsers usn~ily implementcomplexalgorithms,and are relatively slow, but are morerobust in providinganalysesto unrestricted and evensemi-grammaticalEnglish. 1. Introduction 1.1 Aim The aim of this paper is to present resources and techniquesfor statistical natural languageparsing that havebeen developedat Leedsover the last 5-6 yeats, focusing on the exploitation of the richly annotated Polytechnic of Wales(PEW)Corpusto produce large scale probabilistic grammars for the robust parsing of unrestrictedEnglish. 1.2 Background:Competenceversus performance "l]ae dominantparadigmin natural languageprocessing over the last three decades has been the use of competencegrammars.Relatively small sets of rules have been created intuitively by researchers whose primaryinterest wasto demonstratethe possibility of integrating a lexicon and gramma~ into an efficient parsing program. A fairly recent example of the 22 Otherongoingresearch at Leedsrelated to corpus-based probabilistic parsing is surveyedin (Atwell 1992), Including projects sponsoredby the DefenceResearch Agencyand British Telecom, and half a dozen Phi) studentprojects. 1.3 Background at Leeds Researchon corpus-basedparsing at Leedshas revolved round more than one project and corpus. In 1986 the RSRESpeech Research Unit (now part of the UK DefenceResearchAgency)funded a three-year project called APRIL(Annealing Parser for Realistic Input Language)(see Halghet al 1988, Sampson et al 1989). In this project, Haigh, Sampson and Atwelldevelopeda stochastic parser based on the Leeds-Lancaster Treebank (Sampson1987a), a 45,000-wordsubset the Lancaster-Oslo/Bergen (LOB)Corpus(Johannson al 1986) annotated with parse trees. Thegrammarwas extracted fromthe corpusin the formof a probabilistic RTN,and usedto evaluate the solution trees foundby a simulatedannealing search for somenewsentence. At the end of the ’annealing run’, providedthe annealing schedulehas beenproperlytuned, a single best-fit parse tree is found.(Tosimplifymatters,in the early stagesof the APRILproject, an ordered list of word tags was used as input, rather than a list of words).Theparsing technique is quite complexcomparedto rule-based alternatives, andis muchslower, but has the benefit of alwaysproducinga solution tree. In 1987, the RSRESpeech Research Unit, ICL, and Longman sponsored the COMMUNAL project (COnvivial Man-MachineUnderstanding through NAturalLanguage)at University of WalesCollege at Cardiff (UWCC) and the University of Leeds. The COMMUNAL project aimed to develop a natural language interface to knowledge-basedsystems. The Cardiff team, led by RobinFawcett, were responsible for knowledgerepresentation, sentencegeneration and the development of a large systemic functional grammar.At Leeds, Atwelland Souter concentratedon developing a parser for the grammarused in the generator. A number of parsing techniques were investigated. Oneof these involved modifyingthe simulated annealing technique to workwithwords, rather thantags, as input, parsingfromleft to right, and constraining the annealingsearch space with judicious use of probability density functions. In the COMMUNAL project, two corpus sources were used for the grammatical frequency data: Firstly, the Polytechnic of Wales (POW)Corpus, 65,000 words hand-analysedchildren’s spokenEnglish (Fawcett and Perkins 1980, Souter 1989). Secondly,the potentially infinitely large ArkCorpus,whichconsists of randomly generatedtree structures for English, producedby the COMMUNAL NL generator, GENESYS (Wright 1988, Fawcett and Tucker 1989). The Realistic Annealing Parser (RAP)produces a solution tree more quickly than APRIL,by focusingthe annealingon areas of the uee whichhave low probabilities (Atwell et al 1988, Souter 1990, Souter and O’Donoghue 1991). 23 2. Parsed Corporaof English Acorpus is a bodyof texts of one or morelanguages which have been collected in someprincipled way, perhapsto attemptto be generallyrepresentativeof the language,or perhapsfor somemorerestricted purpose, suchas the studyof a particular linguistic genre, of a geographical or historical variety, or even of child languageacquisition. Corporamaybe raw (contain no additionalannotationto the original text), tagged(parts of speech,also called word-tags,are addedto eachword in the corpus) or parsed (full syntactic analyses are given for each utterance in the corpus). A range of probabilistic grammaticalmodels maybe induced or extracted fromthese three types of corpus. Suchmodels may generally be termed constituent likelihood grammars (Atwell 1983, 1988). This paper will restrict itself to workonfully parsedcorpora. Because of the phenomenaleffort involved in hand analysingraw, or eventagged, text, parsedcorporatend to be small and few (between only 50,000 to 150,000 words).This is not very large, compared to rawcorpora, but still represents several person-years of work. Included in this bracket are the Leeds-Lancaster Treebank, the Polytechnic of Wales Corpus, the Gothenburg Corpus (Ellegard 1978), the related Susanne Corpus (Sampson1992), and the Nijmegen Corpus(Keulen 1986). Eachof these corpora contain relatively detailed grammaticalanalyses. Twomuch larger parsed corporawhichhavebeenhand analysedin less detail are the IBM-LancasterAssociated Press Corpus(1 million words;not generally available) and the ACL/DCIPenn Treebank (which is intended to consist of several million words, part of whichhas recently been released on CD-ROM). TheArk corpus of NLgenerator output contains 100,000 sentences, or approximately0.75 million words. For a moredetailed survey of parsed corpora, see (Sampson1992). Each these corpora contains analyses accordingto different grammars, andusesdifferent notations forrepresenting the tree structure. Anyone of the corpora maybe. available in a variety of formats, suchas verticalized (one wordper line), 80 characters per line, or one nee per line (and hence, often very long lines). Figure contains exampletrees froma few of these corpora~ NijmegenCorpus(numerical LDBform): 0800131 AT 9102 THIS 2103 MOMENT, 3101 WE5301 ’VE F201 B~FNF801 JOINEDA801 0800131 BY9102 MILLIONS 7803 OF 9104 PEOPLE3103 ACROSS 9104 0800201 EUROPE,3501 THIS 2104 ER 1104 COVERAGE 3103 BEINGF903 TAKEN A803 0800201 BY 9104 QUITE2805 A 2505 NUMBER 3105 OF 9106 EL~OPEAN 4107 0800202 COUNTRIES 3202 AND6102 ALSO8103 BEINGF903 TAKEN A803 IN 9104 THE2105 0800202 UNITED 9906 STATES.3600 [[ 9400 Leeds.LancasterTreebank: A0168 001 [S[Nns[NFF[Mr]NPT][NP[James]NP][NP[Canaghan]NP][,[, ],][Ns[NN$[labour’ s ]NN$][JJ[ colonial ]JJ][’NN[ spokesman]NN]Ns][,[, ],]Nns][V[VBD[ said ]VBD]V][Fn[Nns[NPT[Sir ]NFF][NP[Roy ]NP]Nns][V[HVD[ had ]HVD]V][Ns[ATI[ no ]ATI][NN[right ]NN][Ti[Vi[TO[to ]TO][VB[delay ]VB]Vi][Ns[NN[ progress ]NN]Ns][P[IN[ in ]IN][Np[ATI[the ]ATI][NNS[talks ]NNS]Np]P][P[IN[by ]IN][Tg[Vg[VBG[ refusing ] VBG]Vg][Ti[Vi[TO[ to ]TO][VB[ sit ]VB]Vi][P[IN[ round ]IN][Ns[ATI[ the ]ATI][NN[ conference ]NN] INN[ table ]NN]Ns]P]Ti]Tg]P]Ti]Ns]Fn] [.[. ].]S] IBM-Lancaster Associated Press Corpus(SpokenEnglish CorpusTreebank): SK01 3 v [Nr Every AT1three_MCmonths NNT2Nr] ,_, [ here_RL[P on r[ [N Radio_NN1 4_MCN]P]] ,_, [N I PPIS1 N][V preseut_VV0[N a_AT1programme..NNl[Fn called_VVN[N Workforce_NP1 N]Fn]N]V]._. Polytechnic of Wales OPOW) Corpus: 189 Z 1 CL FR RIGHT1 CL 2 C PGP 3 P IN-THE-MIDDLE-OF 3 CV4 NGP5 DDTHE5 H TOWN 4 NGP6 & OR6 DD THE 6 MOTHNGPH COUNCIL 6 H ESTATE2 S NGPHP WE20M ’LL 2 M PUT2 C NGPDDTHAT2 C QQGP AX THFJ~ 1 CL 7 & AND7 S NGPH WE70M’LL 7 M PUT7 C NGP8 DQSOME8 H TREES7 C QQGPAX THERE Ark Corpus: [Z [CI [Mclose] [C2/Af[ngp[ds [qqgp[dds the] [a worst]]][h boat+s]]][e !]]] [Z [CI [S/Af [ngp [dq one] [vq of] [h [genclr [g mine]]]]] [O/Xfis] [G going_to][Xpfhave] [Xpbeen] [Mcook+ed][e .]]] [Z [CI [O/Xpisn’t] [SIAl[ngp[h what]]] [Munlock+ed] [e 711] [Z [CI [S it] [O/Xpd is] [Msnow+ing] [e .]]] Figure1: Examplesof trees from different parsed corpora. Themostobviousdistinction is the contrast betweenthe use of bracketsandnumbers to represent tree structure. Numerical trees allow the representation of noncontext-free relations such as discontinuities within a constituent. In the bracketednotation this problemis normally avoided by coding the two parts of a discontinuity as separate constituents. Aless obvious distinction, but perhaps more important, is the divergence between the use of a coarse- and tinegrainedgrammatical description. Contrast, for example, the AssociatedPress (AP)Corpusand the Polytechnic of Wales Corpus. The APCorpuscontains a skeletal parse using a fairly basic set of formal grammatical labels, while the POW Corpuscontains a detailed set of formal and functional labels. Furthermore,there may not be a straightforward mappingbetween different grammatical formalisms, because they may assign different structures tothe sameunambiguous sentence. 2.1 The Polytechnic of WalesCorpus:Its origins and format In the rest of the paper, the POW corpuswill be used as an example, although each of the corpora shownin Figure1. havebeenor are being used for probabifistic workat Leeds. Manyof the techniqueswehave applied to the POW corpus could equally well be applied to other corpora, but someworkis is necessarily corpusspecific, in that it relates to the particular grammar contained in POW,namely Systemic Functional Grammar (SFG). 24 Thecorpuswasoriginally collected for a child language development project to study the use of various syntactico-semantic English constructs betweenthe ages of six and twelve. A sampleof approximately120 children in this age range f~mthe Pontypriddarea of SouthWaleswasselected, anddividedinto four cohorts of 30, eachwithinthree monthsof the ages6, 8, 10, and 12. These cohorts were subdivided by sex (B,G) and socio-economic class (A,B,C,D). The latter was achievedusing details of the ’highest’ occupationof either of the parents of the child and the educational level of the parentor parents. The children were selected in order to minimiseany Welshor other second languageinfluence. Theabove subdivision resulted in small homogeneous cells of three chil&ren.Recordingsweremadeof a play session with a Legobrick buildingtask for eachcell, andof an individualinterviewwith the sameadult for eachchild, in whichthe child’s favourite gamesor TVprogrammes werediscussed. 2.1.1 Transcription andsyntactic analysis The first 10 minutesof each play session commencing at a point wherenormalpeer group interaction began (the microphonewas ignored) were transcribed by Wainedtranscribers. Likewise for the interviews. Intonation contours wereadded by a phonetician, and the resulting transcripts publishedin four volumes. (FawcettandPerldns 1980). For the syntactic analysis ten trained analysts were employed to manuallyparse the transcribedtexts, using Fawcett’s systemic-functional grammar. Despite thoroughchecking,someinconsistencies remainin the text owingto several peopleworkingon different parts of the corpus, and no mechanismbeing available to ensurethe well-formedness of suchdetailed parse trees. Anedited version of the corpus (EPOW), with many these inconsistencies corrected, has been created (O’Donoghue1990). The resulting machine-readablefully parsed corpus consists of approximately 65,000 words in 11,396 (sometimesvery long) lines, each containing a parse tree. Thecorpusof parsetrees fills 1.1 Mb.andcontains 184files, eachwith a referenceheaderwhichidentifies the age, sex and social class of the child, andwhether the text is from a play session or an interview. The corpus is also available in wrap-roundform with a maximum line length of 80 characters, whereone parse tree maytake up several lines. The four-vohime Wanscriptscan be suppliedby the British LibraryInterLibrary Loans System, and the machine readable versions of both POWand EPOW are distributed by ICAME [1] and the OxfordText Archive[2]. 25 2.1.2 Systemic-FunctionalGrammar The grammaticaltheory on whichthe manualparsing is based is RobinFawcett’s developmentof a Hallidayan Systemic-FunctionalGrammar,described in (Fawcett 1981). Functionalelementsof structure, such as subject (S), complement (C), modifier (MO),qualifier (Q) adjunct(A) are filled by formalcategories called units, (d phrases in TGor GPSG)such as nominal group (ngp), prepositional group (pgp) and quantity-quality group(qqgp), or clusters suchas genitive cluster (gc). The top-level symbolis Z (sigma) and is invariably filled by one or moreclauses (el). Someareas have very elaborate description, eg: adjuncts, modifiers, determiners, auxiliaries, while others are relatively simple, eg: main-verb(M), and head(H). 2.1.3 Notation Thetree notation employsnumbersrather than the more traditional bracketed form to define mother-daughter relationships, in order to capture discontinuousunits. The number directly preceding a group of symbols refers to their mother. The mother is itself found immediatelypreceding the first occurrence of that numberin the tree. In the examplesection of a corpus file givenin Figure1., the tree consistsof a sentence(Z) containing three daughterclauses (CL), as each clause is precededby the numberone. In the POWcorpus whenthe correct analysis for a structure is uncertain, the one given is followedby a question mark. Likewise for cases where unclear recordings have madeword identification difficult. Apart from the numerical structure, the grammatical categories and the wordsthemselves, the only other symbolswhichmayoccur in the trees are three types of bracketing: i) square [NV...], [UN...], [RP...] for non-verbal, repetition,etc. ii) angle<...> for ellipsis in rapidspeech. iii) round (...) for ellipsis of items recoverablefrom previoustext. 3. Extracting a lexicon and grammar from the POW corpus A probabilistic context-free phrase-structure grammar can be straightforwardlyextracted fromthe corpus, by taking the corpusone tree at a time, rewriting all the mother-daughter relationships as phrasestructure rules, and deleting all the duplicates after the wholecorpus has beenprocessed. A count is kept on howmanytimes each rule occurred. Over4,500 uniquerules have been extracted from POW,and 2820 from EPOW.The 20 mostfrequentrules are givenin Figure2. 8882 S --> NGP 8792 NGP--> HP 8251 Z--> CL 6698 C --> NGP A.A.A.3 QQGP--> AX 2491 PGP -> P CV 2487 CV--> NGP 2283 NGP--> DDH 2272 CL--> F 1910 Z--> CLCL 1738 NGP--> DQH 1526 NGP--> H 1496 C --> PGP 1272 C --> QQGP 1234 CM--> QQGP 1221 NGP--> DD 1215 NGP--> I-IN 1182 C --> CL 1011 MO--> QQGP 1004 CL--> C Figure 2. The20 most frequentrules in the POW corpus 26791 HP 2250 THEDD 1901 A DQ 1550 AND& 1525 IT HP 1298 YOUHP 1173 ’S OM lll7WEHP 1020 THATDD 897 YEAHF 691 GOTM 610 THEYHP 585 NOF 5541NP 523TO I 482 PUT M 417 HE HP 411 DON’TON 401 ONEHP 400 OF VO (wrongly) contributes to the grammarinstead of the lexicon. This normally occurs whenthe syntactic analyst has omitted the wordtag for a word. The 20 mostfrequent wordsare givenin Figure3. For once, the word’the’ comessecondin the list, as this is a spoken corpus. The list is disambiguated,in that it is the frequencyof a wordpaired with a specific tag whichis being counted, not the global frequency of a word irrespectiveof its tag. Of course,context-freephrasestructure rules are not the only representation which could be used for the grammaticalrelationships whichare in evidencein the corpus. The formalism in which the grammaris extracted depends very muchon its intended use in parsing. Indeed,the rules givenabovealready represent two alternatives: a context,free grammar (ignoring the frequencies) and a probabilistic context-free grammar (includingthe frequencies). Weare grateful to TimO’Donoghue [3] for providing two other examples whichwe have used in different probabilistic parsers: Finite state automata (or probabilistic recursive transition networks)whichwere used in the Realistic AnnealingParser (RAP),and vertical strip grammar for use in a vertical strip parser (O’Donoghue1991). The RTNfragment (Figure contains4 columns.Thefirst is the motherin the tree, in this exampleA (= Adjunc0. The second and third are possible ordereddaughtersof the mother(including the start symbol(#) and the end symbol($)). The fourth columncontains the frequency of the combinationof the pair of daughtersfor a particularmother. A A A A A A A A A A A A A Figure 3. The 20 most frequent word-wordtag pairs in the POW corpus Similarly,eachlexical itemandits graxnmaticaltag can be extracted, with a countkept for anyduplicates. The extracted wordlist can be used as a prototype probabilistic lexicon for parsing. The POW wordlist contains 4,421 unique words, and EPOW 4,618. This growthin the lexicon is the result of identifying illformatted structures in the POWcorpus in which a word appears in the place of a wordtag, and hence # CL 250 # NGP 156 # PGP 970 # QQGP 869 # TEXT 1 CL $ 250 CL CL 6 NGP $ 156 PGP $ 970 PGP PGP 2 QQGp $ 869 QQGP QQGP 4 TEXT $ 1 Figure 4. A fragmentof probabilistic RTNfromEPOW Thevertical strip grammar (Figure 5) contains a list all observedpaths to the root symbol(Z) fromall the possible leaves (word-tags)in the trees in the corpus, along with their frequencies(containedin anotherfile). Thefragmentshowssomeof the vertical paths fromthe tag B (Binder). 26 discovered, which makesthe idea of a watertight grammar ridiculous. Conflicting evidencehas beenprovidedby Taylor et al (1989), who argue that it is the nature of the grammaticalformalism(simple phrase-structure rules) which prompts this conclusion. If a more complex formalism is adopted, such as that used in GPSG, includingcategories as sets of features, ID-LPformat, metamlesandunification, then Tayloret al demonstrate that a finite set of rules canbe usedto describethe same data which led Sampsonto believe that grammaris essentially open-ended. Howeverinteresting this dispute mightbe, one cannot avoid the fact that, if one’s aim is to build a robust parser, a very large grammaris needed. TheAlveyNL tools gl’ammarrepresents one of the largest competence grammarsin the GPSG formalism,but the authors quite Figure 5. A fragment of a vertical strip grammarOgPOW)sensibly do not claim it to be exhaustive(Groveret al 1989; 42-43). To our knowledge, no corpus exists 3.1 The non-circularity of extracting a grammar which has been fully annotated using a GPSG-Iike modelfrom a parsedcorpus formafism,so it is a necessityto resort to the grammars that havebeenused to parse corporato obtain realistic A natural question at this stage is "whyextract a frequencydata. grammar from a fully parsed corpus, because a grammarmusthave existed to parse the corpus by hand The coverageof the extracted lexicons and grammars in the first place?"Sucha ventureis not circular, since, obviouslyvaries dependingon the original corpus. In in this case, the grammar had not yet beenformafised the POW corpus, the size and nature of the extracted computationallyfor parsing. SFGis heavily focusedon wordlist is a moreobviouscause for concern. Evenwith semantic choices (called systems) made in muchof the ’noise’ removedby automaticediting using generation, rather than the surface syntactic a spelling checker,and by manualinspection, there are representations needed for parsing. So the grammar obvious gaps. The corpus wordlist can be used for used for the handparsing of the corpus wasinformal, developingprototype parsers, but to support a robust and naturally had to be supplementedwith analyses to probabilisticparser, a large-scaleprobabilisticlexiconis handle the phenomena common to language required. performance(ellipsis, repetition etc). As we have hopefully shown, extracting a model from a parsed 4. Transforming a lexicai database into a corpusalso allowsa degreeof flexibility in the choice probabilistic lexicon for a corpus-basedgrammar of formalism. Mostimportantly, though, it is also The original POW wordlist is woefully inadequate in possibleto extract frequencydata for each observation, terms of size (4,421 uniquewords),and also contains for use in probabilisticparsing. small numberof errors in the spelling and syntactic labelling of words(see Figure 6 for a fragmentof the 3.2 Frequencydistribution of wordsand grammar unedlted POWwordlis0. Although the latter can be ’rules’ edited out semi-automatically (O’Donoghne 1990), Thefrequencydistribution of the wordsin the corpus can only really addressthe lexical coverageproblemby matcheswhat is commonly knownas Zipf’s law. That incorporatingsomelarge machine-readable lexicon into is, very few wordsoccur very frequently, and very the parsing program. manywords occur infrequently, or only once in the Ouraims in providingan improvedlexical facility for corpus. Thefrequencydistribution of phrase-structure our parsing programsare as follows: Thelexicon should rules closely matchesthat of words,and has led at least be large-scale, that is, containseveral tens of thousands one researcher to concludethat both the grammarand of words, and these should be supplemented with the lexicon are open-ended(Sampson1987b), which corpus-basedfrequencycounts. Thelexicon shouldalso wouldprovideslrong supportfor the use of probabilistic employsystemic functional grammarfor its syntax techniques for robust NLparsing. If the size of the entries, in order to be compatiblewith the grammar corpusis increased,newsingletonrules are likely to be extracted fromthe POW corpus. BCLACLACLZ B CLACL ALCL CCLZ B CLACL AL CLZ BCLACLCCLZ BCLACLZ B CL A CL Z TEXTC CLZ B CL AFCLZ B CL AL CL AL CL C CL AL CL Z B CL ALCL ALCLZ B CL AL CL C CL AL CL Z B CL AL CL C CLZ B CL AL CL CREPLCL Z B CL AL CL CVPGP C CL Z B CL AL CL Z B CL AL CL Z TEXTC CL Z BCLAMCLZ B CL AMLCLZ 27 substantial numberof the verb entries were removed, whichwouldbe duplicates for our purposes, reducing the lexicon to only 59,322 wordforms.Columnsof the lexicon were reordered to bring the frequency information into the same columnfor all entries, irrespectiveof syntacticcategory. 1 ABROADAX 1 ABROARDCM 2 ACCIDENTH 1 ACCOUNTANT H 1 ACHINGM 5 ACROSSAX 1 ACROSSCM 14 ACROSS P 3 ACTINGM 4 ACTION-MAN I-IN 1 ACTUAI.LYAL 7ADDM 1 ADD?M 1 ADDEDM 1 ADDINGM 1 ADJUSTM 3 ADN& 1 ADN-THEN & 1 ADRIANHN 1 ADVENTRUE H abaci abacus NYNNNNNNN N YONirr aback aback ADVYNNNNN 3 N abacus abacus N Y N N N N N N N N N 0 N abacuses abacus N Y N N N N N N N N Y 0 N +es abandon abandon N N Y N N N N N N N N 16 Y abandon abandon V YNNNNNNNNNNN 16 Y abandoned abandon V Y N N N N N Y N N Y N N +ed 36 Y abandonedabandoned A Y N N N N 36 Y N N KEY:wordformstem category ...stem info..i frequency ambiguity...morphol. info.. Figure6. A fragmentof the uneditedPOW wordlist Thesourceof lexical informationwedecidedto tap into wasthe CELEX database. This is a lexical databasefor Dutch, English and German, developed at the University of Nijmegen,Holland. TheEnglish section of the CELEX database comprises the intersection of the word stems in LDOCE (Procter 1978, ASCOT version, see Akkermanet al 1988) and OALD (Hereby 1974), expanded to include all the morphological variants; a total of 80,429wordforms.Moreover,the wordformentries include frequency counts from the COBUILD (Birmingham) 18 million word corpus British English (Sinclair 1987), normalised frequencies per million words. These are sadly not disambiguated for wordforms with more than one syntactic reading. Awordsuch as "cut", whichmightbe labelled as an adjective, noun and verb, wouldonly haveone gross frequencyfigure (177) in the database. Thereare separateentries for "cuts", "cutting" andother morphological variants. ~ offers a fairly traditional set of syntax categories, augmentedwith somesecondary stem and morphological information derived mainly from the LDOCE source dictionary. The formatof the lexicon weexportedto Leedsis shownin Figure7. 4.1 Manipulatingthe CELEX lexicon Varioustransformationswere performedon the lexicon to makeit more suitable for use in parsing using systemic functional grammar.Using an AWK program and someUNIXtools, wereformattedthe lexicon along the lines of the bracketed LDOCE dictionary whichis perhaps the nearest to an established norm. A 28 Figure 7. A sampleof the condensedCELEX lexicon. The most difficult change to perform was the transforming of the CELEX syntax codes into SFG. Somemappingswere achieved automatically because they wereone to one, eg: prepositions, mainverbs and head nouns. Others resorted to stem information, to distinguish subordinating and co-ordinating conjunctions, and various pronouns, for example. Whereas elsewhere, the grammars diverge so substantially (or CELEX did not contain the relevant information) that manualintervention is required (determiners,auxiliaries, temperers,propernouns).The development of the lexicon is not yet complete,with the 300 or so manualadditions requiring frequencydata. A mechanismfor handling proper nouns, compoundsand idiomshas yet to be included, and the coverageof the lexicon has yet to be tested against a corpus.A portion of the resultinglexiconis illustrated in Figure8. Whenthe developmentand testing work is complete, the lexicon will be compatible with the grammar extracted from the POWcorpus and be able to be integrated into the parsing programs we have been experimentingwith. 5. Exploitingthe probabilistic lexicon andgrammar in parsing Perhapsthe most obviouswayto introduce probability into the parsing processis to a~pt a traditional chart parser to include the probability of each edgeas it is built into the chart. Thesearch strategy can then take into account the likelihood of combinationsof edges, andfind the mostlikely parsetree first (if the sentenceis ambiguous).A parser maythereby be constrained to produce only one optimal tree, or be allowed to continueits search for all possible trees, in order of decreasinglikelihood. For an exampleof this approach ((abaci)(abacus)(H)(0)(N)(Y)(N)(N)(N)(N)(N)(N)(N)(N)CY)(irr)) ((aback)(aback)(AX)(3)(N)(Y)(N)(N)(N)(N)(N)) ((abacus)(abacus)(H)(0)(N)(Y)(N)(N)(N)(N)(N)(N)(N)(N)(N)0) ((abacuses)(abacus)(H)(0)(N)(Y)(N)(N)(N)(N)(N)(N)(N)(N)(Y)(+es)) ((abandon)(abandon)(H)(16)(Y)(N)(Y)(N)(N)(N)(N)(N)(N)(N)(N)0) ((abandon)(abandon)(M)(16)(Y)(Y)(N)(N)(N)(N)(N)(N)(N)(N)(N)(N)(N)0) ((abandoned)(abandon)(M)(36)(Y)(Y)(N)(N)(N)(N)(N)(Y)(N)(N)(Y)(N)(N)(+ed)) ((abandoned)(abandoned)(AX)(36)(Y)(Y)(N)(N)(N)(N)(N)(N)) KEY: ((wordform)(stem)(category)(~equency)(ambiguity)(...stem info...)(...morphol, Figure 8. A sampleof the CELEX lexicon reformattedto SFGsyntax. see (BriscoeandCarroll 1991).In the projects described here, we have abandoned the use of rule-hased grammarsand parsers altogether, on the groundsthat they will still fail occasionallydueto lack of coverage, even with a very large grammar.The mainalternative we have experimentedwith, as discussed briefly in section 1, is the use of simulatedannealingas a search strategy. 5.1 The APRILParser In the earlier APRILproject, the aim was to use simulatedannealingto find a best-fit parse tree for any sentence. The approachadopted by the APRILproject involvesparsinga sentenceby the followingsteps: [1] Initially, generate a "first-guess" parse-tree at random,with any structure (as long as it is well-formed phrase marker), and any legal symbolsat the nodes. [2] Then, makea series of randomlocalised changes to the tree, by randomly deleting nodes or inserting new(randomly-labelled)nodes. At each stage, the likelihood of the resulting tree is measuredusing a constituent-likeliboodfunction; if the changewouldcause the tree likelihood to fall belowa threshold,then it is not allowed(but alterations whichincrease the likelihood, or decreaseit only by a small amount,are accepted) Initially, the likelihood threshold belowwhich, treeplausibility maynot fall is very low, so almost all changes are accepted; however, as the programrun proceeds,the thresholdis graduallyraised, so that fewer and fewer ’worsening’ changes are allowed; and eventually, successive modifications should converge on an optimal parse tree, whereany proposedchange will worsenthe likelihood. This is a very simplified 29 statementof this approachto probabilisticparsing; for a fuller description, see (Haighet al 1988,Sampson et al 1989,Atwellet al 1988). 5.2 TheRealistic AnnealingParser The parse trees built by APRILwere not Systemic Functional Grammaranalyses, but rather followed a surface-structure scheme and set of labelling conventions designed by Geoffrey Leechand Geoffrey Sampson specifically for the syntactic analysis of LOB Corpustexts. Theparse tree conventionsdeliberately avoidedspecific schools of linguistic theory such as Generalised Phrase Structure Granmmr, Lexical Functional Grammar(and SFG), aiming to be ’theory-neutral greatest commondenominator’ and avoid theory-specific issues such as functional and semantic labelling. The labels on nodes are atomic, indivisible symbols, whereasmost other linguistic theories (including SFG)assumenodes have compound labels including both category and functional information. To adapt APRIL to produceparse trees in other formalisms,such as for the SFGused in the POW corpus and the COMMUNAL project, the primitive moves in the annealing parser would have to be modifiedto allow for compound nodelabels. Several other aspects of the APRILsystem were modified during the COMMUNAL project to produce the Realistic AnnealingParser. APRIL attempts to parse a wholesentence as a unit, whereasmanyresearchers believe it is morepsychologicallyplausible to parse from left to right, building up the parse tree incrementally as words are ’consumed’. When proposinga moveor tree-modification, APRIL chooses a node from the current tree at random, whereasit wouldseem more efficient to somehowkeep track of how:good’ nodes are and concentrate changeson ’bad’ nodes. The above two factors can be combined by includinga left-to-right bias in the ’badness’measure, biasing the node-chooseragainst changingleft-most (older) nodes in the tree. APRIL(at least the early prototype) assumes the input to be parsed is not sequenceof wordsbut of word-tags, the output of the CLAWSsystem; whereas the COMMUNAL parser woulddeal with ’raw word’input. The Realistic AnnealingParser was developedby Tim O’Donoghue[3], and employed the wordlist and probabilistic RTNextracted fromeither the POW or the Ark corpora, but the samemethodcould equally have been used for a numberof other parsed corpora. The way the RAPdeals with all the above problems is explained in detail in COMMUNAL Report No. 17 (Atwellet al 88), so here wewill just list briefly the maininnovations: 5.2.1 Modifiedprimitive moves Theset of primitive moveshas beenmodifiedto ensure ’well-formed’SystemicFunctionaltrees are built, with alternating functional and categorylabels. Theset of primitive moves was also expanded to allow for unfinishedconstituentsonly partially built as the parse proceedsfromleft to right throughthe sentence. 5.2.2 Pruning the search space by left-to-right incremental parsing TheRAPuses several annealingruns rather than just one to parse a completesentence: annealingis used to find the best partial parse up to wordN, then wordN+I is addedas a newleaf node,and annealingrestarted to find a new, larger partial parse tree. This should effectively prunethe search space, since (Marcus1980) and others have shownthat humansparse efficiently fromleft to right mostof the lime. However, it is not the case that the parse tree up to wordNmustbe kept unchangedwhenmergingin wordN+I; in certain cases (eg ’garden path sentences’) the new word incompatiblewith the existing partial parse, whichmust be radically altered beforea newoptimalpartial parse is arrivedat. used as a PDFin choosinga node, so that a ’bad’ node is morelikely to be modified. Superimposedon this ’badness’ measureis a weight correspondingto the node’sposition fromleft to right in the tree, so that relatively newright-most nodes are morelikely to be altered. This left to fight bias tends to inhibit backtrackingexcept whennewwordsaddedto the fight of the partial parse tree are completelyincompatible withthe parseso far. 5.2.4 RecursiveFirst-order Markov model Toevaluatea partial tree at anypoint duringthe parse, the partial tree is mappedonto the probabilistic recursive transition network (RTND which has been extracted from the corpus. Theprobability of the path through the RTNis calculated by multiplying together all the arc probabilities en route. Eachnetworkin the RTNis used to evaluate a mother and its immediate daughters. The shape of each network is a straightforward first-order Markovmodel, with the further constraint that each categoryis representedin only one node. There is no need to ’hand-craft’ the shapeof eachtransition network,since this constraint applies to all recursive networks. This simplifies statistics extraction from the parsed corpus, and we havefoundempiricallythat it appearsto be sufficient to evaluate partial trees. This empirical finding is theoretically significant, as it wouldseemto be at odds with established ideas founded by Chomsky(1957). However,although Chomsky maintained that a simple Markovmodel was insufficient to describe Natural Languagesyntax, he did not consider RecursiveMarkov models, and did not contemplate their use in conjunctionwith a stochastic optimisation algorithm such as SimulatedAnnealing. 6. Conclusions ThePolytechnicof WalesCorpusis one of a few parsed corpora already in existence whichoffer a rich source of grammaticalinformation for use in probabilistic parsing. Wehave developedtools for the extraction of constituent likelihood grammars in a number of formafisms, which could, with relatively minor adjustment, be used on other parsed corpora. The provisionof compatiblelarge-scale probabilistic lexical resources is achieved by semi-automatically manipulatinga lexical database or machine-tractable dictionary to conformwith the corpus-basedgrammar, and supplementing corpus word-frequencies. Constituentlikelihood grammars are particularly useful for the syntactic description of unrestricted natural language,including syntactically ’noisy’ or ill-formed text. 5.2.30ptlmising the choice of movesby Probability Density FunctionsUPDFs) The rate of convergenceon an optimal solution can be improvedby modifyingthe randomelementsinherent in the Simulated Annealingalgorithm. Whenproposinga moveor tree-change,instead of choosinga nodepurely at random,all nodesin the tree are evaluatedto see how ’bad’ they are. This ’badness’dependsuponhowa node fits in with its sisters andwhatkind of motherit is to any daughtersit has. These’badness’ values are then 30 2275-2277,Berlin. The Realistic AnnealingParser is one of a line of probabilistic grammaticalanalysis systemsbuilt along similar general principles. Otherprojects are ongoing whichutifise corpus-based probabifistic grammars, supplemented with suitably large probabilistic lexicons. Theyinclude the use of neural networks,vertical strip parsers, and hybrid parsers which combine the efficiency of chart parsers with the robustness of probabilistic alternatives. Whilst we wouldnot yet claim that these haveprogressedbeyondthe workbench andonto the productionline for general use, they offer promisingcompetitionto the established normsof rulebased parsers. We believe that corpus-based probabilistic parsing is a newarea of Computational Linguistics whichwill thrive as parsedcorpora become morewidelyavailable. Atwell, Eric Steven, 1992 "Overview of Grammar Acquisition Research" in Henry Thompson(ed), "Workshop on sublanguage grammar and lexicon acquisition for speechand language:proceedings",pp. 65-70, HumanCommunication Research Centre, EdinburghUniversity. Atwell, Eric Stevenand StephenElliot, 1987, "Dealing with ill-formed Englishtext" in Garsideet al 1987,pp. 120-138. Atwell, Eric Steven, D. Clive Souter and Tim F. O’Donoghne1988 "Prototype Parser 1" COMMUNAL report 17, CCALAS, Leeds University. Briscoe, Ted and John Carroll, 1991 "Generalised Probabilistic LRParsing of Natural Language(Corpora) with Unification-based Grammars"University of CambridgeComputerLaboratory Technical Report No. 224, June 1991. .~ 7. Notes [1] ICAME(International Computer Archive of ModemEnglish), NorwegianComputingCentre for the Humanities,P.O. Box53, Universitetet, N-5027Bergen, Norway. [2] The Oxford Text Archive (OTA), Oxford University Computing Service, 13 BanburyRoad, Oxford OX26NN, UK. [3] Tim O’Donoghue was a SERC-funded PhD student in the School of ComputerStudies, who contributed to the Leeds’ research on the COMMUNAL project. Chamiak,Eugene, 1983 "A parser with somethingfor everyone" in Margaret King (ed.) Parsing Natural Language,AcademicPress, London. Chomsky,Noam1957 "Syntactic Structures" Mouton, The Hague 8. References ACL1983"ComputationalLinguistics: Special issue on dealing with ill-formod text." Journal of Computational Linguisticsvolume9 (3-4). Akkerman,Eric, Hetty Voogt-vanZulphen and Willem Meijs. 1988 "A computerizedlexicon for word-level tagging". ASCOTReport No 2, Rodopi Press, Amsterdam. Eliegard A. 1978. "The Syntactic Structure of English Texts". Gothenburg Studies in English 43. Gothenburg: ActaUniversitatis Gothoburgenis. Fawcett, RobinP. 1981"SomeProposals for Systemic Syntax", Journal of the Midlands Association for Linguistic Studies (MALS) 1.2, 2.1, 2.2 (1974-76). Re-issued with light amendments,1981, Departmentof Behavioural and Communication Studies, Polytechnic of Wales. Atwell, Eric Steven, 1983 "Constituent-likelihood gra_mmar".ICAME Journal no. 7 pp. 34-66. Fawcett, RobinP. and MichaelR. Perkins. 1980."Child LanguageTranscripts 6-12" Departmentof Business and Communication Studies, Polytechnicof Wales. Atwelk Eric Steven, 1988 "Grammaticalanalysis of English by statistical pattern recognition" in Josef Kittier (ed), Pattern Recognition:Proceedings ofthe 4th International Conference, Cambridge, pp 626-635, Berlin, Springer-Verlag. Fawcett, RobinP. and GordonTucker. 1989. "Prototype Generators 1 and 2". COMMUNAL Report No. 10, ComputationalLinguistics Unit, University of Wales Collegeof Cardiff. Garside, Roger, GeoffreySampsonand GeoffreyLeech, (eds.) 1987. "Thecomputationalanalysis of English: corpus-based approach". London, Longman. Atwell, Eric Steven, 1990 "MeasuringGrammaticality of machine-readabletext" in WemerBahner, Joachim Schfldt and Dieter Viehweger (eds.), Proceedingsof the XIVInternational Congressof Linguists, Volume3, pp 31 C-rover, C.,E.J.Briscoe, J.Cmroll andB.Boguraev. 1989. "TheAlveynatural language toolsgrammar". University of Cambridge ComputerLaboratory Technical Report No.162,April 1989. Sampson, Geoffrey. 1992. "Analysed corpora of English: a consumerguide". In MarthaPenningtonand VanceStevens(eds.) Computers in AppliedLinguistics. MultilingualMatters. Haigh, Robin, Geoffrey Sampson, and Eric Steven Atwell. 1988. "Project APRIL- a progress report" Proceedings of the 26th ACLConference Buffalo, USA. Sampson,GeoffreyR., RobinHaighand Eric S. Atwell. 1989 "Natural language analysis by stoelmstic optimization: a progress report on Project APRIL", Journal of Experimental and Theoretical Artificial Intelligence1: 271-287. Heidom,G. E., K. Jensen, L. A. Miller, R. J. Byrd,and M. S. Chodorow.1982. "The EPISTLE text-critiquing system"in IBMSystemsJournal 21(3): 305-326. Sinclair, John McH.(ed.) 1987 "Looking Up: accountof the COBUII.I) project in lexical computing". London:Collins. Homby, A.S.(ed.)1974"Oxford Advanced Learner’s Dictionary ofContemporary English". Oxford, OUP. Johannson, Stig, Eric Atwell, Roger Garside, and Geoffrey Leech. 1986. "The Tagged LOBCorpus". NorwegianComputing Centre for the Humanities, University of Bergen,Norway. Souter, Clive. 1989. "A Short Handbookto the Polytechnic of Wales Corpus". ICAME,Norwegian Computing Centre for the Humanities, Bergen University, Norway. Souter, Clive. 1990. "Systemic Functional Grammars and Corpora". In Jan Aarts and WillemMeijs (eds.) Theoryand Practice in CorpusLinguistics, pp 179-211. Amsterdam:RodopiPress. Keulen, Francoise. 1986. "The DutchComputerCorpus Pilot Project". In Jan Aarts and WillemMeijs (eds.) CorpusLinguistics II, pp 127-161,Amsterdam: Rodopi Press. Kwasny,S. and NormanSondheimer.1981. "Relaxation techniquesfor parsinggrammatically ill-formed input in natural languageunderstanding systems" in American Journal of Computational Linguistics7(2): 99-108. Souter, Clive and Tim O’Donoghue. 1991. "Probabilistic Parsing in the COMMUNAL project". In Stig Johannson and Anna-Brita Stenstriin (eds.) English Computer Corpora: Selected Papers and ResearchGuide, pp 33-48. Berlin: Moutonde Gruyter. Marcus, Mitchell. 1980. "A theory of syntactic recognition for natural language". MITPress, Cambridge,Massachusetts. Taylor, Lita, Claire G-roverandTedBriscoe.1989."The syntactic regularity of English noun phrases". Proceedings of the 4th European ACLConference, Manchester.256-263. O’Donoghue, TimF. 1990. "Taking a parsed corpus to the cleaners: the EPOW corpus". ICAME Journal no. 15. pp 55-62. Weischedel,Ralph and John Black. 1980. "Responding intelligently to unparsableinputs". In American Journal of Computational Linguistics 6(2) 97-109. O’Donoghue, TunF. 1991. "The Vertical Strip Parser: A lazy approachto parsing". ResearchReport 91.15, Schoolof Computer Studies, Universityof Leeds. Wright, Joan. 1988. "The developmentof tools for writing and testing systemic functional grammars". COMMUNAL Report No. 3, Computational Linguistics Unit, Universityof WalesCollegeof Cardiff. Procter, Paul. 1978. "Longman Dictionary of ContemporaryEnglish" London,Longman. Sampson,Geoffrey. 1987a. "The GrammaticalDatabase and ParsingScheme".In Garsideet al 1987,pp 82-96. Sampson, Geoffrey. 1987b. "Evidence against the "grammatieal"/"ungrammatical" distinction". In Willem Meijs (ed.) Corpus Linguistics and Beyond. Amsterdam:Rodopi. 32