Corpus Parsing Centre for Computer Analysis of Language

advertisement
From: AAAI Technical Report WS-92-01. Compilation copyright © 1992, AAAI (www.aaai.org). All rights reserved.
A Richly Annotated Corpus for Probabilisfic
Parsing
Clive Souterand Eric Atwell
Centre for ComputerAnalysis of Languageand Speech
Divisionof Artificial Intelligence
Schoolof ComputerStudies
University of Leeds
Leeds LS2 9yr
United Kingdom
Tel: +44 532 335460and335761
Emall:cs@ai.leeds.ac.ukanderic@ai.leeds.ac.uk
competence approach for English is the UK
Government-sponsored Alvey Natural Language
Toolkit, which contains a GPSG-Iikegrammarwhich
expandsto an object grammarof over 1,000 phrasestructure rules (Groveret al 1989). Thetoolkit also
includes an efficient chart parser, a morphological
analyser, anda 35,000wordlexicon.
However,over the last five years or so, interest has
grownin the developmentof robust NLPsystems with
muchlarger lexical and grammaticalcoverage,and with
the ability to providea best fit analysis for a sentence
when,despite its extendedcoverage, the grammardoes
not describe the required structure. Someresearchers
haveattemptedto providead hoerule-basedsolutions to
handlesuch lexical andgrammatical
shortfalls (see, for
example,. ACL1983, Chamiak1983, Cliff and Atwell
1987, Kwasnyand Sondheimer 1981, Heidorn et al
1982, Weischedeland Black 1980). Others, including
someat Leeds,haveinvestigatedthe use of probabilistic
parsing techniques in the quest for more reliable
systems (e.g. Atwell and Elliott 1987, Atwell 1988,
1990). One of the central principles behind the
adoptionof probabilities (unless this is donesimplyto
order the solutions producedby a conventionalparser)
is the forfeiture of a strict grammaiical/ungrammatical
distinction for a sliding scale of likelihood. Structures
whichoccur very frequently are those whichone might
call grammatical,and those whichoccur infrequentlyor
do not occur at all, are either genuinelyrare, semigrammatical or ungrammatical. The source of
grammaticalfrequencyinformationis a parsed corpus:
Amachine-readable
collection of utterances whichhave
been analysed by hand according to somegrammatical
description and formalism. Corporamaybe collected
for bothspokenandwritten text, transcribed(in the case
of spokentexts), and then (laboriously) grammatically
annotated. Such a grammar may be called a
performance grammar.
Abstract
This paperdescribesthe use of a smallbut syntactically
rich parsedcorpus of Englishin probabilistic parsing.
Softwarehas been developedto extract probabilistic
systemic-f~nctional grammars (SFGs) from the
Polytechnic of WalesCorpusin several formalisms,
whichcould equally well be applied to other parsed
corpora. To complement the large probabilistic
grammar,
wediscuss progressin the provisionof lexical
resources, whichrange fromcorpuswordlists to a large
lexical database supplementedwith wordfrequencies
and SFG categories.
The lexicon and grammar
resources maybe used in a variety of probabilistic
parsing programs, one of whichis presented in some
detail: TheRealistic AnnealingParser. Compared
to
traditional rule-based methods,such parsers usn~ily
implementcomplexalgorithms,and are relatively slow,
but are morerobust in providinganalysesto unrestricted
and evensemi-grammaticalEnglish.
1. Introduction
1.1 Aim
The aim of this paper is to present resources and
techniquesfor statistical natural languageparsing that
havebeen developedat Leedsover the last 5-6 yeats,
focusing on the exploitation of the richly annotated
Polytechnic of Wales(PEW)Corpusto produce large
scale probabilistic grammars
for the robust parsing of
unrestrictedEnglish.
1.2 Background:Competenceversus performance
"l]ae dominantparadigmin natural languageprocessing
over the last three decades has been the use of
competencegrammars.Relatively small sets of rules
have been created intuitively by researchers whose
primaryinterest wasto demonstratethe possibility of
integrating a lexicon and gramma~
into an efficient
parsing program. A fairly recent example of the
22
Otherongoingresearch at Leedsrelated to corpus-based
probabilistic parsing is surveyedin (Atwell 1992),
Including projects sponsoredby the DefenceResearch
Agencyand British Telecom, and half a dozen Phi)
studentprojects.
1.3 Background
at Leeds
Researchon corpus-basedparsing at Leedshas revolved
round more than one project and corpus. In 1986 the
RSRESpeech Research Unit (now part of the UK
DefenceResearchAgency)funded a three-year project
called APRIL(Annealing Parser for Realistic Input
Language)(see Halghet al 1988, Sampson
et al 1989).
In this project, Haigh, Sampson
and Atwelldevelopeda
stochastic parser based on the Leeds-Lancaster
Treebank (Sampson1987a), a 45,000-wordsubset
the Lancaster-Oslo/Bergen
(LOB)Corpus(Johannson
al 1986) annotated with parse trees. Thegrammarwas
extracted fromthe corpusin the formof a probabilistic
RTN,and usedto evaluate the solution trees foundby a
simulatedannealing search for somenewsentence. At
the end of the ’annealing run’, providedthe annealing
schedulehas beenproperlytuned, a single best-fit parse
tree is found.(Tosimplifymatters,in the early stagesof
the APRILproject, an ordered list of word tags was
used as input, rather than a list of words).Theparsing
technique is quite complexcomparedto rule-based
alternatives, andis muchslower, but has the benefit of
alwaysproducinga solution tree.
In 1987, the RSRESpeech Research Unit, ICL, and
Longman sponsored the COMMUNAL
project
(COnvivial Man-MachineUnderstanding through
NAturalLanguage)at University of WalesCollege at
Cardiff (UWCC)
and the University of Leeds. The
COMMUNAL
project aimed to develop a natural
language interface to knowledge-basedsystems. The
Cardiff team, led by RobinFawcett, were responsible
for knowledgerepresentation, sentencegeneration and
the development of a large systemic functional
grammar.At Leeds, Atwelland Souter concentratedon
developing a parser for the grammarused in the
generator. A number of parsing techniques were
investigated. Oneof these involved modifyingthe
simulated
annealing
technique
to workwithwords,
rather thantags, as input, parsingfromleft to right, and
constraining the annealingsearch space with judicious
use of probability density functions. In the
COMMUNAL
project, two corpus sources were used
for the grammatical frequency data: Firstly, the
Polytechnic of Wales (POW)Corpus, 65,000 words
hand-analysedchildren’s spokenEnglish (Fawcett and
Perkins 1980, Souter 1989). Secondly,the potentially
infinitely large ArkCorpus,whichconsists of randomly
generatedtree structures for English, producedby the
COMMUNAL
NL generator, GENESYS
(Wright 1988,
Fawcett and Tucker 1989). The Realistic Annealing
Parser (RAP)produces a solution tree more quickly
than APRIL,by focusingthe annealingon areas of the
uee whichhave low probabilities (Atwell et al 1988,
Souter 1990, Souter and O’Donoghue
1991).
23
2. Parsed Corporaof English
Acorpus is a bodyof texts of one or morelanguages
which have been collected in someprincipled way,
perhapsto attemptto be generallyrepresentativeof the
language,or perhapsfor somemorerestricted purpose,
suchas the studyof a particular linguistic genre, of a
geographical or historical variety, or even of child
languageacquisition. Corporamaybe raw (contain no
additionalannotationto the original text), tagged(parts
of speech,also called word-tags,are addedto eachword
in the corpus) or parsed (full syntactic analyses are
given for each utterance in the corpus).
A range of
probabilistic grammaticalmodels maybe induced or
extracted fromthese three types of corpus. Suchmodels
may generally be termed constituent likelihood
grammars
(Atwell 1983, 1988). This paper will restrict
itself to workonfully parsedcorpora.
Because of the phenomenaleffort involved in hand
analysingraw, or eventagged, text, parsedcorporatend
to be small and few (between only 50,000 to 150,000
words).This is not very large, compared
to rawcorpora,
but still represents several person-years of work.
Included in this bracket are the Leeds-Lancaster
Treebank, the Polytechnic of Wales Corpus, the
Gothenburg Corpus (Ellegard 1978), the related
Susanne Corpus (Sampson1992), and the Nijmegen
Corpus(Keulen 1986). Eachof these corpora contain
relatively detailed grammaticalanalyses. Twomuch
larger parsed corporawhichhavebeenhand analysedin
less detail are the IBM-LancasterAssociated Press
Corpus(1 million words;not generally available) and
the ACL/DCIPenn Treebank (which is intended to
consist of several million words, part of whichhas
recently been released on CD-ROM).
TheArk corpus of
NLgenerator output contains 100,000 sentences, or
approximately0.75 million words. For a moredetailed
survey of parsed corpora, see (Sampson1992). Each
these corpora contains analyses accordingto different
grammars,
andusesdifferent notations
forrepresenting
the tree structure. Anyone of the corpora maybe.
available in a variety of formats, suchas verticalized
(one wordper line), 80 characters per line, or one nee
per line (and hence, often very long lines). Figure
contains exampletrees froma few of these corpora~
NijmegenCorpus(numerical LDBform):
0800131 AT 9102 THIS 2103 MOMENT,
3101 WE5301 ’VE F201 B~FNF801 JOINEDA801
0800131 BY9102 MILLIONS
7803 OF 9104 PEOPLE3103 ACROSS
9104
0800201 EUROPE,3501 THIS 2104 ER 1104 COVERAGE
3103 BEINGF903 TAKEN
A803
0800201 BY 9104 QUITE2805 A 2505 NUMBER
3105 OF 9106 EL~OPEAN
4107
0800202 COUNTRIES
3202 AND6102 ALSO8103 BEINGF903 TAKEN
A803 IN 9104 THE2105
0800202 UNITED
9906 STATES.3600 [[ 9400
Leeds.LancasterTreebank:
A0168 001
[S[Nns[NFF[Mr]NPT][NP[James]NP][NP[Canaghan]NP][,[, ],][Ns[NN$[labour’ s ]NN$][JJ[ colonial ]JJ][’NN[
spokesman]NN]Ns][,[, ],]Nns][V[VBD[ said ]VBD]V][Fn[Nns[NPT[Sir ]NFF][NP[Roy ]NP]Nns][V[HVD[
had
]HVD]V][Ns[ATI[
no ]ATI][NN[right ]NN][Ti[Vi[TO[to ]TO][VB[delay ]VB]Vi][Ns[NN[
progress ]NN]Ns][P[IN[
in ]IN][Np[ATI[the ]ATI][NNS[talks ]NNS]Np]P][P[IN[by ]IN][Tg[Vg[VBG[
refusing ] VBG]Vg][Ti[Vi[TO[
to
]TO][VB[ sit ]VB]Vi][P[IN[ round ]IN][Ns[ATI[ the ]ATI][NN[ conference ]NN] INN[ table
]NN]Ns]P]Ti]Tg]P]Ti]Ns]Fn]
[.[. ].]S]
IBM-Lancaster
Associated Press Corpus(SpokenEnglish CorpusTreebank):
SK01 3 v
[Nr Every AT1three_MCmonths NNT2Nr] ,_, [ here_RL[P on r[ [N Radio_NN1
4_MCN]P]] ,_, [N I PPIS1 N][V
preseut_VV0[N a_AT1programme..NNl[Fn called_VVN[N Workforce_NP1
N]Fn]N]V]._.
Polytechnic of Wales OPOW)
Corpus:
189
Z 1 CL FR RIGHT1 CL 2 C PGP 3 P IN-THE-MIDDLE-OF
3 CV4 NGP5 DDTHE5 H TOWN
4 NGP6 & OR6 DD
THE 6 MOTHNGPH COUNCIL
6 H ESTATE2 S NGPHP WE20M ’LL 2 M PUT2 C NGPDDTHAT2 C QQGP
AX THFJ~ 1 CL 7 & AND7 S NGPH WE70M’LL 7 M PUT7 C NGP8 DQSOME8 H TREES7 C QQGPAX
THERE
Ark Corpus:
[Z [CI [Mclose] [C2/Af[ngp[ds [qqgp[dds the] [a worst]]][h boat+s]]][e !]]]
[Z [CI [S/Af [ngp [dq one] [vq of] [h [genclr [g mine]]]]] [O/Xfis] [G going_to][Xpfhave] [Xpbeen] [Mcook+ed][e
.]]]
[Z [CI [O/Xpisn’t] [SIAl[ngp[h what]]] [Munlock+ed]
[e 711]
[Z [CI [S it] [O/Xpd
is] [Msnow+ing]
[e .]]]
Figure1: Examplesof trees from different parsed corpora.
Themostobviousdistinction is the contrast betweenthe
use of bracketsandnumbers
to represent tree structure.
Numerical trees allow the representation of noncontext-free relations such as discontinuities within a
constituent. In the bracketednotation this problemis
normally avoided by coding the two parts of a
discontinuity as separate constituents. Aless obvious
distinction, but perhaps more important, is the
divergence between the use of a coarse- and tinegrainedgrammatical
description. Contrast, for example,
the AssociatedPress (AP)Corpusand the Polytechnic
of Wales Corpus. The APCorpuscontains a skeletal
parse using a fairly basic set of formal grammatical
labels, while the POW
Corpuscontains a detailed set of
formal and functional labels. Furthermore,there may
not be a straightforward mappingbetween different
grammatical formalisms, because they may assign
different structures
tothe sameunambiguous
sentence.
2.1 The Polytechnic of WalesCorpus:Its origins and
format
In the rest of the paper, the POW
corpuswill be used as
an example, although each of the corpora shownin
Figure1. havebeenor are being used for probabifistic
workat Leeds. Manyof the techniqueswehave applied
to the POW
corpus could equally well be applied to
other corpora, but someworkis is necessarily corpusspecific, in that it relates to the particular grammar
contained
in POW,namely Systemic
Functional
Grammar
(SFG).
24
Thecorpuswasoriginally collected for a child language
development project to study the use of various
syntactico-semantic English constructs betweenthe
ages of six and twelve. A sampleof approximately120
children in this age range f~mthe Pontypriddarea of
SouthWaleswasselected, anddividedinto four cohorts
of 30, eachwithinthree monthsof the ages6, 8, 10, and
12. These cohorts were subdivided by sex (B,G) and
socio-economic class (A,B,C,D). The latter was
achievedusing details of the ’highest’ occupationof
either of the parents of the child and the educational
level of the parentor parents.
The children were selected in order to minimiseany
Welshor other second languageinfluence. Theabove
subdivision resulted in small homogeneous
cells of
three chil&ren.Recordingsweremadeof a play session
with a Legobrick buildingtask for eachcell, andof an
individualinterviewwith the sameadult for eachchild,
in whichthe child’s favourite gamesor TVprogrammes
werediscussed.
2.1.1 Transcription
andsyntactic analysis
The first 10 minutesof each play session commencing
at a point wherenormalpeer group interaction began
(the microphonewas ignored) were transcribed by
Wainedtranscribers. Likewise for the interviews.
Intonation contours wereadded by a phonetician, and
the resulting transcripts publishedin four volumes.
(FawcettandPerldns 1980).
For the syntactic analysis ten trained analysts were
employed
to manuallyparse the transcribedtexts, using
Fawcett’s systemic-functional grammar. Despite
thoroughchecking,someinconsistencies remainin the
text owingto several peopleworkingon different parts
of the corpus, and no mechanismbeing available to
ensurethe well-formedness
of suchdetailed parse trees.
Anedited version of the corpus (EPOW),
with many
these inconsistencies corrected, has been created
(O’Donoghue1990).
The resulting machine-readablefully parsed corpus
consists of approximately 65,000 words in 11,396
(sometimesvery long) lines, each containing a parse
tree. Thecorpusof parsetrees fills 1.1 Mb.andcontains
184files, eachwith a referenceheaderwhichidentifies
the age, sex and social class of the child, andwhether
the text is from a play session or an interview. The
corpus is also available in wrap-roundform with a
maximum
line length of 80 characters, whereone parse
tree maytake up several lines. The four-vohime
Wanscriptscan be suppliedby the British LibraryInterLibrary Loans System, and the machine readable
versions of both POWand EPOW
are distributed by
ICAME
[1] and the OxfordText Archive[2].
25
2.1.2 Systemic-FunctionalGrammar
The grammaticaltheory on whichthe manualparsing is
based is RobinFawcett’s developmentof a Hallidayan
Systemic-FunctionalGrammar,described in (Fawcett
1981). Functionalelementsof structure, such as subject
(S), complement
(C), modifier (MO),qualifier (Q)
adjunct(A) are filled by formalcategories called units,
(d phrases in TGor GPSG)such as nominal group
(ngp), prepositional group (pgp) and quantity-quality
group(qqgp), or clusters suchas genitive cluster (gc).
The top-level symbolis Z (sigma) and is invariably
filled by one or moreclauses (el). Someareas have
very elaborate description, eg: adjuncts, modifiers,
determiners, auxiliaries, while others are relatively
simple, eg: main-verb(M), and head(H).
2.1.3 Notation
Thetree notation employsnumbersrather than the more
traditional bracketed form to define mother-daughter
relationships, in order to capture discontinuousunits.
The number directly preceding a group of symbols
refers to their mother. The mother is itself found
immediatelypreceding the first occurrence of that
numberin the tree. In the examplesection of a corpus
file givenin Figure1., the tree consistsof a sentence(Z)
containing three daughterclauses (CL), as each clause
is precededby the numberone.
In the POWcorpus whenthe correct analysis for a
structure is uncertain, the one given is followedby a
question mark. Likewise for cases where unclear
recordings have madeword identification difficult.
Apart from the numerical structure, the grammatical
categories and the wordsthemselves, the only other
symbolswhichmayoccur in the trees are three types of
bracketing:
i) square [NV...], [UN...], [RP...] for non-verbal,
repetition,etc.
ii) angle<...> for ellipsis in rapidspeech.
iii) round (...) for ellipsis of items recoverablefrom
previoustext.
3. Extracting a lexicon and grammar
from the POW
corpus
A probabilistic context-free phrase-structure grammar
can be straightforwardlyextracted fromthe corpus, by
taking the corpusone tree at a time, rewriting all the
mother-daughter
relationships as phrasestructure rules,
and deleting all the duplicates after the wholecorpus
has beenprocessed. A count is kept on howmanytimes
each rule occurred. Over4,500 uniquerules have been
extracted from POW,and 2820 from EPOW.The 20
mostfrequentrules are givenin Figure2.
8882 S --> NGP
8792 NGP--> HP
8251 Z--> CL
6698 C --> NGP
A.A.A.3
QQGP--> AX
2491 PGP -> P CV
2487 CV--> NGP
2283 NGP--> DDH
2272 CL--> F
1910 Z--> CLCL
1738 NGP--> DQH
1526 NGP--> H
1496 C --> PGP
1272 C --> QQGP
1234 CM--> QQGP
1221 NGP--> DD
1215 NGP--> I-IN
1182 C --> CL
1011 MO--> QQGP
1004 CL--> C
Figure 2. The20 most frequentrules in the POW
corpus
26791 HP
2250 THEDD
1901 A DQ
1550 AND&
1525 IT HP
1298 YOUHP
1173 ’S OM
lll7WEHP
1020 THATDD
897 YEAHF
691 GOTM
610 THEYHP
585 NOF
5541NP
523TO I
482 PUT M
417 HE HP
411 DON’TON
401 ONEHP
400 OF VO
(wrongly) contributes to the grammarinstead of the
lexicon. This normally occurs whenthe syntactic
analyst has omitted the wordtag for a word. The 20
mostfrequent wordsare givenin Figure3. For once, the
word’the’ comessecondin the list, as this is a spoken
corpus. The list is disambiguated,in that it is the
frequencyof a wordpaired with a specific tag whichis
being counted, not the global frequency of a word
irrespectiveof its tag.
Of course,context-freephrasestructure rules are not the
only representation which could be used for the
grammaticalrelationships whichare in evidencein the
corpus. The formalism in which the grammaris
extracted depends very muchon its intended use in
parsing. Indeed,the rules givenabovealready represent
two alternatives: a context,free grammar
(ignoring the
frequencies) and a probabilistic context-free grammar
(includingthe frequencies).
Weare grateful to TimO’Donoghue
[3] for providing
two other examples whichwe have used in different
probabilistic parsers: Finite state automata (or
probabilistic recursive transition networks)whichwere
used in the Realistic AnnealingParser (RAP),and
vertical strip grammar
for use in a vertical strip parser
(O’Donoghue1991). The RTNfragment (Figure
contains4 columns.Thefirst is the motherin the tree, in
this exampleA (= Adjunc0. The second and third are
possible ordereddaughtersof the mother(including the
start symbol(#) and the end symbol($)). The fourth
columncontains the frequency of the combinationof
the pair of daughtersfor a particularmother.
A
A
A
A
A
A
A
A
A
A
A
A
A
Figure 3. The 20 most frequent word-wordtag
pairs in the POW
corpus
Similarly,eachlexical itemandits graxnmaticaltag can
be extracted, with a countkept for anyduplicates. The
extracted wordlist can be used as a prototype
probabilistic lexicon for parsing. The POW
wordlist
contains 4,421 unique words, and EPOW
4,618. This
growthin the lexicon is the result of identifying illformatted structures in the POWcorpus in which a
word appears in the place of a wordtag, and hence
#
CL 250
#
NGP 156
#
PGP 970
#
QQGP 869
#
TEXT 1
CL $ 250
CL CL 6
NGP $ 156
PGP $ 970
PGP PGP 2
QQGp $ 869
QQGP QQGP 4
TEXT $ 1
Figure 4. A fragmentof probabilistic RTNfromEPOW
Thevertical strip grammar
(Figure 5) contains a list
all observedpaths to the root symbol(Z) fromall the
possible leaves (word-tags)in the trees in the corpus,
along with their frequencies(containedin anotherfile).
Thefragmentshowssomeof the vertical paths fromthe
tag B (Binder).
26
discovered, which makesthe idea of a watertight
grammar
ridiculous.
Conflicting evidencehas beenprovidedby Taylor et al
(1989), who argue that it is the nature of the
grammaticalformalism(simple phrase-structure rules)
which prompts this conclusion. If a more complex
formalism is adopted, such as that used in GPSG,
includingcategories as sets of features, ID-LPformat,
metamlesandunification, then Tayloret al demonstrate
that a finite set of rules canbe usedto describethe same
data which led Sampsonto believe that grammaris
essentially open-ended.
Howeverinteresting this dispute mightbe, one cannot
avoid the fact that, if one’s aim is to build a robust
parser, a very large grammaris needed. TheAlveyNL
tools gl’ammarrepresents one of the largest competence
grammarsin the GPSG
formalism,but the authors quite
Figure 5. A fragment of a vertical strip grammarOgPOW)sensibly do not claim it to be exhaustive(Groveret al
1989; 42-43). To our knowledge, no corpus exists
3.1 The non-circularity of extracting a grammar
which has been fully annotated using a GPSG-Iike
modelfrom a parsedcorpus
formafism,so it is a necessityto resort to the grammars
that havebeenused to parse corporato obtain realistic
A natural question at this stage is "whyextract a
frequencydata.
grammar from a fully parsed corpus, because a
grammarmusthave existed to parse the corpus by hand
The coverageof the extracted lexicons and grammars
in the first place?"Sucha ventureis not circular, since,
obviouslyvaries dependingon the original corpus. In
in this case, the grammar
had not yet beenformafised
the POW
corpus, the size and nature of the extracted
computationallyfor parsing. SFGis heavily focusedon
wordlist is a moreobviouscause for concern. Evenwith
semantic choices (called systems) made in
muchof the ’noise’ removedby automaticediting using
generation, rather than the surface syntactic
a spelling checker,and by manualinspection, there are
representations needed for parsing. So the grammar
obvious gaps. The corpus wordlist can be used for
used for the handparsing of the corpus wasinformal,
developingprototype parsers, but to support a robust
and naturally had to be supplementedwith analyses to
probabilisticparser, a large-scaleprobabilisticlexiconis
handle the phenomena common to language
required.
performance(ellipsis, repetition etc). As we have
hopefully shown, extracting a model from a parsed
4. Transforming a lexicai database into a
corpusalso allowsa degreeof flexibility in the choice
probabilistic lexicon for a corpus-basedgrammar
of formalism. Mostimportantly, though, it is also
The original POW
wordlist is woefully inadequate in
possibleto extract frequencydata for each observation,
terms of size (4,421 uniquewords),and also contains
for use in probabilisticparsing.
small numberof errors in the spelling and syntactic
labelling of words(see Figure 6 for a fragmentof the
3.2 Frequencydistribution of wordsand grammar
unedlted POWwordlis0. Although the latter can be
’rules’
edited out semi-automatically (O’Donoghne
1990),
Thefrequencydistribution of the wordsin the corpus
can only really addressthe lexical coverageproblemby
matcheswhat is commonly
knownas Zipf’s law. That
incorporatingsomelarge machine-readable
lexicon into
is, very few wordsoccur very frequently, and very
the parsing program.
manywords occur infrequently, or only once in the
Ouraims in providingan improvedlexical facility for
corpus. Thefrequencydistribution of phrase-structure
our parsing programsare as follows: Thelexicon should
rules closely matchesthat of words,and has led at least
be large-scale, that is, containseveral tens of thousands
one researcher to concludethat both the grammarand
of words, and these should be supplemented with
the lexicon are open-ended(Sampson1987b), which
corpus-basedfrequencycounts. Thelexicon shouldalso
wouldprovideslrong supportfor the use of probabilistic
employsystemic functional grammarfor its syntax
techniques for robust NLparsing. If the size of the
entries, in order to be compatiblewith the grammar
corpusis increased,newsingletonrules are likely to be
extracted fromthe POW
corpus.
BCLACLACLZ
B CLACL ALCL CCLZ
B CLACL AL CLZ
BCLACLCCLZ
BCLACLZ
B CL A CL Z TEXTC CLZ
B CL AFCLZ
B CL AL CL AL CL C CL AL CL Z
B CL ALCL ALCLZ
B CL AL CL C CL AL CL Z
B CL AL CL C CLZ
B CL AL CL CREPLCL Z
B CL AL CL CVPGP C CL Z
B CL AL CL Z
B CL AL CL Z TEXTC CL Z
BCLAMCLZ
B CL AMLCLZ
27
substantial numberof the verb entries were removed,
whichwouldbe duplicates for our purposes, reducing
the lexicon to only 59,322 wordforms.Columnsof the
lexicon were reordered to bring the frequency
information into the same columnfor all entries,
irrespectiveof syntacticcategory.
1 ABROADAX
1 ABROARDCM
2 ACCIDENTH
1 ACCOUNTANT
H
1 ACHINGM
5 ACROSSAX
1 ACROSSCM
14 ACROSS
P
3 ACTINGM
4 ACTION-MAN
I-IN
1 ACTUAI.LYAL
7ADDM
1 ADD?M
1 ADDEDM
1 ADDINGM
1 ADJUSTM
3 ADN&
1 ADN-THEN
&
1 ADRIANHN
1 ADVENTRUE
H
abaci abacus NYNNNNNNN
N YONirr
aback aback ADVYNNNNN
3 N
abacus abacus N Y N N N N N N N N N 0 N
abacuses abacus N Y N N N N N N N N Y 0 N +es
abandon abandon N N Y N N N N N N N N 16 Y
abandon abandon V YNNNNNNNNNNN
16 Y
abandoned abandon V Y N N N N N Y N N Y N N +ed 36 Y
abandonedabandoned A Y N N N N 36 Y N N
KEY:wordformstem category ...stem info..i frequency
ambiguity...morphol. info..
Figure6. A fragmentof the uneditedPOW
wordlist
Thesourceof lexical informationwedecidedto tap into
wasthe CELEX
database. This is a lexical databasefor
Dutch, English and German, developed at the
University of Nijmegen,Holland. TheEnglish section
of the CELEX
database comprises the intersection of
the word stems in LDOCE
(Procter 1978, ASCOT
version, see Akkermanet al 1988) and OALD
(Hereby
1974), expanded to include all the morphological
variants; a total of 80,429wordforms.Moreover,the
wordformentries include frequency counts from the
COBUILD
(Birmingham) 18 million word corpus
British English (Sinclair 1987), normalised
frequencies per million words. These are sadly not
disambiguated for wordforms with more than one
syntactic reading. Awordsuch as "cut", whichmightbe
labelled as an adjective, noun and verb, wouldonly
haveone gross frequencyfigure (177) in the database.
Thereare separateentries for "cuts", "cutting" andother
morphological variants.
~ offers a fairly
traditional set of syntax categories, augmentedwith
somesecondary stem and morphological information
derived mainly from the LDOCE
source dictionary. The
formatof the lexicon weexportedto Leedsis shownin
Figure7.
4.1 Manipulatingthe CELEX
lexicon
Varioustransformationswere performedon the lexicon
to makeit more suitable for use in parsing using
systemic functional grammar.Using an AWK
program
and someUNIXtools, wereformattedthe lexicon along
the lines of the bracketed LDOCE
dictionary whichis
perhaps the nearest to an established norm. A
28
Figure 7. A sampleof the condensedCELEX
lexicon.
The most difficult
change to perform was the
transforming of the CELEX
syntax codes into SFG.
Somemappingswere achieved automatically because
they wereone to one, eg: prepositions, mainverbs and
head nouns. Others resorted to stem information, to
distinguish subordinating and co-ordinating
conjunctions, and various pronouns, for example.
Whereas elsewhere, the grammars diverge so
substantially (or CELEX
did not contain the relevant
information) that manualintervention is required
(determiners,auxiliaries, temperers,propernouns).The
development
of the lexicon is not yet complete,with the
300 or so manualadditions requiring frequencydata. A
mechanismfor handling proper nouns, compoundsand
idiomshas yet to be included, and the coverageof the
lexicon has yet to be tested against a corpus.A portion
of the resultinglexiconis illustrated in Figure8.
Whenthe developmentand testing work is complete,
the lexicon will be compatible with the grammar
extracted from the POWcorpus and be able to be
integrated into the parsing programs we have been
experimentingwith.
5. Exploitingthe probabilistic lexicon andgrammar
in parsing
Perhapsthe most obviouswayto introduce probability
into the parsing processis to a~pt a traditional chart
parser to include the probability of each edgeas it is
built into the chart. Thesearch strategy can then take
into account the likelihood of combinationsof edges,
andfind the mostlikely parsetree first (if the sentenceis
ambiguous).A parser maythereby be constrained to
produce only one optimal tree, or be allowed to
continueits search for all possible trees, in order of
decreasinglikelihood. For an exampleof this approach
((abaci)(abacus)(H)(0)(N)(Y)(N)(N)(N)(N)(N)(N)(N)(N)CY)(irr))
((aback)(aback)(AX)(3)(N)(Y)(N)(N)(N)(N)(N))
((abacus)(abacus)(H)(0)(N)(Y)(N)(N)(N)(N)(N)(N)(N)(N)(N)0)
((abacuses)(abacus)(H)(0)(N)(Y)(N)(N)(N)(N)(N)(N)(N)(N)(Y)(+es))
((abandon)(abandon)(H)(16)(Y)(N)(Y)(N)(N)(N)(N)(N)(N)(N)(N)0)
((abandon)(abandon)(M)(16)(Y)(Y)(N)(N)(N)(N)(N)(N)(N)(N)(N)(N)(N)0)
((abandoned)(abandon)(M)(36)(Y)(Y)(N)(N)(N)(N)(N)(Y)(N)(N)(Y)(N)(N)(+ed))
((abandoned)(abandoned)(AX)(36)(Y)(Y)(N)(N)(N)(N)(N)(N))
KEY:
((wordform)(stem)(category)(~equency)(ambiguity)(...stem
info...)(...morphol,
Figure 8. A sampleof the CELEX
lexicon reformattedto SFGsyntax.
see (BriscoeandCarroll 1991).In the projects described
here, we have abandoned the use of rule-hased
grammarsand parsers altogether, on the groundsthat
they will still fail occasionallydueto lack of coverage,
even with a very large grammar.The mainalternative
we have experimentedwith, as discussed briefly in
section 1, is the use of simulatedannealingas a search
strategy.
5.1 The APRILParser
In the earlier APRILproject, the aim was to use
simulatedannealingto find a best-fit parse tree for any
sentence. The approachadopted by the APRILproject
involvesparsinga sentenceby the followingsteps:
[1] Initially, generate a "first-guess" parse-tree at
random,with any structure (as long as it is
well-formed phrase marker), and any legal
symbolsat the nodes.
[2] Then, makea series of randomlocalised changes
to the tree, by randomly deleting nodes or
inserting new(randomly-labelled)nodes. At each
stage, the likelihood of the resulting tree is
measuredusing a constituent-likeliboodfunction;
if the changewouldcause the tree likelihood to
fall belowa threshold,then it is not allowed(but
alterations whichincrease the likelihood, or
decreaseit only by a small amount,are accepted)
Initially, the likelihood threshold belowwhich, treeplausibility maynot fall is very low, so almost all
changes are accepted; however, as the programrun
proceeds,the thresholdis graduallyraised, so that fewer
and fewer ’worsening’ changes are allowed; and
eventually, successive modifications should converge
on an optimal parse tree, whereany proposedchange
will worsenthe likelihood. This is a very simplified
29
statementof this approachto probabilisticparsing; for a
fuller description, see (Haighet al 1988,Sampson
et al
1989,Atwellet al 1988).
5.2 TheRealistic AnnealingParser
The parse trees built by APRILwere not Systemic
Functional Grammaranalyses, but rather followed a
surface-structure
scheme and set of labelling
conventions designed by Geoffrey Leechand Geoffrey
Sampson
specifically for the syntactic analysis of LOB
Corpustexts. Theparse tree conventionsdeliberately
avoidedspecific schools of linguistic theory such as
Generalised Phrase Structure Granmmr, Lexical
Functional Grammar(and SFG), aiming to be
’theory-neutral greatest commondenominator’ and
avoid theory-specific issues such as functional and
semantic labelling. The labels on nodes are atomic,
indivisible symbols, whereasmost other linguistic
theories (including SFG)assumenodes have compound
labels including both category and functional
information. To adapt APRIL
to produceparse trees in
other formalisms,such as for the SFGused in the POW
corpus and the COMMUNAL
project, the primitive
moves in the annealing parser would have to be
modifiedto allow for compound
nodelabels.
Several other aspects of the APRILsystem were
modified during the COMMUNAL
project to produce
the Realistic AnnealingParser. APRIL
attempts to parse
a wholesentence as a unit, whereasmanyresearchers
believe it is morepsychologicallyplausible to parse
from left to right, building up the parse tree
incrementally as words are ’consumed’. When
proposinga moveor tree-modification, APRIL
chooses
a node from the current tree at random, whereasit
wouldseem more efficient to somehowkeep track of
how:good’ nodes are and concentrate changeson ’bad’
nodes. The above two factors can be combined by
includinga left-to-right bias in the ’badness’measure,
biasing the node-chooseragainst changingleft-most
(older) nodes in the tree. APRIL(at least the early
prototype) assumes the input to be parsed is not
sequenceof wordsbut of word-tags, the output of the
CLAWSsystem; whereas the COMMUNAL
parser
woulddeal with ’raw word’input.
The Realistic AnnealingParser was developedby Tim
O’Donoghue[3], and employed the wordlist and
probabilistic RTNextracted fromeither the POW
or the
Ark corpora, but the samemethodcould equally have
been used for a numberof other parsed corpora. The
way the RAPdeals with all the above problems is
explained in detail in COMMUNAL
Report No. 17
(Atwellet al 88), so here wewill just list briefly the
maininnovations:
5.2.1 Modifiedprimitive moves
Theset of primitive moveshas beenmodifiedto ensure
’well-formed’SystemicFunctionaltrees are built, with
alternating functional and categorylabels. Theset of
primitive moves was also expanded to allow for
unfinishedconstituentsonly partially built as the parse
proceedsfromleft to right throughthe sentence.
5.2.2 Pruning the search space by left-to-right
incremental parsing
TheRAPuses several annealingruns rather than just
one to parse a completesentence: annealingis used to
find the best partial parse up to wordN, then wordN+I
is addedas a newleaf node,and annealingrestarted to
find a new, larger partial parse tree. This should
effectively prunethe search space, since (Marcus1980)
and others have shownthat humansparse efficiently
fromleft to right mostof the lime. However,
it is not
the case that the parse tree up to wordNmustbe kept
unchangedwhenmergingin wordN+I; in certain cases
(eg ’garden path sentences’) the new word
incompatiblewith the existing partial parse, whichmust
be radically altered beforea newoptimalpartial parse is
arrivedat.
used as a PDFin choosinga node, so that a ’bad’ node
is morelikely to be modified. Superimposedon this
’badness’ measureis a weight correspondingto the
node’sposition fromleft to right in the tree, so that
relatively newright-most nodes are morelikely to be
altered. This left to fight bias tends to inhibit
backtrackingexcept whennewwordsaddedto the fight
of the partial parse tree are completelyincompatible
withthe parseso far.
5.2.4 RecursiveFirst-order Markov
model
Toevaluatea partial tree at anypoint duringthe parse,
the partial tree is mappedonto the probabilistic
recursive transition network (RTND
which has been
extracted from the corpus. Theprobability of the path
through the RTNis calculated by multiplying together
all the arc probabilities en route. Eachnetworkin the
RTNis used to evaluate a mother and its immediate
daughters. The shape of each network is a
straightforward first-order Markovmodel, with the
further constraint that each categoryis representedin
only one node. There is no need to ’hand-craft’ the
shapeof eachtransition network,since this constraint
applies to all recursive networks. This simplifies
statistics extraction from the parsed corpus, and we
havefoundempiricallythat it appearsto be sufficient to
evaluate partial trees. This empirical finding is
theoretically significant, as it wouldseemto be at odds
with established ideas founded by Chomsky(1957).
However,although Chomsky
maintained that a simple
Markovmodel was insufficient to describe Natural
Languagesyntax, he did not consider RecursiveMarkov
models, and did not contemplate their use in
conjunctionwith a stochastic optimisation algorithm
such as SimulatedAnnealing.
6. Conclusions
ThePolytechnicof WalesCorpusis one of a few parsed
corpora already in existence whichoffer a rich source
of grammaticalinformation for use in probabilistic
parsing. Wehave developedtools for the extraction of
constituent likelihood grammars in a number of
formafisms, which could, with relatively minor
adjustment, be used on other parsed corpora. The
provisionof compatiblelarge-scale probabilistic lexical
resources is achieved by semi-automatically
manipulatinga lexical database or machine-tractable
dictionary to conformwith the corpus-basedgrammar,
and supplementing corpus word-frequencies.
Constituentlikelihood grammars
are particularly useful
for the syntactic description of unrestricted natural
language,including syntactically ’noisy’ or ill-formed
text.
5.2.30ptlmising the choice of movesby Probability
Density FunctionsUPDFs)
The rate of convergenceon an optimal solution can be
improvedby modifyingthe randomelementsinherent in
the Simulated Annealingalgorithm. Whenproposinga
moveor tree-change,instead of choosinga nodepurely
at random,all nodesin the tree are evaluatedto see how
’bad’ they are. This ’badness’dependsuponhowa node
fits in with its sisters andwhatkind of motherit is to
any daughtersit has. These’badness’ values are then
30
2275-2277,Berlin.
The Realistic AnnealingParser is one of a line of
probabilistic grammaticalanalysis systemsbuilt along
similar general principles. Otherprojects are ongoing
whichutifise corpus-based probabifistic grammars,
supplemented
with suitably large probabilistic lexicons.
Theyinclude the use of neural networks,vertical strip
parsers, and hybrid parsers which combine the
efficiency of chart parsers with the robustness of
probabilistic alternatives. Whilst we wouldnot yet
claim that these haveprogressedbeyondthe workbench
andonto the productionline for general use, they offer
promisingcompetitionto the established normsof rulebased parsers. We believe that corpus-based
probabilistic parsing is a newarea of Computational
Linguistics whichwill thrive as parsedcorpora become
morewidelyavailable.
Atwell, Eric Steven, 1992 "Overview of Grammar
Acquisition Research" in Henry Thompson(ed),
"Workshop on sublanguage grammar and lexicon
acquisition for speechand language:proceedings",pp.
65-70, HumanCommunication Research Centre,
EdinburghUniversity.
Atwell, Eric Stevenand StephenElliot, 1987, "Dealing
with ill-formed Englishtext" in Garsideet al 1987,pp.
120-138.
Atwell, Eric Steven, D. Clive Souter and Tim F.
O’Donoghne1988 "Prototype Parser 1" COMMUNAL
report 17, CCALAS,
Leeds University.
Briscoe, Ted and John Carroll, 1991 "Generalised
Probabilistic LRParsing of Natural Language(Corpora)
with Unification-based Grammars"University of
CambridgeComputerLaboratory Technical Report No.
224, June 1991.
.~
7. Notes
[1] ICAME(International Computer Archive of
ModemEnglish), NorwegianComputingCentre
for the Humanities,P.O. Box53, Universitetet,
N-5027Bergen, Norway.
[2] The Oxford Text Archive (OTA), Oxford
University Computing
Service, 13 BanburyRoad,
Oxford OX26NN, UK.
[3] Tim O’Donoghue was a SERC-funded PhD
student in the School of ComputerStudies, who
contributed to the Leeds’ research on the
COMMUNAL
project.
Chamiak,Eugene, 1983 "A parser with somethingfor
everyone" in Margaret King (ed.) Parsing Natural
Language,AcademicPress, London.
Chomsky,Noam1957 "Syntactic Structures" Mouton,
The Hague
8. References
ACL1983"ComputationalLinguistics: Special issue on
dealing with ill-formod text." Journal of Computational
Linguisticsvolume9 (3-4).
Akkerman,Eric, Hetty Voogt-vanZulphen and Willem
Meijs. 1988 "A computerizedlexicon for word-level
tagging". ASCOTReport No 2, Rodopi Press,
Amsterdam.
Eliegard A. 1978. "The Syntactic Structure of English
Texts". Gothenburg
Studies in English 43. Gothenburg:
ActaUniversitatis Gothoburgenis.
Fawcett, RobinP. 1981"SomeProposals for Systemic
Syntax", Journal of the Midlands Association for
Linguistic Studies (MALS)
1.2, 2.1, 2.2 (1974-76).
Re-issued with light amendments,1981, Departmentof
Behavioural and Communication
Studies, Polytechnic
of Wales.
Atwell, Eric Steven, 1983 "Constituent-likelihood
gra_mmar".ICAME
Journal no. 7 pp. 34-66.
Fawcett, RobinP. and MichaelR. Perkins. 1980."Child
LanguageTranscripts 6-12" Departmentof Business
and Communication
Studies, Polytechnicof Wales.
Atwelk Eric Steven, 1988 "Grammaticalanalysis of
English by statistical pattern recognition" in Josef
Kittier (ed), Pattern Recognition:Proceedings
ofthe 4th
International Conference, Cambridge, pp 626-635,
Berlin, Springer-Verlag.
Fawcett, RobinP. and GordonTucker. 1989. "Prototype
Generators 1 and 2". COMMUNAL
Report No. 10,
ComputationalLinguistics Unit, University of Wales
Collegeof Cardiff.
Garside, Roger, GeoffreySampsonand GeoffreyLeech,
(eds.) 1987. "Thecomputationalanalysis of English:
corpus-based
approach".
London,
Longman.
Atwell, Eric Steven, 1990 "MeasuringGrammaticality
of machine-readabletext" in WemerBahner, Joachim
Schfldt and Dieter Viehweger
(eds.), Proceedingsof the
XIVInternational Congressof Linguists, Volume3, pp
31
C-rover,
C.,E.J.Briscoe,
J.Cmroll
andB.Boguraev.
1989.
"TheAlveynatural
language
toolsgrammar".
University
of Cambridge
ComputerLaboratory
Technical
Report
No.162,April
1989.
Sampson, Geoffrey. 1992. "Analysed corpora of
English: a consumerguide". In MarthaPenningtonand
VanceStevens(eds.) Computers
in AppliedLinguistics.
MultilingualMatters.
Haigh, Robin, Geoffrey Sampson, and Eric Steven
Atwell. 1988. "Project APRIL- a progress report"
Proceedings of the 26th ACLConference Buffalo,
USA.
Sampson,GeoffreyR., RobinHaighand Eric S. Atwell.
1989 "Natural language analysis by stoelmstic
optimization: a progress report on Project APRIL",
Journal of Experimental and Theoretical Artificial
Intelligence1: 271-287.
Heidom,G. E., K. Jensen, L. A. Miller, R. J. Byrd,and
M. S. Chodorow.1982. "The EPISTLE
text-critiquing
system"in IBMSystemsJournal 21(3): 305-326.
Sinclair, John McH.(ed.) 1987 "Looking Up:
accountof the COBUII.I)
project in lexical computing".
London:Collins.
Homby,
A.S.(ed.)1974"Oxford
Advanced
Learner’s
Dictionary
ofContemporary
English".
Oxford,
OUP.
Johannson, Stig, Eric Atwell, Roger Garside, and
Geoffrey Leech. 1986. "The Tagged LOBCorpus".
NorwegianComputing Centre for the Humanities,
University of Bergen,Norway.
Souter, Clive. 1989. "A Short Handbookto the
Polytechnic of Wales Corpus". ICAME,Norwegian
Computing Centre for the Humanities, Bergen
University, Norway.
Souter, Clive. 1990. "Systemic Functional Grammars
and Corpora". In Jan Aarts and WillemMeijs (eds.)
Theoryand Practice in CorpusLinguistics, pp 179-211.
Amsterdam:RodopiPress.
Keulen, Francoise. 1986. "The DutchComputerCorpus
Pilot Project". In Jan Aarts and WillemMeijs (eds.)
CorpusLinguistics II, pp 127-161,Amsterdam:
Rodopi
Press.
Kwasny,S. and NormanSondheimer.1981. "Relaxation
techniquesfor parsinggrammatically
ill-formed input in
natural languageunderstanding systems" in American
Journal of Computational
Linguistics7(2): 99-108.
Souter, Clive and Tim O’Donoghue. 1991.
"Probabilistic Parsing in the COMMUNAL
project". In
Stig Johannson and Anna-Brita Stenstriin (eds.)
English Computer Corpora: Selected Papers and
ResearchGuide, pp 33-48. Berlin: Moutonde Gruyter.
Marcus,
Mitchell. 1980. "A theory of syntactic
recognition for natural language". MITPress,
Cambridge,Massachusetts.
Taylor, Lita, Claire G-roverandTedBriscoe.1989."The
syntactic regularity of English noun phrases".
Proceedings of the 4th European ACLConference,
Manchester.256-263.
O’Donoghue,
TimF. 1990. "Taking a parsed corpus to
the cleaners: the EPOW
corpus". ICAME
Journal no.
15. pp 55-62.
Weischedel,Ralph and John Black. 1980. "Responding
intelligently to unparsableinputs". In American
Journal
of Computational
Linguistics 6(2) 97-109.
O’Donoghue,
TunF. 1991. "The Vertical Strip Parser:
A lazy approachto parsing". ResearchReport 91.15,
Schoolof Computer
Studies, Universityof Leeds.
Wright, Joan. 1988. "The developmentof tools for
writing and testing systemic functional grammars".
COMMUNAL
Report No. 3, Computational Linguistics
Unit, Universityof WalesCollegeof Cardiff.
Procter, Paul. 1978. "Longman Dictionary of
ContemporaryEnglish" London,Longman.
Sampson,Geoffrey. 1987a. "The GrammaticalDatabase
and ParsingScheme".In Garsideet al 1987,pp 82-96.
Sampson, Geoffrey. 1987b. "Evidence against the
"grammatieal"/"ungrammatical"
distinction". In Willem
Meijs (ed.) Corpus Linguistics and Beyond.
Amsterdam:Rodopi.
32
Download