an experiment with heuristic parsing of swedish

AN EXPERIMENT WITH HEURISTIC PARSING OF SWEDISH Benny Brodda Inst. of Linguistics University of Stockholm S-I06 91 Stockholm SWEDEN carriers of semantic Information - were substituted for by a mere "-", whereas other formatives were left in their original shape and place. These transformed texts were presented to subjects who w e r e a s k e d to fill in the gaps in such a w a y that the texts thus obtained were to be both syntactically correct and reasonably coherent. ABSTRACT Heuristic parsing is the art of doing parsing in a haphazard and seemingly careless manner but in such a way that the outcome is still "good", at least from a statistical point of view, or, hopefully, even f r o m a m o r e a b s o l u t e point of view. The idea is to find s t r a t e g i c s h o r t c u t s d e r i v e d f r o m g u e s s e s about the s t r u c t u r e of a s e n t e n c e b a s e d on s c a n t y o b s e r v a t i o n s of linguistic units In the sentence. If the guess comes out right much p a r s i n g t i m e can be saved, and if it does not, m a n y s u b o b s e r v a t i o n s m a y still be valid for rev i s e d guesses. In the (very p r e l i m i n a r y ) e x p e r i ment reported here the main idea is to make use of ( c o m b i n a t i o n s of) s u r f a c e p h e n o m e n a as m u c h as p o s s i b l e as the base for the p r e d i c t i o n of the s t r u c t u r e as a whole. In the p a r s e r to be developed along the lines sketched in this report main stress is p u t on a r r i v i n g at i n d e p e n d e n t l y working, parallel recognition procedures. The result of the e x p e r i m e n t was r a t h e r astonishing. It turned out that not only were the syntactic structures mainly restored, in some few cases also the original content was reestablished, a l m o s t w o r d by word. (It was b e y o n d any p o s s i bility that the subjects could have had access to the original text.) Even in those cases when the text i t s e l f w a s not r e s t o r e d to this r e m a r k a b l e extent, the stylistic v a l u e of the v a r i o u s texts was almost invariably reestablished; an originally lively, n a r r a t i v e story c a m e out as a lively, n a r r a t i v e story , and a p i e c e of r a t h e r dull, f a c t u a l text (from a s c h o o l text book on s o c i o logy) invariably came out as dull, factual prose. The work reported here Is both aimed at simul a t l n g c e r t a i n a s p e c t s of h u m a n l a n g u a g e perc e p t i o n and at a r r i v i n g at e f f e c t i v e a l g o r i t h m s for a c t u a l p a r s i n g of r u n n i n g text. T h e r e is, indeed, a great need for fast such a l g o r i t h m s , e.g. for the analysis of the literally millions of words of running text that already today comprise the data b a s e s in v a r i o u s large i n f o r m a t i o n ret r i e v a l s y s t e m s , and w h i c h can be e x p e c t e d to e x p a n d s e v e r a l o r d e r s of m a g n i t u d e b o t h in importance and In size In the foreseeable future. I This experiment showed quite clearly that at least for Swedish the information contained in the c o m b i n a t i o n s of s u r f a c e m a r k e r s to a r e m a r k a b l y h i g h d e g r e e r e f l e c t s the s y n t a c t i c s t r u c t u r e of the o r i g i n a l text; in a l m o s t all cases also the s t y l i s t i c v a l u e and in s o m e f e w cases even the s e m a n t i c c o n t e n t was kept. (The extent to w h i c h this is true is probably language dependent; Swed i s h is r a t h e r rich in m o r p h o l o g y , and this property is certainly a contributing factor for an experiment of this type to come out successful to the extent it actually did.) BACKGROUND T h i s type of e x p e r i m e n t has since then b e e n repeated many times by many scholars; in fact, it ls one of the s t a n d a r d w a y s to d e m o n s t r a t e the c o n c e p t of r e d u n d a n c y in texts. But there are several other important conclusions one could draw f r o m this type of e x p e r i m e n t s . F i r s t of all, of course, the o b v i o u s c o n c l u s i o n that s u r f a c e s i g n a l s do carry a lot of i n f o r m a t i o n about the s t r u c t u r e of s e n t e n c e s , p r o b a b l y m u c h m o r e than one has been inclined to think, and, consequently, It could be w o r t h w h i l e to try to c a p t u r e that I n f o r m a t i o n in s o m e kind of a u t o m a t i c a n a l y s i s system. This is the p r a c t i c a l side of it. But there is more to it. One must ask the question why a language llke Swedish is llke this. What are the theoretical implications? The g e n e r a ! idea b e h i n d the s y s t e m for heuristic parsing now being developed at our group in S t o c k h o l m dates m o r e than 15 y e a r s back, w h e n I was m a k i n g an i n v e s t i g a t i o n ( t o g e t h e r w i t h Hans K a r l g r e n , S t o c k h o l m ) of the p o s s i b i l i t i e s of using computers for information retrieval purposes for the Swedish Governmental Board for Rationaliz a t i o n (Statskontoret). In the c o u r s e of this i n v e s t i g a t i o n we performed some psycholingulstic e x p e r i m e n t s a i m e d at f i n d i n g out to w h a t extent s u r f a c e m a r k e r s , such as endings, p r e p o s i t i o n s , conjunctions and other (bound) e l e m e n t s from t y p i c a l l y c l o s e d c a t e g o r i e s of linguistic units, could serve as a base for a syntactic analysis of sentences. We s a m p l e d a c o u p l e of texts m o r e or less at r a n d o m and p r e p a r e d t h e m in such a way that stems of nouns, adjectives and (main) verbs these c a t e g o r i e s b e i n g thought of as the m a i n Much Interest has been devoted in later years to t h e o r i e s (and s p e c u l a t i o n s ) about h u m a n per- 66 ception o f linguistic stimuli, and I do not think that one s p e c u l a t e s too m u c h if one a s s u m e s that s u r f a c e m a r k e r s of the type that a p p e a r e d in the described experiment t o g e t h e r c o n s t i t u t e imp o r t a n t clues c o n c e r n i n g the gross s y n t a c t i c structure of sentences (or utterances), clues that are probably much less consiously perceived than, e.g., the a c t u a l w o r d s in the s e n t e n c e s / u t t e r a n ces. To the extent that such clues are a c t u a l l y p e r c e i v e d they are o b v i o u s l y p e r c e i v e d s i m u l t a n e o u s l y with, i.e. in p a r a l l e l with, other units (words, for instance). linguistic unit that can be identified at whatever stage of the analysis, according to whatever means there are, i_~s identified, and the significance of the fact that the unit in q u e s t i o n has been identified is made use of in all subsequent stages of the analysis. At any time one must.be prepared to reconsider an already established analysis of a unit on the g r o u n d that e v i d e n c e a ~ a l n s t the analysis may successively accumulate due to what analyses other units arrive at. In next s e c t i o n we give a brief d e s c r i p t i o n of t h e a n a l y s i s s y s t e m for S w e d i s h that is n o w under d e v e l o p m e n t at our g r o u p in S t o c k h o l m . As has been said, m u c h effort is spent on t r y i n g to m a k e use of s u r f a c e s i g n a l s as m u c h as possible. Not that we b e l i e v e that s u r f a c e s i g n a l s play a more important r o l e t h a n a n y o t h e r t y p e of linguistic signals, but rather that we think it is i m p o r t a n t to try to o p t i m i z e each single subprocess (in a p a r a l l e l system) as m u c h as ~osslble, and, as said, it m i g h t be w o r t h w h i l e to look careful into this level, b e c a u s e the importance of surface signals might have been underestimated in previous research. Our exneriments so far s e e m to i n d i c a t e that they c o n s t i t u t e exc e l l e n t units to base h e u r i s t i c g u e s s e s on. Another reason for concentrating our efforts on this level is that it takes time and requires much hard computational w o r k to get such an a n a r c h i s t i c system to really work, and this surface level is reasonably simple to handle. The above way of looking upon perception as a set of i n d e p e n d e n t l y o p e r a t i n g p r o c e s s e s is, of course, more or less generally accepted nowadays (cf., e.g., L i n d s a y - N o r m a n 1977), and it is also g e n e r a l l y a c c e p t e d in c o m p u t a t i o n a l linguistics that any p r o g r a m that a i m s at s i m u l a t i n g perc e p t i o n in one w a y or other m u s t have features that s i m u l a t e s (or, even better, a c t u a l l y performs) p a r a l l e l processing, and the a n a l y s i s system to be described below has much emphasis on exactly this feature. A n o t h e r c o m m o n saying n o w a d a y s w h e n discussing parsing techniques is that one should try to i n c o r p o r a t e " h e u r i s t i c devices" (cf., e.g., the m a n y s u b r e p o r t s related to the big ARPAproject c o n c e r n i n g S p e e c h Recognition and Understanding 1970-76), a l t h o u g h there does not seem to exist a very precise consensus of what exactly that w o u l d mean. (In m a t h e m a t i c s the term has been t r a d i t i o n a l l y used to refer to i n f o r m a l reasoning, e s p e c i a l l y w h e n used in c l a s s r o o m situations. In a f a m o u s study the h u n g a r i a n m a t h e m a t i c i a n Polya, 1945 put forth the thesis that h e u r i s t i c s is one of the most i m p o r t a n t p s y c h o l o g i c a l d r i v i n g m e c h a n i s m s behind m a t h e matical - or s c i e n t i f i c - progress. In AIl i t e r a t u r e it is often used to refer to shortcut search m e t h o d s in s e m a n t i c networks/spaces; c.f. Lenat, 1982). II AN OUTLINE OF AN ANALYZER BASED ON THE HEURISTIC PRINCIPLE F i g u r e 1 b e l o w s h o w s the g e n e r a l o u t l i n e of the system. E a c h of the v a r i o u s boxes (or subboxes) represents one specific process, usually a complete computer program in itself, or, in some cases, independent processes within a program. The big "container", l a b e l l e d "The Pool", c o n t a i n s both the l i n g u i s t i c m a t e r i a l as w e l l as the c u r r e n t a n a l y s i s of it. E a c h p r o g r a m or process looks into the Pool for things "it" can recognize, and when the process finds anything it is trained to recognize, it adds its o b s e r v a t i o n to the m a terial in the Pool. This added material may (hopefully) help o t h e r p r o c e s s e s in r e c o g n i z i n g w h a t they are trained to recognize, w h i c h in its turn may again help the first process to recognize more of "its" units. And so on. One reason for trying to adopt s o m e kind of h e u r i s t i c device in the a n a l y s i s p r o c e d u r e s is that one for m a t h e m a t i c a l reasons k n o w s that ordinary, "careful", parsing algorithms inherently s e e m to refuse to w o r k in real t i m e (i.e. in linear time), whereas human beings, on the whole, s e e m to be able to do e x a c t l y that (i.e. p e r c e i v e sentences or utterances simultaneously with their production). Parallel processing may partly be an a n s w e r to that d i l e m m a , but still, any process that c l a i m s to a c t u a l l y s i m u l a t e s o m e part of human perception must in s o m e w a y or other s i m u l a t e the r e m a r k a b l e a b i l i t i e s h u m a n beings have in g r a s p i n g c o m p l e x p a t t e r n s ("gestalts") seemingly in one single operation. The s y s t e m is n o w under d e v e l o p m e n t and during this build-up phase each process is, as was said above, e s s e n t i a l l y a c o m p l e t e , s t a n d - a l o n e module, and the Pool exists simply as successively u p d a t e d text files on a disc storage. At the moment some programs presuppose that other programs have a l r e a d y been run, but this state of a f f a i r s w i l l be valid Just d u r i n g this b u i l d ~ u p phase. At the end of the b u i l d - u p phase each p r o g r a m shall be able to run c o m p l e t e l y independent of any other program in the system and in a r b i t r a r y o r d e r r e l a t i v e to the others (but, of course, usually perform better if more information is available in the Pool). Ordinary, careful, p a r s i n g a l g o r i t h m s are often organized according to s o m e general p r i n c i p l e such as "top-down", "bottom-to-top", "breadth f i r s t " , " d e p t h f i r s t " , etc., t h e s e headings r e f e r r i n g to s o m e s p e c i f i e d type of "strategy". The h e u r i s t i c m o d e l we are trying to work out has no such preconceived strategy built i n t o it. O u r p h i l o s o p h y is i n s t e a d rather a n a r c h i s t i c (The H e u r i s t i c Principle): W h a t e v e r 67 In the ~ e c o n d phase s u p e r o r d i n a t e d c o n t r o l p r o g r a m s are to be i m p l e m e n t e d . T h e s e p r o g r a m s w i l l f u n c t i o n as "traffic rules" and via these systems one shall be able to test various strategies, i.e. to test w h i c h r e l a t i v e order b e t w e e n the different subsystems that yields optimal resuit in s o m e k i n d of " p e r f o r m a n c e metric", s o m e e v a l u a t i o n p r o c e d u r e that takes both speed and quality into account. tires; the word minus this formative must still be pronounceable, otherwise it cannot be a formative. S M U R F w o r k s e n t i r e l y w i t h o u t s t e m lexicon; it a d h e r e s c o m p l e t e l y to the "philosophy" of u s i n g surface signals as far as possible. NOMFRAS, VERBAL, IFIGEN, CLAUS and PREPPS are other "demons" that recognize different phrases or w o r d g r o u p s w i t h i n s e n t e n c e s , viz. noun phrases, verbal complexes, infinitival constructions, c l a u s e s and p r e p o s i t i o n a l phrases, respectively. N-lex, V - l e x and A - l e x r e p r e s e n t v a r i o u s (sub)lexicons; so far we have tried to do without them as far as possible. One s h o u l d o b s e r v e that s t e m l e x i c o n s are no p r e r e q u i s i t e s for the s y s t e m to work, adding them only enhances its performance. The programs/processes shown in Figure i all represent rather straightforward F i n i t e State Pattern Matching (FS/PM) procedures. It is rather t r i v i a l to s h o w m a t h e m a t i c a l l y that a set of i n t e r a c t i n g FS/PM procedures of the type used in our s y s t e m t o g e t h e r w i l l y i e l d a s y s t e m that formally has the power of a CF-parser; in practice it w i l l yield a s y s t e m that in s o m e sense is stronger, at least f r o m the point of v i e w of convenience. Congruence and similar phenomena will be r e d u c e d to s i m p l e local o b s e r v a t i o n s . T r a n s formational v a r i a n t s of s e n t e n c e s w i l l be rec o g n i z e d d i r e c t l y - there w i l l be no need for performing some kind of backward transformational operations. (In this respect a s y s t e m llke this w i l l r e s e m b l e Gazdar's g r a m m a r concept; G a z d a r 1980. ) The format of the material inside the Pool is the o r i g i n a l text, plus a p p r o p r i a t e "labelled b r a c k e t s " e n c l o s i n g words, w o r d groups, p h r a s e s and so on. In this way, the form of representation is c o n s i s t e n t throughout, no m a t t e r h o w m a n y d i f f e r e n t types of a n a l y s e s have b e e n a p p l i e d to it. Thus, v a r i o u s p e o p l e can join our g r o u p and write their own "demons" in whatever language they prefer, as long as they can take sentences in text format, be r e a s o n a b l y t o l e r a n t to w h a t types of '~rackets" they find in there, do their analysis, add their own brackets (in the specified format), and put the result back into the Pool. The c o n t r o l s t r u c t u r e s later to be s u p e r i m posed on the interacting FS/PM systems will also be of a F i n i t e S t a t e type. A s y s t e m of the type then o b t a i n e d - a s y s t e m of i n d e p e n d e n t F i n i t e State A u t o m a t o n s c o n t r o l l e d by a n o t h e r F i n i t e State Automaton - will in principle have rather complex mathematical properties. It is, e.g., rather easy to see that such a system has stronger c a p a c i t y than a Type 2 device, but it w i l l not have the power of a full Type I system. Now a few comments to Figure i The "balloons" in the figure represent independent programs (later to be developed into indep e n d e n t p r o c e s s e s inside one "big" program). The figure displays those programs t h a t so far (January 1983) h a v e b e e n i m p l e m e n t e d and t e s t e d (to some extent). Other programs will successively be entered into the system. T h e big b a l l o o n l a b e l l e d "The C l o s e d Cat" represents a program that recognizes closed word c l a s s e s suc h as p r e p o s i t i o n s , conjunctions, pronouns, auxiliaries, and so on. The C l o s e d C a t r e c o g n i z e s full w o r d f o r m s directly. The S M U R F b a l l o o n r e p r e s e n t s the m o r p h o l o g i c a l c o m p o n e n t (SMURF = " S w e d i s h Murphology"). S M U R F itself is organized internally as a complex system of independently operating "demons" - S M U R F s - e a c h knowing "its' little corner of Swedish word formation. (The n a m e of the p r o g r a m is an a l l u s i o n to the p o p u l a r comic strip leprechauns "les Schtroumpfs", which in S w e d i s h are called "smurfar".) Thus there is one little smurf recogn i z i n g d e r i v a t [ o n a l m o r p h e m e s , one r e c o g n i z i n g flectional endings, and so on. One special smurf, Phonotax, has an important controlling function every other s m u r f m u s t a l w a y s c o n s u l t P h o n o t a x before identifying one of "its" (potential) forma- 68 Of the v a r i o u s p r o g r a m s SMURF, N O M F R A S and IFIGEN are extensively tested (and, of course, The Closed Cat, w h i c h is a s i m p l e lexical lookup system), and various examples of analyses of these programs will be demonstrated in the next section. We hope to a r r i v e at a crucial s t a t i o n in this p r o j e c t d u r i n g 1983, w h e n CLAUS has been m o r e t h o r o u g h l y tested. If CLAUS p e r f o r m s the w a y we hope (and p r e l i m i n a r y tests indicate that it will), we will have means to identify very quickly the c l a u s a l s t r u c t u r e s of the s e n t e n c e s in an a r b i t r a r y r u n n i n g text, thus h a v i n g a f i r m base for entering higher h i e r a r c h i e s in the s y n t a c t i c domains. the p r o b l e m of h o w to m a k e it p o s s i b l e for each process to work independently of all other in the system. The C l o s e d Cat (CC) has the i m p o r t a n t role to mark words belonging to some well defined closed c a t e g o r i e s of words. This p r o g r a m m a k e s no internal analysis of the words, and only takes full words into account. CC makes use of simple rewrite rules of the type ' ~ => eP~e / (blank)__(blank)", w h e r e the i n s e r t e d e's r e p r e s e n t the "analysis" ("e" s t a n d s for " p r e p o s i t i o n " ; P~ = "on"). A s a m p l e o u t p u t f r o m The C l o s e d Cat is s h o w n in i l l u s t r a t i o n 2, w h e r e the v a r i o u s m e t a - s y m b o l s also are explained. The programs are written in the Beta language d e v e l o p e d by the p r e s e n t author; c.f. B r o d d a Karlsson, 1980, and Brodda, 1983, forthcoming. Of the a c t u a l p r o g r a m s in the system, S M U R F was d e v e l o p e d and e x t e n s i v e l y t e s t e d by B.B. d u r i n g 1977-79 (Brodda, 1979), w h e r e a s the others are (being) developed by B.B. and/or Gunnel KEllgren, Stockholm (mostly "and"). III EXPLODING The s i m p l e e x a m p l e above also s h o w s the format of inserted meta-lnformatlon. Each Identified c o n s t i t u e n t is "tagged" w i t h s u r r o u n d i n g lower case letters, which then can be conceived of as l a b e l l e d brackets. This format is u s e d throughout the system, also for complex constituents. Thus the nominal phrase 'DEN LILLA FLICKAN" ("the little girl") will be tagged as "'nDEN+LILLA+FLICKANn" by N O M F R A S (cf. below; the p l u s e s are i n s e r t e d to make the c o n s t i t u e n t one c o n t i n u o u s string). We have r e s e r v e d the letters n, v and s for the m a j o r c a t e g o r i e s nouns or noun phrases, verbs or v e r b a l groups, and sentences, respectively, whereas other more or less transparent letters are used for other categories. (A list of used c a t e g o r y s y m b o l s is p r e s e n t e d in the Appendix: Printout Illustrations.) SOME OF THE BALLOONS W h e n a "fresh" text is e n t e r e d into The Pool it first passes through a p r e l i m i n a r y o n e - p a s s program, INIT, (not shown in Fig. i) that "normalizes" the text. The o r i g i n a l text m a y be of any type as long as it Is r e g u l a r l y typed Swedish. INIT t r a n s f o r m s the text so that each g r a p h i c sentence will make up exactly one physical record. (Except in poetry, p h y s i c a l records, i.e. lines, u s u a l l y are of m a r g i n a l l i n g u i s t i c interest.) P a r a g r a p h ends w i l l be r e p r e s e n t e d by e m p t y records. Periods used to indicate abbreviations are Just taken a w a y and the a b b r e v i a t i o n itself is contracted to one graphic word, if necessary; thus "t.ex." ("e.g.") is t r a n s f o r m e d into "rex", and so on. Otherwise, periods, commas, question marks and other t y p o g r a p h i c c h a r a c t e r s are p r o v i d e d w i t h preceding blanks. Through this each w o r d is g u a r a n t e e d to be s u r r o u n d e d by blanks, and delimiters llke c o m m a s , p e r i o d s and so on are guaranteed to signal their "normal" textual functions. E a c h record is also ended by a s e n t e n c e delimiter (preceded by a blank). Some manual poste d i t i n g is s o m e t i m e s n e e d e d in order to get the text n o r m a l i z e d a c c o r d i n g to the above. In the I N I T - p h a s e no l i n g u i s t i c analysis w h a t s o e v e r is i n t r o d u c e d (other than into what a p p e a r s to be orthographic sentences). The program S W E M R F (or sMuRF, as it is c a l l e d here) has been e x t e n s i v e l y d e s c r i b e d e l s e w h e r e (Brodda, 1979). It m a k e s a rather i n t r i c a t e m o r p h o l o g i c a l a n a l y s i s w o r d - b y - w o r d In r u n n i n g text (i.e. S M U R F a n a l y z e s each w o r d in itself, disregarding the context it appears in). SMURF can be run in t w o modes, in " s e g m e n t a t i o n " m o d e and "analysis" mode. In its s e g m e n t a t i o n m o d e S M U R F simply strips off the possible affixes from each word; it m a k e s n o use of any s t e m lexicon. (The a f f i x e s it r e c o g n i z e s are c o m m o n prefixes, suffixes - i.e. d e r l v a t l o n a l morphemes - and flexlonal endings.) In analysis mode it also tries to m a k e an o p t i m a l guess of the w o r d class of.the word under inspection, based on what (combinations of) word formation elements it finds in the word. SMURF in itself is organized entirely according to the h e u r i s t i c p r i n c i p l e s as they are c o n c e i v e d here, i.e. as a set of i n d e p e n d e n t l y o p e r a t i n g processes that interactively work on each others output. The S M U R F s y s t e m has been the test bench for t e s t i n g out the m e t h o d s n o w being used throughout the entire Heuristic Parsing Project. INIT also changes all letters in the original text to their c o r r e s p o n d i n g upper case variants. ( O r i g i n a l l y c a p i t a l letters are o p t i o n a l l y provided w i t h a p r e f i x e d "=".) All s u b s e q u e n t analysis p r o g r a m s add their a n a l y s e s In the form of l o w e r case letters or letter c o m b i n a t i o n s . Thus u p p e r case letters or words will belong to the object language, and lower case letters or letter c o m b i n a t i o n s w i l l signal meta-language information. In this way, s t r i c t l y text (ASCII) format can be kept for the text as well as for the various stages of its analysis; the "philosophy" to use text Input and text output for all p r o g r a m s involved represents the computational solution to In its s e g m e n t a t i o n mode SMURF functions formally as a set of interactive transformations, w h e r e the s t r u c t u r a l c h a n g e s h a p p e n to be extremely simple, viz. simple segmentation rules of the type 'T=>P-", "Sffi> -S" and "Effi>-E'' for an arbitrary Prefix, Suffix and Ending, respectively, but w h e r e the "Job" e s s e n t i a l l y consists of e s t a b l i s h i n g the c o r r e s p o n d i n g s t r u c t u r a l descriptions. T h e s e are s h o w n in III. I, below, together with sample analyses. It should be noted that p h o n o t a c t l c c o n s t r a i n t s play a central role 69 in the S M U R F system; i n fact, one of the m a i n o b j e c t i v e s in d e s i g n i n g the S M U R F s y s t e m was to find out how much information actually was carried by the p h o n n t a c t l c component in Swedish. (It turned out to be quite much; cf. Brodda 1979. This p r o b a b l y holds for other G e r m a n i c l a n g u a g e s as well, which all have a rather elaborated phonotaxis.) IFIGEN parsing diagram Aux (simplified): n>Adv)o ATT - - -A # (C)CV -(A/I)T # I where '~ux" and "Adv" are categories recognized by The Closed Cat (tagged "g" and "a", respectively), and "nXn" are s t r u c t u r e s r e c o g n i z e d by e i t h e r N O M F R A S or, in the case of p e r s o n a l pronouns, by C C (It should he worth mentioning that the class of a u x i l i a r i e s in S w e d i s h is m o r e open than the corresponding word class in English; besides the " o r d i n a r y " V A R A ("to be"), HA ("to have") and the modalsy, there is a fuzzy class of seml-auxillarles llke BORJA ("begin") and others; IFIGEN makes use of about 20 of these in the p r e s e n t version.) T h e supine cadence -(A/I)'T is supposed to appear only once in an i n f i n i t i v a l group. A s a m p l e output of IFIGEN is given in Iii. 3. Also for IFIGEN we have r e a c h e d a r e c o g n i t i o n level a r o u n d 95%, which, again, is r a t h e r a s t o n i s h i n g , considering how little information actually is made use of in the system. NOMFRAS is the next program to be commented on. The p r e s e n t v e r s i o n r e c o g n i z e s s t r u c t u r e s of the type det/quant + (adJ)~ + noun; where the "det/quant" categories (i.e. determiners or q u a n t l f l e r s ) are d e f i n e d e x p l i c i t l y t h r o u g h enumeration - they are supposed to belong to the class of "surface markers" and are as such identified by The C l o s e d Cat. A d j e c t i v e s and nouns on the other hand are identified solely on the ground of their "cadences", i.e. w h a t kind of ( f o r m a l l y ) e n d l n g - l l k e s t r i n g s they h a p p e n to end with. The n u m b e r of a d j e c t i v e s that are a c c e p t e d (n in the formula above) varies depending on what (probable) type of construction is under inspection. In indefinite noun phrases the substantial content of the expected endings is, to say the least, meager, as both nouns and adjectives in many situations only have O-endings. In definite noun phrases the noun mostly - but not always - has a more substantial and r e c o g n i z a b l e e n d i n g and all i n t e r v e n i n g adJectives have either the cadence -A or a cadence f r o m a s m a l l but c h a r a c t e r i s t i c set. In a (supposed) d e f i n i t e n o u n p h r a s e all w o r d s e n d i n g in a n y of the m e n t i o n e d c a d e n c e s are a s s u m e d to be adjectives, but in (supposed) i n d e f i n i t e noun p h r a s e s not m o r e than one a d j e c t i v e is a s s u m e d u n l e s s o t h e r types of m o r p h o l o g i c a l s u p p o r t are present. The IFIGEN case illustrates very clearly one of the c e n t r a l points in our h e u r i s t i c a p p r o a c h , namely the following: The information that a word has a specific cadence, in this case the cadence -A, is u s u a l l y of very llttle s i g n i f i c a n c e in itself in Swedish. Certainly it is a typical infin l t l v a l c a d e n c e (at least 90% of all i n f i n i t i v e s in S w e d i s h have it), but on the other hand, it is c e r t a i n l y a very t y p i c a l c a d e n c e for other types of words as well: FLICKA (noun), HELA (adjective), DENNA/DETTA/DESSA (determiners or pronouns) and so on, and these other types are by no comparison the d o m i n a n t g r o u p h a v i n g this s p e c i f i c c a d e n c e in running text. But, in connection with an "infinitive warner" - an auxiliary, or the word ATT - the situation changes dramatically. This can be demonstrated by the following figures: In running text words having the cadance -A represents infinitives in a b o u t 30% of the cases. A T T is an i n f i n i t i v e m a r k e r ( e q u i v a l e n t to "to") in q u i t e e x a c t l y 50% of its o c c u r e n c e s (the o t h e r 50% it is a s u b o r d i n a t i n g conjunction). The conditional probability that the c o n f i g u r a t i o n ATT ..-A r e p r e s e n t s an i n f l n l t v e is, h o w e v e r , g r e a t e r than 99%, prov i d e d that c h a r a c t e r i s t i c c a d e n c e s like - A R N A / O R N A and q u a n t i f l e r s / d e t e r m i n e r s llke A L L A and D E S S A are d i s r e g a r d e d (In our s y s t e m they are marked by SMURF and The Closed Cat, respectively, and thereby "saved" from being classified as infinitives.) G i v e n this, there is a l m o s t no overgeneration in IFIGEN, but Swedish allows for split i n f i n i t i v e s to s o m e extent. Quite m u c h m a t e r i a l can be put in b e t w e e n the i n f i n i t i v e w a r n e r and the infinitive, and this gives rise to some undergeneration (presengly). (Similar o b s e r v a t i o n s reg a r d i n g c o n d i t i o n a l p r o b a b i l i t i e s in c o n f i g u r a tions of l i n g u i s t i c units has been m a d e by M a t s Eeg-Olofson, Lund, 1982). T h e F i n i t e State S c h e m e b e h i n d N O M F R A S is presented in Ill. 2, together with sample outputs; in this case the text has been preprocessed by The Closed Cat, and it appears that these two programs in cooperation are able to recognize noun phrases of the discussed type correctly to well over 95% in r u n n i n g text (at a speed of about 5 s e n t e n c e s per second, CPU-tlme); the e r r o r s w e r e s h a r e d about 50% each between over- and undergenerations. Preliminary experiments aiming at including also SMURF and FREPPS (Prepositional Phrases) s e e m to indicate that about the same recall and precision rate could be kept for a r b i t r a r y types of (nons e n t e n t l a l ) noun p h r a s e s (cf. Iii. 6). (The syst e m s are not yet t r i m m e d to the extent that they can be operatively run together.) I F I G E N (Infinitive Generator) is a n o t h e r rather straightforward F i n i t e State P a t t e r n Matcher (developed by Gunnel K~llgren). It recognizes (groups of) n n n f l n l t e verbs. S o m e w h a t simplified it can be represented by the following d i a g r a m ( r e m e m b e r the c o n v e n t i o n s for upper and lower case): 70 IV REFERENCES Brodda, B. "N~got om de svenska ordens fonotax och morfotax", P a p e r s from the I n s t i t u t e Of Linguistics (PILUS) No. 38, University of Stockholm, 1979. versity of Helsinki, Gazdar, G. "Phrase Structure" i_~nJacobson, P. and Pullam G. (eds.), Nature of Syntactic Representation, Reidel, 1982 Brodda, B. '~ttre kriterler f~r igenkEnnlng av sammans~ttningar" in Saari, M. and Tandefelt, M. (eds.) F6rhandllngar r~rande svenskans beskrivning - Hanaholmen 1981, Meddelanden fr~n Institutionen f~r Nordiska Spr~k, Helsingfors Universitet, 1981 Brodda, B. "The BETA System, tions", Data Linguistics, (forthcoming). Lenat, D.P. "The Nature of Heuristics", Artificial Intelligence, Vol 19(2), 1982. Eeg-Olofsson, M. '~n spr~kstatlstlsk modell f~r ordklassm~rknlng i l~pande text" in K~llgren, G. (ed.) TAGGNING, Fgredrag fr~n 3:e svenska kollokviet i spr~kllg databehandling i maJ 1982, FILUS 47, Stockholm 1982. and some ApplicaGothenburg, 1983 Polya, G. "How to Solve it", Princeton University Press, 1945. Also Doubleday Anchor Press, New York, N.Y. (several editions)• Brodda, B. and Karlsson, F. "An experiment with Automatic Morphological Analysis of Finnish", Publications No. 7, Dept. of Linguistics, Unl- APPENDIX: Some 1981. computer illustrations The following three pages illustrate some of the parsing diagrams used in the system: Iii. I, SMURF, and Iii. 2, NOMFRAS, together with sample analyses. IFIGEN is represented by sample analyses (III. 3; the diagram is given in the text)• The samples are all taken from running text analysis (from a novel by Ivar Lo-Johansson), and "pruned" only in the way that trivial, recurrent examples are omitted. Some typical erroneous analyses are also shown (prefixed by **). In III. I SMURF is run in segmentation mode only, and the existing tags are inserted by the Closed Cat. "A and "E in word final position indicates the corresponding cadences (fullfilling the pattern ?..V~M'A/E '', where M denotes a set of admissible medial clusters)• The tags inserted by CC are: aft(sentence) adverbials, b=particles, dfdeterminers, efprepositions, g=auxiliaries, h=(forms of) HA(VA), iffiinfinitives, j=adjectives, n=nouns, Kfconjunctions, q=quantifiers, r=pronouns, ufsupine verb form, v=verbal (group)• (For space reasons, III. 3 is given first, then I and II.) Iii. 3: PATTERN: aux/ATT^(pron)'(adv)A(adv)'inf^inf A. .. : ..FLOCKNINGEN eEFTER..IkATTk+iHAi+uG~TTui .. rDETr v V A R v O R I M L I G T i k A T T k + i F I N N A I rJAGr gSKAg a B A R A a I H J A L P A i - rDETr gKANg I L I G G A I g S K A g rVlr iV~GAi - rVlr gKANg a l N T E a iG~i . ..ORNA v H O L L v SIG F A R D I G A i k A T T k + i K A S T A i rDEr g V ~ G A D E g a A N T L I G E N a i L Y F T A i gSKAg rNlr a N O D V A N D I G T V I S a iGORAi ..rVlr h H A D E h a A N N U a a l N T E a u H U N N I T u iF~i . . B E C K M O R K R E T eMEDe i k A T T k + I F O R S O K A i + i F ~ I eMEDe V A T G A S eFORe i k A T T k + i K U N N A i + I H ~ L L A i SKOGEN, L A N D E N g T Y C K T E S g iST~i rDENr h H A D E h M I S S L Y C K A T S ele i k A T T k + i N A i *** qENq kS g V ~ G A D E g I K V l N N O R N A + S T A N N A i 71 F R A M A T B O J D H E L A DAGEN.. qETTq K A D S T R E C K ele .. eTILLe ikATTk+iSEi .. qENq KARL INUTI? VIPPEN? HEM eMEDe S K A M M E N ... eOMe N A R S O M H E L S T . ePAe rDETr. N~T eMEDe rDENr, k S ~ k eUPPe P O T A T I S E N . B A L L O N G E N FYLLD. SEJ OPPE. S T I L L A e U N D E R e OSS. SITT M~L. IIi. i: SMURF - PARSING DIAGRAM FOR SWEDISH MORPHOLOGY Structural changes PATTERNS " S t r u c t u r a l D e s c r i p t i o n s " ) : I ) E_NOINGS (E): 2) PREFIXES (P): X " 1/VS.Me "E#; I' #p> X " V " F (s) 3) SUFFIXES (S): I l E :> =E - p - V " X ; -- X " v " F "_S - (s) E# # I " V " x1 P => (-)P> S :> /S(-) where I : ( a d m i s s i b l e ) i n i t i a l c l u s t e r , F = f i n a l c l u s t e r , M = mor-he-m-eTnternal c l u s t e r , V = v o w e l , (s) t h e " g l u o n " S ( c f . TID~INGSMA~), # = word boundary, ( = , > , / , - ) = e a r l i e r accepted a f f i x segmentations, and , finallay, d e n o t e s o r d i n a r y c o n c a t e n a t i o n . ( I t i s t h e enhanced e l e ment in each p a t t e r n t h a t is t e s t e d f o r i t s s e g m e n t a b i l i t y ) . B A G G ' E = v D R O G v . R E P = E T S L I N G R = A D E MELLAN STEN=AR , F O R > B I TALLSTAMM AR , MELLAN ROD*A LINGONTUV=OR e l e GRON I N > F A T T / N I N G . qETTq STORT FORE>M~L hHADEh uRORTu eP~e SIG BORT'A eIe SLANT=EN • FORE>M~L=ET NARM=ADE SIG HOTFULL'T dDETd KNASTR= = A D E eIe S K O G = E N . - SPRING BAGG'E SLAPP=TE kOCHk vSPRANGv . rDEr L~NG'A KJOL=ARNA VIRVI=ADE eOVERe 0<PLOCK=ADE LINGONTUV=OR , BAGG'E KVINNO=RNA hHADEh STRUMPEBAND FOR>FARDIG=ADE eAVe SOCKERTOPPSSNOR=EN , KNUT=NA NEDAN>FOR KNAN'A a F O R S T a b U P P E b e P ~ e q E N q kS V ~ G = A D E K V I N N O = R N A STANN'A . rDEr vSTODv kOCHk STRACK=TE eP~e HALS=ARNA . qENq FRAN UT>DUNST/NING eAVe SKRACK SIPPR=ADE bFRAMb . rDEr vHOLLv BE>SVARJ/ANDE HAND=ERNA FRAM>FOR SIN'A SKOT=EN - dDETd vSERv STORT kOCHk eRUNTe bUTb , vSAv dDENd KORT~A eOMe FORE>MAL=ET dDETd vARy aVALa alNTEa qN~GOTq IN>UT>I ? - dDETd gKANg LIGG'A qENq KARL IN>UT>I ? dDETd vVETv rMANr aVALa kVADk rHANr vGORv eMEDe OSS - r J A G r T Y C K = T E d D E T d R O R = D E e P ~ e SEJ gSKAg rVlr iV~GAI VIPP=EN ? - JA ? E S K A g rVlr i V ~ G A I V I P P ~ E N ? BAGGE vSMOGv SIG eP~e GLAPP'A KNAN UT>F~R BRANT=EN • kNARk rDEr NARM=ADE SIG rDEr FLAT=ADE POTATISKORG=ARNA eMEDe LINGON kSOMk vSTODv eP~e LUT eVIDe VARSIN TUVA , vVARv rDEr aREDANa UT>OM SIG eAVe SKRACK . oDERASo SANS vVARv BORT'A . PASS eP~e . rVlr KANHAND'A alNTEa vTORSv NARM=ARE ? vSAv dDENd MAGR'A RUSTRUN rVlr E K A N g a l N T E a G~ H E M e M E D e S K A M M = E N aHELLERa • rVlr g M ~ S T E E a J U a iHAi B A R K O R G = A R N A eMEDe . - JAVISST , BARKORG=ARNA kMENk kNARk rDEr uKOMMITu bNERb eTILLe STALL=ET I<GEN uVARTu rDEr NYFIK=NA rDEr vDROGSv eTILLe FORE>M~L=ET ele - - 72 Iii. 2: NOMFRAS - FS-DIAGRAM FOR SWEDISH NOUN PHRASE PARSING quant + dec + "OWN" + adJ + noun IDETTA~ OENNAL__ -ERI-NI-~I /j MI-T A LL"A~ ~ B~DA DEN ER) "NAI-EN] - PYTT , vSAv nDEN+L~NGAn k V A D k v V A R v NU n D E T + D A R n nDET+OMF~NGSRIKA+,+SIDENLATTA+TYGETn nDEn GJORDE nEN+STOR+PACKEn eMEDe SIG SJALVA eOMe kATTk nDET+HELAn .. n D E T + N E L A n alNTEa uVARITu nETT+DUGGn nDET+FORMENTA+KLADSTRECKETn .. G R O N e M E D e H A N G B J O R K A R kSOMk nALLAn .. M O D E R N , nDEN+L~NGA+EGNAHEMSHUSTRUNn STORA BOKSTAVER nETT+SVENSKT+FIRMANAMNn eP~e nDEN+ANDRA+,+FR~NVANDAn nDETn vVARv nEN+LUFTENS+SPILLFRUKTn kOCHk nDEN+ANDRA+EGNAHEMSHUSTRUNS+OGONn nETT+STORT+MOSSIGT+BERGn . • SIG eMOTe SKYN eMEDe nEN+DISIG+M~NEn eVIDe nDET+STALLEn SAGA HONOM kATTk nALLA+DESSA+FOREMALn ..ARNA kSOMk nEN+AVIGT+SKRUBBANDE+HANDn kSOMk nEN+OFORMLIG+MASSAn - nEN÷RIKTIG+BALLONGn • *nDETn alNTEa vL~Gv nN~GON+KROPP+GOMDn • ** T V ~ k S O M k B A R G A D E ~ D E N + T I L L S A M M A N S n 73 kATTk VARA RADD eFORe ? eAVe dDETd . alNTEa uVARITu qETTq .. FARLIGT . vVARv kD~k SNOTT FLE.. FYLLDE FUNKTIONER . kSOMk uVARITu ele SKO.. , vSTODv ORDEN .. kSOMk hHADEh uRAMLAT.. VATTNADES eAVe OMSOM HOJDE SIG eMOTe SKYN.. kSOMk qENq RUND LYKTA .. kDARk LANDNINGSLINAN .. aAND~a alNTEa FORMED.. . VALTRADE SIG BALLONG.. gSKAg VARA FYLLD eMEDe.. INUNDER .

an experiment with heuristic parsing of swedish

Related documents

Products

Support

an experiment with heuristic parsing of swedish

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib