From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. A nonstationary hidden Markovmodel with a hard capture of observations: application to the problem of morphological ambiguities* Djamei Bouchaffra and Jacques Rouault Centre de Rechercheen Informatique appliqu~e aux Sciences Sociales B.P. 47, 38040 Grenoble Cedex 9, FRANCE. E-mail : djamelOcriss.fr Phone : (+33) 76.82.56.94. Fax : (+33) 76.82.56.75. Abstract This ccaveslgmdence is cotw.exnedwith the problemof morphological ambiguitiesusing a Margov process. Theproblemhere is to eliminate interferent solutions that mightbe derived froma morphological analysis.Westart byusinga Marlmv chainwithonelongsequence of transitions. In this modelthe states are the morphological featuresanda sequence corresponds to the transition of a word fore fromonefeatureto another.Afterhavingobserved an inadequacy of this model,onewill explorea nonstafionaryhiddenMarkov Wocess.Among the mainadvantagesof this latter modelwehavethe po~bilityto assigna typeto a text givensometrainingsamples.Therefore, a recognition of "style"or a aeatimof a newonemightbe developed. 1. Introduction 1.1. Automaticanalysis of natural language This worklies within a textual analysis system in natural languagediscourse (french in our case). In most systems used today, the analysis process is divided into levels, starting from morphology(first level) through syntax, and semantics to pragmatics. These levels are sequentially activated, without backtracking, originating in the morphologicalphase and ending in the pragmatic phase. Therefore, the i-th level knows only the results of preceding levels; in particular, the morphologicalanalysis works without any reference to the oth~ levels. This meansthat each wordin the text (a form) analyzed autonomouslyout of context. Hence, for each form, one is obliged to consider all possible analyses. * Much of the resemch onwhichthis lxqJetis basedwascarriedout in orderto be usedin the framework 2 of MMI(A Multi-Modal Inteafacefor ManMachine Interaction). Thisinterface is part of a project partially fundedby the Commission of the European Communities ESPRIT program. 136 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. Example:lefs consider the sequenceof the two forms "cut" and "down": - "cut" can be given 3 analyses: verb, noun,adjective; - "down"can be a verb, an adverb or a noun. - The numberof possible combinations based upon the independenceof the analysis of one form in relation with the others implies that the phrase "cut down"is liable to nine interpretations; whateverits context Thesemultiple solutions are transmitted to syntactic parsing whichdoesn’t eliminate themeither. In fact, as a syntactic parser generates its ownintefferent analyses, often from interferent morphologyanalyses, the problemswith whichweare confronted are far from being solved. In order to provide a solution to these problems, we have recourse to statistical methods.Thus, the result of the morphologicalanalysis is filtered whenusing a Markov model. 12. Morphologicalanalysis A morphologicalanalyser must be able to cut up a wordform into srrmiler components and to interpret this action. The easiest segmentation of a word form consists in separating word terminations (flexional endings) from the rest of the wordform called basia. Wehavethengotaflex~nal morpho/ogy. A moreaccurate cutting up consists in splitting up thebasis intoaffixes (prefixes, suffixes) androot. Thisis thencalled derivational morphology. Theinterpretation consists in associating the segmentationof a wordform with a set ofinformation parficulary including: - the general morphologicalclass: verb, noun-adjective,preposition .... - the values of relevant morphologicalvariables: number,gender,tense .... Therefore, an interpretation is a class plus values of variables; such a combinationis called a feature. Notethat a wordformis associated with several features in case there are multiplesolutions. 13. Whystatistical procedures? Becauseof the independence of the analysis levels, it is difficult to providecontextual linguistic rules. This is one of the reasons whywefall backon statistical methods.These latter methods possess another advantage: they reflect simultaneously language properties, eg., the impossibility to obtain a determinantfolloweddirectly by a verb, and properties of the analysed corpus, eg., a large numberof nominalsentences. Someresearchers used Bayesian approaches to solve the problem of morphological ambiguities. However,these methodshave a clear conceptual frameworkand powerful representations, but muststill be knowledge-engineered, rather than trained. Veryoften in 137 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. the application of those methods researchers have not a good observation of the individuals of the population, becausethe observationis a relative notion. Therefore, we have difficulty in observing possible transitions of these individuals. The way of "capturing" the individuals dependson the environmentencountered. 2. A morphological features Markovchain 2.1. The semantic of the model Let (fi) (i = 1 to m) be the states or morphological features, we have only individual (n =1) for each transition time t ¯ { 1,2,...,T}. Afirst-order m-state Markov chain is defined by an m*mstate transition probability matrix P, and an m*1 initial probability vector 1"I, where P = {Pf’aj}, Pfifj = Pr°b[et+l = ~/et "- fi], ij = 1,2 ..... m, ll = {llfi}, lift -- Prob[e I = fi], i = 1,2..... In, E = {et} is a morphological features sequence,t -- 1,2 ..... T. The parameters m and T are respectively the numberof states and the length of state sequence. Bydefinition, we have kmm ~_~Pf.f. = 1 for i = 1,2,...,m andk~"~llifk= 1. jffil tj = Theprobability associated to a realization E of this Markov chain is Prob[E / P,n] = l’lel* ~2 Pet’lef 22. Estimationof transition probabilities As pointed out by Bartlett in Andersonand Goodman [1], the asymptotic theory must be considered with respect to the variable "numberof times of observing the word form in a single sequenceof transitions" instead of the variable "numberof individuals in a state whenT is fixed". However,this asymptotic theory was considered because the numberof times of observing the word form increases (T~**). Furthermore, we cannot investigate the stationarity properties of the Markovprocess since weonly have one word form(one individual) at each transition time. Therefore, weassumedstationarity. Thus, Nfifj is the numberof times that the observedword form was in the feature fi at time 138 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. (t - 1) andin the feature fj at timet for all probabilities are { I..T}, then the estimatesof the transition Pfifj =jNfif Nfi +’ where Nil+ is/he numberof times that the word form was in state fi- The estimated transition probabilities are evaluated on one training sample. Weremoved the morphologicalambiguities by choosingthe sequenceE of higher probability. 3. A Markovmodelwith hidden states and observations The inadequacy of the previous model to removecertain morphological ambiguities has led us to believe that someunknownhidden states govern the distribution of the morphologicalfeatures. Instead of passing from one morphologicalfeature to another we mightwansit froma hiddenstate to another. Besides, in section 2 wefocused only on the surface of one randomsample, i.e., an observation was a morphological feature. As pointed out in [4], this latter entity cannot be extracted without a context effect in a sample.In order to consider this context effect, wehavechosencriteria like "the nature of the feature", "its successorfeature", "its predecessorfeature’, "its position in a sentence" and "the position of this sentence in the text". Anobservation ~ is then a "knownhidden vector" whosecomponentsare values of the criteria presented here. However,one can exploreother criteria. Definition 3.1. A hidden Markov model (HMM)is Markov ch ain wh ose st ates cannot be observeddirectly but only through a sequenceof observation vectors. An HMM is represented by the state transition probability P, the initial state probability vector rl and a (T’K) matrix Vwhoseelements are the conditional densities vi(ot) = density of observationot given et = i, K is the numberof states. Our aim is the determination of the optimal modelestimate O* - [H*,P*,V*]given a certain numberof samples,this is the training problem. Theorem3.2. The probability of a sample S = (o~,o2 ..... Or] given a model 0 can be writtenas : E t=2 139 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. ProoLFor a fixed state sequenceE = e1,e2 ..... eT, the probability of the observation sequenceS = 01,02,...or is Prob(S / E,O) = ve1(O1)*ve2(O2)*...*VeT(OT). The probability of a state sequenceis: Prob(’E/ O) = I’Icl Pclc2*Pc2e3*...*Pefr.l)eT.Usingthe formula: Prob(S, E / O) = l:h’ob(S / E,O)* Prob(E/ O) and summing this joint probability over possible state sequencesE, one demonsu’atesthe theorem. Theinterpretation of the previousequationis: initially a time t = 1, the systemis in state et with probability l’I 1 and weobserve o1 with probability %1(Ol).Thesystemthen makesa transition to state e. z with probability Pele2 and weobserve 02 with probability ve2(o2). This processcontinuesuntil the last transition fromstate e.r. 1 to state e. r with probability Pc’r-leT and then weobserveor with probability v~.~r(or). In oder to determine one of the estimate of the model0 = [I1,P,V], one can use the maximum likelihood criterion for a certain family Si (i ¢ [ 1,2 ..... L]) of training samples. Somemethodsof choosingrepresentative samples of fixed length are presented in [2]. The problemcan be expressed mathematicallyas: t--2 O i ~i There is no knownmethodto solve this problemanalytically, that is the reason why weuse iterative procedures. Westart by determining f’u’st the optimal path for each sample.Anoptimal path E* is the one whichis associated to the higher probability of the sample. Using the well-knownViterbi algorithm, one can determine this optimal path. Thedifferent steps for finding the single best state sequencein the viterbi algorithmare: step l : ~n =o, step2: recursion for(2 > t > T)and(l ~j ~l(), l~i~E TJ) = argmax[ 1~i$£ e j] , 140 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. step3: termination P*= max[ a/i) "r*= argmax [ ~ol step4: state sequence backtracking fort = T-I, T-2 ..... 1 P* is the sta~-opfimizedlikelihood function and E* = {e2*,e.z*..... e~r*}is the optimal state sequence.Instead of tracking all possible paths, one successively wicksonly the optimal paths El* of all samples. Thus, this can bc written as: E X’[el* tffi2 this computationhas to be done for all the samples. Among all the Oi, i E { 1 .... ,L} associated to optimal paths, wedecide to choose as best modelestimate the one which maximizes the probability associated to a sample.It can be written as: o*ffi argmaxg(ol,o2,...,o t T ; E*, Oi) , i E { 1,2,3 .... ,L}. Oi 4. Thedifferent steps of the method Wepresent an itcrative methodwhichenables us to obtain an estimator of the model O. This methodis suitable for direct computation. First step: one has to cluster the sample with respect to the chosen criteria, two possibilities are offered : a classCicationor a segmentation.In this latter procedure,The user maystructure the states, operating in this way, the states appears like knownhidden states. However,in a classification the system structures ~ ownstates according to a suitable norm. Thus, the states appears like unknownhiO0,~ ones. The clusters formed by one of the two proceduresrepresent the first states of the model,they form the first training path. 141 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. Secondstep: one estimates the n’ansition probabilities using the following equations and the probability of eachtraining vector for eachstate i.e., Vi(Ot), this is the fhst model01. numberof times the observation01 belongsto the state i numberof training paths ! numberof times {(or.1 belongsto i) and (ot belongsto j)} (t) PiJ = numberof times the observation %1belongs to i the previous estimation formulacan be written as Nij(t) Nij(t) Pij(t) ffi ~ ffi Ni+(t), vi(°t) the t expectednumberof times Of being in slate i and observingo ’ the expectednumberof times of being in state i for idffi 1,2 ..... K andt = 1,2..... T,where Nij(t) isthenumber oftransitions fromstate at time(t - 1) to state j at time t andNi(t - 1) is the number of timesthe state i is visited time (t- 1). Third step: one computesf(opo2 ..... or ; 01) and determines the next training path or clustering capable to increase f(ot,o 2, .... or ; 01). Weapply the second step to this training path. This procedure is repeated until we reach the maximum value of the previous function. At this optimal value, wehave E1. and 01 of the first sample. This step uses Viterbi algorithm. This algorithm is applied to a family of samples of the same text, so we obtain a family of Ei* and Oi. As mentioned previously, one decides reasonably to choose the model O* whoseprobability associated to a sample is maximum.This last model makes the samplethe most representative, i.e., we have a goodobservation in somesense. This optimalmodelestimate is consideredas a type of the text processed. 5. Test for first-order stationarity As outlined by Anderson andGoodman [I],thefollowing testcanbe usedto determinewhetherthe Markovchain is first-o~’der stationary or not. Thus,wehaveto test 142 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. thenullhypothesis T). H:Pij(t)=Pij (t=1,2 ..... Thelikelihood ratio withrespect tothenullandahcmatc hypo~csis is t=li=Ij=lpNij(0(t )" Wcnowdetermine theconfidence region ofthetest. Infact, theexpression - 2 log~.is distributed as a Chi-square disnibution with(T - 1)*K*(K-I) degrees of freedom thenullhypothesis istrue. Asthedistribution ofthestatistic S = -2log~.isX2,onecan compute a ~ point (6= 95,99.95% etc.) asthethreshold $6.Thetestisformulated as: ffS < S~,thenullhypothesis isaccepted, i.e., theMarkov chain isfirst-order stationary. Otherwise, thenull hypothesis isrejected at100%- [3level ofsignificance, i.e., thechain isnota fn’st-ordcr stationary, andonedecides infavour ofthenonstationary model. 6. Howto solve the morphologicalambiguities Thisisthemostimportantphase ofourapplication, Let’s consider anexample ofnine possible paths encountered in a text. Amongthese paths, thesystem hasto choose the mostlikely according to theprobability measure, seefigure I. Ourdecision of choosing themostlikely pathcomes fromtheoptimal model O*obtained inthetraining phase. Wc showin thisexample howto remove themorphological ambiguities. If theoptimal statesequence obtained in thetraining phaseis theonewhich corresponds tothefigure 2,thenoneforexample canchoose between thetwofollowing paths of thefigure 1: PathhsI s 2 s 3 s 4 s~ s 6 s~ path2: s I s 2 s" 3 s,# s 5 s" 6 s7. OnecomputesOie probabilities of these two realizations of the observationsoi (i =1..... 7) using the formula: Pr (ol,o 2.....07 / 0")= l’I el * vel(ol)* (°t)" = P%l"t* v~ 143 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. The figure 2. shows that each s i belongs to a state eI and using the optimal model 10" = [ I-I,P,V] one can computesthe probability of a path. Ourdecision to removethe morphologicalambiguities is to choosethe path with the highest probability. T=7 I f "f3 An observation vector it’s successorfeature it’s predecessorfeature the position of the feature in a sentence the position of the sentenct in the text. ~!~ii!~ii~iiiii~i~iiiii~i~i!~i~i~ii~ii~i~i~i~iii~ |~iiiiivisib~e enary thatappears oniii~,i~i~iil Fig.l. K=4 The optimal state sequence is: 1 3 3 4 2 2 3 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. 7. Conclusion Wehave presented a newapproach for solving the morphologicalambiguities using a hidden Markovmodel. This method may also be applied to other analysis levels as syntax. The main advantage of the methodis the possibility to assign manydifferent classes of criteria (fuzzy or completelyknown)to the training vectors and investigating manysamples. Furthermore, we can define a "distance" between any sample and a family of types of texts called models. One may choose the model which gives the higher probability of this sampleand concludethat the samplebelongs to this specific type of texts. Wecan also develop a proximity measure between two models 0*1 and 2 0* through representative samples. However,someprecautions must be taken in the choice of the distance used betweenthe training vectors in the cluster process. In fact, the value of the probability associated to a sample maydependon this normand therefore, the choice of the best modelestimate can be affected. So far, we supposedthat the criteria described the observations and the states are completely known (hard observation). Very often, when we want to make deep investigations, fuzziness or uncertainty due to somecriteria or states are encountered, what to do in this case? Howcan we duster theobservations according to any uncertainty measure? What is the optimal path and the best estimate model according to this uncertainty measure?Weare workingin order to proposesolutions to those questions in the case of a prohabilistic[3] anda f,,7~,~, logic. References [1] T.W., Andersonand L.A., Goodman, "Statistical inference about Markovchains", Ann. Math.Statist. 28, 89-110(1957). [2] D., Bouchaffra,Echantillonnage et analysemultivari~ede param~tres linguistiques, Th~sede Doctonu (P.H.D),Universit~Pierre Mond~ France,to appearin (1992). "Arelation between isometricsandthe relative consi~ncy conceptin probabilistic [3] D., Boncinfffr& logic’, 13th IMACS WorldCongresson Computation and applied Mathematics, Dublin,Ireland, July 22-26, (1991), selected papers in A/, Expert Systems and SymbolicComputingfor Scientific Computation, e~ JohnRiceandElias N. Houstis,publishedbyElsevier,North-Holland. : Applicationsdocumentaires, EditionsPeter LangSA,Berne [4] J., Ronault,Linguistiqueautamatique (1987). 145