A nonstationary hidden Markov model with a ... capture of observations: application to the ...

From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
A nonstationary hidden Markovmodel with a hard
capture of observations: application to the problem
of morphological ambiguities*
Djamei Bouchaffra
and Jacques
Rouault
Centre de Rechercheen Informatique appliqu~e aux Sciences Sociales B.P. 47, 38040
Grenoble Cedex 9, FRANCE.
E-mail : djamelOcriss.fr
Phone : (+33) 76.82.56.94.
Fax : (+33) 76.82.56.75.
Abstract
This ccaveslgmdence
is cotw.exnedwith the problemof morphological
ambiguitiesusing a Margov
process. Theproblemhere is to eliminate interferent solutions that mightbe derived froma
morphological
analysis.Westart byusinga Marlmv
chainwithonelongsequence
of transitions. In this
modelthe states are the morphological
featuresanda sequence
corresponds
to the transition of a word
fore fromonefeatureto another.Afterhavingobserved
an inadequacy
of this model,onewill explorea
nonstafionaryhiddenMarkov
Wocess.Among
the mainadvantagesof this latter modelwehavethe
po~bilityto assigna typeto a text givensometrainingsamples.Therefore,
a recognition
of "style"or a
aeatimof a newonemightbe developed.
1. Introduction
1.1. Automaticanalysis of natural language
This worklies within a textual analysis system in natural languagediscourse (french
in our case). In most systems used today, the analysis process is divided into levels,
starting from morphology(first level) through syntax, and semantics to pragmatics.
These levels are sequentially activated, without backtracking, originating in the
morphologicalphase and ending in the pragmatic phase. Therefore, the i-th level knows
only the results of preceding levels; in particular, the morphologicalanalysis works
without any reference to the oth~ levels. This meansthat each wordin the text (a form)
analyzed autonomouslyout of context. Hence, for each form, one is obliged to consider
all possible analyses.
* Much
of the resemch
onwhichthis lxqJetis basedwascarriedout in orderto be usedin the framework
2
of MMI(A Multi-Modal
Inteafacefor ManMachine
Interaction). Thisinterface is part of a project
partially fundedby the Commission
of the European
Communities
ESPRIT
program.
136
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
Example:lefs consider the sequenceof the two forms "cut" and "down":
- "cut" can be given 3 analyses: verb, noun,adjective;
- "down"can be a verb, an adverb or a noun.
- The numberof possible combinations based upon the independenceof the analysis of
one form in relation with the others implies that the phrase "cut down"is liable to nine
interpretations; whateverits context
Thesemultiple solutions are transmitted to syntactic parsing whichdoesn’t eliminate
themeither. In fact, as a syntactic parser generates its ownintefferent analyses, often
from interferent morphologyanalyses, the problemswith whichweare confronted are far
from being solved.
In order to provide a solution to these problems, we have recourse to statistical
methods.Thus, the result of the morphologicalanalysis is filtered whenusing a Markov
model.
12. Morphologicalanalysis
A morphologicalanalyser must be able to cut up a wordform into srrmiler components
and to interpret this action. The easiest segmentation of a word form consists in
separating word terminations (flexional endings) from the rest of the wordform called
basia.
Wehavethengotaflex~nal
morpho/ogy.
A moreaccurate
cutting
up consists
in
splitting
up thebasis
intoaffixes
(prefixes,
suffixes)
androot.
Thisis thencalled
derivational
morphology.
Theinterpretation consists in associating the segmentationof a wordform with a set
ofinformation
parficulary including:
- the general morphologicalclass: verb, noun-adjective,preposition ....
- the values of relevant morphologicalvariables: number,gender,tense ....
Therefore, an interpretation is a class plus values of variables; such a combinationis
called a feature. Notethat a wordformis associated with several features in case there are
multiplesolutions.
13. Whystatistical procedures?
Becauseof the independence
of the analysis levels, it is difficult to providecontextual
linguistic rules. This is one of the reasons whywefall backon statistical methods.These
latter methods possess another advantage: they reflect simultaneously language
properties, eg., the impossibility to obtain a determinantfolloweddirectly by a verb, and
properties of the analysed corpus, eg., a large numberof nominalsentences.
Someresearchers used Bayesian approaches to solve the problem of morphological
ambiguities. However,these methodshave a clear conceptual frameworkand powerful
representations, but muststill be knowledge-engineered,
rather than trained. Veryoften in
137
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
the application of those methods researchers have not a good observation of the
individuals of the population, becausethe observationis a relative notion. Therefore, we
have difficulty in observing possible transitions of these individuals. The way of
"capturing" the individuals dependson the environmentencountered.
2. A morphological features
Markovchain
2.1. The semantic of the model
Let (fi) (i = 1 to m) be the states or morphological features, we have only
individual (n =1) for each transition time t ¯ { 1,2,...,T}. Afirst-order m-state Markov
chain is defined by an m*mstate transition probability matrix P, and an m*1 initial
probability vector 1"I, where
P = {Pf’aj}, Pfifj = Pr°b[et+l = ~/et "- fi], ij = 1,2 ..... m,
ll = {llfi}, lift -- Prob[e
I = fi], i = 1,2..... In,
E = {et} is a morphological
features sequence,t -- 1,2 ..... T.
The parameters m and T are respectively the numberof states and the length of state
sequence. Bydefinition, we have
kmm
~_~Pf.f. = 1 for i = 1,2,...,m andk~"~llifk= 1.
jffil
tj
=
Theprobability associated to a realization E of this Markov
chain is
Prob[E / P,n] = l’lel* ~2 Pet’lef
22. Estimationof transition probabilities
As pointed out by Bartlett in Andersonand Goodman
[1], the asymptotic theory must
be considered with respect to the variable "numberof times of observing the word form
in a single sequenceof transitions" instead of the variable "numberof individuals in a
state whenT is fixed". However,this asymptotic theory was considered because the
numberof times of observing the word form increases (T~**). Furthermore, we cannot
investigate the stationarity properties of the Markovprocess since weonly have one word
form(one individual) at each transition time. Therefore, weassumedstationarity. Thus,
Nfifj is the numberof times that the observedword form was in the feature fi at time
138
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
(t - 1) andin the feature fj at timet for all
probabilities are
{ I..T}, then the estimatesof the transition
Pfifj =jNfif
Nfi +’
where Nil+ is/he numberof times that the word form was in state fi- The estimated
transition
probabilities
are evaluated on one training sample. Weremoved the
morphologicalambiguities by choosingthe sequenceE of higher probability.
3. A Markovmodelwith hidden states and observations
The inadequacy of the previous model to removecertain morphological ambiguities
has led us to believe that someunknownhidden states govern the distribution of the
morphologicalfeatures. Instead of passing from one morphologicalfeature to another we
mightwansit froma hiddenstate to another. Besides, in section 2 wefocused only on the
surface of one randomsample, i.e., an observation was a morphological feature. As
pointed out in [4], this latter entity cannot be extracted without a context effect in a
sample.In order to consider this context effect, wehavechosencriteria like "the nature of
the feature", "its successorfeature", "its predecessorfeature’, "its position in a sentence"
and "the position of this sentence in the text". Anobservation ~ is then a "knownhidden
vector" whosecomponentsare values of the criteria presented here. However,one can
exploreother criteria.
Definition 3.1. A hidden Markov model (HMM)is Markov ch ain wh ose st ates
cannot be observeddirectly but only through a sequenceof observation vectors.
An HMM
is represented by the state transition
probability P, the initial
state
probability vector rl and a (T’K) matrix Vwhoseelements are the conditional densities
vi(ot) = density of observationot given et = i, K is the numberof states. Our aim is the
determination of the optimal modelestimate O* - [H*,P*,V*]given a certain numberof
samples,this is the training problem.
Theorem3.2. The probability of a sample S = (o~,o2 ..... Or] given a model 0 can be
writtenas :
E
t=2
139
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
ProoLFor a fixed state sequenceE = e1,e2 ..... eT, the probability of the observation
sequenceS = 01,02,...or is Prob(S / E,O) = ve1(O1)*ve2(O2)*...*VeT(OT).
The probability
of a state sequenceis: Prob(’E/ O) = I’Icl Pclc2*Pc2e3*...*Pefr.l)eT.Usingthe formula:
Prob(S, E / O) = l:h’ob(S / E,O)* Prob(E/ O) and summing
this joint probability over
possible state sequencesE, one demonsu’atesthe theorem.
Theinterpretation of the previousequationis: initially a time t = 1, the systemis in
state et with probability l’I 1 and weobserve o1 with probability %1(Ol).Thesystemthen
makesa transition to state e. z with probability Pele2 and weobserve 02 with probability
ve2(o2). This processcontinuesuntil the last transition fromstate e.r. 1 to state e. r with
probability Pc’r-leT and then weobserveor with probability v~.~r(or).
In oder to determine one of the estimate of the model0 = [I1,P,V], one can use the
maximum
likelihood criterion for a certain family Si (i ¢ [ 1,2 ..... L]) of training
samples. Somemethodsof choosingrepresentative samples of fixed length are presented
in [2]. The problemcan be expressed mathematicallyas:
t--2
O
i
~i
There is no knownmethodto solve this problemanalytically, that is the reason why
weuse iterative procedures. Westart by determining f’u’st the optimal path for each
sample.Anoptimal path E* is the one whichis associated to the higher probability of the
sample. Using the well-knownViterbi algorithm, one can determine this optimal path.
Thedifferent steps for finding the single best state sequencein the viterbi algorithmare:
step
l
:
~n
=o,
step2: recursion
for(2 > t > T)and(l ~j ~l(),
l~i~E
TJ) = argmax[
1~i$£
e j] ,
140
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
step3: termination
P*= max[ a/i)
"r*= argmax
[ ~ol
step4: state sequence backtracking
fort = T-I, T-2 ..... 1
P* is the sta~-opfimizedlikelihood function and E* = {e2*,e.z*..... e~r*}is the optimal
state sequence.Instead of tracking all possible paths, one successively wicksonly the
optimal paths El* of all samples. Thus, this can bc written as:
E
X’[el*
tffi2
this computationhas to be done for all the samples. Among
all the Oi, i E { 1 .... ,L}
associated to optimal paths, wedecide to choose as best modelestimate the one which
maximizes
the probability associated to a sample.It can be written as:
o*ffi argmaxg(ol,o2,...,o
t
T ; E*, Oi) , i E { 1,2,3 .... ,L}.
Oi
4. Thedifferent steps of the method
Wepresent an itcrative methodwhichenables us to obtain an estimator of the model
O. This methodis suitable for direct computation.
First step: one has to cluster the sample with respect to the chosen criteria,
two
possibilities are offered : a classCicationor a segmentation.In this latter procedure,The
user maystructure the states, operating in this way, the states appears like knownhidden
states. However,in a classification the system structures ~ ownstates according to a
suitable norm. Thus, the states appears like unknownhiO0,~ ones. The clusters formed
by one of the two proceduresrepresent the first states of the model,they form the first
training path.
141
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
Secondstep: one estimates the n’ansition probabilities using the following equations and
the probability of eachtraining vector for eachstate i.e., Vi(Ot), this is the fhst model01.
numberof times the observation01 belongsto the state i
numberof training paths
!
numberof times {(or.1 belongsto i) and (ot belongsto j)}
(t)
PiJ
= numberof times the observation %1belongs to i
the previous estimation formulacan be written as
Nij(t) Nij(t)
Pij(t) ffi ~ ffi Ni+(t),
vi(°t)
the
t expectednumberof times Of being in slate i and observingo
’
the expectednumberof times of being in state i
for
idffi
1,2
.....
K andt = 1,2.....
T,where
Nij(t)
isthenumber
oftransitions
fromstate
at time(t - 1) to state j at time t andNi(t - 1) is the number
of timesthe state i is visited
time (t- 1).
Third step: one computesf(opo2 ..... or ; 01) and determines the next training path or
clustering capable to increase f(ot,o 2, .... or ; 01). Weapply the second step to this
training path. This procedure is repeated until we reach the maximum
value of the
previous function. At this optimal value, wehave E1. and 01 of the first sample. This
step uses Viterbi algorithm.
This algorithm is applied to a family of samples of the same text, so we obtain a
family of Ei* and Oi. As mentioned previously, one decides reasonably to choose the
model O* whoseprobability associated to a sample is maximum.This last model makes
the samplethe most representative, i.e., we have a goodobservation in somesense. This
optimalmodelestimate is consideredas a type of the text processed.
5. Test for first-order stationarity
As outlined
by Anderson
andGoodman
[I],thefollowing
testcanbe usedto
determinewhetherthe Markovchain is first-o~’der stationary or not. Thus,wehaveto test
142
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
thenullhypothesis
T).
H:Pij(t)=Pij
(t=1,2
.....
Thelikelihood
ratio
withrespect
tothenullandahcmatc
hypo~csis
is
t=li=Ij=lpNij(0(t
)"
Wcnowdetermine
theconfidence
region
ofthetest.
Infact,
theexpression
- 2 log~.is
distributed
as a Chi-square
disnibution
with(T - 1)*K*(K-I)
degrees
of freedom
thenullhypothesis
istrue.
Asthedistribution
ofthestatistic
S = -2log~.isX2,onecan
compute
a ~ point
(6= 95,99.95%
etc.)
asthethreshold
$6.Thetestisformulated
as:
ffS < S~,thenullhypothesis
isaccepted,
i.e.,
theMarkov
chain
isfirst-order
stationary.
Otherwise,
thenull
hypothesis
isrejected
at100%- [3level
ofsignificance,
i.e.,
thechain
isnota fn’st-ordcr
stationary,
andonedecides
infavour
ofthenonstationary
model.
6. Howto solve the morphologicalambiguities
Thisisthemostimportantphase
ofourapplication, Let’s
consider
anexample
ofnine
possible
paths
encountered
in a text.
Amongthese
paths,
thesystem
hasto choose
the
mostlikely
according
to theprobability
measure,
seefigure
I. Ourdecision
of choosing
themostlikely
pathcomes
fromtheoptimal
model
O*obtained
inthetraining
phase.
Wc
showin thisexample
howto remove
themorphological
ambiguities.
If theoptimal
statesequence
obtained
in thetraining
phaseis theonewhich
corresponds
tothefigure
2,thenoneforexample
canchoose
between
thetwofollowing
paths of thefigure 1:
PathhsI s 2 s 3 s 4 s~ s 6 s~
path2: s I s 2 s" 3 s,# s 5 s" 6 s7.
OnecomputesOie probabilities of these two realizations of the observationsoi (i =1..... 7)
using the formula:
Pr
(ol,o
2.....07 / 0")= l’I
el * vel(ol)* (°t)"
= P%l"t*
v~
143
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
The figure 2. shows that each s i belongs to a state eI and using the optimal model
10" = [ I-I,P,V] one can computesthe probability of a path. Ourdecision to removethe
morphologicalambiguities is to choosethe path with the highest probability.
T=7
I
f
"f3
An observation vector
it’s successorfeature
it’s predecessorfeature
the position of the feature
in a sentence
the position of the sentenct
in the text.
~!~ii!~ii~iiiii~i~iiiii~i~i!~i~i~ii~ii~i~i~i~iii~
|~iiiiivisib~e
enary
thatappears
oniii~,i~i~iil
Fig.l.
K=4
The optimal state sequence is: 1 3 3 4 2 2 3
From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.
7. Conclusion
Wehave presented a newapproach for solving the morphologicalambiguities using a
hidden Markovmodel. This method may also be applied to other analysis levels as
syntax. The main advantage of the methodis the possibility to assign manydifferent
classes of criteria (fuzzy or completelyknown)to the training vectors and investigating
manysamples. Furthermore, we can define a "distance" between any sample and a family
of types of texts called models. One may choose the model which gives the higher
probability of this sampleand concludethat the samplebelongs to this specific type of
texts.
Wecan also develop a proximity measure between two models 0*1 and
2 0*
through representative samples. However,someprecautions must be taken in the choice
of the distance used betweenthe training vectors in the cluster process. In fact, the value
of the probability associated to a sample maydependon this normand therefore, the
choice of the best modelestimate can be affected.
So far, we supposedthat the criteria described the observations and the states are
completely known (hard observation). Very often, when we want to make deep
investigations, fuzziness or uncertainty due to somecriteria or states are encountered,
what to do in this case? Howcan we duster theobservations according to any uncertainty
measure? What is the optimal path and the best estimate model according to this
uncertainty measure?Weare workingin order to proposesolutions to those questions in
the case of a prohabilistic[3] anda f,,7~,~, logic.
References
[1] T.W., Andersonand L.A., Goodman,
"Statistical inference about Markovchains", Ann.
Math.Statist.
28, 89-110(1957).
[2] D., Bouchaffra,Echantillonnage
et analysemultivari~ede param~tres
linguistiques, Th~sede
Doctonu
(P.H.D),Universit~Pierre Mond~
France,to appearin (1992).
"Arelation between
isometricsandthe relative consi~ncy
conceptin probabilistic
[3] D., Boncinfffr&
logic’, 13th IMACS
WorldCongresson Computation
and applied Mathematics,
Dublin,Ireland,
July 22-26, (1991), selected papers in A/, Expert Systems and SymbolicComputingfor
Scientific Computation,
e~ JohnRiceandElias N. Houstis,publishedbyElsevier,North-Holland.
: Applicationsdocumentaires,
EditionsPeter LangSA,Berne
[4] J., Ronault,Linguistiqueautamatique
(1987).
145