Uploaded by GLENN RODRIGUES_192252

60 nlp exp5

advertisement
Experiment No 05
Part of Speech
Tagging
Aim : To implement the Part of Speech Tagging with Hidden Markov Models
Theory :
Use of a HiddenMarkovModel to do part-of-speech-tagging is a special case of Bayesian inference.
In a classification task, we are given some observation(s) and our job is to determine which of a set
of classes it belongs to. Part-of-speech tagging is generally treated as a sequence classification task.
So here the observation is a sequence of words (a sentence), and it is our job to assign them a
sequence of part-of-speech tags.For example, say we are given a sentence like Secretariat is
expected to race tomorrow. What is the best sequence of tags which corresponds to this sequence
of words?
An HMM is specified by the following components :
HMM be defined by the transition probabilities between hidden states (i.e. part-of-speech tags) and
the observation likelihoods of words given below:
Figure : Tag transition probabilities
Figure : Observation Likelihoods
The symbol <s> is the start-of-sentence symbol.
Viterbi algorithm
The task of Viterbi algorithm here is to find best hidden state sequence given observations and
model
The Viterbi algorithm can be represented by following 3 steps: 1. Initialization : This step involves
obtaining the required transition and emission probabilities from the trained HMM. The
training will be done using any tagged corpus.
2. Recursion :The forward algorithm, in the context of a hidden Markov model (HMM), is used
to calculate a 'belief state': the probability of a state at a certain time, given the history of
evidence.
3. Termination : It involves finding the best hidden state sequence
Programming Exercises
1. Write a Python program to build bigram HMM availing transition and emission
probabilities from a tagged dataset. (Simple sentences of maximum 5 Tags to be used)
lines = ['<s> Mary/N Jane/N can/M see/V
Will/N <\s>', '<s> Spot/N Will/M see/V Mary/N
<\s>',
'<s> Will/M Jane/N Spot/V Mary/N <\s>',
'<s> Mary/N Will/M see/V Spot/N <\s>']
outF = open("hmm.txt", "w")
for i in lines:
outF.write(i+"\n"
) outF.close()
ct = {} # to count no. of occurances
bigram = {}
trigram = {}
cth = {} # to count occurences consider distinct(word, tags)
x = [] # store words without tags
with open("hmm.txt", encoding = 'utf-8')
as f: for line in f:
x.clear()
l = line.strip()
l.replace("\n", "")
xm = l.split(' ')
for i in xm:
if (i in cth):
cth[i] += 1
else:
cth[i] = 1
sp = i.split('/')
x.append(sp[0])
i=0
j=1
for m in x:
if (m in ct):
ct[m] += 1
else:
ct[m] = 1
while(j < len(x)):
bi = x[i] + ' | ' + x[j]
if (bi in bigram):
bigram[bi] += 1
else:
bigram[bi] = 1
i += 1
j += 1
print("Transmission probabilities are:")
print("Bigram \t\tProbability(x(n)|x(n-1))")
for key, value in bigram.items():
x = key.split(" |
") xm=x[0]
den = ct[xm]
prob = value/den
print(key+"\t"+str(prob))
print("Emmision Probabilities are:")
print("Value \tProbability(x(n)|(n1))") for key, value in cth.items():
x=
key.split("/")
xm=x[0]
den = ct[xm]
prob = value/den
print(key+"\t"+str(prob))
2. Use the HMM model created above to tag a given sentence using the Viterbi algorithm.
em = { 'can': { 'M': 1.0 },
'jane': { 'N': 1.0 },
'mary': { 'N': 1.0 },
'see': { 'V': 1.0 },
'spot': { 'N': 0.6666666666666666 },
'spot' : { 'V': 0.3333333333333333 },
'will': { 'M': 0.75 },
'will': { 'N': 0.25 } }
tp = { '<s>': { 'M': 0.25, 'N': 0.75 },
'M': {'N': 0.25, 'V': 0.75 },
'N': {'<\\s>': 0.4444444444444444, 'M': 0.3333333333333333, 'N':
0.1111111111111111, 'V':
0.1111111111111111 }, 'V': { 'N': 1.0 } }
tags = ['M', 'N', 'V']
line = 'can mary see jane'
prevtag = '<s>'
first = True
new = True
prev = 0
words = line.split()
res_mat = {}
for w in words:
res_mat[w] = {}
for t in tags:
# print(w, t)
if em.get(w, False):
if em.get(w).get(t, False):
if first:
first = False
res_mat[w][t] = 1*tp.get(prevtag).get(t,0)
*em.get(w).get(t,0) else:
res_mat[w][t] = prev*tp.get(prevtag).get(t,0)
*em.get(w).get(t,0) # print(prev, tp.get(prevtag).get(t,0),
em.get(w).get(t,0), prevtag) prev = res_mat[w][t]
prevtag =
t print(res_mat)
Download