Experiment No 05 Part of Speech Tagging Aim : To implement the Part of Speech Tagging with Hidden Markov Models Theory : Use of a HiddenMarkovModel to do part-of-speech-tagging is a special case of Bayesian inference. In a classification task, we are given some observation(s) and our job is to determine which of a set of classes it belongs to. Part-of-speech tagging is generally treated as a sequence classification task. So here the observation is a sequence of words (a sentence), and it is our job to assign them a sequence of part-of-speech tags.For example, say we are given a sentence like Secretariat is expected to race tomorrow. What is the best sequence of tags which corresponds to this sequence of words? An HMM is specified by the following components : HMM be defined by the transition probabilities between hidden states (i.e. part-of-speech tags) and the observation likelihoods of words given below: Figure : Tag transition probabilities Figure : Observation Likelihoods The symbol <s> is the start-of-sentence symbol. Viterbi algorithm The task of Viterbi algorithm here is to find best hidden state sequence given observations and model The Viterbi algorithm can be represented by following 3 steps: 1. Initialization : This step involves obtaining the required transition and emission probabilities from the trained HMM. The training will be done using any tagged corpus. 2. Recursion :The forward algorithm, in the context of a hidden Markov model (HMM), is used to calculate a 'belief state': the probability of a state at a certain time, given the history of evidence. 3. Termination : It involves finding the best hidden state sequence Programming Exercises 1. Write a Python program to build bigram HMM availing transition and emission probabilities from a tagged dataset. (Simple sentences of maximum 5 Tags to be used) lines = ['<s> Mary/N Jane/N can/M see/V Will/N <\s>', '<s> Spot/N Will/M see/V Mary/N <\s>', '<s> Will/M Jane/N Spot/V Mary/N <\s>', '<s> Mary/N Will/M see/V Spot/N <\s>'] outF = open("hmm.txt", "w") for i in lines: outF.write(i+"\n" ) outF.close() ct = {} # to count no. of occurances bigram = {} trigram = {} cth = {} # to count occurences consider distinct(word, tags) x = [] # store words without tags with open("hmm.txt", encoding = 'utf-8') as f: for line in f: x.clear() l = line.strip() l.replace("\n", "") xm = l.split(' ') for i in xm: if (i in cth): cth[i] += 1 else: cth[i] = 1 sp = i.split('/') x.append(sp[0]) i=0 j=1 for m in x: if (m in ct): ct[m] += 1 else: ct[m] = 1 while(j < len(x)): bi = x[i] + ' | ' + x[j] if (bi in bigram): bigram[bi] += 1 else: bigram[bi] = 1 i += 1 j += 1 print("Transmission probabilities are:") print("Bigram \t\tProbability(x(n)|x(n-1))") for key, value in bigram.items(): x = key.split(" | ") xm=x[0] den = ct[xm] prob = value/den print(key+"\t"+str(prob)) print("Emmision Probabilities are:") print("Value \tProbability(x(n)|(n1))") for key, value in cth.items(): x= key.split("/") xm=x[0] den = ct[xm] prob = value/den print(key+"\t"+str(prob)) 2. Use the HMM model created above to tag a given sentence using the Viterbi algorithm. em = { 'can': { 'M': 1.0 }, 'jane': { 'N': 1.0 }, 'mary': { 'N': 1.0 }, 'see': { 'V': 1.0 }, 'spot': { 'N': 0.6666666666666666 }, 'spot' : { 'V': 0.3333333333333333 }, 'will': { 'M': 0.75 }, 'will': { 'N': 0.25 } } tp = { '<s>': { 'M': 0.25, 'N': 0.75 }, 'M': {'N': 0.25, 'V': 0.75 }, 'N': {'<\\s>': 0.4444444444444444, 'M': 0.3333333333333333, 'N': 0.1111111111111111, 'V': 0.1111111111111111 }, 'V': { 'N': 1.0 } } tags = ['M', 'N', 'V'] line = 'can mary see jane' prevtag = '<s>' first = True new = True prev = 0 words = line.split() res_mat = {} for w in words: res_mat[w] = {} for t in tags: # print(w, t) if em.get(w, False): if em.get(w).get(t, False): if first: first = False res_mat[w][t] = 1*tp.get(prevtag).get(t,0) *em.get(w).get(t,0) else: res_mat[w][t] = prev*tp.get(prevtag).get(t,0) *em.get(w).get(t,0) # print(prev, tp.get(prevtag).get(t,0), em.get(w).get(t,0), prevtag) prev = res_mat[w][t] prevtag = t print(res_mat)