Uploaded by RAMYA KANAGARAJ Auto

NLP Lab Manual: Computational Lab-II Guide

advertisement
Lab Manual
Subject: Computational Lab-II/ CSL803
Natural Language Processing (NLP)
SEM./YEAR:8TH/4TH
RIZVI COLLEGE OF ENGINEERING
New Rizvi Educational Complex, Off Carter Rd, Bandra West, Mumbai-50.
S.NO.
Name of Experiment
01
02
Word Analysis
INDEX
Date of
Commencement
Date of
Completion
Grade Remark
Pre-processing of text
(Tokenization, Filtration, Script
Validation, Stop Word
Removal, Stemming)
03
Morphological Analysis
04
N-gram model
05
. POS tagging
06
Chunking
07
Named Entity Recognition
08
Case Study/ Mini Project
based on Application
mentioned in Module 6.
Name of student & Signature
Subject Teacher
-------------------------------------
Prof. Vikas R Dubey
RIZVI COLLEGE OF ENGINEERING, MUMBAI
DEPARTMENT OF COMPUTER ENGINEERING
New Rizvi Educational Complex, Off Carter Rd, Bandra West, Mumbai-50.
CERTIFICATE OF SUBMISSION
This is to certify that Mr./Mrs.-----------------------------------------------(Roll no.), student of department of computer engineering, studying in semester
8th (4th year) has submitted the term work/oral/practical/report for the subject
Computational lab-2(Natural Language Processing)for the academic session
2021-22(even) for the partial fulfilment of the bachelor of engineering (B.E.).
Subject In charge
Prof. Vikas R Dubey
Head of Department
Prof. Shiburaj Pappu
Principal
Dr.Varsha Shah
Syllabus
Laboratory Work/Case study/Experiments: Description: The Laboratory Work
(Experiments) for this course is required to be performed and to be evaluated in
CSL803: Computational Lab-II
The objective of Natural Language Processing lab is to introduce the students with the basics of NLP
which will empower them for developing advanced NLP tools and solving practical problems in this
field.
Reference for Experiments: http://cse24-iiith.virtual-labs.ac.in/#
Reference for NPTEL: http://www.cse.iitb.ac.in/~cs626-449
Sample Experiments: possible tools / language: R tool/ Python programming Language Note:
Although it is not mandatory, the experiments can be conducted with reference to any Indian
regional language.
1. Word Analysis
2. Pre-processing of text (Tokenization, Filtration, Script Validation, Stop Word Removal, Stemming)
3. Morphological Analysis
4. N-gram model
5. POS tagging
6. Chunking
7. Named Entity Recognition
8. Case Study/ Mini Project based on Application mentioned in Module 6.
EXPT NO.01
AIM: Word Analysis
Theory:
Analysis of a word into root and affix(es) is called as Morphological analysis of
a word. It is mandatory to identify root of a word for any natural language
processing task. A root word can have various forms. For example, the word
'play' in English has the following forms: 'play', 'plays', 'played' and 'playing'.
Hindi shows more number of forms for the word 'खेल' (khela) which is
equivalent to 'play'. The forms of 'खेल'(khela) are the following:
खेल(khela), खेला(khelaa), खेली(khelii), खेलूंगा(kheluungaa), खेलूंगी(kheluungii), खेलेगा(khelegaa),
खेलेगी(khelegii), खेलते(khelate), खेलती(khelatii), खेलने(khelane), खेलकर(khelakar)
For Telugu root ఆడడం (Adadam), the forms are the following::
Adutaanu, AdutunnAnu, Adenu, Ademu, AdevA, AdutAru, Adutunnaru, AdadAniki, Adesariki,
AdanA, Adinxi, Adutunxi, AdinxA, AdeserA, Adestunnaru, ...
Thus we understand that the morphological richness of one language might
vary from one language to another. Indian languages are generally
morphologically rich languages and therefore morphological analysis of
words becomes a very significant task for Indian languages.
Types of Morphology
Morphology is of two types,
1. Inflectional morphology
Deals with word forms of a root, where there is no change in lexical category. For example,
'played' is an inflection of the root word 'play'. Here, both 'played' and 'play' are verbs.
2. Derivational morphology
Deals with word forms of a root, where there is a change in the lexical category. For example,
the word form 'happiness' is a derivation of the word 'happy'. Here, 'happiness' is a derived
noun form of the adjective 'happy'.
Morphological Features:
All words will have their lexical category attested during morphological analysis.
A noun and pronoun can take suffixes of the following features: gender, number, person,
case
For example, morphological analysis of a few words is given below:
Languageinput:word
Hindi
लडके (ladake)
output:analysis
rt=लड़का(ladakaa), cat=n, gen=m, num=sg, case=obl
Hindi
Hindi
English
English
लडके (ladake) rt=लड़का(ladakaa), cat=n, gen=m, num=pl, case=dir
लड़क ूं (ladakoM)rt=लड़का(ladakaa), cat=n, gen=m, num=pl, case=obl
boy
rt=boy, cat=n, gen=m, num=sg
boys
rt=boy, cat=n, gen=m, num=pl
A verb can take suffixes of the following features: tense, aspect, modality, gender, number,
person
Languageinput:wordoutput:analysis
rt=हँस(hans), cat=v, gen=fem, num=sg/pl, per=1/2/3 tense=past,
Hindi
हँसी(hansii)
aspect=pft
English toys
rt=toy, cat=n, num=pl, per=3
'rt' stands for root. 'cat' stands for lexical category. Thev value of lexicat category can be
noun, verb, adjective, pronoun, adverb, preposition. 'gen' stands for gender. The value of
gender can be masculine or feminine.
'num' stands for number. The value of number can be singular (sg) or plural (pl).
'per' stands for person. The value of person can be 1, 2 or 3
The value of tense can be present, past or future. This feature is applicable for verbs.
The value of aspect can be perfect (pft), continuous (cont) or habitual (hab). This feature is
not applicable for verbs.
'case' can be direct or oblique. This feature is applicable for nouns. A case is an oblique case
when a postposition occurs after noun. If no postposition can occur after noun, then the case
is a direct case. This is applicable for hindi but not english as it doesn't have any
postpositions. Some of the postpsitions in hindi are: का(kaa), की(kii), के(ke), क (ko), में(meM)
Procedure:
STEP1: Select the language.
OUTPUT: Drop down for selecting words will appear.
STEP2: Select the word.
OUTPUT: Drop down for selecting features will appear.
STEP3: Select the features.
STEP4: Click "Check" button to check your answer.
OUTPUT: Right features are marked by tick and wrong features are marked by cross.
Execution and output:
Word Analysis
Select a Language which you know better
Select a word from the below dropbox and do a morphological analysis on that
word
---Select Word---
Select the Correct morphological analysis for the above word using dropboxes
(NOTE : na = not applicable)
WORD
ROOT
watching
watch
verb
CATEGORY
male
GENDER
plural
NUMBER
PERSON
second
na
CASE
TENSE
Check
past-perfect
Conclusion : Thus word analysis is done
EXPT NO.02
AIM: Pre-processing of text (Tokenization, Filtration, Script Validation, Stop Word Removal,
Stemming)
Theory: Preprocessing
a) Tokenizer:
In simple words, a tokenizer is a utility function to split a sentence into
words.keras.preprocessing.text.Tokenizer tokenizes(splits) the texts into tokens(words) while
keeping only the most occurring words in the text corpus.
#Signature:
Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True, split=' ', char_level=False, oov_token=None, document_count=0, **kwargs)
The num_words parameter keeps a pre specified number of words in the text only. This is helpful as
we don’t want our models to get a lot of noise by considering words that occur very infrequently. In
real-world data, most of the words we leave using num_words param are normally misspells. The
tokenizer also filters some non-wanted tokens by default and converts the text into lowercase.
The tokenizer once fitted to the data also keeps an index of words (dictionary of words which we can
use to assign a unique number to a word) which can be accessed by tokenizer.word_index. The
words in the indexed dictionary are ranked in order of frequencies.
CODING/PROGRAM:
So the whole code to use tokenizer is as follows:
from keras.preprocessing.text import Tokenizer
## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X)+list(test_X))
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)
where train_X and test_X are lists of documents in the corpus.
Filtration:
NLP is short for Natural Language Processing. As you probably know, computers are not as great at
understanding words as they are numbers. This is all changing though as advances in NLP are
happening everyday. The fact that devices like Apple’s Siri and Amazon’s Alexa can (usually)
comprehend when we ask the weather, for directions, or to play a certain genre of music are all
examples of NLP. The spam filter in your email and the spellcheck you’ve used since you learned to
type in elementary school are some other basic examples of when your computer is understanding
language.
As a data scientist, we may use NLP for sentiment analysis (classifying words to have positive or
negative connotation) or to make predictions in classification models, among other things. Typically,
whether we’re given the data or have to scrape it, the text will be in its natural human format of
sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do
some cleaning to break the text down into a format the computer can easily understand.
For this example, we’re examining a dataset of Amazon products/reviews which can be found and
downloaded for free on data.world. I’ll be using Python in Jupyter notebook.
Here are the imports used:
(You may need to run nltk.download() in a cell if you’ve never previously used it.)
Read in csv file, create DataFrame & check shape. We are starting out with 10,000 rows and 17
columns. Each row is a different product on Amazon.
I conducted some basic data cleaning that I won’t go into detail about now, but you can read my
post about EDA here if you want some tips.
In order to make the dataset more manageable for this example, I first dropped columns with too
many nulls and then dropped any remaining rows with null values. I changed
the number_of_reviews column type from object to integer and then created a new DataFrame
using only the rows with no more than 1 review. My new shape is 3,705 rows and 10 columns
and I renamed it reviews_df.
NOTE: If we were actually going to use this dataset for analysis or modeling or anything besides a
text preprocessing demo, I would not recommend eliminating such a large percent of the rows.
The following workflow is what I was taught to use and like using, but the steps are just general
suggestions to get you started. Usually I have to modify and/or expand depending on the text
format.
1. Remove HTML
2. Tokenization + Remove punctuation
3. Remove stop words
4. Lemmatization or Stemming
While cleaning this data I ran into a problem I had not encountered before, and learned a cool new
trick from geeksforgeeks.org to split a string from one column into multiple columns either on
spaces or specified characters.
The column I am most interested in is customer_reviews, however, upon
taking a closer look, it currently has the review title, rating, review date, customer name, and review
all in one cell separated by //.
Pandas .str.split method can be applied to a Series. First parameter is the repeated part of the string
you want to split on, n=maximum number of separations and expand=True will split up the sections
into new columns. I set the 4 new columns equal to a new variable called reviews.
Pandas .str.split method can be applied to a Series. First parameter is the repeated part of the string
you want to split on, n=maximum number of separations and expand=True will split up the sections
into new columns. I set the 4 new columns equal to a new variable called reviews.
Pandas .str.split method can be applied to a Series. First parameter is the repeated part of the string
you want to split on, n=maximum number of separations and expand=True will split up the sections
into new columns. I set the 4 new columns equal to a new variable called reviews.
Then you can rename the new 0, 1, 2, 3, 4 columns in the original reviews_df and drop the original
messy column.
I ran the same method over the new customer_name column to split on the \n \n and then dropped
the first and last columns to leave just the actual customer name. There is a lot more we could do
here if this were a longer article! Right off the bat, I can see the names and dates could still use some
cleaning to put them in a uniform format.
Removing HTML is a step I did not do this time, however, if data is coming from a web scrape, it is a
good idea to start with that. This is the function I would have used.
Pretty much every step going forward includes creating a function and then applying it to a series. Be
prepared, lambda functions will very shortly be your new best friend! You could also build a function
to do all of these in one go, but I wanted to show the break down and make them easier to
customize.
Remove punctuation:
One way of doing this is by looping through the Series with list comprehension and keeping
everything that is not in string.punctuation, a list of all punctuation we imported at the beginning
with import string.
“ “.join will join the list of letters back together as words where there are no spaces.
If you scroll up you can see where this text previously had commas, periods, etc.
However, as you can see in the second line of output above, this method does not account for user
typos. Customer had typed “grandson,am” which then became one word “grandsonam” once the
comma was removed. I still think this is handy to know in case you ever need it though.
Tokenize:
This breaks up the strings into a list of words or pieces based on a specified pattern using Regular
Expressions aka RegEx. The pattern I chose to use this time (r'\w') also removes punctuation and is a
better option for this data in particular. We can also add.lower() in the lambda function to make
everything lowercase.
see in line 2: “grandson” and “am” are now separate.
Some other examples of RegEx are:
‘\w+|\$[\d\.]+|\S+’ = splits up by spaces or by periods that are not attached to a digit
‘\s+’, gaps=True = grabs everything except spaces as a token
‘[A-Z]\w+’ = only words that begin with a capital letter.
Stop Word Removal:
We imported a list of the most frequently used words from the NL Toolkit at the beginning with from
nltk.corpus import stopwords. You can run stopwords.word(insert language) to get a full list for
every language. There are 179 English words, including ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘you’, ‘he’, ‘his’,
for example. We usually want to remove these because they have low predictive power. There are
occasions when you may want to keep them though. Such as, if your corpus is very small and
removing stop words would decrease the total number of words by a large percent.
Stemming:
Stemming & Lemmatizing:
Both tools shorten words back to their root form. Stemming is a little more aggressive. It cuts off
prefixes and/or endings of words based on common ones. It can sometimes be helpful, but not
always because often times the new word is so much a root that it loses its actual meaning.
Lemmatizing, on the other hand, maps common words into one base. Unlike stemming though, it
always still returns a proper word that can be found in the dictionary. I like to compare the two to
see which one works better for what I need. I usually prefer Lemmatizer, but surprisingly, this time,
Stemming seemed to have more of an affect.
Lemmatizer : can barely even see a difference
You see more of a difference with Stemmer so I will keep that one in place. Since this is the final
step, I added " ".join() to the function to join the lists of words back together.
Now your text is ready to be analyzed! You could go on to use this data for sentiment analysis, could
use the ratings or manufacture columns as target variable based on word correlations. Maybe build
a recommender system based on user purchases or item reviews or customer segmentation with
clustering. The possibilities are endless!
CONCLUSION : Pre-processing of text (Tokenization, Filtration, Script Validation, Stop Word
Removal, Stemming)
EXPT. NO. 03
Aim: Morphological Analysis
Theory : In linguistics, morphology is the study of words, how they are formed, and their
relationship to other words in the same language. It analyzes the structure of words and parts
of words, such as stems, Morphological Analysis
Polyglot offers trained morfessor models to generate morphemes from words. The goal of the
Morpho project is to develop unsupervised data-driven methods that discover the regularities
behind word forming in natural languages. In particular, Morpho project is focussing on the
discovery of morphemes, which are the primitive units of syntax, the smallest individually
meaningful elements in the utterances of a language. Morphemes are important in automatic
generation and recognition of a language, especially in languages in which words may have many
different inflected forms.
root words, prefixes, and suffixes.
Morphology
Morphology is the part of linguistics that deals with the study of words, their internal structure and
partially their meanings. It refers to identification of a word stem from a full word form. A
morpheme in morphology is the smallest units that carry meaning and fulfill some grammatical
function.
Morphological analysis
Morphological Analysis is the process of providing grammatical information of a word given its suffix.
Models
There are three principal approaches to morphology, which each try to capture the distinctions
above in different ways. These are,
• Morpheme-based morphology also known as Item-and-Arrangement approach.
• Lexeme-based morphology also known as Item-and-Process approach.
• Word-based morphology also known as Word-and-Paradigm approach.
Morphological Analyzer
A morphological analyzer is a program for analyzing the morphology of an input word, it detects
morphemes of any text. Presently we are referring to two types of morph analyzers for Indian
languages: 1. Phrase level Morph Analyzer 2. Word level Morph Analyzer
Role of TDIL
Morphological analyzer is developed for some Indian languages under Machine Translation project
of TDIL.
CODING:
Languages Coverage
Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words
50,000 words of each language.
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
1. Piedmontese language
2. Lombard language
3. Gan Chinese
4. Sicilian
5. Scots
6. Kirghiz, Kyrgyz
7. Pashto, Pushto
8. Kurdish
9. Portuguese
10. Kannada
11. Korean
12. Khmer
13. Kazakh
14. Ilokano
15. Polish
16. Panjabi, Punjabi
17. Georgian
18. Chuvash
19. Alemannic
20. Czech
21. Welsh
22. Chechen
23. Catalan; Valencian
24. Northern Sami
25. Sanskrit (Saṁskṛta)
26. Slovene
27. Javanese
28. Slovak
29. Bosnian-Croatian-Serbian 30. Bavarian
31. Swedish
32. Swahili
33. Sundanese
34. Serbian
35. Albanian
36. Japanese
37. Western Frisian
38. French
39. Finnish
40. Upper Sorbian
41. Faroese
42. Persian
43. Sinhala, Sinhalese
44. Italian
45. Amharic
46. Aragonese
47. Volapük
48. Icelandic
49. Sakha
50. Afrikaans
51. Indonesian
52. Interlingua
53. Azerbaijani
54. Ido
55. Arabic
56. Assamese
57. Yoruba
58. Yiddish
59. Waray-Waray
60. Croatian
61. Hungarian
62. Haitian; Haitian Creole 63. Quechua
64. Armenian
65. Hebrew (modern)
66. Silesian
67. Hindi
68. Divehi; Dhivehi; Mald... 69. German
70. Danish
71. Occitan
72. Tagalog
73. Turkmen
74. Thai
75. Tajik
76. Greek, Modern
77. Telugu
78. Tamil
79. Oriya
80. Ossetian, Ossetic
81. Tatar
82. Turkish
83. Kapampangan
84. Venetian
85. Manx
86. Gujarati
87. Galician
88. Irish
89. Scottish Gaelic; Gaelic 90. Nepali
91. Cebuano
92. Zazaki
93. Walloon
94. Dutch
95. Norwegian
96. Norwegian Nynorsk
97. West Flemish
98. Chinese
99. Bosnian
100. Breton
101. Belarusian
102. Bulgarian
103. Bashkir
104. Egyptian Arabic
105. Tibetan Standard, Tib...
106. Bengali
107. Burmese
108. Romansh
109. Marathi (Marāṭhī)
110. Malay
111. Maltese
112. Russian
113. Macedonian
114. Malayalam
115. Mongolian
116. Malagasy
117. Vietnamese
118. Spanish; Castilian
119. Estonian
120. Basque
121. Bishnupriya Manipuri 122. Asturian
123. English
124. Esperanto
125. Luxembourgish, Letzeb... 126. Latin
127. Uighur, Uyghur
128. Ukrainian
129. Limburgish, Limburgan...
130. Latvian
131. Urdu
132. Lithuanian
133. Fiji Hindi
134. Uzbek
135. Romanian, Moldavian, ...
Download Necessary Models
%%bash
polyglot download morph2.en morph2.ar
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.ar is already up-to-date!
Example
from polyglot.text import Text, Word
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
w = Word(w, language="en")
print("{:<20}{}".format(w, w.morphemes))
preprocessing
['pre', 'process', 'ing']
processor
['process', 'or']
invaluable
['in', 'valuable']
thankful
['thank', 'ful']
crossed
['cross', 'ed']
If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting
the text into its original units. Here, is an example:
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
text.morphemes
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en morph | tail -n 30
which
which
India
In_dia
beat
beat
Bermuda
Ber_mud_a
in
in
Port
Port
of
of
Spain
Spa_in
in
in
2007
2007
,
,
which
which
was
wa_s
equalled
equal_led
five
five
days
day_s
ago
ago
by
by
South
South
Africa
Africa
in
in
their
t_heir
victory
victor_y
over
over
West
West
Indies
In_dies
in
in
Sydney
Syd_ney
.
.
Demo
This demo does not reflect the models supplied by polyglot, however, we think it is indicative of
what you should expect from morfessor
Demo
This is an interface to the implementation being described in the Morfessor2.0: Python
Implementation and Extensions for Morfessor Baseline technical report.
@InProceedings{morfessor2,
title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
author: {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
year: {2013},
publisher: {Department of Signal Processing and Acoustics, Aalto University},
booktitle:{Aalto University publication series}
}
Morphological Analyzer (Hindi - हहूं दी) - CFILT - IIT Bombay
www.cfilt.iitb.ac.in › ~ankitb
Morphological Analyzer Tool (Beta) ... View the results in the area under "Morph Output".
TRANSLITERATION ENGINE(For Easy Typing) Type in Hindi (Press ...
Search Results
Web results
Morphological analyzer - TDIL-DC
OUTPUT: : Morphological Analyzer Tool (Beta)
Top of Form
Input Word to be Analyzed
If you wish to analyze words in a batch, then Click Here
Morphologial Analysis:
Queried Word: साथ
------------------Set of Roots and Features are---------------------Token : साथ, Total Output : 3
[ Root : साथ, Class : , Category : nst, Suffix : Null ]
[ Gender : , Number : , Person : , Case : , Tense : , Aspect : , Mood : ]
[ Root : साथ, Class : , Category : adverb, Suffix : Null ]
[ Gender : , Number : , Person : , Case : , Tense : , Aspect : , Mood : ]
[ Root : साथ, Class : A, Category : noun, Suffix : Null ]
[ Gender : +masc, Number : +-pl, Person : x, Case : -oblique, Tense : x, Aspect : x, Mood : x ]
Instructions of Usage :
1.Type or paste Hindi word in the text area under "Input Word".
2.Click on Submit and wait.
3.View the results in the area under "Morph Output".
Bottom of Form
TRANSLITERATION ENGINE(For Easy Typing)
Type in Hindi (Press Ctrl+g to toggle between English and Hindi)
Morphological Analysis
Conclusion: Thus morphological analysis has been done.
Expt.No.4
Aim: N-gram model
Theory:
Introduction
Statistical language models, in its essence, are the type of models that assign probabilities to the
sequences of words. In this article, we’ll understand the simplest model that assigns probabilities to
sentences and sequences of words, the n-gram
You can think of an N-gram as the sequence of N words, by that notion, a 2-gram (or bigram) is a
two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or
trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”
Intuitive Formulation
Let’s start with equation P(w|h), the probability of word w, given some history, h. For example,
Here,
w = The
h = its water is so transparent that
And, one way to estimate the above probability function is through the relative frequency count
approach, where you would take a substantially large corpus, count the number of times you see its
water is so transparent that, and then count the number of times it is followed by the. In other
words, you are answering the question:
Out of the times you saw the history h, how many times did the word w follow it
Now, you can imagine it is not feasible to perform this over an entire corpus; especially it is of a
significant a size.
This shortcoming and ways to decompose the probability function using the chain rule serves as the
base intuition of the N-gram model. Here, you, instead of computing probability using the entire
corpus, would approximate it by just a few historical words
The Bigram Model
As the name suggests, the bigram model approximates the probability of a word given all the
previous words by using only the conditional probability of one preceding word. In other words, you
approximate it with the probability: P(the | that)
And so, when you use a bigram model to predict the conditional probability of the next word, you
are thus making the following approximation:
This assumption that the probability of a word depends only on the previous word is also known
as Markov assumption.
Markov models are the class of probabilisitic models that assume that we can predict the probability
of some future unit without looking too far in the past.
You can further generalize the bigram model to the trigram model which looks two words into the
past and can thus be further generalized to the N-gram model
Probability Estimation
Now, that we understand the underlying base for N-gram models, you’d think, how can we estimate
the probability function. One of the most straightforward and intuitive ways to do so is Maximum
Likelihood Estimation (MLE)
For example, to compute a particular bigram probability of a word y given a previous word x, you
can determine the count of the bigram C(xy) and normalize it by the sum of all the bigrams that
share the same first-word x.
Challenges
There are, of course, challenges, as with every modeling approach, and estimation method. Let’s
look at the key ones affecting the N-gram model, as well as the use of MLE
Sensitivity to the training corpus
The N-gram model, like many statistical models, is significantly dependent on the training corpus. As
a result, the probabilities often encode particular facts about a given training corpus. Besides, the
performance of the N-gram model varies with the change in the value of N.
Moreover, you may have a language task in which you know all the words that can occur, and hence
we know the vocabulary size V in advance. The closed vocabulary assumption assumes there are no
unknown words, which is unlikely in practical scenarios.
Smoothing
A notable problem with the MLE approach is sparse data. Meaning, any N-gram that appeared a
sufficient number of times might have a reasonable estimate for its probability. But because any
corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it.
As a result of it, the N-gram matrix for any training corpus is bound to have a substantial number of
cases of putative “zero probability N-grams”
Sources:
[1] CHAPTER DRAFT — | Stanford Lagunita. https://lagunita.stanford.edu/c4x/Engineering/CS224N/asset/slp4.pdf
[2] Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech
Albert Au Yeung
Notes on machine learning and A.I.
•
•
•
•
Generating N-grams from Sentences Python
About
Archive
Github
LinkedIn
CODING: Generating N-grams from Sentences Python
Jun 3, 2018
N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive
integers, although usually we do not consider very large N because those n-grams rarely appears in
many different places.
When performing machine learning tasks related to natural language processing, we usually need to
generate n-grams from input sentences. For example, in text classification tasks, in addition to using
each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to
represent our documents. This post describes several different ways to generate n-grams quickly
from input sentences in Python.
The Pure Python Way
In general, an input sentence is just a string of characters in Python. We can use build in functions in
Python to generate n-grams quickly. Let’s take the following sentence as a sample input:
s = "Natural-language processing (NLP) is an area of computer science " \
"and artificial intelligence concerned with the interactions " \
"between computers and human (natural) languages."
If we want to generate a list of bi-grams from the above sentence, the expected output would be
something like below (depending on how do we want to treat the punctuations, the desired output
can be different):
[
"natural language",
"language processing",
"processing nlp",
"nlp is",
"is an",
"an area",
...
]
The following function can be used to achieve this:
import re
def generate_ngrams(s, n):
# Convert to lowercases
s = s.lower()
# Replace all none alphanumeric characters with spaces
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
# Break sentence in the token, remove empty tokens
tokens = [token for token in s.split(" ") if token != ""]
# Use the zip function to help us generate n-grams
# Concatentate the tokens into ngrams and return
ngrams = zip(*[token[i:] for i in range(n)])
return [" ".join(ngram) for ngram in ngrams]
Applying the above function to the sentence, with n=5, gives the following output:
>>> generate_ngrams(s, n=5)
['natural language processing nlp is',
'language processing nlp is an',
'processing nlp is an area',
'nlp is an area of',
'is an area of computer',
'an area of computer science',
'area of computer science and',
'of computer science and artificial',
'computer science and artificial intelligence',
'science and artificial intelligence concerned',
'and artificial intelligence concerned with',
'artificial intelligence concerned with the',
'intelligence concerned with the interactions',
'concerned with the interactions between',
'with the interactions between computers',
'the interactions between computers and',
'interactions between computers and human',
'between computers and human natural',
'computers and human natural languages']
The above function makes use of the zip function, which creates a generator that aggregates
elements from multiple lists (or iterables in genera). The blocks of codes and comments below offer
some more explanation of the usage:
# Sample sentence
s = "one two three four five"
tokens = s.split(" ")
# tokens = ["one", "two", "three", "four", "five"]
sequences = [tokens[i:] for i in range(3)]
# The above will generate sequences of tokens starting from different
# elements of the list of tokens.
# The parameter in the range() function controls how many sequences
# to generate.
#
# sequences = [
# ['one', 'two', 'three', 'four', 'five'],
# ['two', 'three', 'four', 'five'],
# ['three', 'four', 'five']]
bigrams = zip(*sequences)
# The zip function takes the sequences as a list of inputs (using the * operator,
# this is equivalent to zip(sequences[0], sequences[1], sequences[2]).
# Each tuple it returns will contain one element from each of the sequences.
#
# To inspect the content of bigrams, try:
# print(list(bigrams))
# which will give the following:
#
# [('one', 'two', 'three'), ('two', 'three', 'four'), ('three', 'four', 'five')]
#
# Note: even though the first sequence has 5 elements, zip will stop after returning
# 3 tuples, because the last sequence only has 3 elements. In other words, the zip
# function automatically handles the ending of the n-gram generation.
Using NLTK
Instead of using pure Python functions, we can also get help from some natural language processing
libraries such as the Natural Language Toolkit (NLTK). In particular, nltk has the ngrams function that
returns a generator of n-grams given a tokenized sentence. (See the documentaion of the
function here)
import re
from nltk.util import ngrams
s = s.lower()
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens, 5))
The above block of code will generate the same output as the function generate_ngrams() as shown
above.
Lesson Goals
Like in Output Data as HTML File, this lesson takes the frequency pairs collected in Counting
Frequencies and outputs them in HTML. This time the focus is on keywords in context (KWIC) which
creates n-grams from the original document content – in this case a trial transcript from the Old
Bailey Online. You can use your program to select a keyword and the computer will output all
instances of that keyword, along with the words to the left and right of it, making it easy to see at a
glance how the keyword is used.
Once the KWICs have been created, they are then wrapped in HTML and sent to the browser where
they can be viewed. This reinforces what was learned in Output Data as HTML File, opting for a
slightly different output.
At the end of this lesson, you will be able to extract all possible n-grams from the text. In the next
lesson, you will be learn how to output all of the n-grams of a given keyword in a document
downloaded from the Internet, and display them clearly in your browser window.
Files Needed For This Lesson
•
obo.py
If you do not have these files from the previous lesson, you can download programming-historian-7,
a zip file from the previous lesson
From Text to N-Grams to KWIC
Now that you know how to harvest the textual content of a web page automatically with Python,
and have begun to use strings, lists and dictionaries for text processing, there are many other things
that you can do with the text besides counting frequencies. People who study the statistical
properties of language have found that studying linear sequences of linguistic units can tell us a lot
about a text. These linear sequences are known as bigrams (2 units), trigrams (3 units), or more
generally as n-grams.
You have probably seen n-grams many times before. They are commonly used on search results
pages to give you a preview of where your keyword appears in a document and what the
surrounding context of the keyword is. This application of n-grams is known as keywords in context
(often abbreviated as KWIC). For example, if the string in question were “it was the best of times it
was the worst of times it was the age of wisdom it was the age of foolishness” then a 7-gram for the
keyword “wisdom” would be:
the age of wisdom it was the
An n-gram could contain any type of linguistic unit you like. For historians you are most likely to use
characters as in the bigram “qu” or words as in the trigram “the dog barked”; however, you could
also use phonemes, syllables, or any number of other units depending on your research question.
What we’re going to do next is develop the ability to display KWIC for any keyword in a body of text,
showing it in the context of a fixed number of words on either side. As before, we will wrap the
output so that it can be viewed in Firefox and added easily to Zotero.
From Text to N-grams
Since we want to work with words as opposed to characters or phonemes, it will be much easier to
create n-grams using a list of words rather than strings. As you already know, Python can easily turn
a string into a list using the split operation. Once split it becomes simple to retrieve a subsequence of
adjacent words in the list by using a slice, represented as two indexes separated by a colon. This was
introduced when working with strings in Manipulating Strings in Python.
message9 = "Hello World"
message9a = message9[1:8]
print(message9a)
-> ello Wo
However, we can also use this technique to take a predetermined number of neighbouring words
from the list with very little effort. Study the following examples, which you can try out in a Python
Shell.
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
wordlist = wordstring.split()
print(wordlist[0:4])
-> ['it', 'was', 'the', 'best']
print(wordlist[0:6])
-> ['it', 'was', 'the', 'best', 'of', 'times']
print(wordlist[6:10])
-> ['it', 'was', 'the', 'worst']
print(wordlist[0:12])
-> ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']
print(wordlist[:12])
-> ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']
print(wordlist[12:])
-> ['it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness']
In these examples we have used the slice method to return parts of our list. Note that there are two
sides to the colon in a slice. If the right of the colon is left blank as in the last example above, the
program knows to automatically continue to the end – in this case, to the end of the list. The second
last example above shows that we can start at the beginning by leaving the space before the colon
empty. This is a handy shortcut available to keep your code shorter.
You can also use variables to represent the index positions. Used in conjunction with a for loop, you
could easily create every possible n-gram of your list. The following example returns all 5-grams of
our string from the example above.
i=0
for items in wordlist:
print(wordlist[i: i+5])
i += 1
Keeping with our modular approach, we will create a function and save it to the obo.py module that
can create n-grams for us. Study and type or copy the following code:
# Given a list of words and a number n, return a list
# of n-grams.
def getNGrams(wordlist, n):
return [wordlist[i:i+n] for i in range(len(wordlist)-(n-1))]
This function may look a little confusing as there is a lot going on here in not very much code. It uses
a list comprehension to keep the code compact. The following example does exactly the same thing:
def getNGrams(wordlist, n):
ngrams = []
for i in range(len(wordlist)-(n-1)):
ngrams.append(wordlist[i:i+n])
return ngrams
Use whichever makes most sense to you.
A concept that may still be confusing to you are the two function arguments. Notice that our
function has two variable names in the parentheses after its name when we declared it: wordlist, n.
These two variables are the function arguments. When you call (run) this function, these variables
will be used by the function for its solution. Without these arguments there is not enough
information to do the calculations. In this case, the two pieces of information are the list of words
you want to turn into n-grams (wordlist), and the number of words you want in each n-gram (n). For
the function to work it needs both, so you call it in like this (save the following
as useGetNGrams.py and run):
#useGetNGrams.py
import obo
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
allMyWords = wordstring.split()
print(obo.getNGrams(allMyWords, 5))
Notice that the arguments you enter do not have to have the same names as the arguments named
in the function declaration. Python knows to use allMyWords everywhere in the function
that wordlist appears, since this is given as the first argument. Likewise, all instances of n will be
replaced by the integer 5 in this case. Try changing the 5 to a string, such as “elephants” and see
what happens when you run your program. Note that because n is being used as an integer, you
have to ensure the argument sent is also an integer. The same is true for strings, floats or any other
variable type sent as an argument.
You can also use a Python shell to play around with the code to get a better understanding of how it
works. Paste the function declaration for getNGrams (either of the two functions above) into your
Python shell.
test1 = 'here are four words'
test2 = 'this test sentence has eight words in it'
getNGrams(test1.split(), 5)
-> []
getNGrams(test2.split(), 5)
-> [['this', 'test', 'sentence', 'has', 'eight'],
['test', 'sentence', 'has', 'eight', 'words'],
['sentence', 'has', 'eight', 'words', 'in'],
['has', 'eight', 'words', 'in', 'it']]
There are two concepts that we see in this example of which you need to be aware. Firstly, because
our function expects a list of words rather than a string, we have to convert the strings into lists
before our function can handle them. We could have done this by adding another line of code above
the function call, but instead we used the split method directly in the function argument as a bit of a
shortcut.
Secondly, why did the first example return an empty list rather than the n-grams we were after?
In test1, we have tried to ask for an n-gram that is longer than the number of words in our list. This
has resulted in a blank list. In test2 we have no such problem and get all possible 5-grams for the
longer list of words. If you wanted to you could adapt your function to print a warning message or to
return the entire string instead of an empty list.
We now have a way to extract all possible n-grams from a body of text. In the next lesson, we can
focus our attention on isolating those n-grams that are of interest to us.
Code Syncing
To follow along with future lessons it is important that you have the right files and programs in your
“programming-historian” directory. At the end of each chapter you can download the
“programming-historian” zip file to make sure you have the correct code. If you are following along
with the Mac / Linux version you may have to open the obo.py file and change
“file:///Users/username/Desktop/programming-historian/” to the path to the directory on your own
computer.
•
python-lessons8.py (zip sync)
TOOLS USED:
Ngram Analyzer - A Programmer's Guide to Data Mining
Conclusion
N-Grams model is one of the most widely used sentence-to-vector models since it captures the
context between N-words in a sentence. In this article, you saw the theory behind N-Grams model.
You also saw how to implement characters N-Grams and Words N-Grams model. Finally, you studied
how to create automatic text filler using both the approaches.
Expt.No.5
Aim: POS tagging
Theory: POS Tagging - Hidden Markov Model
A Hidden Markov Model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (hidden) states.In a regular
Markov model (Markov Model (Ref: http://en.wikipedia.org/wiki/Markov_model)), the state
is directly visible to the observer, and therefore the state transition probabilities are the only
parameters. In a hidden Markov model, the state is not directly visible, but output, dependent
on the state, is visible.
Hidden Markov Model has two important components1)Transition Probabilities: The one-step transition probability is the probability of transitioning
from one state to another in a single step.
2)Emission Probabilties: : The output probabilities for an observation from state. Emission
probabilities B = { bi,k = bi(ok) = P(ok | qi) }, where okis an Observation. Informally, B is the
probability that the output is ok given that the current state is qi
For POS tagging, it is assumed that POS are generated as random process, and each process
randomly generates a word. Hence, transition matrix denotes the transition probability from
one POS to another and emission matrix denotes the probability that a given word can have
a particular POS. Word acts as the observations. Some of the basic assumptions are:
1. First-order (bigram) Markov assumptions:
a. Limited Horizon: Tag depends only on previous tag
P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)
b. Time invariance: No change over time
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj -> tk)
2. Output probabilities:
Probability of getting word wk for tag tj: P(wk | tj) is independent of oth
er tags or words!
Calculating the Probabilities
Consider the given toy corpus
EOS/eos
They/pronoun
cut/verb
the/determiner
paper/noun
EOS/eos He/pronoun
asked/verb
for/preposition
his/pronoun
cut/noun.
EOS/eos
Put/verb
the/determiner
paper/noun
in/preposition
the/determiner
cut/noun
EOS/eos
Calculating Emission Probability Matrix
Count the no. of times a specific word occus with a specific POS tag in the corpus.
Here, say for "cut"
count(cut,verb)=1
count(cut,noun)=2
count(cut,determiner)=0
... and so on zero for other tags too.
count(cut) = total count of cut = 3
Now, calculating the probability
Probability to be filled in the matrix cell at the intersection of cut and verb
P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33
Similarly,
Probability to be filled in the cell at he intersection of cut and determiner
P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0
Repeat the same for all the word-tag combination and fill the
Calculating Transition Probability Matrix
Count the no. of times a specific tag comes after other POS tags in the corpus.
Here, say for "determiner"
count(verb,determiner)=2
count(preposition,determiner)=1
count(determiner,determiner)=0
count(eos,determiner)=0
count(noun,determiner)=0
... and so on zero for other tags too.
count(determiner) = total count of tag 'determiner' = 3
Now, calculating the probability
Probability to be filled in the cell at he intersection of determiner(in the column) and verb(in
the row)
P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66
Similarly,
Probability to be filled in the cell at he intersection of determiner(in the column) and noun(in
the row)
P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0
Repeat the same for all the tags
Note: EOS/eos is a special marker which represents End Of Sentence.
Procedure:
STEP1: Select the corpus.
STEP2: For the given corpus fill the emission and transition matrix. Answers are rounded to 2
decimal digits.
STEP3: Press Check to check your answer.
Wrong answers are indicated by the red cell.
Execution and outputs:
POS Tagging - Hidden Markov Model
---Select Corpus---
EOS/eos Book/verb a/determiner car/noun EOS/eos Park/verb the/determiner car/noun
EOS/eos The/determiner book/noun is/verb in/preposition the/determiner car/noun
EOS/eos The/determiner car/noun is/verb in/preposition a/determiner park/noun
EOS/eos
Emission Matrix
book
determiner 0
park
car
is
in
a
the
0
0
0
0
0
0
noun
0
0
0
0
0
0
0
verb
prepositio
n
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Transition Matrix
eos
0
eos
determiner
noun
verb
preposition
0
0
0
0
determiner
0
0
0
0
0
noun
0
0
0
0
0
verb
0
0
0
0
0
preposition
0
0
0
0
0
Check
Results: Thus POS(Part of speech) tags are studied and get the results
Expt.No.6
Aim: chunking
Theory:
Chunking
Chunking of text invloves
dividing a text into
syntactically correlated words.
Eg: He ate an apple to satiate his hunger.
[NP He ] [VP ate
] [NP an apple] [VP to satiate] [NP his hunger]
Eg: दरवाज़ा खुल गया
[NP दरवाज़ा] [VP खुल गया]
Chunk Types
The chunk types are based on the
syntactic category part. Besides the head a chunk also
contains modifiers (like determiners, adjectives,
postpositions in NPs).
The basic types of chunks in English are:
Chunk Type
Tag Name
1.
Noun
NP
2.
Verb
VP
3.
Adverb
ADVP
4.
Adjectivial
ADJP
5.
Prepositional
PP
The basic Chunk Tag Set for Indian Languages
Sl. No
1
2.1
2.2
2.3
3
4
Chunk Type
Noun Chunk
Finite Verb Chunk
Non-finite Verb Chunk
Verb Chunk (Gerund)
Adjectival Chunk JJP
Adverb Chunk
NP
Noun Chunks
Tag Name
NP
VGF
VGNF
VGNN
RBP
Noun Chunks will be given the tag NP and include
non-recursive noun phrases and postposition for Indian
languages and preposition for English. Determiners,
adjectives and other modifiers will be part of the noun
chunk.
Eg:
(इस/DEM हकताब/NN में/PSP)NP
'this' 'book' 'in'
((in/IN the/DT big/ADJ room/NN))NP
Verb Chunks
The verb chunks are marked as VP for English, however they
would be of several types for Indian languages. A verb
group will include the main verb and its auxiliaries, if
any.
For English:
I (will/MD be/VB loved/VBD)VP
The types of verb chunks and their tags are described below.
1. VGF
Finite Verb Chunk
The auxiliaries in the verb group mark the finiteness of the
verb at the chunk level. Thus, any verb group which is
finite will be tagged as VGF. For example,
Eg: मैंने घर पर (खाया/VM)VGF
'I erg''home' 'at''meal'
2. VGNF
'ate'
Non-finite Verb Chunk
A non-finite verb chunk will be tagged as VGNF.
Eg: सेब (खाता/VM हुआ/VAUX)VGNF लड़का जा रहा है
'apple' 'eating' 'PROG' 'boy' go' 'PROG' 'is'
3. VGNN Gerunds
A verb chunk having a gerund will be annotated as VGNN.
Eg: शराब (पीना/VM)VGNN सेहत के हलए हाहनकारक है sharAba
'liquor' 'drinking'
'heath'
'for' 'harmful'
JJP/ADJP
'is'
Adjectival Chunk
An adjectival chunk will be tagged as ADJP for English and
JJP for Indian languages. This chunk will consist of all
adjectival chunks including the predicative adjectives.
Eg:
वह लड़की है (सुन्दर/JJ)JJP
The fruit is (ripe/JJ)ADJP
Note: Adjectives appearing before a noun will be grouped
together within the noun chunk.
RBP/ADVP
Adverb Chunk
This chunk will include all pure adverbial phrases.
Eg:
वह (धीरे -धीरे /RB)RBP चल रहा था
'he' 'slowly' 'walk' 'PROG' 'was'
He walks (slowly/ADV)/ADVP
PP Prepositional Chunk
This chunk type is present
for only English and not for Indian languages. It consists
of only the preposition and not the NP argument.
Eg:
(with/IN)PP a pen
IOB prefixes
Each chunk has an open boundary and close boundary that
delimit the word groups as a minimal non-recursive
unit. This can be formally expressed by using IOB prefixes:
B-CHUNK for the first word of the chunk and I-CHUNK for each
other word in the chunk. Here is an example of the file
format:
Tokens
POS
Chunk-Tags
He
ate
an
apple
to
satiate
his
hunger
PRP
VBD
DT
NN
TO
VB
PRP$
NN
B-NP
B-VP
B-NP
I-NP
B-VP
I-VP
B-NP
I-NP
Execution and outputs: Chunking
";
Lexicon POS Chunk
John NNP
gave VBD
Mary NNP
a
DT
book
NN
Submit
Conclusion: Thus chunking of English is studied
Expt.No.7
Aim: Named Entity Recognition
Theory:
Named entity recognition (NER)is probably the first step towards information extraction that seeks
to locate and classify named entities in text into pre-defined categories such as the names of
persons, organizations, locations, expressions of times, quantities, monetary values, percentages,
etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many
real-world questions, such as:
•
Which companies were mentioned in the news article?
•
Were specified products mentioned in complaints or reviews?
• Does the tweet contain the name of a person? Does the tweet contain this person’s location?
Execution and outputs:
from bs4 import BeautifulSoup
import requests
import redef url_to_string(url):
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'html5lib')
for script in soup(["script", "style", 'aside']):
script.extract()
return " ".join(re.split(r'[\n\t]+', soup.get_text()))ny_bb =
url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-firedfbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-columnregion&region=top-news&WT.nav=top-news')
article = nlp(ny_bb)
len(article.ents)
188
There are 188 entities in the article and they are represented as 10 unique labels:
labels = [x.label_ for x in article.ents]
Counter(labels)
Figure 10
The following are three most frequent tokens.
items = [x.text for x in article.ents]
Counter(items).most_common(3)
Figure 11
Let’s randomly select one sentence to learn more.
sentences = [x for x in article.sents]
print(sentences[20])
Figure 12
Let’s run displacy.render to generate the raw markup.
displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')
sentence and its dependencies look like:
displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})
Next, we verbatim, extract part-of-speech and lemmatize this sentence.
[(x.orth_,x.pos_, x.lemma_) for x in [y
for y
in nlp(str(sentences[20]))
if not y.is_stop and y.pos_ != 'PUNCT']]
Figure 15
dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])
Named entity extraction are correct except “F.B.I”.
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[20]])
Finally, we visualize the entity of the entire.
Conclusion: Thus Name Entity Recognition is done
Expt.No.8.
Aim: Case Study/ Mini Project based on Application mentioned in Module 6.
Theory: Mini project should contents
Abstract about project(1 or half page)
Aim ,objectives,methodology(proposed and existed),implementation plan, results and outputs)
Download