Uploaded by swohardokhanpranto

Lecture 4 - Bag of Words

advertisement
Bag of Words
Bag of
Words
• Bag-of-words(BoW) is a statistical language
model used to analyze text and documents based
on word count. The model does not account for
word order within a document. This model is a
simple way to convert words to numerical
representation in natural language
processing. This model is a simple document
embedding technique based on word frequency.
• The Bag of Words model is a common
technique in natural language processing
(NLP) that represents a document as an
unordered set of words, disregarding grammar
and word order but keeping track of word
frequency.
Tokenization:
Workflo
w of Bag
of Words
• Break down a text into individual words
(tokens).
Building a Vocabulary:
• Create a unique set of all words
(vocabulary) in the entire corpus.
Vectorization:
• Represent each document as a vector,
where each element corresponds to the
frequency of a word from the
vocabulary in the document.
The process of
breaking down a
text into individual
words or tokens.
Example:
"The quick brown
fox jumps over the
lazy dog" becomes:
["The", "quick",
"brown", "fox",
"jumps", "over",
"the", "lazy", "dog"]
Tokenization
Creating a unique set of all words present in the
entire corpus.
Building a
Vocabulary
Example:
If the corpus has three documents with the words
{apple, banana, orange}, {banana, grapefruit}, and
{apple, orange, grapefruit}, the vocabulary is
{apple, banana, orange, grapefruit}.
Vectorization
Representing each document as a vector, with each
element corresponding to the frequency of a word from the
vocabulary in the document.
Example:
For the document "The quick brown fox jumps over the lazy
dog" and vocabulary {quick, brown, jumps, lazy}, the vector
might be [1, 1, 1, 1] indicating the frequency of each word.
Application
s
Text
Classification:
Sentiment
analysis, spam
detection, topic
categorization.
Information
Retrieval:
Document
matching and
retrieval based
on content.
Machine
Learning Models:
Input
representation
for models like
Naive Bayes and
SVM.
Advantages
and
Limitations
• Advantages
• Simple and easy to
implement.
• Captures the basic
information about the
document.
• Works well with large
datasets.
• Suitable for various NLP
tasks such as text
classification and sentiment
analysis.
• Limitations
• Ignores word order and
semantics.
• Treats all words equally,
disregarding semantic
meaning.
• Doesn't consider
relationships between
words.
• Doesn't consider the
context of words.
Understanding Bag of Words with an example
Example(1) without preprocessing:
• Sentence 1: ”Welcome to Great Learning, Now start learning”
• Sentence 2: “Learning is a good practice”
Vocabulary List:
• Welcome
• To
• Great
• Learning
• ,
• Now
• start
• learning
• is
• a
• good
• practice
The scoring of sentence 1 would look as follows:
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]
The scoring of sentence 2 would look as follows:
Writing the above frequencies in the vector
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
• The above example was not the best example of how to use a bag of
words. The words Learning and learning, although having the same
meaning are taken twice. Also, a comma ’,’ which does not convey any
information is also included in the vocabulary.
Understanding Bag of Words with an example
Example(2) with preprocessing:
• Sentence 1: ”Welcome to Great Learning, Now start learning”
• Sentence 2: “Learning is a good practice”
Step 1: Convert the above sentences in lower case as the case of the word
does not hold any information.
Step 2: Remove special characters and stopwords from the text. Stopwords
are the words that do not contain much information about text like ‘is’,
‘a’,’the and many more’.
After applying the above steps, the sentences are changed to
Sentence 1: ”welcome great learning now start learning”
Sentence 2: “learning good practice”
Vocabulary List:
• welcome
• great
• learning
• now
• start
• good
• practice
The scoring of sentence 1 would look as follows:
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]
The scoring of sentence 2 would look as follows:
Writing the above frequencies in the vector
Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]
Document
Similarity
Document Similarity
• Quantification of the similarity between two documents can be obtained by
converting the words or phrases within the document or sentence into a
vectorised form of representation.
• The vector representations of the documents can then be used within the cosine
similarity formula to obtain a quantification of similarity.
• In the scenario described above, the cosine similarity of 1 implies that the two
documents are exactly alike and a cosine similarity of 0 would point to the
conclusion that there are no similarities between the two documents.
Example of Document Vectors
• Doc 1= “information retrieval”
• Doc 2 = “computer information retrieval”
• Doc 3 = “computer retrieval”
•
•
•
•
Vocabulary: information, retrieval, computer
Doc 1 = <1, 1, 0>
Doc 2 = <1, 1, 1>
Doc 3 = <0, 1, 1>
Question: Doc 4 = “retrieval information retrieval” ?
information, retrieval, computer
D=
1
1
0
1
1
1
0
1
1
Documents in a Vector Space
• Doc 1= “information retrieval”
• Doc 2 = “computer information
retrieval”
• Doc 3 = “computer retrieval”
Term 1: information
Doc 1
• Vocabulary: information, retrieval,
computer
• Doc 1 = <1, 1, 0>
Term 3: computer
Term 2:
retrieval
Relevance as Vector Similarities
• Doc 1= “information retrieval”
• Doc 2 = “computer information
retrieval”
• Doc 3 = “computer retrieval”
Term 1: information
Which document is closer to Doc 1?
Doc 2 or Doc 3?
What if we have a query
“retrieval”?
Term 3: computer
Doc 1
Doc 2
Doc 3
Term 2:
retrieval
Document Similarity
• Used in information retrieval to determine
which document (d1 or d2) is more similar
to a given query q.
• Documents and queries are represented
in the same space.
• Angle (or cosine) is a proxy for similarity
between two vectors
Distance/Similarity Calculation
• The similarity/relevance of two vectors can be calculated based on
distance/similarity measures
• S: X, Y  (0, 1)
• X: <x1, x2, …, xn>
• Y: <y1, y2, …, yn>
• S(X, Y) = ?
• The more dimensions in common, the larger the similarity
• What about real values?
• Normalization needed.
Similarity Measures
• The Jaccard similarity (Similarity of Two Sets)
|𝑋 ∩ 𝑌|
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑋, 𝑌 =
|𝑋 ∪ 𝑌|
•
•
•
•
D1 = “information retrieval class”
D2 = “information retrieval algorithm”
D3 = “processing information”
What’s the Jaccard similarity of S(D1, D2)? S(D1, D3)?
• What about D3 = “information of information retrieval”
Similarity Measures
• Euclidean Distance – distance of two points
D(X,Y) = (x1 - y1 )2 + (x2 - y2 )2 +... + (xn - yn )2 =
n
2
(x
y
)
å i i
i=1
Z
X
Y
Similarity Measures (Cont.)
• Cosine similarity: similarity of two vectors, normalized
n
cos(X,Y) =
x1 y1 + x2 y2 +... + xn yn
x12 +... + xn2 × y12 +... + yn2
åx y
i
=
i
i=1
n
2
x
å i×
i=1
n
2
y
åi
i=1
X
Y
Which one do you think is suitable for retrieval?
Jaccard? Euclidean? Cosine?
Example
• What is the cosine similarity between:
• D= “cat,dog,dog” = <1,2,0>
• Q= “cat,dog,mouse,mouse” = <1,1,2>
• Answer:
𝜎 𝐷, 𝑄 =
1×1+2×1+0×2
12 +22 +02
12 +12 +22
=
3
5 6
≈0.55
=
5
5 5
=1
• In comparison:
𝜎 𝐷, 𝐷 =
1×1+2×2+0×0
12 +22 +02
12 +22 +02
Download