Bag of Words Bag of Words • Bag-of-words(BoW) is a statistical language model used to analyze text and documents based on word count. The model does not account for word order within a document. This model is a simple way to convert words to numerical representation in natural language processing. This model is a simple document embedding technique based on word frequency. • The Bag of Words model is a common technique in natural language processing (NLP) that represents a document as an unordered set of words, disregarding grammar and word order but keeping track of word frequency. Tokenization: Workflo w of Bag of Words • Break down a text into individual words (tokens). Building a Vocabulary: • Create a unique set of all words (vocabulary) in the entire corpus. Vectorization: • Represent each document as a vector, where each element corresponds to the frequency of a word from the vocabulary in the document. The process of breaking down a text into individual words or tokens. Example: "The quick brown fox jumps over the lazy dog" becomes: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] Tokenization Creating a unique set of all words present in the entire corpus. Building a Vocabulary Example: If the corpus has three documents with the words {apple, banana, orange}, {banana, grapefruit}, and {apple, orange, grapefruit}, the vocabulary is {apple, banana, orange, grapefruit}. Vectorization Representing each document as a vector, with each element corresponding to the frequency of a word from the vocabulary in the document. Example: For the document "The quick brown fox jumps over the lazy dog" and vocabulary {quick, brown, jumps, lazy}, the vector might be [1, 1, 1, 1] indicating the frequency of each word. Application s Text Classification: Sentiment analysis, spam detection, topic categorization. Information Retrieval: Document matching and retrieval based on content. Machine Learning Models: Input representation for models like Naive Bayes and SVM. Advantages and Limitations • Advantages • Simple and easy to implement. • Captures the basic information about the document. • Works well with large datasets. • Suitable for various NLP tasks such as text classification and sentiment analysis. • Limitations • Ignores word order and semantics. • Treats all words equally, disregarding semantic meaning. • Doesn't consider relationships between words. • Doesn't consider the context of words. Understanding Bag of Words with an example Example(1) without preprocessing: • Sentence 1: ”Welcome to Great Learning, Now start learning” • Sentence 2: “Learning is a good practice” Vocabulary List: • Welcome • To • Great • Learning • , • Now • start • learning • is • a • good • practice The scoring of sentence 1 would look as follows: Writing the above frequencies in the vector Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ] The scoring of sentence 2 would look as follows: Writing the above frequencies in the vector Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ] • The above example was not the best example of how to use a bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a comma ’,’ which does not convey any information is also included in the vocabulary. Understanding Bag of Words with an example Example(2) with preprocessing: • Sentence 1: ”Welcome to Great Learning, Now start learning” • Sentence 2: “Learning is a good practice” Step 1: Convert the above sentences in lower case as the case of the word does not hold any information. Step 2: Remove special characters and stopwords from the text. Stopwords are the words that do not contain much information about text like ‘is’, ‘a’,’the and many more’. After applying the above steps, the sentences are changed to Sentence 1: ”welcome great learning now start learning” Sentence 2: “learning good practice” Vocabulary List: • welcome • great • learning • now • start • good • practice The scoring of sentence 1 would look as follows: Writing the above frequencies in the vector Sentence 1 ➝ [ 1,1,2,1,1,0,0 ] The scoring of sentence 2 would look as follows: Writing the above frequencies in the vector Sentence 2 ➝ [ 0,0,1,0,0,1,1 ] Document Similarity Document Similarity • Quantification of the similarity between two documents can be obtained by converting the words or phrases within the document or sentence into a vectorised form of representation. • The vector representations of the documents can then be used within the cosine similarity formula to obtain a quantification of similarity. • In the scenario described above, the cosine similarity of 1 implies that the two documents are exactly alike and a cosine similarity of 0 would point to the conclusion that there are no similarities between the two documents. Example of Document Vectors • Doc 1= “information retrieval” • Doc 2 = “computer information retrieval” • Doc 3 = “computer retrieval” • • • • Vocabulary: information, retrieval, computer Doc 1 = <1, 1, 0> Doc 2 = <1, 1, 1> Doc 3 = <0, 1, 1> Question: Doc 4 = “retrieval information retrieval” ? information, retrieval, computer D= 1 1 0 1 1 1 0 1 1 Documents in a Vector Space • Doc 1= “information retrieval” • Doc 2 = “computer information retrieval” • Doc 3 = “computer retrieval” Term 1: information Doc 1 • Vocabulary: information, retrieval, computer • Doc 1 = <1, 1, 0> Term 3: computer Term 2: retrieval Relevance as Vector Similarities • Doc 1= “information retrieval” • Doc 2 = “computer information retrieval” • Doc 3 = “computer retrieval” Term 1: information Which document is closer to Doc 1? Doc 2 or Doc 3? What if we have a query “retrieval”? Term 3: computer Doc 1 Doc 2 Doc 3 Term 2: retrieval Document Similarity • Used in information retrieval to determine which document (d1 or d2) is more similar to a given query q. • Documents and queries are represented in the same space. • Angle (or cosine) is a proxy for similarity between two vectors Distance/Similarity Calculation • The similarity/relevance of two vectors can be calculated based on distance/similarity measures • S: X, Y (0, 1) • X: <x1, x2, …, xn> • Y: <y1, y2, …, yn> • S(X, Y) = ? • The more dimensions in common, the larger the similarity • What about real values? • Normalization needed. Similarity Measures • The Jaccard similarity (Similarity of Two Sets) |𝑋 ∩ 𝑌| 𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑋, 𝑌 = |𝑋 ∪ 𝑌| • • • • D1 = “information retrieval class” D2 = “information retrieval algorithm” D3 = “processing information” What’s the Jaccard similarity of S(D1, D2)? S(D1, D3)? • What about D3 = “information of information retrieval” Similarity Measures • Euclidean Distance – distance of two points D(X,Y) = (x1 - y1 )2 + (x2 - y2 )2 +... + (xn - yn )2 = n 2 (x y ) å i i i=1 Z X Y Similarity Measures (Cont.) • Cosine similarity: similarity of two vectors, normalized n cos(X,Y) = x1 y1 + x2 y2 +... + xn yn x12 +... + xn2 × y12 +... + yn2 åx y i = i i=1 n 2 x å i× i=1 n 2 y åi i=1 X Y Which one do you think is suitable for retrieval? Jaccard? Euclidean? Cosine? Example • What is the cosine similarity between: • D= “cat,dog,dog” = <1,2,0> • Q= “cat,dog,mouse,mouse” = <1,1,2> • Answer: 𝜎 𝐷, 𝑄 = 1×1+2×1+0×2 12 +22 +02 12 +12 +22 = 3 5 6 ≈0.55 = 5 5 5 =1 • In comparison: 𝜎 𝐷, 𝐷 = 1×1+2×2+0×0 12 +22 +02 12 +22 +02