Document Summarization Using Gensim and Open AI MEMBERS ANIRUDH SHARMA SIDDHARTH SAXENA SURYADEEP DAS NAMIT GUPTA Objective To summarize a document using Gensim Library along with Open AI and compare the results Let's get started Open Source Python Library Version Used - 3.6.0 What you need to know? Provides a method for summarization of text document Using module like gensim.summarization Method used in this module in summarizer() Workflow Diagram Preprocessing in Gensim This involves removing unwanted characters, such as special characters, punctuation marks, and stop words. This is the process of breaking down the text into smaller units called tokens. This involves assigning a part-of-speech tag to each word in the text. This can help the summarization algorithm to identify important entities and to prioritize information related to them. TextRank Algorithm Algorithm based on Graph Theory Words are treated as nodes Weighted edges are relationships between nodes Score of each node is counted based on PageRank algorithm Example - TextRank Algorithm Graph Input Text The quick brown fox jumps over the lazy dog. - S1 The quick brown fox is a clever animal. - S2 The lazy dog is not so clever. - S3 The quick brown fox likes to play. - S4 The lazy dog likes to sleep. - S5 (S1) ---------- (S2) | | | | (S3) | | | | | (S4) ---------- (S5) Edges between the sentences indicate their similarity Initialize Scores Initialize the scores for each sentence node. S1 = 1.0 S2 = 1.0 S3 = 1.0 S4 = 1.0 S5 = 1.0 Calculating Scores Using PageRank Algorithm S1: (1 - 0.85) + 0.85 * ((1.0 / 2) + (1.0 / 1)) = 1.175 S2: (1 - 0.85) + 0.85 * ((1.0 / 2) + (1.0 / 1)) = 1.175 S3: (1 - 0.85) + 0.85 * ((1.0 / 2) + (1.0 / 1)) = 1.175 S4: (1 - 0.85) + 0.85 * ((1.0 / 1) + (1.0 / 1)) = 1.325 S5: (1 - 0.85) + 0.85 * ((1.0 / 1) + (1.0 / 1)) = 1.325 Repeat this process until the scores converge. S1 = 0.946 S2 = 0.946 S3 = 0.946 S4 = 1.078 S5 = 1.078 Extracting Top Sentences The two sentences with the highest scores are S4 and S5. Hence, the Summary will be as follows:Summary: The quick brown fox likes to play. The lazy dog likes to sleep. Selecting Top Sentences Top most sentences are selected for summary These sentences are selected on the basis of TextRank algorithm Higher the score of a particular sentence, higher will be the priority of that sentence Advantages TextRank is a well-established algorithm that has been shown to produce high-quality summaries Ease of use: gensim.summarization is easy to use and requires minimal configuration. Limitations The module provides limited options to customize the summarization process. Difficulty in handling long documents Limited support for non-English texts Preprocessing Advantages Memory Disadvantages Cost Language Processing Speed Speed in compare to Google Creativity Consistemcy Is Open AI better than other libraries? State-of-the-art language models: OpenAI is known for developing some of the largest and most powerful language models in the world, including GPT-3. Other libraries, such as spaCy and NLTK, do not have language models that are as advanced as OpenAI's. Cost: Using OpenAI's language models can be expensive, especially for larger models and higher volumes of usage. Other libraries, such as NLTK and spaCy, are open-source and free to use. Integration: Other libraries, such as spaCy, may offer better integration with other technologies and frameworks, such as deep learning libraries like TensorFlow and PyTorch. COSINE SIMILARITY Cosine similarity is a metric used to determine the similarity between two documents. It measures the cosine of the angle between two vectors, where the vectors represent the term frequency of the words in each document. COSINE_SIMILARITY(D1, D2) = DOT_PRODUCT(D1, D2) / (MAGNITUDE(D1) * MAGNITUDE(D2)) References DEMONSTRATION