Uploaded by Siddharth Saxena

Document Summarization

advertisement
Document
Summarization
Using Gensim and Open AI
MEMBERS
ANIRUDH SHARMA
SIDDHARTH SAXENA
SURYADEEP DAS
NAMIT GUPTA
Objective
To summarize a document using
Gensim Library along with Open AI
and compare the results
Let's get started
Open Source Python Library
Version Used - 3.6.0
What you need to know?
Provides a method for summarization
of text document
Using module like
gensim.summarization
Method used in this module in
summarizer()
Workflow Diagram
Preprocessing in Gensim
This involves
removing
unwanted
characters, such
as special
characters,
punctuation
marks, and stop
words.
This is the process
of breaking down
the text into
smaller units
called tokens.
This involves
assigning a
part-of-speech
tag to each
word in the text.
This can help the
summarization
algorithm to
identify important
entities and to
prioritize
information
related to them.
TextRank
Algorithm
Algorithm based on Graph Theory
Words are treated as nodes
Weighted edges are relationships
between nodes
Score of each node is counted based
on PageRank algorithm
Example - TextRank Algorithm
Graph
Input Text
The quick brown fox jumps over the lazy dog. - S1
The quick brown fox is a clever animal. - S2
The lazy dog is not so clever. - S3
The quick brown fox likes to play. - S4
The lazy dog likes to sleep. - S5
(S1) ---------- (S2)
|
|
|
|
(S3)
|
|
|
|
|
(S4) ---------- (S5)
Edges between the
sentences indicate their
similarity
Initialize Scores
Initialize the scores for each sentence node.
S1 = 1.0
S2 = 1.0
S3 = 1.0
S4 = 1.0
S5 = 1.0
Calculating Scores Using PageRank Algorithm
S1: (1 - 0.85) + 0.85 * ((1.0 / 2) + (1.0 / 1)) = 1.175
S2: (1 - 0.85) + 0.85 * ((1.0 / 2) + (1.0 / 1)) = 1.175
S3: (1 - 0.85) + 0.85 * ((1.0 / 2) + (1.0 / 1)) = 1.175
S4: (1 - 0.85) + 0.85 * ((1.0 / 1) + (1.0 / 1)) = 1.325
S5: (1 - 0.85) + 0.85 * ((1.0 / 1) + (1.0 / 1)) = 1.325
Repeat this process until the scores
converge.
S1 = 0.946
S2 = 0.946
S3 = 0.946
S4 = 1.078
S5 = 1.078
Extracting Top Sentences
The two sentences with the highest scores
are S4 and S5.
Hence, the Summary will be as follows:Summary:
The quick brown fox likes to play.
The lazy dog likes to sleep.
Selecting
Top
Sentences
Top most sentences are selected for
summary
These sentences are selected on the
basis of TextRank algorithm
Higher the score of a particular
sentence, higher will be the priority of
that sentence
Advantages
TextRank is a well-established
algorithm that has been shown to
produce high-quality summaries
Ease of use: gensim.summarization is
easy to use and requires minimal
configuration.
Limitations
The module provides limited options to
customize the summarization process.
Difficulty in handling long documents
Limited support for non-English texts
Preprocessing
Advantages
Memory
Disadvantages
Cost
Language
Processing
Speed
Speed in compare
to Google
Creativity
Consistemcy
Is Open AI better than
other libraries?
State-of-the-art language models: OpenAI is known for developing some
of the largest and most powerful language models in the world, including
GPT-3. Other libraries, such as spaCy and NLTK, do not have language
models that are as advanced as OpenAI's.
Cost: Using OpenAI's language models can be expensive, especially for
larger models and higher volumes of usage. Other libraries, such as NLTK
and spaCy, are open-source and free to use.
Integration: Other libraries, such as spaCy, may offer better integration with
other technologies and frameworks, such as deep learning libraries like
TensorFlow and PyTorch.
COSINE SIMILARITY
Cosine similarity is a metric used to
determine the similarity between two
documents. It measures the cosine of
the angle between two vectors, where
the vectors represent the term
frequency of the words in each
document.
COSINE_SIMILARITY(D1, D2) =
DOT_PRODUCT(D1, D2) / (MAGNITUDE(D1) * MAGNITUDE(D2))
References
DEMONSTRATION
Download