The Single document summarizer is an application which is

advertisement
Single Document Summarization using Information Retrieval Approach
Priyanka Sarraf [1]
Yogesh Kumar Meena [2]
M.Tech Scholar, Banasthali University, Jaipur [1]
Assistant Professor, MNIT, Jaipur [2]
Sarrafpriyanka60@gmail.com[1]
Abstract
Automatic summarization is the process of reducing a
document by a computer program to create summary which
retains the important aspects of the original document. As the
use of whole document is very tedious task so summary helps
identifying the general overview and its importance for us.
Generally,
there
summarization:
are
two
extraction
approaches
and
of
automatic
abstraction.
Extractive
methods work by selecting a subset of existing words,
phrases, or sentences in the original text to form the
summary. Whereas, abstractive methods build an internal
semantic representation and then use natural language
generation techniques to create a summary that is closer to
what a human might generate. Such a summary might
contain words not explicitly present in the original. The
state-of-the-art abstractive methods are still quite weak, so
most research has focused on extractive methods.
amount of documents listed in the order of estimated
relevance. It is not possible for users to read each document
in order to find useful ones. Automatic text summarization
systems helps in this task by providing a quick summary of
the information contained in the document. An ideal
summary in these situations will be one that does not
contain repeated information and includes unique
information from multiple documents on that topic.
The Single document summarizer is an application which is
proposed to extract the most important information of the
document. In automatic summarization, there are two distinct
techniques either text extraction or text abstraction.
Extraction is a summary consisting of a number of sentences
selected from the input document. An abstraction based
summary is generated where some text units are not present
into the input document. With extraction based summary
technique, some more features are added based on
Information Retrieval. However, the total system is alienated
into three segments: pre-processing the test document,
sentence scoring based on text extraction and summarization
based on sentence ranking.
This paper presents basic model for extractive based generic
text summarization and discuss evaluation methods for
extracting summary.
Keywords:
Document
summarization,
extraction,
In this approach generic based single document
summarization system is proposed using extraction based
summary techniques.
Preprocessing, Sentence position value , Key words.
1. Introduction
2. Related Work
With the increasing amount of online information, it
becomes extremely difficult to find relevant information to
users. Information retrieval systems usually return a large
There are many research studies that have been conducted for
automatic text summarization. In this section, related studies
will be mentioned briefly.
Fang Chen et al. in their work considered 3 features at
sentence level. As per Sentence Location Feature, sentences
at the beginning and at the end of a paragraph are more likely
to contain material useful for summary. The second feature,
based on paragraph location is similar to sentences, as
paragraphs also are usually hierarchically organized, with
crucial information at the beginning and concluding
paragraphs. The third feature is Sentence Length Feature,
where sentences that are too short or too long tend not be
included in summaries. A fixed threshold of 5 words for
shorter sentences and 15 words for longer sentences is used
to decide whether the sentence should be included in
summary. [8]
Paice in his work examines the significance of each sentence
by looking at the clues for identifying the importance of the
sentence. He also provides scores for these sentences based
on the clues. Finally sentences were retrieved based on some
fixed threshold or by selecting sentences with highest score.
He experimented sentence scoring with the following seven
type of clues namely "Frequency-keyword approach", "titlekeyword method", "location method", "syntactic criteria","
Indicator phrase method", "cue method" and "relational
criteria".[9]
Radev et al. have presented an extraction based
summarization scheme named as centroid based
summarization. The main contributions of their paper are: the
use of cluster -based relative utility (CBRU) and cross
sentence informational subsumption (CSIS) for evaluation of
single and multi document summaries. MEAD is an
implementation of the centroid-based method that assigns
scores to sentences based on sentence-level and inter
sentence features, including cluster centroids, position,
TF*IDF, etc.[10]
Rada et al. in their work ranked the sentences in a given text
with respect to their importance for the overall understanding
of the text. They constructed a graph by adding a vertex for
each sentence in the text, and edges between vertices are
established using sentence inter-connections (measured as a
function of content overlap). They view such relations
between two sentences as a process of "recommendation" to
other sentences in the text that address the same concepts.
[11]
Krysta et al. in their work presented a new approach based
on neural nets called NetSum. The authors extracted three
sentences from a single news document that best match
various characteristics of the three document highlights. Each
highlight sentence is human-generated. The output of system
consists of purely extracted sentences, where there is no
sentence compression or sentence generation. The authors
have made studies on two separate problems based on the
sentence order in matching the highlights. First, they extract
three sentences that best "match" the highlights as a whole
and they have disregarded the sentence ordering. The second
task considers ordering and compares sentences on an
individual level. [12]
Luhn work on single-document summarization proposed the
frequency of a particular word in a document to be a useful
measure of significance. Though Luhn’s methodology was a
preliminary step towards the summarization, but many of his
ideas are still found to be effective for text. In the first step,
all the stop words were removed and rests of the words were
stemmed to their root forms. A list of content words then
compiled and sorted by decreasing frequency, the index
providing an important measure of the word. From every
sentence a significance factor was extracted that reflects the
number of occurrences of significant words within a
sentence, and correlation between them is measured due to
the intervention of non-significant words. All the sentences
are positioned in order to their significance factor, and the top
positioned sentences are finally selected to form the
automatically generated abstract. [1]
Jing presented a sentence reduction system for removing
irrelevant phrases like prepositional phrases, clauses, to
infinitives, or gerunds from sentences. Their main
contribution is determining less important phrases in a
sentence using reduction decisions. The reduction decisions
are based on syntactic knowledge, context, and probabilities
computed from corpus analysis. [13]
Edmundson et al. proposed a typical structure of text
summarization methodology in. They incorporated both word
frequency and positional value ideas generated by two of his
previous works. The first of two other features used was the
presence of cue words (occurrence of words like significant,
or hardly), and the second one was the skeleton of the
document (the sentence is a title or heading). The sentences
were scores based on these features to extract sentences for
summarization. [14]
Baxendale, also done at IBM and published in the same
journal, provides early insight on a particular feature helpful
in finding salient parts of documents: the sentence position.
Towards this goal, the author examined 200paragraphs to
find that in 85% of the paragraphs the topic sentence came as
the first one and in 7% of the time it was the last sentence.
Thus, a naive but fairly accurate way to select a topic
sentence would be to choose one of these two. This positional
feature has since been used in many complex machine
learning based systems. [15]
3. System Description
Goal of extractive text summarization is selecting
the most relevant sentences of the text. The Proposed
method uses statistical and Linguistic approach to find
most relevant sentence. Summarization system consists
of 3 major steps, Preprocessing ,Extraction of feature
terms and algorithm for ranking the sentence based on
the optimized feature weights.
3.1 Pre-processing
This step involves Sentence segmentation, Sentence
tokenization, Stop word Removal and Stemming.
3.1.1
Sentence Segmentation
It is the process of decomposing the given text document
into its constituent sentences along with its word count. In
English, sentence
is segmented by identifying the
boundary of sentence which ends with full stop ( . ) ,
question mark (?), exclamatory mark(!).
3.1.2
Tokenization
It is the process of splitting the sentences into words
by identifying the spaces, comma and special symbols
between
the words. So list of sentences and words are
maintained for further processing.
2.1.3
Stop Word Removal
Stop words are common words that carry less
important meaning than keywords .This words should be
eliminated otherwise sentence containing
them
can
influence summary generated.
2.1.4
Stemming
A word can be found in different forms in the same
document. These words have to be converted to their original
form for simplicity. The stemming algorithm is used to
transform words to their canonical forms. In this work,
Porter’s stemmer is used that splits a word into its root form
using a predefined suffix list.
2.2 Feature Extraction
After an input document is tokenized and stemmed, it is
split into a collection of sentences. The sentences are ranked
based on four important features: Frequency, Sentence
Position value, Cue words and Similarity with the Title.
2.2.1
Frequency
Frequency is the number of times a word occurs in a
document. If a word’s frequency in a document is high, then
it can be said that this word has a significant effect on the
content of the document. The total frequency value of a
sentence is calculated by sum up the frequency of every word
in the document.
2.2.2
Sentence Position Value
Position of the sentence in the text, decides its importance.
Sentences in the beginning defines the theme of the
document whereas end sentences conclude or summarize
the document.
The positional value of a sentence is computed by assigning
the highest score value to the first sentence and the last
sentence of the document. Second higher score value to the
second sentence from starting and second last sentence of the
document. Remaining sentences are assigned score value
zero.
2.2.3
Key Words
Key words are the important words in the document. These
key words are inputted from the user. If a sentence contains
keywords then assign score value one to the sentence,
otherwise score value is zero for the sentence.
2.2.4
Similarity with the Title
The similarity with the title consists of the words in titles and
headers. These words are considered having some extra
weights in sentence scoring for summarization. . If a sentence
contains words in title and header then assign score value one
to that sentence, otherwise score value is zero for the
sentence.
2.3 Sentence Scoring
The final score is a Linear Combination of frequency,
Sentence positional value, Key Words and Similarity with the
title of the document.
2.4 Sentence Ranking
After scoring of each sentence, sentences are arranged in
descending order of their score value i.e. the sentence whose
score value is highest is in top position and the sentence
whose score value is lowest is in bottom position.
2.5 Summary Extraction
After ranking the sentences based on their total score the
summary is produced selecting X number of top ranked
sentences where the value of X is provided by the user. For
the readers’ convenience, the selected sentences in the
summary are reordered according to their original positions
in the document.
systems and establish a baseline. Besides this, manual
evaluation is too expensive: as stated by Lin (2004), large
scale manual evaluation of summaries as in the DUC
conferences would require over 3000 hours of human efforts.
Two main criteria for evaluating the proficiency of a
system is precision and recall which are used for specifying
the similarity between the summary which is generated by a
system versus the one generated by human. These terms are
defined by following equations:
Text Document
Precision= Correct / (Correct + Wrong)
Recall= Correct / (Correct + Missed)
Preprocessing
Sentence Segmentation
Porter
Stemming
Algorithm
Tokenization
Stop
Words
Stop Word Removal
Stemming
Frequency
Sentence
Position
Value
Key
Words
Similarity
with the Title
Sentence Scoring
Sentence Ranking
Summary
Figure 1. Steps of the proposed text summarization
technique
where, Correct is the number of sentences that are the same
in both summary which are produced by human and system.;
Wrong is the number of sentences presented in summary and
produced by system but is not included in human generated
summary; Missed is the number of sentences which are not
appeared in system generated summary but presented in the
summary produced by human. Therefore, Precision specifies
the number of suitable sentences which are extracted by
system and Recall specify the number of suitable sentences
that the summarization system missed.
5. Conclusion and Future Scope
This Paper discusses the single document
summarization
using extraction method. This paper presents basic steps of
summarization of documents and summarization evaluation
methods. This proposed method is implemented in java and
is under development. In future we want to extract summary
using abstractive based summarization approach and compare
the results between both the approaches.
6. References
[1] H. P. Luhn, The automatic creation of literature abstracts,
in IBM Journal of Research Development, volume 2,
number 2, pages 159-165, 1958.
[2] Mani, I.: Automatic Summarization. John Benjamins
Publishing Co., Amsterdam (2001).
[3]
4. Summary Evaluation
Evaluating a summary result is a difficult task because there
does not exist an ideal summary for a given document or set
of documents. The absence of a standard human or automatic
evaluation metric makes it very hard to compare different
Wei Liu, WANG, The Document Summary Method
based on Statements Weight and Genetic Algorithm,
International Conference on Computer Science and
Network Technology, 2011
[4] Automated Bangla Text Summarization by Sentence
Scoring and Ranking, Md. Iftekharul Alam Efat, IEEE,
2013.
[5] The
Porter
Stemming
Algorithm
[Online].Available:http://tartarus.org/~martin/PorterStem
mer/
[6] Mr. Vikrant Gupta, Ms. Priya Chauhan, Dr. Sohan Garg.
Mrs. Anita Borude, Prof. Shobha Krishnan “A Statistical
Tool for Multi-Document Summrization” , International
Journal of Scientific and Research publication, Volume 2,
Issue 5, may 2012 ISSN 2250-3153
[7] Chetana Thaokar, Dr.Latesh Malik, Test Model for
Summarizing Hindi Text using Extraction Method,
IEEE, 2013
[8] Fang Chen, Kesong Han and Guilin Chen, "An Approach
to Sentence Selection Based Text Summarization", In the
Proceedings of IEEE Region 10 Conference on
Computers, Communications, Control and Power
Engineering, Volume: l,pp.489- 493,2002.
[9] C.D.Paice,"Constructing Literature Abstracts by
Computer: Techniques and Prospects", Information
Processing andManagement 26: 171 - 186, 1990.
[10] Dragomir R. Radev, Hongyan Jing, Malgorzata Stys and
Daniel Tam, "Centroid-based summarization of multiple
documents", Information Processing and Management: 40
(2004) 919-938, 2004.
[11] Rada Mihalcea and Paul Tarau, "An Algorithm for
Language Independent Single and Multiple Document
Summarization", Proceedings of the International Joint
Conference on Natural Language Processing (IJCNLP),
Korea, 2005.
[12] Krysta M. Svore, Lucy Vanderwende and lC.Christopher
Burges, "Enhancing Single-document Summarization by
Combining RankNet and Third-party Sources",
Proceedings of the Joint Conference on Empirical
Methods in Natural Language Processing and
Computational Natural Language Learning, Prague,
pp.448-457,2007.
[13] Hongyan Jing, Sentence Reduction for Automatic Text
Summarization, in Proceedings of the sixth conference on
Applied natural language processing, pages 310-315,
2000.
[14] H. P. Edmundson. New Methods in Automatic Extracting.
Journal of. ACM, 16(2):264-285, 1969.
[15] P.B. Baxendale. Man-made index for technical literature An experiment. In IBM Journal of Research and
Development, pages 354-361, 1958.
Download