Text Summarization Based on Genetic Fuzzy Techniques 1Sravani

advertisement
Text Summarization Based on Genetic Fuzzy
Techniques
1
Sravani Chinthaluri, 2 Pallavi Reddy, 3Kalyani Nara
1
M.Tech Student of GNITS, Hyderabad,sravanich.srvn@gmail.com
2
Assistant professor, CSE Dept, GNITS, Hyderabad,reddygaripallavi@gmail.com
3
Professor, CSE Dept, GNITS, Hyderabad, kalyani_nara@rediffmail.com
ABSTRACT: The tremendous information available on
the internet makes it difficult to understand the
important information. In order to understand the
significance of this huge amount of data, a
summarization technique is needed. This paper focuses
on a comparative study of text summarization techniques
namely General Statistical Method (GSM), Fuzzy and
Genetic Fuzzy developed using extraction approaches.
In General Statistical Method, the summary is generated
by ranking each sentence and lists important ones. In
order to generate the summary, apart from statistical
approach, a Fuzzy system is developed. The Fuzzy system
implements membership functions and various number
of fuzzy rules on the feature values for a definite
summary. Genetic fuzzy system optimizes the input
features that are applied on fuzzy rules. The analysis is
performed on documents related to Earth, Nature,
Forest and Metadata domains and their corresponding
results are presented in this paper.
KEYWORDS: Text Summarization, Preprocessing,
Feature Extraction, General Statistical Method (GSM),
Fuzzy-logic, Genetic Algorithm
I INTRODUCTION
Natural Language Processing (NLP) is a
discipline of computer Science, artificial
intelligence concerned with interactions between
computers and human languages. NLP is used to
design and build software that will examine,
interpret and develop languages that humans use
naturally. This helps the systems in understanding
natural languages. Applications such as machine
translation on the web, spelling and grammar
correction in word processors, automatic question
answering, email spam detection, extracting
appointments from email, and detecting people's
opinions about products or services use natural
language processing. Electronic documents from
these applications are principal media for academic
and business information. In order to utilize these
documents efficiently it is important to obtain the
gist of the document. This could be achieved
through summarization process. The process of
keeping the main content and obtaining the gist is
called Text summarization.
Text Summarization mainly follows two
approaches: Abstraction and Extraction. Abstraction
paraphrases sections in the source document.
Extractive methods work by selecting a subset of
existing phrases, sentences or words in the original
text to form the summary.
Extraction method commonly use sentence
extraction technique to generate the summary. The
method to obtain suitable sentence is to assign a
numerical value called sentence scoring and then
select best sentences to form the summary based on
compression rate[4][11][12][13]. General statistical
method uses this approach for generating the
summary. This approach is proposed in this paper as
a statistical method to generate the summary.
The earlier summaries are generated based upon the
most frequent words used. Luhn created the first
summarization system [1] in 1958. He proposed to
derive statistical information from word frequency
and a relative measure of significance is computed
by the machine based on distribution. The
significance is calculated for individual words and
then for sentences. Sentences scoring the highest
becomes the auto-abstract. Word frequencies and
distribution of words are used by Rath et al. [2] in
1961 to generate the summary.
In the later years various other approaches are
developed to generate the summary. H. P.
Edmundson [3] in 1969 uses cue words and phrases
in generating the summary. J. Kupiec. , J. Pedersen,
and F. Chen, 1995 [4] described a statistical method
using a Bayesian classifier to measure the
probability that a sentence in a source document
should be appended to summary.
Fuzzy sets are proposed by Zadeh [5]. Fuzzy-neural
network model for pattern recognition is proposed
by D. Kulkarni and D. C. Cavanaugh, [6]. Witte and
Bergler [7] studies proposed a fuzzy-theory based
approach to co-reference resolution and its
application to text summarization. L. Suanmali,
Salim N. and M.S. Binwahlan [8] in 2009 proposed
a fuzzy approach on the features extracted.
upon the experimental evaluations from section VI.
Fuzzy logic method is also proposed in this paper to
extract summary. Fuzzy logic method provides with
approximate reasoning. This helps in determining
the important sentences.
A. Preprocessing:
Apart from statistical approach, genetic algorithm is
proposed as an attractive paradigm to improve the
quality of text summarization systems. A technique
for text summarization using combination of fuzzy
logic, genetic algorithm (GA) and genetic
programming (GP) proposed by Kiani and M.R.
Akbarzadeh, [9]. Genetic Semantic Role labelling is
proposed by L. Suanmali, Salim N. and M.S.
Binwahlan [10] in 2011.
It is observed that genetic algorithms are frequently
used for function optimization purpose. The hybrid
approach is also proposed in this paper. The hybrid
approach introduces the optimization of the input
given to the fuzzy system.
A comparative study is performed on the three text
summarization techniques to determine which
technique provides a better summary. The paper is
further organized as follows:
The three text summarization techniques
corresponds to Preprocessing and Feature Extraction
as their initial stage. These steps are briefly
explained in section II.
Section III explains the architecture of General
Statistical Method. A score is determined from
features extracted in section II for each sentence.
Summary is generated depending up on the score.
The architecture of Fuzzy System is elaborated in
section IV which is based on the various rules
designed for eight different features.
Genetic approach is concentrated in section V. The
Genetic-Fuzzy approach that is discussed in this
paper is a hybrid approach of Fuzzy and Genetic
algorithm. The Genetic algorithm helps in
identifying the most important features.
II PREPROCESSING AND FEATURE
EXTRACTION
The input document is a text file and consists of four
main steps at this stage: Segmentation, stop word
removal, tokenization, stemming.
The algorithm for preprocessing is given as:
Procedure Preprocessing
1.
2.
Read the input source file
For Each sentence
{
Perform stop word removal
Perform tokenization
Perform stemming
}
3. Display the preprocessed sentences
The process of dividing the input file into number of
lines is achieved using sentence segmentation. The
most commonly used words such as I, a, the. .etc.
are removed from the segmented lines. After stop
word removal each word is divided into tokens.
From each token, Prefixes and Suffixes are removed
to obtain the base word.
Preprocessing is important as it provides
summarization systems with a clean and adequate
representation of source document.
B. Feature Extraction:
After Preprocessing, eight features are obtained for
each and every sentence. Each Feature gives a value
between 0 and 1.The eight features are described as
below:
1. Title Feature: The title words in the sentence
are important as they are more related to theme.
These sentences have a greater chance of getting
included in the summary. The title feature could be
calculated as below:
Title Feature=
The results obtained from all the three systems are
used to calculate precision and recall values.
Various input documents relating to different
domains are given as input test data. The evaluations
and results of the systems developed are given in
section VI.
The conclusions in section VII are drawn based
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ 𝑑𝑖𝑑𝑙𝑒 π‘Šπ‘œπ‘Ÿπ‘‘π‘  𝑖𝑛 𝑠𝑒𝑛𝑑𝑒𝑛𝑐𝑒
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘€π‘œπ‘Ÿπ‘‘π‘  𝑖𝑛 𝑑𝑖𝑑𝑙𝑒
2. Sentence Length: Sentence Length is important
in generating summary. Short sentences such as
names, date lines etc., are not included in the
summary. This feature is used to remove the short
sentences.
Sentence Length =
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘Šπ‘œπ‘Ÿπ‘‘π‘  π‘œπ‘π‘π‘’π‘Ÿπ‘–π‘›π‘” 𝑖𝑛 𝑠𝑒𝑛𝑑𝑒𝑛𝑐𝑒
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘€π‘œπ‘Ÿπ‘‘π‘  π‘œπ‘π‘π‘’π‘Ÿπ‘–π‘›π‘” 𝑖𝑛 π‘™π‘œπ‘›π‘”π‘’π‘ π‘‘ 𝑠𝑒𝑛𝑑𝑒𝑛𝑐𝑒
3. Term Weight: The frequency of term
occurrences within a document has often been used
for calculating the weight of each sentence. The
score of a sentence can be calculated as the sum of
the score of words in the sentence. The weight of
each word is given term frequency. The weight of
term can be given by:
𝑁
Wt=tfi*isfi=tfi*log𝑛
𝑖
tfi: Term Frequency of word i
N: Number of sentences in the document
Sj: sentence j
Sim (si, sj): is the similarity of 1 to n terms in
sentence si and sj
6. Proper Noun: The sentences which contain more
proper nouns are mostly to be included in the
summary. The Proper noun feature is calculated as
below:
Proper Noun=
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘ƒπ‘Ÿπ‘œπ‘π‘’π‘Ÿ π‘π‘œπ‘’π‘›π‘  𝑖𝑛 𝑆𝑒𝑛𝑑𝑒𝑛𝑐𝑒
𝑆𝑒𝑛𝑑𝑒𝑛𝑐𝑒 πΏπ‘’π‘›π‘”π‘‘β„Ž
7. Thematic word: The terms that occur more
frequently are more related to the topic. We consider
top 10 most frequent words as thematic words.
Thematic words are calculated as below:
Thematic Word=
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‡β„Žπ‘’π‘šπ‘Žπ‘‘π‘–π‘ π‘Šπ‘œπ‘Ÿπ‘‘ 𝑖𝑛 𝑆𝑒𝑛𝑑𝑒𝑛𝑐𝑒
ni :Number of sentences in which the Word i occur
𝑀𝐴𝑋(π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‡β„Žπ‘’π‘šπ‘Žπ‘‘π‘–π‘ π‘€π‘œπ‘Ÿπ‘‘)
The Total Term Weight could be given by the
following formula
8. Numerical Data: This Feature is used to identify
the statistical data in every sentence. Numerical data
is calculated as follows:
Term Weight =
∑π‘˜π‘–=1 π‘Šπ‘– (𝑆)
𝑀𝐴𝑋(∑π‘˜π‘–=1 π‘Šπ‘– (𝑆𝑖𝑁 ))
Numerical
Data=
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘π‘’π‘šπ‘’π‘Ÿπ‘–π‘π‘Žπ‘™ π‘‘π‘Žπ‘‘π‘Ž 𝑖𝑛 𝑆𝑒𝑛𝑑𝑒𝑛𝑐𝑒
𝑆𝑒𝑛𝑑𝑒𝑛𝑐𝑒 πΏπ‘’π‘›π‘”π‘‘β„Ž
K: Number of Words in Sentences
4. Sentence Position: The Position of sentence also
plays an important role in determining whether the
sentence is relevant or not.
If there are 5 lines in document the sentence
positions are given by
Sentence Position=
5/5 for 1st, 4/5 for 2nd, 3/5 for 3rd, 2/5
for 4th, 1/5 for 5th.
III GENERAL STATISTICAL
METHOD
The General Statistical Method (GSM) uses
extraction approach. The basic idea of GSM is to
select the most important sentences based up on the
scores. The technique consists of the following main
steps:
1.
2.
5. Sentence to Sentence Similarity: Similarity
between the sentences is very important in
generating the summary. The Similar sentences
should not repeated in the summary that is to be
generated.
Sentence to Sentence Similarity=
∑ π‘†π‘–π‘š(𝑆𝑖 , 𝑆𝑗 )
𝑀𝐴𝑋(∑ π‘†π‘–π‘š(𝑆𝑖 , 𝑆𝑗 ))
3.
4.
An input text file is read as source
document.
The Preprocessing involves the steps of
segmentation, stop word removal,
tokenization, stemming on the source
document.
The eight features that are described in
section II are obtained for every sentence.
For a sentence(S) a weighted score
function is established by integrating all the
eight features.
The weighted score is calculated as below:
8
π‘†π‘π‘œπ‘Ÿπ‘’(𝑆) = ∑ S_Fk(S)
Si: Sentence i
π‘˜=1
S_FK(S) is score of Feature K
Score(S) = Total Score of a sentence.
5.
From the calculated sentence scores, the
sentences are arranged in decreasing order.
From the arranged sentences depending up
on the compression rate the summary is
generated.
Fig 2: Fuzzy System
The fuzzifier implements triangular membership
function to determine the fuzzy sets. The input
membership function is divided into five fuzzy sets.
A fuzzy set is created by taking min and max values
into consideration. A five level fuzzy sets are
determined calculating the mean frequency term
given by:
Fig1: General Statistical Method
The compression rate plays an important role in
generating the summary for General statistical
method. The precision and recall values alter with
change in the compression rate. The compression
rate should be between 5-30 % for the summary to
be acceptable.
Mean Frequency Term (Mft) = (max-min)/5.0
The five fuzzy sets are very low (VL), low (L),
medium (M), high (H), very high (VH).
The Triangular membership function could be
shown as below:
MemberShip Function
IV FUZZY SYSTEM
Fuzzy system is used to formulate the ambiguous,
imprecise values of text features to be admissible.
The system is based up on fuzzy rules and
membership function.
Membership function operates on the domain of all
possible values that are in range of (0, 1) and
generates the fuzzy sets. Fuzzy rules use the
membership functions and feature values to generate
the summary of important sentences.
The selection of fuzzy rules and membership
functions directly affect the performance. The Fuzzy
system consists of four components, the fuzzifier,
Inference Engine, Rule Base, and Defuzzifier
respectively. The eight features which are obtained
in section II are given as input to the first
component, Fuzzifier.
The Overall architecture of Fuzzy System could be
given as follows:
0.8
0.6
0.4
0.2
0
0
0.246
0.435
1
0.623
2
0.811
1
3
Fig3: Triangular Membership function of Term
weight feature for first three sentences
Each feature which is obtained is given to the
inference engine along with fuzzy set. The Inference
engine generates the output depending upon Fuzzy
Rule base.
The Fuzzy Rule base consists of various number of
rules. All the rules include the features extracted
along with the fuzzy sets. The main concept of rule
base are IF-THEN rules. The IF-THEN rules are
used to compare the fuzzy sets with feature value to
determine them as high, very high, low etc., the rules
could be given as below:
IF (TitleWord is VH) and (SentenceLength is
H)
and (TermFreq is VH) and
(SentencePosition
is
H)
and
(SentenceSimilarity is VH) and (ProperNoun is
H) and (ThematicWord is VH) and
(NumericalData is H) THEN
(Sentence is important)
More number of rules can be written by AND, OR
operators. The minimum weight of all the
antecedents is achieved by AND, while OR uses the
maximum value. The defuzzifier consists of
aggregating the output of each rule. Since the rules
are executed sequentially the defuzzifier generates
the summary.
Chromosome:
A chromosome is used to represent the solution to
the genetic algorithm. A chromosome is basically
represented in two ways: floating point
representation or bit representation. Each
chromosome is represented by a vector of dimension
F (F being the total number of features). The floating
point representation represents an array of floating
numbers. But for further genetic operations the
floating representation is converted into bit
representation (<0.5 means 0 and >=0.5 means 1).
We use bit representation as it is widely used to
perform various genetic operations. A bit
chromosome is represented as below:
0
1
1
0
2
3
0
4
0
1
5
1
6
0
7
8
Fig 5: Chromosome representation
V GENETIC FUZZY SYSTEM
Genetic Algorithm is used for search and
optimization problems. Genetic fuzzy systems are
fuzzy systems constructed by using genetic
algorithms or genetic programming, which mirror
the process of natural evolution, to identify its
structure and parameter.
The Genetic Fuzzy technique proposed in this paper
uses genetic algorithm for the optimization of input
features. The Overall structure of Genetic Fuzzy is
given by the following diagram:
The Genetic algorithm is implemented by the
following steps:
1.
With the initially generated population the evolution
process begins. The evolution process is basically an
iterative process. The population generated in each
evolution are called as generation. In this paper we
use 100 evolutions for generating the intermediate
chromosomes. For each iteration the following steps
are performed until a termination condition is met.
2.
Fig 4: Genetic Fuzzy System.
The basic idea of genetic approach is to maintain a
population of candidate solutions to the concrete
problem being solved. The genetic algorithm
mutates and alters the candidate solutions to provide
a better solution. Each candidate solution is
represented by a chromosome.
Initial Population: Each chromosome is
represented by 0 and 1 binary bits. Let N be
the population of chromosome. We
generate 10 chromosomes of 8 bit length.
Crossover and Mutation: On the initial
population that is generated, a one-point
crossover is performed. The one point
crossover starts with selecting two parent
chromosomes.
From
the
parent
chromosomes a point is selected for
mutation operation. The
mutation
operation includes exchanging of bits
between two parent chromosomes. This
results in two child chromosomes. These
are added to the result set.
While genetic algorithms are very powerful tools to
identify the fuzzy membership functions of a predefined rule base, they have certain limitations
especially when it comes to identify the input and
output variables for a fuzzy system from a given set
of data. Genetic approach in this paper has been used
to identify the input variables. Since optimization is
being performed on input variables selection criteria
is applied on the chromosomes to pass them to
fitness function.
4.
Selection: There are eight features that are
present in each chromosome. After the
crossover and mutation, from the resultant
set the chromosomes with a minimum of
three features are passed to the fitness
function.
Fitness function: The fitness function in a
Genetic Algorithm is problem dependent.
Since the chromosomes represent the eight
features with 8 bits, the fitness function is
devised such that those with value zero are
not fit and those with value 1 are fit. For
example in the chromosome representation
of Fig 3: features 2, 6, 7 are selected.
The list of features that are obtained as output of
genetic algorithm are passed to fuzzy system as
shown in Fig 4. The fuzzy system comprising of
fuzzy rules and fuzzy sets considers the fuzzy sets of
the optimized features as input. The inputs are
allocated to fuzzy rules that specify conditions for
important sentences. The summary is generated
based up on the selected fuzzy sets and fuzzy rules.
100
80
60
40
20
0
GSM
Forest
Nature
Earth
MetaData
Fuzzy
Forest
Nature
Earth
MetaData
Fuzzy-Genetic
Forest
Nature
Earth
MetaData
3.
Precision and Recall
Precision
Recall
Fig 5: Precision and Recall
Recall is also known as sensitivity. Recall is
gradually increasing as shown in the figure. The
increase in recall suggests that the system performs
better compared to other systems.
F-measure for the documents can be shown as
below:
F-Measure
The Evaluation of the system is done considering the
domains relating to Earth, Nature, Forest, and
Metadata. For every document a manually generated
relevant summary is compared to obtain the
precision and recall values. Summary is generated
using GSM, Fuzzy and Genetic-Fuzzy approaches.
Precision, recall and F-measure are calculated for
each document.
General Statistical method generates the summary
based on the score of features. Fuzzy and Genetic
Fuzzy approaches consider the individual features to
improve the quality of summary. In the Fuzzy
system for each feature there are five fuzzy set
values that can be selected dynamically. Where as in
Genetic Fuzzy approach the number of input
features are reduced. This would reduce the number
of fuzzy sets to be selected considerably due to
optimization process.
The graphs for precision, recall and F-measure are
drawn to determine the performance of the systems
that are developed.
The Precision and recall graph could be given as
below:
90
80
70
60
50
40
30
20
10
0
GSM
Forest
Nature
Earth
MetaData
Fuzzy
Forest
Nature
Earth
MetaData
Fuzzy-Genetic
Forest
Nature
Earth
MetaData
VI EVALUATIONS AND RESULTS
Fig 6: F-measure for precision and recall
F-Measure also shows the Fuzzy and Genetic Fuzzy
system is higher compared to statistical approach.
VII CONCLUSIONS
In this paper, a comparative study of General
Statistical Method, Fuzzy and Genetic Fuzzy
summarization
techniques
are
presented.
Evaluations of these techniques are performed by
giving different text documents related to different
domains as inputs. From the evaluations and results,
the Fuzzy system provides a better summary
compared to the statistical method. The studies also
show that the hybrid approach of genetic algorithm
and fuzzy system provide a better summary than the
other techniques. It is observed that the addition of
genetic algorithm to the fuzzy system has improved
the recall over the Fuzzy system.
REFERENCES
[1] H. P. Luhn, “The Automatic Creation of Literature Abstracts” IBM
Journal of Research and Development, vol. 2, pp.159-165. 1958.
[2] G. J. Rath, A. Resnick, and T. R. Savage, “The formation of
abstracts by the selection of sentences” American Documentation,
vol. 12, pp.139- 143.1961.
[3] H. P. Edmundson., “New methods in automatic extracting”
Journal of the Association for Computing Machinery 16 (2). Pp.264285.1969.
[4] J. Kupiec. , J. Pedersen, and F. Chen, “A Trainable Document
Summarizer” In Proceedings of the Eighteenth Annual International
ACM Conference on Research and Development in Information
Retrieval (SIGIR), Seattle, WA, pp.68-73.1995
[5] L. Zadeh, “Fuzzy sets. Information Control” vol. 8, pp.338–
353.1965.
[6] A. D. Kulkarni and D. C. Cavanaugh, “Fuzzy Neural Network
Models for Classification” Applied Intelligence 12, pp.207-215.
2000.
[7] R. Witte and S. Bergler, “Fuzzy coreference resolution for
summarization” In Proceedings of 2003 International Symposium on
Reference Resolution and Its Applications to Question Answering and
Summarization (ARQAS). Venice, Italy: Università Ca’ Foscari.
Pp.43– 50. 2003.
[8] L. Suanmali, Salim N. and M.S. Binwahlan, “Fuzzy Logic Method
for Improving Text Summarization.” International Journal of
Computer Science and Information Security (IJCSIS), vol. 2(1), pp.
65- 70. http://arxiv.org/abs/0906.4690, 2009
[9] A. Kiani and M.R. Akbarzadeh, “Automatic Text Summarization
Using: Hybrid Fuzzy GA-GP” In Proceedings of 2006 IEEE
International Conference on Fuzzy Systems, Sheraton Vancouver
Wall Center Hotel, Vancouver, BC, Canada. Pp.977-983.2006.
[10] L. Suanmali, Salim N. and M.S. Binwahlan, “Fuzzy Genetic
Semantic Based Text Summarization.” Ninth IEEE International
Conference on Dependable, Autonomic and Secure Computing), pp.
1185-1191, 2011.
[11] I. Mani and Mark T. Maybury, editors, Advances in automatic
text summarization MIT Press. 1999.
[12] M.A. Fattah and Fuji Ren, “Automatic Text Summarization” In
proceedings of World Academy of Science, Engineering and
Technology Volume 27. pp 192-195. February 2008.
[13] J.Y. Yeh, H.R. Ke, W.P. Yang and I.H. Meng, “Text
summarization using a trainable summarizer and latent semantic
analysis.” In: Special issue of Information Processing and
Management on An Asian digitallibraries perspective, 41(1), pp.75–
95. 2005.
[14] G. Morris, G.M. Kasper, and D.A. Adam, 19 “The effect and
limitation of automated text condensing on reading comprehension
performance” Information System Research, 3(1), pp.17-35. 1992.
[15] M. Wasson, “Using leading text for news summaries: Evaluation
results and implications for commercial summarization applications”
In Proceedings of the 17th International Conference on
Computational
Linguistics and 36th Annual Meeting of the ACL. Pp.1364-1368.
1998.
[16] L. Yulia, G. Alexander, and René Arnulfo García-Hernández
“Terms Derived from Frequent Sequences for Extractive Text
Summarization” In
Download