Text Summarization Based on Genetic Fuzzy Techniques 1 Sravani Chinthaluri, 2 Pallavi Reddy, 3Kalyani Nara 1 M.Tech Student of GNITS, Hyderabad,sravanich.srvn@gmail.com 2 Assistant professor, CSE Dept, GNITS, Hyderabad,reddygaripallavi@gmail.com 3 Professor, CSE Dept, GNITS, Hyderabad, kalyani_nara@rediffmail.com ABSTRACT: The tremendous information available on the internet makes it difficult to understand the important information. In order to understand the significance of this huge amount of data, a summarization technique is needed. This paper focuses on a comparative study of text summarization techniques namely General Statistical Method (GSM), Fuzzy and Genetic Fuzzy developed using extraction approaches. In General Statistical Method, the summary is generated by ranking each sentence and lists important ones. In order to generate the summary, apart from statistical approach, a Fuzzy system is developed. The Fuzzy system implements membership functions and various number of fuzzy rules on the feature values for a definite summary. Genetic fuzzy system optimizes the input features that are applied on fuzzy rules. The analysis is performed on documents related to Earth, Nature, Forest and Metadata domains and their corresponding results are presented in this paper. KEYWORDS: Text Summarization, Preprocessing, Feature Extraction, General Statistical Method (GSM), Fuzzy-logic, Genetic Algorithm I INTRODUCTION Natural Language Processing (NLP) is a discipline of computer Science, artificial intelligence concerned with interactions between computers and human languages. NLP is used to design and build software that will examine, interpret and develop languages that humans use naturally. This helps the systems in understanding natural languages. Applications such as machine translation on the web, spelling and grammar correction in word processors, automatic question answering, email spam detection, extracting appointments from email, and detecting people's opinions about products or services use natural language processing. Electronic documents from these applications are principal media for academic and business information. In order to utilize these documents efficiently it is important to obtain the gist of the document. This could be achieved through summarization process. The process of keeping the main content and obtaining the gist is called Text summarization. Text Summarization mainly follows two approaches: Abstraction and Extraction. Abstraction paraphrases sections in the source document. Extractive methods work by selecting a subset of existing phrases, sentences or words in the original text to form the summary. Extraction method commonly use sentence extraction technique to generate the summary. The method to obtain suitable sentence is to assign a numerical value called sentence scoring and then select best sentences to form the summary based on compression rate[4][11][12][13]. General statistical method uses this approach for generating the summary. This approach is proposed in this paper as a statistical method to generate the summary. The earlier summaries are generated based upon the most frequent words used. Luhn created the first summarization system [1] in 1958. He proposed to derive statistical information from word frequency and a relative measure of significance is computed by the machine based on distribution. The significance is calculated for individual words and then for sentences. Sentences scoring the highest becomes the auto-abstract. Word frequencies and distribution of words are used by Rath et al. [2] in 1961 to generate the summary. In the later years various other approaches are developed to generate the summary. H. P. Edmundson [3] in 1969 uses cue words and phrases in generating the summary. J. Kupiec. , J. Pedersen, and F. Chen, 1995 [4] described a statistical method using a Bayesian classifier to measure the probability that a sentence in a source document should be appended to summary. Fuzzy sets are proposed by Zadeh [5]. Fuzzy-neural network model for pattern recognition is proposed by D. Kulkarni and D. C. Cavanaugh, [6]. Witte and Bergler [7] studies proposed a fuzzy-theory based approach to co-reference resolution and its application to text summarization. L. Suanmali, Salim N. and M.S. Binwahlan [8] in 2009 proposed a fuzzy approach on the features extracted. upon the experimental evaluations from section VI. Fuzzy logic method is also proposed in this paper to extract summary. Fuzzy logic method provides with approximate reasoning. This helps in determining the important sentences. A. Preprocessing: Apart from statistical approach, genetic algorithm is proposed as an attractive paradigm to improve the quality of text summarization systems. A technique for text summarization using combination of fuzzy logic, genetic algorithm (GA) and genetic programming (GP) proposed by Kiani and M.R. Akbarzadeh, [9]. Genetic Semantic Role labelling is proposed by L. Suanmali, Salim N. and M.S. Binwahlan [10] in 2011. It is observed that genetic algorithms are frequently used for function optimization purpose. The hybrid approach is also proposed in this paper. The hybrid approach introduces the optimization of the input given to the fuzzy system. A comparative study is performed on the three text summarization techniques to determine which technique provides a better summary. The paper is further organized as follows: The three text summarization techniques corresponds to Preprocessing and Feature Extraction as their initial stage. These steps are briefly explained in section II. Section III explains the architecture of General Statistical Method. A score is determined from features extracted in section II for each sentence. Summary is generated depending up on the score. The architecture of Fuzzy System is elaborated in section IV which is based on the various rules designed for eight different features. Genetic approach is concentrated in section V. The Genetic-Fuzzy approach that is discussed in this paper is a hybrid approach of Fuzzy and Genetic algorithm. The Genetic algorithm helps in identifying the most important features. II PREPROCESSING AND FEATURE EXTRACTION The input document is a text file and consists of four main steps at this stage: Segmentation, stop word removal, tokenization, stemming. The algorithm for preprocessing is given as: Procedure Preprocessing 1. 2. Read the input source file For Each sentence { Perform stop word removal Perform tokenization Perform stemming } 3. Display the preprocessed sentences The process of dividing the input file into number of lines is achieved using sentence segmentation. The most commonly used words such as I, a, the. .etc. are removed from the segmented lines. After stop word removal each word is divided into tokens. From each token, Prefixes and Suffixes are removed to obtain the base word. Preprocessing is important as it provides summarization systems with a clean and adequate representation of source document. B. Feature Extraction: After Preprocessing, eight features are obtained for each and every sentence. Each Feature gives a value between 0 and 1.The eight features are described as below: 1. Title Feature: The title words in the sentence are important as they are more related to theme. These sentences have a greater chance of getting included in the summary. The title feature could be calculated as below: Title Feature= The results obtained from all the three systems are used to calculate precision and recall values. Various input documents relating to different domains are given as input test data. The evaluations and results of the systems developed are given in section VI. The conclusions in section VII are drawn based ππ’ππππ ππ π‘ππ‘ππ πππππ ππ π πππ‘ππππ ππ’ππππ ππ π€ππππ ππ π‘ππ‘ππ 2. Sentence Length: Sentence Length is important in generating summary. Short sentences such as names, date lines etc., are not included in the summary. This feature is used to remove the short sentences. Sentence Length = ππ’ππππ ππ πππππ ππππ’ππππ ππ π πππ‘ππππ ππ’ππππ ππ π€ππππ ππππ’ππππ ππ ππππππ π‘ π πππ‘ππππ 3. Term Weight: The frequency of term occurrences within a document has often been used for calculating the weight of each sentence. The score of a sentence can be calculated as the sum of the score of words in the sentence. The weight of each word is given term frequency. The weight of term can be given by: π Wt=tfi*isfi=tfi*logπ π tfi: Term Frequency of word i N: Number of sentences in the document Sj: sentence j Sim (si, sj): is the similarity of 1 to n terms in sentence si and sj 6. Proper Noun: The sentences which contain more proper nouns are mostly to be included in the summary. The Proper noun feature is calculated as below: Proper Noun= ππ’ππππ ππ ππππππ πππ’ππ ππ ππππ‘ππππ ππππ‘ππππ πΏππππ‘β 7. Thematic word: The terms that occur more frequently are more related to the topic. We consider top 10 most frequent words as thematic words. Thematic words are calculated as below: Thematic Word= ππ’ππππ ππ πβππππ‘ππ ππππ ππ ππππ‘ππππ ni :Number of sentences in which the Word i occur ππ΄π(ππ’ππππ ππ πβππππ‘ππ π€πππ) The Total Term Weight could be given by the following formula 8. Numerical Data: This Feature is used to identify the statistical data in every sentence. Numerical data is calculated as follows: Term Weight = ∑ππ=1 ππ (π) ππ΄π(∑ππ=1 ππ (πππ )) Numerical Data= ππ’ππππ ππ ππ’πππππππ πππ‘π ππ ππππ‘ππππ ππππ‘ππππ πΏππππ‘β K: Number of Words in Sentences 4. Sentence Position: The Position of sentence also plays an important role in determining whether the sentence is relevant or not. If there are 5 lines in document the sentence positions are given by Sentence Position= 5/5 for 1st, 4/5 for 2nd, 3/5 for 3rd, 2/5 for 4th, 1/5 for 5th. III GENERAL STATISTICAL METHOD The General Statistical Method (GSM) uses extraction approach. The basic idea of GSM is to select the most important sentences based up on the scores. The technique consists of the following main steps: 1. 2. 5. Sentence to Sentence Similarity: Similarity between the sentences is very important in generating the summary. The Similar sentences should not repeated in the summary that is to be generated. Sentence to Sentence Similarity= ∑ πππ(ππ , ππ ) ππ΄π(∑ πππ(ππ , ππ )) 3. 4. An input text file is read as source document. The Preprocessing involves the steps of segmentation, stop word removal, tokenization, stemming on the source document. The eight features that are described in section II are obtained for every sentence. For a sentence(S) a weighted score function is established by integrating all the eight features. The weighted score is calculated as below: 8 πππππ(π) = ∑ S_Fk(S) Si: Sentence i π=1 S_FK(S) is score of Feature K Score(S) = Total Score of a sentence. 5. From the calculated sentence scores, the sentences are arranged in decreasing order. From the arranged sentences depending up on the compression rate the summary is generated. Fig 2: Fuzzy System The fuzzifier implements triangular membership function to determine the fuzzy sets. The input membership function is divided into five fuzzy sets. A fuzzy set is created by taking min and max values into consideration. A five level fuzzy sets are determined calculating the mean frequency term given by: Fig1: General Statistical Method The compression rate plays an important role in generating the summary for General statistical method. The precision and recall values alter with change in the compression rate. The compression rate should be between 5-30 % for the summary to be acceptable. Mean Frequency Term (Mft) = (max-min)/5.0 The five fuzzy sets are very low (VL), low (L), medium (M), high (H), very high (VH). The Triangular membership function could be shown as below: MemberShip Function IV FUZZY SYSTEM Fuzzy system is used to formulate the ambiguous, imprecise values of text features to be admissible. The system is based up on fuzzy rules and membership function. Membership function operates on the domain of all possible values that are in range of (0, 1) and generates the fuzzy sets. Fuzzy rules use the membership functions and feature values to generate the summary of important sentences. The selection of fuzzy rules and membership functions directly affect the performance. The Fuzzy system consists of four components, the fuzzifier, Inference Engine, Rule Base, and Defuzzifier respectively. The eight features which are obtained in section II are given as input to the first component, Fuzzifier. The Overall architecture of Fuzzy System could be given as follows: 0.8 0.6 0.4 0.2 0 0 0.246 0.435 1 0.623 2 0.811 1 3 Fig3: Triangular Membership function of Term weight feature for first three sentences Each feature which is obtained is given to the inference engine along with fuzzy set. The Inference engine generates the output depending upon Fuzzy Rule base. The Fuzzy Rule base consists of various number of rules. All the rules include the features extracted along with the fuzzy sets. The main concept of rule base are IF-THEN rules. The IF-THEN rules are used to compare the fuzzy sets with feature value to determine them as high, very high, low etc., the rules could be given as below: IF (TitleWord is VH) and (SentenceLength is H) and (TermFreq is VH) and (SentencePosition is H) and (SentenceSimilarity is VH) and (ProperNoun is H) and (ThematicWord is VH) and (NumericalData is H) THEN (Sentence is important) More number of rules can be written by AND, OR operators. The minimum weight of all the antecedents is achieved by AND, while OR uses the maximum value. The defuzzifier consists of aggregating the output of each rule. Since the rules are executed sequentially the defuzzifier generates the summary. Chromosome: A chromosome is used to represent the solution to the genetic algorithm. A chromosome is basically represented in two ways: floating point representation or bit representation. Each chromosome is represented by a vector of dimension F (F being the total number of features). The floating point representation represents an array of floating numbers. But for further genetic operations the floating representation is converted into bit representation (<0.5 means 0 and >=0.5 means 1). We use bit representation as it is widely used to perform various genetic operations. A bit chromosome is represented as below: 0 1 1 0 2 3 0 4 0 1 5 1 6 0 7 8 Fig 5: Chromosome representation V GENETIC FUZZY SYSTEM Genetic Algorithm is used for search and optimization problems. Genetic fuzzy systems are fuzzy systems constructed by using genetic algorithms or genetic programming, which mirror the process of natural evolution, to identify its structure and parameter. The Genetic Fuzzy technique proposed in this paper uses genetic algorithm for the optimization of input features. The Overall structure of Genetic Fuzzy is given by the following diagram: The Genetic algorithm is implemented by the following steps: 1. With the initially generated population the evolution process begins. The evolution process is basically an iterative process. The population generated in each evolution are called as generation. In this paper we use 100 evolutions for generating the intermediate chromosomes. For each iteration the following steps are performed until a termination condition is met. 2. Fig 4: Genetic Fuzzy System. The basic idea of genetic approach is to maintain a population of candidate solutions to the concrete problem being solved. The genetic algorithm mutates and alters the candidate solutions to provide a better solution. Each candidate solution is represented by a chromosome. Initial Population: Each chromosome is represented by 0 and 1 binary bits. Let N be the population of chromosome. We generate 10 chromosomes of 8 bit length. Crossover and Mutation: On the initial population that is generated, a one-point crossover is performed. The one point crossover starts with selecting two parent chromosomes. From the parent chromosomes a point is selected for mutation operation. The mutation operation includes exchanging of bits between two parent chromosomes. This results in two child chromosomes. These are added to the result set. While genetic algorithms are very powerful tools to identify the fuzzy membership functions of a predefined rule base, they have certain limitations especially when it comes to identify the input and output variables for a fuzzy system from a given set of data. Genetic approach in this paper has been used to identify the input variables. Since optimization is being performed on input variables selection criteria is applied on the chromosomes to pass them to fitness function. 4. Selection: There are eight features that are present in each chromosome. After the crossover and mutation, from the resultant set the chromosomes with a minimum of three features are passed to the fitness function. Fitness function: The fitness function in a Genetic Algorithm is problem dependent. Since the chromosomes represent the eight features with 8 bits, the fitness function is devised such that those with value zero are not fit and those with value 1 are fit. For example in the chromosome representation of Fig 3: features 2, 6, 7 are selected. The list of features that are obtained as output of genetic algorithm are passed to fuzzy system as shown in Fig 4. The fuzzy system comprising of fuzzy rules and fuzzy sets considers the fuzzy sets of the optimized features as input. The inputs are allocated to fuzzy rules that specify conditions for important sentences. The summary is generated based up on the selected fuzzy sets and fuzzy rules. 100 80 60 40 20 0 GSM Forest Nature Earth MetaData Fuzzy Forest Nature Earth MetaData Fuzzy-Genetic Forest Nature Earth MetaData 3. Precision and Recall Precision Recall Fig 5: Precision and Recall Recall is also known as sensitivity. Recall is gradually increasing as shown in the figure. The increase in recall suggests that the system performs better compared to other systems. F-measure for the documents can be shown as below: F-Measure The Evaluation of the system is done considering the domains relating to Earth, Nature, Forest, and Metadata. For every document a manually generated relevant summary is compared to obtain the precision and recall values. Summary is generated using GSM, Fuzzy and Genetic-Fuzzy approaches. Precision, recall and F-measure are calculated for each document. General Statistical method generates the summary based on the score of features. Fuzzy and Genetic Fuzzy approaches consider the individual features to improve the quality of summary. In the Fuzzy system for each feature there are five fuzzy set values that can be selected dynamically. Where as in Genetic Fuzzy approach the number of input features are reduced. This would reduce the number of fuzzy sets to be selected considerably due to optimization process. The graphs for precision, recall and F-measure are drawn to determine the performance of the systems that are developed. The Precision and recall graph could be given as below: 90 80 70 60 50 40 30 20 10 0 GSM Forest Nature Earth MetaData Fuzzy Forest Nature Earth MetaData Fuzzy-Genetic Forest Nature Earth MetaData VI EVALUATIONS AND RESULTS Fig 6: F-measure for precision and recall F-Measure also shows the Fuzzy and Genetic Fuzzy system is higher compared to statistical approach. VII CONCLUSIONS In this paper, a comparative study of General Statistical Method, Fuzzy and Genetic Fuzzy summarization techniques are presented. Evaluations of these techniques are performed by giving different text documents related to different domains as inputs. From the evaluations and results, the Fuzzy system provides a better summary compared to the statistical method. The studies also show that the hybrid approach of genetic algorithm and fuzzy system provide a better summary than the other techniques. It is observed that the addition of genetic algorithm to the fuzzy system has improved the recall over the Fuzzy system. REFERENCES [1] H. P. Luhn, “The Automatic Creation of Literature Abstracts” IBM Journal of Research and Development, vol. 2, pp.159-165. 1958. [2] G. J. Rath, A. Resnick, and T. R. Savage, “The formation of abstracts by the selection of sentences” American Documentation, vol. 12, pp.139- 143.1961. [3] H. P. Edmundson., “New methods in automatic extracting” Journal of the Association for Computing Machinery 16 (2). Pp.264285.1969. [4] J. Kupiec. , J. Pedersen, and F. Chen, “A Trainable Document Summarizer” In Proceedings of the Eighteenth Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR), Seattle, WA, pp.68-73.1995 [5] L. Zadeh, “Fuzzy sets. Information Control” vol. 8, pp.338– 353.1965. [6] A. D. Kulkarni and D. C. Cavanaugh, “Fuzzy Neural Network Models for Classification” Applied Intelligence 12, pp.207-215. 2000. [7] R. Witte and S. Bergler, “Fuzzy coreference resolution for summarization” In Proceedings of 2003 International Symposium on Reference Resolution and Its Applications to Question Answering and Summarization (ARQAS). Venice, Italy: Università Ca’ Foscari. Pp.43– 50. 2003. [8] L. Suanmali, Salim N. and M.S. Binwahlan, “Fuzzy Logic Method for Improving Text Summarization.” International Journal of Computer Science and Information Security (IJCSIS), vol. 2(1), pp. 65- 70. http://arxiv.org/abs/0906.4690, 2009 [9] A. Kiani and M.R. Akbarzadeh, “Automatic Text Summarization Using: Hybrid Fuzzy GA-GP” In Proceedings of 2006 IEEE International Conference on Fuzzy Systems, Sheraton Vancouver Wall Center Hotel, Vancouver, BC, Canada. Pp.977-983.2006. [10] L. Suanmali, Salim N. and M.S. Binwahlan, “Fuzzy Genetic Semantic Based Text Summarization.” Ninth IEEE International Conference on Dependable, Autonomic and Secure Computing), pp. 1185-1191, 2011. [11] I. Mani and Mark T. Maybury, editors, Advances in automatic text summarization MIT Press. 1999. [12] M.A. Fattah and Fuji Ren, “Automatic Text Summarization” In proceedings of World Academy of Science, Engineering and Technology Volume 27. pp 192-195. February 2008. [13] J.Y. Yeh, H.R. Ke, W.P. Yang and I.H. Meng, “Text summarization using a trainable summarizer and latent semantic analysis.” In: Special issue of Information Processing and Management on An Asian digitallibraries perspective, 41(1), pp.75– 95. 2005. [14] G. Morris, G.M. Kasper, and D.A. Adam, 19 “The effect and limitation of automated text condensing on reading comprehension performance” Information System Research, 3(1), pp.17-35. 1992. [15] M. Wasson, “Using leading text for news summaries: Evaluation results and implications for commercial summarization applications” In Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the ACL. Pp.1364-1368. 1998. [16] L. Yulia, G. Alexander, and René Arnulfo García-Hernández “Terms Derived from Frequent Sequences for Extractive Text Summarization” In