International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013 Processing Phase of Summarizer for Multiple News Single Punjabi Documents Vishal Gupta Assistant Professor, UIET Panjab University, Sector-25, Chandigarh, India Abstract—The proposed technique discusses processing sub step for summarizer in Punjabi which is single text document multiple news articles summarizer. It is first in the history that this Punjabi summarizer is implemented and developed. For this project, we had developed different language oriented resources in Punjabi i.e. morph of nouns for Punjabi, stemming procedure for Punjabi, finding key terms in Punjabi text and finding named entities in Punjabi etc. We know that one page in any news paper usually contains many types of multiple news with different lengths. On the basis of C.R. (compression ratio) given by any user, the proposed system will retrieve mainlines called headlines present in every news, sentences just after mainlines along with other suitable sentences based on their relevance. Relevance of different lines is calculated on the basis of different text features which are either statistical or language oriented in nature. There are mainly two steps of this proposed approach: a) Pre Processing step b) Processing step. In step of pre-processing input text is denoted in structured manner. We can select and calculate different features which can decide important lines in processing step. Different types of features which are statistical in nature are key terms detection in Punjabi, feature related to length of a line & feature related to numeric data. Various language oriented features for extracting relevant lines are: extraction of mainlines in Punjabi news paper, extraction of next sentences to mainlines, feature related to nouns in Punjabi, feature related to Punjabi names, feature related to English terms which are used as it is in Punjabi, feature related to cue terms in Punjabi, existence of terms related to title in lines. Then we can find out ranks of different lines on the basis of value of features-lines equation. We assume all the features are of same importance. High ranked lines are retrieved in suitable order as part of summary. We also consider coherence of lines using suitable ordering of lines as per their order in input text at suitable C.R. Keywords—Text summarizer, detection of Punjabi mainlines, detection of names in Punjab, detection of key terms in Punjabi. I. INTRODUCTION Summarization of text [1] [2] is technique of shortening input text but maintaining contents of its its information and theme. It is having two sub steps [3] i) Sub step for doing pre processing of input text which is considered as denoting the input text in structured manner. ii) In sub phase of processing the input text we can find ranks of every line by applying and calculating feature-line equation & lines with higher scores are retrieved in summary in a particular order as per their order in input. The proposed technique discusses processing sub step for summarizer in Punjabi which is single text document multiple news articles summarizer. It is first in the ISSN: 2231-5381 history that this Punjabi summarizer is implemented and developed. For this project, we had developed different language oriented resources in Punjabi i.e. morph of nouns for Punjabi, stemming procedure for Punjabi, finding key terms in Punjabi text and finding named entities in Punjabi etc. We know that one page in any news paper usually contains many types of multiple news with different lengths. On the basis of C.R. given by any user, the proposed system will retrieve mainlines called headlines present in every news, sentences just after mainlines along with other suitable sentences based on their relevance. Relevance of different lines is calculated on the basis of different text features which are either statistical or language oriented in nature. There are mainly two steps of this proposed approach: a) Pre Processing step b) Processing step. In step of pre-processing [8] input text is denoted in structured manner. We can select and calculate different features which can decide important lines in processing step. Different types of features which are statistical in nature are key terms detection in Punjabi, feature related to length of a line & feature related to numeric data. Various language oriented features for extracting relevant lines are: extraction of mainlines in Punjabi news paper, extraction of next sentences to mainlines, feature related to nouns in Punjabi, feature related to Punjabi names, feature related to English terms which are used as it is in Punjabi, feature related to cue terms in Punjabi, existence of terms related to title in lines. Then we can find out ranks of different lines on the basis of value of features-lines equation. We assume all the features are of same importance. High ranked lines are retrieved in suitable order as part of summary. We also consider coherence of lines using suitable ordering of lines as per their order in input text at suitable C.R. II. PROCESSING PHASE In this step [4] [12], different types of features deciding important lines are applied and their values are calculated. We can find out ranks of different lines by applying line-feature equation. Line scores are calculated using line-feature equation give as feature_1+feature_2+feature_3+……feature_n Here feature_1, feature_2, feature_3……feature_n are values of various features belongs to lines which we can calculate in various steps for this summarizer Higher scored lines are retrieved in summary by following a particular order of lines as per their order in input text. Which means coherence among different lines is maintained. http://www.ijettjournal.org Page 367 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013 A. Detection of Mainlines and Lines Next to Main Lines This is first in the history that a technique for automatically detection of mainlines & next sentences to main sentences from single input Punjabi documents related to Punjabi news papers is discovered and implemented in this summarizer. Main lines are always relevant in case of news papers, as these lines can convey much information related to whole news. So mainlines are always considered to be relevant and become part of summary. Next sentences to main sentences in news articles are also relevant as they may have relevant information so these next lines to main lines are normally also become part of summary. We have taken The count of main sentences in corpus of Punjabi news articles is sixty five thousand seven hundred twenty two and these lines consume around seven percent of this corpus. The ending of lines in Punjabi is usually marked by presence of any of characters among question mark, exclamation character or vertical line character. So if endings of any line in Punjabi text is not marked by these characters but marked by new line character or enter key character then that sentence belongs to mainline. Moreover, If just next sentence to main line ends with question mark, vertical line character or exclamation character then that sentence belongs to next line to main line. B. Detection of Cue Terms in Punjabi For English language there are various type of cue terms like conclude, at last, ultimately and summary etc. and lines having cue terms at any position either at start, middle or at end in that line are relevant because that line can highlight more information. We have created a list of cue terms in Punjabi from corpus related to news in Punjabi. Lines having these Punjabi cue terms are marked as relevant and belongs to summary. In corpus of news articles in Punjabi, cue terms count is fifty eight thousand seven hundred eight and consume corpus of 0.52%. C. Name Entity Detection in Punjabi Rule oriented detection of names entity in Punjabi is 1st in the history developed by Gupta & Lehal (2011) [10]. We are using this names entity detection approach. In this system various lists have been created called gazetteer lists for example list of names prefixes, list of names suffixes, list of names middle part, list of names last part & list of Punjabi proper nouns used in detecting if any Punjabi term is name entity. These lists were developed after consulting corpus related to news in Punjabi. List of Punjabi prefixes contain different prefixex of Punjabi names for example ਸੀਮਤੀ, ਿਪ. and ਡਾ: etc. The count of these prefixes is fourteen which are found from corpus of news in Punjabi. The count value of Punjabi prefixes is seventeen thousand one hundred twenty seven and consumes the corpus of news articles in Punjabi with 0.15%. List of suffixes in Punjabi is applied for determining if that term is Punjabi name for example ਜੀਤ, ਪੁਰ, ਪੁਰਾ and ਪੁਰੀ etc. This ISSN: 2231-5381 list of suffixes has around fifty suffixes. The count of suffixes is two lakh twenty five thousand three hundred six in corups reletd to news artcles in Punjabi and consume around two percent of this corpus. List of names middle part has different middle parts of Punjabi names for identifyng if the term is name entity in Punjabi for example ਕੁਮਾਰ, ਕੌ ਰ and ਕੁਮਾਰੀ etc. From corpus of news artciles in Punjabi we have found 08 middle parts of nems in Punjabi and count of them is ninty seven thousand nine hundred sven in this corpus and consumes around one percent of this corpus related to news articles n Punjabi. List of last part of names in Punjabi has different last parts of names in Punjabi used for detecting if tht term beongs to name in Punjabi. The count of last parts of names in Punjabi is around three hundred ten. In corpus of Punjabi, sixty nine thousand two hundred sixty eight terms are discovered as last part of names in Punjabi, and consumes this corpus of 0.6135 percent. Names in Punjabi are essential for retrieving relevant lines in summary. The count of names in Punjabi is seventeen thousand five hundred ninety eight in corpus related to news and consume around fourteen percent of this corpus. Value of this feature is determined using ratio of frequency of names in Punjabi sentence to length of line. Its score will vary from 0 to 1. After looking at results of this sub phase on fifty news articles in Punjabi, we have discovered that F-measure is equal to 86.25 percent, Recall is equal to 83.4 percent and Precision is equal to 89.32 percent, along with errors of 13.75 percent. D. Detection of Nouns & Common Nouns in English and Punjabi Lines having nouns [6] are always relevant. Terms in input text are found in morph of nouns in Punjabi for checking if they are nouns. Morph of nouns in Punjabi contains around thirty seven thousand two hundred ninety seven nouns. Terms in input are searched from morph of nouns in Punjabi or stemmer is applied for nouns t check existence of Punjabi nouns. Value of this feature is found by taking ratio of frequency of nouns in particular line to length of line. The score will vary from 0 to 1. Count of nouns in Punjabi is around seventeen percent of terms in corpus related to news in Punjabi. Accuracy of this phase is tested on corpus related to news in Punjabi and is around ninety eight percent with 1.57% errors which are because of not presence of various nouns in morph of Punjabi nouns & stemmer errors. Many terms in English are also written in Punjabi in same manner as in English. For example Punjabi terms ਮੋਬਾਈਲ and ਟੈਕਨਾਲੋ ਜੀ are written in same manner as of English. But these terms are usually absent in morph of nouns & dictionary in Punjabi because these are not the terms of Punjabi. But we know that these terms are very relevant and are called common nouns in English and Punjabi. These terms can affect the relevance of lines. Separate list is created having only these common terms in English & Punjabi. Input terms are searched from list related to common noun terms of English & Punjabi. The score of this feature is found http://www.ijettjournal.org Page 368 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013 by taking ratio of frequency of common noun terms in Punjabi & English in any line to line’s length. Its value will vary from zero to one. We have discovered eighteen thousand two hundred forty five common noun terms of Punjabi & English. These common terms are around six percent of corpus related to news in Punjabi. Accuracy of this sub phase is around ninety five percent. This accuracy is tested on corpus related to Punjabi news articles for fifty documents. Five percent of errors are because of non presence of many common terms in Punjabi and English in its database. E. Detection of Key terms and Title Terms in Punjabi Text Key terms [7] are useful for extracting important lines. This is first time in history that this system for detection of key terms and title terms for Punjabi is implemented by Gupta & Lehal (2011) [11]. Punjabi Key terms are those noun terms having high value of Punjabi term frequency-inverse lines frequency. Here Punjabi noun term frequency is count of a particular noun-term of Punjabi in a given line. Inverse lines frequency is using frequency of lines having that Punjabi noun term. i.e. Its value can be calculates using the formula log(|L|/ LF(t)) Where |L| is total frequency of lines in a given document. LF (t) is frequency of lines having that given noun term t. F-measure, recall and precision of this system for extracting key terms are 85.2 percent, 90.6 percent and 80.4 percent respectively. These measures are determined by thoroughly studying outputs of key terms extraction on 50 documents related to news in Punjabi. Errors of around fifteen percent are because of non existence of various noun terms of Punjabi in morph related to Punjabi nouns, mistakes in Punjabi dictionary, errors while typing input as syntax errors & various violations in rules for stemming the noun terms. Title sentences are main sentences of any news article containing single or multiple news articles. Punjabi lines having title key terms are relevant [5]. Title key terms are selected by eliminating Punjabi stop terms in title sentences. Score value for it is determined by taking ratio of frequency of unique key title terms in any line to total frequency of title key terms. Its accuracy is found to be around ninety seven percent and is determined on 50 documents of corpus related to Punjabi news articles. Three percent errors are because of presence of certain stop terms in title sentences because stop terms list in Punjabi only has six hundred fifteen stop terms of Punjabi. G. Font Feature of Punjabi Terms Those Punjabi lines having terms in bold font, underlined, italics, quotation marks or having larger font size are important and should be included in summary. If this feature is true for any line then its font flag will store value 01 otherwise its value will be 0. H. Finding Scores of Lines for Final Summary We can find out ranks of different lines by applying linefeature equation. Line scores are calculated using line-feature equation as: feature_1+feature_2+feature_3+……feature_n Here feature_1, feature_2, feature_3……feature_n are values of various features belongs to lines which we can calculate in various steps for this summarizer Higher scored lines are retrieved in summary by following a particular order of lines as per their order in input text. Which means coherence among different lines is maintained. III. RESULTS AND DISCUSSIONS This system is tested on 50 news articles of corpus related to news in Punjabi. These documents were of mixed type containing single or multiple news in same document and its data set is having six thousand one hundred eighty five lines and seventy two thousand six hundred eighty nine terms in corpus related to news articles in Punjabi. This system is tested by using measures of intrinsic and extrinsic evaluation. Four intrinsic techniques are applied [9] for evaluating the Punjabi summary a) F-measure b) Measure of CosineSimilarity c) Measure of Cofficient-Jaccard d) Distance calculation using Euclidean measure. We have applied 02 techniques of extrinsic summary evaluation i) Performing task of question answering ii) Performing job of association of key terms Intrinsic summary results are given in TABLE I. TABLE I INTRINSIC SUMMARY EVALUATION Compression Ratio Intrinsic Evaluation of Summary F-measure CosineSimilarity Jaccard Euclidean Coeff. Distance 10% 98.45 98.89 97.30 0.10 30% 96.53 97.56 95.99 0.29 F. Calculation of Line Relative Length 95.11 96.23 95.12 0.48 Lines of small length are not preferred to become as part of 50% summary [5] because these short lines usually have very little information. But large sentences in Punjabi sentences can Results of summary evaluation by applying extrinsic have much of information. Value of this feature is determined by taking ratio of frequency of terms in any line to term count measures are given in Table II at different C.R. of largest line. Value of this feature will vary from zero to one. Score (Relative Length) = Frequency of terms in line / frequency count of terms of largest line. ISSN: 2231-5381 http://www.ijettjournal.org Page 369 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013 REFERENCES TABLE II RESULTS OF EXTRISIC SUMMARY EVALUATION Compression Ratio [1] Extrinsic summary evaluation Efficiency of Question Answering Efficiency of Key Terms Association [2] [3] 10% 80.67 81.88 30% 85.58 94.46 50% 90.45 96.92 [4] [5] [6] [7] IV. CONCLUSIONS It is first in history that this summarizer is developed with such a high accuracy for single & multiple news documents. Various language components applied in this system like stemming of terms in Punjabi, standardization of nouns in Punjabi, detection of names in Punjabi, detection of key terms in Punjabi, list of Punjabi names, list of common noun terms of Punjabi & English, list of Punjabi stop terms, list having suffix parts and prefix parts of names in Punjabi etc. were created from initial point because these resources were not present. Moreover it is 1st in history that these Punjabi language components had developed and these might be useful in implementation of various other NLP applications for Punjabi. ISSN: 2231-5381 [8] [9] [10] [11] [12] F. Kyoomarsi, H. Khosravi, E. Eslami, P.K. Dehkordy, “Optimizing Text Summarization Based on Fuzzy Logic”, IEEE International Conference on Computer and Information Science, University of Shahid Kerman, UK, 2008, pp. 347-352. Vishal Gupta, G.S. Lehal, “A Survey of Text Summarization Extractive Techniques”, In International Journal of Emerging Technologies in Web Intelligence, vol. 2, 2010, pp. 258-268. J. Lin, “Summarization. In Encyclopedia of Database Systems”, Springer-Verlag Heidelberg, Germany, 2009. V. Gupta and G.S. Lehal, “Automatic Punjabi Text Extractive Summarization System” In International Conference on Computational Linguistics COLING-2012, IIT Bombay, India, 2012, pp. 191-198. M.A. Fattah F. Ren, “Automatic Text Summarization”, In World Academy of Science Engineering and Technology, vol. 27, 2008, 192195. K. Kaikhah “Automatic Text Summarization with Neural Networks” In IEEE international Conference on intelligent systems, Texas, USA, 2004, pp. 40-44. J. L. Neto, A.D. Santos, C.A.A. Kaestner, A.A. Freitas, “Document Clustering and Text Summarization”, Int. Conference on Practical Application of Knowledge Discovery & Data Mining, London, 2000, pp. 41-55. V. Gupta and G.S. Lehal, “Complete Pre processing Phase of Punjabi Language Text Summarization” In International Conference on Computational Linguistics COLING-2012, IIT Bombay, India, 2012, pp. 199-205. M. Hassel, “Evaluation of Automatic Text Summarization”, Licentiate Thesis, Stockholm, Sweden, 2004, pp. 1-75. V. Gupta and G. S. Lehal, “Named Entity Recognition for Punjabi Language Text Summarization”, International Journal of Computer Applications, vol. 33, 2011, pp. 28-32. V. Gupta and G. S. Lehal, “Automatic Keywords Extraction for Punjabi Language”, International Journal of Computer Science Issues, vol. 8, 2011, pp. 327-331. V. Gupta and G.S. Lehal, “ Automatic Text Summarization System for Punjabi Language,” Journal of Emerging Technologies in Web Intelligence, vol. 5, pp. 257-271, 2013 http://www.ijettjournal.org Page 370