Chapter 1 Introduction Text mining is text analysis technique, part of data mining. Its basic aim is to extract useful information from the available large amount of data. Text summarization is the technique of creating abstract or summary of one or more text, automatically by the computer system. Research in this field has already been started in sixties in American research libraries. Huge number of data, books, and scientific papers are needs to be available electronically which should be easily reachable and searchable. American libraries has started to show their interest in this field, because they were feeling the necessity to store all the books, research papers etc should be easily stored in a limited database. As they were having limited storage capacity they required to summarize the documents, indexing and made searchable. Even some books or documents contain summary with them but if there were no readymade summary then one was able to make. Luhn(1958), Edmunsun(1969) and Salton(1988) started research many years ago. With increasing the use of internet, there have been awakening interests of text summarization techniques. As the time changed the scenario has also been changed. At that time storage space was limited but people had time to extract and study the text as per their use. But today there is limitless storage and cheap as well. Information available over the internet or system is available in teemingness and innumerable. This is manually impossible to search. It was very difficult to search, select which information should unify. Therefore filtration of the information is also required. NLP make the computer system capable to understand human language. This is unit of artificial intelligence. Implementation of NLP applications is very challenging because computer system require to human to speak unambiguously, précised and in highly structured way which is not always possible. Modern algorithms of NLP are based on machine learning and statistical machine learning. For automated summarization, concept of automated learning is required. Automated learning procedures can make use of statistical inference algorithms. Major application areas of NLP is automatic summarization, co reference resolution, discourse analysis, machine translation, morphological segmentation, named entity recognition, natural language generation, 1 natural language understanding, optical character recognition, part of speech tagging, parsing, relationship extraction, sentence boundary disambiguation, sentiment analysis, speech recognition, speech segmentation, word sense disambiguation, information retrieval, information extraction, speech processing, stemming etc. NLP is the basic domain for automated text epitomize is to solve the information overloading problem different techniques are under research [1-3]. Automated text summarization is the hard problem of NLP. One needs to understand the key points of the available text to analyze and summarize the data. NLP investigated the subfield of recapitulation. For automated text summarization we need to understand text summarization first. The term Automated text summarization means recapitulate the text automatically rather than manually. Text summarization is followed by text categorization. The main idea behind the automatic summarization is present the main information of a document in less space. Summaries depends on different aspects like indicative summaries, contains the information about the text relevance i.e. what is text about rather than displaying any content whereas informative summary provide the shorter version of the detailed document. Topic oriented summaries are built on the bases of the topic desired by the reader’s interest whereas the generic summaries depicts the information according to the author’s mindset. In text categorization rudimentary idea is to consider the various features exist between distinct structural and statistical relationship. These features are character n-grams, word frequency, lemmas, stems, POS tags etc. and targeted categories [4]. To categorize the data automatically items to be classified should contain some pertaining features useful to distinguish and differentiate the data among different categories. To epitomize the data automatically and accurately to maximum extent categorization played a wide and important role by which we can analyze which document is more relevant to which category and then we provide summary of multiple documents related to single domain. In this paper we maintained a ratio of data on the basis of categorization and summarization through which we can find its closeness to specific cluster. The summarization task is needed to identify informative evidence from a given data, which are most pertinent to its content and create a synoptic version of the original document. The informative evidence analogous with techniques used in recapitulation 2 may also provide clues for text categorization. To achieve this objective we proposed a criterion function which decides the relevance of the document. This paper shows the improvement in efficiency and accuracy of the categorization and summary of text documents after removing its ambiguity. To remove the ambiguity, correlation among the text data is found, the discovered pattern helps to classify the data and then we apply discrete differential approach to categorize and sequential similarity approach for summarize the data. 1.1 Problem Statement World Wide Web demands text summarization as it contains lots of data in huge amount which is not easily accessible by the users. As it requires so much time to find, understand and analyze the data. Sometimes after large consumption of time people find the data or document irrelevant to them. Therefore we require a technique which would be helpful in determining the relevant text as well as documents. 1.2 Aim/Objective Perform Centroid based multi document summarization on text data. The main aim of this project is to build an application which provides automation in terms of text categorization as well as summarization by measuring and considering the features of original text. To optimize text categorization and summarization technique we are studying different available approaches and research done in this area. For this project our priority is to give some experimental results which can be used by any organization and can perform better. 1.3 Motivation Text categorization and summarization comes into scenario for organization of national conferences during research work. The objective to optimize automatic summarization is to cut the cost and increase the benefits. Day by day WWW is overloading information and data in huge amount. This all data is available to access by users. But as they do not have so much time to spent over the big data to search and analyze data relevant to their need. 3 They required most relevant documents related to their needs and in shortened format so that they could get information easily. But here if the summary is related to some abstract only then it did not provide them the actual idea behind the document. Similarly in research organizations, when they need to classify the multiple documents related to different domains and need to understand the author’s scenario they need to study them. But they do not have time read and understand huge amount of data, efficiency and accuracy rate also decreases. Therefore need of automated text summarization is generated. And some research work already has been implemented in past years. Business, pharmaceutical industries, academic research are some domains in which data is found in vast amount. Business uses such techniques to competitor and customer data to evaluate his value in the market. Biomedical and pharmaceuticals need this application to analyze the disease, patient related data. They required mining patents and research articles related to different kind of medicines or already introduced medications to improve them. Significance of online available textual data and information in discrete fields and domains. This area required huge improvements and scope is so vast. This kind of application can help user to understand and get clarity about the prototype developed in their mind. Speedy and easy access to the data and information is required. Different discrete domains need automates text summarization for efficient work. 1.4 Background Text summarization is [5] technique to shorten the text data as compared to the given data. We need this concept to use storage space optimally. As the available information is increasing day by day with increase in technologies and experiments it becomes very difficult to summarize the data manually. That’s why requirement of automatic summarization increased. Automatic summarization is when computer system shortens the data from the actual data automatically. In recent time there has already been a volatile outgrowth of written information available in the World Wide Web. Hence there was growing requirement of rearranging, 4 and accessing these data in pliable way for the easy access. Solution available to this problem is text summarization. By categorizing the data initially is add-on to the summarization solution. Before summarize if data is classified into their relevant categories than it will be much better result for the user. He does not need to read unnecessary data whether it is available in summarized form. Automated text summarization works on natural language. Natural Language Processing (NLP) is the part of artificial intelligence. Natural Language Processing includes two approaches i.e. natural Language Generation and Natural language Understanding. By the term Natural language Understanding we mean that consider a spoken/written sentence and working out what does it mean. And Natural language Generation means that consider some formal structure/representation of what we want to say and then working out a way to express in natural/human language. Applications of NLP are machine translation, database access, information retrieval, text categorization, extracting data from text etc. It includes some steps for changing the text in natural language. These are: Speech recognition Syntactic analysis Semantic analysis Pragmatic analysis In speech recognition analogue signals are converted into frequency spectrum, basic sound of signal, convert phonemes into words. But the problem arises is no simple mapping between sounds and words. In syntactic analysis rules of syntax represent the correct organization of words. It provides a structure for grouping the words to create a meaningful sentence. It also includes parsing in which rules and sentence is provided and need to check whether sentence is correct grammatically or not. Grammatical rules are: o Sentence o Noun phrase proper noun o Noun phrase determiner, noun o Verb phrase verb, noun phrase noun phrase, verb phrase Complication with syntactic analysis is about ambiguity and incorrect parsing of sentence. Therefore next step of semantic analysis is required. This technique helps in 5 analyzing the meaning of the sentences with its syntactical structure i.e. grammar + semantics. But problem of ambiguity is not resolved here also. Pragmatic analysis uses context of utterance, handling pronouns, handling ambiguity. Text summarization is balanced to become a universally accepted solution to the larger problem of content analysis. Initially summarization task is considered only as NLP problem. Text summarization consist some steps to complete its procedure. These steps are: Database Tokenization Stop word removal Weighted Term POS Tagging Word sense disambiguation Pragmatic Analysis Extracted Summary Figure 1.1: Summarization steps 6 1.4.1 Types of summary: Summarization are classified into different categories, each one has their own difficulty level to make the summarization automatic. They are as follow: Classification on the basis of text origin: Extractive Summarization: Summary consist the text already present in actual text. Abstractive summarization: Some new text is generated by the summarizer. Classification on the basis of their purpose: Indicative summarization: This classification is on the basis of reader’s perspective. This type of summary gives idea to the reader whether it would be worthwhile reading the entire document or not. Informative Summarization: This kind of summary contains the main idea of the original text. This is according to author’s point of view. Summary delivers the main idea of the author which he wants to deliver to the reader. Critical Summarization: This sort of the summary criticize the original document, if we consider a scientific paper then this would contain methods and experimental results. In case of level to automate summarization, indicative summary has higher probability and capability to make the summarization automatic. Whereas critical summarization has least chances. 1.4.2 Types of Evaluation: Evaluation is quite difficult for summarization techniques and approaches. It could be divided into two phases: Intrinsic: where the summary is tested in of itself called intrinsic. These tests are made to measure the summary quality and informativeness. Extrinsic: In this phase, summarization system is relative to a real world task. 7 1.4.3 Current state-of-art: NIST (National Institute of Standards and Technology) execute the Document Understanding Conference (DUC) in which the minimum summarizing techniques are compared. The examples of latest techniques are defined in proceedings of DUC2001[9]. 1.4.4 Multi-Document Summarization: By the term multi document summarization we meant that to compress or shorten the data from various documents and generate a single summary for them. An issue is that of novelty-detection. This issue summarizes the first document only for given an ordered set of documents. These documents can be ordered by searching engine hits with respect to relevance or news articles which could be organized by date. After summarizing the first document only unseen information was summarized from rest of the documents. Multi Document Summarization is not completely successfully implemented, has some issues. But these issues differ from the issues available in single document summarization. Higher compression is needed, that is if 10 percent summary is generated for the single document and there is no. of documents available then if collaboration of individual page summary would not be good, not acceptable. Hence basic extractive summarization is in question mark. Also, the data in multiple documents could be redundant and we need to find out which data is important and need to include in the summary. Intersecting sentences between two duplicated data or repeated data need to extract. Rhetorical relation defines the structure of similarity between two sentences i.e. it identifies whether two sentences are contradict to each other or they are delivering the same information. It also identifies that which sentence is delivering the more refresh information. 1.4.5 Use of Discourse Structure:’ By the term discourse structure we understand that establishing the relationship between sentences. The order of the sentences is not organized randomly. They have some co8 relation with each other. Might be one sentence is delivering the basic of any concept or showing history and adjacent sentence is elaborating the idea of the previous information. These are just few of the relations existing in Rhetorical Structure theory(RST)[12]. A summarizer using RST would perform well, although automatically parsing RST relations is difficult. 1.4.5 Extractive Indicative Summaries Various methods are present for selecting which sentences to use in summary. Luhn[11] did early work on this and later Edmundson[10]. Luhn proposed sentences could be selected if they contained Content words in a document. Content words are defined as words that occur between given frequencies, top words would not be included and frequency limit would be determined from a corpus. 1.5 Major Approaches: Single document extractive summarization approach: Early techniques for sentence extraction figure out the score of every sentence based on their features like sentence positioning. This feature of positioning was described by the Baxendale in 1958 [5] and Edmundson in 1969 [6]. Along with these features Lun in 1958 introduced a feature of word and phrase frequency. Edmundson also told us about the key phrase features. In recent times more sophisticated techniques are introduced to extract the sentences from the given text. These techniques are based on machine learning and natural language analysis. Machine learning helps in identifying the important features where as natural language extracts the key passages and relation among words rather than bag of words. Chen, Kupiec and Padersen[1] in 1995 used Bayesian classifier technique to develop a summarizer. They combined features and characteristics from corpus of scientific articles and abstracts. Lin and Hovy(1997)[7] worked using machine learning to learn new individual features. They worked on the sentence position feature. And analyzed how the position of sentence affects the sentence selection. Whereas Mittal[2] worked at the phrases and crucial words for their syntactic structure. He used statistical approach. Ono, 9 Sumita and Miike analyzed about the position of the text found according to the predefined structure. This methodology needs a system that could work out on discourse structure of reliability. Single document summarization through abstraction: Abstractive approaches are those which do not follow extraction. These approaches emphasis on information extraction, ontological information, information fusion and compression. This method gives rise to the more accurate with high quality summary to their restricted domain. Reduction process was also a good research area for the other researchers. Knight and Marcu(2000)[3] worked over rule reduction. They tried to compress the parse tree to generate a shorter summary still maximal grammatical version. This approach summarized the sentence accordingly (like two sentences is summarized in to one sentence or three sentences into either two or one and so on). Abstractive summarization has not build up beyond of proof-of-concept stage. 10 Chapter 2 Literature Survey A big amount of work is done on automatic text summarization in recent years. Most of the reachers were concerned with extractive summarization technique. Extractive summarization and abstractive summarization are two approaches in this area. In extractive summarization some important sentences are selected and combined to make summary of the original text. It could be call subset of the original text which contains important information. Whereas, in abstractive summarization, summary contains the abstraction of the original text. It contains the hidden meaning of text. Banxendale, 1958 analysez that topic sentence comes mostly in the starting of any paragraph in the document whereas paragraph’s last line also contains topic line in the document[6]. He found 85% of paragraphs contain first line as a topic line and 7% of paragraphs contain it as last line. So the topic could be chosen by among them. This posiitonal approach, has been used in many complex machines, learning based system. Edmundson(1969), draws a system that provide extract of the document[7]. The author developed a set of rules for extracting data manually from the original data that was implied in a set of four hundred technical documents. With two new features presence of cue words and skeleton of document, two characteristics of word frequency and importance of position were merged from the previous works. McKeown and Radev, 1995 developed a template-driven message understanding systems. He used already existing technology for this[8]. McKeon in1999; Radev et al,(2000), others generate a composite sentence from each cluster(Barzilay et al,1999), while some approaches work dynamically by including each candidate passage only if it is considered novel with respect to the previous included passages. The recapitulation techniques are divided into two categories: supervised and unsupervised. Supervised techniques relay on machine learning algorithms trained on pre-existing document summary pair. Whereas, Unsupervised techniques are based on properties and heuristic search derived from the text. Summarization task is divided into two classes, for positive and negative samples. Positive samples are those sentences which are considered in summary sentences, whereas, Negative samples are those which 11 are not part of the summary [8]. In [11], the use of Genetic Algorithm(GA), feed forword neural network(FNN), probabilistic neural network(PNN), mathematical regression and Gaussian mixture model(GMM) for automatic summarization have been studied. The article [12] presents multi document, multi-lingual, theme based summarization system based on modeling text cohesion. Figure 2.1 Clustering techniques In recent time, graph based methods are used for sentence ranking. For computing sentence value Lexrank[10] and [11] are two available systems uses the algorithms Page Rank and HITS. Eignvector centrality in graph representation was the concept used by Lexrank. Graph based method are completely unsupervised methods and derive extractive summary by using given text. Summarization task is next divided into query and generic based approach. In query based approach summarization is done on the basis of query[12,13,14,15,16] while generic approach gives an overall sense of document’s content[7-14,15,2,13]. QCS system follows three step procedures that is query, cluster and then summarize[16]. In [12] are developed a generic, a query- based, and a hybrid summarizer. The generic summarizer used a combination of different and discussed information and via traditional 12 surface-level analysis information is obtained. For query term information, query based summarizer used, and the hybrid summarizer uses some dissertate information with query-term information. Automatic document summarization is extremely entombed-disciplinary research area related with artificial intelligence, computer science, multimedia, statistic, as well as cognitive psychology. The Event indexing and summarization(EIS) was introduced as an intelligent system. This system was based on a cognitive psychology model (the eventindexing model) and the roles and importance of sentence and their syntax in document understanding. Rokaya[16] uses the technique for abstractive summarization for custom summary which is based on interested field and keywords for the user need. Thesse keywords and of interest are determine and selected by user. Rokaya and power link algorithm to get a much shorter summary. This algorithm works with the concept of co-word analysis researches. This follows a template based approach in which topic sentence plays an important role. In this extraction is divided in two different phase i.e. extract the package and then delete the non effective sentences. Most starting experiments on single-documnt summarization started with technial articles. In 1958, Luhn, explain:ed there search performed at IBM in the 1950s. Luhn explained that to know the scope of importance of meaning of any word, need to calculate frequency of those words. Some key concepts are already included in this paper which has a vital scope in future. In this system, first step includes the stanching, stanching of words are performed to their basic form. This step is followed by the next step which is stop word removal. Assembling of words in order of decreasing frequency is done by the Luhn. Then indexing is performed which is very important for summarization. On a sentence level, an significant factor was transmitted that represents the occurrence of crucial terms in the sentences and the space among them due to the ability insignificant words. Sentences ranking is also important to get more specific summary. Sentence indexing should be done on the basis of their importance level. Their significance factors play an important role. The sentence has more significance are on the top in the stack whereas having the low significance should come last in the stack. Top level sentences are selected for create summary. Related work performed by 13 Benxendale(1958) at IBM. He worked for additional characteristics which are helpful in identifying more relevant components of any article. This characteristic is sentence position. For completing his study author read two hundred paragraphs. He analyzed that 85% of stanzas contains the title of the document. Thus, for selecting a correct topic sentence, pick one of these. Various multifarious machine learning based systems used this approach of sentence positioning since many years. In 1969, Edmunson introduced a system that produces article extracts. His letter contribution was the elaboration of discrete structure for researches related towards extractive summarization. The expansion of typical structure was a initial contribution forr extractive summarization of extraction type. Initially, the writer launched a process of genrating extract manually. This mannual extract is having utility in these 400 documents. The 2 chracteristics of positinal significance & wrd frequncy were merged from two former mechanisms. Other two different words charcteristics were usd and these are: cue-words existence & document framing. Encumbrnces were integrated to evry of this charcteristics and features. These characteristics are integrated manually to cut down every sentence. Study gives the result that 40% of the extracts accorded the extracted manually. In this paper authors presented a new schema that is called H-Base which is used to process transaction data for market basket analysis algorithm. Market basket analysis algorithm runs on apache Map-Reduce and read data from and, then the transaction data is converted and sorted into data set having (key and value pair) and after the completion of whole process, it stores the whole data to the distributed file system. The size of data is growing day by day in the field of bio-informatics so it is not easy to process and find important sequences that are present in bio-informatics data using existing techniques. Authors of this paper basically discussed about the new technologies to store and process large amount of data that is and “Green-plum”. Green-plum is the massively parallel processing technique used to store of data, is also used to process huge amount of data because it is also based on parallel processing and generates results in very less time as compared to existing technologies to process the huge amount of 14 data.Morphological alternates and synonyms were also combined while seeing semantic terms, the previous being identified by using Wordnet in 1995. The corpus helped in the trials was from newswire, few of them belong to the assessments. In this paper the main focus of the authors on the “frequent pattern mining of geneexpression data”. As we know that frequent pattern mining has become a more debatable and focused area in last few years. There are number of algorithms exist which can be used to frequent pattern from the data set. But in this paper authors applied the fuzzification technique on the data set and after that applied number of techniques to find more meaningful frequent patterns from data set. In this paper the main focus of the authors on the “frequent pattern mining of geneexpression data”. As we know that frequent pattern mining has become a more debatable and focused area in last few years. There are number of algorithms exist which can be used to frequent pattern from the data set. But in this paper authors applied the fuzzification technique on the data set and after that applied number of techniques to find more meaningful frequent patterns from data set. In this paper, authors described that the classification and patterns in the stock market or inventory data is really significant or important for business-support and decisionmaking. They also proposed a new algo that is algorithm for mining patterns from large amount of stock market data for guessing factors that are affecting or decreasing the product. A big amount of work is done on automatic text summarization in recent years. Most of the reachers were concerned with extractive summarization technique. Extractive summarization and abstractive summarization are two approaches in this area. In extractive summarization some important sentences are selected and combined to make summary of the original text. It could be call subset of the original text which contains important information. Whereas, in abstractive summarization, summary contains the abstraction of the original text. It contains the hidden meaning of text. Banxendale, 1958 analysez that topic sentence comes mostly in the starting of any paragraph in the document whereas paragraph’s last line also contains topic line in the 15 document[6]. He found 85% of paragraphs contain first line as a topic line and 7% of paragraphs contain it as last line. So the topic could be chosen by among them. This posiitonal approach, has been used in many complex machines, learning based system. Edmundson(1969), draws a system that provide extract of the document[7]. The author developed a set of rules for extracting data manually from the original data that was implied in a set of four hundred technical documents. With two new features presence of cue words and skeleton of document, two characteristics of word frequency and importance of position were merged from the previous works. A metric:s set was represented :by Lin(2004);have; developed values of auto-matic asses:sment of: Summaries. Li in 2006, sug:gested to use:of: an in-formation-the:oretical technicue to automatical: asse:ssment of summariess. The vital: im-pression is to use a devia:tion degree:: b/w a pair of like:lihood distrib:ution, in this case the JensenShannnon: deviation, where the first deliver:y is derive:ates from an autom:atic summary and :the 2nd derived frm a: refer:ences set: summaries. This me:thod has the benefit of enhancing :both the single-docum:ent and, the: multi-document summarization circumst,ances. 16 Chapter 3 Project Design and Implementation 3.1 AAutomaticuSummarizationimethods 3.1.1 Automationkextraction, 3.1.1.1,The, principle, of automatedyextractiont The fundamental idea of automatic extraction is that an article consist group of sentences where sentences are group of words. These sentences and words are arranged in specific order to make a meaningful paragraph and sentence respectively. Automated extraction includes fourthsteps: Measuringgthegweightgofgterm. Measuringkthetweightoofksentence. Arrange these sentenceshinbdecreasingoorder according to theirrweight. Then a thresholdevalue is decided. Sentence whoseeload better then verg values are nominated as summarizedd text. Theoorder of all the summary sentences will be as same as in theooriginal text. 3.1.1.2 Sentence Weighting In automated extrection, masuring thenweightnofnsentencesnusuallynusesnthenmodel whichh dependssonstermskenumeration.lInkthelvectorlspacelmodel, thelsentencelSiois definedkaspSu=<W11,W21,W31,……..,Wni> ,,eachjdimensionoofothislvectorpsignifiespthe weightoofltheitermlwi: Wm=<hF(wmi),iT(wm),oL(wm),oS(wm),kC(wm),oI(wm)i> Therweightlofkthektermowm islcalculatedlaskfollows: Score(wm)=x12*;Fi(iwii)+kx2l*iTl(;wi)+ix3.*.L.(.wi)+.x4.*S(.wi.)+.x5*C.(.wi.)+.x6*.I(.wi) Where..x1, .x2, .x3, .x4, .x5 and .x6.are.adjustments coefficient, discrete papers containing textual data.have.different-adjustment-coefficients. Frequency-of-term-is-abbreviated-as 17 Where.x1,_ x2,_x3,_x4,_x5 and_x6 are-adjustment-coefficients,_different coefficient values are for different documents are assigned. F represents the frequency of terms. Title of the documents referred as T, L represents position of the word in document. S stands for syntactic structure. Cue-Terms are-abbreviated-as--C, and-Indicative-phrases-is abbreviated-as I[9.]. 3.1.2 Understanding-basedoAutomatic0Summarization In concern with obtaining language structure summarization uses semantic information. Domain information is also needed toppresumeaandojudge-toofetch0the0meaning0of. expression. Ultimately.the0summary0has0been0produced0using0meaningoexpression. It 0consist0four0steps: Parsingo: - Tooparseithe_sentences,-use of0linguistic0information0in0dictionary Then0creates0the0syntax0tree. Semantic0Analysis: - semantic analysis is transformation of syntax tree into syntactic. These syntactic expressions are based_on0the0logic0and0meaning. It uses semantic0information0of0the0knowledge. Pragmatic0Analysis0and0Information0Extraction: - the in formations are pre-stored0into the database or system is the base for analyses of context and then collects the extract central information. Summary generation: - Now information centralized in the system is transformed into a information table. This centralized information stored in information table in the précised and unabridged form.[9] 3.1.3 4Information- Extraction For information extraction is the foundation of structure of the summary. The summary structure is divided into 2 phases: first one belongs to selection of the data and second phase in concerned with the generation of summary. 3.1.4 Discourse based automated text summarization: Discourse0is0a0basic0framework.dDifferent0parts0ofodiscourseoendureodifferent , functionsg. A complex relationship exists between them. Automatic summarization based 18 on discourse tried to examine the fundamental discourse characteristics try tooexamine fundamental characteristics of discourse to find article’s central data providing information. In today’s scenario, the automated summarization rely on significant researchotopics: structure analysis based on rhetorical approach,opragmatic analysis, latent semantic analysis [9]. This summarization process is implemented on the categorized documents. This approach is based on centroid based multi document summarization. But the addition in this system is this approach is, it follows semantic based categorization. 3.2 Approach to Design Original text Tagger tool Document set Training model Algorithm Summarized text Figure 3.1 Design Approach 3.2.1 Database: In this project firstly we created a repository of the documents. These documents are the source of data to be summarized. Either one or multiple paper could be fetched for applying process. Database contains the output of the process also. We stored the categorization output into the database for the further reference as well as it also contains the summarized data of previous work. To implement the categorization we need to perform some steps mentioned ahead. 19 3.2.2 Tokenization Token- tokens are individual units having individual meanings. A sentence contains no. of tokens or we can say that a group of token forms a sentence. These tokens are separated by a delimiter. To obtain the tokens a facile program divides the text into words. And it identified separation in two words by the delimiter between them. This program also separates the punctuation building block dividing easily at whitespace & punctuation-marks. Mostly the languages are not completely punctuated which arises some confusion. Text is simply a data that is unknown to the information contained in itself. This is just a combination of words and sentences. Tokenization is the process of broken up into phrases and words. These phrases and words are series of characters without specific knowledge about sentence boundaries and words. These tokens also consider numbers, parentheses, punctuation marks and question mark also. Alphabetical language generally serve sentence containing words these words are separated by blank. Removing the white space is performed by tokenizzer. These spaces are replaced by boundaries of words. Different marks like punctuation, question and parenthesis marks need to remove, is alraedy very short and accurate. The main and most important difficulty is the term’s disambguity. that is vague among sentence0marker. In this project firstly we created a repository of the documents. These documents are the source of data to be summarized. Either one or multiple paper could be fetched for applying process. Database contains the output of the process also. We stored the categorization output into the database for the further reference as well as it also contains the summarized data of previous work. Tokenization instruction can be represented as: First we need to distinguish the series of character at locations of white space and take out the quotes, punctuationomarks and parenthesisoatobothoendsotoogetotheoseriesoof tokenss. Mentioned instruction is impartially right due to punctuation and white space are barely valid displayers of word boundaries. 20 In semantics, the procedure of token making, tokenization, is dividing the sequence of albhabets into punctuations, numerals and otherisymbols. These9terms and series of communication are known as token and the systems and tools fulfilling some tokenization process are called tokenizer. To understand the main process of tokenizer read the statement given below: Inputodata: I am preparing my thesis report Output tokens.: I am preparing my thesis report According to the above expression we understand that each word of the sentence separated by space is called token. Hence the system brokes the textodata into the tokens: Discrete lyrics at punctuation matter, removes the punctuation. A-dot(.) that do not trailed by white space is considered as a portion of a token. Discrete lyrics at hyphens, til there is a counting of the token is continuing, overall token is fixed as anoartifactocountoand0is notoseparated. Tokenization process is a word status process. The crucial problem is how to analyze the description-of-what is intended by the single “term”. Generally the tokenizer rely on meek heuristic, instance: All contractual sequences of alphabetical single tokens contains the characters; and numbers also. The subsequent lists of tokens, punctions and whitespace may get involved. 3.2.3 Stop0Word0Removal Stopwords are oridinary words. These words having less substaintial meaning than the keywords. Search engines are generally discard the stopwords from a sentence. 21 Stopwords are the utmost significant result of the keyword phrases or sentences. Approximately half of the words in the sentence are stopwords. That means fifty percent of the text is not relevant for search engine. 3.2.4 Weighted Term Weighted terms are counted based on their frequency of the co occurrence in the document. This process is followed after the stop word removal. Frequency of individual word is computed by computing the no. of repeated words. Highest the frequency of the word, higher will be the weight of the term. For example if a word is having frequency of no. of repetition is 200 in a document is the more weighted term than the word is repeating itself in the document less than 200. POS tagging consist multiple approach in it. These approaches has their own significance for tagging. The figure given below give the idea about the different type of POS tagging models: Supervised POS tagging model. o Rule based POS tagging model BrilliPOS-tagging o Stochastic POS tagging Model N-gram POS-tagging Maximum0Likelihood HiddenoMarkovoModel o Neural POS taggingiModel Unsupervised POS tagging model. o Rule based POS-taggingiModel o Stochastic POS-TaggingiModel o Neural POS-TaggingiModel 22 POS tagging Supervised Rule based tagging Unsupervised Neural tagging Stochastic tagging Rule Based Tagging Brill Brill N-gram based Maximum likelihood Stochastic tagging Neural tagging Baum – Welch Algorithm Hidden Markov Model Viterbi Algorithm Figure 3.2 Different POS tagging model 3.3 Methods for Automatic Text Summarization: 1)Naïve-Bayes Methods Kupiec(1995) discover a method derived from Edmunson(1969) that is able to learn from data. Using a naïve-based classifier the classification function categorizes each sentence as worthy of extraction or not, using a naive-Bayes classifier. Let s be a sentence, S the super set of sentences that make up the summary, and F1,…..,Fk the features. The features willing to (Edmundson, 1969), but it includes the length of sentence and the uppercase words presence. To check and analyse the system, a structure of the technical documents with the manual abstracts was used in the following way: for each sentence present in the manual abstract, the authors analyzed its match with the actual document 23 sentences manually and created a mapping (e.g. exact match with any sentence, matching a join of two sentences, not matched, etc.). 2) Rich Features and Decision Trees Sentence positioning was the single feature studied by the Lin and Hovy in 1997. Author considered the sentence by its weight and weighing these sentences are based on their position in document, author termed it as “position method”. This idea is based on the discoursed structure of the document. and that the sentences of greater topic centrality tend to occur in certain specifiable locations (e.g. title, abstracts, etc). Author selects the sentences by the keywords present in the topic of the document. And the sentences contain the keywords are considered for the summary. They then ranked the sentence positions by their average yield to produce the Optimal Position Policy (OPP) for topic positions for the genre. Multiple researches had been done by the considering the baseline features like position of the sentence or using a easy combination of multiple features. Then machine extracted sentences and manually extracted sentences were matched and result was found in favor of decision tree classifier. Whereas for three topics naive combination won. Lin concluded that this happens because of independence of some features to each other. 3) Hidden Markov Models By analyzing the problem of inter-independence of the sentences , Conroy and O'leary (2001) modeled the problem of extracting a sentence from a document using a hidden Markov model (HMM). This model was sequential model. And author used this model for considering local dependencies between Only three features were used: position of the sentence in the document (built into the state structure of the HMM), number of terms in the sentence, and likeliness of the sentence terms given the document terms. 4) Log Linear Model Osborne (2002) claims that subsisting approaches to recapitulation have always assumed feature independence. The author used log-linear models to deflect this hypothesis and showed experimentally that the system produced better extracts than a naive-Bayes model, with naïve-model and hidden markov model. The log-linear model surmounted 24 the naive-Bayes classifier with the anterior, demonstrating the early effectualness and its strength. 5) Neural Networks and Third Party Features Svore et al. (2007) introduced an algorithm rely on neural nets. Datasets used to evaluate this algorithm belongs to third party. This algorithm was proposed to handle the problem of extractive summarization with statistical significance. The labels of sentences and their features like positioning in any data or article used by the author to train the model so that proper ranking of sentence could be made in test document. Some of the disserted features based on position or n-grams frequencies have been observed in previous work. 6) Maximal Marginal Relevance Carbonell & Goldstein in 1998 provide their major share to work with topic driven summarization technique. Authors introduced maximal marginal relevance measure. They merge two concepts of query relevance and freshness of information. Based on topic the algorithm finds the relevant sentences and then sanctions the redundant data. They used linear combination of two measures. These two parameters were query data set of the user, or we can say that the user profiles. This profile contains the relevant search of the user according to his need. Second parameter was the data or documents searched by the search engine. The best feature of this algorithm was topic-orientation. This algorithm works with the topic keywords. And based on these keywords system was able to find documents relevant to the user’s need identified by his query profile. 7) Graph Spreading Activation This methodology follows the concept of graph based summarization. In which pairs of similarities and dissimilarities were made. In 1997 Mani and Bloedorn proposed a framework of information extraction that was not text exactly. The summary data shown by using this approach was based on graph representation. Entities and relationships among them were shown by nodes and edges. Despite the sentence extraction, using spread activation technique they detect prominent regions of the graph. This approach shares the title of being title driven. Summarization is done by employing pair of graphs, then nodes common between them are distinguished by their synonyms or if they are sharing the same stem. Correspondently the nodes are not sharing the same node are not having same or similar meaning i.e. they are not common. Two different scores are 25 computed for each and every sentence in two documents. First score represents common nodes present in the document graph, this value is the average weight of these common nodes; and second value that computes instead the average weights of difference nodes. 8) Centroid-based Summarization To prevailing the concept of extractive summarization a method was introduced called Centroid Based Method. MEAD methodology employed for implementing abstractive summarization over the extractive summarization. This is a centroid based approach. It contains three main features necessary to extract cluster from original text. These features include centroid score, position of the centroid and overlap the first sentence. For calculating centroid Tf-idf algorithm is in use. Now for sentence clustering centroid score is the measure. On the basis of the centroid score cluster of the sentences is made up by indexing technique. For indexing position value of text is also required. Length of the summary is also a constraint. Sentence selection is stiffened by the length of the summary and tautological sentences are averted by checking cosine similarity against prior ones [5]. 0 and 1 are two values used in previous researches for summarize the text by extracting the sentences from original data using extractive summarization. Sentences having value 0 are not the part of summary whereas value 1 sentences are the part of summary [6]. Then a conditional random field is applied to attain this task [7]. A new extractive approach based on man folds ranking to query based document recapitulation proposed in [8]. 9) Multilingual Multi-document Summarization Summarization is highly required for multiple documents written in various languages. Then Evans(2005)done work on this. He contributed his efforts towards multi-lingual summarization for the multiple languages. Previously Hovy and Lin already worked on this in 1999. But research work is at its early stage for multi lingual summarization. Framework developed by Evans appears quite useful for newswire applications. Newswire applications required combination of information from foreign news agencies. Evans (2005) emphasis was on the resultant language. He considered the language in which all the documents must be summaries from other languages. Resultant language was decided on the basis of either user’s requirement or the documents available in 26 different languages. Final summary must be grammatically correct because machine translation is known to be far from perfect. Result must have higher coverage of different language documents than just the English documents. 3.4 Implementation: In this paper we proposed immerging two discrete approaches for text summarization. This technique is the compound of centroid based and multi document summarization. This idea arises from MEAD approach. Discrete Differential Semantic based (DDS) algorithm proposed and implemented to employ these approaches. We manually created a corpus containing different files in different format and related to different domains and categories. As per the main aim of summarization of text from multiple documents we applied categorization followed by summarization using discrete differential semantic algorithm. First we found sentence similarity on the basis of term co-occurrence. Let a document D= {d1, d2, d3,…,dn), where n is the number of sentences. Let W= {w1, w2, w3,….wm} represents all the discrete words present in document D, where m is the no. of words. In most existing document clustering algorithms, documents uses vector space model (VSM) [17]. In m- imensional space, each paper is constituted using these words. The measure of feature space is highly dimensional in these algorithms which could create a resistance in clustering algorithm and can cause of big challenge. In our method, a sentence Su is represented as a group of discrete terms appearing in it, Su = {t1, t2,……….tmu}, where mi is the number of different terms appearing in the sentence. In this paper, to calculate to measure likeliness in different sentences we used Google Distance. This distance is normalized Google distance (NGD)[18]. But to optimize the summarization result this process is followed by categorization. In this paper we implemented abstractive summarization. And to optimize automated summarization unsupervised approach of clustering is adapted. Before calculating similarity between the sentences we need to compute similarities between the words. First we categorize the multiple documents in different categories optimally. For that we worked over the semantic expression of the words. Each word has different meaning at different places. In previous work these words are categorized on the bases of pre trained 27 wordnet. And these methods find the weighted term based on Tf-idf algorithm. Where words are weighed with their occurrences in the document and categorized on the bases of available wordnet related to the different domains. Higher the no. of occurrences, higher will be the weight. Whereas in proposed methodology words are analyzed by their semantic meaning. Each keyword is correlated with their adjacent words occurring in the sentence and then the meaning of the word is judged by the system. This approach improves the quality of clustering the documents in any cluster. This approach works with the semantic categorization. In this paper, a criterion function is considered for clustering the documents. This criterion function is to judge the distance from the centroid of any document in a specific cluster. It also deflects the similarity measure of any document to the other domains. So that we can analyze the highest relevance of the document to a particular domain. Then the multiple documents related to a field could be summarized using NGD[18]. Now after categorizing the documents in their most relevant clusters, we proceed to the summarization part. For that we need to calculate the Normalized Google Distance between two terms. Similarity measure is between term tx and ty to calculate. 1. Pages that contain the occurrence of term tx are denoted by function of x i.e. f(tx). 2. Then no. of pages containing the occurrence of both the terms tx and ty is denoted by function of tx and ty i.e. f(tx,ty). 3. D denotes the total number of pages. 4. Now we use binary logarithmic function to calculate similarity measure. 5. The maximum no. pages containing log f(tx) and log f(ty) is calculated individually and stored in a function f(max). 6. The value of log(tx,ty) is to be subtracted from f(max). 7. Then calculate log N. 8. Find minimum of logarithmic value of f(tx) and logarithmic value of f(ty). and store the result in variable f(min). 9. Subtract the result of step 8 from the result of step 7. 10. Divide the value obtained from step 6 with the value obtained from step 9. 11. Let NGD(tx, ty) denotes the resultant of Step 10. 28 SimNGD(tx,ty) = -exponential(-NGD(tx, ty)) (1) If the value of NGD(tx, ty) is 0 that means tx and ty are the similar terms. If the value of NGD(tx, ty) is 1 that means both the terms are different. Expression (1) represents the similarity measure between the terms. By using the expression (1), similarity among the sentences Su and Sv is obtained. 𝑆𝑖𝑚𝑁𝐺𝐷(𝑆𝑢 , 𝑆 𝑣 ) = ∑𝑡𝑥 ∈𝑆𝑢 ∑𝑡𝑦 ∈𝑆𝑣 𝑆𝑖𝑚 𝑁𝐺𝐷(𝑡𝑥 , 𝑡𝑦 ) 𝑚𝑢 𝑚𝑣 (2) Where mu and mv are the maximum no. of terms in a sentence Su and Sv respectively. After calculating the similarity between the sentences in cluster C={S1,S2,…….Sn), collection of similar sentences i.e. nearest to a centroid, are clustered. And By using this technique we summarize the data given in the original documents. For experiments we prepared our own corpus containing number of documents. These documents could be of any length. CIDR algorithm is used to create the clusters. This system added the feature of semantic categorization also. This system is able to read multiple documents concurrently. And analyze the category based on the semantic meaning. The meaning of the words is analyzed by their adjacent words. Then the dictionary stored in the system, this dictionary consists all possible meanings of any word. Then words of meaning are compared with the actual word in the document. And after that an individual virtual file is created containing number of words according to their meaning after stopword removal. Now the final words are found proceed to relate them to the wordnet available. And then more appropriate category is derived. For example: If a sentence is: This phone belongs to the apple. Now the token apple could have different meanings individually. Those meanings could be apple a fruit or apple a company. But with the sentence it has more accuracy about its meaning. The sentence shows that the word apple belongs to the company only. It is not about a fruit. 29 This is the significance of the meaning of any word. Hence semantic categorization is more important for more accuracy of the document. Which provides a discrete differential categorization base to the summarization. implemented as discussed above. 30 For summarization NGD is Chapter 4 Experiment Results The designs and implantation screens are shown below: Figure 4.1: Browse Different POS tagging model 31 Above screen is having a text field for browsing the document or documents. These documents are stored at any location in the system; it could be any single folder or multiple folders. The files should be in .doc or .docx format. Once the document is browsed user need to select his 2nd operation which h/she wants to perform. 1) Literal categorization: literal categorization is based on word net only. In this there is no significance of their meaning. 2) Semantic categorization: this kind of categorization is completely based on the semantic meaning of the word. This gives more accuracy to the categorization process. 3) View history of the categorization: this button is to study the category of the previous documents. 4) Show summarizer details: this is to view the previous summaries stored in the system for previously processed documents. 32 Figure 4.2 Window available for categorization and summarization as well. This window appears after selecting one of the categorization techniques. It could be literal or semantic. This window is having two functionalities. Categorization and summarization both the processes are included. If we want to perform summarization directly as the old tools did, the n we can do. Else summarization should be performing after categorization. Which categorization technique will be performed depends on the selection done on the previous screen. For say, if we have selected the lexical categorization then here the system performs only lexical categorization. The grid shown in left side is to select multiple no. of file to categorize in related categories. Initially all check boxes are ticked. We need to deselect the unwanted files for the process. Then proceed to categorization. 33 Figure 4.3 Categorization details Now the files selected in previous screens are categorized according to the related categories. These file names and categories are shown in the grid presented in the downwards direction. The categorization process is done on the back end of the system but if the user wants to justify his results he would be able to check detailed process of categorization. For that the user needs to check the box of a file for which h/she wants to read details. But the point need to keep in mind is only file could be select at a time. The button of fill dictionary is to add the dictionary in the system. If we want to use this system for another language we could insert and attach a new dictionary with this system using this button. 34 Figure 4.4 Lexical Categorization The details of selected files for lexical categorization are shown in above screen. This screen contains the final category of the file. This category is computed on the basis of matching ratio to the other categories .i.e. the file is how much related to each category is identified by using the mathematical expression: Matched ratio= ____total number of matched words____ Total number of words in document List of matched words are also shown in the screen. Highest the no. of ratio, higher the relevance to the category. Now from here we could proceed to the summarization. 35 Figure 4.5 Semantic categorization If we have selected for semantic categorization in the initial screen then we proceeded to this window only. Similarly to the previous screen, documents related to their categories are shown with their names and related categories but to justify the result of these categories we are having an option to analyze the result by their details. To observe the details we need to select one file at a time. And the details are shown in above grid. This grid contains the category of the highest relevance to the document, in the very first line. Then it shows no. of matched words and calculated ratio also. Details of each category and similarity level are also shown in the next part of the grid. Same as the lexical, highest the ratio means higher the relevance. 36 Figure 4.6 Summary of multiple documents This screen contains the final summary calculated on the bases of NGD. And the stored location of the files is also shown in the above field. As per our aim summarization is done for multiple documents. Which could be perform for individual also as per the user’s need. The separation of the summaries of different documents is shown by a line. So that it would be easy to differentiate by the user for different documents. 37 Chapter 6 Conclusion In this project, proposed work of optimization of automated-text-summarization methodology is implmented by Discrete differential algorithm based on NGD that is Normalized Google Distance. In this work, we proposed an automatic text summarization method by NGD which is done after the categorization implemented with the base of centroid based multi document summarization approach. We have introduced an unsupervised approach to optimize automated document summarization. This approach consists of two steps. First, sentences are categorized semantically as well as syntactically. And their efficiency of correctness is compared. And proposed algorithm gives the improved results. Secondly, when accuracy of the related category’s documents is found, summarization is performed. In our study we developed a discrete differential semantic based technique to optimize the objective function for summarization, which gives better result to the user. When comparing to previous methods which does not include ambiguity removal we found better results. 38 Chapter 7 Future Research We performed Automatic Text Summarization by designing a discrete differential centroid based multi document semantic summarization, summary can be extracted on the bases of this approach. This approach focused on the semantic categorization of the documents to divide them in to discrete categories. Features of sentence positioning, its frequent patterns and meaning related to the sentence are the main features that are considered. Scope of the project is: Project could be multilingual. To reduce the language gap which is not presented in this project and the system would be able to summarize the documents written in different languages. The another enhancement of this developed system could be, the existing system is not supporting all formats of file. The updated system would be able to support all kind of documents. Updated online system could resolve the issues related to access with World Wide Web directly, for that system should work online so that storage space used due to downloading files could be save. Multiledia files could also be the part of this system in future. 39 References: [1] Kupies, Julian, Jan O. Pedersen, and Francine Chen. “A trainable document summarizer” in Research and Development in Information Retrieval, Pages68-73,1995. [2] Witbrock, Michael and Vibhu Mittal, “ Ultra-summarization: A statistical approach to generating highly condensed non-extractive summaries” in proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, pages 315-316, 1999. [3] Knight, Kelvin and Daniel marcu, “Statistic-based summarization-step one: sentence compression”, in proceeding of the 17th National Conference of American Association for Artificial Intelligence(AAAI-2000), pages 703-710, 2000. [4] M.A. Fattah and F. Ren,” GA, MR, FFNN,PNN and GMM based Models for automated text summarization,” Computer Speech and Language, Vol. 23, No. 1, pp. 126-144,2009. [5]P.B. Baxendale, “Machine Made Index of Technical Literature: An experiment” IBM Journal Of Research and Development, Vol2, no. 4, PP. 354-361,1958. [6] Edmunson, H.P. “New methods in automatic extracting” journal of the ACM, 16(2): 264-285.[2,3,4], 1969. [7] Lin, C. and E. Hovy “Identifying topics by position” in fifth conference on Applied Natural Language Processing, association for Computational Linguistics, 31 march-3 april, pages 283-290, 1997. [8] Radev, D. R., Hovy, E., and McKeown, “Introduction to the special issue on summarization. Computational Linguistics”, 28(4):399-408. [1, 2], 2002. [9] M.-R. Akbarzadeh-T., I. Mosavat, and S. Abbasi, “Friendship Modeling for Cooperative Co-Evolutionary Fuzzy Systems: A Hybrid GAGP Algorithm,” Proceedings of the 22nd International Conference of North American Fuzzy Information Processing Society, pp.61-66, Chicago, Illinois, 2003. [10] G. Erkan and D. R Radev, “lexrank: Graph based lexical cenrality as salience in text summarization,” Journal of Artificial Intelligence Research, Vol. 22, pp. 457–479, 2004. [11] D. Radev, E. Hovy, and K. McKeown, “Introduction to the special issue on summarization,” Computational Linguistics, Vol. 28, No. 4, pp. 399–408, 2002. 40 [12] D. Shen, J.-T.Sun, H. Li, Q. Yang, and Z. Chen, “Document summarization using conditional random fields,” Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07), Hyderabad, India, pp. 2862–2867, January 6–12, 2007. [13] D. M. Dunlavy, D. P. O’Leary, J. M. Conroy, and J. D. Schlesinger, “QCS: A system for querying, clustering and summarizing documents,” Information Processing and Management, Vol. 43, No. 6, pp. 1588–1605, 2007. [14] X. Wan, “A novel document similarity measure based on earth mover’s distance,” Information Sciences, Vol. 177, No. 18, pp. 3718–3730, 2007. [15] S. Fisher and B. Roark, “Query-focused summarization by supervised sentence ranking and skewed word distributions,” in proceedings of the Document Understanding Work shop (DUC’06), New York, USA, 8p, 8–9 June 2006. [16] J. Li, L. Sun, C. Kit, and J. Webster, “A queryfocused multi-document summarizer based on lexical chains,” Proceedings of the Document Understanding Conference (DUC’07), New York, USA, 4p, 26-27 April 2007. [17] X. Wan, “Using only cross-document relationships for both generic and topicfocused multi-document summarizations”, Information Retrieval, Vol 11, No. 1, pp 2549, 2008. [18] Mahmoud Rokaya, “automatic summarization based on field coherent passages”, in proceeding of International journal of computer applications, vol 79, No 9, 2013. [19] J.Han an M. Kamber, “Data Mining: Concepts and technique (2nd edition)”, Morgan Kaufman, San Francisco, 800p, 2006 [20] M.A. Fattah and F. Ren,” GA, MR, FFNN,PNN and GMM based Models for automated text summarization,” Computer Speech and Language, Vol. 23, No. 1, pp. 126-144,2009 [21] U. Hahn and I. Mani, “The challenges of automatic summarization,” IEEE Computer, Vol. 33, No. 11, pp. 29–36, 2000. [22] I. Mani and M. T. Maybury, “Advances in automated text summarization,” MIT Press, Cambridge, 442p, 1999. 41