Multi-Document Summarization: Two Methods by Pieter Buzing May 2001 In this paper we will discuss and compare two methods for multi document summarization. One approach is concerned with the explicit meaning of words and viewing them as concepts, who's (semantic) relations explicitly represent context. The second approach is more top-down in its attempt to identify the salient parts of the source documents. It starts at the paragraph level, which forms the base for further processing. After comparison we conclude that the second approach is slightly favorite. 1 Introduction For many organizations it is of major importance to archive documents (which are of relevance in a certain domain or research field) in a database. This database can be used to search for articles that cover a certain topic. There are three main problems here. First, it is often hard to judge whether a document is relevant or not. Second, it is not easy to see the relationships between the documents. Third, these databases can become very large, resulting in a large number of articles that are returned after a search query. Even the summaries of those single documents can be too much to make sense out of it. Due to the extreme increase in available information in the form of news articles and scientific documents more and more attention is given to multi-document summarization, i.e. systems that create a (relatively short) text that gives a good reflection of the main topics in a set of documents. I will discuss and compare two techniques that aim to construct good multi-document summaries. The first method (which we shall address as Conceptual Graph Matching) is proposed in [Mani & Bloedorn 1997]. It starts at the word level, constructing a semantic graph of each document. Then it moves up to sentence level, selecting salient phrases (i.e. phrases that have commonly used concepts) for the summary Another approach (Theme Recognition) can be found in [McKeown et al. 1999]. It starts at the paragraph level, considering the complete set of documents, and groups of paragraphs that are alike into a so-called theme. Then it examines the grammatical structure of phrases in each theme. It selects the most important phrases from each theme for the summary. 1 In sections 2 and 3 I will give a more detailed description of the two methods. In the following section the two approaches will be compared considering content representation, information fusion, semantics preservation, scalability and domain independence. These are the factors that we suspect to give the greatest difference between the two. Section 5 will discuss alternative techniques that have been developed as well as a few suggestions for improvements of the two methods are given. Section 6 will give the conclusions of the comparison. 2 The Graph Matching Method It takes as input a pair of documents. Although Mani and Bloedorn never implemented a real multi-document version, expanding their method from two to n documents should not pose much problems. I will go into this in section 4.4. A topic description also has to be supplied, which can be a (short) text or just a (wellformulated) question. It is used to determine the subjects that are of interest to the user. Section 2.1 will give information on the structure of the graphs. Section 2.2 will the spreading activation algorithm. In Section 2.3 the information fusion algorithm can be found. 2.1 Data structure Each text is then represented as a graph. Each word occurrence is mapped as a node (see figure 1). The connections between different nodes represent specific relations between the words. The ADJ link is the edge between two adjacent words. The PHRASE link ties together (adjacent) words that form a phrase together. For the subtask of identifying different phrases an extensive regular-expression-based sentence boundary disambiguator developed by John Aberdeen (see [Aberdeen 1995]) was used. The SAME link connects two occurrences of the same concept. These words can be identical or the single/plural forms of the same concept. The NAME link connects adjacent nodes that together represent a concept. To extract the names the SRA's NetOwl (see [Krupka 1995]) was used. The COREF link connects name occurrences that are co-referential. Alpha links represent any semantic association link between two concepts. For example "president" is an alpha link between "Bill Gates" and "Microsoft". Mani and Bloedorn used a corpus-sensitive approach proposed by Philip Resnik in [Resnik NAME 1993] that uses a TREC reference corpus. 2.2 ADJ Spreading activation ADJ COREF A node is given an initial weight value, roughly calculated by the number of occurrences in the document. Then a spreading activation technique (based on [Chen et al. 1994]) is used to determine the 2 NAME ADJ ADJ Figure 1: An example graph representation. salient nodes. Nodes that are identical to words in the topic description are considered as input nodes. All the nodes connected to an input node are then taken as output nodes and the activation of the input node is propagated (with an exponentially decaying function of the activation weight and distance between the nodes) to the connected output nodes. Then the nodes that are connected to the current output nodes are considered the new output nodes and the activation of the input nodes is propagated to these leaves. This is done iteratively until all the nodes (or an arbitrary maximum number of nodes) have been reached. Apart from some details (like a different propagation function for different types of links and a bigger activation propagation for words that are within the same sentence) this covers the whole activation algorithm. The idea is of course that words that are mentioned in the topic description should be considered important and that nodes that are connected to these topic nodes are also relevant. Concepts (nodes) that are connected to many topic nodes (keep in mind that there can be some nodes in between) should be considered more salient than words that have little connectivity with input nodes. 2.3 Information fusion With the two graphs (G1 and G2) we start out to find the similarities and differences in the documents. The FSD (Find Similarities and Differences) algorithm works as follows. We create two sets of nodes: one with the common nodes and one with differences. The sets are given by: Common = {c|concept_match(c, G1) & concept_match(c, G2)} Differences = (G1 G2) – Common concept_match(c, G) is true iff c1 G such that c1 is a topic term or c and c1 are synonyms. Now we can select the phrases for the summary. We want to include “common” sentences (sentences that have a lot of nodes in both graphs) and "different" sentences (sentences that have a lot of unique words). The ratio of common sentences to different sentences is a parameter that the user can specify. The “common” score of a sentence s is calculated by: score ( s ) 1 |c ( s )| weight ( wi ), where c( s ) {w | w Common s} | c( s ) | i 1 It means that the score of a sentence is defined as the average of the weight value of each word in s that is common. The “difference” score of a sentence is calculated the same way. The best “common” sentences (i.e., the sentences that have a high score on the common word set) and the best “different” sentences are selected for the summary. 3 3 The Theme Recognition Method As input we consider a set of documents. As opposed to the previous system the number of articles is not limited and a topic description is not required. The algorithm consists of three steps: first (section 3.1) it identifies themes of a given document collection. Then (section 3.2), from each theme the important phrases are selected. This is done by analysis of the grammatical structures of the phrases in each theme: the phrases with the most common structures are chosen for the summary. In the final step (section 3.3) these selected phrases are properly placed in the multi-document summary. See also figure 2. 3.1 theme identification. First the texts are unraveled into paragraph units, which are then thrown on one big pile: the document level is behind us. A theme is a set of paragraphs that are similar with respect to nouns and meaning. So focus is not only on the word-level similarity, but also on the semantic relations. Now, all the paragraphs must be determined upon their pairwise similarity. McKeown used four basic features to decide how similar two text units (paragraphs) are. -Word co-occurrence: if the text units have many (stemmed) words in common they have a high similarity. -Matching noun phrases: the LinkIt tool proposed in [Wacholder 1998] is used to match noun phrases. This algorithm recognizes noun phrase similarities. -WordNet synonyms: by making use of the WordNet semantic database (see [Miller et al. 1990]) we can match words with the same meaning. -Common semantic classes for verbs: presents a classification of verbs, which is used to match verbs that belong to the same semantic class. McKeown also uses composite features (this is the pairwise combination of the primitive features) to match two text units. These composite features are a powerful means to express similarity constraints, like “two text units must have matching noun phrases that appear in the same order and with relative difference in position no more than five”. Matches on composite features indicate that two text units are related both semantically and syntactically. Articles Having found 1. break into phrases the pairwise similarity 2. make grammar rates between 1. break into trees paragraphs, they have paragraphs Themes 3. match phrase 2. find similarities to be divided into trees groups. This is done by 3. cluster similar Phrases a machine learning para’s algorithm, which classifies each pair of paragraphs as either Summary Sentences 1. preprocess similar or dissimilar. phrases This information is then 2. construct used by a clustering sentences algorithm, which places Figure 2: theme identification, information fusion and the most related reformulation 4 paragraphs in the same groups, and thus identifies themes. Mind that text units can be placed in more than one theme. 3.2 information fusion. Next, phrases are selected that correctly represent the content of the themes. The problem is that it is not feasible to just select the sentences that best cover the text units, because non-important embedded phrases will be included and important additional information may be excluded. Sentences are too rigid, so Mckeown developed a method to identify salient phrases, instead of complete sentences. Theme sentences are first run through a statistical phrase parser to identify the functional roles like subject or object. Then this information are passed on to a rule-based component, which constructs a dependency grammer tree. These grammer representations are compared with each other and if the intersection of two (sub)trees yields a full phrase, it is proposed for the summary. This means that phrases that occur more than once are considered to bear important information. 3.3 text reformulation. The phrases that were selected in the previous step must be transformed into decent English sentences. First the phrases are ordered in such a way that things mentioned early in the documents should also return in the start of the summary. Then some additional information like entity descriptions or temporal references may be added. Finally the summary sentences are constructed by use of the FUF/SURGE sentence generator. This is a Functional Unification Formalism that makes use of English grammar rule set called SURGE to build syntactic trees, which are then transformed to sentences. This tool (like many other language generators of this type) performs better if a domain description is given, but McKeown claims that the lack of this semantic background knowledge need not be a great drawback: during the process a lot of work is done on time sequencing, synonyms, etc. 4 Comparison In this section I will attempt to compare these approaches on the basis of content representation, information fusion, semantics preservation, scalability and domain independence. Unfortunately these two systems have not yet been evaluated head-tohead and an objective test of the accuracy of a summary is not available at this time. 4.1 Content representation The Graph Matching method represents the document content on its smallest scale: each word is considered a node in the graph. Through the whole process this word level is maintained. The content of the different documents is kept apart until the actual information fusion takes place. The Theme Recognition method starts at the paragraph level, immediately dropping the distinction between the source documents. Then it looks at the meaning 5 of the words, searching for concepts that two paragraphs have in common. Then we move back up again and consider the grammatical tree structure of phrases. 4.2 Information fusion In the Graph Matching method we simply look at words that are strongly connected (in some sense) to the topic items. It then calculates the common score and the difference score for each sentence. The sentences with the highest score are selected for the summary. The Theme Recognition method selects phrases that frequently occur in a theme. Of course there is the problem of paraphrasing, i.e. phrases that say the same thing with (slightly) different words, like “the group of students walked away” versus “the students walked away” and “John warned the girl” versus “the girl was warned by John”. The algorithm checks on this with simple lexical rules to make sure that in the first example “group of” is ignored and in the second example the active/passive switch is detected. The algorithm thus correctly concludes that both example pairs match each other. This only works with the assumption that the two sentences carry roughly the same meaning, because such a superficial detection of paraphrasing (on a high syntactic level) does not guaranty that it is actually the case. 4.3 Semantics preservation The Graph Matching method is very concerned with meaning of words. It is interested in the relations between concepts: think of the SAME links, the COREF links and the alpha links. The Theme Recognition method does not represent these meanings explicit, but they are captured in two steps. First the paragraphs that have many words in common are considered related to each other and thus clustered in the same group (theme), assuming that paragraphs that share the same words tend to share the same meaning. Second, phrases that occur several times in a theme are considered salient. 4.4 Scalability The Graph Matching method described by [Mani & Bloedorn 1997] only allows a two document input. Adapting the system to a greater number of input documents should not be considered a major problem, though. The number of graphs that the FSD algorithm can handle is by not limited, but there would be a large conjunction of concept_matches of G1, G2, ...Gn. This could result in a very slim set of common words compared to the set of “different” words. On the other hand, the number of common words could only get larger and as the desired length of the summary will probably not change dramatically, the quality of the result should not decay. The McKeown system was tested with a set of 30 documents, which gave good results according to [McKeown et al. 1999]. Scaling up would not pose any difficulty (the number of generated themes would probably go slightly up due to the 6 increased diversity of the source documents), assuming that the size of the summary also grows accordingly. This is caused by the fact that the number of phrases selected from a theme is constant (with respect to the length of the theme) and the increase in the number of themes. Results could get awkward when the number of selected phrases in each theme is forced down. 4.5 Domain independence The Graph Matching approach is claimed to be very domain independent by Mani and Bloedorn, but some arguments against are also easy to find. Main drawback is that it makes use of a (training) corpus. When the input document set is significantly different from the corpus, the system could well get in trouble: the construction of the graph could suffer as SAME links or alpha links may not be recognized. Also the topic description has to be supplied, and a topic text that states too little information could give rise to a too slim set of common words. But these effects will not be decisive, as (i) the corpus gives a good general handhold, (ii) the WordNet synonyms list is quite extensive and (iii) the algorithm that selects the common words is only dependent on the supplied topic text. The method advocated by [McKeown et al. 1999] is also domain independent. Maybe even more than the other method, because the Theme Recognition system does not need any special topic description. There is only one negative point considering domain independence (besides maybe the synonyms list and the verb classification): the matching algorithm could get hampered when certain domain specific paraphrases are not known. 5 Discussion First I will give some alternative approaches that have been tried in this field. After that a few suggestions for improvement on the two methods discussed in this paper shall be presented. [Radev and McKeown 1998] present a system called SUMMONS (SUMMarizing Online NewS articles), which creates briefings for the user on information that the user is interested in. It was developed for (large data sets of) short news articles, and producing a briefing rather than a summary. A briefing is different from a summary, as briefings focus on the subject that the user has indicated interest in, sometimes ignoring the real subject of the article. So if the user indicates that he is interested in e.g. Canadian beer, the system will run through a large number of news articles, returning information he has found on Canadian beer, without giving any attention to the full content of the source articles. The system uses the output produced by a MUC-4 system. This Message Understanding System is very domain specific (MUC-4 is specialized in terrorism) and returns a template with fields like “source”, “incident”, “victim”, etc. The fact that it is very domain specific is a great drawback on the system. I think this is why the research group turned to the approach described in this paper. [Yang et al. 1998] proposes a statistical method, which is highly domain independent. But it is restricted to news stories about one event, and the temporal order is crucial. Yang uses the well known Term Frequency and Inverse Document Frequency (abreviated as TF*IDF; IDF = N/nt, where N is the total number of 7 training documents and nt the number of training articles that contained term t) to cluster terms and phrases. The “strongest” terms are placed in the summary, in order of appearence in the input stream of documents (this is why the temporal order is important). Two clustering methods were tried, both producing reasonable results, according to [Yang et al. 1998]. The method proposed in [Mani & Bloedorn 1997] of course lacks a sentence construction component. This is a crucial step, because without it (i) its value (ie usefulness for the user) is very limited, (ii) it cannot be tested, and (iii) expanding the input to more than two documents is not useful since interpretation of the common and different word set becomes very difficult (the order of the highlighted phrases is unknown, which makes reading rather awkward). [McKeown et al. 1999] proposed a system that performs well. The greatest point of improvement would be to evaluate the created summary from a global perspective (ie as a whole) and not locally. The algorithm currently facilitates the check of global cohesion only in the theme construction phase, where the topics are spread evenly in a number of clusters. This guarantees that the final summary will cover all the important issues, but the construction of sentences and paragraphs do not always provide good cohesion. This is a difficult task, but we think that when you want to build reliable summaries, issues like context and (subtle) interpretation must be addressed. 6 Conclusion We presented two multi document summarization techniques. One was developed by Mani and Bloedorn. It mainly works at the word level, focussing on the semantical relations between concepts (words or groups of words that express an entity). The second approach (described in [McKeown et al. 1999]) relies on the fact that paragraphs in general consist of coherent sentences (ie semantically in accordance with each other) and that paragraphs that have a lot of words in common usually express the same meaning. We compared the two methods on five points. Content representation: the first approach creates a graph of each word in a document and draws the semantical relations between words. The second approach makes use of the paragraph structure and selects phrases that are common in multiple paragraphs. Information fusion: in the first approach sentences that seem to have important words are selected for the summary. The second approach selects phrases that occur most in a theme. Semantics preservation: the first approach makes a lot of effort trying to represent the meaning of concepts; it succeeds well in maintaining the context. The second approach tries to maintain phrases, which seems a very natural solution. Scalability: The first approach is restricted to two-document-input, but the idea by itself is scalable. The second approach is not limited in the number of input documents and a sizable document set should not be problem, provided that the size of the summary grows accordingly (otherwise problems may occur). Domain independence: Both approaches were tested with news paper articles, but the two researchers claim that their system is applicable to other kinds of documents. Both systems should be considered domain independent, though we 8 suspect that source documents that are very different from the corpus documents would give a slight decline in performance for both approaches. The easiest way to conclude this comparison paper is to say that the Theme Recognition method is the best, because it is a working application and the Graph Matching method never got that far. On the other hand, when we look at the results of the comparison we could express only a slight preference for the second approach (McKeown), because the scalability of the graph method is still not proven. References [Aberdeen 1995] John Aberdeen, John Burger, David Day, Lynette Hirschman, Patricia Robinson, and Marc Vilain. MITRE: Description of the Alembic System Used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, November 1995. [Chen et al. 1994] C.H. Chen, K. Basu, and T. Ng. An algorithmic Approach to Concept Exploration in a Large Knowledge Network. Technical Report, MIS Department, University of Arizona, Tucson, AZ. [Krupka 1995] George Krupka. SRA: Description of the SRA System as Used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6). Columbia, Maryland, November 1995. [Mani & Bloedorn 1997] Inderjeet Mani and Eric Bloedorn. Multi-document Summarization by Graph Search and Matching. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-97) [McKeown et al. 1999] Kathleen R. McKeown, Judith L. Klavans, Vasileios Hatzivassiloglou, Regina Barzilay and Elezar Eskin. Towards Multidocument Summarization by Reformulation: Progress and Prospects. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) pp 453-460 [Miller et al. 1990] George A. Miller, Richard Beckwith, Christiane Felllbaum, Derek Gross, and Kathrine J. Miller. Introduction to WordNet: An On-Line Lexical Database. International Journal of Lexicography, 3(4): pp 235-312. [Resnik 1993] Philip Resnik. Selection and Information: A Class-Based Approach to Lexical Relationships. Ph.D. Dissertation, University of Pennsylvania, Philadelphia, PA. [Yang et al. 1998] Yiming Yang, Tom Pierce, Jaime Carbonell. A Study on Retrospective and On-Line Event Detection. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. 9