Summarization of Web Pages by Keyword Extraction and Sentence Vector Md. Shafayat Rahman, Ashequl Qadir, Md. Mohsin Ali Khan, Abdullah Azfar Department of Computer Science and Information Technology (CIT), Islamic University of Technology (IUT), Board Bazar, Gazipur 1704, Bangladesh. fahim_shafayat@yahoo.com, ashequlqadir@yahoo.com, robin_pcc@yahoo.com, abdullah_azfar@yahoo.com Abstract In this paper we are trying to propose a system that can run in parallel with the usual search engine to provide the user with unified and summarized information. Our system will relieve the user of manual accessing of each of the web links that is produced by the search result of a search engine. To implement such feature in the search process, here we propose a procedure that can identify the significant words from the web pages found by giving a search string in any popular search engine. These keywords will then be used to extract the significant information from the web pages and to eliminate extraneous information by determining the sentence vectors. Keywords: Angle between sentences, frequent words, keywords, sentence vector, summarization, web pages. I. INTRODUCTION Effective searching in the web is becoming a very important issue in today’s life. It is very easy to loose track in the vast cyber space while looking for specific information. There are always many possible ways to choose, to find information and not all the ways always lead to the desired destination. So, instead of making the user access all the sources of information manually, the concept of providing all the collected information to the user at a time and in an ordered fashion can be a remarkable approach. Most of the web pages in Internet contain different types and categories of information, among them only a portion is significant to the user. Moreover, there may be sentences in different pages represented differently but containing the same information. Our procedure will try to identify the important sentences as well as to remove the duplicate sentences. In section II, we discuss some related works and fundamental ideas in the field of information extraction and summarization. Section III focuses on our approach to identify the significant words, which will be denoted as the “keywords” from now on, from a web page and to extract redundant information, followed by the experi- ment results discussed in the section IV. In section V, we conclude and brief on our research direction. II. RELATED WORKS IN THE FIELD Many other works have been done in the field on information extraction from web pages. But most of these works are done to serve specific purposes. However, analyzing the research works on similar fields helped us to build our concept on different issues. Orkut Buyukkokten, Hector Garcia-Molina and Andreas Paepcke worked on how information can be summarized [1] before displaying in a handheld device like PDA or palm top computer as the information are to be displayed in a small screen. Adam Jatowt, Khoo Khyou Bun and Mitsuru Ishizuka worked on the changes of dynamic web pages [2], [3], [4], [5], [6] When the content of a web page is changed partially or completely, to provide the user the flexibility to keep track of the changes, they researched on summarizing the changes and providing information about the change to the user. Inderjeet Mani, Eric Bloedorn, and Barbara Gates proposed an interesting approach on information extraction from web page by constructing a graph [7], [8] where the nodes of the graph are words of sentences and the links between the nodes are the different types of relations between the words. Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka researched on the relationship of the links [9] that exists in the web pages. Now as we have already said, identifying the keywords from a web page is of supreme importance in context of our information retrieval and redundancy removal operations. A popular algorithm for keyword extraction is the tfidf measure, which extracts keywords that appear frequently in a particular document, but not occur so frequently in the reminder of the corpus. But in case of web pages, the corpus should contain statistics over millions of web pages which is almost impossible to get and also very much difficult to work with. That is why we will prefer a Domain-independent keyword extraction, which does not require a large corpus. We have 9th International Conference on Computer & Information Technology, 2006 Organized by: Independent University, Bangladesh 432 taken help from a keyword extraction algorithm based solely on a single document developed by Matsuo & Ishizuka [10]. After extracting the keywords, to identify the important sentences and to remove the duplicate sentences, we have used the sentence vector and sentence clustering concept used by Khoo Khyou Bun and Mitsuru Ishizuka [11] as a part of their research on “Topic Extraction from News Archive Using TF*PDF Algorithm”. III. NECESSARY OPERATIONS AND ALGORITHMS A. Removing Noises and Extracting Important Information from the Web Pages Web pages typically contain noisy content such as advertisement banner and navigation bar. If a pure-text classification method is directly applied to these pages, it will incur much bias for the classification algorithm, making it possible to lose focus on the main topics and important content. Our system will use the different HTML tags (<p>, <br>, <b>, <i>, <table>, <tr> etc.) as the delimiters to identify the information units and distinguish them from navigation units, interaction units, decoration units and special function units. For example, if we find an amount of text inside the tags <p> and <\p> then we can consider it as a paragraph containing information. Again, if we find some small length of text between the <td> and <\td> tags, then this text can be ignored from summarization as there is a much possibility of this data to be some kind of statistical information (not applicable to summarization), or to be a part of the web page layout, such as a menu or hyperlink. B. Identifying Keywords from Each Document As we have said in the previous section, we will implement the “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information” algorithm proposed by Matsuo & Ishizuka [10], with a little modification to extract the keywords. In this algorithm, they considered each sentence as a basket containing some words, ignoring their order or other grammatical relations. At first, the frequent terms in the documents are identified. Then the frequency of cooccurrence of each non-frequent term (will be denoted as the “general term” from now on) with each general term is determined. Here co-occurrence only means that the two terms appear in the same sentence, they need not be situated one after another. The main idea behind their proposal is, a general term that is distributed evenly throughout the document is less significant than a term that shows a significant bias of co-occurrence with a particular subset of frequent terms. In order to measure this degree of bias, they proposed an equation: (1) Here, • X2(w) is measure of degree of co-occurrence of a general term w with a frequent term g • G is the set of the frequent terms • nw is the number of terms in the sentences where w appears • pg is the sum of the total number of terms in sentences where g appears divided by the total number of terms in the document • freq (w, g) is the actual number of co-occurrences between w and g. The details of the process can be found in the paper submitted by the developers of the algorithm [11]. Now by implementing this equation for all the general terms, we will get the X2 values for all the general terms. Among them, we will take a certain number of terms with the highest X2 values. Note that, if we want to implement this algorithm, we must ensure that the terms mentioned in the search string are included in the list of keywords, as these words must be important on the context of the user’s perspective. For example, if the user runs a search on “the Big Bang theory”, it is logical that the terms “Big”, “Bang” and “theory” all should be identified as keywords. Though because of the way the current search engines work, there is a high probability that these words will be frequent in the found web pages, we will still manually ensure that these words are included in the list of the keywords. So in a nutshell, the steps of keyword identification process are: • Preprocessing: Remove all the articles (a, an, the), prepositions (on, of, at etc.), conjunctions (and, but etc.) and auxiliary verbs (is, was, were etc.) from the document. Then remove the stop words (For example: punctuation marks). • Selection of frequent terms: Select the top frequent terms up to N% of the number of running terms. Here N is an integer; its value comes from the experiment. • Calculation of the X2 values: Now for each general term, find the X2 value from the equation stated above. Make a list of general terms along with their X2 values in descending order. • Select the keywords: From the list of the general terms, select the top N%. Here N is an integer; its value comes from the experiment. • Add the terms in the search string: Take the search string given by the user and include the terms in it (except the articles, prepositions, conjunctions, auxiliary verbs etc.) to the set of keywords. If a term is already there, it is overwritten. 9th International Conference on Computer & Information Technology, 2006 Organized by: Independent University, Bangladesh 433 C. Identifying Important Sentences Now once when we have identified the keywords from a document, or we should say a set of documents, we have to extract the important sentences from the documents. To extract the important sentences, we will weight each sentences based on their keywords. The weight of a sentence is the summation of the X2 values of the keywords situated in the sentence. Now here is a problem. In the keyword extraction algorithm mentioned in the previous section, at first the frequent words from the documents are identified, and then based on these frequent words, the X2 values of the general terms are calculated. So we do not get a value for the frequent terms. Also we cannot get any weight for any term that we have manually added to the list of keywords, that is, the terms that are present in the search string but not identified as a keyword. To overcome this problem, we will manually assign weight to these terms. For this weight assignment task, we will follow the following method. At first, all the frequent terms will be sorted according to their decreasing frequency and the selected general terms will be sorted by their decreasing weights. Then the difference of weights of each two subsequent general terms will be identified. After that we will calculate the mean difference, D. Now here is our weight assignment process. Let us assume that the weight of the top-ranked keyword is W. Then the weight of the bottom-most frequent word will be, W1=W+D. In the same way, the frequent word situated on its top on the list will have a weight of W2=W+2D. So if there are N number of frequent words, then the weight of the top-most frequent word will be WN=W+ND. All the words that are present in the search string, considering they are the most important ones from the user’s point of view, are assigned weights equal to (W+(N+1)D). Now as all the keywords are assigned weights, we can easily get the weights of each sentence available in the documents and can take a particular percentage of the total number of sentences as the most significant sentences to include in the summarized report. D. Removing Duplicate Sentences Now if there are more than one sentences conveying the same information (their representation may be different), we want to include in the same cluster and want to replace them with the most informative one among them. To achieve this objective, we use the concept of the sentence vectors [11]. According to this concept, each sentence is associated with a sentence vector consisting of unit vectors which actually are the keywords situated in the sentence. Then we will find the angle between each pair of sentence vectors. If the vectors are A and B and the angle between them is “Theta”, then Cos(Theta) = A.B / AB (2) We know that, A.B=A1B1+A2B2+.. .. .. + AnBn. (3) Here Ai and Bi will be the weight of a keyword i in A and B respectively. If it is present in both the sentences, its value in both the sentences will be equal to its weight. But if it is absent in one sentence, its value is zero there. Though in the previous steps, we assigned distinct weight to each keyword; in this step we consider each keywords weight (the value of each unit vector) as 1. This will ensure that the presence of one highlyweighted keyword in a sentence does not mask the significance of the presence of one or more less-weighted keywords in the same sentence. Now if the value of “Theta” is within a certain threshold, we can say that both the sentences are within the same cluster. To get the threshold value for “Theta”, we can examine the following table: Table I: angle between some pairs of sentences Sentence 1 Sentence 2 AB ABc AB ABcd AB ABcde ABC ABCd ABC ABCde ABC ABCde f ABCD ABCDe ABCD ABCDe f ABCD ABCDe fg 9th International Conference on Computer & Information Technology, 2006 Organized by: Independent University, Bangladesh Comments Should they be in the same cluster? Value of “Theta” (degree) Number of common keywords =2; second sentence has a possibility to contain the information carried by the first, plus some additional Yes 35.26 Ambiguous 45 Ambiguous 50.77 Yes 30 Yes 39.23 Ambiguous 45 Yes 26.57 Yes 35.27 Ambiguous 40.89 Number of common keywords =3; second sentence has a possibility to contain the information carried by the first, plus some additional Number of common keywords =4; second sentence has a possibility to contain the information carried by the first, plus some additional 434 Ab Ac Ab Acd Abcd ABc Aefg ABde ABc ABdef Number of common keywords =1; each sentence has keywords that the other does not have; the meaning may be different No 60 No 65.91 No 75.5 Number of common keywords =2; each sentence has keywords that the other does not have; the meaning may be different Ambiguous 54.74 Ambiguous 58.9 No 63.43 No 66.42 Number of common keywords =3; each sentence has keywords that the other does not have; the meaning may be different Yes 41.41 Ambiguous 47.87 Ambiguous 52.24 ABcd ABefg A B c d e ABfgh ABCd ABCe ABCd ABCef ABCd ABCef g ABCd e ABCfg h Ambiguous 56.79 ABCd ef ABCgh i No 60 ab cd No 90 abc def No 90 abc defg No 90 The sentences share no common keyword; meaning is completely different In table I, we consider each sentence as a basket of keywords, their position or sequence in the sentence is ignored. Here by capital letters, we have denoted the keywords common in both the sentences. Small letters denote the unique ones. By carefully examining the table, we can say that a value less than 60o may give a satisfactory result. big 180 4005 bang 163 3754 theory 107 2054 matter 88 1775 galaxies 66 1588 energy 44 908 expansion 43 1112 Hubble 41 957 radiation 41 808 We took topmost 40 words from the list (approximately 5% of the total number of terms) as frequent words and based on these words counted the weight of each general term. Then we took next 200 words as the keywords (approximately 25% of the total number of terms). Now let us take three sentences that we want to test if they are in the same cluster or not: 1. Hubble made the observations that the universe is continuously expanding. –The Big Bang (www.umich.edu/~gs265/bigbang.htm) 2. Hubble had irrefutable proof that the Universe was expanding. – Creation of a cosmology- the big bang theory (liftoff.msfc.nasa.gov/academy/universe/b_bang.html) 3. Evidence suggests that the Universe is expanding. – Big Bang Cosmology Primer (cosmology.berkeley. edu/ Education/IUP/Big_Bang_Primer.html) Here, words in the block letter suggest that they are frequent words, underlined words are general terms and the rest are unnecessary terms. Now for the general terms, their calculated weights aregiven here: Table III: general terms with weights IV. TEST DATA At first, we retrieved the top 5 web pages whose URL was given by the search engine as the result and removed the unnecessary terms (articles, prepositions etc). Then we counted the frequency of each word and prepared a table of terms ordered by their decreasing frequency. The top ten entries were: Table II: frequent words in the web pages Frequent Word Universe Frequency 315 Number of terms in the sentences where frequent term exists Term Weight Made 1524.98 Continuously 60.72 Irrefutable 19.66 Proof 152.72 Evidence 699.73 Suggests 76.87 In our experiment, Made, Proof and Evidence were considered as keywords, as they are within the range of the 25% words. Now let us calculate the angles between the sentences (only the keywords are included, block letters indicate terms common to each sentence): 6468 9th International Conference on Computer & Information Technology, 2006 Organized by: Independent University, Bangladesh 435 Sentence 1 & 2: (hubble, made, observations, universe, expanding) vs. (hubble, proof, universe, expanding) : 47.87o Sentence 1 & 3: (hubble, made, observations, universe, expanding) vs. (evidence, universe, expanding) : 58.9o Sentence 2 & 3: (hubble, proof, universe, expanding) vs. (evidence, universe, expanding) : 54.5o So if we take 60o as the threshold value of “Theta”, then these three sentences are included in the same cluster and can be replaced by the first sentence as it contains the maximum number of keywords. This supports our practical observation. V. CONCLUSION Our proposed system will definitely ease up the process of information reviewing in the internet. It will eliminate the manual accessing of numerous web links when a user searches for anything. Saving the time of the user by removing same or duplicated information present in more than one pages, this system will produce a convenient presentation of information. But, there may be loss of some specific type of information in the process. For example, the loss of pictures, hyperlinks etc. So, the process is specific for text based information retrieval at the moment. In future we hope to generalize the approach by including all sorts of information along with the texts. REFERENCES [1] Orkut Buyukkokten, Hector Garcia-Molina and Andreas Paepcke, Digital Libraries Lab (InfoLab), Stanford University, Stanford, CA 94305, USA “Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices”, 10th International WWW Conference, 24 feb 2001. [2] Khoo Khyou Bun and Mitsuru Ishizuka, Dept. of Information and Communication Engineering, The University of Tokyo – “Information Area Tracking and Changes Summarizing System in WWW”, Proc. WebNet 2001 -- World Conf. on WWW and Internet, Orlando, Florida, USA. (2001.10), pp.680-685. [3] Khoo Khyou Bun and Mitsuru Ishizuka, Dept. of Information and Communication Engineering, The University of Tokyo – “Emerging Topic Tracking System”, Proc. Web Intelligence (WI2001), LNAI 219(Springer), Maebashi, Japan (2001.10), pp.125130. [4] Adam Jatowt, Khoo Khyou Bun and Mitsuru Ishizuka. Dept. of Information and Communication En gineering, The University of Tokyo – “Change Summarization in Web Collections”, Proc. 17th Int'l Conf. on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE 2004), Ottawa, Canada, Lecture Notes in Computer Science,LNCS 3029, Springer, (2004.5), pp.653-662. [5] Adam Jatowt and Mitsuru Ishizuka, Dept. of Infomation and Communication Engineering, The University of Tokyo – “Web Page Summarization Using Dynamic Content”, Poster, Proc. 13th Int'l World Wide Web Conf. (WWW04), New York, USA, (2004.5), pp.344-345. [6] Adam Jatowt, Khoo Khyou Bun and Mitsuru Ishizuka, Dept. of Information and Communication Engineering, The University of Tokyo – “QueryBased Discovering of Popular Changes in WWW”, Proc. IADIS Int'l Conf. on WWW/Internet (IADIS 2003), Algarve, Portugal, Vol.1, (2003.11), pp.477484. [7] Inderjeet Mani, Eric Bloedorn, and Barbara Gates, The MITRE Corporation – “Using Cohesion and Coherence Models for Text Summarization”, In Proceedings of AAAI Spring Symposium on Intelligent Text Summarization, Stanford, March '98. [8] Inderjeet Mani and Eric Bloedorn, The MITRE Corporation – “Multi-document Summarization by Graph Search and Matching”, In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI'97), pp.622-628. [9] Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka – “Discovering Hidden Relation Behind a Link”, Knowledge-based Intelligent Information Engineering Systems & Applied Technologies (KES2001), Osaka, Japan (2001.8), pp.1183-1187. [10] Yutaka Matsuo and Mitsuru Ishizuka, Dept. of Information and Communication Engineering, The University of Tokyo – “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information”, Proc. 16th Int'l FLAIRS Conf., Florida, (2003.5), pp. 392—396. [11] Khoo Khyou Bun and Mitsuru Ishizuka, Dept. of Information and Communication Engineering, The University of Tokyo – “Topic Extraction from News Archive Using TF*PDF Algorithm”, Proc. 3rd Int'l Conference on Web Informtion Systems Engineering (WISE 2002) (IEEE Computer Soc.), Singapore (2002.12), pp.73-82. 9th International Conference on Computer & Information Technology, 2006 Organized by: Independent University, Bangladesh 436