Summarization of Web Pages by Keyword Extraction and Sentence Vector

advertisement
Summarization of Web Pages
by
Keyword Extraction and Sentence Vector
Md. Shafayat Rahman, Ashequl Qadir, Md. Mohsin Ali Khan, Abdullah Azfar
Department of Computer Science and Information Technology (CIT),
Islamic University of Technology (IUT), Board Bazar, Gazipur 1704, Bangladesh.
fahim_shafayat@yahoo.com, ashequlqadir@yahoo.com, robin_pcc@yahoo.com, abdullah_azfar@yahoo.com
Abstract
In this paper we are trying to propose a system that can
run in parallel with the usual search engine to provide
the user with unified and summarized information. Our
system will relieve the user of manual accessing of each
of the web links that is produced by the search result of
a search engine. To implement such feature in the
search process, here we propose a procedure that can
identify the significant words from the web pages found
by giving a search string in any popular search engine.
These keywords will then be used to extract the significant information from the web pages and to eliminate
extraneous information by determining the sentence
vectors.
Keywords: Angle between sentences, frequent words,
keywords, sentence vector, summarization, web pages.
I. INTRODUCTION
Effective searching in the web is becoming a very
important issue in today’s life. It is very easy to loose
track in the vast cyber space while looking for specific
information. There are always many possible ways to
choose, to find information and not all the ways always
lead to the desired destination. So, instead of making
the user access all the sources of information manually,
the concept of providing all the collected information to
the user at a time and in an ordered fashion can be a
remarkable approach.
Most of the web pages in Internet contain different
types and categories of information, among them only a
portion is significant to the user. Moreover, there may
be sentences in different pages represented differently
but containing the same information. Our procedure will
try to identify the important sentences as well as to remove the duplicate sentences.
In section II, we discuss some related works and fundamental ideas in the field of information extraction and
summarization. Section III focuses on our approach to
identify the significant words, which will be denoted as
the “keywords” from now on, from a web page and to
extract redundant information, followed by the experi-
ment results discussed in the section IV. In section V,
we conclude and brief on our research direction.
II. RELATED WORKS IN THE FIELD
Many other works have been done in the field on information extraction from web pages. But most of these
works are done to serve specific purposes. However,
analyzing the research works on similar fields helped us
to build our concept on different issues.
Orkut Buyukkokten, Hector Garcia-Molina and Andreas Paepcke worked on how information can be
summarized [1] before displaying in a handheld device
like PDA or palm top computer as the information are
to be displayed in a small screen.
Adam Jatowt, Khoo Khyou Bun and Mitsuru Ishizuka
worked on the changes of dynamic web pages [2], [3],
[4], [5], [6] When the content of a web page is changed
partially or completely, to provide the user the flexibility to keep track of the changes, they researched on
summarizing the changes and providing information
about the change to the user.
Inderjeet Mani, Eric Bloedorn, and Barbara Gates proposed an interesting approach on information extraction
from web page by constructing a graph [7], [8] where
the nodes of the graph are words of sentences and the
links between the nodes are the different types of relations between the words. Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka researched on the relationship of the links [9] that exists in the web pages.
Now as we have already said, identifying the keywords
from a web page is of supreme importance in context of
our information retrieval and redundancy removal operations. A popular algorithm for keyword extraction is
the tfidf measure, which extracts keywords that appear
frequently in a particular document, but not occur so
frequently in the reminder of the corpus. But in case of
web pages, the corpus should contain statistics over
millions of web pages which is almost impossible to get
and also very much difficult to work with. That is why
we will prefer a Domain-independent keyword extraction, which does not require a large corpus. We have
9th International Conference on Computer & Information Technology, 2006
Organized by: Independent University, Bangladesh
432
taken help from a keyword extraction algorithm based
solely on a single document developed by Matsuo &
Ishizuka [10].
After extracting the keywords, to identify the important
sentences and to remove the duplicate sentences, we
have used the sentence vector and sentence clustering
concept used by Khoo Khyou Bun and Mitsuru Ishizuka
[11] as a part of their research on “Topic Extraction
from News Archive Using TF*PDF Algorithm”.
III. NECESSARY OPERATIONS AND
ALGORITHMS
A. Removing Noises and Extracting Important
Information from the Web Pages
Web pages typically contain noisy content such as advertisement banner and navigation bar. If a pure-text
classification method is directly applied to these pages,
it will incur much bias for the classification algorithm,
making it possible to lose focus on the main topics and
important content. Our system will use the different
HTML tags (<p>, <br>, <b>, <i>, <table>, <tr> etc.) as
the delimiters to identify the information units and distinguish them from navigation units, interaction units,
decoration units and special function units. For example, if we find an amount of text inside the tags <p> and
<\p> then we can consider it as a paragraph containing
information. Again, if we find some small length of text
between the <td> and <\td> tags, then this text can be
ignored from summarization as there is a much possibility of this data to be some kind of statistical information
(not applicable to summarization), or to be a part of the
web page layout, such as a menu or hyperlink.
B. Identifying Keywords from Each Document
As we have said in the previous section, we will implement the “Keyword Extraction from a Single Document
using Word Co-occurrence Statistical Information” algorithm proposed by Matsuo & Ishizuka [10], with a
little modification to extract the keywords. In this algorithm, they considered each sentence as a basket containing some words, ignoring their order or other
grammatical relations. At first, the frequent terms in the
documents are identified. Then the frequency of cooccurrence of each non-frequent term (will be denoted
as the “general term” from now on) with each general
term is determined. Here co-occurrence only means that
the two terms appear in the same sentence, they need
not be situated one after another. The main idea behind
their proposal is, a general term that is distributed
evenly throughout the document is less significant than
a term that shows a significant bias of co-occurrence
with a particular subset of frequent terms. In order to
measure this degree of bias, they proposed an equation:
(1)
Here,
• X2(w) is measure of degree of co-occurrence of a
general term w with a frequent term g
• G is the set of the frequent terms
• nw is the number of terms in the sentences where w
appears
• pg is the sum of the total number of terms in sentences where g appears divided by the total number
of terms in the document
• freq (w, g) is the actual number of co-occurrences
between w and g.
The details of the process can be found in the paper
submitted by the developers of the algorithm [11].
Now by implementing this equation for all the general
terms, we will get the X2 values for all the general
terms. Among them, we will take a certain number of
terms with the highest X2 values.
Note that, if we want to implement this algorithm, we
must ensure that the terms mentioned in the search
string are included in the list of keywords, as these
words must be important on the context of the user’s
perspective. For example, if the user runs a search on
“the Big Bang theory”, it is logical that the terms “Big”,
“Bang” and “theory” all should be identified as keywords. Though because of the way the current search
engines work, there is a high probability that these
words will be frequent in the found web pages, we will
still manually ensure that these words are included in
the list of the keywords.
So in a nutshell, the steps of keyword identification
process are:
• Preprocessing: Remove all the articles (a, an, the),
prepositions (on, of, at etc.), conjunctions (and, but
etc.) and auxiliary verbs (is, was, were etc.) from the
document. Then remove the stop words (For example: punctuation marks).
• Selection of frequent terms: Select the top frequent
terms up to N% of the number of running terms.
Here N is an integer; its value comes from the experiment.
• Calculation of the X2 values: Now for each general
term, find the X2 value from the equation stated
above. Make a list of general terms along with their
X2 values in descending order.
• Select the keywords: From the list of the general
terms, select the top N%. Here N is an integer; its
value comes from the experiment.
•
Add the terms in the search string: Take the search
string given by the user and include the terms in it (except
the articles, prepositions, conjunctions, auxiliary verbs
etc.) to the set of keywords. If a term is already there, it is
overwritten.
9th International Conference on Computer & Information Technology, 2006
Organized by: Independent University, Bangladesh
433
C. Identifying Important Sentences
Now once when we have identified the keywords from
a document, or we should say a set of documents, we
have to extract the important sentences from the documents. To extract the important sentences, we will
weight each sentences based on their keywords. The
weight of a sentence is the summation of the X2 values
of the keywords situated in the sentence. Now here is a
problem. In the keyword extraction algorithm mentioned in the previous section, at first the frequent
words from the documents are identified, and then
based on these frequent words, the X2 values of the
general terms are calculated. So we do not get a value
for the frequent terms. Also we cannot get any weight
for any term that we have manually added to the list of
keywords, that is, the terms that are present in the
search string but not identified as a keyword. To overcome this problem, we will manually assign weight to
these terms. For this weight assignment task, we will
follow the following method.
At first, all the frequent terms will be sorted according
to their decreasing frequency and the selected general
terms will be sorted by their decreasing weights. Then
the difference of weights of each two subsequent general terms will be identified. After that we will calculate
the mean difference, D.
Now here is our weight assignment process. Let us
assume that the weight of the top-ranked keyword is W.
Then the weight of the bottom-most frequent word will
be, W1=W+D. In the same way, the frequent word situated on its top on the list will have a weight of
W2=W+2D. So if there are N number of frequent
words, then the weight of the top-most frequent word
will be WN=W+ND. All the words that are present in
the search string, considering they are the most important ones from the user’s point of view, are assigned
weights equal to (W+(N+1)D).
Now as all the keywords are assigned weights, we can
easily get the weights of each sentence available in the
documents and can take a particular percentage of the
total number of sentences as the most significant sentences to include in the summarized report.
D. Removing Duplicate Sentences
Now if there are more than one sentences conveying the
same information (their representation may be different), we want to include in the same cluster and want to
replace them with the most informative one among
them. To achieve this objective, we use the concept of
the sentence vectors [11]. According to this concept,
each sentence is associated with a sentence vector consisting of unit vectors which actually are the keywords
situated in the sentence. Then we will find the angle
between each pair of sentence vectors. If the vectors are
A and B and the angle between them is “Theta”, then
Cos(Theta) = A.B / AB
(2)
We know that,
A.B=A1B1+A2B2+.. .. .. + AnBn.
(3)
Here Ai and Bi will be the weight of a keyword i in A
and B respectively. If it is present in both the sentences,
its value in both the sentences will be equal to its
weight. But if it is absent in one sentence, its value is
zero there.
Though in the previous steps, we assigned distinct
weight to each keyword; in this step we consider each
keywords weight (the value of each unit vector) as 1.
This will ensure that the presence of one highlyweighted keyword in a sentence does not mask the significance of the presence of one or more less-weighted
keywords in the same sentence.
Now if the value of “Theta” is within a certain threshold, we can say that both the sentences are within the
same cluster. To get the threshold value for “Theta”, we
can examine the following table:
Table I: angle between some pairs of sentences
Sentence 1
Sentence
2
AB
ABc
AB
ABcd
AB
ABcde
ABC
ABCd
ABC
ABCde
ABC
ABCde
f
ABCD
ABCDe
ABCD
ABCDe
f
ABCD
ABCDe
fg
9th International Conference on Computer & Information Technology, 2006
Organized by: Independent University, Bangladesh
Comments
Should
they be
in the
same
cluster?
Value of
“Theta”
(degree)
Number of
common keywords =2;
second sentence
has a possibility
to contain the
information
carried by the
first, plus some
additional
Yes
35.26
Ambiguous
45
Ambiguous
50.77
Yes
30
Yes
39.23
Ambiguous
45
Yes
26.57
Yes
35.27
Ambiguous
40.89
Number of
common keywords =3;
second sentence
has a possibility
to contain the
information
carried by the
first, plus some
additional
Number of
common keywords =4;
second sentence
has a possibility
to contain the
information
carried by the
first, plus some
additional
434
Ab
Ac
Ab
Acd
Abcd
ABc
Aefg
ABde
ABc
ABdef
Number of
common keywords =1; each
sentence has
keywords that
the other does
not have; the
meaning may
be different
No
60
No
65.91
No
75.5
Number of
common keywords =2; each
sentence has
keywords that
the other does
not have; the
meaning may
be different
Ambiguous
54.74
Ambiguous
58.9
No
63.43
No
66.42
Number of
common keywords =3; each
sentence has
keywords that
the other does
not have; the
meaning may
be different
Yes
41.41
Ambiguous
47.87
Ambiguous
52.24
ABcd
ABefg
A B c d
e
ABfgh
ABCd
ABCe
ABCd
ABCef
ABCd
ABCef
g
ABCd
e
ABCfg
h
Ambiguous
56.79
ABCd
ef
ABCgh
i
No
60
ab
cd
No
90
abc
def
No
90
abc
defg
No
90
The sentences
share no common keyword;
meaning is
completely
different
In table I, we consider each sentence as a basket of
keywords, their position or sequence in the sentence is
ignored. Here by capital letters, we have denoted the
keywords common in both the sentences. Small letters
denote the unique ones. By carefully examining the
table, we can say that a value less than 60o may give a
satisfactory result.
big
180
4005
bang
163
3754
theory
107
2054
matter
88
1775
galaxies
66
1588
energy
44
908
expansion
43
1112
Hubble
41
957
radiation
41
808
We took topmost 40 words from the list (approximately
5% of the total number of terms) as frequent words and
based on these words counted the weight of each general term. Then we took next 200 words as the keywords (approximately 25% of the total number of
terms).
Now let us take three sentences that we want to test if
they are in the same cluster or not:
1. Hubble made the observations that the universe is
continuously expanding. –The Big Bang
(www.umich.edu/~gs265/bigbang.htm)
2. Hubble had irrefutable proof that the Universe was
expanding. – Creation of a cosmology- the big bang
theory (liftoff.msfc.nasa.gov/academy/universe/b_bang.html)
3. Evidence suggests that the Universe is expanding. –
Big Bang Cosmology Primer (cosmology.berkeley.
edu/ Education/IUP/Big_Bang_Primer.html)
Here, words in the block letter suggest that they are
frequent words, underlined words are general terms and
the rest are unnecessary terms.
Now for the general terms, their calculated weights aregiven here:
Table III: general terms with weights
IV. TEST DATA
At first, we retrieved the top 5 web pages whose URL
was given by the search engine as the result and removed the unnecessary terms (articles, prepositions
etc). Then we counted the frequency of each word and
prepared a table of terms ordered by their decreasing
frequency. The top ten entries were:
Table II: frequent words in the web pages
Frequent Word
Universe
Frequency
315
Number of terms
in the sentences
where frequent
term exists
Term
Weight
Made
1524.98
Continuously
60.72
Irrefutable
19.66
Proof
152.72
Evidence
699.73
Suggests
76.87
In our experiment, Made, Proof and Evidence were considered as keywords, as they are within the range of the
25% words.
Now let us calculate the angles between the sentences
(only the keywords are included, block letters indicate
terms common to each sentence):
6468
9th International Conference on Computer & Information Technology, 2006
Organized by: Independent University, Bangladesh
435
Sentence 1 & 2: (hubble, made, observations, universe,
expanding) vs. (hubble, proof, universe, expanding) :
47.87o
Sentence 1 & 3: (hubble, made, observations, universe,
expanding) vs. (evidence, universe, expanding) : 58.9o
Sentence 2 & 3: (hubble, proof, universe, expanding)
vs. (evidence, universe, expanding) : 54.5o
So if we take 60o as the threshold value of “Theta”, then these
three sentences are included in the same cluster and can be
replaced by the first sentence as it contains the maximum
number of keywords. This supports our practical observation.
V. CONCLUSION
Our proposed system will definitely ease up the process
of information reviewing in the internet. It will eliminate the manual accessing of numerous web links when
a user searches for anything. Saving the time of the user
by removing same or duplicated information present in
more than one pages, this system will produce a convenient presentation of information. But, there may be
loss of some specific type of information in the process.
For example, the loss of pictures, hyperlinks etc. So, the
process is specific for text based information retrieval at
the moment. In future we hope to generalize the approach by including all sorts of information along with
the texts.
REFERENCES
[1] Orkut Buyukkokten, Hector Garcia-Molina and
Andreas Paepcke, Digital Libraries Lab (InfoLab),
Stanford University, Stanford, CA 94305, USA “Seeing the Whole in Parts: Text Summarization
for Web Browsing on Handheld Devices”, 10th International WWW Conference, 24 feb 2001.
[2] Khoo Khyou Bun and Mitsuru Ishizuka, Dept. of
Information and Communication Engineering, The
University of Tokyo – “Information Area Tracking
and Changes Summarizing System in WWW”, Proc.
WebNet 2001 -- World Conf. on WWW and Internet, Orlando, Florida, USA. (2001.10), pp.680-685.
[3] Khoo Khyou Bun and Mitsuru Ishizuka, Dept. of
Information and Communication Engineering, The
University of Tokyo – “Emerging Topic Tracking
System”, Proc. Web Intelligence (WI2001), LNAI
219(Springer), Maebashi, Japan (2001.10), pp.125130.
[4] Adam Jatowt, Khoo Khyou Bun and Mitsuru Ishizuka. Dept. of Information and Communication En
gineering, The University of Tokyo – “Change
Summarization in Web Collections”, Proc. 17th Int'l
Conf. on Industrial and Engineering Applications
of Artificial Intelligence and Expert Systems
(IEA/AIE 2004), Ottawa, Canada, Lecture Notes in
Computer Science,LNCS 3029, Springer, (2004.5),
pp.653-662.
[5] Adam Jatowt and Mitsuru Ishizuka, Dept. of Infomation and Communication Engineering, The
University of Tokyo – “Web Page Summarization
Using Dynamic Content”, Poster, Proc. 13th Int'l
World Wide Web Conf. (WWW04), New York,
USA, (2004.5), pp.344-345.
[6] Adam Jatowt, Khoo Khyou Bun and Mitsuru Ishizuka, Dept. of Information and Communication
Engineering, The University of Tokyo – “QueryBased Discovering of Popular Changes in WWW”,
Proc. IADIS Int'l Conf. on WWW/Internet (IADIS
2003), Algarve, Portugal, Vol.1, (2003.11), pp.477484.
[7] Inderjeet Mani, Eric Bloedorn, and Barbara Gates,
The MITRE Corporation – “Using Cohesion and
Coherence Models for Text Summarization”, In
Proceedings of AAAI Spring Symposium on Intelligent Text Summarization, Stanford, March '98.
[8] Inderjeet Mani and Eric Bloedorn, The MITRE
Corporation – “Multi-document Summarization by
Graph Search and Matching”, In Proceedings of
the Fourteenth National Conference on Artificial
Intelligence (AAAI'97), pp.622-628.
[9] Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka – “Discovering Hidden Relation Behind a
Link”, Knowledge-based Intelligent Information
Engineering Systems & Applied Technologies
(KES2001), Osaka, Japan (2001.8), pp.1183-1187.
[10] Yutaka Matsuo and Mitsuru Ishizuka, Dept. of Information and Communication Engineering, The
University of Tokyo – “Keyword Extraction from a
Single Document using Word Co-occurrence Statistical Information”, Proc. 16th Int'l FLAIRS
Conf., Florida, (2003.5), pp. 392—396.
[11] Khoo Khyou Bun and Mitsuru Ishizuka, Dept. of
Information and Communication Engineering, The
University of Tokyo – “Topic Extraction from
News Archive Using TF*PDF Algorithm”, Proc.
3rd Int'l Conference on Web Informtion Systems
Engineering (WISE 2002) (IEEE Computer Soc.),
Singapore (2002.12), pp.73-82.
9th International Conference on Computer & Information Technology, 2006
Organized by: Independent University, Bangladesh
436
Download