Boosting Bookmark Category Web Page Classification Accuracy using Multiple Clustering Approaches

advertisement
Boosting Bookmark Category Web Page Classification Accuracy using
Multiple Clustering Approaches
Chris Staff
Department of Artificial Intelligence
University of Malta
Email: chris.staff@um.edu.mt
Abstract
Web browser bookmark files are used to store links
to web sites that the user would like to revisit. However,
bookmark files tend to be under-utilised, as time and
effort is needed to keep them organised. We use four
methods to index and automatically classify documents
referred to in 80 bookmark files, based on document
title-only and full-text indexing and two clustering
approaches. We evaluate the approaches by selecting
a bookmark entry to classify from a bookmark file and
re-creating a snapshot of the bookmark file to contain
only entries created before the selected bookmark
entry. Individually, the algorithms have an accuracy
at rank 1 similar to the title-only baseline approach,
but the different approaches tend to make different
recommendations. We improve accuracy by combining
the recommendations at rank 1 in each algorithm. The
baseline algorithm is 39% accurate at rank 1 when
the target category contains 7 entries. By merging the
recommendations of the 4 approaches, we reach 78.7%
accuracy on average, recommending a maximum of 3
categories. 30.6% of the time we need make only one
recommendation which is correct 81.4% of the time.
1. Motivation
Web browsing software such at Safari, Internet Explorer, and Mozilla Firefox, include a ‘bookmark’ or
‘favorites’ facility, so that a user can keep an electronic
record of web sites or web pages that they have visited
and are likely to want to revisit. It is usually possible to
manually organise these bookmark entries into related
collections called ‘folders’ or categories. If bookmark
files are kept organised and up-to-date, they could be
a good indication of a user’s long-term and short-term
interests which could be used to automatically identify
and retrieve related information. However, bookmark
files require user effort to keep organised, so a collection of bookmarks tends to become disorganised over
time [1], [2].
We describe HyperBK2 which can assist users to keep
a collection of bookmarks organised by recommending
the category in which to store the entry for a web page
in the process of being bookmarked. We examine some
different approaches to indexing web pages, deriving
category representations, and classifying web pages
into categories. Ideally, a user would be recommended
a single category which would always be the correct
one (i.e., the user would never opt to save the entry to
some other category). The approaches that we have
compared do not meet this ideal, but we can offer
the user a selection of categories that may include
the correct one. Of course, we can do this trivially by
offering the user all the categories that exist, (just as
in information retrieval we can guarantee a recall of
100% by retrieving all documents in the collection),
so we want to show the user as small a selection of
recommendations as possible, while maximising the
chances that the small selection contains the correct,
target, category.
We experimented with taking the top-5 recommendations from two different approaches and fusing them,
which resulted in an average of 7 recommendations in
a results set and an accuracy of 80%, and with fusing
the results of the four different approaches at rank 1,
which gives comparable accuracy, but we need offer
the user a maximum of only 3 recommendations.
In section 2 we discuss similar systems. HyperBK2’s
indexing and classification approach is discussed in
section 3, and the evaluation approach in section 4,
and the results are presented and discussed in section
5. Section 6 outlines our future work and conclusions.
2. Background and Similar Systems
Web pages are frequently classified to automatically
create web directories [3], [4], [5], [6] with predefined
categories [7] or dynamic or ad hoc categories [6],
[3], or to assist users in bookmarking favourite web
pages [8]. Bookmarking is a popular way to store
and organise information for later use [9]. However,
drawbacks exist, especially when the number of bookmarks increases over time [10]. Bookmark managers
support users in creating and maintaining reusable
bookmarks’ lists. These may store information locally,
such as HyperBK [11], Conceptual Navigator1 and
Check&Get2 or centrally, such as Caribo [8], Delicious3 and BookmarkTracker4 .
Web pages may be classified using contextual information, such as neighbourhood information and link
graphs [4], [5], using supervised [12], or partially
supervised learning techniques [6]. [7] summarize a
web page before classifying it, to eliminate noise in
the form of navigational links and adverts.
Delicious, which is an online service, allows users to
share bookmarks. Categorisation is aided by the use
of tags, which users associate with their bookmarks.
However there are no explicit category recommendations when a new bookmark is being stored. InLinx [7]
provides for both recommendations and classification
of bookmarks into “globally predefined categories”.
Classification is based on the user’s profile and the
web-page content. CariBo [8] classifies a bookmark
by first establishing a similarity in the interests of two
users and then finding a mapping between the folder
location of a bookmark in the collaborators’ bookmark
files and that of the target user’s bookmark hierarchy.
In previous work [11], we built a bookmark management system, HyperBK, that can recommend a
destination bookmark category (folder). However, only
a small number of bookmark files had been used in
the evaluation. In HyperBK2, we have modified our
approach to indexing and classification, and we have
evaluated the new approach using 80 bookmark files.
3. HyperBK2’s Indexing and Classification
Approach
The literature suggests that approaches to web
page classification are frequently performed using a
global classification taxonomy [7] or make use of
1.
2.
3.
4.
http://www.abscindere.com/
http://activeurls.com
http://del.icio.us
http://www.bookmarktracker.com
a web page’s neighbourhood information [4], [5].
We want to take a partially supervised approach to
clustering [6]: the only sources of information are the
web page to be bookmarked and web page entries
in the user’s existing bookmark categories (positive
examples). We avoid using a global classification
taxonomy, instead using the categories that a user
has created in his or her own bookmark file. This
allows our recommendations to be personalised, and
bookmark entries will be grouped according to an
individual user’s needs and preferences. We examine
four approaches to indexing bookmark files and
classifying web pages into an existing bookmark
category. We use one, called TITLE-ONLY, as a
baseline. As the name suggests, the document title
only is indexed using TFIDF [13]. The indexed
titles are combined to build a centroid representation
of a category. An incoming document is classified
according to the similarity of its title to each of
the category centroids. The other three approaches,
FULL-TEXT, CLUSTER, and SINGLETON all build
their indices, and classify web pages based on a
document’s full-text. They vary according to what
bookmarked web pages (bookmark entries) in a
category are used to derive centroids for categories
to compare to the incoming web page. TITLE-ONLY
and FULL-TEXT build one centroid per category,
using all the entries in the category. SINGLETON
treats each entry as a cluster centroid, so a category
containing n entries will have n centroids. CLUSTER
clusters the n entries in a category using a thresholded
similarity measure, deriving m centroids (where 1 <=
m <= n). In both SINGLETON and CLUSTER, the
category recommended for an incoming web page is
the category containing the centroid most similar to
the incoming web page. The user can always override
HyperBK2’s recommendation and store the entry in
another category or create a new category.
Regardless of the approach used, we first create a
forward index for each document referred to in the
bookmark file. We remove stop words, HTML and
script tags, stem the remaining terms using Gupta’s
Python implementation of the Porter Stemmer5 , and
calculate the term frequency for each stem. Once the
forward indexing of bookmark entries is complete, we
identify the documents that are to be used to create
the centroid or centroids for each category. For the
TITLE-ONLY and FULL-TEXT approaches, we take
the appropriate forward index of each document d1 to
dN in a category and we merge them, calculating a
term weight by summing the term frequencies (TF) of
5. http://tartarus.org/martin/PorterStemmer/python.txt
each term j1 to jm in each document in the category,
and multiplying it by the Normalised Document
Frequency (N DFj = DFj /N, where N is total no of
PN
docs in category),
d=1 T Fji ,d × N DFji . This has
the effect of reducing the weight of terms that occur in
few documents in the category. For the SINGLETON
approach, each selected document becomes a centroid
in its own right. For the CLUSTER approach, we
pick a document representation from a category, and
compare it to each of the remaining (unclustered)
documents in the same category, merging it with
those representations that are similar above a certain
threshold (arbitrarily set to 0.2). We then create
another centroid by picking one of the currently
unclustered documents in the category, and merging
it with other, similar, currently unclustered document
representations. If a document is not sufficiently
similar to any other document in the category, it
becomes a centroid in its own right. This iterative
process continues until each document in a category
is allocated to a cluster, or is turned into a centroid.
Cluster membership is influenced by the order in
which bookmark entries are created in a category,
because we select entries (to form the first centroid of
a category, and to compare entries to the centroid) in
the order that entries are added by the user.
We create a TITLE-ONLY or FULL-TEXT
representation of the web page to classify, using
the same formula (with NDF = 1). We then use
the Cosine Similarity Measure [13] to measure the
similarity between a web page and each category
centroid in the bookmark file. The category containing
the highest ranking category centroid is recommended.
4. Evaluation Approach
We collected 80 bookmark files from anonymous
users. Each bookmark file is in the Netscape bookmark
file format6 , and stores the date that each bookmark
entry was created. We use this date to re-create a
snapshot of the bookmark file’s state just prior to the
addition of the bookmark to be classified.
The method of evaluation is to select bookmark entries
from a number of bookmark files, according to some
criteria (see below), and to measure the ability of the
classification methods to recommend their original category. We measure the presence of the target category
in ranks 1 to 5 as the accuracy at each rank.
The criteria we use to select bookmark entries for
classification from a bookmark file, to determine the el6. http://msdn2.microsoft.com/en-us/library/Aa753582.aspx
igibility of the bookmark file snapshot to participate in
the particular run, are ENTRY-TO-TAKE, and NO-OFCATEGORIES. ENTRY-TO-TAKE is the nth entry in
a category that is selected for classification. If there is
a problem with the bookmark entry selected (i.e., the
web page it represents no longer exists, etc.), then we
take the next entry in the category, if possible. We ran
our system with values for ENTRY-TO-TAKE of 2, 4,
6, 7, 8, 9, and 11. For example, in the simplest case
(ENTRY-TO-TAKE = 2), the second entry created in
each category would be selected for classification, and
a snapshot of that category would contain only one entry. The snapshots of other categories in the bookmark
file contain entries created before the selected entry.
NO-OF-CATEGORIES is the number of categories
that must exist in a snapshot of a bookmark file for it to
participate in the evaluation. We imposed a minimum
of 5 categories, which would give a random classifier
a maximum chance of only 20% to correctly assign
a selected bookmark entry to its original category
(at rank 1). We wanted to see if we would bias
results in our system’s favour if we did not impose
this minimum, so we removed this constraint for the
evaluation platform with the best performing criteria
(table 3, runs 5 and 6).
We first ran the baseline TEXT-ONLY evaluation platform, using only a web page’s title to construct the
document index and on which to perform the classification. We then compared this approach to the approach
that indexed and classified documents using FULLTEXT (table 3). Each category had just one centroid
representation created by merging the descriptions of
all the documents in a category snapshot. The results
are presented in subsection 5.2. Next, categories were
divided into clusters, using the SINGLETON approach
(in which each document in a category snapshot is
considered to be a centroid) and CLUSTER (built by
merging representations of sufficiently similar documents using the cosine similarity measure). The results
are presented in subsection 5.3 and compared to the
FULL-TEXT and TITLE-ONLY baseline results in
subsection 5.4. We notice that although the full-text
and baseline approaches appear to give similar levels
of accuracy (table 3), the different algorithms tend to
make different recommendations 33% of the time, and
by merging the results, we obtain better results (table
4). However, the disadvantage of merging the results
is that on average, a user would be recommended 7
categories. Given that 61% of the bookmark files used
in the evaluation contain up to 10 categories (table 1),
there is no advantage over allowing users to choose a
destination category, though it would be advantageous
to the 26.25% users with 21 or more categories.
Table 1. Submitted bookmark files and their
numbers of categories
No. of categories
1:
2-5:
6-10:
11-20:
21-50:
51-100:
101+:
No. of bookmark files
8
28
13
10
9
8
4
We ran the CLUSTER and SINGLETON experiments
to see if they were more accurate than the unclustered
approaches, and also to see if we could achieve even
higher accuracy by recommending a maximum of four
categories: the categories recommended at rank 1 by
each approach. A user would then be presented with
between 1 and 4 recommendations: 1 in the event that
each approach made the same recommendation, and 4
in the event that each made a different recommendation. These results are presented in subsection 5.4.
5. Results
5.1. Bookmark File Properties and Bookmark
Entries Selected for Classification
In this section, we describe the general properties
of the bookmark files that we collected, in terms
of the number of categories that they contain (table
1), and provide information about the number of
bookmark file entries selected for classification for
each run (Total Eligible Entries in table 2). On
average, bookmark files used in the evaluation have
23 categories, with a minimum of 1 and a maximum
of 229. Table 1 gives the number of categories in each
bookmark file used in the evaluation. 8 files (10%)
contain only one category. 51 files (63.75%) contain
between 2 and 20 categories, and 21 (26.25%) contain
more than 20 categories. Table 2 gives the parameters
for each run, and the total number of bookmark
entries selected for classification in each run.
5.2. TITLE-ONLY and FULL-TEXT
In table 3, we present the results of the runs,
highlighting the best performances. We conducted the
evaluation as follows. From each bookmark file, all
the bookmark entries that satisfied the criteria were
extracted, and a snapshot of the bookmark file was
created per eligible bookmark entry. In table 3, we
see that from rank 2 onwards, there is an advantage of
the FULL-TEXT approach over TITLE-ONLY. Ideally,
the top ranking recommendation is the target category.
The best performance was 44% and worst was 24%
(both FT rank 1). However, it turns out that TITLEONLY and FULL-TEXT approaches are frequently
recommending different categories (on average the
different approaches recommend 3 identical categories
and 4 different categories in all). When we merge
the results (table 4) we see an increase in accuracy
– although users would need to be shown 5 to 10
recommended categories, and 7 categories on average.
5.3. CLUSTER and SINGLETON
We ran CLUSTER and SINGLETON on categories
from which the 8th entry was taken (Run 5). SINGLETON gives 39.2% accuracy at rank 1, and 60.8%
accuracy at rank 5. CLUSTER yields an accuracy of
37.7% at rank 1, and 65.3% accuracy at rank 5. These
are similar to the accuracy of the baseline classification
at rank 1 (39%) and the FULL-TEXT results at rank
5 (64%), respectively, so are not, individually, an
improvement over the simpler approaches. However,
when we compare the recommendations made by each
of the four approaches, we are able to improve our
recommendation accuracy, and we can also reduce the
numbers of categories recommended to the user.
5.4. Merging Recommendations at Rank 1
We want a mechanism that has a good chance
of recommending the correct category, without overloading the user with too many choices of category.
We can merge the FULL-TEXT and TITLE-ONLY
recommendations to increase the recommendation accuracy, at a cost of giving the user a choice of
7 candidate categories on average (subsection 5.3).
We measure the frequency of agreement between the
different approaches at rank 1, and the accuracy of the
recommendation. We compare the results obtained for
run 5.
We can predict the quality of recommendation based
on the degree of agreement on the recommended
category between the different approaches (table 5).
The following arrangements of agreement between the
different approaches are possible: all four approaches
can give the same result (4-of-a-kind), which may
be correct or incorrect; three of the approaches may
make the same recommendation (3-of-a-kind), with
the fourth giving a different one, and either one is
correct or both are incorrect; two of the approaches
may recommend the same category with the other
Table 2. No. of bookmark entries classified
Run
ENTRY-TO-TAKE
NO-OF-CATEGORIES
Total Eligible Entries
1
2
5
1064
2
4
5
567
3
6
5
372
4
7
5
310
5
8
5
253
6
8
1
255
7
9
5
204
8
11
5
144
Table 3. Comparing FULL-TEXT (FT) and TITLE-ONLY (TO) classification accuracy (percent).
TO
FT
TO
FT
TO
FT
TO
FT
TO
FT
Run
rank 1
rank 1
rank 2
rank 2
rank 3
rank 3
rank 4
rank 4
rank 5
rank 5
1
26
24
31
31
33
36
33
40
34
43
2
32
27
39
36
42
44
44
48
44
52
3
33
38
39
48
44
56
46
61
47
65
4
41
38
50
50
53
56
54
62
55
63
5
39
43
46
50
49
57
52
59
52
64
6
38
44
45
53
48
60
51
62
51
66
7
29
44
41
54
44
60
45
64
45
70
8
40
36
48
48
51
55
53
59
55
63
Table 4. Merging the FULL-TEXT and TITLE-ONLY recommendations improves accuracy (percent).
Run
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
1
37
46
51
55
58
2
43
52
59
64
66
3
52
61
68
73
75
4
56
67
74
78
79
5
59
67
74
77
80
6
59
69
76
78
80
7
53
63
69
73
76
8
53
64
68
72
76
Table 5. Comparing merged recommendations at rank 1 with the baseline and merged TEXT-ONLY and
FULL-TEXT approaches at rank 5
Probability of observation:
Accuracy:
No. of recommended categories:
% improvement over baseline
(52% accuracy @ rank 5):
No. of recommended
categories using baseline only:
% improvement over merged
TITLE-ONLY + FULL-TEXT
(80% accuracy @ rank 5):
Average no. of recommended
categories using merged
TITLE-ONLY and FULL-TEXT results:
4-of-a-kind
30.6%
81.4%
1
3-of-a-kind
38.4%
64.2%
2
2-of-a-kind
21.4%
95.8%
2 or 3
1-of-a-kind
9.6%
40.6%
4
+56.7%
+23.5%
+84.2%
-21.9%
5
5
5
5
+1.8%
-19.8%
+19.8%
-49.5%
7
7
7
7
two approaches agreeing on another category or each
recommending a different category (2-of-a-kind), with
any recommendation correct or all incorrect; or each
approach may make a different recommendation (1-ofa-kind) and one of them may be correct, or they may all
be incorrect. Table 5 gives the different combinations,
the frequency of observing them, and their accuracy,
the number of categories that would need to be shown
to users, and the percentage improvement over the
accuracy of the merged results of TITLE-ONLY and
FULL-TEXT (at rank 5), and the percentage improve-
ment of the accuracy over the TITLE-ONLY baseline
(at rank 5). When the approaches agree on 2, 3, or 4 of
the recommendations, the recommendation is correct
on average 78.7% of the time. An added benefit is
that HyperBK2 needs to make only 1, 2, or 3 recommendations. The approaches disagree totally only
9.6% of the time, and accuracy is only 40%, despite
needing to make 4 recommendations. Our results are an
improvement on CariBo’s: a collaborative bookmark
category recommendation system evaluated on the
bookmark files of 15 users that has 60% accuracy at
rank 5 [8].
6. Future Work and Conclusions
When we merge the recommendations of the four
different indexing and clustering approaches at rank 1,
90.4% of the time we can recommend 1 to 3 categories,
and the target category will be recommended on average 78.7% of the time. This gives a 51.3% improvement over the TITLE-ONLY baseline (and needing
to show users 5 categories), and a slight decrease of
just 1.7% compared to the merged recommendations
of TITLE-ONLY and FULL-TEXT (but we would
need to show users an average of 7 categories and
a maximum of 10). We have extended our work previously conducted in the area of automatic bookmark
classification by comparing indexing and classification
methods based on vector-based full-text and title-only
representations of documents in a bookmark category.
We also built a full-text representation based on category entry cluster centroid, where the cluster is either
a singleton entry or cluster membership is based upon
entry similarity. We conducted several runs in which
the bookmark entry to be selected for classification
was the 2nd, 4th, 6th, 7th, 8th, 9th, or 11th entry
created in a category. The FULL-TEXT and TEXTONLY approaches worked best for categories already
containing seven entries. However, SINGLETON treats
each entry as a centroid, so it can work for an arbitrary
number of entries in a category, and CLUSTER creates
an arbitrary number of centroids in a category by
merging representations of an arbitrary number of
similar entries, treating non-similar entries as singleton
centroids, so it too is unaffected by the actual number
of entries in a category. There is a greater likelihood
of making an accurate recommendation if at least two
of the approaches make the same recommendation.
We intend to automatically generate a query from
the category centroids, and evaluate the ability to
automatically find previously unseen documents that
users consider relevant and worth bookmarking.
References
[1] D. Abrams and R. Baecker, “How people use www
bookmarks,” in CHI ’97: CHI ’97 extended abstracts
on Human factors in computing systems. New York,
NY, USA: ACM, 1997, pp. 341–342.
[2] D. Abrams, R. Baecker, and M. Chignell, “Information
archiving with bookmarks: personal web space construction and organization,” in CHI ’98: Proceedings of
the SIGCHI conference on Human factors in computing
systems. New York, NY, USA: ACM Press/AddisonWesley Publishing Co., 1998, pp. 41–48.
[3] X. PENG and B. CHOI, “Automatic web page classification in a dynamic and hierarchical way,” in ICDM
’02: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02). Washington, DC,
USA: IEEE Computer Society, 2002, p. 386.
[4] X. Qi and B. D. Davison, “Knowing a web page by
the company it keeps,” in CIKM ’06: Proceedings of
the 15th ACM international conference on Information
and knowledge management. New York, NY, USA:
ACM, 2006, pp. 228–237.
[5] D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, “A comparison of implicit and explicit links for web page
classification,” in WWW ’06: Proceedings of the 15th
international conference on World Wide Web. New
York, NY, USA: ACM, 2006, pp. 643–650.
[6] H. Yu, J. Han, and K. C.-C. Chang, “Pebl: Web page
classification without negative examples,” IEEE Transactions on Knowledge and Data Engineering, vol. 16,
no. 1, pp. 70–81, 2004.
[7] C. Bighini, A. Carbonaro, and G. Casadei, “Inlinx for
document classification, sharing and recommendation,”
icalt, vol. 00, p. 91, 2003.
[8] D. Benz, K. H. L. Tso, and L. Schmidt-Thieme,
“Automatic bookmark classification - a collaborative
approach,” in Proceedings of the 2nd Workshop in
Innovations in Web Infrastructure (IWI2) at WWW2006,
Edinburgh, Scotland, May 2006.
[9] H. Bruce, W. Jones, and S. Dumais, “Keeping and refinding information on the web: What do people do and
what do they need?” in ASIST 2004: Proceedings of the
67th ASIST annual meeting. Chicago, IL.: Information
Today, Inc., 2004.
[10] W. Jones, H. Bruce, and S. Dumais, “Keeping found
things found on the web,” in CIKM ’01: Proceedings of
the tenth international conference on Information and
knowledge management. New York, NY, USA: ACM,
2001, pp. 119–126.
[11] C. Staff and I. Bugeja, “Automatic classification of
web pages into bookmark categories,” in SIGIR ’07:
Proceedings of the 30th annual international ACM
SIGIR conference on Research and development in
information retrieval. New York, NY, USA: ACM,
2007, pp. 731–732.
[12] M. Tsukada, T. Washio, and H. Motoda, “Automatic
web-page classification by using machine learning
methods,” in WI ’01: Proceedings of the First AsiaPacific Conference on Web Intelligence: Research and
Development. London, UK: Springer-Verlag, 2001, pp.
303–313.
[13] G. Salton and C. Buckley, “Term weighting approaches
in automatic text retrieval,” Ithaca, NY, USA, Tech.
Rep., 1987.
Download