A Multi-Criteria Content-based Filtering System

advertisement
A Multi-Criteria Content-based Filtering System
Gabriella Pasi
Gloria Bordogna
Robert Villa
Dipartimento di Informatica
Sistemistica e Communicazione
Università degli Studi di Milano
+39-02-6448-7847
IDPA
Consiglio Nazionale delle Ricerche
+39-035-622-4262
Department of Computer Science
University of Glasgow
+44-(0)141-330-2998
gloria.bordogna@idpa.cnr.it
villar@dcs.gla.ac.uk
gabriella.pasi@unimib.it
ABSTRACT
In this paper we present a novel filtering system, based on a new
model which reshapes the aims of content-based filtering. The
filtering system has been developed within the EC project PENG
[3], aimed at providing news professionals, such as journalists,
with a system supporting both filtering and retrieval capabilities.
In particular, we suggest that in tackling the problem of
information overload, it is necessary for filtering systems to take
into account multiple aspects of incoming documents in order to
estimate their relevance to a user's profile, and in order to help
users better understand documents, as distinct from solely
attempting to either select relevant material from a stream, or
block inappropriate material. Aiming to so this, a filtering model
based on multiple criteria has been defined, based on the ideas
gleamed in the project requirements stage. The filtering model is
briefly described in this paper.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search
and Retrieval - Information Filtering.
General Terms
Design, Algorithms, Human Factors
Keywords
Content-based filtering, models, frameworks, requirements
gathering
1. INTRODUCTION
Content-based information filtering systems aim to select relevant
information from a continuous and large data stream which is
being pushed towards a user, based on knowledge of the user's
interests encoded into a user profile. Such systems are typically
designed to manage large volumes of dynamically generated
information, such as news streams, or more recently, RSS feeds
and 'blogs'. The information stored in a user’s profile represents a
“stable” and structured information need, typically static over
a
relative long period of time compared to user queries addressed to
an Information Retrieval system [1]. Content based filtering
systems include [2,5], and previous entries into the TREC
filtering TRAC [4].
Copyright is held by the author/owner(s).
SIGIR’07, July 23-27, 2007, Amsterdam, The Netherlands.
ACM 978-1-59593-597-7/07/0007.
Systems such as those above consider the filtering task as a hard
classification task: the aim is to determine, for each new
document presented to the system, whether that document is
relevant or not based on the user's profile, and any other
information the system can gather based on user feedback. The
results of a filtering system are the selected documents, presented
to the user, who may then provide relevance feedback to the
system which may be used to aid the future relevance judgments.
The intuition behind such systems is that in order to support the
user when confronted with information overload, the aim is the
accurate identification of relevant information. While a laudable
goal, in practice we conjecture that such a filtering framework
does not satisfactorily model the requirements typical of many
filtering scenarios. We base this conjecture on the experiences in
the PENG project, which we outline in the next section.
2. A FILTERING FRAMEWORK
PENG [3], an EU funded project, aimed to provide an integrated
environment for news professionals (journalists and editors)
including a filtering component for news streams. Requirements
gathering was carried out principally through interviews with
fourteen journalists and other news professionals. Through these
interviews, it was found that some of the assumptions underlying
the classical filtering framework do not appear to hold, at least in
the domain of news gathering. Findings included:
● the fear which many journalists had towards filtering
systems blocking access to potentially useful information
● relevant material is selected based on many different
criteria, which may be personal or related to the working
environment. A particularly important part of this is the
reliability of the source of the information
● the assumption of filtering as selection was found to be
lacking. Rather than tackling the information overload
problem through aiding a “blind” selection of material, it
was found that it was more important to support the users
in understanding the relevance estimate of incoming
data.
The first and last points may be considered as related: fear of
missing important information may be considered as a lack of
knowledge about the data arriving. This initial requirements
gathering led us to define a new model for filtering incoming
news, based on the consideration of multiple criteria. The output
of the filtering system is a ranked list of items, which can be
dynamically re-organized by users based on their multi-faceted
needs. In doing this we simultaneously make the filtering task
more difficult and yet easier. The filtering system must now
provide customizable ranked lists, placing a document in a
position relative to other documents, attempting to both provide a
degree of explanation – structure – and allowing flexible methods
of structuring the results. Yet this also makes the filtering easier
and controllable, since the system does not need to make binary
relevance assessments, and through extra structure, the user is
better able to explore results lessening the risk of missing
important data. Core to this is the user as an active participant.
3. THE FILTERING MODEL
The PENG filtering model has two main
corresponding to the framework introduced above.
components
The first is a profile matching component, which compares the
incoming documents to a user profile. This corresponds most
closely to conventional filtering, but has two important
differences. First, the output of the matching is a ranked list,
maintained over time, the length of the list being maximized and
dependent on processing resources available. Secondly, the aim of
the matching is not to determine if a document is strictly relevant
to a profile, but rather to disregard the documents which are very
unlikely to be relevant, keeping as many potentially relevant
documents in the system (it is of course impractical to keep a
complete ranked list for all received documents). The second
component is the re-ranking and structuring component, which
enables the user to visualize the matching results in different
ways, with the aim of providing a greater indication of how a new
document relates to other documents.
Both of the above stages operate based on a number of different
criteria. Currently the matching stage computes an overall
Retrieval Status Value (RSV) for each incoming document based
on the consideration of the following criteria:
● Aboutness, which corresponds to the conventional cosine
similarity between profile and document.
● Coverage, which measures how much of the user profile
is entailed by the contents of the latest document. This
has been modelled by means of fuzzy inclusion [6].
● Reliability, referred to the source of the document, aimed
at filtering out documents which do not come from
sources of the required reliability specified by the user in
her/his profile.
Aboutness and coverage scores are combined to compute the
RSV. In the filtering system, documents and information content
specified in the user’s profile are represented using the classical
vector space model.
The re-ranking stage utilizes the following criteria:
● Novelty: how much new information does the document
provide compared to the existing documents? An existing
example of using novelty is [2]
● Timeliness: does the document reflect the most up to date
aspects of the user profile? Timeliness is evaluated as the
conformity of the considered document to a time-window
useful to the user, stored in her/his profile.
The criteria in the re-ranking stage are used in the following way:
each filtered document has an associated RSV score (as explained
above) and two additional scores: one indicating its novelty with
respect to the user profile and the other one indicating its
fulfillment of the temporal constraints specified in the user
profile. In the current version of the filtering system novelty and
timeliness are evaluated for each filtered document, and these two
scores are used for re-ranking filtered documents, with respect to
one of the two above specified constraints. We are also studying
the possibility of combining these parameters to produce overall
ranking score.
4. ONGOING WORK
Work is currently ongoing, in developing the techniques required
to evaluate the proposed filtering framework and model, including
altering conventional filtering evaluation as used in [4] to take
account of the new aims of this style of filtering.
5. ACKNOWLEDGMENTS
This work has been carried out in the PENG Specific Targeted
Research Project (IST-004597) founded within the Sixth Program
Framework of the European Research.
6. REFERENCES
[1] Belkin, N. J. and Croft, W. B. (1992) Information filtering
and Information Retrieval: Two sides of the same Coin? In
Communications of the ACM, 35, 12
[2] Gabrilovich E., Dumais S., and Horvitz E. (2004) Newsjunkie:
Providing Personalized Newsfeeds via Analysis of
Information Novelty, In WWW2004, May 2004, New York
[3] Pasi G., Villa R. (2005) The PENG Project overview, in
IDDI-05-DEXA Workshop, Copenhagen.
[4] Robertson S., Soboroff I. (2002) The TREC 2002 Filtering
Track Report, NIST Special Publication 500-251: The
Eleventh Text REtrieval Conference
[5] Bell, Timothy A. H. and Moffat, Alistair (1996) The Design
of a High Performance Information Filtering System, In
SIGIR’96, Zurich, Switzerland
[6] Miyamoto S. (1990), Fuzzy IR and clustering techniques,
Kluwer.
Download