Emergent Document Semantics

advertisement
Deriving Emergent Web Page Semantics
D.V. Sreenath*, W.I. Grosky**, and F. Fotouhi*
*Department of Computer Science, Wayne State University, Detroit, MI
**Department of Computer and Information, University of Michigan-Dearborn, Dearborn, MI
{sdv, fotouhi}@cs.wayne.edu, wgrosky@umich.edu
It is well known that interpretation depends on
context, whether for a work of art, a piece of literature,
or a natural language utterance. This research addresses
the dynamic context of a collection of linked multimedia
documents, of which the web is a perfect example.
Contextual document semantics emerge through
identification of various users' browsing paths though
this multimedia collection. We present techniques that
use multimedia information as part of this
determination. Some implications of our approach are
that the author of a web page cannot completely define
that document's semantics and that semantics emerge
through use.
The of the goals of this research is to implement the
reciprocal of a search engine: given a sequence of
documents comprising a user’s browsing path, generate
a query that summarizes what the user was looking for.
A specific application of this research leads to terrorist
trend detection.
Privacy laws ensure that a user’s browsing history is
not sold to advertising and marketing agencies. This
restricts the availability of such browsing history for
research purposes like ours. Most of the earlier work on
relevance-feedback based approaches to gathering and
analyzing users’ preferences have failed primarily due
to this lack of user confidence in the agencies that
collect such information and the ways in which that
information will be used. Secondly, users do not trust
any application that is downloaded and installed on their
system that captures and profiles their browsing
patterns. Thirdly, most agent-based feedback systems do
not focus on capturing the underlying purpose for which
the user is browsing through collections of documents.
We, on the other hand, have developed and tested
several approaches to derive the emergent semantics of
web documents using user browsing paths without any
explicit feedback. The information used for our analysis
is the list of URLs visited by users, which typically can
be obtained from an Internet service provider or from a
reasonably large organization like a university or a
corporation with diverse users. We have tested
approaches to capture, filter and analyze the users’
browsing paths from such a large organization.
We believe that the author of a web page contributes
to the initial semantics of that page, but that the
semantics of that page varies over a period of time
based on the users who browse through the collection of
web pages. The actual semantics of a web page is the
emergent semantics that evolves over a period of time,
which depends on the browsing paths of all the users
who visit that page. This dynamic semantics approach is
different from earlier works on deriving static semantics
based solely on link analysis. Static analysis only
captures the intended semantics of the authors of the
linked web pages. Our dynamic analysis derives the
semantics of web pages by deriving the semantics of
user browsing paths. This analysis helps capture the
semantic profile of the user, which enables trend
detection.
In this presentation, we develop a vector-space-based
approach to represent browsing paths. Each path is
represented by a vector, which captures both textual and
visual keywords from pages occurring along the path.
We compare various approaches, each based on
latent semantic indexing, to find what we call semantic
breakpoints. These are used to decompose our original
path set into a set of subpaths, which are semantically
coherent. Using these semantically coherent browsing
paths will then enable us to track the emerging
semantics of individual web pages, as well as to
characterize the browsing behavior of individual users.
Each subpath represented as a vector in the term-path
matrix can be visualized as a point in a reduced
dimensional space. The semantics of a web page, w, can
then be defined as the subset of the points in the reduced
dimension space corresponding to the sub-paths that are
within a threshold distance from page w. The semantics
of a user browsing path is then the collection of
concepts represented by the semantics of the pages
traversed by the user. Without getting into the world of
linguistics and natural language processing, we
represent semantics by similarity. As part of the training
data, we include paths with known semantics. One can
find
reasonably
good
categorizations
at
http://www.dmoz.org, which we use as the training set
for various topics. We formulate a query page
comprising terms that best represents a topic. Using this
query, one can determine if there are any matches to the
query in the browsing history. We accomplish this by
placing the query (also represented as a point) in the
vector space and computing the distance between the
query point and the set of all the points in the vector
space.
Download