Deriving Emergent Web Page Semantics

advertisement
Deriving Emergent Web Page
Semantics
D.V. Sreenath*, W.I. Grosky**, and F.
Fotouhi*
*Wayne State University
**University of Michigan-Dearborn
Semantics

The semantics of a web page is potentially
richer than can be defined by the page’s
author(s)
– Some semantics emerge through context
– A multimedia document has multiple semantics
through being placed in multiple contexts
Content-Based Retrieval


Development of feature-based techniques for
content-based retrieval is a mature area, at least for
images
CBR researchers should now concentrate on
extracting semantics from multimedia documents
so that retrievals using concept-based queries can
be tailored to individual users
– The semantic gap

(Semi)-automated multimedia annotation
Multimedia Annotation(s)

Multimedia annotations should be
semantically rich
– Multiple semantics

This can be discovered by placing
multimedia information in a natural,
context-rich environment
– A social theory based on how multimedia
information is used
Context-Rich Environments

Structural context – Author’s contribution
– Document’s author places semantically similar pieces
of information close to each other

Dynamic context – User’s contribution
– Short browsing sub-paths are semantically coherent
Context-Rich Environments
The WEB is a perfect example of a contextrich environment
 Develop multimedia annotations through
cross-modal techniques

–
–
–
–
Audio
Images
Text
Video
Goal

Derive document semantics based on user
browsing behavior
– The same document has multiple semantics
» Different people see different meanings in the same
document
– Over short browsing paths, an individual user’s
wants and needs are uniform
» The pages visited over these short paths exhibit
semantics in congruence with these wants and needs
Questions




How can the semantics of a web page be derived
given a set of user browsing paths that end at that
page?
How can we characterize the semantics of a user
browsing path?
How can web page semantics help us in
navigating the web more efficiently?
How can our approach actually be implemented in
the real web world?
Our Approach

We use actual browsing paths to find the
latent semantics of web pages
– Textual features
– Image features
– Structural features

We hope to find general concepts
comprising various textual and image
features which frequently co-occur
Semantic Coherence

We believe that a user’s browsing path
exhibits semantic coherence
– While the user’s entire path exhibits multiple
semantics, especially pages far from each other
on the path, neighboring pages, especially the
portions close to the links taken, are
semantically close to each other
Semantic Break Points

We would like to characterize the
contiguous sub-paths of a user’s browsing
path that exhibit similar semantics and
detect the semantic break points along the
path where the semantics appreciably
change
– Collect these sub-paths into a multiset
Web Page Semantics



We categorize the semantics of each web page
based on a history of the semantically-coherent
browsing paths of all users which end at that page
A browsing path will be represented by a highdimensional vector
The various positions of the vector correspond to
the presence of
– textual keywords
– image features (visual keywords)
– structural features (structural keywords)
Deriving Emergent Web Page Semantics
From the complete set of web pages under
consideration, we extract a set of textual,
visual, and structural keywords
 For each multiset, M, of sub-paths that we
are to analyze, we form three matrices

– term-path matrix
– image-path matrix
– structure-path matrix
Deriving Emergent Web Page Semantics

The (i,j)th element of these matrices are
determined by
– Strength of the presence of ith keyword along the jth
browsing path
» Determined by



How many times this term occurs on the pages along the path
How much time the user spends examining these pages
How close each occurrence of the ith keyword is to both the
outgoing and incoming anchor positions
– How many times this browsing path occurs in M
Deriving Emergent Web Page Semantics
These matrices may be concatenated
together in various ways to produce an
overall keyword-path matrix
 Perform latent-semantic analysis to get
concepts
 A page is then represented by a set of
concept classes

Architecture
Vantage Points
path3
url2
path1
url3
path6
url1
path2
path5
path4
url4
Local Iterative Technique
Bob Hope Path – Page 1
Bob Hope Path – Page 2
Bob Hope Path – Page 3
Bob Hope Path – Page 4
Bob Hope Path – Page 5
Bob Hope Path – Page 6
Bob Hope Path – Page 7
Bob Hope Path– Page 8
Bob Hope Path – Page 9
Vaudeville
Broadway
Radio
Troops
Experiment 1 – Paths/Paths
Movies
Bob Hope
Radio
Golf
Troops
Broadway
Vaudeville
Experiment 2 – Paths/URLs
Movies
Bob Hope
Radio
Golf
Troops
Broadway
Vaudeville
Experiment 3 – URLs/URLs
Golf
Movies
Radio
Bob Hope
Troops
Vaudeville
Broadway
Issues
Data capture – privacy issues
 Compute intensive
 SVD updating
 Dynamic content
 Constantly evolving websites

Download