Deriving Emergent Web Page Semantics D.V. Sreenath*, W.I. Grosky**, and F. Fotouhi* *Wayne State University **University of Michigan-Dearborn Semantics The semantics of a web page is potentially richer than can be defined by the page’s author(s) – Some semantics emerge through context – A multimedia document has multiple semantics through being placed in multiple contexts Content-Based Retrieval Development of feature-based techniques for content-based retrieval is a mature area, at least for images CBR researchers should now concentrate on extracting semantics from multimedia documents so that retrievals using concept-based queries can be tailored to individual users – The semantic gap (Semi)-automated multimedia annotation Multimedia Annotation(s) Multimedia annotations should be semantically rich – Multiple semantics This can be discovered by placing multimedia information in a natural, context-rich environment – A social theory based on how multimedia information is used Context-Rich Environments Structural context – Author’s contribution – Document’s author places semantically similar pieces of information close to each other Dynamic context – User’s contribution – Short browsing sub-paths are semantically coherent Context-Rich Environments The WEB is a perfect example of a contextrich environment Develop multimedia annotations through cross-modal techniques – – – – Audio Images Text Video Goal Derive document semantics based on user browsing behavior – The same document has multiple semantics » Different people see different meanings in the same document – Over short browsing paths, an individual user’s wants and needs are uniform » The pages visited over these short paths exhibit semantics in congruence with these wants and needs Questions How can the semantics of a web page be derived given a set of user browsing paths that end at that page? How can we characterize the semantics of a user browsing path? How can web page semantics help us in navigating the web more efficiently? How can our approach actually be implemented in the real web world? Our Approach We use actual browsing paths to find the latent semantics of web pages – Textual features – Image features – Structural features We hope to find general concepts comprising various textual and image features which frequently co-occur Semantic Coherence We believe that a user’s browsing path exhibits semantic coherence – While the user’s entire path exhibits multiple semantics, especially pages far from each other on the path, neighboring pages, especially the portions close to the links taken, are semantically close to each other Semantic Break Points We would like to characterize the contiguous sub-paths of a user’s browsing path that exhibit similar semantics and detect the semantic break points along the path where the semantics appreciably change – Collect these sub-paths into a multiset Web Page Semantics We categorize the semantics of each web page based on a history of the semantically-coherent browsing paths of all users which end at that page A browsing path will be represented by a highdimensional vector The various positions of the vector correspond to the presence of – textual keywords – image features (visual keywords) – structural features (structural keywords) Deriving Emergent Web Page Semantics From the complete set of web pages under consideration, we extract a set of textual, visual, and structural keywords For each multiset, M, of sub-paths that we are to analyze, we form three matrices – term-path matrix – image-path matrix – structure-path matrix Deriving Emergent Web Page Semantics The (i,j)th element of these matrices are determined by – Strength of the presence of ith keyword along the jth browsing path » Determined by How many times this term occurs on the pages along the path How much time the user spends examining these pages How close each occurrence of the ith keyword is to both the outgoing and incoming anchor positions – How many times this browsing path occurs in M Deriving Emergent Web Page Semantics These matrices may be concatenated together in various ways to produce an overall keyword-path matrix Perform latent-semantic analysis to get concepts A page is then represented by a set of concept classes Architecture Vantage Points path3 url2 path1 url3 path6 url1 path2 path5 path4 url4 Local Iterative Technique Bob Hope Path – Page 1 Bob Hope Path – Page 2 Bob Hope Path – Page 3 Bob Hope Path – Page 4 Bob Hope Path – Page 5 Bob Hope Path – Page 6 Bob Hope Path – Page 7 Bob Hope Path– Page 8 Bob Hope Path – Page 9 Vaudeville Broadway Radio Troops Experiment 1 – Paths/Paths Movies Bob Hope Radio Golf Troops Broadway Vaudeville Experiment 2 – Paths/URLs Movies Bob Hope Radio Golf Troops Broadway Vaudeville Experiment 3 – URLs/URLs Golf Movies Radio Bob Hope Troops Vaudeville Broadway Issues Data capture – privacy issues Compute intensive SVD updating Dynamic content Constantly evolving websites