Deriving Emergent Web Page Semantics D.V. Sreenath*, W.I. Grosky**, and F. Fotouhi* *Department of Computer Science, Wayne State University, Detroit, MI **Department of Computer and Information, University of Michigan-Dearborn, Dearborn, MI {sdv, fotouhi}@cs.wayne.edu, wgrosky@umich.edu It is well known that interpretation depends on context, whether for a work of art, a piece of literature, or a natural language utterance. This research addresses the dynamic context of a collection of linked multimedia documents, of which the web is a perfect example. Contextual document semantics emerge through identification of various users' browsing paths though this multimedia collection. We present techniques that use multimedia information as part of this determination. Some implications of our approach are that the author of a web page cannot completely define that document's semantics and that semantics emerge through use. The of the goals of this research is to implement the reciprocal of a search engine: given a sequence of documents comprising a user’s browsing path, generate a query that summarizes what the user was looking for. A specific application of this research leads to terrorist trend detection. Privacy laws ensure that a user’s browsing history is not sold to advertising and marketing agencies. This restricts the availability of such browsing history for research purposes like ours. Most of the earlier work on relevance-feedback based approaches to gathering and analyzing users’ preferences have failed primarily due to this lack of user confidence in the agencies that collect such information and the ways in which that information will be used. Secondly, users do not trust any application that is downloaded and installed on their system that captures and profiles their browsing patterns. Thirdly, most agent-based feedback systems do not focus on capturing the underlying purpose for which the user is browsing through collections of documents. We, on the other hand, have developed and tested several approaches to derive the emergent semantics of web documents using user browsing paths without any explicit feedback. The information used for our analysis is the list of URLs visited by users, which typically can be obtained from an Internet service provider or from a reasonably large organization like a university or a corporation with diverse users. We have tested approaches to capture, filter and analyze the users’ browsing paths from such a large organization. We believe that the author of a web page contributes to the initial semantics of that page, but that the semantics of that page varies over a period of time based on the users who browse through the collection of web pages. The actual semantics of a web page is the emergent semantics that evolves over a period of time, which depends on the browsing paths of all the users who visit that page. This dynamic semantics approach is different from earlier works on deriving static semantics based solely on link analysis. Static analysis only captures the intended semantics of the authors of the linked web pages. Our dynamic analysis derives the semantics of web pages by deriving the semantics of user browsing paths. This analysis helps capture the semantic profile of the user, which enables trend detection. In this presentation, we develop a vector-space-based approach to represent browsing paths. Each path is represented by a vector, which captures both textual and visual keywords from pages occurring along the path. We compare various approaches, each based on latent semantic indexing, to find what we call semantic breakpoints. These are used to decompose our original path set into a set of subpaths, which are semantically coherent. Using these semantically coherent browsing paths will then enable us to track the emerging semantics of individual web pages, as well as to characterize the browsing behavior of individual users. Each subpath represented as a vector in the term-path matrix can be visualized as a point in a reduced dimensional space. The semantics of a web page, w, can then be defined as the subset of the points in the reduced dimension space corresponding to the sub-paths that are within a threshold distance from page w. The semantics of a user browsing path is then the collection of concepts represented by the semantics of the pages traversed by the user. Without getting into the world of linguistics and natural language processing, we represent semantics by similarity. As part of the training data, we include paths with known semantics. One can find reasonably good categorizations at http://www.dmoz.org, which we use as the training set for various topics. We formulate a query page comprising terms that best represents a topic. Using this query, one can determine if there are any matches to the query in the browsing history. We accomplish this by placing the query (also represented as a point) in the vector space and computing the distance between the query point and the set of all the points in the vector space.