Augmenting Information Seeking on the World Wide Web Using Collaborative Filtering Techniques Don Turnbull 1. Introduction The internet has opened a channel of access to a interwoven labyrinth of information over an almost ubiquitous platform - the World Wide Web (the Web). Graphical Web browsers have enabled all types of users to access and share information with one another. However, once the initial thrill of Web access is over, most users don't surf the web, they use it as an information source. This paper seeks to take Information Seeking research and apply it as a framework for understanding the World Wide Web environment and to identify opportunities for augmenting information seeking by applying Bibliometric analysis, filtering techniques, and collaborative technologies to Web usage data that can, in turn leverage a Web user's Information Seeking behavior. 1.1 Overview This paper reviews several areas of study in order to form an extensive view of the issues involved in understanding and improving how a World Wide Web browser user (Web user) can discover new information on the World Wide Web. There are seven main sections to this paper: Section 1: Introduction This Introduction and Overview, intended to explain and layout the overall topics of this paper. Section 2: Applying Information Seeking to Electronic Environments This section reviews the important models and studies in Information Seeking and Bibliometrics to understand and analyze Information Systems use. Section 3: The Internet and the World Wide Web This section introduces the Internet and World Wide Web, some of their basic standards and functionalities. Also included are descriptions and reviews of the data sources and measurement methods currently available to understand Web usage activity. Section 4: Collaborative Filtering This section provides a general introduction to Collaborative Filtering and presents recent significant studies and systems for Information Filtering using both the Internet and the World Wide Web. Also included are studies that illustrate general Collaborative Filtering techniques and a review of current Collaborative Filtering systems for both the Internet and the World Wide Web. Section 5: Conclusion This section concludes the research overview and summarizes the general ideas in the paper. Section 6: Suggested Research Projects This section proposes three research projects, each designed to answer questions about improving Information Seeking on the World Wide Web. Section 7: Bibliography and Appendix A A list of the works cited in this paper and explanatory information presented in the appendices. 2. Applying Information Seeking to Electronic Environments This section reviews the important models and studies in Information Seeking and Bibliometrics (which can be seen as another way to model Information Seeking patterns) to understand and analyze Information Systems use. 2.1 Information Seeking Overview This section focuses on Information Seeking in electronic environments, namely the World Wide Web. My goal is to explore an Information Seeking model that shows elements that can be augmented with Collaborative Filtering techniques developed through data collection and analysis. The Web environment, with its masses of unstructured and inconsistently coordinated information, is more suited to being interpreted by people than by machines. Collaborative Filtering is a quantitative way to develop qualitative data about information on the Web, thus maximizing both people and computer resources. Due to the personal subjectivity and seemingly endless amount of Web information to examine, it is more useful to focus on perceptual and cognitive recognition via browsing the Web than determining precision of Web searches via Information Retrieval techniques. However, this is not simple, Information Seeking on the Web is difficult to measure because a user can never know he is finished. There is no definite ending point. Information Seeking as a problem seems natural to augment with Information Retrieval ideas, but should be additionally leveraged with other users' Information Seeking behavior. At worst, Information Seeking and Information Retrieval can be scaffolded over each other to gradually build to a refinement of a user's information need. Marchionini gives us an appropriate definition of Information Seeking: "a process in which humans purposefully engage in order to change their state of knowledge".(Marchionini 1995) 2.1.1 Information Seeking and Information Retrieval Many studies point out the close relation between Information Seeking and Information Retrieval. Most notably, Saracevic, et. al's comprehensive analysis of Information Seeking and Retrieval provides excellent starting points for ideas about observation and collection of data that help establish a sense for context and classification of user questions; cognitive characteristics and decision making of users; and comparisons of different searches for the same question. The measures and methods of user effectiveness and searching provide a rich framework for further studies. (Saracevic and Kantor 1988a; Saracevic and Kantor 1988b; Saracevic et al. 1988) These general differences contrast Information Retrieval research from Information Seeking research: Information Retrieval: historically, concentrated on the system focuses on planning the use of information sources and systems implies that the information must have been already known relies on the concrete definition of query terms involves subsequent query reformulations centers on the examination of results and their accuracy. Information Seeking: historically, concentrated on the user focuses on understanding the heuristic and dynamic nature of browsing through information resources implies that the information is sought to increase knowledge follows a more opportunistic, unplanned search strategy involves recognizing relevant information centers on an interactive approach to make browsing easy. From a behavioral perspective, the primary difference between Information Retrieval and Information Seeking is searching vs. browsing. The focus of each domain is in the actions studied. As computer technology matures, Information Retrieval and Information Seeking studies are moving closer. In 1996 Saracevic states that "interaction became THE most important feature of information retrieval" as the access to Information Retrieval systems has become more dynamic.(Saracevic 1996) Essentially, the interactivity provides the ability to support more browsing-like approaches for finding information. Therefore, to design a system for augmenting Information Seeking, a more robust understanding of the user and his interactions are in order. The measurement of successful Information Seeking requires more analyzing these subtler measures to gauge success. Again, this makes augmenting Information Seeking via collaboration more probable for success. Instead of relying wholly on Information Retrieval metrics, recording and comparing a user's interactions with a system can be used to enhance the information seeker's success. New technologies, such as the easy-to-use World Wide Web browser, will promote more Information Seeking use (and attract new users). However, new interfaces alone will not help us find everything we seek, but we might believe so as we often think electronic information is more accurate or complete (Liebscher and Marchionini 1988). In a way, utilizing more collaboration between users can make up for some of the shortcomings of technical systems. Blending the different perspectives and experience levels of a pool of users can result in a larger body of resources discovered. Fidel points out two styles of expert searchers, the operationalists who understand the system and use high-precision searches and the conceptualists who focus on concepts and terminology to then combine results to form more complete searches (Fidel 1984). This combination of users cooperating can form a powerful team to enhance each other's Information Seeking. 2.1.1 Information Seeking Models The influence of new technology on Information Seeking is also providing a new set of alternative models that more accurately describe the Information Seeking process as a dynamic activity. Models of Information Seeking attempt to describe the process a user follows to satisfy an information need. The Information Seeking models in this section focus on the behavior of Information Seeking activities. 2.1.1.1 Ellis' Model of Information Seeking The primary model used in this research will be based on Ellis' work - initially, his model with six categories (Ellis 1989). Since Ellis has stated that these activities are applicable to hypertext environments (of which the World Wide Web is one), I will use examples from Web browsing to illustrate each category: Starting is identifying the initial materials to search through and selecting starting points for the search. Starting, as its name implies, is usually undertaken at the beginning of the Information Seeking process to learn about a new field. Starting could also include locating key people in the field or obtaining a literature review of the field. It is also common to rely on personal contacts for informal starting information. For example, in the Web environment, the activity of starting could involve going to the Yahoo! site to find the general category listing of links related to the field of inquiry and looking for overviews, FAQs (Frequently Asked Question files - a commonly-used informal document describing a particular subject), or reputable reference sites. Another possibility is going to a bookmarked page that has proved to be useful in previously looking for similar information or consulting a colleague's own Web page or one he might have recommended. Chaining is following leads from the starting source to referential connections to other sources that contribute new sources of information. Common chaining techniques are following references from a particular article obtained by recommendation or a literature search to references in other articles referred to in the first article. It's also quite natural to pursue the works of a particular author when following these chains. There are two kinds of chaining: 1. backward chaining is following a pointer or reference from the initial source. For example, going to an article mentioned in the initial source's bibliography. 2. forward chaining is looking for new sources that refer to the initial source. For example, using a citation index to find other sources that reference the initial source. The only real constraints to chaining are time available and confidence in pursuing a line of research further. For example, using a Web browser, backward chaining would be following links on the starting page (be it a online document or collection of links which we can assume are related in some way) to other sites. Forward chaining could involve using a search engine to look for other Web pages that link to the initial Web page.[1] Browsing is casually looking for information in areas of interest. This activity is made easy by the nature of documents to have tables of contents, lists of titles, topic headings, and names of persons or organizations. Browsing is being open to serendipitous findings; finding new connections or paths to information; and learning, which can cause information needs to change. While on the Web, browsing is particularly unconstrained as the most-common way to follow a link is simply clicking the mouse. With link availability and adequate access speed, pursuing a new connection is quite simple. Only the worry of getting lost in an ocean of links might constrain browsing through the Web. A common example of browsing on the Web would be finding an online journal article and following its link back to the overall journal table of contents to an entire other article. This might in turn lead to a page linking to all of the journal's various contributing authors, its editorial board, or supporting organization`s home pages. Differentiating is selecting among the known sources by noting the distinctions of characteristics and value of the information. This activity could be ranking and organizing sources by topic, perspective, or level of detail. Differentiating is heavily dependent on the individual's previous or initial experiences with the source or by recommendations from colleagues or reviews. A Web-oriented example would be organizing bookmarks into topic categories and then prioritizing them by the depth of information they present. Monitoring is keeping up-to-date on a topic by regularly following specific sources. Using a small set of core sources including key personal contacts and publications, developments can be tracked for a particular topic. A Web browser monitoring activity could be returning to a bookmarked source to see if the page has been updated or regularly visiting a journal's Web site when it is scheduled to publish its new Web edition. Extracting is methodically analyzing sources to identify materials of interest. This systematic re-evaluation of sources is used to build a historical survey or comprehensive reference on a topic. With a Web browser, extracting might be saving the Web page as a file or printing the Web page for use in an archive or for a segment of an overview document. In follow-up studies, Ellis adds two more features to his model: verifying, where the accuracy of the information is checked and ending, which typifies the conclusion of the Information Seeking process such as building final summaries and organizing notes.(Ellis 1991) These changes not only reflect further studies, but I believe that as Information Seeking has become more mechanical, its processes are easier to note. However, despite refining the processes and the relationships between features of his model, Ellis also agrees that the boundaries between the features are very soft.(Ellis 1996) In using the Web, verifying might involve extracting keywords from a source and searching for corroborating information on another Web page. Admittedly, the Web's newness and large percentage of un-branded information make verification of information difficult. I suspect that often information is verified by checking traditional sources, not other Web pages. Currently (Ellis 1997), Ellis has modified his model's features somewhat, improving starting to surveying. Surveying further stresses the activity of obtaining an overview of the research terrain or locating key people operating in the field. Differentiating has been refined to distinguishing, where information sources are ranked. Distinguishing also includes noting the channel where information comes from. Ellis points out that informal channels, such as discussions or conversations, are normally ranked higher as well as secondary sources, such as tables of contents or abstracts, than full text articles. This is most likely due to the increased use of electronic resources and their capacity to overload a user. For the user some kind of hierarchy of results must be formed to place order on the Information Seeking process. With the Web, it is either easy to discover the channel of information (a Web site owned by an organization) or quite difficult to confirm (a resource included on a personal Web page) due to the ease of moving and presenting information on the World Wide Web. Another new feature of the model has also been added-- filtering, which capitalizes on personal criteria or mechanisms to increase information precision and relevancy. Typical examples of filtering are restricting a search by time or keyword. This idea of filtering, in more than name, points out that Information Filtering is a crucial element of study in Information Seeking. In a Web browser session, filtering would likely involve restricting a search for information (using a Web search engine or even on a particular Web site) by the date published or carefully noting the URL[2] of the Web page. When combined with distinguishing, where resources are actually ranked and sorted, we also begin to see how Information Retrieval is alluded to in Ellis' model of Information Seeking. This figure illustrates Ellis' current Information Seeking model. Note that the overall structure of the process could be contained inside each activity, implying the fractal-like nature of the processes. Figure 1. Ellis` Information Seeking Model 2.1.1.2 Applying Ellis' Model I propose that the Information Seeking process is fractal-like in nature. Each feature follows the overall feature set within itself. For example, within surveying, there surely must be chaining, browsing, differentiating, not to mention ending that formalizes the completion of the step. Like a fractal, even the smallest change in a sub-feature (as I shall now call them) can impact not only its parent feature, but have substantial impact on the entire Information Seeking process. This is more than just refinement of a search, the very features of Information Seeking can take on a different mapping as the seeker, the sources, and technology change. It is these variations that make collaboration in all three of these domains where Information Seeking can be substantially improved. For example, collaboration among seekers is the most obvious area of improvement of the process and the focus of this paper. Different users can share previous findings or cooperate to minimize future work. Sources can be more easily linked and shared as more become available digitally. Improved technology can enable more automation of monitoring; combining and comparing results; and distribution of user profiles or programs that can provide starting points for Information Seeking. Ironically, as resources become more plentiful due to technology, they are also being more loosely, if at all, classified. The resource demands of publishing information are far less than direct expert classification and often exclude indexing. Without common organization among electronic resources, more individual work will be needed to build maps of a research terrain. Again, Collaborative Filtering can help in an ad hoc way by at least establishing operational classifications of information by communities of users who pool their resources. Their resources can not help but become classified in some form: by user, by implicit or explicitly agreed-upon language, or by usage ranking as resources fall prey to limited attention. 2.1.3 Kuhlthau's Model of the Information Search Process Kuhlthau provides an additional model which focuses on the information search process from the user's perspective. Her six stages in the Information Search Process (ISP) Model are: 1. initiation - beginning the process, characterized by feelings of uncertainty and more general ideas with a need to recognize or connect new ideas to existing knowledge. 2. selection?- choosing the initial general topic with general feelings of optimism by using selection to identify the most useful areas of inquiry. 3. exploration - investigating to extend personal understanding and reduce the feelings of uncertainty and confusion about the topic and the process. 4. formulation?- focusing the process with the information encountered accompanied by feelings of increased confidence. 5. collection - interacting smoothly with the information system with feelings of confidence as the topic is defined and extended by selecting and reviewing information.[3] 6. presentation - completing the process with a feeling of confidence or failure depending how useful the findings are.(Kuhlthau 1991) 2.1.4 Belkin's Information Seeking Process Model Belkin provides another view of the Information Seeking process, described as Information Seeking Strategies (ISS). This view can be perceived of as a more taskoriented overlay of either Kuhlthau or Ellis' model. The set of tasks are: browsing?- scanning or searching a resource learning - expanding knowledge of the goal, problem, system or available resources through selection. recognition - identifying relevant items (via system or cognitive association). metainformation?- interacting with the items that map the boundaries of the task (Belkin, Marchetti, and Cool 1993). Again, this model is not linear or like a typical waterfall flow of process. Belkin even stresses this non-linearity in that he suggests that the model should support "graceful movements" among the tasks. 2.1.5 Belkin's Anomalous States of Knowledge Belkin also provides some useful perspectives with the Anomalous State of Knowledge (ASK) theory, "the cognitive and situational aspects that were the reason for seeking information and approaching an IR system" (Saracevic 1996). Belkin proposes that a search begins with a problem and a need to solve it - the gap between these is defined as the information need. The user gradually builds a bridge of levels of information, that may change the question or the desired solution as the process continues (Belkin, N.Oddy, and Brooks 1982). In other words, this view of information seeking is as a dynamic process with varying levels of expertise growing in regard to knowledge about the solution and in using capabilities of the particular information system itself. Taking these ideas, Belkin advocates a systems design using a network of associations between items as a means of filling the knowledge gap. By establishing relationships between individual pieces of knowledge, a bridge of supporting information can be used to cross the knowledge gap. Using a collection of associations in this manner provides a framework that can be applied to designing Collaborative Filtering mechanisms, which work from building associations between users. The full article is located at http://www.gslis.utexas.edu/~donturn/research/augmentis.html#Heading4