Bandits and Browsing: Effective Collection Size as Way of Quantifying Search Efficiency Harriett E. Green, Kirk Hess, and Richard D. Hislop University of Illinois at Urbana-Champaign green19@illinois.edu kirkhess@illinois.edu rhislop2@illinois.edu 2011 DLF Forum Poster Presentation October 31–November 1, 2011, Baltimore, Maryland ABSTRACT This poster presentation will present our preliminary research on how information can be extracted from user browsing behavior to identify understudied works that are relevant but have too few viewers. We investigate how to apply two types of analysis—a formula called Effective Collection Size and ‘multi-armed bandit’ analysis—to extracted user data to develop alternative methods of retrieving materials from collection that are collated by richer factors of relevancy. We anticipate that these analyses will enable the development of an information retrieval system that presents a broad range of content in a user’s search results. INTRODUCTION Scholars do not simply conduct linear searches for specific items in library collections. Rather, they pursue networked searches of random exploration, mediated browsing, and topical searches. But despite their processes non-linear searching, keyword frequency dominates the algorithms of many data retrieval systems in digital libraries. This presentation proposes a method for identifying relevancy among items in a collection through two types of analyses: the proposed formula of Effective Collection Size, and the utilization of multi-armed bandit analysis. DATA ANALYSIS AND INITIAL RESULTS The circulation statistics from the University of Illinois Library’s Voyager catalog database are being analyzed as the prototype data. This initial analysis took the English Library’s collection of approximately 35,000 items as the first sample set. We ran multiple regressions that calculated at various thresholds between 1 and 20 the probability that books would be checked out. We also calculated the correlations of Library of Congress subject headings in the collection for relevancy between titles and their usage. With this data, we have begun to develop a methods of analyzing user data for incorporation into information retrieval systems of library catalogs and digital library collections. The circulation statistics analyzed here would be equivalent to drill-down views of an object in digital library. This collection analysis reveals the Effective Collection Size of a collection, which is the number of items actually viewed and/or borrowed by users, contrasted with the total number of items in the collection. The multi-armed bandit analysis of user data can subsequently be used to implement a self-optimizing search algorithm for the catalog. CONCLUSION The aim of this analysis is to employ user data to improve the search and data retrieval system. We are embarking on a series of project to create a self-optimizing catalog, as the circulation and subject data from the physical catalog enables us to create collection analysis and search algorithm tools that can be applied to digital library development. We anticipate that these tools coud address questions such as: How do you know which titles to digitize first? Which titles have been overlooked in the intellectual and use bias our physical collections and should be included in the digital collection? And for the library subject specialist, these tools will enable them to refer, evaluate and manage their collections with a broad and multi-faceted perspective on the constituency and use of their collections. The management and use of physical and digital library collections can be significantly enhanced with a data retrieval system that promotes maximum exposure of the collection through richer relevancy correlations. With access to such knowledge and data, managers of digital libraries, subject specialists, and library administrators could identify the under-used items in their library collections, quantify the usage of their collections in monetized terms, and ultimately improve the efficiency of their collections. REFERENCES Zhou, T., Kuscsik, Z., Liu, J.G., Medo, M., Wakeling, J.R., & Zhang, Y.C. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences of the United States of America, 107, 4511-4515. Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased Offline Evaluation of Contextualbandit-based News Article Recommendation Algorithms. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, 297-306. Doi: 10.1145/1935826.1935878 Strehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh, H. (2006). Experience-efficient learning in associative bandit problems. Proceedings of the Twenty-Third International Conference on Machine Learning, 889–896. Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. Proceedings of the Nineteenth International Conference on World Wide Web, 661-670. Doi: 10.1145/1772690.1772758 Moon, T., Li, L., Chu, W., Liao, C., Zheng, Z. & Chang Y. (2010). Online learning for recency search ranking using real-time user feedback. Proceedings of the Nineteenth International Conference on Knowledge Management, 1501-1504. Doi: 10.1145/1871437.1871657. Berry, D. A. & Fristedt, B. (1985). Bandit Problems: Sequential allocations of experiments. Chapman and Hall, New York. Xie, I. & Cool, C. (2009). Understanding help seeking within the context of searching digital libraries. Journal of the American Society for Information Science and Technology, 60, 477--494. Wan, G. & Liu, Z. (2008). Content-Based Information Retrieval and Digital Libraries. Information Technology and Libraries 27, 41-47. Kovacic, A., Devedzic, V. & Pocajit, V. (2010). Using data mining to improve digital library services. The Electronic Library, 28, 829-843.