DLF_Forum_handout_final

advertisement
Bandits and Browsing: Effective Collection Size as Way of Quantifying Search Efficiency
Harriett E. Green, Kirk Hess, and Richard D. Hislop
University of Illinois at Urbana-Champaign
green19@illinois.edu  kirkhess@illinois.edu  rhislop2@illinois.edu
2011 DLF Forum Poster Presentation
October 31–November 1, 2011, Baltimore, Maryland
ABSTRACT
This poster presentation will present our preliminary research on how information can be extracted
from user browsing behavior to identify understudied works that are relevant but have too few viewers.
We investigate how to apply two types of analysis—a formula called Effective Collection Size and
‘multi-armed bandit’ analysis—to extracted user data to develop alternative methods of retrieving
materials from collection that are collated by richer factors of relevancy. We anticipate that these
analyses will enable the development of an information retrieval system that presents a broad range of
content in a user’s search results.
INTRODUCTION
Scholars do not simply conduct linear searches for specific items in library collections. Rather, they
pursue networked searches of random exploration, mediated browsing, and topical searches. But
despite their processes non-linear searching, keyword frequency dominates the algorithms of many data
retrieval systems in digital libraries. This presentation proposes a method for identifying relevancy
among items in a collection through two types of analyses: the proposed formula of Effective Collection
Size, and the utilization of multi-armed bandit analysis.
DATA ANALYSIS AND INITIAL RESULTS
The circulation statistics from the University of Illinois Library’s Voyager catalog database are being
analyzed as the prototype data. This initial analysis took the English Library’s collection of
approximately 35,000 items as the first sample set. We ran multiple regressions that calculated at
various thresholds between 1 and 20 the probability that books would be checked out. We also
calculated the correlations of Library of Congress subject headings in the collection for relevancy
between titles and their usage.
With this data, we have begun to develop a methods of analyzing user data for incorporation
into information retrieval systems of library catalogs and digital library collections. The circulation
statistics analyzed here would be equivalent to drill-down views of an object in digital library. This
collection analysis reveals the Effective Collection Size of a collection, which is the number of items
actually viewed and/or borrowed by users, contrasted with the total number of items in the collection.
The multi-armed bandit analysis of user data can subsequently be used to implement a self-optimizing
search algorithm for the catalog.
CONCLUSION
The aim of this analysis is to employ user data to improve the search and data retrieval system. We are
embarking on a series of project to create a self-optimizing catalog, as the circulation and subject data
from the physical catalog enables us to create collection analysis and search algorithm tools that can be
applied to digital library development. We anticipate that these tools coud address questions such as:
How do you know which titles to digitize first? Which titles have been overlooked in the intellectual
and use bias our physical collections and should be included in the digital collection? And for the library
subject specialist, these tools will enable them to refer, evaluate and manage their collections with a
broad and multi-faceted perspective on the constituency and use of their collections.
The management and use of physical and digital library collections can be significantly
enhanced with a data retrieval system that promotes maximum exposure of the collection through
richer relevancy correlations. With access to such knowledge and data, managers of digital libraries,
subject specialists, and library administrators could identify the under-used items in their library
collections, quantify the usage of their collections in monetized terms, and ultimately improve the
efficiency of their collections.
REFERENCES
Zhou, T., Kuscsik, Z., Liu, J.G., Medo, M., Wakeling, J.R., & Zhang, Y.C. (2010). Solving the
apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National
Academy of Sciences of the United States of America, 107, 4511-4515.
Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased Offline Evaluation of Contextualbandit-based News Article Recommendation Algorithms. Proceedings of the 4th ACM
International Conference on Web Search and Data Mining, 297-306. Doi: 10.1145/1935826.1935878
Strehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh, H. (2006). Experience-efficient learning
in associative bandit problems. Proceedings of the Twenty-Third International Conference on
Machine Learning, 889–896.
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to
personalized news article recommendation. Proceedings of the Nineteenth International Conference
on World Wide Web, 661-670. Doi: 10.1145/1772690.1772758
Moon, T., Li, L., Chu, W., Liao, C., Zheng, Z. & Chang Y. (2010). Online learning for recency
search ranking using real-time user feedback. Proceedings of the Nineteenth International
Conference on Knowledge Management, 1501-1504. Doi: 10.1145/1871437.1871657.
Berry, D. A. & Fristedt, B. (1985). Bandit Problems: Sequential allocations of experiments. Chapman
and Hall, New York.
Xie, I. & Cool, C. (2009). Understanding help seeking within the context of searching digital
libraries. Journal of the American Society for Information Science and Technology, 60, 477--494.
Wan, G. & Liu, Z. (2008). Content-Based Information Retrieval and Digital Libraries.
Information Technology and Libraries 27, 41-47.
Kovacic, A., Devedzic, V. & Pocajit, V. (2010). Using data mining to improve digital library
services. The Electronic Library, 28, 829-843.
Download