Come, and Take Choice of All My Library: Mass Digitization Examined Jonathan Bengtson Associate University Librarian for Scholarly Resources Sian Meikle Digital Services Librarian University of Toronto Libraries Access Conference, September 2008 Part One (Jonathan Bengtson): Overview of University of Toronto / Internet Archive collaboration What should a Digital Library be? Models to date. Part Two (Sian Meikle): Building the Digital Library: what goes in, what comes out, how to join in Using the Digital Library: how we’re using it, how users use it, how you can use it Questions encouraged! The University of Toronto’s Internet Archive Scanning Centre “Scribe” scanning station capacity 500 pages per hour 14 hours per day, 5 days per week 7,000 pages each day per scribe “Scribe” Centre capacity 161,000 pages per day (23 scribes) 805,000 pages per week (23 scribes) 2,683 books per week (23 scribes), if an average book is 300 pages 100,000+ books per year Mass Digitization at the University of Toronto Phase one (Pilot): Autumn 2004-Autumn 2005 Phase two (MSN/OCA): Autumn 2005-May 22, 2008 Phase three (OCA+?): May 23, 2008- Partnering with the Internet Archive The University of Toronto is one of the five largest academic libraries in North America. The Internet Archive is a non-profit organization, based in San Francisco, that was founded in 1996 to build an ‘Internet library,’ with the purpose of offering permanent access for researchers, historians, scholars and the general public to historical collections that exist in digital format. www.archive.org Internet Archive: Preservation and Access Over 2.5 petabytes of storage, and growing To put that in perspective: An mp3 is usually 3-4 megabytes 2 petabytes = 2,684,354,560 megabytes 1.5 million downloads per day (one of the top 350 global sites) 3 storage facilities in San Francisco, Amsterdam & Alexandria, Egypt Experience with multiple formats Audio 282,000+ items in over 100 collections •Live Music Archive (2,300 bands & 40,000 performances) • Netlabels (600 labels) • Mother Jones Radio • LibriVox Audio Books •Afropop Worldwide • Old Time Radio • Tse Chen Ling Buddhist Lectures • 78 RPM Records • Free Speech Radio News • Presidential Recordings Moving Images 128,000+ items in 100 collections • Democracy Now • SIGGRAPH Computer Animation • Film Chest Vintage Cartoons • Prelinger Archives • Drive-In Movie Ads • UCSF Tobacco Industry Videos • Universal Newsreels • Mosaic Middle East News • Kino French Films Phase One University of Toronto Collections Evaluate technology, workflow, etc. September 2004-September 2005 University of Toronto Collections: Selections from various collections including, c.1000 volumes from the Centre for Renaissance and Reformation Studies; materials from the Centre for 19th century French Studies; the Pontifical Institute of Mediaeval Studies; circulating collection Records of Early English Drama: by permission of University of Toronto Press Phase Two U of T Collections Most ranges of LC Focus on religion, history, Canadiana (when possible), (some) literature, science Mostly English language Mostly pre-1923 Multiple libraries Some special collections Circulating pre-1923 materials Phase Two Partners Memorial University: Newfoundland Quarterly, materials relating to Newfoundland McMaster University: 100+ items from the First World War Collection Ryerson University: various items including the Yellow Book: an illustrated quarterly University of Ottawa: 500 18th & 19th century works chosen by faculty including history, French, music, history of medicine, jurisprudence and nursing Library and Archives of Canada: c.450,000 pages from Canadian governmental publications Legislative Assembly of Ontario Library Toronto Public Library: local history and genealogy University of Alberta: Canadiana Tufts University, Boston, USA (Mellon and other grant funds) Other: Havergal College, U of T Faculty, Federally funded publisher, test scans for other OCA partners, individual researchers, National Institute of Newman Studies Internet Archive Book Scanning experience 1996 registered as a non-profit 2003 (India) Million books project 2004 Sloan grant, equipment evaluation, trial scanning 2006 Production scanning, 3 sites 2007 8 sites; 5 million pages or 12-15,000 books each month 2008 18 sites; 10 million pages or 25,000 books each month Google Books Microsoft Live Search Books Open Content Alliance The Open Content Alliance (OCA) represents the collaborative efforts of a group of cultural, technology, non-profit, and governmental organizations from around the world that will help build a permanent archive of multilingual digitized text and multimedia content. The Open Library (“Wikipedia”) Why do we need a Digital Library? Things are changing rapidly Variety of experience and preferences Some user themes from recent research at the University of Toronto Researchers & students are hurried, clever, determined, & inattentive. “When it comes to web resources, if it doesn't give me what I want in 5-10 minutes, I'm gone. I try to be more patient with UT, because its slower.” They understand copyright, but download what they need. Their end goals take priority. Journal articles are saved … - So they think, why not e-books, which are too long to be read online? E-books are sought for convenience, but access is NOT necessarily convenient. “Hoping to find an e-book, so I wouldn’t have to go to the stacks.” Part One (Jonathan Bengtson): Overview of University of Toronto / Internet Archive collaboration What should a Digital Library be? Models to date. Part Two (Sian Meikle): Building the Digital Library: what goes in, what comes out, how to join in Using the Digital Library: how we’re using it, how users use it, how you can use it Questions encouraged! What goes in? Books: MARC metadata not too big and not too small: 3”x3” to 14.5”x9.5” not too old and not too new 6% get rejected for hard living; 1922 cut-off z39.50 is used to fetch MARC data, and so… An identifier to tie book to its metadata Constructing the online book Internet Archive Scan Center Make book available online Assign unique id Get metadata via z39.50 Approve book Scan book Perform QA Upload scans Create derivatives Binding books to metadata Ideal book identifiers are: easy to enter and unique capable of retrieving a marc record and so, not title or call number z39.50 – accessible and so, probably not barcode on the book and so, not OCLC#, LCCN, ISBN, DBCN... Some possible solutions Your ILS thinks barcodes are MARC data You put a flier with identifier in each book You provide an intermediary script ... we chose option #3 Constructing the online book Internet Archive Scan Center Assign unique id Get metadata via z39.50 UTL script: •barcode in •identifier out •tracks scan decision Make book available online Approve book Scan book Perform QA Upload scans Create derivatives Some books aren’t mass-digitized books scanned :143,380 books rejected: 12,424 rejection rate: 9% Incomplete or incorrect marc record, 1% More than 5 bolts (uncut pages), 4% Other reasons 6% Book too large for cradle 1% Print too close to outside edge 0% Fold-outs 30% Poor condition 11% Fold-outs Print runs into gutter Rebound, too stiff Poor condition Other reasons More than 5 bolts (uncut pages) Rebound, too stiff, 17% Incomplete/incorrect marc record Book too large for cradle Print runs into gutter 30% Print too close to outside edge What comes out? JPEG 2000s: Raw (~900KB) cropped, deskewed, and lightcompensated (~800KB) (optionally) watermarked (~800KB) page images with embedded OCR colour (~100 KB) black and white (~60KB) IA identifier; bib identifier; contributor; title; volume; creator; publisher; scan data PDF MARC metadata, xml operational metadata, xml: structural metadata, xml Pagination, covers, title page, copyright page OCR (UTF-8) ABBYY, DjVu Flip book (~35KB) Constructive Anatomy How is it used? Our top 10 titles: Downloads Author Title Year 64241 St. Augustine De civitate Dei 1475 13214 Bridgman, George Brant Constructive anatomy 1920 10064 Colonna, Francesco, d. 1527 Hypnerotomachia 1592 7496 Gallonio, Antonio, d. 1605 Traitee des instruments de martyre[…]tortures et tourments des martyrs chretiens. 1904 6546 Descartes, Rene, et al. French and English philosophers: Descartes, Rousseau, Voltaire, Hobbes 1910 5484 Schopenhauer, Arthur The world as will and idea 1910 5098 Knutson, Bengt, fl. 1461 A litil boke the whiche traytied and reherced many gode thinges necessaries for the pestilence ... made by the ... Bisshop of Arusiens 1910 5090 Davenport, Cyril English embroidered bookbindings 1899 4496 Abbott, Edwin Abbott Flatland : a romance of many dimensions 1884 4313 Nightengale, Florence Notes on nursing : what it is, and what it is not 1860 How is it used? Our general statistics: Scanned books Avg use Min use 3206 1531 Top 1,000 973 Top 10,000 Top 50,000 Print : Avg use Min use Top 100 89 65 450 Top 1,000 47 32 283 137 Top 10,000 - 14 125 61 Bottom 10,000 6 11 Bottom 1,000 1 2 Bottom 100 1 1 Bottom 38 0 0 Top 100 […..] hi lo so A: p C G : A hy en P ux er s yc al il ia h w ry ol o o Sc gy rk s ie R nc es eli g G :G io of n eo H is gr to D ap ry hy -F: Hi ,A st nt or H y : S hro po oc lo ia gy l J: Po Sc ie l it nc ic es al Sc ie nc K: e La L: w E du ca ti P: M on La :M ng us N ua :F ic ge in e an Ar d ts Li te ra tu Q re :S Z: ci en Bi R bl : M ce io gr ed T: ap ici Te hy ch ne ,L no ib ra ry logy S ci en ce B: P Percent What are end-users using? 40 35 % scans 30 % titles 25 20 15 10 5 0 Class Higher use A, B, C, D-F, G, N, Z % use Expected use J, K, M, Q, T Lower use H, L, P, R IA scanning for other institutions Ship books Send marc record file with books Request marc records from another source LAC, LC, McMaster, UofT… Arrange z39.50 access for IA OCAD Sponsor books Select area of interest for scanning Sponsor scanning Tufts Perseus collection How can libraries use it? link to Internet Archive add marc records to your catalogue repository of 1 million online books metadata integrated with local collection add full text books to your collection full text search How are we using it? Scholar's Portal E-book platform integrates licensed and free content pdf-like reader open access to IA content Discovery layer Faceted search using Endeca stretch “catalogue” to include: metadata for all books, not just our books web site A&Is Full text journals Full text books Next steps Print on demand Scan on demand Enriched structural metadata to improve discovery Current structural metadata: pagination, covers, title page, copyright page Desired structural metadata: Table of contents, index, images, maps