Understanding & Assessing The Million Book Project

advertisement
Understanding & Assessing
The Million Book Project
Denise Troll Covey
Associate Dean, University Libraries, Carnegie Mellon
Pennsylvania Library Association Conference
Pittsburgh, PA – October 5, 2003
Million Book Project Vision
“Attempt to understand
& solve the technical,
economic, & social policy
issues of providing online
access to all creative works
of the human race.”
– Dr. Raj
Reddy
What is the Million Book Project?
Effort to digitize & provide full-text searching
& free-to-read access to a million books by 2007
Collection development
Copyright permissions
Acquisitions & shipping
Scanning operations
Proposal writing
Why is the Million Book Project?
Democratize knowledge & empower citizenry
Address disparity in library size & accessibility
Facilitate new knowledge
Combining old & new, east & west,
technical & humanistic
Enhance student learning
& success of faculty research
Address copyright absurdities
Why is the Million Book Project?
Support digital library research
Information distribution, management, & sustainability
Security, copyright, & digital rights management
Accuracy of optical character recognition (OCR)
OCR of non-Romanic languages & scripts
Automatic creation of structural metadata
Automatic summarization
Intelligent indexing
Machine translation
Storage formats
Search engines
Who is involved?
Carnegie Mellon University Libraries
& School of Computer Science
Other U.S. libraries
Internet Archive
India & China
OCLC, DLF, & CRL
Archival Resource Company Inc.
Funding
Collection development
NSF – $35,000 for initial
planning meeting 2001
Funded by partners
Copyright permission
UC Merced – $35,000
Carnegie Mellon – ?
Project administration
Carnegie Mellon – ?
Equipment & travel
NSF – $3.6 million
(discounts from Minolta)
Labor for scanning
India – $1.5 million
China – ?
Acquisitions & shipping
Internet Archive – ?
NSF – pilot shipment
Collection Development Strategies
Librarians as selectors
Best books – cited in bibliographies
Technical reports & government documents
Priorities of participants & funding agencies
Topics to get full collections
What can we acquire
Bulk, cheap, fast
Libraries weeding, closing, renovating
Initial Collection of Collections
In Copyright
100,000
Books for
College Libraries
& other selected
bibliographies
November 2001
planning meeting
funded by NSF
Indigenous Indian
& Chinese materials
200,000
Multi-lingual
& multi-script
language
processing
Public Domain
700,000
Current Collection of Collections
In Copyright
500,000
Indigenous Indian
& Chinese materials
Books for
College Libraries
& other selected
bibliographies
200,000
Multi-lingual
& multi-script
language
processing
New copyright
permission strategy
Public Domain
300,000
Scanning Underway in India
Multiple centers – each with terabyte storage
Indigenous materials
Shipments from U.S.
100,000 books by 2004
Above average wages
6000 Book Pilot Shipment to India
20 ft ocean container – 25 days NY to Chennai
243 boxes – 9 palettes – 11,298 lbs – duty free
Mostly public domain – government documents,
social science, biography, history, & literature
Approximate cost $2 per book round trip
4000 books did not have to be returned
2000 books were returned in good condition
August 2002 to August 2003
Lessons Learned from Pilot
Reduce shipping cost per book
Change packing method = cost $1 per book
or 50 cents per book shipped one way
Reduce turn-around time
Learned procedures for clearing customs
Establish 5 international centers for U.S. shipments
• Funded by Indian government
• To be inaugurated January 2004
• Receive books, scan, check quality, return
Distribution in India
Bangalore
Allahabad
Calcutta
Central Distribution Site
Deemed University, Thanjavur
Pilot
Hyderabad
Current
Delhi
Scanning Underway in China
Customs & content issues prohibit
shipping books to China
Scanning indigenous materials
& U.S. copyrighted works
already in their libraries
(with permission granted)
Above average wages
Standards & Workflow
National standards for digital preservation
Developed by IMLS 2001 & endorsed by DLF 2002
http://www.imls.gov/pubs/forumframework.htm
National standards for cataloging
Carnegie Mellon University Libraries
Developed & documented workflow
Provided training
Digitization Workflow
Operators scan, post-process, & OCR
600 DPI TIFF (v5) images
ScanFix post-processing
Abby Fine Reader OCR
• 98% accuracy with English
• Some foreign languages
OCR being developed for
other languages & scripts
Optimum Scanner Throughput
4000 books per year per Minolta scanner
One scanner, two shifts daily = 16 books per day
250 work days per year = 4000 books per year
72,000 books per year with current 18 scanners
400,000 books per year with 100 scanners
Allowing 50% deterioration in throughput,
100 scanners can complete the project in 5 years
Metadata Workflow
Librarians capture
Bibliographic metadata - for delivery system
• MARC from OCLC or create Dublin Core
• Guest IDs provided by OCLC
Administrative metadata - for reporting system
•
•
•
•
•
Bibliographic metadata
Collaborative
development
Source library
by India &
Return requested
Carnegie Mellon
Copyright status – check renewal records
Permission status – used by delivery system
Copyright Workflow #1
Identify & contact publishers
Negotiate permission
No
Update publisher database
Carnegie Mellon
India
Update publisher database
Yes
Locate, acquire, & ship to India
Scan & OCR
Capture metadata
Return books as needed
Send copies to Carnegie Mellon
Copyright Workflow #2
Locate, acquire, & ship to India
Internet Archive &
Archival Resource
Company, Inc.
Scan & OCR
Capture metadata
Return books as needed
Send copies to Carnegie Mellon
Carnegie Mellon
Copyright Yes & Permission Unknown
Identify & contact publishers
Update publisher
database as needed
India
Negotiate permissions
as needed
No
Delivery system won't display
Yes
Update publisher database
Update administrative metadata
Contextual Searching
Legal to scan & create index without permission
When no permission granted
to display copyrighted book,
search returns query terms
in OCR context
Acquiring the Collection
Archival Resources Company Inc.
Packing, shipping, & tracking
Help locate & acquire books
• Weeded collections
• Closing libraries
Acquisitions web site
• Materials wanted
• Loaning & donating
• Insurance
Integrating the Collection
Transporting files to Carnegie Mellon
Inadequate Internet bandwidth
Expense of copying to & from gold CD or DVD
Physically transport files on disks
Sustaining the Collection
Goal is 10 organizations host the Collection
India – Digital Library of India multiple locations
China – site(s) not yet known
U.S. – Carnegie Mellon, Internet Archive,
& University of California Merced
Discussions with OCLC, Library of Congress,
& Digital Library of Alexandria
Estimated cost one million $$ per host site
Estimated size is 20 terabytes
http://www.ulib.org/html/index.html
http://www.ulib.org/html/index.html
www.dli.gov.in
www.dli.gov.in
Next Steps
Ship 12,500 books one-way to Hyderabad, India
from University of Washington @ $6000
Negotiations
UMI/Proquest – print-on-demand service
OCLC – digital registry & identifying source libraries
CRL – supply or help acquire books
November 2003 – collection meeting
NSF proposal to create database
of copyright renewal records
Print on Demand Service
UMI/ProQuest
Handle financial transactions
Print, bind, & send books to customers
Collaboration with Carnegie Mellon
Negotiate royalties with publishers
Develop suitable business model
Digital Rights Management – Lite
Free-to-read by any Internet user
Difficult to save or print books
One page at a time using browser
Secure servers restrict access
Discourage hacking by offering
affordable printed, bound books
Global Business Model
Hardback book = $30.00
Paperback book = $15.00
Digitalback book = cost of cup of coffee
Internet Archive POD = $1.00 paperback
India POD = $0.80 paperback; $2.00 hardback
Open Access Feasibility Study
Couldn’t locate publisher for 11% books
If located publisher, half didn’t respond
Even to second letter
If got response
Fewer than half gave permission
Often permission was restricted
22% permission granted
1999-2000 – statistically valid random sample
Permission by Publisher Type
Success
22%
Rate
overall
Scholarly
45%
associations
University
37%
presses
Museums
31%
& galleries
Commercial
12%
publishers
75%
50%
25%
0%
NoResponse
Rate
Success
Rate
Library
Content
Open Access Fine & Rare Books
367 titles in copyright (34% of collection)
Couldn’t locate copyright holder for 13% of titles
127 letters & 44 follow-up calls to date
56% titles permission granted
6% with restrictions
3% titles permission denied
Assumed if 3 contacts get no response
2002-2003
Transaction Costs
$37.00 per title
$ 6,550
FTE labor
$
225
Phone calls
$
65
$ 6,840
Paper & postage
TOTAL
May 2003 through August 2003
Does not include legal fees,
cost of Internet connectivity
or administrator time
Million Book Project
Copyright Negotiations
Educate
Find online, but use print
Online access increases use
Open access doesn’t decrease,
& can increase sales
Copyright absurdity
Ask
Non-exclusive permission
to scan & provide open access
Minimal system functionality
Give
Preservation-quality copies
Metadata & OCR
Motivate
$$ Use in added-value,
fee-based services
$$ Print on demand
for out-of-print titles
$$ Buy button
for in-print titles
Initial Copyright Approach
Do not pay permission cost
Focus on out-of-print, in-copyright titles
Books for College Libraries has 50,000 titles
Begin with scholarly associations & university presses
Transaction cost per title is prohibitive
Identifying & inserting titles in letters
Negotiating & tracking permission per title
Epiphany & New Approach
Focus on publishers of quality books
Treat bibliographies as approval plan of publishers
Books for College Libraries has 5600 publishers
Ask for permission to digitize
All out-of-print, in-copyright titles
All titles published prior to a date of their choosing
All titles published # or more years ago
List of titles they provide
Follow-up phone call or visit
Current Statistics
5600 publishers in Books for College Libraries
Using intermittent labor
Couldn’t locate 30 publishers (so far)
184 letters & 24 follow-up calls to date
4% permission granted
5% permission denied
Full-time staff October 2003
Results of New Approach
Estimate transaction costs remain the same
But acquire more books for $$ spent
National Academy Press – 99% increase
• 26 titles in Books for College Libraries
• Permission for 3,046 titles
Brookings Institution – 96% increase
Rand McNally – 60% increase
“More Bang for the Buck”
Initial
In Copyright
Public Domain
Indigenous Materials
Current
Projections
Success rate
# of books
(# BCL publishers) per publisher
Million Book
Collection
4% (224)
1500
336,000
6% (336)
1500
504,000
22% (1,232)
1500
1,848,000
We could need to negotiate
with India for more labor
Usage Assessments
Transaction log analysis – beginning 2004
Number of
searches, browses,
pages displayed
Outcomes
assessment
– 2006-2007
Use
title online &
print-on-demand
Userper
demographics
– age,
gender, location
Use
different
demographics
Howby
users
foundgeographic
the Collection
Use per
time
of itday,
day of
week,
Why
they
used
& what
they
did month
with it of the year
What difference the Collection made
Their assessment of the quality of the Collection
& the usability & functionality of the system
Their view of the significance of the project
Copyright Assessments – 2006
Number of copyrighted books in the collection
Success rate of permission requests
Survey of participating publishers
Overall satisfaction
Quality of the copies
What they did or plan to do with the copies
Impact on revenue & view of open access
Dissemination
Million Book Collection
Books accessible via Google search
Libraries can link to collection from web site
Libraries can link books to catalog records
Publisher database
Successful negotiation strategies
Research test bed
Thank you!
Download