- McMaster University

advertisement
Come, and Take Choice of All My Library:
Mass Digitization Examined
Jonathan Bengtson
Associate University Librarian
for Scholarly Resources
Sian Meikle
Digital Services Librarian
University of Toronto Libraries
Access Conference, September 2008
Part One (Jonathan Bengtson):


Overview of University of Toronto / Internet Archive
collaboration
What should a Digital Library be? Models to date.
Part Two (Sian Meikle):


Building the Digital Library:
what goes in, what comes out, how to join in
Using the Digital Library:
how we’re using it, how users use it, how you can use it
Questions encouraged!
The University of Toronto’s
Internet Archive Scanning Centre
“Scribe” scanning station capacity
 500 pages per hour
 14 hours per day, 5 days per week
 7,000 pages each day per scribe
“Scribe” Centre capacity
 161,000 pages per day (23 scribes)
 805,000 pages per week (23 scribes)
 2,683 books per week (23 scribes),
if an average book is 300 pages
 100,000+ books per year
Mass Digitization at the
University of Toronto
Phase one (Pilot):
Autumn 2004-Autumn 2005
Phase two (MSN/OCA):
Autumn 2005-May 22, 2008
Phase three (OCA+?):
May 23, 2008-
Partnering with the
Internet Archive
The University of Toronto is one of the five
largest academic libraries in North America.
The Internet Archive is a non-profit
organization, based in San Francisco, that was
founded in 1996 to build an ‘Internet library,’
with the purpose of offering permanent access
for researchers, historians, scholars and the
general public to historical collections that exist
in digital format.
www.archive.org
Internet Archive:
Preservation and Access
Over 2.5 petabytes of
storage, and growing
To put that in
perspective:
An mp3 is
usually 3-4
megabytes
2 petabytes =
2,684,354,560
megabytes
1.5 million downloads per day (one
of the top 350 global sites)
3 storage facilities in San Francisco,
Amsterdam & Alexandria, Egypt
Experience with multiple formats
Audio
282,000+ items in over 100 collections
•Live Music Archive (2,300 bands & 40,000
performances)
• Netlabels (600 labels)
• Mother Jones Radio
• LibriVox Audio Books
•Afropop Worldwide
• Old Time Radio
• Tse Chen Ling Buddhist Lectures
• 78 RPM Records
• Free Speech Radio News
• Presidential Recordings
Moving Images
128,000+ items in 100 collections
• Democracy Now
• SIGGRAPH Computer Animation
• Film Chest Vintage Cartoons
• Prelinger Archives
• Drive-In Movie Ads
• UCSF Tobacco Industry Videos
• Universal Newsreels
• Mosaic Middle East News
• Kino French Films
Phase One
University of Toronto Collections
Evaluate technology, workflow, etc.
September 2004-September 2005
University of Toronto Collections:
 Selections from various collections
including, c.1000 volumes from the
Centre for Renaissance and
Reformation Studies; materials
from the Centre for 19th century
French Studies; the Pontifical
Institute of Mediaeval Studies;
circulating collection
 Records of Early English Drama: by
permission of University of Toronto
Press
Phase Two U of T Collections
 Most ranges of LC
 Focus on religion, history, Canadiana (when
possible), (some) literature, science
 Mostly English language
 Mostly pre-1923
 Multiple libraries
 Some special collections
 Circulating pre-1923 materials
Phase Two Partners
Memorial University: Newfoundland Quarterly, materials relating to Newfoundland
McMaster University: 100+ items from the First World War Collection
Ryerson University: various items including the Yellow Book: an illustrated quarterly
University of Ottawa: 500 18th & 19th century works chosen by faculty including
history, French, music, history of medicine, jurisprudence and nursing
Library and Archives of Canada: c.450,000 pages from Canadian governmental
publications
Legislative Assembly of Ontario Library
Toronto Public Library: local history and genealogy
University of Alberta: Canadiana
Tufts University, Boston, USA (Mellon and other grant funds)
Other: Havergal College, U of T Faculty, Federally funded publisher, test scans for
other OCA partners, individual researchers, National Institute of Newman Studies
Internet Archive
Book Scanning experience
 1996 registered as a non-profit
 2003 (India) Million books project
 2004 Sloan grant, equipment evaluation, trial
scanning
 2006 Production scanning, 3 sites
 2007 8 sites; 5 million pages or 12-15,000 books
each month
 2008 18 sites; 10 million pages or 25,000 books
each month
Google Books
Microsoft Live Search Books
Open Content Alliance
The Open Content Alliance
(OCA) represents the
collaborative efforts of a
group of cultural,
technology, non-profit, and
governmental organizations
from around the world that
will help build a permanent
archive of multilingual
digitized text and multimedia
content.
The Open Library (“Wikipedia”)
Why do we need a Digital Library?
Things are changing rapidly
Variety of experience and
preferences
Some user themes from recent research at
the University of Toronto
Researchers & students are hurried, clever, determined, &
inattentive.
“When it comes to web resources, if it doesn't give me what I
want in 5-10 minutes, I'm gone. I try to be more patient with
UT, because its slower.”
They understand copyright, but download what they need.
Their end goals take priority.
Journal articles are saved …
- So they think, why not e-books, which are too long to be read
online?
E-books are sought for convenience, but access is NOT necessarily
convenient.
“Hoping to find an e-book, so I wouldn’t have to go to the
stacks.”
Part One (Jonathan Bengtson):


Overview of University of Toronto / Internet Archive
collaboration
What should a Digital Library be? Models to date.
Part Two (Sian Meikle):


Building the Digital Library:
what goes in, what comes out, how to join in
Using the Digital Library:
how we’re using it, how users use it, how you can use it
Questions encouraged!
What goes in?

Books:



MARC metadata


not too big and not too small:
3”x3” to 14.5”x9.5”
not too old and not too new
6% get rejected for hard living; 1922 cut-off
z39.50 is used to fetch MARC data, and so…
An identifier to tie book to its metadata
Constructing the online book
Internet Archive
Scan Center
Make book
available online
Assign unique id
Get metadata
via z39.50
Approve book
Scan book
Perform QA
Upload scans
Create
derivatives
Binding books to metadata
Ideal book identifiers are:
 easy to enter and unique


capable of retrieving a marc record



and so, not title or call number
z39.50 – accessible
and so, probably not barcode
on the book

and so, not OCLC#, LCCN, ISBN, DBCN...
Some possible solutions
Your ILS thinks barcodes are MARC data
 You put a flier with identifier in each book
 You provide an intermediary script

... we chose option #3
Constructing the online book
Internet Archive
Scan Center
Assign unique id
Get metadata
via z39.50
UTL script:
•barcode in
•identifier out
•tracks scan
decision
Make book
available online
Approve book
Scan book
Perform QA
Upload scans
Create
derivatives
Some books aren’t mass-digitized
books scanned :143,380 books rejected: 12,424 rejection rate: 9%
Incomplete or
incorrect marc
record, 1%
More than 5 bolts
(uncut pages),
4%
Other reasons
6%
Book too large
for cradle
1%
Print too close to
outside edge
0%
Fold-outs
30%
Poor condition
11%
Fold-outs
Print runs into gutter
Rebound, too stiff
Poor condition
Other reasons
More than 5 bolts (uncut pages)
Rebound,
too stiff, 17%
Incomplete/incorrect marc
record
Book too large for cradle
Print runs into
gutter
30%
Print too close to outside edge
What comes out?

JPEG 2000s:

Raw (~900KB)
cropped, deskewed, and lightcompensated (~800KB)
(optionally) watermarked (~800KB)



page images with embedded OCR
colour (~100 KB)
black and white (~60KB)

IA identifier; bib identifier; contributor;
title; volume; creator; publisher; scan
data






PDF
MARC metadata, xml
operational metadata, xml:
structural metadata, xml
Pagination, covers, title page, copyright
page
 OCR (UTF-8)
 ABBYY, DjVu


Flip book (~35KB)

Constructive
Anatomy
How is it used? Our top 10 titles:
Downloads
Author
Title
Year
64241
St. Augustine
De civitate Dei
1475
13214
Bridgman, George Brant
Constructive anatomy
1920
10064
Colonna, Francesco, d. 1527
Hypnerotomachia
1592
7496
Gallonio, Antonio, d. 1605
Traitee des instruments de martyre[…]tortures
et tourments des martyrs chretiens.
1904
6546
Descartes, Rene, et al.
French and English philosophers: Descartes,
Rousseau, Voltaire, Hobbes
1910
5484
Schopenhauer, Arthur
The world as will and idea
1910
5098
Knutson, Bengt, fl. 1461
A litil boke the whiche traytied and reherced
many gode thinges necessaries for the
pestilence ... made by the ... Bisshop of
Arusiens
1910
5090
Davenport, Cyril
English embroidered bookbindings
1899
4496
Abbott, Edwin Abbott
Flatland : a romance of many dimensions
1884
4313
Nightengale, Florence
Notes on nursing : what it is, and what it is not
1860
How is it used? Our general statistics:
Scanned books
Avg use
Min use
3206
1531
Top 1,000
973
Top 10,000
Top 50,000
Print :
Avg use Min use
Top 100
89
65
450
Top 1,000
47
32
283
137
Top 10,000
-
14
125
61
Bottom 10,000
6
11
Bottom 1,000
1
2
Bottom 100
1
1
Bottom 38
0
0
Top 100
[…..]
hi
lo
so
A:
p
C
G
: A hy
en
P
ux
er
s
yc
al
il ia
h
w
ry
ol
o
o
Sc
gy rk s
ie
R
nc
es eli g
G
:G
io
of
n
eo
H
is
gr
to
D
ap
ry
hy -F:
Hi
,A
st
nt
or
H
y
: S hro
po
oc
lo
ia
gy
l
J:
Po Sc ie
l it
nc
ic
es
al
Sc
ie
nc
K: e
La
L:
w
E
du
ca
ti
P:
M on
La
:M
ng
us
N
ua
:F
ic
ge
in
e
an
Ar
d
ts
Li
te
ra
tu
Q
re
:S
Z:
ci
en
Bi
R
bl
: M ce
io
gr
ed
T:
ap
ici
Te
hy
ch ne
,L
no
ib
ra
ry logy
S
ci
en
ce
B:
P
Percent
What are end-users using?
40
35
% scans
30
% titles
25
20
15
10
5
0
Class
Higher use
A, B, C, D-F, G,
N, Z
% use
Expected use
J, K, M, Q, T
Lower use
H, L, P, R
IA scanning for other institutions

Ship books




Send marc record file with books
Request marc records from another source
LAC, LC, McMaster, UofT…
Arrange z39.50 access for IA
OCAD
Sponsor books


Select area of interest for scanning
Sponsor scanning
Tufts Perseus collection
How can libraries use it?

link to Internet Archive


add marc records to your catalogue


repository of 1 million online books
metadata integrated with local collection
add full text books to your collection

full text search
How are we using it?

Scholar's Portal E-book platform



integrates licensed and free content
pdf-like reader
open access to IA content
Discovery layer
Faceted search using Endeca
 stretch “catalogue” to include:






metadata for all books, not just our books
web site
A&Is
Full text journals
Full text books
Next steps
Print on demand
 Scan on demand
 Enriched structural metadata to improve
discovery



Current structural metadata:
pagination, covers, title page, copyright page
Desired structural metadata:
Table of contents, index, images, maps
Download