Project Gutenberg: To 10,000 then 1,000,000 Free eBooks

advertisement
Project Gutenberg: To 10,000
then 1,000,000 Free eBooks
Dr. Gregory B. Newby, CEO
Project Gutenberg Literary Archive
Foundation
What’s an eBook?




An Electronic Book
A written work, with or without images, which
may or may not have been published on paper
Some eBooks are born digital, others are
converted from print
Modern publishers are concerned about losing
control of their books if they are distributed
electronically. However, there are many authors
who feel differently, and millions of items in the
public domain
Where do eBooks come from?
Project Gutenberg (PG) is the world’s
oldest producer of eBooks
 PG was started in 1971 when Michael S.
Hart typed in the text of the US
Declaration of Independence
 From 1971 to 1990 he developed the idea
of Project Gutenberg, and released a
dozen or so major works

1971 to 1990
Aug 1989 The Bible, Both Testaments, King James Version
Dec 1979 Abraham Lincoln's First Inaugural Address
Dec 1978 Abraham Lincoln's Second Inaugural Address
Dec 1977 The Mayflower Compact
Dec 1976 Give Me Liberty Or Give Me Death, Patrick Henry
Dec 1975 The United States' Constitution
Nov 1973 Gettysburg Address, Abraham Lincoln
Nov 1973 John F. Kennedy's Inaugural Address
Dec 1972 The United States' Bill of Rights
Dec 1971 Declaration of Independence
[kjvxxxxx.xxx] 10
[linc1xxx.xxx]
9
[linc2xxx.xxx]
8
[mayflxxx.xxx]
7
[liberxxx.xxx]
6
[constxxx.xxx]
5
[gettyxxx.xxx]
4
[jfkxxxxx.xxx]
3
[billxxxx.xxx]
2
[whenxxxx.xxx]
1
Your humble narrator got involved with PG
while a beginning assistant professor in
the Graduate School of Information and
Library Science at UIUC:
Oct 1992 The Legend of Sleepy Hollow, Washington Irving
[sleepxxx.xxx]
41
The Rest is History…




Project Gutenberg attracted a variety of
volunteers
From its visibility and Michael’s tenacity,
awareness of the potential of eBooks emerged
The sole source of funding has been donations
(mostly quite small) from individuals and
organizations
These days, we’re also seeking funding from
grants etc.
PG Structure
In 2001, the Project Gutenberg Literary
Archive Foundation (PGLAF) was formed
 As a 501(c)(3) corporation, this has made
it easier for fundraising, for legal purposes,
and to hire a few part-time personnel
(Michael is the only full-time employee;
gbn is a volunteer)
 There are many thousands of volunteers,
with a core group of about 20

Where is PG?









Main server is ibiblio.org, at UNC-CH
Backup server is archive.org, in S.F.
Dozens of mirrors around the world
Web pages are on promo.net (gutenberg.net),
Webmaster lives in Rome
Gbn is in Chapel Hill, soon to be in Fairbanks
Michael Hart is in Urbana
Cataloger, programmer are in California
DP is run by Charles Franks in Las Vegas
We’re highly automated, all electronic and distributed
Goal: Give Away eBooks



Project Gutenberg seeks
eBooks: All languages &
topics; contemporary and
historical; different
formats
We seek to preserve
cultural heritage by
digitizing and distributing
these eBooks
To insure longevity in
access, we prefer to
provide plain ASCII in
addition to any other
format (HTML, PDF, etc.)
Goal: Enhance Literary



People need to be literate – by reading – to be
empowered and effective citizens
Project Gutenberg wants as many people as
possible to have ready and free access to
eBooks on all possible topics
With the current and historic cost of computer
disk drives, CDs and DVDs, it’s cost-effective for
individuals who have computers to possess the
entire Project Gutenberg collection for just a few
dollars worth of storage
To 10,000 eBooks

PG has tracked Moore’s law for over 10 years. If the historical trend
continues, we will post #10000 later in 2003
Here's the current graph of our progress since December 10, 1990
~Noon January 31, 2003
>>>>>>>
7,000<2/03
7,000
6,500<12/02
6,500
6,000 <9/02
6,000
5,500 <7/02
5,500
5,000 <4/02
5,000
4,500 <2/02
4,500
4,000 <10/01
4,000
3,500 <5/01
3,500
3,000 <12/00
3,000
2,500 < 8/00
2,500
2,000 <12/99
2,000
1,500 <10/98
1,500
1,000 <8/97
1,000
500 <4/96
500
100 <12/93
100
10 < 12/90
10
YR 1990/1991/1992/1993/1994/1995/1996/1997/1998/1999/2000/2001/2002/2003^####
Getting to #10,000: Historical

Historically, individual volunteers would handle
the entire digitization process:




Find a book, submit photocopies of title & verso page
to Michael Hart
Scan & OCR or type the book
Submit the eBook to Michael, who would attach a
header & footer, check proofreading, and announce
by email and in the newsletter
Finding aids include an online catalog, a text file
listing all books, direct FTP/HTTP access, a
browsing page, and independent catalogs (IPL &
OnlineBooks)
Getting to #10,000: Current





Online copyright clearance
(http://beryl.ils.unc.edu/copy.html)
Online eBook submission
(http://beryl.ils.unc.edu/upload.html)
A PG “whitewashers” team (remember Tom Sawyer?) to
work on formatting, uploading and announcing eBooks
Many, many automated programs for different parts of
the process, from checking for data integrity on the
servers to writing the newsletter
Some tools for eBook producers, including Gutcheck
(http://sourceforge.net/projects/gutcheck)
Getting to #10,000:
Distributed Proofreading





Distributed Proofreading (DP) is an innovation by key
volunteer Charles Franks, with help from Charles
Aldarondo and others
The concept: page images and OCR output are
compared, a page at a time, using a simple Web-based
interface
By distributing and making asynchronous the process of
proofreading, we have greatly increased production
By having at least two proofreaders per page, plus
oversight to assemble the final eBook, quality is quite
high
Page images are archived; we are cooperating with a
project of The Internet Archive
DP Infrastructure





Currently based at http://texts01.archive.org/dp ,
but we envision sets of servers for replication
and data integrity
Moderately large disk space needs for active
projects (up to 30MB per eBook)
Based on Linux + MySQL + PHP
Over 6,000 people have prepared at least one
page
Hundreds of very active volunteers
DP Infrastructure, Continued
Dedicated book buyers (up to $1/book at
library sales etc.), but individual book
donations are accepted
 Scanning & OCR are centralized (2 Fijutsu
page-fed scanners; Abbyy Finereader)
 Most books go to plain text only
 In the near future, books will go to XML,
with other formats (text, HTML) derived
from XML

Beyond 10,000
More automation
 More outreach to contemporary authors
for copyrighted works
 More digitization of historical literature
(pre-1923)
 Identification of “unknown” public domain
works
 Better finding aids
 Cooperation with other projects

More: Beyond 10,000
All eBooks in XML format
 Conversion on the fly to different formats:

HTML
 Text; Unicode, etc.
 Braille
 PDF, eBook, etc….


Auto-creation of custom CD/DVD ISO
images
Copyright Procedures





PG follows US laws. We are very diligent about
copyright, since the penalties for copyright infringement
are extreme
We primarily work with public domain works, but also
have procedures for accepting donations of works in
copyright (currently about 2% of the collection)
Rule 1: If the source book was published pre-1923, it’s
public domain in the US
Items from 1923-1989 published in the US without a
copyright notice are public domain
Items which match pre-1923 works are public domain
Lesser Used Copyright Procedures




Items pre-1964 that were not renewed are public
domain (this can be hard to prove)
Items not currently available are exempted from
copyright infringement under Title 17 Section
108(h). We’re starting to work with this rule
Items published outside of the US from 1923present follow the laws of that country (under
GATT and the Berne Conventions). But it’s
tough to be expert in non-US copyright
Core concept: Due Diligence. PG must
demonstrate due diligence that copyright
procedures are followed. We do this!
Why 1,000,000?


If we could get 1million eBooks to 1million
readers each, that would be 1,000,000,000,000
(1quadrillion) eBooks given away. This is a
modest goal: we’d really like to reach a far
greater portion of the world’s population
PG is on track to:




Continue to increase production per Moore’s law
Continue to digitize historical works
Obtain copyrighted works by donation
Distribute these eBooks freely through many mirrors,
CD/DVD, etc.
What can Google do?




Google now harvests all of the PG files from the
ibiblio server. This offers full-text searching
capability to the collection!
Google has a catalog digitization project that
performs similar tasks to our general eBook
production process
Google has topic-based navigation systems
which are suitable for eBooks
There are interesting and challenging issues
involved in making our (current) over 16GB and
16K files available and usable to the populace
More: What can Google do?







Help readers to find
eBooks
Distribute eBooks
Scan & OCR
Support acquisition of
books
Support software
development
Support copyright
research
Help to make more stuff
digital, because that’s
what we’re all about!
Download