Project Gutenberg: To 10,000 then 1,000,000 Free eBooks Dr. Gregory B. Newby, CEO Project Gutenberg Literary Archive Foundation What’s an eBook? An Electronic Book A written work, with or without images, which may or may not have been published on paper Some eBooks are born digital, others are converted from print Modern publishers are concerned about losing control of their books if they are distributed electronically. However, there are many authors who feel differently, and millions of items in the public domain Where do eBooks come from? Project Gutenberg (PG) is the world’s oldest producer of eBooks PG was started in 1971 when Michael S. Hart typed in the text of the US Declaration of Independence From 1971 to 1990 he developed the idea of Project Gutenberg, and released a dozen or so major works 1971 to 1990 Aug 1989 The Bible, Both Testaments, King James Version Dec 1979 Abraham Lincoln's First Inaugural Address Dec 1978 Abraham Lincoln's Second Inaugural Address Dec 1977 The Mayflower Compact Dec 1976 Give Me Liberty Or Give Me Death, Patrick Henry Dec 1975 The United States' Constitution Nov 1973 Gettysburg Address, Abraham Lincoln Nov 1973 John F. Kennedy's Inaugural Address Dec 1972 The United States' Bill of Rights Dec 1971 Declaration of Independence [kjvxxxxx.xxx] 10 [linc1xxx.xxx] 9 [linc2xxx.xxx] 8 [mayflxxx.xxx] 7 [liberxxx.xxx] 6 [constxxx.xxx] 5 [gettyxxx.xxx] 4 [jfkxxxxx.xxx] 3 [billxxxx.xxx] 2 [whenxxxx.xxx] 1 Your humble narrator got involved with PG while a beginning assistant professor in the Graduate School of Information and Library Science at UIUC: Oct 1992 The Legend of Sleepy Hollow, Washington Irving [sleepxxx.xxx] 41 The Rest is History… Project Gutenberg attracted a variety of volunteers From its visibility and Michael’s tenacity, awareness of the potential of eBooks emerged The sole source of funding has been donations (mostly quite small) from individuals and organizations These days, we’re also seeking funding from grants etc. PG Structure In 2001, the Project Gutenberg Literary Archive Foundation (PGLAF) was formed As a 501(c)(3) corporation, this has made it easier for fundraising, for legal purposes, and to hire a few part-time personnel (Michael is the only full-time employee; gbn is a volunteer) There are many thousands of volunteers, with a core group of about 20 Where is PG? Main server is ibiblio.org, at UNC-CH Backup server is archive.org, in S.F. Dozens of mirrors around the world Web pages are on promo.net (gutenberg.net), Webmaster lives in Rome Gbn is in Chapel Hill, soon to be in Fairbanks Michael Hart is in Urbana Cataloger, programmer are in California DP is run by Charles Franks in Las Vegas We’re highly automated, all electronic and distributed Goal: Give Away eBooks Project Gutenberg seeks eBooks: All languages & topics; contemporary and historical; different formats We seek to preserve cultural heritage by digitizing and distributing these eBooks To insure longevity in access, we prefer to provide plain ASCII in addition to any other format (HTML, PDF, etc.) Goal: Enhance Literary People need to be literate – by reading – to be empowered and effective citizens Project Gutenberg wants as many people as possible to have ready and free access to eBooks on all possible topics With the current and historic cost of computer disk drives, CDs and DVDs, it’s cost-effective for individuals who have computers to possess the entire Project Gutenberg collection for just a few dollars worth of storage To 10,000 eBooks PG has tracked Moore’s law for over 10 years. If the historical trend continues, we will post #10000 later in 2003 Here's the current graph of our progress since December 10, 1990 ~Noon January 31, 2003 >>>>>>> 7,000<2/03 7,000 6,500<12/02 6,500 6,000 <9/02 6,000 5,500 <7/02 5,500 5,000 <4/02 5,000 4,500 <2/02 4,500 4,000 <10/01 4,000 3,500 <5/01 3,500 3,000 <12/00 3,000 2,500 < 8/00 2,500 2,000 <12/99 2,000 1,500 <10/98 1,500 1,000 <8/97 1,000 500 <4/96 500 100 <12/93 100 10 < 12/90 10 YR 1990/1991/1992/1993/1994/1995/1996/1997/1998/1999/2000/2001/2002/2003^#### Getting to #10,000: Historical Historically, individual volunteers would handle the entire digitization process: Find a book, submit photocopies of title & verso page to Michael Hart Scan & OCR or type the book Submit the eBook to Michael, who would attach a header & footer, check proofreading, and announce by email and in the newsletter Finding aids include an online catalog, a text file listing all books, direct FTP/HTTP access, a browsing page, and independent catalogs (IPL & OnlineBooks) Getting to #10,000: Current Online copyright clearance (http://beryl.ils.unc.edu/copy.html) Online eBook submission (http://beryl.ils.unc.edu/upload.html) A PG “whitewashers” team (remember Tom Sawyer?) to work on formatting, uploading and announcing eBooks Many, many automated programs for different parts of the process, from checking for data integrity on the servers to writing the newsletter Some tools for eBook producers, including Gutcheck (http://sourceforge.net/projects/gutcheck) Getting to #10,000: Distributed Proofreading Distributed Proofreading (DP) is an innovation by key volunteer Charles Franks, with help from Charles Aldarondo and others The concept: page images and OCR output are compared, a page at a time, using a simple Web-based interface By distributing and making asynchronous the process of proofreading, we have greatly increased production By having at least two proofreaders per page, plus oversight to assemble the final eBook, quality is quite high Page images are archived; we are cooperating with a project of The Internet Archive DP Infrastructure Currently based at http://texts01.archive.org/dp , but we envision sets of servers for replication and data integrity Moderately large disk space needs for active projects (up to 30MB per eBook) Based on Linux + MySQL + PHP Over 6,000 people have prepared at least one page Hundreds of very active volunteers DP Infrastructure, Continued Dedicated book buyers (up to $1/book at library sales etc.), but individual book donations are accepted Scanning & OCR are centralized (2 Fijutsu page-fed scanners; Abbyy Finereader) Most books go to plain text only In the near future, books will go to XML, with other formats (text, HTML) derived from XML Beyond 10,000 More automation More outreach to contemporary authors for copyrighted works More digitization of historical literature (pre-1923) Identification of “unknown” public domain works Better finding aids Cooperation with other projects More: Beyond 10,000 All eBooks in XML format Conversion on the fly to different formats: HTML Text; Unicode, etc. Braille PDF, eBook, etc…. Auto-creation of custom CD/DVD ISO images Copyright Procedures PG follows US laws. We are very diligent about copyright, since the penalties for copyright infringement are extreme We primarily work with public domain works, but also have procedures for accepting donations of works in copyright (currently about 2% of the collection) Rule 1: If the source book was published pre-1923, it’s public domain in the US Items from 1923-1989 published in the US without a copyright notice are public domain Items which match pre-1923 works are public domain Lesser Used Copyright Procedures Items pre-1964 that were not renewed are public domain (this can be hard to prove) Items not currently available are exempted from copyright infringement under Title 17 Section 108(h). We’re starting to work with this rule Items published outside of the US from 1923present follow the laws of that country (under GATT and the Berne Conventions). But it’s tough to be expert in non-US copyright Core concept: Due Diligence. PG must demonstrate due diligence that copyright procedures are followed. We do this! Why 1,000,000? If we could get 1million eBooks to 1million readers each, that would be 1,000,000,000,000 (1quadrillion) eBooks given away. This is a modest goal: we’d really like to reach a far greater portion of the world’s population PG is on track to: Continue to increase production per Moore’s law Continue to digitize historical works Obtain copyrighted works by donation Distribute these eBooks freely through many mirrors, CD/DVD, etc. What can Google do? Google now harvests all of the PG files from the ibiblio server. This offers full-text searching capability to the collection! Google has a catalog digitization project that performs similar tasks to our general eBook production process Google has topic-based navigation systems which are suitable for eBooks There are interesting and challenging issues involved in making our (current) over 16GB and 16K files available and usable to the populace More: What can Google do? Help readers to find eBooks Distribute eBooks Scan & OCR Support acquisition of books Support software development Support copyright research Help to make more stuff digital, because that’s what we’re all about!