Dissertation 2.0: Remixing a Dissertation on American Literature as a Work of Digital Scholarship Lisa Spiro Rice University December 2008 “The Google Books Exchange” • Prompted by Paul Duguid’s 2007 article using Tristram Shandy to examine quality problems with Google Books, particularly scanning & metadata • Patrick Leary’s reply: “Google Books is a tool for extensive research across a populous universe of corrupt texts, not a tool for intensive study of one typographically complex literary classic” • Duguid: GB is “pushing quantity over quality” • At issue: – How do we view Google Books: a research tool or a library? (Kevin Kelly) – How do we measure the credibility & usefulness of scholarly sources? – To what extent does GB enable new approaches to scholarship? Bachelor 2.0: Digital Scholarship Project • Bachelors of Arts: 2002 dissertation on bachelorhood and 19th C American culture • Exploring digital scholarship by remixing my diss: – Use mainly digital sources – Experiment with tools for: • Analyzing texts • Visualizing texts • Organizing information – Explore non-traditional means of dissemination, e.g. video, collaborative wikis, etc. – Blog the process & share work openly How many of my 296 original research sources are digitized & available in full text? (May 2008) Type % Full Text % Digitized secondary monograph* 23.5% 98.3% secondary periodical 93.1% 93.1% primary monograph 75.8% 97% primary periodical 88.6% 91.1% 0.% 0.% Total Primary 82.8% 91.9% Total Secondary 37.2% 97.3% Grand Total 59.1% 94.6% archival * Approx. 75% of books are out-of-print but in copyright (?) & will thus be available through the GB settlement. http://digitalscholarship.wordpress.com/2008/05/05/how-many-texts-have-been-digitized/ Researching the History of Reveries of a Bachelor Using Google Books • Reveries= collection of sentimental essays by Donald Grant Mitchell in which the narrator, Ik Marvel, imagines what marriage is like • First published in 1850, one of the biggest US bestsellers of the 19th century. Sold into the 20th C. • Beloved by readers including Emily Dickinson, who wrote enthusiastic letters, made annotations, etc. • Why was this book so popular? Did responses to it change over time? What evidence could I find in GB? • Searched GB for “reveries of a bachelor”: examined over 300 results, tagged them (still sifting through 370 results for “Ike Marvel,” 337 for “Ik Marvel”) What I Found: Publishing History of Reveries • Evidence of the different choices consumers had at different price points (8 cents to $2.50): binding, paper quality, etc • The intense competition Scribner’s faced after its copyright expired, and how it responded (ads asserting copyright over sections, new cheap edition, etc) http://digitalscholarship.wordpress.com/2008/12/19/using-google-books-to-research-publishing-history/ History of Reading & Reception of Reveries • Secondary studies of 19th reading suggesting that men gave Reveries to women they were wooing • Passages in memoirs suggesting that Reveries was read (by men) to induce particular moods: melancholy, emotional relief • Reveries was embraced by educational authorities--included on Regents Exam, in anthologies & readers, etc. • Included in many library catalogs • Reveries was performed as well as read in private: included in guidebooks to recitation staging tableaux • Reviews from 1850 to 1908, many of associated the book with “youthfulness,” a time past http://digitalscholarship.wordpress.com/2008/12/24/studying-the-history-of-reading-using-google-books-and-other-sources/ Textual Analysis of Different Versions of Reveries GB • Open Content Alliance (OCA) better source for bibliographic analysis: full-color images, downloadable • Downloaded 1850, 1863, 1883, 1893, and 1907 editions of Reveries from OCA (1883 was 1st in GB, 1893 is unauthorized edition) • Used Juxta to collate different editions – View two texts side by side – Search for keywords in context – Automatically create critical apparatus OCA • OCR quality not sufficient to produce authoritative critical edition, but Juxta can be used as an analytical tool to detect errors & variants Juxta Tracking Literary Influence • Found reviews & ads comparing new books to Reveries, e.g. In Maiden Meditation, Reveries of a Bachelor Girl, Reveries of an Undertaker, & The Reflections of a Married Man • GB’s “Popular Passages” includes top 10 passages in the book that appear most frequently in other books • Use computational methods to examine “double-helix” (McGann) of “literary DNA”--production & reception http://digitalscholarship.wordpress.com/2007/12/08/literary-dna-and-google-books/ Impact of Using Google Books • Discovered many sources I probably would not have found otherwise, yet most significant research remains archival work at Yale’s Beinecke Library • Filled in details rather than changed previous view of Reveries: enlarging the sample rather than achieving completeness • Main difference is the methods used rather than the conclusions reached: searching, tagging, manipulating • Yet much of my work was manual. I glimpse more profound possibilities, a sort of digital research assistant: – Extract prices of Reveries automatically – Visualize reader responses, play with variables such as time, gender, position (“ordinary reader,” reviewer), etc. – “Literary DNA”: what resembles this book? Why? • We would need to make how these tools work transparent: what are we seeing & not seeing? Problems with Google Books Poor Scans • Occasionally you’ll find a page that is skewed, distorted, or includes fingers QuickTime™ and a dec ompres sor are needed to s ee this pic ture. Now you can easily report poorly scanned pages But what happens to those reports? Optical Character Recognition (OCR) Is Not Perfect • OCR errors for Ik Marvel, Reveries of a Bachelor, a Book of the Heart • Heveries of a Bachelor (10 hits in GB: 4 found in Ik, Ike, or Reveries searches, 4 not otherwise found, 2 found in different editions) • REVERIES OF A BACHELOR; or, a Rook of the Heart • REVERIES OF A BACHELOR; or, a Bonk of the Heart. • Reveries of a Bad elor. • REVERIES OF A BACHELOR, a Boob of the Heart. By IK. MAETEL Poor Metadata • Wrong date is sometimes given, particularly with journals (actual date in example above is 1857) • Author is occasionally conflated with editor or publisher • Publication place typically not captured • No linkages between volumes in multivolume works Search & Retrieval Are Mystifying A title search for “reveries of a bachelor” yields 22 results; “more editions” yields 125 One result screen says: “101 - 150 of 690,” but then the very next one says “Books 151-159 of 159,” limiting number of results Rights Issues • Works that are in the public domain aren’t always fully available • Even with public domain materials, you can only download PDF, vs. variety of formats in OCA GB OCA OCA HTTP Download Reflections on Quality • How good is good enough? Are the tradeoffs (quality vs quantity) worth it? • It’s important for researchers to be aware of Google’s limitations, but also to set their own tolerance for errors, depending on what they are trying to accomplish. • OCR errors may not be serious impediments to findability, but they would lead to inaccurate word counts/ textual analysis • Researchers can work out methods for dealing with “good enough” texts (e.g. Stanford Beyond Search group) Tips and Tricks for Working with Google Books Tip 1: Be a Resourceful Searcher • Use advanced search – Restrict by date, title, author, subject, etc. • Select “more editions” to see other versions of the texts • Search for unique phrases/ names within a book (“Ike Marvel”) • If you want to do a quick search of both Google Books and OCA, try PublicDomainReprints.org • Don’t limit yourself to Google--also search Open Content Alliance, thematic digital research collections, etc Tip 2: Capture & Organize Your Stuff with Bibliographic Tool (e.g. Zotero) • Capture search results using a bibliographic tool such as Zotero – Automatically grab bibliographic info (you may need to add URL, publisher & publication place manually) – Copy chunks of text into notes field – Add tags as you go, then sort based on those tags – Visualize your sources on a timeline Tip 3: Create a Visual Scrapbook Using Google Notebook Tip 4: Collect & Share Items via “My Library” • Collect Google books into “My Library” • Search within that collection • Share it with others • Using the Google Book Search Gadget, get recommendations for similar works Wish List for Google Books: Improved Search & Discovery • Rich, accurate catalog information (authoritative dates and author names, etc.) – Include WorldCat info (e.g. subject headings) • Ability to do collaborative work--co-searching, search tags & annotations, work together on complex projects • Browse to find similar works (library shelf) • Ability to sort search results by date, title, relevance weighting, etc • Different search interfaces: faceted, timeline, geographic, visual, etc • Better OCR (perhaps by combining different versions of same text) Wish List for Google Books: More Openness & Flexibility in Use • More transparency about how GB works • For public domain works, easily download plain text, images, etc. • Extract information and remix it, e.g. – Image gallery – Create anthology of bachelor literature • Google Books on mobile device • As Dan Cohen suggests, we need an open API to enable text mining, visualization, etc. – Test theories across a wider array of texts (beyond what one could reasonably read) Wishes Granted? GB & NonConsumptive Research • Google Books settlement allows for “non-consumptive research”: “research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book” • Includes: – – – – – (a) Image Analysis and Text Extraction (b) Textual Analysis and Information Extraction (c) Linguistic Analysis (d) Automated Translation (e) Indexing and Search • Seems to be focused more tech development than literary research-- could others used tools developed? Google Books Settlement The Digital Library? • If you were establishing a brand new library today, how would you do it? What percentage of resources would be digital? • What would be required for: – – – – Scholarly trust Usability Preservation & long-term access Technical infrastructure • What would be the impact on scholarship? • I’m working with Geneva Henry on a study sponsored by the Council on Library and Information Resources (CLIR) to investigate the feasibility of all-digital research library Resources • Lisa Spiro: lspiro@rice.edu • Digital Scholarship in the Humanities blog: http://digitalscholarship.wordpress.com/ • Collection of bookmarks on Google Books: http://www.diigo.com/user/lspiro/googlebooks