View/Open - Rice Scholarship Home

advertisement
Dissertation 2.0:
Remixing a Dissertation on
American Literature as a Work
of Digital Scholarship
Lisa Spiro
Rice University
December 2008
“The Google Books Exchange”
• Prompted by Paul Duguid’s 2007 article using Tristram Shandy
to examine quality problems with Google Books, particularly
scanning & metadata
• Patrick Leary’s reply: “Google Books is a tool for extensive
research across a populous universe of corrupt texts, not a tool
for intensive study of one typographically complex literary
classic”
• Duguid: GB is “pushing quantity over quality”
• At issue:
– How do we view Google Books: a research tool or a library? (Kevin
Kelly)
– How do we measure the credibility & usefulness of scholarly
sources?
– To what extent does GB enable new approaches to scholarship?
Bachelor 2.0: Digital Scholarship
Project
• Bachelors of Arts: 2002 dissertation on
bachelorhood and 19th C American culture
• Exploring digital scholarship by remixing my diss:
– Use mainly digital sources
– Experiment with tools for:
• Analyzing texts
• Visualizing texts
• Organizing information
– Explore non-traditional means of dissemination, e.g.
video, collaborative wikis, etc.
– Blog the process & share work openly
How many of my 296 original
research sources are digitized &
available in full text? (May 2008)
Type
% Full Text
% Digitized
secondary monograph*
23.5%
98.3%
secondary periodical
93.1%
93.1%
primary monograph
75.8%
97%
primary periodical
88.6%
91.1%
0.%
0.%
Total Primary
82.8%
91.9%
Total Secondary
37.2%
97.3%
Grand Total
59.1%
94.6%
archival
* Approx. 75% of books are out-of-print but in copyright (?) & will thus be
available through the GB settlement.
http://digitalscholarship.wordpress.com/2008/05/05/how-many-texts-have-been-digitized/
Researching the History of Reveries
of a Bachelor Using Google Books
• Reveries= collection of sentimental essays by Donald
Grant Mitchell in which the narrator, Ik Marvel, imagines
what marriage is like
• First published in 1850, one of the biggest US
bestsellers of the 19th century. Sold into the 20th C.
• Beloved by readers including Emily Dickinson, who
wrote enthusiastic letters, made annotations, etc.
• Why was this book so popular? Did responses to it
change over time? What evidence could I find in GB?
• Searched GB for “reveries of a bachelor”: examined over
300 results, tagged them (still sifting through 370 results
for “Ike Marvel,” 337 for “Ik Marvel”)
What I Found: Publishing History of
Reveries
• Evidence of the different choices consumers
had at different price points (8 cents to $2.50):
binding, paper quality, etc
• The intense competition Scribner’s faced after
its copyright expired, and how it responded
(ads asserting copyright over sections, new
cheap edition, etc)
http://digitalscholarship.wordpress.com/2008/12/19/using-google-books-to-research-publishing-history/
History of Reading & Reception of
Reveries
• Secondary studies of 19th reading suggesting that men gave
Reveries to women they were wooing
• Passages in memoirs suggesting that Reveries was read (by
men) to induce particular moods: melancholy, emotional relief
• Reveries was embraced by educational authorities--included
on Regents Exam, in anthologies & readers, etc.
• Included in many library catalogs
• Reveries was performed as well as read in private: included
in guidebooks to recitation staging tableaux
• Reviews from 1850 to 1908, many of associated the book
with “youthfulness,” a time past
http://digitalscholarship.wordpress.com/2008/12/24/studying-the-history-of-reading-using-google-books-and-other-sources/
Textual Analysis of Different
Versions of Reveries
GB
• Open Content Alliance (OCA) better source for bibliographic
analysis: full-color images, downloadable
• Downloaded 1850, 1863, 1883, 1893, and 1907 editions of
Reveries from OCA (1883 was 1st in GB, 1893 is
unauthorized edition)
• Used Juxta to collate different editions
– View two texts side by side
– Search for keywords in context
– Automatically create critical apparatus
OCA
• OCR quality not sufficient to produce authoritative critical
edition, but Juxta can be used as an analytical tool to detect
errors & variants
Juxta
Tracking Literary Influence
• Found reviews & ads comparing new books to
Reveries, e.g. In Maiden Meditation, Reveries of a
Bachelor Girl, Reveries of an Undertaker, & The
Reflections of a Married Man
• GB’s “Popular Passages” includes top 10 passages in
the book that appear most frequently in other books
• Use computational methods to examine “double-helix”
(McGann) of “literary DNA”--production & reception
http://digitalscholarship.wordpress.com/2007/12/08/literary-dna-and-google-books/
Impact of Using Google Books
• Discovered many sources I probably would not have found
otherwise, yet most significant research remains archival work
at Yale’s Beinecke Library
• Filled in details rather than changed previous view of Reveries:
enlarging the sample rather than achieving completeness
• Main difference is the methods used rather than the
conclusions reached: searching, tagging, manipulating
• Yet much of my work was manual. I glimpse more profound
possibilities, a sort of digital research assistant:
– Extract prices of Reveries automatically
– Visualize reader responses, play with variables such as time,
gender, position (“ordinary reader,” reviewer), etc.
– “Literary DNA”: what resembles this book? Why?
• We would need to make how these tools work transparent:
what are we seeing & not seeing?
Problems with Google Books
Poor Scans
• Occasionally you’ll find a page that is skewed,
distorted, or includes fingers
QuickTime™ and a
dec ompres sor
are needed to s ee this pic ture.
Now you can easily report poorly scanned pages
But what happens to those reports?
Optical Character Recognition
(OCR) Is Not Perfect
• OCR errors for Ik Marvel, Reveries of a Bachelor, a
Book of the Heart
• Heveries of a Bachelor (10 hits in GB: 4 found in Ik, Ike, or
Reveries searches, 4 not otherwise found, 2 found in different
editions)
• REVERIES OF A BACHELOR; or, a Rook of the Heart
• REVERIES OF A BACHELOR; or, a Bonk of the Heart.
• Reveries of a Bad elor.
• REVERIES OF A BACHELOR, a Boob of the Heart. By IK.
MAETEL
Poor Metadata
• Wrong date is sometimes given, particularly
with journals (actual date in example above
is 1857)
• Author is occasionally conflated with editor
or publisher
• Publication place typically not captured
• No linkages between volumes in multivolume works
Search & Retrieval Are
Mystifying
A title search for “reveries of a bachelor” yields 22
results; “more editions” yields 125
One result screen says: “101 - 150 of 690,” but
then the very next one says “Books 151-159 of
159,” limiting number of results
Rights Issues
• Works that are in the public domain aren’t always
fully available
• Even with public domain materials, you can only
download PDF, vs. variety of formats in OCA
GB
OCA
OCA HTTP Download
Reflections on Quality
• How good is good enough? Are the tradeoffs (quality
vs quantity) worth it?
• It’s important for researchers to be aware of Google’s
limitations, but also to set their own tolerance for
errors, depending on what they are trying to
accomplish.
• OCR errors may not be serious impediments to
findability, but they would lead to inaccurate word
counts/ textual analysis
• Researchers can work out methods for dealing with
“good enough” texts (e.g. Stanford Beyond Search
group)
Tips and Tricks for Working with
Google Books
Tip 1: Be a Resourceful Searcher
• Use advanced search
– Restrict by date, title, author, subject, etc.
• Select “more editions” to see other versions of the texts
• Search for unique phrases/ names within a book (“Ike
Marvel”)
• If you want to do a quick search of both Google Books and
OCA, try PublicDomainReprints.org
• Don’t limit yourself to Google--also search Open Content
Alliance, thematic digital research collections, etc
Tip 2: Capture & Organize Your Stuff
with Bibliographic Tool (e.g. Zotero)
• Capture search results using a bibliographic tool
such as Zotero
– Automatically grab bibliographic info (you may need to
add URL, publisher & publication place manually)
– Copy chunks of text into notes field
– Add tags as you go, then sort based on those tags
– Visualize your sources on a timeline
Tip 3: Create a Visual Scrapbook
Using Google Notebook
Tip 4: Collect & Share Items via “My
Library”
• Collect Google
books into “My
Library”
• Search within that
collection
• Share it with others
• Using the Google
Book Search
Gadget, get
recommendations for
similar works
Wish List for Google Books:
Improved Search & Discovery
• Rich, accurate catalog information (authoritative dates
and author names, etc.)
– Include WorldCat info (e.g. subject headings)
• Ability to do collaborative work--co-searching, search
tags & annotations, work together on complex projects
• Browse to find similar works (library shelf)
• Ability to sort search results by date, title, relevance
weighting, etc
• Different search interfaces: faceted, timeline,
geographic, visual, etc
• Better OCR (perhaps by combining different versions of
same text)
Wish List for Google Books: More
Openness & Flexibility in Use
• More transparency about how GB works
• For public domain works, easily download plain
text, images, etc.
• Extract information and remix it, e.g.
– Image gallery
– Create anthology of bachelor literature
• Google Books on mobile device
• As Dan Cohen suggests, we need an open API to
enable text mining, visualization, etc.
– Test theories across a wider array of texts
(beyond what one could reasonably read)
Wishes Granted? GB & NonConsumptive Research
• Google Books settlement allows for “non-consumptive
research”: “research in which computational analysis is
performed on one or more Books, but not research in which a
researcher reads or displays substantial portions of a Book to
understand the intellectual content presented within the
Book”
• Includes:
–
–
–
–
–
(a) Image Analysis and Text Extraction
(b) Textual Analysis and Information Extraction
(c) Linguistic Analysis
(d) Automated Translation
(e) Indexing and Search
• Seems to be focused more tech development than literary
research-- could others used tools developed?
Google Books Settlement
The Digital Library?
• If you were establishing a brand new library today, how would
you do it? What percentage of resources would be digital?
• What would be required for:
–
–
–
–
Scholarly trust
Usability
Preservation & long-term access
Technical infrastructure
• What would be the impact on scholarship?
• I’m working with Geneva Henry on a study sponsored by the
Council on Library and Information Resources (CLIR) to
investigate the feasibility of all-digital research library
Resources
• Lisa Spiro: lspiro@rice.edu
• Digital Scholarship in the Humanities blog:
http://digitalscholarship.wordpress.com/
• Collection of bookmarks on Google Books:
http://www.diigo.com/user/lspiro/googlebooks
Download