Matthew Sag
Copyright and Mass-Digitization:
Professor of Law
Loyola University
of Chicago
The strategic importance of data-mining
matthewsag@gmail.com
www.matthewsag.com
Presentation Details
Abbreviated Time Line
2004 Google library project begins
2005 Class action suit filed by Authors Guild (among others)
2008 & 2009
 Settlement proposed, objections follow, settlement revised
 2011
 (March) Settlement rejected
 (September) 2011 Authors Guild v. HathiTrust filed
 2012
 (August) oral argument in Authors Guild v. HathiTrust
 (October) Judge Baer ruled against the plaintiffs in Authors Guild v.
HathiTrust. Library digitization (ADA + Data) are fair use.
 2013
 (July) Second Cir. tells Judge Chin, no class certification without
addressing the fair use issue
 (September) oral argument on fair use in Authors Guild v. Google



The strategic importance of text-mining
 Different kinds of digitization program raise
different legal issues and bring in different
stakeholders.
The Many Faces of Library/Archive Digitization
 Preservation
 Data production and analysis*
 Searching books, testing search algorithms,
computational linguistics, automated translation,
natural language processing, macro-analysis of
text
 A platform for display and distribution of
individual works
 Disabled access*
 Scholarly access
 General access
4
Strategic Considerations
 Library digitization for data production and
analysis
 Significant academic and commercial
constituency (not just Google!)
 Strong normative appeal
 Obvious orphan works problem
 Justifies digitizing entire collections
 Even if some other uses are ‘too much’, no
all-copyright owner class action possible
The Legal Argument #1
 Metadata – facts about the work – does not
infringe the rights of the copyright owner.
– This is not usually contested, but it’s important to make sure
everyone understands the reasons why metadata can’t infringe.
Those reasons are …
 Idea-expression distinction
 Merger doctrine
 Metadata is not substantially similarity to
underlying text
 Facts about the work don’t originate with the
author
whale(s)
Ahab
old
man
boat(s)
ship
sea
down
such
time
hand(s)
long
head
stubb
men
Queequeg
Captain
never
good
go
might
Sperm
Starbuck
deck
water
day
far
eyes
cried
white
world
moby
crew
life
air
Sir
night
feet
Whale v. Dinosaur
1200
1000
800
600
400
200
0
Whale v. Dinosaur
Legal Argument #2
 A copying process that only produces metadata does not
infringe.
 Intermediate non-expressive use is either (a) not
copying in the relevant sense or (b) fair use
 The distinction between expressive and nonexpressive
parts of works is well recognized (no copyright in a
phone book, etc).
 The same distinction should be made in relation to
potential acts of infringement.
 Intermediate non-expressive uses don’t communicate
the author’s original expression to the public.
 No expressive substitution, no infringement
Application to Fair Use
Sect. 107 Factors
(1) purpose and character: Like transformative uses, a
nonexpressive use poses no risk of expressive substitution
(2) nature of the work … “not much use”
(3) Amount and Substantiality: Like transformative uses, because
there is no expressive substitution in a nonexpressive use, the
amount of copying is qualitatively insignificant.
(4) Market effect: Like transformative uses, a nonexpressive use
poses no risk of expressive substitution, thus no cognizable market
effect.
Legal Argument #3
 Non-expressive use does not harm copyright
owners and has great social value
“The United States is” versus “The United States
are” 1780 –1900
American Slavery in American,
English, and Irish Literature,
1800-1899.
Matthew Jockers,
Macroanalysis: Digital
Methods for Literary History
(2013)
Proportion of Irish
Literature with a topic of
‘slavery’ spikes ~ 1860-65
13
Importance of the Digital Humanities Brief
 Focused attention on digitization for the sake of
data
 Demonstrated importance
 Disentangled it from other issues
 Not just a Google issue,
 Not just an internet issue,
 Not just a research/scholarship issue
 Powerful examples tied directly to the
understanding of literature
» In case making the Internet work through caching and search
was not enough for you!
Quotes from HathiTrust judgment …
 I cannot imagine a definition of fair use that
would not encompass the transformative uses
made by Defendants' MDP and would require
that I terminate this invaluable contribution to
the progress of science and cultivation of the
arts that at the same time effectuates the ideals
espoused by the ADA.
– “The search capabilities of the HDL have already given rise to new
methods of academic inquiry such as text mining.” (brief cited)
– … metadata and text mining, which "could actually enhance the
market for the underlying work, by causing researchers to revisit
the original work and reexamine it in more detail” (brief quoted)
Impact of the Digital Humanities Amicus Brief
 Three for the price of one
 Authors Guild v. HathiTrust (district court)
 Authors Guild v. Google (district court)
 Authors Guild v. HathiTrust (court of appeals)
 Over 100 signatories!
 Discussed with approval in HathiTrust
 United States is/are example made its way
into the judgment in HathiTrust last year and
oral argument in Google books on this week!
Some Concluding Thoughts
 Specific legal issues vary by jurisdiction
 fair use, fair dealing, legislative reform
 Underlying policy questions are global
 Idea-expression distinction
 The promise of big data and problem of orphan
works
 Challenge for libraries and archives is making
courts/decision makers understand the broader
consequences
Action Items
 Commercial and non-commercial digitizers need to
work together and defend everyone’s right to nonexpressive use
 Digital Humanities, Linguistics, Comp. Sci.,
Libraries
 Search providers, plagiarism and copyright
infringement detection tools, music identification
tools, reverse engineering
 Advantage of flexible limitations and exceptions
 Without reform, other nations cede ground to the
U.S. as the data engine of the world.
Abbreviated Issues Summary
Issue
Status
Case
Notes
Preservation
Still open, but
v. HathiTrust
court unconvinced
Orphan works
display
Still open, not ripe v. HathiTrust
Trove (Australia)
Best practices
Disability access
Digitization ok
v. HathiTrust
On appeal
Data mining
Digitization ok
v. HathiTrust
All but given up in
v. Google
Library copies as
quid pro quo
Still open
v. Google
Easier now
underlying use is
fair use
Making/retaining
excessive copies
Still open
v. Google
Snippet display
Still open
v. Google
Standing,
remedies, class
action …
Mixed
v. HathiTrust
v. Google
Further Reading
Matthew Jockers, Matthew Sag & Jason Schultz, Digital Archives:
Don’t Let Copyright Block Data Mining, 490 NATURE 29-30
(October 4, 2012)
Googleagreed to sharetheadvertising
revenuefrom GoogleBookswith authors
and publishers, and to makeone-off paymentsto copyright ownersamountingto a
minimum of US$125 million.
Thesettlement wasstrongly opposed by
foreign governments, theUSDepartment
of Justice, theUSCopyright Office, authors,
academicsand rival technology companies
for variousreasons. Manyfearedthat it would
create an unfair monopoly, with Google
havingthesoleright to publish millionsof
‘orphan’ works—bookswhosecopyright
ownerscannot easilybelocated. In 2009, the
settlement wasrevisedtotrytoaddressthese
concerns. But thecourt rejected therevised
settlement in 2011, and thelegal controversy
continues.
In September last year, in aseparatecase,
theAuthorsGuild sued several universities
for participating in Google’s book-scanningproject. Aspart of thiscase, known as
AuthorsGuildv. HathiTrust, it isalsopursing
legal action against theHathiTrust Digital
Library, aservicethat enablesalargeconsortiumof universitiesand research librariesto
store, secureand search their digital collections using a shared infrastructure.
Amongtheissuesat theheart of thisdispute is what researchers in the emerging
field of digital humanities will be allowed to
analyse: only public-domain books(mostly
thosepublished before1923 in theUnited
States), or all known literary works. The
answer may define the future of the field.
TOTHEBARRICADES
On 3 August, theAssociation for Computers
and theHumanitiesand agroup of 64 scholars (that includes us), from disciplines
rangingfrom law and computer scienceto
linguistics, history and literature, filed an
amicuscuriaebrief on behalf of thedigital
humanities. We are urging the court in
AuthorsGuild v. Googleto grant a summary
judgment in favour of Google, astep that
will effectively end thelitigation1. Wefiled
a similar brief in the HathiTrust case on
7 July. Thejudgein theHathiTrust caseis
currentlyconsideringour submission, anda
decision isexpected imminently. Thecourt
in AuthorsGuild v. Googlewill consider our
argument as soon as the appeals court deals
with certain procedural issues.
We feel that if the Authors Guild wins the
casesagainst Googleand theHathiTrust, the
rulingcould set adangerousprecedent — that copyright givesauthorsand publishers
theright to control all, even ‘non-expressive’, usesof their worksthat involvecopying. Copyright lawhaslongrecognized the
distinction between protectingan author’s
original expression and thepublic’sright to
accessthefactsand ideascontained within
that expression. Accordingto theUSConstitution, thepurposeof copyright is “To
promotetheProgressof Scienceand useful
Arts”. Preventingauthorsfrom monopolizingfactsand ideasallowsothersto explore
their own creativity and ‘stand on theshoulders of giants’.
Webelievethat copyright law isnot (and
should not be) an obstacleto statistical and
computational analysisof themillions of
books owned by university libraries. We
arenot talkingabout republishingthem or
even quotingfromthem. Wesimply want to
extract information fromand about themto
sift out trends and patterns.
Asan example, clusteringmorethan 3,000
nineteenth-century novelsaccordingtohow
much they sharecertain stylistic properties
(specificwordsandpunctuation marks) and
thematicfeatures(suchasgroupsof commonly
KNOWINGYOURSUBJECT
In a network of m or e than 3,000 nineteenth-century novels, arr anged accor ding to how m uch they shar e
certain stylistic and them atic pr operties, books author ed by m en (blue) tend to cluster separ ately from
those author ed by wom en (white). George Eliot’s works (yellow) ar e an exception.
co-occurringwords) hasthrown upfindings
that would behard to glean from readinga
handful of booksindividually. Oneisthat
booksauthored by men tend tocluster quite
distinctly from booksauthored by women
(see‘Knowingyour subject’). Thisillustrates
thedegreeto which gender determinesthe
choicesmadeby writers, but also flagsup outliers. For instance, within thisclustering, the
worksof GeorgeEliot (real nameMaryAnne
Evans) sit firmly among thoseof malewriters.
In other words, such ‘macroanalytic’ methodology givesresearchersaway to seeindividual
authorsandpublicationswithinthecontext of
amuch larger system.
Authors’ rightsdeserveprotection. And
governmentsand thevariousstakeholders
involved may eventually work out how to
achievethefull potential of digital libraries
in away that isfair to writers, readersand
providers. But digitizing books for ‘nonexpressive’ uses, such asbasicsearchingand
text mining, isaseparateissueand should
not bebarred on thebasisof concernsover
copyright. An independent reviewlast year
of intellectual property and growth commissioned by theBritish government came
to a similar conclusion2. Unauthorized
music-filesharing can infringecopyright
because humans ultimately experience
those files as musical works. Scanning
words from library books to make a search
index, or to compile a list of word frequencies, doesnot interferewith therightsof the
author. Theseusessimply convert massesof
text into metadata.
It istimefor theUScourtsto recognize
explicitly that, in the digital age, copying
booksfor non-expressivepurposesis not
infringement. Courtshavealready applied
thislogic in analogouscases: Google, Microsoft and otherscopy web pagesto feed into
their Internet search engines; theonlineserviceTurnitin copiesexampapersand other
sourcesso that plagiarism can bedetected.
Thesepracticeshavebeen challenged and
found to belegal under copyright law.
It is crucial for futureresearch that the
right precedent be set. We hope that the
judges decide that digitization for text
miningand other formsof computational
analysis is, unequivocally, fair use. ■
Matthew L. Jockersisassistant professor
of English at theUniversity of Nebraska,
Lincoln, USA. Matthew Sag isassociate
professor of law at Loyola University,
Chicago, Illinois, USA. Jason Schultz is
assistant clinical professor of law at the
University of California, Berkeley, USA.
e-mail: mjockers@unl.edu
1. Jockers, M. L., Sag, M. & Schultz, J. preprint
at Social Science Research Network (2012);
available at http:/ / ssrn.com/ abstr act=2102542.
2. Hargreaves, I. Digital Opportunity: A Review of
Intellectual Property and Growth (Intellectual
Property Office, 2011).
3 0 | N A T U R E | V O L 4 9 0 | 4 O C T O B E R 2 0 12
© 2012 Macmillan Publishers Limited. All rights reserved
SOURCE: M. L. JOCKERS
COMMENT
Further reading
 Matthew Sag, Orphan Works as Grist for the
Data Mill, 27 BERKELEY TECHNOLOGY LAW JOURNAL
1503 – 1550 (2012)
 Matthew Sag, Copyright and Copy-Reliant
Technology, 103 NORTHWESTERN UNIVERSITY LAW
REVIEW 1607–1682 (2009)