Matthew Sag Copyright and Mass-Digitization: Professor of Law Loyola University of Chicago The strategic importance of data-mining matthewsag@gmail.com www.matthewsag.com Presentation Details Abbreviated Time Line 2004 Google library project begins 2005 Class action suit filed by Authors Guild (among others) 2008 & 2009 Settlement proposed, objections follow, settlement revised 2011 (March) Settlement rejected (September) 2011 Authors Guild v. HathiTrust filed 2012 (August) oral argument in Authors Guild v. HathiTrust (October) Judge Baer ruled against the plaintiffs in Authors Guild v. HathiTrust. Library digitization (ADA + Data) are fair use. 2013 (July) Second Cir. tells Judge Chin, no class certification without addressing the fair use issue (September) oral argument on fair use in Authors Guild v. Google The strategic importance of text-mining Different kinds of digitization program raise different legal issues and bring in different stakeholders. The Many Faces of Library/Archive Digitization Preservation Data production and analysis* Searching books, testing search algorithms, computational linguistics, automated translation, natural language processing, macro-analysis of text A platform for display and distribution of individual works Disabled access* Scholarly access General access 4 Strategic Considerations Library digitization for data production and analysis Significant academic and commercial constituency (not just Google!) Strong normative appeal Obvious orphan works problem Justifies digitizing entire collections Even if some other uses are ‘too much’, no all-copyright owner class action possible The Legal Argument #1 Metadata – facts about the work – does not infringe the rights of the copyright owner. – This is not usually contested, but it’s important to make sure everyone understands the reasons why metadata can’t infringe. Those reasons are … Idea-expression distinction Merger doctrine Metadata is not substantially similarity to underlying text Facts about the work don’t originate with the author whale(s) Ahab old man boat(s) ship sea down such time hand(s) long head stubb men Queequeg Captain never good go might Sperm Starbuck deck water day far eyes cried white world moby crew life air Sir night feet Whale v. Dinosaur 1200 1000 800 600 400 200 0 Whale v. Dinosaur Legal Argument #2 A copying process that only produces metadata does not infringe. Intermediate non-expressive use is either (a) not copying in the relevant sense or (b) fair use The distinction between expressive and nonexpressive parts of works is well recognized (no copyright in a phone book, etc). The same distinction should be made in relation to potential acts of infringement. Intermediate non-expressive uses don’t communicate the author’s original expression to the public. No expressive substitution, no infringement Application to Fair Use Sect. 107 Factors (1) purpose and character: Like transformative uses, a nonexpressive use poses no risk of expressive substitution (2) nature of the work … “not much use” (3) Amount and Substantiality: Like transformative uses, because there is no expressive substitution in a nonexpressive use, the amount of copying is qualitatively insignificant. (4) Market effect: Like transformative uses, a nonexpressive use poses no risk of expressive substitution, thus no cognizable market effect. Legal Argument #3 Non-expressive use does not harm copyright owners and has great social value “The United States is” versus “The United States are” 1780 –1900 American Slavery in American, English, and Irish Literature, 1800-1899. Matthew Jockers, Macroanalysis: Digital Methods for Literary History (2013) Proportion of Irish Literature with a topic of ‘slavery’ spikes ~ 1860-65 13 Importance of the Digital Humanities Brief Focused attention on digitization for the sake of data Demonstrated importance Disentangled it from other issues Not just a Google issue, Not just an internet issue, Not just a research/scholarship issue Powerful examples tied directly to the understanding of literature » In case making the Internet work through caching and search was not enough for you! Quotes from HathiTrust judgment … I cannot imagine a definition of fair use that would not encompass the transformative uses made by Defendants' MDP and would require that I terminate this invaluable contribution to the progress of science and cultivation of the arts that at the same time effectuates the ideals espoused by the ADA. – “The search capabilities of the HDL have already given rise to new methods of academic inquiry such as text mining.” (brief cited) – … metadata and text mining, which "could actually enhance the market for the underlying work, by causing researchers to revisit the original work and reexamine it in more detail” (brief quoted) Impact of the Digital Humanities Amicus Brief Three for the price of one Authors Guild v. HathiTrust (district court) Authors Guild v. Google (district court) Authors Guild v. HathiTrust (court of appeals) Over 100 signatories! Discussed with approval in HathiTrust United States is/are example made its way into the judgment in HathiTrust last year and oral argument in Google books on this week! Some Concluding Thoughts Specific legal issues vary by jurisdiction fair use, fair dealing, legislative reform Underlying policy questions are global Idea-expression distinction The promise of big data and problem of orphan works Challenge for libraries and archives is making courts/decision makers understand the broader consequences Action Items Commercial and non-commercial digitizers need to work together and defend everyone’s right to nonexpressive use Digital Humanities, Linguistics, Comp. Sci., Libraries Search providers, plagiarism and copyright infringement detection tools, music identification tools, reverse engineering Advantage of flexible limitations and exceptions Without reform, other nations cede ground to the U.S. as the data engine of the world. Abbreviated Issues Summary Issue Status Case Notes Preservation Still open, but v. HathiTrust court unconvinced Orphan works display Still open, not ripe v. HathiTrust Trove (Australia) Best practices Disability access Digitization ok v. HathiTrust On appeal Data mining Digitization ok v. HathiTrust All but given up in v. Google Library copies as quid pro quo Still open v. Google Easier now underlying use is fair use Making/retaining excessive copies Still open v. Google Snippet display Still open v. Google Standing, remedies, class action … Mixed v. HathiTrust v. Google Further Reading Matthew Jockers, Matthew Sag & Jason Schultz, Digital Archives: Don’t Let Copyright Block Data Mining, 490 NATURE 29-30 (October 4, 2012) Googleagreed to sharetheadvertising revenuefrom GoogleBookswith authors and publishers, and to makeone-off paymentsto copyright ownersamountingto a minimum of US$125 million. Thesettlement wasstrongly opposed by foreign governments, theUSDepartment of Justice, theUSCopyright Office, authors, academicsand rival technology companies for variousreasons. Manyfearedthat it would create an unfair monopoly, with Google havingthesoleright to publish millionsof ‘orphan’ works—bookswhosecopyright ownerscannot easilybelocated. In 2009, the settlement wasrevisedtotrytoaddressthese concerns. But thecourt rejected therevised settlement in 2011, and thelegal controversy continues. In September last year, in aseparatecase, theAuthorsGuild sued several universities for participating in Google’s book-scanningproject. Aspart of thiscase, known as AuthorsGuildv. HathiTrust, it isalsopursing legal action against theHathiTrust Digital Library, aservicethat enablesalargeconsortiumof universitiesand research librariesto store, secureand search their digital collections using a shared infrastructure. Amongtheissuesat theheart of thisdispute is what researchers in the emerging field of digital humanities will be allowed to analyse: only public-domain books(mostly thosepublished before1923 in theUnited States), or all known literary works. The answer may define the future of the field. TOTHEBARRICADES On 3 August, theAssociation for Computers and theHumanitiesand agroup of 64 scholars (that includes us), from disciplines rangingfrom law and computer scienceto linguistics, history and literature, filed an amicuscuriaebrief on behalf of thedigital humanities. We are urging the court in AuthorsGuild v. Googleto grant a summary judgment in favour of Google, astep that will effectively end thelitigation1. Wefiled a similar brief in the HathiTrust case on 7 July. Thejudgein theHathiTrust caseis currentlyconsideringour submission, anda decision isexpected imminently. Thecourt in AuthorsGuild v. Googlewill consider our argument as soon as the appeals court deals with certain procedural issues. We feel that if the Authors Guild wins the casesagainst Googleand theHathiTrust, the rulingcould set adangerousprecedent — that copyright givesauthorsand publishers theright to control all, even ‘non-expressive’, usesof their worksthat involvecopying. Copyright lawhaslongrecognized the distinction between protectingan author’s original expression and thepublic’sright to accessthefactsand ideascontained within that expression. Accordingto theUSConstitution, thepurposeof copyright is “To promotetheProgressof Scienceand useful Arts”. Preventingauthorsfrom monopolizingfactsand ideasallowsothersto explore their own creativity and ‘stand on theshoulders of giants’. Webelievethat copyright law isnot (and should not be) an obstacleto statistical and computational analysisof themillions of books owned by university libraries. We arenot talkingabout republishingthem or even quotingfromthem. Wesimply want to extract information fromand about themto sift out trends and patterns. Asan example, clusteringmorethan 3,000 nineteenth-century novelsaccordingtohow much they sharecertain stylistic properties (specificwordsandpunctuation marks) and thematicfeatures(suchasgroupsof commonly KNOWINGYOURSUBJECT In a network of m or e than 3,000 nineteenth-century novels, arr anged accor ding to how m uch they shar e certain stylistic and them atic pr operties, books author ed by m en (blue) tend to cluster separ ately from those author ed by wom en (white). George Eliot’s works (yellow) ar e an exception. co-occurringwords) hasthrown upfindings that would behard to glean from readinga handful of booksindividually. Oneisthat booksauthored by men tend tocluster quite distinctly from booksauthored by women (see‘Knowingyour subject’). Thisillustrates thedegreeto which gender determinesthe choicesmadeby writers, but also flagsup outliers. For instance, within thisclustering, the worksof GeorgeEliot (real nameMaryAnne Evans) sit firmly among thoseof malewriters. In other words, such ‘macroanalytic’ methodology givesresearchersaway to seeindividual authorsandpublicationswithinthecontext of amuch larger system. Authors’ rightsdeserveprotection. And governmentsand thevariousstakeholders involved may eventually work out how to achievethefull potential of digital libraries in away that isfair to writers, readersand providers. But digitizing books for ‘nonexpressive’ uses, such asbasicsearchingand text mining, isaseparateissueand should not bebarred on thebasisof concernsover copyright. An independent reviewlast year of intellectual property and growth commissioned by theBritish government came to a similar conclusion2. Unauthorized music-filesharing can infringecopyright because humans ultimately experience those files as musical works. Scanning words from library books to make a search index, or to compile a list of word frequencies, doesnot interferewith therightsof the author. Theseusessimply convert massesof text into metadata. It istimefor theUScourtsto recognize explicitly that, in the digital age, copying booksfor non-expressivepurposesis not infringement. Courtshavealready applied thislogic in analogouscases: Google, Microsoft and otherscopy web pagesto feed into their Internet search engines; theonlineserviceTurnitin copiesexampapersand other sourcesso that plagiarism can bedetected. Thesepracticeshavebeen challenged and found to belegal under copyright law. It is crucial for futureresearch that the right precedent be set. We hope that the judges decide that digitization for text miningand other formsof computational analysis is, unequivocally, fair use. ■ Matthew L. Jockersisassistant professor of English at theUniversity of Nebraska, Lincoln, USA. Matthew Sag isassociate professor of law at Loyola University, Chicago, Illinois, USA. Jason Schultz is assistant clinical professor of law at the University of California, Berkeley, USA. e-mail: mjockers@unl.edu 1. Jockers, M. L., Sag, M. & Schultz, J. preprint at Social Science Research Network (2012); available at http:/ / ssrn.com/ abstr act=2102542. 2. Hargreaves, I. Digital Opportunity: A Review of Intellectual Property and Growth (Intellectual Property Office, 2011). 3 0 | N A T U R E | V O L 4 9 0 | 4 O C T O B E R 2 0 12 © 2012 Macmillan Publishers Limited. All rights reserved SOURCE: M. L. JOCKERS COMMENT Further reading Matthew Sag, Orphan Works as Grist for the Data Mill, 27 BERKELEY TECHNOLOGY LAW JOURNAL 1503 – 1550 (2012) Matthew Sag, Copyright and Copy-Reliant Technology, 103 NORTHWESTERN UNIVERSITY LAW REVIEW 1607–1682 (2009)