Creative Commons for Corpus Construction Björn Lindström bkhl@stp.ling.uu.se Uppsala universitet Department of linguistics and philology [Version 1] March 29, 2005 Abstract An obstacle to corpus linguistics is the fact that all or most large corpora are distributed under restrictive licenses. This article describes the development of a computer program – ccCorpus. It collects texts, distributed under the non-restrictive Creative Commons license, from the web, processing them for incorporation into a corpus. The motivations behind the project and its juridical implications are also discussed. c 2005 Björn Lindström Copyright This work is licensed under the Creative Commons Attribution License. To view a copy of this license, visit http://creativecommons.org/licenses/by/2.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. 1 Introduction An obstacle to corpus linguistics is the fact that all or most large corpora are distributed under restrictive licenses. The background section (section 2) describes this issue, as well as the non-restrictive Creative Commons licenses. In section 3 it continues to discuss the possibility to include works under these licenses in a linguistic corpus. In section 4, a program – ccCorpus – for collecting such texts, and processing them for inclusion into a corpus, is described. As a conclusion (section 5) the article discusses the possibilites of using an improved version of this program as a tool for constructing usable linguistic corpora. 2 2.1 Background Motivations Large linguistic corpora are in general distributed under restrictive licenses. They may or may not cost money, but they typically restrict the licensee in several ways. I will exemplify some common restrictions, but not analyse the license of any particular corpora. The licenses themselves create a lot of administrative work. In some cases, it’s enough for a department to sign the license once, but in some cases, each and every student has to be registered. Also, the license often restricts the possibilities of redistributing derivatives of the original corpus. This limits the possibilities of making versions of a well-known corpus available, enriched with syntactic or semantic information, stratified differently, or otherwise improved. It may be impossible to use the corpus for commercial purposes. This will make it harder to utilise in language technological applications meant for use outside of the academic world. This is not all because of the greed of the corpus constructors, of course. Common text sources for corpora are for instance books and newspaper archives. It would be economically, if not practically, impossible to obtain permissions from all the authors represented in a typical large corpus to distribute it under a non-restrictive license A common way to get around this is to use public records as text sources. However, the texts that may be collected that way will not be representative of a language. This article will explore the possibilities of another approach to collecting a corpus that may be redistributed with no or few restrictions on the licensee. 2.2 Creative Commons Creative Commons (Creative Commons n.d.) co-founder Lawrence Lessig has written pessimistically on how copyright, patent and trademark laws increasingly restrict the freedom of scholars as well as artists and small businesses (Lessig 2004), threatening both independent research and noncommercial culture. However, like a juridical Ballard or Baudrillard, he recog- 1 nises that the technology – and the laws – that have had part in creating this situation, also offers possible solutions. Creative Commons was founded in 2001 with a goal of promoting sharing of creative works under flexible and non-restrictive licenses. In this they were in part inspired by the Free Software Foundation (Free Software Foundation n.d.), which issued a copyright license for the promotion of Free Software back in 1984. Creative Commons have also issued a line of copyright licenses, adapted for various kinds of content, and with several options in regard to the rights and obligations of the licensee. In addition to their licenses, Creative Commons have also developed a standard for encoding information on the licensing of a particular work in the Resource Description Framework (RDF), so that it can be used by for instance search and inference engines. It also makes it possible to collect a large corpus, that may be redistributed under a non-restrictive license. 3 3.1 Licensing Issues Selecting texts based on license The previously mentioned RDF metadata can be used to decide if a particular text comes with a license that makes it possible to include in the planned corpus. First, all Creative Commons licenses require that any distributed copies of the work includes an intact original copyright notice. To this end, I’ll later in this article propose some extensions to the TIGER-XML treebank format (König, Lezius and Voormann 2003) to make that possible. All Creative Commons licenses also allow redistribution of the licensed work, which is obviously required to be able to distribute the resulting corpus. There’s a couple of things that may vary between licenses, and requires some consideration. Is redistribution derivative works allowed? While all Creative Commons licenses allow unlimited redistribution, not all allow that derivative works are distributed. However, they still allow that the work is converted into another format. It might be that this makes it possible to include even those in a corpus. On the other hand, the act of incorporating the work in a corpus itself could possibly be construed as creating a derivative work, especially if the text is also enriched with linguistic information. So, to err on the side of safety, the ccCorpus program is only collecting texts under a license that allows for derivative works to be created. Is derivative works required to be distributed under the same license? Some Creative Commons licenses require that all derivative works are distributed under the same license as the original work. This requirements might create issues for a corpus under some circumstances, which will be discussed in 3.2. Is commercial use of the work allowed? Texts for which commercial use is not allowed may or may not have to be excluded, depending on the purposes of the corpus. If it’s only intended for academic uses, including texts 2 with this restriction is fine, as long as the license for the corpus as a whole also has it, so that is what the ccCorpus program currently does. 3.2 Licensing a Creative Commons based corpus The ccCorpus intermediary format allows that a separate license is given for each included work. If this is acceptable in a particular case, one can include works under the previously mentioned clause that only allows redistribution of the work under the same license. In most cases, however, the users of the corpus will probably prefer it if they can receive the corpus as a whole under a single license agreement, and works with this license clause must be excluded. Since all Creative Commons licenses require that the original copyright notice is preserved in all redistributed copies, it is still advisable to keep information about the creator and source of the each work in the finished corpus. 4 The Automatic Corpus Collector This program, called ccCorpus, is a set of Python modules which collects a number of web pages, parses the RDF license metadata, and if the license agreement is found acceptable, produces an XML structure that may be further processed to be included in a corpus. 4.1 Output format The current output of the ccCorpus program is considered an intermediary format, meant to be handed off to a tokeniser, so that an annotated corpus can be created. The ccCorpus intermediary format is based on the TIGER-XML treebank format (König et al. 2003), with some additions that, with one exception, are necessary for legal reasons. 4.1.1 Paragraphs I will first mention the exception – the <p> element type. TIGER-XML does not allow elements to have text attributes. Instead, the text of each token is in the word attribute of a <t> element type. So, in order to be able to output an intermediary format, the <p> element type has been introduced, to enclose paragraphs from the source text. This is so that information about paragraph segmentation in the source can be given as hints to a tokeniser, segmenter, parser or tagger. In addition, the <p> element type can have the attribute type, which, if present, may inform programs later in the processing chain on what kind of text can be expected in the paragraph. Currently, the ccCorpus program sets this attribute to ‘p’ for paragraphs, and to ‘h’ for headers, list entries and table cells, on the assumption that the former are more likely to include full sentences, while the latter are more likely to consist of single noun phrases. 3 As an example, here’s an example of a block of text that will be fairly common in the ccCorpus output: <p type="h">This work is licensed under a Creative Commons License.</p> It also demonstrates that the previously mentioned assumptions on the contents of different element types in the source text may not be completely sound. 4.1.2 Documents Since most of the Creative Commons licenses require the credits for a work to be preserved when redistributing it, and some also require that it is redistributed under the same license, it becomes necessary to introduce a way to separate source texts, and to describe copyright information for each text. To this end, I’ve added the element type <document>, which, like the main corpus, is divided into <head> and <body>. The body can consist of <p> or, after further processing, <s> elements. The <head> element will, like the head of the main corpus contain a <meta> element, which in turn will contain elements describing the properties of the text. Here’s an example of how this could look: <document> <head> <meta> <creator>Laura Palmer</creator> <date>February 23 1989</date> <description>Laura Palmer’s Diary</description> <keywords>sex, drugs, rock’n’roll</keywords> <license>http://creativecommons.org/licenses/by/2.0 </license> <rights>The estate of Laura Palmer</rights> <source>http:/joe.the-randoms.com/recipes/</source> <title>Antipastos and Gazpachos</title> </meta> </head> <body> <p type="h">Thursday</p> <p type="p">Asparagus for dinner again. I hate asparagus. Does this mean I’ll never grow up?</p> </body> </document> The elements within the meta element get their values as follows. <creator> and <title> are based on the same properties of the RDF license metadata or, as a fall-back, the author metadata given in the header of the web page. <rights> is similar to <creator> but holds the name of a person or entity holding rights to the work, other than the creator. 4 <source> is also based on the same property in the metadata, but will be set to the URL the page was retrieved from, if that property is not given. If a date is given in the RDF metadata, <date> will contain that value, otherwise the program will use the current time. <description> and <keywords> will be collected from the corresponding headers of the web page. <license> is set to a URL to the available license that is found most applicable. Some of these elements may occur several times within a <meta> block, if for instance there are more creators than one given in the metadata. 4.2 Results As the ccCorpus program in itself does not by far produce a finished linguistic corpus, it’s hard to gauge the quality of the results. Quantity is no problem. A typical run of the program, sampling 1 000 pages in Creative Commons’ web spider index, will result in a collection of about 200 texts with licenses that allows redistribution of derivative works. These texts in turn have on average 200 words – counted naı̈vely and estimated low, since a lot of tokens without linguistic content are probably counted. The index has about 3 000 000 pages which means, the estimates now being very rough, that about 120 000 000 words could possibly be collected. It is possible that extending the program so that it spiders the web looking for more content itself, would raise that number. It should be mentioned that sampling even a small number of pages with the program is rather slow. A 1 000 pages was downloaded, checked, and converted in about 40 minutes on a computer with fairly large network bandwidth. A simple benchmark revealed that the program spends most of this time waiting for pages to be downloaded. 5 Conclusions Even though the current output from the ccCorpus program is far from a usable linguistic corpus, the results are encouraging. It has turned out that the amount of text on the web licensed under Creative Commons licenses is more than enough to build a large corpus. Also, a large enough part of these works has the metadata required to collect them automatically. 5.1 The future of the ccCorpus program Since the major bottleneck in the program is the downloading of pages, that part of it should utilise threads to download several pages at a time. Ideally, the converter should never have to wait for more pages to be downloaded. Other than this, the problem of extracting text and licensing information from web pages have largely been solved. The next step is to find or develop a system for tokenising the collected text, and segment it into sentences. If an extant tokeniser is found to work 5 satisfactorily, it might be possible to have it take advantage of the type attribute of <p> elements as currently output by the ccCorpus program. If one is developed specially for this project it could possibly be made more accurate by working with element trees representing the source web pages directly, rather than the extracted text. 5.2 More licensing issues This project so far has been focused on the technical issues of collecting pages, checking metadata, and extracting text. In consequence, this article makes assumptions about the legal grounds it is covering, that should be checked by someone with experience in copyright law. For instance, since it can not be guaranteed that the text collecting program will not destroy any copyright information in the text, it needs to be decided if the preservation of licensing data in the metadata of the corpus is enough to satisfy the clause of the Creative Commons licenses that requires that the license notice is kept intact. Also, it will have to be decided if works with incompatible licenses can be incorporated in the same corpus, as long as the license information for each original work is preserved. 6 Acknowledgements Thanks to Beáta Megyesi for inspiration, encouragement and guidance, and to Filip Salomonsson for help and advice on the Python programming language. Fredrik Lundh inadvertently helped by writing the ElementTree XML library for Python (Lundh 2005). Mike Linksvayer of Creative Commons simplified my work by giving me access to their web spider index. Daniel Luna gave valuable feedback on the article. Henrik Nyh found a bug. 6 References Creative Commons (n.d.). http://creativecommons.org/. Free Software Foundation (n.d.). http://www.fsf.org/. König, E., Lezius, W. and Voormann, H. (2003). TIGERSearch 2.1 User’s Manual. Lessig, L. (2004). Free Culture, The Penguin Press. Lundh, F. (2005). ElementTree, http://effbot.org/zone/element-index.htm. 7