Creative Commons for Corpus Construction - STP

advertisement
Creative Commons
for
Corpus Construction
Björn Lindström
bkhl@stp.ling.uu.se
Uppsala universitet
Department of linguistics and philology
[Version 1]
March 29, 2005
Abstract
An obstacle to corpus linguistics is the fact that all or most large corpora
are distributed under restrictive licenses. This article describes the development of a computer program – ccCorpus. It collects texts, distributed under the non-restrictive Creative Commons license, from the web, processing
them for incorporation into a corpus. The motivations behind the project and
its juridical implications are also discussed.
c 2005 Björn Lindström
Copyright This work is licensed under the Creative Commons Attribution License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/2.0/ or send a letter to Creative Commons, 559 Nathan Abbott
Way, Stanford, California 94305, USA.
1
Introduction
An obstacle to corpus linguistics is the fact that all or most large corpora are
distributed under restrictive licenses. The background section (section 2) describes this issue, as well as the non-restrictive Creative Commons licenses.
In section 3 it continues to discuss the possibility to include works under
these licenses in a linguistic corpus.
In section 4, a program – ccCorpus – for collecting such texts, and processing them for inclusion into a corpus, is described.
As a conclusion (section 5) the article discusses the possibilites of using an improved version of this program as a tool for constructing usable
linguistic corpora.
2
2.1
Background
Motivations
Large linguistic corpora are in general distributed under restrictive licenses.
They may or may not cost money, but they typically restrict the licensee in
several ways. I will exemplify some common restrictions, but not analyse the
license of any particular corpora.
The licenses themselves create a lot of administrative work. In some
cases, it’s enough for a department to sign the license once, but in some
cases, each and every student has to be registered.
Also, the license often restricts the possibilities of redistributing derivatives of the original corpus. This limits the possibilities of making versions
of a well-known corpus available, enriched with syntactic or semantic information, stratified differently, or otherwise improved.
It may be impossible to use the corpus for commercial purposes. This
will make it harder to utilise in language technological applications meant
for use outside of the academic world.
This is not all because of the greed of the corpus constructors, of course.
Common text sources for corpora are for instance books and newspaper
archives. It would be economically, if not practically, impossible to obtain
permissions from all the authors represented in a typical large corpus to distribute it under a non-restrictive license
A common way to get around this is to use public records as text sources.
However, the texts that may be collected that way will not be representative
of a language. This article will explore the possibilities of another approach
to collecting a corpus that may be redistributed with no or few restrictions on
the licensee.
2.2
Creative Commons
Creative Commons (Creative Commons n.d.) co-founder Lawrence Lessig
has written pessimistically on how copyright, patent and trademark laws
increasingly restrict the freedom of scholars as well as artists and small
businesses (Lessig 2004), threatening both independent research and noncommercial culture. However, like a juridical Ballard or Baudrillard, he recog-
1
nises that the technology – and the laws – that have had part in creating this
situation, also offers possible solutions.
Creative Commons was founded in 2001 with a goal of promoting sharing of creative works under flexible and non-restrictive licenses. In this they
were in part inspired by the Free Software Foundation (Free Software Foundation
n.d.), which issued a copyright license for the promotion of Free Software
back in 1984.
Creative Commons have also issued a line of copyright licenses, adapted
for various kinds of content, and with several options in regard to the rights
and obligations of the licensee.
In addition to their licenses, Creative Commons have also developed a
standard for encoding information on the licensing of a particular work in
the Resource Description Framework (RDF), so that it can be used by for
instance search and inference engines. It also makes it possible to collect a
large corpus, that may be redistributed under a non-restrictive license.
3
3.1
Licensing Issues
Selecting texts based on license
The previously mentioned RDF metadata can be used to decide if a particular
text comes with a license that makes it possible to include in the planned
corpus.
First, all Creative Commons licenses require that any distributed copies
of the work includes an intact original copyright notice. To this end, I’ll later
in this article propose some extensions to the TIGER-XML treebank format
(König, Lezius and Voormann 2003) to make that possible.
All Creative Commons licenses also allow redistribution of the licensed
work, which is obviously required to be able to distribute the resulting corpus.
There’s a couple of things that may vary between licenses, and requires
some consideration.
Is redistribution derivative works allowed? While all Creative Commons
licenses allow unlimited redistribution, not all allow that derivative works are
distributed. However, they still allow that the work is converted into another
format. It might be that this makes it possible to include even those in a
corpus. On the other hand, the act of incorporating the work in a corpus
itself could possibly be construed as creating a derivative work, especially
if the text is also enriched with linguistic information. So, to err on the side
of safety, the ccCorpus program is only collecting texts under a license that
allows for derivative works to be created.
Is derivative works required to be distributed under the same license?
Some Creative Commons licenses require that all derivative works are distributed under the same license as the original work. This requirements might
create issues for a corpus under some circumstances, which will be discussed
in 3.2.
Is commercial use of the work allowed? Texts for which commercial use
is not allowed may or may not have to be excluded, depending on the purposes of the corpus. If it’s only intended for academic uses, including texts
2
with this restriction is fine, as long as the license for the corpus as a whole
also has it, so that is what the ccCorpus program currently does.
3.2
Licensing a Creative Commons based corpus
The ccCorpus intermediary format allows that a separate license is given for
each included work. If this is acceptable in a particular case, one can include
works under the previously mentioned clause that only allows redistribution
of the work under the same license.
In most cases, however, the users of the corpus will probably prefer it if
they can receive the corpus as a whole under a single license agreement, and
works with this license clause must be excluded.
Since all Creative Commons licenses require that the original copyright
notice is preserved in all redistributed copies, it is still advisable to keep
information about the creator and source of the each work in the finished
corpus.
4
The Automatic Corpus Collector
This program, called ccCorpus, is a set of Python modules which collects a
number of web pages, parses the RDF license metadata, and if the license
agreement is found acceptable, produces an XML structure that may be further processed to be included in a corpus.
4.1
Output format
The current output of the ccCorpus program is considered an intermediary
format, meant to be handed off to a tokeniser, so that an annotated corpus
can be created.
The ccCorpus intermediary format is based on the TIGER-XML treebank
format (König et al. 2003), with some additions that, with one exception, are
necessary for legal reasons.
4.1.1
Paragraphs
I will first mention the exception – the <p> element type. TIGER-XML does
not allow elements to have text attributes. Instead, the text of each token is in
the word attribute of a <t> element type.
So, in order to be able to output an intermediary format, the <p> element
type has been introduced, to enclose paragraphs from the source text. This is
so that information about paragraph segmentation in the source can be given
as hints to a tokeniser, segmenter, parser or tagger.
In addition, the <p> element type can have the attribute type, which,
if present, may inform programs later in the processing chain on what kind
of text can be expected in the paragraph. Currently, the ccCorpus program
sets this attribute to ‘p’ for paragraphs, and to ‘h’ for headers, list entries and
table cells, on the assumption that the former are more likely to include full
sentences, while the latter are more likely to consist of single noun phrases.
3
As an example, here’s an example of a block of text that will be fairly
common in the ccCorpus output:
<p type="h">This work is licensed under a
Creative Commons License.</p>
It also demonstrates that the previously mentioned assumptions on the
contents of different element types in the source text may not be completely
sound.
4.1.2
Documents
Since most of the Creative Commons licenses require the credits for a work
to be preserved when redistributing it, and some also require that it is redistributed under the same license, it becomes necessary to introduce a way to
separate source texts, and to describe copyright information for each text.
To this end, I’ve added the element type <document>, which, like the
main corpus, is divided into <head> and <body>. The body can consist of
<p> or, after further processing, <s> elements. The <head> element will,
like the head of the main corpus contain a <meta> element, which in turn
will contain elements describing the properties of the text. Here’s an example
of how this could look:
<document>
<head>
<meta>
<creator>Laura Palmer</creator>
<date>February 23 1989</date>
<description>Laura Palmer’s Diary</description>
<keywords>sex, drugs, rock’n’roll</keywords>
<license>http://creativecommons.org/licenses/by/2.0
</license>
<rights>The estate of Laura Palmer</rights>
<source>http:/joe.the-randoms.com/recipes/</source>
<title>Antipastos and Gazpachos</title>
</meta>
</head>
<body>
<p type="h">Thursday</p>
<p type="p">Asparagus for dinner again. I hate
asparagus. Does this mean I’ll never grow up?</p>
</body>
</document>
The elements within the meta element get their values as follows.
<creator> and <title> are based on the same properties of the RDF
license metadata or, as a fall-back, the author metadata given in the header
of the web page.
<rights> is similar to <creator> but holds the name of a person or
entity holding rights to the work, other than the creator.
4
<source> is also based on the same property in the metadata, but will
be set to the URL the page was retrieved from, if that property is not given.
If a date is given in the RDF metadata, <date> will contain that value,
otherwise the program will use the current time.
<description> and <keywords> will be collected from the corresponding headers of the web page.
<license> is set to a URL to the available license that is found most
applicable.
Some of these elements may occur several times within a <meta> block,
if for instance there are more creators than one given in the metadata.
4.2
Results
As the ccCorpus program in itself does not by far produce a finished linguistic
corpus, it’s hard to gauge the quality of the results.
Quantity is no problem. A typical run of the program, sampling 1 000
pages in Creative Commons’ web spider index, will result in a collection of
about 200 texts with licenses that allows redistribution of derivative works.
These texts in turn have on average 200 words – counted naı̈vely and estimated low, since a lot of tokens without linguistic content are probably
counted.
The index has about 3 000 000 pages which means, the estimates now
being very rough, that about 120 000 000 words could possibly be collected.
It is possible that extending the program so that it spiders the web looking
for more content itself, would raise that number.
It should be mentioned that sampling even a small number of pages with
the program is rather slow. A 1 000 pages was downloaded, checked, and
converted in about 40 minutes on a computer with fairly large network bandwidth. A simple benchmark revealed that the program spends most of this
time waiting for pages to be downloaded.
5
Conclusions
Even though the current output from the ccCorpus program is far from a
usable linguistic corpus, the results are encouraging. It has turned out that
the amount of text on the web licensed under Creative Commons licenses is
more than enough to build a large corpus. Also, a large enough part of these
works has the metadata required to collect them automatically.
5.1
The future of the ccCorpus program
Since the major bottleneck in the program is the downloading of pages, that
part of it should utilise threads to download several pages at a time. Ideally,
the converter should never have to wait for more pages to be downloaded.
Other than this, the problem of extracting text and licensing information
from web pages have largely been solved.
The next step is to find or develop a system for tokenising the collected
text, and segment it into sentences. If an extant tokeniser is found to work
5
satisfactorily, it might be possible to have it take advantage of the type
attribute of <p> elements as currently output by the ccCorpus program.
If one is developed specially for this project it could possibly be made
more accurate by working with element trees representing the source web
pages directly, rather than the extracted text.
5.2
More licensing issues
This project so far has been focused on the technical issues of collecting
pages, checking metadata, and extracting text. In consequence, this article
makes assumptions about the legal grounds it is covering, that should be
checked by someone with experience in copyright law.
For instance, since it can not be guaranteed that the text collecting program will not destroy any copyright information in the text, it needs to be
decided if the preservation of licensing data in the metadata of the corpus is
enough to satisfy the clause of the Creative Commons licenses that requires
that the license notice is kept intact.
Also, it will have to be decided if works with incompatible licenses can
be incorporated in the same corpus, as long as the license information for
each original work is preserved.
6
Acknowledgements
Thanks to Beáta Megyesi for inspiration, encouragement and guidance, and
to Filip Salomonsson for help and advice on the Python programming language.
Fredrik Lundh inadvertently helped by writing the ElementTree XML
library for Python (Lundh 2005).
Mike Linksvayer of Creative Commons simplified my work by giving me
access to their web spider index.
Daniel Luna gave valuable feedback on the article.
Henrik Nyh found a bug.
6
References
Creative Commons (n.d.).
http://creativecommons.org/.
Free Software Foundation (n.d.).
http://www.fsf.org/.
König, E., Lezius, W. and Voormann, H. (2003). TIGERSearch 2.1 User’s
Manual.
Lessig, L. (2004). Free Culture, The Penguin Press.
Lundh, F. (2005). ElementTree,
http://effbot.org/zone/element-index.htm.
7
Download