Data Distribution Report - Go-Geo! The Geo-Data Portal Project

advertisement
Go-Geo!
Title:
Data Distribution Study
Synopsis:
This document reports on an investigation into data
distribution systems and the requirements of data creators
for a system within Go-Geo!
Author:
Julie Missen, UK Data Archive
Date:
05 August July 2004
Version:
1.c.a
Status:
Final
Authorised:
Dr David Medyckyj-Scott, EDINA
Contents
1.0 Background............................................................................................................................. 3
2.0 Introduction ............................................................................................................................. 4
3.0 Conducting the Data Distribution Study ................................................................................. 4
4.0 Requirements Study ............................................................................................................... 5
4.1 Results ....................................................................................................... 6
4.2 Summary .................................................................................................... 8
5.0 Literature Review .................................................................................................................... 8
5.1 Self Archiving .............................................................................................. 8
5.1.1 Copyright/IPR .......................................................................................... 9
5.1.2 Preservation ......................................................................................... 10
5.1.3 Peer review .......................................................................................... 10
5.1.4 Cost .................................................................................................... 11
5.1.5 Software .............................................................................................. 11
5.1.6 Case Studies ........................................................................................ 15
5.2 Peer to Peer Archiving (P2P) ...................................................................... 17
5.2.1 How Does P2P Work?............................................................................. 19
5.2.2 The Centralised Model of P2P File-Sharing.................................................. 19
5.2.3 The Decentralised Model of P2P File-Sharing .............................................. 20
5.2.4 Advantages and Limitations of using the Peer-to-Peer Network ........................ 20
5.2.5 Copyright/IPR ........................................................................................ 21
5.2.6 Preservation ......................................................................................... 21
5.2.7 Cost .................................................................................................... 21
5.2.8 Security ............................................................................................... 22
5.2.9 Peer-to-Peer Systems ............................................................................. 22
5.2.10 Case Studies ....................................................................................... 33
5.3 Distribution from a Traditional (Centralised) Archive ..................................... 38
5.3.1 Copyright/IPR ........................................................................................ 38
5.3.2 Preservation ......................................................................................... 38
5.3.3 Cost .................................................................................................... 38
5.3.4 Case Studies ........................................................................................ 38
5.4 Data Distribution Systems Summary ........................................................... 40
6.0 Conclusions .......................................................................................................................... 42
7.0 Recommendations for Data Distribution for the Go-Geo! Portal .......................................... 44
Bibliography ................................................................................................................................ 46
Glossary...................................................................................................................................... 48
Appendix A – Data Distribution Survey ...................................................................................... 50
2
1.0 Background
The creation and submission of geo-spatial metadata records to the Go-Geo! portal is not the
end of the data collection process. Datasets are at risk of becoming forgotten once a project
ends and staff move on. Unless preserved for further use, data which have been collected at
significant expense, expertise and effort, may later exist in only a small number of reports which
analyse only a fraction of the research potential of the data. Within a very short space of time,
the data files can become lost or obsolete as the technology of the collecting institution
changes. The metadata records in the Go-Geo! portal could rapidly become useless as they will
describe datasets to which users have no access. There is a need to ensure that data are
preserved against technological obsolescence and physical damage and that means are
provided to supply them in an appropriate form to users.
Individuals, research centres and departments are generally not organised in such a way as to
be able to administer distribution of data to people who approach them. While a researcher
might be willing to burn data onto a CD as a one off request, they would be less willing to do
this on a regular basis. One of the hopes of the Go-Geo! project is that researchers and
research centres will provide metadata to Go-Geo! about smaller, more specialist datasets they
have created. One of the benefits to individuals and centres of doing this is that demonstrating
continued usage of data after the original research is completed can influence funders to
provide further research money. However, feedback has indicated that provision of metadata
from these sorts of data providers might be limited because of concerns about how they would
distribute and store their data.
For larger datasets hosted and made available through national data centres and other large
service providers, the technology already exists through which the Go-Geo! portal could
support online data mining, visualisation, exploitation and analysis of geospatial data. However,
issues of licence to use data in this way and funding to establish the technical infrastructure
need to be resolved before further developments can take place. More investigation is required
into the best means by which individuals and smaller organisations such as research
groups/centres can share data with others.
The combination of compulsion and reward proposed to encourage metadata creation could
also be applied to data archiving. The Research Assessment Exercise could reward institutions
for depositing high-quality geo-spatial data sets with a suitable archival body, and funding
councils and research councils could follow the example of the NERC, the ESRC and the
AHRB who make it a condition of funding that all geo-spatial datasets should be archived.
It seems clear that what is required is a cost effective and easy way for those holding geospatial data to share their data with others within UK tertiary education. A data sharing
mechanism is seen as critical to the project. Without it, the amount of geo-spatial data available
for re-use may be limited.
The need for long-term preservation, along with cataloguing the existence of data, was
identified during the phase one feasibility study and a recommendation was made to the JISC
to consider establishing a repository for geospatial datasets which falls outside the collecting
scope of the UK Data Archive and Arts and Humanities Data Service. This may be something
one or more existing data centres could take on or it could become an activity within the
operation of the Go-Geo! portal.
Before this investigation was undertaken, three scenarios of potential data distribution systems
for the Go-Geo! project were identified:


self-archiving service. It was envisaged that one or more self-archiving services could be
established where data producers/holders could publish data for use by others. The service
would need to provide mechanisms for users to submit data, metadata and accompanying
documentation (PDF, word files etc.). Metadata would also be published, possibly using
OAI, and therefore harvested and stored in the Go-Geo! catalogue;
peer to peer (P2P) application. Data holders/custodians would set up a P2P server on
institutional machines and store data in them (probably at an institutional or department
level). Metadata would be published announcing the existence of servers and geospatial
3

data. Metadata could also be published to the Go-Geo! catalogue. Users would use a P2P
client to search for data or the Go-Geo! portal and then, having located a copy of the data,
download it to their machine;
depositing copies of data with archive organisations, such as the UK Data Archive. The
archive would maintain controls over the data on behalf of the owner and ensure the longterm safekeeping of the data. The archive would take over the administrative tasks
associated with external users and their queries. Potential users of the data would typically
find data through an online catalogue provided by the archive. Popular or large datasets
may be available online otherwise on-line ordering systems are provided to order copies of
datasets. If researchers and research centres did deposit their data with an archive, it
would be important that the metadata records displayed by Go-Geo! recorded this fact and
how to contact the archive.
2.0 Introduction
The aim of this study was to investigate a cost-effective way of distributing geo-spatial data held
by individuals, research teams and departments. Three approaches were considered in this
investigation, which were felt to reflect the resources available to the data creator/custodian for
data distribution: P2P, self-archiving and traditional (centralised) archiving.
The first part of this report concentrates on a survey which looked at what data creators and
depositors requirements are for a data distribution mechanism. Key researchers and faculty
within the geographic information community in UK academia were contacted to assess their
requirements and their constraints for data distribution. The survey was also posted on mailing
lists and on both the portal and project web sites.
Technical options for both P2P and self-archiving were investigated through a literature review
and by contacting experts in their respective fields. Existing software solutions were identified
and evaluated, of particular importance was to determine how well existing software solutions
could either meet, or could be modified to meet, the particular requirements for geo-spatial data
distribution. The two approaches were compared against traditional (centralised) data archiving
services.
3.0 Conducting the Data Distribution Study
The data distribution study began later than anticipated as there was an over run from another
work package and there was a further delay due to communication difficulties with City
University. Towards the end of the project this relationship was terminated and their allocated
work undertaken by staff at the UK Data Archive.
At the start of the study, a meeting was held at the UK Data Archive to initiate ideas on the
subject of data distribution and to draft a list of stakeholders. A list of stakeholders was drawn
up by UKDA and EDINA, including key researchers from the geographic information (GI)
community.
A requirements survey was then developed and distributed to stakeholders to assess their
needs for data distribution issues. The survey (see Appendix A) attempted to discover how
organisations and individuals would like to see data distributed in the future.
The survey looked at:
 access conditions;
 technical issues;
 copyright/IPR;
 funding;
 licences.
The questionnaire was posted on the Go-Geo! web site, project web site, distributed at
GISRUK and two workshops undertaken by the Go-Geo! metadata project.
4
An investigation was then undertaken to compare self-archiving and peer-to-peer with
traditional (centralised) archiving. The use of OAI (Open Archives Initiative) was also
considered within this study.
The peer-2-peer study should have been undertaken by City University but as this relationship
was later terminated, their allocated work was undertaken by the UK Data Archive, with some
consultation with an expert of the field. As this change in workload occurred at a very late stage
of the project, it left less time then anticipated to complete the study, therefore a slightly scaled
down version of the literature review was decided upon.
Areas of investigation included:
 IPR/copyright. Copyright is an intellectual property right (IPR), or output of human intellect.
Copyright protects the labour, skill and judgement that someone has expended in the
creation of original work. Usually copyright is retained by the author, which can sometimes
defined as and including the individual, organisation or institution. If a piece of work is
completed as part of employment, the employer will retain copyright in your work. If
commissioned to create a piece of work on behalf of someone else, then the author will
retain copyright in that work;
 preservation. The saving and storing of data (either in digital and/or paper format) for
future use, either as a short-term repository of long term preservation of material in a format
which will is not transferable or will not become obsolete. The cost and effort of longer term
preservation may outweigh the benefits;
 cost. This should be considered in both in terms of finances, expertise and resources;
 software. This should include software for both storage and distribution;
 data format/standards. The data formats supported by software;
 service level definition. The definition of what a service will provide, for example, user
support;
 access control/security. Access and security control to the data, metadata and
repository/storage facility;
 user support/training. Support and training should be considered for data creators,
depositors and users;
 Open Archives Initiative. OAI, based at Cornell University provides the Open Archive
Metadata Harvesting protocol that runs through web servers and clients to connect data
providers to data services. The data provider is somebody who archives information on
their site, while the service provider runs the OAI protocol to access the metadata. The OAI
provide web pages for data providers and service providers to register so that they can
know of each other's existence, and thereby bring about interoperable access. The OAI
protocol was originally designed for e-prints, although OAI acknowledge that it needs to be
extended to cover other forms of digital information. The OAI protocol demands that
archives, at a minimum, use the Dublin Core metadata format, although parallel sets of
metadata in other formats are not prohibited. The main commercial and public domain
alternative to the OAI protocol is the IEEE Z39.50 which is widely used by large archives.
The contrast between the two is that Z39.50 supports greater functionality than OAI and
therefore is more complex to implement in the HTTP server and the client. OAI ensures
access to a freely available repository of research information, but does not address longterm archiving issues.
A report was produced, setting out user requirements, a comparison of technical options and
recommendations including costs. The report was then peer reviewed by subject experts and
by a sample of those individuals who were sought to provide input during the requirement
analysis stage.
4.0 Requirements Study
A requirements study was undertaken to investigate how services and organisations currently
store and distribute data and what they might consider they would use in the future. The
investigation was survey-based and focused on access, licensing, funding and policies.
The requirements survey was carried out electronically due to time constraints for completing
this work package. Ideally, if time had allowed, face-to-face interviews or focus group
5
discussion sessions would have been conducted, as these could have produced a greater
response rate.
The survey was conducted via the web, workshops, mailing lists and by emailing UK GIS
lecturers and other key academics.
The questionnaire was sent to:
 gogeogeoxwalk mailing list (http://www.jiscmail.ac.uk./lists/gogeogeoxwalk.html);
 ESDS site reps and website (http://www.esds.ac.uk);
 IBG Quantitative Methods Research Group
(http://www.ncl.ac.uk/geps/research/geography/sarg/qmrg.htm);
 Geographical Information Science Research Group (GIScRG) (http://www.giscience.info/);
 GIS-UK (http://www.jiscmail.ac.uk/lists/GIS-UK.html).
The questionnaire was posted on both the project web pages and on the Go-Geo! portal. The
first 100 respondents who fully completed the survey received a book token, and were entered
into the evaluation prize draw to win an Amazon voucher.
A copy of the survey questionnaire can be found in Appendix A.
4.1 Results
Fewer responses were gained than was initially hoped (ten responses). This was thought to be
due to the time of year (exam pressure, marking etc) as well as survey fatigue amongst
respondents, particularly as many of the contacts had also been invited to evaluate the GoGeo! portal earlier in the year. Those responses we did receive however were from experts and
were therefore regarded to be of high quality. Percentages have predominantly been used
within the following section as some questions were not answered by respondents, or in some
cases, multiple answers were given, therefore the number of responses was not always ten.
From the questionnaires submitted, it is clear that the majority of respondents (63%) would like
to see a centralised archive and distribution service put into place for the portal. A number of
respondents (27%) would like to see a distributed service based on a number of self-archives
located around the country. None of the respondents wanted to see a peer-to-peer network set
up. The reasons behind these choices (see Table 1) included ease of use, cost, need to ensure
long-term availability (preservation) and the provision of a user support system.
Reasons for Choice of Data Distribution System
ease of use
cost
need to ensure long-term availability (preservation)
copyright issues
data security
user support
depositor support
other (please specify)
Number of
Responses
8
8
8
4
3
5
5
0
Table 1. Factors affecting choice of data distribution system
There was a requirement by 66% of respondents for the allowance of direct contact with
depositors, although respondents would want trivial questions to be dealt with through good
documentation, help facilities and the service provider. Half of respondents felt that access to
the system should be restricted to gatekeepers e.g. data librarians or service providers.
Most of the respondents (85%) would like to see original and new datasets in the archive and
for them to be made available online (66%). All respondents would like datasets stored in
formats such as CSV, shape files and GML and 62% of respondents would like to see the
provision of data with complete supporting material, such as guidance notes and copyright/IPR
statements. The majority of respondents (60%) thought that usage of the data should be
6
tracked. Most of the respondents (80%) also thought that data should persist indefinitely,
according to the perceived utility of the data by others.
The respondents felt strongly that data quality and accuracy of data was very important, as was
the implementation of standards for spatial data, data transfer and data documentation. As far
as data deposition and using data is concerned, the respondents largest concerns were with
data quality and liability. Data quality was considered to be a key issue, as if the data are of
poor quality, then any analysis performed on the data will also be less robust.
In terms of data deposition, the respondents were least concerned about confidentiality, IPR
and the need for specialist support. In the context of using data, they were least concerned
about confidentiality, data security and protecting the integrity of the data creator. These scores
were slightly contradicted by some comments which showed that data creators do have
concerns about IPR as if the rights of the depositor are not protected they are less likely to wish
to deposit their data. In terms of submission of data, the most important factors were
considered to be ease of use and speed of the deposition process.
The table below shows the scores gained for each issue. The results were ranked from 1-7,
with 1 being the most important. Therefore a lower score indicates that an issue is considered
to be important to the respondents then a higher score.
Issue of Concern
IPR
Confidentiality
Liability
Data quality/provenance
Need for specialist support
Data security
Protecting the integrity of the data creator
Depositing
Data
43
53
33
23
47
33
33
Using Data
35
55
29
12
34
45
44
Table 2. The respondents concerns for data deposition and use
Responsibility for any conversion or other work required to meet data transfer and preservation
standards and for the active curation of the dataset, was thought to lie with data depositors,
archivists and service providers. When considering who should be responsible for maintaining
the dataset once a team disbands, there was a mixed response, with the answers including
archivists, service providers, data creators, data librarians and a designated team member.
The majority of the respondents (58%) held their own data, but also jointly held data with
another organisation and held Crown data. Of those that held their own data 81% of
respondents would like to make their data freely accessible to academics. Those who do not
hold their own data expressed a preference (60%) to have copyright managed by a central
specialist organisation.
The majority (82%) of the respondents felt that data creators should not incur royalties, but
should also not be charged by service providers for depositing data. Almost half of the
respondents (44%) felt that users should be charged to cover service costs and 56% of the
respondents felt that the service should be free. Most of the respondents felt that funders
should cover researchers costs of preparing the data for sharing. There was a mixed response
as to who or which organisation should be responsible for promoting, facilitating and funding
data sharing (see Table 3). Although 80% of the respondents felt that funders should cover
researchers costs of preparing the data for sharing.
7
Number of Responses
Organisation
Funders of data creation awards
The creators themselves
The institution within which the creator
works
National repositories (where they exist)
The JISC
Other (please state who)
Funding a
service which
Promoting
Facilitating facilitates data
data sharing data sharing
sharing
7
5
7
4
3
0
3
6
5
0
3
7
6
0
0
7
6
0
Table 3. Who should be responsible for promoting, facilitating and funding data sharing
4.2 Summary
The results from the survey provide guidance to the direction in which the data creators would
like to move in, with regards to a data distribution system. The majority of the respondents
would like to see the provision of:
 a centralised archive;
 online access to data;
 copyright agreements and data integrity provided for;
 a service where user and depositor support is provided;
 preservation and archiving facilities provided by the archive;
 a system which is both easy and fast to use;
 supporting material made available;
 a service which is free to deposit and possibly free to use.
From this list, it can be seen that the most obvious recommendation would be to use or set up a
traditional type of archive. Second to this choice would be the use of a number of distributed
archives. This could include well-established archives such as the UKDA and Archaeology Data
Service (ADS), or could see the setting up of institutional archives.
5.0 Literature Review
A literature review was conducted to investigate three methods of storing and distributing data:
self-archiving, peer-to-peer, and traditional archiving.
The study considered issues such as copyright, cost and software as well as the advantages
and disadvantages of each type of data distribution system. Case studies have also been
included with examples of both software and service usage.
5.1 Self Archiving
Self-archiving allows for the free distribution of data across the web, a medium which provides
wide and rapid dissemination of information. The purpose of self-archiving is to make full text
documents visible, accessible, harvestable, searchable and useable by any potential user with
access to the internet.
To self-archive is to deposit a digital document in a publicly accessible website, for example an
OAI-compliant eprint archive (an eprint archive is a collection of digital documents of peerreviewed research articles, before and after refereeing. Before refereeing and publication, the
draft is called a "preprint." The refereed, published final draft is called a "postprint" or “eprint”).
8
Depositing involves a simple web interface where the depositor copies or pastes in the
metadata (date, author-name, title, journal-name, etc.) and then attaches the full-text
document. Software is also being developed to allow documents to be self-archived in bulk,
rather than just one by one.
Self-archiving systems can be either centralised or distributed. There is little difference to users
between self-archiving documents in one central archive or many distributed archives, as users
need not know where documents are located in order to find, browse and retrieve them and the
full texts are all retrievable. Standards used for metadata may also not be made apparent to
users.
There are two types of self-archives: subject based; and institutional based. Several subjectbased archives are in use at present, which include ArXiv and CogPrints. There are also
institutional based archives in existence, such as ePrints, based at Southampton University.
Distributed, institution-based self-archiving benefits research institutions by:
 maximising the visibility and impact of their own refereed research output;
 maximising researchers access to the full refereed research output of all other institutions;
 reducing likelihood of library's annual serials expenditures budget to 10% (in the form of
fees paid to journal publishers for the quality-control of their own research output instead of
tools for accessing other researchers' output).
An institutional library can help researchers to do self-archiving and can maintain the
institution's own refereed eprint archives as an outgoing collection for external use, in place of
the old incoming collection via journal costs, for internal use. Institutional library consortial
power can also be used to provide leveraged support for journal publishers who commit
themselves to a timetable of downsizing to becoming pure quality-control service providers
(Harnad 2001).
The advantages of subject-based archives are that they are specific to the needs and
requirements of a discipline. Many of the repositories use OAI to facilitate interoperability
between repository servers (Pinfield Sept 2003).
Potential problems of self-archiving include quality control, copyright issues and potential lack
of preservation and/or access control. The principal potential problem with self-archiving is
actually getting the content for the repository, which requires a cultural change through
persuading researchers of the benefits of self-archiving and of data sharing.
To date, self-archiving has been about depositing a digital document, typically a full text
document1, in a publicly accessible web site. Consideration of the use of self-archiving for
depositing copies of datasets seems to have been limited. Further issues, which must be
considered, will arise if using self-archiving for the distribution of datasets. Once such issue is
that whilst electronic copies of printed material will have been peer reviewed, this is not the
case with datasets and therefore there is a lack of quality control. Another area of concern is
the format of datasets, which may be more complex the simple text documents, especially
geospatial datasets which could be in formats such as SHP files and will require more support
for depositors. IPR will also be of greater concern with datasets, particularly geospatial ones, as
they may contain more then one source of material, e.g. OS data, government data and
primary sources.
5.1.1 Copyright/IPR
There are clear guidelines for copyright and IPR in relation to self-archiving and eprints. It
seems that the author retains the copyright for pre-refereeing preprint, (therefore it can be selfarchived without seeking anyone else’s permission), but not for postprints. For the refereed
postprint, the author can try to modify the copyright transfer agreement to allow self-archiving.
1 Frequently these documents are eprints, digital texts of peer-reviewed research articles, before and after
refereeing.
9
In those cases where the publisher does not agree to modify the copyright transfer agreement
so as to allow the self-archiving of the refereed final draft (postprint), a corrigenda file can
instead be self-archived, alongside the already archived preprint, listing the changes that need
to be made to make it into a postprint.
Self-archiving of one's own non-plagiarised texts is in general legal in all cases except:
 where exclusive copyright in a "work for hire" has been assigned by the author to a
publisher (i.e. the author has been paid (or will be paid royalties) in exchange for the text)
the author may not self-archive it. The text is still the author's intellectual property, in the
sense that authorship is retained by the author, and the text may not be plagiarised by
anyone, but the exclusive right to sell or give away copies of it has been transferred to the
publisher;
 where exclusive copyright has been assigned by the author to a journal publisher for a
peer-reviewed draft, refereed and accepted for publication by that journal, then that draft
may not be self-archived by the author (without the publisher's permission).
Questions which should be asked when considering self-archiving as an option include:
 are there any rights in an individual metadata record? If so, who owns them?
 do data providers wish to assert any rights over either individual metadata records, or data
collections? If so, what do they want to protect, and how might this be done?
 do data providers disclose any rights information relating to the documents themselves i.e
the metadata?
 how do service providers ascertain the rights status of the metadata they’re harvesting?
 do service providers enhance harvested metadata records, creating new IPR?
 do service providers want to protect their enhanced records? If so, how?
 how do service providers make use of any rights information relating to the documents
themselves?
Project RoMEO based at Loughborough University is investigating copyright issues. More
detailed information, including survey results, can be found from their web site
(http://www.lboro.ac.uk/departments/ls/disresearch/romeo/.
5.1.2 Preservation
There is a concern that archived eprints may not be accessible online in the future. This worry
is not really about self-archiving, but about the online medium itself. However, this concern may
be unnecessary as it seems that many of the self-archives use OAI and Dublin Core which
ensures they are interoperable. For example, the repository arxiv.org, was set up in 1991 and
all of its contents are still accessible today. In any case, if Harnard’s definition of eprints being
duplicates of conventionally published material is used, then preservation could be see as
needless in the short-term (Pinfield 2003).
5.1.3 Peer review
Refereeing (peer review) is the system of evaluation and feedback by which expert researchers
quality control each others research findings. The work of specialists is submitted to a qualified
adjudicator, an editor, who in turn sends it to experts (referees) to seek their advice about
whether the paper is potentially publishable, and if so, what further work is required to make it
acceptable. The paper is not published until and unless the requisite revision can be and is
done to the satisfaction of the editor and referees (Harnad 2000).
Neither the editor nor the referees is infallible. Editors can err in the choice of specialists; or
editors can misinterpret or misapply referees' advice. The referees themselves can fail to be
sufficiently expert, informed, conscientious or fair (Harnad 2000). Nor are authors always
conscientious in accepting the dictates of peer review. It is known among editors that virtually
every paper is eventually published, somewhere: There is a quality hierarchy among journals,
based on the rigour of their peer review, all the way down to an unrefereed vanity press at the
bottom. Persistent authors can work their way down until their paper finds its own level, not
without considerable wasting of time and resources along the way, including the editorial office
budgets of the journals and the freely given time of the referees, who might find themselves
called upon more than once to review the same paper, sometimes unchanged, for several
different journals (Harnad 2000).
10
The system is not perfect, but no one has demonstrated any viable alternative to having experts
judge the work of their peers, let alone one that is at least as effective in maintaining the quality
of the literature as the present one is (Harnad 2000). Improving peer review first requires
careful testing of alternative systems, and demonstrating empirically that these alternatives are
at least as effective as classical peer review in maintaining the quality of the refereed literature.
The self-archiving initiative is directed at freeing the current peer-reviewed literature, it is not
directed at freeing the literature from peer review (Harnad 2001).
5.1.4 Cost
Referees services are donated free to virtually all scientific journals, but there is a real cost to
implementing the refereeing procedures, which include archiving submitted papers onto a
website; selecting appropriate referees; tracking submissions through rounds of review and
author revision; making editorial judgments, and so on (Harnad 2001).
The minimum cost of implementing refereeing has been estimated as $500 per accepted article
but even that figure almost certainly has inessential costs wrapped into it (for example, the
creation of the publisher's PDF). The true figure for peer-review implementation alone across all
refereed journals probably averages much closer to $200 per article or even lower. Hence,
quality control costs account for only about 10% of the collective tolls actually being paid per
article. The optimal solution would be for free online data for everyone. The 10% or so qualitycontrol cost could be paid in the form of quality-control service costs, per paper published, by
authors' institutions, out of their savings on subscription costs (Harnad 2001).
5.1.5 Software
There are a number of self-archives currently in use, the software for some of these are
described below.
ePrints (http://www.eprints.org)
The freely available ePrints software has been designed so institutions or even individuals can
create their own OAI-compliant ePrint archive (ePrints include both preprints and postprints, as
well as any significant drafts in between, and any post publication updates). Setting up the
archive only requires some space on a web server and is relatively easy to install.
The ePrints and self-archiving initiative does not undertake the filtering function of existing
libraries and archives, nor their indexing or preservation function. The document archiving
facilities of the ePrints software, developed by Southampton University, can now be extended
to provide storage for raw scientific data as well as the capability of interoperable processing.
There is already the potential for widespread adoption of the ePrints software by universities
and research institutions worldwide for research report archiving.
The ePrint software is one implementation of an OAI protocol conforming server on a UNIX
operating system. It draws on a set of freely publicly available tools (MySQL, Apache etc.). All
OAI-compliant ePrint archives share the same metadata, making their contents interoperable
with one another. This means their contents are harvestable by cross-archive search engines
like ARC or cite-base.
The limitation in OAI is in the availability of the metadata in only Dublin Core. The ePrint
software uses a more complex nested metadata structure to represent archived items
internally. This is accessible for search from a web page on the archive machine itself.
However, the interoperable service provided to OAI data services only supports the Dublin Core
subset of this.
The main alterations expected to the ePrint system in the future could include:
 improvements to the documentation which is at present is minimal;
 improvements to the installation procedures which at present requires a skilled UNIX
administrator to install the software. This should be done using one of the commercial
installation packages, such as InstallAnywhere;
 changing the structure of the metadata representation used by mapping the existing
document metadata structure to the hierarchical structure used for data metadata;
11


a new metadata format needs to be defined within the ePrints system to support this
mapping;
further interoperability for information that does not consist only of text objects or
multimedia sound/images, but arbitrary raw data. The existing metadata structure supports
ePrints that consist of a set of documents which in turn consist of sets of e-print files (e.g.
an html document can consist of a set of files).
To implement this new metadata structure it will be necessary to apply:
 new strings bundle files for the new metadata input and browse facilities on the server;
 a new mapping from the data metadata to Dublin Core for OAI access;
 new metadata fields for the representation.
Dspace (http://dspace.org/)
DSpace is an open-source, web-based system produced by Massachusetts Institute of
Technology Libraries. DSpace is a digital asset management software platform that enables
institutions to:
 capture and describe digital works using a submission workflow module;
 distribute an institution's digital works over the web through a search and retrieval system;
 store and preserve digital works over the long term.
DSpace institutional repository software is available for download and runs on a variety of
hardware platforms. It functions as a repository for digital research and educational material
and can be both modified and extended to meet specific needs.
DSpace is Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) compliant and
uses Dublin Core, with optional additional fields, including abstract, keywords, technical
metadata and rights metadata. The metadata is indexed for browsing and can be exported,
along with digital data, using an XML schema. OAI support was implemented using OCLC’s
OAICat open-source software The OAICat Open Source project is a Java Servlet web
application providing an OAI-PMH v2.0 repository framework. This framework can be
customised to work with arbitrary data repositories by implementing some Java interfaces.
DSpace@MIT is registered as a data provider with the Open Archives Initiative. Other
institutions running DSpace may choose to turn on OAI or not, and to register as a data
provider or not.
The DSpace system is freely available as open-source software and as such, users are allowed
to modify DSpace to meet an organisation’s specific needs. Open-source tools are freely
available with the DSpace application under an open-source license (not all the same license
as the one for DSpace itself). The BSD distribution license
(http://www.opensource.org/licenses/bsd-license.php) describes the specific terms of use.
DSpace accepts a variety of digital formats, some examples2 are:
 documents, such as articles, preprints, working papers, technical reports, conference;
 papers and books;
 datasets;
 computer programs;
 visualisations, simulations, and other models;
 multimedia publications;
 administrative records;
 bibliographic datasets;
 image, audio and video files;
 learning objects;
 web pages.
Currently, DSpace supports exporting digital content, along with its metadata, in a simple XMLencoded file format. The DSpace developers are working on migrating this export capability to
2
Information taken from http://dspace.org/faqs/index.html#standards
12
use the Metadata Encoding and Transmission standard (METS), but are waiting for some
necessary extension schemas to emerge (i.e. one for qualified Dublin Core metadata, and one
for minimal technical/preservation metadata for arbitrary digital objects). DSpace has
documented Java Application Programming Interfaces (APIs) which can be customised to allow
interoperation with other systems an institution might be running.
DSpace identifies two levels of digital preservation: bit preservation, and functional
preservation:
 bit preservation ensures that a file remains exactly the same over time (not a single bit is
changed) while the physical media evolve around it;
 functional preservation allows the file to change over time so that the material continues to
be immediately usable in the same way it was originally while the digital formats (and
physical media) evolve over time.
Some file formats can be functionally preserved using straightforward format migration, such as
TIFF images or XML documents. Other formats are proprietary, or for other reasons are much
harder to preserve functionally.
There are three levels of preservation defined for a given format: supported, known, or
unsupported:
 supported formats will be functionally preserved using either format migration or emulation
techniques. For example TIFF, SGML, XML, AIFF, and PDF;
 known formats are those which can not be guaranteed to preserve, such as proprietary or
binary formats, but which are so popular that third party migration tools will likely emerge to
help with format migration. Examples include Microsoft Word and Powerpoint, Lotus 1-2-3,
and WordPerfect;
 unsupported formats are those which are not known about enough to do any sort of
functional preservation. This would include some proprietary formats or a one-of-a-kind
software program.
For all three levels, DSpace does bit-level preservation to provide raw material to work with if
the material proves to be worth that effort. DSpace developers are working in conjunction with
partner institutions (particularly Cambridge University) to develop new upload procedures for
converting unsupported or known formats to supported ones where advisable, and to enhance
DSpace’s ability to capture preservation metadata and to perform periodic format migrations.
Kepler (http://kepler.cs.odu.edu:8080/kepler/index.html)
The original Kepler concept is based on the Open Archives Initiative. Kepler gives users the
ability to self-archive publications by means of an "archivelet": a self-contained, self-installing
software system that functions as an Open Archives Initiative data provider. An archivelet has
the tools to let the user publish a report as it is; to post to a web site, yet have a fully OAIcompliant digital library that can be harvested by a service provider. Kepler archivelets are
designed to be easy to install, use and maintain. Kepler can be tailored publication and search
services with broad and fast dissemination, it is also interoperable with other communities.
The publication tools to create an archivelet are downloadable, platform-independent, software
package that can be installed on individual workstations and PC’s. This is different to, for
example, the eprints.org OAI-compliant software package which is intended for institutionallevel service. The archivelet needs to have an extremely easy to use user interface for
publishing and needs to be an OAI-compliant data provider.
The archivelet is expected to store relatively few objects to retain independence. Instead, a
native file system will be used rather than, for example, a database system. In supporting
archivelets, the registration service takes on a bigger role than the registration server plays in
regular OAI. The number of archivelets is expected to be on the order of tens of thousands, and
their state, in terms of availability, will show great variation.
Currently, the OAI registration service keeps track of OAI-compliant archives and the current
registration process is mostly manual. In contrast to data providers at an organisational level,
archivelets will switch more frequently between active and non-active states. It will be
13
necessary for the registration service to keep track of the state of the registered archivelets in
support of higher-level services. For this, the concept is borrowed from Napster and the instantmessenger model where the central server keeps track of active clients.
The OAI presents a technical and organisational metadata-harvesting framework designed to
facilitate the discovery of content stored in distributed archives. The framework consists of two
parts: a set of simple metadata elements (for which OAI uses Dublin Core), and a common
protocol to enable extraction of document metadata and archive-specific metadata from
participating archives. The OAI also defines two distinct participants: data provider and service
provider.
The current OAI framework is targeted for large data providers (at the organisation level). The
Kepler framework based on the OAI to support archivelets is meant for many "little" publishers.
The Kepler framework promotes fast dissemination of technical articles by individual publishers.
Moreover, it is based on interoperability standards that make it flexible so as to build higherlevel services for communities sharing specific interests.
Figure 1 shows the four components of the Kepler framework: OAI compliant repository,
publishing tool, registration service, and service provider. The OAI compliant repository along
with the publishing tool, also referred to as the archivelet, is targeted for individual publishers.
Figure 1. Kepler framework.
The registration service keeps track of registered archivelets including their state of availability.
The service provider offers high-level services, such as a discovery service, that allows users to
search for a published document among all registered archivelets.
The Kepler framework supports two types of users: individual publishers using the archivelet
publishing tool, and general users interested in retrieving published documents. The individual
publishers interact with the publishing tool and the general users interact with a service provider
and an OAI-compliant repository using a browser. In a way, the Kepler framework looks very
similar to a broker based peer-to-peer (P2P) network model (Figure 2). Typically, a user is both
a data provider and a customer who accesses a service provider, thus the primary mode of
operation might be construed as one of exchanging documents.
14
Figure 2. Kepler Framework and Peer-to-Peer Network Model.
The archivelet combines the OAI-compliant repository and the publication tool in a
downloadable and self-installable component. Only OAI requests are supported, not any other
http actions. The basic service part of Kepler is the discovery service Arc.
Figure 3. Kepler architecture.
5.1.6 Case Studies
The following case studies illustrate how the systems discussed above have been used.
ArXiv (http://uk.arxiv.org/)
ArXiv is an e-print service in the fields of physics, mathematics, non-linear science, computer
science, and quantitative biology. The contents of arXiv conform to Cornell University academic
standards. ArXiv is owned, operated and funded by Cornell University, a private not-for-profit
educational institution. ArXiv is also partially funded by the National Science Foundation.
Started in August 1991, arXiv.org (formerly xxx.lanl.gov) is a fully automated electronic archive
and distribution server for research papers. Areas covered include physics and related
disciplines, mathematics, nonlinear sciences, computational linguistics, and neuroscience.
Users can retrieve papers from the archive either through an online world wide web interface,
or by sending commands to the system via email. Similarly, authors can submit their papers to
the archive either using the online world wide web interface, using ftp, or using email. Authors
15
can update their submissions if they choose, though previous versions remain available. Users
can also register to automatically receive an email listing of newly submitted papers in areas of
interest to them, when papers are received in those areas.
In addition, the archive provides for distribution list maintenance and archiving of TeX macro
packages and related tools. Mechanisms for searching through the collection of papers are also
provided.
CogPrints (http://cogprints.ecs.soton.ac.uk/)
CogPrints is an electronic archive for papers in any area of Psychology, Neuroscience,
Linguistics, Computer Science, Biology, Medicine and Anthropology. CogPrints is running on
eprints.org open archive software.
Dspace@Cambridge (http://www.lib.cam.ac.uk/dspace/index.htm)
Cambridge University are undertaking a project to create Dspace@Cambridge.
The project is a collaboration between Cambridge University Library, Cambridge University
Computing Service, and the MIT Libraries, and is funded by a grant from the Cambridge-MIT
Institute.
Dspace will be developed further as a means for digital preservation and will include the ability
to support learning management systems.
The project aims to:
 provide a home for digitised material from the University library's printed and manuscript
collections;
 capture, index, store, disseminate, and preserve digital materials created in any part of the
University;
 contribute to the development of the open source DSpace system, working with other
members of the DSpace Federation of academic research institutions;
 act as an exemplar site for UK higher and further education institutions.
Theses Alive! (http://www.thesesalive.ac.uk/index.shtml)
Theses Alive! is based at the University of Edinburgh and is funded under the JISC Focus on
Access to Institutional Resources (FAIR) Programme. The Theses Alive! project is seeking to
promote the adoption of a management system for electronic theses and dissertations (ETDs)
in the UK, primarily by creating an online submission system for electronic theses and
dissertations (ETD's) that mirrors the current submission process and an online repository of
digital PhD theses.
Including a thesis in the Edinburgh University repository, allows access to research findings for
a global audience, allowing for wide exposure and recognition. The University will provide
metadata from theses held in the Edinburgh University repository to known service providers.
By using interoperability standards (e.g. OAI-PMH) service providers can allow researchers
from institutes anywhere in the world to easily search and find relevant material. An additional
benefit is that once in a repository, the work is protected from physical damage and loss.
Theses Alive! uses DSpace for the archive as all items have a kind of wrapper in which the
parts of the relevant data are stored. This includes all the individual files and the copyright
licence. The metadata is maintained in Dublin Core format in the database for as long as the
item remains in the repository. Security settings for the repository are handled via the
authorisation policy tool and the security of the archive depends upon the way that the DSpace
administrator configures the policies for each community, collection, and item.
The DSpace archive is perhaps more geared toward digital preservation, although this issue is
still very much in debate. It may be that digital preservation is an issue which is never 'solved'
but which requires constant attention by those wishing to preserve and may not necessarily
have anything to do with the software package in question. For this reason it is hard for us to be
sure which package is going down the correct route, and even if that route exists.
16
It was envisaged that the university library would have a role as the key university agent in the
thesis publishing process. In this role, it would provide supportive documentation to
postgraduate students, via their departments, at the commencement of their dissertations and
theses, drawing on the support of the national pilot service. It would ensure that theses authors
are given training in the use of thesis submission software several months before they were
due to submit. It would receive submissions once they had been signed off by the relevant
registrar whether this is at departmental, faculty or central university level. The signing off is of
course the last stage in the academic validation process, and follows on from the successful
defence of the thesis by the student and the award of the postgraduate degree. The library
therefore takes the role of trusted intermediary in what is essentially a triangular relationship,
thus:
1. during the course of their research, the thesis author sends the library the basic metadata
for their thesis;
2. the thesis author then submits the full thesis to the university;
3. the university submits the successfully defended thesis to the library;
4. the library matches thesis to metadata, and ensures that metadata are complete and that
the university validation has taken place;
5. the library finally releases metadata and, if appropriate, the full text of the thesis, including
supplementary digital material (to publishers, lenders or sellers).
Once the library is satisfied with the metadata, they are released by the system, and the same
set of metadata is used by the various agencies providing publication, loan or sale services. It
is not likely that the identical set of metadata will be required by each of these agencies, so the
system operated by the library should accommodate a superset, from which appropriate
subsets can be generated for the requirements of agencies.
It is recommended that the metadata employ an appropriate Dublin Core-based metadata set,
using a Document Type Definition suitable for theses and dissertations, and marked up in XML
to allow ease of repurposing.
SHERPA
SHERPA (Securing a Hybrid Environment for Research Access and Preservation) is a FAIR
project. The main purpose of the SHERPA Project, led by the University of Nottingham, is the
creation, population and management of several e-print repositories based at several partner
institutions. These projects can step in to help the process by providing a more stable platform
for effective collation and dissemination of research.
The SHERPA project aims to:
 set up thirteen institutional open access e-print repositories which comply with the Open
Archives Initiative Protocol for Metadata Harvesting (OAI PMH) using eprints.org software;
 investigate key issues in creating, populating and maintaining e-print collections, including:
Intellectual Property Rights (IPR), quality control, collection development policies, business
models, scholarly communication cultures, and institutional strategies;
 work with OAI Service Providers to achieve acceptable (technical, metadata and collection
management) standards for the effective dissemination of the content;
 investigate digital preservation of e-prints using the Open Archival Information System
(OAIS) Reference Model;
 disseminate lessons learned and provide advice to others wishing to set up similar
services.
SHERPA will work with ePrints UK which will provide search interfaces and will allow for
searching of metadata harvested from hubs using web services. OCLC software is used as well
as the University of Southampton OpenURL citations which results in citation analysis (Pinfield
March 2003).
5.2 Peer to Peer Archiving (P2P)
Peer-to-peer (P2P) networking is a technological communications method where all parties are
equal and any node can operate as either a server or a client (Krishnan 2001). Each party has
the same capabilities and either party can initiate a communication session. This differs from
client/server architectures, in which some computers are dedicated to serving the others. In
17
some cases, peer-to-peer communication is implemented by giving each communication node
both server and client capabilities (known as servents). Peer-2-peer is both flexible and
scalable and could become an invaluable tool to aid collaboration and data management within
a common organisation structure.
On the web, P2P refers specifically to a network established by a group of users sharing a
networking program, to connect with each other and exchange and access files directly or
through a mediating server.
Currently, the most common distributed computing model is the client/server model. In the
client/server architecture (Figure 4), clients request services and servers provide those
services. A variety of servers exist in today's internet, for example, web servers, mail servers,
FTP servers, and so on. The client/server architecture is an example of a centralised
architecture, where the whole network depends on central points, namely servers, to provide
services. Without the servers, the network would make no sense and the web browsers would
not work. Regardless of the number of browsers or clients, the network can exist only if a server
exists (Krishnan 2001).
Figure 4.The typical client/server architecture.
Like the client/server architecture, P2P is also a distributed computing model, but there is an
important difference. The P2P architecture is decentralised (see Figure 5), where neither client
nor server status exists in a network. Every entity in the network, referred to as a peer, has
equal status, meaning that an entity can either request a service, a client trait, or provide a
service, a server trait (Krishnan 2001).
18
Figure 5. The peer-to-peer model
A P2P network also differs from the client/server model in that the P2P network can be
considered alive even if only one peer is active. The P2P network is unavailable only when no
peers are active (Krishnan 2001).
5.2.1 How Does P2P Work?
The user must first download and execute a peer-to-peer networking program. After launching
the program, the user enters the IP address of another computer belonging to the network
(typically, the web page where the user got the download will list several IP addresses as
places to begin). Once the computer finds another network member online, it will connect to
that user's connection and so on.
Users can choose how many member connections to seek at one time and determine which
files they wish to share or password protect. The extent of this peer-to-peer sharing is limited to
the circle of computer users an individual knows and has agreed to share files with. Users who
want to communicate with new or unknown users can transfer files using IRC (Internet Relay
Chat) or other similar bulletin boards dedicated to specific subjects.
Currently, there are a number of advanced P2P file sharing applications, the reach and scope
of peer networks has increased dramatically. The two main models that have evolved are the
centralised model and the decentralised model, used by Gnutella.
5.2.2 The Centralised Model of P2P File-Sharing
One model of P2P file sharing is based around the use of a central server system (the serverclient structure), which directs traffic between individual registered users. The central servers
maintain directories of the shared files stored on the respective PCs of registered users of the
network. These directories are updated every time a user logs on or off the server network.
Each time a user of a centralised P2P file sharing system submits a request or search for a
particular file, the central server creates a list of files matching the search request, by crosschecking the request with the server's database of files belonging to users who are currently
connected to the network. The central server then displays that list to the requesting user, who
can then select the desired file from the list and open a direct HTTP link with the individual
computer which currently posses that file. The download of the actual file takes place directly
from one network user to the other. The actual file is never stored on the central server or on
any intermediate point on the network.
19
Advantages of the Server-Client Structure
One of the server-client model's main advantages is its central index which locates files quickly
and efficiently. As the central directory constantly updates the index, files that users find
through their searches are immediately available for download.
Another advantage lies in the fact that all individual users, or clients, must be registered to be
on the server's network. As a result, search requests reach all logged-on users, which ensures
that all searches are as comprehensive as possible.
Problems with the Server-Client Model
While a centralised architecture allows the most efficient, comprehensive search possible, the
system also has only a single point of entry. As a result, the network could completely collapse
if one or several of the servers were to be incapacitated. Furthermore, the server-client model
may provide out-of-date information or broken links, as the central server's database is
refreshed only periodically.
5.2.3 The Decentralised Model of P2P File-Sharing
Unlike a centralised server network, the P2P network does not use a central server to keep
track of all user files. To share files using this model, a user starts with a networked computer,
equipped with P2P and will connect to another P2P networked computer. The computer will
then announce that it is alive and the message is passed on from one computer to the next.
Once the computer has announced that it is alive to the various members of the peer network,
it can then search the contents of the shared directories of the peer network members. The
search will send the request to all members of the network. If one of the computers in the peer
network has a file which that matches the request, it transmits the file information (name, size,
etc.) back through all the computers in the pathway where a list of files matching the search
request will then appear on the computer. The file can then be downloaded directly.
Advantages of the Decentralised Model
The P2P network has a number of distinct advantages over other methods of file sharing. The
network is more robust than a centralised model because it eliminates reliance on centralised
servers that are potential critical points of failure.
The P2P network is designed to search for any type of digital file (from recipes to pictures to
java libraries). The network also has the potential to reach every computer on the internet, while
even the most comprehensive search engines can only cover 20% of websites available.
Messages are also transmitted over P2P network in a decentralised manner: one user sends a
search request to his "friends," who in turn pass that request along to their "friends," and so on.
If one user, or even several users, in the network stop working, search requests would still get
passed along.
Problems with the Decentralised Model
Although the reach of this network is potentially infinite, in reality it is limited by "time-to-live"
(TTL) constraints; that is, the number of layers of computers that the request will reach. Most
network messages which have TTL's that are excessively high will be rejected.
5.2.4 Advantages and Limitations of using the Peer-to-Peer Network
Peer-to-peer networks are generally simpler then server and client networks, but they usually
do not offer the same performance under heavy loads. There is also concern about illegal
sharing of copyrighted content, for example music files, by some P2P users.
Though peers all have equal status in the network, they do not all necessarily have equal
physical capabilities. A P2P network might consist of peers with varying capabilities, from
mobile devices to mainframes. A mobile peer might not be able to act as a server due to its
intrinsic limitations, even though the network does not restrict it in any way (Krishnan 2001).
A P2P network delivers a quite different scenario to a client/server network. Since every entity
(or peer) in the network is an active participant, each peer contributes certain resources to the
network, such as storage space and CPU cycles. As more and more peers join the network, the
20
network's capability increases, hence, as the network grows, it strengthens. This kind of
scalability is not found in client/server architectures (Krishnan 2001).
The advantages a P2P network offers however, do create problems. Firstly, managing such a
network can be a difficult compared to managing a client/server network, where administration
is only needed at the central points. The enforcement of security policies, backup policies, and
so on, has proven to be complicated in a P2P network. Secondly, P2P protocols are much
more "talkative", as peers join and exit the network at will, than typical client/server protocols.
This transient nature can trigger performance concerns (Krishnan 2001).
Both of the networking models discussed above feature advantages and disadvantages. One
can visualise from Figure 4 that as a client/server network grows (that is, as more and more
clients are added), the pressure on the central point, the server, increases. As each client is
added, the central point weakens; its failure can destroy the whole network (Krishnan 2001).
Although file sharing is widespread, it presents a number of challenges to organisations
through degraded network availability, reduced bandwidth, lost productivity and the threat
posed by having copyright information on the organisation’s network. For example, P2P
downloading can easily consume 30 percent of network bandwidth. P2P can also open the way
for spyware and viruses to enter the system.
P2P provides certain interesting capabilities not possible in traditional client/server networks,
which have predefined client or server roles for their nodes (Krishnan 2001). Corporations are
looking at the advantages of using P2P as a way for employees to share files without the
expense involved in maintaining a centralised server and as a way for businesses to exchange
information with each other directly. However, there can be problems with scalability, security
and performance. It is also impossible to keep track on who is using the files and for what
purpose. The peer-to-peer network may also not be suitable for files which are updated
frequently, as it makes already downloaded copies obsolete.
5.2.5 Copyright/IPR
A P2P network can be closed, so that files are only open to those that are connected to the
network. Individuals can set passwords and only allow access to whom they wish. The network
can also be open to all, which has resulted in many reported cases of copyright infringement in
association with using P2P systems. The most infamous of these is the Napster case, where
the service was eventually shut down.
There can be serious copyright issues when using P2P. In the case of organisations such as
universities, the data holder will probably rely on their institutions copyright agreements.
In the U.S, there is now a bill against copyright infringement to limit the liability of copyright
owners for protecting their works on peer-to-peer networks (Sinrod 2002). A more recent bill
enables the imprisonment of people who illegally trade large amounts of copyrighted music online. A
House Judiciary subcommittee unanimously approved the "Piracy Deterrence and Education Act of
2004," which will be the first law to punish internet music pirates with prison if it were signed into law.
The bill targets people who trade more than 1,000 songs on peer-to-peer networks like Kazaa and
Morpheus, as well as people who make and sell bootlegged copies of films still in cinematic release
(McGuire 2004).
5.2.6 Preservation
Responsibility for preservation must lie with the data holders as data is stored on individual
computers, or peers. Preservation may be possible through the individuals’ institutional archive
or library, if available for deposition, or responsibility may lie with the data creators themselves
to ensure that the data is available in an interoperable format. Problems occur if the data
creator moves institution as they may retain the IPR but not copyright of the data. The data may
then become obsolete as responsibility of preserving the data is not handed on to someone
else. Coupled with the potential existence of a number of versions and out-of-date data across
the network, means that preservation is a large concern with the use of P2P networking.
5.2.7 Cost
Broadband ISPs all around the world, especially in Europe, are complaining that P2P traffic is
costing them too much and claim that almost 60% of all bandwidth is used for file-swapping.
21
According to British CacheLogic, the global cost of P2P networks for ISPs will top £828M
(€1148M, $1356M) in 2003 and will triple in 2004. Various ISPs have considered taking
measures for restricting users' download habits. One UK example is the cable company ntl,
which imposed a 1GB/day limit for its cable modem connections and discovered that thousands
of users left the service immediately, taking also their digital cable TV accounts to competitors
as well. Now some tech companies are trying to invent ways to prioritise the traffic - if the filetrading is done within the ISP's network, the cost for the ISP is minimal compared to
intercontinental network connection costs (AfterDawn 2003).
Peer-to-peer also raises maintenance costs and the need for high spec computers. J.P.
Morgan Chase found, when evaluating grid computing, that there was an up-front development
cost of about $2,000 per desktop, along with an annual maintenance cost of about $600 per
desktop (Breidenbach 2001).
The costs will also involve the time and expertise needed to set up the network, as well as any
legal fees involved with copyright infringement should the need arise.
5.2.8 Security
Security has been a major barrier to peer-to-peer adoption. The different platforms being used
within a company and across an extranet all have different security systems, and it is hard to
get them to interoperate. People end up using the lowest-common-denominator features. The
peer-to-peer community is trying to adapt existing security standards such as Kerberos and
X.509 certificates (Breidenbach 2001).
5.2.9 Peer-to-Peer Systems
In 1999 Napster went into service. Napster was a breakthrough in creating P2P networking.
With Napster, files were not stored on a central server, instead the client software for each user
also acts as a server for files shared from that computer. Connection to the central Napster
server is made only to find where the files are located at any particular point in time. The
advantage of this is that it distributes vast numbers of files over a vast number of "mini servers".
After a series of legal battles for copyright infringement, by mid 2001, Napster was all but shut
down, however, several other P2P services rose up to fill the niche (Somers).
Examples of other peer-to-peer systems are described in the sections below.
Edutella (http://edutella.jxta.org/)
Edutella is a P2P system which aims to connect heterogeneous educational peers with different
types of repositories, query languages and different kinds of metadata schemata (Hatala 2003).
The overall goal of Edutella is to facilitate the reuse of globally distributed learning resources by
creating an open-source, peer-to-peer system for the exchange of RDF-based metadata
(Wilson 2001). The project also aims to provide the metadata services needed to enable
interoperability between heterogeneous JXTA applications.
Initial aims are to create:
 a replication service – to provide data persistence/availability and workload balancing while
maintaining data integrity and consistency;
 a mapping service – to translate between different metadata vocabularies to enable
interoperability between different peers;
 an annotation service – to annotate materials stored anywhere in the Edutella network.
The providers of metadata for the system will be anyone with content they want to make
available. This includes anything from individual teachers and students to universities and other
educational institutions (Wilson 2001).
Edutella is a metadata based peer-to-peer system, able to integrate heterogeneous peers
(using different repositories, query languages and functionalities), as well as different kinds of
metadata schemas. Finding a common ground in essential, in the assumption that all resources
maintained in the Edutella network can be described in RDF, and all functionality in the Edutella
network is mediated through RDF statements and queries on them. For the local user, the
22
Edutella network transparently provides access to distributed information resources, and
different clients/peers can be used to access these resources. Each peer will be required to
offer a number of basic services and may offer additional advanced services (Nejdl 2002).
Edutella Architecture
Edutella is based on JXTA, an Open Source project supported and managed by Sun
Microsystems. JXTA is a set of XML based protocols to cover typical P2P functionality and
provides a Java binding offering a layered approach for creating P2P applications (core,
services, applications, see Figure 6). In addition to remote service access (such as offered by
SOAP), JXTA provides additional P2P protocols and services, including peer discovery, peer
groups and peer monitors. Therefore JXTA is a very useful framework for prototyping and
developing P2P applications (Nejdl 2002).
Figure 6: JXTA Layers
Edutella services complement the JXTA service layer, building upon the JXTA core layer, with
Edutella peers on the application layer. The service uses the functionality provided by these
Edutella services as well as possibly other JXTA services (Nejdl 2002).
On the Edutella service layer, data exchange formats and protocols are defined (how to
exchange queries, query results and other metadata between Edutella peers), as well as APIs
for advanced functionality in a library-like manner. Applications like repositories, annotation
tools or user interfaces connected to and accessing the Edutella network, are implemented on
the application layer (Nejdl 2002).
The Edutella query service is intended to be a standardised query exchange mechanism for
RDF metadata stored in distributed RDF repositories. It is meant to serve as both the query
interface for individual RDF repositories located at single Edutella peers, as well as the query
interface for distributed queries spanning multiple RDF repositories. An RDF repository
consists of RDF statements (or facts) and describes metadata according to arbitrary RDFS
schemas (Nejdl 2002). One of the main purposes is to abstract from various possible RDF
storage layer query languages (e.g., SQL) and from different user level query languages (e.g.,
RQL, TRIPLE). The Edutella query exchange language and the Edutella common data model
provide the syntax and semantics for an overall standard query interface across heterogeneous
peer repositories for any kind of RDF metadata. The Edutella network uses the query exchange
language family RDF-QEL-i (based on Datalog semantics and subsets) as standardised query
exchange language format which is transmitted in an RDF/XML-format (Nejdl 2002).
Edutella peers are highly heterogeneous in terms of the functionality they offer. A simple peer
has RDF storage capability only where the peer has some kind of local storage for RDF triples,
e.g. a relational database, as well as some kind of local query language, e.g. SQL. In addition,
the peer might offer more complex services such as annotation, mediation or mapping (Nejdl
2002). This results in an exchange from the local format to the peer and vice versa, and to
connection of the peer to the Edutella network by a JXTA-based P2P library. To handle queries,
23
the wrapper uses the common Edutella query exchange format and data model for query and
result representation. For communication with the Edutella network, the wrapper translates the
local data model into the Edutella common data model ECDM and vice versa, and connects to
the Edutella network using the JXTA P2P primitives, transmitting the queries based on the
common data model ECDM in RDF/XML form (Nejdl 2002).
In order to handle different query capabilities, several RDF-QEL-i exchange language levels are
defined, describing which kind of queries a peer can handle (conjunctive queries, relational
algebra, transitive closure, etc.) The same internal data model is used for all levels. To enable
the peer to participate in the Edutella network, Edutella wrappers are used to translate queries
and results from the Edutella query (Nejdl 2002).
Freenet
Freenet is a completely distributed decentralised peer-to-peer system. Communication is
handled entirely by peers operating at a global level (Watson 2003). Freenet is a free software
which lets you publish and obtain information on the internet without fear of censorship. To
achieve this freedom, the publishers and consumers of information are anonymous.
The system operates as a location-independent distributed file system across many individual
computers that allows files to be inserted, stored, and requested anonymously. A node is
simply a computer that is running the Freenet software, and all nodes are treated as equals by
the network (Watson 2003).
Users contribute to the network by giving bandwidth and a portion of their hard drive (called the
data store) for storing files. Each node maintains its own local data store which it makes
available to the network for reading and writing, as well as dynamic routing table containing
addresses of other nodes and the keys that they are thought to hold (Watson 2003). Unlike
other peer-to-peer file sharing networks, Freenet does not let the user control what is stored in
the data store. Instead, files are kept or deleted depending on how popular they are, with the
least popular being discarded to make way for newer or more popular content.
Communications by Freenet nodes are encrypted and are routed-through other nodes to make
it extremely difficult to determine who is requesting the information and what its content is. This
removes any single point of failure or control. By following the Freenet protocol, many such
nodes spontaneously organise themselves into an efficient network (Watson 2003). Files in the
data store are encrypted to reduce the likelihood of prosecution by persons wishing to censor
Freenet content.
The network can be used in a number of different ways and is not restricted to just sharing files
like other peer-to-peer networks. It acts more like an internet within an internet. For example,
Freenet can be used for:
 publishing websites or 'free sites';
 communicating via message boards;
 content distribution.
The system is designed to respond adaptively to usage patterns, transparently moving, replicating,
and deleting files as necessary to provide efficient service without resorting to broadcast searches or
centralised location indexes (Watson 2003). Freenet enables users to share unused disk space
(Watson 2003).
It is intended that most users of the system will run nodes to:
 provide security guarantees against inadvertently using a hostile node;
 provide resistance to attempts by third parties to deny access to information and prevent
censorship of documents;
 efficient dynamic storage and routing of information;
 increase the storage capacity available to the network as a whole;
 provide anonymity for both producers and consumers of information;
 decentralisation of all network functions-remove any single point of failure or control.
24
The system operates at the application layer and assumes the existence of a secure transport layer,
although it is transport-independent. It does not seek to provide anonymity for general network
usage, only for Freenet file transactions (Watson 2003).
Unlike many cutting edge projects, Freenet is well established and has been downloaded by
over two million users since the project started. It is used for the distribution of censored
information all over the world including countries such as China and the Middle East.
Freenet Architecture
Freenet is implemented as an adaptive peer-to-peer network of nodes that query one another
to store and retrieve data files, which are named by location-independent keys:
 Keyword-Signed Key (KSK) is based on a short descriptive string chosen by the user when
inserting a file. This string is basically hashed to yield the KSK. To allow others to retrieve a
document, publishers only have to publish this string;
 Signed-Subspace Key (SSK) is used to identify a personal subspace. A SSK is to allow a
user to built a reputation by publishing documents while remaining anonymous but still
identifiable;
 Content-Hash Keys (CHK) allows a node to check that the document it just received is a
genuine copy and hasn’t been tampered with. They also allow an author to update a
document, if this author uses a private subspace (with SSKs).
The basic model is:
 keys are passed along from node to node through a chain of requests in which each node
makes a local decision about where to send the request next, in the style of internet
protocol routing;
 depending on the key requested the routes would vary. The routing algorithms adaptively
adjust routes over time to provide efficient performance while using only local, rather than
global knowledge. As each node only has knowledge of their immediate upstream and
downstream neighbours, to maintain privacy;
 each request is given a hops-to-live limit, which is decremented at each node to prevent
infinite chains;
 each request is also assigned a pseudo-unique random identifier, so that nodes can
prevent loops by rejecting requests they have seen before;
 this process continues until the request is either satisfied or has exceeded its hops-to-live
limit. Then the success or failure is passed back up the chain to the sending node.
(Watson 2003).
Requests
In order to make use of Freenet’s distributed resources, a user must initiate a request.
Requests are messages that can be forwarded through many different nodes. Initially the user
forwards the request to a node that he or she knows about and trusts. If a node does not have
the document that the requester is looking for, it forwards the request to another node that,
according to its information, is more likely to have the document (Watson 2003).
The reply is passed back through each node that forwarded the request, back to the original
node that started the chain. Each node in the chain may cache the reply locally, so that it can
reply immediately to any further requests for that particular document. This means that
commonly requested documents are cached on more nodes, and thus there is no overloading
effect on one node (Watson 2003).
Performance Analysis
User benchmarks include: how long will it take to retrieve a file and how much bandwidth will a
query consume. These both have direct impact on the usability and success of the system.
Network connections such as ADSL and cable modems, favour client over server usage. This
has resulted in the network infrastructure being optimised for computers that are only clients,
not servers. The problem is that peer-to-peer applications are changing the assumption that
end users only want to download from the internet, never upload to it. Peer-to-peer technology
generally makes every host act as both as a client and a server; the asymmetric assumption is
25
incorrect. The network architecture is going to have to change to handle this new traffic pattern
(Watson 2003).
Problems which affect the performance of the decentralised peer-to-peer network of Freenet
are:
 in network communication, connection speed dominates processor and I/O speed as the
bottleneck. This problem is emphasised by the highly parallel nature of Freenet;
 as there is no central master index maintained, messages must be passed over many
hops, in order to search through the system to find the data. Each hop not only adds to the
total bandwidth load but also increases the time needed to perform a query. If a peer is
unreachable it can take several minutes to time out the connection;
 peer-to-peer communities depend on the presence of a sufficient base of communal
participation and cooperation in order to function successfully.
(Watson 2003).
Small World Effect
The small world effect is fundamental to Freenet’s operation. It is important because it defines
the file location problem in a decentralised, self-configuring P2P network like Freenet. In
Freenet, queries are forwarded from one peer to the next according to local decisions about
which potential recipient might make the most progress towards the target. Freenet messages
are not targeted to a specific named peer but towards any peer having a desired file in its data
store (Watson 2003).
Two characteristics that distinguish small-world networks:
 a small average path length, typical of random graphs;
 a large clustering coefficient that is independent of network size. The clustering coefficient
captures how many of a node’s neighbours are connected to each other.
Despite the large network, short routes must exist. In a simulation of a Freenet network, with
1,000 identical nodes, which were initially empty, half of all requests in the mature network
succeed within six hops. A quarter of requests succeed within just three hops or fewer. This
compares with the internet, as a small world network, with a characteristic path length of 19
hops (Watson 2003).
Freenet has good average performance but poor worst case performance, because a few bad
routing choices can throw a request completely off the track. Aspects of robustness affect
Freenet, as all peer-to-peer systems coping with the unreliability of peers. Since peers tend to
be PC’s rather than dedicated servers, they are often turned off or disconnected from the
network at random (Watson 2003).
Trust and Accountability
By signing software they make available for download, authors can provide some assurance
that their code has not been tampered with and facilitate the building of a reputation associated
with their name key (Watson 2003).
The lesson for peer-to-peer designers is that without accountability in a network, it is difficult to
enforce rules of social responsibility. Just like email, today’s peer-to-peer systems run the risk
of being overrun by unsolicited advertisements (Watson 2003).
Security
Firewalls and dynamic internet protocol grew out of the clear need in internet architecture to
make scalable, secure systems. Firewalls stand at the gateway between the internal network
and the internet outside and are a very useful security tool, but they pose a serious obstacle to
peer-to-peer communication models. New peer-to-peer applications challenge this architecture,
demanding that participants serve resources as well as use them (Watson 2003).
Port 80 is conventionally used by HTTP traffic when people browse the web. Firewalls typically
filter traffic based on the direction and the destination port of the traffic. Most current peer-topeer applications have some way to use port 80 in order to circumvent network security
policies. The problem lies in that there is no good way you can identify what applications are
26
running through it. Also, even if the application has a legitimate reason to go through the
firewall, there is no simple way to request permission (Watson 2003).
As no node can tell where a request came from beyond that node that forwarded the request to
it, it is very difficult to find the person who started the request. Freenet does not provide perfect
anonymity because it balances paranoia against efficiency and usability. If someone wants to
find out exactly what you are doing, then given the resources, they will. Freenet does, however,
seek to stop mass, indiscriminate surveillance of people (Watson 2003).
Legal Issues
As Freenet can potentially contain illegal information, it provides deniability that the owner of
the computer/the node, knows nothing of what is stored on his/her computer, due to the
encryption that Freenet provides (Watson 2003).
Advantages and Disadvantages of Freenet
Some of the advantages and disadvantages of Freenet are described in the sections below.
Advantages:
 Freenet is solving many of the problems seen in centralised networks. Popular data, far
from being less available as requests increase, become more available as nodes cache it.
This is the correct reaction of a network storage system to popular data;
 Freenet also removes the single point of attack for censors, the single point of technical
failure, and the ability for people to gather large amounts of personal information about a
reader;
 Freenet’s niche is in the efficient and anonymous distribution of files. It is designed to find a
file in the minimum number of node-to-node transactions. Additionally, it is designed to
protect the privacy of the publisher of the information, and all intervening nodes though
which the information passes.
Disadvantages:
 it is designed for file distribution and not fixed storage. It is not intended to guarantee
permanent file storage, although it is hoped that a sufficient number of nodes will join with
enough storage capacity that most files will be able to remain indefinitely;
 Freenet does not yet have a search system, because designing a search system which is
sufficiently efficient and anonymous can be difficult;
 the node operators cannot be held responsible for what is being stored on its hard drive.
Freenet is constantly criticised because you have to donate your personal hard drive space
to a group of strangers that may be very well use it to host content that you disapprove of;
 Freenet is designed so that if the file is in the network, the path to the file is usually short.
Consequently, Freenet is not optimised for long paths, which can therefore be very slow;
 self-organising file sharing systems like Freenet are affected by the popularity of files, and
hence may be susceptible to the tyranny of the majority.
(Watson 2003).
Gnutella
Gnutella was originally designed by Nullsoft, a subsidiary of America Online (AOL) but
development of the Gnutella protocol was halted by AOL management shortly after being made
available to the public. During the few hours it was available to the public several thousand
downloads occurred. Using those downloads, programmers created their own Gnutella
software packages.
Gnutella is a networking protocol, which defines a manner in which computers can speak
directly to one another in a completely decentralised fashion. The content that is available on
the Gnutella network does not come from web sites or from the publishers of Gnutellacompatible software; it comes from other users running Gnutella-compatible software on their
own computers. Software publishers such as Lime Wire LLC have written and distributed
programs which are compatible with the Gnutella protocol, and which therefore allow users to
participate in the Gnutella network. Gnutellanet is currently one of the most popular of these
decentralised P2P programs because it allows users to exchange all types of files.
27
The Gnutella Structure
Unlike a centralised server network, the Gnutella network (gNet) does not use a central server
to keep track of all user files. To share files using the Gnutella model (Figure 7), a user starts
with a networked computer, equipped with a Gnutella servent. Computer "A" will connect to
another Gnutella-networked computer, "B." A will then announce that it is "alive" to B, which will
in turn announce to all the computers that it is connected to, "C," "D," "E," and "F," that A is
alive. The computers C, D, E, and F will then announce to all computers to which they are
connected that A is alive; those computers will continue the pattern and announce to the
computers they are connected to that computer A is alive.
Once "A" has announced that it is "alive" to the various members of the peer network, it can
then search the contents of the shared directories of the peer network members. The search
request will send the request to all members of the network, starting with, B, then to C, D, E, F,
who will in turn send the request to the computers to which they are connected, and so forth. If
one of the computers in the peer network, say for example, computer D, has a file which that
matches the request, it transmits the file information (name, size, etc.) back through all the
computers in the pathway towards A, where a list of files matching the search request will then
appear on computer A's Gnutella servent display. A will then be able to open a direct
connection with computer D and will be able to download that file directly from computer D. The
Gnutella model enables file sharing without using servers that do not actually directly serve
content themselves.
Figure 7. The Gnutella Structure
28
Technical Overview
The Gnutella protocol is run over TCP/IP a connection-oriented network protocol. A typical
session comprises a client connecting to a server. The client then sends a Gnutella packet
advertising its presence. This advertisement is propagated by the servers through the network
by recursively forwarding it to other connected servers. All servers that receive the packet reply
with a similar packet about themselves (Creedon 2003).
Queries are propagated in the same manner, with positive responses being routed back the
same path. When a resource is found and selected for downloading, a direct point-to-point
connection is made between the client and the host of the resource, and the file downloaded
directly using HTTP. The server in this case will act as a web server capable of responding to
HTTP GET requests (Creedon 2003).
Gnutella packets are of the form:
Message ID (16
bytes)
Function ID (1
byte)
TTL (1
byte)
Hops (1
byte)
Payload length (4
bytes)
Table 4. The form of Gnutella Packets
Where:
 message ID in conjunction with a given TCP/IP connection is used to uniquely identify a
transaction;
 function ID is one of: Advertisement[response], Query[response] or Push-Request;
 TTL is the time-to-live of the packet, i.e. how many more times the packet will be
forwarded;
 hops counts the number of times a given packet is forwarded;
 payload length is the length in bytes of the body of the packet.
Connecting
A client finds a server by trying to connect to any of a local list of known servers that are likely
to be available. This list can be downloaded from the internet, or be compiled by the end user,
comprising, for example, servers run by friends, etc. The Advertisement packets comprise the
number of files the client is sharing, and the size in Kilobytes of the shared data. The server
replies comprise the same information. Thus, once connected, a client knows how much data is
available on the network.
Queries
As mentioned above, queries are propagated the same way as Advertisements. To save
bandwidth, servers that cannot match the search parameters need not send a reply. The
semantics of matching search parameters are not defined in the current published protocol.
The details are server dependent. For example, a search for ".mp3" could be interpreted as all
files with same file extension, or any file with “mp3" in its' name, etc.
Downloading
A client wishing to make a download opens a HTTP (hyper-text transfer protocol) connection to
the host and requests the resource by sending a "GET >URL<" type HTTP command, where
the URL (Uniform Resource Locator) is returned by a query request. Hence, a client sharing
resources has to implement a basic HTTP (aka web) server.
Firewalls
A client residing behind a firewall trying to connect to a Gnutella network will have to connect to
a server running on a "firewall-friendly" port. Typically this will be port 80, as this is the reserved
port number for HTTP, which is generally considered secure and non-malicious.
When a machine hosting a resource cannot accept HTTP connections because it is behind a
firewall, it is possible for the client to send a "Push-Request" packet to the host, instructing it to
make an outbound connection to the client on a firewall-friendly port, and "upload" the
requested resource, as opposed to the more usual client "download" method. The other
29
permutation, where both client and server reside behind firewalls renders the protocol nonfunctional (Creedon 2003).
Advantages and Limitations of Gnutella
The Gnutella network has a number of distinct advantages over other methods of file sharing.
Namely, the Gnutella network is decentralised and hence more robust than a centralised model
because it eliminates reliance on centralised servers that are potential critical points of failure.
Messages are also transmitted over Gnutella network in a decentralised manner: One user
sends a search request to his "friends," who in turn pass that request along to their "friends,"
and so on. If one user, or even several users, in the network stop working, search requests
would still get passed along.
The Gnutella network is designed to search for any type of digital file (from recipes to pictures
to java libraries). The Gnutella network also has the potential to reach every computer on the
internet, while even the most comprehensive search engines can only cover 20% of websites
available. Improvements have enabled the resulting GnutellaNet to overcome major obstacles
and experience substantial growth. It is estimated that the network currently encompasses
about 25,000 simultaneous peers with approximately a quarter million unique peers active on
any given day (Truelove 2001).
The crux of the Gnutella protocol is its focus on decentralised P2P file-sharing and distributed
searching. Such peer-to-peer content searching is not core to other services such as JXTA.
Instead, this is precisely the sort of functionality well-suited for the service level between the
core and applications.
In Gnutella, search and file transfers are cleanly separated, and an established protocol, HTTP,
is used for the latter. Standard web browsers can access Gnutella peers (provided they are not
blocked for social-engineering reasons), which are in essence transient web sites.
As it is an open protocol, the end-user interface and functionality are separable from the
underlying network. The result is a sizable number of interoperable applications that interact
with the common GnutellaNet. Gnutella's query-and-response messages could be put to use as
a bid-response mechanism for peer-to-peer auctioning (Truelove 2001).
The principal shortcomings in the protocol are:
 scalability - the system was designed in a laboratory and set up to run with a few hundred
users. When it became available on the internet, it quickly grew to having a user base of
tens of thousands. Unfortunately at that stage the system became overloaded and was
unable to handle the amount of traffic and nodes that were present in the system. Concepts
that had looked good in a laboratory were showing signs of stress right from the start;
 packet life - to find other users, a packet has to be sent out into the network. It became
apparent early on that the packet life on some packets had not been set right and a build
up of these packets started circulating around the network indefinitely. This resulted in less
bandwidth being available on the network for users;
 connection speeds of users - users on the system act as gateways to other users to find
the data they need. However, not every user had the same connection speed. This has led
to problems as users on slower bandwidth machines were acting as connections to people
on higher bandwidth. This resulted in connection speeds being dictated by people with the
slowest connection speed, on the link to the data thereby leading to bottlenecks.
Furthermore, the entire network is not visible to any one client. Using the standard time-to-live,
during advertisement and search, only about 4,000 peers are reachable. This arises from the
fact that each client only holds connections to four other clients and a search packet is only
forwarded five times. In practical terms this means that even though a certain resource is
available on the network, it may not be visible to the seeker because it is too many nodes
away. To increase the number of reachable peers in the Gnutella network we would need to
increase the time-to-live for packets and the number of connections kept open. Unfortunately
this gives rise to other problems: if we where to increase both the number of connections and
the number of hops to eight, 1.2 gigabytes of aggregate data could be potentially crossing the
network just to perform an 18 byte search query (Creedon 2003).
30
Another significant issue that has been identified is Gnutella's susceptibility to denial of service
attacks. For example a burst of search requests can easily saturate all the available bandwidth
in the attacker's neighbourhood, as there is no easy way for peers to discriminate between
malicious and genuine requests. Some workarounds to the problem have been presented, but
in each case there are significant compromises to be made. All in all the overall quality of
service of the Gnutella network is very poor. This is due to a combination of factors, some of
them deriving directly from the characteristics of the protocol, others induced by the users of the
network themselves. For example, users that are reluctant to concede any outgoing bandwidth
will go to great lengths to prevent others to download files that they are 'sharing'. Similarly the
ability to find a certain file will largely depend on the naming scheme of the user that makes
certain files available. A conspiracy theorist will of course argue that the above tactics are being
used by record companies to undermine the peer-to-peer revolution (Creedon 2003).
JXTA
JXTA is a set of protocols that can be implemented in any language and will allow distributed
client interoperability. It provides a platform to perform the most basic functionality required by
any P2P application: peer discovery and peer communication (Krikorian 2001). At the same
time, these protocols are flexible enough to be easily adapted to application-specific
requirements. While JXTA does not dictate any particular programming language or
environment, Java could potentially become the language of choice for P2P application
development due to its portability, ease of development, and a rich set of class libraries
(Krishnan 2001).
At its core, JXTA is simply a protocol for inter-peer communication. Each peer is assigned a
unique identifier and belongs to one or more groups in which the peers cooperate and function
similarly under a unified set of capabilities and restrictions. JXTA provides protocols for basic
functions, for example, creating groups, finding groups, joining and leaving groups, monitoring
groups, talking to other groups and peers, sharing content and services. All of these functions
are performed by publishing and exchanging XML advertisements and messages between
peers. JXTA includes peer-monitoring hooks that will enable the management of a peer node.
People can build visualization tools that show how much traffic a particular node is getting. With
such information, a network manager can decide to increase or throttle bandwidth on various
nodes, or implement a different level of security (Breidenbach 2001).
The JXTA Structure
Conceptually, each peer in JXTA abstracts three layers: the core layer, the services layer, and
the applications layer. The core layer is responsible for managing the JXTA protocol and it
should encapsulate the knowledge of all basic P2P operations. The core creates an addressing
space separate from IP addresses by providing each peer its own unique peer ID, whilst also
boot-strapping each peer into a peer group. Through protocols which the core knows how to
implement, a JXTA peer can locate other peers and peer groups to join (in a secure,
authenticated manner, if desired). The core layer can also open a pipe, a very simple one-way
message queue, to another peer or group of peers. By using pipes, distinct parties can
communicate with each other.
The services layer serves as a place for common functionality which more than one, but not
necessarily every, P2P program might use. For example, Sun has released a Content
Management System (CMS) service that has been implemented as a JXTA service. CMS
provides a generic way for different applications to share files on JXTA peers so that
decentralised Gnutella-like searches may be performed against all of them. Once specific
content has been located, the CMS provides ways for peers to download content so the
services layer provides library-like functionality, which JXTA applications can control via logic
located in application layer (Krikorian 2001).
The third layer is where the P2P application truly lives. The upper layer might host a user
interface, so that the user can control different services, or it might be where the logic of an
autonomous application operates. For example, a simple chat program can be built on this
layer, making use of both the service and the core layer to allow people to send messages
back and forth to each other. P2P applications should be fairly easy to build once a developer
31
is familiar with JXTA, as the platform provides the basic peer-to-peer framework (Krikorian
2001).
JXTA Developments
As part of its open source release, Sun is distributing a preliminary Java binding for JXTA with
the goal of having early-adopter engineers create simple P2P applications in Java. Sun's
binding is not complete, however as interfaces, implementations, and protocols are likely to
change (Krikorian 2001).
Project JXTA is building core network computing technology to provide a set of simple, small,
and flexible mechanisms that can support P2P computing on any platform, anywhere, at any
time. The project is first generalising P2P functionality and then building core technology that
addresses today's limitations on P2P computing. The focus is on creating basic mechanisms
and leaving policy choices to application developers (Krishnan 2001).
With the unveiling of Project JXTA, not only is Sun introducing new building blocks for P2P
development, but it is launching an open and decentralised peer-to-peer network. While JXTA
can be used for P2P applications that operate in a closed environment such as an intranet and
its success may ultimately be measured by its utility in that domain, a wider public JXTA
network, the JuxtaNet, will be formed. The core JXTA protocols are the foundation for Sun's
initial reference implementation, which in turn is the basis for Sun's example applications,
including the Shell and InstantP2P. These applications give life to the JuxtaNet as they are run
and instantiate peers that intercommunicate (Truelove 2001).
XML in JXTA
Undoubtedly, the first step towards providing a universal base protocol layer is to adopt a
suitable representation that a majority of the platforms currently available can understand. XML
is the ideal candidate for such a representation. The JXTA developers recognize that XML is
fast becoming the default standard for data exchange as it provides a universal, languageindependent, and platform-independent form of data representation. XML can also be easily
transformed into other encoding, hence, the XML format defines all JXTA protocols (Krishnan
2001).
Although JXTA messages are defined in XML, JXTA does not depend on XML encoding. In
fact, a JXTA entity does not require an XML parser; it is an optional component. XML is a
convenient form of data representation used by JXTA. Smaller entities like a mobile phone
might use precompiled XML messages (Krishnan 2001).
JXTA holds promise as a low-level platform for P2P application development. While the
technology is in its early stages, it is expected to mature over time to provide a robust, reliable
framework for P2P computing. As Java is the preferred language for applications designed for
heterogeneous environments, it is the natural choice for P2P applications (Krishnan 2001).
Advantages and Limitations
The JuxtaNet is significant in that it is an open, general-purpose P2P network. JXTA is
abstracted into multiple layers, core, service and application, with the intention that multiple
services will be built on the core, and that the core and services will support multiple
applications. There is no constraint against the simultaneous existence on the JuxtaNet of
multiple services or applications designed for a similar purpose. As an example, just as a PC's
operating system can simultaneously support multiple word processors, the JuxtaNet can
simultaneously support multiple file-sharing systems (Truelove 2001). In contrast to Gnutella,
JXTA was designed for a multiplicity of purposes. Gnutella is one protocol, while the JXTA core
consists of several protocols.
Based on the experience of the GnutellaNet, there are several reasons to expect developer
enthusiasm for JuxtaNet from similar quarters: versatile core protocols for peer discovery, peer
group membership, pipes and peer monitoring form a rich foundation on which a wide variety of
higher-level services and applications can be built. Developers, eager to develop new
decentralised applications, have found that the path to build them in Gnutella involves
overloading existing constructs or carefully grafting on new ones without breaking the installed
32
base. The alternative, individually developing a proprietary vertically integrated application from
the P2P networking layer up to the application layer, is unattractive in many cases. This high
friction has arguably inhibited development, but JXTA lowers it. JXTA's groups, security, pipes,
advertisements and other aspects should be welcomed building blocks.
In terms of attracting developers, the open nature of the JXTA protocols is an advantage to the
JuxtaNet, just as the open nature of the Gnutella protocol is an advantage to GnutellaNet. The
higher complexity of JXTA relative to Gnutella gives it a steeper learning curve, but could be
seen as a release of an open-source reference code that educates by example. Just as with
the GnutellaNet, the JuxtaNet is an open network whose applications can be expected to be
interoperable at lower levels. This can potentially give developers the "instant user base"
phenomenon familiar from the GnutellaNet.
JXTA strives to provide a base P2P infrastructure over which other P2P applications can be
built. This base consists of a set of protocols that are language independent, platform
independent, and network agnostic (that is, they do not assume anything about the underlying
network). These protocols address the bare necessities for building generic P2P applications.
Designed to be simple with low overheads, the protocols target, to quote the JXTA vision
statement, "every device with a digital heartbeat." (Krishnan 2001).
JXTA currently defines six protocols, but not all JXTA peers are required to implement all six of
them. The number of protocols that a peer implements depends on that peer's capabilities;
conceivably, a peer could use just one protocol. Peers can also extend or replace any protocol,
depending on its particular requirements (Krishnan 2001).
It is important to note that JXTA protocols by themselves do not promise interoperability. Here,
you can draw parallels between JXTA and TCP/IP. Though both FTP and HTTP are built over
TCP/IP, you cannot use an FTP client to access web pages. The same is the case with JXTA.
Just because two applications are built on top of JXTA it does not mean that they can magically
interoperate. Developers must design applications to be interoperable, however, developers
can use JXTA, which provides an interoperable base layer, to further reduce interoperability
concerns (Krishnan 2001).
5.2.10 Case Studies
The following case studies illustrate the use of the software discussed in the previous section.
EduSource (http://edusource.licef.teluq.uquebec.ca/ese/en/overview.htm )
The vision of the eduSource project is focused on the creation of a network of linked and
interoperable learning object repositories across Canada. The initial part of this project will be
an inventory of ongoing development of the tools, systems, protocols and practices.
Consequent to this initial exercise the project will look at defining the components of
interoperable framework, the web services that will tie them all together and the protocols
necessary to allow other institutions to enter into that framework.
The eduSource project goals are to:
 support a true network model;
 use standards and protocols that are royalty-free;
 implement and support emerging specifications, such as CanCore;
 enable anybody to become a member;
 provide royalty-free open source infrastructure layer and services;
 provide a distributed architecture;
 facilitate an open marketplace for quality learning resources;
 provide multiple metadata descriptions of a given learning resource;
 provide a semantic web within the eduSource community;
 encourage open rights management.
LimeWire (http://www.limewire.com/english/content/home.shtml)
LimeWire is a file-sharing program running on the Gnutella network. It is open standard
software running on an open protocol, free for the public to use. LimeWire allows you to share
33
any file such as mp3s, jpgs, tiffs, etc. Limewire is written in Java, and will run on Windows,
Macintosh, Linux, Sun, and other computing platforms.
The features of LimeWire are:
 easy to use - just install, run, and search;
 ability to search by artist, title, genre, or other metainformation;
 elegant multiple search tabbed interface ;
 "swarm" downloads from multiple hosts help you get files faster;
 iTunes integration for Mac users;
 unique "ultra peer" technology reduces bandwidth requirements for most users;
 integrated chat;
 browse host feature which even works through firewalls;
 added Bitzi metadata lookup;
 international versions: Now available in many new languages;
 connects to the network using GWebCache, a distributed connection system;
 automatic local network searches for lightning-fast downloads. If you are on a corporate or
university network, you can download files from other users on the same network almost
instantaneously;
 support for MAGNET links that allow you to click on web page links that access Gnutella.
LimeWire is capable of multiple searches, available in several different languages, and easy to
use with cross platform compatibility. LimeWire offers two versions of the software: LimeWire
Basic which is ad-supported and is free of charge; the other is LimeWire PRO which costs a
small fee and contains no ads or flashing banners, includes six months of free updates, and
customer support via email. Several developments are being finalised, such as the ability for
companies to offload bandwidth costs by serving files on LimeWire, as well allowing personal
users to send very large-sized files over email.
LimeWire is a very fast P2P file-sharing application which enables the sharing, searching, and
downloading of MP3 files. LimeWire also lets users share and search for all types of computer
files, including movies, pictures, games and text documents. Other features include dynamic
querying, push-proxy support for connecting through firewalls, the ability to preview files while
downloading, advanced techniques for locating rare files, and an extremely intuitive user
interface (http://download.com.com/3000-2166-10051892.html).
Lime Wire supports Gnutella’s open-protocol, prejudice-free development environment. Since
nobody owns the Gnutella protocol, any company or person can use it to send or respond to
queries, and no entity will have a hold over the network or over the information flowing through
it. This free market environment promotes competition among entities choosing to respond to
the same queries. The model for Gnutella’s growth and development is the world wide web as
nobody owned the hypertext transfer protocol (HTTP) on which the web was based, nor did
anybody own the web itself, which has allowed its growth to be so explosive, and the spectrum
of its applications so broad.
All computers running a program utilising the Gnutella protocol are said to be on the Gnutella
network (gNet). On the world wide web, each computer is connected to only one other
computer at a time. When a user visits Amazon.com, she is not at Yahoo.com. The two sites
are mutually exclusive. On the Gnutella network, a user is connected to several other
computers at once. Information can be received from many sources simultaneously.
LimeWire Peer Server
This product allows Gnutella to service queries beyond file sharing. Utilising XML and the Java
platform, the LimeWire peer server allows various data sources to be easily integrated into the
Gnutella network, and allows these data sources to be conveniently queried to satisfy customer
needs.
The LimeWire peer server can integrate data whether it resides in a database, is streamed as a
feed, has ODBC extensions, etc, into the Gnutella network in a few simple steps. Firstly,
mappings between the data entity and popularly distributed XML schemas are created. Using
LimeWire 1.8, Gnutella users can meta-search the network based on these XML schemas.
34
Through this mechanism, Gnutella becomes a dynamic information platform that allows users
to specify and retrieve the information they seek.
For example, a company that maintains a web site for finding rentable apartments can integrate
its data store into Gnutella using the LimeWire peer server. Gnutella users can then search for
apartments as necessary; a matching "meta-search" would bring up a link to the companies
website. At that point, the traditional business cycle would resume. It is clear that the LimeWire
peer server can make life more convenient for users and more profitable for businesses.
The LimeWire peer server can leverage any existing technology platform. The foundation of
Java and XML ensures that the peer server is completely cross-platform; reliance on JDBC
ensures integration of many data source variants. Moreover, the peer server supports popular
web interfaces, such as CGI, JSP, and ASP. The LimeWire peer server unites two revolutionary
aspects of the World Wide Web: a powerful search mechanism and e-commerce.
The Windows, OS X, and general Unix versions of the LimeWire peer server are now available
for download.
LionShare http://lionshare.its.psu.edu/main/
The LionShare project began as an experimental software development project at Penn State
University to assist faculty with digital file management. The project has now grown to be a
collaborative effort between Penn State University, Massachusetts Institute of Technology
Open Knowledge Initiative, researchers at Simon Fraser University, and the Internet2 P2P
Working Group.
Penn State researchers identified key deficiencies in the ability of existing technologies to
provide Higher Education users with the necessary tools for digital resource sharing and group
collaboration. The LionShare project was initiated to meet these needs.
The LionShare P2P project is an innovative effort to facilitate legitimate file sharing among
individuals and educational institutions around the world. By using peer-to-peer technology and
incorporating features such as authentication, directory servers, and owner controlled sharing
of files, LionShare provides secure file-sharing capabilities for the easy exchange of image
collections, video archives, large data collections, and other types of academic information. In
addition to authenticated file-sharing capabilities, the developing LionShare technology will also
provide users with resources for organising, storing, and retrieving digital files.
Currently, many academic digital collections remain "hidden" or difficult to access on the
Internet. Through the use of LionShare technology, users will be able to find and access these
information reservoirs in a more timely and direct manner, employing one rather than multiple
searches. LionShare will also provide users with tools to catalogue and organise personal files
for easier retrievals and enhanced sharing capabilities.
The LionShare development team anticipates a beta launch in Autumn of 2004 with the
software's final release in 2005. The software will be made available to the general public under
an open source license agreement. The initial LionShare prototype would not have been
possible without the source code and assistance of the Limewire open source project.
35
LionShare Architecture
LionShare is based on LimeWire, and has both decentralised and centralised topology with a
P2P and client/server architecture.
Figure 8. Client/Server Architecture
LionShare consists of LionShare peer, peer server, security and repositories. The peer server is
based on the Gnutella protocol and the security model is similar to Shibboleth.
Figure 9. Peer Server Architecture
Shibboleth Development
The Joint Information Systems Committee (JISC) has aims to adopt Shibboleth technology as
the principal framework for authentication and authorisation within the JISC Information
Environment. In support of this, EDINA is leading a group of partners within the University of
Edinburgh to advance this work under the Shibboleth Development and Support Services
project (SDSS).
This project is funded by the Core Middleware Technology Development Programme, a JISC
initiative which supports a total of 15 projects to develop and explore this technology. SDSS will
36
act in an enabling role for these projects by providing prototype elements of the infrastructure
necessary for this national activity.
These elements include:
 a national certificate-issuing service for SSL-based secure communication between
institutional and national servers;
 a "where are you from" (WAYF) service required to direct users to the appropriate
authentication point;
 development of Shibboleth-enabled target services, by adding this capability to a number of
existing live services operated by EDINA;
 auditing, monitoring and support for use of the initial Shibboleth infrastructure.
It is intended that these prototype services will eventually be replaced by industry-strength
solutions before the end of the project in 2007. Meantime, they will provide a live test bed which
will enable interworking between other Core Middleware projects within a service environment.
Main Features and Benefits of LionShare
LionShare enjoys the benefits of P2P network sharing but with expanded capabilities, and the
elimination of typical P2P drawbacks, all within a trusted, academic oriented network. The
inclusion of access control, defined user groups, persistent file sharing and the ability to search
both the P2P network and centralised repositories makes this software differ from other P2P
networks.
Example Uses of LionShare
Listed below are of some examples of how LionShare can be used in the HE environment for
teaching, learning, research and collaborative efforts:
 improved peer-to-peer networking to provide an information gathering tool - all the personal
file sharing capabilities of Kazaa and Gnutella plus expanded search capabilities to include
special academic databases with only one search query. A permanent storage space is
included for selected personal files to be shared even when the user is not logged on;
 controlled access - a seamless trust fabric could be established between institutions. If a
user is logged on at one institution, they will be able to access the resources from all
partner institutions in a secure, safe manner. Users can specify who to share each file with.
This can range from institutions, disciplinary areas, departments, specific classes,
individuals or even custom defined groups. This is useful when users wish to limit access
due to copyright considerations;
 tools to help organise and use the files - many easy-to-use options are available for users
to organise their own collections which in turn makes sharing with others easier. Users can
easily organise and refine the descriptions of files to be shared, so that keywords, headings
and collection information will identify and organise shared files. There is built in support for
different file types and slideshows;
 tools for collaboration - collaborative tools can facilitate the joint efforts of user defined
groups. Chat rooms and bulletin boards can be created for user-defined groups.
POOL and SPLASH http://www.edusplash.net/
The Portal for Online Objects in Learning (POOL) Project is a consortium of several
educational, private and public, sector organisations to develop an infrastructure for
learning object repositories. The POOL project ran until 2002 and aimed to build an
infrastructure for connecting heterogeneous repositories into one network. The infrastructure
used P2P in which nodes could be individual repositories (called SPLASH), or community
repositories (PONDS). The POOL network used JXTA and followed the CanCore/IMS metadata
profile (Hatala 2003).
SPLASH is a P2P repository developed by Griff Richards and Marek Hatala and Simon Frasier
University in Canada, as part of the wider Portals for On-line Objects in Education (POOL)
consortium (Kraan 2004). SPLASH is a freely available desktop program that provides storage for
learning objects used or collected by individuals. SPLASH enables users to create metadata on
the individuals file system or on the web. The peer-to-peer protocol is used to search for
learning objects on other peers and allows for file transfer between peers (Hatala 2003).
37
As SPLASH is built on peer-to-peer technology, it means that SPLASH programs can talk directly to
each other over the network, without the need for a server. Each group of Splashes (i.e. a POND)
has a ‘head’ SPLASH and is a gateway to the major internet backbones (POOLS). This means that
objects can still be found on Ponds, even if the machine a SPLASH is installed on is off the network
and it is faster. Existing, conventional repositories can be elevated to POND status by either making
it talk SPLASH by the addition of an interface on the repository, or, by having SPLASHES talk to a
gateway that speaks eduSource Communication Layer (ECL- a national implementation of the IMS
Digital Repository Interoperability specification) to other repositories on the other end (Kraan 2004).
5.3 Distribution from a Traditional (Centralised) Archive
Traditional archiving is considered to be a central site where data are deposited. The norm is
for the organisation to provide quality control and support for both users and depositors. The
level of support required differs according to the type of material being archived e.g. text
documents will need less support then more complex documents such as geospatial datasets.
The advantages of using an archive are that the data are quality checked, preserved and
disseminated by the organisation. The limitations are that both the documents and the data
must be archived at a central site where metadata or bibliographic indexes have to be created
by skilled specialists, consuming human and financial resources at the central site.
5.3.1 Copyright/IPR
Usually copyright is retained by the author, sometimes defined as and including: the individual;
organisation; or institution. For example the UK Data Archive licence agreement states that it is:
“a non-exclusive licence which ensures that copyright in the original data is not transferred by
this Agreement and provides other safeguards for the depositor, such as, requesting
acknowledgement in any publications arising from future research using the data.”
If a piece of work is completed as part of employment, the employer will retain copyright in the
work, but the creator will retain IPR. But if you are commissioned to create a piece of work on
behalf of someone else, then you will retain copyright in that work. Copyright is owned by the
author of a letter or other written communication i.e. fax or email.
Centralised archives which use registration such as Athens, also have the advantage of being
able to track who has used data and for what declared purposes, thus safeguarding IP.
5.3.2 Preservation
Centralised archives preserve materials deposited with them to ensure continued access
wherever possible. Each archive has their own individual preservation policies.
5.3.3 Cost
Cost to users from a non-commercial community such as academia, are covered and therefore
free at the point of use. Cost to the data creator is dependent on grant applications and time
spent on issues such as copyright, changing data formats where necessary and anonymising
the data if required. A centralised archive can be a cost effective way for a data creator to store
and preserve their data. Concentrating specialist support into a central organisation may also
be cost effective when compared to a distributed system which might see a duplication of effort
in each host institution.
On the internet, the more traffic you get means the more money it costs as most bandwidth
usage is centralised: one server and one network. This means that using a centralised archive
is more expensive then a decentralised network for this particular cost issue.
5.3.4 Case Studies
The following case studies illustrate the advantages and disadvantages of centralised
archiving.
The National Archive
The UK National Digital Archive of Datasets (NDAD), is operated by the University of London
Computer Centre (ULCC) on behalf of the National Archive (formerly the Public Record Office
(PRO)). Its aim is to conserve and, where possible, provide access to many computer datasets
38
from Central Government departments and agencies, which have been selected for
preservation by the PRO. Information about the data and their accompanying documentation
can be browsed free of charge through the NDAD website.
NDAD is one of very few digital archives in the UK which not only preserve but also provides
access to electronic records. It stores and catalogues its holdings according to archival
practice, and makes them accessible to the public via their website. Paper documentation
which accompanies the electronic records and provides a context for them, is digitised and
made available on the site too, both as a scanned image and as a plain text file. This national
data repository service offers fast network access to extremely large amounts of data (up to
300 terabytes).
UK Data Archive
The UK Data Archive (UKDA), established in 1967, is an internationally-renowned centre of
expertise in data acquisition, preservation, dissemination and promotion; and is curator of the
largest collection of digital data in the social sciences and humanities in the UK. It houses
datasets of interest to researchers in all sectors and from many different disciplines. It also
houses the AHDS, History Service and the Census Registration Service (CRS). Funding comes
from the Economic and Social Research Council (ESRC), the JISC and the University of Essex.
The UKDA provides resource discovery and support for secondary use of quantitative and
qualitative data in research, teaching and learning. It is the lead partner of the Economic and
Social Data Service (ESDS) and hosts AHDS History, which provides preservation services for
other data organisations and facilitates international data exchange.
The UKDA acquires data from public sector (especially central government), academic, and
commercial sources within the United Kingdom. The Archive does not own data but holds and
distributes them under licences signed by data owners. Data received by the Archive go
through a variety of checks to ensure their integrity. The management of data is an essential
part of long-term preservation and fundamentally the processing will strip out any software or
system dependency so that the data can be read at any time in the future. To ensure longevity,
clearly understood structures and comprehensive documentation are essential.
The UKDA catalogue system is well known for its comprehensiveness and the employment of a
controlled vocabulary based upon the use of a thesaurus (known as HASSET, Humanities and
Social Science Electronic Thesaurus). Currently a multilingual version of the thesaurus is being
developed under an EU funded project. The search fields used within the catalogue are part of
a standard study description agreed in the early 1980's by an international committee of data
archivists. This has subsequently been developed and adopted as the international Data
Documentation Initiative (DDI).
Once data are checked, stored and catalogued, they are made available to the user
community. Data are available in a number of formats (typically the main statistical packages)
and on a number of media, for example, CD ROMs, although the main medium used is online
dissemination.
In addition to the delivery process, the UK Data Archive provides a one stop shop support
function (by phone or email) to users of the ESDS, CRS and AHDS, History services. This
service also extends to data depositors. In addition, a network of organisational representatives
has been established and regular data workshops are arranged.
In order to obtain free access to a wide range of data sources, licences are negotiated resulting
in consistent access conditions being agreed and managed centrally for the academic
community. Appropriate registration and authentication processes are implemented to ensure
that data are not made available outside the constraints of the licence. For these reasons an
authentication system is necessary in order to ensure that all users are aware of the conditions
under which they can access the data.
The UK Data Archive is also a member of CESSDA, the consortium of European Social
Science Data Archives, each of which serves a similar role to the UKDA in its individual
39
country. Since the late 1990’s the UKDA, has collaborated with its CESSDA partners to
develop software systems to permit cross national searching for data and online data browsing.
The result of this programme of work is the Nesstar software architecture on which the
CESSDA Integrated Data Catalogue (IDC) is built. An enhanced version of the IDC will be
released during the coming year. The UKDA, therefore is an example of a traditional archive (it
serves this function, for the social sciences, within the UK) which has established mechanisms
for managing data exchange across similar organisations. Authentication is Athens based.
5.4 Data Distribution Systems Summary
The two main models under which a repository could operate include a centralised approach or
some form of distributed model.
The main characteristics of the centralised archive are:
 storage and distribution of data from a single location;
 centralised access control over the supply and re-use of data;
 checking, cleaning and processing of data according to standard criteria;
 centralised support service, describing the contents of the data, the principles and practices
governing the collection of data and other relevant properties of data;
 cataloguing technical and substantive properties of data for information and retrieval and
offering user support following the supply of data.
The main characteristics of the typical distributed model include:
 data holdings distributed over various sites;
 data disseminated to users from each of the different sites, according to where the data is
held;
 the various suppliers of data ideally networked in such a way that common standards and
administrative procedure can be maintained, including agreements on the supply and use
of data;
 a single point of entry into the network for users, together with some form of integrated
cataloguing and ordering service.
Until now the centralised approach has dominated the European scene although this is
beginning to be challenged. The distributed model has been used in the US successfully for
many years and has more recently been operated by the Arts and Humanities Data Service
(AHDS) in the UK.
There are advantages to both the centralised and distributed approaches. There are
advantages to be had from a single institution liaising with all depositors and users. Some
economies of scale are likely with a centralised service. Centralised, possibly national, services
may also be better able to promote national standards for the documentation of research data,
new developments and innovation including cross-national linkage, to promote research and
teaching which values the use of secondary data and to police usage of the data supplied.
These advantages might also, however, be reproduced to varying degrees through a hub and
spoke model with the hub generating much of the benefit provided by a centralised model.
Technological developments make this increasingly possible. The distributed model can allow
for the development and support of pools of knowledge and expertise and for higher volume
usage in clusters related to particular datasets.
Under a hub and spoke model, a centralised facility might be responsible for the most heavily
used datasets. Other datasets could then be held and provided by a network of distributed
centres. This would allow for specialisation within particular satellite centres. Terms and
conditions relating to deposit, access and standards could then be co-ordinated centrally.
Self Archiving
Advantages:


rapid, wide and free
dissemination of data
across the web
documents are stored
electronically on a
Peer-2-Peer



all parties are equal
data stored locally
therefore no central
repository/server needed
simpler then client/server
Centralised Archiving

both documents and
metadata held and
managed by archive
therefore preservation
and copyright not an
issue as negotiated and
40








publicly accessible
website
documents can be
deposited in either
centralised archive or
distributed system
easy access to software
can be published by
author at any stage
subject or organisation
based therefore specific
to given field
cheaper then centralised
archiving as quality
checking done by peer
review
extendible
most self archives use
Dublin Core or OAI and
are therefore
interoperable
some self archives are
duplicates of printed
material, therefore
preservation not an issue




architectures
more robust then
centralised servers as it
eliminates reliance on
centralised servers that
are potential critical
points of failure
can search both the P2P
network and centralised
repositories
control over a closed
network where
individuals can set
passwords
scalable












Disadvantages:








wide dispersal can lead
to problems finding the
data
copyright - author
retains pre-print but not
post-print copyright. The
author can not selfarchive if paid royalties
by publisher or if
publisher holds exclusive
copyright. It can be
difficult to enforce
copyright
ownership can be
disputed if data creator
moves organisations
rights over metadata are
not clear
preservation concerns
over interoperability of
online medium and
maintenance of the
archive
lack of access control
no quality control of
metadata
no or little reputation











no overall control over
copyright and licences
ownership can be
disputed if data creator
moves organisations
no quality control
technical infrastructure
does not offer good
performance under
heavy loads
use of hard drive as
nodes could lead to local
slow performance
must have sufficient
number of nodes to work
successfully
unreliability of peers
which may be switched
off
no central support
system
difficult to enforce rules,
such as copyright
network favours client
not server usage,
therefore creating
bandwidth problems


managed centrally
quality control of
metadata and data
user support available
resource discovery
supported and provided
centrally
reputation of
organisation promotes
trust and confidence in
data creators
required by grant award
providers
IPR remains with data
creator
existence published by
archive
administrative duties
carried out by archive
use of data can be
monitored
access can be managed
centrally e.g. using
Athens
can support a large
number of data users
can create own search
facilities
both documents and
metadata held by
archive, need
specialised staff skills,
therefore human and
financial restraints
dependency on single
network may lead to
bandwidth issues
especially with large
volumes of data,
although archive could
just hold metadata
41

means lack of
confidence in method
concern over long term
prospects of archive








Software:




lack of licensing
cost of data creators
time and resources
relies on peer review for
quality control
user support reliant on
goodwill of publisher or
data creator
requires cultural change
to ensure researchers
submit content to
repositories
EPrints
Dspace
Kepler
CERN







relies on metadata for
data retrieval
difficult to manage as no
central control
user support reliant on
goodwill of publisher or
data creator
cost of data creators
time and resources
concern over long term
prospects of archive
JXTA
Gnutella
FreeNet
LimeWire
LionShare

Archive specific
Table 5. Comparison of the Three Data Distribution Systems
6.0 Conclusions
Whichever method for data distribution is adopted, there are several issues which need to be
addressed and these are discussed below.
User and Depositor Support
Centralised archives would provide the best user and depositor support as resources are
focused in one location. These types of archive are often specialised in an academic discipline
and can therefore provide the most efficient service. There are examples of traditional archives
which work together as one organisation, but still providing specialist support at the individual
level. One example of this is ESDS (Economic and Social Data Service), where specialist skills
remain at the UK Data Archive, ISER, MIMAS and the Cathie Marsh Centre for Census and
Survey Research.
Self-archiving and P2P networks do not provide user and depositor support. Instead they rely
on in-house help at the data holders organisation, such as a library support system based at a
university.
Metadata Standards
The use of the Go-Geo! HE/FE metadata application profile, which was based on FGDC and
updated to include ISO 19115 in 2004, should be considered as a template for the distribution
or storage of metadata records for Go-Geo! If a simple standard is used, such as Dublin Core,
which only has a few elements, then not all of the mandatory elements will be displayed on the
Go-Geo! portal and information about the datasets will be lost. From the case studies discussed
in this report it seems at present that self-archiving and centralised archiving mainly rely on
Dublin Core as it promotes interoperability and preservation. However, the use of Dublin Core
is too simplistic for geospatial metadata records. In contrast, P2P can be more flexible,
although difficulties lay in the lack of control and enforcement of standards and protocols. There
is some evidence that centralised archives will incorporate geospatial elements into the
standards they use, for example the DDI has already begun on this path by adopting the FGDC
bounding rectangle and polygon elements. However, it is unlikely that they would
retrospectively create records using the Go-Geo! HE/FE metadata application profile.
42
Ease of Use
All of the data distribution systems described in this report have their own advantages and
disadvantages when it comes to ease of use. The centralised archive has the support to assist
data creators in the submission of metadata. Self-archiving and P2P archiving are more flexible
as the data creator can deposit when they wish to and update the datasets when they like. This
point is particularly relevant to P2P which, in theory, should be available all of the time.
Quality Control
Distributed networks, such as self-archives and P2P networks, do not have any central control
over data. This means that there is a lack of quality control protocols in place. Self-archiving
has tried to solve this problem by using the peer review system. This system is assumes that
suitable peers can be found in the first place and that they are willing to undertake this task
long-term to ensure continuity in quality standards.
Copyright and Licensing
For distributed archives, copyright and licensing must lie with the data creators host institution
as this is where the data are stored. At a more traditional archive, licences are drawn up and
signed by depositors but also users, who must sign an agreement as to the intended use of the
data. This point is very important when the usage of data is restricted to one community, such
as academia.
Licences are also a means of mediating IP where the data producer has used material for
which that do not have copyright. This issue probably affects the majority of geospatial datasets
in the UK as materials such as Ordnance Survey data are incorporated into the geospatial data.
Without centralised control over licences, issues over copyright will occur.
Cost
The cost of self-archiving and P2P lies mainly with the data creator, as it is they who will spend
the time setting up the system, in the case of P2P and submitting data, as in the case of selfarchiving. The cost with centralised archiving lies with the archive, who must pay for overheads,
staff expertise and time, resources etc. The cost to the user in most cases will be nothing
unless the data is used commercially.
Security
A distributed system provides greater security against denial of service attacks and network
crashes as the system does not rely on one data source or network. A centralised archive could
provide greater data security and integrity as access is controlled and usage of the data is
monitored.
Preservation
The traditional way to preserve material would be to deposit data in a traditional archive.
Although this is still the most effective, reassuring way of preserving data, there are other
issues to consider. Firstly, as self-archiving deals with copies of prints, preservation of
electronic copies may not be an issue. If the data creator is based at a university they may be
able to rely on the university library for data preservation, depending on in-house policies.
Another option would be to use another service such as demonstrated in the Thesis Alive
project. The preservation of metadata records is an issue which needs further consideration.
Solutions
One solution could be to combine the advantages of using centralised archiving with a
distributed system. The creation of a distributed archive, could be subject based and would
involve using several archives such as ADS and UKDA to hold the datasets and metadata
which they would normally hold, such as archaeology and social science respectively. In this
way, the advantages of using a well established, reputable archive which offers quality control
and user support can be combined with the advantage of not holding data in one place,
therefore not depending on one single network. This ensures that whilst only good quality
metadata is published the system will not be slowed down. It also means that data creators can
store their data in the place which they would normally do so and therefore there is less likely to
be duplication of data. Creating subject repositories of this kind who also allow for the creation
43
of subject-based user support systems, where the repository would be able to provide specialist
user support. One major problem with this solution would be that the various archives already
use their own metadata standards and are unlikely to want to hold two different standards or
convert their standard into a geospatial one. Some confusion may also arise as data creators
may not always know which subject repository their dataset is most suited to.
Another solution would be to set up a completely distributed system where repositories are set
up at the individual organisations who have created the data. For example, AHDS Archaeology
holds archaeological data and AHDS History holds historical data. This type of system would
rely on the institutions providing quality control and user support, which could be achieved
through current archive or library procedures. It may also lead to duplication of data and
inconsistencies in style, for example metadata standards which again would probably not be
compliant with the Go-Geo! application profile.
Another solution would be to hold the datasets and associated metadata at an established
archive, such as UKDA. This is likely to be a welcomed scenario as provisions will be made for
user and depositor support, preservation, licensing and data security. Again, the issue of
differences between metadata standards create difficulties with this option. More investigation
is needed into this issue though as further changes may be made to standards such as the DDI
in the future which could make it compliant enough with the Go-Geo! application profile to
warrant this as a feasible option.
7.0 Requirements and Recommendations for Data Distribution for the GoGeo! Portal
The following requirements were made by data creators and depositors for a data distribution
system to provide:
 access to data online;
 copyright agreements and data integrity;
 user and depositor support;
 preservation and archiving policies;
 a system which is both easy and fast to use;
 acceptance of supporting material for storage;
 a service which is free to both depositors and users.
There is also a need for the system to provide:
 large bandwidth;
 secure and stable network;
 storage facilities.
From the findings of this report and with the benefit of feedback received from the metadata
creation initiative, it is clear that the portal technology has two potential functions: as a national
service providing access to geospatially referenced data; and as a tool for institutional
management of resources which are likely be restricted to local use.
For the portal to function most effectively and in line with user expectations as a national
resource, it seems that the only practical solution would be to consider the creation of a
national, centralised geospatial repository which would provide geo spatial data users and
creators with support which is tailored to be compliant with the FGDC and ISO 19115
standards. Data access could be centrally controlled and usage of data monitored. Copyright,
licensing and preservation could all be dealt with in a single repository to provide an efficient
data distribution and storage system for the Go-Geo! portal.
If, on the other hand, the portal is to be used as an institutional tool, the need for a central place
of deposit becomes, in theory, less pressing as issues such as IP, authorisation and user
support become less important. Consequently, for institutional use, self-archiving may be the
solution.
However, it needs to be clearly understood that the two functions do not naturally lead one to
the other. The attraction of institutional repositories seems to be that information is managed,
44
made available and remains within the control of the institution. Whilst there is general
agreement that data should be shared nationally, there is a clear lack of commitment to putting
local resources into removing the barriers that would make this happen. It is difficult to see how
an automated system can resolve issues of IP, user support and access control without human
intervention in the form of a decision making body.
Our only recommendation therefore is that the enormous potential of the portal should be
recognised. The evaluation work has identified two important needs for the management of
geospatially referenced data: one is the need for a system for institutional management; the
other is the need for an enhanced national service for geospatial resources with an emphasis
on access to data. The portal has the potential to serve both these needs and we would like to
see further funding to assess the feasibility and costs of fulfilling both these needs using GoGeo! technology.
45
Bibliography
AfterDawn.com. (May 2003). P2P Networks Cost too Much for ISP. Available at:
shttp://www.afterdawn.com/news/archive/4106.cfm.
Andrew, T. (2003).Trends in Self-Posting of Research Material Online by Academic Staff.
Ariadne Issue 37. Available at: http://www.ariadne.ac.uk/issue37/andrew/.
Breidenbach, Susan. (July 2001). Peer-to-Peer Potential. Network World. Available at:
http://www.nwfusion.com/research/2001/0730feat.html.
Creedon, Eoin.et.al. (2003). GNUtella. Available at: http://ntrg.cs.tcd.ie/undergrad/4ba2.0203/p5.html.
Fleishman, Glenn. (May 2003). It Doesn't Pay to be Popular. Available at:
http://www.openp2p.com/pub/a/p2p/2003/05/30/file_sharing.html.
Harnad S. Distributed Interoperable Research Archives for Both Papers and Their Data: An
Electronic Infrastructure for all Users of Scientific Research. Available at:
http://www.ecs.soton.ac.uk/~harnad/Temp/data-archiving.htm.
Harnad, Stevan. (2001).The Self-Archiving Initiative. Nature 410: 1024-1025. Available at:
http://www.ecs.soton.ac.uk/~harnad/Tp/naturenew.htm.
Harnad, Stevan. (April 2000). The Invisible Hand of Peer Review. Exploit Interactive. Issue 5,
Available at: http://www.exploit-lib.org/issue5/peer-review/.
Hatala, Marek et.al. (2003). The EduSource Communication Language: Implementing Open
Network for Learning Repositories and Services. Available at:
http://www.sfu.ca/~mhatala/pubs/sap04-edusource-submit.pdf.
Hatala, Marek and Richards, Griff. (May 2004). Networking Learning Object Repositories:
Building the eduSource Communications Layer. Oxford.
ITsecurity.com. (5 March, 2004). Workplace P2P File Sharing Makes Mockery Of Internet
Usage Policies Says Blue Coat. Available at:
http://www.itsecurity.com/tecsnews/mar2004/mar59.htm.
Kraan, Wilbert. (March 2004). Splashing in Ponds and Pools. Available at:
http://www.cetis.ac.uk/content2/20040317152845.
Krikorian, Raffi. (April 2001). Hello JXTA! Available at:
http://www.onjava.com/pub/a/onjava/2001/04/25/jxta.html.
Krishnan, Navaneeth. (October 19, 2001). The JXTA Solution to P2P: Sun's New Network
Computing Platform Establishes a Base Infrastructure for Peer-to-Peer Application
Development. Available at: http://www.javaworld.com/javaworld/jw-10-2001/jw-1019-jxta.html.
Maly, K, et.al. (2003). Kepler Proposal and Design Document. Available at:
http://kepler.cs.odu.edu:8080/kepler/publications/finaldes.doc.
Maly, K, et.al. (2001). Kepler - An OAI Data/Service Provider for the Individual. D-Lib Magazine
7(4). Available at: http://www.dlib.org/dlib/april01/maly/04maly.html.
Maly, K, et.al. (2002). Enhanced Kepler Framework for Self-Archiving ICPP-02, pp. 455-461,
Vancouver. Available at: http://kepler.cs.odu.edu:8080/kepler/publications/kepler.pdf.
46
McGuire, David. (March 31, 2004). Lawmakers Push Prison For Online Pirates.
washingtonpost.com. Available at: http://www.washingtonpost.com/wp-dyn/articles/A401452004Mar31.html.
Nejdl, Wolfgang et al.(May 2002). EDUTELLA: A P2P Networking Infrastructure Based on RDF.
Available at: http://www2002.org/CDROM/refereed/597/index.html.
Olivier, Bill and Liber, Oleg. (December 2001). Lifelong Learning: The Need for Portable
Personal Learning Environments and Supporting Interoperability Standards. The JISC Centre
for Educational Technology Interoperability Standards, Bolton Institute.
Pinfield, Stephen (March 2003). Open Archives and UK Institutions. D-Lib Magazine, Volume 9
Number 3. ISSM 1082-9873. Available at:
http://www.dlib.org/dlib/march03/pinfield/03pinfield.html.
Pinfield, Stephen and Hamish James (September 2003). The Digital Preservation of e-Prints.
D-Lib Magazine, Volume 9, Number 9. ISSN 1082-9873. Available at:
http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/september03/pinfield/09pinfield.html.
Pradhan, Anup and David Medyckyj-Scott. (July 2001). EDINA Requirements Specification and
Solution Strategy.
Seybold, P.A. (2002). Web Services Guide for Customer-Centric Executives, Patricia Seybold
Group, Inc. Boston.
Sinrod, Eric J. (August 2002). E-Legal: A Bill to Combat P2P Copyright Infringement. Law.com.
Available at: http://www.duanemorris.com/articles/article927.html.
Simon Fraser University. (May 4 2004). LionShare. JISC Meeting, Oxford.
Smith, M et.al. (2003). DSpace: An Open Dynamic Digital Repository. D-Lib Magazine Volume
9, Number 1. ISSN 1082-9873. Available at: http://mirrored.ukoln.ac.uk/lisjournals/dlib/dlib/dlib/january03/smith/01smith.html.
Somers, Andrew. Is Sharing Music Over A Network Any Different Than Playing It For A Friend?
Available at: http://civilliberty.about.com/library/content/blP2Prights.htm.
Truelove, Kelly. (April 2001). The JuxtaNet. Available at: http://www.openp2p.com/lpt/a/799 .
O'Reilly Media, Inc.
Watson, Ronan. (2003). Freenet. Available at: http://ntrg.cs.tcd.ie/undergrad/4ba2.0203/p7.html.
Wilson, Scott. (September 2001). The Next Wave: CETIS Interviews Mikael Nilsson about the
Edutella Project. Available at: http://www.cetis.ac.uk/content/20010927163232.
Wolf, Donna. (Feb 22 2004). Peer-to Peer. Available from:
http://searchnetworking.techtarget.com/sDefinition/0..sid7 gci21769.00.html.
Wood, Jo (). A Peer-to-Peer Architecture for Collaborative Spatial Data Exchange.
SFU Surrey
47
Glossary
Access
Control
API
Application
Profile
Architecture
Authentication
Bandwidth
BOAI
Creator
Document
Delivery
EDD
Eprint
Eprint Archive
FTP
Harvest
HTTP
Infrastructure
Interoperability
Learning
Object (LO)
Node
Open Access
OAI
OAI-PMH
OAI Compliant
Peer
Technology that selectively permits or prohibits certain types of data access
Application Programming Interfaces
Schema consisting of data elements
The structure or structures of a computer system of software. This structure
includes software components, the externally visible properties of those
components, the relationships among them and the constraints on their use.
It eventually encompasses protocols, either custom-made, or standard, for a
given purpose or set of purposes
Verifying a user's claimed identity
The information carrying capacity of a communication channel
The Budapest Open Access Initiative - a worldwide coordinated movement
to make full-text online access to all peer-reviewed research free for all.
Important note: BOAI and OAI are not the same thing, but BOAI and OAI do
have similar goals
The person or organisation primarily responsible for creating the intellectual
content of the resource. For example, authors in the case of written
documents, artists, photographers, or illustrators in the case of visual
resources.
The supply, for retention, of a document ( journal article, book chapter etc) to
a third party by means of copying, in compliance with all copyright
regulations, and delivering it to the requester (by hand, post, electronically)
electronic document delivery - the supply, for retention, of a document
(journal article, book chapter etc) to a third party by means of scanning or,
where permitted by publishers, from online versions in compliance with all
copyright regulations, and delivering it to the requester by electronic transfer
An electronically published research paper (or other litary item). Or free
software for producing an archive of eprints
An online archive of preprints and postprints. Possibly, but not necessarily,
running eprints software
File Transfer Protocol - an internet application protocol
To retrieve metadata from a digital repository. Conversely: to take delivery of
metadata from a digital Repository
Hyper Text Transfer Protocol
The underlying mechanism or framework of a system
The ability of hardware or software components to work together effectively
Any resource or assets that can be used to support learning. A resource
typically becomes thought of, as a Learning Object, when it is assigned
Learning Object Metadata, is discoverable through a digital repository, and
can be displayed using an eLearning application
See ‘Peer’
Something anyone can read or view
The Open Archives Initiative develops and promotes interoperability
standards that aim to facilitate the efficient dissemination of content
Open Archives Initiative Protocol for Metadata Harvesting - A way for an
archive to share it's metadata with harvesters which will offer searches
across the data of many OAI-Compliant Archives
An archive which has correctly implemented the OAI Protocol
Two computers are considered peers if they are communicating with each
other and playing similar roles. For example, a desktop computer in an office
might communicate with the office's mail server; however, they are not
48
P2P Network
Postprint
Preprint
Preprint
Archive
RDF
Reprint
Servent
XML
Z39.50
peers, since the server is playing the role of server and the desktop
computer is playing the role of client. Gnutella's peer-to-peer model uses no
servers, so the network is composed entirely of peers. Computers connected
to the Gnutella Network are also referred to as "nodes" or "hosts"
Internet users are communicating with each other through P2P (Peer-toPeer) file sharing software programs that allow a group of computer users to
share text, audio and video files stored on each other's computers
The digital text of an article that has been peer-reviewed and accepted for
publication by a journal. This includes the author's own final, revised,
accepted digital draft, the publisher's, edited, marked-up version, possibly in
PDF and any subsequent revised, corrected updates of the peer-reviewed
final draft
The digital text of a paper that has not yet been peer-reviewed and accepted
for publication by a journal
An EPrint Archive which only contains PrePrints
Resource Description Framework (http://www.w3.org/RDF/ ) is a foundation
for processing metadata; it provides interoperability between applications
that exchange machine-understandable information on the Web. RDF
emphasizes facilities to enable automated processing of Web resources
A paper copy of a peer-reviewed, published article. Usually printed off by the
publisher and given to or purchased by the author for distribution.
Node with both client and server capabilities
Extensible Markup Language. A uniform method for describing and
exchanging structured data that is independent of applications or vendors
Z39.50 refers to the International Standard, ISO/IEC 23950: "Information
Retrieval (Z39.50): Application Service Definition and Protocol Specification",
and to ANSI/NISO/IEC Z39.50. The standard specifies a client/server-based
protocol for searching and retrieving information from remote databases
49
Appendix A – Data Distribution Survey
The Go-Geo! project is a collaborative effort between EDINA (University of Edinburgh) and the
UK Data Archive (University of Essex) to provide a geo-spatial data portal for the UK Higher
and Further Education community. This web service will provide a tool for discovering,
exchanging, and accessing spatial data and related resources.
The Go-Geo project is currently investigating different ways by which data could be shared with
others through a distribution system. The aim is that in the future, a data distribution
mechanism will be set up for the Go-Geo! portal. Feedback gained from this questionnaire will
go towards the development of such a data distribution system.
All answers given during this survey will be confidential.
1.0
General
1.1
Which of the following would you choose as a mechanism for data distribution?
Explanations of the mechanisms are given at the bottom of this pages 3
centralised archive and distribution service
centralised repository based on a single self archiving service
distributed service based on a number of self archives located around the country
distributed service with each participating institution having it own self archive
distributed service based upon peer-2-peer (p2p) technology
none of the above
other (please describe)
1.1.1
What factors would affect your choice?
ease of use
cost
need to ensure long-term availability (preservation)
copyright issues
data security
user support
depositor support
other (please specify)
3An
archival dissemination model follows the system employed by the UK Data Archive
whereby academic researchers submit material to a dedicated repository. The materials are
quality checked, preserved and disseminated by the organisation. The archive implements
acquisitions, IP and licensing and quality assurance policies and provides a user and depositor
support facilities.
Self-archiving allows for the deposition of a digital document in a publicly accessible website
and the free distribution of data across the web. A model for GI data would be analogous to the
systems currently being employed within scientific literature but customised for spatial data.
Peer-to-peer is a communications method where all parties are equal. Users within a distributed
network share GI resources through freely available client software that allows the search and
transfer of remote datasets.
50
1.2
How would you like to see data provided?
CD-ROM
DVD
on-line
other (please specify)
1.3
Should users be able to contact depositors for support or should the data and related
materials be distributed ‘as is’ with no further assistance?
contact depositors directly
‘as is’: no further assistance
1.3.1
Please provide comments below
2.0
Depositing Data
2.1
How big a concern are each of the following to you when it comes to deciding whether to
either deposit data for use by others or use data provided by others? Please rank your
answers from 1-7, with 1 being the most important
Issue
Depositing Data
Using Data
IPR
Confidentiality
Liability
Data quality/provenance
Need for specialist support
Data security
Protecting the integrity of the data
creator
2.1.1
Please provide reasons for the concerns:
2.2
Assuming there are no licensing problems, should available data be:
original datasets
limited to value-added datasets
new datasets unavailable elsewhere within the HE community
new datasets unavailable elsewhere within the GI community
limited to a single version of a dataset
51
limited to a single format of a dataset
limited to the most recent edition of a dataset
2.3
Should the data be stored in standard and commonly used formats such as ESRI shape
files, CSV, gml?
yes
no
2.3.1
If yes, please specify which formats you would like?
2.4
In terms of submission of data, what factors are likely to be the most important e.g. speed,
ease of use?
2.5
Should anyone be permitted to put data into the system?
yes
no
2.5.1
If no, then to whom should it be restricted?
individuals within academia only
to registered users within academia only
to gatekeepers e.g. data librarians or service providers
your organisation e.g. university, department, institution
anyone
other (please specify)
2.6
Would you wish to be able to track who has taken the data?
yes
no
2.7
Given that guidance notes, quality statements, metadata etc take time and resources to
produce, should depositing of a dataset only be allowed where:
a complete set of supporting information is provided
minimal supporting information is provided
no supporting information is provided
2.7.1
If you selected answer 2.7b, which of the following would you consider to be minimal?
Please tick all that apply
discovery metadata
guidance notes/help files
code books
data dictionary/schema/data model
information on methodology/processing undertaken on data
reference to research papers based on the data
52
data and metadata quality statements
statement of IPR
statement of copyright & terms of use
other (please specify)
2.7.2
If you selected answer 2.7c, who do you expect would provide support and how would you
expect to monitor the quality of the dataset?
2.8
How important are data quality and accurate metadata?
very important
fairly important
neither important or unimportant
not very important
not important at all
3.0
Long term Storage and Preservation
3.1
Once deposited, should the data persist indefinitely?
yes
no
3.1.1
If no, should the data have a lifespan that is dependent on:
usage
perceived utility to other users
other (please specify)
3.2
How important is it that standards for spatial data, data transfer and data documentation
are implemented?
very important
fairly important
not very important
not important at all
3.3
Who should be responsible for any conversion or other work required to meet data transfer
and preservation standards?
3.4
Who should be responsible for the active curation of the dataset e.g. data creator/service
provider?
53
3.5
Who should take responsibility for maintaining a dataset generated by a research team
when that team disbands?
4.0
Licensing and Copyright
4.1
Is copyright of your data:
your own
held jointly (e.g. if you use Ordnance Survey data made available under a separate
agreement)
held by others (please specify who)
4.1.1
Assuming you are the sole copyright owner in any dataset would you wish to make them
freely available for:
academics only
any user (e.g. academic, commercial or other)
for use within your organisation (e.g. department or university)
none of these
4.1.2
Assuming you do not have sole copyright (e.g. you have used material from Ordnance
Survey) would you:
want to personally manage the additional licensing requirements
ask your department to manage licences
ask your central university administration to manage the licences
prefer to have it managed by a central specialist organisation
other (please specify)
4.3
Do you think that each use of a dataset should incur a royalty charge for the data
creator/s?
yes
no
4.3.1
If yes, please suggest a reasonable charge or a formula for charging (for example, tiered
charge for academic, general public, commercial use):
54
4.4
Do you think there should be a charge by the service providers to cover service operation
costs? (this is separate from royalty charges)
yes, for users of data
yes, for depositors of data
no
55
5.0
Sustainability
5.1
In your view, who or which organisation should be responsible for promoting, facilitating
and funding data sharing?
Organisation
Promoting data
sharing
Facilitating data
sharing
Funding a service
which facilitates
data sharing
Funders of data creation
awards
The creators themselves
The institution within
which the creator works
National repositories
(where they exist)
The JISC
Other (please state who)
5.2
Should funders cover researchers costs of preparing the data for sharing?
yes
no
maybe (please elaborate)
5.2.1
If not, why not and who should fund it?
56
6.0
Additional Comments
Please complete the section below if you wish to be entered into a prize draw to win a £30
Amazon voucher
Name
Organisation
Position
Email
Thank you for taking part. Your views are important to us
Please return this questionnaire to:
Julie Missen
Projects Assistant
UK Data Archive,
University of Essex
Colchester,
CO4 3SQ
Email: jmissen@essex.ac.uk
Tel: 01206 872269
57
Download